krivaltsevich.com Kafka Internals4.4

II · 06 · Durability, Availability & Consistency

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.

Durability in Kafka is not a single switch; it is the product of three dials that live in three different places, acks on the producer (the durability request), min.insync.replicas on the broker (the durability enforcement), and replication.factor on the topic (the redundancy). Get the relationship between them right and a single broker loss costs you zero acknowledged data while writes keep flowing; get it wrong and you have either a cluster that stops accepting writes the moment one follower hiccups, or, worse, and silently, a topic that acknowledges writes it cannot survive. This chapter derives the canonical RF=3 / min.insync.replicas=2 / acks=all standard from the actual enforcement code, explains why Kafka deliberately does not fsync each message (and what that means for where your replicas must physically sit), maps the availability-vs-durability dial that is unclean leader election, and closes with the arithmetic of data-loss probability and where Kafka sits on the CAP plane. Every default and limit here is read from the 4.4 source; every benchmark is marked empirical and cited.

The three dials and where each one lives

Operators routinely conflate these because they all sound like "how durable is my topic." They are mechanically distinct, evaluated at different times, by different components, and a misconfiguration of any one is invisible until the failure that exercises it. Hold this table before touching anything.

DialConfig key / defaultOwned byWhat it controlsSource
Durability requestacks = all producer defaultProducerHow many replicas must confirm before the producer treats a write as done (0 / 1 / all)clients/.../producer/ProducerConfig.java:405
Durability enforcementmin.insync.replicas = 1Broker (partition leader)Minimum ISR size below which an acks=all write is refusedserver-common/.../ServerLogConfigs.java:155
Redundancydefault.replication.factor = 1Topic (set at create time)How many copies of each partition exist on distinct brokersserver/.../ReplicationConfigs.java:42
Availability vs durability last resortunclean.leader.election.enable = falseTopic / brokerMay an out-of-sync replica become leader (accepting possible data loss) when all ISR is downstorage/.../LogConfig.java:139
The two defaults that ship unsafe

Both min.insync.replicas and default.replication.factor default to 1 in the broker/controller configuration. A topic created with no explicit overrides therefore has one replica and accepts acks=all writes with an ISR of one, i.e. no durability guarantee at all beyond a single disk. The producer default is the safe one (acks=all), but a safe request against an RF=1 topic is still a single point of failure. Durability is something you must configure up, per topic, not something the defaults give you. See II · 01 Configuration for setting cluster-wide safe baselines.

Producer, the request
acks=0 fire-and-forget · acks=1 leader-only · acks=all full ISR
ProducerConfig.java:405
↓ ProduceRequest
Leader, the enforcement
if acks=all and |ISR| < effectiveMinIsr → reject with NotEnoughReplicasException before appending
Partition.scala:1234
↓ replicate to followers in ISR
ISR replicas, the redundancy
HW only advances once every ISR member has the record; record invisible to consumers until then
RF on N distinct brokers
The three dials act at three stages of one produce. The request is what the client asks for; the enforcement is the broker's veto; the redundancy is the physical copies that make the guarantee real.
producer/client · leader broker · log/replica copies · monospace = config key or source citation

acks, what the producer is willing to wait for

The acks config is defined with three legal values and a default of "all" (ProducerConfig.java:405, validator in("all","-1","0","1")). Its own documentation states the contract precisely (ProducerConfig.java:131): with acks=1 "should the leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost"; with acks=all "the leader will wait for the full set of in-sync replicas to acknowledge … This guarantees that the record will not be lost as long as at least one in-sync replica remains alive." Note the exact words: the full set of in-sync replicas, not min.insync.replicas of them. If RF=3 and all three are in the ISR, all three must ack, min.insync.replicas is the floor that gates whether the write is allowed, not a quorum count that lets the leader stop early.

acksProducer waits forLoss windowLatency costUse when
0Nothing, record into socket buffer, offset returned as -1Any broker/network blip loses in-flight records; retries don't applyLowestMetrics, logs, lossy telemetry where a dropped record is free
1Leader's local write onlyLeader dies after ack, before followers fetch → record lostOne leader writeThroughput-sensitive, single-failure-tolerant pipelines
all / -1Every replica currently in the ISROnly if every ISR replica loses unflushed data simultaneously (see flush philosophy)One replication round-trip to the slowest ISR followerAnything you cannot lose; the durable default
Why acks=all alone is not enough

acks=all says "wait for the full ISR." But the ISR can legitimately shrink to one member (just the leader) when followers fall behind, and an acks=all write to a single-member ISR is acknowledged after only the leader has it. That is not a durable write, yet the producer got its success. The dial that closes this hole is min.insync.replicas: it sets the floor on ISR size, below which the leader refuses the write entirely rather than acking a write that only one broker holds. acks=all and min.insync.replicas≥2 are a pair; neither delivers durability without the other. This mechanism is detailed in Part I · 08 Replication.

min.insync.replicas, the broker's veto, enforced before the write

This is the single most important durability mechanism to understand at the code level, because its timing is what makes it a guarantee rather than a hope. In appendRecordsToLeader, before the leader appends anything to its own log, it computes the effective minimum ISR and checks the current ISR size (Partition.scala:1230–1238):

val minIsr = effectiveMinIsr(leaderLog)
val inSyncSize = partitionState.isr.size
// Avoid writing to leader if there are not enough insync replicas to make it safe
if (inSyncSize < minIsr && requiredAcks == -1) {
  throw new NotEnoughReplicasException(...)
}

Three operationally critical facts fall out of this:

  • The check is pre-write. When the ISR has already shrunk below the floor, the leader does not append-then-fail; it rejects the produce with NotEnoughReplicasException and the record never enters the log. There is no partial state to clean up. (If the ISR shrinks after the append is in purgatory, the completion path instead returns NotEnoughReplicasAfterAppend, Partition.scala:977.)
  • The gate is acks=all only. The condition is requiredAcks == -1. With acks=1 or 0, min.insync.replicas is ignored, the leader writes regardless of ISR size. min.insync.replicas is meaningless unless the producer also asks for acks=all. This is why the two must be configured together.
  • The value is silently capped to RF. effectiveMinIsr returns min(configured min.insync.replicas, RF) (Partition.scala:246–248: leaderLog.config.minInSyncReplicas.min(remoteReplicasMap.size + 1)). Set min.insync.replicas=2 on an RF=1 topic and the effective floor becomes 1, the topic happily accepts acks=all writes with one replica, giving you no durability while you believe you configured some. Always verify min.insync.replicas ≤ replication.factor per topic (Conduktor's "silent min.insync.replicas trap", empirical).

The same floor governs consumer visibility. The high-watermark, the boundary below which records are readable by consumers, can only advance when the ISR is at or above the effective minimum. In maybeIncrementLeaderHW (Partition.scala:1010–1014): if (isUnderMinIsr) { return false }. So when a partition is under min-ISR, not only are new acks=all writes refused, the HW is frozen: already-replicated-but-not-yet-committed records stay invisible until the ISR heals. The config's own doc spells this out (TopicConfig.java:198): "the messages will not be visible to the consumers until they are replicated to all in-sync replicas and the min.insync.replicas condition is met." The HW machinery is the subject of Part I · 09 Fetch Path.

partition (RF=3, min.insync.replicas=2)
100
101
102HW
Leader LEO is 104, but ISR has just dropped to {leader} only. HW stays pinned at 102 (last offset all-ISR-acknowledged): offsets 103–104 are written on the leader but invisible to consumers, and new acks=all writes are now refused with NotEnoughReplicasException until a follower rejoins and lifts ISR back to 2.
HW = high-watermark, consumer-visible boundary · LEO = log-end offset, leader's newest record · green = committed/replicated, amber = appended-but-uncommitted

replication.factor, the redundancy that makes the guarantee physical

RF is set per topic at creation time and is the number of copies on distinct brokers; it is the only one of the three dials that is genuinely hard to change after the fact (it requires a partition reassignment, see II · 12 Lifecycle). The placement of those copies is the controller's job, performed by the StripedReplicaPlacer (metadata/.../placement/StripedReplicaPlacer.java), which we return to under rack awareness. One hard constraint is worth stating now, because it is compiled into the placer's contract: you cannot place two replicas of one partition on the same broker (StripedReplicaPlacer.java:61–64: "we can't place more than one replica on the same broker. This imposes an upper limit on replication factor, for example, a 3-node cluster can't have any topics with replication factor 4"). RF is therefore bounded above by your broker count, and any RF you choose consumes one broker's worth of placement per replica.

Deriving the standard: RF=3, min.insync.replicas=2, acks=all

This triad is the industry-standard durable-and-available baseline (Conduktor, Datadog, AWS MSK, all empirical). It is not folklore, it is the unique point that satisfies two requirements at once: survive the loss of one in-sync broker with zero acknowledged-data loss, and keep accepting writes through that loss. Walk the arithmetic.

The headroom equation

Three quantities drive every number in this section. Name them before using them:

RF, replication factor
copies of each partition on distinct brokers; a per-topic config you choose. Worked example uses RF = 3. config (default.replication.factor default 1, ReplicationConfigs.java:42, you must raise it)
M, min.insync.replicas
floor on ISR size below which an acks=all write is refused; a per-topic config. Worked example uses M = 2. config (default 1, ServerLogConfigs.java:155)
k, in-sync brokers lost
how many of a partition's ISR members are simultaneously down; the failure scenario we solve for, not a config. scenario

After losing k in-sync brokers the surviving ISR size is RF − k replicas. To keep accepting acks=all writes that surviving count must still meet the floor, so the write-availability condition is:

RF − k ≥ M  ⟺  k ≤ RF − M brokers

Substituting the worked-example values (RF = 3, M = 2): one loss gives RF − k = 3 − 1 = 2 replicas ≥ M = 2 ✓, you survive one loss and still write. A second loss gives RF − k = 3 − 2 = 1 replica < M = 2 ✗, so writes correctly stop (the broker refuses rather than risk a single-copy ack). The maximum tolerable loss is therefore k ≤ RF − M = 3 − 2 = 1 broker. That gap RF − M is your availability headroom: the number of ISR members you can lose and still produce. RF=3/M=2 buys 3 − 2 = 1; RF=5/M=3 buys 5 − 3 = 2, at the cost of 5 ÷ 3 ≈ 1.67× the storage and replication bandwidth of RF=3 (each replica is one full copy of the data, so cost scales linearly with RF).

Every "tolerates loss of …" cell below is derived, not asserted, from two formulas using the named quantities RF and M (= min.insync.replicas):

  • and keeps writing = the availability headroom RF − M from above (writes survive while RF − k ≥ M), floored at 0 since you cannot lose a negative number of brokers, note the effectiveMinIsr cap means M is first clamped to min(M, RF), so RF=2/M=2 uses M=2 but RF=1/M=2 would use M=1.
  • with zero acked loss = RF − 1: an acks=all write commits to the full ISR, so as long as one copy of a committed record survives it is not lost; you can therefore lose up to RF − 1 of the RF copies. (This assumes the lost brokers are not all the ISR at once before a re-replication, the page-cache-independence caveat below.)
RFMKeeps writing through loss of … = max(RF−M, 0)Zero acked loss through loss of … = RF−1Storage / cross-AZ cost = RF×Verdict
111 − 1 = 0 brokers1 − 1 = 0 brokersNo durability. Dev only.
222 − 2 = 0 brokers (any loss stops writes)2 − 1 = 1 brokerDurable but not available, one slow follower halts the topic.
212 − 1 = 1 broker1 − 1 = 0 brokers (single-copy acks: with M=1 a write can commit to one copy, so the "zero-loss" bound collapses to RF−M = 0)Available but a single-copy ack can be lost on leader failure.
323 − 2 = 1 broker3 − 1 = 2, but bounded by writes-stop at 3 − 2 = 1 brokerThe standard. Durable AND available under one failure.
535 − 3 = 2 brokers5 − 1 = 4, bounded by writes-stop at 5 − 3 = 2 brokersHigher tolerance for critical/financial data; meaningfully pricier.

The two "zero acked loss" columns reconcile thus: the raw redundancy bound is RF − 1 (lose all but one copy and committed data survives), but once losses reach RF − M the leader has already stopped acking new writes, so for the ongoing guarantee the operative number is min(RF − 1, RF − M) = RF − M whenever M ≥ 1. The RF=2/M=1 and RF=1/M=1 rows expose the trap: a single-copy ack (M=1) means RF − M = 0 committed-survivor guarantee, even though RF − 1 looks like 1.

Notice the two RF=2 rows: there is no setting of min.insync.replicas on an RF=2 topic that is simultaneously durable and available under a single failure. minISR=2 makes any single loss stop all writes; minISR=1 reopens the single-copy-ack loss window. RF=3 is the smallest RF that escapes this trap, that is the entire reason the standard is 3 and not 2. (RF=2/minISR=2 is occasionally chosen deliberately as "durable-or-nothing," accepting that a single broker bounce takes the topic read-only; know that you are choosing it.)

RF=3, min.insync.replicas=2, acks=all, broker B2 dies
ISR was {B1,B2,B3} → now {B1,B3}; size = 2
writes CONTINUE, leader B1 still has a full healthy ISR of 2; HW keeps advancing
1 < minISR(2): writes REFUSED with NotEnoughReplicasException; HW freezes; topic read-only until a replica returns
The one-failure case is a non-event: the cluster sheds the dead replica, ISR stays at the floor, production never pauses, and no acknowledged record is lost because every acked record was on ≥2 replicas. The two-failure case correctly fails safe rather than acking single-copy writes.
broker event · ISR/log state · writes proceed · writes refused (fail-safe) · path

The flush philosophy: durability from replication, not fsync

Here is the fact that surprises most operators and that every Kafka-vs-X benchmark argument hinges on: Kafka does not fsync each message to disk before acknowledging it. A write that returns success to an acks=all producer is in the OS page cache of every ISR replica, not necessarily on stable storage. The fsync-on-write knobs exist, but they are off by default, set to the maximum possible value so the time/count thresholds effectively never trip:

ConfigDefaultMeaningSource
flush.messages (log.flush.interval.messages)Long.MAX_VALUEfsync after this many messages, effectively neverserver-common/.../ServerLogConfigs.java:101
flush.ms (log.flush.interval.ms / log.flush.scheduler.interval.ms)Long.MAX_VALUEfsync after this much time, effectively neverserver-common/.../ServerLogConfigs.java:109

This is a deliberate design choice, and the config docs say so outright (TopicConfig.java:54, the flush.messages doc): "In general we recommend you not set this and use replication for durability and allow the operating system's background flush capabilities as it is more efficient." Kafka's durability model is: a record is safe because it lives in the page cache of multiple independent machines, and the OS will write each copy to disk in the background on its own schedule. The fsync path in the broker (UnifiedLog.java:2254, flush(offset, includingOffset)) is driven by segment rolls and recovery, not by the produce hot path; the comment at UnifiedLog.java:2212 is explicit that the broker "avoid[s] potentially-costly fsync call" on the write path. The storage engine's relationship to the page cache is covered in Part I · 03 Storage Log Engine and Part I · 04 Storage Management.

Why not fsync? Because replication is cheaper and stronger

An fsync is a slow, blocking, tail-latency-spiking operation. Forcing one per batch trades away Kafka's headline throughput for a guarantee that replication already provides more cheaply. The probability that one machine loses unflushed page-cache data (a crash) is some modest p; the probability that RF independent machines all lose it in the same instant is the product of RF independent small probabilities, pRF, vanishingly small if the machines are truly independent (this is derived with explicit numbers in The durability math below). Replication across RF failure-independent brokers is a stronger durability guarantee than fsync on one, and far faster. This is precisely the critique-and-rebuttal at the heart of vendor benchmarks: the OpenMessaging Benchmark misconfigured Kafka with log.flush.interval.messages=1 (fsync per batch) on Java 11, which crippled it; on identical hardware without that handicap Kafka matched or beat the challenger. Illustrative figure (not a guarantee, specific to that hardware and version): on 3× i3en.6xlarge at NVMe saturation, Kafka sustained ~1,900 MB/s vs the challenger's ~1,400 MB/s (Jack Vanlightly, empirical, ops-blueprint §9). "Kafka is unsafe because it doesn't fsync" is false: Kafka relies on replication + log recovery.

The corollary that dictates your topology: ISR replicas MUST be failure-independent

Because acknowledged data may exist only in page cache, a simultaneous power loss of all ISR replicas can lose not-yet-flushed acknowledged records. The entire durability guarantee therefore rests on one physical assumption: the ISR replicas do not fail together. If all three copies of a partition sit in one rack on one PDU, or one AZ on one power domain, a correlated power event defeats RF=3 entirely, the math assumed independence you didn't provide. This is why rack/AZ spread is not a nice-to-have but a load-bearing part of the durability model. The dial that delivers it is broker.rack, below. If you genuinely cannot tolerate any page-cache loss window (e.g. a single-AZ deployment, or regulatory fsync requirements), set flush.ms to a small value and accept the tail-latency cost, but prefer fixing failure-independence first.

Rack awareness: making "independent" true

Setting broker.rack (default null, unset, server-common/.../ServerConfigs.java:142) on each broker turns rack/AZ spread from an accident into a controller-enforced invariant. When racks are configured, the StripedReplicaPlacer treats spreading replicas across racks as its highest-priority goal (StripedReplicaPlacer.java:37–40: "we want to spread the replicas as evenly as we can across racks … it is the highest priority goal in multi-rack clusters"). It places replicas round-robin across racks with a random starting offset, producing the characteristic striped pattern, for racks A/B/C with three nodes each, partition 1 lands on A0,B0,C0; partition 2 on B1,C1,A1; and so on (StripedReplicaPlacer.java:98–105). For RF=3 across three AZs, this guarantees one replica per AZ: lose an entire AZ and every partition still has two surviving replicas, exactly the headroom RF=3/minISR=2 needs.

Racks must be roughly equal size, or placement skews

The placer prioritizes rack spread over even broker spread (StripedReplicaPlacer.java:48–53): "if you configure 10 brokers in rack A and B, and 1 broker in rack C … you will end up with a lot of partitions on that one broker in rack C." Every RF=3 partition wants a replica in C, so the lone C broker becomes a hotspot and a single point of failure for an outsized share of partitions. The doc calls unequal racks "a user error." Keep AZs balanced in broker count, or the rack-aware placement that was supposed to protect you concentrates risk.

Rack metadata also unlocks the consumer-side cost optimization KIP-392 fetch-from-follower (set broker.rack + replica.selector.class + client client.rack), which lets consumers read a same-AZ replica and eliminates cross-AZ fetch traffic, at the cost of up to ~500 ms added latency, since followers only serve up to the HW (Grab, empirical). That is a cost lever covered in II · 10 Cost; here the point is that the same broker.rack setting both hardens durability and cuts the cross-AZ bill.

How a replica leaves and rejoins the ISR

The ISR is not static; its membership is the live input to every guarantee above, so you must know what moves a replica in and out. A follower is dropped from the ISR when it falls behind by more than replica.lag.time.max.ms (default 30000 ms = 30 s, ReplicationConfigs.java:55). The mechanism is time-based, not offset-based: the leader tracks each follower's lastCaughtUpTimeMs and shrinks the ISR if a follower either stopped fetching (a "stuck" follower) or hasn't reached the leader's LEO within the window (a "slow" follower), both detected via getOutOfSyncReplicas (Partition.scala:1155–1166, comment at :1144–1152). A dropped follower rejoins automatically once it catches back up to the leader's log.

Keep replica.fetch.wait.max.ms below replica.lag.time.max.ms

replica.fetch.wait.max.ms defaults to 500 ms (ReplicationConfigs.java:75) and its own doc warns it "should always be less than the replica.lag.time.max.ms at all times to prevent frequent shrinking of ISR for low throughput topics." If a follower's max fetch wait approaches the lag window, a low-traffic topic can flap replicas in and out of the ISR purely because the follower legitimately had nothing to fetch. ISR thrash (watch IsrShrinksPerSec/IsrExpandsPerSec, both should be 0 in steady state) is one of the highest-signal symptoms of a sick cluster; investigate any shrink without a matching expand. See II · 08 Metrics & Signals.

The operational consequence: every shrink of the ISR erodes your availability headroom. An RF=3/minISR=2 topic whose ISR has shrunk to 2 is now one follower-hiccup away from refusing writes. The cluster will keep your acked data safe in this state, but it has spent its slack, and that is why slow followers (GC pauses, disk I/O saturation, network) are durability incidents in disguise, not mere performance noise.

Unclean leader election: the availability-vs-durability last resort

Everything above describes the system staying safe. Unclean leader election is what governs the moment it cannot, when every in-sync replica for a partition is down and the only surviving copy is one that was not in the ISR (i.e. behind the HW). You have exactly two choices, and they are mutually exclusive:

unclean.leader.election.enableBehavior when all ISR is downYou loseYou keep
false defaultPartition stays offline (no leader) until an ISR member returnsAvailability, the partition is unreadable/unwritable, possibly for a long timeEvery acknowledged record, zero data loss
trueAn out-of-sync replica is elected leader as a last resortAny records the new leader was missing (everything past its log end), permanent data lossAvailability, the partition serves again immediately

The default is false (LogConfig.java:139, DEFAULT_UNCLEAN_LEADER_ELECTION_ENABLE = false), and Kafka chooses consistency/durability over availability by default. The config's own doc names the trade exactly (TopicConfig.java:185): enabling it allows replicas not in the ISR to be elected "as a last resort, even though doing so may result in data loss." When an unclean election does occur, the controller marks the partition's recovery state as RECOVERING (LeaderRecoveryState.java:29–33: "the election for the partition was an unclean leader election and … the leader is recovering from it"), distinguishing it from the clean RECOVERED state, so you can detect after the fact that a partition went through a lossy election.

Any UncleanLeaderElectionsPerSec > 0 is an actual data-loss event

Treat this metric as binary: a single non-zero reading means committed data was discarded. Alert on it as critical (Datadog, Confluent, Grafana all ship critical alerts, empirical). The historical footgun: before Kafka 0.11.0 the default was true, and Datadog suffered temporary data loss on a pre-0.11.0 cluster, recovered only because they dual-wrote to a secondary cluster (Datadog, empirical). On 4.4 the default is safely false, verify it has not been overridden per topic if your data is critical.

Enabling it dynamically in KRaft is not instantaneous

In KRaft mode the controller only checks for needed unclean elections periodically, every unclean.leader.election.interval.ms, default 5 minutes (ReplicationConfigs.java:122–123, TimeUnit.MINUTES.toMillis(5)). If you flip the config during an incident to restore an offline partition, it can take up to five minutes to take effect. The config doc instructs you to run kafka-leader-election.sh with the unclean option to trigger it immediately (ReplicationConfigs.java:128–130). Know this before you need it at 3 a.m.

Online (leader in ISR) all ISR down Offline (no leader) ISR member returns Online · RECOVERED
Offline (no leader) unclean=true → elect stale replica Online · RECOVERING (data lost)
Top path (default, unclean=false): an all-ISR-down partition waits offline and recovers losslessly when any in-sync replica comes back. Bottom path (unclean=true): the partition is forced back online immediately by promoting a stale replica, entering RECOVERING and permanently discarding records past that replica's log end.
RECOVERED = clean election, no loss · RECOVERING = unclean election, data discarded · amber = unavailable · red = lossy

ELR, eligible leader replicas (KIP-966): a safer recovery

Classic unclean election is all-or-nothing because once every ISR member is down, Kafka has forgotten which out-of-ISR replicas were most recently in sync, so it may promote an arbitrarily stale one. KIP-966 Eligible Leader Replicas fixes this by persisting, in the partition's metadata, the set of replicas that were in the ISR when it shrank below min.insync.replicas, the elr and lastKnownElr fields on the partition registration (metadata/.../PartitionRegistration.java:158–159). These ELR members are guaranteed to hold everything up to the HW at the moment they left, so on recovery the controller can elect an ELR replica without data loss even though a strict ISR election is impossible, strictly better than an unclean election. ELR is a feature-gated capability: EligibleLeaderReplicasVersion.ELRV_1 enables it and requires metadata version IBP_4_1_IV0 (server-common/.../EligibleLeaderReplicasVersion.java:27); it is the LATEST_PRODUCTION level (:31), so it is available and production-ready in this 4.4 build.

ELR changes the meaning of min.insync.replicas

When ELR is enabled, the semantics of min.insync.replicas shift, the config doc points at the dedicated ELR documentation section for the new behavior (TopicConfig.java:204). Practically: min.insync.replicas becomes not just the write-acceptance floor but also the boundary that defines which replicas get remembered as eligible leaders. The operational win is that an RF=3/minISR=2 topic that loses both ISR members can now often recover losslessly from a remembered ELR replica, where pre-ELR it faced the offline-or-lose-data choice. Treat ELR as raising the bar on how bad an all-ISR-down event has to be before unclean election is even on the table. The metadata mechanism is covered in Part I · 11 KRaft Controller and propagation in Part I · 12 Metadata Propagation.

The durability math: probability of loss vs RF and independence

You can put numbers on all of this, but every number below rests on two assumptions that must be stated, with units, before the arithmetic runs. This is an illustrative model; substitute your own measured failure rate.

p, per-broker loss probability in the window
probability that a single broker (and its copy of a given partition) is unavailable, or has lost its unflushed page-cache data, in a chosen window. We use an annualized commodity-hardware failure rate; for this worked example, p = 2% = 0.02 per server-year (illustrative, the literature spans roughly 1–4% per server-year; substitute your measured value). illustrative / environment-dependent
independence of the RF copies
the RF brokers holding the copies fail independently, no shared rack, PDU, AZ, power event, or kernel rollout. This is the load-bearing assumption (see callout below); it is an operator-provided property, delivered by broker.rack + balanced AZs, not a Kafka guarantee. topology assumption

Derivation. An acknowledged acks=all write under min.insync.replicas ≥ 2 is committed to all RF copies, so losing the record requires all RF copies to be lost in the same window. If the copies fail independently with probability p each, the probability of the joint event is the product of RF independent probabilities:

P(all RF copies lost) = p × p × … (RF times) = pRF

Plugging in p = 0.02 and reading the safety multiplier relative to RF=1 as P(RF=1) ÷ P(RF=n) = p ÷ pn = p1−n = (1/p)n−1, with 1/p = 1/0.02 = 50:

RFP(all copies lost) = pRF with p=0.02, computedSafety multiplier vs RF=1 = (1/p)RF−1 = 50RF−1Interpretation
1p1 = 0.02 = 2×10−2500 = 1× (baseline)One bad disk loses the data.
2p2 = 0.02×0.02 = 4×10−4501 = 50× safer than RF=1Two independent copies must die together.
3p3 = 0.023 = 8×10−6502 = 2,500 ≈ 2,500× safer than RF=1The durable baseline.
5p5 = 0.025 = 3.2×10−9504 = 6,250,000× safer than RF=1Diminishing returns for most workloads.

Each cell is reproducible by hand: p3 = (2×10−2)3 = 23 × 10−6 = 8×10−6; the RF=3 safety multiplier is p1 ÷ p3 = 1/p2 = 1/(0.02)2 = 1/0.0004 = 2,500. These probabilities are per chosen window (here, per year) and only as real as the independence assumption.

The independence assumption is the whole ballgame

That pRF only holds if failures are independent. The instant your replicas share a failure domain, same rack, same PDU, same AZ, same power event, same bad kernel rollout, the exponent collapses toward 1 and RF=3 buys you almost nothing against that correlated event. Three copies in one rack have roughly the data-loss probability of one copy against a rack-power failure. This is the quantitative restatement of the flush philosophy's corollary: RF multiplies safety only to the degree your replicas are failure-independent. broker.rack + balanced AZs is what makes the exponent real. Spend your durability budget on independence before you spend it on a higher RF.

The same logic explains the cost/durability frontier. Two cost assumptions, then the arithmetic:

storage cost scales as RF×
each replica is one full copy of the log, so total stored bytes = ingress bytes × RF. structural
cross-AZ replication amplification = (RF − 1)× ingress
the leader receives 1 copy from the producer, then ships it to the other RF − 1 replicas; in a 3-AZ cluster with one replica per AZ, each of those RF − 1 hops crosses an AZ boundary. empirical, AWS 3-AZ (ops-blueprint §6)

So at RF=3, cross-AZ replication = (RF − 1) × ingress = (3 − 1) × 1 GB = 2 GB of cross-AZ traffic per 1 GB produced. Stepping RF=3 → RF=5 multiplies both storage and replication bandwidth by 5 ÷ 3 ≈ 1.67×, i.e. (5 − 3) ÷ 3 ≈ 67% more, in exchange for moving the per-window loss probability from the RF=3 figure derived above (p3 = 8×10−6) to the RF=5 figure (p5 = 3.2×10−9), a further p3 ÷ p5 = 1/p2 = 2,500× reduction. Worthwhile for financial ledgers and irreplaceable data; overkill for most. The economics are quantified in II · 10 Cost; the capacity arithmetic in II · 04 Capacity Planning.

Where Kafka sits on CAP

A partition in Kafka is a small replicated state machine, and its CAP behavior is configurable through exactly the dials above. Under a network partition that separates the leader from its followers:

  • With acks=all + min.insync.replicas=2 + unclean=false, Kafka behaves as a CP system: when the ISR can no longer meet the floor, it refuses writes and freezes the HW (consistency + partition-tolerance), sacrificing availability. The NotEnoughReplicasException path and the frozen-HW behavior are the "C over A" choice, in code.
  • Flip unclean.leader.election.enable=true and you shift that partition toward AP for the all-ISR-down case: it stays available by promoting a stale replica, at the cost of consistency (lost records). You are literally moving the partition along the CAP spectrum with one config.
  • With acks=1 or acks=0 the producer opts out of the strong-consistency contract entirely for the sake of latency, a per-write relaxation.
Operator's durability checklist

Set and verify, per topic: replication.factor ≥ 3 for anything you cannot lose; min.insync.replicas = 2 (and always ≤ RF, the effectiveMinIsr cap will silently weaken it otherwise); producers on acks=all (the default) with enable.idempotence=true for ordering. Leave unclean.leader.election.enable=false (the default) unless a specific topic explicitly prefers availability over its own data. Make independence real: set broker.rack on every broker, keep AZs balanced in broker count, and confirm the placer spread one replica per AZ. Watch: UnderMinIsrPartitionCount and OfflinePartitionsCount (alert > 0), UncleanLeaderElectionsPerSec (alert > 0 = data loss), IsrShrinks/ExpandsPerSec (steady-state 0). Know your CAP position and that it is a config, not a fixed property. The failure-mode runbooks for each of these symptoms are in II · 07 Failure Modes.

krivaltsevich.com · Part of Apache Kafka Internals · derived from Apache Kafka 4.4 source · GitHub · MIT-licensed.

Apache Kafka® is a registered trademark of the Apache Software Foundation. This is an independent, unofficial guide, not affiliated with or endorsed by the ASF.