III · 02 · Design Decisions & Their Alternatives
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Architectural analysis grounded in the source-verified Part I and cited comparative sources.
Every system you admire is a frozen argument. Behind each of Kafka's headline behaviours, consumers that poll, a leader that serves every read, a partition count you can never lower, a broker that trusts the OS page cache with your data, sits a fork in the design space where the authors chose one road and rejected several others that real systems took instead. This chapter walks eight of those forks and, for each, does three things: states the alternative roads honestly (Dynamo-style sloppy quorums, Raft majorities, LSM trees, multi-writer logs, push delivery, broker-tracked acknowledgement, managed memory), names the axes of the tradeoff, and tells you when each road is the right one. The single most instructive decision is not any one of them but the deliberate split: Kafka runs ISR replication for the data partitions and a Raft majority quorum for the metadata log, two different consistency protocols in one product, each matched to a different workload. Understand why that split is correct and you have learned the most transferable lesson Kafka has to teach. Throughout, we mark which side of each tradeoff is inherent (a structural consequence you cannot configure away) versus tunable (a dial or a feature shifts it), because conflating those two is the most common way architects misjudge a log.
How to read a design decision
A design decision is not "feature X is good." It is "for the forces we expected, road A dominates roads B and C, and here is the price we knowingly pay." To keep the analysis honest, every section below is structured around the same four questions. Hold them in mind as you read.
| Question | What it forces you to name |
|---|---|
| What is the fork? | The specific point where a credible alternative exists. If there is no alternative, it is not a decision, it is a constraint. |
| What are the axes? | The 2–3 quantities the choice trades against each other (throughput vs latency, consistency vs availability, control vs simplicity, write-cost vs read-cost). |
| When is each road right? | The workload signature that flips the answer. A decision that is "always right" is usually mis-stated. |
| Inherent or tunable? | Whether the downside is structural (carry it forever) or moved by a config/feature (budget for the dial). This is the line most analyses blur. |
When you evaluate any infrastructure, yours or a vendor's, refuse the binary "is it good." Ask instead: what fork did they stand at, what did they optimise, and what did they pay? A system that claims to have paid nothing at a fork is either lying or has not yet hit the workload that bills it. Kafka's authors were unusually explicit about their prices; this chapter makes them legible so you can carry the reasoning to systems whose authors were not.
Decision 1, Pull consumers vs push delivery
The fork. When committed data is available, who initiates its movement to the consumer? Kafka chose pull: the consumer issues Fetch requests on its own clock, and the broker never pushes records it was not asked for. There is, in fact, only one read path in the whole system, followers replicating and consumers draining walk the identical ReplicaManager.fetchMessages machinery, distinguished only by a replica-id field (I · 09 Fetch Path). KRaft's metadata replication is pull too: the class header is blunt that "replication is driven by replica fetching" (I · 10 KRaft Consensus). The rejected road is push, the model RabbitMQ and most classic message brokers take: the broker decides when a message leaves and shoves it down an established channel.
The axes. Pull trades per-message latency at low load against consumer-controlled flow, batching, and replay. Push wins the first; pull wins the rest, decisively.
Pull buys four properties that are awkward-to-impossible under push:
- Backpressure is free and correct. A slow consumer simply fetches less often; the broker never builds an unbounded per-consumer send buffer or has to decide whether to drop or block. Under push, "the consumer can't keep up" is the broker's emergency; under pull it is a non-event, the data just waits in the log it was already sitting in.
- Replay and time-travel. Because the consumer names the offset, "start again from yesterday" is the same operation as "read the next batch." A push broker that already handed you a message and forgot it cannot replay. This is the mechanism under event sourcing, Kappa reprocessing, and adding a brand-new consumer to a year-old topic (III · 01 When to Use the Log).
- Batching is consumer-optimal. The fetch carries
fetch.min.bytes/fetch.max.wait.ms; the consumer decides how much to amortise per round-trip. New Relic cut whole-cluster broker CPU 15% by raisingfetch.min.byteson low-throughput topics, a single client-side change, impossible to express if the broker set the cadence (empirical). - Fan-out is decoupled from the broker's work. Ten groups reading one partition are ten independent cursors over one log; the broker keeps no "delivered to group 7" state, so adding a reader costs it almost nothing (Decision 4 covers why).
The price, and whether it is inherent. A pull consumer with nothing to read must either return empty or wait, which adds latency at the tail of an idle stream. Kafka mitigates this with long-poll: fetch.max.wait.ms parks the request in purgatory and the broker completes it the instant data crosses the high watermark (I · 09), so an idle consumer does not busy-spin, and a record that arrives is served immediately, not on the next poll tick. That mitigation is real but does not fully erase the gap: at very low message rates a push broker can still deliver a single message in fewer microseconds because it never waited to be asked. This residual is inherent to pull. The empirical record confirms it: Kafka's per-message latency exceeds a push broker's at low load, a deliberate throughput-over-latency choice; RabbitMQ measures ~1 ms at low load but degrades past ~30 MB/s, while Kafka holds p99 ~5 ms up into the hundreds of MB/s (Confluent OMB, empirical, directional, vendor benchmark).
A log is a durable, replayable, fan-out abstraction first and a low-latency pipe second. Every one of pull's wins (backpressure, replay, consumer batching, cheap fan-out) is exactly a log property; the one thing it costs (sub-millisecond idle-stream latency) is exactly the thing a log is not optimising for. The decision is coherent: it pays in the currency the abstraction does not value to earn in the currency it does. The transferable rule, when consumers have heterogeneous, variable, or replay-capable consumption needs, make them pull; reserve push for homogeneous, latency-critical, fire-once delivery.
Pull does not rescue you from the structural cost of per-partition ordering plus one consumer per partition per group. If you use a Kafka topic as a work queue, a single slow or poison message blocks every later message in its partition, head-of-line blocking that pull cannot fix, because the consumer must process offset N before it is willing to commit past it (III · 01; Kreps, empirical). Parallelism is capped at partition count. The right tools are I · 15 Share Groups (which break the one-consumer-per-partition rule for queue-shaped workloads) or the Confluent Parallel Consumer, not "add partitions." This is covered as an inherent limit in III · 03.
Decision 2, Leader-based ISR replication vs leaderless / quorum
This is the heart of the chapter, and the decision that most distinguishes Kafka from the Dynamo lineage and from textbook Raft. It has three parts: the ISR mechanism itself, the alternatives it beat for data, and, the crucial twist, the fact that Kafka also runs a Raft majority quorum for metadata, on purpose, in the same cluster.
What ISR actually is
The fork. How do you keep N copies of a partition consistent under failure? Kafka chose leader-based ISR replication: one leader per partition serves all reads and writes, followers pull from it, and the leader tracks an in-sync replica set, the followers currently caught up. A write with acks=all is committed only when every member of the ISR has it; the high watermark (the committed, consumer-visible boundary) is the minimum log-end offset over the ISR, and it refuses to advance while the ISR is below min.insync.replicas (I · 08 Replication). The ISR is not the leader's private opinion, in KRaft it is controller-authoritative: the leader proposes membership changes via the AlterPartition RPC and the controller commits them to the metadata log, so leadership and membership have a single linearizable history (I · 08, KIP-497).
The genius of ISR is that it makes the durability/latency tradeoff into a tunable dial with a clear arithmetic, rather than a fixed protocol property. With RF=3 and min.insync.replicas=2, every acknowledged write sits on at least 2 independent brokers at ack time, so you tolerate f=1 failures with f+1=2 required copies and RF=3 for headroom, and you can move that dial per topic (II · 06 Durability). Contrast this with a majority quorum, where the failure tolerance is fixed by the protocol at ⌊N/2⌋.
| Property | Kafka ISR (leader serves) | Majority quorum (Paxos/Raft) | Dynamo sloppy quorum (W+R>N) |
|---|---|---|---|
| Who serves reads | Leader only (consistent, simple) | Leader only (Raft) / any (some) | Any replica (R of them, then reconcile) |
| Copies needed to commit | Tunable: the full ISR, floored by min.insync.replicas | Fixed: a majority, ⌊N/2⌋+1 | Tunable W, but no total order |
| Failures tolerated | RF − min.insync.replicas writably; RF−1 for read durability | ⌊N/2⌋ (e.g. 1 of 3, 2 of 5) | Up to N−W for writes (eventual) |
| Consistency | Linearizable per partition (single writer, HW gate) | Linearizable | Eventual, conflicts, read-repair, vector clocks |
| Latency cost | Slowest in-sync follower (laggards ejected, not waited on) | Median replica (the majority-th) | Fastest W replicas (low) but reads reconcile |
| Throughput at huge shard count | High, append-only, no per-write voting round | Lower, a round of votes per commit | High, but you inherit conflict resolution |
| Tail behaviour of one slow node | Ejected from ISR after replica.lag.time.max.ms; stops dragging commit | Irrelevant if it is not the majority-th | Irrelevant; W fastest win |
ISR vs majority quorum, the key difference. A majority quorum commits when the median replica acks; it is structurally robust to one slow node in a 3-node group because that node is simply not in the majority. ISR instead waits for everyone currently in-sync, but it actively manages who that is: a follower that falls behind replica.lag.time.max.ms is ejected, so it stops holding the high watermark back (I · 08). The two converge on the same goal (don't let one laggard stall commit) by opposite means: quorum ignores the slow node by counting; ISR evicts it by membership. The ISR approach has a subtle edge for the data plane: it lets you set min.insync.replicas to a value independent of RF, so a 5-replica topic can require only 2 in-sync (cheaper, faster) or all 5 (paranoid), where a Raft group of 5 is permanently locked to 3. That decoupling of durability floor from replica count is the dial that makes ISR right for data.
ISR vs Dynamo sloppy quorum, the key difference. Dynamo (and Cassandra, Riak) accept writes to any W replicas and read from any R, with W+R>N guaranteeing read/write overlap but not a total order, concurrent writes to the same key produce siblings that the application or read-repair must reconcile (vector clocks, last-write-wins). That is a brilliant fit for an always-available key-value store where the data model tolerates eventual convergence. It is the wrong model for a log, whose entire value proposition is a single, total, immutable order of records (III · 00). A log with siblings is not a log. ISR's single-leader-per-partition gives you that total order for free; the price (single writer, covered in Decision 7) is one Kafka happily pays because ordering is the product.
Data partitions want three things at once: very high write throughput, enormous partition counts (millions, post-KRaft), and a tunable durability/latency tradeoff that operators can set per topic. ISR delivers all three. Append-only single-leader writes avoid a per-record voting round (throughput); the ISR is cheap per-partition state, so a broker can host tens of thousands of them (scale); and decoupling min.insync.replicas from RF makes durability a dial, not a protocol constant (tunability). A majority quorum would cost a vote round per write and fix tolerance at ⌊N/2⌋; a sloppy quorum would surrender the total order that is the whole point. ISR is the road that keeps every property the data plane actually needs.
The deliberate split: ISR for data, Raft for metadata
Here is the lesson. Kafka does not use ISR for everything. The cluster's metadata, every topic, partition, leader, ISR, ACL, config, broker registration, lives in a single log, __cluster_metadata-0, replicated by a Raft majority quorum across a small set of controllers (I · 10 KRaft Consensus, I · 11 KRaft Controller). The same company, the same codebase, two different consistency protocols, chosen because the two workloads have opposite shapes.
__cluster_metadata-0 · 3 or 5 controllers · commit = majority acks (⌊N/2⌋+1) · always strongly consistent · self-managed leader election · sub-second failover via hot-standby controllersmin.insync.replicas · tunable durability/latency per topic · controller-authoritative ISR · throughput-optimised append| Force | Metadata log → Raft majority | Data partitions → ISR |
|---|---|---|
| How many of them? | One log, 3–5 voters | Millions of partitions across the fleet |
| Consistency requirement | Non-negotiable, always, a split-brain controller corrupts the whole cluster | Per-topic dial, telemetry may accept acks=1; payments demand acks=all |
| Throughput | Modest, metadata changes are rare vs data writes | Enormous, the product's whole job |
| Failover speed | Must be near-instant and self-managed (no external ZooKeeper) | Driven by the metadata plane; per-partition election is ~ms |
| Who decides membership? | The quorum itself (Raft pre-vote, epochs), self-contained | The controller (via the metadata log), i.e. delegated to the metadata plane |
| Right protocol | Raft majority: small, strongly consistent, self-electing, fast failover | ISR: scalable, throughput-first, tunable durability |
Why is a majority quorum right for metadata when ISR is right for data? Three reasons, each the inverse of the data-plane argument:
- There is exactly one metadata log, so the per-write vote round is affordable. The thing that makes a voting round too expensive for data, you'd pay it millions of times across millions of partitions, simply does not apply to a single, low-write-rate metadata log. You pay the round once per metadata change, rarely.
- Metadata consistency is not a dial, it must always be maximal. A tunable durability floor is exactly wrong here. If two controllers could both believe they are active (split-brain), they would issue conflicting leadership decisions and corrupt the cluster. Raft's majority rule structurally forbids two leaders in the same epoch: a candidate needs a majority to win, and two disjoint majorities cannot exist (I · 10). The metadata plane wants the strongest guarantee unconditionally, which is precisely what a majority quorum provides and what a tunable ISR floor would undermine.
- Metadata must self-manage its own leadership with fast failover, without depending on the thing it bootstraps. The pre-KRaft design delegated this to ZooKeeper, an external majority-quorum system, which is the historical proof that metadata wants a quorum. KRaft (KIP-500) folds that quorum into Kafka so there is no second system to operate, and adds hot-standby controllers so failover does not require reloading all state (empirical: ZooKeeper-era controller failover was O(partitions) and could take minutes at extreme scale; KRaft targets near-instant). A self-managing Raft quorum is the right shape for the one component that everything else depends on.
Match the consistency protocol to the workload, not to the product. Kafka uses a majority quorum where the data is small, must be always-strongly-consistent, and needs self-managed fast failover (metadata), and uses leader-based ISR where the data is vast, throughput-bound, and wants a tunable durability dial (partitions). The wrong move, the move many systems make, is to pick one consistency protocol and impose it everywhere. Had Kafka run Raft majority over every partition, it would never have reached millions of partitions or its throughput; had it run ISR over metadata, it would have had no self-consistent way to elect controllers and would have re-invented split-brain. When you build a system with both a small always-correct control plane and a large throughput-bound data plane, give them different replication. This is the reusable tactic, not a Kafka trivium.
KRaft replication is also pull-based and reuses Kafka's log machinery, "leader election is more or less pure Raft, but replication is driven by replica fetching" (I · 10). So the two planes share mechanism (a fetch loop over a segmented log) while differing in commit rule (majority vs full-ISR-floored-by-min-ISR). That is the elegant part: one storage/transport substrate, two commit semantics layered on top. Sharing the substrate kept the implementation small; differing on the commit rule kept each plane correct for its workload.
Decision 3, Partition as the unit of parallelism + ordering
The fork. How do you shard a topic so it scales horizontally and preserves order where order matters? Kafka chose the partition: a topic is a fixed set of independent logs; a record's partition is normally murmur2(key) % numPartitions (I · 16 Producer Client); each partition is totally ordered, has a single leader, and is the unit of both parallelism (one consumer per partition per group) and ordering (per-key order holds because a key always lands in the same partition). The rejected roads are consistent hashing (Dynamo/Cassandra: keys on a ring, virtual nodes, automatic rebalancing) and range sharding (HBase/Bigtable: contiguous key ranges that split as they grow).
| Axis | Kafka partition (hash mod N) | Consistent hashing (ring + vnodes) | Range sharding (split/merge) |
|---|---|---|---|
| Ordering | Per-partition total order, the design goal | Per-key only; no cross-key order | Per-range order; ranges reorder on split |
| Resharding | Painful: can't decrease; increasing remaps keys (hash%N changes) | Smooth: add a node, move a slice of the ring | Smooth: a hot range auto-splits |
| Parallelism model | Consumers ≤ partitions; explicit, predictable | Implicit; tied to token ownership | Implicit; tied to range ownership |
| Hot-spot handling | Cannot rebalance a hot key, it still hashes to one partition | vnodes spread load; a hot key still pins to one | Auto-split relieves a hot range (not a single hot key) |
| Operational simplicity | Very simple, count is a single number | Complex, ring state, token maps | Complex, split policy, region servers |
Why hash-mod-N is right for a log. The partition's job is to be the ordering unit, and ordering wants a stable, cheap, deterministic mapping from key to log. hash % N is exactly that: any producer, with no shared state, computes the same partition for a key, so per-key order is preserved without coordination. The simplicity compounds, a partition is a self-contained append-only log, which is why "log appends occur without coordination between shards" and "throughput scales linearly" with partition count (Kreps, empirical). Consistent hashing and range sharding both buy smooth resharding at the cost of ordering and simplicity: a consistent-hash ring has no notion of a total order across the keys it scatters, and a range that splits hands you two ranges whose interleaving is no longer the original order. For a key-value store that is the correct trade; for a log it surrenders the product.
This is the most under-appreciated constraint in Kafka, and it is inherent to hash % N. You cannot decrease partitions at all. Increasing them on a keyed topic silently breaks per-key ordering across the resize boundary, because 7654321 % 4 = 1 but 7654321 % 6 = 3, the same key now lands in a different partition, so its history is split and co-partitioned Streams state stores no longer follow it (empirical). The mitigation is operational, not a config: create a new topic at the target count and migrate (dual-write + drain), and over-partition up front for keyed topics. Adding partitions also does not fix a hot key, it still hashes to one partition; you must redesign or salt the key (II · 03 Partitioning). Consistent hashing's whole reason to exist is to make this smooth; Kafka chose ordering and simplicity and pays here forever.
hash % N remaps every key when N changes.If you adopt a hash-mod-N sharding scheme anywhere, Kafka, a sharded database, a cache ring built the naive way, you have bought immutable shard count. Either (a) pick a scheme that decouples logical from physical shards (hash into a large fixed number of logical buckets, map buckets to physical nodes, the trick that lets many systems resize without remapping keys), or (b) over-provision shards up front and accept you can only ever migrate, never resize. Decide which before production, because retrofitting is a topic migration.
Decision 4, Consumer-tracked offsets vs broker-tracked acknowledgement
The fork. Who remembers how far each consumer has read? Kafka chose consumer-tracked offsets, the "dumb broker, smart client" philosophy. The broker does not track per-message delivery or per-consumer acknowledgement at all. A consumer reads from an offset and periodically commits its position; that commit is just another record, written to the __consumer_offsets topic. The rejected road is broker-tracked acknowledgement, the classic message-broker model (RabbitMQ, JMS, SQS): the broker holds per-message state, marks each message delivered/acked/redelivered, and removes it once consumed.
| Axis | Consumer-tracked offsets (Kafka) | Broker-tracked ack (RabbitMQ / SQS) |
|---|---|---|
| Per-message broker state | None, broker stores a log; cursor is a number | Per-message, per-consumer (delivered? acked? in-flight?) |
| Fan-out cost | ~Free, each group is an independent cursor over one log | Per-subscriber copy or per-subscriber state |
| Replay | Trivial, commit/seek to any offset | Impossible once acked-and-removed |
| Where the complexity lives | Pushed to the client: commit cadence, dedup, idempotency | In the broker: redelivery, DLQ, visibility timeouts |
| Selective/out-of-order ack | Hard, an offset is a single high-water mark per partition | Native, ack message 5 while 3 is still in flight |
| Per-message TTL / priority / routing | Absent, topic-level retention only | Native, the broker owns each message |
Why dumb-broker is right for a log. Pushing read-position state to clients is what makes the broker's two best properties cheap: fan-out and replay. Because the broker stores one log and no per-consumer delivery state, ten groups reading a partition cost the broker one log and ten integers (their committed offsets), not ten copies or ten delivery tables. And because nothing is consumed-and-deleted, any consumer can rewind. The literal implementation is the punchline: the group coordinator has no separate database, its durable state is the __consumer_offsets log, every commit is a record, and failover is "replay the partition" (I · 13 Group Coordination). Offset tracking is itself an application of the log pattern. Broker-tracked acknowledgement buys the opposite set of properties, selective ack, per-message TTL/priority, content routing, at the cost of per-message broker state that does not fan out and cannot replay. For request/queue workloads that is the right trade (and it is exactly where RabbitMQ wins; empirical); for a replayable fan-out log it is the wrong one.
State that lives in the broker scales with messages × consumers; state that lives in the client scales with consumers only. By making the broker store a totally-ordered log and the client store a cursor, Kafka turned an O(messages × consumers) problem into an O(log) + O(consumers) one, which is why a single partition feeds arbitrarily many groups, and why retention is decoupled from consumption (data is deleted by time/size policy, not by "everyone acked it"). The transferable principle: if a piece of state grows with the cross-product of two dimensions, try to move it to whichever side keeps it on the smaller dimension.
"Smart client" means the durability of your processing is your problem. An offset is a single high-water mark per partition, so Kafka has no native concept of "message 5 acked, 3 still pending", you cannot selectively acknowledge out of order within a partition. Commit semantics are yours to get right: commit before processing and you risk loss (at-most-once); commit after and you risk duplicates on a crash between processing and commit (at-least-once); achieving exactly-once requires either idempotent consumers or the transactional read-process-write path (I · 14 Transactions & EOS). This complexity did not vanish when Kafka removed it from the broker, it moved to you. Share groups (I · 15) reintroduce per-message acknowledgement as an opt-in precisely for the queue-shaped workloads where the dumb-broker model is a poor fit, itself an admission that the original choice, while right for logs, is wrong for queues.
Decision 5, Page cache + zero-copy vs managed memory / direct IO
The fork. Who manages the memory that buffers your data between disk and network? Kafka chose to rely on the OS page cache and the sendfile zero-copy path: the broker keeps a tiny JVM heap and lets the kernel cache log data; a consumer fetch of cached data is served by FileChannel.transferTo straight from page cache to socket, never copying through user space or the JVM heap (I · 03 Storage Log Engine, I · 09). The rejected road is managed memory + direct IO: own your cache in application space, use O_DIRECT to bypass the kernel cache, and control exactly what is resident, the model Redpanda takes with a C++ thread-per-core, DMA-based runtime (empirical).
sendfile zero-copy (page cache → NIC, no user-space copy) · writes are sequential appends the OS flushes on its schedule · mechanical sympathy: free the cache to RAM, free the copy to the kernel| Axis | Page cache + zero-copy (Kafka) | Managed memory + direct IO (Redpanda) |
|---|---|---|
| Implementation surface | Small, the kernel does caching + zero-copy | Large, reimplement cache, scheduling, IO |
| Read of hot data | Zero-copy from page cache; no heap, no GC | From app cache; also fast, deterministic |
| Control over residency / placement | Low, kernel decides eviction; "catch-up tax" possible | High, pin exactly what each core holds |
| Tail latency under contention | Good, but page-cache eviction + GC can spike it | Marketed lower; workload-dependent, often reverses on equal HW |
| Memory model | Heap small on purpose; RAM → cache | Slice of RAM per core, app-owned |
| Failure: a big historical read | Evicts hot tail → p99 produce can jump (mitigated by tiered storage) | Isolated per core; less collateral |
Why mechanical sympathy is right for a log. A log's access pattern is the kindest case for the page cache: writes are sequential appends and the dominant read is the tail, the records just written, which are exactly the pages the kernel still has hot. So the page cache, with zero tuning, gives you near-RAM read throughput for the common case and a free zero-copy path to the socket, while letting Kafka keep a tiny heap and almost no caching code. This is "mechanical sympathy": align with what the hardware and OS already do well instead of fighting them. The empirical record is decisive against the headline claim that owning memory wins: on identical hardware Kafka often matched or beat Redpanda, ~1900 vs 1400 MB/s at NVMe saturation, and Redpanda's tail latency degraded under sustained load and TLS while Kafka's held or improved (Jack Vanlightly, empirical; vendor "10×/3×" claims used a crippled Kafka, forced per-batch fsync, Java 11, and "do not generalize"). The lesson is not "page cache always wins"; it is that delegating to a battle-tested kernel path is a strong default, and beating it requires more than a benchmark.
The control you give up is real. Inherent: you do not decide eviction. A large historical read can evict the hot tail, the "catch-up tax", spiking p99 produce latency (~2 ms → ~250 ms in documented cases) as the disk goes to 100% (empirical). The JVM is along for the ride: GC pauses on the broker can drag tail latency, which is why heap is capped ~6 GB and G1/ZGC tuning matters (II · 05 Performance Tuning). Tunable / mitigated: the catch-up tax is largely fixed by tiered storage (I · 05, KIP-405), which serves cold reads from object storage instead of evicting the page cache (tests showed ~30% better p99 produce, ~43% throughput drop avoided; empirical). So the structural downside (no eviction control) is real, but its worst symptom is addressable. A thread-per-core engine sidesteps the symptom by construction, at the cost of owning an entire memory and IO stack. Choose page-cache-reliance when your read pattern is tail-heavy and you value a small codebase and operational familiarity; choose managed memory when you need deterministic per-core tail latency and can fund the engineering.
Decision 6, Append-only segments vs LSM-tree / B-tree
The fork. What on-disk structure backs a partition? Kafka chose append-only segments: a partition is an ordered set of immutable segment files, each a .log plus sparse memory-mapped indexes (.index, .timeindex); a write is a sequential append to the active segment, a read-by-offset is a binary search of a sparse index then a sequential scan, and old data is reclaimed by dropping whole segments (I · 03, I · 04 Storage Management). The rejected roads are the two great mutable-store structures: the B-tree (in-place updates, the workhorse of relational databases) and the LSM-tree (buffered writes + background compaction, the workhorse of RocksDB/Cassandra/modern KV stores).
| Axis | Append-only segments (Kafka) | LSM-tree (RocksDB) | B-tree (RDBMS) |
|---|---|---|---|
| Write pattern | Pure sequential append, no rewrite | Sequential to memtable + SSTables; later compaction | Random in-place page writes |
| Write amplification | ~1× (append once; compaction is opt-in for compacted topics) | High, repeated compaction rewrites | Moderate, page splits, WAL |
| Read by key | No random key lookup, read by offset/range only | Good, point + range on key | Excellent, point + range on key |
| Update / delete a record | Not supported, records are immutable | Native (new version / tombstone) | Native (in-place) |
| Reclamation | Drop whole segments (time/size); compaction keeps last-per-key | Compaction merges + drops | Free-list / vacuum |
| Best at | Sequential throughput, retention by time, ordered replay | Write-heavy KV with reads by key | Read-heavy, transactional, ad-hoc query |
Why append-only is right for a log. The log's data model is immutable, totally-ordered records read by position, it does not have updates or random key lookups to support, so the machinery that B-trees and LSM-trees exist to provide is pure cost here. By renouncing mutation, Kafka gets the cheapest possible write (a sequential append, write-amplification ~1×) and the cheapest possible reclamation (unlink a file). Crucially, "accumulating more stored data doesn't make it slower" (Kreps, empirical), there is no tree to rebalance, no compaction debt that grows with dataset size, because the structure is a flat append. An LSM-tree pays continuous compaction to keep key-sorted order; a B-tree pays random IO to keep in-place updates; both are buying capabilities (read-by-key, update) that a log deliberately does not offer.
The most powerful design moves are often subtractions. By refusing to support updates and random key reads, Kafka's storage engine became radically simpler and faster at the one thing a log needs, ordered sequential throughput with cheap retention. This is the same family of move as "immutability simplifies concurrency": an append-only file lets a single writer coexist with lock-free readers, because readers only ever see committed (immutable) bytes below the high watermark, and the mmap'd index can be read without a lock (I · 03). Transferable tactic: when a data structure's general-purpose features (update, random lookup, sort-on-disk) are not in your access pattern, dropping them can unlock an order-of-magnitude simpler and faster design. Ask what you can refuse to support.
The exact property that makes append-only fast, no key index, no in-place update, is exactly why Kafka cannot serve a point lookup by key, an ad-hoc query, or a multi-key transaction (Kreps, Waehner; empirical). Log compaction (cleanup.policy=compact) gives you a weak table, "at least the last value per key", but it does not give you a queryable index, and it does not guarantee a single record per key at any instant (I · 04; empirical). The correct architecture is to treat the log as the source of truth and materialise queryable views into purpose-built stores (a B-tree RDBMS, an LSM KV store, a search index), the "database inside-out" pattern (III · 00). Trying to use the log as the database fights its storage engine and loses; this is a structural limit, not a missing feature, and it is treated in depth in III · 03.
Decision 7, Single-writer (leader) per partition vs multi-writer
The fork. Can more than one node accept writes to the same partition at once? Kafka chose single-writer: exactly one leader per partition accepts every append, assigns the monotonic offset, and serves every read; followers only replicate (I · 08). The rejected road is multi-writer, multiple nodes accepting writes to the same logical shard concurrently, as in multi-leader/active-active databases or leaderless KV stores, which requires conflict resolution (last-write-wins, CRDTs, vector clocks) or a consensus round on every write to agree an order.
| Axis | Single-writer leader (Kafka) | Multi-writer (multi-leader / leaderless) |
|---|---|---|
| Assigning a total order | Trivial, one writer hands out offsets sequentially | Hard, needs consensus per write or post-hoc conflict resolution |
| Write conflicts | Impossible by construction | Inherent, siblings, LWW, CRDTs, merge logic |
| Write availability under leader loss | Paused until a new leader is elected (~ms, controller-driven) | Higher, surviving writers keep accepting |
| Throughput ceiling per partition | Bounded by one leader's resources | Sum of all writers (but you pay reconciliation) |
| Reasoning / correctness | Simple, one source of truth for the offset | Complex, eventual consistency, anomalies |
Why single-writer is right for a log. A log is a total order, and the cheapest way to produce a total order is to have one writer assign positions. With a single leader, the offset is just a counter; there is never a conflict to resolve because there is never a second opinion about what comes next. Multi-writer would hand you exactly the problem ISR-vs-Dynamo already settled (Decision 2): concurrent writers produce an order that must be reconciled, and a reconciled "order" is not the immutable, total order a log promises. Single-writer also makes the acks=all durability story clean, there is one leader whose log is authoritative, so "committed" has a single, unambiguous meaning (the high watermark on that leader), and a returning ex-leader simply truncates to the new leader's epoch boundary rather than negotiating a merge (I · 08, leader epochs / KIP-101).
Single-writer means that when a leader fails, writes to that partition pause until the controller elects a successor, there is no surviving writer to absorb them. The mitigation is to make election fast (KRaft per-partition election is ~ms and controller-driven) and to spread leadership so no single broker's loss pauses too much at once (II · 07 Failure Modes), but the pause is inherent, you trade a brief write-availability gap for never having a write conflict. The second price is a per-partition throughput ceiling: one partition's writes are bounded by one leader, which is precisely why the partition is also the unit of parallelism (Decision 3), you scale write throughput by adding partitions (more leaders), not by adding writers to a partition. A multi-writer design would raise both ceilings at the cost of the total order; Kafka keeps the order and scales out instead.
Single-writer-per-shard is the cheapest way to get a per-shard total order, and it composes beautifully with hash-partitioning: shard by key, one writer per shard, and you have per-key total order with no consensus on the write path. Reach for it whenever order within a key matters more than write availability of a single key during a failover. Reach for multi-writer only when you genuinely cannot tolerate a failover pause and your data model can absorb conflict resolution, a much rarer requirement than it first appears.
Decision 8, Coordinators as replicated logs
The fork. Where does a stateful broker subsystem, group membership, committed offsets, transaction state, keep its durable state? Kafka chose, repeatedly and on purpose, to make these subsystems replicated state machines over an internal log rather than to give them a bespoke datastore. The group coordinator's state is the __consumer_offsets log: every join, every offset commit, every assignment is a record; in-memory state is a deterministic replay of those records; and failover is "the new coordinator loads the partition and replays it" (I · 13). The transaction coordinator does the same over __transaction_state (I · 14). The rejected road is the obvious one most systems take: give each subsystem its own store (an embedded database, a ZooKeeper subtree, a custom on-disk format) with its own replication, durability, and failover code.
__consumer_offsets / __transaction_state, replicated by ISRWhy coordinators-as-logs is right. Kafka already had to build a correct, durable, fast, replicated log, that is the product. Making every stateful subsystem ride that same machinery is massive code reuse of the hardest part of the system. The coordinator does not implement durability (the partition's ISR replication does), does not implement failover (loading-and-replaying a partition does), and does not implement the consistency boundary (the high watermark is exactly the trigger that releases a response once its record is committed; I · 13). Three guarantees the subsystem would otherwise have had to re-derive, durability, failover, read-your-committed-write, fall out for free because the state is a log. This is dogfooding as architecture: the abstraction is good enough that the system builds itself on it.
When your core abstraction is genuinely powerful, the highest-leverage move is to express the rest of the system in it. Kafka's coordinators are the proof: rather than N subsystems each with bespoke durability/failover/consistency code (N times the surface area, N times the bugs), there is one replication engine and several thin state machines that replay logs. The transferable principle, if you have built a strong primitive (a log, a consensus module, a versioned store), look hard for the internal components you can re-express as clients of that primitive instead of giving them their own machinery. The payoff is not elegance for its own sake; it is that every hardening of the core primitive (a durability fix, a failover speedup) automatically hardens every subsystem built on it.
Building coordinators on logs means they also inherit the log's limits, and some bite operationally. The __consumer_offsets and __transaction_state topics have a fixed partition count set at cluster creation, and because a group is mapped to a coordinator by abs(groupId.hashCode()) % numPartitions, changing that count after groups exist re-maps groups to different coordinators and orphans their state, exactly the hash-mod-N immutability of Decision 3 (I · 13). A hanging transaction on __transaction_state can pin the last-stable-offset and stall read_committed consumers, and on a compacted internal topic can block the cleaner (I · 14; II · 07; empirical). The reuse is a net win, but "it's a log all the way down" means the log's structural constraints are also all the way down, plan the internal-topic partition counts as carefully as any keyed topic.
The decisions as one coherent system
Read individually, the eight forks look like independent engineering calls. Read together, they are one bet repeated: optimise relentlessly for a durable, totally-ordered, replayable, high-throughput fan-out log, and pay wherever that abstraction does not value the currency. Each decision reinforces the others, pull consumers and consumer-tracked offsets together make fan-out and replay cheap; single-writer and hash-partitioning together make per-key total order cheap; append-only storage and page-cache reliance together make sequential throughput cheap; ISR-for-data and Raft-for-metadata together make the whole thing both scalable and self-consistent.
| # | Decision | Road not taken | Primary axis | Right when… | The inherent price |
|---|---|---|---|---|---|
| 1 | Pull consumers | Push delivery | Idle latency ↔ flow/replay control | Consumers are heterogeneous / replay-capable | Sub-ms idle-stream latency (mitigated by long-poll) |
| 2a | ISR replication (data) | Majority quorum / sloppy quorum | Throughput+scale ↔ fixed tolerance / eventual | Vast partitions, tunable durability, total order | Single leader serves all reads (a hotspot ceiling) |
| 2b | Raft majority (metadata) | ISR for metadata / external ZK | Always-consistent + self-managed failover | Small, must-never-split-brain control plane | Majority needed to make progress (3/5 quorum) |
| 3 | Partition = hash % N | Consistent hashing / range shards | Order+simplicity ↔ smooth resharding | Per-key order matters; counts are stable | Can't decrease; resizing keyed topics breaks order |
| 4 | Consumer-tracked offsets | Broker-tracked ack | Cheap fan-out/replay ↔ per-msg control | Replayable fan-out; no per-msg TTL/priority needed | No selective ack; processing durability is the client's |
| 5 | Page cache + zero-copy | Managed memory / direct IO | Small surface ↔ residency control | Tail-heavy reads; value simplicity & familiarity | No eviction control (catch-up tax; tiered storage helps) |
| 6 | Append-only segments | LSM-tree / B-tree | Sequential write ↔ random read/update | Immutable ordered records read by position | No key lookup / update, not a database |
| 7 | Single-writer leader | Multi-writer | Cheap total order ↔ write availability | Order within a key > failover-pause tolerance | Writes pause during failover; per-partition ceiling |
| 8 | Coordinators as logs | Bespoke per-subsystem store | Reuse the core ↔ none material | Always (when the log primitive is strong) | Internal topics inherit the log's constraints too |
Five reusable tactics outlive Kafka: (1) match the consistency protocol to the workload, not to the product, a small always-correct control plane and a vast throughput-bound data plane deserve different replication (the ISR/Raft split). (2) Move state to whichever side keeps it on the smaller dimension, push read-position to clients so broker state is O(log), not O(messages×consumers). (3) Subtraction is a design move, refusing updates and random lookups bought append-only's speed; ask what you can decline to support. (4) Delegating to a battle-tested OS/runtime path (page cache, zero-copy) is a strong default; beating it costs a whole subsystem you must then own. (5) Build the system out of its own strongest primitive, express internal subsystems as clients of the core (coordinators-as-logs), so every hardening of the core hardens them all. And the honesty rule that frames all five: name the price at every fork, and know whether it is inherent or tunable, the inherent ones you carry forever; the tunable ones you budget a dial for. The inherent prices collected here become the subject of III · 03 Inherent Limits; the tactics, of III · 04 The Tactics Toolkit; and the head-to-head against the systems that chose the other roads, of III · 06 Comparative Architectures.