III · 02 · Design Decisions & Their Alternatives

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Architectural analysis grounded in the source-verified Part I and cited comparative sources.

Every system you admire is a frozen argument. Behind each of Kafka's headline behaviours, consumers that poll, a leader that serves every read, a partition count you can never lower, a broker that trusts the OS page cache with your data, sits a fork in the design space where the authors chose one road and rejected several others that real systems took instead. This chapter walks eight of those forks and, for each, does three things: states the alternative roads honestly (Dynamo-style sloppy quorums, Raft majorities, LSM trees, multi-writer logs, push delivery, broker-tracked acknowledgement, managed memory), names the axes of the tradeoff, and tells you when each road is the right one. The single most instructive decision is not any one of them but the deliberate split: Kafka runs ISR replication for the data partitions and a Raft majority quorum for the metadata log, two different consistency protocols in one product, each matched to a different workload. Understand why that split is correct and you have learned the most transferable lesson Kafka has to teach. Throughout, we mark which side of each tradeoff is inherent (a structural consequence you cannot configure away) versus tunable (a dial or a feature shifts it), because conflating those two is the most common way architects misjudge a log.

How to read a design decision

A design decision is not "feature X is good." It is "for the forces we expected, road A dominates roads B and C, and here is the price we knowingly pay." To keep the analysis honest, every section below is structured around the same four questions. Hold them in mind as you read.

Question	What it forces you to name
What is the fork?	The specific point where a credible alternative exists. If there is no alternative, it is not a decision, it is a constraint.
What are the axes?	The 2–3 quantities the choice trades against each other (throughput vs latency, consistency vs availability, control vs simplicity, write-cost vs read-cost).
When is each road right?	The workload signature that flips the answer. A decision that is "always right" is usually mis-stated.
Inherent or tunable?	Whether the downside is structural (carry it forever) or moved by a config/feature (budget for the dial). This is the line most analyses blur.

The transferable habit

When you evaluate any infrastructure, yours or a vendor's, refuse the binary "is it good." Ask instead: what fork did they stand at, what did they optimise, and what did they pay? A system that claims to have paid nothing at a fork is either lying or has not yet hit the workload that bills it. Kafka's authors were unusually explicit about their prices; this chapter makes them legible so you can carry the reasoning to systems whose authors were not.

Decision 1, Pull consumers vs push delivery

The fork. When committed data is available, who initiates its movement to the consumer? Kafka chose pull: the consumer issues Fetch requests on its own clock, and the broker never pushes records it was not asked for. There is, in fact, only one read path in the whole system, followers replicating and consumers draining walk the identical ReplicaManager.fetchMessages machinery, distinguished only by a replica-id field (I · 09 Fetch Path). KRaft's metadata replication is pull too: the class header is blunt that "replication is driven by replica fetching" (I · 10 KRaft Consensus). The rejected road is push, the model RabbitMQ and most classic message brokers take: the broker decides when a message leaves and shoves it down an established channel.

The axes. Pull trades per-message latency at low load against consumer-controlled flow, batching, and replay. Push wins the first; pull wins the rest, decisively.

Broker (push) Consumer A Consumer (pull) Broker (Kafka)

push msg (broker's clock)

push msg, A still busy → buffer bloat / drop

Fetch(offset, max.bytes), consumer's clock

batch up to HW, zero-copy

Fetch(offset+N), rate set by consumer

Push lets the broker set the pace (low latency, but the consumer's queue is the broker's problem). Pull lets the consumer set the pace: it asks for the next offset when ready, so backpressure is implicit and the broker holds no per-consumer send queue.

broker consumer request response / pushed data time flows top → down

Pull buys four properties that are awkward-to-impossible under push:

Backpressure is free and correct. A slow consumer simply fetches less often; the broker never builds an unbounded per-consumer send buffer or has to decide whether to drop or block. Under push, "the consumer can't keep up" is the broker's emergency; under pull it is a non-event, the data just waits in the log it was already sitting in.
Replay and time-travel. Because the consumer names the offset, "start again from yesterday" is the same operation as "read the next batch." A push broker that already handed you a message and forgot it cannot replay. This is the mechanism under event sourcing, Kappa reprocessing, and adding a brand-new consumer to a year-old topic (III · 01 When to Use the Log).
Batching is consumer-optimal. The fetch carries fetch.min.bytes/fetch.max.wait.ms; the consumer decides how much to amortise per round-trip. New Relic cut whole-cluster broker CPU 15% by raising fetch.min.bytes on low-throughput topics, a single client-side change, impossible to express if the broker set the cadence (empirical).
Fan-out is decoupled from the broker's work. Ten groups reading one partition are ten independent cursors over one log; the broker keeps no "delivered to group 7" state, so adding a reader costs it almost nothing (Decision 4 covers why).

The price, and whether it is inherent. A pull consumer with nothing to read must either return empty or wait, which adds latency at the tail of an idle stream. Kafka mitigates this with long-poll: fetch.max.wait.ms parks the request in purgatory and the broker completes it the instant data crosses the high watermark (I · 09), so an idle consumer does not busy-spin, and a record that arrives is served immediately, not on the next poll tick. That mitigation is real but does not fully erase the gap: at very low message rates a push broker can still deliver a single message in fewer microseconds because it never waited to be asked. This residual is inherent to pull. The empirical record confirms it: Kafka's per-message latency exceeds a push broker's at low load, a deliberate throughput-over-latency choice; RabbitMQ measures ~1 ms at low load but degrades past ~30 MB/s, while Kafka holds p99 ~5 ms up into the hundreds of MB/s (Confluent OMB, empirical, directional, vendor benchmark).

Why pull is the right default for a log

A log is a durable, replayable, fan-out abstraction first and a low-latency pipe second. Every one of pull's wins (backpressure, replay, consumer batching, cheap fan-out) is exactly a log property; the one thing it costs (sub-millisecond idle-stream latency) is exactly the thing a log is not optimising for. The decision is coherent: it pays in the currency the abstraction does not value to earn in the currency it does. The transferable rule, when consumers have heterogeneous, variable, or replay-capable consumption needs, make them pull; reserve push for homogeneous, latency-critical, fire-once delivery.

Where pull bites: the task-queue anti-pattern

Pull does not rescue you from the structural cost of per-partition ordering plus one consumer per partition per group. If you use a Kafka topic as a work queue, a single slow or poison message blocks every later message in its partition, head-of-line blocking that pull cannot fix, because the consumer must process offset N before it is willing to commit past it (III · 01; Kreps, empirical). Parallelism is capped at partition count. The right tools are I · 15 Share Groups (which break the one-consumer-per-partition rule for queue-shaped workloads) or the Confluent Parallel Consumer, not "add partitions." This is covered as an inherent limit in III · 03.

Decision 2, Leader-based ISR replication vs leaderless / quorum

This is the heart of the chapter, and the decision that most distinguishes Kafka from the Dynamo lineage and from textbook Raft. It has three parts: the ISR mechanism itself, the alternatives it beat for data, and, the crucial twist, the fact that Kafka also runs a Raft majority quorum for metadata, on purpose, in the same cluster.

What ISR actually is

The fork. How do you keep N copies of a partition consistent under failure? Kafka chose leader-based ISR replication: one leader per partition serves all reads and writes, followers pull from it, and the leader tracks an in-sync replica set, the followers currently caught up. A write with acks=all is committed only when every member of the ISR has it; the high watermark (the committed, consumer-visible boundary) is the minimum log-end offset over the ISR, and it refuses to advance while the ISR is below min.insync.replicas (I · 08 Replication). The ISR is not the leader's private opinion, in KRaft it is controller-authoritative: the leader proposes membership changes via the AlterPartition RPC and the controller commits them to the metadata log, so leadership and membership have a single linearizable history (I · 08, KIP-497).

The genius of ISR is that it makes the durability/latency tradeoff into a tunable dial with a clear arithmetic, rather than a fixed protocol property. With RF=3 and min.insync.replicas=2, every acknowledged write sits on at least 2 independent brokers at ack time, so you tolerate f=1 failures with f+1=2 required copies and RF=3 for headroom, and you can move that dial per topic (II · 06 Durability). Contrast this with a majority quorum, where the failure tolerance is fixed by the protocol at ⌊N/2⌋.

Property	Kafka ISR (leader serves)	Majority quorum (Paxos/Raft)	Dynamo sloppy quorum (W+R>N)
Who serves reads	Leader only (consistent, simple)	Leader only (Raft) / any (some)	Any replica (R of them, then reconcile)
Copies needed to commit	Tunable: the full ISR, floored by `min.insync.replicas`	Fixed: a majority, ⌊N/2⌋+1	Tunable W, but no total order
Failures tolerated	`RF − min.insync.replicas` writably; `RF−1` for read durability	⌊N/2⌋ (e.g. 1 of 3, 2 of 5)	Up to N−W for writes (eventual)
Consistency	Linearizable per partition (single writer, HW gate)	Linearizable	Eventual, conflicts, read-repair, vector clocks
Latency cost	Slowest in-sync follower (laggards ejected, not waited on)	Median replica (the majority-th)	Fastest W replicas (low) but reads reconcile
Throughput at huge shard count	High, append-only, no per-write voting round	Lower, a round of votes per commit	High, but you inherit conflict resolution
Tail behaviour of one slow node	Ejected from ISR after `replica.lag.time.max.ms`; stops dragging commit	Irrelevant if it is not the majority-th	Irrelevant; W fastest win

ISR vs majority quorum, the key difference. A majority quorum commits when the median replica acks; it is structurally robust to one slow node in a 3-node group because that node is simply not in the majority. ISR instead waits for everyone currently in-sync, but it actively manages who that is: a follower that falls behind replica.lag.time.max.ms is ejected, so it stops holding the high watermark back (I · 08). The two converge on the same goal (don't let one laggard stall commit) by opposite means: quorum ignores the slow node by counting; ISR evicts it by membership. The ISR approach has a subtle edge for the data plane: it lets you set min.insync.replicas to a value independent of RF, so a 5-replica topic can require only 2 in-sync (cheaper, faster) or all 5 (paranoid), where a Raft group of 5 is permanently locked to 3. That decoupling of durability floor from replica count is the dial that makes ISR right for data.

ISR vs Dynamo sloppy quorum, the key difference. Dynamo (and Cassandra, Riak) accept writes to any W replicas and read from any R, with W+R>N guaranteeing read/write overlap but not a total order, concurrent writes to the same key produce siblings that the application or read-repair must reconcile (vector clocks, last-write-wins). That is a brilliant fit for an always-available key-value store where the data model tolerates eventual convergence. It is the wrong model for a log, whose entire value proposition is a single, total, immutable order of records (III · 00). A log with siblings is not a log. ISR's single-leader-per-partition gives you that total order for free; the price (single writer, covered in Decision 7) is one Kafka happily pays because ordering is the product.

Why ISR for the data plane

Data partitions want three things at once: very high write throughput, enormous partition counts (millions, post-KRaft), and a tunable durability/latency tradeoff that operators can set per topic. ISR delivers all three. Append-only single-leader writes avoid a per-record voting round (throughput); the ISR is cheap per-partition state, so a broker can host tens of thousands of them (scale); and decoupling min.insync.replicas from RF makes durability a dial, not a protocol constant (tunability). A majority quorum would cost a vote round per write and fix tolerance at ⌊N/2⌋; a sloppy quorum would surrender the total order that is the whole point. ISR is the road that keeps every property the data plane actually needs.

The deliberate split: ISR for data, Raft for metadata

Here is the lesson. Kafka does not use ISR for everything. The cluster's metadata, every topic, partition, leader, ISR, ACL, config, broker registration, lives in a single log, __cluster_metadata-0, replicated by a Raft majority quorum across a small set of controllers (I · 10 KRaft Consensus, I · 11 KRaft Controller). The same company, the same codebase, two different consistency protocols, chosen because the two workloads have opposite shapes.

METADATA plane, Raft majority quorum (KRaft)

One log __cluster_metadata-0 · 3 or 5 controllers · commit = majority acks (⌊N/2⌋+1) · always strongly consistent · self-managed leader election · sub-second failover via hot-standby controllers

↕ controller publishes committed metadata; brokers fetch it as observers

DATA plane, ISR replication (per partition)

Millions of partitions · one leader each · commit = full ISR, floored by min.insync.replicas · tunable durability/latency per topic · controller-authoritative ISR · throughput-optimised append

Two consistency protocols, matched to two workloads. The metadata quorum is small, must be always strongly consistent, and prizes fast self-managed failover, Raft majority is ideal. The data plane is vast, throughput-hungry, and wants a per-topic durability dial, ISR is ideal. The metadata quorum even tells the data plane who its leaders and ISRs are.

metadata / control plane (Raft) data plane (ISR) ↕ = metadata flows down into data-plane leadership

Force	Metadata log → Raft majority	Data partitions → ISR
How many of them?	One log, 3–5 voters	Millions of partitions across the fleet
Consistency requirement	Non-negotiable, always, a split-brain controller corrupts the whole cluster	Per-topic dial, telemetry may accept `acks=1`; payments demand `acks=all`
Throughput	Modest, metadata changes are rare vs data writes	Enormous, the product's whole job
Failover speed	Must be near-instant and self-managed (no external ZooKeeper)	Driven by the metadata plane; per-partition election is ~ms
Who decides membership?	The quorum itself (Raft pre-vote, epochs), self-contained	The controller (via the metadata log), i.e. delegated to the metadata plane
Right protocol	Raft majority: small, strongly consistent, self-electing, fast failover	ISR: scalable, throughput-first, tunable durability

Why is a majority quorum right for metadata when ISR is right for data? Three reasons, each the inverse of the data-plane argument:

There is exactly one metadata log, so the per-write vote round is affordable. The thing that makes a voting round too expensive for data, you'd pay it millions of times across millions of partitions, simply does not apply to a single, low-write-rate metadata log. You pay the round once per metadata change, rarely.
Metadata consistency is not a dial, it must always be maximal. A tunable durability floor is exactly wrong here. If two controllers could both believe they are active (split-brain), they would issue conflicting leadership decisions and corrupt the cluster. Raft's majority rule structurally forbids two leaders in the same epoch: a candidate needs a majority to win, and two disjoint majorities cannot exist (I · 10). The metadata plane wants the strongest guarantee unconditionally, which is precisely what a majority quorum provides and what a tunable ISR floor would undermine.
Metadata must self-manage its own leadership with fast failover, without depending on the thing it bootstraps. The pre-KRaft design delegated this to ZooKeeper, an external majority-quorum system, which is the historical proof that metadata wants a quorum. KRaft (KIP-500) folds that quorum into Kafka so there is no second system to operate, and adds hot-standby controllers so failover does not require reloading all state (empirical: ZooKeeper-era controller failover was O(partitions) and could take minutes at extreme scale; KRaft targets near-instant). A self-managing Raft quorum is the right shape for the one component that everything else depends on.

The single best architectural lesson in Kafka

Match the consistency protocol to the workload, not to the product. Kafka uses a majority quorum where the data is small, must be always-strongly-consistent, and needs self-managed fast failover (metadata), and uses leader-based ISR where the data is vast, throughput-bound, and wants a tunable durability dial (partitions). The wrong move, the move many systems make, is to pick one consistency protocol and impose it everywhere. Had Kafka run Raft majority over every partition, it would never have reached millions of partitions or its throughput; had it run ISR over metadata, it would have had no self-consistent way to elect controllers and would have re-invented split-brain. When you build a system with both a small always-correct control plane and a large throughput-bound data plane, give them different replication. This is the reusable tactic, not a Kafka trivium.

A nuance that proves the rule

KRaft replication is also pull-based and reuses Kafka's log machinery, "leader election is more or less pure Raft, but replication is driven by replica fetching" (I · 10). So the two planes share mechanism (a fetch loop over a segmented log) while differing in commit rule (majority vs full-ISR-floored-by-min-ISR). That is the elegant part: one storage/transport substrate, two commit semantics layered on top. Sharing the substrate kept the implementation small; differing on the commit rule kept each plane correct for its workload.

Decision 3, Partition as the unit of parallelism + ordering

The fork. How do you shard a topic so it scales horizontally and preserves order where order matters? Kafka chose the partition: a topic is a fixed set of independent logs; a record's partition is normally murmur2(key) % numPartitions (I · 16 Producer Client); each partition is totally ordered, has a single leader, and is the unit of both parallelism (one consumer per partition per group) and ordering (per-key order holds because a key always lands in the same partition). The rejected roads are consistent hashing (Dynamo/Cassandra: keys on a ring, virtual nodes, automatic rebalancing) and range sharding (HBase/Bigtable: contiguous key ranges that split as they grow).

Axis	Kafka partition (hash mod N)	Consistent hashing (ring + vnodes)	Range sharding (split/merge)
Ordering	Per-partition total order, the design goal	Per-key only; no cross-key order	Per-range order; ranges reorder on split
Resharding	Painful: can't decrease; increasing remaps keys (`hash%N` changes)	Smooth: add a node, move a slice of the ring	Smooth: a hot range auto-splits
Parallelism model	Consumers ≤ partitions; explicit, predictable	Implicit; tied to token ownership	Implicit; tied to range ownership
Hot-spot handling	Cannot rebalance a hot key, it still hashes to one partition	vnodes spread load; a hot key still pins to one	Auto-split relieves a hot range (not a single hot key)
Operational simplicity	Very simple, count is a single number	Complex, ring state, token maps	Complex, split policy, region servers

Why hash-mod-N is right for a log. The partition's job is to be the ordering unit, and ordering wants a stable, cheap, deterministic mapping from key to log. hash % N is exactly that: any producer, with no shared state, computes the same partition for a key, so per-key order is preserved without coordination. The simplicity compounds, a partition is a self-contained append-only log, which is why "log appends occur without coordination between shards" and "throughput scales linearly" with partition count (Kreps, empirical). Consistent hashing and range sharding both buy smooth resharding at the cost of ordering and simplicity: a consistent-hash ring has no notion of a total order across the keys it scatters, and a range that splits hands you two ranges whose interleaving is no longer the original order. For a key-value store that is the correct trade; for a log it surrenders the product.

The inherent price: partition count is near-immutable for keyed topics

This is the most under-appreciated constraint in Kafka, and it is inherent to hash % N. You cannot decrease partitions at all. Increasing them on a keyed topic silently breaks per-key ordering across the resize boundary, because 7654321 % 4 = 1 but 7654321 % 6 = 3, the same key now lands in a different partition, so its history is split and co-partitioned Streams state stores no longer follow it (empirical). The mitigation is operational, not a config: create a new topic at the target count and migrate (dual-write + drain), and over-partition up front for keyed topics. Adding partitions also does not fix a hot key, it still hashes to one partition; you must redesign or salt the key (II · 03 Partitioning). Consistent hashing's whole reason to exist is to make this smooth; Kafka chose ordering and simplicity and pays here forever.

Need to reshard a topic?

Is the topic keyed (per-key order matters)?

Increase partitions in place, safe; round-robin keyless records just spread wider

Decrease or increase?

New topic at target count → dual-write → drain → cut over. Never resize in place.

The resharding decision is dominated by one question, is per-key order load-bearing? Keyless topics resize freely; keyed topics must migrate to a new topic because hash % N remaps every key when N changes.

trigger decision safe path ordering-breaking path path hazardous branch

Tactic to carry

If you adopt a hash-mod-N sharding scheme anywhere, Kafka, a sharded database, a cache ring built the naive way, you have bought immutable shard count. Either (a) pick a scheme that decouples logical from physical shards (hash into a large fixed number of logical buckets, map buckets to physical nodes, the trick that lets many systems resize without remapping keys), or (b) over-provision shards up front and accept you can only ever migrate, never resize. Decide which before production, because retrofitting is a topic migration.

Decision 4, Consumer-tracked offsets vs broker-tracked acknowledgement

The fork. Who remembers how far each consumer has read? Kafka chose consumer-tracked offsets, the "dumb broker, smart client" philosophy. The broker does not track per-message delivery or per-consumer acknowledgement at all. A consumer reads from an offset and periodically commits its position; that commit is just another record, written to the __consumer_offsets topic. The rejected road is broker-tracked acknowledgement, the classic message-broker model (RabbitMQ, JMS, SQS): the broker holds per-message state, marks each message delivered/acked/redelivered, and removes it once consumed.

Axis	Consumer-tracked offsets (Kafka)	Broker-tracked ack (RabbitMQ / SQS)
Per-message broker state	None, broker stores a log; cursor is a number	Per-message, per-consumer (delivered? acked? in-flight?)
Fan-out cost	~Free, each group is an independent cursor over one log	Per-subscriber copy or per-subscriber state
Replay	Trivial, commit/seek to any offset	Impossible once acked-and-removed
Where the complexity lives	Pushed to the client: commit cadence, dedup, idempotency	In the broker: redelivery, DLQ, visibility timeouts
Selective/out-of-order ack	Hard, an offset is a single high-water mark per partition	Native, ack message 5 while 3 is still in flight
Per-message TTL / priority / routing	Absent, topic-level retention only	Native, the broker owns each message

Why dumb-broker is right for a log. Pushing read-position state to clients is what makes the broker's two best properties cheap: fan-out and replay. Because the broker stores one log and no per-consumer delivery state, ten groups reading a partition cost the broker one log and ten integers (their committed offsets), not ten copies or ten delivery tables. And because nothing is consumed-and-deleted, any consumer can rewind. The literal implementation is the punchline: the group coordinator has no separate database, its durable state is the __consumer_offsets log, every commit is a record, and failover is "replay the partition" (I · 13 Group Coordination). Offset tracking is itself an application of the log pattern. Broker-tracked acknowledgement buys the opposite set of properties, selective ack, per-message TTL/priority, content routing, at the cost of per-message broker state that does not fan out and cannot replay. For request/queue workloads that is the right trade (and it is exactly where RabbitMQ wins; empirical); for a replayable fan-out log it is the wrong one.

Why push the state to the client

State that lives in the broker scales with messages × consumers; state that lives in the client scales with consumers only. By making the broker store a totally-ordered log and the client store a cursor, Kafka turned an O(messages × consumers) problem into an O(log) + O(consumers) one, which is why a single partition feeds arbitrarily many groups, and why retention is decoupled from consumption (data is deleted by time/size policy, not by "everyone acked it"). The transferable principle: if a piece of state grows with the cross-product of two dimensions, try to move it to whichever side keeps it on the smaller dimension.

The price the client pays, and it is inherent

"Smart client" means the durability of your processing is your problem. An offset is a single high-water mark per partition, so Kafka has no native concept of "message 5 acked, 3 still pending", you cannot selectively acknowledge out of order within a partition. Commit semantics are yours to get right: commit before processing and you risk loss (at-most-once); commit after and you risk duplicates on a crash between processing and commit (at-least-once); achieving exactly-once requires either idempotent consumers or the transactional read-process-write path (I · 14 Transactions & EOS). This complexity did not vanish when Kafka removed it from the broker, it moved to you. Share groups (I · 15) reintroduce per-message acknowledgement as an opt-in precisely for the queue-shaped workloads where the dumb-broker model is a poor fit, itself an admission that the original choice, while right for logs, is wrong for queues.

Decision 5, Page cache + zero-copy vs managed memory / direct IO

The fork. Who manages the memory that buffers your data between disk and network? Kafka chose to rely on the OS page cache and the sendfile zero-copy path: the broker keeps a tiny JVM heap and lets the kernel cache log data; a consumer fetch of cached data is served by FileChannel.transferTo straight from page cache to socket, never copying through user space or the JVM heap (I · 03 Storage Log Engine, I · 09). The rejected road is managed memory + direct IO: own your cache in application space, use O_DIRECT to bypass the kernel cache, and control exactly what is resident, the model Redpanda takes with a C++ thread-per-core, DMA-based runtime (empirical).

Kafka, lean on the kernel

Small JVM heap (~6 GB) · log data cached by the OS page cache · reads served by sendfile zero-copy (page cache → NIC, no user-space copy) · writes are sequential appends the OS flushes on its schedule · mechanical sympathy: free the cache to RAM, free the copy to the kernel

↕ same data, opposite ownership of the buffer + copy

Redpanda, own the memory

C++ thread-per-core, no JVM · application-managed cache · DMA / direct IO, bypass the page cache · pin a core+RAM slice per shard · control: deterministic placement, no GC, no double-buffering, at the cost of re-implementing what the kernel gives free

Two answers to "who owns the buffer and the copy." Kafka delegates both to the kernel (less code, no GC pressure on the data path, free zero-copy) and accepts less control. Redpanda re-implements both in user space for deterministic control and tail latency, and pays in engineering surface.

Kafka / kernel-delegated Redpanda / app-managed ↕ = same workload, different memory ownership

Axis	Page cache + zero-copy (Kafka)	Managed memory + direct IO (Redpanda)
Implementation surface	Small, the kernel does caching + zero-copy	Large, reimplement cache, scheduling, IO
Read of hot data	Zero-copy from page cache; no heap, no GC	From app cache; also fast, deterministic
Control over residency / placement	Low, kernel decides eviction; "catch-up tax" possible	High, pin exactly what each core holds
Tail latency under contention	Good, but page-cache eviction + GC can spike it	Marketed lower; workload-dependent, often reverses on equal HW
Memory model	Heap small on purpose; RAM → cache	Slice of RAM per core, app-owned
Failure: a big historical read	Evicts hot tail → p99 produce can jump (mitigated by tiered storage)	Isolated per core; less collateral

Why mechanical sympathy is right for a log. A log's access pattern is the kindest case for the page cache: writes are sequential appends and the dominant read is the tail, the records just written, which are exactly the pages the kernel still has hot. So the page cache, with zero tuning, gives you near-RAM read throughput for the common case and a free zero-copy path to the socket, while letting Kafka keep a tiny heap and almost no caching code. This is "mechanical sympathy": align with what the hardware and OS already do well instead of fighting them. The empirical record is decisive against the headline claim that owning memory wins: on identical hardware Kafka often matched or beat Redpanda, ~1900 vs 1400 MB/s at NVMe saturation, and Redpanda's tail latency degraded under sustained load and TLS while Kafka's held or improved (Jack Vanlightly, empirical; vendor "10×/3×" claims used a crippled Kafka, forced per-batch fsync, Java 11, and "do not generalize"). The lesson is not "page cache always wins"; it is that delegating to a battle-tested kernel path is a strong default, and beating it requires more than a benchmark.

Where the page-cache bet is genuinely worse, and what is inherent vs tunable

The control you give up is real. Inherent: you do not decide eviction. A large historical read can evict the hot tail, the "catch-up tax", spiking p99 produce latency (~2 ms → ~250 ms in documented cases) as the disk goes to 100% (empirical). The JVM is along for the ride: GC pauses on the broker can drag tail latency, which is why heap is capped ~6 GB and G1/ZGC tuning matters (II · 05 Performance Tuning). Tunable / mitigated: the catch-up tax is largely fixed by tiered storage (I · 05, KIP-405), which serves cold reads from object storage instead of evicting the page cache (tests showed ~30% better p99 produce, ~43% throughput drop avoided; empirical). So the structural downside (no eviction control) is real, but its worst symptom is addressable. A thread-per-core engine sidesteps the symptom by construction, at the cost of owning an entire memory and IO stack. Choose page-cache-reliance when your read pattern is tail-heavy and you value a small codebase and operational familiarity; choose managed memory when you need deterministic per-core tail latency and can fund the engineering.

Decision 6, Append-only segments vs LSM-tree / B-tree

The fork. What on-disk structure backs a partition? Kafka chose append-only segments: a partition is an ordered set of immutable segment files, each a .log plus sparse memory-mapped indexes (.index, .timeindex); a write is a sequential append to the active segment, a read-by-offset is a binary search of a sparse index then a sequential scan, and old data is reclaimed by dropping whole segments (I · 03, I · 04 Storage Management). The rejected roads are the two great mutable-store structures: the B-tree (in-place updates, the workhorse of relational databases) and the LSM-tree (buffered writes + background compaction, the workhorse of RocksDB/Cassandra/modern KV stores).

Axis	Append-only segments (Kafka)	LSM-tree (RocksDB)	B-tree (RDBMS)
Write pattern	Pure sequential append, no rewrite	Sequential to memtable + SSTables; later compaction	Random in-place page writes
Write amplification	~1× (append once; compaction is opt-in for compacted topics)	High, repeated compaction rewrites	Moderate, page splits, WAL
Read by key	No random key lookup, read by offset/range only	Good, point + range on key	Excellent, point + range on key
Update / delete a record	Not supported, records are immutable	Native (new version / tombstone)	Native (in-place)
Reclamation	Drop whole segments (time/size); compaction keeps last-per-key	Compaction merges + drops	Free-list / vacuum
Best at	Sequential throughput, retention by time, ordered replay	Write-heavy KV with reads by key	Read-heavy, transactional, ad-hoc query

Why append-only is right for a log. The log's data model is immutable, totally-ordered records read by position, it does not have updates or random key lookups to support, so the machinery that B-trees and LSM-trees exist to provide is pure cost here. By renouncing mutation, Kafka gets the cheapest possible write (a sequential append, write-amplification ~1×) and the cheapest possible reclamation (unlink a file). Crucially, "accumulating more stored data doesn't make it slower" (Kreps, empirical), there is no tree to rebalance, no compaction debt that grows with dataset size, because the structure is a flat append. An LSM-tree pays continuous compaction to keep key-sorted order; a B-tree pays random IO to keep in-place updates; both are buying capabilities (read-by-key, update) that a log deliberately does not offer.

Why renouncing mutation is a feature

The most powerful design moves are often subtractions. By refusing to support updates and random key reads, Kafka's storage engine became radically simpler and faster at the one thing a log needs, ordered sequential throughput with cheap retention. This is the same family of move as "immutability simplifies concurrency": an append-only file lets a single writer coexist with lock-free readers, because readers only ever see committed (immutable) bytes below the high watermark, and the mmap'd index can be read without a lock (I · 03). Transferable tactic: when a data structure's general-purpose features (update, random lookup, sort-on-disk) are not in your access pattern, dropping them can unlock an order-of-magnitude simpler and faster design. Ask what you can refuse to support.

The inherent flip side: this is why Kafka is not a database

The exact property that makes append-only fast, no key index, no in-place update, is exactly why Kafka cannot serve a point lookup by key, an ad-hoc query, or a multi-key transaction (Kreps, Waehner; empirical). Log compaction (cleanup.policy=compact) gives you a weak table, "at least the last value per key", but it does not give you a queryable index, and it does not guarantee a single record per key at any instant (I · 04; empirical). The correct architecture is to treat the log as the source of truth and materialise queryable views into purpose-built stores (a B-tree RDBMS, an LSM KV store, a search index), the "database inside-out" pattern (III · 00). Trying to use the log as the database fights its storage engine and loses; this is a structural limit, not a missing feature, and it is treated in depth in III · 03.

Decision 7, Single-writer (leader) per partition vs multi-writer

The fork. Can more than one node accept writes to the same partition at once? Kafka chose single-writer: exactly one leader per partition accepts every append, assigns the monotonic offset, and serves every read; followers only replicate (I · 08). The rejected road is multi-writer, multiple nodes accepting writes to the same logical shard concurrently, as in multi-leader/active-active databases or leaderless KV stores, which requires conflict resolution (last-write-wins, CRDTs, vector clocks) or a consensus round on every write to agree an order.

Axis	Single-writer leader (Kafka)	Multi-writer (multi-leader / leaderless)
Assigning a total order	Trivial, one writer hands out offsets sequentially	Hard, needs consensus per write or post-hoc conflict resolution
Write conflicts	Impossible by construction	Inherent, siblings, LWW, CRDTs, merge logic
Write availability under leader loss	Paused until a new leader is elected (~ms, controller-driven)	Higher, surviving writers keep accepting
Throughput ceiling per partition	Bounded by one leader's resources	Sum of all writers (but you pay reconciliation)
Reasoning / correctness	Simple, one source of truth for the offset	Complex, eventual consistency, anomalies

Why single-writer is right for a log. A log is a total order, and the cheapest way to produce a total order is to have one writer assign positions. With a single leader, the offset is just a counter; there is never a conflict to resolve because there is never a second opinion about what comes next. Multi-writer would hand you exactly the problem ISR-vs-Dynamo already settled (Decision 2): concurrent writers produce an order that must be reconciled, and a reconciled "order" is not the immutable, total order a log promises. Single-writer also makes the acks=all durability story clean, there is one leader whose log is authoritative, so "committed" has a single, unambiguous meaning (the high watermark on that leader), and a returning ex-leader simply truncates to the new leader's epoch boundary rather than negotiating a merge (I · 08, leader epochs / KIP-101).

The inherent price: per-partition write availability and throughput ceiling

Single-writer means that when a leader fails, writes to that partition pause until the controller elects a successor, there is no surviving writer to absorb them. The mitigation is to make election fast (KRaft per-partition election is ~ms and controller-driven) and to spread leadership so no single broker's loss pauses too much at once (II · 07 Failure Modes), but the pause is inherent, you trade a brief write-availability gap for never having a write conflict. The second price is a per-partition throughput ceiling: one partition's writes are bounded by one leader, which is precisely why the partition is also the unit of parallelism (Decision 3), you scale write throughput by adding partitions (more leaders), not by adding writers to a partition. A multi-writer design would raise both ceilings at the cost of the total order; Kafka keeps the order and scales out instead.

Tactic to carry

Single-writer-per-shard is the cheapest way to get a per-shard total order, and it composes beautifully with hash-partitioning: shard by key, one writer per shard, and you have per-key total order with no consensus on the write path. Reach for it whenever order within a key matters more than write availability of a single key during a failover. Reach for multi-writer only when you genuinely cannot tolerate a failover pause and your data model can absorb conflict resolution, a much rarer requirement than it first appears.

Decision 8, Coordinators as replicated logs

The fork. Where does a stateful broker subsystem, group membership, committed offsets, transaction state, keep its durable state? Kafka chose, repeatedly and on purpose, to make these subsystems replicated state machines over an internal log rather than to give them a bespoke datastore. The group coordinator's state is the __consumer_offsets log: every join, every offset commit, every assignment is a record; in-memory state is a deterministic replay of those records; and failover is "the new coordinator loads the partition and replays it" (I · 13). The transaction coordinator does the same over __transaction_state (I · 14). The rejected road is the obvious one most systems take: give each subsystem its own store (an embedded database, a ZooKeeper subtree, a custom on-disk format) with its own replication, durability, and failover code.

Offset commit / join / txn state change

Coordinator shardturns the mutation into a record

Internal log partition__consumer_offsets / __transaction_state, replicated by ISR

High watermark ≥ record offset?

release client response, survives failover

new coordinator loads partition → replays records → identical state

A coordinator built as a replicated log. The mutation is a record; durability and failover are inherited from the partition's ISR replication and high-watermark gate, not re-implemented. A response is released only once its record is committed, so any acknowledged change survives the coordinator's loss.

client request coordinator internal log failover / commit gate path failover branch

Why coordinators-as-logs is right. Kafka already had to build a correct, durable, fast, replicated log, that is the product. Making every stateful subsystem ride that same machinery is massive code reuse of the hardest part of the system. The coordinator does not implement durability (the partition's ISR replication does), does not implement failover (loading-and-replaying a partition does), and does not implement the consistency boundary (the high watermark is exactly the trigger that releases a response once its record is committed; I · 13). Three guarantees the subsystem would otherwise have had to re-derive, durability, failover, read-your-committed-write, fall out for free because the state is a log. This is dogfooding as architecture: the abstraction is good enough that the system builds itself on it.

Why build the system out of its own abstraction

When your core abstraction is genuinely powerful, the highest-leverage move is to express the rest of the system in it. Kafka's coordinators are the proof: rather than N subsystems each with bespoke durability/failover/consistency code (N times the surface area, N times the bugs), there is one replication engine and several thin state machines that replay logs. The transferable principle, if you have built a strong primitive (a log, a consensus module, a versioned store), look hard for the internal components you can re-express as clients of that primitive instead of giving them their own machinery. The payoff is not elegance for its own sake; it is that every hardening of the core primitive (a durability fix, a failover speedup) automatically hardens every subsystem built on it.

The price: the internal logs inherit the log's constraints too

Building coordinators on logs means they also inherit the log's limits, and some bite operationally. The __consumer_offsets and __transaction_state topics have a fixed partition count set at cluster creation, and because a group is mapped to a coordinator by abs(groupId.hashCode()) % numPartitions, changing that count after groups exist re-maps groups to different coordinators and orphans their state, exactly the hash-mod-N immutability of Decision 3 (I · 13). A hanging transaction on __transaction_state can pin the last-stable-offset and stall read_committed consumers, and on a compacted internal topic can block the cleaner (I · 14; II · 07; empirical). The reuse is a net win, but "it's a log all the way down" means the log's structural constraints are also all the way down, plan the internal-topic partition counts as carefully as any keyed topic.

The decisions as one coherent system

Read individually, the eight forks look like independent engineering calls. Read together, they are one bet repeated: optimise relentlessly for a durable, totally-ordered, replayable, high-throughput fan-out log, and pay wherever that abstraction does not value the currency. Each decision reinforces the others, pull consumers and consumer-tracked offsets together make fan-out and replay cheap; single-writer and hash-partitioning together make per-key total order cheap; append-only storage and page-cache reliance together make sequential throughput cheap; ISR-for-data and Raft-for-metadata together make the whole thing both scalable and self-consistent.

#	Decision	Road not taken	Primary axis	Right when…	The inherent price
1	Pull consumers	Push delivery	Idle latency ↔ flow/replay control	Consumers are heterogeneous / replay-capable	Sub-ms idle-stream latency (mitigated by long-poll)
2a	ISR replication (data)	Majority quorum / sloppy quorum	Throughput+scale ↔ fixed tolerance / eventual	Vast partitions, tunable durability, total order	Single leader serves all reads (a hotspot ceiling)
2b	Raft majority (metadata)	ISR for metadata / external ZK	Always-consistent + self-managed failover	Small, must-never-split-brain control plane	Majority needed to make progress (3/5 quorum)
3	Partition = hash % N	Consistent hashing / range shards	Order+simplicity ↔ smooth resharding	Per-key order matters; counts are stable	Can't decrease; resizing keyed topics breaks order
4	Consumer-tracked offsets	Broker-tracked ack	Cheap fan-out/replay ↔ per-msg control	Replayable fan-out; no per-msg TTL/priority needed	No selective ack; processing durability is the client's
5	Page cache + zero-copy	Managed memory / direct IO	Small surface ↔ residency control	Tail-heavy reads; value simplicity & familiarity	No eviction control (catch-up tax; tiered storage helps)
6	Append-only segments	LSM-tree / B-tree	Sequential write ↔ random read/update	Immutable ordered records read by position	No key lookup / update, not a database
7	Single-writer leader	Multi-writer	Cheap total order ↔ write availability	Order within a key > failover-pause tolerance	Writes pause during failover; per-partition ceiling
8	Coordinators as logs	Bespoke per-subsystem store	Reuse the core ↔ none material	Always (when the log primitive is strong)	Internal topics inherit the log's constraints too

What to take to your own systems

Five reusable tactics outlive Kafka: (1) match the consistency protocol to the workload, not to the product, a small always-correct control plane and a vast throughput-bound data plane deserve different replication (the ISR/Raft split). (2) Move state to whichever side keeps it on the smaller dimension, push read-position to clients so broker state is O(log), not O(messages×consumers). (3) Subtraction is a design move, refusing updates and random lookups bought append-only's speed; ask what you can decline to support. (4) Delegating to a battle-tested OS/runtime path (page cache, zero-copy) is a strong default; beating it costs a whole subsystem you must then own. (5) Build the system out of its own strongest primitive, express internal subsystems as clients of the core (coordinators-as-logs), so every hardening of the core hardens them all. And the honesty rule that frames all five: name the price at every fork, and know whether it is inherent or tunable, the inherent ones you carry forever; the tunable ones you budget a dial for. The inherent prices collected here become the subject of III · 03 Inherent Limits; the tactics, of III · 04 The Tactics Toolkit; and the head-to-head against the systems that chose the other roads, of III · 06 Comparative Architectures.