krivaltsevich.com Kafka Internals4.4

III · 04 · The Tactics Toolkit

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Architectural analysis grounded in the source-verified Part I and cited comparative sources.

Kafka is not one idea; it is a dossier of reusable engineering tactics, each solving a problem that recurs in almost every distributed system. Strip away the topics and the wire protocol and what remains is a small vocabulary of moves, ride the hardware instead of fighting it, amortise fixed costs by batching, make one log the coordination substrate for everything, fence stale actors with monotonic counters, serialise writes onto a single thread over copy-on-write structures, fold deltas onto an immutable image, schedule millions of timeouts in O(1), and gate behaviour behind feature flags so a fleet upgrades without a flag day. This chapter extracts fourteen such tactics, names the concrete Kafka mechanism behind each (cross-linked to the source-verified Part I), and generalises each into a tool you can carry to systems with nothing to do with messaging. For each we give the force it answers, the force it sacrifices, and the line between structural (inherited whether you want it or not) and tunable (a knob you choose). The companions bound this: bp01 decides whether the log fits at all, bp03 catalogues where these tactics turn into walls, and bp05 traces how the toolkit was discovered version by version.

How to read this, and: a tactic is not a feature

Each tactic follows three beats: Principle (the move in system-agnostic terms), In Kafka (the concrete mechanism, linking source-verified Part I), Beyond Kafka (where it pays off, and where it does not). They run from the physical (IO) through the logical (coordination, consistency) to the operational (evolution, extensibility), and several compose. Throughout we abstract above the feature list: "idempotent producer" is a feature; dedup by a monotonic (id, epoch, sequence) triple is the tactic, and it reappears in payment gateways, gRPC retries, and CRDT delta sync. You should be able to apply each section without mentioning Kafka.

Tactic 1, Mechanical sympathy: ride the hardware

Principle. The fastest code matches how the machine actually works. Modern hardware rewards sequential access (disks and prefetchers stream contiguous bytes far faster than they seek), not copying (every byte moved between kernel and user space costs a bus cycle and a cache line), and letting the OS do its job (the page cache is a battle-tested LRU you did not write). "Mechanical sympathy", Jackie Stewart's racing phrase, borrowed by Martin Thompson, is the discipline of designing software so the hardware's fast path is the common path.

In Kafka. The log engine (03) is the canonical demonstration. Writes are append-only to a segment file, so the disk (or the SSD's flash translation layer) sees a sequential stream, not random writes. Recent reads are served from the OS page cache, Kafka keeps no data cache of its own, which is why the JVM heap stays small (~6 GB) and the rest of RAM (~28–30 GB on a 64 GB box) goes to the kernel [Confluent, Running Kafka in Production]. Consumer reads use zero-copy via sendfile: bytes go disk → page cache → NIC without entering the JVM, skipping two copies and two context switches. The unit of all of it is the batch (Tactic 2): records are stored, replicated, and served as opaque compressed blocks, so per-record overhead approaches zero. The combined effect is the LinkedIn benchmark, 193 MB/s across three producers with 3× async replication, end-to-end p99 of 3 ms on 2014-era commodity hardware [Kreps, LinkedIn].

Beyond Kafka. Log-structured storage engines (LSM trees in RocksDB/Cassandra/ScyllaDB) turn random writes into sequential ones for exactly this reason; write-ahead logs in every relational database do the same. Append-only design is also what makes object-store-native systems (WarpStream, AutoMQ, the diskless Kafka of bp05) viable, S3 loves large sequential PUTs and hates small random ones. The "let the OS cache" half transfers to any read-heavy service whose working set fits in RAM: memory-mapped indexes, and not reinventing an LRU you will get subtly wrong.

Why this is the foundational tactic

Mechanical sympathy is why Kafka is fast without being clever. There is no exotic data structure on the hot path, a log is "perhaps the simplest possible storage abstraction" [Kreps, The Log]. The performance comes from alignment with hardware, not algorithmic ingenuity. The transferable lesson: before you optimise an algorithm, check whether you are fighting the machine.

Where riding the hardware turns into a wall

Two structural costs are inherited, not chosen. (1) Zero-copy only works when bytes pass through untouched, the moment you need broker-side filtering, transformation, or TLS encryption, the sendfile fast path is gone and bytes traverse user space, raising CPU and tail latency (part of why some benchmarks reverse under TLS [Vanlightly]). (2) Leaning on the page cache means cold reads (replay, a lagging consumer) evict the hot working set, the "catch-up tax" [azguards]; tiered storage (05) and fetch isolation mitigate but do not abolish it. The fast path is fast precisely because it is narrow; step off it and you pay full price.

Tactic 2, Batching: amortise the fixed cost

Principle. Almost every operation has a fixed per-call cost (a syscall, a round-trip, a header, a checksum, a lock) plus a variable per-byte cost. If the fixed cost dominates at small sizes, the single most effective optimisation in computing is to do more per call: group N items into one operation and divide the fixed cost by N. Batching is the universal amortiser, and its price is always the same, latency (you wait to accumulate) and memory (you buffer).

In Kafka. Batching is pervasive and recursive. The producer (16) accumulates records per partition into a RecordBatch, controlled by batch.size (default 16 KB) and linger.ms (how long to wait for it to fill). The batch is the unit of compression (01), one compressed block with one header set and one CRC, which is why tiny batches compress poorly, and the unit of the wire protocol, replication, the on-disk segment, and the zero-copy fetch. The tradeoff is concrete: with linger.ms=0, batches averaged ~1,215 bytes despite a 16 KB cap; at linger.ms=1500 they reached ~275 KB [Confluent producer hands-on]; roll-ups cite 10–50× throughput gains from batching [Confluent].

Beyond Kafka. Batching is everywhere fixed costs hide: SQL bulk inserts, the Nagle algorithm coalescing TCP segments, GPU kernels amortising launch overhead across a tensor, syscall batching (io_uring, writev), and database group-commit (fsync once per N transactions). The decision is always the same shape: does the fixed cost dominate at your item size, and can you tolerate the latency of accumulation?

Is the per-op fixed cost ≫ per-item cost?syscall · round-trip · header · CRC · lock
Do not batchbatching only adds latency & buffering; ship items individually
Can you tolerate accumulation latency?how long to wait for a full batch?
Size-trigger onlylinger≈0 · flush when full · accept small batches at low rate (Kafka default)
Size + time triggerlinger.ms>0 · larger batch.size · flush on whichever fires first
Batching decision tree: batch only when fixed cost dominates; choose a size trigger, a time trigger, or both, by your latency budget.
decision point no-batch outcome batching strategy throughput-favouring path latency-favouring path
Batching is not free at low rate

The default trap: on a low-rate stream (200 rec/s of 1000 B), increasing linger.ms decreased throughput and increased latency, the defaults were optimal [Confluent producer hands-on]. linger.ms > 0 only pays when the rate can fill batch.size within the window; below that, linger is pure added latency with no amortisation. Always ask "does my rate fill the batch?" before raising linger.

Tactic 3, The log as a coordination primitive

Principle. A replicated, totally-ordered log is not just a place to put data, it is a coordination mechanism. The State Machine Replication principle says identical deterministic processes fed the same inputs in the same order reach the same state [Kreps, The Log]. So once you can build one consistent ordered log, you have reduced an enormous class of coordination problems, consensus, replication, config distribution, offset tracking, leader-election bookkeeping, to "append a record and let everyone replay it." The radical move: this primitive is so general you should stop building bespoke coordination subsystems and reuse the same log machinery for all of them.

In Kafka. This is the deepest architectural commitment in the system: everything is a log. Consider the five internal logs Kafka maintains, each reusing the identical storage, replication, and fetch engine that carries user data:

Internal logWhat it coordinatesPart I
user data topicsthe application's events, the original purpose03
__consumer_offsetsevery group's committed position; a compacted topic keyed by (group, topic, partition)13
__transaction_statethe state machine of every transactional producer (Ongoing → PrepareCommit → Complete)14
__cluster_metadatathe entire cluster's configuration and topology, the KRaft Raft log replacing ZooKeeper10, 12
__share_group_stateper-record acknowledgement state for share groups (queue semantics)15

The economy is striking. KRaft (10) did not invent a new consensus store; it modelled cluster metadata as a Kafka topic (__cluster_metadata) replicated by a Raft quorum, so the same log the system was built to serve now serves the system itself. The controller (11) turns every admin request into metadata records appended to that log; failover is "another node deciding to start reading from where the old one stopped." Offsets, transaction state, and share-group acknowledgements are likewise just records in compacted logs. One mechanism, hardened once, reused five ways.

Beyond Kafka. This is the "turning the database inside out" insight [Kleppmann, 2014]: make the replication log a first-class citizen and treat every queryable structure (indexes, caches, materialised views) as a deterministic projection of it. etcd and Consul are Raft logs with a key-value projection on top; ZooKeeper is an atomic-broadcast log; CockroachDB and TiDB replicate per-range Raft logs; CDC + the outbox pattern make a database's own WAL the integration log for the whole company [Morling, Debezium]. The reusable tactic: when you need ordering, durability, replication, and replay in more than one subsystem, build the log once and project from it rather than reimplementing those four properties per subsystem.

Tactic: collapse N coordination problems into 1 log

About to build a second durable, ordered, replicated store for "just this one piece of metadata"? Stop. The marginal cost of putting it on the existing log is a compacted topic and a small state machine; a second bespoke store is a second thing to make consistent, monitor, back up, and reason about under partition. Kafka pushed this to its end (KRaft) and deleted an entire external dependency, ZooKeeper KIP-500.

Tactic 4, Idempotence by (id, epoch, sequence)

Principle. In any system with retries, and every distributed system has retries, the network cannot tell you whether a request that timed out was lost or merely slow. Retrying a slow-but-successful request causes duplicates. The fix is to make the operation idempotent: attach a stable identity to each logical item and have the receiver deduplicate. The robust identity for an ordered stream is a triple: a producer identity (who), an epoch (which incarnation of that producer, see Tactic 5), and a monotonic sequence number (which item in order). The receiver tracks the highest sequence seen per (id, epoch) and rejects anything not exactly one greater.

In Kafka. The idempotent producer (14) is exactly this. On first connect the producer is assigned a producerId (PID); each batch carries that PID, a producer epoch, and a per-partition sequence number. The partition leader caches the last five sequence numbers per PID and enforces strict succession: a duplicate (a retried batch already written) is silently acknowledged without re-appending; a gap (sequence too high) is rejected as OutOfOrderSequence. This is what lets enable.idempotence=true coexist with max.in.flight.requests.per.connection=5 while preserving order and exactly-once write semantics within a partition, five batches in flight, with sequence numbers reconstructing order and catching duplicates. Transactions build read-process-write exactly-once on this same dedup foundation.

Beyond Kafka. The (id, sequence) dedup tactic is the backbone of reliable messaging everywhere: TCP sequence numbers deduplicate retransmitted segments; Stripe requires an Idempotency-Key so a retried charge does not double-bill; gRPC/HTTP clients pair retries with idempotency tokens; CRDT and delta-sync protocols version each update so re-delivery is harmless. The epoch dimension matters whenever the producer itself can restart (Tactic 5).

The dedup window is finite, design for it

Kafka caches only the last five sequence numbers per PID, and the PID-to-state mapping is itself bounded and expirable, because no practical dedup store keeps unbounded history. So idempotence holds only within a window (in flight, recent). The transferable rule: an idempotency key is only as good as the retention of the dedup table behind it (a Stripe idempotency key expires after 24 hours for the same reason). Idempotence is a local, bounded guarantee, not a permanent global one.

Tactic 5, Epoch-fencing: the universal anti-zombie

Principle. The hardest failure in distributed systems is not the node that dies, it is the node that appears to die, gets replaced, and then comes back. A GC pause, a partition, or a slow disk makes the cluster declare a node dead and promote a successor; then the "dead" node wakes up, still believing it is the leader/owner/producer, and issues writes, a zombie that causes split-brain and corruption. The universal defence is an epoch (a.k.a. fencing token, term, generation): a monotonically increasing counter bumped on every ownership change. Every action carries the epoch under which it was authorised; the receiver remembers the highest epoch seen and rejects any action stamped with a stale one. The zombie's writes carry an old number and bounce off.

In Kafka. Epoch-fencing is not used once, it is a pattern applied independently at every layer, the strongest evidence of its generality. The same idea, four times:

EpochFences againstBumped whenPart I
leader epocha deposed partition leader writing after a new leader is elected; truncates divergent suffixes on followersleadership changes08
producer epocha zombie transactional producer (old instance) committing after a new instance took its transactional.ida new producer claims the transactional.id (raises ProducerFenced on the old one)14
partition epochstale ISR/leader updates racing with the controller's viewpartition state changes in the metadata log08
member epocha consumer acting on a stale group assignment after a rebalance moved its partitionsgroup membership changes (KIP-848 protocol)13

The leader epoch is the cleanest example. Before it existed, a returning follower could not tell whether its log suffix was committed or belonged to a deposed leader, and truncation by high-watermark alone could lose or resurrect data; the leader epoch stamps each message range with the term that produced it, so a rejoining replica asks "what was the end offset of epoch N?" and truncates precisely KIP-101. The producer epoch turns "two pods sharing a transactional.id" from silent double-commit into a clean ProducerFencedException on the loser [Kafka transactions; dev.to zombie fencing].

Beyond Kafka. Fencing tokens are the textbook fix for the lease/lock zombie problem (Kleppmann's critique of distributed locks): the lock service hands out a monotonic token and the protected resource rejects stale tokens, otherwise a GC-paused client holding an "expired" lock corrupts data. Raft/Paxos terms/ballots, ZooKeeper's zxid, primary-backup generation numbers, and Kubernetes resourceVersion are all the same tactic. The reusable rule: any time ownership can transfer, give ownership a monotonic number and make every downstream actor reject stale numbers, checking the token at the resource, not just at the lock service, is the part people forget.

Tactic: a monotonic counter at every ownership boundary

Epochs are cheap and composable, a single integer per ownership domain, checked on the write path. Wherever your design has a "current leader," "current owner," or "current configuration," ask: what stops the previous one from acting? "We declared it dead" is not an answer, declaring a node dead does not stop it running. Only a fencing token does. This is the most under-applied tactic in homegrown distributed systems.

Tactic 6, Single-threaded event loop over copy-on-write timelines

Principle. Concurrency control is the second-hardest problem in systems (after naming): locks are subtle and contended, lock-free structures fiendish to get right. When one component owns mutable state there is a radically simpler alternative, run all writes on a single thread (a serial event loop), eliminating write-write races by construction, and serve reads from immutable snapshots so readers never block writers and never see a torn state. One writer thread plus multi-version (copy-on-write) read structures gives the consistency of a global lock with the read concurrency of a lock-free design, and code you can actually reason about.

In Kafka. The KRaft controller (11) is built exactly this way. A single class, QuorumController, runs a one-thread event loop: every administrative request becomes an event, events process strictly in order, and each emits a list of metadata records. With one writer there are no locks on the controller's in-memory state. Reads must stay consistent as writes stream in, so the controller stores state in timeline data structures (TimelineHashMap in the SnapshotRegistry), copy-on-write / MVCC maps where a mutation creates a new version tagged with an offset, and a reader pins a consistent snapshot at any committed offset and reads it lock-free while the writer races ahead. The same property makes uncommitted state reversible: if records are written but not yet committed by Raft and the controller loses leadership, it reverts the timeline to the last committed offset, the snapshot machinery is the rollback mechanism.

Beyond Kafka. The single-writer loop is the core of Redis, Node.js, nginx, the LMAX Disruptor (millions of ops/sec on one thread by avoiding locks and cache contention), and the actor model (Akka, Erlang); copy-on-write/MVCC reads power PostgreSQL snapshot isolation, Clojure's persistent structures, and CopyOnWriteArrayList. The reusable tactic: if a component owns mutable state, prefer "one writer thread + immutable read snapshots" over "shared state + locks", simpler, often faster, and the snapshot gives point-in-time rollback for free.

Many concurrent readers
pin a snapshot at a committed offset · read lock-free · never block, never see a torn write
↕ snapshot at offset
Copy-on-write timeline (TimelineHashMap · SnapshotRegistry)
each mutation = new version tagged by offset · old versions readable · revert-to-offset = rollback
↕ single writer appends
One event-loop thread (QuorumController)
processes events strictly in order · no locks on state · write → commit → apply discipline
Single-writer + MVCC: one thread mutates a copy-on-write timeline; unlimited readers pin consistent snapshots; reverting to an offset is the rollback path.
lock-free readers versioned (MVCC) state single writer thread snapshot / append boundary
The single thread is also the ceiling

Structural cost: one writer thread bounds the controller's write throughput to one core, fine because metadata mutation is low-rate relative to data, which is exactly why the controller is not on the data path. The single-writer tactic works when the serialised work is small per item and modest in rate. Redis hits the same wall (a single slow command stalls everything; KEYS * on a big keyspace is a known footgun). If your write rate or per-op cost is high, the loop becomes the bottleneck and you must shard state across loops, reintroducing cross-shard coordination.

Tactic 7, Optimistic concurrency via epoch compare-and-set

Principle. When conflicts are rare, pessimistic locking is wasteful, you pay coordination on every operation to guard against a collision that almost never happens. Optimistic concurrency inverts this: proceed without locking, but stamp each update with the version you read, and the authority accepts it only if the version still matches (compare-and-set). If someone changed it first, the CAS fails and you retry with the fresh version. You pay coordination only on the rare conflict.

In Kafka. The AlterPartition path (08) is optimistic concurrency in action. A leader that wants to shrink or expand the ISR does not lock the controller; it sends an AlterPartition request carrying the current leader and partition epochs it believes are in effect. The controller applies the change only if those epochs match its own view (a compare-and-set on the partition's version); if they are stale, because another change landed first, the request is rejected and the leader refreshes and retries. Many leaders can propose ISR changes concurrently against one controller without blocking, while the epoch CAS guarantees no stale update corrupts partition state. It is the same epoch from Tactic 5, used here as a concurrency token rather than a fencing token, one mechanism serving multiple tactics.

Beyond Kafka. Optimistic concurrency is the default for high-read, low-conflict systems: HTTP ETag + If-Match, SQL version-column locking (JPA @Version), DynamoDB/Cosmos conditional writes, Kubernetes resourceVersion, Git's compare-and-swap of refs, and the hardware CAS under all lock-free programming. The decision tree below captures when to choose it.

How often do writers actually conflict?contention on the same item
Optimistic (epoch / version CAS)no lock on the hot path · retry on the rare conflict · e.g. AlterPartition, ETag, @Version
Pessimistic (lock / lease)serialise up front · avoids retry storms when conflict is the norm · e.g. row locks, single-writer loop
Version is the safety netCAS for optimistic; fencing token for pessimistic, same counter
Concurrency-control decision: optimistic CAS for low contention, pessimistic locking for high contention; a monotonic version protects both.
decision / shared mechanism optimistic outcome pessimistic outcome low-contention path high-contention path shared versioning rule
Optimistic flips to pessimistic under contention

The tradeoff is structural and self-reinforcing: optimistic CAS is cheap when conflicts are rare but degrades into a retry storm when they are common, every writer reads, computes, fails the CAS, and retries, wasting work proportional to contention. The crossover is the design question. If a single item is hot (Tactic 6's single-writer loop is one answer; sharding the item is another), optimistic concurrency is the wrong choice. Measure the conflict rate before assuming optimism.

Tactic 8, Immutable image + delta

Principle. A single mutable structure everyone reads and writes creates two problems: readers see torn intermediate states, and you cannot cheaply ask "what changed?" The alternative keeps state as an immutable snapshot (image) plus a stream of deltas. To advance, fold a delta onto the image to produce a new image; readers always hold a complete, internally-consistent image and swap to a newer one atomically, and subscribers can be sent just the delta. The image is your checkpoint; the delta is your incremental update.

In Kafka. Metadata propagation (12) is built on MetadataImage and MetadataDelta. The cluster's entire metadata state, topics, partitions, configs, ACLs, broker registrations, is a MetadataImage, an immutable snapshot. As the controller appends records to __cluster_metadata, each broker's MetadataLoader accumulates a MetadataDelta and, at record boundaries, folds the delta onto the current image, publishing a new immutable image atomically to every component that reads metadata. Components never see a half-applied update, only image N, then image N+1. The same pair underlies KRaft snapshots: rather than replay the metadata log from the beginning of time, a new or lagging broker loads a recent image and folds only the deltas since. This makes KRaft failover fast, the standby controller already holds a current image and need only catch up the tail (contrast the ZooKeeper era, where failover reloaded all partition state at ~2 ms/partition, making 10,000 partitions cost ~20 s of unavailability [Jun Rao; Confluent 200K post]).

Beyond Kafka. Image + delta is the model of Git (commit = snapshot, diff = transmitted delta), Redux/Elm (immutable store + reducers), event-sourced aggregates (snapshot + events since), database checkpoint + WAL replay, and container image layering. The reusable tactic: checkpoint an immutable image periodically, apply deltas between checkpoints, and fold delta-onto-image to advance, consistent reads, cheap "what changed" subscriptions, and fast recovery (load nearest image, replay tail) at once. It composes with Tactic 3 (the log is the delta stream) and Tactic 6 (the image swap is the atomic publish to lock-free readers).

Why image + delta beats "just replay the log"

A pure log gives you the deltas but no checkpoints, so recovery time grows without bound, replaying years of metadata records on every restart is untenable. A pure mutable image gives you fast reads but no history and no cheap incremental sync. Combining them, periodic immutable image plus the delta stream, bounds recovery to "nearest snapshot + tail" while keeping the full ordered history available for audit and replay. This is precisely why KRaft snapshots (10) exist on top of the metadata log, and why every serious event-sourced system snapshots aggregates.

Tactic 9, Purgatory and hierarchical timing wheels

Principle. Many server operations cannot complete immediately, a produce with acks=all waits for replicas; a fetch with fetch.min.bytes waits for data; a delayed join waits for a rebalance window. The naive approach (a thread per wait, or a sorted timer queue) is O(log n) per insertion and burns threads. You need two things: (1) park an operation and wake it the instant its completion condition becomes true, and (2) schedule its timeout in O(1) even with millions of pending timers. The combination is a holding area ("purgatory") whose timeouts ride hierarchical timing wheels.

In Kafka. The broker's request handling (06) parks delayed operations in Purgatory. A DelayedOperation (DelayedProduce, DelayedFetch, …) is registered against the keys (partitions) whose state could complete it; when an event changes that state, a follower fetch advances the high watermark, new bytes are appended, the relevant operations are re-checked and completed without polling. Timeouts ride a hierarchical timing wheel: a ring of buckets each holding timers for that interval, with a hierarchy of wheels (like clock hands, seconds, minutes, hours) covering a vast range with O(1) insert and O(1) tick. This lets one broker hold an enormous number of in-flight delayed requests (produce purgatory is non-empty by design under acks=all, normal, not an alert [reference, Appendix E]) without the per-timer cost of a sorted queue.

Beyond Kafka. Hierarchical timing wheels are the standard high-scale timer structure: the Linux kernel timer subsystem, Netty's HashedWheelTimer, connection-timeout management, and TCP retransmission timers (the idea is Varghese & Lauck, 1987). The "completion-condition watcher" half, park, register against state, wake on change, is the readiness model behind epoll/kqueue and reactor frameworks. The reusable tactic: when many operations each wait on (a) an event and (b) a deadline, separate the two, a watcher for events, a timing wheel for deadlines, for O(1) scheduling at massive scale.

Tactic: O(1) timers + condition watchers, never thread-per-wait

Spawning a thread (or a sleeping future) per waiting operation does not survive scale, threads are expensive and a sorted timer queue is O(log n). Timing wheels trade a little timer precision (bucket granularity) for O(1) cost at any scale, almost always right for timeouts, since you rarely need microsecond-exact firing. Pair them with a state-change watcher so the common case (the operation completes before its timeout) never touches the timer beyond a cheap cancel.

Tactic 10, The watermark: a committed-boundary pointer

Principle. In a system where data is written ahead of being durable/replicated/visible, you need a single, cheap, advancing pointer that says "everything below this line is safe; everything above is not yet." A watermark is that pointer. It separates the written frontier from the committed frontier, lets readers see only safe data, and gives every participant a one-number summary of progress to compare against.

In Kafka. The high watermark (08) is the offset to which a record has been replicated to all in-sync replicas; consumers may only read below it, so they never see a record that could still be lost if the leader fails. The leader advances it as followers fetch and acknowledge, the committed-boundary pointer for durability. The same idea recurs: the log-end offset (LEO) is the written frontier; the last stable offset (LSO) is the visibility boundary for read_committed consumers (14), records past it belong to in-flight transactions, invisible until commit; the consumer's committed offset in __consumer_offsets is the watermark of "how far this group has processed." Each is one offset cleanly partitioning "safe/done" from "pending."

Beyond Kafka. Watermarks appear wherever progress is summarised by a boundary: stream-processing event-time watermarks (Flink, Beam) mark "no more events older than T," gating window firing; a database's commit LSN/checkpoint marks durable-up-to-here; TCP's cumulative ACK is a watermark over the byte stream; GCs use a safe point. The reusable tactic: represent "progress" or "safety" as a single monotonically-advancing offset and expose it, lag, durability, and visibility all become subtractions.

A watermark is a floor, not a guarantee of exactness

A watermark says "at least this much is safe," not "exactly this is the state." The LSO can stall indefinitely behind one hanging transaction, a crashed transactional producer pins it, and every later record on the partition becomes invisible to read_committed consumers, potentially for up to transaction.max.timeout.ms (15 min default) [reference; KIP-664]. Likewise an event-time watermark that advances too eagerly drops late data; too conservatively, it stalls all windows. The watermark is powerful precisely because it is a single number, but a single number is also a single point of stall.

Tactic 11, Pull-based backpressure

Principle. When a fast producer feeds a slow consumer, something must give. Push systems (the producer sends as fast as it can) must add an explicit flow-control protocol or risk overwhelming the consumer and dropping data. Pull systems invert control: the consumer requests data when it is ready, so the consumer's own rate is the flow-control signal, backpressure is implicit and automatic. A slow consumer simply pulls less often; no buffer overflows, no credit protocol, no data loss from overrun.

In Kafka. Consumption is pull-based (09, 17): the consumer issues Fetch requests in a poll() loop and the broker never pushes. A consumer that falls behind accumulates lag but does not destabilise the broker, the data sits durably on disk (or tiered storage) and is read when the consumer can. The broker long-polls with fetch.min.bytes/fetch.max.wait.ms to avoid busy-waiting (parking the fetch in Purgatory, Tactic 9), so pull stays efficient at low rate. The deeper backpressure is the bounded request queue (06): when the handler pool can't keep up, the bounded requestQueue fills, network threads stop reading sockets, and backpressure propagates to TCP.

Beyond Kafka. Pull-based flow control is the model of Reactive Streams (the request(n) demand signal in Reactor, RxJava, Akka Streams), gRPC streaming flow control, TCP's receive-window, and any worker pool draining a bounded queue. Contrast push systems, RabbitMQ needs a prefetch/QoS limit and acks to avoid overrun. The reusable tactic: let the consumer's request rate be the flow-control signal, and bound every queue so a full queue blocks the upstream, backpressure you get for free beats a flow-control protocol you have to design and debug.

Pull buys stability by spending latency

A recurring Kafka critique: pull adds per-message latency at low load, because the consumer polls on its own schedule rather than being woken the instant data arrives, part of why push brokers like RabbitMQ can beat Kafka on latency at low throughput (~1 ms) [Confluent OMB; reference §8]. It is a deliberate throughput-over-latency choice (bp02). The transferable judgement: pull's automatic backpressure wins for throughput-oriented, batch-tolerant pipelines; for latency-critical fan-out to many idle consumers, push (or a hybrid) may win.

Tactic 12, The in-sync set as a tunable consistency dial

Principle. Replication forces a choice between consistency (don't acknowledge until enough copies are durable) and availability/latency (acknowledge fast, tolerate fewer copies). Rather than hard-code one point on this spectrum, expose it as a dial: maintain a dynamic set of replicas that are "caught up enough to count," and let the operator choose how many of that set a write must reach before it is acknowledged. The set adapts to failures (slow replicas drop out, recovered ones rejoin) and the threshold is a config, so the same mechanism spans "fast but fragile" to "slow but bulletproof."

In Kafka. The in-sync replica set (ISR) (08) is this dial. A replica is in the ISR if it fetched up to the leader within replica.lag.time.max.ms (default 30 s); fall behind and it is ejected, catch up and it rejoins, the set is dynamic. The high watermark (Tactic 10) advances only once all ISR members hold the record, and an acks=all producer is acknowledged only then. The operator tunes the consistency point with three composing knobs: replication.factor (how many copies), min.insync.replicas (how many must be in-sync for an acks=all write), and acks (0 / 1 / all). The canonical durable config, RF=3, min.insync.replicas=2, acks=all, unclean.leader.election.enable=false, makes data loss require ≥2 simultaneous failures, while acks=1 trades that for latency (the LinkedIn benchmark showed sync replication roughly halved single-producer throughput: 40.2 vs 78.3 MB/s [Kreps, LinkedIn]).

Beyond Kafka. The tunable-quorum idea is the heart of Dynamo-style systems (Cassandra/DynamoDB per-request R + W levels, where R + W > N gives strong consistency and lower values give speed), MongoDB's writeConcern/readConcern, and any primary-backup system with a configurable ack count. The reusable tactic: do not pick a single consistency point, maintain a dynamic "caught-up" set and expose the ack-threshold as configuration, so different topics/tables/requests sit at different points on the consistency–latency curve without a different mechanism.

The dial has a footgun at the low end and a sharp edge at the high end

Three structural hazards. Low end: acks=all does not mean "on disk", Kafka acknowledges once the ISR holds the record in page cache; no per-message fsync by default [reference §3 & §9; Vanlightly]. The dial controls replication, not flushing; for disk-durability you must also tune flush.ms/flush.messages (and pay the latency). Silent cap: min.insync.replicas guarantees nothing if RF is lower, effectiveMinIsr silently caps it to the replica count, so RF=1 with min.insync.replicas=2 still accepts writes [Conduktor; reference §4]. High end: maximum durability (large/high min-ISR) raises the chance an acks=all write stalls for lack of in-sync replicas, availability drops as consistency rises. Verify min.insync.replicas ≤ replication.factor and decide explicitly about fsync.

Tactic 13, Feature-flagging a distributed system

Principle. Upgrading a single process is easy; upgrading a fleet is not. During a rolling upgrade old and new code run simultaneously, and a new node that starts writing data or speaking a protocol the old nodes can't understand corrupts state or crashes peers. The safe pattern decouples "code is deployed" from "behaviour is enabled": ship the new code dormant, gate every new behaviour behind a cluster-wide version flag, and flip the flag, explicitly, once, only after every node is confirmed on the new code. The flag is the coordination point that makes a heterogeneous fleet safe.

In Kafka. KRaft makes this first-class: metadata.version (and the broader feature-level system) is a cluster-wide flag stored in the metadata log itself (12) KIP-778 KIP-584. Brokers and controllers gate new record types and behaviours on the active feature level; you deploy new binaries with the level unchanged, so they behave like the old version, and only after the whole cluster is upgraded does the operator run kafka-features.sh upgrade to bump it, an explicit, logged, cluster-wide transition every node observes by replaying the same metadata record. Because the flag lives in the log (Tactic 3), no separate consensus is needed to agree "what version are we"; it is just another committed record, and downgrade is gated to prevent enabling a feature the fleet can't uniformly support. This replaced the brittle ZooKeeper-era inter.broker.protocol.version static config.

Beyond Kafka. "Deploy dark, enable behind a flag" is the backbone of safe continuous delivery: LaunchDarkly-style flags, database online schema migrations (expand/contract: add column dormant → dual-write → backfill → cut over → drop old, never a breaking change in one step), protocol negotiation (TLS, HTTP), and Kubernetes API graduation (alpha → beta → GA via feature gates). The reusable rule: where instances upgrade independently, every backward-incompatible change is a flag flipped after, never during, the rollout, with its state agreed cluster-wide. Putting the flag in the replicated log is the Kafka twist: the thing coordinating the upgrade is the same log being upgraded.

Tactic: separate deploy from enable; agree the flag in the log

The number-one cause of upgrade outages is a new behaviour that activates the moment new code starts, before peers can handle it. Feature-flagging structurally prevents this: the new code is provably inert until the flag flips, so a half-upgraded fleet is always valid and you roll back by redeploying without ever having changed behaviour. Kafka's refinement, storing the flag as a committed log record so the whole fleet observes the same value in the same order, removes the need for a side-channel agreement protocol. Directly transferable to any service mesh, schema migration, or wire-protocol evolution; see bp05.

Tactic 14, SPI / pluggability at the seams

Principle. A core system cannot anticipate every environment, every auth backend, storage tier, placement policy. Hard-coding those decisions makes it rigid and forks inevitable. The alternative defines narrow interfaces (Service Provider Interfaces) at the points of variability and lets operators drop in implementations while the core stays small and policy-free. The art is choosing where the seams go, too few and it is inflexible, too many and it is a configuration swamp with no opinion.

In Kafka. Kafka exposes pluggability at exactly the seams where deployments genuinely differ:

SPISeam (point of variability)Part I
Authorizerauthorization backend, the default StandardAuthorizer, or plug in OPA/Ranger/custom ACL stores18
RemoteStorageManager / RemoteLogMetadataManagerwhere cold log segments live, S3, GCS, HDFS, without touching the log engine KIP-40505
ConsumerPartitionAssignorhow partitions map to consumers, range, round-robin, sticky, cooperative-sticky, rack-aware, or bespoke17, 13
Partitioner / serializers / interceptorsproducer-side key→partition mapping, encoding, and cross-cutting hooks16

The discipline is visible in what is not pluggable: the log format, replication protocol, and consensus layer are fixed, because varying them fractures compatibility. Pluggability sits at the edges (auth, storage tier, placement) where environments differ, not the core where they must agree. Tiered storage is the strongest case, the whole feature is "define a RemoteStorageManager interface and let the cloud's object store implement it," spanning S3/GCS/HDFS without core changes.

Beyond Kafka. The SPI tactic appears wherever extensibility meets a stable core: the JDBC driver model, Kubernetes CRI/CNI/CSI, Envoy's filter chain, Java ServiceLoader, and OS device drivers. The reusable rule: identify the few axes along which deployments truly vary, define a narrow stable interface on each, keep everything else opinionated and fixed. The failure modes are symmetric, too rigid (users fork to change one policy), too pluggable (no defaults, every install a research project). Make the environment-specific parts pluggable and the correctness-critical parts fixed.

Seam placement, not seam count, is the design

Measuring extensibility by how many things are pluggable is backwards, a great SPI design is defined by where the seams are. A seam at a correctness-critical boundary yields incompatible clusters; a refused seam at a genuine point of variability yields forks (LinkedIn ran a custom branch for years partly because some seams did not yet exist [reference §7]). The transferable skill is telling policy boundaries (pluggable) from protocol boundaries (fixed).

The toolkit at a glance

The fourteen tactics, mapped from Kafka mechanism to transfer target. Read it as a checklist: when you hit the problem in the "answers" column, reach for the tactic.

#TacticAnswers the problem of…Kafka mechanism (Part I)Applies elsewhere
1Mechanical sympathyIO too slow; fighting the hardwareappend-only log + page cache + zero-copy sendfile (03)LSM engines, WALs, mmap, object-store-native systems
2Batchingfixed per-op cost dominatesRecordBatch · batch.size/linger.ms · compressed block (16, 01)group commit, bulk insert, Nagle, GPU kernels, io_uring
3Log as coordination primitiveN bespoke ordered/durable storesdata + __consumer_offsets + __transaction_state + __cluster_metadata + __share_group_state (12)etcd/Consul/ZK, Raft KV stores, CDC + outbox
4Idempotence (id, epoch, seq)retries create duplicatesidempotent producer: PID + epoch + sequence dedup (14)TCP seq, Stripe idempotency keys, gRPC retries, CRDT deltas
5Epoch-fencingzombies / split-brain after failoverleader · producer · partition · member epochs (08, 13, 14)fencing tokens, Raft/Paxos terms, zxid, K8s resourceVersion
6Single-writer loop + MVCClock complexity; readers blocking writersQuorumController event loop over TimelineHashMap snapshots (11)Redis, LMAX Disruptor, actors, PostgreSQL MVCC
7Optimistic concurrency (CAS)locking wasteful when conflicts rareAlterPartition epoch compare-and-set (08)ETag/If-Match, @Version, DynamoDB conditional writes, Git refs
8Immutable image + deltatorn reads; unbounded recovery; "what changed?"MetadataImage + MetadataDelta + KRaft snapshots (12, 10)Git, Redux, checkpoint+WAL, container layers
9Purgatory + timing wheelsmany ops waiting on event + deadlineDelayedOperation purgatory + hierarchical TimingWheel (06)Linux kernel timers, Netty HashedWheelTimer, epoll/reactor
10The watermarksummarise "safe vs pending" boundaryhigh watermark · LSO · committed offset (08, 14)Flink event-time watermarks, commit LSN, TCP cumulative ACK
11Pull-based backpressurefast producer vs slow consumerFetch poll loop + bounded request queue (09, 06)Reactive Streams request(n), gRPC flow control, TCP window
12In-sync set as consistency dialfixed consistency point too rigidISR + RF / min.insync.replicas / acks (08)Dynamo R+W levels, Mongo write/readConcern
13Feature-flag the fleetrolling upgrade of heterogeneous nodesmetadata.version / feature levels in the metadata log (12)LaunchDarkly, online schema migration, TLS/HTTP negotiation, K8s feature gates
14SPI / pluggabilitycore can't anticipate every environmentAuthorizer · RemoteStorageManager · assignors (18, 05, 17)JDBC, CRI/CNI/CSI, Envoy filters, ServiceLoader

How the tactics compose

The tactics are not a menu of independent choices; Kafka's power is in how they stack, and the same compositions recur in other well-built systems.

1 · Mechanical sympathyappend-only log on disk
3 · Coordination primitiveone log, many uses
8 · Image + deltafold onto immutable snapshot
6 · Single-writer + MVCCcontroller event loop
12 · ISR dial+ 10 · watermark
13 · Feature flagin the log
5 · Epoch fencing+ 4 · idempotence · 7 · CAS
The append-only log (1) becomes the coordination substrate (3) carrying deltas folded onto immutable images (8), published lock-free by a single-writer loop (6), with the ISR/watermark dial (12, 10) gating commits, feature flags (13) gating evolution, and epochs/idempotence/CAS (5, 4, 7) fencing every ownership and retry boundary.
storage / log foundation state representation & evolution execution model consistency dial correctness / anti-zombie guards "builds on" dependency cross-cutting guard / gate

Three couplings recur far beyond Kafka. Log (3) + image/delta (8) + single-writer (6) is the replicated state machine in full, an ordered log of deltas folded by a deterministic single-writer onto a snapshot, with checkpoints to bound recovery; this triad is KRaft, etcd, CockroachDB ranges, every Raft-backed store, and adopting one pulls in all three. Epoch (5) is the shared currency of 4, 7, 10, and 12, one monotonic counter fences zombies, powers optimistic CAS, dedups retries via the (id, epoch, seq) triple, and defines incarnations of the consistency dial; "stamp ownership with a monotonic number" underwrites four tactics. Batching (2) + pull (11) + mechanical sympathy (1) are the throughput engine and the reason latency is the price: batching waits to fill, pull waits to poll, both ride the sequential-IO path that rewards size over immediacy, the structural reason Kafka optimises throughput over latency (bp02).

The meta-tactic: reuse one good mechanism many ways

The deepest lesson of the toolkit is not any single tactic, it is the discipline of reuse. Kafka builds one log and uses it five ways (3); defines one epoch idea and applies it at four layers (5); writes one timeline structure and gets both lock-free reads and rollback from it (6, 8). Each reuse is one mechanism to harden, test, and reason about instead of many. The highest-leverage design question is not "what new mechanism do I need" but "which mechanism I already have can I apply here." That is what turns a pile of features into an architecture, the through-line to the architect's cheat-sheet.

Carry these as tools, not trivia. The next time you design a system that must coordinate, replicate, evolve safely, or survive a node coming back from the dead, you should be reaching into this dossier and asking which move fits, what it costs, and where it falls short. That judgement, tactic, tradeoff, limit, is the whole point of treating the log as a blueprint.

krivaltsevich.com · Part of Apache Kafka Internals · derived from Apache Kafka 4.4 source · GitHub · MIT-licensed.

Apache Kafka® is a registered trademark of the Apache Software Foundation. This is an independent, unofficial guide, not affiliated with or endorsed by the ASF.