III · 00 · The Distributed Log as a Pattern
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Architectural analysis grounded in the source-verified Part I and cited comparative sources.
Strip Kafka of its brand, its config keys, and its 700-page operator manual, and what remains is a single, surprisingly small idea: an append-only, per-shard totally-ordered, partitioned, replicated, offset-addressed, retention-bounded sequence of immutable records that consumers read at a position they own. That sentence is the whole pattern. Everything else in this book, the ISR, KRaft, tiered storage, exactly-once, is machinery in service of making that abstraction fast, durable, and operable at scale. This chapter abstracts the pattern away from the product. It defines the log's essence and its invariants; the universal problems a log solves that nothing else solves as cleanly (temporal decoupling, spatial fan-out, replay, ordering, durability, and collapsing integration sprawl); the sharp lines that separate a log from a queue and from a database; and why the same primitive keeps reappearing, in database redo logs, in Raft and Paxos, in event sourcing, even in blockchains. We ground each claim in Kafka's realization (cross-linked to the storage engine, replication, and partitioning), but the goal is the abstraction itself, a blueprint you can carry into systems that have nothing to do with Kafka.
The essence: seven words, each load-bearing
A definition is only useful if every word does work. Drop any one of these seven and you get a different, weaker primitive. Here is the anatomy of the distributed log, term by term, with the Kafka mechanism that realizes each and, more importantly, what each word buys you and what it costs.
| Property | What it means | Kafka realization | What it buys / what it costs |
|---|---|---|---|
| Append-only | Writes only ever add to the end; nothing in the middle is updated or moved. | The active segment takes appends at the log end offset (LEO); records are never rewritten in place (Storage Engine). | Buys sequential I/O, lock-free reads, and a stable history. Costs: no in-place update, mutation is modeled as a new record, and "current value" requires a fold or a compacted projection. |
| Per-shard total order | Within one shard, every record has a definite predecessor and successor; across shards, order is undefined. | Each partition is one totally-ordered log; there is no global order across partitions (Partitioning). | Buys deterministic replay and ordered processing per key. Costs: global ordering requires a single partition (sacrificing all parallelism), the central tension of the pattern. |
| Partitioned | The logical stream is split into N independent logs that can live on different machines. | A topic is N partitions; partition = murmur2(key) % N places keyed records deterministically (Producer). | Buys linear horizontal scale, appends on different shards need no coordination. Costs: N is near-immutable for keyed topics, and order is now only partition-local. |
| Replicated | Each shard is copied to several machines so the loss of one does not lose data. | Each partition has a leader and RF−1 followers; the in-sync replica set (ISR) tracks who is caught up (Replication). | Buys durability and availability across machine/disk/AZ failure. Costs: write amplification (RF× bytes), cross-AZ network, and a latency floor from waiting on replicas. |
| Offset-addressed | Each record's position is a stable, monotonic integer; that integer is its address forever. | A 64-bit offset assigned by the leader at append; immutable for the life of the record (Storage Engine). | Buys a cheap, durable cursor and random access by position, the "time machine." Costs: the address is positional, not content-based, no lookup by value. |
| Retention-bounded | The log is finite: old records age out by time, size, or are compacted to last-value-per-key. | Time/size retention or cleanup.policy=compact; tiered storage extends the bound to object stores (Storage Management, Tiered Storage). | Buys bounded cost and a tunable replay window. Costs: history is not infinite by default, "replay from the beginning" only works inside the retention horizon. |
| Consumer-owned position | The reader, not the broker, tracks where it is; many readers read independently at zero marginal broker cost. | Consumers commit offsets to __consumer_offsets; the broker does not track per-message delivery (Consumer, Group Coordination). | Buys fan-out and replay (rewind the cursor). Costs: the broker cannot do per-message acknowledgement, redelivery, or selective routing, that is a queue's job (and why share groups were added). |
Read the table as a ledger of forces, not features. Every property is simultaneously a strength and a constraint, and they are coupled: partitioning gives scale but breaks global order; replication gives durability but adds a latency floor; the consumer-owned cursor gives fan-out and replay but removes per-message delivery semantics. When you evaluate any log-shaped system, Kafka, Pulsar, Kinesis, a Raft library, your own event store, walk these seven properties and ask which the system keeps, which it relaxes, and what it pays for the relaxation. That walk is the whole of comparative architecture in miniature.
The invariants, the contract a log promises
Beneath the seven properties sit four invariants: statements the system guarantees to hold at all times, which downstream design is allowed to rely on absolutely. They are what makes the log composable, you can build a cache, an index, a join, or a ledger on top precisely because these never bend.
- I1 · Immutability
- Once a record is written and acknowledged, its bytes never change. A record is a fact that happened, not a mutable cell. Kafka enforces this physically: appended batches in a segment are never edited (Storage Engine). Corollary: there is no "update record 47", only "append record 92 that supersedes 47."
- I2 · Per-partition order
- If record A was appended before record B in the same partition, every reader sees A before B, always, on every replica, across every replay. This is the ordering invariant the leader-epoch machinery protects through leader changes (Replication). It does not extend across partitions, that non-guarantee is itself a load-bearing fact.
- I3 · Offset stability
- A record's offset is a permanent address. Reading offset 5,001 tomorrow returns the same record as today (until retention removes it). Crucially, compaction and retention remove records but never renumber survivors, a consumer's committed offset stays meaningful (Storage Management). This is what lets a cursor be a durable bookmark rather than a fragile pointer.
- I4 · Decoupling
- Producers and consumers share no clock and no liveness dependency. A producer can write while every consumer is offline; a consumer can read records written years (within retention) before it existed. They are decoupled in time (the log buffers) and in space (they never address each other, only the log).
Engineers often summarize the log as "append-only," but the deeper invariant is immutability of acknowledged records. Append-only is the mechanism; immutability is the contract. It is immutability that makes the log a valid system of record, if records could change after the fact, no derived view could trust its inputs, replay would be non-deterministic, and the offset could not be a stable address. Every powerful thing a log does (replay, CDC, event sourcing, audit) is downstream of "a written fact is final." When you design a log-shaped system, protect immutability the way a database protects durability: it is the invariant the rest of the architecture leans on.
The universal problems a log solves
A pattern earns the name by solving a recurring class of problems better than the alternatives. The distributed log solves six, and they recur in nearly every non-trivial data architecture. We take them in turn, and for each, name where it is the wrong tool, because an honest pattern catalog includes the boundaries.
1 · Temporal decoupling, a durable buffer that absorbs rate mismatches
Two systems rarely run at the same speed at the same instant. A burst of orders arrives faster than the fraud service can score them; a nightly batch dwarfs the daytime trickle; a downstream database is mid-failover. Without a buffer, the fast side must block on the slow side (back-pressure all the way to the user) or drop data. The log is a durable, ordered buffer: the producer writes at its own pace and returns; the consumer drains at its own pace, possibly hours later. The "buffer" is bounded only by retention and disk, not by RAM, Kafka can hold days of backlog on disk because reads are served from the page cache via zero-copy (Fetch Path), so a deep backlog does not degrade the write path the way an in-memory queue would.
A buffer that absorbs mismatches also hides them. Lag is silent until you measure it: a consumer falling behind looks healthy from the producer's side right up until the retention horizon eats un-consumed data (Failure Modes). And a deep historical read can evict the hot tail from the page cache, the "catch-up tax," where a back-filling consumer spikes producer p99 from ~2 ms to ~250 ms by forcing disk I/O (empirical: azguards; Pinterest; mitigated by tiered storage). Decoupling is not free; it relocates the failure from "drop on the floor immediately" to "lag, then mass-expire silently." Consumer-lag alerting is the price of admission.
2 · Spatial decoupling, fan-out to many readers at zero marginal broker cost
This is the property that most cleanly separates a log from a queue, and it is the quiet superpower of the pattern. Because the reader owns its cursor (I4) and the broker does not track per-message delivery, adding a consumer costs the broker almost nothing: a new consumer group is just another set of offsets reading the same immutable bytes. One payments topic can feed fraud scoring, the data warehouse, a real-time dashboard, an audit sink, and a search indexer, five independent readers, each at its own position, each immune to the others' speed or failure. In a point-to-point queue, fan-out means the producer must enumerate and write to every consumer (coupling), or the broker must copy the message per subscriber (cost that scales with subscribers).
The near-free fan-out applies to distinct consumer groups, each replaying the full stream. Adding a consumer within one group does not multiply reads, it splits the partitions, and is capped at one active consumer per partition (Partitioning). So the log fans out cheaply across use cases (groups) but parallelizes within a use case only up to N. Conflating the two is a common sizing error.
3 · Replay, the offset is a time machine
Because records are immutable (I1) and offsets are stable (I3), a consumer can move its cursor backward and reprocess history. This single capability dissolves whole categories of operational pain that are intractable in a fire-and-forget queue:
- Bug recovery. A consumer shipped a serialization bug and corrupted a day of derived state? Fix the code, reset the offset to before the bad window, reprocess. The source of truth was never mutated.
- New consumers join late. A team builds a search index six months after the topic launched; it reads from offset 0 (within retention) and bootstraps its entire state from history, no special backfill pipeline.
- Reprocessing with new logic. A better fraud model, a changed aggregation, run it over the retained log and rebuild the output. This is the basis of Kappa architecture: rather than maintain a separate batch path (Lambda), you reprocess by replaying the log into a fresh output and swapping (empirical: Kreps, "Questioning the Lambda Architecture").
Two honest limits. First, replay only reaches back to the log start offset, retention (or compaction) defines the horizon; "replay from the beginning of time" requires infinite retention, which is a deliberate, costly choice now made affordable, not free, by tiered storage. Second, the log rewinds but the world does not: reprocessing a payments stream re-emits "charge the card" unless your consumer is idempotent or writes to a fresh output you then swap in. Replay is a superpower for derived state; for external side effects it is a loaded gun. Design replayable consumers to be idempotent or output-swapping, never to re-fire effects.
4 · Ordering, deterministic, per key, for free
Many correctness properties reduce to "process these events in the order they happened": apply account debits before the closing balance read; honor "add to cart" before "checkout"; respect a CDC stream's insert-before-update. The log gives this per partition as invariant I2, and you get per-key ordering by partitioning on the key (all events for account 42 land in the same partition, in order). It is deterministic and costs nothing extra at write time. The honest caveat, which we return to throughout this part, is that this is per-partition, never global: across partitions there is no order. Global total order is achievable only by collapsing to a single partition, which surrenders the entire parallelism the pattern exists to provide.
5 · Durability, replicated, fsync-decoupled, tunable
The log persists every acknowledged record to multiple machines (replication, I4-adjacent). Kafka's durability is a tunable contract, not a fixed one, and understanding the dials is essential: acks, replication.factor, and min.insync.replicas together decide how many copies must accept a write before the producer is told "done" (Durability). The canonical safe configuration, RF=3, min.insync.replicas=2, acks=all, unclean.leader.election.enable=false, makes data loss require multiple simultaneous failures (empirical: Conduktor; reference §4).
A subtle and frequently misstated point: Kafka acknowledges a write when the required replicas hold it in page cache, not when it is fsync'd to physical media. Durability comes from replication plus recovery, not per-message disk flush, by default there is no fsync on the produce path (Durability). This is a deliberate latency-over-disk-flush choice (forcing per-message fsync wrecks tail latency under load), and it is the central, valid critique in several competitor comparisons (empirical: Vanlightly; reference §3). The lesson is general: a "durable" log defines durability as a replication quorum, and you must know whether your system's "committed" means "on N machines' RAM" or "on N machines' disks." For most workloads the former is correct and faster; for a few it is not, and you reach for flush.ms/flush.messages knowing the cost.
6 · The integration backbone, collapsing O(N²) into O(N)
This is the problem the log was, in the end, invented to solve, and the one Jay Kreps put at the center of "The Log" (2013). An organization with N data systems that must share data point-to-point trends toward O(N²) bespoke pipelines: every source wired to every sink, each with its own format, schema drift, and failure modes. The log inverts this into a hub-and-spoke, O(N) topology: every source writes once to the log; every sink reads from the log. The log becomes the organization's shared, ordered, replayable system of record, the integration backbone. At LinkedIn this was the original motivation, and it scaled to "over 60 billion unique message writes per day" in 2013 (empirical: Kreps, "The Log") and to trillions per day a decade later (empirical: reference §7).
N × M integrations, each with its own format, schema, and failure modes. Adding one system means wiring it to all the others.Collapsing to a hub is not pure win. The log becomes a shared dependency every team relies on, its schemas become contracts (hence Schema Registry and the discipline around it), its outages are everyone's outages, and a poorly-chosen partition key or a hot topic now blasts the whole fleet (Cloudflare's worst recurring incident was exactly this, a client-library change funneling most traffic to one partition: empirical, reference §7). The O(N) topology trades many small, independent failure domains for one large, shared one. That is usually the right trade, but it is a trade, and it demands governance the point-to-point world did not.
Log vs queue vs database, three primitives, three jobs
The most common and most expensive architectural mistake around the log is using it as something it is not. The log sits between two older primitives, the message queue and the database, and borrows from both without being either. Drawing the lines precisely is the single most clarifying thing in this chapter.
| Dimension | Message queue (e.g. RabbitMQ) | Distributed log (Kafka) | Database (e.g. Postgres) |
|---|---|---|---|
| Primary record | A message in flight (transient) | An immutable event, retained | A mutable row (current state) |
| On read | Consumed & removed (destructive) | Read non-destructively at a cursor; re-readable | Queried; data stays |
| Who tracks position | The broker (per-message ack / redeliver) | The consumer (owns its offset) | N/A, random access by key |
| Addressing | None, next available message | By position (offset) | By value/key (indexed) |
| Ordering | Best-effort; weakens with redelivery | Strict per partition (I2) | Defined by query, not storage |
| Fan-out | Per-subscriber copy or competing consumers | Cheap, many groups, one copy | Many readers, but reads current state |
| Replay history | No, gone once acked | Yes, rewind the cursor (within retention) | No, only current state (history needs explicit modeling) |
| Per-message routing / TTL / priority | Yes, its core strength | No, topic-level only | N/A |
| Point lookup by key | No | No (only by offset; compaction approximates it) | Yes, its core strength |
| Best at | Work distribution, RPC-ish tasks, per-message control | Ordered history, fan-out, replay, integration backbone | Current-state queries, ad-hoc joins, multi-key transactions |
Read down the columns and the identities snap into focus. A queue is about work: each message is a task, consumed once, with the broker arbitrating delivery, which is exactly why a queue can offer per-message acknowledgement, redelivery, priority, TTL, and content routing. A database is about current state: rows are mutable, addressable by value, optimized for "what is true now" and ad-hoc queries. The log is about history: an ordered, immutable, replayable record of what happened, addressable by position, optimized for fan-out and reprocessing.
Kleppmann's reframe (2014) is the deepest insight here: a database already contains a log, its write-ahead/redo log, but treats mutable state as primary and the log as a hidden implementation detail. The log pattern turns the database inside out: it promotes the log of immutable events to the system of record, and demotes every queryable structure (indexes, caches, materialized views, the "current state" itself) to a derived, replayable projection of that log. Kreps' State Machine Replication Principle is the formal backbone: "if two identical, deterministic processes begin in the same state and get the same inputs in the same order, they will produce the same output and end in the same state", so consistency reduces to building one consistent log and feeding it to deterministic consumers (empirical: Kreps, "The Log"; Kleppmann, "Turning the Database Inside Out"). This is why stream-table duality holds: a table is the aggregation of a changelog stream, and a stream is the changelog of a table (empirical: Sax, BIRTE@VLDB 2018), the basis of KStream/KTable and of CDC. The log is not "a queue with retention"; it is the more primitive thing a database is built on.
Most log misuse is forcing it into a neighboring primitive's job. Log-as-database: there are no efficient key lookups, no ad-hoc SQL, no multi-key ACID, Kafka replicates into specialized stores, it does not replace them; "I don't think Kafka really benefits from random-access lookups directly against the log" (Kreps). Log-as-RPC: the log optimizes throughput over latency and offers no native request-reply, building synchronous request/response over it "returns the coupling we were trying to avoid" (Waehner). Log-as-task-queue: per-partition order plus one consumer per partition means throughput "collapses to the speed of the slowest message", one poison pill head-of-line-blocks every later record, and parallelism is capped at N (empirical: reference §8). The full decision framework, and the mitigations (share groups for the queue case, the Parallel Consumer, materializing into a DB for the query case), is the next chapter.
Why it is a pattern, the same primitive, everywhere
A solution is a pattern when the same structure recurs across independent problems and independent designers keep reinventing it. The append-only, totally-ordered log is one of the most reinvented structures in all of computing, it predates Kafka by decades and appears far outside messaging. Recognizing it as a pattern is what lets you transfer Kafka's hard-won lessons to systems that look nothing like Kafka, and vice versa.
fold over the log from offset 0, replay as a first-class feature (empirical: Fowler; Kleppmann).The variation across these is instructive, because it maps directly to the dials you tune in any log system:
| Incarnation | What a "record" is | Replication / agreement model | Who consumes, and how |
|---|---|---|---|
| DB WAL / redo log | A physical or logical change (changed bytes, or the SQL command) | Local durability + (optionally) primary→replica streaming | The recovery/apply process; replicas, internal to the DB |
| Raft / Paxos log | A state-machine command (e.g. "set leader of partition 7") | Quorum agreement (majority); crash-fault-tolerant | Every replica's state machine, applied in committed order |
| Kafka partition | A produced record/event (opaque bytes + key + headers) | Leader + ISR; tunable quorum via acks/min.insync.replicas (Replication) | External consumers at self-owned offsets; many independent groups |
| Event store | A domain event ("OrderPlaced") | Whatever the store provides (often a Kafka/DB-backed log) | Projections/aggregates that fold the log into read models |
| Blockchain | A signed transaction, hash-chained to its predecessor | BFT consensus + cryptographic linkage; adversary-tolerant | All nodes; validate-and-apply in agreed order |
When you see append-only + ordered + immutable + (often) replicated + replay-from-a-position, you are looking at the log pattern, whatever the domain calls it. This recognition is worth real money: it means Kafka's solutions to log problems, leader epochs to prevent divergence on leadership change (Replication), sparse indexes for cheap positional lookup (Storage Engine), compaction to derive a table from a stream, tiered storage to unbound retention (Tiered Storage), are candidate solutions for your log, and the limits of your Raft library or event store illuminate the limits of Kafka. Part III's tactics toolkit harvests exactly these transferable techniques; inherent limits harvests the transferable constraints.
Inherent vs tunable, the distinction that runs through this whole part
The single most valuable analytical habit for a log architect is to separate, for every limitation, what is inherent (structural, a consequence of the pattern itself, true of Pulsar and Kinesis and your hand-rolled event store too) from what is merely tunable (a config) or mitigated (a feature). Marketing blurs this line in both directions; good architecture depends on holding it sharp. Here is the keystone version of that distinction, each row is expanded in later chapters.
| Concern | Status | Why, structural cause or the dial/feature |
|---|---|---|
| No global order across shards | Inherent | Partitioning is the scale mechanism; independent appends cannot share a total order without coordination. Global order ⇒ N=1 ⇒ no parallelism. True of every partitioned log (Inherent Limits). |
| No per-message lookup by value | Inherent | The address is positional (offset), not content-based. Point queries require a derived index/store. Compaction approximates last-value-per-key but is not a lookup. |
| No native per-message routing / TTL / priority | Inherent | The consumer-owned cursor and shared single copy preclude per-message broker decisions, that is a queue's model, structurally distinct (Comparative). |
| A real produce latency floor | Inherent (magnitude tunable) | Waiting on a replication quorum has an irreducible floor; you tune how high via acks, placement, and fsync policy, but cannot remove it while keeping durability (Performance Tuning). |
| Cross-AZ replication cost | Inherent (mitigated) | RF copies must cross failure domains to be durable. Fetch-from-follower and diskless/object-store designs reduce it; the produce+replication floor is structural (Cost). |
| Partition-count ceiling | Mostly tunable now | The ZooKeeper-era O(partitions) controller failover was structural; KIP-500 KRaft moved metadata to an in-memory replicated log, raising the ceiling ~100×, but per-broker FDs, mmap regions, and rebalance time still bound it (Partitioning, Limits). |
| Durability = "in N page caches," not fsync'd | Tunable | A default, not a law. flush.ms/flush.messages force fsync at a latency cost; the default trades disk-flush for speed and relies on replication+recovery (Durability). |
| "Stop-the-world" rebalances | Mitigated → largely fixed | Historically structural pain; KIP-429 (incremental cooperative) and KIP-848 (broker-side assignment, GA in 4.0) removed the global sync barrier (Group Coordination). |
When a vendor says "Kafka can't do X," ask the second question: is X inherent to the log pattern (then their product can't really do it either, unless they abandoned the pattern) or is it a Kafka default/feature gap (then it is a config, a KIP, or a roadmap item)? Redpanda's "Kafka is unsafe because it doesn't fsync per message" conflates a tunable default with a structural flaw, and is false, because durability comes from replication (empirical: Vanlightly; reference §3, §9). WarpStream's "Kafka cross-AZ cost is huge" is real and partly inherent, but it attacks it by leaving the local-disk part of the pattern for object storage, trading latency for cost (empirical: reference §6, §9). The inherent-vs-tunable lens turns marketing into engineering. It is the through-line of design decisions, inherent limits, and comparative architecture.
A first decision: is your problem log-shaped at all?
We close the foundation chapter with the most basic question, posed as a decision tree. This is deliberately coarse, the full, force-weighted framework (with the anti-patterns and their mitigations) is the next chapter, but it captures the first cut: does your problem have the shape the log is good at? The shape is: history matters, fan-out or replay is valuable, and per-key ordering is enough. If instead you need current-state queries, per-message control, or synchronous replies, you are looking at a database, a queue, or RPC.
(log is the wrong primitive, no point lookup)
rethink the ordering requirement
(durable, ordered, replayable, fan-out)
You now hold the pattern in the abstract: its seven-word essence, its four invariants, the six universal problems it solves (and where each bites back), the queue/log/database trichotomy, the recognition that the same primitive underlies WALs, consensus, event sourcing, and ledgers, and the inherent-vs-tunable lens that keeps the rest of this part honest. From here, III·01 turns the coarse tree above into a full decision framework with the anti-patterns; III·02 dissects each Kafka choice as a tradeoff (pull vs push, ISR vs quorum, page cache vs managed memory); III·03 presses on where the pattern structurally fails; III·04 extracts the reusable tactics; and III·06 tests every tradeoff against Pulsar, Redpanda, Kinesis, Pub/Sub, RabbitMQ, and the diskless designs. Treat Kafka, throughout, as the best-documented instance of a pattern far older and far more general than itself.