III · 01 · When to Use the Log, and When Not To
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Architectural analysis grounded in the source-verified Part I and cited comparative sources.
The distributed commit log is the most over-applied data structure of the last decade. It is also one of the most powerful, when the problem actually has the shape it solves. This chapter is a decision framework, not a sales pitch. bp00 defined the pattern (an append-only, totally-ordered-per-shard, replicated sequence of immutable records that is the system of record, with all queryable state derived as a replayable projection); here we draw the boundary around it. We name the forces that pull toward a log, sustained throughput, replay, fan-out, durable buffering, per-key ordering, decoupling, audit, and the forces that pull against it, request/response latency, per-message routing and priority, per-message expiry, multi-entity atomicity, tiny scale, and random access. Then we dissect the three canonical anti-patterns (Kafka-as-database, Kafka-as-RPC, Kafka-as-task-queue), each pinned to the concrete mechanism that makes it fail, and we are honest about what share groups (KIP-932) do and do not change. Throughout, the discipline is the same one Part II demanded of operators: separate what is inherent to the log from what is tunable by a config or mitigated by a feature, because the wrong fix for an inherent limit is the most expensive mistake an architect can make.
The shape of the problem the log solves
Before any force table, fix the mental image, because every "yes" and "no" below derives from it. A Kafka topic-partition is a single, ordered, immutable file-of-record-batches replicated across brokers (03, 08). Writers only ever append to the tail. Readers declare an offset and stream forward; the broker is a dumb, sequential pipe that hands back bytes with zero-copy sendfile and parks the request in a purgatory when there is nothing new to send (09). The broker keeps no per-message, per-consumer state on the classic path, a consumer group's entire progress is one committed offset per partition (13). Retention is a topic-wide sweep by time or size, not a per-record lifecycle (04). Everything good and everything bad about Kafka falls out of those five facts.
Kafka deliberately pushed delivery state out of the broker and onto the consumer's offset. A "smart broker / dumb consumer" design (RabbitMQ) tracks per-message acknowledgement, redelivery, priority, and TTL inside the broker, rich semantics, but the broker now holds mutable per-message state that caps throughput and complicates replication. Kafka inverted it: the broker maintains almost nothing per message, so a partition is just a file you append to and stream from, and the OS page cache plus sendfile do the heavy lifting. The reference's comparison is blunt, Kafka peaked at ~605 MB/s in Confluent's OpenMessaging benchmark versus RabbitMQ's ~38 MB/s, which degrades past ~30 MB/s (bp06; Confluent OMB, vendor benchmark, directional). That throughput is bought with the absences in red above. You do not get to keep both halves.
The forces FOR the log
Seven forces favor a log. They are not independent, they reinforce, and the more of them a workload exhibits, the more decisively the log is the right substrate. Each traces to a concrete mechanism in Part I.
| Force | Why the log is strong here | Mechanism (Part I) |
|---|---|---|
| High sustained throughput | Append is the cheapest possible write; partitions shard the write across brokers with no cross-shard coordination, so throughput scales horizontally. Reads are sequential and served zero-copy. A single partition absorbs tens of MB/s; a cluster reaches multi-GB/s. | Append-only segment engine + per-partition independence (03); zero-copy fetch (09); partition sharding (op03) |
| Replay / reprocess | Because records are immutable and retained, any consumer can reset its offset and re-read history, to rebuild a corrupted projection, backfill a new index, or run a new model over old events. The data is not consumed-and-gone; reading does not mutate the log. | Immutable segments + offset-addressed reads (03); offset is consumer-owned (17) |
| Multiple independent consumers (fan-out) | Each consumer group has its own cursor; adding a tenth reader costs the leader nothing but extra sequential reads (often still from page cache). One write feeds analytics, search indexing, audit, and a cache-warmer simultaneously, the canonical "data integration backbone." | Per-group offsets, no broker-side per-consumer message state (13); shared zero-copy reads (09) |
| Durable buffering & backpressure absorption | The log is a shock absorber between a fast producer and a slow or bursty consumer. Retention is the buffer; a consumer that falls behind catches up later instead of dropping data or back-pressuring the producer. Spikes become lag, not loss. | Disk-resident retention decoupled from consumption (04); pull-based consumers set their own pace (09) |
| Per-key ordering | All records with the same key land in the same partition and are delivered in append order. This gives a total order per entity (per user, per account, per device), exactly the guarantee state machines and changelogs need, without the cost of a single global order. | Key → partition hashing (16); partition is a total order (03) |
| Event-driven decoupling | Producers know nothing about consumers and vice versa. Teams ship and evolve independently; a new consumer joins by subscribing, with no change to the producer. The topic is the contract, the schema is the API. This is the architectural payoff, not a performance one. | Publish/subscribe over a shared log; no producer-consumer handshake (bp00) |
| Auditable source of truth | An append-only, ordered, immutable record of what happened is a natural audit log and the basis of event sourcing: state is a fold over the events, and the events themselves are never mutated. Compliance, debugging, and reconstruction all read the same history. | Immutability + ordering (03); compaction retains latest-per-key for changelogs (04) |
When evaluating any substrate for an event pipeline, score the workload against these seven. Two forces are near-decisive because nothing else does them as cheaply: replay (immutable retained history you can re-read) and fan-out (N independent consumers from one write). If a workload needs both, the log wins almost regardless of the others. If it needs neither, a single consumer that processes each message once and never replays, you are likely holding a queue problem, and the forces-against table below probably applies. This force-counting is itself the transferable artifact: it works for Pulsar, Kinesis, or a home-grown log just as well as for Kafka.
The forces AGAINST the log
Six forces push the other way. Crucially, each is a structural consequence of the same five facts that produced the strengths, which is why most cannot be tuned away, only mitigated or worked around. The table marks, for each, whether the limit is inherent (you cannot remove it without ceasing to be a log), mitigated (a feature softens it but does not erase it), or tunable (a config trades it against something else).
| Force against | The mechanism that causes it | Inherent / mitigated / tunable |
|---|---|---|
| Low-latency request/response | A durable produce waits for replication to min.insync.replicas before it is acknowledged (08); consumers pull and the broker batches reads behind fetch.min.bytes/fetch.max.wait.ms (09). Both add latency floor. The log optimizes throughput over latency by construction. | Inherent floor, partly tunable. You can push p99 e2e to single-digit ms with tuning (a tier-1 bank held sub-5 ms p99 at 1.6M msg/s; Confluent), but you cannot get below the replication + poll floor without giving up durability (acks=1), and you can never make it a synchronous round-trip. |
| Per-message routing / priority / selective consume | A partition is a sequence; a consumer reads it in order or not at all. The broker has no content filter, no priority lanes, no "give me only messages matching X." The classic consumer reads every record at its offset (17). | Inherent; partly mitigated by share groups. Share groups (KIP-932) add per-record acquire/ack and out-of-order completion, but still no priority and no content routing. RabbitMQ-style routing remains absent by design (bp06). |
| Per-message TTL / expiry | Retention is a whole-segment sweep by retention.ms/retention.bytes at the topic level (04). There is no per-record clock; record 5 and record 5,000,000 in a segment expire together when the segment ages out. | Inherent. No config makes one record expire before its segment-mates. The only granular delete is a compaction tombstone on a keyed/compacted topic, and even that is eventual and per-key, not a TTL (04). |
| Multi-entity transactional updates | Kafka transactions give atomicity for a read-process-write within Kafka (atomic writes across partitions + offset commit) via the coordinator and markers (14). They are not a 2PC across Kafka and an external database, and not a general multi-row update. | Inherent boundary. Mitigated only at the edges: the outbox/CDC pattern gets you at-least-once + eventual consistency across DB and Kafka (bp04), never cross-system exactly-once. |
| Tiny scale | A 3-broker KRaft cluster, partitions, RF, ISR, offsets, monitoring, and the operational model (op00) are fixed costs whether you push 10 msg/s or 10M. The machinery does not shrink to a hobby workload. | Inherent overhead, partly mitigated. Managed/serverless offerings amortize the ops cost; Kinesis/Pub-Sub remove it entirely for low volume (bp06). But the conceptual surface area is still large for a problem a single database table would solve. |
| Point queries / random access | Offsets are positions in a sequence, not keys in an index. "Fetch the current value for user 42" requires scanning, because the log has no key→offset map for arbitrary lookups (03). It is a write-optimized sequence, not a read-optimized index. | Inherent. The pattern's own answer is to not query the log: materialize a projection into a store built for lookups (Streams state stores, a KV DB) and query that (bp00). The log feeds the index; it is not the index. |
If your dominant access pattern is lookup-by-id with low latency, mutate-in-place, ad-hoc query, and per-row lifecycle, that is a database, and no amount of Kafka tuning will make a log into one. If your dominant pattern is synchronous request → compute → reply under tens of milliseconds, that is RPC/HTTP/gRPC, and routing it through a log re-introduces the coupling and latency you were trying to remove. If your pattern is a handful of events per minute that one service handles once, the operational surface of a cluster dwarfs the problem; use a managed queue or a DB-backed job table. The log is a specialist. Reach for it when the seven forces light up, not because it is the default streaming brand.
The decision tree
The framework collapses into a single traversal. Walk it top-down for any candidate workload; the first hard "no" routes you off the log. The tree encodes the central asymmetry: the forces against are mostly inherent, so a workload that trips one of the early gates will not be rescued by tuning, whereas a workload that clears them benefits more from the log the more of the later forces it accumulates.
Anti-pattern 1, Kafka as a database
The single most common misuse. It is seductive because Kafka persists data, durably, checksummed, replicated, on disk, so it looks like a store, and a compacted topic that retains the latest value per key looks like a table. But "persists data" and "is a database" are different claims, and the gap is exactly the set of mechanisms a database has that a log does not.
| You want a database for… | The log's mechanism instead | Concrete failure mode |
|---|---|---|
| Read a row by primary key, fast | Offsets address positions, not keys. There is no key→offset index for arbitrary lookups (03). | "Get user 42's current address" forces a scan from offset 0 (or from the last compaction point). O(N) over the topic. There is no point read. |
| Update a field in place | The log is append-only and immutable. "Update" = append a new record; old records remain (03). | To know the current value you must fold the whole key's history. Two readers at different offsets see different "current" values, there is no single mutable cell. |
| Keep data until I delete it | Retention is a topic-wide time/size sweep (04). On a non-compacted topic, old segments are deleted on a clock you set at the topic level. | Retention silently deletes your "database." Set retention.ms=7d and your system of record is gone after a week. Operators have lost data this way by treating a delete-policy topic as permanent storage. |
| Run an ad-hoc / multi-key query | No query engine. Compaction (cleanup.policy=compact) keeps the latest per key, but that is a retention rule, not an index or a WHERE clause (04). | "Sum balances where region=EU" cannot run against the log. Compaction does not give you secondary indexes, joins, or aggregation, you would scan everything in the application. |
| Multi-row ACID transaction | Transactions are atomic Kafka writes across partitions, not multi-key updates to a queryable dataset (14). | "Debit A, credit B, atomically, then query the result" is not what Kafka transactions do; and a hanging transaction pins the LSO and makes all later records on the partition invisible to read_committed readers (op07). |
The most sophisticated version of this anti-pattern is "use a compacted topic as a KV store." Compaction retains at least the last value per key, useful for a changelog you replay to rebuild state (Streams does exactly this), but the reference is explicit that it does not guarantee a single record per key at any instant (multiple values and tombstones can coexist until the cleaner runs; timing is non-deterministic). It gives you retention, not random access: you still read it sequentially to materialize a map. A compacted topic is a durable, replayable input to a KV store. It is not the KV store. The pattern's prescription is unambiguous, Kreps: "I don't think Kafka really benefits from trying to add any kind of random access lookups directly against the log"; Kafka replicates into specialized systems, it does not replace them.
The correct shape when you are tempted to query Kafka: keep the log as the write model / system of record, and materialize a read model into a store built for the query, a KV DB for point lookups, a search index for text, an OLAP store for aggregation, a Streams state store for stream joins (bp00, 20). The projection is derived, disposable, and rebuildable by replaying the log. You get the log's throughput/fan-out/audit and the database's query power, because you stopped asking one component to be both. This is the inside-out database, and it is the whole point of the pattern.
Anti-pattern 2, Kafka as RPC / request-reply
The second misuse routes a synchronous call through the log: service A writes a "request" record, service B consumes it, processes, and writes a "response" record to a reply topic that A consumes, correlating on an id. It works in a demo and rots in production, for reasons that are structural, not incidental.
- Latency floor, twice. Each leg pays the durable-produce floor (replication to
min.insync.replicas, 08) plus the consumer poll floor (fetch.max.wait.ms, 09). A round-trip is two legs, so the inherent latency is doubled, and the log is optimized for throughput, not the per-message latency RPC needs. The reference is blunt: Kafka is pull/poll-based, not push, which is part of why per-message latency exceeds push brokers at low load, a deliberate throughput-over-latency choice. - No correlation or addressing in the substrate. The log has no request id, no reply-to, no "this message is for caller X." You build correlation in the application and demultiplex responses yourself, re-implementing, badly, what an RPC framework gives you for free.
- Re-coupling. The whole appeal of the log was decoupling producers from consumers. Request-reply re-introduces the synchronous dependency, A now blocks on B's liveness and latency. As the reference puts it, request-reply over Kafka "returns the coupling we were trying to avoid."
- Reply-topic sprawl. Per-caller reply topics multiply partition count (a cost, op02) or you share a reply topic and every caller filters for its own correlation ids, scanning past everyone else's responses, because, again, the log cannot route.
Request-reply over Kafka is occasionally justified when the "reply" is not latency-sensitive and you specifically want the request and response persisted, auditable, and replayable, e.g., a long-running async job whose submission and result you want on the same durable backbone as everything else, with the log providing buffering and at-least-once retry. Even then, prefer an explicit async-job design (correlation id, idempotent handler, a result store you query) over pretending the log is a synchronous channel. If the caller is willing to block for an answer in real time, it is RPC; use RPC.
Anti-pattern 3, Kafka as a priority / task queue
The third misuse treats a topic as a work queue: producers enqueue tasks, a pool of workers dequeues and executes, with per-task acknowledgement, retry of just the failed task, priority lanes, and a visibility timeout. The classic consumer group cannot do this, and the reasons are mechanical.
| Task-queue feature | Classic consumer-group reality | Mechanism (Part I) |
|---|---|---|
| Ack/retry a single message | Progress is one committed offset per partition, not per message. You acknowledge a position, not a record. To "retry one task" you rewind the offset and re-process everything after it. | Offset commit model (13, 17) |
| Visibility timeout (lease a task, auto-redeliver on crash) | None. A consumer holds an offset range implicitly; if it dies mid-batch, redelivery happens only via rebalance + offset rewind, re-delivering the whole uncommitted range. | No per-message lease on the classic path (17) |
| Priority lanes | A partition is strict FIFO; there is no priority. You cannot make message B jump ahead of A in the same partition. | Partition is a total order (03) |
| Parallelism beyond partition count | At most one consumer per partition per group. Want 200 workers? You need ≥ 200 partitions, with all the cost that implies. | 1:1 partition→consumer in a group (13, op03) |
| Isolate a slow/poison task | Head-of-line blocking. Because a partition is consumed in order by one member, one slow or failing record stalls every record behind it. The reference: "partition throughput collapses to the speed of the slowest message"; a poison pill stalls every later message. | In-order single-consumer-per-partition (13) |
What share groups KIP-932 change, and what they don't
This is where the chapter must be precise, because share groups are exactly the feature that converts some of this anti-pattern into a supported pattern, and conflating "some" with "all" is the new mistake. A share group lets many consumers cooperatively consume the same partition; the partition leader holds per-record in-flight state in a SharePartition, and each record moves through an explicit state machine: AVAILABLE → ACQUIRED → (ACKNOWLEDGED | back to AVAILABLE | ARCHIVED). What that buys, and what it still cannot do:
| Classic task-queue gap | Share groups KIP-932 | Verdict |
|---|---|---|
| Per-message ack | Solved. A consumer ACCEPTs/RELEASEs/REJECTs individual records; state is per-record, persisted in __share_group_state (15). | Fixed |
| Visibility timeout | Solved. The acquisition lock is a time-bounded lease (share.record.lock.duration.ms); on expiry the record returns to AVAILABLE for redelivery, exactly a visibility timeout (15). | Fixed |
| Parallelism > partition count | Solved. Many members consume one partition cooperatively; you are no longer capped at one consumer per partition (15). | Fixed |
| Isolate a poison message (head-of-line) | Largely solved. Records complete out of order, and one stuck record no longer blocks the rest; a bounded delivery.count.limit then archives it, optionally to a DLQ KIP-1191 for inspection (15). | Fixed |
| Priority lanes | Not solved. No priority within or across records; acquisition still walks the log forward. A high-priority task cannot jump the queue. | Still absent |
| Content-based routing / selective consume | Not solved. No server-side filtering or routing; consumers still acquire whatever the share-partition hands out. | Still absent |
| Exactly-once consumption | Not solved. Share groups are at-least-once; redelivery is expected. Consumers must be idempotent. EOS is a different mechanism (14). | Still absent |
| Strict per-key order | Given up by design. Cooperative out-of-order consumption means per-key/partition order is not preserved, the opposite trade from a classic consumer (15). | Traded away |
The honest summary: KIP-932 makes Kafka a competent at-least-once work queue with acks, visibility timeout, bounded retry, and a DLQ, genuinely closing the historical "Kafka can't do queues" critique for the common case. But three classic broker features remain structurally absent because they conflict with the log: priority (a partition is FIFO), content-based routing (the broker is a dumb pipe), and strict per-key order under sharing (cooperative consumption gives it up). If your task queue genuinely needs priority lanes or routing, that requirement does not trace to a missing Kafka feature, it traces to the log structure itself, and you want a routing broker (bp06). Note also that share groups are a preview-maturing feature (preview in 4.1, maturing through 4.2; full in this 4.4 tree); weigh that operational maturity for production-critical queues.
Inherent vs. tunable, the architect's litmus test
The through-line of this chapter, and of all of Part III, is the discipline of asking "is this limit structural or configurable?" before reacting to it. Misclassifying it is how teams either abandon Kafka over a limit they could have tuned, or waste a quarter trying to tune away a limit that is the log itself. The table below classifies the limits this chapter raised; carry the question, not just the answers.
| Apparent limit | Class | What actually moves it |
|---|---|---|
| Per-message latency floor | Inherent floor, tunable margin | Tuning (batching off, fetch.min.bytes=1, fast disks/network, acks=1) shrinks it to single-digit ms; nothing makes a replicated append a synchronous round-trip (op05). |
| No per-message ack / visibility timeout | Mitigated by feature | Share groups (KIP-932) add both; classic consumer groups never will (15). |
| Head-of-line blocking on a partition | Mitigated by feature | Share groups consume out of order; classic groups block by design (15). |
| No priority / no content routing | Inherent | Nothing in Kafka, including share groups, a partition is FIFO and the broker does not filter. Use a routing broker (bp06). |
| No per-message TTL | Inherent | Retention is topic-wide by time/size; tombstones on compacted topics are eventual per-key, not a TTL (04). |
| No random read / point query | Inherent | Materialize a projection into a query store; the log feeds the index, it is not the index (bp00). |
| Retention deletes "stored" data | Tunable / mitigated | retention.ms=-1 + compaction, or tiered storage KIP-405 for cheap long retention, but a non-compacted delete-policy topic is still not a database (05). |
| Cross-system exactly-once | Inherent boundary | Kafka EOS is within-Kafka read-process-write; outbox/CDC gets at-least-once + eventual across an external DB, never 2PC (14, bp04). |
| Fixed operational cost at tiny scale | Mitigated by deployment | Managed/serverless amortizes ops; the conceptual surface remains. A small problem may want a small tool (bp06). |
| Global (cross-partition) ordering | Inherent (vs. throughput) | Total order forces a single partition, sacrificing throughput; design per-key order instead (op03). |
For any "Kafka can't do X" objection, ask three things in order: (1) Is X absent because of the log structure (append-only, per-partition order, offset-not-index, topic-wide retention, dumb broker)? If yes, it is inherent, design around it, do not fight it. (2) If not structural, is there a feature that supplies X at a stated cost (share groups for queue semantics, tiered storage for retention, transactions for atomic Kafka writes)? Then it is mitigated, adopt the feature and pay its price knowingly. (3) Otherwise it is a config trading X against throughput/latency/durability, it is tunable, so pick your point on the tradeoff. This three-step is the single most transferable habit in this part; it applies unchanged to any log-shaped system you evaluate.
Worked judgements
The framework is only useful if it produces clear calls on real workloads. A few, with the deciding force named:
- Clickstream / telemetry ingest feeding analytics, search, and ML
- Log, decisively. High sustained throughput, fan-out to ≥3 independent consumers, replay to backfill new models. All seven forces fire; zero gates trip. The textbook fit.
- Order-state changes that must stay ordered per order and feed several services
- Log. Per-key (per-order) ordering + fan-out + audit. Partition by order id; each order is a total order. Materialize current-order-state into a KV store for the lookup-by-id queries, do not query the log.
- "What is account 42's balance right now?" serving a UI at p99 < 20 ms
- Database (fed by the log). Point lookup + low latency = the store gate trips immediately. Keep balance-change events on a log, but serve reads from a materialized projection (bp00).
- Synchronous "validate this payment, reply yes/no in 50 ms"
- RPC/gRPC. The reply gate trips: doubled latency floor + no correlation + re-coupling. A reply topic is the anti-pattern.
- Background job processing: thousands of independent tasks, per-task retry, isolate poison tasks, scale workers freely
- Share groups if you can live without priority/routing and accept at-least-once; otherwise a routing broker. The queue gate trips, but KIP-932 now answers most of it, check the line-by-line table above against your needs.
- Tasks with strict priority lanes and content-based routing to specific workers
- RabbitMQ-style broker. Priority + routing are inherent absences in the log, share groups included. This requirement traces to the log structure, not a missing config (bp06).
- A handful of webhook events per minute, one consumer, no replay
- A managed queue or a DB job table. No replay, no fan-out, tiny scale, the fixed operational surface of a cluster is unjustified. The log works, but it is not the simplest correct tool.
The log is a specialist optimized for ordered, replayable, high-throughput, fanned-out streams that you decouple producers and consumers around. Its strengths and its absences are the same design decision. Reach for it when the seven forces light up; route off it the moment an early gate (store, RPC, classic-queue-with-priority/routing) trips, because those gates guard inherent limits no tuning will move. And when in doubt, apply the litmus question: structural, feature-mitigated, or config-tunable. Next, bp02 opens the design-decision space inside the "use the log" branch, partition count, keys, RF, ordering, and the rest, where the interesting tradeoffs begin.