III · 03 · Inherent Limitations
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Architectural analysis grounded in the source-verified Part I and cited comparative sources.
Every architecture buys its strengths with a matching set of weaknesses, and the honest way to evaluate one is to study where it breaks by design. The distributed, partitioned, replicated commit log is exceptional at exactly one shape of problem, high-throughput, ordered, replayable, fan-out data movement, and it is exceptional precisely because it refuses to be a database, a router, a queue, or an RPC bus. This chapter is the catalogue of that refusal. For each limitation we ask the only question that matters to an architect: is this inherent (structural, it follows from the log abstraction itself and no feature can remove it), mitigated (a feature softens it but the underlying force remains), or tunable (a config trades it against something else)? Confusing these three categories is the single most expensive mistake teams make with Kafka, they tune what is structural, or assume structural what is merely a default. Get the taxonomy right and you will know, before you write a line of code, which of these walls you can move and which you must design around.
The discipline: inherent vs. mitigated vs. tunable
The log's power comes from a handful of radical simplifications: the broker is a dumb, append-only, sequential store; ordering is defined only within a partition; routing is reduced to a hash of a key; and almost all intelligence is pushed to the clients. Each simplification is the source of both a strength documented elsewhere in this book and a limitation documented here. The two are inseparable, you cannot keep the throughput and shed the constraint, because they are the same decision viewed from two sides. Before the catalogue, fix the vocabulary, because we will tag every limitation with one of these three labels.
| Class | Definition | What an architect does about it | Example in this chapter |
|---|---|---|---|
| Inherent | Follows from the log abstraction or the single-leader-per-partition model. No configuration or feature removes it; only a different architecture does. | Design around it. Choose the log only where the constraint is acceptable; otherwise pick or compose a different system. | Per-partition-only ordering; read-everything-and-filter; latency floor > sub-ms. |
| Mitigated | The structural force still exists, but a feature reduces its blast radius or shifts where it bites. The ceiling moves; it does not vanish. | Adopt the mitigation, but plan for the residual. Know what the feature does not fix. | Head-of-line blocking (share groups); rebalance stalls (KIP-848); partition ceiling (KRaft). |
| Tunable | A config knob trades the limitation against another property, usually latency vs. throughput vs. durability vs. cost. | Pick a point on the curve. Measure, then choose; revisit as load changes. | The exact latency number; cross-AZ cost (fetch-from-follower); per-topic retention. |
The comparative literature is full of "Kafka can't do X" claims that are really "Kafka's default for X is Y", and an equal number of "just tune it" hand-waves about things no tune can fix. The reference we cite is explicit that per-message TTL, priority, and content-based routing are "by design, not bugs" and that "stop-the-world rebalances are largely historical" (empirical: ops-blueprint-reference, §9 cautions). Those two sentences are an inherent and a mitigated verdict respectively. The rest of this chapter is that distinction, applied rigorously. See The Log Pattern for why these simplifications exist and Design Decisions for the forces that chose them.
1 · Ordering is per-partition only, global order means no parallelism
This is the deepest constraint, the one from which several others descend. Kafka guarantees a total order of records within a single partition and makes no promise whatsoever across partitions. This is true at the lowest level: a partition is one append-only log with one leader that serializes writes (Part I The Storage Log Engine, Replication); the offset is a per-partition monotonic counter; there is no global sequence number and no cross-partition clock. It has been true since the original 2011 design paper (empirical: Kreps, Narkhede, Rao 2011).
The architectural consequence is a hard trade you cannot escape: parallelism and global order are the same dial, turned in opposite directions. Partitions are the unit of parallelism (more partitions ⇒ more leaders absorbing writes, more consumers reading; see Part II Partitioning Strategy). If your requirement is a single global total order over all events in a topic, the only way to get it is a topic with one partition, at which point producer throughput collapses to one leader's write rate, consumer parallelism collapses to exactly one instance, and you have thrown away the entire reason to use a distributed log.
global total order ⟺ partitions = 1 ⟺ parallelism = 1. There is no configuration, no feature, no client trick that breaks this identity, because it is not a policy, it is what "the log is sharded by partition" means. Pulsar, Kinesis, and Pub/Sub make the identical trade (per-partition / per-shard / per-key only); the only systems that offer cheap global order are ones that do not shard, and they do not scale (empirical: ops-blueprint-reference, Appendix D).
The engineering escape, the reusable tactic, is to recognise that almost nobody actually needs global order; they need order within an entity. All events for one account, one device, one order id must be ordered relative to each other; events for different accounts need no relative order at all. That requirement maps perfectly onto the log: partition by the entity key, and Kafka's murmur2(key) % N routing puts every event for a key on one partition, giving per-key total order while preserving full cross-key parallelism (Part II Partitioning Strategy). The skill is to interrogate "we need ordering" down to the granularity that is actually required, then choose that granularity as the partition key.
| Ordering requirement | How the log serves it | Cost | Class |
|---|---|---|---|
| Order within an entity (per account/device/key) | Partition by the entity key; native, free | None, this is the design's sweet spot | , |
| Order across a small, fixed set of entities | Co-key them onto one partition deliberately | Reduced parallelism for that set; risk of a hot partition (Part II op03) | Tunable |
| Global total order over all events | One partition only | Parallelism = 1. Producer ≤ one leader's rate, exactly one consumer | Inherent |
| Order across partitions after the fact | Not provided, you must buffer + sort downstream by a timestamp/sequence you embed | Application complexity; watermark/late-data handling | Inherent |
"We need ordering" is almost always "we need ordering per X." Find X, make it the partition key, and the log gives you ordering and parallelism simultaneously. Reserve single-partition global order for genuinely sequential domains (a ledger's global sequence, a leader-election log, the metadata log itself) where throughput is inherently modest. If you need both high throughput and a global order, you need a different abstraction (e.g. a consensus log with a sequencer), not a tuned Kafka.
2 · The partition-count ceiling, quantized, capped, and hard to change
Because the partition is the unit of parallelism, parallelism is quantized: you scale consumers in units of one partition, and a consumer group can never usefully run more instances than the topic has partitions, the surplus members are assigned nothing (Part I Group Coordination; mechanism and the full sizing discussion in Part II Partitioning Strategy). Three distinct ceilings stack here, with three different classes, and conflating them is a common error.
| Ceiling | What it limits | Root mechanism | Class & why |
|---|---|---|---|
| Consumer parallelism = partition count | Max useful consumers in a group | ≤ 1 consumer per partition per group (assignor invariant) | Inherent. Structural to the consumer-group model; provision partitions for peak future fan-out (Part II op03) |
| Partitions per cluster (control-plane) | How many partitions the cluster can hold at all | ZK era: O(partitions) controller failover/reload. KRaft: in-memory, log-replicated metadata | Mitigated. KRaft raised it ~100× (lab ~2M) but per-broker limits remain (empirical: Confluent KRaft lab; Instaclustr) |
| Partitions per broker (data-plane) | Density before a node degrades | File descriptors; ~2 mmap regions/partition vs vm.max_map_count; fetcher threads; rebalance time | Tunable (raise nofile, vm.max_map_count) up to a wall; default ~32,765/broker on Linux (empirical: Instaclustr Part 3) |
| Repartitioning a keyed topic | Whether you can change the count later | murmur2(key) % N remaps keys when N changes; you cannot shrink at all | Inherent. Changing N breaks per-key order and co-partitioned state, there is no in-place fix (Part II op03) |
The first and fourth rows are the structural ones an architect must respect. KRaft (Part I KRaft Consensus, The KRaft Controller) is a genuine mitigation of the second, controller failover no longer requires reloading all partition state, which is what capped ZooKeeper-era clusters near ~200,000 partitions (empirical: Confluent 200K post; KIP-500). But the reference is emphatic that this is widely overstated:
KRaft's "millions of partitions" is a target plus a 2M lab result, not a per-broker production SLA. Real per-broker counts are still gated by file descriptors, fetcher overhead, RAM, rebalance time, and vm.max_map_count (~32,765/broker default until raised), and crucially, throughput does not scale with partition count: in the Instaclustr test producer rate peaked around ~100 partitions and degraded past ~1,000 (empirical: ops-blueprint-reference §2; Instaclustr Part 1 & 3). Scalability ≠ performance. KRaft raised the ceiling on how many partitions can exist; it did nothing for how many you can usefully drive.
The fourth row is the one teams discover too late. Partition count is, for keyed topics, effectively immutable: you can never decrease it (the controller rejects the request outright), and increasing it changes hash(key) % N for most keys, splitting each key's history across two partitions, which silently breaks per-key ordering, co-partitioned stream joins, and log-compaction semantics. The accepted fix is a full migration to a new topic at the target count (dual-write, replay, drain, cut over). The complete mechanism, source citations, and migration playbook live in Part II Partitioning Strategy; the architectural takeaway is simply:
Choosing N for a keyed topic is closer to choosing a primary-key encoding than to setting a thread-pool size: it is expensive and disruptive to change, so decide it deliberately and over-provision modestly (a small multiple of peak need, with a divisor-rich count like 12/24/60). This irreversibility is inherent, it is a direct consequence of routing-by-hash, the same mechanism that gives you cheap per-key ordering. You bought the ordering with the rigidity.
3 · No per-message routing, priority, TTL, or selective consumption
The broker is a dumb sequential log. It does not inspect message contents, it does not route by anything except the partition the producer chose, it has no notion of message priority, and a consumer reads a partition in offset order from where it left off, it cannot ask the broker for "only the messages matching this predicate" or "the high-priority ones first." This is the sharpest contrast with a message broker like RabbitMQ, which is a "smart broker / dumb consumer" with exchange-based routing (direct, topic, fanout, headers), per-message priority, and per-message/per-queue TTL; Kafka is deliberately the inverse, "dumb broker / smart consumer" (empirical: ops-blueprint-reference §9). The dumbness is the feature, it is what makes the broker a sequential-I/O machine that hits hundreds of MB/s per node via zero-copy sendfile (Part I The Fetch Path). A broker that filtered, prioritised, or expired individual messages could not do that.
| Capability | Message broker (e.g. RabbitMQ) | Kafka log | Class & the log's answer |
|---|---|---|---|
| Content-based routing | Exchange rules on headers/body | None, partition chosen by producer (key hash or explicit) | Inherent. Route by choosing the topic/partition at produce time; filter in the consumer or a stream job |
| Per-message priority | Priority queues reorder | None, strict offset order | Inherent. Use separate topics per priority tier; consume high-priority topic preferentially |
| Per-message TTL | Message/queue expiry | None, retention is per topic (see §6) | Inherent at message level; topic-level time/size retention only |
| Selective consumption | Consumer binds a routing key; gets only matches | Consumer reads the whole partition; filters client-side | Inherent for classic consumers; partially lifted by share groups + broker-side push-down is not offered |
| Individual ack / redelivery of one message | Per-message ack/nack/requeue | Classic: commit by offset (a high-water mark), not per message | Mitigated by share groups KIP-932 |
The partial lift, share groups. The one place Kafka deliberately moves toward queue semantics is Share Groups (KIP-932). A share group lets many consumers cooperatively read the same partitions, with each record delivered individually under a time-bounded acquisition lock and carrying a per-record delivery state (Available → Acquired → Acknowledged/Archived) and a delivery count. This is genuinely new: it provides per-record acknowledgement, redelivery, a max-delivery-count, and an optional dead-letter queue, the things a task queue needs. But it is a mitigation, not a removal of the constraint, and the boundaries matter:
Share groups decouple parallelism from partition count (more consumers than partitions can make progress on the same partition) and break head-of-line blocking by letting a stuck record be redelivered while others advance. They do not add content-based routing, priority, per-message TTL, or broker-side filtering, a share consumer still receives records the broker selected by offset and filters in the client. And the cost of per-record state is explicit: delivery is at-least-once with redelivery and ordering guarantees are relaxed, the design note states share groups give queue fan-out "without per-key ordering guarantees" (Part I Share Groups). You trade order for individual delivery. That is the same trade RabbitMQ makes; Kafka simply made you opt into it explicitly rather than baking it into the broker.
In a log, the routing decision is the topic/partition choice at write time, and "priority" is a separate topic you drain preferentially. This pushes a routing table into your topic topology (Part II Topologies). It is more rigid than a broker's runtime routing rules, but it is also visible, versionable, and free at read time. Reach for share groups when you need per-message ack/redelivery (work-queue dispatch, competing consumers on a hot partition); keep classic consumer groups when you need ordered, replayable, offset-based consumption.
4 · Per-partition head-of-line blocking, one slow record stalls the partition
This is the direct, painful consequence of §1 (per-partition order) combined with the consumer-group model (≤ 1 consumer per partition). A classic consumer processes a partition strictly in offset order and commits a single offset that means "everything before this is done." Therefore a record that is slow to process, or that cannot be processed at all (a poison pill: a malformed message, a record whose downstream dependency is down), blocks every later record on that partition for that group. The reference puts it bluntly: per-partition order plus one-consumer-per-partition means "partition throughput collapses to the speed of the slowest message," and a poison pill "stalls every later message" (empirical: ops-blueprint-reference §8).
This is inherent to ordered offset-based consumption and it is the reason the reference names "Kafka as a task queue" an anti-pattern. The mitigations are real but each gives something up:
| Mitigation | How it helps | What it costs | Class |
|---|---|---|---|
| Dead-letter the poison record | Consumer catches the failure, ships the record to a DLQ topic, commits past it | You must write the DLQ + reprocessing logic; the bad record leaves the ordered stream | Tactic (application-level) |
| More partitions | Shrinks the blast radius, only keys on the stalled partition are blocked | Does not eliminate it; a hot key still funnels to one partition (Part II op03) | Mitigated |
| Share groups KIP-932 | Per-record acquisition + redelivery: a stuck record is released/timed-out and the rest proceed; max-delivery-count + DLQ built in (Part I 15) | At-least-once redelivery; relaxed ordering, you give up per-key order to gain non-blocking dispatch | Mitigated |
| External parallel-consumer / push proxy | Decouples processing concurrency from partition count (e.g. Uber's uForwarder, Confluent Parallel Consumer) | Extra component; ordering semantics depend on the chosen key-level concurrency | Tactic (empirical: Uber uForwarder) |
For read_committed consumers the blocking can come from the producer side. A crashed transactional producer leaves a hanging transaction that pins the Last Stable Offset, so all later records on that partition become invisible to read_committed consumers until the transaction times out, bounded by the broker's transaction.max.timeout.ms (default 15 minutes) (Part I Transactions & EOS; empirical: ops-blueprint-reference §4). This is the same structural head-of-line property, strict offset order forces "everything before the LSO must resolve first", surfacing through the EOS machinery. Detect via LastStableOffsetLag; recover with the kafka-transactions.sh tooling (KIP-664).
Whenever you build ordered-stream processing, design the poison-pill path first: a try/catch that routes the failing record to a DLQ and commits past it converts an unbounded partition stall into a bounded, observable failure. The general lesson transfers far beyond Kafka: any strictly-ordered pipeline needs an escape hatch for the record that won't move, or the slowest/worst record sets the throughput of everything behind it.
5 · The latency floor, ms-scale, not sub-ms request/reply
Kafka is built to optimise throughput over latency, and several structural choices put a floor under per-message latency that no amount of tuning drops to the sub-millisecond range of an in-memory request/reply system. The floor has three structural contributors, and it is important to separate the parts that are inherent from the parts that are tunable.
linger.ms + batch.size deliberately wait to amortise per-request cost, the latency added equals the linger window. Set near 0 for latency, higher for throughput.acks=all)The result is a system whose floor is milliseconds, not microseconds. That floor is genuinely low when tuned, a tier-1 bank held sub-5 ms p99 end-to-end at 1.6M msg/s for trading, but only by tuning serialization (Protobuf), 10GbE, broker config, and minimising replication delay (empirical: ops-blueprint-reference §9). And much of the tail beyond the floor is OS/filesystem, not protocol: Allegro cut producer-latency outliers above a 65 ms SLO by 82% just by moving to XFS with tuned journaling (empirical: ops-blueprint-reference §9).
| Contributor | Adds | Class | Lever / note |
|---|---|---|---|
| Producer batching | ≤ linger.ms | Tunable | Trades latency for throughput; set 0–1 ms for latency (Part II Performance Tuning) |
ISR replication (acks=all) | One broker round-trip (cross-AZ if replicas span AZs) | Inherent for durable writes; tunable down to acks=1/0 by sacrificing durability | The durability ⇄ latency trade (Part II Durability) |
| Pull consumption | Fetch-cycle latency at low load | Inherent | fetch.min.bytes/fetch.max.wait.ms tune the batching, not the pull model |
| OS / filesystem tail | Variable (page-cache, fsync, GC) | Tunable | XFS, page-cache headroom, small heap + G1GC (empirical: Allegro; ops-blueprint-reference §3) |
If your requirement is synchronous, sub-millisecond request/reply, an in-memory cache lookup, an online feature fetch on the critical path of a user request, an order-matching inner loop, the log is the wrong abstraction, and no tuning changes that. Kafka has "no native point-to-point or on-the-fly reply topics," and request-reply over Kafka "returns the coupling we were trying to avoid" (empirical: ops-blueprint-reference §8). Use a gRPC/HTTP service or an in-memory store for the synchronous path; use the log for the durable, ordered, fan-out event stream that the synchronous path emits. The two compose; one does not replace the other.
The newest variants move the floor in the wrong direction on purpose: WarpStream and KIP-1150-style diskless designs write straight to object storage, eliminating cross-AZ replication cost but raising produce p99 to ~400 ms and end-to-end p99 to ~1 s (empirical: ops-blueprint-reference §9; Appendix C). AutoMQ's WAL-assisted variant recovers single-digit-ms. The point for an architect: there is a latency ⇄ cost curve here too, and the cheapest designs sit far above Kafka's already-non-sub-ms floor. See Evolution and Part II Cost.
6 · All-or-nothing topic retention, no per-record TTL
Retention in Kafka is a topic-level policy: retention.ms and retention.bytes govern when whole segments age out (Part I Storage Management). There is no per-record TTL. A record's lifetime is the lifetime of its topic's retention window; you cannot say "this message expires in 30 seconds, that one in 30 days." This falls directly out of the segment design, the broker deletes by dropping whole segment files, an O(1) filesystem operation, which is what keeps retention cheap; per-record expiry would require scanning and rewriting segments, destroying the sequential-I/O property.
The one mechanism that approximates per-key lifetime is log compaction with tombstones: on a compacted topic, writing a record with a key and a null value (a tombstone) eventually causes the compactor to remove all prior values for that key, and the tombstone itself is purged after delete.retention.ms (Part I Storage Management; The Log Pattern). But that is key-level deletion, not time-based per-record TTL, and it is non-deterministic in timing, compaction "does NOT guarantee one record per key at any instant," only "at least the last value per key" eventually (empirical: ops-blueprint-reference §8).
| You want… | Log provides | Class |
|---|---|---|
| "All data older than 7 days disappears" | retention.ms at topic level, native | Tunable (per topic) |
| "Keep only the latest value per key" | Compaction (cleanup.policy=compact) | Tunable (per topic) |
| "Delete this specific key now" | Tombstone (key + null) on a compacted topic, eventual, not immediate | Mitigated (key-level, non-deterministic timing) |
| "This message expires in 30s, that one in 30d" | Not provided, split by lifetime into separate topics | Inherent |
If records have materially different lifetimes, route them to separate topics with different retention, the same "topology as configuration" move as priority and routing. For the "table of latest values" use case (a changelog, a config store, a CDC snapshot), a compacted topic gives you durable current-state whose size is bounded by distinct live keys, independent of total throughput (Part I Storage Management). What the log structurally cannot give you is a GDPR-style "delete exactly this record at this moment" guarantee on a non-compacted topic, that is a known gap to design around (e.g. crypto-shredding: store the payload encrypted, delete the key).
7 · No server-side filtering, query, or secondary index
A consumer that needs a sparse subset of a topic must read everything and filter client-side. The broker offers exactly one access path: sequential read from an offset (Part I The Fetch Path). There is no WHERE clause, no secondary index, no point-lookup-by-value, no server-side projection. If 1% of records match your predicate, you still pay to transfer and decode 100%. This is the deliberate flip side of "Kafka is not a database", Kreps is explicit: "I don't think Kafka really benefits from trying to add any kind of random access lookups directly against the log"; Kafka replicates into specialised systems rather than replacing them (empirical: ops-blueprint-reference §8).
This is inherent, and the canonical answer is the architecture this whole part is about: the log is the system of record; every queryable view is a derived projection you build by consuming the stream into a store designed for the query (Part I Kafka Streams, Kafka Connect; The Log Pattern). Need point lookups? Materialise into a key-value store. Need full-text? Into Elasticsearch. Need ad-hoc SQL? Into a warehouse. The log feeds them all and keeps them consistent because they replay the same ordered events.
The read-everything cost is easy to under-estimate because it is invisible at design time and only shows up as cluster egress and consumer CPU at scale. If you find yourself adding a consumer group whose job is to read a firehose topic and discard 99% of it, that is a signal you have pushed a query into the transport. Either pre-split at the producer (route the 1% to its own topic), or accept that you should be querying a derived store, not scanning the log. Kafka does not push filtering to the broker, by design, so the broker can stay a zero-copy sequential machine (empirical: ops-blueprint-reference §8, §9).
8 · Complexity pushed to the clients, the fat-client burden
"Dumb broker / smart consumer" has a corollary the broker's simplicity hides: the intelligence did not disappear, it moved into the clients. The producer owns batching, compression, partitioning, idempotence sequencing, and ret/backoff; the consumer owns the fetch loop, offset management, deserialization, the rebalance protocol, and (for EOS) the read-process-write transaction. Kafka clients are among the fattest in the messaging world, and that fatness is a structural property of pushing logic out of the broker, not an accident to be refactored away.
| Responsibility the broker does NOT carry | Where it lives | The burden it creates |
|---|---|---|
| Batching / compression / partitioning | Producer (Part I 16) | Many tunables (batch.size, linger.ms, compression, partitioner) that interact; wrong choices silently cost latency or throughput |
| Offset management / commit cadence | Consumer (Part I 17) | At-least-once vs at-most-once vs effectively-once is a client decision; auto-commit is a footgun |
| Group membership / rebalance | Consumer + coordinator (Part I 13) | The rebalance protocol is intricate; poll-interval and heartbeat misconfig cause rebalance storms (see §9) |
| Read-process-write atomicity (EOS) | Producer + consumer + coordinator (Part I 14) | Transactions, fencing, read_committed, hanging-txn recovery, significant application-side complexity |
Putting logic in the client is what lets the broker be a zero-copy sequential-I/O machine and what lets each client tune its own trade-offs (a latency-sensitive producer and a throughput-sensitive one talk to the same broker with opposite settings). The cost is borne by every team that writes a client: a steep learning curve, language-specific client quality variance (the JVM client is the reference; others lag), and the reality that most production incidents are client misconfiguration, not broker faults, the reference's failure-mode catalogue is dominated by client-side issues (rebalance storms, fencing avalanches, auto-commit data loss; empirical: ops-blueprint-reference §4). This is inherent to the architecture; the mitigation is organisational: wrap the raw clients in a vetted internal library with safe defaults. LinkedIn, Datadog, and others all built exactly such abstractions over the raw client (empirical: ops-blueprint-reference §7).
If more than a handful of teams use your Kafka, the highest-leverage investment is a shared client wrapper that encodes durability defaults (acks=all, idempotence, sane commit cadence), the DLQ pattern, metrics, and the cooperative rebalance protocol, so individual teams cannot reinvent the footguns. The fat-client burden does not go away, but you pay it once, centrally, instead of once per team per incident.
9 · Rebalance disruption, historically stop-the-world
When a consumer group's membership changes (a consumer joins, leaves, or is evicted for missing a heartbeat), the group rebalances: partitions are reassigned across members (Part I Group Coordination). The original protocol was "stop-the-world", every consumer revoked all its partitions, the group synchronised on a barrier, then everyone got a new assignment; during that window the whole group made no progress. On large groups with frequent membership churn this was a serious, recurring stall, and it is the single most-improved area of the protocol. This makes it the textbook example of a mitigated limitation, the structural fact (membership change forces reassignment) remains, but its blast radius has shrunk dramatically across three KIPs:
KIP-429 (Kafka 2.4) introduced incremental cooperative rebalancing (the CooperativeStickyAssignor): a consumer keeps the partitions it will retain and only the genuinely-moving partitions are revoked, eliminating the global stop. KIP-848 (GA in Kafka 4.0) goes further, moving assignment computation to the broker-side group coordinator via a ConsumerGroupHeartbeat RPC, removing the client-side synchronization barrier entirely and reported up to ~20× faster (empirical: ops-blueprint-reference §9). The reference is explicit that the old critique is now mostly historical:
KIP-429 and KIP-848 "substantially reduce disruption"; critiques citing a full stop-the-world stall "apply to old clients/eager assignors" (empirical: ops-blueprint-reference §9 cautions). What they do not remove: membership change still triggers reassignment, partition movement still pauses processing of the moved partitions, and the most common production failure is still the self-sustaining rebalance storm, a slow consumer exceeds max.poll.interval.ms (default 5 min), gets evicted, rejoins, triggers another rebalance (empirical: ops-blueprint-reference §4). The mitigation toolkit, cooperative assignor, static membership (KIP-345, stable group.instance.id so a restart does not trigger reassignment), right-sized max.poll.interval.ms, smaller max.poll.records, addresses the residual but does not make membership changes free.
The deepest fix is to stop triggering rebalances in the first place: static membership keeps assignments across rolling restarts and deploys, and sizing max.poll.interval.ms above worst-case batch processing time stops slow consumers from self-evicting. Cheap rebalancing (cooperative/KIP-848) and rare rebalancing (static membership) are complementary, adopt both. The transferable lesson: in any membership-based distributed system, the cost of reassignment matters less than the frequency of reassignment, invest in stability of membership before cleverness of reassignment.
10 · Cross-AZ replication cost, failure-independence has a cloud bill
Kafka's durability comes from placing replicas on failure-independent brokers, and in the cloud "failure-independent" means "in different availability zones." That is exactly right for resilience, and it collides head-on with cloud network pricing, where inter-AZ traffic is billed in both directions. With 3-AZ RF=3, every 1 GB of ingress produces ~2 GB of cross-AZ replication traffic before you count producer and consumer cross-zone hops (empirical: ops-blueprint-reference §9). The result is the most material cost critique of self-managed Kafka:
Confluent's own model puts networking at ~70% of self-managed infra at 20 MBps and ~87% at 100 MBps; "after tiered storage, networking alone can comprise ~90%" of TCO (empirical: ops-blueprint-reference §6, §9). The replication multiplier is the driver: RF=3 across 3 AZs means the (RF−1) = 2 cross-AZ copies of every byte are unavoidable given the durability model. This is the structural tension, failure-independence and cross-AZ billing are the same placement decision.
Which parts are inherent and which are tunable is the crux, and the reference is precise about it:
| Cross-AZ component | Driver | Lever | Class |
|---|---|---|---|
| Consumer fetch cross-AZ | Consumers read the leader, often in another AZ | Fetch-from-follower KIP-392 + rack-awareness KIP-881: read a same-AZ replica → can cut ~all consumer cross-AZ, ~50% total cluster cost | Tunable (mitigation), at the cost of up to ~500 ms tail latency & broker skew |
| Replication cross-AZ (RF−1 copies) | Durability requires failure-independent replicas | Lower RF on non-critical topics; shorter retention. Classic Kafka cannot remove it without losing failure-independence | Inherent for the durability model; tunable only by lowering durability |
| Producer cross-AZ | Producers write the leader, often in another AZ | Not removed by FFF; only diskless/object-store designs eliminate it (write straight to S3, leaderless) | Inherent for classic Kafka; removed only by a different storage architecture |
| Storage (separate axis) | Local triple-replicated SSD is ~10–20× S3 per-GiB | Tiered storage KIP-405 offloads cold segments to S3 | Mitigated for storage, but does NOT touch networking (common misconception) |
A frequent and expensive misconception: tiered storage (KIP-405) does not reduce cross-AZ networking, it only moves cold storage to S3; after adopting it, networking can become an even larger share of TCO (empirical: ops-blueprint-reference Appendix C). Even with fetch-from-follower fully deployed, produce + replication cross-AZ traffic remains (~$13.8k of AutoMQ's $24k example), only diskless / object-store designs remove it, and they do so by trading the latency floor (§5). The architect's takeaway: cross-AZ cost is part inherent (the replication multiplier is the durability model) and part tunable (consumer-side via FFF). The full cost model, levers, and dollar figures are in Part II Cost; the structural point is that you cannot have failure-independent replicas and zero cross-AZ traffic in classic Kafka. See also Evolution for the diskless trajectory.
11 · Metadata as a single log, the control-plane scaling bound
KRaft's elegance is that it eats its own dog food: cluster metadata is itself a distributed commit log. All metadata, topics, partitions, leaders, ISR, configs, ACLs, is a stream of records in one Raft-replicated topic partition, __cluster_metadata-0, applied by a single-threaded event loop in QuorumController (Part I KRaft Consensus, The KRaft Controller). This is a massive improvement over ZooKeeper for failover, standbys replay the same log and converge on byte-identical state, so failover transfers no state and is near-instant (empirical: KIP-500; Part I 11). But notice that the metadata plane inherits the log's own §1 limitation: it is a single partition, single-writer, serially-applied log. That gives it a structural scaling bound that mirrors the data plane's.
__cluster_metadata-0QuorumController, single-threaded event loopreplay() in offset order, the same per-partition-order discipline as a data topic| Property of the metadata log | Consequence | Class |
|---|---|---|
| Single partition, single active writer | Metadata mutation throughput is one controller's serial apply rate, it does not scale horizontally | Inherent (same as §1, applied to metadata) |
| Every broker holds the full metadata image in memory | Total partitions/topics/configs are bounded by per-broker RAM, not just the controller's | Tunable up to a wall (RAM, vm.max_map_count) |
| Standbys replay the same log | Failover transfers no state, near-instant (the upside) | Mitigation of the ZK-era failover cost |
| Snapshots bound replay length | A new/rejoining controller loads a snapshot + tail, not the whole history (Part I 10) | Mitigation (keeps startup bounded) |
This single-log control plane is the real reason the "millions of partitions" claim is a ceiling on existence, not performance (§2). Every partition is metadata records that every broker holds in memory and that one controller applies serially; pile on enough and you bound metadata-change latency and per-broker RAM long before you bound disk. KRaft removed the ZooKeeper failover bottleneck, a genuine, large mitigation, but it replaced it with a single-writer log whose throughput and memory footprint are themselves the new bound. The metadata plane is fast at failover and finite at scale for the same reason: it is a single distributed log. The pattern's strength and its limit are, once again, the same decision.
The catalogue at a glance
The recurring shape of this chapter is that every limitation is the shadow of a strength, and the architect's job is to know which shadows move and which are fixed. The summary matrix:
| # | Limitation | Inherent / Mitigated / Tunable | The strength it pays for | Design-around / lever |
|---|---|---|---|---|
| 1 | Ordering per-partition only | Inherent | Sharded parallelism | Order per-key; reserve 1 partition for true global order |
| 2 | Partition-count ceiling & rigidity | Mixed: consumer-cap & keyed-rigidity inherent; cluster-count mitigated (KRaft); per-broker tunable | Parallelism = partitions; cheap per-key routing | Over-provision keyed counts; raise FD/mmap; see op03 |
| 3 | No routing/priority/TTL/selective read | Inherent; partially mitigated by share groups | Dumb broker = zero-copy throughput | Topology as routing; share groups KIP-932 for per-record dispatch |
| 4 | Per-partition head-of-line blocking | Inherent; mitigated (share groups, DLQ) | Strict ordered, replayable consumption | DLQ the poison record; share groups; parallel-consumer |
| 5 | Latency floor (ms, not sub-ms) | Inherent floor; tail tunable | Throughput-first batching + durable replication | Don't use for sync request/reply; tune linger/FS for the tail |
| 6 | All-or-nothing topic retention | Inherent (no per-record TTL); mitigated by compaction | O(1) segment-drop retention | Split by lifetime into topics; compaction + tombstones |
| 7 | No server-side filter/query/index | Inherent | Single sequential access path | Materialise derived query stores; pre-split at producer |
| 8 | Complexity in the clients | Inherent | Per-client tuning; thin broker | Shared vetted client wrapper with safe defaults |
| 9 | Rebalance disruption | Mitigated (KIP-429, KIP-848) | Elastic consumer-group membership | Cooperative assignor + static membership KIP-345 |
| 10 | Cross-AZ replication cost | Inherent (replication); consumer-side tunable | Failure-independent durability | Fetch-from-follower KIP-392; diskless trades latency; op10 |
| 11 | Metadata as a single log | Inherent bound; failover mitigated (KRaft) | Trivial, stateless controller failover | Bound partition counts; accept it as a ceiling on existence |
When the log is the wrong choice, the negative space
Stacking the inherent limitations gives the most useful output of this chapter: a crisp list of workloads where the distributed log is structurally the wrong tool, no matter how it is tuned. An architect who memorises this list will save a team from months of fighting the abstraction.
(1) Synchronous sub-millisecond request/reply, the latency floor (§5) and the absence of a native reply path make the log the wrong transport; use RPC/HTTP or an in-memory store. (2) Primary store for point lookups or ad-hoc queries, no index, no query language (§7); the log feeds derived query stores, it does not replace them. (3) Rich per-message routing, priority, or TTL with no ordering need, a smart-broker (RabbitMQ) is the better fit (§3); share groups close part but not all of the gap. Everything else, durable ordered history, decoupled high-throughput fan-out, replay, CDC, stream processing, is the log's home ground, and the limitations in this chapter are the price you knowingly pay for it. Carry this list to When To Use and the consolidated Architect's Cheat-Sheet.
The single most valuable habit this chapter teaches is not Kafka-specific: for any system you evaluate, separate the inherent from the mitigated from the tunable before you commit. Inherent limits define the problem class the system can serve; mitigated limits define how much operational maturity you must buy; tunable limits define the curve you will spend your career picking points on. Vendors and benchmarks blur these lines on purpose, the reference repeatedly flags claims that present a tunable default as an inherent flaw, or an inherent flaw as a "just tune it" non-issue (empirical: ops-blueprint-reference §9 cautions). The architect's edge is refusing that blur.