III · 03 · Inherent Limitations

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Architectural analysis grounded in the source-verified Part I and cited comparative sources.

Every architecture buys its strengths with a matching set of weaknesses, and the honest way to evaluate one is to study where it breaks by design. The distributed, partitioned, replicated commit log is exceptional at exactly one shape of problem, high-throughput, ordered, replayable, fan-out data movement, and it is exceptional precisely because it refuses to be a database, a router, a queue, or an RPC bus. This chapter is the catalogue of that refusal. For each limitation we ask the only question that matters to an architect: is this inherent (structural, it follows from the log abstraction itself and no feature can remove it), mitigated (a feature softens it but the underlying force remains), or tunable (a config trades it against something else)? Confusing these three categories is the single most expensive mistake teams make with Kafka, they tune what is structural, or assume structural what is merely a default. Get the taxonomy right and you will know, before you write a line of code, which of these walls you can move and which you must design around.

The discipline: inherent vs. mitigated vs. tunable

The log's power comes from a handful of radical simplifications: the broker is a dumb, append-only, sequential store; ordering is defined only within a partition; routing is reduced to a hash of a key; and almost all intelligence is pushed to the clients. Each simplification is the source of both a strength documented elsewhere in this book and a limitation documented here. The two are inseparable, you cannot keep the throughput and shed the constraint, because they are the same decision viewed from two sides. Before the catalogue, fix the vocabulary, because we will tag every limitation with one of these three labels.

Class	Definition	What an architect does about it	Example in this chapter
Inherent	Follows from the log abstraction or the single-leader-per-partition model. No configuration or feature removes it; only a different architecture does.	Design around it. Choose the log only where the constraint is acceptable; otherwise pick or compose a different system.	Per-partition-only ordering; read-everything-and-filter; latency floor > sub-ms.
Mitigated	The structural force still exists, but a feature reduces its blast radius or shifts where it bites. The ceiling moves; it does not vanish.	Adopt the mitigation, but plan for the residual. Know what the feature does not fix.	Head-of-line blocking (share groups); rebalance stalls (KIP-848); partition ceiling (KRaft).
Tunable	A config knob trades the limitation against another property, usually latency vs. throughput vs. durability vs. cost.	Pick a point on the curve. Measure, then choose; revisit as load changes.	The exact latency number; cross-AZ cost (fetch-from-follower); per-topic retention.

Why this taxonomy is the whole point

The comparative literature is full of "Kafka can't do X" claims that are really "Kafka's default for X is Y", and an equal number of "just tune it" hand-waves about things no tune can fix. The reference we cite is explicit that per-message TTL, priority, and content-based routing are "by design, not bugs" and that "stop-the-world rebalances are largely historical" (empirical: ops-blueprint-reference, §9 cautions). Those two sentences are an inherent and a mitigated verdict respectively. The rest of this chapter is that distinction, applied rigorously. See The Log Pattern for why these simplifications exist and Design Decisions for the forces that chose them.

A limitation bites youclassify before you act

TUNABLEpick a point on the latency/throughput/durability/cost curve

MITIGATEDadopt it; design for the residual force

INHERENTdesign around it or choose a different system

The first question for any Kafka limitation: which of the three classes is it? The answer dictates whether you turn a knob, adopt a feature, or change your architecture.

tunable (config trade-off) mitigated (feature softens) inherent (structural) classify path

1 · Ordering is per-partition only, global order means no parallelism

This is the deepest constraint, the one from which several others descend. Kafka guarantees a total order of records within a single partition and makes no promise whatsoever across partitions. This is true at the lowest level: a partition is one append-only log with one leader that serializes writes (Part I The Storage Log Engine, Replication); the offset is a per-partition monotonic counter; there is no global sequence number and no cross-partition clock. It has been true since the original 2011 design paper (empirical: Kreps, Narkhede, Rao 2011).

The architectural consequence is a hard trade you cannot escape: parallelism and global order are the same dial, turned in opposite directions. Partitions are the unit of parallelism (more partitions ⇒ more leaders absorbing writes, more consumers reading; see Part II Partitioning Strategy). If your requirement is a single global total order over all events in a topic, the only way to get it is a topic with one partition, at which point producer throughput collapses to one leader's write rate, consumer parallelism collapses to exactly one instance, and you have thrown away the entire reason to use a distributed log.

The ordering ⇄ parallelism identity

global total order ⟺ partitions = 1 ⟺ parallelism = 1. There is no configuration, no feature, no client trick that breaks this identity, because it is not a policy, it is what "the log is sharded by partition" means. Pulsar, Kinesis, and Pub/Sub make the identical trade (per-partition / per-shard / per-key only); the only systems that offer cheap global order are ones that do not shard, and they do not scale (empirical: ops-blueprint-reference, Appendix D).

The engineering escape, the reusable tactic, is to recognise that almost nobody actually needs global order; they need order within an entity. All events for one account, one device, one order id must be ordered relative to each other; events for different accounts need no relative order at all. That requirement maps perfectly onto the log: partition by the entity key, and Kafka's murmur2(key) % N routing puts every event for a key on one partition, giving per-key total order while preserving full cross-key parallelism (Part II Partitioning Strategy). The skill is to interrogate "we need ordering" down to the granularity that is actually required, then choose that granularity as the partition key.

Ordering requirement	How the log serves it	Cost	Class
Order within an entity (per account/device/key)	Partition by the entity key; native, free	None, this is the design's sweet spot	,
Order across a small, fixed set of entities	Co-key them onto one partition deliberately	Reduced parallelism for that set; risk of a hot partition (Part II op03)	Tunable
Global total order over all events	One partition only	Parallelism = 1. Producer ≤ one leader's rate, exactly one consumer	Inherent
Order across partitions after the fact	Not provided, you must buffer + sort downstream by a timestamp/sequence you embed	Application complexity; watermark/late-data handling	Inherent

Reusable tactic, pick the smallest ordering scope

"We need ordering" is almost always "we need ordering per X." Find X, make it the partition key, and the log gives you ordering and parallelism simultaneously. Reserve single-partition global order for genuinely sequential domains (a ledger's global sequence, a leader-election log, the metadata log itself) where throughput is inherently modest. If you need both high throughput and a global order, you need a different abstraction (e.g. a consensus log with a sequencer), not a tuned Kafka.

2 · The partition-count ceiling, quantized, capped, and hard to change

Because the partition is the unit of parallelism, parallelism is quantized: you scale consumers in units of one partition, and a consumer group can never usefully run more instances than the topic has partitions, the surplus members are assigned nothing (Part I Group Coordination; mechanism and the full sizing discussion in Part II Partitioning Strategy). Three distinct ceilings stack here, with three different classes, and conflating them is a common error.

Ceiling	What it limits	Root mechanism	Class & why
Consumer parallelism = partition count	Max useful consumers in a group	≤ 1 consumer per partition per group (assignor invariant)	Inherent. Structural to the consumer-group model; provision partitions for peak future fan-out (Part II op03)
Partitions per cluster (control-plane)	How many partitions the cluster can hold at all	ZK era: O(partitions) controller failover/reload. KRaft: in-memory, log-replicated metadata	Mitigated. KRaft raised it ~100× (lab ~2M) but per-broker limits remain (empirical: Confluent KRaft lab; Instaclustr)
Partitions per broker (data-plane)	Density before a node degrades	File descriptors; ~2 mmap regions/partition vs `vm.max_map_count`; fetcher threads; rebalance time	Tunable (raise `nofile`, `vm.max_map_count`) up to a wall; default ~32,765/broker on Linux (empirical: Instaclustr Part 3)
Repartitioning a keyed topic	Whether you can change the count later	`murmur2(key) % N` remaps keys when N changes; you cannot shrink at all	Inherent. Changing N breaks per-key order and co-partitioned state, there is no in-place fix (Part II op03)

The first and fourth rows are the structural ones an architect must respect. KRaft (Part I KRaft Consensus, The KRaft Controller) is a genuine mitigation of the second, controller failover no longer requires reloading all partition state, which is what capped ZooKeeper-era clusters near ~200,000 partitions (empirical: Confluent 200K post; KIP-500). But the reference is emphatic that this is widely overstated:

"KRaft means partitions are free" is false

KRaft's "millions of partitions" is a target plus a 2M lab result, not a per-broker production SLA. Real per-broker counts are still gated by file descriptors, fetcher overhead, RAM, rebalance time, and vm.max_map_count (~32,765/broker default until raised), and crucially, throughput does not scale with partition count: in the Instaclustr test producer rate peaked around ~100 partitions and degraded past ~1,000 (empirical: ops-blueprint-reference §2; Instaclustr Part 1 & 3). Scalability ≠ performance. KRaft raised the ceiling on how many partitions can exist; it did nothing for how many you can usefully drive.

The fourth row is the one teams discover too late. Partition count is, for keyed topics, effectively immutable: you can never decrease it (the controller rejects the request outright), and increasing it changes hash(key) % N for most keys, splitting each key's history across two partitions, which silently breaks per-key ordering, co-partitioned stream joins, and log-compaction semantics. The accepted fix is a full migration to a new topic at the target count (dual-write, replay, drain, cut over). The complete mechanism, source citations, and migration playbook live in Part II Partitioning Strategy; the architectural takeaway is simply:

Reusable tactic, treat keyed-partition count as a schema decision

Choosing N for a keyed topic is closer to choosing a primary-key encoding than to setting a thread-pool size: it is expensive and disruptive to change, so decide it deliberately and over-provision modestly (a small multiple of peak need, with a divisor-rich count like 12/24/60). This irreversibility is inherent, it is a direct consequence of routing-by-hash, the same mechanism that gives you cheap per-key ordering. You bought the ordering with the rigidity.

3 · No per-message routing, priority, TTL, or selective consumption

The broker is a dumb sequential log. It does not inspect message contents, it does not route by anything except the partition the producer chose, it has no notion of message priority, and a consumer reads a partition in offset order from where it left off, it cannot ask the broker for "only the messages matching this predicate" or "the high-priority ones first." This is the sharpest contrast with a message broker like RabbitMQ, which is a "smart broker / dumb consumer" with exchange-based routing (direct, topic, fanout, headers), per-message priority, and per-message/per-queue TTL; Kafka is deliberately the inverse, "dumb broker / smart consumer" (empirical: ops-blueprint-reference §9). The dumbness is the feature, it is what makes the broker a sequential-I/O machine that hits hundreds of MB/s per node via zero-copy sendfile (Part I The Fetch Path). A broker that filtered, prioritised, or expired individual messages could not do that.

Capability	Message broker (e.g. RabbitMQ)	Kafka log	Class & the log's answer
Content-based routing	Exchange rules on headers/body	None, partition chosen by producer (key hash or explicit)	Inherent. Route by choosing the topic/partition at produce time; filter in the consumer or a stream job
Per-message priority	Priority queues reorder	None, strict offset order	Inherent. Use separate topics per priority tier; consume high-priority topic preferentially
Per-message TTL	Message/queue expiry	None, retention is per topic (see §6)	Inherent at message level; topic-level time/size retention only
Selective consumption	Consumer binds a routing key; gets only matches	Consumer reads the whole partition; filters client-side	Inherent for classic consumers; partially lifted by share groups + broker-side push-down is not offered
Individual ack / redelivery of one message	Per-message ack/nack/requeue	Classic: commit by offset (a high-water mark), not per message	Mitigated by share groups KIP-932

The partial lift, share groups. The one place Kafka deliberately moves toward queue semantics is Share Groups (KIP-932). A share group lets many consumers cooperatively read the same partitions, with each record delivered individually under a time-bounded acquisition lock and carrying a per-record delivery state (Available → Acquired → Acknowledged/Archived) and a delivery count. This is genuinely new: it provides per-record acknowledgement, redelivery, a max-delivery-count, and an optional dead-letter queue, the things a task queue needs. But it is a mitigation, not a removal of the constraint, and the boundaries matter:

What share groups do, and pointedly do not, fix

Share groups decouple parallelism from partition count (more consumers than partitions can make progress on the same partition) and break head-of-line blocking by letting a stuck record be redelivered while others advance. They do not add content-based routing, priority, per-message TTL, or broker-side filtering, a share consumer still receives records the broker selected by offset and filters in the client. And the cost of per-record state is explicit: delivery is at-least-once with redelivery and ordering guarantees are relaxed, the design note states share groups give queue fan-out "without per-key ordering guarantees" (Part I Share Groups). You trade order for individual delivery. That is the same trade RabbitMQ makes; Kafka simply made you opt into it explicitly rather than baking it into the broker.

Reusable tactic, model priority and routing as topics, not message attributes

In a log, the routing decision is the topic/partition choice at write time, and "priority" is a separate topic you drain preferentially. This pushes a routing table into your topic topology (Part II Topologies). It is more rigid than a broker's runtime routing rules, but it is also visible, versionable, and free at read time. Reach for share groups when you need per-message ack/redelivery (work-queue dispatch, competing consumers on a hot partition); keep classic consumer groups when you need ordered, replayable, offset-based consumption.

4 · Per-partition head-of-line blocking, one slow record stalls the partition

This is the direct, painful consequence of §1 (per-partition order) combined with the consumer-group model (≤ 1 consumer per partition). A classic consumer processes a partition strictly in offset order and commits a single offset that means "everything before this is done." Therefore a record that is slow to process, or that cannot be processed at all (a poison pill: a malformed message, a record whose downstream dependency is down), blocks every later record on that partition for that group. The reference puts it bluntly: per-partition order plus one-consumer-per-partition means "partition throughput collapses to the speed of the slowest message," and a poison pill "stalls every later message" (empirical: ops-blueprint-reference §8).

Partition (offset order)Consumer (single)

offset 100, ok

commit 101

offset 101, POISON / slow downstream

102,103,104… queued behind 101

no commit advances, partition stalled for this group

Head-of-line blocking: because the consumer must process offset 101 before 102 and commit is a single high-water mark, a poison or slow record at 101 stalls 102+ indefinitely. Other partitions are unaffected; this one partition is frozen for this group.

partition log consumer deliver/commit stalled / blocked

This is inherent to ordered offset-based consumption and it is the reason the reference names "Kafka as a task queue" an anti-pattern. The mitigations are real but each gives something up:

Mitigation	How it helps	What it costs	Class
Dead-letter the poison record	Consumer catches the failure, ships the record to a DLQ topic, commits past it	You must write the DLQ + reprocessing logic; the bad record leaves the ordered stream	Tactic (application-level)
More partitions	Shrinks the blast radius, only keys on the stalled partition are blocked	Does not eliminate it; a hot key still funnels to one partition (Part II op03)	Mitigated
Share groups KIP-932	Per-record acquisition + redelivery: a stuck record is released/timed-out and the rest proceed; max-delivery-count + DLQ built in (Part I 15)	At-least-once redelivery; relaxed ordering, you give up per-key order to gain non-blocking dispatch	Mitigated
External parallel-consumer / push proxy	Decouples processing concurrency from partition count (e.g. Uber's uForwarder, Confluent Parallel Consumer)	Extra component; ordering semantics depend on the chosen key-level concurrency	Tactic (empirical: Uber uForwarder)

A hanging transaction is head-of-line blocking with a 15-minute fuse

For read_committed consumers the blocking can come from the producer side. A crashed transactional producer leaves a hanging transaction that pins the Last Stable Offset, so all later records on that partition become invisible to read_committed consumers until the transaction times out, bounded by the broker's transaction.max.timeout.ms (default 15 minutes) (Part I Transactions & EOS; empirical: ops-blueprint-reference §4). This is the same structural head-of-line property, strict offset order forces "everything before the LSO must resolve first", surfacing through the EOS machinery. Detect via LastStableOffsetLag; recover with the kafka-transactions.sh tooling (KIP-664).

Reusable tactic, make the failure of one record local and recoverable

Whenever you build ordered-stream processing, design the poison-pill path first: a try/catch that routes the failing record to a DLQ and commits past it converts an unbounded partition stall into a bounded, observable failure. The general lesson transfers far beyond Kafka: any strictly-ordered pipeline needs an escape hatch for the record that won't move, or the slowest/worst record sets the throughput of everything behind it.

5 · The latency floor, ms-scale, not sub-ms request/reply

Kafka is built to optimise throughput over latency, and several structural choices put a floor under per-message latency that no amount of tuning drops to the sub-millisecond range of an in-memory request/reply system. The floor has three structural contributors, and it is important to separate the parts that are inherent from the parts that are tunable.

Batching on the producer (tunable)

linger.ms + batch.size deliberately wait to amortise per-request cost, the latency added equals the linger window. Set near 0 for latency, higher for throughput.

ISR replication round-trip (inherent for acks=all)

A durable write waits for the in-sync followers to fetch the batch before the leader acks, a network round-trip to other brokers/AZs. Sync replication roughly halves throughput and adds a hop (Part I Replication).

Pull-based consumption (inherent)

Consumers poll; the broker does not push. At low load this adds fetch-cycle latency a push broker avoids, a deliberate throughput-over-latency choice (empirical: ops-blueprint-reference §8).

The result is a system whose floor is milliseconds, not microseconds. That floor is genuinely low when tuned, a tier-1 bank held sub-5 ms p99 end-to-end at 1.6M msg/s for trading, but only by tuning serialization (Protobuf), 10GbE, broker config, and minimising replication delay (empirical: ops-blueprint-reference §9). And much of the tail beyond the floor is OS/filesystem, not protocol: Allegro cut producer-latency outliers above a 65 ms SLO by 82% just by moving to XFS with tuned journaling (empirical: ops-blueprint-reference §9).

Contributor	Adds	Class	Lever / note
Producer batching	≤ `linger.ms`	Tunable	Trades latency for throughput; set 0–1 ms for latency (Part II Performance Tuning)
ISR replication (`acks=all`)	One broker round-trip (cross-AZ if replicas span AZs)	Inherent for durable writes; tunable down to `acks=1`/`0` by sacrificing durability	The durability ⇄ latency trade (Part II Durability)
Pull consumption	Fetch-cycle latency at low load	Inherent	`fetch.min.bytes`/`fetch.max.wait.ms` tune the batching, not the pull model
OS / filesystem tail	Variable (page-cache, fsync, GC)	Tunable	XFS, page-cache headroom, small heap + G1GC (empirical: Allegro; ops-blueprint-reference §3)

When the latency floor is disqualifying

If your requirement is synchronous, sub-millisecond request/reply, an in-memory cache lookup, an online feature fetch on the critical path of a user request, an order-matching inner loop, the log is the wrong abstraction, and no tuning changes that. Kafka has "no native point-to-point or on-the-fly reply topics," and request-reply over Kafka "returns the coupling we were trying to avoid" (empirical: ops-blueprint-reference §8). Use a gRPC/HTTP service or an in-memory store for the synchronous path; use the log for the durable, ordered, fan-out event stream that the synchronous path emits. The two compose; one does not replace the other.

Object-store / diskless designs trade the floor for cost

The newest variants move the floor in the wrong direction on purpose: WarpStream and KIP-1150-style diskless designs write straight to object storage, eliminating cross-AZ replication cost but raising produce p99 to ~400 ms and end-to-end p99 to ~1 s (empirical: ops-blueprint-reference §9; Appendix C). AutoMQ's WAL-assisted variant recovers single-digit-ms. The point for an architect: there is a latency ⇄ cost curve here too, and the cheapest designs sit far above Kafka's already-non-sub-ms floor. See Evolution and Part II Cost.

6 · All-or-nothing topic retention, no per-record TTL

Retention in Kafka is a topic-level policy: retention.ms and retention.bytes govern when whole segments age out (Part I Storage Management). There is no per-record TTL. A record's lifetime is the lifetime of its topic's retention window; you cannot say "this message expires in 30 seconds, that one in 30 days." This falls directly out of the segment design, the broker deletes by dropping whole segment files, an O(1) filesystem operation, which is what keeps retention cheap; per-record expiry would require scanning and rewriting segments, destroying the sequential-I/O property.

The one mechanism that approximates per-key lifetime is log compaction with tombstones: on a compacted topic, writing a record with a key and a null value (a tombstone) eventually causes the compactor to remove all prior values for that key, and the tombstone itself is purged after delete.retention.ms (Part I Storage Management; The Log Pattern). But that is key-level deletion, not time-based per-record TTL, and it is non-deterministic in timing, compaction "does NOT guarantee one record per key at any instant," only "at least the last value per key" eventually (empirical: ops-blueprint-reference §8).

You want…	Log provides	Class
"All data older than 7 days disappears"	`retention.ms` at topic level, native	Tunable (per topic)
"Keep only the latest value per key"	Compaction (`cleanup.policy=compact`)	Tunable (per topic)
"Delete this specific key now"	Tombstone (key + null) on a compacted topic, eventual, not immediate	Mitigated (key-level, non-deterministic timing)
"This message expires in 30s, that one in 30d"	Not provided, split by lifetime into separate topics	Inherent

Reusable tactic, partition lifetime into topics, and lean on compaction for "current state"

If records have materially different lifetimes, route them to separate topics with different retention, the same "topology as configuration" move as priority and routing. For the "table of latest values" use case (a changelog, a config store, a CDC snapshot), a compacted topic gives you durable current-state whose size is bounded by distinct live keys, independent of total throughput (Part I Storage Management). What the log structurally cannot give you is a GDPR-style "delete exactly this record at this moment" guarantee on a non-compacted topic, that is a known gap to design around (e.g. crypto-shredding: store the payload encrypted, delete the key).

7 · No server-side filtering, query, or secondary index

A consumer that needs a sparse subset of a topic must read everything and filter client-side. The broker offers exactly one access path: sequential read from an offset (Part I The Fetch Path). There is no WHERE clause, no secondary index, no point-lookup-by-value, no server-side projection. If 1% of records match your predicate, you still pay to transfer and decode 100%. This is the deliberate flip side of "Kafka is not a database", Kreps is explicit: "I don't think Kafka really benefits from trying to add any kind of random access lookups directly against the log"; Kafka replicates into specialised systems rather than replacing them (empirical: ops-blueprint-reference §8).

Need a sparse subset?e.g. 1% of records match

access pattern?

read all, filter client-sidenative; transfer + decode 100% to keep 1%

materialise into a query storeproject the stream → Postgres / Elasticsearch / KV; query there

random lookup against the loganti-pattern, the log has no index

The log offers one access path: sequential scan from an offset. Sparse or ad-hoc reads belong in a derived store you build by consuming the stream, not in the log itself.

native scan derived query store anti-pattern path ec-data (materialise)

This is inherent, and the canonical answer is the architecture this whole part is about: the log is the system of record; every queryable view is a derived projection you build by consuming the stream into a store designed for the query (Part I Kafka Streams, Kafka Connect; The Log Pattern). Need point lookups? Materialise into a key-value store. Need full-text? Into Elasticsearch. Need ad-hoc SQL? Into a warehouse. The log feeds them all and keeps them consistent because they replay the same ordered events.

The wasteful-fan-out trap

The read-everything cost is easy to under-estimate because it is invisible at design time and only shows up as cluster egress and consumer CPU at scale. If you find yourself adding a consumer group whose job is to read a firehose topic and discard 99% of it, that is a signal you have pushed a query into the transport. Either pre-split at the producer (route the 1% to its own topic), or accept that you should be querying a derived store, not scanning the log. Kafka does not push filtering to the broker, by design, so the broker can stay a zero-copy sequential machine (empirical: ops-blueprint-reference §8, §9).

8 · Complexity pushed to the clients, the fat-client burden

"Dumb broker / smart consumer" has a corollary the broker's simplicity hides: the intelligence did not disappear, it moved into the clients. The producer owns batching, compression, partitioning, idempotence sequencing, and ret/backoff; the consumer owns the fetch loop, offset management, deserialization, the rebalance protocol, and (for EOS) the read-process-write transaction. Kafka clients are among the fattest in the messaging world, and that fatness is a structural property of pushing logic out of the broker, not an accident to be refactored away.

Responsibility the broker does NOT carry	Where it lives	The burden it creates
Batching / compression / partitioning	Producer (Part I 16)	Many tunables (`batch.size`, `linger.ms`, compression, partitioner) that interact; wrong choices silently cost latency or throughput
Offset management / commit cadence	Consumer (Part I 17)	At-least-once vs at-most-once vs effectively-once is a client decision; auto-commit is a footgun
Group membership / rebalance	Consumer + coordinator (Part I 13)	The rebalance protocol is intricate; poll-interval and heartbeat misconfig cause rebalance storms (see §9)
Read-process-write atomicity (EOS)	Producer + consumer + coordinator (Part I 14)	Transactions, fencing, `read_committed`, hanging-txn recovery, significant application-side complexity

Why the fat client is the right call, and what it costs

Putting logic in the client is what lets the broker be a zero-copy sequential-I/O machine and what lets each client tune its own trade-offs (a latency-sensitive producer and a throughput-sensitive one talk to the same broker with opposite settings). The cost is borne by every team that writes a client: a steep learning curve, language-specific client quality variance (the JVM client is the reference; others lag), and the reality that most production incidents are client misconfiguration, not broker faults, the reference's failure-mode catalogue is dominated by client-side issues (rebalance storms, fencing avalanches, auto-commit data loss; empirical: ops-blueprint-reference §4). This is inherent to the architecture; the mitigation is organisational: wrap the raw clients in a vetted internal library with safe defaults. LinkedIn, Datadog, and others all built exactly such abstractions over the raw client (empirical: ops-blueprint-reference §7).

Reusable tactic, own a thin internal client wrapper

If more than a handful of teams use your Kafka, the highest-leverage investment is a shared client wrapper that encodes durability defaults (acks=all, idempotence, sane commit cadence), the DLQ pattern, metrics, and the cooperative rebalance protocol, so individual teams cannot reinvent the footguns. The fat-client burden does not go away, but you pay it once, centrally, instead of once per team per incident.

9 · Rebalance disruption, historically stop-the-world

When a consumer group's membership changes (a consumer joins, leaves, or is evicted for missing a heartbeat), the group rebalances: partitions are reassigned across members (Part I Group Coordination). The original protocol was "stop-the-world", every consumer revoked all its partitions, the group synchronised on a barrier, then everyone got a new assignment; during that window the whole group made no progress. On large groups with frequent membership churn this was a serious, recurring stall, and it is the single most-improved area of the protocol. This makes it the textbook example of a mitigated limitation, the structural fact (membership change forces reassignment) remains, but its blast radius has shrunk dramatically across three KIPs:

eager Stop-the-world: revoke ALL partitions, global sync barrier, reassign. Whole group stalls. KIP-429 (2.4) Incremental cooperative: keep partitions you'll retain; only revoke/move what changes. No global stop. KIP-848 (4.0) Broker-coordinated: assignment computed at the coordinator via ConsumerGroupHeartbeat; no client-side sync barrier. Up to ~20× faster.

The rebalance protocol evolved from a full stop-the-world barrier to incremental cooperative reassignment to broker-driven assignment, progressively removing the global stall while the underlying need to reassign on membership change is unchanged.

eager → cooperative → broker-coordinated protocol evolution

KIP-429 (Kafka 2.4) introduced incremental cooperative rebalancing (the CooperativeStickyAssignor): a consumer keeps the partitions it will retain and only the genuinely-moving partitions are revoked, eliminating the global stop. KIP-848 (GA in Kafka 4.0) goes further, moving assignment computation to the broker-side group coordinator via a ConsumerGroupHeartbeat RPC, removing the client-side synchronization barrier entirely and reported up to ~20× faster (empirical: ops-blueprint-reference §9). The reference is explicit that the old critique is now mostly historical:

"Rebalances are stop-the-world" is largely historical, but the residual is real

KIP-429 and KIP-848 "substantially reduce disruption"; critiques citing a full stop-the-world stall "apply to old clients/eager assignors" (empirical: ops-blueprint-reference §9 cautions). What they do not remove: membership change still triggers reassignment, partition movement still pauses processing of the moved partitions, and the most common production failure is still the self-sustaining rebalance storm, a slow consumer exceeds max.poll.interval.ms (default 5 min), gets evicted, rejoins, triggers another rebalance (empirical: ops-blueprint-reference §4). The mitigation toolkit, cooperative assignor, static membership (KIP-345, stable group.instance.id so a restart does not trigger reassignment), right-sized max.poll.interval.ms, smaller max.poll.records, addresses the residual but does not make membership changes free.

Reusable tactic, make membership stable, not just rebalancing cheap

The deepest fix is to stop triggering rebalances in the first place: static membership keeps assignments across rolling restarts and deploys, and sizing max.poll.interval.ms above worst-case batch processing time stops slow consumers from self-evicting. Cheap rebalancing (cooperative/KIP-848) and rare rebalancing (static membership) are complementary, adopt both. The transferable lesson: in any membership-based distributed system, the cost of reassignment matters less than the frequency of reassignment, invest in stability of membership before cleverness of reassignment.

10 · Cross-AZ replication cost, failure-independence has a cloud bill

Kafka's durability comes from placing replicas on failure-independent brokers, and in the cloud "failure-independent" means "in different availability zones." That is exactly right for resilience, and it collides head-on with cloud network pricing, where inter-AZ traffic is billed in both directions. With 3-AZ RF=3, every 1 GB of ingress produces ~2 GB of cross-AZ replication traffic before you count producer and consumer cross-zone hops (empirical: ops-blueprint-reference §9). The result is the most material cost critique of self-managed Kafka:

Networking, not storage or compute, dominates the bill

Confluent's own model puts networking at ~70% of self-managed infra at 20 MBps and ~87% at 100 MBps; "after tiered storage, networking alone can comprise ~90%" of TCO (empirical: ops-blueprint-reference §6, §9). The replication multiplier is the driver: RF=3 across 3 AZs means the (RF−1) = 2 cross-AZ copies of every byte are unavoidable given the durability model. This is the structural tension, failure-independence and cross-AZ billing are the same placement decision.

Which parts are inherent and which are tunable is the crux, and the reference is precise about it:

Cross-AZ component	Driver	Lever	Class
Consumer fetch cross-AZ	Consumers read the leader, often in another AZ	Fetch-from-follower KIP-392 + rack-awareness KIP-881: read a same-AZ replica → can cut ~all consumer cross-AZ, ~50% total cluster cost	Tunable (mitigation), at the cost of up to ~500 ms tail latency & broker skew
Replication cross-AZ (RF−1 copies)	Durability requires failure-independent replicas	Lower RF on non-critical topics; shorter retention. Classic Kafka cannot remove it without losing failure-independence	Inherent for the durability model; tunable only by lowering durability
Producer cross-AZ	Producers write the leader, often in another AZ	Not removed by FFF; only diskless/object-store designs eliminate it (write straight to S3, leaderless)	Inherent for classic Kafka; removed only by a different storage architecture
Storage (separate axis)	Local triple-replicated SSD is ~10–20× S3 per-GiB	Tiered storage KIP-405 offloads cold segments to S3	Mitigated for storage, but does NOT touch networking (common misconception)

Tiered storage cuts storage, not the network, and a hard cross-AZ floor remains

A frequent and expensive misconception: tiered storage (KIP-405) does not reduce cross-AZ networking, it only moves cold storage to S3; after adopting it, networking can become an even larger share of TCO (empirical: ops-blueprint-reference Appendix C). Even with fetch-from-follower fully deployed, produce + replication cross-AZ traffic remains (~$13.8k of AutoMQ's $24k example), only diskless / object-store designs remove it, and they do so by trading the latency floor (§5). The architect's takeaway: cross-AZ cost is part inherent (the replication multiplier is the durability model) and part tunable (consumer-side via FFF). The full cost model, levers, and dollar figures are in Part II Cost; the structural point is that you cannot have failure-independent replicas and zero cross-AZ traffic in classic Kafka. See also Evolution for the diskless trajectory.

11 · Metadata as a single log, the control-plane scaling bound

KRaft's elegance is that it eats its own dog food: cluster metadata is itself a distributed commit log. All metadata, topics, partitions, leaders, ISR, configs, ACLs, is a stream of records in one Raft-replicated topic partition, __cluster_metadata-0, applied by a single-threaded event loop in QuorumController (Part I KRaft Consensus, The KRaft Controller). This is a massive improvement over ZooKeeper for failover, standbys replay the same log and converge on byte-identical state, so failover transfers no state and is near-instant (empirical: KIP-500; Part I 11). But notice that the metadata plane inherits the log's own §1 limitation: it is a single partition, single-writer, serially-applied log. That gives it a structural scaling bound that mirrors the data plane's.

All cluster metadata → one log: __cluster_metadata-0

single Raft partition; one active controller is the sole writer

↓

QuorumController, single-threaded event loop

every admin op becomes records, applied serially via replay() in offset order, the same per-partition-order discipline as a data topic

↓

Every broker is an observer replicating the same log

total metadata = (partitions + topics + configs + ACLs) records held in memory per broker

KRaft applies the log pattern to metadata itself: one partition, one writer, serial apply. This is why failover is trivial, and why the metadata plane has a single-log scaling ceiling, gated by serial apply throughput and per-broker memory for the full metadata image.

control plane (metadata log) broker/observer replicate / apply

Property of the metadata log	Consequence	Class
Single partition, single active writer	Metadata mutation throughput is one controller's serial apply rate, it does not scale horizontally	Inherent (same as §1, applied to metadata)
Every broker holds the full metadata image in memory	Total partitions/topics/configs are bounded by per-broker RAM, not just the controller's	Tunable up to a wall (RAM, `vm.max_map_count`)
Standbys replay the same log	Failover transfers no state, near-instant (the upside)	Mitigation of the ZK-era failover cost
Snapshots bound replay length	A new/rejoining controller loads a snapshot + tail, not the whole history (Part I 10)	Mitigation (keeps startup bounded)

Why this matters for "millions of partitions"

This single-log control plane is the real reason the "millions of partitions" claim is a ceiling on existence, not performance (§2). Every partition is metadata records that every broker holds in memory and that one controller applies serially; pile on enough and you bound metadata-change latency and per-broker RAM long before you bound disk. KRaft removed the ZooKeeper failover bottleneck, a genuine, large mitigation, but it replaced it with a single-writer log whose throughput and memory footprint are themselves the new bound. The metadata plane is fast at failover and finite at scale for the same reason: it is a single distributed log. The pattern's strength and its limit are, once again, the same decision.

The catalogue at a glance

The recurring shape of this chapter is that every limitation is the shadow of a strength, and the architect's job is to know which shadows move and which are fixed. The summary matrix:

#	Limitation	Inherent / Mitigated / Tunable	The strength it pays for	Design-around / lever
1	Ordering per-partition only	Inherent	Sharded parallelism	Order per-key; reserve 1 partition for true global order
2	Partition-count ceiling & rigidity	Mixed: consumer-cap & keyed-rigidity inherent; cluster-count mitigated (KRaft); per-broker tunable	Parallelism = partitions; cheap per-key routing	Over-provision keyed counts; raise FD/mmap; see op03
3	No routing/priority/TTL/selective read	Inherent; partially mitigated by share groups	Dumb broker = zero-copy throughput	Topology as routing; share groups KIP-932 for per-record dispatch
4	Per-partition head-of-line blocking	Inherent; mitigated (share groups, DLQ)	Strict ordered, replayable consumption	DLQ the poison record; share groups; parallel-consumer
5	Latency floor (ms, not sub-ms)	Inherent floor; tail tunable	Throughput-first batching + durable replication	Don't use for sync request/reply; tune linger/FS for the tail
6	All-or-nothing topic retention	Inherent (no per-record TTL); mitigated by compaction	O(1) segment-drop retention	Split by lifetime into topics; compaction + tombstones
7	No server-side filter/query/index	Inherent	Single sequential access path	Materialise derived query stores; pre-split at producer
8	Complexity in the clients	Inherent	Per-client tuning; thin broker	Shared vetted client wrapper with safe defaults
9	Rebalance disruption	Mitigated (KIP-429, KIP-848)	Elastic consumer-group membership	Cooperative assignor + static membership KIP-345
10	Cross-AZ replication cost	Inherent (replication); consumer-side tunable	Failure-independent durability	Fetch-from-follower KIP-392; diskless trades latency; op10
11	Metadata as a single log	Inherent bound; failover mitigated (KRaft)	Trivial, stateless controller failover	Bound partition counts; accept it as a ceiling on existence

When the log is the wrong choice, the negative space

Stacking the inherent limitations gives the most useful output of this chapter: a crisp list of workloads where the distributed log is structurally the wrong tool, no matter how it is tuned. An architect who memorises this list will save a team from months of fighting the abstraction.

Is the distributed log the right pattern?

Need synchronous sub-ms request/reply?

WRONG, use RPC / in-memory store§5 latency floor; no native reply path

Need point-lookup / ad-hoc query / index?

WRONG, use a database§7 no index; materialise a derived store instead

Need per-message routing/priority/TTL, no order?

WEAK, prefer a message broker§3 RabbitMQ-style routing; share groups only partial

RIGHT, the log is the sweet spotordered per-key history, decoupled fan-out, replay/CDC

The negative-space decision spine: three structural disqualifiers (sub-ms request/reply, primary-store ad-hoc query, rich per-message routing) branch off to the right and route you away from the log; what remains, ordered, replayable, high-throughput fan-out, is exactly where it wins.

log is right log is wrong/weak / decision path ec-err off-ramp (disqualifier)

The three structural disqualifiers

(1) Synchronous sub-millisecond request/reply, the latency floor (§5) and the absence of a native reply path make the log the wrong transport; use RPC/HTTP or an in-memory store. (2) Primary store for point lookups or ad-hoc queries, no index, no query language (§7); the log feeds derived query stores, it does not replace them. (3) Rich per-message routing, priority, or TTL with no ordering need, a smart-broker (RabbitMQ) is the better fit (§3); share groups close part but not all of the gap. Everything else, durable ordered history, decoupled high-throughput fan-out, replay, CDC, stream processing, is the log's home ground, and the limitations in this chapter are the price you knowingly pay for it. Carry this list to When To Use and the consolidated Architect's Cheat-Sheet.

The transferable meta-lesson

The single most valuable habit this chapter teaches is not Kafka-specific: for any system you evaluate, separate the inherent from the mitigated from the tunable before you commit. Inherent limits define the problem class the system can serve; mitigated limits define how much operational maturity you must buy; tunable limits define the curve you will spend your career picking points on. Vendors and benchmarks blur these lines on purpose, the reference repeatedly flags claims that present a tunable default as an inherent flaw, or an inherent flaw as a "just tune it" non-issue (empirical: ops-blueprint-reference §9 cautions). The architect's edge is refusing that blur.