II · 07 · Failure Modes & The Runbook
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.
Almost every Kafka incident is a downstream symptom of one of a small, knowable set of root causes: a broker or its disk goes away, a follower falls behind (network, slow disk, GC pause), the controller quorum loses majority, load skews onto one partition, or a client (consumer group, transactional producer) misbehaves. This chapter is the catalogue as a runbook. For each failure it gives you the same four things in the same order, CAUSE → SYMPTOM (the specific metric you will actually see) → MECHANISM (the source line that produces the behaviour) → MITIGATION (the dial, the command, the design fix). The discipline matters under pressure: UnderReplicatedPartitions > 0 is never a root cause, it is a consequence, and the runbook's job is to take you from the alert to the line of code that explains it, fast. Every default, limit, and metric here is verified against the 4.4 source; every benchmark or heuristic that the code cannot give is marked empirical and carries its named source. Wire each symptom to its alert in II · 08 Metrics & Signals; the durability arithmetic that makes most of these recoverable lives in II · 06 Durability.
The failure taxonomy: how to read this chapter
The failures below cluster into five families. Knowing the family tells you which metric to look at first and which Part I mechanism is in play. A single broker dying touches the replication layer; a hanging transaction touches the coordinator; a rebalance storm touches group coordination. The taxonomy is the index into the runbook.
| Family | Failures in this chapter | First metric to read | Part I mechanism |
|---|---|---|---|
| Replication / availability | Broker loss, slow follower, network partition, under-min-ISR, offline partitions, unclean election | UnderReplicatedPartitions, OfflinePartitionsCount | 08 Replication |
| Storage | Log-dir / disk failure, full disk, page-cache pressure / cold reads | OfflineLogDirectoryCount, LogFlushRateAndTimeMs | 03, 04 |
| Control plane | Controller quorum loss, metadata lag (stale routing) | ActiveControllerCount, metadata-lag gap | 10, 11, 12 |
| Saturation | GC pauses, request-handler & purgatory saturation, fetch-session eviction, hot/skewed partition | RequestHandlerAvgIdlePercent, RequestQueueTimeMs | 06, 07, 09 |
| Client / protocol | Rebalance storm, producer fencing, hanging transaction | records-lag-max, LastStableOffsetLag, PartitionsWithLateTransactionsCount | 13, 14 |
When a runbook says "raise X," know what X is. A HARD limit is a compiled-in constant, you change it only by changing source (e.g. the metadata-dir failure forces Exit.halt(1), ReplicaManager.scala:2190, not tunable). A SOFT limit is a config key with a default (e.g. replica.lag.time.max.ms=30000), tunable at runtime. An EMERGENT limit is a consequence of resources or architecture (e.g. recovery time ≈ partition bytes ÷ re-replication bandwidth), you change it by changing the deployment, not a setting. The mitigation column of each section tells you which lever you actually have.
Broker loss
CAUSE. A broker process dies, the host is rebooted, the NIC drops, or an operator stops it without controlled shutdown. Its leaderships must move; its replicas leave every ISR they were in.
SYMPTOM. UnderReplicatedPartitions (the kafka.server:type=ReplicaManager gauge) jumps to a steady non-zero value across the surviving brokers, every partition that had a replica on the dead node now has |ISR| < |replicas|. The controller emits a burst on LeaderElectionRateAndTimeMs. OfflineReplicaCount rises. Crucially the URP number is flat, not fluctuating, that constancy is how you distinguish "a broker is gone" from "a follower is struggling" (next section).
MECHANISM. isUnderReplicated is exactly isLeader && (assignmentState.replicationFactor - partitionState.isr.size) > 0 (Partition.scala:232). When the dead broker held leaderships, the KRaft controller elects new leaders from the surviving ISR, a clean election, no data loss, because by definition the survivors held every committed (high-watermark) record. The empirical cost of moving leadership is small per partition but adds up: ~5 ms to elect a leader for one partition, so a broker holding ~1,000 leaderships can take up to ~5 s of partial unavailability on an unclean loss (empirical, Jun Rao 2015, a ZooKeeper-era hardware figure; treat it as the mechanism, not a 2026 SLA). Under KRaft the controller does not reload metadata from ZooKeeper, so the second, ZK-era cost term, ~2 ms/partition of metadata init on controller failover, is gone (see Part I 10).
MITIGATION. With the production durability standard (replication.factor=3, min.insync.replicas=2, acks=all, unclean.leader.election.enable=false, see II · 06), one broker loss is zero acknowledged-data loss and writes keep flowing: the partition still has two in-sync replicas, which satisfies the min-ISR gate. Action: bring the broker back; its followers re-replicate from the leaders' tails and rejoin each ISR, driving URP back to 0 (you will see IsrExpandsPerSec tick). If the broker is gone for good, reassign its replicas with kafka-reassign-partitions.sh and throttle the move (leader.replication.throttled.rate) so re-replication does not saturate NICs, the dominant recovery bottleneck at scale (empirical: Netflix documented the cascade slow-outlier → replication lag → leaders read from disk → buffer exhaustion → message drops). The EMERGENT rule: recovery time ≈ (bytes to re-replicate) ÷ (per-broker replication bandwidth); tiered storage (Part I 05) shrinks the numerator because cold data need not be re-replicated on rejoin (empirical, Uber).
Slow follower (ISR shrink without a broker loss)
CAUSE. A follower is alive but cannot keep up: its disk is slow, the inter-broker link is congested, it is mid-GC-pause, or its leader's write rate exceeds the follower's fetch rate. It has not died, so this is not broker loss, but it is no longer in sync.
SYMPTOM. UnderReplicatedPartitions fluctuates with every broker up; IsrShrinksPerSec ticks without a matching IsrExpandsPerSec shortly after, the canonical signature of a struggling replica (an isolated shrink with no re-expand is the one to chase). AtMinIsrPartitionCount may rise if the ISR is now exactly at the floor.
MECHANISM. The leader runs maybeShrinkIsr() on a timer (Partition.scala:1089). A follower is judged out of sync by isFollowerOutOfSync → isCaughtUp(leaderEndOffset, currentTimeMs, maxLagMs) (Partition.scala:1133, 1155). The window maxLagMs is replica.lag.time.max.ms, default 30000 ms (ReplicationConfigs.java:55, REPLICA_LAG_TIME_MAX_MS_DEFAULT = 30000L). Two cases are caught (comment at Partition.scala:1142): a stuck follower whose LEO has not advanced in 30 s, and a slow follower that has not reached the leader's LEO within 30 s. Either way it is dropped from the ISR and URP for that partition becomes true. This is a time-based, not offset-based, lag check, KIP-16, which is why a follower that is merely behind but steadily fetching stays in the ISR, while one that stalls is ejected.
An offset-count threshold (the pre-KIP-16 replica.lag.max.messages) was unworkable: the "right" number of messages-behind depends entirely on the topic's write rate, so a single cluster-wide value either evicted healthy followers on bursty topics or kept dead ones on slow topics. The 30 s wall-clock window is rate-independent, a follower is "in sync" if it caught up to the leader's LEO at some point in the last 30 s. The doc even warns (ReplicationConfigs.java:77) that low-throughput topics need replica.fetch.wait.max.ms < replica.lag.time.max.ms to avoid frequent ISR shrinking.
MITIGATION. Triage in order (empirical slow-follower order, ops reference §4): (1) Is a broker actually down? (no → continue.) (2) Is the shrink concentrated on one broker's partitions? → that broker's disk or NIC. (3) GC pauses on the follower? → see GC section. (4) Leader write rate > follower fetch rate? → raise num.replica.fetchers (default 1, ReplicationConfigs.java:96; recommend 4–8) so more fetcher threads pull from the busy leader, and/or fix the slow disk. Keep replica.fetch.wait.max.ms (default 500 ms, ReplicationConfigs.java:75) safely under the 30 s lag window. Do not reflexively raise replica.lag.time.max.ms to silence the alert, that widens the window in which a "still in ISR" follower is actually far behind, which is precisely the data you would lose in an unclean election.
Network partition
CAUSE. A network split isolates one or more brokers (or an entire AZ) from the leaders and/or the controller quorum. From the leader's perspective the isolated followers simply stop fetching.
SYMPTOM. On the majority side: IsrShrinksPerSec for partitions whose followers are across the split; UnderReplicatedPartitions rises. If a partition is left with no in-sync replica that can reach the controller, OfflinePartitionsCount > 0. If unclean.leader.election.enable=true (not the default), UncleanLeaderElectionsPerSec may fire on the majority side. On the minority side, brokers that cannot reach the controller stop receiving metadata updates.
MECHANISM. Same ISR machinery as the slow follower, the isolated followers cross the replica.lag.time.max.ms window and are ejected, but with an added control-plane hazard: KRaft requires a majority of the controller quorum to make progress (QuorumState.java:697, "Cannot become leader without majority votes granted"). The minority side cannot elect a controller and cannot accept metadata writes. Whether data loss is possible depends entirely on the unclean-election dial (next section).
MITIGATION. Topology is the mitigation, configured before the partition: spread replicas across racks/AZs (broker.rack) so a single-AZ split still leaves a majority ISR on the surviving side, and run controllers in an odd count across ≥3 AZs so a single-AZ loss never costs the quorum majority. Keep unclean.leader.election.enable=false (the default, LogConfig.java:139) so the majority side waits for an in-sync replica rather than electing a stale one. When the split heals, isolated followers re-fetch and rejoin ISRs automatically (IsrExpandsPerSec ticks).
Under-min-ISR: writes refused (the safety valve working)
CAUSE. Enough in-sync replicas have been lost (broker loss + slow follower, or two broker losses) that the ISR for an acks=all partition has dropped below min.insync.replicas. This is the durability machine doing its job: it would rather refuse a write than accept one it cannot make durable.
SYMPTOM. Producers with acks=all receive NotEnoughReplicasException (or NOT_ENOUGH_REPLICAS_AFTER_APPEND); the broker gauge UnderMinIsrPartitionCount (kafka.server:type=ReplicaManager) is > 0. The per-partition gauge UnderMinIsr flips to 1 (Partition.scala:223).
MECHANISM. The leader gates the write before appending. In appendRecordsToLeader (Partition.scala:1234): if (inSyncSize < minIsr && requiredAcks == -1) throw new NotEnoughReplicasException(...). There is a second gate at acknowledgement time: even if the record was appended just before the ISR shrank, NOT_ENOUGH_REPLICAS_AFTER_APPEND is returned if minIsr > curMaximalIsr.size (Partition.scala:977). isUnderMinIsr is partitionState.isr.size < effectiveMinIsr(...) && isLeader (Partition.scala:237).
min.insync.replicas ≤ replication.factor)It is the effective min-ISR that gates, and it is silently clamped to the replica count: effectiveMinIsr = leaderLog.config.minInSyncReplicas.min(remoteReplicasMap.size + 1) (Partition.scala:246–247). So a topic with replication.factor=1 but min.insync.replicas=2 has an effective min-ISR of 1, it will happily accept acks=all writes with a single copy, because min(2, 1) = 1. You believe you have a durability floor of 2; you have 1. The comment at Partition.scala:242 spells it out: "Even if the value does not make sense to be larger than the replication factor… the effective min ISR of min(replication factor, min ISR) is returned." Audit every topic for min.insync.replicas ≤ replication.factor; a mismatch is a data-loss footgun that no alert fires on. (Empirical corroboration: Conduktor, ops reference §4.)
MITIGATION. Under-min-ISR is a symptom of lost replicas, so the fix is to restore them, bring back the down broker or speed up the slow follower (above) until the ISR re-expands to ≥ min.insync.replicas. Do not "fix" it by lowering min.insync.replicas to 1 under pressure: that converts a write-availability incident into a potential silent-data-loss incident, because the leader will then ack writes it holds on a single copy. If write availability genuinely matters more than durability for a specific topic (a known, deliberate tradeoff), that is a topic design decision made in advance, not an incident-time panic edit.
Offline partitions: no leader, unavailable
CAUSE. A partition has no leader. Causes: the last in-sync replica was lost, a network split removed every reachable ISR member, the leader's log directory failed, or (the meta-cause) the controller cannot run an election because the quorum lost majority.
SYMPTOM. OfflinePartitionsCount > 0, the controller-level gauge at ControllerMetadataMetrics.java:62 (kafka.controller:type=KafkaController,name=OfflinePartitionsCount). Producers and consumers for that partition get LEADER_NOT_AVAILABLE / NOT_LEADER_OR_FOLLOWER and retry into a wall. This is a page-immediately signal: any value > 0 means active unavailability.
MECHANISM. Because the gauge lives on the active controller, it is only meaningful when a controller exists; aggregate it with SUM across the cluster and read the active controller's value. A partition is offline whenever the controller has no eligible replica to promote and unclean.leader.election.enable=false forbids promoting an out-of-sync one. The append path returns Errors.NOT_LEADER_OR_FOLLOWER for the no-leader case (Partition.scala:981).
MITIGATION. Identify why there is no leader (the branches below) and fix that root cause; the partition comes back online the moment the controller can elect an in-sync leader. If, and only if, you have accepted potential data loss in advance, enabling unclean election will restore availability by promoting a stale replica (see next). Prevention is RF≥3 across failure domains plus a healthy controller quorum, so the loss of any one broker or AZ never strands a partition without an eligible leader.
Unclean leader election: availability bought with data
CAUSE. Every in-sync replica for a partition is gone, and unclean.leader.election.enable=true, so the controller promotes an out-of-sync replica to leader. That replica is, by definition, missing records that were committed on the (now-dead) ISR. Those records are lost forever.
SYMPTOM. UncleanLeaderElectionsPerSec > 0 (kafka.controller:type=ControllerStats). Any non-zero value is a data-loss event, treat it as a critical, page-immediately alert. Consumers may see the log end offset go backwards; downstream systems silently miss records.
MECHANISM. The controller marks the election unclean and resets recovery state. In ReplicationControlManager.java:1140–1141, when uncleanLeaderElectionEnabledForTopic(...) is true, it sets PartitionChangeBuilder.Election.UNCLEAN. PartitionChangeBuilder.java:348–350 then sets LeaderRecoveryState.RECOVERING (value 1, LeaderRecoveryState.java:33), the partition is flagged as having recovered through an unclean path. The new leader's log becomes the source of truth; any records past its LEO that existed only on the lost ISR are gone. The frequency of the controller's unclean-election sweep is unclean.leader.election.interval.ms, default 5 minutes (ReplicationConfigs.java:123, TimeUnit.MINUTES.toMillis(5)).
In modern Kafka, including this KRaft 4.4 build, unclean.leader.election.enable defaults to false (LogConfig.java:139, DEFAULT_UNCLEAN_LEADER_ELECTION_ENABLE = false). But it was true before 0.11.0, a default flip that has bitten teams who carried old broker configs forward. (Empirical: Datadog suffered temporary data loss on a pre-0.11.0 cluster, recovered only because they dual-wrote to a secondary cluster, ops reference §4.) Never assume; verify the effective value per topic with kafka-configs.sh --describe. If you ever deliberately enable it for availability, you are accepting that an all-ISR-loss event will lose data, make that an explicit, documented topic policy, and prefer KIP-966 Eligible Leader Replicas (GA in this build; see II · 06) which narrows the set of replicas eligible to become leader so "unclean" is needed far less often.
MITIGATION. Prevention is the whole game: unclean.leader.election.enable=false + RF≥3 across AZs means all-ISR-loss requires ≥3 simultaneous independent failures. If unclean election has already happened, the lost data is unrecoverable from Kafka; recovery is an application concern (replay from an upstream source of truth, reconcile from a secondary cluster). The operational lesson is to alert on UncleanLeaderElectionsPerSec > 0 so you know it happened and can trigger reconciliation, rather than discovering it weeks later as a data-quality bug.
Log-directory / disk failure
CAUSE. A physical disk or log directory throws an IOException, a failing drive, a corrupted filesystem, an unmounted volume. Kafka supports JBOD (multiple log dirs), so one bad disk need not take the whole broker down.
SYMPTOM. The broker logs "Stopping serving replicas in dir <path> … because the log directory has failed". The partitions that lived on that dir go offline on that broker (their replicas leave the ISR cluster-wide). OfflineLogDirectoryCount > 0; if those replicas were leaders, you see leader elections and a URP bump elsewhere.
MECHANISM. On the first IOException against a dir, the offending code calls logDirFailureChannel.maybeAddOfflineLogDir(dir, msg, e), which records the dir once and enqueues it (LogDirFailureChannel.java:60–65). A dedicated thread, LogDirFailureHandler, blocks on takeNextOfflineLogDir() (ReplicaManager.scala:229) and runs handleLogDirFailure(dir) (ReplicaManager.scala:2157): it finds every online partition whose log's parentDir is the failed dir, removes their fetchers, and calls markPartitionOffline on each (ReplicaManager.scala:2176–2177). The dir "will stay offline until the broker is restarted" (the channel's own doc, LogDirFailureChannel.java:36), there is no auto-recovery of a failed dir within a running broker. Under KRaft the broker also notifies the controller of the directory failure (directoryEventHandler.handleFailure(uuid), ReplicaManager.scala:2195), this is the KIP-858 AssignReplicasToDirs reconciliation path. Mechanism originally KIP-112. See Part I 03 / 08.
If the failed dir is the KRaft metadata log dir, the broker does not survive it, handleLogDirFailure hits fatal("Shutdown broker because the metadata log dir … has failed"); Exit.halt(1) (ReplicaManager.scala:2188–2190). This is a HARD behaviour, an immediate, ungraceful halt, not a tunable. The same Exit.halt(1) fires if a failed dir has no directory UUID to propagate (ReplicaManager.scala:2197–2198). Implication: site the metadata log dir on your most reliable storage, and never co-locate it with churny data dirs whose disks fail first.
MITIGATION. With RF≥2 a single data-dir failure is non-data-losing: the partitions' other replicas keep leading and the cluster re-replicates onto healthy brokers/dirs. Action: replace the disk (or remove the dir from log.dirs), restart the broker, let it re-replicate. To bound recovery, raise num.recovery.threads.per.data.dir (default 1) so startup log scanning parallelizes. Put data dirs on independent physical disks (JBOD) rather than one big RAID volume so a single drive failure offlines one dir, not all of them, and keep the metadata dir separate and reliable per the warning above.
Full disk: the ungraceful stop
CAUSE. A log directory fills (retention misconfigured, an unexpected traffic spike, a stuck consumer preventing log deletion, a hanging transaction blocking compaction). A write fails with No space left on device, which surfaces as an IOException, i.e. the disk-failure path above, except the "disk" is fine and merely full.
SYMPTOM. The same log-dir-failure log line and OfflineLogDirectoryCount bump, often preceded by LogFlushRateAndTimeMs climbing and disk-usage metrics crossing 85–90%. In the worst case the broker halts (the metadata-dir Exit.halt(1) path, or a generic out-of-space crash). (Empirical: a full disk crashes the broker with no graceful shutdown, ops reference §4.)
MECHANISM. A full disk is an IOException source, so it flows through LogDirFailureChannel exactly like a hardware failure, Kafka does not distinguish "disk broken" from "disk full" at the failure-channel boundary. There is no compiled-in free-space reserve that pre-emptively stops writes (KIP-928 works toward making log dirs resilient to fullness); operationally you must keep headroom yourself.
The safe recovery (empirical, Conduktor, ops reference §4): stop the broker → reclaim space by lowering retention dynamically via kafka-configs.sh --alter --add-config retention.ms=… (preferred) or deleting only old closed segments, never the active/newest .log/.index/.timeindex for a partition, which corrupts it → restart at 10–20% free → verify with kafka-topics.sh --describe --under-replicated-partitions returning empty. Alert at 70% (warning) and 85% (critical) disk usage and on OfflineLogDirectoryCount > 0 so you act before the halt, not after.
MITIGATION. Prevention: size retention against real ingress (II · 04) with comfortable headroom; alert on disk usage well below full; enable tiered storage so the hot disk holds only recent data while the bulk ages out to object storage (Part I 05). With RF≥2 a single full disk is recoverable as above; the cluster stays available on the other replicas while you reclaim space.
Controller quorum loss: the cluster stops thinking
CAUSE. A majority of the KRaft controller quorum is unreachable (host failures, an AZ outage, a partition that splits off the majority). With no majority, no controller can be elected and no metadata write can commit.
SYMPTOM. SUM(ActiveControllerCount) == 0 across the cluster (the per-controller gauge at QuorumControllerMetrics.java:47, kafka.controller:type=KafkaController,name=ActiveControllerCount). Topic creation/deletion hang; leader elections do not happen (a failed broker's partitions stay offline because nothing can re-elect); kafka-metadata-quorum.sh describe shows lagging or unreachable voters. Existing leaders keep serving produce/fetch, the data plane survives a controller-quorum outage; the control plane is frozen.
MECHANISM. KRaft is a Raft quorum: a leader needs a majority of votes (QuorumState.java:697, "Cannot become leader without majority votes granted"), and progress requires a majority to have reached the latest offset (QuorumState.java:703). Voter liveness is tracked by controller.quorum.fetch.timeout.ms, default 2000 ms (QuorumConfig.java:83, DEFAULT_QUORUM_FETCH_TIMEOUT_MS = 2_000); a follower that misses the fetch window starts an election after controller.quorum.election.timeout.ms, default 1000 ms (QuorumConfig.java:77). With a 3-controller quorum you tolerate 1 loss (2 of 3 is a majority); with 5 you tolerate 2. Lose the majority and the math says no leader, by design, to prevent split-brain. The active controller's progress is observable via LastAppliedRecordOffset (QuorumControllerMetrics.java:55). See Part I 10 and 11.
Brokers cache the metadata they need to keep leading partitions; they do not consult the controller on the produce/fetch hot path (Part I 12). So a controller-quorum outage degrades gracefully: existing leaders keep accepting acks=all writes, but nothing can change, no new leaders, no new topics, no reassignments, no ISR membership commits. This is the opposite of a data-plane outage and is why losing the controller majority is "serious but not total." Restore the majority and the backlog of metadata changes applies.
MITIGATION. Run an odd number of controllers (3 or 5) spread across ≥3 AZs so no single AZ holds the majority, losing one AZ then leaves a majority alive. Recovery: bring failed controllers back so a majority can form; if a controller is permanently lost, replace it and let it catch up from the metadata log (the snapshot + tail). Do not run a 2-controller quorum (it tolerates zero losses, worse than a single controller) and do not co-locate all controllers in one rack/AZ.
Metadata lag: stale routing
CAUSE. A broker (or client) is acting on out-of-date metadata, it has not yet applied the latest leadership/ISR changes from the controller. Common after a leadership move, a reassignment, or under controller load.
SYMPTOM. Clients get NOT_LEADER_OR_FOLLOWER and retry; produce/fetch briefly route to a broker that is no longer the leader. The active controller's LastAppliedRecordOffset advancing far ahead of a broker's applied offset indicates that broker is behind. Usually transient.
MECHANISM. KRaft brokers consume the metadata log and apply records asynchronously; there is a propagation delay between the controller committing a change and every broker observing it (Part I 12). During that window a client's cached routing can be stale. The client's normal recovery is automatic: on NOT_LEADER_OR_FOLLOWER it refreshes metadata and retries against the new leader.
MITIGATION. Transient lag is self-healing, the client's metadata refresh + retry resolves it within the propagation window. Persistent lag points at a struggling broker (slow disk applying the log, GC) or controller overload, investigate that broker's saturation metrics. Do not lower client retry/backoff so aggressively that a brief leadership move turns into a retry storm; the defaults already absorb normal propagation delay.
GC pauses: the soft outage
CAUSE. A stop-the-world JVM garbage-collection pause freezes a broker for hundreds of milliseconds to seconds. During the pause the broker fetches nothing, heartbeats nothing, and answers nothing, it looks dead without being dead.
SYMPTOM. A cluster of correlated symptoms that all point back to one paused broker: IsrShrinksPerSec (the paused broker's followers stop fetching and cross the lag window), brief UnderReplicatedPartitions, consumer session.timeout.ms expiries triggering rebalances, and a spike in request latency on that broker. The GC logs show the pause directly.
MECHANISM. A pause longer than replica.lag.time.max.ms (30 s) would eject the broker's followers; in practice even sub-second pauses cause transient ISR shrink/expand churn, and multi-second Full GCs cause real drops. The page-cache architecture is why the heap can stay small: Kafka leans on the OS page cache for reads/writes (zero-copy sendfile), so a large heap buys little but costs long pauses. (Empirical: a 500 ms pause can trigger rebalances; multi-second Full GC causes ISR drops; LinkedIn's busy clusters run ~21 ms p90 GC pause with <1 young-GC/sec, while a 32 GB G1 heap can incur 100–200 ms pauses, ops reference §4.)
Keep the broker heap small, ~6 GB, so the rest of RAM is page cache, and use G1GC with a pause target: -Xms6g -Xmx6g -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35. Only consider ZGC for genuinely large heaps (>16 GB, >10k partitions) at ~5–10% CPU / ~20% memory overhead. Alert on GC pause p99 > 50 ms (warning) / > 200 ms (critical) and on HeapMemoryAfterGC > 60%. (Empirical, ops reference §3/§4.) The architectural why: a big heap does not make Kafka faster, the page cache does, so a big heap is pure downside (longer pauses → soft outages).
MITIGATION. Cap the heap as above; if pauses persist, the cause is usually too-large a heap or too many partitions per broker (more partitions = more index/metadata objects = more GC pressure, see II · 03). For the symptoms a pause produces (rebalances), size session.timeout.ms to ride out normal pauses and prefer the new consumer protocol (below), which is far less pause-sensitive.
Page-cache pressure & cold reads (the catch-up tax)
CAUSE. A consumer reads far back in history (a backfill, a recovered lagging group, a new consumer starting from earliest). Those historical reads miss the page cache and hit disk, evicting the hot tail data that producers and real-time consumers depend on.
SYMPTOM. Produce p99 latency spikes (the hot path now waits on disk), disk read I/O saturates, LocalTimeMs on the leader climbs (it is reading from disk, not cache). Real-time consumers that were served from cache suddenly see latency. (Empirical: historical reads can spike p99 produce ~2 ms → ~250 ms and drive disk I/O to 100%; KIP-405 tests showed historical consumers caused ~43% producer-throughput drop without tiered storage, ops reference §7.)
MECHANISM. Kafka's read performance is the page cache: a fetch for recent data is a zero-copy sendfile from cache to socket with no application copy (Part I 09). A cold read forces a disk seek and pollutes the cache by evicting tail pages, so the cost is borne not just by the slow reader but by everyone reading the hot tail. The architecture optimizes ferociously for the common case (read the tail) and degrades for the uncommon one (read cold history).
MITIGATION. Tiered storage (KIP-405, GA since 3.6; Part I 05) is the structural fix: cold segments live in object storage and are streamed from there, so historical reads do not evict the local page cache that serves the hot tail (empirical: improved p99 produce ~30% under historical-read load). Without it, the levers are: provision page cache for roughly write_throughput × 30s of headroom; isolate heavy backfill consumers to off-peak windows; consider KIP-392 fetch-from-follower so backfill load lands on followers, not leaders. Capacity-plan page cache deliberately (II · 04).
Hot / skewed partition: one leader saturated
CAUSE. Load concentrates on a single partition, usually a hot key (one key receives a disproportionate share of records and hash(key) % N always sends it to the same partition), or a producer-side bug that funnels most messages to one partition. That partition's leader broker saturates while the rest idle.
SYMPTOM. Per-partition BytesInPerSec wildly uneven; one broker's CPU/network at ceiling while peers are quiet; consumer lag growing on just the hot partition (one consumer can serve it, and it cannot keep up). Aggregate cluster throughput looks fine, the skew hides in per-partition metrics. (Empirical: Cloudflare's worst recurring incident was partition skew from a client-library abstraction funneling most messages to one partition, ops reference §4/§7.)
MECHANISM. Partitioning is the unit of parallelism, and a single partition is served by a single leader and consumed by at most one consumer in a group (II · 03, Part I 13). A hot key therefore creates a single-threaded bottleneck that no amount of broker hardware fixes, the work cannot be split below one partition.
The most common wrong fix is to add partitions. A hot key still hashes to exactly one partition (hash(key) % N is deterministic per key), so more partitions do nothing for it, and increasing partition count on a keyed topic remaps keys (7654321 % 4 = 1 but % 6 = 3), silently breaking per-key ordering and co-partitioned Kafka Streams state. You also cannot decrease partitions afterward. (Empirical, ops reference §2.) The right fixes: redesign the key for even distribution; or salt the hot key (append a bucket suffix) only when per-key ordering is not required; or use a custom partitioner. Treat partition count as near-immutable for keyed topics, see II · 03.
MITIGATION. As above: fix the key, not the partition count. If the skew is a producer bug (a misused custom partitioner or a default that collapses keys), fix the producer. Monitor per-partition BytesInPerSec and leader skew, aggregate dashboards will not show it. Cruise Control / topicmappr-style tools rebalance leadership across brokers, which helps even out broker load but does not fix a single over-hot partition (the work still cannot split below one partition).
Request-handler & purgatory saturation
CAUSE. The broker's I/O (request-handler) threads or network threads are maxed, too few threads for the load, slow disk making each request take longer, or a flood of tiny requests. Requests queue instead of being served.
SYMPTOM. RequestHandlerAvgIdlePercent (kafka.server:type=KafkaRequestHandlerPool; metric name at KafkaRequestHandler.scala:210) drops toward 0 (0 = fully busy, 1 = fully idle). RequestQueueTimeMs in the latency breakdown climbs (requests wait in the queue before a handler picks them up). NetworkProcessorAvgIdlePercent dropping signals network-thread saturation instead. (Empirical thresholds: RequestHandlerAvgIdlePercent <0.2 = potential problem, <0.1 = active problem, keep >0.3; NetworkProcessorAvgIdlePercent ideally >0.4, raise threads below ~0.3, ops reference §4/§5.)
MECHANISM. Requests land on network threads (default num.network.threads=3, SocketServerConfigs.java:152), are placed on a request queue bounded by queued.max.requests (default 500, SocketServerConfigs.java:144), and are processed by I/O threads (default num.io.threads=8, ServerConfigs.java:51). The latency decomposition is exact: TotalTimeMs = RequestQueueTimeMs + LocalTimeMs + RemoteTimeMs + ResponseQueueTimeMs + ResponseSendTimeMs (Part I 07, 06). Purgatory is where requests that cannot complete immediately wait: an acks=all produce sits in produce-purgatory until the ISR acks (so produce purgatory being non-zero is normal and expected under acks=all, do not alert on raw purgatory size), and a fetch with fetch.min.bytes unmet waits in fetch-purgatory.
Localize before you tune: high RequestQueueTimeMs → too few num.io.threads (handlers starved); high LocalTimeMs → leader disk / page-cache / GC; high RemoteTimeMs → slow followers or min.insync.replicas waits (check URP); high ResponseQueueTimeMs → too few num.network.threads. (Empirical, ops reference §3/§5.) Recommended starting points: num.io.threads ≈ 8 × #data-disks, num.network.threads 8–12. Confirm RequestHandlerAvgIdlePercent is a 0–1 fraction in your build before trusting thresholds, it has a history of being mis-reported (KAFKA-7295; anomalous in KRaft combined mode, KIP-1207).
MITIGATION. If idle% is chronically low and latency is queue-dominated, add threads (per the map) or add brokers (the work is genuinely too much for the box). If LocalTimeMs dominates, the bottleneck is the disk/cache, not threads, fix storage, not thread counts. queued.max.requests=500 is a backpressure bound, not a throughput dial; raising it lets more requests queue but does not make them faster.
Fetch-session cache eviction
CAUSE. More fetch sessions are active than the broker's incremental-fetch-session cache can hold. Each consumer and follower fetcher wants an incremental fetch session (it sends only changed partitions, not the full set every time); when the cache is full, new sessions cannot be cached and those clients fall back to full fetches.
SYMPTOM. Rising fetch request size and CPU on the broker for what should be cheap incremental fetches; fetch-session eviction/cache metrics climbing. The extra load is structural, not a crash, it manifests as elevated broker CPU and fetch latency under high client/fetcher counts. (Part I 09.)
MECHANISM. Incremental fetch sessions (KIP-227) let a fetcher register its partition set once and thereafter send only deltas, which is what makes thousands of partitions per fetcher affordable. The cache is bounded by max.incremental.fetch.session.cache.slots, default 1000 (ServerConfigs.java:102, MAX_INCREMENTAL_FETCH_SESSION_CACHE_SLOTS_DEFAULT = 1000). When slots are exhausted, the broker cannot grant a new incremental session, so that client sends full fetches every time, every partition, every poll, which is exactly the per-request overhead incremental fetch was built to avoid. Eviction is capacity-driven (the cache drops sessions to make room), not LRU-perfect.
MITIGATION. If you run many consumers and follower fetchers against a high-partition broker and see full-fetch fallback, raise max.incremental.fetch.session.cache.slots (it is a SOFT limit) to comfortably exceed the count of distinct fetch sessions (≈ consumers + follower fetcher threads). The EMERGENT bound is set by your client/fetcher fan-out: more consumers and more num.replica.fetchers mean more sessions competing for slots.
Consumer rebalance storm
CAUSE. Under the classic (eager) rebalance protocol, a consumer that is too slow processing a batch misses its poll deadline, gets evicted, rejoins, and triggers another rebalance, which stop-the-world revokes all partitions from all members each time, so the group thrashes and makes no progress. Deploys, scale events, and GC pauses all light the fuse.
SYMPTOM. Repeated rebalances in the consumer logs; consumer lag (records-lag-max) climbing because the group spends its time rebalancing instead of consuming; processing throughput collapses. (Empirical: vanilla MirrorMaker triggered cluster-wide rebalances on any broker hiccup → 5–10 min stalls, and after 32 failed rebalances permanently stuck consumers, "an outage almost every week" at Uber, ops reference §7.)
MECHANISM. A consumer must call poll() within max.poll.interval.ms (default 300000 ms / 5 min, ConsumerConfig.java:629) and heartbeat within session.timeout.ms (default 45000 ms, ConsumerConfig.java:438) with heartbeat.interval.ms 3000 ms (ConsumerConfig.java:443). Miss the poll deadline, typically because a batch of up to max.poll.records (default 500, ConsumerConfig.java:94) took too long to process, and the coordinator evicts the member, forcing a rebalance. Under eager assignment every rebalance revokes everything, so a single slow member stalls the whole group (Part I 13).
Classic-protocol fixes: adopt CooperativeStickyAssignor (it revokes only the partitions that must move, not all of them, it is in the default assignor list, ConsumerConfig.java); raise max.poll.interval.ms to cover your worst-case batch time; reduce max.poll.records so each poll() returns faster; use static membership (KIP-345: a stable group.instance.id + a session.timeout.ms large enough to ride out a deploy preserves assignments across restarts, avoiding the rebalance entirely). The structural fix is the new consumer group protocol KIP-848: the coordinator computes assignments server-side and members reconcile incrementally, so a slow or restarting member no longer stop-the-worlds the group. Note the new-protocol server-side session-timeout bounds differ, group.consumer.min/max.session.timeout.ms default 45000/60000 (GroupCoordinatorConfig.java:189,193). See Part I 13.
MITIGATION. Apply the recipe above; the single highest-leverage move on a modern cluster is migrating consumers to the KIP-848 protocol, which removes the self-sustaining-rebalance failure mode at the source. Make processing idempotent and commit in onPartitionsRevoked so a rebalance mid-batch does not double-process. Alert on rebalance rate and on lag trend (growing over a sustained window), not a static lag count.
Producer fencing (ProducerFenced)
CAUSE. Two producers share the same transactional.id (or an idempotent producer's epoch is superseded). When a new producer initializes with an existing transactional.id, the coordinator bumps the epoch; the older producer is now a "zombie" and is fenced. Most often this is two pods accidentally configured with the same transactional.id.
SYMPTOM. The fenced producer throws ProducerFencedException on its next transactional operation; its sends fail. A "fencing avalanche", two instances sharing one transactional.id repeatedly fencing each other, shows as alternating fence exceptions on both. (Empirical, ops reference §4.)
MECHANISM. Each transactional.id carries a monotonically increasing producer epoch in the coordinator's TransactionMetadata. On InitProducerId from a new producer, prepareIncrementProducerEpoch (TransactionMetadata.java:154) bumps the epoch; a request arriving with an older epoch hits the fencing branch and the coordinator throws Errors.PRODUCER_FENCED.exception() (TransactionMetadata.java:186). This is the zombie-fencing guarantee that makes exactly-once safe: only the newest producer for a given transactional.id may write (Part I 14). There is also a hard ceiling, the epoch is a short, and isEpochExhausted triggers at producerEpoch >= Short.MAX_VALUE - 1 (TransactionMetadata.java:64–65), after which a new producerId must be allocated.
MITIGATION. Give every producer instance a unique transactional.id, the fencing avalanche is purely a config collision. KIP-588 made some epoch errors recoverable so a producer can rejoin gracefully after a transient timeout rather than dying. Fencing is correct behaviour (a zombie was prevented from corrupting EOS); the runbook task is to find and fix the duplicate id, not to disable fencing.
Hanging transaction: the LSO stuck, read_committed blocked
CAUSE. A transactional producer starts a transaction, writes some records, then crashes or stalls without committing or aborting. The transaction stays open until the coordinator times it out, and until then it pins the partition's Last Stable Offset (LSO).
SYMPTOM. read_committed consumers stall on that partition, they cannot advance past the LSO, so all later records on the partition, even from committed transactions, become invisible until the hung transaction resolves. The signals: LastStableOffsetLag (per-partition gauge, Partition.scala:226) grows; PartitionsWithLateTransactionsCount (kafka.server:type=ReplicaManager, gauge at ReplicaManager.scala:245) goes > 0. On compacted topics the cleaner cannot advance past the open transaction either, so the log grows unbounded. (Empirical, ops reference §4.)
MECHANISM. The LSO is the offset below which all transactions have reached a terminal (commit/abort) state; a read_committed fetch returns data only up to the LSO (Partition.scala:1442, READ_COMMITTED → localLog.lastStableOffset). An open transaction holds the LSO at its first record, so nothing after it is "stable," hence invisible to read_committed. The broker flags a partition as having a late transaction via hasLateTransaction(currentTimeMs) (Partition.scala:230), which the cluster aggregates into lateTransactionsCount (ReplicaManager.scala:250). See Part I 14.
The producer's transaction.timeout.ms (default 60000 ms, ProducerConfig.java:548) is the window before the coordinator proactively aborts an open transaction, but it is capped by the broker's transaction.max.timeout.ms, default 900000 ms / 15 min (TransactionStateManagerConfig.java:33, (int) TimeUnit.MINUTES.toMillis(15)). A producer that sets a large timeout (up to the cap) and then crashes can block read_committed consumers on that partition for up to 15 minutes. Do not assume the 60 s default is your worst case, lower transaction.max.timeout.ms if your consumers cannot tolerate that stall, and detect early via PartitionsWithLateTransactionsCount / LastStableOffsetLag.
MITIGATION. Most hung transactions self-resolve when the coordinator aborts them at timeout. For one that is truly stuck, KIP-664 tooling (shipped in 3.0) is the clean abort path: kafka-transactions.sh --describe-producers to inspect, --find-hanging (requires --max-transaction-timeout) to locate, and --abort (by start offset, or by producerId+epoch+coordinatorEpoch) to clear it. Prevention: keep transaction.timeout.ms tight, ensure producers handle crashes by re-initializing cleanly, and alert on the LSO-lag / late-transaction metrics so a stuck transaction is caught in minutes, not when a read_committed backlog explodes.
The runbook at a glance
One table, all failures, sorted by what you will see first. Print it. The metric is the entry point; the source line is the explanation; the action is what to do at 03:00.
| Failure | First symptom (metric) | Mechanism (source) | Class | Primary mitigation |
|---|---|---|---|---|
| Broker loss | UnderReplicatedPartitions flat >0; LeaderElectionRateAndTimeMs | Partition.scala:232; clean election from ISR | EMERGENT | Restore broker; RF≥3 → zero loss; throttle re-replication |
| Slow follower | UnderReplicatedPartitions fluctuating; IsrShrinksPerSec w/o expand | replica.lag.time.max.ms=30000, ReplicationConfigs.java:55; Partition.scala:1133 | SOFT | Raise num.replica.fetchers; fix disk/NIC; do not widen lag window |
| Network partition | IsrShrinksPerSec; OfflinePartitionsCount | ISR window + quorum majority, QuorumState.java:697 | EMERGENT | Rack/AZ spread; controllers across ≥3 AZs |
| Under-min-ISR | UnderMinIsrPartitionCount >0; produce gets NotEnoughReplicas | Partition.scala:1234; effectiveMinIsr :246 | SOFT | Restore replicas; verify min.insync.replicas ≤ RF |
| Offline partitions | OfflinePartitionsCount >0 | ControllerMetadataMetrics.java:62; Partition.scala:981 | EMERGENT | Restore eligible leader / quorum; RF≥3 |
| Unclean election | UncleanLeaderElectionsPerSec >0 (= data loss) | ReplicationControlManager.java:1140; LeaderRecoveryState.RECOVERING | SOFT | Keep =false (LogConfig.java:139); reconcile from upstream if it fired |
| Log-dir / disk failure | OfflineLogDirectoryCount >0; "log directory has failed" | LogDirFailureChannel.java:60; ReplicaManager.scala:2157 | HARD (metadata dir → halt) | Replace disk + restart; JBOD isolation; metadata dir on reliable storage |
| Full disk | disk >85%; OfflineLogDirectoryCount; possible halt | IOException via LogDirFailureChannel | EMERGENT | Lower retention dynamically; never delete newest segment; tiered storage |
| Controller quorum loss | SUM(ActiveControllerCount)==0 | QuorumState.java:697; controller.quorum.fetch.timeout.ms=2000 :83 | EMERGENT | Odd controllers across ≥3 AZs; restore majority |
| Metadata lag | NOT_LEADER_OR_FOLLOWER retries; broker applied-offset behind | async metadata apply, Part I 12 | EMERGENT | Self-heals via client refresh; chase a lagging broker if persistent |
| GC pauses | correlated IsrShrinks + rebalances on one broker | STW pause vs 30 s ISR window | EMERGENT | Heap ~6 GB + G1GC; fewer partitions/broker |
| Page-cache / cold reads | produce p99 spike; LocalTimeMs up; disk read I/O 100% | cold read evicts hot tail, Part I 09 | EMERGENT | Tiered storage (KIP-405); off-peak backfills; fetch-from-follower |
| Hot / skewed partition | per-partition BytesInPerSec uneven; one broker maxed | 1 partition = 1 leader = 1 consumer, Part I 13 | EMERGENT | Fix the key (do NOT add partitions); salt only if order not needed |
| Handler / purgatory saturation | RequestHandlerAvgIdlePercent low; RequestQueueTimeMs up | num.io.threads=8 :51; queued.max.requests=500 :144 | SOFT | Add threads per latency map / add brokers; not purgatory size |
| Fetch-session eviction | full fetches; broker fetch CPU up | max.incremental.fetch.session.cache.slots=1000, ServerConfigs.java:102 | SOFT | Raise slots above #sessions (consumers + fetchers) |
| Rebalance storm | repeated rebalances; records-lag-max climbing | max.poll.interval.ms=300000 :629; session.timeout.ms=45000 :438 | SOFT | KIP-848 protocol; cooperative assignor; static membership |
| Producer fencing | ProducerFencedException on send | TransactionMetadata.java:186 (epoch bump) | HARD (epoch is correctness) | Unique transactional.id per instance |
| Hanging transaction | LastStableOffsetLag up; PartitionsWithLateTransactionsCount >0 | Partition.scala:1442 (LSO); cap transaction.max.timeout.ms=900000 :33 | SOFT | Tight transaction.timeout.ms; kafka-transactions.sh --abort (KIP-664) |
With replication.factor=3, min.insync.replicas=2, acks=all, unclean.leader.election.enable=false, every single failure in this chapter, one broker, one disk, one slow follower, one AZ network split, costs zero acknowledged data and, for most, zero write availability. Data loss requires either ≥2 simultaneous independent failures, or a deliberate choice (enabling unclean election, lowering min-ISR under pressure, or carrying a silent min.insync.replicas > replication.factor mismatch). The runbook's deepest rule: do not, under incident pressure, edit a config that converts an availability incident into a durability incident. Restore the failed component; let the invariant hold. The full derivation is in II · 06 Durability; the signals to watch are in II · 08; what changes at scale is in II · 11.