krivaltsevich.com Kafka Internals4.4

II · 14 · Proactive Monitoring: Leading Indicators & Capacity Runway

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.

The golden signals in op08 are honest but cruel: OfflinePartitionsCount > 0 means partitions are already unavailable; UncleanLeaderElectionsPerSec > 0 means data is already gone. They are lagging, they fire at the cliff edge, not before it. This chapter is about the metrics that buy you lead time: the slopes that show a number marching toward a limit, the headroom that tells you how far the limit still is, and the saturation bands that light up minutes-to-days before queuing and errors appear. Every leading indicator here is sourced to a real metric in the code with a default-grounded danger band, a typical lead time, and the mechanism that makes it predictive. The payoff is a posture change: from responding to outages to spending runway deliberately.

Lagging vs. leading: the spine of the chapter

A lagging indicator is a state that only becomes true once damage has occurred. UnderReplicatedPartitions is the archetype: by the time the in-sync replica set has actually shrunk, the partition is already at reduced durability. A leading indicator is one of three shapes, each of which exists before the lagging signal:

  • Slope / rate-of-change. A level metric whose trend is what matters. Consumer records-lag of 50,000 is meaningless in isolation; a sustained positive slope of +2,000 records/s means the consumer is structurally below required throughput and will breach SLA, the question is only when.
  • Headroom / runway. The distance to a limit, expressed in the unit that limit is measured in: free disk bytes converted to days-to-full, partition count as a percentage of the per-broker ceiling, file descriptors remaining. Runway turns an invisible threshold into a countdown.
  • Saturation band. A utilisation gauge crossing into a danger zone well before it pins. RequestHandlerAvgIdlePercent falling from 0.8 to 0.3 is not yet an outage, but it is the warning that the I/O thread pool is filling and request-queue time is about to climb non-linearly.
a metric is moving
is it a state or a trend?
LAGGING, URP>0, Offline>0, UncleanLeaderElection>0. Page now; damage done.
LEADING, slope, runway, or saturation band. Ticket / forecast; lead time remains.
spend runway deliberately, before the cliff
The same underlying problem (e.g. a slow follower) surfaces first as a leading band (IsrShrinksPerSec ticking, idle% dipping) and only later as a lagging state (UnderReplicatedPartitions).
lagging / damage done   leading / lead time remains   the action window   intended response path
The reframe in one sentence

Alert on the derivative and the distance, not only the value. A threshold alarm tells you that you arrived; a slope/runway alarm tells you that you are going to arrive, and roughly when.

Three monitoring disciplines

(A) Alert on the slope, rate-of-change and burn-rate

For any metric trending toward a cliff, the alert should fire on the rate at which it approaches, not on a fixed level. Two implementations dominate:

  • Linear slope on a level metric. Compute the trend over a sliding window (e.g. PromQL deriv(kafka_consumer_records_lag[15m]) or predict_linear(...[1h], 4*3600)). Alert when the projected value crosses the limit within your response budget. LinkedIn's Burrow formalised exactly this for consumer lag: it classifies a group OK/WARNING/ERROR by evaluating the lag trend over a sliding window of committed offsets, so no per-topic threshold needs tuning (LinkedIn Burrow, empirical).
  • Burn-rate on an error budget. Borrowed from Google SRE: if your SLO permits X bad events per window, alert when the current consumption rate would exhaust the remaining budget before the window ends. A fast burn (e.g. 14.4× budget over 1 h) pages; a slow burn (e.g. 3× over 6 h) tickets. This is the canonical proactive pattern and is detailed in the SLO section below.
Why slope beats threshold for lag

Absolute lag conflates two unrelated things: a one-off backlog (a deploy, a batch job) that will self-drain, and a structural throughput deficit that will not. The deficit is visible only in the sign and magnitude of the slope. A positive, sustained slope on records-lag-max (FetchMetricsRegistry.java:102) means bytes-consumed-rate < production rate, a condition that the absolute number, paradoxically, hides while lag is still "small".

(B) Track headroom / runway

Runway converts a resource limit into a time or percentage you can put on a dashboard and trend. The pattern is always runway = remaining / consumption_rate. Examples developed below: disk days_to_full = free_bytes / (ingress_per_day × RF); partition headroom = 1 − (partitions_on_broker / ceiling); lag-time runway = records_lag / consume_rate. The discipline: pick the resource, measure both terms, alert on the quotient trending down, not the numerator alone.

(C) Watch saturation bands

Kafka exposes several utilisation gauges as fractions in [0,1]. They degrade gracefully then suddenly: idle time falls linearly with load until the queue starts to fill, at which point latency rises super-linearly (classic queueing behaviour, M/M/1: wait time ∝ 1/(1−ρ)). Watching the band, entering, say, 0.3–0.2, gives lead time before the knee. The two broker bands and their exact source mechanics:

Saturation gaugeSource & mechanismHealthyWarning bandSaturated
RequestHandlerAvgIdlePercentKafkaRequestHandler.scala:253, meter over idle time of the num.io.threads pool; each handler discounts idle time by thread count (:121-123). 1.0 = fully idle, 0 = fully busy.> 0.60.3 → 0.2→ 0 (requests queue, RequestQueueTimeMs climbs)
NetworkProcessorAvgIdlePercentSocketServer.scala:120-132, mean of per-processor io-wait-ratio, capped at 1.0. Measures time network threads wait for socket readiness vs. doing work.> 0.60.4 → 0.3→ 0 (responses queue, ResponseQueueTimeMs climbs)
ControllerEventManager AvgIdleRatioQuorumControllerMetrics.java:52, idle ratio of the single-threaded controller event loop.> 0.70.5 → 0.3→ 0 (EventQueueTimeMs climbs; metadata ops slow)
RemoteLogReaderAvgIdlePercentRemoteStorageThreadPool.java:60-61, 1 − activeCount/corePoolSize of the tiered-storage read pool.> 0.50.3 → 0.1→ 0 (RemoteLogReaderTaskQueueSize grows; remote fetch latency spikes)
Confirm the idle gauge is a 0–1 fraction

RequestHandlerAvgIdlePercent has a long history of mis-reporting: mis-calculation (KAFKA-7295), some integrations exposing it as a rate rather than a 0–1 gauge (Datadog integrations-core #516), and an anomaly in KRaft combined mode (KIP-1207). Validate that your pipeline shows it in [0,1] before trusting any band, and prefer the per-pool meters (BrokerRequestHandlerAvgIdlePercent / ControllerRequestHandlerAvgIdlePercent, KafkaRequestHandler.scala:243-251) which separate broker and controller work on a combined node.

The leading-indicator catalogue

For each indicator: what it predicts, the early-warning band/trend, the typical lead time it buys, and the source. The lead times are operational rules of thumb (empirical), not guarantees, they scale with how aggressively you provisioned headroom.

1 · Consumer-lag trend (slope of records-lag + time-lag)

Predicts: an SLA breach or retention overrun, the consumer falling permanently behind, or lag exceeding the retention window so records age out unread. Mechanism: a sustained positive slope on records-lag-max (client-side, FetchMetricsRegistry.java:102) means the consumer group's net drain rate is below the production rate. Two runway quotients fall out, both remaining ÷ rate. (a) Recovery runway: recovery_time = current_lag ÷ net_drain_rate, worked with an illustrative lag of 500,000 records draining at a net 5,000 records/s (consume-rate minus produce-rate): 500,000 records ÷ 5,000 records/s = 100 s to catch up (the records cancel, leaving seconds); if net drain is ≤ 0 there is no recovery. (b) Time-lag, because retention is time-bound not count-bound: time_lag = current_lag ÷ consume_rate, e.g. 500,000 records ÷ 20,000 records/s consumed = 25 s behind the tail; compare that against the retention window, not against the raw count. Band: alert when the 15-minute slope is positive for > N consecutive windows, or when projected time-lag will reach the retention limit within your response budget. Lead time: minutes-to-hours, set by how much lag-time runway you keep relative to retention. Cross-ref op03 (partitioning & parallelism), Part I 17.

2 · RequestHandlerAvgIdlePercent declining into a band

Predicts: I/O-thread saturation and rising RequestQueueTimeMs before produce/fetch requests start timing out. Mechanism: the meter at KafkaRequestHandler.scala:253 measures the fraction of time the num.io.threads pool spends blocked on requestChannel.receiveRequest(300) (:117) instead of handling work. As load rises the pool drains; once idle → 0 the request queue fills and RequestQueueTimeMs rises super-linearly. Band: warning at 0.3, action at 0.2 (Instaclustr: "constantly below 0.2 → add capacity", empirical). Lead time: minutes to hours of advance notice depending on workload growth rate. Cross-ref Part I 06, 07.

3 · NetworkProcessorAvgIdlePercent declining

Predicts: network-thread saturation and rising ResponseQueueTimeMs. Mechanism: identical logic for the num.network.threads processors; the gauge (SocketServer.scala:120) is the mean per-processor io-wait-ratio. Band: Confluent guidance "ideally > 0.4"; below ~0.3 raise num.network.threads (empirical). Watch MemoryPoolAvailable (SocketServer.scala:134) alongside, if the network read buffer pool is draining to zero, the socket layer is back-pressuring. Lead time: minutes to hours. Cross-ref Part I 06.

4 · Page-cache pressure (disk-read rate while consumers are caught up)

Predicts: the working set outgrowing RAM, and consumers about to fall off the zero-copy/sendfile path onto disk. Mechanism: Kafka serves tail reads from the OS page cache via sendfile (Part I 09); if all consumers are caught up to the tail and the broker is still doing physical disk reads (node_disk_reads climbing), the hot set no longer fits in cache and reads will increasingly miss. The reference documents the "catch-up tax": historical reads evict the hot tail and spike p99 produce ~2 ms → ~250 ms with disk I/O to 100% (KIP-405 tests, empirical). Band: non-zero and rising disk-read rate while max consumer lag ≈ 0. Lead time: hours-to-days as the working set creeps up; sudden if a backfill consumer starts reading cold data. Mitigations: more RAM, or tiered storage (Part I 05) to keep local retention small. Cross-ref op04, Part I 03.

5 · IsrShrinksPerSec ticking up (shrinks that re-expand)

Predicts: brokers beginning to struggle (GC, disk, network) before sustained UnderReplicatedPartitions or offline partitions. Mechanism: a follower is dropped from the ISR if it fails to reach the leader's log-end-offset within replica.lag.time.max.ms, default 30000 ms (ReplicationConfigs.java:55). A follower that briefly stalls, a GC pause, a disk hiccup, shrinks then re-expands. A rising rate of IsrShrinksPerSec with matching IsrExpandsPerSec is the cluster flickering at the edge: each individual shrink self-heals, but the increasing frequency is the leading edge of trouble that will eventually produce a shrink with no matching expand (= true URP). Band: any shrink/expand churn above a near-zero steady-state baseline; investigate shrinks without a matching expand immediately. Lead time: minutes. Cross-ref Part I 08, op07.

ISR full (steady) follower stalls < 30 s shrink → expand (flicker) rate rising churn (leading signal) stall > replica.lag.time.max.ms shrink, no expand = URP
The window between "flicker" and "URP" is the lead time: IsrShrinksPerSec rising is leading; UnderReplicatedPartitions > 0 is lagging. replica.lag.time.max.ms default 30000 ms (ReplicationConfigs.java:55) is the dial.
healthy steady state   warn flicker / churn (leading)   err URP (lagging)

6 · Disk-free runway (days-to-full)

Predicts: a full-disk outage days ahead. Mechanism: on No space left on device the broker calls Exit.halt(1) with no graceful shutdown (empirical; reference §4). The runway is computable as a remaining ÷ rate quotient: a broker stores RF copies of the cluster's bytes-in (its share of leader and follower replicas), so it fills at RF × the cluster's per-broker ingress, and days_to_full = free_bytes ÷ (ingress_bytes_per_day_per_broker × RF). Band: alert when projected days-to-full < lead-time-to-add-capacity (commonly 7–14 days), and a hard backstop at 85% used (AWS MSK KafkaDataLogsDiskUsed ≥ 85%, empirical). Lead time: days, the longest-lead indicator in the catalogue, and the one most worth automating. Cross-ref op04, op07.

Worked example: days-to-full with labeled inputs

All four inputs are illustrative for this example, substitute your own measured values. The first three are read off a dashboard; RF is a topic config.

SymbolValue (with units)Why / sourceKind
free_bytes2,000 GB free on the data dirfrom node_filesystem_avail_bytes on the brokerillustrative (measured)
ingress_per_broker50 MB/s of producer bytes landing on this broker's leadersfrom BytesInPerSec for this broker; pre-replicationillustrative (measured)
RF3 copiestopic replication.factor; durability baseline (reference §4)config
seconds/day86,400 s/dayunit conversion constantfixed

Derivation (units cancel to days):

Step 1, convert ingress to bytes/day:
ingress_per_day = 50 MB/s × 86,400 s/day = 4,320,000 MB/day = 4,320 GB/day (the s cancels).

Step 2, account for replication (this broker writes RF copies of every byte it is responsible for, leaders + followers, so disk fills RF× faster than raw ingress):
disk_fill_rate = ingress_per_day × RF = 4,320 GB/day × 3 = 12,960 GB/day.

Step 3, divide remaining by rate (the GB cancels, leaving days):
days_to_full = free_bytes ÷ disk_fill_rate = 2,000 GB ÷ 12,960 GB/day ≈ 0.15 day ≈ 3.7 hours.

Takeaway: at 50 MB/s ingress with RF=3, a 2,000 GB free margin is only ~3.7 hours of runway, not days, because RF triples the fill rate and 50 MB/s is 4.3 TB/day of raw bytes before replication. The conclusion holds because the divisor is the replicated fill rate (× RF), not the raw produce rate; omitting × RF would over-state runway 3×. Two caveats fold in below: this assumes non-plateaued ingress (a retention-bound topic stops growing once bytes-aged-out offsets bytes-in), and it assumes the 50 MB/s is sustained, trend the real slope rather than a single sample.

Days-to-full must use the projected, not current, ingress

A retention-bound topic plateaus: bytes-in is offset by bytes-aged-out once steady state is reached, so a naive free / current_write_rate over-warns. But a growing ingress, a lengthened retention, or a new high-volume topic breaks the plateau. Trend free_bytes directly with predict_linear(node_filesystem_avail_bytes[6h], 7*86400) so the forecast captures the real slope, including retention and RF changes you may have forgotten you made.

7 · Partition-count headroom (% of per-broker ceiling)

Predicts: hitting the per-broker partition limit and the slow failover that accompanies high partition density. Mechanism: partition count is bounded by file descriptors, fetcher overhead, and on Linux by vm.max_map_count. Leader-election work also scales with leaderships held, so density predicts longer outages on the next failure. Band: headroom = 1 − (partitions_on_broker ÷ ceiling); warn at < 30% headroom, act at < 15%. Lead time: days-to-weeks (partition growth is usually deliberate). Cross-ref op02, op03, Part I 11.

Where the ~32,765 ceiling and the failover cost come from
SymbolValue (with units)Why / sourceKind
vm.max_map_count65,530 mmap areas/processLinux kernel default for memory-mapped regions per processOS default (cited: Instaclustr, empirical)
mmaps/partition≈ 2 mmap areas/partitionKafka mmaps the .index + .timeindex of the active segment (Instaclustr KRaft Part 3)empirical figure
election cost/partition≈ 5 ms/leader-partitiontime to elect one leader on unclean broker loss (Jun Rao 2015)research figure (ZK-era; illustrative)

Derivation 1, the per-broker partition ceiling (the mmap areas cancel, leaving partitions):
ceiling = vm.max_map_count ÷ mmaps_per_partition = 65,530 mmap ÷ 2 mmap/partition ≈ 32,765 partitions/broker, unless vm.max_map_count is raised. This is a hard OS ceiling, distinct from the much lower practical limits set by FDs, fetcher overhead, and failover time discussed in op02.

Derivation 2, headroom, worked (the partitions cancel, leaving a dimensionless fraction). For a broker holding 24,000 partitions against the default ceiling:
headroom = 1 − (24,000 partitions ÷ 32,765 partitions) = 1 − 0.73 ≈ 0.27 = 27% → below the 30% warn line, act soon.

Derivation 3, failover cost the density predicts (the partitions cancel, leaving ms). If those 24,000 partitions are split RF=3 across enough brokers that one broker leads ~8,000 of them, an unclean loss of that broker costs:
outage ≈ 5 ms/leader-partition × 8,000 leader-partitions = 40,000 ms = 40 s of leaderless unavailability for those partitions until elections complete.

Takeaway: headroom warns you about the ceiling; the 5 ms/partition figure warns you that the cost of the next failure grows with density, two independent reasons to cap partitions/broker. Both constants are illustrative: the 5 ms is a 2015 ZooKeeper-era measurement that teaches the linear-in-leaderships mechanism, not a 2026 prediction (KRaft failover is far faster), and the 2-mmaps/partition ceiling assumes the default vm.max_map_count.

8 · p99 produce/fetch latency drift

Predicts: an SLA breach before the alert line is crossed. Mechanism: the request-time breakdown TotalTimeMs = RequestQueueTimeMs + LocalTimeMs + RemoteTimeMs + ResponseQueueTimeMs + ResponseSendTimeMs (+ ThrottleTimeMs) (reference §5, empirical). A slow upward drift in any component is a leading edge: rising RemoteTimeMs → followers slowing (links to indicator 5); rising LocalTimeMs → disk/flush/GC; rising RequestQueueTimeMs → handler starvation (indicator 2). Band: alert on a sustained week-over-week or day-over-day increase in the p99, not only the absolute crossing, e.g. p99 up > 50% vs. a 7-day baseline. Lead time: hours-to-days. Cross-ref Part I 07, op08.

9 · GC pause frequency/duration trending up

Predicts: soft failures cascading, ISR drops, session timeouts, rebalances. Mechanism: a stop-the-world pause that exceeds replica.lag.time.max.ms internals or a client's session.timeout.ms (default 45000 ms, ConsumerConfig.java:438) causes the broker to miss fetches and clients to miss heartbeats. The reference documents 500 ms pauses triggering rebalances and multi-second Full GCs causing ISR drops (empirical); LinkedIn's busy clusters run ~21 ms p90 pause, < 1 young GC/s. Band: p99 pause > 50 ms = warning, > 200 ms = critical (empirical); also alert on rising frequency, which precedes a creeping heap. Lead time: hours-to-days as heap pressure builds. The conventional cap of ~6 GiB heap (a planning heuristic, reference §3 / op05, not a hard limit) exists precisely so pauses stay short and the rest of RAM serves page cache (indicator 4): on a 64 GiB box, a 6 GiB heap leaves ~28–30 GiB for the OS cache. Cross-ref op05, op07.

10 · Quota throttle-time becoming non-zero

Predicts: client-visible slowdown, then timeouts and errors, as a producer/consumer approaches its quota. Mechanism: when a client exceeds a byte-rate or request quota the broker delays the response and reports the delay; the client sees produce-throttle-time-avg / produce-throttle-time-max (SenderMetricsRegistry.java:123-126) and fetch-throttle-time-avg/-max (FetchMetricsRegistry.java:107-110). Throttling is graceful at first, latency just rises, but as the client backs up it heads toward buffer exhaustion (indicators 16/17) and eventual send failures. Band: any sustained non-zero throttle-time on a client that is supposed to run unthrottled. Lead time: minutes-to-hours before the throttle turns into client errors. Cross-ref Part I 19, 13, op08.

11 · Controller event-queue time/backlog rising

Predicts: metadata-operation slowness, topic creates, partition reassignments, leader elections all queuing behind a busy controller. Mechanism: the KRaft controller processes events on a single-threaded loop; EventQueueTimeMs (QuorumControllerMetrics.java:48-49) is how long an event waits before processing, and AvgIdleRatio (:52) is the loop's headroom. As the controller saturates, queue time climbs and metadata propagation (Part I 12) slows cluster-wide. Band: rising EventQueueTimeMs p99, or AvgIdleRatio entering 0.5→0.3. Cross-check TimedOutBrokerHeartbeatCount (:62) and LastAppliedRecordLagMs (:60), a standby controller falling behind on the metadata log is itself a leading risk to failover speed. Lead time: minutes-to-hours. Cross-ref Part I 11, 10.

12 · Rebalance rate increasing

Predicts: consumer-group instability and processing stalls. Mechanism: each rebalance pauses consumption for the group; a rising rebalance-rate-per-hour (ConsumerRebalanceMetricsManager.java:70) or any non-zero failed-rebalance-rate-per-hour (:74) signals churn, flapping members, slow processing tripping max.poll.interval.ms (default 300000 ms, ConsumerConfig.java:629), or unstable membership. Watch last-rebalance-seconds-ago (:97): a value that keeps resetting to near-zero is a rebalance storm in progress. Band: rebalance-rate above the group's steady-state baseline, or failed-rebalance-rate > 0. Lead time: minutes, fast-moving, but earlier than the lag spike the storm eventually produces. Cross-ref Part I 13, op07.

13 · Connection count / creation-rate approaching caps

Predicts: hitting max.connections / max.connections.per.ip limits, or file-descriptor exhaustion, causing new clients to be rejected. Mechanism: each connection consumes an FD and listener slot; the broker tracks connection-count (Selector.java:1260) and the standard connection-creation-rate. A climbing creation-rate often indicates clients in a connect/disconnect loop (mis-tuned pools, repeated auth failures) that will exhaust the cap. ExpiredConnectionsKilledCount (SocketServer.scala:136) rising shows the broker reaping idle connections under connections.max.idle.ms pressure. Band: connection-count headroom < 20% of the configured cap, or a sustained non-zero creation-rate on a stable client population. Lead time: minutes-to-hours. Cross-ref op02, Part I 06.

14 · Log-flush time creeping up = disk slowing

Predicts: disk degradation, which will surface as rising LocalTimeMs, ISR shrinks, and eventually produce stalls. Mechanism: LogFlushRateAndTimeMs (LogSegment.java:77, kafka.log:type=LogFlushStats) times the fsync of log segments. A slow upward creep means the disk is taking longer to accept writes, a failing SSD, a noisy neighbour on shared storage, or a saturated I/O queue. Band: any sustained increase over the disk's established baseline (the absolute number is hardware-specific). Lead time: hours-to-days for gradual degradation; correlate with rising LocalTimeMs and falling RequestHandlerAvgIdlePercent. Cross-ref Part I 03, op04.

15 · Leader-election time growing with partition count

Predicts: longer outages on the next broker failure. Mechanism: LeaderElectionRateAndTimeMs (controller) reports time spent without a leader during elections; unclean broker loss costs roughly 5 ms × #leader-partitions-on-broker (Jun Rao 2015, empirical), so as partition density grows the failover duration grows with it. This is a second-order leading indicator: it predicts the severity of a future incident, letting you rebalance or add brokers before density makes the next failure painful. Band: trend election time vs. partition count; act when projected unavailability on a single-broker loss exceeds your SLO. Lead time: weeks (it tracks slow partition growth). Cross-ref op02, op11, Part I 11.

Client-team leading indicators (no broker access required)

The part most guides miss: a producer or consumer application team can be proactive entirely from their own client JMX metrics, with zero broker access, and frequently see trouble building before the platform team's lagging broker alerts fire. The data lives in the client's org.apache.kafka.common.metrics.Metrics registry and is also pushed centrally via KIP-714 client telemetry (ClientMetricsManager.java handles GetTelemetrySubscriptions/PushTelemetry at :159:188; Part I 19), so a platform team can scrape app-side signals without app-side scrape endpoints.

Consumer-side leading indicators

MetricSourceWhat the trend predictsWatch for
records-lag-max + its slopeFetchMetricsRegistry.java:102structural throughput deficit → SLA breach / retention overrunsustained positive slope (not absolute value)
fetch-latency-avg driftFetchMetricsRegistry.java:93broker/network slowing on the fetch pathslow upward trend vs. baseline
bytes-consumed-rateFetchMetricsRegistry.java:81compare to production rate; deficit → lag growthconsume-rate < produce-rate over a window
rebalance-rate-per-hour / failed-rebalance-rate-per-hourConsumerRebalanceMetricsManager.java:70,74group instability → repeated processing stallsrate above baseline; failed-rate > 0
commit-latency-avg/-maxOffsetCommitMetricsManager.java:45,49coordinator slow → at-least-once replay risk on crashupward drift
time-between-poll-max vs. max.poll.interval.msKafkaConsumerMetrics.java:64 vs. ConsumerConfig.java:629 (300000 ms)processing too slow → consumer evicted → rebalance stormpoll interval approaching the configured max
poll-idle-ratio-avgKafkaConsumerMetrics.java:70low idle = app is the bottleneck (processing-bound)ratio → 0 (no slack to absorb a load increase)
last-poll-seconds-agoKafkaConsumerMetrics.java:55a stuck consumer about to be kickedclimbing toward max.poll.interval.ms/1000
The poll-interval headroom check is the consumer's most actionable proactive signal

"Am I processing fast enough to avoid being kicked?" is answered directly by time-between-poll-max (KafkaConsumerMetrics.java:64) against max.poll.interval.ms, a real Kafka default of 300,000 ms (ConsumerConfig.java:629, verified). The headroom is a dimensionless fraction: headroom = 1 − (time-between-poll-max ÷ max.poll.interval.ms). Worked: if your slowest observed gap between polls is 210,000 ms, then headroom = 1 − (210,000 ms ÷ 300,000 ms) = 1 − 0.70 = 0.30 = 30% (the ms cancel), exactly at the warn line. When that headroom shrinks below ~30%, you are one slow batch away from eviction and a rebalance, the leading cause of self-inflicted rebalance storms (op07). Fix before the storm: lower max.poll.records, speed up processing, or raise the interval.

Producer-side leading indicators

The producer's leading-indicator story is a pipeline: the RecordAccumulator buffers records, the Sender drains them to brokers, and the BufferPool backstops memory. When the broker can't keep up (or batching is mis-tuned), pressure propagates backward through these stages in a predictable order, each with its own metric, giving a clean early-warning sequence before the terminal max.block.ms timeout.

app calls send()
RecordAccumulatorrecord-queue-time-avg/-max ↑ first
Sender → brokerrequest-latency-avg ↑ ; requests-in-flight
BufferPoolbuffer-available-bytes ↓ toward 0
block max.block.msthen BufferExhaustedException
Producer back-pressure propagates right-to-left: queue-time rises first (leading), buffer-available falls next (blocking imminent), and only at the end does send() block for max.block.ms (default 60000 ms, ProducerConfig.java:451) and throw BufferExhaustedException (BufferPool.java:161, lagging).
app   accumulator/sender   buffer pool   terminal failure   data flow   back-pressure
MetricSourceWhat the trend predictsWatch for
record-queue-time-avg / -maxSenderMetricsRegistry.java:88,90RecordAccumulator backing up, broker can't keep up or batching mis-tuned → buffer exhaustion nextthe earliest producer signal; rising = act now
buffer-available-bytesRecordAccumulator.java:215blocking imminent once it nears 0; then max.block.ms timeoutsfalling toward 0 (vs. buffer-total-bytes = buffer.memory, default 32 MB)
waiting-threadsRecordAccumulator.java:205app threads already blocked on buffer memoryany non-zero, sustained > 0
request-latency-avgSenderMetricsRegistry.java:92broker-side slowdown reaching the producerslow upward drift
record-retry-rateSenderMetricsRegistry.java:102transient broker errors rising → approaching delivery.timeout.ms failurescreeping above ~0
record-error-rateSenderMetricsRegistry.java:106permanent send failures, data loss if not handledany non-zero
produce-throttle-time-avgSenderMetricsRegistry.java:123you are approaching a broker quotabecoming non-zero (links to indicator 10)
buffer-exhausted-rateBufferPool.java:88sends already being dropped on exhaustion, this is laggingany non-zero = already failing
batch-split-rateSenderMetricsRegistry.java:118batches exceeding max.request.size → re-splitting overheadsustained > 0 (tune batch.size / record size)
Why record-queue-time is the producer's leading edge

The BufferPool.allocate() path (BufferPool.java:107) only blocks after the pool is empty, and only throws after max.block.ms (:159-163). By then you are already failing sends. But record-queue-time (SenderMetricsRegistry.java:88) measures how long batches sit in the accumulator before the Sender drains them, it rises the moment the drain rate falls below the append rate, which is strictly earlier than the buffer running dry. Trend it, and you get the warning while there is still buffer to absorb the burst. buffer-available-bytes falling is the second-stage confirmation; buffer-exhausted-rate > 0 is the post-mortem.

Producer / consumer proactive checklist

Producer, watch & doConsumer, watch & do
Trend record-queue-time-max; rising → broker slow or batching too aggressive. Investigate before buffer drains.Trend records-lag-max slope; positive & sustained → scale consumers or speed processing (lag-time runway, not absolute lag).
Alert buffer-available-bytes < 20% of buffer-total-bytes; raise buffer.memory or reduce send rate before max.block.ms hits.Alert poll-interval headroom < 30% (time-between-poll-max ÷ max.poll.interval.ms); lower max.poll.records before eviction.
Alert any non-zero record-error-rate; this is data loss unless the callback handles it.Alert failed-rebalance-rate-per-hour > 0 and rising rebalance-rate-per-hour; stabilise membership (static IDs, cooperative assignor).
Alert non-zero produce-throttle-time-avg; you are nearing a quota, request a raise or compress.Alert non-zero fetch-throttle-time-avg; nearing a fetch quota, coordinate with platform.
Watch request-latency-avg drift as an early read on broker health from your vantage point.Watch commit-latency-max drift; slow commits widen the replay window on a crash.
Worked example: producer buffer headroom

The mirror of the consumer poll-headroom check. buffer-total-bytes equals the configured buffer.memory, a real Kafka default of 32 MB = 33,554,432 bytes (ProducerConfig.java:401, 32 * 1024 * 1024L, verified); treat the actual configured value as the assumption. Headroom is a dimensionless fraction: headroom = buffer-available-bytes ÷ buffer-total-bytes. Worked, with buffer-available-bytes = 6 MB observed against a 32 MB default: headroom = 6 MB ÷ 32 MB = 0.1875 ≈ 19% (the MB cancel), below the 20% alert line, so raise buffer.memory or cut the send rate before the pool empties and send() blocks for max.block.ms (60,000 ms, ProducerConfig.java:451, verified).

Two client-metric blind spots

(1) Client-side records-lag-max only covers running consumers, a dead or stuck consumer reports nothing. An external lag monitor (Burrow, kafka_exporter) is still required to catch a group that has stopped entirely (reference §5, empirical). (2) records-lag-max is computed from the current position, not the committed offset (FetchMetricsRegistry.java:103), a consumer that fetches fast but commits slowly can look healthy on lag while risking large replay on a crash; pair lag with commit-latency.

The runway / headroom dashboard

Assemble a single "how much runway is left" panel set. Each panel is a quotient (remaining ÷ rate) trended over time, with the alert on the quotient, not the numerator. This is the proactive counterpart to the golden-signal status board in op08.

Throughput headroom
idle% bands, how much request capacity remains before queueing
RequestHandlerAvgIdlePercentNetworkProcessorAvgIdlePercent
Disk runway
days-to-full = free_bytes / (ingress/day × RF)
predict_linear(avail_bytes)85% backstop
Partition & FD headroom
1 − partitions/ceiling ; 1 − conns/cap
vm.max_map_countconnection-count
Consumer lag-time runway
records_lag / consume_rate vs. retention
records-lag-max slopeBurrow trend
Cache headroom
disk-read rate while lag≈0 (working set vs. RAM)
node_disk_readspage-cache hit
Five runway panels, each a remaining/rate quotient. Alert on the quotient trending down through your lead-time budget, not on the raw numerator crossing a fixed line.
throughput   storage/cache   metadata/FD limits   consumer runway
Runway panelFormulaAlert whenLead time
Throughput headroomidle% gauges (0–1)min(handler, network) idle% trending into 0.3→0.2min–hr
Disk runwayfree / (ingress/day × RF)projected days-to-full < lead-to-provision (7–14 d); hard 85%days
Partition headroom1 − partitions/ceiling< 30% (warn), < 15% (act)days–wks
FD / connection headroom1 − conns/cap< 20% headroom or rising creation-rate on stable populationmin–hr
Consumer lag-time runwayrecords_lag / consume_ratepositive slope sustained; projected to reach retentionmin–hr
Cache headroomdisk-read rate while lag≈0non-zero & rising while consumers caught uphr–days

SLO error-budget & burn-rate framing

The most disciplined proactive pattern is the burn-rate alert: define an SLO (e.g. "99.9% of produce requests succeed within 50 ms over 30 days"), derive an error budget, and alert when the current consumption rate of the budget would exhaust it before the window closes. This unifies the catalogue: every leading indicator above is, ultimately, an early read on whether you are about to start burning budget.

Assumptions for the worked burn-rate numbers

These are the canonical Google SRE multi-window thresholds (cited heuristic, not a Kafka limit); the SLO target and window are illustrative, substitute your own.

SymbolValue (with units)Why / sourceKind
S (SLO target)99.9% = 0.999 (dimensionless)example produce-success objectiveillustrative
W (window)30 days = 43,200 minexample budget window (30 × 24 × 60)illustrative
fast-burn multiple14.4× budget over 1 hGoogle SRE multi-window page thresholdcited heuristic (Google SRE)
slow-burn multiple3× budget over 6 hGoogle SRE multi-window ticket thresholdcited heuristic (Google SRE)
Error budget
For an availability SLO of S over window W: budget = (1 − S) × W. Worked: (1 − 0.999) × 43,200 min = 0.001 × 43,200 min = 43.2 min of allowed "bad" per 30 days (the % cancels; result is in minutes). So 0.1% of a 30-day window ≈ 43 min.
Burn rate
burn_rate = (bad events ÷ total events) ÷ (1 − S), a dimensionless multiple of the budget's steady drain. A burn rate of 1 exactly exhausts the budget at the window's end; > 1 exhausts it early. (E.g. failing 1.44% of requests over an hour = 0.0144 ÷ 0.001 = 14.4×.)
Fast-burn page
14.4× over a 1-hour window. Why "~2% of the budget in 1 h": at 14.4× the budget drains 14.4× faster than the 30-day baseline, and 1 h is 1 ÷ (30 × 24) = 1/720 of the window, so the fraction consumed is 14.4 × (1 h ÷ 720 h) = 14.4 ÷ 720 = 0.02 = 2% of the 30-day budget in one hour (the hours cancel). Burning 2%/hour empties the budget in ~50 h → page; the cliff is hours-to-days away.
Slow-burn ticket
3× over a 6-hour window. Fraction consumed: 3 × (6 h ÷ 720 h) = 18 ÷ 720 = 0.025 = 2.5% of the 30-day budget over 6 h → at that pace the budget lasts 720 h ÷ 3 = 240 h ≈ 10 days → ticket; the cliff is days away, but the trend is real.
measure SLI (e.g. p99 produce success)
current burn rate vs. budget?
fast burn → PAGE (cliff in hours)
slow burn → TICKET (cliff in days)
within budget → no action
Multi-window burn-rate alerting: the same SLI drives a paging fast-burn and a ticketing slow-burn. The slow-burn is the proactive arm, it fires while runway remains.
fast burn / page   slow burn / ticket   within budget   evaluation   ,
Pairing rule: every lagging golden signal should have a leading partner

Audit your alert set as pairs. UnderReplicatedPartitions > 0 (lagging) ↔ IsrShrinksPerSec rising (leading). OfflinePartitionsCount > 0 (lagging) ↔ partition-headroom + leader-election-time trend (leading). Producer buffer-exhausted-rate > 0 (lagging) ↔ record-queue-time rising (leading). Consumer SLA-miss (lagging) ↔ records-lag-max slope (leading). If a golden signal has no leading partner on the board, you have an outage class you can only respond to, never prevent.

Master table, the leading-indicator catalogue

#Leading indicatorPredictsEarly-warning band / trendLead timeSource / metric
1Consumer-lag trend (records + time)SLA breach / retention overrunsustained positive slope; projected time-lag → retentionmin–hrrecords-lag-max slope, FetchMetricsRegistry.java:102 + Burrow
2RequestHandlerAvgIdlePercent ↓I/O-thread saturation; RequestQueueTimeMs0.3 (warn) → 0.2 (act)min–hrKafkaRequestHandler.scala:253
3NetworkProcessorAvgIdlePercent ↓network-thread saturation; ResponseQueueTimeMs0.4 → 0.3min–hrSocketServer.scala:120
4Page-cache pressureworking set > RAM; consumers off zero-copy pathdisk-read rate ↑ while lag ≈ 0hr–daysOS disk reads vs. Part I 09
5IsrShrinksPerSec ↑ (re-expanding)brokers struggling (GC/disk/net) before URPshrink/expand churn above ~0 baselineminIsrShrinksPerSec; replica.lag.time.max.ms=30000 (ReplicationConfigs.java:55)
6Disk-free runwayfull-disk outage (Exit.halt(1))days-to-full < lead-to-provision; 85% backstopdaysfree/(ingress×RF); op04
7Partition-count headroomper-broker ceiling; slow failoverheadroom < 30% (warn), < 15% (act)days–wksvm.max_map_count ≈32k/broker; op02
8p99 produce/fetch latency driftSLA breach before alert linep99 ↑ > 50% vs. 7-day baselinehr–daysRequestMetrics TotalTimeMs
9GC pause freq/duration ↑soft failures: ISR drops, session timeoutsp99 pause > 50 ms (warn), > 200 ms (crit)hr–daysJVM GC vs. session.timeout.ms=45000 (ConsumerConfig.java:438)
10Quota throttle-time > 0client-visible slowdown → errorsany sustained non-zero on an unthrottled clientmin–hrproduce-throttle-time-avg (SenderMetricsRegistry.java:123), fetch-throttle-time-avg (FetchMetricsRegistry.java:107)
11Controller event-queue ↑metadata-operation slownessEventQueueTimeMs p99 ↑; AvgIdleRatio 0.5→0.3min–hrQuorumControllerMetrics.java:48,52
12Rebalance rate ↑group instability / processing stallsrate above baseline; failed-rate > 0minConsumerRebalanceMetricsManager.java:70,74
13Connection count / creation-rate ↑connection caps / FD exhaustionheadroom < 20%; sustained creation-rate on stable pop.min–hrconnection-count (Selector.java:1260); ExpiredConnectionsKilledCount (SocketServer.scala:136)
14Log-flush time ↑disk slowing → LocalTimeMs ↑, ISR shrinkssustained rise over hardware baselinehr–daysLogFlushRateAndTimeMs (LogSegment.java:77)
15Leader-election time vs. partitionslonger outage on the next failureprojected single-broker unavailability > SLOwksLeaderElectionRateAndTimeMs; ≈5 ms/partition (empirical)
C1Producer record-queue-time ↑accumulator backing up → buffer exhaustionearliest producer signal; any sustained riseminSenderMetricsRegistry.java:88
C2Producer buffer-available-bytes ↓blocking → max.block.ms timeout< 20% of buffer-total-bytesminRecordAccumulator.java:215; max.block.ms=60000 (ProducerConfig.java:451)
C3Consumer poll-interval headroom ↓eviction → rebalance stormheadroom < 30% of max.poll.interval.msmintime-between-poll-max (KafkaConsumerMetrics.java:64) vs. 300000 (ConsumerConfig.java:629)
Lead times are provisioning-dependent, not guarantees

Every lead-time figure above is an operational rule of thumb (empirical) and assumes you provisioned some headroom. A cluster run at 95% disk and 0.25 handler-idle has effectively zero lead time, the leading indicators have already merged into the lagging ones. The point of headroom is to keep the leading and lagging signals separated by a usable interval. Benchmark numbers (e.g. the ~2 ms → ~250 ms catch-up tax, the ~5 ms/partition election cost) are version- and hardware-dependent illustrations of the mechanism, not predictions for your fleet.

What to do this week

  1. Build the runway dashboard. Five quotient panels (throughput idle%, disk days-to-full, partition headroom, FD/connection headroom, lag-time runway). Alert on the quotient trend, not the numerator.
  2. Convert your top lagging alerts to pairs. For each golden signal in op08, add the leading partner from the invariant box above. An unpaired golden signal is an outage you can only react to.
  3. Add multi-window burn-rate alerts on your produce-success and consumer-lag SLOs: a fast-burn page and a slow-burn ticket. The slow-burn is your proactive early warning.
  4. Push the client checklist to app teams. Producer: trend record-queue-time-max and buffer-available-bytes. Consumer: trend records-lag-max slope and poll-interval headroom. Wire app-side metrics centrally via KIP-714 so the platform team sees them too.
  5. Validate idle gauges are 0–1. Confirm RequestHandlerAvgIdlePercent reports a fraction before trusting any band (KAFKA-7295 / KIP-1207 footguns).

The discipline is simple to state and hard to keep: every metric that can break your cluster has a derivative and a distance, and the derivative and distance are visible before the break. Monitor those, and most pages become tickets you closed days earlier.

krivaltsevich.com · Part of Apache Kafka Internals · derived from Apache Kafka 4.4 source · GitHub · MIT-licensed.

Apache Kafka® is a registered trademark of the Apache Software Foundation. This is an independent, unofficial guide, not affiliated with or endorsed by the ASF.