II · 14 · Proactive Monitoring: Leading Indicators & Capacity Runway

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.

The golden signals in op08 are honest but cruel: OfflinePartitionsCount > 0 means partitions are already unavailable; UncleanLeaderElectionsPerSec > 0 means data is already gone. They are lagging, they fire at the cliff edge, not before it. This chapter is about the metrics that buy you lead time: the slopes that show a number marching toward a limit, the headroom that tells you how far the limit still is, and the saturation bands that light up minutes-to-days before queuing and errors appear. Every leading indicator here is sourced to a real metric in the code with a default-grounded danger band, a typical lead time, and the mechanism that makes it predictive. The payoff is a posture change: from responding to outages to spending runway deliberately.

Lagging vs. leading: the spine of the chapter

A lagging indicator is a state that only becomes true once damage has occurred. UnderReplicatedPartitions is the archetype: by the time the in-sync replica set has actually shrunk, the partition is already at reduced durability. A leading indicator is one of three shapes, each of which exists before the lagging signal:

Slope / rate-of-change. A level metric whose trend is what matters. Consumer records-lag of 50,000 is meaningless in isolation; a sustained positive slope of +2,000 records/s means the consumer is structurally below required throughput and will breach SLA, the question is only when.
Headroom / runway. The distance to a limit, expressed in the unit that limit is measured in: free disk bytes converted to days-to-full, partition count as a percentage of the per-broker ceiling, file descriptors remaining. Runway turns an invisible threshold into a countdown.
Saturation band. A utilisation gauge crossing into a danger zone well before it pins. RequestHandlerAvgIdlePercent falling from 0.8 to 0.3 is not yet an outage, but it is the warning that the I/O thread pool is filling and request-queue time is about to climb non-linearly.

a metric is moving

is it a state or a trend?

LAGGING, URP>0, Offline>0, UncleanLeaderElection>0. Page now; damage done.

LEADING, slope, runway, or saturation band. Ticket / forecast; lead time remains.

spend runway deliberately, before the cliff

The same underlying problem (e.g. a slow follower) surfaces first as a leading band (IsrShrinksPerSec ticking, idle% dipping) and only later as a lagging state (UnderReplicatedPartitions).

lagging / damage done leading / lead time remains the action window intended response path

The reframe in one sentence

Alert on the derivative and the distance, not only the value. A threshold alarm tells you that you arrived; a slope/runway alarm tells you that you are going to arrive, and roughly when.

Three monitoring disciplines

(A) Alert on the slope, rate-of-change and burn-rate

For any metric trending toward a cliff, the alert should fire on the rate at which it approaches, not on a fixed level. Two implementations dominate:

Linear slope on a level metric. Compute the trend over a sliding window (e.g. PromQL deriv(kafka_consumer_records_lag[15m]) or predict_linear(...[1h], 4*3600)). Alert when the projected value crosses the limit within your response budget. LinkedIn's Burrow formalised exactly this for consumer lag: it classifies a group OK/WARNING/ERROR by evaluating the lag trend over a sliding window of committed offsets, so no per-topic threshold needs tuning (LinkedIn Burrow, empirical).
Burn-rate on an error budget. Borrowed from Google SRE: if your SLO permits X bad events per window, alert when the current consumption rate would exhaust the remaining budget before the window ends. A fast burn (e.g. 14.4× budget over 1 h) pages; a slow burn (e.g. 3× over 6 h) tickets. This is the canonical proactive pattern and is detailed in the SLO section below.

Why slope beats threshold for lag

Absolute lag conflates two unrelated things: a one-off backlog (a deploy, a batch job) that will self-drain, and a structural throughput deficit that will not. The deficit is visible only in the sign and magnitude of the slope. A positive, sustained slope on records-lag-max (FetchMetricsRegistry.java:102) means bytes-consumed-rate < production rate, a condition that the absolute number, paradoxically, hides while lag is still "small".

(B) Track headroom / runway

Runway converts a resource limit into a time or percentage you can put on a dashboard and trend. The pattern is always runway = remaining / consumption_rate. Examples developed below: disk days_to_full = free_bytes / (ingress_per_day × RF); partition headroom = 1 − (partitions_on_broker / ceiling); lag-time runway = records_lag / consume_rate. The discipline: pick the resource, measure both terms, alert on the quotient trending down, not the numerator alone.

(C) Watch saturation bands

Kafka exposes several utilisation gauges as fractions in [0,1]. They degrade gracefully then suddenly: idle time falls linearly with load until the queue starts to fill, at which point latency rises super-linearly (classic queueing behaviour, M/M/1: wait time ∝ 1/(1−ρ)). Watching the band, entering, say, 0.3–0.2, gives lead time before the knee. The two broker bands and their exact source mechanics:

Saturation gauge	Source & mechanism	Healthy	Warning band	Saturated
`RequestHandlerAvgIdlePercent`	`KafkaRequestHandler.scala:253`, meter over idle time of the `num.io.threads` pool; each handler discounts idle time by thread count (`:121-123`). 1.0 = fully idle, 0 = fully busy.	> 0.6	0.3 → 0.2	→ 0 (requests queue, `RequestQueueTimeMs` climbs)
`NetworkProcessorAvgIdlePercent`	`SocketServer.scala:120-132`, mean of per-processor `io-wait-ratio`, capped at 1.0. Measures time network threads wait for socket readiness vs. doing work.	> 0.6	0.4 → 0.3	→ 0 (responses queue, `ResponseQueueTimeMs` climbs)
`ControllerEventManager AvgIdleRatio`	`QuorumControllerMetrics.java:52`, idle ratio of the single-threaded controller event loop.	> 0.7	0.5 → 0.3	→ 0 (`EventQueueTimeMs` climbs; metadata ops slow)
`RemoteLogReaderAvgIdlePercent`	`RemoteStorageThreadPool.java:60-61`, `1 − activeCount/corePoolSize` of the tiered-storage read pool.	> 0.5	0.3 → 0.1	→ 0 (`RemoteLogReaderTaskQueueSize` grows; remote fetch latency spikes)

Confirm the idle gauge is a 0–1 fraction

RequestHandlerAvgIdlePercent has a long history of mis-reporting: mis-calculation (KAFKA-7295), some integrations exposing it as a rate rather than a 0–1 gauge (Datadog integrations-core #516), and an anomaly in KRaft combined mode (KIP-1207). Validate that your pipeline shows it in [0,1] before trusting any band, and prefer the per-pool meters (BrokerRequestHandlerAvgIdlePercent / ControllerRequestHandlerAvgIdlePercent, KafkaRequestHandler.scala:243-251) which separate broker and controller work on a combined node.

The leading-indicator catalogue

For each indicator: what it predicts, the early-warning band/trend, the typical lead time it buys, and the source. The lead times are operational rules of thumb (empirical), not guarantees, they scale with how aggressively you provisioned headroom.

1 · Consumer-lag trend (slope of records-lag + time-lag)

Predicts: an SLA breach or retention overrun, the consumer falling permanently behind, or lag exceeding the retention window so records age out unread. Mechanism: a sustained positive slope on records-lag-max (client-side, FetchMetricsRegistry.java:102) means the consumer group's net drain rate is below the production rate. Two runway quotients fall out, both remaining ÷ rate. (a) Recovery runway: recovery_time = current_lag ÷ net_drain_rate, worked with an illustrative lag of 500,000 records draining at a net 5,000 records/s (consume-rate minus produce-rate): 500,000 records ÷ 5,000 records/s = 100 s to catch up (the records cancel, leaving seconds); if net drain is ≤ 0 there is no recovery. (b) Time-lag, because retention is time-bound not count-bound: time_lag = current_lag ÷ consume_rate, e.g. 500,000 records ÷ 20,000 records/s consumed = 25 s behind the tail; compare that against the retention window, not against the raw count. Band: alert when the 15-minute slope is positive for > N consecutive windows, or when projected time-lag will reach the retention limit within your response budget. Lead time: minutes-to-hours, set by how much lag-time runway you keep relative to retention. Cross-ref op03 (partitioning & parallelism), Part I 17.

2 · RequestHandlerAvgIdlePercent declining into a band

Predicts: I/O-thread saturation and rising RequestQueueTimeMs before produce/fetch requests start timing out. Mechanism: the meter at KafkaRequestHandler.scala:253 measures the fraction of time the num.io.threads pool spends blocked on requestChannel.receiveRequest(300) (:117) instead of handling work. As load rises the pool drains; once idle → 0 the request queue fills and RequestQueueTimeMs rises super-linearly. Band: warning at 0.3, action at 0.2 (Instaclustr: "constantly below 0.2 → add capacity", empirical). Lead time: minutes to hours of advance notice depending on workload growth rate. Cross-ref Part I 06, 07.

3 · NetworkProcessorAvgIdlePercent declining

Predicts: network-thread saturation and rising ResponseQueueTimeMs. Mechanism: identical logic for the num.network.threads processors; the gauge (SocketServer.scala:120) is the mean per-processor io-wait-ratio. Band: Confluent guidance "ideally > 0.4"; below ~0.3 raise num.network.threads (empirical). Watch MemoryPoolAvailable (SocketServer.scala:134) alongside, if the network read buffer pool is draining to zero, the socket layer is back-pressuring. Lead time: minutes to hours. Cross-ref Part I 06.

4 · Page-cache pressure (disk-read rate while consumers are caught up)

Predicts: the working set outgrowing RAM, and consumers about to fall off the zero-copy/sendfile path onto disk. Mechanism: Kafka serves tail reads from the OS page cache via sendfile (Part I 09); if all consumers are caught up to the tail and the broker is still doing physical disk reads (node_disk_reads climbing), the hot set no longer fits in cache and reads will increasingly miss. The reference documents the "catch-up tax": historical reads evict the hot tail and spike p99 produce ~2 ms → ~250 ms with disk I/O to 100% (KIP-405 tests, empirical). Band: non-zero and rising disk-read rate while max consumer lag ≈ 0. Lead time: hours-to-days as the working set creeps up; sudden if a backfill consumer starts reading cold data. Mitigations: more RAM, or tiered storage (Part I 05) to keep local retention small. Cross-ref op04, Part I 03.

5 · IsrShrinksPerSec ticking up (shrinks that re-expand)

Predicts: brokers beginning to struggle (GC, disk, network) before sustained UnderReplicatedPartitions or offline partitions. Mechanism: a follower is dropped from the ISR if it fails to reach the leader's log-end-offset within replica.lag.time.max.ms, default 30000 ms (ReplicationConfigs.java:55). A follower that briefly stalls, a GC pause, a disk hiccup, shrinks then re-expands. A rising rate of IsrShrinksPerSec with matching IsrExpandsPerSec is the cluster flickering at the edge: each individual shrink self-heals, but the increasing frequency is the leading edge of trouble that will eventually produce a shrink with no matching expand (= true URP). Band: any shrink/expand churn above a near-zero steady-state baseline; investigate shrinks without a matching expand immediately. Lead time: minutes. Cross-ref Part I 08, op07.

ISR full (steady) follower stalls < 30 s shrink → expand (flicker) rate rising churn (leading signal) stall > replica.lag.time.max.ms shrink, no expand = URP

The window between "flicker" and "URP" is the lead time: IsrShrinksPerSec rising is leading; UnderReplicatedPartitions > 0 is lagging. replica.lag.time.max.ms default 30000 ms (ReplicationConfigs.java:55) is the dial.

healthy steady state warn flicker / churn (leading) err URP (lagging)

6 · Disk-free runway (days-to-full)

Predicts: a full-disk outage days ahead. Mechanism: on No space left on device the broker calls Exit.halt(1) with no graceful shutdown (empirical; reference §4). The runway is computable as a remaining ÷ rate quotient: a broker stores RF copies of the cluster's bytes-in (its share of leader and follower replicas), so it fills at RF × the cluster's per-broker ingress, and days_to_full = free_bytes ÷ (ingress_bytes_per_day_per_broker × RF). Band: alert when projected days-to-full < lead-time-to-add-capacity (commonly 7–14 days), and a hard backstop at 85% used (AWS MSK KafkaDataLogsDiskUsed ≥ 85%, empirical). Lead time: days, the longest-lead indicator in the catalogue, and the one most worth automating. Cross-ref op04, op07.

Worked example: days-to-full with labeled inputs

All four inputs are illustrative for this example, substitute your own measured values. The first three are read off a dashboard; RF is a topic config.

Symbol	Value (with units)	Why / source	Kind
`free_bytes`	2,000 GB free on the data dir	from `node_filesystem_avail_bytes` on the broker	illustrative (measured)
`ingress_per_broker`	50 MB/s of producer bytes landing on this broker's leaders	from `BytesInPerSec` for this broker; pre-replication	illustrative (measured)
`RF`	3 copies	topic `replication.factor`; durability baseline (reference §4)	config
seconds/day	86,400 s/day	unit conversion constant	fixed

Derivation (units cancel to days):

Step 1, convert ingress to bytes/day:
ingress_per_day = 50 MB/s × 86,400 s/day = 4,320,000 MB/day = 4,320 GB/day (the s cancels).

Step 2, account for replication (this broker writes RF copies of every byte it is responsible for, leaders + followers, so disk fills RF× faster than raw ingress):
disk_fill_rate = ingress_per_day × RF = 4,320 GB/day × 3 = 12,960 GB/day.

Step 3, divide remaining by rate (the GB cancels, leaving days):
days_to_full = free_bytes ÷ disk_fill_rate = 2,000 GB ÷ 12,960 GB/day ≈ 0.15 day ≈ 3.7 hours.

Takeaway: at 50 MB/s ingress with RF=3, a 2,000 GB free margin is only ~3.7 hours of runway, not days, because RF triples the fill rate and 50 MB/s is 4.3 TB/day of raw bytes before replication. The conclusion holds because the divisor is the replicated fill rate (× RF), not the raw produce rate; omitting × RF would over-state runway 3×. Two caveats fold in below: this assumes non-plateaued ingress (a retention-bound topic stops growing once bytes-aged-out offsets bytes-in), and it assumes the 50 MB/s is sustained, trend the real slope rather than a single sample.

Days-to-full must use the projected, not current, ingress

A retention-bound topic plateaus: bytes-in is offset by bytes-aged-out once steady state is reached, so a naive free / current_write_rate over-warns. But a growing ingress, a lengthened retention, or a new high-volume topic breaks the plateau. Trend free_bytes directly with predict_linear(node_filesystem_avail_bytes[6h], 7*86400) so the forecast captures the real slope, including retention and RF changes you may have forgotten you made.

7 · Partition-count headroom (% of per-broker ceiling)

Predicts: hitting the per-broker partition limit and the slow failover that accompanies high partition density. Mechanism: partition count is bounded by file descriptors, fetcher overhead, and on Linux by vm.max_map_count. Leader-election work also scales with leaderships held, so density predicts longer outages on the next failure. Band: headroom = 1 − (partitions_on_broker ÷ ceiling); warn at < 30% headroom, act at < 15%. Lead time: days-to-weeks (partition growth is usually deliberate). Cross-ref op02, op03, Part I 11.

Where the ~32,765 ceiling and the failover cost come from

Symbol	Value (with units)	Why / source	Kind
`vm.max_map_count`	65,530 mmap areas/process	Linux kernel default for memory-mapped regions per process	OS default (cited: Instaclustr, empirical)
mmaps/partition	≈ 2 mmap areas/partition	Kafka mmaps the `.index` + `.timeindex` of the active segment (Instaclustr KRaft Part 3)	empirical figure
election cost/partition	≈ 5 ms/leader-partition	time to elect one leader on unclean broker loss (Jun Rao 2015)	research figure (ZK-era; illustrative)

Derivation 1, the per-broker partition ceiling (the mmap areas cancel, leaving partitions):
ceiling = vm.max_map_count ÷ mmaps_per_partition = 65,530 mmap ÷ 2 mmap/partition ≈ 32,765 partitions/broker, unless vm.max_map_count is raised. This is a hard OS ceiling, distinct from the much lower practical limits set by FDs, fetcher overhead, and failover time discussed in op02.

Derivation 2, headroom, worked (the partitions cancel, leaving a dimensionless fraction). For a broker holding 24,000 partitions against the default ceiling:
headroom = 1 − (24,000 partitions ÷ 32,765 partitions) = 1 − 0.73 ≈ 0.27 = 27% → below the 30% warn line, act soon.

Derivation 3, failover cost the density predicts (the partitions cancel, leaving ms). If those 24,000 partitions are split RF=3 across enough brokers that one broker leads ~8,000 of them, an unclean loss of that broker costs:
outage ≈ 5 ms/leader-partition × 8,000 leader-partitions = 40,000 ms = 40 s of leaderless unavailability for those partitions until elections complete.

Takeaway: headroom warns you about the ceiling; the 5 ms/partition figure warns you that the cost of the next failure grows with density, two independent reasons to cap partitions/broker. Both constants are illustrative: the 5 ms is a 2015 ZooKeeper-era measurement that teaches the linear-in-leaderships mechanism, not a 2026 prediction (KRaft failover is far faster), and the 2-mmaps/partition ceiling assumes the default vm.max_map_count.

8 · p99 produce/fetch latency drift

Predicts: an SLA breach before the alert line is crossed. Mechanism: the request-time breakdown TotalTimeMs = RequestQueueTimeMs + LocalTimeMs + RemoteTimeMs + ResponseQueueTimeMs + ResponseSendTimeMs (+ ThrottleTimeMs) (reference §5, empirical). A slow upward drift in any component is a leading edge: rising RemoteTimeMs → followers slowing (links to indicator 5); rising LocalTimeMs → disk/flush/GC; rising RequestQueueTimeMs → handler starvation (indicator 2). Band: alert on a sustained week-over-week or day-over-day increase in the p99, not only the absolute crossing, e.g. p99 up > 50% vs. a 7-day baseline. Lead time: hours-to-days. Cross-ref Part I 07, op08.

9 · GC pause frequency/duration trending up

Predicts: soft failures cascading, ISR drops, session timeouts, rebalances. Mechanism: a stop-the-world pause that exceeds replica.lag.time.max.ms internals or a client's session.timeout.ms (default 45000 ms, ConsumerConfig.java:438) causes the broker to miss fetches and clients to miss heartbeats. The reference documents 500 ms pauses triggering rebalances and multi-second Full GCs causing ISR drops (empirical); LinkedIn's busy clusters run ~21 ms p90 pause, < 1 young GC/s. Band: p99 pause > 50 ms = warning, > 200 ms = critical (empirical); also alert on rising frequency, which precedes a creeping heap. Lead time: hours-to-days as heap pressure builds. The conventional cap of ~6 GiB heap (a planning heuristic, reference §3 / op05, not a hard limit) exists precisely so pauses stay short and the rest of RAM serves page cache (indicator 4): on a 64 GiB box, a 6 GiB heap leaves ~28–30 GiB for the OS cache. Cross-ref op05, op07.

10 · Quota throttle-time becoming non-zero

Predicts: client-visible slowdown, then timeouts and errors, as a producer/consumer approaches its quota. Mechanism: when a client exceeds a byte-rate or request quota the broker delays the response and reports the delay; the client sees produce-throttle-time-avg / produce-throttle-time-max (SenderMetricsRegistry.java:123-126) and fetch-throttle-time-avg/-max (FetchMetricsRegistry.java:107-110). Throttling is graceful at first, latency just rises, but as the client backs up it heads toward buffer exhaustion (indicators 16/17) and eventual send failures. Band: any sustained non-zero throttle-time on a client that is supposed to run unthrottled. Lead time: minutes-to-hours before the throttle turns into client errors. Cross-ref Part I 19, 13, op08.

11 · Controller event-queue time/backlog rising

Predicts: metadata-operation slowness, topic creates, partition reassignments, leader elections all queuing behind a busy controller. Mechanism: the KRaft controller processes events on a single-threaded loop; EventQueueTimeMs (QuorumControllerMetrics.java:48-49) is how long an event waits before processing, and AvgIdleRatio (:52) is the loop's headroom. As the controller saturates, queue time climbs and metadata propagation (Part I 12) slows cluster-wide. Band: rising EventQueueTimeMs p99, or AvgIdleRatio entering 0.5→0.3. Cross-check TimedOutBrokerHeartbeatCount (:62) and LastAppliedRecordLagMs (:60), a standby controller falling behind on the metadata log is itself a leading risk to failover speed. Lead time: minutes-to-hours. Cross-ref Part I 11, 10.

12 · Rebalance rate increasing

Predicts: consumer-group instability and processing stalls. Mechanism: each rebalance pauses consumption for the group; a rising rebalance-rate-per-hour (ConsumerRebalanceMetricsManager.java:70) or any non-zero failed-rebalance-rate-per-hour (:74) signals churn, flapping members, slow processing tripping max.poll.interval.ms (default 300000 ms, ConsumerConfig.java:629), or unstable membership. Watch last-rebalance-seconds-ago (:97): a value that keeps resetting to near-zero is a rebalance storm in progress. Band: rebalance-rate above the group's steady-state baseline, or failed-rebalance-rate > 0. Lead time: minutes, fast-moving, but earlier than the lag spike the storm eventually produces. Cross-ref Part I 13, op07.

13 · Connection count / creation-rate approaching caps

Predicts: hitting max.connections / max.connections.per.ip limits, or file-descriptor exhaustion, causing new clients to be rejected. Mechanism: each connection consumes an FD and listener slot; the broker tracks connection-count (Selector.java:1260) and the standard connection-creation-rate. A climbing creation-rate often indicates clients in a connect/disconnect loop (mis-tuned pools, repeated auth failures) that will exhaust the cap. ExpiredConnectionsKilledCount (SocketServer.scala:136) rising shows the broker reaping idle connections under connections.max.idle.ms pressure. Band: connection-count headroom < 20% of the configured cap, or a sustained non-zero creation-rate on a stable client population. Lead time: minutes-to-hours. Cross-ref op02, Part I 06.

14 · Log-flush time creeping up = disk slowing

Predicts: disk degradation, which will surface as rising LocalTimeMs, ISR shrinks, and eventually produce stalls. Mechanism: LogFlushRateAndTimeMs (LogSegment.java:77, kafka.log:type=LogFlushStats) times the fsync of log segments. A slow upward creep means the disk is taking longer to accept writes, a failing SSD, a noisy neighbour on shared storage, or a saturated I/O queue. Band: any sustained increase over the disk's established baseline (the absolute number is hardware-specific). Lead time: hours-to-days for gradual degradation; correlate with rising LocalTimeMs and falling RequestHandlerAvgIdlePercent. Cross-ref Part I 03, op04.

15 · Leader-election time growing with partition count

Predicts: longer outages on the next broker failure. Mechanism: LeaderElectionRateAndTimeMs (controller) reports time spent without a leader during elections; unclean broker loss costs roughly 5 ms × #leader-partitions-on-broker (Jun Rao 2015, empirical), so as partition density grows the failover duration grows with it. This is a second-order leading indicator: it predicts the severity of a future incident, letting you rebalance or add brokers before density makes the next failure painful. Band: trend election time vs. partition count; act when projected unavailability on a single-broker loss exceeds your SLO. Lead time: weeks (it tracks slow partition growth). Cross-ref op02, op11, Part I 11.

Client-team leading indicators (no broker access required)

The part most guides miss: a producer or consumer application team can be proactive entirely from their own client JMX metrics, with zero broker access, and frequently see trouble building before the platform team's lagging broker alerts fire. The data lives in the client's org.apache.kafka.common.metrics.Metrics registry and is also pushed centrally via KIP-714 client telemetry (ClientMetricsManager.java handles GetTelemetrySubscriptions/PushTelemetry at :159:188; Part I 19), so a platform team can scrape app-side signals without app-side scrape endpoints.

Consumer-side leading indicators

Metric	Source	What the trend predicts	Watch for
`records-lag-max` + its slope	`FetchMetricsRegistry.java:102`	structural throughput deficit → SLA breach / retention overrun	sustained positive slope (not absolute value)
`fetch-latency-avg` drift	`FetchMetricsRegistry.java:93`	broker/network slowing on the fetch path	slow upward trend vs. baseline
`bytes-consumed-rate`	`FetchMetricsRegistry.java:81`	compare to production rate; deficit → lag growth	consume-rate < produce-rate over a window
`rebalance-rate-per-hour` / `failed-rebalance-rate-per-hour`	`ConsumerRebalanceMetricsManager.java:70,74`	group instability → repeated processing stalls	rate above baseline; failed-rate > 0
`commit-latency-avg/-max`	`OffsetCommitMetricsManager.java:45,49`	coordinator slow → at-least-once replay risk on crash	upward drift
`time-between-poll-max` vs. `max.poll.interval.ms`	`KafkaConsumerMetrics.java:64` vs. `ConsumerConfig.java:629` (300000 ms)	processing too slow → consumer evicted → rebalance storm	poll interval approaching the configured max
`poll-idle-ratio-avg`	`KafkaConsumerMetrics.java:70`	low idle = app is the bottleneck (processing-bound)	ratio → 0 (no slack to absorb a load increase)
`last-poll-seconds-ago`	`KafkaConsumerMetrics.java:55`	a stuck consumer about to be kicked	climbing toward `max.poll.interval.ms/1000`

The poll-interval headroom check is the consumer's most actionable proactive signal

"Am I processing fast enough to avoid being kicked?" is answered directly by time-between-poll-max (KafkaConsumerMetrics.java:64) against max.poll.interval.ms, a real Kafka default of 300,000 ms (ConsumerConfig.java:629, verified). The headroom is a dimensionless fraction: headroom = 1 − (time-between-poll-max ÷ max.poll.interval.ms). Worked: if your slowest observed gap between polls is 210,000 ms, then headroom = 1 − (210,000 ms ÷ 300,000 ms) = 1 − 0.70 = 0.30 = 30% (the ms cancel), exactly at the warn line. When that headroom shrinks below ~30%, you are one slow batch away from eviction and a rebalance, the leading cause of self-inflicted rebalance storms (op07). Fix before the storm: lower max.poll.records, speed up processing, or raise the interval.

Producer-side leading indicators

The producer's leading-indicator story is a pipeline: the RecordAccumulator buffers records, the Sender drains them to brokers, and the BufferPool backstops memory. When the broker can't keep up (or batching is mis-tuned), pressure propagates backward through these stages in a predictable order, each with its own metric, giving a clean early-warning sequence before the terminal max.block.ms timeout.

app calls send()

RecordAccumulatorrecord-queue-time-avg/-max ↑ first

Sender → brokerrequest-latency-avg ↑ ; requests-in-flight

BufferPoolbuffer-available-bytes ↓ toward 0

block max.block.msthen BufferExhaustedException

Producer back-pressure propagates right-to-left: queue-time rises first (leading), buffer-available falls next (blocking imminent), and only at the end does send() block for max.block.ms (default 60000 ms, ProducerConfig.java:451) and throw BufferExhaustedException (BufferPool.java:161, lagging).

app accumulator/sender buffer pool terminal failure data flow back-pressure

Metric	Source	What the trend predicts	Watch for
`record-queue-time-avg` / `-max`	`SenderMetricsRegistry.java:88,90`	RecordAccumulator backing up, broker can't keep up or batching mis-tuned → buffer exhaustion next	the earliest producer signal; rising = act now
`buffer-available-bytes`	`RecordAccumulator.java:215`	blocking imminent once it nears 0; then `max.block.ms` timeouts	falling toward 0 (vs. `buffer-total-bytes` = `buffer.memory`, default 32 MB)
`waiting-threads`	`RecordAccumulator.java:205`	app threads already blocked on buffer memory	any non-zero, sustained > 0
`request-latency-avg`	`SenderMetricsRegistry.java:92`	broker-side slowdown reaching the producer	slow upward drift
`record-retry-rate`	`SenderMetricsRegistry.java:102`	transient broker errors rising → approaching `delivery.timeout.ms` failures	creeping above ~0
`record-error-rate`	`SenderMetricsRegistry.java:106`	permanent send failures, data loss if not handled	any non-zero
`produce-throttle-time-avg`	`SenderMetricsRegistry.java:123`	you are approaching a broker quota	becoming non-zero (links to indicator 10)
`buffer-exhausted-rate`	`BufferPool.java:88`	sends already being dropped on exhaustion, this is lagging	any non-zero = already failing
`batch-split-rate`	`SenderMetricsRegistry.java:118`	batches exceeding `max.request.size` → re-splitting overhead	sustained > 0 (tune `batch.size` / record size)

Why record-queue-time is the producer's leading edge

The BufferPool.allocate() path (BufferPool.java:107) only blocks after the pool is empty, and only throws after max.block.ms (:159-163). By then you are already failing sends. But record-queue-time (SenderMetricsRegistry.java:88) measures how long batches sit in the accumulator before the Sender drains them, it rises the moment the drain rate falls below the append rate, which is strictly earlier than the buffer running dry. Trend it, and you get the warning while there is still buffer to absorb the burst. buffer-available-bytes falling is the second-stage confirmation; buffer-exhausted-rate > 0 is the post-mortem.

Producer / consumer proactive checklist

Producer, watch & do	Consumer, watch & do
Trend `record-queue-time-max`; rising → broker slow or batching too aggressive. Investigate before buffer drains.	Trend `records-lag-max` slope; positive & sustained → scale consumers or speed processing (lag-time runway, not absolute lag).
Alert `buffer-available-bytes` < 20% of `buffer-total-bytes`; raise `buffer.memory` or reduce send rate before `max.block.ms` hits.	Alert poll-interval headroom < 30% (`time-between-poll-max` ÷ `max.poll.interval.ms`); lower `max.poll.records` before eviction.
Alert any non-zero `record-error-rate`; this is data loss unless the callback handles it.	Alert `failed-rebalance-rate-per-hour` > 0 and rising `rebalance-rate-per-hour`; stabilise membership (static IDs, cooperative assignor).
Alert non-zero `produce-throttle-time-avg`; you are nearing a quota, request a raise or compress.	Alert non-zero `fetch-throttle-time-avg`; nearing a fetch quota, coordinate with platform.
Watch `request-latency-avg` drift as an early read on broker health from your vantage point.	Watch `commit-latency-max` drift; slow commits widen the replay window on a crash.

Worked example: producer buffer headroom

The mirror of the consumer poll-headroom check. buffer-total-bytes equals the configured buffer.memory, a real Kafka default of 32 MB = 33,554,432 bytes (ProducerConfig.java:401, 32 * 1024 * 1024L, verified); treat the actual configured value as the assumption. Headroom is a dimensionless fraction: headroom = buffer-available-bytes ÷ buffer-total-bytes. Worked, with buffer-available-bytes = 6 MB observed against a 32 MB default: headroom = 6 MB ÷ 32 MB = 0.1875 ≈ 19% (the MB cancel), below the 20% alert line, so raise buffer.memory or cut the send rate before the pool empties and send() blocks for max.block.ms (60,000 ms, ProducerConfig.java:451, verified).

Two client-metric blind spots

(1) Client-side records-lag-max only covers running consumers, a dead or stuck consumer reports nothing. An external lag monitor (Burrow, kafka_exporter) is still required to catch a group that has stopped entirely (reference §5, empirical). (2) records-lag-max is computed from the current position, not the committed offset (FetchMetricsRegistry.java:103), a consumer that fetches fast but commits slowly can look healthy on lag while risking large replay on a crash; pair lag with commit-latency.

The runway / headroom dashboard

Assemble a single "how much runway is left" panel set. Each panel is a quotient (remaining ÷ rate) trended over time, with the alert on the quotient, not the numerator. This is the proactive counterpart to the golden-signal status board in op08.

Throughput headroom

idle% bands, how much request capacity remains before queueing

RequestHandlerAvgIdlePercentNetworkProcessorAvgIdlePercent

↕

Disk runway

days-to-full = free_bytes / (ingress/day × RF)

predict_linear(avail_bytes)85% backstop

↕

Partition & FD headroom

1 − partitions/ceiling ; 1 − conns/cap

vm.max_map_countconnection-count

↕

Consumer lag-time runway

records_lag / consume_rate vs. retention

records-lag-max slopeBurrow trend

↕

Cache headroom

disk-read rate while lag≈0 (working set vs. RAM)

node_disk_readspage-cache hit

Five runway panels, each a remaining/rate quotient. Alert on the quotient trending down through your lead-time budget, not on the raw numerator crossing a fixed line.

throughput storage/cache metadata/FD limits consumer runway

Runway panel	Formula	Alert when	Lead time
Throughput headroom	idle% gauges (0–1)	min(handler, network) idle% trending into 0.3→0.2	min–hr
Disk runway	`free / (ingress/day × RF)`	projected days-to-full < lead-to-provision (7–14 d); hard 85%	days
Partition headroom	`1 − partitions/ceiling`	< 30% (warn), < 15% (act)	days–wks
FD / connection headroom	`1 − conns/cap`	< 20% headroom or rising creation-rate on stable population	min–hr
Consumer lag-time runway	`records_lag / consume_rate`	positive slope sustained; projected to reach retention	min–hr
Cache headroom	disk-read rate while lag≈0	non-zero & rising while consumers caught up	hr–days

SLO error-budget & burn-rate framing

The most disciplined proactive pattern is the burn-rate alert: define an SLO (e.g. "99.9% of produce requests succeed within 50 ms over 30 days"), derive an error budget, and alert when the current consumption rate of the budget would exhaust it before the window closes. This unifies the catalogue: every leading indicator above is, ultimately, an early read on whether you are about to start burning budget.

Assumptions for the worked burn-rate numbers

These are the canonical Google SRE multi-window thresholds (cited heuristic, not a Kafka limit); the SLO target and window are illustrative, substitute your own.

Symbol	Value (with units)	Why / source	Kind
`S` (SLO target)	99.9% = 0.999 (dimensionless)	example produce-success objective	illustrative
`W` (window)	30 days = 43,200 min	example budget window (30 × 24 × 60)	illustrative
fast-burn multiple	14.4× budget over 1 h	Google SRE multi-window page threshold	cited heuristic (Google SRE)
slow-burn multiple	3× budget over 6 h	Google SRE multi-window ticket threshold	cited heuristic (Google SRE)

Error budget: For an availability SLO of S over window W: budget = (1 − S) × W. Worked: (1 − 0.999) × 43,200 min = 0.001 × 43,200 min = 43.2 min of allowed "bad" per 30 days (the % cancels; result is in minutes). So 0.1% of a 30-day window ≈ 43 min.
Burn rate: burn_rate = (bad events ÷ total events) ÷ (1 − S), a dimensionless multiple of the budget's steady drain. A burn rate of 1 exactly exhausts the budget at the window's end; > 1 exhausts it early. (E.g. failing 1.44% of requests over an hour = 0.0144 ÷ 0.001 = 14.4×.)
Fast-burn page: 14.4× over a 1-hour window. Why "~2% of the budget in 1 h": at 14.4× the budget drains 14.4× faster than the 30-day baseline, and 1 h is 1 ÷ (30 × 24) = 1/720 of the window, so the fraction consumed is 14.4 × (1 h ÷ 720 h) = 14.4 ÷ 720 = 0.02 = 2% of the 30-day budget in one hour (the hours cancel). Burning 2%/hour empties the budget in ~50 h → page; the cliff is hours-to-days away.
Slow-burn ticket: 3× over a 6-hour window. Fraction consumed: 3 × (6 h ÷ 720 h) = 18 ÷ 720 = 0.025 = 2.5% of the 30-day budget over 6 h → at that pace the budget lasts 720 h ÷ 3 = 240 h ≈ 10 days → ticket; the cliff is days away, but the trend is real.

measure SLI (e.g. p99 produce success)

current burn rate vs. budget?

fast burn → PAGE (cliff in hours)

slow burn → TICKET (cliff in days)

within budget → no action

Multi-window burn-rate alerting: the same SLI drives a paging fast-burn and a ticketing slow-burn. The slow-burn is the proactive arm, it fires while runway remains.

fast burn / page slow burn / ticket within budget evaluation ,

Pairing rule: every lagging golden signal should have a leading partner

Audit your alert set as pairs. UnderReplicatedPartitions > 0 (lagging) ↔ IsrShrinksPerSec rising (leading). OfflinePartitionsCount > 0 (lagging) ↔ partition-headroom + leader-election-time trend (leading). Producer buffer-exhausted-rate > 0 (lagging) ↔ record-queue-time rising (leading). Consumer SLA-miss (lagging) ↔ records-lag-max slope (leading). If a golden signal has no leading partner on the board, you have an outage class you can only respond to, never prevent.

Master table, the leading-indicator catalogue

#	Leading indicator	Predicts	Early-warning band / trend	Lead time	Source / metric
1	Consumer-lag trend (records + time)	SLA breach / retention overrun	sustained positive slope; projected time-lag → retention	min–hr	`records-lag-max` slope, `FetchMetricsRegistry.java:102` + Burrow
2	RequestHandlerAvgIdlePercent ↓	I/O-thread saturation; `RequestQueueTimeMs` ↑	0.3 (warn) → 0.2 (act)	min–hr	`KafkaRequestHandler.scala:253`
3	NetworkProcessorAvgIdlePercent ↓	network-thread saturation; `ResponseQueueTimeMs` ↑	0.4 → 0.3	min–hr	`SocketServer.scala:120`
4	Page-cache pressure	working set > RAM; consumers off zero-copy path	disk-read rate ↑ while lag ≈ 0	hr–days	OS disk reads vs. Part I 09
5	IsrShrinksPerSec ↑ (re-expanding)	brokers struggling (GC/disk/net) before URP	shrink/expand churn above ~0 baseline	min	`IsrShrinksPerSec`; `replica.lag.time.max.ms`=30000 (`ReplicationConfigs.java:55`)
6	Disk-free runway	full-disk outage (`Exit.halt(1)`)	days-to-full < lead-to-provision; 85% backstop	days	`free/(ingress×RF)`; op04
7	Partition-count headroom	per-broker ceiling; slow failover	headroom < 30% (warn), < 15% (act)	days–wks	`vm.max_map_count` ≈32k/broker; op02
8	p99 produce/fetch latency drift	SLA breach before alert line	p99 ↑ > 50% vs. 7-day baseline	hr–days	`RequestMetrics TotalTimeMs`
9	GC pause freq/duration ↑	soft failures: ISR drops, session timeouts	p99 pause > 50 ms (warn), > 200 ms (crit)	hr–days	JVM GC vs. `session.timeout.ms`=45000 (`ConsumerConfig.java:438`)
10	Quota throttle-time > 0	client-visible slowdown → errors	any sustained non-zero on an unthrottled client	min–hr	`produce-throttle-time-avg` (`SenderMetricsRegistry.java:123`), `fetch-throttle-time-avg` (`FetchMetricsRegistry.java:107`)
11	Controller event-queue ↑	metadata-operation slowness	`EventQueueTimeMs` p99 ↑; `AvgIdleRatio` 0.5→0.3	min–hr	`QuorumControllerMetrics.java:48,52`
12	Rebalance rate ↑	group instability / processing stalls	rate above baseline; failed-rate > 0	min	`ConsumerRebalanceMetricsManager.java:70,74`
13	Connection count / creation-rate ↑	connection caps / FD exhaustion	headroom < 20%; sustained creation-rate on stable pop.	min–hr	`connection-count` (`Selector.java:1260`); `ExpiredConnectionsKilledCount` (`SocketServer.scala:136`)
14	Log-flush time ↑	disk slowing → `LocalTimeMs` ↑, ISR shrinks	sustained rise over hardware baseline	hr–days	`LogFlushRateAndTimeMs` (`LogSegment.java:77`)
15	Leader-election time vs. partitions	longer outage on the next failure	projected single-broker unavailability > SLO	wks	`LeaderElectionRateAndTimeMs`; ≈5 ms/partition (empirical)
C1	Producer record-queue-time ↑	accumulator backing up → buffer exhaustion	earliest producer signal; any sustained rise	min	`SenderMetricsRegistry.java:88`
C2	Producer buffer-available-bytes ↓	blocking → `max.block.ms` timeout	< 20% of `buffer-total-bytes`	min	`RecordAccumulator.java:215`; `max.block.ms`=60000 (`ProducerConfig.java:451`)
C3	Consumer poll-interval headroom ↓	eviction → rebalance storm	headroom < 30% of `max.poll.interval.ms`	min	`time-between-poll-max` (`KafkaConsumerMetrics.java:64`) vs. 300000 (`ConsumerConfig.java:629`)

Lead times are provisioning-dependent, not guarantees

Every lead-time figure above is an operational rule of thumb (empirical) and assumes you provisioned some headroom. A cluster run at 95% disk and 0.25 handler-idle has effectively zero lead time, the leading indicators have already merged into the lagging ones. The point of headroom is to keep the leading and lagging signals separated by a usable interval. Benchmark numbers (e.g. the ~2 ms → ~250 ms catch-up tax, the ~5 ms/partition election cost) are version- and hardware-dependent illustrations of the mechanism, not predictions for your fleet.

What to do this week

Build the runway dashboard. Five quotient panels (throughput idle%, disk days-to-full, partition headroom, FD/connection headroom, lag-time runway). Alert on the quotient trend, not the numerator.
Convert your top lagging alerts to pairs. For each golden signal in op08, add the leading partner from the invariant box above. An unpaired golden signal is an outage you can only react to.
Add multi-window burn-rate alerts on your produce-success and consumer-lag SLOs: a fast-burn page and a slow-burn ticket. The slow-burn is your proactive early warning.
Push the client checklist to app teams. Producer: trend record-queue-time-max and buffer-available-bytes. Consumer: trend records-lag-max slope and poll-interval headroom. Wire app-side metrics centrally via KIP-714 so the platform team sees them too.
Validate idle gauges are 0–1. Confirm RequestHandlerAvgIdlePercent reports a fraction before trusting any band (KAFKA-7295 / KIP-1207 footguns).

The discipline is simple to state and hard to keep: every metric that can break your cluster has a derivative and a distance, and the derivative and distance are visible before the break. Monitor those, and most pages become tickets you closed days earlier.