II · 05 · Performance Tuning: Throughput ⇄ Latency

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.

Kafka tuning is not a search for one fast configuration; it is the deliberate choice of where on three axes, throughput, latency, and durability, you want to sit. Almost every knob trades one against another, and the trade is mechanical: batching amortizes per-request cost at the price of a fixed latency tax; acks=all buys durability with a replication round-trip; compression buys network and disk savings with producer CPU. This chapter maps each dial to the exact source mechanism that makes it a dial, gives you the compiled-in default and the file it lives in, and then breaks the end-to-end latency budget into the five broker stages you can actually measure. The cardinal skill is reading the signal, the request-queue/local/remote breakdown and the two idle-ratio gauges, so that you tune the stage that is actually slow rather than the one folklore blames.

The three-axis model: what you are really trading

Hold this picture before touching any config. A produce request's success has three independent costs. Latency is how long the producer waits for the ack. Throughput is how many bytes per second the pipeline sustains. Durability is how many simultaneous failures the data survives. You optimize for at most two; the third is the bill. Batching raises throughput and (under load) can even lower latency, but raises latency under light load. Compression raises effective throughput and cuts cost but spends CPU. acks=all raises durability but adds a remote round-trip to latency. Confluent's reference benchmark demonstrates the happy region, roughly 605 MB/s peak and p99 5 ms at 200 MB/s on i3en.2xlarge, RF=3, acks=all, fsync off (Confluent "Kafka Performance", empirical, 1 KB messages, reads served from page cache), but note the words "AND, not maxed simultaneously": you do not get peak throughput and minimum latency at the same instant.

Producer dials

batch.size + linger.ms (batching) · compression.type · acks · max.in.flight · buffer.memory

throughput ⇄ latency ⇄ durability

↕

Broker dials

num.io.threads · num.network.threads · num.replica.fetchers · socket buffers · queued.max.requests · log.segment.bytes

sized from idle-ratio gauges

↕

OS / page cache

small heap → OS caches the tail; zero-copy sendfile; vm.swappiness=1; FDs; vm.max_map_count; dirty ratios

the real read/write engine

↕

Consumer dials

fetch.min.bytes + fetch.max.wait.ms (read batching) · max.partition.fetch.bytes · max.poll.records + max.poll.interval.ms

processing budget ⇄ liveness

The tuning surface, top to bottom: where each knob lives and which axes it moves.

producer/client · broker · OS & page cache · consumer · code = a config key.

The tuning triangle (rule of thumb)

linger.ms↑ trades latency for throughput. compression.type trades producer CPU for network and disk. acks=all trades latency for durability. Pick two; pay for the third (synthesized from Confluent / Strimzi tuning guides, empirical). Before you change anything, decide which one your workload is allowed to spend.

Producer: the batching dial

The producer's dominant lever is batching, and its mechanism is the RecordAccumulator, a per-partition deque of ProducerBatch objects (clients/.../producer/internals/RecordAccumulator.java:76,78 hold batchSize and lingerMs; the class is documented at line 102 as trading "some latency for potentially better throughput due to more batching"). A batch becomes ready to send when either it fills to batch.size or linger.ms elapses since the first record landed (the ready/allBatchesFull logic, RecordAccumulator.java:416), whichever comes first. That single OR is the whole dial. The sender thread then drains all ready batches across partitions into per-broker requests, capped by max.request.size (default 1 MB, ProducerConfig.java:424). Part I chapter 16 · Producer Client walks the accumulator and sender threads in full.

Config	Default	Source	Mechanism / why
`batch.size`	16384 (16 KB)	`ProducerConfig.java:413`	Upper bound per-partition batch; a buffer of this size is pre-allocated. Larger ⇒ fewer requests, better compression, more throughput, more memory waste.
`linger.ms`	5	`ProducerConfig.java:418`	Artificial delay to let a batch fill. Changed from 0 to 5 in Apache Kafka 4.0 (doc, `ProducerConfig.java:100,162`) because larger batches usually yield similar or lower latency.
`compression.type`	`none`	`ProducerConfig.java:409`	Compresses whole batches; batching efficacy drives the ratio. Valid: `none`, `gzip`, `snappy`, `lz4`, `zstd`.
`acks`	`all`	`ProducerConfig.java:405`	How many replicas must ack. `all` waits for the full ISR. Idempotence requires `all`.
`max.in.flight.requests.per.connection`	5	`ProducerConfig.java:489`	Unacked requests per connection. Capped at 5 under idempotence (broker retains 5 batches/producer).
`buffer.memory`	33554432 (32 MB)	`ProducerConfig.java:401`	Total record-buffering memory; producer blocks when full.
`max.block.ms`	60000 (60 s)	`ProducerConfig.java:451`	How long `send()` blocks on a full buffer or metadata fetch before throwing.
`delivery.timeout.ms`	120000 (120 s)	`ProducerConfig.java:419`	Total deadline from `send()` to success/failure; must be ≥ `request.timeout.ms` + `linger.ms`.
`enable.idempotence`	`true`	`ProducerConfig.java:543`	Exactly-once-per-partition writes; on by default unless conflicting configs disable it.

Why linger.ms default became 5 in Kafka 4.0

With linger.ms=0 the accumulator sends as soon as the sender thread is free, so under light load batches are tiny, Confluent's producer hands-on course measured average batches of only ~1,215 bytes against a 16 KB limit at linger=0, versus ~275 KB at linger=1500ms (empirical, Confluent). Tiny batches mean more requests, worse compression, more broker CPU. The 4.0 default of 5 ms (ProducerConfig.java:100) bets that the efficiency gain typically lowers end-to-end latency despite the added 5 ms ceiling, because each request now does proportionally more useful work.

linger.ms only helps when the batch can fill

Raising linger.ms on a low-rate topic is counter-productive: Confluent's hands-on course showed that for a 200 rec/s stream of 1000-byte records, increasing linger decreased throughput and increased latency, the defaults were already optimal (empirical, Confluent). Set linger.ms > 0 meaningfully only when the produce rate is high enough to fill batch.size within the window. Otherwise you are just adding dead time.

The throughput recipe vs the latency recipe

Two named configurations bracket the dial. For maximum throughput: batch.size in the 64 KB–1 MB range, linger.ms 5–100 ms, compression.type=lz4 (or zstd when storage cost dominates), acks=all with min.insync.replicas=2 (Intel/Confluent learn page, empirical). For minimum latency: keep linger.ms at 0–1 ms, modest batch.size, and accept lower compression ratios. The historical "10–50× throughput from batching" figures are vendor roll-ups, directional, not guarantees, but the direction is real and rooted in the per-request fixed cost the accumulator amortizes.

buffer.memory is backpressure, not a buffer you can ignore

buffer.memory (32 MB, ProducerConfig.java:401) must be ≥ batch.size, and Kafka's own partition-count guidance is that a producer needs "at least a few tens of KB per partition being produced" (Jun Rao, empirical). When the buffer fills, because the brokers are slow or the network is saturated, send() blocks for up to max.block.ms (60 s) and then throws (ProducerConfig.java:225 doc). That block is your backpressure: if you see produce latency spikes, check whether the buffer is saturating before blaming the broker. Sizing rule: buffer.memory ≥ batch.size × partitions_produced + compression + in-flight headroom.

Ordering, idempotence, and max.in.flight

The interaction between max.in.flight.requests.per.connection (5) and idempotence is a correctness boundary, not a performance one. The constant MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION_FOR_IDEMPOTENCE = 5 (ProducerConfig.java:277) is a hard cap: the broker retains at most 5 batches per producer (aligned to ProducerStateEntry#NUM_BATCHES_TO_RETAIN, comment at line 276), so a 6th in-flight batch could be silently dropped on the broker side. The validation throws ConfigException if you set it above 5 with idempotence on (ProducerConfig.java:640). With idempotence, all five in-flight requests preserve order because the broker dedups and reorders by sequence number; without idempotence, max.in.flight > 1 plus retries can reorder records on a failed-then-retried send (ProducerConfig.java:301 doc). Part I chapter 14 · Transactions & EOS details the sequence-number mechanism.

ProducerRecordAccumulatorLeader

append(record)

future (not yet sent)

wait: batch full OR linger.ms (5)

ProduceRequest (≤ 5 in flight)

append + await ISR (acks=all)

ack ⇒ complete future

Producer send path: the accumulator holds the record until the batch fills or linger elapses; the latency tax is that wait plus the acks=all round-trip.

request · async reply/callback · producer · accumulator · leader.

Compression: a CPU-vs-ratio dial, applied per batch

Compression happens on the full batch (ProducerConfig.java:247 doc), which is why batching and compression compound: a bigger batch compresses better. The data travels and is stored compressed, Kafka replicates and persists the compressed batch, so a good ratio cuts produce bytes, replication bytes, disk, and consumer-fetch bytes all at once. The cost is producer CPU. Kafka 4.x exposes per-codec levels: compression.gzip.level, compression.lz4.level, compression.zstd.level (ProducerConfig.java:410-412, defaults are each codec's defaultLevel()).

Codec	Compress / Decompress (lzbench)	Ratio (HTTP data)	When
`lz4`	594 / 2428 MB/s	1.81×	Default choice for performance, fastest decompressor by far; lowest CPU. Often faster end-to-end than no compression.
`snappy`	446 / 1344 MB/s	2.35×	Similar niche to lz4; slightly higher ratio, slower decompress.
`zstd` (lvl -1)	409 / 844 MB/s	4.5× (lvl 6)	Best ratio for moderately more CPU; strong default when storage/network cost dominates (Kafka 2.1+, KIP-110).
`gzip`	slowest	3.58×	Highest CPU; usually avoided despite decent ratio.

Speeds and ratios: Cloudflare "Squeezing the firehose" (lzbench on ~1 MB / 600-record batches; HTTP-request data), empirical and data-dependent. Cloudflare chose Zstandard level 6, saving "hundreds of gigabits of internal traffic and terabytes of flash storage" and cancelling a hardware expansion; Trendyol chose zstd level 3 for ~70% message-size reduction (empirical).

"zstd is the default" is wrong, and tiny batches compress poorly

compression.type defaults to none (ProducerConfig.java:409), not producer and not zstd. And because compression is per-batch, the "4.5×" ratio is a property of large batches of compressible data, a stream of tiny, near-random, or already-compressed payloads will see little benefit from any codec. Get batching right first (raise linger.ms / batch.size), then compression pays off. For the cost angle, compression as the cheapest cross-AZ lever, see II · 10 · Cost.

Consumer: read batching and the processing budget

On the read side, the symmetric batching dial is fetch.min.bytes + fetch.max.wait.ms. The broker holds a fetch open until it has fetch.min.bytes available or fetch.max.wait.ms elapses (this is local-log fetch; remote/tiered fetch has its own remote.fetch.max.wait.ms, noted in ConsumerConfig.java:208). The crucial fact is the default: fetch.min.bytes=1 (ConsumerConfig.java:187) means Kafka does not batch reads by default, it answers as soon as a single byte is available, optimized for millisecond latency, not for cost. Part I chapter 09 · Fetch Path covers the broker-side delayed-fetch purgatory that implements this wait.

Config	Default	Source	Mechanism / why
`fetch.min.bytes`	1	`ConsumerConfig.java:187`	Min bytes before broker answers. =1 ⇒ no read batching; latency-optimized, NOT cost-optimized.
`fetch.max.wait.ms`	500	`ConsumerConfig.java:210`	Max the broker blocks waiting for `fetch.min.bytes` (local fetch only).
`fetch.max.bytes`	52428800 (50 MB)	`ConsumerConfig.java:200`	Cap per fetch response across all partitions (soft; one oversized batch still returns).
`max.partition.fetch.bytes`	1048576 (1 MB)	`ConsumerConfig.java:225`	Cap per partition per fetch (soft). Drives consumer fetch memory.
`max.poll.records`	500	`ConsumerConfig.java:94`	Records returned per `poll()`. Does NOT change fetching, caps records handed up from the cache.
`max.poll.interval.ms`	300000 (5 min)	`ConsumerConfig.java:629`	Max gap between `poll()` calls before the member is evicted ⇒ rebalance.
`session.timeout.ms`	45000	`ConsumerConfig.java:438`	Liveness window (classic protocol). Member dropped if no heartbeat within it.
`heartbeat.interval.ms`	3000	`ConsumerConfig.java:443`	Heartbeat cadence; keep ≈ ⅓ of `session.timeout.ms`.
`receive.buffer.bytes`	65536 (64 KB)	`ConsumerConfig.java:493`	SO_RCVBUF for the consumer socket; raise on high-BDP links.
`enable.auto.commit`	`true`	`ConsumerConfig.java:460`	Background offset commits; set `false` for at-least-once correctness control.

session.timeout.ms default is 45000, not ~10000

Older guides and the empirical reference quote a ~10 s session timeout; the actual compiled default in this tree is 45000 ms (ConsumerConfig.java:438), with heartbeat.interval.ms=3000 (line 443). Verify against the version you run, this is exactly the kind of number that drifts in folklore. Note also that session.timeout.ms and heartbeat.interval.ms are not used under the new consumer group protocol (KIP-848); they are listed as unsupported for that protocol at ConsumerConfig.java:407-413. See 13 · Group Coordination.

Raising fetch.min.bytes can cut whole-cluster CPU, the New Relic result

Because the default is fetch.min.bytes=1, a consumer of a low-throughput topic that polls in a tight loop sends a flood of near-empty fetch requests, each costing broker CPU to service. New Relic raised fetch.min.bytes (and tuned fetch.max.wait.ms) on the consumers of low-traffic topics and cut cluster-wide broker CPU by 15% from a single application change, enabling broker scale-down (New Relic, empirical). The mechanism is pure request-rate reduction: fewer, fatter fetches. This is the highest-leverage consumer knob most operators never touch.

max.poll.records is a liveness guard, not a throughput knob

A common misconception: max.poll.records does NOT change how much data is fetched, fetching is governed by fetch.max.bytes / max.partition.fetch.bytes. It only caps how many already-cached records each poll() hands to your code (ConsumerConfig.java:92 doc). Its real job is protecting against rebalance storms: if your per-record processing is slow, a large max.poll.records can push the loop past max.poll.interval.ms (5 min), the coordinator evicts the member, it rejoins, and you get a self-sustaining rebalance. Sizing rule: max.poll.records × worst-case per-record time ≤ max.poll.interval.ms, with margin. To go faster, fetch more and process in parallel, do not just raise this cap. Rebalance-storm recovery lives in II · 07 · Failure Modes.

Consumer fetch memory is bounded two ways and you must provision RAM for the larger: fetch_memory ≈ #brokers × fetch.max.bytes, and also ≈ #partitions × max.partition.fetch.bytes (Strimzi, empirical). With defaults, a consumer subscribed to 1,000 partitions can demand up to ~1 GB of fetch buffering by the second bound, a reason not to point one consumer at thousands of partitions. Partition-count economics are in II · 03 · Partitioning.

Broker: threads, queues, and how to size them from idle ratios

A broker has two thread pools on the request path. Network threads (num.network.threads, default 3 per listener, SocketServerConfigs.java:152) read bytes off sockets and write responses; each listener except the controller listener gets its own pool (doc, line 153). I/O threads a.k.a. request handlers (num.io.threads, default 8, ServerConfigs.java:51) do the actual work, log append, fetch, index lookups, including disk I/O. Between them sits the request queue, capped by queued.max.requests (default 500, SocketServerConfigs.java:144); when it fills, network threads stop reading, that is the broker's intake backpressure. Part I chapters 06 · Network & Threading and 07 · Request Processing describe this acceptor → processor → handler pipeline in detail.

Size threads from the idle-ratio gauges, not from a formula alone

Do not guess. Each pool publishes an idle fraction. RequestHandlerAvgIdlePercent (the I/O pool, metric name defined at core/.../KafkaRequestHandler.scala:210) and NetworkProcessorAvgIdlePercent (the network pool, gauge registered at core/.../SocketServer.scala:120) are both fractions in [0, 1]: 1 = fully idle, 0 = saturated. Raise the pool whose idle ratio is low.

Pool	Config (default · source)	Idle gauge	Action threshold
I/O / request handlers	`num.io.threads` = 8 · `ServerConfigs.java:51`	`RequestHandlerAvgIdlePercent`	< 0.2 ⇒ overloaded, add handlers/brokers; < 0.1 ⇒ active problem (Instaclustr/Confluent, empirical). Heuristic: `num.io.threads ≈ 8 × #data disks`, bounded by cores.
Network processors	`num.network.threads` = 3 · `SocketServerConfigs.java:152`	`NetworkProcessorAvgIdlePercent`	Ideal > 0.4; < 0.3 ⇒ raise `num.network.threads` (Confluent, empirical). Common production range 8–12.
Replica fetchers	`num.replica.fetchers` = 1 · `ReplicationConfigs.java:96`	(watch URP / `RemoteTimeMs`)	Raise to 4–8 when followers lag and `RemoteTimeMs` is high. Total fetchers = this × #brokers (doc, line 97).
Recovery	`num.recovery.threads.per.data.dir` = 2 · `ServerLogConfigs.java:147`	(startup time)	Raise to speed log recovery at startup / flush at shutdown on multi-disk hosts.

RequestHandlerAvgIdlePercent has a measurement history, confirm it is a 0–1 gauge

This metric has been misreported repeatedly: mis-calculated (KAFKA-7295), surfaced as a rate by some integrations (Datadog integrations-core #516), and anomalous in KRaft combined mode (KIP-1207). Before you trust the < 0.2 threshold, confirm your monitoring shows it as a fraction in [0, 1]. The defaults, 3 network, 8 I/O threads, are deliberately conservative starting points (SocketServerConfigs.java:152, ServerConfigs.java:51); real clusters routinely run higher, but only raise the pool the gauge says is starved.

Replication parallelism and the ISR-drop window

Followers replicate by issuing fetch requests through num.replica.fetchers threads (default 1). One thread per source broker fetches all that broker's leader partitions sequentially, so a single slow or hot leader can starve the rest. When a follower falls behind, the leader drops it from the ISR after replica.lag.time.max.ms (default 30000 ms, ReplicationConfigs.java:55) without a successful fetch reaching the leader's LEO. The follower fetch itself blocks up to replica.fetch.wait.max.ms (default 500 ms, ReplicationConfigs.java:75), and the doc is explicit that this must stay below replica.lag.time.max.ms to avoid frequent ISR shrinking on low-throughput topics (line 76). Each follower fetch pulls up to replica.fetch.max.bytes per partition (default 1 MB, line 68). The durability consequences of ISR membership are the subject of Part I 08 · Replication.

Why high RemoteTimeMs means "raise num.replica.fetchers or fix the follower's disk"

Under acks=all the leader cannot ack a produce until the ISR has the record. That wait shows up as RemoteTimeMs in the latency breakdown. If RemoteTimeMs is high, the replication pipeline is the bottleneck: either the leader's write rate exceeds the followers' fetch rate (raise num.replica.fetchers so more partitions replicate in parallel) or a specific follower's disk/network is slow (fix the host). Jun Rao's mental model, "replicating 1000 partitions from one broker to another can add about 20 ms latency" (empirical, 2015 hardware), explains why the single-fetcher default does not scale to thousands of partitions per broker.

Socket buffers and segment size

Config	Default	Source	Tuning
`socket.send.buffer.bytes`	102400 (100 KB)	`SocketServerConfigs.java:88`	SO_SNDBUF; raise toward 1 MB on high bandwidth-delay-product links. `-1` ⇒ OS default.
`socket.receive.buffer.bytes`	102400 (100 KB)	`SocketServerConfigs.java:92`	SO_RCVBUF; same logic. Set ≈ `bandwidth × RTT`.
`socket.request.max.bytes`	104857600 (100 MB)	`SocketServerConfigs.java:96`	Hard cap on a single socket request.
`queued.max.requests`	500	`SocketServerConfigs.java:144`	Request-queue depth before network threads block reading (intake backpressure).
`replica.socket.receive.buffer.bytes`	65536 (64 KB)	`ReplicationConfigs.java:64`	SO_RCVBUF for the follower→leader replication socket.
`log.segment.bytes`	1073741824 (1 GB)	`LogConfig.java:131`	Segment roll size. Smaller ⇒ more files/FDs/index churn; larger ⇒ coarser retention granularity.
`message.max.bytes`	1 MB + overhead	`ServerLogConfigs.java:177`	Largest batch the broker accepts (after compression). Must be coordinated with producer `max.request.size`.

The bandwidth-delay product is the formula for socket buffers: buffer = link_bandwidth × round-trip latency. On a 10 Gbps link with 1 ms RTT that is ~1.25 MB, so the 100 KB default throttles a single high-BDP connection, raise both socket buffers (and the OS net.core.rmem_max/wmem_max) for cross-region replication or fat WAN consumers. log.segment.bytes is a storage-engine knob (Part I 03 · Storage Log Engine): smaller segments give finer retention and faster individual-segment recovery but multiply open file descriptors and .index mmaps, which feeds directly into the file-descriptor and vm.max_map_count limits below.

The page cache is the performance engine

Kafka's read and write speed comes from the OS page cache, not the JVM heap. Writes go to the page cache and are flushed by the kernel; reads of the log tail are served from the page cache; and the fetch path uses zero-copy sendfile to ship bytes from page cache straight to the socket without copying through user space or the heap (Part I 03 · Storage Log Engine and 09 · Fetch Path). The operational consequence is counter-intuitive: keep the JVM heap small so the OS has more RAM to cache the tail. Kafka does not need a heap above ~6 GB; on a 64 GB box with a 6 GB heap, roughly 28–30 GB is left for page cache (Confluent, empirical). Note that by default Kafka does not fsync on every write, log.flush.interval.messages defaults to Long.MAX_VALUE (ServerLogConfigs.java:101), i.e. flushing is left to the OS, which is precisely why durability under acks=all means "in the ISR's page cache," not "on disk" (see II · 06 · Durability).

Cold reads evict the cache that serves your hot writes

A consumer that suddenly reads from the beginning of a large topic (a backfill, a new analytics job, a reset offset) streams cold segments off disk and into the page cache, evicting the hot tail that producers and real-time consumers depend on. The symptom is a latency spike on otherwise-healthy partitions, often with rising LocalTimeMs, while one consumer does a historical scan. Mitigations: isolate batch/backfill consumers, use tiered storage so cold data lives in object storage rather than thrashing the broker's cache, and provision page cache for roughly write_throughput × 30 seconds of working set (synthesized, empirical). This is the single most common "mysterious" broker latency event.

Why heap ≤ 6 GB and not "more to go faster"

A large heap does not speed Kafka up; it slows it down twice. First, it steals RAM from the page cache, shrinking the cache hit rate that actually drives throughput. Second, it lengthens GC pauses, a 32 GB G1GC heap can incur 100–200 ms pauses, and a multi-second Full GC drops the broker from the ISR and triggers rebalances (empirical; LinkedIn's busy clusters hold ~21 ms p90 GC pause at < 1 young-GC/sec). Use G1GC with a small heap: -Xms6g -Xmx6g -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35. Alert when GC P99 > 50 ms (warning) / > 200 ms (critical). Reserve ZGC for heaps > 16 GB (rare for brokers) at ~5–10% CPU overhead. GC-induced ISR thrash is a runbook in II · 07.

The end-to-end latency budget

When a request is slow, do not guess which stage, read it. The broker decomposes every request's TotalTimeMs into five additive stages, each a JMX histogram under kafka.network:type=RequestMetrics,name=…,request={Produce|FetchConsumer|FetchFollower}. The names are compiled in at server/.../network/metrics/RequestMetrics.java:48-54:

TotalTimeMs = sum of the below: The full server-side latency for the request (RequestMetrics.java:54). Plus, when throttled, ThrottleTimeMs (line 51).
RequestQueueTimeMs: Time waiting in the request queue for a free I/O thread (RequestMetrics.java:48). High ⇒ too few num.io.threads (handler starvation).
LocalTimeMs: Time the I/O thread spends doing the work locally, log append, page-cache/disk, index (RequestMetrics.java:49). High ⇒ disk/flush latency, page-cache misses, or GC.
RemoteTimeMs: Time waiting on other brokers, for a produce with acks=all, waiting for ISR acks; for a fetch, waiting in delayed-fetch purgatory (RequestMetrics.java:50). High ⇒ slow followers / min.insync.replicas waits.
ResponseQueueTimeMs: Time the built response waits for a network thread to send it (RequestMetrics.java:52). High ⇒ too few num.network.threads.
ResponseSendTimeMs: Time actually writing the response to the socket (RequestMetrics.java:53). High ⇒ slow client/network, small socket buffers.

Two more diagnostic gauges from the same class: MessageConversionsTimeMs (line 45) flags expensive down-conversion for old clients, and TemporaryMemoryBytes (line 46) tracks transient memory for conversion/compression. Throttle time (ThrottleTimeMs, line 51) is added when a client exceeds its quota, covered in Part I 19 · Quotas; from the latency-budget view it is "the broker deliberately delaying you," not a fault.

symptom: produce/fetch p99 high

which stage dominates TotalTimeMs?

too few io threads → raise num.io.threads (RequestHandlerAvgIdlePercent low)

disk / page-cache miss / GC → check flush latency, cold reads, GC pauses

slow followers / ISR wait → raise num.replica.fetchers, check URP

too few network threads → raise num.network.threads (NetworkProcessorAvgIdlePercent low)

client over quota → see Quotas (Part I 19)

Latency root-cause decision tree, driven by the RequestMetrics decomposition. Read the stage, fix that stage.

thread-pool fix · disk/cache · replication wait · decision · quota throttle · sync stage · remote/async wait.

The latency-localization cheat sheet

High RequestQueueTimeMs → too few I/O threads (confirm with low RequestHandlerAvgIdlePercent). High LocalTimeMs → leader disk, page-cache miss, or GC. High RemoteTimeMs → slow followers or acks=all/min.insync.replicas waits (confirm with URP). High ResponseQueueTimeMs → too few network threads (confirm with low NetworkProcessorAvgIdlePercent). This is the map in II · 08 · Metrics & Signals; the breakdown is the first thing to pull in any latency incident.

OS-level tuning: the kernel knobs that gate Kafka

Because Kafka leans on the page cache, zero-copy, and thousands of memory-mapped index files, a handful of kernel settings gate its performance and stability. These are emergent limits, consequences of the architecture, not Kafka configs, and getting them wrong produces failures that look like Kafka bugs.

OS knob	Recommended	Why (mechanism)
`vm.swappiness`	1 (not 0)	Minimize swapping the JVM/page-cache to disk, which would murder latency, but 1, not 0: 0 forbids swap entirely and removes the OOM safety net (Cloudera/Confluent, empirical).
`vm.max_map_count`	≥ 262144	Each partition uses ~2 mmap areas (`.index` + `.timeindex`). The Linux default 65530 ⇒ a hard ceiling around ~32,765 partitions/broker (Instaclustr, empirical). Must exceed your `.index` file count, see II · 02 · Limits.
open files (`nofile`)	≥ 100000	`FDs ≈ (#partitions × partition_size / segment_size) + 1 per connection`; production brokers run 30k+ open handles (Jun Rao/Confluent, empirical). Too low ⇒ "Too many open files" and a downed broker.
`vm.dirty_background_ratio`	~5	Start flushing dirty page-cache pages early so the kernel never has to do one giant blocking flush that spikes `LocalTimeMs`.
`net.core.rmem_max` / `wmem_max`	~16 MB+	Ceiling for the socket-buffer tuning above; without it, `socket.send/receive.buffer.bytes` cannot grow to the BDP on fat links.
filesystem	XFS, `noatime`, NVMe SSD	`noatime` avoids a metadata write on every read; XFS handles Kafka's large sequential I/O well; NVMe lowers `LocalTimeMs` floor.

vm.max_map_count is a silent partition ceiling

This one bites at scale, not in testing. With the Linux default of 65530 and ~2 mmaps per partition, a broker simply cannot host much beyond ~32k partitions, the JVM throws on the next mmap and the broker fails. The Instaclustr KRaft work hit exactly this when pushing toward 600k partitions/broker, and KAFKA-14204 (fixed 3.3.0) made high-partition creation painfully slow before the limit was even reached (empirical). Raise vm.max_map_count to ≥ 262144 before you grow partition counts, and treat it as a first-class capacity input in II · 04 · Capacity Planning.

Putting it together: two reference profiles

The knobs are not independent; they compose into a posture. Two endpoints, both grounded in the defaults and mechanisms above. Treat all benchmark numbers as empirical and version/hardware-dependent, the canonical LinkedIn 2014 result is 2,024,032 rec/s (193 MB/s) with three producers and async replication, but a single durable producer at acks=-1 drops to 421,823 rec/s (40.2 MB/s), sync replication roughly halved throughput (Jay Kreps/LinkedIn, empirical). That gap is the durability tax, measured.

Dial	Throughput / cost profile	Latency profile
`linger.ms`	10–100 ms (fill big batches)	0–1 ms (send immediately)
`batch.size`	64 KB – 1 MB	16 KB default
`compression.type`	`lz4` (or `zstd` for storage)	`none` or `lz4`
`acks`	`all` (durable), accept the round-trip	`all` still preferred; `1` only if loss is tolerable
consumer `fetch.min.bytes`	raise (fatter fetches, less CPU)	1 (answer immediately)
broker `num.io.threads`	raise until `RequestHandlerAvgIdlePercent` ≥ 0.3	same gauge target
broker `num.network.threads`	raise until `NetworkProcessorAvgIdlePercent` ≥ 0.4	same gauge target
JVM heap	~6 GB; rest → page cache	~6 GB; rest → page cache

What does not change between profiles

The durability floor and the resource posture are constant. Keep acks=all with min.insync.replicas=2 and RF=3 (see II · 06), keep the heap small so the page cache stays large, keep vm.swappiness=1 / vm.max_map_count ≥ 262144 / nofile ≥ 100000, and always read the RequestMetrics breakdown and the two idle gauges before turning a knob. Latency and throughput are dials; data loss is not a dial you want to discover you turned.