II · 05 · Performance Tuning: Throughput ⇄ Latency
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.
Kafka tuning is not a search for one fast configuration; it is the deliberate choice of where on three axes, throughput, latency, and durability, you want to sit. Almost every knob trades one against another, and the trade is mechanical: batching amortizes per-request cost at the price of a fixed latency tax; acks=all buys durability with a replication round-trip; compression buys network and disk savings with producer CPU. This chapter maps each dial to the exact source mechanism that makes it a dial, gives you the compiled-in default and the file it lives in, and then breaks the end-to-end latency budget into the five broker stages you can actually measure. The cardinal skill is reading the signal, the request-queue/local/remote breakdown and the two idle-ratio gauges, so that you tune the stage that is actually slow rather than the one folklore blames.
The three-axis model: what you are really trading
Hold this picture before touching any config. A produce request's success has three independent costs. Latency is how long the producer waits for the ack. Throughput is how many bytes per second the pipeline sustains. Durability is how many simultaneous failures the data survives. You optimize for at most two; the third is the bill. Batching raises throughput and (under load) can even lower latency, but raises latency under light load. Compression raises effective throughput and cuts cost but spends CPU. acks=all raises durability but adds a remote round-trip to latency. Confluent's reference benchmark demonstrates the happy region, roughly 605 MB/s peak and p99 5 ms at 200 MB/s on i3en.2xlarge, RF=3, acks=all, fsync off (Confluent "Kafka Performance", empirical, 1 KB messages, reads served from page cache), but note the words "AND, not maxed simultaneously": you do not get peak throughput and minimum latency at the same instant.
batch.size + linger.ms (batching) · compression.type · acks · max.in.flight · buffer.memorynum.io.threads · num.network.threads · num.replica.fetchers · socket buffers · queued.max.requests · log.segment.bytessendfile; vm.swappiness=1; FDs; vm.max_map_count; dirty ratiosfetch.min.bytes + fetch.max.wait.ms (read batching) · max.partition.fetch.bytes · max.poll.records + max.poll.interval.mslinger.ms↑ trades latency for throughput. compression.type trades producer CPU for network and disk. acks=all trades latency for durability. Pick two; pay for the third (synthesized from Confluent / Strimzi tuning guides, empirical). Before you change anything, decide which one your workload is allowed to spend.
Producer: the batching dial
The producer's dominant lever is batching, and its mechanism is the RecordAccumulator, a per-partition deque of ProducerBatch objects (clients/.../producer/internals/RecordAccumulator.java:76,78 hold batchSize and lingerMs; the class is documented at line 102 as trading "some latency for potentially better throughput due to more batching"). A batch becomes ready to send when either it fills to batch.size or linger.ms elapses since the first record landed (the ready/allBatchesFull logic, RecordAccumulator.java:416), whichever comes first. That single OR is the whole dial. The sender thread then drains all ready batches across partitions into per-broker requests, capped by max.request.size (default 1 MB, ProducerConfig.java:424). Part I chapter 16 · Producer Client walks the accumulator and sender threads in full.
| Config | Default | Source | Mechanism / why |
|---|---|---|---|
batch.size | 16384 (16 KB) | ProducerConfig.java:413 | Upper bound per-partition batch; a buffer of this size is pre-allocated. Larger ⇒ fewer requests, better compression, more throughput, more memory waste. |
linger.ms | 5 | ProducerConfig.java:418 | Artificial delay to let a batch fill. Changed from 0 to 5 in Apache Kafka 4.0 (doc, ProducerConfig.java:100,162) because larger batches usually yield similar or lower latency. |
compression.type | none | ProducerConfig.java:409 | Compresses whole batches; batching efficacy drives the ratio. Valid: none, gzip, snappy, lz4, zstd. |
acks | all | ProducerConfig.java:405 | How many replicas must ack. all waits for the full ISR. Idempotence requires all. |
max.in.flight.requests.per.connection | 5 | ProducerConfig.java:489 | Unacked requests per connection. Capped at 5 under idempotence (broker retains 5 batches/producer). |
buffer.memory | 33554432 (32 MB) | ProducerConfig.java:401 | Total record-buffering memory; producer blocks when full. |
max.block.ms | 60000 (60 s) | ProducerConfig.java:451 | How long send() blocks on a full buffer or metadata fetch before throwing. |
delivery.timeout.ms | 120000 (120 s) | ProducerConfig.java:419 | Total deadline from send() to success/failure; must be ≥ request.timeout.ms + linger.ms. |
enable.idempotence | true | ProducerConfig.java:543 | Exactly-once-per-partition writes; on by default unless conflicting configs disable it. |
With linger.ms=0 the accumulator sends as soon as the sender thread is free, so under light load batches are tiny, Confluent's producer hands-on course measured average batches of only ~1,215 bytes against a 16 KB limit at linger=0, versus ~275 KB at linger=1500ms (empirical, Confluent). Tiny batches mean more requests, worse compression, more broker CPU. The 4.0 default of 5 ms (ProducerConfig.java:100) bets that the efficiency gain typically lowers end-to-end latency despite the added 5 ms ceiling, because each request now does proportionally more useful work.
Raising linger.ms on a low-rate topic is counter-productive: Confluent's hands-on course showed that for a 200 rec/s stream of 1000-byte records, increasing linger decreased throughput and increased latency, the defaults were already optimal (empirical, Confluent). Set linger.ms > 0 meaningfully only when the produce rate is high enough to fill batch.size within the window. Otherwise you are just adding dead time.
The throughput recipe vs the latency recipe
Two named configurations bracket the dial. For maximum throughput: batch.size in the 64 KB–1 MB range, linger.ms 5–100 ms, compression.type=lz4 (or zstd when storage cost dominates), acks=all with min.insync.replicas=2 (Intel/Confluent learn page, empirical). For minimum latency: keep linger.ms at 0–1 ms, modest batch.size, and accept lower compression ratios. The historical "10–50× throughput from batching" figures are vendor roll-ups, directional, not guarantees, but the direction is real and rooted in the per-request fixed cost the accumulator amortizes.
buffer.memory (32 MB, ProducerConfig.java:401) must be ≥ batch.size, and Kafka's own partition-count guidance is that a producer needs "at least a few tens of KB per partition being produced" (Jun Rao, empirical). When the buffer fills, because the brokers are slow or the network is saturated, send() blocks for up to max.block.ms (60 s) and then throws (ProducerConfig.java:225 doc). That block is your backpressure: if you see produce latency spikes, check whether the buffer is saturating before blaming the broker. Sizing rule: buffer.memory ≥ batch.size × partitions_produced + compression + in-flight headroom.
Ordering, idempotence, and max.in.flight
The interaction between max.in.flight.requests.per.connection (5) and idempotence is a correctness boundary, not a performance one. The constant MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION_FOR_IDEMPOTENCE = 5 (ProducerConfig.java:277) is a hard cap: the broker retains at most 5 batches per producer (aligned to ProducerStateEntry#NUM_BATCHES_TO_RETAIN, comment at line 276), so a 6th in-flight batch could be silently dropped on the broker side. The validation throws ConfigException if you set it above 5 with idempotence on (ProducerConfig.java:640). With idempotence, all five in-flight requests preserve order because the broker dedups and reorders by sequence number; without idempotence, max.in.flight > 1 plus retries can reorder records on a failed-then-retried send (ProducerConfig.java:301 doc). Part I chapter 14 · Transactions & EOS details the sequence-number mechanism.
Compression: a CPU-vs-ratio dial, applied per batch
Compression happens on the full batch (ProducerConfig.java:247 doc), which is why batching and compression compound: a bigger batch compresses better. The data travels and is stored compressed, Kafka replicates and persists the compressed batch, so a good ratio cuts produce bytes, replication bytes, disk, and consumer-fetch bytes all at once. The cost is producer CPU. Kafka 4.x exposes per-codec levels: compression.gzip.level, compression.lz4.level, compression.zstd.level (ProducerConfig.java:410-412, defaults are each codec's defaultLevel()).
| Codec | Compress / Decompress (lzbench) | Ratio (HTTP data) | When |
|---|---|---|---|
lz4 | 594 / 2428 MB/s | 1.81× | Default choice for performance, fastest decompressor by far; lowest CPU. Often faster end-to-end than no compression. |
snappy | 446 / 1344 MB/s | 2.35× | Similar niche to lz4; slightly higher ratio, slower decompress. |
zstd (lvl -1) | 409 / 844 MB/s | 4.5× (lvl 6) | Best ratio for moderately more CPU; strong default when storage/network cost dominates (Kafka 2.1+, KIP-110). |
gzip | slowest | 3.58× | Highest CPU; usually avoided despite decent ratio. |
Speeds and ratios: Cloudflare "Squeezing the firehose" (lzbench on ~1 MB / 600-record batches; HTTP-request data), empirical and data-dependent. Cloudflare chose Zstandard level 6, saving "hundreds of gigabits of internal traffic and terabytes of flash storage" and cancelling a hardware expansion; Trendyol chose zstd level 3 for ~70% message-size reduction (empirical).
compression.type defaults to none (ProducerConfig.java:409), not producer and not zstd. And because compression is per-batch, the "4.5×" ratio is a property of large batches of compressible data, a stream of tiny, near-random, or already-compressed payloads will see little benefit from any codec. Get batching right first (raise linger.ms / batch.size), then compression pays off. For the cost angle, compression as the cheapest cross-AZ lever, see II · 10 · Cost.
Consumer: read batching and the processing budget
On the read side, the symmetric batching dial is fetch.min.bytes + fetch.max.wait.ms. The broker holds a fetch open until it has fetch.min.bytes available or fetch.max.wait.ms elapses (this is local-log fetch; remote/tiered fetch has its own remote.fetch.max.wait.ms, noted in ConsumerConfig.java:208). The crucial fact is the default: fetch.min.bytes=1 (ConsumerConfig.java:187) means Kafka does not batch reads by default, it answers as soon as a single byte is available, optimized for millisecond latency, not for cost. Part I chapter 09 · Fetch Path covers the broker-side delayed-fetch purgatory that implements this wait.
| Config | Default | Source | Mechanism / why |
|---|---|---|---|
fetch.min.bytes | 1 | ConsumerConfig.java:187 | Min bytes before broker answers. =1 ⇒ no read batching; latency-optimized, NOT cost-optimized. |
fetch.max.wait.ms | 500 | ConsumerConfig.java:210 | Max the broker blocks waiting for fetch.min.bytes (local fetch only). |
fetch.max.bytes | 52428800 (50 MB) | ConsumerConfig.java:200 | Cap per fetch response across all partitions (soft; one oversized batch still returns). |
max.partition.fetch.bytes | 1048576 (1 MB) | ConsumerConfig.java:225 | Cap per partition per fetch (soft). Drives consumer fetch memory. |
max.poll.records | 500 | ConsumerConfig.java:94 | Records returned per poll(). Does NOT change fetching, caps records handed up from the cache. |
max.poll.interval.ms | 300000 (5 min) | ConsumerConfig.java:629 | Max gap between poll() calls before the member is evicted ⇒ rebalance. |
session.timeout.ms | 45000 | ConsumerConfig.java:438 | Liveness window (classic protocol). Member dropped if no heartbeat within it. |
heartbeat.interval.ms | 3000 | ConsumerConfig.java:443 | Heartbeat cadence; keep ≈ ⅓ of session.timeout.ms. |
receive.buffer.bytes | 65536 (64 KB) | ConsumerConfig.java:493 | SO_RCVBUF for the consumer socket; raise on high-BDP links. |
enable.auto.commit | true | ConsumerConfig.java:460 | Background offset commits; set false for at-least-once correctness control. |
Older guides and the empirical reference quote a ~10 s session timeout; the actual compiled default in this tree is 45000 ms (ConsumerConfig.java:438), with heartbeat.interval.ms=3000 (line 443). Verify against the version you run, this is exactly the kind of number that drifts in folklore. Note also that session.timeout.ms and heartbeat.interval.ms are not used under the new consumer group protocol (KIP-848); they are listed as unsupported for that protocol at ConsumerConfig.java:407-413. See 13 · Group Coordination.
Because the default is fetch.min.bytes=1, a consumer of a low-throughput topic that polls in a tight loop sends a flood of near-empty fetch requests, each costing broker CPU to service. New Relic raised fetch.min.bytes (and tuned fetch.max.wait.ms) on the consumers of low-traffic topics and cut cluster-wide broker CPU by 15% from a single application change, enabling broker scale-down (New Relic, empirical). The mechanism is pure request-rate reduction: fewer, fatter fetches. This is the highest-leverage consumer knob most operators never touch.
A common misconception: max.poll.records does NOT change how much data is fetched, fetching is governed by fetch.max.bytes / max.partition.fetch.bytes. It only caps how many already-cached records each poll() hands to your code (ConsumerConfig.java:92 doc). Its real job is protecting against rebalance storms: if your per-record processing is slow, a large max.poll.records can push the loop past max.poll.interval.ms (5 min), the coordinator evicts the member, it rejoins, and you get a self-sustaining rebalance. Sizing rule: max.poll.records × worst-case per-record time ≤ max.poll.interval.ms, with margin. To go faster, fetch more and process in parallel, do not just raise this cap. Rebalance-storm recovery lives in II · 07 · Failure Modes.
Consumer fetch memory is bounded two ways and you must provision RAM for the larger: fetch_memory ≈ #brokers × fetch.max.bytes, and also ≈ #partitions × max.partition.fetch.bytes (Strimzi, empirical). With defaults, a consumer subscribed to 1,000 partitions can demand up to ~1 GB of fetch buffering by the second bound, a reason not to point one consumer at thousands of partitions. Partition-count economics are in II · 03 · Partitioning.
Broker: threads, queues, and how to size them from idle ratios
A broker has two thread pools on the request path. Network threads (num.network.threads, default 3 per listener, SocketServerConfigs.java:152) read bytes off sockets and write responses; each listener except the controller listener gets its own pool (doc, line 153). I/O threads a.k.a. request handlers (num.io.threads, default 8, ServerConfigs.java:51) do the actual work, log append, fetch, index lookups, including disk I/O. Between them sits the request queue, capped by queued.max.requests (default 500, SocketServerConfigs.java:144); when it fills, network threads stop reading, that is the broker's intake backpressure. Part I chapters 06 · Network & Threading and 07 · Request Processing describe this acceptor → processor → handler pipeline in detail.
Do not guess. Each pool publishes an idle fraction. RequestHandlerAvgIdlePercent (the I/O pool, metric name defined at core/.../KafkaRequestHandler.scala:210) and NetworkProcessorAvgIdlePercent (the network pool, gauge registered at core/.../SocketServer.scala:120) are both fractions in [0, 1]: 1 = fully idle, 0 = saturated. Raise the pool whose idle ratio is low.
| Pool | Config (default · source) | Idle gauge | Action threshold |
|---|---|---|---|
| I/O / request handlers | num.io.threads = 8 · ServerConfigs.java:51 | RequestHandlerAvgIdlePercent | < 0.2 ⇒ overloaded, add handlers/brokers; < 0.1 ⇒ active problem (Instaclustr/Confluent, empirical). Heuristic: num.io.threads ≈ 8 × #data disks, bounded by cores. |
| Network processors | num.network.threads = 3 · SocketServerConfigs.java:152 | NetworkProcessorAvgIdlePercent | Ideal > 0.4; < 0.3 ⇒ raise num.network.threads (Confluent, empirical). Common production range 8–12. |
| Replica fetchers | num.replica.fetchers = 1 · ReplicationConfigs.java:96 | (watch URP / RemoteTimeMs) | Raise to 4–8 when followers lag and RemoteTimeMs is high. Total fetchers = this × #brokers (doc, line 97). |
| Recovery | num.recovery.threads.per.data.dir = 2 · ServerLogConfigs.java:147 | (startup time) | Raise to speed log recovery at startup / flush at shutdown on multi-disk hosts. |
This metric has been misreported repeatedly: mis-calculated (KAFKA-7295), surfaced as a rate by some integrations (Datadog integrations-core #516), and anomalous in KRaft combined mode (KIP-1207). Before you trust the < 0.2 threshold, confirm your monitoring shows it as a fraction in [0, 1]. The defaults, 3 network, 8 I/O threads, are deliberately conservative starting points (SocketServerConfigs.java:152, ServerConfigs.java:51); real clusters routinely run higher, but only raise the pool the gauge says is starved.
Replication parallelism and the ISR-drop window
Followers replicate by issuing fetch requests through num.replica.fetchers threads (default 1). One thread per source broker fetches all that broker's leader partitions sequentially, so a single slow or hot leader can starve the rest. When a follower falls behind, the leader drops it from the ISR after replica.lag.time.max.ms (default 30000 ms, ReplicationConfigs.java:55) without a successful fetch reaching the leader's LEO. The follower fetch itself blocks up to replica.fetch.wait.max.ms (default 500 ms, ReplicationConfigs.java:75), and the doc is explicit that this must stay below replica.lag.time.max.ms to avoid frequent ISR shrinking on low-throughput topics (line 76). Each follower fetch pulls up to replica.fetch.max.bytes per partition (default 1 MB, line 68). The durability consequences of ISR membership are the subject of Part I 08 · Replication.
Under acks=all the leader cannot ack a produce until the ISR has the record. That wait shows up as RemoteTimeMs in the latency breakdown. If RemoteTimeMs is high, the replication pipeline is the bottleneck: either the leader's write rate exceeds the followers' fetch rate (raise num.replica.fetchers so more partitions replicate in parallel) or a specific follower's disk/network is slow (fix the host). Jun Rao's mental model, "replicating 1000 partitions from one broker to another can add about 20 ms latency" (empirical, 2015 hardware), explains why the single-fetcher default does not scale to thousands of partitions per broker.
Socket buffers and segment size
| Config | Default | Source | Tuning |
|---|---|---|---|
socket.send.buffer.bytes | 102400 (100 KB) | SocketServerConfigs.java:88 | SO_SNDBUF; raise toward 1 MB on high bandwidth-delay-product links. -1 ⇒ OS default. |
socket.receive.buffer.bytes | 102400 (100 KB) | SocketServerConfigs.java:92 | SO_RCVBUF; same logic. Set ≈ bandwidth × RTT. |
socket.request.max.bytes | 104857600 (100 MB) | SocketServerConfigs.java:96 | Hard cap on a single socket request. |
queued.max.requests | 500 | SocketServerConfigs.java:144 | Request-queue depth before network threads block reading (intake backpressure). |
replica.socket.receive.buffer.bytes | 65536 (64 KB) | ReplicationConfigs.java:64 | SO_RCVBUF for the follower→leader replication socket. |
log.segment.bytes | 1073741824 (1 GB) | LogConfig.java:131 | Segment roll size. Smaller ⇒ more files/FDs/index churn; larger ⇒ coarser retention granularity. |
message.max.bytes | 1 MB + overhead | ServerLogConfigs.java:177 | Largest batch the broker accepts (after compression). Must be coordinated with producer max.request.size. |
The bandwidth-delay product is the formula for socket buffers: buffer = link_bandwidth × round-trip latency. On a 10 Gbps link with 1 ms RTT that is ~1.25 MB, so the 100 KB default throttles a single high-BDP connection, raise both socket buffers (and the OS net.core.rmem_max/wmem_max) for cross-region replication or fat WAN consumers. log.segment.bytes is a storage-engine knob (Part I 03 · Storage Log Engine): smaller segments give finer retention and faster individual-segment recovery but multiply open file descriptors and .index mmaps, which feeds directly into the file-descriptor and vm.max_map_count limits below.
The page cache is the performance engine
Kafka's read and write speed comes from the OS page cache, not the JVM heap. Writes go to the page cache and are flushed by the kernel; reads of the log tail are served from the page cache; and the fetch path uses zero-copy sendfile to ship bytes from page cache straight to the socket without copying through user space or the heap (Part I 03 · Storage Log Engine and 09 · Fetch Path). The operational consequence is counter-intuitive: keep the JVM heap small so the OS has more RAM to cache the tail. Kafka does not need a heap above ~6 GB; on a 64 GB box with a 6 GB heap, roughly 28–30 GB is left for page cache (Confluent, empirical). Note that by default Kafka does not fsync on every write, log.flush.interval.messages defaults to Long.MAX_VALUE (ServerLogConfigs.java:101), i.e. flushing is left to the OS, which is precisely why durability under acks=all means "in the ISR's page cache," not "on disk" (see II · 06 · Durability).
A consumer that suddenly reads from the beginning of a large topic (a backfill, a new analytics job, a reset offset) streams cold segments off disk and into the page cache, evicting the hot tail that producers and real-time consumers depend on. The symptom is a latency spike on otherwise-healthy partitions, often with rising LocalTimeMs, while one consumer does a historical scan. Mitigations: isolate batch/backfill consumers, use tiered storage so cold data lives in object storage rather than thrashing the broker's cache, and provision page cache for roughly write_throughput × 30 seconds of working set (synthesized, empirical). This is the single most common "mysterious" broker latency event.
A large heap does not speed Kafka up; it slows it down twice. First, it steals RAM from the page cache, shrinking the cache hit rate that actually drives throughput. Second, it lengthens GC pauses, a 32 GB G1GC heap can incur 100–200 ms pauses, and a multi-second Full GC drops the broker from the ISR and triggers rebalances (empirical; LinkedIn's busy clusters hold ~21 ms p90 GC pause at < 1 young-GC/sec). Use G1GC with a small heap: -Xms6g -Xmx6g -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35. Alert when GC P99 > 50 ms (warning) / > 200 ms (critical). Reserve ZGC for heaps > 16 GB (rare for brokers) at ~5–10% CPU overhead. GC-induced ISR thrash is a runbook in II · 07.
The end-to-end latency budget
When a request is slow, do not guess which stage, read it. The broker decomposes every request's TotalTimeMs into five additive stages, each a JMX histogram under kafka.network:type=RequestMetrics,name=…,request={Produce|FetchConsumer|FetchFollower}. The names are compiled in at server/.../network/metrics/RequestMetrics.java:48-54:
TotalTimeMs= sum of the below- The full server-side latency for the request (
RequestMetrics.java:54). Plus, when throttled,ThrottleTimeMs(line 51). RequestQueueTimeMs- Time waiting in the request queue for a free I/O thread (
RequestMetrics.java:48). High ⇒ too fewnum.io.threads(handler starvation). LocalTimeMs- Time the I/O thread spends doing the work locally, log append, page-cache/disk, index (
RequestMetrics.java:49). High ⇒ disk/flush latency, page-cache misses, or GC. RemoteTimeMs- Time waiting on other brokers, for a produce with
acks=all, waiting for ISR acks; for a fetch, waiting in delayed-fetch purgatory (RequestMetrics.java:50). High ⇒ slow followers /min.insync.replicaswaits. ResponseQueueTimeMs- Time the built response waits for a network thread to send it (
RequestMetrics.java:52). High ⇒ too fewnum.network.threads. ResponseSendTimeMs- Time actually writing the response to the socket (
RequestMetrics.java:53). High ⇒ slow client/network, small socket buffers.
Two more diagnostic gauges from the same class: MessageConversionsTimeMs (line 45) flags expensive down-conversion for old clients, and TemporaryMemoryBytes (line 46) tracks transient memory for conversion/compression. Throttle time (ThrottleTimeMs, line 51) is added when a client exceeds its quota, covered in Part I 19 · Quotas; from the latency-budget view it is "the broker deliberately delaying you," not a fault.
num.io.threads (RequestHandlerAvgIdlePercent low)num.replica.fetchers, check URPnum.network.threads (NetworkProcessorAvgIdlePercent low)High RequestQueueTimeMs → too few I/O threads (confirm with low RequestHandlerAvgIdlePercent). High LocalTimeMs → leader disk, page-cache miss, or GC. High RemoteTimeMs → slow followers or acks=all/min.insync.replicas waits (confirm with URP). High ResponseQueueTimeMs → too few network threads (confirm with low NetworkProcessorAvgIdlePercent). This is the map in II · 08 · Metrics & Signals; the breakdown is the first thing to pull in any latency incident.
OS-level tuning: the kernel knobs that gate Kafka
Because Kafka leans on the page cache, zero-copy, and thousands of memory-mapped index files, a handful of kernel settings gate its performance and stability. These are emergent limits, consequences of the architecture, not Kafka configs, and getting them wrong produces failures that look like Kafka bugs.
| OS knob | Recommended | Why (mechanism) |
|---|---|---|
vm.swappiness | 1 (not 0) | Minimize swapping the JVM/page-cache to disk, which would murder latency, but 1, not 0: 0 forbids swap entirely and removes the OOM safety net (Cloudera/Confluent, empirical). |
vm.max_map_count | ≥ 262144 | Each partition uses ~2 mmap areas (.index + .timeindex). The Linux default 65530 ⇒ a hard ceiling around ~32,765 partitions/broker (Instaclustr, empirical). Must exceed your .index file count, see II · 02 · Limits. |
open files (nofile) | ≥ 100000 | FDs ≈ (#partitions × partition_size / segment_size) + 1 per connection; production brokers run 30k+ open handles (Jun Rao/Confluent, empirical). Too low ⇒ "Too many open files" and a downed broker. |
vm.dirty_background_ratio | ~5 | Start flushing dirty page-cache pages early so the kernel never has to do one giant blocking flush that spikes LocalTimeMs. |
net.core.rmem_max / wmem_max | ~16 MB+ | Ceiling for the socket-buffer tuning above; without it, socket.send/receive.buffer.bytes cannot grow to the BDP on fat links. |
| filesystem | XFS, noatime, NVMe SSD | noatime avoids a metadata write on every read; XFS handles Kafka's large sequential I/O well; NVMe lowers LocalTimeMs floor. |
This one bites at scale, not in testing. With the Linux default of 65530 and ~2 mmaps per partition, a broker simply cannot host much beyond ~32k partitions, the JVM throws on the next mmap and the broker fails. The Instaclustr KRaft work hit exactly this when pushing toward 600k partitions/broker, and KAFKA-14204 (fixed 3.3.0) made high-partition creation painfully slow before the limit was even reached (empirical). Raise vm.max_map_count to ≥ 262144 before you grow partition counts, and treat it as a first-class capacity input in II · 04 · Capacity Planning.
Putting it together: two reference profiles
The knobs are not independent; they compose into a posture. Two endpoints, both grounded in the defaults and mechanisms above. Treat all benchmark numbers as empirical and version/hardware-dependent, the canonical LinkedIn 2014 result is 2,024,032 rec/s (193 MB/s) with three producers and async replication, but a single durable producer at acks=-1 drops to 421,823 rec/s (40.2 MB/s), sync replication roughly halved throughput (Jay Kreps/LinkedIn, empirical). That gap is the durability tax, measured.
| Dial | Throughput / cost profile | Latency profile |
|---|---|---|
linger.ms | 10–100 ms (fill big batches) | 0–1 ms (send immediately) |
batch.size | 64 KB – 1 MB | 16 KB default |
compression.type | lz4 (or zstd for storage) | none or lz4 |
acks | all (durable), accept the round-trip | all still preferred; 1 only if loss is tolerable |
consumer fetch.min.bytes | raise (fatter fetches, less CPU) | 1 (answer immediately) |
broker num.io.threads | raise until RequestHandlerAvgIdlePercent ≥ 0.3 | same gauge target |
broker num.network.threads | raise until NetworkProcessorAvgIdlePercent ≥ 0.4 | same gauge target |
| JVM heap | ~6 GB; rest → page cache | ~6 GB; rest → page cache |
The durability floor and the resource posture are constant. Keep acks=all with min.insync.replicas=2 and RF=3 (see II · 06), keep the heap small so the page cache stays large, keep vm.swappiness=1 / vm.max_map_count ≥ 262144 / nofile ≥ 100000, and always read the RequestMetrics breakdown and the two idle gauges before turning a knob. Latency and throughput are dials; data loss is not a dial you want to discover you turned.