krivaltsevich.com Kafka Internals4.4

II · 04 · Capacity Planning & Sizing

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.

Capacity planning is the act of turning a target workload, a message rate, an average size, a retention window, a fan-out, into a concrete bill of materials: how many brokers, how much disk, how much RAM, how big a NIC. Kafka makes this tractable because the cost model is almost entirely linear and analytic: every byte produced is amplified by replication, multiplied by retention, and copied to consumers in ways you can compute on paper before you provision a single instance. This chapter builds the four resource equations, ingress/egress, disk, memory, and network, from the mechanisms in the source, works a full example end to end, then dimensions the cluster with the broker-count formula and the headroom rules that keep it standing when a broker dies. The recurring theme: the NIC is usually the first ceiling, and replication is why.

The two numbers everything starts from

Every sizing exercise reduces to two measured inputs and a handful of multipliers. The inputs are the peak ingress rate and the average compressed record size:

msg_rate
Records per second at the peak you must serve without throttling, not the daily average. Plan against the 95th-percentile minute, not the mean hour.
avg_size
Average serialized, post-compression bytes per record. Kafka stores, replicates, and serves batches compressed (see Part I · 01), so the on-wire and on-disk size is the compressed size, not the in-app object size.
ingress (B/s)
= msg_rate × avg_size. This is the byte rate hitting partition leaders. It is the seed of every other number.
Measure, do not guess, avg_size

Read it from a representative cluster as BytesInPerSec / MessagesInPerSec, both are real broker meters: kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec and name=MessagesInPerSec (storage/.../log/metrics/BrokerTopicMetrics.java:34, :33). Dividing them gives the true compressed average including batch and header overhead, which is what every formula below assumes. A 200-byte JSON object that compresses 4× in a batch costs you ~50 bytes of ingress, not 200.

The amplification chain

A single produced byte does not stay a single byte. It is fanned out three ways, and each path has a multiplier rooted in a concrete mechanism:

producer: 1 byte
leader appends to log
(RF−1) followers fetch
RF copies on disk
N consumer groups fetch
egress to apps
One produced byte, three amplified paths: replication writes (RF−1) follower copies, retention holds RF copies on disk, and consumer fan-out re-reads it N times.
client traffic · broker/replication · on-disk copy · byte flow
QuantityFormulaMechanism / source
Ingress (write to leaders)ingress = msg_rate × avg_sizeProducer append path (Part I · 16).
Replication traffic (inter-broker)repl = ingress × (RF − 1)Each follower runs fetcher threads that pull every committed byte from the leader; RF − 1 followers ⇒ RF − 1 copies of ingress traverse the cluster fabric. Part I · 08.
Disk written/secdisk_write = ingress × RFLeader writes 1 copy; each of RF − 1 followers writes its own copy. Total = RF copies appended per second across the cluster.
Egress (read off cluster)egress = ingress × fanout + replfanout = number of consumer groups reading the full stream; replication fetch is itself egress from the leader. Part I · 09.
Why replication is RF − 1, not RF

The leader already has the byte, it received it from the producer. Only the followers must fetch it. With the default default.replication.factor = 1 (server/.../config/ReplicationConfigs.java:42, REPLICATION_FACTOR_DEFAULT = 1) there is no replication traffic at all; at the production-standard RF=3 the cluster moves 2× ingress in inter-broker replication alone, dwarfing most consumer fan-out. This single term is why "egress ≥ 2× ingress" is the canonical planning floor (empirical: Confluent / Jun Rao sizing guidance) and why networking dominates the cloud bill (empirical: Confluent cost model, networking ~52–87% of self-managed infra; see op10).

Disk: retention × replication factor

Kafka is a log: it keeps bytes until a retention policy deletes them. Steady-state disk is therefore the byte rate × how long you keep it × how many copies exist:

Local disk (no tiering)
disk = ingress × retention_seconds × RF
Default retention
7 days = 604,800 s. The constant is DEFAULT_RETENTION_MS = 24 * 7 * 60 * 60 * 1000 (storage/.../log/LogConfig.java:134), surfaced as retention.ms / log.retention.hours.
Default retention bytes
retention.bytes = -1 (unbounded; time-based only), the size cap is off by default, so time governs disk unless you set it.

The RF factor is not optional accounting: a topic with RF=3 physically stores three full copies, one per replica broker, because each follower maintains a complete local log (Part I · 03, Part I · 08). Halving RF halves disk and halves replication network, the most direct dial you have, traded against fault tolerance.

Provision the segment, not just the bytes

Disk usage is quantized by the active segment: a partition cannot delete data in the segment it is still appending to. With log.segment.bytes default 1 GiB (DEFAULT_SEGMENT_BYTES = 1024*1024*1024, LogConfig.java:131) and segment.ms default 7 days (DEFAULT_SEGMENT_MS, LogConfig.java:132), a low-traffic partition can hold up to ~1 GiB beyond its nominal retention until the active segment rolls. There is also a hard ceiling per segment: a single log file cannot exceed Integer.MAX_VALUE (~2 GiB), the broker throws on open if a segment is larger (FileRecords.java:68-73). Index files add to disk too: log.index.size.max.bytes defaults to 10 MiB per segment (ServerLogConfigs.java:93). Budget total disk at ingress × retention × RF, then add ~10–20% headroom for the active-segment tail, indexes, and the hard-fail margin (never run a log dir past ~85%, a full dir halts the broker; see op07).

How tiered storage rewrites the local term

Tiered storage (Part I · 05, KIP-405) splits retention into a small local hot tier on broker disk and an unbounded remote tier in object storage. The broker keeps only local.retention.ms / local.retention.bytes worth of recent data locally; older segments are copied to S3/GCS/Azure Blob and deleted from disk. It is off by default, remote.log.storage.system.enable = false (DEFAULT_REMOTE_LOG_STORAGE_SYSTEM_ENABLE = false, storage/.../remote/storage/RemoteLogManagerConfig.java:58), and enabled per topic with remote.storage.enable=true. The local-retention defaults -2 mean "derive from the topic's retention.*" until you set an explicit, shorter local window (DEFAULT_LOCAL_RETENTION_MS = -2, LogConfig.java:146).

ModeLocal disk equationWhat changes
Classicingress × retention × RFFull history × RF on local disk.
Tieredingress × local.retention × RF (local) + ingress × total_retention (remote, ~1 logical copy in the object store)Local term collapses from days to hours; the RF multiplier disappears from the remote tier, object storage provides its own internal durability, so you store ~1 logical copy remotely instead of RF.
The economics, marked empirical

Object storage runs ~$0.02/GiB-month vs ~$0.08–0.10/GiB-month for replicated EBS, roughly 4–5× cheaper per byte, and Confluent reports storage cost reductions "over 90%" in favorable cases, ~30–40% typical (empirical: Confluent, Uber, WarpStream; see op10). Tiered storage does NOT reduce cross-AZ networking, replication still happens on the local hot tier. It buys you disk and faster broker recovery (no cold re-replication on rejoin), not network. Set local.retention.ms to cover your consumers' real-time lag plus replication catch-up, long enough that live consumers and rejoining followers read from local disk (cheap, zero-copy) rather than triggering remote fetches.

Memory: size the page cache, keep the heap small

This is the most misunderstood dimension, and the source makes the mechanism unambiguous. Kafka does not cache records in the JVM heap. A fetch returns a slice of the on-disk file: LogSegment.read ends with log.slice(startPosition, fetchSize) (storage/.../log/LogSegment.java:462), producing a FileRecords view, never a heap copy. That slice is then streamed to the consumer's socket with the sendfile zero-copy path: FileRecords.writeTo calls destChannel.transferFrom(channel, position, count) (clients/.../record/internal/FileRecords.java:302), which for a plaintext connection is fileChannel.transferTo(position, count, socketChannel) (clients/.../network/PlaintextTransportLayer.java:213-214), a single kernel syscall that moves bytes from the page cache straight to the NIC, never touching the JVM.

Consumer fetch
asks for records from offset O
FetchRequest
Broker fetch path
LogSegment.readFileRecords.slice(), a file view, no copy
LogSegment.java:462
OS page cache
hot tail of each partition lives here; transferTo reads it directly
sendfile syscall
Socket → NIC
transferFromfileChannel.transferTo(…, socketChannel)
PlaintextTransportLayer.java:213
The zero-copy fetch: page cache → socket in one syscall, bypassing the JVM heap entirely. This is why RAM, not heap, is the read accelerator.
client · broker logic · page cache / disk · chip source anchor
Therefore: RAM holds the hot tail, the heap stays modest (~6 GB)

A consumer that is caught up reads bytes that were written seconds ago, still resident in the page cache from the write. The kernel serves it with no disk seek. So the RAM you want is enough page cache to hold the "hot tail" of every active partition, the recent window that live consumers and catching-up followers re-read. The JVM heap, by contrast, holds only broker bookkeeping (request objects, the metadata image, purgatory, index structures) and stays modest: ~6 GB is the typical recommendation (empirical: Confluent, LinkedIn, heap >6 GB is unnecessary and risks long GC pauses). Do not oversize the heap at the page cache's expense: every GiB you hand the JVM is a GiB the kernel cannot use to keep partitions hot, which forces cold disk reads and spikes p99 (the "catch-up tax": historical reads evicting hot data have driven p99 produce from ~2 ms to ~250 ms, empirical: see op07 / Pinterest).

A working memory rule

The page-cache formula is cache_needed = ingress_per_broker × hot_window, where hot_window is the worst-case lag (in seconds) you want served from RAM rather than disk. The inputs below are illustrative planning constants, substitute your measured values.

heap, config, cited recommendation
Fixed, small: 6 GiB. -Xms6g -Xmx6g with G1GC is the canonical setting (empirical: Confluent / LinkedIn, heap > 6 GiB is unnecessary and risks long GC pauses); raise toward 8 GiB only above ~10k partitions per broker, where the metadata image and per-partition structures grow.
box_ram, hardware assumption
64 GiB per broker (empirical: Confluent reference hardware, 64 GiB RAM, 32 GiB minimum).
os_reserve, hardware assumption
28 GiB left for the OS, agents, and slack on a 64 GiB box (illustrative; measure on your image). What remains is page cache.
ingress_per_broker, workload-dependent (illustrative)
50 MB/s for this example (substitute your measured per-broker BytesInPerSec).
hot_window target, planning heuristic
~30 s of write throughput is the common starting heuristic (empirical: Confluent); if your slowest real-time consumer lags by up to T seconds, size for ingress_per_broker × T so even your laggards stay in cache.

Derivation, page cache available, then how much hot data it holds:

  • page cache ≈ box_ram − heap − os_reserve = 64 GiB − 6 GiB − 28 GiB30 GiB for the cache (the "~28–30 GiB" Confluent reference figure).
  • resident hot window = cache ÷ ingress_per_broker = 30 GiB ÷ 50 MB/s30,720 MiB ÷ 50 MB/s~600 s ≈ 10 minutes of hot data resident, comfortably above the ~30 s heuristic, so normal consumer lag is served entirely from RAM.

(Mixing GiB and MB here is deliberately conservative: 30 GiB ≈ 32,212 MB, so dividing by 50 MB/s gives ~644 s, we round down to ~600 s / 10 min. The takeaway is order-of-minutes, not the exact second.)

TLS breaks zero-copy on the read path

The transferTo sendfile path requires a plaintext socket. When the consumer connection is encrypted, bytes must be copied into the JVM, encrypted, and written normally, the zero-copy fast path is gone, CPU rises, and the page-cache-to-NIC shortcut no longer applies on egress. Budget extra CPU for TLS-heavy read fan-out, and note that historically per-SSL-channel memory has been non-trivial (~122 KB/channel forced one fleet's heap from 4 GB to 8 GB, empirical: Pinterest, see op07). See Part I · 18 and Part I · 06.

Network: the NIC is the first ceiling

Sum what crosses each broker's NIC and you usually find it, not disk or CPU, is the binding constraint. Per broker, in the steady state with balanced leadership:

Inbound
producer writes to its leader partitions + replication it pulls as a follower for replicas led elsewhere.
Outbound
consumer reads of its leader partitions (× fanout) + replication bytes it serves to followers as a leader.

Aggregated across the cluster, total network movement is approximately:

Cluster network ≈ ingress × RF (replication side) + egress (consumer side)

The replication term appears as both an in and an out copy across the fabric: ingress × (RF−1) leaves leaders and ingress × (RF−1) arrives at followers. Add producer ingress and ingress × fanout of consumer reads. So the cluster NIC amplification factor is 1 (produce) + 2(RF−1) (replication in+out) + fanout (consume). Assume RF=3 and fanout=1 (one consumer group), both illustrative, and it works out to 1 + 2×(3−1) + 1 = 1 + 4 + 1 = 6×… but most of that replication cost is symmetric in+out, so per byte produced the commonly quoted floor is 1× produce + 2× replication + 1× consume ≈ 4× ingress of distinct NIC traffic (the in and out copies of replication are the same 2×). A "100 MB/s" application workload is therefore really a 100 MB/s × 4 = ~400 MB/s networking problem. NIC line-rate conversion (assumption, hardware): a 10 GbE link is 10 Gbit/s ÷ 8 bit/byte = 1.25 GB/s, so the NIC is a hard per-broker wall at ~1,250 MB/s long before disk fills.

The broker exposes the replication slice of this directly: ReplicationBytesInPerSec and ReplicationBytesOutPerSec (BrokerTopicMetrics.java:37-38), watch these alongside BytesInPerSec/BytesOutPerSec to see how much of your NIC is consumed by replication you cannot avoid versus consumer traffic you can sometimes redirect.

Cross-AZ is the same bytes, priced

In a cloud 3-AZ deployment the network bytes above are the same bytes, but the cross-zone fraction is metered: ~⅔ of produce, ~⅔ of consumer fetch, and all inter-broker replication crosses AZ boundaries, charged ~$0.02/GiB effective on AWS. The ingress × (RF−1) replication term is the dominant cross-AZ cost. KIP-392 fetch-from-follower lets consumers read a same-AZ replica to kill consumer-side cross-AZ, but the produce+replication floor remains. Full treatment in op10 and Part I · 19.

CPU: compression, TLS, request handling

CPU is rarely the first ceiling for plaintext, lz4-compressed workloads, but three loads scale with traffic and can become binding:

CPU loadScales withDial / default
Compression / decompressioningress & egress bytes; producer compresses, broker may recompress, consumers decompresscompression.type default none (ProducerConfig.java:409, CompressionType.NONE). lz4/snappy fastest, zstd best ratio. Compression is per-batch, tiny batches compress poorly. (empirical: Cloudflare zstd-6 = 4.5× ratio; lz4 = 1.81×)
TLS encrypt/decryptevery encrypted byte, both directions; defeats zero-copy egressPlaintext by default. Allocate headroom for TLS-heavy fan-out.
Request handlingrequest rate (not just bytes), many tiny fetches cost CPU per requestnum.io.threads default 8 (ServerConfigs.java:51); num.network.threads default 3 (SocketServerConfigs.java:152); num.replica.fetchers default 1 (ReplicationConfigs.java:96); queued.max.requests default 500 (SocketServerConfigs.java:144).
Request rate can saturate CPU before byte rate

Kafka does not batch reads on the broker by default, fetch.min.bytes is 1 (ConsumerConfig.java:187, DEFAULT_FETCH_MIN_BYTES = 1), tuned for latency, not cost. A swarm of consumers polling tiny fetches generates request-handling CPU out of proportion to bytes moved. The documented fix is raising fetch.min.bytes on low-throughput-topic consumers: a single such change cut whole-cluster broker CPU 15% at New Relic (empirical; see op05/op08). Watch RequestHandlerAvgIdlePercent (keep >0.3) and NetworkProcessorAvgIdlePercent (keep >0.4), saturation there means add I/O or network threads, or brokers (see op08). Plan throughput per partition at a conservative ~10 MB/s, even though a single partition can sustain "tens of MB/s" (empirical: Jun Rao).

Broker count: the max of four constraints

The number of brokers is not a single calculation, it is the maximum of independent lower bounds, because the cluster must satisfy all of them at once:

The broker-count formula

brokers = max( target_throughput / per_broker_ceiling ,  total_partitions / per_broker_partition_limit ,  RF ,  3 )

BoundWhy it existsType
throughput / per_broker_ceilingThe throughput bound. per_broker_ceiling is the emergent NIC/disk/CPU limit of one broker, typically the NIC, computed as in §Network. Use a burst-headroom factor of 0.7 (planning heuristic, leave ~30% of NIC line rate free for spikes; see §Headroom) so per_broker_ceiling = NIC_line_rate × 0.7.Emergent (resource).
total_partitions / per_broker_partition_limitThe partition bound. Each partition costs file descriptors, two mmap'd index regions, fetcher work, and metadata. total_partitions = Σ (partitions × RF) across all topics, replicas count.Emergent, with a hard floor (below).
RFYou cannot place RF copies of a partition on fewer than RF brokers, each replica needs a distinct broker.Hard (placement invariant).
3Availability floor: RF=3 plus tolerating one broker down for maintenance while keeping a majority needs at least 3, and the KRaft controller quorum wants an odd ≥3 (Part I · 10).Operational floor.

The per-broker partition limit, precisely

"Partitions are free under KRaft" is wrong. KRaft (Part I · 11, KIP-500) removed the ZooKeeper-era ~4,000/broker and ~200,000/cluster ceilings, those were driven by O(partitions) controller failover and metadata reload, and are version-specific (empirical: Confluent 200K post, Kafka 1.1.0+; ZK removed in 4.0). But concrete per-broker limits remain:

LimitValue / sourceKind
Linux vm.max_map_countDefault 65,530 mmap regions; Kafka uses ~2 mmap regions/partition (offset + time index) ⇒ 65,530 regions ÷ 2 regions/partition ≈ 32,765 partitions/broker before mmap exhaustion. Raise to ≥262,144 (or 1,000,000+) for dense brokers. (empirical: Instaclustr KRaft Part 3)Hard (OS constant).
File descriptorsopen files ≈ Σ(partition_size / segment_size) + connections; set nofile100,000. Production brokers run >30k handles. (empirical: Jun Rao / Confluent)Emergent + OS limit.
Replication latency"Replicating 1000 partitions broker-to-broker adds ~20 ms latency" ⇒ per-broker cap ≈ 100 × brokers × RF to bound it. (empirical: Jun Rao, 2015, teach the mechanism, not the constant)Emergent.
Throughput peaks earlyProducer throughput peaked at ~100 partitions and degraded past ~1,000 in one test; scalability ≠ performance. (empirical: Instaclustr KRaft Part 1)Emergent (anti-pattern).

A safe planning figure for a well-resourced broker is on the order of 1,000–4,000 partitions (replicas included) for latency-sensitive workloads, with the hard mmap wall far above that. Pick per_broker_partition_limit from your latency tolerance and FD/mmap budget, not from the KRaft headline. Partition count itself is sized in op03; here it is an input to broker count.

A worked example, end to end

Take a realistic mid-size event pipeline and run every equation. Every number below is either a labeled assumption or is derived from earlier ones, nothing appears mid-calculation unintroduced. The workload figures are illustrative for this example; substitute your own measured values.

AssumptionValue (with units)Why / sourceKind
msg_rate1,000,000 records/s (peak)The peak minute this pipeline must serve without throttling.Workload (illustrative)
avg_size250 B/record (post-compression)Measured as BytesInPerSec / MessagesInPerSec; see §"Measure, do not guess".Workload (illustrative)
RF3 (copies)Production durability triad; the server default is 1 (ReplicationConfigs.java:42) so you must raise it.Config (production standard)
min.insync.replicas, acks2, allA committed write needs 2 in-sync copies, so one failure is survivable. Fixes RF=3 as the multiplier everywhere below.Config (production standard)
retention7 days = 604,800 sKafka default DEFAULT_RETENTION_MS = 24*7*60*60*1000 (LogConfig.java:134); 7 d × 86,400 s/d = 604,800 s.Cited Kafka default
fanout2 (consumer groups reading the full stream)Two independent consumer groups each re-read every byte.Workload (illustrative)
partitions30 (on the main topic)Sized in op03; here it is an input. Replica count = partitions × RF.Workload (from op03)
disk_overhead~15%Active-segment tail + indexes + hard-fail margin (see §"Provision the segment"); illustrative.Planning heuristic

Derivation, each row applies a formula from the tables above to these assumptions, carrying units so they cancel to the result's unit:

StepFormula & substitution (units cancel)Result
Ingressingress = msg_rate × avg_size = 1,000,000 records/s × 250 B/record = 250,000,000 B/s; ÷ 1,000,000 B/MB250 MB/s
Replication trafficrepl = ingress × (RF−1) = 250 MB/s × (3−1) = 250 MB/s × 2500 MB/s inter-broker
Egress (off cluster)egress = ingress × fanout + repl = 250 MB/s × 2 + 500 MB/s = 500 + 500 MB/s1,000 MB/s total read
Disk written/secdisk_write = ingress × RF = 250 MB/s × 3750 MB/s across cluster
Disk total (raw)disk = ingress × retention × RF = 250 MB/s × 604,800 s × 3 = 453,600,000 MB; ÷ 1,000,000 MB/TB453.6 TB raw
Disk total (provisioned)453.6 TB × (1 + disk_overhead) = 453.6 TB × 1.15522 TB usable plan
Cluster NIC trafficsum of every byte crossing a NIC = produce (250) + repl-in (500) + repl-out (500) + consume (500) MB/s, i.e. ingress × (1 + 2(RF−1) + fanout) = 250 MB/s × (1 + 2×2 + 2) = 250 MB/s × 71,750 MB/s aggregate (= 7× ingress; the in+out copies of replication are why it exceeds the "4×" distinct-bytes floor)

The here and the ~4× floor in §Network are the same model at different fanout: the floor counts distinct bytes at fanout=1 (1+2+1), while this aggregate counts every NIC crossing, replication in and out, at fanout=2, giving 1 + 2(RF−1) + fanout = 7.

Now dimension the cluster

Two more assumptions feed the broker-count formula, both illustrative, substitute your hardware:

NIC_line_rate, hardware
10 GbE10 Gbit/s ÷ 8 bit/byte = 1.25 GB/s = 1,250 MB/s per broker.
per_broker_ceiling, derived
= NIC_line_rate × 0.7 (burst-headroom factor, §Headroom) = 1,250 MB/s × 0.7 = 875 MB/s.
per_broker_partition_limit, planning target
2,000 replicas (well below the ~32,765 mmap wall; chosen for latency, see §"per-broker partition limit").

Plug the example's derived 1,750 MB/s cluster NIC traffic and 30 partitions × RF 3 = 90 replicas into the four bounds:

broker_count = max(…)
1,750 MB/s ÷ 875 MB/s ≈ 2
90 replicas ÷ 2,000 ≈ 1
3
3
max = 3 → size for failure & disk
Four lower bounds: throughput = 1,750 MB/s ÷ 875 MB/s/broker ≈ 2; partitions = (30 × RF 3) ÷ 2,000/broker ≈ 1; RF = 3; floor = 3. The raw maximum is 3 (RF and the availability floor dominate this small-partition workload). Disk and post-failure throughput then push the real count up.
compute · a single bound · decision
Three is the floor, not the answer, apply headroom and let disk decide

The raw max is 3, but spread the load over 3 brokers and each carries 1,750 MB/s ÷ 3 ≈ 583 MB/s of NIC and 453.6 TB ÷ 3 ≈ 151 TB of disk, and if one dies, the remaining two must absorb its load. Size for N−1: the survivors must serve the full 1,750 MB/s with one gone, so at the 875 MB/s/broker ceiling you need 1,750 MB/s ÷ 875 MB/s = 2 survivors at full tilt, making 3 (only 2 survivors) marginal even on throughput. Disk dominates. Assume disk_per_broker ≈ 15 TB usable (hardware assumption, fast NVMe; illustrative): 453.6 TB ÷ 15 TB/broker ≈ 30 brokers. On storage-dense instances assume disk_per_broker ≈ 48 TB usable: 453.6 TB ÷ 48 TB/broker ≈ 9.5 ⇒ ~10 brokers (round up for headroom ⇒ 10–12). The lesson: compute all four bounds, then let the dominant physical resource (here disk, then NIC) set the count, and add N−1 + burst headroom on top. More, smaller brokers also shrink blast radius and recovery time.

Headroom: failure and burst

A cluster sized to exactly its steady-state load has zero margin and will brown out the first time anything moves. Two headroom budgets are non-negotiable:

HeadroomRuleWhy
Failure (N−1)The cluster must serve full load with one broker down. Effective ceiling per broker = per_broker_ceiling × (N−1)/N. For RF=3 across 3 AZs you also want to survive a full AZ.A dead broker's leadership migrates to survivors (Part I · 08); its replicas must be re-fetched, adding transient replication load. Provision so survivors are ≤ ~70% utilized post-failover.
BurstPlan against peak minute, then leave ~20–30% above it. Producers buffer and retry, but the broker must drain it.Traffic is bursty; the producer's buffer.memory default 32 MiB (ProducerConfig.java:401) absorbs short spikes, but sustained over-ceiling backs up into client-side blocking and growing lag.
Recovery / re-replicationReserve NIC and disk-write headroom for catch-up after a broker rejoins.A rejoining broker re-replicates everything it missed at num.replica.fetchers × per-fetcher rate; this competes with live traffic. Tiered storage shrinks this (no cold re-fetch). Throttle with replication quotas (Part I · 19).
Durability is a capacity input, not an afterthought

The production durability triad, RF=3, min.insync.replicas=2, acks=all, is what makes the RF multiplier 3 in every disk and replication equation above. The server defaults are weaker than production needs: default.replication.factor=1 (ReplicationConfigs.java:42) and min.insync.replicas=1 (ServerLogConfigs.java:155, MIN_IN_SYNC_REPLICAS_DEFAULT = 1), while the producer defaults acks=all (ProducerConfig.java:405) and enable.idempotence=true (ProducerConfig.java:543) are already production-grade. You must raise RF and isr at topic creation, and the moment you do, you have tripled disk and doubled replication network. Size for the durability you will actually run, from day one. The full durability mechanism is op06; the rationale (min.insync.replicas=2 means a committed write needs 2 in-sync copies, so a single failure is survivable without data loss) lives in Part I · 08.

Per-broker network math (the binding worksheet)

Because the NIC is usually the wall, here is the per-broker inbound/outbound worksheet for a balanced RF=3 cluster of B brokers, ingress I, fanout F:

Per-broker ingress (leader writes)
I / B (its share of producer traffic).
Per-broker replication IN (as follower)
I / B × (RF−1), it follows replicas led elsewhere.
Per-broker replication OUT (as leader)
I / B × (RF−1), followers pull from it.
Per-broker consumer OUT
I / B × F.
Total per-broker NIC
(I/B) × (1 + 2(RF−1) + F), the sum of the four rows above. Substituting RF=3, F=2: 1 + 2×(3−1) + 2 = 1 + 4 + 2 = 7, so total ≈ (I/B) × 7 (the same the worked example derived for aggregate NIC, here divided across B brokers).
Decision rule

Set B ≥ (I × (1 + 2(RF−1) + F)) / (NIC_line_rate × 0.7 × (N−1)/N). The 0.7 is burst headroom; the (N−1)/N is failure headroom. Plug RF=3, F=2, and you will see the NIC bound dominate for any read-heavy or high-RF workload, confirming the chapter's thesis that networking is the first ceiling and replication is why.

Checklist: from workload to cluster

  1. Measure avg_size as BytesInPerSec / MessagesInPerSec; record peak msg_rate.
  2. Ingress = msg_rate × avg_size. Everything derives from this.
  3. Replication = ingress × (RF−1); egress = ingress × fanout + replication; disk/s = ingress × RF.
  4. Disk total = ingress × retention_seconds × RF (× ~1.15 overhead); collapse the local term if tiering.
  5. Memory: heap fixed ~6 GB; RAM ≥ ingress_per_broker × hot_window_seconds for page cache.
  6. Network: per-broker NIC ≈ (I/B) × (1 + 2(RF−1) + F); this is usually the binding bound.
  7. Broker count = max(throughput/ceiling, partitions/limit, RF, 3), then add N−1 failure and ~30% burst headroom; let the dominant physical resource (disk or NIC) set the final number.
  8. Validate against the empirical envelope: ~10 MB/s/partition planning rate; ~605 MB/s peak per small cluster at RF=3 acks=all (empirical: Confluent benchmark, illustrative, not a guarantee); ~32k partitions/broker mmap wall.
Benchmarks are directional, mechanisms are not

The headline numbers, 2M writes/s (LinkedIn, three producers + async), 605 MB/s peak (Confluent OMB, 1 KB msgs, fsync off), are specific tests on specific hardware, not SLAs. A single durable producer at acks=all in that same LinkedIn run did ~422k rec/s (40 MB/s), sync replication roughly halved throughput (empirical). Use benchmarks to bound expectations; use the equations in this chapter, grounded in the replication and zero-copy mechanisms in source, to actually size. When the two disagree, trust the math and the mechanism, then verify on your hardware. Next: turn this capacity into the right partition count (op03) and the tuning that realizes the per-broker ceiling (op05).

krivaltsevich.com · Part of Apache Kafka Internals · derived from Apache Kafka 4.4 source · GitHub · MIT-licensed.

Apache Kafka® is a registered trademark of the Apache Software Foundation. This is an independent, unofficial guide, not affiliated with or endorsed by the ASF.