II · 04 · Capacity Planning & Sizing
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.
Capacity planning is the act of turning a target workload, a message rate, an average size, a retention window, a fan-out, into a concrete bill of materials: how many brokers, how much disk, how much RAM, how big a NIC. Kafka makes this tractable because the cost model is almost entirely linear and analytic: every byte produced is amplified by replication, multiplied by retention, and copied to consumers in ways you can compute on paper before you provision a single instance. This chapter builds the four resource equations, ingress/egress, disk, memory, and network, from the mechanisms in the source, works a full example end to end, then dimensions the cluster with the broker-count formula and the headroom rules that keep it standing when a broker dies. The recurring theme: the NIC is usually the first ceiling, and replication is why.
The two numbers everything starts from
Every sizing exercise reduces to two measured inputs and a handful of multipliers. The inputs are the peak ingress rate and the average compressed record size:
msg_rate- Records per second at the peak you must serve without throttling, not the daily average. Plan against the 95th-percentile minute, not the mean hour.
avg_size- Average serialized, post-compression bytes per record. Kafka stores, replicates, and serves batches compressed (see Part I · 01), so the on-wire and on-disk size is the compressed size, not the in-app object size.
ingress(B/s)= msg_rate × avg_size. This is the byte rate hitting partition leaders. It is the seed of every other number.
avg_sizeRead it from a representative cluster as BytesInPerSec / MessagesInPerSec, both are real broker meters: kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec and name=MessagesInPerSec (storage/.../log/metrics/BrokerTopicMetrics.java:34, :33). Dividing them gives the true compressed average including batch and header overhead, which is what every formula below assumes. A 200-byte JSON object that compresses 4× in a batch costs you ~50 bytes of ingress, not 200.
The amplification chain
A single produced byte does not stay a single byte. It is fanned out three ways, and each path has a multiplier rooted in a concrete mechanism:
| Quantity | Formula | Mechanism / source |
|---|---|---|
| Ingress (write to leaders) | ingress = msg_rate × avg_size | Producer append path (Part I · 16). |
| Replication traffic (inter-broker) | repl = ingress × (RF − 1) | Each follower runs fetcher threads that pull every committed byte from the leader; RF − 1 followers ⇒ RF − 1 copies of ingress traverse the cluster fabric. Part I · 08. |
| Disk written/sec | disk_write = ingress × RF | Leader writes 1 copy; each of RF − 1 followers writes its own copy. Total = RF copies appended per second across the cluster. |
| Egress (read off cluster) | egress = ingress × fanout + repl | fanout = number of consumer groups reading the full stream; replication fetch is itself egress from the leader. Part I · 09. |
RF − 1, not RFThe leader already has the byte, it received it from the producer. Only the followers must fetch it. With the default default.replication.factor = 1 (server/.../config/ReplicationConfigs.java:42, REPLICATION_FACTOR_DEFAULT = 1) there is no replication traffic at all; at the production-standard RF=3 the cluster moves 2× ingress in inter-broker replication alone, dwarfing most consumer fan-out. This single term is why "egress ≥ 2× ingress" is the canonical planning floor (empirical: Confluent / Jun Rao sizing guidance) and why networking dominates the cloud bill (empirical: Confluent cost model, networking ~52–87% of self-managed infra; see op10).
Disk: retention × replication factor
Kafka is a log: it keeps bytes until a retention policy deletes them. Steady-state disk is therefore the byte rate × how long you keep it × how many copies exist:
- Local disk (no tiering)
disk = ingress × retention_seconds × RF- Default retention
- 7 days =
604,800 s. The constant isDEFAULT_RETENTION_MS = 24 * 7 * 60 * 60 * 1000(storage/.../log/LogConfig.java:134), surfaced asretention.ms/log.retention.hours. - Default retention bytes
retention.bytes = -1(unbounded; time-based only), the size cap is off by default, so time governs disk unless you set it.
The RF factor is not optional accounting: a topic with RF=3 physically stores three full copies, one per replica broker, because each follower maintains a complete local log (Part I · 03, Part I · 08). Halving RF halves disk and halves replication network, the most direct dial you have, traded against fault tolerance.
Disk usage is quantized by the active segment: a partition cannot delete data in the segment it is still appending to. With log.segment.bytes default 1 GiB (DEFAULT_SEGMENT_BYTES = 1024*1024*1024, LogConfig.java:131) and segment.ms default 7 days (DEFAULT_SEGMENT_MS, LogConfig.java:132), a low-traffic partition can hold up to ~1 GiB beyond its nominal retention until the active segment rolls. There is also a hard ceiling per segment: a single log file cannot exceed Integer.MAX_VALUE (~2 GiB), the broker throws on open if a segment is larger (FileRecords.java:68-73). Index files add to disk too: log.index.size.max.bytes defaults to 10 MiB per segment (ServerLogConfigs.java:93). Budget total disk at ingress × retention × RF, then add ~10–20% headroom for the active-segment tail, indexes, and the hard-fail margin (never run a log dir past ~85%, a full dir halts the broker; see op07).
How tiered storage rewrites the local term
Tiered storage (Part I · 05, KIP-405) splits retention into a small local hot tier on broker disk and an unbounded remote tier in object storage. The broker keeps only local.retention.ms / local.retention.bytes worth of recent data locally; older segments are copied to S3/GCS/Azure Blob and deleted from disk. It is off by default, remote.log.storage.system.enable = false (DEFAULT_REMOTE_LOG_STORAGE_SYSTEM_ENABLE = false, storage/.../remote/storage/RemoteLogManagerConfig.java:58), and enabled per topic with remote.storage.enable=true. The local-retention defaults -2 mean "derive from the topic's retention.*" until you set an explicit, shorter local window (DEFAULT_LOCAL_RETENTION_MS = -2, LogConfig.java:146).
| Mode | Local disk equation | What changes |
|---|---|---|
| Classic | ingress × retention × RF | Full history × RF on local disk. |
| Tiered | ingress × local.retention × RF (local) + ingress × total_retention (remote, ~1 logical copy in the object store) | Local term collapses from days to hours; the RF multiplier disappears from the remote tier, object storage provides its own internal durability, so you store ~1 logical copy remotely instead of RF. |
Object storage runs ~$0.02/GiB-month vs ~$0.08–0.10/GiB-month for replicated EBS, roughly 4–5× cheaper per byte, and Confluent reports storage cost reductions "over 90%" in favorable cases, ~30–40% typical (empirical: Confluent, Uber, WarpStream; see op10). Tiered storage does NOT reduce cross-AZ networking, replication still happens on the local hot tier. It buys you disk and faster broker recovery (no cold re-replication on rejoin), not network. Set local.retention.ms to cover your consumers' real-time lag plus replication catch-up, long enough that live consumers and rejoining followers read from local disk (cheap, zero-copy) rather than triggering remote fetches.
Memory: size the page cache, keep the heap small
This is the most misunderstood dimension, and the source makes the mechanism unambiguous. Kafka does not cache records in the JVM heap. A fetch returns a slice of the on-disk file: LogSegment.read ends with log.slice(startPosition, fetchSize) (storage/.../log/LogSegment.java:462), producing a FileRecords view, never a heap copy. That slice is then streamed to the consumer's socket with the sendfile zero-copy path: FileRecords.writeTo calls destChannel.transferFrom(channel, position, count) (clients/.../record/internal/FileRecords.java:302), which for a plaintext connection is fileChannel.transferTo(position, count, socketChannel) (clients/.../network/PlaintextTransportLayer.java:213-214), a single kernel syscall that moves bytes from the page cache straight to the NIC, never touching the JVM.
LogSegment.read → FileRecords.slice(), a file view, no copytransferTo reads it directlytransferFrom → fileChannel.transferTo(…, socketChannel)A consumer that is caught up reads bytes that were written seconds ago, still resident in the page cache from the write. The kernel serves it with no disk seek. So the RAM you want is enough page cache to hold the "hot tail" of every active partition, the recent window that live consumers and catching-up followers re-read. The JVM heap, by contrast, holds only broker bookkeeping (request objects, the metadata image, purgatory, index structures) and stays modest: ~6 GB is the typical recommendation (empirical: Confluent, LinkedIn, heap >6 GB is unnecessary and risks long GC pauses). Do not oversize the heap at the page cache's expense: every GiB you hand the JVM is a GiB the kernel cannot use to keep partitions hot, which forces cold disk reads and spikes p99 (the "catch-up tax": historical reads evicting hot data have driven p99 produce from ~2 ms to ~250 ms, empirical: see op07 / Pinterest).
A working memory rule
The page-cache formula is cache_needed = ingress_per_broker × hot_window, where hot_window is the worst-case lag (in seconds) you want served from RAM rather than disk. The inputs below are illustrative planning constants, substitute your measured values.
heap, config, cited recommendation- Fixed, small: 6 GiB.
-Xms6g -Xmx6gwith G1GC is the canonical setting (empirical: Confluent / LinkedIn, heap > 6 GiB is unnecessary and risks long GC pauses); raise toward 8 GiB only above ~10k partitions per broker, where the metadata image and per-partition structures grow. box_ram, hardware assumption- 64 GiB per broker (empirical: Confluent reference hardware, 64 GiB RAM, 32 GiB minimum).
os_reserve, hardware assumption- ≈ 28 GiB left for the OS, agents, and slack on a 64 GiB box (illustrative; measure on your image). What remains is page cache.
ingress_per_broker, workload-dependent (illustrative)- 50 MB/s for this example (substitute your measured per-broker
BytesInPerSec). hot_windowtarget, planning heuristic- ~30 s of write throughput is the common starting heuristic (empirical: Confluent); if your slowest real-time consumer lags by up to
Tseconds, size foringress_per_broker × Tso even your laggards stay in cache.
Derivation, page cache available, then how much hot data it holds:
- page cache ≈
box_ram − heap − os_reserve=64 GiB − 6 GiB − 28 GiB≈ 30 GiB for the cache (the "~28–30 GiB" Confluent reference figure). - resident hot window =
cache ÷ ingress_per_broker=30 GiB ÷ 50 MB/s≈30,720 MiB ÷ 50 MB/s≈ ~600 s ≈ 10 minutes of hot data resident, comfortably above the ~30 s heuristic, so normal consumer lag is served entirely from RAM.
(Mixing GiB and MB here is deliberately conservative: 30 GiB ≈ 32,212 MB, so dividing by 50 MB/s gives ~644 s, we round down to ~600 s / 10 min. The takeaway is order-of-minutes, not the exact second.)
The transferTo sendfile path requires a plaintext socket. When the consumer connection is encrypted, bytes must be copied into the JVM, encrypted, and written normally, the zero-copy fast path is gone, CPU rises, and the page-cache-to-NIC shortcut no longer applies on egress. Budget extra CPU for TLS-heavy read fan-out, and note that historically per-SSL-channel memory has been non-trivial (~122 KB/channel forced one fleet's heap from 4 GB to 8 GB, empirical: Pinterest, see op07). See Part I · 18 and Part I · 06.
Network: the NIC is the first ceiling
Sum what crosses each broker's NIC and you usually find it, not disk or CPU, is the binding constraint. Per broker, in the steady state with balanced leadership:
- Inbound
- producer writes to its leader partitions + replication it pulls as a follower for replicas led elsewhere.
- Outbound
- consumer reads of its leader partitions (× fanout) + replication bytes it serves to followers as a leader.
Aggregated across the cluster, total network movement is approximately:
The replication term appears as both an in and an out copy across the fabric: ingress × (RF−1) leaves leaders and ingress × (RF−1) arrives at followers. Add producer ingress and ingress × fanout of consumer reads. So the cluster NIC amplification factor is 1 (produce) + 2(RF−1) (replication in+out) + fanout (consume). Assume RF=3 and fanout=1 (one consumer group), both illustrative, and it works out to 1 + 2×(3−1) + 1 = 1 + 4 + 1 = 6×… but most of that replication cost is symmetric in+out, so per byte produced the commonly quoted floor is 1× produce + 2× replication + 1× consume ≈ 4× ingress of distinct NIC traffic (the in and out copies of replication are the same 2×). A "100 MB/s" application workload is therefore really a 100 MB/s × 4 = ~400 MB/s networking problem. NIC line-rate conversion (assumption, hardware): a 10 GbE link is 10 Gbit/s ÷ 8 bit/byte = 1.25 GB/s, so the NIC is a hard per-broker wall at ~1,250 MB/s long before disk fills.
The broker exposes the replication slice of this directly: ReplicationBytesInPerSec and ReplicationBytesOutPerSec (BrokerTopicMetrics.java:37-38), watch these alongside BytesInPerSec/BytesOutPerSec to see how much of your NIC is consumed by replication you cannot avoid versus consumer traffic you can sometimes redirect.
In a cloud 3-AZ deployment the network bytes above are the same bytes, but the cross-zone fraction is metered: ~⅔ of produce, ~⅔ of consumer fetch, and all inter-broker replication crosses AZ boundaries, charged ~$0.02/GiB effective on AWS. The ingress × (RF−1) replication term is the dominant cross-AZ cost. KIP-392 fetch-from-follower lets consumers read a same-AZ replica to kill consumer-side cross-AZ, but the produce+replication floor remains. Full treatment in op10 and Part I · 19.
CPU: compression, TLS, request handling
CPU is rarely the first ceiling for plaintext, lz4-compressed workloads, but three loads scale with traffic and can become binding:
| CPU load | Scales with | Dial / default |
|---|---|---|
| Compression / decompression | ingress & egress bytes; producer compresses, broker may recompress, consumers decompress | compression.type default none (ProducerConfig.java:409, CompressionType.NONE). lz4/snappy fastest, zstd best ratio. Compression is per-batch, tiny batches compress poorly. (empirical: Cloudflare zstd-6 = 4.5× ratio; lz4 = 1.81×) |
| TLS encrypt/decrypt | every encrypted byte, both directions; defeats zero-copy egress | Plaintext by default. Allocate headroom for TLS-heavy fan-out. |
| Request handling | request rate (not just bytes), many tiny fetches cost CPU per request | num.io.threads default 8 (ServerConfigs.java:51); num.network.threads default 3 (SocketServerConfigs.java:152); num.replica.fetchers default 1 (ReplicationConfigs.java:96); queued.max.requests default 500 (SocketServerConfigs.java:144). |
Kafka does not batch reads on the broker by default, fetch.min.bytes is 1 (ConsumerConfig.java:187, DEFAULT_FETCH_MIN_BYTES = 1), tuned for latency, not cost. A swarm of consumers polling tiny fetches generates request-handling CPU out of proportion to bytes moved. The documented fix is raising fetch.min.bytes on low-throughput-topic consumers: a single such change cut whole-cluster broker CPU 15% at New Relic (empirical; see op05/op08). Watch RequestHandlerAvgIdlePercent (keep >0.3) and NetworkProcessorAvgIdlePercent (keep >0.4), saturation there means add I/O or network threads, or brokers (see op08). Plan throughput per partition at a conservative ~10 MB/s, even though a single partition can sustain "tens of MB/s" (empirical: Jun Rao).
Broker count: the max of four constraints
The number of brokers is not a single calculation, it is the maximum of independent lower bounds, because the cluster must satisfy all of them at once:
brokers = max( target_throughput / per_broker_ceiling , total_partitions / per_broker_partition_limit , RF , 3 )
| Bound | Why it exists | Type |
|---|---|---|
throughput / per_broker_ceiling | The throughput bound. per_broker_ceiling is the emergent NIC/disk/CPU limit of one broker, typically the NIC, computed as in §Network. Use a burst-headroom factor of 0.7 (planning heuristic, leave ~30% of NIC line rate free for spikes; see §Headroom) so per_broker_ceiling = NIC_line_rate × 0.7. | Emergent (resource). |
total_partitions / per_broker_partition_limit | The partition bound. Each partition costs file descriptors, two mmap'd index regions, fetcher work, and metadata. total_partitions = Σ (partitions × RF) across all topics, replicas count. | Emergent, with a hard floor (below). |
RF | You cannot place RF copies of a partition on fewer than RF brokers, each replica needs a distinct broker. | Hard (placement invariant). |
3 | Availability floor: RF=3 plus tolerating one broker down for maintenance while keeping a majority needs at least 3, and the KRaft controller quorum wants an odd ≥3 (Part I · 10). | Operational floor. |
The per-broker partition limit, precisely
"Partitions are free under KRaft" is wrong. KRaft (Part I · 11, KIP-500) removed the ZooKeeper-era ~4,000/broker and ~200,000/cluster ceilings, those were driven by O(partitions) controller failover and metadata reload, and are version-specific (empirical: Confluent 200K post, Kafka 1.1.0+; ZK removed in 4.0). But concrete per-broker limits remain:
| Limit | Value / source | Kind |
|---|---|---|
Linux vm.max_map_count | Default 65,530 mmap regions; Kafka uses ~2 mmap regions/partition (offset + time index) ⇒ 65,530 regions ÷ 2 regions/partition ≈ 32,765 partitions/broker before mmap exhaustion. Raise to ≥262,144 (or 1,000,000+) for dense brokers. (empirical: Instaclustr KRaft Part 3) | Hard (OS constant). |
| File descriptors | open files ≈ Σ(partition_size / segment_size) + connections; set nofile ≥ 100,000. Production brokers run >30k handles. (empirical: Jun Rao / Confluent) | Emergent + OS limit. |
| Replication latency | "Replicating 1000 partitions broker-to-broker adds ~20 ms latency" ⇒ per-broker cap ≈ 100 × brokers × RF to bound it. (empirical: Jun Rao, 2015, teach the mechanism, not the constant) | Emergent. |
| Throughput peaks early | Producer throughput peaked at ~100 partitions and degraded past ~1,000 in one test; scalability ≠ performance. (empirical: Instaclustr KRaft Part 1) | Emergent (anti-pattern). |
A safe planning figure for a well-resourced broker is on the order of 1,000–4,000 partitions (replicas included) for latency-sensitive workloads, with the hard mmap wall far above that. Pick per_broker_partition_limit from your latency tolerance and FD/mmap budget, not from the KRaft headline. Partition count itself is sized in op03; here it is an input to broker count.
A worked example, end to end
Take a realistic mid-size event pipeline and run every equation. Every number below is either a labeled assumption or is derived from earlier ones, nothing appears mid-calculation unintroduced. The workload figures are illustrative for this example; substitute your own measured values.
| Assumption | Value (with units) | Why / source | Kind |
|---|---|---|---|
msg_rate | 1,000,000 records/s (peak) | The peak minute this pipeline must serve without throttling. | Workload (illustrative) |
avg_size | 250 B/record (post-compression) | Measured as BytesInPerSec / MessagesInPerSec; see §"Measure, do not guess". | Workload (illustrative) |
RF | 3 (copies) | Production durability triad; the server default is 1 (ReplicationConfigs.java:42) so you must raise it. | Config (production standard) |
min.insync.replicas, acks | 2, all | A committed write needs 2 in-sync copies, so one failure is survivable. Fixes RF=3 as the multiplier everywhere below. | Config (production standard) |
retention | 7 days = 604,800 s | Kafka default DEFAULT_RETENTION_MS = 24*7*60*60*1000 (LogConfig.java:134); 7 d × 86,400 s/d = 604,800 s. | Cited Kafka default |
fanout | 2 (consumer groups reading the full stream) | Two independent consumer groups each re-read every byte. | Workload (illustrative) |
partitions | 30 (on the main topic) | Sized in op03; here it is an input. Replica count = partitions × RF. | Workload (from op03) |
disk_overhead | ~15% | Active-segment tail + indexes + hard-fail margin (see §"Provision the segment"); illustrative. | Planning heuristic |
Derivation, each row applies a formula from the tables above to these assumptions, carrying units so they cancel to the result's unit:
| Step | Formula & substitution (units cancel) | Result |
|---|---|---|
| Ingress | ingress = msg_rate × avg_size = 1,000,000 records/s × 250 B/record = 250,000,000 B/s; ÷ 1,000,000 B/MB | 250 MB/s |
| Replication traffic | repl = ingress × (RF−1) = 250 MB/s × (3−1) = 250 MB/s × 2 | 500 MB/s inter-broker |
| Egress (off cluster) | egress = ingress × fanout + repl = 250 MB/s × 2 + 500 MB/s = 500 + 500 MB/s | 1,000 MB/s total read |
| Disk written/sec | disk_write = ingress × RF = 250 MB/s × 3 | 750 MB/s across cluster |
| Disk total (raw) | disk = ingress × retention × RF = 250 MB/s × 604,800 s × 3 = 453,600,000 MB; ÷ 1,000,000 MB/TB | ≈ 453.6 TB raw |
| Disk total (provisioned) | 453.6 TB × (1 + disk_overhead) = 453.6 TB × 1.15 | ≈ 522 TB usable plan |
| Cluster NIC traffic | sum of every byte crossing a NIC = produce (250) + repl-in (500) + repl-out (500) + consume (500) MB/s, i.e. ingress × (1 + 2(RF−1) + fanout) = 250 MB/s × (1 + 2×2 + 2) = 250 MB/s × 7 | ≈ 1,750 MB/s aggregate (= 7× ingress; the in+out copies of replication are why it exceeds the "4×" distinct-bytes floor) |
The 7× here and the ~4× floor in §Network are the same model at different fanout: the floor counts distinct bytes at fanout=1 (1+2+1), while this aggregate counts every NIC crossing, replication in and out, at fanout=2, giving 1 + 2(RF−1) + fanout = 7.
Now dimension the cluster
Two more assumptions feed the broker-count formula, both illustrative, substitute your hardware:
NIC_line_rate, hardware- 10 GbE ⇒
10 Gbit/s ÷ 8 bit/byte = 1.25 GB/s = 1,250 MB/sper broker. per_broker_ceiling, derived- =
NIC_line_rate × 0.7(burst-headroom factor, §Headroom) =1,250 MB/s × 0.7= 875 MB/s. per_broker_partition_limit, planning target- 2,000 replicas (well below the ~32,765 mmap wall; chosen for latency, see §"per-broker partition limit").
Plug the example's derived 1,750 MB/s cluster NIC traffic and 30 partitions × RF 3 = 90 replicas into the four bounds:
The raw max is 3, but spread the load over 3 brokers and each carries 1,750 MB/s ÷ 3 ≈ 583 MB/s of NIC and 453.6 TB ÷ 3 ≈ 151 TB of disk, and if one dies, the remaining two must absorb its load. Size for N−1: the survivors must serve the full 1,750 MB/s with one gone, so at the 875 MB/s/broker ceiling you need 1,750 MB/s ÷ 875 MB/s = 2 survivors at full tilt, making 3 (only 2 survivors) marginal even on throughput. Disk dominates. Assume disk_per_broker ≈ 15 TB usable (hardware assumption, fast NVMe; illustrative): 453.6 TB ÷ 15 TB/broker ≈ 30 brokers. On storage-dense instances assume disk_per_broker ≈ 48 TB usable: 453.6 TB ÷ 48 TB/broker ≈ 9.5 ⇒ ~10 brokers (round up for headroom ⇒ 10–12). The lesson: compute all four bounds, then let the dominant physical resource (here disk, then NIC) set the count, and add N−1 + burst headroom on top. More, smaller brokers also shrink blast radius and recovery time.
Headroom: failure and burst
A cluster sized to exactly its steady-state load has zero margin and will brown out the first time anything moves. Two headroom budgets are non-negotiable:
| Headroom | Rule | Why |
|---|---|---|
| Failure (N−1) | The cluster must serve full load with one broker down. Effective ceiling per broker = per_broker_ceiling × (N−1)/N. For RF=3 across 3 AZs you also want to survive a full AZ. | A dead broker's leadership migrates to survivors (Part I · 08); its replicas must be re-fetched, adding transient replication load. Provision so survivors are ≤ ~70% utilized post-failover. |
| Burst | Plan against peak minute, then leave ~20–30% above it. Producers buffer and retry, but the broker must drain it. | Traffic is bursty; the producer's buffer.memory default 32 MiB (ProducerConfig.java:401) absorbs short spikes, but sustained over-ceiling backs up into client-side blocking and growing lag. |
| Recovery / re-replication | Reserve NIC and disk-write headroom for catch-up after a broker rejoins. | A rejoining broker re-replicates everything it missed at num.replica.fetchers × per-fetcher rate; this competes with live traffic. Tiered storage shrinks this (no cold re-fetch). Throttle with replication quotas (Part I · 19). |
The production durability triad, RF=3, min.insync.replicas=2, acks=all, is what makes the RF multiplier 3 in every disk and replication equation above. The server defaults are weaker than production needs: default.replication.factor=1 (ReplicationConfigs.java:42) and min.insync.replicas=1 (ServerLogConfigs.java:155, MIN_IN_SYNC_REPLICAS_DEFAULT = 1), while the producer defaults acks=all (ProducerConfig.java:405) and enable.idempotence=true (ProducerConfig.java:543) are already production-grade. You must raise RF and isr at topic creation, and the moment you do, you have tripled disk and doubled replication network. Size for the durability you will actually run, from day one. The full durability mechanism is op06; the rationale (min.insync.replicas=2 means a committed write needs 2 in-sync copies, so a single failure is survivable without data loss) lives in Part I · 08.
Per-broker network math (the binding worksheet)
Because the NIC is usually the wall, here is the per-broker inbound/outbound worksheet for a balanced RF=3 cluster of B brokers, ingress I, fanout F:
- Per-broker ingress (leader writes)
- ≈
I / B(its share of producer traffic). - Per-broker replication IN (as follower)
- ≈
I / B × (RF−1), it follows replicas led elsewhere. - Per-broker replication OUT (as leader)
- ≈
I / B × (RF−1), followers pull from it. - Per-broker consumer OUT
- ≈
I / B × F. - Total per-broker NIC
- ≈
(I/B) × (1 + 2(RF−1) + F), the sum of the four rows above. Substituting RF=3, F=2:1 + 2×(3−1) + 2 = 1 + 4 + 2 = 7, so total ≈(I/B) × 7(the same 7× the worked example derived for aggregate NIC, here divided acrossBbrokers).
Set B ≥ (I × (1 + 2(RF−1) + F)) / (NIC_line_rate × 0.7 × (N−1)/N). The 0.7 is burst headroom; the (N−1)/N is failure headroom. Plug RF=3, F=2, and you will see the NIC bound dominate for any read-heavy or high-RF workload, confirming the chapter's thesis that networking is the first ceiling and replication is why.
Checklist: from workload to cluster
- Measure
avg_sizeasBytesInPerSec / MessagesInPerSec; record peakmsg_rate. - Ingress =
msg_rate × avg_size. Everything derives from this. - Replication =
ingress × (RF−1); egress =ingress × fanout + replication; disk/s =ingress × RF. - Disk total =
ingress × retention_seconds × RF(× ~1.15 overhead); collapse the local term if tiering. - Memory: heap fixed ~6 GB; RAM ≥
ingress_per_broker × hot_window_secondsfor page cache. - Network: per-broker NIC ≈
(I/B) × (1 + 2(RF−1) + F); this is usually the binding bound. - Broker count =
max(throughput/ceiling, partitions/limit, RF, 3), then add N−1 failure and ~30% burst headroom; let the dominant physical resource (disk or NIC) set the final number. - Validate against the empirical envelope: ~10 MB/s/partition planning rate; ~605 MB/s peak per small cluster at RF=3
acks=all(empirical: Confluent benchmark, illustrative, not a guarantee); ~32k partitions/broker mmap wall.
The headline numbers, 2M writes/s (LinkedIn, three producers + async), 605 MB/s peak (Confluent OMB, 1 KB msgs, fsync off), are specific tests on specific hardware, not SLAs. A single durable producer at acks=all in that same LinkedIn run did ~422k rec/s (40 MB/s), sync replication roughly halved throughput (empirical). Use benchmarks to bound expectations; use the equations in this chapter, grounded in the replication and zero-copy mechanisms in source, to actually size. When the two disagree, trust the math and the mechanism, then verify on your hardware. Next: turn this capacity into the right partition count (op03) and the tuning that realizes the per-broker ceiling (op05).