II · 11 · Scaling: 1M → 10M → 100M events/sec
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.
Kafka does not scale by one dial; it scales by which constraint binds first, and that constraint changes by an order of magnitude at each step. This chapter works three concrete tiers, 1M, 10M, and 100M events/second at a 1 KiB average message (≈1, 10, and 100 GB/s of uncompressed ingress), and for each gives the throughput math, the partition count, the broker count, the replication factor, the bottleneck that emerges, and exactly what to monitor. The thesis in one sentence: at 1M/s the binding constraint is your own consumers and you barely feel the cluster; at 10M/s the constraint moves into the network, replication (×RF) and cross-AZ transfer dominate cost, and the page-cache working set starts to exceed RAM; at 100M/s you hit structural limits, the per-broker partition ceiling, the controller's metadata throughput, raw NIC bandwidth, and a cross-AZ bill that makes fetch-from-follower and tiered storage mandatory rather than optional. For each tier we name precisely where Kafka's limits bind, the architectural reason, and whether the limit is tunable (a config you can turn), emergent (a resource ceiling you must buy past), or structural (a compiled-in property you must architect around). The anchor for the top tier is real: LinkedIn ran ~7 trillion msgs/day ≈ ~80M/s across 100+ clusters and 4,000+ brokers, and then, at 32 trillion/day, outgrew Kafka and built a replacement (LinkedIn 2019 / 2025, InfoQ). 100M/s in a single cluster is past where the largest operators chose to federate.
Every byte produced is written RF times across the cluster and read once per consumer group. The leader of a partition carries, per byte produced: 1× ingress + (RF−1)× replication egress + (groups)× consumer egress. With RF=3 and 2 consumer groups, one produced byte becomes 2 bytes of cross-broker replication and ~3 bytes of read. The scaling story is the story of which term in that sum saturates a resource first, disk, then NIC, then partition metadata. See II · 04 Capacity Planning for the full derivation and I · 08 Replication for the replication mechanism.
The taxonomy of limits, and how to read this chapter
Before the tiers, fix the vocabulary used throughout Part II. A limit is one of three kinds, and the operational response differs for each:
| Kind | What it is | How you respond | Example in this chapter |
|---|---|---|---|
| HARD | A compiled-in constant or protocol invariant. You cannot exceed it without changing source. | Architect around it; it will not move. | One leader per partition; one consumer per partition per group; Linux vm.max_map_count bound on mmapped indexes (AbstractIndex.java:72). |
| SOFT | A configuration default. Tunable, with a known range and a tradeoff. | Turn the dial; know what you trade. | num.replica.fetchers=1 (ReplicationConfigs.java:96); replica.lag.time.max.ms=30000 (:55). |
| EMERGENT | A consequence of finite resources or architecture, RAM, NIC, disk IOPS, controller serialization. | Buy past it (more/bigger brokers) or change topology. | Page-cache working set vs RAM; NIC bandwidth vs ×RF replication; controller event-queue serialization. |
The scaling tiers below are not "Kafka gets slower", Kafka's per-broker throughput is roughly flat across them. What changes is how many brokers you need and which limit you bump into when you add them. We size with two empirical anchors from the reference, both marked as such:
- Per-broker write ceiling: a well-tuned modern broker on storage-optimized cloud hardware (e.g.
i3en/im4gnclass) sustains on the order of ~100–300 MB/s of ingress before replication and read amplification saturate its NIC or disk, derived from Confluent's OpenMessaging benchmark of ~605 MB/s aggregate peak peri3en.2xlargeat RF=3 (Confluent, 2020, empirical, fsync off, 1 KB msgs). We plan conservatively at ~250 MB/s ingress/broker. - Per-partition rate: "tens of MB/s" on a single partition; conservative planning number ~10 MB/s (Jun Rao, Confluent 2015, empirical). Throughput does not rise monotonically with partitions: peak producer rate was measured at ~100 partitions and latency rose sharply past ~1,000 (Instaclustr, Kafka 3.1.1, empirical). Over-partitioning to chase throughput is an anti-pattern.
Tier 1, 1M events/sec (≈1 GB/s): the consumer-bound regime
Assumptions and the math
Every number in this tier is derived from the labeled constants below. Each row gives the value with units, a one-line why/source, and its kind, fixed (the chapter holds it constant across all three tiers), config (a Kafka setting), hardware (what you buy), workload (you must measure your own, the example value is illustrative), or heuristic (a cited planning rule). Nothing below appears in a calculation before it is introduced here.
| Symbol | Assumption | Value (with units) | Why / source | Kind |
|---|---|---|---|---|
s | Average message size | 1 KiB | The chapter's fixed message size, held across all three tiers so the tiers differ only by rate. | fixed |
T | Target ingress (uncompressed) | 1,000,000 msg/s × 1 KiB/msg = 1,024 MB/s ≈ 1 GB/s | Derived: rate × s. (1,000,000 × 1 KiB = 1,000,000 KiB = 976.6 MiB ≈ 1,024 MB at 106 B/MB; we round to 1,024 MB/s ≈ 1 GB/s.) | derived |
RF | Replication factor | 3 (across 3 AZs) | Durability baseline, one byte survives the loss of any 2 replicas / 1 AZ (see II · 06 Durability). | config |
G | Consumer-group fan-out | 2 groups | A realistic "stream processor + archiver" read fan-out; each produced byte is read once per group. | workload |
Tp | Per-partition produce rate | ~10 MB/s per partition | Conservative single-partition planning rate ("tens of MB/s on one partition"), Jun Rao, Confluent 2015; empirical heuristic (this guide's reference §2). Substitute your measured value. | heuristic |
Tc | Per-consumer drain rate | ~8 MB/s per consumer | The slowest consumer path (sink write + processing), not a Kafka limit. Illustrative for this worked example, measure yours; it is usually below Tp because downstream sinks are slower than a sequential log append. | workload |
NIC | Per-broker network card | 10 GbE = 10 Gbit/s ÷ 8 bit/B = 1,250 MB/s | Hardware line rate of a 10 GbE broker; full-duplex, so in and out each get 1,250 MB/s. | hardware |
amp | NIC amplification on the worst (leader) broker | per produced byte: 1 in (produce) + (RF−1)=2 out (replicate to 2 followers) + G=2 out (2 consumer groups) = ≈5× (dimensionless) | Derived from the "one model" callout: 1 + (RF−1) + G = 1 + 2 + 2 = 5. It is a ratio, so it has no unit; it scales a MB/s figure into the MB/s the NIC must carry. | derived |
ceiling | Usable ingress per broker | NIC ÷ amp = 1,250 MB/s ÷ 5 = ≈250 MB/s | Derived: the NIC carries amp bytes for every ingress byte, so usable ingress = line rate ÷ amplification. Matches the conservative ~250 MB/s planning anchor and sits well under the ~605 MB/s aggregate i3en.2xlarge benchmark, Confluent 2020; empirical. | derived |
RAM | Box memory | 64 GiB | Confluent's recommended broker RAM (32 GiB minimum), this guide's reference §1/§3. | hardware |
heap | JVM heap | 6 GiB | Kafka does not benefit from a heap >6 GiB; the rest of RAM must stay free for the OS page cache, this guide's reference §3. | config |
A NIC's send and receive lanes are independent (each ~1,250 MB/s), so the strict bottleneck is the busier lane, egress, which carries 4 bytes per ingress byte (2 replicate + 2 consume) → 1,250 MB/s ÷ 4 ≈ ≈312 MB/s. We plan with the more conservative single-shared-budget figure (line rate ÷ amp = 1,250 ÷ 5 ≈ 250 MB/s) on purpose: it leaves headroom for bursts, TCP/protocol overhead, and the inbound replication a broker also carries as a follower for other leaders' partitions. Either way the method is identical, line rate ÷ per-ingress-byte amplification.
Two cluster-wide rates that follow directly. Disk write per cluster = T × RF = 1,024 MB/s × 3 = ≈3 GB/s of total log writes (every produced byte is stored RF times). Page cache per broker = RAM − heap = 64 GiB − 6 GiB = ≈58 GiB, of which the OS and other processes take a few GiB, leaving ~50 GiB for log pages.
Partitions. The rule is N ≥ max(T ÷ Tp, T ÷ Tc), partitions must be enough for both the producer side and the consumer side, so take the larger.
- Producer side:
T ÷ Tp= 1,024 MB/s ÷ 10 MB/s per partition ≈ 103 partitions (the MB/s cancel, leaving partitions). - Consumer side:
T ÷ Tc= 1,024 MB/s ÷ 8 MB/s per consumer ≈ 128 partitions, because one consumer reads one partition per group, you need at least this many partitions for the slowest group to keep up at one consumer each.
The consumer side dominates (128 > 103) because the assumed per-consumer drain (8 MB/s) is below the per-partition produce rate (10 MB/s), so consumer parallelism, not produce throughput or disk, sets the count. Take the max (128) and round up for headroom and future consumer parallelism: ~150–200 partitions for the hot topic. This is comfortably inside every limit, orders of magnitude below the per-broker partition ceiling discussed at Tier 3.
Brokers. The rule is B = max(T ÷ ceiling, N ÷ cap, 3), the largest of the NIC/ingress demand, the partition-density demand, and the availability floor of 3.
- NIC/ingress demand:
T ÷ ceiling= 1,024 MB/s ÷ 250 MB/s per broker ≈ 4.1 → 5 brokers (round up to a whole broker). Recallceiling=NIC ÷ amp= 1,250 ÷ 5 ≈ 250 MB/s, so this already accounts for replication and read egress on the NIC. - Partition-density demand:
N ÷ cap= 200 partitions ÷ (a few thousand/broker) ≪ 1, negligible here. - Availability floor: 3.
The max is 5, but you must size for N−1: one broker can fail and its leadership and replication load must redistribute onto the survivors without exceeding their ceiling. Add one → 6 brokers, RF=3. Leaders are spread evenly, so leaders per broker = partitions ÷ brokers = 180 ÷ 6 = 30 (using the midpoint of the 150–200 range as the planning figure). Page cache (~50 GiB/broker, from above) holds many seconds of the write rate, so consumers reading the tail are served from RAM via zero-copy sendfile (see I · 09 Fetch Path) and the disk does almost pure sequential writes.
At 1 GB/s, the working set a consumer needs (the last few seconds of each partition) fits in page cache. Kafka's read path never touches disk for tail consumers: the broker sendfile()s directly from the page cache to the socket. Disk sees only the sequential append. The mechanism, log segments are memory-mapped and the OS keeps recently-written pages resident, is described in I · 03 Storage Log Engine. The corollary: the cluster is not your bottleneck; your consumers are. A slow downstream sink, a GC-pausing stream app, or a single under-provisioned consumer will build lag long before a broker breaks a sweat.
Where the limit binds, and what to watch
At this tier nothing in Kafka is the constraint. The binding limits are (a) consumer drain rate (EMERGENT, your sink/processing throughput) and (b) the HARD ceiling of one consumer per partition per group: you cannot add a 201st consumer to a 200-partition topic and gain parallelism. The fix for (b) is provisioned up front by choosing partition count for peak fan-out, because you cannot decrease partitions and increasing them on a keyed topic remaps hash(key) % N and breaks per-key ordering (see II · 03 Partitioning).
| Watch | Metric (verified in source / standard JMX) | Threshold & why |
|---|---|---|
| Consumer lag trend | kafka.consumer:type=consumer-fetch-manager-metrics,name=records-lag-max | Alert on a growing trend over 5+ min, not a static count (LinkedIn Burrow model, empirical). recovery_time ≈ current_lag / net_drain_rate. |
| Under-replicated partitions | kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions | >0 sustained = a broker down or a slow follower. At 1M/s this should be 0 except during deploys. |
| Request-handler idle | kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent | Keep >0.3 (60–80% healthy). At this tier it should sit high; a dip means a hot partition or GC. |
| Page-cache health (OS) | OS read I/O on the data dirs (e.g. iostat read MB/s) | Near-zero read I/O confirms tail consumers are served from cache. A rising read rate is your early warning that the working set is outgrowing RAM (the Tier-2 problem). |
Tier 2, 10M events/sec (≈10 GB/s): the network-bound regime
Assumptions and the math
Same constants as Tier 1 (table above), only the rate changes: s=1 KiB, RF=3, G=2, Tp≈10 MB/s, Tc≈8 MB/s, amp≈5×, ceiling=NIC÷amp≈250 MB/s, RAM=64 GiB, heap=6 GiB. New target:
| Symbol | Assumption | Value (with units) | Why / source | Kind |
|---|---|---|---|---|
T | Target ingress | 10,000,000 msg/s × 1 KiB/msg = 10,240 MB/s ≈ 10 GB/s | Derived: rate × s, exactly 10× Tier 1. | derived |
NIC | Per-broker network card | 50 GbE ÷ 8 = 6,250 MB/s (up from 10 GbE at Tier 1) | At 10× the rate, a 10 GbE NIC's ~250 MB/s usable ingress would need too many brokers; a 50 GbE card raises the per-broker ceiling, see below. | hardware |
The arithmetic that was slack at Tier 1 now goes taut on two resources simultaneously, the NIC and RAM. Each line below shows its units cancelling to the result.
- Total disk writes =
T × RF= 10 GB/s × 3 = 30 GB/s across the cluster. - Total cross-broker replication egress =
T × (RF−1)= 10 GB/s × 2 = 20 GB/s of pure replication traffic, on top of produce and consume. - Aggregate NIC demand (cluster-wide). Counting each byte once as it leaves a broker to avoid double-counting (the 20 GB/s that exits the leaders is the same 20 GB/s that arrives at the followers): produce-in
T= 10 + replication-outT×(RF−1)= 20 + consumer-outT×G= 10×2 = 20, giving ≈50 GB/s egress + 10 GB/s ingress ≈ 60 GB/s of one-way NIC traffic (≈480 Gbit/s) the fleet must move. (Equivalently, total ingress equals total egress on a closed cluster, so ~50 GB/s in each direction is the planning number.) - Brokers =
T ÷ ceiling. With a 50 GbE NIC,ceiling=NIC ÷ amp= 6,250 MB/s ÷ 5 = 1,250 MB/s would over-provision per broker; we keep the conservative ~250 MB/s ingress/broker planning anchor (disk and read-amplification headroom, not just NIC, bind a real broker). So 10,240 MB/s ÷ 250 MB/s per broker ≈ 41 brokers; size for N−1 and burst headroom → ~45–50 brokers, RF=3. - Partitions =
max(T÷Tp, T÷Tc). Producer side: 10,240 MB/s ÷ 10 MB/s/partition ≈ 1,024; consumer side: 10,240 MB/s ÷ 8 MB/s/consumer ≈ 1,280 (still the binding side, 8 < 10). With consumer-group fan-out and an even per-broker spread (keep each broker well under its partition cap), plan ~2,000–4,000 partitions on the hot topic (low thousands).
Per-broker, that is ~50 brokers each ingesting ~250 MB/s (≈2 Gbit/s) but carrying about amp× that on the NIC, ~250 MB/s × 5 ≈ 1,250 MB/s ≈ 10 Gbit/s of total NIC traffic per broker once replication and reads are added. A 25 GbE NIC leaves thin headroom at burst; 50–100 GbE is the comfortable choice. This is the central fact of Tier 2: the network, specifically the ×(RF−1) replication multiplier, becomes the dominant NIC consumer, and it is not optional traffic.
min.insync.replicas (see I · 08).Bottleneck 1, replication network and the fetcher threads
Followers pull from leaders using a fixed pool of fetcher threads. The number is num.replica.fetchers, default 1 (server/src/main/java/org/apache/kafka/server/config/ReplicationConfigs.java:96), and the doc text is explicit about the bound: "The total number of fetchers on each broker is bound by num.replica.fetchers multiplied by the number of brokers in the cluster" (:97-98). With the default of 1 and ~50 brokers, each broker has up to ~49 fetcher threads, but all partitions replicated from a single peer share one thread. At 10 GB/s a single fetcher thread per source broker becomes a serialization point: if one broker leads many hot partitions, its followers' single fetcher to it cannot keep ISR caught up, ISR shrinks, and acks=all producers stall.
Raise num.replica.fetchers from 1 to 4–8 (SOFT limit; ReplicationConfigs.java:96). This parallelizes replication I/O between any pair of brokers at the cost of more CPU and memory (the doc names exactly this tradeoff, :99-100). Pair it with larger replication batches: replica.fetch.max.bytes default is 1 MiB per partition (:68) and replica.fetch.response.max.bytes is 10 MiB total (:88); raise the socket buffer replica.socket.receive.buffer.bytes above its 64 KiB default (:64) on high-bandwidth-delay links so the TCP window does not throttle a 50 GbE replication flow.
The companion risk is ISR thrash. A follower is dropped from the ISR if it has not reached the leader's log-end offset within replica.lag.time.max.ms, default 30000 ms (ReplicationConfigs.java:55): "the leader will remove the follower from ISR" (:56-57). At 10 GB/s a momentary fetcher stall, a GC pause, or a disk-flush spike can push a follower past 30 s of lag, shrink the ISR, and, if it drops below min.insync.replicas, reject produce with NotEnoughReplicas. The companion knob replica.fetch.wait.max.ms (default 500 ms, :75) must stay below replica.lag.time.max.ms to avoid spurious shrinks. Watch the shrink/expand rate closely (Tier-2 table).
Bottleneck 2, page-cache working set exceeds RAM (the catch-up tax)
At Tier 1 the read working set fit in page cache. At 10 GB/s, each broker ingests T ÷ brokers = 10,240 MB/s ÷ ~50 ≈ ~205 MB/s (round ~200 MB/s); with the same ~50 GiB of page cache per broker (RAM − heap less OS overhead, from the Tier-1 table), the cache holds only page cache ÷ per-broker write rate = 50 GiB ÷ 200 MB/s ≈ 53,687 MB ÷ 200 MB/s ≈ ~250 seconds of writes, fine for tail consumers, fatal for a consumer that falls minutes behind. A consumer reading historical data forces the broker to read cold segments from disk, which (a) competes with the sequential write stream for disk bandwidth and (b) evicts the hot tail pages, so even healthy tail consumers now miss cache. This is the documented "catch-up tax": historical reads spike p99 produce latency from ~2 ms to ~250 ms and drive disk I/O to 100% (azguards / KIP-405 testing, empirical). The KIP-405 measurements showed historical consumers causing a ~43% producer-throughput drop without tiered storage (empirical).
Tiered storage (KIP-405, GA in Kafka 3.6) moves closed segments to object storage and keeps only a small local retention window on broker disk. It does not change the network math, but it caps the page-cache working set: cold reads go to object storage via the remote read path instead of evicting hot pages from RAM. It is disabled by default, remote.log.storage.system.enable=false (storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManagerConfig.java:58), and you set log.local.retention.ms / log.local.retention.bytes (:159, :165) to size the local window to a few hours. At 10M/s it is a strong recommendation; at 100M/s (Tier 3) it is effectively mandatory.
Bottleneck 3, cross-AZ cost appears
With brokers across 3 AZs, the cloud bill changes character. Two more labeled assumptions enter here: xfer$ = ~$0.02/GB effective cross-AZ rate (AWS bills $0.01/GB in each direction → ~$0.02/GB round-trip; this guide's reference §6; cited cloud-pricing figure, GCP differs, Azure historically free), and the formula itself: the cross-AZ throughput (Confluent, empirical) is cross-AZ = (ingress × ⅔) + (egress × ⅔) + (ingress × (RF−1)); the ingress × (RF−1) term, mandatory inter-broker replication, dominates and, unlike the rest, cannot be avoided with co-location because replicas must sit in different AZs for the RF=3-across-3-AZs durability you bought. At 10 GB/s ingress, the replication term alone is T × (RF−1) = 10 GB/s × 2 = ~20 GB/s crossing zone boundaries. Converting to a monthly bill: 20 GB/s × 2,592,000 s/month (= 60×60×24×30) = 51,840,000 GB/month × xfer$ ~$0.02/GB ≈ ~$1.0M/month in replication transfer alone, and that is before produce and consumer cross-AZ. This is the moment cross-AZ transfer becomes a visible line item, Confluent estimates networking is "likely over 50%" of a Kafka bill (empirical); at Tier 3 (10× the bytes) it dominates entirely. The first lever is free, compression (lz4/zstd) shrinks replication, storage, and fetch bytes simultaneously because Kafka replicates and stores batches compressed (see II · 10 Cost and II · 05 Performance Tuning).
Segment and flush tuning at this tier
With ~2,000–4,000 partitions across ~50 brokers, segment roll frequency matters. The default log.segment.bytes is 1 GiB (storage/src/main/java/org/apache/kafka/storage/internals/log/LogConfig.java:131, minimum 1 MiB at :160). At 200 MB/s/broker spread over hundreds of partitions, each partition rolls a 1 GiB segment only every few minutes, acceptable. Do not shrink segments to chase anything here: smaller segments multiply open files and mmapped indexes (the Tier-3 ceiling). Keep num.io.threads (default 8, server-common/src/main/java/org/apache/kafka/server/config/ServerConfigs.java:51) at roughly 8 per data disk, and ensure num.recovery.threads.per.data.dir (default 2, ServerLogConfigs.java:147) is raised so a broker restart re-reads its now-large log set quickly. The broker's request-admission queue queued.max.requests (default 500, SocketServerConfigs.java:144) and per-request cap socket.request.max.bytes (default 100 MiB, :96) rarely need touching, but the network thread count does, see the table.
| Watch (Tier 2) | Metric | Threshold & why |
|---|---|---|
| Network saturation | kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent | Keep >0.4; <0.3 → raise num.network.threads (default 3, SocketServerConfigs.java:152). Also watch host NIC % util, the real ceiling. |
| ISR shrink/expand | kafka.server:type=ReplicaManager,name=IsrShrinksPerSec / IsrExpandsPerSec | Both 0 in steady state. A shrink without a matching expand = a follower stuck behind replica.lag.time.max.ms; check the fetcher and disk. |
| Page-cache hit (OS) | Disk read MB/s vs write MB/s on data dirs | Read MB/s climbing toward write MB/s = working set exceeding RAM → enable tiered storage or add RAM. The catch-up tax lives here. |
| Replication wait | kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce | High RemoteTimeMs = produce waiting on followers / min.insync.replicas. Localizes the stall to replication, not local disk. |
| Cross-AZ bytes | Cloud provider network egress metrics by AZ pair | Track the ×(RF−1) replication term, it is the cost that scales with ingress and forces Tier-3 architecture. |
Tier 3, 100M events/sec (≈100 GB/s): the structural-limit regime
Assumptions and the math
Same constants as Tiers 1–2 (s=1 KiB, RF=3, G=2, Tp≈10 MB/s, Tc≈8 MB/s, amp≈5×, conservative ceiling≈250 MB/s ingress/broker), only the rate changes. Naively scaling Tier 2 by 10× gives the numbers below, and that is exactly the point: the numbers stop being "buy more brokers" and start being "Kafka's structure pushes back."
| Symbol | Assumption | Value (with units) | Why / source | Kind |
|---|---|---|---|---|
T | Target ingress | 100,000,000 msg/s × 1 KiB/msg = 102,400 MB/s ≈ 100 GB/s | Derived: rate × s, exactly 10× Tier 2. | derived |
maps | mmap regions per active partition | ≈2 (an OffsetIndex + a TimeIndex, each a MappedByteBuffer) | Each LogSegment memory-maps its offset and time indexes (LogSegment.java:81-82; AbstractIndex.java:72), the structural input to the partition ceiling below. | config |
vmmc | Linux vm.max_map_count (distro default) | 65,530 mmap regions per process | OS cap on memory-mapped regions per process, Instaclustr; empirical, KAFKA-14204. Raisable, but the default sets the out-of-the-box ceiling. | hardware |
The cluster-wide rates and the raw broker/partition counts follow directly, units cancelling at each step:
- Total disk writes =
T × RF= 100 GB/s × 3 = 300 GB/s across the cluster. - Cross-broker replication =
T × (RF−1)= 100 GB/s × 2 = 200 GB/s of mandatory inter-AZ replication. - Brokers (raw) =
T ÷ ceiling= 102,400 MB/s ÷ 250 MB/s per broker ≈ 410; with N−1 and headroom, ~450–500 brokers on high-bandwidth (100 GbE ÷ 8 ≈ 12,500 MB/s) NICs. - Partitions (raw) =
max(T÷Tp, T÷Tc)= max(102,400÷10, 102,400÷8) = max(10,240, 12,800) ≈ ~10,000–13,000 on the hot topic alone; total cluster partitions (countingRF=3 replicas across all topics) easily reach hundreds of thousands to low millions.
~80M/s is precisely the LinkedIn-2019 scale (~7T msgs/day), and they ran it across 100+ clusters and 4,000+ brokers, not one cluster (LinkedIn, empirical). The reason is in the next three subsections: at this size you bump three HARD/structural ceilings at once, and the operational answer is federation, many bounded clusters behind a routing/aggregation layer, not a single mega-cluster.
Structural limit 1, the per-broker partition ceiling (mmapped indexes)
Each open log segment memory-maps its indexes. In source, every LogSegment holds a LazyIndex<OffsetIndex> and a LazyIndex<TimeIndex> (storage/src/main/java/org/apache/kafka/storage/internals/log/LogSegment.java:81-82) plus a TransactionIndex (:83); each OffsetIndex/TimeIndex is backed by a MappedByteBuffer (AbstractIndex.java:72). The active segment of every partition therefore consumes maps ≈ 2 memory-map areas (from the table). Linux caps mmap regions per process at vm.max_map_count = vmmc = 65,530, so the HARD ceiling is vmmc ÷ maps = 65,530 ÷ 2 ≈ 32,765 partitions per broker unless you raise it (Instaclustr, empirical, KAFKA-14204). Even after raising it, a single KRaft broker has been demonstrated at ~600,000 partitions in a lab (Instaclustr, Kafka 3.2.1, RF=1, empirical), but creation became "painfully slow" and throughput is a separate question entirely.
KRaft removing the ZooKeeper metadata ceiling does not make partitions free. Per-broker counts are gated by file descriptors (set nofile ≥ 100,000; production brokers run >30k open handles, Jun Rao), mmapped indexes (vm.max_map_count, raise to ≥ 262,144), fetcher overhead, and rebalance time. And throughput does not rise with partition count: it peaked near ~100 partitions and latency degraded past ~1,000 in controlled tests (Instaclustr, empirical). At 100M/s you need ~10,000 partitions for parallelism but must keep per-broker partition counts bounded (a few thousand, not tens of thousands), which is itself an argument for spreading across many brokers and many clusters.
Structural limit 2, the controller's metadata throughput
In KRaft, all cluster metadata, every topic, partition, leader, ISR change, broker registration, flows through a single active controller that serializes events onto a Raft-replicated metadata log (see I · 11 KRaft Controller and I · 10 KRaft Consensus). KRaft replaced the ZooKeeper-era ~200,000-partition/cluster ceiling and its O(partitions) failover (controlled shutdown of 50k partitions dropped from 6.5 min to 3 s between Kafka 1.0 and 1.1, Confluent, empirical) with an in-memory, log-replicated quorum targeting "a million partitions or more" and near-instant failover via hot-standby controllers (KIP-500; Confluent lab ran 2M partitions, empirical). But the controller is still a single serialized event loop: it processes one metadata event at a time, and at hundreds of thousands of partitions with constant leadership churn (broker restarts, reassignments, ISR changes during a 100 GB/s incident) the event queue is a structural throughput limit. The source exposes exactly the signals for it.
| Controller / metadata signal | Metric (verified in source) | Source |
|---|---|---|
| Event queue wait | kafka.controller:type=ControllerEventManager,name=EventQueueTimeMs | QuorumControllerMetrics.java:48-49 |
| Event processing time | kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs | QuorumControllerMetrics.java:50-51 |
| Cluster partition count | kafka.controller:type=KafkaController,name=GlobalPartitionCount | ControllerMetadataMetrics.java:59-60 |
| Controller singleton | kafka.controller:type=KafkaController,name=ActiveControllerCount (SUM across nodes must = 1) | QuorumControllerMetrics.java:46-47 |
| Metadata staleness (controller) | kafka.controller:type=KafkaController,name=LastAppliedRecordLagMs | QuorumControllerMetrics.java:60-61 |
| Broker metadata lag | kafka.server:type=broker-metadata-metrics,name=last-applied-record-lag-ms | BrokerServerMetrics.java:82 |
| Broker metadata apply errors | ...,name=metadata-apply-error-count / metadata-load-error-count | BrokerServerMetrics.java:87,92 |
| Stale broker heartbeats | kafka.controller:type=KafkaController,name=TimedOutBrokerHeartbeatCount | QuorumControllerMetrics.java:62-63 |
When the controller event queue backs up (rising EventQueueTimeMs), leadership changes are applied slowly, and brokers replaying the metadata log fall behind (rising last-applied-record-lag-ms). The visible symptom is routing staleness: producers and consumers get NOT_LEADER_OR_FOLLOWER from brokers whose view of leadership is stale, retry, and amplify load, exactly when you can least afford it (during a multi-broker failure that is generating thousands of ISR/leadership events). This is the modern, KRaft-era analogue of LinkedIn's ZK-era "cascading controller failures under memory pressure" (LinkedIn, empirical). Keep per-cluster partition counts bounded so the controller's steady-state and failure-time event rates stay well inside its serialized throughput. See I · 12 Metadata Propagation.
Structural limit 3, raw network and cross-AZ cost domination
Two network facts become non-negotiable at 100 GB/s:
- Raw bandwidth requires many high-NIC brokers. Counting bytes as they leave a broker (as in Tier 2): produce-in
T=100 + replication-outT×(RF−1)=200 + consumer-outT×G=100×2=200 ≈ 500 GB/s egress, plus 100 GB/s produce ingress ≈ ~600 GB/s of one-way aggregate NIC traffic. On 100 GbE (= 100 Gbit/s ÷ 8 = 12.5 GB/s per broker), serving ~500 GB/s egress needs ≥ 500 GB/s ÷ 12.5 GB/s ≈ 40 brokers' worth of NIC just for egress, and the ~250 MB/s ingress ceiling already put us at ~450–500 brokers. There is no config for "more bytes through a NIC"; you buy more NICs. This is the EMERGENT bandwidth limit. - Cross-AZ cost dominates the bill, and fetch-from-follower becomes mandatory. By default consumers fetch from the partition leader, which is in a different AZ ~⅔ of the time across 3 AZs. KIP-392 rack-aware fetch-from-follower lets a consumer read from a same-AZ replica, eliminating consumer-side cross-AZ entirely (Grab drove reconfigured-consumer cross-AZ to zero, InfoQ 2023, empirical). At 100 GB/s with multiple consumer groups this is not an optimization, it is the difference between a viable and an absurd cloud bill. It is configured with
broker.rack,replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector, andclient.rack; the tradeoff is that followers serve only up to the high-watermark, adding up to ~500 ms tail latency and causing broker load skew (empirical). It does not reduce the produce or replication cross-AZ floor, that floor (ingress × (RF−1)= 200 GB/s here) is structural at RF=3-across-3-AZs and only object-store/diskless designs attack it (KIP-1150, accepted ~March 2026 but not production-ready, see II · 10 Cost).
At 300 GB/s of writes, even one-day retention is petabytes of local disk per cluster. Tiered storage (KIP-405) is what makes long retention affordable and keeps the page-cache working set bounded so the catch-up tax does not turn every replay into a cluster-wide latency event. Uber's tiered-storage adoption was driven precisely by broker-recovery time: with cold data in object storage, a rejoining broker does not re-replicate days of data (Uber, empirical). At this tier, treat remote.log.storage.system.enable=true with a few-hours log.local.retention.ms as a baseline, not an add-on.
Why you federate instead of building one cluster
Every limit above is per-cluster: the controller is one serialized loop per cluster, the partition ceiling is per broker but rebalance/recovery time grows with per-cluster partition count, and blast radius (one bad config, one cascading controller event) is bounded by cluster size. The largest operators therefore cap cluster size and run many clusters behind an aggregation/routing tier: Pinterest caps at ~200 brokers/cluster to bound blast radius (2,000+ brokers, 800B msgs/day, empirical); Netflix Keystone kept <200 brokers and <10,000 partitions per cluster across 36 clusters (empirical); Uber federated into ~150-node clusters with two-tier regional+aggregate topology (empirical); LinkedIn ran 100+ clusters at ~80M/s, then built Northguard (sharded Raft state machines) at 32T/day because three Kafka limits, single-controller metadata, multi-TB partition rebalance skew, and partition-based scaling, became hard blockers (LinkedIn / InfoQ 2025, empirical). The lesson for your 100M/s design: do not pursue a single cluster. Shard by topic/domain into bounded clusters (each well inside Tier-2 mechanics), and use MirrorMaker 2 (KIP-382) or an aggregation layer for the rare cross-cluster flows. See II · 09 Topologies.
The three tiers side by side
| 1M/s (~1 GB/s) | 10M/s (~10 GB/s) | 100M/s (~100 GB/s) | |
|---|---|---|---|
| Hot-topic partitions | ~150–200 | ~2,000–4,000 | ~10,000+ (sharded across clusters) |
| Brokers (RF=3, N−1 sized) | ~6 | ~45–50 | ~450–500 (across multiple clusters) |
| NIC class | 10–25 GbE | 50–100 GbE | 100 GbE+, hundreds of them |
| Binding limit | Consumer drain rate (EMERGENT); 1 consumer/partition (HARD) | Replication NIC ×(RF−1) & page-cache vs RAM (EMERGENT) | Partition ceiling (HARD), controller serialization (structural), cross-AZ & raw NIC (EMERGENT) |
| Primary dial | partition count for fan-out | num.replica.fetchers ↑, compression, tiered storage (advised) | tiered storage (required), KIP-392 (required), federate |
| Watch first | records-lag-max trend, URP | IsrShrinksPerSec, NetworkProcessorAvgIdlePercent, page-cache hit | EventQueueTimeMs, last-applied-record-lag-ms, GlobalPartitionCount, cross-AZ $ |
| Tunable vs structural | mostly tunable / provisioning | tunable (network, cache), buy + configure | structural, architect around (multi-cluster) |
Operational decision rules across the tiers
- Size partitions for the slowest consumer, once, for peak.
N = max(T/Tp, T/Tc)withTc= the slowest consumer-path rate, not the broker disk. You cannot reduce partitions, and increasing them on a keyed topic breakshash(key) % Nordering (II · 03). Over-partition modestly up front; do not chase throughput with partition count (peak ~100, degrades past ~1,000, empirical). - Size brokers for N−1 and burst, not steady state. Plan
ceiling≈ 250 MB/s ingress/broker, derived asNIC ÷ amp= 1,250 MB/s (10 GbE) ÷ 5 ≈ 250 MB/s, and conservative against the ~605 MB/s aggregatei3en.2xlargebenchmark (Confluent 2020, empirical), thenB = T ÷ ceilingand add one broker so a single failure redistributes without toppling the rest (II · 04). - The amplifier is RF. Every byte produced is
RFbytes of disk and(RF−1)bytes of mandatory cross-AZ replication. Dropping RF 3→2 on non-critical topics halves the replication tax; never drop below RF=2 withmin.insync.replicas=2if you want single-failure durability (II · 06). - Watch the page-cache read rate as the Tier-1→2 tripwire. Near-zero disk reads = healthy. Rising disk reads = working set outgrowing RAM = enable tiered storage or add RAM before the catch-up tax hits (I · 05).
- Watch the controller event queue as the Tier-2→3 tripwire. Rising
EventQueueTimeMs+ rising brokerlast-applied-record-lag-ms= metadata throughput becoming the limit. The answer is fewer partitions per cluster, i.e. federate (I · 11). - Cross-AZ cost is structural at RF=3-across-3-AZs. Compression first (free), then KIP-392 to kill consumer-side cross-AZ, then accept the produce+replication floor unless you move to object-store/diskless designs (II · 10).
- At ~80–100M/s, the correct unit is a fleet of clusters. Every public operator at this scale federated. A single cluster bumps the controller, the partition ceiling, and an unacceptable blast radius simultaneously (II · 09).
Broker/partition/NIC counts above are first-order capacity arithmetic from the chapter's stated assumptions (1 KiB messages, RF=3, ~250 MB/s/broker), not benchmarked guarantees, your message size, compression ratio, and fan-out move them substantially. Defaults, limits, and metric names are cited to Kafka 4.4 source (path/File.java:line). Benchmark and case-study figures (LinkedIn, Confluent, Instaclustr, Grab, Uber, Pinterest, Netflix) are empirical and version/hardware-dependent, directional, not SLAs. The headline org figures are point-in-time: LinkedIn's "~7T/day ≈ 80M/s" is ~2019; "32T/day" is 2025.