II · 10 · Cost Engineering
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.
In the cloud, a Kafka bill is rarely what an engineer expects. The instinct is "more brokers = more money," and compute is real, but on a well-utilised cluster it is usually the smallest of the three line items. The two giants are storage (ingress × retention × replication factor, sitting on local disk that the RF multiplies) and, almost always the largest, cross-AZ network transfer, because Kafka's replication protocol copies every byte RF−1 times between brokers, and in a multi-AZ deployment those copies, plus the producer write to a leader in another zone and every consumer fetch from a leader in another zone, all cross zone boundaries that the cloud provider meters per gigabyte in each direction. Confluent's own teardown of a 100 MBps cluster splits it roughly $2.3k compute / $14.5k storage / $24.2k networking per month, networking "likely over 50%," rising to ~90% once tiered storage shrinks storage (Confluent, empirical). This chapter roots every cost driver in a concrete mechanism you have already met in Part I, gives each lever its dial and its tradeoff, and ends with a worked cost model and an effort/impact-ranked lever table. The cardinal rule here is unusually literal: each dollar has a why, and the why is a byte that some piece of source code decided to copy, store, or compress.
The three cost drivers, each rooted in a mechanism
Kafka's architecture fixes the shape of the bill before you tune anything. Three mechanisms, log persistence, ISR replication, and broker-owned compute, each generate one of the three line items. Internalise the mechanism and the cost becomes predictable arithmetic rather than a surprise.
ingress × (RF−1)) + cross-AZ produce (~⅔ of writes) + cross-AZ consumer fetch (~⅔ × #groups). Metered per GB, each direction. The replication term dominates at RF=3.ingress × retention × RF on broker-attached block storage (EBS/SSD). Every produced byte is written to disk and the log keeps it for the retention window; RF copies it on RF brokers.Driver 1, Storage: ingress × retention × RF
Every byte a producer sends is appended to the leader's log and, because Kafka persists to disk by design (Part I Storage & the Log Engine), written to broker-attached block storage. The log retains it for the configured window, and the replication factor copies the whole log onto RF brokers. The storage bill is therefore the cleanest formula in the chapter:
- Storage (GB on disk)
ingress (MB/s) × 86,400 (s/day) × retention (days) × 0.001 (GB/MB) × RF, the two literals are pure unit conversions: 86,400 s/day = 60×60×24 (turns a per-second rate into bytes/day), and 0.001 GB/MB converts MB→GB. Units cancel to GB:(MB/s)·(s/day)·(days)·(GB/MB)= GB.- Storage cost/mo
(GB on disk) × $/GB-month, at the assumed $0.08/GB-mo EBS gp3 rate (Confluent, empirical; an illustrative cloud price, check your bill).
The three multipliers are all operator-controlled, and each maps to a config with a source-verified default:
| Multiplier | Config (default) | Source | Cost effect |
|---|---|---|---|
| Retention (time) | retention.ms = 604800000 (7 days) | storage/.../LogConfig.java:134,212 (DEFAULT_RETENTION_MS = 24*7*60*60*1000) | Linear: halve retention → halve storage. |
| Retention (size) | retention.bytes = −1 (unbounded) | server-common/.../ServerLogConfigs.java:81 (LOG_RETENTION_BYTES_DEFAULT = -1) | A hard per-partition cap; the binding limit when both are set. |
| Replication factor | default.replication.factor = 1 | server/.../ReplicationConfigs.java:42,153 (REPLICATION_FACTOR_DEFAULT = 1) | Direct multiplier: RF=3 stores 3× the bytes of RF=1. |
default.replication.factor=1 is the source default, but it is not a production setting: RF=1 means a single broker loss permanently loses data, and (see Part I Replication & the ISR) it silently breaks min.insync.replicas=2 because the effective in-sync requirement is capped at the replica count. Every durable deployment overrides RF to 3, which triples both the storage and the cross-AZ replication bill. The cost of durability is not a footnote; it is a factor-of-3 on your two largest line items, and it is why RF is the first dial to revisit on non-critical topics (RF=3→2 cuts both by ⅓). Set durability deliberately in Durability Engineering, do not let it be an accident of the default.
Driver 2, Network: replication amplification + cross-AZ transfer
This is the one that surprises people, and it is structural. Two distinct things make up the network bill, and both come straight from how Kafka moves bytes.
(a) Replication amplification. When a producer writes 1 GB to a partition leader, the ISR mechanism (Part I Replication & the ISR) copies that GB to each of the RF−1 followers via their replica-fetcher threads. So RF−1 GB of inter-broker traffic is generated per GB ingested, at RF=3, 2 GB of replication per 1 GB written. This shows up on the brokers as kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec / ReplicationBytesOutPerSec (storage/.../BrokerTopicMetrics.java:37–38), watch those to see the amplification directly.
(b) Cross-AZ billing. Inside a single AZ, inter-broker traffic is free. The cloud meters traffic that crosses AZ boundaries, and Kafka's placement guarantees a lot of it crosses:
- Clients (producers and consumers) talk only to the partition leader. With brokers spread across 3 AZs, a leader is in a different AZ from the client ~⅔ of the time. So ~⅔ of produce bytes and ~⅔ of consumer-fetch bytes cross a zone boundary.
- All replication crosses zones when RF spans AZs (the whole point of multi-AZ RF is that the replicas are in different failure domains). That is the
ingress × (RF−1)term, and it dominates.
- Cross-AZ throughput (3 AZ)
(ingress × ⅔) + (egress × ⅔) + (ingress × (RF−1)), Confluent / AutoMQ formula (empirical). The ⅔ is a probability, not a tunable: with brokers evenly spread over 3 AZs, a randomly placed client shares the leader's AZ 1 of 3 times, so1 − ⅓ = ⅔of produce and of fetch bytes cross a zone boundary. The (RF−1) replication term carries no ⅔ because every follower copy is placed in a different AZ on purpose (that is the point of multi-AZ RF).- Effective AWS rate
- $0.01/GB in EACH direction → ~$0.02/GB effective (AutoMQ, 2minutestreaming, empirical; illustrative, verify against your cloud bill). GCP ≈ $0.01/GB once; Azure inter-AZ historically free.
Compute scales with broker count, which scales with peak throughput, and modern instances are cheap per MB/s. Storage scales with bytes at rest. But cross-AZ network scales with bytes in motion × an amplification factor that, for one consumer group, is roughly (RF−1) + ⅔ + ⅔ = (RF−1) + 4/3, the RF−1 replication copies, plus ⅔ of produce and ⅔ of fetch crossing zones (the 4/3 is just those two ⅔ terms added), and every gigabyte is charged twice (each direction) at a non-trivial rate. AutoMQ's worked example: a 3-node, 100 MiB/s, 3-consumer-group cluster moves ~173 TB/mo of producer writes (~$3,460) but ~$10,360 of replication and ~$10,360 of consumer reads, landing at $14k–$24k/mo in cross-AZ alone, vs a VM bill an order of magnitude smaller (AutoMQ, empirical). Even with fetch-from-follower eliminating the consumer term, a produce+replication floor of ~$13.8k/mo remains. This is why the cost levers below are ordered with the networking levers high.
Driver 3, Compute: brokers
Compute is CPU, RAM, and NIC on the broker fleet. The CPU-heavy paths are compression/decompression (Part I The Record & Batch Format), TLS handshakes and per-record encryption (Part I Security), request handling across the network/IO thread pools (Part I Network & Threading), and replication fetch. RAM is split: keep the JVM heap small, ~6 GB, and leave the rest of the box to the OS page cache, which is the real read/write accelerator via zero-copy sendfile (Confluent, empirical; the page-cache fetch path is Part I The Fetch Path). On a well-utilised, well-batched cluster, compute is usually the smallest of the three line items; it becomes material only when TLS + high message rates + heavy compression saturate cores, or when over-provisioned brokers sit idle. Right-sizing compute is the subject of Capacity Planning; here we note only that throwing brokers at a throughput problem is cheap, while throwing them at a network-cost problem does nothing, cross-AZ is per-GB, not per-broker.
The levers, each with its mechanism, dial, and tradeoff
Five levers, in roughly increasing order of effort and architectural disruption. The first is nearly free; the last restructures the cost model. Each is grounded in a specific Kafka feature you can point to in source.
Lever 1, Compression: compression.type (cuts storage AND network at CPU cost)
Compression is the highest-leverage, lowest-effort move, because Kafka compresses per record batch on the producer and then stores and replicates the batch still compressed, the broker does not recompress on the happy path, and zero-copy fetch ships the compressed bytes to consumers untouched. So a single producer config simultaneously shrinks storage, replication bytes, cross-AZ transfer, and consumer-fetch bytes. The cost is CPU on the producer (and on any consumer that decompresses).
The source enumerates exactly five codecs and their level ranges in clients/.../record/internal/CompressionType.java:
Codec (id) | Level range / default | Source line | Ratio (HTTP data) | Speed |
|---|---|---|---|---|
none (0) | , | CompressionType.java:30 | 1.0× | n/a |
gzip (1) | 1–9, default -1 (Deflater.DEFAULT) | CompressionType.java:33–36 | 3.58× | slowest, usually avoided |
snappy (2) | , (no levels) | CompressionType.java:71 | 2.35× | fast |
lz4 (3) | 1–17, default 9 | CompressionType.java:72,75–77 | 1.81× | fastest decompress (2,428 MB/s) |
zstd (4) | −131072–22, default 3 | CompressionType.java:99,105–109 | 4.5× (lvl 6) | moderate (409/844 MB/s) |
Ratios and speeds above are Cloudflare's production measurements on ~1 MB / 600-record HTTP-request batches (empirical), they are data-dependent, not guarantees: text/JSON compresses ~10–12×, pre-compressed or binary payloads barely at all. The operational defaults that fall out:
compression.type=lz4when CPU/latency matter most, fastest codec, ~1.8–2× shrink, smallest CPU tax. The throughput recipe's default.compression.type=zstdwhen bytes (and therefore cross-AZ + storage cost) matter most, markedly better ratio for moderately more CPU. Cloudflare chose zstd and saved "hundreds of gigabits of internal traffic and terabytes of flash storage," cancelling a hardware expansion (empirical); Trendyol reported ~70% message-size reduction at zstd level 3 (empirical). Tune the level viacompression.zstd.level(added with fine-grained level control, KIP-390/KIP-780).
Because the unit of compression is the record batch, a producer that flushes one record at a time gets almost no ratio regardless of codec. Compression therefore composes with batching: raise batch.size (default 16384) and set a non-zero linger.ms so batches fill before they ship, only then does "zstd gives 4.5×" materialise. See Performance Tuning for the batching dials. Also: the producer default is compression.type=none (ProducerConfig.java:245,409; the doc literally says "The default is none"), and the broker default is compression.type=producer, meaning the broker keeps whatever the producer sent (server-common/.../ServerLogConfigs.java:178 → BrokerCompressionType.PRODUCER, BrokerCompressionType.java:33). So if no one sets it, nothing is compressed. "zstd is the default" is a common and costly myth.
Lever 2, Fetch-from-follower: rack-aware replica selection KIP-392 (kills consumer cross-AZ)
By default, a consumer always fetches from the partition leader, which is in another AZ ~⅔ of the time, generating the cross-AZ consumer-read term. KIP-392 lets a consumer instead read from a same-AZ follower replica, eliminating that term entirely. It is a configuration change, not an architectural one, and on consumer-read-heavy clusters it is the single biggest networking win after compression.
Three configs turn it on; all are source-verified:
| Side | Config | Value | Source |
|---|---|---|---|
| Broker | broker.rack | this broker's AZ id (e.g. us-east-1a) | server-common/.../ServerConfigs.java:92 |
| Broker | replica.selector.class | org.apache.kafka.common.replica.RackAwareReplicaSelector | server/.../ReplicationConfigs.java:140,173 (default null = "returns the leader") |
| Consumer | client.rack | the consumer's AZ id (must match a broker.rack) | clients/.../CommonClientConfigs.java:77–79 (default "") |
The mechanism is worth seeing in source, because its constraints are the tradeoff. The broker resolves the preferred replica in ReplicaManager.findPreferredReadReplica (core/.../ReplicaManager.scala:1964–2013), which builds the candidate set and hands it to the selector. Two guards in that code define the behaviour:
- A follower is a candidate only if it is in the ISR and its
logEndOffset ≥ fetchOffset ≥ logStartOffset(ReplicaManager.scala:1985–1987). The comment is explicit: excluding out-of-sync replicas prevents the leader from "continuously pick[ing] the lagging follower … indefinitely." RackAwareReplicaSelectorthen filters to replicas whoseendpoint().rack()equals the client'srackId; if the leader is in-rack it returns the leader, otherwise the most caught-up in-rack replica; if none are in-rack it falls back to the leader (clients/.../replica/RackAwareReplicaSelector.java:35–52).
A follower can only serve records up to the high-watermark (the offset replicated to the full ISR; see Part I The Fetch Path and Replication & the ISR), it cannot hand out data the leader has but the ISR has not yet confirmed. Consumers that switch to a follower therefore see data slightly later than the leader's log end. Grab measured "up to 500 ms" of added consumer latency from this, and observed broker load skew because fetch traffic now follows replica placement rather than leader placement (Grab/InfoQ, empirical). What it does not touch: produce and replication bytes still cross AZ. Grab's result, reconfigured-consumer cross-AZ cost driven to zero, at +500 ms, is the canonical outcome (empirical).
Lever 3, RF and retention as cost dials
Before reaching for new architecture, the two oldest dials give linear savings with no latency penalty, only a durability/availability tradeoff:
- Replication factor. RF multiplies both storage and cross-AZ replication. RF=3 is the durable standard (tolerates one broker loss with
min.insync.replicas=2, two losses before unavailability). RF=2 cuts both costs by ⅓ but leaves zero headroom: a single failure drops you to one replica, andmin.insync.replicas=2then blocks all writes to that partition (Part I Replication & the ISR). Reserve RF=2 for explicitly non-critical, reproducible topics. Set viadefault.replication.factor(default 1) or per-topic. - Retention.
retention.ms(default 7 days,LogConfig.java:134) cuts storage linearly, 3-day retention is half the disk of 6-day. Useretention.bytes(default −1) to cap per-partition size as a hard backstop. Shorter retention is free until someone needs to replay further back than the window allows, which is exactly the gap that Lever 4 fills.
Lever 4, Tiered storage KIP-405 (cheap object store for cold data)
Tiered storage (Part I Tiered Storage) lets a topic keep only a small local tail on broker disk while offloading older segments to object storage (S3/GCS/ABS) via the RemoteLogManager (storage/.../RemoteLogManager.java). Because the RF-multiplied copy lives only on the cheap, single-copy object store, this attacks the storage line item hard, object storage runs ~$0.02/GiB-mo vs ~$0.08–0.10/GiB-mo for EBS, ~4–5× cheaper, and decouples retention from broker disk (so you can keep weeks or months without growing the fleet). Enable it cluster-wide with remote.log.storage.system.enable=true (default false, RemoteLogManagerConfig.java:55,58) and per-topic with remote.storage.enable=true (default false, LogConfig.java:142,253); set the local tail with local.retention.ms / local.retention.bytes (default −2 = "derive from retention.ms/retention.bytes", LogConfig.java:145,146,255,257).
This is the most common and most expensive misconception about KIP-405. It moves bytes at rest to a cheaper tier; it does nothing to bytes in motion. Replication still copies RF−1 times across AZs, producers and consumers still cross zones. In fact, by shrinking storage it raises networking's share of the bill, Aiven's figure: with tiered storage, "Networking is 83%+ of cost ($882k/yr out of $1.05M/yr)" (empirical). If networking is your problem, the levers are 2 (fetch-from-follower) and 5 (diskless), not 4. Watch the offload working via kafka.server:type=BrokerTopicMetrics,name=RemoteCopyBytesPerSec / RemoteFetchBytesPerSec and the backlog gauge RemoteCopyLagBytes (storage/api/.../RemoteStorageMetrics.java:35–36,48); remote reads are served by a pool of remote.log.reader.threads (default 10, RemoteLogManagerConfig.java:150,152).
Lever 5, Diskless / object-store-native designs KIP-1150 (restructure the model)
The newest direction attacks the networking floor directly. Diskless designs (WarpStream and AutoMQ commercially; KIP-1150 "Diskless Topics" in Apache Kafka) write produce batches directly to object storage instead of replicating between brokers. With no inter-broker replication and no local-disk RF, the ingress × (RF−1) cross-AZ term, the dominant one, drops to ~0, and durability comes from the object store's own cross-AZ replication (which the provider does not bill back to you as inter-AZ transfer). KIP-1150 was accepted ~March 2026, but acceptance is not a production-ready OSS implementation, treat it as a roadmap item, not a deployable feature, in mid-2026.
Writing to S3 means waiting on a commit interval (~250 ms or an 8 MiB batch) plus an S3 PUT (~200–400 ms p99 for 2–8 MB), so produce latency moves from sub-100 ms to ~200–400 ms typical, up to ~2.4 s end-to-end (WarpStream/Aiven, empirical). S3 Express One Zone narrows produce p99 to ~169 ms. The cost case is dramatic where it fits: WarpStream's own TCO benchmark put 3-AZ OSS Kafka at $20,252/mo (inter-zone networking alone $14,765) vs WarpStream at $2,961/mo for the same 268 MiB/s workload (vendor-reported, directional). KIP-1150 is explicitly designed to coexist with classic sub-100 ms topics in one cluster, not replace them, use diskless for high-throughput, latency-tolerant streams and classic topics for the rest. Vendor % savings (80–90%) assume high fanout, retail pricing, and RF=3; the mechanism is real, the percentage is workload-dependent.
A worked cost model
Put it together on one concrete workload so the arithmetic, and the dominance of networking, is unmistakable. Every number below is introduced first as a labeled assumption, then used; the rates are the empirical reference's cited cloud figures (AWS, RF=3, 3 AZs); treat them as illustrative and version-dependent, not a quote, substitute your own measured ingress and your own cloud bill's transfer rates.
Assumptions. Each constant carries its value with units, a one-line why/source, and its kind (workload / config / illustrative cloud rate). Nothing in the derivation uses a number that is not listed here.
| Symbol | Value (with units) | Why / source | Kind |
|---|---|---|---|
| ingress | 100 MiB/s sustained | The workload we are pricing, pick the peak sustained produce rate your cluster measures (BytesInPerSec). | workload (illustrative) |
| fanout | 3 consumer groups (3× read) | Each group re-reads the full stream once, so total consumer egress = 3 × ingress. | workload (illustrative) |
| RF | 3 replicas | The durable standard; set via default.replication.factor (source default is 1, see Driver 1). RF=3 ⇒ RF−1 = 2 follower copies. | config |
| AZs | 3 availability zones | Brokers spread evenly ⇒ a client shares the leader's AZ ⅓ of the time, so ⅔ of produce & fetch crosses zones (Driver 2). | workload / topology |
| retention | 3 days | How long the log is kept on disk; retention.ms (source default 7 days, LogConfig.java:134). 3 days chosen to keep the example small. | config (illustrative) |
| compression | none (1.0×) for the baseline | We price the untuned wire volume first, then apply compression as Lever 1. Producer & broker defaults compress nothing (Driver 1 gotcha). | config (baseline) |
| $storage | $0.08 / GiB-month | AWS EBS gp3 (Confluent, empirical), illustrative; check your bill. | cloud rate (illustrative) |
| $xAZ | $0.02 / GiB effective | AWS cross-AZ = $0.01/GiB each direction → ~$0.02 effective (AutoMQ / 2minutestreaming, empirical). GCP ≈ $0.01 once; Azure historically free. | cloud rate (illustrative) |
| $compute | ≈ $2,300 / month for the fleet | Broker instances sized to absorb ingress × RF ingress-replication plus fan-out reads; matches Confluent's published 100 MBps teardown compute line (empirical). Right-sizing is Capacity Planning; here it is a single illustrative line, not a per-broker derivation. | cloud rate (illustrative) |
| sec/month | 2,592,000 s = 60×60×24×30 | Unit conversion: seconds in a 30-day month (turns a per-second rate into bytes/month). | constant (exact) |
Derived: monthly ingress volume. This single quantity feeds every line below, so derive it once:
ingress × sec/month ÷ 1024 = 100 MiB/s × 2,592,000 s ÷ 1024 MiB/GiB = 259,200,000 MiB ÷ 1024 ≈ 253,125 GiB ≈ 247 TiB/month. (Units cancel: (MiB/s)·s = MiB, then MiB ÷ (MiB/GiB) = GiB.) Call this I = 253,125 GiB/mo in the table below.
Derivation. Each row takes the formula from its Driver section, substitutes the assumptions above (so every literal traces to the Assumptions table), and multiplies by the relevant rate. Storage is a stock (bytes resident on disk), so it uses ingress-per-day × retention × RF; the cross-AZ rows are flows over the month, so they scale the monthly volume I. Units are shown so they cancel to dollars.
| Line item | Derivation (formula → substitute assumptions → with units) | Bytes | × Rate | ≈ Cost/mo |
|---|---|---|---|---|
| Storage (on disk) | ingress/day × retention × RF. ingress/day = 100 MiB/s × 86,400 s/day ÷ 1024 = 8,438 GiB/day; resident = 8,438 GiB/day × 3 days × 3 (RF) = 75,938 GiB (days & RF are dimensionless ⇒ result in GiB) | ~76 TiB resident | × $0.08/GiB-mo | 75,938 × $0.08 = ~$6,100 |
| Cross-AZ: replication | I × (RF−1) = 253,125 GiB × 2 (each of the 2 followers is in another AZ ⇒ all of it crosses) | ~506,250 GiB (~494 TiB) | × $0.02/GiB | 506,250 × $0.02 = ~$10,100 |
| Cross-AZ: produce | I × ⅔ = 253,125 GiB × 0.67 (⅔ of writes hit a leader in another AZ, see Driver 2) | ~169,594 GiB (~166 TiB) | × $0.02/GiB | 169,594 × $0.02 = ~$3,400 |
| Cross-AZ: consumer fetch | I × fanout × ⅔ = 253,125 GiB × 3 × 0.67 (3 groups, each reading from a leader that is cross-AZ ⅔ of the time) | ~508,781 GiB (~497 TiB) | × $0.02/GiB | 508,781 × $0.02 = ~$10,200 |
| Compute (brokers) | fleet sized to absorb ingress × RF ingress+replication and serve fanout × ingress reads, taken directly as the $compute assumption (not re-derived here; see Capacity Planning) | , | fleet | ~$2,300 |
| Sum of the rows ≈ | ~$32,100/mo | |||
Adding the three cross-AZ rows: $10,100 + $3,400 + $10,200 ≈ $23,700 of networking. So networking is $23,700 ÷ $32,100 ≈ 74% of the bill; compute is $2,300 ÷ $32,100 ≈ 7%; storage is the remaining ~19%. Why networking dominates: the cross-AZ multiplier (RF−1) + ⅔ + (fanout × ⅔) = 2 + 0.67 + 2.01 ≈ 4.7 applies to the monthly volume at $0.02/GiB each, whereas storage applies a 3× RF to only the resident tail (3 days, not the whole month) at $0.08/GiB-mo, so bytes-in-motion × ~4.7 beats bytes-at-rest × 3 even though the per-GiB storage rate is 4× higher. These figures align with Confluent's published 100 MBps teardown (~$24.2k networking / ~$14.5k storage / ~$2.3k compute); our storage line is lower only because this example assumes 3-day rather than longer retention. (Small differences from the round empirical figures, e.g. ~$10,100 vs the often-quoted ~$10,360, are pure rounding of the ⅔ probability and the $0.02 rate; carry more decimals if you need them to match a vendor sheet.) Now apply the levers in order and watch the bill collapse:
$6.1k÷3 ≈ $2.0k, cross-AZ $23.7k÷3 ≈ $7.9k, +compute $2.3k → total ≈ $12.2k$10.2k÷3 ≈ $3.4k after L1): $12.2k − $3.4k → total ≈ $8.8k×(RF−1): 2→1 ⇒ ×½ each (linear in RF)$0.08→$0.02/GiB-mo (~4–5× cheaper); networking now >80% of remainderI×(RF−1) → ~0; new floor = produce + metadataCompression shrinks the byte volume that every other lever and formula operates on. If you compute cross-AZ cost on uncompressed ingress and then "save 50% with fetch-from-follower," you have double-counted. The correct order is: (1) measure compressed ingress (or model it with your codec's real ratio on your data), (2) compute the cross-AZ terms from that, (3) then subtract the consumer term that fetch-from-follower removes. The single biggest modelling error is pricing the bill on wire-format bytes that compression already eliminated.
The lever table, ranked by effort vs impact
Decision rule for picking the next lever: identify which line item dominates your bill (measure BytesInPerSec, ReplicationBytesInPerSec, and your cloud cross-AZ transfer report), then pick the lowest-effort lever that targets it.
| # | Lever | Targets | Effort | Impact | Tradeoff / dial | Feature |
|---|---|---|---|---|---|---|
| 1 | compression.type = lz4/zstd | storage + ALL network + fetch | Low (one producer config) | High (~2–3× on JSON; ~10× on text) | Producer/consumer CPU; per-batch (needs batching). Dial: codec + compression.zstd.level. | Part I 01; KIP-390 |
| 2 | Fetch-from-follower (rack-aware) | consumer cross-AZ (the ⅔×fanout term) | Low–Med (3 configs + correct rack ids) | High on read-heavy/multi-group clusters (Grab → $0) | +up to ~500 ms consumer latency (HW-bounded); broker load skew. Dial: on/off. | Part I 09; KIP-392 |
| 3 | RF 3→2 / shorter retention | storage + replication cross-AZ (linear) | Low (per-topic config) | Med (−⅓ per RF step; linear in retention) | Durability/availability ↓ (RF=2 = no failure headroom); replay window ↓. Dials: replication.factor, retention.ms. | Part I 08 |
| 4 | Tiered storage | storage only (cold bytes → object store) | Med (cluster + object-store setup) | Med–High on storage (~4–5× cheaper tier; 30–90% storage cut) | Remote-read latency for cold reads; object-store ops. No network effect. Dials: remote.storage.enable, local.retention.ms. | Part I 05; KIP-405 |
| 5 | Diskless / object-store-native | replication + produce cross-AZ (the floor) | High (new topic type / vendor; not GA in OSS) | Very High on network (replication cross-AZ → ~0) | +200–400 ms produce latency, up to ~2.4 s e2e. Coexists with classic topics. Dial: per-topic. | KIP-1150 (accepted ~Mar 2026) |
Turn on compression everywhere (lz4 if latency-sensitive, zstd if bytes-sensitive) with sane batching, it is nearly free and cuts every byte-volume line item. Then, if cross-AZ dominates (it usually does), enable fetch-from-follower to kill the consumer term and revisit RF on non-critical topics. If storage is large, add tiered storage, but do not expect it to touch networking. Only reach for diskless when the produce+replication cross-AZ floor is the dominant cost and the workload tolerates 200–400 ms latency. Measure first: ReplicationBytesInPerSec tells you the amplification, your cloud transfer bill tells you the cross-AZ reality, and RemoteCopyBytesPerSec tells you tiering is working. Every number in this chapter is directional and version/region-dependent, re-check cloud transfer rates and vendor TCO claims against live pricing before you commit a budget.
This chapter is the economics layer over the rest of the operations manual. Capacity Planning sizes the fleet whose compute you are pricing here; Partitioning governs the per-partition counts that drive replica placement and therefore cross-AZ flows; Performance Tuning owns the batching/compression dials Lever 1 depends on; Durability Engineering owns the RF/min.insync.replicas/acks choices behind Lever 3; Topologies decides the AZ/cluster layout that sets the cross-AZ baseline; and Metrics & Signals gives the byte-rate gauges that turn this model from estimate into measurement. Cost is not a separate concern, it is the shadow every architectural choice casts on the invoice.