III · 07 · The Architect's Cheat Sheet

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Architectural analysis grounded in the source-verified Part I and cited comparative sources.

This is the page you bookmark. Everything in Parts II and III collapses here into a single decision surface: the dials you turn and what each one trades; the four decision trees that answer the recurring questions, should I use a log at all? how many partitions? what replication factor and floor? how do I scale to N?; the heuristics with their one-line mechanism; the tactics worth stealing for any system; and one consolidated reference panel for choosing and tuning a log-based architecture. Minimal prose by design, each cell and node links to the chapter that proves it. One framing governs the whole sheet, and it is the most useful thing to carry away: every property of a commit-log system is either structural (baked into "partitioned, replicated, ordered, append-only", you cannot tune it away, only design around it) or tunable (a config or feature that moves a point in the tradeoff space). Knowing which is which is the entire skill.

1 · The design dials

Eight dials cover almost every architectural decision. Each one buys something by spending something else, there are no free positions, only choices about which axis you can afford to give up. Owner = who sets it; the binding tradeoff is the property you sacrifice when you turn it toward "more."

Dial	Range / key configs	Turn it up to gain…	…and you spend	Class	Owns it
Consistency	`acks` 0 → 1 → all; `min.insync.replicas` 1 → 2 → RF	Stronger commit guarantee: at `acks=all`+minISR=2 every ack lands on ≥2 brokers before the producer hears back	Write latency & availability, the leader waits for the full ISR, and below the floor it refuses the write (goes read-only)	tunable	Producer + topic
Durability	`replication.factor` 1 → 3 → 5; `broker.rack` spread	More copies across more failure domains → lower loss probability	Storage & cross-AZ bandwidth, each replica is a full copy written and replicated (RF=3 ⇒ 2× ingress crosses zones)	tunable	Topic
Latency	`batch.size`, `linger.ms` 0 → 100; `compression.type`; `fetch.min.bytes`	Lower per-record latency (small/zero linger; `fetch.min.bytes=1`)	Throughput & CPU/$, tiny batches amortize nothing; the inverse (big batches) trades latency for throughput. Pick at most two of {latency, throughput, durability}	tunable	Producer + consumer
Ordering scope	partition count + partition key (1 partition = global order)	Finer concurrency: more independent ordered streams	Global order, Kafka orders only within a partition, ever. Total order forces partition count = 1 (no parallelism)	structural	Topic design
Throughput / parallelism	`num.partitions` (producer leaders & consumer slots)	More parallel leaders writing, more consumers draining (≤ 1 per partition per group)	Control-plane & per-broker overhead, FDs, ~2 mmap/partition, fetcher load, rebalance time, election windows; throughput peaks ~100 and latency rises past ~1,000 partitions	partly structural	Topic + cluster
Cost	RF; `retention.ms`/`.bytes`; tiered storage (KIP-405); compression; fetch-from-follower (KIP-392)	Lower $: shorter retention, fewer copies, S3 for cold data, compressed bytes, same-AZ reads	Durability / history / cross-AZ floor, and tiered storage cuts storage only, never networking; a produce+replication cross-AZ floor remains	tunable	Topic + cluster
Availability	`unclean.leader.election.enable` false → true	Stay writable when the whole ISR is gone (elect any live replica)	Committed data, a lagging replica becomes leader and truncates everything it never received; any `UncleanLeaderElectionsPerSec>0` is loss	tunable	Topic
Delivery semantics	`enable.idempotence`; transactions / EOS; consumer commit policy	Exactly-once within Kafka read-process-write; dedup of producer retries	Latency & scope, transactional overhead; EOS does not extend across an external DB (that is at-least-once + outbox)	tunable	Producer + app

The single rule under every dial

Tunable dials move a point in the tradeoff space; structural dials define the space. You can buy durability with RF, latency with linger.ms, cost with retention, those are negotiations. You cannot buy global ordering, sub-partition parallelism, per-message TTL/priority, or point-by-key lookup at any setting, because they contradict "partitioned, ordered, append-only log." When a requirement lands on a structural axis, the answer is never a config, it is a different topic design or a different system. See bp02 · Design Decisions and bp03 · Inherent Limits.

The tuning triangle, made literal

The three tunable performance dials, latency, throughput, durability, form a triangle you can optimize for at most two vertices of. This is not folklore; it is mechanism: linger.ms↑ trades latency for throughput (bigger batches amortize per-request cost), acks=all trades latency for durability (wait for the ISR), and compression trades CPU for network/disk. LinkedIn measured synchronous acks=-1 replication roughly halving single-producer throughput (821,557 → 421,823 rec/s), the durability vertex pulling against throughput, made numeric (empirical: Jay Kreps / LinkedIn, 2014).

Latency vertex, linger.ms≈0, fetch.min.bytes=1, small batches, acks=1

lowest per-record delay; costs throughput (no amortization) and durability (un-replicated tail)

pick 2 of 3

↕

Throughput vertex, batch.size↑, linger.ms 5–100, compression, large fetches

max bytes/sec; costs latency (the linger window) and CPU (compression)

pick 2 of 3

↕

Durability vertex, acks=all, RF=3, minISR=2, rack-spread

every ack on ≥2 independent brokers; costs latency (ISR wait) and $ (copies + cross-AZ)

pick 2 of 3

The three tunable performance dials pull against each other. You can sit near two vertices; reaching the third sacrifices one of the others. Ordering and partition count are not on this triangle, they are structural and chosen separately.

latency-optimized throughput-optimized durability-optimized ↕ = "tension between".

2 · The four decision trees

The four questions an architect asks, as compact flows. Each terminal node is an answer; red terminals are "stop, wrong tool / wrong fix."

2a · Should I use a log at all?

The log is the right architecture when you need ordered, durable, replayable history that many independent consumers read at their own pace. It is the wrong tool for point lookups, synchronous request/reply, per-message routing/priority, and per-message-parallel task queues, those fight the structure (share groups KIP-932 address the last). Full reasoning in bp01 · When To Use.

requirementwhat does the workload actually need?

need ordered, durable, replayable history with many independent readers?

use a database / KV store
(Kafka has no random access, materialize INTO one)

need synchronous reply to the caller?

use RPC / gRPC
(request-reply over a log re-couples what you decoupled)

need per-message TTL, priority, or content routing?

use RabbitMQ / a broker
(Kafka retention is topic-level; no priority/routing)

per-message parallel work, order doesn't matter?

use share groups (KIP-932) / a queue
classic groups head-of-line-block on per-partition order

USE THE LOG ✓
event backbone, CDC, stream processing, replay

Use-a-log decision spine. Every red off-ramp is a documented anti-pattern; the log wins precisely when ordered durable replayable fan-out is the requirement and none of the disqualifiers apply.

decision use the log log-adjacent feature wrong tool path qualified-yes.

The four anti-patterns, one sentence each

As a database, no key lookups, no ad-hoc SQL, no multi-key ACID; Kafka replicates into specialized stores, it does not replace them (empirical: Kreps; Waehner). As RPC, a throughput-optimized log has no native reply channel; request-reply "returns the coupling we were trying to avoid." As a routing broker, no per-message TTL, priority, or content-based routing (that is RabbitMQ's domain). As a per-message task queue, per-partition order + one consumer per partition means "throughput collapses to the speed of the slowest message"; a poison pill stalls everything behind it. Share groups (Part I · 15) address the last one explicitly. These are intentional log-semantics tradeoffs, not bugs (empirical: ops-blueprint reference §8).

2b · How many partitions?

The sizing formula is N = max(T/T_p, T/T_c), target throughput over the measured single-partition produce and consume rates, and the binding term is almost always the consumer, because consumption parallelism is hard-capped at the partition count. Then add headroom, because resharding a keyed topic is a migration, not a config change. Full mechanism in Part II · 03 Partitioning.

measure T_p, T_c at YOUR record size / compression / acksnever assume the ~10 MB/s planning number

T / T_p

T / T_c

N = max(T/T_p, T/T_c)

keyed topic?

small/zero headroom
, growing in place is safe

×2 growth headroom + divisor-rich (12, 24, 60)
, you can't reshard keyed in place

N > ~1,000 per topic OR over per-broker budget?

back off: latency knee & FD/mmap ceiling
spread across brokers, don't inflate count

set N · set RF=3 · never ship defaults (1/1)

Partition-count sizing spine. The consumer term usually binds; keyed topics get growth headroom because resharding is a new-topic migration; and the count is capped from above by the latency knee and per-broker ceilings.

produce term consume term result/action back off path unkeyed branch.

The two partition facts that bite hardest

You cannot decrease partitions (the controller rejects it by construction), and growing a keyed topic remaps keys via murmur2(key) % N, a key at 7654321 moves from partition 1 (N=4) to 3 (N=6), splitting its history across two logs and breaking per-key order, compaction, and co-partitioned Streams joins (empirical: Confluent; Arpit Bhayani). Treat keyed-topic partition count as near-immutable; over-partition modestly up front. And remember: more partitions never fixes a hot key, that key still hashes to one partition; redesign or salt it instead. See Part II · 03.

2c · What replication factor and min.insync.replicas?

Durability is three dials owned by three actors: acks (producer's request), min.insync.replicas (broker's pre-append floor), replication.factor (topology). The canonical setting RF=3 / minISR=2 / acks=all / unclean=false is the unique point that survives one broker loss both losslessly and writably: minISR=2 is exactly one above the survivable failure count of 1. Eligible Leader Replicas KIP-966 (GA in this build) further narrows safe recovery. Full arithmetic and source in Part II · 06 Durability.

how much loss can this topic tolerate?

acknowledged data loss acceptable?

RF=2 or 3, acks=1 (or 0)
availability/cost over durability (e.g. Netflix Keystone)

must stay writable through one failure?

RF=3 · minISR=3 · acks=all
any single loss ⇒ read-only (strict, less available)

RF=3 · minISR=2 · acks=all · unclean=false
survives 1 loss lossless + writable; 2nd ⇒ read-only

spread RF across racks/AZs (broker.rack) · enable ELR (KIP-966) · verify minISR ≤ RF

small flush.ms on the critical topic, shrinks the page-cache loss window at a latency cost

Durability decision spine. The default answer is RF=3/minISR=2; the divergences are deliberate, lossy-tolerant telemetry relaxes it, strict-consistency tightens it to minISR=3, and rack spread is non-negotiable because it converts the page-cache window from a single-event certainty into a multi-event improbability.

decision/placement durable setting availability-leaning path lossy-tolerant branch.

Why these exact numbers (the mechanism, not folklore)

The pre-append gate if (inSyncSize < minIsr && requiredAcks == -1) throw NotEnoughReplicasException (Partition.scala:1234) fires before the record is appended, so an unaccepted write can never be "lost", it was never in the log. With minISR=2, every accepted acks=all write sat on ≥2 brokers at ack time; lose one and a copy remains. RF=3 supplies the third copy purely as availability headroom. Failure independence dominates replication factor: RF=3 across 3 AZs gives loss probability ≈ p³ ≈ 10⁻⁹, but RF=3 in one AZ collapses to ≈ q ≈ 10⁻³ when a single power event takes all three, a million times worse at identical RF. Spend the durability budget on rack spread before more replicas. And the honest caveat: acks=all means "in the page cache of every ISR member," not fsynced, Kafka bets N independent replicas beat one fsynced copy, which is correct for every failure except simultaneous power loss of the whole ISR (empirical: Kreps/LinkedIn; Vanlightly). See Part II · 06 and Part I · 08.

2d · How do I scale to N events/sec?

Kafka does not scale by one dial; it scales by which constraint binds first, and that constraint moves outward by an order of magnitude at each step: at ~1M/s your own consumers bind, at ~10M/s the network (×RF) and page cache bind, at ~100M/s you hit structural ceilings and the answer becomes "more clusters, not a bigger cluster." The levers that appear by tier are fetch-from-follower KIP-392, tiered storage KIP-405, and federation via MirrorMaker 2 KIP-382. Full tier-by-tier math in Part II · 11 Scaling.

target throughput T (events/s)which wall binds first?

T ≲ ~1M/s (~1 GB/s)?

bottleneck = YOUR consumers + partition count
~6 brokers, ~100–200 partitions, cluster idle
watch: records-lag-max, URP

T ≲ ~10M/s (~10 GB/s)?

bottleneck = NETWORK (×RF) + page-cache working set
raise num.replica.fetchers 1→4–8; KIP-392 fetch-from-follower
watch: NetworkProcessorAvgIdlePercent<0.4, IsrShrinksPerSec

bottleneck = STRUCTURAL: partition ceiling, controller metadata throughput, raw NIC, cross-AZ $$
tiered storage + federate (MM2); LinkedIn runs ~80M/s across 100+ clusters
watch: EventQueueTimeMs, LastAppliedRecordLagMs, $/GB

Scaling decision tree. Read it as "what changes," not "what's bigger": each 10× moves the binding constraint from your consumers, to the network and RAM, to the architecture itself. The final transition is structural, no single cluster does 100M/s.

target/decision consumer-bound resource-bound (tunable) structurally bound (re-architect) path.

The one scaling number to internalize

Replication is the network tax. Every byte a producer writes is read RF−1 times by followers before any consumer reads it, at RF=3 a leader's egress is ≥2× its ingress just to stay replicated, exposed as ReplicationBytesOutPerSec (BrokerTopicMetrics.java:37). This single multiplier is why the network, not disk, not CPU, becomes the wall at 10M/s and the dominant cost (50–90% of a cloud bill) at 100M/s. Tiered storage (KIP-405) cuts storage but not this networking; only fetch-from-follower (consumer term) and diskless/object-store designs (produce+replication term) attack it. See Part II · 11 and Part II · 10 Cost.

3 · The heuristics table

The rules of thumb gathered across both parts, each with its one-line mechanism. These are planning defaults, not guarantees, measure, then deviate deliberately.

#	Heuristic	Why (mechanism)	Where
H1	`partitions = max(T/T_p, T/T_c)`; plan ~10 MB/s/partition	Topic throughput is the sum of per-leader rates; the slower of produce/consume binds	op03
H2	Size partitions for the consumer, not the disk	Consumption parallelism is hard-capped at partition count (≤1 consumer/partition/group)	op03
H3	Throughput peaks ~100 partitions; latency rises past ~1,000, don't over-partition	Per-partition overhead (fetch fan-out, metadata, election) dominates past the knee (empirical: Instaclustr)	op02
H4	Treat keyed-topic partition count as immutable; over-partition ~2× up front	`hash(key) % N` remaps on resize → broken order/joins; you also can't decrease N	op03
H5	RF=3 / minISR=2 / acks=all / unclean=false is the durable-and-available default	minISR=2 is one above the survivable failure (1) → one loss is lossless + writable	op06
H6	Verify `min.insync.replicas ≤ replication.factor` on every topic	`effectiveMinIsr` silently caps minISR at RF, RF=1 + minISR=2 still accepts single-copy writes	op06
H7	Spread RF across racks/AZs (`broker.rack`) before adding replicas	Independence changes the exponent (p³); RF only changes the base, correlated copies are worthless	op06
H8	Durability is replication, not fsync, leave flush off; spread the ISR instead	Per-batch fsync destroys tail latency; N replicas in page cache beat one fsynced copy	op06
H9	Plan egress ≥ 2× ingress; budget cross-AZ at `ingress×(RF−1)` + ⅔ produce + ⅔ consume	Replication multiplies every byte by RF−1; clients hit a remote leader ~⅔ of the time	op04, op10
H10	Keep JVM heap ~6 GB; give the rest of RAM to the OS page cache	Kafka reads via zero-copy `sendfile` from page cache; a big heap only adds GC pauses	op05
H11	Set `linger.ms>0` only when the produce rate fills `batch.size` in the window	On low-rate streams linger adds pure latency; at `linger=0` batches averaged ~1.2 KB vs a 16 KB cap (empirical: Confluent)	op05
H12	Compress at the producer (lz4 default, zstd for ratio), cuts storage + replication + fetch at once	Kafka stores/replicates compressed batches; one CPU spend, three byte savings (compression default is `none`, not zstd)	op05, op10
H13	Raise `fetch.min.bytes` on low-throughput-topic consumers	Default 1 is tuned for latency not cost; many tiny fetches burn broker CPU (a one-app fix cut cluster CPU 15%, empirical: New Relic)	op05
H14	Cross 10M/s ⇒ raise `num.replica.fetchers` 1→4–8 first	One fetcher thread/source broker serializes replication of thousands of partitions	op11
H15	Deploy fetch-from-follower (KIP-392) when cross-AZ shows up in the bill	Same-AZ reads zero the consumer cross-AZ term (~⅔ of consumer traffic); +up to ~500 ms latency, broker skew	op10, op11
H16	Tiered storage (KIP-405) for cold data, but it cuts storage only, never networking	Decouples retention from broker disk; the common misconception is that it lowers the cross-AZ bill	I·05, op10
H17	Cap clusters ≤ ~200 brokers; federate beyond with MirrorMaker 2	Bounds recovery time and blast radius; 100M/s is a multi-cluster topology by construction (empirical: LinkedIn 100+ clusters)	op09, op11
H18	For per-message parallelism without order, use share groups (KIP-932), not more partitions	Classic groups give per-partition order + one consumer/partition → head-of-line blocking on a slow message	I·15
H19	Cross-system "exactly-once" = outbox + CDC + idempotent consumers, not Kafka EOS	EOS is within Kafka read-process-write only; outbox writes {row + event} in one local txn, accepts at-least-once	I·14, bp04
H20	Per-broker ceiling: FDs ≥ 100,000; `vm.max_map_count` ≥ 262,144 (default ~32k partitions/broker)	Each partition holds open segment files + ~2 mmap index regions; defaults cap density on Linux	op02

4 · The tactics toolkit

The reusable engineering moves, the parts of Kafka's design worth stealing for any system, independent of Kafka. Full treatment in bp04 · Tactics Toolkit; this is the quick-list with the transferable principle.

Tactic	What Kafka does	Transferable principle, steal this
The log as system of record	Immutable totally-ordered events are primary; tables/indexes/caches are derived, replayable projections (database "inside out")	Make the write-ahead log a first-class citizen; derive every queryable view from it so state is always reconstructable by `fold(log)`
Replication over fsync	Durability from N independent page-cache replicas, not a synchronous disk flush on the write path	For machine-level failures, cross-node redundancy is a stronger and faster guarantee than single-node fsync, spend on failure independence
Partition = unit of everything	One key (parallelism, ordering, placement, recovery) shards the system; appends need no cross-shard coordination → linear scale	Pick one sharding key that simultaneously buys parallelism and ordering; coordination-free appends are what make throughput scale linearly
Zero-copy + page cache	Reads served from OS page cache via `sendfile`; no user-space copy, no app-managed cache	Lean on the kernel's cache instead of building your own; sequential append + OS readahead beats a hand-rolled buffer pool
Pull-based consumers	Consumers poll at their own pace and track their own offset; the broker is "dumb," the consumer "smart"	Push backpressure and position-tracking to the consumer → natural flow control, independent replay, no broker-side per-consumer state
The ISR / high-watermark gate	Commit = "at or below the high watermark"; HW advances only when the ISR meets the floor, admission control before the append	Reject unsafe writes at the door, not after; a record never accepted can never be lost, cheaper than compensating later
Idempotent producer + epochs	Sequence numbers dedup retries; `transactional.id` epochs fence zombie/duplicate writers	Make retries safe by construction (sequence + dedup) and fence stale actors with monotonic epochs, turns at-least-once into effectively-once
Stream–table duality	A table is the aggregate of a changelog; a stream is the changelog of a table (KStream/KTable, CDC)	Treat state and its change-history as two views of one thing; CDC the DB log to bridge mutable state into the event world without dual-writes
Replay / Kappa	Retained log lets you reprocess history by starting a new job from offset 0 and swapping outputs, no second code path	Make recompute a first-class operation: one logic path, replay to rebuild views or fix bugs, instead of a parallel batch pipeline (Lambda)
Tier hot/cold storage	Short local retention for the hot tail, object store for cold (KIP-405), decouples retention from compute and shrinks recovery	Separate the working set from the archive; keep only the hot tail on fast/expensive media, push cold data to cheap storage

The meta-tactic

Kafka's whole design is one bet repeated: make the simple, sequential, append-only path the fast path, and derive everything else from it. Ordering, durability via replication, zero-copy reads, replay, and stream/table duality all fall out of that one commitment. When you design your own system, find the equivalent simplest-possible primitive (often: an ordered log of decisions) and let queryable state, caches, and recovery be deterministic projections of it. That is the transferable core of the blueprint, see bp00 · The Log Pattern and bp05 · Evolution.

5 · Reference panel, choosing & tuning a log-based architecture

One panel that ties the sheet together: when the log fits, where it inherently falls short, which lever attacks which cost, and how Kafka sits against the alternatives. Use it as the final sanity check before committing.

5a · Fit & misfit at a glance

The log is the RIGHT choice when…	The log is the WRONG choice when… (and what wins)
You need an ordered, durable, replayable record of what happened	You need point lookup by key / ad-hoc query / multi-key ACID → a database (materialize into one)
Many independent consumers read the same stream at their own pace	You need a synchronous reply to the caller → RPC/gRPC (request-reply re-couples)
You want decoupled fan-out and integration (one source, N sinks)	You need per-message TTL, priority, or content routing → RabbitMQ
You need CDC / event sourcing / stream processing / replay	You need per-message-parallel work, order irrelevant → queue / share groups
Throughput is high and per-partition ordering by key is enough	You need global total order at scale → forces 1 partition (no parallelism); rethink the requirement
You want history retained as the system of record (tiered/infinite retention)	You need sub-millisecond per-message latency at low load → a push broker (Kafka is pull-based)

5b · Inherent limits vs what mitigates them

The honest part: separate what is structural (you design around it) from what a feature or config moves. Mitigated ≠ removed, note where a floor remains. Full analysis in bp03 · Inherent Limits.

Limitation	Class	Mitigated / tuned by	Residual (what's left)
Ordering is per-partition, never global	structural	Partition-by-key for partition-local order	No global order without 1 partition, inherent
Consumption parallelism ≤ partition count	structural	Over-provision partitions; share groups (KIP-932) for queue-style work	Per-group cap remains for ordered consumption
No per-message TTL / priority / routing	structural	Topic-level retention; app-level routing into sub-topics	By-design log semantics, not a roadmap item
No random access / point-by-key lookup	structural	Compaction (last value per key); materialize into a KV store	Log is sequential; lookups live in derived stores
Latency floor (pull + no per-write fsync default)	tunable	Small `linger`/`fetch.min.bytes`, tuned serialization, fast disks/NIC (tier-1 bank hit <5 ms p99)	A real floor remains; push brokers beat it at low load
Partition count ceiling	tunable (was structural)	KRaft removed the ZK O(partitions) failover; lab 2M KIP-500	Per-broker FD/mmap/fetcher ceilings still bind (~32k/broker default)
Cross-AZ replication cost (50–90% of cloud bill)	partly structural	Fetch-from-follower (KIP-392) zeroes consumer term; compression; tiered storage (storage only)	Produce + replication cross-AZ floor, only diskless/object-store removes it
Rebalance disruption	tunable (was structural)	Cooperative KIP-429, broker-side protocol KIP-848, static membership	Largely historical; old eager clients still pay it
Page-cache loss window (no fsync)	tunable	Rack/AZ spread (makes it multi-event); small `flush.ms` if regulated	Correlated power loss of whole ISR, bounded, not zero

5c · Cost levers, what each one actually attacks

Networking is ≥50% of a multi-AZ cloud Kafka bill (80–90% after tiered storage). The levers, ordered by effort, and crucially which cost each one touches (empirical: ops-blueprint reference App. C). See Part II · 10 Cost.

Lever	Attacks	Effect	Tradeoff / caveat
Producer compression (lz4 / zstd)	storage + replication + fetch	~10× on text/JSON, near-free; one CPU spend, three byte savings	Per-batch, needs batch depth; little gain on pre-compressed/binary
Fetch-from-follower (KIP-392)	consumer cross-AZ	~all consumer cross-AZ removed; up to ~50% total cluster cost	+up to ~500 ms tail; broker skew; does NOT touch produce/replication
Tune RF & retention	storage + replication cross-AZ	Linear cut (RF 3→2 on non-critical; shorter retention)	Lower RF reduces fault tolerance
Tiered storage (KIP-405)	storage only	30–90% storage cut (S3 ~$0.02 vs EBS ~$0.08/GiB-mo)	Does NOT reduce cross-AZ networking, the common misconception
Diskless / object-store (KIP-1150, WarpStream, AutoMQ)	replication + produce cross-AZ	Kills the cross-AZ floor (writes straight to S3); ~80% TCO cut claimed	+200–400 ms produce p99 (~1 s e2e); upstream KIP-1150 accepted ~2026 but not GA OSS

The cross-AZ hard floor

With classical Kafka + fetch-from-follower you remove the consumer cross-AZ term, but produce + replication cross-AZ remains, in AutoMQ's $24k/mo example, ~$13.8k of it (empirical: AutoMQ). Tiered storage does nothing for it. Only leaderless object-store designs (diskless) attack the produce+replication floor, and they trade ~200–400 ms of latency to do it. Pick the lever that matches the cost line you actually have, and don't expect tiered storage to fix a networking bill. See Part II · 10.

5d · Kafka vs the alternatives, when each wins

The log blueprint has several implementations; they differ mainly in where compute and storage sit and what they trade. All cross-system performance numbers are vendor benchmarks unless independently attributed, directional, check durability/JVM/duration parity (empirical: ops-blueprint reference App. D; Vanlightly).

System	Model	Ordering	Wins when…	Gives up
Apache Kafka	Brokers co-own compute + storage; local disk + page cache + zero-copy; tiered to S3	Per-partition	You want highest throughput-per-partition + replay/storage on commodity infra	Cross-AZ cost, latency floor, no per-message TTL/priority, partition ceiling
Apache Pulsar	Stateless brokers; compute/storage separated via BookKeeper	Per-partition (+ keys)	You need independent storage scaling + seamless capacity add	Operational complexity (ZK + BookKeeper + brokers); own caches, not page cache
Redpanda	Single C++ binary, thread-per-core, no JVM, Raft	Per-partition	You want simpler ops (no JVM/ZK), Kafka-API compatible	10×/3× claims reverse on equal hardware (empirical: Vanlightly), directional
Amazon Kinesis	Managed, metered-shard model	Per-shard	Spiky/low-volume, zero-ops, pay-per-shard	Hard 1 MB/s per shard, 500-shard default, ~2–3× pricier at sustained scale
Google Pub/Sub	Managed, global, auto-scaling; no shards to manage	None by default (opt-in keys)	You want hands-off elastic global scale	Less throughput-per-unit control; ordering is opt-in and separate
RabbitMQ	Smart broker / dumb consumer; exchange routing; per-message state	Per-queue (priority breaks it)	You need rich routing, per-message TTL + priority, low-load latency	Lower throughput; degrades >~30 MB/s; per-message state is the cost
WarpStream	Diskless / S3-native; stateless agents, no inter-broker replication, leaderless	Per-partition (control plane)	Cost is the priority; no local disk, no cross-AZ replication; Kafka-API	~400 ms produce p99 / ~1 s e2e, not for sub-100 ms workloads

Read vendor benchmarks adversarially

Every vendor benchmark is advocacy. Before trusting a headline number, verify: equivalent durability (not log.flush.interval.messages=1 forcing per-batch fsync on Kafka, not "no-journal" Pulsar vs acks=all Kafka), Java 17+ (not 11, which hurts Kafka especially with TLS), consistent offset-commit cadence, Coordinated Omission corrected, and a long run (12–36h, tail latency drifts). Jack Vanlightly's independent re-run of Redpanda's own OMB on identical hardware reversed the headline: Kafka often matched or beat it (~1,900 vs 1,400 MB/s at NVMe saturation), and Redpanda's "Kafka doesn't fsync so it's unsafe" claim is simply false reasoning, Kafka trades single-node fsync for cross-node replication by design (empirical: Vanlightly). See bp06 · Comparative.

5e · The page-immediately signals (the on-call companion)

The architecture is only as good as your ability to see it failing. The binary health gauges that map to the dials above, alert on these, in this order. Thresholds and JMX names in Part II · 08 Metrics; runbooks in Part II · 07 Failure Modes.

Signal	Alert when	Means (which dial / mechanism)
`UncleanLeaderElectionsPerSec`	any > 0	Active data loss, a lagging replica became leader and truncated (availability dial cost)
`OfflinePartitionsCount`	> 0	Partitions with no leader, unavailable now
`UnderMinIsrPartitionCount`	> 0	ISR below the floor, `acks=all` producers are getting `NotEnoughReplicas` (consistency dial engaged)
`UnderReplicatedPartitions`	> 0 sustained	A copy is missing, constant ≈ broker down; fluctuating ≈ performance bottleneck (durability dial degraded)
`ActiveControllerCount` (cluster sum)	≠ 1	0 = no controller; >1 = split-brain (control plane)
`RequestHandler`/`NetworkProcessorAvgIdlePercent`	< ~0.3	Saturation, too few IO/network threads or NIC-bound (throughput dial wall)
Consumer `records-lag-max`	growing over a window	A consumer is slower than the produce rate, bounded by partition count to drain (parallelism dial)
`EventQueueTimeMs` / `LastAppliedRecordLagMs`	rising fleet-wide	Controller backlog / metadata propagation lag, the 100M-tier failure mode (structural scaling wall)

How to use this sheet

Start at the dial or decision tree that matches your question, take the terminal answer, then jump to the linked chapter for the mechanism and source. The discipline that makes you fast: classify every requirement as tunable or structural first. Tunable → find the dial and accept its tradeoff. Structural → it is a topic-design or system-choice decision, never a config. The rest of Part III builds the judgement behind each box here, bp00 · The Log Pattern (why the log), bp01 · When To Use (fit), bp02 · Design Decisions (the dials in depth), bp03 · Inherent Limits (the structural walls), bp04 · Tactics (what to steal), bp05 · Evolution (where it's going), and bp06 · Comparative (the alternatives), and Part II grounds every number in operations.