III · 07 · The Architect's Cheat Sheet
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Architectural analysis grounded in the source-verified Part I and cited comparative sources.
This is the page you bookmark. Everything in Parts II and III collapses here into a single decision surface: the dials you turn and what each one trades; the four decision trees that answer the recurring questions, should I use a log at all? how many partitions? what replication factor and floor? how do I scale to N?; the heuristics with their one-line mechanism; the tactics worth stealing for any system; and one consolidated reference panel for choosing and tuning a log-based architecture. Minimal prose by design, each cell and node links to the chapter that proves it. One framing governs the whole sheet, and it is the most useful thing to carry away: every property of a commit-log system is either structural (baked into "partitioned, replicated, ordered, append-only", you cannot tune it away, only design around it) or tunable (a config or feature that moves a point in the tradeoff space). Knowing which is which is the entire skill.
1 · The design dials
Eight dials cover almost every architectural decision. Each one buys something by spending something else, there are no free positions, only choices about which axis you can afford to give up. Owner = who sets it; the binding tradeoff is the property you sacrifice when you turn it toward "more."
| Dial | Range / key configs | Turn it up to gain… | …and you spend | Class | Owns it |
|---|---|---|---|---|---|
| Consistency | acks 0 → 1 → all; min.insync.replicas 1 → 2 → RF | Stronger commit guarantee: at acks=all+minISR=2 every ack lands on ≥2 brokers before the producer hears back | Write latency & availability, the leader waits for the full ISR, and below the floor it refuses the write (goes read-only) | tunable | Producer + topic |
| Durability | replication.factor 1 → 3 → 5; broker.rack spread | More copies across more failure domains → lower loss probability | Storage & cross-AZ bandwidth, each replica is a full copy written and replicated (RF=3 ⇒ 2× ingress crosses zones) | tunable | Topic |
| Latency | batch.size, linger.ms 0 → 100; compression.type; fetch.min.bytes | Lower per-record latency (small/zero linger; fetch.min.bytes=1) | Throughput & CPU/$, tiny batches amortize nothing; the inverse (big batches) trades latency for throughput. Pick at most two of {latency, throughput, durability} | tunable | Producer + consumer |
| Ordering scope | partition count + partition key (1 partition = global order) | Finer concurrency: more independent ordered streams | Global order, Kafka orders only within a partition, ever. Total order forces partition count = 1 (no parallelism) | structural | Topic design |
| Throughput / parallelism | num.partitions (producer leaders & consumer slots) | More parallel leaders writing, more consumers draining (≤ 1 per partition per group) | Control-plane & per-broker overhead, FDs, ~2 mmap/partition, fetcher load, rebalance time, election windows; throughput peaks ~100 and latency rises past ~1,000 partitions | partly structural | Topic + cluster |
| Cost | RF; retention.ms/.bytes; tiered storage (KIP-405); compression; fetch-from-follower (KIP-392) | Lower $: shorter retention, fewer copies, S3 for cold data, compressed bytes, same-AZ reads | Durability / history / cross-AZ floor, and tiered storage cuts storage only, never networking; a produce+replication cross-AZ floor remains | tunable | Topic + cluster |
| Availability | unclean.leader.election.enable false → true | Stay writable when the whole ISR is gone (elect any live replica) | Committed data, a lagging replica becomes leader and truncates everything it never received; any UncleanLeaderElectionsPerSec>0 is loss | tunable | Topic |
| Delivery semantics | enable.idempotence; transactions / EOS; consumer commit policy | Exactly-once within Kafka read-process-write; dedup of producer retries | Latency & scope, transactional overhead; EOS does not extend across an external DB (that is at-least-once + outbox) | tunable | Producer + app |
Tunable dials move a point in the tradeoff space; structural dials define the space. You can buy durability with RF, latency with linger.ms, cost with retention, those are negotiations. You cannot buy global ordering, sub-partition parallelism, per-message TTL/priority, or point-by-key lookup at any setting, because they contradict "partitioned, ordered, append-only log." When a requirement lands on a structural axis, the answer is never a config, it is a different topic design or a different system. See bp02 · Design Decisions and bp03 · Inherent Limits.
The tuning triangle, made literal
The three tunable performance dials, latency, throughput, durability, form a triangle you can optimize for at most two vertices of. This is not folklore; it is mechanism: linger.ms↑ trades latency for throughput (bigger batches amortize per-request cost), acks=all trades latency for durability (wait for the ISR), and compression trades CPU for network/disk. LinkedIn measured synchronous acks=-1 replication roughly halving single-producer throughput (821,557 → 421,823 rec/s), the durability vertex pulling against throughput, made numeric (empirical: Jay Kreps / LinkedIn, 2014).
linger.ms≈0, fetch.min.bytes=1, small batches, acks=1batch.size↑, linger.ms 5–100, compression, large fetchesacks=all, RF=3, minISR=2, rack-spread2 · The four decision trees
The four questions an architect asks, as compact flows. Each terminal node is an answer; red terminals are "stop, wrong tool / wrong fix."
2a · Should I use a log at all?
The log is the right architecture when you need ordered, durable, replayable history that many independent consumers read at their own pace. It is the wrong tool for point lookups, synchronous request/reply, per-message routing/priority, and per-message-parallel task queues, those fight the structure (share groups KIP-932 address the last). Full reasoning in bp01 · When To Use.
(Kafka has no random access, materialize INTO one)
(request-reply over a log re-couples what you decoupled)
(Kafka retention is topic-level; no priority/routing)
classic groups head-of-line-block on per-partition order
event backbone, CDC, stream processing, replay
As a database, no key lookups, no ad-hoc SQL, no multi-key ACID; Kafka replicates into specialized stores, it does not replace them (empirical: Kreps; Waehner). As RPC, a throughput-optimized log has no native reply channel; request-reply "returns the coupling we were trying to avoid." As a routing broker, no per-message TTL, priority, or content-based routing (that is RabbitMQ's domain). As a per-message task queue, per-partition order + one consumer per partition means "throughput collapses to the speed of the slowest message"; a poison pill stalls everything behind it. Share groups (Part I · 15) address the last one explicitly. These are intentional log-semantics tradeoffs, not bugs (empirical: ops-blueprint reference §8).
2b · How many partitions?
The sizing formula is N = max(T/Tp, T/Tc), target throughput over the measured single-partition produce and consume rates, and the binding term is almost always the consumer, because consumption parallelism is hard-capped at the partition count. Then add headroom, because resharding a keyed topic is a migration, not a config change. Full mechanism in Part II · 03 Partitioning.
, you can't reshard keyed in place
, growing in place is safe
spread across brokers, don't inflate count
You cannot decrease partitions (the controller rejects it by construction), and growing a keyed topic remaps keys via murmur2(key) % N, a key at 7654321 moves from partition 1 (N=4) to 3 (N=6), splitting its history across two logs and breaking per-key order, compaction, and co-partitioned Streams joins (empirical: Confluent; Arpit Bhayani). Treat keyed-topic partition count as near-immutable; over-partition modestly up front. And remember: more partitions never fixes a hot key, that key still hashes to one partition; redesign or salt it instead. See Part II · 03.
2c · What replication factor and min.insync.replicas?
Durability is three dials owned by three actors: acks (producer's request), min.insync.replicas (broker's pre-append floor), replication.factor (topology). The canonical setting RF=3 / minISR=2 / acks=all / unclean=false is the unique point that survives one broker loss both losslessly and writably: minISR=2 is exactly one above the survivable failure count of 1. Eligible Leader Replicas KIP-966 (GA in this build) further narrows safe recovery. Full arithmetic and source in Part II · 06 Durability.
survives 1 loss lossless + writable; 2nd ⇒ read-only
any single loss ⇒ read-only (strict, less available)
broker.rack) · enable ELR (KIP-966) · verify minISR ≤ RFflush.ms on the critical topic, shrinks the page-cache loss window at a latency costThe pre-append gate if (inSyncSize < minIsr && requiredAcks == -1) throw NotEnoughReplicasException (Partition.scala:1234) fires before the record is appended, so an unaccepted write can never be "lost", it was never in the log. With minISR=2, every accepted acks=all write sat on ≥2 brokers at ack time; lose one and a copy remains. RF=3 supplies the third copy purely as availability headroom. Failure independence dominates replication factor: RF=3 across 3 AZs gives loss probability ≈ p³ ≈ 10⁻⁹, but RF=3 in one AZ collapses to ≈ q ≈ 10⁻³ when a single power event takes all three, a million times worse at identical RF. Spend the durability budget on rack spread before more replicas. And the honest caveat: acks=all means "in the page cache of every ISR member," not fsynced, Kafka bets N independent replicas beat one fsynced copy, which is correct for every failure except simultaneous power loss of the whole ISR (empirical: Kreps/LinkedIn; Vanlightly). See Part II · 06 and Part I · 08.
2d · How do I scale to N events/sec?
Kafka does not scale by one dial; it scales by which constraint binds first, and that constraint moves outward by an order of magnitude at each step: at ~1M/s your own consumers bind, at ~10M/s the network (×RF) and page cache bind, at ~100M/s you hit structural ceilings and the answer becomes "more clusters, not a bigger cluster." The levers that appear by tier are fetch-from-follower KIP-392, tiered storage KIP-405, and federation via MirrorMaker 2 KIP-382. Full tier-by-tier math in Part II · 11 Scaling.
~6 brokers, ~100–200 partitions, cluster idle
watch:
records-lag-max, URPraise
num.replica.fetchers 1→4–8; KIP-392 fetch-from-followerwatch:
NetworkProcessorAvgIdlePercent<0.4, IsrShrinksPerSectiered storage + federate (MM2); LinkedIn runs ~80M/s across 100+ clusters
watch:
EventQueueTimeMs, LastAppliedRecordLagMs, $/GBReplication is the network tax. Every byte a producer writes is read RF−1 times by followers before any consumer reads it, at RF=3 a leader's egress is ≥2× its ingress just to stay replicated, exposed as ReplicationBytesOutPerSec (BrokerTopicMetrics.java:37). This single multiplier is why the network, not disk, not CPU, becomes the wall at 10M/s and the dominant cost (50–90% of a cloud bill) at 100M/s. Tiered storage (KIP-405) cuts storage but not this networking; only fetch-from-follower (consumer term) and diskless/object-store designs (produce+replication term) attack it. See Part II · 11 and Part II · 10 Cost.
3 · The heuristics table
The rules of thumb gathered across both parts, each with its one-line mechanism. These are planning defaults, not guarantees, measure, then deviate deliberately.
| # | Heuristic | Why (mechanism) | Where |
|---|---|---|---|
| H1 | partitions = max(T/Tp, T/Tc); plan ~10 MB/s/partition | Topic throughput is the sum of per-leader rates; the slower of produce/consume binds | op03 |
| H2 | Size partitions for the consumer, not the disk | Consumption parallelism is hard-capped at partition count (≤1 consumer/partition/group) | op03 |
| H3 | Throughput peaks ~100 partitions; latency rises past ~1,000, don't over-partition | Per-partition overhead (fetch fan-out, metadata, election) dominates past the knee (empirical: Instaclustr) | op02 |
| H4 | Treat keyed-topic partition count as immutable; over-partition ~2× up front | hash(key) % N remaps on resize → broken order/joins; you also can't decrease N | op03 |
| H5 | RF=3 / minISR=2 / acks=all / unclean=false is the durable-and-available default | minISR=2 is one above the survivable failure (1) → one loss is lossless + writable | op06 |
| H6 | Verify min.insync.replicas ≤ replication.factor on every topic | effectiveMinIsr silently caps minISR at RF, RF=1 + minISR=2 still accepts single-copy writes | op06 |
| H7 | Spread RF across racks/AZs (broker.rack) before adding replicas | Independence changes the exponent (p³); RF only changes the base, correlated copies are worthless | op06 |
| H8 | Durability is replication, not fsync, leave flush off; spread the ISR instead | Per-batch fsync destroys tail latency; N replicas in page cache beat one fsynced copy | op06 |
| H9 | Plan egress ≥ 2× ingress; budget cross-AZ at ingress×(RF−1) + ⅔ produce + ⅔ consume | Replication multiplies every byte by RF−1; clients hit a remote leader ~⅔ of the time | op04, op10 |
| H10 | Keep JVM heap ~6 GB; give the rest of RAM to the OS page cache | Kafka reads via zero-copy sendfile from page cache; a big heap only adds GC pauses | op05 |
| H11 | Set linger.ms>0 only when the produce rate fills batch.size in the window | On low-rate streams linger adds pure latency; at linger=0 batches averaged ~1.2 KB vs a 16 KB cap (empirical: Confluent) | op05 |
| H12 | Compress at the producer (lz4 default, zstd for ratio), cuts storage + replication + fetch at once | Kafka stores/replicates compressed batches; one CPU spend, three byte savings (compression default is none, not zstd) | op05, op10 |
| H13 | Raise fetch.min.bytes on low-throughput-topic consumers | Default 1 is tuned for latency not cost; many tiny fetches burn broker CPU (a one-app fix cut cluster CPU 15%, empirical: New Relic) | op05 |
| H14 | Cross 10M/s ⇒ raise num.replica.fetchers 1→4–8 first | One fetcher thread/source broker serializes replication of thousands of partitions | op11 |
| H15 | Deploy fetch-from-follower (KIP-392) when cross-AZ shows up in the bill | Same-AZ reads zero the consumer cross-AZ term (~⅔ of consumer traffic); +up to ~500 ms latency, broker skew | op10, op11 |
| H16 | Tiered storage (KIP-405) for cold data, but it cuts storage only, never networking | Decouples retention from broker disk; the common misconception is that it lowers the cross-AZ bill | I·05, op10 |
| H17 | Cap clusters ≤ ~200 brokers; federate beyond with MirrorMaker 2 | Bounds recovery time and blast radius; 100M/s is a multi-cluster topology by construction (empirical: LinkedIn 100+ clusters) | op09, op11 |
| H18 | For per-message parallelism without order, use share groups (KIP-932), not more partitions | Classic groups give per-partition order + one consumer/partition → head-of-line blocking on a slow message | I·15 |
| H19 | Cross-system "exactly-once" = outbox + CDC + idempotent consumers, not Kafka EOS | EOS is within Kafka read-process-write only; outbox writes {row + event} in one local txn, accepts at-least-once | I·14, bp04 |
| H20 | Per-broker ceiling: FDs ≥ 100,000; vm.max_map_count ≥ 262,144 (default ~32k partitions/broker) | Each partition holds open segment files + ~2 mmap index regions; defaults cap density on Linux | op02 |
4 · The tactics toolkit
The reusable engineering moves, the parts of Kafka's design worth stealing for any system, independent of Kafka. Full treatment in bp04 · Tactics Toolkit; this is the quick-list with the transferable principle.
| Tactic | What Kafka does | Transferable principle, steal this |
|---|---|---|
| The log as system of record | Immutable totally-ordered events are primary; tables/indexes/caches are derived, replayable projections (database "inside out") | Make the write-ahead log a first-class citizen; derive every queryable view from it so state is always reconstructable by fold(log) |
| Replication over fsync | Durability from N independent page-cache replicas, not a synchronous disk flush on the write path | For machine-level failures, cross-node redundancy is a stronger and faster guarantee than single-node fsync, spend on failure independence |
| Partition = unit of everything | One key (parallelism, ordering, placement, recovery) shards the system; appends need no cross-shard coordination → linear scale | Pick one sharding key that simultaneously buys parallelism and ordering; coordination-free appends are what make throughput scale linearly |
| Zero-copy + page cache | Reads served from OS page cache via sendfile; no user-space copy, no app-managed cache | Lean on the kernel's cache instead of building your own; sequential append + OS readahead beats a hand-rolled buffer pool |
| Pull-based consumers | Consumers poll at their own pace and track their own offset; the broker is "dumb," the consumer "smart" | Push backpressure and position-tracking to the consumer → natural flow control, independent replay, no broker-side per-consumer state |
| The ISR / high-watermark gate | Commit = "at or below the high watermark"; HW advances only when the ISR meets the floor, admission control before the append | Reject unsafe writes at the door, not after; a record never accepted can never be lost, cheaper than compensating later |
| Idempotent producer + epochs | Sequence numbers dedup retries; transactional.id epochs fence zombie/duplicate writers | Make retries safe by construction (sequence + dedup) and fence stale actors with monotonic epochs, turns at-least-once into effectively-once |
| Stream–table duality | A table is the aggregate of a changelog; a stream is the changelog of a table (KStream/KTable, CDC) | Treat state and its change-history as two views of one thing; CDC the DB log to bridge mutable state into the event world without dual-writes |
| Replay / Kappa | Retained log lets you reprocess history by starting a new job from offset 0 and swapping outputs, no second code path | Make recompute a first-class operation: one logic path, replay to rebuild views or fix bugs, instead of a parallel batch pipeline (Lambda) |
| Tier hot/cold storage | Short local retention for the hot tail, object store for cold (KIP-405), decouples retention from compute and shrinks recovery | Separate the working set from the archive; keep only the hot tail on fast/expensive media, push cold data to cheap storage |
Kafka's whole design is one bet repeated: make the simple, sequential, append-only path the fast path, and derive everything else from it. Ordering, durability via replication, zero-copy reads, replay, and stream/table duality all fall out of that one commitment. When you design your own system, find the equivalent simplest-possible primitive (often: an ordered log of decisions) and let queryable state, caches, and recovery be deterministic projections of it. That is the transferable core of the blueprint, see bp00 · The Log Pattern and bp05 · Evolution.
5 · Reference panel, choosing & tuning a log-based architecture
One panel that ties the sheet together: when the log fits, where it inherently falls short, which lever attacks which cost, and how Kafka sits against the alternatives. Use it as the final sanity check before committing.
5a · Fit & misfit at a glance
| The log is the RIGHT choice when… | The log is the WRONG choice when… (and what wins) |
|---|---|
| You need an ordered, durable, replayable record of what happened | You need point lookup by key / ad-hoc query / multi-key ACID → a database (materialize into one) |
| Many independent consumers read the same stream at their own pace | You need a synchronous reply to the caller → RPC/gRPC (request-reply re-couples) |
| You want decoupled fan-out and integration (one source, N sinks) | You need per-message TTL, priority, or content routing → RabbitMQ |
| You need CDC / event sourcing / stream processing / replay | You need per-message-parallel work, order irrelevant → queue / share groups |
| Throughput is high and per-partition ordering by key is enough | You need global total order at scale → forces 1 partition (no parallelism); rethink the requirement |
| You want history retained as the system of record (tiered/infinite retention) | You need sub-millisecond per-message latency at low load → a push broker (Kafka is pull-based) |
5b · Inherent limits vs what mitigates them
The honest part: separate what is structural (you design around it) from what a feature or config moves. Mitigated ≠ removed, note where a floor remains. Full analysis in bp03 · Inherent Limits.
| Limitation | Class | Mitigated / tuned by | Residual (what's left) |
|---|---|---|---|
| Ordering is per-partition, never global | structural | Partition-by-key for partition-local order | No global order without 1 partition, inherent |
| Consumption parallelism ≤ partition count | structural | Over-provision partitions; share groups (KIP-932) for queue-style work | Per-group cap remains for ordered consumption |
| No per-message TTL / priority / routing | structural | Topic-level retention; app-level routing into sub-topics | By-design log semantics, not a roadmap item |
| No random access / point-by-key lookup | structural | Compaction (last value per key); materialize into a KV store | Log is sequential; lookups live in derived stores |
| Latency floor (pull + no per-write fsync default) | tunable | Small linger/fetch.min.bytes, tuned serialization, fast disks/NIC (tier-1 bank hit <5 ms p99) | A real floor remains; push brokers beat it at low load |
| Partition count ceiling | tunable (was structural) | KRaft removed the ZK O(partitions) failover; lab 2M KIP-500 | Per-broker FD/mmap/fetcher ceilings still bind (~32k/broker default) |
| Cross-AZ replication cost (50–90% of cloud bill) | partly structural | Fetch-from-follower (KIP-392) zeroes consumer term; compression; tiered storage (storage only) | Produce + replication cross-AZ floor, only diskless/object-store removes it |
| Rebalance disruption | tunable (was structural) | Cooperative KIP-429, broker-side protocol KIP-848, static membership | Largely historical; old eager clients still pay it |
| Page-cache loss window (no fsync) | tunable | Rack/AZ spread (makes it multi-event); small flush.ms if regulated | Correlated power loss of whole ISR, bounded, not zero |
5c · Cost levers, what each one actually attacks
Networking is ≥50% of a multi-AZ cloud Kafka bill (80–90% after tiered storage). The levers, ordered by effort, and crucially which cost each one touches (empirical: ops-blueprint reference App. C). See Part II · 10 Cost.
| Lever | Attacks | Effect | Tradeoff / caveat |
|---|---|---|---|
| Producer compression (lz4 / zstd) | storage + replication + fetch | ~10× on text/JSON, near-free; one CPU spend, three byte savings | Per-batch, needs batch depth; little gain on pre-compressed/binary |
| Fetch-from-follower (KIP-392) | consumer cross-AZ | ~all consumer cross-AZ removed; up to ~50% total cluster cost | +up to ~500 ms tail; broker skew; does NOT touch produce/replication |
| Tune RF & retention | storage + replication cross-AZ | Linear cut (RF 3→2 on non-critical; shorter retention) | Lower RF reduces fault tolerance |
| Tiered storage (KIP-405) | storage only | 30–90% storage cut (S3 ~$0.02 vs EBS ~$0.08/GiB-mo) | Does NOT reduce cross-AZ networking, the common misconception |
| Diskless / object-store (KIP-1150, WarpStream, AutoMQ) | replication + produce cross-AZ | Kills the cross-AZ floor (writes straight to S3); ~80% TCO cut claimed | +200–400 ms produce p99 (~1 s e2e); upstream KIP-1150 accepted ~2026 but not GA OSS |
With classical Kafka + fetch-from-follower you remove the consumer cross-AZ term, but produce + replication cross-AZ remains, in AutoMQ's $24k/mo example, ~$13.8k of it (empirical: AutoMQ). Tiered storage does nothing for it. Only leaderless object-store designs (diskless) attack the produce+replication floor, and they trade ~200–400 ms of latency to do it. Pick the lever that matches the cost line you actually have, and don't expect tiered storage to fix a networking bill. See Part II · 10.
5d · Kafka vs the alternatives, when each wins
The log blueprint has several implementations; they differ mainly in where compute and storage sit and what they trade. All cross-system performance numbers are vendor benchmarks unless independently attributed, directional, check durability/JVM/duration parity (empirical: ops-blueprint reference App. D; Vanlightly).
| System | Model | Ordering | Wins when… | Gives up |
|---|---|---|---|---|
| Apache Kafka | Brokers co-own compute + storage; local disk + page cache + zero-copy; tiered to S3 | Per-partition | You want highest throughput-per-partition + replay/storage on commodity infra | Cross-AZ cost, latency floor, no per-message TTL/priority, partition ceiling |
| Apache Pulsar | Stateless brokers; compute/storage separated via BookKeeper | Per-partition (+ keys) | You need independent storage scaling + seamless capacity add | Operational complexity (ZK + BookKeeper + brokers); own caches, not page cache |
| Redpanda | Single C++ binary, thread-per-core, no JVM, Raft | Per-partition | You want simpler ops (no JVM/ZK), Kafka-API compatible | 10×/3× claims reverse on equal hardware (empirical: Vanlightly), directional |
| Amazon Kinesis | Managed, metered-shard model | Per-shard | Spiky/low-volume, zero-ops, pay-per-shard | Hard 1 MB/s per shard, 500-shard default, ~2–3× pricier at sustained scale |
| Google Pub/Sub | Managed, global, auto-scaling; no shards to manage | None by default (opt-in keys) | You want hands-off elastic global scale | Less throughput-per-unit control; ordering is opt-in and separate |
| RabbitMQ | Smart broker / dumb consumer; exchange routing; per-message state | Per-queue (priority breaks it) | You need rich routing, per-message TTL + priority, low-load latency | Lower throughput; degrades >~30 MB/s; per-message state is the cost |
| WarpStream | Diskless / S3-native; stateless agents, no inter-broker replication, leaderless | Per-partition (control plane) | Cost is the priority; no local disk, no cross-AZ replication; Kafka-API | ~400 ms produce p99 / ~1 s e2e, not for sub-100 ms workloads |
Every vendor benchmark is advocacy. Before trusting a headline number, verify: equivalent durability (not log.flush.interval.messages=1 forcing per-batch fsync on Kafka, not "no-journal" Pulsar vs acks=all Kafka), Java 17+ (not 11, which hurts Kafka especially with TLS), consistent offset-commit cadence, Coordinated Omission corrected, and a long run (12–36h, tail latency drifts). Jack Vanlightly's independent re-run of Redpanda's own OMB on identical hardware reversed the headline: Kafka often matched or beat it (~1,900 vs 1,400 MB/s at NVMe saturation), and Redpanda's "Kafka doesn't fsync so it's unsafe" claim is simply false reasoning, Kafka trades single-node fsync for cross-node replication by design (empirical: Vanlightly). See bp06 · Comparative.
5e · The page-immediately signals (the on-call companion)
The architecture is only as good as your ability to see it failing. The binary health gauges that map to the dials above, alert on these, in this order. Thresholds and JMX names in Part II · 08 Metrics; runbooks in Part II · 07 Failure Modes.
| Signal | Alert when | Means (which dial / mechanism) |
|---|---|---|
UncleanLeaderElectionsPerSec | any > 0 | Active data loss, a lagging replica became leader and truncated (availability dial cost) |
OfflinePartitionsCount | > 0 | Partitions with no leader, unavailable now |
UnderMinIsrPartitionCount | > 0 | ISR below the floor, acks=all producers are getting NotEnoughReplicas (consistency dial engaged) |
UnderReplicatedPartitions | > 0 sustained | A copy is missing, constant ≈ broker down; fluctuating ≈ performance bottleneck (durability dial degraded) |
ActiveControllerCount (cluster sum) | ≠ 1 | 0 = no controller; >1 = split-brain (control plane) |
RequestHandler/NetworkProcessorAvgIdlePercent | < ~0.3 | Saturation, too few IO/network threads or NIC-bound (throughput dial wall) |
Consumer records-lag-max | growing over a window | A consumer is slower than the produce rate, bounded by partition count to drain (parallelism dial) |
EventQueueTimeMs / LastAppliedRecordLagMs | rising fleet-wide | Controller backlog / metadata propagation lag, the 100M-tier failure mode (structural scaling wall) |
Start at the dial or decision tree that matches your question, take the terminal answer, then jump to the linked chapter for the mechanism and source. The discipline that makes you fast: classify every requirement as tunable or structural first. Tunable → find the dial and accept its tradeoff. Structural → it is a topic-design or system-choice decision, never a config. The rest of Part III builds the judgement behind each box here, bp00 · The Log Pattern (why the log), bp01 · When To Use (fit), bp02 · Design Decisions (the dials in depth), bp03 · Inherent Limits (the structural walls), bp04 · Tactics (what to steal), bp05 · Evolution (where it's going), and bp06 · Comparative (the alternatives), and Part II grounds every number in operations.