krivaltsevich.com Kafka Internals4.4

III · 07 · The Architect's Cheat Sheet

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Architectural analysis grounded in the source-verified Part I and cited comparative sources.

This is the page you bookmark. Everything in Parts II and III collapses here into a single decision surface: the dials you turn and what each one trades; the four decision trees that answer the recurring questions, should I use a log at all? how many partitions? what replication factor and floor? how do I scale to N?; the heuristics with their one-line mechanism; the tactics worth stealing for any system; and one consolidated reference panel for choosing and tuning a log-based architecture. Minimal prose by design, each cell and node links to the chapter that proves it. One framing governs the whole sheet, and it is the most useful thing to carry away: every property of a commit-log system is either structural (baked into "partitioned, replicated, ordered, append-only", you cannot tune it away, only design around it) or tunable (a config or feature that moves a point in the tradeoff space). Knowing which is which is the entire skill.

1 · The design dials

Eight dials cover almost every architectural decision. Each one buys something by spending something else, there are no free positions, only choices about which axis you can afford to give up. Owner = who sets it; the binding tradeoff is the property you sacrifice when you turn it toward "more."

DialRange / key configsTurn it up to gain……and you spendClassOwns it
Consistencyacks 0 → 1 → all; min.insync.replicas 1 → 2 → RFStronger commit guarantee: at acks=all+minISR=2 every ack lands on ≥2 brokers before the producer hears backWrite latency & availability, the leader waits for the full ISR, and below the floor it refuses the write (goes read-only)tunableProducer + topic
Durabilityreplication.factor 1 → 3 → 5; broker.rack spreadMore copies across more failure domains → lower loss probabilityStorage & cross-AZ bandwidth, each replica is a full copy written and replicated (RF=3 ⇒ 2× ingress crosses zones)tunableTopic
Latencybatch.size, linger.ms 0 → 100; compression.type; fetch.min.bytesLower per-record latency (small/zero linger; fetch.min.bytes=1)Throughput & CPU/$, tiny batches amortize nothing; the inverse (big batches) trades latency for throughput. Pick at most two of {latency, throughput, durability}tunableProducer + consumer
Ordering scopepartition count + partition key (1 partition = global order)Finer concurrency: more independent ordered streamsGlobal order, Kafka orders only within a partition, ever. Total order forces partition count = 1 (no parallelism)structuralTopic design
Throughput / parallelismnum.partitions (producer leaders & consumer slots)More parallel leaders writing, more consumers draining (≤ 1 per partition per group)Control-plane & per-broker overhead, FDs, ~2 mmap/partition, fetcher load, rebalance time, election windows; throughput peaks ~100 and latency rises past ~1,000 partitionspartly structuralTopic + cluster
CostRF; retention.ms/.bytes; tiered storage (KIP-405); compression; fetch-from-follower (KIP-392)Lower $: shorter retention, fewer copies, S3 for cold data, compressed bytes, same-AZ readsDurability / history / cross-AZ floor, and tiered storage cuts storage only, never networking; a produce+replication cross-AZ floor remainstunableTopic + cluster
Availabilityunclean.leader.election.enable false → trueStay writable when the whole ISR is gone (elect any live replica)Committed data, a lagging replica becomes leader and truncates everything it never received; any UncleanLeaderElectionsPerSec>0 is losstunableTopic
Delivery semanticsenable.idempotence; transactions / EOS; consumer commit policyExactly-once within Kafka read-process-write; dedup of producer retriesLatency & scope, transactional overhead; EOS does not extend across an external DB (that is at-least-once + outbox)tunableProducer + app
The single rule under every dial

Tunable dials move a point in the tradeoff space; structural dials define the space. You can buy durability with RF, latency with linger.ms, cost with retention, those are negotiations. You cannot buy global ordering, sub-partition parallelism, per-message TTL/priority, or point-by-key lookup at any setting, because they contradict "partitioned, ordered, append-only log." When a requirement lands on a structural axis, the answer is never a config, it is a different topic design or a different system. See bp02 · Design Decisions and bp03 · Inherent Limits.

The tuning triangle, made literal

The three tunable performance dials, latency, throughput, durability, form a triangle you can optimize for at most two vertices of. This is not folklore; it is mechanism: linger.ms↑ trades latency for throughput (bigger batches amortize per-request cost), acks=all trades latency for durability (wait for the ISR), and compression trades CPU for network/disk. LinkedIn measured synchronous acks=-1 replication roughly halving single-producer throughput (821,557 → 421,823 rec/s), the durability vertex pulling against throughput, made numeric (empirical: Jay Kreps / LinkedIn, 2014).

Latency vertex, linger.ms≈0, fetch.min.bytes=1, small batches, acks=1
lowest per-record delay; costs throughput (no amortization) and durability (un-replicated tail)
pick 2 of 3
Throughput vertex, batch.size↑, linger.ms 5–100, compression, large fetches
max bytes/sec; costs latency (the linger window) and CPU (compression)
pick 2 of 3
Durability vertex, acks=all, RF=3, minISR=2, rack-spread
every ack on ≥2 independent brokers; costs latency (ISR wait) and $ (copies + cross-AZ)
pick 2 of 3
The three tunable performance dials pull against each other. You can sit near two vertices; reaching the third sacrifices one of the others. Ordering and partition count are not on this triangle, they are structural and chosen separately.
latency-optimized   throughput-optimized   durability-optimized   = "tension between".

2 · The four decision trees

The four questions an architect asks, as compact flows. Each terminal node is an answer; red terminals are "stop, wrong tool / wrong fix."

2a · Should I use a log at all?

The log is the right architecture when you need ordered, durable, replayable history that many independent consumers read at their own pace. It is the wrong tool for point lookups, synchronous request/reply, per-message routing/priority, and per-message-parallel task queues, those fight the structure (share groups KIP-932 address the last). Full reasoning in bp01 · When To Use.

requirementwhat does the workload actually need?
need ordered, durable, replayable history with many independent readers?
use a database / KV store
(Kafka has no random access, materialize INTO one)
need synchronous reply to the caller?
use RPC / gRPC
(request-reply over a log re-couples what you decoupled)
need per-message TTL, priority, or content routing?
use RabbitMQ / a broker
(Kafka retention is topic-level; no priority/routing)
per-message parallel work, order doesn't matter?
use share groups (KIP-932) / a queue
classic groups head-of-line-block on per-partition order
USE THE LOG ✓
event backbone, CDC, stream processing, replay
Use-a-log decision tree. Every red exit is a documented anti-pattern; the log wins precisely when ordered durable replayable fan-out is the requirement and none of the disqualifiers apply.
decision   use the log   log-adjacent feature   wrong tool   path   qualified-yes.
The four anti-patterns, one sentence each

As a database, no key lookups, no ad-hoc SQL, no multi-key ACID; Kafka replicates into specialized stores, it does not replace them (empirical: Kreps; Waehner). As RPC, a throughput-optimized log has no native reply channel; request-reply "returns the coupling we were trying to avoid." As a routing broker, no per-message TTL, priority, or content-based routing (that is RabbitMQ's domain). As a per-message task queue, per-partition order + one consumer per partition means "throughput collapses to the speed of the slowest message"; a poison pill stalls everything behind it. Share groups (Part I · 15) address the last one explicitly. These are intentional log-semantics tradeoffs, not bugs (empirical: ops-blueprint reference §8).

2b · How many partitions?

The sizing formula is N = max(T/Tp, T/Tc), target throughput over the measured single-partition produce and consume rates, and the binding term is almost always the consumer, because consumption parallelism is hard-capped at the partition count. Then add headroom, because resharding a keyed topic is a migration, not a config change. Full mechanism in Part II · 03 Partitioning.

measure Tp, Tc at YOUR record size / compression / acksnever assume the ~10 MB/s planning number
T / Tp
T / Tc
N = max(T/Tp, T/Tc)
×2 growth headroom + divisor-rich (12, 24, 60)
, you can't reshard keyed in place
small/zero headroom
, growing in place is safe
N > ~1,000 per topic OR over per-broker budget?
back off: latency knee & FD/mmap ceiling
spread across brokers, don't inflate count
set N · set RF=3 · never ship defaults (1/1)
Partition-count decision tree. The consumer term usually binds; keyed topics get growth headroom because resharding is a new-topic migration; and the count is capped from above by the latency knee and per-broker ceilings.
produce term   consume term   result/action   back off   path   unkeyed branch.
The two partition facts that bite hardest

You cannot decrease partitions (the controller rejects it by construction), and growing a keyed topic remaps keys via murmur2(key) % N, a key at 7654321 moves from partition 1 (N=4) to 3 (N=6), splitting its history across two logs and breaking per-key order, compaction, and co-partitioned Streams joins (empirical: Confluent; Arpit Bhayani). Treat keyed-topic partition count as near-immutable; over-partition modestly up front. And remember: more partitions never fixes a hot key, that key still hashes to one partition; redesign or salt it instead. See Part II · 03.

2c · What replication factor and min.insync.replicas?

Durability is three dials owned by three actors: acks (producer's request), min.insync.replicas (broker's pre-append floor), replication.factor (topology). The canonical setting RF=3 / minISR=2 / acks=all / unclean=false is the unique point that survives one broker loss both losslessly and writably: minISR=2 is exactly one above the survivable failure count of 1. Eligible Leader Replicas KIP-966 (GA in this build) further narrows safe recovery. Full arithmetic and source in Part II · 06 Durability.

how much loss can this topic tolerate?
acknowledged data loss acceptable?
RF=2 or 3, acks=1 (or 0)
availability/cost over durability (e.g. Netflix Keystone)
must stay writable through one failure?
RF=3 · minISR=2 · acks=all · unclean=false
survives 1 loss lossless + writable; 2nd ⇒ read-only
RF=3 · minISR=3 · acks=all
any single loss ⇒ read-only (strict, less available)
spread RF across racks/AZs (broker.rack) · enable ELR (KIP-966) · verify minISR ≤ RF
small flush.ms on the critical topic, shrinks the page-cache loss window at a latency cost
Durability decision tree. The default answer is RF=3/minISR=2; the divergences are deliberate, lossy-tolerant telemetry relaxes it, strict-consistency tightens it to minISR=3, and rack spread is non-negotiable because it converts the page-cache window from a single-event certainty into a multi-event improbability.
decision/placement   durable setting   availability-leaning   path   lossy-tolerant branch.
Why these exact numbers (the mechanism, not folklore)

The pre-append gate if (inSyncSize < minIsr && requiredAcks == -1) throw NotEnoughReplicasException (Partition.scala:1234) fires before the record is appended, so an unaccepted write can never be "lost", it was never in the log. With minISR=2, every accepted acks=all write sat on ≥2 brokers at ack time; lose one and a copy remains. RF=3 supplies the third copy purely as availability headroom. Failure independence dominates replication factor: RF=3 across 3 AZs gives loss probability ≈ p³ ≈ 10⁻⁹, but RF=3 in one AZ collapses to ≈ q ≈ 10⁻³ when a single power event takes all three, a million times worse at identical RF. Spend the durability budget on rack spread before more replicas. And the honest caveat: acks=all means "in the page cache of every ISR member," not fsynced, Kafka bets N independent replicas beat one fsynced copy, which is correct for every failure except simultaneous power loss of the whole ISR (empirical: Kreps/LinkedIn; Vanlightly). See Part II · 06 and Part I · 08.

2d · How do I scale to N events/sec?

Kafka does not scale by one dial; it scales by which constraint binds first, and that constraint moves outward by an order of magnitude at each step: at ~1M/s your own consumers bind, at ~10M/s the network (×RF) and page cache bind, at ~100M/s you hit structural ceilings and the answer becomes "more clusters, not a bigger cluster." The levers that appear by tier are fetch-from-follower KIP-392, tiered storage KIP-405, and federation via MirrorMaker 2 KIP-382. Full tier-by-tier math in Part II · 11 Scaling.

target throughput T (events/s)which wall binds first?
T ≲ ~1M/s (~1 GB/s)?
bottleneck = YOUR consumers + partition count
~6 brokers, ~100–200 partitions, cluster idle
watch: records-lag-max, URP
T ≲ ~10M/s (~10 GB/s)?
bottleneck = NETWORK (×RF) + page-cache working set
raise num.replica.fetchers 1→4–8; KIP-392 fetch-from-follower
watch: NetworkProcessorAvgIdlePercent<0.4, IsrShrinksPerSec
bottleneck = STRUCTURAL: partition ceiling, controller metadata throughput, raw NIC, cross-AZ $$
tiered storage + federate (MM2); LinkedIn runs ~80M/s across 100+ clusters
watch: EventQueueTimeMs, LastAppliedRecordLagMs, $/GB
Scaling decision tree. Read it as "what changes," not "what's bigger": each 10× moves the binding constraint from your consumers, to the network and RAM, to the architecture itself. The final transition is structural, no single cluster does 100M/s.
target/decision   consumer-bound   resource-bound (tunable)   structurally bound (re-architect)   path.
The one scaling number to internalize

Replication is the network tax. Every byte a producer writes is read RF−1 times by followers before any consumer reads it, at RF=3 a leader's egress is ≥2× its ingress just to stay replicated, exposed as ReplicationBytesOutPerSec (BrokerTopicMetrics.java:37). This single multiplier is why the network, not disk, not CPU, becomes the wall at 10M/s and the dominant cost (50–90% of a cloud bill) at 100M/s. Tiered storage (KIP-405) cuts storage but not this networking; only fetch-from-follower (consumer term) and diskless/object-store designs (produce+replication term) attack it. See Part II · 11 and Part II · 10 Cost.

3 · The heuristics table

The rules of thumb gathered across both parts, each with its one-line mechanism. These are planning defaults, not guarantees, measure, then deviate deliberately.

#HeuristicWhy (mechanism)Where
H1partitions = max(T/Tp, T/Tc); plan ~10 MB/s/partitionTopic throughput is the sum of per-leader rates; the slower of produce/consume bindsop03
H2Size partitions for the consumer, not the diskConsumption parallelism is hard-capped at partition count (≤1 consumer/partition/group)op03
H3Throughput peaks ~100 partitions; latency rises past ~1,000, don't over-partitionPer-partition overhead (fetch fan-out, metadata, election) dominates past the knee (empirical: Instaclustr)op02
H4Treat keyed-topic partition count as immutable; over-partition ~2× up fronthash(key) % N remaps on resize → broken order/joins; you also can't decrease Nop03
H5RF=3 / minISR=2 / acks=all / unclean=false is the durable-and-available defaultminISR=2 is one above the survivable failure (1) → one loss is lossless + writableop06
H6Verify min.insync.replicas ≤ replication.factor on every topiceffectiveMinIsr silently caps minISR at RF, RF=1 + minISR=2 still accepts single-copy writesop06
H7Spread RF across racks/AZs (broker.rack) before adding replicasIndependence changes the exponent (p³); RF only changes the base, correlated copies are worthlessop06
H8Durability is replication, not fsync, leave flush off; spread the ISR insteadPer-batch fsync destroys tail latency; N replicas in page cache beat one fsynced copyop06
H9Plan egress ≥ 2× ingress; budget cross-AZ at ingress×(RF−1) + ⅔ produce + ⅔ consumeReplication multiplies every byte by RF−1; clients hit a remote leader ~⅔ of the timeop04, op10
H10Keep JVM heap ~6 GB; give the rest of RAM to the OS page cacheKafka reads via zero-copy sendfile from page cache; a big heap only adds GC pausesop05
H11Set linger.ms>0 only when the produce rate fills batch.size in the windowOn low-rate streams linger adds pure latency; at linger=0 batches averaged ~1.2 KB vs a 16 KB cap (empirical: Confluent)op05
H12Compress at the producer (lz4 default, zstd for ratio), cuts storage + replication + fetch at onceKafka stores/replicates compressed batches; one CPU spend, three byte savings (compression default is none, not zstd)op05, op10
H13Raise fetch.min.bytes on low-throughput-topic consumersDefault 1 is tuned for latency not cost; many tiny fetches burn broker CPU (a one-app fix cut cluster CPU 15%, empirical: New Relic)op05
H14Cross 10M/s ⇒ raise num.replica.fetchers 1→4–8 firstOne fetcher thread/source broker serializes replication of thousands of partitionsop11
H15Deploy fetch-from-follower (KIP-392) when cross-AZ shows up in the billSame-AZ reads zero the consumer cross-AZ term (~⅔ of consumer traffic); +up to ~500 ms latency, broker skewop10, op11
H16Tiered storage (KIP-405) for cold data, but it cuts storage only, never networkingDecouples retention from broker disk; the common misconception is that it lowers the cross-AZ billI·05, op10
H17Cap clusters ≤ ~200 brokers; federate beyond with MirrorMaker 2Bounds recovery time and blast radius; 100M/s is a multi-cluster topology by construction (empirical: LinkedIn 100+ clusters)op09, op11
H18For per-message parallelism without order, use share groups (KIP-932), not more partitionsClassic groups give per-partition order + one consumer/partition → head-of-line blocking on a slow messageI·15
H19Cross-system "exactly-once" = outbox + CDC + idempotent consumers, not Kafka EOSEOS is within Kafka read-process-write only; outbox writes {row + event} in one local txn, accepts at-least-onceI·14, bp04
H20Per-broker ceiling: FDs ≥ 100,000; vm.max_map_count ≥ 262,144 (default ~32k partitions/broker)Each partition holds open segment files + ~2 mmap index regions; defaults cap density on Linuxop02

4 · The tactics toolkit

The reusable engineering moves, the parts of Kafka's design worth stealing for any system, independent of Kafka. Full treatment in bp04 · Tactics Toolkit; this is the quick-list with the transferable principle.

TacticWhat Kafka doesTransferable principle, steal this
The log as system of recordImmutable totally-ordered events are primary; tables/indexes/caches are derived, replayable projections (database "inside out")Make the write-ahead log a first-class citizen; derive every queryable view from it so state is always reconstructable by fold(log)
Replication over fsyncDurability from N independent page-cache replicas, not a synchronous disk flush on the write pathFor machine-level failures, cross-node redundancy is a stronger and faster guarantee than single-node fsync, spend on failure independence
Partition = unit of everythingOne key (parallelism, ordering, placement, recovery) shards the system; appends need no cross-shard coordination → linear scalePick one sharding key that simultaneously buys parallelism and ordering; coordination-free appends are what make throughput scale linearly
Zero-copy + page cacheReads served from OS page cache via sendfile; no user-space copy, no app-managed cacheLean on the kernel's cache instead of building your own; sequential append + OS readahead beats a hand-rolled buffer pool
Pull-based consumersConsumers poll at their own pace and track their own offset; the broker is "dumb," the consumer "smart"Push backpressure and position-tracking to the consumer → natural flow control, independent replay, no broker-side per-consumer state
The ISR / high-watermark gateCommit = "at or below the high watermark"; HW advances only when the ISR meets the floor, admission control before the appendReject unsafe writes at the door, not after; a record never accepted can never be lost, cheaper than compensating later
Idempotent producer + epochsSequence numbers dedup retries; transactional.id epochs fence zombie/duplicate writersMake retries safe by construction (sequence + dedup) and fence stale actors with monotonic epochs, turns at-least-once into effectively-once
Stream–table dualityA table is the aggregate of a changelog; a stream is the changelog of a table (KStream/KTable, CDC)Treat state and its change-history as two views of one thing; CDC the DB log to bridge mutable state into the event world without dual-writes
Replay / KappaRetained log lets you reprocess history by starting a new job from offset 0 and swapping outputs, no second code pathMake recompute a first-class operation: one logic path, replay to rebuild views or fix bugs, instead of a parallel batch pipeline (Lambda)
Tier hot/cold storageShort local retention for the hot tail, object store for cold (KIP-405), decouples retention from compute and shrinks recoverySeparate the working set from the archive; keep only the hot tail on fast/expensive media, push cold data to cheap storage
The meta-tactic

Kafka's whole design is one bet repeated: make the simple, sequential, append-only path the fast path, and derive everything else from it. Ordering, durability via replication, zero-copy reads, replay, and stream/table duality all fall out of that one commitment. When you design your own system, find the equivalent simplest-possible primitive (often: an ordered log of decisions) and let queryable state, caches, and recovery be deterministic projections of it. That is the transferable core of the blueprint, see bp00 · The Log Pattern and bp05 · Evolution.

5 · Reference panel, choosing & tuning a log-based architecture

One panel that ties the sheet together: when the log fits, where it inherently falls short, which lever attacks which cost, and how Kafka sits against the alternatives. Use it as the final sanity check before committing.

5a · Fit & misfit at a glance

The log is the RIGHT choice when…The log is the WRONG choice when… (and what wins)
You need an ordered, durable, replayable record of what happenedYou need point lookup by key / ad-hoc query / multi-key ACID → a database (materialize into one)
Many independent consumers read the same stream at their own paceYou need a synchronous reply to the caller → RPC/gRPC (request-reply re-couples)
You want decoupled fan-out and integration (one source, N sinks)You need per-message TTL, priority, or content routing → RabbitMQ
You need CDC / event sourcing / stream processing / replayYou need per-message-parallel work, order irrelevant → queue / share groups
Throughput is high and per-partition ordering by key is enoughYou need global total order at scale → forces 1 partition (no parallelism); rethink the requirement
You want history retained as the system of record (tiered/infinite retention)You need sub-millisecond per-message latency at low load → a push broker (Kafka is pull-based)

5b · Inherent limits vs what mitigates them

The honest part: separate what is structural (you design around it) from what a feature or config moves. Mitigated ≠ removed, note where a floor remains. Full analysis in bp03 · Inherent Limits.

LimitationClassMitigated / tuned byResidual (what's left)
Ordering is per-partition, never globalstructuralPartition-by-key for partition-local orderNo global order without 1 partition, inherent
Consumption parallelism ≤ partition countstructuralOver-provision partitions; share groups (KIP-932) for queue-style workPer-group cap remains for ordered consumption
No per-message TTL / priority / routingstructuralTopic-level retention; app-level routing into sub-topicsBy-design log semantics, not a roadmap item
No random access / point-by-key lookupstructuralCompaction (last value per key); materialize into a KV storeLog is sequential; lookups live in derived stores
Latency floor (pull + no per-write fsync default)tunableSmall linger/fetch.min.bytes, tuned serialization, fast disks/NIC (tier-1 bank hit <5 ms p99)A real floor remains; push brokers beat it at low load
Partition count ceilingtunable (was structural)KRaft removed the ZK O(partitions) failover; lab 2M KIP-500Per-broker FD/mmap/fetcher ceilings still bind (~32k/broker default)
Cross-AZ replication cost (50–90% of cloud bill)partly structuralFetch-from-follower (KIP-392) zeroes consumer term; compression; tiered storage (storage only)Produce + replication cross-AZ floor, only diskless/object-store removes it
Rebalance disruptiontunable (was structural)Cooperative KIP-429, broker-side protocol KIP-848, static membershipLargely historical; old eager clients still pay it
Page-cache loss window (no fsync)tunableRack/AZ spread (makes it multi-event); small flush.ms if regulatedCorrelated power loss of whole ISR, bounded, not zero

5c · Cost levers, what each one actually attacks

Networking is ≥50% of a multi-AZ cloud Kafka bill (80–90% after tiered storage). The levers, ordered by effort, and crucially which cost each one touches (empirical: ops-blueprint reference App. C). See Part II · 10 Cost.

LeverAttacksEffectTradeoff / caveat
Producer compression (lz4 / zstd)storage + replication + fetch~10× on text/JSON, near-free; one CPU spend, three byte savingsPer-batch, needs batch depth; little gain on pre-compressed/binary
Fetch-from-follower (KIP-392)consumer cross-AZ~all consumer cross-AZ removed; up to ~50% total cluster cost+up to ~500 ms tail; broker skew; does NOT touch produce/replication
Tune RF & retentionstorage + replication cross-AZLinear cut (RF 3→2 on non-critical; shorter retention)Lower RF reduces fault tolerance
Tiered storage (KIP-405)storage only30–90% storage cut (S3 ~$0.02 vs EBS ~$0.08/GiB-mo)Does NOT reduce cross-AZ networking, the common misconception
Diskless / object-store (KIP-1150, WarpStream, AutoMQ)replication + produce cross-AZKills the cross-AZ floor (writes straight to S3); ~80% TCO cut claimed+200–400 ms produce p99 (~1 s e2e); upstream KIP-1150 accepted ~2026 but not GA OSS
The cross-AZ hard floor

With classical Kafka + fetch-from-follower you remove the consumer cross-AZ term, but produce + replication cross-AZ remains, in AutoMQ's $24k/mo example, ~$13.8k of it (empirical: AutoMQ). Tiered storage does nothing for it. Only leaderless object-store designs (diskless) attack the produce+replication floor, and they trade ~200–400 ms of latency to do it. Pick the lever that matches the cost line you actually have, and don't expect tiered storage to fix a networking bill. See Part II · 10.

5d · Kafka vs the alternatives, when each wins

The log blueprint has several implementations; they differ mainly in where compute and storage sit and what they trade. All cross-system performance numbers are vendor benchmarks unless independently attributed, directional, check durability/JVM/duration parity (empirical: ops-blueprint reference App. D; Vanlightly).

SystemModelOrderingWins when…Gives up
Apache KafkaBrokers co-own compute + storage; local disk + page cache + zero-copy; tiered to S3Per-partitionYou want highest throughput-per-partition + replay/storage on commodity infraCross-AZ cost, latency floor, no per-message TTL/priority, partition ceiling
Apache PulsarStateless brokers; compute/storage separated via BookKeeperPer-partition (+ keys)You need independent storage scaling + seamless capacity addOperational complexity (ZK + BookKeeper + brokers); own caches, not page cache
RedpandaSingle C++ binary, thread-per-core, no JVM, RaftPer-partitionYou want simpler ops (no JVM/ZK), Kafka-API compatible10×/3× claims reverse on equal hardware (empirical: Vanlightly), directional
Amazon KinesisManaged, metered-shard modelPer-shardSpiky/low-volume, zero-ops, pay-per-shardHard 1 MB/s per shard, 500-shard default, ~2–3× pricier at sustained scale
Google Pub/SubManaged, global, auto-scaling; no shards to manageNone by default (opt-in keys)You want hands-off elastic global scaleLess throughput-per-unit control; ordering is opt-in and separate
RabbitMQSmart broker / dumb consumer; exchange routing; per-message statePer-queue (priority breaks it)You need rich routing, per-message TTL + priority, low-load latencyLower throughput; degrades >~30 MB/s; per-message state is the cost
WarpStreamDiskless / S3-native; stateless agents, no inter-broker replication, leaderlessPer-partition (control plane)Cost is the priority; no local disk, no cross-AZ replication; Kafka-API~400 ms produce p99 / ~1 s e2e, not for sub-100 ms workloads
Read vendor benchmarks adversarially

Every vendor benchmark is advocacy. Before trusting a headline number, verify: equivalent durability (not log.flush.interval.messages=1 forcing per-batch fsync on Kafka, not "no-journal" Pulsar vs acks=all Kafka), Java 17+ (not 11, which hurts Kafka especially with TLS), consistent offset-commit cadence, Coordinated Omission corrected, and a long run (12–36h, tail latency drifts). Jack Vanlightly's independent re-run of Redpanda's own OMB on identical hardware reversed the headline: Kafka often matched or beat it (~1,900 vs 1,400 MB/s at NVMe saturation), and Redpanda's "Kafka doesn't fsync so it's unsafe" claim is simply false reasoning, Kafka trades single-node fsync for cross-node replication by design (empirical: Vanlightly). See bp06 · Comparative.

5e · The page-immediately signals (the on-call companion)

The architecture is only as good as your ability to see it failing. The binary health gauges that map to the dials above, alert on these, in this order. Thresholds and JMX names in Part II · 08 Metrics; runbooks in Part II · 07 Failure Modes.

SignalAlert whenMeans (which dial / mechanism)
UncleanLeaderElectionsPerSecany > 0Active data loss, a lagging replica became leader and truncated (availability dial cost)
OfflinePartitionsCount> 0Partitions with no leader, unavailable now
UnderMinIsrPartitionCount> 0ISR below the floor, acks=all producers are getting NotEnoughReplicas (consistency dial engaged)
UnderReplicatedPartitions> 0 sustainedA copy is missing, constant ≈ broker down; fluctuating ≈ performance bottleneck (durability dial degraded)
ActiveControllerCount (cluster sum)≠ 10 = no controller; >1 = split-brain (control plane)
RequestHandler/NetworkProcessorAvgIdlePercent< ~0.3Saturation, too few IO/network threads or NIC-bound (throughput dial wall)
Consumer records-lag-maxgrowing over a windowA consumer is slower than the produce rate, bounded by partition count to drain (parallelism dial)
EventQueueTimeMs / LastAppliedRecordLagMsrising fleet-wideController backlog / metadata propagation lag, the 100M-tier failure mode (structural scaling wall)
How to use this sheet

Start at the dial or decision tree that matches your question, take the terminal answer, then jump to the linked chapter for the mechanism and source. The discipline that makes you fast: classify every requirement as tunable or structural first. Tunable → find the dial and accept its tradeoff. Structural → it is a topic-design or system-choice decision, never a config. The rest of Part III builds the judgement behind each box here, bp00 · The Log Pattern (why the log), bp01 · When To Use (fit), bp02 · Design Decisions (the dials in depth), bp03 · Inherent Limits (the structural walls), bp04 · Tactics (what to steal), bp05 · Evolution (where it's going), and bp06 · Comparative (the alternatives), and Part II grounds every number in operations.

krivaltsevich.com · Part of Apache Kafka Internals · derived from Apache Kafka 4.4 source · GitHub · MIT-licensed.

Apache Kafka® is a registered trademark of the Apache Software Foundation. This is an independent, unofficial guide, not affiliated with or endorsed by the ASF.