krivaltsevich.com Kafka Internals4.4

III · 05 · Architectural Evolution as Case Studies

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Architectural analysis grounded in the source-verified Part I and cited comparative sources.

A long-lived distributed system is not designed once; it is re-decided dozens of times under the constraint that it can never stop, never lose data, and never break a client that was written years ago. Kafka's twelve-year history is an unusually clean archive of those re-decisions, because almost every one shipped as a numbered KIP with a written rationale and a reversible (or deliberately irreversible) migration path. This chapter reads six of them as engineering case studies, ZooKeeper → KRaft, the v0→v2 record format, exactly-once, the rebalance protocol, tiered storage, and share groups, and for each asks the same three questions: what forced the change, what was actually built, and what reusable lesson generalizes to your own system. Underneath the six stories sit five meta-tactics that are the real transferable payload: feature flags as a version ratchet, SPIs to extend without forking, backward compatibility as a hard constraint, the absorb-vs-externalize complexity decision, and multi-release deprecation discipline. Treat Kafka here not as a product to copy but as a worked exam in how to evolve infrastructure that thousands of teams depend on.

The meta-thesis

Every change below was constrained by the same invariant: the upgrade had to be performable on a running cluster, one node at a time, by an operator who could not coordinate every client. That single constraint, not elegance, not raw performance, explains the shape of each design (the dual-write phase, the feature flag, the down-converter, the two-protocol coexistence). When you evolve your own system, the binding constraint is almost never "is the new design better"; it is "can I get from here to there without a flag day." Design the migration first.

A twelve-year timeline of irreversible decisions

Before the individual cases, the sweep. Each milestone below either added a capability the log did not have or replaced a subsystem the log could no longer afford. Read it as a single trajectory: Kafka started as a thin partitioned log with an external coordinator and a naive on-disk format, and incrementally absorbed consensus, strong guarantees, elastic storage, and a second consumption model, each time paying down the constraint that had begun to bind.

0.8 · partitioned log + ZooKeeper ZK control plane 0.11 · KIP-98 v2 batch + idempotence + txns EOS core 2.4 · KIP-429 cooperative rebalance incremental rebalance 2.5 · KIP-447 scalable EOS offsets EOS per app-instance 2.8–3.3 · KIP-500/631 KRaft GA self-managed quorum 3.6 · KIP-405 tiered storage GA decoupled retention 4.0 · KIP-848 broker-driven rebalance + ZK removed server-side coordination 4.x · KIP-932 share groups queue model
Kafka's architecture as an evolution timeline. Each transition is a subsystem either replaced (ZK→KRaft) or added (EOS, tiered storage, share groups) without a flag day. Release numbers are approximate GA milestones; the lesson is the sequence of binding constraints, not the exact version.
state = the cluster's dominant capability after that milestone; transition labels name the KIP and release that delivered it. Time flows left→right.
The six case studies at a glance, the forcing function, the change, and the one-line lesson.
CaseKIP(s)What was bindingThe architectural moveTransferable lesson
ZK → KRaftKIP-500, 631, 853External coordinator capped partitions and slowed failoverReplace an external dependency with self-managed Raft consensusOwn your coordination plane when it sits on the critical path
Record format v0→v2KIP-98Per-message overhead; no place for idempotence fieldsEvolve the on-disk/on-wire format with down-conversion for old clientsVersion the format with a magic byte; make compatibility the broker's job
Exactly-onceKIP-98, 447At-least-once core forced dedup onto every consumerLayer idempotence + transactions on top of the existing logAdd strong guarantees as an opt-in layer, not a rewrite
Rebalance protocolKIP-429, 848Stop-the-world rebalance halted all consumers on any changeIncremental cooperative → broker-driven reconciliation by epochRemove a global barrier in stages, each backward-compatible
Tiered storageKIP-405Retention bounded by aggregate local diskTwo pluggable SPIs decouple cold-segment storage from the brokerExtend via a narrow interface (SPI) instead of forking the core
Share groupsKIP-932Per-partition ordering blocked queue-style fan-outAdd a second consumption model beside the log, not inside itAdd a new access pattern without disturbing the existing one

Case 1, ZooKeeper → KRaft: owning the coordination plane KIP-500

The problem: a borrowed brain on the critical path

For its first decade Kafka delegated all hard state, topic and partition definitions, ISR membership, leadership, broker registrations, configs, ACLs, ELR, to an external Apache ZooKeeper ensemble, and a single elected broker (the controller) acted as the bridge, reading ZooKeeper and pushing LeaderAndIsr and UpdateMetadata RPCs to every broker. This worked, and it let Kafka ship in 2011 without building consensus. But by construction it put a second distributed system, with its own operational model, its own failure modes, and its own scaling ceiling, directly on Kafka's control path. Part I The KRaft Controller documents what the old design cost and why it was replaced.

The forcing function was metadata scale. On ZooKeeper, controller failover was O(partitions): a new controller had to read all partition state out of ZooKeeper before it could act. The empirical reference pins this precisely, reloading metadata for 100,000 partitions dropped from 28 s (Kafka 1.0.0) to 14 s (1.1.0) after KIP-227 batched the controller's writes, but it remained linear in partition count, which is exactly why the era's guidance capped a cluster at roughly 200,000 partitions "to accommodate the rare event of a hard failure of the controller" (source: Confluent, "Apache Kafka Supports 200K Partitions Per Cluster"). Controlled shutdown told the same story, 5 brokers / 50,000 partitions took 6.5 minutes on 1.0.0, cut to 3 seconds on 1.1.0, but the structural O(n) dependency on ZooKeeper remained. ZooKeeper was a borrowed brain that could not be made fast enough.

Why the ZK design eventually had to go (the mechanism)

The deep problem was not ZooKeeper's speed; it was that the controller cached state by reading an external store and raced against watchers. Recovery meant re-reading, and the truth lived in two places (ZK and the controller's memory) that could diverge. KRaft inverts this: the controller is the source of truth, a deterministic state machine over an ordered, replicated log (Part I KRaft Consensus), so recovery is replay, not re-read, and standby controllers are already hot because they have been replaying the same log all along. The design rationale in KRaft Controller states it directly: the old ZK controller "raced with watchers and pushed full RPCs"; the KRaft controller is a log-replicated state machine and brokers pull incremental deltas (Part I Metadata Propagation).

The change: build the consensus you depend on

KIP-500 replaced ZooKeeper with a self-managed metadata quorum, a small set of controller nodes (typically 3 or 5) running a Raft implementation over the internal __cluster_metadata topic, realized as the event-driven quorum controller of KIP-631. The Raft leader becomes the active controller; the rest are standbys replaying the log. The benefits are exactly the inverse of the costs above: one system to operate instead of two, hot-standby failover (a standby promotes in milliseconds because its in-memory image is already current), and a metadata ceiling raised roughly 100×, Instaclustr created ~600,000 partitions on a single KRaft broker and stable lab operation at ~2 million per cluster (source: Instaclustr, "KRaft Part 3").

But "build the consensus you depend on" is a heavy bill. Kafka had to write, test, and harden a Raft implementation (KIP-595), a dynamic-reconfiguration story for the quorum itself (KIP-853, covered in KRaft Controller), snapshotting to bound the metadata log, and, the genuinely hard part, a live migration from ZooKeeper that could not lose a single record. That migration (KIP-866) ran the cluster in a dual-write mode: the KRaft controller wrote to both the new metadata log and the old ZooKeeper store simultaneously, so the cluster could be rolled forward incrementally and rolled back to ZK if the new path misbehaved, until the operator committed. The vestige is still visible in 4.4: ZkMigrationStateRecord survives as a deliberate no-op in MetadataDelta.replay, retained solely so clusters that migrated under 3.x can roll forward (Part I Metadata Propagation). ZooKeeper itself was removed entirely in 4.0.

Before, ZooKeeper era
controller caches state by reading ZK; recovery = re-read; truth lives in two places; failover is O(partitions)
external dependency~200k partition ceiling
During, KIP-866 dual-write migration
KRaft controller writes to both the metadata log and ZK; reversible; commit when soaked
rollback path preserved
After, KRaft
controller is the log; recovery = replay; standbys hot; ceiling ~100× higher
self-managed quorumms failover
The ZK→KRaft transition is a textbook dependency replacement: the dual-write middle phase is what made an irreversible-looking change reversible until the operator chose to commit.
control plane / coordination   migration / metadata log   represents an irreversible commit once the operator finalizes.
Where "own your coordination plane" is the WRONG lesson

Kafka was right to internalize Raft because coordination was on its hot path, it had the engineering capacity, and the dependency had become the scaling ceiling. Do not generalize this to "always build your own consensus." For most systems, an external etcd/ZooKeeper/Consul is the correct choice: you get a battle-tested implementation for free, and your coordination volume is nowhere near the ceiling that forced Kafka's hand. The honest decision rule is below, internalize only when the dependency is simultaneously on the critical path, the binding constraint, and within your competence to build and operate forever. Owning consensus means owning its bugs at 3 a.m. for the life of the product.

Should you internalize an external dependency?(coordination, storage, scheduling…)
Is it on the request-critical path?does its latency/availability bound yours?
Keep it external, integration cost only
Is the dependency the binding constraint?scaling ceiling, failover time, op burden
Tune/shard the dependency first
Can you build AND operate it forever?team capacity, on-call, correctness proof
Do NOT internalize, buy/borrow
Internalize (Kafka's KRaft choice)
The internalize-vs-externalize decision tree, abstracted from KIP-500. Kafka answered yes·yes·yes; most systems should not. The expensive failure mode is answering "yes" to the last gate when you cannot actually carry the operational weight.
internalize outcome   keep-external outcome   anti-pattern outcome   decision node   red/error edge.

Case 2, Record format v0→v2: evolving a format that's already on disk KIP-98

The problem: every record paid full freight

Kafka's original on-disk unit was a flat message (LegacyRecord, magic byte 0 or 1) prefixed by a 12-byte log overhead. Part I The Record Format dissects the weaknesses: in v0/v1 an uncompressed message set was a sequence of standalone messages, one per offset, and each repeated its own CRC, magic, attributes, and (in v1) an 8-byte absolute timestamp. There was no batch header in which to put anything shared, and, critically, nowhere to carry the producer-identity fields that idempotence and transactions would require. A compressed batch was a single "wrapper" message whose value was itself a compressed message set, and iterating it required decompressing the whole wrapper to recover absolute offsets, because inner offsets were stored relative to the wrapper's last offset. The format was simple and expensive: per-record overhead at high record counts, and no extension point for new semantics.

The change: a self-describing batch with delta-encoded records

KIP-98 introduced the v2 record batch (DefaultRecordBatch, magic byte 2). The architectural insight is a clean separation of weight: the batch became the heavyweight, self-describing unit (one header, one CRC-32C, one compression frame, and the producerId/producerEpoch/baseSequence triple that EOS needs), while each record became a maximally-compact body storing only an offset delta and timestamp delta from bases held once in the header, zig-zag varint encoded. Part I The Record Format shows the byte layout: the 8-byte absolute offset and 8-byte timestamp are paid once per batch; each additional record costs a byte or two before the record region is optionally compressed. The same header that delivered compactness is the one that made idempotence and transactions possible, the producer fields had a home.

What changed v0/v1 → v2, and why each change was load-bearing for later features.
Dimensionv0 / v1 (legacy)v2 (KIP-98)Consequence
UnitFlat message; uncompressed set = N standalone messagesBatch header + N delta-encoded inner recordsPer-record overhead amortized across the batch
OffsetsAbsolute per message; compressed inner offsets relative to wrapper's lastbaseOffset once; per-record offsetDelta varintTiny records; lastOffset = baseOffset + lastOffsetDelta computed, never stored
Timestamp8-byte absolute per message (v1)baseTimestamp + per-record deltaFurther compaction; log-append-time override is one field
Identity fieldsNone, no place for themproducerId, producerEpoch, baseSequence in headerEnabled idempotence + transactions (Case 3)
ChecksumPer-message CRCOne CRC-32C over the batch, excluding partitionLeaderEpochBroker can stamp leader epoch without recomputing CRC

The hard constraint: old clients still on the wire

You cannot change a format that is already on millions of disks and in millions of deployed clients with a flag day. KIP-98's compatibility tactic is the transferable part. First, the magic byte: every batch self-identifies its version in a fixed position, so a reader parses the magic before interpreting anything else, the format is versioned in the data, not out of band. Second, down-conversion: when a consumer too old to understand v2 fetches a v2 batch, the broker rewrites it back into the v0/v1 the client expects, on the fetch path. This preserves old clients at a real cost, down-conversion happens on the broker, burns CPU, and forfeits zero-copy (the broker must touch and transform the bytes rather than sendfile them straight from page cache; Part I The Fetch Path). Compatibility was bought with broker cycles, deliberately.

Tactic, version the format in the data, push compatibility to the server

The reusable pattern: (1) put a version discriminator (magic byte) at a fixed offset so any reader can branch before parsing; (2) make the newest component (the broker) responsible for serving the oldest reader, via on-the-fly conversion, so old clients need zero changes; (3) accept that conversion costs the server something (here, CPU and zero-copy) and treat removing old formats as a multi-year deprecation, not a release. This is the same shape as HTTP content negotiation and protobuf field numbers, the discriminator lives with the payload, and the new side absorbs the cost of the old.

Where down-conversion bites in production

Down-conversion is correct but not free, and the cost is invisible until a fleet of legacy consumers hits a high-throughput topic: broker CPU climbs and the fetch path loses zero-copy precisely when you can least afford it. The operational lesson, keep clients current so the broker can sendfile v2 bytes untouched, is the reason Part II Lifecycle Operations treats client upgrades as part of cluster hygiene, not an afterthought. A compatibility shim is a bridge, not a destination; budget to retire it.

Case 3, Exactly-once: strong guarantees as an opt-in layer KIP-98 KIP-447

The problem: at-least-once pushed dedup onto everyone

Kafka's core delivery semantic is at-least-once: a producer that retries after a network hiccup can write a duplicate, and a consumer that crashes after processing but before committing will reprocess. For many pipelines that is fine. For exactly-once read-process-write, the canonical "consume from topic A, transform, produce to topic B, commit input offset", it forced every application to implement its own idempotent dedup, correctly, forever. That is precisely the kind of cross-cutting correctness burden a platform should absorb once rather than externalize to thousands of teams.

The change: two composing layers, neither a rewrite

The architectural move (Part I Transactions & EOS) is that exactly-once was built as two layers stacked on the unchanged at-least-once log, not as a new storage engine. The lower layer is the idempotent producer: every batch carries (producerId, producerEpoch, baseSequence), the v2 header fields from Case 2, and the partition leader's ProducerStateManager deduplicates retries by checking that sequence numbers are strictly contiguous per (PID, epoch), absorbing retries in a 5-batch window and rejecting gaps with OutOfOrderSequenceException. That gives exactly-once-in-order to a single partition with essentially zero throughput cost, it is a sequence check on a batch that was being written anyway. The upper layer is transactions: a stable transactional.id maps to a PID with epoch fencing of zombies; a TransactionCoordinator drives a two-phase commit over the internal __transaction_state log and writes transaction markers (control records) into every partition the transaction touched; and consumers reading at read_committed filter aborted data using the Last Stable Offset and a per-segment aborted-transaction index.

Layer 2, Transactions (KIP-98)
atomic multi-partition writes + offset commit; two-phase commit over __transaction_state; markers; LSO-gated read_committed
opt-in via transactional.id
↓ builds on ↓
Layer 1, Idempotent producer (KIP-98)
(producerId, epoch, baseSequence) dedup in ProducerStateManager; strictly-increasing per-partition sequence
opt-in via enable.idempotence
↓ builds on ↓
Layer 0, At-least-once log (unchanged)
append-only partitioned log; the v2 batch header carries the identity fields the upper layers need
full throughput preserved
Exactly-once is a layered opt-in, not a rewrite. Each layer is independently switchable, and the base log is untouched, applications that don't need EOS pay nothing.
transaction layer   idempotence layer   base log   "builds on" = the upper layer reuses the lower's mechanism, not a parallel store.

The throughput-vs-guarantee balance, and the KIP-447 scaling fix

The crucial discipline is that strong guarantees were added without surrendering the throughput that defines Kafka. Idempotence is a near-free sequence check. Transactions cost a little, two extra coordinator round-trips to begin/commit and the marker writes, amortized across a whole batch of records, so the per-record overhead at realistic batch sizes is small. KIP-447 then removed the original scaling cliff: the first EOS design required one transactional producer per input partition (so offsets could be fenced), which exploded the producer count for a Streams app over a many-partition topic. KIP-447 changed sendOffsetsToTransaction to take the full ConsumerGroupMetadata (group id, generation, member id), letting the group coordinator fence offset commits by generation, so a single transactional producer can serve a partition and EOS scales per application instance rather than per input partition (Part I Group Coordination). That is the difference between EOS being a curiosity and being usable at Streams scale.

EOS as a tunable space, what each guarantee costs, and where it stays at-least-once.
SettingGuaranteeCostInherent limit
enable.idempotence=trueNo duplicates per partition on retry; in-orderCaps max.in.flight at 5; one sequence checkPer-partition only; not multi-partition atomic
+ transactions (transactional.id)Atomic multi-partition write + offset commit2 coordinator round-trips/txn + markers; LSO read latencyEOS is within Kafka read-process-write only
+ KIP-447 offsetsScales to one producer per app instanceCoordinator fences offset commits by generationStill bounded by partition assignment
External sink (DB + Kafka)At-least-once + eventual (outbox/CDC)Idempotent consumer requiredKafka EOS does NOT span an external system
The boundary the layer cannot cross

Kafka's exactly-once is exactly-once within the Kafka read-process-write loop, it does not, and cannot, give exactly-once across an external database. Writing to Postgres and producing to Kafka is a dual write; the correct pattern is the transactional outbox (write the business row and an outbox row in one local ACID transaction, then tail the DB log via CDC to Kafka), which yields at-least-once plus eventual consistency, not 2PC, and demands idempotent consumers (source: Morling / Debezium, "Reliable Microservices Data Exchange With the Outbox Pattern"). The lesson: an opt-in guarantee layer is bounded by the system whose primitives it is built on. Know exactly where the boundary is, and do not let "exactly-once" leak into a sentence about an external store.

Case 4, Rebalance: removing a global barrier in stages KIP-429 KIP-848

The problem: stop-the-world on every membership change

In the original ("classic") consumer protocol, group membership and partition assignment were negotiated through a JoinGroup/SyncGroup barrier orchestrated by the broker but with the assignment computed by an elected client leader (Part I Group Coordination). The fatal property was eager rebalancing: on any change, a consumer joining, leaving, or simply being slow, every member revoked all of its partitions, the leader recomputed a full assignment, and everyone re-acquired. The whole group stopped consuming during the rebalance. At scale, with deploys rolling consumers and occasional GC pauses tripping session timeouts, this produced "rebalance storms" where each rebalance triggered the next.

The change: two stages, each backward-compatible

Kafka did not fix this in one jump; it removed the barrier in two compatible stages, which is the transferable lesson.

  • Stage 1, cooperative incremental (KIP-429, Kafka 2.4). Still the classic protocol, but a new client-side assignor (CooperativeStickyAssignor) changes revocation behaviour: a consumer keeps the partitions it will retain and only revokes the ones actually being moved, across two rebalances. Only moved partitions pause; the rest keep consuming. Critically, it advertises both protocols (supportedProtocols() returns [COOPERATIVE, EAGER]), so a group can roll from eager to cooperative one consumer at a time, no flag day. This is the first attack on stop-the-world: it shrinks the blast radius without changing the protocol's shape.
  • Stage 2, broker-driven (KIP-848, GA Kafka 4.0). The barrier itself is removed. JoinGroup/SyncGroup are replaced by a single repeated ConsumerGroupHeartbeat RPC that doubles as join, sync, poll-for-assignment, and liveness. There is no client leader and no client-side assignor: the broker computes the target assignment with a server-side assignor (uniform or range), and each member incrementally reconciles toward it. Coordination becomes a conversation about three epochs, group epoch (bumped when inputs change), target-assignment epoch (when the assignor last ran), and per-member epoch (how far each member has reconciled), that replace the classic generation. No global synchronization point remains.
Consumer ACoordinator (broker)Consumer B
Heartbeat(memberEpoch)
target assignment + groupEpoch
Heartbeat(memberEpoch)
its slice + groupEpoch
reconciled → bump memberEpoch
KIP-848: assignment is a per-member conversation about epochs, not a synchronized round. A and B reconcile independently; neither waits on the other, so there is no stop-the-world point. Contrast the classic protocol, where every member blocks at the SyncGroup barrier.
consumer   broker coordinator   arrows are RPCs over time (left→right); each consumer's exchange is independent.
Why two stages instead of one

KIP-848 could in principle have been the whole answer, but shipping it in one jump would have demanded that every client in the world upgrade before any group benefited, and it would have stranded the millions of consumers that couldn't move. Cooperative rebalancing (KIP-429) delivered most of the operational relief years earlier within the classic protocol, and because it advertised both assignors, a group migrated incrementally. KIP-848 then finished the job with a new protocol that coexists with the classic one (the classic path remains the fallback for older clients and for components not yet ported). The general tactic: remove a global barrier in compatible increments, each of which is shippable and rolls one node at a time, rather than betting the migration on a single coordinated cutover. Reported result: KIP-848 rebalances run up to roughly 20× faster (source: Confluent, "KIP-848: A New Consumer Rebalance Protocol").

"Stop-the-world rebalance" is now largely a historical critique

Comparisons that still indict Kafka for stop-the-world rebalances are describing eager assignors on old clients. With CooperativeStickyAssignor (default-capable since 2.4) or the KIP-848 protocol (GA in 4.0), the global pause is gone (source: empirical reference §9). When you read a critique of any mature system, check whether it targets a design the system has already evolved past, and when you evolve, leave the old path working so the critique stops being true incrementally rather than all at once.

Case 5, Tiered storage: extending via an SPI, not a fork KIP-405

The problem: retention chained to local disk

Without tiered storage, a topic's retention is bounded by the aggregate local disk of the brokers hosting its replicas. Wanting to keep 30 days of a high-throughput topic meant provisioning (and paying for, and replicating across AZs) 30 days of expensive local SSD, even though the vast majority of that data is cold and read rarely, if ever. Worse, storage and compute were welded together: you scaled disk by adding brokers you didn't need for CPU, and broker recovery had to re-replicate cold data on rejoin (Part I Tiered Storage).

The change: two narrow SPIs decouple cold bytes from the broker

KIP-405 decoupled retention from local disk by letting the broker offload sealed (non-active) segments to an external object store and keep only a small hot tail locally. The architecturally interesting decision is how: rather than baking S3 (and GCS, and Azure Blob, and HDFS, and every future store) into the broker, Kafka defined two pluggable SPIs orchestrated by the broker-side RemoteLogManager (Part I Tiered Storage):

  • RemoteStorageManager (RSM), moves bytes. A five-method interface: copy a segment's data and indexes, fetch a segment (by byte range), fetch one index, and delete. Copy and delete must be idempotent; transient failures are signalled with a RetriableRemoteStorageException. The contract requires only eventual consistency of the object store.
  • RemoteLogMetadataManager (RLMM), tracks which segment covers which offsets/epochs, with strongly consistent semantics. The default implementation, TopicBasedRemoteLogMetadataManager, persists metadata as records in the internal __remote_log_metadata topic, Kafka storing its own tiering metadata in Kafka.
Why split the contract into two interfaces with different consistency demands

This is the subtle, transferable part. Object stores are cheap and durable but only eventually consistent and slow to list; you must not make correctness depend on an S3 bucket listing. So Kafka split the responsibilities by their consistency needs: the RSM owns data with eventual-consistency requirements, while the RLMM owns metadata with strong consistency, and the broker always updates RLMM metadata around every RSM data operation. The durable source of truth about "what exists in remote" is the metadata, never the object-store listing (Part I Tiered Storage). The general tactic: when you extend a system across a boundary with weak guarantees (object storage), keep a strongly-consistent index you control on the strong side, and treat the weak side as a content-addressed blob bag.

Broker core, RemoteLogManager
orchestrates copy/expire/read tasks; never knows what S3 is
↓ narrow SPI ↓
RLMM (strongly consistent)
which segment ↔ which offsets/epochs; default backed by __remote_log_metadata
source of truth
+
RSM (eventually consistent)
copy / fetch-range / delete segment bytes; idempotent; pluggable: S3, GCS, Azure, HDFS…
content-addressed blobs
Tiered storage extends the broker through two SPIs split by consistency need. The core stays storage-agnostic; vendors implement RSM/RLMM without forking Kafka.
broker core   metadata SPI (strong)   data SPI (eventual)   "narrow SPI" = the only surface a plugin must implement.
Tactic, SPI as the seam against fragmentation

Before KIP-405, every cloud vendor maintained a fork of Kafka to add their object store (Confluent, Aiven, and others each had one). A narrow, well-specified SPI converts forks into plugins: vendors implement two interfaces, the core stays single-sourced, and competition moves to implementation quality instead of divergent codebases. The general rule for extending infrastructure you don't want to fork: find the narrowest interface that captures the variability (here, "move bytes" + "track metadata"), specify its consistency contract explicitly, and load implementations dynamically (Kafka even isolates plugin dependencies with a child-first classloader). The same pattern recurs in Kafka Connect's SourceConnector/SinkConnector and in the pluggable assignors of Case 4.

What tiered storage does NOT fix (and the misconception that follows)

Tiered storage cuts storage cost (S3 at ~$0.02/GiB-mo vs EBS ~$0.08–0.10) and decouples storage from compute, but it does not reduce cross-AZ networking, which is 70–90% of a high-throughput cloud Kafka bill and can become an even larger share after tiering shrinks the storage line (source: Confluent, "Uncovering Kafka's Hidden Infrastructure Costs"; empirical reference §6, §9). A reader who hears "tiered storage saves money" and concludes "so my cross-AZ replication cost drops" is wrong; the networking lever is fetch-from-follower (KIP-392) or a diskless/object-store design (KIP-1150, still in-flight upstream, accepted ~March 2026 but not production-ready OSS). An extension SPI solves the problem it was scoped to and no other; name the boundary so the cost model stays honest. Note also that KIP-405 reached GA only in 3.6, "infinite retention" is a recent capability, not an original Kafka property.

Case 6, Share groups: a new access pattern beside the log KIP-932

The problem: per-partition ordering blocks queue fan-out

The log's defining strength, strict per-partition ordering consumed by exactly one member of a group per partition, is also the reason Kafka is a poor task queue. Parallelism within a consumer group is capped at the partition count, and a single slow or poison-pill message causes head-of-line blocking that stalls every later message in its partition (source: empirical reference §8). For genuine queue workloads, competing consumers, work that can be processed in any order, redelivery on failure, the log semantics fight you. Historically the answer was "use RabbitMQ for that," or bolt on the Confluent Parallel Consumer.

The change: add the model, don't mutate the log

KIP-932 added share groups: many share consumers cooperatively read the same partitions, with records delivered individually under a time-bounded acquisition lock rather than tracked by a single committed offset (Part I Share Groups). Each record carries a delivery state (Available → Acquired → Acknowledged/Archived) and a delivery count, giving at-least-once queue semantics with redelivery, max-attempt limits, and an optional dead-letter queue. The crucial architectural restraint: this was added beside the log, not inside it. The partition log, its append-only semantics, its replication, its retention, all unchanged. Share-group bookkeeping lives in a new in-memory engine on the partition leader (SharePartition), durably backed by a new internal topic (__share_group_state) through a coordinator, entirely separate from the offset-commit machinery of classic consumer groups.

Consumer group vs share group, two consumption models over the same log.
PropertyConsumer group (the log)Share group (KIP-932)
Unit of bookkeepingOne committed offset per partitionIndividual record (delivery state + count)
Consumers per partitionExactly one per groupMany, concurrency decoupled from partition count
OrderingStrict per-partitionNone, queue semantics
Failure handlingRewind the offset (coarse)Per-record redelivery, max attempts, DLQ
Head-of-line blockingYes, slow record stalls partitionNo, records acquired independently
State location__consumer_offsets__share_group_state (separate)
Tactic, add an access pattern without disturbing the existing one

The temptation when a system can't do X is to bend its core until it can, and break everyone relying on the core's current behaviour. KIP-932 instead introduced a parallel read model with its own state, its own RPCs (ShareFetch/ShareAcknowledge), and its own coordinator, leaving the log untouched. Existing consumer groups don't notice share groups exist. The general rule: when you need a fundamentally different access pattern over the same data, prefer a new model sharing the substrate over a mutation of the existing model. You keep both contracts intact and let users choose per workload, log semantics where order matters, queue semantics where throughput-of-competing-consumers matters.

The substrate still constrains the new model

Share groups relax ordering and decouple parallelism from partition count, but they do not turn Kafka into RabbitMQ. There is still no per-message TTL, no priority, and no content-based routing, those are deliberate log-semantics omissions, not gaps to be patched (source: empirical reference §9). And share-group state, while durably backed, is held and reconstructed on the partition leader, so it inherits Kafka's partition-leadership failure model. A new access pattern over an existing substrate inherits the substrate's constraints; it widens the system's reach without escaping its physics. If you need rich routing and per-message priority, RabbitMQ is still the right tool, adding a model is not the same as becoming a different system.

The five meta-lessons: how to evolve infrastructure safely

Stepping back from the six cases, the same five tactics recur. These are the transferable payload, the techniques you can carry to any long-lived system that must change while running.

1, Feature flags as a version ratchet (metadata.version) KIP-584

Every behaviour-changing or format-changing capability above is gated by a feature level, with metadata.version (KIP-584) as the master flag, and Transaction Version, kraft.version, and others as siblings. Two properties make it safe (Part I KRaft Controller, Part II Lifecycle Operations). First, a level can only be finalized when every node supports it: FeatureControlManager.reasonNotSupported() walks all registered brokers and controllers and refuses the upgrade if any reports a range that excludes the new level, the compile-time guard against a mixed-version cluster enabling a record format an old node cannot parse. Second, it is a one-way ratchet across any record-format change: the source says in capitals "ONCE A METADATA VERSION IS PRODUCTION, IT CANNOT BE CHANGED," and a downgrade is refused unless the controller can prove no record format actually changed between the two versions. This is what lets the binary roll and the behaviour switch be separated, you upgrade software reversibly, soak it, and only then cross the irreversible feature gate.

Roll binaries (one node at a time)reversible, old behaviour still active
Do ALL nodes support target level?kafka-features describe
Blocked, finish the roll first
Does the level flip a record format?
Reversible bump, can downgrade
Irreversible ratchet, commit
The metadata.version ratchet (KIP-584). The cardinal gate is "all nodes support it"; the cardinal trap is irreversibility once a record-format-changing level commits. Never bump in the same window as the binary roll.
reversible binary step   reversible / gated   irreversible commit   blocked edge   cylinder = durable, committed state.
The ratchet invariant

A feature flag that gates an on-disk format must be (a) monotonic, you can only move forward across a format change, because the alternative is deleting metadata you cannot reconstruct; and (b) quorum-gated, finalizable only when every participant can speak the new level. Kafka enforces both in FeatureControlManager. If you build version flags into your own system, copy this exactly: refuse the upgrade until all nodes support it, and refuse the downgrade if it would lose information. A flag that can be toggled freely is not a safe-evolution tool; it is a footgun.

2, SPIs to extend without forking

Tiered storage (RSM/RLMM), Connect (source/sink connectors), the pluggable server-side assignors of KIP-848, and the share-group persister are all service provider interfaces: narrow contracts that let third parties extend Kafka without modifying or forking it. The discipline is to make the interface as small as the variability requires and to specify its semantics, especially consistency and idempotence, in the contract, not in tribal knowledge. The payoff is that the core stays single-sourced while the ecosystem of implementations grows independently. The anti-pattern an SPI prevents is the vendor fork, which fragments the project and strands users on divergent versions.

3, Backward compatibility as a hard constraint

The magic byte and down-conversion (Case 2), the dual-protocol coexistence in rebalancing (Case 4), the both-assignors advertisement of CooperativeStickyAssignor, and the retained-no-op ZkMigrationStateRecord (Case 1) are all the same discipline: a new design must interoperate with the old one during the entire migration window, because you cannot upgrade every node and every client atomically. This costs real engineering, a down-converter, a compatibility shim, a fallback path, and the cost is the price of admission for evolving a system that thousands of teams depend on. The mark of mature infrastructure is that the old client written years ago still works against the new broker.

4, Absorb vs externalize complexity

Two of the cases are decisions about where complexity should live, and they point in opposite directions, which is the lesson. KRaft (Case 1) absorbed complexity that had been externalized to ZooKeeper, because coordination was on the critical path and the dependency was the binding constraint. Tiered storage (Case 5) externalized complexity to an object store behind an SPI, because cold storage is a commodity better bought than built. There is no universal rule; the correct direction depends on whether the thing is on your critical path, whether it is your binding constraint, and whether you can carry its operational weight forever (the decision tree in Case 1). Exactly-once (Case 3) is a third variant: it absorbed the dedup complexity that had been externalized to every application, a platform earning its keep by solving a cross-cutting correctness problem once.

Absorb vs externalize, the same decision, resolved differently per case.
CaseDirectionWhat movedWhy that direction
KRaftAbsorbConsensus, from ZooKeeper into the brokerOn the critical path; the binding scaling constraint
Exactly-onceAbsorbDedup, from every app into the platformCross-cutting correctness; solve once
Tiered storageExternalizeCold storage, from broker disk to object storeCommodity; cheaper to buy than build/operate
Share groupsAbsorb (additively)Queue semantics, from external brokers into KafkaDemand was high; added beside the log, not inside

5, Multi-release deprecation discipline

Nothing was removed abruptly. ZooKeeper was deprecated in 3.5 and only removed in 4.0, with the KIP-866 dual-write migration spanning the 3.x line and a no-op record retained even after removal so late migrators could still roll forward. The classic rebalance protocol remains the fallback for older clients even after KIP-848 went GA. Old record formats are still down-converted. The discipline is to give users multiple releases between "deprecated" and "removed," with a working migration path the whole time, never a single release that yanks a capability out from under a running fleet. Part II Lifecycle Operations operationalizes this from the operator's side.

The architect's takeaway

If you remember one thing from this chapter, make it the order of operations for evolving a live system: (1) design the migration before the destination, the dual-write phase, the down-converter, the both-protocols coexistence are not afterthoughts, they are the actual deliverable; (2) gate every behaviour change behind a quorum-checked, monotonic feature flag so the binary roll and the behaviour switch are separable; (3) extend through the narrowest possible SPI rather than forking, and specify its consistency contract; (4) decide absorb-vs-externalize per subsystem against the critical-path / binding-constraint / can-I-operate-it test, not by ideology; and (5) deprecate across releases with a working path throughout. Kafka's twelve years are a long, public proof that these five tactics let infrastructure change continuously without a flag day, and that skipping any one of them is how a "simple upgrade" becomes an outage.

Where to go next

The evolution stories here are the how of change; the rest of Part III is the when and what. For the structural properties these changes were defending or extending, see The Log as a Pattern and Inherent Limits; for the decision frameworks that mirror Case 1's tree, see When to Use a Log and Design Decisions; for the reusable tactics distilled here as a toolkit, see The Tactics Toolkit; and for how Kafka's choices compare to systems that made different ones (Pulsar's compute/storage split, Redpanda's thread-per-core, WarpStream's diskless), see Comparative Systems. The operator's view of these same migrations, rolling upgrades, the feature ratchet in practice, reassignment, is Part II Lifecycle Operations.

krivaltsevich.com · Part of Apache Kafka Internals · derived from Apache Kafka 4.4 source · GitHub · MIT-licensed.

Apache Kafka® is a registered trademark of the Apache Software Foundation. This is an independent, unofficial guide, not affiliated with or endorsed by the ASF.