III · 05 · Architectural Evolution as Case Studies
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Architectural analysis grounded in the source-verified Part I and cited comparative sources.
A long-lived distributed system is not designed once; it is re-decided dozens of times under the constraint that it can never stop, never lose data, and never break a client that was written years ago. Kafka's twelve-year history is an unusually clean archive of those re-decisions, because almost every one shipped as a numbered KIP with a written rationale and a reversible (or deliberately irreversible) migration path. This chapter reads six of them as engineering case studies, ZooKeeper → KRaft, the v0→v2 record format, exactly-once, the rebalance protocol, tiered storage, and share groups, and for each asks the same three questions: what forced the change, what was actually built, and what reusable lesson generalizes to your own system. Underneath the six stories sit five meta-tactics that are the real transferable payload: feature flags as a version ratchet, SPIs to extend without forking, backward compatibility as a hard constraint, the absorb-vs-externalize complexity decision, and multi-release deprecation discipline. Treat Kafka here not as a product to copy but as a worked exam in how to evolve infrastructure that thousands of teams depend on.
Every change below was constrained by the same invariant: the upgrade had to be performable on a running cluster, one node at a time, by an operator who could not coordinate every client. That single constraint, not elegance, not raw performance, explains the shape of each design (the dual-write phase, the feature flag, the down-converter, the two-protocol coexistence). When you evolve your own system, the binding constraint is almost never "is the new design better"; it is "can I get from here to there without a flag day." Design the migration first.
A twelve-year timeline of irreversible decisions
Before the individual cases, the sweep. Each milestone below either added a capability the log did not have or replaced a subsystem the log could no longer afford. Read it as a single trajectory: Kafka started as a thin partitioned log with an external coordinator and a naive on-disk format, and incrementally absorbed consensus, strong guarantees, elastic storage, and a second consumption model, each time paying down the constraint that had begun to bind.
| Case | KIP(s) | What was binding | The architectural move | Transferable lesson |
|---|---|---|---|---|
| ZK → KRaft | KIP-500, 631, 853 | External coordinator capped partitions and slowed failover | Replace an external dependency with self-managed Raft consensus | Own your coordination plane when it sits on the critical path |
| Record format v0→v2 | KIP-98 | Per-message overhead; no place for idempotence fields | Evolve the on-disk/on-wire format with down-conversion for old clients | Version the format with a magic byte; make compatibility the broker's job |
| Exactly-once | KIP-98, 447 | At-least-once core forced dedup onto every consumer | Layer idempotence + transactions on top of the existing log | Add strong guarantees as an opt-in layer, not a rewrite |
| Rebalance protocol | KIP-429, 848 | Stop-the-world rebalance halted all consumers on any change | Incremental cooperative → broker-driven reconciliation by epoch | Remove a global barrier in stages, each backward-compatible |
| Tiered storage | KIP-405 | Retention bounded by aggregate local disk | Two pluggable SPIs decouple cold-segment storage from the broker | Extend via a narrow interface (SPI) instead of forking the core |
| Share groups | KIP-932 | Per-partition ordering blocked queue-style fan-out | Add a second consumption model beside the log, not inside it | Add a new access pattern without disturbing the existing one |
Case 1, ZooKeeper → KRaft: owning the coordination plane KIP-500
The problem: a borrowed brain on the critical path
For its first decade Kafka delegated all hard state, topic and partition definitions, ISR membership, leadership, broker registrations, configs, ACLs, ELR, to an external Apache ZooKeeper ensemble, and a single elected broker (the controller) acted as the bridge, reading ZooKeeper and pushing LeaderAndIsr and UpdateMetadata RPCs to every broker. This worked, and it let Kafka ship in 2011 without building consensus. But by construction it put a second distributed system, with its own operational model, its own failure modes, and its own scaling ceiling, directly on Kafka's control path. Part I The KRaft Controller documents what the old design cost and why it was replaced.
The forcing function was metadata scale. On ZooKeeper, controller failover was O(partitions): a new controller had to read all partition state out of ZooKeeper before it could act. The empirical reference pins this precisely, reloading metadata for 100,000 partitions dropped from 28 s (Kafka 1.0.0) to 14 s (1.1.0) after KIP-227 batched the controller's writes, but it remained linear in partition count, which is exactly why the era's guidance capped a cluster at roughly 200,000 partitions "to accommodate the rare event of a hard failure of the controller" (source: Confluent, "Apache Kafka Supports 200K Partitions Per Cluster"). Controlled shutdown told the same story, 5 brokers / 50,000 partitions took 6.5 minutes on 1.0.0, cut to 3 seconds on 1.1.0, but the structural O(n) dependency on ZooKeeper remained. ZooKeeper was a borrowed brain that could not be made fast enough.
The deep problem was not ZooKeeper's speed; it was that the controller cached state by reading an external store and raced against watchers. Recovery meant re-reading, and the truth lived in two places (ZK and the controller's memory) that could diverge. KRaft inverts this: the controller is the source of truth, a deterministic state machine over an ordered, replicated log (Part I KRaft Consensus), so recovery is replay, not re-read, and standby controllers are already hot because they have been replaying the same log all along. The design rationale in KRaft Controller states it directly: the old ZK controller "raced with watchers and pushed full RPCs"; the KRaft controller is a log-replicated state machine and brokers pull incremental deltas (Part I Metadata Propagation).
The change: build the consensus you depend on
KIP-500 replaced ZooKeeper with a self-managed metadata quorum, a small set of controller nodes (typically 3 or 5) running a Raft implementation over the internal __cluster_metadata topic, realized as the event-driven quorum controller of KIP-631. The Raft leader becomes the active controller; the rest are standbys replaying the log. The benefits are exactly the inverse of the costs above: one system to operate instead of two, hot-standby failover (a standby promotes in milliseconds because its in-memory image is already current), and a metadata ceiling raised roughly 100×, Instaclustr created ~600,000 partitions on a single KRaft broker and stable lab operation at ~2 million per cluster (source: Instaclustr, "KRaft Part 3").
But "build the consensus you depend on" is a heavy bill. Kafka had to write, test, and harden a Raft implementation (KIP-595), a dynamic-reconfiguration story for the quorum itself (KIP-853, covered in KRaft Controller), snapshotting to bound the metadata log, and, the genuinely hard part, a live migration from ZooKeeper that could not lose a single record. That migration (KIP-866) ran the cluster in a dual-write mode: the KRaft controller wrote to both the new metadata log and the old ZooKeeper store simultaneously, so the cluster could be rolled forward incrementally and rolled back to ZK if the new path misbehaved, until the operator committed. The vestige is still visible in 4.4: ZkMigrationStateRecord survives as a deliberate no-op in MetadataDelta.replay, retained solely so clusters that migrated under 3.x can roll forward (Part I Metadata Propagation). ZooKeeper itself was removed entirely in 4.0.
O(partitions)Kafka was right to internalize Raft because coordination was on its hot path, it had the engineering capacity, and the dependency had become the scaling ceiling. Do not generalize this to "always build your own consensus." For most systems, an external etcd/ZooKeeper/Consul is the correct choice: you get a battle-tested implementation for free, and your coordination volume is nowhere near the ceiling that forced Kafka's hand. The honest decision rule is below, internalize only when the dependency is simultaneously on the critical path, the binding constraint, and within your competence to build and operate forever. Owning consensus means owning its bugs at 3 a.m. for the life of the product.
Case 2, Record format v0→v2: evolving a format that's already on disk KIP-98
The problem: every record paid full freight
Kafka's original on-disk unit was a flat message (LegacyRecord, magic byte 0 or 1) prefixed by a 12-byte log overhead. Part I The Record Format dissects the weaknesses: in v0/v1 an uncompressed message set was a sequence of standalone messages, one per offset, and each repeated its own CRC, magic, attributes, and (in v1) an 8-byte absolute timestamp. There was no batch header in which to put anything shared, and, critically, nowhere to carry the producer-identity fields that idempotence and transactions would require. A compressed batch was a single "wrapper" message whose value was itself a compressed message set, and iterating it required decompressing the whole wrapper to recover absolute offsets, because inner offsets were stored relative to the wrapper's last offset. The format was simple and expensive: per-record overhead at high record counts, and no extension point for new semantics.
The change: a self-describing batch with delta-encoded records
KIP-98 introduced the v2 record batch (DefaultRecordBatch, magic byte 2). The architectural insight is a clean separation of weight: the batch became the heavyweight, self-describing unit (one header, one CRC-32C, one compression frame, and the producerId/producerEpoch/baseSequence triple that EOS needs), while each record became a maximally-compact body storing only an offset delta and timestamp delta from bases held once in the header, zig-zag varint encoded. Part I The Record Format shows the byte layout: the 8-byte absolute offset and 8-byte timestamp are paid once per batch; each additional record costs a byte or two before the record region is optionally compressed. The same header that delivered compactness is the one that made idempotence and transactions possible, the producer fields had a home.
| Dimension | v0 / v1 (legacy) | v2 (KIP-98) | Consequence |
|---|---|---|---|
| Unit | Flat message; uncompressed set = N standalone messages | Batch header + N delta-encoded inner records | Per-record overhead amortized across the batch |
| Offsets | Absolute per message; compressed inner offsets relative to wrapper's last | baseOffset once; per-record offsetDelta varint | Tiny records; lastOffset = baseOffset + lastOffsetDelta computed, never stored |
| Timestamp | 8-byte absolute per message (v1) | baseTimestamp + per-record delta | Further compaction; log-append-time override is one field |
| Identity fields | None, no place for them | producerId, producerEpoch, baseSequence in header | Enabled idempotence + transactions (Case 3) |
| Checksum | Per-message CRC | One CRC-32C over the batch, excluding partitionLeaderEpoch | Broker can stamp leader epoch without recomputing CRC |
The hard constraint: old clients still on the wire
You cannot change a format that is already on millions of disks and in millions of deployed clients with a flag day. KIP-98's compatibility tactic is the transferable part. First, the magic byte: every batch self-identifies its version in a fixed position, so a reader parses the magic before interpreting anything else, the format is versioned in the data, not out of band. Second, down-conversion: when a consumer too old to understand v2 fetches a v2 batch, the broker rewrites it back into the v0/v1 the client expects, on the fetch path. This preserves old clients at a real cost, down-conversion happens on the broker, burns CPU, and forfeits zero-copy (the broker must touch and transform the bytes rather than sendfile them straight from page cache; Part I The Fetch Path). Compatibility was bought with broker cycles, deliberately.
The reusable pattern: (1) put a version discriminator (magic byte) at a fixed offset so any reader can branch before parsing; (2) make the newest component (the broker) responsible for serving the oldest reader, via on-the-fly conversion, so old clients need zero changes; (3) accept that conversion costs the server something (here, CPU and zero-copy) and treat removing old formats as a multi-year deprecation, not a release. This is the same shape as HTTP content negotiation and protobuf field numbers, the discriminator lives with the payload, and the new side absorbs the cost of the old.
Down-conversion is correct but not free, and the cost is invisible until a fleet of legacy consumers hits a high-throughput topic: broker CPU climbs and the fetch path loses zero-copy precisely when you can least afford it. The operational lesson, keep clients current so the broker can sendfile v2 bytes untouched, is the reason Part II Lifecycle Operations treats client upgrades as part of cluster hygiene, not an afterthought. A compatibility shim is a bridge, not a destination; budget to retire it.
Case 3, Exactly-once: strong guarantees as an opt-in layer KIP-98 KIP-447
The problem: at-least-once pushed dedup onto everyone
Kafka's core delivery semantic is at-least-once: a producer that retries after a network hiccup can write a duplicate, and a consumer that crashes after processing but before committing will reprocess. For many pipelines that is fine. For exactly-once read-process-write, the canonical "consume from topic A, transform, produce to topic B, commit input offset", it forced every application to implement its own idempotent dedup, correctly, forever. That is precisely the kind of cross-cutting correctness burden a platform should absorb once rather than externalize to thousands of teams.
The change: two composing layers, neither a rewrite
The architectural move (Part I Transactions & EOS) is that exactly-once was built as two layers stacked on the unchanged at-least-once log, not as a new storage engine. The lower layer is the idempotent producer: every batch carries (producerId, producerEpoch, baseSequence), the v2 header fields from Case 2, and the partition leader's ProducerStateManager deduplicates retries by checking that sequence numbers are strictly contiguous per (PID, epoch), absorbing retries in a 5-batch window and rejecting gaps with OutOfOrderSequenceException. That gives exactly-once-in-order to a single partition with essentially zero throughput cost, it is a sequence check on a batch that was being written anyway. The upper layer is transactions: a stable transactional.id maps to a PID with epoch fencing of zombies; a TransactionCoordinator drives a two-phase commit over the internal __transaction_state log and writes transaction markers (control records) into every partition the transaction touched; and consumers reading at read_committed filter aborted data using the Last Stable Offset and a per-segment aborted-transaction index.
__transaction_state; markers; LSO-gated read_committed(producerId, epoch, baseSequence) dedup in ProducerStateManager; strictly-increasing per-partition sequenceThe throughput-vs-guarantee balance, and the KIP-447 scaling fix
The crucial discipline is that strong guarantees were added without surrendering the throughput that defines Kafka. Idempotence is a near-free sequence check. Transactions cost a little, two extra coordinator round-trips to begin/commit and the marker writes, amortized across a whole batch of records, so the per-record overhead at realistic batch sizes is small. KIP-447 then removed the original scaling cliff: the first EOS design required one transactional producer per input partition (so offsets could be fenced), which exploded the producer count for a Streams app over a many-partition topic. KIP-447 changed sendOffsetsToTransaction to take the full ConsumerGroupMetadata (group id, generation, member id), letting the group coordinator fence offset commits by generation, so a single transactional producer can serve a partition and EOS scales per application instance rather than per input partition (Part I Group Coordination). That is the difference between EOS being a curiosity and being usable at Streams scale.
| Setting | Guarantee | Cost | Inherent limit |
|---|---|---|---|
enable.idempotence=true | No duplicates per partition on retry; in-order | Caps max.in.flight at 5; one sequence check | Per-partition only; not multi-partition atomic |
+ transactions (transactional.id) | Atomic multi-partition write + offset commit | 2 coordinator round-trips/txn + markers; LSO read latency | EOS is within Kafka read-process-write only |
| + KIP-447 offsets | Scales to one producer per app instance | Coordinator fences offset commits by generation | Still bounded by partition assignment |
| External sink (DB + Kafka) | At-least-once + eventual (outbox/CDC) | Idempotent consumer required | Kafka EOS does NOT span an external system |
Kafka's exactly-once is exactly-once within the Kafka read-process-write loop, it does not, and cannot, give exactly-once across an external database. Writing to Postgres and producing to Kafka is a dual write; the correct pattern is the transactional outbox (write the business row and an outbox row in one local ACID transaction, then tail the DB log via CDC to Kafka), which yields at-least-once plus eventual consistency, not 2PC, and demands idempotent consumers (source: Morling / Debezium, "Reliable Microservices Data Exchange With the Outbox Pattern"). The lesson: an opt-in guarantee layer is bounded by the system whose primitives it is built on. Know exactly where the boundary is, and do not let "exactly-once" leak into a sentence about an external store.
Case 4, Rebalance: removing a global barrier in stages KIP-429 KIP-848
The problem: stop-the-world on every membership change
In the original ("classic") consumer protocol, group membership and partition assignment were negotiated through a JoinGroup/SyncGroup barrier orchestrated by the broker but with the assignment computed by an elected client leader (Part I Group Coordination). The fatal property was eager rebalancing: on any change, a consumer joining, leaving, or simply being slow, every member revoked all of its partitions, the leader recomputed a full assignment, and everyone re-acquired. The whole group stopped consuming during the rebalance. At scale, with deploys rolling consumers and occasional GC pauses tripping session timeouts, this produced "rebalance storms" where each rebalance triggered the next.
The change: two stages, each backward-compatible
Kafka did not fix this in one jump; it removed the barrier in two compatible stages, which is the transferable lesson.
- Stage 1, cooperative incremental (KIP-429, Kafka 2.4). Still the classic protocol, but a new client-side assignor (
CooperativeStickyAssignor) changes revocation behaviour: a consumer keeps the partitions it will retain and only revokes the ones actually being moved, across two rebalances. Only moved partitions pause; the rest keep consuming. Critically, it advertises both protocols (supportedProtocols()returns[COOPERATIVE, EAGER]), so a group can roll from eager to cooperative one consumer at a time, no flag day. This is the first attack on stop-the-world: it shrinks the blast radius without changing the protocol's shape. - Stage 2, broker-driven (KIP-848, GA Kafka 4.0). The barrier itself is removed. JoinGroup/SyncGroup are replaced by a single repeated
ConsumerGroupHeartbeatRPC that doubles as join, sync, poll-for-assignment, and liveness. There is no client leader and no client-side assignor: the broker computes the target assignment with a server-side assignor (uniformorrange), and each member incrementally reconciles toward it. Coordination becomes a conversation about three epochs, group epoch (bumped when inputs change), target-assignment epoch (when the assignor last ran), and per-member epoch (how far each member has reconciled), that replace the classic generation. No global synchronization point remains.
KIP-848 could in principle have been the whole answer, but shipping it in one jump would have demanded that every client in the world upgrade before any group benefited, and it would have stranded the millions of consumers that couldn't move. Cooperative rebalancing (KIP-429) delivered most of the operational relief years earlier within the classic protocol, and because it advertised both assignors, a group migrated incrementally. KIP-848 then finished the job with a new protocol that coexists with the classic one (the classic path remains the fallback for older clients and for components not yet ported). The general tactic: remove a global barrier in compatible increments, each of which is shippable and rolls one node at a time, rather than betting the migration on a single coordinated cutover. Reported result: KIP-848 rebalances run up to roughly 20× faster (source: Confluent, "KIP-848: A New Consumer Rebalance Protocol").
Comparisons that still indict Kafka for stop-the-world rebalances are describing eager assignors on old clients. With CooperativeStickyAssignor (default-capable since 2.4) or the KIP-848 protocol (GA in 4.0), the global pause is gone (source: empirical reference §9). When you read a critique of any mature system, check whether it targets a design the system has already evolved past, and when you evolve, leave the old path working so the critique stops being true incrementally rather than all at once.
Case 5, Tiered storage: extending via an SPI, not a fork KIP-405
The problem: retention chained to local disk
Without tiered storage, a topic's retention is bounded by the aggregate local disk of the brokers hosting its replicas. Wanting to keep 30 days of a high-throughput topic meant provisioning (and paying for, and replicating across AZs) 30 days of expensive local SSD, even though the vast majority of that data is cold and read rarely, if ever. Worse, storage and compute were welded together: you scaled disk by adding brokers you didn't need for CPU, and broker recovery had to re-replicate cold data on rejoin (Part I Tiered Storage).
The change: two narrow SPIs decouple cold bytes from the broker
KIP-405 decoupled retention from local disk by letting the broker offload sealed (non-active) segments to an external object store and keep only a small hot tail locally. The architecturally interesting decision is how: rather than baking S3 (and GCS, and Azure Blob, and HDFS, and every future store) into the broker, Kafka defined two pluggable SPIs orchestrated by the broker-side RemoteLogManager (Part I Tiered Storage):
RemoteStorageManager(RSM), moves bytes. A five-method interface: copy a segment's data and indexes, fetch a segment (by byte range), fetch one index, and delete. Copy and delete must be idempotent; transient failures are signalled with aRetriableRemoteStorageException. The contract requires only eventual consistency of the object store.RemoteLogMetadataManager(RLMM), tracks which segment covers which offsets/epochs, with strongly consistent semantics. The default implementation,TopicBasedRemoteLogMetadataManager, persists metadata as records in the internal__remote_log_metadatatopic, Kafka storing its own tiering metadata in Kafka.
This is the subtle, transferable part. Object stores are cheap and durable but only eventually consistent and slow to list; you must not make correctness depend on an S3 bucket listing. So Kafka split the responsibilities by their consistency needs: the RSM owns data with eventual-consistency requirements, while the RLMM owns metadata with strong consistency, and the broker always updates RLMM metadata around every RSM data operation. The durable source of truth about "what exists in remote" is the metadata, never the object-store listing (Part I Tiered Storage). The general tactic: when you extend a system across a boundary with weak guarantees (object storage), keep a strongly-consistent index you control on the strong side, and treat the weak side as a content-addressed blob bag.
RemoteLogManager__remote_log_metadataBefore KIP-405, every cloud vendor maintained a fork of Kafka to add their object store (Confluent, Aiven, and others each had one). A narrow, well-specified SPI converts forks into plugins: vendors implement two interfaces, the core stays single-sourced, and competition moves to implementation quality instead of divergent codebases. The general rule for extending infrastructure you don't want to fork: find the narrowest interface that captures the variability (here, "move bytes" + "track metadata"), specify its consistency contract explicitly, and load implementations dynamically (Kafka even isolates plugin dependencies with a child-first classloader). The same pattern recurs in Kafka Connect's SourceConnector/SinkConnector and in the pluggable assignors of Case 4.
Tiered storage cuts storage cost (S3 at ~$0.02/GiB-mo vs EBS ~$0.08–0.10) and decouples storage from compute, but it does not reduce cross-AZ networking, which is 70–90% of a high-throughput cloud Kafka bill and can become an even larger share after tiering shrinks the storage line (source: Confluent, "Uncovering Kafka's Hidden Infrastructure Costs"; empirical reference §6, §9). A reader who hears "tiered storage saves money" and concludes "so my cross-AZ replication cost drops" is wrong; the networking lever is fetch-from-follower (KIP-392) or a diskless/object-store design (KIP-1150, still in-flight upstream, accepted ~March 2026 but not production-ready OSS). An extension SPI solves the problem it was scoped to and no other; name the boundary so the cost model stays honest. Note also that KIP-405 reached GA only in 3.6, "infinite retention" is a recent capability, not an original Kafka property.
Case 6, Share groups: a new access pattern beside the log KIP-932
The problem: per-partition ordering blocks queue fan-out
The log's defining strength, strict per-partition ordering consumed by exactly one member of a group per partition, is also the reason Kafka is a poor task queue. Parallelism within a consumer group is capped at the partition count, and a single slow or poison-pill message causes head-of-line blocking that stalls every later message in its partition (source: empirical reference §8). For genuine queue workloads, competing consumers, work that can be processed in any order, redelivery on failure, the log semantics fight you. Historically the answer was "use RabbitMQ for that," or bolt on the Confluent Parallel Consumer.
The change: add the model, don't mutate the log
KIP-932 added share groups: many share consumers cooperatively read the same partitions, with records delivered individually under a time-bounded acquisition lock rather than tracked by a single committed offset (Part I Share Groups). Each record carries a delivery state (Available → Acquired → Acknowledged/Archived) and a delivery count, giving at-least-once queue semantics with redelivery, max-attempt limits, and an optional dead-letter queue. The crucial architectural restraint: this was added beside the log, not inside it. The partition log, its append-only semantics, its replication, its retention, all unchanged. Share-group bookkeeping lives in a new in-memory engine on the partition leader (SharePartition), durably backed by a new internal topic (__share_group_state) through a coordinator, entirely separate from the offset-commit machinery of classic consumer groups.
| Property | Consumer group (the log) | Share group (KIP-932) |
|---|---|---|
| Unit of bookkeeping | One committed offset per partition | Individual record (delivery state + count) |
| Consumers per partition | Exactly one per group | Many, concurrency decoupled from partition count |
| Ordering | Strict per-partition | None, queue semantics |
| Failure handling | Rewind the offset (coarse) | Per-record redelivery, max attempts, DLQ |
| Head-of-line blocking | Yes, slow record stalls partition | No, records acquired independently |
| State location | __consumer_offsets | __share_group_state (separate) |
The temptation when a system can't do X is to bend its core until it can, and break everyone relying on the core's current behaviour. KIP-932 instead introduced a parallel read model with its own state, its own RPCs (ShareFetch/ShareAcknowledge), and its own coordinator, leaving the log untouched. Existing consumer groups don't notice share groups exist. The general rule: when you need a fundamentally different access pattern over the same data, prefer a new model sharing the substrate over a mutation of the existing model. You keep both contracts intact and let users choose per workload, log semantics where order matters, queue semantics where throughput-of-competing-consumers matters.
Share groups relax ordering and decouple parallelism from partition count, but they do not turn Kafka into RabbitMQ. There is still no per-message TTL, no priority, and no content-based routing, those are deliberate log-semantics omissions, not gaps to be patched (source: empirical reference §9). And share-group state, while durably backed, is held and reconstructed on the partition leader, so it inherits Kafka's partition-leadership failure model. A new access pattern over an existing substrate inherits the substrate's constraints; it widens the system's reach without escaping its physics. If you need rich routing and per-message priority, RabbitMQ is still the right tool, adding a model is not the same as becoming a different system.
The five meta-lessons: how to evolve infrastructure safely
Stepping back from the six cases, the same five tactics recur. These are the transferable payload, the techniques you can carry to any long-lived system that must change while running.
1, Feature flags as a version ratchet (metadata.version) KIP-584
Every behaviour-changing or format-changing capability above is gated by a feature level, with metadata.version (KIP-584) as the master flag, and Transaction Version, kraft.version, and others as siblings. Two properties make it safe (Part I KRaft Controller, Part II Lifecycle Operations). First, a level can only be finalized when every node supports it: FeatureControlManager.reasonNotSupported() walks all registered brokers and controllers and refuses the upgrade if any reports a range that excludes the new level, the compile-time guard against a mixed-version cluster enabling a record format an old node cannot parse. Second, it is a one-way ratchet across any record-format change: the source says in capitals "ONCE A METADATA VERSION IS PRODUCTION, IT CANNOT BE CHANGED," and a downgrade is refused unless the controller can prove no record format actually changed between the two versions. This is what lets the binary roll and the behaviour switch be separated, you upgrade software reversibly, soak it, and only then cross the irreversible feature gate.
kafka-features describemetadata.version ratchet (KIP-584). The cardinal gate is "all nodes support it"; the cardinal trap is irreversibility once a record-format-changing level commits. Never bump in the same window as the binary roll.A feature flag that gates an on-disk format must be (a) monotonic, you can only move forward across a format change, because the alternative is deleting metadata you cannot reconstruct; and (b) quorum-gated, finalizable only when every participant can speak the new level. Kafka enforces both in FeatureControlManager. If you build version flags into your own system, copy this exactly: refuse the upgrade until all nodes support it, and refuse the downgrade if it would lose information. A flag that can be toggled freely is not a safe-evolution tool; it is a footgun.
2, SPIs to extend without forking
Tiered storage (RSM/RLMM), Connect (source/sink connectors), the pluggable server-side assignors of KIP-848, and the share-group persister are all service provider interfaces: narrow contracts that let third parties extend Kafka without modifying or forking it. The discipline is to make the interface as small as the variability requires and to specify its semantics, especially consistency and idempotence, in the contract, not in tribal knowledge. The payoff is that the core stays single-sourced while the ecosystem of implementations grows independently. The anti-pattern an SPI prevents is the vendor fork, which fragments the project and strands users on divergent versions.
3, Backward compatibility as a hard constraint
The magic byte and down-conversion (Case 2), the dual-protocol coexistence in rebalancing (Case 4), the both-assignors advertisement of CooperativeStickyAssignor, and the retained-no-op ZkMigrationStateRecord (Case 1) are all the same discipline: a new design must interoperate with the old one during the entire migration window, because you cannot upgrade every node and every client atomically. This costs real engineering, a down-converter, a compatibility shim, a fallback path, and the cost is the price of admission for evolving a system that thousands of teams depend on. The mark of mature infrastructure is that the old client written years ago still works against the new broker.
4, Absorb vs externalize complexity
Two of the cases are decisions about where complexity should live, and they point in opposite directions, which is the lesson. KRaft (Case 1) absorbed complexity that had been externalized to ZooKeeper, because coordination was on the critical path and the dependency was the binding constraint. Tiered storage (Case 5) externalized complexity to an object store behind an SPI, because cold storage is a commodity better bought than built. There is no universal rule; the correct direction depends on whether the thing is on your critical path, whether it is your binding constraint, and whether you can carry its operational weight forever (the decision tree in Case 1). Exactly-once (Case 3) is a third variant: it absorbed the dedup complexity that had been externalized to every application, a platform earning its keep by solving a cross-cutting correctness problem once.
| Case | Direction | What moved | Why that direction |
|---|---|---|---|
| KRaft | Absorb | Consensus, from ZooKeeper into the broker | On the critical path; the binding scaling constraint |
| Exactly-once | Absorb | Dedup, from every app into the platform | Cross-cutting correctness; solve once |
| Tiered storage | Externalize | Cold storage, from broker disk to object store | Commodity; cheaper to buy than build/operate |
| Share groups | Absorb (additively) | Queue semantics, from external brokers into Kafka | Demand was high; added beside the log, not inside |
5, Multi-release deprecation discipline
Nothing was removed abruptly. ZooKeeper was deprecated in 3.5 and only removed in 4.0, with the KIP-866 dual-write migration spanning the 3.x line and a no-op record retained even after removal so late migrators could still roll forward. The classic rebalance protocol remains the fallback for older clients even after KIP-848 went GA. Old record formats are still down-converted. The discipline is to give users multiple releases between "deprecated" and "removed," with a working migration path the whole time, never a single release that yanks a capability out from under a running fleet. Part II Lifecycle Operations operationalizes this from the operator's side.
If you remember one thing from this chapter, make it the order of operations for evolving a live system: (1) design the migration before the destination, the dual-write phase, the down-converter, the both-protocols coexistence are not afterthoughts, they are the actual deliverable; (2) gate every behaviour change behind a quorum-checked, monotonic feature flag so the binary roll and the behaviour switch are separable; (3) extend through the narrowest possible SPI rather than forking, and specify its consistency contract; (4) decide absorb-vs-externalize per subsystem against the critical-path / binding-constraint / can-I-operate-it test, not by ideology; and (5) deprecate across releases with a working path throughout. Kafka's twelve years are a long, public proof that these five tactics let infrastructure change continuously without a flag day, and that skipping any one of them is how a "simple upgrade" becomes an outage.
Where to go next
The evolution stories here are the how of change; the rest of Part III is the when and what. For the structural properties these changes were defending or extending, see The Log as a Pattern and Inherent Limits; for the decision frameworks that mirror Case 1's tree, see When to Use a Log and Design Decisions; for the reusable tactics distilled here as a toolkit, see The Tactics Toolkit; and for how Kafka's choices compare to systems that made different ones (Pulsar's compute/storage split, Redpanda's thread-per-core, WarpStream's diskless), see Comparative Systems. The operator's view of these same migrations, rolling upgrades, the feature ratchet in practice, reassignment, is Part II Lifecycle Operations.