krivaltsevich.com Kafka Internals4.4

22 · Glossary & Cross-Cutting Concepts

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Derived from source code, not copied from official documentation.

This closing chapter is a reference index for the whole guide. Part A is a glossary: every load-bearing term used across the preceding chapters, defined in one to three sentences and cross-linked to the chapter that develops it. Part B draws out the cross-cutting ideas that recur everywhere, the log abstraction and stream-table duality, the three delivery semantics and where each is enforced, the ordering guarantees, the everything is a replicated log pattern that unifies data partitions with the five internal topics, epochs and fencing as a universal anti-zombie mechanism, and the timeline/snapshot data structures that let in-memory state track a replicated log. Terms are grouped by domain rather than strictly alphabetised so that related vocabulary reads together; the on-page table of contents lets you jump to any group.

Part A, Glossary

Terms are grouped into nine domains: the log & offsets, topics, partitions & replication, on-disk structures, the record format, producers & transactions, coordinators, groups & rebalancing, KRaft & metadata, security & authorization, and cross-cutting mechanisms. Each definition links to the chapter where the term is grounded in source.

The log & offsets

Offset
A monotonically increasing int64 assigned to each record within a partition by the leader at append time; it is the record's permanent address and is never reused or reordered. See The Log Storage Engine.
Log end offset (LEO)
The offset that will be assigned to the next record appended, i.e. one past the last record. Each replica has its own LEO; the leader tracks every follower's LEO to compute the high watermark. See The Log Storage Engine and Replication, ISR & High Watermark.
High watermark (HW)
The highest offset that has been replicated to the full in-sync replica set and is therefore committed and consumer-visible. It equals the minimum LEO over the maximal ISR, is monotonic, and never exceeds the leader's LEO (UnifiedLog.java:564). read_uncommitted consumers may read up to the HW. See Replication, ISR & High Watermark.
Last stable offset (LSO)
min(highWatermark, firstUnstableOffset), the highest offset below which there are no in-flight (undecided) transactions. read_committed consumers may not read past the LSO. See Transactions & Exactly-Once.
Isolation level (read_committed / read_uncommitted)
The consumer FetchRequest setting (a FetchIsolation on the broker) that bounds how far a fetch may see: read_uncommitted (the default, HIGH_WATERMARK) reads up to the high watermark; read_committed (TXN_COMMITTED) reads only up to the LSO and filters out batches belonging to aborted transactions using the segment .txnindex. See Transactions & Exactly-Once and The Consumer Client.
Log start offset
The lowest offset still retained in the log; advanced by retention, compaction, DeleteRecords, or tiering. Advancing it requires the new value to be ≤ the HW, else OffsetOutOfRangeException (UnifiedLog.java:1358). The local log start offset (≥ the log start offset) is the lowest offset held on local disk when tiered storage is enabled. See Log Management, Retention & Compaction.
Recovery point
The first not-yet-fsynced offset; advanced only at an actual flush. With default flush.messages/flush.ms = Long.MAX_VALUE, durability relies on replication and the OS page cache rather than fsync. See The Log Storage Engine.
Watermark checkpoint
The replication-offset-checkpoint file (and siblings recovery-point-offset-checkpoint, log-start-offset-checkpoint, cleaner-offset-checkpoint): plain-text topic partition offset triples written crash-consistently via atomic rename, persisting the HW and other boundaries across restarts (default interval 5000 ms). See Replication, ISR & High Watermark.

Topics, partitions & replication

Topic
A named, partitioned, append-only stream of records, the unit of pub/sub addressing. Internally a topic is just a set of partition logs plus metadata (id, config, ACLs). See Architecture Overview.
Partition
The unit of parallelism, ordering and replication: a totally ordered log identified by (topic, partition). All ordering and delivery guarantees are scoped to a single partition. See The Log Storage Engine.
Replica
One copy of a partition log hosted on a broker. The replication factor is the number of replicas; each is either the leader or a follower. See Replication, ISR & High Watermark.
Leader / follower
Exactly one replica is the leader, which serves all produces and consumer fetches and assigns offsets; the others are followers that replicate by issuing Fetch requests to the leader (replication is pull-based). See Replication, ISR & High Watermark and Fetch Path & Replica Fetchers.
ISR (in-sync replicas)
The set of replicas (including the leader) that are caught up to the leader within replica.lag.time.max.ms (default 30 s). In KRaft the ISR is controller-authoritative: the leader proposes changes via the AlterPartition RPC and the controller commits them by appending a PartitionChangeRecord. The HW cannot advance while the ISR is below the effective min.insync.replicas. See Replication, ISR & High Watermark.
ELR (eligible leader replicas)
KIP-966 state added to PartitionRegistration (elr[], lastKnownElr[]): replicas that were in the ISR when it shrank and so still hold all committed data, making them safe to elect as leader without data loss even after the ISR empties. The controller elects from ISR first, then ELR, then last-known ELR, and only as an unclean last resort an arbitrary replica. See Replication, ISR & High Watermark.
Leader epoch
A monotonically increasing int32 stamped into every v2 batch header by the leader; it identifies which leadership term wrote a range of offsets. Each replica keeps a LeaderEpochFileCache mapping epoch → first offset, which drives truncation-safe follower replication (KIP-101/KIP-279). See The Log Storage Engine and Replication, ISR & High Watermark.
Partition epoch
The optimistic-concurrency token for a partition's leader/ISR state, bumped by the controller on every PartitionChangeRecord. An AlterPartition commits only if its partition epoch matches the controller's current value, else INVALID_UPDATE_VERSION. See Replication, ISR & High Watermark and The KRaft Controller.
Rack awareness
Placement and routing by failure domain: StripedReplicaPlacer spreads replicas across racks first then brokers (The KRaft Controller), and fetch-from-follower (KIP-392) lets a consumer read from the most caught-up same-rack replica via RackAwareReplicaSelector (Fetch Path & Replica Fetchers).
Zero-copy
Returning log bytes straight from page cache to the socket via sendfile(2) (FileRecords.writeTotransferFrom), so fetched data never enters the JVM heap. Disabled only when records must be re-encoded (e.g. a follower re-appending). See Fetch Path & Replica Fetchers.
Page cache
The OS file-system cache that Kafka relies on instead of an in-process record cache; writes go to page cache and are flushed by the OS, and zero-copy reads serve hot data without a disk seek. See The Log Storage Engine.

On-disk structures

Segment
A contiguous slice of a partition log stored as a .log file named by its 20-digit base offset, paired with its indexes. The newest (highest-base-offset) segment is the active segment, the only one being appended; a segment rolls on size, time, index-full, or offset-overflow. See The Log Storage Engine.
Index (offset / time / transaction)
Three sparse companions to each segment: the memory-mapped .index (offset → file position, 8-byte entries), the memory-mapped .timeindex (timestamp → relative offset, 12-byte entries), and the non-mmapped, lazily-created .txnindex recording aborted transactions for read_committed fetches. Entries are added every index.interval.bytes (default 4096). Indexes carry no checksum and are rebuilt from the log on corruption. See The Log Storage Engine.
Compaction vs retention
The two cleanup.policy modes. Retention (delete) drops whole oldest-first segments past a size or time bound. Compaction (compact) keeps only the latest record per key, copying live records into new segments via a two-pass cleaner. A topic may set both. See Log Management, Retention & Compaction.
Tombstone
A record with a null value, signalling deletion of its key during compaction. Tombstones are themselves retained for delete.retention.ms (default 24 h) after entering the clean section, enforced in v2 by a delete-horizon stamped into the batch header, so that all replicas observe the deletion before it disappears. See Log Management, Retention & Compaction.
Tiered storage / remote log
KIP-405 offloading of older, closed log segments to an external store (object storage, HDFS, …) so brokers keep only recent data locally, enabled cluster-wide by remote.log.storage.system.enable and orchestrated per-partition by the RemoteLogManager (RLM). Only sealed (non-active) segments whose copy reached COPY_SEGMENT_FINISHED are remote-readable; local deletion of a tiered segment is guarded by highestOffsetInRemoteStorage. A consumer fetch below the local range is served transparently from the remote store. See Tiered Storage.
Local log start offset
The lowest offset still held on local disk (≥ the log start offset) once tiered storage is enabled; offsets between the log start offset and this boundary live only in the remote store and are fetched from there on demand. See Tiered Storage.
RSM / RLMM (__remote_log_metadata)
The two pluggable SPIs of tiered storage: the RemoteStorageManager (RSM) reads and writes segment and index objects in the external store, while the RemoteLogMetadataManager (RLMM) tracks each remote segment's id, offset range and state. The default RLMM persists that metadata to the internal compacted topic __remote_log_metadata (50 partitions) and materialises an in-memory index from it, one more instance of the replicated-log pattern. See Tiered Storage.

The record format

Record batch (v2)
The DefaultRecordBatch: the on-disk/on-wire container with a fixed 61-byte plaintext header (base offset, length, partition leader epoch, magic, CRC-32C, attributes, PID/epoch/base-sequence, record count, …) followed by a compressed run of delta-encoded inner records. The atom of replication, compression, idempotence and CRC. See Record Format & Batches.
Control record
A record in a batch whose attributes bit 5 is set; its typed key (Version, Type) names a control event, COMMIT/ABORT (transaction markers), LEADER_CHANGE, snapshot header/footer, and KRaft voter/version records. Control records are never compacted and clients may not write them. See Record Format & Batches.
Transaction marker
The COMMIT or ABORT control record (EndTransactionMarker, carrying the coordinator epoch) that the transaction coordinator writes into every partition a transaction touched, atomically deciding the transaction's visibility. See Transactions & Exactly-Once.

Producers & transactions

Producer ID (PID)
A cluster-unique int64 stamped in every batch from an idempotent or transactional producer, allocated by the controller in blocks of 1000 and never reused. See Transactions & Exactly-Once.
Producer epoch
An int16 generation counter for a PID. Bumped on InitProducerId (and, under Transaction V2, on every commit/abort) to fence zombies; the leader rejects any batch carrying a lower epoch, and, under Transaction V2, any marker whose epoch is not strictly greater than the current one. See Transactions & Exactly-Once.
Sequence number
A per-partition int32 base sequence in the batch header, starting at 0 and advancing by the record count. The leader requires nextFirstSeq == lastSeq + 1 for a given (PID, epoch); violations raise OutOfOrderSequenceException. See Transactions & Exactly-Once.
Idempotent producer
A producer with enable.idempotence=true (the default, requiring acks=all and ≤ 5 in-flight requests) whose (PID, epoch, sequence) triple lets the leader's ProducerStateManager deduplicate retries and reject reordering, giving exactly-once-in-order delivery to a single partition. See The Producer Client and Transactions & Exactly-Once.
transactional.id
A stable, user-supplied string naming a logical producer across restarts; it maps deterministically to one __transaction_state partition and thus one coordinator, and anchors epoch-based zombie fencing. See Transactions & Exactly-Once.

Coordinators, groups & rebalancing

Coordinator (group / transaction / share)
A broker-hosted replicated state machine that owns one partition of an internal topic. The group coordinator owns a __consumer_offsets partition (group membership + committed offsets); the transaction coordinator owns a __transaction_state partition; the share coordinator owns a __share_group_state partition. The owning broker is the leader of the relevant partition, picked by abs(key.hashCode()) % partitionCount. See Group Coordination, Transactions & Exactly-Once, Share Groups.
Consumer group
A set of consumers that cooperatively consume a subscription, with each partition assigned to exactly one member and one committed offset per partition. Managed by the group coordinator. See Group Coordination.
Rebalance (eager / cooperative / KIP-848)
The process of (re)distributing partitions among group members. Eager (classic protocol) revokes all partitions then reassigns via a stop-the-world JoinGroup/SyncGroup barrier. Cooperative (KIP-429) revokes only the partitions that move, in two rounds. The KIP-848 consumer protocol replaces the barrier with an incremental, server-driven ConsumerGroupHeartbeat reconciled by epochs (no leader, no client-side assignor; GA in 4.0). See Group Coordination.
Share group
KIP-932 queue-like consumption: N share consumers cooperatively read the same partition (N:M), with records leased individually under a time-bounded acquisition lock and a per-record delivery state instead of an offset commit, giving at-least-once with bounded redelivery and an optional dead-letter queue. Durable state lives in __share_group_state. See Share Groups.

KRaft & metadata

KRaft
Kafka Raft: the built-in consensus subsystem that replaced ZooKeeper (fully removed in 4.0). It stores all cluster metadata as a single Raft-replicated log partition, __cluster_metadata-0, over a quorum of controllers. See KRaft Consensus.
Controller quorum
The set of voter nodes that run the Raft protocol over the metadata log and elect an active controller. One controller is the Raft leader; the others are hot standbys that replay the same records. See KRaft Consensus and The KRaft Controller.
Voter vs observer
A voter participates in elections and counts toward the commit majority; an observer (every broker, and catching-up nodes) only fetches the metadata log to stay current. Both replicate by pulling via Fetch; only voters vote. See KRaft Consensus.
Quorum
A majority of voters. A metadata record is committed once replicated to a quorum (the leader counts itself); the Raft high watermark is the offset replicated to a majority. See KRaft Consensus.
Metadata log
The single-partition internal topic __cluster_metadata (Topic.CLUSTER_METADATA_TOPIC_NAME) carrying every cluster-state change as a typed record; it is the source of truth from which all broker state is materialised. See Metadata Propagation & Broker Lifecycle.
MetadataImage / MetadataDelta
The materialised view of the metadata log. MetadataImage is an immutable record of ten components, a MetadataProvenance offset/epoch marker plus nine sub-images (features, cluster, topics, configs, client-quotas, producer-ids, ACLs, SCRAM, delegation-tokens); MetadataDelta accumulates a batch of records and produces the next image, reusing unchanged sub-images by reference. Readers swap the whole image atomically. See Metadata Propagation & Broker Lifecycle.
metadata.version
A finalised feature level (KIP-584) stored as a FeatureLevelRecord that gates which record types and behaviours the cluster uses; raised only when the controller software and every registered broker and controller support the new level. See The KRaft Controller and Metadata Propagation & Broker Lifecycle.
Broker epoch
A lease identifier the controller assigns at broker registration and bumps on each re-registration; carried in heartbeats and Fetch requests so the controller and leaders can fence a stale broker incarnation. See Metadata Propagation & Broker Lifecycle.

Security & authorization

ACL (access-control entry)
An authorization rule binding a {ResourceType, ResourceName, PatternType, Principal, Host, Operation, PermissionType} tuple. In KRaft, ACLs are cluster metadata: each is stored as an AccessControlEntryRecord in __cluster_metadata (removed via RemoveAccessControlEntryRecord) and materialised into the AclsImage. See Security: Authentication & Authorization.
Authorizer (StandardAuthorizer)
The pluggable component (authorizer.class.name) that decides each request's ALLOW/DENY. The built-in KRaft implementation is StandardAuthorizer, which reads ACLs straight from the metadata image. Its evaluation order is fixed: a configured super-user wins outright, otherwise any matching DENY beats every ALLOW, and if nothing matches it falls back to the default (allow.everyone.if.no.acl.found, default deny). See Security: Authentication & Authorization.
KafkaPrincipal / principal
The authenticated identity an Authorizer checks, of the form type:name (almost always User:alice); built from the connection's credentials by a KafkaPrincipalBuilder after authentication. The unauthenticated identity is User:ANONYMOUS. See Security: Authentication & Authorization.
SASL
The authentication framework Kafka uses on the wire, selected per listener via a security.protocol of SASL_PLAINTEXT or SASL_SSL. Supported mechanisms are PLAIN, SCRAM-SHA-256/SCRAM-SHA-512, GSSAPI (Kerberos), and OAUTHBEARER. See Security: Authentication & Authorization.
SSL / TLS
Transport encryption for the SSL and SASL_SSL listeners; with ssl.client.auth=required it also performs mutual TLS, authenticating the client from its certificate and deriving the KafkaPrincipal from the certificate's distinguished name. See Security: Authentication & Authorization.
Delegation token
A lightweight shared secret (KIP-48) that an already-authenticated client can obtain to authenticate later connections without re-presenting its primary credentials, handy for distributed workers. Tokens are SCRAM-style secrets replicated as cluster metadata (the DelegationTokenImage) so every broker can verify them. See Security: Authentication & Authorization.
super.users
The broker config listing principals (e.g. User:admin;User:broker) that bypass ACL evaluation entirely, every operation is allowed before any ACL or default is consulted. Used for inter-broker and administrative identities. See Security: Authentication & Authorization.

Cross-cutting mechanisms

Fencing
Rejecting operations from a stale instance by comparing a monotonically increasing token (epoch). Used everywhere: producer epoch fences zombie producers, leader epoch fences stale leaders, partition/coordinator/broker epochs fence stale state. See Transactions & Exactly-Once and Part B below.
Purgatory
A DelayedOperationPurgatory that parks an operation which cannot complete immediately (e.g. acks=all produce waiting for the HW, a fetch waiting for min.bytes) and completes it on a triggering event or a timeout driven by a hierarchical timing wheel. See Network Layer & Threading.
Quota / throttling
Overload protection: per-entity limits on produce/fetch byte rate, request CPU %, controller mutations, and replication, enforced by a shared Sensor + Rate/TokenBucket. Over-quota clients are throttled by muting the channel for a computed delay and returning throttle_time_ms so they back off. See Quotas & Throttling.

Part B, Cross-cutting concepts

Five ideas appear in nearly every chapter. Reading them together explains why so much of Kafka looks the same regardless of subsystem.

The log abstraction & stream-table duality

Kafka's one foundational data structure is the append-only, totally ordered, offset-addressed log. A partition is a log; a topic is a set of logs; the metadata quorum is a log; every coordinator's durable state is a log. Everything else, indexes, the high watermark, retention, compaction, is bookkeeping over that log.

A compacted log encodes a table: the latest record per key is the current value, a tombstone is a deletion, and the log itself is the ordered changelog of every mutation. This stream-table duality is why the same __consumer_offsets, __transaction_state, __share_group_state and __cluster_metadata logs can be replayed to rebuild an in-memory table after failover, and why Kafka Streams (Kafka Streams Architecture) materialises a KTable from a compacted changelog and emits a KStream from the same log read forward.

Streamread the log forward
off 0 · k=a v=1
off 1 · k=b v=7
off 2 · k=a v=3
off 3 · k=c v=9
Tablelatest record per key
a ⇒ 3latest wins
b ⇒ 7
c ⇒ 9
Stream-table duality: one log read two ways. Reading forward yields the stream; compacting by key yields the table (latest record per key, tombstone ⇒ delete). Replaying the log forward reconstructs the in-memory table after failover.
log / record (the stream) materialised table entry compact by key replay rebuilds state cylinder = a log / changelog
Key idea

If you understand the log, you understand most of Kafka. Replication, consumer offsets, transactions, share-group state and cluster metadata are all the same pattern, a replicated log plus a deterministic replay that turns it into in-memory state.

Delivery semantics & where each is enforced

Kafka supports three end-to-end delivery semantics. Which one you get is not a single switch but the combination of producer settings, consumer commit timing, and (for the strongest) transactions.

SemanticProducer sideConsumer sideEnforced by
At-most-onceacks=0 or no retries, a lost batch is simply lostCommit the offset before processingNothing dedups; gaps are tolerated. Cheapest, lossy.
At-least-onceacks=all with retries (idempotence optional), a batch is never lost but may duplicate on retryCommit the offset after processingReplication (HW) for durability; redelivery on reprocessing. The default. Share groups are inherently at-least-once.
Exactly-onceIdempotent producer (PID+epoch+sequence) plus transactions binding writes and the offset commit atomicallyRead at read_committed (LSO + aborted-txn filtering)Leader sequence-dedup + 2-phase commit over __transaction_state + transaction markers. See Transactions & Exactly-Once.

The enforcement points are spread across the stack. Idempotence lives at the partition leader, where ProducerStateManager deduplicates retried batches by (PID, epoch, sequence) (The Log Storage Engine). Atomicity lives in the transaction coordinator's two-phase commit and the markers it fans out to every partition (Transactions & Exactly-Once). Visibility lives in the consumer: the leader caps a read_committed fetch at the LSO, and the client skips batches whose PID is in the aborted-transaction index (The Consumer Client). Kafka Streams packages all of this as KIP-447 exactly_once_v2 (Kafka Streams Architecture).

Gotcha

"Exactly-once" is exactly-once processing within a Kafka-to-Kafka pipeline, not magical de-duplication of arbitrary side effects. It holds only when the read, the derived writes, and the input-offset commit all sit inside one transaction and downstream consumers read at read_committed.

Ordering guarantees

All ordering in Kafka is per-partition; there is no global order across a topic. Within a partition the log is totally ordered by offset, and that order is preserved end to end provided two conditions hold:

  • On the produce side, order is preserved automatically when the idempotent producer is on (the broker enforces strictly increasing per-partition sequences, so any of max.in.flight.requests.per.connection in [1,5] is safe). Without idempotence, only max.in.flight=1 preserves order, because a retried batch could otherwise overtake a later one (The Producer Client).
  • On the broker side, the leader assigns offsets under a single lock so appends are serialised, and per-connection request ordering is guaranteed by muting a channel while its request is in flight (Network Layer & Threading).

Consumers see records in offset order within each partition but interleave partitions arbitrarily. Keyed records preserve relative order only because the default partitioner sends a given key to a fixed partition. Share groups deliberately give up ordering: records are leased individually to whichever consumer is free, so there is no cross-consumer order at all (Share Groups).

Everything is a replicated log

Data partitions are not special. The same machinery, an append-only log, replication to followers, a high watermark gating visibility, and a deterministic replay that rebuilds in-memory state, backs five internal logs, each owned by the controller, a coordinator, or the tiered-storage remote-log metadata manager (RLMM).

Internal topicHoldsReplayed intoChapter
__cluster_metadataAll cluster state (topics, configs, ACLs, quotas, broker registrations, leader/ISR/ELR)MetadataImage on every broker & controllerController, Metadata Propagation
__consumer_offsetsGroup membership + committed offsets (compacted)GroupCoordinatorShard timeline stateGroup Coordination
__transaction_statePer-transactional.id transaction metadata (compacted)TransactionMetadata state machinesTransactions & Exactly-Once
__share_group_statePer-record delivery state & share-partition offsets (compacted)ShareCoordinatorShard + SharePartitionShare Groups
__remote_log_metadataTiered-storage segment metadata (50 partitions)Default RLMM in-memory indexTiered Storage

The two data-plane logs (__cluster_metadata and ordinary topic partitions) replicate by followers pulling via Fetch; the coordinator logs are ordinary compacted topic partitions, so the leader that owns a partition is the coordinator for the keys hashed to it, and a coordinator failover is just a partition leader change plus a replay. This is the single most unifying idea in the codebase: there is essentially one durable-state mechanism, reused everywhere.

Why it is built this way

Reusing the log as the only persistence primitive means there is one replication path, one durability story (replicate-then-commit), one failover story (replay the partition), and one operational surface (it is just a topic). New stateful subsystems, share groups, KIP-848 groups, Streams groups, are added by writing new record types into a coordinator log, not by inventing new storage.

Epochs & fencing: a universal anti-zombie mechanism

Distributed systems must cope with zombies: an instance that was replaced (after a pause, crash, or network partition) but does not yet know it. Kafka's answer is uniform, a monotonically increasing epoch attached to every authority, checked on every operation, with the rule reject anything carrying an epoch lower than the current one.

EpochFencesBumped onRejection
Producer epochA zombie producer for a transactional.idInitProducerId; every commit/abort under TV_2InvalidProducerEpochException / PRODUCER_FENCED
Leader epochA stale leader's writes; guides follower truncationEach leadership change (new PartitionChangeRecord)Diverging-epoch response → follower truncates
Partition epochA stale AlterPartition ISR proposalEvery controller partition changeINVALID_UPDATE_VERSION
Coordinator epochA stale transaction marker / log appendEach __transaction_state leader changeTransactionCoordinatorFenced
Broker epochA stale broker incarnationEach (re-)registration leaseStale heartbeats/fetches ignored
Raft / quorum epochA deposed controller leaderEvery election (+1)NotController / vote rejected
Group / member epoch (KIP-848)A zombie group member & its commitsMembership/assignment changesSTALE_MEMBER_EPOCH / FENCED_*_EPOCH
Active instanceepoch = N
Zombie instanceepoch = N−1
Authoritycurrent epoch = N
compare request.epoch to N
✗ REJECTfenced
✓ accept
adopt N := request.epochthen accept
The single fencing rule, instantiated by every epoch in the table above: a request is rejected when its epoch is below the authority's, accepted when equal, and adopted-then-accepted when higher.
active (current) instance zombie / rejected path authority holding the epoch accepted adopt higher epoch request flow stale / fenced write rounded = decision
Invariant

For every epoch, the value is non-decreasing and a strictly higher epoch always supersedes a lower one. No operation tagged with a stale epoch is ever applied. This one rule is what makes failover safe across the entire system without distributed locks.

Timeline & snapshot data structures

Coordinators and the controller face a shared problem: they apply records to in-memory state optimistically (before those records are committed past the high watermark), yet a record that never commits, because leadership was lost or a metadata transaction aborted, must leave no visible trace. The solution is the timeline collection: a copy-on-write map/set/value that remembers its contents at each metadata offset and can be reverted to any earlier offset.

The controller's TimelineHashMap/TimelineObject family is backed by a SnapshotRegistry keyed by offset (The KRaft Controller); the group, transaction and share coordinators use the same SnapshotRegistry-backed timeline structures so their handlers need no locks and reads observe a consistent snapshot (Group Coordination). On renounce or transaction abort, the state is revertToSnapshot(lastStableOffset), discarding everything uncommitted, which is exactly why writes are applied optimistically yet failover is byte-identical.

Committed (stable)up to lastStableOffset = 110
offset 100a=1, b=2
offset 110a=1, b=9
offset 120 — uncommitteda=7, b=9
back to committed prefixoffset 110
A timeline collection keeps per-offset tiers so uncommitted state can be discarded wholesale on failover or abort. The Raft log is the truth; the in-memory image tracks only its committed prefix, reverting the optimistic offset-120 tier on renounce.
committed (stable) offset / prefix uncommitted (optimistic) tier append / apply revertToSnapshot (discard) cylinder = a log / snapshot tier

The same shape recurs at the log level. ProducerStateManager snapshots producer state on every segment roll so it can be rebuilt after failover (The Log Storage Engine); the metadata log takes KIP-630 snapshots so a fresh node need not replay from offset 0 (KRaft Consensus); the share coordinator periodically writes a full ShareSnapshotValue to bound replay (Share Groups). In every case a snapshot is a compaction of a log into materialised state, closing the loop back to stream-table duality.

Note

These six ideas are deliberately the threads that tie the guide together. When a new chapter introduces a subsystem, ask which log it persists to, which epoch fences it, and how its in-memory state is reverted on failover, the answers are almost always an instance of the patterns above.

krivaltsevich.com · Part of Apache Kafka Internals · derived from Apache Kafka 4.4 source · GitHub · MIT-licensed.

Apache Kafka® is a registered trademark of the Apache Software Foundation. This is an independent, unofficial guide, not affiliated with or endorsed by the ASF.