krivaltsevich.com Kafka Internals4.4

00 · Architecture Overview

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Derived from source code, not copied from official documentation.

Apache Kafka is, at its core, a single idea executed with unusual discipline: a distributed, partitioned, replicated commit log. Producers append records to the end of a log; consumers read forward from a position they themselves track; the broker is a comparatively "dumb" custodian of bytes that it stores as sequential append-only segment files and ships to readers with near-zero overhead. Everything else in this guide, replication and the high watermark, the KRaft metadata quorum that replaced ZooKeeper, consumer groups and rebalancing, exactly-once transactions, share groups, tiered storage, Streams and Connect, is an elaboration built on top of that log. This chapter is the front door. It builds an accurate end-to-end mental model: what the abstractions are, how a cluster is laid out in KRaft, how a record physically travels from producer.send() to consumer.poll(), how bytes are stored, how coordination and delivery semantics work, how a broker is threaded, and where every detail lives in the codebase and in the twenty-one chapters that follow. Read it once to orient yourself; return to it as the map.

What Kafka is: a distributed commit log

Strip away the features and Kafka is a log. A log here is the precise computer-science object: an append-only, totally-ordered sequence of records, ordered by the time they were appended, addressed by a monotonically increasing integer position. Jay Kreps called it "perhaps the simplest possible storage abstraction", and it is the same primitive that sits underneath database replication and state-machine replication. Kafka's bet, made at LinkedIn in 2010–2011 and proven out since, is that this one abstraction, made distributed and durable, is the right backbone for both real-time stream processing and offline data integration, replacing an O(N²) tangle of point-to-point pipelines with an O(N) hub-and-spoke.

The core abstractions

Record
The unit of data: an optional key, a value (the payload, or null for a compaction tombstone), a timestamp, and an ordered list of headers (string key + bytes). Records are never stored individually on disk, they live inside batches. See Record Format & Batches.
Record batch
The real on-disk and on-wire unit. The v2 DefaultRecordBatch has a fixed 61-byte header (base offset, CRC-32C, producer id/epoch/sequence, leader epoch, attributes) followed by varint delta-encoded records. Writing shared fields once per batch cut per-record overhead from ~34 bytes for a standalone legacy v1 message (a 12-byte offset/size log prefix plus the 22-byte RECORD_OVERHEAD_V1) to ~7 bytes. The batch, not the record, is the fundamental unit of write, replication, compression, and fetch.
Partition
One physical log. A partition is the unit of parallelism and of ordering: records within a partition are strictly ordered by offset; there is no ordering guarantee across partitions. A partition is replicated; exactly one replica is the leader and serves reads/writes.
Topic
A named, logical stream, simply a set of partitions. A topic has no storage of its own; it is the partitions that store data.
Offset
The position of a record within its partition: a 64-bit integer assigned by the leader at append time, dense and gap-free within a batch's range. The offset is the only per-consumer state Kafka needs, one number per partition tells a consumer exactly where it is.
Broker
A server process that hosts partitions, accepts produce/fetch requests, and replicates. A Kafka cluster is a set of brokers plus a controller quorum (below).
Segment
A partition log is split into ~1 GiB segments: a .log file of record batches plus sparse .index / .timeindex / .txnindex sidecars. Only the last (active) segment is written; old segments are immutable and can be deleted, compacted, or tiered wholesale. See The Log Storage Engine.
partition 0  ·  topic "orders" producer appends to the tail →
0
1
2
3
4
5high watermark
6
group A
reads @1
group B
reads @6
One partition is an append-only, offset-addressed log (a topic is a set of these). Records up to the high watermark (offset 5) are committed, replicated to the ISR and visible to consumers; offset 6 is appended but not yet committed; the log end offset (LEO = 7) is the next slot the leader will write. Independent consumer groups read at their own positions, so a rewind is just a seek.
committed — offset ≤ high watermark uncommitted — replicating to followers LEO — next offset to be written ▲ consumer-group read position n = monotonic offset (int64)

The "dumb broker, smart client" model

The defining architectural choice is that the consumer tracks its own offset, and the broker keeps essentially no per-consumer state on the hot path. A traditional message queue deletes a message when it is consumed and must track, per consumer, what has been acknowledged. Kafka instead addresses every record by a stable logical offset, retains data by a time/size SLA regardless of who has read it, and lets each consumer (or consumer group) say "give me from offset N." This has profound consequences:

  • Replay and fan-out are free. Rewinding is a seek; a new consumer group reading history costs the broker nothing extra. Many independent subscribers read the same log.
  • The broker stays simple and fast. No per-message acknowledgement bookkeeping, no random-access index of "who has what", just sequential append and sequential scan, which is what disks and the OS page cache are best at.
  • Back-pressure is natural. Consumers pull (long-poll fetch) at their own pace; a slow consumer cannot be overwhelmed by a push.
The one-sentence mental model

Kafka is a set of append-only, offset-addressed, replicated partition logs. Producers append batches to the leader of a partition; the leader replicates to followers; data below the high watermark is committed and visible; consumers pull forward from a position they own and commit back to the cluster. Hold that picture and every other chapter slots into it.

Cluster anatomy in KRaft (no ZooKeeper)

As of Kafka 4.0, ZooKeeper is gone entirely; 4.4.0-SNAPSHOT runs KRaft-only. A cluster is split cleanly into two planes:

  • The data plane: broker nodes that host topic-partition replicas and serve produce/fetch. This is where your records live.
  • The metadata plane: a small controller quorum that owns all cluster metadata, topics, partitions, replica assignments, leadership, ISR, ACLs, quotas, feature levels, producer-ID blocks, as an ordered, replayable, fsync'd event log.

Every node runs with a configured set of process.roles. The ProcessRole enum has two members, BrokerRole and ControllerRole (server/.../ProcessRole.java), and KafkaRaftServer instantiates a BrokerServer if the roles contain BrokerRole and a ControllerServer if they contain ControllerRole (KafkaRaftServer.scala:76-82). So a node can be:

process.rolesRunsTypical use
brokerBrokerServer only, hosts partitions, serves clientsProduction data nodes in a larger cluster
controllerControllerServer only, a voter in the metadata quorumDedicated controllers (3 or 5 of them) in a larger cluster
broker,controllerBoth, in one JVM"Combined" mode, small clusters and development

Bootstrapping a KRaft cluster

Because there is no ZooKeeper to seed the cluster, KRaft makes formatting storage a mandatory first step, unlike the ZooKeeper era, a node will not start until its log directories carry a cluster identity. The sequence is:

  1. Generate a cluster ID. kafka-storage random-uuid prints a fresh base64 Uuid (StorageTool.scala, the random-uuid command → Uuid.randomUuid). The same ID is used to format every node.
  2. Format each node's log dirs. kafka-storage format --cluster-id <ID> --release-version <MV> writes a meta.properties into each log.dirs directory (cluster.id, node.id, and a per-directory directory.id UUID), and lays down a bootstrap.checkpoint file. That checkpoint is the BootstrapMetadata: a tiny record set whose first record is a FeatureLevelRecord pinning the initial metadata.version (the --release-version you chose), so the cluster comes up at a known feature level rather than the code's default.
  3. Declare the initial voters (KIP-853). For a dynamic controller quorum, the format step takes --standalone (a single-node quorum), --initial-controllers <list> (an explicit voter set, each id@host:port:directory-id), or --no-initial-controllers; the older static model instead lists voters in controller.quorum.voters.

Only after a node's directories are formatted does KafkaRaftServer start the Raft stack against them. The deep mechanics of registration and the bootstrap metadata live in The KRaft Controller and Metadata Propagation & Broker Lifecycle.

The metadata quorum and __cluster_metadata

The controllers form a Raft quorum using Kafka's own Raft dialect (KRaft). The quorum's state is a single topic partition, __cluster_metadata-0, replicated across the controller voters. One controller is the Raft leader (the "active controller"); the others are hot standbys. This partition is the single source of truth for the whole cluster: nothing about a topic, a partition's leader, or an ACL is "true" until it is a committed record in this log.

Clients
producers write to partition leaders; consumers read from leaders (or rack-local followers)
ProducersConsumersAdmin
↓ produce · fetch  |  metadata responses ↑
Metadata plane, controller quorum (KRaft)
a Raft quorum owns cluster metadata as an event log, __cluster_metadata-0; standbys fetch from the active leader (the single source of truth)
controller · active leadercontroller · standbycontroller · standby
↓ committed records replayed: TopicRecord · PartitionChangeRecord · RegisterBrokerRecord · AccessControlEntryRecord
Data plane, brokers
each broker replays committed metadata into a MetadataImage, hosts topic-partition replicas, and replicates partition data among themselves
broker 0broker 1broker 2
KRaft cluster anatomy: a Raft controller quorum owns metadata as an event log (the single source of truth, __cluster_metadata-0); brokers replay committed metadata and replicate partition data among themselves, while clients produce to and fetch from the brokers.
clients — producers / consumers / admin metadata plane — controller quorum data plane — brokers stacked bands = planes; ↓↑ separators = the flow between them Record = committed metadata record type

Brokers are clients of this metadata log. Each broker registers with the active controller (BrokerRegistrationRequest), is fenced until it has caught up, then continually pulls committed metadata records via the Raft Fetch path and sends periodic heartbeats (which double as liveness). Crucially, the controller does not push LeaderAndIsr/UpdateMetadata/StopReplica RPCs the way the ZooKeeper-era controller did; making metadata a log the brokers replay turns topic create/delete into an O(1) append for the controller and eliminates the controller-vs-ZooKeeper state divergence that plagued the old design. The deep mechanics are in KRaft Consensus, The KRaft Controller, and Metadata Propagation & Broker Lifecycle.

Two Rafts, do not conflate them

KRaft (the metadata quorum over __cluster_metadata) and ordinary partition replication (leader/follower ISR replication of your topic data) are different mechanisms. KRaft uses a majority-vote high watermark over the controller voters; topic-partition replication uses the ISR / min-LEO high watermark over the partition's replicas. They share a code lineage (followers fetch in both) but are governed by separate rules. KRaft replaced only ZooKeeper's metadata role, not data replication.

Control plane vs data plane, and the metadata backbone

The split above is enforced in the request layer too. Brokers do not mutate cluster metadata directly. Around eighteen administrative APIs, CreateTopics, DeleteTopics, the CreateAcls/DeleteAcls family, AlterClientQuotas, ElectLeaders, reassignments, UpdateFeatures, the Raft-voter APIs, when received by a broker are wrapped in an Envelope and forwarded to the active controller, which re-dispatches them through ControllerApis (KafkaApis.scalaforwardToControllerControllerApis.scala). The AlterConfigs/IncrementalAlterConfigs family is a partial exception: its handler preprocesses some resources locally and forwards only the remainder. The controller is the only writer of metadata; brokers are readers.

The propagation backbone is a clean pipeline. The active controller is a deterministic replicated state machine: a single-threaded event loop (QuorumController, thread quorum-controller-{id}-event-handler) turns each request into a list of metadata records, appends them to the Raft log, and applies them to in-memory state under a strict write → commit → apply discipline so the active controller and every standby reach byte-identical state by replaying the same records in offset order.

admin requestCreateTopics, AlterConfigs, ElectLeaders …
ControllerApison the active controller
QuorumControllersingle-threaded event loop: 1. generate records · 2. prepareAppend → Raft log · 3. replay() optimistically
__cluster_metadataRaft commit (majority over voters)
MetadataLoadersingle thread, folds committed batches into a MetadataDelta
new immutable MetadataImageprovenance + 9 sub-images: topics, configs, cluster, ACLs, quotas, features …
MetadataPublishersordered chain, fan out to each subsystem
KRaftMetadataCachesetImage(), serves Metadata responses
ReplicaManagerapplyDelta(), make leaders / followers
Group / Txn coordinatorselect / resign shards
Acl / Scram · quotasclient-metrics publishers
The metadata backbone: the controller writes records, Raft commits them, and every broker replays them into an immutable MetadataImage that publishers fan out to each subsystem.
admin request (client) controller / metadata plane broker-side component coordinator shard cylinder = the metadata log generate / apply async Raft commit callback

On the broker side, the MetadataLoader consumes Raft commits on one thread, folds them into immutable MetadataImage snapshots (a MetadataProvenance plus nine sub-images, rebuilt incrementally via MetadataDelta), and publishes each new image to an ordered chain of MetadataPublishers. The BrokerMetadataPublisher is what actually turns metadata into behaviour: it swaps the cache image, then calls replicaManager.applyDelta(topicsDelta, newImage) to drive partitions into leader/follower roles, and elects or resigns coordinator shards. This is the join between the metadata plane and the data plane.

The end-to-end data path for a record

Here is the journey of a single record, from a producer's send() to a consumer's poll(), with each hop linked to its chapter. This is the most important diagram in the guide.

PRODUCER · app thread · §16
send(record)serialize · pick partition (explicit / murmur2(key) / sticky)
RecordAccumulatorper-partition deque of ProducerBatch (fill to batch.size or linger.ms; compress) · §01
Sender I/O threadstamp (PID, epoch, seq) if idempotent · build ProduceRequest · §02
BROKER · leader of the partition · §06 · §07
Acceptor ▶ Processor (NIO) ▶ RequestChannel ▶ KafkaRequestHandler ▶ KafkaApis.handle
ReplicaManager.appendRecordsauthorize; for acks=all reject if ISR ‹ min.insync.replicas · §08
UnifiedLog.appendvalidate · assign offsets from LEO · stamp leader epoch · write to active segment · advance LEO · §03
acks?
fire-and-forgetno broker reply; client surfaces offset −1
ack nowreply once written to leader
DelayedProduce in PRODUCE purgatorypark; wait for HW · §08
RECORD COMMITTEDoffset ‹ high watermark
CONSUMER · poll loop · §17
Fetch from leader(or rack-local follower, KIP-392); page-cache read bounded by isolation: read_uncommitted→HW, read_committed→last stable offset, skip aborted · §09 · §14
FileRecords via sendfile (zero-copy)if ‹ fetch.min.bytes, park in FETCH purgatory until data / max.wait.ms · §03
poll()deliver ≤ max.poll.records to the app; process
OffsetCommitgroup coordinator writes offset to __consumer_offsets (replicated) · §13
The full data path: producer batching → ProduceRequest to the leader → log append → ISR replication → high-watermark commit (acks) → consumer fetch bounded by HW/LSO → group offset commit.
client (producer / consumer) broker server component the log / segments coordinator / purgatory (waiting) cylinder = a log / store acks? rounded = decision data flow async (await HW / replicate) pill = phase

Narrating the hops:

  1. Batching (producer). send() never does network I/O on the caller's thread. It serializes, selects a partition (explicit, murmur2 key-hash, or the adaptive sticky BuiltInPartitioner), and appends bytes into a per-partition deque in the RecordAccumulator. A single Sender thread drains full or lingering batches. Batching is what makes Kafka fast, it amortizes round-trips and turns writes into large sequential chunks. See The Producer Client.
  2. ProduceRequest to the leader. The batch travels as a length-prefixed, versioned RPC (Wire Protocol & RPC) to the broker that currently leads that partition. The client learned the leader from a cached Metadata response.
  3. Append to the log. The broker's network/handler threads hand the request to KafkaApis (Request Processing), which authorizes and calls ReplicaManager. For acks=all the leader first rejects if the ISR is below min.insync.replicas. Then UnifiedLog.append validates the batches, assigns offsets from the current log-end offset, stamps the leader epoch into each batch header, and writes to the active segment, advancing the LEO (The Log Storage Engine).
  4. Replication to followers (ISR). Followers are themselves fetch clients of the leader: their ReplicaFetcherThreads pull new records and report their own LEO back. The leader tracks each follower and maintains the in-sync replica set. See Replication, ISR & High Watermark and The Fetch Path.
  5. High-watermark commit (acks). The high watermark is the minimum LEO over the ISR, the boundary below which data is committed and can never be lost to a clean leader change. For acks=all, the produce request waits in the produce purgatory until the HW reaches its offset, then the ack is sent. acks=1 replies once the leader has written; acks=0 is fire-and-forget, the broker sends no response at all (a NoOpResponse), and the client's send callback completes locally with a fabricated RecordMetadata offset of −1.
  6. Consumer fetch. A consumer fetches from the leader (or, with KIP-392, a rack-local follower). The leader serves the read from the OS page cache, bounded by an isolation level: read_uncommitted sees up to the HW, read_committed sees only up to the last stable offset and skips aborted-transaction batches. Bytes are sent with sendfile (zero-copy). See The Consumer Client.
  7. Group offset commit. After processing, the consumer commits its position to the group coordinator, which durably writes it to the internal __consumer_offsets topic. On restart or rebalance, consumers resume from the committed offset. See Group Coordination.

The storage model in one screen

A partition replica on disk is a directory named topic-partition containing a sequence of segments. Each segment is a .log file of record batches plus three sparse, memory-mapped sidecar indexes. The layered abstraction is UnifiedLogLocalLogLogSegmentsLogSegment.

File in /var/lib/kafka/orders-0/Contents
00000000000000000000.logrecord batches (the data); the 20-digit prefix is the segment base offset = first offset in the file
00000000000000000000.indexsparse offset → file-position (8 B/entry)
00000000000000000000.timeindexsparse timestamp → relative offset (12 B/entry)
00000000000000000000.txnindexaborted-txn ranges (lazy; read_committed)
00000000000000004096.lognext segment; rolls at segment.bytes (1 GiB), segment.ms (7 d), or when an index fills
00000000000000004096.indexindex for the next segment
00000000000000004096.snapshotproducer-state snapshot (idempotence/txn)
leader-epoch-checkpointepoch → start offset (truncation authority)
partition.metadatatopic id

Read-by-offset: offset → OffsetIndex.lookup (greatest ≤ target) → short forward scan → FileRecords.slicesendfile to socket (no JVM heap copy).

On-disk layout of one partition replica: append-only segments with sparse indexes; reads translate an offset to a file position and stream bytes zero-copy.
00000000000000000000 = 20-digit segment base offset (the file's first offset) .log / .index / .timeindex / .txnindex = data + sparse sidecar indexes in the read path = "translates to / then"

The storage design is a deliberate set of bets, every one of which trades generality for throughput:

  • Sequential everything. Appends go to the tail; reads scan forward. Sequential disk I/O is orders of magnitude faster than random, so performance is decoupled from total retained volume.
  • Sparse indexes, relative offsets. An index entry is added only every index.interval.bytes (4 KiB) of log. Offsets are stored relative to the segment base so a 64-bit offset fits in 4 bytes. Indexes carry no checksum and are rebuilt from the log if corrupt.
  • Lean on the OS page cache. Kafka keeps no in-JVM record cache. The kernel page cache survives broker restarts, avoids double-buffering, and avoids GC pressure on huge heaps. A caught-up cluster does essentially no disk reads.
  • Zero-copy. Reads return FileRecords, a window over the segment's FileChannel; FileRecords.writeTo bottoms out in sendfile(2), so committed bytes flow page-cache → NIC without entering user space. (Zero-copy is defeated by TLS or by down-converting v2 to an old format.)
  • Durability via replication, not fsync. By default flush.messages and flush.ms are effectively infinite, Kafka does not fsync every write. Durability comes from replication across the ISR plus the page cache, not per-write disk sync.

Two retention modes govern a log's lifecycle, and a third tier extends it:

Modecleanup.policyWhat it keepsFor
Retention (delete)deleteWhole oldest segments deleted by size (retention.bytes) or age (retention.ms, default 7 d)Event streams, logs, most topics
CompactioncompactAt least the latest value per key; null value = tombstone (retained delete.retention.ms)Changelogs, CDC, state, __consumer_offsets, __transaction_state
Tiered storage(orthogonal)Closed segments offloaded to object store; local tier keeps the hot tailNear-infinite retention, small clusters

Retention and compaction mechanics are in Log Management, Retention & Compaction; the remote tier (RSM/RLMM SPIs, __remote_log_metadata, copy/fetch quotas) is in Tiered Storage.

Coordination & delivery semantics

Consumer groups & rebalancing

A consumer group scales out consumption: the group's members divide the subscribed partitions among themselves, with each partition assigned to exactly one member (preserving per-partition order). A broker-side group coordinator, the leader of the relevant __consumer_offsets partition, owns the group's membership and offsets, persisting them as records in that partition's log (its durable state is the log; failover is just a replay). The rebalance protocol has evolved decisively from client-driven to server-driven:

Protocolgroup.protocolHow assignment happens
ClassicclassicA consumer (the elected "leader," not the broker) computes the assignment via JoinGroup/SyncGroup; eager (stop-the-world) or incremental cooperative (KIP-429)
Consumer KIP-848consumerA single long-lived ConsumerGroupHeartbeat RPC; assignment is fully server-side; incremental reconciliation via three epochs (group / assignment / member). GA in 4.0; server-side assignors only
Streams KIP-1071streamsThe KIP-848 model extended to Kafka Streams task assignment; Early Access in 4.1, Generally Available (core feature set) since 4.2

The new Java group coordinator (default on KRaft in 4.0) runs each __consumer_offsets partition as a single-threaded replicated state machine and also backs share groups and Streams groups. Full detail in Group Coordination & Rebalance Protocols.

Exactly-once & transactions

Exactly-once semantics is two layers (KIP-98), both carried in the v2 batch header and reconstructable on log reload:

  • The idempotent producer deduplicates retries within one producer session, per partition, using a producer id (PID) + epoch + monotonic per-partition sequence number that the leader's ProducerStateManager validates. On by default since 3.0 (with acks=all, in-flight ≤ 5).
  • Transactions give atomic, all-or-nothing writes across many partitions and across sessions. A TransactionCoordinator maps a user transactional.id to a PID with epoch fencing, runs a two-phase commit over the internal __transaction_state log, and writes COMMIT/ABORT marker control records into every involved partition. read_committed consumers stop at the last stable offset and filter aborted batches via the transaction index.

Folding consumer-offset commits into the transaction (sendOffsetsToTransaction) makes the read-process-write loop a single atomic unit, the basis for Kafka Streams' end-to-end exactly-once. See Transactions & Exactly-Once.

Share groups (queues)

Share groups KIP-932 add a queue-like model: many consumers cooperatively read the same partition (N:M), and records are leased individually under a time-bounded acquisition lock with a per-record delivery state (Available → Acquired → Acknowledged/Archived) and delivery count, rather than committed by offset. This gives at-least-once with bounded redelivery and an optional dead-letter queue, decoupling parallelism from partition count. Durable state lives in __share_group_state via a share coordinator. See Share Groups.

Delivery & ordering guarantees

GuaranteeWhat holdsMechanism
Per-partition orderingRecords in a partition are read in offset order, alwaysSingle leader assigns dense offsets at append; followers copy a prefix
No cross-partition orderingOrder across partitions is undefinedPartition is the unit of ordering by design
At-least-onceThe default with acks=all + retries; a record may be delivered more than onceProducer retries; consumer commits after processing
Exactly-onceOpt-in: idempotent producer + transactions + read_committedPID/epoch/sequence dedup; 2PC markers; LSO read boundary
Read boundary (HW)read_uncommitted consumers never see uncommitted dataHigh watermark = min LEO over ISR; reads clamp to HW
Read boundary (LSO)read_committed consumers never see undecided/aborted txn dataLast stable offset = below it every transaction is decided
acks=all alone is not "no data loss"

acks=all waits for all members of the current ISR, which can shrink to just the leader. Durability requires acks=all and min.insync.replicas >= 2 so the leader rejects writes when the ISR is too small. The classic loss case is replication.factor=3, min.insync.replicas=1, acks=all. Eligible Leader Replicas (KIP-966) close the remaining "last replica standing" gap by freezing HW advancement when the ISR drops below min ISR.

The threading model of a broker at a glance

A broker is a carefully partitioned set of thread pools, each with a single job, communicating through bounded hand-off queues. Nothing on the request hot path blocks on disk or network from a handler thread, slow work is parked in a purgatory and completed by an event later.

socketinbound TCP connection
ACCEPTOR1 per listener, round-robins connections to processors
PROCESSOR threads (NIO)num.network.threads / listener (default 3); one Selector each · frame: 4-byte size + payload
RequestChannelbounded ArrayBlockingQueue (queued.max.requests = 500), the back-pressure point
KafkaRequestHandler poolnum.io.threads (default 8): KafkaApis.handle (match on apiKey, ~80 handlers) ▶ authorize ▶ act ▶ build response ▶ throttle ▶ route to originating processor
ReplicaManager → UnifiedLogappend / read · Group / Txn / Share coordinators · MetadataCache
DelayedOperationPurgatoryProduce, Fetch, DeleteRecords, RemoteFetch … hierarchical TIMING WHEEL for timeouts; completes when its condition fires
running alongside, background threads
background thread poolsReplicaFetcherThreads (pull from leaders) · log cleaner / retention schedulers · ExpirationReapers (advance the wheel) · ThrottledChannelReapers (quotas) · if combined: quorum-controller event loop + kafka-raft-io-thread
Broker threading: acceptor → processor (NIO reactor) → bounded RequestChannel → handler pool → KafkaApis, with non-completable work parked in purgatories driven by a hierarchical timing wheel.
external socket / connection broker thread pool / component purgatory (parked, waiting) cylinder = the bounded hand-off queue request hand-off parked / runs alongside
Acceptor
One non-daemon thread per listener; accepts TCP connections and round-robins them to processors. Connection quotas (count + creation rate) are enforced here.
Processor (network thread)
num.network.threads per listener (default 3), each owning one java.nio.Selector. Reads size-delimited requests, hands them to the RequestChannel, and writes responses back. Per-connection ordering is guaranteed by muting a channel while its request is in flight.
Request handler (I/O thread)
num.io.threads daemon threads (default 8) running KafkaApis.handle, a giant dispatch on apiKey to ~80 handlers. The RequestChannel (bounded at queued.max.requests = 500) is the back-pressure point: when handlers fall behind, network threads block on the queue and stop reading sockets, turning handler slowness into TCP back-pressure.
Replica fetcher
Background threads on a follower broker that pull from leaders (num.replica.fetchers per source, keyed by leader+fetcherId). Followers are fetch clients of leaders.
Purgatory & reapers
A DelayedOperationPurgatory parks operations (acks=all produce, under-min-bytes fetch, remote fetch, …) that cannot complete immediately, indexed by watch key and by a hierarchical timing wheel (O(1) insert/cancel). An ExpirationReaper advances the clock; operations complete when their condition fires or they time out.
Coordinator / controller threads
Group/transaction/share coordinators run per-partition single-threaded event loops. A combined node also runs the controller's quorum-controller event loop and the kafka-raft-io-thread.

The request lifecycle is instrumented end to end: each request carries timestamps decomposing latency into request-queue time, local processing, remote (purgatory) time, response-queue time, and send time, each written by a different actor. The reactor (Selector + KafkaChannel) is the same code used by producers, consumers, and replica fetchers. Deep dives: Network Layer & Threading and Request Processing (KafkaApis).

Map of the codebase

The repository is a multi-module Gradle build. Knowing which module owns what is half the battle when reading source. The principal modules:

ModuleLanguageWhat lives thereChapters
clientsJavaThe producer, consumer, admin, and share clients (the AdminClient has no chapter of its own, its controller-forwarding pipeline surfaces in 07); the shared NIO reactor (Selector, KafkaChannel); the wire-protocol message types and generated schemas; record-format classes; security channel stack01, 02, 16, 17, 18
coreScalaThe broker itself: SocketServer, KafkaApis, ReplicaManager, Partition, BrokerServer/ControllerServer/KafkaRaftServer, replica fetchers, the transaction coordinator runtime, broker-side metadata publishing06, 07, 08, 09, 14
storageJavaThe log engine: UnifiedLog, LocalLog, LogSegment, the indexes, LogManager, the LogCleaner (compaction), ProducerStateManager, tiered-storage RemoteLogManager03, 04, 05
metadataJavaThe KRaft controller (QuorumController and the per-domain control managers), the MetadataImage/MetadataDelta/MetadataLoader pipeline, PartitionRegistration, the StandardAuthorizer11, 12, 18
raftJavaKafka's own Raft implementation: KafkaRaftClient, the QuorumState role machine, snapshots, dynamic quorum reconfiguration10
group-coordinatorJavaThe modern Java group coordinator: CoordinatorRuntime, GroupMetadataManager, OffsetMetadataManager, classic + KIP-848 protocols, server-side assignors; share-group config13, 15
transaction-coordinatorJava/ScalaProducer-ID management (RPCProducerIdManager), transaction state schemas, marker plumbing14
share-coordinatorJavaThe share-group state coordinator and its __share_group_state persistence15
server / server-commonJavaShared server infrastructure: config classes, ProcessRole, the purgatory + timing wheel, quotas, broker lifecycle, AssignmentsManager, feature/metadata-version definitions06, 12, 19
streamsJavaKafka Streams: topology builder, StreamThread, the task model, state stores, the changelog restorer, EOS-v220
connectJavaKafka Connect: the plugin SPI, Worker and task loops, the DistributedHerder, the three internal topics, MirrorMaker 221

How to read this guide

The twenty-one detailed chapters are written to stand alone, but they build on each other. Here is a recommended path and a logical grouping. If you read top to bottom you will rarely hit a forward reference you do not already understand.

Recommended reading order

  1. Start here (this chapter) for the whole-system model.
  2. Bytes & protocol, the vocabulary everything else speaks: 01 · Record Format & Batches, then 02 · Wire Protocol & RPC.
  3. Storage, where records live: 03 · The Log Storage Engine, 04 · Log Management, Retention & Compaction, 05 · Tiered Storage.
  4. The broker's spine, how requests flow: 06 · Network Layer & Threading, 07 · Request Processing (KafkaApis).
  5. Durability, making data safe and readable: 08 · Replication, ISR & High Watermark, 09 · Fetch Path & Replica Fetchers.
  6. The metadata plane, the KRaft brain: 10 · KRaft Consensus (Raft), 11 · The KRaft Controller, 12 · Metadata Propagation & Broker Lifecycle.
  7. Coordination & semantics, groups, EOS, queues: 13 · Group Coordination, 14 · Transactions & Exactly-Once, 15 · Share Groups.
  8. The clients, the smart edge: 16 · The Producer Client, 17 · The Consumer Client.
  9. Cross-cutting concerns: 18 · Security, 19 · Quotas & Throttling.
  10. Built on top, the higher-level frameworks: 20 · Kafka Streams, 21 · Kafka Connect.
  11. Keep the Glossary & Cross-Cutting Concepts open in a tab.

Alternative entry points

"I operate clusters"
Read 08 (replication/durability), 12 (broker lifecycle), 04 (retention), 05 (tiered storage), 19 (quotas), 18 (security). The durability gotcha above is the single most important operational fact.
"I write producers/consumers"
Read 16, 17, then 13 (groups) and 14 (EOS) for semantics, with 01/02 for what goes over the wire.
"I'm here for KRaft"
Read 10 (the Raft dialect), 11 (the controller state machine), 12 (how brokers consume metadata). Note that KRaft is a pull-based Raft dialect with no leader heartbeats, not textbook Raft.
"I want exactly-once"
Read 14 first, then 13 (offset commits fold into transactions) and 03 (producer state on disk). Streams EOS-v2 is in 20.
Why this structure

The guide is layered like Kafka itself: bytes, then storage, then the broker, then durability, then the metadata plane, then coordination, then clients, then the frameworks built on all of it. Each chapter is derived directly from the 4.4.0-SNAPSHOT source (git 04bfe7d) with file-and-line citations, KRaft-only, and fact-checked, not paraphrased from official docs. Where a date or attribution is commonly mis-stated (e.g. "acks=all guarantees no loss," "KRaft is standard Raft," "tiered storage was GA in 3.6"), the chapters call out the correct fact explicitly.

krivaltsevich.com · Part of Apache Kafka Internals · derived from Apache Kafka 4.4 source · GitHub · MIT-licensed.

Apache Kafka® is a registered trademark of the Apache Software Foundation. This is an independent, unofficial guide, not affiliated with or endorsed by the ASF.