00 · Architecture Overview

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Derived from source code, not copied from official documentation.

Apache Kafka is, at its core, a single idea executed with unusual discipline: a distributed, partitioned, replicated commit log. Producers append records to the end of a log; consumers read forward from a position they themselves track; the broker is a comparatively "dumb" custodian of bytes that it stores as sequential append-only segment files and ships to readers with near-zero overhead. Everything else in this guide, replication and the high watermark, the KRaft metadata quorum that replaced ZooKeeper, consumer groups and rebalancing, exactly-once transactions, share groups, tiered storage, Streams and Connect, is an elaboration built on top of that log. This chapter is the front door. It builds an accurate end-to-end mental model: what the abstractions are, how a cluster is laid out in KRaft, how a record physically travels from producer.send() to consumer.poll(), how bytes are stored, how coordination and delivery semantics work, how a broker is threaded, and where every detail lives in the codebase and in the twenty-one chapters that follow. Read it once to orient yourself; return to it as the map.

What Kafka is: a distributed commit log

Strip away the features and Kafka is a log. A log here is the precise computer-science object: an append-only, totally-ordered sequence of records, ordered by the time they were appended, addressed by a monotonically increasing integer position. Jay Kreps called it "perhaps the simplest possible storage abstraction", and it is the same primitive that sits underneath database replication and state-machine replication. Kafka's bet, made at LinkedIn in 2010–2011 and proven out since, is that this one abstraction, made distributed and durable, is the right backbone for both real-time stream processing and offline data integration, replacing an O(N²) tangle of point-to-point pipelines with an O(N) hub-and-spoke.

The core abstractions

Record: The unit of data: an optional key, a value (the payload, or null for a compaction tombstone), a timestamp, and an ordered list of headers (string key + bytes). Records are never stored individually on disk, they live inside batches. See Record Format & Batches.
Record batch: The real on-disk and on-wire unit. The v2 DefaultRecordBatch has a fixed 61-byte header (base offset, CRC-32C, producer id/epoch/sequence, leader epoch, attributes) followed by varint delta-encoded records. Writing shared fields once per batch cut per-record overhead from ~34 bytes for a standalone legacy v1 message (a 12-byte offset/size log prefix plus the 22-byte RECORD_OVERHEAD_V1) to ~7 bytes. The batch, not the record, is the fundamental unit of write, replication, compression, and fetch.
Partition: One physical log. A partition is the unit of parallelism and of ordering: records within a partition are strictly ordered by offset; there is no ordering guarantee across partitions. A partition is replicated; exactly one replica is the leader and serves reads/writes.
Topic: A named, logical stream, simply a set of partitions. A topic has no storage of its own; it is the partitions that store data.
Offset: The position of a record within its partition: a 64-bit integer assigned by the leader at append time, dense and gap-free within a batch's range. The offset is the only per-consumer state Kafka needs, one number per partition tells a consumer exactly where it is.
Broker: A server process that hosts partitions, accepts produce/fetch requests, and replicates. A Kafka cluster is a set of brokers plus a controller quorum (below).
Segment: A partition log is split into ~1 GiB segments: a .log file of record batches plus sparse .index / .timeindex / .txnindex sidecars. Only the last (active) segment is written; old segments are immutable and can be deleted, compacted, or tiered wholesale. See The Log Storage Engine.

partition 0 · topic "orders" producer appends to the tail →

5high watermark

7LEO

group A
reads @1

group B
reads @6

One partition is an append-only, offset-addressed log (a topic is a set of these). Records up to the high watermark (offset 5) are committed, replicated to the ISR and visible to consumers; offset 6 is appended but not yet committed; the log end offset (LEO = 7) is the next slot the leader will write. Independent consumer groups read at their own positions, so a rewind is just a seek.

committed — offset ≤ high watermark uncommitted — replicating to followers LEO — next offset to be written ▲ consumer-group read position n = monotonic offset (int64)

The "dumb broker, smart client" model

The defining architectural choice is that the consumer tracks its own offset, and the broker keeps essentially no per-consumer state on the hot path. A traditional message queue deletes a message when it is consumed and must track, per consumer, what has been acknowledged. Kafka instead addresses every record by a stable logical offset, retains data by a time/size SLA regardless of who has read it, and lets each consumer (or consumer group) say "give me from offset N." This has profound consequences:

Replay and fan-out are free. Rewinding is a seek; a new consumer group reading history costs the broker nothing extra. Many independent subscribers read the same log.
The broker stays simple and fast. No per-message acknowledgement bookkeeping, no random-access index of "who has what", just sequential append and sequential scan, which is what disks and the OS page cache are best at.
Back-pressure is natural. Consumers pull (long-poll fetch) at their own pace; a slow consumer cannot be overwhelmed by a push.

The one-sentence mental model

Kafka is a set of append-only, offset-addressed, replicated partition logs. Producers append batches to the leader of a partition; the leader replicates to followers; data below the high watermark is committed and visible; consumers pull forward from a position they own and commit back to the cluster. Hold that picture and every other chapter slots into it.

Cluster anatomy in KRaft (no ZooKeeper)

As of Kafka 4.0, ZooKeeper is gone entirely; 4.4.0-SNAPSHOT runs KRaft-only. A cluster is split cleanly into two planes:

The data plane: broker nodes that host topic-partition replicas and serve produce/fetch. This is where your records live.
The metadata plane: a small controller quorum that owns all cluster metadata, topics, partitions, replica assignments, leadership, ISR, ACLs, quotas, feature levels, producer-ID blocks, as an ordered, replayable, fsync'd event log.

Every node runs with a configured set of process.roles. The ProcessRole enum has two members, BrokerRole and ControllerRole (server/.../ProcessRole.java), and KafkaRaftServer instantiates a BrokerServer if the roles contain BrokerRole and a ControllerServer if they contain ControllerRole (KafkaRaftServer.scala:76-82). So a node can be:

`process.roles`	Runs	Typical use
`broker`	`BrokerServer` only, hosts partitions, serves clients	Production data nodes in a larger cluster
`controller`	`ControllerServer` only, a voter in the metadata quorum	Dedicated controllers (3 or 5 of them) in a larger cluster
`broker,controller`	Both, in one JVM	"Combined" mode, small clusters and development

Bootstrapping a KRaft cluster

Because there is no ZooKeeper to seed the cluster, KRaft makes formatting storage a mandatory first step, unlike the ZooKeeper era, a node will not start until its log directories carry a cluster identity. The sequence is:

Generate a cluster ID. kafka-storage random-uuid prints a fresh base64 Uuid (StorageTool.scala, the random-uuid command → Uuid.randomUuid). The same ID is used to format every node.
Format each node's log dirs. kafka-storage format --cluster-id <ID> --release-version <MV> writes a meta.properties into each log.dirs directory (cluster.id, node.id, and a per-directory directory.id UUID), and lays down a bootstrap.checkpoint file. That checkpoint is the BootstrapMetadata: a tiny record set whose first record is a FeatureLevelRecord pinning the initial metadata.version (the --release-version you chose), so the cluster comes up at a known feature level rather than the code's default.
Declare the initial voters (KIP-853). For a dynamic controller quorum, the format step takes --standalone (a single-node quorum), --initial-controllers <list> (an explicit voter set, each id@host:port:directory-id), or --no-initial-controllers; the older static model instead lists voters in controller.quorum.voters.

Only after a node's directories are formatted does KafkaRaftServer start the Raft stack against them. The deep mechanics of registration and the bootstrap metadata live in The KRaft Controller and Metadata Propagation & Broker Lifecycle.

The metadata quorum and __cluster_metadata

The controllers form a Raft quorum using Kafka's own Raft dialect (KRaft). The quorum's state is a single topic partition, __cluster_metadata-0, replicated across the controller voters. One controller is the Raft leader (the "active controller"); the others are hot standbys. This partition is the single source of truth for the whole cluster: nothing about a topic, a partition's leader, or an ACL is "true" until it is a committed record in this log.

Clients

producers write to partition leaders; consumers read from leaders (or rack-local followers)

ProducersConsumersAdmin

↓ produce · fetch | metadata responses ↑

Metadata plane, controller quorum (KRaft)

a Raft quorum owns cluster metadata as an event log, __cluster_metadata-0; standbys fetch from the active leader (the single source of truth)

controller · active leadercontroller · standbycontroller · standby

↓ committed records replayed: TopicRecord · PartitionChangeRecord · RegisterBrokerRecord · AccessControlEntryRecord …

Data plane, brokers

each broker replays committed metadata into a MetadataImage, hosts topic-partition replicas, and replicates partition data among themselves

broker 0broker 1broker 2

KRaft cluster anatomy: a Raft controller quorum owns metadata as an event log (the single source of truth, __cluster_metadata-0); brokers replay committed metadata and replicate partition data among themselves, while clients produce to and fetch from the brokers.

clients — producers / consumers / admin metadata plane — controller quorum data plane — brokers stacked bands = planes; ↓↑ separators = the flow between them Record = committed metadata record type

Brokers are clients of this metadata log. Each broker registers with the active controller (BrokerRegistrationRequest), is fenced until it has caught up, then continually pulls committed metadata records via the Raft Fetch path and sends periodic heartbeats (which double as liveness). Crucially, the controller does not push LeaderAndIsr/UpdateMetadata/StopReplica RPCs the way the ZooKeeper-era controller did; making metadata a log the brokers replay turns topic create/delete into an O(1) append for the controller and eliminates the controller-vs-ZooKeeper state divergence that plagued the old design. The deep mechanics are in KRaft Consensus, The KRaft Controller, and Metadata Propagation & Broker Lifecycle.

Two Rafts, do not conflate them

KRaft (the metadata quorum over __cluster_metadata) and ordinary partition replication (leader/follower ISR replication of your topic data) are different mechanisms. KRaft uses a majority-vote high watermark over the controller voters; topic-partition replication uses the ISR / min-LEO high watermark over the partition's replicas. They share a code lineage (followers fetch in both) but are governed by separate rules. KRaft replaced only ZooKeeper's metadata role, not data replication.

Control plane vs data plane, and the metadata backbone

The split above is enforced in the request layer too. Brokers do not mutate cluster metadata directly. Around eighteen administrative APIs, CreateTopics, DeleteTopics, the CreateAcls/DeleteAcls family, AlterClientQuotas, ElectLeaders, reassignments, UpdateFeatures, the Raft-voter APIs, when received by a broker are wrapped in an Envelope and forwarded to the active controller, which re-dispatches them through ControllerApis (KafkaApis.scala → forwardToController → ControllerApis.scala). The AlterConfigs/IncrementalAlterConfigs family is a partial exception: its handler preprocesses some resources locally and forwards only the remainder. The controller is the only writer of metadata; brokers are readers.

The propagation backbone is a clean pipeline. The active controller is a deterministic replicated state machine: a single-threaded event loop (QuorumController, thread quorum-controller-{id}-event-handler) turns each request into a list of metadata records, appends them to the Raft log, and applies them to in-memory state under a strict write → commit → apply discipline so the active controller and every standby reach byte-identical state by replaying the same records in offset order.

admin requestCreateTopics, AlterConfigs, ElectLeaders …

ControllerApison the active controller

QuorumControllersingle-threaded event loop: 1. generate records · 2. prepareAppend → Raft log · 3. replay() optimistically

__cluster_metadataRaft commit (majority over voters)

MetadataLoadersingle thread, folds committed batches into a MetadataDelta

new immutable MetadataImageprovenance + 9 sub-images: topics, configs, cluster, ACLs, quotas, features …

MetadataPublishersordered chain, fan out to each subsystem

KRaftMetadataCachesetImage(), serves Metadata responses

ReplicaManagerapplyDelta(), make leaders / followers

Group / Txn coordinatorselect / resign shards

Acl / Scram · quotasclient-metrics publishers

The metadata backbone: the controller writes records, Raft commits them, and every broker replays them into an immutable MetadataImage that publishers fan out to each subsystem.

admin request (client) controller / metadata plane broker-side component coordinator shard cylinder = the metadata log generate / apply async Raft commit callback

On the broker side, the MetadataLoader consumes Raft commits on one thread, folds them into immutable MetadataImage snapshots (a MetadataProvenance plus nine sub-images, rebuilt incrementally via MetadataDelta), and publishes each new image to an ordered chain of MetadataPublishers. The BrokerMetadataPublisher is what actually turns metadata into behaviour: it swaps the cache image, then calls replicaManager.applyDelta(topicsDelta, newImage) to drive partitions into leader/follower roles, and elects or resigns coordinator shards. This is the join between the metadata plane and the data plane.

The end-to-end data path for a record

Here is the journey of a single record, from a producer's send() to a consumer's poll(), with each hop linked to its chapter. This is the most important diagram in the guide.

PRODUCER · app thread · §16

send(record)serialize · pick partition (explicit / murmur2(key) / sticky)

RecordAccumulatorper-partition deque of ProducerBatch (fill to batch.size or linger.ms; compress) · §01

Sender I/O threadstamp (PID, epoch, seq) if idempotent · build ProduceRequest · §02

BROKER · leader of the partition · §06 · §07

Acceptor ▶ Processor (NIO) ▶ RequestChannel ▶ KafkaRequestHandler ▶ KafkaApis.handle

ReplicaManager.appendRecordsauthorize; for acks=all reject if ISR ‹ min.insync.replicas · §08

UnifiedLog.appendvalidate · assign offsets from LEO · stamp leader epoch · write to active segment · advance LEO · §03

acks?

fire-and-forgetno broker reply; client surfaces offset −1

ack nowreply once written to leader

DelayedProduce in PRODUCE purgatorypark; wait for HW · §08

RECORD COMMITTEDoffset ‹ high watermark

CONSUMER · poll loop · §17

Fetch from leader(or rack-local follower, KIP-392); page-cache read bounded by isolation: read_uncommitted→HW, read_committed→last stable offset, skip aborted · §09 · §14

FileRecords via sendfile (zero-copy)if ‹ fetch.min.bytes, park in FETCH purgatory until data / max.wait.ms · §03

poll()deliver ≤ max.poll.records to the app; process

OffsetCommitgroup coordinator writes offset to __consumer_offsets (replicated) · §13

The full data path: producer batching → ProduceRequest to the leader → log append → ISR replication → high-watermark commit (acks) → consumer fetch bounded by HW/LSO → group offset commit.

client (producer / consumer) broker server component the log / segments coordinator / purgatory (waiting) cylinder = a log / store acks? rounded = decision data flow async (await HW / replicate) pill = phase

Narrating the hops:

Batching (producer). send() never does network I/O on the caller's thread. It serializes, selects a partition (explicit, murmur2 key-hash, or the adaptive sticky BuiltInPartitioner), and appends bytes into a per-partition deque in the RecordAccumulator. A single Sender thread drains full or lingering batches. Batching is what makes Kafka fast, it amortizes round-trips and turns writes into large sequential chunks. See The Producer Client.
ProduceRequest to the leader. The batch travels as a length-prefixed, versioned RPC (Wire Protocol & RPC) to the broker that currently leads that partition. The client learned the leader from a cached Metadata response.
Append to the log. The broker's network/handler threads hand the request to KafkaApis (Request Processing), which authorizes and calls ReplicaManager. For acks=all the leader first rejects if the ISR is below min.insync.replicas. Then UnifiedLog.append validates the batches, assigns offsets from the current log-end offset, stamps the leader epoch into each batch header, and writes to the active segment, advancing the LEO (The Log Storage Engine).
Replication to followers (ISR). Followers are themselves fetch clients of the leader: their ReplicaFetcherThreads pull new records and report their own LEO back. The leader tracks each follower and maintains the in-sync replica set. See Replication, ISR & High Watermark and The Fetch Path.
High-watermark commit (acks). The high watermark is the minimum LEO over the ISR, the boundary below which data is committed and can never be lost to a clean leader change. For acks=all, the produce request waits in the produce purgatory until the HW reaches its offset, then the ack is sent. acks=1 replies once the leader has written; acks=0 is fire-and-forget, the broker sends no response at all (a NoOpResponse), and the client's send callback completes locally with a fabricated RecordMetadata offset of −1.
Consumer fetch. A consumer fetches from the leader (or, with KIP-392, a rack-local follower). The leader serves the read from the OS page cache, bounded by an isolation level: read_uncommitted sees up to the HW, read_committed sees only up to the last stable offset and skips aborted-transaction batches. Bytes are sent with sendfile (zero-copy). See The Consumer Client.
Group offset commit. After processing, the consumer commits its position to the group coordinator, which durably writes it to the internal __consumer_offsets topic. On restart or rebalance, consumers resume from the committed offset. See Group Coordination.

The storage model in one screen

A partition replica on disk is a directory named topic-partition containing a sequence of segments. Each segment is a .log file of record batches plus three sparse, memory-mapped sidecar indexes. The layered abstraction is UnifiedLog → LocalLog → LogSegments → LogSegment.

File in `/var/lib/kafka/orders-0/`	Contents
`00000000000000000000.log`	record batches (the data); the 20-digit prefix is the segment base offset = first offset in the file
`00000000000000000000.index`	sparse offset → file-position (8 B/entry)
`00000000000000000000.timeindex`	sparse timestamp → relative offset (12 B/entry)
`00000000000000000000.txnindex`	aborted-txn ranges (lazy; `read_committed`)
`00000000000000004096.log`	next segment; rolls at `segment.bytes` (1 GiB), `segment.ms` (7 d), or when an index fills
`00000000000000004096.index`	index for the next segment
`00000000000000004096.snapshot`	producer-state snapshot (idempotence/txn)
`leader-epoch-checkpoint`	epoch → start offset (truncation authority)
`partition.metadata`	topic id

Read-by-offset: offset → OffsetIndex.lookup (greatest ≤ target) → short forward scan → FileRecords.slice → sendfile to socket (no JVM heap copy).

On-disk layout of one partition replica: append-only segments with sparse indexes; reads translate an offset to a file position and stream bytes zero-copy.

00000000000000000000 = 20-digit segment base offset (the file's first offset) .log / .index / .timeindex / .txnindex = data + sparse sidecar indexes → in the read path = "translates to / then"

The storage design is a deliberate set of bets, every one of which trades generality for throughput:

Sequential everything. Appends go to the tail; reads scan forward. Sequential disk I/O is orders of magnitude faster than random, so performance is decoupled from total retained volume.
Sparse indexes, relative offsets. An index entry is added only every index.interval.bytes (4 KiB) of log. Offsets are stored relative to the segment base so a 64-bit offset fits in 4 bytes. Indexes carry no checksum and are rebuilt from the log if corrupt.
Lean on the OS page cache. Kafka keeps no in-JVM record cache. The kernel page cache survives broker restarts, avoids double-buffering, and avoids GC pressure on huge heaps. A caught-up cluster does essentially no disk reads.
Zero-copy. Reads return FileRecords, a window over the segment's FileChannel; FileRecords.writeTo bottoms out in sendfile(2), so committed bytes flow page-cache → NIC without entering user space. (Zero-copy is defeated by TLS or by down-converting v2 to an old format.)
Durability via replication, not fsync. By default flush.messages and flush.ms are effectively infinite, Kafka does not fsync every write. Durability comes from replication across the ISR plus the page cache, not per-write disk sync.

Two retention modes govern a log's lifecycle, and a third tier extends it:

Mode	`cleanup.policy`	What it keeps	For
Retention (delete)	`delete`	Whole oldest segments deleted by size (`retention.bytes`) or age (`retention.ms`, default 7 d)	Event streams, logs, most topics
Compaction	`compact`	At least the latest value per key; `null` value = tombstone (retained `delete.retention.ms`)	Changelogs, CDC, state, `__consumer_offsets`, `__transaction_state`
Tiered storage	(orthogonal)	Closed segments offloaded to object store; local tier keeps the hot tail	Near-infinite retention, small clusters

Retention and compaction mechanics are in Log Management, Retention & Compaction; the remote tier (RSM/RLMM SPIs, __remote_log_metadata, copy/fetch quotas) is in Tiered Storage.

Coordination & delivery semantics

Consumer groups & rebalancing

A consumer group scales out consumption: the group's members divide the subscribed partitions among themselves, with each partition assigned to exactly one member (preserving per-partition order). A broker-side group coordinator, the leader of the relevant __consumer_offsets partition, owns the group's membership and offsets, persisting them as records in that partition's log (its durable state is the log; failover is just a replay). The rebalance protocol has evolved decisively from client-driven to server-driven:

Protocol	`group.protocol`	How assignment happens
Classic	`classic`	A consumer (the elected "leader," not the broker) computes the assignment via `JoinGroup`/`SyncGroup`; eager (stop-the-world) or incremental cooperative (KIP-429)
Consumer KIP-848	`consumer`	A single long-lived `ConsumerGroupHeartbeat` RPC; assignment is fully server-side; incremental reconciliation via three epochs (group / assignment / member). GA in 4.0; server-side assignors only
Streams KIP-1071	`streams`	The KIP-848 model extended to Kafka Streams task assignment; Early Access in 4.1, Generally Available (core feature set) since 4.2

The new Java group coordinator (default on KRaft in 4.0) runs each __consumer_offsets partition as a single-threaded replicated state machine and also backs share groups and Streams groups. Full detail in Group Coordination & Rebalance Protocols.

Exactly-once & transactions

Exactly-once semantics is two layers (KIP-98), both carried in the v2 batch header and reconstructable on log reload:

The idempotent producer deduplicates retries within one producer session, per partition, using a producer id (PID) + epoch + monotonic per-partition sequence number that the leader's ProducerStateManager validates. On by default since 3.0 (with acks=all, in-flight ≤ 5).
Transactions give atomic, all-or-nothing writes across many partitions and across sessions. A TransactionCoordinator maps a user transactional.id to a PID with epoch fencing, runs a two-phase commit over the internal __transaction_state log, and writes COMMIT/ABORT marker control records into every involved partition. read_committed consumers stop at the last stable offset and filter aborted batches via the transaction index.

Folding consumer-offset commits into the transaction (sendOffsetsToTransaction) makes the read-process-write loop a single atomic unit, the basis for Kafka Streams' end-to-end exactly-once. See Transactions & Exactly-Once.

Share groups (queues)

Share groups KIP-932 add a queue-like model: many consumers cooperatively read the same partition (N:M), and records are leased individually under a time-bounded acquisition lock with a per-record delivery state (Available → Acquired → Acknowledged/Archived) and delivery count, rather than committed by offset. This gives at-least-once with bounded redelivery and an optional dead-letter queue, decoupling parallelism from partition count. Durable state lives in __share_group_state via a share coordinator. See Share Groups.

Delivery & ordering guarantees

Guarantee	What holds	Mechanism
Per-partition ordering	Records in a partition are read in offset order, always	Single leader assigns dense offsets at append; followers copy a prefix
No cross-partition ordering	Order across partitions is undefined	Partition is the unit of ordering by design
At-least-once	The default with `acks=all` + retries; a record may be delivered more than once	Producer retries; consumer commits after processing
Exactly-once	Opt-in: idempotent producer + transactions + `read_committed`	PID/epoch/sequence dedup; 2PC markers; LSO read boundary
Read boundary (HW)	`read_uncommitted` consumers never see uncommitted data	High watermark = min LEO over ISR; reads clamp to HW
Read boundary (LSO)	`read_committed` consumers never see undecided/aborted txn data	Last stable offset = below it every transaction is decided

acks=all alone is not "no data loss"

acks=all waits for all members of the current ISR, which can shrink to just the leader. Durability requires acks=all and min.insync.replicas >= 2 so the leader rejects writes when the ISR is too small. The classic loss case is replication.factor=3, min.insync.replicas=1, acks=all. Eligible Leader Replicas (KIP-966) close the remaining "last replica standing" gap by freezing HW advancement when the ISR drops below min ISR.

The threading model of a broker at a glance

A broker is a carefully partitioned set of thread pools, each with a single job, communicating through bounded hand-off queues. Nothing on the request hot path blocks on disk or network from a handler thread, slow work is parked in a purgatory and completed by an event later.

socketinbound TCP connection

ACCEPTOR1 per listener, round-robins connections to processors

PROCESSOR threads (NIO)num.network.threads / listener (default 3); one Selector each · frame: 4-byte size + payload

RequestChannelbounded ArrayBlockingQueue (queued.max.requests = 500), the back-pressure point

KafkaRequestHandler poolnum.io.threads (default 8): KafkaApis.handle (match on apiKey, ~80 handlers) ▶ authorize ▶ act ▶ build response ▶ throttle ▶ route to originating processor

ReplicaManager → UnifiedLogappend / read · Group / Txn / Share coordinators · MetadataCache

DelayedOperationPurgatoryProduce, Fetch, DeleteRecords, RemoteFetch … hierarchical TIMING WHEEL for timeouts; completes when its condition fires

running alongside, background threads

background thread poolsReplicaFetcherThreads (pull from leaders) · log cleaner / retention schedulers · ExpirationReapers (advance the wheel) · ThrottledChannelReapers (quotas) · if combined: quorum-controller event loop + kafka-raft-io-thread

Broker threading: acceptor → processor (NIO reactor) → bounded RequestChannel → handler pool → KafkaApis, with non-completable work parked in purgatories driven by a hierarchical timing wheel.

external socket / connection broker thread pool / component purgatory (parked, waiting) cylinder = the bounded hand-off queue request hand-off parked / runs alongside

Acceptor: One non-daemon thread per listener; accepts TCP connections and round-robins them to processors. Connection quotas (count + creation rate) are enforced here.
Processor (network thread): num.network.threads per listener (default 3), each owning one java.nio.Selector. Reads size-delimited requests, hands them to the RequestChannel, and writes responses back. Per-connection ordering is guaranteed by muting a channel while its request is in flight.
Request handler (I/O thread): num.io.threads daemon threads (default 8) running KafkaApis.handle, a giant dispatch on apiKey to ~80 handlers. The RequestChannel (bounded at queued.max.requests = 500) is the back-pressure point: when handlers fall behind, network threads block on the queue and stop reading sockets, turning handler slowness into TCP back-pressure.
Replica fetcher: Background threads on a follower broker that pull from leaders (num.replica.fetchers per source, keyed by leader+fetcherId). Followers are fetch clients of leaders.
Purgatory & reapers: A DelayedOperationPurgatory parks operations (acks=all produce, under-min-bytes fetch, remote fetch, …) that cannot complete immediately, indexed by watch key and by a hierarchical timing wheel (O(1) insert/cancel). An ExpirationReaper advances the clock; operations complete when their condition fires or they time out.
Coordinator / controller threads: Group/transaction/share coordinators run per-partition single-threaded event loops. A combined node also runs the controller's quorum-controller event loop and the kafka-raft-io-thread.

The request lifecycle is instrumented end to end: each request carries timestamps decomposing latency into request-queue time, local processing, remote (purgatory) time, response-queue time, and send time, each written by a different actor. The reactor (Selector + KafkaChannel) is the same code used by producers, consumers, and replica fetchers. Deep dives: Network Layer & Threading and Request Processing (KafkaApis).

Map of the codebase

The repository is a multi-module Gradle build. Knowing which module owns what is half the battle when reading source. The principal modules:

Module	Language	What lives there	Chapters
`clients`	Java	The producer, consumer, admin, and share clients (the `AdminClient` has no chapter of its own, its controller-forwarding pipeline surfaces in 07); the shared NIO reactor (`Selector`, `KafkaChannel`); the wire-protocol message types and generated schemas; record-format classes; security channel stack	01, 02, 16, 17, 18
`core`	Scala	The broker itself: `SocketServer`, `KafkaApis`, `ReplicaManager`, `Partition`, `BrokerServer`/`ControllerServer`/`KafkaRaftServer`, replica fetchers, the transaction coordinator runtime, broker-side metadata publishing	06, 07, 08, 09, 14
`storage`	Java	The log engine: `UnifiedLog`, `LocalLog`, `LogSegment`, the indexes, `LogManager`, the `LogCleaner` (compaction), `ProducerStateManager`, tiered-storage `RemoteLogManager`	03, 04, 05
`metadata`	Java	The KRaft controller (`QuorumController` and the per-domain control managers), the `MetadataImage`/`MetadataDelta`/`MetadataLoader` pipeline, `PartitionRegistration`, the `StandardAuthorizer`	11, 12, 18
`raft`	Java	Kafka's own Raft implementation: `KafkaRaftClient`, the `QuorumState` role machine, snapshots, dynamic quorum reconfiguration	10
`group-coordinator`	Java	The modern Java group coordinator: `CoordinatorRuntime`, `GroupMetadataManager`, `OffsetMetadataManager`, classic + KIP-848 protocols, server-side assignors; share-group config	13, 15
`transaction-coordinator`	Java/Scala	Producer-ID management (`RPCProducerIdManager`), transaction state schemas, marker plumbing	14
`share-coordinator`	Java	The share-group state coordinator and its `__share_group_state` persistence	15
`server` / `server-common`	Java	Shared server infrastructure: config classes, `ProcessRole`, the purgatory + timing wheel, quotas, broker lifecycle, `AssignmentsManager`, feature/metadata-version definitions	06, 12, 19
`streams`	Java	Kafka Streams: topology builder, `StreamThread`, the task model, state stores, the changelog restorer, EOS-v2	20
`connect`	Java	Kafka Connect: the plugin SPI, `Worker` and task loops, the `DistributedHerder`, the three internal topics, MirrorMaker 2	21

How to read this guide

The twenty-one detailed chapters are written to stand alone, but they build on each other. Here is a recommended path and a logical grouping. If you read top to bottom you will rarely hit a forward reference you do not already understand.

Alternative entry points

"I operate clusters": Read 08 (replication/durability), 12 (broker lifecycle), 04 (retention), 05 (tiered storage), 19 (quotas), 18 (security). The durability gotcha above is the single most important operational fact.
"I write producers/consumers": Read 16, 17, then 13 (groups) and 14 (EOS) for semantics, with 01/02 for what goes over the wire.
"I'm here for KRaft": Read 10 (the Raft dialect), 11 (the controller state machine), 12 (how brokers consume metadata). Note that KRaft is a pull-based Raft dialect with no leader heartbeats, not textbook Raft.
"I want exactly-once": Read 14 first, then 13 (offset commits fold into transactions) and 03 (producer state on disk). Streams EOS-v2 is in 20.

Why this structure

The guide is layered like Kafka itself: bytes, then storage, then the broker, then durability, then the metadata plane, then coordination, then clients, then the frameworks built on all of it. Each chapter is derived directly from the 4.4.0-SNAPSHOT source (git 04bfe7d) with file-and-line citations, KRaft-only, and fact-checked, not paraphrased from official docs. Where a date or attribution is commonly mis-stated (e.g. "acks=all guarantees no loss," "KRaft is standard Raft," "tiered storage was GA in 3.6"), the chapters call out the correct fact explicitly.