krivaltsevich.com Kafka Internals4.4

II · 00 · The Operator's Mental Model

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.

Before any runbook, any capacity formula, any tuning dial, you need a correct model of what a Kafka cluster is in KRaft 4.x, because nearly every production incident is a control loop that stopped converging, and you cannot diagnose a feedback loop you cannot name. This chapter assembles that model: the two kinds of node and what process.roles decides; the metadata plane (the controller quorum) versus the data plane (brokers) and the hard line of ownership between them; the five self-correcting control loops, broker heartbeats & fencing, ISR maintenance, high-watermark advancement, metadata propagation, group rebalancing, each presented as something you observe rather than command; the five internal topics you implicitly operate and exactly what breaks when each one is unhealthy; the four SLIs that actually predict user pain and how to turn them into SLOs; and a one-screen map of "what can go wrong, where" that the remaining twelve chapters of Part II expand into detail. Every default and metric name here is pinned to the source file that defines it.

What you actually run: nodes, roles, and one process class

A KRaft cluster is a set of JVM processes, every one of which is constructed by the same class, KafkaRaftServer (core/src/main/scala/kafka/server/KafkaRaftServer.scala:49). That single class reads one decisive config, process.roles (raft/src/main/java/org/apache/kafka/raft/KRaftConfigs.java:32; it has no default, you must set it), and from it instantiates an Option[BrokerServer], an Option[ControllerServer], or both (KafkaRaftServer.scala:76-90). The legal role names are exactly two, broker and controller, enumerated in server/src/main/java/org/apache/kafka/server/ProcessRole.java:21-22. There is no third "ZooKeeper" thing to run: ZooKeeper was removed in 4.0, and this build has no migration path back to it.

So the entire operational fleet decomposes into three deployment shapes of one process:

process.rolesWhat this node isRuns (from KafkaRaftServer)When to use
brokerData-plane node: holds partition replicas, serves produce/fetch.BrokerServer only (:76)Production fleets of any size; brokers scale horizontally and independently of the quorum.
controllerMetadata-plane node: a voter in the Raft quorum that owns cluster metadata.ControllerServer only (:82)Dedicated controllers, the recommended shape for any cluster you care about; isolates metadata from data load.
broker,controllerCombined ("co-located") node: both at once, sharing a JVM.both BrokerServer + ControllerServer (:76,:82); detected by isKRaftCombinedMode (KafkaConfig.scala:180-181)Dev, test, small/edge clusters. Avoid for large production: broker GC pauses and page-cache pressure can disrupt the controller in the same heap.

Crucially, even a combined node holds exactly one SharedServer instance (core/src/main/scala/kafka/server/SharedServer.scala:97; its doc comment at :86-95 states this explicitly). The SharedServer owns the components that the broker and controller halves both need: the KafkaRaftManager (the Raft client for the metadata log), the MetadataLoader, and the SnapshotGenerator (SharedServer.scala:123-130). This matters operationally: in combined mode you are not running two independent servers that happen to share a host, you are running two role-players over one Raft client and one metadata pipeline. A fault in metadata loading is fatal for a process that plays the controller role and merely logged-and-incremented for a pure broker (SharedServer.scala:203-210), which is the first hint that the two planes have very different blast radii.

Rule of thumb · quorum sizing

Run 3 controllers for a normal production cluster (tolerates 1 controller loss) or 5 for the largest/most critical (tolerates 2). The number must be odd, a Raft quorum needs a strict majority, so an even count gives no extra fault tolerance and wastes a node. Controllers do small, metadata-shaped work; you do not scale them with traffic. You scale brokers with traffic. Cross-link KRaft consensus for the quorum mechanics and Topologies for placement.

Two planes, one hard line of ownership

The single most useful abstraction for operating KRaft is the split between the metadata plane and the data plane. They run different code, fail differently, are sized differently, and own strictly disjoint state. Confuse them and you will reach for the wrong runbook.

Metadata plane, the controller quorum
A Raft cluster of ControllerServer voters. One is the active controller (leader); the rest are hot standbys replaying the same log. Owns the authoritative cluster state: topics, partitions, replica assignments, ISR membership, broker registrations & liveness, configs, ACLs, quotas, feature levels. State lives in the __cluster_metadata log (single partition) and is published to brokers as a stream of records + periodic snapshots.
ControllerServerQuorumControllerKafkaRaftManager__cluster_metadata-0
↕ replicate metadata log (Raft) · brokers fetch it, never write it
Data plane, the brokers
A horizontally scaled set of BrokerServer nodes. Each is a follower of the metadata log and a materializer of the slice of cluster state that concerns it. Owns the data: partition replicas on local disk (and tiered object storage), the high-watermark per partition, the request-handling reactor, the group/transaction/share coordinators. Talks to the controller only to register, heartbeat, and propose ISR changes.
BrokerServerReplicaManagerLogManagerKRaftMetadataCachepartition logs on disk
The KRaft control/data split. The controller quorum is the single writer of cluster metadata; brokers are read-only consumers of that metadata and the sole owners of partition data.
metadata plane (controller / quorum) data plane (brokers) metadata log replication (one direction of authority)

The boundary is enforced in code, not convention. Brokers reach the controller through a NodeToControllerChannelManager for forwarding and for proposing changes (BrokerServer.scala:254-264), and they keep a read-optimized KRaftMetadataCache built by replaying the metadata log (BrokerServer.scala:223). A broker never mutates topic/partition/ISR state directly; it asks. Two concrete asks define almost everything an operator watches:

  • Broker → controller liveness: the broker's BrokerLifecycleManager registers with the controller (per KIP-631) and sends periodic heartbeats (server/src/main/java/org/apache/kafka/server/BrokerLifecycleManager.java:64-67). The controller decides whether the broker is alive (unfenced) or dead (fenced).
  • Broker → controller ISR proposals: when a partition leader wants to shrink or grow its in-sync set, it does not act unilaterally; it sends an AlterPartition request via the AlterPartitionManager (BrokerServer.scala:305-310) and the controller commits the new ISR to the metadata log. This is why ISR changes show up as controller-driven, log-ordered events.
Why a single-writer metadata plane

In the ZooKeeper era, cluster metadata lived in ZK and controller failover meant re-reading O(partitions) of state, Confluent measured ~28 s for 100k partitions on Kafka 1.0.0, dropping to ~14 s on 1.1.0, and it stayed structurally linear, which is the real reason ZK capped clusters near ~200k partitions (empirical; Confluent "200K Partitions" post). KRaft replaces that with a replicated log the standby controllers already hold in memory, so a new leader is current the instant it wins election. The lab demonstration, Confluent's 2,000,000-partition cluster, ~10× the ZK ceiling, is the headline, but the operational point is the mechanism: failover no longer reloads state (empirical; Confluent KRaft docs / KIP-500). Do not over-read "millions of partitions" as a per-broker SLA, per-broker counts are still bounded by file descriptors, fetcher threads, and Linux vm.max_map_count (default 65,530, ~2 mmaps/partition ⇒ ~32k partitions/broker until raised) (empirical; Instaclustr). See Limits and Partitioning.

The five control loops you observe (not command)

Kafka is not a request/response database you drive imperatively; it is a bundle of negative-feedback control loops, each continuously measuring an error and correcting toward a setpoint. Health is "all loops converging." Every page-worthy incident is a loop that has stopped converging, diverging, oscillating, or stuck. Learn the loops and the metrics fall out of them, because each loop's "error signal" is a metric you already have.

① Heartbeat & fencingsetpoint: every broker alive ⇒ unfenced
② ISR maintenancesetpoint: ISR = all caught-up replicas
③ High-watermark advancesetpoint: HW = min(LEO over ISR)
④ Metadata propagationsetpoint: broker cache = committed log
⑤ Group rebalancingsetpoint: every partition assigned once
The five control loops, roughly in dependency order: liveness gates membership, membership gates the ISR, the ISR gates the high-watermark (durability), metadata propagation distributes the resulting truth, and rebalancing keeps consumption matched to it.
controller-owned loop broker-owned loop log/durability loop coordinator loop dependency / drives

① Broker heartbeat & fencing, the liveness loop

Each broker sends a BrokerHeartbeat to the active controller every broker.heartbeat.interval.ms = 2000 ms (KRaftConfigs.java:40). The controller grants a lease that expires after broker.session.timeout.ms = 9000 ms (KRaftConfigs.java:44), its doc string literally reads "the length of time… that a broker lease lasts if no heartbeats are made." Miss heartbeats for the session window and the controller fences the broker: it stops being a valid leader/ISR member, and its leaderships migrate elsewhere. The broker's own state machine tracks this, NOT_RUNNING(0) → STARTING(1) → RECOVERY(2) → RUNNING(3) → PENDING_CONTROLLED_SHUTDOWN(6) → SHUTTING_DOWN(7) (metadata/src/main/java/org/apache/kafka/metadata/BrokerState.java:47-80), and a freshly registered broker stays fenced until it has caught up on the metadata log and the controller transitions it from RECOVERY to RUNNING (the heartbeat-response handler logs exactly that on unfence, BrokerLifecycleManager.java:665-666). The loop's error signal: heartbeats not arriving. Operator-visible as the controller-side TimedOutBrokerHeartbeatCount and the FencedBrokerCount / ActiveBrokerCount gauges (controller/.../metrics/ControllerMetadataMetrics.java:49,51; QuorumControllerMetrics.java:62).

Why 9000 over 2000

The 4.5× ratio between session timeout and heartbeat interval tolerates two consecutive missed heartbeats plus jitter before fencing, long enough to ride out a GC pause or a brief network blip, short enough that a genuinely dead broker is evicted in under ~10 s. This is the same design logic as a consumer's session.timeout.ms ≈ 3 × heartbeat.interval.ms. Do not set them equal; a single dropped packet would then fence a healthy broker. The interval also bounds how fast a fenced broker re-joins, because the broker re-heartbeats aggressively (it schedules the next heartbeat ~10 ms out after a fence-relevant response, BrokerLifecycleManager.java:660).

② ISR maintenance, the membership loop

For each partition, the leader maintains the in-sync replica set (ISR): the replicas that have fetched up to (close to) the leader's log-end offset recently. A follower is dropped from the ISR if it hasn't caught up within replica.lag.time.max.ms = 30000 ms (server/src/main/java/org/apache/kafka/server/config/ReplicationConfigs.java:55); when it catches up, it is re-added. Shrinks and expansions are committed through the controller (loop ①'s liveness gates who is even eligible). The error signal is the gap between the replica set and the ISR, surfaced as UnderReplicatedPartitions (core/src/main/scala/kafka/server/ReplicaManager.scala:94,241) and as the rates IsrShrinksPerSec / IsrExpandsPerSec (:100,101). In steady state both rates are zero; sustained shrinking with no matching expansion means a follower (or its disk/network) is falling behind, or a broker is down. Cross-link Replication.

③ High-watermark advancement, the durability loop

The high-watermark (HW) is the offset up to which a record is replicated to the whole current ISR and therefore safe to expose to consumers. It advances to the minimum log-end offset across the ISR. This is the loop that makes "committed" mean something: a producer with acks=all is told its write succeeded only once the HW (gated by min.insync.replicas) covers it. When the ISR shrinks, the HW advances against fewer replicas, faster, but less durable; that tradeoff is the entire reason min.insync.replicas exists, and why its default of 1 (ServerLogConfigs.java:155) is unsafe for durable topics. The operator levers here are covered in depth in Durability; the loop's health is visible as UnderMinIsrPartitionCount and AtMinIsrPartitionCount (ReplicaManager.scala:95,96,242,243).

Invariant · the durability triad

replication.factor=3 + min.insync.replicas=2 + producer acks=all makes any single broker failure non-data-losing: a write is acknowledged only after ≥2 replicas hold it, so the surviving replica always has every committed record. The compiled-in defaults do not give you this, default.replication.factor is 1 (ReplicationConfigs.java:42), min.insync.replicas is 1 (ServerLogConfigs.java:155), and num.partitions is 1 (ServerLogConfigs.java:36). You must set the triad explicitly. And note the silent trap: min.insync.replicas is capped by the actual replica count, so a topic with RF=1 still accepts writes even when the cluster default is 2, verify min.insync.replicas ≤ replication.factor per topic (empirical; Conduktor).

④ Metadata propagation, the truth-distribution loop

Once the controller commits a change (a new leader, an ISR update, a created topic), every broker must learn it. Brokers run a MetadataLoader that replays the committed metadata log and applies it to the local KRaftMetadataCache (SharedServer.scala:326-334, BrokerServer.scala:223). To keep replay bounded, the controller periodically snapshots: a new snapshot is generated after metadata.log.max.record.bytes.between.snapshots = 20 MiB or metadata.log.max.snapshot.interval.ms = 1 hour, whichever first (raft/src/main/java/org/apache/kafka/raft/MetadataLogConfig.java:41,39), and the active controller writes a no-op record every metadata.max.idle.interval.ms = 500 ms so the committed offset keeps moving even when idle (:76). The error signal is metadata lag: how far a broker's applied offset trails the committed log. A broker that lags propagation routes clients to stale leaders. Cross-link Metadata Propagation; controller-side progress is observable via LastAppliedRecordOffset / LastCommittedRecordOffset (QuorumControllerMetrics.java:54,56).

⑤ Group rebalancing, the consumption-matching loop

The group coordinator (a broker role, materialized over __consumer_offsets) keeps every partition of a subscribed topic assigned to exactly one consumer in a group. Members join, leave, time out, or fall behind; each change triggers a rebalance that recomputes the assignment. The setpoint is "each partition assigned once, no orphans, no double-consumption." The pathology is the rebalance storm: a member that exceeds max.poll.interval.ms (default 5 min) is evicted, rejoins, and triggers another rebalance, self-sustaining (empirical; Conduktor/AWS). This loop is unique in that the actors are clients you may not control; see Group Coordination and Failure Modes. The newer consumer-group protocol (KIP-848) moves assignment computation broker-side to damp these storms.

LoopSetpointError signal (metric)Key tunable / default · sourceDetail chapter
① Heartbeat / fencingalive broker ⇒ unfencedFencedBrokerCount, TimedOutBrokerHeartbeatCountbroker.session.timeout.ms=9000 · KRaftConfigs.java:44KRaft Controller
② ISR maintenanceISR = caught-up replicasUnderReplicatedPartitions, IsrShrinksPerSecreplica.lag.time.max.ms=30000 · ReplicationConfigs.java:55Replication
③ HW advancementHW = min LEO over ISRUnderMinIsrPartitionCount, AtMinIsrPartitionCountmin.insync.replicas=1 · ServerLogConfigs.java:155Durability
④ Metadata propagationbroker cache = committed logmetadata lag; LastAppliedRecordOffsetsnapshot @ 20 MiB / 1 h · MetadataLogConfig.java:41,39Metadata Propagation
⑤ Group rebalancingeach partition assigned oncerebalance rate; consumer lagmax.poll.interval.ms=300000 (client)Group Coordination

The five internal topics you implicitly operate

You did not create them and you rarely name them, but five internal topics carry the cluster's nervous system. Four are ordinary (replicated, broker-hosted) topics whose names are hardcoded in clients/src/main/java/org/apache/kafka/common/internals/Topic.java:27-30 plus the tiered-storage metadata topic in storage/.../TopicBasedRemoteLogMetadataManagerConfig.java:44; one, __cluster_metadata, is special and is not a normal topic at all but the Raft metadata log itself. Know each one's defaults and, more importantly, its failure blast radius.

Internal topicOwner planeDefault partitions · RF · min ISRSourceIf it is unhealthy…
__cluster_metadatametadata (Raft log, not a managed topic)1 partition (fixed); replicated by the controller quorumTopic.java:30-34; partition 0 is CLUSTER_METADATA_TOPIC_PARTITIONCluster-wide outage of the control plane. No leader elections, no topic creation, no config/ISR changes commit. Existing partitions keep serving from cache, but the cluster cannot react to any change. This is the one topic whose loss is catastrophic.
__consumer_offsetsdata (group coordinator)50 · 3 · (inherits broker min.insync.replicas)GroupCoordinatorConfig.java:115,124Consumer groups cannot commit or load offsets. On coordinator loss, affected groups stall and rebalance; committed positions for a lost partition may be unreadable, risking re-processing or skipped messages. Producers/consumers of user topics keep flowing until they need to commit.
__transaction_statedata (transaction coordinator)50 · 3 · 2TransactionLogConfig.java:32,40,45EOS/transactional producers cannot begin, commit, or abort. In-flight transactions hang; with hung transactions the last stable offset stalls and read_committed consumers block on that partition (see PartitionsWithLateTransactionsCount). Non-transactional traffic is unaffected.
__share_group_statedata (share coordinator, KIP-932)50 · 3ShareCoordinatorConfig.java:38,42Share groups (queue-style consumption) cannot persist per-record acknowledgement state. Share-group delivery stalls; classic consumer groups and producers are unaffected. Only relevant if you use share groups.
__remote_log_metadatadata (tiered storage, KIP-405)50 · 3 · 2TopicBasedRemoteLogMetadataManagerConfig.java:54,56,57Tiered-storage metadata (which segments are in object storage) cannot be read/written. Fetches that must reach remote segments fail or stall; local (hot) data still serves. Only present when tiered storage is enabled.
Warning · the coordinator topics are RF=3 for a reason, do not under-replicate them

All four managed internal topics default to RF 3, and the transaction and remote-log-metadata topics default to min ISR 2 (sources above). On a cluster with fewer than 3 brokers the offsets/transaction topics will be created with whatever RF is available, silently weakening durability of the very state that coordinates everyone else. The classic production incident: a 3-broker cluster created __consumer_offsets at RF=3, one broker is lost long-term, an offsets partition goes under-min-ISR, and every group whose coordinator maps to it stalls, an outage that has nothing to do with user-topic configuration. Treat the internal topics as first-class: verify their RF and ISR after any cluster resize, and keep ≥3 brokers before relying on RF=3. Cross-link Metrics & Signals.

The four SLIs that predict user pain

You can collect hundreds of Kafka metrics; only four service-level indicators directly track what a producer or consumer experiences. Everything else is a cause or a leading indicator of one of these. Define your SLOs on these four and let the rest be diagnostics.

SLIWhat the user feelsPrimary signals (source)Example SLO (set yours to workload)
Availability (write/read acceptance)"Can I produce/consume right now?"OfflinePartitionsCount=0 (ControllerMetadataMetrics.java:62); UnderMinIsrPartitionCount=0 (ReplicaManager.scala:95); ActiveControllerCount sums to 1 (QuorumControllerMetrics.java:46)99.95% of produce requests to RF≥3 topics succeed (non-throttled) per 30-day window.
Durability (no acknowledged-write loss)"Did my committed data survive a failure?"UncleanLeaderElectionsPerSec=0 (any non-zero = loss); UnderMinIsrPartitionCount; durability triad in force0 unclean leader elections; 100% of acks=all writes retained across single-broker loss.
End-to-end latency (produce→consume)"How fresh is the data I read?"broker RequestMetrics TotalTimeMs p99 (produce/fetch); client produce/fetch latency; e2e probep99 produce ack < 50 ms; p99 e2e (produce→consume) < 200 ms for the hot path.
Consumer lag (backlog age)"How far behind is my pipeline?"client records-lag-max; external trend monitor (Burrow / exporter); alert on trend, not a static countLag for tier-1 groups drains within 5 min of a spike; sustained growth over 15 min pages.
Turning SLIs into SLOs, three rules

(1) Latency and lag are distributions and trends, never single thresholds. Alert latency on p99 over a window; alert lag on sustained growth (recovery time ≈ current lag ÷ net drain rate), because a healthy consumer that briefly spikes is fine and a dead consumer reports no lag at all, only an external monitor catches the stalled group (empirical; LinkedIn Burrow). (2) Availability and durability are near-binary, page on >0. OfflinePartitionsCount, UnderMinIsrPartitionCount, and UncleanLeaderElectionsPerSec should be zero in a healthy cluster; any sustained non-zero is user-visible (empirical; Confluent/Datadog/AWS MSK). (3) Aggregate controller gauges with SUM. ActiveControllerCount is 0 or 1 per node; the cluster is healthy iff the sum is exactly 1, sum 0 means no controller (critical), sum >1 means split-brain. Datadog's rule: alert on any value lasting longer than one second (empirical; Datadog).

One screen: what can go wrong, where

This is the map the rest of Part II fills in. Read it as a triage tree: a symptom at the top, the plane and loop it implicates, and the chapter that holds the runbook. The discipline is to localize to a plane and a loop first, control-plane problems and data-plane problems demand different responses, and within the data plane a durability problem and a latency problem have different fixes.

symptom observedSLI breached or alert fired
control plane or data plane?
ActiveControllerCount ≠ 1no/duplicate controller; metadata frozen
quorum can't elect< majority of controllers up
metadata lag risingbrokers route to stale leaders
durability, availability, or latency?
OfflinePartitions > 0no leader → unavailable
UnderMinIsr > 0acks=all rejected
UncleanLeaderElection > 0committed data lost
disk full / slowhigh LocalTimeMs; broker halts at 0 free
handler/network saturationidle% < 30%
rebalance storm / hot partitionlag grows; one partition skewed
The "what can go wrong, where" triage map. Localize to plane first (control vs data), then to the loop. Each leaf is expanded in a later Part II chapter.
control-plane fault (quorum/controller) data-plane node fault storage/disk fault coordinator/client fault availability / data-loss fault sibling symptom (same class)
SymptomPlane · loopLikely root causeWhere it's expanded
ActiveControllerCount sum = 0control ·, Quorum lost majority; controllers down or partitioned.Failure Modes, KRaft Consensus
Metadata lag climbing on a brokercontrol→data · ④Broker can't keep up replaying the metadata log; GC, disk, or network on that node.Metadata Propagation
OfflinePartitionsCount > 0data · ②③Partition has no leader: all replicas down, or sole in-sync replica lost.Failure Modes
UnderReplicatedPartitions > 0, all brokers updata · ②Slow follower, disk I/O, network, or GC on one node; not a broker-down event.Replication, Performance Tuning
UnderMinIsrPartitionCount > 0data · ③ISR fell below min.insync.replicas; acks=all producers get NotEnoughReplicas.Durability
UncleanLeaderElectionsPerSec > 0data · ③An out-of-sync replica was elected leader (only if you enabled it) → committed data truncated.Durability
Disk approaching full; broker haltsdata · storageRetention not keeping up with ingest; a full log dir crashes the broker without graceful shutdown.Capacity Planning, Storage Management
Handler/network idle% < 30%data ·, Request-thread or reactor saturation; under-provisioned or hot-spotted broker.Performance Tuning, Network & Threading
Consumer lag growing; one partition hotdata · ⑤Skewed key, slow consumer processing, or rebalance storm.Partitioning, Group Coordination
PartitionsWithLateTransactionsCount > 0; read_committed stalleddata ·, Hung transaction pins the last stable offset.Transactions & EOS, Failure Modes
Gotcha · "constant" vs "fluctuating" under-replication mean opposite things

A steady, constant UnderReplicatedPartitions with one broker absent is almost always "broker down", fix the node. A fluctuating URP while all brokers are up is a performance problem, a follower repeatedly drops out and rejoins because its disk, NIC, or GC can't keep pace with the leader's write rate (empirical; oneuptime). Same metric, two root causes, two completely different runbooks. Always correlate URP with broker liveness (loop ①) before acting, and watch IsrShrinksPerSec/IsrExpandsPerSec, a shrink with no matching expand is the fingerprint of a follower that fell behind and stayed behind.

Operator actions vs. emergent behaviour

A final framing that prevents a lot of grief: distinguish the handful of things an operator commands from the vast majority that the cluster does on its own. You issue a small set of imperative actions; the control loops above produce everything else as emergent behaviour. Reaching for an imperative action when a loop is mid-correction is how operators turn incidents into outages.

You command (imperative)Mechanism / config · sourceThe loops then do (emergent)
Start / format a nodeprocess.roles, node.id, controller.quorum.bootstrap.servers; metadata dir metadata.log.dir (MetadataLogConfig.java:34)Registration (KIP-631), catch-up on metadata, unfence, leadership assignment.
Graceful shutdown for maintenancecontrolled.shutdown.enable=true (ServerConfigs.java:97); broker enters PENDING_CONTROLLED_SHUTDOWNController migrates leaderships off the broker before it stops, so no partition goes leaderless. Why you rolling-restart one broker at a time.
Create / configure a topicAdmin API → controller commits to __cluster_metadataReplica placement, leader election, metadata propagation to all brokers (loop ④), then clients discover it.
Reassign partitions / rebalance datakafka-reassign-partitions.sh → controller AlterPartitionReassignmentsNew replicas catch up (loop ②), join ISR, old replicas drop; throttle to protect the durability/latency SLIs.
Change quorum membershipDynamic quorum (controller.quorum.bootstrap.servers) vs static controller.quorum.voters (QuorumConfig.java:58,67)Raft adds/removes a voter and re-replicates; majority must stay live throughout.
Invariant · controlled shutdown is the whole reason you can do rolling restarts

With controlled.shutdown.enable=true (the default), a broker asked to stop first transitions to PENDING_CONTROLLED_SHUTDOWN (BrokerState.java:70) and the controller moves its partition leaderships to other in-sync replicas before the process exits. That is why a planned single-broker restart causes no leaderless partitions while an unplanned crash does (the controller only learns it's gone after the 9 s session timeout). The operational consequence is concrete: restart brokers one at a time, wait for UnderReplicatedPartitions to return to 0 between each, and never restart two replicas of the same partition inside one replica.lag.time.max.ms window, the durability triad only protects you if at least min.insync.replicas replicas are continuously up. See Lifecycle.

Where to go from here

This chapter is the index to Part II. The model, two planes, five loops, five internal topics, four SLIs, one failure map, is the lens every later chapter assumes. From here: Configuration (the dials and how they're applied), Limits (hard constants vs soft configs vs emergent bounds), Partitioning and Capacity Planning (sizing the data plane), Performance Tuning and Durability (the throughput/latency/durability triangle), Failure Modes and Metrics & Signals (the runbooks and the signals that trigger them), Topologies and Cost (placement and the cloud bill), and Scaling Scenarios and Lifecycle (what changes as you grow and how to operate the cluster over time). For the mechanisms beneath the loops, the Part I chapters are cross-linked throughout, start with Overview and KRaft Consensus.

KIPs referenced

  • KIP-500 Replaced ZooKeeper with the self-managed KRaft metadata quorum, the foundation of the metadata-plane / data-plane split in this build.
  • KIP-631 The KRaft controller's broker-lifecycle protocol: registration, heartbeats, fencing, control loop ① (BrokerLifecycleManager.java:64-67).
  • KIP-405 Tiered storage; introduces the __remote_log_metadata internal topic (TopicBasedRemoteLogMetadataManagerConfig.java:44).
  • KIP-848 The new (server-side) consumer group protocol, damps the rebalance-storm pathology of control loop ⑤.
  • KIP-932 Share groups (queue-style consumption); introduces __share_group_state (ShareCoordinatorConfig.java:38).

krivaltsevich.com · Part of Apache Kafka Internals · derived from Apache Kafka 4.4 source · GitHub · MIT-licensed.

Apache Kafka® is a registered trademark of the Apache Software Foundation. This is an independent, unofficial guide, not affiliated with or endorsed by the ASF.