krivaltsevich.com Kafka Internals4.4

II · 12 · Lifecycle Operations

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.

A Kafka cluster is never done, it is upgraded, rebalanced, grown, shrunk, and recovered for years while it stays online. This chapter is the playbook for change-while-running. It walks the four lifecycle operations every operator performs and the one they hope never to: rolling upgrades (binaries first, then metadata.version via the feature mechanism, and the hard rule that metadata.version generally cannot be downgraded); partition reassignment (move replicas through the controller, throttled so it cannot starve live traffic); adding and removing brokers (registration, fencing, and the fact that Kafka never auto-rebalances data onto a new broker, you reassign it); controller-quorum changes (KIP-853 dynamic voters); and disaster recovery (the __cluster_metadata snapshot, MirrorMaker 2, unclean recovery, JBOD disk replacement). The unifying mechanism is the metadata log: every lifecycle change is a record the active controller appends and replicates, so each operation is really a question of what record, in what order, and what waits for what to commit. Every default and limit here is cited to source; every benchmark is marked empirical and directional, not a guarantee.

The lifecycle invariant: change is a metadata record, applied in order

Upgrades, reassignments, broker membership, and quorum changes all flow through the same path, a request to the active controller, which appends one or more records to __cluster_metadata, replicates them to the quorum, and lets every broker apply them via its metadata listener (Part I Metadata Propagation). This is why almost every lifecycle operation has a "wait until committed / wait until caught up" step: the cluster only acts on what has reached the high-watermark. Internalize this and the rest of the chapter is corollaries.

Rolling upgrades: binaries first, format second

A Kafka version upgrade has two independent phases that must happen in this order: (1) replace the binaries on every node, one at a time, while the cluster stays up; then, once every node runs the new version and is stable, (2) bump the cluster's feature flags, chiefly metadata.version, to enable the new on-disk record formats and behaviours the new binaries ship. Phase 1 is reversible (you can roll back the binary). Phase 2 is, for metadata.version, generally not reversible. Doing them in the wrong order, bumping the format before all nodes can read it, bricks the laggards.

Phase 1, rolling the binaries

Each broker is upgraded by a controlled restart. The shutdown is controlled: the broker tells the controller it is going down so leadership migrates off it before the process exits, rather than the cluster discovering it dead. The mechanism lives in BrokerLifecycleManager: on shutdown it transitions to PENDING_CONTROLLED_SHUTDOWN and sends its next heartbeat immediately so the controller starts moving leadership right away (server/.../BrokerLifecycleManager.java:341-346). On the controller side the broker passes through the BrokerControlState machine, UNFENCED → CONTROLLED_SHUTDOWN → SHUTDOWN_NOW (metadata/.../controller/BrokerControlState.java:22-25), and the controller elects new leaders for that broker's partitions from the in-sync followers. Because those followers are already in the ISR, this is a clean election with no data loss (Part I Replication).

Broker NControllerFollowers
heartbeat: PENDING_CONTROLLED_SHUTDOWN
elect new leaders from ISR (clean)
heartbeat reply: shutdown OK
exit → upgrade binary → restart
register (new incarnationId)
unfence once caught up to metadata
One step of a rolling upgrade. Leadership moves off the broker before it exits; on restart it re-registers and is unfenced only after catching up.
broker   controller/metadata   follower replicas   request   reply

The "wait" between brokers is the whole game. After a broker restarts it registers with the controller quorum (KIP-631) under a fresh random incarnationId (server/.../BrokerLifecycleManager.java:103), and it comes back up fenced, it cannot host leaders or serve clients until it has (a) caught up to the latest metadata offset and (b) been explicitly marked ready to unfence (BrokerLifecycleManager.java:109,114,143). Do not restart the next broker until the previous one is fully unfenced and UnderReplicatedPartitions has returned to 0 (core/.../server/ReplicaManager.scala:94); restarting two brokers whose replica sets overlap can drop a partition below min.insync.replicas and reject produces.

Config / constantDefaultSourceWhy it matters during a roll
broker.heartbeat.interval.ms2000raft/.../KRaftConfigs.java:40How often the broker proves liveness; the controlled-shutdown handshake rides on it.
broker.session.timeout.ms9000raft/.../KRaftConfigs.java:44Lease length; miss heartbeats for this long and the controller fences the broker. A roll that stalls a broker > 9 s makes it look dead.
initial.broker.registration.timeout.ms60000raft/.../KRaftConfigs.java:36How long a freshly restarted broker waits to register before giving up and exiting. Can't register in 60 s → the node self-terminates; check controller reachability.
replica.lag.time.max.ms30000server/.../ReplicationConfigs.java:55A restarted broker's followers must re-enter the ISR within this window after catching up. Bigger backlogs take longer; watch IsrExpandsPerSec.
Why one broker at a time, never two with overlapping replicas

With the durability baseline replication.factor=3, min.insync.replicas=2 (Part I Replication; Durability), a partition tolerates exactly one replica being unavailable while still accepting acks=all writes. Taking a single broker down leaves 2 replicas, still at min.insync.replicas. Take a second broker down that shares a partition and the ISR drops to 1; the partition is now under-min-ISR and producers get NotEnoughReplicas. The restart pacing rule ("URP back to 0 before the next node") is the operational expression of that arithmetic. At scale, replace ≤ 1 broker per maintenance window (Pinterest/Netflix practice, empirical).

Phase 2, bumping metadata.version (KIP-584)

New binaries ship new record formats and behaviours dormant. They are switched on cluster-wide by raising the finalized feature level, primarily metadata.version, the master version that gates KRaft record schemas. The flow: the operator runs kafka-features.sh upgrade --metadata <version> (or the equivalent AdminClient updateFeatures), which reaches the active controller's FeatureControlManager.updateFeatures(...) (metadata/.../controller/FeatureControlManager.java:179). The controller does not just take your word for it: before writing the upgrade it calls reasonNotSupported(...), which verifies that every registered broker and every controller advertises support for the target level, and refuses with INVALID_UPDATE_VERSION if any node is behind (FeatureControlManager.java:321-373). This is exactly why the binaries must roll first.

kafka-features.sh upgrade --metadata 4.3
all brokers + controllers support target level?
reject: INVALID_UPDATE_VERSION
is this a downgrade?
refuse: would delete metadata
append FeatureLevelRecord → replicate → every broker applies
metadata.version upgrade gating in FeatureControlManager. The controller verifies cluster-wide support, then refuses any downgrade that would erase metadata.
controller/feature control   metadata log record   rejected path   flow   rejection

The downgrade rule is the operator trap. updateMetadataVersion(...) (FeatureControlManager.java:385-423) treats a lower target as a downgrade and calls MetadataVersion.checkIfMetadataChanged(current, target). That method walks the version chain between the two and returns true if any intervening version set a "metadata changed" flag (MetadataVersion.java:398-425; each enum constant carries a didMetadataChange boolean, e.g. IBP_4_3_IV0(30,"4.3","IV0",true) at MetadataVersion.java:125). If the format changed, the controller refuses: "Refusing to perform the requested downgrade because it might delete metadata information" (FeatureControlManager.java:410-411). Unsafe (forced) metadata downgrade is not implemented, the code path exists but returns "Unsafe metadata downgrade is not supported in this version" (FeatureControlManager.java:404-406; tracked as KAFKA-13896). The only downgrade the controller will accept is to a version where nothing in the format changed.

metadata.version is a one-way door in practice, plan accordingly

Treat raising metadata.version as irreversible. Validate the new binaries thoroughly while the cluster still runs the old format (you can roll binaries back in this window), and only bump the format once you are committed to the release. The format is forward-only at the storage layer too: kafka-storage format warns that "formatted directories are not forward-compatible, the broker version must be ≥ the kafka-storage tool that formatted the directory" (core/.../tools/StorageTool.scala:339-341). Use kafka-features.sh describe to confirm the new level is finalized cluster-wide before declaring the upgrade complete.

Feature / constantValueSource
Feature namemetadata.versionMetadataVersion.java:142
MINIMUM_VERSION (oldest supported)IBP_3_3_IV3 (level 7)MetadataVersion.java:147
LATEST_PRODUCTION (newest stable, the default)IBP_4_3_IV0 (level 30)MetadataVersion.java:157
Latest unstable (test-only unless unstable.feature.versions.enable=true)IBP_4_4_IV0 (level 31)MetadataVersion.java:136
Other finalizable features (independent levels)kraft.version, group.version, eligible.leader.replicas.version, transaction.version, …FeatureControlManager.java:243-317

Two corollaries. First, since the 4.0 release removed ZooKeeper entirely, the floor for any cluster you operate is KRaft; there is no ZK migration runbook to consider here, only KRaft-native upgrades. Second, features other than metadata.version (e.g. group.version for the new consumer-group protocol of KIP-848, or eligible.leader.replicas.version) have their own levels and dependency rules, the controller validates a feature's declared dependencies against the proposed set before accepting it (FeatureControlManager.java:304-317), and you can preview those with kafka-storage feature-dependencies (StorageTool.scala:209).

Partition reassignment: moving replicas without dropping packets

Kafka does not move data on its own. Re-placing a partition's replicas, to drain a broker, relieve a hot broker, change the replication factor, or spread newly added brokers' load, is an explicit reassignment. The tool is kafka-reassign-partitions.sh (tools/.../reassign/ReassignPartitionsCommand.java), which sends an AlterPartitionReassignments request to the active controller, landing in ReplicationControlManager.alterPartitionReassignments(...) (metadata/.../controller/ReplicationControlManager.java:2134).

The three-stage reassignment protocol

The controller does not swap replica lists atomically, it cannot, because the new replicas hold none of the data yet. changePartitionReassignment(...) drives a three-stage protocol, documented verbatim in the source (ReplicationControlManager.java:2232-2251):

stage 1add new replicas; set addingReplicas / removingReplicas
stage 2wait: ISR contains all new replicas
stage 3drop removingReplicas; clear adding/removing, done
The controller-driven reassignment lifecycle. Stage 2 is the long pole: it is the time for the new replicas to fetch the entire partition into the ISR.
metadata record / replica state   wait-for-ISR   progression
  1. Stage 1: issue a PartitionChangeRecord that adds all new replicas to the partition's replica list and records which are addingReplicas and which are removingReplicas. The new followers begin replicating from the leader immediately, like any other follower (Part I Fetch Path).
  2. Stage 2: wait until the ISR contains every new replica (or, if only removing, until the ISR holds at least one replica that is not being removed). This is where the bytes actually move; for a multi-GB partition it can take minutes to hours, gated by your throttle and the leader's spare egress.
  3. Stage 3: issue a second PartitionChangeRecord removing the old replicas and clearing the adding/removing markers. The reassignment is complete.

A reassignment that merely reorders an existing replica list (e.g. to change the preferred leader) skips stages 1–2 and completes immediately, because the ISR is already suitable (ReplicationControlManager.java:2248-2251). While a partition is mid-reassignment it shows up in ReassigningPartitions (core/.../server/ReplicaManager.scala:97) and in kafka-reassign-partitions.sh --verify / the ListPartitionReassignments API (ReplicationControlManager.java:2298). Crucially, replicas being added by a reassignment are not counted as under-replicated, UnderReplicatedPartitions excludes them, so a healthy reassignment does not page you.

Cancelling a reassignment can require an unclean election, and may be blocked

You can abort an in-flight reassignment (resubmit with no target replicas → cancelPartitionReassignment, ReplicationControlManager.java:2197). The controller reverts to the original replica set, but if that revert would require electing a leader no longer in sync, it is treated as unclean and is refused unless unclean.leader.election.enable=true for the topic (ReplicationControlManager.java:2204-2209). Translation: do not assume "cancel" is always free. If you started a reassignment that moved the only in-sync replicas, cancelling may be blocked exactly when you most want it.

Throttling: the dial that keeps the move from starving live traffic

An unthrottled reassignment lets followers fetch as fast as the leader can serve, competing directly with live consumer fetches and replication, and able to saturate the leader's NIC. The fix is the replication throttle, enforced by ReplicationQuotaManager (server/.../quota/ReplicationQuotaManager.java). It is two coordinated parts:

Config keyScopeDefaultSource
leader.replication.throttled.rateper-broker (dynamic only)Long.MAX_VALUE (unlimited)server-common/.../config/QuotaConfig.java:75, default :87
follower.replication.throttled.rateper-broker (dynamic only)Long.MAX_VALUE (unlimited)QuotaConfig.java:80
leader.replication.throttled.replicasper-topic (list of part:broker or *)[] (none)QuotaConfig.java:60,65
follower.replication.throttled.replicasper-topic[]QuotaConfig.java:67,72
replica.alter.log.dirs.io.max.bytes.per.secondper-broker (intra-broker JBOD moves)unlimitedQuotaConfig.java:84,87

The two work together: the .rate configs set a bytes/sec ceiling on a broker's throttled replication traffic, and the .replicas configs mark which replicas count against it (ReplicationQuotaManager.isThrottled(...), :96-99; markThrottled(...), :114-123). A replica's bytes are metered only if its topic-partition is in the throttled set, so the throttle hits the reassignment's moving replicas and leaves normal traffic alone. The rate is enforced as a sliding-window byte-rate sensor that raises QuotaViolationException when exceeded (:80-90). kafka-reassign-partitions.sh wires all four keys for you: pass --throttle <bytes/sec> on execute, and it sets the broker .rate configs and the topic .replicas configs; on a successful --verify it clears them automatically (ReassignPartitionsCommand.java:112-119, 226-231, 494-537).

Throttle rule of thumb

Set the throttle to the leader broker's spare egress, not its total, leave headroom for live consumers and ongoing replication. A conservative starting point caps the move at the bandwidth that keeps client-facing TotalTimeMs p99 and UnderReplicatedPartitions flat (Metrics & Signals). You can raise the throttle mid-flight by resubmitting with a higher --throttle (ReassignPartitionsCommand.java:124-125). The source itself advises keeping the limit above 1 MB/s for accurate behaviour, below that the sampling is too coarse (QuotaConfig.java:76-78). Set the rate too low and the move never reaches stage 2: the reassignment hangs forever. Watch ReassigningPartitions for progress.

Always run --verify to completion, orphaned throttles silently cap replication

If you set a throttle and never run a successful --verify, the .rate / .replicas configs persist. A leftover leader.replication.throttled.replicas=* with a low rate will throttle all future replication on that topic, including recovery after a broker failure, turning a finished migration into a latent outage. The first thing to check when a topic mysteriously can't catch up is kafka-configs.sh --describe for stale throttle entries. The intra-broker JBOD case has its own ceiling (replica.alter.log.dirs.io.max.bytes.per.second) because that traffic is local disk-to-disk, not network.

Adding and removing brokers: there is no autopilot for data

The single most surprising fact for operators new to Kafka: adding a broker moves no data onto it. A new broker registers, is unfenced, and sits there hosting nothing until you explicitly reassign partitions to it. Kafka has no built-in data balancer; the partition placement of existing topics is frozen until you change it. (External tools, Cruise Control, Confluent Self-Balancing Clusters, MSK auto-rebalancing, exist precisely to automate the reassignment plan; open-source Kafka ships only the mechanism, kafka-reassign-partitions.sh, not the policy. LinkedIn runs Cruise Control across its fleet for exactly this, empirical.)

Adding a broker

format dirs, start STARTING register (KIP-631) FENCED caught up + ready UNFENCED / RUNNING you reassign partitions onto it hosting replicas
A new broker's path to usefulness. It is RUNNING but empty until an explicit reassignment places replicas on it.
FENCED cannot host leaders / serve clients   UNFENCED eligible for placement   the final transition is operator-driven, not automatic
  1. Format its storage with the cluster's existing cluster ID: kafka-storage format --cluster-id <id> --config <props> (StorageTool.scala:116). The cluster ID must match the running cluster (--cluster-id is required, StorageTool.scala:321-324); a mismatched ID is rejected by kafka-storage info with "Mismatched cluster IDs" (StorageTool.scala:438-439). A broker-only node does not write a bootstrap snapshot, only nodes with the controller role do (StorageTool.scala:128).
  2. Start it. It registers under a fresh incarnationId, comes up FENCED, catches up on metadata, and unfences (the registration/heartbeat machinery of BrokerLifecycleManager above). Confirm with kafka-broker-api-versions.sh or the controller's broker list.
  3. Reassign partitions onto it. Generate a plan (kafka-reassign-partitions.sh --generate) that includes the new broker ID, then execute it throttled. This is the step that actually uses the new capacity.

Removing a broker (decommission)

Decommissioning is the inverse and must be done before the broker stops, while it can still serve its replicas:

  1. Reassign every partition off the broker. Build a reassignment whose target replica lists exclude the departing broker ID, covering all its partitions, and execute it throttled. Until this finishes, the broker still holds the only copy of some replicas.
  2. Confirm it hosts nothing. Wait until ReassigningPartitions is 0 and the broker's PartitionCount / LeaderCount (ReplicaManager.scala:91-92) drop to 0.
  3. Shut it down (controlled). With no replicas left, there is nothing to migrate; the broker simply leaves.
Decommissioning the wrong way drops you below min.insync.replicas

Stopping a broker before draining its replicas is just an uncontrolled failure, for every partition where it was a replica, the ISR shrinks by one. If a partition was at RF=3 and you pull a broker without reassigning, it falls to 2 in-sync, then any further hiccup pushes it under min.insync.replicas=2 and produces are rejected. With degraded or flapping machines, even reassignment can be insufficient (the broker can't serve its data fast enough); the documented mitigation is rate-limited single-broker-per-window replacement (Pinterest's Jan-2018 incident, empirical). Never decommission two brokers concurrently unless you have proven no partition shares both.

Preferred-leader election: rebalancing leadership, not data

Reassignment moves replicas; it does not, by itself, balance who is leader. After a roll or a failure, leadership often piles onto whichever brokers came back first, leaving some brokers carrying most of the produce/fetch load. Each partition has a preferred leader, the first replica in its assignment list, and Kafka can restore leadership to it. By default the controller does this automatically: auto.leader.rebalance.enable=true (server/.../ReplicationConfigs.java:148) runs a background imbalance check every leader.imbalance.check.interval.seconds=300 (:119) and, if leaders are skewed, elects each partition's preferred replica.

You can also trigger it on demand with kafka-leader-election.sh --election-type PREFERRED (tools/.../LeaderElectionCommand.java), useful immediately after a planned roll to re-seat leadership without waiting for the 5-minute timer. Preferred election is a clean election (the preferred replica must be in the ISR), so it is safe; it is distinct from the dangerous UNCLEAN election type discussed under disaster recovery.

KRaft dropped the per-broker imbalance percentage knob

The ZooKeeper-era leader.imbalance.per.broker.percentage threshold is no longer a public configuration in KRaft; auto-rebalance is governed by auto.leader.rebalance.enable and the check interval. If you want tight control over when leadership re-seats (e.g. to avoid a leadership storm mid-incident), disable auto-rebalance and run preferred election manually as a deliberate step. Separately, periodic unclean election retries are paced by unclean.leader.election.interval.ms (default 5 minutes, ReplicationConfigs.java).

Cluster formation and dynamic controller membership

A cluster is born from kafka-storage format, which writes meta.properties (the cluster ID, node ID, and directory ID) into each log directory and, for controller nodes, a bootstrap snapshot seeding the initial metadata.version and feature levels (StorageTool.runFormatCommand, StorageTool.scala:116-177; Part I KRaft Controller). The --release-version flag picks the initial format level, default LATEST_PRODUCTION, minimum MINIMUM_VERSION (StorageTool.scala:336-341). For the controller quorum you choose, at format time, between a static voter set (controller.quorum.voters) and a dynamic one (--standalone, --initial-controllers, or --no-initial-controllers with controller.quorum.bootstrap.servers); the two are mutually exclusive and the tool enforces it (StorageTool.scala:148-172).

KIP-853: adding and removing controller voters at runtime

A dynamic quorum can change its voter membership while running, grow a 1-node controller quorum to 3, or replace a failed controller, without reformatting. This is KIP-853, enabled by kraft.version=1. The add path, AddVoterHandler.handleAddVoterRequest(...) (raft/.../internals/AddVoterHandler.java:86), runs a careful 10-step protocol (documented at AddVoterHandler.java:45-64); the gating checks that matter operationally are:

kafka-metadata-quorum add-controller
leader HWM known? (prior leaders fenced)
cluster kraft.version ≥ 1?
no uncommitted voter change?
new voter caught up to leader LEO?
append VotersRecord → commit on NEW majority
AddVoter gating. Every "no" returns an error (REQUEST_TIMED_OUT / UNSUPPORTED_VERSION / INVALID_REQUEST); the new voter must be caught up before it can join.
controller/quorum check   catch-up wait   VotersRecord   pass
  • kraft.version ≥ 1 required. If the cluster hasn't enabled reconfiguration, AddVoter returns UNSUPPORTED_VERSION (AddVoterHandler.java:114-127). Upgrade kraft.version via the same feature mechanism first.
  • New voter must be caught up. The leader checks isReplicaCaughtUp(...) and aborts with a timeout if the joiner lags (AddVoterHandler.java:282-301), so a fresh controller must replay the metadata log close to the LEO before it can be admitted, avoiding a quorum that can't make progress.
  • One change at a time. If a previous voter change hasn't committed, the request times out (AddVoterHandler.java:129-142). Add or remove voters sequentially, never in parallel.

Removal (RemoveVoterHandler, raft/.../internals/RemoveVoterHandler.java:40-52) is a 7-step mirror: same HWM and kraft.version and no-uncommitted-change gates, append the new VotersRecord, commit on the new majority, and, if the leader removed itself, resign leadership so a remaining voter takes over (RemoveVoterHandler.java:46-51). The operator commands are kafka-metadata-quorum.sh add-controller / remove-controller. Always keep the quorum at an odd size (1, 3, 5) so a clean majority exists; grow 1→3 by adding voters one at a time, each caught up before the next. See Part I KRaft Consensus for the replication protocol underneath.

Disaster recovery

Lifecycle operations are planned. DR is for when the plan failed. Four mechanisms matter, in roughly increasing severity.

The metadata snapshot, your control plane's backup

KRaft's entire cluster state, topics, partitions, configs, ACLs, broker registrations, feature levels, lives in the __cluster_metadata log. To bound recovery time and log size, the controller periodically snapshots it (KIP-630). A snapshot is written when either of two thresholds is hit:

metadata.log.max.snapshot.interval.ms
default 1 hour, time-based snapshotting (raft/.../MetadataLogConfig.java:39).
metadata.log.max.record.bytes.between.snapshots
20 MiB of new records since the last snapshot (MetadataLogConfig.java:41).
metadata.max.idle.interval.ms
500 ms, even when idle, the controller appends a no-op so the log keeps advancing and snapshots stay current (MetadataLogConfig.java:76).

Why this matters for DR: when a broker (or a new controller voter) starts, it loads the latest snapshot and replays only the log after it, that is why catch-up is fast and why a freshly added controller can be admitted quickly. For backup, the snapshot files plus the metadata log are the cluster's control-plane state; protect the controllers' metadata directory accordingly. These snapshots cover metadata, not your topic data, topic data durability is the replication/RF story, and cross-cluster data backup is MirrorMaker's job.

MirrorMaker 2, cross-cluster backup and DR

MirrorMaker 2 (connect/mirror/.../MirrorMaker.java) replicates topics, consumer-group offsets, and topic configs from a source cluster to a target cluster, running as Kafka Connect connectors (Part I Kafka Connect): MirrorSourceConnector (data + configs), MirrorCheckpointConnector (consumer offsets, so consumers can fail over and resume near where they left off), and MirrorHeartbeatConnector (liveness/lag of the replication flow). For DR you run MM2 from primary → standby cluster; on a primary loss, applications repoint to the standby and resume from the checkpointed offsets. Failover to a standby cluster in under 5 minutes is a documented target at scale (Netflix Keystone, empirical).

MM2 is asynchronous, expect a small replication lag, and don't mirror throttle configs

MM2 replication is asynchronous, so a hard primary failure can lose the few records not yet mirrored; it is RPO-near-zero, not RPO-zero. Also, MM2 deliberately excludes the replication-throttle configs from what it copies, follower.replication.throttled.replicas, leader.replication.throttled.replicas (and the matching rate keys) are in the default config-exclude list (connect/mirror/.../DefaultConfigPropertyFilter.java:38-39), precisely so a throttle set on the source does not silently throttle the target. Historically, naïve MirrorMaker (v1) suffered rebalance storms that stalled replication for 5–10 minutes and, after repeated failed rebalances, stuck permanently (Uber, empirical), which is why MM2's Connect-based, incrementally-rebalancing design replaced it.

Unclean recovery, trading data for availability

If every in-sync replica of a partition is lost, the partition has no leader and is offline (OfflinePartitionsCount > 0). The safe default is to wait: unclean.leader.election.enable=false (storage/.../log/LogConfig.java:139) keeps the partition unavailable until an in-sync replica returns, preserving every committed record. The escape hatch is an unclean leader election: promote an out-of-sync replica to leader, restoring availability but discarding every record that replica was missing. When this happens, the controller resets the ISR to that lone replica and sets LeaderRecoveryState=RECOVERING (metadata/.../LeaderRecoveryState.java:33) until the partition re-stabilises.

Unclean election is data loss, make it a conscious choice, monitored

Any non-zero UncleanLeaderElectionsPerSec means committed messages were lost (Conduktor, empirical). Because the modern default is false, an offline partition simply stays offline until you either (a) recover the original in-sync replica's broker, or (b) deliberately run kafka-leader-election.sh --election-type UNCLEAN to choose availability over the missing data. A standby cluster fed by MM2 is the better answer for genuinely critical data, Datadog recovered from a pre-0.11 unclean-election data-loss event only because they dual-wrote to a second cluster (empirical). Note the default flipped: unclean.leader.election.enable was true before 0.11.0, false after, verify on legacy clusters. Decide this policy per topic before an incident, not during one.

JBOD disk replacement

With multiple log directories per broker (JBOD), Kafka tolerates losing a single disk without losing the broker. On the first I/O error a broker marks that log directory offline: it stops serving the replicas on that disk (they go under-replicated and re-lead elsewhere) while the broker keeps serving its other disks. Only when all log directories are bad does the broker go fully offline. The replacement procedure:

  1. Physically replace the failed disk and recreate the empty log-directory mount.
  2. Restart the broker. It registers (fenced) and reconciles its on-disk directories against the controller's expectation. In KRaft, the broker advertises its directory assignments and the controller reconciles replica-to-directory placement via the AssignReplicasToDirs mechanism (KIP-858); the empty disk is repopulated by normal follower replication from the leaders.
  3. Optionally throttle the catch-up with follower.replication.throttled.rate if the re-replication competes with live traffic, and watch UnderReplicatedPartitions drain to 0.
A full disk is not a clean JBOD failure

Running a log directory out of space is more abrupt than a disk fault: the broker can crash on No space left on device without a graceful shutdown. It is recoverable if RF ≥ 2 (another replica leads). The emergency move is to free space safely, lower retention via kafka-configs.sh (which lets the cleaner delete old segments) rather than hand-deleting files, and never delete the newest active .log/.index/.timeindex of a partition (that corrupts it). Restart at 10–20% free and verify UnderReplicatedPartitions returns to empty (Conduktor, empirical). Alert at 70% / 85% disk well before this point (Metrics & Signals).

Putting it together: the lifecycle decision map

what changed / what's the goal?
roll binaries 1×, URP→0 between; then bump metadata.version (one-way)
reassign partitions, throttled; --verify to clear throttle
add/remove broker → then reassign (no autopilot)
offline? wait for ISR, or MM2 failover; unclean = last resort
The four lifecycle paths and their first move. Three are planned and reversible-ish; the red path trades data for availability and should be a pre-decided policy.
upgrade/feature   reassignment   membership   DR / data-loss path   path
Five rules that never change

(1) Roll binaries before bumping metadata.version, the controller rejects the bump until every node supports it, and the bump is effectively one-way (FeatureControlManager.java:321,399). (2) One broker (or one voter) at a time; wait for UnderReplicatedPartitions=0 and full unfence before the next. (3) Every data move is an explicit, throttled reassignment, Kafka never balances data for you, and an orphaned throttle is a latent outage. (4) Decommission by draining replicas before stopping the broker, never after. (5) Unclean leader election and forced metadata downgrade are data-destroying last resorts, decide the policy per topic in advance, and keep a MirrorMaker-2 standby for anything you cannot afford to lose.

krivaltsevich.com · Part of Apache Kafka Internals · derived from Apache Kafka 4.4 source · GitHub · MIT-licensed.

Apache Kafka® is a registered trademark of the Apache Software Foundation. This is an independent, unofficial guide, not affiliated with or endorsed by the ASF.