II · 12 · Lifecycle Operations
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.
A Kafka cluster is never done, it is upgraded, rebalanced, grown, shrunk, and recovered for years while it stays online. This chapter is the playbook for change-while-running. It walks the four lifecycle operations every operator performs and the one they hope never to: rolling upgrades (binaries first, then metadata.version via the feature mechanism, and the hard rule that metadata.version generally cannot be downgraded); partition reassignment (move replicas through the controller, throttled so it cannot starve live traffic); adding and removing brokers (registration, fencing, and the fact that Kafka never auto-rebalances data onto a new broker, you reassign it); controller-quorum changes (KIP-853 dynamic voters); and disaster recovery (the __cluster_metadata snapshot, MirrorMaker 2, unclean recovery, JBOD disk replacement). The unifying mechanism is the metadata log: every lifecycle change is a record the active controller appends and replicates, so each operation is really a question of what record, in what order, and what waits for what to commit. Every default and limit here is cited to source; every benchmark is marked empirical and directional, not a guarantee.
Upgrades, reassignments, broker membership, and quorum changes all flow through the same path, a request to the active controller, which appends one or more records to __cluster_metadata, replicates them to the quorum, and lets every broker apply them via its metadata listener (Part I Metadata Propagation). This is why almost every lifecycle operation has a "wait until committed / wait until caught up" step: the cluster only acts on what has reached the high-watermark. Internalize this and the rest of the chapter is corollaries.
Rolling upgrades: binaries first, format second
A Kafka version upgrade has two independent phases that must happen in this order: (1) replace the binaries on every node, one at a time, while the cluster stays up; then, once every node runs the new version and is stable, (2) bump the cluster's feature flags, chiefly metadata.version, to enable the new on-disk record formats and behaviours the new binaries ship. Phase 1 is reversible (you can roll back the binary). Phase 2 is, for metadata.version, generally not reversible. Doing them in the wrong order, bumping the format before all nodes can read it, bricks the laggards.
Phase 1, rolling the binaries
Each broker is upgraded by a controlled restart. The shutdown is controlled: the broker tells the controller it is going down so leadership migrates off it before the process exits, rather than the cluster discovering it dead. The mechanism lives in BrokerLifecycleManager: on shutdown it transitions to PENDING_CONTROLLED_SHUTDOWN and sends its next heartbeat immediately so the controller starts moving leadership right away (server/.../BrokerLifecycleManager.java:341-346). On the controller side the broker passes through the BrokerControlState machine, UNFENCED → CONTROLLED_SHUTDOWN → SHUTDOWN_NOW (metadata/.../controller/BrokerControlState.java:22-25), and the controller elects new leaders for that broker's partitions from the in-sync followers. Because those followers are already in the ISR, this is a clean election with no data loss (Part I Replication).
The "wait" between brokers is the whole game. After a broker restarts it registers with the controller quorum (KIP-631) under a fresh random incarnationId (server/.../BrokerLifecycleManager.java:103), and it comes back up fenced, it cannot host leaders or serve clients until it has (a) caught up to the latest metadata offset and (b) been explicitly marked ready to unfence (BrokerLifecycleManager.java:109,114,143). Do not restart the next broker until the previous one is fully unfenced and UnderReplicatedPartitions has returned to 0 (core/.../server/ReplicaManager.scala:94); restarting two brokers whose replica sets overlap can drop a partition below min.insync.replicas and reject produces.
| Config / constant | Default | Source | Why it matters during a roll |
|---|---|---|---|
broker.heartbeat.interval.ms | 2000 | raft/.../KRaftConfigs.java:40 | How often the broker proves liveness; the controlled-shutdown handshake rides on it. |
broker.session.timeout.ms | 9000 | raft/.../KRaftConfigs.java:44 | Lease length; miss heartbeats for this long and the controller fences the broker. A roll that stalls a broker > 9 s makes it look dead. |
initial.broker.registration.timeout.ms | 60000 | raft/.../KRaftConfigs.java:36 | How long a freshly restarted broker waits to register before giving up and exiting. Can't register in 60 s → the node self-terminates; check controller reachability. |
replica.lag.time.max.ms | 30000 | server/.../ReplicationConfigs.java:55 | A restarted broker's followers must re-enter the ISR within this window after catching up. Bigger backlogs take longer; watch IsrExpandsPerSec. |
With the durability baseline replication.factor=3, min.insync.replicas=2 (Part I Replication; Durability), a partition tolerates exactly one replica being unavailable while still accepting acks=all writes. Taking a single broker down leaves 2 replicas, still at min.insync.replicas. Take a second broker down that shares a partition and the ISR drops to 1; the partition is now under-min-ISR and producers get NotEnoughReplicas. The restart pacing rule ("URP back to 0 before the next node") is the operational expression of that arithmetic. At scale, replace ≤ 1 broker per maintenance window (Pinterest/Netflix practice, empirical).
Phase 2, bumping metadata.version (KIP-584)
New binaries ship new record formats and behaviours dormant. They are switched on cluster-wide by raising the finalized feature level, primarily metadata.version, the master version that gates KRaft record schemas. The flow: the operator runs kafka-features.sh upgrade --metadata <version> (or the equivalent AdminClient updateFeatures), which reaches the active controller's FeatureControlManager.updateFeatures(...) (metadata/.../controller/FeatureControlManager.java:179). The controller does not just take your word for it: before writing the upgrade it calls reasonNotSupported(...), which verifies that every registered broker and every controller advertises support for the target level, and refuses with INVALID_UPDATE_VERSION if any node is behind (FeatureControlManager.java:321-373). This is exactly why the binaries must roll first.
The downgrade rule is the operator trap. updateMetadataVersion(...) (FeatureControlManager.java:385-423) treats a lower target as a downgrade and calls MetadataVersion.checkIfMetadataChanged(current, target). That method walks the version chain between the two and returns true if any intervening version set a "metadata changed" flag (MetadataVersion.java:398-425; each enum constant carries a didMetadataChange boolean, e.g. IBP_4_3_IV0(30,"4.3","IV0",true) at MetadataVersion.java:125). If the format changed, the controller refuses: "Refusing to perform the requested downgrade because it might delete metadata information" (FeatureControlManager.java:410-411). Unsafe (forced) metadata downgrade is not implemented, the code path exists but returns "Unsafe metadata downgrade is not supported in this version" (FeatureControlManager.java:404-406; tracked as KAFKA-13896). The only downgrade the controller will accept is to a version where nothing in the format changed.
Treat raising metadata.version as irreversible. Validate the new binaries thoroughly while the cluster still runs the old format (you can roll binaries back in this window), and only bump the format once you are committed to the release. The format is forward-only at the storage layer too: kafka-storage format warns that "formatted directories are not forward-compatible, the broker version must be ≥ the kafka-storage tool that formatted the directory" (core/.../tools/StorageTool.scala:339-341). Use kafka-features.sh describe to confirm the new level is finalized cluster-wide before declaring the upgrade complete.
| Feature / constant | Value | Source |
|---|---|---|
| Feature name | metadata.version | MetadataVersion.java:142 |
MINIMUM_VERSION (oldest supported) | IBP_3_3_IV3 (level 7) | MetadataVersion.java:147 |
LATEST_PRODUCTION (newest stable, the default) | IBP_4_3_IV0 (level 30) | MetadataVersion.java:157 |
Latest unstable (test-only unless unstable.feature.versions.enable=true) | IBP_4_4_IV0 (level 31) | MetadataVersion.java:136 |
| Other finalizable features (independent levels) | kraft.version, group.version, eligible.leader.replicas.version, transaction.version, … | FeatureControlManager.java:243-317 |
Two corollaries. First, since the 4.0 release removed ZooKeeper entirely, the floor for any cluster you operate is KRaft; there is no ZK migration runbook to consider here, only KRaft-native upgrades. Second, features other than metadata.version (e.g. group.version for the new consumer-group protocol of KIP-848, or eligible.leader.replicas.version) have their own levels and dependency rules, the controller validates a feature's declared dependencies against the proposed set before accepting it (FeatureControlManager.java:304-317), and you can preview those with kafka-storage feature-dependencies (StorageTool.scala:209).
Partition reassignment: moving replicas without dropping packets
Kafka does not move data on its own. Re-placing a partition's replicas, to drain a broker, relieve a hot broker, change the replication factor, or spread newly added brokers' load, is an explicit reassignment. The tool is kafka-reassign-partitions.sh (tools/.../reassign/ReassignPartitionsCommand.java), which sends an AlterPartitionReassignments request to the active controller, landing in ReplicationControlManager.alterPartitionReassignments(...) (metadata/.../controller/ReplicationControlManager.java:2134).
The three-stage reassignment protocol
The controller does not swap replica lists atomically, it cannot, because the new replicas hold none of the data yet. changePartitionReassignment(...) drives a three-stage protocol, documented verbatim in the source (ReplicationControlManager.java:2232-2251):
- Stage 1: issue a
PartitionChangeRecordthat adds all new replicas to the partition's replica list and records which areaddingReplicasand which areremovingReplicas. The new followers begin replicating from the leader immediately, like any other follower (Part I Fetch Path). - Stage 2: wait until the ISR contains every new replica (or, if only removing, until the ISR holds at least one replica that is not being removed). This is where the bytes actually move; for a multi-GB partition it can take minutes to hours, gated by your throttle and the leader's spare egress.
- Stage 3: issue a second
PartitionChangeRecordremoving the old replicas and clearing the adding/removing markers. The reassignment is complete.
A reassignment that merely reorders an existing replica list (e.g. to change the preferred leader) skips stages 1–2 and completes immediately, because the ISR is already suitable (ReplicationControlManager.java:2248-2251). While a partition is mid-reassignment it shows up in ReassigningPartitions (core/.../server/ReplicaManager.scala:97) and in kafka-reassign-partitions.sh --verify / the ListPartitionReassignments API (ReplicationControlManager.java:2298). Crucially, replicas being added by a reassignment are not counted as under-replicated, UnderReplicatedPartitions excludes them, so a healthy reassignment does not page you.
You can abort an in-flight reassignment (resubmit with no target replicas → cancelPartitionReassignment, ReplicationControlManager.java:2197). The controller reverts to the original replica set, but if that revert would require electing a leader no longer in sync, it is treated as unclean and is refused unless unclean.leader.election.enable=true for the topic (ReplicationControlManager.java:2204-2209). Translation: do not assume "cancel" is always free. If you started a reassignment that moved the only in-sync replicas, cancelling may be blocked exactly when you most want it.
Throttling: the dial that keeps the move from starving live traffic
An unthrottled reassignment lets followers fetch as fast as the leader can serve, competing directly with live consumer fetches and replication, and able to saturate the leader's NIC. The fix is the replication throttle, enforced by ReplicationQuotaManager (server/.../quota/ReplicationQuotaManager.java). It is two coordinated parts:
| Config key | Scope | Default | Source |
|---|---|---|---|
leader.replication.throttled.rate | per-broker (dynamic only) | Long.MAX_VALUE (unlimited) | server-common/.../config/QuotaConfig.java:75, default :87 |
follower.replication.throttled.rate | per-broker (dynamic only) | Long.MAX_VALUE (unlimited) | QuotaConfig.java:80 |
leader.replication.throttled.replicas | per-topic (list of part:broker or *) | [] (none) | QuotaConfig.java:60,65 |
follower.replication.throttled.replicas | per-topic | [] | QuotaConfig.java:67,72 |
replica.alter.log.dirs.io.max.bytes.per.second | per-broker (intra-broker JBOD moves) | unlimited | QuotaConfig.java:84,87 |
The two work together: the .rate configs set a bytes/sec ceiling on a broker's throttled replication traffic, and the .replicas configs mark which replicas count against it (ReplicationQuotaManager.isThrottled(...), :96-99; markThrottled(...), :114-123). A replica's bytes are metered only if its topic-partition is in the throttled set, so the throttle hits the reassignment's moving replicas and leaves normal traffic alone. The rate is enforced as a sliding-window byte-rate sensor that raises QuotaViolationException when exceeded (:80-90). kafka-reassign-partitions.sh wires all four keys for you: pass --throttle <bytes/sec> on execute, and it sets the broker .rate configs and the topic .replicas configs; on a successful --verify it clears them automatically (ReassignPartitionsCommand.java:112-119, 226-231, 494-537).
Set the throttle to the leader broker's spare egress, not its total, leave headroom for live consumers and ongoing replication. A conservative starting point caps the move at the bandwidth that keeps client-facing TotalTimeMs p99 and UnderReplicatedPartitions flat (Metrics & Signals). You can raise the throttle mid-flight by resubmitting with a higher --throttle (ReassignPartitionsCommand.java:124-125). The source itself advises keeping the limit above 1 MB/s for accurate behaviour, below that the sampling is too coarse (QuotaConfig.java:76-78). Set the rate too low and the move never reaches stage 2: the reassignment hangs forever. Watch ReassigningPartitions for progress.
If you set a throttle and never run a successful --verify, the .rate / .replicas configs persist. A leftover leader.replication.throttled.replicas=* with a low rate will throttle all future replication on that topic, including recovery after a broker failure, turning a finished migration into a latent outage. The first thing to check when a topic mysteriously can't catch up is kafka-configs.sh --describe for stale throttle entries. The intra-broker JBOD case has its own ceiling (replica.alter.log.dirs.io.max.bytes.per.second) because that traffic is local disk-to-disk, not network.
Adding and removing brokers: there is no autopilot for data
The single most surprising fact for operators new to Kafka: adding a broker moves no data onto it. A new broker registers, is unfenced, and sits there hosting nothing until you explicitly reassign partitions to it. Kafka has no built-in data balancer; the partition placement of existing topics is frozen until you change it. (External tools, Cruise Control, Confluent Self-Balancing Clusters, MSK auto-rebalancing, exist precisely to automate the reassignment plan; open-source Kafka ships only the mechanism, kafka-reassign-partitions.sh, not the policy. LinkedIn runs Cruise Control across its fleet for exactly this, empirical.)
Adding a broker
- Format its storage with the cluster's existing cluster ID:
kafka-storage format --cluster-id <id> --config <props>(StorageTool.scala:116). The cluster ID must match the running cluster (--cluster-idis required,StorageTool.scala:321-324); a mismatched ID is rejected bykafka-storage infowith "Mismatched cluster IDs" (StorageTool.scala:438-439). A broker-only node does not write a bootstrap snapshot, only nodes with the controller role do (StorageTool.scala:128). - Start it. It registers under a fresh
incarnationId, comes upFENCED, catches up on metadata, and unfences (the registration/heartbeat machinery ofBrokerLifecycleManagerabove). Confirm withkafka-broker-api-versions.shor the controller's broker list. - Reassign partitions onto it. Generate a plan (
kafka-reassign-partitions.sh --generate) that includes the new broker ID, then execute it throttled. This is the step that actually uses the new capacity.
Removing a broker (decommission)
Decommissioning is the inverse and must be done before the broker stops, while it can still serve its replicas:
- Reassign every partition off the broker. Build a reassignment whose target replica lists exclude the departing broker ID, covering all its partitions, and execute it throttled. Until this finishes, the broker still holds the only copy of some replicas.
- Confirm it hosts nothing. Wait until
ReassigningPartitionsis 0 and the broker'sPartitionCount/LeaderCount(ReplicaManager.scala:91-92) drop to 0. - Shut it down (controlled). With no replicas left, there is nothing to migrate; the broker simply leaves.
Stopping a broker before draining its replicas is just an uncontrolled failure, for every partition where it was a replica, the ISR shrinks by one. If a partition was at RF=3 and you pull a broker without reassigning, it falls to 2 in-sync, then any further hiccup pushes it under min.insync.replicas=2 and produces are rejected. With degraded or flapping machines, even reassignment can be insufficient (the broker can't serve its data fast enough); the documented mitigation is rate-limited single-broker-per-window replacement (Pinterest's Jan-2018 incident, empirical). Never decommission two brokers concurrently unless you have proven no partition shares both.
Preferred-leader election: rebalancing leadership, not data
Reassignment moves replicas; it does not, by itself, balance who is leader. After a roll or a failure, leadership often piles onto whichever brokers came back first, leaving some brokers carrying most of the produce/fetch load. Each partition has a preferred leader, the first replica in its assignment list, and Kafka can restore leadership to it. By default the controller does this automatically: auto.leader.rebalance.enable=true (server/.../ReplicationConfigs.java:148) runs a background imbalance check every leader.imbalance.check.interval.seconds=300 (:119) and, if leaders are skewed, elects each partition's preferred replica.
You can also trigger it on demand with kafka-leader-election.sh --election-type PREFERRED (tools/.../LeaderElectionCommand.java), useful immediately after a planned roll to re-seat leadership without waiting for the 5-minute timer. Preferred election is a clean election (the preferred replica must be in the ISR), so it is safe; it is distinct from the dangerous UNCLEAN election type discussed under disaster recovery.
The ZooKeeper-era leader.imbalance.per.broker.percentage threshold is no longer a public configuration in KRaft; auto-rebalance is governed by auto.leader.rebalance.enable and the check interval. If you want tight control over when leadership re-seats (e.g. to avoid a leadership storm mid-incident), disable auto-rebalance and run preferred election manually as a deliberate step. Separately, periodic unclean election retries are paced by unclean.leader.election.interval.ms (default 5 minutes, ReplicationConfigs.java).
Cluster formation and dynamic controller membership
A cluster is born from kafka-storage format, which writes meta.properties (the cluster ID, node ID, and directory ID) into each log directory and, for controller nodes, a bootstrap snapshot seeding the initial metadata.version and feature levels (StorageTool.runFormatCommand, StorageTool.scala:116-177; Part I KRaft Controller). The --release-version flag picks the initial format level, default LATEST_PRODUCTION, minimum MINIMUM_VERSION (StorageTool.scala:336-341). For the controller quorum you choose, at format time, between a static voter set (controller.quorum.voters) and a dynamic one (--standalone, --initial-controllers, or --no-initial-controllers with controller.quorum.bootstrap.servers); the two are mutually exclusive and the tool enforces it (StorageTool.scala:148-172).
KIP-853: adding and removing controller voters at runtime
A dynamic quorum can change its voter membership while running, grow a 1-node controller quorum to 3, or replace a failed controller, without reformatting. This is KIP-853, enabled by kraft.version=1. The add path, AddVoterHandler.handleAddVoterRequest(...) (raft/.../internals/AddVoterHandler.java:86), runs a careful 10-step protocol (documented at AddVoterHandler.java:45-64); the gating checks that matter operationally are:
- kraft.version ≥ 1 required. If the cluster hasn't enabled reconfiguration, AddVoter returns
UNSUPPORTED_VERSION(AddVoterHandler.java:114-127). Upgradekraft.versionvia the same feature mechanism first. - New voter must be caught up. The leader checks
isReplicaCaughtUp(...)and aborts with a timeout if the joiner lags (AddVoterHandler.java:282-301), so a fresh controller must replay the metadata log close to the LEO before it can be admitted, avoiding a quorum that can't make progress. - One change at a time. If a previous voter change hasn't committed, the request times out (
AddVoterHandler.java:129-142). Add or remove voters sequentially, never in parallel.
Removal (RemoveVoterHandler, raft/.../internals/RemoveVoterHandler.java:40-52) is a 7-step mirror: same HWM and kraft.version and no-uncommitted-change gates, append the new VotersRecord, commit on the new majority, and, if the leader removed itself, resign leadership so a remaining voter takes over (RemoveVoterHandler.java:46-51). The operator commands are kafka-metadata-quorum.sh add-controller / remove-controller. Always keep the quorum at an odd size (1, 3, 5) so a clean majority exists; grow 1→3 by adding voters one at a time, each caught up before the next. See Part I KRaft Consensus for the replication protocol underneath.
Disaster recovery
Lifecycle operations are planned. DR is for when the plan failed. Four mechanisms matter, in roughly increasing severity.
The metadata snapshot, your control plane's backup
KRaft's entire cluster state, topics, partitions, configs, ACLs, broker registrations, feature levels, lives in the __cluster_metadata log. To bound recovery time and log size, the controller periodically snapshots it (KIP-630). A snapshot is written when either of two thresholds is hit:
metadata.log.max.snapshot.interval.ms- default 1 hour, time-based snapshotting (
raft/.../MetadataLogConfig.java:39). metadata.log.max.record.bytes.between.snapshots- 20 MiB of new records since the last snapshot (
MetadataLogConfig.java:41). metadata.max.idle.interval.ms- 500 ms, even when idle, the controller appends a no-op so the log keeps advancing and snapshots stay current (
MetadataLogConfig.java:76).
Why this matters for DR: when a broker (or a new controller voter) starts, it loads the latest snapshot and replays only the log after it, that is why catch-up is fast and why a freshly added controller can be admitted quickly. For backup, the snapshot files plus the metadata log are the cluster's control-plane state; protect the controllers' metadata directory accordingly. These snapshots cover metadata, not your topic data, topic data durability is the replication/RF story, and cross-cluster data backup is MirrorMaker's job.
MirrorMaker 2, cross-cluster backup and DR
MirrorMaker 2 (connect/mirror/.../MirrorMaker.java) replicates topics, consumer-group offsets, and topic configs from a source cluster to a target cluster, running as Kafka Connect connectors (Part I Kafka Connect): MirrorSourceConnector (data + configs), MirrorCheckpointConnector (consumer offsets, so consumers can fail over and resume near where they left off), and MirrorHeartbeatConnector (liveness/lag of the replication flow). For DR you run MM2 from primary → standby cluster; on a primary loss, applications repoint to the standby and resume from the checkpointed offsets. Failover to a standby cluster in under 5 minutes is a documented target at scale (Netflix Keystone, empirical).
MM2 replication is asynchronous, so a hard primary failure can lose the few records not yet mirrored; it is RPO-near-zero, not RPO-zero. Also, MM2 deliberately excludes the replication-throttle configs from what it copies, follower.replication.throttled.replicas, leader.replication.throttled.replicas (and the matching rate keys) are in the default config-exclude list (connect/mirror/.../DefaultConfigPropertyFilter.java:38-39), precisely so a throttle set on the source does not silently throttle the target. Historically, naïve MirrorMaker (v1) suffered rebalance storms that stalled replication for 5–10 minutes and, after repeated failed rebalances, stuck permanently (Uber, empirical), which is why MM2's Connect-based, incrementally-rebalancing design replaced it.
Unclean recovery, trading data for availability
If every in-sync replica of a partition is lost, the partition has no leader and is offline (OfflinePartitionsCount > 0). The safe default is to wait: unclean.leader.election.enable=false (storage/.../log/LogConfig.java:139) keeps the partition unavailable until an in-sync replica returns, preserving every committed record. The escape hatch is an unclean leader election: promote an out-of-sync replica to leader, restoring availability but discarding every record that replica was missing. When this happens, the controller resets the ISR to that lone replica and sets LeaderRecoveryState=RECOVERING (metadata/.../LeaderRecoveryState.java:33) until the partition re-stabilises.
Any non-zero UncleanLeaderElectionsPerSec means committed messages were lost (Conduktor, empirical). Because the modern default is false, an offline partition simply stays offline until you either (a) recover the original in-sync replica's broker, or (b) deliberately run kafka-leader-election.sh --election-type UNCLEAN to choose availability over the missing data. A standby cluster fed by MM2 is the better answer for genuinely critical data, Datadog recovered from a pre-0.11 unclean-election data-loss event only because they dual-wrote to a second cluster (empirical). Note the default flipped: unclean.leader.election.enable was true before 0.11.0, false after, verify on legacy clusters. Decide this policy per topic before an incident, not during one.
JBOD disk replacement
With multiple log directories per broker (JBOD), Kafka tolerates losing a single disk without losing the broker. On the first I/O error a broker marks that log directory offline: it stops serving the replicas on that disk (they go under-replicated and re-lead elsewhere) while the broker keeps serving its other disks. Only when all log directories are bad does the broker go fully offline. The replacement procedure:
- Physically replace the failed disk and recreate the empty log-directory mount.
- Restart the broker. It registers (fenced) and reconciles its on-disk directories against the controller's expectation. In KRaft, the broker advertises its directory assignments and the controller reconciles replica-to-directory placement via the
AssignReplicasToDirsmechanism (KIP-858); the empty disk is repopulated by normal follower replication from the leaders. - Optionally throttle the catch-up with
follower.replication.throttled.rateif the re-replication competes with live traffic, and watchUnderReplicatedPartitionsdrain to 0.
Running a log directory out of space is more abrupt than a disk fault: the broker can crash on No space left on device without a graceful shutdown. It is recoverable if RF ≥ 2 (another replica leads). The emergency move is to free space safely, lower retention via kafka-configs.sh (which lets the cleaner delete old segments) rather than hand-deleting files, and never delete the newest active .log/.index/.timeindex of a partition (that corrupts it). Restart at 10–20% free and verify UnderReplicatedPartitions returns to empty (Conduktor, empirical). Alert at 70% / 85% disk well before this point (Metrics & Signals).
Putting it together: the lifecycle decision map
(1) Roll binaries before bumping metadata.version, the controller rejects the bump until every node supports it, and the bump is effectively one-way (FeatureControlManager.java:321,399). (2) One broker (or one voter) at a time; wait for UnderReplicatedPartitions=0 and full unfence before the next. (3) Every data move is an explicit, throttled reassignment, Kafka never balances data for you, and an orphaned throttle is a latent outage. (4) Decommission by draining replicas before stopping the broker, never after. (5) Unclean leader election and forced metadata downgrade are data-destroying last resorts, decide the policy per topic in advance, and keep a MirrorMaker-2 standby for anything you cannot afford to lose.