II · 09 · Topologies & Deployment Patterns

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.

A Kafka cluster is a physical object: it lives in racks, in availability zones, in regions, and on a budget. The same logical cluster can be one combined node on a laptop or 150 brokers fronted by an object store spanning a continent, and the layout decisions you make at formatting time (how many controllers, how you tag racks, how many clusters, how you bridge regions) are the hardest to change later and the most expensive to get wrong. This chapter is the layout playbook: the KRaft node topology (dedicated vs combined controllers, static vs dynamic quorums), rack/multi-AZ placement and how StripedReplicaPlacer turns broker.rack into AZ-survivable partitions, the blast-radius math of single vs multi-cluster, the two ways to span regions (MirrorMaker 2 and stretch clusters), where tiered storage and an object store fit, and a concrete recommended topology for each scale tier from dev to global. Every default and limit is pinned to the source that defines it; every benchmark and cost figure is marked empirical and attributed.

The control plane: KRaft node topology

Since 4.0, ZooKeeper is gone, there is no separate consensus system to deploy. The cluster's metadata (every topic, partition, ISR, config, ACL, and broker registration) is itself a single-partition replicated log, __cluster_metadata, maintained by a Raft quorum of controller nodes. A node's job is set at startup by process.roles, which accepts broker, controller, or both (raft/.../KRaftConfigs.java:32, validated to that exact list at :73). KafkaRaftServer constructs a BrokerServer if the roles contain BrokerRole and a ControllerServer if they contain ControllerRole (core/.../KafkaRaftServer.scala:76-90), so "dedicated controller", "dedicated broker", and "combined" are all the same binary with a different one-line config. See Part I · 10 (KRaft consensus) and Part I · 11 (the controller) for the mechanism this section deploys.

Controller quorum, Raft voters

owns __cluster_metadata (1 partition, RF = #voters); one active leader, the rest hot standbys

odd: 3 or 5process.roles=controllercontroller.listener.names

↕ brokers fetch metadata log + send heartbeats

Broker plane, data

hosts topic partitions, serves produce/fetch; each broker is a metadata observer (non-voter) that replays the log

process.roles=brokerscale horizontally

↕ replication (RF), client traffic

Storage

local log dirs (page cache + zero-copy) & optional remote tier (object store)

JBOD / RAIDremote.log.storage.system.enable

The three planes of a KRaft cluster. The controller quorum is a small odd set; the broker plane scales independently.

controller/metadata · broker/data · log/disk/object store · chip = a config or property that pins the layer

Why the quorum is odd, and how many failures it tolerates

A Raft quorum makes progress only when a majority of voters is reachable and current. With n voters the majority is floor(n/2) + 1, so the cluster tolerates floor((n-1)/2) simultaneous voter failures. This is the single most important number in the control plane:

Voters `n`	Majority needed	Failures tolerated	Verdict
1	1	0	Dev/test only, any controller loss = metadata outage
2	2	0	Never. Even number; tolerates nothing yet costs two nodes
3	2	1	Standard production quorum
5	3	2	Large / global; survives an AZ loss + one more
7	4	3	Rarely justified; more voters = slower commits, larger election surface

Why odd, not even

Going from 3 to 4 voters does not raise fault tolerance (both tolerate 1 failure) but it raises the majority from 2 to 3, so a 4-node quorum is strictly worse than a 3-node one: same resilience, larger write set to acknowledge, more nodes that can be the slow one. An even quorum also makes split-brain "2 vs 2" partitions possible where neither side has a majority. Always pick an odd n. Three is the default answer; five only when you need to survive losing two voters at once (e.g. one whole AZ plus a maintenance reboot).

Raft commits an append only after a majority has it, so commit latency tracks the median voter's round-trip, and elections are bounded by the KRaft timeouts. A follower becomes a candidate after controller.quorum.fetch.timeout.ms (default 2000 ms) without a successful fetch from the leader, and an election that stalls is retried after controller.quorum.election.timeout.ms (default 1000 ms) with exponential backoff capped by controller.quorum.election.backoff.max.ms (default 1000 ms), all in raft/.../QuorumConfig.java:77-88, defined at lines 123-125. These are deliberately low: the class header states the design philosophy that "changing the leader of a Raft cluster [is] a relatively quick operation … the standby is a 'hot' standby, not a 'cold' one" (QuorumConfig.java:46-49). A standby controller already holds the full metadata in memory, so failover does not reload state, this is the structural reason KRaft removed the ZooKeeper-era O(partitions) controller-failover penalty (empirically: ZK controlled-shutdown of 5 brokers / 50,000 partitions fell from 6.5 min on 1.0.0 to ~20–30 s under KRaft for far larger counts, Confluent KRaft lab ~2M-partition test; empirical, version-dependent).

Dedicated controllers vs combined nodes

Because the role is just config, you choose between two physical topologies:

Dimension	Combined `broker,controller`	Dedicated `controller` + `broker`
Node count floor	3 nodes total (each is voter + broker)	3 controllers + N brokers (≥ 6 nodes)
Isolation	Metadata writes share CPU, page cache, GC, and disk with data load, a produce spike can starve the controller	Controller has its own machine; data load cannot induce an election or stall metadata
Shutdown ordering	Broker must shut down before controller (controlled shutdown needs the controller), handled in `KafkaRaftServer.scala:103-108`	Independent lifecycles; roll brokers without touching the quorum
Blast radius	Losing a node removes a voter and a data broker at once	A broker loss never threatens quorum; a controller loss never loses partition data
Cost	Lowest, no dedicated hardware	3 small extra nodes (controllers need little disk/RAM; the metadata log is tiny)
Use when	Dev, test, small clusters (≲ 3–6 brokers, modest partition counts)	Production at scale: many brokers, high throughput, large metadata

Rule of thumb

Combined mode for ≤ ~3–6 brokers; dedicated controllers the moment the cluster matters. The controller quorum's job is to stay available while brokers churn; co-locating it with the heaviest, most volatile workload on the cluster defeats that. Dedicated controllers are cheap, they carry only the metadata log, so small instances with fast disks suffice. The startup/shutdown ordering in KafkaRaftServer (controller starts first, broker stops first, :92-108) exists precisely because combined mode entangles the two; dedicated mode sidesteps the entanglement.

Defining quorum membership: static voters vs dynamic (KIP-853)

There are two ways to tell the cluster who the voters are, and you pick one at format time:

Static voters, set controller.quorum.voters to a comma-separated list of {id}@{host}:{port} (e.g. 1@host1:9093,2@host2:9093,3@host3:9093), parsed in QuorumConfig.parseVoterConnections (QuorumConfig.java:219-257). The membership is fixed in config on every node. Simple, but changing the voter set means a coordinated config rollout and restart, and you cannot grow the quorum live.
Dynamic voters KIP-853, leave controller.quorum.voters unset, set controller.quorum.bootstrap.servers instead, and establish the initial voter set at format time with kafka-storage.sh ... --standalone (single voter) or --initial-controllers (the voters doc spells this out at QuorumConfig.java:62-63). Voters are then added and removed at runtime via the AddRaftVoter / RemoveRaftVoter RPCs (the request schema exists at clients/.../message/AddRaftVoterRequest.json; handler state in raft/.../LeaderState.java:78-79). A new controller can even self-register when controller.quorum.auto.join.enable is true (default false, QuorumConfig.java:106-109).

Do not set both

The controller.quorum.voters doc is explicit: it "is the old way of defining membership … and should NOT be set if using dynamic quorums" (QuorumConfig.java:60-63). Setting both a static voter list and bootstrap-server dynamic membership is a configuration error. Choose dynamic quorums for any cluster you expect to live for years, replacing a dead controller becomes an online AddRaftVoter / RemoveRaftVoter operation instead of a fleet-wide config edit and rolling restart.

adminquorum leadernew controller

AddRaftVoter(id, endpoints)

replicate __cluster_metadata to catch up

caught up → counted in majority

voter set now n+1

Dynamic quorum growth (KIP-853): a new controller joins live; the leader replicates the metadata log to it before counting it toward quorum.

request/replication · async completion · controller/metadata plane · admin client

One more control-plane fact you will tune at scale: a broker's liveness in the cluster is a lease. Brokers heartbeat to the controller every broker.heartbeat.interval.ms (default 2000 ms) and the lease expires after broker.session.timeout.ms (default 9000 ms) of silence, after which the controller fences the broker and reassigns leadership (KRaftConfigs.java:39-45). That ~9 s is the upper bound on how long a hard-dead broker keeps "holding" leaderships before the controller moves them, see op07 (failure modes) for the failover runbook and op08 (signals) for the metrics that page on it.

Rack & multi-AZ awareness

A 3-broker RF=3 cluster all in one availability zone survives a broker failure but not a zone failure: lose the AZ and every partition is offline. The fix is to tell Kafka where each broker physically lives, via broker.rack (a free-form string, default null; doc: "used in rack aware replication assignment for fault tolerance … Examples: RACK1, us-east-1d", server-common/.../ServerConfigs.java:92-93). Set it to the AZ id. Once every broker is tagged, the controller's StripedReplicaPlacer spreads each partition's replicas across racks so no single rack/AZ holds two replicas of the same partition.

How StripedReplicaPlacer survives an AZ loss

The placer's design header is unambiguous about priority order (metadata/.../StripedReplicaPlacer.java:36-56): spreading replicas across racks is the highest-priority goal, spreading evenly across brokers is second, and preferring unfenced brokers is third. It places replicas onto racks round-robin with a random starting offset, advancing the offset by one per partition, the source illustrates the resulting "striped" pattern directly (:84-105): for racks A, B, C with three brokers each, partition 1 → A0, B0, C0; partition 2 → B1, C1, A1; partition 3 → C2, A2, B2. Two hard constraints (not goals, placement fails if violated) bound this:

No two replicas of a partition on the same broker (:60-64). This caps replication factor at the broker count: a 3-node cluster cannot create an RF=4 topic, it throws InvalidReplicationFactorException (:411-417). This is an architectural constraint "from Kafka's internal design," not a tunable.
The leader (first replica) must be an unfenced broker (:66-69, :354-366), a brand-new partition cannot elect a leader that is down.

The AZ-survival invariant

With brokers tagged by AZ and StripedReplicaPlacer active, the replicas of any partition land in distinct racks (up to RF). For RF=3 across exactly 3 AZs, every partition has exactly one replica per AZ, so losing one entire AZ removes exactly one replica from every partition, and with min.insync.replicas=2 the surviving two still form a quorum. Producers with acks=all keep writing; no partition goes offline. This is the whole point of rack awareness, and it is enforced at placement time, not by a background balancer. See Part I · 08 (replication & ISR) for why the surviving ISR remains writable.

create topic, RF=3, 3 racks tagged

spread replicas across racks (highest-priority goal)

P0: AZ-a, AZ-b, AZ-c · P1: AZ-b, AZ-c, AZ-a …

every partition keeps 2 replicas → ISR ≥ 2, stays online

two replicas share a rack → AZ loss can take partitions offline

Rack-aware placement decision: as long as racks ≥ RF and are roughly equal-sized, an AZ loss removes only one replica per partition.

controller placement · replica layout · surviving brokers · failure outcome · normal path

Unequal racks break the guarantee

The placer warns that rack-spread outranks broker-balance: "if you configure 10 brokers in rack A and B, and 1 broker in rack C … you will end up with a lot of partitions on that one broker in rack C … In general racks are supposed to be about the same size, if they aren't, this is a user error" (StripedReplicaPlacer.java:48-56). Keep AZs the same broker count (a multiple of 3 brokers across 3 AZs). An asymmetric cluster hot-spots the small AZ and can still satisfy rack-spread while overloading one node.

The cost of rack awareness: cross-AZ traffic

Rack-spreading replicas means replication traffic is, by construction, cross-AZ, and on AWS/GCP that is metered. With RF=3 across 3 AZs, every 1 GiB produced becomes ~2 GiB of cross-AZ replication (the leader ships to two followers in other AZs), plus producers land ~2/3 cross-AZ and each consumer group reads ~2/3 cross-AZ absent optimization (empirical model; Confluent / AutoMQ). Cross-AZ networking is commonly 50–90% of a self-managed cloud Kafka bill at scale (empirical; Confluent cost model, directional, AWS/GCP-centric; Azure inter-AZ has historically been free). This is the durability/cost dial: AZ survival is not free, and the lever to recover the consumer share is fetch-from-follower.

Fetch-from-follower KIP-392 lets a consumer read from a same-AZ follower instead of always the leader. The broker picks the preferred read replica via replica.selector.class (default returns the leader, ReplicationConfigs.java:140-141); set it to the rack-aware selector and tag consumers with client.rack (the client-side twin of broker.rack, clients/.../CommonClientConfigs.java:77-78) so the broker matches the consumer's AZ to a co-located replica. Aligning consumer fetch traffic this way can cut total cluster cross-AZ cost by roughly half (empirical; Grab drove consumer cross-AZ to zero, InfoQ, 2023), at the cost of up to ~hundreds of ms of added tail latency and some broker load skew. Crucially it touches only consumer reads, produce and replication cross-AZ remain. The mechanism and fetch path are in Part I · 09 (the fetch path); the full cost arithmetic and lever ordering live in op10 (cost), and quota-based tenant isolation in Part I · 19.

Producers can be rack-aware too

For keyless records, the built-in partitioner can prefer a partition whose leader is in the producer's own AZ when partitioner.rack.aware=true and client.rack is set (clients/.../ProducerConfig.java:126,331; it throws if client.rack is missing, RecordAccumulator.java:1243). This shaves producer-side cross-AZ for unkeyed topics but, like all partitioner choices, has no effect on a keyed topic where the key fixes the partition.

One cluster or many? Blast radius and tenancy

A single cluster is operationally cheap (one quorum, one set of brokers, one upgrade) but it is also a single blast radius: a runaway tenant, a bad config push, a metadata bug, or a correlated failure hits everyone on it. The largest operators deliberately cap cluster size to bound this, Pinterest holds clusters to ~200 brokers, Netflix Keystone to < 200 brokers and < 10,000 partitions, Uber federates into ~150-node clusters (empirical; LinkedIn/Pinterest/Netflix/Uber engineering posts). They run many bounded clusters rather than one giant one, then bridge with an aggregation/federation tier.

Strategy	Isolation	Cost / ops	When to choose
Shared multi-tenant cluster (quotas + ACLs)	Soft, enforced by quotas (byte-rate, request-rate) and ACLs; a noisy tenant is throttled, not isolated	Lowest; one cluster to run, best resource packing	Many small/medium tenants, trusted internal teams, cost-sensitive
Dedicated clusters per tenant/domain	Hard, separate brokers, quorum, failure domain; one tenant's incident cannot touch another	Highest; N control planes, N upgrades, lower utilization	Strong compliance/SLA boundaries, very large or hostile-neighbor tenants, regulated data
Bounded clusters + federation	Per-cluster blast radius; cross-cluster via MM2	Medium; standard at hyperscale	Fleet beyond a few hundred brokers; bound blast radius while keeping clusters mergeable

Quotas + ACLs are the cheap isolation; separate clusters are the strong one

On a shared cluster, your isolation primitives are quotas (cap a principal's produce/fetch/request rate, see Part I · 19) and ACLs (who can touch which topic/group, Part I · 18). These prevent a tenant from starving the cluster but not from sharing its fate in a correlated failure or bad rollout. When a tenant's data or availability boundary is a hard requirement, give it its own cluster, the only way to guarantee that another tenant's incident, upgrade, or metadata growth cannot affect it. Most organizations run a tiered mix: a big shared cluster for the long tail, dedicated clusters for the few that warrant it.

Spanning regions: MirrorMaker 2 vs stretch clusters

A single Kafka cluster's replication assumes low, stable inter-broker latency, the ISR mechanism (Part I · 08) drops a follower from the ISR after replica.lag.time.max.ms (default 30 s) of falling behind, and acks=all waits on those followers. Stretch a single cluster across a high-latency WAN and you pay that latency on every durable write and risk constant ISR churn. There are therefore two fundamentally different ways to be in more than one region.

Option A, MirrorMaker 2 (asynchronous, separate clusters)

MirrorMaker 2 KIP-382 is a set of Kafka Connect connectors (in the connect/mirror module) that asynchronously copy data between independent clusters. MirrorSourceConnector consumes from the source and produces to the target; MirrorCheckpointConnector translates and replicates consumer-group offsets; MirrorHeartbeatConnector emits heartbeats to measure end-to-end lag. The clusters stay fully independent, each has its own quorum, its own durability, and replication is best-effort-forward, so the target trails the source by the replication lag (no synchronous coupling, no WAN penalty on the producer). See Part I · 21 (Kafka Connect & MM2) for the connector internals.

Key source-grounded defaults that shape an MM2 deployment:

Config	Default	Source	Meaning
`replication.factor` (mirrored topics)	2	`MirrorSourceConfig.java:37`	RF MM2 uses when creating remote topics on the target, raise to 3 for production durability
`checkpoints.topic.replication.factor`	3	`MirrorCheckpointConfig.java:42`	RF of the internal offset-checkpoint topic
`heartbeats.topic.replication.factor`	3	`MirrorHeartbeatConfig.java:30`	RF of the internal heartbeats topic
`offset-syncs.topic.replication.factor`	3	`MirrorSourceConfig.java:53`	RF of the offset-translation mapping topic
`refresh.topics.interval.seconds`	600 (10 min)	`MirrorSourceConfig.java:66`	How often MM2 discovers new source topics to mirror
`sync.topic.configs.enabled`	true	`MirrorSourceConfig.java:70`	Propagate topic configs (retention, etc.) source → target
`emit.checkpoints.enabled`	true	`MirrorCheckpointConfig.java:59`	Write translated consumer offsets so a failover consumer resumes near where it left off
`sync.group.offsets.enabled`	false	`MirrorCheckpointConfig.java:66`	Write translated offsets into the target's `__consumer_offsets`, off by default; enable for clean failover

Two MM2 defaults to override on day one

First, mirrored-topic replication.factor defaults to 2 (MirrorSourceConfig.java:37), fine for a lab, but production DR topics should be RF=3 like the source. Second, sync.group.offsets.enabled defaults to false (MirrorCheckpointConfig.java:66): checkpoints are emitted but not written into the target's consumer-offsets topic unless you turn this on. For an active/passive DR setup where consumers must resume cleanly after failover, enable it (it only writes while no active consumer in that group is connected to the target, :65 doc).

Topic renaming prevents replication cycles. The DefaultReplicationPolicy prefixes every mirrored topic with the source cluster's alias and a separator (default ".", connect/mirror-client/.../DefaultReplicationPolicy.java:39): formatRemoteTopic(alias, topic) = alias + separator + topic (:65-66). So topic orders from a cluster aliased us-east appears on the target as us-east.orders. This makes the origin visible in the name and lets MM2 detect already-mirrored topics, so in an active/active pair A↔B the topic orders on A becomes us-east.orders on B and is not mirrored back, breaking the loop. Consumers that want both local and remote data subscribe to a pattern (e.g. .*orders).

producer (us-east)cluster A (us-east)MM2 (Connect)cluster B (eu-west)

produce → topic "orders"

MirrorSourceTask consumes "orders"

produce → "us-east.orders" (prefixed)

checkpoint: translated consumer offsets

heartbeat: measure e2e lag

MirrorMaker 2 active/passive: async copy with source-alias prefixing (prevents cycles) plus offset checkpoints (enables clean failover).

data copy · offset/heartbeat metadata · independent cluster · MM2 connect workers

Active/passive vs active/active. In active/passive, one region is primary; MM2 one-way mirrors to a standby that runs no producers until failover. Simple, no conflict resolution; the standby trails by replication lag (RPO > 0). In active/active, both regions take writes and mirror to each other (the prefix scheme prevents loops); you get low local latency everywhere but must design for the fact that there is no global ordering across regions and no automatic conflict resolution, keep keys region-affine or make consumers idempotent. Uber documented the operational reality: legacy MirrorMaker (MM1) rebalance storms caused 5–10 min replication stalls and ~weekly outages, which is why MM2's Connect-based, statically-assigned model (and uReplicator-style designs) replaced it (empirical; Uber engineering). MM1's flaws are not MM2's behavior.

Option B, stretch cluster (synchronous, one cluster across nearby AZs)

A stretch cluster is a single logical cluster whose brokers and controllers span multiple AZs in the same region (or very-low-latency nearby DCs), tagged by broker.rack so StripedReplicaPlacer spreads replicas across them. Because it is one cluster, acks=all + min.insync.replicas=2 gives you synchronous durability across AZs, a committed write is already replicated into two zones, RPO = 0 for an AZ loss, and failover is automatic (the controller re-elects from the surviving in-AZ replicas). The price is that every durable write pays the inter-AZ round-trip, and a 5-voter quorum spread across AZs pays the median inter-AZ latency on every metadata commit.

Stretch within a region, mirror across regions

Stretch clusters work because intra-region inter-AZ latency is single-digit milliseconds, tolerable for synchronous replication. Do not stretch a single cluster across distant regions: the WAN latency turns every acks=all write slow and destabilizes both the ISR (followers fall outside replica.lag.time.max.ms) and the Raft quorum (voters miss controller.quorum.fetch.timeout.ms = 2000 ms and trigger spurious elections, QuorumConfig.java:83). Cross-region is MirrorMaker 2's job (async, decoupled clusters); cross-AZ within a region is the stretch cluster's job (sync, one cluster). Mixing them up is the classic multi-region footgun.

Property	Stretch cluster (1 cluster, multi-AZ)	MirrorMaker 2 (N clusters, multi-region)
Replication	Synchronous (ISR, `acks=all`)	Asynchronous (Connect copy)
RPO on zone/region loss	0 (AZ loss; committed = durable in another AZ)	> 0 (target trails by replication lag)
Failover	Automatic (controller re-elects)	Manual/orchestrated (repoint clients, use checkpointed offsets)
Latency on writes	Pays inter-AZ RTT on every durable write	Local-region latency; no WAN penalty on producers
Span	AZs within one region only	Any distance (regions, continents)
Topic names	Unchanged	Prefixed with source alias on the target

Tiered storage topology

By default a partition's entire log lives on broker-local disk, so storage and compute are coupled: retention is bounded by disk, and a broker that rejoins must re-replicate cold data. Tiered storage KIP-405 (see Part I · 05) puts an object store behind the cluster, old log segments are offloaded to remote storage while only recent data stays local. It is off by default: remote.log.storage.system.enable defaults to false (storage/.../RemoteLogManagerConfig.java:58) and must be enabled cluster-wide, then per-topic. Two new retention knobs split "how long to keep locally" from total retention: log.local.retention.ms and log.local.retention.bytes, both defaulting to -2 meaning "inherit the overall retention" (RemoteLogManagerConfig.java:163,169), set them small (hours/GBs) so the broker keeps only the hot tail on disk and ships the rest to the object store.

Broker, serves hot reads from page cache (zero-copy)

keeps only recent segments locally per log.local.retention.{ms,bytes}

↕ RemoteLogManager offloads closed segments

Object store (S3 / GCS / Azure Blob)

cold segments; ~$0.02/GiB-mo vs ~$0.08–0.10 EBS, empirical, AWS retail

remote.log.storage.system.enable=true

Tiered topology: local disk holds the hot tail; the object store holds history. Storage scales independently of broker count.

broker/data · object store (cold), green = log/storage tier

Tiered storage cuts storage cost, not cross-AZ networking

A persistent misconception: tiering offloads cold storage (S3 ~$0.02/GiB-mo vs EBS ~$0.08–0.10, empirical, AWS retail; ~4–5× cheaper) and shrinks broker-rejoin re-replication time (the broker fetches cold data from the object store, not from peers), but it does not reduce the cross-AZ replication traffic between brokers, replication still goes leader→followers across AZs at RF=3. After tiering, networking can climb to 80–90% of TCO precisely because storage shrank. The networking levers are fetch-from-follower (consumer side) and, eventually, diskless/object-store-native designs (KIP-1150, accepted ~March 2026, not yet production-ready OSS, empirical/version note). See op10 (cost).

The operational wins are concrete: Uber and Pinterest run tiered storage to decouple retention from disk (Pinterest offloads ~200 TB/day to object store) and to blunt the page-cache "catch-up tax" where historical reads evict the hot tail and spike p99 produce latency (KIP-405 tests reported ~30% p99 improvement and avoided a ~43% producer-throughput drop under historical-read load, empirical, KIP-405; workload-specific). Tiered storage is also what makes "many bounded clusters" affordable, since each cluster needs only enough local disk for its hot tail.

A recommended topology per scale tier

Putting it together, a decision guide from laptop to continent. Numbers are starting points; size partitions and brokers with the formulas in op04 (capacity) and op03 (partitioning).

Tier	Controllers	Brokers / layout	Durability	Multi-region / storage
Dev / test	1 combined node (or 3 combined)	1–3 combined `broker,controller`, single AZ	RF=1 (single) or RF=3; `min.insync.replicas=1`	None; local disk only
Medium (single-region prod)	3 dedicated controllers, one per AZ, dynamic quorum	6–30 brokers, RF=3 striped across 3 AZs (`broker.rack`=AZ)	RF=3, `min.insync.replicas=2`, `acks=all`, unclean election off	Tiered storage on if retention > days; fetch-from-follower to cut consumer cross-AZ
Large (bounded fleet)	3 or 5 dedicated controllers per cluster, dynamic quorum	Many clusters capped ~150–200 brokers each, RF=3 / 3 AZs; tenants split by quotas+ACLs or dedicated clusters	Same triad; per-topic RF tuning; Cruise-Control-style balancing	Tiered storage standard; fetch-from-follower; MM2 for cross-cluster aggregation
Global / multi-region	5 dedicated controllers per regional cluster; stretch quorum only within a region	Independent regional clusters (each itself large/bounded)	RF=3 + min.insync=2 per region; stretch within region for RPO=0 on AZ loss	MM2 active/active or active/passive across regions (RF=3, `sync.group.offsets.enabled=true`); tiered storage everywhere

The four decisions that define your topology

(1) Combined or dedicated controllers, combined ≤ ~6 brokers, dedicated once it matters. (2) Quorum size, 3 (tolerate 1) almost always; 5 (tolerate 2) for large/global. (3) Rack awareness, always tag broker.rack with the AZ in any cloud cluster, keep AZs equal-sized, and accept the cross-AZ cost (recover the consumer share with KIP-392). (4) Cross-region bridge, stretch within a region (sync, RPO=0), MirrorMaker 2 across regions (async, RF=3, offsets synced). Get these four right at format time; everything else is tunable later.

Anti-patterns to avoid

Even quorum (2 or 4 voters). Strictly worse than the next-lower odd number: same fault tolerance, larger majority, possible split-brain. Always odd.
Combined controllers under heavy data load at scale. A produce spike can starve the metadata plane and induce elections; dedicate controllers once the cluster carries real traffic.
All replicas in one AZ. RF=3 in a single AZ survives a broker, not a zone. Tag racks so StripedReplicaPlacer spreads across AZs.
Asymmetric AZ broker counts. The placer prioritizes rack-spread over broker-balance, so an undersized AZ hot-spots its nodes (StripedReplicaPlacer.java:48-56). Keep AZs equal.
Stretching one cluster across regions. WAN latency breaks ISR and the Raft quorum timeouts. Use MM2 for cross-region.
Leaving MM2 mirrored-topic RF at the default 2 and sync.group.offsets off. Both undercut a real DR posture, raise RF to 3 and enable offset sync.
Expecting tiered storage to cut the network bill. It cuts storage; cross-AZ replication is untouched. Reach for fetch-from-follower / diskless for networking.
One giant cluster as the whole fleet. Cap cluster size to bound blast radius; federate with MM2, the universal pattern at hyperscale.

Next: op10 (cost) turns the cross-AZ, RF, retention, and tiered-storage choices made here into dollar figures and an ordered lever list; op11 (scaling scenarios) walks the growth path through these tiers; and op07 (failure modes) covers what each topology does when an AZ, a broker, or a controller dies.