krivaltsevich.com Kafka Internals4.4

II · 03 · Partitioning Strategy: How Many Partitions?

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Operational guidance grounded in source code and cited benchmarks.

The partition count is the single most consequential, and most irreversible, number you set on a topic. It fixes the ceiling on producer parallelism, the maximum useful size of every consumer group, and the only ordering guarantee Kafka offers. Set it too low and you cannot consume faster no matter how many machines you throw at the problem; set it too high and you pay a standing tax in file descriptors, memory-mapped index regions, client memory, replication fan-out, failover latency, and rebalance time on every broker for the life of the topic. This chapter answers the central question, how many?, with the source mechanisms that make partitions the unit of parallelism, the empirical ceilings (~10 MB/s per partition, latency degradation past ~1000), the sizing heuristic max(T/Tp, T/Tc), the per-partition cost rooted in concrete code, the signals that tell you to reshard, and the repartitioning trap that makes a keyed topic's partition count effectively immutable.

Why the partition is the unit of everything

A Kafka topic is a logical name; the partition is the physical thing. Each partition is an independent append-only log (Part I 03 · Storage & the Log Engine) with its own leader, its own replica set, its own offset space, and its own segment files. Three properties of the system flow directly from this, and all three are bounded by the partition count.

Producer parallelism, partitions are the write fan-out

A producer spreads records across partitions. With no key, the default partitioner (BuiltInPartitioner) uses adaptive sticky partitioning (KIP-794): it fills one partition with a batch worth of bytes, then switches, weighting the next choice by the inverse of each partition's queue depth so slower brokers receive fewer records (clients/src/main/java/org/apache/kafka/clients/producer/internals/BuiltInPartitioner.java:71). With a key, the partition is deterministic, murmur2(key) % numPartitions (BuiltInPartitioner.java:397). Either way, writes for a topic can be in flight to at most numPartitions leaders simultaneously, and those leaders are spread across brokers. The partition count is therefore the ceiling on how much write concurrency the cluster can absorb for one topic.

Consumer parallelism, one consumer per partition, hard ceiling

This is the constraint engineers underestimate most. Within a consumer group, a partition is assigned to exactly one member at a time. The assignor lays out the partitions and divides them across the members: "we lay out the available partitions in numeric order and the consumers in lexicographic order... divide the number of partitions by the total number of consumers to determine the number of partitions to assign to each" (clients/src/main/java/org/apache/kafka/clients/consumer/RangeAssignor.java:43). The server-side assignor for the new consumer-group protocol (KIP-848) states the same invariant: "each subscribed member receives at least one partition" (group-coordinator/src/main/java/org/apache/kafka/coordinator/group/assignor/RangeAssignor.java). See Part I 13 · Group Coordination.

Invariant: useful group size ≤ partition count

A consumer group can have at most numPartitions actively-consuming members. Add more consumers than partitions and the surplus sit idle, assigned nothing, holding a TCP connection and a heartbeat but doing zero work. You cannot scale consumption past the partition count by adding machines. If a topic has 12 partitions, the 13th consumer is dead weight. This is the hard reason partition count must anticipate peak consumer parallelism, not just peak throughput.

topic: 4 partitions
consumer A
consumer B
consumer C (idle)
Four partitions, three consumers: two get two partitions each, the third is assigned nothing.
topic/log · consumer · idle/no-work · active assignment · empty assignment.

Ordering, only per partition, only per key

Kafka guarantees ordering within a partition only. There is no total order across a topic. For keyed records the order guarantee is per-key, and it holds precisely because every record with a given key lands on the same partition via murmur2(key) % numPartitions. This is why the partition count and the key schema are joined at the hip, and why changing the partition count silently breaks the guarantee (covered below). See Part I 16 · Producer Client.

Why these three are the same number

Producer fan-out, consumer fan-out, and per-key ordering are not three independent design axes, they are three views of one quantity, numPartitions, because each partition is a single-leader, single-writer, single-consumer-per-group log. Sizing partitions is capacity planning for parallelism; everything else in this chapter is the cost of that choice.

The per-partition throughput ceiling

How much can one partition do? The planning number, from Jun Rao's canonical guidance (Confluent, March 2015; empirical), is that "one can produce at 10s of MB/sec on just a single partition," with a conservative planning figure of ~10 MB/s. This is not a code constant, it is an emergent property of disk write bandwidth on the leader, the follower fetch rate, batch size, compression ratio, and acks. Treat it as directional and measure your own.

KnobEffect on single-partition throughputSource
Batch size / lingerLarger batches amortize per-request overhead; ~10–50× throughput gains reported moving from tiny to large batchesConfluent producer course (empirical)
Compression (lz4/zstd)More bytes per disk write & per network frame; trades producer CPU for effective throughputCloudflare (empirical)
acks=all + replicationSync replication roughly halved single-producer throughput in LinkedIn's benchmark (821K → 422K rec/s)Jay Kreps / LinkedIn 2014 (empirical)
Replication factorEach partition's writes are re-fetched RF−1 times; follower fetch rate caps the leaderPart I 08 · Replication
Gotcha: the consumer side is usually the slower path

When you compute the per-partition consume rate Tc, measure your application's end-to-end processing rate per partition, deserialize, business logic, downstream write, not the broker's disk-read ceiling. A partition the broker can serve at 50 MB/s is useless if your consumer processes 2 MB/s of it. Size Tc from the slowest stage in your consumer, then take the larger of the two requirements.

The sizing heuristic: partitions = max(T/Tp, T/Tc)

The accepted formula (Jun Rao, 2015; empirical) is:

N ≥ max(T / Tp, T / Tc)
Take the larger of the producer-side and consumer-side partition requirements. All three quantities must be in the same unit so the division cancels to a pure partition count.
T
Target throughput for the topic at peak (MB/s or records/s), measured in the same unit as Tp and Tc.
Tp
Throughput a single partition sustains on the produce path, given your batch/compression/acks. Conservative planning value ~10 MB/s.
Tc
Throughput a single partition sustains on the consume path, the application's per-partition processing rate, usually the binding constraint.

Before working the example, pin every constant it consumes. None of these is a Kafka guarantee, Tp and Tc are workload-dependent rates you must measure; the values below are illustrative planning figures. Substitute your own measured numbers.

SymbolValue (with units)KindWhy / source
T100 MB/s (peak topic ingress)Workload (illustrative)The target this topic must absorb at peak, assumed for this worked example; measure yours.
Tp~10 MB/s per partitionWorkload planning figure (empirical)Conservative single-partition produce rate; "one can produce at 10s of MB/sec on just a single partition", Jun Rao (this guide's empirical reference §2). Directional, not a code constant, measure with your batch/compression/acks.
Tc~5 MB/s per partition (one consumer thread)Workload (illustrative)The slowest stage of the consumer app (deserialize + business logic + downstream write) per partition, assumed here; measure yours per the gotcha above.

Derivation (worked example from the reference). Plug the assumptions into each side of the max, keeping units explicit so they cancel to a partition count:

Producer side
T / Tp = 100 MB/s ÷ 10 MB/s per partition ≈ 10 partitions, the MB/s cancels, leaving partitions. So 10 partitions is the floor the produce path needs.
Consumer side
T / Tc = 100 MB/s ÷ 5 MB/s per partition ≈ 20 partitions, again the MB/s cancels. The consume path needs 20 partitions so a 20-member group (one consumer per partition, per the invariant above) can keep up.
Take the larger
N ≥ max(10, 20) = 20 partitions.

Why the consumer side binds here: the assumed per-consumer drain (Tc = 5 MB/s) is below the per-partition produce rate (Tp = 10 MB/s), so each partition's consumer is the slowest link; consumer parallelism, not produce throughput or disk, sets the count, and the max correctly picks the larger (20) requirement. Had the consumer been faster than the producer (Tc > Tp), the produce side would have bound instead. The formula picks whichever dimension is slowest.

target T (peak MB/s)
T / Tp
producer-side N
T / Tc
consumer-side N
N = max(…), round up, add growth headroom
The sizing decision: compute both sides, take the maximum, then add headroom for growth (within the cost bounds below).
input/target · produce path · consume path · chosen count · decision nodes rounded.

The cost of a partition, each rooted in a mechanism

Partitions are not free. Every one carries a standing, per-broker cost for the life of the topic. Here is the full bill, each line tied to the concrete mechanism that produces it.

1. Open file descriptors and memory-mapped index regions

Each partition replica on a broker is a directory of segments, and each segment is three files: .log, .index (offset index), .timeindex (time index). The two index files are memory-mapped, AbstractIndex holds a MappedByteBuffer mmap (storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java:72), realized as OffsetIndex and TimeIndex. So each retained segment contributes ~2 mmap regions (its two memory-mapped indexes) plus ~3 open file handles (.log + .index + .timeindex). To size the per-partition cost, fix three numbers:

Segment size
1 GiB, Kafka default DEFAULT_SEGMENT_BYTES = 1024 * 1024 * 1024 (cited config default, storage/src/main/java/org/apache/kafka/storage/internals/log/LogConfig.java:131).
mmap regions per segment
~2, the offset index and time index are each one MappedByteBuffer (mechanism, AbstractIndex.java:72); the .log itself is not mmapped.
Partition data size
10 GB retained per partition, illustrative workload assumption for this example (= retention bytes per partition); substitute your own from retention × ingress.

Then, with units cancelling to a count:

Segments per partition
10 GB ÷ 1 GiB per segment ≈ 10 segments (treating GB ≈ GiB for planning).
mmap regions per partition
10 segments × 2 mmap regions per segment ≈ 20 mmap regions, with a similar count of FDs. (If a partition is near-empty, only its active segment is mapped, so the floor is ~2 mmap regions per partition.)
Warning: vm.max_map_count is a hard per-broker ceiling

Linux defaults vm.max_map_count to 65,530 (an OS hard limit, the max number of memory-map areas one process may hold). Taking the floor of ~2 mmap areas per partition (the active segment's two indexes, from above), the default ceiling derives as 65,530 mmap areas ÷ 2 mmap areas per partition ≈ 32,765 partitions/broker before the broker fails to mmap and crashes, even on KRaft. Instaclustr hit exactly this pushing toward 600k partitions on one KRaft broker (empirical reference §2). Production guidance: raise vm.max_map_count to ≥ 262,144 and broker nofile to ≥ 100,000. FD budget (units cancel to a file count): open files ≈ Σ(partition_size ÷ segment_size) + connections (Jun Rao; empirical reference §2). Cross-link Part I 04 · Storage Management.

2. Producer client memory, a buffer per partition

The producer's RecordAccumulator keeps an independent batch queue per partition: topicInfo.batches.computeIfAbsent(effectivePartition, k -> new ArrayDeque<>()) (clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java:318). All those queues draw from a single bounded pool, buffer.memory (Kafka default 32 MiB, ProducerConfig.java:40132 * 1024 * 1024). Budget ~tens of KB per partition being produced (empirical planning figure, Jun Rao; reference §2). To find where a single producer runs dry, divide the pool by the per-partition budget (KB cancels, leaving a partition count):

Partitions one 32 MiB pool covers
32 MiB ÷ ~64 KB per partition ≈ 32,768 KiB ÷ 64 KB per partition ≈ 512 partitions at the high end of "tens of KB", beyond that the queues thin out.

A producer fanning out past that either needs a larger buffer.memory or it stalls on max.block.ms when the pool drains. See Part I 16 · Producer Client.

3. Replication fan-out

Every partition is replicated RF times; followers fetch from the leader continuously (Part I 08 · Replication, 09 · Fetch Path). More partitions = more fetcher work and more inter-broker traffic. Two planning figures bound this, both empirical (Jun Rao; this guide's reference §2, directional, not Kafka guarantees):

Replication-latency figure
~20 ms added per 1,000 partitions replicated from one broker to another (equivalently ~0.02 ms/partition).
Per-broker cap formula
partitions/broker ≤ 100 × (#brokers) × RF, the published heuristic that keeps single-broker replication latency in a safe band given the figure above.

So for a 6-broker cluster at RF=3 the cap evaluates to 100 × 6 brokers × 3 ≈ 1,800 partitions/broker. Plan cluster egress at ≥ 2× ingress (a planning ratio, reference §1) because each written byte is re-fetched by RF−1 followers and read by consumer fan-out.

4. Controller / metadata load and failover time

The controller tracks every partition's leader, ISR, and assignment in the metadata log (Part I 11 · KRaft Controller, 12 · Metadata Propagation). The two cost constants here are empirical, ZooKeeper-era, 2015-hardware figures (Jun Rao; reference §2, teach the mechanism, not the millisecond; do not predict 2026 behavior):

Leader election
~5 ms per partition to elect a new leader on the surviving brokers.
Controller metadata init
~2 ms per partition to reload metadata when a ZooKeeper-era controller fails over.

These scale linearly with the partition count on the affected broker/controller, so on unclean loss the unavailability windows are products with units cancelling to time:

Broker with ~1,000 leaderships
1,000 leaders × 5 ms/leader = 5,000 ms ≈ 5 s to re-elect.
ZK controller failover, ~10,000 partitions
10,000 partitions × 2 ms/partition = 20,000 ms ≈ 20 s of metadata-init unavailability.

KRaft fundamentally changes this: the controller holds metadata in memory and replicates it as a log, so a standby controller takes over without a reload pass (KIP-500), it removes the ~2 ms/partition controller-init term entirely. That is what lifts the ceiling from ZK-era ~200,000/cluster toward the KRaft target of "a million partitions or more" (Confluent lab: 2M; empirical reference §2).

Gotcha: "KRaft means partitions are free" is wrong

KRaft removes the controller-failover term, not the per-broker cost. Real per-broker counts are still gated by FDs, mmap regions (vm.max_map_count), heap, fetcher overhead, and rebalance time. And critically: throughput does not scale with partitions, peak producer rate occurred around ~100 partitions and end-to-end latency rose sharply past ~1000 partitions in Instaclustr's test (Kafka 3.1.1, acks=all; ZooKeeper and KRaft were identical here, empirical). Over-partitioning to chase throughput is a documented anti-pattern.

5. Rebalance time

When group membership changes, the coordinator must recompute and redistribute the assignment across all subscribed partitions (Part I 13 · Group Coordination). More partitions = more to move and longer rebalances; this is one more reason the partition count is a cost, not just a capability. The newer protocol (KIP-848) and cooperative/sticky assignors reduce stop-the-world rebalances, but the work still scales with partition count.

CostTypeMechanism / sourceMitigation
Open FDsEmergent (OS)3 files/segment; many segments/partitionnofile ≥ 100,000
mmap regionsHard (OS)~2/segment via MappedByteBuffer, AbstractIndex.java:72vm.max_map_count ≥ 262144
Producer memorySoft + emergentDeque per partition, RecordAccumulator.java:318; buffer.memory default 32 MBraise buffer.memory; ~tens of KB/partition
Replication fan-outEmergentRF × fetcher work; +20 ms / 1000 partitions (empirical)cap ≈ 100 × brokers × RF
Failover latencyEmergentleader election ~5 ms/partition ZK-era; near-zero on KRaft standby (KIP-500)use KRaft; bound leaders/broker
Rebalance timeEmergentassignment work ∝ partitions, RangeAssignor.java:43KIP-848, cooperative assignor
Latency past ~1000Emergente2e latency degrades (empirical, Instaclustr)do not over-partition

The defaults, and why they are deliberately tiny

ConfigDefaultSourceType
num.partitions1server-common/src/main/java/org/apache/kafka/server/config/ServerLogConfigs.java:36Soft (config)
default.replication.factor1server/src/main/java/org/apache/kafka/server/config/ReplicationConfigs.java:42Soft (config)
min.insync.replicas1server-common/src/main/java/org/apache/kafka/server/config/ServerLogConfigs.java:155Soft (config)
controller op size10,000 records/opDEFAULT_MAX_RECORDS_PER_BATCH, metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:180Hard (constant)
Warning: these are auto-create defaults, not production defaults

num.partitions=1, default.replication.factor=1, and min.insync.replicas=1 exist so a single-broker dev box works out of the box, they are not a recommendation. A topic auto-created at these defaults has one partition (no parallelism) and one replica (any broker loss = data loss). In production, disable auto-creation and create topics explicitly with a deliberate partition count, RF=3, and min.insync.replicas=2. The min.insync.replicas=2 + acks=all + RF=3 triad is the durability baseline, see Part I 08 · Replication and Part II op06 · Durability.

The 10,000-record ceiling on a single controller operation (QuorumController.java:180) is a real hard limit you can hit: a single CreatePartitions or CreateTopics that would emit more than 10,000 metadata records (each new partition is a record) is rejected. Practically, you cannot create a topic with tens of thousands of partitions in one call, and bulk topic creation must be chunked.

When to reshard: 2 → 3 → 5

Increasing partition count (you cannot decrease, see below) is justified by a small set of trigger signals. The test is whether you are parallelism-bound, not whether the topic is merely busy.

symptom: consumer lag rising
group size == partition count?
add consumers first
(do NOT reshard)
per-partition throughput saturated?
RESHARD: increase partitions (or new topic if keyed)
Reshard only when the group is already at one-consumer-per-partition AND per-partition throughput is saturated. If you have fewer consumers than partitions, add consumers, that is free parallelism you already provisioned.
consumer/lag signal · throughput check · wrong fix · reshard action · decisions rounded.

Reshard triggers (all should hold):

  • Sustained consumer lag that grows over a window (not a transient spike), see Part II op08 · Metrics & Signals for lag-on-trend alerting.
  • The group is already at one consumer per partition, you have spent all the parallelism the current count allows. If not, add consumers; you provisioned that headroom already.
  • Per-partition throughput is saturated, the binding stage (broker disk, network, or your processing) is at its ceiling for a single partition.
Warning: adding partitions never fixes a hot key

If lag is concentrated on one partition because a single key dominates traffic, more partitions does nothing, that key still hashes to one partition via murmur2(key) % numPartitions. The fix is to redesign or salt the key (only where ordering allows), or use a custom partitioner, never to reshard. Cloudflare hit exactly this when a client-library abstraction funneled most messages to one partition (empirical). See Part II op07 · Failure Modes.

The repartitioning trap

This is the part that bites operators, and it is enforced in two hard mechanisms in the source.

You cannot decrease partitions, enforced in the controller

ReplicationControlManager.createPartitions rejects any request whose target count is not strictly greater than the current count:

RequestResultSource
target == currentInvalidPartitionsException "Topic already has N partition(s)."ReplicationControlManager.java:1917
target < currentInvalidPartitionsException "... would not be an increase."ReplicationControlManager.java:1920
target > currentappend (target − current) new partitions, ids starting at startPartitionId = topicInfo.parts.size()ReplicationControlManager.java:1925, :1953

The "decrease" path simply does not exist; the API only ever appends partition ids onto the end of the existing range. There is no truncation, no merge, no rebalancing of existing data across the new layout. This is by design, existing offsets, the leaders, and every committed record live in partitions 0..current−1 and cannot be relocated without rewriting consumer offsets globally.

Increasing partitions on a keyed topic breaks ordering and co-partitioning

Because the producer maps keys with toPositive(murmur2(serializedKey)) % numPartitions (clients/src/main/java/org/apache/kafka/clients/producer/internals/BuiltInPartitioner.java:398), the moment numPartitions changes, the same key routes to a different partition, the modulus shifts the entire mapping. To make this followable, let h = toPositive(murmur2(serializedKey)) be the key's (fixed) hash; only the divisor changes. Real murmur2 output is a large opaque integer, so the value below is an illustrative hash chosen so the arithmetic is checkable, substitute your own key's hash. Take h = 9:

Before (4 partitions)
9 % 4 = 1 → the key lives in partition 1.
After resize to 6 partitions
9 % 6 = 3 → the same key now routes to partition 3, while its old records remain in partition 1.

(The hash h is unchanged; only the divisor went 4 → 6, and that alone relocates the key. Note: the reference's literal example "7654321 % 6 = 3" does not actually compute, 7654321 % 6 = 1, and Kafka hashes the key bytes through murmur2 first rather than the integer, so the schematic h = 9 above is used to demonstrate the same remap with arithmetic that checks out.)

key K, h=9, before
partition 1
9 % 6 = 3
partition 3
With illustrative hash h = toPositive(murmur2(K)) = 9: the same key K moves from partition 1 (9 % 4) to partition 3 (9 % 6) the instant the count goes 4→6. New records for K append to partition 3; old records stay in partition 1, per-key order is now split across two partitions with no cross-partition ordering.
producer/key · partition/log · remap break · the resize event.

The consequences are concrete and unrecoverable in place:

  • Per-key ordering breaks across the resize boundary. A consumer reading partition 3 sees only the post-resize records for K; the pre-resize records sit in partition 1 and there is no ordering between the two partitions.
  • Co-partitioned joins break. Two topics joined by key rely on identical key→partition mapping. Resize one and they no longer co-partition, Kafka Streams joins and the assignor's co-partitioning guarantee (Part I 20 · Kafka Streams) silently produce wrong results.
  • Stateful processing breaks. Kafka Streams state stores are partitioned; when a key moves partitions, "the associated state does not follow them", the local state for K is on the task that owned the old partition (empirical; Confluent docs).

The accepted fix: a new topic at the target count, then migrate

Because in-place resize is destructive for keyed topics, the production pattern is to create a new topic at the target partition count and migrate:

producerstopic-old (N)topic-new (M)consumers
1. dual-write to new
2. replay/copy backlog old → new
3. drain old, switch to new
4. stop writing old, retire
Migration sequence: producers dual-write, the backlog is re-produced into the new topic (re-keyed by the new modulus), consumers drain then cut over, and the old topic is retired.
client · topic/log · data move · retire/cutover.
  1. Dual-write: producers begin writing to the new topic (and optionally keep writing the old during cutover).
  2. Backfill: copy/replay the old topic's retained records into the new topic so history is preserved under the new key mapping (re-keying happens automatically as records are re-produced through the partitioner).
  3. Drain & switch: consumers finish the old topic, then move to the new topic.
  4. Retire: stop writing the old topic and delete it once consumers are clear.
Rule of thumb: provision for growth, but respect the standing cost

Because resize is irreversible and breaks keyed ordering, over-partition modestly up front (Confluent's explicit advice; empirical), but "modestly" is the operative word. Each partition costs FDs, ~2 mmap regions, producer memory, replication fan-out, and rebalance time on every broker, and throughput does not improve past ~100 partitions per topic. A practical target: size with the max(T/Tp, T/Tc) formula, then multiply by a growth factor of ~1.5–2× (a planning heuristic, reference §2 / Confluent "over-partition modestly") to cover the next year, not 10×. Keep total partitions per broker well under the vm.max_map_count ÷ 2 mmap-per-partition ceiling derived above (~32,765 on default Linux) and ideally in the low thousands. For unkeyed topics, resize is cheap and safe, so you can size leaner and grow on demand. For keyed topics, treat the partition count as near-immutable and size it once, carefully.

Decision checklist

QuestionAnswer / rule
Starting partition count?max(T/Tp, T/Tc) with Tp≈10 MB/s (empirical planning figure), Tc=measured app rate, both same unit as T; e.g. max(100÷10, 100÷5)=20. Then × ~1.5–2 growth headroom (heuristic)
Is this topic keyed?Yes → count is near-immutable; size once. No → resize is safe; size leaner
Lag rising, reshard?Only if group == partition count AND per-partition saturated AND not a hot key
Hot single key?Redesign/salt the key, never add partitions
Can I reduce partitions?No, InvalidPartitionsException, ReplicationControlManager.java:1920
Increase on keyed topic?New topic at target count + migrate (dual-write, backfill, switch)
Per-broker partition ceiling?Hard: vm.max_map_count ÷ 2 mmap/partition = 65,530 ÷ 2 ≈ 32,765 default Linux; raise to ≥262,144. Keep counts low thousands
Throughput not improving with more partitions?Expected past ~100/topic; latency degrades past ~1000 (empirical), you are over-partitioned

Continue to Part II op04 · Capacity Planning for broker counts and cluster sizing, op06 · Durability for the replication/ISR settings that pair with these partitions, and op08 · Metrics & Signals for the lag and under-replication signals that drive reshard decisions.

krivaltsevich.com · Part of Apache Kafka Internals · derived from Apache Kafka 4.4 source · GitHub · MIT-licensed.

Apache Kafka® is a registered trademark of the Apache Software Foundation. This is an independent, unofficial guide, not affiliated with or endorsed by the ASF.