01 · The Record Format & Batches
Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Derived from source code, not copied from official documentation.
Every byte Kafka stores on disk, ships over the wire, or replicates between brokers is framed by the record batch format. This chapter dissects the v2 batch (DefaultRecordBatch) field by field, the variable-length zig-zag encoding of the inner records (DefaultRecord), the legacy v0/v1 formats that v2 superseded, the control records that carry transaction markers and KRaft quorum events, the compression codecs, the CRC-32C checksum, and the builder/reader machinery (MemoryRecordsBuilder, FileLogInputStream) that produces and consumes these bytes. The batch header is also where idempotence and transactions live: the producer id, epoch and base sequence are header fields.
Role & responsibilities
The record format is the lingua franca of Kafka. The same bytes flow unchanged through the producer's accumulator, the Produce RPC, the broker's log segments, the Fetch RPC, the replica fetchers, tiered storage, and the consumer. Keeping one canonical layout, and never re-serializing it on the broker's hot path, is what enables Kafka's zero-copy sendfile transfer from page cache to socket. Concretely the format is responsible for:
- Framing, delimiting batches so a reader can walk a log file or a network buffer batch-by-batch without parsing record contents.
- Integrity, a CRC-32C over each batch detects corruption on disk or in flight.
- Compaction, preserving the first and last offset/sequence of a batch even when its records are removed, so producer state can be rebuilt.
- Idempotence & transactions, the producer id (PID), epoch and base sequence ride in the batch header (see Transactions & Exactly-Once Semantics).
- Compression, the record-payload region (everything after the header) may be compressed as one unit.
- Control signalling, transaction commit/abort markers and KRaft leader-change / snapshot / voters events travel as control batches.
Where it lives in the code
The format lives in the org.apache.kafka.common.record.internal package of the clients module (note the internal sub-package, it was moved there in 4.x). Compression codecs live in org.apache.kafka.common.compress.
| Class / interface | File | Responsibility |
|---|---|---|
RecordBatch | clients/src/main/java/org/apache/kafka/common/record/internal/RecordBatch.java | The batch interface; magic constants, sentinel values (NO_PRODUCER_ID, NO_SEQUENCE…). |
DefaultRecordBatch | .../internal/DefaultRecordBatch.java | v2 batch: header field offsets, attributes bitfield, CRC, iteration. |
DefaultRecord | .../internal/DefaultRecord.java | v2 inner record: varint/varlong encode & decode. |
AbstractLegacyRecordBatch / LegacyRecord | .../internal/AbstractLegacyRecordBatch.java, LegacyRecord.java | v0/v1 message format (one record = one "batch", or a compressed wrapper). |
MemoryRecords / MemoryRecordsBuilder | .../internal/MemoryRecords.java, MemoryRecordsBuilder.java | In-memory representation and the write path that builds batches. |
FileRecords / FileLogInputStream | .../internal/FileRecords.java, FileLogInputStream.java | On-disk representation; lazy batch iteration over a FileChannel. |
ControlRecordType / EndTransactionMarker | .../internal/ControlRecordType.java, EndTransactionMarker.java | Control record key/value schemas. |
CompressionType / Compression | .../internal/CompressionType.java, compress/Compression.java | Codec enum (id + name) and codec factory/streams. |
Crc32C / ByteUtils | .../utils/internals/Crc32C.java, ByteUtils.java | Castagnoli CRC; varint/varlong/zig-zag primitives. |
Core concepts & terminology
- Record
- A single key/value/timestamp/headers tuple. The smallest addressable unit; each has an offset.
- Record batch
- A container for one or more records sharing a header. In v2 the unit of compression, CRC, and the carrier of producer/transaction metadata. Also called a "message set" historically.
- Magic
- One byte naming the format version:
0,1, or2.RecordBatch.CURRENT_MAGIC_VALUE = MAGIC_VALUE_V2 = 2RecordBatch.java:39-46. - Log overhead
- The fixed 12-byte
(baseOffset:int64, length:int32)prefix shared by all formats:OFFSET_LENGTH(8) + SIZE_LENGTH(4) = LOG_OVERHEAD = 12Records.java:45-49. A reader can always read 12 bytes to learn how big the next batch is. - Base / delta
- v2 stores absolute
baseOffsetandbaseTimestamponce in the header; each record stores only an offset delta and timestamp delta from those bases, which compress to tiny varints. - Control batch
- A batch with the control attribute bit set. Its records carry a typed control key rather than user data; never compacted, never delivered to ordinary consumers.
In v2 the batch is the heavyweight, self-describing unit (header, CRC, compression, PID/epoch/sequence), and the record is a maximally-compact delta-encoded body. A single produced batch typically holds many records, even when uncompressed, unlike v0/v1 where an uncompressed "batch" was always exactly one record.
Data structure: the v2 record batch (on disk / on wire)
The header is a fixed 61-byte structure followed by the records region. Field offsets are compile-time constants at the top of DefaultRecordBatch DefaultRecordBatch.java:104-131; the value RECORD_BATCH_OVERHEAD = RECORDS_OFFSET = 61. All multi-byte integers are big-endian (Java ByteBuffer default).
partitionLeaderEpoch (@12) can be re-stamped without recomputing it.Notable design choices visible directly in the code:
- CRC placement & coverage.
CRC_OFFSET = 17sits after the magic byte, and the checksum covers fromATTRIBUTES_OFFSET (21)to the end:Crc32C.compute(buffer, ATTRIBUTES_OFFSET, buffer.limit() − ATTRIBUTES_OFFSET)DefaultRecordBatch.java:399-401. A reader must parse the magic byte before knowing how to interpret bytes 17–20. Crucially,partitionLeaderEpoch(bytes 12–15) is excluded from the CRC, so the broker can stamp the leader epoch into an incoming batch without recomputing the checksum DefaultRecordBatch.java:67-71, 386-388. sizeInBytes()=LOG_OVERHEAD + batchLength=12 + buffer.getInt(LENGTH_OFFSET)DefaultRecordBatch.java:222-225. SobatchLengthis everything after itself.- Second attributes byte is reserved.
attributes()reads the 16-bit field but casts tobyte: "note we're not using the second byte of attributes" DefaultRecordBatch.java:403-406.
The attributes bitfield
The low byte of the 16-bit attributes field packs five flags; bits 7–15 are unused DefaultRecordBatch.java:99-101, 133-137.
| Bits | Mask | Meaning | Accessor |
|---|---|---|---|
| 0–2 | 0x07 COMPRESSION_CODEC_MASK | Compression codec id (0 none, 1 gzip, 2 snappy, 3 lz4, 4 zstd) | compressionType() → CompressionType.forId(attrs & 0x07) |
| 3 | 0x08 TIMESTAMP_TYPE_MASK | 0 = CREATE_TIME, 1 = LOG_APPEND_TIME | timestampType() |
| 4 | 0x10 TRANSACTIONAL_FLAG_MASK | Batch is part of a transaction | isTransactional() |
| 5 | 0x20 CONTROL_FLAG_MASK | Control batch (markers / KRaft events) | isControlBatch() |
| 6 | 0x40 DELETE_HORIZON_FLAG_MASK | If set, baseTimestamp is the tombstone delete horizon | deleteHorizonMs() |
| 7–15 | , | Unused / reserved | , |
Attributes are assembled by computeAttributes(...) DefaultRecordBatch.java:424-440, which insists a timestamp type be supplied for v2 (it throws on NO_TIMESTAMP_TYPE). The delete-horizon bit (KIP-534) lets the cleaner record when tombstones and aborted-txn markers become eligible for removal: when set, deleteHorizonMs() returns baseTimestamp reinterpreted as that horizon DefaultRecordBatch.java:251-261.
lastOffset() == baseOffset + lastOffsetDelta and lastSequence() == incrementSequence(baseSequence, lastOffsetDelta). These are computed, never stored, so re-stamping baseOffset on append (via setLastOffset, which writes offset − lastOffsetDelta DefaultRecordBatch.java:367-369) preserves both the offset and sequence ranges atomically.
Data structure: the v2 inner record
Each record after the header is a variable-length structure (DefaultRecord, schema at DefaultRecord.java:36-68). All lengths are signed zig-zag varints; a length of −1 means "null".
The fixed overhead of a record excluding key/value/headers is at most 21 bytes: MAX_RECORD_OVERHEAD = 21 = 5 (length) + 1 (attributes) + 10 (timestamp varlong) + 5 (offset varint) DefaultRecord.java:71-72. The actual size is far smaller in practice because deltas are tiny: a record whose offset delta is 0 and whose timestamp delta is 0 encodes those as a single byte each.
Varint / zig-zag encoding
Kafka uses the Protocol-Buffers base-128 varint, with zig-zag mapping so small-magnitude signed numbers (positive or negative) stay short. The mapping is purely arithmetic ByteUtils.java:474-487, 498-516:
// 32-bit: zz(n) = (n << 1) ^ (n >> 31)
// 64-bit: zz(n) = (n << 1) ^ (n >> 63)
// then emit 7 bits per byte, low byte first, high bit = "more bytes follow"
So 0→0, −1→1, 1→2, −2→3, … An int varint is 1–5 bytes; a long varlong is 1–9 bytes. The encoded width is computed without a loop using a leading-zeros intrinsic ByteUtils.java:557-603. A null field is written as varint −1, which zig-zags to 1 and occupies a single byte; the size of that sentinel is cached as NULL_VARINT_SIZE_BYTES DefaultRecord.java:74.
Delta encoding plus zig-zag varints is why v2 is dramatically more compact than v0/v1 for batched writes: the 8-byte absolute offset and 8-byte timestamp are stored once per batch; each additional record pays only a byte or two for its deltas, before the whole record region is (optionally) compressed.
Reading a record
DefaultRecord.readFrom reverses the layout DefaultRecord.java:306-360: read the body length, then attributes, then reconstruct timestamp = baseTimestamp + timestampDelta (overridden to logAppendTime if the batch is LOG_APPEND_TIME), offset = baseOffset + offsetDelta, and sequence = incrementSequence(baseSequence, offsetDelta) when a base sequence is present. After parsing it asserts the number of consumed bytes equals the declared body length, throwing InvalidRecordException on mismatch, a cheap structural integrity check independent of the batch CRC. A "partial" reader (readPartiallyFrom → PartialDefaultRecord) can skip key/value/header bytes when the caller only needs offsets/sizes (used by the log validator's skipKeyValueIterator) DefaultRecord.java:362-420.
Legacy formats (v0 / v1) and why v2 exists
Before v2, the unit was a flat message (LegacyRecord) prefixed by the 12-byte log overhead. The header constants are in LegacyRecord LegacyRecord.java:45-74.
timestamp field. Per-message CRC, magic and attributes (and timestamp in v1) repeat for every record, the overhead v2 amortizes. HEADER_SIZE_V0 = 6, HEADER_SIZE_V1 = 14.The crucial structural difference: in v0/v1 an uncompressed message set is a sequence of standalone messages, one per offset; a compressed set is a single "wrapper" message whose value is itself a compressed message set of inner messages. AbstractLegacyRecordBatch models this duality by implementing both RecordBatch and Record, the wrapper is the batch, the inner messages are the records AbstractLegacyRecordBatch.java:45-56. Iterating a compressed v1 batch requires decompressing the whole wrapper to recover absolute offsets, since inner offsets were stored relative to the wrapper's last offset AbstractLegacyRecordBatch.java:315-413.
Legacy weaknesses that v2 fixes:
- Per-message CRC and absolute fields. Each v0/v1 message carries its own CRC, magic, attributes and full timestamp, large per-record overhead and a CRC that had to be recomputed whenever the broker re-assigned offsets. v2 amortizes one CRC and one timestamp/offset base across the batch.
- No record headers. v0/v1 have no header support; appending a header to a sub-v2 builder throws MemoryRecordsBuilder.java:468-469.
- No idempotence/transactions. There is nowhere to put a PID/epoch/sequence; legacy batches hard-code
NO_PRODUCER_ID,NO_SEQUENCE, andisTransactional() == falseAbstractLegacyRecordBatch.java:174-217. - No efficient count or leader epoch.
countOrNull()returnsnullfor v0/v1 (the count is unknown without deep iteration);partitionLeaderEpoch()is−1and cannot be set AbstractLegacyRecordBatch.java:159-161, 210-212, 499-501. - No zstd. ZStandard is rejected for magic < 2 MemoryRecordsBuilder.java:115-116.
Although ZooKeeper and old inter-broker protocols are gone, the record format machinery still understands v0/v1 so the broker can read historical log segments and the LogValidator can convert between formats. LogValidator carries a toMagic target and up/down-converts incoming batches whose magic does not match storage/src/main/java/org/apache/kafka/storage/internals/log/LogValidator.java:75, 116, 128-129. New writes from modern clients are always v2.
Control batches & control records
A control batch sets attribute bit 5. Its records use a typed key: Key => Version:int16, Type:int16 ControlRecordType.java:29-43. The 16-bit version is a schema version (so the key can grow compatibly); the 16-bit type selects the meaning. Defined types ControlRecordType.java:43-57:
| Type id | Name | Purpose |
|---|---|---|
| 0 | ABORT | Transaction abort marker |
| 1 | COMMIT | Transaction commit marker |
| 2 | LEADER_CHANGE | KRaft: a new leader was elected for the metadata/Raft log |
| 3 / 4 | SNAPSHOT_HEADER / SNAPSHOT_FOOTER | KRaft snapshot delimiters |
| 5 / 6 | KRAFT_VERSION / KRAFT_VOTERS | KRaft membership / quorum version records |
| −1 | UNKNOWN | Type the client does not recognize → ignored |
Transaction markers are produced by the coordinator via appendEndTxnMarker, which requires a valid PID and the transactional flag MemoryRecordsBuilder.java:618-625. The marker's value is an EndTransactionMarker carrying the coordinator epoch, version-prefixed and serialized through the generated EndTxnMarker message EndTransactionMarker.java:41-47, 82-103. The KRaft control types (2–6) are appended by the corresponding append…Message helpers and are how the Raft log records leadership and membership changes, see KRaft Consensus and The KRaft Controller MemoryRecordsBuilder.java:627-668.
Control records are skipped by the log cleaner ("control records are not considered for compaction" ControlRecordType.java:39) and are filtered out before delivery to ordinary consumers. Ordinary clients may not write them: LogValidator rejects client-written control records LogValidator.java:476.
Compression
In v2 the entire records region (bytes 61…end) is compressed as a single stream; the batch header is always plaintext, so a reader can inspect offsets, PID and CRC without decompressing. Codecs are an enum with a 2-bit-storable id CompressionType.java:29-142:
| id | name | library | level range (default) | decompression buffer |
|---|---|---|---|---|
| 0 | none | , | , | , |
| 1 | gzip | JDK Deflater | BEST_SPEED…BEST_COMPRESSION, default DEFAULT_COMPRESSION | 16 KB |
| 2 | snappy | xerial snappy (native) | , | , |
| 3 | lz4 | lz4-java | 1…17 (default 9) | 2 KB intermediate |
| 4 | zstd | zstd-jni (native) | −131072…22 (default 3) | 16 KB |
Codec classes are loaded lazily so native libraries are only pulled in if the codec is actually used CompressionType.java:66-70. The Compression factory builds per-codec stream wrappers: wrapForOutput(ByteBufferOutputStream, magic) for the producer side and wrapForInput(ByteBuffer, magic, BufferSupplier) for readers compress/Compression.java:43-56. Decompression buffer sizes come from decompressionOutputSize() (gzip/zstd 16 KB compress/GzipCompression.java:80-83 compress/ZstdCompression.java:105-109; LZ4 uses a 2 KB intermediate compress/Lz4Compression.java:66-72). Kafka's LZ4 uses the LZ4 frame format with magic 0x184D2204 compress/Lz4BlockOutputStream.java:39.
The RecordBatch.streamingIterator javadoc warns that for small batches the per-call cost is dominated by allocating the decompression buffer ("64 KB for LZ4"), which is why a reusable BufferSupplier matters RecordBatch.java:234-240. That 64 KB refers to the LZ4 block size; the supplier-backed intermediate buffer Kafka requests is 2 KB. Either way: reuse the supplier on hot read paths.
CRC-32C
v2 uses CRC-32C (Castagnoli polynomial) via the JDK's java.util.zip.CRC32C, wrapped by Crc32C utils/internals/Crc32C.java:41-59, on modern CPUs this maps to a hardware instruction. v0/v1 instead used plain CRC-32 (java.util.zip.CRC32) LegacyRecord.java:508-512. Validation: ensureValid() first checks the batch is at least RECORD_BATCH_OVERHEAD bytes, then compares stored vs. recomputed CRC, throwing CorruptRecordException otherwise DefaultRecordBatch.java:151-159, 395-401.
Architecture & control/data flow
writeTo straight to the socket. The same bytes are reused; the broker rewrites only the header (offsets, leader epoch, CRC) when needed.Building a batch (write path)
MemoryRecordsBuilder reserves the header region up front, then streams records (optionally through the compressor) into the buffer MemoryRecordsBuilder.java:138-147:
- The constructor positions the underlying
ByteBufferOutputStreamatinitialPosition + batchHeaderSizeInBytesand wraps it with the codec's output stream. For v2,batchHeaderSizeInBytes = RECORD_BATCH_OVERHEAD (61); for legacy-compressed it isLOG_OVERHEAD + legacy record overheadAbstractRecords.java:153-161. - Each
appendcomputesoffsetDelta = offset − baseOffsetandtimestampDelta = timestamp − baseTimestamp, thenDefaultRecord.writeToemits the record;recordWrittenupdatesnumRecords,lastOffset, and tracksmaxTimestampMemoryRecordsBuilder.java:755-799. The first append seedsbaseTimestamp(unless a delete horizon already set it) MemoryRecordsBuilder.java:471-472, 145-147. close()flushes the compressor, then backfills the header in place viawriteDefaultBatchHeader→DefaultRecordBatch.writeHeader, which writes every field and computes the CRC over[attributes..end]last MemoryRecordsBuilder.java:361-386, 408-429 DefaultRecordBatch.java:460-500. The compression ratio is recorded for the accumulator's future size estimates.
Offsets must increase strictly monotonically within a builder (appendWithOffset throws otherwise), and a control record may only be appended to a control batch MemoryRecordsBuilder.java:455-469. Size accounting for "will the next record fit" is in hasRoomFor, which is deliberately conservative about compression and always admits the first record so a non-empty batch can always be formed even with batch.size = 0 MemoryRecordsBuilder.java:849-890.
Reading from a file (read path)
FileLogInputStream.nextBatch() reads only the 17-byte header-up-to-magic (HEADER_SIZE_UP_TO_MAGIC = LOG_OVERHEAD + 4 + 1 Records.java:51-55), extracts offset, size and magic, sanity-checks the size against the V0 minimum, and constructs a FileChannelRecordBatch without copying the record data into the heap FileLogInputStream.java:63-94. The concrete batch lazily materializes either just the header (loadBatchHeader, reading headerSize() bytes) or the whole batch (loadFullBatch) on demand, caching the result FileLogInputStream.java:194-222. This lazy design is what lets a broker answer "what is the last offset / does this batch have a PID" cheaply, and lets writeTo stream batch bytes straight from the channel to a destination buffer for zero-copy fetch responses FileLogInputStream.java:177-188.
Sending over the wire
On the wire, records are framed identically to disk. DefaultRecordsSend simply delegates to records().writeTo(channel, …), allowing FileRecords to transfer bytes directly via the channel (the zero-copy path) without re-encoding DefaultRecordsSend.java:32-35. See Wire Protocol & RPC Framework and The Fetch Path.
Concurrency & threading
- Batches are not thread-safe.
DefaultRecordBatchwraps a singleByteBufferand reads/writes it without locks. Header mutators,setLastOffset,setPartitionLeaderEpoch,setMaxTimestamp, assume single-threaded access (the broker append path is serialized per partition; see The Log Storage Engine) DefaultRecordBatch.java:366-393. - Builder lifecycle.
MemoryRecordsBuilderis owned by one thread at a time. On the producer it is created and appended-to by the application threads under the accumulator's per-batch synchronization, thencloseForRecordAppends()is called (releasing potentially large compression buffers) when appends stop but before the header is finalized, important because there can be a gap before the batch is drained and sent MemoryRecordsBuilder.java:45-47, 332-342. After close, only header fields may change; attempting to set producer state on a closed builder throws, to avoid introducing duplicates on retry MemoryRecordsBuilder.java:308-320. See The Producer Client. - Iterator buffers. Decompressing iterators allocate intermediate buffers; the
BufferSupplierpassed tostreamingIterator/skipKeyValueIteratoris per-thread (aBufferSupplieris not thread-safe) and reused across batches on the broker's fetch/validation paths to avoid repeated 16 KB/2 KB allocations DefaultRecordBatch.java:279-364. - Iterator must be closed. Compressed iterators hold a decompression stream; the plain
iterator()eagerly drains the whole batch into a list precisely because it cannot guarantee the caller will close the stream, whereasstreamingIteratordefers decompression but obliges the caller to close DefaultRecordBatch.java:320-364.
Configuration reference
The format itself is fixed, but a few client knobs shape the batches produced. Defaults are from ProducerConfig.
| Key | Default | Effect on batches |
|---|---|---|
compression.type | none | Codec written into attribute bits 0–2. ProducerConfig.java:244, 409 |
batch.size | 16384 | Target bytes per batch; drives writeLimit/hasRoomFor. A value of 0 still admits one record per batch. ProducerConfig.java:82, 413 |
compression.gzip.level | DEFAULT_COMPRESSION | gzip level (otherwise BEST_SPEED…BEST_COMPRESSION). ProducerConfig.java:250, 410 |
compression.lz4.level | 9 | lz4 level 1–17. ProducerConfig.java:254; CompressionType.java:72-98 |
compression.zstd.level | 3 | zstd level −131072…22. ProducerConfig.java:258; CompressionType.java:99-130 |
Idempotence/transaction header fields are not direct format knobs; they are populated from the producer's PID/epoch and per-partition sequence (enable.idempotence, transactional.id), see Transactions & Exactly-Once Semantics.
Failure modes, edge cases & recovery
- Corruption. A bad CRC throws
CorruptRecordExceptionfromensureValid(); a too-small batch (< 61 bytes) is rejected before the CRC is even read DefaultRecordBatch.java:151-159. - Truncated / over-declared batch. If
recordsCountclaims more records than the buffer holds, the uncompressed iterator throws "Incorrect declared batch size, premature EOF reached"; if records remain after reading the declared count, it throws "records still remaining" DefaultRecordBatch.java:299-318, 596-611. The per-record body-length cross-check inreadFromcatches intra-record corruption independently of the batch CRC DefaultRecord.java:350-353. - Empty batches after compaction. The cleaner may retain a header-only batch (recordsCount 0) to preserve the producer's first/last sequence so state can be rebuilt after a leader failover; such a batch persists until the producer writes a new sequence or its PID expires DefaultRecordBatch.java:73-85. A zero-count batch iterates as empty DefaultRecordBatch.java:320-323.
- Timestamp semantics under compaction.
baseTimestampis the first record's timestamp in most cases,−1for an empty batch, or the delete horizon when bit 6 is set. ForLOG_APPEND_TIME,maxTimestampis broker-stamped and survives compaction; an empty batch retains its priormaxTimestampDefaultRecordBatch.java:87-95. - Sequence wraparound. Sequence numbers are 32-bit and wrap;
incrementSequence/decrementSequenceimplement modular arithmetic aroundInteger.MAX_VALUEsolastSequencestays correct across the wrap boundary DefaultRecordBatch.java:558-568. - Legacy edge cases. Decoding a compressed v1 wrapper rejects null inner values, inner compression, mismatched magic, and a wrapper offset less than the last inner offset (unless it is 0, tolerated for some librdkafka versions) AbstractLegacyRecordBatch.java:330-388.
Invariants & guarantees
First/last offsets and sequences are immutable under compaction. v2 preserves baseOffset, lastOffsetDelta, baseSequence and the derived last sequence even when inner records are removed, so a broker can rebuild producer state from the log and the next expected sequence stays in sync, preventing spurious OutOfOrderSequence errors after failover DefaultRecordBatch.java:73-80.
The CRC excludes the partition leader epoch. The broker can write partitionLeaderEpoch into a received batch without recomputing the checksum, because bytes 12–15 lie outside the CRC region DefaultRecordBatch.java:386-388, 399-401.
Offsets are strictly increasing within a batch. The builder enforces monotonicity at append time MemoryRecordsBuilder.java:461-463, and the log validator re-verifies the batch's reported offset range and count on the broker LogValidator.java:451-469.
Self-describing magic. Every batch begins with (offset, size) then a magic byte at a format-independent position (MAGIC_OFFSET = 16 Records.java:51-53), so a reader can always determine the format before interpreting the rest.
Interactions with other subsystems
- The Log Storage Engine & Log Management, Retention & Compaction, segments are sequences of these batches on disk; the cleaner manipulates batch headers (delete horizon, retained empty batches) directly.
- Replication, ISR & High Watermark & The Fetch Path, followers fetch and append the identical bytes; the leader stamps
partitionLeaderEpochinto each batch. - Transactions & Exactly-Once, PID/epoch/sequence header fields and the
ABORT/COMMITcontrol records are the on-disk substrate of EOS. - KRaft Consensus / The KRaft Controller, the metadata log is the same batch format;
LEADER_CHANGE,SNAPSHOT_*,KRAFT_VERSION,KRAFT_VOTERScontrol records drive the quorum. - Producer / Consumer, the producer's accumulator builds batches with
MemoryRecordsBuilder; the consumer decodes them with the iterators here. - Tiered Storage, remote segments store the same format (see
RemoteLogInputStream).
Design rationale & evolution
The v2 batch was introduced by KIP-98 (exactly-once / idempotent producer), which needed a place for the PID, producer epoch and base sequence, and by the message-format overhaul that delta-encodes offsets/timestamps and moves the CRC to a per-batch, leader-epoch-excluding position. Record headers came with KIP-82. The delete-horizon attribute bit (so the cleaner can reason about when tombstones/markers expire) is KIP-534. CRC-32C replaced CRC-32 to exploit hardware checksum instructions. The lazy FileChannelRecordBatch and zero-copy RecordsSend exist so the broker never has to deserialize record bodies just to route bytes, the bytes a producer writes are the bytes a consumer reads.
The trajectory has been toward doing less per record on the broker: amortize integrity and metadata across a batch, keep the header parseable without decompression, and never re-encode payloads in the steady state. v0/v1 remain only as a compatibility/read path.
Gotchas / operational notes
baseOffset() on a v2 batch is O(1) (a header read), but on a legacy compressed batch it requires decompressing to find the first inner offset AbstractLegacyRecordBatch.java:138-141. Prefer lastOffset() when you only need a single offset and might be on a legacy path.
A null value (valueLength −1) is a tombstone for compacted topics, distinct from an empty value (valueLength 0). The distinction is preserved by the varint sentinel, not a separate flag DefaultRecord.java:191-197.
Header keys are required and non-null UTF-8; header values may be null. DefaultRecord.writeTo throws on a null headers array or a null header key DefaultRecord.java:199-219. The header count is a signed varint, and the reader rejects a negative count or a count larger than the remaining buffer as corruption DefaultRecord.java:338-342.
The maximum number of records per batch and the maximum offset delta are both bounded by Integer.MAX_VALUE (offset delta is a 32-bit field); the builder throws if either is exceeded MemoryRecordsBuilder.java:784-789.