01 · The Record Format & Batches

Source: Apache Kafka 4.4.0-SNAPSHOT (git 04bfe7d, 2026-06-15), KRaft mode. Derived from source code, not copied from official documentation.

Every byte Kafka stores on disk, ships over the wire, or replicates between brokers is framed by the record batch format. This chapter dissects the v2 batch (DefaultRecordBatch) field by field, the variable-length zig-zag encoding of the inner records (DefaultRecord), the legacy v0/v1 formats that v2 superseded, the control records that carry transaction markers and KRaft quorum events, the compression codecs, the CRC-32C checksum, and the builder/reader machinery (MemoryRecordsBuilder, FileLogInputStream) that produces and consumes these bytes. The batch header is also where idempotence and transactions live: the producer id, epoch and base sequence are header fields.

Role & responsibilities

The record format is the lingua franca of Kafka. The same bytes flow unchanged through the producer's accumulator, the Produce RPC, the broker's log segments, the Fetch RPC, the replica fetchers, tiered storage, and the consumer. Keeping one canonical layout, and never re-serializing it on the broker's hot path, is what enables Kafka's zero-copy sendfile transfer from page cache to socket. Concretely the format is responsible for:

Framing, delimiting batches so a reader can walk a log file or a network buffer batch-by-batch without parsing record contents.
Integrity, a CRC-32C over each batch detects corruption on disk or in flight.
Compaction, preserving the first and last offset/sequence of a batch even when its records are removed, so producer state can be rebuilt.
Idempotence & transactions, the producer id (PID), epoch and base sequence ride in the batch header (see Transactions & Exactly-Once Semantics).
Compression, the record-payload region (everything after the header) may be compressed as one unit.
Control signalling, transaction commit/abort markers and KRaft leader-change / snapshot / voters events travel as control batches.

Where it lives in the code

The format lives in the org.apache.kafka.common.record.internal package of the clients module (note the internal sub-package, it was moved there in 4.x). Compression codecs live in org.apache.kafka.common.compress.

Class / interface	File	Responsibility
`RecordBatch`	`clients/src/main/java/org/apache/kafka/common/record/internal/RecordBatch.java`	The batch interface; magic constants, sentinel values (`NO_PRODUCER_ID`, `NO_SEQUENCE`…).
`DefaultRecordBatch`	`.../internal/DefaultRecordBatch.java`	v2 batch: header field offsets, attributes bitfield, CRC, iteration.
`DefaultRecord`	`.../internal/DefaultRecord.java`	v2 inner record: varint/varlong encode & decode.
`AbstractLegacyRecordBatch` / `LegacyRecord`	`.../internal/AbstractLegacyRecordBatch.java`, `LegacyRecord.java`	v0/v1 message format (one record = one "batch", or a compressed wrapper).
`MemoryRecords` / `MemoryRecordsBuilder`	`.../internal/MemoryRecords.java`, `MemoryRecordsBuilder.java`	In-memory representation and the write path that builds batches.
`FileRecords` / `FileLogInputStream`	`.../internal/FileRecords.java`, `FileLogInputStream.java`	On-disk representation; lazy batch iteration over a `FileChannel`.
`ControlRecordType` / `EndTransactionMarker`	`.../internal/ControlRecordType.java`, `EndTransactionMarker.java`	Control record key/value schemas.
`CompressionType` / `Compression`	`.../internal/CompressionType.java`, `compress/Compression.java`	Codec enum (id + name) and codec factory/streams.
`Crc32C` / `ByteUtils`	`.../utils/internals/Crc32C.java`, `ByteUtils.java`	Castagnoli CRC; varint/varlong/zig-zag primitives.

Core concepts & terminology

Record: A single key/value/timestamp/headers tuple. The smallest addressable unit; each has an offset.
Record batch: A container for one or more records sharing a header. In v2 the unit of compression, CRC, and the carrier of producer/transaction metadata. Also called a "message set" historically.
Magic: One byte naming the format version: 0, 1, or 2. RecordBatch.CURRENT_MAGIC_VALUE = MAGIC_VALUE_V2 = 2 RecordBatch.java:39-46.
Log overhead: The fixed 12-byte (baseOffset:int64, length:int32) prefix shared by all formats: OFFSET_LENGTH(8) + SIZE_LENGTH(4) = LOG_OVERHEAD = 12 Records.java:45-49. A reader can always read 12 bytes to learn how big the next batch is.
Base / delta: v2 stores absolute baseOffset and baseTimestamp once in the header; each record stores only an offset delta and timestamp delta from those bases, which compress to tiny varints.
Control batch: A batch with the control attribute bit set. Its records carry a typed control key rather than user data; never compacted, never delivered to ordinary consumers.

Key idea

In v2 the batch is the heavyweight, self-describing unit (header, CRC, compression, PID/epoch/sequence), and the record is a maximally-compact delta-encoded body. A single produced batch typically holds many records, even when uncompressed, unlike v0/v1 where an uncompressed "batch" was always exactly one record.

Data structure: the v2 record batch (on disk / on wire)

The header is a fixed 61-byte structure followed by the records region. Field offsets are compile-time constants at the top of DefaultRecordBatch DefaultRecordBatch.java:104-131; the value RECORD_BATCH_OVERHEAD = RECORDS_OFFSET = 61. All multi-byte integers are big-endian (Java ByteBuffer default).

baseOffsetint64 · @0

batchLengthint32 · @8

partitionLeaderEpochint32 · @12 · not in CRC

magicint8 · @16 · = 2

crcuint32 · @17 · CRC-32C of [21..end]

attributesint16 · @21 · low byte only

lastOffsetDeltaint32 · @23

baseTimestampint64 · @27

maxTimestampint64 · @35

producerIdint64 · @43 · −1 if none

producerEpochint16 · @51 · −1 if none

baseSequenceint32 · @53 · −1 if none

recordsCountint32 · @57

records[Record]… · @61 · compressed as one blob

Fixed 61-byte header (RECORD_BATCH_OVERHEAD), then the records region. All multi-byte integers big-endian. The CRC at @17 covers bytes [21..end], so partitionLeaderEpoch (@12) can be re-stamped without recomputing it.

v2 DefaultRecordBatch header byte layout (offsets from DefaultRecordBatch.java:104-131)

header field (colours cycle to separate adjacent fields) --w = relative byte width int64 · @0 = type · byte offset records = variable-length region (compressed as one blob)

Notable design choices visible directly in the code:

CRC placement & coverage. CRC_OFFSET = 17 sits after the magic byte, and the checksum covers from ATTRIBUTES_OFFSET (21) to the end: Crc32C.compute(buffer, ATTRIBUTES_OFFSET, buffer.limit() − ATTRIBUTES_OFFSET) DefaultRecordBatch.java:399-401. A reader must parse the magic byte before knowing how to interpret bytes 17–20. Crucially, partitionLeaderEpoch (bytes 12–15) is excluded from the CRC, so the broker can stamp the leader epoch into an incoming batch without recomputing the checksum DefaultRecordBatch.java:67-71, 386-388.
sizeInBytes() = LOG_OVERHEAD + batchLength = 12 + buffer.getInt(LENGTH_OFFSET) DefaultRecordBatch.java:222-225. So batchLength is everything after itself.
Second attributes byte is reserved. attributes() reads the 16-bit field but casts to byte: "note we're not using the second byte of attributes" DefaultRecordBatch.java:403-406.

The attributes bitfield

The low byte of the 16-bit attributes field packs five flags; bits 7–15 are unused DefaultRecordBatch.java:99-101, 133-137.

Bits	Mask	Meaning	Accessor
0–2	`0x07` `COMPRESSION_CODEC_MASK`	Compression codec id (0 none, 1 gzip, 2 snappy, 3 lz4, 4 zstd)	`compressionType()` → `CompressionType.forId(attrs & 0x07)`
3	`0x08` `TIMESTAMP_TYPE_MASK`	0 = `CREATE_TIME`, 1 = `LOG_APPEND_TIME`	`timestampType()`
4	`0x10` `TRANSACTIONAL_FLAG_MASK`	Batch is part of a transaction	`isTransactional()`
5	`0x20` `CONTROL_FLAG_MASK`	Control batch (markers / KRaft events)	`isControlBatch()`
6	`0x40` `DELETE_HORIZON_FLAG_MASK`	If set, `baseTimestamp` is the tombstone delete horizon	`deleteHorizonMs()`
7–15	,	Unused / reserved	,

Attributes are assembled by computeAttributes(...) DefaultRecordBatch.java:424-440, which insists a timestamp type be supplied for v2 (it throws on NO_TIMESTAMP_TYPE). The delete-horizon bit (KIP-534) lets the cleaner record when tombstones and aborted-txn markers become eligible for removal: when set, deleteHorizonMs() returns baseTimestamp reinterpreted as that horizon DefaultRecordBatch.java:251-261.

Invariant

lastOffset() == baseOffset + lastOffsetDelta and lastSequence() == incrementSequence(baseSequence, lastOffsetDelta). These are computed, never stored, so re-stamping baseOffset on append (via setLastOffset, which writes offset − lastOffsetDelta DefaultRecordBatch.java:367-369) preserves both the offset and sequence ranges atomically.

Data structure: the v2 inner record

Each record after the header is a variable-length structure (DefaultRecord, schema at DefaultRecord.java:36-68). All lengths are signed zig-zag varints; a length of −1 means "null".

lengthvarint · body size, excludes self

attributesint8 · always 0 (reserved)

timestampDeltavarlong · ts − baseTimestamp

offsetDeltavarint · offset − baseOffset

keyLengthvarint · −1 ⇒ null key

keybytes

valueLengthvarint · −1 ⇒ tombstone

valuebytes

headerCountvarint

headers[ keyLen:varint, key:utf8, valLen:varint, val:bytes ] *

v2 DefaultRecord wire layout (DefaultRecord.java:36-68, writeTo at 168-223)

record field (colours cycle to separate adjacent fields) --w = relative byte width (approx, varints are 1–5/1–9 bytes) varint / varlong = zig-zag variable-length integer −1 length = null / tombstone sentinel

The fixed overhead of a record excluding key/value/headers is at most 21 bytes: MAX_RECORD_OVERHEAD = 21 = 5 (length) + 1 (attributes) + 10 (timestamp varlong) + 5 (offset varint) DefaultRecord.java:71-72. The actual size is far smaller in practice because deltas are tiny: a record whose offset delta is 0 and whose timestamp delta is 0 encodes those as a single byte each.

Varint / zig-zag encoding

Kafka uses the Protocol-Buffers base-128 varint, with zig-zag mapping so small-magnitude signed numbers (positive or negative) stay short. The mapping is purely arithmetic ByteUtils.java:474-487, 498-516:

// 32-bit:  zz(n) = (n << 1) ^ (n >> 31)
// 64-bit:  zz(n) = (n << 1) ^ (n >> 63)
// then emit 7 bits per byte, low byte first, high bit = "more bytes follow"

So 0→0, −1→1, 1→2, −2→3, … An int varint is 1–5 bytes; a long varlong is 1–9 bytes. The encoded width is computed without a loop using a leading-zeros intrinsic ByteUtils.java:557-603. A null field is written as varint −1, which zig-zags to 1 and occupies a single byte; the size of that sentinel is cached as NULL_VARINT_SIZE_BYTES DefaultRecord.java:74.

Key idea

Delta encoding plus zig-zag varints is why v2 is dramatically more compact than v0/v1 for batched writes: the 8-byte absolute offset and 8-byte timestamp are stored once per batch; each additional record pays only a byte or two for its deltas, before the whole record region is (optionally) compressed.

Reading a record

DefaultRecord.readFrom reverses the layout DefaultRecord.java:306-360: read the body length, then attributes, then reconstruct timestamp = baseTimestamp + timestampDelta (overridden to logAppendTime if the batch is LOG_APPEND_TIME), offset = baseOffset + offsetDelta, and sequence = incrementSequence(baseSequence, offsetDelta) when a base sequence is present. After parsing it asserts the number of consumed bytes equals the declared body length, throwing InvalidRecordException on mismatch, a cheap structural integrity check independent of the batch CRC. A "partial" reader (readPartiallyFrom → PartialDefaultRecord) can skip key/value/header bytes when the caller only needs offsets/sizes (used by the log validator's skipKeyValueIterator) DefaultRecord.java:362-420.

Legacy formats (v0 / v1) and why v2 exists

Before v2, the unit was a flat message (LegacyRecord) prefixed by the 12-byte log overhead. The header constants are in LegacyRecord LegacyRecord.java:45-74.

offsetint64 · @0 · log overhead

sizeint32 · @8 · log overhead

crcuint32 · @12 · CRC-32 (not 32C)

magicint8 · @16 · 0 or 1

attributesint8 · @17

timestampint64 · @18 · v1 only

keyLengthint32

keybytes

valueLengthint32

valuebytes

v1 message shown. v0 omits the 8-byte timestamp field. Per-message CRC, magic and attributes (and timestamp in v1) repeat for every record, the overhead v2 amortizes. HEADER_SIZE_V0 = 6, HEADER_SIZE_V1 = 14.

v0/v1 layout. RECORD_OVERHEAD_V0 = 14, RECORD_OVERHEAD_V1 = 22 (LegacyRecord.java:63-74)

message field (colours cycle to separate adjacent fields) --w = relative byte width int64 · @0 = type · byte offset crc = CRC-32 (plain, not 32C) · per message v1 only = field absent in v0

The crucial structural difference: in v0/v1 an uncompressed message set is a sequence of standalone messages, one per offset; a compressed set is a single "wrapper" message whose value is itself a compressed message set of inner messages. AbstractLegacyRecordBatch models this duality by implementing both RecordBatch and Record, the wrapper is the batch, the inner messages are the records AbstractLegacyRecordBatch.java:45-56. Iterating a compressed v1 batch requires decompressing the whole wrapper to recover absolute offsets, since inner offsets were stored relative to the wrapper's last offset AbstractLegacyRecordBatch.java:315-413.

Legacy weaknesses that v2 fixes:

Per-message CRC and absolute fields. Each v0/v1 message carries its own CRC, magic, attributes and full timestamp, large per-record overhead and a CRC that had to be recomputed whenever the broker re-assigned offsets. v2 amortizes one CRC and one timestamp/offset base across the batch.
No record headers. v0/v1 have no header support; appending a header to a sub-v2 builder throws MemoryRecordsBuilder.java:468-469.
No idempotence/transactions. There is nowhere to put a PID/epoch/sequence; legacy batches hard-code NO_PRODUCER_ID, NO_SEQUENCE, and isTransactional() == false AbstractLegacyRecordBatch.java:174-217.
No efficient count or leader epoch. countOrNull() returns null for v0/v1 (the count is unknown without deep iteration); partitionLeaderEpoch() is −1 and cannot be set AbstractLegacyRecordBatch.java:159-161, 210-212, 499-501.
No zstd. ZStandard is rejected for magic < 2 MemoryRecordsBuilder.java:115-116.

Note

Although ZooKeeper and old inter-broker protocols are gone, the record format machinery still understands v0/v1 so the broker can read historical log segments and the LogValidator can convert between formats. LogValidator carries a toMagic target and up/down-converts incoming batches whose magic does not match storage/src/main/java/org/apache/kafka/storage/internals/log/LogValidator.java:75, 116, 128-129. New writes from modern clients are always v2.

Control batches & control records

A control batch sets attribute bit 5. Its records use a typed key: Key => Version:int16, Type:int16 ControlRecordType.java:29-43. The 16-bit version is a schema version (so the key can grow compatibly); the 16-bit type selects the meaning. Defined types ControlRecordType.java:43-57:

Type id	Name	Purpose
0	`ABORT`	Transaction abort marker
1	`COMMIT`	Transaction commit marker
2	`LEADER_CHANGE`	KRaft: a new leader was elected for the metadata/Raft log
3 / 4	`SNAPSHOT_HEADER` / `SNAPSHOT_FOOTER`	KRaft snapshot delimiters
5 / 6	`KRAFT_VERSION` / `KRAFT_VOTERS`	KRaft membership / quorum version records
−1	`UNKNOWN`	Type the client does not recognize → ignored

Transaction markers are produced by the coordinator via appendEndTxnMarker, which requires a valid PID and the transactional flag MemoryRecordsBuilder.java:618-625. The marker's value is an EndTransactionMarker carrying the coordinator epoch, version-prefixed and serialized through the generated EndTxnMarker message EndTransactionMarker.java:41-47, 82-103. The KRaft control types (2–6) are appended by the corresponding append…Message helpers and are how the Raft log records leadership and membership changes, see KRaft Consensus and The KRaft Controller MemoryRecordsBuilder.java:627-668.

Note

Control records are skipped by the log cleaner ("control records are not considered for compaction" ControlRecordType.java:39) and are filtered out before delivery to ordinary consumers. Ordinary clients may not write them: LogValidator rejects client-written control records LogValidator.java:476.

Compression

In v2 the entire records region (bytes 61…end) is compressed as a single stream; the batch header is always plaintext, so a reader can inspect offsets, PID and CRC without decompressing. Codecs are an enum with a 2-bit-storable id CompressionType.java:29-142:

id	name	library	level range (default)	decompression buffer
0	`none`	,	,	,
1	`gzip`	JDK `Deflater`	`BEST_SPEED`…`BEST_COMPRESSION`, default `DEFAULT_COMPRESSION`	16 KB
2	`snappy`	xerial snappy (native)	,	,
3	`lz4`	lz4-java	1…17 (default 9)	2 KB intermediate
4	`zstd`	zstd-jni (native)	−131072…22 (default 3)	16 KB

Codec classes are loaded lazily so native libraries are only pulled in if the codec is actually used CompressionType.java:66-70. The Compression factory builds per-codec stream wrappers: wrapForOutput(ByteBufferOutputStream, magic) for the producer side and wrapForInput(ByteBuffer, magic, BufferSupplier) for readers compress/Compression.java:43-56. Decompression buffer sizes come from decompressionOutputSize() (gzip/zstd 16 KB compress/GzipCompression.java:80-83 compress/ZstdCompression.java:105-109; LZ4 uses a 2 KB intermediate compress/Lz4Compression.java:66-72). Kafka's LZ4 uses the LZ4 frame format with magic 0x184D2204 compress/Lz4BlockOutputStream.java:39.

Gotcha

The RecordBatch.streamingIterator javadoc warns that for small batches the per-call cost is dominated by allocating the decompression buffer ("64 KB for LZ4"), which is why a reusable BufferSupplier matters RecordBatch.java:234-240. That 64 KB refers to the LZ4 block size; the supplier-backed intermediate buffer Kafka requests is 2 KB. Either way: reuse the supplier on hot read paths.

CRC-32C

v2 uses CRC-32C (Castagnoli polynomial) via the JDK's java.util.zip.CRC32C, wrapped by Crc32C utils/internals/Crc32C.java:41-59, on modern CPUs this maps to a hardware instruction. v0/v1 instead used plain CRC-32 (java.util.zip.CRC32) LegacyRecord.java:508-512. Validation: ensureValid() first checks the batch is at least RECORD_BATCH_OVERHEAD bytes, then compares stored vs. recomputed CRC, throwing CorruptRecordException otherwise DefaultRecordBatch.java:151-159, 395-401.

Architecture & control/data flow

user recordsappend(ts, k, v, hdrs)

MemoryRecordsBuilderProduce write path

MemoryRecordsframed batch in heap

broker LogValidatorvalidate · assign offsets · stamp leaderEpoch · maybe convert

FileRecordssegment .log on disk

FileChannelRecordBatchlazy · loadBatchHeader / loadFullBatch

DefaultRecordBatch.iterator()decompress if needed

DefaultRecord⇒ consumer

socket bytessendfile from page cache

End-to-end path of the record format: the write path (top) builds and validates a batch into a log segment, then the read path forks, full iteration for consumers vs. a zero-copy writeTo straight to the socket. The same bytes are reused; the broker rewrites only the header (offsets, leader epoch, CRC) when needed.

client / producer · consumer path broker machinery log / segment · socket bytes cylinder = on-disk store data flow zero-copy transfer Class = class / method on the path

Building a batch (write path)

MemoryRecordsBuilder reserves the header region up front, then streams records (optionally through the compressor) into the buffer MemoryRecordsBuilder.java:138-147:

The constructor positions the underlying ByteBufferOutputStream at initialPosition + batchHeaderSizeInBytes and wraps it with the codec's output stream. For v2, batchHeaderSizeInBytes = RECORD_BATCH_OVERHEAD (61); for legacy-compressed it is LOG_OVERHEAD + legacy record overhead AbstractRecords.java:153-161.
Each append computes offsetDelta = offset − baseOffset and timestampDelta = timestamp − baseTimestamp, then DefaultRecord.writeTo emits the record; recordWritten updates numRecords, lastOffset, and tracks maxTimestamp MemoryRecordsBuilder.java:755-799. The first append seeds baseTimestamp (unless a delete horizon already set it) MemoryRecordsBuilder.java:471-472, 145-147.
close() flushes the compressor, then backfills the header in place via writeDefaultBatchHeader → DefaultRecordBatch.writeHeader, which writes every field and computes the CRC over [attributes..end] last MemoryRecordsBuilder.java:361-386, 408-429 DefaultRecordBatch.java:460-500. The compression ratio is recorded for the accumulator's future size estimates.

Offsets must increase strictly monotonically within a builder (appendWithOffset throws otherwise), and a control record may only be appended to a control batch MemoryRecordsBuilder.java:455-469. Size accounting for "will the next record fit" is in hasRoomFor, which is deliberately conservative about compression and always admits the first record so a non-empty batch can always be formed even with batch.size = 0 MemoryRecordsBuilder.java:849-890.

Reading from a file (read path)

FileLogInputStream.nextBatch() reads only the 17-byte header-up-to-magic (HEADER_SIZE_UP_TO_MAGIC = LOG_OVERHEAD + 4 + 1 Records.java:51-55), extracts offset, size and magic, sanity-checks the size against the V0 minimum, and constructs a FileChannelRecordBatch without copying the record data into the heap FileLogInputStream.java:63-94. The concrete batch lazily materializes either just the header (loadBatchHeader, reading headerSize() bytes) or the whole batch (loadFullBatch) on demand, caching the result FileLogInputStream.java:194-222. This lazy design is what lets a broker answer "what is the last offset / does this batch have a PID" cheaply, and lets writeTo stream batch bytes straight from the channel to a destination buffer for zero-copy fetch responses FileLogInputStream.java:177-188.

Sending over the wire

On the wire, records are framed identically to disk. DefaultRecordsSend simply delegates to records().writeTo(channel, …), allowing FileRecords to transfer bytes directly via the channel (the zero-copy path) without re-encoding DefaultRecordsSend.java:32-35. See Wire Protocol & RPC Framework and The Fetch Path.

Concurrency & threading

Batches are not thread-safe. DefaultRecordBatch wraps a single ByteBuffer and reads/writes it without locks. Header mutators, setLastOffset, setPartitionLeaderEpoch, setMaxTimestamp, assume single-threaded access (the broker append path is serialized per partition; see The Log Storage Engine) DefaultRecordBatch.java:366-393.
Builder lifecycle. MemoryRecordsBuilder is owned by one thread at a time. On the producer it is created and appended-to by the application threads under the accumulator's per-batch synchronization, then closeForRecordAppends() is called (releasing potentially large compression buffers) when appends stop but before the header is finalized, important because there can be a gap before the batch is drained and sent MemoryRecordsBuilder.java:45-47, 332-342. After close, only header fields may change; attempting to set producer state on a closed builder throws, to avoid introducing duplicates on retry MemoryRecordsBuilder.java:308-320. See The Producer Client.
Iterator buffers. Decompressing iterators allocate intermediate buffers; the BufferSupplier passed to streamingIterator/skipKeyValueIterator is per-thread (a BufferSupplier is not thread-safe) and reused across batches on the broker's fetch/validation paths to avoid repeated 16 KB/2 KB allocations DefaultRecordBatch.java:279-364.
Iterator must be closed. Compressed iterators hold a decompression stream; the plain iterator() eagerly drains the whole batch into a list precisely because it cannot guarantee the caller will close the stream, whereas streamingIterator defers decompression but obliges the caller to close DefaultRecordBatch.java:320-364.

Configuration reference

The format itself is fixed, but a few client knobs shape the batches produced. Defaults are from ProducerConfig.

Key	Default	Effect on batches
`compression.type`	`none`	Codec written into attribute bits 0–2. ProducerConfig.java:244, 409
`batch.size`	`16384`	Target bytes per batch; drives `writeLimit`/`hasRoomFor`. A value of 0 still admits one record per batch. ProducerConfig.java:82, 413
`compression.gzip.level`	`DEFAULT_COMPRESSION`	gzip level (otherwise `BEST_SPEED`…`BEST_COMPRESSION`). ProducerConfig.java:250, 410
`compression.lz4.level`	`9`	lz4 level 1–17. ProducerConfig.java:254; CompressionType.java:72-98
`compression.zstd.level`	`3`	zstd level −131072…22. ProducerConfig.java:258; CompressionType.java:99-130

Idempotence/transaction header fields are not direct format knobs; they are populated from the producer's PID/epoch and per-partition sequence (enable.idempotence, transactional.id), see Transactions & Exactly-Once Semantics.

Failure modes, edge cases & recovery

Corruption. A bad CRC throws CorruptRecordException from ensureValid(); a too-small batch (< 61 bytes) is rejected before the CRC is even read DefaultRecordBatch.java:151-159.
Truncated / over-declared batch. If recordsCount claims more records than the buffer holds, the uncompressed iterator throws "Incorrect declared batch size, premature EOF reached"; if records remain after reading the declared count, it throws "records still remaining" DefaultRecordBatch.java:299-318, 596-611. The per-record body-length cross-check in readFrom catches intra-record corruption independently of the batch CRC DefaultRecord.java:350-353.
Empty batches after compaction. The cleaner may retain a header-only batch (recordsCount 0) to preserve the producer's first/last sequence so state can be rebuilt after a leader failover; such a batch persists until the producer writes a new sequence or its PID expires DefaultRecordBatch.java:73-85. A zero-count batch iterates as empty DefaultRecordBatch.java:320-323.
Timestamp semantics under compaction. baseTimestamp is the first record's timestamp in most cases, −1 for an empty batch, or the delete horizon when bit 6 is set. For LOG_APPEND_TIME, maxTimestamp is broker-stamped and survives compaction; an empty batch retains its prior maxTimestamp DefaultRecordBatch.java:87-95.
Sequence wraparound. Sequence numbers are 32-bit and wrap; incrementSequence/decrementSequence implement modular arithmetic around Integer.MAX_VALUE so lastSequence stays correct across the wrap boundary DefaultRecordBatch.java:558-568.
Legacy edge cases. Decoding a compressed v1 wrapper rejects null inner values, inner compression, mismatched magic, and a wrapper offset less than the last inner offset (unless it is 0, tolerated for some librdkafka versions) AbstractLegacyRecordBatch.java:330-388.

Invariants & guarantees

Invariant

First/last offsets and sequences are immutable under compaction. v2 preserves baseOffset, lastOffsetDelta, baseSequence and the derived last sequence even when inner records are removed, so a broker can rebuild producer state from the log and the next expected sequence stays in sync, preventing spurious OutOfOrderSequence errors after failover DefaultRecordBatch.java:73-80.

Invariant

The CRC excludes the partition leader epoch. The broker can write partitionLeaderEpoch into a received batch without recomputing the checksum, because bytes 12–15 lie outside the CRC region DefaultRecordBatch.java:386-388, 399-401.

Invariant

Offsets are strictly increasing within a batch. The builder enforces monotonicity at append time MemoryRecordsBuilder.java:461-463, and the log validator re-verifies the batch's reported offset range and count on the broker LogValidator.java:451-469.

Invariant

Self-describing magic. Every batch begins with (offset, size) then a magic byte at a format-independent position (MAGIC_OFFSET = 16 Records.java:51-53), so a reader can always determine the format before interpreting the rest.

Interactions with other subsystems

The Log Storage Engine & Log Management, Retention & Compaction, segments are sequences of these batches on disk; the cleaner manipulates batch headers (delete horizon, retained empty batches) directly.
Replication, ISR & High Watermark & The Fetch Path, followers fetch and append the identical bytes; the leader stamps partitionLeaderEpoch into each batch.
Transactions & Exactly-Once, PID/epoch/sequence header fields and the ABORT/COMMIT control records are the on-disk substrate of EOS.
KRaft Consensus / The KRaft Controller, the metadata log is the same batch format; LEADER_CHANGE, SNAPSHOT_*, KRAFT_VERSION, KRAFT_VOTERS control records drive the quorum.
Producer / Consumer, the producer's accumulator builds batches with MemoryRecordsBuilder; the consumer decodes them with the iterators here.
Tiered Storage, remote segments store the same format (see RemoteLogInputStream).

Design rationale & evolution

Design rationale

The v2 batch was introduced by KIP-98 (exactly-once / idempotent producer), which needed a place for the PID, producer epoch and base sequence, and by the message-format overhaul that delta-encodes offsets/timestamps and moves the CRC to a per-batch, leader-epoch-excluding position. Record headers came with KIP-82. The delete-horizon attribute bit (so the cleaner can reason about when tombstones/markers expire) is KIP-534. CRC-32C replaced CRC-32 to exploit hardware checksum instructions. The lazy FileChannelRecordBatch and zero-copy RecordsSend exist so the broker never has to deserialize record bodies just to route bytes, the bytes a producer writes are the bytes a consumer reads.

The trajectory has been toward doing less per record on the broker: amortize integrity and metadata across a batch, keep the header parseable without decompression, and never re-encode payloads in the steady state. v0/v1 remain only as a compatibility/read path.

Gotchas / operational notes

Gotcha

baseOffset() on a v2 batch is O(1) (a header read), but on a legacy compressed batch it requires decompressing to find the first inner offset AbstractLegacyRecordBatch.java:138-141. Prefer lastOffset() when you only need a single offset and might be on a legacy path.

Gotcha

A null value (valueLength −1) is a tombstone for compacted topics, distinct from an empty value (valueLength 0). The distinction is preserved by the varint sentinel, not a separate flag DefaultRecord.java:191-197.

Caution

Header keys are required and non-null UTF-8; header values may be null. DefaultRecord.writeTo throws on a null headers array or a null header key DefaultRecord.java:199-219. The header count is a signed varint, and the reader rejects a negative count or a count larger than the remaining buffer as corruption DefaultRecord.java:338-342.

Note

The maximum number of records per batch and the maximum offset delta are both bounded by Integer.MAX_VALUE (offset delta is a 32-bit field); the builder throws if either is exceeded MemoryRecordsBuilder.java:784-789.