docs: clarify binary partition value serialization format in protocol spec

## Problem

The protocol spec's description of binary partition value serialization is ambiguous:

> binary | Encoded as a string of escaped binary values. For example, `"\u0001\u0002\u0003"`

The phrase "escaped binary values" is unclear, and the example uses JSON notation (`\u0001`) which can be misread as literal six-character escape sequences rather than single Unicode characters.

### How implementations interpret the spec

Java kernel and Spark treat the byte array as a UTF-8 string. Each byte sequence maps to its UTF-8 character. The `\u0001` in the spec example is JSON notation for the character U+0001, not the literal string `\u0001`.

- Java kernel write: `new String((byte[]) value, StandardCharsets.UTF_8)`
- Java kernel read: `Literal.ofBinary(partitionValue.getBytes())`
- Spark read: `case BinaryType => value.getBytes()`
- delta-kernel-rs read: `Binary => Ok(Scalar::Binary(raw.to_string().into_bytes()))`

However, a plausible misreading of the spec is to encode each byte as a literal six-character string `\uXXXX` (e.g., byte 0x01 becomes the six characters `\`, `u`, `0`, `0`, `0`, `1` stored in the map). This interpretation would produce values 6x larger than intended and would not interoperate with Java kernel or Spark.

### Deeper issue: non-UTF-8 binary data is lossy

The `partitionValues` map is `Map[String, String]`, and JSON strings are Unicode. All existing implementations use lossy UTF-8 conversion for binary partition values:

- Java's `new String(bytes, UTF_8)` replaces invalid UTF-8 sequences with U+FFFD (replacement character)
- delta-kernel-rs's `raw.to_string().into_bytes()` treats the string as UTF-8 bytes
- On the read side, `String.getBytes()` / `.into_bytes()` returns the UTF-8 bytes of the (possibly mangled) string

This means **binary partition columns with non-UTF-8 byte data silently lose information**. There is no lossless way to represent arbitrary bytes in a JSON string without an explicit encoding scheme (e.g., base64, hex).

The protocol does not acknowledge this limitation.

### Concrete examples of lossy behavior

Valid UTF-8 byte sequences round-trip correctly. Only invalid UTF-8 sequences are lossy:

| Input bytes | Meaning | After round-trip | Data lost? |
|---|---|---|---|
| `[0x48, 0x69]` | "Hi" (valid ASCII) | `[0x48, 0x69]` | No |
| `[0xF0, 0x9F, 0x98, 0x88]` | U+1F608 emoji (valid 4-byte UTF-8) | `[0xF0, 0x9F, 0x98, 0x88]` | No |
| `[0xC3, 0xBC]` | "u with umlaut" (valid 2-byte UTF-8) | `[0xC3, 0xBC]` | No |
| `[0x80]` | Lone continuation byte (invalid UTF-8) | `[0xEF, 0xBF, 0xBD]` (U+FFFD) | **Yes** -- 1 byte became 3 |
| `[0xFF]` | Invalid byte in UTF-8 | `[0xEF, 0xBF, 0xBD]` (U+FFFD) | **Yes** -- different byte |
| `[0x80, 0xFF]` | Two different invalid bytes | `[0xEF, 0xBF, 0xBD, 0xEF, 0xBF, 0xBD]` | **Yes** -- both become same replacement char, now indistinguishable |
| `[0xC3]` | Truncated 2-byte sequence (missing continuation byte) | `[0xEF, 0xBF, 0xBD]` (U+FFFD) | **Yes** |
| `[0x48, 0x80, 0x69]` | "H", invalid byte, "i" | `[0x48, 0xEF, 0xBF, 0xBD, 0x69]` | **Yes** -- 3 bytes became 5, middle byte corrupted |

The round-trip is: `bytes -> new String(bytes, UTF_8) -> string.getBytes(UTF_8) -> bytes`. Both Java and Rust produce identical behavior for all cases above.

### Proposed spec change

Replace:

> binary | Encoded as a string of escaped binary values. For example, `"\u0001\u0002\u0003"`

With:

> binary | Each byte is mapped to the Unicode character with the same code point (byte 0x00 becomes U+0000, byte 0x41 becomes U+0041 = 'A', etc.) and the resulting string is stored in the partition value map. For example, the three-byte array `[0x01, 0x02, 0x03]` becomes a three-character string. When serialized to JSON, non-printable characters are represented using standard JSON Unicode escapes (e.g., `"\u0001\u0002\u0003"`), but the map value itself is the three-character string, not the literal escape sequences.
>
> Note: This encoding is only lossless for bytes 0x00-0x7F. Byte sequences that are not valid UTF-8 may be corrupted during serialization because the `partitionValues` map uses JSON strings (which are Unicode). Implementations should avoid using non-UTF-8 binary data as partition values.

### References

- Java kernel serialization: [`PartitionUtils.java` line 573-574](https://github.com/delta-io/delta/blob/master/kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/PartitionUtils.java)
- Spark deserialization: [`PartitioningUtils.scala` line 796](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala)
- delta-kernel-rs deserialization: [`scalars.rs` parse_scalar for Binary](https://github.com/delta-io/delta-kernel-rs/blob/main/kernel/src/expressions/scalars.rs)

Input bytes	Meaning	After round-trip	Data lost?
`[0x48, 0x69]`	"Hi" (valid ASCII)	`[0x48, 0x69]`	No
`[0xF0, 0x9F, 0x98, 0x88]`	U+1F608 emoji (valid 4-byte UTF-8)	`[0xF0, 0x9F, 0x98, 0x88]`	No
`[0xC3, 0xBC]`	"u with umlaut" (valid 2-byte UTF-8)	`[0xC3, 0xBC]`	No
`[0x80]`	Lone continuation byte (invalid UTF-8)	`[0xEF, 0xBF, 0xBD]` (U+FFFD)	Yes -- 1 byte became 3
`[0xFF]`	Invalid byte in UTF-8	`[0xEF, 0xBF, 0xBD]` (U+FFFD)	Yes -- different byte
`[0x80, 0xFF]`	Two different invalid bytes	`[0xEF, 0xBF, 0xBD, 0xEF, 0xBF, 0xBD]`	Yes -- both become same replacement char, now indistinguishable
`[0xC3]`	Truncated 2-byte sequence (missing continuation byte)	`[0xEF, 0xBF, 0xBD]` (U+FFFD)	Yes
`[0x48, 0x80, 0x69]`	"H", invalid byte, "i"	`[0x48, 0xEF, 0xBF, 0xBD, 0x69]`	Yes -- 3 bytes became 5, middle byte corrupted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: clarify binary partition value serialization format in protocol spec #6480

Problem

How implementations interpret the spec

Deeper issue: non-UTF-8 binary data is lossy

Concrete examples of lossy behavior

Proposed spec change

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

docs: clarify binary partition value serialization format in protocol spec #6480

Description

Problem

How implementations interpret the spec

Deeper issue: non-UTF-8 binary data is lossy

Concrete examples of lossy behavior

Proposed spec change

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions