Skip to content

docs: clarify binary partition value serialization format in protocol spec #6480

@scottsand-db

Description

@scottsand-db

Problem

The protocol spec's description of binary partition value serialization is ambiguous:

binary | Encoded as a string of escaped binary values. For example, "\u0001\u0002\u0003"

The phrase "escaped binary values" is unclear, and the example uses JSON notation (\u0001) which can be misread as literal six-character escape sequences rather than single Unicode characters.

How implementations interpret the spec

Java kernel and Spark treat the byte array as a UTF-8 string. Each byte sequence maps to its UTF-8 character. The \u0001 in the spec example is JSON notation for the character U+0001, not the literal string \u0001.

  • Java kernel write: new String((byte[]) value, StandardCharsets.UTF_8)
  • Java kernel read: Literal.ofBinary(partitionValue.getBytes())
  • Spark read: case BinaryType => value.getBytes()
  • delta-kernel-rs read: Binary => Ok(Scalar::Binary(raw.to_string().into_bytes()))

However, a plausible misreading of the spec is to encode each byte as a literal six-character string \uXXXX (e.g., byte 0x01 becomes the six characters \, u, 0, 0, 0, 1 stored in the map). This interpretation would produce values 6x larger than intended and would not interoperate with Java kernel or Spark.

Deeper issue: non-UTF-8 binary data is lossy

The partitionValues map is Map[String, String], and JSON strings are Unicode. All existing implementations use lossy UTF-8 conversion for binary partition values:

  • Java's new String(bytes, UTF_8) replaces invalid UTF-8 sequences with U+FFFD (replacement character)
  • delta-kernel-rs's raw.to_string().into_bytes() treats the string as UTF-8 bytes
  • On the read side, String.getBytes() / .into_bytes() returns the UTF-8 bytes of the (possibly mangled) string

This means binary partition columns with non-UTF-8 byte data silently lose information. There is no lossless way to represent arbitrary bytes in a JSON string without an explicit encoding scheme (e.g., base64, hex).

The protocol does not acknowledge this limitation.

Concrete examples of lossy behavior

Valid UTF-8 byte sequences round-trip correctly. Only invalid UTF-8 sequences are lossy:

Input bytes Meaning After round-trip Data lost?
[0x48, 0x69] "Hi" (valid ASCII) [0x48, 0x69] No
[0xF0, 0x9F, 0x98, 0x88] U+1F608 emoji (valid 4-byte UTF-8) [0xF0, 0x9F, 0x98, 0x88] No
[0xC3, 0xBC] "u with umlaut" (valid 2-byte UTF-8) [0xC3, 0xBC] No
[0x80] Lone continuation byte (invalid UTF-8) [0xEF, 0xBF, 0xBD] (U+FFFD) Yes -- 1 byte became 3
[0xFF] Invalid byte in UTF-8 [0xEF, 0xBF, 0xBD] (U+FFFD) Yes -- different byte
[0x80, 0xFF] Two different invalid bytes [0xEF, 0xBF, 0xBD, 0xEF, 0xBF, 0xBD] Yes -- both become same replacement char, now indistinguishable
[0xC3] Truncated 2-byte sequence (missing continuation byte) [0xEF, 0xBF, 0xBD] (U+FFFD) Yes
[0x48, 0x80, 0x69] "H", invalid byte, "i" [0x48, 0xEF, 0xBF, 0xBD, 0x69] Yes -- 3 bytes became 5, middle byte corrupted

The round-trip is: bytes -> new String(bytes, UTF_8) -> string.getBytes(UTF_8) -> bytes. Both Java and Rust produce identical behavior for all cases above.

Proposed spec change

Replace:

binary | Encoded as a string of escaped binary values. For example, "\u0001\u0002\u0003"

With:

binary | Each byte is mapped to the Unicode character with the same code point (byte 0x00 becomes U+0000, byte 0x41 becomes U+0041 = 'A', etc.) and the resulting string is stored in the partition value map. For example, the three-byte array [0x01, 0x02, 0x03] becomes a three-character string. When serialized to JSON, non-printable characters are represented using standard JSON Unicode escapes (e.g., "\u0001\u0002\u0003"), but the map value itself is the three-character string, not the literal escape sequences.

Note: This encoding is only lossless for bytes 0x00-0x7F. Byte sequences that are not valid UTF-8 may be corrupted during serialization because the partitionValues map uses JSON strings (which are Unicode). Implementations should avoid using non-UTF-8 binary data as partition values.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions