Problem
The protocol spec's description of binary partition value serialization is ambiguous:
binary | Encoded as a string of escaped binary values. For example, "\u0001\u0002\u0003"
The phrase "escaped binary values" is unclear, and the example uses JSON notation (\u0001) which can be misread as literal six-character escape sequences rather than single Unicode characters.
How implementations interpret the spec
Java kernel and Spark treat the byte array as a UTF-8 string. Each byte sequence maps to its UTF-8 character. The \u0001 in the spec example is JSON notation for the character U+0001, not the literal string \u0001.
- Java kernel write:
new String((byte[]) value, StandardCharsets.UTF_8)
- Java kernel read:
Literal.ofBinary(partitionValue.getBytes())
- Spark read:
case BinaryType => value.getBytes()
- delta-kernel-rs read:
Binary => Ok(Scalar::Binary(raw.to_string().into_bytes()))
However, a plausible misreading of the spec is to encode each byte as a literal six-character string \uXXXX (e.g., byte 0x01 becomes the six characters \, u, 0, 0, 0, 1 stored in the map). This interpretation would produce values 6x larger than intended and would not interoperate with Java kernel or Spark.
Deeper issue: non-UTF-8 binary data is lossy
The partitionValues map is Map[String, String], and JSON strings are Unicode. All existing implementations use lossy UTF-8 conversion for binary partition values:
- Java's
new String(bytes, UTF_8) replaces invalid UTF-8 sequences with U+FFFD (replacement character)
- delta-kernel-rs's
raw.to_string().into_bytes() treats the string as UTF-8 bytes
- On the read side,
String.getBytes() / .into_bytes() returns the UTF-8 bytes of the (possibly mangled) string
This means binary partition columns with non-UTF-8 byte data silently lose information. There is no lossless way to represent arbitrary bytes in a JSON string without an explicit encoding scheme (e.g., base64, hex).
The protocol does not acknowledge this limitation.
Concrete examples of lossy behavior
Valid UTF-8 byte sequences round-trip correctly. Only invalid UTF-8 sequences are lossy:
| Input bytes |
Meaning |
After round-trip |
Data lost? |
[0x48, 0x69] |
"Hi" (valid ASCII) |
[0x48, 0x69] |
No |
[0xF0, 0x9F, 0x98, 0x88] |
U+1F608 emoji (valid 4-byte UTF-8) |
[0xF0, 0x9F, 0x98, 0x88] |
No |
[0xC3, 0xBC] |
"u with umlaut" (valid 2-byte UTF-8) |
[0xC3, 0xBC] |
No |
[0x80] |
Lone continuation byte (invalid UTF-8) |
[0xEF, 0xBF, 0xBD] (U+FFFD) |
Yes -- 1 byte became 3 |
[0xFF] |
Invalid byte in UTF-8 |
[0xEF, 0xBF, 0xBD] (U+FFFD) |
Yes -- different byte |
[0x80, 0xFF] |
Two different invalid bytes |
[0xEF, 0xBF, 0xBD, 0xEF, 0xBF, 0xBD] |
Yes -- both become same replacement char, now indistinguishable |
[0xC3] |
Truncated 2-byte sequence (missing continuation byte) |
[0xEF, 0xBF, 0xBD] (U+FFFD) |
Yes |
[0x48, 0x80, 0x69] |
"H", invalid byte, "i" |
[0x48, 0xEF, 0xBF, 0xBD, 0x69] |
Yes -- 3 bytes became 5, middle byte corrupted |
The round-trip is: bytes -> new String(bytes, UTF_8) -> string.getBytes(UTF_8) -> bytes. Both Java and Rust produce identical behavior for all cases above.
Proposed spec change
Replace:
binary | Encoded as a string of escaped binary values. For example, "\u0001\u0002\u0003"
With:
binary | Each byte is mapped to the Unicode character with the same code point (byte 0x00 becomes U+0000, byte 0x41 becomes U+0041 = 'A', etc.) and the resulting string is stored in the partition value map. For example, the three-byte array [0x01, 0x02, 0x03] becomes a three-character string. When serialized to JSON, non-printable characters are represented using standard JSON Unicode escapes (e.g., "\u0001\u0002\u0003"), but the map value itself is the three-character string, not the literal escape sequences.
Note: This encoding is only lossless for bytes 0x00-0x7F. Byte sequences that are not valid UTF-8 may be corrupted during serialization because the partitionValues map uses JSON strings (which are Unicode). Implementations should avoid using non-UTF-8 binary data as partition values.
References
Problem
The protocol spec's description of binary partition value serialization is ambiguous:
The phrase "escaped binary values" is unclear, and the example uses JSON notation (
\u0001) which can be misread as literal six-character escape sequences rather than single Unicode characters.How implementations interpret the spec
Java kernel and Spark treat the byte array as a UTF-8 string. Each byte sequence maps to its UTF-8 character. The
\u0001in the spec example is JSON notation for the character U+0001, not the literal string\u0001.new String((byte[]) value, StandardCharsets.UTF_8)Literal.ofBinary(partitionValue.getBytes())case BinaryType => value.getBytes()Binary => Ok(Scalar::Binary(raw.to_string().into_bytes()))However, a plausible misreading of the spec is to encode each byte as a literal six-character string
\uXXXX(e.g., byte 0x01 becomes the six characters\,u,0,0,0,1stored in the map). This interpretation would produce values 6x larger than intended and would not interoperate with Java kernel or Spark.Deeper issue: non-UTF-8 binary data is lossy
The
partitionValuesmap isMap[String, String], and JSON strings are Unicode. All existing implementations use lossy UTF-8 conversion for binary partition values:new String(bytes, UTF_8)replaces invalid UTF-8 sequences with U+FFFD (replacement character)raw.to_string().into_bytes()treats the string as UTF-8 bytesString.getBytes()/.into_bytes()returns the UTF-8 bytes of the (possibly mangled) stringThis means binary partition columns with non-UTF-8 byte data silently lose information. There is no lossless way to represent arbitrary bytes in a JSON string without an explicit encoding scheme (e.g., base64, hex).
The protocol does not acknowledge this limitation.
Concrete examples of lossy behavior
Valid UTF-8 byte sequences round-trip correctly. Only invalid UTF-8 sequences are lossy:
[0x48, 0x69][0x48, 0x69][0xF0, 0x9F, 0x98, 0x88][0xF0, 0x9F, 0x98, 0x88][0xC3, 0xBC][0xC3, 0xBC][0x80][0xEF, 0xBF, 0xBD](U+FFFD)[0xFF][0xEF, 0xBF, 0xBD](U+FFFD)[0x80, 0xFF][0xEF, 0xBF, 0xBD, 0xEF, 0xBF, 0xBD][0xC3][0xEF, 0xBF, 0xBD](U+FFFD)[0x48, 0x80, 0x69][0x48, 0xEF, 0xBF, 0xBD, 0x69]The round-trip is:
bytes -> new String(bytes, UTF_8) -> string.getBytes(UTF_8) -> bytes. Both Java and Rust produce identical behavior for all cases above.Proposed spec change
Replace:
With:
References
PartitionUtils.javaline 573-574PartitioningUtils.scalaline 796scalars.rsparse_scalar for Binary