Currently, materialization connectors utilize the minimum and maximum JSON Schema attributes to determine the optimal underlying data types for fields mapped as type: integer and type: number. This works well for native numeric fields.
However, for fields mapped as type: string with format: integer or format: number, we currently lack equivalent boundary attributes. Without min/max ranges for these fields, materialization connectors cannot make informed decisions about the best database types to use for numbers that are presented as strings.
Problem
The initial idea was to apply the existing minimum and maximum JSON Schema attributes directly to numeric-formatted string fields. However, because these are strict validation keywords (not just annotations) within JSON Schema, reusing them for strings introduces two major issues:
- Loss of precision (Lossiness): Reusing the same keyword mixes the bounds of native numbers and strings. A string representation's bound might exceed a 32-bit or 64-bit limit while the native numeric bound does not, making it difficult to differentiate and act upon.
- Breaking backfills: Validation is critical to the end-to-end schema inference process. If we started enforcing standard
maximum validation against strings, established collections with existing strings larger than the inferred maximum would permanently break during materialization backfills. Connectors rely on validation failures to safely halt, restart, and apply required DDL changes before processing widened documents.
Proposed Solution
Introduce two new custom validation keywords specifically for string fields:
x-str-minimum
x-str-maximum
Implementation Details / Requirements
- Schema Inference Update: Schema inference should begin populating
x-str-minimum and x-str-maximum for new locations where type: string and format: integer/number are detected.
- Backward Compatibility: These new bounds should not be added to established string locations that do not already have them (matching our existing behavior for standard
minimum/maximum to prevent breaking existing streams).
- Validation Enforcement: These must be implemented as strictly enforced validations in the validation crate, not merely as data-plane annotations.
- Connector Guarantee: This enforcement guarantees that a connector will never receive a document that violates an
x-str-maximum or x-str-minimum setting it was provided at startup, giving it the opportunity to halt and apply DDL updates if the schema widens.
Currently, materialization connectors utilize the
minimumandmaximumJSON Schema attributes to determine the optimal underlying data types for fields mapped astype: integerandtype: number. This works well for native numeric fields.However, for fields mapped as
type: stringwithformat: integerorformat: number, we currently lack equivalent boundary attributes. Without min/max ranges for these fields, materialization connectors cannot make informed decisions about the best database types to use for numbers that are presented as strings.Problem
The initial idea was to apply the existing
minimumandmaximumJSON Schema attributes directly to numeric-formatted string fields. However, because these are strict validation keywords (not just annotations) within JSON Schema, reusing them for strings introduces two major issues:maximumvalidation against strings, established collections with existing strings larger than the inferredmaximumwould permanently break during materialization backfills. Connectors rely on validation failures to safely halt, restart, and apply required DDL changes before processing widened documents.Proposed Solution
Introduce two new custom validation keywords specifically for string fields:
x-str-minimumx-str-maximumImplementation Details / Requirements
x-str-minimumandx-str-maximumfor new locations wheretype: stringandformat: integer/numberare detected.minimum/maximumto prevent breaking existing streams).x-str-maximumorx-str-minimumsetting it was provided at startup, giving it the opportunity to halt and apply DDL updates if the schema widens.