Skip to content

Introduce x-str-minimum and x-str-maximum validation keywords for numeric-formatted strings #2895

@williamhbaker

Description

@williamhbaker

Currently, materialization connectors utilize the minimum and maximum JSON Schema attributes to determine the optimal underlying data types for fields mapped as type: integer and type: number. This works well for native numeric fields.

However, for fields mapped as type: string with format: integer or format: number, we currently lack equivalent boundary attributes. Without min/max ranges for these fields, materialization connectors cannot make informed decisions about the best database types to use for numbers that are presented as strings.

Problem

The initial idea was to apply the existing minimum and maximum JSON Schema attributes directly to numeric-formatted string fields. However, because these are strict validation keywords (not just annotations) within JSON Schema, reusing them for strings introduces two major issues:

  1. Loss of precision (Lossiness): Reusing the same keyword mixes the bounds of native numbers and strings. A string representation's bound might exceed a 32-bit or 64-bit limit while the native numeric bound does not, making it difficult to differentiate and act upon.
  2. Breaking backfills: Validation is critical to the end-to-end schema inference process. If we started enforcing standard maximum validation against strings, established collections with existing strings larger than the inferred maximum would permanently break during materialization backfills. Connectors rely on validation failures to safely halt, restart, and apply required DDL changes before processing widened documents.

Proposed Solution

Introduce two new custom validation keywords specifically for string fields:

  • x-str-minimum
  • x-str-maximum

Implementation Details / Requirements

  • Schema Inference Update: Schema inference should begin populating x-str-minimum and x-str-maximum for new locations where type: string and format: integer/number are detected.
  • Backward Compatibility: These new bounds should not be added to established string locations that do not already have them (matching our existing behavior for standard minimum/maximum to prevent breaking existing streams).
  • Validation Enforcement: These must be implemented as strictly enforced validations in the validation crate, not merely as data-plane annotations.
  • Connector Guarantee: This enforcement guarantees that a connector will never receive a document that violates an x-str-maximum or x-str-minimum setting it was provided at startup, giving it the opportunity to halt and apply DDL updates if the schema widens.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions