Context
Currently, schema inference for integer fields scales by powers of 10 to minimize schema churn. For example, an observed value of 3 sets the inferred maximum to 10, and an observed value of 12 sets the maximum to 100. This broader granularity is by design, as updating inferred schemas too frequently is undesirable.
The Problem
The tension with this approach is that materializations rely on these inferred values to determine what kind of column to create. If the power-of-10 rounding pushes an inferred maximum beyond the limits of an i64 (64-bit signed integer), the system will allocate a needlessly larger column type.
At these specific type boundaries, exactness is critical. We only want an inferred maximum to exceed max(i64) if the actual data strictly requires it. Other useful cutoffs to consider are the 128-bit and 256-bit points, though these are less commonly handled separately for materializations.
Proposed Solution
Maintain the power-of-10 granularity for the vast majority of cases to prevent schema churn, but implement specific bounds testing for critical numeric thresholds (e.g., i64 limits). This will prevent the inference engine from artificially over-widening column types during materialization.
Context
Currently, schema inference for integer fields scales by powers of 10 to minimize schema churn. For example, an observed value of
3sets the inferred maximum to10, and an observed value of12sets the maximum to100. This broader granularity is by design, as updating inferred schemas too frequently is undesirable.The Problem
The tension with this approach is that materializations rely on these inferred values to determine what kind of column to create. If the power-of-10 rounding pushes an inferred maximum beyond the limits of an
i64(64-bit signed integer), the system will allocate a needlessly larger column type.At these specific type boundaries, exactness is critical. We only want an inferred maximum to exceed
max(i64)if the actual data strictly requires it. Other useful cutoffs to consider are the 128-bit and 256-bit points, though these are less commonly handled separately for materializations.Proposed Solution
Maintain the power-of-10 granularity for the vast majority of cases to prevent schema churn, but implement specific bounds testing for critical numeric thresholds (e.g.,
i64limits). This will prevent the inference engine from artificially over-widening column types during materialization.