feat: expand date/time expression support using codegen dispatcher by andygrove · Pull Request #4417 · apache/datafusion-comet

andygrove · 2026-05-24T16:21:21Z

Which issue does this PR close?

Part of #3202. Supersedes and consolidates #4373.

Expressions covered

Ten Spark date/time expressions are routed through the Arrow-direct codegen dispatcher in this PR. All run inside the Comet pipeline (no operator-level Spark fallback) when spark.comet.exec.scalaUDF.codegen.enabled=true is set. Default behavior is unchanged when the flag is off.

date_format (originally feat: route date_format through codegen dispatcher for non-native cases #4373; native to_char path retained for UTC + whitelisted-format cases, dispatcher path for everything else)
add_months
months_between
make_timestamp
timestamp_millis (MillisToTimestamp)
timestamp_micros (MicrosToTimestamp)
unix_seconds
unix_millis
unix_micros
to_unix_timestamp

Rationale for this change

Comet's plan rules fall back to Spark for any expression that lacks a native serde, breaking up the Comet pipeline at the operator boundary. The Arrow-direct codegen dispatcher already in tree (CometScalaUDFCodegen, behind spark.comet.exec.scalaUDF.codegen.enabled) closure-serializes a bound Catalyst expression, ships it through a JvmScalarUdf proto, and Janino-compiles Spark's own doGenCode into a per-batch kernel that reads Arrow vectors and writes an Arrow output vector directly. For any expression whose doGenCode is real (not CodegenFallback) and whose input/output types fit CometBatchKernelCodegen.isSupportedDataType, routing through this dispatcher reproduces Spark behavior exactly without a bespoke UDF class.

This PR establishes a reusable helper for that routing pattern and applies it to the ten expressions above in two waves: DateFormatClass first (originally #4373, folded in here), then nine currently-unsupported expressions that share the same shape.

What changes are included in this PR?

fix(codegen) — Pre-existing bug surfaced while wiring date_format. CometBatchKernelCodegen.defaultBody emitted this.col$ord.isNull(i) for every NullIntolerant input, but primitive Arrow vectors are wrapped in CometPlainVector at input-cast time and expose isNullAt, not isNull. Janino rejected the kernel with "method isNull not declared". emitTypedGetters already knew the right method name via nullCheckMethod; the fix exposes that helper so defaultBody picks the same name per column ordinal. New source test pins the chosen method for TimeStampMicroTZVector so the regression can't recur.

feat — dispatcher helper extraction — Extract the closure-serialize + JvmScalarUdf emission from CometScalaUDF.convert into CometScalaUDF.emitJvmCodegenDispatch(expr, inputs, binding) so other serdes can reuse the path. CometDateFormat.convert keeps the native to_char path for UTC + whitelisted-format cases and now calls the helper for everything else. Gated by spark.comet.exec.scalaUDF.codegen.enabled (default false, experimental); default behavior is unchanged.

feat — CometCodegenDispatch[T] base class — A one-class wrapper around emitJvmCodegenDispatch lets any expression with a real doGenCode slot in with a single-line object declaration. The helper marks the expression Compatible() because the dispatcher runs Spark's own doGenCode inside the kernel; behavior matches Spark exactly when the flag is on, and the operator falls back cleanly when it is off.

feat — nine new datetime serdes — Each of the nine new expressions becomes a one-line singleton extends CometCodegenDispatch[T] and is registered in temporalExpressions. Three are ANSI-sensitive (MakeTimestamp, MillisToTimestamp, ToUnixTimestamp carry failOnError); the dispatcher inherits the throw site from Spark's own doGenCode, so exception semantics propagate without any serde-level branching.

test — MaxSparkVersion annotation — Mirror the existing MinSparkVersion parser gate with a MaxSparkVersion ceiling. The CometSqlFileTestSuite skip logic now reports whether a constraint was a floor or ceiling. This lets fixtures pair a 3.5+ variant against a 3.4 variant where the expected error class wording differs.

Interval-producing expressions (MakeInterval/MakeYMInterval/MakeDTInterval) are explicitly out of scope: the dispatcher's isSupportedDataType does not include Spark's interval types. Version-conditional expressions (TimestampAdd/TimestampDiff 3.4+, DayName 3.5+, MonthName 4.0+) are deferred to a follow-on so this PR avoids touching the CometExprShim files.

Scaffolded with the superpowers:brainstorming and superpowers:writing-plans skills.

How are these changes tested?

CometTemporalExpressionSuite date_format tests: 10/10 pass. Three "falls back to Spark" tests are paired with a "routes via codegen dispatcher" sibling that enables the flag and asserts in-Comet execution. date_format - timestamp_ntz input runs checkSparkAnswerAndOperator for every timezone under the codegen flag.
CometSqlFileTestSuite: nine new per-expression SQL fixtures (add_months.sql, months_between.sql, make_timestamp.sql, timestamp_millis.sql, timestamp_micros.sql, unix_seconds.sql, unix_millis.sql, unix_micros.sql, to_unix_timestamp.sql) pin a non-UTC session timezone and the codegen flag at file scope. date_format.sql from feat: route date_format through codegen dispatcher for non-native cases #4373 is included.
ANSI exception coverage: paired fixtures make_timestamp_ansi.sql / make_timestamp_ansi_spark34.sql, to_unix_timestamp_ansi.sql / to_unix_timestamp_ansi_spark34.sql, plus timestamp_millis_ansi.sql. The Spark 3.5+ files match the DATETIME_FIELD_OUT_OF_BOUNDS / CANNOT_PARSE_TIMESTAMP error classes; the 3.4 files match the underlying JDK java.time exception text (MonthOfYear, Invalid date, HourOfDay, could not be parsed). Each ANSI file includes a sentinel non-error query that uses checkSparkAnswerAndOperator so a silent dispatcher fallback would fail the file (the expect_error path alone passes vacuously because Spark fallback also throws).
CometCodegenSourceSuite: includes the new NullIntolerant short-circuit uses isNullAt for CometPlainVector-wrapped columns regression test plus a parameterized Bucket 4 datetime expressions produce non-empty generated kernel source test covering all nine new expressions.
CometCodegenSuite: no regressions in the dispatcher's existing surface.
Local Spark 3.5 run: 284/284 SQL fixtures pass (3.4 variants correctly skip). Local Spark 3.4 run with -Pspark-3.4: 284/284 SQL fixtures pass (3.5+ variants correctly skip).

…l short-circuit CometBatchKernelCodegen.defaultBody emitted this.col$ord.isNull(i) for every NullIntolerant input, but primitive Arrow vectors (timestamp / int / float / date / bool / ...) are wrapped in CometPlainVector at input-cast time and expose isNullAt rather than the raw Arrow isNull. The short-circuit therefore failed to compile for any primitive-typed column with a Janino "method isNull not declared" error. Share the existing nullCheckMethod helper between emitTypedGetters and defaultBody so both sites pick the right method name per column. Add a source test that pins the chosen method for TimeStampMicroTZVector inputs.

CometDateFormat keeps the native to_char path for UTC sessions with a format literal in the strftime-mappable whitelist, and now routes every other case through the Arrow-direct codegen dispatcher (CometScalaUDFCodegen) so that non-UTC sessions, non-literal formats, and formats outside the whitelist stay inside the Comet pipeline running Spark's own DateFormatClass.doGenCode. Refactor: extract the closure-serialize + JvmScalarUdf-proto emission from CometScalaUDF.convert into a reusable CometScalaUDF.emitJvmCodegenDispatch helper. Any serde that wants to fall back to a Spark built-in expression through the dispatcher can call it. Gated by COMET_SCALA_UDF_CODEGEN_ENABLED so the default remains a clean Spark fallback for those cases until the dispatcher graduates from experimental. Reasoning notes: - DateFormatClass already has a proper doGenCode (not CodegenFallback), NullIntolerant, and ResolveTimeZone stamps the timeZoneId on it during analysis. Closure-serializing the bound tree therefore reproduces Spark-identical behavior for every timezone. - The kernel cache key already encodes the literal format and timezone via the serialized expression bytes, so (format, tz) combinations get distinct cached kernels just like a bespoke (format, tz) -> formatter cache would. Saves an entire DateFormatUDF.scala class. Tests: - date_format - timestamp_ntz input: now runs checkSparkAnswerAndOperator for every timezone under the codegen flag instead of falling back for non-UTC. - Split each previous "falls back to Spark" Scala test into two: one asserting the codegen-on path stays in Comet, one asserting the codegen-off path falls back with the dispatcher flag as the reason. - date_format.sql now pins a non-UTC session timezone and enables the codegen flag at file scope; all queries are plain query and assert in-Comet execution.

The CometScalaUDF fallback message was generalized from 'ScalaUDF has no native path' to 'expression has no native path' when the dispatcher helper was extracted for reuse by CometDateFormat.

…ssions

…kip ci]

…p ci]

…sions [skip ci]

…p ci]

- Drop the getCompatibleNotes override on CometCodegenDispatch. The docs generator emits compat notes under a heading promising 'no additional configuration', which contradicts a note describing the dispatcher flag. Keep getSupportLevel=Compatible and surface the flag dependency via withInfo / EXPLAIN instead. - Add a sentinel non-error query to each *_ansi.sql fixture. The expect_error semantics pass vacuously when the dispatcher silently falls back to Spark (both paths throw identical exceptions); the sentinel uses checkSparkAnswerAndOperator and fails if Comet did not run the expression natively. - Pin spark.sql.legacy.timeParserPolicy=CORRECTED in to_unix_timestamp_ansi.sql so the JDK java.time formatter is exercised regardless of runtime default; LEGACY policy uses SimpleDateFormat with a different exception class. - Annotate the three ANSI fixtures with MinSparkVersion: 3.5 since the DATETIME_FIELD_OUT_OF_BOUNDS and CANNOT_PARSE_TIMESTAMP error classes were standardized in Spark 3.5. Spark 3.4 coverage is delivered separately.

Mirror the existing MinSparkVersion gate with a MaxSparkVersion gate so SQL fixtures can pair a 3.5+ variant (using post-3.5 error class names) with a 3.4 variant (using the pre-classification JDK java.time exception text). The make_timestamp and to_unix_timestamp ANSI exception paths produce different exception wording on Spark 3.4 versus 3.5+; before this commit only the 3.5+ side had coverage and 3.4 ANSI behavior went untested. Framework: - SqlTestFile gains maxSparkVersion: Option[String]. - SqlFileTestParser recognizes -- MaxSparkVersion: lines. - CometSqlFileTestSuite gains meetsMaxSparkVersion / skipReason helpers; the skip-and-log path now reports whether the constraint was a floor or ceiling. Coverage: - make_timestamp_ansi_spark34.sql: MaxSparkVersion: 3.4, expect_error patterns target the JDK DateTimeException field-name text (MonthOfYear, Invalid date, HourOfDay) which is stable in 3.4's pre-classification error path. - to_unix_timestamp_ansi_spark34.sql: MaxSparkVersion: 3.4, expect_error pattern targets the JDK DateTimeParseException 'could not be parsed' wording.

CI failed on Spark 3.5.8 because the executor-thrown SparkDateTimeException's getMessage() does NOT preserve the driver-formatted '[DATETIME_FIELD_OUT_OF_BOUNDS]' error-class prefix; only the inner JDK message ('Invalid value for MonthOfYear ...', 'Invalid date FEBRUARY 30', 'Invalid value for HourOfDay ...') survives the 'Job aborted ... Lost task ... SparkDateTimeException: <inner>' wrapping that shows up in the test's caught exception. Switching to the JDK java.time field-name substrings (MonthOfYear, Invalid date, HourOfDay) makes the assertions stable across Spark 3.4, 3.5.x, and 4.x without needing a MinSparkVersion gate, so the make_timestamp_ansi_spark34.sql variant becomes redundant and is deleted in the same commit. Verified locally: passes under -Pspark-3.4 (3.4.3) and -Pspark-3.5 (3.5.8).

mbutrovich · 2026-05-26T15:03:56Z

Awesome to see the codegen framework being put to good use, and bugs being found and fixed!

A couple of things on the broader dispatcher pattern worth thinking about before the second wave of expressions lands.

The sentinel query after each expect_error block is cool. I checked make_timestamp_ansi.sql, timestamp_millis_ansi.sql, and to_unix_timestamp_ansi.sql and the sentinel is in all three. The concern is that nothing in CometSqlFileTestSuite or SqlFileTestParser enforces "any fixture combining expect_error with spark.comet.exec.scalaUDF.codegen.enabled=true must include a non-error sentinel using checkSparkAnswerAndOperator." As more codegen-dispatched ANSI expressions adopt this fixture pattern, a forgotten sentinel would let a regression silently pass. Could the parser flag that shape, or at minimum could a comment in SqlFileTestParser document the requirement so the next contributor isn't relying on conventions absorbed by osmosis?

On the cache-key side, three of the new expressions are failOnError-sensitive (MakeTimestamp, MillisToTimestamp, ToUnixTimestamp), MonthsBetween carries a roundOff flag, and several are timezone-aware. The dispatcher caches kernels by closure-serialized bytes, which should naturally differentiate them, but I don't see a test pinning that two plans of the same expression class differing only in failOnError (or roundOff, or session timezone) compile to distinct kernels rather than colliding in the cache. A small unit test in CometScalaUDFCodegen's suite that asserts the serialized bytes diverge for those pairs would lock in the invariant cheaply. Worth adding while the shapes are still fresh?

Address review feedback on Bucket 4 codegen PR: - `CometSqlFileTestSuite.requireSentinelForCodegenExpectError` rejects any fixture combining `expect_error` with the codegen flag unless at least one non-error sentinel query is present. Documents the rule on `ExpectError`. Without the sentinel a silent dispatcher fallback to Spark would let the `expect_error` queries pass vacuously. - New `CometCodegenSourceSuite` test serializes pairs of bound expressions that differ only in `failOnError` (MakeTimestamp, ToUnixTimestamp), `roundOff` (MonthsBetween), or `timeZoneId` and asserts the closure- serialized bytes diverge. Pins the invariant that the dispatcher caches these variants under distinct keys rather than colliding.

mbutrovich

I think expressions.md needs updating, but that's just a docs PR, no reasont to kick CI again on this PR. Awesome to see more use for the codegen path!

andygrove · 2026-05-26T20:57:31Z

Merged. Thanks @mbutrovich. I will follow up with docs PR.

…of hand-written UDFs Replace the six hand-written `RegExp*UDF` / `StringSplitUDF` JVM UDF implementations with the Arrow-direct codegen dispatcher introduced in PR apache#4417 (`CometScalaUDF.emitJvmCodegenDispatch`). The dispatcher Janino-compiles Spark's own `doGenCode` for the expression, so the regex family inherits Spark-identical semantics with no per-expression glue code. Changes: - Delete `spark/src/main/scala/org/apache/comet/udf/RegExp*UDF.scala` and `StringSplitUDF.scala`. Their behavior is now provided by Spark's `doGenCode` running inside the dispatcher. - Rewrite the regex serdes in `strings.scala`. Expressions with no native Rust path (`RegExpExtract`, `RegExpExtractAll`, `RegExpInStr`) share a new `CometRegexpCodegenOnly` base; expressions with a native path (`RLike`, `RegExpReplace`, `StringSplit`) keep an explicit route table where the JVM arm now delegates to `CometScalaUDF.emitJvmCodegenDispatch`. - Drop the `spark.comet.jvmUdf.enabled` config. The codegen dispatcher already has its own master switch (`spark.comet.exec.scalaUDF.codegen.enabled`); gating the regex family on the same flag avoids two flags for the same path. `spark.comet.exec.regexp.engine` keeps the `java`/`rust` selector semantics, and `engine=java` now requires the codegen flag. - Revert the native Rust additions in `jvm_udf/mod.rs` and `jni-bridge/src/lib.rs`. The codegen dispatcher constructs Arrow output fields JVM-side via `CometBatchKernelCodegenOutput.toFfiArrowField`, so the list-vector field-name normalization cast is unnecessary. - Update `CometRegExpJvmSuite`, `CometRegExpBenchmark`, the regex SQL test fixtures, and the regex compatibility doc to reflect the new gating. Test plan: - `CometRegExpJvmSuite`: 45/45 pass (covers all six regex expressions through the codegen dispatcher). - `CometSqlFileTestSuite`: 289/289 pass. - `CometStringExpressionSuite`: 33/33 pass. - `CometCodegenSuite`: 60/60 pass. - `cargo clippy --all-targets --workspace -- -D warnings`: clean.

…f hand-written UDFs Replace the three hand-written `GetJsonObjectUDF` / `FromJsonUDF` / `ToJsonUDF` JVM UDF implementations and the `CometLambdaRegistry` indirection with the Arrow-direct codegen dispatcher introduced in PR apache#4417 (`CometScalaUDF.emitJvmCodegenDispatch`). The dispatcher Janino-compiles Spark's own `doGenCode` (or `eval(row)` for CodegenFallback expressions) so the JSON family inherits Spark-identical semantics with no per-expression glue. Changes: - Delete the three hand-written UDF files under `spark/src/main/scala/org/apache/comet/udf/` and their unit-test suites. The codegen dispatcher's per-task `kernelCache` provides the same per-thread isolation that `CometLambdaRegistry` was working around. - Rewrite the JSON serdes (`CometGetJsonObject` in `strings.scala`, `CometStructsToJson` and `CometJsonToStructs` in `structs.scala`) to go through a new `JsonRoute` helper. `engine=rust` keeps the native path; `engine=java` delegates to `CometScalaUDF.emitJvmCodegenDispatch` when `spark.comet.exec.scalaUDF.codegen.enabled=true`. - Generalize the codegen dispatcher to accept `CodegenFallback` expressions. `CodegenFallback.doGenCode` emits `references[N].eval(row)`, the same shape the `HigherOrderFunction` carve-out already relied on; lifting the rejection lets `JsonToStructs` and `StructsToJson` (which are `CodegenFallback` in Spark 4) ride the same path. - Unwrap `RuntimeReplaceable` expressions inside `CometScalaUDF.emitJvmCodegenDispatch` before binding. Spark 4's `StructsToJson` is `RuntimeReplaceable` and its `doGenCode` throws "Cannot generate code for expression"; calling `.replacement` gives the `Invoke(StructsToJsonEvaluator, ...)` form that does codegen. - Update the JSON compatibility doc and the `CometJsonJvmSuite` config to reference the codegen flag. Test plan: - `CometJsonJvmSuite`: 3/3 pass (get_json_object, from_json, to_json round-trip via the codegen dispatcher). - `CometJsonExpressionSuite`: 8/8 pass on the unchanged native path. - `CometStringExpressionSuite`: 33/33, `CometCodegenSuite`: 60/60, `CometCodegenSourceSuite`: 50/50, `CometSqlFileTestSuite`: 284/284. - `cargo clippy --all-targets --workspace -- -D warnings`: clean.

parthchandra · 2026-05-26T23:28:16Z

Some post merge comments -

This really should have been two PRs.
The codegen dispatcher is incredibly powerful but we must be careful to use it only when necessary. It's terrific that the implemented functions cause a host of issues, but perhaps we can restrict future conversions to those cases where compatible native implementations are impossible?

andygrove · 2026-05-27T00:53:37Z

Some post merge comments -

This really should have been two PRs.

The codegen dispatcher is incredibly powerful but we must be careful to use it only when necessary. It's terrific that the implemented functions cause a host of issues, but perhaps we can restrict future conversions to those cases where compatible native implementations are impossible?

This is just adding a cheaper fallback for expressions that are not yet implemented. I'm happy to review PRs to add native implementations and remove the fallbacks.

andygrove added 17 commits May 23, 2026 09:47

Merge remote-tracking branch 'apache/main' into feat/date-format-jvm-udf

4823f33

test: update ArrayInsertUnsupportedArgs fallback reason wording

7cbb8e4

The CometScalaUDF fallback message was generalized from 'ScalaUDF has no native path' to 'expression has no native path' when the dispatcher helper was extracted for reuse by CometDateFormat.

feat(serde): add CometCodegenDispatch helper for codegen-routed expre…

7f781e1

…ssions

feat(datetime): route AddMonths through codegen dispatcher [skip ci]

34de47c

feat(datetime): route MonthsBetween through codegen dispatcher [skip ci]

15b4193

feat(datetime): route MakeTimestamp through codegen dispatcher [skip ci]

4d352fb

feat(datetime): route MillisToTimestamp through codegen dispatcher [s…

6de2d2d

…kip ci]

feat(datetime): route MicrosToTimestamp through codegen dispatcher [s…

d22e355

…kip ci]

feat(datetime): route UnixSeconds through codegen dispatcher [skip ci]

2d0a117

feat(datetime): route UnixMillis through codegen dispatcher [skip ci]

2826f15

feat(datetime): route UnixMicros through codegen dispatcher [skip ci]

f4e1c0e

feat(datetime): route ToUnixTimestamp through codegen dispatcher [ski…

eebeef8

…p ci]

test(datetime): ANSI fixtures for codegen-routed throw-capable expres…

771f367

…sions [skip ci]

test(codegen): unit coverage for Bucket 4 datetime kernel source [ski…

5871339

…p ci]

style: apply spotless formatting

fadcbfe

andygrove changed the title ~~feat: route Bucket 4 date/time expressions through codegen dispatcher~~ feat: Implement some new date/time expressions through codegen dispatcher May 24, 2026

andygrove mentioned this pull request May 24, 2026

[EPIC] Implement all Spark date/time expressions #4418

Open

87 tasks

andygrove marked this pull request as draft May 24, 2026 17:53

andygrove added 3 commits May 24, 2026 12:02

style: apply spotless to MaxSparkVersion javadoc

1b952e7

andygrove force-pushed the feat/datetime-codegen-bucket4 branch from a3923d4 to 1b952e7 Compare May 24, 2026 18:35

andygrove changed the title ~~feat: Implement some new date/time expressions through codegen dispatcher~~ feat: route date_format and 9 other date/time expressions through codegen dispatcher May 24, 2026

andygrove mentioned this pull request May 24, 2026

feat: route date_format through codegen dispatcher for non-native cases #4373

Closed

andygrove changed the title ~~feat: route date_format and 9 other date/time expressions through codegen dispatcher~~ feat: expand date/time expression support using codegen dispatcher May 24, 2026

andygrove marked this pull request as ready for review May 24, 2026 20:26

mbutrovich self-requested a review May 26, 2026 14:00

mbutrovich approved these changes May 26, 2026

View reviewed changes

andygrove merged commit f383a0c into apache:main May 26, 2026
62 checks passed

andygrove deleted the feat/datetime-codegen-bucket4 branch May 26, 2026 20:57

andygrove mentioned this pull request May 26, 2026

docs: list date/time expressions added in #4417 #4443

Merged

1 task

andygrove added a commit that referenced this pull request May 26, 2026

docs: list date/time expressions added in #4417 (#4443)

d6c402a

andygrove mentioned this pull request May 26, 2026

feat: experimental Spark regex support via codegen dispatcher #4239

Open

andygrove mentioned this pull request May 26, 2026

feat: experimental Spark JSON support via codegen dispatcher #4305

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expand date/time expression support using codegen dispatcher#4417

feat: expand date/time expression support using codegen dispatcher#4417
andygrove merged 22 commits into
apache:mainfrom
andygrove:feat/datetime-codegen-bucket4

andygrove commented May 24, 2026 •

edited

Loading

Uh oh!

mbutrovich commented May 26, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

Uh oh!

andygrove commented May 26, 2026

Uh oh!

parthchandra commented May 26, 2026

Uh oh!

andygrove commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andygrove commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Expressions covered

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich commented May 26, 2026

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented May 26, 2026

Uh oh!

parthchandra commented May 26, 2026

Uh oh!

andygrove commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented May 24, 2026 •

edited

Loading