Skip to content

[SPARK-54908][SQL] Support dateFormat option during JSON schema inference#55417

Open
yadavay-amzn wants to merge 1 commit intoapache:masterfrom
yadavay-amzn:fix/SPARK-54908-json-date-infer
Open

[SPARK-54908][SQL] Support dateFormat option during JSON schema inference#55417
yadavay-amzn wants to merge 1 commit intoapache:masterfrom
yadavay-amzn:fix/SPARK-54908-json-date-infer

Conversation

@yadavay-amzn
Copy link
Copy Markdown

What changes were proposed in this pull request?

Add date type inference in JsonInferSchema using the dateFormat option, mirroring existing timestamp inference.

Why are the changes needed?

The dateFormat option is documented but has no effect during schema inference, while timestampFormat works correctly.

Does this PR introduce any user-facing change?

Yes, JSON schema inference now correctly infers DateType when dateFormat is specified.

How was this patch tested?

Added unit tests verifying date inference with custom dateFormat.

Was this patch authored or co-authored using generative AI tooling?

Yes

// this behavior now. See SPARK-46769 for more details.
if (SparkDateTimeUtils.stringToTimestampWithoutTimeZone(utf8Value, false).isDefined) {
TimestampType
} else if (allCatch.opt(dateFormatter.parse(field)).isDefined) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem here is the performance. It becomes too slower

Copy link
Copy Markdown
Author

@yadavay-amzn yadavay-amzn Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Updated with two changes:

  1. Date inference only runs when the user explicitly sets dateFormatInRead. When not set the code path is never reached.

  2. Added DateFormatter.parseOptional() that uses DateTimeFormatter.parseUnresolved() instead of exception-based control flow. So for users who do set dateFormat, the performance is comparable to the existing timestamp inference overhead.

Please let me know if you have any suggestions on how it can be improved. Thanks!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is that we should also support timestamp format too. How would we handle the precedence? Suppose that both are set differently.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code already handles this — timestamp is tried first, date only if timestamp doesn't
match any of the following:

  1. TimestampNTZType (if default NTZ + matches timestampNTZ format)
  2. TimestampType (if matches timestamp format)
  3. DateType (if matches date format)
  4. StringType (fallback)

So if both timestampFormat and dateFormat are set and a string matches both, timestamp wins. Date
inference only kicks in for strings that fail all timestamp parsers.

That said — should we also gate timestamp inference on an explicit timestampFormat being set, for
consistency? Currently it's controlled by inferTimestamp (a separate boolean config). Happy to align
the two if you think that's the right approach.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the explicit test cases? E.g., timstamp format as YYYY and date format YYYY-DD, vise versa. I want to make sure that this all works sound with all type coercion

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explicit test cases covering the precedence and type coercion:

  1. Timestamp takes precedence over date — when both formats match the same string (e.g., "2024" with timestampFormat=yyyy and dateFormat=yyyy), TimestampType wins.
  2. Date inferred when timestamp does not match"2024-07-02" with timestampFormat=yyyy-MM-dd HH:mm:ss and dateFormat=yyyy-MM-ddDateType.
  3. No date inference without explicit dateFormat — date-like strings stay StringType when dateFormat is not set.
  4. Type coercion across rows — row 1 infers DateType, row 2 infers TimestampType → merged to TimestampType.

Also added the missing DateType + TimestampType → TimestampType coercion rule (previously only had DateType + TimestampNTZType → TimestampNTZType).

All 5 tests pass on both JsonV1Suite and JsonV2Suite.

@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-54908-json-date-infer branch 2 times, most recently from e477cdc to ef8d0bb Compare April 20, 2026 07:16
override def parseOptional(s: String): Option[Int] = {
val pp = new java.text.ParsePosition(0)
val parsed = formatter.parseUnresolved(s, pp)
if (parsed != null && pp.getErrorIndex == -1 && pp.getIndex == s.length) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds parseOptional to DateFormatter to mirror TimestampFormatter.parseOptional. The override in Iso8601DateFormatter uses parseUnresolved instead of exception-based control
flow, which is faster for non-matching strings.
While the inferDate eliminates cost for users who don't set dateFormat, users who do set it will run this on every string value, so I added this since original concern was with slowness and this brings consistency across TimestampFormatter and DateFormatter.

Happy to drop it though, if you'd prefer to keep the scope smaller and don't think this is needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that why do we need this null index s.length is required.


test("SPARK-54908: timestamp takes precedence over date when both formats match") {
// When a string matches both timestampFormat and dateFormat, timestamp wins
val ds = Seq("""{"v": "2024"}""").toDS()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have multiple values with explicit repartition so the type coercion result is always same?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.schema
// DateType from row 1 + TimestampType from row 2 should merge
assert(schema("v").dataType === TimestampType ||
schema("v").dataType === TimestampNTZType)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it || or? Can we make the test determinstic

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-54908-json-date-infer branch from ef8d0bb to 155ea69 Compare April 20, 2026 23:49
.option("inferTimestamp", true)
.json(ds)
.schema
// Default timestampFormat may or may not match; but without dateFormat, no DateType
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the result type should here be deterministic?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — changed to use "02/Jul/2024" which does not match any default timestamp parser, asserting StringType.

"""{"v": "2024-07-02 10:30:00"}""",
"""{"v": "2024-08-15"}""",
"""{"v": "2024-08-15 14:00:00"}"""
).toDS().repartition(2)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this a little more complicated, e.g.,

    val ds = Seq(
      """{"v": "2024-07-02"}""",
      """{"v": "2024-07-02 10:30:00"}""",
      """{"v": "2024-08-15"}""",
      """{"v": "2024-08-15 14:00:00"}"""
      """{"v": "2024-08-15 14:00:00"}"""
      """{"v": "2024-08-15"}""",
    ).toDS().repartition(4)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

val pp = new java.text.ParsePosition(0)
val parsed = formatter.parseUnresolved(s, pp)
// pp.getIndex == s.length ensures the entire string was consumed,
// not just a prefix like "2024-07-02" matching "yyyy-MM-dd" in "2024-07-02 extra"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not just a prefix like "2024-07-02" matching "yyyy-MM-dd" in "2024-07-02 extra"

So 2024-07-02 extra was parsed properly but we are blocking this now, right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, parseUnresolved parses the prefix successfully but we require full string consumption via pp.getIndex == s.length. This matches the behavior of DateTimeFormatter.parse() which also rejects "2024-07-02 extra" — so no behavior change, just a faster path.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Any concerns with the above format being blocked now?
Just double checking to make sure I'm not misunderstanding something. Please let me know if you're not aligned on this behavior and it should be changed.
Thanks!

…ence

Add date type inference in JsonInferSchema using the dateFormat option,
mirroring existing timestamp inference. When inferTimestamp is enabled,
strings that don't match timestamp formats are now also tried against
the dateFormat. If they match, DateType is inferred instead of StringType.

Also adds DateType vs TimestampNTZType compatibility in compatibleType
so that merging these types produces TimestampNTZType (matching the
existing DateType vs TimestampType -> TimestampType precedence).
@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-54908-json-date-infer branch from 155ea69 to e44055a Compare April 21, 2026 00:27
@yadavay-amzn yadavay-amzn marked this pull request as ready for review April 29, 2026 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants