Skip to content

Fix duplicate field error when partition column already exists in schema#587

Open
simbadzina wants to merge 2 commits into
linkedin:masterfrom
simbadzina:sdzinama/fix-datepartition-duplication
Open

Fix duplicate field error when partition column already exists in schema#587
simbadzina wants to merge 2 commits into
linkedin:masterfrom
simbadzina:sdzinama/fix-datepartition-duplication

Conversation

@simbadzina
Copy link
Copy Markdown
Collaborator

@simbadzina simbadzina commented Mar 25, 2026

Summary

  • addPartitionColsToSchema() blindly appends partition columns without checking if they already exist in the schema, causing AvroRuntimeException: Duplicate field X in record
  • This can happen when a Hive view projects a partition column as a regular field — the schema already contains the field, and addPartitionColsToSchema() tries to add it again
  • The fix collects existing field names into a Set and skips partition columns that are already present

Changes

  • SchemaUtilities.addPartitionColsToSchema(): Added deduplication check before adding partition columns
  • SchemaUtilitiesTests: Added two tests — one verifying duplicates are skipped, one verifying normal partition column addition still works

Test plan

  • New test testAddPartitionColsToSchemaSkipsDuplicates fails without fix, passes with fix
  • New test testAddPartitionColsToSchemaAddsNewPartitionCol confirms normal behavior unchanged
  • Existing partition tests pass (testBaseTableWithPartition, testSelectStarWithPartition, testSelectPartitionColumn, testUnionSelectStarFromPartitionTable)

@simbadzina simbadzina changed the title [Coral-Schema] Fix duplicate field error when partition column alread… Fix duplicate field error when partition column already exists in schema Mar 25, 2026
@simbadzina simbadzina force-pushed the sdzinama/fix-datepartition-duplication branch 2 times, most recently from 477c6b6 to 614ad37 Compare March 25, 2026 01:16
@simbadzina simbadzina marked this pull request as ready for review March 25, 2026 01:23
@ruolin59
Copy link
Copy Markdown
Contributor

This shouldn't be needed because partition columns should not exist inside the schema to begin with. Instead of here, it should be fixed on the caller side: if the schema contains the columns, no partition columns should be passed into this method in the first place

@simbadzina
Copy link
Copy Markdown
Collaborator Author

This shouldn't be needed because partition columns should not exist inside the schema to begin with. Instead of here, it should be fixed on the caller side: if the schema contains the columns, no partition columns should be passed into this method in the first place

The two callers of addPartitionColsToSchema are both inside SchemaUtilities (getCasePreservedSchemaForTable:79 and the merge branch of getAvroSchemaForTable:118-124)

Deduplication in here fit the contract "ensure partition columns are present on the schema".
For some tables the partition columns so already exist in the avro files and then independently in HMS.

@simbadzina simbadzina force-pushed the sdzinama/fix-datepartition-duplication branch from 2411854 to dc328ec Compare April 17, 2026 00:56
@simbadzina simbadzina force-pushed the sdzinama/fix-datepartition-duplication branch from dc328ec to 1807f65 Compare April 21, 2026 15:47
…y exists in schema

When a Hive view projects a partition column as a regular field in its schema,
addPartitionColsToSchema() would attempt to add it again, causing
AvroRuntimeException: "Duplicate field X in record". The fix skips partition
columns that already exist in the schema by name.
@simbadzina simbadzina force-pushed the sdzinama/fix-datepartition-duplication branch from 1807f65 to 5b343b8 Compare April 21, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants