Skip to content

[Coral-Schema] Skip duplicate partition columns in getAvroSchemaForTable Hive-merge path#601

Draft
simbadzina wants to merge 1 commit into
linkedin:masterfrom
simbadzina:sdzinama/dedup-partition-cols-merge-hive-path
Draft

[Coral-Schema] Skip duplicate partition columns in getAvroSchemaForTable Hive-merge path#601
simbadzina wants to merge 1 commit into
linkedin:masterfrom
simbadzina:sdzinama/dedup-partition-cols-merge-hive-path

Conversation

@simbadzina
Copy link
Copy Markdown
Collaborator

@simbadzina simbadzina commented Apr 17, 2026

Summary

Follow-up to #587 covering a sibling code path in the same file.

SchemaUtilities.getAvroSchemaForTable() (non-Avro-serde branch) builds a Hive column list from table.getSd().getCols() and appends getPartitionCols(table) before merging with the original Avro schema via MergeHiveSchemaWithAvro.visit(). If the regular column list already contains a column with the same name as a partition key, the merge produces a record with duplicate fields and fails at SchemaUtilities.copyRecord with AvroRuntimeException: Duplicate field X in record.

The same pattern exists in the fallback convertHiveSchemaToAvro(). Both call sites now go through a small shared helper that skips partition columns whose names are already present in the column list.

Changes

  • SchemaUtilities: Added private helper appendMissingPartitionCols(cols, table) that dedups by column name; replaced the two existing cols.addAll(getPartitionCols(table)) call sites (getAvroSchemaForTable and convertHiveSchemaToAvro) with calls to the helper.
  • SchemaUtilitiesTests: Added two tests for convertHiveSchemaToAvro covering both the duplicate-skipping case and the normal-append case.

Notes

  • This is complementary to Fix duplicate field error when partition column already exists in schema #587 which addresses the addPartitionColsToSchema path. The two PRs target different stack traces and can be reviewed independently.
  • Stack trace this PR fixes (observed in production): RecordSchema.setFieldsSchemaUtilities.copyRecordMergeHiveSchemaWithAvro.structSchemaUtilities.getAvroSchemaForTable.

Test plan

  • New tests pass against the fixed branch
  • ./gradlew coral-schema:test passes

@simbadzina simbadzina force-pushed the sdzinama/dedup-partition-cols-merge-hive-path branch from c396840 to dd82e4a Compare April 21, 2026 17:14
…hema from Hive cols

getAvroSchemaForTable and convertHiveSchemaToAvro both appended the
table's partition columns to the regular column list before converting
to Avro. When the regular column list already contained a partition
column by name, the resulting Avro record failed with "Duplicate field X
in record". This mirrors the dedup already applied in
addPartitionColsToSchema. Extracted into a shared
appendMissingPartitionCols helper and covered by tests for
convertHiveSchemaToAvro.
@simbadzina simbadzina force-pushed the sdzinama/dedup-partition-cols-merge-hive-path branch from dd82e4a to 585f841 Compare April 21, 2026 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant