Skip to content

materialize-s3-iceberg: s/python/iceberg-go/#4347

Open
jacobmarble wants to merge 7 commits into
mainfrom
jgm-s3-iceberg-golang
Open

materialize-s3-iceberg: s/python/iceberg-go/#4347
jacobmarble wants to merge 7 commits into
mainfrom
jgm-s3-iceberg-golang

Conversation

@jacobmarble
Copy link
Copy Markdown
Contributor

@jacobmarble jacobmarble commented Apr 30, 2026

Description:

Replaces the embedded Python iceberg-ctl subprocess with the native Go iceberg-go library. The connector now creates and writes to Iceberg tables in-process.

New tables are created with metadata retention defaults (gzip metadata, drop prior versions on commit, expire snapshots older than 24h) to keep metadata files under object-store size limits.

As a nice side-effect, this also works around a date range issue in Python. When Iceberg metadata contains dates outside of the years 1 through 9999, it fails to parse because the corresponding Python type has this limit. Now that Python is removed from the code path, these dates are parsed without issue.

Workflow steps:

No user-visible changes.

Documentation links affected:

None.

Notes for reviewers:

  • iceberg-ctl/ Python tree is removed wholesale.
  • go/schema-gen/generate.go now traverses oneOf/anyOf branches so fixups apply to nullable-wrapped fields emitted by invopop.

@jacobmarble jacobmarble requested a review from a team April 30, 2026 23:54
@jacobmarble jacobmarble force-pushed the jgm-s3-iceberg-golang branch 2 times, most recently from 1927e86 to 93f26fc Compare May 4, 2026 20:37
time.Parse(time.DateOnly, ...) rejects 5-digit year strings such as
"10000-01-01". Fall back to a regex parse so Parquet DATE columns can
encode dates outside the year-1–9999 band that Python datetime imposed.
The clamping to year 1-9999 was required because pyarrow materialised
Parquet column min/max stats through Python datetime, which overflows
outside that band. iceberg-go reads the same stats as plain int64/int32,
so values like "9999-12-31T23:59:59-14:00" (UTC year 10000) and
"10000-01-01" now pass through without modification.
Remove stale pyarrow/pyiceberg references from the overflow regression
tests and migration-test skip comment now that the Python path is gone.
Regression tests now write the raw overflow values directly instead of
pre-clamping them.
@jacobmarble jacobmarble force-pushed the jgm-s3-iceberg-golang branch from 93f26fc to a645fa8 Compare May 5, 2026 20:35
@mdibaiee
Copy link
Copy Markdown
Member

@jacobmarble do we still want to do this?
I'm just afraid of the risk of regressions with this, seems like a sudden lift, but I like the path... mixing Python and Go is quite the mess

@jacobmarble
Copy link
Copy Markdown
Contributor Author

@jacobmarble do we still want to do this? I'm just afraid of the risk of regressions with this, seems like a sudden lift, but I like the path... mixing Python and Go is quite the mess

I still want to do it, sooner is better IMHO.

The Python Iceberg implementation is more mature than the Golang impl, but we don't push it very hard.

The PR hasn't been updated recently because no one had reviewed it yet. If you support the effort (with caveat to mitigate regression risks) then I'll rebase and squash to make the review easier.

@mdibaiee
Copy link
Copy Markdown
Member

@jacobmarble yeah I think this is worth doing if we can mitigate the risk 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants