Skip to content

materialize-bigtable: new connector#4357

Merged
williamhbaker merged 6 commits into
mainfrom
wb/bigtable
May 5, 2026
Merged

materialize-bigtable: new connector#4357
williamhbaker merged 6 commits into
mainfrom
wb/bigtable

Conversation

@williamhbaker
Copy link
Copy Markdown
Member

@williamhbaker williamhbaker commented May 4, 2026

Description:

New materialize-bigtable connector. Materializes Flow collections to Google Cloud Bigtable as standard updates.

A few interesting design choices to be aware of:

Exactly-once via MVCC

The connector stamps every cell write with a monotonically increasing counter as the cell timestamp. The column family is created with MaxVersions(2), so a crashed transaction's "dirty" cell is preserved alongside the prior committed value. Loads request the latest 2 versions and walk newest-first, skipping any cell at the in-flight timestamp—this surfaces the last committed value even if a previous Store partially landed.

Note: This relies on idempotent runtime transactions; without them, reductions other than LWW are incorrect on retry.

Row keys are FDB-packed tuples

These are used directly as the Bigtable row key, which preserves lexicographic ordering. This enables efficient range scans with appropriate composite key structures.

Value Encoding

Everything is stored as bytes to match Bigtable's native model:

Type Encoding Notes
Booleans 0x00 / 0x01
Integers (int64) 8-byte big-endian Bigtable's spec'd format. Enables atomic increment via ReadModifyWrite (unused here).
Integers (wider than int64) Decimal text (e.g. "99999999999999999999") Used when schema inference indicates the range or string length exceeds int64.
Numbers (floats) 8-byte big-endian IEEE 754 Not formally Bigtable-specified, but matches standard tool behavior.
Null Empty byte slice Ambiguous with zero-length string/binary, but unlikely to matter in practice.
Strings UTF-8 bytes Passed through.
Binary Raw bytes Base64-decoded from JSON.
Arrays / objects / multi-typed JSON bytes Stored as the original JSON encoding.

No delta-updates mode

There is no meaningful semantic for a key/value store; same as materialize-dynamodb.

Closes estuary/flow#2918

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

See estuary/flow#2919

Notes for reviewers:

(anything that might help someone review this PR)

Comment thread materialize-bigtable/type_mapping.go Dismissed
Materializes Flow collections to Cloud Bigtable as standard updates.
Primary keys pack into FDB tuples and are used directly as the
Bigtable row key, which keeps rows ordered by key. Fields and the
root document are written as cells in a single column family with
MaxVersions(2) — the extra version supports dirty-read detection on
transaction retry.
docker-compose runs the gcloud Bigtable emulator on a known port so
the boilerplate materialize suite can exercise the full lifecycle
(apply, materialize, snapshot) without needing real GCP credentials.
Adds the connector to the build matrix and to the gate of
materializations whose integration tests are run on every PR.
…et client metrics

* Add an opt-in path for running integration tests against a real Cloud
  Bigtable instance, since the emulator can't exercise auth,
  PingAndWarm, or real network behavior.

* Disable the data client's built-in metrics exporter so the connector
  doesn't depend on metric-publishing permissions or emit telemetry the
  user hasn't opted into.

* Capture an initial benchmark baseline so future runs have something to
  compare against. This was run on a single node, base "trial" instance
  so it may not be reflective of actual achievable performance of
  production workloads, but it appears at least on-par with other
  materializations.
Otherwise deletes leave tombstone rows that downstream readers must
filter on `_meta/op`.

`DeleteRow` is unconditional rather than timestamp-bounded, which is
fine for exactly-once: a replayed transaction safely re-issues the
delete on an already-empty row.
@williamhbaker williamhbaker marked this pull request as ready for review May 5, 2026 17:48
@williamhbaker williamhbaker requested review from a team May 5, 2026 17:48
Copy link
Copy Markdown
Contributor

@jacobmarble jacobmarble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid!

Comment on lines +10 to +19
# See README.md for instructions on running against a real Cloud Bigtable
# instance.

# acmeCo/tests/materialize-bigtable-gcp:
# endpoint:
# local:
# command: ["go", "run", "."]
# protobuf: true
# config: config.gcp.yaml
# bindings: []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind following this pattern?

all := *testAll || os.Getenv("ICEBERG_TEST_ALL") != ""
materializeSpec := "testdata/materialize-rest-local.flow.yaml"
applySpec := "testdata/apply-rest-local.flow.yaml"
migrateSpec := "testdata/migrate-rest-local.flow.yaml"
if all {
materializeSpec = "testdata/materialize.flow.yaml"
applySpec = "testdata/apply.flow.yaml"
migrateSpec = "testdata/migrate.flow.yaml"
}

FWIW I'm using the pattern in the EventBridge PR as well, much simpler though: #4359

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that is simpler! Updated.

Comment thread materialize-bigtable/transactor.go Outdated
Real-instance tests previously required hand-editing YAML to uncomment
the GCP block; switch to a `BIGTABLE_TEST_ALL` env var / `-bigtable.test-all`
flag matching the pattern used by materialize-iceberg and
materialize-eventbridge so the full matrix is invokable from `go test`
without source edits. Drop the `_hd` suffix from hard-delete bindings
since the boilerplate framework already disambiguates per-task tables
via UUID. Type the `CommittedTimestamp` state field as int64 to match
the underlying type of `bigtable.Timestamp`.
@williamhbaker williamhbaker merged commit bd06033 into main May 5, 2026
66 of 69 checks passed
@williamhbaker williamhbaker deleted the wb/bigtable branch May 5, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stats-next: new bigtable materialization

3 participants