Skip to content

Dataflow reset drops journal splitting, painful for large captures #2881

@jwhartley

Description

@jwhartley

Problem

Dataflow reset on a collection removes the collection's journal splitting configuration. Journals that were previously split back to default (single journal per logical partition). For large, high-throughput captures that relied on splitting for parallelism, post-reset performance drops significantly until splits are re-applied.

Customers cannot split journals themselves — splitting is an Estuary-internal operation — so once splitting is lost on reset, throughput stays degraded until Estuary support intervenes to re-split.

Why it matters

  • Large captures (multi-TB, multi-million-row) frequently need journal splitting to keep up with source throughput or to process a backfill within an acceptable window.
  • Dataflow reset is commonly used precisely for these captures (e.g. to recover from schema drift, or to restart from the current source state rather than replaying weeks of change data).
  • Post-reset, the backfill is often the single largest throughput event the pipeline will ever do — exactly when split parallelism matters most.
  • There is a real risk scenario: if the source's replication slot (Postgres, MySQL, etc.) is sensitive to replication lag, the slow post-reset catch-up can cause the slot to fall behind or get purged, escalating to an outage.

Prior internal discussion

Proposed resolution directions

  1. Preserve journal splits across reset. The split configuration is an operational concern separate from the collection data / inferred schema. It could be copied forward onto the new collection state as part of reset publication.

  2. Auto-re-apply splits after reset based on a recorded target. If we can't carry the live split state forward, we could persist the split intent (e.g. "this collection should have N splits") on the spec/controller, and have the control plane re-apply it once the reset completes and the collection is receiving data again.

  3. Expose journal splitting to customers. Out of scope for this issue, but would make this problem a self-service recovery rather than a support ticket. Tracked elsewhere if it isn't already.

Direction (1) is likely simplest if the splitting state is recorded in a place that survives the reset's journal turnover.

Cross-links

Labels

control-plane, enhance

Metadata

Metadata

Assignees

Labels

control-planeenhanceNew feature or enhancement with UX impact

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions