Skip to content

billing: add tenant billing contact fields and per-tenant controller#2902

Open
jshearer wants to merge 38 commits into
jshearer/billing_graphqlfrom
jshearer/billing_fields
Open

billing: add tenant billing contact fields and per-tenant controller#2902
jshearer wants to merge 38 commits into
jshearer/billing_graphqlfrom
jshearer/billing_fields

Conversation

@jshearer
Copy link
Copy Markdown
Contributor

@jshearer jshearer commented Apr 29, 2026

Summary

Customers cannot self-serve billing email changes today. Every request requires manual Stripe intervention. This adds admin-editable billing contact fields on tenants, a GraphQL mutation for managing them, and a per-tenant controller that reconciles the Stripe-backed subset to Stripe asynchronously.

  • Adds billing_email, billing_name, and billing_address fields to tenants, letting admins self-serve billing contact changes through a new setBillingContact GraphQL mutation instead of requesting manual Stripe edits
  • Introduces a per-tenant TenantController automation (the first tenant-scoped automation) that reconciles DB-authoritative billing contact data to Stripe asynchronously
  • Updates existing customer-creation paths (createBillingSetupIntent, billing-integrations) to prefer tenant-managed billing email over JWT claims when creating new Stripe customers

How it works

The mutation writes Postgres and returns. A trigger wakes the tenant's controller task, which reads current DB state, compares against the Stripe customer, and calls update_customer_billing_profile if they differ. Tenants without a Stripe customer store billing data in the DB; the controller treats "no customer" as a no-op, and customer-creation paths wake the controller afterward.

The controller follows the same sub-controller composition pattern as LiveSpecControllerExecutor: TenantControllerState contains a nested BillingContactStatus managed by the billing_contact sub-module, so future tenant automations can be added as additional sub-controllers.

Migration

20260429120000_tenant_controller_billing_contact.sql:

  • Adds columns. New tenants get a controller task via an insert trigger. Existing tenants get one lazily: wake_tenant_controller creates the task on-demand if controller_task_id is null, so setBillingContact or customer-creation paths work without a pre-existing task row.
  • Any source of change to these billing fields will trigger the automation to sync to stripe
  • Backfills billing_email and billing_address from CDC-synced stripe.customers so existing data matches Stripe without triggering reconciliation. Only tenants that received billing data from this backfill get a controller task row; the rest get one on first use.

Testing

I tested this e2e in a local stack with a testmode Stripe API key

@jshearer jshearer force-pushed the jshearer/billing_fields branch 3 times, most recently from dd24bcb to 2d02827 Compare April 29, 2026 21:35
@jshearer jshearer self-assigned this Apr 29, 2026
@jshearer jshearer added change:planned This is a planned change control-plane-api Change affecting the API of control-plane, may impact the UI, flowctl, etc labels Apr 29, 2026
@jshearer jshearer marked this pull request as ready for review April 29, 2026 21:40
@jshearer jshearer force-pushed the jshearer/billing_fields branch from 2d02827 to 769438a Compare April 29, 2026 22:42
@jshearer jshearer force-pushed the jshearer/billing_graphql branch from cc73e15 to 2184f2e Compare April 29, 2026 22:42
@jshearer jshearer force-pushed the jshearer/billing_fields branch from 769438a to 4203194 Compare April 29, 2026 22:43
@jshearer jshearer added waiting This change is waiting on something else and removed change:planned This is a planned change control-plane-api Change affecting the API of control-plane, may impact the UI, flowctl, etc waiting This change is waiting on something else labels May 4, 2026
@jshearer jshearer force-pushed the jshearer/billing_graphql branch 2 times, most recently from 183cb12 to e3ba37e Compare May 4, 2026 17:35
@jshearer jshearer force-pushed the jshearer/billing_fields branch from 4203194 to 583eeca Compare May 5, 2026 15:24
@jshearer jshearer force-pushed the jshearer/billing_graphql branch 5 times, most recently from ca57448 to 98a8373 Compare May 6, 2026 20:06
@jshearer jshearer added control-plane waiting This change is waiting on something else labels May 8, 2026
@jshearer jshearer force-pushed the jshearer/billing_fields branch from 583eeca to 4d347ab Compare May 8, 2026 19:02
@jshearer jshearer force-pushed the jshearer/billing_graphql branch 3 times, most recently from 955f630 to d1317da Compare May 18, 2026 17:13
jacobmarble and others added 29 commits May 19, 2026 16:08
* materialize-eventbridge: document new connector

* docs(eventbridge): fix broken link and add field selection tip for size limit

Fixes the eb-events-structure link that redirected to the wrong page,
and notes that field selection can help stay under EventBridge's 256 KB limit.
Mechanical regeneration only; no .proto changes. Captures formatting
shifts from the local prost-build/protoc toolchain (e.g. `*` -> `-`
bullets in doc comments, multi-line attribute reflow, type-parameter
line breaks). Splitting this from the next commit so review can focus
on the functional diff.
* updating SEO related redirects

* Adding another recently removed connector

* formatting
Introduce the runtime-next crate housing both the Shuffle Leader and
per-shard TaskService implementations behind the bidirectional Leader
and Shard RPCs defined in runtime.proto. Materialization is implemented
end-to-end (open / commit / acknowledge / trigger plus recovery and
Frontier↔Checkpoint mapping); derivations and captures land in
follow-on work.

Small supporting bits land in the runtime, doc, and ops crates:
TaskServiceConfig destructuring tolerates new fields,
combine::Accumulator exposes spill-segment ranges, and ops re-exports
proto_flow::ops as ops::proto.

See plans/runtime-v2/plan.md for the architecture and rollout plan.
RCLOCK_BEGIN_MIN aliased KEY_BEGIN rather than KEY_BEGIN_MIN.
`bytesBehind` is typed as u64 to tally large values, but like
`bytesTotal` and `docsTotal` it should serialize as a native JSON
integer rather than a quoted string. Extend the build.rs codegen
rewrite to cover `bytesBehind`.
Relaxed schemas strip validation keywords; `redact` belongs in that
set. Pass it through to `RelaxedSchemaObj` with `skip_serializing` so
it is dropped, and cover the behavior in both the models unit test and
a validation scenario exercising a redacted key with a connector and
relaxed write schema.
When a journal read skips the direct-fragment path and the broker
returns a `file://` fragment URL, the fragment lives on the broker's
local filesystem and the client has no transport to read it. With
`do_not_proxy=true` and no open spool file, the broker's `serveRead`
short-circuits after sending only fragment metadata, EOFs the stream,
and the client loop spins. Clear `do_not_proxy` for `file://`
fragments so the broker proxies the content instead.
The "nonce" name is unrelated to cryptography but trips GitHub's secret
scanner. Renaming to "seq_no" sidesteps the false positive without
changing protocol semantics.
…TH_TOKEN

Replace the direct storage_mappings/grants inserts in `local:test-tenant`
with a betaOnboard directive, mint a multi-use refresh token, and emit
`~/flow-local/test-tenant.env`. Raise the new tenant's task/collection quotas
so concurrent integration suites don't trip the default ceiling.

flowctl: collapse FLOW_ACCESS_TOKEN into FLOW_AUTH_TOKEN, which now accepts
either a JWT access token or a base64 refresh-token JSON; drop the now-unused
base64 dependency.

ci:dekaf-e2e and the dekaf e2e harness take FLOW_AUTH_TOKEN / FLOW_TEST_TENANT
from that env file instead of a hard-coded system-user token.

Also symlink CLAUDE.md -> AGENTS.md and add local/README.md documenting the
local-stack systemd topology.
Use "runtime sidecar" consistently across the runtime-next README and
the runtime-v2 plan, replacing the mix of "shuffle sidecar" and
"runtime-sidecar process" phrasings.
Add `crates/runtime-sidecar/`, the per-machine Rust process that hosts
the Shuffle and Shuffle Leader gRPC services for all V2 tasks on a
reactor machine. It listens on a fixed fleet-wide sidecar port,
optionally terminates TLS, and is supervised with the same lifetime
as the reactor process(es) it serves.
Implement the "controller" portion of the V2 runtime, which initiates
the shard RPC lifecycle and drives the Join/Joined => Opened sequence.

The new runtime is selected only for tasks having the `enable-runtime-v2`
feature flag.

Also add a new --shuffle-port flag, used to generate accessible
endpoints for the sidecar of a given reactor.
Add a Fixed binding that targets a single pre-existing journal by name,
distinct from Mapped bindings which dynamically resolve documents to a
collection's partitions (creating them on demand). Use Fixed for the
ops stats journal, which activate pre-creates and which never needs
partition mapping.

This lets the runtime drop ops_stats_spec from the Task proto and
removes catalog.LoadCollectionForJournal along with its Go caller,
both of which existed only to recover the ops stats CollectionSpec
so a Mapped binding could be narrowed to its single partition.
The V2 publisher creates destination partitions on demand, so it needs
APPLY in addition to APPEND, plus LIST to watch journals. Have the
runtime-next task service and the runtime-sidecar publisher factory
request `APPEND | APPLY | LIST` jointly, and teach `authorize_task` to
accept that combined capability as `models::Capability::Write`. Update
the `TaskCollectionAuth` doc comment to reflect the broadened set.
`bindings` is linked into Go binaries through CGO, so there is no Rust
binary entrypoint to install rustls' process-wide CryptoProvider.
Install the `aws-lc-rs` provider lazily (once) when a task service is
created, and enable rustls' `aws_lc_rs` feature.
… Materialize

Drop `ops_stats_journal` from the `Task` proto: both the leader and the
shard already receive it via shard labeling at Join time, so passing it
through `Task` was redundant.

Add `log_level` to the top-level `Materialize` message so the controller
can supply it on unary `spec` / `validate` requests, which never see
the Join-time labeling that carries log level for session-bound work.
Session paths continue to read log level from labeling.
…scan

Long-lived tasks accumulate FC: entries for producers that stopped
writing (including ones that wrote CONTINUE_TXN docs but never committed
them), inflating startup cost, RocksDB size, and abandoned-transaction
replay distance.

Add `recovery::prune_committed_frontier`, a pure pass over the decoded
per-(journal, binding) FC: chunks that drops a producer only when, within
its group, it is not FH:-protected, trails the newest last_commit by at
least FRONTIER_PRUNE_CLOCK_HORIZON, and trails the furthest read offset
by at least FRONTIER_PRUNE_BYTE_HORIZON. The scan path then issues a
small (non-synced; this is GC, not a commit) delete batch for the pruned
FC: keys before returning Recover, so the leader never observes them.
The close-policy comparisons used a strict `>`, so a threshold of zero
could never be satisfied; use `>=` so zero-valued thresholds fire.
Widen the `last_close_age` placeholder ceiling from 300s to
`Duration::MAX`.

The materialize stats doc also reported the `sourced` and `loaded`
document/byte tallies under swapped `left`/`right` keys; correct the
orientation.
The V2 leader stamps a synthetic "committed-close" source into the
consumer.Checkpoint on each commit, recording the V2 RocksDB epoch. If
a task is rolled back to the V1 runtime, V1 would otherwise carry that
marker verbatim across its own commits; a later roll-forward to V2
would then mistake the stale marker for an in-sync RocksDB state,
ignore the legacy_checkpoint, and resume from V2's stale frontier —
reprocessing whatever V1 had advanced past. Strip the "committed-close"
source on each V1 start-commit so a subsequent V2 startup treats V1's
advanced sources as authoritative.
`NewStore` is invoked only on the initial PRIMARY transition, so a
publish that flips the `enable-runtime-v2` flag on a running shard
cannot otherwise reroute it between the V1 and V2 materialize runtimes.
Have each app's `RestoreCheckpoint` surface a functional error when its
shard's flag no longer matches the running runtime, forcing the
controller to restart the shard so `NewStore` re-evaluates the flag and
selects the correct runtime.
Each local data plane now runs a dedicated runtime-v2 sidecar on
base_port+60, advertising the same per-plane FQDN and HMAC key as its
reactors. Cap brokers and reactors at 10 instances each so the +0..+9
and +90..+99 ranges stay clear of the +50/+51/+52 Dekaf and +60 sidecar
reservations.

Also set CONSUMER_ZONE on reactors so sidecar peering resolves, prefer
the musl target dir ahead of glibc on PATH so an over-broad `cargo
build` doesn't shadow the musl flow-connector-init, disable color in
sidecar logs, and document the preview-harness scope and the
Supabase Docker-network connector wiring.
… and metrics

Add `crates/service-kit/`, a service-agnostic leaf crate that provides
the observability foundation for the runtime-v2 sidecar:

- `Registry` / `HandlerGuard`: a coarse lifecycle view of in-flight
  units of work (label / phase / fields), each running inside its own
  `tracing` handler span.
- `admin`: a loopback-only `axum` surface — an auto-refreshing HTML
  dashboard, `/debug/handlers.json`, a per-handler drill-down page, and
  a `POST /debug/handlers/{id}/level/{level}` runtime trace-level
  control.
- `trace`: a `tracing_subscriber` layer-filter, composed with the base
  `EnvFilter`, that admits events at or above an enclosing handler
  span's override level.
- `event!`: a structured-event macro with lazy field capture, feeding
  both `tracing` and per-handler breadcrumb rings shown on the
  drill-down page.
- `metrics`: a Prometheus registry and `/metrics` route.

The crate is added inert; the following commit wires it into the
sidecar's Shuffle and Leader services.
…nt! instrumentation

Wire the runtime-v2 sidecar's Shuffle and Shuffle Leader services into
`service-kit`. Both gRPC services register their spawned handlers in a
shared `Registry`, each running inside its handler span, and replace
ad-hoc `tracing` calls in their actor loops with `service_kit::event!`.
`runtime-sidecar` gains an `--admin-port` and rebuilds its tracing
stack on a layered subscriber that hosts the loopback admin surface;
local data planes bind it at base_port + 61.

The shuffle `Shard` message gains an `id` field used to label handlers
and metrics, and gazette journal append/read gain the instrumentation
the event stream draws on.
The legacy V1 `consumer.Checkpoint` holds a complete committed frontier,
whereas V2 writes `FC:` keys as per-transaction deltas. At a cutover the
recovered `FC:` keys are not yet a sound recovery baseline.

`leader::materialize::startup` now reconciles synchronously: after the
connector Open/Opened exchange, when the final status of the recovered V1
checkpoint and any remote-authoritative connector checkpoint is known, it
issues one cleanup `Persist` to shard zero. An authoritative checkpoint
clears all `FC:` keys and rewrites the complete baseline.

The per-task `drop-runtime-v1-rollback` shard-label flag tells the leader
to stop maintaining the legacy `consumer.Checkpoint`, deleting the
persisted key during startup in exchange for forfeiting V1 rollback.

Adds `delete_committed_frontier` and `delete_legacy_checkpoint` to the
`Persist` proto, renumbering subsequent fields.
* Add new Agent Skills docs page

* Shuffle links around
Customers cannot self-serve billing email changes today. Every request requires manual Stripe intervention. This adds admin-editable billing contact fields on `tenants`, a GraphQL mutation for managing them, and a per-tenant controller that reconciles the Stripe-backed subset to Stripe asynchronously.

* `billing_email`, `billing_name`, `billing_address` columns on `tenants`. Insert trigger creates a controller task per tenant, update trigger wakes the controller when `billing_email` or `billing_address` change. Existing tenants are backfilled from `stripe.customers` CDC data.
* `setBillingContact` mutation writes the DB and returns immediately. `TenantBilling.contact` query field reads from the DB. No Stripe call in the request path.
* `BillingProvider::update_customer_billing_profile` for updating Stripe `Customer.email` and `Customer.address`.
* `TenantController` executor (`TaskType(12)`) with a `billing_contact` sub-controller that compares DB desired state against actual Stripe state and reconciles on mismatch, with retry backoff.
* `createBillingSetupIntent` and `billing-integrations` customer creation now prefer `billing_email` from the tenants table over the JWT user's email when creating new Stripe customers.
@jshearer jshearer force-pushed the jshearer/billing_fields branch from 4d347ab to c8fc53b Compare May 21, 2026 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

control-plane waiting This change is waiting on something else

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants