Skip to content
Open

RFCs #2891

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions plans/api-deprecation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# API Deprecation Lifecycle

## Executive Summary

Estuary maintains an evolving product API, but today we have no mechanism to retire an endpoint once it's in use. The immediate motivator is the user-management migration from PostgREST to GraphQL — flowctl users are still hitting the old PostgREST endpoints, and we have no systematic way to detect that or steer them to the replacement.

Supabase logs show who's calling what, but only with seven days of retention — not long enough to track adoption of a replacement endpoint. Communicating deprecation to customers is either a mass email or relies on institutional knowledge of which customers happen to be using which APIs.

This plan establishes a general-purpose deprecation lifecycle for the control-plane API. The challenge: while we control the dashboard UI and can migrate it to new endpoints on our own schedule, flowctl is installed on customer machines and older versions will continue to call deprecated endpoints indefinitely unless we give ourselves a way to see them and reach their operators.

- Engineering gets visibility into which tenants (and which flowctl versions) are calling a given endpoint - once request volume drops below some acceptable threshold, we can remove the endpoint.
- Deprecated endpoints announce themselves via standard `Deprecation`/`Sunset` response headers.
- flowctl surfaces those headers noisily, printing a stderr warning on every response from a deprecated endpoint so the signal reaches the operator running the command or reading CI logs.
- Affected customers get targeted outreach — automated, periodic email alerts with increasing frequency as the sunset date approaches — specific tenants still calling a deprecated endpoint hear about it directly.

At current scale, flowctl adoption is small enough that watching call volume in Loki and reaching out to affected customers is the primary enforcement mechanism. We'll hold off on Sunset headers and actual endpoint removal until the warning (P1) and alerting (P3) machinery is live and broadly adopted. Until then, deprecation headers plus human support follow-up is sufficient.

## Technical Notes

### Signaling deprecation to API consumers

Both PostgREST and GraphQL endpoints return standard `Deprecation` and `Sunset` headers. GraphQL additionally marks deprecated operations and fields in the schema itself, so schema-aware clients get the signal through introspection as well. Successor information (e.g. "use the `listConnectors` GraphQL operation instead") is stored in the deprecation table and surfaced in flowctl warnings and alert emails rather than via a `Link` header — GraphQL operations don't have their own URLs, so a link isn't meaningful.

### PostgREST deprecation headers set via pre-request function

PostgREST supports a `db-pre-request` configuration — a Postgres function that runs before every request and can set response headers via `set_config('response.headers', ...)`. We use this to inject deprecation headers.

The deprecation metadata lives in a `deprecated_endpoints` table — endpoint path, deprecation date, optional sunset date, and a human-readable successor description (e.g. "use the `listConnectors` GraphQL operation"). The pre-request function looks up `current_setting('request.path')` against this table and sets `Deprecation` and (if present) `Sunset` headers. The same table serves as the source of truth for alert emails to communicate successor info to users.

## Open Questions

- **Pre-request table lookup performance.** The `deprecated_endpoints` table is the single source of truth for deprecation metadata — used by the pre-request header injection, alert emails, and potentially a GraphQL query for flowctl to enrich deprecation warnings. But the pre-request function runs on every PostgREST request, so we need to verify the per-request cost of the table lookup is negligible (the table will be tiny and should stay in the buffer cache, but we should confirm this).

## Phases

### P1: flowctl deprecation warnings

flowctl learns to inspect responses from the control-plane API for `Deprecation` and `Sunset` headers and prints a human-readable warning on stderr, once per invocation, including the sunset date and successor information when present. We aren't setting either of these headers yet.

The warning message distinguishes between two contexts. When the deprecated call originates from a built-in flowctl subcommand, the warning tells the user to update flowctl — the newer version already uses the successor endpoint. When it originates from a user-defined raw API call, the warning names the deprecated endpoint and its sunset date if known. Successor information (which endpoint or operation to use instead) becomes available once the deprecation table exists in P2 — flowctl can query it to enrich the warning.

This phase also fixes a bug: flowctl already constructs a `flowctl-<version>` User-Agent and applies it to its agent-API HTTP client, but the PostgREST client never receives the header. As a result, every PostgREST call from flowctl currently arrives at the server with an empty UA.

### P2: PostgREST deprecation signaling

Build the `deprecated_endpoints` table and the PostgREST pre-request function that injects `Deprecation` (and eventually `Sunset`) headers based on it. Then use it to deprecate our first endpoints — likely `user_grants` and `role_grants` once the GraphQL operations that replace them ship as part of the user-management migration. An endpoint must not be marked deprecated until flowctl's own subcommands have migrated to the successor — otherwise the "update flowctl" advice in the deprecation warning would be wrong. After this phase we can actually begin deprecating PostgREST endpoints: customers running an updated flowctl see warnings (from P1), engineering uses Loki to see who's still calling a given endpoint, and we do manual customer outreach based on that visibility.

This LogQL query shows who's calling specific endpoints, filtering out dashboard and Supabase JS traffic to isolate programmatic callers. Once the P1 UA fix has propagated, we can filter on `user_agent` directly instead of excluding known non-flowctl callers by referer and client info.

```logql
{service="edge_logs"}
| metadata_request_path =~ "/rest/v1/(user_grants|role_grants).*"
| metadata_request_method != "OPTIONS"
| metadata_request_headers_x_client_info !~ "supabase-js-web/.*"
| metadata_request_headers_referer !~ "https://dashboard\\.estuary\\.dev.*"
| line_format "{{.metadata_request_method}} {{.metadata_request_path}} {{.metadata_response_status_code}} sub={{.metadata_request_sb_jwt_authorization_payload_subject}} ua={{.metadata_request_headers_user_agent}}"
```

### P3: Automated customer email alerts

_Speculative — details will firm up once P2 is in use ... and we have enough customers using flowctl to justify._ A new alert type on the existing alerting infrastructure sends periodic email alerts to tenants still calling deprecated endpoints. Alerts only fire once a sunset date is set — no sunset, no emails. As the sunset date approaches, alert frequency increases: roughly weekly at first, then every few days, then daily as the deadline nears.

## Phase Dependencies

```mermaid
graph TD
P1[flowctl deprecation warnings, fix missing UA header]
P2[Send PostgREST deprecation headers]
P3[Automated customer email alerts]
P1 --> P2 --> P3
```
59 changes: 59 additions & 0 deletions plans/orthogonal-authz.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Orthogonal Authz

## Executive Summary

Estuary's access control today is a tiered role model with only two tiers in practice: `read` for looking at data, and `admin` for everything else. That makes `admin` badly overloaded: platform engineers receive billing email alerts meant for finance, and the finance team has access to take down a production system.

This plan refactors the role hierarchy into fine-grained, independent capabilities — most immediately to support dedicated **billing** and **user management** capabilities, so customers can delegate those responsibilities without handing out platform admin.

## Technical Notes

- **Capabilities are a flat set, not a hierarchy.** The five capabilities — `read`, `write`, `admin`, `billing`, `user_management` — don't imply each other. An admin grant does not grant `billing`, and (once the migration completes) does not grant `write` either; each capability is listed explicitly. This is the whole point of the refactor, and has downstream consequences — most notably for publish-target checks, see Phases below.

Once capabilities are orthogonal, the names `write` and `admin` start to feel vague — they were meaningful as tiers but don't describe a specific power on their own. A later migration phase renames and/or splits them (e.g. `write → publish`, or separating task control from catalog edits) once the shape and Postgrest retirement allow it.

- **Capabilities inherit down the prefix tree.** A grant at `acmeCo/` applies to every descendant prefix — `acmeCo/sales/`, `acmeCo/sales/leads/`, and so on. A user's effective capabilities on a given prefix are the union of every grant at that prefix or any ancestor. This is how scoping already works for `read`/`write`/`admin`, and the new capabilities inherit the same way.

> The `billing` capability only really makes sense at the root prefix and will be inert on any subprefix; granting this capability on subprefixes will be inert. The UI can handle this as a special case.

- **Role grants narrow capabilities, never widen them.** When a user reaches a prefix through a role grant, their effective capabilities are the intersection of what the user has and what the role grant allows. Neither side can escalate past the other:

| Alice's user grant on `acmeCo/` | `acmeCo/` role grant on `partner/shared/` | Alice's effective capabilities on `partner/shared/` |
| ------------------------------- | ----------------------------------------- | --------------------------------------------------- |
| `{read, write, billing}` | `{read, write}` | `{read, write}` — `billing` is filtered out |
| `{read}` | `{read, write}` | `{read}` — the role grant can't add `write` |

## Open Questions

1. **Do we need a `traverse` capability to gate role-grant traversal?**

Today, only users with the `admin` role can traverse role grants at all. A read-only user on `acmeCo/` cannot follow a role grant from `acmeCo/` → `partner/shared/`.

The role grant rule as stated in Technical Notes would change this. Once capabilities are orthogonal and we drop the `admin`-required gate, any user whose capabilities intersect with a role grant's capabilities can traverse it. That means every existing read-only user would suddenly gain read access to every prefix reachable through existing role grants — a potentially large, silent expansion of access.

Should we add an explicit `traverse` capability to prevent this? With `traverse`, a user can only follow a role-grant edge if `traverse` appears on their user grant. `traverse` is a gate — it controls whether the user can enter the role grant at all, but it doesn't carry through to the effective capability set:

| User grant on `acmeCo/` | Role grant `acmeCo/` → `partner/shared/` | Effective capabilities on `partner/shared/` |
|---|---|---|
| `{read, write}` | `{read, write}` | none — no `traverse` on user grant |
| `{read, traverse}` | `{read, write}` | `{read}` — `traverse` lets her in, but `write` is filtered out because it wasn't on the user grant |

We could backfill and add `traverse` wherever there is already an `admin` grant so as not to change anyone's existing level of access.

## Phases (still in progress)

We will interleave these phases with other changes (service accounts, better user management, billing features) as needed.

**Phase 1 — add the array, orthogonal capabilities only.** Introduce `capabilities capability[] NOT NULL DEFAULT '{}'` on `user_grants`. The existing `capability` enum stays authoritative for `read`/`write`/`admin`; the array only carries the new orthogonal capabilities (`billing`, `user_management`). Only the GraphQL/Rust path reads the array. This lets us gate `billing` and `user_management` features immediately without touching existing authz code paths.

**Phase 2 — dual-write the tiered capabilities into the array.** The array becomes authoritative for the Rust/GraphQL authz layer for all five capabilities; the enum stays authoritative for RLS. A sync trigger keeps them coherent during the Postgrest sunset:

- _New-path writes_ (GraphQL/Rust) set the array directly and project to the enum: `admin` if the array contains it, else `write`, else `read`. Orthogonal-only grants (e.g. `{billing}`) project to enum `read`, accepting a Postgrest read-leak within the prefix as Postgrest is sunsetting.
- _Legacy-path writes_ (Postgrest/direct SQL) trigger a DB function that expands the enum to its tier capabilities (`admin → {read, write, admin}`, `write → {read, write}`, `read → {read}`) and merges them with any existing orthogonal capabilities on the row. A Postgrest write re-expresses only the tier portion; capabilities like `billing` are left untouched. Postgrest can't remove orthogonal capabilities, which is fine — they're only managed through the new path.
- Add a `capabilities capability[]` column to `role_grants` (same as `user_grants`), backfill from the existing enum, and update role-grant traversal logic to compute intersections against the new array.
- A one-shot backfill populates tier capabilities into the array for all existing rows using the same expansion.
- If we decide to add the `traverse` capability, this backfill should also add `traverse` to every existing admin user and role grant, preserving today's behavior where admins can follow role-grant edges. Going forward, `traverse` is auto-bundled whenever an `admin` grant is created — the grant-expansion rule becomes `admin → {read, write, admin, traverse}`. A later phase of the user-management RFC will unbundle `traverse` from `admin` when the UI supports assigning capabilities individually.

**Phase 3 — cutover.** Once Postgrest retires, drop the enum column on both tables, remove the sync trigger, and remove the projection logic. `CapabilitySet` becomes the only representation. The publish-target check becomes a plain flag-containment test for `write`; admin grants continue to satisfy it because the grant-expansion rule always stores `{read, write, admin}` on admin grants.

**Phase 4 — rename and split the legacy tier names.** With Postgrest gone and `CapabilitySet` as the sole representation, the `write` and `admin` names can be replaced with capabilities that describe specific powers (e.g. `publish`, `manage`, or finer splits between task control and catalog edits). This is a pure rename/split inside the new model — a migration on `grant_capability` values, updates to the Rust `CapabilitySet` variants, and a sweep of the call sites. Sequenced last because it's disruptive to read without a forcing function, and only makes sense once nothing outside the new model speaks the old names.
Loading
Loading