diff --git a/docs/CONTEXT.md b/docs/CONTEXT.md new file mode 100644 index 000000000..7c39fffa5 --- /dev/null +++ b/docs/CONTEXT.md @@ -0,0 +1,103 @@ +# LFX Insights Public API + +A server-to-server HTTP API that exposes LFX Insights analytics data (contributor activity, project health, security posture) to external developers. It is a standalone service (`/api`) backed by the same Tinybird workspace and Postgres database as the existing Nuxt frontend, but with a formal versioned contract, API key authentication, and rate limiting. + +## Language + +### Auth & Identity + +**API Key**: +The long-lived credential a User receives from the LFX Self-Serve App at `app.lfx.dev/settings` — technically a Refresh Token. Customers see and handle this as their "API key." Only Key Contacts in member organizations are permitted to create them. The actual Bearer value sent to the Insights API on each request is a short-lived Access Token derived from it. +_Avoid_: token, secret, access key + +**Refresh Token**: +The long-lived credential held by the customer (what they receive as their "API key"). Used at `POST api.insights.linuxfoundation.org/v1/auth/token` (which Insights proxies to LFX Self-Serve) to mint Access Tokens. Never sent to the Insights API as a Bearer credential. Revoking it stops the customer's code from minting new Access Tokens; in-flight Access Tokens continue to work until their `exp`. Multiple active Refresh Tokens per User are supported for zero-downtime rotation. +_Avoid_: long-lived JWT, API key (when referring to the credential type specifically) + +**Access Token**: +A short-lived JWT (~15 min) minted from the Refresh Token via the proxied `/v1/auth/token` endpoint. Sent to the Insights API as `Authorization: Bearer `. Carries the verified `sub`, `org`, `tier`, `iss`, `kid`, and possibly `aud` claims. JWKS-verified on every request. Customers typically don't handle these directly — a short `getAccessToken()` helper or SDK manages the lifecycle. +_Note:_ the presence of `org` and `tier` in the LFX Self-Serve access token is an assumption pending confirmation with the Self-Serve team (T-015). If the existing PAT is reused, these claims may need to be added. +_Avoid_: calling it just "a JWT" or "the bearer token" — always use "access token" so it's clear which credential is meant + +**User**: +The human account that owns one or more API Keys (Refresh Tokens) and is the billing principal. Identified by the JWT `sub` claim issued by the LFX Self-Serve App. +_Avoid_: account, customer, client + +**Organization (org_id)**: +The LFX organization tied to a User's API access, encoded in the `org` claim of the LFX Self-Serve access token. "Belongs to" is narrow here: only authorized **Key Contacts** of an organization with an active LFX membership can hold API keys — not every employee or self-attested affiliate of the organization. Used as the shared bucket for rate-limit quotas — all keys belonging to Key Contacts in the same org share a pool. +_Avoid_: tenant, workspace, team + +**Tier**: +A named LFX membership level attached to an Organization that controls the rate-limit pool size. Known tiers in ascending order: Silver, Gold, Platinum (exact hierarchy and rate-limit numbers confirmed at T-093). In v1, tiers affect only rate limits; endpoint-level gating is reserved for future versions. +_Avoid_: plan, subscription + +### API Shape + +**Endpoint Group**: +A logical cluster of related endpoints released together as a unit (Development, Contributors, Popularity, Security, Collections). Each group maps to a Jira epic. +_Avoid_: phase, module, domain + +**Breaking Change**: +Any modification that forces existing callers to update their integration: removing or renaming a response field, changing a field's type, making an optional input required, removing an endpoint, or changing the error envelope shape. Governed by the tolerant-reader contract (see ADR-0003). Changing default or max pageSize, changing the cursor encoding semantics, removing a value from an endpoint's `sort` allow-list, or changing an endpoint's default `sort` value also counts. +_Avoid_: non-backwards-compatible change + +**Error Envelope**: +The standard JSON wrapper for all error responses: `{ error: { code, message, requestId, docsUrl } }`. `code` is a machine-readable snake_case string; `docsUrl` is always present — it deep-links to a specific docs page when one exists, otherwise to the general errors reference page. +_Avoid_: error body, error payload + +**Request ID**: +The OpenTelemetry trace ID for the request — a 32-char lowercase hex string (128-bit). Exposed in the error envelope's `requestId` field; this is the value a customer quotes in a support ticket. The same value appears in pino log lines as `trace_id` and on the active OTel span, so logs ↔ APM traces join in Datadog without translation. W3C `traceparent` is the sole HTTP propagation channel — honoured inbound, injected outbound; no `X-Request-Id` response header is set. There is no separate ULID/UUID request ID — see ADR-0019. +_Avoid_: ULID, UUID, separate correlation ID, X-Request-Id + +### Data & Infrastructure + +**Tinybird**: +The columnar analytics database backing all time-series metrics (contributor activity, commit counts, etc.). The API queries Tinybird Pipes via HTTP; it does not use Postgres for analytics reads. +_Avoid_: analytics DB, ClickHouse (Tinybird is the canonical name in this repo) + +**Collection**: +A named group of projects and repositories stored in Postgres (`collections` table, keyed by `slug`). Two types: **Community Collections** (created by a User, owned via `ssoUserId`) and **Curated Collections** (system-created, `ssoUserId` is null). Privacy is a boolean `isPrivate` flag — public collections are visible to any valid API key; private collections are visible only to their owner. There is no collaborator or member model — ownership is exclusively the creator. +_Avoid_: project group, saved filter, list + +**Rate-limit Pool**: +The shared sliding-window counter for an Organization. All API Keys belonging to users in the same org draw from the same pool. Implemented in Redis. +_Avoid_: quota, bucket + +### API Stability + +**`/v1-alpha`**: +The unstable stage. Endpoints are served under `/v1-alpha/...` — no contract guarantees. Breaking changes (field renames, shape changes, endpoint removal) are allowed freely. Used during early validation with a small allow-listed cohort. +_Avoid_: beta, preview + +**`/v1`**: +The stable stage. Endpoints graduate here from `/v1-alpha` once they pass the promotion criteria (load test, one week of stable shape, error/latency budgets, security sign-off). From this point the full tolerant-reader contract applies: additive-only changes within `/v1`; any breaking change requires `/v2`. +_Avoid_: stable, released, GA + +## Relationships + +- A **User** holds one or more **API Keys** +- A **User** is an authorized **Key Contact** of one **Organization** (v1); the **Organization** owns the **Rate-limit Pool** +- A **Tier** is attached to an **Organization** and governs the size of its **Rate-limit Pool** +- A **Collection** is owned by a single **User** (the creator, identified by `ssoUserId`); curated/system Collections have `ssoUserId = null`. There is no collaborator or org-ownership model in v1. A **Permission Check** gates access per request for private Collections (see ADR-0007) +- An **Endpoint Group** contains many endpoints; endpoints are promoted through launch stages independently + +## Example dialogue + +> **Dev:** "Should I check the User's tier before returning a response?" +> **Domain expert:** "In v1, no — all tiers see all endpoints. Tier only affects the Rate-limit Pool size. If the org's pool is exhausted you return 429; if a future endpoint requires a higher tier you return 403 `tier_forbidden`. Don't conflate the two." + +## Conventions + +These are committed wire-format decisions — changing them within v1 would be a breaking change (see ADR-0003). + +- **JSON key casing:** camelCase for all request and response fields (`startDate`, `activityTypes`). The Nuxt layer uses mixed casing; the `nuxt-to-api` skill normalizes to camelCase at port time. +- **Date format:** ISO-8601 UTC strings only (`2025-12-31T23:59:59Z`). Never Unix timestamps or locale-formatted strings. +- **Pagination:** cursor-based. Request: `cursor` (opaque, omit on first page) + `pageSize` (default 50, max 200) + `sort` from a per-endpoint allow-list (e.g. `name_asc`, `commits_desc`). Response: `{ data, pageSize, nextCursor }` — `nextCursor: null` means end of list. No `total` field. Removing an allowed `sort` value or changing an endpoint's default sort is a breaking change. See [ADR-0011](adr/0011-pagination-cursor-based.md). +- **Error codes:** machine-readable snake_case strings in the Error Envelope `code` field (e.g. `tier_forbidden`, `rate_limit_exceeded`, `unauthorized`, `upstream_unavailable`). +- **Cache TTLs:** two tiers — long cache (24h) for stable data (project lists, leaderboards, categories); short cache (1h) for time-series analytics. Mirrors the existing Nuxt API caching model. +- **Tinybird error handling:** when Tinybird is unavailable, the cached Redis response is served if one exists within its normal TTL. If the cache is empty or expired, 503 with `code: upstream_unavailable` is returned. + +## Flagged ambiguities + +- "account" was used for both User and Organization during design — resolved: **User** is the human principal (identified by `sub`), **Organization** is the entity that holds an LFX membership (Silver/Gold/Platinum) and owns the Rate-limit Pool (identified by `org_id`). +- "phase" was used interchangeably with Endpoint Group — resolved: use **Endpoint Group** for the technical rollout cluster; "phase" is informal and should be avoided in task descriptions. diff --git a/docs/PUBLIC_API_PLAN.md b/docs/PUBLIC_API_PLAN.md new file mode 100644 index 000000000..8146d180a --- /dev/null +++ b/docs/PUBLIC_API_PLAN.md @@ -0,0 +1,399 @@ +# Insights Public API — Project Plan + +> Draft plan for review. Not yet broken out into Jira tickets — once we agree on the shape, every "T-XXX" line below maps cleanly to a single Jira task and every "Epic" maps to an epic / milestone. + +## 1. Context + +Today, all `/api/*` endpoints live inside the Nuxt frontend (`frontend/server/api/`, ~106 endpoint files across 17 sub-folders). They were designed as **internal** endpoints for the insights UI: + +- Authenticated via Auth0 OIDC cookie (browser session) or a single shared Bearer secret (`jwtSecret`) for a few report endpoints. +- Rate limited only by IP (Redis sliding window, 200 req/min default). +- No per-customer concept, no usage tiers, no SLAs, no public documentation, no contract guarantees. +- Coupled to the frontend release cycle — we cannot change endpoint shape without coordinating UI changes. + +We want to expose a **public API** to LFX customers. Rather than retrofit the Nuxt routes, we will build a **standalone API service** that ports endpoints over with proper API key auth, tier-based access, rate limits, versioning, observability, and SLAs. + +### Goals +- Standalone API app, independently deployable. +- API key auth via LFX Self-Serve (`app.lfx.dev/settings`) — customers receive refresh tokens; their code mints short-lived access tokens via Insights' proxied `/v1/auth/token` endpoint. Access tokens carry membership tier claims that drive **rate limits** in v1; endpoint-level tier gating is a future capability. +- URL-versioned (`/v1`, `/v2`); breaking changes only across versions. +- Heavy observability (OTel → Datadog) so we can offer SLAs and bill by tier confidently. +- Phased rollout: **Development → Contributors → Popularity → Security & Best Practices → Collections** (more later). +- Frictionless mechanism to port a Nuxt endpoint into the public API (probably a Claude skill). + +--- + +## 2. Architecture Overview + +``` +┌─────────────────────┐ 1. create token ┌──────────────────────────────────┐ +│ User (browser) │ ────────────────▶ │ LFX Self-Serve App │ +│ │ │ app.lfx.dev/settings │ +│ │ ◀──────────────── │ Issues + revokes refresh tokens │ +│ │ 2. refresh token │ Publishes JWKS endpoint │ +└─────────────────────┘ └──────────────────┬───────────────┘ + │ ▲ + │ 3. paste refresh token │ 4. forward + │ into server env │ /token request + ▼ │ +┌─────────────────────┐ POST /v1/auth/token ┌──────────────┴──────────────────────────┐ +│ Customer server │ ────────────────────▶ │ api.insights.linuxfoundation.org │ +│ │ ◀──────────────────── │ (Fastify, TypeScript) │ +│ │ 5. access token │ │ +│ │ (~15 min) │ /v1/auth/token (proxy) │ +│ │ │ /v1/development/... │ +│ │ 6. Bearer │ /v1/contributors/... │ +│ │ │ /v1/popularity/... │ +│ │ ────────────────────▶ │ /v1/security/... │ +└─────────────────────┘ │ /v1/collections/... │ + └───────────┬──────────────┬──────────────┘ + 7. JWKS verify │ │ + (LFX Self-Serve JWKS — cached) ◀─────────────┘ │ + ▼ + ┌────────────────────────────────────────┐ + │ Redis │ + │ Rate-limit counters │ + │ Response cache (cache hit → return) │ + └──────────────┬─────────────────────────┘ + │ cache miss + ┌───────────┴───────────┐ + ▼ ▼ + ┌─────────────────┐ ┌────────────────────┐ + │ Tinybird │ │ Postgres │ + │ (analytics) │ │ (read host) │ + │ dedicated read │ │ Collections auth │ + │ replica * │ └────────────────────┘ + └─────────────────┘ + + * dedicated Tinybird read replica is the goal; pending + confirmation from the Tinybird team on whether per-app + replica isolation is supported. + + ┌────────────────────────────────────────────────────────────────────────┐ + │ App OTel SDK ──OTLP──▶ otel-collector sidecar ──▶ Datadog │ + │ │ + │ Custom metrics — low-cardinality tags only (endpoint, version, │ + │ tier, status_class). Billed per unique tag combo. │ + │ Used for: SRE dashboards, alerts, SLO tracking. │ + │ │ + │ APM trace metrics — high-cardinality dims live on spans as attributes │ + │ (customer_id, api_key_id). Not billed as metrics. │ + │ Used for: per-customer drilldowns, debugging. │ + │ │ + │ Structured logs via pino, correlated to traces via trace_id │ + │ (OTel hex format; Datadog ingests natively, no dd.trace_id needed). │ + └────────────────────────────────────────────────────────────────────────┘ +``` + +- **Location in monorepo:** `api/` at the repo root, sibling of `frontend/`. Add `api` to `pnpm-workspace.yaml`. +- **Shared code with frontend:** `libs/tinybird-client` (HTTP client, AdaptiveSemaphore, bucket routing), `libs/insights-types` (shared enums), `libs/rate-limiter` (Redis sliding-window primitive). +- **Tinybird:** dedicated read replica per app (goal — pending Tinybird team confirmation); fallback is a separate token/pipe set. +- **Postgres:** reuse existing read host with its own connection pool, separate from the frontend's. +- **Cache:** Redis for rate-limit counters + response cache (24h stable data, 1h time-series). + +--- + +## 3. Open Decisions — Pros / Cons & Recommendations + +Each decision below has a full pros/cons analysis and a recommendation. We are committing to these in this plan; there are no separate "spike" tasks. If we change our minds during implementation, that's fine — but the default direction is set. + +### D1. Framework — NestJS vs Fastify vs Express vs Hono + +| Option | Pros | Cons | +|---|---|---| +| **NestJS** | Opinionated structure (modules, controllers, providers, DI); batteries-included validation/guards/interceptors/pipes; mature `@nestjs/swagger` for OpenAPI; great for large APIs; familiar to anyone from Angular/Spring/.NET; CLI generators; strong DI-based testing story. | Heavy footprint and slower cold start; opinionated to the point of friction if you fight it; requires `experimentalDecorators` + `reflect-metadata`; steep learning curve for the team if not already on Angular-style DI; overkill for a read-only API. | +| **Fastify** ⭐ | ~2× throughput vs Express on Node; native JSON-Schema validation gives free serialization speedup; `@fastify/swagger` + `@fastify/type-provider-typebox` auto-generate OpenAPI from schemas with **zero spec drift**; mature plugin ecosystem; encapsulation via plugins; used under the hood by NestJS so we can wrap it later if we ever want Nest. | Less opinionated than Nest — team must enforce structure conventions; smaller community than Express; some plugins lag Express equivalents. | +| **Express** | Ubiquitous, every dev knows it, every middleware exists, simplest to debug, lowest learning curve. | No built-in validation or OpenAPI; ~half the throughput of Fastify; async error handling clumsy without wrappers; no opinionated structure — every team builds it differently; Express 5 has been "imminent" for years; least modern of the four. | +| **Hono** | Edge-native (Workers, Vercel, Bun, Node), fastest of the four; modern ergonomic API; first-class TypeScript; built-in Zod/Valibot/TypeBox validators; tiny bundle; great OpenAPI middleware. | Newer ecosystem; smaller community than Fastify/Express; fewer pre-built middlewares; Node-at-scale story less battle-tested than Fastify (most case studies are edge); team would need to learn it. | + +**Recommendation: Fastify.** Code-first OpenAPI via TypeBox is essentially free, the schema-driven serializer is a real perf win, it's opinionated enough to give us structure without the Nest tax, and it's battle-tested on Node at scale. Hono is the runner-up if we ever want to deploy at the edge. + +### D2. Docs Tool — Mintlify vs Scalar vs Stoplight Elements vs VitePress + Swagger UI + +| Option | Pros | Cons | +|---|---|---| +| **Mintlify** ⭐ (if budget approved) | Best-in-class hosted polish (Anthropic, Cursor, Cloudflare use it); MDX guides + auto-generated OpenAPI reference in one product; built-in search; AI assistant baked in (customers chat over docs); analytics + CDN included; great onboarding flows. | Paid (≈$150–$550/mo team tier; enterprise priced separately); content lives on their infra; customization constrained by their conventions. | +| **Scalar** ⭐ (OSS fallback) | OSS, "Stripe-like" reference UI — easily the prettiest of the OSS options; embeddable into anything (Vue/VitePress/Next/Hono); best-in-class OpenAPI rendering; built-in "try it" client; fast; well-funded team behind it. | Just a reference renderer — you bring your own narrative/guide layer (we'd marry it with VitePress for guides); smaller team than Stoplight; theming is configurable but less plug-and-play than Mintlify. | +| **Stoplight Elements** | OSS web component; drop-in API reference; mature (Stoplight has been doing this for years); high OpenAPI 3.x fidelity. | Looks dated next to Scalar/Mintlify; Stoplight's commercial focus is on Stoplight Platform — OSS Elements gets less love; weak narrative-doc story. | +| **VitePress + Swagger UI** | VitePress already in repo (powering `/docs` and `/blog`); zero new tooling; full control; Swagger UI is the most universally-recognized OpenAPI viewer. | Swagger UI is ugly and dated; integration is DIY; "try it" UX is mediocre; reference + guides feel disjointed (two render styles). | + +**Decision: VitePress + Scalar under `api/docs/`.** Standalone VitePress site co-located with the API service. Scalar embedded for the interactive OpenAPI reference, reading the generated spec — the reference cannot drift. Deployed independently of the frontend with its own subdomain. + +### D3. OpenAPI Source — Code-first vs Spec-first + +| Option | Pros | Cons | +|---|---|---| +| **Code-first (TypeBox or Zod → OpenAPI)** ⭐ | Schema lives next to the handler — single source of truth; types derived automatically (`Static`); spec literally cannot drift from implementation; framework integrations (Fastify+TypeBox) are turnkey; validation + OpenAPI from one schema; refactors stay safe. | Spec is generated — pre-implementation contract review is awkward; harder for PMs/technical writers to propose changes via PR; design-first workflows feel inverted. | +| **Spec-first (hand-written `openapi.yaml`)** | Contract exists before code; easy for non-engineers to review/comment; language-agnostic; can drive both server and client codegen; classic API-design discipline. | Drift is the #1 failure mode — handlers diverge from spec silently unless you wire heavy contract tests; two sources of truth; refactoring is painful; TS-side codegen tooling is mediocre. | + +**Recommendation: code-first via TypeBox** (paired with Fastify per D1). For a 100+ endpoint surface area, drift is a near-certainty in spec-first; code-first inverts the failure mode — handlers cannot lie about their schemas. + +### D4. Endpoint Conversion Tooling — Claude Skill vs Codegen Script vs Manual + +| Option | Pros | Cons | +|---|---|---| +| **Claude Code skill** ⭐ (`.claude/skills/nuxt-to-api/`) | Handles variance in Nuxt route shapes (different validation styles, response patterns, error conventions); can update related artifacts (handler + TypeBox schema + integration test + OpenAPI tag + docs stub) in one pass; can ask follow-up questions when ambiguous; lives in-repo and improves iteratively; matches existing `.claude/rules/*` and `.claude/skills/*` workflows. | Non-deterministic — different runs may produce slightly different code (mitigated by mandatory PR review and tests); skill quality can drift over time without maintenance. | +| **Codegen script (AST-based)** | Deterministic; reproducible; could run in CI to enforce conformance. | Nuxt handlers vary too much for clean AST transforms (H3 helpers, inline TB queries, custom middlewares); you spend more time building the codegen than the API; LLMs end up doing the last-mile cleanup anyway. | +| **Manual port** | Maximum control; zero tooling overhead. | ~100 endpoints × 1–2 h each ≈ 100–200 h of repetitive work; high copy-paste error rate; pattern drift across endpoints. | + +**Recommendation: Claude skill.** This codebase already has `.claude/skills/` and `.claude/rules/` infrastructure, and this is exactly the kind of repetitive structured port the skill model was designed for. Every conversion lands in a PR with tests, so non-determinism is a non-issue. + +### D5. Datadog Metrics Strategy — Custom Metrics vs APM Trace Metrics + +| Option | Pros | Cons | +|---|---|---| +| **Custom metrics** (DogStatsD / OTel metrics → DD custom metrics) | Explicit metric names; dashboards/monitors trivial to build; predictable aggregation semantics; fast queries. | Billed per unique tag-combination per metric (≈$0.05/series/month above quota); cardinality explosion is easy and expensive; high-cardinality dimensions (`customer_id`, `api_key_id`) blow the budget fast. | +| **APM trace metrics** (derived from span attributes) | Slicing by span attributes is not billed as custom metrics; can slice by `customer_id` / `api_key_id` without cost spike; flame-graph + latency-breakdown per request; ingestion cost is per-span, not per-tag. | APM has its own ingestion cost; span sampling can drop rare events at scale; alerting ergonomics are slightly different; counters for rate-limit rejections still want every event. | +| **Hybrid (both)** ⭐ | Low-cardinality custom metrics for SRE dashboards/alerting; high-cardinality slicing happens in APM; cost-controlled and complete. | Two systems to learn; need a clear rule for "what goes where" (covered in §6). | + +**Recommendation: Hybrid.** A small set of low-cardinality custom metrics (tags: `endpoint`, `version`, `tier`, `status_class`) for dashboards and alerts, and APM trace metrics (span attributes: `customer_id`, `api_key_id`, `bucket_id`, `pipe_id`, numeric `status_code`) for per-customer drilldowns. Catalog in §6. + +--- + +## 4. Epics / Milestones for Jira + +Recommended ordering: **E1 → E2 → E3 (in parallel with E4, E5) → E6 → E7 (Development) → E8 (Contributors) → ...** + +### Epic E1 — Foundation & Framework + +Bootstrap the standalone service per §3 D1 (Fastify) and share code with frontend. + +- **T-001** Bootstrap `api` package in pnpm workspace with **Fastify + TypeScript + TypeBox** (per §3 D1, D3). ESLint/Prettier matching repo conventions. Include `@fastify/swagger` and `@fastify/type-provider-typebox`. +- **T-002** Dockerfile + Helm/Terraform/whatever the repo uses for deploy. Stage and prod environments. +- **T-003** CI: lint, typecheck, test, build, image push (mirror frontend pipeline). +- **T-004** Extract Tinybird client (`adaptive-semaphore.ts`, `bucket-cache.ts`, `TinybirdResponse` type, core HTTP fetch logic) into `libs/tinybird-client`. Replace `ofetch` with native `fetch` (Node 18+). Remove Luxon and H3/Nuxt-specific dependencies — the lib is framework-agnostic. Frontend and API both depend on it. +- **T-005** Extract shared enum definitions (`ActivityPlatforms`, `ActivityTypes`, `Granularity`) into `libs/insights-types`. Request/response shape types are defined separately in each app — the frontend keeps its Luxon-based types; `/api` defines its own TypeBox schemas. +- **T-006** Standard health endpoints: `/health/live`, `/health/ready` (TB ping, Redis ping, PG ping). +- **T-007** Error envelope ADR + Fastify error hook — single shape for all errors (`{ error: { code, message, requestId, docsUrl } }`). +- **T-008** Wire OTel trace ID as the request ID — error envelope's `requestId` = OTel trace ID. Inbound `traceparent` honoured via the W3C propagator (auto); outbound calls inject it automatically. W3C `traceparent` is the sole HTTP propagation channel — no `X-Request-Id` response header. No separate ULID generator (per ADR-0019). +- **T-009** Local dev story: `pnpm dev` from `api/`, hot reload via `tsx watch` or `fastify-cli`, env file template. + +### Epic E2 — Tinybird Read Replica + +Prepare upstream so the API does not contend with the frontend for TB capacity. + +- **T-010** Contact Tinybird support to confirm whether per-app dedicated read replicas are supported. Owner: . +- **T-011** Track TB response; if available, set up the replica and wire its host/token into env config. +- **T-012** Fallback plan if TB does not offer per-app replicas: provision a **separate Tinybird workspace** or **separate token with workload class** for the API. ADR documenting trade-offs. +- **T-013** Move TB token + host config to runtime env (`API_TB_TOKEN`, `API_TB_HOST`). +- **T-014** Add upstream Tinybird latency + error metrics (covered in E4) so we can compare replica vs frontend behavior. + +### Epic E3 — Auth & Rate Limiting (API Keys via LFX Self-Serve) + +Per-key auth with tier-aware authorization and rate limiting. Reuse existing LFX membership tiers — we consume them from JWT claims, we don't invent a new tier model. + +- **T-015** Coordinate with the LFX Self-Serve team: confirm the JWKS endpoint URL, the JWT claim names for `org` and `tier`, the `iss` value, the access-token lifetime (target: ~15 min), the `/token` endpoint shape on Self-Serve (RFC 6749 §6 `grant_type=refresh_token`), and **resolve the open question of whether the Insights API reuses the existing `app.lfx.dev/settings` personal access token as the refresh token or mints a new Insights-scoped refresh token** (see §9 Open Questions and ADR-0015). +- **T-015b** Implement `POST /v1/auth/token` proxy in Insights — forwards request body and headers to Self-Serve's `/token` verbatim, returns the response verbatim. Maps Self-Serve unreachable to `503 upstream_unavailable`. This is the sole endpoint customers use to exchange refresh tokens for access tokens. +- **T-016** ADR: API key claims schema + tier → capability matrix (which existing LFX tiers map to which endpoints and rate limits). Reference LFX Self-Serve as the claims issuer. +- **T-017** Access token verification middleware: verify JWT signature of the **access token** against the LFX Self-Serve JWKS endpoint, cache JWKS, accept `Authorization: Bearer `. Insights never receives refresh tokens on `/v1/...` endpoints. Reject tokens whose `iss` does not match the configured LFX Self-Serve issuer (and `aud`, if the open question lands on a scoped token). +- **T-018** Tier-based authorization: route-level decorator/config that checks the key's tier against the route's required tier, lives next to the route definition. +- **T-019** Per-org rate limiting (Redis sliding window) — extract only the Redis sorted-set sliding-window primitive from `frontend/server/utils/rate-limiter.ts` into `libs/rate-limiter`. The public API writes its own org-aware wrapper keyed by `org_id` on top of this primitive. The IP-based identity resolution and H3-specific code stays in the frontend. +- **T-020** Standard rate-limit response headers (`X-RateLimit-*`, `Retry-After`) and 429 envelope. +- **T-021** Document and test the revocation flow: revoking a refresh token in `app.lfx.dev/settings` causes the next `POST /v1/auth/token` (forwarded to Self-Serve) to return `400 invalid_grant`; in-flight access tokens expire naturally (~15 min). Insights has nothing to delete or maintain. Cover in an integration test. +- **T-022** Customer-facing docs: explain the refresh-token / access-token split; provide a ~30-line `getAccessToken()` snippet pointing at `api.insights.linuxfoundation.org/v1/auth/token`. Link to `app.lfx.dev/settings` for token creation — Insights docs do not duplicate that flow. + +### Epic E4 — Observability (OpenTelemetry + Datadog) + +No hard SLAs in v1 — everything is **observational** for now. Implements the hybrid strategy from §3 D5. + +- **T-023** Integrate OpenTelemetry SDK (`@opentelemetry/sdk-node`): HTTP auto-instrumentation, Postgres auto-instrumentation, custom spans around Tinybird calls. W3C TraceContext propagator (default). Span attributes carry high-cardinality dimensions (`customer_id`, `api_key_id`, `bucket_id`, `pipe`, numeric `status_code`). Per ADR-0019. +- **T-024** Implement low-cardinality custom metrics per §6 catalog. Helper module so handlers emit consistently. +- **T-025** Add `opentelemetry-collector` sidecar to the API pod spec. Configure the SDK to export OTLP to `localhost:4317`. Collector forwards to Datadog in prod/staging; stdout exporter in local dev (per ADR-0019). +- **T-026** Datadog dashboards: per-endpoint, per-tier, per-customer top-N (via APM trace metrics), upstream TB health. +- **T-027** Datadog monitors: 5xx rate, TB failure rate, auth-failure spike, rate-limit-rejection spike. Latency thresholds left open — we baseline first, then dial in. +- **T-028** Structured JSON logging (pino), shipped to Datadog Logs with trace correlation (per ADR-0018). Logs include `trace_id`/`span_id` (OTel hex format); Datadog ingests OTel-format IDs natively. + +### Epic E5 — API Documentation + +Implements §3 D2. **VitePress + Scalar** under `api/docs/` — standalone site co-located with the API service, deployed independently of the frontend. + +- **T-029** Bootstrap `api/docs/` as a VitePress site — quickstart, authentication, pagination, error codes, changelog pages. +- **T-030** Embed Scalar on the reference page; wire it to ingest the generated OpenAPI spec (`api/openapi.json`) on every release. Serve the static VitePress build at `api.insights.linuxfoundation.org/docs` via Fastify's static file serving under `/docs`. +- **T-031** Wire Fastify OpenAPI export so docs ingest the generated spec on every release. +- **T-032** Quickstart guide: auth, first request, error envelope, rate limits. +- **T-033** Per-tier capability matrix in docs. +- **T-034** Changelog + deprecation page. + +### Epic E6 — Versioning + +Implements URL-prefix versioning (`/v1`, `/v2`). + +- **T-035** ADR: URL-prefix versioning. Why URL over headers: discoverability, easier caching, simpler customer code samples. +- **T-036** Version routing structure in Fastify: separate router trees per version, not flag-based branching inside handlers. +- **T-037** Per-version OpenAPI artifact (one spec per version, served at `/v1/openapi.json`). +- **T-038** Deprecation/Sunset header support (`Deprecation: true`, `Sunset: `, `Link: ; rel="deprecation"`). +- **T-039** Version-bumping playbook: introducing v2 of an endpoint while keeping v1 stable. Shared upstream code where possible (handler imports a `v1Mapper` / `v2Mapper`). + +### Epic E7 — Endpoint Migration Phase 1: Development + +One ticket per endpoint. Each ticket: port handler, define TypeBox schema, write integration test, OpenAPI tag, document, ship to production (soft-launch model — per §9 #23). No feature flag. + +- **T-040** Inventory all `frontend/server/api/**` endpoints used by the Development tab. Produce a checklist. +- **T-041 .. T-04N** One task per endpoint (N tickets — fill in once T-040 is done). Each uses the `nuxt-to-api` skill (E14). Each endpoint goes live to production when its ticket completes — no batched "Phase 1 launch" event (per §9 #23 soft-launch model). + +### Epic E8 — Endpoint Migration Phase 2: Contributors +- Inventory + one ticket per endpoint + launch. + +### Epic E9 — Endpoint Migration Phase 3: Popularity +- Same shape. + +### Epic E10 — Endpoint Migration Phase 4: Security & Best Practices +- Same shape. + +### Epic E11 — Endpoint Migration Phase 5: Overviews +- Same shape. + +### Epic E12 — Endpoint Migration Phase 6: Collections +- Same shape as earlier endpoint groups, plus: +- **Collections Postgres queries:** `/api` writes its own minimal read-only SQL queries directly — no shared repo lib with the frontend. The frontend's `communityCollection.repo.ts` is write-heavy and Nuxt-coupled; the API only needs ~3 read queries (get by slug, list, permission check). + +### Epic E13 — Endpoint Migration Phase 7: Leaderboard +- Same shape. + +### Epic E14 — Endpoint Conversion Tooling + + + + +Implements §3 D4 (Claude skill). + +- **T-080** Build the skill `nuxt-to-api` (under `.claude/skills/`): + - Input: a Nuxt endpoint path (e.g. `frontend/server/api/development/...`). + - Reads the Nuxt handler, request validation, response shape. + - Emits a Fastify handler in `api/src/routes/v1//[slug]/.ts` (slug mirrored in directory structure, matching Nuxt convention) with TypeBox schema, OpenAPI tags, route-level tier requirement, and a passing integration test. + - Adds entries to docs (or at least a stub) and to the per-version OpenAPI. +- **T-081** Test the skill on 3 representative endpoints from the Development tab; iterate. +- **T-082** Document the skill in `CONTRIBUTING.md` of `api`. +- **T-083** Cursor pagination rewrite: the skill must convert Nuxt `page`/`pageSize`/`total` handlers (and `limit`/`offset` variants) into cursor-based handlers — encoding `{ k, id }` as `base64url`, using the lookahead-LIMIT trick (`LIMIT pageSize + 1`) to set `nextCursor` without a second count query, and dropping the `total` field from the response. + +### Epic E15 — Key Management Entry Point (LFX Insights Frontend) + +Key creation, listing, and revocation live entirely in the LFX Self-Serve App (per ADR-0015). The LFX Insights frontend only needs a small placeholder that deep-links to `app.lfx.dev/settings`. + +- **T-095** Add a `/settings/api-keys` page in the LFX Insights frontend with a single "Manage API keys" CTA that opens `app.lfx.dev/settings` in a new tab. No list, no create, no revoke UI. +- **T-096** Closed-alpha gating signal: surface a "request access" state for users whose org is not on the closed-alpha allowlist ([T-089](#epic-e16--pre-launch)), so they understand why their key (if any) returns 403 against `/v1-alpha`. + +### Epic E16 — Pre-Launch + +These are the gates for the **launch** (per §9 #23) — not for individual endpoints, which roll out per-endpoint through closed alpha → silent public. + +- **T-090** Load testing (k6 or artillery) — establish baseline req/s per pod, validate rate limiter under load. Required gate for promoting an endpoint from closed alpha → silent public. +- **T-091** Tier-to-API-access mapping finalized with product (per §9 #2 reuses existing LFX tiers, but rate-limit numbers per tier need product sign-off). + +--- + +## 5. Endpoint Rollout Order (recap) + +1. Development +2. Contributors +3. Popularity +4. Security & Best Practices +5. Overviews +6. Collections +7. Leaderboard +8. (more — to be decided) + +Each phase = one epic = many tasks (one per endpoint). + +--- + +## 6. Datadog Metrics Catalog (Initial) + +Implements the hybrid strategy from §3 D5. **Tags = low cardinality** (billed as custom metrics). **Span attributes = high cardinality** (free in APM trace metrics). + +Cost reminder: Datadog bills custom metrics per unique tag-combination per metric (≈$0.05/series/month above the included quota); high-cardinality tags multiply fast. Span attributes do not count toward custom-metric billing. + +| Metric | Type | Tags (low-card) | Span attribute (high-card) | Why | +|---|---|---|---|---| +| `api.request.count` | Counter | `endpoint`, `version`, `status_class` (2xx/4xx/5xx), `tier` | `customer_id`, `api_key_id`, `status_code` | Throughput by route/tier | +| `api.request.duration` | Histogram | same as above | same | Latency p50/p95/p99 | +| `api.tinybird.duration` | Histogram | `pipe`, `status_class` | `query_id`, `bucket_id` | Upstream latency | +| `api.tinybird.errors` | Counter | `pipe`, `error_type` | `query_id`, `customer_id` | Upstream reliability | +| `api.postgres.duration` | Histogram | `query_name`, `status_class` | `customer_id` | DB latency | +| `api.ratelimit.rejections` | Counter | `tier`, `endpoint` | `customer_id`, `api_key_id` | Customers hitting limits | +| `api.auth.failures` | Counter | `reason` (invalid_jwt / expired / revoked / missing) | `api_key_id` (when known) | Auth bypass attempts | +| `api.cache.hit_ratio` | Gauge | `cache_layer` | — | If we add response caching | +| `api.concurrency` | Gauge | — | — | Adaptive semaphore depth | + +**Cardinality budget (initial):** roughly `~25 endpoints × 4 tiers × 3 status_class × 2 versions ≈ 600 timeseries per metric` × 9 metrics ≈ 5.4k custom timeseries. Well within reasonable cost. + +**Things we will NOT tag:** `customer_id`, `api_key_id`, numeric `status_code`, `bucket_id`, `pipe_id`. These ride on spans (APM trace metrics, not billed as custom metrics) so we can still slice by them in Datadog APM and Logs. + +--- + +## 7. Critical Files / Areas to Reference During Implementation + +- `frontend/server/data/tinybird/tinybird.ts` — TB client with `AdaptiveSemaphore`, bucket routing, response typing. Extract to shared lib ([T-004](#epic-e1--foundation--framework)). +- `frontend/server/data/types.ts` — shared filter shapes (`DefaultFilter`, `ActiveContributorsFilter`, etc.). These are **not** extracted to a shared lib — `/api` defines its own TypeBox equivalents per endpoint (filter shapes are TypeBox schemas, not shared runtime types). Use this file as a reference when porting handlers. +- `frontend/server/utils/rate-limiter.ts` — Redis sliding-window implementation; extract the core primitive into `libs/rate-limiter`, write a new org-aware wrapper in `/api` keyed by `org_id` ([T-019](#epic-e3--auth--rate-limiting-api-keys-via-lfx-self-serve)). +- `frontend/server/utils/jwt.ts` — existing Bearer/JWT helper (`auth(event)`). Conceptually closest to the public-API auth but uses one shared secret; we'll replace the verify step with the LFX Self-Serve JWKS endpoint (per ADR-0015), not Auth0 JWKS ([T-017](#epic-e3--auth--rate-limiting-api-keys-via-lfx-self-serve)). +- `frontend/setup/rate-limiter.ts` — current rate-limiter rules. Inspiration for tier-based rules. +- `frontend/server/api/development/**` — all endpoints to inventory in [T-040](#epic-e7--endpoint-migration-phase-1-development). +- `pnpm-workspace.yaml` — currently lists `frontend`, `workers/*`. Add `api` and `libs/*` entries when bootstrapping ([T-001](#epic-e1--foundation--framework)). + +--- + +## 8. Verification (post-implementation) + +- Unit + integration tests per endpoint (mocked TB). +- Contract tests against staging TB for each endpoint family. +- Load test ([T-090](#epic-e16--pre-launch)) against staging cluster. +- Security review ([T-091](#epic-e16--pre-launch)). +- Datadog dashboards green for 72h on stage with synthetic traffic before GA per phase. +- One real partner integration completed end-to-end before declaring a phase GA. + +--- + +## 9. Decisions So Far + Remaining Open Questions + +**Decided (this plan):** + +1. **Location:** monorepo, `/api` at the repo root (sibling of `frontend/`). Add `api` to `pnpm-workspace.yaml`. +2. **Tiers:** reuse existing LFX membership tiers — consumed as JWT claims, no new tier model. +3. **SLAs:** observational only in v1. No numeric latency/uptime commitments yet. +4. **Framework:** Fastify (§3 D1). +5. **OpenAPI source:** code-first via TypeBox (§3 D3). +6. **Versioning:** URL prefix `/v1`, `/v2` (§3 + E6). +7. **Conversion tooling:** Claude skill `nuxt-to-api` (§3 D4). Produces a complete, ready-to-review Fastify handler (TypeBox schema, Tinybird/Postgres calls, response mapping, integration test) — not a skeleton. Developer's job is review, not writing. +8. **Datadog strategy:** hybrid — low-card custom metrics + APM trace metrics for high-card slicing (§3 D5, §6). +9. **Docs tool:** VitePress + Scalar (§3 D2). Standalone site under `api/docs/`, co-located with the service, deployed independently. Scalar embedded on the reference page, reads generated OpenAPI spec. +10. **Customer model:** the API principal is a **user** (`sub` = user ID, used as the `customer_id` span attribute in APM traces). The user's **organization** (`org` claim, exact name confirmed at T-015) drives **tier and rate-limit pool** — multiple users in the same org share one rate-limit pool. Both claims come from the LFX Self-Serve JWT (per ADR-0015). The exact claim names, behavior when a user has no org, and behavior when a user belongs to multiple orgs are follow-ups for T-015 with the LFX Self-Serve team. Whether the token is an existing platform PAT or a new Insights-scoped token is the open question at §9 #4. +11. **Data scope (v1):** + - **Phases 1–4 (Development, Contributors, Popularity, Security & Best Practices):** public OSS data, **no per-project permission check**. Tier check only. + - **Phase 5 (Collections):** tier check + permission check (private collections gated by ownership/membership; public collections open to all valid keys). Permission source: **Postgres lookup with Redis cache** (~60s TTL). Decision deferred to Phase 5 — does not block earlier phases. + - No general project-membership authorization graph in v1. +12. **Authentication floor:** every request requires a valid API key. No anonymous access path. 401 on missing/invalid key, full stop. +13. **Caller scope (v1):** server-to-server only. CORS responds with no `Access-Control-Allow-Origin` for the API, which blocks browser callers. **Revisit before GA** — we may extend to browser support (per-key origin allowlist or a publishable+secret key model) if customer demand emerges. Track as a follow-up. +14. **Billing model:** bundled with existing LFX membership. No standalone billing infra in v1. Tier comes from the existing LFX tier on the user's org. Usage is metered only for rate-limit enforcement and Datadog dashboards — not for invoicing. Enforcement is split: (a) **token-mint time** — Self-Serve validates Key Contact status via OpenFGA `v2_organization` entities before issuing an Insights access token; the precise moment depends on the PAT model (new Insights-scoped refresh token → check at issuance; existing PAT reused → check at `/v1/auth/token` exchange time — open question ADR-0015 Q1); either way, non-Key-Contacts never receive a valid Insights access token; (b) **request time** — the Insights API reads `tier` and `org` from the JWKS-verified claims on every request to enforce rate limits and future per-endpoint tier gating, but never re-queries OpenFGA or any membership system. See ADR-0010. +15. **Pagination:** cursor-based. Request: `cursor` (opaque base64url, omit on first page) + `pageSize` (default 50, max 200). Response: `{ data, pageSize, nextCursor }`; `nextCursor: null` indicates end of list. No `total` field (counting on every request doubles Tinybird load). Cursor-based chosen over the Nuxt `page`/`pageSize` convention for stability under mutations, O(log N) cost vs offset scan, and fit with server-to-server iteration. See ADR-0011. +16. **URL port strategy:** **hybrid**. Default to port-as-is from Nuxt to `/v1/...`, applying only light normalization (kebab-case path segments, plural collection nouns). Rename only when the existing URL is **genuinely misleading** to an external developer. The `nuxt-to-api` skill ([T-080](#epic-e14--endpoint-conversion-tooling)) defaults to port-as-is and surfaces the URL for explicit reviewer approval; renames are a per-endpoint judgement call recorded in the PR. +17. **Versioning semantics ("breaking change" definition):** **tolerant-reader / additive-only**. Within a version, allowed: adding response fields, adding optional query params, adding endpoints, expanding accepted enum INPUT values, adding new error codes, adding new success status codes. Requires a major version bump: removing/renaming a response field, changing a field's type, making an optional input required, narrowing accepted input values, removing an endpoint, changing the error envelope shape, changing default or max pageSize, or changing the cursor encoding semantics. **Customers commit to ignoring unknown response fields** (documented prominently). Matches Stripe/GitHub/Google. +18. **Caching contract (v1):** origin-side Redis cache only (~5–60s TTL depending on endpoint). All responses set `Cache-Control: private, max-age=0` — customers do not cache, intermediaries do not cache. Lets us tune TTL without breaking customers. Public/CDN cache headers can be introduced later as a non-breaking improvement once we have real traffic data. +19. **JSON key casing:** **camelCase** for all request and response JSON keys (`startDate`, `activityTypes`, `includeCodeContributions`). The existing Nuxt code is mixed (some snake_case fields like `activity_types`); the `nuxt-to-api` skill normalizes these to camelCase at port time. Date values are ISO-8601 strings in UTC (`2025-12-31T23:59:59Z`) — committed as a convention, not a question. +20. **Tier gating (v1):** all tiers see all endpoints; **tiers control rate limits only** in v1. The per-route "required tier" mechanism IS built into the framework (so individual endpoints can be gated later without an architectural change), but every v1 endpoint declares the lowest tier. When a future endpoint is gated above a user's tier, the response is **403 with `code: tier_forbidden`** and the error envelope's `docsUrl` deep-links to the tier-capability matrix. +21. **API key lifecycle:** refresh tokens are long-lived with **no auto-expiry**. Access tokens are short-lived (~15 min, confirmed at T-015). Customers rotate refresh tokens manually via `app.lfx.dev/settings`. **Multiple active refresh tokens per user supported** so rotation is zero-downtime (mint new → switch → revoke old). Revocation enforced by LFX Self-Serve — once a refresh token is revoked, the next `POST /v1/auth/token` call returns `400 invalid_grant`; in-flight access tokens expire naturally. No Insights-side deny-list. Best practice rotation guidance documented but never forced. +22. **SDKs (v1):** none. Customers integrate against the OpenAPI spec directly, plus `curl`/`fetch`/`requests` examples in docs. SDK strategy revisited post-v1 once we see what languages customers actually use. +23. **Launch model:** Endpoints roll out per-endpoint through two stability stages: + 1. **`/v1-alpha`:** endpoint is served under `/v1-alpha/...` with no contract guarantees. Access is restricted to an allow-listed cohort (LFX-internal devs + external design partners). Breaking changes are allowed freely. Used to validate contract and performance before broad exposure. + 2. **`/v1`:** once the endpoint passes the promotion criteria (load test, one week stable shape with alpha cohort, error/latency budgets healthy, security sign-off), it graduates to `/v1/...` and the full tolerant-reader contract locks in. Open to all users with a valid API key. The `/v1-alpha` route returns `410 Gone` for two weeks after promotion. + +**Still open:** + +1. Rate-limit numbers per LFX membership tier (Gold, Platinum, etc.) — TBD, pending product sign-off. Drives [T-093](#epic-e16--pre-launch). +2. **Reuse `app.lfx.dev/settings` personal access token as the refresh token, or mint a new Insights-scoped refresh token?** Drives the JWT claim shape, whether `aud` is required, and the LFX Self-Serve UI work. Trade-offs: + - **Reuse:** one platform-wide refresh token across every LFX service — best UX. Compromise blast radius is wider than Insights alone, so a service-scope claim (e.g. `aud` or `scopes`) would be needed for Insights to safely refuse out-of-scope use. Requires LFX Self-Serve to ensure the existing token carries (or can be made to carry) `org` and `tier`. + - **Mint new:** Insights-scoped refresh token isolated from the user's other LFX integrations; per-service revocation. Cost: extra "Create Insights API token" affordance in `app.lfx.dev/settings` and a new token type for LFX Self-Serve to support. + +**Resolved:** +- Deployed on the same Kubernetes cluster as frontend. ([T-002](#epic-e1--foundation--framework)) +- Using existing Datadog org and APM agent. ([T-025](#epic-e4--observability-opentelemetry--datadog)) +- `granularity` most granular option will be `daily` — no `hourly`. (E7–E11) +- API docs gated until launch. (E5) +- Base URL: `https://api.insights.linuxfoundation.org/v1/...` +- Keys issued by LFX Self-Serve App (`app.lfx.dev/settings`); Insights stores no keys. (ADR-0015) +- Tinybird errors: serve stale Redis cache, 503 if no cache. (ADR-0016) +- Collections: read-only in v1, both metadata and analytics exposed. +- Slug validation: pass-through (unknown slugs return empty data, not 404). +- Repo-level filtering (`repos` param) exposed as-is across all endpoints. +- Integration test per endpoint required before promotion to `/v1`. +- Local dev uses real credentials via `.env` + Docker Redis. diff --git a/docs/adr/0001-fastify-over-nestjs.md b/docs/adr/0001-fastify-over-nestjs.md new file mode 100644 index 000000000..6768f0552 --- /dev/null +++ b/docs/adr/0001-fastify-over-nestjs.md @@ -0,0 +1,9 @@ +# Fastify over NestJS for the public API service + +We chose Fastify as the HTTP framework for `/api` rather than NestJS. NestJS was the leading alternative — it's the de-facto "enterprise Node.js" choice and familiar to most backend developers. We rejected it because: (1) NestJS's decorator-based DI and module system adds ~3–5× the boilerplate for a service whose only job is validating a JWT, proxying a Tinybird query, and shaping the response; (2) NestJS couples schema validation to class-validator/class-transformer, whereas we want TypeBox for code-first OpenAPI generation (see ADR-0008); and (3) Fastify's plugin isolation maps cleanly onto our endpoint-group structure. The trade-off is that Fastify's ecosystem is smaller and less opinionated — future engineers should resist the urge to add NestJS-style decorators or a DI framework unless the service grows substantially beyond its current scope. + +## Considered Options + +- **NestJS** — rejected: too much ceremony for a thin proxy service; forces class-validator coupling +- **Express** — rejected: no built-in schema validation, slower than Fastify, no first-class TypeBox integration +- **Hono** — rejected: designed for edge runtimes (Cloudflare Workers, Deno Deploy) first and Node second — the Tinybird and Postgres clients rely on Node-native APIs (`net`, `tls`, `Buffer`) that edge runtimes don't provide, so running Hono on Node requires an adapter layer that reintroduces the compatibility risk Hono was chosen to avoid. Fastify is Node-native and battle-tested at scale with no runtime adapter between the framework and the OS. diff --git a/docs/adr/0002-api-at-repo-root.md b/docs/adr/0002-api-at-repo-root.md new file mode 100644 index 000000000..cccc53e09 --- /dev/null +++ b/docs/adr/0002-api-at-repo-root.md @@ -0,0 +1,3 @@ +# Public API service lives at `/api` (repo root), not inside `workers/` + +The public API service is a top-level directory `/api` at the repo root, sibling to `frontend/`, rather than living under `workers/` or another subdirectory. The repo already has a `workers/` tree for background workers; placing `/api` there would imply the same operational model (fire-and-forget workers) rather than the long-running HTTP server it actually is. The `frontend/` precedent — a first-class product surface with its own `package.json`, dev server, and deployment unit — is the correct mental model. Keeping `/api` at the root also makes `pnpm-workspace.yaml` membership and CI matrix entries symmetric with `frontend`. Engineers should not move it into `workers/` to "organize" it — the flat structure is intentional. diff --git a/docs/adr/0003-tolerant-reader-versioning.md b/docs/adr/0003-tolerant-reader-versioning.md new file mode 100644 index 000000000..8395e2e8f --- /dev/null +++ b/docs/adr/0003-tolerant-reader-versioning.md @@ -0,0 +1,51 @@ +# v1 contract: tolerant-reader / additive-only changes within a version + +## Stability prefixes + +Endpoints go through two URL-level stability stages before they are considered stable: + +- **`/v1-alpha/...`** — no contract guarantees. Used during the closed-alpha stage. We can change shapes, rename fields, remove endpoints, or restructure responses freely. Alpha callers are explicitly told to expect breakage and must re-integrate when we promote to `/v1`. +- **`/v1/...`** — full tolerant-reader contract (see below). An endpoint graduates to `/v1` when it is promoted to silent public. From that point, the additive-only rules apply and breaking changes require a version bump. + +This means closed-alpha users integrate against `/v1-alpha` and absorb any churn during validation. Once an endpoint moves to `/v1`, the contract is locked and all callers — including MCP integrations — can rely on it indefinitely. + +## `/v1` contract + +Within `/v1`, we commit to additive-only changes: new response fields, new optional query params, new endpoints, expanded enum inputs, new error codes. In practice this means the version number rarely if ever changes — the vast majority of API evolution is additive and stays on v1 indefinitely. A version bump is only warranted when something fundamental needs to be removed or restructured, which for a read-only analytics API is rare; Stripe has followed this same model and remained on v1 since 2011. Callers are documented to ignore unknown response fields. Any removal, rename, type change, or constraint-tightening requires a new major version prefix. We chose this over date-based versioning (e.g. `2025-01-01`) because the primary consumers are MCP integrations and automated tooling — a stable, predictable URL like `/v1/...` is simpler to hardcode and reason about than a date-versioned header scheme where the caller must track which date snapshot they're pinned to. Engineers adding a field to a response type must not remove another field in the same PR without a `/v2` plan in place. + +## Promotion from `/v1-alpha` to `/v1` + +An endpoint can be promoted when all of the following are true: + +1. **Load test passes** — baseline req/s established, rate limiter validated under load ([T-090](../PUBLIC_API_PLAN.md#epic-e16--pre-launch)). +2. **No contract regressions** — the closed-alpha cohort has been using it without shape changes for at least one week. +3. **Error and latency budgets healthy** — 5xx rate and p99 latency within acceptable thresholds in Datadog. +4. **Security sign-off** — no open auth, tenant isolation, or key-leakage issues for this endpoint. + +**Promotion mechanics:** + +1. Open a PR that adds the handler under `/v1/...` (the contract-locked route). +2. Change the `/v1-alpha/...` handler to return `410 Gone` with a `Link: ; rel="successor-version"` header, so alpha callers get a clear machine-readable signal to update their URLs. +3. Notify the closed-alpha cohort directly (release notes) with the new `/v1` URL and a migration note. +4. Both routes coexist for a minimum of 2 weeks after promotion to give alpha callers time to migrate, then the `410` handler can be removed entirely. + +## Deprecation process + +When a field, parameter, or endpoint needs to be removed, the process is: + +1. **Mark it deprecated** in the TypeBox schema using a `description` annotation starting with `DEPRECATED:` and a short reason. The OpenAPI spec will surface this to consumers. +2. **Add response headers** on the affected endpoint: `Deprecation: true` and `Sunset: ` (the earliest date we will remove it), plus `Link: ; rel="deprecation"` pointing to the migration guide. +3. **Communicate** — add an entry to the changelog page in docs and, where possible, notify known consumers directly. +4. **Honour the sunset window** — the field or endpoint must remain functional until the `Sunset` date. Removal before that date is a contract violation. +5. **Remove in the next major version** — the actual removal ships in `/v2`, not `/v1`. The `/v1` endpoint or field stays alive (even if just returning empty/stub data) until v1 is fully sunset. + +For **fields**: keep the field in the response but document it as deprecated. Do not change its type or meaning — that is also a breaking change. + +For **endpoints**: keep the route returning valid responses. Return a `Deprecation` header on every response from that route. + +For **entire versions**: sunset `/v1` only after `/v2` has been stable for a publicly announced minimum period (TBD — at least 6 months is conventional). + +## Consequences + +- All TypeBox response schemas must be treated as append-only within v1 +- The `nuxt-to-api` conversion skill must normalize field names to camelCase at port time (not patch them in-place later, which would be a breaking rename) diff --git a/docs/adr/0004-server-to-server-cors-deny.md b/docs/adr/0004-server-to-server-cors-deny.md new file mode 100644 index 000000000..600cddd3f --- /dev/null +++ b/docs/adr/0004-server-to-server-cors-deny.md @@ -0,0 +1,3 @@ +# v1 is server-to-server only; CORS denies all browser origins + +The v1 API does not send an `Access-Control-Allow-Origin` header for any browser origin and does not satisfy CORS preflight requests, making it intentionally unusable from browser JavaScript. This was a deliberate scope decision, not an oversight — the API is designed for server-side integrations (CI pipelines, dashboards, data exports) where the API key is safely stored server-side. Allowing browser origins in v1 would require a CORS policy, credential-safe key distribution story, and potentially cookie-based auth — all out of scope. The decision is flagged for **revisit before launch**: if design partners need browser-side access (e.g. embedding charts in their own web apps), we will add an explicit `allowedOrigins` allow-list mechanism at that point. Engineers must not "fix" the missing CORS headers — this is intentional. diff --git a/docs/adr/0005-tiers-control-rate-limits-only.md b/docs/adr/0005-tiers-control-rate-limits-only.md new file mode 100644 index 000000000..609404d64 --- /dev/null +++ b/docs/adr/0005-tiers-control-rate-limits-only.md @@ -0,0 +1,3 @@ +# Tiers control rate limits only in v1; no per-endpoint feature gating + +In v1, all tiers see all endpoints. The Tier attached to an Organization only determines the size of its Rate-limit Pool. We deliberately did not gate specific endpoints behind higher tiers in v1 — the endpoint-group rollout is already complex enough, and we do not yet have enough usage data to know which endpoints are expensive enough to warrant gating. The per-route "required tier" mechanism IS built into the framework (so future gating requires a config change, not an architectural one), but every v1 endpoint declares the minimum tier. When a caller's tier is insufficient for a future gated endpoint, the response is `403` with `code: tier_forbidden` and a `docsUrl` pointing to the tier-capability matrix. Engineers must not add per-endpoint tier checks to v1 endpoints without a product decision — doing so would break callers who integrated assuming open access. diff --git a/docs/adr/0006-refresh-and-short-lived-access-tokens.md b/docs/adr/0006-refresh-and-short-lived-access-tokens.md new file mode 100644 index 000000000..27782365a --- /dev/null +++ b/docs/adr/0006-refresh-and-short-lived-access-tokens.md @@ -0,0 +1,19 @@ +# Refresh tokens are long-lived; access tokens are short-lived + +The customer-facing credential is a **refresh token** issued by the LFX Self-Serve App at `app.lfx.dev/settings`. It does not expire automatically. Rotation is encouraged (documented best practice) but never enforced — multiple active refresh tokens per User are supported so rotation is zero-downtime: mint new → switch integrations → revoke old. + +Customer code exchanges the refresh token for a short-lived **access token** (~15 min; exact lifetime confirmed at T-015) via `POST api.insights.linuxfoundation.org/v1/auth/token`. That endpoint is a thin proxy to LFX Self-Serve's `/token` endpoint (RFC 6749 §6 `grant_type=refresh_token`). The Insights API forwards the request and returns the response verbatim — it does not mint, validate, or store refresh tokens. + +The access token is what the Insights API JWKS-verifies on every request. It carries the `sub`, `org`, `tier`, `iss`, `kid`, and possibly `aud` claims. The Insights API never receives refresh tokens on `/v1/...` endpoints. + +## Revocation + +Revocation is owned by LFX Self-Serve. When a user revokes a refresh token in `app.lfx.dev/settings`, the next `POST /v1/auth/token` call (forwarded to Self-Serve) fails with `400 invalid_grant` — the customer's code can no longer mint new access tokens. Already-issued access tokens continue to work until their `exp`. Revocation lag is bounded by the access token's natural lifetime (~15 min). Insights maintains no deny-list and runs no introspection endpoint. + +## Why not long-lived JWTs sent directly as Bearer + +The previous design issued a long-lived JWT that the customer sent as `Authorization: Bearer` on every request. Two problems: + +1. **No natural expiry on a leaked token.** A long-lived JWT that escapes into logs, error reports, or a compromised system stays valid indefinitely. Revocation requires Insights to run a per-request introspection-with-cache check — extra infrastructure for a problem the refresh-token model solves natively (leaked access token expires in ~15 min; leaked refresh token can only mint new access tokens, which the legitimate owner can stop by revoking it). + +2. **Revocation is ambiguous.** "Revoking" a long-lived JWT that has been cryptographically signed means nothing to a verifier that only checks the signature — the token remains valid until the key is rotated. Rotating the JWKS key revokes all tokens at once, not just one user's. The refresh-token model avoids this entirely. diff --git a/docs/adr/0007-collections-only-permission-check.md b/docs/adr/0007-collections-only-permission-check.md new file mode 100644 index 000000000..e318f9150 --- /dev/null +++ b/docs/adr/0007-collections-only-permission-check.md @@ -0,0 +1,3 @@ +# Only Collections endpoints require a per-request permission check; phases 1–4 are public-data-only + +Endpoint Groups 1–4 (Development, Contributors, Popularity, Security) expose aggregated public project metrics — data that is already public on the LFX Insights website. No per-request ownership check is needed; a valid API key is sufficient. Only Group 5 (Collections) requires a permission check because Collections can be private. The check: if `collections.isPrivate = false`, serve to any valid API key; if `collections.isPrivate = true`, compare `collections.ssoUserId` (e.g. `auth0|abc123`) against the requesting user's `sub` JWT claim — deny with 403 if they don't match. There is no collaborator model; ownership is exclusively the creator (`ssoUserId`). This is implemented as a Postgres lookup with a Redis cache (60s TTL) at the Collections route level, not globally. Engineers must not add ownership checks to Groups 1–4 "for safety" — doing so would break the public-data model and add latency with no security benefit. If a project ever becomes private in LFX Insights, the access model for that endpoint group must be reconsidered separately. diff --git a/docs/adr/0008-typebox-code-first-openapi.md b/docs/adr/0008-typebox-code-first-openapi.md new file mode 100644 index 000000000..2477cb938 --- /dev/null +++ b/docs/adr/0008-typebox-code-first-openapi.md @@ -0,0 +1,13 @@ +# TypeBox for code-first OpenAPI schema generation + +We use TypeBox to define request/response schemas in TypeScript; the OpenAPI spec is generated from those definitions rather than written by hand or inferred from decorators. + +TypeBox works by making a single object definition serve two roles simultaneously: it is a valid JSON Schema (consumed by Fastify at runtime for request validation and response serialization) and a TypeScript type (derived via `Static` at compile time). There is no conversion step and no second representation — the object literally is both things at once. + +Fastify's first-class TypeBox support (`@fastify/type-provider-typebox` + `@fastify/swagger`) means the OpenAPI spec is generated from the same schema objects that validate requests. The spec cannot drift from the implementation because they share one source. + +Zod was considered: it has a more ergonomic API and a larger community, but its native representation is not JSON Schema — a conversion layer (`zod-to-json-schema`) is required, which reintroduces a transformation step that can silently diverge. The Fastify+TypeBox integration is zero-transformation. + +Hand-written OpenAPI YAML was rejected: with ~100 endpoints, drift between spec and implementation is a certainty at scale, not a risk to manage. + +Engineers must not introduce a parallel validation library (Zod, Joi, class-validator) into `/api` — TypeBox is the single validation and schema boundary. diff --git a/docs/adr/0009-api-key-required-for-all-requests.md b/docs/adr/0009-api-key-required-for-all-requests.md new file mode 100644 index 000000000..bdc96fe4e --- /dev/null +++ b/docs/adr/0009-api-key-required-for-all-requests.md @@ -0,0 +1,3 @@ +# Every request requires a valid API key — no anonymous access + +All endpoints, including those that expose public project data (Endpoint Groups 1–4), require a valid API key. There is no unauthenticated access path. A missing or invalid key returns 401 immediately. We chose the auth-floor approach because: (1) rate limiting and abuse prevention require a stable identity to enforce per-org quotas; (2) attribution — knowing which orgs use which endpoints — is essential for prioritizing the roadmap and justifying infrastructure cost; (3) anonymous access complicates the future tier-gating mechanism. The cost is a higher onboarding barrier (users must create an API key before their first request). This can be revisited if adoption data shows the friction is significant. diff --git a/docs/adr/0010-billing-bundled-with-lfx-membership.md b/docs/adr/0010-billing-bundled-with-lfx-membership.md new file mode 100644 index 000000000..89af5b8d5 --- /dev/null +++ b/docs/adr/0010-billing-bundled-with-lfx-membership.md @@ -0,0 +1,18 @@ +# API access is bundled with LFX membership; no standalone billing in v1 + +API access is included in the user's existing LFX membership tier. There is no separate billing infrastructure or usage-based pricing. + +## Where membership is enforced + +Enforcement is split across two boundaries: + +**Token-mint time (LFX Self-Serve):** Self-Serve checks Key Contact status via OpenFGA against `v2_organization` entities before issuing an Insights **access token**, and refuses non-Key-Contacts. The precise moment of this check depends on the PAT model chosen (open question — ADR-0015 Q1): + +- *New Insights-scoped refresh token:* check happens when the refresh token is first issued at `app.lfx.dev/settings`. Non-Key-Contacts cannot obtain the token at all. +- *Existing PAT reused as refresh token:* all users already hold a PAT. Check happens at `POST /v1/auth/token` exchange time — Self-Serve refuses to mint an Insights access token for non-Key-Contacts even though they possess the PAT. + +In either model the check lives in Self-Serve, not in Insights. The `tier` and `org` claims baked into the resulting access token are Self-Serve's authoritative statement of membership status. + +**Request time (Insights API):** the Insights API verifies the access token's JWT signature on every request, reads the `tier` and `org` claims from the verified payload, and uses them to enforce rate limits and (in future versions) per-endpoint tier gating. It never re-queries OpenFGA, Postgres member tables, or any other membership system — it trusts what Self-Serve encoded in the token. **Assumption:** the presence of `org` and `tier` in the Self-Serve access token is not yet confirmed — this needs validation with the Self-Serve team at T-015. If the existing PAT is reused, these claims may need to be added to it. + +**Implication for membership revocation:** revoking a user's LFX membership does not immediately invalidate their existing refresh tokens. Those tokens continue to mint valid access tokens until they are explicitly revoked in `app.lfx.dev/settings`. Membership-driven revocation is a Self-Serve operational concern — Self-Serve is expected to revoke the user's tokens when their membership lapses. \ No newline at end of file diff --git a/docs/adr/0011-pagination-cursor-based.md b/docs/adr/0011-pagination-cursor-based.md new file mode 100644 index 000000000..dbda9338a --- /dev/null +++ b/docs/adr/0011-pagination-cursor-based.md @@ -0,0 +1,56 @@ +# Pagination is cursor-based with opaque base64url cursors + +All paginated endpoints accept `cursor` (opaque base64url string, omit on the first request) and `pageSize` (integer, default 50, max 200) as query parameters, and return `{ data, pageSize, nextCursor }`. `nextCursor: null` indicates the end of the list. There is no `total` field and no page number. + +## Why cursor-based instead of page + pageSize + +The previous design ported the Nuxt convention (`page` + `pageSize` zero-indexed, `total` from Tinybird's `rows_before_limit_at_least`). That decision was driven by minimising port-time translation cost, not by what is right for a public API contract. Four reasons flip the trade-off: + +1. **Stability under mutations — no duplicates or missed entries.** Offset pagination produces corrupt iteration when the underlying set changes between page fetches. Concretely: if an item is inserted between fetching page N and page N+1, every subsequent row shifts one position forward. The item that was the first entry of page N+1 slides back into the last position of page N — which we already fetched — so it is silently skipped. The reverse happens on deletion: the item that was the last entry of page N drops into the first position of page N+1, so it appears in both pages. A cursor anchors to a row's sort-key value rather than its offset, so insertions and deletions between calls never affect which rows the caller sees next. This API serves analytics over data that grows continuously — commits, contributors, vulnerabilities — making offset instability a practical concern, not a theoretical one. + +2. **Cost at scale.** `LIMIT N OFFSET M` is O(N+M): Tinybird and Postgres scan the skipped rows. A cursor query (`WHERE (sort_key, id) < (:cursor_k, :cursor_id) ORDER BY sort_key DESC, id ASC LIMIT N`) is O(log N) on an indexed key. The first and the hundredth page cost the same. + +3. **Fit for the caller.** v1 is server-to-server only: pipelines, dashboards, batch jobs. They iterate end-to-end. They do not render a "page 5 of 23" UI. Cursor-based fits that shape; offset is a UI affordance we do not need. + +4. **Industry convention.** Stripe, GitHub, AWS, Linear, and Slack all use cursor-based pagination for their public APIs. + +## No `total` field + +Computing a total count alongside every paginated result requires a separate `count(*)` query, doubling Tinybird load per request. The cursor model makes this unnecessary for the primary use case (iterate all records). If a specific caller needs a total count we can add a dedicated `/count` endpoint — we will not add it speculatively in v1. + +## Cursor encoding + +The cursor is `base64url(JSON.stringify({ k: , id: }))`. It is server-generated and server-opaque: clients must not parse, construct, or store cursors as structured data. They pass back the `nextCursor` value verbatim. Opacity means the server can change the internal encoding (add fields, switch format) without issuing a breaking change. + +## Lookahead trick — no second count query + +Tinybird queries fetch `pageSize + 1` rows. If the result set has `pageSize + 1` rows, there is a next page: drop the last row from the response, encode its `(sort_key, id)` as the `nextCursor`. If the result set has `≤ pageSize` rows, set `nextCursor: null`. No second query needed. + +## Default and max pageSize + +- Default: `pageSize = 50` +- Maximum: `pageSize = 200` + +Changing either value is a breaking change per ADR-0003. Tune based on RPS and Tinybird query latency data once available. + +## Non-paginated endpoints + +Top-N leaderboards (hard-capped result sets), single-row lookups, and time-series charts are not paginated. Cursor pagination applies only to list endpoints that would previously have returned `{ data, page, pageSize, total }` or equivalent `limit`/`offset` shapes. + +## Sort order + +Sort order is **caller-selectable from a per-endpoint allow-list.** Every paginated endpoint declares a closed set of accepted `sort` values in its TypeBox schema; a value outside the list returns `400 invalid_sort`. Each accepted value must be backed by a Tinybird or Postgres index — sort keys with no index are not added, since server-side resorting would break the O(log N) cost argument. + +Wire format: `?sort=_`, e.g. `name_asc`, `commits_desc`. This mirrors the existing Nuxt convention (`frontend/server/api/project/index.ts`, `frontend/server/api/collection/index.ts`, `frontend/server/api/ossindex/collections.ts`) so port-time conversion is mechanical. + +The cursor still encodes `(sort_key_value, id)`. The `sort_key_value` is whichever field the caller selected. The cursor is opaque — callers must not mix cursors across `sort` values; this is documented in the OpenAPI parameter description for each endpoint. + +**Adding** a value to an endpoint's allow-list is additive (non-breaking). **Removing** a value or **changing the default** is a breaking change per ADR-0003. + +Sort direction (`asc`/`desc`) is selectable only where both directions are indexed; otherwise the endpoint locks direction in its allow-list (e.g. `name_asc` only). + +Each endpoint's allowed `sort` values and its default are documented in the OpenAPI spec for that route. + +## Port-time conversion + +The `nuxt-to-api` skill rewrites Nuxt `page`/`pageSize`/`total` handlers — and `limit`/`offset` handlers — into cursor-based handlers at port time, including the lookahead-LIMIT pattern in the Tinybird query. Existing `sort` parameters from Nuxt handlers are preserved as-is (same `field_direction` wire format). diff --git a/docs/adr/0012-url-port-strategy-hybrid.md b/docs/adr/0012-url-port-strategy-hybrid.md new file mode 100644 index 000000000..669d94c40 --- /dev/null +++ b/docs/adr/0012-url-port-strategy-hybrid.md @@ -0,0 +1,3 @@ +# URL porting strategy: port-as-is by default, rename only when genuinely misleading + +When porting Nuxt endpoints to `/v1/...`, the default is to preserve the existing URL path with light normalization (kebab-case segments, plural collection nouns). Renaming is only done when the existing URL is **genuinely misleading to an external developer** who has no knowledge of the Nuxt internals. The `nuxt-to-api` skill defaults to port-as-is and surfaces the URL explicitly for reviewer sign-off; any rename must be recorded in the PR description. We chose this over a full rename-everything approach because: (1) the existing URLs have been stable and are already familiar to the frontend team; (2) wholesale renaming creates a large diff with no functional change, obscuring the real porting work; (3) any URL committed in v1 is hard to change without a major version bump (ADR-0003). The cost is that some URLs may carry Nuxt-internal naming conventions that are slightly awkward externally — this is acceptable when the alternative is a large, hard-to-review rename pass. diff --git a/docs/adr/0013-origin-cache-only-private-cache-control.md b/docs/adr/0013-origin-cache-only-private-cache-control.md new file mode 100644 index 000000000..5fd560fad --- /dev/null +++ b/docs/adr/0013-origin-cache-only-private-cache-control.md @@ -0,0 +1,8 @@ +# Responses are cached at the origin (Redis) only; Cache-Control: private, max-age=0 + +All responses carry `Cache-Control: private, max-age=0`. The origin stores a Redis cache keyed by request params, mirroring the two-tier TTL model already in use by the Insights Nuxt API (`frontend/setup/caching.ts`): + +- **Long cache (86 400s / 24h):** stable, infrequently-changing data — project lists, leaderboards, search indexes, category lists, report aggregates. +- **Short cache (3 600s / 1h):** time-series analytics endpoints where data updates throughout the day. + +Clients and intermediary proxies/CDNs do not cache responses (`Cache-Control: private, max-age=0`). We chose origin-only caching over public HTTP caching because: (1) some endpoints are user-scoped (Collections) where a shared CDN cache would be a security error; (2) origin Redis gives us a single TTL knob tunable without a contract change. Public cache headers can be introduced later as a non-breaking improvement once per-endpoint analysis is done. Engineers must not add `public` or `s-maxage` headers without a deliberate review. diff --git a/docs/adr/0014-camelcase-json-iso8601-dates.md b/docs/adr/0014-camelcase-json-iso8601-dates.md new file mode 100644 index 000000000..6c1d2cff8 --- /dev/null +++ b/docs/adr/0014-camelcase-json-iso8601-dates.md @@ -0,0 +1,3 @@ +# All JSON keys are camelCase; dates are ISO-8601 UTC strings + +All request and response JSON field names use camelCase (`startDate`, `activityTypes`, `includeCodeContributions`). Date values are ISO-8601 strings in UTC (`2025-12-31T23:59:59Z`), never Unix timestamps or locale-formatted strings. The existing Nuxt API layer is inconsistent — some fields are camelCase, others are snake_case (`activity_types`, `start_date`) reflecting Tinybird's internal naming. The `nuxt-to-api` skill normalizes all names to camelCase at port time rather than exposing the inconsistency externally. We chose camelCase over snake_case because the primary consumers are JavaScript/TypeScript server-side callers for whom camelCase is idiomatic; and over snake_case-preserving-Tinybird-names because coupling the public contract to internal Tinybird field names makes future Tinybird schema changes into breaking API changes. Any field name committed in v1 is hard to rename without a major version bump (ADR-0003) — verify casing before merging a new endpoint. diff --git a/docs/adr/0015-api-keys-issued-by-lfx-self-serve.md b/docs/adr/0015-api-keys-issued-by-lfx-self-serve.md new file mode 100644 index 000000000..3e292e34a --- /dev/null +++ b/docs/adr/0015-api-keys-issued-by-lfx-self-serve.md @@ -0,0 +1,52 @@ +# API keys are issued by the LFX Self-Serve App + +API keys are refresh tokens issued by the LFX Self-Serve App at `app.lfx.dev/settings`. The Insights API exposes a proxied `/v1/auth/token` endpoint (forwarding to Self-Serve's `/token`) so customers configure only one host. Customer code exchanges refresh tokens for short-lived access tokens via that proxy (per ADR-0006); the Insights API receives only access tokens on actual API requests. The Insights API is a verifier only: it fetches the LFX Self-Serve JWKS endpoint, verifies the access token's signature, and reads identity + authorization claims from the verified payload. Insights stores no keys, runs no key-management UI, and has no dependency on the Auth0 Management API. + +> **Note:** this decision assumes LFX Self-Serve can support the required token model (Key Contact gating, `org`/`tier` claims, JWKS exposure, and the `/token` proxy endpoint). Coordination with the Self-Serve team is required at T-015 before implementation — the exact shape of the solution may change based on what Self-Serve can provide. + +## JWT claims used by the Insights API + +These claims live on the **access token**, not the refresh token. + +| Claim | Purpose | +|---|---| +| `iss` | LFX Self-Serve issuer URL — used to select the right JWKS and reject foreign tokens. | +| `sub` | User ID — used for revocation reference and as the `customer_id` span attribute in APM traces. | +| `org` | LFX Organization ID — drives the rate-limit pool key (all Key Contacts in the same org share a pool). **Assumption:** Self-Serve includes this in the access token. Exact claim name and feasibility confirmed at T-015. | +| `tier` | LFX membership tier (`silver` / `gold` / `platinum`) — drives rate-limit pool size and any future per-route tier gating. **Assumption:** Self-Serve includes this in the access token. Confirmed at T-015. | +| `kid` | Key ID — selects the right key in the JWKS response for signature verification. | +| `aud` | Service audience — required if the open question below resolves to using the existing platform PAT with scope enforcement. | + +## Token endpoint proxy + +The Insights API exposes `POST /v1/auth/token` as a thin passthrough to LFX Self-Serve's `/token` endpoint. Insights forwards the request body and headers verbatim and returns the response verbatim. It does not mint, validate, or persist refresh tokens. The proxy exists purely so customers configure a single host (`api.insights.linuxfoundation.org`) for both API calls and access-token minting. + +Failure modes: if Self-Serve returns a 4xx or 5xx, Insights forwards the response as-is. If Self-Serve is unreachable, Insights returns `503` with `code: upstream_unavailable` matching its standard error envelope. + +## Revocation + +Revocation is owned by LFX Self-Serve. When a user revokes a refresh token in `app.lfx.dev/settings`, the next `POST /v1/auth/token` call (forwarded to Self-Serve) fails with `400 invalid_grant` — the customer's code can no longer mint new access tokens. Already-issued access tokens continue to work until their `exp` (~15 min). Insights maintains no deny-list and runs no introspection endpoint — revocation lag is bounded by the access token's natural lifetime. + +## Open question: reuse existing platform PAT vs mint a new Insights-scoped refresh token + +**Undecided.** This is a product question to resolve with the LFX Self-Serve team at T-015. Both options are valid: + +| Aspect | Reuse existing `app.lfx.dev/settings` PAT as refresh token | Mint new Insights-scoped refresh token | +|---|---|---| +| User experience | One token across every LFX service — best UX. | Extra "Create Insights API token" step in `app.lfx.dev/settings`. | +| Compromise blast radius | A leaked refresh token grants access to whatever the platform PAT covers — wider than Insights alone. | A leaked Insights refresh token is scoped to Insights only; revoking it doesn't break the user's other LFX integrations. | +| Scope enforcement | Requires a service-scope claim (`aud: insights.linuxfoundation.org` or a `scopes` array) so Insights can refuse tokens issued for unrelated services. | Natural: `aud`/`iss` already implies Insights. | +| Claim requirements | The existing PAT may need additional claims (`org`, `tier`) if they're not already present. | LFX Self-Serve mints with the exact claim shape Insights needs. | +| Revocation granularity | Revoking the platform PAT loses every LFX integration at once. | Per-service revocation — rotating the Insights refresh token leaves other LFX integrations intact. | +| LFX Self-Serve team work | Extend/add claims on an existing token type. | New token type + UI affordance in `app.lfx.dev/settings`. | + +We commit to the issuer (LFX Self-Serve) and the verification path (JWKS from LFX Self-Serve). The token form is decided at T-015. + +## Why this replaces the previous Auth0-based design + +The original design stored keys in Auth0 and routed key-creation UI through the LFX Insights frontend (E15), which called the Auth0 Management API. That design was rejected for two reasons: + +1. **Duplicate UI.** The LFX Self-Serve App already runs a key-management UI at `app.lfx.dev/settings`. Building a parallel UI inside Insights forces two surfaces to stay in sync and gives users a fragmented experience. +2. **Wrong ownership.** Insights is an analytics API, not an identity service. Owning a signing key, a Management API integration, and a key-lifecycle surface is out of scope for a read-only analytics proxy. + +Storing keys in our own Postgres was rejected for the same reason: we'd own the signing key, the revocation surface, and the token lifecycle. diff --git a/docs/adr/0016-vitepress-scalar-api-docs.md b/docs/adr/0016-vitepress-scalar-api-docs.md new file mode 100644 index 000000000..3955f5726 --- /dev/null +++ b/docs/adr/0016-vitepress-scalar-api-docs.md @@ -0,0 +1,16 @@ +# API docs use VitePress + Scalar, served from `api/docs/` + +API documentation is a standalone VitePress site under `api/docs/`, co-located with the service it documents. Scalar is embedded on the reference page and ingests the OpenAPI spec generated from TypeBox schemas on every release. + +## Considered Options + +- **Mintlify** — rejected: paid, content lives on external infra, customization constrained by their conventions. +- **Extend `frontend/docs/`** — rejected: API docs have their own versioning, navigation, and release cadence that has nothing to do with the Insights frontend. Coupling them means the docs deploy is tied to the frontend deploy, and `frontend/` grows a dependency it doesn't own. +- **`api/docs/` standalone VitePress + Scalar (chosen)** — docs live alongside the service they document. Scalar is embedded on the reference page and reads `api/openapi.json` — the reference cannot drift from the implementation. Served at `api.insights.linuxfoundation.org/docs` — same host as the API; Fastify serves the static VitePress build under `/docs`, no extra subdomain needed. + +## Consequences + +- API docs (quickstart, authentication, pagination, error reference, changelog) live under `api/docs/`. +- The Scalar reference page loads `api/openapi.json` at build time. +- Served at `api.insights.linuxfoundation.org/docs`. Docs changes do not require a frontend deploy. +- Engineers must not move the docs into `frontend/docs/` to "consolidate" — the separation is intentional. diff --git a/docs/adr/0017-collections-queries-not-shared.md b/docs/adr/0017-collections-queries-not-shared.md new file mode 100644 index 000000000..281cbf40b --- /dev/null +++ b/docs/adr/0017-collections-queries-not-shared.md @@ -0,0 +1,14 @@ +# Collections Postgres queries are written fresh in `/api`, not shared with the frontend + +The `/api` service writes its own minimal read-only Postgres queries for the Collections endpoints rather than extracting `frontend/server/repo/communityCollection.repo.ts` into a shared library. + +## Considered Options + +- **Shared `libs/collections-repo`** — rejected: the frontend repo is 762 lines, write-heavy (create, update, delete, like), and coupled to Nuxt path aliases (`~~/server/utils/common`). Extracting it would require stripping Nuxt dependencies and carrying write operations that `/api` will never use. The coupling risk outweighs the DRY benefit. +- **Fresh queries in `/api` (chosen)** — the public API needs ~3 read queries in v1: get collection by slug, list collections, permission check (`isPrivate` + `ssoUserId`). Writing these directly keeps `/api` self-contained and avoids a shared library that exists only because two services query the same table. + +## Consequences + +- If the `collections` table schema changes, both the frontend repo and the `/api` queries need updating. This is acceptable — schema changes are rare and infrequent, and both sites need to be reviewed anyway. +- Engineers seeing what looks like duplicated SQL should not reflexively extract a shared repo. The duplication is deliberate: the frontend owns write operations, the API owns read operations, and they have different dependency contexts. +- If a third consumer of the collections table emerges, revisit this decision. diff --git a/docs/adr/0018-structured-json-logging.md b/docs/adr/0018-structured-json-logging.md new file mode 100644 index 000000000..51b65bee4 --- /dev/null +++ b/docs/adr/0018-structured-json-logging.md @@ -0,0 +1,14 @@ +# Structured JSON logging via pino; log levels follow LFX-0002 + +The API service uses pino for structured JSON logging. Every log line is a single valid JSON object — no leading or trailing characters, no literal newlines (escaped newlines inside JSON string values are fine). Logs go to stdout (pino default). The Datadog agent DaemonSet running on each cluster node picks them up via `containerCollectAll: true` — it tails the container runtime log files and ships them to Datadog Logs automatically, with no log routing through the otel-collector sidecar. This inherits the LFX platform standard from [lfx-architecture-decisions/0002](https://github.com/linuxfoundation/lfx-architecture-decisions/blob/main/decisions/0002-structured-json-logging.md), so any future LFX service we integrate with reads logs in the same shape. + +## Log levels + +- **`error`** — non-recoverable failures: Tinybird or Postgres unreachable, malformed responses from those documented upstreams, our own crashes. +- **`warn`** — invalid client request payloads (validation 4xx), recoverable upstream blips, unparsable data from undocumented sources. +- **`info`** — successful mutating requests only (one line per request). **In v1 the API is read-only, so info logs should be rare** — limited to admin actions and config changes. Read endpoints MUST NOT emit info logs; doing so would blow up log volume on a high-RPS proxy. +- **`debug`** — judicious developer aid only, never function entry/exit. Use OpenTelemetry spans for execution tracing. + +## Trace correlation + +pino's mixin pulls the active OpenTelemetry trace context into every log line, emitting OTel-formatted `trace_id` (32-char lowercase hex, 128-bit) and `span_id` (16-char lowercase hex, 64-bit). Datadog's APM ingester recognises OTel-format IDs natively — no parallel `dd.trace_id` / `dd.span_id` fields are emitted. `trace_id` is the canonical correlation key — engineers must not invent a parallel `requestId` field for log correlation. The OTel trace ID is also the API's exposed request ID (error envelope `requestId` field); see ADR-0019. diff --git a/docs/adr/0019-opentelemetry-instrumentation.md b/docs/adr/0019-opentelemetry-instrumentation.md new file mode 100644 index 000000000..208e4a98f --- /dev/null +++ b/docs/adr/0019-opentelemetry-instrumentation.md @@ -0,0 +1,32 @@ +# OpenTelemetry instrumentation; OTel trace ID is the request ID + +The API service is instrumented with OpenTelemetry from day one. Inbound and outbound HTTP context propagation, log↔trace correlation, and custom metrics all flow through a single OTel SDK rather than ad-hoc per-concern integrations. This inherits the LFX platform standard from [lfx-architecture-decisions/0003](https://github.com/linuxfoundation/lfx-architecture-decisions/blob/main/decisions/0003-opentelemetry-instrumentation.md). + +## What we use + +`@opentelemetry/sdk-node` with auto-instrumentation for `http`, `pg` (Postgres), and pino. The W3C TraceContext propagator is the default — it reads incoming `traceparent` headers and injects them on outbound HTTP calls automatically. No manual context plumbing in handler code. + +## OTel trace ID is the request ID + +Auto-propagation makes a separate ULID-based request ID redundant: every request already has a 128-bit OTel trace ID generated at the inbound HTTP span (or honoured from an inbound `traceparent`). We use that trace ID as our request identifier: + +- Error envelope `requestId` field = OTel trace ID (32-char lowercase hex). This is the customer-facing ID for support tickets — it correlates directly to logs and APM traces in Datadog without translation. +- Log lines include `trace_id` and `span_id` (OTel hex format, 128-bit and 64-bit). Datadog's APM ingester recognises OTel-format IDs natively — no parallel `dd.trace_id` / `dd.span_id` fields are emitted. A pino mixin can re-add `dd.trace_id` without contract impact if a future Datadog regression breaks UI joins. +- Inbound `traceparent` is honoured (caller's trace continues across the boundary). +- W3C `traceparent` is the sole HTTP propagation channel — honoured inbound, injected outbound automatically by the SDK. No `X-Request-Id` response header is set. + +This means engineers must not introduce a parallel ULID/UUID request ID generator. + +## Sampling + +100% sampling in v1 — closed-alpha and silent-public traffic is low absolute volume and we want every trace for debugging. Per LFX-0003, higher-throughput services should drop the sample rate; revisit once we have RPS data on `/v1`. + +## Deployment topology + +The app pod runs an `opentelemetry-collector` sidecar container. The app's OTel SDK sends OTLP to the sidecar (`localhost:4317`); the collector forwards to Datadog. This follows the LFX-0003 recommendation and keeps the configuration self-contained — no changes to the shared cluster Datadog agent are needed. + +In production and staging the collector exports to Datadog. Logs already flow to Datadog independently via `containerCollectAll: true` on the cluster log collector — no log routing through the OTel collector needed (per ADR-0018). + +## Local development + +`OTLP_EXPORTER` defaults to the stdout exporter in dev — pino logs already carry `trace_id`/`span_id`, which is enough for most local debugging. A developer wanting visual trace inspection can run a one-off Jaeger container and point the SDK at it; not mandated. diff --git a/docs/architecture-review/01-overview.md b/docs/architecture-review/01-overview.md new file mode 100644 index 000000000..020668d67 --- /dev/null +++ b/docs/architecture-review/01-overview.md @@ -0,0 +1,170 @@ +# LFX Insights Public API — Architecture Review + +**Date:** 2026-05-04 +**Author:** LFX Insights Engineering +**Status:** Pending architecture team approval + +**ADRs:** Architecture Decision Records are committed to the codebase at [`docs/adr/`](../adr/) alongside the code they describe. Future engineers can find the reasoning for any decision without hunting through wikis or Notion. ADRs are append-only — past decisions are never edited, only superseded by new ones. + +--- + +## Problem + +All analytics endpoints in LFX Insights today live inside the Nuxt frontend (`frontend/server/api/`, ~106 files). They were built as internal routes for the web UI: + +- Authenticated by Auth0 OIDC session cookies (browser-only) or a single shared Bearer secret. +- Rate limited only by IP — no per-customer identity, no tiers, no quota enforcement. +- Coupled to the frontend release cycle — changing a response shape requires coordinating UI changes in the same PR. +- No versioning contract, no public documentation, no SLA, no observability at the customer level. + +LFX customers need programmatic access to the same analytics data to build pipelines, dashboards, and reports. We cannot expose the existing Nuxt routes as-is. + +--- + +## What We Are Building + +A standalone HTTP API service (`/api`, sibling of `frontend/`) that ports existing Nuxt analytics endpoints under a formal versioned contract with proper authentication, rate limiting, observability, and documentation. + +### Key characteristics + +| Property | Decision | +|---|---| +| Base URL | `https://api.insights.linuxfoundation.org/v1/...` | +| Location | `/api` at monorepo root, added to `pnpm-workspace.yaml` | +| Framework | Fastify + TypeScript + TypeBox | +| Auth | Refresh tokens issued by LFX Self-Serve (`app.lfx.dev/settings`); customer code mints short-lived access tokens via Insights `/v1/auth/token` (proxied to Self-Serve); Insights JWKS-verifies access tokens on every request | +| Rate limiting | Redis sliding window, per-org pool, tier-driven | +| Versioning | URL prefix (`/v1`, `/v2`); additive-only within a version | +| Contract | Tolerant-reader; no breaking changes within a major version | +| Docs | VitePress + Scalar at `api.insights.linuxfoundation.org/docs` (served by Fastify from `api/docs/`) | +| Observability | OpenTelemetry → Datadog (hybrid custom metrics + APM) | +| Callers | Server-to-server only in v1; CORS denies all browser origins | +| Billing | Bundled with existing LFX membership tiers; no standalone billing | +| SDKs | None in v1; OpenAPI spec + curl examples | + +--- + +## Architecture + +``` +┌─────────────────────┐ 1. create token ┌──────────────────────────────────┐ +│ User (browser) │ ────────────────▶ │ LFX Self-Serve App │ +│ │ │ app.lfx.dev/settings │ +│ │ ◀──────────────── │ Issues + revokes refresh tokens │ +│ │ 2. refresh token │ Publishes JWKS endpoint │ +└─────────────────────┘ └──────────────────┬───────────────┘ + │ ▲ + │ 3. paste refresh token │ 4. forward /token + │ into server env │ request + ▼ │ +┌─────────────────────┐ POST /v1/auth/token ┌──────────────┴──────────────────┐ +│ Customer server │ ────────────────────▶ │ api.insights.linuxfoundation │ +│ │ ◀──────────────────── │ .org (Fastify, TypeScript) │ +│ │ 5. access token │ │ +│ │ (~15 min) │ /v1/auth/token (proxy) │ +│ │ │ /v1/development/... │ +│ │ 6. Bearer /v1/contributors/... │ +│ │ ────────────────────▶ │ /v1/popularity/... │ +└─────────────────────┘ │ /v1/security/... │ + │ /v1/collections/... │ + 7. JWKS verify └───────────┬─────────────────────┘ + ┌─────────────────────────────────────┘ + ▼ + (LFX Self-Serve JWKS endpoint — cached) + └───────────┬─────────────────────┐ + │ │ + ▼ ▼ + ┌────────────────────────────────────────┐ + │ Redis │ + │ Rate-limit counters │ + │ Response cache (cache hit → return) │ + └──────────────┬─────────────────────────┘ + │ cache miss + ┌───────────┴───────────┐ + ▼ ▼ + ┌─────────────────┐ ┌────────────────────┐ + │ Tinybird │ │ Postgres │ + │ (analytics) │ │ (read host) │ + │ dedicated read │ │ Collections auth │ + │ replica * │ └────────────────────┘ + └─────────────────┘ + + * dedicated Tinybird read replica is the goal; pending + confirmation from the Tinybird team on whether per-app + replica isolation is supported. + + ┌────────────────────────────────────────────────────────────────────────┐ + │ App OTel SDK ──OTLP──▶ otel-collector sidecar ──▶ Datadog │ + │ │ + │ Custom metrics — low-cardinality tags only (endpoint, version, │ + │ tier, status_class). Billed per unique tag combo. │ + │ Used for: SRE dashboards, alerts, SLO tracking. │ + │ │ + │ APM trace metrics — high-cardinality dims live on spans as attributes │ + │ (customer_id, api_key_id). Not billed as metrics. │ + │ Used for: per-customer drilldowns, debugging. │ + │ │ + │ Structured logs via pino, correlated to traces via trace_id / span_id │ + │ (OTel hex format; Datadog ingests natively). Local dev: stdout. │ + └────────────────────────────────────────────────────────────────────────┘ +``` + +### Key management + +Refresh tokens (what customers call their "API key") are created and managed entirely in the LFX Self-Serve App at `app.lfx.dev/settings`. The LFX Insights frontend deep-links to that page from a `/settings/api-keys` placeholder ([E15](../PUBLIC_API_PLAN.md#epic-e15--key-management-entry-point-lfx-insights-frontend)); it does not implement create / list / revoke. Membership gating (only Key Contacts in member organizations can create keys) is enforced by LFX Self-Serve, not by Insights. What the customer pastes into their environment is a refresh token; their code mints short-lived access tokens from it via Insights' proxied `/v1/auth/token` endpoint (per ADR-0006 and ADR-0015). + +Whether the existing `app.lfx.dev/settings` personal access token is reused as the refresh token or a new Insights-scoped refresh token is minted is an open product question; see ADR-0015. + +### Shared library strategy + +Rather than duplicating Tinybird query logic, three workspace libraries are extracted: + +- `libs/tinybird-client` — Tinybird HTTP client, AdaptiveSemaphore, bucket-per-project routing. Both `frontend/` and `api/` depend on it. +- `libs/insights-types` — shared enum definitions only (`ActivityPlatforms`, `ActivityTypes`, `Granularity`). Request/response shapes are defined separately per app. +- `libs/rate-limiter` — Redis sliding-window rate limiter, forked from `frontend/server/utils/rate-limiter.ts`. + +--- + +## Endpoint Rollout + +Endpoints are ported in five groups, each mapped to a Jira epic. Each endpoint ships through two stability stages: `/v1-alpha` → `/v1` (see Endpoint Stability below). + +| Group | Content | Status | +|---|---|---| +| 1 — Development | Commit activity, PR metrics, review turnaround | [E7](../PUBLIC_API_PLAN.md#epic-e7--endpoint-migration-phase-1-development) | +| 2 — Contributors | Contributor leaderboards, org breakdowns | [E8](../PUBLIC_API_PLAN.md#epic-e8--endpoint-migration-phase-2-contributors) | +| 3 — Popularity | Stars, forks, downloads, dependency counts | [E9](../PUBLIC_API_PLAN.md#epic-e9--endpoint-migration-phase-3-popularity) | +| 4 — Security & Best Practices | CVE counts, vulnerability summaries, scorecard | [E10](../PUBLIC_API_PLAN.md#epic-e10--endpoint-migration-phase-4-security--best-practices) | +| 5 — Overviews | Project health summaries and overview metrics | [E11](../PUBLIC_API_PLAN.md#epic-e11--endpoint-migration-phase-5-overviews) | +| 6 — Collections | User-curated project groups (requires permission check) | [E12](../PUBLIC_API_PLAN.md#epic-e12--endpoint-migration-phase-6-collections) | +| 7 — Leaderboard | Cross-project contributor and activity leaderboards | [E13](../PUBLIC_API_PLAN.md#epic-e13--endpoint-migration-phase-7-leaderboard) | + +--- + +## Endpoint Stability + +Each endpoint goes through two stages: + +1. **`/v1-alpha/...`** — no contract guarantees. Breaking changes (field renames, shape changes, endpoint removal) are allowed freely. Access is restricted to an allow-listed cohort (LFX-internal + external design partners) for contract and performance validation. +2. **`/v1/...`** — full tolerant-reader contract. An endpoint graduates here once it passes the promotion criteria: load test passes, shape has been stable for at least one week, error/latency budgets are healthy, and security sign-off is given. From this point only additive changes are permitted within `/v1`; breaking changes require `/v2`. + +Promotion is per-endpoint. The `/v1-alpha` route returns `410 Gone` for two weeks after promotion so alpha callers get a clear signal to update their URLs. + +--- + +## Open Questions for Architecture Team + +The following items are unresolved and need input before or during implementation: + +| # | Question | Drives | +|---|---|---| +| 1 | Who is the named LFX Self-Serve contact for API key claims schema coordination? | [T-015](../PUBLIC_API_PLAN.md#epic-e3--auth--rate-limiting-api-keys-via-lfx-self-serve) | +| 2 | Can a user belong to more than one org, or hold more than one membership tier? Affects how the rate-limit pool and tier are resolved per request. | [T-015](../PUBLIC_API_PLAN.md#epic-e3--auth--rate-limiting-api-keys-via-lfx-self-serve) | +| 3 | Reuse existing `app.lfx.dev/settings` personal access token, or mint a new Insights-scoped token? See ADR-0015 for trade-off table. | [T-015](../PUBLIC_API_PLAN.md#epic-e3--auth--rate-limiting-api-keys-via-lfx-self-serve) | +**Notes:** + +- Deployed on the same Kubernetes cluster as `frontend/`. ([T-002](../PUBLIC_API_PLAN.md#epic-e1--foundation--framework)) +- Using the existing Datadog org and APM agent in the cluster. ([T-025](../PUBLIC_API_PLAN.md#epic-e4--observability-opentelemetry--datadog)) +- Hour-granularity datetime filters (`2024-01-01T14:00:00Z`) are supported and will be accepted. +- The most granular `granularity` option will be `daily` — no `hourly` option. ([E7](../PUBLIC_API_PLAN.md#epic-e7--endpoint-migration-phase-1-development)–[E11](../PUBLIC_API_PLAN.md#epic-e11--endpoint-migration-phase-5-overviews)) +- API docs will be gated (not publicly indexable) until launch. ([E5](../PUBLIC_API_PLAN.md#epic-e5--api-documentation)) diff --git a/docs/architecture-review/02-decisions.md b/docs/architecture-review/02-decisions.md new file mode 100644 index 000000000..e0b68779e --- /dev/null +++ b/docs/architecture-review/02-decisions.md @@ -0,0 +1,126 @@ +# Architecture Decisions + +All decisions recorded here met the ADR bar: hard to reverse, surprising without context, result of a genuine trade-off. Full ADR files live in [`docs/adr/`](../adr/). + +--- + +## Service & Infrastructure + +### Where the service lives — [docs/adr/0002](../adr/0002-api-at-repo-root.md) + +The API is a top-level directory `/api` at the repo root, sibling to `frontend/`, not inside `workers/`. The `workers/` tree is for background workers; `/api` is a long-running HTTP server with its own deployment unit, matching the `frontend/` precedent. Keeping it at the root makes `pnpm-workspace.yaml` membership and CI matrix entries symmetric. + +### Framework: Fastify over NestJS — [docs/adr/0001](../adr/0001-fastify-over-nestjs.md) + +We chose Fastify over NestJS (the leading alternative) because: NestJS adds 3–5× boilerplate for a read-only proxy service; it couples schema validation to class-validator, where we want TypeBox (see ADR-0008); and Fastify's plugin isolation maps cleanly onto our endpoint-group structure. Express was rejected for lack of built-in validation and lower throughput. Hono was rejected because it is designed for edge runtimes (Cloudflare Workers, Deno Deploy) first and Node second — the Tinybird and Postgres clients rely on Node-native APIs (`net`, `tls`, `Buffer`) that edge runtimes don't provide, so running Hono on Node requires an adapter layer that reintroduces the compatibility risk Hono was chosen to avoid. Fastify is Node-native and battle-tested at scale with no runtime adapter between the framework and the OS. + +### OpenAPI: TypeBox code-first — [docs/adr/0008](../adr/0008-typebox-code-first-openapi.md) + +TypeBox is a TypeScript library where a single object definition is simultaneously a valid JSON Schema (used at runtime) and a TypeScript type (used at compile time). A route schema like `Type.Object({ commitCount: Type.Number() })` both validates the incoming request at runtime and produces the TypeScript type `{ commitCount: number }` statically — no duplication and drift. + +Fastify has first-class support for this pairing via `@fastify/type-provider-typebox`: it consumes TypeBox schemas directly for request validation and response serialization, and `@fastify/swagger` reads the same schemas to generate the OpenAPI spec automatically. The result is that the spec literally cannot diverge from the implementation — they are derived from the same object. + +Zod was considered as an alternative (more ergonomic API, larger community) but rejected because Zod's native type is not JSON Schema — a conversion layer (`zod-to-json-schema`) is required, which reintroduces a transformation step that can drift. The Fastify+TypeBox integration is zero-transformation: the schema object _is_ the JSON Schema. + +Hand-written OpenAPI YAML was rejected outright: with ~100 endpoints, drift between the YAML spec and the actual handler behavior is high maintenance. TypeBox is the single validation and schema boundary. + +### API docs: VitePress + Scalar at `api.insights.linuxfoundation.org/docs` — [docs/adr/0016](../adr/0016-vitepress-scalar-api-docs.md) + +API docs live in `api/docs/` as a standalone VitePress site, served by Fastify under `/docs`. Scalar is embedded on the reference page and reads the generated OpenAPI spec — the reference cannot drift. Mintlify was rejected (paid, external infra). Extending `frontend/docs/` was rejected — API docs have their own release cadence and should not be coupled to the frontend deploy. + +--- + +## Authentication & Authorization + +### Every request requires a valid API key — [docs/adr/0009](../adr/0009-api-key-required-for-all-requests.md) + +All endpoints, including those serving public project data, require a valid API key. There is no unauthenticated path. A missing or invalid key returns 401 immediately. This is intentional: rate limiting requires a stable identity, and attribution data is essential for roadmap prioritization. + +### API keys are issued by the LFX Self-Serve App — [docs/adr/0015](../adr/0015-api-keys-issued-by-lfx-self-serve.md) + +API keys are refresh tokens issued by the LFX Self-Serve App at `app.lfx.dev/settings`. Customer code exchanges refresh tokens for short-lived access tokens via `POST api.insights.linuxfoundation.org/v1/auth/token` — a thin proxy to Self-Serve's `/token` endpoint so customers configure only one host. The Insights API JWKS-verifies the access token on every request and reads `sub`, `org`, and `tier` from the verified claims — Insights stores no keys and runs no key-management UI of its own. + +Whether the existing `app.lfx.dev/settings` personal access token is reused as the refresh token or a new Insights-scoped refresh token is minted is an open product question — see ADR-0015 and §9 of the plan. + +Storing keys in Auth0 (and building an Insights-side management UI) was rejected because it duplicates a UI the LFX Self-Serve App already runs. Storing keys in our own Postgres was rejected — we'd own the signing key, the revocation surface, and the token lifecycle. + +### Tiers control rate limits only in v1 — [docs/adr/0005](../adr/0005-tiers-control-rate-limits-only.md) + +All tiers see all endpoints in v1 (which are existing endpoints that power the widgets in Insights). The Tier attached to an Organization only determines the size of its Rate-limit Pool. The per-route "required tier" mechanism is built into the framework (future gating is a config change, not an architecture change), but every v1 endpoint declares the minimum tier. Engineers must not add per-endpoint tier checks without a product decision — doing so breaks callers who integrated assuming open access. + +### Refresh tokens are long-lived; access tokens are short-lived — [docs/adr/0006](../adr/0006-refresh-and-short-lived-access-tokens.md) + +Refresh tokens do not expire automatically. Multiple active refresh tokens per user are supported for zero-downtime rotation (mint new → switch → revoke old). Access tokens are short-lived (~15 min, confirmed at T-015). Revocation is owned by LFX Self-Serve — revoking a refresh token prevents future access token mints; in-flight access tokens expire naturally. No Insights-side deny-list or introspection endpoint needed. + +### Collections-only permission check — [docs/adr/0007](../adr/0007-collections-only-permission-check.md) + +Endpoint Groups 1–4 (Development, Contributors, Popularity, Security, Overview) expose aggregated public data — a valid API key is sufficient. Only Group 5 (Collections) requires a per-request ownership check (Postgres lookup, Redis-cached at 60s TTL) because Collections are user-created private groupings. Engineers must not add ownership checks to Groups 1–4. + +### Collections Postgres queries written fresh in `/api` — [docs/adr/0017](../adr/0017-collections-queries-not-shared.md) + +The `/api` service writes its own minimal read-only Postgres queries for Collections rather than extracting a shared repo library. The frontend repo is write-heavy and Nuxt-coupled; the API only needs ~3 read queries. Engineers seeing what looks like duplicated SQL should not reflexively extract a shared repo — the duplication is deliberate. + +--- + +## API Contract + +### Tolerant-reader / additive-only versioning — [docs/adr/0003](../adr/0003-tolerant-reader-versioning.md) + +Endpoints go through two URL-level stability stages. During closed alpha they are served under `/v1-alpha/...` — no contract guarantees, shapes can change freely. When an endpoint graduates to silent public it moves to `/v1/...` and the full contract locks in: only additive changes are permitted (new response fields, new optional query params, new endpoints, expanded enum inputs, new error codes). Any removal, rename, type change, or constraint-tightening from that point requires a new major version prefix (`/v2`). Customers commit to ignoring unknown fields (documented prominently). + +**Promotion from `/v1-alpha` to `/v1`** happens when: load test passes, the closed-alpha cohort has used the endpoint without shape changes for at least one week, error/latency budgets are healthy, and security sign-off is given. Mechanically: the `/v1` route is added in a PR, the `/v1-alpha` route is changed to return `410 Gone` with a `Link: ; rel="successor-version"` header, the alpha cohort is notified directly, and both routes coexist for a minimum of 2 weeks before the `410` handler is removed. + +When a field, parameter, or endpoint needs to eventually be removed, the deprecation process is: (1) annotate it as `DEPRECATED:` in the TypeBox schema so it surfaces in the OpenAPI spec; (2) add `Deprecation: true`, `Sunset: `, and `Link: ; rel="deprecation"` response headers on every response from that route; (3) publish a changelog entry and notify known consumers; (4) honour the sunset window — removal before the `Sunset` date is a contract violation; (5) do the actual removal in `/v2`, not `/v1`. Full details in [docs/adr/0003](../adr/0003-tolerant-reader-versioning.md). + +### Pagination: cursor-based — [docs/adr/0011](../adr/0011-pagination-cursor-based.md) + +All paginated endpoints accept `cursor` (opaque base64url, omit on first page) + `pageSize` (default 50, max 200) and return `{ data, pageSize, nextCursor }`. `nextCursor: null` means end of list. No `total` field — counting on every request doubles Tinybird load and offset semantics don't fit cursor pagination. + +Cursor-based was chosen over the existing Nuxt `page` + `pageSize` convention for three reasons: (1) stability under inserts/deletes (offset pagination skips or duplicates rows when the underlying set mutates between calls — common for live analytics); (2) O(log N) cost on an indexed sort key vs O(N+offset) for `LIMIT … OFFSET`; (3) v1 is server-to-server, so callers iterate end-to-end and don't need "jump to page 5" UI affordances. Industry alignment (Stripe, GitHub, AWS, Linear) is a side benefit. + +The `nuxt-to-api` skill rewrites Nuxt's `page`/`pageSize`/`total` handlers (and `limit`/`offset` handlers) into cursor handlers at port time. Top-N leaderboards (hard-capped, no second page) and time-series charts stay non-paginated. + +Sort order is caller-selectable from a per-endpoint allow-list (`?sort=name_asc`, `?sort=commits_desc`). Each accepted `sort` value is index-backed; values outside the list return `400 invalid_sort`. Removing an allowed value or changing an endpoint's default sort is a breaking change per ADR-0003. Full details in [docs/adr/0011](../adr/0011-pagination-cursor-based.md). + +### camelCase JSON + ISO-8601 UTC dates — [docs/adr/0014](../adr/0014-camelcase-json-iso8601-dates.md) + +All JSON keys are camelCase (`startDate`, `activityTypes`). Dates are ISO-8601 UTC strings (`2025-12-31T23:59:59Z`), never Unix timestamps. The existing Nuxt layer is mixed (some snake_case fields mirror Tinybird's internal naming); the `nuxt-to-api` skill normalizes at port time. Coupling the public contract to Tinybird field names was rejected because it would make Tinybird schema changes into API breaking changes. + +### URL porting: hybrid, rename only when genuinely misleading — [docs/adr/0012](../adr/0012-url-port-strategy-hybrid.md) + +The default is to port-as-is from Nuxt to `/v1/...` with light normalization (kebab-case, plural nouns). Renaming happens only when the existing URL is genuinely misleading to an external developer. Any rename must be recorded in the PR. Wholesale renaming was rejected as a large diff with no functional change, obscuring the real porting work. + +### v1 is server-to-server only; CORS denies all browser origins — [docs/adr/0004](../adr/0004-server-to-server-cors-deny.md) + +`Access-Control-Allow-Origin` is absent for the API — this is intentional. Allowing browser origins in v1 would require a CORS policy, credential-safe key distribution, and potentially cookie-based auth — all out of scope. **Flagged for revisit before launch** Will we need browser-side access? + +--- + +## Operations & Cost + +### Caching: origin Redis only, `Cache-Control: private` — [docs/adr/0013](../adr/0013-origin-cache-only-private-cache-control.md) + +All responses carry `Cache-Control: private, max-age=0`. A Redis cache with two TTL tiers lives at the origin: 24h for stable data (project lists, leaderboards, categories), 1h for time-series analytics. Clients and CDNs do not cache. This keeps a single TTL knob we can tune without a contract change, and avoids accidental public caching of org-scoped Collection responses. + +### Billing bundled with LFX membership — [docs/adr/0010](../adr/0010-billing-bundled-with-lfx-membership.md) + +API access is included in the user's existing LFX membership tier. No separate billing infrastructure or usage-based pricing in v1. + +Enforcement is split across two boundaries: + +- **Token-mint time (LFX Self-Serve):** Self-Serve checks Key Contact status via OpenFGA `v2_organization` entities before issuing an Insights access token. The precise moment depends on the PAT model (open — ADR-0015 Q1): with a new Insights-scoped refresh token the check happens at token issuance; with the existing PAT reused, the check happens at `POST /v1/auth/token` exchange time. Either way the check is in Self-Serve; non-Key-Contacts never receive a valid Insights access token. +- **Request time (Insights API):** Insights verifies the JWT signature, reads `tier` and `org` from the verified claims, and uses them for rate limits and future per-endpoint tier gating. It never re-queries OpenFGA or any membership system. + +Revoking a membership does not immediately invalidate existing refresh tokens — full details in [docs/adr/0010](../adr/0010-billing-bundled-with-lfx-membership.md). + +### Datadog: hybrid custom metrics + APM trace metrics — [PUBLIC_API_PLAN.md §3 D5 + §6](../PUBLIC_API_PLAN.md#d5-datadog-metrics-strategy--custom-metrics-vs-apm-trace-metrics) + +Low-cardinality custom metrics (tags: `endpoint`, `version`, `tier`, `status_class(2xx,4xx,5xx)`) power SRE dashboards and alerting. High-cardinality dimensions (`customer_id`, `api_key_id`) live in APM span attributes — not billed as custom metrics. Estimated budget: ~5,400 timeseries at ~$270/mo above DD quota. Pure custom metrics were rejected because per-customer cardinality would blow the cost budget. + +### Structured JSON logging via pino — [docs/adr/0018](../adr/0018-structured-json-logging.md) + +pino emits single-line JSON to stdout (no literal newlines, no leading/trailing characters), collected and shipped to Datadog Logs by the cluster log collector. Log levels follow the LFX platform standard ([lfx-architecture-decisions/0002](https://github.com/linuxfoundation/lfx-architecture-decisions/blob/main/decisions/0002-structured-json-logging.md)): `error` for non-recoverable failures, `warn` for recoverable/client errors, `info` for successful mutating requests only. Because v1 is a read-only API, read endpoints must not emit info logs. pino's OTel mixin injects `trace_id` and `span_id` into every line so logs correlate to traces in Datadog automatically — `trace_id` is the canonical correlation key; no parallel `requestId` field is used for log correlation. + +### OpenTelemetry instrumentation; trace ID is the request ID — [docs/adr/0019](../adr/0019-opentelemetry-instrumentation.md) + +The API uses `@opentelemetry/sdk-node` with HTTP, Postgres, and pino auto-instrumentation. W3C TraceContext propagation is automatic — inbound `traceparent` headers are honoured, outbound HTTP calls inject them, and the active trace context flows through async operations without manual plumbing. Inheriting [LFX-0003](https://github.com/linuxfoundation/lfx-architecture-decisions/blob/main/decisions/0003-opentelemetry-instrumentation.md). W3C `traceparent` is the sole HTTP propagation channel — no `X-Request-Id` response header is set. The OTel trace ID **is** the request ID: it surfaces in the error envelope's `requestId` field, so a support ticket carrying a `requestId` joins logs ↔ APM traces in Datadog without translation. There is no parallel ULID generator. Logs include `trace_id`/`span_id` (OTel hex format); Datadog's APM ingester recognises OTel-format IDs natively — no parallel `dd.trace_id`/`dd.span_id` fields are emitted. Sampling is 100% in v1 — revisit when RPS data is available. The SDK exports OTLP to a co-located `opentelemetry-collector` sidecar (`localhost:4317`), which forwards to Datadog in prod/staging — following LFX-0003's recommendation and keeping the configuration self-contained in the pod with no changes to the shared cluster Datadog agent. diff --git a/docs/architecture-review/03-context.md b/docs/architecture-review/03-context.md new file mode 100644 index 000000000..d75963252 --- /dev/null +++ b/docs/architecture-review/03-context.md @@ -0,0 +1,140 @@ +# Domain Context + +This page defines the canonical language for the LFX Insights Public API. When writing tasks, docs, or code comments, use these terms precisely. Ambiguities from earlier design discussions are recorded in the **Flagged ambiguities** section. + +The full machine-readable version lives at [`docs/CONTEXT.md`](../CONTEXT.md). + +--- + +## Auth & Identity + +**API Key** +The long-lived credential a User receives from the LFX Self-Serve App at `app.lfx.dev/settings` — technically a Refresh Token. Customers see and handle this as their "API key." Only Key Contacts in member organizations are permitted to create them. The actual Bearer value sent to the Insights API is a short-lived Access Token derived from it. +_Avoid:_ token, secret, access key + +**Refresh Token** +The long-lived credential held by the customer (what they receive as their "API key"). Used at `POST api.insights.linuxfoundation.org/v1/auth/token` (proxied to LFX Self-Serve) to mint Access Tokens. Never sent to the Insights API as a Bearer credential. Revoking it stops future Access Token mints; in-flight Access Tokens continue to work until their `exp`. Multiple active Refresh Tokens per User are supported for zero-downtime rotation. +_Avoid:_ long-lived JWT, API key (when referring to the credential type specifically) + +**Access Token** +A short-lived JWT (~15 min) minted from the Refresh Token. Sent to the Insights API as `Authorization: Bearer `. Carries the verified `sub`, `org`, `tier`, `iss`, `kid`, and possibly `aud` claims. JWKS-verified on every request. +_Note:_ the presence of `org` and `tier` in the LFX Self-Serve access token is an assumption pending confirmation with the Self-Serve team (T-015). If the existing PAT is reused, these claims may need to be added. +_Avoid:_ calling it just "a JWT" or "the bearer token" — always use "access token" so it's clear which credential is meant + +**User** +The human account that owns one or more API Keys (Refresh Tokens) and is the billing principal. Identified by the JWT `sub` claim issued by the LFX Self-Serve App. +_Avoid:_ account, customer, client + +**Organization** (`org_id`) +The LFX organization tied to a User's API access, encoded in the `org` claim of the LFX Self-Serve access token. "Belongs to" is narrow here: only authorized **Key Contacts** of an organization with an active LFX membership can hold API keys — not every employee or self-attested affiliate of the organization. Used as the shared bucket for rate-limit quotas — all API keys belonging to Key Contacts in the same org draw from one pool. +_Avoid:_ tenant, workspace, team + +**Tier** +A named access level attached to an Organization that controls Rate-limit Pool size. In v1, tiers affect only rate limits; endpoint-level gating is reserved for future versions. +_Avoid:_ plan, subscription + +--- + +## API Shape + +**Endpoint Group** +A logical cluster of related endpoints released together (Development, Contributors, Popularity, Security, Collections). Each group maps to a Jira epic. Endpoints within a group are promoted through launch stages independently. +_Avoid:_ phase, module, domain + +**Breaking Change** +Any modification that forces existing callers to update their integration: removing or renaming a response field, changing a field's type, making an optional input required, removing an endpoint, or changing the Error Envelope shape. Governed by the tolerant-reader contract (ADR-0003). Changing default or max pageSize, changing the cursor encoding semantics, removing a value from an endpoint's `sort` allow-list, or changing an endpoint's default `sort` value also counts. +_Avoid:_ non-backwards-compatible change + +**Error Envelope** +The standard JSON shape for all error responses: +```json +{ + "error": { + "code": "rate_limit_exceeded", + "message": "You have exceeded your rate limit.", + "requestId": "4bf92f3577b34da6a3ce929d0e0e4736", + "docsUrl": "https://docs.../errors#rate_limit_exceeded" + } +} +``` +`code` is a machine-readable snake_case string. `docsUrl` deep-links to the relevant docs page. +_Avoid:_ error body, error payload + +**Request ID** +The OpenTelemetry trace ID for the request — a 32-char lowercase hex string (128-bit). Exposed in the error envelope's `requestId` field; this is the value a customer quotes in a support ticket. The same value appears in pino log lines as `trace_id` and on the active OTel span, so logs ↔ APM traces join in Datadog without translation. W3C `traceparent` is the sole HTTP propagation channel — honoured inbound, injected outbound; no `X-Request-Id` response header is set. There is no separate ULID/UUID request ID — see ADR-0019. +_Avoid:_ ULID, UUID, separate correlation ID, X-Request-Id + +--- + +## Data & Infrastructure + +**Tinybird** +The columnar analytics database backing all time-series metrics (contributor activity, commit counts, etc.). The API queries Tinybird Pipes via HTTP; it does not use Postgres for analytics reads. +_Avoid:_ analytics DB, ClickHouse (Tinybird is the canonical name in this repo) + +**Collection** +A user-curated named group of projects, stored in Postgres. The only Endpoint Group (Phase 5) that requires a per-request permission check — callers must prove they have access to the specific Collection they are querying. +_Avoid:_ project group, saved filter, list + +**Rate-limit Pool** +The shared sliding-window counter for an Organization. All API keys belonging to users in the same org draw from the same pool. Implemented as a Redis key keyed by `org_id`. +_Avoid:_ quota, bucket + +--- + +## API Stability + +**`/v1-alpha`** +The unstable stage. Endpoints served under `/v1-alpha/...` carry no contract guarantees — breaking changes (field renames, shape changes, endpoint removal) are allowed freely. Access is restricted to an allow-listed cohort (LFX-internal + external design partners) for contract and performance validation. +_Avoid:_ beta, preview + +**`/v1`** +The stable stage. An endpoint graduates here once it passes the promotion criteria: load test passes, shape has been stable for at least one week with the alpha cohort, error/latency budgets are healthy, and security sign-off is given. From this point the full tolerant-reader contract applies — only additive changes within `/v1`; breaking changes require `/v2`. The `/v1-alpha` route returns `410 Gone` for two weeks after promotion. +_Avoid:_ stable, released, GA + +--- + +## Wire-Format Conventions + +These are committed in v1. Changing any of them within v1 is a Breaking Change. + +| Convention | Rule | +|---|---| +| JSON key casing | camelCase for all request and response fields (`startDate`, `activityTypes`) | +| Date format | ISO-8601 UTC strings only (`2025-12-31T23:59:59Z`) — never Unix timestamps | +| Pagination | cursor-based: `cursor` (opaque base64url) + `pageSize` (default 50, max 200) + `sort` from a per-endpoint allow-list (e.g. `name_asc`, `commits_desc`); response: `{ data, pageSize, nextCursor }`; `nextCursor: null` ⇒ end; removing an allowed `sort` value or changing its default is a breaking change | +| Error codes | snake_case machine-readable strings (`tier_forbidden`, `rate_limit_exceeded`, `unauthorized`) | + +--- + +## Relationships + +``` +User ──owns──▶ API Key (many keys per user) +User ──is Key Contact of──▶ Organization (one org per user, v1) +Organization ──has──▶ Tier +Organization ──has──▶ Rate-limit Pool +Endpoint Group ──contains──▶ many Endpoints +Endpoint ──transitions through──▶ /v1-alpha → /v1 +Collection ──owned by──▶ User (creator only, via `ssoUserId`; null for curated/system Collections) +Collection endpoint ──requires──▶ Permission Check (Postgres + Redis cache) +``` + +--- + +## Example Dialogue + +> **Dev:** "Should I look up the User's tier before returning a response?" +> +> **Domain expert:** "In v1, no — all tiers see all endpoints. Tier only affects the Rate-limit Pool size. If the org's pool is exhausted, return 429. If a *future* endpoint is gated above the caller's tier, return 403 `tier_forbidden`. Don't conflate the two." + +--- + +## Flagged Ambiguities + +| Term used | Ambiguity | Resolution | +|---|---|---| +| "account" | Used for both User and Organization during design | **User** = human principal (`sub`); **Organization** = the entity that holds an LFX membership (Silver/Gold/Platinum) and owns the Rate-limit Pool (`org_id`) | +| "GA" / "stable" / "released" | Used loosely to mean "live" | Use **`/v1`** — an endpoint is either on `/v1-alpha` (no contract) or `/v1` (full contract). Avoid GA/stable/released. | +| "phase" | Used interchangeably with Endpoint Group | Use **Endpoint Group** in task descriptions; "phase" is informal and imprecise | +| "customer" | Used for both the User and their Organization | Use **User** for the human principal; use **Organization** for the LFX membership holder |