Skip to content

[WIP] RFC-0001: HUD Integration for Out-of-Tree CI Results#1

Draft
subinz1 wants to merge 3 commits intomasterfrom
oot-hud-integration-rfc
Draft

[WIP] RFC-0001: HUD Integration for Out-of-Tree CI Results#1
subinz1 wants to merge 3 commits intomasterfrom
oot-hud-integration-rfc

Conversation

@subinz1
Copy link
Copy Markdown
Owner

@subinz1 subinz1 commented Apr 28, 2026

Summary

This RFC defines the HUD-side ingestion and display layer for Out-of-Tree (OOT) CI results, building on RFC-0050 (Cross-Repository CI Relay for PyTorch Out-of-Tree Backends).

Data Flow

flowchart LR
    subgraph Downstream["Downstream CI (OOT Backend)"]
        DS["Run tests\n+ upload artifacts"]
    end

    subgraph ART["Artifact Storage (org-managed)"]
        STORE[("Logs, test reports,\nJUnit XML")]
    end

    subgraph Relay["Relay Server"]
        RH["Result Handler\n• OIDC verify\n• Allowlist check\n• Rate limit"]
    end

    subgraph HUD["HUD"]
        API["/api/oot/results\n• Auth check\n• Payload validation\n• Payload caps (2MB)"]
    end

    subgraph Storage["Storage"]
        DDB[("DynamoDB\ntorchci-oot-workflow-job\n(in_progress + completed)")]
        STR["DynamoDB Stream"]
        REP["clickhouse-replicator-dynamo"]
        CH[("ClickHouse\ndefault.oot_workflow_job\n(completed only)")]
    end

    subgraph Frontend["HUD Frontend"]
        P1["/oot — Global Summary"]
        P2["/oot/org/repo — Per-Backend"]
        P3["/pr/N — OOT Section"]
    end

    DS -->|"Upload artifacts"| STORE
    DS -->|"① POST in_progress\n② POST completed\n+ artifact_url\n(OIDC token)"| RH
    RH -->|"X-Hud-Internal-Bot\n{trusted, untrusted}"| API
    API -->|"PutItem"| DDB
    DDB --> STR --> REP -->|"completed only"| CH
    CH -->|"Query results +\nartifact_url"| P1 & P2 & P3
    P2 & P3 -.->|"User clicks\nexternal link"| STORE
Loading

Key points:

  • Artifact URLs are included in the completed callback payload and flow through the Result Handler → HUD API → DynamoDB → ClickHouse
  • HUD pages read artifact_url from ClickHouse and render it as an external link — no direct connection between HUD and downstream storage
  • Only completed records are replicated to ClickHouse; in_progress stays in DynamoDB for mutable state tracking

What this RFC covers

  • Write path: Downstream CI → Result Handler → HUD API → DynamoDB → ClickHouse (completed records only)
    • in_progress callbacks → DynamoDB only (mutable state tracking)
    • completed callbacks → DynamoDB → replicated to ClickHouse (dashboard queries)
    • Artifact URLs flow through the callback payload, not sent directly to HUD
  • Read path: Three new HUD views:
    • /oot — Global OOT CI summary (cross-repo health overview, repos sorted by pass rate)
    • /oot/[org]/[repo] — Per-backend dashboard (matrix view: PRs × jobs, failure drill-down, external artifact links)
    • /pr/[number] — Collapsible "Out-of-Tree Backends" section in existing PR pages
  • Storage schemas: DynamoDB table and ClickHouse table designs
  • DB protection: Rate limiting (per-repo at relay), payload caps (2MB at HUD API)
  • Security: OIDC authentication, trusted/untrusted payload split, error handling strategy (delivered/hud_rejected/hud_unavailable/skipped), signed callback token proposal, state machine for status transitions
  • Sample payloads: In-progress, success, and failure callback examples with full field definitions
  • Implementation plan: 6-phase rollout with task-level breakdown:
    1. Storage Layer — DynamoDB + ClickHouse + replicator mapping
    2. HUD API Endpoint — types, validation, write logic
    3. Relay Integration — handler → HUD forwarding, rate limiting, reusable GHA action
    4. HUD Frontend Pages — 3 views + saved ClickHouse queries
    5. End-to-End Validation — real downstream repo testing
    6. Security Hardening — callback token, state machine (future)

Reference implementation

A working reference implementation is available at subinz1/test-infra#1, which includes the API endpoint, ClickHouse schema, replicator mapping, saved ClickHouse queries, and all three frontend pages.

Status

This is a WIP draft. Feedback welcome.

Defines the HUD-side ingestion and display layer for OOT CI results,
building on RFC-0050 (Cross-Repository CI Relay). Covers the complete
write path (Result Lambda → HUD API → DynamoDB → ClickHouse), three
frontend views (global summary, per-backend dashboard, PR integration),
storage schemas, DB protection (rate limits, payload caps, daily budgets),
and security design (OIDC, trusted/untrusted split, callback token proposal).

Reference implementation: subinz1/test-infra#1
@subinz1 subinz1 force-pushed the oot-hud-integration-rfc branch from c94be0f to d57ee39 Compare April 28, 2026 11:38
Rename from RFC-0051 to RFC-0001. Defines the HUD-side ingestion and
display layer for OOT CI results, building on the Cross-Repository CI
Relay. Covers write path, storage schemas, DB protection, security,
and three frontend views.

Reference implementation: subinz1/test-infra#1
@subinz1 subinz1 changed the title [WIP] RFC-0051: HUD Integration for Out-of-Tree CI Results [WIP] RFC-0001: HUD Integration for Out-of-Tree CI Results Apr 28, 2026
…WS refs

- Artifact URLs now flow through Result Handler (not directly to HUD)
- Removed daily budget enforcement
- Split implementation plan into 6 clearly defined phases with task tables
- Removed AWS/Vercel/Terraform/IAM-specific references throughout
- Clarified that only completed records are replicated to ClickHouse
  (in_progress stays in DynamoDB only for mutable state tracking)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant