Skip to content

[WIP][CRCR] Initial implementation of L2#7967

Draft
KarhouTam wants to merge 5 commits intopytorch:mainfrom
KarhouTam:crcr-L2
Draft

[WIP][CRCR] Initial implementation of L2#7967
KarhouTam wants to merge 5 commits intopytorch:mainfrom
KarhouTam:crcr-L2

Conversation

@KarhouTam
Copy link
Copy Markdown
Contributor

@KarhouTam KarhouTam commented Apr 14, 2026

Author

Summary

Higher-level behaviors for L3 and L4 are intentionally left for follow-up work.

Architecture

The relay is split into two AWS Lambda functions:

  • webhook lambda function (Updated)

    • receives GitHub webhook PR and push events from the upstream repo
    • validates webhook signatures and authenticates with AWS Secret Manager
    • reads the downstream whitelist from the URL and stores it in Redis
    • for opened/reopened/synchronized/closed actions, forwards repository_dispatch events to downstream repos
  • callback lambda function (Added)

    • receives downstream callback payload through a public lambda function URL
    • validates callback payload with OIDC
    • reads the downstream whitelist from the URL and stores it in Redis
    • extracts CI result information from the payload and uploads to PyTorch HUD
    • records queue time and execute time for evolution to L3 repo

Changes

..github/
├── workflows/
│   └── _lambda-do-release-runners.yml     # Updates the Lambda release workflow to include cross-repo-ci-relay packaging/release
│
└── actions/
    └── cross-repo-ci-relay-callback/
        └── action.yml                     # Composite action used by downstream workflows to report status back to the relay/result endpoint

aws/lambda/cross_repo_ci_relay/
├── tests/                                 # Unit tests for allowlist/config/webhook/result/redis behavior
├── README.md                              # Project overview, local development, callback flow, and result-side validation steps
├── Makefile                               # Top-level local developer entrypoint for test / deploy / clean
├── local_server.py                        # FastAPI wrapper for local end-to-end testing of both webhook and result endpoints
├── requirements.txt                       # Python dependencies required by the relay Lambdas
│
├── utils/
│   ├── allowlist.py                       # Loads, parses, and queries the downstream allowlist by rollout level
│   ├── config.py                          # Shared runtime config loading and cached get_config() helper
│   ├── gh_helper.py                       # GitHub App, repository_dispatch, and GitHub file access helpers
│   ├── hud.py                             # HUD write helpers for downstream result reporting
│   ├── jwt_helper.py                      # Helpers for minting/verifying relay callback tokens
│   ├── redis_helper.py                    # Redis helpers for allowlist cache, OOT state, and timing data
│   └── misc.py                            # Shared TypedDict definitions and HTTPException
│
├── webhook/
│   ├── Makefile                           # Build/package/deploy commands for the webhook Lambda
│   ├── lambda_function.py                 # Webhook Lambda entrypoint: verifies GitHub webhook requests and routes events
│   └── event_handler.py                   # Handles PR/push events, resolves allowlist targets, and dispatches to downstream repos
│
└── callback/
    ├── Makefile                           # Build/package/deploy commands for the result Lambda
    ├── lambda_function.py                 # Result Lambda entrypoint: verifies callback token and GitHub OIDC token
    └── callback_handler.py                # Validates callback payloads, checks L2+ eligibility, stores state, and writes to HUD

Usage

See README.md for more details.

Verification

We performed the following scenario verification on our AWS Lambda instance:

  • Test with Upstream PR create/reopen/synchronize and push events triggering webhook, then redispatching to the Downstream CI (different organization) workflow.
  • Test with Downstream workflow send callback payload through the added action to the result lambda, then extract CI result information and send to PyTorch HUD.

Terraform configuration

Unit Tests

  • Unit Tests (Mock)

Security

  • Callback payload carries full upstream webhook data back to HUDaction.yml builds the callback body by mutating github.event.client_payload (which contains the entire original webhook payload: PR metadata, commits, author info) and adding status/conclusion/workflow_name/workflow_url on top. This full blob is forwarded verbatim by hud.py to HUD with no relay-side filtering. HUD receives both relay-trusted verified_repo and an unvalidated body — if HUD trusts self-reported fields inside the body over verified_repo, a manipulated dispatch payload could tamper with HUD records.

  • Lambda callback URL is public and hardcoded — The endpoint is hardcoded in `action.yml and exposed in a public action, making it trivially discoverable. OIDC verification blocks unauthorized HUD writes, but the endpoint has no rate limiting; request flooding can cause Lambda concurrency exhaustion or Redis connection saturation.

  • Only OIDC is used for verification — The callback lambda relies solely on GitHub OIDC token verification for authentication, without additional application-level secrets or signatures. If an attacker compromises a downstream repo's GitHub Actions permissions, they could forge authenticated requests to the callback endpoint. Besides, OIDC has its own limitations (e.g., token expiration, potential misconfigurations) that could lead to unauthorized access if not carefully managed.

HUD Interaction

  • Design Principle: Transparent Relay & Decoupling
    The Relay Server acts as a lightweight data passthrough layer. It does not define or parse specific CI data formats; instead, it offloads data interpretation and validation to the HUD. This ensures complete decoupling between the relay infrastructure and business-specific data.

  • Security & Risk Mitigation
    The relay uses OIDC authentication to guarantee the authenticity of the data source (Verified Repo). Its core responsibility is to ensure the data originates from the claimed repository, while security filtering and content compliance are enforced at the HUD level.

Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com>
Co-authored-by: fffrog <ljw1101.vip@gmail.com>
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Apr 14, 2026

Hi @KarhouTam!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 14, 2026

@KarhouTam is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@KarhouTam KarhouTam changed the title [CRCR] Initial implementation of L2 [WIP][CRCR] Initial implementation of L2 Apr 14, 2026
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Apr 15, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 15, 2026
Comment thread aws/lambda/cross_repo_ci_relay/utils/hud.py Outdated

#### Known limitations of this model

A compromised or malicious maintainer of an allowlisted repo can:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZainRizvi raised this in the RFC Document: What's the worst that a malicious user could do with these known limitations?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a higher CI passing rate could benefit them to upgrade from L2 to L3 I guess. I believe most people won't cheat on that, but we must consider such a situation in case it happens.

Comment thread aws/lambda/cross_repo_ci_relay/utils/hud.py Outdated
# a tampered value just misses the timing cache, which only hurts the
# attacker's own HUD row.
ci_metrics: dict = {"queue_time": None, "execution_time": None}
if status == "in_progress":
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a state machine to enforce valid status transitions before processing. Right now any allowlisted repo can send in_progress or completed in any order with no checks. A simple Redis-backed state per delivery_id + verified_repo would prevent rollbacks and replays:

in_progress: accepted only if no record exists yet
completed: accepted if current status is in_progress OR no record exists (handles missed in_progress callback)
in_progress after completed: rejected (no rollback)
Duplicate completed: rejected
This combined with per-repo rate limiting covers replay and flooding concerns without needing complex tamper-proof logic on every field. The OIDC-verified downstream_repo is the trust anchor — trusting the data from a verified source is reasonable since the downstream is invested in accuracy.

The in_progress skip for completed is important — the in_progress callback is best-effort and can be lost (network issues, Lambda timeout). Rejecting completed in that case would mean silently losing the actual test results, which is worse than missing the queue_time metric.

Reference: https://docs.google.com/document/d/1w10Agz_YkSA1IDurobWjuNoYo4u9aykprjuymDFE6MU/edit?disco=AAAB3xDqDYg

Copy link
Copy Markdown
Contributor Author

@KarhouTam KarhouTam Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state machine approach is solid, but we must consider what happens if the downstream workflow fails due to internal issues and requires a rerun. In that scenario, the relay would receive multiple in_progress payloads, making it indistinguishable from a replay attack—the only difference being the intent. Given that network instability, hardware failures, and software bugs are common, such downstream failures are highly probable and must be accounted for.

cc @jewelkm89 @fffrog @can-gaa-hou @ZainRizvi

Copy link
Copy Markdown
Contributor

@fffrog fffrog Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@subinz1 @KarhouTam @can-gaa-hou

I agree to add state machine back(We added it in POC state, but removed it due to a little more complexity), but this logic involves several details.

Background

Starting from L3, we need to support the rerun functionality on both the PyTorch side and the downstream side. Essentially, the handling logic for both rerun scenarios is the same. Considering network fluctuations, user behavior, and reruns, I suggest simplifying the state machine:

Metrics

  • dispatch_time: This is straightforward. It is recorded in Redis as soon as webhook_lambda forwards the event to the downstream repo.
  • started_at: Recorded once and only once. Regardless of whether the request is an initial run or a rerun, we only keep the timestamp of the first occurrence in Redis.
  • completed_at: Same as above, recorded once and only once. Subsequent rerun updates will not overwrite this time.

Limitations and Justification:

  • callback_handler ensures that these timestamp fields are not empty when delivering data to HUD.
  • While the final calculated execution and queue times might be "magnified" (less accurate) during reruns, I believe this error is reasonable and acceptable at this stage since it is primarily driven by downstream behavior.

Test Result Uploads and State Transitions

To reach a consensus, I suggest defining the following three STATUS values in Redis:

  • Started: dispatch has been successfully sent.
  • In_progress: "In-progress" feedback has been received.
  • Completed: Final test results have been received.

(Note: There is also an implicit None state, meaning no record exists in Redis for the given delivery_id.)

When callback_lambda receives a callback signal, the logic is as follows:

A. Upon receiving an in_progress signal:

  • None: Reject. This indicates the task was not triggered by a valid webhook. Whether it is an initial run or a rerun, a record should exist in Redis (we set a 3-day expiration to align with the maximum job execution time).
  • Started: Accept and call the HUD API to notify that the task has officially started.
  • In_progress: Accept, but do not call the HUD API (to ensure the workflow continues while avoiding redundant calls).
  • Completed: Accept, update the HUD Status to In progess.

B. Upon receiving a complete signal:

  • None: Reject. (Same reason as above).
  • Started: Reject. In this abnormal case (skipping the in_progress stage), the user can use a rerun in the downstream repo to re-upload, so no test results will be lost.
  • In_progress: Accept and call the HUD API (initial creation or updating).
  • Completed: Reject

This is just a preliminary thought, and I'd love to hear your feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we must consider what happens if the downstream workflow fails due to internal issues and requires a rerun. In that scenario, the relay would receive multiple in_progress payloads, making it indistinguishable from a replay attack—the only difference being the intent

The github actions api's attempt field helps with this:

  • When a workflow is executed, it creates a new Run with attempt=1
  • If jobs fail and get retried, a new event is emitted with the same run number but attempt is incremented.
  • job_id remains unique per job execution, so it should never be affected by reruns (reruns create a new job id)

So our state machine should include tracking: run_id, run_attempt, and job_id.

Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In_progress: Accept, but do not call the HUD API (to ensure the workflow continues while avoiding redundant calls).

I also want to call out is that HUD should prob get an entry back per job as well as per callback. The way I'm thinking about it, the OOT repo could want to trigger multiple jobs for a single commit, and each one should be able to report back it's own status. This allows the OOT to say "these four jobs passed but this one failed"

One callback could potentially call 5 jobs under the hood, and we want to ensure all 5 get status updates in HUD

Comment thread aws/lambda/cross_repo_ci_relay/callback/result_handler.py
"in_progress".
required: false
default: ''
test-results:
Copy link
Copy Markdown

@subinz1 subinz1 Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @KarhouTam,

As per @ZainRizvi' comments here: https://docs.google.com/document/d/1w10Agz_YkSA1IDurobWjuNoYo4u9aykprjuymDFE6MU/edit?disco=AAAB3T8nxDE, it is tricky to enforce a consistent test result json schema, and also including all the failed test info may become very large. So it is better to call-out or enforce a standard test result json which the downstream sends to result_handler lambda.

A standard test result json is also needed for HUD integration.

WDYT

CC @ZainRizvi @fffrog @groenenboomj @jewelkm89

Copy link
Copy Markdown
Contributor Author

@KarhouTam KarhouTam Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that regulate the test-results is necessary. I think we should settle down the final HUD RFC first to have a clear direction of code implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

We can add annotations to this action to clarify for downstream developers what they should and should not deliver. We could also evaluate the "body" content received from the downstream repo. However, the prerequisite for this is having a clear definition of the "body" in the HUD RFC.

Would you mind proposing the RFC first? I believe this will help us reach a final consensus, which can then guide us in developing and optimizing the code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure we're on the same page here, I think we're landing at:

  1. This test-results object only contains a summary of the tests (e.g. X tests passed, Y failed, Z skipped. Anything else needed?). Importantly, it does not enumerate all passing/failing tests.
  2. Full test result info gets uploaded as an artifact to some url

Also: We should add a version number to this schema in case we need to update it in the future without breaking backwards compatibility.

|---|---|---|
| `2xx` | record delivered | green |
| `4xx` (schema reject) | propagate same status | **red** — author must fix payload |
| `5xx` / network error | log + swallow | green — HUD outage is not the caller's fault |
Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log + swallow

minor discrepancy here: We should log and return the error to the caller, letting them know that there was some internal error and the data was not saved as they expect (that's us being honest). The code returns the error instead of swallowing it

But our documentation and return message should let the caller should be told that these internal errors should be considered ignorable by them and not fail their CI.

Why put this in the return message? It'll probably get read by an LLM. We're helping it code :)

Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some earlier comments plus a couple new ones from cross-referencing with the RFC.

last_exception.code,
last_exception.reason,
)
raise HTTPException(last_exception.code, last_exception.reason) from last_exception
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prob shouldn't log the internal code and exception reason here. If it's an internal failure (not the caller's fault) it's valid to just return a 500 and say something like this to guide the LLM reading the failure:

"An internal failure occured. Your update was note saved, but the CI run is still valid. You can attempt progressive retries after X seconds or ignore this failure."

in_progress_at, completed_at, "execution_time"
)

trusted = {"ci_metrics": ci_metrics, "verified_repo": verified_repo}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also include the downstream repo level here, so that this lambda is the only determiner of authoritative repo level information at this point in time.

If we make HUD recompute this information when we make tiering information dynamic, that logic will have to be kept in sync with HUD (and we get exposed to other timing/consistency issues).

"in_progress".
required: false
default: ''
test-results:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure we're on the same page here, I think we're landing at:

  1. This test-results object only contains a summary of the tests (e.g. X tests passed, Y failed, Z skipped. Anything else needed?). Importantly, it does not enumerate all passing/failing tests.
  2. Full test result info gets uploaded as an artifact to some url

Also: We should add a version number to this schema in case we need to update it in the future without breaking backwards compatibility.

supply the relay URL, the status/conclusion, and optional structured test
results.

inputs:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description: >
Base URL of the result callback server.
required: false
default: https://zciswjsrynb6ksccc2nb3mckpa0ldzou.lambda-url.us-east-1.on.aws/github/result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I know this is for testing, but might be easier to avoid an accidental merge by moving this to the caller sooner rather than later :)

Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more things from cross-referencing with the RFC.

field; HUD should prefer it over anything self-reported in `body`.
- **Schema / business validation (HUD's job)**: the callback body is passed
through to HUD verbatim as a top-level `body` field. Relay does **not**
validate fields inside `body.workflow` or `body.payload` — HUD owns the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate fields inside body.workflow or body.payload — HUD owns the schema since it owns persistence.

This contradicts the direction we landed on in the RFC review (rfcs#96) — validation should live in the relay, not HUD. The relay is already the gatekeeper for OIDC, allowlist, and rate limiting, so schema validation belongs here too. HUD should just authenticate the relay and write what it's told.

The code actually does the right thing already (validates delivery_id and workflow.status here in the relay). Let's update this section to match what the code does instead of claiming the opposite.


_ALLOWLIST_CACHE_KEY = "crcr:allowlist_yaml"
_TIMING_PREFIX = "crcr:timing:"
_RATE_LIMIT_PREFIX = "crcr:ratelimit:"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using oot: as the prefix for all the Redis keys? Right now it's crcr: for timing/allowlist and crcr:ratelimit: here. The RFC and doc use oot:rate:. Either prefix is fine, but let's pick one — a split between crcr: and oot: for different keys would be the worst outcome. I'd lean oot: since that's what the RFC uses and it's more descriptive for anyone poking around in Redis.

@KarhouTam
Copy link
Copy Markdown
Contributor Author

KarhouTam commented Apr 30, 2026

Hey, @ZainRizvi . Thanks for your valuable comments and suggestions. I am on Labor Day vacation now and will be back on May 6th. I will address these when I come back! Once again, thank you!

"Authorization": config.hud_bot_key,
},
method="POST",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC (rfcs#96) just changed this to a dedicated X-OOT-Relay-Token header — the whole point was to avoid reusing a shared credential on a write path. This should send X-OOT-Relay-Token instead of Authorization so HUD can scope the token to just the /api/oot/results endpoint.

"name": os.environ["WORKFLOW_NAME"],
"url": os.environ["WORKFLOW_URL"],
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC's dynamoKey is {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}, which means HUD expects job_name, run_attempt, started_at, and completed_at in the workflow dict. Without them, every job in a multi-job workflow will collide on the same key (job_name defaults to "default", run_attempt to 1), and HUD has no wall-clock timestamps so duration_seconds is always 0.

This needs to include:

"job_name": os.environ["JOB_NAME"],
"run_attempt": int(os.environ["RUN_ATTEMPT"]),
"started_at": datetime.utcnow().isoformat() + "Z",  # only on in_progress
"completed_at": datetime.utcnow().isoformat() + "Z",  # only on completed

with corresponding env vars:

JOB_NAME: ${{ github.job }}
RUN_ATTEMPT: ${{ github.run_attempt }}

in_progress_at, completed_at, "execution_time"
)

trusted = {"ci_metrics": ci_metrics, "verified_repo": verified_repo}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on my earlier comment about including downstream_repo_level — the allowlist is already loaded a few lines above and verified_repo is checked against l2_repos. The level should be set here in trusted so HUD doesn't need to recompute it:

level = allowlist.get_level(verified_repo)  # or however the allowlist exposes this
trusted = {"ci_metrics": ci_metrics, "verified_repo": verified_repo, "downstream_repo_level": level.value}

The RFC schema includes this field and HUD will need it for the per-backend dashboard.

Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general prompting tip, claude does really well if you point it at both the RFC PR and the google doc we worked and and ask it to look for inconsistencies between it (and explicitly ask it to check the comments as well).

That's a good prompt for both the coding agent to start with and also for the reviewing agent to verify against.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants