Fix llvm update and mlir distro by hunhoffe · Pull Request #3052 · Xilinx/mlir-aie

hunhoffe · 2026-05-06T21:23:59Z

Summary

Fixes two compounding bugs that caused #3018, an auto-generated LLVM-bump PR that could never pass CI.

Bug 1: `mlirDistro.yml` silently produced `LLVM_PROJECT_COMMIT=null`

Run 25436146719 — "Get latest LLVM commit" — LLVM_PROJECT_COMMIT=null written to outputs after curl rate-limited.

The "Get latest LLVM commit" step used an unauthenticated curl against api.github.com. When the per-IP rate limit hit, jq parsed the error body, returned the literal string "null", and the step exited 0 — passing null downstream so every wheel build tried to fetch zip/null and failed. At least, this is my theory. I've added additional logging in case I am wrong about where the "null" comes from.

Fix: add an Authorization: Bearer $GH_TOKEN header, fail-fast on HTTP error, and reject empty/null commit values before writing them to $GITHUB_OUTPUT.

Bug 2: `update_llvm_version.py` fabricated a wheel version

When no mlir-distro wheel existed for the chosen Triton/Torch-MLIR commit, the script invented a DATETIME from the LLVM commit timestamp and proceeded anyway, producing a PR that referenced a wheel version that had never been built.

Fix: exit non-zero when no wheel exists, surfacing the gap rather than hiding it behind a fabricated value.

Closing the loop

A fail-loud script alone leaves a manual gap: someone has to dispatch mlirDistro.yml by hand to publish a wheel before the next scheduled update-llvm run can succeed. To close the loop, update_llvm_version.py learns --identify-only, and update-llvm.yml runs in three phases:

Identify the target commit and check wheel availability.
If no wheel exists, dispatch mlirDistro.yml with LLVM_COMMIT=$target and gh run watch it to completion.
Apply the update with --llvm-hash $target.

The orchestration is intentionally synchronous; the schedule is biweekly, so the runner-hour cost of blocking is small relative to the operational complexity of an async workflow_run chain.

Test plan

On a branch with this change, mlirDistro.yml is exercised by the pull_request: paths: ['.github/workflows/mlirDistro.yml'] trigger — verify the "Get latest LLVM commit" step succeeds with auth header and that null/empty values are rejected.
Manually dispatch update-llvm.yml (workflow_dispatch) on the branch and confirm:
--identify-only step writes target_commit, wheel_exists, bump_reason to $GITHUB_OUTPUT.
When wheel_exists == false, mlirDistro is dispatched and watched through to success.
Apply step uses --llvm-hash with the captured target.
Resulting PR points at a wheel that actually exists in mlir-distro.

Two independent failures combined to create PR Xilinx#3018, an LLVM update PR that could never pass CI: 1. mlirDistro.yml fetched llvm-project HEAD with an unauthenticated curl. When the GitHub API rate-limited the runner IP, jq parsed the error response, returned the literal string "null", and the workflow wrote LLVM_PROJECT_COMMIT=null to its output without complaining. Every downstream wheel build then tried to download zip/null and failed, but the get-commit job exited 0, so wheels silently stopped publishing for ~6 weeks while the schedule reported "success". 2. update_llvm_version.py picked the newest Triton/Torch-MLIR LLVM commit, looked it up in mlir-distro, and on cache miss fabricated a DATETIME from the commit timestamp and proceeded anyway. The resulting PR pointed at a wheel version that never existed. Fixes: - mlirDistro.yml: add Authorization header to the curl, fail-fast on HTTP error, and reject empty/null LLVM_PROJECT_COMMIT values before passing them downstream. - update_llvm_version.py: when no mlir-distro wheel exists for the target commit, exit non-zero instead of fabricating one. A failed scheduled run will surface that mlir-distro is broken or lagging rather than producing another unmergeable PR. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

The previous commit made update-llvm-version.py fail loudly when no mlir-distro wheel existed for the target commit. This avoided creating broken PRs but left a manual gap: someone had to dispatch mlirDistro by hand to publish a wheel before the next scheduled update-llvm run could succeed. In practice, that gap meant LLVM bumps stalled indefinitely. Close the loop by having update-llvm orchestrate the build: - update_llvm_version.py learns --identify-only, which runs the existing Triton/Torch detection and writes target_commit, wheel_exists, and bump_reason to $GITHUB_OUTPUT without modifying any files. - update-llvm.yml runs in three phases: 1. Identify the target commit and check wheel availability. 2. If no wheel exists, dispatch mlirDistro with LLVM_COMMIT=$target and gh-run-watch the build to completion. 3. Apply the update with --llvm-hash $target. The job timeout-minutes is raised to 180 to cover the ~1h mlirDistro build, and actions:write is added to the permissions block so the scheduled workflow can dispatch mlirDistro. The orchestration step is intentionally synchronous (gh run watch blocks): the schedule is biweekly, so the runner-hour cost is small compared to the operational complexity of an async workflow_run chain. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

The previous fix added an Authorization header and a curl -f flag, but -f discards the response body. If the API ever fails again, the CI log would show only "curl: (22) The requested URL returned error: 403" — no indication whether it was a rate limit, expired token, GitHub outage, or something else. Switch to --fail-with-body so the body is captured even on HTTP error, and dump it to the CI log on failure. The next failure will show the actual GitHub response, removing the inference step from diagnosis. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

hunhoffe and others added 4 commits May 6, 2026 14:44

Merge branch 'main' into fix-llvm-update-and-mlir-distro

7ecd10b

hunhoffe added this to the IRON 1.3.2 milestone May 6, 2026

hunhoffe mentioned this pull request May 7, 2026

Fix llvm update and mlir distro #3054

Draft

hunhoffe closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix llvm update and mlir distro#3052

Fix llvm update and mlir distro#3052
hunhoffe wants to merge 4 commits intoXilinx:mainfrom
hunhoffe:fix-llvm-update-and-mlir-distro

hunhoffe commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hunhoffe commented May 6, 2026

Summary

Bug 1: mlirDistro.yml silently produced LLVM_PROJECT_COMMIT=null

Bug 2: update_llvm_version.py fabricated a wheel version

Closing the loop

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 1: `mlirDistro.yml` silently produced `LLVM_PROJECT_COMMIT=null`

Bug 2: `update_llvm_version.py` fabricated a wheel version