Fix llvm update and mlir distro#3052
Closed
hunhoffe wants to merge 4 commits intoXilinx:mainfrom
Closed
Conversation
Two independent failures combined to create PR Xilinx#3018, an LLVM update PR that could never pass CI: 1. mlirDistro.yml fetched llvm-project HEAD with an unauthenticated curl. When the GitHub API rate-limited the runner IP, jq parsed the error response, returned the literal string "null", and the workflow wrote LLVM_PROJECT_COMMIT=null to its output without complaining. Every downstream wheel build then tried to download zip/null and failed, but the get-commit job exited 0, so wheels silently stopped publishing for ~6 weeks while the schedule reported "success". 2. update_llvm_version.py picked the newest Triton/Torch-MLIR LLVM commit, looked it up in mlir-distro, and on cache miss fabricated a DATETIME from the commit timestamp and proceeded anyway. The resulting PR pointed at a wheel version that never existed. Fixes: - mlirDistro.yml: add Authorization header to the curl, fail-fast on HTTP error, and reject empty/null LLVM_PROJECT_COMMIT values before passing them downstream. - update_llvm_version.py: when no mlir-distro wheel exists for the target commit, exit non-zero instead of fabricating one. A failed scheduled run will surface that mlir-distro is broken or lagging rather than producing another unmergeable PR. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
The previous commit made update-llvm-version.py fail loudly when no
mlir-distro wheel existed for the target commit. This avoided creating
broken PRs but left a manual gap: someone had to dispatch mlirDistro by
hand to publish a wheel before the next scheduled update-llvm run could
succeed. In practice, that gap meant LLVM bumps stalled indefinitely.
Close the loop by having update-llvm orchestrate the build:
- update_llvm_version.py learns --identify-only, which runs the existing
Triton/Torch detection and writes target_commit, wheel_exists, and
bump_reason to $GITHUB_OUTPUT without modifying any files.
- update-llvm.yml runs in three phases:
1. Identify the target commit and check wheel availability.
2. If no wheel exists, dispatch mlirDistro with LLVM_COMMIT=$target
and gh-run-watch the build to completion.
3. Apply the update with --llvm-hash $target.
The job timeout-minutes is raised to 180 to cover the ~1h mlirDistro
build, and actions:write is added to the permissions block so the
scheduled workflow can dispatch mlirDistro.
The orchestration step is intentionally synchronous (gh run watch
blocks): the schedule is biweekly, so the runner-hour cost is small
compared to the operational complexity of an async workflow_run chain.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
The previous fix added an Authorization header and a curl -f flag, but -f discards the response body. If the API ever fails again, the CI log would show only "curl: (22) The requested URL returned error: 403" — no indication whether it was a rate limit, expired token, GitHub outage, or something else. Switch to --fail-with-body so the body is captured even on HTTP error, and dump it to the CI log on failure. The next failure will show the actual GitHub response, removing the inference step from diagnosis. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two compounding bugs that caused #3018, an auto-generated LLVM-bump PR that could never pass CI.
Bug 1:
mlirDistro.ymlsilently producedLLVM_PROJECT_COMMIT=nullLLVM_PROJECT_COMMIT=nullwritten to outputs after curl rate-limited.The "Get latest LLVM commit" step used an unauthenticated
curlagainstapi.github.com. When the per-IP rate limit hit, jq parsed the error body, returned the literal string"null", and the step exited 0 — passingnulldownstream so every wheel build tried to fetchzip/nulland failed. At least, this is my theory. I've added additional logging in case I am wrong about where the"null"comes from.Fix: add an
Authorization: Bearer $GH_TOKENheader, fail-fast on HTTP error, and reject empty/null commit values before writing them to$GITHUB_OUTPUT.Bug 2:
update_llvm_version.pyfabricated a wheel versionWhen no mlir-distro wheel existed for the chosen Triton/Torch-MLIR commit, the script invented a
DATETIMEfrom the LLVM commit timestamp and proceeded anyway, producing a PR that referenced a wheel version that had never been built.Fix: exit non-zero when no wheel exists, surfacing the gap rather than hiding it behind a fabricated value.
Closing the loop
A fail-loud script alone leaves a manual gap: someone has to dispatch
mlirDistro.ymlby hand to publish a wheel before the next scheduled update-llvm run can succeed. To close the loop,update_llvm_version.pylearns--identify-only, andupdate-llvm.ymlruns in three phases:mlirDistro.ymlwithLLVM_COMMIT=$targetandgh run watchit to completion.--llvm-hash $target.The orchestration is intentionally synchronous; the schedule is biweekly, so the runner-hour cost of blocking is small relative to the operational complexity of an async
workflow_runchain.Test plan
mlirDistro.ymlis exercised by thepull_request: paths: ['.github/workflows/mlirDistro.yml']trigger — verify the "Get latest LLVM commit" step succeeds with auth header and that null/empty values are rejected.update-llvm.yml(workflow_dispatch) on the branch and confirm:--identify-onlystep writestarget_commit,wheel_exists,bump_reasonto$GITHUB_OUTPUT.wheel_exists == false, mlirDistro is dispatched and watched through to success.--llvm-hashwith the captured target.