Skip to content

Fix llvm update and mlir distro#3052

Closed
hunhoffe wants to merge 4 commits intoXilinx:mainfrom
hunhoffe:fix-llvm-update-and-mlir-distro
Closed

Fix llvm update and mlir distro#3052
hunhoffe wants to merge 4 commits intoXilinx:mainfrom
hunhoffe:fix-llvm-update-and-mlir-distro

Conversation

@hunhoffe
Copy link
Copy Markdown
Collaborator

@hunhoffe hunhoffe commented May 6, 2026

Summary

Fixes two compounding bugs that caused #3018, an auto-generated LLVM-bump PR that could never pass CI.

Bug 1: mlirDistro.yml silently produced LLVM_PROJECT_COMMIT=null

The "Get latest LLVM commit" step used an unauthenticated curl against api.github.com. When the per-IP rate limit hit, jq parsed the error body, returned the literal string "null", and the step exited 0 — passing null downstream so every wheel build tried to fetch zip/null and failed. At least, this is my theory. I've added additional logging in case I am wrong about where the "null" comes from.

Fix: add an Authorization: Bearer $GH_TOKEN header, fail-fast on HTTP error, and reject empty/null commit values before writing them to $GITHUB_OUTPUT.

Bug 2: update_llvm_version.py fabricated a wheel version

When no mlir-distro wheel existed for the chosen Triton/Torch-MLIR commit, the script invented a DATETIME from the LLVM commit timestamp and proceeded anyway, producing a PR that referenced a wheel version that had never been built.

Fix: exit non-zero when no wheel exists, surfacing the gap rather than hiding it behind a fabricated value.

Closing the loop

A fail-loud script alone leaves a manual gap: someone has to dispatch mlirDistro.yml by hand to publish a wheel before the next scheduled update-llvm run can succeed. To close the loop, update_llvm_version.py learns --identify-only, and update-llvm.yml runs in three phases:

  1. Identify the target commit and check wheel availability.
  2. If no wheel exists, dispatch mlirDistro.yml with LLVM_COMMIT=$target and gh run watch it to completion.
  3. Apply the update with --llvm-hash $target.

The orchestration is intentionally synchronous; the schedule is biweekly, so the runner-hour cost of blocking is small relative to the operational complexity of an async workflow_run chain.

Test plan

  • On a branch with this change, mlirDistro.yml is exercised by the pull_request: paths: ['.github/workflows/mlirDistro.yml'] trigger — verify the "Get latest LLVM commit" step succeeds with auth header and that null/empty values are rejected.
  • Manually dispatch update-llvm.yml (workflow_dispatch) on the branch and confirm:
  • --identify-only step writes target_commit, wheel_exists, bump_reason to $GITHUB_OUTPUT.
  • When wheel_exists == false, mlirDistro is dispatched and watched through to success.
  • Apply step uses --llvm-hash with the captured target.
  • Resulting PR points at a wheel that actually exists in mlir-distro.

hunhoffe and others added 4 commits May 6, 2026 14:44
Two independent failures combined to create PR Xilinx#3018, an LLVM update PR
that could never pass CI:

1. mlirDistro.yml fetched llvm-project HEAD with an unauthenticated
   curl. When the GitHub API rate-limited the runner IP, jq parsed the
   error response, returned the literal string "null", and the workflow
   wrote LLVM_PROJECT_COMMIT=null to its output without complaining.
   Every downstream wheel build then tried to download zip/null and
   failed, but the get-commit job exited 0, so wheels silently stopped
   publishing for ~6 weeks while the schedule reported "success".

2. update_llvm_version.py picked the newest Triton/Torch-MLIR LLVM
   commit, looked it up in mlir-distro, and on cache miss fabricated a
   DATETIME from the commit timestamp and proceeded anyway. The
   resulting PR pointed at a wheel version that never existed.

Fixes:

- mlirDistro.yml: add Authorization header to the curl, fail-fast on
  HTTP error, and reject empty/null LLVM_PROJECT_COMMIT values before
  passing them downstream.

- update_llvm_version.py: when no mlir-distro wheel exists for the
  target commit, exit non-zero instead of fabricating one. A failed
  scheduled run will surface that mlir-distro is broken or lagging
  rather than producing another unmergeable PR.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
The previous commit made update-llvm-version.py fail loudly when no
mlir-distro wheel existed for the target commit. This avoided creating
broken PRs but left a manual gap: someone had to dispatch mlirDistro by
hand to publish a wheel before the next scheduled update-llvm run could
succeed. In practice, that gap meant LLVM bumps stalled indefinitely.

Close the loop by having update-llvm orchestrate the build:

- update_llvm_version.py learns --identify-only, which runs the existing
  Triton/Torch detection and writes target_commit, wheel_exists, and
  bump_reason to $GITHUB_OUTPUT without modifying any files.

- update-llvm.yml runs in three phases:
    1. Identify the target commit and check wheel availability.
    2. If no wheel exists, dispatch mlirDistro with LLVM_COMMIT=$target
       and gh-run-watch the build to completion.
    3. Apply the update with --llvm-hash $target.
  The job timeout-minutes is raised to 180 to cover the ~1h mlirDistro
  build, and actions:write is added to the permissions block so the
  scheduled workflow can dispatch mlirDistro.

The orchestration step is intentionally synchronous (gh run watch
blocks): the schedule is biweekly, so the runner-hour cost is small
compared to the operational complexity of an async workflow_run chain.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
The previous fix added an Authorization header and a curl -f flag, but
-f discards the response body. If the API ever fails again, the CI log
would show only "curl: (22) The requested URL returned error: 403" — no
indication whether it was a rate limit, expired token, GitHub outage, or
something else.

Switch to --fail-with-body so the body is captured even on HTTP error,
and dump it to the CI log on failure. The next failure will show the
actual GitHub response, removing the inference step from diagnosis.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
@hunhoffe hunhoffe added this to the IRON 1.3.2 milestone May 6, 2026
@hunhoffe hunhoffe closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant