ci: install torchvision alongside torch in transformers backend workflow#3002
Merged
tianyu-l merged 2 commits intopytorch:mainfrom Apr 18, 2026
Merged
Conversation
The ROCm matrix entry of integration_test_8gpu_transformers_modeling_backend
reinstalls torch from the ROCm nightly index but leaves the docker image's
stale torchvision (built against an older torch ABI). When transformers
eagerly imports torchvision via image_utils.py during module init, the
registration register_fake("torchvision::nms") fails because the op
doesn't exist in the freshly-installed torch nightly, crashing the test
at import time.
Reinstalling torchvision from the same nightly index keeps both wheels
ABI-compatible.
Reproduced locally on MI325X with rocm/pytorch:latest:
pip install --force-reinstall --pre torch \
--index-url https://download.pytorch.org/whl/nightly/rocm7.1
python -c "import torchvision"
# RuntimeError: operator torchvision::nms does not exist
Fix verified locally:
pip install --force-reinstall --pre torch torchvision \
--index-url https://download.pytorch.org/whl/nightly/rocm7.1
python -c "import torchvision" # OK
|
|
Contributor
Author
|
Also seen on #2982 — same traceback, same job: https://github.com/pytorch/torchtitan/actions/runs/24466095345/job/71493479334 |
tianyu-l
approved these changes
Apr 17, 2026
| fi | ||
| python -m pip install --force-reinstall --pre \ | ||
| "${TORCH_SPEC}" --index-url ${{ matrix.index-url }} | ||
| "${TORCH_SPEC}" torchvision --index-url ${{ matrix.index-url }} |
Contributor
There was a problem hiding this comment.
Sounds reasonable to me. Could you add a NOTE comment nearby, so that people who read / maintain the code can get context? Thanks!
Contributor
Author
There was a problem hiding this comment.
Thanks @tianyu-l - I added the note for context. Let me know if there are any other issues!
Address review feedback: explain why torchvision is reinstalled together with torch so future maintainers don't have to spelunk git blame.
rishisinhanj
added a commit
to rishisinhanj/torchtitan
that referenced
this pull request
Apr 22, 2026
Force-reinstalling only torch from the nightly index leaves the pre-installed torchvision (built against an older nightly) in place. Pip's resolver doesn't catch the resulting ABI break (pip check passes), but any code path that does `import torchvision` fails: RuntimeError: operator torchvision::nms does not exist Same one-line fix that landed in pytorch#3002 for the transformers modeling backend workflow. Narrowing this PR to the vlm workflow only — the other ROCm 8-GPU workflows (features/torchft/autoparallel) are failing for unrelated reasons and a torchvision pin there would be speculative. Verified locally on MI325X / rocm/pytorch:latest against the nightly/rocm7.1 index: bug reproduced before fix, import succeeds after.
tianyu-l
pushed a commit
that referenced
this pull request
Apr 22, 2026
## What Add `torchvision` to the `pip install --force-reinstall --pre torch ...` line in the VLM 8-GPU integration test workflow: - `integration_test_8gpu_vlm.yaml` Same one-line fix that landed in #3002 for `integration_test_8gpu_transformers_modeling_backend.yaml`. ## Why Force-reinstalling only `torch` from the nightly index leaves the pre-installed `torchvision` (built against an older nightly) in place. Pip's resolver doesn't catch the resulting ABI break (`pip check` passes), but any code path that does `import torchvision` fails: ``` RuntimeError: operator torchvision::nms does not exist File ".../torchvision/_meta_registrations.py", line 163, in <module> @torch.library.register_fake("torchvision::nms") ``` The VLM workflow's test entry point pulls `torchvision` transitively (image-utils path through `transformers`), so the ABI break is fatal at import time. ## Scope note An earlier revision of this PR also patched `features.yaml`, `torchft.yaml`, and `autoparallel.yaml`. Removed — those workflows are red on ROCm for an unrelated reason (DCP planner `EOFError` in `gather_object`, see #3051), and a torchvision pin there would be speculative. Will file separately if/when verified. ## Repro / verification (local, MI325X / `rocm/pytorch:latest`) | Step | Result | |---|---| | Force-reinstall `torch` only from `nightly/rocm7.1` | torch upgraded to 2.13.0.dev, torchvision left at 0.25.0 | | `import torchvision` | `RuntimeError: operator torchvision::nms does not exist` | | `pip check` | "No broken requirements" — silent ABI break | | Apply fix (install `torch torchvision` together) | torch 2.13.0.dev + torchvision 0.27.0.dev | | `import torchvision` | succeeds | The exact same shape was previously confirmed and merged for `transformers_modeling_backend.yaml` in #3002. ## Note comment Per review feedback on #3002, the touched file gets a 2-line `# NOTE:` explaining why `torchvision` is on the install line, so future maintainers don't strip it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The ROCm path of
integration_test_8gpu_transformers_modeling_backend.yamlreinstallstorchfrom the ROCm nightly index but doesn't reinstalltorchvision, leaving the docker image's staletorchvision==0.26.0(built againsttorch==2.11.0). pip's resolver flags this explicitly:When
transformers.modeling_utilseagerly imports torchvision (vialoss_for_object_detection.py→image_transforms.py→image_utils.py), the registration@torch.library.register_fake("torchvision::nms")fails because the op doesn't exist in the freshly-installed torch nightly:This crashes the test at import time, before any GPU code runs.
Reproduction
Failing scheduled run on
main: https://github.com/pytorch/torchtitan/actions/runs/24270213946/job/70873527337Reproduced locally on MI325X with
rocm/pytorch:latest:After the fix:
Test plan
build-test (rocm, linux.rocm.gpu.gfx942.8, ...)job goes green on this PRbuild-test (cuda, ...)job stays green (no regression)Note on a separate latent issue
While reproducing, I also hit an unrelated
TypeError: HFTransformerModel.Config field 'transformers_version' must be keyword-onlyatmodel.py:83withtransformers >= 5.5.x(inherited fromPretrainedConfig). That's downstream of this fix — CI's docker image currently has an oldertransformersso it doesn't trigger yet. Worth filing separately once this lands and the ROCm path actually gets that far.