ci: install torchvision alongside torch in transformers backend workflow by rishisinhanj · Pull Request #3002 · pytorch/torchtitan

rishisinhanj · 2026-04-16T23:55:56Z

Summary

The ROCm path of integration_test_8gpu_transformers_modeling_backend.yaml reinstalls torch from the ROCm nightly index but doesn't reinstall torchvision, leaving the docker image's stale torchvision==0.26.0 (built against torch==2.11.0). pip's resolver flags this explicitly:

torchvision 0.26.0 requires torch==2.11.0,
  but you have torch 2.12.0.dev20260411+rocm7.1 which is incompatible.

When transformers.modeling_utils eagerly imports torchvision (via loss_for_object_detection.py → image_transforms.py → image_utils.py), the registration @torch.library.register_fake("torchvision::nms") fails because the op doesn't exist in the freshly-installed torch nightly:

RuntimeError: operator torchvision::nms does not exist

This crashes the test at import time, before any GPU code runs.

Reproduction

Failing scheduled run on main: https://github.com/pytorch/torchtitan/actions/runs/24270213946/job/70873527337

Reproduced locally on MI325X with rocm/pytorch:latest:

pip install --force-reinstall --pre torch \
  --index-url https://download.pytorch.org/whl/nightly/rocm7.1
python -c "import torchvision"
# RuntimeError: operator torchvision::nms does not exist

After the fix:

pip install --force-reinstall --pre torch torchvision \
  --index-url https://download.pytorch.org/whl/nightly/rocm7.1
python -c "import torchvision; print(torchvision.__version__)"
# OK

Test plan

CI's ROCm build-test (rocm, linux.rocm.gpu.gfx942.8, ...) job goes green on this PR
CUDA build-test (cuda, ...) job stays green (no regression)

Note on a separate latent issue

While reproducing, I also hit an unrelated TypeError: HFTransformerModel.Config field 'transformers_version' must be keyword-only at model.py:83 with transformers >= 5.5.x (inherited from PretrainedConfig). That's downstream of this fix — CI's docker image currently has an older transformers so it doesn't trigger yet. Worth filing separately once this lands and the ROCm path actually gets that far.

The ROCm matrix entry of integration_test_8gpu_transformers_modeling_backend reinstalls torch from the ROCm nightly index but leaves the docker image's stale torchvision (built against an older torch ABI). When transformers eagerly imports torchvision via image_utils.py during module init, the registration register_fake("torchvision::nms") fails because the op doesn't exist in the freshly-installed torch nightly, crashing the test at import time. Reinstalling torchvision from the same nightly index keeps both wheels ABI-compatible. Reproduced locally on MI325X with rocm/pytorch:latest: pip install --force-reinstall --pre torch \ --index-url https://download.pytorch.org/whl/nightly/rocm7.1 python -c "import torchvision" # RuntimeError: operator torchvision::nms does not exist Fix verified locally: pip install --force-reinstall --pre torch torchvision \ --index-url https://download.pytorch.org/whl/nightly/rocm7.1 python -c "import torchvision" # OK

pytorch-bot · 2026-04-16T23:56:04Z

~~Workflows were awaiting approval.~~ CI has now been triggered for the ciflow labels on this PR.

rishisinhanj · 2026-04-16T23:59:26Z

Also seen on #2982 — same traceback, same job: https://github.com/pytorch/torchtitan/actions/runs/24466095345/job/71493479334

tianyu-l · 2026-04-17T00:03:55Z

        fi
        python -m pip install --force-reinstall --pre \
-          "${TORCH_SPEC}" --index-url ${{ matrix.index-url }}
+          "${TORCH_SPEC}" torchvision --index-url ${{ matrix.index-url }}


Sounds reasonable to me. Could you add a NOTE comment nearby, so that people who read / maintain the code can get context? Thanks!

Thanks @tianyu-l - I added the note for context. Let me know if there are any other issues!

Address review feedback: explain why torchvision is reinstalled together with torch so future maintainers don't have to spelunk git blame.

Force-reinstalling only torch from the nightly index leaves the pre-installed torchvision (built against an older nightly) in place. Pip's resolver doesn't catch the resulting ABI break (pip check passes), but any code path that does `import torchvision` fails: RuntimeError: operator torchvision::nms does not exist Same one-line fix that landed in pytorch#3002 for the transformers modeling backend workflow. Narrowing this PR to the vlm workflow only — the other ROCm 8-GPU workflows (features/torchft/autoparallel) are failing for unrelated reasons and a torchvision pin there would be speculative. Verified locally on MI325X / rocm/pytorch:latest against the nightly/rocm7.1 index: bug reproduced before fix, import succeeds after.

## What Add `torchvision` to the `pip install --force-reinstall --pre torch ...` line in the VLM 8-GPU integration test workflow: - `integration_test_8gpu_vlm.yaml` Same one-line fix that landed in #3002 for `integration_test_8gpu_transformers_modeling_backend.yaml`. ## Why Force-reinstalling only `torch` from the nightly index leaves the pre-installed `torchvision` (built against an older nightly) in place. Pip's resolver doesn't catch the resulting ABI break (`pip check` passes), but any code path that does `import torchvision` fails: ``` RuntimeError: operator torchvision::nms does not exist File ".../torchvision/_meta_registrations.py", line 163, in <module> @torch.library.register_fake("torchvision::nms") ``` The VLM workflow's test entry point pulls `torchvision` transitively (image-utils path through `transformers`), so the ABI break is fatal at import time. ## Scope note An earlier revision of this PR also patched `features.yaml`, `torchft.yaml`, and `autoparallel.yaml`. Removed — those workflows are red on ROCm for an unrelated reason (DCP planner `EOFError` in `gather_object`, see #3051), and a torchvision pin there would be speculative. Will file separately if/when verified. ## Repro / verification (local, MI325X / `rocm/pytorch:latest`) | Step | Result | |---|---| | Force-reinstall `torch` only from `nightly/rocm7.1` | torch upgraded to 2.13.0.dev, torchvision left at 0.25.0 | | `import torchvision` | `RuntimeError: operator torchvision::nms does not exist` | | `pip check` | "No broken requirements" — silent ABI break | | Apply fix (install `torch torchvision` together) | torch 2.13.0.dev + torchvision 0.27.0.dev | | `import torchvision` | succeeds | The exact same shape was previously confirmed and merged for `transformers_modeling_backend.yaml` in #3002. ## Note comment Per review feedback on #3002, the touched file gets a 2-line `# NOTE:` explaining why `torchvision` is on the install line, so future maintainers don't strip it.

rishisinhanj requested review from fegin, tianyu-l, wconstab and wwwjn as code owners April 16, 2026 23:55

pytorch-bot Bot added the ciflow/8gpu label Apr 16, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 16, 2026

tianyu-l approved these changes Apr 17, 2026

View reviewed changes

ci: add NOTE comment explaining torchvision reinstall

bb09297

Address review feedback: explain why torchvision is reinstalled together with torch so future maintainers don't have to spelunk git blame.

tianyu-l merged commit ab87f58 into pytorch:main Apr 18, 2026
15 of 19 checks passed

tianyu-l mentioned this pull request Apr 18, 2026

[ci fix] torchvision channel #3022

Closed

rishisinhanj mentioned this pull request Apr 21, 2026

ci: pin torchvision alongside torch in vlm 8-GPU workflow #3047

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: install torchvision alongside torch in transformers backend workflow#3002

ci: install torchvision alongside torch in transformers backend workflow#3002
tianyu-l merged 2 commits intopytorch:mainfrom
rishisinhanj:fix-rocm-torchvision-install

rishisinhanj commented Apr 16, 2026

Uh oh!

pytorch-bot Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

rishisinhanj commented Apr 16, 2026

Uh oh!

tianyu-l Apr 17, 2026

Uh oh!

rishisinhanj Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rishisinhanj commented Apr 16, 2026

Summary

Reproduction

Test plan

Note on a separate latent issue

Uh oh!

pytorch-bot Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rishisinhanj commented Apr 16, 2026

Uh oh!

tianyu-l Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

rishisinhanj Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Apr 16, 2026 •

edited

Loading

rishisinhanj Apr 17, 2026 •

edited

Loading