Skip to content

ci: install torchvision alongside torch in transformers backend workflow#3002

Merged
tianyu-l merged 2 commits intopytorch:mainfrom
rishisinhanj:fix-rocm-torchvision-install
Apr 18, 2026
Merged

ci: install torchvision alongside torch in transformers backend workflow#3002
tianyu-l merged 2 commits intopytorch:mainfrom
rishisinhanj:fix-rocm-torchvision-install

Conversation

@rishisinhanj
Copy link
Copy Markdown
Contributor

Summary

The ROCm path of integration_test_8gpu_transformers_modeling_backend.yaml reinstalls torch from the ROCm nightly index but doesn't reinstall torchvision, leaving the docker image's stale torchvision==0.26.0 (built against torch==2.11.0). pip's resolver flags this explicitly:

torchvision 0.26.0 requires torch==2.11.0,
  but you have torch 2.12.0.dev20260411+rocm7.1 which is incompatible.

When transformers.modeling_utils eagerly imports torchvision (via loss_for_object_detection.pyimage_transforms.pyimage_utils.py), the registration @torch.library.register_fake("torchvision::nms") fails because the op doesn't exist in the freshly-installed torch nightly:

RuntimeError: operator torchvision::nms does not exist

This crashes the test at import time, before any GPU code runs.

Reproduction

Failing scheduled run on main: https://github.com/pytorch/torchtitan/actions/runs/24270213946/job/70873527337

Reproduced locally on MI325X with rocm/pytorch:latest:

pip install --force-reinstall --pre torch \
  --index-url https://download.pytorch.org/whl/nightly/rocm7.1
python -c "import torchvision"
# RuntimeError: operator torchvision::nms does not exist

After the fix:

pip install --force-reinstall --pre torch torchvision \
  --index-url https://download.pytorch.org/whl/nightly/rocm7.1
python -c "import torchvision; print(torchvision.__version__)"
# OK

Test plan

  • CI's ROCm build-test (rocm, linux.rocm.gpu.gfx942.8, ...) job goes green on this PR
  • CUDA build-test (cuda, ...) job stays green (no regression)

Note on a separate latent issue

While reproducing, I also hit an unrelated TypeError: HFTransformerModel.Config field 'transformers_version' must be keyword-only at model.py:83 with transformers >= 5.5.x (inherited from PretrainedConfig). That's downstream of this fix — CI's docker image currently has an older transformers so it doesn't trigger yet. Worth filing separately once this lands and the ROCm path actually gets that far.

The ROCm matrix entry of integration_test_8gpu_transformers_modeling_backend
reinstalls torch from the ROCm nightly index but leaves the docker image's
stale torchvision (built against an older torch ABI). When transformers
eagerly imports torchvision via image_utils.py during module init, the
registration register_fake("torchvision::nms") fails because the op
doesn't exist in the freshly-installed torch nightly, crashing the test
at import time.

Reinstalling torchvision from the same nightly index keeps both wheels
ABI-compatible.

Reproduced locally on MI325X with rocm/pytorch:latest:
  pip install --force-reinstall --pre torch \
    --index-url https://download.pytorch.org/whl/nightly/rocm7.1
  python -c "import torchvision"
  # RuntimeError: operator torchvision::nms does not exist

Fix verified locally:
  pip install --force-reinstall --pre torch torchvision \
    --index-url https://download.pytorch.org/whl/nightly/rocm7.1
  python -c "import torchvision"  # OK
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 16, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 16, 2026

Workflows were awaiting approval. CI has now been triggered for the ciflow labels on this PR.

@rishisinhanj
Copy link
Copy Markdown
Contributor Author

Also seen on #2982 — same traceback, same job: https://github.com/pytorch/torchtitan/actions/runs/24466095345/job/71493479334

fi
python -m pip install --force-reinstall --pre \
"${TORCH_SPEC}" --index-url ${{ matrix.index-url }}
"${TORCH_SPEC}" torchvision --index-url ${{ matrix.index-url }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me. Could you add a NOTE comment nearby, so that people who read / maintain the code can get context? Thanks!

Copy link
Copy Markdown
Contributor Author

@rishisinhanj rishisinhanj Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tianyu-l - I added the note for context. Let me know if there are any other issues!

Address review feedback: explain why torchvision is reinstalled together
with torch so future maintainers don't have to spelunk git blame.
@tianyu-l tianyu-l merged commit ab87f58 into pytorch:main Apr 18, 2026
15 of 19 checks passed
rishisinhanj added a commit to rishisinhanj/torchtitan that referenced this pull request Apr 22, 2026
Force-reinstalling only torch from the nightly index leaves the
pre-installed torchvision (built against an older nightly) in place.
Pip's resolver doesn't catch the resulting ABI break (pip check passes),
but any code path that does `import torchvision` fails:

  RuntimeError: operator torchvision::nms does not exist

Same one-line fix that landed in pytorch#3002 for the transformers modeling
backend workflow. Narrowing this PR to the vlm workflow only — the
other ROCm 8-GPU workflows (features/torchft/autoparallel) are
failing for unrelated reasons and a torchvision pin there would be
speculative.

Verified locally on MI325X / rocm/pytorch:latest against the
nightly/rocm7.1 index: bug reproduced before fix, import succeeds
after.
tianyu-l pushed a commit that referenced this pull request Apr 22, 2026
## What

Add `torchvision` to the `pip install --force-reinstall --pre torch ...`
line in the VLM 8-GPU integration test workflow:

- `integration_test_8gpu_vlm.yaml`

Same one-line fix that landed in #3002 for
`integration_test_8gpu_transformers_modeling_backend.yaml`.

## Why

Force-reinstalling only `torch` from the nightly index leaves the
pre-installed `torchvision` (built against an older nightly) in place.
Pip's resolver doesn't catch the resulting ABI break (`pip check`
passes), but any code path that does `import torchvision` fails:

```
RuntimeError: operator torchvision::nms does not exist
  File ".../torchvision/_meta_registrations.py", line 163, in <module>
    @torch.library.register_fake("torchvision::nms")
```

The VLM workflow's test entry point pulls `torchvision` transitively
(image-utils path through `transformers`), so the ABI break is fatal at
import time.

## Scope note

An earlier revision of this PR also patched `features.yaml`,
`torchft.yaml`, and `autoparallel.yaml`. Removed — those workflows are
red on ROCm for an unrelated reason (DCP planner `EOFError` in
`gather_object`, see #3051), and a torchvision pin there would be
speculative. Will file separately if/when verified.

## Repro / verification (local, MI325X / `rocm/pytorch:latest`)

| Step | Result |
|---|---|
| Force-reinstall `torch` only from `nightly/rocm7.1` | torch upgraded
to 2.13.0.dev, torchvision left at 0.25.0 |
| `import torchvision` | `RuntimeError: operator torchvision::nms does
not exist` |
| `pip check` | "No broken requirements" — silent ABI break |
| Apply fix (install `torch torchvision` together) | torch 2.13.0.dev +
torchvision 0.27.0.dev |
| `import torchvision` | succeeds |

The exact same shape was previously confirmed and merged for
`transformers_modeling_backend.yaml` in #3002.

## Note comment

Per review feedback on #3002, the touched file gets a 2-line `# NOTE:`
explaining why `torchvision` is on the install line, so future
maintainers don't strip it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants