Skip to content

Increase timeout for features test to 60 minutes.#2982

Closed
akashveramd wants to merge 2 commits intopytorch:mainfrom
akashveramd:av_features_timeout
Closed

Increase timeout for features test to 60 minutes.#2982
akashveramd wants to merge 2 commits intopytorch:mainfrom
akashveramd:av_features_timeout

Conversation

@akashveramd
Copy link
Copy Markdown
Collaborator

@akashveramd akashveramd commented Apr 15, 2026

In this PR, we have increased the timeout for features test to 60 minutes. This is the timeout failure we are seeing for features test https://github.com/pytorch/torchtitan/actions/runs/24430483234/job/71373655959. The increased timeout is helping features test to pass on ROCm CI https://github.com/pytorch/torchtitan/actions/runs/24541348555/job/71747650589.

This is a temporary fix for ROCm CI timeout issue. We are working towards enabling MI350 label and replace the existing MI325 label. Hopefully it shouldn't result in a timeout. But we need to enable it and check.
This is the draft PR for enabling MI350 label #2740

The failing transformers test seems unrelated to the timeout change made in features test https://github.com/pytorch/torchtitan/actions/runs/24466095345/job/71493479334?pr=2982.

@akashveramd akashveramd self-assigned this Apr 15, 2026
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2026
@wwwjn
Copy link
Copy Markdown
Contributor

wwwjn commented Apr 15, 2026

Is this change mainly needed by RoCM CI timeout?

@akashveramd
Copy link
Copy Markdown
Collaborator Author

Is this change mainly needed by RoCM CI timeout?

Yes, it's a temporary fix for ROCm CI timeout. We are in the process of replacing MI325X runners with MI350X. Hopefully they shouldn't result in timeout. But we need to enable it and see.

Comment thread .github/workflows/integration_test_8gpu_features.yaml Outdated
@akashveramd akashveramd requested a review from tianyu-l April 17, 2026 01:23
@tianyu-l
Copy link
Copy Markdown
Contributor

are the rocm CI failures real?

@akashveramd
Copy link
Copy Markdown
Collaborator Author

are the rocm CI failures real?

It seems to have failed due to RCCL timeout. I have re-run failing jobs to see if it is a consistent error.

@akashveramd
Copy link
Copy Markdown
Collaborator Author

@tianyu-l: My other PR #2740 which adds MI350 label for all torchtitan workflows is ready for review. Maybe we don't need this temporary PR which increases the timeout for ROCm. If you want, I can close this PR.

@tianyu-l
Copy link
Copy Markdown
Contributor

sg, left comments over there

@akashveramd akashveramd marked this pull request as draft April 21, 2026 20:40
repository: pytorch/torchtitan
upload-artifact: outputs
timeout: 45
timeout: ${{ matrix.gpu-arch-type == 'rocm' && 60 || 45 }} # TODO: change it to 45min when MI350 label is added.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timeout: ${{ matrix.gpu-arch-type == 'rocm' && 60 || 45 }} # TODO: change it to 45min when MI350 label is added.
timeout: ${{ matrix.gpu-arch-type == 'rocm' && 60 || 45 }}

@akashveramd
Copy link
Copy Markdown
Collaborator Author

Closing the PR since we have this PR merged #3144 that improves test run time.

@akashveramd akashveramd closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants