Increase timeout for features test to 60 minutes.#2982
Closed
akashveramd wants to merge 2 commits intopytorch:mainfrom
Closed
Increase timeout for features test to 60 minutes.#2982akashveramd wants to merge 2 commits intopytorch:mainfrom
akashveramd wants to merge 2 commits intopytorch:mainfrom
Conversation
Contributor
|
Is this change mainly needed by RoCM CI timeout? |
Collaborator
Author
Yes, it's a temporary fix for ROCm CI timeout. We are in the process of replacing MI325X runners with MI350X. Hopefully they shouldn't result in timeout. But we need to enable it and see. |
tianyu-l
reviewed
Apr 16, 2026
2 tasks
Contributor
|
are the rocm CI failures real? |
Collaborator
Author
It seems to have failed due to RCCL timeout. I have re-run failing jobs to see if it is a consistent error. |
Collaborator
Author
Contributor
|
sg, left comments over there |
| repository: pytorch/torchtitan | ||
| upload-artifact: outputs | ||
| timeout: 45 | ||
| timeout: ${{ matrix.gpu-arch-type == 'rocm' && 60 || 45 }} # TODO: change it to 45min when MI350 label is added. |
Contributor
There was a problem hiding this comment.
Suggested change
| timeout: ${{ matrix.gpu-arch-type == 'rocm' && 60 || 45 }} # TODO: change it to 45min when MI350 label is added. | |
| timeout: ${{ matrix.gpu-arch-type == 'rocm' && 60 || 45 }} |
Collaborator
Author
|
Closing the PR since we have this PR merged #3144 that improves test run time. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In this PR, we have increased the timeout for features test to 60 minutes. This is the timeout failure we are seeing for features test https://github.com/pytorch/torchtitan/actions/runs/24430483234/job/71373655959. The increased timeout is helping features test to pass on ROCm CI https://github.com/pytorch/torchtitan/actions/runs/24541348555/job/71747650589.
This is a temporary fix for ROCm CI timeout issue. We are working towards enabling MI350 label and replace the existing MI325 label. Hopefully it shouldn't result in a timeout. But we need to enable it and check.
This is the draft PR for enabling MI350 label #2740
The failing transformers test seems unrelated to the timeout change made in features test https://github.com/pytorch/torchtitan/actions/runs/24466095345/job/71493479334?pr=2982.