Collect peak memory directly in integration tests#2974
Open
tugsbayasgalan wants to merge 1 commit intogh/tugsbayasgalan/17/basefrom
Open
Collect peak memory directly in integration tests#2974tugsbayasgalan wants to merge 1 commit intogh/tugsbayasgalan/17/basefrom
tugsbayasgalan wants to merge 1 commit intogh/tugsbayasgalan/17/basefrom
Conversation
Add direct peak-memory collection for integration runs by having MetricsProcessor write a JSON summary at the end of training. The test runner now passes TORCHTITAN_PEAK_MEMORY_JSON into each launched training job, forces metrics logging every step, and reads the emitted summary file back for reporting. MetricsProcessor tracks the maximum reserved and active CUDA memory it observes across log and validation calls and writes a single summary on close from the metrics rank. This keeps the measurement path local to the training run, avoids depending on TensorBoard event parsing for memory collection, and preserves the integration-test UX via --collect_peak_memory. The graph-trainer integration entrypoint and 8-GPU workflow are wired to use the flag. [ghstack-poisoned]
tugsbayasgalan
added a commit
that referenced
this pull request
Apr 15, 2026
Add direct peak-memory collection for integration runs by having MetricsProcessor write a JSON summary at the end of training. The test runner now passes TORCHTITAN_PEAK_MEMORY_JSON into each launched training job, forces metrics logging every step, and reads the emitted summary file back for reporting. MetricsProcessor tracks the maximum reserved and active CUDA memory it observes across log and validation calls and writes a single summary on close from the metrics rank. This keeps the measurement path local to the training run, avoids depending on TensorBoard event parsing for memory collection, and preserves the integration-test UX via --collect_peak_memory. The graph-trainer integration entrypoint and 8-GPU workflow are wired to use the flag. ghstack-source-id: 21c2731 Pull Request resolved: #2974
Contributor
@felipemello1 has WIP changes #2607 to move towards this direction in general. Would prefer we wait for that if this is not urgent to graph trainer workstream. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Add direct peak-memory collection for integration runs by having MetricsProcessor write a JSON summary at the end of training.
The test runner now passes TORCHTITAN_PEAK_MEMORY_JSON into each launched training job, forces metrics logging every step, and reads the emitted summary file back for reporting. MetricsProcessor tracks the maximum reserved and active CUDA memory it observes across log and validation calls and writes a single summary on close from the metrics rank.
This keeps the measurement path local to the training run, avoids depending on TensorBoard event parsing for memory collection, and preserves the integration-test UX via --collect_peak_memory. The graph-trainer integration entrypoint and 8-GPU workflow are wired to use the flag.