Collect peak memory directly in integration tests by tugsbayasgalan · Pull Request #2974 · pytorch/torchtitan

tugsbayasgalan · 2026-04-15T04:12:05Z

Stack from ghstack (oldest at bottom):

Add direct peak-memory collection for integration runs by having MetricsProcessor write a JSON summary at the end of training.

The test runner now passes TORCHTITAN_PEAK_MEMORY_JSON into each launched training job, forces metrics logging every step, and reads the emitted summary file back for reporting. MetricsProcessor tracks the maximum reserved and active CUDA memory it observes across log and validation calls and writes a single summary on close from the metrics rank.

This keeps the measurement path local to the training run, avoids depending on TensorBoard event parsing for memory collection, and preserves the integration-test UX via --collect_peak_memory. The graph-trainer integration entrypoint and 8-GPU workflow are wired to use the flag.

Add direct peak-memory collection for integration runs by having MetricsProcessor write a JSON summary at the end of training. The test runner now passes TORCHTITAN_PEAK_MEMORY_JSON into each launched training job, forces metrics logging every step, and reads the emitted summary file back for reporting. MetricsProcessor tracks the maximum reserved and active CUDA memory it observes across log and validation calls and writes a single summary on close from the metrics rank. This keeps the measurement path local to the training run, avoids depending on TensorBoard event parsing for memory collection, and preserves the integration-test UX via --collect_peak_memory. The graph-trainer integration entrypoint and 8-GPU workflow are wired to use the flag. [ghstack-poisoned]

Add direct peak-memory collection for integration runs by having MetricsProcessor write a JSON summary at the end of training. The test runner now passes TORCHTITAN_PEAK_MEMORY_JSON into each launched training job, forces metrics logging every step, and reads the emitted summary file back for reporting. MetricsProcessor tracks the maximum reserved and active CUDA memory it observes across log and validation calls and writes a single summary on close from the metrics rank. This keeps the measurement path local to the training run, avoids depending on TensorBoard event parsing for memory collection, and preserves the integration-test UX via --collect_peak_memory. The graph-trainer integration entrypoint and 8-GPU workflow are wired to use the flag. ghstack-source-id: 21c2731 Pull Request resolved: #2974

tianyu-l · 2026-04-15T04:51:15Z

Add direct peak-memory collection for integration runs by having MetricsProcessor write a JSON summary at the end of training.

@felipemello1 has WIP changes #2607 to move towards this direction in general. Would prefer we wait for that if this is not urgent to graph trainer workstream.

tugsbayasgalan requested review from SherlockNoMad, aditvenk, fegin, tianyu-l, wconstab, wwwjn, xmfan and yiming0416 as code owners April 15, 2026 04:12

tugsbayasgalan mentioned this pull request Apr 15, 2026

[graph_trainer] Add torch.no_grad() and graph-based SAC to traced execution #2766

Merged

pytorch-bot Bot added the ciflow/8gpu label Apr 15, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect peak memory directly in integration tests#2974

Collect peak memory directly in integration tests#2974
tugsbayasgalan wants to merge 1 commit intogh/tugsbayasgalan/17/basefrom
gh/tugsbayasgalan/17/head

tugsbayasgalan commented Apr 15, 2026 •

edited

Loading

Uh oh!

tianyu-l commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tugsbayasgalan commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tugsbayasgalan commented Apr 15, 2026 •

edited

Loading