Skip to content

Collect peak memory directly in integration tests#2974

Open
tugsbayasgalan wants to merge 1 commit intogh/tugsbayasgalan/17/basefrom
gh/tugsbayasgalan/17/head
Open

Collect peak memory directly in integration tests#2974
tugsbayasgalan wants to merge 1 commit intogh/tugsbayasgalan/17/basefrom
gh/tugsbayasgalan/17/head

Conversation

@tugsbayasgalan
Copy link
Copy Markdown
Contributor

@tugsbayasgalan tugsbayasgalan commented Apr 15, 2026

Stack from ghstack (oldest at bottom):

Add direct peak-memory collection for integration runs by having MetricsProcessor write a JSON summary at the end of training.

The test runner now passes TORCHTITAN_PEAK_MEMORY_JSON into each launched training job, forces metrics logging every step, and reads the emitted summary file back for reporting. MetricsProcessor tracks the maximum reserved and active CUDA memory it observes across log and validation calls and writes a single summary on close from the metrics rank.

This keeps the measurement path local to the training run, avoids depending on TensorBoard event parsing for memory collection, and preserves the integration-test UX via --collect_peak_memory. The graph-trainer integration entrypoint and 8-GPU workflow are wired to use the flag.

Add direct peak-memory collection for integration runs by having MetricsProcessor write a JSON summary at the end of training.

The test runner now passes TORCHTITAN_PEAK_MEMORY_JSON into each launched training job, forces metrics logging every step, and reads the emitted summary file back for reporting. MetricsProcessor tracks the maximum reserved and active CUDA memory it observes across log and validation calls and writes a single summary on close from the metrics rank.

This keeps the measurement path local to the training run, avoids depending on TensorBoard event parsing for memory collection, and preserves the integration-test UX via --collect_peak_memory. The graph-trainer integration entrypoint and 8-GPU workflow are wired to use the flag.

[ghstack-poisoned]
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2026
tugsbayasgalan added a commit that referenced this pull request Apr 15, 2026
Add direct peak-memory collection for integration runs by having MetricsProcessor write a JSON summary at the end of training.

The test runner now passes TORCHTITAN_PEAK_MEMORY_JSON into each launched training job, forces metrics logging every step, and reads the emitted summary file back for reporting. MetricsProcessor tracks the maximum reserved and active CUDA memory it observes across log and validation calls and writes a single summary on close from the metrics rank.

This keeps the measurement path local to the training run, avoids depending on TensorBoard event parsing for memory collection, and preserves the integration-test UX via --collect_peak_memory. The graph-trainer integration entrypoint and 8-GPU workflow are wired to use the flag.

ghstack-source-id: 21c2731
Pull Request resolved: #2974
@tianyu-l
Copy link
Copy Markdown
Contributor

Add direct peak-memory collection for integration runs by having MetricsProcessor write a JSON summary at the end of training.

@felipemello1 has WIP changes #2607 to move towards this direction in general. Would prefer we wait for that if this is not urgent to graph trainer workstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants