[ignore-for-now][llm_trainer] Add experiment for LLM-driven model optimization by bobrenjc93 · Pull Request #3006 · pytorch/torchtitan

bobrenjc93 · 2026-04-17T04:28:33Z

Stack from ghstack (oldest at bottom):

Adds the llm_trainer experiment which traces a model's full
forward+backward training step into a flat sequence of ATen ops,
then provides benchmarking infra for an LLM to iteratively optimize
the generated code while maintaining bitwise correctness.

Key components:

flattener: traces via make_fx, writes standalone Python files
per rank, verifies bitwise equivalence, copies baseline to
optimized_models/
benchmarker: compares optimized vs candidate models for bitwise
correctness and MFU, promotes only if >=1% faster on N consecutive
runs (default 3)
Shell script wrappers (run_flattener.sh, run_benchmarker.sh) for
ergonomic torchrun invocation
INSTRUCTIONS.md guide for LLMs

Directory structure uses targets// where fingerprint
encodes both hardware label and parallelism config (e.g.
h100-sm90_tp2_fsdp4). Promoted files get an MFU comment header for
self-documenting optimization history.

[ghstack-poisoned]

…imization Adds the llm_trainer experiment which traces a model's full forward+backward training step into a flat sequence of ATen ops, then provides benchmarking infra for an LLM to iteratively optimize the generated code while maintaining bitwise correctness. Key components: - flattener: traces via make_fx, writes standalone Python files per rank, verifies bitwise equivalence, copies baseline to optimized_models/ - benchmarker: compares optimized vs candidate models for bitwise correctness and MFU, promotes only if >=1% faster on N consecutive runs (default 3) - Shell script wrappers (run_flattener.sh, run_benchmarker.sh) for ergonomic torchrun invocation - INSTRUCTIONS.md guide for LLMs Directory structure uses targets/<fingerprint>/ where fingerprint encodes both hardware label and parallelism config (e.g. h100-sm90_tp2_fsdp4). Promoted files get an MFU comment header for self-documenting optimization history. ghstack-source-id: b011200 Pull-Request: #3006

[ghstack-poisoned]

…imization Adds the llm_trainer experiment which traces a model's full forward+backward training step into a flat sequence of ATen ops, then provides benchmarking infra for an LLM to iteratively optimize the generated code while maintaining bitwise correctness. Key components: - flattener: traces via make_fx, writes standalone Python files per rank, verifies bitwise equivalence, copies baseline to optimized_models/ - benchmarker: compares optimized vs candidate models for bitwise correctness and MFU, promotes only if >=1% faster on N consecutive runs (default 3) - Shell script wrappers (run_flattener.sh, run_benchmarker.sh) for ergonomic torchrun invocation - INSTRUCTIONS.md guide for LLMs Directory structure uses targets/<fingerprint>/ where fingerprint encodes both hardware label and parallelism config (e.g. h100-sm90_tp2_fsdp4). Promoted files get an MFU comment header for self-documenting optimization history. ghstack-source-id: e5feea0 Pull-Request: #3006

[ghstack-poisoned]

…imization Adds the llm_trainer experiment which traces a model's full forward+backward training step into a flat sequence of ATen ops, then provides benchmarking infra for an LLM to iteratively optimize the generated code while maintaining bitwise correctness. Key components: - flattener: traces via make_fx, writes standalone Python files per rank, verifies bitwise equivalence, copies baseline to optimized_models/ - benchmarker: compares optimized vs candidate models for bitwise correctness and MFU, promotes only if >=1% faster on N consecutive runs (default 3) - Shell script wrappers (run_flattener.sh, run_benchmarker.sh) for ergonomic torchrun invocation - INSTRUCTIONS.md guide for LLMs Directory structure uses targets/<fingerprint>/ where fingerprint encodes both hardware label and parallelism config (e.g. h100-sm90_tp2_fsdp4). Promoted files get an MFU comment header for self-documenting optimization history. ghstack-source-id: b817ccb Pull-Request: #3006

[ghstack-poisoned]

…imization Adds the llm_trainer experiment which traces a model's full forward+backward training step into a flat sequence of ATen ops, then provides benchmarking infra for an LLM to iteratively optimize the generated code while maintaining bitwise correctness. Key components: - flattener: traces via make_fx, writes standalone Python files per rank, verifies bitwise equivalence, copies baseline to optimized_models/ - benchmarker: compares optimized vs candidate models for bitwise correctness and MFU, promotes only if >=1% faster on N consecutive runs (default 3) - Shell script wrappers (run_flattener.sh, run_benchmarker.sh) for ergonomic torchrun invocation - INSTRUCTIONS.md guide for LLMs Directory structure uses targets/<fingerprint>/ where fingerprint encodes both hardware label and parallelism config (e.g. h100-sm90_tp2_fsdp4). Promoted files get an MFU comment header for self-documenting optimization history. ghstack-source-id: 90f76c6 Pull-Request: #3006

[ghstack-poisoned]

…imization Adds the llm_trainer experiment which traces a model's full forward+backward training step into a flat sequence of ATen ops, then provides benchmarking infra for an LLM to iteratively optimize the generated code while maintaining bitwise correctness. Key components: - flattener: traces via make_fx, writes standalone Python files per rank, verifies bitwise equivalence, copies baseline to optimized_models/ - benchmarker: compares optimized vs candidate models for bitwise correctness and MFU, promotes only if >=1% faster on N consecutive runs (default 3) - Shell script wrappers (run_flattener.sh, run_benchmarker.sh) for ergonomic torchrun invocation - INSTRUCTIONS.md guide for LLMs Directory structure uses targets/<fingerprint>/ where fingerprint encodes both hardware label and parallelism config (e.g. h100-sm90_tp2_fsdp4). Promoted files get an MFU comment header for self-documenting optimization history. ghstack-source-id: 3a2e782 Pull-Request: #3006

[ghstack-poisoned]

…imization Adds the llm_trainer experiment which traces a model's full forward+backward training step into a flat sequence of ATen ops, then provides benchmarking infra for an LLM to iteratively optimize the generated code while maintaining bitwise correctness. Key components: - flattener: traces via make_fx, writes standalone Python files per rank, verifies bitwise equivalence, copies baseline to optimized_models/ - benchmarker: compares optimized vs candidate models for bitwise correctness and MFU, promotes only if >=1% faster on N consecutive runs (default 3) - Shell script wrappers (run_flattener.sh, run_benchmarker.sh) for ergonomic torchrun invocation - INSTRUCTIONS.md guide for LLMs Directory structure uses targets/<fingerprint>/ where fingerprint encodes both hardware label and parallelism config (e.g. h100-sm90_tp2_fsdp4). Promoted files get an MFU comment header for self-documenting optimization history. ghstack-source-id: c3775b1 Pull-Request: #3006

[ghstack-poisoned]

…imization Adds the llm_trainer experiment which traces a model's full forward+backward training step into a flat sequence of ATen ops, then provides benchmarking infra for an LLM to iteratively optimize the generated code while maintaining bitwise correctness. Key components: - flattener: traces via make_fx, writes standalone Python files per rank, verifies bitwise equivalence, copies baseline to optimized_models/ - benchmarker: compares optimized vs candidate models for bitwise correctness and MFU, promotes only if >=1% faster on N consecutive runs (default 3) - Shell script wrappers (run_flattener.sh, run_benchmarker.sh) for ergonomic torchrun invocation - INSTRUCTIONS.md guide for LLMs Directory structure uses targets/<fingerprint>/ where fingerprint encodes both hardware label and parallelism config (e.g. h100-sm90_tp2_fsdp4). Promoted files get an MFU comment header for self-documenting optimization history. ghstack-source-id: e8114a9 Pull-Request: #3006

[ghstack-poisoned]

…imization Adds the llm_trainer experiment which traces a model's full forward+backward training step into a flat sequence of ATen ops, then provides benchmarking infra for an LLM to iteratively optimize the generated code while maintaining bitwise correctness. Key components: - flattener: traces via make_fx, writes standalone Python files per rank, verifies bitwise equivalence, copies baseline to optimized_models/ - benchmarker: compares optimized vs candidate models for bitwise correctness and MFU, promotes only if >=1% faster on N consecutive runs (default 3) - Shell script wrappers (run_flattener.sh, run_benchmarker.sh) for ergonomic torchrun invocation - INSTRUCTIONS.md guide for LLMs Directory structure uses targets/<fingerprint>/ where fingerprint encodes both hardware label and parallelism config (e.g. h100-sm90_tp2_fsdp4). Promoted files get an MFU comment header for self-documenting optimization history. ghstack-source-id: 2c324d2 Pull-Request: #3006

…n model optimization" Adds the llm_trainer experiment which traces a model's full forward+backward training step into a flat sequence of ATen ops, then provides benchmarking infra for an LLM to iteratively optimize the generated code while maintaining bitwise correctness. Key components: - flattener: traces via make_fx, writes standalone Python files per rank, verifies bitwise equivalence, copies baseline to optimized_models/ - benchmarker: compares optimized vs candidate models for bitwise correctness and MFU, promotes only if >=1% faster on N consecutive runs (default 3) - Shell script wrappers (run_flattener.sh, run_benchmarker.sh) for ergonomic torchrun invocation - INSTRUCTIONS.md guide for LLMs Directory structure uses targets/<fingerprint>/ where fingerprint encodes both hardware label and parallelism config (e.g. h100-sm90_tp2_fsdp4). Promoted files get an MFU comment header for self-documenting optimization history. [ghstack-poisoned]

Update

ca37db6

[ghstack-poisoned]

pytorch-bot Bot added the ciflow/8gpu label Apr 17, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 17, 2026

Update

1c3c197

[ghstack-poisoned]

Update

e3f4d30

[ghstack-poisoned]

Update

23cf567

[ghstack-poisoned]

Update

52d19d1

[ghstack-poisoned]

Update

b06bf4b

[ghstack-poisoned]

Update

c4ee499

[ghstack-poisoned]

Update

b795005

[ghstack-poisoned]

This was referenced Apr 21, 2026

llm_trainer: promote fused RoPE for llama3 8B tp2 fsdp4 #3039

Closed

llm_trainer: promote cuda-graph replay on top of fused RoPE #3040

Closed

bobrenjc93 mentioned this pull request Apr 26, 2026

Enable bucketing pass in precompile path #3107

Merged

bobrenjc93 closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ignore-for-now][llm_trainer] Add experiment for LLM-driven model optimization#3006

[ignore-for-now][llm_trainer] Add experiment for LLM-driven model optimization#3006
bobrenjc93 wants to merge 9 commits intogh/bobrenjc93/43/basefrom
gh/bobrenjc93/43/head

bobrenjc93 commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bobrenjc93 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bobrenjc93 commented Apr 17, 2026 •

edited

Loading