[WIP]Enable DP-to-EP for MoE inference by wwwjn · Pull Request #3171 · pytorch/torchtitan

wwwjn · 2026-04-30T03:46:22Z

Stack from ghstack (oldest at bottom):

Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math,
enabling EP to span both DP and TP ranks (ep = dp * tp). For inference,
the skip_dp path returns the model directly — no FSDP wrapping needed
since there's no backward pass.

Changes:

parallel_dims: dp_replicate mesh always exists
vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size
vllm_wrapper: weight loading uses [Replicate()] * mesh.ndim
qwen3/parallelize: skip_dp returns model (no fully_shard)
llama4/parallelize: clarify shard_placement_fn comments
actors/utils: full_tensor() instead of to_local() for TP-sharded logits
actors/trainer: strict=False (expert_bias not in HF), ep_enabled clip
grpo: tyro.conf.Suppress on model_spec to bypass CLI parsing of Placement
experiments/rl: TORCHTITAN_SKIP_INITIAL_HF_LOAD env var
config_registry: env/validation_env in MoE configs (fix tyro CLI)
qwen3/init: debugmodel_moe vocab_size 2048 → 151936 (match tokenizer)
scripts/rl/create_debug_moe_ckpt.py: helper to generate debug checkpoint

Verified end-to-end RL with rl_grpo_qwen3_moe_debug_ep on debug model.

Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math, enabling EP to span both DP and TP ranks (ep = dp * tp). For inference, the skip_dp path returns the model directly — no FSDP wrapping needed since there's no backward pass. Changes: - parallel_dims: dp_replicate mesh always exists - vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size - vllm_wrapper: weight loading uses [Replicate()] * mesh.ndim - qwen3/parallelize: skip_dp returns model (no fully_shard) - llama4/parallelize: clarify shard_placement_fn comments - actors/utils: full_tensor() instead of to_local() for TP-sharded logits - actors/trainer: strict=False (expert_bias not in HF), ep_enabled clip - grpo: tyro.conf.Suppress on model_spec to bypass CLI parsing of Placement - experiments/rl: TORCHTITAN_SKIP_INITIAL_HF_LOAD env var - config_registry: env/validation_env in MoE configs (fix tyro CLI) - qwen3/__init__: debugmodel_moe vocab_size 2048 → 151936 (match tokenizer) - scripts/rl/create_debug_moe_ckpt.py: helper to generate debug checkpoint Verified end-to-end RL with rl_grpo_qwen3_moe_debug_ep on debug model. [ghstack-poisoned]

…cher (#3193) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #3172 * #3171 * #3142 * __->__ #3193 Today the dispatcher's _split_along_sp() raises when num_tokens (bs * slen) is not divisible by sp_size (= TP degree). Real workloads with varlen prompts can land on non-divisible totals and crash the MoE forward. Pad inside dispatch(): round num_tokens up to the next multiple of sp_size, padding x and top_scores with zeros and selected_experts_indices with 0 (so pad rows route deterministically to expert 0 with zero score). combine() reads metadata.original_num_tokens to size the scatter_add buffer at the padded length and slices the pad rows off before returning. When sp_size == 1 or input is already divisible, behavior is bitwise identical to today. Pad tokens are numerically inert: - Zero scores -> contribution to scatter_add is exactly zero either before or after expert compute (independent of score_before_experts). - Pad indices fall in [original, padded), which is sliced off after scatter_add, so they never appear in the returned output. Trainer/generator can pad by different amounts depending on their batch shapes; the unpadded portions remain bitwise identical. - TorchAOTokenDispatcher inherits dispatch/combine and gets the fix for free. - DeepEPTokenDispatcher uses a separate metadata type and is unaffected.

Map vLLM's data_parallel_size to dp_shard in TorchTitan's mesh math, enabling EP to span both DP and TP ranks (ep = dp * tp). For inference, the skip_dp path returns the model directly — no FSDP wrapping needed since there's no backward pass. Changes: - parallel_dims: dp_replicate mesh always exists - vllm_wrapper: ep_size = dp_size * tp_size, dp_shard = dp_size - vllm_wrapper: weight loading uses [Replicate()] * mesh.ndim - qwen3/parallelize: skip_dp returns model (no fully_shard) - llama4/parallelize: clarify shard_placement_fn comments - actors/utils: full_tensor() instead of to_local() for TP-sharded logits - actors/trainer: strict=False (expert_bias not in HF), ep_enabled clip - grpo: tyro.conf.Suppress on model_spec to bypass CLI parsing of Placement - experiments/rl: TORCHTITAN_SKIP_INITIAL_HF_LOAD env var - config_registry: env/validation_env in MoE configs (fix tyro CLI) - qwen3/__init__: debugmodel_moe vocab_size 2048 → 151936 (match tokenizer) - scripts/rl/create_debug_moe_ckpt.py: helper to generate debug checkpoint Verified end-to-end RL with rl_grpo_qwen3_moe_debug_ep on debug model. [ghstack-poisoned]

wwwjn requested review from fegin, tianyu-l and wconstab as code owners April 30, 2026 03:46

pytorch-bot Bot added the ciflow/8gpu label Apr 30, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 30, 2026

This was referenced Apr 30, 2026

[rl] Enable TP2EP for MoE inference in vLLM wrapper #3142

Open

[WIP] Add bitwise parity test for MoE EP #3172

Closed

wwwjn changed the title ~~Enable DP-to-EP for MoE inference~~ [WIP]Enable DP-to-EP for MoE inference Apr 30, 2026

wwwjn mentioned this pull request May 1, 2026

[MoE] Pad token count to a multiple of sp_size in AllToAllTokenDispatcher #3193

Merged

wwwjn added 6 commits April 30, 2026 20:25

wwwjn closed this May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Enable DP-to-EP for MoE inference#3171

[WIP]Enable DP-to-EP for MoE inference#3171
wwwjn wants to merge 9 commits intogh/wwwjn/16/basefrom
gh/wwwjn/16/head

wwwjn commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wwwjn commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wwwjn commented Apr 30, 2026 •

edited

Loading