[rl] Enable TP2EP for MoE inference in vLLM wrapper#3142
[rl] Enable TP2EP for MoE inference in vLLM wrapper#3142wwwjn wants to merge 23 commits intogh/wwwjn/14/basefrom
Conversation
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
| dp_shard=1, | ||
| cp=parallel_config.decode_context_parallel_size, | ||
| tp=parallel_config.tensor_parallel_size, | ||
| cp=1, |
There was a problem hiding this comment.
This is not enabled now, but previous this is just for testing purpose. And we are leak of knowledge of decode_context_parallel_size in vllm, and I think hard codinf it to be 1 is safer before we officially support CP in inference
There was a problem hiding this comment.
we should assert vllm CP / PP degrees to be 1
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
There was a problem hiding this comment.
This file is a soft link
| # Initial load model weights from HuggingFace checkpoint path. | ||
| import os as _os | ||
|
|
||
| if _os.environ.get("TORCHTITAN_SKIP_INITIAL_HF_LOAD") != "1": |
There was a problem hiding this comment.
I used a env variable to random init in vllm_wrapper because it's impossible to change the vllm_wrapper init signature, which is hard to pass as a kwarg or config
| storage_reader.read_metadata().state_dict_metadata.keys() | ||
| ) | ||
| missing = set(hf_state_dict.keys()) - hf_keys_in_checkpoint | ||
| unexpected_missing = {k for k in missing if not k.endswith(".expert_bias")} |
There was a problem hiding this comment.
Need to remove expert_bias from the check after the loss based load balancing: #3000. If expert_bias is in our model, we should load from checkpoint
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
…cher (#3193) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #3172 * #3171 * #3142 * __->__ #3193 Today the dispatcher's _split_along_sp() raises when num_tokens (bs * slen) is not divisible by sp_size (= TP degree). Real workloads with varlen prompts can land on non-divisible totals and crash the MoE forward. Pad inside dispatch(): round num_tokens up to the next multiple of sp_size, padding x and top_scores with zeros and selected_experts_indices with 0 (so pad rows route deterministically to expert 0 with zero score). combine() reads metadata.original_num_tokens to size the scatter_add buffer at the padded length and slices the pad rows off before returning. When sp_size == 1 or input is already divisible, behavior is bitwise identical to today. Pad tokens are numerically inert: - Zero scores -> contribution to scatter_add is exactly zero either before or after expert compute (independent of score_before_experts). - Pad indices fall in [original, padded), which is sliced off after scatter_add, so they never appear in the returned output. Trainer/generator can pad by different amounts depending on their batch shapes; the unpadded portions remain bitwise identical. - TorchAOTokenDispatcher inherits dispatch/combine and gets the fix for free. - DeepEPTokenDispatcher uses a separate metadata type and is unaffected.
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
| """Top-level config for RL training.""" | ||
|
|
||
| model_spec: ModelSpec | None = None | ||
| model_spec: Annotated[ModelSpec | None, tyro.conf.Suppress] = None |
There was a problem hiding this comment.
why we didn't need this before?
| model=model, | ||
| model_state_dict=torchtitan_state_dict, | ||
| options=StateDictOptions(strict=True), | ||
| # strict=False: HF MoE checkpoints don't carry expert_bias buffers |
There was a problem hiding this comment.
then expert_bias should be non-persistent buffer?
https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/common/moe.py#L349
I believe qwen is not using loss-free-load-balancing, so we should enable load balance loss for it.
This is blocked by @pianpwk 's #3000
| import os as _os | ||
|
|
||
| if _os.environ.get("TORCHTITAN_SKIP_INITIAL_HF_LOAD") != "1": | ||
| self._initial_load_weights(checkpoint_path=vllm_config.model_config.model) |
There was a problem hiding this comment.
I'm confused here. Why do we need to initial load weights at all?
https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/rl/grpo.py#L453
this will init the weights on generator. The flow should be
- trainer build model, load weights from HF (which we now can skip with a flag for debug purpose)
- trainer push model state dict to TS
- generator pull model state dict from TS and load into model
There was a problem hiding this comment.
Yes you are right. I guess it's more from a historical reason: we started from inference only, so we will need a way to load weights . Now the initial_load_weights are unconditionally called when generator are initialized. I agree we should skip either 1) initial_load_weights (but passing an extra flag to vllm_wrapper need env variable), , or 2) skip the first weight sync
I'd prefer go with 1)
There was a problem hiding this comment.
what's 1) -- is it generator side or trainer side?
I think we can just remove any initial load weight from generator (and only rely on the first weight sync to get proper weight)
There was a problem hiding this comment.
Replied here: #3142 (comment)
I think we can just remove any initial load weight from generator (and only rely on the first weight sync to get proper weight)
In RL yes.
| tp=parallel_config.tensor_parallel_size, | ||
| cp=1, | ||
| tp=tp_size, | ||
| pp=parallel_config.pipeline_parallel_size, |
There was a problem hiding this comment.
then you should hardcode this to 1 as well?
| dp_shard=1, | ||
| cp=parallel_config.decode_context_parallel_size, | ||
| tp=parallel_config.tensor_parallel_size, | ||
| cp=1, |
There was a problem hiding this comment.
we should assert vllm CP / PP degrees to be 1
| # All-to-all dispatch tokens to EP ranks. | ||
| # Use the non-autograd version under inference (vLLM), since | ||
| # _c10d_functional_autograd ops don't dispatch correctly without | ||
| # an active autograd context. Gated by a Python bool so the choice | ||
| # is stable at trace time. |
There was a problem hiding this comment.
sry why do we think all_to_all_autograd didn't dispatch properly? autograd.Functions have some early exit conditions in inference mode, but I don't think that prevents you from using it in inference
UPDATE: we need to revive pytorch/pytorch#149411
There was a problem hiding this comment.
For my current PR, I will leave the current branching as is, and create a issue in torchtitan to track the progress - The ultimate goal for us is to consolidate to a single ops for pretrain and RL.
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
| f"got cp={p.context_parallel_degree}" | ||
| ) | ||
| if p.expert_parallel_degree > 1: | ||
| # vLLM ties EP to TP (enable_expert_parallel=True reuses the TP |
There was a problem hiding this comment.
so vllm doesn't support EP>TP? by also reusing DP group if we turn on FSDP.
There was a problem hiding this comment.
When EP > TP , we will borrow DP degrees, which is achieved in next PR. And it's pure DP , not FSDP
| config_format=TORCHTITAN_CONFIG_FORMAT, | ||
| dtype=config.model_dtype, | ||
| tensor_parallel_size=config.parallelism.tensor_parallel_degree, | ||
| enable_expert_parallel=enable_ep, |
| dp_shard=1, | ||
| cp=parallel_config.decode_context_parallel_size, | ||
| tp=parallel_config.tensor_parallel_size, | ||
| cp=1, |
There was a problem hiding this comment.
maybe leave a comment on why hardcoding cp to 1? also for fsdp, wondering if we could turn it on and later have both tp2ep and dp2ep
| moe = getattr(layer, "moe", None) | ||
| if moe is None: | ||
| continue | ||
| dispatcher = getattr(moe.experts, "token_dispatcher", None) |
There was a problem hiding this comment.
can you raise ValueError when it's None?
| model_spec=model_registry("30B-A3B", attn_backend="varlen"), | ||
| hf_assets_path="torchtitan/experiments/rl/example_checkpoint/Qwen3-30B-A3B", | ||
| num_steps=10, | ||
| compile=CompileConfig(enable=True, backend="aot_eager"), |
There was a problem hiding this comment.
why this doesn't need to be False?
| """Top-level config for RL training.""" | ||
|
|
||
| model_spec: ModelSpec | None = None | ||
| model_spec: Annotated[ModelSpec | None, tyro.conf.Suppress] = None |
| routed_input = all_to_all_single_autograd( | ||
| # All-to-all dispatch tokens to EP ranks. | ||
| # Use the non-autograd version under inference (vLLM), since | ||
| # _c10d_functional_autograd ops don't dispatch correctly without |
There was a problem hiding this comment.
just curious why can't torch.inference_mode automatically switch between these two if this is the only difference.
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
| # Replicate (no-op) and Shard(-1) (all-gather) lm_head output placements. | ||
| if isinstance(logits, DTensor): | ||
| logits = logits.to_local() | ||
| logits = logits.full_tensor() |
There was a problem hiding this comment.
should work with disable_loss_parallel already?
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs. Key changes: - Meta-device init + to_empty() + init_states() for large MoE models - EP mesh: ep = tp_size (TP ranks become EP ranks for experts) - Use enable_sp (not inference flag) for output layout and SP splitting - enable_sequence_parallel=True, disable_loss_parallel=True for inference - Remove stale ModelConvertersContainer references - Add MoE debug and 30B-A3B RL configs Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B. Similar as #3057 [ghstack-poisoned]

Stack from ghstack (oldest at bottom):
Enable Expert Parallelism for MoE models in the vLLM inference path, with TP on dense layers. Includes meta-device init, init_states() fix for RoPE cache, EP mesh creation, enable_sp-based output layout, and MoE debug configs.
Key changes:
Verified: TP=2, EP+TP=2, EP+TP=4 all produce correct output matching native vLLM on Qwen3-30B-A3B.
Similar as #3057