Skip to content

re-enable compile tests#2978

Closed
acisseJZhong wants to merge 2 commits intomainfrom
renable_compile
Closed

re-enable compile tests#2978
acisseJZhong wants to merge 2 commits intomainfrom
renable_compile

Conversation

@acisseJZhong
Copy link
Copy Markdown
Contributor

TSIA

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2026
@xmfan
Copy link
Copy Markdown
Member

xmfan commented Apr 17, 2026

@laithsakka would you mind taking a look?

Failure Analysis for pytorch/torchtitan#2978

  Which test failed: Only 1 of the integration tests failed — "Gpt-oss FSDP+TP+EP+compile" (the others all passed, including "DeepSeek V3 FSDP+EP+compile").

  The error:
  torch._inductor.exc.InductorError: AssertionError: failed OrderedSet([]) >= OrderedSet([u12]) (inductor >= fx)

  Where it happens: torch/_inductor/graph.py:2181 in run_node:
  assert new_unbacked_defs >= renamed_unbacked_bindings, ...

  The FX node causing the issue is:
  %slice_12 : [num_users=2] = call_function[target=torch.ops.aten.slice.Tensor](args = (%index_5, 0, 0, %sym_sum_2), kwargs = {})

  What this means:

  This is an unbacked SymInt tracking bug in torch.inductor. During inductor's lowering/codegen phase, it runs through the FX graph and checks that each node that is supposed to define ("bind") an unbacked symbolic
  integer actually does so. Here, the FX graph says slice_12 should bind the unbacked symbol u12, but inductor's lowering of that node produced no new unbacked symbol definitions (OrderedSet([])). The assertion
  inductor >= fx means "inductor must define at least every unbacked symbol that FX says this node defines."

  The symbolic variable u12 likely comes from the dynamic shape of sym_sum_2 (the slice end index), which is related to expert parallel routing in the MoE (Mixture of Experts) model — the gpt_oss model uses expert
  parallelism (EP=4, ETP=1). The aten.slice.Tensor with a symbolic sum endpoint creates an unbacked SymInt that inductor doesn't know how to handle when combined with compile.

  Root cause: This is a PyTorch nightly inductor bug, not a torchtitan code bug. The aten.slice op with a symbolic (sym_sum_2) upper bound creates an unbacked symbol u12 that inductor's lowering doesn't propagate.
  This is likely a regression in the nightly torch build related to how inductor handles unbacked SymInts from dynamic slicing in the context of expert-parallel MoE models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants