[demo] verify fully_shard([norm, head]) and fully_shard([tok_embedding, norm, head]) works with chunked loss by weifengpy · Pull Request #2976 · pytorch/torchtitan

weifengpy · 2026-04-15T05:07:33Z

remove if-else base on chunked loss in apply_fsdp: #2937

Implements chunked cross-entropy loss that splits the sequence dimension into N chunks, computing lm_head projection and CE loss per-chunk to avoid materializing the full [B, L, V] logits tensor at once. Key components: - ChunkedCELoss: wraps lm_head + ce_loss with chunked forward/backward - GradAccumulator: pre-allocated buffer for assembling chunk gradients - _no_reshard_after_backward: FSDP2 context to avoid N all-gathers - skip_lm_head kwarg on Decoder.forward() for the detach boundary - ChunkedCELossFactory: deferred initialization (model not available at build time) - Trainer integration with dedicated forward_backward_step branch

…CELoss - Add loss_num_chunks to TrainingConfig (default 1, no-op) - Trainer auto-wraps loss_fn in ChunkedCELossFactory when loss_num_chunks > 1 - Integration tests for FSDP, FSDP+TP(SP), FSDP+CP, FSDP+TP+CP, FSDP+compile

FSDP2's backward hooks are one-shot per forward pass. The previous approach of calling self.lm_head(h_chunk) triggered FSDP2's backward hooks during chunk backward, leaving no hooks for the decoder backward (h.backward(grad)), causing zero gradients on model parameters. Fix: Use F.linear(h_chunk, lm_weight) to bypass FSDP2 module hooks during chunk computation. Use (h * accumulated_grad).sum().backward() instead of h.backward(grad) to properly trigger FSDP2's hooks in a single backward pass.

Replace bare function + build_fn pattern with proper loss classes. CrossEntropyLoss and MSELoss encapsulate compilation logic internally. The old function names (cross_entropy_loss, mse_loss) remain as public API for backward compatibility. build_cross_entropy_loss and build_mse_loss now return class instances.

wwwjn added 6 commits April 10, 2026 15:40

refactor FSDP TP

0d1ef86

change FSDP plan to work with weight tying

035ee33

pytorch-bot Bot added the ciflow/8gpu label Apr 15, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2026

weifengpy changed the title ~~verify fully_shard([norm, head]) and fully_shard([tok_embedding, norm, head]) works with chunked loss~~ [demo] verify fully_shard([norm, head]) and fully_shard([tok_embedding, norm, head]) works with chunked loss Apr 15, 2026

weifengpy closed this May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[demo] verify fully_shard([norm, head]) and fully_shard([tok_embedding, norm, head]) works with chunked loss#2976

[demo] verify fully_shard([norm, head]) and fully_shard([tok_embedding, norm, head]) works with chunked loss#2976
weifengpy wants to merge 6 commits intopytorch:mainfrom
weifengpy:chunked-ce-loss

weifengpy commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weifengpy commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weifengpy commented Apr 15, 2026 •

edited

Loading