Skip to content

fix(generation): beam sample when num_beams * vocab_size exceeds multinomial limit#45251

Closed
balgaly wants to merge 2 commits intohuggingface:mainfrom
balgaly:fix/beam-sample-multinomial-flat-dim-limit
Closed

fix(generation): beam sample when num_beams * vocab_size exceeds multinomial limit#45251
balgaly wants to merge 2 commits intohuggingface:mainfrom
balgaly:fix/beam-sample-multinomial-flat-dim-limit

Conversation

@balgaly
Copy link
Copy Markdown

@balgaly balgaly commented Apr 5, 2026

Problem

torch.multinomial rejects last dimensions >= 2**24. Beam search with do_sample=True builds a flat distribution of size num_beams * vocab_size, which can exceed that limit (e.g. large beams + ~164k vocab), crashing during generation (#45245).

Solution

When the flat dimension is at or above 2**24, select beams_to_keep continuations via Gumbel-top-k on accumulated_log_probs, equivalent to multinomial(softmax(logits), k, replacement=False) without using an oversized multinomial.

Tests

  • tests/generation/test_beam_search_multinomial_limit.py patches the limit to exercise the fallback on small tensors.

Fixes #45245

…inomial limit

PyTorch multinomial requires the last dimension to be at most 2**24.
Beam search with do_sample flattens num_beams * vocab_size into one
dimension; large beams + large vocabs (e.g. huggingface#45245) crash on CUDA.

Use Gumbel-top-k when flat_dim >= 2**24, equivalent to sampling without
replacement from softmax(accumulated_log_probs).

Adds a unit test with a patched limit for small tensors.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45251&sha=490973

@balgaly
Copy link
Copy Markdown
Author

balgaly commented Apr 7, 2026

The tests_hub failure appears unrelated to this PR, tests_generate passed, which covers the changed code in utils.py and the new test in tests/generation/test_beam_search_multinomial_limit.py. Could a maintainer re-run tests_hub when convenient? Happy to address any concerns about the implementation.

@Rocketknight1
Copy link
Copy Markdown
Member

No agent PRs on random issues please! (Swapping a multinomial for a gumbel-topk trick that adds loads of code bloat for a very rare edge case is the most code agent solution imaginable)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RuntimeError: number of categories cannot exceed 2^24

2 participants