Fix CB Accuracy Regression under FA2 by Qubitium · Pull Request #45274 · huggingface/transformers

Qubitium · 2026-04-07T01:43:12Z

What does this PR do?

Fix: CUDA graph reuse for FA2 continuous batching was wrongly keyed causing quality collapse for specific configuration

CUDA graph reuse used the wrong key: replay reuse depended on padded tensor sizes, but FA varlen kernels also depend on non-tensor runtime ints such as max_seqlen_q and max_seqlen_k. That allowed CB to replay a graph captured for one FA runtime shape against a different one, which is why max_batch_tokens could change accuracy.

Fix: remove the FA2 decode-fast-path gate that didn't make sense (likely related to above bug)

_ensure_decode_fast_path_is_available() only accepted FA3, but FA2/FA3 should both be supported.

I highly suspect, looking at the comments that installed the FA3 only gates that the first bug is the one might be source cause of the output variance that lead to the FA2 off-gating.

  MAIN = /tmp/transformers-main-head @ d3c7a19176
  PR = /root/transformers @ 33e9cbb08e
  Workload: Evalution GSM8K Platinum, paged|flash_attention_2, bf16, batch_size=24, max_rows=128, use_cuda_graph=auto, use_async_batching=auto.
  

  Sweep Results
  +--------+--------------+------------+-----------+-------------+-----------+----------+--------------+------------+
  | max_bt | MAIN acc,num | PR acc,num | Δ acc     | MAIN samp/s | PR samp/s | Δ samp/s | MAIN peak GB | PR peak GB | 
  +========+==============+============+===========+=============+===========+==========+==============+============+
    | 128    | 0.4140625    | 0.4140625  | +0.0000   | 9.7604      | 10.1033   | +3.51%   | 34.7090      | 34.7090    |
    | 256    | 0.4140625    | 0.4453125  | +0.0312   | 9.7349      | 8.7137    | -10.49%  | 34.9004      | 34.9004    |
--> | 384    | 0.2265625    | 0.4531250  | +0.2266   | 10.0055     | 9.9615    | -0.44%   | 35.0859      | 35.0898    |
    | 512    | 0.4218750    | 0.4062500  | -0.0156   | 10.0384     | 9.7630    | -2.74%   | 35.2949      | 35.2949    | 
    | 640    | 0.3984375    | 0.4140625  | +0.0156   | 10.2056     | 9.5572    | -6.35%   | 35.4883      | 35.4902    |
    | 768    | 0.3984375    | 0.4062500  | +0.0078   | 10.0023     | 9.7447    | -2.58%   | 35.6836      | 35.6836    | 
    | 896    | FAIL         | FAIL       | n/a       | n/a         | n/a       | n/a      | n/a          | n/a        | 
    | 1024   | FAIL         | FAIL       | n/a       | n/a         | n/a       | n/a      | n/a          | n/a        |
  +--------+--------------+------------+-----------+-------------+-----------+----------+--------------+------------+

  Failure Details
  +--------+-------------------------------+
  | max_bt | MAIN                          | PR                             |
  +========+===============================+================================+
  | 896    | illegal memory access         | illegal memory access          |
  | 1024   | CUBLAS_STATUS_EXECUTION_FAILED| illegal memory access          |
  +--------+-------------------------------+--------------------------------+

Look at the accuracy collapse at max_bt (max batched token size for cb) == 384!

The 896/1024 are caused by another un-related bug that I did not push to this PR so this one is clean/isolated.

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@remi-or @ArthurZucker @McPatate

This reverts commit a2a95fe.

Qubitium added 7 commits April 6, 2026 22:18

fix fa2 cb max_bt graph reuse

b07a530

fix async graph bounds fallback

a7c9044

guard fa2 async graphs

a2a95fe

Revert "guard fa2 async graphs"

c21271a

This reverts commit a2a95fe.

document graph bound replay

c74a4f8

simplify graph reuse fix

33e9cbb

enable fa2 decode fast path

9a9d4ed

Qubitium changed the title ~~Fix fa2 max bt~~ Fix CB Accuracy Regression under FA2 Apr 7, 2026

Qubitium added 4 commits April 7, 2026 09:45

Merge branch 'main' into fix-fa2-max-bt

1e9e7c1

restore cb max_blocks default

f28d929

restore cb config comment

45b1271

style

f921f65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CB Accuracy Regression under FA2#45274

Fix CB Accuracy Regression under FA2#45274
Qubitium wants to merge 11 commits intohuggingface:mainfrom
Qubitium:fix-fa2-max-bt

Qubitium commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Qubitium commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Qubitium commented Apr 7, 2026 •

edited

Loading