Skip to content

Fix CB Accuracy Regression under FA2#45274

Open
Qubitium wants to merge 11 commits intohuggingface:mainfrom
Qubitium:fix-fa2-max-bt
Open

Fix CB Accuracy Regression under FA2#45274
Qubitium wants to merge 11 commits intohuggingface:mainfrom
Qubitium:fix-fa2-max-bt

Conversation

@Qubitium
Copy link
Copy Markdown
Contributor

@Qubitium Qubitium commented Apr 7, 2026

What does this PR do?

  1. Fix: CUDA graph reuse for FA2 continuous batching was wrongly keyed causing quality collapse for specific configuration

CUDA graph reuse used the wrong key: replay reuse depended on padded tensor sizes, but FA varlen kernels also depend on non-tensor runtime ints such as max_seqlen_q and max_seqlen_k. That allowed CB to replay a graph captured for one FA runtime shape against a different one, which is why max_batch_tokens could change accuracy.

  1. Fix: remove the FA2 decode-fast-path gate that didn't make sense (likely related to above bug)

_ensure_decode_fast_path_is_available() only accepted FA3, but FA2/FA3 should both be supported.

I highly suspect, looking at the comments that installed the FA3 only gates that the first bug is the one might be source cause of the output variance that lead to the FA2 off-gating.

  MAIN = /tmp/transformers-main-head @ d3c7a19176
  PR = /root/transformers @ 33e9cbb08e
  Workload: Evalution GSM8K Platinum, paged|flash_attention_2, bf16, batch_size=24, max_rows=128, use_cuda_graph=auto, use_async_batching=auto.
  

  Sweep Results
  +--------+--------------+------------+-----------+-------------+-----------+----------+--------------+------------+
  | max_bt | MAIN acc,num | PR acc,num | Δ acc     | MAIN samp/s | PR samp/s | Δ samp/s | MAIN peak GB | PR peak GB | 
  +========+==============+============+===========+=============+===========+==========+==============+============+
    | 128    | 0.4140625    | 0.4140625  | +0.0000   | 9.7604      | 10.1033   | +3.51%   | 34.7090      | 34.7090    |
    | 256    | 0.4140625    | 0.4453125  | +0.0312   | 9.7349      | 8.7137    | -10.49%  | 34.9004      | 34.9004    |
--> | 384    | 0.2265625    | 0.4531250  | +0.2266   | 10.0055     | 9.9615    | -0.44%   | 35.0859      | 35.0898    |
    | 512    | 0.4218750    | 0.4062500  | -0.0156   | 10.0384     | 9.7630    | -2.74%   | 35.2949      | 35.2949    | 
    | 640    | 0.3984375    | 0.4140625  | +0.0156   | 10.2056     | 9.5572    | -6.35%   | 35.4883      | 35.4902    |
    | 768    | 0.3984375    | 0.4062500  | +0.0078   | 10.0023     | 9.7447    | -2.58%   | 35.6836      | 35.6836    | 
    | 896    | FAIL         | FAIL       | n/a       | n/a         | n/a       | n/a      | n/a          | n/a        | 
    | 1024   | FAIL         | FAIL       | n/a       | n/a         | n/a       | n/a      | n/a          | n/a        |
  +--------+--------------+------------+-----------+-------------+-----------+----------+--------------+------------+

  Failure Details
  +--------+-------------------------------+
  | max_bt | MAIN                          | PR                             |
  +========+===============================+================================+
  | 896    | illegal memory access         | illegal memory access          |
  | 1024   | CUBLAS_STATUS_EXECUTION_FAILED| illegal memory access          |
  +--------+-------------------------------+--------------------------------+

Look at the accuracy collapse at max_bt (max batched token size for cb) == 384!

The 896/1024 are caused by another un-related bug that I did not push to this PR so this one is clean/isolated.

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@remi-or @ArthurZucker @McPatate

@Qubitium Qubitium changed the title Fix fa2 max bt Fix CB Accuracy Regression under FA2 Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant