Skip to content

aya-ebpf: document FExit ret kernel range#1584

Merged
tamird merged 1 commit into
aya-rs:mainfrom
swananan:doc/fexit-ret-verifier-fix-note
May 26, 2026
Merged

aya-ebpf: document FExit ret kernel range#1584
tamird merged 1 commit into
aya-rs:mainfrom
swananan:doc/fexit-ret-verifier-fix-note

Conversation

@swananan
Copy link
Copy Markdown
Contributor

@swananan swananan commented May 25, 2026

kernel verifier git-bisect

Follow-up to #1574.

I ran a kernel git bisect for the verifier issue hit by the FExitContext::ret() test, using this minimized C reproducer as the bisect test case:

https://gist.github.com/swananan/165cca6008f6c81870a28aa7a445d5ea

The bisect identified the upstream fix as:
torvalds/linux@d028f87

One important note for 6.1.91 backporting: applying d028f87517d6 alone is not sufficient. The backport also needs the verifier range-tracking semantics from:

torvalds/linux@9e314f5

Without that prerequisite, the 6.1.91 verifier can still lose the refined bounds needed for this case.

root cause analysis

I realized that the root cause was not explained clearly enough in #1574, so I dug deeper into the verifier logic. I also added temporary logging to the kernel verifier to confirm the state transitions, then re-analyzed the issue based on those logs.

A few verifier details were not obvious to me at first. Summarizing my current understanding here:

  • The scalar ID is part of the verifier's register state, not something tied to source-level control-flow blocks like {}. Its lifetime follows the register value inside each verifier state: it can be created for an unknown scalar, copied by assignments such as r7 = r0, refined by branch conditions, and lost or replaced when the register is overwritten or transformed. So the shared ID only means that, in that verifier state, two registers are currently known to hold the same scalar value. It does not mean the verifier understands the original source-level logic.
  • The verifier explores possible instruction-level paths and makes decisions based on the register state it has tracked for each path. It does not infer facts from the source-level intent unless those facts are represented in the register state.
  • The verifier also uses state pruning to avoid re-analyzing the same program suffix for many similar paths. It caches register/stack states at selected instructions. When another path reaches the same instruction, the verifier compares the current state with the cached one using regsafe()/states_equal(). If the cached state is considered sufficient to prove the current state safe, the verifier treats the current path as already analyzed and stops exploring it. For scalar registers, this depends on liveness, precision marks, range/tnum constraints, and scalar ID relationships.
15: (85) call bpf_get_func_ret#184    ; R0_w=scalar() fp-8_w=mmmmmmmm
; if ret == 0 {
16: (79) r7 = *(u64 *)(r10 -8)        ; R7_w=scalar() R10=fp0
17: (15) if r0 == 0x0 goto pc+1       ; R0_w=scalar()
18: (bf) r7 = r0                      ; R0=scalar(id=1) R7=scalar(id=1)
; match ctx.ret::<T>() {
19: (55) if r0 != 0x0 goto pc+6       ; R0=0
20: (67) r7 <<= 32                    ; R7_w=0
21: (77) r7 >>= 32                    ; R7_w=0
22: (b7) r1 = 1                       ; R1_w=1
; if retval == expected {
23: (55) if r7 != 0xf goto pc+1
  1. At line 15, bpf_get_func_ret() returns an unknown scalar to the verifier. At runtime, it succeeds, so r0 == 0.

  2. At line 16, the traced function return value is loaded into r7. At runtime, this value is 15.

  3. At line 17, the program checks if r0 == 0. The jump target is the success path, while the fallthrough path is the failure path and should imply r0 != 0. The verifier explores both paths: it continues with the fallthrough path first and saves the jump target for later analysis.

  4. On 6.1.91, the verifier does not record the r0 != 0 fact for the fallthrough path of line 17. Then line 18 executes r7 = r0, so the verifier gives r0 and r7 the same scalar ID, but still treats both as possibly zero.

  5. At line 19, the program checks if r0 != 0. This should be always true on the failure path, but the verifier still thinks r0 may be zero, so it explores the fallthrough path as well.

  6. Line 19 is a JNE instruction: if r0 != 0 goto .... For this instruction, the jump target means r0 != 0, and the fallthrough path means r0 == 0. Because the verifier explores that fallthrough path, it narrows r0 to 0. Since r7 shares the same scalar ID as r0, the verifier also concludes r7 == 0. This creates an impossible path: it came from the failure path of line 17, which should mean r0 != 0, but it is now analyzed as r0 == 0.

  7. The impossible path continues to line 23. Since the verifier now thinks r7 == 0, it concludes that if r7 != 15 is always true. So it only explores the branch that keeps error = 1.

  8. Later, the verifier pops the saved success path from line 17. This path arrives at line 19 with r0 == 0, while r7 still holds the traced function return value. This is the real runtime path that should eventually prove r7 == 15 and set error = 0.

  9. When this real success path reaches line 19, the verifier performs state pruning. It finds an earlier cached state at the same instruction from the previous exploration. According to regsafe(), the current state is considered safe against that cached state: the cached r0 is an imprecise scalar, and the cached r7 constraints are still loose enough to accept the current r7. Therefore the verifier stops analyzing this path.

  10. Because the real success path is pruned at line 19, the verifier never analyzes the real continuation where r0 == 0 and r7 == 15 reaches line 23. The only explored continuation at line 23 came from the ghost path where r7 == 0, so the verifier treats if r7 != 15 as always true. Later branch hard-wiring preserves that wrong conclusion, and the program reports error = 1 at runtime.

torvalds/linux@d028f87 fixes the missing != 0 refinement on the fallthrough path of line 17, so the verifier can represent r0 more precisely as [1, U64_MAX] instead of a fully unknown scalar.
torvalds/linux@9e314f5 fixes the later precision loss in the old range-combining logic. Without that semantic change, 6.1.91 can still discard the refined bounds after learning them.

Overall, this backport is a bit tricky because the verifier logic changed substantially between the tested 6.1.91 LTS baseline and the upstream fixed baseline (v6.8). Because of that, this is not a clean textual cherry-pick for 6.1.91.

However, the actual change needed for 6.1.91 looks manageable: the torvalds/linux@9e314f5 semantics need to be preserved, and the torvalds/linux@d028f87 logic only needs a small adaptation to fit the older verifier code.

Added/updated tests?

We strongly encourage you to add a test for your changes.

  • No, and this is why: just updated the doc

Checklist

  • Rust code has been formatted with cargo +nightly fmt.
  • All clippy lints have been fixed.
    You can find failing lints with cargo xtask clippy.
  • Unit tests are passing locally with cargo test.
  • The Integration tests are passing locally.
  • I have blessed any API changes with cargo xtask public-api --bless.

(Optional) What GIF best describes this PR or how it makes you feel?


This change is Reviewable

Kernel git bisect identified the upstream verifier fix:

torvalds/linux@d028f87

`bpf_get_func_ret` was added in v5.17. The workaround is therefore
for kernels v5.17 through v6.7, before the fix is present in v6.8.
@swananan swananan requested a review from a team as a code owner May 25, 2026 16:15
@netlify
Copy link
Copy Markdown

netlify Bot commented May 25, 2026

Deploy Preview for aya-rs-docs ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit e36ffb8
🔍 Latest deploy log https://app.netlify.com/projects/aya-rs-docs/deploys/6a14758ce163d40008897671
😎 Deploy Preview https://deploy-preview-1584--aya-rs-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Copy Markdown
Member

@tamird tamird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. According to https://www.kernel.org/category/releases.html: 6.6, 6.1, 5.15, and 5.10 are all LTS that are not yet EOL. Might be worth an email to see which of these are worth doing the backport for. cc @4ast

@tamird reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on swananan).

@tamird tamird merged commit 46759c9 into aya-rs:main May 26, 2026
21 of 22 checks passed
@swananan
Copy link
Copy Markdown
Contributor Author

Sure, I will prepare the patch series and send an email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants