aya-ebpf: document FExit ret kernel range#1584
Merged
Merged
Conversation
Kernel git bisect identified the upstream verifier fix: torvalds/linux@d028f87 `bpf_get_func_ret` was added in v5.17. The workaround is therefore for kernels v5.17 through v6.7, before the fix is present in v6.8.
✅ Deploy Preview for aya-rs-docs ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
tamird
approved these changes
May 26, 2026
Member
tamird
left a comment
There was a problem hiding this comment.
Nice. According to https://www.kernel.org/category/releases.html: 6.6, 6.1, 5.15, and 5.10 are all LTS that are not yet EOL. Might be worth an email to see which of these are worth doing the backport for. cc @4ast
@tamird reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on swananan).
Contributor
Author
|
Sure, I will prepare the patch series and send an email. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
kernel verifier git-bisect
Follow-up to #1574.
I ran a kernel
git bisectfor the verifier issue hit by theFExitContext::ret()test, using this minimized C reproducer as the bisect test case:https://gist.github.com/swananan/165cca6008f6c81870a28aa7a445d5ea
The bisect identified the upstream fix as:
torvalds/linux@d028f87
One important note for 6.1.91 backporting: applying
d028f87517d6alone is not sufficient. The backport also needs the verifier range-tracking semantics from:torvalds/linux@9e314f5
Without that prerequisite, the 6.1.91 verifier can still lose the refined bounds needed for this case.
root cause analysis
I realized that the root cause was not explained clearly enough in #1574, so I dug deeper into the verifier logic. I also added temporary logging to the kernel verifier to confirm the state transitions, then re-analyzed the issue based on those logs.
A few verifier details were not obvious to me at first. Summarizing my current understanding here:
{}. Its lifetime follows the register value inside each verifier state: it can be created for an unknown scalar, copied by assignments such asr7 = r0, refined by branch conditions, and lost or replaced when the register is overwritten or transformed. So the shared ID only means that, in that verifier state, two registers are currently known to hold the same scalar value. It does not mean the verifier understands the original source-level logic.regsafe()/states_equal(). If the cached state is considered sufficient to prove the current state safe, the verifier treats the current path as already analyzed and stops exploring it. For scalar registers, this depends on liveness, precision marks, range/tnum constraints, and scalar ID relationships.At line 15,
bpf_get_func_ret()returns an unknown scalar to the verifier. At runtime, it succeeds, sor0 == 0.At line 16, the traced function return value is loaded into
r7. At runtime, this value is15.At line 17, the program checks
if r0 == 0. The jump target is the success path, while the fallthrough path is the failure path and should implyr0 != 0. The verifier explores both paths: it continues with the fallthrough path first and saves the jump target for later analysis.On 6.1.91, the verifier does not record the
r0 != 0fact for the fallthrough path of line 17. Then line 18 executesr7 = r0, so the verifier givesr0andr7the same scalar ID, but still treats both as possibly zero.At line 19, the program checks
if r0 != 0. This should be always true on the failure path, but the verifier still thinksr0may be zero, so it explores the fallthrough path as well.Line 19 is a
JNEinstruction:if r0 != 0 goto .... For this instruction, the jump target meansr0 != 0, and the fallthrough path meansr0 == 0. Because the verifier explores that fallthrough path, it narrowsr0to0. Sincer7shares the same scalar ID asr0, the verifier also concludesr7 == 0. This creates an impossible path: it came from the failure path of line 17, which should meanr0 != 0, but it is now analyzed asr0 == 0.The impossible path continues to line 23. Since the verifier now thinks
r7 == 0, it concludes thatif r7 != 15is always true. So it only explores the branch that keepserror = 1.Later, the verifier pops the saved success path from line 17. This path arrives at line 19 with
r0 == 0, whiler7still holds the traced function return value. This is the real runtime path that should eventually prover7 == 15and seterror = 0.When this real success path reaches line 19, the verifier performs state pruning. It finds an earlier cached state at the same instruction from the previous exploration. According to
regsafe(), the current state is considered safe against that cached state: the cachedr0is an imprecise scalar, and the cachedr7constraints are still loose enough to accept the currentr7. Therefore the verifier stops analyzing this path.Because the real success path is pruned at line 19, the verifier never analyzes the real continuation where
r0 == 0andr7 == 15reaches line 23. The only explored continuation at line 23 came from the ghost path wherer7 == 0, so the verifier treatsif r7 != 15as always true. Later branch hard-wiring preserves that wrong conclusion, and the program reportserror = 1at runtime.torvalds/linux@d028f87 fixes the missing
!= 0refinement on the fallthrough path of line 17, so the verifier can representr0more precisely as[1, U64_MAX]instead of a fully unknown scalar.torvalds/linux@9e314f5 fixes the later precision loss in the old range-combining logic. Without that semantic change, 6.1.91 can still discard the refined bounds after learning them.
Overall, this backport is a bit tricky because the verifier logic changed substantially between the tested 6.1.91 LTS baseline and the upstream fixed baseline (
v6.8). Because of that, this is not a clean textual cherry-pick for 6.1.91.However, the actual change needed for 6.1.91 looks manageable: the torvalds/linux@9e314f5 semantics need to be preserved, and the torvalds/linux@d028f87 logic only needs a small adaptation to fit the older verifier code.
Added/updated tests?
We strongly encourage you to add a test for your changes.
Checklist
cargo +nightly fmt.You can find failing lints with
cargo xtask clippy.cargo test.cargo xtask public-api --bless.(Optional) What GIF best describes this PR or how it makes you feel?
This change is