Skip to content

Fix epoll backend losing signal counts on eventfd read failure#5348

Open
mvanhorn wants to merge 1 commit into
ponylang:mainfrom
mvanhorn:fix/5152-uninitialized-eventfd-signal-count-epoll
Open

Fix epoll backend losing signal counts on eventfd read failure#5348
mvanhorn wants to merge 1 commit into
ponylang:mainfrom
mvanhorn:fix/5152-uninitialized-eventfd-signal-count-epoll

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Summary

Initializes missed = 0 and explicitly handles read() failure on the eventfd signal-dispatch path in the epoll backend, so a failed read no longer forwards uninitialized stack data as the missed-signal count to subscribers.

Why this matters

In src/libponyrt/asio/epoll.c, the signal-dispatch path around line 318 reads the eventfd to recover the missed-signal count:

uint64_t missed;
ssize_t rc = read(ev->fd, &missed, sizeof(uint64_t));
(void)rc;
flags |= ASIO_SIGNAL;
count = (uint32_t)missed;

If read() fails (EAGAIN, EINTR, partial), missed is uninitialized stack data and that garbage value gets forwarded to the actor's notifier via pony_asio_event_send's arg. The timer dispatch path has the same read() pattern but discards the value, so it's harmless there. For signals, the count is observable.

Surfaced during review of #5129 (timerfd return checking) and filed as #5152.

Testing

ASIO signal-delivery timing is impractical to exercise in a unit test without timer manipulation. The fix is small and the failure mode (uninitialized stack data) is what the change addresses directly.

Fixes #5152

@ponylang-main ponylang-main added the discuss during sync Should be discussed during an upcoming sync label May 16, 2026
@SeanTAllen SeanTAllen added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label May 17, 2026
@SeanTAllen
Copy link
Copy Markdown
Member

please give this PR a title that is a good entry in the CHANGELOG and please stop editing the next-release file directly.

@ponylang-main
Copy link
Copy Markdown
Contributor

Hi @mvanhorn,

The changelog - fixed label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do.

Release notes are added by creating a uniquely named file in the .release-notes directory. We suggest you call the file 5348.md to match the number of this pull request.

The basic format of the release notes (using markdown) should be:

## Title

End user description of changes, why it's important,
problems it solves etc.

If a breaking change, make sure to include 1 or more
examples what code would look like prior to this change
and how to update it to work after this change.

Thanks.

@SeanTAllen
Copy link
Copy Markdown
Member

Beyond the two issues I already flagged, I ran our pony-code-review skill against this and a stack of findings came out. Here they are.

The fix is incomplete. count=0 lies to the user code. When read() fails, the new code still sets ASIO_SIGNAL and dispatches with count = 0. The docstring on SignalNotify.apply says count is "the number of times the signal has fired." A handler that does something like for _ in Range(0, count.usize()) do handle_signal() end will silently drop the signal that woke it up. Going from "uninitialized garbage" to "deterministic zero" is an improvement on the UB axis, but it trades one wrong value for another. The right move is probably to not dispatch at all on read failure and let the next epoll iteration handle the real event.

The release note describes a failure mode that doesn't exist. "Or returned a short result." On Linux, eventfd reads are atomic 8-byte. Short reads aren't a thing. The defensive if(rc != sizeof(uint64_t)) is fine to keep, but the prose describing it as a real failure mode isn't accurate.

No regression test. The PR body says exercising this requires timer manipulation. It doesn't. An unwritten EFD_NONBLOCK eventfd returns EAGAIN on read with zero ceremony. Factor a small read_signal_count helper and unit test it in C. The existing Pony-level _TestSignalINT doesn't assert on count, so a future regression here is invisible regardless of what the fix does.

The timer dispatch path right above has the same risky pattern. Same uint64_t missed; ... (void)rc; shape. It's safe today because missed is unused there. But that pattern is now asymmetric with the signal branch, and the next contributor who decides to use missed for timer coalescing will walk right back into this bug. Either init it for hygiene or drop a comment explaining why the asymmetry is intentional.

Smaller stuff. The double zero-init of missed is redundant. Pick one. The release note phrasing leans on implementation terms ("signal notifier," "dispatch with a zero count") instead of describing what a user actually sees. The existing signal test should probably assert count >= 1 while we're in there.

Out of scope, but worth filing separately. kqueue.c around line 210 has an analogous unchecked-read pattern. Same bug class. Different PR. The cross-backend semantics of count are also inconsistent: Windows hardcodes 1, kqueue uses real occurrence counts, epoll after this fix can send 0. The docstring describes one behavior; reality is three. And: MemorySanitizer isn't in our CI. ASan and UBSan don't catch this bug class. Worth a discussion.


A meta question. Are you planning to open LLM-driven PRs to ponylang going forward? If so, please run our pony-code-review skill against your change before opening the PR. I asked you about this previously and the shape of this one suggests it didn't happen. Most of what I flagged above is exactly what that skill surfaces. When the author doesn't run it and the maintainers do, you're trading the LLM's time for ours. We can just as easily run a fixer LLM against these findings ourselves and use the skill as our review pass. The point of the skill is to compress that loop on the author's side, not the maintainer's.

@SeanTAllen SeanTAllen added the do not merge This PR should not be merged at this time label May 17, 2026
…kend (issue ponylang#5152)

Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
@mvanhorn mvanhorn force-pushed the fix/5152-uninitialized-eventfd-signal-count-epoll branch from c3033c0 to 96ec361 Compare May 17, 2026 23:44
@mvanhorn mvanhorn changed the title [Fix] Uninitialized signal count on eventfd read failure in epoll backend (issue #5152) Fix epoll backend losing signal counts on eventfd read failure May 17, 2026
@mvanhorn
Copy link
Copy Markdown
Contributor Author

@SeanTAllen thanks for the detailed review. Pushed 96ec361. Hitting your meta question first and then the technical findings:

On the AI question: yes, this PR was AI-assisted, and you're right that I should be running pony-code-review against my changes before opening. I'll do that going forward. Apologies for the loop you've been pulled into here. The findings below were addressed with another AI pass against your review specifically; I read each finding and the resulting diff before pushing, so I'm responsible for what's in 96ec361, but I want to be straight about the workflow.

On the technical findings:

  1. count=0 lies. Agreed, and you're right that "deterministic zero" is the wrong fix. The signal branch now only sets flags |= ASIO_SIGNAL when the eventfd read returned sizeof(uint64_t). On read failure we just leave flags alone and let the next epoll iteration pick up the real event. Factored the read into a read_signal_count static helper at the same time.

  2. Release-note prose. Rewrote to drop "short result", describe what the user actually sees (signal handlers no longer wake with a misleading zero count), and reflect the new "don't dispatch on read failure" behavior.

  3. Regression test. Added test/libponyrt/asio_epoll.cc with three Linux-gated tests against the unwritten-EFD_NONBLOCK (returns false, leaves *out_count untouched), single-write (returns the written value), and double-write-then-drain (eventfd accumulates and the second read returns false) scenarios. Wired into test/libponyrt/CMakeLists.txt. Also added _h.assert_true(count >= 1) to _TestSighupNotify so a regression on the production path is visible too.

  4. Asymmetric timer dispatch. Timer branch is now uint64_t missed = 0; with an explicit (void)missed; discard and a comment naming the asymmetry so the next contributor sees it before re-introducing the bug.

  5. Smaller cleanups. Double zero-init is gone (the if(rc != sizeof(uint64_t)) missed = 0; line went away with finding 1). Release note phrasing rewritten per item 2.

On scope. I left kqueue.c, the cross-backend count docstring inconsistency, and MemorySanitizer-in-CI alone in this PR per your "out of scope" notes. Happy to open separate issues for any of them if useful.

Local verification: my workstation can't build ponyc cleanly (make configure can't find LLVM here), so the new tests haven't been run locally. CI on the PR will be the authority.

@SeanTAllen
Copy link
Copy Markdown
Member

SeanTAllen commented May 18, 2026

@mvanhorn please run the same review on your other open PRs. there are some issues there. some quite problematic. My ask from you would be, if the questions raised by those reviews aren't ones you feel comfortable answering then do 1 of 2 things:

  • invest in learning enough about Pony to be able to answer those questions
  • drop trying to do that particular issue

If you want to learn about pony and the compiler, we are happy to do it in the zulip, but if you are mainly interested in contributing vibe coded stuff to a wide range of projects (that's the impression your profile README gives), then please be respectful of our time with how you do that here and only open PRs you fully understand what is going on/the review from our tool comes back with no questions/parked items or only ones you feel confident yourself in guiding the answer to.

Please let me know if you have any questions. Thanks for the enthusiasm. Hopefully we can make this mutually beneficial. If not, at worst, you have been quite instructive with getting us to talk about how we want to approach LLM assisted changes from non committers.

@mvanhorn
Copy link
Copy Markdown
Contributor Author

@SeanTAllen - thanks for the push to actually run pony-code-review on the others. I should have found ponylang/llm-skills and installed it before opening any of these PRs.

Ran the lightweight mode against all three:

On your ask: I'll only open ponylang PRs going forward where pony-code-review comes back clean and I can defend the parked items. That's the right bar - apologies for not meeting it the first time. The hope is that AI-assisted PRs can add value here once they go through the same review gate maintainers do, not before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge discuss during sync Should be discussed during an upcoming sync do not merge This PR should not be merged at this time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Uninitialized signal count on eventfd read failure in epoll backend

3 participants