Add diagnostics for Linux XDP Debug BVT runner crash#5914
Draft
Add diagnostics for Linux XDP Debug BVT runner crash#5914
Conversation
The Debug+UseXdp Linux BVT job consistently crashes the GitHub Actions
runner (~72 min vs ~92 min success). The runner disconnects entirely,
so the Test step logs are lost (HTTP 404).
Two-pronged approach to capture diagnostics:
1. test.yml: Add a 'System Diagnostics (pre-test)' step that runs
BEFORE the Test step. Because this step completes normally, its
logs are preserved even when the runner later crashes. Captures
baseline memory, disk, kernel version, dmesg, core pattern, and
CPU info.
2. test.ps1: For the Linux XDP sudo path:
- Print memory/disk/load before and after each test binary
- Check dmesg for OOM killer, XDP, BPF, and segfault messages
- Start a background resource monitor that logs to a file every
30 seconds under artifacts/xdp_diagnostics/ (uploaded as artifact
if the runner survives long enough)
- Limit core dumps to 1 GB (ulimit -c 1048576) to prevent
cascading crashes from filling the disk
- Add process timeout (6000s) as safety net against hangs
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use a single-quoted here-string (@'...'@) to write the bash monitor script to a file, avoiding PowerShell's interpretation of $ and / inside the awk commands. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5914 +/- ##
==========================================
- Coverage 86.18% 84.92% -1.26%
==========================================
Files 60 60
Lines 18731 18731
==========================================
- Hits 16143 15907 -236
- Misses 2588 2824 +236 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The Test step logs are completely lost (HTTP 404) when the runner crashes, so console diagnostics are useless. Instead, post a PR comment after each test binary with memory, disk, load, dmesg, and resource monitor data. These comments survive because they're sent via GitHub API before the next binary starts. Also disable core dumps entirely via hard limit (ulimit -Hc 0) since the ulimit -c unlimited in run-gtest.ps1 overrides any soft limit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GITHUB_TOKEN is not automatically available as an environment variable in GitHub Actions steps. Explicitly pass it so test.ps1 can post diagnostic checkpoints as PR comments that survive runner crashes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…s permission - Add pull-requests: write permission to workflow - Rewrite Post-XdpDiag to write JSON to temp file and use curl -d @file to avoid PowerShell escaping issues - Post diagnostic comment BEFORE each binary starts (not just after) so we get data even if the first binary crashes the runner - Remove duplicate/broken curl call Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
XDP Diag: Starting msquicplatformtest |
XDP Diag: Starting msquicplatformtest |
XDP Diag: Finished msquicplatformtest (exit=0) |
XDP Diag: Starting msquiccoretest |
XDP Diag: Finished msquicplatformtest (exit=0) |
XDP Diag: Starting msquiccoretest |
XDP Diag: Finished msquiccoretest (exit=0) |
XDP Diag: Starting msquictest |
XDP Diag: Finished msquiccoretest (exit=0) |
XDP Diag: Starting msquictest |
XDP Diag: Finished msquictest (exit=0) |
Posts PR comments every 5 minutes with memory, disk, load, dmesg (broadened to catch kernel oops/BUG/panic), top processes by memory, and resource monitor log. This will capture the system state just before the runner crash during msquictest in Debug mode. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
XDP Diag: Starting msquicplatformtest |
XDP Diag: Starting msquicplatformtest |
XDP Diag: Finished msquicplatformtest (exit=0) |
XDP Diag: Starting msquiccoretest |
XDP Diag: Finished msquicplatformtest (exit=0) |
XDP Diag: Starting msquiccoretest |
XDP Heartbeat #1: msquiccoretest (1x5 min elapsed) |
XDP Heartbeat #1: msquiccoretest (1x5 min elapsed) |
XDP Heartbeat #2: msquiccoretest (2x5 min elapsed) |
XDP Heartbeat #2: msquiccoretest (2x5 min elapsed) |
XDP Heartbeat #3: msquiccoretest (3x5 min elapsed) |
XDP Heartbeat #3: msquiccoretest (3x5 min elapsed) |
XDP Heartbeat #16: msquictest (+16 min) |
XDP Heartbeat #17: msquictest (+17 min) |
XDP Heartbeat #18: msquictest (+18 min) |
XDP Heartbeat #19: msquictest (+19 min) |
XDP Heartbeat #20: msquictest (+20 min) |
XDP Heartbeat #21: msquictest (+21 min) |
XDP Heartbeat #22: msquictest (+22 min) |
XDP Heartbeat #23: msquictest (+23 min) |
XDP Heartbeat #24: msquictest (+24 min) |
XDP Heartbeat #25: msquictest (+25 min) |
XDP Heartbeat #26: msquictest (+26 min) |
XDP Heartbeat #27: msquictest (+27 min) |
XDP Heartbeat #28: msquictest (+28 min) |
XDP Heartbeat #29: msquictest (+29 min) |
XDP Heartbeat #30: msquictest (+30 min) |
XDP Heartbeat #31: msquictest (+31 min) |
XDP Heartbeat #32: msquictest (+32 min) |
XDP Heartbeat #33: msquictest (+33 min) |
XDP Heartbeat #34: msquictest (+34 min) |
XDP Heartbeat #35: msquictest (+35 min) |
XDP Heartbeat #36: msquictest (+36 min) |
XDP Heartbeat #37: msquictest (+37 min) |
XDP Heartbeat #38: msquictest (+38 min) |
XDP Heartbeat #39: msquictest (+39 min) |
XDP Heartbeat #40: msquictest (+40 min) |
XDP Heartbeat #41: msquictest (+41 min) |
XDP Heartbeat #42: msquictest (+42 min) |
XDP Heartbeat #43: msquictest (+43 min) |
XDP Heartbeat #44: msquictest (+44 min) |
XDP Diag: Finished msquictest (exit=0) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The
BVT (Debug, linux, ubuntu-24.04, x64, quictls, -UseSystemOpenSSLCrypto, -Test, -UseXdp)job consistently crashes the GitHub Actions runner at ~72 minutes (vs ~92 min normal success time). The runner disconnects entirely - Test step logs return HTTP 404, cleanup steps never run.This PR adds diagnostics to identify the crash cause and mitigations to prevent cascading failures:
test.yml: New 'System Diagnostics (pre-test)' step that captures baseline system info (memory, disk, kernel, dmesg, core pattern). This step completes before the Test step, so its logs are preserved even when the runner crashes.
test.ps1 (Linux XDP sudo path):
Testing
No new tests needed. This change adds diagnostics and mitigations to the existing BVT CI pipeline. The changes only affect the Linux XDP test execution path.
Documentation
No documentation impact.