[ci] Investigate/Re-enable linux-gfx1150-gpu-rocm test runner (#3199)#4459
Draft
mgehre-amd wants to merge 12 commits intomainfrom
Draft
[ci] Investigate/Re-enable linux-gfx1150-gpu-rocm test runner (#3199)#4459mgehre-amd wants to merge 12 commits intomainfrom
mgehre-amd wants to merge 12 commits intomainfrom
Conversation
hipcc_check hangs during execution on the gfx1150 CI runner. Adding HIP_TRACE_API, HIP_LAUNCH_BLOCKING, and AMD_SERIALIZE_* env vars to capture where exactly the hang occurs. Also bumping sanity timeout from 5 to 15 minutes to accommodate the extra logging overhead.
Reads /sys/kernel/debug/dri/0/amdgpu_firmware_info when accessible, to help debug gfx1150 kernel dispatch hangs by comparing firmware versions across machines.
The HIP trace debug logs were invisible because run_command uses capture_output=True and only prints on failure — but when the process hangs, pytest's timeout kills it before output is printed. Now using Popen with communicate(timeout=120) so partial output is printed when the timeout fires.
Use capture=False for hipcc_check so HIP debug output goes directly to the CI log in real time, visible even when the process hangs.
debugfs is not mounted inside CI containers, so the previous approach of reading /sys/kernel/debug/dri/0/amdgpu_firmware_info silently produced no output. Use `amd-smi firmware` instead, which prints per-component firmware versions (CP_PFP, CP_ME, RLC, SDMA, etc.) and works inside containers.
HIP trace showed the hang is inside hipLaunchKernel, after fatbin
loading ("Forcing SPIRV: false") but before kernel dispatch completes.
Changes:
- Bump AMD_LOG_LEVEL from 4 to 7 for maximum verbosity through the
program build/load path
- Add HSA_ENABLE_SDMA=0 to rule out SDMA-related hangs
- Add hip_simple_check.cpp: minimal kernel WITHOUT printf to isolate
whether the hang is caused by printf buffer setup or kernel dispatch
- Add test_hip_simple that runs before test_hip_printf
The hang occurs during code object loading (after "Using Code Object V5" in devprogram.cpp). AMD_LOG_LEVEL=7 showed no further messages. Adding strace to capture the exact syscall (ioctl to KFD, futex, etc.) where the process blocks. Uses 60s timeout with Popen so the strace log is printed even when the process hangs.
strace is not available in the CI container image. Instead, poll /proc/PID/stack, /proc/PID/syscall, and /proc/PID/wchan every 5s while the process runs. This captures the kernel stack trace and current syscall for the main thread and all worker threads, which will tell us exactly where in the KFD/amdgpu driver the process blocks during code object loading.
The /proc polling approach showed the gfx1150 hip_simple_check hang is a userspace CPU spin (wchan=0, syscall=running), not a kernel block. But /proc/PID/stack requires SYS_PTRACE which the container lacked. Add --cap-add SYS_PTRACE to the sanity test container options (matching existing precedent from rocgdb and rocprofiler-sdk containers), then install and attach gdb to get full userspace backtraces for all threads when the process hangs for >30s.
…rrors A known GPU firmware bug prints error messages to the kernel log and causes all subsequent kernel launches to stall. Dump dmesg (errors/warnings) before running hip_simple_check, and full dmesg after detecting a hang, to check if firmware errors are the root cause on gfx1150.
Previous attempts to debug the gfx1150 hang failed because: - strace is not installed in the CI container - apt-get install gdb fails (no network/cache in container) - dmesg returns empty (kernel.dmesg_restrict=1, no CAP_SYSLOG) - /proc/PID/stack requires SYS_PTRACE on the target process New approach: - hip_simple_check.cpp: SIGALRM watchdog fires after 20s, calls backtrace() + backtrace_symbols_fd() to dump the call stack from within the process itself, plus /proc/self/maps for symbol resolution and /proc/self/stack for the kernel stack. - test_rocm_sanity.py: read /dev/kmsg directly (bypasses dmesg_restrict) to capture GPU firmware errors before and after the hang. This eliminates all external tool dependencies — the backtrace comes from glibc's backtrace() which is always available.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The gfx1150 Linux test runner was disabled in bd97652 (Feb 2026) due to ROCm sanity check timeouts. Re-enabling to observe current failure state and unblock investigation.
Nightly tests run:
https://github.com/ROCm/TheRock/actions/runs/24239158127
https://github.com/ROCm/TheRock/actions/runs/24244204542 (using artifact id 24239158127)
Hanging on