Skip to content

[ci] Investigate/Re-enable linux-gfx1150-gpu-rocm test runner (#3199)#4459

Draft
mgehre-amd wants to merge 12 commits intomainfrom
users/mgehre/re-enable-gfx1150-runner
Draft

[ci] Investigate/Re-enable linux-gfx1150-gpu-rocm test runner (#3199)#4459
mgehre-amd wants to merge 12 commits intomainfrom
users/mgehre/re-enable-gfx1150-runner

Conversation

@mgehre-amd
Copy link
Copy Markdown
Contributor

@mgehre-amd mgehre-amd commented Apr 10, 2026

The gfx1150 Linux test runner was disabled in bd97652 (Feb 2026) due to ROCm sanity check timeouts. Re-enabling to observe current failure state and unblock investigation.

Nightly tests run:
https://github.com/ROCm/TheRock/actions/runs/24239158127
https://github.com/ROCm/TheRock/actions/runs/24244204542 (using artifact id 24239158127)

Hanging on

-------------------------------- live log call ---------------------------------
INFO     test_rocm_sanity:test_rocm_sanity.py:33 ++ Run [None]$ /__w/TheRock/TheRock/build/lib/llvm/bin/offload-arch
INFO     test_rocm_sanity:test_rocm_sanity.py:33 ++ Run [/__w/TheRock/TheRock/build/bin]$ /__w/TheRock/TheRock/build/bin/hipcc /__w/TheRock/TheRock/tests/hipcc_check.cpp -Xlinker -rpath=/__w/TheRock/TheRock/build/bin/../lib/ --offload-arch=gfx1150 -o hipcc_check
INFO     test_rocm_sanity:test_rocm_sanity.py:33 ++ Run [/__w/TheRock/TheRock/build/bin]$ ./hipcc_check
[12:16:14Z] Mem: 2.5/30.7GB (8%) | Jobs: ~1/24 | Disk: 28GB free
[12:16:44Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:17:14Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:17:44Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:18:14Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:18:44Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:19:14Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:19:44Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:20:14Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
Error: The action 'Test' has timed out after 5 minutes.

The gfx1150 Linux test runner was disabled in bd97652 (Feb 2026) due to
ROCm sanity check timeouts. Re-enabling to observe current failure state
and unblock investigation.

Changes:
- Restore test-runs-on label and fetch-gfx-targets in both matrix files
- Remove stale TODO(#3199) comments
@mgehre-amd mgehre-amd changed the title [ci] Re-enable linux-gfx1150-gpu-rocm test runner (#3199) [ci] Investigate/Re-enable linux-gfx1150-gpu-rocm test runner (#3199) Apr 10, 2026
hipcc_check hangs during execution on the gfx1150 CI runner. Adding
HIP_TRACE_API, HIP_LAUNCH_BLOCKING, and AMD_SERIALIZE_* env vars to
capture where exactly the hang occurs. Also bumping sanity timeout
from 5 to 15 minutes to accommodate the extra logging overhead.
Reads /sys/kernel/debug/dri/0/amdgpu_firmware_info when accessible,
to help debug gfx1150 kernel dispatch hangs by comparing firmware
versions across machines.
The HIP trace debug logs were invisible because run_command uses
capture_output=True and only prints on failure — but when the process
hangs, pytest's timeout kills it before output is printed. Now using
Popen with communicate(timeout=120) so partial output is printed
when the timeout fires.
Use capture=False for hipcc_check so HIP debug output goes directly
to the CI log in real time, visible even when the process hangs.
debugfs is not mounted inside CI containers, so the previous
approach of reading /sys/kernel/debug/dri/0/amdgpu_firmware_info
silently produced no output. Use `amd-smi firmware` instead,
which prints per-component firmware versions (CP_PFP, CP_ME,
RLC, SDMA, etc.) and works inside containers.
HIP trace showed the hang is inside hipLaunchKernel, after fatbin
loading ("Forcing SPIRV: false") but before kernel dispatch completes.

Changes:
- Bump AMD_LOG_LEVEL from 4 to 7 for maximum verbosity through the
  program build/load path
- Add HSA_ENABLE_SDMA=0 to rule out SDMA-related hangs
- Add hip_simple_check.cpp: minimal kernel WITHOUT printf to isolate
  whether the hang is caused by printf buffer setup or kernel dispatch
- Add test_hip_simple that runs before test_hip_printf
The hang occurs during code object loading (after "Using Code Object V5"
in devprogram.cpp). AMD_LOG_LEVEL=7 showed no further messages. Adding
strace to capture the exact syscall (ioctl to KFD, futex, etc.) where
the process blocks. Uses 60s timeout with Popen so the strace log is
printed even when the process hangs.
strace is not available in the CI container image. Instead, poll
/proc/PID/stack, /proc/PID/syscall, and /proc/PID/wchan every 5s
while the process runs. This captures the kernel stack trace and
current syscall for the main thread and all worker threads, which
will tell us exactly where in the KFD/amdgpu driver the process
blocks during code object loading.
The /proc polling approach showed the gfx1150 hip_simple_check hang is
a userspace CPU spin (wchan=0, syscall=running), not a kernel block.
But /proc/PID/stack requires SYS_PTRACE which the container lacked.

Add --cap-add SYS_PTRACE to the sanity test container options (matching
existing precedent from rocgdb and rocprofiler-sdk containers), then
install and attach gdb to get full userspace backtraces for all threads
when the process hangs for >30s.
…rrors

A known GPU firmware bug prints error messages to the kernel log and
causes all subsequent kernel launches to stall. Dump dmesg (errors/warnings)
before running hip_simple_check, and full dmesg after detecting a hang,
to check if firmware errors are the root cause on gfx1150.
Previous attempts to debug the gfx1150 hang failed because:
- strace is not installed in the CI container
- apt-get install gdb fails (no network/cache in container)
- dmesg returns empty (kernel.dmesg_restrict=1, no CAP_SYSLOG)
- /proc/PID/stack requires SYS_PTRACE on the target process

New approach:
- hip_simple_check.cpp: SIGALRM watchdog fires after 20s, calls
  backtrace() + backtrace_symbols_fd() to dump the call stack from
  within the process itself, plus /proc/self/maps for symbol resolution
  and /proc/self/stack for the kernel stack.
- test_rocm_sanity.py: read /dev/kmsg directly (bypasses dmesg_restrict)
  to capture GPU firmware errors before and after the hang.

This eliminates all external tool dependencies — the backtrace comes
from glibc's backtrace() which is always available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

1 participant