[ci] Investigate/Re-enable linux-gfx1150-gpu-rocm test runner (#3199) by mgehre-amd · Pull Request #4459 · ROCm/TheRock

mgehre-amd · 2026-04-10T10:43:50Z

The gfx1150 Linux test runner was disabled in bd97652 (Feb 2026) due to ROCm sanity check timeouts. Re-enabling to observe current failure state and unblock investigation.

Nightly tests run:
https://github.com/ROCm/TheRock/actions/runs/24239158127
https://github.com/ROCm/TheRock/actions/runs/24244204542 (using artifact id 24239158127)

Hanging on

-------------------------------- live log call ---------------------------------
INFO     test_rocm_sanity:test_rocm_sanity.py:33 ++ Run [None]$ /__w/TheRock/TheRock/build/lib/llvm/bin/offload-arch
INFO     test_rocm_sanity:test_rocm_sanity.py:33 ++ Run [/__w/TheRock/TheRock/build/bin]$ /__w/TheRock/TheRock/build/bin/hipcc /__w/TheRock/TheRock/tests/hipcc_check.cpp -Xlinker -rpath=/__w/TheRock/TheRock/build/bin/../lib/ --offload-arch=gfx1150 -o hipcc_check
INFO     test_rocm_sanity:test_rocm_sanity.py:33 ++ Run [/__w/TheRock/TheRock/build/bin]$ ./hipcc_check
[12:16:14Z] Mem: 2.5/30.7GB (8%) | Jobs: ~1/24 | Disk: 28GB free
[12:16:44Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:17:14Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:17:44Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:18:14Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:18:44Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:19:14Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:19:44Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
[12:20:14Z] Mem: 2.4/30.7GB (8%) | CPU: 4% | Jobs: ~1/24 | Disk: 28GB free
Error: The action 'Test' has timed out after 5 minutes.

The gfx1150 Linux test runner was disabled in bd97652 (Feb 2026) due to ROCm sanity check timeouts. Re-enabling to observe current failure state and unblock investigation. Changes: - Restore test-runs-on label and fetch-gfx-targets in both matrix files - Remove stale TODO(#3199) comments

hipcc_check hangs during execution on the gfx1150 CI runner. Adding HIP_TRACE_API, HIP_LAUNCH_BLOCKING, and AMD_SERIALIZE_* env vars to capture where exactly the hang occurs. Also bumping sanity timeout from 5 to 15 minutes to accommodate the extra logging overhead.

Reads /sys/kernel/debug/dri/0/amdgpu_firmware_info when accessible, to help debug gfx1150 kernel dispatch hangs by comparing firmware versions across machines.

The HIP trace debug logs were invisible because run_command uses capture_output=True and only prints on failure — but when the process hangs, pytest's timeout kills it before output is printed. Now using Popen with communicate(timeout=120) so partial output is printed when the timeout fires.

Use capture=False for hipcc_check so HIP debug output goes directly to the CI log in real time, visible even when the process hangs.

debugfs is not mounted inside CI containers, so the previous approach of reading /sys/kernel/debug/dri/0/amdgpu_firmware_info silently produced no output. Use `amd-smi firmware` instead, which prints per-component firmware versions (CP_PFP, CP_ME, RLC, SDMA, etc.) and works inside containers.

HIP trace showed the hang is inside hipLaunchKernel, after fatbin loading ("Forcing SPIRV: false") but before kernel dispatch completes. Changes: - Bump AMD_LOG_LEVEL from 4 to 7 for maximum verbosity through the program build/load path - Add HSA_ENABLE_SDMA=0 to rule out SDMA-related hangs - Add hip_simple_check.cpp: minimal kernel WITHOUT printf to isolate whether the hang is caused by printf buffer setup or kernel dispatch - Add test_hip_simple that runs before test_hip_printf

The hang occurs during code object loading (after "Using Code Object V5" in devprogram.cpp). AMD_LOG_LEVEL=7 showed no further messages. Adding strace to capture the exact syscall (ioctl to KFD, futex, etc.) where the process blocks. Uses 60s timeout with Popen so the strace log is printed even when the process hangs.

strace is not available in the CI container image. Instead, poll /proc/PID/stack, /proc/PID/syscall, and /proc/PID/wchan every 5s while the process runs. This captures the kernel stack trace and current syscall for the main thread and all worker threads, which will tell us exactly where in the KFD/amdgpu driver the process blocks during code object loading.

The /proc polling approach showed the gfx1150 hip_simple_check hang is a userspace CPU spin (wchan=0, syscall=running), not a kernel block. But /proc/PID/stack requires SYS_PTRACE which the container lacked. Add --cap-add SYS_PTRACE to the sanity test container options (matching existing precedent from rocgdb and rocprofiler-sdk containers), then install and attach gdb to get full userspace backtraces for all threads when the process hangs for >30s.

…rrors A known GPU firmware bug prints error messages to the kernel log and causes all subsequent kernel launches to stall. Dump dmesg (errors/warnings) before running hip_simple_check, and full dmesg after detecting a hang, to check if firmware errors are the root cause on gfx1150.

Previous attempts to debug the gfx1150 hang failed because: - strace is not installed in the CI container - apt-get install gdb fails (no network/cache in container) - dmesg returns empty (kernel.dmesg_restrict=1, no CAP_SYSLOG) - /proc/PID/stack requires SYS_PTRACE on the target process New approach: - hip_simple_check.cpp: SIGALRM watchdog fires after 20s, calls backtrace() + backtrace_symbols_fd() to dump the call stack from within the process itself, plus /proc/self/maps for symbol resolution and /proc/self/stack for the kernel stack. - test_rocm_sanity.py: read /dev/kmsg directly (bypasses dmesg_restrict) to capture GPU firmware errors before and after the hang. This eliminates all external tool dependencies — the backtrace comes from glibc's backtrace() which is always available.

github-project-automation bot added this to TheRock Triage Apr 10, 2026

github-project-automation bot moved this to TODO in TheRock Triage Apr 10, 2026

mgehre-amd added the gfx1150 label Apr 10, 2026

mgehre-amd mentioned this pull request Apr 10, 2026

[issue] Re-enable linux-gfx1150-gpu-rocm test machines once stable #3199

Open

mgehre-amd changed the title ~~[ci] Re-enable linux-gfx1150-gpu-rocm test runner (#3199)~~ [ci] Investigate/Re-enable linux-gfx1150-gpu-rocm test runner (#3199) Apr 10, 2026

mgehre-amd added 11 commits April 10, 2026 06:52

[ci] Print amdgpu firmware versions in driver sanity check

a5e8c35

Reads /sys/kernel/debug/dri/0/amdgpu_firmware_info when accessible, to help debug gfx1150 kernel dispatch hangs by comparing firmware versions across machines.

[ci] Simplify: stream hipcc_check output instead of capturing

6772ac5

Use capture=False for hipcc_check so HIP debug output goes directly to the CI log in real time, visible even when the process hangs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] Investigate/Re-enable linux-gfx1150-gpu-rocm test runner (#3199)#4459

[ci] Investigate/Re-enable linux-gfx1150-gpu-rocm test runner (#3199)#4459
mgehre-amd wants to merge 12 commits intomainfrom
users/mgehre/re-enable-gfx1150-runner

mgehre-amd commented Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgehre-amd commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgehre-amd commented Apr 10, 2026 •

edited

Loading