Skip to content

enh(test_common): add profiler-safe HIP-event timing path to run_perftest#656

Draft
Arist12 wants to merge 1 commit into
ROCm:mainfrom
Arist12:enh/perftest-events-timing
Draft

enh(test_common): add profiler-safe HIP-event timing path to run_perftest#656
Arist12 wants to merge 1 commit into
ROCm:mainfrom
Arist12:enh/perftest-events-timing

Conversation

@Arist12
Copy link
Copy Markdown
Contributor

@Arist12 Arist12 commented Jun 4, 2026

Problem

run_perftest times benchmark iterations by running them under torch.profiler (ROCTracer). When the benchmark is driven by an external rocprofv3 session, nesting torch.profiler inside rocprofv3 causes duplicate-flow warnings from ROCTracer and can perturb the measured kernel time. There was no way to time the same workload without entering the profiler.

The module-level import torch.profiler as tpf also caused ROCTracer to be imported eagerly on every pytest collection.

Solution

Add a FLYDSL_PERFTEST_USE_EVENTS=1 environment variable that switches the internal timing backend to paired HIP events, bypassing torch.profiler entirely. When set:

  • Each of the num_iters iterations is bracketed by torch.cuda.Event(enable_timing=True) record/synchronize pairs.
  • The mean over all iterations is returned as the average latency, consistent with what rocprofv3 --stats reports per dispatch.
  • torch.profiler is never imported or entered.

When the variable is unset (default), behavior is identical to before.

torch.profiler is now lazy-imported in both the default timing branch and the testGraph branch, so it is only pulled in when actually used.

Typical usage

# Without external profiler — behavior unchanged, default path
python tests/kernels/test_pa_mqa_logits_fp4.py --batch 8 --ctx 131072 --next_n 1

# Under rocprofv3 — avoids nested profiler
FLYDSL_PERFTEST_USE_EVENTS=1 \
rocprofv3 --stats --kernel-trace -f csv -o /tmp/out -- \
python tests/kernels/test_pa_mqa_logits_fp4.py --batch 8 --ctx 131072 --next_n 1

Testing

  • Both FLYDSL_PERFTEST_USE_EVENTS=1 and =0 paths verified to return valid (data, avg_us) tuples.
  • FLYDSL_LOG_MORE=1 combined with FLYDSL_PERFTEST_USE_EVENTS=1 does not crash.
  • tests/unit/test_compile_hints.py::TestCacheDisabledRegression passes (this test directly calls run_perftest).
  • python -m pytest tests/unit/ -m "not l2_device and not rocm_lower": 356 passed, 2 pre-existing failures unrelated to this change.

…test

Add FLYDSL_PERFTEST_USE_EVENTS=1 to time benchmark iterations with a pair
of HIP events rather than torch.profiler.  When set, each iteration is
bracketed by Event.record() / Event.synchronize() and the mean latency is
returned as usual, but torch.profiler is never entered.

This is necessary when running benchmarks under an external rocprofv3
session: nesting torch.profiler (ROCTracer) inside rocprofv3 produces
duplicate-flow warnings and can perturb timing.  With the events path the
benchmark command line stays identical; only the internal timing backend
changes.

Lazy-import torch.profiler in the default path so the module-level import
no longer pulls in ROCTracer on every test collection.  The testGraph path
gets the same lazy import.

FLYDSL_PERFTEST_USE_EVENTS is not set in any test; default behavior is
unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant