-
Notifications
You must be signed in to change notification settings - Fork 56
[docs] Add Predecoder Decoding with CUDA-Q Realtime #497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
bmhowe23
merged 10 commits into
NVIDIA:main
from
wsttiger:realtime_predecoder_no_fpga_demo_docs
Apr 14, 2026
Merged
Changes from 4 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
910394b
Added initial docs for predecoder + pymatching pipeline
wsttiger 164dfe4
Updated docs
wsttiger 9a43acc
Minor updates
bmhowe23 f65f101
Merge branch 'main' into realtime_predecoder_no_fpga_demo_docs
bmhowe23 f625cd2
Added class / API documentation
wsttiger 112e9da
Merge branch 'realtime_predecoder_no_fpga_demo_docs' of github.com:ws…
wsttiger 51444cf
Converted class-level documentation back to Doxygen
wsttiger 8514d9a
Merge branch 'main' into realtime_predecoder_no_fpga_demo_docs
bmhowe23 5aebdd9
clang-format
bmhowe23 5fbcf3d
Update memory estimate
bmhowe23 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
310 changes: 310 additions & 0 deletions
310
docs/sphinx/examples_rst/qec/realtime_predecoder_pymatching.rst
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,310 @@ | ||
| Realtime AI Predecoder Pipeline | ||
| ================================ | ||
|
|
||
| .. note:: | ||
|
|
||
| The following information is about a C++ demonstration that must be built | ||
| from source and is not part of any distributed CUDA-Q QEC binaries. | ||
|
|
||
| This guide explains how to build and run the hybrid AI predecoder + PyMatching | ||
| streaming benchmark. The benchmark uses a TensorRT-accelerated neural network | ||
| (the *predecoder*) to reduce syndrome density on the GPU, then feeds the | ||
| residual detectors to a pool of PyMatching MWPM decoders on the CPU. A | ||
| software data injector streams pre-generated syndrome shots through the | ||
| ``RealtimePipeline`` at a configurable rate and collects latency, throughput, | ||
| syndrome density, and logical error rate statistics. | ||
|
|
||
| The benchmark binary is | ||
| ``test_realtime_predecoder_w_pymatching``, built from | ||
| `libs/qec/unittests/realtime/test_realtime_predecoder_w_pymatching.cpp | ||
| <https://github.com/NVIDIA/cudaqx/blob/main/libs/qec/unittests/realtime/test_realtime_predecoder_w_pymatching.cpp>`_. | ||
|
|
||
|
|
||
| Prerequisites | ||
| ------------- | ||
|
|
||
| Hardware | ||
| ^^^^^^^^ | ||
|
|
||
| - CUDA-capable GPU (NVIDIA Grace Blackwell / GB200 recommended) | ||
| - Sufficient GPU memory for the TensorRT engine (the d13_r104 model requires | ||
| approximately 1 GB per predecoder instance) | ||
|
|
||
| Software | ||
| ^^^^^^^^ | ||
|
|
||
| - **CUDA Toolkit** 12.6 or later | ||
| - **TensorRT** 10.x (headers and libraries) | ||
| - **CUDA-Q SDK** pre-installed (provides ``libcudaq``, ``libnvqir``, ``nvq++``) | ||
| - **CUDA-Q Realtime** libraries (``libcudaq-realtime``, | ||
| ``libcudaq-realtime-dispatch``, ``libcudaq-realtime-host-dispatch``) built | ||
| and installed to a known prefix (e.g. ``/tmp/cudaq-realtime``) | ||
|
|
||
| Additional inputs: | ||
|
|
||
| - **Predecoder ONNX model** (e.g. ``predecoder_memory_d13_T104_X.onnx``) | ||
| placed under ``libs/qec/lib/realtime/``. A cached TensorRT ``.engine`` file | ||
| with the same base name is loaded automatically if present; otherwise the | ||
| engine is built from the ONNX file on first run (this can take 1--2 minutes | ||
| for large models). | ||
| - **Syndrome data directory** containing pre-generated detector samples, | ||
| observables, and matching graph data (see `Data Directory Layout`_). | ||
|
|
||
|
|
||
| Data Directory Layout | ||
| --------------------- | ||
|
|
||
| The ``--data-dir`` flag points to a directory with the following files. | ||
| All binary files use little-endian format. | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 30 70 | ||
|
|
||
| * - File | ||
| - Description | ||
| * - ``detectors.bin`` | ||
| - Detector samples. Header: ``uint32 num_samples``, ``uint32 num_detectors``; | ||
| body: ``int32[num_samples * num_detectors]``. | ||
| * - ``observables.bin`` | ||
| - Observable ground-truth labels. Header: ``uint32 num_samples``, | ||
| ``uint32 num_observables``; body: ``int32[num_samples * num_observables]``. | ||
| * - ``H_csr.bin`` | ||
| - Sparse CSR parity check matrix. Header: ``uint32 nrows``, | ||
| ``uint32 ncols``, ``uint32 nnz``; body: ``int32 indptr[nrows+1]``, | ||
| ``int32 indices[nnz]``. | ||
| * - ``O_csr.bin`` | ||
| - Sparse CSR observables matrix (same format as ``H_csr.bin``). | ||
| * - ``priors.bin`` | ||
| - Per-edge error probabilities. Header: ``uint32 num_edges``; body: | ||
| ``float64[num_edges]``. | ||
| * - ``metadata.txt`` | ||
| - Human-readable parameters (``distance``, ``n_rounds``, ``p_error``, | ||
| etc.). Not read by the binary; included for reference. | ||
|
|
||
|
|
||
| Building | ||
| -------- | ||
|
|
||
| The benchmark requires two CMake targets: | ||
|
|
||
| - ``test_realtime_predecoder_w_pymatching`` -- the benchmark binary | ||
| - ``cudaq-qec-pymatching`` -- the PyMatching decoder plugin (loaded at runtime) | ||
|
|
||
| Configure and build: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| cd /path/to/cudaqx | ||
|
|
||
| cmake -S . -B build \ | ||
| -DCMAKE_BUILD_TYPE=Release \ | ||
| -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc \ | ||
| -DCUDAQ_DIR=/usr/local/cudaq/lib/cmake/cudaq \ | ||
| -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ | ||
| -DCUDAQ_QEC_BUILD_TRT_DECODER=ON \ | ||
| -DCUDAQX_ENABLE_LIBS=qec \ | ||
| -DCUDAQX_INCLUDE_TESTS=ON \ | ||
| -DCUDAQX_QEC_INCLUDE_TESTS=ON | ||
|
|
||
| cmake --build build -j$(nproc) --target \ | ||
| test_realtime_predecoder_w_pymatching \ | ||
| cudaq-qec-pymatching | ||
|
|
||
| .. note:: | ||
|
|
||
| The ``test_realtime_predecoder_w_pymatching`` target requires TensorRT | ||
| headers and libraries to be discoverable. CMake searches standard system | ||
| paths (e.g. ``/usr/include/aarch64-linux-gnu``, | ||
| ``/usr/lib/aarch64-linux-gnu``). If TensorRT is installed elsewhere, set | ||
| ``-DTENSORRT_ROOT=/path/to/tensorrt``. | ||
|
|
||
| The ``cudaq-qec-pymatching`` shared library is written to | ||
| ``build/lib/decoder-plugins/``. If the benchmark fails with | ||
| ``invalid decoder requested: pymatching``, verify that this file exists. | ||
|
|
||
|
|
||
| Running | ||
| ------- | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| test_realtime_predecoder_w_pymatching <config> [rate_us] [duration_s] [flags] | ||
|
|
||
| Positional Arguments | ||
| ^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 15 55 15 | ||
|
|
||
| * - Argument | ||
| - Description | ||
| - Default | ||
| * - ``config`` | ||
| - Pipeline configuration name (see table below) | ||
| - ``d7`` | ||
| * - ``rate_us`` | ||
| - Inter-arrival time in microseconds. ``0`` runs open-loop (as fast as | ||
| possible). | ||
| - ``0`` | ||
| * - ``duration_s`` | ||
| - Test duration in seconds | ||
| - ``5`` | ||
|
|
||
| Named Flags | ||
| ^^^^^^^^^^^ | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 30 70 | ||
|
|
||
| * - Flag | ||
| - Description | ||
| * - ``--data-dir <path>`` | ||
| - Path to syndrome data directory (see `Data Directory Layout`_). When | ||
| omitted, random syndromes with 1% error rate are generated. | ||
| * - ``--num-gpus <n>`` | ||
| - Number of GPUs to use. Currently clamped to 1 (multi-GPU dispatch is | ||
| not yet supported). | ||
|
|
||
| Pipeline Configurations | ||
| ^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 12 10 10 38 10 10 10 | ||
|
|
||
| * - Config | ||
| - Distance | ||
| - Rounds | ||
| - ONNX Model | ||
| - Pre-decoders | ||
| - Workers | ||
| - Decode Workers | ||
| * - ``d13_r104`` | ||
| - 13 | ||
| - 104 | ||
| - ``predecoder_memory_d13_T104_X.onnx`` | ||
| - 8 | ||
| - 8 | ||
| - 16 | ||
|
|
||
| Example | ||
| ^^^^^^^ | ||
|
|
||
| Run the d13_r104 configuration at 500 req/s for 2 minutes with real syndrome | ||
| data: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| ./build/libs/qec/unittests/realtime/test_realtime_predecoder_w_pymatching \ | ||
| d13_r104 2000 120 \ | ||
| --data-dir /path/to/syndrome_data/p0.003 | ||
|
|
||
|
|
||
| Changing the Predecoder Model | ||
| ----------------------------- | ||
|
|
||
| The ONNX model file for each configuration is set in the ``PipelineConfig`` | ||
| factory methods in | ||
| ``libs/qec/unittests/realtime/predecoder_pipeline_common.h``. To use a | ||
| different model, edit the ``onnx_filename`` field and rebuild: | ||
|
|
||
| .. code-block:: cpp | ||
|
|
||
| static PipelineConfig d13_r104() { | ||
| return { | ||
| "d13_r104_X", 13, 104, | ||
| "predecoder_memory_model_4_d13_T104_X.onnx", // changed model | ||
| 8, 8, 16}; | ||
| } | ||
|
|
||
| Then rebuild: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| cmake --build build -j$(nproc) --target test_realtime_predecoder_w_pymatching | ||
|
|
||
| ONNX model files and their corresponding ``.engine`` caches live in | ||
| ``libs/qec/lib/realtime/``. If a cached engine exists with the same base name | ||
| as the ONNX file, TensorRT loads it directly. Otherwise, the engine is built | ||
| from the ONNX file on the first run. | ||
|
|
||
|
|
||
| Reading the Output | ||
| ------------------ | ||
|
|
||
| The benchmark prints a structured report after the streaming run completes. | ||
|
|
||
| Throughput and Timing | ||
| ^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| Submitted: 60001 | ||
| Completed: 60001 | ||
| Throughput: 500.0 req/s | ||
| Backpressure stalls: 0 | ||
|
|
||
| ``Backpressure stalls`` counts how many times the producer had to spin because | ||
| all pipeline slots were occupied. Zero stalls means the pipeline kept up with | ||
| the injection rate. | ||
|
|
||
| Latency Distribution | ||
| ^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| Latency (us) [steady-state, 59981 requests after 20 warmup] | ||
| min = 154.8 | ||
| p50 = 203.9 | ||
| mean = 215.5 | ||
| p99 = 363.4 | ||
|
|
||
| End-to-end latency measured from ``injector.submit()`` to the completion | ||
| callback. Includes GPU inference, CPU-side PyMatching decode, and all pipeline | ||
| overhead. The first 20 requests are excluded as warmup. | ||
|
|
||
| PyMatching Average Time | ||
| ^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| PyMatching decode: 75.6 us | ||
|
|
||
| Average time for the PyMatching MWPM decoder to process a single residual | ||
| syndrome. | ||
|
|
||
| Syndrome Density | ||
| ^^^^^^^^^^^^^^^^ | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| Input: 931.0 / 17472 (0.0533) | ||
| Output: 16.0 / 17472 (0.0009) | ||
| Reduction: 98.3% | ||
|
|
||
| Average nonzero detectors before the predecoder (input) and after (residual | ||
| output). Higher reduction means the predecoder is removing more syndrome | ||
| weight, which reduces PyMatching decode time. | ||
|
|
||
| Correctness Verification | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| Printed only when ``--data-dir`` is provided: | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| Pipeline (pred+pymatch) mismatches: 108 LER: 0.0018 | ||
|
|
||
| - **Pipeline LER**: logical error rate of the full predecoder + PyMatching | ||
| chain compared to ground-truth observables. | ||
|
|
||
| .. note:: | ||
|
|
||
| Syndrome samples are cycled when the run exceeds the dataset size. | ||
| For example, if the dataset has 10,000 shots and the test runs 60,000 | ||
| requests, each shot is replayed approximately 6 times. Correctness | ||
| verification still compares against the correct ground truth for each | ||
| replayed shot. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.