NVIDIA · bmhowe23 · Apr 14, 2026 · Apr 10, 2026 · Apr 10, 2026 · Apr 12, 2026
diff --git a/docs/sphinx/components/qec/introduction.rst b/docs/sphinx/components/qec/introduction.rst
@@ -861,6 +861,7 @@ Additional quantum gates can be applied, and only when `get_corrections` is call
 For detailed information on real-time decoding, see:
 
 * :doc:`/examples_rst/qec/realtime_decoding` - Complete Guide with Examples
+* :doc:`/examples_rst/qec/realtime_predecoder_pymatching` - Realtime AI Predecoder Pipeline
 * :doc:`/api/qec/cpp_api` - C++ API Reference (see Real-Time Decoding section)
 * :doc:`/api/qec/python_api` - Python API Reference (see Real-Time Decoding section)
 

diff --git a/docs/sphinx/examples_rst/qec/examples.rst b/docs/sphinx/examples_rst/qec/examples.rst
@@ -10,4 +10,5 @@ Examples that illustrate how to use CUDA-QX for application development are avai
       Code-Capacity QEC <code_capacity_noise.rst>
       Circuit-Level QEC <circuit_level_noise.rst>
       Decoders <decoders.rst>
-      Real-Time Decoding <realtime_decoding.rst>
+      Real-Time Decoding <realtime_decoding.rst>
+      Realtime AI Predecoder Pipeline <realtime_predecoder_pymatching.rst>
diff --git a/docs/sphinx/examples_rst/qec/realtime_predecoder_pymatching.rst b/docs/sphinx/examples_rst/qec/realtime_predecoder_pymatching.rst
@@ -0,0 +1,310 @@
+Realtime AI Predecoder Pipeline
+================================
+
+.. note::
+
+  The following information is about a C++ demonstration that must be built
+  from source and is not part of any distributed CUDA-Q QEC binaries.
+
+This guide explains how to build and run the hybrid AI predecoder + PyMatching
+streaming benchmark. The benchmark uses a TensorRT-accelerated neural network
+(the *predecoder*) to reduce syndrome density on the GPU, then feeds the
+residual detectors to a pool of PyMatching MWPM decoders on the CPU. A
+software data injector streams pre-generated syndrome shots through the
+``RealtimePipeline`` at a configurable rate and collects latency, throughput,
+syndrome density, and logical error rate statistics.
+
+The benchmark binary is
+``test_realtime_predecoder_w_pymatching``, built from
+`libs/qec/unittests/realtime/test_realtime_predecoder_w_pymatching.cpp
+<https://github.com/NVIDIA/cudaqx/blob/main/libs/qec/unittests/realtime/test_realtime_predecoder_w_pymatching.cpp>`_.
+
+
+Prerequisites
+-------------
+
+Hardware
+^^^^^^^^
+
+- CUDA-capable GPU (NVIDIA Grace Blackwell / GB200 recommended)
+- Sufficient GPU memory for the TensorRT engine (the d13_r104 model requires
+  approximately 1 GB per predecoder instance)
+
+Software
+^^^^^^^^
+
+- **CUDA Toolkit** 12.6 or later
+- **TensorRT** 10.x (headers and libraries)
+- **CUDA-Q SDK** pre-installed (provides ``libcudaq``, ``libnvqir``, ``nvq++``)
+- **CUDA-Q Realtime** libraries (``libcudaq-realtime``,
+  ``libcudaq-realtime-dispatch``, ``libcudaq-realtime-host-dispatch``) built
+  and installed to a known prefix (e.g. ``/tmp/cudaq-realtime``)
+
+Additional inputs:
+
+- **Predecoder ONNX model** (e.g. ``predecoder_memory_d13_T104_X.onnx``)
+  placed under ``libs/qec/lib/realtime/``. A cached TensorRT ``.engine`` file
+  with the same base name is loaded automatically if present; otherwise the
+  engine is built from the ONNX file on first run (this can take 1--2 minutes
+  for large models).
+- **Syndrome data directory** containing pre-generated detector samples,
+  observables, and matching graph data (see `Data Directory Layout`_).
+
+
+Data Directory Layout
+---------------------
+
+The ``--data-dir`` flag points to a directory with the following files.
+All binary files use little-endian format.
+
+.. list-table::
+   :header-rows: 1
+   :widths: 30 70
+
+   * - File
+     - Description
+   * - ``detectors.bin``
+     - Detector samples. Header: ``uint32 num_samples``, ``uint32 num_detectors``;
+       body: ``int32[num_samples * num_detectors]``.
+   * - ``observables.bin``
+     - Observable ground-truth labels. Header: ``uint32 num_samples``,
+       ``uint32 num_observables``; body: ``int32[num_samples * num_observables]``.
+   * - ``H_csr.bin``
+     - Sparse CSR parity check matrix. Header: ``uint32 nrows``,
+       ``uint32 ncols``, ``uint32 nnz``; body: ``int32 indptr[nrows+1]``,
+       ``int32 indices[nnz]``.
+   * - ``O_csr.bin``
+     - Sparse CSR observables matrix (same format as ``H_csr.bin``).
+   * - ``priors.bin``
+     - Per-edge error probabilities. Header: ``uint32 num_edges``; body:
+       ``float64[num_edges]``.
+   * - ``metadata.txt``
+     - Human-readable parameters (``distance``, ``n_rounds``, ``p_error``,
+       etc.). Not read by the binary; included for reference.
+
+
+Building
+--------
+
+The benchmark requires two CMake targets:
+
+- ``test_realtime_predecoder_w_pymatching`` -- the benchmark binary
+- ``cudaq-qec-pymatching`` -- the PyMatching decoder plugin (loaded at runtime)
+
+Configure and build:
+
+.. code-block:: bash
+
+   cd /path/to/cudaqx
+
+   cmake -S . -B build \
+     -DCMAKE_BUILD_TYPE=Release \
+     -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc \
+     -DCUDAQ_DIR=/usr/local/cudaq/lib/cmake/cudaq \
+     -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \
+     -DCUDAQ_QEC_BUILD_TRT_DECODER=ON \
+     -DCUDAQX_ENABLE_LIBS=qec \
+     -DCUDAQX_INCLUDE_TESTS=ON \
+     -DCUDAQX_QEC_INCLUDE_TESTS=ON
+
+   cmake --build build -j$(nproc) --target \
+     test_realtime_predecoder_w_pymatching \
+     cudaq-qec-pymatching
+
+.. note::
+
+   The ``test_realtime_predecoder_w_pymatching`` target requires TensorRT
+   headers and libraries to be discoverable. CMake searches standard system
+   paths (e.g. ``/usr/include/aarch64-linux-gnu``,
+   ``/usr/lib/aarch64-linux-gnu``). If TensorRT is installed elsewhere, set
+   ``-DTENSORRT_ROOT=/path/to/tensorrt``.
+
+   The ``cudaq-qec-pymatching`` shared library is written to
+   ``build/lib/decoder-plugins/``. If the benchmark fails with
+   ``invalid decoder requested: pymatching``, verify that this file exists.
+
+
+Running
+-------
+
+.. code-block:: text
+
+   test_realtime_predecoder_w_pymatching <config> [rate_us] [duration_s] [flags]
+
+Positional Arguments
+^^^^^^^^^^^^^^^^^^^^
+
+.. list-table::
+   :header-rows: 1
+   :widths: 15 55 15
+
+   * - Argument
+     - Description
+     - Default
+   * - ``config``
+     - Pipeline configuration name (see table below)
+     - ``d7``
+   * - ``rate_us``
+     - Inter-arrival time in microseconds. ``0`` runs open-loop (as fast as
+       possible).
+     - ``0``
+   * - ``duration_s``
+     - Test duration in seconds
+     - ``5``
+
+Named Flags
+^^^^^^^^^^^
+
+.. list-table::
+   :header-rows: 1
+   :widths: 30 70
+
+   * - Flag
+     - Description
+   * - ``--data-dir <path>``
+     - Path to syndrome data directory (see `Data Directory Layout`_). When
+       omitted, random syndromes with 1% error rate are generated.
+   * - ``--num-gpus <n>``
+     - Number of GPUs to use. Currently clamped to 1 (multi-GPU dispatch is
+       not yet supported).
+
+Pipeline Configurations
+^^^^^^^^^^^^^^^^^^^^^^^
+
+.. list-table::
+   :header-rows: 1
+   :widths: 12 10 10 38 10 10 10
+
+   * - Config
+     - Distance
+     - Rounds
+     - ONNX Model
+     - Pre-decoders
+     - Workers
+     - Decode Workers
+   * - ``d13_r104``
+     - 13
+     - 104
+     - ``predecoder_memory_d13_T104_X.onnx``
+     - 8
+     - 8
+     - 16
+
+Example
+^^^^^^^
+
+Run the d13_r104 configuration at 500 req/s for 2 minutes with real syndrome
+data:
+
+.. code-block:: bash
+
+   ./build/libs/qec/unittests/realtime/test_realtime_predecoder_w_pymatching \
+       d13_r104 2000 120 \
+       --data-dir /path/to/syndrome_data/p0.003
+
+
+Changing the Predecoder Model
+-----------------------------
+
+The ONNX model file for each configuration is set in the ``PipelineConfig``
+factory methods in
+``libs/qec/unittests/realtime/predecoder_pipeline_common.h``. To use a
+different model, edit the ``onnx_filename`` field and rebuild:
+
+.. code-block:: cpp
+
+   static PipelineConfig d13_r104() {
+       return {
+           "d13_r104_X", 13, 104,
+           "predecoder_memory_model_4_d13_T104_X.onnx",  // changed model
+           8, 8, 16};
+   }
+
+Then rebuild:
+
+.. code-block:: bash
+
+   cmake --build build -j$(nproc) --target test_realtime_predecoder_w_pymatching
+
+ONNX model files and their corresponding ``.engine`` caches live in
+``libs/qec/lib/realtime/``. If a cached engine exists with the same base name
+as the ONNX file, TensorRT loads it directly. Otherwise, the engine is built
+from the ONNX file on the first run.
+
+
+Reading the Output
+------------------
+
+The benchmark prints a structured report after the streaming run completes.
+
+Throughput and Timing
+^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: text
+
+   Submitted:          60001
+   Completed:          60001
+   Throughput:         500.0 req/s
+   Backpressure stalls:       0
+
+``Backpressure stalls`` counts how many times the producer had to spin because
+all pipeline slots were occupied. Zero stalls means the pipeline kept up with
+the injection rate.
+
+Latency Distribution
+^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: text
+
+   Latency (us)  [steady-state, 59981 requests after 20 warmup]
+     min    =      154.8
+     p50    =      203.9
+     mean   =      215.5
+     p99    =      363.4
+
+End-to-end latency measured from ``injector.submit()`` to the completion
+callback. Includes GPU inference, CPU-side PyMatching decode, and all pipeline
+overhead. The first 20 requests are excluded as warmup.
+
+PyMatching Average Time
+^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: text
+
+   PyMatching decode:         75.6 us
+
+Average time for the PyMatching MWPM decoder to process a single residual
+syndrome.
+
+Syndrome Density
+^^^^^^^^^^^^^^^^
+
+.. code-block:: text
+
+   Input:  931.0 / 17472  (0.0533)
+   Output: 16.0 / 17472  (0.0009)
+   Reduction: 98.3%
+
+Average nonzero detectors before the predecoder (input) and after (residual
+output). Higher reduction means the predecoder is removing more syndrome
+weight, which reduces PyMatching decode time.
+
+Correctness Verification
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Printed only when ``--data-dir`` is provided:
+
+.. code-block:: text
+
+   Pipeline (pred+pymatch) mismatches: 108  LER: 0.0018
+
+- **Pipeline LER**: logical error rate of the full predecoder + PyMatching
+  chain compared to ground-truth observables.
+
+.. note::
+
+   Syndrome samples are cycled when the run exceeds the dataset size.
+   For example, if the dataset has 10,000 shots and the test runs 60,000
+   requests, each shot is replayed approximately 6 times. Correctness
+   verification still compares against the correct ground truth for each
+   replayed shot.