33Realtime Pipeline API
44=====================
55
6- The realtime pipeline API provides a framework for building low-latency QEC
7- decoding pipelines that combine GPU inference (e.g. TensorRT) with CPU
8- post-processing (e.g. PyMatching MWPM). All types live in the
9- ``cudaq::qec::realtime::experimental `` namespace and are declared in
6+ The realtime pipeline API provides the reusable host-side runtime for
7+ low-latency QEC pipelines that combine GPU inference with optional CPU
8+ post-processing. The published reference is generated from
109``cudaq/qec/realtime/pipeline.h ``.
1110
1211.. note ::
@@ -17,237 +16,57 @@ post-processing (e.g. PyMatching MWPM). All types live in the
1716Configuration
1817-------------
1918
20- .. class :: core_pinning
19+ .. doxygenstruct :: cudaq::qec::realtime::experimental::core_pinning
20+ :members:
2121
22- CPU core affinity settings for pipeline threads.
23-
24- :param dispatcher: Core for the host dispatcher thread (-1 to disable pinning).
25- :param consumer: Core for the consumer (completion) thread (-1 to disable pinning).
26- :param worker_base: Base core for worker threads. Workers pin to
27- base, base+1, etc. (-1 to disable pinning).
28-
29-
30- .. class :: pipeline_stage_config
31-
32- Configuration for a single pipeline stage.
33-
34- :param num_workers: Number of GPU worker threads (max 64). Default: 8.
35- :param num_slots: Number of ring buffer slots. Default: 32.
36- :param slot_size: Size of each ring buffer slot in bytes. Default: 16384.
37- :param cores: CPU core affinity settings (``core_pinning ``).
38- :param external_ringbuffer: When non-null, the pipeline uses this
39- caller-owned ring buffer (``cudaq_ringbuffer_t* ``) instead of
40- allocating its own. The caller is responsible for lifetime.
41- ``ring_buffer_injector `` is unavailable in this mode.
22+ .. doxygenstruct :: cudaq::qec::realtime::experimental::pipeline_stage_config
23+ :members:
4224
4325
4426GPU Stage
4527---------
4628
47- .. class :: gpu_worker_resources
48-
49- Per-worker GPU resources returned by the ``gpu_stage_factory ``.
50-
51- Each worker owns a captured CUDA graph, a dedicated stream, and optional
52- pre/post launch callbacks for DMA staging or result extraction.
53-
54- :param graph_exec: Instantiated CUDA graph (``cudaGraphExec_t ``).
55- :param stream: Dedicated CUDA stream (``cudaStream_t ``).
56- :param pre_launch_fn: Optional callback invoked before graph launch.
57- :param pre_launch_data: Opaque user data for ``pre_launch_fn ``.
58- :param post_launch_fn: Optional callback invoked after graph launch.
59- :param post_launch_data: Opaque user data for ``post_launch_fn ``.
60- :param function_id: RPC function ID that this worker handles.
61- :param user_context: Opaque user context passed to the CPU stage callback.
62-
29+ .. doxygenstruct :: cudaq::qec::realtime::experimental::gpu_worker_resources
30+ :members:
6331
64- .. type :: gpu_stage_factory
65-
66- ``std::function<gpu_worker_resources(int worker_id)> ``
67-
68- Factory called once per worker during ``start() ``. Returns the GPU
69- resources for the given worker index.
32+ .. doxygentypedef :: cudaq::qec::realtime::experimental::gpu_stage_factory
7033
7134
7235CPU Stage
7336---------
7437
75- .. class :: cpu_stage_context
76-
77- Context passed to the CPU stage callback for each completed GPU workload.
78-
79- :param worker_id: Index of the worker thread.
80- :param origin_slot: Ring buffer slot that originated this request.
81- :param gpu_output: Pointer to GPU inference output (nullptr in poll mode).
82- :param gpu_output_size: Size of GPU output in bytes.
83- :param response_buffer: Destination buffer for the RPC response.
84- :param max_response_size: Maximum bytes writable to ``response_buffer ``.
85- :param user_context: Opaque context from ``gpu_worker_resources ``.
38+ .. doxygenstruct :: cudaq::qec::realtime::experimental::cpu_stage_context
39+ :members:
8640
41+ .. doxygentypedef :: cudaq::qec::realtime::experimental::cpu_stage_callback
8742
88- .. type :: cpu_stage_callback
89-
90- ``std::function<size_t(const cpu_stage_context &ctx)> ``
91-
92- Returns the number of bytes written into ``response_buffer ``. Special
93- return values:
94-
95- - **0 **: No GPU result ready yet; the pipeline will poll again.
96- - **DEFERRED_COMPLETION ** (``SIZE_MAX ``): Release the worker immediately
97- but defer slot completion. The caller must call
98- ``realtime_pipeline::complete_deferred(slot) `` once the deferred work
99- finishes.
43+ .. doxygenvariable :: cudaq::qec::realtime::experimental::DEFERRED_COMPLETION
10044
10145
10246Completion
10347----------
10448
105- .. class :: completion
106-
107- Metadata for a completed (or errored) pipeline request.
49+ .. doxygenstruct :: cudaq::qec::realtime::experimental::completion
50+ :members:
10851
109- :param request_id: Original request ID from the RPC header.
110- :param slot: Ring buffer slot that held this request.
111- :param success: True if the request completed without CUDA errors.
112- :param cuda_error: CUDA error code (0 on success).
113-
114-
115- .. type :: completion_callback
116-
117- ``std::function<void(const completion &c)> ``
118-
119- Invoked by the consumer thread for each completed or errored request.
52+ .. doxygentypedef :: cudaq::qec::realtime::experimental::completion_callback
12053
12154
12255Ring Buffer Injector
12356--------------------
12457
125- .. class :: ring_buffer_injector
126-
127- Writes RPC-framed requests into the pipeline's ring buffer, simulating
128- FPGA DMA deposits. Created via ``realtime_pipeline::create_injector() ``.
129- The parent ``realtime_pipeline `` must outlive the injector.
130-
131- Not available when the pipeline is configured with an external ring buffer
132- (``pipeline_stage_config::external_ringbuffer != nullptr ``).
133-
134- .. method :: bool try_submit(uint32_t function_id, const void *payload, size_t payload_size, uint64_t request_id)
135-
136- Try to submit a request without blocking.
137-
138- :param function_id: RPC function identifier.
139- :param payload: Pointer to payload data.
140- :param payload_size: Payload size in bytes.
141- :param request_id: Caller-assigned request identifier.
142- :return: True if accepted, false if all slots are busy.
143-
144- .. method :: void submit(uint32_t function_id, const void *payload, size_t payload_size, uint64_t request_id)
145-
146- Submit a request, spinning until a slot becomes available.
147-
148- :param function_id: RPC function identifier.
149- :param payload: Pointer to payload data.
150- :param payload_size: Payload size in bytes.
151- :param request_id: Caller-assigned request identifier.
152-
153- .. method :: uint64_t backpressure_stalls() const
154-
155- :return: Cumulative number of times ``submit() `` had to spin-wait.
58+ .. doxygenclass :: cudaq::qec::realtime::experimental::ring_buffer_injector
59+ :members:
15660
15761
15862Pipeline
15963--------
16064
161- .. class :: realtime_pipeline
162-
163- Orchestrates GPU inference and CPU post-processing for low-latency
164- realtime QEC decoding.
165-
166- The pipeline manages a ring buffer, a host dispatcher thread, per-worker
167- GPU streams with captured CUDA graphs, optional CPU worker threads, and a
168- consumer thread for completion signaling. It supports both an internal
169- ring buffer (for software testing via ``ring_buffer_injector ``) and an
170- external ring buffer (for FPGA RDMA).
171-
172- **Lifecycle: **
173-
174- 1. Construct with ``pipeline_stage_config ``
175- 2. Register callbacks: ``set_gpu_stage() ``, ``set_cpu_stage() `` (optional),
176- ``set_completion_handler() `` (optional)
177- 3. Call ``start() `` to spawn threads
178- 4. Submit requests via ``ring_buffer_injector `` or external FPGA DMA
179- 5. Call ``stop() `` to shut down
180-
181- .. method :: realtime_pipeline(const pipeline_stage_config &config)
182-
183- Construct a pipeline and allocate ring buffer resources.
184-
185- :param config: Stage configuration.
186-
187- .. method :: void set_gpu_stage(gpu_stage_factory factory)
188-
189- Register the GPU stage factory. Must be called before ``start() ``.
190-
191- :param factory: Callback returning ``gpu_worker_resources `` per worker.
192-
193- .. method :: void set_cpu_stage(cpu_stage_callback callback)
194-
195- Register the CPU worker callback. Must be called before ``start() ``.
196- If not set, the pipeline operates in GPU-only mode with completion
197- signaled via ``cudaLaunchHostFunc ``.
198-
199- :param callback: CPU stage processing function.
200-
201- .. method :: void set_completion_handler(completion_callback handler)
202-
203- Register the completion callback. Must be called before ``start() ``.
204-
205- :param handler: Function called for each completed request.
206-
207- .. method :: void start()
208-
209- Allocate resources, build dispatcher config, and spawn all threads.
210-
211- .. method :: void stop()
212-
213- Signal shutdown, join all threads, and free resources.
214-
215- .. method :: ring_buffer_injector create_injector()
216-
217- Create a software injector for testing without FPGA hardware.
218-
219- :return: A ``ring_buffer_injector `` bound to this pipeline.
220- :raises std::logic_error: If the pipeline uses an external ring buffer.
221-
222- .. method :: Stats stats() const
223-
224- Thread-safe, lock-free statistics snapshot.
225-
226- :return: Current ``Stats `` struct.
227-
228- .. method :: void complete_deferred(int slot)
229-
230- Signal that deferred processing for a slot is complete. Call from any
231- thread after the CPU stage callback returned ``DEFERRED_COMPLETION ``.
232-
233- :param slot: Ring buffer slot index to complete.
234-
235- .. method :: ring_buffer_bases ringbuffer_bases() const
236-
237- :return: Host and device base addresses of the RX data ring.
238-
239- .. class :: Stats
240-
241- Pipeline throughput and backpressure statistics.
242-
243- :param submitted: Total requests submitted to the ring buffer.
244- :param completed: Total requests that completed (success or error).
245- :param dispatched: Total packets dispatched by the host dispatcher.
246- :param backpressure_stalls: Cumulative producer backpressure stalls.
247-
248- .. class :: ring_buffer_bases
65+ .. doxygenclass :: cudaq::qec::realtime::experimental::realtime_pipeline
66+ :members:
24967
250- Host and device base addresses of the RX data ring.
68+ .. doxygenstruct :: cudaq::qec::realtime::experimental::realtime_pipeline::Stats
69+ :members:
25170
252- :param rx_data_host: Host-mapped base pointer.
253- :param rx_data_dev: Device-mapped base pointer.
71+ .. doxygenstruct :: cudaq::qec::realtime::experimental::realtime_pipeline::ring_buffer_bases
72+ :members:
0 commit comments