NVIDIA · rostan-t · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026
diff --git a/.claude/skills/using-dali-dynamic-mode-workspace/evals/evals.json b/.claude/skills/using-dali-dynamic-mode-workspace/evals/evals.json
@@ -0,0 +1,93 @@
+{
+  "skill_name": "using-dali-dynamic-mode",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "Write a complete Python script that uses DALI dynamic mode to load and preprocess images for training an image classification model with PyTorch. The images are JPEGs on disk, and I need GPU-accelerated decode, resize to 224x224, and ImageNet normalization. The script should include the training loop.",
+      "expected_output": "Complete pipeline using ndd.readers.File, ndd.decoders.image(device='gpu'), ndd.resize, ndd.crop_mirror_normalize, .torch() handoff",
+      "files": [],
+      "assertions": [
+        {"name": "correct-import", "text": "Uses import nvidia.dali.experimental.dynamic as ndd"},
+        {"name": "reader-no-batchsize-in-constructor", "text": "batch_size is NOT passed to the ndd.readers.File() constructor (it belongs in next_epoch(), not the reader constructor)"},
+        {"name": "reader-pascalcase", "text": "Reader is PascalCase: ndd.readers.File(...)"},
+        {"name": "reader-stateful", "text": "Reader created once outside loop, reused across epochs"},
+        {"name": "next-epoch-iteration", "text": "Uses reader.next_epoch(batch_size=N) for iteration"},
+        {"name": "device-gpu-not-mixed", "text": "Uses device='gpu' for decoder, NOT device='mixed'"},
+        {"name": "no-pipeline-mode", "text": "No pipeline-mode constructs (no @pipeline_def, pipe.build(), pipe.run()) and operators called directly on ndd (e.g. ndd.resize, not fn.resize or ndd.fn.resize)"},
+        {"name": "torch-handoff", "text": "Uses .torch() for PyTorch conversion"},
+        {"name": "no-unnecessary-evaluate", "text": "No unnecessary .evaluate() calls"},
+        {"name": "set-num-threads", "text": "Calls ndd.set_num_threads() at startup"}
+      ]
+    },
+    {
+      "id": 2,
+      "prompt": "I have a Batch of 2D random values in DALI dynamic mode and need to extract the first column as crop_x and the second column as crop_y to pass to an operator. How do I do this? Show a working code example.",
+      "expected_output": "Uses batch.slice[0] and batch.slice[1] for samplewise slicing",
+      "files": [],
+      "assertions": [
+        {"name": "correct-slice-usage", "text": "Uses batch.slice[0] and batch.slice[1] (not batch.select())"},
+        {"name": "no-getitem", "text": "Does not use batch[0] or batch[:, 0] (Batch has no __getitem__)"},
+        {"name": "correct-slice-semantics", "text": "Correctly explains that .slice indexes within each sample, not across samples"},
+        {"name": "batch-size-to-random", "text": "Passes batch_size to ndd.random.uniform()"}
+      ]
+    },
+    {
+      "id": 3,
+      "prompt": "Convert the following pipeline-mode DALI code to dynamic mode. Write the complete converted script.",
+      "expected_output": "Correct conversion with all pipeline-mode patterns replaced",
+      "files": ["evals/files/pipeline_to_convert.py"],
+      "assertions": [
+        {"name": "device-gpu-not-mixed", "text": "device='mixed' converted to device='gpu'"},
+        {"name": "reader-pascalcase", "text": "fn.readers.file converted to ndd.readers.File (PascalCase)"},
+        {"name": "no-pipeline-mode", "text": "No pipeline-mode constructs (no @pipeline_def, pipe.build(), pipe.run()) and operators called directly on ndd (e.g. ndd.rotate, not fn.rotate or ndd.fn.rotate)"},
+        {"name": "next-epoch-iteration", "text": "Uses reader.next_epoch(batch_size=N) for iteration (batch_size in next_epoch, not reader constructor)"},
+        {"name": "seed-handling", "text": "Pipeline seed converted to ndd.random.set_seed() or RNG(seed=)"},
+        {"name": "set-num-threads", "text": "Pipeline num_threads converted to ndd.set_num_threads()"},
+        {"name": "batch-size-to-random", "text": "batch_size passed to random operators (uniform, coin_flip)"}
+      ]
+    },
+    {
+      "id": 4,
+      "prompt": "My data loading code built with DALI's dynamic (imperative) API produces wrong results intermittently — images sometimes appear corrupted. The code decodes JPEG images on the GPU, resizes them, and normalizes them. How do I debug this? Write a debugging guide with code examples.",
+      "expected_output": "Recommends EvalMode.sync_full or sync_cpu for debugging, explains async execution model, code examples use correct dynamic mode patterns",
+      "files": [],
+      "assertions": [
+        {"name": "recommends-sync-mode", "text": "Recommends EvalMode.sync_full or EvalMode.sync_cpu for debugging"},
+        {"name": "no-scatter-evaluate", "text": "Does not recommend adding .evaluate() after every operation as the primary debugging approach"},
+        {"name": "correct-evalmode-syntax", "text": "Uses correct context manager syntax: with ndd.EvalMode.sync_full: (not ndd.eval_mode(...) or other invented API)"},
+        {"name": "correct-sample-inspection", "text": "When inspecting intermediate values, uses batch.select(i).cpu() or np.asarray(batch.select(i).cpu()) — not batch[i] or batch.as_cpu().as_array()"},
+        {"name": "code-examples-no-pipeline-mode", "text": "All code examples in the guide use dynamic mode patterns (ndd.decoders.image, ndd.resize, etc.) — no fn.* or ndd.fn.* operators in any code snippet"},
+        {"name": "code-examples-device-gpu", "text": "All code examples use device='gpu' for decode, NOT device='mixed'"}
+      ]
+    },
+    {
+      "id": 5,
+      "prompt": "I need to train a speech classification model on WAV files using PyTorch. Write a complete Python script that uses DALI dynamic mode for the data loading and audio feature extraction (mel spectrograms). My audio clips have different durations.",
+      "expected_output": "Uses ndd.readers, ndd.decoders.audio(), spectral ops, handles variable-length via .torch(pad=True)",
+      "files": [],
+      "assertions": [
+        {"name": "correct-import", "text": "Uses import nvidia.dali.experimental.dynamic as ndd"},
+        {"name": "device-gpu-not-mixed", "text": "Uses device='gpu' for audio decode, NOT device='mixed'"},
+        {"name": "reader-pascalcase", "text": "Reader class is PascalCase (e.g. ndd.readers.File)"},
+        {"name": "reader-stateful", "text": "Reader created once and reused across epochs via next_epoch()"},
+        {"name": "torch-pad-true", "text": "Uses .torch(pad=True) to handle variable-length spectrograms when converting to PyTorch"},
+        {"name": "no-pipeline-mode", "text": "No pipeline-mode constructs (no @pipeline_def, pipe.build(), pipe.run()) and operators called directly on ndd (e.g. ndd.resize, not fn.resize or ndd.fn.resize)"}
+      ]
+    },
+    {
+      "id": 6,
+      "prompt": "Write a complete Python script for an object detection training pipeline using DALI dynamic mode and PyTorch. It should read COCO-format images and annotations, apply random horizontal flip as augmentation (both images and their bounding boxes), resize, normalize, and feed to a model.",
+      "expected_output": "DALI reader with bbox support, coordinated augmentation via ndd.random, correct dynamic mode patterns",
+      "files": [],
+      "assertions": [
+        {"name": "correct-import", "text": "Uses import nvidia.dali.experimental.dynamic as ndd"},
+        {"name": "batch-size-to-random", "text": "Passes batch_size to the coin_flip/random operator"},
+        {"name": "device-gpu-not-mixed", "text": "Uses device='gpu' for decode, NOT device='mixed'"},
+        {"name": "next-epoch-iteration", "text": "Uses reader.next_epoch(batch_size=N) for iteration"},
+        {"name": "torch-pad-true", "text": "Uses .torch(pad=True) for bounding boxes (ragged — different images have different numbers of boxes)"},
+        {"name": "no-pipeline-mode", "text": "No pipeline-mode constructs (no @pipeline_def, pipe.build(), pipe.run()) and operators called directly on ndd (e.g. ndd.resize, not fn.resize or ndd.fn.resize)"},
+        {"name": "coordinated-flip", "text": "The same coin_flip Batch is passed to both ndd.flip (images) and ndd.bb_flip (bounding boxes) — not two separate independent coin flips"}
+      ]
+    }
+  ]
+}
diff --git a/.claude/skills/using-dali-dynamic-mode-workspace/evals/files/pipeline_to_convert.py b/.claude/skills/using-dali-dynamic-mode-workspace/evals/files/pipeline_to_convert.py
@@ -0,0 +1,27 @@
+@pipeline_def
+def training_pipeline(image_dir):
+    jpegs, labels = fn.readers.file(file_root=image_dir, random_shuffle=True)
+    images = fn.decoders.image(jpegs, device="mixed")
+    angle = fn.random.uniform(range=(-30, 30))
+    images = fn.rotate(images, angle=angle)
+    mirror = fn.random.coin_flip(probability=0.5)
+    images = fn.crop_mirror_normalize(
+        images,
+        crop=(224, 224),
+        mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
+        std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
+        mirror=mirror,
+    )
+    return images, labels
+
+
+pipe = training_pipeline(
+    image_dir="/data/images",
+    batch_size=64,
+    num_threads=4,
+    device_id=0,
+    seed=42,
+)
+pipe.build()
+for _ in range(100):
+    images, labels = pipe.run()
diff --git a/.claude/skills/using-dali-dynamic-mode/SKILL.md b/.claude/skills/using-dali-dynamic-mode/SKILL.md
@@ -0,0 +1,193 @@
+---
+name: using-dali-dynamic-mode
+description: "Use when writing DALI data loading or preprocessing code with `nvidia.dali.experimental.dynamic` (ndd), or when converting DALI pipeline-mode code to dynamic mode, or when the user asks about DALI dynamic mode, imperative DALI, or ndd. Use this skill any time someone mentions 'ndd', 'dynamic mode', 'DALI preprocessing', or wants to load/augment data with DALI outside of a pipeline definition."
+---
+
+# DALI Dynamic Mode
+
+Dynamic mode is DALI's imperative Python API. Call DALI operators as regular Python functions with standard control flow -- no pipeline graph, no `pipe.build()`, no `pipe.run()`.
+
+```python
+import nvidia.dali.experimental.dynamic as ndd
+```
+
+## Core Data Types
+
+### Tensor -- single sample
+
+```python
+t = ndd.tensor(data)           # copy
+t = ndd.as_tensor(data)        # wrap, no copy if possible
+t.cpu()                        # move to CPU
+t.gpu()                        # move to GPU
+t.torch(copy=False)            # zero-copy PyTorch tensor (default)
+t[1:3]                         # slicing supported
+np.asarray(t)                  # NumPy via __array__ (CPU only)
+```
+
+Supports `__dlpack__`, `__cuda_array_interface__`, `__array__`, arithmetic operators.
+
+### Batch -- collection of samples (variable shapes OK)
+
+```python
+b = ndd.batch([arr1, arr2])    # copy
+b = ndd.as_batch(data)         # wrap, no copy if possible
+```
+
+**Batch has no `__getitem__`** -- `batch[i]` raises `TypeError` because indexing is ambiguous (sample selection vs. per-sample slicing). Use the explicit APIs instead:
+
+| Intent | Method | Returns |
+|--------|--------|---------|
+| Get sample i | `batch.select(i)` | `Tensor` |
+| Get subset of samples | `batch.select(slice_or_list)` | `Batch` |
+| Slice within each sample | `batch.slice[...]` | `Batch` (same batch_size) |
+
+`.select()` picks **which samples**. `.slice` indexes **inside each sample**.
+
+```python
+xy = ndd.random.uniform(batch_size=16, range=[0, 1], shape=2)
+crop_x = xy.slice[0]       # Batch of 16 scalars, first element from each sample
+crop_y = xy.slice[1]       # Batch of 16 scalars, second element from each sample
+sample_0 = xy.select(0)    # Tensor, the entire first sample [x, y]
+```
+
+**PyTorch conversion:**
+- `batch.torch()` -- works for uniform shapes; raises for ragged batches
+- `batch.torch(pad=True)` -- zero-pads ragged batches to max shape (use for variable-length audio, detection boxes, etc.)
+- `batch.torch(copy=None)` is the default (avoids copy if possible)
+- Batch has **no `__dlpack__`** -- use `ndd.as_tensor(batch)` first for DLPack consumers. `ndd.as_tensor` supports `pad` as well.
+- `Tensor.torch(copy=False)` is default (no copy)
+
+**Iteration:** `for sample in batch:` yields Tensors.
+
+## Readers
+
+Readers are **stateful objects** -- create once, reuse across epochs. This matters because readers track internal state like shuffle order and shard position.
+
+```python
+reader = ndd.readers.File(file_root=image_dir, random_shuffle=True)
+
+for epoch in range(num_epochs):
+    for jpegs, labels in reader.next_epoch(batch_size=64):
+        # jpegs, labels are Batch objects
+        ...
+```
+
+Key points:
+- Reader outputs (jpegs, labels, etc.) are **CPU** tensors/batches. Labels typically stay on CPU until you convert them for your framework (e.g. `labels.torch().to(device)`).
+- Reader classes are **PascalCase**: `ndd.readers.File(...)`, `ndd.readers.COCO(...)`, `ndd.readers.TFRecord(...)`
+- `batch_size` goes to `next_epoch()`, not to the reader constructor
+- `next_epoch(batch_size=N)` yields tuples of `Batch`; `next_epoch()` without batch_size yields tuples of `Tensor`
+- The iterator from `next_epoch()` must be fully consumed before calling `next_epoch()` again
+- Batch size can vary between epochs but is bounded by `max_batch_size`. If not specified, defaults to the first batch size used.
+
+Sharded reading for distributed training:
+```python
+reader = ndd.readers.File(
+    file_root=image_dir,
+    shard_id=rank, num_shards=world_size,
+    stick_to_shard=True,
+    pad_last_batch=True,
+)
+```
+
+## Device Handling
+
+- Device is **inferred from inputs** -- GPU if any input is on GPU
+- For hybrid decode: use `device="gpu"` (NOT `"mixed"`). The `"mixed"` keyword is a pipeline-mode concept for implicit CPU-to-GPU transfer; in dynamic mode, passing `device="gpu"` triggers the same hardware-accelerated decode path.
+- Don't call `.cpu()` before passing to a GPU model -- `.torch()` gives you a GPU tensor directly. `.cpu()` is only needed for consumers requiring host memory (numpy, `__array__`).
+- CUDA stream sync between DALI and PyTorch is **automatic via DLPack** -- no manual stream management needed.
+
+## Execution Model
+
+Default mode is `eager` -- async execution in a background thread, returns immediately.
+
+**No `.evaluate()` needed in most cases.** Any data consumption (`.torch()`, `__dlpack__`, `__array__`, `.shape`, property access, iteration) triggers evaluation automatically.
+
+For debugging, switch to synchronous mode so errors surface at the exact call site rather than later in the async queue:
+
+```python
+with ndd.EvalMode.sync_full:
+    images = ndd.decoders.image(jpegs, device="gpu")
+    images = ndd.resize(images, size=[224, 224])
+    # Any error surfaces here, at the exact op that failed
+```
+
+Modes (increasing synchronicity): `deferred` < `eager` < `sync_cpu` < `sync_full`
+
+Use `EvalMode.sync_full` for debugging instead of scattering `.evaluate()` calls -- it's cleaner and catches all issues at once. `sync_cpu` is often sufficient and lighter than `sync_full`.
+
+## Thread Configuration
+
+```python
+ndd.set_num_threads(4)  # Call once at startup
+```
+
+Controls DALI's internal worker threads for CPU operators. Defaults to CPU affinity count or `DALI_NUM_THREADS` env var. Unrelated to Python-level threading.
+
+## RNG
+
+Two approaches (use one, not both):
+
+```python
+# Approach 1: set the thread-local default seed (simple, good enough for most cases)
+ndd.random.set_seed(42)
+angles = ndd.random.uniform(batch_size=64, range=(-30, 30))
+
+# Approach 2: explicit RNG object (finer control, pass rng= to each op)
+rng = ndd.random.RNG(seed=42)
+values = ndd.random.uniform(batch_size=64, range=[0, 1], shape=2, rng=rng)
+```
+
+When `rng=` is passed to a random op, the explicit RNG overrides the default seed. Thread-local: each thread has independent random state.
+
+Random ops need an explicit `batch_size` when working with batches -- there is no pipeline-level batch size to inherit.
+
+## Example: Image Classification Pipeline
+
+```python
+import nvidia.dali.experimental.dynamic as ndd
+
+ndd.set_num_threads(4)
+reader = ndd.readers.File(file_root="/data/imagenet/train", random_shuffle=True)
+
+for epoch in range(num_epochs):
+    for jpegs, labels in reader.next_epoch(batch_size=64):
+        images = ndd.decoders.image(jpegs, device="gpu")
+        images = ndd.resize(images, size=[224, 224])
+        images = ndd.crop_mirror_normalize(
+            images,
+            mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
+            std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
+        )
+        train_step(images.torch(), labels.torch())
+```
+
+## Common Mistakes
+
+| Wrong | Right | Why |
+|-------|-------|-----|
+| `device="mixed"` | `device="gpu"` | `"mixed"` is pipeline mode only |
+| `batch[i]` | `batch.select(i)` | `Batch` has no `__getitem__` |
+| `batch.select(0)` for per-sample slicing | `batch.slice[0]` | `.select()` picks samples; `.slice` slices within each sample |
+| `.evaluate()` after every op | Let consumption trigger eval | `.torch()`, `.shape`, etc. trigger it automatically |
+| `.cpu()` before GPU model | `.torch()` directly | Avoids wasteful D2H + H2D round-trip |
+| Recreate reader each epoch | `reader.next_epoch()` | Readers are stateful -- create once, reuse |
+| `ndd.readers.file(...)` | `ndd.readers.File(...)` | Reader classes are PascalCase |
+| `break` from `next_epoch()` loop | Exhaust iterator or create new reader | Iterator must be fully consumed before next `next_epoch()` |
+| No `batch_size` to random ops | `ndd.random.uniform(batch_size=N, ...)` | No pipeline-level batch size to inherit |
+
+## Pipeline Mode Migration
+
+| Pipeline Mode | Dynamic Mode |
+|--------------|--------------|
+| `@pipeline_def` / `pipe.build()` / `pipe.run()` | Direct function calls in a loop |
+| `fn.readers.file(...)` | `ndd.readers.File(...)` (PascalCase, stateful) |
+| `fn.decoders.image(jpegs, device="mixed")` | `ndd.decoders.image(jpegs, device="gpu")` |
+| `fn.op_name(...)` | `ndd.op_name(...)` |
+| Pipeline-level `batch_size=64` | `reader.next_epoch(batch_size=64)` + random ops `batch_size=64` |
+| Pipeline-level `seed=42` | `ndd.random.set_seed(42)` or `ndd.random.RNG(seed=42)` |
+| Pipeline-level `num_threads=4` | `ndd.set_num_threads(4)` at startup |
+| `output.at(i)` | `batch.select(i)` |
+| `output.as_cpu()` | `batch.cpu()` |
+| `pipe.run()` returns tuple of `TensorList` | `reader.next_epoch(batch_size=N)` yields tuples of `Batch` |