Skip to content
Draft
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
513 commits
Select commit Hold shift + click to select a range
efd3555
Merge branch 'memcpy_map_to_libnode_pass' into explicit-streams
ThrudPrimrose Apr 25, 2026
4d008c4
Merge branch 'explicit-streams' into new-gpu-codegen-dev
ThrudPrimrose Apr 25, 2026
f9b4c41
Auto-trigger GPU CI workflows on PR/push/merge_group
ThrudPrimrose Apr 25, 2026
454525c
Add experimental GPU CI + auto-trigger both workflows on PR
ThrudPrimrose Apr 25, 2026
9b43207
Add experimental GPU CI + auto-trigger both workflows on PR
ThrudPrimrose Apr 25, 2026
545f441
Preserve _stream_in connector on memcpy tasklet
ThrudPrimrose Apr 25, 2026
a2bf51d
Scalar: trivial packed-strides predicates
ThrudPrimrose Apr 25, 2026
43d1500
Merge memcpy_map_to_libnode_pass
ThrudPrimrose Apr 25, 2026
f70b66f
Merge explicit-streams
ThrudPrimrose Apr 25, 2026
3010949
Default to experimental codegen
ThrudPrimrose Apr 26, 2026
76e9166
Attempt to fix remaining gpu codegen bugs
ThrudPrimrose Apr 26, 2026
6c6879a
Rm submodules
ThrudPrimrose Apr 26, 2026
9c5cede
Tests
ThrudPrimrose Apr 26, 2026
55e3099
Run lint
ThrudPrimrose Apr 26, 2026
b5120d2
Fixes
ThrudPrimrose Apr 26, 2026
54435c6
Minor fix to copynd
ThrudPrimrose Apr 26, 2026
7ad8831
More fixex
ThrudPrimrose Apr 26, 2026
e9b1438
New GPU codegen fixes
ThrudPrimrose Apr 26, 2026
c6f28ae
Extend support for other subset
ThrudPrimrose Apr 26, 2026
fab551f
Fixes
ThrudPrimrose Apr 27, 2026
c087d9c
New attempt to fix new gpu codegen
ThrudPrimrose Apr 27, 2026
ce9d4b7
Fixes to new gpu codegen
ThrudPrimrose Apr 28, 2026
6c9be0b
More fixes to stream and copy node management
ThrudPrimrose Apr 28, 2026
6944f32
Fix many things
ThrudPrimrose Apr 28, 2026
8f8fe46
Run lint
ThrudPrimrose Apr 28, 2026
5c4f219
Add correct keywords
ThrudPrimrose Apr 28, 2026
7214cb0
Fix things
ThrudPrimrose Apr 28, 2026
6e0d3d4
Fixes
ThrudPrimrose Apr 28, 2026
df04c6e
RM old dep
ThrudPrimrose Apr 28, 2026
5f67d70
Run lint
ThrudPrimrose Apr 28, 2026
9047c29
gpu_specialization: refactor stream lowering — strategy-as-policy, un…
ThrudPrimrose Apr 29, 2026
2f22e8a
gpu_specialization: compose monolithic validator, merge seq-scope rou…
ThrudPrimrose Apr 29, 2026
2546c33
Fix issues refactor a littl
ThrudPrimrose Apr 29, 2026
1fe674e
Attempt fix
ThrudPrimrose Apr 29, 2026
cf62d4d
Name fixes
ThrudPrimrose Apr 29, 2026
e2e669e
Try fix thing
ThrudPrimrose Apr 30, 2026
6137e7a
Refactor
ThrudPrimrose Apr 30, 2026
bd6c8e3
Fix regressions
ThrudPrimrose May 1, 2026
8a83ffd
Drop map-staging copy lift
ThrudPrimrose May 1, 2026
d665388
Fix erronous and unnecessary transformation
ThrudPrimrose May 1, 2026
a9922ac
Added an `__init__.py` file to `dace.codegen.targets.experimental_cud…
philip-paul-mueller May 7, 2026
612a90b
Fixing `config_schema.yml`. (#2363)
philip-paul-mueller May 11, 2026
2b7f2fa
Add some stuff
ThrudPrimrose May 11, 2026
c394f73
Fixes
ThrudPrimrose May 11, 2026
3e4a702
Merge branch 'main' into new-gpu-codegen-dev
ThrudPrimrose May 11, 2026
d21b365
Drop sudo apt-get from gpu-experimental-ci
ThrudPrimrose May 11, 2026
29688e2
Merge branch 'main' into new-gpu-codegen-dev
ThrudPrimrose May 11, 2026
7a025e8
Drop sudo apt-get from gpu-ci (legacy)
ThrudPrimrose May 12, 2026
65445b8
Skip direct-copy lift for Stream and custom-target storages
ThrudPrimrose May 12, 2026
f88e0d6
Fix handling of unknown copy types added by targets added to dispatch…
ThrudPrimrose May 12, 2026
902b2d7
Pin I1 regression: minimal CPU→GPU→CPU scalar round-trip segfault
ThrudPrimrose May 12, 2026
3329bcb
Fix I1: allow Scalar endpoints in _replace_direct_copies
ThrudPrimrose May 12, 2026
89fca3a
I2: publish libnode connector names as class constants
ThrudPrimrose May 12, 2026
dc61eec
I3 + ReinferConnectorTypes rename + Memset constructor fix
ThrudPrimrose May 12, 2026
d0d51b3
Pin WCR survival through experimental GPU pipeline
ThrudPrimrose May 12, 2026
d596246
Add WCR np.sum tests (implicit + explicit GPU storage)
ThrudPrimrose May 12, 2026
7a6fb2b
Broaden Register demotion to small literal-shape transients
ThrudPrimrose May 12, 2026
eed8ff8
I4: gpu_utils dedup, drop dead helper, refresh DESIGN.md
ThrudPrimrose May 12, 2026
4332ea0
Code-review cleanup: docstring fix + predicate rename
ThrudPrimrose May 12, 2026
2f8457d
auto_optimize: don't demote small-map to Sequential when data is GPU_…
ThrudPrimrose May 12, 2026
fa46197
Remove explicit-GPU-storage np.sum xfail test
ThrudPrimrose May 12, 2026
9bd4e89
Stream scheduler: self-idempotency + pipeline docstring refresh
ThrudPrimrose May 12, 2026
6895933
Memcpy/Memset pass: reject transpose patterns
ThrudPrimrose May 12, 2026
ea5e250
Fix things
ThrudPrimrose May 12, 2026
5d3343a
gpu_utils: re-export to_3d_dims / product / validate_block_size_limits
ThrudPrimrose May 12, 2026
3cce708
Fix imports
ThrudPrimrose May 12, 2026
3be988d
Cleanup: direct imports, drop -> None / depends_on boilerplate, prune…
ThrudPrimrose May 12, 2026
a91d252
Minor refactor
ThrudPrimrose May 12, 2026
0a596dd
Memset/Memcpy pass: skip maps nested in any GPU scope
ThrudPrimrose May 12, 2026
a7a5f61
Memset/Memcpy pass: skip single-element lifts
ThrudPrimrose May 12, 2026
610bea8
Fix assignment map to tasklet pass
ThrudPrimrose May 12, 2026
3eeef35
Minor fixes the assignment map to libnode pass
ThrudPrimrose May 12, 2026
eff356a
cuda_test.sh: skip filter.py under experimental codegen
ThrudPrimrose May 12, 2026
cb33cf4
Add CopyLibraryNode / MemsetLibraryNode + InsertExplicitCopies pass
ThrudPrimrose May 13, 2026
b44ed68
Add AssignmentAndCopyKernelToMemsetAndMemcpy pass + wire into simplify
ThrudPrimrose May 13, 2026
9602b41
infer_types: default-schedule patch for CopyLibraryNode / MemsetLibra…
ThrudPrimrose May 13, 2026
921df18
infer_types: default-schedule patch for CopyLibraryNode / MemsetLibra…
ThrudPrimrose May 13, 2026
6abc67b
Remove InsertExplicitCopies from SIMPLIFY_PASSES: lowering pass, not …
ThrudPrimrose May 13, 2026
c84419d
Remove InsertExplicitCopies + AssignmentAndCopyKernelToMemsetAndMemcp…
ThrudPrimrose May 13, 2026
8e807cf
tests/passes/iec: add xfail pins for view-lift bugs
ThrudPrimrose May 13, 2026
db43693
IEC: collapse AN->View->AN round-trips; CopyND fallback for rank-mism…
ThrudPrimrose May 13, 2026
3663cda
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 13, 2026
c64c965
assignment-memset pass: drop dynamic map-range connectors when lifting
ThrudPrimrose May 13, 2026
f82811b
cleanup: dead code, docstrings, comments
ThrudPrimrose May 15, 2026
faef60f
cleanup: dead code, docstrings, comments
ThrudPrimrose May 15, 2026
ec00a31
cleanup: dead code, docstrings, comments
ThrudPrimrose May 15, 2026
3dcd98b
Merge explicit-copy-memset-nodes
ThrudPrimrose May 15, 2026
8fa8ff0
Merge assignment-copy-kernel-to-libnode
ThrudPrimrose May 15, 2026
8848dec
fix: mutable default arg in AssignmentAndCopyKernelToMemsetAndMemcpy
ThrudPrimrose May 15, 2026
6acf08a
Merge assignment-copy-kernel-to-libnode
ThrudPrimrose May 15, 2026
bf51a91
docs: ASCII-only, repo backtick convention, concise docstrings
ThrudPrimrose May 15, 2026
e88aceb
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 15, 2026
6d4882d
docs: ASCII-only, repo backtick convention, concise docstrings
ThrudPrimrose May 15, 2026
bdc90e0
Merge branch 'assignment-copy-kernel-to-libnode' into new-gpu-codegen…
ThrudPrimrose May 15, 2026
92f031c
docs: ASCII-only, repo backtick convention, concise docstrings
ThrudPrimrose May 15, 2026
e40f224
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 15, 2026
27abf7b
docs: fix :role: cross-reference backticks
ThrudPrimrose May 15, 2026
21d711d
Merge branch 'assignment-copy-kernel-to-libnode' into new-gpu-codegen…
ThrudPrimrose May 15, 2026
e6f03a4
docs: ASCII-only, repo backtick convention, concise docstrings
ThrudPrimrose May 15, 2026
9cf409e
docs: ASCII-only, repo backtick convention, concise docstrings
ThrudPrimrose May 15, 2026
f1f0c8a
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 15, 2026
832299f
docs: rigorous Sphinx docstring + comment audit
ThrudPrimrose May 15, 2026
5123e8a
Merge branch 'assignment-copy-kernel-to-libnode' into new-gpu-codegen…
ThrudPrimrose May 15, 2026
a1c4026
docs: rigorous Sphinx docstring + comment audit
ThrudPrimrose May 16, 2026
8f58dde
Fix __dace_current_stream undeclared: scheduler wires a connector of …
ThrudPrimrose May 16, 2026
26547d6
Unify libnode stream connector on __dace_current_stream (valid in bot…
ThrudPrimrose May 16, 2026
cc5dae8
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 16, 2026
5dc30ec
Merge branch 'assignment-copy-kernel-to-libnode' into new-gpu-codegen…
ThrudPrimrose May 16, 2026
bb96608
Unify scheduler stream connector on __dace_current_stream (match help…
ThrudPrimrose May 16, 2026
3812133
Remove libnode stream-input plumbing; rename STREAM_CONN -> CURRENT_S…
ThrudPrimrose May 16, 2026
e0f6d9b
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 16, 2026
5bb80fa
Merge branch 'assignment-copy-kernel-to-libnode' into new-gpu-codegen…
ThrudPrimrose May 16, 2026
d54c490
gpu_helpers: reference renamed CURRENT_STREAM_NAME in comment
ThrudPrimrose May 16, 2026
f90df82
Drop banner comments in stream-lowering helpers + stream test
ThrudPrimrose May 16, 2026
22acb94
Remove dead stream-shim code; single-source the stream-connector name
ThrudPrimrose May 16, 2026
58c4e47
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 17, 2026
dae3798
Reuse subset.num_elements(); unify memset tasklet builder
ThrudPrimrose May 17, 2026
df315c3
Merge branch 'assignment-copy-kernel-to-libnode' into new-gpu-codegen…
ThrudPrimrose May 17, 2026
fc0987d
Factor _get_write_begin_and_length into pure helpers (-50 LoC)
ThrudPrimrose May 17, 2026
7db3a83
Reuse helpers.is_within_schedule_types; drop dead lines in experiment…
ThrudPrimrose May 17, 2026
6f26f73
Merge remote-tracking branch 'origin/explicit-copy-memset-nodes' into…
ThrudPrimrose May 17, 2026
8bbf73f
Merge branch 'assignment-copy-kernel-to-libnode' into new-gpu-codegen…
ThrudPrimrose May 17, 2026
d5e7e13
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 17, 2026
2fc469f
Use CopyLibraryNode connector-name constants in legacy-ambient-stream…
ThrudPrimrose May 18, 2026
18cc384
Drop stale stream-descriptor mention from CopyLibraryNode.validate do…
ThrudPrimrose May 18, 2026
9361677
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 18, 2026
02c06c0
Merge branch 'assignment-copy-kernel-to-libnode' into new-gpu-codegen…
ThrudPrimrose May 18, 2026
dd6dfe2
Refresh stale comments referencing removed stream passes
ThrudPrimrose May 18, 2026
a5540a2
Move no-cycle inline imports to module top in copy/memset libnodes
ThrudPrimrose May 18, 2026
8aaae39
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 18, 2026
874f360
Module-qualify disallowed class imports + add missing helper docstrings
ThrudPrimrose May 18, 2026
a7c3ae7
Merge branch 'assignment-copy-kernel-to-libnode' into new-gpu-codegen…
ThrudPrimrose May 18, 2026
09de0b2
Code-style cleanup: copyright year, drop -> None, de-banner comments,…
ThrudPrimrose May 18, 2026
fbc62ba
Move get_parent_map_and_loop_scopes to transformation.helpers; drop c…
ThrudPrimrose May 18, 2026
2ec94e2
Merge remote-tracking branch 'origin/explicit-copy-memset-nodes' into…
ThrudPrimrose May 18, 2026
e64dc22
Merge remote-tracking branch 'origin/assignment-copy-kernel-to-libnod…
ThrudPrimrose May 18, 2026
a40c343
Add length-1<->scalar conversion passes
ThrudPrimrose May 18, 2026
84a0079
Passes are the only API (drop standalone function)
ThrudPrimrose May 18, 2026
b5394c5
Add length-1<->scalar conversion passes
ThrudPrimrose May 18, 2026
0ffad1e
Add length-1<->scalar conversion passes
ThrudPrimrose May 18, 2026
dad18c1
Merge branch 'main' into explicit-copy-memset-nodes
ThrudPrimrose May 20, 2026
1cbc095
Move connector names to class constants; use dtypes.{CPU,GPU}_RESIDEN…
ThrudPrimrose May 20, 2026
8e39e71
Apply connector-contract + inner-literal consolidation to memset and …
ThrudPrimrose May 20, 2026
a9b8bea
helpers: use imported nodes module for MapEntry/Tasklet/LibraryNode i…
ThrudPrimrose May 20, 2026
ef777dd
Accept Fortran-packed layouts in CopyNDTemplate; route mixed C/F to M…
ThrudPrimrose May 20, 2026
192d6c9
MappedTasklet handles rank-mismatch via 1-D walker + int_floor/% deli…
ThrudPrimrose May 20, 2026
3074fd0
Revert MappedTasklet rank-mismatch; CopyNDTemplate is the only suppor…
ThrudPrimrose May 20, 2026
6a42deb
Refactor multi-dim copy tests onto shared helper + add unsupported-ca…
ThrudPrimrose May 20, 2026
31b963a
Extract auto_dispatch + merge memset test helpers + type hints on new…
ThrudPrimrose May 20, 2026
7b3e859
DRY/YAGNI: drop dead CopyExpansion fields, reuse collapse_shape_and_s…
ThrudPrimrose May 20, 2026
6c35792
Drop ExpandCopyNDTemplate; MappedTasklet handles rank-mismatch (CopyN…
ThrudPrimrose May 20, 2026
a8aceb7
Simplify _coarse_pick, _cuda2d_strides_are_supported, merge cudaMemcp…
ThrudPrimrose May 20, 2026
f42e134
Tailor _build_copynd_call to its sole shared-memory caller
ThrudPrimrose May 20, 2026
aef3558
Drop dynamic-input connectors from copy/memset libnodes; subset symbo…
ThrudPrimrose May 20, 2026
cb62493
Prune dead code: drop length-one<->scalar pass (unused), TODOs, requi…
ThrudPrimrose May 20, 2026
b55c5f5
Soften _is_cross_cpu_gpu docstring on Register handling assumption
ThrudPrimrose May 20, 2026
2e47679
Merge remote-tracking branch 'origin/main' into explicit-copy-memset-…
ThrudPrimrose May 20, 2026
567bd79
Inline _require_contiguous_subset as combined check at both call sites
ThrudPrimrose May 20, 2026
411b83a
Add type hints (StorageType, Range, SymExpr, LibraryNode forward refs…
ThrudPrimrose May 20, 2026
b29b36f
Improve _refine_cuda_impl_for_subsets docstring (accurate routing table)
ThrudPrimrose May 20, 2026
5e2ba18
Drop _memcpy_connector_typing: DaCe handles single-element pointer co…
ThrudPrimrose May 20, 2026
d99c6b0
Add 5 Auto-routed copy edge-case tests: 4D/1D flatten, 1D/4D inflate,…
ThrudPrimrose May 20, 2026
03f97ef
Trim inner-tasklet connector comments; unify memset inner connector t…
ThrudPrimrose May 20, 2026
3bf59c8
Rename memset tests: <expansion>_<rank>_<storage>; reject-test names …
ThrudPrimrose May 20, 2026
c8991f4
Drop new_gpu_codegen_only markers; remove stale connector-types test
ThrudPrimrose May 20, 2026
3a18bd6
Unify SDFG-construction helpers in copy_node_test.py
ThrudPrimrose May 20, 2026
4f723c6
Reject transpose pattern upfront in CopyLibraryNode + trim test docst…
ThrudPrimrose May 20, 2026
e891b81
Pin contract: same-rank copy needs matching per-dim subset sizes, not…
ThrudPrimrose May 20, 2026
ea15eca
Normalize storage-type references to the full dace.dtypes.StorageType…
ThrudPrimrose May 20, 2026
aa0adf7
MemsetLibraryNode: Auto falls back to 'pure' for non-contiguous subsets
ThrudPrimrose May 20, 2026
66f4624
Prune dead helpers, args, and stale docstring
ThrudPrimrose May 20, 2026
5c2f27c
InsertExplicitCopies: lift stage-in / stage-out copies to libnodes in…
ThrudPrimrose May 20, 2026
33e0fff
IEC cleanup: unify staging-lift methods + drop dead param
ThrudPrimrose May 20, 2026
8ceb7a4
copy_node_test: assert no dace::CopyND in generated C++ at every compile
ThrudPrimrose May 20, 2026
0686082
Inline single-callsite helpers, drop dead docs; simplify __main__ blo…
ThrudPrimrose May 20, 2026
22cb179
Trim verbose WHY comment on symbolic-< fallback in _is_consecutive_re…
ThrudPrimrose May 20, 2026
b5b2341
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 21, 2026
4bbd8fe
AssignmentAndCopyKernelToMemsetAndMemcpy: hoist dynamic map-range bou…
ThrudPrimrose May 21, 2026
20309cf
AssignmentAndCopyKernelToMemsetAndMemcpy: nested-SDFG fallback for in…
ThrudPrimrose May 21, 2026
e59aaab
Test dynamic map-range bound handling: symbol hoist, nested-SDFG fall…
ThrudPrimrose May 21, 2026
4f4aba2
AssignmentAndCopyKernelToMemsetAndMemcpy: extract shared _lift_precon…
ThrudPrimrose May 21, 2026
9ef2295
test: add _get_num_nested_sdfgs helper, use it in the dynamic-bound t…
ThrudPrimrose May 21, 2026
0fb6a7c
Drop out-of-scope length-1<->scalar conversion passes from this PR
ThrudPrimrose May 21, 2026
7ae9adc
Apply code-style rules to the dynamic-bound test additions
ThrudPrimrose May 21, 2026
eeeb36d
Unify remove_memcpy/remove_memset into one _lift_paths(is_memset) driver
ThrudPrimrose May 21, 2026
9451d5a
Merge branch 'main' into explicit-copy-memset-nodes
ThrudPrimrose May 21, 2026
7bff1e9
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 21, 2026
e0991d7
Derive owning SDFG from state in copy/memset libnodes and copy-insert…
ThrudPrimrose May 22, 2026
ee066ed
Unify contiguous-memcpy expansion across CPU and CUDA1D
ThrudPrimrose May 22, 2026
5983f26
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 22, 2026
b716ee2
Audit InsertExplicitCopies and treat views as copy endpoints
ThrudPrimrose May 22, 2026
f1dcd9f
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 22, 2026
1a52336
Improve comments
ThrudPrimrose May 22, 2026
786eb82
Pre-commit
ThrudPrimrose May 22, 2026
7d02766
InsertExplicitCopies: memlet-path subset resolution, reuse is_in_scop…
ThrudPrimrose May 22, 2026
48f3605
Merge branch 'explicit-copy-memset-nodes' into assignment-copy-kernel…
ThrudPrimrose May 22, 2026
236aad8
Merge branch 'assignment-copy-kernel-to-libnode' into new-gpu-codegen…
ThrudPrimrose May 22, 2026
f5dbd9c
gpu-specialization tests: drop sys.path hacks and banner comments
ThrudPrimrose May 22, 2026
2cc7572
gpu-specialization: de-alias imports, drop redundant -> None
ThrudPrimrose May 22, 2026
de0f16b
gpu-specialization: fix exhausted-filter pool scan and bad should_rea…
ThrudPrimrose May 22, 2026
2180b60
gpu-specialization: dedup global-lifetime predicate, drop dead get_de…
ThrudPrimrose May 22, 2026
f6b661c
length-1<->scalar conversion: anchor [0]-strip, preserve dynamic, add…
ThrudPrimrose May 22, 2026
26a7990
MoveArrayOutOfKernel: fix symbol_mapping key type, drop dead subset c…
ThrudPrimrose May 22, 2026
5256dc1
experimental_cuda: keep pooled as a lazy filter to mirror legacy cuda
ThrudPrimrose May 22, 2026
9137a7c
gpu_helpers: fix stale src_storage/dst_storage call arity, add regres…
ThrudPrimrose May 22, 2026
21d7ebb
Merge origin/main into explicit-copy-memset-nodes (prefer theirs)
ThrudPrimrose May 26, 2026
564fde7
Merge explicit-copy-memset-nodes into assignment-copy-kernel-to-libno…
ThrudPrimrose May 26, 2026
3236de5
Merge assignment-copy-kernel-to-libnode into new-gpu-codegen-dev (pre…
ThrudPrimrose May 26, 2026
503e5b1
Post-merge fixes: restore GPU_KERNEL_ACCESSIBLE_STORAGES, drop orphan…
ThrudPrimrose May 26, 2026
010e0cd
experimental_cuda: don't redeclare __dace_current_stream when it is t…
ThrudPrimrose May 27, 2026
7235227
fix(loop_to_map): refuse loops that carry a scalar read in-state
ThrudPrimrose May 27, 2026
3db37b5
fix(sdfg): keep extent symbols of used arrays only (#2382)
ThrudPrimrose May 27, 2026
0750523
fix(sdfg): resolve exit-write arg from the memlet tree root
ThrudPrimrose May 27, 2026
3bd3e30
Merge fix-arglist-and-loop2map-carried-symbol into new-gpu-codegen-de…
ThrudPrimrose May 27, 2026
215d8e2
cuTensor env: drop malformed cmake_link_flags
ThrudPrimrose May 28, 2026
30f5e0b
copy_node + experimental_cuda: fix assumption-divergent shape equalit…
ThrudPrimrose May 28, 2026
db84609
codegen + pass: Scalar->pointer dispatcher, stream wiring transfer, l…
ThrudPrimrose May 28, 2026
36cf4b8
ci: drop pauli GPU runners, run legacy + experimental codegen on cscs
ThrudPrimrose May 28, 2026
e0830d4
style: yapf-format files touched by the previous commit
ThrudPrimrose May 28, 2026
33db9d6
codegen: skip Scalar->pointer address-of for opaque dtypes (fixes MPI…
ThrudPrimrose May 28, 2026
750d3a5
ci: restore pauli GPU runners alongside cscs (faster signal during dev)
ThrudPrimrose May 28, 2026
439647d
ci: use cutensor.__path__[0] (cutensor-cu12 is a namespace package)
ThrudPrimrose May 28, 2026
0920be2
test: relax hoist-vs-nest assertion to ``<= 1`` nested SDFG (Option A)
ThrudPrimrose May 28, 2026
0d5bca1
insert_explicit_copies: skip stage-out lift when MapExit has overlapp…
ThrudPrimrose May 28, 2026
327d48c
subgraph_fusion: refuse fusion that would create duplicate writes via…
ThrudPrimrose May 29, 2026
4da0201
copy_node: scope-aware dispatcher + single-element invariant
ThrudPrimrose May 29, 2026
c178311
subgraph_fusion: drop dead intermediate writes in apply (vadv WAW)
ThrudPrimrose May 29, 2026
fd21b59
copy_node: scope-aware dispatcher + single-element invariant
ThrudPrimrose May 29, 2026
f2b7078
copy_node: scope-aware dispatcher + single-element invariant
ThrudPrimrose May 29, 2026
1ce3dab
argument_signature_test: split arglist assertion from GPU compile/run
ThrudPrimrose May 29, 2026
b472562
insert_explicit_copies: build outer-side Memlet from outer.data and r…
ThrudPrimrose May 29, 2026
81f10cf
insert_explicit_copies: build outer-side Memlet from outer.data and r…
ThrudPrimrose May 29, 2026
0fd04e8
insert_explicit_copies: build outer-side Memlet from outer.data and r…
ThrudPrimrose May 29, 2026
7862281
fix(sdfg): resolve exit-write arg from the memlet tree root
ThrudPrimrose May 27, 2026
eddc2cc
argument_signature_test: split arglist assertion from GPU compile/run
ThrudPrimrose May 29, 2026
dac86ff
fix(sdfg): resolve exit-write arg from the memlet tree root
ThrudPrimrose May 27, 2026
d4754d7
argument_signature_test: split arglist assertion from GPU compile/run
ThrudPrimrose May 29, 2026
febd08d
style: tighten comments on DSE block + IEC outer-subset resolution
ThrudPrimrose May 30, 2026
f696321
style: tighten outer-subset resolution comment in InsertExplicitCopies
ThrudPrimrose May 30, 2026
1407eca
style: tighten outer-subset resolution comment in InsertExplicitCopies
ThrudPrimrose May 30, 2026
4a839fe
fix(experimental_cuda): register split-DECLARE/ALLOCATE transients in…
ThrudPrimrose May 31, 2026
f3abd12
ci: comma in module docstring to re-trigger CI
ThrudPrimrose May 31, 2026
d62159c
fix(experimental_cuda): pick ``defined_vars.add`` ancestor by topmost…
ThrudPrimrose Jun 1, 2026
71d093c
comments: debloat ``_declare_pointer_if_needed`` docstring + split-al…
ThrudPrimrose Jun 1, 2026
dcc552f
Fixed a missing return (#2390)
philip-paul-mueller Jun 3, 2026
d356364
Select CUDA codegen implementation at build time
ThrudPrimrose Jun 3, 2026
2e60b37
Improve register location detection
ThrudPrimrose Jun 4, 2026
0c0d088
Merge
ThrudPrimrose Jun 4, 2026
1fd2a6e
Merge branch 'assignment-copy-kernel-to-libnode' into new-gpu-codegen…
ThrudPrimrose Jun 4, 2026
152d493
Removed a stray `print()`.
philip-paul-mueller Jun 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions .github/workflows/gpu-ci.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
name: Pauli GPU Tests

on:
workflow_dispatch
#push:
# branches: [ main, ci-fix ]
#pull_request:
# branches: [ main, ci-fix ]
#merge_group:
# branches: [ main, ci-fix ]
workflow_dispatch:
push:
branches: [ main, ci-fix ]
pull_request:
branches: [ main, ci-fix ]
merge_group:
branches: [ main, ci-fix ]

env:
CUDACXX: /usr/local/cuda/bin/nvcc
Expand Down
80 changes: 80 additions & 0 deletions .github/workflows/gpu-experimental-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
name: Pauli GPU Tests (ExperimentalCUDACodeGen)

on:
workflow_dispatch:
push:
branches: [ main, ci-fix ]
pull_request:
branches: [ main, ci-fix ]
merge_group:
branches: [ main, ci-fix ]

env:
CUDACXX: /usr/local/cuda/bin/nvcc
MKLROOT: /opt/intel/oneapi/mkl/latest/
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
# Force the experimental CUDA codegen for every test in this workflow.
DACE_compiler_cuda_implementation: experimental

concurrency:
group: ${{github.workflow}}-${{github.ref}}
cancel-in-progress: true

jobs:
test-gpu-experimental:
if: "!contains(github.event.pull_request.labels.*.name, 'no-ci')"
runs-on: [self-hosted, gpu]
steps:
- uses: actions/checkout@v6
with:
submodules: 'recursive'
- name: Install dependencies
run: |
rm -f ~/.dace.conf
rm -rf .dacecache tests/.dacecache
python -m venv ~/.venv # create venv so we can use pip
source ~/.venv/bin/activate # activate venv
python -m pip install --upgrade pip
pip install flake8 pytest-xdist coverage
pip install mpi4py
pip install cupy
pip uninstall -y dace
pip install -e ".[testing,ml]"
curl -Os https://uploader.codecov.io/latest/linux/codecov
chmod +x codecov

- name: Test dependencies
run: |
source ~/.venv/bin/activate # activate venv
nvidia-smi

- name: Run pytest GPU (experimental codegen)
run: |
source ~/.venv/bin/activate # activate venv
export DACE_cache=single
export PATH=$PATH:/usr/local/cuda/bin # some test is calling cuobjdump, so it needs to be in path
echo "CUDACXX: $CUDACXX"
echo "DACE_compiler_cuda_implementation: $DACE_compiler_cuda_implementation"
pytest --cov-report=xml --cov=dace --tb=short --timeout_method thread --timeout=300 -m "gpu"

- name: Run extra GPU tests (experimental codegen)
run: |
source ~/.venv/bin/activate # activate venv
export NOSTATUSBAR=1
export DACE_cache=single
export COVERAGE_RCFILE=`pwd`/.coveragerc
export PYTHON_BINARY="coverage run --source=dace --parallel-mode"
./tests/cuda_test.sh

- name: Report overall coverage
run: |
source ~/.venv/bin/activate # activate venv
export COVERAGE_RCFILE=`pwd`/.coveragerc
coverage combine . */; coverage report; coverage xml
reachable=0
ping -W 2 -c 1 codecov.io || reachable=$?
if [ $reachable -eq 0 ]; then
./codecov
else
echo "Codecov.io is unreachable"
fi
12 changes: 11 additions & 1 deletion dace/codegen/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ foreach(DACE_FILE ${DACE_FILES})
# Make the path absolute
set(DACE_FILE ${DACE_SRC_DIR}/${DACE_FILE})
# Now treat the file according to the deduced target
if(${DACE_FILE_TARGET} STREQUAL "cuda")
# previous: if(${DACE_FILE_TARGET} STREQUAL "cuda"). Needed to work with experimental
if(${DACE_FILE_TARGET} STREQUAL "experimental_cuda" OR ${DACE_FILE_TARGET} STREQUAL "cuda")
if(${DACE_FILE_TARGET_TYPE} MATCHES "hip")
set(DACE_ENABLE_HIP ON)
set(DACE_HIP_FILES ${DACE_HIP_FILES} ${DACE_FILE})
Expand Down Expand Up @@ -261,13 +262,22 @@ endforeach()
# Create DaCe library file
add_library(${DACE_PROGRAM_NAME} SHARED ${DACE_CPP_FILES} ${DACE_OBJECTS})
target_link_libraries(${DACE_PROGRAM_NAME} PUBLIC ${DACE_LIBS})
# The OpenMP INTERFACE options don't always propagate through to this target;
# inject -fopenmp at the front of both compile and link lines so libgomp is
# considered before -Wl,--as-needed can drop it.
target_compile_options(${DACE_PROGRAM_NAME} BEFORE PRIVATE ${OpenMP_CXX_FLAGS})
target_link_options(${DACE_PROGRAM_NAME} BEFORE PRIVATE ${OpenMP_CXX_FLAGS})

# Set C++ standard to C++20 (or the configured standard)
set_property(TARGET ${DACE_PROGRAM_NAME} PROPERTY CXX_STANDARD ${DACE_CPP_STANDARD})

# Create DaCe loader stub
add_library(dacestub_${DACE_PROGRAM_NAME} SHARED "${CMAKE_SOURCE_DIR}/tools/dacestub.cpp")
target_link_libraries(dacestub_${DACE_PROGRAM_NAME} Threads::Threads OpenMP::OpenMP_CXX ${CMAKE_DL_LIBS})
# Same -fopenmp injection as above: dacestub.cpp calls omp_get_max_threads() at
# load time, so the symbol must be resolved even after --as-needed.
target_compile_options(dacestub_${DACE_PROGRAM_NAME} BEFORE PRIVATE ${OpenMP_CXX_FLAGS})
target_link_options(dacestub_${DACE_PROGRAM_NAME} BEFORE PRIVATE ${OpenMP_CXX_FLAGS})

# Windows-specific fixes
if (MSVC_IDE)
Expand Down
4 changes: 3 additions & 1 deletion dace/codegen/dispatcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ class DefinedType(attr_enum.ExtensibleAttributeEnum):
Object = auto() # An object moved by reference
Stream = auto() # A stream object moved by reference and accessed via a push/pop API
StreamArray = auto() # An array of Streams
GPUStream = auto() # A backend GPU stream handle (e.g., cudaStream_t / hipStream_t)


class DefinedMemlets:
Expand Down Expand Up @@ -91,7 +92,8 @@ def add(self, name: str, dtype: DefinedType, ctype: str, ancestor: int = 0, allo
for _, scope, can_access_parent in reversed(self._scopes):
if name in scope:
err_str = "Shadowing variable {} from type {} to {}".format(name, scope[name], dtype)
if (allow_shadowing or config.Config.get_bool("compiler", "allow_shadowing")):
if (allow_shadowing or config.Config.get_bool("compiler", "allow_shadowing")
or dtype == DefinedType.GPUStream):
if not allow_shadowing:
print("WARNING: " + err_str)
else:
Expand Down
64 changes: 60 additions & 4 deletions dace/codegen/instrumentation/gpu_events.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ def on_scope_entry(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, n
'GPU_Device map scopes')

idstr = 'b' + self._idstr(cfg, state, node)
stream = getattr(node, '_cuda_stream', -1)
stream = self._get_gpu_stream(state, node)
outer_stream.write(self._record_event(idstr, stream), cfg, state_id, node)

def on_scope_exit(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, node: nodes.ExitNode,
Expand All @@ -139,7 +139,7 @@ def on_scope_exit(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, no
s = self._get_sobj(node)
if s.instrument == dtypes.InstrumentationType.GPU_Events:
idstr = 'e' + self._idstr(cfg, state, entry_node)
stream = getattr(node, '_cuda_stream', -1)
stream = self._get_gpu_stream(state, node)
outer_stream.write(self._record_event(idstr, stream), cfg, state_id, node)
outer_stream.write(self._report('%s %s' % (type(s).__name__, s.label), cfg, state, entry_node), cfg,
state_id, node)
Expand All @@ -153,7 +153,7 @@ def on_node_begin(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, no
if node.instrument == dtypes.InstrumentationType.GPU_Events:
state_id = state.parent_graph.node_id(state)
idstr = 'b' + self._idstr(cfg, state, node)
stream = getattr(node, '_cuda_stream', -1)
stream = self._get_gpu_stream(state, node)
outer_stream.write(self._record_event(idstr, stream), cfg, state_id, node)

def on_node_end(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, node: nodes.Node,
Expand All @@ -165,7 +165,63 @@ def on_node_end(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, node
if node.instrument == dtypes.InstrumentationType.GPU_Events:
state_id = state.parent_graph.node_id(state)
idstr = 'e' + self._idstr(cfg, state, node)
stream = getattr(node, '_cuda_stream', -1)
stream = self._get_gpu_stream(state, node)
outer_stream.write(self._record_event(idstr, stream), cfg, state_id, node)
outer_stream.write(self._report('%s %s' % (type(node).__name__, node.label), cfg, state, node), cfg,
state_id, node)

def _get_gpu_stream(self, state: SDFGState, node: nodes.Node) -> int:
"""
Return the GPU stream ID assigned to a given node.

- In the CUDACodeGen, the stream ID is stored as the private attribute
``_cuda_stream`` on the node.
- In the ExperimentalCUDACodeGen, streams are explicitly assigned to tasklets
and GPU_Device-scheduled maps (kernels) via a GPU stream AccessNode. For
other node types, no reliable stream assignment is available.

Parameters
----------
state : SDFGState
The state containing the node.
node : dace.sdfg.nodes.Node
The node for which to query the GPU stream.

Returns
-------
int
The assigned GPU stream ID, or ``-1`` if none could be determined.
"""
if config.Config.get('compiler', 'cuda', 'implementation') == 'legacy':
stream = getattr(node, '_cuda_stream', -1)
return stream

def _stream_from_in_edges(target: nodes.Node) -> int:
for in_edge in state.in_edges(target):
src = in_edge.src
if (isinstance(src, nodes.AccessNode) and src.desc(state).dtype == dtypes.gpuStream_t
and not in_edge.data.is_empty()):
return int(in_edge.data.subset)
return -1

stream = _stream_from_in_edges(node)

# MapExit's out-edge to gpu_streams carries an empty dependency memlet
# (see ConnectGPUStreamsToNodes._build_chain). Resolve via the matching
# MapEntry, which has the real `gpu_streams[i]` in-edge.
if stream == -1 and isinstance(node, nodes.MapExit):
entry = state.entry_node(node)
if entry is not None:
stream = _stream_from_in_edges(entry)

# Defensive out-edge fallback for non-Exit nodes only (Exit nodes' stream
# out-edges are always empty by construction).
if stream == -1 and not isinstance(node, nodes.ExitNode):
for out_edge in state.out_edges(node):
dst = out_edge.dst
if (isinstance(dst, nodes.AccessNode) and dst.desc(state).dtype == dtypes.gpuStream_t
and not out_edge.data.is_empty()):
stream = int(out_edge.data.subset)
break

return stream
37 changes: 34 additions & 3 deletions dace/codegen/instrumentation/gpu_tx_markers.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,18 @@ class GPUTXMarkersProvider(InstrumentationProvider):

def __init__(self):
self.backend = common.get_gpu_backend()
# Check if ROCm TX libraries and headers are available
# Check if ROCm TX libraries and headers are available. Only meaningful
# when the backend is HIP — on a CUDA host that happens to also have
# ROCm installed we must not flip into rocTX mode (would suppress
# NVTX init markers via the `enable_rocTX` short-circuits below).
rocm_path = os.getenv('ROCM_PATH', '/opt/rocm')
roctx_header_paths = [
os.path.join(rocm_path, 'roctracer/include/roctx.h'),
os.path.join(rocm_path, 'include/roctracer/roctx.h')
]
roctx_library_path = os.path.join(rocm_path, 'lib', 'libroctx64.so')
self.enable_rocTX = any(os.path.isfile(path)
for path in roctx_header_paths) and os.path.isfile(roctx_library_path)
self.enable_rocTX = (self.backend == 'hip' and any(os.path.isfile(path) for path in roctx_header_paths)
and os.path.isfile(roctx_library_path))
self.include_generated = False
super().__init__()

Expand Down Expand Up @@ -171,6 +174,34 @@ def on_scope_exit(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, no
return
self.print_range_pop(outer_stream)

def on_node_begin(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, node: nodes.Node,
outer_stream: CodeIOStream, inner_stream: CodeIOStream, global_stream: CodeIOStream) -> None:
# Bracket host-side cudaMemcpyAsync tasklets emitted by expanded
# CopyLibraryNode instances. These tasklets bypass the legacy
# _emit_copy() path that fires on_copy_begin, so without an explicit
# hook here the experimental codegen ends up with no `copy_*` ranges.
if state.instrument != dtypes.InstrumentationType.GPU_TX_MARKERS:
return
if not isinstance(node, nodes.Tasklet):
return
if is_devicelevel_gpu_kernel(sdfg, state, node):
return
if not node.label.startswith('copy_'):
return
self.print_range_push(node.label, sdfg, outer_stream)

def on_node_end(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, node: nodes.Node,
outer_stream: CodeIOStream, inner_stream: CodeIOStream, global_stream: CodeIOStream) -> None:
if state.instrument != dtypes.InstrumentationType.GPU_TX_MARKERS:
return
if not isinstance(node, nodes.Tasklet):
return
if is_devicelevel_gpu_kernel(sdfg, state, node):
return
if not node.label.startswith('copy_'):
return
self.print_range_pop(outer_stream)

def on_sdfg_init_begin(self, sdfg: SDFG, callsite_stream: CodeIOStream, global_stream: CodeIOStream) -> None:
if sdfg.instrument != dtypes.InstrumentationType.GPU_TX_MARKERS:
return
Expand Down
1 change: 1 addition & 0 deletions dace/codegen/targets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@
from .mlir.mlir import MLIRCodeGen
from .sve.codegen import SVECodeGen
from .snitch import SnitchCodeGen
from .experimental_cuda import ExperimentalCUDACodeGen
Loading
Loading