From 62b83ca7afeb440cd046d78e16f5db2d694059ca Mon Sep 17 00:00:00 2001 From: romintomasetti Date: Wed, 21 Aug 2024 17:37:17 -0400 Subject: [PATCH 1/6] core(graph): adding documentation for `Kokkos::Experimental::Graph` --- docs/source/API/core-index.rst | 3 + docs/source/API/core/Graph.rst | 350 +++++++++++++++++++++++++++++++++ docs/source/conf.py | 1 + 3 files changed, 354 insertions(+) create mode 100644 docs/source/API/core/Graph.rst diff --git a/docs/source/API/core-index.rst b/docs/source/API/core-index.rst index 0996b0521..fa22cd0d3 100644 --- a/docs/source/API/core-index.rst +++ b/docs/source/API/core-index.rst @@ -37,6 +37,8 @@ API: Core - Utility functionality part of Kokkos Core. * - `Detection Idiom `__ - Used to recognize, in an SFINAE-friendly way, the validity of any C++ expression. + * - `Graph and related `_ + - Kokkos Graph abstraction. * - `Macros `__ - Global macros defined by Kokkos, used for architectures, general settings, etc. @@ -60,4 +62,5 @@ API: Core ./core/Utilities ./core/Detection-Idiom ./core/Macros + ./core/Graph ./core/Profiling diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst new file mode 100644 index 000000000..30d6c8982 --- /dev/null +++ b/docs/source/API/core/Graph.rst @@ -0,0 +1,350 @@ +Graph and related +================= + +Usage +----- + +:code:`Kokkos::Graph` is an abstraction that can be used to define a group of asynchronous workloads that are organised as a direct acyclic graph. +A :code:`Kokkos::Graph` is defined separatly from its execution, allowing it to be re-executed multiple times. + +:code:`Kokkos::Graph` is a powerful way of describing workload dependencies. It is also a good opportunity to present all workloads +at once to the driver, and allow some optimizations [ref]. + +.. note:: + + However, because command-group submission is tied to execution on the queue, without having a prior construction step before starting execution, optimization opportunities are missed from the runtime not being made aware of a defined dependency graph ahead of execution. + +For small workloads that need to be sumitted several times, it might save you some overhead [reference to some presentation / paper]. + +:code:`Kokkos::Graph` is specialized for some backends: + +* :code:`Cuda`: [ref to vendor doc] +* :code:`HIP`: [ref to vendor doc] +* :code:`SYCL`: [ref to vendor doc] -> https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc + +For other backends, Kokkos provides a defaulted implementation [ref to file]. + +Philosophy +---------- + +As mentioned earlier, the :code:`Kokkos::Graph` is first defined, and then executed. In fact, before the graph can be executed, +it needs to be *instantiated*. + +During the *instantiation* phase, the topology of the graph is **locked**, and an *executable graph* is created. + +In short, we have 3 phases: + +1. Graph definition (topology DAG graph) +2. Graph instantiation (executable graph) +3. Graph submission (execute) + +"Splitting command construction from execution is a proven solution." (https://www.iwocl.org/wp-content/uploads/iwocl-2023-Ewan-Crawford-4608.pdf) + +Basic example +------------- + +This example showcases how three workloads can be organised as a :code:`Kokkos::Graph`. + +Workloads A and B are independent, but workload C needs the completion of A and B. + +.. code-block:: cpp + + int main() + { + auto graph = Kokkos::Experimental::create_graph([&](auto root) { + const auto node_A = root.then_parallel_for(...label..., ...policy..., ...body...); + const auto node_B = root.then_parallel_for(...label..., ...policy..., ...body...); + const auto ready = Kokkos::Experimental::when_all(node_A, node_B); + const auto node_C = ready.then_parallel_for(...label..., ...policy..., ...body...); + }); + + for(int irep = 0; irep < nrep; ++irep) + graph.submit(); + } + +Advanced example +---------------- + +To be done soon. + +References +---------- + +* https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf +* https://github.com/intel/llvm/blob/sycl/sycl/doc/syclgraph/SYCLGraphUsageGuide.md +* https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/ + + +Use cases +--------- + +Diamond with closure, don't care about `exec` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Create a simple diamond-like graph within a closure, no caring about execution space instances. + +This use case demonstrates how a graph can be created from inside a closure, and how it could look like in the future. +It is a very simple use case. + +Note that I'm not sure why we should support the closure anyway. + +.. graphviz:: + :caption: Diamond topology + + digraph diamond { + A -> B; + A -> C; + B -> D; + C -> D; + } + +.. code-block:: c++ + :caption: Current pseudo-code + + auto graph = Kokkos::create_graph([&](const auto& root){ + auto node_A = root.then_parallel_...(...label..., ...policy..., ...functor...); + + auto node_B = node_A.then_parallel_...(...label..., ...policy..., ...functor...); + auto node_C = node_A.then_parallel_...(...label..., ...policy..., ...functor...); + + auto node_D = Kokkos::when_all(node_B, node_C).then_parallel_...(...label..., ...policy..., ...functor...); + }); + graph.instantiate(); + graph.submit() + +.. code-block:: c++ + :caption: P2300 (but really I don't like that because `graph` itself is already a *sender*) + + auto graph = Kokkos::create_graph([&](const auto& root){ + auto node_A = then(root, parallel_...(...label..., ...policy..., ...functor...)); + + auto node_B = then(node_A, parallel_...(...label..., ...policy..., ...functor...)); + auto node_C = then(node_A, parallel_...(...label..., ...policy..., ...functor...)); + + auto node_D = then(when_all(node_B, node_C), parallel_...(...label..., ...policy..., ...functor...)); + }); + graph.instantiate(); + graph.submit() + +Diamond, caring about `exec` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Create a simple diamond-like graph, caring about execution space instances. + +This use case demonstrates how a graph can be created without a closure, and how it could look like in the future. +It also focuses on where steps occur. + +Graph topology is known at compile, thus enabling a lot of optimizations (kernel fusion might be one). + +.. graphviz:: + :caption: Diamond topology + + digraph diamond { + A -> B; + A -> C; + B -> D; + C -> D; + } + +.. code-block:: c++ + :caption: Current pseudo-code + + auto graph = Kokkos::create_graph(exec_A, [&](const auto& root){}); + auto root = Kokkos::Impl::GraphAccess::create_root_node_ref(graph); + + auto node_A = root.then_parallel_...(...label..., ...policy..., ...functor...); + + auto node_B = node_A.then_parallel_...(...label..., ...policy..., ...functor...); + auto node_C = node_A.then_parallel_...(...label..., ...policy..., ...functor...); + + auto node_D = Kokkos::when_all(node_B, node_C).then_parallel_...(...label..., ...policy..., ...functor...); + + graph.instantiate(); + exec_A.fence("The graph might make some async to-device copies."); + graph.submit(exec_B); + +.. code-block:: c++ + :caption: P2300 + defer when Kokkos performs internal async to-device copies + + // Step 1: define topology (no execution space instance required) + auto graph = Kokkos::create_graph(); + + auto node_A = then(graph, parallel_...(...label..., ...policy..., ...functor...)); + + auto node_B = then(node_A, parallel_...(...label..., ...policy..., ...functor...)); + auto node_C = then(node_A, parallel_...(...label..., ...policy..., ...functor...)); + + auto node_D = then(when_all(node_B, node_C), parallel_...(...label..., ...policy..., ...functor...)); + + // Step 2: instantiate (execution space instance required by both backend and Kokkos internals) + graph.instantiate(exec_A); + exec_A.fence(); + + // Step 3: execute + graph.submit(exec_B) + +No "root" node +~~~~~~~~~~~~~~ + +Currently, the :code:`Kokkos::Graph` would expose to the user a "root node" concept that is not needed +by any backend (but might be needed by the default implementation that works with *sinks*). + +The "root node" might be confusing. It sould not appear in the API for 2 reasons: + +1. It can be misleading, as the user might think it's necessary though I think it's an artifact of how :code:`Kokkos::Graph` + is currently implemented for graph construction, and because of the *sink*-based defaulted implementation. +2. With P2300, it's clear that *root* is an empty useless sender that can be thrown away at compile time. + +.. graphviz:: + :caption: No root node. + + digraph no_root { + A1 -> B; + A2 -> B; + A3 -> B; + } + +.. code-block:: c++ + :caption: P2300 + + auto graph = construct_graph(); + + auto A1 = then(graph, ...); + auto A2 = then(graph, ...); + auto A3 = then(graph, ...); + + auto B = then(when_all(A1, A2, A3), ...); + +Complex DAG topology +~~~~~~~~~~~~~~~~~~~~ + +Any complex-but-valid DAG topology should work. + +.. graphviz:: + :caption: A complex DAG + + digraph complex_dag { + + A1 -> B1; + A1 -> B2; + A1 -> B3; + A2 -> B1; + A2 -> B3; + A3 -> B4; + + B1 -> C1; + B3 -> C1; + + B2 -> C2; + B4 -> C2; + + // Enfore ordering of nodes with invisible edges. + { + rank = same; + edge[ style=invis]; + B1 -> B2 -> B3 -> B4 ; + rankdir = LR; + } + } + +Changing scheduler +~~~~~~~~~~~~~~~~~~ + +This is the purpose of PR https://github.com/kokkos/kokkos/pull/7249, and should be further documented. + +Towards https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html#design-sender-adaptor-starts_on. + +.. code-block:: c++ + + auto graph = construct() + + auto node_1 = ... + + ... + + graph.instantiate(); + + graph.submit(exec_A); + + ... + + graph.submit(exec_C); + + ... + + graph.submit(exec_D); + +Interoperability +~~~~~~~~~~~~~~~~ + +Why interoperability matters (helps adoption of :code:`Kokkos::Graph`, extensibility, corner cases): + +1. Attract users that already use some backend graph (*e.g.* `cudaGraph_t`) towards `Kokkos`. It helps them transition smoothly. +2. Help user integrate backend-specific graph capabilities that are not part of the :code:`Kokkos::Graph` API for whatever reason. + +Since `Kokkos` might run some stuff linked to its internals at *instantiation* stage, and since in PR https://github.com/kokkos/kokkos/pull/7240 +we decided to ensure that before the submission, the graph needs to be instantiated in `Kokkos`, interoperability implies that the user +passes through `Kokkos` for both *instantiation* and *submission*. + +.. graphviz:: + :caption: Dark nodes/edges are added through :code:`Kokkos::Graph`. + + digraph interoperability { + + A[color=darksalmon]; + + B1[color=darksalmon]; + B2[color=darksalmon]; + B3[color=darksalmon]; + + C3[color=darksalmon]; + + A -> B1[color=darksalmon]; + A -> B2[color=darksalmon]; + A -> B3[color=darksalmon]; + + B3 -> C3[color=darksalmon]; + + // Enfore ordering of nodes with invisible edges. + { + rank = same; + edge[style=invis]; + B1 -> B2 -> B3 ; + rankdir = LR; + } + + B1 -> C1; + B2 -> C1; + + C1 -> D1; + C3 -> D1; + } + +.. code-block:: c++ + :caption: interoperability pseudo-code P2300 + + cudaGraph_t graph; + cudaGraphCreate(&graph, ...); + + cudaGraphNode_t A, B1, B2, B3, C3; + ... create kernel nodes and add dependencies ... + + auto kokkos_graph = construct(graph); + + auto C1 = then(when_all(B1, B2), ...); + auto D1 = then(when_all(C1, C3), ...); + + kokkos_graph.instantiate(); + kokkos_graph.submit(); + +Graph update +~~~~~~~~~~~~ + +From reading `Cuda`, `HIP` and `SYCL` documentations, all have some *executable graph update* mechanisms. + +For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`, not in `HIP` yet) can support complex graphs that might slightly change from one submission to another. + + Updates to a graph will be scheduled after any in-flight executions of the same graph and will not affect previous submissions of the same graph. + The user is not required to wait on any previous submissions of a graph before updating it. + +As the topology is fixed, we can only reasonably update kernel parameters. diff --git a/docs/source/conf.py b/docs/source/conf.py index cfc39b7e7..583e7a044 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -35,6 +35,7 @@ # ones. extensions = ["myst_parser", "sphinx.ext.autodoc", + "sphinx.ext.graphviz", "sphinx.ext.viewcode", "sphinx.ext.intersphinx", "sphinx_copybutton", From e9d776cef8fd333f1e5bc358b33a1a737b3e8ccf Mon Sep 17 00:00:00 2001 From: romintomasetti Date: Tue, 27 Aug 2024 17:23:40 +0000 Subject: [PATCH 2/6] wip meeting this morning results and todos --- docs/source/API/core/Graph.rst | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst index 30d6c8982..1a9f3df6a 100644 --- a/docs/source/API/core/Graph.rst +++ b/docs/source/API/core/Graph.rst @@ -171,6 +171,9 @@ Graph topology is known at compile, thus enabling a lot of optimizations (kernel auto node_A = then(graph, parallel_...(...label..., ...policy..., ...functor...)); + // what happens to an exec space instance passed to the policy ? is it used somehow or just ignored ? + // when dispatching the driver to global memory, what exec space instance is used for the async copies ? + auto node_B = then(node_A, parallel_...(...label..., ...policy..., ...functor...)); auto node_C = then(node_A, parallel_...(...label..., ...policy..., ...functor...)); @@ -348,3 +351,26 @@ For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`, not in The user is not required to wait on any previous submissions of a graph before updating it. As the topology is fixed, we can only reasonably update kernel parameters. + +Iterative process +----------------- + +- iterative solver (our assembly case) +- line search in optimization + + + +They also use graphs... +----------------------- + +* `PyTorch` https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/ +* `GROMACS` https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/ + + +Homework + +- what does Kokkos during dispatching ? (HIP CUDA SYCL) Execution space instance from the policy, used or ignored ? +- for each example 3 columns how to write it in CUDA SYCL P2300 Kokkos +- développer l'update +- essayer de démontrer qu'on peut écrire un seul code, et dire si on veut que ce soit un graph ou pas + (why it matters: write single source code , kokkos premise 'single source code') \ No newline at end of file From ad9700f7bb8461b30a8f42e8d49f69cda30dca12 Mon Sep 17 00:00:00 2001 From: romintomasetti Date: Thu, 29 Aug 2024 19:59:26 +0000 Subject: [PATCH 3/6] wip before meeting --- .../API/core/Graph.axpby.kokkos.graph.cpp | 12 + .../core/Graph.axpby.kokkos.graph.p2300.cpp | 15 ++ .../API/core/Graph.axpby.kokkos.vanilla.cpp | 8 + docs/source/API/core/Graph.rst | 253 ++++++++++-------- 4 files changed, 181 insertions(+), 107 deletions(-) create mode 100644 docs/source/API/core/Graph.axpby.kokkos.graph.cpp create mode 100644 docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp create mode 100644 docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp diff --git a/docs/source/API/core/Graph.axpby.kokkos.graph.cpp b/docs/source/API/core/Graph.axpby.kokkos.graph.cpp new file mode 100644 index 000000000..24cc178ac --- /dev/null +++ b/docs/source/API/core/Graph.axpby.kokkos.graph.cpp @@ -0,0 +1,12 @@ +auto graph = Kokkos::Experimental::create_graph(exec_A, [&](auto root){ + auto node_xpy = root.then_parallel_for(N, MyAxpby{x, y, alpha, beta}); + auto node_zpy = root.then_parallel_for(N, MyAxpby{z, y, gamma, beta}); + + auto node_dotp = Kokkos::Experimental::when_all(node_xpy, node_zpy).then_parallel_reduce( + N, MyDotp{x, z}, dotp + ) +}); + +graph.submit(exec_A); + +exec_A.fence(); diff --git a/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp new file mode 100644 index 000000000..3d129d2a4 --- /dev/null +++ b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp @@ -0,0 +1,15 @@ +auto graph = Kokkos::construct_graph(); + +auto node_xpy = Kokkos::then(graph, Kokkos::parallel_for(N, MyAxpby{x, y, alpha, beta})); +auto node_zpy = Kokkos::then(graph, Kokkos::parallel_for(N, MyAxpby{z, y, gamma, beta})); + +auto node_dotp = Kokkos::then( + Kokkos::when_all(node_xpy, node_zpy), + Kokkos::parallel_reduce(N, MyDotp{x, z}, dotp) +); + +graph.instantiate(); + +graph.submit(exec_A); + +exec_A.fence(); diff --git a/docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp b/docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp new file mode 100644 index 000000000..3789ba4d7 --- /dev/null +++ b/docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp @@ -0,0 +1,8 @@ +Kokkos::parallel_for(policy_t(exec_A, 0, N), MyAxpby{x, y, alpha, beta}); +Kokkos::parallel_for(policy_t(exec_B, 0, N), MyAxpby{z, y, gamma, beta}); + +exec_B.fence(); + +Kokkos::parallel_reduce(policy_t(exec_A, 0, N), MyDotp{x, z}, dotp); + +exec_A.fence(); diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst index 1a9f3df6a..42beb0860 100644 --- a/docs/source/API/core/Graph.rst +++ b/docs/source/API/core/Graph.rst @@ -4,10 +4,10 @@ Graph and related Usage ----- -:code:`Kokkos::Graph` is an abstraction that can be used to define a group of asynchronous workloads that are organised as a direct acyclic graph. -A :code:`Kokkos::Graph` is defined separatly from its execution, allowing it to be re-executed multiple times. +:cppkokkos:`Kokkos::Graph` is an abstraction that can be used to define a group of asynchronous workloads that are organised as a direct acyclic graph. +A :cppkokkos:`Kokkos::Graph` is defined separatly from its execution, allowing it to be re-executed multiple times. -:code:`Kokkos::Graph` is a powerful way of describing workload dependencies. It is also a good opportunity to present all workloads +:cppkokkos:`Kokkos::Graph` is a powerful way of describing workload dependencies. It is also a good opportunity to present all workloads at once to the driver, and allow some optimizations [ref]. .. note:: @@ -16,18 +16,18 @@ at once to the driver, and allow some optimizations [ref]. For small workloads that need to be sumitted several times, it might save you some overhead [reference to some presentation / paper]. -:code:`Kokkos::Graph` is specialized for some backends: +:cppkokkos:`Kokkos::Graph` is specialized for some backends: -* :code:`Cuda`: [ref to vendor doc] -* :code:`HIP`: [ref to vendor doc] -* :code:`SYCL`: [ref to vendor doc] -> https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc +* :cppkokkos:`Cuda`: [ref to vendor doc] +* :cppkokkos:`HIP`: [ref to vendor doc] +* :cppkokkos:`SYCL`: [ref to vendor doc] -> https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc For other backends, Kokkos provides a defaulted implementation [ref to file]. Philosophy ---------- -As mentioned earlier, the :code:`Kokkos::Graph` is first defined, and then executed. In fact, before the graph can be executed, +As mentioned earlier, the :cppkokkos:`Kokkos::Graph` is first defined, and then executed. In fact, before the graph can be executed, it needs to be *instantiated*. During the *instantiation* phase, the topology of the graph is **locked**, and an *executable graph* is created. @@ -40,53 +40,23 @@ In short, we have 3 phases: "Splitting command construction from execution is a proven solution." (https://www.iwocl.org/wp-content/uploads/iwocl-2023-Ewan-Crawford-4608.pdf) -Basic example -------------- - -This example showcases how three workloads can be organised as a :code:`Kokkos::Graph`. - -Workloads A and B are independent, but workload C needs the completion of A and B. - -.. code-block:: cpp - - int main() - { - auto graph = Kokkos::Experimental::create_graph([&](auto root) { - const auto node_A = root.then_parallel_for(...label..., ...policy..., ...body...); - const auto node_B = root.then_parallel_for(...label..., ...policy..., ...body...); - const auto ready = Kokkos::Experimental::when_all(node_A, node_B); - const auto node_C = ready.then_parallel_for(...label..., ...policy..., ...body...); - }); - - for(int irep = 0; irep < nrep; ++irep) - graph.submit(); - } - -Advanced example ----------------- - -To be done soon. - -References ----------- - -* https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf -* https://github.com/intel/llvm/blob/sycl/sycl/doc/syclgraph/SYCLGraphUsageGuide.md -* https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/ - - Use cases --------- Diamond with closure, don't care about `exec` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Create a simple diamond-like graph within a closure, no caring about execution space instances. +Create a simple diamond-like graph within a closure, not caring too much about execution space instances. This use case demonstrates how a graph can be created from inside a closure, and how it could look like in the future. It is a very simple use case. -Note that I'm not sure why we should support the closure anyway. +.. note:: + + I'm not sure why we should support the closure anyway. I don't see the benefits of enforcing the + user to create the whole graph in there. + + See :ref:`no_root_node` for discussion. .. graphviz:: :caption: Diamond topology @@ -99,9 +69,9 @@ Note that I'm not sure why we should support the closure anyway. } .. code-block:: c++ - :caption: Current pseudo-code + :caption: Current `Kokkos` pseudo-code. - auto graph = Kokkos::create_graph([&](const auto& root){ + auto graph = Kokkos::create_graph([&](auto root){ auto node_A = root.then_parallel_...(...label..., ...policy..., ...functor...); auto node_B = node_A.then_parallel_...(...label..., ...policy..., ...functor...); @@ -113,9 +83,9 @@ Note that I'm not sure why we should support the closure anyway. graph.submit() .. code-block:: c++ - :caption: P2300 (but really I don't like that because `graph` itself is already a *sender*) + :caption: *à la* P2300 (but really I don't like that because `graph` itself is already a *sender*). - auto graph = Kokkos::create_graph([&](const auto& root){ + auto graph = Kokkos::create_graph([&](auto root){ auto node_A = then(root, parallel_...(...label..., ...policy..., ...functor...)); auto node_B = then(node_A, parallel_...(...label..., ...policy..., ...functor...)); @@ -129,7 +99,7 @@ Note that I'm not sure why we should support the closure anyway. Diamond, caring about `exec` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Create a simple diamond-like graph, caring about execution space instances. +Create a simple diamond-like graph, caring about execution space instances. No closure. This use case demonstrates how a graph can be created without a closure, and how it could look like in the future. It also focuses on where steps occur. @@ -147,9 +117,9 @@ Graph topology is known at compile, thus enabling a lot of optimizations (kernel } .. code-block:: c++ - :caption: Current pseudo-code + :caption: Current `Kokkos` pseudo-code. - auto graph = Kokkos::create_graph(exec_A, [&](const auto& root){}); + auto graph = Kokkos::create_graph(exec_A, [&](auto root){}); auto root = Kokkos::Impl::GraphAccess::create_root_node_ref(graph); auto node_A = root.then_parallel_...(...label..., ...policy..., ...functor...); @@ -161,19 +131,17 @@ Graph topology is known at compile, thus enabling a lot of optimizations (kernel graph.instantiate(); exec_A.fence("The graph might make some async to-device copies."); + graph.submit(exec_B); .. code-block:: c++ - :caption: P2300 + defer when Kokkos performs internal async to-device copies + :caption: *à la* P2300 and defer when `Kokkos` performs internal async to-device copies to the `instantiate` step. - // Step 1: define topology (no execution space instance required) + // Step 1: define graph topology (note that no execution space instance required). auto graph = Kokkos::create_graph(); auto node_A = then(graph, parallel_...(...label..., ...policy..., ...functor...)); - // what happens to an exec space instance passed to the policy ? is it used somehow or just ignored ? - // when dispatching the driver to global memory, what exec space instance is used for the async copies ? - auto node_B = then(node_A, parallel_...(...label..., ...policy..., ...functor...)); auto node_C = then(node_A, parallel_...(...label..., ...policy..., ...functor...)); @@ -186,15 +154,17 @@ Graph topology is known at compile, thus enabling a lot of optimizations (kernel // Step 3: execute graph.submit(exec_B) -No "root" node -~~~~~~~~~~~~~~ +.. _no_root_node: -Currently, the :code:`Kokkos::Graph` would expose to the user a "root node" concept that is not needed +To root or not to root ? +~~~~~~~~~~~~~~~~~~~~~~~~ + +Currently, the :cppkokkos:`Kokkos::Graph` API would expose to the user a "root node" concept that is not strictly needed by any backend (but might be needed by the default implementation that works with *sinks*). -The "root node" might be confusing. It sould not appear in the API for 2 reasons: +I think the "root node" might be confusing. IMO, it should not appear in the API for 2 reasons: -1. It can be misleading, as the user might think it's necessary though I think it's an artifact of how :code:`Kokkos::Graph` +1. It can be misleading, as the user might think it's necessary though I think it's an artifact of how :cppkokkos:`Kokkos::Graph` is currently implemented for graph construction, and because of the *sink*-based defaulted implementation. 2. With P2300, it's clear that *root* is an empty useless sender that can be thrown away at compile time. @@ -208,15 +178,15 @@ The "root node" might be confusing. It sould not appear in the API for 2 reasons } .. code-block:: c++ - :caption: P2300 + :caption: *à la* P2300. - auto graph = construct_graph(); + auto graph = Kokkos::construct_graph(); - auto A1 = then(graph, ...); - auto A2 = then(graph, ...); - auto A3 = then(graph, ...); + auto A1 = Kokkos::then(graph, Kokkos::parallel_...(...)); + auto A2 = Kokkos::then(graph, Kokkos::parallel_...(...)); + auto A3 = Kokkos::then(graph, Kokkos::parallel_...(...)); - auto B = then(when_all(A1, A2, A3), ...); + auto B = Kokkos::then(Kokkos::when_all(A1, A2, A3), Kokkos::parallel_...(...)); Complex DAG topology ~~~~~~~~~~~~~~~~~~~~ @@ -234,13 +204,13 @@ Any complex-but-valid DAG topology should work. A2 -> B1; A2 -> B3; A3 -> B4; - + B1 -> C1; B3 -> C1; - + B2 -> C2; B4 -> C2; - + // Enfore ordering of nodes with invisible edges. { rank = same; @@ -255,59 +225,58 @@ Changing scheduler This is the purpose of PR https://github.com/kokkos/kokkos/pull/7249, and should be further documented. -Towards https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html#design-sender-adaptor-starts_on. +This is a step towards https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html#design-sender-adaptor-starts_on. .. code-block:: c++ + :caption: *à la* P2300. - auto graph = construct() - - auto node_1 = ... + // Step 1: construct. + auto graph = Kokkos::construct_graph(); + auto node_1 = Kokkos::then(graph, ...); ... + // Step 2: instantiate. graph.instantiate(); + // Step 3: execute, execute, and again. graph.submit(exec_A); - ... - graph.submit(exec_C); - ... - graph.submit(exec_D); Interoperability ~~~~~~~~~~~~~~~~ -Why interoperability matters (helps adoption of :code:`Kokkos::Graph`, extensibility, corner cases): +Why interoperability matters (helps adoption of :cppkokkos:`Kokkos::Graph`, extensibility, corner cases): -1. Attract users that already use some backend graph (*e.g.* `cudaGraph_t`) towards `Kokkos`. It helps them transition smoothly. -2. Help user integrate backend-specific graph capabilities that are not part of the :code:`Kokkos::Graph` API for whatever reason. +1. Attract users that already use some backend graph (*e.g.* :code:`cudaGraph_t`) towards `Kokkos`. It helps them transition smoothly. +2. Help user integrate backend-specific graph capabilities that are not part of the :cppkokkos:`Kokkos::Graph` API for whatever reason. Since `Kokkos` might run some stuff linked to its internals at *instantiation* stage, and since in PR https://github.com/kokkos/kokkos/pull/7240 we decided to ensure that before the submission, the graph needs to be instantiated in `Kokkos`, interoperability implies that the user -passes through `Kokkos` for both *instantiation* and *submission*. +relies on `Kokkos` for both *instantiation* and *submission*. .. graphviz:: - :caption: Dark nodes/edges are added through :code:`Kokkos::Graph`. + :caption: Dark nodes/edges are added through :cppkokkos:`Kokkos::Graph` API, the rest is pre-existing. digraph interoperability { A[color=darksalmon]; - + B1[color=darksalmon]; B2[color=darksalmon]; B3[color=darksalmon]; - + C3[color=darksalmon]; A -> B1[color=darksalmon]; A -> B2[color=darksalmon]; A -> B3[color=darksalmon]; - + B3 -> C3[color=darksalmon]; - + // Enfore ordering of nodes with invisible edges. { rank = same; @@ -315,50 +284,102 @@ passes through `Kokkos` for both *instantiation* and *submission*. B1 -> B2 -> B3 ; rankdir = LR; } - + B1 -> C1; B2 -> C1; - + C1 -> D1; C3 -> D1; - } + } .. code-block:: c++ - :caption: interoperability pseudo-code P2300 + :caption: Interoperability pseudo-code *à la* P2300. + // The user starts creating its graph with a backend API for some reason. cudaGraph_t graph; cudaGraphCreate(&graph, ...); cudaGraphNode_t A, B1, B2, B3, C3; ... create kernel nodes and add dependencies ... - auto kokkos_graph = construct(graph); + // But at some point wants interoperability with Kokkos. + auto kokkos_graph = Kokkos::construct_graph(graph); - auto C1 = then(when_all(B1, B2), ...); - auto D1 = then(when_all(C1, C3), ...); + auto C1 = Kokkos::then(Kokkos::when_all(B1, B2), ...); + auto D1 = Kokkos::then(Kokkos::when_all(C1, C3), ...); + // The user is now bound to Kokkos for instantiation and submission. kokkos_graph.instantiate(); kokkos_graph.submit(); Graph update ~~~~~~~~~~~~ -From reading `Cuda`, `HIP` and `SYCL` documentations, all have some *executable graph update* mechanisms. +From reading :cppkokkos:`Cuda`, :cppkokkos:`HIP` and :cppkokkos:`SYCL` documentations, all have some *executable graph update* mechanisms. -For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`, not in `HIP` yet) can support complex graphs that might slightly change from one submission to another. +For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`) can support complex graphs that might slightly change from one submission to another. Updates to a graph will be scheduled after any in-flight executions of the same graph and will not affect previous submissions of the same graph. The user is not required to wait on any previous submissions of a graph before updating it. -As the topology is fixed, we can only reasonably update kernel parameters. +As the topology is fixed, we can only reasonably update kernel parameters or skip a node. + +.. graphviz:: + :caption: Some iterative loop that needs to seed under some condition (to be enhanced). + + digraph graph_update { + + S[label="start", shape=diamond]; + + A[label="seed"]; + B[label="compute"]; + C[label="solve"]; + + S -> A[color=green]; + + A -> B[color=green]; + + B -> C; + + C -> S; + + S -> B[color="red"]; + + } + +Iterative processes +~~~~~~~~~~~~~~~~~~~ -Iterative process ------------------ +Plenty of opportunities for :cppkokkos:`Kokkos::Graph` to lean in: -- iterative solver (our assembly case) +- iterative solver - line search in optimization +- you name it + +Let's take the `AXPBY` micro-benchmark from https://hihat.opencommons.org/images/1/1a/Kokkos-graphs-presentation.pdf: + +.. graphviz:: + :caption: Two `AXPBY` followed by a dot product. + + digraph axpby { + A[label="axpby"]; + B[label="axpby"]; + C[label="dotp"]; + A->C; + B->C; + } + +.. literalinclude:: Graph.axpby.kokkos.vanilla.cpp + :language: c++ + :caption: Vanilla `Kokkos`. +.. literalinclude:: Graph.axpby.kokkos.graph.cpp + :language: c++ + :caption: Current :cppkokkos:`Kokkos::Graph`. +.. literalinclude:: Graph.axpby.kokkos.graph.p2300.cpp + :language: c++ + :caption: *à la* P2300. They also use graphs... ----------------------- @@ -366,11 +387,29 @@ They also use graphs... * `PyTorch` https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/ * `GROMACS` https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/ +Design choices +-------------- + +Questions we need to answer before going further in the :cppkokkos:`Graph` refactor. + +Dispatching +~~~~~~~~~~~ -Homework +- Do we allow node policies to have a user-provided execution space instance ? +- When does `Kokkos` makes its to-device dispatching (*e.g.* to global memory) ? -- what does Kokkos during dispatching ? (HIP CUDA SYCL) Execution space instance from the policy, used or ignored ? -- for each example 3 columns how to write it in CUDA SYCL P2300 Kokkos -- développer l'update -- essayer de démontrer qu'on peut écrire un seul code, et dire si on veut que ce soit un graph ou pas - (why it matters: write single source code , kokkos premise 'single source code') \ No newline at end of file +Write a single source code, but allow skipping backend graph +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We should be able to write a single source code and decide if we want the graph to map to the backend graph or just +execute nodes. + +This would greatly benefit adoption, and respect `Kokkos` single source code promise. + +References +---------- + +* https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf +* https://github.com/intel/llvm/blob/sycl/sycl/doc/syclgraph/SYCLGraphUsageGuide.md +* https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/ +* https://hihat.opencommons.org/images/1/1a/Kokkos-graphs-presentation.pdf From 9a268a3a19ec74c3e7ab044cc80057ce82d63659 Mon Sep 17 00:00:00 2001 From: romintomasetti Date: Fri, 30 Aug 2024 03:51:05 +0000 Subject: [PATCH 4/6] cleaning stuff --- .../core/Graph.axpby.kokkos.graph.p2300.cpp | 62 ++++++++++++++++--- docs/source/API/core/Graph.rst | 15 +++-- 2 files changed, 63 insertions(+), 14 deletions(-) diff --git a/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp index 3d129d2a4..c4e7015bf 100644 --- a/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp +++ b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp @@ -1,15 +1,57 @@ -auto graph = Kokkos::construct_graph(); +/** + * This is some external library function to which we pass a sender. + * The sender might either be a regular @c Kokkos execution space instance + * or a graph-node-sender-like stuff. + * The asynchronicity within the function will either be provided by the graph + * or must be dealt with in the regular way (creating many space instances). + */ +sender library_stuff(sender start) +{ + sender auto exec_A, exec_B; + + if constexpr (Kokkos::is_a_sender) { + exec_A = exec_B = start; + } else { + std::tie(exec_A, exec_B) = Kokkos::partition_space(start, 1, 1); + } -auto node_xpy = Kokkos::then(graph, Kokkos::parallel_for(N, MyAxpby{x, y, alpha, beta})); -auto node_zpy = Kokkos::then(graph, Kokkos::parallel_for(N, MyAxpby{z, y, gamma, beta})); + auto node_xpy = Kokkos::parallel_for(exec_A, policy(N), MyAxpby{x, y, alpha, beta}); + auto node_zpy = Kokkos::parallel_for(exec_B, policy(N), MyAxpby{z, y, gamma, beta}); -auto node_dotp = Kokkos::then( - Kokkos::when_all(node_xpy, node_zpy), - Kokkos::parallel_reduce(N, MyDotp{x, z}, dotp) -); + /// No need to fence, because @c Kokkos::when_all will take care of that. + return Kokkos::parallel_reduce( + Kokkos::when_all(node_xpy, node_zpy), + policy(N), + MyDotp{x, z}, dotp + ); +} -graph.instantiate(); +int main() +{ + scheduler auto exec = Kokkos::DefaultExecutionSpace{}; -graph.submit(exec_A); + /** + * Start the chain of nodes with an "empty" node, similar to @c std::execution::schedule. + * Under the hood, it creates the @c Kokkos::Graph. + * All nodes created from this sender will share a handle to the underlying @c Kokkos::Graph. + */ + sender auto start = Kokkos::construct_empty_node(exec); -exec_A.fence(); + sender auto seeding = Kokkos::parallel_for(start, policy(N), SomeWork{...}); + + /// Pass our chain to some external library function. + sender auto subgraph = library_stuff(seeding); + + sender auto last_action = Kokkos::parallel_scan(subgraph, policy(N), ScanFunctor{...}); + + /// @c Kokkos has a free function for instantiating the underlying graph. + /// All nodes connected to the same handle are notified that they cannot be used as senders anymore, + /// because they are locked in an instantiated graph. + sender auto executable_whatever = Kokkos::Graph::instantiate(last_action); + + /// Submission is a no-op if the received sender is an execution space instance. + /// Otherwise, it submits the underlying graph. + Kokkos::Graph::submit(my_exec, executable_whatever) + + my_exec.fence(); +} diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst index 42beb0860..9a4283f1a 100644 --- a/docs/source/API/core/Graph.rst +++ b/docs/source/API/core/Graph.rst @@ -377,10 +377,6 @@ Let's take the `AXPBY` micro-benchmark from https://hihat.opencommons.org/images :language: c++ :caption: Current :cppkokkos:`Kokkos::Graph`. -.. literalinclude:: Graph.axpby.kokkos.graph.p2300.cpp - :language: c++ - :caption: *à la* P2300. - They also use graphs... ----------------------- @@ -406,6 +402,17 @@ execute nodes. This would greatly benefit adoption, and respect `Kokkos` single source code promise. +Design we would like to agree on +-------------------------------- + +This should be the kind of design we'd like to have (kind of conforming to P2300). + +Might be worth reading: https://docs.nvidia.com/hpc-sdk/archive/23.9/pdf/hpc239c++_par_alg.pdf. + +.. literalinclude:: Graph.axpby.kokkos.graph.p2300.cpp + :language: c++ + :caption: *à la* P2300. + References ---------- From 46e55537356a4f3cf1e97e9832af927f866fd9c0 Mon Sep 17 00:00:00 2001 From: romintomasetti Date: Fri, 30 Aug 2024 21:46:04 +0200 Subject: [PATCH 5/6] wip --- .../core/Graph.axpby.kokkos.graph.p2300.cpp | 54 +++++++++++++------ docs/source/API/core/Graph.rst | 50 +++++++++-------- docs/source/API/core/Graph.update.tikz | 38 +++++++++++++ docs/source/conf.py | 1 + 4 files changed, 104 insertions(+), 39 deletions(-) create mode 100644 docs/source/API/core/Graph.update.tikz diff --git a/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp index c4e7015bf..f8182024a 100644 --- a/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp +++ b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp @@ -12,12 +12,18 @@ sender library_stuff(sender start) if constexpr (Kokkos::is_a_sender) { exec_A = exec_B = start; } else { - std::tie(exec_A, exec_B) = Kokkos::partition_space(start, 1, 1); + /// How do we partition ? + exec_A = start; + exec_B = Kokkos::partition_space(start, 1); } auto node_xpy = Kokkos::parallel_for(exec_A, policy(N), MyAxpby{x, y, alpha, beta}); auto node_zpy = Kokkos::parallel_for(exec_B, policy(N), MyAxpby{z, y, gamma, beta}); + /// In the non-graph case,how do we enforce that e.g. node_zpy is done and launch + /// the parallel-reduce on the same execution space instance as node_xpy without writing + /// any additional piece of code ? + /// No need to fence, because @c Kokkos::when_all will take care of that. return Kokkos::parallel_reduce( Kokkos::when_all(node_xpy, node_zpy), @@ -28,30 +34,44 @@ sender library_stuff(sender start) int main() { - scheduler auto exec = Kokkos::DefaultExecutionSpace{}; + /// A @c Kokkos execution space instance is a context (i.e. a source + /// of asynchronous execution such as a thread pool or a GPU stream) + const Kokkos::DefaultExecutionSpace context {}; + + /// A scheduler is a lightweight handle to an execution context. + stdexec::scheduler auto scheduler = context.get_scheduler(); /** - * Start the chain of nodes with an "empty" node, similar to @c std::execution::schedule. - * Under the hood, it creates the @c Kokkos::Graph. - * All nodes created from this sender will share a handle to the underlying @c Kokkos::Graph. - */ - sender auto start = Kokkos::construct_empty_node(exec); + * Start the chain of nodes with an "empty" node, similar to @c std::execution::schedule. + * Under the hood, it creates the @c Kokkos::Graph. + * All nodes created from this sender will share a handle to the underlying @c Kokkos::Graph. + */ + stdexec::sender auto start = Kokkos::Experimental::Graph::schedule(scheduler); - sender auto seeding = Kokkos::parallel_for(start, policy(N), SomeWork{...}); + /// @c Kokkos::parallel_for would behave much like @c std::execution::bulk. + stdexec::sender auto my_work = Kokkos::Experimental::Graph::parallel_for(start, policy(N), ForFunctor{...}); /// Pass our chain to some external library function. - sender auto subgraph = library_stuff(seeding); + stdexec::sender auto subgraph = library_stuff(mywork); - sender auto last_action = Kokkos::parallel_scan(subgraph, policy(N), ScanFunctor{...}); + /// Add some work again. + stdexec::sender auto my_other_work = Kokkos::Experimental::Graph::parallel_scan(subgraph, policy(N), ScanFunctor{...}); - /// @c Kokkos has a free function for instantiating the underlying graph. - /// All nodes connected to the same handle are notified that they cannot be used as senders anymore, - /// because they are locked in an instantiated graph. - sender auto executable_whatever = Kokkos::Graph::instantiate(last_action); + /// @c Kokkos::Graph has a free function for instantiating the underlying graph. + /// All nodes connected to the same handle (i.e. that are on the same chain) are notified + /// that they cannot be used as senders anymore, + /// because they are locked in an instantiated graph. In other words, the chain is a DAG, and it + /// cannot change anymore. + stdexec::sender auto executable_chain = Kokkos::Graph::instantiate(my_other_work); - /// Submission is a no-op if the received sender is an execution space instance. + /// Submission is a no-op if the passed sender is a @c Kokkos execution space instance. /// Otherwise, it submits the underlying graph. - Kokkos::Graph::submit(my_exec, executable_whatever) + Kokkos::Graph::submit(scheduler, executable_chain) + + ::stdexec::sync_wait(scheduler); - my_exec.fence(); + /// Submit the chain again, using another scheduler. + /// In essence, what @c Kokkos::Graph::submit can do is pertty much similar to what + /// @c std::execution::starts_on does. It allows the sender to be executed elsewhere. + Kokkos::Graph::submit(another_scheduler, executable_chain); } diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst index 9a4283f1a..1adde3f55 100644 --- a/docs/source/API/core/Graph.rst +++ b/docs/source/API/core/Graph.rst @@ -312,6 +312,21 @@ relies on `Kokkos` for both *instantiation* and *submission*. kokkos_graph.instantiate(); kokkos_graph.submit(); +Interweaving +~~~~~~~~~~~~ + +When a user does not use :cppkokkos:`Graph`, but calls some external library function that does. + +In this case, :code:`submit` really needs to be passed an execution space instance to ensure that the graph +is nicely inserted into the user's kernel queues. + +Stated verbosely: + + The stream-based (execution space instance based) approach can co-exist in the same code with + the graph-based approach, thereby making :cppkokkos:`Graph` a very attractive abstraction. + A use case in which "at the global level" the code uses a stream-based approach can play well with + some (possibly external) calls that use :cppkokkos:`Graph` under the hood. + Graph update ~~~~~~~~~~~~ @@ -324,28 +339,9 @@ For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`) can su As the topology is fixed, we can only reasonably update kernel parameters or skip a node. -.. graphviz:: - :caption: Some iterative loop that needs to seed under some condition (to be enhanced). - - digraph graph_update { - - S[label="start", shape=diamond]; - - A[label="seed"]; - B[label="compute"]; - C[label="solve"]; - - S -> A[color=green]; - - A -> B[color=green]; - - B -> C; - - C -> S; - - S -> B[color="red"]; - - } +.. tikz:: Some iterative loop that needs to seed under some condition, as well as a library call for compute. + :include: Graph.update.tikz + :libs: backgrounds, calc, positioning, shapes Iterative processes ~~~~~~~~~~~~~~~~~~~ @@ -377,6 +373,16 @@ Let's take the `AXPBY` micro-benchmark from https://hihat.opencommons.org/images :language: c++ :caption: Current :cppkokkos:`Kokkos::Graph`. +Why/when should I choose :cppkokkos:`Kokkos::Graph` +--------------------------------------------------- + +Two obvious but different cases: + +#. A few kernels, probably small, easily manually stream-managed, submitted several times. Then using :cppkokkos:`Graph` + will help you reduce kernel launch overheads. TODO: link to A1 A2 A3 B graph. +#. A lot of kernels, very complex DAG, probably not worth it thinking too much how they could be efficiently orchestrated + if :cppkokkos:`Graph` guarentees that it will take care of that for you. + They also use graphs... ----------------------- diff --git a/docs/source/API/core/Graph.update.tikz b/docs/source/API/core/Graph.update.tikz new file mode 100644 index 000000000..3f679e444 --- /dev/null +++ b/docs/source/API/core/Graph.update.tikz @@ -0,0 +1,38 @@ +\tikzset{ + decide/.style = {draw, shape = diamond, fill = red!25, aspect = 2, inner sep = 1pt}, + endpoint/.style = {draw, circle, fill = black!20, inner sep = 1pt}, + yesorno/.style = {rectangle,draw,fill=white,inner sep=1pt}, + work/.style = {rectangle, draw, fill = orange!25}, + % We need to enforce a white background for folks in dark mode. + background rectangle/.style={fill=white}, + show background rectangle +} +\node[endpoint] (start) {Start}; + +\node[decide,below=0.5cm of start] (decision) { Seeding ?}; + +\node[work, below=1cm of decision] (seeding) {Seeding}; + +\node[work, below=0.5cm of seeding, minimum height=2cm, minimum width = 2cm] (compute) {Compute}; + +\node[work, below=1cm of compute] (solve) {Solve}; + +\node[decide,right=0.5cm of solve] (convergence) {Convergence ?}; + +\node[endpoint, right=1cm of convergence] (end) {End}; + +\draw [-stealth,solid](start) -- (decision.north); + +\draw [-stealth,solid](decision) -- (seeding.north) node[midway, yesorno] {yes}; + +\draw [-stealth,solid](seeding)--(compute.north); + +\draw [-stealth,solid](compute)--(solve.north); + +\draw [-stealth,solid](solve)--(convergence.west); + +\draw [-stealth,solid](convergence.east)--(end.west) node[midway, yesorno] {yes}; + +\draw [-stealth,solid](convergence.north) -- node[midway, yesorno]{no} (convergence.north|-decision.east) -- (decision.east); + +\draw [-stealth,solid](decision.west) -- ({$(decision.west)-0.25*(convergence.north)+0.25*(decision.east)$}|-decision.west) |- node[near start, yesorno] {no} (compute.west); diff --git a/docs/source/conf.py b/docs/source/conf.py index 583e7a044..524bc8225 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -40,6 +40,7 @@ "sphinx.ext.intersphinx", "sphinx_copybutton", "sphinx_design", + "sphinxcontrib.tikz", "cppkokkos"] # Add any paths that contain templates here, relative to this directory. From 49ba326bbe7f7b9de75f7a23b5d0576f31e77a6e Mon Sep 17 00:00:00 2001 From: romintomasetti Date: Tue, 1 Apr 2025 16:07:48 +0000 Subject: [PATCH 6/6] wip --- docs/source/API/core/Graph.old.rst | 166 +++++++++++++++++++++++++++++ docs/source/API/core/Graph.rst | 98 +++++++++++++++++ 2 files changed, 264 insertions(+) create mode 100644 docs/source/API/core/Graph.old.rst diff --git a/docs/source/API/core/Graph.old.rst b/docs/source/API/core/Graph.old.rst new file mode 100644 index 000000000..191fab0ae --- /dev/null +++ b/docs/source/API/core/Graph.old.rst @@ -0,0 +1,166 @@ +# What are the semantics of `Kokkos::Graph` ? + +What are the allowed semantics of `Kokkos::Graph` ? + +Questions: + +1. Do we document the allowed semantics for which the user gets covered by `Kokkos` or do we try to enforce the semantics with object states and stuff ? +2. What about the execution space instance ? It seems that `submit` should allow one to be passed. +3. Multi-GPU. +4. runtime aggregate node is still not possible, see https://github.com/kokkos/kokkos/issues/6060. +4. Missing documentation online ? + +It should allow functionalities listed in https://www.olcf.ornl.gov/wp-content/uploads/2021/10/013_CUDA_Graphs.pdf, slide 4. + +## Usage + +How would people use `Kokkos::Graph` ? + +### The simplest usage I could come with + +The graph is known in advance (at compile time) and can be created in the lambda (*i.e.* not using hidden `impl` stuff). +Once created, the user expects that the graph can be re-submitted several time. The user does not want to add/remove nodes once submitted for the first time (no fancy stuff). +The user does not care about streams whatsoever. + +1. Create some `data` in a view, and a `functor` to act on it. +2. Create the `graph` and add a parallel-for `node` using the `functor` acting on `data`. +3. Submit the graph as much as you want. + +```c++ +template +struct Functor +{ + Kokkos::View data; + + template + KOKKOS_FUNCTION + void operator()(const T index) const { ... ... }; +}; + +int main() +{ + const Kokkos::View data(...); + + auto graph = Kokkos::Experimental::create_graph([&](auto root) { + [[maybe_unused]] const auto node = root.then_parallel_for(0, ..., Functor{ .data = data }); + }); + + graph.submit(); +} +``` + +### More advanced usage + +The graph is unknown and cannot be easily/prettily create in the lambda (*e.g.* the user attaches nodes dynamically depending on some complex setup like partitioning). +Once created, the user still expects that the graph can be re-submitted several time. +The user care about streams for orchestration. + +We need to use some `impl` stuff for such a case. + +```c++ +/** + * Create the graph. + * + * 1. Damien said there are other ways to do that w/o using Impl, but I could not find them. It seems that TestGraph.hpp only uses + * the Kokkos::Experimental::create_graph that takes a closure. + * It seems that 'construct_graph' should somehow be promoted to the public API. Is there any reason not to do so? + * 2. The execution space instance is not used until the executable graph is launched with 'cudaGraphLaunch'. + * Therefore, it's questionnable whether it should be part of the Kokkos::Graph state or not (it's an Impl detail though). + */ +auto graph = Kokkos::Impl::GraphAccess::construct_graph(exec_a); +auto root = Kokkos::Impl::GraphAccess::create_root_ref(graph); + +/** + * Fill the graph with nodes, according to a complex DAG topology. + * The nodes might be added conditionally (conditions might change at runtime, e.g. MPI partitioning). + * + * ROOT + * / \ + * N11 N12 + * | | \ + * N21 N22 N23 + * \ / / + * \ / / + * N31 + * + * @todo Add @c if nodes. See also https://developer.nvidia.com/blog/dynamic-control-flow-in-cuda-graphs-with-conditional-nodes/. + */ +std::vector N31_predecessors; + +if(condition_branch_1) // branch 1 +{ + auto N11 = root.then_parallel_for(...label..., ...policy..., ...body...); + auto N21 = root.then_parallel_for(...label..., ...policy..., ...body...); + N31_predecessors.push_back(N21); +} + +if(condition_branch_2) // branch 2 +{ + auto N12 = root.then_parallel_for(...name..., ...policy..., ...body...); + auto N22 = root.then_parallel_for(...name..., ...policy..., ...body...); + auto N23 = root.then_parallel_for(...name..., ...policy..., ...body...); + N31_predecessors.push_back(N22); + N31_predecessors.push_back(N23); +} + +//! This is currently impossible. See also https://github.com/kokkos/kokkos/issues/6060. +auto N31_ready = Kokkos::Experimental::when_all(N31_predecessors); +auto N31 = N31_ready.then_parallel_for(...name..., ...policy..., ...body...); + +/** + * The topology of the graph has been defined. + * It now has to be instantiated. + * According to: + * - https://www.olcf.ornl.gov/wp-content/uploads/2021/10/013_CUDA_Graphs.pdf (slide 9) + * - https://developer.nvidia.com/blog/employing-cuda-graphs-in-a-dynamic-environment/ + * the topology cannot change once the graph has been instantiated, + * but the nodes parameters may be updated (cudaGraphExecUpdate). + */ +graph.instantiate(...) + +/** + * Launch the graph on some execution space instance. + * Re-launch onto another execution space instance. + * According to cudaGraphLaunch, a stream is allowed and it makes sense. + * + * @todo Check for @c HIP and @c SYCL. + */ +graph.submit(exec_b); +graph.submit(exec_c); +``` + +## What to do, prioritizing + +### Promote `construct_graph` to the public API + +This allows for advanced use cases that do not fit well with the current closure-based construction API. + +Retrieving the root node should also be promoted to the public API. + +### `Kokkos::Graph::instantiate` + +**Add** `Kokkos::Graph::instantiate` to the public API. + +This allows the user to control when the executable graph gets instantiated. + +It can be called only once. + +Adding nodes after instantiation is prohibited. + +### `Kokkos::Graph::submit` + +**Change** the public API to accept an execution space instance. + +Note that it is simply used to order the graph launch into some work queue. + +### Remove the execution space instance from `Kokkos::Graph` state + +The title says it all. + +### Allow dynamic aggregate node + +**Add** a `Kokkos::Experimental::when_all` that allows for a vector/list of nodes to be passed. + +## Go further + +We might want to get the design of `Kokkos::Graph` close to `std::execution` (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html). diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst index 1adde3f55..d6ad76e7d 100644 --- a/docs/source/API/core/Graph.rst +++ b/docs/source/API/core/Graph.rst @@ -339,6 +339,12 @@ For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`) can su As the topology is fixed, we can only reasonably update kernel parameters or skip a node. +.. note:: + + Todo: in solve take Emil work and say that at compile time we could not reasonnably know what its graph + would look like. But our own assembly graph could be determined at compile time (knowing the system at stake, + how we partition it and so on -> still a burden) + .. tikz:: Some iterative loop that needs to seed under some condition, as well as a library call for compute. :include: Graph.update.tikz :libs: backgrounds, calc, positioning, shapes @@ -373,6 +379,18 @@ Let's take the `AXPBY` micro-benchmark from https://hihat.opencommons.org/images :language: c++ :caption: Current :cppkokkos:`Kokkos::Graph`. +Runtime graph +~~~~~~~~~~~~~ + +It can happen that a graph cannot be known at compile time. Examples of programs that could not +determine the control flow completely at compile time: +- MPI partitioning +- BLAS routines and system size +- you name it + +Therefore, we must support both pure compile time graphs and runtime graphs. +This implies type-erasure. And this is not possible by default in `std::execution` apparently. + Why/when should I choose :cppkokkos:`Kokkos::Graph` --------------------------------------------------- @@ -426,3 +444,83 @@ References * https://github.com/intel/llvm/blob/sycl/sycl/doc/syclgraph/SYCLGraphUsageGuide.md * https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/ * https://hihat.opencommons.org/images/1/1a/Kokkos-graphs-presentation.pdf + + +********************************* + +Ecrire nos exemples de graphes en P2300 -> compiler + tester + +Puis en current kokkos graph -> compiler + tester + +Puis en draft el kokkos graph à la p2300 -> à titre de guideline (where we want to go) + +* TFE Emil Geleleens example: Solver for matrix L -> backsubstitution for lower triangular + matrix -> dependencies between unknowns to speed up things -> creates a graph with "as many nodes + there are unknowns" (with variants, but whatsoever we get many nodes) in the input matrix L + -> his work is not about efficiently launchign this graph and in fact he did it with manual kernel launches + -> could be nice to use kokkos graph to focus on other things + -> there might be some sweet spot above which the backend graph makes sense (cost of instantiate and launch) + +=> repo privé "uliegecsm/kokkos-graph-p2300" + + +One blcoker (https://docs.nvidia.com/cuda/pdf/CUSPARSE_Library.pdf): we would need to use graph capture +to embed the cu solver into our graph... + + Most of the cuSPARSE routines can be optimized by exploiting CUDA Graphs capture and + Hardware Memory Compression features. + More in details, a single cuSPARSE call or a sequence of calls can be captured by a CUDA + Graph and executed in a second moment. This minimizes kernels launch overhead and allows + the CUDA runtime to optimize the whole workflow. A full example of CUDA graphs capture + applied to a cuSPARSE routine can be found in cuSPARSE Library Samples - CUDA Graph. + + + +Meeting notes +============= + +0. Do you know :cppkokkos:`Kokkos::Graph` ? + + :cppkokkos:`Kokkos::Graph` is an abstraction of a DAG of asynchronous workloads that maps to a backend graph, + or to the defaulted implementation. + + Advantages of the graph: asynchronous management done by the backend driver + launch overhead reduces especially + when submitting many times. + + .. figure:: Graph.kokkos.3.paper.jpg + + Example from Kokkos 3 paper. + +1. We want to refactor the public API of :cppkokkos:`Kokkos::Graph` so that it feels more like `std::execution` (P2300). + + We could think of a graph (e.g. :cppkokkos:`Kokkos::Graph`) as a **multi-shot sender chain** (?). + + .. code-block:: c++ + :caption: Old way + + child = parent.then_parallel_(policy, body); + + .. code-block:: c++ + :caption: P2300-alike way + + child = parallel_for(parent, policy, body); // usual + child = parent | parallel_for(policy, body); // piping + + This seems to be an easy step. A few wrappers could be used in a first step to "transport" + the P2300-alike way arguments to the old way (thereby keeping the `Kokkos::Graph` implementation + untouched). + +2. Deeper refactoring of :cppkokkos:`Kokkos::Graph`: + + * Should the nodes of the graph be senders ? Or should `Kokkos` nodes and graph + be wrapped in an adaptor-like API to remain an implementation detail hidden to the user ? + "P2300 nodes" would then have handlers to their :cppkokkos:`Kokkos::Impl` (nodes and graph) counterparts. + How is this implemented in `HPX` ? When creating a sender, is there some under-the-hood implementation + class that maps to some `HPX` pre-existing internals ? + * Current :cppkokkos:`Kokkos::Graph` restrictions: + - All nodes are targeting the same backend :math:`\implies` only one scheduler type can be used. + - The chain cannot contain `transfer`, `starts_on`, and so. The scheduling is left to `Kokkos` through + :cppkokkos:`Kokkos::Graph::submit(exec)`. + + +https://accu.org/journals/overload/29/164/teodorescu/ \ No newline at end of file