From 62b83ca7afeb440cd046d78e16f5db2d694059ca Mon Sep 17 00:00:00 2001
From: romintomasetti <romin.tomasetti@gmail.com>
Date: Wed, 21 Aug 2024 17:37:17 -0400
Subject: [PATCH 1/6] core(graph): adding documentation for
 `Kokkos::Experimental::Graph`

---
 docs/source/API/core-index.rst |   3 +
 docs/source/API/core/Graph.rst | 350 +++++++++++++++++++++++++++++++++
 docs/source/conf.py            |   1 +
 3 files changed, 354 insertions(+)
 create mode 100644 docs/source/API/core/Graph.rst
diff --git a/docs/source/API/core-index.rst b/docs/source/API/core-index.rst
index 0996b0521..fa22cd0d3 100644
--- a/docs/source/API/core-index.rst
+++ b/docs/source/API/core-index.rst
@@ -37,6 +37,8 @@ API: Core
      - Utility functionality part of Kokkos Core.
    * - `Detection Idiom <core/Detection-Idiom.html>`__
      - Used to recognize, in an SFINAE-friendly way, the validity of any C++ expression.
+   * - `Graph and related <core/Graph.html>`_
+     - Kokkos Graph abstraction.
    * - `Macros <core/Macros.html>`__
      - Global macros defined by Kokkos, used for architectures, general settings, etc.
 
@@ -60,4 +62,5 @@ API: Core
    ./core/Utilities
    ./core/Detection-Idiom
    ./core/Macros
+   ./core/Graph
    ./core/Profiling
diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst
new file mode 100644
index 000000000..30d6c8982
--- /dev/null
+++ b/docs/source/API/core/Graph.rst
@@ -0,0 +1,350 @@
+Graph and related
+=================
+
+Usage
+-----
+
+:code:`Kokkos::Graph` is an abstraction that can be used to define a group of asynchronous workloads that are organised as a direct acyclic graph.
+A :code:`Kokkos::Graph` is defined separatly from its execution, allowing it to be re-executed multiple times.
+
+:code:`Kokkos::Graph` is a powerful way of describing workload dependencies. It is also a good opportunity to present all workloads
+at once to the driver, and allow some optimizations [ref].
+
+.. note::
+
+    However, because command-group submission is tied to execution on the queue, without having a prior construction step before starting execution, optimization opportunities are missed from the runtime not being made aware of a defined dependency graph ahead of execution.
+
+For small workloads that need to be sumitted several times, it might save you some overhead [reference to some presentation / paper].
+
+:code:`Kokkos::Graph` is specialized for some backends:
+
+* :code:`Cuda`: [ref to vendor doc]
+* :code:`HIP`: [ref to vendor doc]
+* :code:`SYCL`: [ref to vendor doc] -> https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc
+
+For other backends, Kokkos provides a defaulted implementation [ref to file].
+
+Philosophy
+----------
+
+As mentioned earlier, the :code:`Kokkos::Graph` is first defined, and then executed. In fact, before the graph can be executed,
+it needs to be *instantiated*.
+
+During the *instantiation* phase, the topology of the graph is **locked**, and an *executable graph* is created.
+
+In short, we have 3 phases:
+
+1. Graph definition (topology DAG graph)
+2. Graph instantiation (executable graph)
+3. Graph submission (execute)
+
+"Splitting command construction from execution is a proven solution." (https://www.iwocl.org/wp-content/uploads/iwocl-2023-Ewan-Crawford-4608.pdf)
+
+Basic example
+-------------
+
+This example showcases how three workloads can be organised as a :code:`Kokkos::Graph`.
+
+Workloads A and B are independent, but workload C needs the completion of A and B.
+
+.. code-block:: cpp
+
+    int main()
+    {
+        auto graph = Kokkos::Experimental::create_graph<Exec>([&](auto root) {
+            const auto node_A = root.then_parallel_for(...label..., ...policy..., ...body...);
+            const auto node_B = root.then_parallel_for(...label..., ...policy..., ...body...);
+            const auto ready  = Kokkos::Experimental::when_all(node_A, node_B);
+            const auto node_C = ready.then_parallel_for(...label..., ...policy..., ...body...);
+        });
+
+        for(int irep = 0; irep < nrep; ++irep)
+            graph.submit();
+    }
+
+Advanced example
+----------------
+
+To be done soon.
+
+References
+----------
+
+* https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
+* https://github.com/intel/llvm/blob/sycl/sycl/doc/syclgraph/SYCLGraphUsageGuide.md
+* https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/
+
+
+Use cases
+---------
+
+Diamond with closure, don't care about `exec`
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Create a simple diamond-like graph within a closure, no caring about execution space instances.
+
+This use case demonstrates how a graph can be created from inside a closure, and how it could look like in the future.
+It is a very simple use case.
+
+Note that I'm not sure why we should support the closure anyway.
+
+.. graphviz::
+    :caption: Diamond topology
+
+    digraph diamond {
+        A -> B;
+        A -> C;
+        B -> D;
+        C -> D;
+    }
+
+.. code-block:: c++
+    :caption: Current pseudo-code
+
+    auto graph = Kokkos::create_graph([&](const auto& root){
+        auto node_A = root.then_parallel_...(...label..., ...policy..., ...functor...);
+
+        auto node_B = node_A.then_parallel_...(...label..., ...policy..., ...functor...);
+        auto node_C = node_A.then_parallel_...(...label..., ...policy..., ...functor...);
+
+        auto node_D = Kokkos::when_all(node_B, node_C).then_parallel_...(...label..., ...policy..., ...functor...);
+    });
+    graph.instantiate();
+    graph.submit()
+
+.. code-block:: c++
+    :caption: P2300 (but really I don't like that because `graph` itself is already a *sender*)
+
+    auto graph = Kokkos::create_graph([&](const auto& root){
+        auto node_A = then(root, parallel_...(...label..., ...policy..., ...functor...));
+
+        auto node_B = then(node_A, parallel_...(...label..., ...policy..., ...functor...));
+        auto node_C = then(node_A, parallel_...(...label..., ...policy..., ...functor...));
+
+        auto node_D = then(when_all(node_B, node_C), parallel_...(...label..., ...policy..., ...functor...));
+    });
+    graph.instantiate();
+    graph.submit()
+
+Diamond, caring about `exec`
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Create a simple diamond-like graph, caring about execution space instances.
+
+This use case demonstrates how a graph can be created without a closure, and how it could look like in the future.
+It also focuses on where steps occur.
+
+Graph topology is known at compile, thus enabling a lot of optimizations (kernel fusion might be one).
+
+.. graphviz::
+    :caption: Diamond topology
+
+    digraph diamond {
+        A -> B;
+        A -> C;
+        B -> D;
+        C -> D;
+    }
+
+.. code-block:: c++
+    :caption: Current pseudo-code
+
+    auto graph = Kokkos::create_graph(exec_A, [&](const auto& root){});
+    auto root  = Kokkos::Impl::GraphAccess::create_root_node_ref(graph);
+
+    auto node_A = root.then_parallel_...(...label..., ...policy..., ...functor...);
+
+    auto node_B = node_A.then_parallel_...(...label..., ...policy..., ...functor...);
+    auto node_C = node_A.then_parallel_...(...label..., ...policy..., ...functor...);
+
+    auto node_D = Kokkos::when_all(node_B, node_C).then_parallel_...(...label..., ...policy..., ...functor...);
+
+    graph.instantiate();
+    exec_A.fence("The graph might make some async to-device copies.");
+    graph.submit(exec_B);
+
+.. code-block:: c++
+    :caption: P2300 + defer when Kokkos performs internal async to-device copies
+
+    // Step 1: define topology (no execution space instance required)
+    auto graph = Kokkos::create_graph<execution_space>();
+
+    auto node_A = then(graph, parallel_...(...label..., ...policy..., ...functor...));
+
+    auto node_B = then(node_A, parallel_...(...label..., ...policy..., ...functor...));
+    auto node_C = then(node_A, parallel_...(...label..., ...policy..., ...functor...));
+
+    auto node_D = then(when_all(node_B, node_C), parallel_...(...label..., ...policy..., ...functor...));
+
+    // Step 2: instantiate (execution space instance required by both backend and Kokkos internals)
+    graph.instantiate(exec_A);
+    exec_A.fence();
+
+    // Step 3: execute
+    graph.submit(exec_B)
+
+No "root" node
+~~~~~~~~~~~~~~
+
+Currently, the :code:`Kokkos::Graph` would expose to the user a "root node" concept that is not needed
+by any backend (but might be needed by the default implementation that works with *sinks*).
+
+The "root node" might be confusing. It sould not appear in the API for 2 reasons:
+
+1. It can be misleading, as the user might think it's necessary though I think it's an artifact of how :code:`Kokkos::Graph`
+   is currently implemented for graph construction, and because of the *sink*-based defaulted implementation.
+2. With P2300, it's clear that *root* is an empty useless sender that can be thrown away at compile time.
+
+.. graphviz::
+    :caption: No root node.
+
+    digraph no_root {
+        A1 -> B;
+        A2 -> B;
+        A3 -> B;
+    }
+
+.. code-block:: c++
+    :caption: P2300
+
+    auto graph = construct_graph();
+
+    auto A1 = then(graph, ...);
+    auto A2 = then(graph, ...);
+    auto A3 = then(graph, ...);
+
+    auto B = then(when_all(A1, A2, A3), ...);
+
+Complex DAG topology
+~~~~~~~~~~~~~~~~~~~~
+
+Any complex-but-valid DAG topology should work.
+
+.. graphviz::
+    :caption: A complex DAG
+
+    digraph complex_dag {
+
+        A1 -> B1;
+        A1 -> B2;
+        A1 -> B3;
+        A2 -> B1;
+        A2 -> B3;
+        A3 -> B4;
+        
+        B1 -> C1;
+        B3 -> C1;
+        
+        B2 -> C2;
+        B4 -> C2;
+        
+        // Enfore ordering of nodes with invisible edges.
+        {
+            rank = same;
+            edge[ style=invis];
+            B1 -> B2 -> B3 -> B4 ;
+            rankdir = LR;
+        }
+    }
+
+Changing scheduler
+~~~~~~~~~~~~~~~~~~
+
+This is the purpose of PR https://github.com/kokkos/kokkos/pull/7249, and should be further documented.
+
+Towards https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html#design-sender-adaptor-starts_on.
+
+.. code-block:: c++
+
+    auto graph = construct()
+
+    auto node_1 = ...
+
+    ...
+
+    graph.instantiate();
+
+    graph.submit(exec_A);
+
+    ...
+
+    graph.submit(exec_C);
+
+    ...
+
+    graph.submit(exec_D);
+
+Interoperability
+~~~~~~~~~~~~~~~~
+
+Why interoperability matters (helps adoption of :code:`Kokkos::Graph`, extensibility, corner cases):
+
+1. Attract users that already use some backend graph (*e.g.* `cudaGraph_t`) towards `Kokkos`. It helps them transition smoothly.
+2. Help user integrate backend-specific graph capabilities that are not part of the :code:`Kokkos::Graph` API for whatever reason.
+
+Since `Kokkos` might run some stuff linked to its internals at *instantiation* stage, and since in PR https://github.com/kokkos/kokkos/pull/7240
+we decided to ensure that before the submission, the graph needs to be instantiated in `Kokkos`, interoperability implies that the user
+passes through `Kokkos` for both *instantiation* and *submission*.
+
+.. graphviz::
+    :caption: Dark nodes/edges are added through :code:`Kokkos::Graph`.
+
+    digraph interoperability {
+
+        A[color=darksalmon];
+        
+        B1[color=darksalmon];
+        B2[color=darksalmon];
+        B3[color=darksalmon];
+        
+        C3[color=darksalmon];
+
+        A -> B1[color=darksalmon];
+        A -> B2[color=darksalmon];
+        A -> B3[color=darksalmon];
+        
+        B3 -> C3[color=darksalmon];
+        
+        // Enfore ordering of nodes with invisible edges.
+        {
+            rank = same;
+            edge[style=invis];
+            B1 -> B2 -> B3 ;
+            rankdir = LR;
+        }
+        
+        B1 -> C1;
+        B2 -> C1;
+        
+        C1 -> D1;
+        C3 -> D1;
+    } 
+
+.. code-block:: c++
+    :caption: interoperability pseudo-code P2300
+
+    cudaGraph_t graph;
+    cudaGraphCreate(&graph, ...);
+
+    cudaGraphNode_t A, B1, B2, B3, C3;
+    ... create kernel nodes and add dependencies ...
+
+    auto kokkos_graph = construct(graph);
+
+    auto C1 = then(when_all(B1, B2), ...);
+    auto D1 = then(when_all(C1, C3), ...);
+
+    kokkos_graph.instantiate();
+    kokkos_graph.submit();
+
+Graph update
+~~~~~~~~~~~~
+
+From reading `Cuda`, `HIP` and `SYCL` documentations, all have some *executable graph update* mechanisms.
+
+For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`, not in `HIP` yet) can support complex graphs that might slightly change from one submission to another.
+
+    Updates to a graph will be scheduled after any in-flight executions of the same graph and will not affect previous submissions of the same graph.
+    The user is not required to wait on any previous submissions of a graph before updating it.
+
+As the topology is fixed, we can only reasonably update kernel parameters.
diff --git a/docs/source/conf.py b/docs/source/conf.py
index cfc39b7e7..583e7a044 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -35,6 +35,7 @@
 # ones.
 extensions = ["myst_parser",
               "sphinx.ext.autodoc",
+              "sphinx.ext.graphviz",
               "sphinx.ext.viewcode",
               "sphinx.ext.intersphinx",
               "sphinx_copybutton",

From e9d776cef8fd333f1e5bc358b33a1a737b3e8ccf Mon Sep 17 00:00:00 2001
From: romintomasetti <romin.tomasetti@gmail.com>
Date: Tue, 27 Aug 2024 17:23:40 +0000
Subject: [PATCH 2/6] wip meeting this morning results and todos

---
 docs/source/API/core/Graph.rst | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst
index 30d6c8982..1a9f3df6a 100644
--- a/docs/source/API/core/Graph.rst
+++ b/docs/source/API/core/Graph.rst
@@ -171,6 +171,9 @@ Graph topology is known at compile, thus enabling a lot of optimizations (kernel
 
     auto node_A = then(graph, parallel_...(...label..., ...policy..., ...functor...));
 
+    // what happens to an exec space instance passed to the policy ? is it used somehow or just ignored ?
+    // when dispatching the driver to global memory, what exec space instance is used for the async copies ?
+
     auto node_B = then(node_A, parallel_...(...label..., ...policy..., ...functor...));
     auto node_C = then(node_A, parallel_...(...label..., ...policy..., ...functor...));
 
@@ -348,3 +351,26 @@ For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`, not in
     The user is not required to wait on any previous submissions of a graph before updating it.
 
 As the topology is fixed, we can only reasonably update kernel parameters.
+
+Iterative process
+-----------------
+
+- iterative solver (our assembly case)
+- line search in optimization
+
+
+
+They also use graphs...
+-----------------------
+
+* `PyTorch` https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
+* `GROMACS` https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/
+
+
+Homework
+
+- what does Kokkos during dispatching ? (HIP CUDA SYCL) Execution space instance from the policy, used or ignored ?
+- for each example 3 columns how to write it in CUDA SYCL P2300 Kokkos
+- développer l'update
+- essayer de démontrer qu'on peut écrire un seul code, et dire si on veut que ce soit un graph ou pas
+  (why it matters: write single source code , kokkos premise 'single source code')
\ No newline at end of file

From ad9700f7bb8461b30a8f42e8d49f69cda30dca12 Mon Sep 17 00:00:00 2001
From: romintomasetti <romin.tomasetti@gmail.com>
Date: Thu, 29 Aug 2024 19:59:26 +0000
Subject: [PATCH 3/6] wip before meeting

---
 .../API/core/Graph.axpby.kokkos.graph.cpp     |  12 +
 .../core/Graph.axpby.kokkos.graph.p2300.cpp   |  15 ++
 .../API/core/Graph.axpby.kokkos.vanilla.cpp   |   8 +
 docs/source/API/core/Graph.rst                | 253 ++++++++++--------
 4 files changed, 181 insertions(+), 107 deletions(-)
 create mode 100644 docs/source/API/core/Graph.axpby.kokkos.graph.cpp
 create mode 100644 docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp
 create mode 100644 docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp

diff --git a/docs/source/API/core/Graph.axpby.kokkos.graph.cpp b/docs/source/API/core/Graph.axpby.kokkos.graph.cpp
new file mode 100644
index 000000000..24cc178ac
--- /dev/null
+++ b/docs/source/API/core/Graph.axpby.kokkos.graph.cpp
@@ -0,0 +1,12 @@
+auto graph = Kokkos::Experimental::create_graph(exec_A, [&](auto root){
+    auto node_xpy = root.then_parallel_for(N, MyAxpby{x, y, alpha, beta});
+    auto node_zpy = root.then_parallel_for(N, MyAxpby{z, y, gamma, beta});
+
+    auto node_dotp = Kokkos::Experimental::when_all(node_xpy, node_zpy).then_parallel_reduce(
+        N, MyDotp{x, z}, dotp
+    )
+});
+
+graph.submit(exec_A);
+
+exec_A.fence();
diff --git a/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp
new file mode 100644
index 000000000..3d129d2a4
--- /dev/null
+++ b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp
@@ -0,0 +1,15 @@
+auto graph = Kokkos::construct_graph();
+
+auto node_xpy = Kokkos::then(graph, Kokkos::parallel_for(N, MyAxpby{x, y, alpha, beta}));
+auto node_zpy = Kokkos::then(graph, Kokkos::parallel_for(N, MyAxpby{z, y, gamma, beta}));
+
+auto node_dotp = Kokkos::then(
+    Kokkos::when_all(node_xpy, node_zpy),
+    Kokkos::parallel_reduce(N, MyDotp{x, z}, dotp)
+);
+
+graph.instantiate();
+
+graph.submit(exec_A);
+
+exec_A.fence();
diff --git a/docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp b/docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp
new file mode 100644
index 000000000..3789ba4d7
--- /dev/null
+++ b/docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp
@@ -0,0 +1,8 @@
+Kokkos::parallel_for(policy_t(exec_A, 0, N), MyAxpby{x, y, alpha, beta});
+Kokkos::parallel_for(policy_t(exec_B, 0, N), MyAxpby{z, y, gamma, beta});
+
+exec_B.fence();
+
+Kokkos::parallel_reduce(policy_t(exec_A, 0, N), MyDotp{x, z}, dotp);
+
+exec_A.fence();
diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst
index 1a9f3df6a..42beb0860 100644
--- a/docs/source/API/core/Graph.rst
+++ b/docs/source/API/core/Graph.rst
@@ -4,10 +4,10 @@ Graph and related
 Usage
 -----
 
-:code:`Kokkos::Graph` is an abstraction that can be used to define a group of asynchronous workloads that are organised as a direct acyclic graph.
-A :code:`Kokkos::Graph` is defined separatly from its execution, allowing it to be re-executed multiple times.
+:cppkokkos:`Kokkos::Graph` is an abstraction that can be used to define a group of asynchronous workloads that are organised as a direct acyclic graph.
+A :cppkokkos:`Kokkos::Graph` is defined separatly from its execution, allowing it to be re-executed multiple times.
 
-:code:`Kokkos::Graph` is a powerful way of describing workload dependencies. It is also a good opportunity to present all workloads
+:cppkokkos:`Kokkos::Graph` is a powerful way of describing workload dependencies. It is also a good opportunity to present all workloads
 at once to the driver, and allow some optimizations [ref].
 
 .. note::
@@ -16,18 +16,18 @@ at once to the driver, and allow some optimizations [ref].
 
 For small workloads that need to be sumitted several times, it might save you some overhead [reference to some presentation / paper].
 
-:code:`Kokkos::Graph` is specialized for some backends:
+:cppkokkos:`Kokkos::Graph` is specialized for some backends:
 
-* :code:`Cuda`: [ref to vendor doc]
-* :code:`HIP`: [ref to vendor doc]
-* :code:`SYCL`: [ref to vendor doc] -> https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc
+* :cppkokkos:`Cuda`: [ref to vendor doc]
+* :cppkokkos:`HIP`: [ref to vendor doc]
+* :cppkokkos:`SYCL`: [ref to vendor doc] -> https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc
 
 For other backends, Kokkos provides a defaulted implementation [ref to file].
 
 Philosophy
 ----------
 
-As mentioned earlier, the :code:`Kokkos::Graph` is first defined, and then executed. In fact, before the graph can be executed,
+As mentioned earlier, the :cppkokkos:`Kokkos::Graph` is first defined, and then executed. In fact, before the graph can be executed,
 it needs to be *instantiated*.
 
 During the *instantiation* phase, the topology of the graph is **locked**, and an *executable graph* is created.
@@ -40,53 +40,23 @@ In short, we have 3 phases:
 
 "Splitting command construction from execution is a proven solution." (https://www.iwocl.org/wp-content/uploads/iwocl-2023-Ewan-Crawford-4608.pdf)
 
-Basic example
--------------
-
-This example showcases how three workloads can be organised as a :code:`Kokkos::Graph`.
-
-Workloads A and B are independent, but workload C needs the completion of A and B.
-
-.. code-block:: cpp
-
-    int main()
-    {
-        auto graph = Kokkos::Experimental::create_graph<Exec>([&](auto root) {
-            const auto node_A = root.then_parallel_for(...label..., ...policy..., ...body...);
-            const auto node_B = root.then_parallel_for(...label..., ...policy..., ...body...);
-            const auto ready  = Kokkos::Experimental::when_all(node_A, node_B);
-            const auto node_C = ready.then_parallel_for(...label..., ...policy..., ...body...);
-        });
-
-        for(int irep = 0; irep < nrep; ++irep)
-            graph.submit();
-    }
-
-Advanced example
-----------------
-
-To be done soon.
-
-References
-----------
-
-* https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
-* https://github.com/intel/llvm/blob/sycl/sycl/doc/syclgraph/SYCLGraphUsageGuide.md
-* https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/
-
-
 Use cases
 ---------
 
 Diamond with closure, don't care about `exec`
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Create a simple diamond-like graph within a closure, no caring about execution space instances.
+Create a simple diamond-like graph within a closure, not caring too much about execution space instances.
 
 This use case demonstrates how a graph can be created from inside a closure, and how it could look like in the future.
 It is a very simple use case.
 
-Note that I'm not sure why we should support the closure anyway.
+.. note::
+
+    I'm not sure why we should support the closure anyway. I don't see the benefits of enforcing the
+    user to create the whole graph in there.
+
+    See :ref:`no_root_node` for discussion.
 
 .. graphviz::
     :caption: Diamond topology
@@ -99,9 +69,9 @@ Note that I'm not sure why we should support the closure anyway.
     }
 
 .. code-block:: c++
-    :caption: Current pseudo-code
+    :caption: Current `Kokkos` pseudo-code.
 
-    auto graph = Kokkos::create_graph([&](const auto& root){
+    auto graph = Kokkos::create_graph([&](auto root){
         auto node_A = root.then_parallel_...(...label..., ...policy..., ...functor...);
 
         auto node_B = node_A.then_parallel_...(...label..., ...policy..., ...functor...);
@@ -113,9 +83,9 @@ Note that I'm not sure why we should support the closure anyway.
     graph.submit()
 
 .. code-block:: c++
-    :caption: P2300 (but really I don't like that because `graph` itself is already a *sender*)
+    :caption: *à la* P2300 (but really I don't like that because `graph` itself is already a *sender*).
 
-    auto graph = Kokkos::create_graph([&](const auto& root){
+    auto graph = Kokkos::create_graph([&](auto root){
         auto node_A = then(root, parallel_...(...label..., ...policy..., ...functor...));
 
         auto node_B = then(node_A, parallel_...(...label..., ...policy..., ...functor...));
@@ -129,7 +99,7 @@ Note that I'm not sure why we should support the closure anyway.
 Diamond, caring about `exec`
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Create a simple diamond-like graph, caring about execution space instances.
+Create a simple diamond-like graph, caring about execution space instances. No closure.
 
 This use case demonstrates how a graph can be created without a closure, and how it could look like in the future.
 It also focuses on where steps occur.
@@ -147,9 +117,9 @@ Graph topology is known at compile, thus enabling a lot of optimizations (kernel
     }
 
 .. code-block:: c++
-    :caption: Current pseudo-code
+    :caption: Current `Kokkos` pseudo-code.
 
-    auto graph = Kokkos::create_graph(exec_A, [&](const auto& root){});
+    auto graph = Kokkos::create_graph(exec_A, [&](auto root){});
     auto root  = Kokkos::Impl::GraphAccess::create_root_node_ref(graph);
 
     auto node_A = root.then_parallel_...(...label..., ...policy..., ...functor...);
@@ -161,19 +131,17 @@ Graph topology is known at compile, thus enabling a lot of optimizations (kernel
 
     graph.instantiate();
     exec_A.fence("The graph might make some async to-device copies.");
+
     graph.submit(exec_B);
 
 .. code-block:: c++
-    :caption: P2300 + defer when Kokkos performs internal async to-device copies
+    :caption: *à la* P2300 and defer when `Kokkos` performs internal async to-device copies to the `instantiate` step.
 
-    // Step 1: define topology (no execution space instance required)
+    // Step 1: define graph topology (note that no execution space instance required).
     auto graph = Kokkos::create_graph<execution_space>();
 
     auto node_A = then(graph, parallel_...(...label..., ...policy..., ...functor...));
 
-    // what happens to an exec space instance passed to the policy ? is it used somehow or just ignored ?
-    // when dispatching the driver to global memory, what exec space instance is used for the async copies ?
-
     auto node_B = then(node_A, parallel_...(...label..., ...policy..., ...functor...));
     auto node_C = then(node_A, parallel_...(...label..., ...policy..., ...functor...));
 
@@ -186,15 +154,17 @@ Graph topology is known at compile, thus enabling a lot of optimizations (kernel
     // Step 3: execute
     graph.submit(exec_B)
 
-No "root" node
-~~~~~~~~~~~~~~
+.. _no_root_node:
 
-Currently, the :code:`Kokkos::Graph` would expose to the user a "root node" concept that is not needed
+To root or not to root ?
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Currently, the :cppkokkos:`Kokkos::Graph` API would expose to the user a "root node" concept that is not strictly needed
 by any backend (but might be needed by the default implementation that works with *sinks*).
 
-The "root node" might be confusing. It sould not appear in the API for 2 reasons:
+I think the "root node" might be confusing. IMO, it should not appear in the API for 2 reasons:
 
-1. It can be misleading, as the user might think it's necessary though I think it's an artifact of how :code:`Kokkos::Graph`
+1. It can be misleading, as the user might think it's necessary though I think it's an artifact of how :cppkokkos:`Kokkos::Graph`
    is currently implemented for graph construction, and because of the *sink*-based defaulted implementation.
 2. With P2300, it's clear that *root* is an empty useless sender that can be thrown away at compile time.
 
@@ -208,15 +178,15 @@ The "root node" might be confusing. It sould not appear in the API for 2 reasons
     }
 
 .. code-block:: c++
-    :caption: P2300
+    :caption: *à la* P2300.
 
-    auto graph = construct_graph();
+    auto graph = Kokkos::construct_graph();
 
-    auto A1 = then(graph, ...);
-    auto A2 = then(graph, ...);
-    auto A3 = then(graph, ...);
+    auto A1 = Kokkos::then(graph, Kokkos::parallel_...(...));
+    auto A2 = Kokkos::then(graph, Kokkos::parallel_...(...));
+    auto A3 = Kokkos::then(graph, Kokkos::parallel_...(...));
 
-    auto B = then(when_all(A1, A2, A3), ...);
+    auto B = Kokkos::then(Kokkos::when_all(A1, A2, A3), Kokkos::parallel_...(...));
 
 Complex DAG topology
 ~~~~~~~~~~~~~~~~~~~~
@@ -234,13 +204,13 @@ Any complex-but-valid DAG topology should work.
         A2 -> B1;
         A2 -> B3;
         A3 -> B4;
-        
+
         B1 -> C1;
         B3 -> C1;
-        
+
         B2 -> C2;
         B4 -> C2;
-        
+
         // Enfore ordering of nodes with invisible edges.
         {
             rank = same;
@@ -255,59 +225,58 @@ Changing scheduler
 
 This is the purpose of PR https://github.com/kokkos/kokkos/pull/7249, and should be further documented.
 
-Towards https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html#design-sender-adaptor-starts_on.
+This is a step towards https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html#design-sender-adaptor-starts_on.
 
 .. code-block:: c++
+    :caption: *à la* P2300.
 
-    auto graph = construct()
-
-    auto node_1 = ...
+    // Step 1: construct.
+    auto graph = Kokkos::construct_graph();
 
+    auto node_1 = Kokkos::then(graph, ...);
     ...
 
+    // Step 2: instantiate.
     graph.instantiate();
 
+    // Step 3: execute, execute, and again.
     graph.submit(exec_A);
-
     ...
-
     graph.submit(exec_C);
-
     ...
-
     graph.submit(exec_D);
 
 Interoperability
 ~~~~~~~~~~~~~~~~
 
-Why interoperability matters (helps adoption of :code:`Kokkos::Graph`, extensibility, corner cases):
+Why interoperability matters (helps adoption of :cppkokkos:`Kokkos::Graph`, extensibility, corner cases):
 
-1. Attract users that already use some backend graph (*e.g.* `cudaGraph_t`) towards `Kokkos`. It helps them transition smoothly.
-2. Help user integrate backend-specific graph capabilities that are not part of the :code:`Kokkos::Graph` API for whatever reason.
+1. Attract users that already use some backend graph (*e.g.* :code:`cudaGraph_t`) towards `Kokkos`. It helps them transition smoothly.
+2. Help user integrate backend-specific graph capabilities that are not part of the :cppkokkos:`Kokkos::Graph` API for whatever reason.
 
 Since `Kokkos` might run some stuff linked to its internals at *instantiation* stage, and since in PR https://github.com/kokkos/kokkos/pull/7240
 we decided to ensure that before the submission, the graph needs to be instantiated in `Kokkos`, interoperability implies that the user
-passes through `Kokkos` for both *instantiation* and *submission*.
+relies on `Kokkos` for both *instantiation* and *submission*.
 
 .. graphviz::
-    :caption: Dark nodes/edges are added through :code:`Kokkos::Graph`.
+    :caption: Dark nodes/edges are added through :cppkokkos:`Kokkos::Graph` API, the rest is pre-existing.
 
     digraph interoperability {
 
         A[color=darksalmon];
-        
+
         B1[color=darksalmon];
         B2[color=darksalmon];
         B3[color=darksalmon];
-        
+
         C3[color=darksalmon];
 
         A -> B1[color=darksalmon];
         A -> B2[color=darksalmon];
         A -> B3[color=darksalmon];
-        
+
         B3 -> C3[color=darksalmon];
-        
+
         // Enfore ordering of nodes with invisible edges.
         {
             rank = same;
@@ -315,50 +284,102 @@ passes through `Kokkos` for both *instantiation* and *submission*.
             B1 -> B2 -> B3 ;
             rankdir = LR;
         }
-        
+
         B1 -> C1;
         B2 -> C1;
-        
+
         C1 -> D1;
         C3 -> D1;
-    } 
+    }
 
 .. code-block:: c++
-    :caption: interoperability pseudo-code P2300
+    :caption: Interoperability pseudo-code *à la* P2300.
 
+    // The user starts creating its graph with a backend API for some reason.
     cudaGraph_t graph;
     cudaGraphCreate(&graph, ...);
 
     cudaGraphNode_t A, B1, B2, B3, C3;
     ... create kernel nodes and add dependencies ...
 
-    auto kokkos_graph = construct(graph);
+    // But at some point wants interoperability with Kokkos.
+    auto kokkos_graph = Kokkos::construct_graph(graph);
 
-    auto C1 = then(when_all(B1, B2), ...);
-    auto D1 = then(when_all(C1, C3), ...);
+    auto C1 = Kokkos::then(Kokkos::when_all(B1, B2), ...);
+    auto D1 = Kokkos::then(Kokkos::when_all(C1, C3), ...);
 
+    // The user is now bound to Kokkos for instantiation and submission.
     kokkos_graph.instantiate();
     kokkos_graph.submit();
 
 Graph update
 ~~~~~~~~~~~~
 
-From reading `Cuda`, `HIP` and `SYCL` documentations, all have some *executable graph update* mechanisms.
+From reading :cppkokkos:`Cuda`, :cppkokkos:`HIP` and :cppkokkos:`SYCL` documentations, all have some *executable graph update* mechanisms.
 
-For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`, not in `HIP` yet) can support complex graphs that might slightly change from one submission to another.
+For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`) can support complex graphs that might slightly change from one submission to another.
 
     Updates to a graph will be scheduled after any in-flight executions of the same graph and will not affect previous submissions of the same graph.
     The user is not required to wait on any previous submissions of a graph before updating it.
 
-As the topology is fixed, we can only reasonably update kernel parameters.
+As the topology is fixed, we can only reasonably update kernel parameters or skip a node.
+
+.. graphviz::
+    :caption: Some iterative loop that needs to seed under some condition (to be enhanced).
+
+    digraph graph_update {
+
+        S[label="start", shape=diamond];
+
+        A[label="seed"];
+        B[label="compute"];
+        C[label="solve"];
+        
+        S -> A[color=green];
+        
+        A -> B[color=green];
+        
+        B -> C;
+        
+        C -> S;
+        
+        S -> B[color="red"];
+
+    }
+
+Iterative processes
+~~~~~~~~~~~~~~~~~~~
 
-Iterative process
------------------
+Plenty of opportunities for :cppkokkos:`Kokkos::Graph` to lean in:
 
-- iterative solver (our assembly case)
+- iterative solver
 - line search in optimization
+- you name it
+
+Let's take the `AXPBY` micro-benchmark from https://hihat.opencommons.org/images/1/1a/Kokkos-graphs-presentation.pdf:
+
+.. graphviz::
+    :caption: Two `AXPBY` followed by a dot product.
+
+    digraph axpby {
+        A[label="axpby"];
+        B[label="axpby"];
+        C[label="dotp"];
+        A->C;
+        B->C;
+    }
+
+.. literalinclude:: Graph.axpby.kokkos.vanilla.cpp
+    :language: c++
+    :caption: Vanilla `Kokkos`.
 
+.. literalinclude:: Graph.axpby.kokkos.graph.cpp
+    :language: c++
+    :caption: Current :cppkokkos:`Kokkos::Graph`.
 
+.. literalinclude:: Graph.axpby.kokkos.graph.p2300.cpp
+    :language: c++
+    :caption: *à la* P2300.
 
 They also use graphs...
 -----------------------
@@ -366,11 +387,29 @@ They also use graphs...
 * `PyTorch` https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
 * `GROMACS` https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/
 
+Design choices
+--------------
+
+Questions we need to answer before going further in the :cppkokkos:`Graph` refactor.
+
+Dispatching
+~~~~~~~~~~~
 
-Homework
+- Do we allow node policies to have a user-provided execution space instance ?
+- When does `Kokkos` makes its to-device dispatching (*e.g.* to global memory) ?
 
-- what does Kokkos during dispatching ? (HIP CUDA SYCL) Execution space instance from the policy, used or ignored ?
-- for each example 3 columns how to write it in CUDA SYCL P2300 Kokkos
-- développer l'update
-- essayer de démontrer qu'on peut écrire un seul code, et dire si on veut que ce soit un graph ou pas
-  (why it matters: write single source code , kokkos premise 'single source code')
\ No newline at end of file
+Write a single source code, but allow skipping backend graph
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We should be able to write a single source code and decide if we want the graph to map to the backend graph or just
+execute nodes.
+
+This would greatly benefit adoption, and respect `Kokkos` single source code promise.
+
+References
+----------
+
+* https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
+* https://github.com/intel/llvm/blob/sycl/sycl/doc/syclgraph/SYCLGraphUsageGuide.md
+* https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/
+* https://hihat.opencommons.org/images/1/1a/Kokkos-graphs-presentation.pdf

From 9a268a3a19ec74c3e7ab044cc80057ce82d63659 Mon Sep 17 00:00:00 2001
From: romintomasetti <romin.tomasetti@gmail.com>
Date: Fri, 30 Aug 2024 03:51:05 +0000
Subject: [PATCH 4/6] cleaning stuff

---
 .../core/Graph.axpby.kokkos.graph.p2300.cpp   | 62 ++++++++++++++++---
 docs/source/API/core/Graph.rst                | 15 +++--
 2 files changed, 63 insertions(+), 14 deletions(-)

diff --git a/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp
index 3d129d2a4..c4e7015bf 100644
--- a/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp
+++ b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp
@@ -1,15 +1,57 @@
-auto graph = Kokkos::construct_graph();
+/**
+ * This is some external library function to which we pass a sender.
+ * The sender might either be a regular @c Kokkos execution space instance
+ * or a graph-node-sender-like stuff.
+ * The asynchronicity within the function will either be provided by the graph
+ * or must be dealt with in the regular way (creating many space instances).
+ */
+sender library_stuff(sender start)
+{
+    sender auto exec_A, exec_B;
+    
+    if constexpr (Kokkos::is_a_sender<sender>) {
+        exec_A = exec_B = start;
+    } else {
+        std::tie(exec_A, exec_B) = Kokkos::partition_space(start, 1, 1);
+    }
 
-auto node_xpy = Kokkos::then(graph, Kokkos::parallel_for(N, MyAxpby{x, y, alpha, beta}));
-auto node_zpy = Kokkos::then(graph, Kokkos::parallel_for(N, MyAxpby{z, y, gamma, beta}));
+    auto node_xpy = Kokkos::parallel_for(exec_A, policy(N), MyAxpby{x, y, alpha, beta});
+    auto node_zpy = Kokkos::parallel_for(exec_B, policy(N), MyAxpby{z, y, gamma, beta});
 
-auto node_dotp = Kokkos::then(
-    Kokkos::when_all(node_xpy, node_zpy),
-    Kokkos::parallel_reduce(N, MyDotp{x, z}, dotp)
-);
+    /// No need to fence, because @c Kokkos::when_all will take care of that.
+    return Kokkos::parallel_reduce(
+        Kokkos::when_all(node_xpy, node_zpy),
+        policy(N),
+        MyDotp{x, z}, dotp
+    );
+}
 
-graph.instantiate();
+int main()
+{
+    scheduler auto exec = Kokkos::DefaultExecutionSpace{};
 
-graph.submit(exec_A);
+    /**
+    * Start the chain of nodes with an "empty" node, similar to @c std::execution::schedule.
+    * Under the hood, it creates the @c Kokkos::Graph.
+    * All nodes created from this sender will share a handle to the underlying @c Kokkos::Graph.
+    */
+    sender auto start = Kokkos::construct_empty_node(exec);
 
-exec_A.fence();
+    sender auto seeding = Kokkos::parallel_for(start, policy(N), SomeWork{...});
+
+    /// Pass our chain to some external library function.
+    sender auto subgraph = library_stuff(seeding);
+
+    sender auto last_action = Kokkos::parallel_scan(subgraph, policy(N), ScanFunctor{...});
+
+    /// @c Kokkos has a free function for instantiating the underlying graph.
+    /// All nodes connected to the same handle are notified that they cannot be used as senders anymore,
+    /// because they are locked in an instantiated graph.
+    sender auto executable_whatever = Kokkos::Graph::instantiate(last_action);
+
+    /// Submission is a no-op if the received sender is an execution space instance.
+    /// Otherwise, it submits the underlying graph.
+    Kokkos::Graph::submit(my_exec, executable_whatever)
+
+    my_exec.fence();
+}
diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst
index 42beb0860..9a4283f1a 100644
--- a/docs/source/API/core/Graph.rst
+++ b/docs/source/API/core/Graph.rst
@@ -377,10 +377,6 @@ Let's take the `AXPBY` micro-benchmark from https://hihat.opencommons.org/images
     :language: c++
     :caption: Current :cppkokkos:`Kokkos::Graph`.
 
-.. literalinclude:: Graph.axpby.kokkos.graph.p2300.cpp
-    :language: c++
-    :caption: *à la* P2300.
-
 They also use graphs...
 -----------------------
 
@@ -406,6 +402,17 @@ execute nodes.
 
 This would greatly benefit adoption, and respect `Kokkos` single source code promise.
 
+Design we would like to agree on
+--------------------------------
+
+This should be the kind of design we'd like to have (kind of conforming to P2300).
+
+Might be worth reading: https://docs.nvidia.com/hpc-sdk/archive/23.9/pdf/hpc239c++_par_alg.pdf.
+
+.. literalinclude:: Graph.axpby.kokkos.graph.p2300.cpp
+    :language: c++
+    :caption: *à la* P2300.
+
 References
 ----------
 

From 46e55537356a4f3cf1e97e9832af927f866fd9c0 Mon Sep 17 00:00:00 2001
From: romintomasetti <romin.tomasetti@gmail.com>
Date: Fri, 30 Aug 2024 21:46:04 +0200
Subject: [PATCH 5/6] wip

---
 .../core/Graph.axpby.kokkos.graph.p2300.cpp   | 54 +++++++++++++------
 docs/source/API/core/Graph.rst                | 50 +++++++++--------
 docs/source/API/core/Graph.update.tikz        | 38 +++++++++++++
 docs/source/conf.py                           |  1 +
 4 files changed, 104 insertions(+), 39 deletions(-)
 create mode 100644 docs/source/API/core/Graph.update.tikz

diff --git a/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp
index c4e7015bf..f8182024a 100644
--- a/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp
+++ b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp
@@ -12,12 +12,18 @@ sender library_stuff(sender start)
     if constexpr (Kokkos::is_a_sender<sender>) {
         exec_A = exec_B = start;
     } else {
-        std::tie(exec_A, exec_B) = Kokkos::partition_space(start, 1, 1);
+        /// How do we partition ?
+        exec_A = start;
+        exec_B = Kokkos::partition_space(start, 1);
     }
 
     auto node_xpy = Kokkos::parallel_for(exec_A, policy(N), MyAxpby{x, y, alpha, beta});
     auto node_zpy = Kokkos::parallel_for(exec_B, policy(N), MyAxpby{z, y, gamma, beta});
 
+    /// In the non-graph case,how do we enforce that e.g. node_zpy is done and launch
+    /// the parallel-reduce on the same execution space instance as node_xpy without writing
+    /// any additional piece of code ?
+
     /// No need to fence, because @c Kokkos::when_all will take care of that.
     return Kokkos::parallel_reduce(
         Kokkos::when_all(node_xpy, node_zpy),
@@ -28,30 +34,44 @@ sender library_stuff(sender start)
 
 int main()
 {
-    scheduler auto exec = Kokkos::DefaultExecutionSpace{};
+    /// A @c Kokkos execution space instance is a context (i.e. a source
+    /// of asynchronous execution such as a thread pool or a GPU stream)
+    const Kokkos::DefaultExecutionSpace context {};
+
+    /// A scheduler is a lightweight handle to an execution context.
+    stdexec::scheduler auto scheduler = context.get_scheduler();
 
     /**
-    * Start the chain of nodes with an "empty" node, similar to @c std::execution::schedule.
-    * Under the hood, it creates the @c Kokkos::Graph.
-    * All nodes created from this sender will share a handle to the underlying @c Kokkos::Graph.
-    */
-    sender auto start = Kokkos::construct_empty_node(exec);
+     * Start the chain of nodes with an "empty" node, similar to @c std::execution::schedule.
+     * Under the hood, it creates the @c Kokkos::Graph.
+     * All nodes created from this sender will share a handle to the underlying @c Kokkos::Graph.
+     */
+    stdexec::sender auto start = Kokkos::Experimental::Graph::schedule(scheduler);
 
-    sender auto seeding = Kokkos::parallel_for(start, policy(N), SomeWork{...});
+    /// @c Kokkos::parallel_for would behave much like @c std::execution::bulk.
+    stdexec::sender auto my_work = Kokkos::Experimental::Graph::parallel_for(start, policy(N), ForFunctor{...});
 
     /// Pass our chain to some external library function.
-    sender auto subgraph = library_stuff(seeding);
+    stdexec::sender auto subgraph = library_stuff(mywork);
 
-    sender auto last_action = Kokkos::parallel_scan(subgraph, policy(N), ScanFunctor{...});
+    /// Add some work again.
+    stdexec::sender auto my_other_work = Kokkos::Experimental::Graph::parallel_scan(subgraph, policy(N), ScanFunctor{...});
 
-    /// @c Kokkos has a free function for instantiating the underlying graph.
-    /// All nodes connected to the same handle are notified that they cannot be used as senders anymore,
-    /// because they are locked in an instantiated graph.
-    sender auto executable_whatever = Kokkos::Graph::instantiate(last_action);
+    /// @c Kokkos::Graph has a free function for instantiating the underlying graph.
+    /// All nodes connected to the same handle (i.e. that are on the same chain) are notified
+    /// that they cannot be used as senders anymore,
+    /// because they are locked in an instantiated graph. In other words, the chain is a DAG, and it
+    /// cannot change anymore.
+    stdexec::sender auto executable_chain = Kokkos::Graph::instantiate(my_other_work);
 
-    /// Submission is a no-op if the received sender is an execution space instance.
+    /// Submission is a no-op if the passed sender is a @c Kokkos execution space instance.
     /// Otherwise, it submits the underlying graph.
-    Kokkos::Graph::submit(my_exec, executable_whatever)
+    Kokkos::Graph::submit(scheduler, executable_chain)
+
+    ::stdexec::sync_wait(scheduler);
 
-    my_exec.fence();
+    /// Submit the chain again, using another scheduler.
+    /// In essence, what @c Kokkos::Graph::submit can do is pertty much similar to what
+    /// @c std::execution::starts_on does. It allows the sender to be executed elsewhere.
+    Kokkos::Graph::submit(another_scheduler, executable_chain);
 }
diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst
index 9a4283f1a..1adde3f55 100644
--- a/docs/source/API/core/Graph.rst
+++ b/docs/source/API/core/Graph.rst
@@ -312,6 +312,21 @@ relies on `Kokkos` for both *instantiation* and *submission*.
     kokkos_graph.instantiate();
     kokkos_graph.submit();
 
+Interweaving
+~~~~~~~~~~~~
+
+When a user does not use :cppkokkos:`Graph`, but calls some external library function that does.
+
+In this case, :code:`submit` really needs to be passed an execution space instance to ensure that the graph
+is nicely inserted into the user's kernel queues.
+
+Stated verbosely:
+
+    The stream-based (execution space instance based) approach can co-exist in the same code with
+    the graph-based approach, thereby making :cppkokkos:`Graph` a very attractive abstraction.
+    A use case in which "at the global level" the code uses a stream-based approach can play well with
+    some (possibly external) calls that use :cppkokkos:`Graph` under the hood.
+
 Graph update
 ~~~~~~~~~~~~
 
@@ -324,28 +339,9 @@ For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`) can su
 
 As the topology is fixed, we can only reasonably update kernel parameters or skip a node.
 
-.. graphviz::
-    :caption: Some iterative loop that needs to seed under some condition (to be enhanced).
-
-    digraph graph_update {
-
-        S[label="start", shape=diamond];
-
-        A[label="seed"];
-        B[label="compute"];
-        C[label="solve"];
-        
-        S -> A[color=green];
-        
-        A -> B[color=green];
-        
-        B -> C;
-        
-        C -> S;
-        
-        S -> B[color="red"];
-
-    }
+.. tikz:: Some iterative loop that needs to seed under some condition, as well as a library call for compute.
+   :include: Graph.update.tikz
+   :libs: backgrounds, calc, positioning, shapes
 
 Iterative processes
 ~~~~~~~~~~~~~~~~~~~
@@ -377,6 +373,16 @@ Let's take the `AXPBY` micro-benchmark from https://hihat.opencommons.org/images
     :language: c++
     :caption: Current :cppkokkos:`Kokkos::Graph`.
 
+Why/when should I choose :cppkokkos:`Kokkos::Graph`
+---------------------------------------------------
+
+Two obvious but different cases:
+
+#. A few kernels, probably small, easily manually stream-managed, submitted several times. Then using :cppkokkos:`Graph`
+   will help you reduce kernel launch overheads. TODO: link to A1 A2 A3 B graph.
+#. A lot of kernels, very complex DAG, probably not worth it thinking too much how they could be efficiently orchestrated
+   if :cppkokkos:`Graph` guarentees that it will take care of that for you.
+
 They also use graphs...
 -----------------------
 
diff --git a/docs/source/API/core/Graph.update.tikz b/docs/source/API/core/Graph.update.tikz
new file mode 100644
index 000000000..3f679e444
--- /dev/null
+++ b/docs/source/API/core/Graph.update.tikz
@@ -0,0 +1,38 @@
+\tikzset{
+    decide/.style = {draw, shape = diamond, fill = red!25, aspect = 2, inner sep = 1pt},
+    endpoint/.style = {draw, circle, fill = black!20, inner sep = 1pt},
+    yesorno/.style = {rectangle,draw,fill=white,inner sep=1pt},
+    work/.style = {rectangle, draw, fill = orange!25},
+    % We need to enforce a white background for folks in dark mode.
+    background rectangle/.style={fill=white},
+    show background rectangle
+}
+\node[endpoint] (start) {Start};
+
+\node[decide,below=0.5cm of start] (decision) { Seeding ?};
+
+\node[work, below=1cm of decision] (seeding) {Seeding};
+
+\node[work, below=0.5cm of seeding, minimum height=2cm, minimum width = 2cm] (compute) {Compute};
+
+\node[work, below=1cm of compute] (solve) {Solve};
+
+\node[decide,right=0.5cm of solve] (convergence) {Convergence ?};
+
+\node[endpoint, right=1cm of convergence] (end) {End};
+
+\draw [-stealth,solid](start) -- (decision.north);
+
+\draw [-stealth,solid](decision) -- (seeding.north) node[midway, yesorno] {yes};
+
+\draw [-stealth,solid](seeding)--(compute.north);
+
+\draw [-stealth,solid](compute)--(solve.north);
+
+\draw [-stealth,solid](solve)--(convergence.west);
+
+\draw [-stealth,solid](convergence.east)--(end.west) node[midway, yesorno] {yes};
+
+\draw [-stealth,solid](convergence.north) -- node[midway, yesorno]{no} (convergence.north|-decision.east) -- (decision.east);
+
+\draw [-stealth,solid](decision.west) -- ({$(decision.west)-0.25*(convergence.north)+0.25*(decision.east)$}|-decision.west) |- node[near start, yesorno] {no} (compute.west);
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 583e7a044..524bc8225 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -40,6 +40,7 @@
               "sphinx.ext.intersphinx",
               "sphinx_copybutton",
               "sphinx_design",
+              "sphinxcontrib.tikz",
               "cppkokkos"]
 
 # Add any paths that contain templates here, relative to this directory.

From 49ba326bbe7f7b9de75f7a23b5d0576f31e77a6e Mon Sep 17 00:00:00 2001
From: romintomasetti <romin.tomasetti@gmail.com>
Date: Tue, 1 Apr 2025 16:07:48 +0000
Subject: [PATCH 6/6] wip

---
 docs/source/API/core/Graph.old.rst | 166 +++++++++++++++++++++++++++++
 docs/source/API/core/Graph.rst     |  98 +++++++++++++++++
 2 files changed, 264 insertions(+)
 create mode 100644 docs/source/API/core/Graph.old.rst

diff --git a/docs/source/API/core/Graph.old.rst b/docs/source/API/core/Graph.old.rst
new file mode 100644
index 000000000..191fab0ae
--- /dev/null
+++ b/docs/source/API/core/Graph.old.rst
@@ -0,0 +1,166 @@
+# What are the semantics of `Kokkos::Graph` ?
+
+What are the allowed semantics of `Kokkos::Graph` ?
+
+Questions:
+
+1. Do we document the allowed semantics for which the user gets covered by `Kokkos` or do we try to enforce the semantics with object states and stuff ?
+2. What about the execution space instance ? It seems that `submit` should allow one to be passed.
+3. Multi-GPU.
+4. runtime aggregate node is still not possible, see https://github.com/kokkos/kokkos/issues/6060.
+4. Missing documentation online ?
+
+It should allow functionalities listed in https://www.olcf.ornl.gov/wp-content/uploads/2021/10/013_CUDA_Graphs.pdf, slide 4.
+
+## Usage
+
+How would people use `Kokkos::Graph` ?
+
+### The simplest usage I could come with
+
+The graph is known in advance (at compile time) and can be created in the lambda (*i.e.* not using hidden `impl` stuff).
+Once created, the user expects that the graph can be re-submitted several time. The user does not want to add/remove nodes once submitted for the first time (no fancy stuff).
+The user does not care about streams whatsoever.
+
+1. Create some `data` in a view, and a `functor` to act on it.
+2. Create the `graph` and add a parallel-for `node` using the `functor` acting on `data`.
+3. Submit the graph as much as you want.
+
+```c++
+template <typename Mem>
+struct Functor
+{
+    Kokkos::View<int*, Mem> data;
+
+    template <std::integer T>
+    KOKKOS_FUNCTION
+    void operator()(const T index) const { ... <data> ... };
+};
+
+int main()
+{
+    const Kokkos::View<int*, Exec> data(...);
+
+    auto graph = Kokkos::Experimental::create_graph<Exec>([&](auto root) {
+        [[maybe_unused]] const auto node = root.then_parallel_for(0, ..., Functor<Mem>{ .data = data });
+    });
+
+    graph.submit();
+}
+```
+
+### More advanced usage
+
+The graph is unknown and cannot be easily/prettily create in the lambda (*e.g.* the user attaches nodes dynamically depending on some complex setup like partitioning).
+Once created, the user still expects that the graph can be re-submitted several time.
+The user care about streams for orchestration.
+
+We need to use some `impl` stuff for such a case.
+
+```c++
+/**
+ * Create the graph.
+ *
+ * 1. Damien said there are other ways to do that w/o using Impl, but I could not find them. It seems that TestGraph.hpp only uses
+ *    the Kokkos::Experimental::create_graph that takes a closure.
+ *    It seems that 'construct_graph' should somehow be promoted to the public API. Is there any reason not to do so?
+ * 2. The execution space instance is not used until the executable graph is launched with 'cudaGraphLaunch'.
+ *    Therefore, it's questionnable whether it should be part of the Kokkos::Graph state or not (it's an Impl detail though).
+ */
+auto graph = Kokkos::Impl::GraphAccess::construct_graph(exec_a);
+auto root  = Kokkos::Impl::GraphAccess::create_root_ref(graph);
+
+/**
+ * Fill the graph with nodes, according to a complex DAG topology.
+ * The nodes might be added conditionally (conditions might change at runtime, e.g. MPI partitioning).
+ *
+ *       ROOT
+ *      /    \
+ *     N11    N12
+ *     |       | \
+ *     N21    N22 N23
+ *     \      /   /
+ *      \    /   /
+ *         N31
+ *
+ * @todo Add @c if nodes. See also https://developer.nvidia.com/blog/dynamic-control-flow-in-cuda-graphs-with-conditional-nodes/.
+ */
+std::vector<generic_node_t> N31_predecessors;
+
+if(condition_branch_1) // branch 1
+{
+    auto N11 = root.then_parallel_for(...label..., ...policy..., ...body...);
+    auto N21 = root.then_parallel_for(...label..., ...policy..., ...body...);
+    N31_predecessors.push_back(N21);
+}
+
+if(condition_branch_2) // branch 2
+{
+    auto N12 = root.then_parallel_for(...name..., ...policy..., ...body...);
+    auto N22 = root.then_parallel_for(...name..., ...policy..., ...body...);
+    auto N23 = root.then_parallel_for(...name..., ...policy..., ...body...);
+    N31_predecessors.push_back(N22);
+    N31_predecessors.push_back(N23);
+}
+
+//! This is currently impossible. See also https://github.com/kokkos/kokkos/issues/6060.
+auto N31_ready = Kokkos::Experimental::when_all(N31_predecessors);
+auto N31 = N31_ready.then_parallel_for(...name..., ...policy..., ...body...);
+
+/**
+ * The topology of the graph has been defined.
+ * It now has to be instantiated.
+ * According to:
+ *  - https://www.olcf.ornl.gov/wp-content/uploads/2021/10/013_CUDA_Graphs.pdf (slide 9)
+ *  - https://developer.nvidia.com/blog/employing-cuda-graphs-in-a-dynamic-environment/
+ * the topology cannot change once the graph has been instantiated,
+ * but the nodes parameters may be updated (cudaGraphExecUpdate).
+ */
+graph.instantiate(...)
+
+/**
+ * Launch the graph on some execution space instance.
+ * Re-launch onto another execution space instance. 
+ * According to cudaGraphLaunch, a stream is allowed and it makes sense.
+ *
+ * @todo Check for @c HIP and @c SYCL.
+ */
+graph.submit(exec_b);
+graph.submit(exec_c);
+```
+
+## What to do, prioritizing
+
+### Promote `construct_graph` to the public API
+
+This allows for advanced use cases that do not fit well with the current closure-based construction API.
+
+Retrieving the root node should also be promoted to the public API.
+
+### `Kokkos::Graph::instantiate`
+
+**Add** `Kokkos::Graph::instantiate` to the public API.
+
+This allows the user to control when the executable graph gets instantiated.
+
+It can be called only once.
+
+Adding nodes after instantiation is prohibited.
+
+### `Kokkos::Graph::submit`
+
+**Change** the public API to accept an execution space instance.
+
+Note that it is simply used to order the graph launch into some work queue.
+
+### Remove the execution space instance from `Kokkos::Graph` state
+
+The title says it all.
+
+### Allow dynamic aggregate node
+
+**Add** a `Kokkos::Experimental::when_all` that allows for a vector/list of nodes to be passed.
+
+## Go further
+
+We might want to get the design of `Kokkos::Graph` close to `std::execution` (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html).
diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst
index 1adde3f55..d6ad76e7d 100644
--- a/docs/source/API/core/Graph.rst
+++ b/docs/source/API/core/Graph.rst
@@ -339,6 +339,12 @@ For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`) can su
 
 As the topology is fixed, we can only reasonably update kernel parameters or skip a node.
 
+.. note::
+
+    Todo: in solve take Emil work and say that at compile time we could not reasonnably know what its graph
+    would look like. But our own assembly graph could be determined at compile time (knowing the system at stake,
+    how we partition it and so on -> still a burden)
+
 .. tikz:: Some iterative loop that needs to seed under some condition, as well as a library call for compute.
    :include: Graph.update.tikz
    :libs: backgrounds, calc, positioning, shapes
@@ -373,6 +379,18 @@ Let's take the `AXPBY` micro-benchmark from https://hihat.opencommons.org/images
     :language: c++
     :caption: Current :cppkokkos:`Kokkos::Graph`.
 
+Runtime graph
+~~~~~~~~~~~~~
+
+It can happen that a graph cannot be known at compile time. Examples of programs that could not
+determine the control flow completely at compile time:
+- MPI partitioning
+- BLAS routines and system size
+- you name it
+
+Therefore, we must support both pure compile time graphs and runtime graphs.
+This implies type-erasure. And this is not possible by default in `std::execution` apparently.
+
 Why/when should I choose :cppkokkos:`Kokkos::Graph`
 ---------------------------------------------------
 
@@ -426,3 +444,83 @@ References
 * https://github.com/intel/llvm/blob/sycl/sycl/doc/syclgraph/SYCLGraphUsageGuide.md
 * https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/
 * https://hihat.opencommons.org/images/1/1a/Kokkos-graphs-presentation.pdf
+
+
+*********************************
+
+Ecrire nos exemples de graphes en P2300 -> compiler + tester
+
+Puis en current kokkos graph -> compiler + tester
+
+Puis en draft el kokkos graph à la p2300 -> à titre de guideline (where we want to go)
+
+* TFE Emil Geleleens example: Solver for matrix L -> backsubstitution for lower triangular
+  matrix -> dependencies between unknowns to speed up things -> creates a graph with "as many nodes
+  there are unknowns" (with variants, but whatsoever we get many nodes) in the input matrix L
+  -> his work is not about efficiently launchign this graph and in fact he did it with manual kernel launches
+  -> could be nice to use kokkos graph to focus on other things
+  -> there might be some sweet spot above which the backend graph makes sense (cost of instantiate and launch)
+
+=> repo privé "uliegecsm/kokkos-graph-p2300"
+
+
+One blcoker (https://docs.nvidia.com/cuda/pdf/CUSPARSE_Library.pdf): we would need to use graph capture
+to embed the cu solver into our graph...
+
+    Most of the cuSPARSE routines can be optimized by exploiting CUDA Graphs capture and
+    Hardware Memory Compression features.
+    More in details, a single cuSPARSE call or a sequence of calls can be captured by a CUDA
+    Graph and executed in a second moment. This minimizes kernels launch overhead and allows
+    the CUDA runtime to optimize the whole workflow. A full example of CUDA graphs capture
+    applied to a cuSPARSE routine can be found in cuSPARSE Library Samples - CUDA Graph.
+
+
+
+Meeting notes
+=============
+
+0. Do you know :cppkokkos:`Kokkos::Graph` ?
+
+   :cppkokkos:`Kokkos::Graph` is an abstraction of a DAG of asynchronous workloads that maps to a backend graph,
+   or to the defaulted implementation.
+
+   Advantages of the graph: asynchronous management done by the backend driver + launch overhead reduces especially
+   when submitting many times.
+
+   .. figure:: Graph.kokkos.3.paper.jpg
+
+     Example from Kokkos 3 paper.
+
+1. We want to refactor the public API of :cppkokkos:`Kokkos::Graph` so that it feels more like `std::execution` (P2300).
+
+   We could think of a graph (e.g. :cppkokkos:`Kokkos::Graph`) as a **multi-shot sender chain** (?).
+
+    .. code-block:: c++
+        :caption: Old way
+
+        child = parent.then_parallel_(policy, body);
+
+    .. code-block:: c++
+        :caption: P2300-alike way
+
+        child = parallel_for(parent, policy, body);  // usual
+        child = parent | parallel_for(policy, body); // piping
+
+   This seems to be an easy step. A few wrappers could be used in a first step to "transport"
+   the P2300-alike way arguments to the old way (thereby keeping the `Kokkos::Graph` implementation
+   untouched).
+
+2. Deeper refactoring of :cppkokkos:`Kokkos::Graph`:
+
+   * Should the nodes of the graph be senders ? Or should `Kokkos` nodes and graph
+     be wrapped in an adaptor-like API to remain an implementation detail hidden to the user ?
+     "P2300 nodes" would then have handlers to their :cppkokkos:`Kokkos::Impl` (nodes and graph) counterparts.
+     How is this implemented in `HPX` ? When creating a sender, is there some under-the-hood implementation
+     class that maps to some `HPX` pre-existing internals ?
+   * Current :cppkokkos:`Kokkos::Graph` restrictions:
+      - All nodes are targeting the same backend :math:`\implies` only one scheduler type can be used.
+      - The chain cannot contain `transfer`, `starts_on`, and so. The scheduling is left to `Kokkos` through
+        :cppkokkos:`Kokkos::Graph::submit(exec)`.
+
+
+https://accu.org/journals/overload/29/164/teodorescu/
\ No newline at end of file