Threading Model#388
Conversation
masterleinad
left a comment
There was a problem hiding this comment.
I think the most interesting question to answer still is if we serialize kernels on all backends for the same execution space instance or not.
|
|
||
| A multi-threaded program structured such that there is a *happens-before* relationship between each call to perform a *Fundamental Operation* will behave equivalently to a single-threaded program that performs the same sequence of *Fundamental Operations*. (Note: This is analogous to ``MPI_THREAD_SERIALIZED``) | ||
|
|
||
| .. Do we actually want to guarantee that every Fundamental Operation is serializing? Should that just mean that we don't require call sites to have *happens-before* relationships, or should they also internally create such *happens-before* relationships? I.e. that the calling threads *synchronize-with* each other at those points? |
There was a problem hiding this comment.
That's a key question. My understanding is that we want to serialize parallel dispatch to the same execution space instance but I don't think we want to promise anything with respect to data access outside of kernels.
|
|
||
| *Global Synchronization* creates a *happens-before* relationship between the completion of every *Fundamental Operation* on any *Execution Space Instance* that *happens-before* the *Global Synchronization* and the thread that performs the *Global Synchronization*. | ||
|
|
||
| .. Should the above actually be *synchronizes-with*? |
There was a problem hiding this comment.
Is there really much of a difference when we talk about fence?
|
|
||
| * Managed Construction | ||
| Managed construction of a Kokkos View performs a *Memory Allocation*, potentially followed by a *Parallel Dispatch* to initialize the memory (depending on whether ``WithoutInitializing`` was passed), potentially followed by a *Synchronization* (if no execution space instance was passed, so that allocation and initialization *happen-before* any subsequent operation that may reference the ``View``'s memory'). | ||
| .. Do we want that to be *Global Synchronization* or *Local Synchronization*? |
There was a problem hiding this comment.
We effectively do a device-wide (or at least execution space-wide) synchronization at the moment, see https://github.com/kokkos/kokkos/blob/5d81422daea73f5a2a69771cc0dfafc19f785003/core/src/Cuda/Kokkos_CudaSpace.cpp#L160-L205. The intent is to make sure that memory can't be accessed before allocation is complete and thus it should be (IMHO) enough to fence the active execution space instance on the current thread.
| * *Initialization* | ||
|
|
||
| .. Not just Kokkos::init, but also whatever device-specific or thread-specific stuff we have Legion doing now | ||
|
|
||
| * *Finalization* | ||
|
|
||
| .. Ditto Initialization |
There was a problem hiding this comment.
Backends can still only be initialized or finalized once. I'm not quite sure if it's worth mentioning initialization/finalization then. At the very least, we need to clarify what we mean here (execution space instance initialization/finalization maybe sensible).
| * *Data Access* | ||
| ``View::operator()``, to memory that is accessible from the host. | ||
|
|
There was a problem hiding this comment.
Not quite sure if we want to promise anything about data access outside of kernels.
There was a problem hiding this comment.
I think we have to, or else we can't suitably address either usage of unmanaged views, or UVM
| * Metadata Query | ||
| * Element Access | ||
| Element Access performs a Data Access operation. |
There was a problem hiding this comment.
Not quite sure if we need these.
| Backend-Specific Details | ||
| ------------------------ | ||
|
|
||
| .. Local or Global synchronizations below? |
There was a problem hiding this comment.
It might be enough to group backends into synchronous and asynchronous backends clarifying that kernels submitted by multiple kernels are serialized (if we decide to make that promise).
| * ``CUDA`` and ``HIP`` | ||
|
|
||
| * ``HPX`` | ||
|
|
There was a problem hiding this comment.
We should talk more about parallel dispatch and the behavior of independent threads (without a happens-before relationship between them) accessing the same data.
Possibly also clarifying where we promise that dispatch implies fences (linking to API for parallel_for, parallel_reduce, parallel_scan).
|
TODO:
|
Document what semantics we actually have around use of multiple threads calling Kokkos
The foundational principles I think we have are that
View::operator()from host, and equivalent memory access in buffers that wedeep_copyto/from)