diff --git a/.wordlist.txt b/.wordlist.txt index c35dfb045d..7bc0f65fa0 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -7,6 +7,8 @@ APUs AQL AXPY asm +Asynchronicity +Asynchrony asynchrony backtrace bfloat @@ -71,6 +73,7 @@ hipModule hipModuleLaunchKernel hipother HIPRTC +hyperthreading icc IILE iGPU @@ -91,7 +94,6 @@ iteratively Lapack latencies libc -libhipcxx libstdc lifecycle linearizing @@ -116,6 +118,7 @@ NDRange nonnegative NOP Numa +ns Nsight ocp omnitrace @@ -124,6 +127,7 @@ overindexing oversubscription overutilized parallelizable +pipelining parallelized pixelated pragmas @@ -172,6 +176,7 @@ ULP ULPs unintuitive UMM +uncoalesced unmap unmapped unmapping diff --git a/docs/data/understand/programming_model/cpu-gpu-comparison.drawio b/docs/data/understand/programming_model/cpu-gpu-comparison.drawio new file mode 100644 index 0000000000..a7e851b3d5 --- /dev/null +++ b/docs/data/understand/programming_model/cpu-gpu-comparison.drawio @@ -0,0 +1,181 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/data/understand/programming_model/cpu-gpu-comparison.svg b/docs/data/understand/programming_model/cpu-gpu-comparison.svg new file mode 100644 index 0000000000..552290299f --- /dev/null +++ b/docs/data/understand/programming_model/cpu-gpu-comparison.svg @@ -0,0 +1 @@ +
CPU versus GPU Architecture
CPU versus GPU Archite...
CPU
CPU
CPU Core
CPU Core
CPU Core
CPU Core
CPU Core
CPU Core
CPU Core
CPU Core
GPU
GPU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
CU
Large Complex Cores
Large Complex Cores
High Clock Speed (3-5 GHz)
High Clock Speed (3-5 GHz)
Many Simple Cores
Many Simple Cores
Lower Clock Speed (1-2 GHz)
Lower Clock Speed (1-2 GHz)
Large Cache per Core
Large Cache per Core
Shared Memory across Cores
Shared Memory across Cores
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/host-device-flow.drawio b/docs/data/understand/programming_model/host-device-flow.drawio new file mode 100644 index 0000000000..2ee8c43ae9 --- /dev/null +++ b/docs/data/understand/programming_model/host-device-flow.drawio @@ -0,0 +1,61 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/data/understand/programming_model/host-device-flow.svg b/docs/data/understand/programming_model/host-device-flow.svg new file mode 100644 index 0000000000..02bce96c5d --- /dev/null +++ b/docs/data/understand/programming_model/host-device-flow.svg @@ -0,0 +1 @@ +
Host-Device Data Flow
Host-Device Data Flow
Host (CPU)
Host (CPU)
Device (GPU)
Device (GPU)
1. Initialize
1. Initialize
2. Transfer Data
2. Transfer Data
3. Execute Kernel
3. Execute Kernel
4. Return Results
4. Return Results
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/memory-access.drawio b/docs/data/understand/programming_model/memory-access.drawio new file mode 100644 index 0000000000..3577772532 --- /dev/null +++ b/docs/data/understand/programming_model/memory-access.drawio @@ -0,0 +1,237 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/data/understand/programming_model/memory-access.svg b/docs/data/understand/programming_model/memory-access.svg new file mode 100644 index 0000000000..5f0dbd8aae --- /dev/null +++ b/docs/data/understand/programming_model/memory-access.svg @@ -0,0 +1 @@ +
Memory Access Patterns
Memory Access Patterns
Uncoalesced Access
Uncoalesced Access
Threads
Threads
Memory
Memory
Coalesced Access
Coalesced Access
Threads
Threads
Memory
Memory
0
0
...
...
...
...
63
63
0
0
...
...
...
...
63
63
0
0
...
...
...
...
63
63
0
0
...
...
...
...
63
63
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/multi-gpu.drawio b/docs/data/understand/programming_model/multi-gpu.drawio new file mode 100644 index 0000000000..17eca3c318 --- /dev/null +++ b/docs/data/understand/programming_model/multi-gpu.drawio @@ -0,0 +1,64 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/data/understand/programming_model/multi-gpu.svg b/docs/data/understand/programming_model/multi-gpu.svg new file mode 100644 index 0000000000..190f2593d2 --- /dev/null +++ b/docs/data/understand/programming_model/multi-gpu.svg @@ -0,0 +1 @@ +
Multi-GPU Workload Distribution
Multi-GPU Workload Distribution
Host CPU
Host CPU
GPU 0
GPU 0
GPU 1
GPU 1
GPU 2
GPU 2
GPU 3
GPU 3
25%
25%
25%
25%
25%
25%
25%
25%
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/simt-execution.drawio b/docs/data/understand/programming_model/simt-execution.drawio new file mode 100644 index 0000000000..1e2652f51f --- /dev/null +++ b/docs/data/understand/programming_model/simt-execution.drawio @@ -0,0 +1,124 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/data/understand/programming_model/simt-execution.svg b/docs/data/understand/programming_model/simt-execution.svg new file mode 100644 index 0000000000..412b9265e7 --- /dev/null +++ b/docs/data/understand/programming_model/simt-execution.svg @@ -0,0 +1 @@ +
SIMT Execution Model
SIMT Execution Model
a[i] = b[i] + c[i]
a[i] = b[i] + c[i]
Thread 0
Thread 0
b[0] = 5
b[0] = 5
c[0] = 3
c[0] = 3
a[0] = 8
a[0] = 8
Thread 1
Thread 1
b[1] = 2
b[1] = 2
c[1] = 4
c[1] = 4
a[1] = 6
a[1] = 6
Thread 2
Thread 2
b[2] = 7
b[2] = 7
c[2] = 1
c[2] = 1
a[2] = 8
a[2] = 8
Thread 3
Thread 3
b[3] = 3
b[3] = 3
c[3] = 5
c[3] = 5
a[3] = 8
a[3] = 8
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/simt.drawio b/docs/data/understand/programming_model/simt.drawio deleted file mode 100644 index 4c5c5a3f26..0000000000 --- a/docs/data/understand/programming_model/simt.drawio +++ /dev/null @@ -1,148 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/docs/data/understand/programming_model/simt.svg b/docs/data/understand/programming_model/simt.svg deleted file mode 100644 index c149ab88e4..0000000000 --- a/docs/data/understand/programming_model/simt.svg +++ /dev/null @@ -1 +0,0 @@ -
ADD
ADD
FM
FMA
FMA
FM
FMA
FMA
FM
FMA
FMA
FM
FMA
FMA
DIV
DIV
ADD
ADD
FM
FMA
FMA
FM
FMA
FMA
FM
FMA
FMA
FM
FMA
FMA
DIV
DIV
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/stream-workflow.drawio b/docs/data/understand/programming_model/stream-workflow.drawio new file mode 100644 index 0000000000..616dd28d78 --- /dev/null +++ b/docs/data/understand/programming_model/stream-workflow.drawio @@ -0,0 +1,97 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/data/understand/programming_model/stream-workflow.svg b/docs/data/understand/programming_model/stream-workflow.svg new file mode 100644 index 0000000000..9648351cad --- /dev/null +++ b/docs/data/understand/programming_model/stream-workflow.svg @@ -0,0 +1 @@ +
Stream and Event Workflow
Stream and Event Workf...
Stream 1
Stream 1
Stream 2
Stream 2
Stream 3
Stream 3
Operation
Operation
Event
Event
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/how-to/hip_runtime_api/cooperative_groups.rst b/docs/how-to/hip_runtime_api/cooperative_groups.rst index 3170e197ef..a3e32cd294 100644 --- a/docs/how-to/hip_runtime_api/cooperative_groups.rst +++ b/docs/how-to/hip_runtime_api/cooperative_groups.rst @@ -164,7 +164,7 @@ The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ` Coalesced groups ------------------ -Threads (64 threads on CDNA and 32 threads on RDNA) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled, if they do not meet the condition to execute that branch. The active threads referred as coalesced, and coalesced group represents an active thread group within a warp. +Threads (64 threads on CDNA and 32 threads on RDNA) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled if they do not meet the condition to execute that branch. The active threads are referred to as coalesced, and coalesced group represents an active thread group within a warp. .. note:: diff --git a/docs/index.md b/docs/index.md index 23f352e306..9be67a91d3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -22,7 +22,6 @@ The HIP documentation is organized into the following categories: :::{grid-item-card} Programming guide -* [Introduction](./programming_guide) * {doc}`./understand/programming_model` * {doc}`./understand/hardware_implementation` * {doc}`./understand/compilers` diff --git a/docs/programming_guide.rst b/docs/programming_guide.rst deleted file mode 100644 index 7444408866..0000000000 --- a/docs/programming_guide.rst +++ /dev/null @@ -1,83 +0,0 @@ -.. meta:: - :description: HIP programming guide introduction - :keywords: HIP programming guide introduction, HIP programming guide - -.. _hip-programming-guide: - -******************************************************************************** -HIP programming guide introduction -******************************************************************************** - -This topic provides key HIP programming concepts and links to more detailed -information. - -Write GPU Kernels for Parallel Execution -================================================================================ - -To make the most of the parallelism inherent to GPUs, a thorough understanding -of the :ref:`programming model ` is helpful. The HIP -programming model is designed to make it easy to map data-parallel algorithms to -architecture of the GPUs. HIP employs the SIMT-model (Single Instruction -Multiple Threads) with a multi-layered thread hierarchy for efficient execution. - -Understand the Target Architecture (CPU and GPU) -================================================================================ - -The :ref:`hardware implementation ` topic outlines the -GPUs supported by HIP. In general, GPUs are made up of Compute Units that excel -at executing parallelizable, computationally intensive workloads without complex -control-flow. - -Increase parallelism on multiple level -================================================================================ - -To maximize performance and keep all system components fully utilized, the -application should expose and efficiently manage as much parallelism as possible. -:ref:`Parallel execution ` can be achieved at the -application, device, and multiprocessor levels. - -The application’s host and device operations can achieve parallel execution -through asynchronous calls, streams, or HIP graphs. On the device level, -multiple kernels can execute concurrently when resources are available, and at -the multiprocessor level, developers can overlap data transfers with -computations to further optimize performance. - -Memory management -================================================================================ - -GPUs generally have their own distinct memory, also called :ref:`device -memory `, separate from the :ref:`host memory `. -Device memory needs to be managed separately from the host memory. This includes -allocating the memory and transfering it between the host and the device. These -operations can be performance critical, so it's important to know how to use -them effectively. For more information, see :ref:`Memory management `. - -Synchronize CPU and GPU Workloads -================================================================================ - -Tasks on the host and devices run asynchronously, so proper synchronization is -needed when dependencies between those tasks exist. The asynchronous execution -of tasks is useful for fully utilizing the available resources. Even when only a -single device is available, memory transfers and the execution of tasks can be -overlapped with asynchronous execution. - -Error Handling -================================================================================ - -All functions in the HIP runtime API return an error value of type -:cpp:enum:`hipError_t` that can be used to verify whether the function was -successfully executed. It's important to confirm these returned values, in order -to catch and handle those errors, if possible. An exception is kernel launches, -which don't return any value. These errors can be caught with specific functions -like :cpp:func:`hipGetLastError()`. - -For more information, see :ref:`error_handling` . - -Multi-GPU and Load Balancing -================================================================================ - -Large-scale applications that need more compute power can use multiple GPUs in -the system. This requires distributing workloads across multiple GPUs to balance -the load to prevent GPUs from being overutilized while others are idle. - -For more information, see :ref:`multi-device` . \ No newline at end of file diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 2f08ffcd5a..dacc58d884 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -24,8 +24,6 @@ subtrees: - caption: Programming guide entries: - - file: programming_guide - title: Introduction - file: understand/programming_model - file: understand/hardware_implementation - file: understand/compilers diff --git a/docs/understand/hardware_implementation.rst b/docs/understand/hardware_implementation.rst index 7038262812..e57f7d4505 100644 --- a/docs/understand/hardware_implementation.rst +++ b/docs/understand/hardware_implementation.rst @@ -45,12 +45,13 @@ The amount of warps that can reside concurrently on a CU, known as occupancy, is determined by the warp's resource usage of registers and shared memory. +.. _gcn_cu: + .. figure:: ../data/understand/hardware_implementation/compute_unit.svg :alt: Diagram depicting the general structure of a compute unit of an AMD GPU. - An AMD Graphics Core Next (GCN) CU. The CDNA and RDNA CUs are based on - variations of the GCN CU. + AMD Graphics Core Next (GCN) CU On AMD GCN GPUs the basic structure of a CU is: @@ -102,6 +103,8 @@ The scalar unit performs instructions that are uniform within a warp. It thereby improves efficiency and reduces the pressure on the vector ALUs and the vector register file. +.. _cdna3_cu: + CDNA architecture ================= @@ -121,6 +124,8 @@ multiply-accumulate operations for Block Diagram of a CDNA3 Compute Unit. +.. _rdna3_cu: + RDNA architecture ================= diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 6c7015996f..3cac7e374a 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -7,67 +7,96 @@ .. _programming_model: ******************************************************************************* -HIP programming model +Introduction to the HIP programming model ******************************************************************************* -The HIP programming model makes it easy to map data-parallel C/C++ algorithms to -massively parallel, wide single instruction, multiple data (SIMD) architectures, -such as GPUs. +The HIP programming model enables mapping data-parallel C/C++ algorithms to massively +parallel SIMD (Single Instruction, Multiple Data) architectures like GPUs. HIP +supports many imperative languages, such as Python via PyHIP, but this document +focuses on the original C/C++ API of HIP. -While the model may be expressed in most imperative languages, (for example -Python via PyHIP) this document will focus on the original C/C++ API of HIP. - -A basic understanding of the underlying device architecture helps you +While GPUs may be capable of running applications written for CPUs if properly ported +and compiled, it would not be an efficient use of GPU resources. GPUs fundamentally differ +from CPUs and should be used accordingly to achieve optimum +performance. A basic understanding of the underlying device architecture helps you make efficient use of HIP and general purpose graphics processing unit (GPGPU) -programming in general. +programming in general. The following topics introduce you to the key concepts of +GPU-based programming and the HIP programming model. -RDNA & CDNA architecture summary +Hardware differences: CPU vs GPU ================================ -GPUs in general are made up of basic building blocks called compute units (CUs), -that execute the threads of a kernel. These CUs provide the necessary resources -for the threads: the Arithmetic Logical Units (ALUs), register files, caches and -shared memory for efficient communication between the threads. +CPUs and GPUs have been designed for different purposes. CPUs quickly execute a single thread, decreasing the time for a single operation while increasing the number of sequential instructions that can be executed. This includes fetching data and reducing pipeline stalls where the ALU has to wait for previous instructions to finish. + +.. figure:: ../data/understand/programming_model/cpu-gpu-comparison.svg + :alt: Diagram depicting the differences between CPU and GPU hardware. + The CPU block shows four large processing cores, lists Large Cache per + Core, and High Clock Speed of 3 to 5 gigahertz. The GPU block shows 42 + smaller processing cores, lists Shared Memory across Cores, and Lower + Clock Speeds of 1 to 2 gigahertz. + + Differences in CPUs and GPUs + +With CPUs, the goal is to quickly process operations. CPUs provide low-latency processing for +serial instructions. On the other hand, GPUs have been designed to execute many similar commands, or threads, +in parallel, achieving higher throughput. Latency is the time between starting an +operation and receiving its result, such as 2 ns, while throughput is the rate of +completed operations, for example, operations per second. + +For the GPU, the objective is to process as many operations in parallel, rather +than to finish a single instruction quickly. GPUs in general are made up of basic +building blocks called compute units (CUs), that execute the threads of a kernel. +As described in :ref:`hardware_implementation`, these CUs provide the necessary +resources for the threads: the Arithmetic Logical Units (ALUs), register files, +caches and shared memory for efficient communication between the threads. -This design allows for efficient execution of kernels while also being able to -scale from small GPUs embedded in APUs with few CUs up to GPUs designed for data -centers with hundreds of CUs. Figure :ref:`rdna3_cu` and :ref:`cdna3_cu` show -examples of such compute units. +The following describes a few hardware differences between CPUs and GPUs: -For architecture details, check :ref:`hardware_implementation`. +* CPU: -.. _rdna3_cu: + - Optimized for sequential processing with a few powerful cores (4-64 typically) + - High clock speeds (3-5 GHz) + - One register file per thread. On modern CPUs you have at most 2 register files per core, called hyperthreading. + - One ALU executing the thread. -.. figure:: ../data/understand/programming_model/rdna3_cu.png - :alt: Block diagram showing the structure of an RDNA3 Compute Unit. It - consists of four SIMD units, each including a vector and scalar register - file, with the corresponding scalar and vector ALUs. All four SIMDs - share a scalar and instruction cache, as well as the shared memory. Two - of the SIMD units each share an L0 cache. + - Designed to quickly execute instructions of the same thread. + - Complex branch prediction. - Block Diagram of an RDNA3 Compute Unit. + - Large L1/L2 cache per core, shared by fewer threads (maximum of 2 when hyperthreading is available). + - A disadvantage is switching execution from one thread to another (or context switching) takes a considerable amount of time: the ALU pipeline needs to be emptied, the register file has to be written to memory to free the register for another thread. + +* GPU: -.. _cdna3_cu: + - Designed for parallel processing with many simpler cores (hundreds/thousands) + - Lower clock speeds (1-2 GHz) + - Streamlined control logic + - Small caches, more registers + - Register files are shared among threads. The number of threads that can be run in parallel depends on the registers needed per thread. + - Multiple ALUs execute a collection of threads having the same operations, also known as a wavefront or warp. This is called single-instruction, multiple threads (SIMT) operation as described in :ref:`programming_model_simt`. -.. figure:: ../data/understand/programming_model/cdna3_cu.png - :alt: Block diagram showing the structure of a CDNA3 compute unit. It includes - Shader Cores, the Matrix Core Unit, a Local Data Share used for sharing - memory between threads in a block, an L1 Cache and a Scheduler. The - Shader Cores represent the vector ALUs and the Matrix Core Unit the - matrix ALUs. The Local Data Share is used as the shared memory. + - The collection of ALUs is called SIMD. SIMDs are an extension to the hardware architecture that allows a `single instruction` to concurrently operate on `multiple data` inputs. + - For branching threads where conditional instructions lead to thread divergence, ALUs still process the full wavefront, but the result for divergent threads is masked out. This leads to wasted ALU cycles and should be a consideration in your programming. Keep instructions consistent and leave conditionals out of threads. - Block Diagram of a CDNA3 Compute Unit. + - The advantage for GPUs is that context switching is easy. All threads that run on a core/compute unit have their registers on the compute unit, so they don't need to be stored to global memory, and each cycle one instruction from any wavefront that resides on the compute unit can be issued. -Heterogeneous Programming +When programming for a heterogeneous system, which incorporates CPUs and GPUs, you must +write your program to take advantage of the strengths of the available hardware. +Use the CPU for tasks that require complex logic with conditional branching, to reduce the +time to reach a decision. Use the GPU for parallel operations of the same instruction +across large datasets, with little branching, where the volume of operations is the key. + +.. _heterogeneous_programming: + +Heterogeneous programming ========================= -The HIP programming model assumes two execution contexts. One is referred to as -*host* while compute kernels execute on a *device*. These contexts have -different capabilities, therefor slightly different rules apply. The *host* -execution is defined by the C++ abstract machine, while *device* execution -follows the :ref:`SIMT model` of HIP. These execution contexts in -code are signified by the ``__host__`` and ``__device__`` decorators. There are -a few key differences between the two: +The HIP programming model has two execution contexts. The main application starts on the CPU, or +the *host* processor, and compute kernels are launched on the *device* such as `Instinct +accelerators `_ or AMD GPUs. +The host execution is defined by the C++ abstract machine, while device execution +follows the :ref:`SIMT model` of HIP. These two execution contexts +are signified by the ``__host__`` and ``__global__`` (or ``__device__``) decorators +in HIP program code. There are a few key differences between the two contexts: * The C++ abstract machine assumes a unified memory address space, meaning that one can always access any given address in memory (assuming the absence of @@ -75,96 +104,160 @@ a few key differences between the two: from one means nothing in another. Moreover, not all address spaces are accessible from all contexts. - Looking at :ref:`rdna3_cu` and :ref:`cdna3_cu`, you can see that - every CU has an instance of storage backing the namespace ``__shared__``. - Even if the host were to have access to these regions of - memory, the performance benefits of the segmented memory subsystem are + Looking at the :ref:`gcn_cu` figure, you can see that every CU has an instance of storage + backing the namespace ``__shared__``. Even if the host were to have access to these + regions of memory, the performance benefits of the segmented memory subsystem are supported by the inability of asynchronous access from the host. -* Not all C++ language features map cleanly to typical device architectures, - some are very expensive (meaning slow) to implement on GPU devices, therefor - they are forbidden in device contexts to avoid users tapping into features - that unexpectedly decimate their program's performance. Offload devices targeted - by HIP aren't general purpose devices, at least not in the sense that a CPU is. - HIP focuses on data parallel computations and as such caters to throughput - optimized architectures, such as GPUs or accelerators derived from GPU - architectures. +* Not all C++ language features map cleanly to typical GPU device architectures. + Some C++ features have poor latency when implemented on GPU devices, therefore + they are forbidden in device contexts to avoid using features that unexpectedly + decimate the program's performance. Offload devices targeted by HIP aren't general + purpose devices, at least not in the sense that a CPU is. HIP focuses on data + parallel computations and as such caters to throughput optimized architectures, + such as GPUs or accelerators derived from GPU architectures. -* Asynchrony is at the forefront of the HIP API. Computations launched on the device +* Asynchronicity is at the forefront of the HIP API. Computations launched on the device execute asynchronously with respect to the host, and it is the user's responsibility to synchronize their data dispatch/fetch with computations on the device. .. note:: - HIP does perform implicit synchronization on occasions, more advanced than other - APIs such as OpenCL or SYCL, in which the responsibility of synchronization mostly - depends on the user. + HIP performs implicit synchronization on occasions, unlike some + APIs where the responsibility for synchronization is left to the user. + +Host programming +---------------- + +In heterogeneous programming, the CPU is available for processing operations but the host application has the additional task of managing data and computation exchanges between the CPU (host) and GPU (device). The host acts as the application manager, coordinating the overall workflow and directing operations to the appropriate context, handles data preparation and data transfers, and manages GPU tasks and synchronization. Here is a typical sequence of operations: + +1. Initialize the HIP runtime and select the GPU: As described in :ref:`initialization`, refers to identifying and selecting a target GPU, setting up a context to let the CPU interact with the GPU. +2. Data preparation: As discussed in :ref:`memory_management`, this includes allocating the required memory on the host and device, preparing input data and transferring it from the host to the device. The data is both transferred to the device, and passed as an input parameter when launching the kernel. +3. Configure and launch the kernel on the GPU: As described in :ref:`device_program`, this defines kernel configurations and arguments, launches kernel to run on the GPU device using the triple chevron syntax or appropriate API call (for example ``hipLaunchKernelGGL``). On the GPU, multiple kernels can run on streams, with a queue of operations. Within the same stream, operations run in the order they were issued, but on multiple streams operations are independent and can execute concurrently. In the HIP runtime, kernels run on the default stream when one is not specified, but specifying a stream for the kernel lets you increase concurrency in task scheduling and resource utilization, and launch and manage multiple kernels from the host program. +4. Synchronization: As described in :ref:`asynchronous_how-to`, kernel execution occurs in the context of device streams, specifically the default (`0`) stream. You can use streams and events to manage task dependencies, overlap computation with data transfers, and manage asynchronous processes to ensure proper sequencing of operations. Wait for events or streams to finish execution and transfer results from the GPU back to the host. +5. Error handling: As described in :ref:`error_handling`, you should catch and handle potential errors from API calls, kernel launches, or memory operations. For example, use ``hipGetErrorString`` to retrieve error messages. +6. Cleanup and resource management: Validate results, clean up GPU contexts and resources, and free allocated memory on the host and devices. + +This structure allows for efficient use of GPU resources and facilitates the acceleration of compute-intensive tasks while keeping the host CPU available for other tasks. + +.. figure:: ../data/understand/programming_model/host-device-flow.svg + :alt: Diagram depicting a host CPU and device GPU rectangles of varying color. + There are arrows pointing between the rectangles showing from the Host + to the Device the initialization, data transfer, and Kernel execution + steps, and from the Device back to the Host the returning results. + + Interaction of Host and Device in a GPU application + +.. _device_program: + +Device programming +------------------ + +The device or kernel program acts as workers on the GPU application, distributing operations to be handled quickly and efficiently. Launching a kernel in the host application starts the kernel program running on the GPU, defining the parallel operations to repeat the same instructions across many datasets. Understanding how the kernel works and the processes involved is essential to writing efficient GPU applications. Threads, blocks, and grids provide a hierarchical approach to parallel operations. Understanding the thread hierarchy is critical to distributing work across the available CUs, managing parallel operations, and optimizing memory access. The general flow of the kernel program looks like this: + +1. Thread Grouping: As described in :ref:`inherent_thread_model`, threads are organized into a hierarchy consisting of threads, which are individual instances of parallel operations, blocks that group the threads, and grids that group blocks into the kernel. Each thread runs an instance of the kernel in parallel with other threads in the block. +2. Indexing: The kernel computes the unique index for each thread to access the relevant data to be processed by the thread. +3. Data Fetch: Threads fetch input data from memory previously transferred from the host to the device. As described in :ref:`memory_hierarchy`, the hierarchy of threads is influenced by the memory subsystem of GPUs. The memory hierarchy includes local memory per-thread with very fast access, shared memory for the block of threads which also supports quick access, and larger amounts of global memory visible to the whole kernel,but accesses are expensive due to high latency. Understanding the memory model is a key concept for kernel programming. +4. Computation: Threads perform the required computations on the input data, and generate any needed output. Each thread of the kernel runs the same instruction simultaneously on the different datasets. This sometimes require multiple iterations when the number of operations exceeds the resources of the CU. +5. Synchronization: When needed, threads synchronize within their block to ensure correct results when working with shared memory. + +Kernels are parallel programs that execute the same instruction set across multiple threads, organized in wavefronts, as described below and as demonstrated in the `Hello World tutorial `_ or :doc:`../tutorial/saxpy`. However, heterogeneous GPU applications can also become quite complex, managing hundreds, thousands, or hundreds of thousands of operations with repeated data transfers between host and device to support massive parallelization, using multiple streams to manage concurrent asynchronous operations, using rich libraries of functions optimized for GPU hardware as described in the `ROCm documentation `_. .. _programming_model_simt: Single instruction multiple threads (SIMT) ========================================== -The SIMT programming model behind the HIP device-side execution is a middle-ground -between SMT (Simultaneous Multi-Threading) programming known from multicore CPUs, -and SIMD (Single Instruction, Multiple Data) programming mostly known from exploiting -relevant instruction sets on CPUs (for example SSE/AVX/Neon). +The HIP kernel code, written as a series of scalar instructions for multiple +threads with different thread indices, gets mapped to the SIMD units of the GPUs. +Every single instruction, which is executed for every participating thread of a +kernel, gets mapped to the SIMD. -A HIP device compiler maps SIMT code written in HIP C++ to an inherently SIMD -architecture (like GPUs). This is done by scalarizing the entire kernel and issuing the scalar -instructions of multiple kernel instances (called threads) to each of the SIMD engine lanes, rather -than exploiting data parallelism within a single instance of a kernel and spreading -identical instructions over the available SIMD engines. +This is done by grouping threads into warps, which contain as many threads as there +are physical lanes in a SIMD, and issuing that instruction to the SIMD for every +warp of a kernel. Ideally, the SIMD is always fully utilized. However, if the number of threads +can't be evenly divided by the warpSize, then the unused lanes are masked out +from the corresponding SIMD execution. -Consider the following kernel: +.. _simt: + +.. figure:: ../data/understand/programming_model/simt-execution.svg + :alt: Diagram depicting the SIMT execution model. There is a red rectangle + which contains the expression a[i] = b[i] + c[i], and below that four + arrows that point to Thread 0,1,2, and 3. Each thread contains different + values for b, c, and a, showing the parallel operations of this equation. + + Instruction flow of a sample SIMT program + +A kernel follows the same C++ rules as the functions on the host, but it has a special ``__global__`` label to mark it for execution on the device, as shown in the following example: .. code-block:: cpp - __global__ void k(float4* a, const float4* b) + __global__ void AddKernel(float* a, const float* b) { - int tid = threadIdx.x; - int bid = blockIdx.x; - int dim = blockDim.x; + int global_idx = threadIdx.x + blockIdx.x * blockDim.x; - a[tid] += (tid + bid - dim) * b[tid]; + a[global_idx] += b[global_idx]; } -The incoming four-vector of floating-point values ``b`` is multiplied by a -scalar and then added element-wise to the four-vector floating-point values of -``a``. On modern SIMD-capable architectures, the four-vector ops are expected to -compile to a single SIMD instruction. However, GPU execution of this kernel will -typically break down the vector elements into 4 separate threads for parallel execution, -as seen in the following figure: +One of the first things you might notice is the usage of the special ``threadIdx``, +``blockIdx`` and ``blockDim`` variables. Unlike normal C++ host functions, a kernel +is not launched once, but as often as specified by the user. Each of these instances +is a separate thread, with its own values for ``threadIdx``, ``blockIdx`` and ``blockDim``. -.. _simt: +The kernel program is launched from the host application using a language extension +called the triple chevron syntax, which looks like the following: + +.. code-block:: cpp + + AddKernel<<>>(a, b); + +Inside the angle brackets, provide the following: -.. figure:: ../data/understand/programming_model/simt.svg - :alt: Image representing the instruction flow of a SIMT program. Two identical - arrows pointing downward with blocks representing the instructions - inside and ellipsis between the arrows. The instructions represented in - the arrows are, from top to bottom: ADD, DIV, FMA, FMA, FMA and FMA. +* The number of blocks to launch, which defines the grid size (relating to blockDim). +* The number of threads in a block, which defines the block size (relating to blockIdx). +* The amount of shared memory to allocate by the host, not specified above. +* The device stream to enqueue the operation on, not specified above so the default stream is used. - Instruction flow of the sample SIMT program. +.. note:: + The kernel can also be launched through other methods, such as the ``hipLaunchKernel()`` function. + +Here, the total number of threads launched for the ``AddKernel`` program is defined by +``number_of_blocks * threads_per_block``. You define these values when launching the +kernel program to address the problem to be solved with the available resources within +the system. In other words, the thread configuration is customized to the needs of the +operations and the available hardware. + +For comparison, the ``AddKernel`` program could be written in plain C++ as a ``FOR`` loop: + +.. code-block:: cpp + + for(int i = 0; i < (number_of_blocks * threads_per_block); ++i){ + a[i] += b[i]; + } In HIP, lanes of the SIMD architecture are fed by mapping threads of a SIMT execution, one thread down each lane of an SIMD engine. Execution parallelism -usually isn't exploited from the width of the built-in vector types, but across multiple threads via the thread ID constants ``threadIdx.x``, ``blockIdx.x``, etc. +usually isn't exploited from the width of the built-in vector types, but across +multiple threads via the thread ID constants ``threadIdx.x``, ``blockIdx.x``, etc. .. _inherent_thread_model: -Inherent thread model -===================== - -The SIMT nature of HIP is captured by the ability to execute user-provided -device programs, expressed as single-source C/C++ functions or sources compiled -online/offline to binaries, in bulk. +Hierarchical thread model +--------------------- -All threads of a kernel are uniquely identified by a set of integral values, called thread IDs. -The set of integers identifying a thread relate to the hierarchy in which the threads execute. +As previously discussed, all threads of a kernel are uniquely identified by a set +of integral values called thread IDs. The hierarchy consists of three levels: thread, +blocks, and grids. -The thread hierarchy inherent to how AMD GPUs operate is depicted in the -following figure. +* Threads are single instances of kernel operations, running concurrently across warps +* Blocks group threads together and enable cooperation and shared memory +* Grids define the number of thread blocks for a single kernel launch +* Blocks and grids can be defined in 3 dimensions (``x``, ``y``, ``z``) +* By default, the Y and Z dimensions are set to 1 -.. _inherent_thread_hierarchy: +The combined values represent the thread index, and relate to the sequence that the +threads execute. The thread hierarchy is integral to how AMD GPUs operate, and is +depicted in the following figure. .. figure:: ../data/understand/programming_model/thread_hierarchy.svg :alt: Diagram depicting nested rectangles of varying color. The outermost one @@ -175,10 +268,13 @@ following figure. Hierarchy of thread groups. +.. _wavefront: + Warp (or Wavefront) - The innermost grouping of threads is called a warp, or a wavefront in ISA terms. A warp - is the most tightly coupled groups of threads, both physically and logically. Threads - inside a warp are also called lanes, and the integral value identifying them is the lane ID. + The innermost grouping of threads is called a warp. A warp is the most tightly + coupled groups of threads, both physically and logically. Threads inside a warp + are executed in lockstep, with each thread executing the same instruction. Threads + in a warp are also called lanes, and the value identifying them is the lane ID. .. tip:: @@ -187,41 +283,52 @@ Warp (or Wavefront) calculated values to be. The size of a warp is architecture dependent and always fixed. For AMD GPUs - the wavefront is typically 64 threads, though sometimes 32 threads. Warps are + the warp is typically 64 threads, though sometimes 32 threads. Warps are signified by the set of communication primitives at their disposal, as discussed in :ref:`warp-cross-lane`. .. _inherent_thread_hierarchy_block: Block - The middle grouping is called a block or thread block. The defining feature - of a block is that all threads in a block will share an instance of memory - which they may use to share data or synchronize with one another. - - The size of a block is user-configurable but is limited by the queryable - capabilities of the executing hardware. The unique ID of the thread within a - block is 3-dimensional as provided by the API. When linearizing thread IDs - within a block, assume the "fast index" being dimension ``x``, followed by - the ``y`` and ``z`` dimensions. + The next level of the thread hierarchy is called a thread block, or block. The + defining feature of a block is that all threads in the block have shared memory + that they can use to share data or synchronize with one another, as described in + :ref:`memory_hierarchy`. + + The size of a block, or the block dimension, is the user-configurable number of + threads per block, but is limited by the queryable capabilities of the executing + hardware. The unique ID of the thread within a block can be 1, 2, or 3-dimensional + as provided by the HIP API. You can configure the thread block to best represent + the data associated with the kernel instruction set. + + .. note:: + When linearizing thread IDs within a block, assume the *fast index* is the ``x`` + dimension, followed by the ``y`` and ``z`` dimensions. .. _inherent_thread_hierarchy_grid: Grid - The outermost grouping is called a grid. A grid manifests as a single - dispatch of kernels for execution. The unique ID of each block within a grid - is 3-dimensional, as provided by the API and is queryable by every thread - within the block. + The top-most level of the thread hierarchy is a grid. A grid is the number of blocks + needed for a single launch of the kernel. The unique ID of each block within + a grid can be 1, 2, or 3-dimensional, as provided by the API and is queryable + by every thread within the block. + +The three-dimensional thread hierarchy available to a kernel program lends itself to solutions +that align closely to the computational problem. The following are some examples: + +* 1-dimensional: array processing, linear data structures, or sequential data transformation +* 2-dimensional: Image processing, matrix operations, 2 dimensional simulations +* 3-dimensional: Volume rendering, 3D scientific simulations, spatial algorithms Cooperative groups thread model ------------------------------- -The Cooperative groups API introduces new APIs to launch, group, subdivide, +The Cooperative groups API introduces new functions to launch, group, subdivide, synchronize and identify threads, as well as some predefined group-collective -algorithms, but most importantly a matching threading model to think in terms of. -It relaxes some restrictions of the :ref:`inherent_thread_model` imposed by the -strict 1:1 mapping of architectural details to the programming model. Cooperative -groups let you define your own set of thread groups which may fit your user-cases -better than the defaults defined by the hardware. +algorithms. Cooperative groups let you define your own set of thread groups which +may fit your use-cases better than those defined by the hardware. It relaxes some +restrictions of the :ref:`inherent_thread_model` imposed by the strict 1:1 mapping +of architectural details to the programming model. .. note:: The implicit groups defined by kernel launch parameters are still available @@ -229,14 +336,17 @@ better than the defaults defined by the hardware. For further information, see :doc:`Cooperative groups `. +.. _memory_hierarchy: + Memory model ============ -The hierarchy of threads introduced by the :ref:`inherent_thread_model` is induced -by the memory subsystem of GPUs. The following figure summarizes the memory -namespaces and how they relate to the various levels of the threading model. - -.. _memory_hierarchy: +The GPU memory architecture is designed to support parallel execution across the +thread hierarchy. Understanding the following memory spaces and their relationships +to thread groupings is crucial for efficient GPU programming. The choice of memory +type and access patterns significantly impacts kernel performance. The following figure +summarizes the memory namespaces and how they relate to the various levels of the +threading model. .. figure:: ../data/understand/programming_model/memory_hierarchy.svg :alt: Diagram depicting nested rectangles of varying color. The outermost one @@ -250,22 +360,33 @@ namespaces and how they relate to the various levels of the threading model. Local or per-thread memory Read-write storage only visible to the threads defining the given variables, - also called per-thread memory. The size of a block for a given kernel, and thereby - the number of concurrent warps, are limited by local memory usage. - This relates to an important aspect: occupancy. This is the default memory - namespace. + also called per-thread memory. This is the default memory namespace. + The size of the blocks for a given kernel, and thereby the number of concurrent + warps, are limited by local memory usage. This relates to the *occupancy* of the + CU as described in :doc:`Compute Units <./hardware_implementation>`, + an important concept in resource usage and performance optimization. + + Use local memory when the data is specific to a thread, to store variables generated + by the thread, or to provide register pressure relief for the thread. Shared memory - Read-write storage visible to all the threads in a given block. + Read-write storage visible to all the threads in a given block. Use shared memory + when the data is reused within a thread block, when cross-thread communication + is needed, or to minimize global memory transactions by using device memory + whenever possible. Global Read-write storage visible to all threads in a given grid. There are specialized versions of global memory with different usage semantics which - are typically backed by the same hardware storing global. + are typically backed by the same hardware storing global. + + Use global memory when you have large datasets, are transferring memory between + the host and the device, and when you are sharing data between thread blocks. Constant Read-only storage visible to all threads in a given grid. It is a limited - segment of global with queryable size. + segment of global with queryable size. Use constant memory for read-only data + that is shared across multiple threads, and that has a small data size. Texture Read-only storage visible to all threads in a given grid and accessible @@ -274,54 +395,86 @@ Global Surface A read-write version of texture memory. -Execution model -=============== +Memory optimizations and best practices +--------------------------------------- -HIP programs consist of two distinct scopes: +.. figure:: ../data/understand/programming_model/memory-access.svg + :alt: Diagram depicting an example memory access pattern for coalesced memory. + The diagram has uncoalesced access on the left side, with consecutive + threads accessing memory in a random pattern. With coalesced access on the + right showing consecutive threads accessing consecutive memory addresses. -* The host-side API running on the host processor. There are two APIs available: + Coalesced memory accesses - * The HIP runtime API which enables use of the single-source programming - model. +The following are a few memory access patterns and best practices to improve performance. You can find additional information in :ref:`memory_management` and :doc:`../how-to/performance_guidelines`. - * The HIP driver API which sits at a lower level and most importantly differs - by removing some facilities provided by the runtime API, most - importantly around kernel launching and argument setting. It is geared - towards implementing abstractions atop, such as the runtime API itself. - Offers two additional pieces of functionality not provided by the Runtime - API: ``hipModule`` and ``hipCtx`` APIs. For further details, check - :doc:`HIP driver API `. +* **Global memory**: Coalescing reduces the number of memory transactions. -* The device-side kernels running on GPUs. Both the host and the device-side - APIs have synchronous and asynchronous functions in them. + Coalesced memory access in HIP refers to the optimization of memory transactions to maximize throughput when accessing global memory. When a kernel accesses global memory, the memory transactions typically occur in chunks of 32, 64, or 128 bytes, which must be naturally aligned. Coalescing memory accesses means aligning and organizing these accesses so that multiple threads in a warp can combine their memory requests into the fewest possible transactions. If threads access memory in a coalesced manner, meaning consecutive threads read or write consecutive memory locations, the memory controller can merge these accesses into a single transaction. This is crucial because global memory bandwidth is relatively low compared to on-chip bandwidths, and non-optimal memory accesses can significantly impact performance. If all the threads in a warp can access consecutive memory locations, memory access is fully coalesced. -.. note:: + To achieve coalesced memory access in HIP, you should: + + 1. *Align Data*: Use data types that are naturally aligned and ensure that structures and arrays are aligned properly. + 2. *Optimize Access Patterns*: Arrange memory accesses so that consecutive threads in a warp access consecutive memory locations. For example, if threads access a 2D array, the array and thread block widths should be multiples of the warp size. + 3. *Avoid strided access*: For example array[i * stride] can lead to memory bank conflicts and inefficient access. + 4. *Pad Data*: If necessary, pad data structures to ensure alignment and coalescing. + +* **Shared memory**: Avoiding bank conflicts reduces the serialization of memory transactions. + + Shared memory is a small, fast memory region inside the CU. Unlike global memory, shared memory accesses do not require coalescing, but they can suffer from bank conflicts, which are another form of inefficient memory access. Shared memory is divided into multiple memory banks (usually 32 banks on modern GPUs). If multiple threads within a warp try to access different addresses that map to the same memory bank, accesses get serialized, leading to poor performance. To optimize shared memory usage, ensure that consecutive threads access different memory banks. Use padding if necessary to avoid conflicts. + +* **Texture memory**: Spatial locality improves caching performance. - The HIP does not present two *separate* APIs link NVIDIA CUDA. HIP only extends - the HIP runtime API with new APIs for ``hipModule`` and ``hipCtx``. + Texture memory is read-only memory optimized for spatial locality and caching rather than coalescing. Texture memory is cached, unlike standard global memory, and it provides optimized access patterns for 2D and spatially local data. Accessing neighboring values results in cache hits, improving performance. Therefore, instead of worrying about coalescing, optimal memory access patterns involve ensuring that threads access spatially adjacent texture elements, and the memory layout aligns well with the 2D caching mechanism. + +* **Unified memory**: Structured access reduces the overhead of page migrations. + + Unified memory allows the CPU and GPU to share memory seamlessly, but performance depends on access patterns. Unified memory enables automatic page migration between CPU and GPU memory. However, if different threads access different pages, it can lead to expensive page migrations and slow throughput performance. Accessing unified memory in a structured, warp-friendly manner reduces unnecessary page transfers. Ensure threads access memory in a structured, consecutive manner, minimizing page faults. Prefetch data to the GPU before computation by using ``hipMemPrefetchAsync()``. In addition, using small batch transfers as described below, can reduce unexpected page migrations when using unified memory. + +* **Small batch transfers**: Enable pipelining and improve PCIe bandwidth use. + + Memory transfers between the host and the device can become a major bottleneck if not optimized. One method is to use small batch memory transfers where data is transferred in smaller chunks instead of dealing with large datasets to avoid long blocking operations. Small batch transfers offer better PCIe bandwidth utilization over large data transfers. Small batch transfers offer performance improvement by offering reduced latency with small batches that run asynchronously using ``hipMemcpyAsync()`` as described in :ref:`asynchronous_how-to`, pipelining data transfers and kernel execution using separate streams. Finally, using pinned memory with small batch transfers enables faster DMA transfers without CPU involvement, greatly improving memory transfer performance. + +Execution model +=============== + +As previously discussed in :ref:`heterogeneous_programming`, HIP programs consist of two distinct scopes: + +* The host-side API running on the host processor. +* The device-side kernels running on GPUs. + +Both the host and the device-side APIs have synchronous and asynchronous functions. Host-side execution ------------------- -The part of the host-side API which deals with device management and their -queries are synchronous. All asynchronous APIs, such as kernel execution, data -movement and potentially data allocation/freeing all happen in the context of -device streams. +The host-side API dealing with device management and their queries are synchronous. +All asynchronous APIs, such as kernel execution, data movement and potentially data +allocation/freeing all happen in the context of device streams, as described in `Managing streams <../how-to/hip_runtime_api/asynchronous.html#managing-streams>`_. Streams are FIFO buffers of commands to execute relating to a given device. -Commands which enqueue tasks on a stream all return promptly and the command is +Operations that enqueue tasks on a stream all return promptly, and the command is executed asynchronously. All side effects of a command on a stream are visible to all subsequent commands on the same stream. Multiple streams may point to the same device and those streams may be fed from multiple concurrent host-side threads. Execution on multiple streams may be concurrent but isn't required to be. -Asynchronous APIs involving a stream all return a stream event which may be +Asynchronous APIs involving a stream all return a stream event, which can be used to synchronize the execution of multiple streams. A user may enqueue a -barrier onto a stream referencing an event. The barrier will block until -the command related to the event does not complete, at which point all -side effects of the command shall be visible to commands following the barrier, -even if those side effects manifest on different devices. +barrier onto a stream referencing an event. The barrier will block activity on the +stream until the operation related to the event completes. After the event completes, all +side effects of the operation will be visible to subsequent commands even if those +side effects manifest on different devices. + +.. figure:: ../data/understand/programming_model/stream-workflow.svg + :alt: Diagram depicting the stream and event workflow, with an example of + multiple streams working together. The diagram shows operations as red + rectangles, and events as white dots. There are three streams labelled + Stream 1, 2, and 3. The streams each have multiple operations and events + that require synchronization between the streams. + + Multiple stream workflow Streams also support executing user-defined functions as callbacks on the host. The stream will not launch subsequent commands until the callback completes. @@ -329,17 +482,8 @@ The stream will not launch subsequent commands until the callback completes. Device-side execution --------------------- -The SIMT programming model behind the HIP device-side execution is a -middle-ground between SMT (Simultaneous Multi-Threading) programming known from -multicore CPUs, and SIMD (Single Instruction, Multiple Data) programming -mostly known from exploiting relevant instruction sets on CPUs (for example -SSE/AVX/Neon). - -Kernel launch -------------- - -Kernels may be launched in multiple ways all with different syntaxes and -intended use-cases. +Kernels may be launched in multiple ways, all with different syntaxes and +intended use cases. * Using the triple-chevron ``<<<...>>>`` operator on a ``__global__`` annotated function. @@ -348,18 +492,44 @@ intended use-cases. .. tip:: - This name by default is a macro expanding to triple-chevron. In cases where + This name, by default, is a macro expanding to the triple-chevron syntax. In cases where language syntax extensions are undesirable, or where launching templated and/or overloaded kernel functions define the ``HIP_TEMPLATE_KERNEL_LAUNCH`` preprocessor macro before including the HIP headers to turn it into a templated function. -* Using the launch APIs supporting the triple-chevron syntax directly. - - .. caution:: - - These APIs are intended to be used/generated by tools such as the HIP - compiler itself and not intended towards end-user code. Should you be - writing a tool having to launch device code using HIP, consider using these - over the alternatives. - +Asynchronous execution +---------------------- + +Asynchronous operations between the host and the kernel provide a variety of opportunities, +or challenges, for managing synchronization, as described in :ref:`asynchronous_how-to`. +For instance, a basic model would be to launch an asynchronous operation on a kernel +in a stream, create an event to track the operation, continue operations in the host +program, and when the event shows that the asynchronous operation is complete, synchronize the kernel to return the results. + +However, one of the opportunities of asynchronous operation is the pipelining of operations +between launching kernels and transferring memory. In this case, you would be working +with multiple streams running concurrently, or at least overlapping in some regard, +and managing any dependencies between the streams in the host application. +The producer-consumer paradigm can be used to convert a sequential program +into parallel operations to improve performance. This process can employ multiple +streams to kick off asynchronous kernels, provide data to the kernels, perform operations, +and return the results for further processing in the host application. + +These asynchronous activities call for stream management strategies. In the case +of the single stream, the only management would be the stream synchronization +when the work was complete. However, with multiple streams you have +overlapping execution of operations and synchronization becomes more complex, as shown +in the variations of the example in `Programmatic dependent launch and synchronization <../how-to/hip_runtime_api/asynchronous.html#programmatic-dependent-launch-and-synchronization>`_. +You need to manage each stream's activities, evaluate the availability of results, evaluate the critical path of the tasks, allocate resources on the hardware, and manage the execution order. + +Multi-GPU and load balancing +---------------------------- + +For applications requiring additional computational power beyond a single device, +HIP supports utilizing multiple GPUs within a system. Large-scale applications +that need more compute power can use multiple GPUs in the system. This enables +the runtime to distribute workloads across multiple GPUs to balance the load and prevent some GPUs +from being over-utilized while others are idle. + +For more information, see :ref:`multi-device`.