Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ bfloat
Bitcode
bitcode
bitcodes
bitmask
blockDim
blockIdx
builtins
Expand Down
3 changes: 2 additions & 1 deletion docs/doxygen/Doxyfile
Original file line number Diff line number Diff line change
Expand Up @@ -834,7 +834,8 @@ INPUT = ../../include/hip \
../../../clr/hipamd/include/hip/amd_detail/amd_hip_gl_interop.h \
../../../clr/hipamd/include/hip/amd_detail/amd_surface_functions.h \
../../../clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups.h \
../../../ROCR-Runtime/src/inc/hsa_ext_amd.h
../../../ROCR-Runtime/runtime/hsa-runtime/inc/hsa_ext_amd.h
# ../../../ROCR-Runtime/src/inc/hsa_ext_amd.h

# This tag can be used to specify the character encoding of the source files
# that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses
Expand Down
2 changes: 1 addition & 1 deletion docs/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ What NVIDIA CUDA features does HIP support?
The :doc:`NVIDIA CUDA runtime API supported by HIP<hipify:tables/CUDA_Runtime_API_functions_supported_by_HIP>`
and :doc:`NVIDIA CUDA driver API supported by HIP<hipify:tables/CUDA_Driver_API_functions_supported_by_HIP>`
pages describe which NVIDIA CUDA APIs are supported and what the equivalents are.
The :doc:`HIP API documentation <doxygen/html/index>` describes each API and
The :doc:`HIP API documentation <reference/hip_runtime_api_reference>` describes each API and
its limitations, if any, compared with the equivalent CUDA API.

The kernel language features are documented in the
Expand Down
138 changes: 130 additions & 8 deletions docs/how-to/hip_cpp_language_extensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -411,20 +411,18 @@ warpSize
================================================================================

The ``warpSize`` constant contains the number of threads per warp for the given
target device. It can differ between different architectures, and on RDNA
architectures it can even differ between kernel launches, depending on whether
they run in CU or WGP mode. See the
:doc:`hardware features <../reference/hardware_features>` for more
information.
target device. On AMD hardware, this is referred to as ``wavefront size``, which
may vary depending on the architecture. For more details, see the
:doc:`hardware features <../reference/hardware_features>`.

Since ``warpSize`` can differ between devices, it can not be assumed to be a
compile-time constant on the host. It has to be queried using
:cpp:func:`hipDeviceGetAttribute` or :cpp:func:`hipDeviceGetProperties`, e.g.:

.. code-block:: cpp

int val;
hipDeviceGetAttribute(&val, hipDeviceAttributeWarpSize, deviceId);
int warpSizeHost;
hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId);

.. note::

Expand All @@ -435,6 +433,130 @@ compile-time constant on the host. It has to be queried using
of 32 can run on devices with a ``warpSize`` of 64, it only utilizes half of
the compute resources.

The ``warpSize`` parameter will no longer be a compile-time constant in a future
release of ROCm, however it will be still early folded by the compiler, which
means it can be used for loop bounds and supports loop unrolling similarly to
compile-time warp size.

If the compile time warp size is still required, for example to select the correct
mask type or code path at compile time, the recommended approach is to determine
the warp size of the GPU on host side and setup the kernel accordingly, as shown
in the following block reduce example.

The ``block_reduce`` kernel has a template parameter for warp size and performs
a reduction operation in two main phases:

- Shared memory reduction: Reduction is performed iteratively, halving the
number of active threads each step until only a warp remains
(32 or 64 threads, depending on the device).

- Warp-level reduction: Once the shared memory reduction completes, the
remaining threads use warp-level shuffling to sum the remaining values. This
is done efficiently with the ``__shfl_down`` intrinsic, which allows threads within
the warp to exchange values without explicit synchronization.

.. tab-set::

.. tab-item:: WarpSize Template Parameter
:sync: template-warpsize

.. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
:start-after: // [Sphinx template warp size block reduction kernel start]
:end-before: // [Sphinx template warp size block reduction kernel end]
:language: cpp


.. tab-item:: HIP warpSize
:sync: hip-warpsize

.. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
:start-after: // [Sphinx HIP warp size block reduction kernel start]
:end-before: // [Sphinx HIP warp size block reduction kernel end]
:language: cpp

The host code with the main function:

- Retrieves the warp size of the GPU (``warpSizeHost``) to determine the optimal
kernel configuration.

- Allocates device memory (``d_data`` for input, ``d_results`` for block-wise
output) and initializes the input vector to 1.

- Generates the mask variables for every warp and copies them to the device.

.. tab-set::

.. tab-item:: Compile-time WarpSize
:sync: template-warpsize

.. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
:start-after: // [Sphinx template warp size mask generation start]
:end-before: // [Sphinx template warp size mask generation end]
:language: cpp


.. tab-item:: HIP warpSize
:sync: hip-warpsize

.. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
:start-after: // [Sphinx HIP warp size mask generation start]
:end-before: // [Sphinx HIP warp size mask generation end]
:language: cpp

- Selects the appropriate kernel specialization based on the warp
size (either 32 or 64) and launches the kernel.

.. tab-set::

.. tab-item:: Compile-time WarpSize
:sync: template-warpsize

.. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
:start-after: // [Sphinx template warp size select kernel start]
:end-before: // [Sphinx template warp size select kernel end]
:language: cpp


.. tab-item:: HIP warpSize
:sync: hip-warpsize

.. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
:start-after: // [Sphinx HIP warp size select kernel start]
:end-before: // [Sphinx HIP warp size select kernel end]
:language: cpp

- Synchronizes the device and copies the results back to the host.

- Checks that each block's sum is equal with the expected mask bit count,
verifying the reduction's correctness.

- Frees the device memory to prevent memory leaks.

.. note::

The ``warpSize`` runtime example code is also provided for comparison purposes
and the full example codes are located in the `tools folder <https://github.com/ROCm/hip/tree/docs/develop/docs/tools/example_codes>`_.

The variable ``warpSize`` can be used for loop bounds and supports
loop unrolling similarly to the template parameter ``WarpSize``.

For users who still require a compile-time constant warp size as a macro on the
device side, it can be defined manually based on the target device architecture,
as shown in the following example.

.. code-block:: cpp

#if defined(__GFX8__) || defined(__GFX9__)
#define WarpSize 64
#else
#define WarpSize 32
#endif

.. note::

``mwavefrontsize64`` compiler option is not supported by HIP runtime, that's
why the architecture based compile time selector is an acceptable approach.

********************************************************************************
Vector types
********************************************************************************
Expand Down Expand Up @@ -857,7 +979,7 @@ The different shuffle functions behave as following:
of range, the thread returns its own ``var``.

``__shfl_down``
The thread reads ``var`` from lane ``laneIdx - delta``, thereby "shuffling"
The thread reads ``var`` from lane ``laneIdx + delta``, thereby "shuffling"
the values of the lanes of the warp "down". If the resulting source lane is
out of range, the thread returns its own ``var``.

Expand Down
64 changes: 64 additions & 0 deletions docs/how-to/hip_porting_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -611,6 +611,70 @@ code, while the host can query it during runtime via the device properties. See
the :ref:`HIP language extension for warpSize <warp_size>` for information on
how to write portable wave-aware code.

Lane masks bit-shift
================================================================================

A thread in a warp is also called a lane, and a lane mask is a bitmask where
each bit corresponds to a thread in a warp. A bit is 1 if the thread is active,
0 if it's inactive. Bit-shift operations are typically used to create lane masks
and on AMD GPUs the ``warpSize`` can differ between different architectures,
that's why it's essential to use correct bitmask type, when porting code.

Example:

.. code-block:: cpp

// Get the thread's position in the warp
unsigned int laneId = threadIdx.x % warpSize;

// Use lane ID for bit-shift
val & ((1 << (threadIdx.x % warpSize) )-1 );

// Shift 32 bit integer with val variable
WarpReduce::sum( (val < warpSize) ? (1 << val) : 0);

Lane masks are 32-bit integer types as this is the integer precision that C
assigns to such constants by default. GCN/CDNA architectures have a warp size of
64, :code:`threadIdx.x % warpSize` and :code:`val` in the example may obtain
values greater than 31. Consequently, shifting by such values would clear the
32-bit register to which the shift operation is applied. For AMD
architectures, a straightforward fix could look as follows:

.. code-block:: cpp

// Get the thread's position in the warp
unsigned int laneId = threadIdx.x % warpSize;

// Use lane ID for bit-shift
val & ((1ull << (threadIdx.x % warpSize) )-1 );

// Shift 64 bit integer with val variable
WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0);

For portability reasons, it is better to introduce appropriately
typed placeholders as shown below:

.. code-block:: cpp

#if defined(__GFX8__) || defined(__GFX9__)
typedef uint64_t lane_mask_t;
#else
typedef uint32_t lane_mask_t;
#endif

The use of :code:`lane_mask_t` with the previous example:

.. code-block:: cpp

// Get the thread's position in the warp
unsigned int laneId = threadIdx.x % warpSize;

// Use lane ID for bit-shift
val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 );

// Shift 32 or 64 bit integer with val variable
WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0);

Porting from CUDA __launch_bounds__
================================================================================

Expand Down
54 changes: 12 additions & 42 deletions docs/how-to/hip_rtc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,14 @@ alongside options to guide the compilation.
* This library can be used for compilation on systems without AMD GPU drivers
installed (offline compilation). However, running the compiled code still
requires both the HIP runtime library and GPU drivers on the target system.
* This library depends on Code Object Manager (comgr). You can try to
statically link comgr into HIPRTC to avoid ambiguity.
* Developers can bundle this library with their application.
* HIPRTC leverages AMD's Code Object Manager API (``Comgr``) internally, which
is designed to simplify linking, compiling, and inspecting code objects. For
more information, see the `llvm-project/amd/comgr/README <https://github.com/ROCm/llvm-project/blob/amd-staging/amd/comgr/README.md>`_.
* Comgr may cache HIPRTC compilations. To force full recompilation for each HIPRTC API invocation, set AMD_COMGR_CACHE=0.

- When viewing the *README* in the Comgr GitHub repository you should look at a
specific branch of interest, such as ``docs/6.3.0`` or ``docs/6.4.1``, rather than the default branch.

Compilation APIs
===============================================================================
Expand Down Expand Up @@ -250,45 +255,6 @@ The full example is below:
HIP_CHECK(hipFree(doutput));
}


Kernel Compilation Cache
===============================================================================

HIPRTC incorporates a cache to avoid recompiling kernels between program
executions. The contents of the cache include the kernel source code (including
the contents of any ``#include`` headers), the compilation flags, and the
compiler version. After a ROCm version update, the kernels are progressively
recompiled, and the new results are cached. When the cache is disabled, each
kernel is recompiled every time it is requested.

Use the following environment variables to manage the cache status as enabled or
disabled, the location for storing the cache contents, and the cache eviction
policy:

* ``AMD_COMGR_CACHE`` By default this variable is unset and the
compilation cache feature is enabled. To disable the feature set the
environment variable to a value of ``0``.

* ``AMD_COMGR_CACHE_DIR``: By default the value of this environment variable is
defined as ``$XDG_CACHE_HOME/comgr``, which defaults to
``$USER/.cache/comgr`` on Linux, and ``%LOCALAPPDATA%\cache\comgr``
on Windows. You can specify a different directory for the environment variable
to change the path for cache storage. If the runtime fails to access the
specified cache directory the cache is disabled. If the environment variable
is set to an empty string (``""``), the default directory is used.

* ``AMD_COMGR_CACHE_POLICY``: If assigned a value, the string is interpreted and
applied to the cache pruning policy. The string format is consistent with
`Clang's ThinLTO cache pruning policy <https://rocm.docs.amd.com/projects/llvm-project/en/latest/LLVM/clang/html/ThinLTO.html#cache-pruning>`_.
The default policy is defined as:
``prune_interval=1h:prune_expiration=0h:cache_size=75%:cache_size_bytes=30g:cache_size_files=0``.
If the runtime fails to parse the defined string, or the environment variable
is set to an empty string (""), the cache is disabled.

.. note::

This cache is also shared with the OpenCL runtime shipped with ROCm.

HIPRTC specific options
===============================================================================

Expand Down Expand Up @@ -484,14 +450,18 @@ application requires the ingestion of bitcode/IR not derived from the currently
installed AMD compiler, it must run with HIPRTC and comgr dynamic libraries that
are compatible with the version of the bitcode/IR.

`Comgr <https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr>`_ is a
`Comgr <https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr/README.md>`_ is a
shared library that incorporates the LLVM/Clang compiler that HIPRTC relies on.
To identify the bitcode/IR version that comgr is compatible with, one can
execute "clang -v" using the clang binary from the same ROCm or HIP SDK package.
For instance, if compiling bitcode/IR version 14, the HIPRTC and comgr libraries
released by AMD around mid 2022 would be the best choice, assuming the
LLVM/Clang version included in the package is also version 14.

.. note::
When viewing the *README* in the Comgr GitHub repository you should look at a
specific branch of interest, such as ``docs/6.3.0`` or ``docs/6.4.1``, rather than the default branch.

To ensure smooth operation and compatibility, an application may choose to ship
the specific versions of HIPRTC and comgr dynamic libraries, or it may opt to
clearly specify the version requirements and dependencies. This approach
Expand Down
2 changes: 1 addition & 1 deletion docs/how-to/hip_runtime_api/call_stack.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ and AMD GPUs use different approaches. NVIDIA GPUs have the independent thread
scheduling feature where each thread has its own call stack and effective
program counter. On AMD GPUs threads are grouped; each warp has its own call
stack and program counter. Warps are described and explained in the
:ref:`inherent_thread_hierarchy`
:ref:`inherent_thread_model`

If a thread or warp exceeds its stack size, a stack overflow occurs, causing
kernel failure. This can be detected using debuggers.
Expand Down
Loading
Loading