ROCm · briankoco · Jul 8, 2025 · Jul 11, 2025 · Jul 11, 2025 · Jul 11, 2025
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -15,6 +15,7 @@ bfloat
 Bitcode
 bitcode
 bitcodes
+bitmask
 blockDim
 blockIdx
 builtins

diff --git a/docs/doxygen/Doxyfile b/docs/doxygen/Doxyfile
@@ -834,7 +834,8 @@ INPUT                  = ../../include/hip \
                          ../../../clr/hipamd/include/hip/amd_detail/amd_hip_gl_interop.h \
                          ../../../clr/hipamd/include/hip/amd_detail/amd_surface_functions.h \
                          ../../../clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups.h \
-                         ../../../ROCR-Runtime/src/inc/hsa_ext_amd.h
+                         ../../../ROCR-Runtime/runtime/hsa-runtime/inc/hsa_ext_amd.h
+                         # ../../../ROCR-Runtime/src/inc/hsa_ext_amd.h
 
 # This tag can be used to specify the character encoding of the source files
 # that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses

diff --git a/docs/faq.rst b/docs/faq.rst
@@ -43,7 +43,7 @@ What NVIDIA CUDA features does HIP support?
 The :doc:`NVIDIA CUDA runtime API supported by HIP<hipify:tables/CUDA_Runtime_API_functions_supported_by_HIP>`
 and :doc:`NVIDIA CUDA driver API supported by HIP<hipify:tables/CUDA_Driver_API_functions_supported_by_HIP>`
 pages describe which NVIDIA CUDA APIs are supported and what the equivalents are.
-The :doc:`HIP API documentation <doxygen/html/index>` describes each API and
+The :doc:`HIP API documentation <reference/hip_runtime_api_reference>` describes each API and
 its limitations, if any, compared with the equivalent CUDA API.
 
 The kernel language features are documented in the

diff --git a/docs/how-to/hip_cpp_language_extensions.rst b/docs/how-to/hip_cpp_language_extensions.rst
@@ -411,20 +411,18 @@ warpSize
 ================================================================================
 
 The ``warpSize`` constant contains the number of threads per warp for the given
-target device. It can differ between different architectures, and on RDNA
-architectures it can even differ between kernel launches, depending on whether
-they run in CU or WGP mode. See the
-:doc:`hardware features <../reference/hardware_features>` for more
-information.
+target device. On AMD hardware, this is referred to as ``wavefront size``, which 
+may vary depending on the architecture. For more details, see the
+:doc:`hardware features <../reference/hardware_features>`.
 
 Since ``warpSize`` can differ between devices, it can not be assumed to be a
 compile-time constant on the host. It has to be queried using
 :cpp:func:`hipDeviceGetAttribute` or :cpp:func:`hipDeviceGetProperties`, e.g.:
 
 .. code-block:: cpp
 
-    int val;
-    hipDeviceGetAttribute(&val, hipDeviceAttributeWarpSize, deviceId);
+    int warpSizeHost;
+    hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId);
 
 .. note::
 
@@ -435,6 +433,130 @@ compile-time constant on the host. It has to be queried using
   of 32 can run on devices with a ``warpSize`` of 64, it only utilizes half of
   the compute resources.
 
+The ``warpSize`` parameter will no longer be a compile-time constant in a future
+release of ROCm, however it will be still early folded by the compiler, which
+means it can be used for loop bounds and supports loop unrolling similarly to
+compile-time warp size.
+
+If the compile time warp size is still required, for example to select the correct
+mask type or code path at compile time, the recommended approach is to determine
+the warp size of the GPU on host side and setup the kernel accordingly, as shown
+in the following block reduce example.
+
+The ``block_reduce`` kernel has a template parameter for warp size and performs
+a reduction operation in two main phases:
+
+- Shared memory reduction: Reduction is performed iteratively, halving the
+  number of active threads each step until only a warp remains
+  (32 or 64 threads, depending on the device).
+
+- Warp-level reduction: Once the shared memory reduction completes, the
+  remaining threads use warp-level shuffling to sum the remaining values. This
+  is done efficiently with the ``__shfl_down`` intrinsic, which allows threads within
+  the warp to exchange values without explicit synchronization.
+
+.. tab-set::
+
+    .. tab-item:: WarpSize Template Parameter
+       :sync: template-warpsize
+
+       .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
+          :start-after: // [Sphinx template warp size block reduction kernel start]
+          :end-before: // [Sphinx template warp size block reduction kernel end]
+          :language: cpp
+
+
+    .. tab-item:: HIP warpSize
+       :sync: hip-warpsize
+
+       .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
+          :start-after: // [Sphinx HIP warp size block reduction kernel start]
+          :end-before: // [Sphinx HIP warp size block reduction kernel end]
+          :language: cpp
+
+The host code with the main function:
+
+- Retrieves the warp size of the GPU (``warpSizeHost``) to determine the optimal
+  kernel configuration.
+
+- Allocates device memory (``d_data`` for input, ``d_results`` for block-wise
+  output) and initializes the input vector to 1.
+
+- Generates the mask variables for every warp and copies them to the device.
+
+  .. tab-set::
+
+      .. tab-item:: Compile-time WarpSize
+         :sync: template-warpsize
+
+         .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
+            :start-after: // [Sphinx template warp size mask generation start]
+            :end-before: // [Sphinx template warp size mask generation end]
+            :language: cpp
+
+
+      .. tab-item:: HIP warpSize
+         :sync: hip-warpsize
+
+         .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
+            :start-after:  // [Sphinx HIP warp size mask generation start]
+            :end-before:  // [Sphinx HIP warp size mask generation end]
+            :language: cpp
+
+- Selects the appropriate kernel specialization based on the warp
+  size (either 32 or 64) and launches the kernel.
+
+  .. tab-set::
+
+      .. tab-item:: Compile-time WarpSize
+         :sync: template-warpsize
+
+         .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
+            :start-after: // [Sphinx template warp size select kernel start]
+            :end-before: // [Sphinx template warp size select kernel end]
+            :language: cpp
+
+
+      .. tab-item:: HIP warpSize
+         :sync: hip-warpsize
+
+         .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
+            :start-after: // [Sphinx HIP warp size select kernel start]
+            :end-before: // [Sphinx HIP warp size select kernel end]
+            :language: cpp
+
+- Synchronizes the device and copies the results back to the host.
+
+- Checks that each block's sum is equal with the expected mask bit count, 
+  verifying the reduction's correctness.
+
+- Frees the device memory to prevent memory leaks.
+
+.. note::
+
+  The ``warpSize`` runtime example code is also provided for comparison purposes
+  and the full example codes are located in the `tools folder <https://github.com/ROCm/hip/tree/docs/develop/docs/tools/example_codes>`_.
+
+  The variable ``warpSize`` can be used for loop bounds and supports 
+  loop unrolling similarly to the template parameter ``WarpSize``.
+
+For users who still require a compile-time constant warp size as a macro on the
+device side, it can be defined manually based on the target device architecture,
+as shown in the following example.
+
+.. code-block:: cpp
+
+  #if defined(__GFX8__) || defined(__GFX9__)
+    #define WarpSize 64
+  #else
+    #define WarpSize 32
+  #endif
+
+.. note:: 
+
+  ``mwavefrontsize64`` compiler option is not supported by HIP runtime, that's
+  why the architecture based compile time selector is an acceptable approach.
+
 ********************************************************************************
 Vector types
 ********************************************************************************
@@ -857,7 +979,7 @@ The different shuffle functions behave as following:
   of range, the thread returns its own ``var``.
 
 ``__shfl_down``
-  The thread reads ``var`` from lane ``laneIdx - delta``, thereby "shuffling"
+  The thread reads ``var`` from lane ``laneIdx + delta``, thereby "shuffling"
   the values of the lanes of the warp "down". If the resulting source lane is
   out of range, the thread returns its own ``var``.
 

diff --git a/docs/how-to/hip_porting_guide.rst b/docs/how-to/hip_porting_guide.rst
@@ -611,6 +611,70 @@ code, while the host can query it during runtime via the device properties. See
 the :ref:`HIP language extension for warpSize <warp_size>` for information on
 how to write portable wave-aware code.
 
+Lane masks bit-shift
+================================================================================
+
+A thread in a warp is also called a lane, and a lane mask is a bitmask where
+each bit corresponds to a thread in a warp. A bit is 1 if the thread is active,
+0 if it's inactive. Bit-shift operations are typically used to create lane masks
+and on AMD GPUs the ``warpSize`` can differ between different architectures,
+that's why it's essential to use correct bitmask type, when porting code.
+
+Example:
+
+.. code-block:: cpp
+
+  // Get the thread's position in the warp
+  unsigned int laneId = threadIdx.x % warpSize;
+
+  // Use lane ID for bit-shift
+  val & ((1 << (threadIdx.x % warpSize) )-1 );
+
+  // Shift 32 bit integer with val variable
+  WarpReduce::sum( (val < warpSize) ? (1 << val) : 0);
+
+Lane masks are 32-bit integer types as this is the integer precision that C 
+assigns to such constants by default. GCN/CDNA architectures have a warp size of
+64, :code:`threadIdx.x % warpSize` and :code:`val` in the example may obtain 
+values greater than 31. Consequently, shifting by such values would clear the 
+32-bit register to which the shift operation is applied. For AMD
+architectures, a straightforward fix could look as follows:
+
+.. code-block:: cpp
+
+  // Get the thread's position in the warp
+  unsigned int laneId = threadIdx.x % warpSize;
+
+  // Use lane ID for bit-shift
+  val & ((1ull << (threadIdx.x % warpSize) )-1 );
+
+  // Shift 64 bit integer with val variable
+  WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0);
+
+For portability reasons, it is better to introduce appropriately
+typed placeholders as shown below:
+
+.. code-block:: cpp
+
+  #if defined(__GFX8__) || defined(__GFX9__)
+  typedef uint64_t lane_mask_t;
+  #else
+  typedef uint32_t lane_mask_t;
+  #endif
+
+The use of :code:`lane_mask_t` with the previous example:
+
+.. code-block:: cpp
+
+  // Get the thread's position in the warp
+  unsigned int laneId = threadIdx.x % warpSize;
+
+  // Use lane ID for bit-shift
+  val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 );
+
+  // Shift 32 or 64 bit integer with val variable
+  WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0);
+
 Porting from CUDA __launch_bounds__
 ================================================================================
 

diff --git a/docs/how-to/hip_rtc.rst b/docs/how-to/hip_rtc.rst
@@ -17,9 +17,14 @@ alongside options to guide the compilation.
   * This library can be used for compilation on systems without AMD GPU drivers
     installed (offline compilation). However, running the compiled code still
     requires both the HIP runtime library and GPU drivers on the target system.
-  * This library depends on Code Object Manager (comgr). You can try to
-    statically link comgr into HIPRTC to avoid ambiguity.
   * Developers can bundle this library with their application.
+  * HIPRTC leverages AMD's Code Object Manager API (``Comgr``) internally, which
+    is designed to simplify linking, compiling, and inspecting code objects. For
+    more information, see the `llvm-project/amd/comgr/README <https://github.com/ROCm/llvm-project/blob/amd-staging/amd/comgr/README.md>`_.
+  * Comgr may cache HIPRTC compilations. To force full recompilation for each HIPRTC API invocation, set AMD_COMGR_CACHE=0.
+
+    - When viewing the *README* in the Comgr GitHub repository you should look at a
+      specific branch of interest, such as ``docs/6.3.0`` or ``docs/6.4.1``, rather than the default branch.
 
 Compilation APIs
 ===============================================================================
@@ -250,45 +255,6 @@ The full example is below:
     HIP_CHECK(hipFree(doutput));
   }
 
-
-Kernel Compilation Cache
-===============================================================================
-
-HIPRTC incorporates a cache to avoid recompiling kernels between program
-executions. The contents of the cache include the kernel source code (including
-the contents of any ``#include`` headers), the compilation flags, and the
-compiler version. After a ROCm version update, the kernels are progressively
-recompiled, and the new results are cached. When the cache is disabled, each
-kernel is recompiled every time it is requested.
-
-Use the following environment variables to manage the cache status as enabled or
-disabled, the location for storing the cache contents, and the cache eviction
-policy:
-
-* ``AMD_COMGR_CACHE`` By default this variable is unset and the
-  compilation cache feature is enabled. To disable the feature set the
-  environment variable to a value of ``0``.
-
-* ``AMD_COMGR_CACHE_DIR``: By default the value of this environment variable is
-  defined as ``$XDG_CACHE_HOME/comgr``, which defaults to
-  ``$USER/.cache/comgr`` on Linux, and ``%LOCALAPPDATA%\cache\comgr``
-  on Windows. You can specify a different directory for the environment variable
-  to change the path for cache storage. If the runtime fails to access the
-  specified cache directory the cache is disabled. If the environment variable
-  is set to an empty string (``""``), the default directory is used.
-
-* ``AMD_COMGR_CACHE_POLICY``: If assigned a value, the string is interpreted and
-  applied to the cache pruning policy. The string format is consistent with
-  `Clang's ThinLTO cache pruning policy <https://rocm.docs.amd.com/projects/llvm-project/en/latest/LLVM/clang/html/ThinLTO.html#cache-pruning>`_.
-  The default policy is defined as:
-  ``prune_interval=1h:prune_expiration=0h:cache_size=75%:cache_size_bytes=30g:cache_size_files=0``.
-  If the runtime fails to parse the defined string, or the environment variable
-  is set to an empty string (""), the cache is disabled.
-
-.. note::
-
-  This cache is also shared with the OpenCL runtime shipped with ROCm.
-
 HIPRTC specific options
 ===============================================================================
 
@@ -484,14 +450,18 @@ application requires the ingestion of bitcode/IR not derived from the currently
 installed AMD compiler, it must run with HIPRTC and comgr dynamic libraries that
 are compatible with the version of the bitcode/IR.
 
-`Comgr <https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr>`_ is a
+`Comgr <https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr/README.md>`_ is a
 shared library that incorporates the LLVM/Clang compiler that HIPRTC relies on.
 To identify the bitcode/IR version that comgr is compatible with, one can
 execute "clang -v" using the clang binary from the same ROCm or HIP SDK package.
 For instance, if compiling bitcode/IR version 14, the HIPRTC and comgr libraries
 released by AMD around mid 2022 would be the best choice, assuming the
 LLVM/Clang version included in the package is also version 14.
 
+.. note:: 
+  When viewing the *README* in the Comgr GitHub repository you should look at a
+  specific branch of interest, such as ``docs/6.3.0`` or ``docs/6.4.1``, rather than the default branch.
+
 To ensure smooth operation and compatibility, an application may choose to ship
 the specific versions of HIPRTC and comgr dynamic libraries, or it may opt to
 clearly specify the version requirements and dependencies. This approach

diff --git a/docs/how-to/hip_runtime_api/call_stack.rst b/docs/how-to/hip_runtime_api/call_stack.rst
@@ -21,7 +21,7 @@ and AMD GPUs use different approaches. NVIDIA GPUs have the independent thread
 scheduling feature where each thread has its own call stack and effective
 program counter. On AMD GPUs threads are grouped; each warp has its own call
 stack and program counter. Warps are described and explained in the
-:ref:`inherent_thread_hierarchy`
+:ref:`inherent_thread_model`
 
 If a thread or warp exceeds its stack size, a stack overflow occurs, causing
 kernel failure. This can be detected using debuggers.