diff --git a/.wordlist.txt b/.wordlist.txt
index 1bca54a941..6cbf374ae1 100644
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -15,6 +15,7 @@ bfloat
 Bitcode
 bitcode
 bitcodes
+bitmask
 blockDim
 blockIdx
 builtins
diff --git a/docs/doxygen/Doxyfile b/docs/doxygen/Doxyfile
index fb4eaae2de..0d6c94b4e4 100644
--- a/docs/doxygen/Doxyfile
+++ b/docs/doxygen/Doxyfile
@@ -834,7 +834,8 @@ INPUT                  = ../../include/hip \
                          ../../../clr/hipamd/include/hip/amd_detail/amd_hip_gl_interop.h \
                          ../../../clr/hipamd/include/hip/amd_detail/amd_surface_functions.h \
                          ../../../clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups.h \
-                         ../../../ROCR-Runtime/src/inc/hsa_ext_amd.h
+                         ../../../ROCR-Runtime/runtime/hsa-runtime/inc/hsa_ext_amd.h
+                         # ../../../ROCR-Runtime/src/inc/hsa_ext_amd.h
 
 # This tag can be used to specify the character encoding of the source files
 # that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses
diff --git a/docs/faq.rst b/docs/faq.rst
index 15308e437c..f0d836761f 100644
--- a/docs/faq.rst
+++ b/docs/faq.rst
@@ -43,7 +43,7 @@ What NVIDIA CUDA features does HIP support?
 The :doc:`NVIDIA CUDA runtime API supported by HIP<hipify:tables/CUDA_Runtime_API_functions_supported_by_HIP>`
 and :doc:`NVIDIA CUDA driver API supported by HIP<hipify:tables/CUDA_Driver_API_functions_supported_by_HIP>`
 pages describe which NVIDIA CUDA APIs are supported and what the equivalents are.
-The :doc:`HIP API documentation <doxygen/html/index>` describes each API and
+The :doc:`HIP API documentation <reference/hip_runtime_api_reference>` describes each API and
 its limitations, if any, compared with the equivalent CUDA API.
 
 The kernel language features are documented in the
diff --git a/docs/how-to/hip_cpp_language_extensions.rst b/docs/how-to/hip_cpp_language_extensions.rst
index afedf9cc06..fba47b2a6a 100644
--- a/docs/how-to/hip_cpp_language_extensions.rst
+++ b/docs/how-to/hip_cpp_language_extensions.rst
@@ -411,11 +411,9 @@ warpSize
 ================================================================================
 
 The ``warpSize`` constant contains the number of threads per warp for the given
-target device. It can differ between different architectures, and on RDNA
-architectures it can even differ between kernel launches, depending on whether
-they run in CU or WGP mode. See the
-:doc:`hardware features <../reference/hardware_features>` for more
-information.
+target device. On AMD hardware, this is referred to as ``wavefront size``, which 
+may vary depending on the architecture. For more details, see the
+:doc:`hardware features <../reference/hardware_features>`.
 
 Since ``warpSize`` can differ between devices, it can not be assumed to be a
 compile-time constant on the host. It has to be queried using
@@ -423,8 +421,8 @@ compile-time constant on the host. It has to be queried using
 
 .. code-block:: cpp
 
-    int val;
-    hipDeviceGetAttribute(&val, hipDeviceAttributeWarpSize, deviceId);
+    int warpSizeHost;
+    hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId);
 
 .. note::
 
@@ -435,6 +433,130 @@ compile-time constant on the host. It has to be queried using
   of 32 can run on devices with a ``warpSize`` of 64, it only utilizes half of
   the compute resources.
 
+The ``warpSize`` parameter will no longer be a compile-time constant in a future
+release of ROCm, however it will be still early folded by the compiler, which
+means it can be used for loop bounds and supports loop unrolling similarly to
+compile-time warp size.
+
+If the compile time warp size is still required, for example to select the correct
+mask type or code path at compile time, the recommended approach is to determine
+the warp size of the GPU on host side and setup the kernel accordingly, as shown
+in the following block reduce example.
+
+The ``block_reduce`` kernel has a template parameter for warp size and performs
+a reduction operation in two main phases:
+
+- Shared memory reduction: Reduction is performed iteratively, halving the
+  number of active threads each step until only a warp remains
+  (32 or 64 threads, depending on the device).
+
+- Warp-level reduction: Once the shared memory reduction completes, the
+  remaining threads use warp-level shuffling to sum the remaining values. This
+  is done efficiently with the ``__shfl_down`` intrinsic, which allows threads within
+  the warp to exchange values without explicit synchronization.
+
+.. tab-set::
+
+    .. tab-item:: WarpSize Template Parameter
+       :sync: template-warpsize
+
+       .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
+          :start-after: // [Sphinx template warp size block reduction kernel start]
+          :end-before: // [Sphinx template warp size block reduction kernel end]
+          :language: cpp
+
+
+    .. tab-item:: HIP warpSize
+       :sync: hip-warpsize
+
+       .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
+          :start-after: // [Sphinx HIP warp size block reduction kernel start]
+          :end-before: // [Sphinx HIP warp size block reduction kernel end]
+          :language: cpp
+
+The host code with the main function:
+
+- Retrieves the warp size of the GPU (``warpSizeHost``) to determine the optimal
+  kernel configuration.
+
+- Allocates device memory (``d_data`` for input, ``d_results`` for block-wise
+  output) and initializes the input vector to 1.
+
+- Generates the mask variables for every warp and copies them to the device.
+
+  .. tab-set::
+
+      .. tab-item:: Compile-time WarpSize
+         :sync: template-warpsize
+
+         .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
+            :start-after: // [Sphinx template warp size mask generation start]
+            :end-before: // [Sphinx template warp size mask generation end]
+            :language: cpp
+
+
+      .. tab-item:: HIP warpSize
+         :sync: hip-warpsize
+
+         .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
+            :start-after:  // [Sphinx HIP warp size mask generation start]
+            :end-before:  // [Sphinx HIP warp size mask generation end]
+            :language: cpp
+
+- Selects the appropriate kernel specialization based on the warp
+  size (either 32 or 64) and launches the kernel.
+
+  .. tab-set::
+
+      .. tab-item:: Compile-time WarpSize
+         :sync: template-warpsize
+
+         .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
+            :start-after: // [Sphinx template warp size select kernel start]
+            :end-before: // [Sphinx template warp size select kernel end]
+            :language: cpp
+
+
+      .. tab-item:: HIP warpSize
+         :sync: hip-warpsize
+
+         .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
+            :start-after: // [Sphinx HIP warp size select kernel start]
+            :end-before: // [Sphinx HIP warp size select kernel end]
+            :language: cpp
+
+- Synchronizes the device and copies the results back to the host.
+
+- Checks that each block's sum is equal with the expected mask bit count, 
+  verifying the reduction's correctness.
+
+- Frees the device memory to prevent memory leaks.
+
+.. note::
+
+  The ``warpSize`` runtime example code is also provided for comparison purposes
+  and the full example codes are located in the `tools folder <https://github.com/ROCm/hip/tree/docs/develop/docs/tools/example_codes>`_.
+
+  The variable ``warpSize`` can be used for loop bounds and supports 
+  loop unrolling similarly to the template parameter ``WarpSize``.
+
+For users who still require a compile-time constant warp size as a macro on the
+device side, it can be defined manually based on the target device architecture,
+as shown in the following example.
+
+.. code-block:: cpp
+
+  #if defined(__GFX8__) || defined(__GFX9__)
+    #define WarpSize 64
+  #else
+    #define WarpSize 32
+  #endif
+
+.. note:: 
+
+  ``mwavefrontsize64`` compiler option is not supported by HIP runtime, that's
+  why the architecture based compile time selector is an acceptable approach.
+
 ********************************************************************************
 Vector types
 ********************************************************************************
@@ -857,7 +979,7 @@ The different shuffle functions behave as following:
   of range, the thread returns its own ``var``.
 
 ``__shfl_down``
-  The thread reads ``var`` from lane ``laneIdx - delta``, thereby "shuffling"
+  The thread reads ``var`` from lane ``laneIdx + delta``, thereby "shuffling"
   the values of the lanes of the warp "down". If the resulting source lane is
   out of range, the thread returns its own ``var``.
 
diff --git a/docs/how-to/hip_porting_guide.rst b/docs/how-to/hip_porting_guide.rst
index 136084f66b..8e2b0f2c5e 100644
--- a/docs/how-to/hip_porting_guide.rst
+++ b/docs/how-to/hip_porting_guide.rst
@@ -611,6 +611,70 @@ code, while the host can query it during runtime via the device properties. See
 the :ref:`HIP language extension for warpSize <warp_size>` for information on
 how to write portable wave-aware code.
 
+Lane masks bit-shift
+================================================================================
+
+A thread in a warp is also called a lane, and a lane mask is a bitmask where
+each bit corresponds to a thread in a warp. A bit is 1 if the thread is active,
+0 if it's inactive. Bit-shift operations are typically used to create lane masks
+and on AMD GPUs the ``warpSize`` can differ between different architectures,
+that's why it's essential to use correct bitmask type, when porting code.
+
+Example:
+
+.. code-block:: cpp
+
+  // Get the thread's position in the warp
+  unsigned int laneId = threadIdx.x % warpSize;
+
+  // Use lane ID for bit-shift
+  val & ((1 << (threadIdx.x % warpSize) )-1 );
+
+  // Shift 32 bit integer with val variable
+  WarpReduce::sum( (val < warpSize) ? (1 << val) : 0);
+
+Lane masks are 32-bit integer types as this is the integer precision that C 
+assigns to such constants by default. GCN/CDNA architectures have a warp size of
+64, :code:`threadIdx.x % warpSize` and :code:`val` in the example may obtain 
+values greater than 31. Consequently, shifting by such values would clear the 
+32-bit register to which the shift operation is applied. For AMD
+architectures, a straightforward fix could look as follows:
+
+.. code-block:: cpp
+  
+  // Get the thread's position in the warp
+  unsigned int laneId = threadIdx.x % warpSize;
+
+  // Use lane ID for bit-shift
+  val & ((1ull << (threadIdx.x % warpSize) )-1 );
+
+  // Shift 64 bit integer with val variable
+  WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0);
+
+For portability reasons, it is better to introduce appropriately
+typed placeholders as shown below:
+
+.. code-block:: cpp
+
+  #if defined(__GFX8__) || defined(__GFX9__)
+  typedef uint64_t lane_mask_t;
+  #else
+  typedef uint32_t lane_mask_t;
+  #endif
+
+The use of :code:`lane_mask_t` with the previous example:
+
+.. code-block:: cpp
+
+  // Get the thread's position in the warp
+  unsigned int laneId = threadIdx.x % warpSize;
+
+  // Use lane ID for bit-shift
+  val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 );
+
+  // Shift 32 or 64 bit integer with val variable
+  WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0);
+
 Porting from CUDA __launch_bounds__
 ================================================================================
 
diff --git a/docs/how-to/hip_rtc.rst b/docs/how-to/hip_rtc.rst
index 223e11081c..861e07fec8 100644
--- a/docs/how-to/hip_rtc.rst
+++ b/docs/how-to/hip_rtc.rst
@@ -17,9 +17,14 @@ alongside options to guide the compilation.
   * This library can be used for compilation on systems without AMD GPU drivers
     installed (offline compilation). However, running the compiled code still
     requires both the HIP runtime library and GPU drivers on the target system.
-  * This library depends on Code Object Manager (comgr). You can try to
-    statically link comgr into HIPRTC to avoid ambiguity.
   * Developers can bundle this library with their application.
+  * HIPRTC leverages AMD's Code Object Manager API (``Comgr``) internally, which
+    is designed to simplify linking, compiling, and inspecting code objects. For
+    more information, see the `llvm-project/amd/comgr/README <https://github.com/ROCm/llvm-project/blob/amd-staging/amd/comgr/README.md>`_.
+  * Comgr may cache HIPRTC compilations. To force full recompilation for each HIPRTC API invocation, set AMD_COMGR_CACHE=0.
+
+    - When viewing the *README* in the Comgr GitHub repository you should look at a
+      specific branch of interest, such as ``docs/6.3.0`` or ``docs/6.4.1``, rather than the default branch.
 
 Compilation APIs
 ===============================================================================
@@ -250,45 +255,6 @@ The full example is below:
     HIP_CHECK(hipFree(doutput));
   }
 
-
-Kernel Compilation Cache
-===============================================================================
-
-HIPRTC incorporates a cache to avoid recompiling kernels between program
-executions. The contents of the cache include the kernel source code (including
-the contents of any ``#include`` headers), the compilation flags, and the
-compiler version. After a ROCm version update, the kernels are progressively
-recompiled, and the new results are cached. When the cache is disabled, each
-kernel is recompiled every time it is requested.
-
-Use the following environment variables to manage the cache status as enabled or
-disabled, the location for storing the cache contents, and the cache eviction
-policy:
-
-* ``AMD_COMGR_CACHE`` By default this variable is unset and the
-  compilation cache feature is enabled. To disable the feature set the
-  environment variable to a value of ``0``.
-
-* ``AMD_COMGR_CACHE_DIR``: By default the value of this environment variable is
-  defined as ``$XDG_CACHE_HOME/comgr``, which defaults to
-  ``$USER/.cache/comgr`` on Linux, and ``%LOCALAPPDATA%\cache\comgr``
-  on Windows. You can specify a different directory for the environment variable
-  to change the path for cache storage. If the runtime fails to access the
-  specified cache directory the cache is disabled. If the environment variable
-  is set to an empty string (``""``), the default directory is used.
-
-* ``AMD_COMGR_CACHE_POLICY``: If assigned a value, the string is interpreted and
-  applied to the cache pruning policy. The string format is consistent with
-  `Clang's ThinLTO cache pruning policy <https://rocm.docs.amd.com/projects/llvm-project/en/latest/LLVM/clang/html/ThinLTO.html#cache-pruning>`_.
-  The default policy is defined as:
-  ``prune_interval=1h:prune_expiration=0h:cache_size=75%:cache_size_bytes=30g:cache_size_files=0``.
-  If the runtime fails to parse the defined string, or the environment variable
-  is set to an empty string (""), the cache is disabled.
-
-.. note::
-
-  This cache is also shared with the OpenCL runtime shipped with ROCm.
-
 HIPRTC specific options
 ===============================================================================
 
@@ -484,7 +450,7 @@ application requires the ingestion of bitcode/IR not derived from the currently
 installed AMD compiler, it must run with HIPRTC and comgr dynamic libraries that
 are compatible with the version of the bitcode/IR.
 
-`Comgr <https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr>`_ is a
+`Comgr <https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr/README.md>`_ is a
 shared library that incorporates the LLVM/Clang compiler that HIPRTC relies on.
 To identify the bitcode/IR version that comgr is compatible with, one can
 execute "clang -v" using the clang binary from the same ROCm or HIP SDK package.
@@ -492,6 +458,10 @@ For instance, if compiling bitcode/IR version 14, the HIPRTC and comgr libraries
 released by AMD around mid 2022 would be the best choice, assuming the
 LLVM/Clang version included in the package is also version 14.
 
+.. note:: 
+  When viewing the *README* in the Comgr GitHub repository you should look at a
+  specific branch of interest, such as ``docs/6.3.0`` or ``docs/6.4.1``, rather than the default branch.
+
 To ensure smooth operation and compatibility, an application may choose to ship
 the specific versions of HIPRTC and comgr dynamic libraries, or it may opt to
 clearly specify the version requirements and dependencies. This approach
diff --git a/docs/how-to/hip_runtime_api/call_stack.rst b/docs/how-to/hip_runtime_api/call_stack.rst
index 43354cd0cf..a9d03bb493 100644
--- a/docs/how-to/hip_runtime_api/call_stack.rst
+++ b/docs/how-to/hip_runtime_api/call_stack.rst
@@ -21,7 +21,7 @@ and AMD GPUs use different approaches. NVIDIA GPUs have the independent thread
 scheduling feature where each thread has its own call stack and effective
 program counter. On AMD GPUs threads are grouped; each warp has its own call
 stack and program counter. Warps are described and explained in the
-:ref:`inherent_thread_hierarchy`
+:ref:`inherent_thread_model`
 
 If a thread or warp exceeds its stack size, a stack overflow occurs, causing
 kernel failure. This can be detected using debuggers.
diff --git a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst
index 55f3ceb35b..838433c313 100644
--- a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst
+++ b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst
@@ -41,48 +41,64 @@ Unified memory enables the access to memory located on other devices via
 several methods, depending on whether hardware support is available or has to be
 managed by the driver.
 
-Hardware supported on-demand page migration
---------------------------------------------------------------------------------
-
-When a kernel on the device tries to access a memory address that is not in its
-memory, a page-fault is triggered. The GPU then in turn requests the page from
-the host or an other device, on which the memory is located. The page is then
-unmapped from the source, sent to the device and mapped to the device's memory.
-The requested memory is then available to the processes running on the device.
-
-In case the device's memory is at capacity, a page is unmapped from the device's
-memory first and sent and mapped to host memory. This enables more memory to be
-allocated and used for a GPU, than the GPU itself has physically available.
-
-This level of unified memory support can be very beneficial for sparse accesses
-to an array, that is not often used on the device.
-
-Driver managed page migration
---------------------------------------------------------------------------------
+Managed memory
+================================================================================
 
-If the hardware does not support on-demand page migration, then all the pages
-accessed by a kernel have to be resident on the device, so they have to be
-migrated before the kernel is running. Since the driver can not know beforehand,
-what parts of an array are going to be accessed, all pages of all accessed
-arrays have to be migrated. This can lead to significant delays on the first run
-of a kernel, on top of possibly copying more memory than is actually accessed by
-the kernel.
+Managed Memory is an extension of the unified memory architecture in which HIP
+monitors memory access and intelligently migrates data between device and
+system memories, thereby improving performance and resource efficiency.
+
+When a kernel on the device tries to access a managed memory address that is
+not in its local device memory, a page-fault is triggered.  The GPU then in
+turn requests the page from the host or other device on which the memory is
+located. The page is then unmapped from the source, sent to the device and
+mapped to the device's memory.  The requested memory is then available locally
+to the processes running on the device, which improves performance as local
+memory access outperforms remote memory access.
+
+Managed memory also expands the memory capacity available to a GPU kernel. When
+migrating memory into the device on page-fault, if the device's memory is
+already at capacity, a page is unmapped from the device's memory first and sent
+and mapped to host memory.  This enables more memory to be allocated and used
+for a GPU than the GPU itself has physically available. This level of support
+can be very beneficial, for example, for sparse accesses to an array that is
+not often used on the device.
 
 .. _unified memory system requirements:
 
-System requirements
-================================================================================
+System requirements for managed memory
+--------------------------------------------------------------------------------
 
-Unified memory is supported on Linux by all modern AMD GPUs from the Vega
-series onward, as shown in the following table. Unified memory management can
-be achieved by explicitly allocating managed memory using
-:cpp:func:`hipMallocManaged` or marking variables with the ``__managed__``
-attribute. For the latest GPUs, with a Linux kernel that supports
-`Heterogeneous Memory Management (HMM)
+Some AMD GPUs do not support page-faults, and thus do not support on-demand
+page-fault driven migration. On these architectures, if the programmer prefers
+all GPU memory accesses to be local, all pages have to migrated before the
+kernel is dispatched, as the driver cannot know beforehand which parts of a
+dataset are going to be accessed. This can lead to significant delays on the
+first run of a kernel, and, in the example of a sparsely accessed array, can
+also lead to copying more memory than is actually accessed by the kernel.
+
+Note that on systems which do not support page-faults, managed memory APIs are
+still accessible to the programmer, but managed memory operates in a degraded
+fashion due to the lack of demand-driven migration. Furthermore, on these
+systems it is still possible to use other unified memory allocators that do not
+provide managed memory features.
+
+Managed memory is supported on Linux by all modern AMD GPUs from the Vega
+series onward, as shown in the following table. Managed memory can be
+explicitly allocated using :cpp:func:`hipMallocManaged()` or marking variables
+with the ``__managed__`` attribute. For the latest GPUs, with a Linux kernel
+that supports `Heterogeneous Memory Management (HMM)
 <https://www.kernel.org/doc/html/latest/mm/hmm.html>`_, the normal system
-allocator can be used.
+allocators can be used.
 
-.. list-table:: Supported Unified Memory Allocators by GPU architecture
+Note: to ensure the proper functioning of managed memory on supported GPUs, it
+is __essential__ to set the environment variable ``HSA_XNACK=1`` and use a GPU
+kernel mode driver that supports HMM
+<https://www.kernel.org/doc/html/latest/mm/hmm.html>`_. Without this
+configuration, access-driven memory migration will be disabled, and the
+behavior will be similar to that of systems without HMM support.
+
+.. list-table:: Managed Memory Support by GPU Architecture
     :widths: 40, 25, 25
     :header-rows: 1
     :align: center
@@ -98,7 +114,7 @@ allocator can be used.
       - ✅ :sup:`1`
     * - CDNA1
       - ✅
-      - ✅ :sup:`1`
+      - ❌
     * - RDNA1
       - ✅
       - ❌
@@ -138,12 +154,11 @@ system requirements` and :ref:`checking unified memory support`.
   offers an easy transition for code written for CPUs to HIP code as the
   same system allocation API is used.
 
-To ensure the proper functioning of system allocated unified memory on supported
-GPUs, it is essential to set the environment variable ``HSA_XNACK=1`` and use
-a GPU kernel mode driver that supports HMM
-<https://www.kernel.org/doc/html/latest/mm/hmm.html>`_. Without this
-configuration, the behavior will be similar to that of systems without HMM
-support.
+- **HIP allocated non-managed memory**
+
+  :cpp:func:`hipMalloc()` and :cpp:func:`hipHostMalloc()` are dynamic memory
+  allocators available on all GPUs with unified memory support. Memory
+  allocated by these allocators is not migrated between device and host memory.
 
 The table below illustrates the expected behavior of managed and unified memory
 functions on ROCm and CUDA, both with and without HMM support.
@@ -177,10 +192,10 @@ functions on ROCm and CUDA, both with and without HMM support.
         - host
         - page-fault migration
       * - :cpp:func:`hipHostRegister()`
+        - pinned host
+        - zero copy [zc]_
         - undefined behavior
         - undefined behavior
-        - host
-        - page-fault migration
       * - :cpp:func:`hipHostMalloc()`
         - pinned host
         - zero copy [zc]_
diff --git a/docs/how-to/performance_guidelines.rst b/docs/how-to/performance_guidelines.rst
index 33dbbb4af4..efe532e6b6 100644
--- a/docs/how-to/performance_guidelines.rst
+++ b/docs/how-to/performance_guidelines.rst
@@ -262,7 +262,9 @@ For example, when the control condition depends on ``threadIdx`` or ``warpSize``
 warp doesn't diverge. The compiler might optimize loops, short ifs, or switch
 blocks using branch predication, which prevents warp divergence. With branch
 predication, instructions associated with a false predicate are scheduled but
-not executed, which avoids unnecessary operations.
+not executed, which avoids unnecessary operations. For control conditions where
+one outcome is significantly more likely than the other, use `__builtin_expect <https://clang.llvm.org/docs/LanguageExtensions.html#builtin-expect>`_
+or ``[[likely]]`` to indicate the likely condition result.
 
 Avoiding divergent warps
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/docs/install/build.rst b/docs/install/build.rst
index 64deba241b..b0a7baa43d 100644
--- a/docs/install/build.rst
+++ b/docs/install/build.rst
@@ -9,27 +9,28 @@ Build HIP from source
 Prerequisites
 =================================================
 
-HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler, or a CUDA platform with ``nvcc`` installed.
-Before building and running HIP, make sure drivers and prebuilt packages are installed properly on the platform.
+HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler,
+or a CUDA platform with ``nvcc`` installed. Before building and running HIP,
+make sure drivers and prebuilt packages are installed properly on the platform.
 
 You also need to install Python 3, which includes the ``CppHeaderParser`` package.
 Install Python 3 using the following command:
 
 .. code-block:: shell
 
-   apt-get install python3
+  apt-get install python3
 
 Check and install ``CppHeaderParser`` package using the command:
 
 .. code-block:: shell
 
-   pip3 install CppHeaderParser
+  pip3 install CppHeaderParser
 
 Install ``ROCm LLVM`` package using the command:
 
 .. code-block:: shell
 
-   apt-get install rocm-llvm-dev
+  apt-get install rocm-llvm-dev
 
 
 .. _Building the HIP runtime:
@@ -41,201 +42,209 @@ Set the repository branch using the variable: ``ROCM_BRANCH``. For example, for
 
 .. code-block:: shell
 
-   export ROCM_BRANCH=rocm-6.1.x
+  export ROCM_BRANCH=rocm-6.1.x
 
 .. tab-set::
 
-   .. tab-item:: AMD
-      :sync: amd
+  .. tab-item:: AMD
+     :sync: amd
 
-      #. Get HIP source code.
+     #. Get HIP source code.
 
-         .. note::
-            Starting in ROCM 5.6, CLR is a new repository that includes the former ROCclr, HIPAMD and
-            OpenCl repositories. OpenCL provides headers that ROCclr runtime depends on.
+        .. note::
+     
+           Starting in ROCM 5.6, CLR is a new repository that includes the former ROCclr, HIPAMD and
+           OpenCl repositories. OpenCL provides headers that ROCclr runtime depends on.
 
-         .. note::
-            Starting in ROCM 6.1, a new repository ``hipother`` is added to ROCm, which is branched out from HIP.
-            ``hipother`` provides files required to support the HIP back-end implementation on some non-AMD platforms,
-            like NVIDIA.
+        .. note::
 
-         .. code-block:: shell
+           Starting in ROCM 6.1, a new repository ``hipother`` is added to ROCm, which is branched out from HIP.
+           ``hipother`` provides files required to support the HIP back-end implementation on some non-AMD platforms,
+           like NVIDIA.
 
-            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
-            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
+        .. code-block:: shell
 
-         CLR (Compute Language Runtime) repository includes ROCclr, HIPAMD and OpenCL.
+           git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
+           git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
 
-         ROCclr (ROCm Compute Language Runtime) is a virtual device interface which
-         is defined on the AMD platform. HIP runtime uses ROCclr to interact with different backends.
+        CLR (Compute Language Runtime) repository includes ROCclr, HIPAMD and OpenCL.
 
-         HIPAMD provides implementation specifically for HIP on the AMD platform.
+        ROCclr (ROCm Compute Language Runtime) is a virtual device interface which
+        is defined on the AMD platform. HIP runtime uses ROCclr to interact with different backends.
 
-         OpenCL provides headers that ROCclr runtime currently depends on.
-         hipother provides headers and implementation specifically for non-AMD HIP platforms, like NVIDIA.
+        HIPAMD provides implementation specifically for HIP on the AMD platform.
 
-      #. Set the environment variables.
+        OpenCL provides headers that ROCclr runtime currently depends on.
+        hipother provides headers and implementation specifically for non-AMD HIP platforms, like NVIDIA.
 
-         .. code-block:: shell
+     #. Set the environment variables.
 
-            export CLR_DIR="$(readlink -f clr)"
-            export HIP_DIR="$(readlink -f hip)"
+        .. code-block:: shell
 
+          export CLR_DIR="$(readlink -f clr)"
+          export HIP_DIR="$(readlink -f hip)"
 
-      #. Build HIP.
 
-         .. code-block:: shell
+     #. Build HIP.
 
-            cd "$CLR_DIR"
-            mkdir -p build; cd build
-            cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF ..
+        .. code-block:: shell
 
-            make -j$(nproc)
-            sudo make install
+           cd "$CLR_DIR"
+           mkdir -p build; cd build
+           cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF ..
 
-         .. note::
+           make -j$(nproc)
+           sudo make install
 
-            Note, if you don't specify ``CMAKE_INSTALL_PREFIX``, the HIP runtime is installed at
-            ``<ROCM_PATH>``.
+        .. note::
 
-            By default, release version of HIP is built. If need debug version, you can put the option ``CMAKE_BUILD_TYPE=Debug`` in the command line.
+           Note, if you don't specify ``CMAKE_INSTALL_PREFIX``, the HIP runtime is installed at
+           ``<ROCM_PATH>``.
 
-         Default paths and environment variables:
+        By default, release version of HIP is built. If need debug version, you can
+        put the option ``CMAKE_BUILD_TYPE=Debug`` in the command line.
 
-            * HIP is installed into ``<ROCM_PATH>``. This can be overridden by setting the ``INSTALL_PREFIX`` as the command option.
-               environment variable.
-            * HSA is in ``<ROCM_PATH>``. This can be overridden by setting the ``HSA_PATH``
-               environment variable.
-            * Clang is in ``<ROCM_PATH>/llvm/bin``. This can be overridden by setting the
-               ``HIP_CLANG_PATH`` environment variable.
-            * The device library is in ``<ROCM_PATH>/lib``. This can be overridden by setting the
-               ``DEVICE_LIB_PATH`` environment variable.
-            * Optionally, you can add ``<ROCM_PATH>/bin`` to your ``PATH``, which can make it easier to
-               use the tools.
-            * Optionally, you can set ``HIPCC_VERBOSE=7`` to output the command line for compilation.
+        Default paths and environment variables:
 
-         After you run the ``make install`` command, HIP is installed to ``<ROCM_PATH>`` by default, or ``$PWD/install/hip`` while ``INSTALL_PREFIX`` is defined.
+        * HIP is installed into ``<ROCM_PATH>``. This can be overridden by setting the ``INSTALL_PREFIX`` as the command option.
+     
+        * HSA is in ``<ROCM_PATH>``. This can be overridden by setting the ``HSA_PATH`` environment variable.
+        
+        * Clang is in ``<ROCM_PATH>/llvm/bin``. This can be overridden by setting the ``HIP_CLANG_PATH`` environment variable.
+        
+        * The device library is in ``<ROCM_PATH>/lib``. This can be overridden by setting the ``DEVICE_LIB_PATH`` environment variable.
+        
+        * Optionally, you can add ``<ROCM_PATH>/bin`` to your ``PATH``, which can make it easier to use the tools.
+        
+        * Optionally, you can set ``HIPCC_VERBOSE=7`` to output the command line for compilation.
 
-         #. Generate a profiling header after adding/changing a HIP API.
+        After you run the ``make install`` command, HIP is installed to ``<ROCM_PATH>`` by default, or ``$PWD/install/hip`` while ``INSTALL_PREFIX`` is defined.
 
-            When you add or change a HIP API, you may need to generate a new ``hip_prof_str.h`` header.
-            This header is used by ROCm tools to track HIP APIs, such as ``rocprofiler`` and ``roctracer``.
+     #. Generate a profiling header after adding/changing a HIP API.
 
-            To generate the header after your change, use the ``hip_prof_gen.py`` tool located in
-            ``hipamd/src``.
+        When you add or change a HIP API, you may need to generate a new ``hip_prof_str.h`` header.
+        This header is used by ROCm tools to track HIP APIs, such as ``rocprofiler`` and ``roctracer``.
 
-            Usage:
+        To generate the header after your change, use the ``hip_prof_gen.py`` tool located in
+        ``hipamd/src``.
 
-            .. code-block:: shell
+        Usage:
 
-               `hip_prof_gen.py [-v] <input HIP API .h file> <patched srcs path> <previous output> [<output>]`
+        .. code-block:: shell
 
-            Flags:
+           `hip_prof_gen.py [-v] <input HIP API .h file> <patched srcs path> <previous output> [<output>]`
 
-               * ``-v``: Verbose messages
-               * ``-r``: Process source directory recursively
-               * ``-t``: API types matching check
-               * ``--priv``: Private API check
-               * ``-e``: On error exit mode
-               * ``-p``: ``HIP_INIT_API`` macro patching mode
+           Flags:
 
-            Example usage:
+              * ``-v``: Verbose messages
+              * ``-r``: Process source directory recursively
+              * ``-t``: API types matching check
+              * ``--priv``: Private API check
+              * ``-e``: On error exit mode
+              * ``-p``: ``HIP_INIT_API`` macro patching mode
 
-            .. code-block:: shell
+        Example usage:
 
-               hip_prof_gen.py -v -p -t --priv <hip>/include/hip/hip_runtime_api.h \
-               <hipamd>/src <hipamd>/include/hip/amd_detail/hip_prof_str.h \
-               <hipamd>/include/hip/amd_detail/hip_prof_str.h.new
+        .. code-block:: shell
 
-   .. tab-item:: NVIDIA
-      :sync: nvidia
+           hip_prof_gen.py -v -p -t --priv <hip>/include/hip/hip_runtime_api.h \
+           <hipamd>/src <hipamd>/include/hip/amd_detail/hip_prof_str.h \
+           <hipamd>/include/hip/amd_detail/hip_prof_str.h.new
 
-      #. Get the HIP source code.
+  .. tab-item:: NVIDIA
+     :sync: nvidia
 
-         .. code-block:: shell
+     #. Get the HIP source code.
 
-            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
-            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
-            git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hipother.git
+        .. code-block:: shell
 
-      #. Set the environment variables.
+           git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
+           git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
+           git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hipother.git
 
-         .. code-block:: shell
+     #. Set the environment variables.
 
-            export CLR_DIR="$(readlink -f clr)"
-            export HIP_DIR="$(readlink -f hip)"
-            export HIP_OTHER="$(readlink -f hipother)"
+        .. code-block:: shell
 
-      #. Build HIP.
+           export CLR_DIR="$(readlink -f clr)"
+           export HIP_DIR="$(readlink -f hip)"
+           export HIP_OTHER="$(readlink -f hipother)"
 
-         .. code-block:: shell
+     #. Build HIP.
 
-            cd "$CLR_DIR"
-            mkdir -p build; cd build
-            cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF -DHIPNV_DIR=$HIP_OTHER/hipnv ..
-            make -j$(nproc)
-            sudo make install
+        .. code-block:: shell
+
+           cd "$CLR_DIR"
+           mkdir -p build; cd build
+           cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF -DHIPNV_DIR=$HIP_OTHER/hipnv ..
+           make -j$(nproc)
+           sudo make install
+
+.. note::
+
+   The HIP runtime is only buildable on Linux.
 
 Build HIP tests
 =================================================
 
 .. tab-set::
 
-   .. tab-item:: AMD
-      :sync: amd
+  .. tab-item:: AMD
+     :sync: amd
 
-      * Build HIP catch tests.
+     **Build HIP catch tests.**
 
-         HIP catch tests are separate from the HIP project and use Catch2.
+     HIP catch tests are separate from the HIP project and use Catch2.
 
-         * Get HIP tests source code.
+     #. Get HIP tests source code.
 
-            .. code-block:: shell
+        .. code-block:: shell
 
-               git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip-tests.git
+           git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip-tests.git
 
-         * Build HIP tests from source.
+     #. Build HIP tests from source.
 
-            .. code-block:: shell
+        .. code-block:: shell
 
-               export HIPTESTS_DIR="$(readlink -f hip-tests)"
-               cd "$HIPTESTS_DIR"
-               mkdir -p build; cd build
-               cmake ../catch -DHIP_PLATFORM=amd -DHIP_PATH=$CLR_DIR/build/install  # or any path where HIP is installed; for example: ``/opt/rocm``
-               make build_tests
-               ctest # run tests
+           export HIPTESTS_DIR="$(readlink -f hip-tests)"
+           cd "$HIPTESTS_DIR"
+           mkdir -p build; cd build
+           cmake ../catch -DHIP_PLATFORM=amd -DHIP_PATH=$CLR_DIR/build/install  # or any path where HIP is installed; for example: ``/opt/rocm``
+           make build_tests
+           ctest # run tests
 
-            HIP catch tests are built in ``$HIPTESTS_DIR/build``.
+        HIP catch tests are built in ``$HIPTESTS_DIR/build``.
 
-            To run any single catch test, use this example:
+        To run any single catch test, use this example:
 
-            .. code-block:: shell
+        .. code-block:: shell
 
-               cd $HIPTESTS_DIR/build/catch_tests/unit/texture
-               ./TextureTest
+           cd $HIPTESTS_DIR/build/catch_tests/unit/texture
+           ./TextureTest
 
-         * Build a HIP Catch2 standalone test.
+     #. Build a HIP Catch2 standalone test. (Optional)
 
-            .. code-block:: shell
+        .. code-block:: shell
 
-               cd "$HIPTESTS_DIR"
-               hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc \
-               -I ./catch/include ./catch/hipTestMain/standalone_main.cc \
-               -I ./catch/external/Catch2 -o hipPointerGetAttributes
-               ./hipPointerGetAttributes
-               ...
+           cd "$HIPTESTS_DIR"
+           hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc \
+           -I ./catch/include ./catch/hipTestMain/standalone_main.cc \
+           -I ./catch/external/Catch2 -o hipPointerGetAttributes
+           ./hipPointerGetAttributes
+           ...
 
-               All tests passed
+           All tests passed
 
-   .. tab-item:: NVIDIA
-      :sync: nvidia
+  .. tab-item:: NVIDIA
+     :sync: nvidia
 
-      The commands to build HIP tests on an NVIDIA platform are the same as on an AMD platform.
-      However, you must first set ``-DHIP_PLATFORM=nvidia``.
+     The commands to build HIP tests on an NVIDIA platform are the same as on an AMD platform.
+     However, you must first set ``-DHIP_PLATFORM=nvidia``.
 
 
 Run HIP
 =================================================
 
 After installation and building HIP, you can compile your application and run.
-Simple examples can be found in the `ROCm-examples repository <https://github.com/ROCm/rocm-examples>`_.
+A simple SAXPY example can be found in the `ROCm-examples repository <https://github.com/ROCm/rocm-examples/tree/amd-staging/HIP-Basic/saxpy>`_
+and the guide on how to build and run it is in the :doc:`SAXPY tutorial <../tutorial/saxpy>`
diff --git a/docs/install/install.rst b/docs/install/install.rst
index c5cafac663..522c935edc 100644
--- a/docs/install/install.rst
+++ b/docs/install/install.rst
@@ -10,10 +10,10 @@ HIP can be installed on AMD (ROCm with HIP-Clang) and NVIDIA (CUDA with NVCC) pl
 
 .. note::
 
-   The version definition for the HIP runtime is different from CUDA. On AMD
-   platforms, the :cpp:func:`hipRuntimeGetVersion` function returns the HIP
-   runtime version. On NVIDIA platforms, this function returns the CUDA runtime
-   version.
+  The version definition for the HIP runtime is different from CUDA. On AMD
+  platforms, the :cpp:func:`hipRuntimeGetVersion` function returns the HIP
+  runtime version. On NVIDIA platforms, this function returns the CUDA runtime
+  version.
 
 .. _install_prerequisites:
 
@@ -22,84 +22,88 @@ Prerequisites
 
 .. tab-set::
 
-   .. tab-item:: AMD
-      :sync: amd
+  .. tab-item:: AMD
+     :sync: amd
 
-      Refer to the Prerequisites section in the ROCm install guides:
+     Refer to the Prerequisites section in the ROCm install guides:
 
-         * :doc:`rocm-install-on-linux:reference/system-requirements`
-         * :doc:`rocm-install-on-windows:reference/system-requirements`
+     * :doc:`rocm-install-on-linux:reference/system-requirements`
+     * :doc:`rocm-install-on-windows:reference/system-requirements`
 
-   .. tab-item:: NVIDIA
-      :sync: nvidia
+  .. tab-item:: NVIDIA
+     :sync: nvidia
 
-      With NVIDIA GPUs, HIP requires unified memory. All CUDA-enabled NVIDIA
-      GPUs with compute capability 5.0 or later should be supported. For more
-      information, see `NVIDIA's list of CUDA enabled GPUs <https://developer.nvidia.com/cuda-gpus>`_.
+     With NVIDIA GPUs, HIP requires unified memory. All CUDA-enabled NVIDIA
+     GPUs with compute capability 5.0 or later should be supported. For more
+     information, see `NVIDIA's list of CUDA enabled GPUs <https://developer.nvidia.com/cuda-gpus>`_.
 
 Installation
 =======================================
 
 .. tab-set::
 
-   .. tab-item:: AMD
-      :sync: amd
+  .. tab-item:: AMD
+     :sync: amd
 
-      HIP is automatically installed during the ROCm installation. If you haven't yet installed ROCm, you
-      can find installation instructions here:
+     HIP is automatically installed during the ROCm installation. If you haven't
+     yet installed ROCm, you can find installation instructions here:
 
-         * :doc:`rocm-install-on-linux:index`
-         * :doc:`rocm-install-on-windows:index`
+     * :doc:`rocm-install-on-linux:index`
+     * :doc:`rocm-install-on-windows:index`
 
-      By default, HIP is installed into ``/opt/rocm``.
+     By default, HIP is installed into ``/opt/rocm``.
 
-      .. note::
-         There is no autodetection for the HIP installation. If you choose to install it somewhere other than the default location, you must set the ``HIP_PATH`` environment variable as explained in `Build HIP from source <./build.html>`_.
+     .. note::
+     
+        There is no autodetection for the HIP installation. If you choose to 
+        install it somewhere other than the default location, you must set the
+        ``HIP_PATH`` environment variable as explained in
+        `Build HIP from source <./build.html>`_.
 
-   .. tab-item:: NVIDIA
-      :sync: nvidia
+  .. tab-item:: NVIDIA
+     :sync: nvidia
 
-      #. Install the NVIDIA toolkit.
+     #. Install the NVIDIA toolkit.
 
-         The latest release can be found here:
-         `CUDA Toolkit <https://developer.nvidia.com/cuda-downloads>`_.
+        The latest release can be found here:
+        `CUDA Toolkit <https://developer.nvidia.com/cuda-downloads>`_.
 
-      #. Setup the radeon repo.
+     #. Setup the radeon repo.
 
-         .. code-block::shell
+        .. code-block::shell
 
-            # Replace url with appropriate link in the table below
-            wget https://repo.radeon.com/amdgpu-install/6.2/distro/version_name/amdgpu-install_6.2.60200-1_all.deb
-            sudo apt install ./amdgpu-install_6.2.60200-1_all.deb
-            sudo apt update
+          # Replace url with appropriate link in the table below
+          wget https://repo.radeon.com/amdgpu-install/6.2/distro/version_name/amdgpu-install_6.2.60200-1_all.deb
+          sudo apt install ./amdgpu-install_6.2.60200-1_all.deb
+          sudo apt update
 
-         .. list-table:: amdgpu-install links
-            :widths: 25 100
-            :header-rows: 1
+        .. list-table:: amdgpu-install links
+           :widths: 25 100
+           :header-rows: 1
 
-            * - Ubuntu version
-              - URL
-            * - 24.04
-              - https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb
-            * - 22.04
-              - https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/jammy/amdgpu-install_6.2.60204-1_all.deb
+           * - Ubuntu version
+             - URL
+           * - 24.04
+             - https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb
+           * - 22.04
+             - https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/jammy/amdgpu-install_6.2.60204-1_all.deb
 
-      #. Install the ``hip-runtime-nvidia`` and ``hip-dev`` packages. This installs the CUDA SDK and HIP
-         porting layer.
+     #. Install the ``hip-runtime-nvidia`` and ``hip-dev`` packages. This installs the CUDA SDK and HIP
+        porting layer.
 
-         .. code-block:: shell
+        .. code-block:: shell
 
-            apt-get install hip-runtime-nvidia hip-dev
+          apt-get install hip-runtime-nvidia hip-dev
 
-         The default paths are:
-            * CUDA SDK: ``/usr/local/cuda``
-            * HIP: ``/opt/rocm``
+        The default paths are:
+          * CUDA SDK: ``/usr/local/cuda``
+          * HIP: ``/opt/rocm``
 
-      #. Set the HIP_PLATFORM to nvidia.
+     #. Set the HIP_PLATFORM to nvidia.
 
-         .. code-block:: shell
+        .. code-block:: shell
 
-            export HIP_PLATFORM="nvidia"
+          export HIP_PLATFORM="nvidia"
 
 Verify your installation
 ==========================================================
@@ -108,4 +112,4 @@ Run ``hipconfig`` in your installation path.
 
 .. code-block:: shell
 
-   /opt/rocm/bin/hipconfig --full
+  /opt/rocm/bin/hipconfig --full
diff --git a/docs/reference/hardware_features.rst b/docs/reference/hardware_features.rst
index f5e227fc78..5bf3a74b81 100644
--- a/docs/reference/hardware_features.rst
+++ b/docs/reference/hardware_features.rst
@@ -240,10 +240,12 @@ page.
         - 106
         - 104
 
-.. [1] RDNA architectures have a configurable wavefront size. The native
-   wavefront size is 32, but they can run in "CU mode", which has an effective
-   wavefront size of 64. This affects the number of resident wavefronts and
-   blocks per compute Unit.
+.. [1] The RDNA architectures feature an experimental compiler option called 
+   ``mwavefrontsize64``, which determines the wavefront size for kernel code
+   generation. When this option is disabled, the native wavefront size of 32 is
+   used, when enabled wavefront size 64 is used. This option is not supported by
+   the HIP runtime.
+
 .. [2] RDNA architectures expand the concept of the traditional compute unit
    with the so-called work group processor, which effectively includes two
    compute units, within which all threads can cooperate.
diff --git a/docs/reference/hip_runtime_api_reference.rst b/docs/reference/hip_runtime_api_reference.rst
index e77490f79e..4af699c309 100644
--- a/docs/reference/hip_runtime_api_reference.rst
+++ b/docs/reference/hip_runtime_api_reference.rst
@@ -8,7 +8,50 @@
 HIP runtime API
 ********************************************************************************
 
-The HIP Runtime API reference:
+The HIP Runtime API reference includes descriptions of HIP functions, as well as global datatypes, enums, and structs. 
 
-* :ref:`modules_reference`
-* :ref:`global_defines_enums_structs_files_reference`
+Modules
+=======
+
+The API is organized into modules based on functionality.
+
+* :ref:`initialization_version_reference`
+* :ref:`device_management_reference`
+* :ref:`execution_control_reference`
+* :ref:`error_handling_reference`
+* :ref:`stream_management_reference`
+* :ref:`stream_memory_operations_reference`
+* :ref:`event_management_reference`
+* :ref:`memory_management_reference`
+
+  * :ref:`memory_management_deprecated_reference`
+  * :ref:`external_resource_interoperability_reference`
+  * :ref:`stream_ordered_memory_allocator_reference`
+  * :ref:`unified_memory_reference`
+  * :ref:`virtual_memory_reference`
+  * :ref:`texture_management_reference`
+  * :ref:`texture_management_deprecated_reference`
+  * :ref:`surface_object_reference`
+
+* :ref:`peer_to_peer_device_memory_access_reference`
+* :ref:`context_management_reference`
+* :ref:`module_management_reference`
+* :ref:`occupancy_reference`
+* :ref:`profiler_control_reference`
+* :ref:`launch_api_reference`
+* :ref:`runtime_compilation_reference`
+* :ref:`callback_activity_apis_reference`
+* :ref:`graph_management_reference`
+* :ref:`opengl_interoperability_reference`
+* :ref:`graphics_interoperability_reference`
+* :ref:`cooperative_groups_reference`
+
+Global defines, enums, structs and files
+========================================
+
+The structs, define macros, enums and files in the HIP runtime API.
+
+* :ref:`global_enum_defines_reference`
+* :ref:`driver_types_reference`
+* :doc:`../../doxygen/html/annotated`
+* :doc:`../../doxygen/html/files`
\ No newline at end of file
diff --git a/docs/reference/low_fp_types.rst b/docs/reference/low_fp_types.rst
index 7fe450a35f..b5645eed61 100644
--- a/docs/reference/low_fp_types.rst
+++ b/docs/reference/low_fp_types.rst
@@ -12,6 +12,382 @@ and FP16 (Half Precision), which reduce memory and bandwidth requirements compar
 32-bit or 64-bit formats. The following sections detail their specifications, variants, and provide
 practical guidance for implementation in HIP.
 
+FP4 (4-bit Precision)
+=======================
+
+FP4 (Floating Point 4-bit) numbers represent the current extreme in low-precision formats,
+pushing the boundaries of memory optimization for specialized AI workloads. This ultra-compact
+format is designed for scenarios where model size and computational efficiency are paramount
+constraints, even at the cost of significant precision reduction.
+
+FP4 is particularly valuable in weight storage for large language models (LLMs) and vision
+transformers, where aggressive quantization can dramatically reduce model size while
+maintaining acceptable inference quality. By reducing memory footprint to a quarter of FP16,
+FP4 enables deployment of larger models in memory-constrained environments or higher throughput
+in existing hardware.
+
+The supported FP4 format is:
+
+- **E2M1 Format**
+
+  - Sign: 1 bit
+  - Exponent: 2 bits
+  - Mantissa: 1 bit
+
+The E2M1 format offers a balance between minimal precision and a reasonable dynamic range,
+optimized for weight storage in neural network applications.
+
+HIP Header
+----------
+
+The `HIP FP4 header <https://github.com/ROCm/clr/blob/amd-staging/hipamd/include/hip/amd_detail/amd_hip_fp4.h>`_
+defines the FP4 numbers.
+
+Supported Devices
+-----------------
+
+Different GPU models support different FP4 formats. Here's a breakdown:
+
+.. list-table:: Supported devices for fp4 numbers
+    :header-rows: 1
+
+    * - Device Type
+      - E2M1
+    * - Host
+      - Yes
+    * - CDNA1
+      - No
+    * - CDNA2
+      - No
+    * - CDNA3
+      - Yes
+    * - RDNA2
+      - No
+    * - RDNA3
+      - No
+
+Using FP4 Numbers in HIP Programs
+---------------------------------
+
+To use the FP4 numbers inside HIP programs:
+
+.. code-block:: cpp
+
+    #include <hip/hip_fp4.h>
+
+FP4 numbers can be used on CPU side:
+
+.. code-block:: cpp
+
+    __hip_fp4_storage_t convert_float_to_fp4(
+      float in, /* Input val */
+      __hip_saturation_t sat /* Saturation behavior */
+      ) {
+      return __hip_cvt_float_to_fp4(in, __HIP_E2M1, sat);
+    }
+
+The same can be done in kernels as well:
+
+.. code-block:: cpp
+
+    __device__ __hip_fp4_storage_t d_convert_float_to_fp4(
+      float in,
+      __hip_saturation_t sat) {
+      return __hip_cvt_float_to_fp4(in, __HIP_E2M1, sat);
+    }
+
+The following code example demonstrates a simple roundtrip conversion using FP4 types:
+
+.. code-block:: cpp
+
+    #include <hip/hip_fp4.h>
+    #include <hip/hip_runtime.h>
+    #include <iostream>
+    #include <vector>
+
+    #define hip_check(hip_call)                                                    \
+    {                                                                              \
+        auto hip_res = hip_call;                                                   \
+        if (hip_res != hipSuccess) {                                               \
+          std::cerr << "Failed in HIP call: " << #hip_call \
+                    << " at " << __FILE__ << ":" << __LINE__ \
+                    << " with error: " << hipGetErrorString(hip_res) << std::endl; \
+          std::abort();                                                            \
+        }                                                                          \
+    }
+
+    __global__ void float_to_fp4_to_float(float *in,
+                                        __hip_saturation_t sat, float *out,
+                                        size_t size) {
+        int i = threadIdx.x;
+        if (i < size) {
+            auto fp4 = __hip_cvt_float_to_fp4(in[i], __HIP_E2M1, sat);
+            out[i] = __hip_cvt_fp4_to_halfraw(fp4, __HIP_E2M1);
+        }
+    }
+
+    int main() {
+        constexpr size_t size = 16;
+        hipDeviceProp_t prop;
+        hip_check(hipGetDeviceProperties(&prop, 0));
+        bool is_supported = (std::string(prop.gcnArchName).find("gfx94") != std::string::npos);
+        if(!is_supported) {
+            std::cerr << "Need a gfx94x, but found: " << prop.gcnArchName << std::endl;
+            std::cerr << "Device conversions are not supported on this hardware." << std::endl;
+            return -1;
+        }
+
+        constexpr __hip_saturation_t sat = __HIP_SATFINITE;
+
+        // Create test data
+        std::vector<float> in;
+        in.reserve(size);
+        for (size_t i = 0; i < size; i++) {
+            in.push_back(i * 0.5f);
+        }
+
+        // Allocate device memory
+        float *d_in, *d_out;
+        hip_check(hipMalloc(&d_in, sizeof(float) * size));
+        hip_check(hipMalloc(&d_out, sizeof(float) * size));
+        hip_check(hipMemcpy(d_in, in.data(), sizeof(float) * size, hipMemcpyHostToDevice));
+
+        // Run conversion kernel
+        float_to_fp4_to_float<<<1, size>>>(d_in, sat, d_out, size);
+
+        // Get results
+        std::vector<float> result(size);
+        hip_check(hipMemcpy(result.data(), d_out, sizeof(float) * size, hipMemcpyDeviceToHost));
+
+        // Clean up
+        hip_check(hipFree(d_in));
+        hip_check(hipFree(d_out));
+
+        // Display results
+        std::cout << "FP4 Roundtrip Results:" << std::endl;
+        for (size_t i = 0; i < size; i++) {
+            std::cout << "Original: " << in[i] << " -> FP4 roundtrip: " << result[i] << std::endl;
+        }
+
+        return 0;
+    }
+
+There are C++ style classes available as well:
+
+.. code-block:: cpp
+
+    __hip_fp4_e2m1 fp4_val(1.0f);
+
+FP4 type has its own class:
+
+- __hip_fp4_e2m1
+
+There is support of vector of FP4 types:
+
+- __hip_fp4x2_e2m1: holds 2 values of FP4 e2m1 numbers
+- __hip_fp4x4_e2m1: holds 4 values of FP4 e2m1 numbers
+
+FP6 (6-bit Precision)
+========================
+
+FP6 (Floating Point 6-bit) numbers represent an even more aggressive memory optimization
+compared to FP8, designed specifically for ultra-efficient deep learning inference and
+specialized AI applications. This extremely compact format delivers significant memory
+and bandwidth savings at the cost of reduced dynamic range and precision.
+
+The primary advantage of FP6 is enabling higher computational throughput in
+hardware-constrained environments, particularly for AI model deployment on edge devices
+and applications where model size is a critical constraint. While offering less precision
+than FP8, FP6 maintains sufficient accuracy for many inference tasks, especially when
+used with carefully quantized models.
+
+There are two primary FP6 formats:
+
+- **E3M2 Format**
+
+  - Sign: 1 bit
+  - Exponent: 3 bits
+  - Mantissa: 2 bits
+
+- **E2M3 Format**
+
+  - Sign: 1 bit
+  - Exponent: 2 bits
+  - Mantissa: 3 bits
+
+The E3M2 format provides a wider numeric range with less precision, while the E2M3 format
+offers higher precision within a narrower range.
+
+HIP Header
+----------
+
+The `HIP FP6 header <https://github.com/ROCm/clr/blob/amd-staging/hipamd/include/hip/amd_detail/amd_hip_fp6.h>`_
+defines the FP6 numbers.
+
+Supported Devices
+-----------------
+
+Different GPU models support different FP6 formats. Here's a breakdown:
+
+.. list-table:: Supported devices for fp6 numbers
+    :header-rows: 1
+
+    * - Device Type
+      - E3M2
+      - E2M3
+    * - Host
+      - Yes
+      - Yes
+    * - CDNA1
+      - No
+      - No
+    * - CDNA2
+      - No
+      - No
+    * - CDNA3
+      - Yes
+      - Yes
+    * - RDNA2
+      - No
+      - No
+    * - RDNA3
+      - No
+      - No
+
+Using FP6 Numbers in HIP Programs
+---------------------------------
+
+To use the FP6 numbers inside HIP programs:
+
+.. code-block:: cpp
+
+    #include <hip/hip_fp6.h>
+
+FP6 numbers can be used on CPU side:
+
+.. code-block:: cpp
+
+    __hip_fp6_storage_t convert_float_to_fp6(
+      float in, /* Input val */
+      __hip_fp6_interpretation_t interpret, /* interpretation of number E3M2/E2M3 */
+      __hip_saturation_t sat /* Saturation behavior */
+      ) {
+      return __hip_cvt_float_to_fp6(in, interpret, sat);
+    }
+
+The same can be done in kernels as well:
+
+.. code-block:: cpp
+
+    __device__ __hip_fp6_storage_t d_convert_float_to_fp6(
+      float in,
+      __hip_fp6_interpretation_t interpret,
+      __hip_saturation_t sat) {
+      return __hip_cvt_float_to_fp6(in, interpret, sat);
+    }
+
+The following code example demonstrates a roundtrip conversion using FP6 types:
+
+.. code-block:: cpp
+
+    #include <hip/hip_fp6.h>
+    #include <hip/hip_runtime.h>
+    #include <iostream>
+    #include <vector>
+
+    #define hip_check(hip_call)                                                    \
+    {                                                                              \
+        auto hip_res = hip_call;                                                   \
+        if (hip_res != hipSuccess) {                                               \
+          std::cerr << "Failed in HIP call: " << #hip_call \
+                    << " at " << __FILE__ << ":" << __LINE__ \
+                    << " with error: " << hipGetErrorString(hip_res) << std::endl; \
+          std::abort();                                                            \
+        }                                                                          \
+    }
+
+    __global__ void float_to_fp6_to_float(float *in,
+                                        __hip_fp6_interpretation_t interpret,
+                                        __hip_saturation_t sat, float *out,
+                                        size_t size) {
+        int i = threadIdx.x;
+        if (i < size) {
+            auto fp6 = __hip_cvt_float_to_fp6(in[i], interpret, sat);
+            out[i] = __hip_cvt_fp6_to_halfraw(fp6, interpret);
+        }
+    }
+
+    int main() {
+        constexpr size_t size = 16;
+        hipDeviceProp_t prop;
+        hip_check(hipGetDeviceProperties(&prop, 0));
+        bool is_supported = (std::string(prop.gcnArchName).find("gfx94") != std::string::npos);
+        if(!is_supported) {
+            std::cerr << "Need a gfx94x, but found: " << prop.gcnArchName << std::endl;
+            std::cerr << "Device conversions are not supported on this hardware." << std::endl;
+            return -1;
+        }
+
+        // Test both formats
+        const __hip_saturation_t sat = __HIP_SATFINITE;
+
+        // Create test vectors
+        std::vector<float> in(size);
+        for (size_t i = 0; i < size; i++) {
+            in[i] = i * 0.5f;
+        }
+
+        std::vector<float> out_e2m3(size);
+        std::vector<float> out_e3m2(size);
+
+        // Allocate device memory
+        float *d_in, *d_out;
+        hip_check(hipMalloc(&d_in, sizeof(float) * size));
+        hip_check(hipMalloc(&d_out, sizeof(float) * size));
+        hip_check(hipMemcpy(d_in, in.data(), sizeof(float) * size, hipMemcpyHostToDevice));
+
+        // Test E2M3 format
+        float_to_fp6_to_float<<<1, size>>>(d_in, __HIP_E2M3, sat, d_out, size);
+        hip_check(hipMemcpy(out_e2m3.data(), d_out, sizeof(float) * size, hipMemcpyDeviceToHost));
+
+        // Test E3M2 format
+        float_to_fp6_to_float<<<1, size>>>(d_in, __HIP_E3M2, sat, d_out, size);
+        hip_check(hipMemcpy(out_e3m2.data(), d_out, sizeof(float) * size, hipMemcpyDeviceToHost));
+
+        // Display results
+        std::cout << "FP6 Roundtrip Results:" << std::endl;
+        for (size_t i = 0; i < size; i++) {
+            std::cout << "Original: " << in[i]
+                      << " -> E2M3: " << out_e2m3[i]
+                      << " -> E3M2: " << out_e3m2[i] << std::endl;
+        }
+
+        // Clean up
+        hip_check(hipFree(d_in));
+        hip_check(hipFree(d_out));
+
+        return 0;
+    }
+
+There are C++ style classes available as well:
+
+.. code-block:: cpp
+
+    __hip_fp6_e2m3 fp6_val_e2m3(1.1f);
+    __hip_fp6_e3m2 fp6_val_e3m2(1.1f);
+
+Each type of FP6 number has its own class:
+
+- __hip_fp6_e2m3
+- __hip_fp6_e3m2
+
+There is support of vector of FP6 types:
+
+- __hip_fp6x2_e2m3: holds 2 values of FP6 e2m3 numbers
+- __hip_fp6x4_e2m3: holds 4 values of FP6 e2m3 numbers
+- __hip_fp6x2_e3m2: holds 2 values of FP6 e3m2 numbers
+- __hip_fp6x4_e3m2: holds 4 values of FP6 e3m2 numbers
+
 FP8 (Quarter Precision)
 =======================
 
@@ -65,7 +441,7 @@ numbers compared to standard FP8 formats.
 HIP Header
 ----------
 
-The `HIP FP8 header <https://github.com/ROCm/clr/blob/develop/hipamd/include/hip/amd_detail/amd_hip_fp8.h>`_
+The `HIP FP8 header <https://github.com/ROCm/clr/blob/amd-staging/hipamd/include/hip/amd_detail/amd_hip_fp8.h>`_
 defines the FP8 ocp/fnuz numbers.
 
 Supported Devices
@@ -317,10 +693,10 @@ supported with its two main formats, float16 and bfloat16.
 HIP Header
 ----------
 
-The `HIP FP16 header <https://github.com/ROCm/clr/blob/develop/hipamd/include/hip/amd_detail/amd_hip_fp16.h>`_
+The `HIP FP16 header <https://github.com/ROCm/clr/blob/amd-staging/hipamd/include/hip/amd_detail/amd_hip_fp16.h>`_
 defines the float16 format.
 
-The `HIP BF16 header <https://github.com/ROCm/clr/blob/develop/hipamd/include/hip/amd_detail/amd_hip_bf16.h>`_
+The `HIP BF16 header <https://github.com/ROCm/clr/blob/amd-staging/hipamd/include/hip/amd_detail/amd_hip_bf16.h>`_
 defines the bfloat16 format.
 
 Supported Devices
diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
index 35ed57f0b6..72f87fb9a8 100644
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -66,39 +66,36 @@ subtrees:
   - file: reference/hip_runtime_api_reference
     subtrees:
     - entries:
-      - file: reference/hip_runtime_api/modules
+      - file: reference/hip_runtime_api/modules/initialization_and_version
+      - file: reference/hip_runtime_api/modules/device_management
+      - file: reference/hip_runtime_api/modules/execution_control
+      - file: reference/hip_runtime_api/modules/error_handling
+      - file: reference/hip_runtime_api/modules/stream_management
+      - file: reference/hip_runtime_api/modules/stream_memory_operations
+      - file: reference/hip_runtime_api/modules/event_management
+      - file: reference/hip_runtime_api/modules/memory_management
         subtrees:
         - entries:
-          - file: reference/hip_runtime_api/modules/initialization_and_version
-          - file: reference/hip_runtime_api/modules/device_management
-          - file: reference/hip_runtime_api/modules/execution_control
-          - file: reference/hip_runtime_api/modules/error_handling
-          - file: reference/hip_runtime_api/modules/stream_management
-          - file: reference/hip_runtime_api/modules/stream_memory_operations
-          - file: reference/hip_runtime_api/modules/event_management
-          - file: reference/hip_runtime_api/modules/memory_management
-            subtrees:
-            - entries:
-              - file: reference/hip_runtime_api/modules/memory_management/memory_management_deprecated
-              - file: reference/hip_runtime_api/modules/memory_management/external_resource_interoperability
-              - file: reference/hip_runtime_api/modules/memory_management/stream_ordered_memory_allocator
-              - file: reference/hip_runtime_api/modules/memory_management/unified_memory_reference
-              - file: reference/hip_runtime_api/modules/memory_management/virtual_memory_reference
-              - file: reference/hip_runtime_api/modules/memory_management/texture_management
-              - file: reference/hip_runtime_api/modules/memory_management/texture_management_deprecated
-              - file: reference/hip_runtime_api/modules/memory_management/surface_object
-          - file: reference/hip_runtime_api/modules/peer_to_peer_device_memory_access
-          - file: reference/hip_runtime_api/modules/context_management
-          - file: reference/hip_runtime_api/modules/module_management
-          - file: reference/hip_runtime_api/modules/occupancy
-          - file: reference/hip_runtime_api/modules/profiler_control
-          - file: reference/hip_runtime_api/modules/launch_api
-          - file: reference/hip_runtime_api/modules/runtime_compilation
-          - file: reference/hip_runtime_api/modules/callback_activity_apis
-          - file: reference/hip_runtime_api/modules/graph_management
-          - file: reference/hip_runtime_api/modules/graphics_interoperability
-          - file: reference/hip_runtime_api/modules/opengl_interoperability
-          - file: reference/hip_runtime_api/modules/cooperative_groups_reference
+          - file: reference/hip_runtime_api/modules/memory_management/memory_management_deprecated
+          - file: reference/hip_runtime_api/modules/memory_management/external_resource_interoperability
+          - file: reference/hip_runtime_api/modules/memory_management/stream_ordered_memory_allocator
+          - file: reference/hip_runtime_api/modules/memory_management/unified_memory_reference
+          - file: reference/hip_runtime_api/modules/memory_management/virtual_memory_reference
+          - file: reference/hip_runtime_api/modules/memory_management/texture_management
+          - file: reference/hip_runtime_api/modules/memory_management/texture_management_deprecated
+          - file: reference/hip_runtime_api/modules/memory_management/surface_object
+      - file: reference/hip_runtime_api/modules/peer_to_peer_device_memory_access
+      - file: reference/hip_runtime_api/modules/context_management
+      - file: reference/hip_runtime_api/modules/module_management
+      - file: reference/hip_runtime_api/modules/occupancy
+      - file: reference/hip_runtime_api/modules/profiler_control
+      - file: reference/hip_runtime_api/modules/launch_api
+      - file: reference/hip_runtime_api/modules/runtime_compilation
+      - file: reference/hip_runtime_api/modules/callback_activity_apis
+      - file: reference/hip_runtime_api/modules/graph_management
+      - file: reference/hip_runtime_api/modules/graphics_interoperability
+      - file: reference/hip_runtime_api/modules/opengl_interoperability
+      - file: reference/hip_runtime_api/modules/cooperative_groups_reference
       - file: reference/hip_runtime_api/global_defines_enums_structs_files
         subtrees:
         - entries:
diff --git a/docs/tools/example_codes/template_warp_size_reduction.hip b/docs/tools/example_codes/template_warp_size_reduction.hip
new file mode 100644
index 0000000000..2d265080d9
--- /dev/null
+++ b/docs/tools/example_codes/template_warp_size_reduction.hip
@@ -0,0 +1,207 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+#include <type_traits>
+#include <iostream>
+#include <vector>
+#include <random>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess){                \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+// [Sphinx template warp size block reduction kernel start]
+template<uint32_t WarpSize>
+using lane_mask_t = typename std::conditional<WarpSize == 32, uint32_t, uint64_t>::type;
+
+template<uint32_t WarpSize>
+__global__ void block_reduce(int* input, lane_mask_t<WarpSize>* mask, int* output, size_t size) {
+  extern __shared__ int shared[];
+
+  // Read of input with bounds check
+  auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id)
+  {
+    lane_mask_t<WarpSize> warp_mask = lane_mask_t<WarpSize>(1) << lane_id;
+    return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
+  };
+
+  const uint32_t tid = threadIdx.x,
+                 lid = threadIdx.x % WarpSize,
+                 wid = threadIdx.x / WarpSize,
+                 bid = blockIdx.x,
+                 gid = bid * blockDim.x + tid;
+
+  // Read input buffer to shared
+  shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / WarpSize) + wid);
+  __syncthreads();
+
+  // Shared reduction
+  for (uint32_t i = blockDim.x / 2; i >= WarpSize; i /= 2)
+  {
+    if (tid < i)
+      shared[tid] = shared[tid] + shared[tid + i];
+    __syncthreads();
+  }
+
+  // Use local variable in warp reduction  
+  int result =  shared[tid];
+  __syncthreads();
+
+  // This loop would be unrolled the same with the runtime warpSize.
+  #pragma unroll
+  for (uint32_t i = WarpSize/2; i >= 1; i /= 2) {
+    result = result + __shfl_down(result, i);
+  }
+
+  // Write result to output buffer
+  if (tid == 0)
+    output[bid] = result;
+};
+// [Sphinx template warp size block reduction kernel end]
+
+// [Sphinx template warp size mask generation start]
+template<uint32_t WarpSize>
+void generate_and_copy_mask(
+  void *d_mask, 
+  std::vector<int>& vectorExpected, 
+  int numOfBlocks,
+  int numberOfWarp,
+  int mask_size,
+  int mask_element_size) {
+  
+  std::random_device rd;
+  std::mt19937_64 eng(rd());
+
+  // Host side mask vector
+  std::vector<lane_mask_t<WarpSize>> mask(mask_size);
+  // Define uniform unsigned int distribution
+  std::uniform_int_distribution<lane_mask_t<WarpSize>> distr;
+  // Fill up the mask 
+  for(int i=0; i < numOfBlocks; i++) {
+    int count = 0;
+    for(int j=0; j < numberOfWarp; j++) {
+      int mask_index = i * numberOfWarp + j;
+      mask[mask_index] = distr(eng);
+      if constexpr(WarpSize == 32)
+        count += __builtin_popcount(mask[mask_index]);
+      else
+        count += __builtin_popcountll(mask[mask_index]);
+    }
+    vectorExpected[i]= count;
+  }
+
+  // Copy the mask array
+  HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
+}
+// [Sphinx template warp size mask generation end]
+
+int main() {
+
+  int deviceId = 0;
+  int warpSizeHost;
+  HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
+  std::cout << "Warp size: " << warpSizeHost << std::endl;
+
+  constexpr int numOfBlocks = 16;
+  constexpr int threadsPerBlock = 1024;
+  const int numberOfWarp = threadsPerBlock / warpSizeHost;
+  const int mask_element_size = warpSizeHost == 32 ? sizeof(uint32_t) : sizeof(uint64_t);
+  const int mask_size = numOfBlocks * numberOfWarp;
+  constexpr size_t arraySize = numOfBlocks * threadsPerBlock;
+
+  int *d_data, *d_results;
+  void *d_mask;
+  int initValue = 1;
+  std::vector<int> vectorInput(arraySize, initValue);
+  std::vector<int> vectorOutput(numOfBlocks);
+  std::vector<int> vectorExpected(numOfBlocks);
+  // Allocate device memory
+  HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
+  HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
+  HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
+  // Host to Device copy of the input array
+  HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
+  
+  // [Sphinx template warp size select kernel start]
+  // Fill up the mask variable, copy to device and select the right kernel.
+  if(warpSizeHost == 32) {
+    // Generate and copy mask arrays
+    generate_and_copy_mask<32>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
+
+    // Start the kernel
+    block_reduce<32><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
+      d_data,
+      static_cast<uint32_t*>(d_mask),
+      d_results,
+      arraySize);
+  } else if(warpSizeHost == 64) {
+    // Generate and copy mask arrays
+    generate_and_copy_mask<64>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
+
+    // Start the kernel
+    block_reduce<64><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
+      d_data,
+      static_cast<uint64_t*>(d_mask),
+      d_results,
+      arraySize);
+  } else {
+    std::cerr << "Unsupported warp size." << std::endl;
+    return 0;
+  }
+  // [Sphinx template warp size select kernel end]
+
+  // Check the kernel launch
+  HIP_CHECK(hipGetLastError());
+  // Check for kernel execution error
+  HIP_CHECK(hipDeviceSynchronize());
+  // Device to Host copy of the result
+  HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
+
+  // Verify results
+  bool passed = true;
+  for(size_t i = 0; i < numOfBlocks; ++i) {
+    if(vectorOutput[i] != vectorExpected[i]) {
+      passed = false;
+      std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl;
+    }
+  }
+  if(passed){
+    std::cout << "Execution completed successfully." << std::endl;
+  }else{
+    std::cerr << "Execution failed." << std::endl;
+  }
+
+  // Cleanup
+  HIP_CHECK(hipFree(d_data));
+  HIP_CHECK(hipFree(d_mask));
+  HIP_CHECK(hipFree(d_results));
+  return 0;
+}
\ No newline at end of file
diff --git a/docs/tools/example_codes/warp_size_reduction.hip b/docs/tools/example_codes/warp_size_reduction.hip
new file mode 100644
index 0000000000..0be830ff0e
--- /dev/null
+++ b/docs/tools/example_codes/warp_size_reduction.hip
@@ -0,0 +1,184 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+#include <type_traits>
+#include <iostream>
+#include <vector>
+#include <random>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess){                \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+// [Sphinx HIP warp size block reduction kernel start]
+__global__ void block_reduce(int* input, uint64_t* mask, int* output, size_t size){
+  extern __shared__ int shared[];
+  // Read of input with bounds check
+  auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id)
+  {
+    uint64_t warp_mask = 1ull << lane_id;
+    return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
+  };
+  const uint32_t tid = threadIdx.x,
+                 lid = threadIdx.x % warpSize,
+                 wid = threadIdx.x / warpSize,
+                 bid = blockIdx.x,
+                 gid = bid * blockDim.x + tid;
+  // Read input buffer to shared
+  shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / warpSize) + wid);
+  __syncthreads();
+  // Shared reduction
+  for (uint32_t i = blockDim.x / 2; i >= warpSize; i /= 2)
+  {
+    if (tid < i)
+      shared[tid] = shared[tid] + shared[tid + i];
+    __syncthreads();
+  }
+
+  // Use local variable in warp reduction  
+  int result =  shared[tid];
+  __syncthreads();
+
+  // This loop would be unrolled the same with the compile-time WarpSize.
+  #pragma unroll
+  for (uint32_t i = warpSize/2; i >= 1; i /= 2) {
+    result = result + __shfl_down(result, i);
+  }
+
+  // Write result to output buffer
+  if (tid == 0)
+    output[bid] = result;
+};
+// [Sphinx HIP warp size block reduction kernel end]
+
+// [Sphinx HIP warp size mask generation start]
+void generate_and_copy_mask(
+  uint64_t *d_mask, 
+  std::vector<int>& vectorExpected,
+  int warpSizeHost,
+  int numOfBlocks,
+  int numberOfWarp,
+  int mask_size,
+  int mask_element_size) {
+  
+  std::random_device rd;
+  std::mt19937_64 eng(rd());
+
+  // Host side mask vector
+  std::vector<uint64_t> mask(mask_size);
+  // Define uniform unsigned int distribution
+  std::uniform_int_distribution<uint64_t> distr;
+  // Fill up the mask 
+  for(int i=0; i < numOfBlocks; i++) {
+    int count = 0;
+    for(int j=0; j < numberOfWarp; j++) {
+      int mask_index = i * numberOfWarp + j;
+      mask[mask_index] = distr(eng);
+      if(warpSizeHost == 32)
+        count += __builtin_popcount(mask[mask_index]);
+      else
+        count += __builtin_popcountll(mask[mask_index]);
+    }
+    vectorExpected[i]= count;
+  }
+  // Copy the mask array
+  HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
+}
+// [Sphinx HIP warp size mask generation end]
+
+int main() {
+  int deviceId = 0;
+  int warpSizeHost;
+  HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
+  std::cout << "Warp size: " << warpSizeHost << std::endl;
+  constexpr int numOfBlocks = 16;
+  constexpr int threadsPerBlock = 1024;
+  const int numberOfWarp = threadsPerBlock / warpSizeHost;
+  const int mask_element_size = sizeof(uint64_t);
+  const int mask_size = numOfBlocks * numberOfWarp;
+  constexpr size_t arraySize = numOfBlocks * threadsPerBlock;
+  int *d_data, *d_results;
+  uint64_t *d_mask;
+  int initValue = 1;
+  std::vector<int> vectorInput(arraySize, initValue);
+  std::vector<int> vectorOutput(numOfBlocks);
+  std::vector<int> vectorExpected(numOfBlocks);
+  // Allocate device memory
+  HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
+  HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
+  HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
+  // Host to Device copy of the input array
+  HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
+  
+  // [Sphinx HIP warp size select kernel start]
+  // Generate and copy mask arrays
+  generate_and_copy_mask(
+    d_mask,
+    vectorExpected,
+    warpSizeHost,
+    numOfBlocks,
+    numberOfWarp,
+    mask_size,
+    mask_element_size);
+
+  // Start the kernel
+  block_reduce<<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
+    d_data,
+    d_mask,
+    d_results,
+    arraySize);
+  // [Sphinx HIP warp size select kernel end]
+
+  // Check the kernel launch
+  HIP_CHECK(hipGetLastError());
+  // Check for kernel execution error
+  HIP_CHECK(hipDeviceSynchronize());
+  // Device to Host copy of the result
+  HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
+  // Verify results
+  bool passed = true;
+  for(size_t i = 0; i < numOfBlocks; ++i) {
+    if(vectorOutput[i] != vectorExpected[i]) {
+      passed = false;
+      std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl;
+    }
+  }
+  if(passed){
+    std::cout << "Execution completed successfully." << std::endl;
+  }else{
+    std::cerr << "Execution failed." << std::endl;
+  }
+  // Cleanup
+  HIP_CHECK(hipFree(d_data));
+  HIP_CHECK(hipFree(d_mask));
+  HIP_CHECK(hipFree(d_results));
+  return 0;
+}
\ No newline at end of file
diff --git a/docs/understand/compilers.rst b/docs/understand/compilers.rst
index ccd2dbbec6..650ca622ef 100644
--- a/docs/understand/compilers.rst
+++ b/docs/understand/compilers.rst
@@ -8,95 +8,12 @@
 HIP compilers
 ********************************************************************************
 
-ROCm provides the compiler driver ``hipcc``, that can be used on AMD ROCm and
-NVIDIA CUDA platforms.
+ROCm provides the compiler tools used to compile HIP applications for use on AMD GPUs. 
+The compilers set up the default libraries and include paths for the HIP and ROCm
+libraries and some needed environment variables. For more information, see the
+:doc:`ROCm compiler reference <llvm-project:reference/rocmcc>`.
 
-On ROCm, ``hipcc`` takes care of the following:
-
-- Setting the default library and include paths for HIP
-- Setting some environment variables
-- Invoking the appropriate compiler - ``amdclang++``
-
-On NVIDIA CUDA platform, ``hipcc`` takes care of invoking compiler ``nvcc``.
-``amdclang++`` is based on the ``clang++`` compiler. For more
-details, see the :doc:`llvm project<llvm-project:index>`.
-
-HIPCC
-================================================================================
-
-Common Compiler Options
---------------------------------------------------------------------------------
-
-The following table shows the most common compiler options supported by
-``hipcc``.
-
-.. list-table::
-  :header-rows: 1
-
-  *
-   - Option
-   - Description
-  *
-   - ``--fgpu-rdc``
-   - Generate relocatable device code, which allows kernels or device functions
-     to call device functions in different translation units.
-  *
-   - ``-ggdb``
-   - Equivalent to `-g` plus tuning for GDB. This is recommended when using
-     ROCm's GDB to debug GPU code.
-  *
-   - ``--gpu-max-threads-per-block=<num>``
-   - Generate code to support up to the specified number of threads per block.
-  *
-   - ``-offload-arch=<target>``
-   - Generate code for the given GPU target.
-     For a full list of supported compilation targets see the `processor names in AMDGPU's llvm documentation <https://llvm.org/docs/AMDGPUUsage.html#processors>`_.
-     This option can appear multiple times to generate a fat binary for multiple
-     targets.
-     The actual support of the platform's runtime may differ.
-  *
-   - ``-save-temps``
-   - Save the compiler generated intermediate files.
-  *
-   - ``-v``
-   - Show the compilation steps.
-
-Linking
---------------------------------------------------------------------------------
-
-``hipcc`` adds the necessary libraries for HIP as well as for the accelerator
-compiler (``nvcc`` or ``amdclang++``). We recommend linking with ``hipcc`` since
-it automatically links the binary to the necessary HIP runtime libraries.
-
-Linking Code With Other Compilers
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-``nvcc`` by default uses ``g++`` to generate the host code.
-
-``amdclang++`` generates both device and host code. The code uses the same API
-as ``gcc``, which allows code generated by different ``gcc``-compatible
-compilers to be linked together. For example, code compiled using ``amdclang++``
-can link with code compiled using compilers such as ``gcc``, ``icc`` and
-``clang``. Take care to ensure all compilers use the same standard C++ header
-and library formats.
-
-libc++ and libstdc++
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-``hipcc`` links to ``libstdc++`` by default. This provides better compatibility
-between ``g++`` and HIP.
-
-In order to link to ``libc++``, pass ``--stdlib=libc++`` to ``hipcc``.
-Generally, libc++ provides a broader set of C++ features while ``libstdc++`` is
-the standard for more compilers, notably including ``g++``.
-
-When cross-linking C++ code, any C++ functions that use types from the C++
-standard library, such as ``std::string``, ``std::vector`` and other containers,
-must use the same standard-library implementation. This includes cross-linking
-between ``amdclang++`` and other compilers.
-
-
-HIP compilation workflow
+Compilation workflow
 ================================================================================
 
 HIP provides a flexible compilation workflow that supports both offline
@@ -115,25 +32,18 @@ performance overhead.
 Offline compilation
 --------------------------------------------------------------------------------
 
-The HIP code compilation is performed in two stages: host and  device code
-compilation stage.
+Offline compilation is performed in two steps: host and  device code
+compilation. 
+
+- Host-code compilation: On the host side, ``amdclang++`` or ``hipcc`` can
+  compile the host code in one step without other C++ compilers. 
 
-- Device-code compilation stage: The compiled device code is embedded into the
+- Device-code compilation: The compiled device code is embedded into the
   host object file. Depending on the platform, the device code can be compiled
-  into assembly or binary. ``nvcc`` and ``amdclang++`` target different
-  architectures and use different code object formats. ``nvcc`` uses the binary
-  ``cubin`` or the assembly PTX files, while the ``amdclang++`` path is the
-  binary ``hsaco`` format. On CUDA platforms, the driver compiles the PTX files
-  to executable code during runtime.
-
-- Host-code compilation stage: On the host side, ``hipcc`` or ``amdclang++`` can
-  compile the host code in one step without other C++ compilers. On the other
-  hand, ``nvcc`` only replaces the ``<<<...>>>`` kernel launch syntax with the
-  appropriate CUDA runtime function call and the modified host code is passed to
-  the default host compiler.
+  into assembly or binary. 
 
 For an example on how to compile HIP from the command line, see :ref:`SAXPY
-tutorial<compiling_on_the_command_line>` .
+tutorial <compiling_on_the_command_line>` .
 
 Runtime compilation
 --------------------------------------------------------------------------------
@@ -142,27 +52,26 @@ HIP allows you to compile kernels at runtime using the ``hiprtc*`` API. Kernels
 are stored as a text string, which is passed to HIPRTC alongside options to
 guide the compilation.
 
-For more details, see
-:doc:`HIP runtime compiler <../how-to/hip_rtc>`.
+For more information, see :doc:`HIP runtime compiler <../how-to/hip_rtc>`.
 
 Static libraries
 ================================================================================
 
-``hipcc`` supports generating two types of static libraries.
+Both ``amdclang++`` and ``hipcc`` support generating two types of static libraries.
 
 - The first type of static library only exports and launches host functions
   within the same library and not the device functions. This library type offers
-  the ability to link with a non-hipcc compiler such as ``gcc``. Additionally,
+  the ability to link with another compiler such as ``gcc``. Additionally,
   this library type contains host objects with device code embedded as fat
   binaries. This library type is generated using the flag ``--emit-static-lib``:
 
   .. code-block:: shell
 
-    hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
+    amdclang++ hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
     gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out
 
 - The second type of static library exports device functions to be linked by
-  other code objects by using ``hipcc`` as the linker. This library type
+  other code objects by using ``amdclang++`` or ``hipcc`` as the linker. This library type
   contains relocatable device objects and is generated using ``ar``:
 
   .. code-block:: shell
@@ -171,6 +80,6 @@ Static libraries
     ar rcsD libHipDevice.a hipDevice.o
     hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out
 
-A full example for this can be found in the ROCm-examples, see the examples for
+Examples of this can be found in `rocm-examples <https://github.com/ROCm/rocm-examples>`_ under
 `static host libraries <https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_host_library>`_
 or `static device libraries <https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_device_library>`_.
diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst
index 3cac7e374a..2fc13df65b 100644
--- a/docs/understand/programming_model.rst
+++ b/docs/understand/programming_model.rst
@@ -243,7 +243,7 @@ multiple threads via the thread ID constants ``threadIdx.x``, ``blockIdx.x``, et
 .. _inherent_thread_model:
 
 Hierarchical thread model
----------------------
+-------------------------
 
 As previously discussed, all threads of a kernel are uniquely identified by a set
 of integral values called thread IDs. The hierarchy consists of three levels: thread,
diff --git a/include/hip/hip_runtime_api.h b/include/hip/hip_runtime_api.h
index e96769113b..337dd94718 100644
--- a/include/hip/hip_runtime_api.h
+++ b/include/hip/hip_runtime_api.h
@@ -551,6 +551,8 @@ typedef enum hipDeviceAttribute_t {
     hipDeviceAttributeFineGrainSupport,                         ///< '1' if Device supports fine grain, '0' otherwise
     hipDeviceAttributeWallClockRate,                            ///< Constant frequency of wall clock in kilohertz.
     hipDeviceAttributeNumberOfXccs,                             ///< The number of XCC(s) on the device
+    hipDeviceAttributeMaxAvailableVgprsPerThread,               ///< Max number of available (directly or indirectly
+                                                                ///< addressable) VGPRs per thread in DWORDs.
 
     hipDeviceAttributeAmdSpecificEnd = 19999,
     hipDeviceAttributeVendorSpecificBegin = 20000,
diff --git a/include/hip/linker_types.h b/include/hip/linker_types.h
index fd3d29d09a..505cdcf0f0 100755
--- a/include/hip/linker_types.h
+++ b/include/hip/linker_types.h
@@ -127,4 +127,4 @@ typedef enum hipJitFallback {
 #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
 #endif
 
-#endif // HIP_INCLUDE_HIP_LINKER_TYPES_H
\ No newline at end of file
+#endif // HIP_INCLUDE_HIP_LINKER_TYPES_H