Full documentation for rocPRIM is available at https://rocm.docs.amd.com/projects/rocPRIM/en/latest/.
- The default scan accumulator types for device-level scan algorithms have changed. This is a breaking change. The previous default accumulator types could lead to situations in which unexpected overflow occured, such as when the input or inital type was smaller than the output type.
This is a complete list of affected functions and how their default accumulator types are changing:
rocprim::inclusive_scan- past default:
class AccType = typename std::iterator_traits<InputIterator>::value_type> - new default:
class AccType = rocprim::invoke_result_binary_op_t<typename std::iterator_traits<InputIterator>::value_type, BinaryFunction>
- past default:
rocprim::deterministic_inclusive_scan- past default:
class AccType = typename std::iterator_traits<InputIterator>::value_type> - new default:
class AccType = rocprim::invoke_result_binary_op_t<typename std::iterator_traits<InputIterator>::value_type, BinaryFunction>
- past default:
rocprim::exclusive_scan- past default:
class AccType = detail::input_type_t<InitValueType>> - new default:
class AccType = rocprim::invoke_result_binary_op_t<rocprim::detail::input_type_t<InitValueType>, BinaryFunction>
- past default:
rocprim::deterministic_exclusive_scan- past default:
class AccType = detail::input_type_t<InitValueType>> - new default:
class AccType = rocprim::invoke_result_binary_op_t<rocprim::detail::input_type_t<InitValueType>, BinaryFunction>
- past default:
- This release removes support for custom builds on gfx940 and gfx941.
- Added gfx950 support.
- Added
rocprim::key_value_pair::operator==. - Added the
rocprim::unrolled_copythread function to copy multiple items inside a thread. - Added the
rocprim::unrolled_thread_loadfunction to load multiple items inside a thread usingrocprim::thread_load. - Added
rocprim::int128_tandrocprim::uint128_tto benchmarks for improved performance evaluation on 128-bit integers. - Added
rocprim::int128_tto the supported autotuning types to improve performance for 128-bit integers. - Added the
rocprim::merge_inplacefunction for merging in-place. - Added initial value support for warp- and block-level inclusive scan.
- Added support for building tests with device-side random data generation, making them finish faster. This requires rocRAND, and is enabled with the
WITH_ROCRAND=ONbuild flag.
- A new version of
rocprim::thread_loadandrocprim::thread_storereplace the deprecatedrocprim::thread_loadandrocprim::thread_storefunctions. The versions avoid inline assembly where possible, and don't hinder the optimizer as much as a result. - Renamed
rocprim::load_cstorocprim::load_nontemporalandrocprim::store_cstorocprim::store_nontemporalto express the intent of these load and store methods better. - All kernels now have hidden symbol visibility. All symbols now have inline namespaces that include the library version, for example,
rocprim::ROCPRIM_300400_NS::symbolinstead ofrocPRIM::symbol, letting the user link multiple libraries built with different versions of rocPRIM.
- Fixed device radix sort not returning the correct required temporary storage when a double buffer contains
nullptr. - Fixed constness of equality operators (
==and!=) inrocprim::key_value_pair.
-
The next major release may change the template parameters of warp and block algorithms.
-
The default scan accumulator types for device-level scan algorithms will be changed. This is a breaking change.
Previously, the default accumulator type was set to the input type for inclusive scans and to the initial value type for exclusive scans. These default types could cause unexpected overflow in situations where the input or initial type is smaller than the output type when the user doesn't explicitly set an accumulator type using the AccType template parameter.
The new default types will be set to the type that results when the input or initial value type is applied to the scan operator.
The following is the complete list of affected functions and how their default accumulator types are changing:
rocprim::inclusive_scan- current default:
class AccType = typename std::iterator_traits<InputIterator>::value_type> - future default:
class AccType = rocprim::invoke_result_binary_op_t<typename std::iterator_traits<InputIterator>::value_type, BinaryFunction>
- current default:
rocprim::deterministic_inclusive_scan- current default:
class AccType = typename std::iterator_traits<InputIterator>::value_type> - future default:
class AccType = rocprim::invoke_result_binary_op_t<typename std::iterator_traits<InputIterator>::value_type, BinaryFunction>
- current default:
rocprim::exclusive_scan- current default:
class AccType = detail::input_type_t<InitValueType>> - future default:
class AccType = rocprim::invoke_result_binary_op_t<rocprim::detail::input_type_t<InitValueType>, BinaryFunction>
- current default:
rocprim::deterministic_exclusive_scan- current default:
class AccType = detail::input_type_t<InitValueType>> - future default:
class AccType = rocprim::invoke_result_binary_op_t<rocprim::detail::input_type_t<InitValueType>, BinaryFunction>
- current default:
-
rocprim::load_csandrocprim::store_csare deprecated. Userocprim::load_nontemporalandrocprim::store_nontemporalnow. -
Due to an upcoming compiler change the following warp size-related symbols will be removed in the next major release and are thus marked as deprecated:
rocprim::device_warp_size()- For compile-time constants, this is replaced with
rocprim::arch::wavefront::min_size()androcprim::arch::wavefront::max_size(). Use this when allocating global or shared memory. - For run-time constants, this is replaced with
rocprim::arch::wavefront::size().
- For compile-time constants, this is replaced with
rocprim::warp_size()ROCPRIM_WAVEFRONT_SIZE
- When using
rocprim::deterministic_inclusive_scan_by_keyandrocprim::deterministic_exclusive_scan_by_keythe intermediate values can change order on Navi3x- However if a commutative scan operator is used then the final scan value (output array) will still always be consistent between runs
- Added extended tests to
rtest.py. These tests are extra tests that did not fit the criteria of smoke and regression tests. These tests will take much longer to run relative to smoke and regression tests. - Use
python rtest.py [--emulation|-e|--test|-t]=extendedto run these tests. - Added regression tests to
rtest.py. Regression tests are a subset of tests that caused hardware problems for past emulation environments.- Can be run with
python rtest.py [--emulation|-e|--test|-t]=regression
- Can be run with
- Added the parallel
find_first_ofdevice function with autotuned configurations, this function is similar tostd::find_first_of, it searches for the first occurrence of any of the provided elements. - Added
--emulationoption added forrtest.py- Unit tests can be run with
[--emulation|-e|--test|-t]=<test_name>
- Unit tests can be run with
- Added tuned configurations for segmented radix sort for gfx942 to improve performance on this architecture.
- Added a parallel device-level function,
rocprim::adjacent_find, similar to the C++ Standard Librarystd::adjacent_findalgorithm. - Added configuration autotuning to device adjacent find (
rocprim::adjacent_find) for improved performance on selected architectures. - Added rocprim::numeric_limits which is an extension of
std::numeric_limits, which includes support for 128-bit integers. - Added rocprim::int128_t and rocprim::uint128_t which are the __int128_t and __uint128_t types.
- Added the parallel
searchandfind_enddevice functions similar tostd::searchandstd::find_end, these functions search for the first and last occurrence of the sequence respectively. - Added a parallel device-level function,
rocprim::search_n, similar to the C++ Standard Librarystd::search_nalgorithm. - Added new constructors and a
basefunction, and addedconstexprspecifier to all functions inrocprim::reverse_iteratorto improve parity with the C++17std::reverse_iterator. - Added hipGraph support to device run-length-encode for nontrivial runs (
rocprim::run_length_encode_non_trivial_runs). - Added configuration autotuning to device run-length-encode for nontrivial runs (
rocprim::run_length_encode_non_trivial_runs) for improved performance on selected architectures. - Added configuration autotuning to device run-length-encode for trivial runs (
rocprim::run_length_encode) for improved performance on selected architectures. - Added a new type traits interface to enable users to provide additional type trait information to rocPRIM, facilitating better compatibility with custom types.
-
Changed the subset of tests that are run for smoke tests such that the smoke test will complete with faster run-time and to never exceed 2GB of vram usage. Use
python rtest.py [--emulation|-e|--test|-t]=smoketo run these tests. -
The
rtest.pyoptions have changed.rtest.pyis now run with at least either--test|-tor--emulation|-e, but not both options. -
Changed the internal algorithm of block radix sort to use rank match to improve performance of various radix sort related algorithms.
-
Disabled padding in various cases where higher occupancy resulted in better performance despite more bank conflicts.
-
Removed HIP-CPU support. HIP-CPU support was experimental and broken.
-
Changed the C++ version from 14 to 17. C++14 will be deprecated in the next major release.
-
You can use CMake HIP language support with CMake 3.18 and later. To use HIP language support, run
cmakewith-DUSE_HIPCXX=ONinstead of setting theCXXvariable to the path to a HIP-aware compiler.
- Fixed an issue where
rmake.pywould generate wrong CMAKE commands while using Linux environment - Fixed an issue where
rocprim::partial_sort_copywould yield a compile error if the input iterator is const. - Fixed incorrect 128-bit signed and unsigned integers type traits.
- Fixed compilation issue when
rocprim::radix_key_codec<...>is specialized with a 128-bit integer. - Fixed the warp-level reduction
rocprim::warp_reduce.reduceDPP implementation to avoid undefined intermediate values during the reduction. - Fixed an issue that caused a segmentation fault when
hipStreamLegacywas passed to some API functions.
-
Using the initialisation constructor of
rocprim::reverse_iteratorwill throw a deprecation warning. It will be marked as explicit in the next major release. -
Using the initialisation constructor of rocprim::reverse_iterator will throw a deprecation warning. It will be marked as explicit in the next major release.
- Changed the default value of
rmake.py -atodefault_gpus. This is equivalent togfx906:xnack-,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201. - The
--test smokeoption has been added tortest.py. Whenrtest.pyis called with this option it runs a subset of tests such that the total test time is 5 minutes. Usepython3 ./rtest.py --test smokeorpython3 ./rtest.py -t smoketo run the smoke test. - The
--seedoption has been added torun_benchmarks.py. The--seedoption specifies a seed for the generation of random inputs. When the option is omitted, the default behavior is to use a random seed for each benchmark measurement. - Added configuration autotuning to device partition (
rocprim::partition,rocprim::partition_two_way, androcprim::partition_three_way), to device select (rocprim::select,rocprim::unique, androcprim::unique_by_key), and to device reduce by key (rocprim::reduce_by_key) to improve performance on selected architectures. - Added
rocprim::uninitialized_arrayto provide uninitialized storage in local memory for user-defined types. - Added large segment support for
rocprim:segmented_reduce. - Added a parallel
nth_elementdevice function similar tostd::nth_element.nth_elementplaces elements that are smaller than the nth element before the nth element, and elements that are bigger than the nth element after the nth element. - Added deterministic (bitwise reproducible) algorithm variants
rocprim::deterministic_inclusive_scan,rocprim::deterministic_exclusive_scan,rocprim::deterministic_inclusive_scan_by_key,rocprim::deterministic_exclusive_scan_by_key, androcprim::deterministic_reduce_by_key. These provide run-to-run stable results with non-associative operators such as float operations, at the cost of reduced performance. - Added a parallel
partial_sortandpartial_sort_copydevice functions similar tostd::partial_sortandstd::partial_sort_copy.partial_sortandpartial_sort_copyarrange elements such that the elements are in the same order as a sorted list up to and including the middle index.
- Modified the input size in device adjacent difference benchmarks. Observed performance with these benchmarks might be different.
- Changed the default seed for
device_benchmark_segmented_reduce. - Changed
test_utils_hipgraphs.hppto be a classGraphHelperwith internal graph and graph instances
rocprim::thread_load()androcprim::thread_store()have been deprecated. Usedereference()instead.
- Fixed an issue in
rmake.pywhere the list storing cmake options would contain individual characters instead of a full string of options. - Resolved an issue in
rtest.pywhere it crashed if thebuildfolder was created withoutreleaseordebugsubdirectories. - Resolved an issue with
rtest.pyon Windows where passing an absolute path to--install_dircaused aFileNotFounderror. - rocPRIM functions are no longer forcefully inlined on Windows. This significantly reduces the build time of debug builds.
block_load,block_store,block_shuffle,block_exchange, andwarp_exchangenow use placementnewinstead of copy assignment (operator=) when writing to local memory. This fixes the behavior of custom types with non-trivial copy assignments.- Fixed a bug in the generation of input data for benchmarks, which caused incorrect performance to be reported in specific cases. It may affect the reported performance for one-byte types (
uint8_tandint8_t) and instantiations ofcustom_type. Specifically, device binary search, device histogram, device merge and warp sort are affected. - Fixed a bug for
rocprim::merge_path_searchwhere usingunsignedoffsets would produce incorrect results. - Fixed a bug for
rocprim::thread_loadandrocprim::thread_storewherefloatanddoublewere not cast to the correct type, resulting in incorrect results. - Resolved an issue where tests where failing when they were compiled with
-D_GLIBCXX_ASSERTIONS=ON. - Resolved an issue where algorithms that used an internal serial merge routine caused a memory access fault that resulted in potential performance drops when using block sort, device merge sort (block merge), device merge, device partial sort, and device sort (merge sort).
- Fixed memory leaks in unit tests due to missing calls to
hipFree()and the incorrect use of hipGraphs. - Fixed an issue where certain inputs to
block_sort_merge(),device_merge_sort_merge_path(),device_merge(), andwarp_sort_stable()caused an assertion error during the call toserial_merge().
- Improved performance of
block_reduce_warp_reducewhen warp size equals block size.
- New overloads for
warp_scan::exclusive_scanthat take no initial value. These new overloads will write an unspecified result to the first value of each warp. - The internal accumulator type of
inclusive_scan(_by_key)andexclusive_scan(_by_key)is now exposed as an optional type parameter.- The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value's type (exclusive scan). This is the same behaviour as before this change.
- New overload for
device_adjacent_difference_inplacethat allows separate input and output iterators, but allows them to point to the same element. - New public API for deriving resulting type on device-only functions:
rocprim::invoke_resultrocprim::invoke_result_trocprim::invoke_result_binary_oprocprim::invoke_result_binary_op_t
- New
rocprim::batch_copyfunction added. Similar torocprim::batch_memcpy, but copies by element, not with memcpy. - Added more test cases, to better cover supported data types.
- Updated some tests to work with supported data types.
- An optional
decomposerargument for all member functions ofrocprim::block_radix_sortand all functions ofdevice_radix_sort. To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce arocprim::tupleof references to arithmetic types from the key. - New
rocprim::predicate_iteratorwhich acts as a proxy for an underlying iterator based on a predicate. It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate istrue. It can be instantiated with:rocprim::make_predicate_iteratorrocprim::make_mask_iterator
- Added custom radix sizes as the last parameter for
block_radix_sort. The default value is 4, it can be a number between 0 and 32. - New
rocprim::radix_key_codec, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed.
- Improved the performance of
warp_sort_shuffleandblock_sort_bitonic. - Created an optimized version of the
warp_exchangefunctionsblocked_to_striped_shuffleandstriped_to_blocked_shufflewhen the warpsize is equal to the items per thread. - Improved the performance of
device_transform.
- Fixed incorrect results of
warp_exchange::blocked_to_striped_shuffleandwarp_exchange::striped_to_blocked_shufflewhen the block size is larger than the logical warp size. The test suite has been updated with such cases. - Fixed incorrect results returned when calling device
unique_by_keywith overlappingvalues_inputandvalues_output. - Fixed incorrect output type used in
device_adjacent_difference. - Hotfix for incorrect results on the GFX10 (Navi 10/RDNA1, Navi 20/RDNA2) ISA and GFX11 ISA (Navi 30 GPUs) on device scan algorithms
rocprim::inclusive_scan(_by_key)androcprim::exclusive_scan(_by_key)with large input types. device_adjacent_differencenow considers both the input and the output type for selecting the appropriate kernel launch config. Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage.- Fixed incorrect data being loaded with
rocprim::thread_loadwhen compiling with-O0. - Fixed a compilation failure in the host compiler when instantiating various block and device algorithms with block sizes not divisible by 64.
- The internal header
detail/match_result_type.hpphas been deprecated. TwiddleInandTwiddleOuthave been deprecated in favor ofradix_key_codec.- The internal
::rocprim::detail::radix_key_codechas been deprecated in favor of the new public utility with the same name.
- New primitive:
block_run_length_decode - New primitive:
batch_memcpy
- Renamed:
scan_config_v2toscan_configscan_by_key_config_v2toscan_by_key_configradix_sort_config_v2toradix_sort_configreduce_by_key_config_v2toreduce_by_key_configradix_sort_config_v2toradix_sort_config
- Removed support for custom config types for device algorithms
host_warp_size()was moved intorocprim/device/config_types.hpp; it now uses eitherdevice_idor astreamparameter to query the proper device and adevice_idout parameter- The return type is
hipError_t
- The return type is
- Added support for
__int128_tindevice_radix_sortandblock_radix_sort - Improved the performance of
match_any, andblock_histogramwhich uses it
- Removed
reduce_by_key_config,MatchAny,scan_config,scan_by_key_config, andradix_sort_config
- Build issues with
rmake.pyon Windows when using VS 2017 15.8 or later (due to a breaking fix with extended aligned storage) - Fix tests for
block_histogram,block_exchange,device_histogramanddevice_reduce_by_keyfor various types
device_run_length_encode,warp_exchangeandwarp_loadtests fail withrocprim::half
block_sort::sort()overload for keys and values with a dynamic size, for all block sort algorithms. Additionally, allblock_sort::sort()overloads with a dynamic size are now supported forblock_sort_algorithm::merge_sortandblock_sort_algorithm::bitonic_sort.- New two-way partition primitive
partition_two_waywhich can write to two separate iterators.
- Improved the performance of
partition.
- Fixed
rocprim::MatchAnyfor devices with 64-bit warp size. The functionrocprim::MatchAnyis deprecated androcprim::match_anyis preferred instead.
- Deprecated configuration
radix_sort_configfor device-level radix sort as it no longer matches the algorithm's parameters. New configurationradix_sort_config_v2is preferred instead. - Removed erroneous implementation of device-level
inclusive_scanandexclusive_scan. The prior default implementation using lookback-scan now is the only available implementation. - The benchmark metric indicating the bytes processed for
exclusive_scan_by_keyandinclusive_scan_by_keyhas been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported asscanandscan_by_keyinstead ofscan_exclusiveandscan_inclusive. - Deprecated configurations
scan_configandscan_by_key_configfor device-level scans, as they no longer match the algorithm's parameters. New configurationsscan_config_v2andscan_by_key_config_v2are preferred instead.
- Fixed build issue caused by missing header in
thread/thread_search.hpp.
- New block level
radix_rankprimitive - New block level
radix_rank_matchprimitive - Added a stable block sorting implementation, which can be used with
block_sortby adding theblock_sort_algorithm::stable_merge_sortalgorithm
- Improved the performance of:
block_radix_sortdevice_radix_sortdevice_merge_sort
- Updated the
docsdirectory structure to match the standard of rocm-docs-core
- Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on
Windows (due to GPU
printfperformance issues on Windows) - When
ROCPRIM_DISABLE_LOOKBACK_SCANis set,device_scanfails for input sizes larger thanscan_config::size_limit, which defaults tostd::numeric_limits<unsigned int>::max()
device_partition,device_unique, anddevice_reduce_by_keynow support problem sizes larger than 2^32 items- Device algorithms now return
hipErrorInvalidValueif the amount of passed temporary memory is insufficient - Lists of sizes for tests are unified, restored scan and reduce tests for
halfandbfloat16values
block_sort::sort()overload for keys and values with a dynamic size- This overload was documented but the implementation is missing; to avoid further confusion, the documentation is removed until a decision is made on implementing the function
- Fixed the compilation failure in
device_mergewhen the two key iterators don't match
device_mergedoesn't correctly support different types forkeys_input1andkeys_input2(as of the 5.3.0 release)
- New functions
subtract_leftandsubtract_rightinblock_adjacent_differenceto apply functions on pairs of adjacent items distributed between threads in a block - New device-level
adjacent_differenceprimitives - Experimental tooling for automatic kernel configuration tuning for various architectures
- Benchmarks collect and output more detailed system information
- CMake functionality improves build parallelism of the test suite that splits compilation units by function or by parameters
- Reverse iterator
- Support for problem sizes over
UINT_MAXin device functionsinclusive_scan_by_keyandexclusive_scan_by_key
- Improved the performance of warp primitives using the swizzle operation on Navi
- Improved build parallelism of the test suite by splitting up large compilation units
device_selectnow supports problem sizes larger than 2^32 itemsdevice_segmented_radix_sortnow partitions segments to group small, medium, and large segments- Each segment group can be sorted by specialized kernels to improve throughput
- Improved histogram performance for the case of highly uneven sample distribution
- Packages for tests and benchmark executables on all supported operating systems using CPack
- Added file and folder reorganization changes with backward compatibility support using wrapper headers
- Fixed Radix Sort
int64_tbug introduced in version 2.10.11
- Future value
- Device
partition_three_wayto partition input to three output iterators based on two predicates
- The reduce/scan algorithm precision issues in the tests has been resolved for half types
- The device Radix Sort algorithm supports indexing with 64-bit unsigned integers
- The indexer type is chosen based on the type argument of parameter
size - If
sizeof(size)is not larger than 4 bytes, the indexer type is 32-bit unsigned int, otherwise, the indexer type is 64-bit unsigned int - The maximum problem size is based on the compile time configuration of the algorithm according to the following formula:
max_problem_size = (UINT_MAX + 1) * config::scan::block_size * config::scan::items_per_thread
- The indexer type is chosen based on the type argument of parameter
- Flags API of
block_adjacent_difference
device_segmented_radix_sortunit test is failing for HIP on Windows
- Enable bfloat16 tests and reduce threshold for bfloat16
- Fix device scan
limit_sizefeature - Non-optimized builds no longer trigger local memory limit errors
- Scan size limit feature
- Reduce size limit feature
- Transform size limit feature
block_load_stripedandblock_store_stripedgather_to_blockedto gather values from other threads into a blocked arrangement- The block sizes for device merge sorts initial block sort and its merge steps are now separate in its
kernel config
- The block sort step supports multiple items per thread
- you can now set the
size_limitfor scan, reduce, and transform in the config struct instead of using a parameter device_scananddevice_segmented_scan:inclusive_scannow uses theinput-typeasaccumulator-type;exclusive_scanusesinitial-value-type- This changes the behavior of small-size input types with large-size output types (e.g.,
shortinput,intoutput) and low-res input with high-res output (e.g.,floatinput,doubleoutput)
- This changes the behavior of small-size input types with large-size output types (e.g.,
- Revert an old Fiji workaround because they solved the issue at the compiler side
- Update README CMake minimum version number
- Added block sort support for multiple items per thread
- Currently only powers of two block sizes, and items per threads are supported and only for full blocks
- Bumped the minimum required version of CMake to 3.16
device_segmented_radix_sortanddevice_scanunit tests failing for HIP on WindowsReduceEmptyInputcauses random failure with bfloat16
- Initial HIP on Windows support
- bfloat16 support added
- Packaging has been split into a runtime package (
rocprim) and a development package (rocprim-devel): The development package depends on the runtime package. When installing the runtime package, the package manager will suggest the installation of the development package to aid users transitioning from the previous version's combined package. This suggestion by package manager is for all supported operating systems (except CentOS 7) to aid in the transition. Thesuggestionfeature in the runtime package is introduced as a deprecated feature and will be removed in a future ROCm release.- Because rocPRIM is a header-only library, the runtime package is an empty placeholder used to aid in the transition. This package is also a deprecated feature and will be removed in a future rocm release.
- Unit tests may soft hang on MI200 when running in
hipMallocManagedmode
- Code coverage tools build option
- AddressSanitizer build option
- gfx1030 support added
- Experimental HIP-CPU support; build using GCC/Clang/MSVC on Windows and Linux (this is work in progress and many algorithms are known to fail)
- Added single tile Radix Sort for smaller sizes
- Improved performance for Radix Sort for larger element sizes
- Bug fix and minor performance improvement for
merge_sortwhen input and output storage are the same
- gfx90a support added
warp_size()function; usehost_warp_size()anddevice_warp_size()for host and device references, respectively
- Size zero inputs are now properly handled with newer ROCm builds that no longer allow zero-size kernel grid and block dimensions
- Minimum CMake version required is now 3.10.2
- Device scan unit test is currently failing due to an LLVM bug
- Texture cache iteration support has been re-enabled
- Benchmark builds have been re-enabled
- Unique operator is no longer called on invalid elements
- Device scan unit test is currently failing because of an LLVM bug
- No new features
- Updates to DPP instructions for warp shuffle
- Benchmark builds are disabled due to compiler bug
- HIP CMake dependency
- Updates to warp shuffle for gfx10
- Disabled DPP functions on gfx10++
- Benchmark builds are disabled due to compiler bug
- Fix for rocPRIM texture cache iterator
- Package dependency correct to
hip-rocclr
- rocPRIM texture cache iterator functionality is broken in the runtime (this will be fixed in the next release); you can use the prior release if calling this function
- No new features
- Point release with compilation fix
- Improved tests with fixed and random seeds for test data
- Network interface improvements with API v3
- Switched to HIP-Clang as the default compiler
- CMake searches for rocPRIM locally first; if t's not found, CMake downloads it from GitHub