Add partitioned probe support for hash joins by PointKernel · Pull Request #22108 · rapidsai/cudf

PointKernel · 2026-04-10T20:28:19Z

Description

Closes #18677

This introduces partitioned probe support for hash joins and refactors the join internals to reduce duplication.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

GregoryKimball · 2026-04-21T17:43:17Z

@shrshi and @lamarrr would you please share your review?

mhaseeb123

Looks good so far. Surprisingly fairly easy to review

shrshi

Can you please post the benchmark results for the normal vs partitioned comparison? Thanks!

shrshi · 2026-04-23T00:58:15Z

+  auto constexpr tiles_in_block = DEFAULT_JOIN_BLOCK_SIZE / Ref::cg_size;
+  auto const num_blocks = static_cast<unsigned int>((n + tiles_in_block - 1) / tiles_in_block);
+
+  retrieve_kernel<IsOuter><<<num_blocks, DEFAULT_JOIN_BLOCK_SIZE, 0, stream.value()>>>(


I might be overlooking something, but could you help me understand why we need a custom retrieve kernel? My assumption is that we are seeing significant performance benefits from shared-memory buffering compared to our approach for unpartitioned joins, where we directly invoke cuco's retrieve for the probe table.

We originally tried to avoid writing custom kernels and instead relied on cuco host APIs as much as possible. This helped minimize maintenance overhead in cuDF, and that approach worked well for quite some time. However, things have changed recently. We’ve identified a couple of issues:

Some cuco-provided logic, such as retrieve_outer and count_outer, is highly ETL-specific and doesn’t have meaningful use cases outside of cuDF join operations. As a result, we’ve decided to move this logic into cuDF and deprecate it in cuco.

From a code cleanliness perspective, the original cuco kernels (e.g., retrieve) are designed to write build table matches in output as well, which isn’t always needed in cuDF. This forces us to use workarounds like transform or discard iterators, which are inefficient.

Additionally, since all types are already known within cuDF, writing custom kernels allows us to avoid the long list of template parameters required by cuco kernels, improving both readability and build times.

Regarding your specific question about the retrieve kernel: the underlying algorithms have changed.

Previously, the non-partitioned count only returned a total count. This required the retrieve kernel to use atomic operations to update match counts and compute correct output offsets.

With the new partitioned count_each approach, we maintain a per-key count array. This eliminates the need for atomic updates and output offsets can now be computed via a scan over the count array.

Because of this change, the previous use of shared memory to improve memory coalescing is no longer as beneficial as before. Writes to the output array are already coalesced under the new design.

In summary, introducing custom kernels in cuDF allows us to:

Remove cuDF-specific logic from cuco

Simplify and clean up the code

Eliminate unnecessary general-purpose abstractions

Improve build times and maintainability

for reference, the third bullet point in #19270

lamarrr · 2026-04-26T13:40:55Z

+
+template <bool IsOuter, typename Ref>
+CUDF_KERNEL void __launch_bounds__(DEFAULT_JOIN_BLOCK_SIZE)
+  count_each_kernel(probe_key_type const* __restrict__ keys,


can we rename it to something more concise, like count_join_matches_kernel

Good point. The current naming was carried over from cuco without much consideration, hence count_each_kernel. We also plan to migrate count_kernel into cuDF, and names like count_join_matches_kernel could become confusing.

To better reflect the use case, I'm renaming the current kernel to partitioned_count_kernel, indicating it’s used for partitioned joins. The upcoming standard version can remain count_kernel, where the former operates on per-key count arrays, while the latter produces a single total count.

shrshi

Two non-blocking nits, but looks great otherwise! :)

wence-

Requesting a few clarifications in the docstrings because every time I read the join docstrings I get confused.

coderabbitai · 2026-05-09T00:04:03Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Implements partitioned (count+retrieve) hash-join: new count/retrieve kernels and launchers, partition-level inner/left/full APIs, per-partition finalization, refactored full-join finalization, explicit instantiations, benchmarks, tests, and CMake additions.

Changes

Partitioned Hash Join Feature

Layer / File(s)	Summary
Public API Contracts & Semantics `cpp/include/cudf/join/join.hpp`, `cpp/include/cudf/join/hash_join.hpp`, `cpp/include/cudf/detail/join/hash_join.hpp`, `cpp/src/join/hash_join/hash_join.cu`, `cpp/src/join/hash_join/finalize_partitioned_full_join.cpp`	Adds partitioned join declarations and public wrappers (`partitioned_inner_join`, `partitioned_left_join`, `partitioned_full_join`) and `finalize_partitioned_full_join`; makes `join_match_context` non-copyable and movable.
Kernel Type & Reference Infrastructure `cpp/src/join/hash_join/kernels_common.cuh`, `cpp/src/join/hash_join/ref_types.cuh`	Introduces `probe_key_type` and centralized equality/count-ref type aliases for primitive/nested/flat dispatch paths used by kernels.
Count Phase `cpp/src/join/hash_join/partitioned_count_kernels.hpp`, `cpp/src/join/hash_join/partitioned_count_kernels.cuh`, `cpp/src/join/hash_join/partitioned_count.cu`, `cpp/src/join/hash_join/partitioned_count_outer.cu`, `cpp/src/join/hash_join/match_context.cu`	Adds partitioned match-count kernel and host launcher with warp/tile reduction and outer-join clamping; match_context now precomputes probe keys and calls `launch_partitioned_count`.
Retrieve Phase `cpp/src/join/hash_join/partitioned_retrieve_kernels.hpp`, `cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh`, `cpp/src/join/hash_join/partitioned_retrieve.cu`, `cpp/src/join/hash_join/partitioned_retrieve_outer.cu`	Adds partitioned retrieve kernel with shared buffering and atomic-amortized global writes; launcher sizes and allocates output vectors from match counts and supports outer-join `JoinNoMatch`.
Partitioned Join Orchestration `cpp/src/join/hash_join/partitioned_join_retrieve.cu`, `cpp/src/join/hash_join/partitioned_inner_join.cu`, `cpp/src/join/hash_join/partitioned_left_join.cu`, `cpp/src/join/hash_join/partitioned_full_join.cu`	Orchestrates per-partition work: empty-left/build cases, probe slicing, key preprocessing, comparator dispatch, and launches count+retrieve; thin wrappers per join kind.
Full-Join Finalization Consolidation `cpp/src/join/join_common_utils.hpp`, `cpp/src/join/join_utils.cu`, `cpp/src/join/hash_join/retrieve_impl.cuh`, `cpp/src/join/conditional_join.cu`, `cpp/src/join/mixed_join.cu`	Centralizes `finalize_full_join` overloads handling single-probe and per-partition partials, replacing earlier complement/concatenate helpers with scatter/copy_if-based unmatched-row emission.
Build, Benchmarks & Tests `cpp/CMakeLists.txt`, `cpp/benchmarks/join/join.cu`, `cpp/tests/join/join_tests.cpp`	Registers new sources in CMake; benchmark `nvbench_inner_join` gains `mode` axis (`normal`,`partitioned`); tests extended with partitioned inner/left/full and edge-case coverage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

wence-
mythrocks
mhaseeb123

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title clearly and specifically describes the primary change: adding partitioned (chunked) probe support for hash joins, which aligns with the main objectives in the changeset.
Description check	✅ Passed	Description is related to the changeset, referencing the linked issue `#18677` and explaining the feature introduces partitioned probe support for hash joins with refactoring to reduce duplication.
Linked Issues check	✅ Passed	Changeset implements partitioned probe API for hash joins [hash_join.hpp, partitioned_*_join.cu files], count kernels [partitioned_count_kernels.cuh], retrieve kernels [partitioned_retrieve_kernels.cuh], and test coverage [join_tests.cpp], directly fulfilling `#18677` requirements for chunked probe support.
Out of Scope Changes check	✅ Passed	All significant changes relate to implementing partitioned hash join probing (`#18677`). Refactoring of join internals (finalize_full_join, join_common_utils) directly supports the partitioned implementation and reduces code duplication as stated in PR objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

wence-

I think this looks good here, thanks for all your work.

One thing: there are a number of resolved comments where nothing ended up changing. I am unclear if that is because (for example) they were not relevant, they were considered but the codegen/performance was worse, they're being addressed in a followup, something else.

bdice

Approving CMake. I had leftover questions from an earlier state of this PR where I looked at the C++, but I never completed my review... Please treat these comments as non-blocking. I haven't re-reviewed the C++ side in this round.

bdice · 2026-04-21T18:29:19Z

+// Hash join retrieve kernel ported from cuco's open_addressing retrieve.
+// Uses a shared-memory buffer per flushing tile (warp) to coalesce global
+// output writes and amortize the global atomic counter across many matches.


How does this compare to the cuCollections implementation? Why don't we use that directly?

#22108 (comment) Here’s a more detailed explanation that should hopefully answer your questions.

…IsOuter writers

…sh-join-probe

PointKernel · 2026-05-20T20:27:35Z

/merge

github-actions Bot assigned PointKernel Apr 10, 2026

github-actions Bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Apr 10, 2026

PointKernel changed the title ~~Chunked hash join probe~~ Add partitioned probe support for hash joins Apr 10, 2026

PointKernel added feature request New feature or request non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS labels Apr 10, 2026

PointKernel mentioned this pull request Apr 13, 2026

[FEA] Rewrite mixed join internals using normal join + pre/post-filtering #22124

Open

PointKernel marked this pull request as ready for review April 17, 2026 23:48

PointKernel requested review from a team as code owners April 17, 2026 23:48

PointKernel requested review from mhaseeb123, mythrocks and shrshi April 17, 2026 23:48

GregoryKimball added this to libcudf Apr 21, 2026

GregoryKimball moved this to Burndown in libcudf Apr 21, 2026

GregoryKimball requested a review from lamarrr April 21, 2026 17:43

mhaseeb123 reviewed Apr 21, 2026

View reviewed changes

shrshi reviewed Apr 22, 2026

View reviewed changes

Comment thread cpp/include/cudf/join/hash_join.hpp Outdated

shrshi reviewed Apr 23, 2026

View reviewed changes

lamarrr requested changes Apr 26, 2026

View reviewed changes

Comment thread cpp/benchmarks/join/join.cu Outdated

Comment thread cpp/include/cudf/join/join.hpp Outdated

Comment thread cpp/src/join/hash_join/count_kernels.cuh Outdated

lamarrr requested changes Apr 26, 2026

View reviewed changes

PointKernel requested review from lamarrr, mhaseeb123 and shrshi April 28, 2026 23:46

shrshi approved these changes May 4, 2026

View reviewed changes

Comment thread cpp/src/join/hash_join/partitioned_count_kernels.cuh Outdated

Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh

wence- requested changes May 5, 2026

View reviewed changes

PointKernel requested a review from wence- May 8, 2026 23:54

PointKernel requested a review from a team as a code owner May 19, 2026 22:05

PointKernel requested a review from jameslamb May 19, 2026 22:05

github-actions Bot added Python Affects Python cuDF API. Java Affects Java cuDF API. cudf-polars Issues specific to cudf-polars labels May 19, 2026

github-project-automation Bot added this to cuDF Python May 19, 2026

GPUtester moved this to In Progress in cuDF Python May 19, 2026

PointKernel changed the base branch from main to release/26.06 May 19, 2026 22:06

PointKernel removed request for a team and jameslamb May 19, 2026 22:07

rapidsai deleted a comment from copy-pr-bot Bot May 19, 2026

PointKernel removed Python Affects Python cuDF API. Java Affects Java cuDF API. cudf-polars Issues specific to cudf-polars labels May 19, 2026

PointKernel added 2 commits May 19, 2026 22:33

Fix probe_table_num_rows reference in match_context.cu

f2c9fff

Use right/left naming for table nouns in finalize_full_join and tests

3a47653

wence- approved these changes May 20, 2026

View reviewed changes

Comment thread cpp/src/join/hash_join/partitioned_count_kernels.cuh Outdated

Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh

bdice approved these changes May 20, 2026

View reviewed changes

mhaseeb123 approved these changes May 20, 2026

View reviewed changes

PointKernel added 3 commits May 20, 2026 17:14

Simplify partitioned_count_kernel: drop tile.all short-circuit, fuse …

87765a3

…IsOuter writers

Add tile size check + use cuda::std::distance

594a532

Merge remote-tracking branch 'upstream/release/26.06' into chunked-ha…

c455b21

…sh-join-probe

rapids-bot Bot merged commit f871594 into rapidsai:release/26.06 May 20, 2026
319 of 325 checks passed

github-project-automation Bot moved this from In Progress to Done in cuDF Python May 20, 2026

PointKernel deleted the chunked-hash-join-probe branch May 20, 2026 20:27

vuule moved this from Burndown to Landed in libcudf May 20, 2026

Conversation

PointKernel commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

GregoryKimball commented Apr 21, 2026

Uh oh!

mhaseeb123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shrshi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shrshi Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

PointKernel Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lamarrr Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

PointKernel Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shrshi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bdice left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PointKernel commented Apr 10, 2026 •

edited

Loading

PointKernel Apr 28, 2026 •

edited

Loading

coderabbitai Bot commented May 9, 2026 •

edited

Loading