Skip to content

Add partitioned probe support for hash joins#22108

Merged
rapids-bot[bot] merged 30 commits into
rapidsai:release/26.06from
PointKernel:chunked-hash-join-probe
May 20, 2026
Merged

Add partitioned probe support for hash joins#22108
rapids-bot[bot] merged 30 commits into
rapidsai:release/26.06from
PointKernel:chunked-hash-join-probe

Conversation

@PointKernel
Copy link
Copy Markdown
Member

@PointKernel PointKernel commented Apr 10, 2026

Description

Closes #18677

This introduces partitioned probe support for hash joins and refactors the join internals to reduce duplication.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions Bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Apr 10, 2026
@PointKernel PointKernel changed the title Chunked hash join probe Add partitioned probe support for hash joins Apr 10, 2026
@PointKernel PointKernel added feature request New feature or request non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS labels Apr 10, 2026
@PointKernel PointKernel marked this pull request as ready for review April 17, 2026 23:48
@PointKernel PointKernel requested review from a team as code owners April 17, 2026 23:48
@GregoryKimball GregoryKimball moved this to Burndown in libcudf Apr 21, 2026
@GregoryKimball GregoryKimball requested a review from lamarrr April 21, 2026 17:43
@GregoryKimball
Copy link
Copy Markdown
Contributor

@shrshi and @lamarrr would you please share your review?

Copy link
Copy Markdown
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. Surprisingly fairly easy to review

Comment thread cpp/src/join/hash_join/partitioned_join_retrieve.cu
Comment thread cpp/src/join/hash_join/count_kernels.cuh Outdated
Comment thread cpp/src/join/hash_join/partitioned_join_retrieve.cu
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
Comment thread cpp/src/join/join_utils.cu Outdated
Comment thread cpp/src/join/join_utils.cu
Comment thread cpp/src/join/join_utils.cu
Copy link
Copy Markdown
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please post the benchmark results for the normal vs partitioned comparison? Thanks!

Comment thread cpp/include/cudf/join/hash_join.hpp Outdated
Comment thread cpp/CMakeLists.txt Outdated
auto constexpr tiles_in_block = DEFAULT_JOIN_BLOCK_SIZE / Ref::cg_size;
auto const num_blocks = static_cast<unsigned int>((n + tiles_in_block - 1) / tiles_in_block);

retrieve_kernel<IsOuter><<<num_blocks, DEFAULT_JOIN_BLOCK_SIZE, 0, stream.value()>>>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be overlooking something, but could you help me understand why we need a custom retrieve kernel? My assumption is that we are seeing significant performance benefits from shared-memory buffering compared to our approach for unpartitioned joins, where we directly invoke cuco's retrieve for the probe table.

Copy link
Copy Markdown
Member Author

@PointKernel PointKernel Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We originally tried to avoid writing custom kernels and instead relied on cuco host APIs as much as possible. This helped minimize maintenance overhead in cuDF, and that approach worked well for quite some time. However, things have changed recently. We’ve identified a couple of issues:

  • Some cuco-provided logic, such as retrieve_outer and count_outer, is highly ETL-specific and doesn’t have meaningful use cases outside of cuDF join operations. As a result, we’ve decided to move this logic into cuDF and deprecate it in cuco.
  • From a code cleanliness perspective, the original cuco kernels (e.g., retrieve) are designed to write build table matches in output as well, which isn’t always needed in cuDF. This forces us to use workarounds like transform or discard iterators, which are inefficient.
  • Additionally, since all types are already known within cuDF, writing custom kernels allows us to avoid the long list of template parameters required by cuco kernels, improving both readability and build times.

Regarding your specific question about the retrieve kernel: the underlying algorithms have changed.

Previously, the non-partitioned count only returned a total count. This required the retrieve kernel to use atomic operations to update match counts and compute correct output offsets.

With the new partitioned count_each approach, we maintain a per-key count array. This eliminates the need for atomic updates and output offsets can now be computed via a scan over the count array.

Because of this change, the previous use of shared memory to improve memory coalescing is no longer as beneficial as before. Writes to the output array are already coalesced under the new design.

In summary, introducing custom kernels in cuDF allows us to:

  • Remove cuDF-specific logic from cuco
  • Simplify and clean up the code
  • Eliminate unnecessary general-purpose abstractions
  • Improve build times and maintainability

for reference, the third bullet point in #19270

Comment thread cpp/benchmarks/join/join.cu Outdated
Comment thread cpp/include/cudf/join/join.hpp Outdated
Comment thread cpp/src/join/hash_join/count_kernels.cuh Outdated

template <bool IsOuter, typename Ref>
CUDF_KERNEL void __launch_bounds__(DEFAULT_JOIN_BLOCK_SIZE)
count_each_kernel(probe_key_type const* __restrict__ keys,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename it to something more concise, like count_join_matches_kernel

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The current naming was carried over from cuco without much consideration, hence count_each_kernel. We also plan to migrate count_kernel into cuDF, and names like count_join_matches_kernel could become confusing.

To better reflect the use case, I'm renaming the current kernel to partitioned_count_kernel, indicating it’s used for partitioned joins. The upcoming standard version can remain count_kernel, where the former operates on per-key count arrays, while the latter produces a single total count.

Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
Copy link
Copy Markdown
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two non-blocking nits, but looks great otherwise! :)

Comment thread cpp/src/join/hash_join/partitioned_count_kernels.cuh Outdated
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
Copy link
Copy Markdown
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting a few clarifications in the docstrings because every time I read the join docstrings I get confused.

Comment thread cpp/include/cudf/join/hash_join.hpp
Comment thread cpp/include/cudf/join/hash_join.hpp
Comment thread cpp/include/cudf/join/hash_join.hpp
Comment thread cpp/include/cudf/join/hash_join.hpp
Comment thread cpp/src/join/hash_join/match_context.cu Outdated
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
Comment thread cpp/src/join/join_utils.cu
@PointKernel PointKernel requested a review from wence- May 8, 2026 23:54
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Implements partitioned (count+retrieve) hash-join: new count/retrieve kernels and launchers, partition-level inner/left/full APIs, per-partition finalization, refactored full-join finalization, explicit instantiations, benchmarks, tests, and CMake additions.

Changes

Partitioned Hash Join Feature

Layer / File(s) Summary
Public API Contracts & Semantics
cpp/include/cudf/join/join.hpp, cpp/include/cudf/join/hash_join.hpp, cpp/include/cudf/detail/join/hash_join.hpp, cpp/src/join/hash_join/hash_join.cu, cpp/src/join/hash_join/finalize_partitioned_full_join.cpp
Adds partitioned join declarations and public wrappers (partitioned_inner_join, partitioned_left_join, partitioned_full_join) and finalize_partitioned_full_join; makes join_match_context non-copyable and movable.
Kernel Type & Reference Infrastructure
cpp/src/join/hash_join/kernels_common.cuh, cpp/src/join/hash_join/ref_types.cuh
Introduces probe_key_type and centralized equality/count-ref type aliases for primitive/nested/flat dispatch paths used by kernels.
Count Phase
cpp/src/join/hash_join/partitioned_count_kernels.hpp, cpp/src/join/hash_join/partitioned_count_kernels.cuh, cpp/src/join/hash_join/partitioned_count.cu, cpp/src/join/hash_join/partitioned_count_outer.cu, cpp/src/join/hash_join/match_context.cu
Adds partitioned match-count kernel and host launcher with warp/tile reduction and outer-join clamping; match_context now precomputes probe keys and calls launch_partitioned_count.
Retrieve Phase
cpp/src/join/hash_join/partitioned_retrieve_kernels.hpp, cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh, cpp/src/join/hash_join/partitioned_retrieve.cu, cpp/src/join/hash_join/partitioned_retrieve_outer.cu
Adds partitioned retrieve kernel with shared buffering and atomic-amortized global writes; launcher sizes and allocates output vectors from match counts and supports outer-join JoinNoMatch.
Partitioned Join Orchestration
cpp/src/join/hash_join/partitioned_join_retrieve.cu, cpp/src/join/hash_join/partitioned_inner_join.cu, cpp/src/join/hash_join/partitioned_left_join.cu, cpp/src/join/hash_join/partitioned_full_join.cu
Orchestrates per-partition work: empty-left/build cases, probe slicing, key preprocessing, comparator dispatch, and launches count+retrieve; thin wrappers per join kind.
Full-Join Finalization Consolidation
cpp/src/join/join_common_utils.hpp, cpp/src/join/join_utils.cu, cpp/src/join/hash_join/retrieve_impl.cuh, cpp/src/join/conditional_join.cu, cpp/src/join/mixed_join.cu
Centralizes finalize_full_join overloads handling single-probe and per-partition partials, replacing earlier complement/concatenate helpers with scatter/copy_if-based unmatched-row emission.
Build, Benchmarks & Tests
cpp/CMakeLists.txt, cpp/benchmarks/join/join.cu, cpp/tests/join/join_tests.cpp
Registers new sources in CMake; benchmark nvbench_inner_join gains mode axis (normal,partitioned); tests extended with partitioned inner/left/full and edge-case coverage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • wence-
  • mythrocks
  • mhaseeb123
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly and specifically describes the primary change: adding partitioned (chunked) probe support for hash joins, which aligns with the main objectives in the changeset.
Description check ✅ Passed Description is related to the changeset, referencing the linked issue #18677 and explaining the feature introduces partitioned probe support for hash joins with refactoring to reduce duplication.
Linked Issues check ✅ Passed Changeset implements partitioned probe API for hash joins [hash_join.hpp, partitioned_*_join.cu files], count kernels [partitioned_count_kernels.cuh], retrieve kernels [partitioned_retrieve_kernels.cuh], and test coverage [join_tests.cpp], directly fulfilling #18677 requirements for chunked probe support.
Out of Scope Changes check ✅ Passed All significant changes relate to implementing partitioned hash join probing (#18677). Refactoring of join internals (finalize_full_join, join_common_utils) directly supports the partitioned implementation and reduces code duplication as stated in PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Comment @coderabbitai help to get the list of available commands and usage tips.

@PointKernel PointKernel requested a review from a team as a code owner May 19, 2026 22:05
@PointKernel PointKernel requested a review from jameslamb May 19, 2026 22:05
@github-actions github-actions Bot added Python Affects Python cuDF API. Java Affects Java cuDF API. cudf-polars Issues specific to cudf-polars labels May 19, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python May 19, 2026
@PointKernel PointKernel changed the base branch from main to release/26.06 May 19, 2026 22:06
@PointKernel PointKernel removed request for a team and jameslamb May 19, 2026 22:07
@rapidsai rapidsai deleted a comment from copy-pr-bot Bot May 19, 2026
@rapidsai rapidsai deleted a comment from copy-pr-bot Bot May 19, 2026
@PointKernel PointKernel removed Python Affects Python cuDF API. Java Affects Java cuDF API. cudf-polars Issues specific to cudf-polars labels May 19, 2026
Copy link
Copy Markdown
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good here, thanks for all your work.

One thing: there are a number of resolved comments where nothing ended up changing. I am unclear if that is because (for example) they were not relevant, they were considered but the codegen/performance was worse, they're being addressed in a followup, something else.

Comment thread cpp/src/join/hash_join/partitioned_count_kernels.cuh Outdated
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
Copy link
Copy Markdown
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving CMake. I had leftover questions from an earlier state of this PR where I looked at the C++, but I never completed my review... Please treat these comments as non-blocking. I haven't re-reviewed the C++ side in this round.

Comment thread cpp/src/join/join_utils.cu Outdated
Comment on lines +6 to +8
// Hash join retrieve kernel ported from cuco's open_addressing retrieve.
// Uses a shared-memory buffer per flushing tile (warp) to coalesce global
// output writes and amortize the global atomic counter across many matches.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this compare to the cuCollections implementation? Why don't we use that directly?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#22108 (comment) Here’s a more detailed explanation that should hopefully answer your questions.

@PointKernel
Copy link
Copy Markdown
Member Author

/merge

@rapids-bot rapids-bot Bot merged commit f871594 into rapidsai:release/26.06 May 20, 2026
319 of 325 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in cuDF Python May 20, 2026
@PointKernel PointKernel deleted the chunked-hash-join-probe branch May 20, 2026 20:27
@vuule vuule moved this from Burndown to Landed in libcudf May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS

Projects

Status: Done
Status: Landed

Development

Successfully merging this pull request may close these issues.

[FEA] Add support for chunked probe in hash joins

9 participants