Add partitioned probe support for hash joins#22108
Conversation
mhaseeb123
left a comment
There was a problem hiding this comment.
Looks good so far. Surprisingly fairly easy to review
shrshi
left a comment
There was a problem hiding this comment.
Can you please post the benchmark results for the normal vs partitioned comparison? Thanks!
| auto constexpr tiles_in_block = DEFAULT_JOIN_BLOCK_SIZE / Ref::cg_size; | ||
| auto const num_blocks = static_cast<unsigned int>((n + tiles_in_block - 1) / tiles_in_block); | ||
|
|
||
| retrieve_kernel<IsOuter><<<num_blocks, DEFAULT_JOIN_BLOCK_SIZE, 0, stream.value()>>>( |
There was a problem hiding this comment.
I might be overlooking something, but could you help me understand why we need a custom retrieve kernel? My assumption is that we are seeing significant performance benefits from shared-memory buffering compared to our approach for unpartitioned joins, where we directly invoke cuco's retrieve for the probe table.
There was a problem hiding this comment.
We originally tried to avoid writing custom kernels and instead relied on cuco host APIs as much as possible. This helped minimize maintenance overhead in cuDF, and that approach worked well for quite some time. However, things have changed recently. We’ve identified a couple of issues:
- Some cuco-provided logic, such as
retrieve_outerandcount_outer, is highly ETL-specific and doesn’t have meaningful use cases outside of cuDF join operations. As a result, we’ve decided to move this logic into cuDF and deprecate it in cuco. - From a code cleanliness perspective, the original cuco kernels (e.g.,
retrieve) are designed to write build table matches in output as well, which isn’t always needed in cuDF. This forces us to use workarounds like transform or discard iterators, which are inefficient. - Additionally, since all types are already known within cuDF, writing custom kernels allows us to avoid the long list of template parameters required by cuco kernels, improving both readability and build times.
Regarding your specific question about the retrieve kernel: the underlying algorithms have changed.
Previously, the non-partitioned count only returned a total count. This required the retrieve kernel to use atomic operations to update match counts and compute correct output offsets.
With the new partitioned count_each approach, we maintain a per-key count array. This eliminates the need for atomic updates and output offsets can now be computed via a scan over the count array.
Because of this change, the previous use of shared memory to improve memory coalescing is no longer as beneficial as before. Writes to the output array are already coalesced under the new design.
In summary, introducing custom kernels in cuDF allows us to:
- Remove cuDF-specific logic from cuco
- Simplify and clean up the code
- Eliminate unnecessary general-purpose abstractions
- Improve build times and maintainability
for reference, the third bullet point in #19270
|
|
||
| template <bool IsOuter, typename Ref> | ||
| CUDF_KERNEL void __launch_bounds__(DEFAULT_JOIN_BLOCK_SIZE) | ||
| count_each_kernel(probe_key_type const* __restrict__ keys, |
There was a problem hiding this comment.
can we rename it to something more concise, like count_join_matches_kernel
There was a problem hiding this comment.
Good point. The current naming was carried over from cuco without much consideration, hence count_each_kernel. We also plan to migrate count_kernel into cuDF, and names like count_join_matches_kernel could become confusing.
To better reflect the use case, I'm renaming the current kernel to partitioned_count_kernel, indicating it’s used for partitioned joins. The upcoming standard version can remain count_kernel, where the former operates on per-key count arrays, while the latter produces a single total count.
shrshi
left a comment
There was a problem hiding this comment.
Two non-blocking nits, but looks great otherwise! :)
wence-
left a comment
There was a problem hiding this comment.
Requesting a few clarifications in the docstrings because every time I read the join docstrings I get confused.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughImplements partitioned (count+retrieve) hash-join: new count/retrieve kernels and launchers, partition-level inner/left/full APIs, per-partition finalization, refactored full-join finalization, explicit instantiations, benchmarks, tests, and CMake additions. ChangesPartitioned Hash Join Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Comment |
wence-
left a comment
There was a problem hiding this comment.
I think this looks good here, thanks for all your work.
One thing: there are a number of resolved comments where nothing ended up changing. I am unclear if that is because (for example) they were not relevant, they were considered but the codegen/performance was worse, they're being addressed in a followup, something else.
bdice
left a comment
There was a problem hiding this comment.
Approving CMake. I had leftover questions from an earlier state of this PR where I looked at the C++, but I never completed my review... Please treat these comments as non-blocking. I haven't re-reviewed the C++ side in this round.
| // Hash join retrieve kernel ported from cuco's open_addressing retrieve. | ||
| // Uses a shared-memory buffer per flushing tile (warp) to coalesce global | ||
| // output writes and amortize the global atomic counter across many matches. |
There was a problem hiding this comment.
How does this compare to the cuCollections implementation? Why don't we use that directly?
There was a problem hiding this comment.
#22108 (comment) Here’s a more detailed explanation that should hopefully answer your questions.
|
/merge |
f871594
into
rapidsai:release/26.06
Description
Closes #18677
This introduces partitioned probe support for hash joins and refactors the join internals to reduce duplication.
Checklist