[Example] Update intra-node GEMM-RS example#35
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Caution Review failedThe pull request is closed. WalkthroughThis PR adds distributed GEMM infrastructure with overlapped computation support. It introduces atomic and store synchronization primitives at PTX and IR levels, creates a comprehensive 2D reduce-scatter framework with NVLink topology detection, provides new distributed GEMM examples, and updates an existing example with error checking. Changes
Sequence DiagramsequenceDiagram
participant Main as Main Process
participant Distributed as Distributed Init
participant GEMM as GEMM Kernel Build
participant Exec as Execution (Multi-rank)
participant TL as TileLang GEMM RS
participant RS as Reduce-Scatter
participant PT as PyTorch Path
participant Verify as Verification
Main->>Distributed: Initialize process group & NVLink topology check
Distributed->>Distributed: has_fullmesh_nvlink() validation
Distributed-->>Main: Ready
Main->>GEMM: Compile gemm_kernel (JIT)
GEMM-->>Main: Kernel ready
Main->>Exec: Spawn processes per rank
Exec->>TL: Execute gemm_rs_op(A, B, C)
activate TL
TL->>TL: Run gemm_kernel on gemm_stream
TL->>RS: Synchronize, initiate reduce-scatter
RS->>RS: intra_node_scatter with signal sync
RS->>RS: ring_reduce_tma (per-node reduction)
RS->>RS: ring_reduce (inter-node if multi-node)
RS-->>TL: Output ready
deactivate TL
TL-->>Exec: TileLang result
Exec->>PT: Execute torch_gemm_rs (baseline)
PT-->>Exec: PyTorch result
Exec->>Verify: Compare results (allclose)
Verify-->>Exec: Pass/Fail status
Exec-->>Main: Benchmark & metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Specific areas requiring extra attention:
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (11)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This pull request introduces support for new tilelang CUDA intrinsics for atomic operations and stores, and demonstrates their usage in distributed GEMM examples. The changes include adding new built-in operators, corresponding CUDA code generation and PTX implementations, and updating distributed example scripts to leverage these features. Additionally, a new distributed GEMM example with overlapped reduce-scatter is provided, while an older example is removed.
Tilelang CUDA atomic intrinsics support
atom_addandstto tilelang, with corresponding registration insrc/op/builtin.ccand header declarations insrc/op/builtin.h. These operators represent atomic add (returning the original value) and atomic store with semantics. [1] [2]src/tl_templates/cuda/atomic.handsrc/tl_templates/cuda/sync.h. [1] [2]src/target/codegen_cuda.ccto emit calls to the new atomic intrinsics, mapping tilelang operators to the correct PTX functions.Distributed GEMM examples update
example_gemm_rs_overlapped.pydemonstrating overlapped GEMM and reduce-scatter using the new atomic intrinsics and synchronization primitives.example_gemm_rs.py, which is now superseded by the new version.Miscellaneous
example_allgather_gemm_overlapped.pyfor CUDA error checking. [1] [2]tilelang/distributed/utils.pyupdated with additional imports for threading and subprocess management.Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Chores