Skip to content

Build RCCL in a single generuc stage for all Instinct/CDNA targets#4450

Draft
marbre wants to merge 4 commits intomainfrom
users/marbre/multi-arch-rccl-single-stage
Draft

Build RCCL in a single generuc stage for all Instinct/CDNA targets#4450
marbre wants to merge 4 commits intomainfrom
users/marbre/multi-arch-rccl-single-stage

Conversation

@marbre
Copy link
Copy Markdown
Member

@marbre marbre commented Apr 10, 2026

Motivation

RCCL's host code can vary depending on which GPU targets are compiled in. Building per-arch shards risks producing inconsistent host code across shards. This switches to build RCCL with USE_DIST_AMDGPU_TARGETS in a single comm-libs CI stage, yielding a single consistent artifact.

Building for all dist families in one pass would spawn one device-link lld invocation per arch, pushing the multi-arch CI stage close to the 2 hour timeout. The comm-libs stage is therefore restricted to Instinct/CDNA families only.

Technical Details

  • Add USE_DIST_AMDGPU_TARGETS to therock_cmake_subproject_declare(rccl)
  • Add TARGET_NEUTRAL to therock_provide_artifact(rccl), producing a single rccl_dev_generic.tar.xz artifact
  • Change artifact_groups.comm-libs type from per-arch to generic
  • Remove type = per-arch from build_stages.comm-libs
  • Replace the matrix CI job with a single generic job
  • Add a restrict_dist_families_regex = "dcgpu|^gfx9" to filter dist families to Instinct/CDNA only
  • Add exclude_family to the rccl test configuration to prevent test failures if RDNA runners gain
    further test coverage in the future

The [artifacts.rccl] entry in BUILD_TOPOLOGY.toml intentionally remains target-specific so the kpack splitter can later split the monolithic artifact by architecture.

Test Plan

CI run on PR.

Test Result

Pending.

Submission Checklist

RCCL's host code can vary depending on which GPU targets are compiled
in. Building per-arch shards risks producing inconsistent host code
across shards. This switches to build RCCL with USE_DIST_AMDGPU_TARGETS
in a single comm-libs CI stage, yielding a single consistent artifact.

Changes:
- Add USE_DIST_AMDGPU_TARGETS to therock_cmake_subproject_declare(rccl)
- Add TARGET_NEUTRAL to therock_provide_artifact(rccl), producing a
  single rccl_dev_generic.tar.xz artifact
- Change artifact_groups.comm-libs type from per-arch to generic
- Remove type = per-arch from build_stages.comm-libs
- Replace the matrix CI job with a single generic job

The [artifacts.rccl] entry in BUILD_TOPOLOGY.toml intentionally remains
target-specific so the kpack splitter can later split the monolithic
artifact by architecture.

Co-Authored-By: Claude <noreply@anthropic.com>
@marbre marbre changed the title Build RCCL for all dist targets in a single stage Build RCCL in a single generuc stage for all Instinct/CDNA targets Apr 10, 2026
rccl is a multi-GPU collective comms library requiring high-bandwidth
GPU-to-GPU interconnects (xGMI on Instinct/CDNA). With
USE_DIST_AMDGPU_TARGETS, building for all dist families spawns one
device-link lld invocation per arch at O3 LTO on ~160 bitcode objects,
pushing CI toward the 2-hour timeout.

Add restrict_dist_families_regex to ArtifactGroup in BUILD_TOPOLOGY.toml.
configure_stage.py reads this and filters dist_amdgpu_families before
generating THEROCK_DIST_AMDGPU_FAMILIES, limiting comm-libs to Instinct
families (dcgpu|^gfx9).

Also add exclude_family to the rccl test configuration to prevent test
failures when RDNA runners gain full test coverage.

Co-Authored-By: Claude <noreply@anthropic.com>
@marbre marbre force-pushed the users/marbre/multi-arch-rccl-single-stage branch from 7b08402 to 5e84d5f Compare April 10, 2026 20:06
rccl-tests is bundled into the same artifact as rccl. Using
USE_TEST_AMDGPU_TARGETS would build rccl-tests for all available
targets, producing an artifact inconsistent with rccl which only
contains Instinct/CDNA device code. USE_DIST_AMDGPU_TARGETS keeps
both in sync and ensures the restrict_dist_families_regex on the
comm-libs artifact group applies to rccl-tests as well.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

1 participant