Skip to content

Make get_rdma_gbs RoCE-LAG aware#671

Open
zfan2356 wants to merge 3 commits into
deepseek-ai:mainfrom
zfan2356:fix/rdma-gbs-lag-aware
Open

Make get_rdma_gbs RoCE-LAG aware#671
zfan2356 wants to merge 3 commits into
deepseek-ai:mainfrom
zfan2356:fix/rdma-gbs-lag-aware

Conversation

@zfan2356

Copy link
Copy Markdown

Background

get_rdma_gbs reads ibstat's Rate: to feed ElasticBuffer.get_theoretical_num_sms. On Mellanox RoCE LAG (e.g., CX-7 with two 200 Gb/s NDR rails bonded into one mlx5_bond_*), ibstat reports the per-rail rate — the mlx5 driver hashes traffic across both rails, so a single QP only ever uses one, but multiple QPs reach the bond's full aggregate. The probe therefore returns half of the real per-NIC capacity and the downstream SM-count calculation picks the wrong bottleneck.

Same root cause as #614.

Measurement (H800 + CX-7, mlx5_bond_* 2-rail LAG)

Probe Result
ibstat 'mlx5_bond_1' .. Rate: 200 Gb/s (per rail)
ib_write_bw -d mlx5_bond_1 -F (1 QP) 183 Gb/s ≈ 22.9 GB/s
ib_write_bw -d mlx5_bond_1 -F -q 8 372 Gb/s ≈ 46.6 GB/s
Mlx5Context.query_mlx5_device(comp_mask=MASK_NUM_LAG_PORTS).num_lag_ports 2

Real per-NIC bandwidth is ~50 GB/s, not 25 — a 2x error.

Fix

Query num_lag_ports via pyverbs and multiply the ibstat rate by it. Same approach as Mooncake's transfer engine, see mooncake-transfer-engine/src/transport/rdma_transport/rdma_context.cpp:266 for the C++ equivalent (mlx5dv_query_device + MLX5DV_CONTEXT_MASK_NUM_LAG_PORTS).

ctx = Mlx5Context(attr=Mlx5DVContextAttr(), name=nic_name)
dv = ctx.query_mlx5_device(comp_mask=1 << 9)  # MLX5DV_CONTEXT_MASK_NUM_LAG_PORTS
num_lag_ports = dv.num_lag_ports

Gotcha: query_mlx5_device()'s default comp_mask=-1 ORs only the masks pyverbs knows about; at least the version I tested against didn't include NUM_LAG_PORTS, so num_lag_ports came back as 0. The bit must be passed explicitly. Hardcoded from infiniband/mlx5dv.h since pyverbs doesn't re-export the constant.

Also adds an EP_RDMA_GBS=<gbps> env override (also /8 to GB/s) so users with broken probes or non-Mellanox fabrics can short-circuit detection — addresses the request in #614.

Falls back cleanly to the original single-rail behaviour when pyverbs is missing, when the NIC isn't mlx5, or when LAG isn't enabled.

Tests

tests/utils/test_envs.py — three local checks (no distributed setup needed) that auto-skip when pyverbs / ibstat / mlx5_bond_* aren't present. On H800 + CX-7 LAG the test prints:

get_rdma_gbs('mlx5_bond_1') = 50.0 GB/s  (per-port 200 Gb/s x 2 rails / 8)
LAG aggregation detected: 400 Gb/s -> 50.0 GB/s
ALL TESTS PASSED

Compatibility

No new hard dependencies; pyverbs is optional (import is try/except). No API changes. Only the returned value changes, and only when LAG is actually in effect.

Fixes #614. Credit @michaelchen1996 env-override sketch.

xingonzhang added 3 commits June 25, 2026 15:10
Mellanox RoCE LAG presents N physical rails (e.g. 2x200G NDR) as a
single logical port, so the existing ibstat-based probe in
get_rdma_gbs() returns the per-rail rate, not the aggregated bandwidth.
On a 2-rail bond that's half of what's actually achievable, and the
downstream get_theoretical_num_sms calculation then picks the wrong
bottleneck and recommends fewer SMs than it should.

Probe num_lag_ports via the mlx5 direct-verbs interface (pyverbs) and
multiply the ibstat rate by it. Falls back cleanly to the old behaviour
when pyverbs is not installed.

Also add an EP_RDMA_GBS escape hatch (in Gbps, divided by 8 here, same
units as ibstat's Rate field) so deployments with broken probes or
exotic fabrics can override without patching the package. This matches
the request in deepseek-ai#614.

Verified on H800 + CX-7: ibstat reports 200 Gb/s per rail, ib_write_bw
with 8 QPs reaches ~372 Gb/s, and Mlx5Context.query_mlx5_device()
returns num_lag_ports=2, recovering ~46 GB/s = 372/8 (vs the old 25).

Fixes deepseek-ai#614 (credit @michaelchen1996 for the original report and patch).
Three local checks that fall back gracefully when pyverbs / ibstat /
mlx5_bond_* is absent (so single-rail and non-RDMA CI hosts skip):

  - _query_num_lag_ports returns >=1
  - get_rdma_gbs matches rate_per_port * num_lag_ports / 8
  - EP_RDMA_GBS env override is honored

Pure-local probes, so single-process is enough; the file also runs
cleanly under torchrun --nnodes=2 --nproc-per-node=8 (each rank just
prints from its own context, exercising concurrent pyverbs opens).

On H800 + CX-7 2x200G RoCE LAG the test prints:

    get_rdma_gbs('mlx5_bond_1') = 50.0 GB/s
    LAG aggregation detected: 400 Gb/s -> 50.0 GB/s
pyverbs's default comp_mask=-1 ORs only the mask bits it knows about,
and at least some versions do not include MLX5DV_CONTEXT_MASK_NUM_LAG_PORTS
in that list. The result is that num_lag_ports stays at 0 and we fall back
to the legacy single-rail behaviour even on real LAG fabrics.

Passing the bit explicitly (1 << 9, taken from rdma-core's mlx5dv.h)
restores the intended behaviour.

Verified on H800 + CX-7 2-rail LAG:

  comp_mask=-1 (default)      -> num_lag_ports = 0
  comp_mask=1<<9 (this patch) -> num_lag_ports = 2

  get_rdma_gbs('mlx5_bond_1') = 50.0 GB/s    (was 25.0)

@ds-review-bot ds-review-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 ds-review-bot Code Review

Model 1

本 MR 让 get_rdma_gbs 在 Mellanox RoCE LAG(多条物理 rail 绑定成一个逻辑端口)下感知聚合带宽:通过 pyverbs 的 mlx5dv_query_device 查询 num_lag_ports,把 ibstat 报告的每 rail 速率乘以 rail 数,从而恢复真实单卡带宽(H800+CX-7 2×200G 下由 25 GB/s 修正为 50 GB/s);同时新增 EP_RDMA_GBS 环境变量直接覆盖检测结果,并在缺失 pyverbs / 非 mlx5 / 非 LAG 时优雅回退到旧行为。整体实现方向与 Mooncake 的 C++ 实现一致,正则解析 × num_lag_ports / 8 的核心计算正确,对下游 get_theoretical_num_sms 的带宽建模语义也吻合(rdma_gbs 本就是按整卡聚合带宽参与 traffic/gbs 计算)。主要不足是 LAG 查询失败时静默降级到半带宽、不输出任何告警,以及新增的测试基本是“函数自洽”式断言、对真实带宽缺乏校验力。

Model 2

本 MR 让 get_rdma_gbs 支持 Mellanox RoCE LAG 场景:在 CX-7 双轨 bond(mlx5_bond_*)上,ibstat 仅报告单轨速率,导致下游 SM 数计算把真实带宽低估一半。修复方式是通过 pyverbs 的 mlx5 direct-verbs 查询 num_lag_ports 并将 ibstat 速率乘以该值,同时新增 EP_RDMA_GBS 环境变量覆盖项;当 pyverbs 缺失、非 mlx5 或未启用 LAG 时干净地回退到原单轨行为。整体实现逻辑正确、异常处理完备、回退路径清晰,测试也能在缺少依赖时自动跳过。未发现明显 bug,下方仅给出两点提示性建议。

Files reviewed: 2
Issues found: 🟡 1 warning | 🔵 4 suggestion
Inline comments posted: 5

Comment thread deep_ep/utils/envs.py
Comment on lines +270 to +279
try:
ctx = Mlx5Context(attr=Mlx5DVContextAttr(), name=nic_name)
try:
dv = ctx.query_mlx5_device(comp_mask=_MLX5DV_CONTEXT_MASK_NUM_LAG_PORTS)
num_lag_ports = int(dv.num_lag_ports or 0)
finally:
ctx.close()
return max(num_lag_ports, 1)
except Exception:
return 1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 warning: _query_num_lag_portstry/except Exception 吞掉所有异常并静默返回 1,而 get_rdma_gbs 在调用它时也不输出任何“是否应用了 LAG 聚合”的日志。这意味着当 pyverbs 已安装但查询失败(设备打不开、pyverbs 版本不匹配、num_lag_ports 字段不存在等)时,函数会无任何提示地退回到原先的半带宽结果——而这正是本 MR 要修复的问题,用户根本无法察觉修复没有生效(且错误只发生在一台 node 上,跨节点更难排查)。注意无效 EP_RDMA_GBS 都已加了 print(第 303-304 行),而真正核心的 LAG 查询失败反而无声无息。建议在该 except 分支至少 print 一条告警,或仅当确实是“pyverbs 缺失/非 mlx5”等预期情况才无声回退,其余异常应暴露出来。

Comment thread deep_ep/utils/envs.py
Comment on lines +299 to +305
override = os.getenv('EP_RDMA_GBS')
if override:
# noinspection PyBroadException
try:
return float(override) / 8
except ValueError:
print(f'Invalid EP_RDMA_GBS={override!r}, ignoring and falling back to ibstat')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 suggestion: EP_RDMA_GBSget_rdma_gbs 函数体内读取,但该函数是 @lru_cache(以 nic_name 为键)。因此仅在“首次调用”时才会读取该环境变量并缓存,之后即便修改或新设置 EP_RDMA_GBS 也不会再生效。若期望支持运行时动态覆盖,需要把该环境变量也纳入 lru_cache 的 key(或不在被缓存的函数内部读取 env);若仅希望启动前设置一次,建议在 docstring 里明确说明“需在首次调用前设置,之后修改无效”。

Comment thread tests/utils/test_envs.py
Comment on lines +98 to +110
n_lag = _query_num_lag_ports(nic)
expected_gbs = rate_per_port * n_lag / 8.0
actual_gbs = get_rdma_gbs(nic)
_log(
f"get_rdma_gbs({nic!r}) = {actual_gbs} GB/s "
f"(per-port {rate_per_port} Gb/s x {n_lag} rails / 8)"
)
assert actual_gbs == expected_gbs, f"expected {expected_gbs}, got {actual_gbs}"
# Headline assertion: on a 2-rail 200G LAG, we recover ~50 GB/s
# (= 400 Gb/s aggregated), not the previous 25 GB/s single-rail value.
if n_lag >= 2 and rate_per_port >= 100:
aggregated_gbps = rate_per_port * n_lag
_log(f"LAG aggregation detected: {aggregated_gbps} Gb/s -> {actual_gbs} GB/s")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 suggestion: test_get_rdma_gbs_lag_aggregationexpected_gbs 是用 ibstat 重解析 + 同一个 _query_num_lag_ports(nic) 算出来的,而 get_rdma_gbs 内部恰好就是 rate_per_port * _query_num_lag_ports(nic) / 8,因此 actual_gbs == expected_gbs 本质上只是在验证“函数是否与自身一致”(仅能捕获重构导致的不一致)。真正的结论(LAG 下恢复约 50 GB/s、而非 25 GB/s)只是打印出来、并未断言;遇到非 LAG 环境又直接 skip。建议至少在检测到 LAG(n_lag >= 2)时断言聚合带宽大于单 rail 值(或约为 n_lag 倍),否则这个测试对“修复是否真的起效”几乎没有校验力,存在“测试通过但实际仍拿不到聚合带宽”的风险。

Comment thread deep_ep/utils/envs.py
Comment on lines +299 to +305
override = os.getenv('EP_RDMA_GBS')
if override:
# noinspection PyBroadException
try:
return float(override) / 8
except ValueError:
print(f'Invalid EP_RDMA_GBS={override!r}, ignoring and falling back to ibstat')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 suggestion: EP_RDMA_GBS 覆盖路径未对非正数做校验:若用户误设 EP_RDMA_GBS=0 或负值,会直接返回 0 或负的带宽值。get_rdma_gbs 的返回值在 elastic.py:614 被用于 SM 数计算(通常作为分母或速率),0/负值可能导致除零或异常的 SM 估计。建议在 float 转换成功后校验 >0,否则也回退到 ibstat 探测。

Comment thread deep_ep/utils/envs.py
Comment on lines +252 to +253
@functools.lru_cache()
def _query_num_lag_ports(nic_name: str) -> int:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 suggestion: _query_num_lag_ports 使用 lru_cache 缓存结果,包含失败时返回的回退值 1。若首次调用因瞬时原因(如设备短暂不可用)失败而缓存为 1,后续即使 LAG 恢复也将一直沿用单轨值。鉴于该函数仅在启动期调用一次、与 get_rdma_gbs 的缓存语义一致,影响有限,但可在注释中说明此缓存语义以避免误用。

@zfan2356 zfan2356 force-pushed the fix/rdma-gbs-lag-aware branch from 4426f40 to 2c5745a Compare June 25, 2026 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EPv2: get_rdma_gbs() get half bandwidth in LACP bonding environments

2 participants