Skip to content

support to transfer key of routed experts#1658

Open
YanhuiDua wants to merge 2 commits intoInternLM:mainfrom
YanhuiDua:refactor_r3_transfer
Open

support to transfer key of routed experts#1658
YanhuiDua wants to merge 2 commits intoInternLM:mainfrom
YanhuiDua:refactor_r3_transfer

Conversation

@YanhuiDua
Copy link
Copy Markdown
Collaborator

No description provided.

@YanhuiDua
Copy link
Copy Markdown
Collaborator Author

@claude review

enable_transfer_obj_ref: Annotated[
bool,
Parameter(
group=infer_group,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning — Copy-paste help text. This help string is identical to enable_return_routed_experts above. It should describe the transfer obj ref feature instead.

Suggested change
group=infer_group,
help="Whether to enable transferring routed experts via Ray shared store object references.",

Comment on lines +596 to +600
lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE)
routed_experts_ref = lmdeploy_actor.get.remote(routed_experts)
cur_routed_experts = await routed_experts_ref # n,layer,expert
ray.internal.free(routed_experts_ref, local_only=False)
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit — Missing cleanup of string key. In the no-history transfer path (line 580), del routed_experts cleans up the string key. Here in the history transfer path, the original routed_experts string is not deleted after use. For consistency and to mirror the del data in the else branch:

Suggested change
lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE)
routed_experts_ref = lmdeploy_actor.get.remote(routed_experts)
cur_routed_experts = await routed_experts_ref # n,layer,expert
ray.internal.free(routed_experts_ref, local_only=False)
else:
lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE)
routed_experts_ref = lmdeploy_actor.get.remote(routed_experts)
del routed_experts
cur_routed_experts = await routed_experts_ref # n,layer,expert
ray.internal.free(routed_experts_ref, local_only=False)

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 8, 2026

Claude: ## Summary

Adds enable_transfer_obj_ref config flag to support transferring routed expert keys via a Ray actor-based shared store (lmdeploy namespace) instead of the existing base64-encoded cloudpickle approach. When enabled, the worker fetches a named Ray actor and uses .get.remote(key) to retrieve routed expert data by key.

Issues

Warning

  • [xtuner/v1/ray/config/worker.py:185] Copy-paste help text — enable_transfer_obj_ref has the identical help string as enable_return_routed_experts. Should describe the transfer feature.
  • [xtuner/v1/ray/rollout/worker.py:576-578] ray.get_actor() is called redundantly in both the no-history and history branches. Could be fetched once or cached.

Nit

  • [xtuner/v1/ray/rollout/worker.py:576] Inconsistent config access: self.config.enable_transfer_obj_ref is read directly while the sibling flag uses a cached self.enable_return_routed_experts attribute.
  • [xtuner/v1/ray/rollout/worker.py:596-600] Missing del routed_experts for the string key in the history+transfer path (present in the no-history path at line 580).
  • PR body is empty — consider adding a description with motivation and test plan per project PR standards.

Verdict

COMMENT — no critical bugs found, but the copy-paste help text should be fixed and the other suggestions would improve consistency.

@YanhuiDua YanhuiDua force-pushed the refactor_r3_transfer branch 3 times, most recently from baa6401 to 26ca2b2 Compare April 10, 2026 06:39
@YanhuiDua
Copy link
Copy Markdown
Collaborator Author

@claude review

Comment on lines +342 to +344
if self.enable_return_routed_experts is True or self.enable_transfer_obj_ref is True:
self.enable_transfer_obj_ref = True
self.enable_return_routed_experts = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning — Implicit breaking change via forced coupling. When either flag is True, both are forced to True. This means existing users who only had enable_return_routed_experts: true will silently get the new transfer-via-actor behavior instead of the previous base64 decode path. This is a backwards-incompatible change.

If this is intentional (i.e. the old base64 path is being fully replaced), please document that in the PR body and commit message. If users should still be able to opt into the old path, this coupling should be removed.

Also a minor style nit: prefer truthiness over is True:

Suggested change
if self.enable_return_routed_experts is True or self.enable_transfer_obj_ref is True:
self.enable_transfer_obj_ref = True
self.enable_return_routed_experts = True
if self.enable_return_routed_experts or self.enable_transfer_obj_ref:
self.enable_transfer_obj_ref = True
self.enable_return_routed_experts = True


if self.enable_return_routed_experts is True or self.enable_transfer_obj_ref is True:
self.enable_transfer_obj_ref = True
self.enable_return_routed_experts = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit — Trailing whitespace. There are trailing spaces on this line. Ruff/pre-commit should catch this, but flagging in case pre-commit wasn't run.

Suggested change
self.enable_return_routed_experts = True
self.enable_return_routed_experts = True

Comment on lines 218 to 224
if isinstance(routed_experts, str):
import base64

data = base64.b64decode(routed_experts)
return ray.cloudpickle.loads(data)
if self.lmdeploy_actor is None:
self.lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE)
assert self.lmdeploy_actor is not None, "LMDeploy actor should be available in the shared store."
routed_experts_ref = self.lmdeploy_actor.get.remote(routed_experts)
return routed_experts_ref
return torch.tensor(routed_experts)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Note — Return type change. This method now returns an ObjectRef (from .get.remote()) for string inputs, whereas it previously returned the materialized data (via base64 + cloudpickle). The callers in worker.py already handle ObjectRef correctly, so this works, but the implicit return-type contract has changed and may surprise future readers.

Consider adding a return type annotation to make this explicit:

def _decode_routed_experts(self, routed_experts: Any) -> ObjectRef | torch.Tensor:

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 10, 2026

Claude: ## Re-review after commit 26ca2b2

The second commit addressed the redundant ray.get_actor() call with a lazy-init pattern — good fix.

Still open from previous review

  • Copy-paste help text on enable_transfer_obj_ref (config/worker.py:184) — still identical to enable_return_routed_experts. Previous inline comment still applies.
  • Empty PR body — per project standards, PRs should include motivation, what changed, and a test plan.

New findings

Warning

  • [config/worker.py:342-344] Implicit breaking change via forced flag coupling. model_post_init forces both enable_return_routed_experts and enable_transfer_obj_ref to True whenever either is set. This silently migrates existing users from the old base64 decode path to the new actor-based transfer. If intentional, please document; if not, decouple the flags.

Nit

  • [config/worker.py:344] Trailing whitespace — trailing spaces on the line.
  • [config/worker.py:342] Style — prefer if self.enable_return_routed_experts or self.enable_transfer_obj_ref: over is True.
  • [lmdeploy.py:218-224] Return type change_decode_routed_experts now returns ObjectRef for string inputs instead of materialized data. Callers handle this correctly, but a return type annotation would make the contract explicit.

Verdict

COMMENT — no critical bugs, but the forced flag coupling is a notable behavioral change that should be documented or reconsidered. The copy-paste help text from the first review is still unresolved.

@YanhuiDua YanhuiDua force-pushed the refactor_r3_transfer branch 2 times, most recently from 599ad71 to 09eddbd Compare April 10, 2026 07:01
@YanhuiDua YanhuiDua force-pushed the refactor_r3_transfer branch from 09eddbd to 7d1ee1b Compare April 10, 2026 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant