support to transfer key of routed experts by YanhuiDua · Pull Request #1658 · InternLM/xtuner

YanhuiDua · 2026-04-08T08:08:25Z

No description provided.

YanhuiDua · 2026-04-08T08:08:40Z

claude · 2026-04-08T08:13:11Z

xtuner/v1/ray/config/worker.py

+    enable_transfer_obj_ref: Annotated[
+        bool,
+        Parameter(
+            group=infer_group,


Claude: Warning — Copy-paste help text. This help string is identical to enable_return_routed_experts above. It should describe the transfer obj ref feature instead.

Suggested change

group=infer_group,

help="Whether to enable transferring routed experts via Ray shared store object references.",

xtuner/v1/ray/rollout/worker.py

claude · 2026-04-08T08:13:36Z

xtuner/v1/ray/rollout/worker.py

+                                lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE)
+                                routed_experts_ref = lmdeploy_actor.get.remote(routed_experts)
+                                cur_routed_experts = await routed_experts_ref  # n,layer,expert
+                                ray.internal.free(routed_experts_ref, local_only=False)
+                            else:


Claude: Nit — Missing cleanup of string key. In the no-history transfer path (line 580), del routed_experts cleans up the string key. Here in the history transfer path, the original routed_experts string is not deleted after use. For consistency and to mirror the del data in the else branch:

Suggested change

lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE)

routed_experts_ref = lmdeploy_actor.get.remote(routed_experts)

cur_routed_experts = await routed_experts_ref # n,layer,expert

ray.internal.free(routed_experts_ref, local_only=False)

else:

lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE)

routed_experts_ref = lmdeploy_actor.get.remote(routed_experts)

del routed_experts

cur_routed_experts = await routed_experts_ref # n,layer,expert

ray.internal.free(routed_experts_ref, local_only=False)

claude · 2026-04-08T08:13:54Z

Claude: ## Summary

Adds enable_transfer_obj_ref config flag to support transferring routed expert keys via a Ray actor-based shared store (lmdeploy namespace) instead of the existing base64-encoded cloudpickle approach. When enabled, the worker fetches a named Ray actor and uses .get.remote(key) to retrieve routed expert data by key.

Issues

Warning

[xtuner/v1/ray/config/worker.py:185] Copy-paste help text — enable_transfer_obj_ref has the identical help string as enable_return_routed_experts. Should describe the transfer feature.
[xtuner/v1/ray/rollout/worker.py:576-578] ray.get_actor() is called redundantly in both the no-history and history branches. Could be fetched once or cached.

Nit

[xtuner/v1/ray/rollout/worker.py:576] Inconsistent config access: self.config.enable_transfer_obj_ref is read directly while the sibling flag uses a cached self.enable_return_routed_experts attribute.
[xtuner/v1/ray/rollout/worker.py:596-600] Missing del routed_experts for the string key in the history+transfer path (present in the no-history path at line 580).
PR body is empty — consider adding a description with motivation and test plan per project PR standards.

Verdict

COMMENT — no critical bugs found, but the copy-paste help text should be fixed and the other suggestions would improve consistency.

YanhuiDua · 2026-04-10T06:39:43Z

@claude review

claude · 2026-04-10T06:42:47Z

xtuner/v1/ray/config/worker.py

+        if self.enable_return_routed_experts is True or self.enable_transfer_obj_ref is True:
+            self.enable_transfer_obj_ref = True
+            self.enable_return_routed_experts = True    


Claude: Warning — Implicit breaking change via forced coupling. When either flag is True, both are forced to True. This means existing users who only had enable_return_routed_experts: true will silently get the new transfer-via-actor behavior instead of the previous base64 decode path. This is a backwards-incompatible change.

If this is intentional (i.e. the old base64 path is being fully replaced), please document that in the PR body and commit message. If users should still be able to opt into the old path, this coupling should be removed.

Also a minor style nit: prefer truthiness over is True:

Suggested change

if self.enable_return_routed_experts is True or self.enable_transfer_obj_ref is True:

self.enable_transfer_obj_ref = True

self.enable_return_routed_experts = True

if self.enable_return_routed_experts or self.enable_transfer_obj_ref:

self.enable_transfer_obj_ref = True

self.enable_return_routed_experts = True

claude · 2026-04-10T06:42:49Z

xtuner/v1/ray/config/worker.py


+        if self.enable_return_routed_experts is True or self.enable_transfer_obj_ref is True:
+            self.enable_transfer_obj_ref = True
+            self.enable_return_routed_experts = True    


Claude: Nit — Trailing whitespace. There are trailing spaces on this line. Ruff/pre-commit should catch this, but flagging in case pre-commit wasn't run.

Suggested change

self.enable_return_routed_experts = True

self.enable_return_routed_experts = True

claude · 2026-04-10T06:42:54Z

xtuner/v1/ray/rollout/lmdeploy.py

        if isinstance(routed_experts, str):
-            import base64
-
-            data = base64.b64decode(routed_experts)
-            return ray.cloudpickle.loads(data)
+            if self.lmdeploy_actor is None:
+                self.lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE)
+            assert self.lmdeploy_actor is not None, "LMDeploy actor should be available in the shared store."
+            routed_experts_ref = self.lmdeploy_actor.get.remote(routed_experts)
+            return routed_experts_ref
        return torch.tensor(routed_experts)


Claude: Note — Return type change. This method now returns an ObjectRef (from .get.remote()) for string inputs, whereas it previously returned the materialized data (via base64 + cloudpickle). The callers in worker.py already handle ObjectRef correctly, so this works, but the implicit return-type contract has changed and may surprise future readers.

Consider adding a return type annotation to make this explicit:

def _decode_routed_experts(self, routed_experts: Any) -> ObjectRef | torch.Tensor:

claude · 2026-04-10T06:43:16Z

Claude: ## Re-review after commit 26ca2b2

The second commit addressed the redundant ray.get_actor() call with a lazy-init pattern — good fix.

Still open from previous review

Copy-paste help text on enable_transfer_obj_ref (config/worker.py:184) — still identical to enable_return_routed_experts. Previous inline comment still applies.
Empty PR body — per project standards, PRs should include motivation, what changed, and a test plan.

New findings

Warning

[config/worker.py:342-344] Implicit breaking change via forced flag coupling. model_post_init forces both enable_return_routed_experts and enable_transfer_obj_ref to True whenever either is set. This silently migrates existing users from the old base64 decode path to the new actor-based transfer. If intentional, please document; if not, decouple the flags.

Nit

[config/worker.py:344] Trailing whitespace — trailing spaces on the line.
[config/worker.py:342] Style — prefer if self.enable_return_routed_experts or self.enable_transfer_obj_ref: over is True.
[lmdeploy.py:218-224] Return type change — _decode_routed_experts now returns ObjectRef for string inputs instead of materialized data. Callers handle this correctly, but a return type annotation would make the contract explicit.

Verdict

COMMENT — no critical bugs, but the forced flag coupling is a notable behavioral change that should be documented or reconsidered. The copy-paste help text from the first review is still unresolved.

claude bot reviewed Apr 8, 2026

View reviewed changes

xtuner/v1/ray/rollout/worker.py Outdated Show resolved Hide resolved

claude bot reviewed Apr 8, 2026

View reviewed changes

xtuner/v1/ray/rollout/worker.py Outdated Show resolved Hide resolved

claude bot reviewed Apr 8, 2026

View reviewed changes

adapt to transfer key of routed experts

13ae947

YanhuiDua force-pushed the refactor_r3_transfer branch 3 times, most recently from baa6401 to 26ca2b2 Compare April 10, 2026 06:39

claude bot reviewed Apr 10, 2026

View reviewed changes

YanhuiDua force-pushed the refactor_r3_transfer branch 2 times, most recently from 599ad71 to 09eddbd Compare April 10, 2026 07:01

get lmdeploy_actor once

7d1ee1b

YanhuiDua force-pushed the refactor_r3_transfer branch from 09eddbd to 7d1ee1b Compare April 10, 2026 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support to transfer key of routed experts#1658

support to transfer key of routed experts#1658
YanhuiDua wants to merge 2 commits intoInternLM:mainfrom
YanhuiDua:refactor_r3_transfer

YanhuiDua commented Apr 8, 2026

Uh oh!

YanhuiDua commented Apr 8, 2026

Uh oh!

claude bot Apr 8, 2026

Uh oh!

Uh oh!

Uh oh!

claude bot Apr 8, 2026

Uh oh!

claude bot commented Apr 8, 2026

Uh oh!

YanhuiDua commented Apr 10, 2026

Uh oh!

claude bot Apr 10, 2026

Uh oh!

claude bot Apr 10, 2026

Uh oh!

claude bot Apr 10, 2026

Uh oh!

claude bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	group=infer_group,
	help="Whether to enable transferring routed experts via Ray shared store object references.",

	self.enable_return_routed_experts = True
	self.enable_return_routed_experts = True

Conversation

YanhuiDua commented Apr 8, 2026

Uh oh!

YanhuiDua commented Apr 8, 2026

Uh oh!

claude bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

claude bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 8, 2026

Issues

Warning

Nit

Verdict

Uh oh!

YanhuiDua commented Apr 10, 2026

Uh oh!

claude bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 10, 2026

Still open from previous review

New findings

Warning

Nit

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant