support to transfer key of routed experts#1658
support to transfer key of routed experts#1658YanhuiDua wants to merge 2 commits intoInternLM:mainfrom
Conversation
|
@claude review |
xtuner/v1/ray/config/worker.py
Outdated
| enable_transfer_obj_ref: Annotated[ | ||
| bool, | ||
| Parameter( | ||
| group=infer_group, |
There was a problem hiding this comment.
Claude: Warning — Copy-paste help text. This help string is identical to enable_return_routed_experts above. It should describe the transfer obj ref feature instead.
| group=infer_group, | |
| help="Whether to enable transferring routed experts via Ray shared store object references.", |
xtuner/v1/ray/rollout/worker.py
Outdated
| lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE) | ||
| routed_experts_ref = lmdeploy_actor.get.remote(routed_experts) | ||
| cur_routed_experts = await routed_experts_ref # n,layer,expert | ||
| ray.internal.free(routed_experts_ref, local_only=False) | ||
| else: |
There was a problem hiding this comment.
Claude: Nit — Missing cleanup of string key. In the no-history transfer path (line 580), del routed_experts cleans up the string key. Here in the history transfer path, the original routed_experts string is not deleted after use. For consistency and to mirror the del data in the else branch:
| lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE) | |
| routed_experts_ref = lmdeploy_actor.get.remote(routed_experts) | |
| cur_routed_experts = await routed_experts_ref # n,layer,expert | |
| ray.internal.free(routed_experts_ref, local_only=False) | |
| else: | |
| lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE) | |
| routed_experts_ref = lmdeploy_actor.get.remote(routed_experts) | |
| del routed_experts | |
| cur_routed_experts = await routed_experts_ref # n,layer,expert | |
| ray.internal.free(routed_experts_ref, local_only=False) |
|
Claude: ## Summary Adds IssuesWarning
Nit
VerdictCOMMENT — no critical bugs found, but the copy-paste help text should be fixed and the other suggestions would improve consistency. |
baa6401 to
26ca2b2
Compare
|
@claude review |
xtuner/v1/ray/config/worker.py
Outdated
| if self.enable_return_routed_experts is True or self.enable_transfer_obj_ref is True: | ||
| self.enable_transfer_obj_ref = True | ||
| self.enable_return_routed_experts = True |
There was a problem hiding this comment.
Claude: Warning — Implicit breaking change via forced coupling. When either flag is True, both are forced to True. This means existing users who only had enable_return_routed_experts: true will silently get the new transfer-via-actor behavior instead of the previous base64 decode path. This is a backwards-incompatible change.
If this is intentional (i.e. the old base64 path is being fully replaced), please document that in the PR body and commit message. If users should still be able to opt into the old path, this coupling should be removed.
Also a minor style nit: prefer truthiness over is True:
| if self.enable_return_routed_experts is True or self.enable_transfer_obj_ref is True: | |
| self.enable_transfer_obj_ref = True | |
| self.enable_return_routed_experts = True | |
| if self.enable_return_routed_experts or self.enable_transfer_obj_ref: | |
| self.enable_transfer_obj_ref = True | |
| self.enable_return_routed_experts = True |
xtuner/v1/ray/config/worker.py
Outdated
|
|
||
| if self.enable_return_routed_experts is True or self.enable_transfer_obj_ref is True: | ||
| self.enable_transfer_obj_ref = True | ||
| self.enable_return_routed_experts = True |
There was a problem hiding this comment.
Claude: Nit — Trailing whitespace. There are trailing spaces on this line. Ruff/pre-commit should catch this, but flagging in case pre-commit wasn't run.
| self.enable_return_routed_experts = True | |
| self.enable_return_routed_experts = True |
| if isinstance(routed_experts, str): | ||
| import base64 | ||
|
|
||
| data = base64.b64decode(routed_experts) | ||
| return ray.cloudpickle.loads(data) | ||
| if self.lmdeploy_actor is None: | ||
| self.lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE) | ||
| assert self.lmdeploy_actor is not None, "LMDeploy actor should be available in the shared store." | ||
| routed_experts_ref = self.lmdeploy_actor.get.remote(routed_experts) | ||
| return routed_experts_ref | ||
| return torch.tensor(routed_experts) |
There was a problem hiding this comment.
Claude: Note — Return type change. This method now returns an ObjectRef (from .get.remote()) for string inputs, whereas it previously returned the materialized data (via base64 + cloudpickle). The callers in worker.py already handle ObjectRef correctly, so this works, but the implicit return-type contract has changed and may surprise future readers.
Consider adding a return type annotation to make this explicit:
def _decode_routed_experts(self, routed_experts: Any) -> ObjectRef | torch.Tensor:|
Claude: ## Re-review after commit 26ca2b2 The second commit addressed the redundant Still open from previous review
New findingsWarning
Nit
VerdictCOMMENT — no critical bugs, but the forced flag coupling is a notable behavioral change that should be documented or reconsidered. The copy-paste help text from the first review is still unresolved. |
599ad71 to
09eddbd
Compare
09eddbd to
7d1ee1b
Compare
No description provided.