feat(agent): add Doubao (Volcengine Ark) agent loop#1275
feat(agent): add Doubao (Volcengine Ark) agent loop#1275hillday wants to merge 2 commits intotrycua:mainfrom
Conversation
|
@hillday is attempting to deploy a commit to the Cua Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughThis PR adds Doubao agent loop support to the Python agent library. It includes exporting a new Changes
Sequence Diagram(s)sequenceDiagram
participant Agent as Agent Loop
participant DoubaoAPI as Doubao API
participant Handler as Computer Handler
participant Response as Response Parser
Agent->>Agent: Prepare messages & adapt tool schemas
Note over Agent: Computer tools: 1000x1000 coords<br/>Other tools: pass-through
Agent->>DoubaoAPI: Call with model, messages, tools, stream, reasoning
DoubaoAPI-->>Agent: Response with output items & usage
Agent->>Response: Parse tool-call arguments (JSON retry on fail)
Response-->>Agent: Parsed output items
Agent->>Handler: get_dimensions() for screen size
Handler-->>Agent: Physical screen dimensions (fallback: 1024x768)
Agent->>Agent: Denormalize x/y from 0..1000 to physical coords
Agent->>Agent: Set usage + response_cost from _hidden_params
Agent-->>Agent: Return modified response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@libs/python/agent/agent/loops/doubao.py`:
- Around line 44-45: The current register_agent decorator on
DoubaoComputerAgentConfig uses a permissive regex r".*doubao.*" which matches
any model string containing "doubao" and can incorrectly claim models from other
providers; narrow the pattern to only match the OpenAI-adapter Doubao models
(e.g. require the "openai/" prefix) by updating the models regex in the
`@register_agent` call (reference: the register_agent decorator and the
DoubaoComputerAgentConfig class) to a stricter pattern such as one that enforces
the "openai/" prefix (for example ^openai/doubao(/.*)?$) so this loop only
activates for the intended OpenAI Doubao models.
- Around line 154-166: The current JSON parse failure branch logs the raw tool
argument blob (args / cleaned_args) which may contain secrets; update the
exception handling in the parse block (where args is processed and cleaned_args
is constructed) to avoid emitting raw contents to logger.warning. Instead, build
a sanitized summary: parse cleaned_args into a dict if possible and redact
sensitive keys like "password", "secret", "token", "keys", "credentials", or
"text" (replace values with "<REDACTED>"), or if it cannot be parsed, log only a
truncated length and a safe hash/preview (e.g., first 32 chars) without secrets;
then call logger.warning with that sanitized summary and the error. Ensure you
change the logger.warning call and any subsequent logging to reference the
sanitized variable rather than the original args/cleaned_args.
- Around line 195-203: predict_click in doubao.py tries to use authoritative
screen dimensions via computer_handler.get_dimensions, but
ComputerAgent.predict_click currently only forwards model, image_b64,
instruction, api_key, and api_base so computer_handler is never passed through;
update the public call path (ComputerAgent.predict_click) to accept and forward
the computer_handler (or relevant agent/context object) into
agent.loops.doubao.predict_click so the get_dimensions branch is reachable, and
ensure any call sites that invoke ComputerAgent.predict_click also supply the
computer_handler; reference symbols: predict_click (in
libs/python/agent/agent/loops/doubao.py), ComputerAgent.predict_click (in
libs/python/agent/agent/agent.py), and computer_handler.get_dimensions.
- Around line 33-40: _denormalize_xy currently maps normalized 1000-based
coordinates so that nx=1000 -> x=target_w and ny=1000 -> y=target_h, producing
out-of-bounds pixel indices; update the function (_denormalize_xy) to compute x
and y as before then clamp them into the valid zero-based pixel range [0,
target_w-1] and [0, target_h-1] (e.g., use min/max or equivalent) so results
never exceed the last pixel and never go negative.
- Around line 59-60: The code incorrectly assumes litellm.aresponses() returns a
final response when called with stream=True; to fix, either remove/ban streaming
by validating the stream parameter at the start of the relevant functions (e.g.,
in the function signature handling in doubao.py) and raise a clear error if
stream is True, or implement proper streaming consumption: when calling
litellm.aresponses() with stream=True, treat the result as an async iterator,
iterate with "async for event in ..." to collect content events and only access
.usage on the final event (and build the full response object before calling
.model_dump()); update the same pattern for the other places noted (around lines
134-146 and 151-191) so you do not call .model_dump() or index .usage on the
streaming iterator.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bb0876db-3582-4766-9988-dd69a3c4c1f7
📒 Files selected for processing (2)
libs/python/agent/agent/loops/__init__.pylibs/python/agent/agent/loops/doubao.py
| def _denormalize_xy( | ||
| nx: float, ny: float, target_w: int = 1024, target_h: int = 768 | ||
| ) -> Tuple[int, int]: | ||
| """ | ||
| 将 1000x1000 空间的归一化坐标还原为 Computer Server 的物理坐标系。 | ||
| """ | ||
| x = int(round((nx / 1000.0) * target_w)) | ||
| y = int(round((ny / 1000.0) * target_h)) |
There was a problem hiding this comment.
Clamp denormalized coordinates to the last valid pixel.
nx=1000 currently maps to x=target_w and ny=1000 to y=target_h. On a 1024x768 screen that yields (1024, 768), which is one past the right/bottom edge for zero-based coordinates and can miss edge targets in both step and click flows.
🐛 Proposed fix
def _denormalize_xy(
nx: float, ny: float, target_w: int = 1024, target_h: int = 768
) -> Tuple[int, int]:
"""
将 1000x1000 空间的归一化坐标还原为 Computer Server 的物理坐标系。
"""
- x = int(round((nx / 1000.0) * target_w))
- y = int(round((ny / 1000.0) * target_h))
+ target_w = max(1, int(target_w))
+ target_h = max(1, int(target_h))
+ max_x = target_w - 1
+ max_y = target_h - 1
+ x = max(0, min(max_x, int(round((nx / 1000.0) * max_x))))
+ y = max(0, min(max_y, int(round((ny / 1000.0) * max_y))))
return x, y📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _denormalize_xy( | |
| nx: float, ny: float, target_w: int = 1024, target_h: int = 768 | |
| ) -> Tuple[int, int]: | |
| """ | |
| 将 1000x1000 空间的归一化坐标还原为 Computer Server 的物理坐标系。 | |
| """ | |
| x = int(round((nx / 1000.0) * target_w)) | |
| y = int(round((ny / 1000.0) * target_h)) | |
| def _denormalize_xy( | |
| nx: float, ny: float, target_w: int = 1024, target_h: int = 768 | |
| ) -> Tuple[int, int]: | |
| """ | |
| 将 1000x1000 空间的归一化坐标还原为 Computer Server 的物理坐标系。 | |
| """ | |
| target_w = max(1, int(target_w)) | |
| target_h = max(1, int(target_h)) | |
| max_x = target_w - 1 | |
| max_y = target_h - 1 | |
| x = max(0, min(max_x, int(round((nx / 1000.0) * max_x)))) | |
| y = max(0, min(max_y, int(round((ny / 1000.0) * max_y)))) | |
| return x, y |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/python/agent/agent/loops/doubao.py` around lines 33 - 40,
_denormalize_xy currently maps normalized 1000-based coordinates so that nx=1000
-> x=target_w and ny=1000 -> y=target_h, producing out-of-bounds pixel indices;
update the function (_denormalize_xy) to compute x and y as before then clamp
them into the valid zero-based pixel range [0, target_w-1] and [0, target_h-1]
(e.g., use min/max or equivalent) so results never exceed the last pixel and
never go negative.
| @register_agent(models=r".*doubao.*", priority=10) | ||
| class DoubaoComputerAgentConfig: |
There was a problem hiding this comment.
Restrict this registration to the intended openai/ Doubao models.
r".*doubao.*" matches any model string containing doubao. Because agent dispatch picks the first regex match by priority, this loop will activate even when callers did not choose the OpenAI-adapter path described in the PR, and it can shadow other provider configs that happen to include the same token.
🐛 Proposed fix
-@register_agent(models=r".*doubao.*", priority=10)
+@register_agent(models=r"^openai/.*doubao.*$", priority=10)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @register_agent(models=r".*doubao.*", priority=10) | |
| class DoubaoComputerAgentConfig: | |
| `@register_agent`(models=r"^openai/.*doubao.*$", priority=10) | |
| class DoubaoComputerAgentConfig: |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/python/agent/agent/loops/doubao.py` around lines 44 - 45, The current
register_agent decorator on DoubaoComputerAgentConfig uses a permissive regex
r".*doubao.*" which matches any model string containing "doubao" and can
incorrectly claim models from other providers; narrow the pattern to only match
the OpenAI-adapter Doubao models (e.g. require the "openai/" prefix) by updating
the models regex in the `@register_agent` call (reference: the register_agent
decorator and the DoubaoComputerAgentConfig class) to a stricter pattern such as
one that enforces the "openai/" prefix (for example ^openai/doubao(/.*)?$) so
this loop only activates for the intended OpenAI Doubao models.
| stream: bool = False, | ||
| computer_handler=None, |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
LiteLLM aresponsesstreaming return type: whenstream=True, does the API return a streaming iterator / stream wrapper instead of a response object with .model_dump()and.usage?
💡 Result:
Yes. In LiteLLM’s Responses API wrapper, litellm.aresponses(..., stream=True) returns a streaming iterator/wrapper (a BaseResponsesAPIStreamingIterator), not a single ResponsesAPIResponse object you can immediately call .model_dump() on or read .usage from. The return type is explicitly Union[ResponsesAPIResponse, BaseResponsesAPIStreamingIterator]. [1]
In streaming mode, you consume it with async for ... in stream: and LiteLLM yields stream events; the usage (and the full response object) is only available on the final “response.completed” (or failed/incomplete) event, where the event contains a response object that has .usage (and can be .model_dump()’d). LiteLLM internally records that final event as completed_response. [2]
Sources: [1] [2]
Add streaming support or reject stream=True entirely.
The stream parameter is forwarded to litellm.aresponses() but the code immediately calls .model_dump(), iterates the result as output, and accesses .usage — none of which work with a streaming response. When stream=True, litellm.aresponses() returns a streaming iterator, not a response object; usage data is only available on the final stream event after consuming with async for. Either implement proper streaming handling with event consumption, or remove the stream parameter and always use stream=False.
Also applies to: 134-146, 151-191
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/python/agent/agent/loops/doubao.py` around lines 59 - 60, The code
incorrectly assumes litellm.aresponses() returns a final response when called
with stream=True; to fix, either remove/ban streaming by validating the stream
parameter at the start of the relevant functions (e.g., in the function
signature handling in doubao.py) and raise a clear error if stream is True, or
implement proper streaming consumption: when calling litellm.aresponses() with
stream=True, treat the result as an async iterator, iterate with "async for
event in ..." to collect content events and only access .usage on the final
event (and build the full response object before calling .model_dump()); update
the same pattern for the other places noted (around lines 134-146 and 151-191)
so you do not call .model_dump() or index .usage on the streaming iterator.
| args = item.get("arguments", "{}") | ||
| if isinstance(args, str): | ||
| try: | ||
| args = json.loads(args) | ||
| except json.JSONDecodeError as e: | ||
| logger.warning(f"⚠️ [JSON解析失败] 无法解析工具调用参数: {args}. 错误: {e}") | ||
| # 尝试简单的清洗:去掉可能存在的 markdown 代码块标记 | ||
| cleaned_args = args.strip() | ||
| if cleaned_args.startswith("```json"): | ||
| cleaned_args = cleaned_args[7:] | ||
| if cleaned_args.endswith("```"): | ||
| cleaned_args = cleaned_args[:-3] | ||
| cleaned_args = cleaned_args.strip() |
There was a problem hiding this comment.
Don’t log raw computer arguments on parse failures.
This payload can include text and keys. If the model is typing credentials, the warning logs the whole secret-bearing argument blob, which breaks the PR’s “no secrets are logged” guarantee.
🛡️ Proposed fix
- except json.JSONDecodeError as e:
- logger.warning(f"⚠️ [JSON解析失败] 无法解析工具调用参数: {args}. 错误: {e}")
+ except json.JSONDecodeError as e:
+ logger.warning(
+ "⚠️ [JSON解析失败] 无法解析 computer 工具调用参数: error=%s raw_length=%d",
+ e,
+ len(args),
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| args = item.get("arguments", "{}") | |
| if isinstance(args, str): | |
| try: | |
| args = json.loads(args) | |
| except json.JSONDecodeError as e: | |
| logger.warning(f"⚠️ [JSON解析失败] 无法解析工具调用参数: {args}. 错误: {e}") | |
| # 尝试简单的清洗:去掉可能存在的 markdown 代码块标记 | |
| cleaned_args = args.strip() | |
| if cleaned_args.startswith("```json"): | |
| cleaned_args = cleaned_args[7:] | |
| if cleaned_args.endswith("```"): | |
| cleaned_args = cleaned_args[:-3] | |
| cleaned_args = cleaned_args.strip() | |
| args = item.get("arguments", "{}") | |
| if isinstance(args, str): | |
| try: | |
| args = json.loads(args) | |
| except json.JSONDecodeError as e: | |
| logger.warning( | |
| "⚠️ [JSON解析失败] 无法解析 computer 工具调用参数: error=%s raw_length=%d", | |
| e, | |
| len(args), | |
| ) | |
| # 尝试简单的清洗:去掉可能存在的 markdown 代码块标记 | |
| cleaned_args = args.strip() | |
| if cleaned_args.startswith(" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/python/agent/agent/loops/doubao.py` around lines 154 - 166, The current
JSON parse failure branch logs the raw tool argument blob (args / cleaned_args)
which may contain secrets; update the exception handling in the parse block
(where args is processed and cleaned_args is constructed) to avoid emitting raw
contents to logger.warning. Instead, build a sanitized summary: parse
cleaned_args into a dict if possible and redact sensitive keys like "password",
"secret", "token", "keys", "credentials", or "text" (replace values with
"<REDACTED>"), or if it cannot be parsed, log only a truncated length and a safe
hash/preview (e.g., first 32 chars) without secrets; then call logger.warning
with that sanitized summary and the error. Ensure you change the logger.warning
call and any subsequent logging to reference the sanitized variable rather than
the original args/cleaned_args.
| self, model: str, image_b64: str, instruction: str, computer_handler=None, **kwargs | ||
| ) -> Optional[Tuple[int, int]]: | ||
| """Predict click coordinates specifically for Doubao with 1000x1000 scaling.""" | ||
| # 获取真实物理尺寸用于还原 | ||
| physical_width, physical_height = 1024, 768 | ||
| if computer_handler and hasattr(computer_handler, "get_dimensions"): | ||
| try: | ||
| physical_width, physical_height = await computer_handler.get_dimensions() | ||
| logger.info( |
There was a problem hiding this comment.
predict_click() cannot use physical screen dimensions from the current public call path.
This method expects computer_handler, but the current ComputerAgent.predict_click() integration only forwards model, image_b64, instruction, api_key, and api_base. That makes the get_dimensions() branch dead in normal use and forces denormalization to fall back to image size instead of the authoritative screen dimensions.
🐛 Complementary fix in libs/python/agent/agent/agent.py
return await self.agent_loop.predict_click(
- model=self.model, image_b64=image_b64, instruction=instruction, **click_kwargs
+ model=self.model,
+ image_b64=image_b64,
+ instruction=instruction,
+ computer_handler=self.computer_handler,
+ **click_kwargs,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self, model: str, image_b64: str, instruction: str, computer_handler=None, **kwargs | |
| ) -> Optional[Tuple[int, int]]: | |
| """Predict click coordinates specifically for Doubao with 1000x1000 scaling.""" | |
| # 获取真实物理尺寸用于还原 | |
| physical_width, physical_height = 1024, 768 | |
| if computer_handler and hasattr(computer_handler, "get_dimensions"): | |
| try: | |
| physical_width, physical_height = await computer_handler.get_dimensions() | |
| logger.info( | |
| return await self.agent_loop.predict_click( | |
| model=self.model, | |
| image_b64=image_b64, | |
| instruction=instruction, | |
| computer_handler=self.computer_handler, | |
| **click_kwargs, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/python/agent/agent/loops/doubao.py` around lines 195 - 203,
predict_click in doubao.py tries to use authoritative screen dimensions via
computer_handler.get_dimensions, but ComputerAgent.predict_click currently only
forwards model, image_b64, instruction, api_key, and api_base so
computer_handler is never passed through; update the public call path
(ComputerAgent.predict_click) to accept and forward the computer_handler (or
relevant agent/context object) into agent.loops.doubao.predict_click so the
get_dimensions branch is reachable, and ensure any call sites that invoke
ComputerAgent.predict_click also supply the computer_handler; reference symbols:
predict_click (in libs/python/agent/agent/loops/doubao.py),
ComputerAgent.predict_click (in libs/python/agent/agent/agent.py), and
computer_handler.get_dimensions.
What problem this solves
ComputerAgentagainst ByteDance Volcengine Ark (Doubao) models via an OpenAI-compatible interface, including proper parsing of Responses-style tool calls and coordinate normalization/denormalization so the agent can reliably click/type on real screens.How to use
VOLCENGINE_API_KEYin your environment.https://ark.cn-beijing.volces.com/api/v3openai/prefix so the OpenAI adapter path is used.Scope / impact
openai/doubao-...).VOLCENGINE_API_KEYand sent to the configuredapi_base.Summary by CodeRabbit
New Features