Skip to content

feat(agent): add Doubao (Volcengine Ark) agent loop#1275

Open
hillday wants to merge 2 commits intotrycua:mainfrom
hillday:pr/doubao-support
Open

feat(agent): add Doubao (Volcengine Ark) agent loop#1275
hillday wants to merge 2 commits intotrycua:mainfrom
hillday:pr/doubao-support

Conversation

@hillday
Copy link
Copy Markdown

@hillday hillday commented Apr 7, 2026

What problem this solves

  • Adds first-class support for running CUA’s ComputerAgent against ByteDance Volcengine Ark (Doubao) models via an OpenAI-compatible interface, including proper parsing of Responses-style tool calls and coordinate normalization/denormalization so the agent can reliably click/type on real screens.

How to use

  • Set your Volcengine API key:
    • VOLCENGINE_API_KEY in your environment.
  • Point the agent to Ark’s OpenAI-compatible base URL (no extra quotes/spaces):
    • https://ark.cn-beijing.volces.com/api/v3
  • Choose a Doubao model name using the openai/ prefix so the OpenAI adapter path is used.
import os
from agent import ComputerAgent
from agent.tools.skill import SkillTool
from computer import Computer

computer = Computer(os_type="windows", use_host_computer_server=True)

agent = ComputerAgent(
    model="openai/doubao-seed-2-0-lite-260215",
    tools=[computer, SkillTool()],
    api_key=os.getenv("VOLCENGINE_API_KEY"),
    api_base="https://ark.cn-beijing.volces.com/api/v3",
    screenshot_delay=1.0,
)

Scope / impact

  • Opt-in only: affects users who explicitly select a Doubao model (e.g., openai/doubao-...).
  • No behavior changes for existing model providers (Anthropic/OpenAI/Gemini/etc.) unless they switch to these model strings.
  • No secrets are logged; the API key is read from VOLCENGINE_API_KEY and sent to the configured api_base.

Summary by CodeRabbit

New Features

  • Added Doubao model support to the agent framework for AI-powered computer automation. The system features intelligent screen coordinate adaptation across different device resolutions, enabling autonomous action predictions and click-based operations for seamless automation across diverse environments.

@vercel
Copy link
Copy Markdown
Contributor

vercel bot commented Apr 7, 2026

@hillday is attempting to deploy a commit to the Cua Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

📝 Walkthrough

Walkthrough

This PR adds Doubao agent loop support to the Python agent library. It includes exporting a new doubao module and implementing a DoubaoComputerAgentConfig class that handles Doubao model interactions with normalized coordinate transformations for computer control tasks.

Changes

Cohort / File(s) Summary
Module Exports
libs/python/agent/agent/loops/__init__.py
Added doubao module to package imports and __all__ declaration.
Doubao Agent Loop Implementation
libs/python/agent/agent/loops/doubao.py
New DoubaoComputerAgentConfig class implementing async predict_step and predict_click methods for Doubao API interactions. Handles tool schema adaptation (forcing computer tools to 1000x1000 coordinate system), API calls via litellm.aresponses, coordinate denormalization from normalized to physical screen coordinates, JSON parsing with retry logic, and usage tracking.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as Agent Loop
    participant DoubaoAPI as Doubao API
    participant Handler as Computer Handler
    participant Response as Response Parser
    
    Agent->>Agent: Prepare messages & adapt tool schemas
    Note over Agent: Computer tools: 1000x1000 coords<br/>Other tools: pass-through
    Agent->>DoubaoAPI: Call with model, messages, tools, stream, reasoning
    DoubaoAPI-->>Agent: Response with output items & usage
    Agent->>Response: Parse tool-call arguments (JSON retry on fail)
    Response-->>Agent: Parsed output items
    Agent->>Handler: get_dimensions() for screen size
    Handler-->>Agent: Physical screen dimensions (fallback: 1024x768)
    Agent->>Agent: Denormalize x/y from 0..1000 to physical coords
    Agent->>Agent: Set usage + response_cost from _hidden_params
    Agent-->>Agent: Return modified response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A Doubao loop hops into view,
With coordinates transformed anew,
From thousand-fold to pixel-true,
The agent clicks where we want it to! 🎯✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: adding a new Doubao agent loop implementation for Volcengine Ark models.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@libs/python/agent/agent/loops/doubao.py`:
- Around line 44-45: The current register_agent decorator on
DoubaoComputerAgentConfig uses a permissive regex r".*doubao.*" which matches
any model string containing "doubao" and can incorrectly claim models from other
providers; narrow the pattern to only match the OpenAI-adapter Doubao models
(e.g. require the "openai/" prefix) by updating the models regex in the
`@register_agent` call (reference: the register_agent decorator and the
DoubaoComputerAgentConfig class) to a stricter pattern such as one that enforces
the "openai/" prefix (for example ^openai/doubao(/.*)?$) so this loop only
activates for the intended OpenAI Doubao models.
- Around line 154-166: The current JSON parse failure branch logs the raw tool
argument blob (args / cleaned_args) which may contain secrets; update the
exception handling in the parse block (where args is processed and cleaned_args
is constructed) to avoid emitting raw contents to logger.warning. Instead, build
a sanitized summary: parse cleaned_args into a dict if possible and redact
sensitive keys like "password", "secret", "token", "keys", "credentials", or
"text" (replace values with "<REDACTED>"), or if it cannot be parsed, log only a
truncated length and a safe hash/preview (e.g., first 32 chars) without secrets;
then call logger.warning with that sanitized summary and the error. Ensure you
change the logger.warning call and any subsequent logging to reference the
sanitized variable rather than the original args/cleaned_args.
- Around line 195-203: predict_click in doubao.py tries to use authoritative
screen dimensions via computer_handler.get_dimensions, but
ComputerAgent.predict_click currently only forwards model, image_b64,
instruction, api_key, and api_base so computer_handler is never passed through;
update the public call path (ComputerAgent.predict_click) to accept and forward
the computer_handler (or relevant agent/context object) into
agent.loops.doubao.predict_click so the get_dimensions branch is reachable, and
ensure any call sites that invoke ComputerAgent.predict_click also supply the
computer_handler; reference symbols: predict_click (in
libs/python/agent/agent/loops/doubao.py), ComputerAgent.predict_click (in
libs/python/agent/agent/agent.py), and computer_handler.get_dimensions.
- Around line 33-40: _denormalize_xy currently maps normalized 1000-based
coordinates so that nx=1000 -> x=target_w and ny=1000 -> y=target_h, producing
out-of-bounds pixel indices; update the function (_denormalize_xy) to compute x
and y as before then clamp them into the valid zero-based pixel range [0,
target_w-1] and [0, target_h-1] (e.g., use min/max or equivalent) so results
never exceed the last pixel and never go negative.
- Around line 59-60: The code incorrectly assumes litellm.aresponses() returns a
final response when called with stream=True; to fix, either remove/ban streaming
by validating the stream parameter at the start of the relevant functions (e.g.,
in the function signature handling in doubao.py) and raise a clear error if
stream is True, or implement proper streaming consumption: when calling
litellm.aresponses() with stream=True, treat the result as an async iterator,
iterate with "async for event in ..." to collect content events and only access
.usage on the final event (and build the full response object before calling
.model_dump()); update the same pattern for the other places noted (around lines
134-146 and 151-191) so you do not call .model_dump() or index .usage on the
streaming iterator.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bb0876db-3582-4766-9988-dd69a3c4c1f7

📥 Commits

Reviewing files that changed from the base of the PR and between 1e62756 and 692c1ce.

📒 Files selected for processing (2)
  • libs/python/agent/agent/loops/__init__.py
  • libs/python/agent/agent/loops/doubao.py

Comment on lines +33 to +40
def _denormalize_xy(
nx: float, ny: float, target_w: int = 1024, target_h: int = 768
) -> Tuple[int, int]:
"""
将 1000x1000 空间的归一化坐标还原为 Computer Server 的物理坐标系。
"""
x = int(round((nx / 1000.0) * target_w))
y = int(round((ny / 1000.0) * target_h))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clamp denormalized coordinates to the last valid pixel.

nx=1000 currently maps to x=target_w and ny=1000 to y=target_h. On a 1024x768 screen that yields (1024, 768), which is one past the right/bottom edge for zero-based coordinates and can miss edge targets in both step and click flows.

🐛 Proposed fix
 def _denormalize_xy(
     nx: float, ny: float, target_w: int = 1024, target_h: int = 768
 ) -> Tuple[int, int]:
     """
     将 1000x1000 空间的归一化坐标还原为 Computer Server 的物理坐标系。
     """
-    x = int(round((nx / 1000.0) * target_w))
-    y = int(round((ny / 1000.0) * target_h))
+    target_w = max(1, int(target_w))
+    target_h = max(1, int(target_h))
+    max_x = target_w - 1
+    max_y = target_h - 1
+    x = max(0, min(max_x, int(round((nx / 1000.0) * max_x))))
+    y = max(0, min(max_y, int(round((ny / 1000.0) * max_y))))
     return x, y
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _denormalize_xy(
nx: float, ny: float, target_w: int = 1024, target_h: int = 768
) -> Tuple[int, int]:
"""
1000x1000 空间的归一化坐标还原为 Computer Server 的物理坐标系
"""
x = int(round((nx / 1000.0) * target_w))
y = int(round((ny / 1000.0) * target_h))
def _denormalize_xy(
nx: float, ny: float, target_w: int = 1024, target_h: int = 768
) -> Tuple[int, int]:
"""
1000x1000 空间的归一化坐标还原为 Computer Server 的物理坐标系
"""
target_w = max(1, int(target_w))
target_h = max(1, int(target_h))
max_x = target_w - 1
max_y = target_h - 1
x = max(0, min(max_x, int(round((nx / 1000.0) * max_x))))
y = max(0, min(max_y, int(round((ny / 1000.0) * max_y))))
return x, y
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/python/agent/agent/loops/doubao.py` around lines 33 - 40,
_denormalize_xy currently maps normalized 1000-based coordinates so that nx=1000
-> x=target_w and ny=1000 -> y=target_h, producing out-of-bounds pixel indices;
update the function (_denormalize_xy) to compute x and y as before then clamp
them into the valid zero-based pixel range [0, target_w-1] and [0, target_h-1]
(e.g., use min/max or equivalent) so results never exceed the last pixel and
never go negative.

Comment on lines +44 to +45
@register_agent(models=r".*doubao.*", priority=10)
class DoubaoComputerAgentConfig:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restrict this registration to the intended openai/ Doubao models.

r".*doubao.*" matches any model string containing doubao. Because agent dispatch picks the first regex match by priority, this loop will activate even when callers did not choose the OpenAI-adapter path described in the PR, and it can shadow other provider configs that happen to include the same token.

🐛 Proposed fix
-@register_agent(models=r".*doubao.*", priority=10)
+@register_agent(models=r"^openai/.*doubao.*$", priority=10)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@register_agent(models=r".*doubao.*", priority=10)
class DoubaoComputerAgentConfig:
`@register_agent`(models=r"^openai/.*doubao.*$", priority=10)
class DoubaoComputerAgentConfig:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/python/agent/agent/loops/doubao.py` around lines 44 - 45, The current
register_agent decorator on DoubaoComputerAgentConfig uses a permissive regex
r".*doubao.*" which matches any model string containing "doubao" and can
incorrectly claim models from other providers; narrow the pattern to only match
the OpenAI-adapter Doubao models (e.g. require the "openai/" prefix) by updating
the models regex in the `@register_agent` call (reference: the register_agent
decorator and the DoubaoComputerAgentConfig class) to a stricter pattern such as
one that enforces the "openai/" prefix (for example ^openai/doubao(/.*)?$) so
this loop only activates for the intended OpenAI Doubao models.

Comment on lines +59 to +60
stream: bool = False,
computer_handler=None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

LiteLLM aresponsesstreaming return type: whenstream=True, does the API return a streaming iterator / stream wrapper instead of a response object with .model_dump()and.usage?

💡 Result:

Yes. In LiteLLM’s Responses API wrapper, litellm.aresponses(..., stream=True) returns a streaming iterator/wrapper (a BaseResponsesAPIStreamingIterator), not a single ResponsesAPIResponse object you can immediately call .model_dump() on or read .usage from. The return type is explicitly Union[ResponsesAPIResponse, BaseResponsesAPIStreamingIterator]. [1]

In streaming mode, you consume it with async for ... in stream: and LiteLLM yields stream events; the usage (and the full response object) is only available on the final “response.completed” (or failed/incomplete) event, where the event contains a response object that has .usage (and can be .model_dump()’d). LiteLLM internally records that final event as completed_response. [2]

Sources: [1] [2]


Add streaming support or reject stream=True entirely.

The stream parameter is forwarded to litellm.aresponses() but the code immediately calls .model_dump(), iterates the result as output, and accesses .usage — none of which work with a streaming response. When stream=True, litellm.aresponses() returns a streaming iterator, not a response object; usage data is only available on the final stream event after consuming with async for. Either implement proper streaming handling with event consumption, or remove the stream parameter and always use stream=False.

Also applies to: 134-146, 151-191

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/python/agent/agent/loops/doubao.py` around lines 59 - 60, The code
incorrectly assumes litellm.aresponses() returns a final response when called
with stream=True; to fix, either remove/ban streaming by validating the stream
parameter at the start of the relevant functions (e.g., in the function
signature handling in doubao.py) and raise a clear error if stream is True, or
implement proper streaming consumption: when calling litellm.aresponses() with
stream=True, treat the result as an async iterator, iterate with "async for
event in ..." to collect content events and only access .usage on the final
event (and build the full response object before calling .model_dump()); update
the same pattern for the other places noted (around lines 134-146 and 151-191)
so you do not call .model_dump() or index .usage on the streaming iterator.

Comment on lines +154 to +166
args = item.get("arguments", "{}")
if isinstance(args, str):
try:
args = json.loads(args)
except json.JSONDecodeError as e:
logger.warning(f"⚠️ [JSON解析失败] 无法解析工具调用参数: {args}. 错误: {e}")
# 尝试简单的清洗:去掉可能存在的 markdown 代码块标记
cleaned_args = args.strip()
if cleaned_args.startswith("```json"):
cleaned_args = cleaned_args[7:]
if cleaned_args.endswith("```"):
cleaned_args = cleaned_args[:-3]
cleaned_args = cleaned_args.strip()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t log raw computer arguments on parse failures.

This payload can include text and keys. If the model is typing credentials, the warning logs the whole secret-bearing argument blob, which breaks the PR’s “no secrets are logged” guarantee.

🛡️ Proposed fix
-                    except json.JSONDecodeError as e:
-                        logger.warning(f"⚠️ [JSON解析失败] 无法解析工具调用参数: {args}. 错误: {e}")
+                    except json.JSONDecodeError as e:
+                        logger.warning(
+                            "⚠️ [JSON解析失败] 无法解析 computer 工具调用参数: error=%s raw_length=%d",
+                            e,
+                            len(args),
+                        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
args = item.get("arguments", "{}")
if isinstance(args, str):
try:
args = json.loads(args)
except json.JSONDecodeError as e:
logger.warning(f"⚠️ [JSON解析失败] 无法解析工具调用参数: {args}. 错误: {e}")
# 尝试简单的清洗:去掉可能存在的 markdown 代码块标记
cleaned_args = args.strip()
if cleaned_args.startswith("```json"):
cleaned_args = cleaned_args[7:]
if cleaned_args.endswith("```"):
cleaned_args = cleaned_args[:-3]
cleaned_args = cleaned_args.strip()
args = item.get("arguments", "{}")
if isinstance(args, str):
try:
args = json.loads(args)
except json.JSONDecodeError as e:
logger.warning(
"⚠️ [JSON解析失败] 无法解析 computer 工具调用参数: error=%s raw_length=%d",
e,
len(args),
)
# 尝试简单的清洗:去掉可能存在的 markdown 代码块标记
cleaned_args = args.strip()
if cleaned_args.startswith("
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/python/agent/agent/loops/doubao.py` around lines 154 - 166, The current
JSON parse failure branch logs the raw tool argument blob (args / cleaned_args)
which may contain secrets; update the exception handling in the parse block
(where args is processed and cleaned_args is constructed) to avoid emitting raw
contents to logger.warning. Instead, build a sanitized summary: parse
cleaned_args into a dict if possible and redact sensitive keys like "password",
"secret", "token", "keys", "credentials", or "text" (replace values with
"<REDACTED>"), or if it cannot be parsed, log only a truncated length and a safe
hash/preview (e.g., first 32 chars) without secrets; then call logger.warning
with that sanitized summary and the error. Ensure you change the logger.warning
call and any subsequent logging to reference the sanitized variable rather than
the original args/cleaned_args.

Comment on lines +195 to +203
self, model: str, image_b64: str, instruction: str, computer_handler=None, **kwargs
) -> Optional[Tuple[int, int]]:
"""Predict click coordinates specifically for Doubao with 1000x1000 scaling."""
# 获取真实物理尺寸用于还原
physical_width, physical_height = 1024, 768
if computer_handler and hasattr(computer_handler, "get_dimensions"):
try:
physical_width, physical_height = await computer_handler.get_dimensions()
logger.info(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

predict_click() cannot use physical screen dimensions from the current public call path.

This method expects computer_handler, but the current ComputerAgent.predict_click() integration only forwards model, image_b64, instruction, api_key, and api_base. That makes the get_dimensions() branch dead in normal use and forces denormalization to fall back to image size instead of the authoritative screen dimensions.

🐛 Complementary fix in libs/python/agent/agent/agent.py
             return await self.agent_loop.predict_click(
-                model=self.model, image_b64=image_b64, instruction=instruction, **click_kwargs
+                model=self.model,
+                image_b64=image_b64,
+                instruction=instruction,
+                computer_handler=self.computer_handler,
+                **click_kwargs,
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self, model: str, image_b64: str, instruction: str, computer_handler=None, **kwargs
) -> Optional[Tuple[int, int]]:
"""Predict click coordinates specifically for Doubao with 1000x1000 scaling."""
# 获取真实物理尺寸用于还原
physical_width, physical_height = 1024, 768
if computer_handler and hasattr(computer_handler, "get_dimensions"):
try:
physical_width, physical_height = await computer_handler.get_dimensions()
logger.info(
return await self.agent_loop.predict_click(
model=self.model,
image_b64=image_b64,
instruction=instruction,
computer_handler=self.computer_handler,
**click_kwargs,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/python/agent/agent/loops/doubao.py` around lines 195 - 203,
predict_click in doubao.py tries to use authoritative screen dimensions via
computer_handler.get_dimensions, but ComputerAgent.predict_click currently only
forwards model, image_b64, instruction, api_key, and api_base so
computer_handler is never passed through; update the public call path
(ComputerAgent.predict_click) to accept and forward the computer_handler (or
relevant agent/context object) into agent.loops.doubao.predict_click so the
get_dimensions branch is reachable, and ensure any call sites that invoke
ComputerAgent.predict_click also supply the computer_handler; reference symbols:
predict_click (in libs/python/agent/agent/loops/doubao.py),
ComputerAgent.predict_click (in libs/python/agent/agent/agent.py), and
computer_handler.get_dimensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant