feat(agent): add Doubao (Volcengine Ark) agent loop by hillday · Pull Request #1275 · trycua/cua

hillday · 2026-04-07T03:46:52Z

What problem this solves

Adds first-class support for running CUA’s ComputerAgent against ByteDance Volcengine Ark (Doubao) models via an OpenAI-compatible interface, including proper parsing of Responses-style tool calls and coordinate normalization/denormalization so the agent can reliably click/type on real screens.

How to use

Set your Volcengine API key:
- VOLCENGINE_API_KEY in your environment.
Point the agent to Ark’s OpenAI-compatible base URL (no extra quotes/spaces):
- https://ark.cn-beijing.volces.com/api/v3
Choose a Doubao model name using the openai/ prefix so the OpenAI adapter path is used.

import os
from agent import ComputerAgent
from agent.tools.skill import SkillTool
from computer import Computer

computer = Computer(os_type="windows", use_host_computer_server=True)

agent = ComputerAgent(
    model="openai/doubao-seed-2-0-lite-260215",
    tools=[computer, SkillTool()],
    api_key=os.getenv("VOLCENGINE_API_KEY"),
    api_base="https://ark.cn-beijing.volces.com/api/v3",
    screenshot_delay=1.0,
)

Scope / impact

Opt-in only: affects users who explicitly select a Doubao model (e.g., openai/doubao-...).
No behavior changes for existing model providers (Anthropic/OpenAI/Gemini/etc.) unless they switch to these model strings.
No secrets are logged; the API key is read from VOLCENGINE_API_KEY and sent to the configured api_base.

Summary by CodeRabbit

New Features

Added Doubao model support to the agent framework for AI-powered computer automation. The system features intelligent screen coordinate adaptation across different device resolutions, enabling autonomous action predictions and click-based operations for seamless automation across diverse environments.

vercel · 2026-04-07T03:46:58Z

@hillday is attempting to deploy a commit to the Cua Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-04-07T03:47:11Z

📝 Walkthrough

Walkthrough

This PR adds Doubao agent loop support to the Python agent library. It includes exporting a new doubao module and implementing a DoubaoComputerAgentConfig class that handles Doubao model interactions with normalized coordinate transformations for computer control tasks.

Changes

Cohort / File(s)	Summary
Module Exports `libs/python/agent/agent/loops/__init__.py`	Added `doubao` module to package imports and `__all__` declaration.
Doubao Agent Loop Implementation `libs/python/agent/agent/loops/doubao.py`	New `DoubaoComputerAgentConfig` class implementing async `predict_step` and `predict_click` methods for Doubao API interactions. Handles tool schema adaptation (forcing computer tools to 1000x1000 coordinate system), API calls via litellm.aresponses, coordinate denormalization from normalized to physical screen coordinates, JSON parsing with retry logic, and usage tracking.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as Agent Loop
    participant DoubaoAPI as Doubao API
    participant Handler as Computer Handler
    participant Response as Response Parser
    
    Agent->>Agent: Prepare messages & adapt tool schemas
    Note over Agent: Computer tools: 1000x1000 coords<br/>Other tools: pass-through
    Agent->>DoubaoAPI: Call with model, messages, tools, stream, reasoning
    DoubaoAPI-->>Agent: Response with output items & usage
    Agent->>Response: Parse tool-call arguments (JSON retry on fail)
    Response-->>Agent: Parsed output items
    Agent->>Handler: get_dimensions() for screen size
    Handler-->>Agent: Physical screen dimensions (fallback: 1024x768)
    Agent->>Agent: Denormalize x/y from 0..1000 to physical coords
    Agent->>Agent: Set usage + response_cost from _hidden_params
    Agent-->>Agent: Return modified response

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A Doubao loop hops into view,
With coordinates transformed anew,
From thousand-fold to pixel-true,
The agent clicks where we want it to! 🎯✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main change: adding a new Doubao agent loop implementation for Volcengine Ark models.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@libs/python/agent/agent/loops/doubao.py`:
- Around line 44-45: The current register_agent decorator on
DoubaoComputerAgentConfig uses a permissive regex r".*doubao.*" which matches
any model string containing "doubao" and can incorrectly claim models from other
providers; narrow the pattern to only match the OpenAI-adapter Doubao models
(e.g. require the "openai/" prefix) by updating the models regex in the
`@register_agent` call (reference: the register_agent decorator and the
DoubaoComputerAgentConfig class) to a stricter pattern such as one that enforces
the "openai/" prefix (for example ^openai/doubao(/.*)?$) so this loop only
activates for the intended OpenAI Doubao models.
- Around line 154-166: The current JSON parse failure branch logs the raw tool
argument blob (args / cleaned_args) which may contain secrets; update the
exception handling in the parse block (where args is processed and cleaned_args
is constructed) to avoid emitting raw contents to logger.warning. Instead, build
a sanitized summary: parse cleaned_args into a dict if possible and redact
sensitive keys like "password", "secret", "token", "keys", "credentials", or
"text" (replace values with "<REDACTED>"), or if it cannot be parsed, log only a
truncated length and a safe hash/preview (e.g., first 32 chars) without secrets;
then call logger.warning with that sanitized summary and the error. Ensure you
change the logger.warning call and any subsequent logging to reference the
sanitized variable rather than the original args/cleaned_args.
- Around line 195-203: predict_click in doubao.py tries to use authoritative
screen dimensions via computer_handler.get_dimensions, but
ComputerAgent.predict_click currently only forwards model, image_b64,
instruction, api_key, and api_base so computer_handler is never passed through;
update the public call path (ComputerAgent.predict_click) to accept and forward
the computer_handler (or relevant agent/context object) into
agent.loops.doubao.predict_click so the get_dimensions branch is reachable, and
ensure any call sites that invoke ComputerAgent.predict_click also supply the
computer_handler; reference symbols: predict_click (in
libs/python/agent/agent/loops/doubao.py), ComputerAgent.predict_click (in
libs/python/agent/agent/agent.py), and computer_handler.get_dimensions.
- Around line 33-40: _denormalize_xy currently maps normalized 1000-based
coordinates so that nx=1000 -> x=target_w and ny=1000 -> y=target_h, producing
out-of-bounds pixel indices; update the function (_denormalize_xy) to compute x
and y as before then clamp them into the valid zero-based pixel range [0,
target_w-1] and [0, target_h-1] (e.g., use min/max or equivalent) so results
never exceed the last pixel and never go negative.
- Around line 59-60: The code incorrectly assumes litellm.aresponses() returns a
final response when called with stream=True; to fix, either remove/ban streaming
by validating the stream parameter at the start of the relevant functions (e.g.,
in the function signature handling in doubao.py) and raise a clear error if
stream is True, or implement proper streaming consumption: when calling
litellm.aresponses() with stream=True, treat the result as an async iterator,
iterate with "async for event in ..." to collect content events and only access
.usage on the final event (and build the full response object before calling
.model_dump()); update the same pattern for the other places noted (around lines
134-146 and 151-191) so you do not call .model_dump() or index .usage on the
streaming iterator.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bb0876db-3582-4766-9988-dd69a3c4c1f7

📥 Commits

Reviewing files that changed from the base of the PR and between 1e62756 and 692c1ce.

📒 Files selected for processing (2)

libs/python/agent/agent/loops/__init__.py
libs/python/agent/agent/loops/doubao.py

coderabbitai · 2026-04-07T03:58:42Z