Skip to content

QUALITY-780: client driver — WaitingForEvents watchdog (Wave 2)#12009

Merged
cephalonaut merged 1 commit into
matthew/QUALITY-780-client-corefrom
matthew/QUALITY-780-client-driver
Jun 1, 2026
Merged

QUALITY-780: client driver — WaitingForEvents watchdog (Wave 2)#12009
cephalonaut merged 1 commit into
matthew/QUALITY-780-client-corefrom
matthew/QUALITY-780-client-driver

Conversation

@cephalonaut
Copy link
Copy Markdown
Contributor

Description

Implements client TECH §4 of QUALITY-780 (Wave 2: client-driver): when an Oz cloud agent yields via wait_for_events, the local driver no longer exits — instead it schedules a watchdog timer that, on fire, emits a synthetic WaitForEventsResult against the unresolved tool call so the agent's next turn decides how to proceed. The watchdog never resolves run_exit, so a waiting run stays alive across the yield.

Stacks on top of client-core (PR #11998matthew/QUALITY-780-client-core).

Linked Issue

QUALITY-780 (linked from client-core)

What

Touches only the two files in this Wave 2 agent's scope:

  • app/src/ai/agent_sdk/driver.rs — new

    • DEFAULT_ORCHESTRATED_IDLE_TIMEOUT_SECONDS = 30 * 60 (30 minutes; documented to roughly mirror server-side VMIdleTimeoutMinutes tenant default)
    • UnresolvedWaitForEventsCall struct + find_unresolved_wait_for_events_call(&AIConversation) helper that scans linearized messages for the most recent unresolved WaitForEvents tool call (intentionally decoupled from client-detection's internal mark_/clear_conversation_waiting_for_events storage)
    • watchdog_timeout_for_call resolves the timeout from idle_timeout_seconds, falling back to the default for 0 (prost flattened-scalar unset) and clamping negatives to the default
    • execute_run's UpdatedConversationStatus handler:
      • The existing is_in_progress() early-return now also bumps the watchdog generation counter, so an inbound resume (per client TECH §8.3) cancels any pending watchdog
      • New is_waiting_for_events() arm schedules the watchdog and explicitly does NOT resolve run_exit
    • schedule_waiting_watchdog spawns the timer with a generation-counter check (mirrors the existing IdleTimeoutSender pattern but is a separate counter so the waiting watchdog and the idle_on_complete completion timer never share state)
    • emit_wait_for_events_timeout_result is the watchdog-fire callback. Currently a logged stub with a precise TODO(integration, QUALITY-780) marker — see "Integration handoff" below
  • app/src/ai/agent_sdk/driver_tests.rs — new tests:

    • default_orchestrated_idle_timeout_seconds_is_thirty_minutes
    • watchdog_timeout_uses_server_supplied_value_when_positive / ..._falls_back_to_default_when_unset / ..._clamps_negative_value_to_default
    • find_unresolved_returns_none_when_no_wait_for_events_call / ..._returns_call_when_present_and_unresolved / ..._returns_none_when_resolved_by_later_result / ..._returns_most_recent_when_multiple_present
    • unresolved_wait_for_events_call_roundtrips_idle_timeout_zero_via_helper

PRODUCT.md invariants (11), (12), (13) are covered at the unit level. The full subscription-driven schedule/cancel/fire-then-resume flows are intentionally deferred to the Wave 3 integration tests against the fake server (per client TECH §"Testing and validation").

Why

PRODUCT.md (11)–(13): a wait_for_events yield must not exit the Oz CLI worker, must auto-emit an empty WaitForEventsResult after a bounded watchdog interval, and must let the agent's next turn decide what to do with that timeout (commonly finish_task, but the agent may re-yield, ask the user, or take other action) rather than auto-cancelling the run.

Integration handoff

This Wave 2 PR ends at the watchdog-fire boundary. The actual upstream emission (building an AIAgentInput::ActionResult carrying AIAgentActionResultType::WaitForEvents(WaitForEventsResult{}) and dispatching it via BlocklistAIController::submit_wait_for_events_timeout(...)) requires both an AIAgentActionResultType variant and a new public controller method — both in files owned by client-detection. The orchestrator confirmed (019e73e3-c513-7cd7-9a8b-75439ee9ecbe) that this is the correct boundary; the watchdog-fire stub logs the event and stores everything needed for the integrator to replace it with a single method call. The replacement is documented inline at emit_wait_for_events_timeout_result's TODO(integration, QUALITY-780) marker.

Testing

  • cargo fmt --check — passes
  • cargo clippy -p warp --tests -- -D warnings — passes
  • cargo nextest run -p warp for the new tests and the existing IdleTimeoutSender tests — all 16 pass
  • Full workspace ./script/presubmit could not complete on the sandbox due to disk-space exhaustion across the sibling Wave 2 worktrees; the per-crate clippy + tests above cover the changed code

Agent Mode

  • Warp Agent Mode — This PR was created via Warp's AI Agent Mode

Conversation: https://staging.warp.dev/conversation/e93eac23-3709-40ab-a60a-841b50333372
Run: https://oz.staging.warp.dev/runs/019e83b1-72f0-7bb0-926a-b0ba3463d6c6

This PR was generated with Oz.

Implements client TECH §4: when the conversation transitions into
`ConversationStatus::WaitingForEvents`, the Oz CLI driver's
`UpdatedConversationStatus` handler now schedules a local watchdog timer
instead of resolving `run_exit` (so the worker stays alive across the
yield) and the watchdog uses the server-supplied `idle_timeout_seconds`
from the unresolved `wait_for_events` tool call, falling back to a new
`DEFAULT_ORCHESTRATED_IDLE_TIMEOUT_SECONDS` (30 minutes) when unset.

The watchdog is decoupled from `run_exit`: on fire it does NOT cancel
the run; instead it (will) emit a `WaitForEventsResult` tool-call result
that the agent's next turn processes. A `TODO(integration, QUALITY-780)`
marker at `emit_wait_for_events_timeout_result` documents the exact
helper signature client-detection's Wave 2 work will expose, so the
integrator can replace the stub with a single line.

A generation counter (mirroring the existing `IdleTimeoutSender`
pattern, but separate so the watchdog and the `idle_on_complete` timer
don't share state) supersedes the watchdog when the conversation
returns to `InProgress` via the normal resume path (per client TECH
§8.3).

`find_unresolved_wait_for_events_call` scans the conversation's
linearized messages for the most recent unresolved `WaitForEvents` tool
call, intentionally avoiding coupling to client-detection's internal
`mark_/clear_conversation_waiting_for_events` storage.

Tests cover:
- Default-timeout constant (30 minutes).
- `watchdog_timeout_for_call` honors positive server-supplied values,
  falls back to default for `0` (prost flattened-scalar unset), and
  clamps negative values to the default.
- `find_unresolved_wait_for_events_call` returns `None` when there's no
  `WaitForEvents` call, returns the call when present and unresolved,
  returns `None` when a later `ToolCallResult` matches the
  `tool_call_id`, and returns the most-recent unresolved call when
  multiple yields appear in the same conversation.
- Roundtrip: `idle_timeout_seconds=0` is preserved through the detection
  helper and resolved to the default at the timeout-helper boundary.

PRODUCT.md invariants (11), (12), (13) covered at the unit level. The
full subscription-driven scheduling/cancellation and end-to-end
fire-then-resume flows are intentionally deferred to the Wave 3
integration tests against the fake server (per client TECH
§"Testing and validation").

Co-Authored-By: Oz <oz-agent@warp.dev>
@cla-bot cla-bot Bot added the cla-signed label Jun 1, 2026
@cephalonaut cephalonaut merged commit c092bde into matthew/QUALITY-780-client-core Jun 1, 2026
4 checks passed
@cephalonaut cephalonaut deleted the matthew/QUALITY-780-client-driver branch June 1, 2026 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant