Skip to content

fix: debounce TUI refresh to cap renderer redraw rate#17

Open
demianbrecht wants to merge 3 commits into
mainfrom
tui-refresh-debounce
Open

fix: debounce TUI refresh to cap renderer redraw rate#17
demianbrecht wants to merge 3 commits into
mainfrom
tui-refresh-debounce

Conversation

@demianbrecht
Copy link
Copy Markdown
Contributor

@demianbrecht demianbrecht commented May 26, 2026

Summary

A long-running task wedged at 99% CPU on the daemon process (1h 51m elapsed × 41% sustained CPU = continuous spin since startup), while the agent process was at 0% CPU patiently blocked on its IPC channel. py-spy dump pinned the daemon's main thread:

```
split_lines (prompt_toolkit/formatted_text/utils.py:89)
create_content (prompt_toolkit/layout/controls.py:376)
_write_to_screen_at_index (prompt_toolkit/layout/containers.py:1760)
write_to_screen (prompt_toolkit/layout/containers.py:1735)
render (prompt_toolkit/renderer.py:682)
run_in_context (prompt_toolkit/application/application.py:526)
_redraw (prompt_toolkit/application/application.py:543)
redraw (prompt_toolkit/application/application.py:471)
schedule (prompt_toolkit/eventloop/utils.py:73)
```

The agent's events were piling up unread because the TUI couldn't drain them between redraws.

Mechanism

`_append_line` → `_refresh` → unconditional `Application.invalidate()`. When events arrive in bursts (an LLM tool loop firing dozens of `tool.invoke` events back-to-back, each producing 50+ rendered lines via markdown / diff splitting), every line schedules a fresh redraw. `split_lines` is per-frame O(total-rendered-text) — with a 10k-line scrollback the renderer can't finish a frame before the next invalidate is queued, and we converge on a render-only loop that never yields to the IPC reader.

Fix

Coalesce `_refresh()` calls within a 33ms (~30 fps) window into a single `Application.invalidate()`. First call arms a `loop.call_later` timer; subsequent calls inside the window are no-ops; timer fires once and clears.

  • Visible latency bounded by 33ms (well below human perception).
  • Bursty append-storms collapse from N redraws to one.
  • Outside an event loop (test fixtures, pre-startup paths) falls back to direct `invalidate()`.

Test plan

  • 662 unit tests pass (`uv run pytest tests/`).
  • `uv run ruff check src/ tests/ examples/` clean.
  • Reattach to a long-running calder task with high event rate (LLM tool loop) — confirm daemon CPU stays low and the agent's events are visible promptly. Detached mode (`-d`) is unaffected as a control case.

A long-running calder task wedged at 99% CPU on the daemon process.
py-spy dump pinned the main thread spinning in
`prompt_toolkit.layout.controls.create_content → split_lines →
write_to_screen → render → _redraw` while the agent process
(properly blocked on its IPC channel) was at 0% CPU and 30s of
total CPU time vs. 1h 51m elapsed. The agent's events were piling
up unread because the TUI couldn't drain them between redraws.

Mechanism: every `_append_line` calls `self._refresh()` which
unconditionally calls `Application.invalidate()`. When events
arrive in bursts (an LLM tool loop firing dozens of `tool.invoke`
events back-to-back, each producing 50+ rendered lines via
markdown / diff splitting), the renderer can't keep up — each
redraw runs `split_lines` over the entire 10k-line scrollback,
which is per-frame O(total-rendered-text), not O(visible-area).
The redraw schedules itself (via the next invalidate) before it
finishes processing the prior event, and we converge on a render
loop that never yields back to the IPC reader.

Fix: coalesce `_refresh()` calls within a 33ms (~30 fps) window
into a single `Application.invalidate()`. The first call within a
quiet period arms a `loop.call_later` timer; subsequent calls
inside the window are no-ops; the timer fires once and clears.
Visible latency is bounded (33ms is well below human perception)
while bursty append-storms collapse from N redraws to one,
giving the IPC reader CPU to actually drain events.

Outside an event loop (test fixtures, pre-startup, the existing
`test_calls_refresh_with_app_set` shape) the behavior falls back
to direct `invalidate()` so legacy callers don't silently lose
their redraw.

Three new TUI tests:
- `test_coalesces_burst_into_single_invalidate` — 10 rapid
  appends + sleep past the debounce → exactly one `invalidate`
  call, timer cleared.
- `test_subsequent_burst_after_fire_arms_new_timer` — debounce
  doesn't permanently gate; a second burst after the first fires
  re-arms.
- `test_refresh_outside_event_loop_falls_back_to_direct_invalidate`
  — pins the legacy synchronous-fallback path.

662 unit tests pass (+3). ruff lint + format clean across whole repo.
@github-actions github-actions Bot added the tests label May 26, 2026
Follow-on to the `_refresh()` debounce in this PR. The debounce
removes the *additive* invalidate storm during event bursts, but
doesn't help the steady-state pin: an active task drives a 5 fps
spinner that calls `_refresh()` every 200ms regardless of whether
content changed, and prompt_toolkit's `FormattedTextControl.create_content`
walks the entire scrollback on every redraw. With a 10k-line
buffer the per-frame cost can exceed the spinner interval, and the
renderer never yields back to the IPC reader.

Cuts both axes:

- **`max_buffer_lines` default 10_000 → 2_000.** Render cost is
  linear in buffer size; a 5× cut makes the steady-state render
  cheap enough that the spinner can never starve the reader.
- **Spinner tick 0.2s → 1.0s.** Was hardcoded; promoted to a
  per-session field. Humans don't perceive the difference between
  a 5 fps and 1 fps spinner ("task is running"), but the renderer
  pressure drops by 5×.

Both knobs land on a new `TuiConfig` model in `switchplane.config`
so applications (e.g. calder) can override them via their merged
TOML config without code changes:

```toml
[tui]
max_buffer_lines = 5000
spinner_interval = 0.5
```

The CLI's interactive-launch path now reads the merged config and
passes both values into `run_tui` → `TUISession`. Falls back to the
defaults if config parsing fails — a config validation error on
TUI launch is the wrong place to surface.

5 new tests:
- `TestTuiConfig` (3): defaults, overrides, reach via AppConfig.
- `TestTUISessionInit` gains `test_default_spinner_interval` and
  `test_custom_spinner_interval`.

667 tests pass (+5). ruff lint + format clean across the whole repo.
Copy link
Copy Markdown
Contributor Author

@demianbrecht demianbrecht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Security] No findings.

Reviewed the TUI debounce + config-knob change. Scope is local-process only: a Pydantic TuiConfig model with two numeric fields, an asyncio.call_later-based redraw debounce in tui.py, and a config-loader plumb-through in cli.py. No new HTTP handlers, auth surface, DB queries, filesystem paths from user input, crypto, network egress, secrets, or deserialization of untrusted data. The existing load_config uses stdlib tomllib (safe parser) and Pydantic validation against already-trusted paths (paths.config_path, app.default_config_path).

Notes considered and dismissed:

  • Broad except Exception in _load_tui_config falls back to defaults — UX/reliability concern, not security; doesn't mask any auth or integrity check.
  • A user setting spinner_interval=0.0 in their own config would self-DoS their TUI — config file is user-owned and already trusted; not an attacker-reachable vector.

Critical: 0
High: 0
Medium: 0
Low: 0

Comment thread src/switchplane/config.py
long-running tasks even when the user isn't actively scrolling.
"""

max_buffer_lines: int = 2_000
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc drift: CLAUDE.md line 103 still says "Event buffers are capped at a configurable max_buffer_lines (default 10,000)". This PR drops the default to 2,000. Worth updating in the same PR so the agent-facing docs don't lie about runtime behavior — a future change here will read CLAUDE.md, not config.py.

Comment thread src/switchplane/tui.py
# Fall back to immediate invalidate.
self._app.invalidate()
return
self._refresh_timer = loop.call_later(_REFRESH_DEBOUNCE_SECONDS, self._fire_refresh)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistency with the existing spinner pattern. Two-line down (_tick_spinner at line ~1348) the codebase already has the same arm-a-timer shape, and there it acquires the loop via self._app.loop. Here _refresh uses asyncio.get_running_loop(). They will normally agree, but if anything ever calls _refresh from a coroutine running on a non-app loop (e.g. a future test that drives a session under a separate event loop, or a thread-bridged context), the timer fires on a loop the app's invalidate doesn't observe. Recommend matching the spinner pattern: loop = self._app.loop; if loop is None: self._app.invalidate(); return — same shape, no try/except RuntimeError, and consistent with the established convention in this file.

Comment thread src/switchplane/cli.py Outdated
config_class=app.config_class,
)
return cfg.tui
except Exception:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except Exception here will silently swallow real config errors (TOML syntax errors, type validation failures from a misnamed [tui] field, etc.) and the user gets defaults with zero feedback. The docstring acknowledges this is intentional — "the TUI is the wrong place to surface a config validation error" — but "surface" doesn't have to mean "crash the TUI". Consider at minimum a stderr warning before the swallow, or routing through structlog so the issue ends up in the daemon log. Right now a typo in user config is invisible.

Comment thread src/switchplane/cli.py Outdated
)
return cfg.tui
except Exception:
from switchplane.config import TuiConfig
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: load_config is already imported at the top of this module (line 13) — TuiConfig can ride along in the same import line at module scope rather than being deferred into the except branch. There's no circular-import reason to keep it local; this just makes the failure path do extra work and slightly obscures the dependency.

Comment thread src/switchplane/config.py Outdated
deep-enough scrollback for routine debugging.
"""

spinner_interval: float = 1.0
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth calling out in the PR body / commit msg: this is a 5× drop in spinner frame rate (5 Hz → 1 Hz), not just a debounce. A 1 Hz spinner is going to feel sluggish to operators watching for liveness — at 1 fps, a moving spinner reads more like "is it stuck?" than "it's alive." The render-cost rationale in the docstring is sound, but the human-visible effect is also a UX regression on its own. Two options worth considering: (1) keep the spinner at a higher rate (e.g. 0.5s) and rely on the _refresh debounce + smaller buffer to bound cost — those changes alone might be sufficient; (2) drop to 1.0s but call it out explicitly in user-facing release notes so operators know to override via config.

Copy link
Copy Markdown
Contributor Author

@demianbrecht demianbrecht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Quality] CI green across all 8 checks (e2e 3.12/3.13, lint, test 3.12/3.13/3.14). Diagnosis is well-grounded — the py-spy trace, the per-frame O(total-rendered-text) analysis, and the bursty-IPC mechanism in the description all line up with the code change. Test coverage of the debounce timer (burst-coalescing, second-burst rearming, fallback path) is the right shape.

Five non-blocking observations posted inline. Highest-value:

  1. CLAUDE.md driftdefault 10,000 claim on line 103 of CLAUDE.md is now wrong; agent-facing docs should be updated in this PR.
  2. Loop-acquisition inconsistency_refresh() uses asyncio.get_running_loop() while _tick_spinner() (same file, same arm-a-timer pattern) uses self._app.loop. Worth aligning so future cross-loop shenanigans don't bite asymmetrically.
  3. Silent config-error swallow in _load_tui_config's except Exception — typos in user [tui] config are now invisible. Suggest at least a stderr/structlog warning before falling back to defaults.

Two minor:
4. Local import of TuiConfig inside the except branch — module-level next to existing load_config import is cleaner.
5. spinner_interval: 0.2 → 1.0 is a 5× UX-visible drop, not just a render-cost change. Consider 0.5s (still 4× cost reduction) or call it out in release notes.

Verdict: comment, not blocking. The fix is right, the tests are real, the data path is sound.

Five non-blocking findings from the review, all addressed:

1. **CLAUDE.md drift** — `default 10,000` claim was now wrong since
   this PR drops the default to 2,000. Updated to reference the new
   `[tui]` config section and the surrounding TUI tuning knobs.

2. **Loop-acquisition convention** — `_refresh()` was using
   `asyncio.get_running_loop()` while `_tick_spinner()` (same file,
   same arm-a-timer pattern) used `self._app.loop`. Aligned to
   prompt_toolkit's own convention from
   `Application.create_background_task`:
   `self._app.loop or asyncio.get_running_loop()`. This handles the
   pre-`run_async` window correctly (where `self._app.loop` is None
   and we still need to schedule on whatever loop is running) AND
   the post-`run_async` window (where `self._app.loop` is set and
   should be preferred). Preserves the synchronous-test fallback to
   direct `invalidate()`. Added a test pinning the
   "loop set → schedule on app loop, not get_running_loop" branch.

3. **Silent config-error swallow** — `_load_tui_config`'s
   `except Exception` now emits a stderr warning before falling
   back to defaults. A typo in `[tui]` would otherwise silently
   revert to defaults with zero feedback.

4. **Local TuiConfig import** — promoted to a module-level import
   alongside `load_config`. No circular-import reason to keep it
   local; cleaner failure path.

5. **Spinner interval pushback** — reviewer's case for "1 fps reads
   as stuck, not alive" was sound. Settled on 0.5s (2 fps) as the
   compromise: 2.5× cost reduction vs. 0.2s without crossing the
   "is it stuck?" threshold. The buffer cap and refresh debounce
   are doing most of the heavy lifting on render cost; the spinner
   change is polish on top. TuiConfig docstring + test fixtures
   updated.

668 unit tests pass (+1 new for the loop-acquisition branch).
ruff lint + format clean across the whole repo.
@github-actions github-actions Bot added the docs label May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant