fix: debounce TUI refresh to cap renderer redraw rate#17
Conversation
A long-running calder task wedged at 99% CPU on the daemon process. py-spy dump pinned the main thread spinning in `prompt_toolkit.layout.controls.create_content → split_lines → write_to_screen → render → _redraw` while the agent process (properly blocked on its IPC channel) was at 0% CPU and 30s of total CPU time vs. 1h 51m elapsed. The agent's events were piling up unread because the TUI couldn't drain them between redraws. Mechanism: every `_append_line` calls `self._refresh()` which unconditionally calls `Application.invalidate()`. When events arrive in bursts (an LLM tool loop firing dozens of `tool.invoke` events back-to-back, each producing 50+ rendered lines via markdown / diff splitting), the renderer can't keep up — each redraw runs `split_lines` over the entire 10k-line scrollback, which is per-frame O(total-rendered-text), not O(visible-area). The redraw schedules itself (via the next invalidate) before it finishes processing the prior event, and we converge on a render loop that never yields back to the IPC reader. Fix: coalesce `_refresh()` calls within a 33ms (~30 fps) window into a single `Application.invalidate()`. The first call within a quiet period arms a `loop.call_later` timer; subsequent calls inside the window are no-ops; the timer fires once and clears. Visible latency is bounded (33ms is well below human perception) while bursty append-storms collapse from N redraws to one, giving the IPC reader CPU to actually drain events. Outside an event loop (test fixtures, pre-startup, the existing `test_calls_refresh_with_app_set` shape) the behavior falls back to direct `invalidate()` so legacy callers don't silently lose their redraw. Three new TUI tests: - `test_coalesces_burst_into_single_invalidate` — 10 rapid appends + sleep past the debounce → exactly one `invalidate` call, timer cleared. - `test_subsequent_burst_after_fire_arms_new_timer` — debounce doesn't permanently gate; a second burst after the first fires re-arms. - `test_refresh_outside_event_loop_falls_back_to_direct_invalidate` — pins the legacy synchronous-fallback path. 662 unit tests pass (+3). ruff lint + format clean across whole repo.
Follow-on to the `_refresh()` debounce in this PR. The debounce
removes the *additive* invalidate storm during event bursts, but
doesn't help the steady-state pin: an active task drives a 5 fps
spinner that calls `_refresh()` every 200ms regardless of whether
content changed, and prompt_toolkit's `FormattedTextControl.create_content`
walks the entire scrollback on every redraw. With a 10k-line
buffer the per-frame cost can exceed the spinner interval, and the
renderer never yields back to the IPC reader.
Cuts both axes:
- **`max_buffer_lines` default 10_000 → 2_000.** Render cost is
linear in buffer size; a 5× cut makes the steady-state render
cheap enough that the spinner can never starve the reader.
- **Spinner tick 0.2s → 1.0s.** Was hardcoded; promoted to a
per-session field. Humans don't perceive the difference between
a 5 fps and 1 fps spinner ("task is running"), but the renderer
pressure drops by 5×.
Both knobs land on a new `TuiConfig` model in `switchplane.config`
so applications (e.g. calder) can override them via their merged
TOML config without code changes:
```toml
[tui]
max_buffer_lines = 5000
spinner_interval = 0.5
```
The CLI's interactive-launch path now reads the merged config and
passes both values into `run_tui` → `TUISession`. Falls back to the
defaults if config parsing fails — a config validation error on
TUI launch is the wrong place to surface.
5 new tests:
- `TestTuiConfig` (3): defaults, overrides, reach via AppConfig.
- `TestTUISessionInit` gains `test_default_spinner_interval` and
`test_custom_spinner_interval`.
667 tests pass (+5). ruff lint + format clean across the whole repo.
demianbrecht
left a comment
There was a problem hiding this comment.
[Security] No findings.
Reviewed the TUI debounce + config-knob change. Scope is local-process only: a Pydantic TuiConfig model with two numeric fields, an asyncio.call_later-based redraw debounce in tui.py, and a config-loader plumb-through in cli.py. No new HTTP handlers, auth surface, DB queries, filesystem paths from user input, crypto, network egress, secrets, or deserialization of untrusted data. The existing load_config uses stdlib tomllib (safe parser) and Pydantic validation against already-trusted paths (paths.config_path, app.default_config_path).
Notes considered and dismissed:
- Broad
except Exceptionin_load_tui_configfalls back to defaults — UX/reliability concern, not security; doesn't mask any auth or integrity check. - A user setting
spinner_interval=0.0in their own config would self-DoS their TUI — config file is user-owned and already trusted; not an attacker-reachable vector.
Critical: 0
High: 0
Medium: 0
Low: 0
| long-running tasks even when the user isn't actively scrolling. | ||
| """ | ||
|
|
||
| max_buffer_lines: int = 2_000 |
There was a problem hiding this comment.
Doc drift: CLAUDE.md line 103 still says "Event buffers are capped at a configurable max_buffer_lines (default 10,000)". This PR drops the default to 2,000. Worth updating in the same PR so the agent-facing docs don't lie about runtime behavior — a future change here will read CLAUDE.md, not config.py.
| # Fall back to immediate invalidate. | ||
| self._app.invalidate() | ||
| return | ||
| self._refresh_timer = loop.call_later(_REFRESH_DEBOUNCE_SECONDS, self._fire_refresh) |
There was a problem hiding this comment.
Inconsistency with the existing spinner pattern. Two-line down (_tick_spinner at line ~1348) the codebase already has the same arm-a-timer shape, and there it acquires the loop via self._app.loop. Here _refresh uses asyncio.get_running_loop(). They will normally agree, but if anything ever calls _refresh from a coroutine running on a non-app loop (e.g. a future test that drives a session under a separate event loop, or a thread-bridged context), the timer fires on a loop the app's invalidate doesn't observe. Recommend matching the spinner pattern: loop = self._app.loop; if loop is None: self._app.invalidate(); return — same shape, no try/except RuntimeError, and consistent with the established convention in this file.
| config_class=app.config_class, | ||
| ) | ||
| return cfg.tui | ||
| except Exception: |
There was a problem hiding this comment.
except Exception here will silently swallow real config errors (TOML syntax errors, type validation failures from a misnamed [tui] field, etc.) and the user gets defaults with zero feedback. The docstring acknowledges this is intentional — "the TUI is the wrong place to surface a config validation error" — but "surface" doesn't have to mean "crash the TUI". Consider at minimum a stderr warning before the swallow, or routing through structlog so the issue ends up in the daemon log. Right now a typo in user config is invisible.
| ) | ||
| return cfg.tui | ||
| except Exception: | ||
| from switchplane.config import TuiConfig |
There was a problem hiding this comment.
Minor: load_config is already imported at the top of this module (line 13) — TuiConfig can ride along in the same import line at module scope rather than being deferred into the except branch. There's no circular-import reason to keep it local; this just makes the failure path do extra work and slightly obscures the dependency.
| deep-enough scrollback for routine debugging. | ||
| """ | ||
|
|
||
| spinner_interval: float = 1.0 |
There was a problem hiding this comment.
Worth calling out in the PR body / commit msg: this is a 5× drop in spinner frame rate (5 Hz → 1 Hz), not just a debounce. A 1 Hz spinner is going to feel sluggish to operators watching for liveness — at 1 fps, a moving spinner reads more like "is it stuck?" than "it's alive." The render-cost rationale in the docstring is sound, but the human-visible effect is also a UX regression on its own. Two options worth considering: (1) keep the spinner at a higher rate (e.g. 0.5s) and rely on the _refresh debounce + smaller buffer to bound cost — those changes alone might be sufficient; (2) drop to 1.0s but call it out explicitly in user-facing release notes so operators know to override via config.
demianbrecht
left a comment
There was a problem hiding this comment.
[Quality] CI green across all 8 checks (e2e 3.12/3.13, lint, test 3.12/3.13/3.14). Diagnosis is well-grounded — the py-spy trace, the per-frame O(total-rendered-text) analysis, and the bursty-IPC mechanism in the description all line up with the code change. Test coverage of the debounce timer (burst-coalescing, second-burst rearming, fallback path) is the right shape.
Five non-blocking observations posted inline. Highest-value:
- CLAUDE.md drift —
default 10,000claim on line 103 of CLAUDE.md is now wrong; agent-facing docs should be updated in this PR. - Loop-acquisition inconsistency —
_refresh()usesasyncio.get_running_loop()while_tick_spinner()(same file, same arm-a-timer pattern) usesself._app.loop. Worth aligning so future cross-loop shenanigans don't bite asymmetrically. - Silent config-error swallow in
_load_tui_config'sexcept Exception— typos in user[tui]config are now invisible. Suggest at least a stderr/structlog warning before falling back to defaults.
Two minor:
4. Local import of TuiConfig inside the except branch — module-level next to existing load_config import is cleaner.
5. spinner_interval: 0.2 → 1.0 is a 5× UX-visible drop, not just a render-cost change. Consider 0.5s (still 4× cost reduction) or call it out in release notes.
Verdict: comment, not blocking. The fix is right, the tests are real, the data path is sound.
Five non-blocking findings from the review, all addressed: 1. **CLAUDE.md drift** — `default 10,000` claim was now wrong since this PR drops the default to 2,000. Updated to reference the new `[tui]` config section and the surrounding TUI tuning knobs. 2. **Loop-acquisition convention** — `_refresh()` was using `asyncio.get_running_loop()` while `_tick_spinner()` (same file, same arm-a-timer pattern) used `self._app.loop`. Aligned to prompt_toolkit's own convention from `Application.create_background_task`: `self._app.loop or asyncio.get_running_loop()`. This handles the pre-`run_async` window correctly (where `self._app.loop` is None and we still need to schedule on whatever loop is running) AND the post-`run_async` window (where `self._app.loop` is set and should be preferred). Preserves the synchronous-test fallback to direct `invalidate()`. Added a test pinning the "loop set → schedule on app loop, not get_running_loop" branch. 3. **Silent config-error swallow** — `_load_tui_config`'s `except Exception` now emits a stderr warning before falling back to defaults. A typo in `[tui]` would otherwise silently revert to defaults with zero feedback. 4. **Local TuiConfig import** — promoted to a module-level import alongside `load_config`. No circular-import reason to keep it local; cleaner failure path. 5. **Spinner interval pushback** — reviewer's case for "1 fps reads as stuck, not alive" was sound. Settled on 0.5s (2 fps) as the compromise: 2.5× cost reduction vs. 0.2s without crossing the "is it stuck?" threshold. The buffer cap and refresh debounce are doing most of the heavy lifting on render cost; the spinner change is polish on top. TuiConfig docstring + test fixtures updated. 668 unit tests pass (+1 new for the loop-acquisition branch). ruff lint + format clean across the whole repo.
Summary
A long-running task wedged at 99% CPU on the daemon process (1h 51m elapsed × 41% sustained CPU = continuous spin since startup), while the agent process was at 0% CPU patiently blocked on its IPC channel. py-spy dump pinned the daemon's main thread:
```
split_lines (prompt_toolkit/formatted_text/utils.py:89)
create_content (prompt_toolkit/layout/controls.py:376)
_write_to_screen_at_index (prompt_toolkit/layout/containers.py:1760)
write_to_screen (prompt_toolkit/layout/containers.py:1735)
render (prompt_toolkit/renderer.py:682)
run_in_context (prompt_toolkit/application/application.py:526)
_redraw (prompt_toolkit/application/application.py:543)
redraw (prompt_toolkit/application/application.py:471)
schedule (prompt_toolkit/eventloop/utils.py:73)
```
The agent's events were piling up unread because the TUI couldn't drain them between redraws.
Mechanism
`_append_line` → `_refresh` → unconditional `Application.invalidate()`. When events arrive in bursts (an LLM tool loop firing dozens of `tool.invoke` events back-to-back, each producing 50+ rendered lines via markdown / diff splitting), every line schedules a fresh redraw. `split_lines` is per-frame O(total-rendered-text) — with a 10k-line scrollback the renderer can't finish a frame before the next invalidate is queued, and we converge on a render-only loop that never yields to the IPC reader.
Fix
Coalesce `_refresh()` calls within a 33ms (~30 fps) window into a single `Application.invalidate()`. First call arms a `loop.call_later` timer; subsequent calls inside the window are no-ops; timer fires once and clears.
Test plan