fix: debounce TUI refresh to cap renderer redraw rate by demianbrecht · Pull Request #17 · salesforce-misc/switchplane

demianbrecht · 2026-05-26T04:10:28Z

Summary

A long-running task wedged at 99% CPU on the daemon process (1h 51m elapsed × 41% sustained CPU = continuous spin since startup), while the agent process was at 0% CPU patiently blocked on its IPC channel. py-spy dump pinned the daemon's main thread:

```
split_lines (prompt_toolkit/formatted_text/utils.py:89)
create_content (prompt_toolkit/layout/controls.py:376)
_write_to_screen_at_index (prompt_toolkit/layout/containers.py:1760)
write_to_screen (prompt_toolkit/layout/containers.py:1735)
render (prompt_toolkit/renderer.py:682)
run_in_context (prompt_toolkit/application/application.py:526)
_redraw (prompt_toolkit/application/application.py:543)
redraw (prompt_toolkit/application/application.py:471)
schedule (prompt_toolkit/eventloop/utils.py:73)
```

The agent's events were piling up unread because the TUI couldn't drain them between redraws.

Mechanism

`_append_line` → `_refresh` → unconditional `Application.invalidate()`. When events arrive in bursts (an LLM tool loop firing dozens of `tool.invoke` events back-to-back, each producing 50+ rendered lines via markdown / diff splitting), every line schedules a fresh redraw. `split_lines` is per-frame O(total-rendered-text) — with a 10k-line scrollback the renderer can't finish a frame before the next invalidate is queued, and we converge on a render-only loop that never yields to the IPC reader.

Fix

Coalesce `_refresh()` calls within a 33ms (~30 fps) window into a single `Application.invalidate()`. First call arms a `loop.call_later` timer; subsequent calls inside the window are no-ops; timer fires once and clears.

Visible latency bounded by 33ms (well below human perception).
Bursty append-storms collapse from N redraws to one.
Outside an event loop (test fixtures, pre-startup paths) falls back to direct `invalidate()`.

Test plan

662 unit tests pass (`uv run pytest tests/`).
`uv run ruff check src/ tests/ examples/` clean.
Reattach to a long-running calder task with high event rate (LLM tool loop) — confirm daemon CPU stays low and the agent's events are visible promptly. Detached mode (`-d`) is unaffected as a control case.

A long-running calder task wedged at 99% CPU on the daemon process. py-spy dump pinned the main thread spinning in `prompt_toolkit.layout.controls.create_content → split_lines → write_to_screen → render → _redraw` while the agent process (properly blocked on its IPC channel) was at 0% CPU and 30s of total CPU time vs. 1h 51m elapsed. The agent's events were piling up unread because the TUI couldn't drain them between redraws. Mechanism: every `_append_line` calls `self._refresh()` which unconditionally calls `Application.invalidate()`. When events arrive in bursts (an LLM tool loop firing dozens of `tool.invoke` events back-to-back, each producing 50+ rendered lines via markdown / diff splitting), the renderer can't keep up — each redraw runs `split_lines` over the entire 10k-line scrollback, which is per-frame O(total-rendered-text), not O(visible-area). The redraw schedules itself (via the next invalidate) before it finishes processing the prior event, and we converge on a render loop that never yields back to the IPC reader. Fix: coalesce `_refresh()` calls within a 33ms (~30 fps) window into a single `Application.invalidate()`. The first call within a quiet period arms a `loop.call_later` timer; subsequent calls inside the window are no-ops; the timer fires once and clears. Visible latency is bounded (33ms is well below human perception) while bursty append-storms collapse from N redraws to one, giving the IPC reader CPU to actually drain events. Outside an event loop (test fixtures, pre-startup, the existing `test_calls_refresh_with_app_set` shape) the behavior falls back to direct `invalidate()` so legacy callers don't silently lose their redraw. Three new TUI tests: - `test_coalesces_burst_into_single_invalidate` — 10 rapid appends + sleep past the debounce → exactly one `invalidate` call, timer cleared. - `test_subsequent_burst_after_fire_arms_new_timer` — debounce doesn't permanently gate; a second burst after the first fires re-arms. - `test_refresh_outside_event_loop_falls_back_to_direct_invalidate` — pins the legacy synchronous-fallback path. 662 unit tests pass (+3). ruff lint + format clean across whole repo.

Follow-on to the `_refresh()` debounce in this PR. The debounce removes the *additive* invalidate storm during event bursts, but doesn't help the steady-state pin: an active task drives a 5 fps spinner that calls `_refresh()` every 200ms regardless of whether content changed, and prompt_toolkit's `FormattedTextControl.create_content` walks the entire scrollback on every redraw. With a 10k-line buffer the per-frame cost can exceed the spinner interval, and the renderer never yields back to the IPC reader. Cuts both axes: - **`max_buffer_lines` default 10_000 → 2_000.** Render cost is linear in buffer size; a 5× cut makes the steady-state render cheap enough that the spinner can never starve the reader. - **Spinner tick 0.2s → 1.0s.** Was hardcoded; promoted to a per-session field. Humans don't perceive the difference between a 5 fps and 1 fps spinner ("task is running"), but the renderer pressure drops by 5×. Both knobs land on a new `TuiConfig` model in `switchplane.config` so applications (e.g. calder) can override them via their merged TOML config without code changes: ```toml [tui] max_buffer_lines = 5000 spinner_interval = 0.5 ``` The CLI's interactive-launch path now reads the merged config and passes both values into `run_tui` → `TUISession`. Falls back to the defaults if config parsing fails — a config validation error on TUI launch is the wrong place to surface. 5 new tests: - `TestTuiConfig` (3): defaults, overrides, reach via AppConfig. - `TestTUISessionInit` gains `test_default_spinner_interval` and `test_custom_spinner_interval`. 667 tests pass (+5). ruff lint + format clean across the whole repo.

demianbrecht

[Security] No findings.

Reviewed the TUI debounce + config-knob change. Scope is local-process only: a Pydantic TuiConfig model with two numeric fields, an asyncio.call_later-based redraw debounce in tui.py, and a config-loader plumb-through in cli.py. No new HTTP handlers, auth surface, DB queries, filesystem paths from user input, crypto, network egress, secrets, or deserialization of untrusted data. The existing load_config uses stdlib tomllib (safe parser) and Pydantic validation against already-trusted paths (paths.config_path, app.default_config_path).

Notes considered and dismissed:

Broad except Exception in _load_tui_config falls back to defaults — UX/reliability concern, not security; doesn't mask any auth or integrity check.
A user setting spinner_interval=0.0 in their own config would self-DoS their TUI — config file is user-owned and already trusted; not an attacker-reachable vector.

Critical: 0
High: 0
Medium: 0
Low: 0

demianbrecht · 2026-05-26T15:23:20Z

+    long-running tasks even when the user isn't actively scrolling.
+    """
+
+    max_buffer_lines: int = 2_000


Doc drift: CLAUDE.md line 103 still says "Event buffers are capped at a configurable max_buffer_lines (default 10,000)". This PR drops the default to 2,000. Worth updating in the same PR so the agent-facing docs don't lie about runtime behavior — a future change here will read CLAUDE.md, not config.py.

demianbrecht · 2026-05-26T15:23:29Z

+            # Fall back to immediate invalidate.
+            self._app.invalidate()
+            return
+        self._refresh_timer = loop.call_later(_REFRESH_DEBOUNCE_SECONDS, self._fire_refresh)


Inconsistency with the existing spinner pattern. Two-line down (_tick_spinner at line ~1348) the codebase already has the same arm-a-timer shape, and there it acquires the loop via self._app.loop. Here _refresh uses asyncio.get_running_loop(). They will normally agree, but if anything ever calls _refresh from a coroutine running on a non-app loop (e.g. a future test that drives a session under a separate event loop, or a thread-bridged context), the timer fires on a loop the app's invalidate doesn't observe. Recommend matching the spinner pattern: loop = self._app.loop; if loop is None: self._app.invalidate(); return — same shape, no try/except RuntimeError, and consistent with the established convention in this file.

demianbrecht · 2026-05-26T15:23:44Z

+                config_class=app.config_class,
+            )
+            return cfg.tui
+        except Exception:


except Exception here will silently swallow real config errors (TOML syntax errors, type validation failures from a misnamed [tui] field, etc.) and the user gets defaults with zero feedback. The docstring acknowledges this is intentional — "the TUI is the wrong place to surface a config validation error" — but "surface" doesn't have to mean "crash the TUI". Consider at minimum a stderr warning before the swallow, or routing through structlog so the issue ends up in the daemon log. Right now a typo in user config is invisible.

demianbrecht · 2026-05-26T15:23:46Z

+            )
+            return cfg.tui
+        except Exception:
+            from switchplane.config import TuiConfig


Minor: load_config is already imported at the top of this module (line 13) — TuiConfig can ride along in the same import line at module scope rather than being deferred into the except branch. There's no circular-import reason to keep it local; this just makes the failure path do extra work and slightly obscures the dependency.

demianbrecht · 2026-05-26T15:23:48Z

+    deep-enough scrollback for routine debugging.
+    """
+
+    spinner_interval: float = 1.0


Worth calling out in the PR body / commit msg: this is a 5× drop in spinner frame rate (5 Hz → 1 Hz), not just a debounce. A 1 Hz spinner is going to feel sluggish to operators watching for liveness — at 1 fps, a moving spinner reads more like "is it stuck?" than "it's alive." The render-cost rationale in the docstring is sound, but the human-visible effect is also a UX regression on its own. Two options worth considering: (1) keep the spinner at a higher rate (e.g. 0.5s) and rely on the _refresh debounce + smaller buffer to bound cost — those changes alone might be sufficient; (2) drop to 1.0s but call it out explicitly in user-facing release notes so operators know to override via config.

demianbrecht

[Quality] CI green across all 8 checks (e2e 3.12/3.13, lint, test 3.12/3.13/3.14). Diagnosis is well-grounded — the py-spy trace, the per-frame O(total-rendered-text) analysis, and the bursty-IPC mechanism in the description all line up with the code change. Test coverage of the debounce timer (burst-coalescing, second-burst rearming, fallback path) is the right shape.

Five non-blocking observations posted inline. Highest-value:

CLAUDE.md drift — default 10,000 claim on line 103 of CLAUDE.md is now wrong; agent-facing docs should be updated in this PR.
Loop-acquisition inconsistency — _refresh() uses asyncio.get_running_loop() while _tick_spinner() (same file, same arm-a-timer pattern) uses self._app.loop. Worth aligning so future cross-loop shenanigans don't bite asymmetrically.
Silent config-error swallow in _load_tui_config's except Exception — typos in user [tui] config are now invisible. Suggest at least a stderr/structlog warning before falling back to defaults.

Two minor:
4. Local import of TuiConfig inside the except branch — module-level next to existing load_config import is cleaner.
5. spinner_interval: 0.2 → 1.0 is a 5× UX-visible drop, not just a render-cost change. Consider 0.5s (still 4× cost reduction) or call it out in release notes.

Verdict: comment, not blocking. The fix is right, the tests are real, the data path is sound.

Five non-blocking findings from the review, all addressed: 1. **CLAUDE.md drift** — `default 10,000` claim was now wrong since this PR drops the default to 2,000. Updated to reference the new `[tui]` config section and the surrounding TUI tuning knobs. 2. **Loop-acquisition convention** — `_refresh()` was using `asyncio.get_running_loop()` while `_tick_spinner()` (same file, same arm-a-timer pattern) used `self._app.loop`. Aligned to prompt_toolkit's own convention from `Application.create_background_task`: `self._app.loop or asyncio.get_running_loop()`. This handles the pre-`run_async` window correctly (where `self._app.loop` is None and we still need to schedule on whatever loop is running) AND the post-`run_async` window (where `self._app.loop` is set and should be preferred). Preserves the synchronous-test fallback to direct `invalidate()`. Added a test pinning the "loop set → schedule on app loop, not get_running_loop" branch. 3. **Silent config-error swallow** — `_load_tui_config`'s `except Exception` now emits a stderr warning before falling back to defaults. A typo in `[tui]` would otherwise silently revert to defaults with zero feedback. 4. **Local TuiConfig import** — promoted to a module-level import alongside `load_config`. No circular-import reason to keep it local; cleaner failure path. 5. **Spinner interval pushback** — reviewer's case for "1 fps reads as stuck, not alive" was sound. Settled on 0.5s (2 fps) as the compromise: 2.5× cost reduction vs. 0.2s without crossing the "is it stuck?" threshold. The buffer cap and refresh debounce are doing most of the heavy lifting on render cost; the spinner change is polish on top. TuiConfig docstring + test fixtures updated. 668 unit tests pass (+1 new for the loop-acquisition branch). ruff lint + format clean across the whole repo.

github-actions Bot added the tests label May 26, 2026

demianbrecht commented May 26, 2026

View reviewed changes

github-actions Bot added the docs label May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: debounce TUI refresh to cap renderer redraw rate#17

fix: debounce TUI refresh to cap renderer redraw rate#17
demianbrecht wants to merge 3 commits into
mainfrom
tui-refresh-debounce

demianbrecht commented May 26, 2026 •

edited

Loading

Uh oh!

demianbrecht left a comment

Uh oh!

demianbrecht May 26, 2026

Uh oh!

demianbrecht May 26, 2026

Uh oh!

demianbrecht May 26, 2026

Uh oh!

demianbrecht May 26, 2026

Uh oh!

demianbrecht May 26, 2026

Uh oh!

demianbrecht left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

demianbrecht commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Mechanism

Fix

Test plan

Uh oh!

demianbrecht left a comment

Choose a reason for hiding this comment

Uh oh!

demianbrecht May 26, 2026

Choose a reason for hiding this comment

Uh oh!

demianbrecht May 26, 2026

Choose a reason for hiding this comment

Uh oh!

demianbrecht May 26, 2026

Choose a reason for hiding this comment

Uh oh!

demianbrecht May 26, 2026

Choose a reason for hiding this comment

Uh oh!

demianbrecht May 26, 2026

Choose a reason for hiding this comment

Uh oh!

demianbrecht left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

demianbrecht commented May 26, 2026 •

edited

Loading