diff --git a/docs/app-mode-ai-chat-review.md b/docs/app-mode-ai-chat-review.md
index 51833d4f9acef..eacfba08cbcbf 100644
--- a/docs/app-mode-ai-chat-review.md
+++ b/docs/app-mode-ai-chat-review.md
@@ -1,354 +1,48 @@
 # App Mode AI Chat Review
 
-## Purpose
+This note only tracks the highest-value next steps for making app-mode AI chat
+safer and more efficient.
 
-This document reviews the current app-mode AI chat design with a focus on:
+## Recommended Next Steps
 
-- keeping prompts and context as small as possible;
-- requiring user confirmation for important actions;
-- making datatable integration smooth and safe for users.
+1. Add confirmation for dangerous app tools.
 
-## Short verdict
+   Require explicit user confirmation before file writes, file deletes, backend
+   runnable writes, backend runnable deletes, and datatable SQL execution. Show a
+   useful diff or exact SQL before applying the action.
 
-The app-mode AI chat has a solid foundation: mode-specific helpers, explicit `@` context, app snapshots/revert, datatable whitelisting, and generic confirmation UI already exist.
+2. Enforce datatable SQL safety in code.
 
-However, it is not yet optimal for minimal context and user-safe automation:
+   Do not rely on prompt instructions for SQL safety. Classify statements before
+   execution, block DDL unless table creation is allowed, and require
+   confirmation for DDL, DML, and row-returning reads that would expose data back
+   to the model.
 
-1. **Context is still too large by default**, especially the app system prompt, broad file-discovery guidance, full datatable schemas, and persistent `@` context. (`get_files()` has since been replaced by metadata-only `list_files()`.)
-2. **Important app/datatable actions are not consistently confirmed**. The confirmation infrastructure exists, but app tools mostly bypass it.
-3. **Datatables UX is promising but has rough edges**: stale cached table context, weak SQL safety, policy persistence issues, and too-heavy full-schema fetching.
+3. Keep default context demand-driven.
 
-## Relevant files
+   Prefer selected context and targeted reads before broad discovery. Keep file
+   listings metadata-only, avoid sending full datatable schemas by default, and
+   keep SDK/reference material out of the base prompt unless it is requested or
+   needed for the task.
 
-### AI chat orchestration
+4. Improve app context lifecycle.
 
-- `frontend/src/lib/components/copilot/chat/AIChatManager.svelte.ts`
-- `frontend/src/lib/components/copilot/chat/chatLoop.ts`
-- `frontend/src/lib/components/copilot/chat/shared.ts`
-- `frontend/src/lib/components/copilot/chat/AIChat.svelte`
-- `frontend/src/lib/components/copilot/chat/AIChatDisplay.svelte`
-- `frontend/src/lib/components/copilot/chat/AIChatInput.svelte`
-- `frontend/src/lib/components/copilot/chat/ToolExecutionDisplay.svelte`
+   Treat `@` context as per-message by default, with an explicit pinning affordance
+   for context that should persist. Lazy-load file and runnable contents, and add
+   a visible approximate context-size indicator so users can spot prompt bloat.
 
-### App mode
+5. Refresh datatable context after mutations.
 
-- `frontend/src/lib/components/copilot/chat/app/core.ts`
-- `frontend/src/lib/components/copilot/chat/AppAvailableContextList.svelte`
-- `frontend/src/lib/components/copilot/chat/ContextElementBadge.svelte`
-- `frontend/src/lib/components/copilot/chat/DatatableCreationPolicy.svelte`
+   Refresh table metadata after data-panel changes and after AI-created tables so
+   follow-up tool calls and user-visible context do not use stale schema data.
 
-### Raw app editor and datatables
+6. Persist table creation policy explicitly.
 
-- `frontend/src/lib/components/raw_apps/RawAppEditor.svelte`
-- `frontend/src/lib/components/raw_apps/RawAppDataTableList.svelte`
-- `frontend/src/lib/components/raw_apps/RawAppDataTableDrawer.svelte`
-- `frontend/src/lib/components/raw_apps/DefaultDatabaseSelector.svelte`
-- `frontend/src/lib/components/raw_apps/dataTableRefUtils.ts`
-- `frontend/src/lib/components/raw_apps/datatableUtils.svelte.ts`
-- `frontend/src/routes/(root)/(logged)/apps_raw/add/+page.svelte`
-- `frontend/src/routes/(root)/(logged)/apps_raw/edit/[...path]/+page.svelte`
+   Store whether AI table creation is enabled as an explicit app setting instead
+   of inferring it from the presence of datatable configuration.
 
-### Backend datatable APIs
+7. Add focused eval coverage for these behaviors.
 
-- `backend/windmill-api-workspaces/src/workspaces.rs`
-  - `list_datatables`
-  - `list_datatable_schemas`
-  - `get_datatable_schema`
-  - `edit_datatable_config`
-
-### System prompts
-
-- `system_prompts/README.md`
-- `system_prompts/auto-generated/index.ts`
-- `system_prompts/auto-generated/sdks/datatable-typescript.md`
-- `system_prompts/auto-generated/sdks/datatable-python.md`
-
-## How app mode works today
-
-In the raw app editor, `RawAppEditor.svelte` initializes app-mode AI chat on mount:
-
-- calls `aiChatManager.saveAndClear()`;
-- calls `aiChatManager.changeMode(AIMode.APP)`;
-- registers app helpers through `aiChatManager.setAppHelpers(...)`.
-
-Those app helpers expose operations for:
-
-- frontend files;
-- backend runnables;
-- current selected editor context;
-- linting;
-- app snapshots and revert;
-- datatable schema loading;
-- SQL execution;
-- app table whitelisting.
-
-When app mode is active, `AIChatManager.changeMode(AIMode.APP)` sets:
-
-- system prompt: `prepareAppSystemMessage(...)`;
-- tools: `getAppTools()`;
-- helpers: `appAiChatHelpers`.
-
-When the user sends a message, `prepareAppUserMessage(...)` builds the user prompt from:
-
-- current frontend/backend file selection, unless excluded;
-- inspector-selected DOM element;
-- editor code selection;
-- additional `@`-mentioned context;
-- the user instructions.
-
-`runChatLoop(...)` then sends the system message, history, user message, and tool definitions to the selected model. Tool calls go through `processToolCall(...)`, which supports confirmation only when a tool opts into `requiresConfirmation`.
-
-## Current app tools
-
-### Read and discovery tools
-
-These are generally safe without confirmation:
-
-- `list_files`
-- `get_frontend_file`
-- `get_backend_runnable`
-- `get_selected_context`
-- `lint`
-- `search_workspace`
-- `get_runnable_details`
-- `search_hub_scripts`
-- `list_datatables`
-- `get_datatable_table_schema`
-
-### Mutating tools
-
-These currently execute directly in app mode:
-
-- `set_frontend_file`
-- `patch_file`
-- `delete_frontend_file`
-- `set_backend_runnable`
-- `delete_backend_runnable`
-- `exec_datatable_sql`
-
-This is the biggest mismatch with the requirement that every important action should be confirmed by the user.
-
-## System prompt assessment
-
-The app system prompt is useful but heavier than ideal.
-
-### Strengths
-
-- Clearly explains raw app structure.
-- Explains the frontend/backend runnable split.
-- Encourages `patch_file` for small edits.
-- Pushes datatables for persisted app storage.
-- Explains that datatable DDL should go through `exec_datatable_sql`.
-- Includes table creation policy context.
-
-### Concerns
-
-1. It always includes broad app-building instructions, even for small localized edits.
-2. The previous prompt included the datatable SDK reference for both TypeScript and Python every time. This has since been removed; concise examples remain in the prompt.
-3. The previous prompt told the model to start with `get_files()`, which encouraged loading all files even when selected context was sufficient. This is now improved by `list_files()`, but the prompt still needs to stay demand-driven.
-4. It relies heavily on prompt instructions for datatable safety instead of enforcing safety in tools.
-5. Custom workspace/user prompts are appended as `USER GIVEN INSTRUCTIONS`, which is flexible but can further increase context.
-
-### Recommendation
-
-The base app prompt should be shorter and more demand-driven:
-
-- Keep file discovery demand-driven: use selected and explicitly provided context first; call `list_files()` only when a broader metadata overview is needed.
-- Keep full SDK details out of the default prompt; concise examples are usually enough. Add an on-demand SDK reference only if it does not cause unnecessary extra tool turns.
-- Keep only minimal datatable rules in the base prompt:
-  - use datatables for persistence;
-  - call `list_datatables()` before schema work;
-  - DDL must use `exec_datatable_sql`;
-  - non-read SQL requires confirmation.
-
-## Additional context assessment
-
-The `@` context system is a good UX foundation.
-
-App mode exposes categories for:
-
-- frontend files;
-- backend runnables;
-- datatables.
-
-Selecting a datatable context includes its columns and also calls `addTableToWhitelist(...)`, adding the table to the app data panel.
-
-### Strengths
-
-- Context is explicit and user-controllable.
-- Datatable table selection is naturally integrated into the chat input.
-- Selected app file/runnable chips are visible and can be excluded.
-- Inspector and code-selection context are compact and useful.
-
-### Concerns
-
-1. `@` context persists across messages until manually removed, which can silently bloat follow-up prompts.
-2. Available app context currently includes file contents/runnable configs in memory before selection.
-3. Each selected context item is truncated, but there is no overall context budget indicator.
-4. Current file/runnable selection is included by default unless excluded, which is convenient but not minimal.
-
-### Recommendation
-
-- Make app `@` context per-message by default.
-- Add an explicit “pin” option for context that should persist across messages.
-- Lazy-load file/runnable content when selected or when a message is sent.
-- Show an approximate context-size/token budget indicator.
-- Prefer sending path/name and selected code first; fetch full files only when necessary.
-
-## Confirmation assessment
-
-The generic confirmation mechanism already exists:
-
-- `processToolCall(...)` checks `tool.requiresConfirmation`.
-- `ToolExecutionDisplay.svelte` renders Run/Cancel controls.
-- Script test runs, flow test runs, and mutating API calls already use confirmation.
-
-App mode should use the same infrastructure for important actions.
-
-### Suggested confirmation policy
-
-#### No confirmation required
-
-- `list_files`, as a metadata-only response;
-- `get_frontend_file`;
-- `get_backend_runnable`;
-- `get_selected_context`;
-- `list_datatables`, as table-name metadata only;
-- `get_datatable_table_schema`, as a targeted schema read;
-- `lint`;
-- search tools.
-
-#### Confirmation required
-
-- `set_frontend_file`;
-- `patch_file`;
-- `delete_frontend_file`;
-- `set_backend_runnable`;
-- `delete_backend_runnable`;
-- `exec_datatable_sql` for any DDL or DML;
-- `exec_datatable_sql` for `SELECT` if it returns real row data that will be sent back to the model.
-
-### Recommended UX
-
-For files/runnables:
-
-- Prefer batched proposed edits.
-- Show a diff.
-- Let the user click “Apply changes”.
-- Run lint after applying.
-
-For SQL:
-
-- Show the exact SQL.
-- Classify the query as:
-  - schema read;
-  - data read;
-  - insert/update/delete;
-  - DDL.
-- Require confirmation before data reads and all mutations.
-- For table creation, require both:
-  - table creation policy enabled;
-  - explicit confirmation of the `CREATE TABLE` SQL.
-
-## Datatables integration assessment
-
-The datatable integration is directionally good and already has several strong user-facing pieces.
-
-### Current strengths
-
-The new app setup lets the user choose:
-
-- default datatable;
-- schema mode: none, new, existing;
-- whether AI can create tables;
-- pre-whitelisted existing tables.
-
-The raw app data panel lets users:
-
-- add datatable table references;
-- inspect tables through the DB manager drawer;
-- configure the default datatable/schema for new tables.
-
-The AI chat integration lets users:
-
-- mention datatable tables through `@` context;
-- add mentioned tables to the app whitelist;
-- list datatable/schema/table names with `list_datatables()`;
-- retrieve one table's columns with `get_datatable_table_schema()`;
-- create tables through `exec_datatable_sql(..., new_table)`.
-
-### Concerns
-
-1. **`exec_datatable_sql` is too powerful without confirmation.**
-   It can run `SELECT`, `INSERT`, `UPDATE`, `DELETE`, `CREATE`, `DROP`, `ALTER`, etc.
-
-2. **Table creation policy is not fully enforced in code.**
-   The tool blocks `new_table` when policy is disabled, but it does not block DDL if the model omits `new_table`.
-
-3. **Table creation disabled state may not persist cleanly.**
-   `RawAppData` stores `datatable` and `schema`, but not an explicit `enabled` value. `RawAppEditor` infers enabled from `data.datatable !== undefined`, which can re-enable table creation after reopening.
-
-4. **Datatable context cache can become stale.**
-   `AIChatManager.refreshDatatables()` runs when app helpers are set, but may not refresh immediately after data panel changes or after AI creates a new table.
-
-5. **Full schema loading can still be too expensive internally.**
-   `list_datatables()` and `get_datatable_table_schema()` reduce what is sent to the model, but they still currently rely on app helpers that fetch full schema data before filtering.
-
-6. **Auto-whitelisting from `@table` is convenient but silent.**
-   It mutates app data without an obvious confirmation or undo affordance.
-
-### Recommended datatable tool design
-
-Instead of one broad schema tool and one unrestricted SQL tool, prefer smaller tools:
-
-- `list_datatables()`
-- `list_datatable_tables(datatable, schema?, search?)` (optional backend/API optimization if table lists need server-side filtering)
-- `get_datatable_table_schema(datatable, schema, table)`
-- `preview_datatable_rows(datatable, schema, table, limit)` with confirmation
-- `execute_datatable_sql(datatable, sql)` with query classification and confirmation
-- `create_datatable_table(datatable, schema, table, columns)` as a structured safe path for table creation
-
-## Priority recommendations
-
-1. **Add confirmation to dangerous app tools**
-   - file/runnable writes;
-   - file/runnable deletes;
-   - datatable SQL;
-   - especially DDL/DML.
-
-2. **Enforce SQL safety in code, not only in prompts**
-   - block DDL unless `new_table` is provided and policy allows it;
-   - confirm all non-`SELECT` statements;
-   - consider confirming `SELECT` row reads too.
-
-3. **Reduce default prompt/tool context**
-   - keep `list_files()` metadata-only and demand-driven;
-   - use selected context first;
-   - keep full SDK references out of the default prompt;
-   - keep datatable tools split into smaller schema/table lookups.
-
-4. **Refresh datatable context reliably**
-   - refresh after data panel changes;
-   - refresh after `exec_datatable_sql(..., new_table)`;
-   - remove debug logging from datatable refresh.
-
-5. **Persist table creation policy explicitly**
-   - store a boolean such as `tableCreationEnabled` in raw app data;
-   - do not infer enabled solely from `data.datatable`.
-
-6. **Improve `@` context lifecycle**
-   - make app `@` context per-message by default;
-   - add pinning for persistent context;
-   - lazy-load file/runnable contents;
-   - show approximate context size.
-
-## Overall opinion
-
-The current architecture is good and extensible, but it should become more demand-driven and safer before being considered efficient and user-safe.
-
-The highest-impact changes are:
-
-- add confirmation for app mutations and datatable SQL;
-- enforce datatable SQL policy programmatically;
-- reduce the app system prompt and avoid automatic broad context loading;
-- split datatable schema access into smaller, targeted tools.
+   Cover confirmation requirements, datatable SQL policy enforcement, selected
+   context minimization, and stale-schema refresh behavior with targeted app-mode
+   evals or lower-level tests where practical.
diff --git a/docs/app-mode-ai-chat-token-baseline.md b/docs/app-mode-ai-chat-token-baseline.md
deleted file mode 100644
index 95eab70c0dfb5..0000000000000
--- a/docs/app-mode-ai-chat-token-baseline.md
+++ /dev/null
@@ -1,245 +0,0 @@
-# App Mode AI Chat Token Baseline
-
-This baseline was collected before optimizing app-mode context/prompt/datatable behavior.
-
-> Note: The historical commands/results below include `app-token-selected-large-frontend-context` and `app-token-selected-large-backend-context`. Those cases were removed from the active eval suite because `runtime.appContext.selected` only verified that the file/runnable existed and did not serialize a selected file/runnable hint to the model. Future selected-file/runnable coverage should be reintroduced through the app context manager path.
-
-## Command
-
-Secrets were loaded from `~/windmill/ai_evals/.env` without printing them.
-
-```bash
-cd ai_evals
-set -a
-source ~/windmill/ai_evals/.env
-set +a
-bun run cli -- run app \
-  app-token-baseline-large-app-small-edit \
-  app-token-selected-large-frontend-context \
-  app-token-selected-large-backend-context \
-  app-token-many-datatable-context \
-  app-token-large-datatable-discovery \
-  --model haiku \
-  --runs 1 \
-  --output results/app-token-baseline-current-max8.json
-```
-
-## Environment
-
-- Mode: `app`
-- Model under test: `anthropic:claude-haiku-4-5-20251001`
-- Transport: `direct`
-- Judge model: `claude-sonnet-4-6`
-- Runs per case: `1`
-- Token-heavy app cases use `runtime.maxTurns: 8`
-
-## Results
-
-Pass rate: **100% (5/5)**
-
-| Case | Prompt tokens | Completion tokens | Total tokens | Tool calls | Tools used |
-|---|---:|---:|---:|---:|---|
-| `app-token-baseline-large-app-small-edit` | 73,682 | 519 | 74,201 | 4 | `get_files`, `get_frontend_file`, `patch_file` |
-| `app-token-selected-large-frontend-context` | 36,305 | 348 | 36,653 | 2 | `get_frontend_file`, `patch_file` |
-| `app-token-selected-large-backend-context` | 95,232 | 19,633 | 114,865 | 4 | `set_backend_runnable`, `get_backend_runnable` |
-| `app-token-many-datatable-context` | 35,204 | 404 | 35,608 | 2 | `get_files`, `patch_file` |
-| `app-token-large-datatable-discovery` | 114,964 | 4,047 | 119,011 | 7 | `get_files`, `get_datatables`, `set_backend_runnable`, `set_frontend_file`, `patch_file`, `lint` |
-
-Aggregate token usage:
-
-```json
-{
-  "totalTokenUsage": {
-    "prompt": 355387,
-    "completion": 24951,
-    "total": 380338
-  },
-  "averageTokenUsagePerAttempt": {
-    "prompt": 71077.4,
-    "completion": 4990.2,
-    "total": 76067.6
-  }
-}
-```
-
-## Interpretation
-
-The highest-token cases are:
-
-1. `app-token-large-datatable-discovery` — full datatable discovery with `get_datatables()` and app edits reached **119,011** total tokens.
-2. `app-token-selected-large-backend-context` — selected large backend runnable plus a rewrite-style tool call reached **114,865** total tokens.
-3. `app-token-baseline-large-app-small-edit` — a trivial heading edit still reached **74,201** total tokens, largely due broad file discovery.
-
-These cases should be rerun after prompt/context/tool changes to compare total and prompt-token reductions.
-
-## Follow-up: metadata-only `list_files`
-
-The contentful `get_files` app-mode tool was replaced with `list_files` to make broad app discovery cheaper and less sticky in chat history.
-
-Changes:
-
-- Renamed the overview tool from `get_files` to `list_files`.
-- Changed the overview response from truncated source/config contents to metadata only.
-- `list_files` returns:
-  - frontend files: `path`, character `size`, and file `kind`;
-  - backend runnables: `key`, `name`, `type`, and lightweight optional metadata such as `path`, `language`, `contentSize`, and `staticInputKeys`.
-- Updated app-mode prompt guidance so the model no longer starts every task with broad file discovery.
-- Kept targeted content tools as the path for inspection:
-  - `get_frontend_file(path)` for frontend source;
-  - `get_backend_runnable(key)` for runnable configuration/source.
-
-The same five cases were rerun with:
-
-```bash
-cd ai_evals
-set -a
-source ~/windmill/ai_evals/.env
-set +a
-bun run cli -- run app \
-  app-token-baseline-large-app-small-edit \
-  app-token-selected-large-frontend-context \
-  app-token-selected-large-backend-context \
-  app-token-many-datatable-context \
-  app-token-large-datatable-discovery \
-  --model haiku \
-  --runs 1 \
-  --output results/app-token-after-list-files.json
-```
-
-Pass rate: **100% (5/5)**
-
-| Case | Prompt tokens | Completion tokens | Total tokens | Tool calls | Tools used |
-|---|---:|---:|---:|---:|---|
-| `app-token-baseline-large-app-small-edit` | 41,020 | 422 | 41,442 | 3 | `list_files`, `get_frontend_file`, `patch_file` |
-| `app-token-selected-large-frontend-context` | 41,020 | 422 | 41,442 | 3 | `list_files`, `get_frontend_file`, `patch_file` |
-| `app-token-selected-large-backend-context` | 53,511 | 9,714 | 63,225 | 3 | `list_files`, `get_backend_runnable`, `set_backend_runnable` |
-| `app-token-many-datatable-context` | 46,990 | 475 | 47,465 | 3 | `list_files`, `get_frontend_file`, `patch_file` |
-| `app-token-large-datatable-discovery` | 131,607 | 5,084 | 136,691 | 8 | `get_datatables`, `list_files`, `set_backend_runnable`, `set_frontend_file`, `patch_file`, `lint` |
-
-Aggregate token usage:
-
-```json
-{
-  "totalTokenUsage": {
-    "prompt": 314148,
-    "completion": 16117,
-    "total": 330265
-  },
-  "averageTokenUsagePerAttempt": {
-    "prompt": 62829.6,
-    "completion": 3223.4,
-    "total": 66053
-  }
-}
-```
-
-Comparison against the post-rebase / PR #8922 run (`results/app-token-after-origin-main-pr8922.json`):
-
-| Case | PR #8922 total | `list_files` total | Delta | Delta % | Prompt delta |
-|---|---:|---:|---:|---:|---:|
-| `app-token-baseline-large-app-small-edit` | 74,061 | 41,442 | -32,619 | -44.0% | -32,522 |
-| `app-token-selected-large-frontend-context` | 74,061 | 41,442 | -32,619 | -44.0% | -32,522 |
-| `app-token-selected-large-backend-context` | 71,050 | 63,225 | -7,825 | -11.0% | -7,787 |
-| `app-token-many-datatable-context` | 35,497 | 47,465 | +11,968 | +33.7% | +11,886 |
-| `app-token-large-datatable-discovery` | 97,128 | 136,691 | +39,563 | +40.7% | +38,295 |
-
-Aggregate comparison against the post-rebase / PR #8922 run:
-
-| Metric | PR #8922 | `list_files` | Delta | Delta % |
-|---|---:|---:|---:|---:|
-| Prompt tokens | 336,798 | 314,148 | -22,650 | -6.7% |
-| Completion tokens | 14,999 | 16,117 | +1,118 | +7.5% |
-| Total tokens | 351,797 | 330,265 | -21,532 | -6.1% |
-
-Compared to the original baseline above, the `list_files` run is **-50,073 total tokens** (**-13.2% total**).
-
-Interpretation:
-
-- The small edit and selected-frontend cases improved substantially because broad discovery no longer injects truncated contents for the whole app.
-- The selected-backend case also improved, despite still needing targeted runnable inspection.
-- The datatable-context cases can require an extra `get_frontend_file` after `list_files`, so the small datatable edit regressed in this single-run sample.
-- The large datatable case remains dominated by datatable/schema prompt bloat and model variability; moving datatable SDK/reference and schema discovery behind smaller on-demand tools is still the next likely high-impact optimization.
-
-## Follow-up: targeted datatable tools and shorter datatable prompt
-
-The next pass reduced default datatable context by making datatable discovery metadata-first and removing the full datatable SDK reference from the system prompt.
-
-Changes:
-
-- Replaced the broad schema discovery tool with `list_datatables()` for datatable/schema/table names only.
-- Added `get_datatable_table_schema(datatable_name, schema_name, table_name)` for targeted column lookup when column names/types are actually needed.
-- Removed the full TypeScript + Python datatable SDK reference from the default app system prompt.
-- Kept concise TypeScript and Python datatable examples in the prompt, which were enough for the benchmark cases.
-- Strengthened prompt/tool guidance so table-list dashboards use `list_datatables()` directly and avoid schema/SDK lookups unless needed.
-
-The same five cases were rerun with:
-
-```bash
-cd ai_evals
-set -a
-source ~/windmill/ai_evals/.env
-set +a
-bun run cli -- run app \
-  app-token-baseline-large-app-small-edit \
-  app-token-selected-large-frontend-context \
-  app-token-selected-large-backend-context \
-  app-token-many-datatable-context \
-  app-token-large-datatable-discovery \
-  --model haiku \
-  --runs 1 \
-  --output results/app-token-after-datatable-tools-v3.json
-```
-
-Pass rate: **100% (5/5)**
-
-| Case | Prompt tokens | Completion tokens | Total tokens | Tool calls | Tools used |
-|---|---:|---:|---:|---:|---|
-| `app-token-baseline-large-app-small-edit` | 37,516 | 425 | 37,941 | 3 | `list_files`, `get_frontend_file`, `patch_file` |
-| `app-token-selected-large-frontend-context` | 37,516 | 358 | 37,874 | 3 | `list_files`, `get_frontend_file`, `patch_file` |
-| `app-token-selected-large-backend-context` | 49,995 | 9,708 | 59,703 | 3 | `list_files`, `get_backend_runnable`, `set_backend_runnable` |
-| `app-token-many-datatable-context` | 43,493 | 536 | 44,029 | 3 | `list_files`, `get_frontend_file`, `patch_file` |
-| `app-token-large-datatable-discovery` | 24,193 | 2,043 | 26,236 | 4 | `list_datatables`, `list_files`, `get_frontend_file`, `set_frontend_file` |
-
-Aggregate token usage:
-
-```json
-{
-  "totalTokenUsage": {
-    "prompt": 192713,
-    "completion": 13070,
-    "total": 205783
-  },
-  "averageTokenUsagePerAttempt": {
-    "prompt": 38542.6,
-    "completion": 2614,
-    "total": 41156.6
-  }
-}
-```
-
-Comparison against the metadata-only `list_files` run (`results/app-token-after-list-files.json`):
-
-| Case | `list_files` total | Datatable-tools total | Delta | Delta % | Prompt delta |
-|---|---:|---:|---:|---:|---:|
-| `app-token-baseline-large-app-small-edit` | 41,442 | 37,941 | -3,501 | -8.4% | -3,504 |
-| `app-token-selected-large-frontend-context` | 41,442 | 37,874 | -3,568 | -8.6% | -3,504 |
-| `app-token-selected-large-backend-context` | 63,225 | 59,703 | -3,522 | -5.6% | -3,516 |
-| `app-token-many-datatable-context` | 47,465 | 44,029 | -3,436 | -7.2% | -3,497 |
-| `app-token-large-datatable-discovery` | 136,691 | 26,236 | -110,455 | -80.8% | -107,414 |
-
-Aggregate comparison:
-
-| Metric | `list_files` | Datatable tools | Delta | Delta % |
-|---|---:|---:|---:|---:|
-| Prompt tokens | 314,148 | 192,713 | -121,435 | -38.7% |
-| Completion tokens | 16,117 | 13,070 | -3,047 | -18.9% |
-| Total tokens | 330,265 | 205,783 | -124,482 | -37.7% |
-
-Compared to the post-rebase / PR #8922 run, the datatable-tools run is **-146,014 total tokens** (**-41.5% total**). Compared to the original baseline above, it is **-174,555 total tokens** (**-45.9% total**).
-
-Interpretation:
-
-- Removing the full datatable SDK reference from the default prompt saved about 3.5k prompt tokens in every case.
-- The large datatable discovery case improved dramatically because the model used `list_datatables()` table-name metadata instead of loading full schemas.
-- The small datatable-context edit is still higher than the post-rebase / PR #8922 run because selected file identifiers are not yet injected, so the model still discovers and reads `/index.tsx` before patching.
-- A future context-manager-backed selected file/runnable flow should add cheap selected identifiers when that UX is ready, so selected-file tasks can skip `list_files()` without reintroducing implicit source-content bloat.
diff --git a/docs/failing-tests.md b/docs/failing-tests.md
deleted file mode 100644
index d0ae44f1095d1..0000000000000
--- a/docs/failing-tests.md
+++ /dev/null
@@ -1,33 +0,0 @@
-# Failing Tests
-
-This file tracks benchmark cases that still fail or need follow-up validation.
-
-## Flow
-
-- `flow-test6-ai-agent-tools`
-  Latest failing run: `ai_evals/results/2026-04-09T11-25-24.107Z__flow`
-  Issues:
-  final output does not include the actions or tool-result details the prompt asks for
-  `open_support_ticket` contains a syntax bug
-
-- `flow-test7-simple-modification`
-  Latest failing run: `ai_evals/results/2026-04-09T11-25-24.107Z__flow`
-  Issues:
-  `validate_data` was added, but the failure behavior still does not match the requested contract
-  `save_results` throws instead of returning a graceful structured result
-
-- `flow-test11-preprocessor-and-failure-handler`
-  Latest failing run: `ai_evals/results/2026-04-09T11-25-24.107Z__flow`
-  Issues:
-  the model creates regular `preprocessor` and `failure` modules
-  it does not use Windmill's special top-level `preprocessor_module` and `failure_module`
-
-## Needs Reconfirmation
-
-- `flow-test4-order-processing-loop`
-  Full-suite failing run: `ai_evals/results/2026-04-09T11-25-24.107Z__flow`
-  Follow-up passing run after prompt improvement: `ai_evals/results/2026-04-09T13-29-15.877Z__flow`
-  Note:
-  this case failed on invalid `branchone` downstream result access
-  it passed after adding explicit branch-output guidance to the flow prompt
-  rerun the full flow suite to confirm the fix holds in the broader benchmark
diff --git a/docs/system-prompt-testing-plan.md b/docs/system-prompt-testing-plan.md
deleted file mode 100644
index 9b12f1c5e0c84..0000000000000
--- a/docs/system-prompt-testing-plan.md
+++ /dev/null
@@ -1,1000 +0,0 @@
-# System Prompt And Skill Output Testing Plan
-
-Historical note:
-
-- This file is a planning document and no longer matches the current benchmark CLI in every detail.
-- The current source of truth is [ai_evals/README.md](/home/farhad/windmill__worktrees/prompt-testing-plan/ai_evals/README.md) and [system-prompt-testing-status.md](/home/farhad/windmill__worktrees/prompt-testing-plan/docs/system-prompt-testing-status.md).
-- In particular, the current tool no longer has the old variants, compare, or history workflow described below.
-
-## Goal
-
-Build a single testing strategy that answers one question reliably:
-
-> Given a user task, how good is the artifact produced by our AI system?
-
-This plan is intentionally focused on **black-box output evaluation**, not on unit testing frontend or CLI internals.
-
-The intended end state is a **new repo-level benchmark CLI** that runs a shared
-eval suite across multiple surfaces.
-
-That benchmark CLI should be the main entrypoint for:
-
-- running one case
-- running a benchmark set
-- comparing baseline vs candidate variants
-- writing benchmark history snapshots
-
-Frontend and Windmill CLI are not meant to become separate testing products.
-They should be implemented as adapters behind this shared benchmark CLI.
-
-The system under test is:
-
-- Frontend AI Chat in `script`, `flow`, and `app` modes
-- CLI local development experience driven by generated guidance and skills
-
-The artifact under test is:
-
-- Script code
-- Flow JSON / module structure
-- Raw app files and backend runnables
-- Files and project artifacts produced in a local CLI workspace
-
-## Non-Goals
-
-This plan does **not** treat the following as the main testing target:
-
-- Unit testing helper functions, stores, or tool wrapper internals
-- UI rendering behavior, DOM interactions, or component-level correctness
-- `wmill init` correctness as a standalone product area
-- Backend route correctness except where it affects prompt delivery or AI configuration
-
-Those may still need lightweight tests, but they are not the core of prompt reliability evaluation.
-
-## Core Principles
-
-### 1. Black-box evaluation only
-
-The runner should provide an input task to the real system setup, let it run, collect the final artifact, and score the result.
-
-In practice, this runner should be exposed through the new repo-level benchmark
-CLI rather than through separate ad hoc test commands for each surface.
-
-### 2. Headless execution
-
-Frontend evaluation must be fully decoupled from the browser UI. It should exercise prompt assembly, tool selection, and tool execution logic without mounting Svelte components or clicking through the app.
-
-### 3. Real prompt environment
-
-All evals must use the same prompt-building path, tool definitions, and skill content that production uses, or a clearly defined variant of them.
-
-### 4. Artifact-first scoring
-
-The main score is based on the produced artifact, not on intermediate transcripts.
-
-### 5. Reliability over one-off success
-
-A prompt is not "good" because it passed once. Reliability means pass rate across repeated runs and across a representative case set.
-
-### 6. Track benchmark history over time
-
-The suite must not only evaluate the current output. It must also produce a
-git-tracked benchmark history so the team can see whether the system is
-improving over time.
-
-This history should focus on official benchmark snapshots, not on every local
-experiment.
-
-### 7. Shared corpus, separate adapters
-
-Frontend and CLI should share the same evaluation corpus format when possible, but each surface should have its own execution adapter.
-
-### 8. CLI first, UI last
-
-The CLI should be the first surface brought to a high-confidence benchmark
-state.
-
-It is the cleanest foundation for the suite because it produces direct files in
-an isolated workspace, has less ambiguity than the frontend, and is easier to
-score deterministically.
-
-Frontend should reuse the benchmark model proven on the CLI rather than define
-a parallel testing philosophy.
-
-### 9. UI comes last
-
-The testing suite must exist and be trustworthy before building a studio UI on top of it.
-
-## Current State
-
-## Shared Prompt Source Of Truth
-
-The repo already has the right content split:
-
-- `system_prompts/` is the shared source of truth for core Windmill prompt content
-- frontend adds chat-specific tool instructions on top
-- CLI materializes guidance and skill content from generated outputs
-
-This is a strong foundation for a shared eval suite.
-
-## Execution Priority
-
-Even though the repo already has useful frontend eval scaffolding, the
-implementation priority should be:
-
-1. build the repo-level benchmark CLI and use the Windmill CLI adapter as the
-   first implementation behind it
-2. make the CLI artifact-evaluation path excellent
-3. stabilize shared scoring, reporting, and benchmark history around that path
-4. bring frontend onto the same benchmark model through the same benchmark CLI
-5. build the UI only after the underlying suite is trustworthy
-
-This keeps the hardest product question focused on artifact quality rather than
-on UI workflow.
-
-## Benchmark CLI As The Main Product
-
-The testing suite should have one primary interface:
-
-- a new repo-level benchmark CLI
-
-The benchmark CLI should be able to run:
-
-- Windmill CLI evals
-- frontend evals
-- shared reporting and comparison commands
-
-Illustrative command shape:
-
-```bash
-ai-evals run --surface cli --case bun-hello-script
-ai-evals run --surface frontend-flow --case support-flow
-ai-evals compare --surface cli --variant baseline --variant candidate-a
-ai-evals history latest
-```
-
-The exact binary name can change, but the architecture should not:
-
-- one benchmark CLI
-- shared case loader
-- shared scoring
-- shared history writer
-- separate surface adapters underneath
-
-## Temporary Bootstrap Code
-
-This bootstrap phase is now complete for frontend `flow`, `app`, and `script`.
-
-Frontend AI benchmark ownership has moved into `ai_evals/`, and the frontend
-source tree no longer owns a separate AI benchmark suite under
-`frontend/.../__tests__/...`.
-
-Benchmark authors should only need the repo-level benchmark CLI to run the
-long-term suite.
-
-The only temporary frontend-specific piece that remains is a thin Vitest/Vite
-loader bridge so the benchmark runner can import the production chat modules in
-the same module/runtime environment they already expect.
-
-## Frontend: What Exists Today
-
-The current frontend benchmark path is **decoupled from the UI** and now owned
-by `ai_evals`.
-
-They currently:
-
-- run through the shared headless chat loop
-- use production prompt builders
-- use production tool definitions
-- use benchmark-owned helper adapters that write to temp workspaces on disk
-- execute through the frontend module/runtime environment only as a loader bridge
-
-This means the current frontend evals are now a proper benchmark adapter,
-not a frontend test suite.
-
-That is the correct direction.
-
-### Frontend Architecture Notes
-
-There are three categories of code involved:
-
-- shared production logic:
-  - production system prompt builders
-  - production tool definitions
-  - production `runChatLoop`
-- benchmark-only infrastructure:
-  - case loading
-  - variant loading
-  - judge scoring
-  - benchmark result shaping
-  - history/reporting integration
-- alternate helper adapters:
-  - production helpers mutate UI/editor state
-  - benchmark helpers mutate temp-workspace files
-
-This is important because the benchmark suite is **not** meant to duplicate the
-frontend chat logic. It is meant to reuse the production chat loop and tool
-definitions while swapping the execution backend from UI state to filesystem
-state.
-
-## Frontend: What Is Missing
-
-### Coverage gaps
-
-- `script` is now exposed through the shared benchmark CLI, but it only has initial case coverage.
-- Existing frontend coverage is still too small relative to the target benchmark corpus.
-
-### Reliability gaps
-
-- Frontend flow and app can already run with pass/fail results and repeated runs through the shared benchmark CLI.
-- The remaining gap is turning that into stronger routine reliability gating with better deterministic validators and broader routine case coverage.
-- Frontend reliability reporting is still less mature than the intended end state for official CI tiers and richer failure triage.
-
-### Prompt-iteration gaps
-
-- Frontend prompt variants are file-backed now, but the repo only ships baseline manifests by default.
-- Creating and curating meaningful frontend candidate variants is still a mostly manual workflow compared with the CLI snapshot flow.
-- Frontend prompt comparison exists through the shared `compare` command, but it still needs broader routine use and better variant coverage.
-
-### Artifact-validation gaps
-
-- The current flow and app helpers are file-backed now, but several effects are still lightweight and should become more realistic over time.
-- Linting and runnable validation are currently too lightweight in the eval path.
-- Datatable interactions are mocked rather than validated as output constraints.
-- The suite does not yet enforce a strong deterministic validator layer before using an LLM judge.
-
-### Corpus gaps
-
-- Frontend surfaces already use shared case manifests under `ai_evals/cases/frontend/`.
-- The remaining gap is breadth and representativeness, not the absence of a shared corpus.
-- Cases still need richer metadata, stronger deterministic constraints, and a larger regression library built from real failures.
-
-### Reporting gaps
-
-- Frontend runs already emit the shared benchmark result shape and can write official history snapshots through the shared benchmark CLI.
-- There is still no rich leaderboard or trend-oriented debugging workflow for frontend surfaces specifically.
-- There is still no strong "worst failures first" report for debugging regressions.
-
-## Frontend: Perfect Testing Logic
-
-The perfect frontend testing logic is:
-
-Frontend should not be the place where the benchmark philosophy is invented.
-
-It should consume the shared case format, validator model, reporting format,
-and history format already proven through the CLI path.
-
-### 1. Stay fully headless
-
-Do not mount the chat UI.
-
-Do not click through the frontend.
-
-Do not use Playwright for prompt evaluation.
-
-The runner should directly invoke:
-
-- the production system message builder
-- the production user message builder
-- the production tool list
-- the production chat loop
-
-It is acceptable for the benchmark adapter to use the frontend Vitest/Vite
-runtime as a thin loader bridge when production chat modules still depend on
-that environment, as long as:
-
-- the benchmark entrypoint remains the shared benchmark CLI
-- the benchmark logic and fixtures live under `ai_evals`
-- the frontend source tree does not own a separate benchmark suite
-
-This keeps the suite decorrelated from the frontend UI while still testing the real AI logic.
-
-### 2. Test the three frontend AI surfaces separately
-
-#### Script mode
-
-Input:
-
-- user prompt
-- optional initial script
-- optional context such as selected workspace runnables or DB references
-
-Output:
-
-- final script code
-
-Scoring:
-
-- deterministic validators first
-- LLM judge second
-
-Deterministic validators should include:
-
-- expected entrypoint present
-- syntax / parse validity
-- language-appropriate compile or lint check where feasible
-- required behaviors or structures present
-- forbidden patterns absent
-
-#### Flow mode
-
-Input:
-
-- user prompt
-- optional initial flow
-- optional schema
-- optional workspace context
-
-Output:
-
-- final flow definition
-
-Scoring:
-
-- flow JSON is structurally valid
-- expected module types exist
-- expected branches / loops / tools exist
-- schema shape matches required inputs
-- required data flow connections are present
-- LLM judge scores completeness and overall quality
-
-#### App mode
-
-Input:
-
-- user prompt
-- optional initial app
-- optional workspace context
-
-Output:
-
-- final frontend files
-- final backend runnables
-
-Scoring:
-
-- expected files and runnables exist
-- file structure is coherent
-- app bundle / lint checks pass where feasible in headless mode
-- required UI/backend behaviors are represented in the artifact
-- LLM judge scores completeness and product quality
-
-### 3. Use repeated runs, not single runs
-
-Each case should run more than once.
-
-Recommended starting point:
-
-- PR smoke run: 2 runs per case on a small curated subset
-- nightly reliability run: 5 to 10 runs per case on the full benchmark set
-
-Primary metric:
-
-- pass rate
-
-Secondary metrics:
-
-- average deterministic score
-- average judge score
-- worst-case judge score
-- latency
-- total tool calls
-
-### 4. Keep tool traces as diagnostics only
-
-Tool usage matters for debugging, but it should not be the primary score.
-
-The suite should record:
-
-- tool names
-- tool arguments
-- iteration count
-- model/provider
-
-But the main question remains:
-
-> Was the final artifact good?
-
-### 5. Make prompt variants easy to test
-
-Prompt candidates should not require editing test code.
-
-The suite should support a file-based prompt variant workflow.
-
-Example direction:
-
-- `ai_evals/variants/frontend/script/baseline.md`
-- `ai_evals/variants/frontend/script/candidate-a.md`
-- `ai_evals/variants/frontend/flow/baseline.md`
-- `ai_evals/variants/frontend/app/baseline.md`
-
-Each variant should be runnable side by side against the same case set.
-
-### 6. Separate benchmark cases from test code
-
-Benchmark cases should live in data files, not inline in test files.
-
-Each case should define:
-
-- surface
-- user prompt
-- initial artifact if any
-- required constraints
-- forbidden constraints
-- judge rubric
-- tags
-
-This makes the benchmark editable by prompt authors without changing runner logic.
-
-## CLI: What Exists Today
-
-The current CLI tests prove only one narrow property:
-
-> Given a prompt, does the model invoke the expected skill?
-
-That is useful as a smoke signal, but it is far from sufficient for output evaluation.
-
-The current CLI setup also depends on manual preparation of a `.claude/skills` folder, which makes repeated benchmarking and prompt iteration much harder than necessary.
-
-## CLI: What Is Missing
-
-### Output-evaluation gap
-
-- The current suite does not score the artifact produced by the CLI workflow.
-- It only checks whether a skill was invoked.
-- It does not verify that the resulting files are good.
-
-### Automation gap
-
-- The current setup requires manual copying of generated skills into a test folder.
-- That makes the suite too fragile and too manual for rapid prompt iteration.
-
-### Reliability gap
-
-- There is no repeated-run measurement.
-- There is no pass-rate metric.
-- There is no baseline vs candidate comparison workflow.
-
-### Prompt-variant gap
-
-- There is no first-class way to test alternate skill bundles or alternate generated guidance.
-- There is no clean candidate flow for "I changed skill content, show me whether reliability improved."
-
-### Corpus gap
-
-- CLI cases are not aligned with frontend benchmark cases.
-- There is no shared benchmark language describing the task, initial state, and expected artifact.
-
-### Reporting gap
-
-- There is no stable output report for artifact comparison.
-- There is no failure clustering by skill bundle, task family, or model.
-
-## CLI: Perfect Testing Logic
-
-The perfect CLI testing logic is:
-
-This should be the reference implementation for the suite.
-
-### 1. Evaluate the final artifact, not the skill invocation
-
-Skill invocation should be kept as diagnostic metadata only.
-
-The primary output should be the files produced in a temporary workspace.
-
-Example CLI artifacts:
-
-- generated script files
-- generated flow files
-- raw app project files
-- schedule / trigger config files
-- AGENTS / guidance files only when they are directly relevant to the task
-
-### 2. Create the workspace automatically
-
-The runner should create a fresh temporary project for every case.
-
-It should seed that workspace with:
-
-- initial files for the benchmark case
-- the current generated CLI guidance and skills
-- any fixture data required by the task
-
-It should never depend on a manually maintained test folder.
-
-### 3. Materialize the exact skill bundle under test
-
-The runner should be able to test:
-
-- the current production skill bundle
-- a candidate skill bundle built from prompt changes
-
-For CLI, a "prompt variant" is effectively a skill-bundle variant.
-
-That means the suite should support alternate generated skill content without requiring ad hoc manual copies.
-
-### 4. Score the final workspace
-
-The scoring approach should match the frontend philosophy:
-
-- deterministic validators first
-- LLM judge second
-
-Deterministic validators for CLI should include:
-
-- expected files created
-- expected file names and locations
-- required content patterns present
-- expected artifact type produced
-- optional parse / lint / compile validation where feasible
-
-### 5. Run repeated benchmarks
-
-The CLI should use the same reliability logic as frontend:
-
-- benchmark set
-- repeated runs
-- pass rate
-- baseline vs candidate comparison
-
-### 6. Keep skill traces as diagnostics
-
-Record:
-
-- invoked skills
-- order of invocation
-- turns
-- file changes
-
-But do not let that replace artifact evaluation.
-
-## Perfect Shared Benchmark Model
-
-The frontend and CLI should share the same benchmark concept.
-
-Each evaluation case should define:
-
-- `id`
-- `surface`
-- `user_prompt`
-- `initial_state`
-- `workspace_context`
-- `artifact_checks`
-- `judge_rubric`
-- `tags`
-
-The same task should be runnable on multiple surfaces when it makes sense.
-
-This gives direct comparability between:
-
-- frontend script vs CLI script
-- frontend flow vs CLI flow
-- frontend app vs CLI app
-
-## Recommended Benchmark Categories
-
-The first benchmark set should be broad, but not huge.
-
-Recommended initial size:
-
-- 20 to 30 core cases
-
-Recommended categories:
-
-- from-scratch script creation
-- script modification
-- from-scratch flow creation
-- flow modification
-- from-scratch raw app creation
-- raw app modification
-- reuse of workspace assets
-- tasks requiring datatable awareness
-- tasks requiring constraints or edge-case handling
-- known regressions from real failures
-
-Every category should contain both:
-
-- "easy success" cases
-- "high ambiguity" cases
-
-This is essential for measuring reliability rather than only measuring best-case demos.
-
-## Scoring Model
-
-The suite should use three layers.
-
-## Layer 1: Deterministic Validators
-
-This is the hard gate.
-
-Examples:
-
-- parse succeeds
-- artifact shape is valid
-- required entrypoint exists
-- expected files exist
-- required module types exist
-- expected inputs / schema fields exist
-- forbidden patterns are absent
-
-If layer 1 fails, the run is a failure.
-
-## Layer 2: Task-Specific Validators
-
-These are stronger artifact checks derived from the benchmark case.
-
-Examples:
-
-- flow contains a loop and a conditional branch
-- app includes a reset button path and backend wiring
-- script performs the requested transformation
-
-These should still be deterministic whenever possible.
-
-## Layer 3: LLM Judge
-
-Use an LLM judge only after deterministic validation.
-
-The judge should answer:
-
-- Did the artifact satisfy the request?
-- Is it complete?
-- Is it coherent for Windmill?
-- How close is it to the intended solution?
-
-The judge score is valuable, but it should not be the only oracle.
-
-## Benchmark History
-
-The suite should persist official benchmark summaries in a git-tracked history
-layer so improvements and regressions can be reviewed over time.
-
-## What Should Be Git-Tracked
-
-Only official benchmark outputs should be committed:
-
-- post-merge benchmark snapshots on `main`
-- scheduled nightly benchmark snapshots
-- manually promoted benchmark snapshots when the team wants to record a result
-
-Each official snapshot should produce:
-
-- one detailed run JSON
-- one entry in an append-only summary file
-- regenerated rollups for trend views
-
-## What Should Not Be Git-Tracked
-
-The following should remain local or external by default:
-
-- raw transcripts
-- full model messages
-- large generated artifact bundles
-- ad hoc local experiments
-- temporary comparison runs
-
-This keeps git history focused on stable benchmark signals instead of noisy
-debug output.
-
-## Reliability Metrics
-
-Every prompt or skill candidate should be reported with:
-
-- total cases
-- passes
-- pass rate
-- average judge score
-- median judge score
-- worst-case judge score
-- average latency
-- average turns
-
-Per-case results should also be retained.
-
-This is the minimum needed to compare:
-
-- baseline vs candidate
-- provider vs provider
-- frontend vs CLI
-
-## Benchmark Metrics
-
-The history layer should track metrics in four groups.
-
-## Quality Metrics
-
-- `pass_rate`
-- `deterministic_pass_rate`
-- `judge_score_mean`
-- `judge_score_median`
-- `judge_score_p10`
-- `category_pass_rate`
-
-## Reliability Metrics
-
-- `runs_per_case`
-- `flake_rate`
-- `path_consistency`
-
-## Efficiency Metrics
-
-- `latency_ms_mean`
-- `latency_ms_median`
-- `tokens_prompt_mean`
-- `tokens_completion_mean`
-- `tokens_total_mean`
-- `tool_calls_mean`
-- `iterations_mean`
-- `estimated_cost_mean`
-- `cost_per_success`
-- `latency_per_success`
-
-## Provenance Metrics
-
-- `timestamp`
-- `git_sha`
-- `suite_version`
-- `scoring_version`
-- `surface`
-- `variant_name`
-- `provider`
-- `model`
-- `judge_model`
-
-The provenance metrics are essential. Without them, a trend line can mix prompt
-changes with upstream model drift and become hard to interpret.
-
-## Efficiency Score
-
-The suite should not collapse everything into one number.
-
-It should track at least three top-level composite scores:
-
-- `quality_score`
-- `efficiency_score`
-- `value_score`
-
-Recommended interpretation:
-
-- `quality_score`: how good the artifact is
-- `efficiency_score`: how fast and cheap the system is relative to peers
-- `value_score`: quality-adjusted efficiency
-
-These composite scores should sit on top of the raw metrics, not replace them.
-
-## Proposed Suite Architecture
-
-The suite should be built in six layers.
-
-## Layer 1: Benchmark Data
-
-Purpose:
-
-- define the cases once
-
-Contents:
-
-- case files
-- reusable initial fixtures
-- evaluation metadata
-
-## Layer 2: Benchmark CLI
-
-Purpose:
-
-- provide one shared entrypoint for the suite
-
-Responsibilities:
-
-- load cases and variants
-- select a surface adapter
-- run one case or a benchmark set
-- invoke shared scoring and history writing
-- expose comparison and history commands
-
-## Layer 3: Surface Adapters
-
-Purpose:
-
-- run a case against one surface
-
-Adapters:
-
-- frontend-script adapter
-- frontend-flow adapter
-- frontend-app adapter
-- CLI adapter
-
-Responsibilities:
-
-- prepare the correct prompt environment
-- prepare the initial artifact state
-- run the real model loop
-- return the final artifact plus diagnostics
-
-## Layer 4: Scoring And Reporting
-
-Purpose:
-
-- evaluate the final artifact
-- aggregate repeated runs
-- compare variants
-
-Responsibilities:
-
-- deterministic validation
-- LLM judging
-- pass/fail computation
-- result serialization
-- comparison reports
-
-## Layer 5: Benchmark History
-
-Purpose:
-
-- preserve official benchmark summaries over time
-- support trend analysis and regression review
-
-Responsibilities:
-
-- store official run snapshots
-- append benchmark summary entries
-- generate rollups for charts and dashboards
-- keep provenance metadata for every tracked run
-
-## Layer 6: UI Studio
-
-Purpose:
-
-- provide a user interface for the exact same benchmark CLI and runner stack
-
-Important rule:
-
-The UI must not define its own execution semantics.
-
-It must only be a frontend over the same suite used in CI and local benchmarking.
-
-## Proposed Development Order
-
-### Phase 1: Stabilize the benchmark model
-
-Deliverables:
-
-- shared case schema
-- shared result schema
-- initial core benchmark set
-
-### Phase 2: Build the benchmark CLI shell
-
-Deliverables:
-
-- repo-level benchmark CLI entrypoint
-- `run`, `compare`, and `history` command skeletons
-- adapter selection layer
-- temporary wiring to the first CLI adapter
-
-### Phase 3: Replace the CLI smoke suite with real artifact evaluation
-
-Deliverables:
-
-- temp-workspace runner
-- automatic skill-bundle materialization
-- artifact scoring
-- repeated-run support
-- baseline vs candidate skill-bundle comparison
-
-### Phase 4: Add shared reporting and benchmark history around the CLI path
-
-Deliverables:
-
-- baseline vs candidate reports
-- pass-rate summaries
-- worst-failure reports
-- official run schema
-- git-tracked benchmark summary file
-- history snapshot writer
-- rollup generation for trend charts
-
-### Phase 5: Finish the frontend black-box harness on top of the shared model
-
-Deliverables:
-
-- convert current flow and app evals into proper scored reliability tests
-- add script eval support
-- add repeated-run support
-- add prompt-variant loading from files
-- align frontend outputs with the shared result and history format
-- expose frontend runs through the same benchmark CLI
-
-### Phase 6: Add CI tiers
-
-Deliverables:
-
-- fast PR smoke benchmark
-- fuller nightly benchmark
-- official history updates on `main` and scheduled runs
-- manual benchmark mode for prompt authors
-
-### Phase 7: Build the UI studio
-
-Deliverables:
-
-- run selector
-- variant selector
-- per-case comparison view
-- artifact diff view
-- reliability dashboard
-- trend dashboard backed by git-tracked benchmark history
-
-This phase comes last because the UI is only valuable once the underlying suite is stable and trusted.
-
-## Proposed Prompt Variant Workflow
-
-The suite should make it cheap to test new prompt candidates.
-
-Recommended workflow:
-
-1. Edit or add a candidate prompt file.
-2. Run the benchmark against baseline and candidate.
-3. Compare pass rate and score.
-4. Inspect worst regressions first.
-5. Promote only if the candidate improves the benchmark materially.
-
-For CLI, the same workflow applies, but the tested unit is the generated skill bundle rather than a single chat system prompt.
-
-## Suggested Repository Direction
-
-This plan does not require the UI studio to exist first.
-
-A reasonable repo structure would be:
-
-```text
-ai_evals/
-  cli/
-  cases/
-  fixtures/
-  history/
-    runs/
-    rollups/
-  variants/
-    frontend/
-      script/
-      flow/
-      app/
-    cli/
-  results/        # gitignored
-  scripts/
-  adapters/
-  scoring/
-  reports/
-```
-
-The exact folder names can change, but the architectural split should remain.
-
-## What "Done" Looks Like
-
-This project is successful when all of the following are true:
-
-- one repo-level benchmark CLI is the primary way to run prompt evals
-- frontend prompt behavior is tested headlessly and independently from the UI
-- CLI local-dev behavior is tested by evaluating the final files it produces
-- benchmark cases are shared where possible between frontend and CLI
-- prompt and skill candidates can be tested without editing test code
-- reliability is reported as pass rate over repeated runs
-- baseline vs candidate comparisons are easy to run and inspect
-- the UI studio is only a thin interface over the same trusted runner
-
-## Final Recommendation
-
-The current frontend evals should be treated as a useful starting point, not the finished solution.
-
-They already prove that the repo can test AI behavior without coupling to the browser UI.
-
-The main work now is:
-
-- build the repo-level benchmark CLI as the durable entrypoint
-- replace CLI invocation checks with artifact evaluation
-- make the CLI path the reference benchmark implementation
-- unify frontend under that same benchmark model
-- make frontend evals complete and reliability-oriented only after the shared
-  scoring model is stable
-- build the UI only after the suite is strong enough to stand on its own
diff --git a/docs/system-prompt-testing-status.md b/docs/system-prompt-testing-status.md
deleted file mode 100644
index 86beaadc892d1..0000000000000
--- a/docs/system-prompt-testing-status.md
+++ /dev/null
@@ -1,140 +0,0 @@
-# System Prompt Testing Status
-
-This document describes the benchmark tool that exists today. It is the current
-truth for `ai_evals/`.
-
-The longer planning document in
-[system-prompt-testing-plan.md](/home/farhad/windmill__worktrees/prompt-testing-plan/docs/system-prompt-testing-plan.md)
-still contains useful background, but parts of its workflow are now historical
-because the old variants/history system was removed.
-
-## Current Tool
-
-There is one repo-level benchmark CLI under `ai_evals/` with three commands:
-
-- `bun run cli -- models`
-- `bun run cli -- cases [mode]`
-- `bun run cli -- run <mode> [caseIds...]`
-
-Supported modes:
-
-- `cli`
-- `flow`
-- `script`
-- `app`
-
-Public `run` options:
-
-- `--runs <n>`
-- `--output <path>`
-- `--model <alias>`
-- `--verbose`
-- `--record`
-
-There is no variant workflow and no compare command in the current tool.
-Tracked history is intentionally minimal: `run --record` appends one compact
-summary line to `ai_evals/history/<mode>.jsonl`. This is only allowed for
-full-suite runs, not selected case ids. History lines include average token
-usage when the benchmark mode reports it, plus average judge score and per-case
-duration/judge/token usage summaries.
-
-## How It Works
-
-Each attempt runs:
-
-1. the current production prompts, tools, and guidance from this checkout
-2. deterministic validation
-3. LLM judging
-
-Results are written locally under `ai_evals/results/` as:
-
-- a summary JSON file
-- a sibling artifacts directory containing the generated flow/script/app/workspace
-
-If `--record` is used, the CLI also appends a compact JSONL summary line to the
-tracked file for that mode under `ai_evals/history/`.
-
-## Current Architecture
-
-- `ai_evals/cases/`: one YAML manifest per mode
-- `ai_evals/fixtures/`: initial and expected fixtures
-- `ai_evals/core/`: shared case loading, model resolution, validation, judging, and result writing
-- `ai_evals/history/`: optional tracked pass-rate history written by `run --record`, one JSONL file per mode
-- `ai_evals/modes/`: one runner per mode
-
-Execution model:
-
-- `flow`, `script`, and `app` reuse the production frontend chat loop and production tool definitions through the frontend Vitest bridge
-- `cli` creates a temp workspace, writes the current checkout guidance into it, and runs the Anthropic agent SDK against that workspace
-
-## Case Model
-
-Each case is intentionally small:
-
-- `prompt`
-- optional `initial`
-- optional `expected`
-- optional `validate`
-- optional `cliExpect`
-
-`validate` is mainly used for stronger deterministic checks where exact fixture
-matching would be too strict, especially for `flow` creation cases.
-
-`cliExpect` is used by CLI-mode cases to assert agent behavior deterministically,
-including:
-
-- required or forbidden skills
-- skills invoked before the first file mutation
-- ordered `wmill` command proposals in the assistant response
-- forbidden attempted `wmill` executions
-- read-only guidance cases where the workspace must stay unchanged
-
-Examples of current deterministic checks:
-
-- schema contains one of several accepted input shapes
-- `results.*` references resolve
-- required code/input characteristics exist in some module
-- expected workspace files are created in `cli` mode
-- expected CLI skills and proposed `wmill` commands are observed in `cli` mode
-
-## Model Selection
-
-Model aliases are resolved through a shared registry in `ai_evals/core/models.ts`.
-
-Current aliases:
-
-- `haiku`
-- `sonnet`
-- `opus`
-- `4o`
-
-Notes:
-
-- the `models` command also shows accepted alias spellings such as `gpt-4o` and `claude-opus-4.6`
-- frontend modes can use Anthropic and OpenAI-backed aliases
-- `cli` mode is Anthropic-only because it runs through the Anthropic agent SDK
-- the judge model is separate and currently defaults to `claude-sonnet-4-6`
-
-## What Is Working Well
-
-- one simple local benchmark CLI
-- real production execution paths instead of synthetic prompt variants
-- local result and artifact persistence by default
-- live frontend progress output
-- reusable flow/script/app/cli runners under one tool
-- deterministic validation can now catch real runtime-invalid flow wiring
-
-## What Still Needs Work
-
-- broader case coverage across all four modes
-- stronger deterministic validators for more cases, especially app/script semantics
-- clearer per-case validation metadata as the corpus grows
-- CI automation for smoke and nightly runs
-
-## Recommended Next Focus
-
-The next high-value work is:
-
-1. add more realistic benchmark cases
-2. keep simplifying deterministic validators so they check correctness, not one exact implementation
-3. add CI only after the local benchmark signal is trustworthy