diff --git a/docs/app-mode-ai-chat-review.md b/docs/app-mode-ai-chat-review.md index 51833d4f9acef..eacfba08cbcbf 100644 --- a/docs/app-mode-ai-chat-review.md +++ b/docs/app-mode-ai-chat-review.md @@ -1,354 +1,48 @@ # App Mode AI Chat Review -## Purpose +This note only tracks the highest-value next steps for making app-mode AI chat +safer and more efficient. -This document reviews the current app-mode AI chat design with a focus on: +## Recommended Next Steps -- keeping prompts and context as small as possible; -- requiring user confirmation for important actions; -- making datatable integration smooth and safe for users. +1. Add confirmation for dangerous app tools. -## Short verdict + Require explicit user confirmation before file writes, file deletes, backend + runnable writes, backend runnable deletes, and datatable SQL execution. Show a + useful diff or exact SQL before applying the action. -The app-mode AI chat has a solid foundation: mode-specific helpers, explicit `@` context, app snapshots/revert, datatable whitelisting, and generic confirmation UI already exist. +2. Enforce datatable SQL safety in code. -However, it is not yet optimal for minimal context and user-safe automation: + Do not rely on prompt instructions for SQL safety. Classify statements before + execution, block DDL unless table creation is allowed, and require + confirmation for DDL, DML, and row-returning reads that would expose data back + to the model. -1. **Context is still too large by default**, especially the app system prompt, broad file-discovery guidance, full datatable schemas, and persistent `@` context. (`get_files()` has since been replaced by metadata-only `list_files()`.) -2. **Important app/datatable actions are not consistently confirmed**. The confirmation infrastructure exists, but app tools mostly bypass it. -3. **Datatables UX is promising but has rough edges**: stale cached table context, weak SQL safety, policy persistence issues, and too-heavy full-schema fetching. +3. Keep default context demand-driven. -## Relevant files + Prefer selected context and targeted reads before broad discovery. Keep file + listings metadata-only, avoid sending full datatable schemas by default, and + keep SDK/reference material out of the base prompt unless it is requested or + needed for the task. -### AI chat orchestration +4. Improve app context lifecycle. -- `frontend/src/lib/components/copilot/chat/AIChatManager.svelte.ts` -- `frontend/src/lib/components/copilot/chat/chatLoop.ts` -- `frontend/src/lib/components/copilot/chat/shared.ts` -- `frontend/src/lib/components/copilot/chat/AIChat.svelte` -- `frontend/src/lib/components/copilot/chat/AIChatDisplay.svelte` -- `frontend/src/lib/components/copilot/chat/AIChatInput.svelte` -- `frontend/src/lib/components/copilot/chat/ToolExecutionDisplay.svelte` + Treat `@` context as per-message by default, with an explicit pinning affordance + for context that should persist. Lazy-load file and runnable contents, and add + a visible approximate context-size indicator so users can spot prompt bloat. -### App mode +5. Refresh datatable context after mutations. -- `frontend/src/lib/components/copilot/chat/app/core.ts` -- `frontend/src/lib/components/copilot/chat/AppAvailableContextList.svelte` -- `frontend/src/lib/components/copilot/chat/ContextElementBadge.svelte` -- `frontend/src/lib/components/copilot/chat/DatatableCreationPolicy.svelte` + Refresh table metadata after data-panel changes and after AI-created tables so + follow-up tool calls and user-visible context do not use stale schema data. -### Raw app editor and datatables +6. Persist table creation policy explicitly. -- `frontend/src/lib/components/raw_apps/RawAppEditor.svelte` -- `frontend/src/lib/components/raw_apps/RawAppDataTableList.svelte` -- `frontend/src/lib/components/raw_apps/RawAppDataTableDrawer.svelte` -- `frontend/src/lib/components/raw_apps/DefaultDatabaseSelector.svelte` -- `frontend/src/lib/components/raw_apps/dataTableRefUtils.ts` -- `frontend/src/lib/components/raw_apps/datatableUtils.svelte.ts` -- `frontend/src/routes/(root)/(logged)/apps_raw/add/+page.svelte` -- `frontend/src/routes/(root)/(logged)/apps_raw/edit/[...path]/+page.svelte` + Store whether AI table creation is enabled as an explicit app setting instead + of inferring it from the presence of datatable configuration. -### Backend datatable APIs +7. Add focused eval coverage for these behaviors. -- `backend/windmill-api-workspaces/src/workspaces.rs` - - `list_datatables` - - `list_datatable_schemas` - - `get_datatable_schema` - - `edit_datatable_config` - -### System prompts - -- `system_prompts/README.md` -- `system_prompts/auto-generated/index.ts` -- `system_prompts/auto-generated/sdks/datatable-typescript.md` -- `system_prompts/auto-generated/sdks/datatable-python.md` - -## How app mode works today - -In the raw app editor, `RawAppEditor.svelte` initializes app-mode AI chat on mount: - -- calls `aiChatManager.saveAndClear()`; -- calls `aiChatManager.changeMode(AIMode.APP)`; -- registers app helpers through `aiChatManager.setAppHelpers(...)`. - -Those app helpers expose operations for: - -- frontend files; -- backend runnables; -- current selected editor context; -- linting; -- app snapshots and revert; -- datatable schema loading; -- SQL execution; -- app table whitelisting. - -When app mode is active, `AIChatManager.changeMode(AIMode.APP)` sets: - -- system prompt: `prepareAppSystemMessage(...)`; -- tools: `getAppTools()`; -- helpers: `appAiChatHelpers`. - -When the user sends a message, `prepareAppUserMessage(...)` builds the user prompt from: - -- current frontend/backend file selection, unless excluded; -- inspector-selected DOM element; -- editor code selection; -- additional `@`-mentioned context; -- the user instructions. - -`runChatLoop(...)` then sends the system message, history, user message, and tool definitions to the selected model. Tool calls go through `processToolCall(...)`, which supports confirmation only when a tool opts into `requiresConfirmation`. - -## Current app tools - -### Read and discovery tools - -These are generally safe without confirmation: - -- `list_files` -- `get_frontend_file` -- `get_backend_runnable` -- `get_selected_context` -- `lint` -- `search_workspace` -- `get_runnable_details` -- `search_hub_scripts` -- `list_datatables` -- `get_datatable_table_schema` - -### Mutating tools - -These currently execute directly in app mode: - -- `set_frontend_file` -- `patch_file` -- `delete_frontend_file` -- `set_backend_runnable` -- `delete_backend_runnable` -- `exec_datatable_sql` - -This is the biggest mismatch with the requirement that every important action should be confirmed by the user. - -## System prompt assessment - -The app system prompt is useful but heavier than ideal. - -### Strengths - -- Clearly explains raw app structure. -- Explains the frontend/backend runnable split. -- Encourages `patch_file` for small edits. -- Pushes datatables for persisted app storage. -- Explains that datatable DDL should go through `exec_datatable_sql`. -- Includes table creation policy context. - -### Concerns - -1. It always includes broad app-building instructions, even for small localized edits. -2. The previous prompt included the datatable SDK reference for both TypeScript and Python every time. This has since been removed; concise examples remain in the prompt. -3. The previous prompt told the model to start with `get_files()`, which encouraged loading all files even when selected context was sufficient. This is now improved by `list_files()`, but the prompt still needs to stay demand-driven. -4. It relies heavily on prompt instructions for datatable safety instead of enforcing safety in tools. -5. Custom workspace/user prompts are appended as `USER GIVEN INSTRUCTIONS`, which is flexible but can further increase context. - -### Recommendation - -The base app prompt should be shorter and more demand-driven: - -- Keep file discovery demand-driven: use selected and explicitly provided context first; call `list_files()` only when a broader metadata overview is needed. -- Keep full SDK details out of the default prompt; concise examples are usually enough. Add an on-demand SDK reference only if it does not cause unnecessary extra tool turns. -- Keep only minimal datatable rules in the base prompt: - - use datatables for persistence; - - call `list_datatables()` before schema work; - - DDL must use `exec_datatable_sql`; - - non-read SQL requires confirmation. - -## Additional context assessment - -The `@` context system is a good UX foundation. - -App mode exposes categories for: - -- frontend files; -- backend runnables; -- datatables. - -Selecting a datatable context includes its columns and also calls `addTableToWhitelist(...)`, adding the table to the app data panel. - -### Strengths - -- Context is explicit and user-controllable. -- Datatable table selection is naturally integrated into the chat input. -- Selected app file/runnable chips are visible and can be excluded. -- Inspector and code-selection context are compact and useful. - -### Concerns - -1. `@` context persists across messages until manually removed, which can silently bloat follow-up prompts. -2. Available app context currently includes file contents/runnable configs in memory before selection. -3. Each selected context item is truncated, but there is no overall context budget indicator. -4. Current file/runnable selection is included by default unless excluded, which is convenient but not minimal. - -### Recommendation - -- Make app `@` context per-message by default. -- Add an explicit “pin” option for context that should persist across messages. -- Lazy-load file/runnable content when selected or when a message is sent. -- Show an approximate context-size/token budget indicator. -- Prefer sending path/name and selected code first; fetch full files only when necessary. - -## Confirmation assessment - -The generic confirmation mechanism already exists: - -- `processToolCall(...)` checks `tool.requiresConfirmation`. -- `ToolExecutionDisplay.svelte` renders Run/Cancel controls. -- Script test runs, flow test runs, and mutating API calls already use confirmation. - -App mode should use the same infrastructure for important actions. - -### Suggested confirmation policy - -#### No confirmation required - -- `list_files`, as a metadata-only response; -- `get_frontend_file`; -- `get_backend_runnable`; -- `get_selected_context`; -- `list_datatables`, as table-name metadata only; -- `get_datatable_table_schema`, as a targeted schema read; -- `lint`; -- search tools. - -#### Confirmation required - -- `set_frontend_file`; -- `patch_file`; -- `delete_frontend_file`; -- `set_backend_runnable`; -- `delete_backend_runnable`; -- `exec_datatable_sql` for any DDL or DML; -- `exec_datatable_sql` for `SELECT` if it returns real row data that will be sent back to the model. - -### Recommended UX - -For files/runnables: - -- Prefer batched proposed edits. -- Show a diff. -- Let the user click “Apply changes”. -- Run lint after applying. - -For SQL: - -- Show the exact SQL. -- Classify the query as: - - schema read; - - data read; - - insert/update/delete; - - DDL. -- Require confirmation before data reads and all mutations. -- For table creation, require both: - - table creation policy enabled; - - explicit confirmation of the `CREATE TABLE` SQL. - -## Datatables integration assessment - -The datatable integration is directionally good and already has several strong user-facing pieces. - -### Current strengths - -The new app setup lets the user choose: - -- default datatable; -- schema mode: none, new, existing; -- whether AI can create tables; -- pre-whitelisted existing tables. - -The raw app data panel lets users: - -- add datatable table references; -- inspect tables through the DB manager drawer; -- configure the default datatable/schema for new tables. - -The AI chat integration lets users: - -- mention datatable tables through `@` context; -- add mentioned tables to the app whitelist; -- list datatable/schema/table names with `list_datatables()`; -- retrieve one table's columns with `get_datatable_table_schema()`; -- create tables through `exec_datatable_sql(..., new_table)`. - -### Concerns - -1. **`exec_datatable_sql` is too powerful without confirmation.** - It can run `SELECT`, `INSERT`, `UPDATE`, `DELETE`, `CREATE`, `DROP`, `ALTER`, etc. - -2. **Table creation policy is not fully enforced in code.** - The tool blocks `new_table` when policy is disabled, but it does not block DDL if the model omits `new_table`. - -3. **Table creation disabled state may not persist cleanly.** - `RawAppData` stores `datatable` and `schema`, but not an explicit `enabled` value. `RawAppEditor` infers enabled from `data.datatable !== undefined`, which can re-enable table creation after reopening. - -4. **Datatable context cache can become stale.** - `AIChatManager.refreshDatatables()` runs when app helpers are set, but may not refresh immediately after data panel changes or after AI creates a new table. - -5. **Full schema loading can still be too expensive internally.** - `list_datatables()` and `get_datatable_table_schema()` reduce what is sent to the model, but they still currently rely on app helpers that fetch full schema data before filtering. - -6. **Auto-whitelisting from `@table` is convenient but silent.** - It mutates app data without an obvious confirmation or undo affordance. - -### Recommended datatable tool design - -Instead of one broad schema tool and one unrestricted SQL tool, prefer smaller tools: - -- `list_datatables()` -- `list_datatable_tables(datatable, schema?, search?)` (optional backend/API optimization if table lists need server-side filtering) -- `get_datatable_table_schema(datatable, schema, table)` -- `preview_datatable_rows(datatable, schema, table, limit)` with confirmation -- `execute_datatable_sql(datatable, sql)` with query classification and confirmation -- `create_datatable_table(datatable, schema, table, columns)` as a structured safe path for table creation - -## Priority recommendations - -1. **Add confirmation to dangerous app tools** - - file/runnable writes; - - file/runnable deletes; - - datatable SQL; - - especially DDL/DML. - -2. **Enforce SQL safety in code, not only in prompts** - - block DDL unless `new_table` is provided and policy allows it; - - confirm all non-`SELECT` statements; - - consider confirming `SELECT` row reads too. - -3. **Reduce default prompt/tool context** - - keep `list_files()` metadata-only and demand-driven; - - use selected context first; - - keep full SDK references out of the default prompt; - - keep datatable tools split into smaller schema/table lookups. - -4. **Refresh datatable context reliably** - - refresh after data panel changes; - - refresh after `exec_datatable_sql(..., new_table)`; - - remove debug logging from datatable refresh. - -5. **Persist table creation policy explicitly** - - store a boolean such as `tableCreationEnabled` in raw app data; - - do not infer enabled solely from `data.datatable`. - -6. **Improve `@` context lifecycle** - - make app `@` context per-message by default; - - add pinning for persistent context; - - lazy-load file/runnable contents; - - show approximate context size. - -## Overall opinion - -The current architecture is good and extensible, but it should become more demand-driven and safer before being considered efficient and user-safe. - -The highest-impact changes are: - -- add confirmation for app mutations and datatable SQL; -- enforce datatable SQL policy programmatically; -- reduce the app system prompt and avoid automatic broad context loading; -- split datatable schema access into smaller, targeted tools. + Cover confirmation requirements, datatable SQL policy enforcement, selected + context minimization, and stale-schema refresh behavior with targeted app-mode + evals or lower-level tests where practical. diff --git a/docs/app-mode-ai-chat-token-baseline.md b/docs/app-mode-ai-chat-token-baseline.md deleted file mode 100644 index 95eab70c0dfb5..0000000000000 --- a/docs/app-mode-ai-chat-token-baseline.md +++ /dev/null @@ -1,245 +0,0 @@ -# App Mode AI Chat Token Baseline - -This baseline was collected before optimizing app-mode context/prompt/datatable behavior. - -> Note: The historical commands/results below include `app-token-selected-large-frontend-context` and `app-token-selected-large-backend-context`. Those cases were removed from the active eval suite because `runtime.appContext.selected` only verified that the file/runnable existed and did not serialize a selected file/runnable hint to the model. Future selected-file/runnable coverage should be reintroduced through the app context manager path. - -## Command - -Secrets were loaded from `~/windmill/ai_evals/.env` without printing them. - -```bash -cd ai_evals -set -a -source ~/windmill/ai_evals/.env -set +a -bun run cli -- run app \ - app-token-baseline-large-app-small-edit \ - app-token-selected-large-frontend-context \ - app-token-selected-large-backend-context \ - app-token-many-datatable-context \ - app-token-large-datatable-discovery \ - --model haiku \ - --runs 1 \ - --output results/app-token-baseline-current-max8.json -``` - -## Environment - -- Mode: `app` -- Model under test: `anthropic:claude-haiku-4-5-20251001` -- Transport: `direct` -- Judge model: `claude-sonnet-4-6` -- Runs per case: `1` -- Token-heavy app cases use `runtime.maxTurns: 8` - -## Results - -Pass rate: **100% (5/5)** - -| Case | Prompt tokens | Completion tokens | Total tokens | Tool calls | Tools used | -|---|---:|---:|---:|---:|---| -| `app-token-baseline-large-app-small-edit` | 73,682 | 519 | 74,201 | 4 | `get_files`, `get_frontend_file`, `patch_file` | -| `app-token-selected-large-frontend-context` | 36,305 | 348 | 36,653 | 2 | `get_frontend_file`, `patch_file` | -| `app-token-selected-large-backend-context` | 95,232 | 19,633 | 114,865 | 4 | `set_backend_runnable`, `get_backend_runnable` | -| `app-token-many-datatable-context` | 35,204 | 404 | 35,608 | 2 | `get_files`, `patch_file` | -| `app-token-large-datatable-discovery` | 114,964 | 4,047 | 119,011 | 7 | `get_files`, `get_datatables`, `set_backend_runnable`, `set_frontend_file`, `patch_file`, `lint` | - -Aggregate token usage: - -```json -{ - "totalTokenUsage": { - "prompt": 355387, - "completion": 24951, - "total": 380338 - }, - "averageTokenUsagePerAttempt": { - "prompt": 71077.4, - "completion": 4990.2, - "total": 76067.6 - } -} -``` - -## Interpretation - -The highest-token cases are: - -1. `app-token-large-datatable-discovery` — full datatable discovery with `get_datatables()` and app edits reached **119,011** total tokens. -2. `app-token-selected-large-backend-context` — selected large backend runnable plus a rewrite-style tool call reached **114,865** total tokens. -3. `app-token-baseline-large-app-small-edit` — a trivial heading edit still reached **74,201** total tokens, largely due broad file discovery. - -These cases should be rerun after prompt/context/tool changes to compare total and prompt-token reductions. - -## Follow-up: metadata-only `list_files` - -The contentful `get_files` app-mode tool was replaced with `list_files` to make broad app discovery cheaper and less sticky in chat history. - -Changes: - -- Renamed the overview tool from `get_files` to `list_files`. -- Changed the overview response from truncated source/config contents to metadata only. -- `list_files` returns: - - frontend files: `path`, character `size`, and file `kind`; - - backend runnables: `key`, `name`, `type`, and lightweight optional metadata such as `path`, `language`, `contentSize`, and `staticInputKeys`. -- Updated app-mode prompt guidance so the model no longer starts every task with broad file discovery. -- Kept targeted content tools as the path for inspection: - - `get_frontend_file(path)` for frontend source; - - `get_backend_runnable(key)` for runnable configuration/source. - -The same five cases were rerun with: - -```bash -cd ai_evals -set -a -source ~/windmill/ai_evals/.env -set +a -bun run cli -- run app \ - app-token-baseline-large-app-small-edit \ - app-token-selected-large-frontend-context \ - app-token-selected-large-backend-context \ - app-token-many-datatable-context \ - app-token-large-datatable-discovery \ - --model haiku \ - --runs 1 \ - --output results/app-token-after-list-files.json -``` - -Pass rate: **100% (5/5)** - -| Case | Prompt tokens | Completion tokens | Total tokens | Tool calls | Tools used | -|---|---:|---:|---:|---:|---| -| `app-token-baseline-large-app-small-edit` | 41,020 | 422 | 41,442 | 3 | `list_files`, `get_frontend_file`, `patch_file` | -| `app-token-selected-large-frontend-context` | 41,020 | 422 | 41,442 | 3 | `list_files`, `get_frontend_file`, `patch_file` | -| `app-token-selected-large-backend-context` | 53,511 | 9,714 | 63,225 | 3 | `list_files`, `get_backend_runnable`, `set_backend_runnable` | -| `app-token-many-datatable-context` | 46,990 | 475 | 47,465 | 3 | `list_files`, `get_frontend_file`, `patch_file` | -| `app-token-large-datatable-discovery` | 131,607 | 5,084 | 136,691 | 8 | `get_datatables`, `list_files`, `set_backend_runnable`, `set_frontend_file`, `patch_file`, `lint` | - -Aggregate token usage: - -```json -{ - "totalTokenUsage": { - "prompt": 314148, - "completion": 16117, - "total": 330265 - }, - "averageTokenUsagePerAttempt": { - "prompt": 62829.6, - "completion": 3223.4, - "total": 66053 - } -} -``` - -Comparison against the post-rebase / PR #8922 run (`results/app-token-after-origin-main-pr8922.json`): - -| Case | PR #8922 total | `list_files` total | Delta | Delta % | Prompt delta | -|---|---:|---:|---:|---:|---:| -| `app-token-baseline-large-app-small-edit` | 74,061 | 41,442 | -32,619 | -44.0% | -32,522 | -| `app-token-selected-large-frontend-context` | 74,061 | 41,442 | -32,619 | -44.0% | -32,522 | -| `app-token-selected-large-backend-context` | 71,050 | 63,225 | -7,825 | -11.0% | -7,787 | -| `app-token-many-datatable-context` | 35,497 | 47,465 | +11,968 | +33.7% | +11,886 | -| `app-token-large-datatable-discovery` | 97,128 | 136,691 | +39,563 | +40.7% | +38,295 | - -Aggregate comparison against the post-rebase / PR #8922 run: - -| Metric | PR #8922 | `list_files` | Delta | Delta % | -|---|---:|---:|---:|---:| -| Prompt tokens | 336,798 | 314,148 | -22,650 | -6.7% | -| Completion tokens | 14,999 | 16,117 | +1,118 | +7.5% | -| Total tokens | 351,797 | 330,265 | -21,532 | -6.1% | - -Compared to the original baseline above, the `list_files` run is **-50,073 total tokens** (**-13.2% total**). - -Interpretation: - -- The small edit and selected-frontend cases improved substantially because broad discovery no longer injects truncated contents for the whole app. -- The selected-backend case also improved, despite still needing targeted runnable inspection. -- The datatable-context cases can require an extra `get_frontend_file` after `list_files`, so the small datatable edit regressed in this single-run sample. -- The large datatable case remains dominated by datatable/schema prompt bloat and model variability; moving datatable SDK/reference and schema discovery behind smaller on-demand tools is still the next likely high-impact optimization. - -## Follow-up: targeted datatable tools and shorter datatable prompt - -The next pass reduced default datatable context by making datatable discovery metadata-first and removing the full datatable SDK reference from the system prompt. - -Changes: - -- Replaced the broad schema discovery tool with `list_datatables()` for datatable/schema/table names only. -- Added `get_datatable_table_schema(datatable_name, schema_name, table_name)` for targeted column lookup when column names/types are actually needed. -- Removed the full TypeScript + Python datatable SDK reference from the default app system prompt. -- Kept concise TypeScript and Python datatable examples in the prompt, which were enough for the benchmark cases. -- Strengthened prompt/tool guidance so table-list dashboards use `list_datatables()` directly and avoid schema/SDK lookups unless needed. - -The same five cases were rerun with: - -```bash -cd ai_evals -set -a -source ~/windmill/ai_evals/.env -set +a -bun run cli -- run app \ - app-token-baseline-large-app-small-edit \ - app-token-selected-large-frontend-context \ - app-token-selected-large-backend-context \ - app-token-many-datatable-context \ - app-token-large-datatable-discovery \ - --model haiku \ - --runs 1 \ - --output results/app-token-after-datatable-tools-v3.json -``` - -Pass rate: **100% (5/5)** - -| Case | Prompt tokens | Completion tokens | Total tokens | Tool calls | Tools used | -|---|---:|---:|---:|---:|---| -| `app-token-baseline-large-app-small-edit` | 37,516 | 425 | 37,941 | 3 | `list_files`, `get_frontend_file`, `patch_file` | -| `app-token-selected-large-frontend-context` | 37,516 | 358 | 37,874 | 3 | `list_files`, `get_frontend_file`, `patch_file` | -| `app-token-selected-large-backend-context` | 49,995 | 9,708 | 59,703 | 3 | `list_files`, `get_backend_runnable`, `set_backend_runnable` | -| `app-token-many-datatable-context` | 43,493 | 536 | 44,029 | 3 | `list_files`, `get_frontend_file`, `patch_file` | -| `app-token-large-datatable-discovery` | 24,193 | 2,043 | 26,236 | 4 | `list_datatables`, `list_files`, `get_frontend_file`, `set_frontend_file` | - -Aggregate token usage: - -```json -{ - "totalTokenUsage": { - "prompt": 192713, - "completion": 13070, - "total": 205783 - }, - "averageTokenUsagePerAttempt": { - "prompt": 38542.6, - "completion": 2614, - "total": 41156.6 - } -} -``` - -Comparison against the metadata-only `list_files` run (`results/app-token-after-list-files.json`): - -| Case | `list_files` total | Datatable-tools total | Delta | Delta % | Prompt delta | -|---|---:|---:|---:|---:|---:| -| `app-token-baseline-large-app-small-edit` | 41,442 | 37,941 | -3,501 | -8.4% | -3,504 | -| `app-token-selected-large-frontend-context` | 41,442 | 37,874 | -3,568 | -8.6% | -3,504 | -| `app-token-selected-large-backend-context` | 63,225 | 59,703 | -3,522 | -5.6% | -3,516 | -| `app-token-many-datatable-context` | 47,465 | 44,029 | -3,436 | -7.2% | -3,497 | -| `app-token-large-datatable-discovery` | 136,691 | 26,236 | -110,455 | -80.8% | -107,414 | - -Aggregate comparison: - -| Metric | `list_files` | Datatable tools | Delta | Delta % | -|---|---:|---:|---:|---:| -| Prompt tokens | 314,148 | 192,713 | -121,435 | -38.7% | -| Completion tokens | 16,117 | 13,070 | -3,047 | -18.9% | -| Total tokens | 330,265 | 205,783 | -124,482 | -37.7% | - -Compared to the post-rebase / PR #8922 run, the datatable-tools run is **-146,014 total tokens** (**-41.5% total**). Compared to the original baseline above, it is **-174,555 total tokens** (**-45.9% total**). - -Interpretation: - -- Removing the full datatable SDK reference from the default prompt saved about 3.5k prompt tokens in every case. -- The large datatable discovery case improved dramatically because the model used `list_datatables()` table-name metadata instead of loading full schemas. -- The small datatable-context edit is still higher than the post-rebase / PR #8922 run because selected file identifiers are not yet injected, so the model still discovers and reads `/index.tsx` before patching. -- A future context-manager-backed selected file/runnable flow should add cheap selected identifiers when that UX is ready, so selected-file tasks can skip `list_files()` without reintroducing implicit source-content bloat. diff --git a/docs/failing-tests.md b/docs/failing-tests.md deleted file mode 100644 index d0ae44f1095d1..0000000000000 --- a/docs/failing-tests.md +++ /dev/null @@ -1,33 +0,0 @@ -# Failing Tests - -This file tracks benchmark cases that still fail or need follow-up validation. - -## Flow - -- `flow-test6-ai-agent-tools` - Latest failing run: `ai_evals/results/2026-04-09T11-25-24.107Z__flow` - Issues: - final output does not include the actions or tool-result details the prompt asks for - `open_support_ticket` contains a syntax bug - -- `flow-test7-simple-modification` - Latest failing run: `ai_evals/results/2026-04-09T11-25-24.107Z__flow` - Issues: - `validate_data` was added, but the failure behavior still does not match the requested contract - `save_results` throws instead of returning a graceful structured result - -- `flow-test11-preprocessor-and-failure-handler` - Latest failing run: `ai_evals/results/2026-04-09T11-25-24.107Z__flow` - Issues: - the model creates regular `preprocessor` and `failure` modules - it does not use Windmill's special top-level `preprocessor_module` and `failure_module` - -## Needs Reconfirmation - -- `flow-test4-order-processing-loop` - Full-suite failing run: `ai_evals/results/2026-04-09T11-25-24.107Z__flow` - Follow-up passing run after prompt improvement: `ai_evals/results/2026-04-09T13-29-15.877Z__flow` - Note: - this case failed on invalid `branchone` downstream result access - it passed after adding explicit branch-output guidance to the flow prompt - rerun the full flow suite to confirm the fix holds in the broader benchmark diff --git a/docs/system-prompt-testing-plan.md b/docs/system-prompt-testing-plan.md deleted file mode 100644 index 9b12f1c5e0c84..0000000000000 --- a/docs/system-prompt-testing-plan.md +++ /dev/null @@ -1,1000 +0,0 @@ -# System Prompt And Skill Output Testing Plan - -Historical note: - -- This file is a planning document and no longer matches the current benchmark CLI in every detail. -- The current source of truth is [ai_evals/README.md](/home/farhad/windmill__worktrees/prompt-testing-plan/ai_evals/README.md) and [system-prompt-testing-status.md](/home/farhad/windmill__worktrees/prompt-testing-plan/docs/system-prompt-testing-status.md). -- In particular, the current tool no longer has the old variants, compare, or history workflow described below. - -## Goal - -Build a single testing strategy that answers one question reliably: - -> Given a user task, how good is the artifact produced by our AI system? - -This plan is intentionally focused on **black-box output evaluation**, not on unit testing frontend or CLI internals. - -The intended end state is a **new repo-level benchmark CLI** that runs a shared -eval suite across multiple surfaces. - -That benchmark CLI should be the main entrypoint for: - -- running one case -- running a benchmark set -- comparing baseline vs candidate variants -- writing benchmark history snapshots - -Frontend and Windmill CLI are not meant to become separate testing products. -They should be implemented as adapters behind this shared benchmark CLI. - -The system under test is: - -- Frontend AI Chat in `script`, `flow`, and `app` modes -- CLI local development experience driven by generated guidance and skills - -The artifact under test is: - -- Script code -- Flow JSON / module structure -- Raw app files and backend runnables -- Files and project artifacts produced in a local CLI workspace - -## Non-Goals - -This plan does **not** treat the following as the main testing target: - -- Unit testing helper functions, stores, or tool wrapper internals -- UI rendering behavior, DOM interactions, or component-level correctness -- `wmill init` correctness as a standalone product area -- Backend route correctness except where it affects prompt delivery or AI configuration - -Those may still need lightweight tests, but they are not the core of prompt reliability evaluation. - -## Core Principles - -### 1. Black-box evaluation only - -The runner should provide an input task to the real system setup, let it run, collect the final artifact, and score the result. - -In practice, this runner should be exposed through the new repo-level benchmark -CLI rather than through separate ad hoc test commands for each surface. - -### 2. Headless execution - -Frontend evaluation must be fully decoupled from the browser UI. It should exercise prompt assembly, tool selection, and tool execution logic without mounting Svelte components or clicking through the app. - -### 3. Real prompt environment - -All evals must use the same prompt-building path, tool definitions, and skill content that production uses, or a clearly defined variant of them. - -### 4. Artifact-first scoring - -The main score is based on the produced artifact, not on intermediate transcripts. - -### 5. Reliability over one-off success - -A prompt is not "good" because it passed once. Reliability means pass rate across repeated runs and across a representative case set. - -### 6. Track benchmark history over time - -The suite must not only evaluate the current output. It must also produce a -git-tracked benchmark history so the team can see whether the system is -improving over time. - -This history should focus on official benchmark snapshots, not on every local -experiment. - -### 7. Shared corpus, separate adapters - -Frontend and CLI should share the same evaluation corpus format when possible, but each surface should have its own execution adapter. - -### 8. CLI first, UI last - -The CLI should be the first surface brought to a high-confidence benchmark -state. - -It is the cleanest foundation for the suite because it produces direct files in -an isolated workspace, has less ambiguity than the frontend, and is easier to -score deterministically. - -Frontend should reuse the benchmark model proven on the CLI rather than define -a parallel testing philosophy. - -### 9. UI comes last - -The testing suite must exist and be trustworthy before building a studio UI on top of it. - -## Current State - -## Shared Prompt Source Of Truth - -The repo already has the right content split: - -- `system_prompts/` is the shared source of truth for core Windmill prompt content -- frontend adds chat-specific tool instructions on top -- CLI materializes guidance and skill content from generated outputs - -This is a strong foundation for a shared eval suite. - -## Execution Priority - -Even though the repo already has useful frontend eval scaffolding, the -implementation priority should be: - -1. build the repo-level benchmark CLI and use the Windmill CLI adapter as the - first implementation behind it -2. make the CLI artifact-evaluation path excellent -3. stabilize shared scoring, reporting, and benchmark history around that path -4. bring frontend onto the same benchmark model through the same benchmark CLI -5. build the UI only after the underlying suite is trustworthy - -This keeps the hardest product question focused on artifact quality rather than -on UI workflow. - -## Benchmark CLI As The Main Product - -The testing suite should have one primary interface: - -- a new repo-level benchmark CLI - -The benchmark CLI should be able to run: - -- Windmill CLI evals -- frontend evals -- shared reporting and comparison commands - -Illustrative command shape: - -```bash -ai-evals run --surface cli --case bun-hello-script -ai-evals run --surface frontend-flow --case support-flow -ai-evals compare --surface cli --variant baseline --variant candidate-a -ai-evals history latest -``` - -The exact binary name can change, but the architecture should not: - -- one benchmark CLI -- shared case loader -- shared scoring -- shared history writer -- separate surface adapters underneath - -## Temporary Bootstrap Code - -This bootstrap phase is now complete for frontend `flow`, `app`, and `script`. - -Frontend AI benchmark ownership has moved into `ai_evals/`, and the frontend -source tree no longer owns a separate AI benchmark suite under -`frontend/.../__tests__/...`. - -Benchmark authors should only need the repo-level benchmark CLI to run the -long-term suite. - -The only temporary frontend-specific piece that remains is a thin Vitest/Vite -loader bridge so the benchmark runner can import the production chat modules in -the same module/runtime environment they already expect. - -## Frontend: What Exists Today - -The current frontend benchmark path is **decoupled from the UI** and now owned -by `ai_evals`. - -They currently: - -- run through the shared headless chat loop -- use production prompt builders -- use production tool definitions -- use benchmark-owned helper adapters that write to temp workspaces on disk -- execute through the frontend module/runtime environment only as a loader bridge - -This means the current frontend evals are now a proper benchmark adapter, -not a frontend test suite. - -That is the correct direction. - -### Frontend Architecture Notes - -There are three categories of code involved: - -- shared production logic: - - production system prompt builders - - production tool definitions - - production `runChatLoop` -- benchmark-only infrastructure: - - case loading - - variant loading - - judge scoring - - benchmark result shaping - - history/reporting integration -- alternate helper adapters: - - production helpers mutate UI/editor state - - benchmark helpers mutate temp-workspace files - -This is important because the benchmark suite is **not** meant to duplicate the -frontend chat logic. It is meant to reuse the production chat loop and tool -definitions while swapping the execution backend from UI state to filesystem -state. - -## Frontend: What Is Missing - -### Coverage gaps - -- `script` is now exposed through the shared benchmark CLI, but it only has initial case coverage. -- Existing frontend coverage is still too small relative to the target benchmark corpus. - -### Reliability gaps - -- Frontend flow and app can already run with pass/fail results and repeated runs through the shared benchmark CLI. -- The remaining gap is turning that into stronger routine reliability gating with better deterministic validators and broader routine case coverage. -- Frontend reliability reporting is still less mature than the intended end state for official CI tiers and richer failure triage. - -### Prompt-iteration gaps - -- Frontend prompt variants are file-backed now, but the repo only ships baseline manifests by default. -- Creating and curating meaningful frontend candidate variants is still a mostly manual workflow compared with the CLI snapshot flow. -- Frontend prompt comparison exists through the shared `compare` command, but it still needs broader routine use and better variant coverage. - -### Artifact-validation gaps - -- The current flow and app helpers are file-backed now, but several effects are still lightweight and should become more realistic over time. -- Linting and runnable validation are currently too lightweight in the eval path. -- Datatable interactions are mocked rather than validated as output constraints. -- The suite does not yet enforce a strong deterministic validator layer before using an LLM judge. - -### Corpus gaps - -- Frontend surfaces already use shared case manifests under `ai_evals/cases/frontend/`. -- The remaining gap is breadth and representativeness, not the absence of a shared corpus. -- Cases still need richer metadata, stronger deterministic constraints, and a larger regression library built from real failures. - -### Reporting gaps - -- Frontend runs already emit the shared benchmark result shape and can write official history snapshots through the shared benchmark CLI. -- There is still no rich leaderboard or trend-oriented debugging workflow for frontend surfaces specifically. -- There is still no strong "worst failures first" report for debugging regressions. - -## Frontend: Perfect Testing Logic - -The perfect frontend testing logic is: - -Frontend should not be the place where the benchmark philosophy is invented. - -It should consume the shared case format, validator model, reporting format, -and history format already proven through the CLI path. - -### 1. Stay fully headless - -Do not mount the chat UI. - -Do not click through the frontend. - -Do not use Playwright for prompt evaluation. - -The runner should directly invoke: - -- the production system message builder -- the production user message builder -- the production tool list -- the production chat loop - -It is acceptable for the benchmark adapter to use the frontend Vitest/Vite -runtime as a thin loader bridge when production chat modules still depend on -that environment, as long as: - -- the benchmark entrypoint remains the shared benchmark CLI -- the benchmark logic and fixtures live under `ai_evals` -- the frontend source tree does not own a separate benchmark suite - -This keeps the suite decorrelated from the frontend UI while still testing the real AI logic. - -### 2. Test the three frontend AI surfaces separately - -#### Script mode - -Input: - -- user prompt -- optional initial script -- optional context such as selected workspace runnables or DB references - -Output: - -- final script code - -Scoring: - -- deterministic validators first -- LLM judge second - -Deterministic validators should include: - -- expected entrypoint present -- syntax / parse validity -- language-appropriate compile or lint check where feasible -- required behaviors or structures present -- forbidden patterns absent - -#### Flow mode - -Input: - -- user prompt -- optional initial flow -- optional schema -- optional workspace context - -Output: - -- final flow definition - -Scoring: - -- flow JSON is structurally valid -- expected module types exist -- expected branches / loops / tools exist -- schema shape matches required inputs -- required data flow connections are present -- LLM judge scores completeness and overall quality - -#### App mode - -Input: - -- user prompt -- optional initial app -- optional workspace context - -Output: - -- final frontend files -- final backend runnables - -Scoring: - -- expected files and runnables exist -- file structure is coherent -- app bundle / lint checks pass where feasible in headless mode -- required UI/backend behaviors are represented in the artifact -- LLM judge scores completeness and product quality - -### 3. Use repeated runs, not single runs - -Each case should run more than once. - -Recommended starting point: - -- PR smoke run: 2 runs per case on a small curated subset -- nightly reliability run: 5 to 10 runs per case on the full benchmark set - -Primary metric: - -- pass rate - -Secondary metrics: - -- average deterministic score -- average judge score -- worst-case judge score -- latency -- total tool calls - -### 4. Keep tool traces as diagnostics only - -Tool usage matters for debugging, but it should not be the primary score. - -The suite should record: - -- tool names -- tool arguments -- iteration count -- model/provider - -But the main question remains: - -> Was the final artifact good? - -### 5. Make prompt variants easy to test - -Prompt candidates should not require editing test code. - -The suite should support a file-based prompt variant workflow. - -Example direction: - -- `ai_evals/variants/frontend/script/baseline.md` -- `ai_evals/variants/frontend/script/candidate-a.md` -- `ai_evals/variants/frontend/flow/baseline.md` -- `ai_evals/variants/frontend/app/baseline.md` - -Each variant should be runnable side by side against the same case set. - -### 6. Separate benchmark cases from test code - -Benchmark cases should live in data files, not inline in test files. - -Each case should define: - -- surface -- user prompt -- initial artifact if any -- required constraints -- forbidden constraints -- judge rubric -- tags - -This makes the benchmark editable by prompt authors without changing runner logic. - -## CLI: What Exists Today - -The current CLI tests prove only one narrow property: - -> Given a prompt, does the model invoke the expected skill? - -That is useful as a smoke signal, but it is far from sufficient for output evaluation. - -The current CLI setup also depends on manual preparation of a `.claude/skills` folder, which makes repeated benchmarking and prompt iteration much harder than necessary. - -## CLI: What Is Missing - -### Output-evaluation gap - -- The current suite does not score the artifact produced by the CLI workflow. -- It only checks whether a skill was invoked. -- It does not verify that the resulting files are good. - -### Automation gap - -- The current setup requires manual copying of generated skills into a test folder. -- That makes the suite too fragile and too manual for rapid prompt iteration. - -### Reliability gap - -- There is no repeated-run measurement. -- There is no pass-rate metric. -- There is no baseline vs candidate comparison workflow. - -### Prompt-variant gap - -- There is no first-class way to test alternate skill bundles or alternate generated guidance. -- There is no clean candidate flow for "I changed skill content, show me whether reliability improved." - -### Corpus gap - -- CLI cases are not aligned with frontend benchmark cases. -- There is no shared benchmark language describing the task, initial state, and expected artifact. - -### Reporting gap - -- There is no stable output report for artifact comparison. -- There is no failure clustering by skill bundle, task family, or model. - -## CLI: Perfect Testing Logic - -The perfect CLI testing logic is: - -This should be the reference implementation for the suite. - -### 1. Evaluate the final artifact, not the skill invocation - -Skill invocation should be kept as diagnostic metadata only. - -The primary output should be the files produced in a temporary workspace. - -Example CLI artifacts: - -- generated script files -- generated flow files -- raw app project files -- schedule / trigger config files -- AGENTS / guidance files only when they are directly relevant to the task - -### 2. Create the workspace automatically - -The runner should create a fresh temporary project for every case. - -It should seed that workspace with: - -- initial files for the benchmark case -- the current generated CLI guidance and skills -- any fixture data required by the task - -It should never depend on a manually maintained test folder. - -### 3. Materialize the exact skill bundle under test - -The runner should be able to test: - -- the current production skill bundle -- a candidate skill bundle built from prompt changes - -For CLI, a "prompt variant" is effectively a skill-bundle variant. - -That means the suite should support alternate generated skill content without requiring ad hoc manual copies. - -### 4. Score the final workspace - -The scoring approach should match the frontend philosophy: - -- deterministic validators first -- LLM judge second - -Deterministic validators for CLI should include: - -- expected files created -- expected file names and locations -- required content patterns present -- expected artifact type produced -- optional parse / lint / compile validation where feasible - -### 5. Run repeated benchmarks - -The CLI should use the same reliability logic as frontend: - -- benchmark set -- repeated runs -- pass rate -- baseline vs candidate comparison - -### 6. Keep skill traces as diagnostics - -Record: - -- invoked skills -- order of invocation -- turns -- file changes - -But do not let that replace artifact evaluation. - -## Perfect Shared Benchmark Model - -The frontend and CLI should share the same benchmark concept. - -Each evaluation case should define: - -- `id` -- `surface` -- `user_prompt` -- `initial_state` -- `workspace_context` -- `artifact_checks` -- `judge_rubric` -- `tags` - -The same task should be runnable on multiple surfaces when it makes sense. - -This gives direct comparability between: - -- frontend script vs CLI script -- frontend flow vs CLI flow -- frontend app vs CLI app - -## Recommended Benchmark Categories - -The first benchmark set should be broad, but not huge. - -Recommended initial size: - -- 20 to 30 core cases - -Recommended categories: - -- from-scratch script creation -- script modification -- from-scratch flow creation -- flow modification -- from-scratch raw app creation -- raw app modification -- reuse of workspace assets -- tasks requiring datatable awareness -- tasks requiring constraints or edge-case handling -- known regressions from real failures - -Every category should contain both: - -- "easy success" cases -- "high ambiguity" cases - -This is essential for measuring reliability rather than only measuring best-case demos. - -## Scoring Model - -The suite should use three layers. - -## Layer 1: Deterministic Validators - -This is the hard gate. - -Examples: - -- parse succeeds -- artifact shape is valid -- required entrypoint exists -- expected files exist -- required module types exist -- expected inputs / schema fields exist -- forbidden patterns are absent - -If layer 1 fails, the run is a failure. - -## Layer 2: Task-Specific Validators - -These are stronger artifact checks derived from the benchmark case. - -Examples: - -- flow contains a loop and a conditional branch -- app includes a reset button path and backend wiring -- script performs the requested transformation - -These should still be deterministic whenever possible. - -## Layer 3: LLM Judge - -Use an LLM judge only after deterministic validation. - -The judge should answer: - -- Did the artifact satisfy the request? -- Is it complete? -- Is it coherent for Windmill? -- How close is it to the intended solution? - -The judge score is valuable, but it should not be the only oracle. - -## Benchmark History - -The suite should persist official benchmark summaries in a git-tracked history -layer so improvements and regressions can be reviewed over time. - -## What Should Be Git-Tracked - -Only official benchmark outputs should be committed: - -- post-merge benchmark snapshots on `main` -- scheduled nightly benchmark snapshots -- manually promoted benchmark snapshots when the team wants to record a result - -Each official snapshot should produce: - -- one detailed run JSON -- one entry in an append-only summary file -- regenerated rollups for trend views - -## What Should Not Be Git-Tracked - -The following should remain local or external by default: - -- raw transcripts -- full model messages -- large generated artifact bundles -- ad hoc local experiments -- temporary comparison runs - -This keeps git history focused on stable benchmark signals instead of noisy -debug output. - -## Reliability Metrics - -Every prompt or skill candidate should be reported with: - -- total cases -- passes -- pass rate -- average judge score -- median judge score -- worst-case judge score -- average latency -- average turns - -Per-case results should also be retained. - -This is the minimum needed to compare: - -- baseline vs candidate -- provider vs provider -- frontend vs CLI - -## Benchmark Metrics - -The history layer should track metrics in four groups. - -## Quality Metrics - -- `pass_rate` -- `deterministic_pass_rate` -- `judge_score_mean` -- `judge_score_median` -- `judge_score_p10` -- `category_pass_rate` - -## Reliability Metrics - -- `runs_per_case` -- `flake_rate` -- `path_consistency` - -## Efficiency Metrics - -- `latency_ms_mean` -- `latency_ms_median` -- `tokens_prompt_mean` -- `tokens_completion_mean` -- `tokens_total_mean` -- `tool_calls_mean` -- `iterations_mean` -- `estimated_cost_mean` -- `cost_per_success` -- `latency_per_success` - -## Provenance Metrics - -- `timestamp` -- `git_sha` -- `suite_version` -- `scoring_version` -- `surface` -- `variant_name` -- `provider` -- `model` -- `judge_model` - -The provenance metrics are essential. Without them, a trend line can mix prompt -changes with upstream model drift and become hard to interpret. - -## Efficiency Score - -The suite should not collapse everything into one number. - -It should track at least three top-level composite scores: - -- `quality_score` -- `efficiency_score` -- `value_score` - -Recommended interpretation: - -- `quality_score`: how good the artifact is -- `efficiency_score`: how fast and cheap the system is relative to peers -- `value_score`: quality-adjusted efficiency - -These composite scores should sit on top of the raw metrics, not replace them. - -## Proposed Suite Architecture - -The suite should be built in six layers. - -## Layer 1: Benchmark Data - -Purpose: - -- define the cases once - -Contents: - -- case files -- reusable initial fixtures -- evaluation metadata - -## Layer 2: Benchmark CLI - -Purpose: - -- provide one shared entrypoint for the suite - -Responsibilities: - -- load cases and variants -- select a surface adapter -- run one case or a benchmark set -- invoke shared scoring and history writing -- expose comparison and history commands - -## Layer 3: Surface Adapters - -Purpose: - -- run a case against one surface - -Adapters: - -- frontend-script adapter -- frontend-flow adapter -- frontend-app adapter -- CLI adapter - -Responsibilities: - -- prepare the correct prompt environment -- prepare the initial artifact state -- run the real model loop -- return the final artifact plus diagnostics - -## Layer 4: Scoring And Reporting - -Purpose: - -- evaluate the final artifact -- aggregate repeated runs -- compare variants - -Responsibilities: - -- deterministic validation -- LLM judging -- pass/fail computation -- result serialization -- comparison reports - -## Layer 5: Benchmark History - -Purpose: - -- preserve official benchmark summaries over time -- support trend analysis and regression review - -Responsibilities: - -- store official run snapshots -- append benchmark summary entries -- generate rollups for charts and dashboards -- keep provenance metadata for every tracked run - -## Layer 6: UI Studio - -Purpose: - -- provide a user interface for the exact same benchmark CLI and runner stack - -Important rule: - -The UI must not define its own execution semantics. - -It must only be a frontend over the same suite used in CI and local benchmarking. - -## Proposed Development Order - -### Phase 1: Stabilize the benchmark model - -Deliverables: - -- shared case schema -- shared result schema -- initial core benchmark set - -### Phase 2: Build the benchmark CLI shell - -Deliverables: - -- repo-level benchmark CLI entrypoint -- `run`, `compare`, and `history` command skeletons -- adapter selection layer -- temporary wiring to the first CLI adapter - -### Phase 3: Replace the CLI smoke suite with real artifact evaluation - -Deliverables: - -- temp-workspace runner -- automatic skill-bundle materialization -- artifact scoring -- repeated-run support -- baseline vs candidate skill-bundle comparison - -### Phase 4: Add shared reporting and benchmark history around the CLI path - -Deliverables: - -- baseline vs candidate reports -- pass-rate summaries -- worst-failure reports -- official run schema -- git-tracked benchmark summary file -- history snapshot writer -- rollup generation for trend charts - -### Phase 5: Finish the frontend black-box harness on top of the shared model - -Deliverables: - -- convert current flow and app evals into proper scored reliability tests -- add script eval support -- add repeated-run support -- add prompt-variant loading from files -- align frontend outputs with the shared result and history format -- expose frontend runs through the same benchmark CLI - -### Phase 6: Add CI tiers - -Deliverables: - -- fast PR smoke benchmark -- fuller nightly benchmark -- official history updates on `main` and scheduled runs -- manual benchmark mode for prompt authors - -### Phase 7: Build the UI studio - -Deliverables: - -- run selector -- variant selector -- per-case comparison view -- artifact diff view -- reliability dashboard -- trend dashboard backed by git-tracked benchmark history - -This phase comes last because the UI is only valuable once the underlying suite is stable and trusted. - -## Proposed Prompt Variant Workflow - -The suite should make it cheap to test new prompt candidates. - -Recommended workflow: - -1. Edit or add a candidate prompt file. -2. Run the benchmark against baseline and candidate. -3. Compare pass rate and score. -4. Inspect worst regressions first. -5. Promote only if the candidate improves the benchmark materially. - -For CLI, the same workflow applies, but the tested unit is the generated skill bundle rather than a single chat system prompt. - -## Suggested Repository Direction - -This plan does not require the UI studio to exist first. - -A reasonable repo structure would be: - -```text -ai_evals/ - cli/ - cases/ - fixtures/ - history/ - runs/ - rollups/ - variants/ - frontend/ - script/ - flow/ - app/ - cli/ - results/ # gitignored - scripts/ - adapters/ - scoring/ - reports/ -``` - -The exact folder names can change, but the architectural split should remain. - -## What "Done" Looks Like - -This project is successful when all of the following are true: - -- one repo-level benchmark CLI is the primary way to run prompt evals -- frontend prompt behavior is tested headlessly and independently from the UI -- CLI local-dev behavior is tested by evaluating the final files it produces -- benchmark cases are shared where possible between frontend and CLI -- prompt and skill candidates can be tested without editing test code -- reliability is reported as pass rate over repeated runs -- baseline vs candidate comparisons are easy to run and inspect -- the UI studio is only a thin interface over the same trusted runner - -## Final Recommendation - -The current frontend evals should be treated as a useful starting point, not the finished solution. - -They already prove that the repo can test AI behavior without coupling to the browser UI. - -The main work now is: - -- build the repo-level benchmark CLI as the durable entrypoint -- replace CLI invocation checks with artifact evaluation -- make the CLI path the reference benchmark implementation -- unify frontend under that same benchmark model -- make frontend evals complete and reliability-oriented only after the shared - scoring model is stable -- build the UI only after the suite is strong enough to stand on its own diff --git a/docs/system-prompt-testing-status.md b/docs/system-prompt-testing-status.md deleted file mode 100644 index 86beaadc892d1..0000000000000 --- a/docs/system-prompt-testing-status.md +++ /dev/null @@ -1,140 +0,0 @@ -# System Prompt Testing Status - -This document describes the benchmark tool that exists today. It is the current -truth for `ai_evals/`. - -The longer planning document in -[system-prompt-testing-plan.md](/home/farhad/windmill__worktrees/prompt-testing-plan/docs/system-prompt-testing-plan.md) -still contains useful background, but parts of its workflow are now historical -because the old variants/history system was removed. - -## Current Tool - -There is one repo-level benchmark CLI under `ai_evals/` with three commands: - -- `bun run cli -- models` -- `bun run cli -- cases [mode]` -- `bun run cli -- run [caseIds...]` - -Supported modes: - -- `cli` -- `flow` -- `script` -- `app` - -Public `run` options: - -- `--runs ` -- `--output ` -- `--model ` -- `--verbose` -- `--record` - -There is no variant workflow and no compare command in the current tool. -Tracked history is intentionally minimal: `run --record` appends one compact -summary line to `ai_evals/history/.jsonl`. This is only allowed for -full-suite runs, not selected case ids. History lines include average token -usage when the benchmark mode reports it, plus average judge score and per-case -duration/judge/token usage summaries. - -## How It Works - -Each attempt runs: - -1. the current production prompts, tools, and guidance from this checkout -2. deterministic validation -3. LLM judging - -Results are written locally under `ai_evals/results/` as: - -- a summary JSON file -- a sibling artifacts directory containing the generated flow/script/app/workspace - -If `--record` is used, the CLI also appends a compact JSONL summary line to the -tracked file for that mode under `ai_evals/history/`. - -## Current Architecture - -- `ai_evals/cases/`: one YAML manifest per mode -- `ai_evals/fixtures/`: initial and expected fixtures -- `ai_evals/core/`: shared case loading, model resolution, validation, judging, and result writing -- `ai_evals/history/`: optional tracked pass-rate history written by `run --record`, one JSONL file per mode -- `ai_evals/modes/`: one runner per mode - -Execution model: - -- `flow`, `script`, and `app` reuse the production frontend chat loop and production tool definitions through the frontend Vitest bridge -- `cli` creates a temp workspace, writes the current checkout guidance into it, and runs the Anthropic agent SDK against that workspace - -## Case Model - -Each case is intentionally small: - -- `prompt` -- optional `initial` -- optional `expected` -- optional `validate` -- optional `cliExpect` - -`validate` is mainly used for stronger deterministic checks where exact fixture -matching would be too strict, especially for `flow` creation cases. - -`cliExpect` is used by CLI-mode cases to assert agent behavior deterministically, -including: - -- required or forbidden skills -- skills invoked before the first file mutation -- ordered `wmill` command proposals in the assistant response -- forbidden attempted `wmill` executions -- read-only guidance cases where the workspace must stay unchanged - -Examples of current deterministic checks: - -- schema contains one of several accepted input shapes -- `results.*` references resolve -- required code/input characteristics exist in some module -- expected workspace files are created in `cli` mode -- expected CLI skills and proposed `wmill` commands are observed in `cli` mode - -## Model Selection - -Model aliases are resolved through a shared registry in `ai_evals/core/models.ts`. - -Current aliases: - -- `haiku` -- `sonnet` -- `opus` -- `4o` - -Notes: - -- the `models` command also shows accepted alias spellings such as `gpt-4o` and `claude-opus-4.6` -- frontend modes can use Anthropic and OpenAI-backed aliases -- `cli` mode is Anthropic-only because it runs through the Anthropic agent SDK -- the judge model is separate and currently defaults to `claude-sonnet-4-6` - -## What Is Working Well - -- one simple local benchmark CLI -- real production execution paths instead of synthetic prompt variants -- local result and artifact persistence by default -- live frontend progress output -- reusable flow/script/app/cli runners under one tool -- deterministic validation can now catch real runtime-invalid flow wiring - -## What Still Needs Work - -- broader case coverage across all four modes -- stronger deterministic validators for more cases, especially app/script semantics -- clearer per-case validation metadata as the corpus grows -- CI automation for smoke and nightly runs - -## Recommended Next Focus - -The next high-value work is: - -1. add more realistic benchmark cases -2. keep simplifying deterministic validators so they check correctness, not one exact implementation -3. add CI only after the local benchmark signal is trustworthy