test: add global chat eval coverage by centdix · Pull Request #9320 · windmill-labs/windmill

centdix · 2026-05-26T00:46:52Z

Summary

Adds broader global chat eval coverage and tightens the frontend eval harness so ai_evals reproduces production-like global chat behavior more closely.

This PR covers two things:

global chat eval parity improvements for frontend/global runs
additional human-style global chat prompts where the user does not explicitly provide exact paths or script languages

Changes

Make global chat eval runner state capture closer to the real draft behavior used by production global chat.
Add a fixture for editing an existing invoice-processing flow from global chat.
Add flexible global draft validation for path-free human prompts with optional pathIncludes and pathStartsWith requirements.
Add global chat cases for:
- inferred script path/language for a welcome helper
- a draft weekday trial-check job with script plus schedule
- a draft Slack secret variable
- finding and editing an existing invoice flow without an explicit path

Test plan and results

cd ai_evals && bun test adapters/frontend/vitestAdapter.test.ts adapters/frontend/core/shared/providerConfig.test.ts modes/frontendCommon.test.ts core/validators.test.ts core/cases.test.ts
- Result: passed, 46 pass, 1 skip, 0 fail.
cd ai_evals && bun test core/validators.test.ts core/cases.test.ts
- Result: passed, 38 pass, 0 fail.
cd ai_evals && WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8010 WMILL_AI_EVAL_BACKEND_WORKSPACE=integration-tests bun run cli -- run global global-test8-human-script-infer-path-language global-test9-human-weekday-trial-job global-test10-human-secret-variable global-test11-human-existing-flow-informal-edit --models haiku,opus
- Haiku result before relaxing the helper path assertion: 2/4 passed.
  - global-test8-human-script-infer-path-language: behavior was correct and judge scored 100, but deterministic validation failed because the harness incorrectly required an f/ path while Haiku reasonably chose u/welcome_formatter.
  - global-test9-human-weekday-trial-job: failed judge threshold, score 72; Haiku created the script and schedule but interpreted the 30-day cutoff as expired-or-expiring-soon rather than ended-after-30-days.
  - global-test10-human-secret-variable: passed; tools used write_variable; no resource/deploy/delete calls.
  - global-test11-human-existing-flow-informal-edit: passed; tools used list_workspace_items, read_workspace_item, read_flow_module_code, set_flow_module_code.
- Opus result before relaxing the helper path assertion: 3/4 passed.
  - global-test8-human-script-infer-path-language: behavior was correct and judge scored 100, but deterministic validation failed for the same overly strict f/ path assertion; Opus chose u/user/welcome_message.
  - global-test9-human-weekday-trial-job: passed, judge score 90; tools used askUserQuestion, get_instructions, write_script, write_schedule.
  - global-test10-human-secret-variable: passed; tools used search_resource_types, write_variable; no resource/deploy/delete calls.
  - global-test11-human-existing-flow-informal-edit: passed, judge score 100; tools used list_workspace_items, read_workspace_item, read_flow_module_code, set_flow_module_code.
After removing the overly strict helper path assertion, reran global-test8-human-script-infer-path-language on Haiku and Opus:
- Command: cd ai_evals && WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8010 WMILL_AI_EVAL_BACKEND_WORKSPACE=integration-tests bun run cli -- run global global-test8-human-script-infer-path-language --models haiku,opus
- Haiku: passed, judge score 100; tools used get_instructions, write_script; chose Python at u/welcome_formatter.
- Opus: passed, judge score 100; tools used askUserQuestion, get_instructions, write_script; chose Bun at u/user/format_welcome_line.
Earlier provider matrix for the existing global suite through ai_evals:
- Original global cases 1-3: passed 3/3 on Haiku, GPT-4o, and Gemini Flash.
- Harder global cases 4-7: passed 4/4 on Haiku and GPT-4o; Gemini Flash passed 2/4, failing the secret-variable draft case and the ambiguous-app ask-question case because expected tool calls were missing.
- Fresh global-test1 smoke against the current backend: Haiku, GPT-4o, and Gemini Flash all passed.
UI deployment testing in the test workspace with global chat:
- Script create/deploy succeeded.
- Existing script update/redeploy succeeded.
- Schedule deployment succeeded.
- Secret variable deployment succeeded.
- App deployment was correctly refused by global chat and no app was created.
- Editor-open global chat script deployment succeeded.
- Flow deployment succeeded and was verified via the flow API.
- HTTP trigger deployment failed in this environment: write_trigger repeatedly hit a trigger existence endpoint returning 404, and the chat then became noisy by trying to redeploy the script. This remains a behavior/environment issue to investigate separately.
cd ai_evals && bun run typecheck
- Result: failed on existing broader TypeScript issues outside this PR path, starting with modes/cli.ts(83,11): Object literal may only specify known properties, and 'overwriteProjectGuidance' does not exist in type 'WriteAiGuidanceOptions', followed by many ../cli/... type-only import/API errors.

Review

Skipped local review per request.

cloudflare-workers-and-pages · 2026-05-26T00:50:48Z

Deploying windmill with Cloudflare Pages

Latest commit:	`696cf89`
Status:	✅ Deploy successful!
Preview URL:	https://d1c6a9bf.windmill.pages.dev
Branch Preview URL:	https://test-global-chat-evals.windmill.pages.dev

View logs

claude · 2026-05-27T02:28:05Z

Claude finished @centdix's task in 3m 17s —— View job

Reviewing PR #9320

cc @centdix

Mergeable, but should ideally address nits: weak valueIncludes: "true" assertion for is_secret in global-test6/test10; duplicated inline requirement types in validators.ts helpers.

This PR is entirely scoped to ai_evals/ test infrastructure — no production code is touched. It (1) re-points the eval runner from the (removed) globalDraftStore to the new userDraftAdapter-backed clearGlobalDrafts / listGlobalDrafts, (2) adds mocks for the three *WithDraft-style $lib/gen calls that the production global tools now use, (3) extends GlobalDraftRequirement with pathIncludes / pathStartsWith for path-free human prompts, and (4) adds four richer global cases plus four human-style cases. The refactor of findGlobalDraft looks correct — it filters by locator, prefers a content-matching candidate, and falls back to the first locator match so the per-snippet checks still produce meaningful failure details.

Findings

P2 — valueIncludes: "true" is a weak proxy for is_secret: true ai_evals/cases/global.yaml:178 (global-test6) and ai_evals/cases/global.yaml:264 (global-test10) both rely on the substring "true" to assert the variable draft is a secret. Because stringifyGlobalDraftValue does JSON.stringify(value, null, 2) and normalizeText only normalizes line endings, the boolean serializes as the bare token true. That string is short and unanchored, so any value containing "true" (e.g. a description like "trues up"), or future fields with "force": true, would also satisfy the check. Prefer is_secret or "is_secret": true (or both) to actually pin the secret-flag intent. The judge guards behavior in test6 but test10 has skipJudge: true, so this is the only mechanical check that the secret flag is set.

P2 — duplicated inline requirement-shape types in validators.ts ai_evals/core/validators.ts:618-642, :645-666, :668-686, :691-714 all redeclare overlapping inline object types for the requirement. GlobalDraftRequirement from core/types.ts already covers the superset (type, optional path, pathIncludes, pathStartsWith, triggerKind, summaryIncludes, valueIncludes, valueExcludes). Reusing the existing type (or splitting it into GlobalDraftLocator + GlobalDraftContent aliases) avoids future drift between the runtime shape and the YAML-facing schema.

Nit (non-blocking) — forbiddenDrafts doesn't get the flexible matching ai_evals/core/types.ts:127-131 still requires path: string for forbidden entries while requiredDrafts is now flexible. Probably intentional (forbidding "any draft whose path includes X" is rarely what you want), but worth a comment so future authors don't extend it ad hoc.

AGENTS.md compliance

Prompts match the "global-specific rules" guidance — human-style cases (test8-test11) intentionally do not specify paths, languages, or windmill internals, which is exactly what the doc encourages.
Deterministic validation is narrow (draft count + required types + a few content snippets), with judgeChecklist carrying semantic expectations — consistent with the "core rules" section.
New initial: fixture path follows the existing ai_evals/fixtures/frontend/... convention used across app.yaml.

Test coverage

Backend (Rust): not touched, none required.
Frontend pure-logic: validators.test.ts adds two new tests for the flexible matcher (positive and failure-message cases). Adapter-side changes are covered by the existing vitestAdapter.test.ts shape (the new mocks are direct analogues of the previous getScriptByPath / getFlowByPath mocks).
Eval-suite (live model runs): PR description shows Haiku + Opus passing the new cases post-fix, plus the broader provider matrix on the original cases — sufficient.

Manual verification still warranted before merge: re-run cd ai_evals && bun test adapters/frontend/vitestAdapter.test.ts core/validators.test.ts core/cases.test.ts modes/frontendCommon.test.ts adapters/frontend/core/shared/providerConfig.test.ts once the userDraftAdapter rename has fully landed in main, since the imports in globalEvalRunner.ts cross the frontend/backend boundary and the runner won't load if the exports drift. The pre-existing bun run typecheck failures in modes/cli.ts are unrelated to this PR and noted by the author.
· Branch

github-actions · 2026-05-27T02:30:35Z

Now I have a thorough understanding of all the changes. Let me compose the review.

Pi Review

Good to merge

This PR adds global chat eval coverage and tightens the frontend eval harness — purely test infrastructure and eval case definitions. The changes are clean, well-structured, and backward-compatible.

What changed

globalEvalRunner.ts — Switches draft state capture from globalDraftStore (a Svelte store) to clearGlobalDrafts/listGlobalDrafts from userDraftAdapter, matching the production global chat code path (core.ts line 1351).
types.ts — Makes GlobalDraftRequirement.path optional and adds pathIncludes/pathStartsWith for flexible draft matching without exact paths.
validators.ts — Refactors findGlobalDraft into a two-phase locator+content matcher with flexible path filters. Existing exact-path matching is unchanged (fallback behavior preserves old semantics).
validators.test.ts — Two new tests for the flexible path matching.
vitestAdapter.test.ts — Mocks getScriptByPathWithDraft, getFlowByPathWithDraft, and getFlowLatestVersion for benchmark workspaces, needed by flow-edit eval cases.
global.yaml — Eight new eval cases (global-test4 through global-test11) covering multi-artifact drafts, flow editing, secret variables, ambiguous-app questioning, and human-style prompts without explicit paths.
process_invoice_flow.json — Fixture for the invoice-processing flow used by the editing test cases.

Findings

P2 — ai_evals/core/types.ts:127 — The forbiddenDrafts type still requires path: string (non-optional) while requiredDrafts now uses the new GlobalDraftRequirement with path?: string. This is a minor inconsistency — forbiddenDrafts can't use the new flexible path filters (pathIncludes/pathStartsWith). Not a bug since the existing forbiddenDrafts entries in the YAML all use exact paths, but a future contributor adding a forbiddenDrafts entry for a human-style prompt might expect to be able to write pathIncludes there too.

P2 — ai_evals/core/validators.ts:619-643 — findGlobalDraft now has content-aware matching (via globalDraftMatchesContent) with a fallback to candidates[0] when no candidate satisfies the content constraints. The call site independently re-validates content properties (summaryIncludes, valueIncludes, valueExcludes) so this is functionally correct, but the two-phase approach with fallback is a subtlety future maintainers may not expect when reading the function in isolation. Consider either: (a) merging locator+content into a single pass, or (b) documenting the fallback intent in a comment.

Test coverage

Test infrastructure (ai_evals): The PR adds validator unit tests covering the new flexible path matching (pathIncludes acceptance and rejection). The vitest adapter mocks follow the existing pattern. Eval cases themselves are validated by running against real models, which the PR body documents extensively (Haiku, Opus, GPT-4o, Gemini Flash).
No backend or Svelte changes — no integration tests needed.
Manual verification: Run the new global eval cases (global-test8 through global-test11) against the integration-tests workspace with wmill ai evals -- run global to confirm they pass on current models. The PR body already documents these runs and the results.

github-actions · 2026-05-27T02:32:22Z

Codex Review

Good to merge

No issues found. Checked for bugs, security, and AGENTS.md compliance.

Test coverage

The PR adds focused validator coverage and adapter mock coverage for the changed eval harness behavior. No backend or in-app UI surface is touched. Manual verification is the global eval execution described in the PR body; bun run typecheck remains blocked by existing broader TypeScript issues outside this PR path.

cubic-dev-ai

1 issue found across 7 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="ai_evals/core/validators.ts">

<violation number="1" location="ai_evals/core/validators.ts:634">
P2: Required draft matching does not consider `required.language`, so validation can fail incorrectly when multiple drafts match locator filters.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-05-27T02:35:05Z

+    globalDraftMatchesLocator(draft, requirement)
+  );
+  return (
+    candidates.find((draft) => globalDraftMatchesContent(draft, requirement)) ??


P2: Required draft matching does not consider required.language, so validation can fail incorrectly when multiple drafts match locator filters.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At ai_evals/core/validators.ts, line 634: <comment>Required draft matching does not consider `required.language`, so validation can fail incorrectly when multiple drafts match locator filters.</comment> <file context> @@ -615,18 +616,102 @@ function summarizeProblems(problems: string[], limit = 5): string | undefined { + globalDraftMatchesLocator(draft, requirement) + ); + return ( + candidates.find((draft) => globalDraftMatchesContent(draft, requirement)) ?? + candidates[0] + ); </file context>

centdix added 2 commits May 26, 2026 02:28

test: improve global chat eval parity

e58bf9d

test: add human-style global chat evals

696cf89

centdix marked this pull request as ready for review May 27, 2026 02:27

centdix requested review from alpetric, hugocasa and rubenfiszel as code owners May 27, 2026 02:27

cubic-dev-ai Bot reviewed May 27, 2026

View reviewed changes

rubenfiszel merged commit e29dfba into main May 27, 2026
31 checks passed

rubenfiszel deleted the test-global-chat-evals branch May 27, 2026 12:20

github-actions Bot locked and limited conversation to collaborators May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add global chat eval coverage#9320

test: add global chat eval coverage#9320
rubenfiszel merged 2 commits into
mainfrom
test-global-chat-evals

centdix commented May 26, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented May 26, 2026

Uh oh!

claude Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

centdix commented May 26, 2026

Summary

Changes

Test plan and results

Review

Uh oh!

cloudflare-workers-and-pages Bot commented May 26, 2026

Deploying windmill with Cloudflare Pages

Uh oh!

claude Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR #9320

Findings

AGENTS.md compliance

Test coverage

Uh oh!

github-actions Bot commented May 27, 2026

Pi Review

What changed

Findings

Test coverage

Uh oh!

github-actions Bot commented May 27, 2026

Codex Review

Test coverage

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented May 27, 2026 •

edited

Loading

cubic-dev-ai Bot May 27, 2026 •

edited

Loading