Skip to content

test: add global chat eval coverage#9320

Merged
rubenfiszel merged 2 commits into
mainfrom
test-global-chat-evals
May 27, 2026
Merged

test: add global chat eval coverage#9320
rubenfiszel merged 2 commits into
mainfrom
test-global-chat-evals

Conversation

@centdix
Copy link
Copy Markdown
Collaborator

@centdix centdix commented May 26, 2026

Summary

Adds broader global chat eval coverage and tightens the frontend eval harness so ai_evals reproduces production-like global chat behavior more closely.

This PR covers two things:

  • global chat eval parity improvements for frontend/global runs
  • additional human-style global chat prompts where the user does not explicitly provide exact paths or script languages

Changes

  • Make global chat eval runner state capture closer to the real draft behavior used by production global chat.
  • Add a fixture for editing an existing invoice-processing flow from global chat.
  • Add flexible global draft validation for path-free human prompts with optional pathIncludes and pathStartsWith requirements.
  • Add global chat cases for:
    • inferred script path/language for a welcome helper
    • a draft weekday trial-check job with script plus schedule
    • a draft Slack secret variable
    • finding and editing an existing invoice flow without an explicit path

Test plan and results

  • cd ai_evals && bun test adapters/frontend/vitestAdapter.test.ts adapters/frontend/core/shared/providerConfig.test.ts modes/frontendCommon.test.ts core/validators.test.ts core/cases.test.ts
    • Result: passed, 46 pass, 1 skip, 0 fail.
  • cd ai_evals && bun test core/validators.test.ts core/cases.test.ts
    • Result: passed, 38 pass, 0 fail.
  • cd ai_evals && WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8010 WMILL_AI_EVAL_BACKEND_WORKSPACE=integration-tests bun run cli -- run global global-test8-human-script-infer-path-language global-test9-human-weekday-trial-job global-test10-human-secret-variable global-test11-human-existing-flow-informal-edit --models haiku,opus
    • Haiku result before relaxing the helper path assertion: 2/4 passed.
      • global-test8-human-script-infer-path-language: behavior was correct and judge scored 100, but deterministic validation failed because the harness incorrectly required an f/ path while Haiku reasonably chose u/welcome_formatter.
      • global-test9-human-weekday-trial-job: failed judge threshold, score 72; Haiku created the script and schedule but interpreted the 30-day cutoff as expired-or-expiring-soon rather than ended-after-30-days.
      • global-test10-human-secret-variable: passed; tools used write_variable; no resource/deploy/delete calls.
      • global-test11-human-existing-flow-informal-edit: passed; tools used list_workspace_items, read_workspace_item, read_flow_module_code, set_flow_module_code.
    • Opus result before relaxing the helper path assertion: 3/4 passed.
      • global-test8-human-script-infer-path-language: behavior was correct and judge scored 100, but deterministic validation failed for the same overly strict f/ path assertion; Opus chose u/user/welcome_message.
      • global-test9-human-weekday-trial-job: passed, judge score 90; tools used askUserQuestion, get_instructions, write_script, write_schedule.
      • global-test10-human-secret-variable: passed; tools used search_resource_types, write_variable; no resource/deploy/delete calls.
      • global-test11-human-existing-flow-informal-edit: passed, judge score 100; tools used list_workspace_items, read_workspace_item, read_flow_module_code, set_flow_module_code.
  • After removing the overly strict helper path assertion, reran global-test8-human-script-infer-path-language on Haiku and Opus:
    • Command: cd ai_evals && WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8010 WMILL_AI_EVAL_BACKEND_WORKSPACE=integration-tests bun run cli -- run global global-test8-human-script-infer-path-language --models haiku,opus
    • Haiku: passed, judge score 100; tools used get_instructions, write_script; chose Python at u/welcome_formatter.
    • Opus: passed, judge score 100; tools used askUserQuestion, get_instructions, write_script; chose Bun at u/user/format_welcome_line.
  • Earlier provider matrix for the existing global suite through ai_evals:
    • Original global cases 1-3: passed 3/3 on Haiku, GPT-4o, and Gemini Flash.
    • Harder global cases 4-7: passed 4/4 on Haiku and GPT-4o; Gemini Flash passed 2/4, failing the secret-variable draft case and the ambiguous-app ask-question case because expected tool calls were missing.
    • Fresh global-test1 smoke against the current backend: Haiku, GPT-4o, and Gemini Flash all passed.
  • UI deployment testing in the test workspace with global chat:
    • Script create/deploy succeeded.
    • Existing script update/redeploy succeeded.
    • Schedule deployment succeeded.
    • Secret variable deployment succeeded.
    • App deployment was correctly refused by global chat and no app was created.
    • Editor-open global chat script deployment succeeded.
    • Flow deployment succeeded and was verified via the flow API.
    • HTTP trigger deployment failed in this environment: write_trigger repeatedly hit a trigger existence endpoint returning 404, and the chat then became noisy by trying to redeploy the script. This remains a behavior/environment issue to investigate separately.
  • cd ai_evals && bun run typecheck
    • Result: failed on existing broader TypeScript issues outside this PR path, starting with modes/cli.ts(83,11): Object literal may only specify known properties, and 'overwriteProjectGuidance' does not exist in type 'WriteAiGuidanceOptions', followed by many ../cli/... type-only import/API errors.

Review

Skipped local review per request.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: 696cf89
Status: ✅  Deploy successful!
Preview URL: https://d1c6a9bf.windmill.pages.dev
Branch Preview URL: https://test-global-chat-evals.windmill.pages.dev

View logs

@centdix centdix marked this pull request as ready for review May 27, 2026 02:27
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 27, 2026

Claude finished @centdix's task in 3m 17s —— View job


Reviewing PR #9320

  • Load review tools
  • Read AGENTS.md and surrounding context
  • Read changed files
  • Analyze diff for bugs / AGENTS.md violations
  • Assess test coverage
  • Post review

cc @centdix

Mergeable, but should ideally address nits: weak valueIncludes: "true" assertion for is_secret in global-test6/test10; duplicated inline requirement types in validators.ts helpers.

This PR is entirely scoped to ai_evals/ test infrastructure — no production code is touched. It (1) re-points the eval runner from the (removed) globalDraftStore to the new userDraftAdapter-backed clearGlobalDrafts / listGlobalDrafts, (2) adds mocks for the three *WithDraft-style $lib/gen calls that the production global tools now use, (3) extends GlobalDraftRequirement with pathIncludes / pathStartsWith for path-free human prompts, and (4) adds four richer global cases plus four human-style cases. The refactor of findGlobalDraft looks correct — it filters by locator, prefers a content-matching candidate, and falls back to the first locator match so the per-snippet checks still produce meaningful failure details.

Findings

P2 — valueIncludes: "true" is a weak proxy for is_secret: true   ai_evals/cases/global.yaml:178 (global-test6) and ai_evals/cases/global.yaml:264 (global-test10) both rely on the substring "true" to assert the variable draft is a secret. Because stringifyGlobalDraftValue does JSON.stringify(value, null, 2) and normalizeText only normalizes line endings, the boolean serializes as the bare token true. That string is short and unanchored, so any value containing "true" (e.g. a description like "trues up"), or future fields with "force": true, would also satisfy the check. Prefer is_secret or "is_secret": true (or both) to actually pin the secret-flag intent. The judge guards behavior in test6 but test10 has skipJudge: true, so this is the only mechanical check that the secret flag is set.

P2 — duplicated inline requirement-shape types in validators.ts   ai_evals/core/validators.ts:618-642, :645-666, :668-686, :691-714 all redeclare overlapping inline object types for the requirement. GlobalDraftRequirement from core/types.ts already covers the superset (type, optional path, pathIncludes, pathStartsWith, triggerKind, summaryIncludes, valueIncludes, valueExcludes). Reusing the existing type (or splitting it into GlobalDraftLocator + GlobalDraftContent aliases) avoids future drift between the runtime shape and the YAML-facing schema.

Nit (non-blocking) — forbiddenDrafts doesn't get the flexible matching   ai_evals/core/types.ts:127-131 still requires path: string for forbidden entries while requiredDrafts is now flexible. Probably intentional (forbidding "any draft whose path includes X" is rarely what you want), but worth a comment so future authors don't extend it ad hoc.

AGENTS.md compliance

  • Prompts match the "global-specific rules" guidance — human-style cases (test8-test11) intentionally do not specify paths, languages, or windmill internals, which is exactly what the doc encourages.
  • Deterministic validation is narrow (draft count + required types + a few content snippets), with judgeChecklist carrying semantic expectations — consistent with the "core rules" section.
  • New initial: fixture path follows the existing ai_evals/fixtures/frontend/... convention used across app.yaml.

Test coverage

  • Backend (Rust): not touched, none required.
  • Frontend pure-logic: validators.test.ts adds two new tests for the flexible matcher (positive and failure-message cases). Adapter-side changes are covered by the existing vitestAdapter.test.ts shape (the new mocks are direct analogues of the previous getScriptByPath / getFlowByPath mocks).
  • Eval-suite (live model runs): PR description shows Haiku + Opus passing the new cases post-fix, plus the broader provider matrix on the original cases — sufficient.

Manual verification still warranted before merge: re-run cd ai_evals && bun test adapters/frontend/vitestAdapter.test.ts core/validators.test.ts core/cases.test.ts modes/frontendCommon.test.ts adapters/frontend/core/shared/providerConfig.test.ts once the userDraftAdapter rename has fully landed in main, since the imports in globalEvalRunner.ts cross the frontend/backend boundary and the runner won't load if the exports drift. The pre-existing bun run typecheck failures in modes/cli.ts are unrelated to this PR and noted by the author.
· Branch

@github-actions
Copy link
Copy Markdown
Contributor

Now I have a thorough understanding of all the changes. Let me compose the review.

Pi Review

Good to merge

This PR adds global chat eval coverage and tightens the frontend eval harness — purely test infrastructure and eval case definitions. The changes are clean, well-structured, and backward-compatible.

What changed

  • globalEvalRunner.ts — Switches draft state capture from globalDraftStore (a Svelte store) to clearGlobalDrafts/listGlobalDrafts from userDraftAdapter, matching the production global chat code path (core.ts line 1351).
  • types.ts — Makes GlobalDraftRequirement.path optional and adds pathIncludes/pathStartsWith for flexible draft matching without exact paths.
  • validators.ts — Refactors findGlobalDraft into a two-phase locator+content matcher with flexible path filters. Existing exact-path matching is unchanged (fallback behavior preserves old semantics).
  • validators.test.ts — Two new tests for the flexible path matching.
  • vitestAdapter.test.ts — Mocks getScriptByPathWithDraft, getFlowByPathWithDraft, and getFlowLatestVersion for benchmark workspaces, needed by flow-edit eval cases.
  • global.yaml — Eight new eval cases (global-test4 through global-test11) covering multi-artifact drafts, flow editing, secret variables, ambiguous-app questioning, and human-style prompts without explicit paths.
  • process_invoice_flow.json — Fixture for the invoice-processing flow used by the editing test cases.

Findings

P2ai_evals/core/types.ts:127 — The forbiddenDrafts type still requires path: string (non-optional) while requiredDrafts now uses the new GlobalDraftRequirement with path?: string. This is a minor inconsistency — forbiddenDrafts can't use the new flexible path filters (pathIncludes/pathStartsWith). Not a bug since the existing forbiddenDrafts entries in the YAML all use exact paths, but a future contributor adding a forbiddenDrafts entry for a human-style prompt might expect to be able to write pathIncludes there too.

P2ai_evals/core/validators.ts:619-643findGlobalDraft now has content-aware matching (via globalDraftMatchesContent) with a fallback to candidates[0] when no candidate satisfies the content constraints. The call site independently re-validates content properties (summaryIncludes, valueIncludes, valueExcludes) so this is functionally correct, but the two-phase approach with fallback is a subtlety future maintainers may not expect when reading the function in isolation. Consider either: (a) merging locator+content into a single pass, or (b) documenting the fallback intent in a comment.

Test coverage

  • Test infrastructure (ai_evals): The PR adds validator unit tests covering the new flexible path matching (pathIncludes acceptance and rejection). The vitest adapter mocks follow the existing pattern. Eval cases themselves are validated by running against real models, which the PR body documents extensively (Haiku, Opus, GPT-4o, Gemini Flash).
  • No backend or Svelte changes — no integration tests needed.
  • Manual verification: Run the new global eval cases (global-test8 through global-test11) against the integration-tests workspace with wmill ai evals -- run global to confirm they pass on current models. The PR body already documents these runs and the results.

@github-actions
Copy link
Copy Markdown
Contributor

Codex Review

Good to merge

No issues found. Checked for bugs, security, and AGENTS.md compliance.

Test coverage

The PR adds focused validator coverage and adapter mock coverage for the changed eval harness behavior. No backend or in-app UI surface is touched. Manual verification is the global eval execution described in the PR body; bun run typecheck remains blocked by existing broader TypeScript issues outside this PR path.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="ai_evals/core/validators.ts">

<violation number="1" location="ai_evals/core/validators.ts:634">
P2: Required draft matching does not consider `required.language`, so validation can fail incorrectly when multiple drafts match locator filters.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

globalDraftMatchesLocator(draft, requirement)
);
return (
candidates.find((draft) => globalDraftMatchesContent(draft, requirement)) ??
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Required draft matching does not consider required.language, so validation can fail incorrectly when multiple drafts match locator filters.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At ai_evals/core/validators.ts, line 634:

<comment>Required draft matching does not consider `required.language`, so validation can fail incorrectly when multiple drafts match locator filters.</comment>

<file context>
@@ -615,18 +616,102 @@ function summarizeProblems(problems: string[], limit = 5): string | undefined {
+    globalDraftMatchesLocator(draft, requirement)
+  );
+  return (
+    candidates.find((draft) => globalDraftMatchesContent(draft, requirement)) ??
+    candidates[0]
+  );
</file context>
Fix with Cubic

@rubenfiszel rubenfiszel merged commit e29dfba into main May 27, 2026
31 checks passed
@rubenfiszel rubenfiszel deleted the test-global-chat-evals branch May 27, 2026 12:20
@github-actions github-actions Bot locked and limited conversation to collaborators May 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants