test: add global chat eval coverage#9320
Conversation
Deploying windmill with
|
| Latest commit: |
696cf89
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://d1c6a9bf.windmill.pages.dev |
| Branch Preview URL: | https://test-global-chat-evals.windmill.pages.dev |
|
Claude finished @centdix's task in 3m 17s —— View job Reviewing PR #9320
cc @centdix Mergeable, but should ideally address nits: weak This PR is entirely scoped to FindingsP2 — P2 — duplicated inline requirement-shape types in Nit (non-blocking) — AGENTS.md compliance
Test coverage
Manual verification still warranted before merge: re-run |
|
Now I have a thorough understanding of all the changes. Let me compose the review. Pi ReviewGood to merge This PR adds global chat eval coverage and tightens the frontend eval harness — purely test infrastructure and eval case definitions. The changes are clean, well-structured, and backward-compatible. What changed
FindingsP2 — P2 — Test coverage
|
Codex ReviewGood to merge No issues found. Checked for bugs, security, and Test coverageThe PR adds focused validator coverage and adapter mock coverage for the changed eval harness behavior. No backend or in-app UI surface is touched. Manual verification is the global eval execution described in the PR body; |
There was a problem hiding this comment.
1 issue found across 7 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="ai_evals/core/validators.ts">
<violation number="1" location="ai_evals/core/validators.ts:634">
P2: Required draft matching does not consider `required.language`, so validation can fail incorrectly when multiple drafts match locator filters.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| globalDraftMatchesLocator(draft, requirement) | ||
| ); | ||
| return ( | ||
| candidates.find((draft) => globalDraftMatchesContent(draft, requirement)) ?? |
There was a problem hiding this comment.
P2: Required draft matching does not consider required.language, so validation can fail incorrectly when multiple drafts match locator filters.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At ai_evals/core/validators.ts, line 634:
<comment>Required draft matching does not consider `required.language`, so validation can fail incorrectly when multiple drafts match locator filters.</comment>
<file context>
@@ -615,18 +616,102 @@ function summarizeProblems(problems: string[], limit = 5): string | undefined {
+ globalDraftMatchesLocator(draft, requirement)
+ );
+ return (
+ candidates.find((draft) => globalDraftMatchesContent(draft, requirement)) ??
+ candidates[0]
+ );
</file context>
Summary
Adds broader global chat eval coverage and tightens the frontend eval harness so
ai_evalsreproduces production-like global chat behavior more closely.This PR covers two things:
Changes
pathIncludesandpathStartsWithrequirements.Test plan and results
cd ai_evals && bun test adapters/frontend/vitestAdapter.test.ts adapters/frontend/core/shared/providerConfig.test.ts modes/frontendCommon.test.ts core/validators.test.ts core/cases.test.ts46 pass,1 skip,0 fail.cd ai_evals && bun test core/validators.test.ts core/cases.test.ts38 pass,0 fail.cd ai_evals && WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8010 WMILL_AI_EVAL_BACKEND_WORKSPACE=integration-tests bun run cli -- run global global-test8-human-script-infer-path-language global-test9-human-weekday-trial-job global-test10-human-secret-variable global-test11-human-existing-flow-informal-edit --models haiku,opus2/4passed.global-test8-human-script-infer-path-language: behavior was correct and judge scored100, but deterministic validation failed because the harness incorrectly required anf/path while Haiku reasonably choseu/welcome_formatter.global-test9-human-weekday-trial-job: failed judge threshold, score72; Haiku created the script and schedule but interpreted the 30-day cutoff as expired-or-expiring-soon rather than ended-after-30-days.global-test10-human-secret-variable: passed; tools usedwrite_variable; no resource/deploy/delete calls.global-test11-human-existing-flow-informal-edit: passed; tools usedlist_workspace_items,read_workspace_item,read_flow_module_code,set_flow_module_code.3/4passed.global-test8-human-script-infer-path-language: behavior was correct and judge scored100, but deterministic validation failed for the same overly strictf/path assertion; Opus choseu/user/welcome_message.global-test9-human-weekday-trial-job: passed, judge score90; tools usedaskUserQuestion,get_instructions,write_script,write_schedule.global-test10-human-secret-variable: passed; tools usedsearch_resource_types,write_variable; no resource/deploy/delete calls.global-test11-human-existing-flow-informal-edit: passed, judge score100; tools usedlist_workspace_items,read_workspace_item,read_flow_module_code,set_flow_module_code.global-test8-human-script-infer-path-languageon Haiku and Opus:cd ai_evals && WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8010 WMILL_AI_EVAL_BACKEND_WORKSPACE=integration-tests bun run cli -- run global global-test8-human-script-infer-path-language --models haiku,opus100; tools usedget_instructions,write_script; chose Python atu/welcome_formatter.100; tools usedaskUserQuestion,get_instructions,write_script; chose Bun atu/user/format_welcome_line.ai_evals:3/3on Haiku, GPT-4o, and Gemini Flash.4/4on Haiku and GPT-4o; Gemini Flash passed2/4, failing the secret-variable draft case and the ambiguous-app ask-question case because expected tool calls were missing.testworkspace with global chat:write_triggerrepeatedly hit a trigger existence endpoint returning404, and the chat then became noisy by trying to redeploy the script. This remains a behavior/environment issue to investigate separately.cd ai_evals && bun run typecheckmodes/cli.ts(83,11): Object literal may only specify known properties, and 'overwriteProjectGuidance' does not exist in type 'WriteAiGuidanceOptions', followed by many../cli/...type-only import/API errors.Review
Skipped local review per request.