Releases: microsoft/waza
Waza v0.31.0
What's Changed
- refactor: complete vocabulary renames — BenchmarkSpec→EvalSpec, TestRunner→EvalRunner (#166) by @spboyer in #222
- feat: support custom agent (.agent.md) file discovery and parsing #225 by @spboyer in #226
- fix: mock engine echoes file content for CI evals (#227) by @spboyer in #228
- fix: waza serve crashes when stdin is not a terminal by @spboyer in #224
- chore(deps): Bump postcss from 8.5.6 to 8.5.12 in /site by @dependabot[bot] in #229
- docs: cross-reference audit for recent renames and feature additions by @spboyer in #230
- Release v0.31.0 by @spboyer in #231
Full Changelog: v0.30.1...v0.31.0
Waza azd Extension v0.31.0
Changelog
All notable changes to waza will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[Unreleased]
[0.31.0] - 2026-04-28
Added
- Custom agent (
.agent.md) eval support — Discover.agent.mdfiles alongsideSKILL.md, parse agent-specific frontmatter (tools,model,handoffs,mcp-servers,agents), auto-injecttool_constraintgrader from agenttools:field, complete worked example underexamples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)
Fixed
- Mock engine echoes file content —
_output_containsexpectations against file contents now work in CI without a real model. Mock response includes task metadata, file paths, and a 1KB content preview per resource (#228, closes #227) waza serveno longer crashes when stdin isn't a terminal — MCP stdio server only starts whenterm.IsTerminal()is true; piped input or background mode no longer kills the HTTP dashboard (#224)
Changed
- Vocabulary renames — Internal types renamed:
BenchmarkSpec→EvalSpec,TestRunner→EvalRunner. Not a breaking change for external consumers (types live ininternal/) (#222)
Documentation
- Cross-reference audit for recent renames + custom agent feature: added
.agent.mdcoverage to quickstart, getting-started, GUIDE, TUTORIAL, examples README; updated mock engine descriptions in INTEGRATION-TESTING and eval-yaml guide (#230)
Dependencies
- Bump postcss from 8.5.6 to 8.5.12 in /site (#229)
[0.30.1] - 2026-04-22
Documentation
- Updated README with missing CLI commands — Added documentation for recently-added CLI commands that were missing from the README (#220)
[0.30.0] - 2026-04-22
Added
waza qualitycommand — LLM-as-Judge skill quality scoring that evaluates skill output quality using a configurable judge model (#218)- Scope-reduction advisory check —
waza checknow includes an advisory that flags skills with overly broad scope, helping authors tighten skill definitions (#219)
[0.29.0] - 2026-04-22
Added
--keep-workspaceflag — Preserve the temporary workspace after task execution for debugging agent output (#123, #217)--no-skillsflag anddisabled_skillsconfig — Disable specific skills during evaluation to isolate behavior (#126, #216)- Non-blocking version update check — CLI now checks for newer waza versions in the background without slowing startup (#104, #214)
- Per-task
skill_directories— Specify different skill directories for individual tasks in eval YAML (#156, #215)
Dependencies
- Bump astro and @astrojs/starlight in /site (#212)
[0.28.0] - 2026-04-21
Added
- Follow-up prompts in eval YAML — Tasks can now include pre-written follow-up prompts for multi-turn evaluation conversations (#189, #209)
waza modelscommand — List all available models supported by the configured engine (#208)- Early termination for trigger tests — Trigger tests can now stop early once the target skill is invoked, reducing evaluation time (#207)
Fixed
- Stricter YAML validation — Audited all YAML parsers; unknown fields in
TestCasedefinitions are now properly rejected (#132, #206) - Test fixture assertion syntax — Fixed invalid Python expression in a test fixture assertion (#197)
- CI integration test stability — CI integration tests now correctly handle expected eval failures when using the mock executor (#210)
Documentation
- Added Quick Start guide to the documentation site (#205)
[0.27.0] - 2026-04-21
Added
output_contains_anyexpectation — New expectation field that passes when the agent response contains any one of the specified strings (#203)max_response_time_msbehavior rule — Enforce maximum response time constraints on agent execution (#201)- Task prompt from file — Task
promptfield can now reference an external file path instead of inline text (#157, #200) tool_callsgrader — New grader type that validates the specific tool calls an agent makes during execution (#187, #202)
Fixed
- Webserver test resilience — Webserver tests now skip gracefully when frontend assets are not built (#204)
[0.26.0] - 2026-04-21
Changed
- Timestamped output directories —
run --output-dirnow groups result files by timestamp for cleaner organization (#153) - Improved debug logging — Debug output is now more structured and useful for troubleshooting (#152)
Fixed
--discoverfinds eval.yaml in nested layout — Skill discovery now correctly locateseval.yamlfiles inevals/{name}/directories at the project root (#44)- Diff grader reads post-execution workspace — The diff grader now reads files from the workspace after agent execution completes, not before (#165, #196)
- Grader config validation — Required grader configuration fields are now validated before evaluation starts (#195)
- macOS install and trigger test count — Fixed macOS binary installation and an off-by-one error in trigger test counting (#164, #184, #193)
Documentation
- Added cache command reference, prompt mode documentation, and complete YAML schema reference (#198)
- Updated demo guide and added CI/CD integration guide (#112, #89, #194)
Dependencies
- Bump defu from 6.1.4 to 6.1.6 in /site (#181)
- Bump vite from 6.4.1 to 6.4.2 in /site and /web (#182, #192)
- Bump go.opentelemetry.io/otel/sdk from 1.42.0 to 1.43.0 (#185)
- Bump astro from 5.17.3 to 5.18.1 in /site (#163)
- Bump picomatch from 4.0.3 to 4.0.4 in /site and /web (#159, #160)
- Bump smol-toml from 1.6.0 to 1.6.1 in /site (#158)
[0.25.0] - 2026-04-21
Added
- Eval coverage grid generator — New coverage output that visualizes which skills have eval coverage across grader types (#92)
Fixed
- SKILL.md injection and trigger fixture loading —
waza runnow correctly injects SKILL.md content into the evaluation context, loads trigger test fixtures, and passes MCP server configuration to the engine (#191)
Dependencies
- Bump h3 from 1.15.5 to 1.15.8 in /site (#144)
[0.24.0] - 2026-03-25
Changed
- Strict YAML validation — All YAML parsers now use
KnownFields(true)to reject unknown fields, catching typos and misconfigurations early (#132, #133) max_workersrenamed toworkers— Config YAML key renamed for consistency across all config types (breaking change)- Unified token counting —
waza checkandwaza tokens countnow share the same counting logic for consistent results (#146)
Fixed
- Typo in prompt grader — Fixed "prmopt" → "prompt" in error message
Dependencies
- Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
- Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)
[0.21.0] - 2026-03-12
Added
waza new task from-promptcommand — Record Copilot sessions into task YAML files for eval creation (#110)- Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
- Eval scaffolding command —
waza eval newgenerates eval.yaml scaffolding for skills (#94) - Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
- Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
- Per-file token budget configuration — Configure token budgets per-file in
.waza.yaml(#96) - Skill-aware thresholds —
waza tokens comparesupports skill-specific threshold configuration (#93) - Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
- CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
- FileWriter service — Refactored
waza initinventory with FileWriter abstraction (#63)
Fixed
waza suggestdeadlock —Execute()now applies the request timeout before callingStart(), preventing goroutine deadlock (#43)ResourceFile.Contenttype — Changed fromstringto[]bytefor proper binary file handling (#117)tokens comparein subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)--output-dirignored — Fixed--output-dirhaving no effect for single-skill runs (#109)- Web dashboard build order — Build dashboard assets before Go compilation (#107)
- Test file leak — Fixed test that leaked files into the repo (#120)
- Config schema defaults — Aligned
config.schema.jsondefaults with Go source of truth (#65) - Skill discovery path — Discover skills under
.github/skills/directory (#69)
Changed
- Renamed
confignodemax_workerstoworkersfor consistency across all config types- This is a breaking change
- Custom YAML deserializers for config types (#106)
- Validate only known fields in YAML decoders. (#132)
- Token limits priority inverted to
.waza.yamlfirst (#64) @wbrezaadded to CODEOWNERS (#111)- Go 1.26+ noted in agent instruction files (#108)
[0.9.0] - 2026-02-23
Added
- A/B baseline testing —
--baselineflag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307) - Pairwise LLM judging —
pairwisemode onpromptgrader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310) - Tool constraint grader — New
tool_constraintgrader type withexpect_tools,reject_tools,max_turns,max_tokensconstraints. Validates agent tool usage behavior (#391) - Auto skill discovery —
--discoverflag walks directory trees for SKILL.md + eval.yaml pairs.--strictmode fails if any skill lacks...
v0.30.1
v0.30.0
What's New in v0.30.0
New Features
-
waza qualitycommand (#98) — LLM-as-Judge skill quality scoring. Evaluates SKILL.md across 5 dimensions (clarity, completeness, trigger precision, scope coverage, anti-patterns) using the Copilot SDK. Scored 1-5 with visual bar output. Supports--format jsonfor CI integration. (@spboyer) -
Scope reduction advisory (#183) —
waza checknow warns when a skill has low capability scope, detecting potential token-limit compression loss. Parses USE FOR phrases, headings, and numbered procedures as capability signals. (@diberry)
Housekeeping
- Closed 5 stale issues that were already implemented: #59 (token limits priority), #86 (per-file budgets), #81 (tokens diff), #83 (eval scaffolding), #162 (TypeSpec user query — answered)
Full Changelog: v0.29.0...v0.30.0
v0.29.0
What's New in v0.29.0
New Features
-
Per-task skill directories (#156) — Tasks can now override eval-level
skill_directorieswith their own, enabling multi-skill eval suites. (@LarryOsterman) -
Disable skill loading (#126) — New
disabled_skillsconfig field and--no-skillsCLI flag. Usedisabled_skills: ["*"]to disable all skills for baseline/comparison testing. (@richardpark-msft) -
Debug workspace preservation (#123) — New
--keep-workspaceflag preserves temp workspace directories after execution for debugging fixture and agent file issues. (@richardpark-msft) -
Version update notifications (#104) —
waza runnow checks for new versions in the background (cached 24h, non-blocking). Disable with--no-update-checkorWAZA_NO_UPDATE_CHECK=1. (@RickWinter)
Test Coverage
- Copilot log parsing edge cases (#115) — 23 new tests covering malformed JSON, truncated logs, binary data, unknown event types, and more. (@richardpark-msft)
Dependencies
- Bumped astro + @astrojs/starlight in site
Full Changelog: v0.28.0...v0.29.0
Waza azd Extension v0.30.1
Changelog
All notable changes to waza will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[Unreleased]
[0.24.0] - 2026-03-25
Changed
- Strict YAML validation — All YAML parsers now use
KnownFields(true)to reject unknown fields, catching typos and misconfigurations early (#132, #133) max_workersrenamed toworkers— Config YAML key renamed for consistency across all config types (breaking change)- Unified token counting —
waza checkandwaza tokens countnow share the same counting logic for consistent results (#146)
Fixed
- Typo in prompt grader — Fixed "prmopt" → "prompt" in error message
Dependencies
- Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
- Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)
[0.21.0] - 2026-03-12
Added
waza new task from-promptcommand — Record Copilot sessions into task YAML files for eval creation (#110)- Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
- Eval scaffolding command —
waza eval newgenerates eval.yaml scaffolding for skills (#94) - Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
- Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
- Per-file token budget configuration — Configure token budgets per-file in
.waza.yaml(#96) - Skill-aware thresholds —
waza tokens comparesupports skill-specific threshold configuration (#93) - Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
- CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
- FileWriter service — Refactored
waza initinventory with FileWriter abstraction (#63)
Fixed
waza suggestdeadlock —Execute()now applies the request timeout before callingStart(), preventing goroutine deadlock (#43)ResourceFile.Contenttype — Changed fromstringto[]bytefor proper binary file handling (#117)tokens comparein subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)--output-dirignored — Fixed--output-dirhaving no effect for single-skill runs (#109)- Web dashboard build order — Build dashboard assets before Go compilation (#107)
- Test file leak — Fixed test that leaked files into the repo (#120)
- Config schema defaults — Aligned
config.schema.jsondefaults with Go source of truth (#65) - Skill discovery path — Discover skills under
.github/skills/directory (#69)
Changed
- Renamed
confignodemax_workerstoworkersfor consistency across all config types- This is a breaking change
- Custom YAML deserializers for config types (#106)
- Validate only known fields in YAML decoders. (#132)
- Token limits priority inverted to
.waza.yamlfirst (#64) @wbrezaadded to CODEOWNERS (#111)- Go 1.26+ noted in agent instruction files (#108)
[0.9.0] - 2026-02-23
Added
- A/B baseline testing —
--baselineflag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307) - Pairwise LLM judging —
pairwisemode onpromptgrader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310) - Tool constraint grader — New
tool_constraintgrader type withexpect_tools,reject_tools,max_turns,max_tokensconstraints. Validates agent tool usage behavior (#391) - Auto skill discovery —
--discoverflag walks directory trees for SKILL.md + eval.yaml pairs.--strictmode fails if any skill lacks eval coverage (#392) - Releases page — New docs site page at
reference/releaseswith platform download links, install commands, and azd extension info (#383)
Fixed
- Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings
Changed
- Competitive research — Added OpenAI Evals analysis (
docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md) - Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md
[0.8.0] - 2026-02-21
Added
- MCP Server —
waza servenow includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286) waza suggestcommand — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags:--model,--dry-run,--apply,--output-dir,--format(#287)- Interactive workflow skill —
skills/waza-interactive/SKILL.mdwith 5 workflow scenarios for conversational eval orchestration (#288) - Grader weighting —
weightfield on grader configs,ComputeWeightedRunScoremethod, dashboard weighted scores column (#299) - Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
- Judge model support —
--judge-modelflag andjudge_modelconfig for separate LLM-as-judge model (#309) - Spec compliance checks — 8 agentskills.io compliance checks in
waza checkandwaza dev(#314) - SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
- MCP integration scoring — 4 MCP integration checks in
waza dev(#316) - Batch skill processing —
waza devprocesses multiple skills in one run (#317) - Token compare --strict — Budget enforcement mode for
waza tokens compare(#318) - Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
- Skill profile —
waza tokens profilefor static analysis of skill token distribution (#311) - JUnit XML reporter —
--format junitoutput for CI integration (#312) - Template Variables — New
internal/templatepackage withRender()for Go text/template syntax in hooks and commands. System variables:JobID,TaskName,Iteration,Attempt,Timestamp. User variables viavarsmap (#186) - GroupBy Results — New
group_byconfig field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includesGroupStatswith name/passed/total/avg_score (#188) - Custom Input Variables — New
inputssection in eval.yaml for defining key-value pairs available as{{.Vars.key}}throughout evaluation. Accessible in hooks, task templates, and grader configs (#189) - CSV Dataset Support — New
tasks_fromfield to generate tasks from CSV files. Each row becomes a task with columns accessible as{{.Vars.column}}. Optionalrange: [start, end]for row filtering. First row treated as headers (#187) - Retry/Attempts — Add
max_attemptsconfig field for retrying failed task executions within each trial (#191) - Lifecycle Hooks — Add
hookssection withbefore_run/after_run/before_task/after_tasklifecycle points (#191) promptgrader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)- Two modes:
clean(fresh context) andcontinue_session(resumes test session) - Tool-based grading:
set_waza_grade_passandset_waza_grade_failtools for LLM graders - Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
- Two modes:
trigger_tests.yamlauto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)- New
internal/trigger/package for trigger testing - Automatically discovered alongside
eval.yaml - Confidence weighting:
high(weight 1.0) andmedium(weight 0.5) for borderline cases trigger_accuracymetric with configurable cutoff threshold- Metrics: accuracy, precision, recall, F1, error count
- New
diffgrader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)- Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in
examples/rubrics/adapted from Azure ML evaluators (#160, #161):- Tool call rubrics:
tool_call_accuracy,tool_selection,tool_input_accuracy,tool_output_utilization - Task evaluation rubrics:
task_completion,task_adherence,intent_resolution,response_completeness
- Tool call rubrics:
- MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)
Changed
- Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
- Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)
Fixed
- install.sh macOS checksum — added
shasum -a 256fallback for macOS (which lackssha256sum) (#163) - Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
- GitHub icon alignment and search bar width on docs site
[0.4.0-alpha.1] - 2026-02-17
Added
- Go cross-platform release pipeline —
go-release.ymlworkflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155) install.shinstaller — one-line binary install with checksum verification:curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashskill_invocationgrader — validates orchestration workflows by checking which sk...
Waza azd Extension v0.30.0
Changelog
All notable changes to waza will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[Unreleased]
[0.24.0] - 2026-03-25
Changed
- Strict YAML validation — All YAML parsers now use
KnownFields(true)to reject unknown fields, catching typos and misconfigurations early (#132, #133) max_workersrenamed toworkers— Config YAML key renamed for consistency across all config types (breaking change)- Unified token counting —
waza checkandwaza tokens countnow share the same counting logic for consistent results (#146)
Fixed
- Typo in prompt grader — Fixed "prmopt" → "prompt" in error message
Dependencies
- Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
- Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)
[0.21.0] - 2026-03-12
Added
waza new task from-promptcommand — Record Copilot sessions into task YAML files for eval creation (#110)- Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
- Eval scaffolding command —
waza eval newgenerates eval.yaml scaffolding for skills (#94) - Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
- Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
- Per-file token budget configuration — Configure token budgets per-file in
.waza.yaml(#96) - Skill-aware thresholds —
waza tokens comparesupports skill-specific threshold configuration (#93) - Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
- CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
- FileWriter service — Refactored
waza initinventory with FileWriter abstraction (#63)
Fixed
waza suggestdeadlock —Execute()now applies the request timeout before callingStart(), preventing goroutine deadlock (#43)ResourceFile.Contenttype — Changed fromstringto[]bytefor proper binary file handling (#117)tokens comparein subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)--output-dirignored — Fixed--output-dirhaving no effect for single-skill runs (#109)- Web dashboard build order — Build dashboard assets before Go compilation (#107)
- Test file leak — Fixed test that leaked files into the repo (#120)
- Config schema defaults — Aligned
config.schema.jsondefaults with Go source of truth (#65) - Skill discovery path — Discover skills under
.github/skills/directory (#69)
Changed
- Renamed
confignodemax_workerstoworkersfor consistency across all config types- This is a breaking change
- Custom YAML deserializers for config types (#106)
- Validate only known fields in YAML decoders. (#132)
- Token limits priority inverted to
.waza.yamlfirst (#64) @wbrezaadded to CODEOWNERS (#111)- Go 1.26+ noted in agent instruction files (#108)
[0.9.0] - 2026-02-23
Added
- A/B baseline testing —
--baselineflag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307) - Pairwise LLM judging —
pairwisemode onpromptgrader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310) - Tool constraint grader — New
tool_constraintgrader type withexpect_tools,reject_tools,max_turns,max_tokensconstraints. Validates agent tool usage behavior (#391) - Auto skill discovery —
--discoverflag walks directory trees for SKILL.md + eval.yaml pairs.--strictmode fails if any skill lacks eval coverage (#392) - Releases page — New docs site page at
reference/releaseswith platform download links, install commands, and azd extension info (#383)
Fixed
- Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings
Changed
- Competitive research — Added OpenAI Evals analysis (
docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md) - Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md
[0.8.0] - 2026-02-21
Added
- MCP Server —
waza servenow includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286) waza suggestcommand — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags:--model,--dry-run,--apply,--output-dir,--format(#287)- Interactive workflow skill —
skills/waza-interactive/SKILL.mdwith 5 workflow scenarios for conversational eval orchestration (#288) - Grader weighting —
weightfield on grader configs,ComputeWeightedRunScoremethod, dashboard weighted scores column (#299) - Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
- Judge model support —
--judge-modelflag andjudge_modelconfig for separate LLM-as-judge model (#309) - Spec compliance checks — 8 agentskills.io compliance checks in
waza checkandwaza dev(#314) - SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
- MCP integration scoring — 4 MCP integration checks in
waza dev(#316) - Batch skill processing —
waza devprocesses multiple skills in one run (#317) - Token compare --strict — Budget enforcement mode for
waza tokens compare(#318) - Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
- Skill profile —
waza tokens profilefor static analysis of skill token distribution (#311) - JUnit XML reporter —
--format junitoutput for CI integration (#312) - Template Variables — New
internal/templatepackage withRender()for Go text/template syntax in hooks and commands. System variables:JobID,TaskName,Iteration,Attempt,Timestamp. User variables viavarsmap (#186) - GroupBy Results — New
group_byconfig field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includesGroupStatswith name/passed/total/avg_score (#188) - Custom Input Variables — New
inputssection in eval.yaml for defining key-value pairs available as{{.Vars.key}}throughout evaluation. Accessible in hooks, task templates, and grader configs (#189) - CSV Dataset Support — New
tasks_fromfield to generate tasks from CSV files. Each row becomes a task with columns accessible as{{.Vars.column}}. Optionalrange: [start, end]for row filtering. First row treated as headers (#187) - Retry/Attempts — Add
max_attemptsconfig field for retrying failed task executions within each trial (#191) - Lifecycle Hooks — Add
hookssection withbefore_run/after_run/before_task/after_tasklifecycle points (#191) promptgrader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)- Two modes:
clean(fresh context) andcontinue_session(resumes test session) - Tool-based grading:
set_waza_grade_passandset_waza_grade_failtools for LLM graders - Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
- Two modes:
trigger_tests.yamlauto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)- New
internal/trigger/package for trigger testing - Automatically discovered alongside
eval.yaml - Confidence weighting:
high(weight 1.0) andmedium(weight 0.5) for borderline cases trigger_accuracymetric with configurable cutoff threshold- Metrics: accuracy, precision, recall, F1, error count
- New
diffgrader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)- Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in
examples/rubrics/adapted from Azure ML evaluators (#160, #161):- Tool call rubrics:
tool_call_accuracy,tool_selection,tool_input_accuracy,tool_output_utilization - Task evaluation rubrics:
task_completion,task_adherence,intent_resolution,response_completeness
- Tool call rubrics:
- MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)
Changed
- Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
- Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)
Fixed
- install.sh macOS checksum — added
shasum -a 256fallback for macOS (which lackssha256sum) (#163) - Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
- GitHub icon alignment and search bar width on docs site
[0.4.0-alpha.1] - 2026-02-17
Added
- Go cross-platform release pipeline —
go-release.ymlworkflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155) install.shinstaller — one-line binary install with checksum verification:curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashskill_invocationgrader — validates orchestration workflows by checking which sk...
Waza azd Extension v0.29.0
Changelog
All notable changes to waza will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[Unreleased]
[0.24.0] - 2026-03-25
Changed
- Strict YAML validation — All YAML parsers now use
KnownFields(true)to reject unknown fields, catching typos and misconfigurations early (#132, #133) max_workersrenamed toworkers— Config YAML key renamed for consistency across all config types (breaking change)- Unified token counting —
waza checkandwaza tokens countnow share the same counting logic for consistent results (#146)
Fixed
- Typo in prompt grader — Fixed "prmopt" → "prompt" in error message
Dependencies
- Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
- Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)
[0.21.0] - 2026-03-12
Added
waza new task from-promptcommand — Record Copilot sessions into task YAML files for eval creation (#110)- Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
- Eval scaffolding command —
waza eval newgenerates eval.yaml scaffolding for skills (#94) - Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
- Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
- Per-file token budget configuration — Configure token budgets per-file in
.waza.yaml(#96) - Skill-aware thresholds —
waza tokens comparesupports skill-specific threshold configuration (#93) - Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
- CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
- FileWriter service — Refactored
waza initinventory with FileWriter abstraction (#63)
Fixed
waza suggestdeadlock —Execute()now applies the request timeout before callingStart(), preventing goroutine deadlock (#43)ResourceFile.Contenttype — Changed fromstringto[]bytefor proper binary file handling (#117)tokens comparein subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)--output-dirignored — Fixed--output-dirhaving no effect for single-skill runs (#109)- Web dashboard build order — Build dashboard assets before Go compilation (#107)
- Test file leak — Fixed test that leaked files into the repo (#120)
- Config schema defaults — Aligned
config.schema.jsondefaults with Go source of truth (#65) - Skill discovery path — Discover skills under
.github/skills/directory (#69)
Changed
- Renamed
confignodemax_workerstoworkersfor consistency across all config types- This is a breaking change
- Custom YAML deserializers for config types (#106)
- Validate only known fields in YAML decoders. (#132)
- Token limits priority inverted to
.waza.yamlfirst (#64) @wbrezaadded to CODEOWNERS (#111)- Go 1.26+ noted in agent instruction files (#108)
[0.9.0] - 2026-02-23
Added
- A/B baseline testing —
--baselineflag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307) - Pairwise LLM judging —
pairwisemode onpromptgrader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310) - Tool constraint grader — New
tool_constraintgrader type withexpect_tools,reject_tools,max_turns,max_tokensconstraints. Validates agent tool usage behavior (#391) - Auto skill discovery —
--discoverflag walks directory trees for SKILL.md + eval.yaml pairs.--strictmode fails if any skill lacks eval coverage (#392) - Releases page — New docs site page at
reference/releaseswith platform download links, install commands, and azd extension info (#383)
Fixed
- Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings
Changed
- Competitive research — Added OpenAI Evals analysis (
docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md) - Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md
[0.8.0] - 2026-02-21
Added
- MCP Server —
waza servenow includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286) waza suggestcommand — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags:--model,--dry-run,--apply,--output-dir,--format(#287)- Interactive workflow skill —
skills/waza-interactive/SKILL.mdwith 5 workflow scenarios for conversational eval orchestration (#288) - Grader weighting —
weightfield on grader configs,ComputeWeightedRunScoremethod, dashboard weighted scores column (#299) - Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
- Judge model support —
--judge-modelflag andjudge_modelconfig for separate LLM-as-judge model (#309) - Spec compliance checks — 8 agentskills.io compliance checks in
waza checkandwaza dev(#314) - SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
- MCP integration scoring — 4 MCP integration checks in
waza dev(#316) - Batch skill processing —
waza devprocesses multiple skills in one run (#317) - Token compare --strict — Budget enforcement mode for
waza tokens compare(#318) - Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
- Skill profile —
waza tokens profilefor static analysis of skill token distribution (#311) - JUnit XML reporter —
--format junitoutput for CI integration (#312) - Template Variables — New
internal/templatepackage withRender()for Go text/template syntax in hooks and commands. System variables:JobID,TaskName,Iteration,Attempt,Timestamp. User variables viavarsmap (#186) - GroupBy Results — New
group_byconfig field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includesGroupStatswith name/passed/total/avg_score (#188) - Custom Input Variables — New
inputssection in eval.yaml for defining key-value pairs available as{{.Vars.key}}throughout evaluation. Accessible in hooks, task templates, and grader configs (#189) - CSV Dataset Support — New
tasks_fromfield to generate tasks from CSV files. Each row becomes a task with columns accessible as{{.Vars.column}}. Optionalrange: [start, end]for row filtering. First row treated as headers (#187) - Retry/Attempts — Add
max_attemptsconfig field for retrying failed task executions within each trial (#191) - Lifecycle Hooks — Add
hookssection withbefore_run/after_run/before_task/after_tasklifecycle points (#191) promptgrader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)- Two modes:
clean(fresh context) andcontinue_session(resumes test session) - Tool-based grading:
set_waza_grade_passandset_waza_grade_failtools for LLM graders - Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
- Two modes:
trigger_tests.yamlauto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)- New
internal/trigger/package for trigger testing - Automatically discovered alongside
eval.yaml - Confidence weighting:
high(weight 1.0) andmedium(weight 0.5) for borderline cases trigger_accuracymetric with configurable cutoff threshold- Metrics: accuracy, precision, recall, F1, error count
- New
diffgrader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)- Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in
examples/rubrics/adapted from Azure ML evaluators (#160, #161):- Tool call rubrics:
tool_call_accuracy,tool_selection,tool_input_accuracy,tool_output_utilization - Task evaluation rubrics:
task_completion,task_adherence,intent_resolution,response_completeness
- Tool call rubrics:
- MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)
Changed
- Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
- Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)
Fixed
- install.sh macOS checksum — added
shasum -a 256fallback for macOS (which lackssha256sum) (#163) - Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
- GitHub icon alignment and search bar width on docs site
[0.4.0-alpha.1] - 2026-02-17
Added
- Go cross-platform release pipeline —
go-release.ymlworkflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155) install.shinstaller — one-line binary install with checksum verification:curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashskill_invocationgrader — validates orchestration workflows by checking which sk...
v0.28.0
What's New in v0.28.0
New Features
-
waza modelscommand (#141) — List available Copilot models with their IDs and capabilities. Requires authentication viacopilot login. (@richardpark-msft) -
Trigger test early termination (#188) — Trigger tests now cancel the agent session as soon as a skill invocation is detected, instead of waiting for the full timeout. Implemented at the execution layer via
CancelOnSkillInvocationflag. (@JasonYeMSFT) -
Follow-up prompts (#189) — Support multi-turn eval tasks with
follow_up_promptsin task YAML. Follow-ups reuse the same session and workspace, enabling tests for conversational workflows where the agent pauses for confirmation. (@JasonYeMSFT) -
Quick Start guide — New focused 5-minute quick start page on the docs site, with Mermaid workflow diagram and tabbed install options. Added as the first sidebar item.
Bug Fixes
-
CI integration test (#210) — Fixed the root cause of persistent ubuntu-latest CI failures. PR #203 (v0.27.0) wired up
evaluateExpectations()which madeoutput_containschecks execute for the first time — the mock executor's generic output didn't match. CI now correctly allows eval failures with mock while still catching crashes. -
YAML validation audit (#132) — Verified all 10 user-facing config loaders use strict
KnownFields(true)parsing. Added regression test for unknown field rejection. (@LarryOsterman)
Infrastructure
- CODEOWNERS simplified to @spboyer
- Branch protection rulesets updated with proper bypass actors and streamlined required checks (Lint, CLA, test)
Documentation
- Quick Start page with install → auth → first eval workflow
waza modelsadded to CLI referencefollow_up_promptsand trigger early termination documented in eval-yaml guide
Full Changelog: v0.27.0...v0.28.0
v0.27.0
What's New in v0.27.0
New Features
-
tool_callsgrader (#187) — Validate which tools the agent called during execution. Supportsrequired_tools,forbidden_tools,min_calls, andmax_callsconstraints with partial scoring. (@JasonYeMSFT) -
output_contains_anyexpectation (#137) — New YAML field that passes if ANY of the listed strings appear in output (OR logic), complementing the existingoutput_contains(AND logic) andoutput_not_contains. (@LarryOsterman) -
max_response_time_msbehavior rule (#136) — Enforce response time limits on eval tasks. Fails the behavior check if execution exceeds the configured threshold. (@LarryOsterman) -
prompt_filefor task prompts (#157) — Load task prompts from external files instead of inline YAML. Supportsprompt_file: path/to/prompt.mdwith path traversal protection. (@LarryOsterman)
Bug Fixes
- Windows CI fix (#204) — Webserver test now skips gracefully when frontend assets aren't built, fixing the persistent
windows-latestCI failure that blocked all PRs today. - Cross-platform test fix — Absolute path test in suggest package uses
runtime.GOOSfor Windows compatibility.
Documentation
All 4 new features include updated docs:
- Graders guide (
graders.mdx) —tool_callssection added - Eval YAML guide (
eval-yaml.mdx) —output_contains_any,max_response_time_ms,prompt_filedocumented - Schema reference (
schema.mdx) — All new fields added
Full Changelog: v0.26.0...v0.27.0