Skip to content

Releases: microsoft/waza

Waza v0.31.0

28 Apr 20:11
bf77c75

Choose a tag to compare

What's Changed

  • refactor: complete vocabulary renames — BenchmarkSpec→EvalSpec, TestRunner→EvalRunner (#166) by @spboyer in #222
  • feat: support custom agent (.agent.md) file discovery and parsing #225 by @spboyer in #226
  • fix: mock engine echoes file content for CI evals (#227) by @spboyer in #228
  • fix: waza serve crashes when stdin is not a terminal by @spboyer in #224
  • chore(deps): Bump postcss from 8.5.6 to 8.5.12 in /site by @dependabot[bot] in #229
  • docs: cross-reference audit for recent renames and feature additions by @spboyer in #230
  • Release v0.31.0 by @spboyer in #231

Full Changelog: v0.30.1...v0.31.0

Waza azd Extension v0.31.0

28 Apr 20:08
bf77c75

Choose a tag to compare

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.31.0] - 2026-04-28

Added

  • Custom agent (.agent.md) eval support — Discover .agent.md files alongside SKILL.md, parse agent-specific frontmatter (tools, model, handoffs, mcp-servers, agents), auto-inject tool_constraint grader from agent tools: field, complete worked example under examples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)

Fixed

  • Mock engine echoes file content_output_contains expectations against file contents now work in CI without a real model. Mock response includes task metadata, file paths, and a 1KB content preview per resource (#228, closes #227)
  • waza serve no longer crashes when stdin isn't a terminal — MCP stdio server only starts when term.IsTerminal() is true; piped input or background mode no longer kills the HTTP dashboard (#224)

Changed

  • Vocabulary renames — Internal types renamed: BenchmarkSpecEvalSpec, TestRunnerEvalRunner. Not a breaking change for external consumers (types live in internal/) (#222)

Documentation

  • Cross-reference audit for recent renames + custom agent feature: added .agent.md coverage to quickstart, getting-started, GUIDE, TUTORIAL, examples README; updated mock engine descriptions in INTEGRATION-TESTING and eval-yaml guide (#230)

Dependencies

  • Bump postcss from 8.5.6 to 8.5.12 in /site (#229)

[0.30.1] - 2026-04-22

Documentation

  • Updated README with missing CLI commands — Added documentation for recently-added CLI commands that were missing from the README (#220)

[0.30.0] - 2026-04-22

Added

  • waza quality command — LLM-as-Judge skill quality scoring that evaluates skill output quality using a configurable judge model (#218)
  • Scope-reduction advisory checkwaza check now includes an advisory that flags skills with overly broad scope, helping authors tighten skill definitions (#219)

[0.29.0] - 2026-04-22

Added

  • --keep-workspace flag — Preserve the temporary workspace after task execution for debugging agent output (#123, #217)
  • --no-skills flag and disabled_skills config — Disable specific skills during evaluation to isolate behavior (#126, #216)
  • Non-blocking version update check — CLI now checks for newer waza versions in the background without slowing startup (#104, #214)
  • Per-task skill_directories — Specify different skill directories for individual tasks in eval YAML (#156, #215)

Dependencies

  • Bump astro and @astrojs/starlight in /site (#212)

[0.28.0] - 2026-04-21

Added

  • Follow-up prompts in eval YAML — Tasks can now include pre-written follow-up prompts for multi-turn evaluation conversations (#189, #209)
  • waza models command — List all available models supported by the configured engine (#208)
  • Early termination for trigger tests — Trigger tests can now stop early once the target skill is invoked, reducing evaluation time (#207)

Fixed

  • Stricter YAML validation — Audited all YAML parsers; unknown fields in TestCase definitions are now properly rejected (#132, #206)
  • Test fixture assertion syntax — Fixed invalid Python expression in a test fixture assertion (#197)
  • CI integration test stability — CI integration tests now correctly handle expected eval failures when using the mock executor (#210)

Documentation

  • Added Quick Start guide to the documentation site (#205)

[0.27.0] - 2026-04-21

Added

  • output_contains_any expectation — New expectation field that passes when the agent response contains any one of the specified strings (#203)
  • max_response_time_ms behavior rule — Enforce maximum response time constraints on agent execution (#201)
  • Task prompt from file — Task prompt field can now reference an external file path instead of inline text (#157, #200)
  • tool_calls grader — New grader type that validates the specific tool calls an agent makes during execution (#187, #202)

Fixed

  • Webserver test resilience — Webserver tests now skip gracefully when frontend assets are not built (#204)

[0.26.0] - 2026-04-21

Changed

  • Timestamped output directoriesrun --output-dir now groups result files by timestamp for cleaner organization (#153)
  • Improved debug logging — Debug output is now more structured and useful for troubleshooting (#152)

Fixed

  • --discover finds eval.yaml in nested layout — Skill discovery now correctly locates eval.yaml files in evals/{name}/ directories at the project root (#44)
  • Diff grader reads post-execution workspace — The diff grader now reads files from the workspace after agent execution completes, not before (#165, #196)
  • Grader config validation — Required grader configuration fields are now validated before evaluation starts (#195)
  • macOS install and trigger test count — Fixed macOS binary installation and an off-by-one error in trigger test counting (#164, #184, #193)

Documentation

  • Added cache command reference, prompt mode documentation, and complete YAML schema reference (#198)
  • Updated demo guide and added CI/CD integration guide (#112, #89, #194)

Dependencies

  • Bump defu from 6.1.4 to 6.1.6 in /site (#181)
  • Bump vite from 6.4.1 to 6.4.2 in /site and /web (#182, #192)
  • Bump go.opentelemetry.io/otel/sdk from 1.42.0 to 1.43.0 (#185)
  • Bump astro from 5.17.3 to 5.18.1 in /site (#163)
  • Bump picomatch from 4.0.3 to 4.0.4 in /site and /web (#159, #160)
  • Bump smol-toml from 1.6.0 to 1.6.1 in /site (#158)

[0.25.0] - 2026-04-21

Added

  • Eval coverage grid generator — New coverage output that visualizes which skills have eval coverage across grader types (#92)

Fixed

  • SKILL.md injection and trigger fixture loadingwaza run now correctly injects SKILL.md content into the evaluation context, loads trigger test fixtures, and passes MCP server configuration to the engine (#191)

Dependencies

  • Bump h3 from 1.15.5 to 1.15.8 in /site (#144)

[0.24.0] - 2026-03-25

Changed

  • Strict YAML validation — All YAML parsers now use KnownFields(true) to reject unknown fields, catching typos and misconfigurations early (#132, #133)
  • max_workers renamed to workers — Config YAML key renamed for consistency across all config types (breaking change)
  • Unified token countingwaza check and waza tokens count now share the same counting logic for consistent results (#146)

Fixed

  • Typo in prompt grader — Fixed "prmopt" → "prompt" in error message

Dependencies

  • Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
  • Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)

[0.21.0] - 2026-03-12

Added

  • waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
  • Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
  • Eval scaffolding commandwaza eval new generates eval.yaml scaffolding for skills (#94)
  • Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
  • Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
  • Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
  • Skill-aware thresholdswaza tokens compare supports skill-specific threshold configuration (#93)
  • Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
  • CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
  • FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

  • waza suggest deadlockExecute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
  • ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
  • tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
  • --output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
  • Web dashboard build order — Build dashboard assets before Go compilation (#107)
  • Test file leak — Fixed test that leaked files into the repo (#120)
  • Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
  • Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

  • Renamed config node max_workers to workers for consistency across all config types
    • This is a breaking change
  • Custom YAML deserializers for config types (#106)
  • Validate only known fields in YAML decoders. (#132)
  • Token limits priority inverted to .waza.yaml first (#64)
  • @wbreza added to CODEOWNERS (#111)
  • Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

  • A/B baseline testing--baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
  • Pairwise LLM judgingpairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
  • Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
  • Auto skill discovery--discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks...
Read more

v0.30.1

22 Apr 20:53
47a3d9c

Choose a tag to compare

v0.30.1

Documentation

  • README updated — Added missing waza models command documentation with usage examples and flags (#220)

Full Changelog: v0.30.0...v0.30.1

v0.30.0

22 Apr 19:57
6aaebec

Choose a tag to compare

What's New in v0.30.0

New Features

  • waza quality command (#98) — LLM-as-Judge skill quality scoring. Evaluates SKILL.md across 5 dimensions (clarity, completeness, trigger precision, scope coverage, anti-patterns) using the Copilot SDK. Scored 1-5 with visual bar output. Supports --format json for CI integration. (@spboyer)

  • Scope reduction advisory (#183) — waza check now warns when a skill has low capability scope, detecting potential token-limit compression loss. Parses USE FOR phrases, headings, and numbered procedures as capability signals. (@diberry)

Housekeeping

  • Closed 5 stale issues that were already implemented: #59 (token limits priority), #86 (per-file budgets), #81 (tokens diff), #83 (eval scaffolding), #162 (TypeSpec user query — answered)

Full Changelog: v0.29.0...v0.30.0

v0.29.0

22 Apr 17:57
0194e61

Choose a tag to compare

What's New in v0.29.0

New Features

  • Per-task skill directories (#156) — Tasks can now override eval-level skill_directories with their own, enabling multi-skill eval suites. (@LarryOsterman)

  • Disable skill loading (#126) — New disabled_skills config field and --no-skills CLI flag. Use disabled_skills: ["*"] to disable all skills for baseline/comparison testing. (@richardpark-msft)

  • Debug workspace preservation (#123) — New --keep-workspace flag preserves temp workspace directories after execution for debugging fixture and agent file issues. (@richardpark-msft)

  • Version update notifications (#104) — waza run now checks for new versions in the background (cached 24h, non-blocking). Disable with --no-update-check or WAZA_NO_UPDATE_CHECK=1. (@RickWinter)

Test Coverage

  • Copilot log parsing edge cases (#115) — 23 new tests covering malformed JSON, truncated logs, binary data, unknown event types, and more. (@richardpark-msft)

Dependencies

  • Bumped astro + @astrojs/starlight in site

Full Changelog: v0.28.0...v0.29.0

Waza azd Extension v0.30.1

22 Apr 20:59
47a3d9c

Choose a tag to compare

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.24.0] - 2026-03-25

Changed

  • Strict YAML validation — All YAML parsers now use KnownFields(true) to reject unknown fields, catching typos and misconfigurations early (#132, #133)
  • max_workers renamed to workers — Config YAML key renamed for consistency across all config types (breaking change)
  • Unified token countingwaza check and waza tokens count now share the same counting logic for consistent results (#146)

Fixed

  • Typo in prompt grader — Fixed "prmopt" → "prompt" in error message

Dependencies

  • Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
  • Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)

[0.21.0] - 2026-03-12

Added

  • waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
  • Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
  • Eval scaffolding commandwaza eval new generates eval.yaml scaffolding for skills (#94)
  • Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
  • Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
  • Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
  • Skill-aware thresholdswaza tokens compare supports skill-specific threshold configuration (#93)
  • Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
  • CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
  • FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

  • waza suggest deadlockExecute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
  • ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
  • tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
  • --output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
  • Web dashboard build order — Build dashboard assets before Go compilation (#107)
  • Test file leak — Fixed test that leaked files into the repo (#120)
  • Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
  • Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

  • Renamed config node max_workers to workers for consistency across all config types
    • This is a breaking change
  • Custom YAML deserializers for config types (#106)
  • Validate only known fields in YAML decoders. (#132)
  • Token limits priority inverted to .waza.yaml first (#64)
  • @wbreza added to CODEOWNERS (#111)
  • Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

  • A/B baseline testing--baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
  • Pairwise LLM judgingpairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
  • Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
  • Auto skill discovery--discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
  • Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

  • Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

  • Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
  • Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

  • MCP Serverwaza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
  • waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
  • Interactive workflow skillskills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
  • Grader weightingweight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
  • Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
  • Judge model support--judge-model flag and judge_model config for separate LLM-as-judge model (#309)
  • Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
  • SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
  • MCP integration scoring — 4 MCP integration checks in waza dev (#316)
  • Batch skill processingwaza dev processes multiple skills in one run (#317)
  • Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
  • Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
  • Skill profilewaza tokens profile for static analysis of skill token distribution (#311)
  • JUnit XML reporter--format junit output for CI integration (#312)
  • Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
  • GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
  • Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
  • CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
  • Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
  • Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
  • prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
    • Two modes: clean (fresh context) and continue_session (resumes test session)
    • Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
    • Separate judge model configuration: run evaluation with a different model than the executor
    • Pre-built rubric templates adapted from Azure ML evaluators
  • trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
    • New internal/trigger/ package for trigger testing
    • Automatically discovered alongside eval.yaml
    • Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
    • trigger_accuracy metric with configurable cutoff threshold
    • Metrics: accuracy, precision, recall, F1, error count
  • diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
  • Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
    • Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
    • Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
  • MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

  • Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
  • Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

  • install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
  • Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
  • GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

  • Go cross-platform release pipelinego-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
  • install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
  • skill_invocation grader — validates orchestration workflows by checking which sk...
Read more

Waza azd Extension v0.30.0

22 Apr 20:01
6aaebec

Choose a tag to compare

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.24.0] - 2026-03-25

Changed

  • Strict YAML validation — All YAML parsers now use KnownFields(true) to reject unknown fields, catching typos and misconfigurations early (#132, #133)
  • max_workers renamed to workers — Config YAML key renamed for consistency across all config types (breaking change)
  • Unified token countingwaza check and waza tokens count now share the same counting logic for consistent results (#146)

Fixed

  • Typo in prompt grader — Fixed "prmopt" → "prompt" in error message

Dependencies

  • Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
  • Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)

[0.21.0] - 2026-03-12

Added

  • waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
  • Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
  • Eval scaffolding commandwaza eval new generates eval.yaml scaffolding for skills (#94)
  • Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
  • Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
  • Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
  • Skill-aware thresholdswaza tokens compare supports skill-specific threshold configuration (#93)
  • Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
  • CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
  • FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

  • waza suggest deadlockExecute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
  • ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
  • tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
  • --output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
  • Web dashboard build order — Build dashboard assets before Go compilation (#107)
  • Test file leak — Fixed test that leaked files into the repo (#120)
  • Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
  • Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

  • Renamed config node max_workers to workers for consistency across all config types
    • This is a breaking change
  • Custom YAML deserializers for config types (#106)
  • Validate only known fields in YAML decoders. (#132)
  • Token limits priority inverted to .waza.yaml first (#64)
  • @wbreza added to CODEOWNERS (#111)
  • Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

  • A/B baseline testing--baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
  • Pairwise LLM judgingpairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
  • Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
  • Auto skill discovery--discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
  • Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

  • Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

  • Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
  • Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

  • MCP Serverwaza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
  • waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
  • Interactive workflow skillskills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
  • Grader weightingweight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
  • Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
  • Judge model support--judge-model flag and judge_model config for separate LLM-as-judge model (#309)
  • Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
  • SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
  • MCP integration scoring — 4 MCP integration checks in waza dev (#316)
  • Batch skill processingwaza dev processes multiple skills in one run (#317)
  • Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
  • Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
  • Skill profilewaza tokens profile for static analysis of skill token distribution (#311)
  • JUnit XML reporter--format junit output for CI integration (#312)
  • Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
  • GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
  • Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
  • CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
  • Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
  • Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
  • prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
    • Two modes: clean (fresh context) and continue_session (resumes test session)
    • Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
    • Separate judge model configuration: run evaluation with a different model than the executor
    • Pre-built rubric templates adapted from Azure ML evaluators
  • trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
    • New internal/trigger/ package for trigger testing
    • Automatically discovered alongside eval.yaml
    • Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
    • trigger_accuracy metric with configurable cutoff threshold
    • Metrics: accuracy, precision, recall, F1, error count
  • diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
  • Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
    • Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
    • Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
  • MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

  • Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
  • Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

  • install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
  • Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
  • GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

  • Go cross-platform release pipelinego-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
  • install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
  • skill_invocation grader — validates orchestration workflows by checking which sk...
Read more

Waza azd Extension v0.29.0

22 Apr 18:03
0194e61

Choose a tag to compare

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.24.0] - 2026-03-25

Changed

  • Strict YAML validation — All YAML parsers now use KnownFields(true) to reject unknown fields, catching typos and misconfigurations early (#132, #133)
  • max_workers renamed to workers — Config YAML key renamed for consistency across all config types (breaking change)
  • Unified token countingwaza check and waza tokens count now share the same counting logic for consistent results (#146)

Fixed

  • Typo in prompt grader — Fixed "prmopt" → "prompt" in error message

Dependencies

  • Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
  • Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)

[0.21.0] - 2026-03-12

Added

  • waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
  • Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
  • Eval scaffolding commandwaza eval new generates eval.yaml scaffolding for skills (#94)
  • Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
  • Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
  • Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
  • Skill-aware thresholdswaza tokens compare supports skill-specific threshold configuration (#93)
  • Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
  • CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
  • FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

  • waza suggest deadlockExecute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
  • ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
  • tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
  • --output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
  • Web dashboard build order — Build dashboard assets before Go compilation (#107)
  • Test file leak — Fixed test that leaked files into the repo (#120)
  • Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
  • Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

  • Renamed config node max_workers to workers for consistency across all config types
    • This is a breaking change
  • Custom YAML deserializers for config types (#106)
  • Validate only known fields in YAML decoders. (#132)
  • Token limits priority inverted to .waza.yaml first (#64)
  • @wbreza added to CODEOWNERS (#111)
  • Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

  • A/B baseline testing--baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
  • Pairwise LLM judgingpairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
  • Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
  • Auto skill discovery--discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
  • Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

  • Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

  • Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
  • Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

  • MCP Serverwaza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
  • waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
  • Interactive workflow skillskills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
  • Grader weightingweight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
  • Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
  • Judge model support--judge-model flag and judge_model config for separate LLM-as-judge model (#309)
  • Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
  • SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
  • MCP integration scoring — 4 MCP integration checks in waza dev (#316)
  • Batch skill processingwaza dev processes multiple skills in one run (#317)
  • Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
  • Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
  • Skill profilewaza tokens profile for static analysis of skill token distribution (#311)
  • JUnit XML reporter--format junit output for CI integration (#312)
  • Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
  • GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
  • Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
  • CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
  • Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
  • Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
  • prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
    • Two modes: clean (fresh context) and continue_session (resumes test session)
    • Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
    • Separate judge model configuration: run evaluation with a different model than the executor
    • Pre-built rubric templates adapted from Azure ML evaluators
  • trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
    • New internal/trigger/ package for trigger testing
    • Automatically discovered alongside eval.yaml
    • Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
    • trigger_accuracy metric with configurable cutoff threshold
    • Metrics: accuracy, precision, recall, F1, error count
  • diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
  • Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
    • Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
    • Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
  • MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

  • Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
  • Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

  • install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
  • Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
  • GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

  • Go cross-platform release pipelinego-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
  • install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
  • skill_invocation grader — validates orchestration workflows by checking which sk...
Read more

v0.28.0

21 Apr 21:06
b1acf61

Choose a tag to compare

What's New in v0.28.0

New Features

  • waza models command (#141) — List available Copilot models with their IDs and capabilities. Requires authentication via copilot login. (@richardpark-msft)

  • Trigger test early termination (#188) — Trigger tests now cancel the agent session as soon as a skill invocation is detected, instead of waiting for the full timeout. Implemented at the execution layer via CancelOnSkillInvocation flag. (@JasonYeMSFT)

  • Follow-up prompts (#189) — Support multi-turn eval tasks with follow_up_prompts in task YAML. Follow-ups reuse the same session and workspace, enabling tests for conversational workflows where the agent pauses for confirmation. (@JasonYeMSFT)

  • Quick Start guide — New focused 5-minute quick start page on the docs site, with Mermaid workflow diagram and tabbed install options. Added as the first sidebar item.

Bug Fixes

  • CI integration test (#210) — Fixed the root cause of persistent ubuntu-latest CI failures. PR #203 (v0.27.0) wired up evaluateExpectations() which made output_contains checks execute for the first time — the mock executor's generic output didn't match. CI now correctly allows eval failures with mock while still catching crashes.

  • YAML validation audit (#132) — Verified all 10 user-facing config loaders use strict KnownFields(true) parsing. Added regression test for unknown field rejection. (@LarryOsterman)

Infrastructure

  • CODEOWNERS simplified to @spboyer
  • Branch protection rulesets updated with proper bypass actors and streamlined required checks (Lint, CLA, test)

Documentation

  • Quick Start page with install → auth → first eval workflow
  • waza models added to CLI reference
  • follow_up_prompts and trigger early termination documented in eval-yaml guide

Full Changelog: v0.27.0...v0.28.0

v0.27.0

21 Apr 18:29
7fc7f07

Choose a tag to compare

What's New in v0.27.0

New Features

  • tool_calls grader (#187) — Validate which tools the agent called during execution. Supports required_tools, forbidden_tools, min_calls, and max_calls constraints with partial scoring. (@JasonYeMSFT)

  • output_contains_any expectation (#137) — New YAML field that passes if ANY of the listed strings appear in output (OR logic), complementing the existing output_contains (AND logic) and output_not_contains. (@LarryOsterman)

  • max_response_time_ms behavior rule (#136) — Enforce response time limits on eval tasks. Fails the behavior check if execution exceeds the configured threshold. (@LarryOsterman)

  • prompt_file for task prompts (#157) — Load task prompts from external files instead of inline YAML. Supports prompt_file: path/to/prompt.md with path traversal protection. (@LarryOsterman)

Bug Fixes

  • Windows CI fix (#204) — Webserver test now skips gracefully when frontend assets aren't built, fixing the persistent windows-latest CI failure that blocked all PRs today.
  • Cross-platform test fix — Absolute path test in suggest package uses runtime.GOOS for Windows compatibility.

Documentation

All 4 new features include updated docs:

  • Graders guide (graders.mdx) — tool_calls section added
  • Eval YAML guide (eval-yaml.mdx) — output_contains_any, max_response_time_ms, prompt_file documented
  • Schema reference (schema.mdx) — All new fields added

Full Changelog: v0.26.0...v0.27.0