28 Apr 20:11

github-actions

bf77c75

Waza v0.31.0 Latest

Latest

What's Changed

refactor: complete vocabulary renames — BenchmarkSpec→EvalSpec, TestRunner→EvalRunner (#166) by @spboyer in #222
feat: support custom agent (.agent.md) file discovery and parsing #225 by @spboyer in #226
fix: mock engine echoes file content for CI evals (#227) by @spboyer in #228
fix: waza serve crashes when stdin is not a terminal by @spboyer in #224
chore(deps): Bump postcss from 8.5.6 to 8.5.12 in /site by @dependabot[bot] in #229
docs: cross-reference audit for recent renames and feature additions by @spboyer in #230
Release v0.31.0 by @spboyer in #231

Full Changelog: v0.30.1...v0.31.0

Contributors

spboyer and dependabot

Assets 9

28 Apr 20:08

github-actions

azd-ext-microsoft-azd-waza_0.31.0

bf77c75

Waza azd Extension v0.31.0

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.31.0] - 2026-04-28

Added

Custom agent (.agent.md) eval support — Discover .agent.md files alongside SKILL.md, parse agent-specific frontmatter (tools, model, handoffs, mcp-servers, agents), auto-inject tool_constraint grader from agent tools: field, complete worked example under examples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)

Fixed

Mock engine echoes file content — _output_contains expectations against file contents now work in CI without a real model. Mock response includes task metadata, file paths, and a 1KB content preview per resource (#228, closes #227)
waza serve no longer crashes when stdin isn't a terminal — MCP stdio server only starts when term.IsTerminal() is true; piped input or background mode no longer kills the HTTP dashboard (#224)

Changed

Vocabulary renames — Internal types renamed: BenchmarkSpec → EvalSpec, TestRunner → EvalRunner. Not a breaking change for external consumers (types live in internal/) (#222)

Documentation

Cross-reference audit for recent renames + custom agent feature: added .agent.md coverage to quickstart, getting-started, GUIDE, TUTORIAL, examples README; updated mock engine descriptions in INTEGRATION-TESTING and eval-yaml guide (#230)

Dependencies

Bump postcss from 8.5.6 to 8.5.12 in /site (#229)

[0.30.1] - 2026-04-22

Documentation

Updated README with missing CLI commands — Added documentation for recently-added CLI commands that were missing from the README (#220)

[0.30.0] - 2026-04-22

Added

waza quality command — LLM-as-Judge skill quality scoring that evaluates skill output quality using a configurable judge model (#218)
Scope-reduction advisory check — waza check now includes an advisory that flags skills with overly broad scope, helping authors tighten skill definitions (#219)

[0.29.0] - 2026-04-22

Added

--keep-workspace flag — Preserve the temporary workspace after task execution for debugging agent output (#123, #217)
--no-skills flag and disabled_skills config — Disable specific skills during evaluation to isolate behavior (#126, #216)
Non-blocking version update check — CLI now checks for newer waza versions in the background without slowing startup (#104, #214)
Per-task skill_directories — Specify different skill directories for individual tasks in eval YAML (#156, #215)

Dependencies

Bump astro and @astrojs/starlight in /site (#212)

[0.28.0] - 2026-04-21

Added

Follow-up prompts in eval YAML — Tasks can now include pre-written follow-up prompts for multi-turn evaluation conversations (#189, #209)
waza models command — List all available models supported by the configured engine (#208)
Early termination for trigger tests — Trigger tests can now stop early once the target skill is invoked, reducing evaluation time (#207)

Fixed

Stricter YAML validation — Audited all YAML parsers; unknown fields in TestCase definitions are now properly rejected (#132, #206)
Test fixture assertion syntax — Fixed invalid Python expression in a test fixture assertion (#197)
CI integration test stability — CI integration tests now correctly handle expected eval failures when using the mock executor (#210)

Documentation

Added Quick Start guide to the documentation site (#205)

[0.27.0] - 2026-04-21

Added

output_contains_any expectation — New expectation field that passes when the agent response contains any one of the specified strings (#203)
max_response_time_ms behavior rule — Enforce maximum response time constraints on agent execution (#201)
Task prompt from file — Task prompt field can now reference an external file path instead of inline text (#157, #200)
tool_calls grader — New grader type that validates the specific tool calls an agent makes during execution (#187, #202)

Fixed

Webserver test resilience — Webserver tests now skip gracefully when frontend assets are not built (#204)

[0.26.0] - 2026-04-21

Changed

Timestamped output directories — run --output-dir now groups result files by timestamp for cleaner organization (#153)
Improved debug logging — Debug output is now more structured and useful for troubleshooting (#152)

Fixed

--discover finds eval.yaml in nested layout — Skill discovery now correctly locates eval.yaml files in evals/{name}/ directories at the project root (#44)
Diff grader reads post-execution workspace — The diff grader now reads files from the workspace after agent execution completes, not before (#165, #196)
Grader config validation — Required grader configuration fields are now validated before evaluation starts (#195)
macOS install and trigger test count — Fixed macOS binary installation and an off-by-one error in trigger test counting (#164, #184, #193)

Documentation

Added cache command reference, prompt mode documentation, and complete YAML schema reference (#198)
Updated demo guide and added CI/CD integration guide (#112, #89, #194)

Dependencies

Bump defu from 6.1.4 to 6.1.6 in /site (#181)
Bump vite from 6.4.1 to 6.4.2 in /site and /web (#182, #192)
Bump go.opentelemetry.io/otel/sdk from 1.42.0 to 1.43.0 (#185)
Bump astro from 5.17.3 to 5.18.1 in /site (#163)
Bump picomatch from 4.0.3 to 4.0.4 in /site and /web (#159, #160)
Bump smol-toml from 1.6.0 to 1.6.1 in /site (#158)

[0.25.0] - 2026-04-21

Added

Eval coverage grid generator — New coverage output that visualizes which skills have eval coverage across grader types (#92)

Fixed

SKILL.md injection and trigger fixture loading — waza run now correctly injects SKILL.md content into the evaluation context, loads trigger test fixtures, and passes MCP server configuration to the engine (#191)

Dependencies

Bump h3 from 1.15.5 to 1.15.8 in /site (#144)

[0.24.0] - 2026-03-25

Changed

Strict YAML validation — All YAML parsers now use KnownFields(true) to reject unknown fields, catching typos and misconfigurations early (#132, #133)
max_workers renamed to workers — Config YAML key renamed for consistency across all config types (breaking change)
Unified token counting — waza check and waza tokens count now share the same counting logic for consistent results (#146)

Fixed

Typo in prompt grader — Fixed "prmopt" → "prompt" in error message

Dependencies

Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)

[0.21.0] - 2026-03-12

Added

waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
Eval scaffolding command — waza eval new generates eval.yaml scaffolding for skills (#94)
Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
Skill-aware thresholds — waza tokens compare supports skill-specific threshold configuration (#93)
Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

waza suggest deadlock — Execute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
--output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
Web dashboard build order — Build dashboard assets before Go compilation (#107)
Test file leak — Fixed test that leaked files into the repo (#120)
Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

Renamed config node max_workers to workers for consistency across all config types
- This is a breaking change
Custom YAML deserializers for config types (#106)
Validate only known fields in YAML decoders. (#132)
Token limits priority inverted to .waza.yaml first (#64)
@wbreza added to CODEOWNERS (#111)
Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

A/B baseline testing — --baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
Pairwise LLM judging — pairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
Auto skill discovery — --discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks...

Assets 8

22 Apr 20:53

spboyer

v0.30.1

47a3d9c

v0.30.1

Documentation

README updated — Added missing waza models command documentation with usage examples and flags (#220)

Full Changelog: v0.30.0...v0.30.1

Assets 2

22 Apr 19:57

spboyer

v0.30.0

6aaebec

v0.30.0

What's New in v0.30.0

New Features

waza quality command (#98) — LLM-as-Judge skill quality scoring. Evaluates SKILL.md across 5 dimensions (clarity, completeness, trigger precision, scope coverage, anti-patterns) using the Copilot SDK. Scored 1-5 with visual bar output. Supports --format json for CI integration. (@spboyer)
Scope reduction advisory (#183) — waza check now warns when a skill has low capability scope, detecting potential token-limit compression loss. Parses USE FOR phrases, headings, and numbered procedures as capability signals. (@diberry)

Housekeeping

Closed 5 stale issues that were already implemented: #59 (token limits priority), #86 (per-file budgets), #81 (tokens diff), #83 (eval scaffolding), #162 (TypeSpec user query — answered)

Full Changelog: v0.29.0...v0.30.0

Contributors

spboyer and diberry

Assets 2

22 Apr 17:57

spboyer

v0.29.0

0194e61

v0.29.0

What's New in v0.29.0

New Features

Per-task skill directories (#156) — Tasks can now override eval-level skill_directories with their own, enabling multi-skill eval suites. (@LarryOsterman)
Disable skill loading (#126) — New disabled_skills config field and --no-skills CLI flag. Use disabled_skills: ["*"] to disable all skills for baseline/comparison testing. (@richardpark-msft)
Debug workspace preservation (#123) — New --keep-workspace flag preserves temp workspace directories after execution for debugging fixture and agent file issues. (@richardpark-msft)
Version update notifications (#104) — waza run now checks for new versions in the background (cached 24h, non-blocking). Disable with --no-update-check or WAZA_NO_UPDATE_CHECK=1. (@RickWinter)

Test Coverage

Copilot log parsing edge cases (#115) — 23 new tests covering malformed JSON, truncated logs, binary data, unknown event types, and more. (@richardpark-msft)

Dependencies

Bumped astro + @astrojs/starlight in site

Full Changelog: v0.28.0...v0.29.0

Contributors

RickWinter, LarryOsterman, and richardpark-msft

Assets 2

22 Apr 20:59

github-actions

azd-ext-microsoft-azd-waza_0.30.1

47a3d9c

Waza azd Extension v0.30.1

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.24.0] - 2026-03-25

Changed

Strict YAML validation — All YAML parsers now use KnownFields(true) to reject unknown fields, catching typos and misconfigurations early (#132, #133)
max_workers renamed to workers — Config YAML key renamed for consistency across all config types (breaking change)
Unified token counting — waza check and waza tokens count now share the same counting logic for consistent results (#146)

Fixed

Typo in prompt grader — Fixed "prmopt" → "prompt" in error message

Dependencies

Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)

[0.21.0] - 2026-03-12

Added

waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
Eval scaffolding command — waza eval new generates eval.yaml scaffolding for skills (#94)
Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
Skill-aware thresholds — waza tokens compare supports skill-specific threshold configuration (#93)
Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

waza suggest deadlock — Execute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
--output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
Web dashboard build order — Build dashboard assets before Go compilation (#107)
Test file leak — Fixed test that leaked files into the repo (#120)
Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

Renamed config node max_workers to workers for consistency across all config types
- This is a breaking change
Custom YAML deserializers for config types (#106)
Validate only known fields in YAML decoders. (#132)
Token limits priority inverted to .waza.yaml first (#64)
@wbreza added to CODEOWNERS (#111)
Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

A/B baseline testing — --baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
Pairwise LLM judging — pairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
Auto skill discovery — --discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

MCP Server — waza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
Interactive workflow skill — skills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
Grader weighting — weight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
Judge model support — --judge-model flag and judge_model config for separate LLM-as-judge model (#309)
Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
MCP integration scoring — 4 MCP integration checks in waza dev (#316)
Batch skill processing — waza dev processes multiple skills in one run (#317)
Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
Skill profile — waza tokens profile for static analysis of skill token distribution (#311)
JUnit XML reporter — --format junit output for CI integration (#312)
Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
- Two modes: clean (fresh context) and continue_session (resumes test session)
- Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
- Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
- New internal/trigger/ package for trigger testing
- Automatically discovered alongside eval.yaml
- Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
- trigger_accuracy metric with configurable cutoff threshold
- Metrics: accuracy, precision, recall, F1, error count
diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
- Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
- Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

Go cross-platform release pipeline — go-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
skill_invocation grader — validates orchestration workflows by checking which sk...

Assets 8

22 Apr 20:01

github-actions

azd-ext-microsoft-azd-waza_0.30.0

6aaebec

Waza azd Extension v0.30.0

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.24.0] - 2026-03-25

Changed

Strict YAML validation — All YAML parsers now use KnownFields(true) to reject unknown fields, catching typos and misconfigurations early (#132, #133)
max_workers renamed to workers — Config YAML key renamed for consistency across all config types (breaking change)
Unified token counting — waza check and waza tokens count now share the same counting logic for consistent results (#146)

Fixed

Typo in prompt grader — Fixed "prmopt" → "prompt" in error message

Dependencies

Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)

[0.21.0] - 2026-03-12

Added

waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
Eval scaffolding command — waza eval new generates eval.yaml scaffolding for skills (#94)
Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
Skill-aware thresholds — waza tokens compare supports skill-specific threshold configuration (#93)
Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

waza suggest deadlock — Execute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
--output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
Web dashboard build order — Build dashboard assets before Go compilation (#107)
Test file leak — Fixed test that leaked files into the repo (#120)
Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

Renamed config node max_workers to workers for consistency across all config types
- This is a breaking change
Custom YAML deserializers for config types (#106)
Validate only known fields in YAML decoders. (#132)
Token limits priority inverted to .waza.yaml first (#64)
@wbreza added to CODEOWNERS (#111)
Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

A/B baseline testing — --baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
Pairwise LLM judging — pairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
Auto skill discovery — --discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

MCP Server — waza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
Interactive workflow skill — skills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
Grader weighting — weight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
Judge model support — --judge-model flag and judge_model config for separate LLM-as-judge model (#309)
Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
MCP integration scoring — 4 MCP integration checks in waza dev (#316)
Batch skill processing — waza dev processes multiple skills in one run (#317)
Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
Skill profile — waza tokens profile for static analysis of skill token distribution (#311)
JUnit XML reporter — --format junit output for CI integration (#312)
Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
- Two modes: clean (fresh context) and continue_session (resumes test session)
- Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
- Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
- New internal/trigger/ package for trigger testing
- Automatically discovered alongside eval.yaml
- Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
- trigger_accuracy metric with configurable cutoff threshold
- Metrics: accuracy, precision, recall, F1, error count
diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
- Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
- Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

Go cross-platform release pipeline — go-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
skill_invocation grader — validates orchestration workflows by checking which sk...

Assets 8

22 Apr 18:03

github-actions

azd-ext-microsoft-azd-waza_0.29.0

0194e61

Waza azd Extension v0.29.0

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.24.0] - 2026-03-25

Changed

Strict YAML validation — All YAML parsers now use KnownFields(true) to reject unknown fields, catching typos and misconfigurations early (#132, #133)
max_workers renamed to workers — Config YAML key renamed for consistency across all config types (breaking change)
Unified token counting — waza check and waza tokens count now share the same counting logic for consistent results (#146)

Fixed

Typo in prompt grader — Fixed "prmopt" → "prompt" in error message

Dependencies

Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)

[0.21.0] - 2026-03-12

Added

waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
Eval scaffolding command — waza eval new generates eval.yaml scaffolding for skills (#94)
Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
Skill-aware thresholds — waza tokens compare supports skill-specific threshold configuration (#93)
Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

waza suggest deadlock — Execute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
--output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
Web dashboard build order — Build dashboard assets before Go compilation (#107)
Test file leak — Fixed test that leaked files into the repo (#120)
Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

Renamed config node max_workers to workers for consistency across all config types
- This is a breaking change
Custom YAML deserializers for config types (#106)
Validate only known fields in YAML decoders. (#132)
Token limits priority inverted to .waza.yaml first (#64)
@wbreza added to CODEOWNERS (#111)
Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

A/B baseline testing — --baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
Pairwise LLM judging — pairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
Auto skill discovery — --discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks eval coverage (#392)
Releases page — New docs site page at reference/releases with platform download links, install commands, and azd extension info (#383)

Fixed

Lint warnings — Resolved errcheck (webserver) and ineffassign (utils) lint warnings

Changed

Competitive research — Added OpenAI Evals analysis (docs/research/waza-vs-openai-evals.md), skill-validator analysis (docs/research/waza-vs-skill-validator.md), and eval registry design doc (docs/research/waza-eval-registry-design.md)
Mermaid diagrams — Converted remaining ASCII diagrams to Mermaid across all markdown files. Added Mermaid directive to AGENTS.md

[0.8.0] - 2026-02-21

Added

MCP Server — waza serve now includes an always-on MCP server with 10 tools (eval.list, eval.get, eval.validate, eval.run, task.list, run.status, run.cancel, results.summary, results.runs, skill.check) via stdio transport (#286)
waza suggest command — LLM-powered eval suggestions: reads SKILL.md, proposes test cases, graders, and fixtures. Flags: --model, --dry-run, --apply, --output-dir, --format (#287)
Interactive workflow skill — skills/waza-interactive/SKILL.md with 5 workflow scenarios for conversational eval orchestration (#288)
Grader weighting — weight field on grader configs, ComputeWeightedRunScore method, dashboard weighted scores column (#299)
Statistical confidence intervals — Bootstrap CI with 10K resamples, 95% confidence, normalized gain. Dashboard CI bands and significance badges (#308)
Judge model support — --judge-model flag and judge_model config for separate LLM-as-judge model (#309)
Spec compliance checks — 8 agentskills.io compliance checks in waza check and waza dev (#314)
SkillsBench advisory — 5 advisory checks (module-count, complexity, negative-delta, procedural, over-specificity) (#315)
MCP integration scoring — 4 MCP integration checks in waza dev (#316)
Batch skill processing — waza dev processes multiple skills in one run (#317)
Token compare --strict — Budget enforcement mode for waza tokens compare (#318)
Scaffold trigger tests — Auto-generate trigger test YAML from SKILL.md frontmatter (#319)
Skill profile — waza tokens profile for static analysis of skill token distribution (#311)
JUnit XML reporter — --format junit output for CI integration (#312)
Template Variables — New internal/template package with Render() for Go text/template syntax in hooks and commands. System variables: JobID, TaskName, Iteration, Attempt, Timestamp. User variables via vars map (#186)
GroupBy Results — New group_by config field to organize results by dimension (e.g., model). CLI shows grouped output, JSON includes GroupStats with name/passed/total/avg_score (#188)
Custom Input Variables — New inputs section in eval.yaml for defining key-value pairs available as {{.Vars.key}} throughout evaluation. Accessible in hooks, task templates, and grader configs (#189)
CSV Dataset Support — New tasks_from field to generate tasks from CSV files. Each row becomes a task with columns accessible as {{.Vars.column}}. Optional range: [start, end] for row filtering. First row treated as headers (#187)
Retry/Attempts — Add max_attempts config field for retrying failed task executions within each trial (#191)
Lifecycle Hooks — Add hooks section with before_run/after_run/before_task/after_task lifecycle points (#191)
prompt grader (LLM-as-judge) — LLM-based evaluation with rubrics, tool-based grading, and session management modes (#177, closes #104)
- Two modes: clean (fresh context) and continue_session (resumes test session)
- Tool-based grading: set_waza_grade_pass and set_waza_grade_fail tools for LLM graders
- Separate judge model configuration: run evaluation with a different model than the executor
- Pre-built rubric templates adapted from Azure ML evaluators
trigger_tests.yaml auto-discovery — measure prompt trigger accuracy for skills (#166, closes #36)
- New internal/trigger/ package for trigger testing
- Automatically discovered alongside eval.yaml
- Confidence weighting: high (weight 1.0) and medium (weight 0.5) for borderline cases
- trigger_accuracy metric with configurable cutoff threshold
- Metrics: accuracy, precision, recall, F1, error count
diff grader — new grader type for workspace file comparison with snapshot matching and contains-line fragment checks (#158)
Azure ML evaluation rubrics — 8 pre-built rubric YAMLs in examples/rubrics/ adapted from Azure ML evaluators (#160, #161):
- Tool call rubrics: tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization
- Task evaluation rubrics: task_completion, task_adherence, intent_resolution, response_completeness
MockEngine WorkspaceDir support — test infrastructure for graders that need workspace access (#159)

Changed

Dashboard — Aspire-style trajectory waterfall, weighted scores column, CI bands with significance indicators, judge model badge (#303, #330, #331, #332)
Docs site — Dashboard explore page with 14+ screenshots, light/dark mode, navbar polish (#357, #358, #360)

Fixed

install.sh macOS checksum — added shasum -a 256 fallback for macOS (which lacks sha256sum) (#163)
Dashboard compare-runs screenshot now shows 2 runs selected with full comparison
GitHub icon alignment and search bar width on docs site

[0.4.0-alpha.1] - 2026-02-17

Added

Go cross-platform release pipeline — go-release.yml workflow builds binaries for linux/darwin/windows on amd64 and arm64 (#155)
install.sh installer — one-line binary install with checksum verification: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
skill_invocation grader — validates orchestration workflows by checking which sk...

Assets 8

21 Apr 21:06

spboyer

v0.28.0

b1acf61

v0.28.0

What's New in v0.28.0

New Features

waza models command (#141) — List available Copilot models with their IDs and capabilities. Requires authentication via copilot login. (@richardpark-msft)
Trigger test early termination (#188) — Trigger tests now cancel the agent session as soon as a skill invocation is detected, instead of waiting for the full timeout. Implemented at the execution layer via CancelOnSkillInvocation flag. (@JasonYeMSFT)
Follow-up prompts (#189) — Support multi-turn eval tasks with follow_up_prompts in task YAML. Follow-ups reuse the same session and workspace, enabling tests for conversational workflows where the agent pauses for confirmation. (@JasonYeMSFT)
Quick Start guide — New focused 5-minute quick start page on the docs site, with Mermaid workflow diagram and tabbed install options. Added as the first sidebar item.

Bug Fixes

CI integration test (#210) — Fixed the root cause of persistent ubuntu-latest CI failures. PR #203 (v0.27.0) wired up evaluateExpectations() which made output_contains checks execute for the first time — the mock executor's generic output didn't match. CI now correctly allows eval failures with mock while still catching crashes.
YAML validation audit (#132) — Verified all 10 user-facing config loaders use strict KnownFields(true) parsing. Added regression test for unknown field rejection. (@LarryOsterman)

Infrastructure

CODEOWNERS simplified to @spboyer
Branch protection rulesets updated with proper bypass actors and streamlined required checks (Lint, CLA, test)

Documentation

Quick Start page with install → auth → first eval workflow
waza models added to CLI reference
follow_up_prompts and trigger early termination documented in eval-yaml guide

Full Changelog: v0.27.0...v0.28.0

Contributors

spboyer, LarryOsterman, and 2 other contributors

Assets 2

21 Apr 18:29

spboyer

v0.27.0

7fc7f07

v0.27.0

What's New in v0.27.0

New Features

tool_calls grader (#187) — Validate which tools the agent called during execution. Supports required_tools, forbidden_tools, min_calls, and max_calls constraints with partial scoring. (@JasonYeMSFT)
output_contains_any expectation (#137) — New YAML field that passes if ANY of the listed strings appear in output (OR logic), complementing the existing output_contains (AND logic) and output_not_contains. (@LarryOsterman)
max_response_time_ms behavior rule (#136) — Enforce response time limits on eval tasks. Fails the behavior check if execution exceeds the configured threshold. (@LarryOsterman)
prompt_file for task prompts (#157) — Load task prompts from external files instead of inline YAML. Supports prompt_file: path/to/prompt.md with path traversal protection. (@LarryOsterman)

Bug Fixes

Windows CI fix (#204) — Webserver test now skips gracefully when frontend assets aren't built, fixing the persistent windows-latest CI failure that blocked all PRs today.
Cross-platform test fix — Absolute path test in suggest package uses runtime.GOOS for Windows compatibility.

Documentation

All 4 new features include updated docs:

Graders guide (graders.mdx) — tool_calls section added
Eval YAML guide (eval-yaml.mdx) — output_contains_any, max_response_time_ms, prompt_file documented
Schema reference (schema.mdx) — All new fields added

Full Changelog: v0.26.0...v0.27.0

Contributors

LarryOsterman and JasonYeMSFT

Assets 2

Releases: microsoft/waza

Waza v0.31.0

What's Changed

Contributors

Uh oh!

Waza azd Extension v0.31.0

Changelog

[Unreleased]

[0.31.0] - 2026-04-28

Added

Fixed

Changed

Documentation

Dependencies

[0.30.1] - 2026-04-22

Documentation

[0.30.0] - 2026-04-22

Added

[0.29.0] - 2026-04-22

Added

Dependencies

[0.28.0] - 2026-04-21

Added

Fixed

Documentation

[0.27.0] - 2026-04-21

Added

Fixed

[0.26.0] - 2026-04-21

Changed

Fixed

Documentation

Dependencies

[0.25.0] - 2026-04-21

Added

Fixed

Dependencies

[0.24.0] - 2026-03-25

Changed

Fixed

Dependencies

[0.21.0] - 2026-03-12

Added

Fixed

Changed

[0.9.0] - 2026-02-23

Added

Uh oh!

v0.30.1

v0.30.1

Documentation

Uh oh!

v0.30.0

What's New in v0.30.0

New Features

Housekeeping

Contributors

Uh oh!

v0.29.0

What's New in v0.29.0

New Features

Test Coverage

Dependencies

Contributors

Uh oh!

Waza azd Extension v0.30.1

Changelog

[Unreleased]

[0.24.0] - 2026-03-25

Changed

Fixed

Dependencies

[0.21.0] - 2026-03-12

Added

Fixed

Changed

[0.9.0] - 2026-02-23

Added

Fixed

Changed