chore: remove building-a-custom-evaluator notebook#70
Closed
jimbobbennett wants to merge 1 commit into
Closed
Conversation
Source for the Arize docs "Creating a Custom LLM Evaluator with a Benchmark Dataset" cookbook, which is being consolidated into the newer "Align LLM Evals with Human Judgment" guide. No other notebooks or READMEs reference it. Companion to Arize-ai/docs#650. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
jimbobbennett
added a commit
that referenced
this pull request
Jun 25, 2026
Source for the Arize docs "Creating a Custom LLM Evaluator with a Benchmark Dataset" cookbook, which is being consolidated into "Align LLM Evals with Human Judgment" (Arize-ai/docs#650). No other notebooks or READMEs reference it. Folds in the change from #70. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
jimbobbennett
added a commit
that referenced
this pull request
Jun 25, 2026
* chore: upgrade mastra-arize-ax-tracing cookbook to latest stack
The cookbook was pinned to a Mastra 0.15-era stack and `npm i` failed with an
ERESOLVE peer conflict. The bigger issue: @arizeai/openinference-mastra is
deprecated and relied on Mastra's legacy telemetry system, removed 2025-11-04.
Upgrade everything to the current stack and migrate the code accordingly:
Dependencies:
- @mastra/core 0.15 -> 1.46, libsql/loggers/memory/mastra CLI to latest 1.x
- @ai-sdk/openai 1 -> 3, zod 3 -> 4, typescript 5 -> 6, @types/node 24 -> 26
- replace deprecated @arizeai/openinference-mastra with native @mastra/arize
- add @mastra/observability and ai (now required directly)
Code (Mastra 0.x -> 1.x API changes):
- tracing: legacy telemetry.export.custom -> new AI Tracing via
new Observability({ configs: { arize: { exporters: [new ArizeExporter(...)] }}})
- tool execute: input is now the first arg directly ({ context } -> { ...input })
- add required `id` to Agent and every LibSQLStore
- main storage :memory: -> file:../mastra.db so Mastra's internal scheduler
tables (mastra_workflow_snapshot) persist and are shared across connections
(":memory:" gives each libsql connection its own DB -> SQLITE_ERROR)
- gitignore *.db* artifacts; refresh README tracing references
Verified: npm i clean, tsc --noEmit clean, `mastra dev` boots with no errors.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor: route tool LLM calls through Mastra agents for tracing
The weather-analysis and activity-planning tools called the ai SDK's
generateText() directly, so those LLM calls bypassed Mastra and never appeared
as spans in Arize AX.
Make them first-class Mastra agents instead:
- Add weatherAnalysisAgent and activityPlanningAgent (instructions + model),
registered on the Mastra instance
- Tools now resolve the agent from the tool execution context
(mastra.getAgent(...)) and call agent.generate(), so each call is captured
as a traced child span of the tool. Using the context avoids importing the
Mastra instance into tools (no circular dependency).
- Drop the now-unused `ai` dependency and direct @ai-sdk/openai imports in tools
Verified: tsc clean, `mastra dev` boots with no errors, all three agents
register.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs: fix README and Node engine for the upgraded cookbook
- Point the intro link at the exact guide in its new location
(cookbooks/evaluate/align-llm-evals-with-human-judgment)
- Bump required Node to >=22.13.0 (engines + README) to match @mastra/core 1.46
- Run instructions: Mastra Studio at http://localhost:4111 (no auto-open),
with a concrete example request
- "How it works": show the analysis/planning tools delegating to worker
agents, and explain the nested tracing
- Project structure: add the two new worker agent files
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: use non-deprecated gpt-4.1-mini model
gpt-4o-mini is superseded by gpt-4.1 (per the check-models policy) and gets
flagged as an outdated introduced reference. Move all three agents to
gpt-4.1-mini — the current non-reasoning mini tier, which (unlike the GPT-5
reasoning minis) still accepts temperature/top_p.
Verified with scan-models.mjs: no blocking model findings.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ci(check-models): run on Node 24
Bump the check-models workflow's setup-node from 20 to 24.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: use gpt-5.4-mini for the agents
Switch all three agents from gpt-4.1-mini to gpt-5.4-mini (the latest mini
tier). Verified end-to-end: `mastra dev` boots and a live orchestrator
invocation runs the full weather -> analysis -> planning flow and returns a
plan with no errors (the reasoning model does not reject the request).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ci(check-models): bump github-script to v8 (Node 24)
actions/github-script@v7 declares a Node 20 runtime, which the runner now
forces onto Node 24 with a deprecation warning. Bump to v8, which runs on
Node 24 natively.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: add generate-traces script for populating Arize AX
Add `npm run generate-traces`, a tsx script that sends a diverse set of
prompts to the WeatherOrchestratorAgent and flushes the spans to Arize AX via
mastra.shutdown(). Prompts are grouped to exercise each tool path:
weather-only, weather+planning, and the full weather+analysis+planning chain,
plus some varied/edge phrasings.
The script chdirs into src/ so the Mastra config's `file:../mastra.db` lands
in the cookbook root (gitignored) rather than the parent directory. Verified:
all 12 prompts complete and the observability exporter flushes on shutdown.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: expand example prompts to 50 and extract to JSON
Move the trace-generation prompts out of generate-traces.ts into
src/scripts/example-prompts.json, grouped by tool path, and expand the set to
50 prompts for broader coverage:
- weather-only: 12
- weather+planning: 12
- weather+analysis+planning: 12
- multi-location: 6
- vague-intent: 4
- activity-constrained: 4
The script now loads and flattens the JSON at runtime. Validated: JSON parses
to 50 unique prompts, tsc clean. README points at the JSON for editing.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs: add text language hint to diagram code fence
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: remove building-a-custom-evaluator notebook
Source for the Arize docs "Creating a Custom LLM Evaluator with a Benchmark
Dataset" cookbook, which is being consolidated into "Align LLM Evals with
Human Judgment" (Arize-ai/docs#650). No other notebooks or READMEs reference
it. Folds in the change from #70.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Removes
python/llm/evaluation/building-a-custom-evaluator.ipynb, the source notebook for the Arize docs "Creating a Custom LLM Evaluator with a Benchmark Dataset" cookbook.That cookbook is being consolidated into the newer "Align LLM Evals with Human Judgment" guide (same concept — build a custom LLM-as-a-Judge and align it to human-annotated ground truth), so the standalone notebook is no longer referenced by any docs page. No other notebooks or READMEs in this repo reference it.
Companion to Arize-ai/docs#650.
🤖 Generated with Claude Code