chore: remove building-a-custom-evaluator notebook by jimbobbennett · Pull Request #70 · Arize-ai/tutorials

jimbobbennett · 2026-06-24T19:03:39Z

Summary

Removes python/llm/evaluation/building-a-custom-evaluator.ipynb, the source notebook for the Arize docs "Creating a Custom LLM Evaluator with a Benchmark Dataset" cookbook.

That cookbook is being consolidated into the newer "Align LLM Evals with Human Judgment" guide (same concept — build a custom LLM-as-a-Judge and align it to human-annotated ground truth), so the standalone notebook is no longer referenced by any docs page. No other notebooks or READMEs in this repo reference it.

Companion to Arize-ai/docs#650.

🤖 Generated with Claude Code

Source for the Arize docs "Creating a Custom LLM Evaluator with a Benchmark Dataset" cookbook, which is being consolidated into the newer "Align LLM Evals with Human Judgment" guide. No other notebooks or READMEs reference it. Companion to Arize-ai/docs#650. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

review-notebook-app · 2026-06-24T19:03:44Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Source for the Arize docs "Creating a Custom LLM Evaluator with a Benchmark Dataset" cookbook, which is being consolidated into "Align LLM Evals with Human Judgment" (Arize-ai/docs#650). No other notebooks or READMEs reference it. Folds in the change from #70. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jimbobbennett · 2026-06-25T01:40:56Z

Folded into #71 (the mastra-arize-ax-tracing upgrade) to keep this workstream in a single tutorials PR. The notebook deletion is now commit 8666e5c on chore/upgrade-mastra-cookbook-latest.

* chore: upgrade mastra-arize-ax-tracing cookbook to latest stack The cookbook was pinned to a Mastra 0.15-era stack and `npm i` failed with an ERESOLVE peer conflict. The bigger issue: @arizeai/openinference-mastra is deprecated and relied on Mastra's legacy telemetry system, removed 2025-11-04. Upgrade everything to the current stack and migrate the code accordingly: Dependencies: - @mastra/core 0.15 -> 1.46, libsql/loggers/memory/mastra CLI to latest 1.x - @ai-sdk/openai 1 -> 3, zod 3 -> 4, typescript 5 -> 6, @types/node 24 -> 26 - replace deprecated @arizeai/openinference-mastra with native @mastra/arize - add @mastra/observability and ai (now required directly) Code (Mastra 0.x -> 1.x API changes): - tracing: legacy telemetry.export.custom -> new AI Tracing via new Observability({ configs: { arize: { exporters: [new ArizeExporter(...)] }}}) - tool execute: input is now the first arg directly ({ context } -> { ...input }) - add required `id` to Agent and every LibSQLStore - main storage :memory: -> file:../mastra.db so Mastra's internal scheduler tables (mastra_workflow_snapshot) persist and are shared across connections (":memory:" gives each libsql connection its own DB -> SQLITE_ERROR) - gitignore *.db* artifacts; refresh README tracing references Verified: npm i clean, tsc --noEmit clean, `mastra dev` boots with no errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: route tool LLM calls through Mastra agents for tracing The weather-analysis and activity-planning tools called the ai SDK's generateText() directly, so those LLM calls bypassed Mastra and never appeared as spans in Arize AX. Make them first-class Mastra agents instead: - Add weatherAnalysisAgent and activityPlanningAgent (instructions + model), registered on the Mastra instance - Tools now resolve the agent from the tool execution context (mastra.getAgent(...)) and call agent.generate(), so each call is captured as a traced child span of the tool. Using the context avoids importing the Mastra instance into tools (no circular dependency). - Drop the now-unused `ai` dependency and direct @ai-sdk/openai imports in tools Verified: tsc clean, `mastra dev` boots with no errors, all three agents register. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: fix README and Node engine for the upgraded cookbook - Point the intro link at the exact guide in its new location (cookbooks/evaluate/align-llm-evals-with-human-judgment) - Bump required Node to >=22.13.0 (engines + README) to match @mastra/core 1.46 - Run instructions: Mastra Studio at http://localhost:4111 (no auto-open), with a concrete example request - "How it works": show the analysis/planning tools delegating to worker agents, and explain the nested tracing - Project structure: add the two new worker agent files Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: use non-deprecated gpt-4.1-mini model gpt-4o-mini is superseded by gpt-4.1 (per the check-models policy) and gets flagged as an outdated introduced reference. Move all three agents to gpt-4.1-mini — the current non-reasoning mini tier, which (unlike the GPT-5 reasoning minis) still accepts temperature/top_p. Verified with scan-models.mjs: no blocking model findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(check-models): run on Node 24 Bump the check-models workflow's setup-node from 20 to 24. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: use gpt-5.4-mini for the agents Switch all three agents from gpt-4.1-mini to gpt-5.4-mini (the latest mini tier). Verified end-to-end: `mastra dev` boots and a live orchestrator invocation runs the full weather -> analysis -> planning flow and returns a plan with no errors (the reasoning model does not reject the request). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(check-models): bump github-script to v8 (Node 24) actions/github-script@v7 declares a Node 20 runtime, which the runner now forces onto Node 24 with a deprecation warning. Bump to v8, which runs on Node 24 natively. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: add generate-traces script for populating Arize AX Add `npm run generate-traces`, a tsx script that sends a diverse set of prompts to the WeatherOrchestratorAgent and flushes the spans to Arize AX via mastra.shutdown(). Prompts are grouped to exercise each tool path: weather-only, weather+planning, and the full weather+analysis+planning chain, plus some varied/edge phrasings. The script chdirs into src/ so the Mastra config's `file:../mastra.db` lands in the cookbook root (gitignored) rather than the parent directory. Verified: all 12 prompts complete and the observability exporter flushes on shutdown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: expand example prompts to 50 and extract to JSON Move the trace-generation prompts out of generate-traces.ts into src/scripts/example-prompts.json, grouped by tool path, and expand the set to 50 prompts for broader coverage: - weather-only: 12 - weather+planning: 12 - weather+analysis+planning: 12 - multi-location: 6 - vague-intent: 4 - activity-constrained: 4 The script now loads and flattens the JSON at runtime. Validated: JSON parses to 50 unique prompts, tsc clean. README points at the JSON for editing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: add text language hint to diagram code fence Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: remove building-a-custom-evaluator notebook Source for the Arize docs "Creating a Custom LLM Evaluator with a Benchmark Dataset" cookbook, which is being consolidated into "Align LLM Evals with Human Judgment" (Arize-ai/docs#650). No other notebooks or READMEs reference it. Folds in the change from #70. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jimbobbennett closed this Jun 25, 2026

jimbobbennett deleted the chore/remove-custom-evaluator-notebook branch June 25, 2026 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: remove building-a-custom-evaluator notebook#70

chore: remove building-a-custom-evaluator notebook#70
jimbobbennett wants to merge 1 commit into
mainfrom
chore/remove-custom-evaluator-notebook

jimbobbennett commented Jun 24, 2026

Uh oh!

review-notebook-app Bot commented Jun 24, 2026

Uh oh!

jimbobbennett commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jimbobbennett commented Jun 24, 2026

Summary

Uh oh!

review-notebook-app Bot commented Jun 24, 2026

Uh oh!

jimbobbennett commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant