Skip to content

chore: remove building-a-custom-evaluator notebook#70

Closed
jimbobbennett wants to merge 1 commit into
mainfrom
chore/remove-custom-evaluator-notebook
Closed

chore: remove building-a-custom-evaluator notebook#70
jimbobbennett wants to merge 1 commit into
mainfrom
chore/remove-custom-evaluator-notebook

Conversation

@jimbobbennett

Copy link
Copy Markdown
Contributor

Summary

Removes python/llm/evaluation/building-a-custom-evaluator.ipynb, the source notebook for the Arize docs "Creating a Custom LLM Evaluator with a Benchmark Dataset" cookbook.

That cookbook is being consolidated into the newer "Align LLM Evals with Human Judgment" guide (same concept — build a custom LLM-as-a-Judge and align it to human-annotated ground truth), so the standalone notebook is no longer referenced by any docs page. No other notebooks or READMEs in this repo reference it.

Companion to Arize-ai/docs#650.

🤖 Generated with Claude Code

Source for the Arize docs "Creating a Custom LLM Evaluator with a Benchmark
Dataset" cookbook, which is being consolidated into the newer "Align LLM Evals
with Human Judgment" guide. No other notebooks or READMEs reference it.

Companion to Arize-ai/docs#650.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

jimbobbennett added a commit that referenced this pull request Jun 25, 2026
Source for the Arize docs "Creating a Custom LLM Evaluator with a Benchmark
Dataset" cookbook, which is being consolidated into "Align LLM Evals with
Human Judgment" (Arize-ai/docs#650). No other notebooks or READMEs reference
it. Folds in the change from #70.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jimbobbennett

Copy link
Copy Markdown
Contributor Author

Folded into #71 (the mastra-arize-ax-tracing upgrade) to keep this workstream in a single tutorials PR. The notebook deletion is now commit 8666e5c on chore/upgrade-mastra-cookbook-latest.

@jimbobbennett jimbobbennett deleted the chore/remove-custom-evaluator-notebook branch June 25, 2026 01:40
jimbobbennett added a commit that referenced this pull request Jun 25, 2026
* chore: upgrade mastra-arize-ax-tracing cookbook to latest stack

The cookbook was pinned to a Mastra 0.15-era stack and `npm i` failed with an
ERESOLVE peer conflict. The bigger issue: @arizeai/openinference-mastra is
deprecated and relied on Mastra's legacy telemetry system, removed 2025-11-04.

Upgrade everything to the current stack and migrate the code accordingly:

Dependencies:
- @mastra/core 0.15 -> 1.46, libsql/loggers/memory/mastra CLI to latest 1.x
- @ai-sdk/openai 1 -> 3, zod 3 -> 4, typescript 5 -> 6, @types/node 24 -> 26
- replace deprecated @arizeai/openinference-mastra with native @mastra/arize
- add @mastra/observability and ai (now required directly)

Code (Mastra 0.x -> 1.x API changes):
- tracing: legacy telemetry.export.custom -> new AI Tracing via
  new Observability({ configs: { arize: { exporters: [new ArizeExporter(...)] }}})
- tool execute: input is now the first arg directly ({ context } -> { ...input })
- add required `id` to Agent and every LibSQLStore
- main storage :memory: -> file:../mastra.db so Mastra's internal scheduler
  tables (mastra_workflow_snapshot) persist and are shared across connections
  (":memory:" gives each libsql connection its own DB -> SQLITE_ERROR)
- gitignore *.db* artifacts; refresh README tracing references

Verified: npm i clean, tsc --noEmit clean, `mastra dev` boots with no errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor: route tool LLM calls through Mastra agents for tracing

The weather-analysis and activity-planning tools called the ai SDK's
generateText() directly, so those LLM calls bypassed Mastra and never appeared
as spans in Arize AX.

Make them first-class Mastra agents instead:
- Add weatherAnalysisAgent and activityPlanningAgent (instructions + model),
  registered on the Mastra instance
- Tools now resolve the agent from the tool execution context
  (mastra.getAgent(...)) and call agent.generate(), so each call is captured
  as a traced child span of the tool. Using the context avoids importing the
  Mastra instance into tools (no circular dependency).
- Drop the now-unused `ai` dependency and direct @ai-sdk/openai imports in tools

Verified: tsc clean, `mastra dev` boots with no errors, all three agents
register.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: fix README and Node engine for the upgraded cookbook

- Point the intro link at the exact guide in its new location
  (cookbooks/evaluate/align-llm-evals-with-human-judgment)
- Bump required Node to >=22.13.0 (engines + README) to match @mastra/core 1.46
- Run instructions: Mastra Studio at http://localhost:4111 (no auto-open),
  with a concrete example request
- "How it works": show the analysis/planning tools delegating to worker
  agents, and explain the nested tracing
- Project structure: add the two new worker agent files

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix: use non-deprecated gpt-4.1-mini model

gpt-4o-mini is superseded by gpt-4.1 (per the check-models policy) and gets
flagged as an outdated introduced reference. Move all three agents to
gpt-4.1-mini — the current non-reasoning mini tier, which (unlike the GPT-5
reasoning minis) still accepts temperature/top_p.

Verified with scan-models.mjs: no blocking model findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci(check-models): run on Node 24

Bump the check-models workflow's setup-node from 20 to 24.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix: use gpt-5.4-mini for the agents

Switch all three agents from gpt-4.1-mini to gpt-5.4-mini (the latest mini
tier). Verified end-to-end: `mastra dev` boots and a live orchestrator
invocation runs the full weather -> analysis -> planning flow and returns a
plan with no errors (the reasoning model does not reject the request).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci(check-models): bump github-script to v8 (Node 24)

actions/github-script@v7 declares a Node 20 runtime, which the runner now
forces onto Node 24 with a deprecation warning. Bump to v8, which runs on
Node 24 natively.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat: add generate-traces script for populating Arize AX

Add `npm run generate-traces`, a tsx script that sends a diverse set of
prompts to the WeatherOrchestratorAgent and flushes the spans to Arize AX via
mastra.shutdown(). Prompts are grouped to exercise each tool path:
weather-only, weather+planning, and the full weather+analysis+planning chain,
plus some varied/edge phrasings.

The script chdirs into src/ so the Mastra config's `file:../mastra.db` lands
in the cookbook root (gitignored) rather than the parent directory. Verified:
all 12 prompts complete and the observability exporter flushes on shutdown.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat: expand example prompts to 50 and extract to JSON

Move the trace-generation prompts out of generate-traces.ts into
src/scripts/example-prompts.json, grouped by tool path, and expand the set to
50 prompts for broader coverage:

- weather-only: 12
- weather+planning: 12
- weather+analysis+planning: 12
- multi-location: 6
- vague-intent: 4
- activity-constrained: 4

The script now loads and flattens the JSON at runtime. Validated: JSON parses
to 50 unique prompts, tsc clean. README points at the JSON for editing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: add text language hint to diagram code fence

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: remove building-a-custom-evaluator notebook

Source for the Arize docs "Creating a Custom LLM Evaluator with a Benchmark
Dataset" cookbook, which is being consolidated into "Align LLM Evals with
Human Judgment" (Arize-ai/docs#650). No other notebooks or READMEs reference
it. Folds in the change from #70.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant