diff --git a/.agents/skills/check-models/SKILL.md b/.agents/skills/check-models/SKILL.md new file mode 100644 index 0000000..7d112b4 --- /dev/null +++ b/.agents/skills/check-models/SKILL.md @@ -0,0 +1,152 @@ +--- +name: check-models +description: Find and update out-of-date OpenAI and Anthropic model references (in docs, MDX, notebooks, and code) to the latest size-equivalent models, and apply the code changes each new generation requires (e.g. max_tokens → max_completion_tokens for GPT-5). Use when asked to "check the models", "update model versions", "are these models current", "migrate the models in the docs/tutorials", or before publishing content that names a model. +--- + +# check-models + +Keep model references current. OpenAI and Anthropic ship new generations often, and docs/tutorials drift: a notebook pinned to `gpt-4o-mini` or `claude-3-5-sonnet` is teaching a model that's a generation (or three) behind, sometimes with code that no longer runs (GPT-5 rejects `max_tokens`). + +This skill does three things: + +1. **Look up** the latest models (don't trust memory — models change). +2. **Migrate** every reference to the latest model **of the same size tier** (a `*-mini` becomes the latest mini, never the flagship). +3. **Fix the code** each new generation requires (parameter renames/removals). + +## The golden rule: match the size tier + +Never "upgrade" a small model to a big one. `gpt-4o-mini` is a small/cheap model — its modern equivalent is the latest **mini**, not the flagship. Migrating it to `gpt-5.5` silently multiplies a reader's cost. Map within the tier: + +| Old tier | OpenAI → | Anthropic → | +|---|---|---| +| flagship / full | latest flagship | latest Opus | +| pro | latest pro | latest Opus | +| mini / small / cheap | latest **mini** | latest Haiku | +| nano | latest **nano** | (smallest tier) latest Haiku | + +> The latest mini/nano are **not** always in the newest generation. As of the policy date below, the flagship is GPT-5.5 but there is no GPT-5.5-mini — so `gpt-4o-mini` → `gpt-5.4-mini`, **not** `gpt-5.5-mini`. Always confirm which generation actually has a mini/nano variant. + +## Current models (policy) + +The authoritative values live in [`models.json`](./models.json) — the scanner and the GitHub Action both read it. As of `updated: 2026-06-17`: + +**OpenAI (reasoning — default targets)** — flagship `gpt-5.5`, pro `gpt-5.5-pro`, mini `gpt-5.4-mini`, nano `gpt-5.4-nano`. +**OpenAI (non-reasoning — for temperature-dependent code)** — flagship `gpt-4.1`, mini `gpt-4.1-mini`, nano `gpt-4.1-nano`. Active, **not** deprecated, and they **accept `temperature`/`top_p`**. +**Anthropic** — Opus `claude-opus-4-8`, Sonnet `claude-sonnet-4-6`, Haiku `claude-haiku-4-5`. +**Google Gemini** — Pro `gemini-2.5-pro`, Flash `gemini-3.5-flash`, Flash-Lite `gemini-3.1-flash-lite`. The latest **stable** model per tier sits on *different* generations (like OpenAI), so the policy tracks Gemini with an explicit `google.current` allow-list and a `google.deprecated` → replacement map (in `models.json`), not a single version floor. Match the SIZE tier (pro/flash/flash-lite). `gemini-2.0-*` (shut down 2026-06-01), `gemini-1.5-*`, and `gemini-1.0-*` are deprecated; `gemini-2.5-*` and `gemini-3.x` are still current and are left alone. + +> ⚠️ **Claude Fable 5 / Mythos 5** were export-control-suspended on 2026-06-12 — they are NOT valid migration targets. ⚠️ GPT-5.5 has **no** mini/nano variant. + +**Specialised (non-chat) OpenAI models** — realtime, audio, image, transcribe, tts, embeddings, and moderation models don't map to the flagship/mini tiers, so they're tracked separately in `models.json` → `specialized`: +- `specialized.current` (e.g. `gpt-realtime-1.5`, `gpt-realtime-2`, `gpt-audio-1.5`, `gpt-image-2`, `gpt-4o-transcribe`, `gpt-4o-mini-tts`, `text-embedding-3-*`, `omni-moderation`) are **valid** — the scanner leaves them alone. +- `specialized.deprecated` maps a retired ID to its current replacement (e.g. `gpt-4o-realtime-preview` → `gpt-realtime-1.5` (shut down 2026-05-07), `gpt-4o-audio-preview` → `gpt-audio-1.5`, `gpt-4o-search-preview` → `gpt-5.4-mini`) — flagged `warn` with the target, since the realtime/audio API surface differs and the swap needs a human eye. + +These are checked **before** `review.openaiPatterns`, which stays the fallback for any unrecognised variant. When a new specialised model ships (or one is retired), update these two lists. + +### Reasoning vs non-reasoning — which OpenAI target? + +The GPT-5 family are **reasoning** models: they reject `temperature`, `top_p`, and the other sampling params. So the target depends on whether the call relies on those: + +- **General code** (no meaningful `temperature`, or `temperature` was incidental) → migrate to the **GPT-5** tier above (the default). This is what most code wants. +- **Temperature-dependent code** → migrate to the **non-reasoning `gpt-4.1` tier** and **keep `temperature`**. This is the right call for: + - **LLM-as-judge / evaluators** that set `temperature=0` for reproducible scores (e.g. Phoenix `OpenAIModel(...)`, `LLM(...)`, eval classifiers), + - **deterministic extraction / classification** pinned at `temperature=0`. + Match the size tier: a `gpt-4o-mini` eval judge → `gpt-4.1-mini` (keep `temperature=0`), not `gpt-5.4-mini`. + +`gpt-4.1` / `gpt-4.1-mini` / `gpt-4.1-nano` are treated as **current** by the scanner (not flagged) — they're a legitimate, non-deprecated choice. `gpt-4o` / `gpt-4o-mini` are also non-reasoning + active but older; migrate them to the GPT-5 tier by default, or to `gpt-4.1*` when temperature matters. + +## Workflow + +### 1. Verify the latest models are still current + +`models.json` is a cache — refresh it before a big migration. **The scanner can't browse, so the lookup is yours to run; `--refresh` gives you the exact checklist:** + +```bash +node .agents/skills/check-models/scripts/scan-models.mjs --refresh +``` + +It prints the current policy, the WebSearch queries, the authoritative source URLs, and which keys to edit. Then: + +- WebSearch `"latest OpenAI models"` and `"latest Anthropic Claude models"` (use the current month); cross-check Anthropic against the **`claude-api`** skill's `shared/models.md` (canonical Claude IDs + the Claude-side code changes). +- Confirm each tier's latest model **and that the tier still exists** (a new flagship may ship with no `-mini` yet — keep mini/nano on the older generation). Skip any suspended/withdrawn model. +- If anything changed, **propose the `models.json` diff and confirm before writing** (bump `flagship`/`mini`/`opus`/… and `updated`). That one edit updates the scanner, the skill, and the CI gate together. + +**When to refresh:** the scanner tells you. Every run prints a `⚠ Model policy may be out of date…` hint (and sets `stale.suggestRefresh` in `--json`) when either the policy is older than ~45 days **or** the content references a model newer than the policy knows (e.g. a `gpt-5.6` appears while the policy flagship is `gpt-5.5`). Newer-than-policy models are left untouched — never downgraded — so refresh first, then re-scan. + +### 2. Scan + +```bash +node .agents/skills/check-models/scripts/scan-models.mjs python typescript # scan paths +node .agents/skills/check-models/scripts/scan-models.mjs --json > out.json # machine-readable +node .agents/skills/check-models/scripts/scan-models.mjs --diff python typescript # PR-gate mode +``` + +**`--diff ` (the CI gate)** scans the whole of each *touched* file but tags every finding with `changed` (was the line added/modified by the PR?). It splits outdated-model errors into two tiers: **introduced** (on a changed line) → fails the run; **pre-existing** (on an unchanged line of a touched file) → reported as a non-blocking warning. Untouched files are never read. Full-scan mode (no `--diff`) tags everything `changed:true`, so it fails on any error as before. + +Each finding is one of: +- `✗ error` — an outdated lowercase canonical model ID (e.g. `gpt-4o-mini`). Migrate it. +- `⚠ review` / `⚠ replace` (prose) — a model named in prose (`GPT-4o`) or a specialised variant (`*-codex`, `*-chat-latest`, `gpt-4v`). Use judgement (see below). +- `⚠ param` — a GPT-5/o-series code change to apply (`max_tokens`, `temperature`, …). + +### 3. Migrate the model IDs + +For each finding, replace with the scanner's suggested target, **preserving local style**: +- Keep the original separator/case for prose: `GPT-4o` → `GPT-5.5` (not `gpt-5.5`); `claude-sonnet-4.5` (dotted) → `claude-sonnet-4.6`. +- Drop stale date snapshots: `gpt-5-2025-08-07` → `gpt-5.5` (use the bare alias, don't invent a date). +- Keep both halves of any SDK-version tab in sync (v7/v8 examples). + +#### Platform-specific IDs (Bedrock, Databricks, OpenRouter, LiteLLM) + +The scanner matches the embedded `claude-*` / `gpt-*` substring inside a platform-wrapped ID and flags it. Bump the **version** but **keep the host platform's ID format** — these are not bare first-party IDs: + +- **Amazon Bedrock** — IDs look like `[region.]anthropic.claude-<...>[-vN:0]`. Claude 4.x (Opus 4.x, Sonnet 4.5+, Haiku 4.5) require a **cross-region inference profile**, so they take a `us.` / `eu.` / `apac.` prefix and **drop** the on-demand `-vN:0` suffix the Claude 3 IDs used. e.g. `anthropic.claude-3-haiku-20240307-v1:0` → `us.anthropic.claude-haiku-4-5`. Match the region prefix to the doc's endpoint (the repo's Bedrock docs default to `us.`). +- **Databricks** — Foundation Model endpoint names look like `databricks-claude-sonnet-4-6`. Databricks owns these names and **availability is workspace/region-dependent** — bump the version following their pattern, but verify the endpoint actually exists rather than assuming it. +- **OpenRouter / LiteLLM** — provider-prefixed: `anthropic/claude-sonnet-4-6`, `openai/gpt-5.4-mini`. Keep the `provider/` prefix; bump only the model half. + +When a migration changes more than the version number (a Bedrock region prefix, a Databricks endpoint name), **call it out for the reviewer** — it may need adjusting for their region/workspace. + +### 4. Apply the code changes + +**These changes apply to the _raw OpenAI SDK_ only** (`client.chat.completions.create(...)`, `openai.OpenAI()...`). When migrating such a call **to GPT-5 or o-series** (reasoning models), in the same example: +- Rename `max_tokens` → `max_completion_tokens`. +- Remove `temperature` (unless it's the default `1`), `top_p`, `presence_penalty`, `frequency_penalty`, `logprobs`, `top_logprobs`, `logit_bias` — reasoning models reject them. Steer with `reasoning_effort` (`low`/`medium`/`high`) and `verbosity` instead. + +**Wrapper libraries need a split rule** — `phoenix.evals.OpenAIModel(...)`, `langchain_openai.ChatOpenAI(...)`, `litellm.completion(...)`, etc. expose their own kwargs: +- **`max_tokens`: keep it.** The wrapper owns this kwarg and maps it to `max_completion_tokens` internally; renaming it passes an unknown constructor arg and breaks the call. +- **non-default `temperature` / `top_p`:** a GPT-5/o-series reasoning model rejects any `temperature` ≠ 1 with a 400 — wrapper or not. Two correct outcomes: + - if the value **matters** (eval judge / deterministic call) → re-target to the non-reasoning **`gpt-4.1`** tier and **keep** the `temperature` (see "Reasoning vs non-reasoning" above); + - if it was **incidental** → drop the explicit value and stay on the GPT-5 model. +- Otherwise migrate the **model ID only**. + +> **`phoenix.evals` gotcha:** the legacy model classes (`OpenAIModel`, `LiteLLMModel`, `AnthropicModel`) accept `temperature` in their constructor (valid). The newer `LLM(provider=…, model=…)` does **not** — its `**kwargs` are forwarded to the *SDK client constructor* (for `api_key`/`base_url`), so `LLM(provider="openai", model="gpt-4.1", temperature=0)` raises `TypeError` at construction. Set sampling params on the **evaluator** instead: `ClassificationEvaluator(name=…, llm=…, prompt_template=…, choices=…, temperature=0)` (the `create_classifier(...)` helper takes no `temperature`). + +(The scanner flags any `max_tokens`/`temperature` token as a `⚠ param` — it can't tell a raw call from a wrapper or know the value, so this is the judgement call.) + +**Watch token caps on reasoning models.** GPT-5/o-series count *reasoning* tokens against `max_completion_tokens`, so a tiny cap that worked on gpt-4o (e.g. `max_tokens=20` for a short answer) can return empty/truncated output. When migrating such a call, raise the cap to a safe value (≥256) or set `reasoning_effort: "minimal"`. + +When migrating **to Claude Opus 4.8/4.7 or Sonnet 4.6**: `budget_tokens` and `temperature`/`top_p`/`top_k` are removed — use `thinking: {type: "adaptive"}` + `output_config.effort`. Defer to the **`claude-api`** skill (`/claude-api migrate`) for the full Claude code-change checklist; don't hand-edit Claude SDK calls from memory. + +### 5. Skip what shouldn't change + +Do **not** rewrite: +- **Autogenerated code** — `api-clients/` is generated SDK reference; never hand-edit it (it's in `excludePaths`, so the scanner skips it). Fix the model references at their generator/source instead. +- **Historical / release-notes / changelog / migration-guide** content — it documents what *was* true ("v1.2 added gpt-4o support"). These paths are in `excludePaths`; respect the same rule for any historical prose the scanner happens to catch. +- **Non-model tokens** that share a prefix: `gpt-oss-*` (open-weight models, version-pinned), `claude-code`, `claude-agent-sdk`, web-crawler user-agents (`claude-web`, `claude-user`, `claude-searchbot`), image filenames. These are in the `ignore` list and won't be flagged. +- **Comparative prose** that names an old model on purpose ("unlike GPT-4, GPT-5.5 can…"). Leave the historical reference; only update where the doc is telling the reader which model to *use*. +- **Markdown image alt text** — `![…model gpt-4o-mini…](screenshot.png)` describes what a screenshot *shows*. Editing the alt text alone would make it misdescribe the image (the pixels still show the old model), so the scanner skips model names inside `![ … ]`. Update these by **regenerating the screenshot**, not by editing the alt text. + +To suppress a single line the scanner shouldn't touch, add a `check-models:ignore` comment on it. + +### 6. Verify + +Re-run the scanner — `✗ error` count should be 0. Remaining `⚠` are prose/variants you've consciously reviewed. + +Also guard against a **stale date carried onto a new alias**: when migrating a dated ID, drop the date (`claude-sonnet-4-5-20250929` → `claude-sonnet-4-6`, **not** `claude-sonnet-4-6-20250929` — that snapshot doesn't exist). The scanner can't catch this (it date-strips before classifying, so a wrong-dated current alias looks current), so grep for it: + +```bash +grep -rnE 'claude-(opus-4-8|sonnet-4-6)-20[0-9]' python typescript # these aliases have no dated form +``` + +## Updating for a new model launch + +The whole skill is driven by `models.json`. When a new model ships: WebSearch to confirm the IDs and which tiers exist, edit the relevant value(s) in `models.json` plus `updated`, re-run the scanner. No code changes to the scanner are normally needed — it derives the full old→new table from the policy numbers. diff --git a/.agents/skills/check-models/scripts/models.json b/.agents/skills/check-models/scripts/models.json new file mode 100644 index 0000000..988d0a0 --- /dev/null +++ b/.agents/skills/check-models/scripts/models.json @@ -0,0 +1,129 @@ +{ + "_comment": "Single source of truth for the check-models skill. Update the numbers in `openai` and `anthropic` whenever a new model ships, then re-run the scanner. The scanner derives the full replacement table from these policy values — you should not need to enumerate every old ID.", + "updated": "2026-06-18", + "verifiedBy": "WebFetch developers.openai.com/api/docs/models + /deprecations (2026-06-18); Google via ai.google.dev/gemini-api/docs/models (2026-06-18); Anthropic via claude-api skill shared/models.md", + + "openai": { + "flagship": "gpt-5.5", + "pro": "gpt-5.5-pro", + "mini": "gpt-5.4-mini", + "nano": "gpt-5.4-nano", + "flagshipMinVersion": 5.4, + "proMinVersion": 5.5, + "miniMinVersion": 5.4, + "nanoMinVersion": 5.4, + "notes": "GPT-5.5 is the latest flagship (released 2026-04-23, API id gpt-5.5-2026-04-23) and has no mini/nano variant — mini/nano stay on the 5.4 generation. GPT-5.4 also remains available as a cheaper flagship (still current, NOT deprecated per OpenAI's models page), so flagshipMinVersion is 5.4 — only gpt-5.3 and older flag for migration; gpt-5.4 and gpt-5.5 both pass. Migration target for outdated flagships is still gpt-5.5 (the `flagship` value). Match the SIZE tier: a *-mini model migrates to the latest mini, never the flagship. The GPT-5 family are REASONING models — they reject temperature/top_p (see codeChanges). For temperature-dependent code use the nonReasoning tier below.", + "nonReasoning": { + "flagship": "gpt-4.1", + "mini": "gpt-4.1-mini", + "nano": "gpt-4.1-nano", + "note": "Latest NON-reasoning OpenAI models — active (NOT deprecated, per https://developers.openai.com/api/docs/deprecations checked 2026-06-17) and they ACCEPT temperature/top_p. Use these (not GPT-5) for code that needs temperature control — LLM-as-judge evaluators, deterministic extraction/classification at temperature 0. gpt-4o / gpt-4o-mini are also non-reasoning + active but older; gpt-4.1 supersedes them.", + "treatAsCurrent": true + }, + "deprecatedOpenAI": ["gpt-3.5-turbo (snapshots; shutdown 2026-10-23)", "o1", "o1-mini", "o1-preview", "o3-mini", "o4-mini", "gpt-5 / gpt-5-mini legacy snapshots (shutdown 2026-12-11)", "gpt-5.2-chat-latest (2026-08-10)"] + }, + + "anthropic": { + "opus": "claude-opus-4-8", + "sonnet": "claude-sonnet-4-6", + "haiku": "claude-haiku-4-5", + "notes": "Opus 4.8 / Sonnet 4.6 / Haiku 4.5 are the current GA tiers. Claude Fable 5 and Claude Mythos 5 exist but were export-control-suspended on 2026-06-12 — NEVER migrate to them. Match the SIZE tier (opus/sonnet/haiku)." + }, + + "google": { + "_comment": "Gemini tiers span generations (the latest STABLE per tier is on different generations, like OpenAI). `current` IDs are valid (not flagged). `deprecated` IDs map to their current-tier replacement. Verified against https://ai.google.dev/gemini-api/docs/models + /changelog (2026-06-18). Match the SIZE tier: pro/flash/flash-lite.", + "tiers": { "pro": "gemini-2.5-pro", "flash": "gemini-3.5-flash", "flash-lite": "gemini-3.1-flash-lite" }, + "current": [ + "gemini-2.5-pro", "gemini-3.1-pro", "gemini-3.5-pro", + "gemini-3.5-flash", "gemini-3-flash", "gemini-2.5-flash", + "gemini-3.1-flash-lite", "gemini-2.5-flash-lite" + ], + "deprecated": { + "gemini-2.0-flash": "gemini-3.5-flash", + "gemini-2.0-flash-lite": "gemini-3.1-flash-lite", + "gemini-1.5-pro": "gemini-2.5-pro", + "gemini-1.5-flash": "gemini-3.5-flash", + "gemini-1.5-flash-8b": "gemini-3.1-flash-lite", + "gemini-1.0-pro": "gemini-2.5-pro", + "gemini-pro": "gemini-2.5-pro", + "gemini-pro-vision": "gemini-2.5-pro" + } + }, + + "codeChanges": { + "_comment": "Reasoning-model (GPT-5 / o-series) parameter migrations. The scanner flags these; the skill applies them.", + "renames": [ + { "from": "max_tokens", "to": "max_completion_tokens", "scope": "openai-reasoning", "note": "GPT-5 and o-series reject max_tokens in Chat Completions; use max_completion_tokens." } + ], + "removeForReasoning": [ + "temperature", "top_p", "presence_penalty", "frequency_penalty", + "logprobs", "top_logprobs", "logit_bias" + ], + "removeNote": "GPT-5 / o-series reasoning models reject these sampling params (temperature must be the default 1 if sent). Remove them; steer with `reasoning_effort` (low|medium|high) and `verbosity` (low|medium|high) instead.", + "anthropicNote": "Anthropic Opus 4.8/4.7 reject temperature/top_p/top_k and budget_tokens (use thinking:{type:'adaptive'} + effort). See the claude-api skill `shared/model-migration.md` for full Claude code changes." + }, + + "specialized": { + "_comment": "Non-chat OpenAI families (realtime / audio / image / transcribe / tts / embeddings / moderation) that don't map to the flagship/mini tiers. `current` IDs are treated as valid (not flagged); `deprecated` IDs map to their current replacement. Checked before review.openaiPatterns, which stays as the fallback for unrecognised variants. Verified against https://developers.openai.com/api/docs/models and /deprecations (2026-06-18).", + "current": [ + "gpt-realtime", "gpt-realtime-1.5", "gpt-realtime-2", "gpt-realtime-mini", + "gpt-realtime-translate", "gpt-realtime-whisper", + "gpt-audio", "gpt-audio-1.5", "gpt-audio-mini", + "gpt-image-2", + "gpt-4o-transcribe", "gpt-4o-mini-transcribe", + "gpt-4o-mini-tts", + "text-embedding-3-small", "text-embedding-3-large", + "omni-moderation", "omni-moderation-latest" + ], + "deprecated": { + "gpt-4o-realtime-preview": "gpt-realtime-1.5", + "gpt-4o-mini-realtime-preview": "gpt-realtime-mini", + "gpt-4o-audio-preview": "gpt-audio-1.5", + "gpt-4o-mini-audio-preview": "gpt-audio-mini", + "gpt-4o-search-preview": "gpt-5.4-mini", + "gpt-4o-mini-search-preview": "gpt-5.4-mini" + } + }, + + "review": { + "_comment": "Tokens that look like model IDs but need a human decision — the scanner flags them (severity: warn) and never auto-rewrites them. specialized.current/deprecated (above) take precedence for known non-chat models.", + "openaiPatterns": [ + "codex", "chat-latest", "realtime", "audio", "search", "transcribe", + "image", "tts", "whisper", "embedding", "moderation", "instruct" + ], + "openaiExact": ["gpt-4v"] + }, + + "ignore": { + "_comment": "Known NON-model tokens that share the model prefix. The scanner never reports these.", + "openai": ["gpt-oss-20b", "gpt-oss-120b", "gpt-oss"], + "openaiPatterns": ["oss", "\\.png", "\\.jpg", "\\.jpeg", "\\.gif", "\\.svg"], + "anthropic": [ + "claude-code", "claude-code-tracing", "claude-agent-sdk", "claude-web", + "claude-user", "claude-searchbot", "claude-powered", "claude-trace", + "claude-session", "claude-fable-5", "claude-mythos-5" + ], + "anthropicPatterns": [ + "^claude-code", "^claude-md", "searchbot", "-bot$", "\\.png", "\\.jpg", + "\\.jpeg", "\\.gif", "\\.svg", "\\.log", "fable", "mythos", "-powered" + ] + }, + + "excludePaths": [ + "**/release-notes/**", + "**/changelog/**", + "**/*release-note*", + "**/*-releases.*", + "**/on-premise-releases*", + "**/*changelog*", + "**/*migration*", + "**/*migrate*", + "**/api-clients/**", + "**/.agents/skills/check-models/**", + "**/node_modules/**", + "**/.git/**", + "**/.venv/**" + ], + + "scanExtensions": [".mdx", ".md", ".ipynb", ".py", ".ts", ".tsx", ".js", ".jsx"] +} diff --git a/.agents/skills/check-models/scripts/scan-models.mjs b/.agents/skills/check-models/scripts/scan-models.mjs new file mode 100644 index 0000000..4684f6f Binary files /dev/null and b/.agents/skills/check-models/scripts/scan-models.mjs differ diff --git a/.github/workflows/check-models.yml b/.github/workflows/check-models.yml new file mode 100644 index 0000000..4bc50f8 --- /dev/null +++ b/.github/workflows/check-models.yml @@ -0,0 +1,106 @@ +name: Check model versions + +# Flags out-of-date OpenAI / Anthropic / Google model references. +# Two tiers: outdated models a PR INTRODUCES on changed lines FAIL the check; outdated +# models that already exist on unchanged lines of a file the PR touches are reported as +# non-blocking WARNINGS. Untouched files are never inspected, so unrelated PRs aren't blocked. +# Policy (latest models + ignore lists) lives in .agents/skills/check-models/scripts/models.json. +# Platform-wrapped IDs (Bedrock `[region.]anthropic.claude-…`, Databricks +# `databricks-claude-…`, OpenRouter/LiteLLM `provider/model`) are flagged on their +# embedded model substring — migrate the version but keep the platform's ID format. +# See the "Platform-specific IDs" section of the check-models skill. + +on: + pull_request: + branches: [main] + +permissions: + contents: read + pull-requests: write + +jobs: + check-models: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd + with: + fetch-depth: 0 # need the base commit to diff against + + - uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e + with: + node-version: 20 + + - name: Scan touched files for outdated models + id: scan + run: | + node .agents/skills/check-models/scripts/scan-models.mjs \ + --diff "${{ github.event.pull_request.base.sha }}" \ + python typescript \ + --json --no-fail > model-scan.json + cat model-scan.json + + - name: Comment and gate + uses: actions/github-script@f28e40c7f34bde8b3046d885e986cb6290c5673b + with: + script: | + const fs = require('fs'); + const r = JSON.parse(fs.readFileSync('model-scan.json', 'utf8')); + const marker = ''; + + // Three buckets: + // introduced — outdated model on a line this PR added/changed → FAILS the check + // preExisting — outdated model on an unchanged line of a touched file → warn, non-blocking + // warns — prose / specialised variants / GPT-5 param hints (changed lines) → review + const introduced = r.findings.filter(f => f.severity === 'error' && f.changed !== false); + const preExisting = r.findings.filter(f => f.severity === 'error' && f.changed === false); + const warns = r.findings.filter(f => f.severity === 'warn'); + + const row = f => `| \`${f.file}\`:${f.line} | \`${f.token}\` | ${f.action === 'replace' ? '`' + f.target + '`' : '—'} | ${f.reason} |`; + const table = items => `| Location | Found | Suggested | Why |\n|---|---|---|---|\n` + items.map(row).join('\n') + '\n\n'; + let body = `${marker}\n## 🤖 Model version check\n\n`; + + // Surface a stale-policy warning (content references a newer model than the policy knows, or the policy is old) + if (r.stale && r.stale.suggestRefresh) { + const newer = (r.stale.newerThanPolicy || []).map(n => `\`${n.token}\``).join(', '); + body += newer + ? `> ⚠️ **The model policy looks out of date** — this PR references ${newer}, newer than \`models.json\` knows. Run \`scan-models.mjs --refresh\` and update the policy.\n\n` + : `> ⚠️ **The model policy is ${r.stale.ageDays} days old** (updated ${r.updated}). Consider \`scan-models.mjs --refresh\`.\n\n`; + } + + if (!introduced.length && !preExisting.length && !warns.length) { + body += `✓ No outdated model references in this PR. _(policy ${r.updated})_`; + } else { + if (introduced.length) { + body += `### ✗ ${introduced.length} outdated model reference(s) introduced — please update (this fails the check)\n\n`; + body += `On lines this PR adds or changes:\n\n` + table(introduced); + } + if (preExisting.length) { + body += `### ⚠️ ${preExisting.length} pre-existing outdated model(s) in files this PR touches — not blocking\n\n`; + body += `These are on unchanged lines, so the check still passes — but since you're already editing these files, consider updating them too.\n\n` + table(preExisting); + } + if (warns.length) { + body += `### ⚠ ${warns.length} item(s) to review (not blocking)\n\n`; + body += `Prose mentions, specialised variants (\`*-codex\`, \`*-chat-latest\`), or GPT-5/o-series code changes (\`max_tokens\` → \`max_completion_tokens\`, drop \`temperature\`).\n\n` + table(warns); + } + body += `_See the [\`check-models\`](.agents/skills/check-models/SKILL.md) skill. Policy date: ${r.updated}. Add \`check-models:ignore\` to a line to skip it._\n`; + body += `\n> ℹ️ **Platform-wrapped IDs** (Bedrock \`[region.]anthropic.claude-…\`, Databricks \`databricks-claude-…\`, OpenRouter/LiteLLM \`provider/model\`) are flagged on their embedded model name — bump the version but keep the platform's ID format (e.g. Bedrock 4.x needs a \`us.\`/\`eu.\`/\`apac.\` inference-profile prefix). See the skill's _Platform-specific IDs_ section.`; + } + + // upsert a single bot comment + const { owner, repo } = context.repo; + const prNumber = context.payload.pull_request.number; + const comments = await github.paginate(github.rest.issues.listComments, { owner, repo, issue_number: prNumber }); + const existing = comments.find(c => c.body && c.body.includes(marker)); + // Post when there are findings, or when this PR concretely introduces a + // newer-than-policy model. Don't open a fresh comment for date-staleness + // alone (it would fire on every PR once the policy ages) — only update an existing one. + const worthPosting = introduced.length || preExisting.length || warns.length || (r.stale && r.stale.byContent); + if (existing) { + await github.rest.issues.updateComment({ owner, repo, comment_id: existing.id, body }); + } else if (worthPosting) { + await github.rest.issues.createComment({ owner, repo, issue_number: prNumber, body }); + } + + if (introduced.length) { + core.setFailed(`${introduced.length} outdated model reference(s) introduced by this PR.`); + } diff --git a/python/cookbooks/airflow_example_dags/example_arize_ax_self_optimizing_loop_dag.py b/python/cookbooks/airflow_example_dags/example_arize_ax_self_optimizing_loop_dag.py index 7a1beef..dec4677 100644 --- a/python/cookbooks/airflow_example_dags/example_arize_ax_self_optimizing_loop_dag.py +++ b/python/cookbooks/airflow_example_dags/example_arize_ax_self_optimizing_loop_dag.py @@ -83,7 +83,7 @@ ``"false"`` so you can inspect the artifacts in the Arize UI between runs. - ``arize_ax_self_optimizing_model`` — the model used by the server-side - experiment tasks (default ``"gpt-4o-mini"``). The optimizer always + experiment tasks (default ``"gpt-4o-mini"``). The optimizer always # demo: gate test edit uses ``gpt-4o`` regardless. """