diff --git a/.agents/skills/check-models/SKILL.md b/.agents/skills/check-models/SKILL.md
new file mode 100644
index 0000000..7d112b4
--- /dev/null
+++ b/.agents/skills/check-models/SKILL.md
@@ -0,0 +1,152 @@
+---
+name: check-models
+description: Find and update out-of-date OpenAI and Anthropic model references (in docs, MDX, notebooks, and code) to the latest size-equivalent models, and apply the code changes each new generation requires (e.g. max_tokens → max_completion_tokens for GPT-5). Use when asked to "check the models", "update model versions", "are these models current", "migrate the models in the docs/tutorials", or before publishing content that names a model.
+---
+
+# check-models
+
+Keep model references current. OpenAI and Anthropic ship new generations often, and docs/tutorials drift: a notebook pinned to `gpt-4o-mini` or `claude-3-5-sonnet` is teaching a model that's a generation (or three) behind, sometimes with code that no longer runs (GPT-5 rejects `max_tokens`).
+
+This skill does three things:
+
+1. **Look up** the latest models (don't trust memory — models change).
+2. **Migrate** every reference to the latest model **of the same size tier** (a `*-mini` becomes the latest mini, never the flagship).
+3. **Fix the code** each new generation requires (parameter renames/removals).
+
+## The golden rule: match the size tier
+
+Never "upgrade" a small model to a big one. `gpt-4o-mini` is a small/cheap model — its modern equivalent is the latest **mini**, not the flagship. Migrating it to `gpt-5.5` silently multiplies a reader's cost. Map within the tier:
+
+| Old tier | OpenAI → | Anthropic → |
+|---|---|---|
+| flagship / full | latest flagship | latest Opus |
+| pro | latest pro | latest Opus |
+| mini / small / cheap | latest **mini** | latest Haiku |
+| nano | latest **nano** | (smallest tier) latest Haiku |
+
+> The latest mini/nano are **not** always in the newest generation. As of the policy date below, the flagship is GPT-5.5 but there is no GPT-5.5-mini — so `gpt-4o-mini` → `gpt-5.4-mini`, **not** `gpt-5.5-mini`. Always confirm which generation actually has a mini/nano variant.
+
+## Current models (policy)
+
+The authoritative values live in [`models.json`](./models.json) — the scanner and the GitHub Action both read it. As of `updated: 2026-06-17`:
+
+**OpenAI (reasoning — default targets)** — flagship `gpt-5.5`, pro `gpt-5.5-pro`, mini `gpt-5.4-mini`, nano `gpt-5.4-nano`.
+**OpenAI (non-reasoning — for temperature-dependent code)** — flagship `gpt-4.1`, mini `gpt-4.1-mini`, nano `gpt-4.1-nano`. Active, **not** deprecated, and they **accept `temperature`/`top_p`**.
+**Anthropic** — Opus `claude-opus-4-8`, Sonnet `claude-sonnet-4-6`, Haiku `claude-haiku-4-5`.
+**Google Gemini** — Pro `gemini-2.5-pro`, Flash `gemini-3.5-flash`, Flash-Lite `gemini-3.1-flash-lite`. The latest **stable** model per tier sits on *different* generations (like OpenAI), so the policy tracks Gemini with an explicit `google.current` allow-list and a `google.deprecated` → replacement map (in `models.json`), not a single version floor. Match the SIZE tier (pro/flash/flash-lite). `gemini-2.0-*` (shut down 2026-06-01), `gemini-1.5-*`, and `gemini-1.0-*` are deprecated; `gemini-2.5-*` and `gemini-3.x` are still current and are left alone.
+
+> ⚠️ **Claude Fable 5 / Mythos 5** were export-control-suspended on 2026-06-12 — they are NOT valid migration targets. ⚠️ GPT-5.5 has **no** mini/nano variant.
+
+**Specialised (non-chat) OpenAI models** — realtime, audio, image, transcribe, tts, embeddings, and moderation models don't map to the flagship/mini tiers, so they're tracked separately in `models.json` → `specialized`:
+- `specialized.current` (e.g. `gpt-realtime-1.5`, `gpt-realtime-2`, `gpt-audio-1.5`, `gpt-image-2`, `gpt-4o-transcribe`, `gpt-4o-mini-tts`, `text-embedding-3-*`, `omni-moderation`) are **valid** — the scanner leaves them alone.
+- `specialized.deprecated` maps a retired ID to its current replacement (e.g. `gpt-4o-realtime-preview` → `gpt-realtime-1.5` (shut down 2026-05-07), `gpt-4o-audio-preview` → `gpt-audio-1.5`, `gpt-4o-search-preview` → `gpt-5.4-mini`) — flagged `warn` with the target, since the realtime/audio API surface differs and the swap needs a human eye.
+
+These are checked **before** `review.openaiPatterns`, which stays the fallback for any unrecognised variant. When a new specialised model ships (or one is retired), update these two lists.
+
+### Reasoning vs non-reasoning — which OpenAI target?
+
+The GPT-5 family are **reasoning** models: they reject `temperature`, `top_p`, and the other sampling params. So the target depends on whether the call relies on those:
+
+- **General code** (no meaningful `temperature`, or `temperature` was incidental) → migrate to the **GPT-5** tier above (the default). This is what most code wants.
+- **Temperature-dependent code** → migrate to the **non-reasoning `gpt-4.1` tier** and **keep `temperature`**. This is the right call for:
+ - **LLM-as-judge / evaluators** that set `temperature=0` for reproducible scores (e.g. Phoenix `OpenAIModel(...)`, `LLM(...)`, eval classifiers),
+ - **deterministic extraction / classification** pinned at `temperature=0`.
+ Match the size tier: a `gpt-4o-mini` eval judge → `gpt-4.1-mini` (keep `temperature=0`), not `gpt-5.4-mini`.
+
+`gpt-4.1` / `gpt-4.1-mini` / `gpt-4.1-nano` are treated as **current** by the scanner (not flagged) — they're a legitimate, non-deprecated choice. `gpt-4o` / `gpt-4o-mini` are also non-reasoning + active but older; migrate them to the GPT-5 tier by default, or to `gpt-4.1*` when temperature matters.
+
+## Workflow
+
+### 1. Verify the latest models are still current
+
+`models.json` is a cache — refresh it before a big migration. **The scanner can't browse, so the lookup is yours to run; `--refresh` gives you the exact checklist:**
+
+```bash
+node .agents/skills/check-models/scripts/scan-models.mjs --refresh
+```
+
+It prints the current policy, the WebSearch queries, the authoritative source URLs, and which keys to edit. Then:
+
+- WebSearch `"latest OpenAI models"` and `"latest Anthropic Claude models"` (use the current month); cross-check Anthropic against the **`claude-api`** skill's `shared/models.md` (canonical Claude IDs + the Claude-side code changes).
+- Confirm each tier's latest model **and that the tier still exists** (a new flagship may ship with no `-mini` yet — keep mini/nano on the older generation). Skip any suspended/withdrawn model.
+- If anything changed, **propose the `models.json` diff and confirm before writing** (bump `flagship`/`mini`/`opus`/… and `updated`). That one edit updates the scanner, the skill, and the CI gate together.
+
+**When to refresh:** the scanner tells you. Every run prints a `⚠ Model policy may be out of date…` hint (and sets `stale.suggestRefresh` in `--json`) when either the policy is older than ~45 days **or** the content references a model newer than the policy knows (e.g. a `gpt-5.6` appears while the policy flagship is `gpt-5.5`). Newer-than-policy models are left untouched — never downgraded — so refresh first, then re-scan.
+
+### 2. Scan
+
+```bash
+node .agents/skills/check-models/scripts/scan-models.mjs python typescript # scan paths
+node .agents/skills/check-models/scripts/scan-models.mjs --json > out.json # machine-readable
+node .agents/skills/check-models/scripts/scan-models.mjs --diff python typescript # PR-gate mode
+```
+
+**`--diff ` (the CI gate)** scans the whole of each *touched* file but tags every finding with `changed` (was the line added/modified by the PR?). It splits outdated-model errors into two tiers: **introduced** (on a changed line) → fails the run; **pre-existing** (on an unchanged line of a touched file) → reported as a non-blocking warning. Untouched files are never read. Full-scan mode (no `--diff`) tags everything `changed:true`, so it fails on any error as before.
+
+Each finding is one of:
+- `✗ error` — an outdated lowercase canonical model ID (e.g. `gpt-4o-mini`). Migrate it.
+- `⚠ review` / `⚠ replace` (prose) — a model named in prose (`GPT-4o`) or a specialised variant (`*-codex`, `*-chat-latest`, `gpt-4v`). Use judgement (see below).
+- `⚠ param` — a GPT-5/o-series code change to apply (`max_tokens`, `temperature`, …).
+
+### 3. Migrate the model IDs
+
+For each finding, replace with the scanner's suggested target, **preserving local style**:
+- Keep the original separator/case for prose: `GPT-4o` → `GPT-5.5` (not `gpt-5.5`); `claude-sonnet-4.5` (dotted) → `claude-sonnet-4.6`.
+- Drop stale date snapshots: `gpt-5-2025-08-07` → `gpt-5.5` (use the bare alias, don't invent a date).
+- Keep both halves of any SDK-version tab in sync (v7/v8 examples).
+
+#### Platform-specific IDs (Bedrock, Databricks, OpenRouter, LiteLLM)
+
+The scanner matches the embedded `claude-*` / `gpt-*` substring inside a platform-wrapped ID and flags it. Bump the **version** but **keep the host platform's ID format** — these are not bare first-party IDs:
+
+- **Amazon Bedrock** — IDs look like `[region.]anthropic.claude-<...>[-vN:0]`. Claude 4.x (Opus 4.x, Sonnet 4.5+, Haiku 4.5) require a **cross-region inference profile**, so they take a `us.` / `eu.` / `apac.` prefix and **drop** the on-demand `-vN:0` suffix the Claude 3 IDs used. e.g. `anthropic.claude-3-haiku-20240307-v1:0` → `us.anthropic.claude-haiku-4-5`. Match the region prefix to the doc's endpoint (the repo's Bedrock docs default to `us.`).
+- **Databricks** — Foundation Model endpoint names look like `databricks-claude-sonnet-4-6`. Databricks owns these names and **availability is workspace/region-dependent** — bump the version following their pattern, but verify the endpoint actually exists rather than assuming it.
+- **OpenRouter / LiteLLM** — provider-prefixed: `anthropic/claude-sonnet-4-6`, `openai/gpt-5.4-mini`. Keep the `provider/` prefix; bump only the model half.
+
+When a migration changes more than the version number (a Bedrock region prefix, a Databricks endpoint name), **call it out for the reviewer** — it may need adjusting for their region/workspace.
+
+### 4. Apply the code changes
+
+**These changes apply to the _raw OpenAI SDK_ only** (`client.chat.completions.create(...)`, `openai.OpenAI()...`). When migrating such a call **to GPT-5 or o-series** (reasoning models), in the same example:
+- Rename `max_tokens` → `max_completion_tokens`.
+- Remove `temperature` (unless it's the default `1`), `top_p`, `presence_penalty`, `frequency_penalty`, `logprobs`, `top_logprobs`, `logit_bias` — reasoning models reject them. Steer with `reasoning_effort` (`low`/`medium`/`high`) and `verbosity` instead.
+
+**Wrapper libraries need a split rule** — `phoenix.evals.OpenAIModel(...)`, `langchain_openai.ChatOpenAI(...)`, `litellm.completion(...)`, etc. expose their own kwargs:
+- **`max_tokens`: keep it.** The wrapper owns this kwarg and maps it to `max_completion_tokens` internally; renaming it passes an unknown constructor arg and breaks the call.
+- **non-default `temperature` / `top_p`:** a GPT-5/o-series reasoning model rejects any `temperature` ≠ 1 with a 400 — wrapper or not. Two correct outcomes:
+ - if the value **matters** (eval judge / deterministic call) → re-target to the non-reasoning **`gpt-4.1`** tier and **keep** the `temperature` (see "Reasoning vs non-reasoning" above);
+ - if it was **incidental** → drop the explicit value and stay on the GPT-5 model.
+- Otherwise migrate the **model ID only**.
+
+> **`phoenix.evals` gotcha:** the legacy model classes (`OpenAIModel`, `LiteLLMModel`, `AnthropicModel`) accept `temperature` in their constructor (valid). The newer `LLM(provider=…, model=…)` does **not** — its `**kwargs` are forwarded to the *SDK client constructor* (for `api_key`/`base_url`), so `LLM(provider="openai", model="gpt-4.1", temperature=0)` raises `TypeError` at construction. Set sampling params on the **evaluator** instead: `ClassificationEvaluator(name=…, llm=…, prompt_template=…, choices=…, temperature=0)` (the `create_classifier(...)` helper takes no `temperature`).
+
+(The scanner flags any `max_tokens`/`temperature` token as a `⚠ param` — it can't tell a raw call from a wrapper or know the value, so this is the judgement call.)
+
+**Watch token caps on reasoning models.** GPT-5/o-series count *reasoning* tokens against `max_completion_tokens`, so a tiny cap that worked on gpt-4o (e.g. `max_tokens=20` for a short answer) can return empty/truncated output. When migrating such a call, raise the cap to a safe value (≥256) or set `reasoning_effort: "minimal"`.
+
+When migrating **to Claude Opus 4.8/4.7 or Sonnet 4.6**: `budget_tokens` and `temperature`/`top_p`/`top_k` are removed — use `thinking: {type: "adaptive"}` + `output_config.effort`. Defer to the **`claude-api`** skill (`/claude-api migrate`) for the full Claude code-change checklist; don't hand-edit Claude SDK calls from memory.
+
+### 5. Skip what shouldn't change
+
+Do **not** rewrite:
+- **Autogenerated code** — `api-clients/` is generated SDK reference; never hand-edit it (it's in `excludePaths`, so the scanner skips it). Fix the model references at their generator/source instead.
+- **Historical / release-notes / changelog / migration-guide** content — it documents what *was* true ("v1.2 added gpt-4o support"). These paths are in `excludePaths`; respect the same rule for any historical prose the scanner happens to catch.
+- **Non-model tokens** that share a prefix: `gpt-oss-*` (open-weight models, version-pinned), `claude-code`, `claude-agent-sdk`, web-crawler user-agents (`claude-web`, `claude-user`, `claude-searchbot`), image filenames. These are in the `ignore` list and won't be flagged.
+- **Comparative prose** that names an old model on purpose ("unlike GPT-4, GPT-5.5 can…"). Leave the historical reference; only update where the doc is telling the reader which model to *use*.
+- **Markdown image alt text** — `` describes what a screenshot *shows*. Editing the alt text alone would make it misdescribe the image (the pixels still show the old model), so the scanner skips model names inside `![ … ]`. Update these by **regenerating the screenshot**, not by editing the alt text.
+
+To suppress a single line the scanner shouldn't touch, add a `check-models:ignore` comment on it.
+
+### 6. Verify
+
+Re-run the scanner — `✗ error` count should be 0. Remaining `⚠` are prose/variants you've consciously reviewed.
+
+Also guard against a **stale date carried onto a new alias**: when migrating a dated ID, drop the date (`claude-sonnet-4-5-20250929` → `claude-sonnet-4-6`, **not** `claude-sonnet-4-6-20250929` — that snapshot doesn't exist). The scanner can't catch this (it date-strips before classifying, so a wrong-dated current alias looks current), so grep for it:
+
+```bash
+grep -rnE 'claude-(opus-4-8|sonnet-4-6)-20[0-9]' python typescript # these aliases have no dated form
+```
+
+## Updating for a new model launch
+
+The whole skill is driven by `models.json`. When a new model ships: WebSearch to confirm the IDs and which tiers exist, edit the relevant value(s) in `models.json` plus `updated`, re-run the scanner. No code changes to the scanner are normally needed — it derives the full old→new table from the policy numbers.
diff --git a/.agents/skills/check-models/scripts/models.json b/.agents/skills/check-models/scripts/models.json
new file mode 100644
index 0000000..988d0a0
--- /dev/null
+++ b/.agents/skills/check-models/scripts/models.json
@@ -0,0 +1,129 @@
+{
+ "_comment": "Single source of truth for the check-models skill. Update the numbers in `openai` and `anthropic` whenever a new model ships, then re-run the scanner. The scanner derives the full replacement table from these policy values — you should not need to enumerate every old ID.",
+ "updated": "2026-06-18",
+ "verifiedBy": "WebFetch developers.openai.com/api/docs/models + /deprecations (2026-06-18); Google via ai.google.dev/gemini-api/docs/models (2026-06-18); Anthropic via claude-api skill shared/models.md",
+
+ "openai": {
+ "flagship": "gpt-5.5",
+ "pro": "gpt-5.5-pro",
+ "mini": "gpt-5.4-mini",
+ "nano": "gpt-5.4-nano",
+ "flagshipMinVersion": 5.4,
+ "proMinVersion": 5.5,
+ "miniMinVersion": 5.4,
+ "nanoMinVersion": 5.4,
+ "notes": "GPT-5.5 is the latest flagship (released 2026-04-23, API id gpt-5.5-2026-04-23) and has no mini/nano variant — mini/nano stay on the 5.4 generation. GPT-5.4 also remains available as a cheaper flagship (still current, NOT deprecated per OpenAI's models page), so flagshipMinVersion is 5.4 — only gpt-5.3 and older flag for migration; gpt-5.4 and gpt-5.5 both pass. Migration target for outdated flagships is still gpt-5.5 (the `flagship` value). Match the SIZE tier: a *-mini model migrates to the latest mini, never the flagship. The GPT-5 family are REASONING models — they reject temperature/top_p (see codeChanges). For temperature-dependent code use the nonReasoning tier below.",
+ "nonReasoning": {
+ "flagship": "gpt-4.1",
+ "mini": "gpt-4.1-mini",
+ "nano": "gpt-4.1-nano",
+ "note": "Latest NON-reasoning OpenAI models — active (NOT deprecated, per https://developers.openai.com/api/docs/deprecations checked 2026-06-17) and they ACCEPT temperature/top_p. Use these (not GPT-5) for code that needs temperature control — LLM-as-judge evaluators, deterministic extraction/classification at temperature 0. gpt-4o / gpt-4o-mini are also non-reasoning + active but older; gpt-4.1 supersedes them.",
+ "treatAsCurrent": true
+ },
+ "deprecatedOpenAI": ["gpt-3.5-turbo (snapshots; shutdown 2026-10-23)", "o1", "o1-mini", "o1-preview", "o3-mini", "o4-mini", "gpt-5 / gpt-5-mini legacy snapshots (shutdown 2026-12-11)", "gpt-5.2-chat-latest (2026-08-10)"]
+ },
+
+ "anthropic": {
+ "opus": "claude-opus-4-8",
+ "sonnet": "claude-sonnet-4-6",
+ "haiku": "claude-haiku-4-5",
+ "notes": "Opus 4.8 / Sonnet 4.6 / Haiku 4.5 are the current GA tiers. Claude Fable 5 and Claude Mythos 5 exist but were export-control-suspended on 2026-06-12 — NEVER migrate to them. Match the SIZE tier (opus/sonnet/haiku)."
+ },
+
+ "google": {
+ "_comment": "Gemini tiers span generations (the latest STABLE per tier is on different generations, like OpenAI). `current` IDs are valid (not flagged). `deprecated` IDs map to their current-tier replacement. Verified against https://ai.google.dev/gemini-api/docs/models + /changelog (2026-06-18). Match the SIZE tier: pro/flash/flash-lite.",
+ "tiers": { "pro": "gemini-2.5-pro", "flash": "gemini-3.5-flash", "flash-lite": "gemini-3.1-flash-lite" },
+ "current": [
+ "gemini-2.5-pro", "gemini-3.1-pro", "gemini-3.5-pro",
+ "gemini-3.5-flash", "gemini-3-flash", "gemini-2.5-flash",
+ "gemini-3.1-flash-lite", "gemini-2.5-flash-lite"
+ ],
+ "deprecated": {
+ "gemini-2.0-flash": "gemini-3.5-flash",
+ "gemini-2.0-flash-lite": "gemini-3.1-flash-lite",
+ "gemini-1.5-pro": "gemini-2.5-pro",
+ "gemini-1.5-flash": "gemini-3.5-flash",
+ "gemini-1.5-flash-8b": "gemini-3.1-flash-lite",
+ "gemini-1.0-pro": "gemini-2.5-pro",
+ "gemini-pro": "gemini-2.5-pro",
+ "gemini-pro-vision": "gemini-2.5-pro"
+ }
+ },
+
+ "codeChanges": {
+ "_comment": "Reasoning-model (GPT-5 / o-series) parameter migrations. The scanner flags these; the skill applies them.",
+ "renames": [
+ { "from": "max_tokens", "to": "max_completion_tokens", "scope": "openai-reasoning", "note": "GPT-5 and o-series reject max_tokens in Chat Completions; use max_completion_tokens." }
+ ],
+ "removeForReasoning": [
+ "temperature", "top_p", "presence_penalty", "frequency_penalty",
+ "logprobs", "top_logprobs", "logit_bias"
+ ],
+ "removeNote": "GPT-5 / o-series reasoning models reject these sampling params (temperature must be the default 1 if sent). Remove them; steer with `reasoning_effort` (low|medium|high) and `verbosity` (low|medium|high) instead.",
+ "anthropicNote": "Anthropic Opus 4.8/4.7 reject temperature/top_p/top_k and budget_tokens (use thinking:{type:'adaptive'} + effort). See the claude-api skill `shared/model-migration.md` for full Claude code changes."
+ },
+
+ "specialized": {
+ "_comment": "Non-chat OpenAI families (realtime / audio / image / transcribe / tts / embeddings / moderation) that don't map to the flagship/mini tiers. `current` IDs are treated as valid (not flagged); `deprecated` IDs map to their current replacement. Checked before review.openaiPatterns, which stays as the fallback for unrecognised variants. Verified against https://developers.openai.com/api/docs/models and /deprecations (2026-06-18).",
+ "current": [
+ "gpt-realtime", "gpt-realtime-1.5", "gpt-realtime-2", "gpt-realtime-mini",
+ "gpt-realtime-translate", "gpt-realtime-whisper",
+ "gpt-audio", "gpt-audio-1.5", "gpt-audio-mini",
+ "gpt-image-2",
+ "gpt-4o-transcribe", "gpt-4o-mini-transcribe",
+ "gpt-4o-mini-tts",
+ "text-embedding-3-small", "text-embedding-3-large",
+ "omni-moderation", "omni-moderation-latest"
+ ],
+ "deprecated": {
+ "gpt-4o-realtime-preview": "gpt-realtime-1.5",
+ "gpt-4o-mini-realtime-preview": "gpt-realtime-mini",
+ "gpt-4o-audio-preview": "gpt-audio-1.5",
+ "gpt-4o-mini-audio-preview": "gpt-audio-mini",
+ "gpt-4o-search-preview": "gpt-5.4-mini",
+ "gpt-4o-mini-search-preview": "gpt-5.4-mini"
+ }
+ },
+
+ "review": {
+ "_comment": "Tokens that look like model IDs but need a human decision — the scanner flags them (severity: warn) and never auto-rewrites them. specialized.current/deprecated (above) take precedence for known non-chat models.",
+ "openaiPatterns": [
+ "codex", "chat-latest", "realtime", "audio", "search", "transcribe",
+ "image", "tts", "whisper", "embedding", "moderation", "instruct"
+ ],
+ "openaiExact": ["gpt-4v"]
+ },
+
+ "ignore": {
+ "_comment": "Known NON-model tokens that share the model prefix. The scanner never reports these.",
+ "openai": ["gpt-oss-20b", "gpt-oss-120b", "gpt-oss"],
+ "openaiPatterns": ["oss", "\\.png", "\\.jpg", "\\.jpeg", "\\.gif", "\\.svg"],
+ "anthropic": [
+ "claude-code", "claude-code-tracing", "claude-agent-sdk", "claude-web",
+ "claude-user", "claude-searchbot", "claude-powered", "claude-trace",
+ "claude-session", "claude-fable-5", "claude-mythos-5"
+ ],
+ "anthropicPatterns": [
+ "^claude-code", "^claude-md", "searchbot", "-bot$", "\\.png", "\\.jpg",
+ "\\.jpeg", "\\.gif", "\\.svg", "\\.log", "fable", "mythos", "-powered"
+ ]
+ },
+
+ "excludePaths": [
+ "**/release-notes/**",
+ "**/changelog/**",
+ "**/*release-note*",
+ "**/*-releases.*",
+ "**/on-premise-releases*",
+ "**/*changelog*",
+ "**/*migration*",
+ "**/*migrate*",
+ "**/api-clients/**",
+ "**/.agents/skills/check-models/**",
+ "**/node_modules/**",
+ "**/.git/**",
+ "**/.venv/**"
+ ],
+
+ "scanExtensions": [".mdx", ".md", ".ipynb", ".py", ".ts", ".tsx", ".js", ".jsx"]
+}
diff --git a/.agents/skills/check-models/scripts/scan-models.mjs b/.agents/skills/check-models/scripts/scan-models.mjs
new file mode 100644
index 0000000..4684f6f
Binary files /dev/null and b/.agents/skills/check-models/scripts/scan-models.mjs differ
diff --git a/.github/workflows/check-models.yml b/.github/workflows/check-models.yml
new file mode 100644
index 0000000..4bc50f8
--- /dev/null
+++ b/.github/workflows/check-models.yml
@@ -0,0 +1,106 @@
+name: Check model versions
+
+# Flags out-of-date OpenAI / Anthropic / Google model references.
+# Two tiers: outdated models a PR INTRODUCES on changed lines FAIL the check; outdated
+# models that already exist on unchanged lines of a file the PR touches are reported as
+# non-blocking WARNINGS. Untouched files are never inspected, so unrelated PRs aren't blocked.
+# Policy (latest models + ignore lists) lives in .agents/skills/check-models/scripts/models.json.
+# Platform-wrapped IDs (Bedrock `[region.]anthropic.claude-…`, Databricks
+# `databricks-claude-…`, OpenRouter/LiteLLM `provider/model`) are flagged on their
+# embedded model substring — migrate the version but keep the platform's ID format.
+# See the "Platform-specific IDs" section of the check-models skill.
+
+on:
+ pull_request:
+ branches: [main]
+
+permissions:
+ contents: read
+ pull-requests: write
+
+jobs:
+ check-models:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd
+ with:
+ fetch-depth: 0 # need the base commit to diff against
+
+ - uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e
+ with:
+ node-version: 20
+
+ - name: Scan touched files for outdated models
+ id: scan
+ run: |
+ node .agents/skills/check-models/scripts/scan-models.mjs \
+ --diff "${{ github.event.pull_request.base.sha }}" \
+ python typescript \
+ --json --no-fail > model-scan.json
+ cat model-scan.json
+
+ - name: Comment and gate
+ uses: actions/github-script@f28e40c7f34bde8b3046d885e986cb6290c5673b
+ with:
+ script: |
+ const fs = require('fs');
+ const r = JSON.parse(fs.readFileSync('model-scan.json', 'utf8'));
+ const marker = '';
+
+ // Three buckets:
+ // introduced — outdated model on a line this PR added/changed → FAILS the check
+ // preExisting — outdated model on an unchanged line of a touched file → warn, non-blocking
+ // warns — prose / specialised variants / GPT-5 param hints (changed lines) → review
+ const introduced = r.findings.filter(f => f.severity === 'error' && f.changed !== false);
+ const preExisting = r.findings.filter(f => f.severity === 'error' && f.changed === false);
+ const warns = r.findings.filter(f => f.severity === 'warn');
+
+ const row = f => `| \`${f.file}\`:${f.line} | \`${f.token}\` | ${f.action === 'replace' ? '`' + f.target + '`' : '—'} | ${f.reason} |`;
+ const table = items => `| Location | Found | Suggested | Why |\n|---|---|---|---|\n` + items.map(row).join('\n') + '\n\n';
+ let body = `${marker}\n## 🤖 Model version check\n\n`;
+
+ // Surface a stale-policy warning (content references a newer model than the policy knows, or the policy is old)
+ if (r.stale && r.stale.suggestRefresh) {
+ const newer = (r.stale.newerThanPolicy || []).map(n => `\`${n.token}\``).join(', ');
+ body += newer
+ ? `> ⚠️ **The model policy looks out of date** — this PR references ${newer}, newer than \`models.json\` knows. Run \`scan-models.mjs --refresh\` and update the policy.\n\n`
+ : `> ⚠️ **The model policy is ${r.stale.ageDays} days old** (updated ${r.updated}). Consider \`scan-models.mjs --refresh\`.\n\n`;
+ }
+
+ if (!introduced.length && !preExisting.length && !warns.length) {
+ body += `✓ No outdated model references in this PR. _(policy ${r.updated})_`;
+ } else {
+ if (introduced.length) {
+ body += `### ✗ ${introduced.length} outdated model reference(s) introduced — please update (this fails the check)\n\n`;
+ body += `On lines this PR adds or changes:\n\n` + table(introduced);
+ }
+ if (preExisting.length) {
+ body += `### ⚠️ ${preExisting.length} pre-existing outdated model(s) in files this PR touches — not blocking\n\n`;
+ body += `These are on unchanged lines, so the check still passes — but since you're already editing these files, consider updating them too.\n\n` + table(preExisting);
+ }
+ if (warns.length) {
+ body += `### ⚠ ${warns.length} item(s) to review (not blocking)\n\n`;
+ body += `Prose mentions, specialised variants (\`*-codex\`, \`*-chat-latest\`), or GPT-5/o-series code changes (\`max_tokens\` → \`max_completion_tokens\`, drop \`temperature\`).\n\n` + table(warns);
+ }
+ body += `_See the [\`check-models\`](.agents/skills/check-models/SKILL.md) skill. Policy date: ${r.updated}. Add \`check-models:ignore\` to a line to skip it._\n`;
+ body += `\n> ℹ️ **Platform-wrapped IDs** (Bedrock \`[region.]anthropic.claude-…\`, Databricks \`databricks-claude-…\`, OpenRouter/LiteLLM \`provider/model\`) are flagged on their embedded model name — bump the version but keep the platform's ID format (e.g. Bedrock 4.x needs a \`us.\`/\`eu.\`/\`apac.\` inference-profile prefix). See the skill's _Platform-specific IDs_ section.`;
+ }
+
+ // upsert a single bot comment
+ const { owner, repo } = context.repo;
+ const prNumber = context.payload.pull_request.number;
+ const comments = await github.paginate(github.rest.issues.listComments, { owner, repo, issue_number: prNumber });
+ const existing = comments.find(c => c.body && c.body.includes(marker));
+ // Post when there are findings, or when this PR concretely introduces a
+ // newer-than-policy model. Don't open a fresh comment for date-staleness
+ // alone (it would fire on every PR once the policy ages) — only update an existing one.
+ const worthPosting = introduced.length || preExisting.length || warns.length || (r.stale && r.stale.byContent);
+ if (existing) {
+ await github.rest.issues.updateComment({ owner, repo, comment_id: existing.id, body });
+ } else if (worthPosting) {
+ await github.rest.issues.createComment({ owner, repo, issue_number: prNumber, body });
+ }
+
+ if (introduced.length) {
+ core.setFailed(`${introduced.length} outdated model reference(s) introduced by this PR.`);
+ }