From 1b1dbdeadb332d8b9daad29dd14412b140e84c25 Mon Sep 17 00:00:00 2001 From: Jim Bennett Date: Thu, 18 Jun 2026 15:46:34 -0700 Subject: [PATCH 1/3] feat: add check-models skill + PR-diff-scoped CI gate Skill + scanner that find outdated OpenAI / Anthropic / Google (Gemini) model references and migrate them to current size-tier equivalents. The CI gate scans only the lines a PR changes (paths: python typescript), so it flags newly introduced outdated models and fails the check, without blocking on pre-existing references. Notebook/code migrations land separately as themed PRs. Co-Authored-By: Claude Opus 4.8 (1M context) --- .agents/skills/check-models/SKILL.md | 149 ++++++++++++++++++ .../skills/check-models/scripts/models.json | 129 +++++++++++++++ .../check-models/scripts/scan-models.mjs | Bin 0 -> 19646 bytes .github/workflows/check-models.yml | 97 ++++++++++++ 4 files changed, 375 insertions(+) create mode 100644 .agents/skills/check-models/SKILL.md create mode 100644 .agents/skills/check-models/scripts/models.json create mode 100644 .agents/skills/check-models/scripts/scan-models.mjs create mode 100644 .github/workflows/check-models.yml diff --git a/.agents/skills/check-models/SKILL.md b/.agents/skills/check-models/SKILL.md new file mode 100644 index 0000000..1c1e0d5 --- /dev/null +++ b/.agents/skills/check-models/SKILL.md @@ -0,0 +1,149 @@ +--- +name: check-models +description: Find and update out-of-date OpenAI and Anthropic model references (in docs, MDX, notebooks, and code) to the latest size-equivalent models, and apply the code changes each new generation requires (e.g. max_tokens → max_completion_tokens for GPT-5). Use when asked to "check the models", "update model versions", "are these models current", "migrate the models in the docs/tutorials", or before publishing content that names a model. +--- + +# check-models + +Keep model references current. OpenAI and Anthropic ship new generations often, and docs/tutorials drift: a notebook pinned to `gpt-4o-mini` or `claude-3-5-sonnet` is teaching a model that's a generation (or three) behind, sometimes with code that no longer runs (GPT-5 rejects `max_tokens`). + +This skill does three things: + +1. **Look up** the latest models (don't trust memory — models change). +2. **Migrate** every reference to the latest model **of the same size tier** (a `*-mini` becomes the latest mini, never the flagship). +3. **Fix the code** each new generation requires (parameter renames/removals). + +## The golden rule: match the size tier + +Never "upgrade" a small model to a big one. `gpt-4o-mini` is a small/cheap model — its modern equivalent is the latest **mini**, not the flagship. Migrating it to `gpt-5.5` silently multiplies a reader's cost. Map within the tier: + +| Old tier | OpenAI → | Anthropic → | +|---|---|---| +| flagship / full | latest flagship | latest Opus | +| pro | latest pro | latest Opus | +| mini / small / cheap | latest **mini** | latest Haiku | +| nano | latest **nano** | (smallest tier) latest Haiku | + +> The latest mini/nano are **not** always in the newest generation. As of the policy date below, the flagship is GPT-5.5 but there is no GPT-5.5-mini — so `gpt-4o-mini` → `gpt-5.4-mini`, **not** `gpt-5.5-mini`. Always confirm which generation actually has a mini/nano variant. + +## Current models (policy) + +The authoritative values live in [`models.json`](./models.json) — the scanner and the GitHub Action both read it. As of `updated: 2026-06-17`: + +**OpenAI (reasoning — default targets)** — flagship `gpt-5.5`, pro `gpt-5.5-pro`, mini `gpt-5.4-mini`, nano `gpt-5.4-nano`. +**OpenAI (non-reasoning — for temperature-dependent code)** — flagship `gpt-4.1`, mini `gpt-4.1-mini`, nano `gpt-4.1-nano`. Active, **not** deprecated, and they **accept `temperature`/`top_p`**. +**Anthropic** — Opus `claude-opus-4-8`, Sonnet `claude-sonnet-4-6`, Haiku `claude-haiku-4-5`. +**Google Gemini** — Pro `gemini-2.5-pro`, Flash `gemini-3.5-flash`, Flash-Lite `gemini-3.1-flash-lite`. The latest **stable** model per tier sits on *different* generations (like OpenAI), so the policy tracks Gemini with an explicit `google.current` allow-list and a `google.deprecated` → replacement map (in `models.json`), not a single version floor. Match the SIZE tier (pro/flash/flash-lite). `gemini-2.0-*` (shut down 2026-06-01), `gemini-1.5-*`, and `gemini-1.0-*` are deprecated; `gemini-2.5-*` and `gemini-3.x` are still current and are left alone. + +> ⚠️ **Claude Fable 5 / Mythos 5** were export-control-suspended on 2026-06-12 — they are NOT valid migration targets. ⚠️ GPT-5.5 has **no** mini/nano variant. + +**Specialised (non-chat) OpenAI models** — realtime, audio, image, transcribe, tts, embeddings, and moderation models don't map to the flagship/mini tiers, so they're tracked separately in `models.json` → `specialized`: +- `specialized.current` (e.g. `gpt-realtime-1.5`, `gpt-realtime-2`, `gpt-audio-1.5`, `gpt-image-2`, `gpt-4o-transcribe`, `gpt-4o-mini-tts`, `text-embedding-3-*`, `omni-moderation`) are **valid** — the scanner leaves them alone. +- `specialized.deprecated` maps a retired ID to its current replacement (e.g. `gpt-4o-realtime-preview` → `gpt-realtime-1.5` (shut down 2026-05-07), `gpt-4o-audio-preview` → `gpt-audio-1.5`, `gpt-4o-search-preview` → `gpt-5.4-mini`) — flagged `warn` with the target, since the realtime/audio API surface differs and the swap needs a human eye. + +These are checked **before** `review.openaiPatterns`, which stays the fallback for any unrecognised variant. When a new specialised model ships (or one is retired), update these two lists. + +### Reasoning vs non-reasoning — which OpenAI target? + +The GPT-5 family are **reasoning** models: they reject `temperature`, `top_p`, and the other sampling params. So the target depends on whether the call relies on those: + +- **General code** (no meaningful `temperature`, or `temperature` was incidental) → migrate to the **GPT-5** tier above (the default). This is what most code wants. +- **Temperature-dependent code** → migrate to the **non-reasoning `gpt-4.1` tier** and **keep `temperature`**. This is the right call for: + - **LLM-as-judge / evaluators** that set `temperature=0` for reproducible scores (e.g. Phoenix `OpenAIModel(...)`, `LLM(...)`, eval classifiers), + - **deterministic extraction / classification** pinned at `temperature=0`. + Match the size tier: a `gpt-4o-mini` eval judge → `gpt-4.1-mini` (keep `temperature=0`), not `gpt-5.4-mini`. + +`gpt-4.1` / `gpt-4.1-mini` / `gpt-4.1-nano` are treated as **current** by the scanner (not flagged) — they're a legitimate, non-deprecated choice. `gpt-4o` / `gpt-4o-mini` are also non-reasoning + active but older; migrate them to the GPT-5 tier by default, or to `gpt-4.1*` when temperature matters. + +## Workflow + +### 1. Verify the latest models are still current + +`models.json` is a cache — refresh it before a big migration. **The scanner can't browse, so the lookup is yours to run; `--refresh` gives you the exact checklist:** + +```bash +node .agents/skills/check-models/scripts/scan-models.mjs --refresh +``` + +It prints the current policy, the WebSearch queries, the authoritative source URLs, and which keys to edit. Then: + +- WebSearch `"latest OpenAI models"` and `"latest Anthropic Claude models"` (use the current month); cross-check Anthropic against the **`claude-api`** skill's `shared/models.md` (canonical Claude IDs + the Claude-side code changes). +- Confirm each tier's latest model **and that the tier still exists** (a new flagship may ship with no `-mini` yet — keep mini/nano on the older generation). Skip any suspended/withdrawn model. +- If anything changed, **propose the `models.json` diff and confirm before writing** (bump `flagship`/`mini`/`opus`/… and `updated`). That one edit updates the scanner, the skill, and the CI gate together. + +**When to refresh:** the scanner tells you. Every run prints a `⚠ Model policy may be out of date…` hint (and sets `stale.suggestRefresh` in `--json`) when either the policy is older than ~45 days **or** the content references a model newer than the policy knows (e.g. a `gpt-5.6` appears while the policy flagship is `gpt-5.5`). Newer-than-policy models are left untouched — never downgraded — so refresh first, then re-scan. + +### 2. Scan + +```bash +node .agents/skills/check-models/scripts/scan-models.mjs python typescript # scan paths +node .agents/skills/check-models/scripts/scan-models.mjs --json > out.json # machine-readable +``` + +Each finding is one of: +- `✗ error` — an outdated lowercase canonical model ID (e.g. `gpt-4o-mini`). Migrate it. +- `⚠ review` / `⚠ replace` (prose) — a model named in prose (`GPT-4o`) or a specialised variant (`*-codex`, `*-chat-latest`, `gpt-4v`). Use judgement (see below). +- `⚠ param` — a GPT-5/o-series code change to apply (`max_tokens`, `temperature`, …). + +### 3. Migrate the model IDs + +For each finding, replace with the scanner's suggested target, **preserving local style**: +- Keep the original separator/case for prose: `GPT-4o` → `GPT-5.5` (not `gpt-5.5`); `claude-sonnet-4.5` (dotted) → `claude-sonnet-4.6`. +- Drop stale date snapshots: `gpt-5-2025-08-07` → `gpt-5.5` (use the bare alias, don't invent a date). +- Keep both halves of any SDK-version tab in sync (v7/v8 examples). + +#### Platform-specific IDs (Bedrock, Databricks, OpenRouter, LiteLLM) + +The scanner matches the embedded `claude-*` / `gpt-*` substring inside a platform-wrapped ID and flags it. Bump the **version** but **keep the host platform's ID format** — these are not bare first-party IDs: + +- **Amazon Bedrock** — IDs look like `[region.]anthropic.claude-<...>[-vN:0]`. Claude 4.x (Opus 4.x, Sonnet 4.5+, Haiku 4.5) require a **cross-region inference profile**, so they take a `us.` / `eu.` / `apac.` prefix and **drop** the on-demand `-vN:0` suffix the Claude 3 IDs used. e.g. `anthropic.claude-3-haiku-20240307-v1:0` → `us.anthropic.claude-haiku-4-5`. Match the region prefix to the doc's endpoint (the repo's Bedrock docs default to `us.`). +- **Databricks** — Foundation Model endpoint names look like `databricks-claude-sonnet-4-6`. Databricks owns these names and **availability is workspace/region-dependent** — bump the version following their pattern, but verify the endpoint actually exists rather than assuming it. +- **OpenRouter / LiteLLM** — provider-prefixed: `anthropic/claude-sonnet-4-6`, `openai/gpt-5.4-mini`. Keep the `provider/` prefix; bump only the model half. + +When a migration changes more than the version number (a Bedrock region prefix, a Databricks endpoint name), **call it out for the reviewer** — it may need adjusting for their region/workspace. + +### 4. Apply the code changes + +**These changes apply to the _raw OpenAI SDK_ only** (`client.chat.completions.create(...)`, `openai.OpenAI()...`). When migrating such a call **to GPT-5 or o-series** (reasoning models), in the same example: +- Rename `max_tokens` → `max_completion_tokens`. +- Remove `temperature` (unless it's the default `1`), `top_p`, `presence_penalty`, `frequency_penalty`, `logprobs`, `top_logprobs`, `logit_bias` — reasoning models reject them. Steer with `reasoning_effort` (`low`/`medium`/`high`) and `verbosity` instead. + +**Wrapper libraries need a split rule** — `phoenix.evals.OpenAIModel(...)`, `langchain_openai.ChatOpenAI(...)`, `litellm.completion(...)`, etc. expose their own kwargs: +- **`max_tokens`: keep it.** The wrapper owns this kwarg and maps it to `max_completion_tokens` internally; renaming it passes an unknown constructor arg and breaks the call. +- **non-default `temperature` / `top_p`:** a GPT-5/o-series reasoning model rejects any `temperature` ≠ 1 with a 400 — wrapper or not. Two correct outcomes: + - if the value **matters** (eval judge / deterministic call) → re-target to the non-reasoning **`gpt-4.1`** tier and **keep** the `temperature` (see "Reasoning vs non-reasoning" above); + - if it was **incidental** → drop the explicit value and stay on the GPT-5 model. +- Otherwise migrate the **model ID only**. + +> **`phoenix.evals` gotcha:** the legacy model classes (`OpenAIModel`, `LiteLLMModel`, `AnthropicModel`) accept `temperature` in their constructor (valid). The newer `LLM(provider=…, model=…)` does **not** — its `**kwargs` are forwarded to the *SDK client constructor* (for `api_key`/`base_url`), so `LLM(provider="openai", model="gpt-4.1", temperature=0)` raises `TypeError` at construction. Set sampling params on the **evaluator** instead: `ClassificationEvaluator(name=…, llm=…, prompt_template=…, choices=…, temperature=0)` (the `create_classifier(...)` helper takes no `temperature`). + +(The scanner flags any `max_tokens`/`temperature` token as a `⚠ param` — it can't tell a raw call from a wrapper or know the value, so this is the judgement call.) + +**Watch token caps on reasoning models.** GPT-5/o-series count *reasoning* tokens against `max_completion_tokens`, so a tiny cap that worked on gpt-4o (e.g. `max_tokens=20` for a short answer) can return empty/truncated output. When migrating such a call, raise the cap to a safe value (≥256) or set `reasoning_effort: "minimal"`. + +When migrating **to Claude Opus 4.8/4.7 or Sonnet 4.6**: `budget_tokens` and `temperature`/`top_p`/`top_k` are removed — use `thinking: {type: "adaptive"}` + `output_config.effort`. Defer to the **`claude-api`** skill (`/claude-api migrate`) for the full Claude code-change checklist; don't hand-edit Claude SDK calls from memory. + +### 5. Skip what shouldn't change + +Do **not** rewrite: +- **Autogenerated code** — `api-clients/` is generated SDK reference; never hand-edit it (it's in `excludePaths`, so the scanner skips it). Fix the model references at their generator/source instead. +- **Historical / release-notes / changelog / migration-guide** content — it documents what *was* true ("v1.2 added gpt-4o support"). These paths are in `excludePaths`; respect the same rule for any historical prose the scanner happens to catch. +- **Non-model tokens** that share a prefix: `gpt-oss-*` (open-weight models, version-pinned), `claude-code`, `claude-agent-sdk`, web-crawler user-agents (`claude-web`, `claude-user`, `claude-searchbot`), image filenames. These are in the `ignore` list and won't be flagged. +- **Comparative prose** that names an old model on purpose ("unlike GPT-4, GPT-5.5 can…"). Leave the historical reference; only update where the doc is telling the reader which model to *use*. +- **Markdown image alt text** — `![…model gpt-4o-mini…](screenshot.png)` describes what a screenshot *shows*. Editing the alt text alone would make it misdescribe the image (the pixels still show the old model), so the scanner skips model names inside `![ … ]`. Update these by **regenerating the screenshot**, not by editing the alt text. + +To suppress a single line the scanner shouldn't touch, add a `check-models:ignore` comment on it. + +### 6. Verify + +Re-run the scanner — `✗ error` count should be 0. Remaining `⚠` are prose/variants you've consciously reviewed. + +Also guard against a **stale date carried onto a new alias**: when migrating a dated ID, drop the date (`claude-sonnet-4-5-20250929` → `claude-sonnet-4-6`, **not** `claude-sonnet-4-6-20250929` — that snapshot doesn't exist). The scanner can't catch this (it date-strips before classifying, so a wrong-dated current alias looks current), so grep for it: + +```bash +grep -rnE 'claude-(opus-4-8|sonnet-4-6)-20[0-9]' python typescript # these aliases have no dated form +``` + +## Updating for a new model launch + +The whole skill is driven by `models.json`. When a new model ships: WebSearch to confirm the IDs and which tiers exist, edit the relevant value(s) in `models.json` plus `updated`, re-run the scanner. No code changes to the scanner are normally needed — it derives the full old→new table from the policy numbers. diff --git a/.agents/skills/check-models/scripts/models.json b/.agents/skills/check-models/scripts/models.json new file mode 100644 index 0000000..988d0a0 --- /dev/null +++ b/.agents/skills/check-models/scripts/models.json @@ -0,0 +1,129 @@ +{ + "_comment": "Single source of truth for the check-models skill. Update the numbers in `openai` and `anthropic` whenever a new model ships, then re-run the scanner. The scanner derives the full replacement table from these policy values — you should not need to enumerate every old ID.", + "updated": "2026-06-18", + "verifiedBy": "WebFetch developers.openai.com/api/docs/models + /deprecations (2026-06-18); Google via ai.google.dev/gemini-api/docs/models (2026-06-18); Anthropic via claude-api skill shared/models.md", + + "openai": { + "flagship": "gpt-5.5", + "pro": "gpt-5.5-pro", + "mini": "gpt-5.4-mini", + "nano": "gpt-5.4-nano", + "flagshipMinVersion": 5.4, + "proMinVersion": 5.5, + "miniMinVersion": 5.4, + "nanoMinVersion": 5.4, + "notes": "GPT-5.5 is the latest flagship (released 2026-04-23, API id gpt-5.5-2026-04-23) and has no mini/nano variant — mini/nano stay on the 5.4 generation. GPT-5.4 also remains available as a cheaper flagship (still current, NOT deprecated per OpenAI's models page), so flagshipMinVersion is 5.4 — only gpt-5.3 and older flag for migration; gpt-5.4 and gpt-5.5 both pass. Migration target for outdated flagships is still gpt-5.5 (the `flagship` value). Match the SIZE tier: a *-mini model migrates to the latest mini, never the flagship. The GPT-5 family are REASONING models — they reject temperature/top_p (see codeChanges). For temperature-dependent code use the nonReasoning tier below.", + "nonReasoning": { + "flagship": "gpt-4.1", + "mini": "gpt-4.1-mini", + "nano": "gpt-4.1-nano", + "note": "Latest NON-reasoning OpenAI models — active (NOT deprecated, per https://developers.openai.com/api/docs/deprecations checked 2026-06-17) and they ACCEPT temperature/top_p. Use these (not GPT-5) for code that needs temperature control — LLM-as-judge evaluators, deterministic extraction/classification at temperature 0. gpt-4o / gpt-4o-mini are also non-reasoning + active but older; gpt-4.1 supersedes them.", + "treatAsCurrent": true + }, + "deprecatedOpenAI": ["gpt-3.5-turbo (snapshots; shutdown 2026-10-23)", "o1", "o1-mini", "o1-preview", "o3-mini", "o4-mini", "gpt-5 / gpt-5-mini legacy snapshots (shutdown 2026-12-11)", "gpt-5.2-chat-latest (2026-08-10)"] + }, + + "anthropic": { + "opus": "claude-opus-4-8", + "sonnet": "claude-sonnet-4-6", + "haiku": "claude-haiku-4-5", + "notes": "Opus 4.8 / Sonnet 4.6 / Haiku 4.5 are the current GA tiers. Claude Fable 5 and Claude Mythos 5 exist but were export-control-suspended on 2026-06-12 — NEVER migrate to them. Match the SIZE tier (opus/sonnet/haiku)." + }, + + "google": { + "_comment": "Gemini tiers span generations (the latest STABLE per tier is on different generations, like OpenAI). `current` IDs are valid (not flagged). `deprecated` IDs map to their current-tier replacement. Verified against https://ai.google.dev/gemini-api/docs/models + /changelog (2026-06-18). Match the SIZE tier: pro/flash/flash-lite.", + "tiers": { "pro": "gemini-2.5-pro", "flash": "gemini-3.5-flash", "flash-lite": "gemini-3.1-flash-lite" }, + "current": [ + "gemini-2.5-pro", "gemini-3.1-pro", "gemini-3.5-pro", + "gemini-3.5-flash", "gemini-3-flash", "gemini-2.5-flash", + "gemini-3.1-flash-lite", "gemini-2.5-flash-lite" + ], + "deprecated": { + "gemini-2.0-flash": "gemini-3.5-flash", + "gemini-2.0-flash-lite": "gemini-3.1-flash-lite", + "gemini-1.5-pro": "gemini-2.5-pro", + "gemini-1.5-flash": "gemini-3.5-flash", + "gemini-1.5-flash-8b": "gemini-3.1-flash-lite", + "gemini-1.0-pro": "gemini-2.5-pro", + "gemini-pro": "gemini-2.5-pro", + "gemini-pro-vision": "gemini-2.5-pro" + } + }, + + "codeChanges": { + "_comment": "Reasoning-model (GPT-5 / o-series) parameter migrations. The scanner flags these; the skill applies them.", + "renames": [ + { "from": "max_tokens", "to": "max_completion_tokens", "scope": "openai-reasoning", "note": "GPT-5 and o-series reject max_tokens in Chat Completions; use max_completion_tokens." } + ], + "removeForReasoning": [ + "temperature", "top_p", "presence_penalty", "frequency_penalty", + "logprobs", "top_logprobs", "logit_bias" + ], + "removeNote": "GPT-5 / o-series reasoning models reject these sampling params (temperature must be the default 1 if sent). Remove them; steer with `reasoning_effort` (low|medium|high) and `verbosity` (low|medium|high) instead.", + "anthropicNote": "Anthropic Opus 4.8/4.7 reject temperature/top_p/top_k and budget_tokens (use thinking:{type:'adaptive'} + effort). See the claude-api skill `shared/model-migration.md` for full Claude code changes." + }, + + "specialized": { + "_comment": "Non-chat OpenAI families (realtime / audio / image / transcribe / tts / embeddings / moderation) that don't map to the flagship/mini tiers. `current` IDs are treated as valid (not flagged); `deprecated` IDs map to their current replacement. Checked before review.openaiPatterns, which stays as the fallback for unrecognised variants. Verified against https://developers.openai.com/api/docs/models and /deprecations (2026-06-18).", + "current": [ + "gpt-realtime", "gpt-realtime-1.5", "gpt-realtime-2", "gpt-realtime-mini", + "gpt-realtime-translate", "gpt-realtime-whisper", + "gpt-audio", "gpt-audio-1.5", "gpt-audio-mini", + "gpt-image-2", + "gpt-4o-transcribe", "gpt-4o-mini-transcribe", + "gpt-4o-mini-tts", + "text-embedding-3-small", "text-embedding-3-large", + "omni-moderation", "omni-moderation-latest" + ], + "deprecated": { + "gpt-4o-realtime-preview": "gpt-realtime-1.5", + "gpt-4o-mini-realtime-preview": "gpt-realtime-mini", + "gpt-4o-audio-preview": "gpt-audio-1.5", + "gpt-4o-mini-audio-preview": "gpt-audio-mini", + "gpt-4o-search-preview": "gpt-5.4-mini", + "gpt-4o-mini-search-preview": "gpt-5.4-mini" + } + }, + + "review": { + "_comment": "Tokens that look like model IDs but need a human decision — the scanner flags them (severity: warn) and never auto-rewrites them. specialized.current/deprecated (above) take precedence for known non-chat models.", + "openaiPatterns": [ + "codex", "chat-latest", "realtime", "audio", "search", "transcribe", + "image", "tts", "whisper", "embedding", "moderation", "instruct" + ], + "openaiExact": ["gpt-4v"] + }, + + "ignore": { + "_comment": "Known NON-model tokens that share the model prefix. The scanner never reports these.", + "openai": ["gpt-oss-20b", "gpt-oss-120b", "gpt-oss"], + "openaiPatterns": ["oss", "\\.png", "\\.jpg", "\\.jpeg", "\\.gif", "\\.svg"], + "anthropic": [ + "claude-code", "claude-code-tracing", "claude-agent-sdk", "claude-web", + "claude-user", "claude-searchbot", "claude-powered", "claude-trace", + "claude-session", "claude-fable-5", "claude-mythos-5" + ], + "anthropicPatterns": [ + "^claude-code", "^claude-md", "searchbot", "-bot$", "\\.png", "\\.jpg", + "\\.jpeg", "\\.gif", "\\.svg", "\\.log", "fable", "mythos", "-powered" + ] + }, + + "excludePaths": [ + "**/release-notes/**", + "**/changelog/**", + "**/*release-note*", + "**/*-releases.*", + "**/on-premise-releases*", + "**/*changelog*", + "**/*migration*", + "**/*migrate*", + "**/api-clients/**", + "**/.agents/skills/check-models/**", + "**/node_modules/**", + "**/.git/**", + "**/.venv/**" + ], + + "scanExtensions": [".mdx", ".md", ".ipynb", ".py", ".ts", ".tsx", ".js", ".jsx"] +} diff --git a/.agents/skills/check-models/scripts/scan-models.mjs b/.agents/skills/check-models/scripts/scan-models.mjs new file mode 100644 index 0000000000000000000000000000000000000000..065599466b4cc9148582875b1c2756edbb3bf1e5 GIT binary patch literal 19646 zcmcg!ZFAekcD|qWE7q(h0WA@b6+2Dk>o|%nC+>A@kEPo?x1wx<&z?Pd-uEp3=3cZY@~EGrkxfrcnhotn z6q&)q4o*5VyiE&J4B|AkdB}#BNjfY{wkU^jX@}PCVj?H8-i&Kn9 z#%Y!d{F5|0OMzdiYc*wdV%a0h1-V6@bE+Vo<~AN)n8h4D442*x&3i?2#vWyyk8O!L z0Cs5e0rI;l1g4;k?e=g#&-Q|;!ZIf+sTNciKF3iz9E+<78hR5h3|G1H6luT_A4HKXd z4DBdhOv_y}I2*R_0M+RXlhMfhu^$)qTLG1&(+d+1hgjz{NwIsUg>Mety*Fc$=uUvL zaEBNF%&_|a)FGkc{?wAk=8N)981Z!4fMEQ-+VjD5F|_Zv{5!{|S!Wa{HH>(A7GHo7 z&y&(@bL2js$gk$1(cMTdOwZxpmSmNnb>UbkKXH9I)CNC#%w>;=kAhaWNdz!nW) zm$KJ?e*gL~BP~|PEb=J;Zw#`uD9yqC>z5zj?H>p017w6ytDa%Q(c4_7gYa$Nl&F z=uG@V@KOp14Js&AFMdIx5j9&4_yOpBb_;ALQ{h`xLK z`qlHl0_5KxynPeSMB)3efA!;p5)moIJZ(402$t~{S zAH;TC$P?c_`=7&6fPpl+9U)#%!vY$>20ImDvU!O`TpCKjt{KJCf+}%3t)A2DB{;Ks zrTFbaF-)h_cca`Eld6sDh1NVeHp?d)qb!FSw53S^fBOmk{83gHLT8T42|jFYw$0^+ z0b6tDV{{|JN6GOMd`w2vx=)`zHLXsE$hF!AxT>0ZYRY_J*$q~~uJ#4)t}BJ|uAWZl zTyI_HoddXb*ITrZLwIvjQNE6gBaj>8-qxq!s)Takd{In-7+bU4;PUGRC7DoTTge~; z0O_a|wm>iP&yU%e$)-c-QmBr0L*GNAU6{Vb|6@K-4e?7n~j~;urR|~Q?wlAokyg=2vIK;OXP`OSbR7%3dT#PhUW-w2=$W!x13{aR9 zZ{jz>ux*|RPnliw69st$`lT`WMm-=8(1;0uU3V8hev3(v(`b zoFqjEBt6daJV~(wgelI>6yC%xlza?hp2;+@_FL5Yx!oeXvq=K~M0diRSnLD(*wA&bY;^6*+rGvj6&=7ID-Qg zCq$wC7JFZ8%Z$hM=W&*eVOODEFR_VU0gS1L#^bO|=$NcDW>XtZvvJT{@doLvcp`Ff z>hC!p=bgAZVncS#{Yz2Bx2b$-yj4%P#SS!8_j2QP42|Wun&4OYe0l#8EC6v;o_=Op zEiiV^1oC|ew+(w1kF$1fxv}xwxps>8F390rSvVAX2ELFYIP`Gn3;uBk2clRo3)Bu-8i`b{1Q;|JY0Ag1$= zp}1szU{*}f86V^knv_<)AGPWatz#_vFst@15P8{!i!MA$AyeVn;(NNnz>RHC zW~Lt>7bjl1Ii_xtjt@ig7u!FuaSm|_7ES~Pj^48j%qjiEb8EpHfF|_B*Lhk-;sn!j zTQY{RR4${Cns}bHFwe^Pn$FLsF?>JmUsRBqUrjPBt^sgF{(%ylxY06`$EDc<7k{4H z0UR0FPzP#n0p}gEE#3__cag&xqauLJj*@%^&WjURT#`usP0Dg!>_!n{osxDZY(_$u72;tD4Mq^ob2c&Jz?baR* zCUH8pPz4?t(V8+phnV6dfzkicOb74O{T5)Aj9O@2V80lk z5);IRe_;dMeqgmIu&;=dBF;C=W5k=fvz$&rLImHuJp?qwtr`NxE}{&L<2zN)#T#4>fihFfCyxM)bXm_mg5x+wJ&IgbRiEvKzp727*5ueKwCe<^$qod!?Cy~vhS zZ?Y4?rg`vFlKunC6vU3)BL6C5CwdhS>JO?_S3ZeS^+Vl?g|y)V(K#fe`|LGVQ;`1k(fq|qcPHHcoiML0uXb)nSXJCyM!DuVJGbkdHjU-jlf@F+S z$iIlI=3*dO&0*1+)PHj+xW%fElO6fTg89}=piU0>u&>gmq!fxl758Zy$egaB} z!#snFEMex$X%?43@BZcN=;84)kvgJIbd3BDHTfHpxr$~Vki1)5Q6L|jwO z53HnN6(4$K3_+p~FnLiD(lj~%>K=NhjjIwkegPcN0rH!{v)yipzwKvjtU_3n5bw>} zZbkY{i38IBK#k>bZElqIAkEPw5*>LzQjBC~#ttotDV?mSLlM2+&ac`Ie!Sj{k8Gur zBDxu2K4ltve&AWu3~Yh~6Wv%cSUe~f;H1}khx-Rmn20;32CJY(AiL4YS51d30CEF$ z63zp59E$wRmgrmpZKxF|U<4<@P$-d!Jc40Ua55qjSi&>GMh%h&t=7h`rB+quarlVQ z1AK4aND1MNJq{ml0p1{{o5XJTu4W8L&Sco$^01*};1^&`B!%-N1O-LyEL$K1l{5m< za-2mPy$Y+QjSJ+Q$d)Z?8_!L;j8_aISE7m|?*6fP_H6ZYkj5rHq#M+05B`^HWiKpj|8ijK@N%BvZiwH4n*r-4Ur)8f9rgTwkDcu?W0M7OBGU8 zJqRV#I7M#m-j<;<+9d;ejkN2Xl%>qHtTQFi2j`bLu~s2k6>FwfL4;agX{TC@9T7Bt z9oCjENhRZ11cV@wlPES@?4=j-ZOd=r`p|cA3FDI%?VI$2mwrYpPWE*Ih(5oGXseZ7 zZ`QBUL2T}NyL@D`C{JNV(NmoACR9jikfze~x>6zg%4bKRiZsf@jcpe|_gswl#TT?B zrtvDx$W><3-^dfyOlxdWa|CB_?0^&#jV3dUW@4xC%l@}%M`r;mmH{z>k`@poErqN| zj>>UL5Ouo4%STK8cgOwreuQ)=@?=!Got^ESuRGgcbsqlh4dcF9!m@Nu#j;eRs&S7V zZtv{C(ISjT+h6Z&cTRWKfvAkTu~p`IJk9!t*?Y2mFm`xDL%|Ii@1M^D0n+-=GMn;b zOcjEddR<51=7;;2%bs(M7jZ`26Q_k@;DE{1J;Kr!(pcLN4TLcGtuL6Wk%>>ysMTz!RS4t#a7koWk z2E5hLTGgj6_W0e(RL9ktR9MydkC${=;7-8HQPta`qo_X4+zt|$dAcb9k023UL3Bs) z3?7e_bI^ENCNs+jW-&|{9wjrTxFRIt$Pk5@fyI{zz?@Az>Y`f`Ahgy*2tj6hxI#~_Ito*JACu%!+y zKqvaLaA{qWy6x5RWDU_#>;{NzxdkGdZiT4Sy#XK^Zh?oE2DlrBded;rc;8GQD`5^r z>Z_m|K5jq54ydo-*zpp5;g*Uv10_A3xcgs)k2~PiK91DN0dltf#B>I&iTtyr$zm>o zy2^>?4CEJpj>}=(EFA@}FX5~RF5$T;2>?Yd;&^~UQHD~MoE2D- zl5)xskxHePuT$2?-`8kb$Huo%jK}PZfxcg$W<`CFgvmi?cp`6zAZEj4fWuX0sgd-Q zQW}s1YA{&Lp;reI>O$!f+UgSI+@(ir*mKT|dHzajl911bF2y0mP%3|rr$n+2Tc8IN z5^fVNeUh+tMz615%>wGg4wP~zn_&%yDjDjJd&=dX_9axCCB-{g8d7aXd!2tgcI-#D zS8|}OAKo2y9)U9=(|WE4F9&jd*b<4b!-YEDAf(U;d*@stU!?RD#!cq{H+!_9+Qnx8 zpS7f_NiuLZv6Y07l`B%4W$$g!tHs7%^WXRUS@gOKFSLddxaPT2p(yUu8t75QwkYpI z>Z;+{!$wvCRuzZe;@ns5YYqs7`!@IH_GC#@?=-Ay!^N*wp<-{M|Bob!a$X~a()3|} zw{Fm6OG7JEYZ_+A8Zl0dx&9XD!Jf?2{+Lv>1_0eq^v zBU#^eBk=i+*)dOGks(s}H)X@56y?*J>{c`UNn-(Y)F}L*r4D-L37;99jD+~wp{E3> zffV^_b^txE10`-oxaxgKqI{_glF?)-wkJe+)Rr9LP#v2-u5IykE5n)#5mpItcx z>xQfj9czWhHJA*O#tAg$HPhWZx+G_Z_)E+UE=@A}uE1K{zx1G8yRH>vaP zbjx_YEhp5@(qPwS@%cwVyQqQ$9fjo3y@;(m!h`xpI?#1=c2l}(8VGRx+0zN;hF>Nt zx{<48GW4h?MmP0L@m1Dc`S!e3!6I#FFOQ{ojyIx(5I__T>sJC@@0`c0E~!I8`!WZV zJffZiMLX|(qQnQhtC3sUk5R#SFaFL}^ zB4E#v(ZvxJ&Or`r>5o2HAG?2!G7VaoqG1u_Mn}bsx?gAU^aP&*&EadfNYA6dvLp!4 zFW?)(;30A;Lii%3*3Ki`SVDubdg&5`V)_VQWQh0+$lnz&P`NTFv-|=T6oNGYiiQY2 zTy>*1GGjyg5wu-kbywXGSuFhb6!(kTK0h^IJ8wkgV$EwPw;B;E>=Sa?xkIdNxTZ#R zKeivqQ(aCYfG6>od)1zJle-C+u*Ovgrrb970I)uDeUh z6euub%Hk$ZovM<-p~0G7i@PQTuQs8!M)wxBtuAHa6$MebPLot}EdtU)dg6VX zT^jsEYLTU=VhS8yltHVDiw&M1lD4jc0M2t(c6Bcu&+ zsRAGGVc&%BfD~Z!t$^_24Hw{?B!gYK$tVEFJqjmQI{jpQ`tQCoIH_el&z2mzn(ij+ zd)oDf@4jo|G$Bqk4W#_#RayoUAh?AsyJ2QxO$x$3$;zPx^3o-LH<9wB-MZeei6+j) z1>#hK5ZWO`fy(|h2#AT?)8?jMW0eGT(_IJKQJ`xhxzA4WFrAk>Dp>-#b#8pNF z)(T5gz3%4CCYUTgfo7Ep2vP@$U<=dnIvs4edfi}K+*M_rGor3DN$JgSsWIvN>3dna zR92F@T)(>qh032D+-H%@?h~cL))k4cD<`bo=*s;K=IQwA=ed10z8WW^tKxLrc21j6 zv$;p0#7{t1t7_$(7DsMEbiurL@5n&U;v4>JjsxDN>q!ej0y)anz+j%7LtLn@5LFIx za6~c50NL->J8Vl2B+*9O-35}FfXD^8LD zLg_AJ^d<}Mv~U~IhBIEbf*9M4n>eDzrA`?JZk7}tb?J-~H$CeT-jYIQ1Re%>SNoXEbHdPXMdpKW>?sboXqYvHV;|Fcjf&|eR zVpIW9VL6&11vo^wwrO@?-0By6V7ZxNkz{VB>XE=Efo88^QLP0hQ{EjY|pOPzJr0cP~ES`~$P zvu6!#~6wR}ba z1xLH&;7~7H8{8d4@hRi~90Eg_V1M=Dsn$5c&O-B=_sJ=P{^&xLpd4T%Xkzg`*@8#L zs!x0IIj^kX0!V_71;Q+E$%yXoJ;MyFslMQdNm2Y4xFBJUIyjtn0y}T%lJtNhF{V+E zBx_e-UYT;YsY|gN%~ll-jXjlHJ8$R{fv z3Utrf7`jI&O;qww&X*f5$fK5QtW9LT8BSOHT`dTp5$S@4lxT=6B1@bh^Ay5&f9b3bIMY)m+C%O1@`Ws^|O3w?-Ad8Ulp1SUfKc*1U|IQJJf_7myqY3T~+6D z;fi{0l8jNLR{9*egDpe9zc>^Jvl$laRd?MBVQ=Yqh&kTLCDgV?WzF#-uqg@!`6C~$ zzSn`4Sj{ydUaM4~496~0c2+>4T5W;gy$FAhU+jG7?v}sT9|-_H`dKj`pZj+= zjo0<=Cee?$3x+}Ziw#_gge6>B*uQW?yk=C|!;zA00jl~d2isfbE0+GxgCG^wMoi== zyEK8+R^#VT5bEP24tSRGq-?Ql!)VEsU}Q4A@Vfx8_v?TDxA_US9Y@@H%h+A8XQG38 zTqq2jRxdL7GdfB2@p4dQj~x{L1WX}*IFb+N%dndo^BzfxZjWg!AjhAk2--^sh@(-r z*Kk-hg(%nFLW$!VG3#}dQDD}*jVwPL;I7D_!R1%ohb4bwMi8NVH+~Ic9TP}!GD7hh z4QN$0_%7RLy zSaI+>MoWPL23Hs6`Lh(Y6W}kY%MXHR019HjjdWDp^;v884t4KBXZY+qT}5gtMQx=n z%1ImU{R2tYBKTGfB0gpb9U)n#9r`B?!=y6Nzo6w-xj<0JEFMH@-Dr74+9uEel26g} z0cz!*L2VF>s3!foCFQEjy0~8LI{sHt^)S8K5ZA0Ju%2G6U;pdh%$tl!F-i{ag;KPt z3xNJdYg7swD_mK}5q2KGTrw&?iZ4Yi>5zs8$`e;8NOd+UV_fG3<+la8De@!$QlrFr(;UofdYr#$cm`1!t(z^?w&!bwA4z{lIr%266&GjenSfgOcfSg`EDVbx@gvb-MJa2}Zi*P!*JuDv#w-wl97S5St9#ILXTK#ccUpHnYCH4mR`{0};xtS0~f literal 0 HcmV?d00001 diff --git a/.github/workflows/check-models.yml b/.github/workflows/check-models.yml new file mode 100644 index 0000000..f93ade9 --- /dev/null +++ b/.github/workflows/check-models.yml @@ -0,0 +1,97 @@ +name: Check model versions + +# Flags out-of-date OpenAI / Anthropic model references introduced by a PR. +# Only added lines are inspected, so existing references don't block unrelated PRs. +# Policy (latest models + ignore lists) lives in .agents/skills/check-models/scripts/models.json. +# Platform-wrapped IDs (Bedrock `[region.]anthropic.claude-…`, Databricks +# `databricks-claude-…`, OpenRouter/LiteLLM `provider/model`) are flagged on their +# embedded model substring — migrate the version but keep the platform's ID format. +# See the "Platform-specific IDs" section of the check-models skill. + +on: + pull_request: + branches: [main] + +permissions: + contents: read + pull-requests: write + +jobs: + check-models: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd + with: + fetch-depth: 0 # need the base commit to diff against + + - uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e + with: + node-version: 20 + + - name: Scan changed lines for outdated models + id: scan + run: | + node .agents/skills/check-models/scripts/scan-models.mjs \ + --diff "${{ github.event.pull_request.base.sha }}" \ + python typescript \ + --json --no-fail > model-scan.json + cat model-scan.json + + - name: Comment and gate + uses: actions/github-script@f28e40c7f34bde8b3046d885e986cb6290c5673b + with: + script: | + const fs = require('fs'); + const r = JSON.parse(fs.readFileSync('model-scan.json', 'utf8')); + const marker = ''; + + const errors = r.findings.filter(f => f.severity === 'error'); + const warns = r.findings.filter(f => f.severity === 'warn'); + + const row = f => `| \`${f.file}\`:${f.line} | \`${f.token}\` | ${f.action === 'replace' ? '`' + f.target + '`' : '—'} | ${f.reason} |`; + let body = `${marker}\n## 🤖 Model version check\n\n`; + + // Surface a stale-policy warning (content references a newer model than the policy knows, or the policy is old) + if (r.stale && r.stale.suggestRefresh) { + const newer = (r.stale.newerThanPolicy || []).map(n => `\`${n.token}\``).join(', '); + body += newer + ? `> ⚠️ **The model policy looks out of date** — this PR references ${newer}, newer than \`models.json\` knows. Run \`scan-models.mjs --refresh\` and update the policy.\n\n` + : `> ⚠️ **The model policy is ${r.stale.ageDays} days old** (updated ${r.updated}). Consider \`scan-models.mjs --refresh\`.\n\n`; + } + + if (errors.length === 0 && warns.length === 0) { + body += `✓ No outdated model references in the changed lines. _(policy ${r.updated})_`; + } else { + if (errors.length) { + body += `### ✗ ${errors.length} outdated model reference(s) — please update\n\n`; + body += `| Location | Found | Suggested | Why |\n|---|---|---|---|\n`; + body += errors.map(row).join('\n') + '\n\n'; + } + if (warns.length) { + body += `### ⚠ ${warns.length} item(s) to review (not blocking)\n\n`; + body += `Prose mentions, specialised variants (\`*-codex\`, \`*-chat-latest\`), or GPT-5/o-series code changes (\`max_tokens\` → \`max_completion_tokens\`, drop \`temperature\`).\n\n`; + body += `| Location | Found | Suggested | Why |\n|---|---|---|---|\n`; + body += warns.map(row).join('\n') + '\n\n'; + } + body += `_See the [\`check-models\`](.agents/skills/check-models/SKILL.md) skill. Policy date: ${r.updated}. Add \`check-models:ignore\` to a line to skip it._\n`; + body += `\n> ℹ️ **Platform-wrapped IDs** (Bedrock \`[region.]anthropic.claude-…\`, Databricks \`databricks-claude-…\`, OpenRouter/LiteLLM \`provider/model\`) are flagged on their embedded model name — bump the version but keep the platform's ID format (e.g. Bedrock 4.x needs a \`us.\`/\`eu.\`/\`apac.\` inference-profile prefix). See the skill's _Platform-specific IDs_ section.`; + } + + // upsert a single bot comment + const { owner, repo } = context.repo; + const prNumber = context.payload.pull_request.number; + const comments = await github.paginate(github.rest.issues.listComments, { owner, repo, issue_number: prNumber }); + const existing = comments.find(c => c.body && c.body.includes(marker)); + // Post when there are findings, or when this PR concretely introduces a + // newer-than-policy model. Don't open a fresh comment for date-staleness + // alone (it would fire on every PR once the policy ages) — only update an existing one. + const worthPosting = errors.length || warns.length || (r.stale && r.stale.byContent); + if (existing) { + await github.rest.issues.updateComment({ owner, repo, comment_id: existing.id, body }); + } else if (worthPosting) { + await github.rest.issues.createComment({ owner, repo, issue_number: prNumber, body }); + } + + if (errors.length) { + core.setFailed(`${errors.length} outdated model reference(s) introduced by this PR.`); + } From 96cac1ed846fa71cd3266f1634b421498e74d765 Mon Sep 17 00:00:00 2001 From: Jim Bennett Date: Thu, 18 Jun 2026 15:57:58 -0700 Subject: [PATCH 2/3] two-tier PR gate: fail on introduced outdated models, warn on pre-existing MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI gate now scans the whole of each touched file and tags findings by whether the PR changed the line — introduced fails, pre-existing (unchanged line of a touched file) is a non-blocking warning. `python typescript` scan paths. Co-Authored-By: Claude Opus 4.8 (1M context) --- .agents/skills/check-models/SKILL.md | 3 ++ .../check-models/scripts/scan-models.mjs | Bin 19646 -> 21318 bytes .github/workflows/check-models.yml | 43 +++++++++++------- 3 files changed, 29 insertions(+), 17 deletions(-) diff --git a/.agents/skills/check-models/SKILL.md b/.agents/skills/check-models/SKILL.md index 1c1e0d5..7d112b4 100644 --- a/.agents/skills/check-models/SKILL.md +++ b/.agents/skills/check-models/SKILL.md @@ -78,8 +78,11 @@ It prints the current policy, the WebSearch queries, the authoritative source UR ```bash node .agents/skills/check-models/scripts/scan-models.mjs python typescript # scan paths node .agents/skills/check-models/scripts/scan-models.mjs --json > out.json # machine-readable +node .agents/skills/check-models/scripts/scan-models.mjs --diff python typescript # PR-gate mode ``` +**`--diff ` (the CI gate)** scans the whole of each *touched* file but tags every finding with `changed` (was the line added/modified by the PR?). It splits outdated-model errors into two tiers: **introduced** (on a changed line) → fails the run; **pre-existing** (on an unchanged line of a touched file) → reported as a non-blocking warning. Untouched files are never read. Full-scan mode (no `--diff`) tags everything `changed:true`, so it fails on any error as before. + Each finding is one of: - `✗ error` — an outdated lowercase canonical model ID (e.g. `gpt-4o-mini`). Migrate it. - `⚠ review` / `⚠ replace` (prose) — a model named in prose (`GPT-4o`) or a specialised variant (`*-codex`, `*-chat-latest`, `gpt-4v`). Use judgement (see below). diff --git a/.agents/skills/check-models/scripts/scan-models.mjs b/.agents/skills/check-models/scripts/scan-models.mjs index 065599466b4cc9148582875b1c2756edbb3bf1e5..4684f6f0e926e9e7d47e25df6f62c30304a406f8 100644 GIT binary patch delta 2015 zcmah~ON%2_6sECN$Yum}<>pMLXRN|X1yKe&Aw6m)&@oM?>26V(O>U}Ar7m54Z@6_U znb0)RrEY5>zF=^nTcusOa3i?VAK=Et_y^ny3Z8qbs*|P_10j#|`p$R0bLxjrHh%hQ zSQuj1BtIT; z3}L(=JHo-CU!#me0huVX+K?1v1Prvyf|$G`o=_Hq6I|jHhWmT(fy{KsG=>dRIssHl zsuY}5+Po7onZyATR9y*RV+w$&!UjI%spet=A&#{ROoR|BIVGD+u{mXmoi-KDIL<*x zo}v$Z0a2DDjg+#XZPlb+u@GX!vP7q#RfZlA1e~E-Q2$hgI5@$nwska{5N%L4+A{9o z2nl11*^En7T3kQ>d3);_uX2Qr&3y64_7|H=9)WYSqUz2a2>jGk<9Y!%I~}N7BI+&# zQc#x}wt=u->APrj#uR6QKr@BnMGbSQbPQ>xA{G$t8qkYo=A2pbTc9xUXf*kwooAP2 zc@KQw58&KwllM8j*Ri=2p))0l`j%B&n1F@YC^7&sLTaCe%~;nJTjv;LGSDi_j*P?$bz^EvrTwl24IF_O%WHN`7!6z zcqLj+cq+|OEhC{q>}!gJZt+*cIuhYaBfW_6L>J7quJg zDLchN>@*c3a7sfV!O;2$DH`KM1}7FVDVjb>?sYraH%dttxjCu5NV1SwML^y44Rxe3 zhm?_Ocwh%=xAF*e{`c#@-8GUyp?&h|5qFNG@uBvByU^KzGYU7aB03QyN)9i<%kBK5 zZhtEgM6UI&GWsz~okX}+93&67eew01PdsHtbv&mfQejPJSC#Bcuz1u%JHbK{TUS2yZj=2xs6@rN?O1i%G zcI%6^CTa`a&}NsI>?YgAAFZZ70$N9sZVAq>S>~41bEONSumJAUeIxNZPGqT${Hj;8 z(Gg2hBo+(eY2Ll+fctKe(ALKPvvE^rJ5z^)!^c8|GSOX)@~ErgZkr7AJw4+t4=SWTI>5?8f*RW z=O5>rgR@uc;U>2J!yO#E1tl%64b7*6!IN8OWXQiCys&RG>`RT5(aM+CfQ7;+Ijxx;6o$%p*eADdM*Wfc{5Zdd^lKtI+Y&I rH3di2N*j%u2V?53FOj{b^jM?&QQnLjSC59Df3WvlZG}GnZ12`TtpKn; delta 393 zcmYjMu}T9$5am!(1hEv1&EzhG3nVK68;xfS7;SRmVHqP?b9?b%-Icoy2qE^?Q;1@~ z%Em@GYfC#DKS0nA@E3IN(qx($X5M@A-usjBc4xeNY+}7(C4OVNh!?x%B|^vmCY44W zV1mo7gXFaJZL9<`0(&4ute_h)UIiWs7UT8aSAXur1;_xpS|U65NGu~|+q4R26>xTd zLQPK<7;Evg!XOh>OsKBr)}Y>lqT|3w_$<&5d?=Rzr|mUQzn({FjdT-=5lfAx6%VNB zXpTLzC#o4@Ub)!>)69bDXvtrDA%|D)IzGD=e!3euh!y25p@*+V!O1y zZcW(I{r(Z%qcP^aZR~i3lH`nVIkfvHqBPFnG2u)M)%gVENcmJTAAh}@)ih{PUMeEw OJ`H&qYX(Q_*!l f.severity === 'error'); - const warns = r.findings.filter(f => f.severity === 'warn'); + // Three buckets: + // introduced — outdated model on a line this PR added/changed → FAILS the check + // preExisting — outdated model on an unchanged line of a touched file → warn, non-blocking + // warns — prose / specialised variants / GPT-5 param hints (changed lines) → review + const introduced = r.findings.filter(f => f.severity === 'error' && f.changed !== false); + const preExisting = r.findings.filter(f => f.severity === 'error' && f.changed === false); + const warns = r.findings.filter(f => f.severity === 'warn'); const row = f => `| \`${f.file}\`:${f.line} | \`${f.token}\` | ${f.action === 'replace' ? '`' + f.target + '`' : '—'} | ${f.reason} |`; + const table = items => `| Location | Found | Suggested | Why |\n|---|---|---|---|\n` + items.map(row).join('\n') + '\n\n'; let body = `${marker}\n## 🤖 Model version check\n\n`; // Surface a stale-policy warning (content references a newer model than the policy knows, or the policy is old) @@ -59,19 +67,20 @@ jobs: : `> ⚠️ **The model policy is ${r.stale.ageDays} days old** (updated ${r.updated}). Consider \`scan-models.mjs --refresh\`.\n\n`; } - if (errors.length === 0 && warns.length === 0) { - body += `✓ No outdated model references in the changed lines. _(policy ${r.updated})_`; + if (!introduced.length && !preExisting.length && !warns.length) { + body += `✓ No outdated model references in this PR. _(policy ${r.updated})_`; } else { - if (errors.length) { - body += `### ✗ ${errors.length} outdated model reference(s) — please update\n\n`; - body += `| Location | Found | Suggested | Why |\n|---|---|---|---|\n`; - body += errors.map(row).join('\n') + '\n\n'; + if (introduced.length) { + body += `### ✗ ${introduced.length} outdated model reference(s) introduced — please update (this fails the check)\n\n`; + body += `On lines this PR adds or changes:\n\n` + table(introduced); + } + if (preExisting.length) { + body += `### ⚠️ ${preExisting.length} pre-existing outdated model(s) in files this PR touches — not blocking\n\n`; + body += `These are on unchanged lines, so the check still passes — but since you're already editing these files, consider updating them too.\n\n` + table(preExisting); } if (warns.length) { body += `### ⚠ ${warns.length} item(s) to review (not blocking)\n\n`; - body += `Prose mentions, specialised variants (\`*-codex\`, \`*-chat-latest\`), or GPT-5/o-series code changes (\`max_tokens\` → \`max_completion_tokens\`, drop \`temperature\`).\n\n`; - body += `| Location | Found | Suggested | Why |\n|---|---|---|---|\n`; - body += warns.map(row).join('\n') + '\n\n'; + body += `Prose mentions, specialised variants (\`*-codex\`, \`*-chat-latest\`), or GPT-5/o-series code changes (\`max_tokens\` → \`max_completion_tokens\`, drop \`temperature\`).\n\n` + table(warns); } body += `_See the [\`check-models\`](.agents/skills/check-models/SKILL.md) skill. Policy date: ${r.updated}. Add \`check-models:ignore\` to a line to skip it._\n`; body += `\n> ℹ️ **Platform-wrapped IDs** (Bedrock \`[region.]anthropic.claude-…\`, Databricks \`databricks-claude-…\`, OpenRouter/LiteLLM \`provider/model\`) are flagged on their embedded model name — bump the version but keep the platform's ID format (e.g. Bedrock 4.x needs a \`us.\`/\`eu.\`/\`apac.\` inference-profile prefix). See the skill's _Platform-specific IDs_ section.`; @@ -85,13 +94,13 @@ jobs: // Post when there are findings, or when this PR concretely introduces a // newer-than-policy model. Don't open a fresh comment for date-staleness // alone (it would fire on every PR once the policy ages) — only update an existing one. - const worthPosting = errors.length || warns.length || (r.stale && r.stale.byContent); + const worthPosting = introduced.length || preExisting.length || warns.length || (r.stale && r.stale.byContent); if (existing) { await github.rest.issues.updateComment({ owner, repo, comment_id: existing.id, body }); } else if (worthPosting) { await github.rest.issues.createComment({ owner, repo, issue_number: prNumber, body }); } - if (errors.length) { - core.setFailed(`${errors.length} outdated model reference(s) introduced by this PR.`); + if (introduced.length) { + core.setFailed(`${introduced.length} outdated model reference(s) introduced by this PR.`); } From 71df10b601af810b231a7e6b956bc8533bf9c263 Mon Sep 17 00:00:00 2001 From: Jim Bennett Date: Thu, 18 Jun 2026 16:03:33 -0700 Subject: [PATCH 3/3] test: demo the check-models gate (touch one model line; do not merge) --- .../example_arize_ax_self_optimizing_loop_dag.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/cookbooks/airflow_example_dags/example_arize_ax_self_optimizing_loop_dag.py b/python/cookbooks/airflow_example_dags/example_arize_ax_self_optimizing_loop_dag.py index 7a1beef..dec4677 100644 --- a/python/cookbooks/airflow_example_dags/example_arize_ax_self_optimizing_loop_dag.py +++ b/python/cookbooks/airflow_example_dags/example_arize_ax_self_optimizing_loop_dag.py @@ -83,7 +83,7 @@ ``"false"`` so you can inspect the artifacts in the Arize UI between runs. - ``arize_ax_self_optimizing_model`` — the model used by the server-side - experiment tasks (default ``"gpt-4o-mini"``). The optimizer always + experiment tasks (default ``"gpt-4o-mini"``). The optimizer always # demo: gate test edit uses ``gpt-4o`` regardless. """