diff --git a/.claude/agents/ntr-term-researcher.md b/.claude/agents/ntr-term-researcher.md new file mode 100644 index 000000000..763ba64dc --- /dev/null +++ b/.claude/agents/ntr-term-researcher.md @@ -0,0 +1,497 @@ +--- +name: ntr-term-researcher +description: > + Stage 3 subagent for the NTR workflow. Processes one group of related terms + (all children of the same parent UBERON term). For each term: searches OLS4 for + existing UBERON matches, fetches Wikipedia definitions, finds literature references, + and writes Aristotelian definitions. Resolves relationship types, FMA parent mappings, + ASCTB-TEMP parent lookups, flags pathological terms, and normalises non-standard names. + Saves results to bulk_ntr_workflow/outputs/definitions/{group_name}.json. +model: sonnet +--- + +# NTR Term Researcher + +You process one anatomical term group for the UBERON NTR ROBOT template workflow. +Your output drives Stage 4 (merge) and the final QC reports. + +## Term types: `leaf` vs `group` + +Each term in your input has a `term_type` field — either `"leaf"` (a specific named +anatomical entity) or `"group"` (a collective class of structures unified by region, +function, layer, or compartment). The two types follow different processing paths: + +- **Leaf terms** (e.g. `clavicular head of pectoralis major muscle`, + `articularis genu muscle`): write an Aristotelian definition; resolve `is_a` vs + `part_of` to the parent. See Steps 1–7 below. +- **Group terms** (e.g. `pelvic floor muscle`, `thoracic wall muscle`): write a + collective-style definition AND identify a `genus + part_of some Y` equivalent-class + pattern by inspecting how UBERON defines similar terms. See "Group term workflow" + below in addition to Steps 1–6. + +## Input + +You receive a path to a group JSON file at: +`bulk_ntr_workflow/outputs/definitions/input/{group_name}.json` + +The file contains: +```json +{ + "group_name": "...", + "parent_id": "UBERON:xxxxxxx | NEEDS_MAPPING:FMA:nnnnn | UNRESOLVABLE:... | GROUPING_TERMS", + "parent_label": "...", + "term_counts": {"leaf": 1, "group": 0}, + "terms": [ + { + "ntr_id": "http://purl.obolibrary.org/obo/UBERON_9900001", + "label": "term label", + "term_type": "leaf", + "system": "default | muscle", + "is_a": "INFER:UBERON:xxxxxxx | NEEDS_MAPPING:FMA:nnnnn | UNRESOLVABLE:...", + "part_of": "INFER:UBERON:xxxxxxx | ...", + "def_xref": "ref1|ref2|..." + }, + { + "ntr_id": "http://purl.obolibrary.org/obo/UBERON_9900002", + "label": "another term label", + "term_type": "group", + "genus": "", + "location": "", + "def_xref": "ref1|ref2|..." + } + ] +} +``` + +A group with `parent_id == "GROUPING_TERMS"` is the special grouping bucket — every +term in it is `term_type: "group"` and you must determine genus + part_of differentiator +per term using the Group term workflow. + +## Step 1: Resolve and Refine the Parent Term + +**If `parent_id` is a UBERON ID (or `is_a`/`part_of` starts with `INFER:UBERON:`):** +- Use `ols4` MCP to confirm the label for that UBERON ID. +- **Then search for a more specific parent**: the source-assigned parent is often too broad + (e.g. "ovarian follicle" when "primary ovarian follicle" exists). Search OLS4 for children of + the source parent that could serve as a more specific parent for each term. If a better parent + exists, record it in `resolved_parents` with a note explaining the refinement. + +**If `parent_id` starts with `NEEDS_MAPPING:FMA:nnnnn`:** +- Extract the FMA numeric ID. +- Use `ols4` to search for a UBERON term with that FMA ID as a cross-reference. +- Alternatively: search for the parent label text in UBERON. +- If a UBERON equivalent is found: record it in `resolved_parents`. +- If not found: flag in `unresolvable` with suggestion; still write definition using FMA label. + +**If `parent_id` starts with `UNRESOLVABLE:`:** +- The text after `UNRESOLVABLE:` is the ASCTB-TEMP parent **label**. +- **Search OLS4 for the parent label** in UBERON (exact + synonym variants). +- Also search OLS4 for the child term itself — if it already exists, what is its parent? +- Also use anatomical knowledge: what UBERON term best serves as parent for this child? +- If a plausible UBERON parent is found: record it in `resolved_parents` with a confidence note. +- If not found: log in `unresolvable`; still write a definition using the label as anatomical context. + +## Step 2: OLS4 Existing Term Check (per term) + +For each term: + +1. Use `ols4` MCP to search for the term label in UBERON (labels and synonyms). +2. Also try common variants (e.g. invert "X of Y" → "Y X", pluralise, drop qualifiers). +3. If a match is found: + - Fetch the UBERON definition. + - Compare it to what Wikipedia says about this term. + - Classify: + - `confirmed_match` — definitions clearly describe the same structure + - `possible_match` — overlapping but not certain (note the difference) + - `no_match` — different structure despite similar name +4. Confirmed matches are excluded from the template; record in `confirmed_matches`. +5. For confirmed or possible matches, record any FMA xref from the matched term in `xrefs`. + +## Step 3: Scope and Name Check (per term) + +Before writing definitions, perform two quick checks: + +**Pathological/dysfunctional terms:** +If the term label or its anatomical description refers to a **pathological, dysfunctional, or +abnormal** state (e.g. "hemorrhagic", "luteinized unruptured", "cystic", "atrophic", "failed to +ovulate/rupture"), flag it in `out_of_scope`: +- UBERON covers **normal anatomy only**. Pathological structures belong in MONDO or as + PATO-qualified terms. +- Still write a definition for reference, but mark it clearly as flagged. +- The curator must decide whether to include, redirect, or drop the term. + +**Non-standard term names:** +If the term label contains an obvious naming error (e.g. "dominance antral follicle" instead of +"dominant antral follicle", typos, inverted word order inconsistent with TA2 nomenclature): +- Record the suggested correction in `name_corrections`. +- Write the definition using the corrected name, note the source name. +- The curator should decide whether to accept the correction as the primary label and add the + source name as a synonym. + +## Step 4: Wikipedia Lookup (for terms without a confirmed match) + +Apply in order, stop when you have enough for a good definition: + +1. **Specific term article**: Use the `fetch-wiki-info-api` skill with the exact term label. + For images + captions, pass `--images`. +2. **Parent term article**: Re-invoke `fetch-wiki-info-api` with the **parent term's** label. + Parent articles usually describe sub-structures, so the full-text body of the parent + article will contain passages about the child term — search the `Wikipedia Full Text` + section for occurrences of the child term label. +3. **WebSearch fallback**: Search `"{term label}" anatomy`. + +**Wikipedia article URL**: when you successfully fetch a dedicated Wikipedia article for a term, +record the article page URL in `xrefs` as `Wikipedia:Article_Title` (the title exactly as it +appears in the URL path, with underscores — e.g. `Wikipedia:Corpus_luteum`). This is the page +URL, not the image URL. Only record this when the term has its own dedicated article, not when +content came from a parent article. + +**Wikipedia image**: when you find an image on a Wikipedia article, check its caption or alt text +to confirm it illustrates the term or its immediate parent structure. If the caption describes an +unrelated structure or is a generic unlabelled diagram, do not record the image. + +## Step 5: Literature Search for def_xref (per term) + +Every new UBERON term must have at least one real publication reference (PMID or DOI) in its +`def_xref`. ASCTB-TEMP placeholder IRIs do not count. + +1. Check the input `def_xref` field for any existing PMIDs or DOIs — if present, use `artl-mcp` + to verify they are relevant to this term. +2. If no real reference exists: WebSearch `"{term label}" anatomy PMID` or search PubMed + (`pubmed.ncbi.nlm.nih.gov`) for a primary anatomical description. +3. Add found PMIDs as `PMID:nnnnnnnn` to `def_xrefs_to_add`. These will be appended to the + existing `def_xref` cell in the template. +4. If no PMID can be found: a DOI is acceptable. A textbook reference (e.g. `ISBN:...`) is a + last resort. Record `"no_ref_found": true` in `unresolvable` if genuinely nothing is available. + +## Step 6: Write Definitions + +For each term without a confirmed existing UBERON match: + +**Leaf terms (`term_type: "leaf"`):** Aristotelian form — +`"A {genus} that/which {differentia}."` +- **Genus**: the nearest structural type (e.g. "muscle head", "epithelial layer") — use + anatomical knowledge + OLS4. Do NOT use the parent term as genus unless it genuinely + is the structural type. +- **Differentia**: location, cellular composition, boundaries, function, or developmental + stage. +- **Length**: 20–60 words, 1–2 sentences maximum. +- **Must NOT be**: merely "A structure that is part of X" or "A type of X". + +**Group terms (`term_type: "group"`):** collective form — +`"A {plural genus} that/which {unifying differentia}."` or +`"A group of {genus class} located in/that compose/that innervate {Y}."` +- **Plural genus**: "muscles", "anatomical structures", etc. +- **Unifying differentia**: the property that defines membership — usually the location + (e.g. "muscles part of the pelvic floor"), function, or innervation. +- Where members are known and bounded, enumerate: "...comprising the X, Y, and Z." +- Length still 20–60 words. + +## Step 7: Resolve genus AND part_of for LEAF terms + +For each `term_type: "leaf"` term, look up how UBERON defines similar specific structures +to determine BOTH a genus (`is_a`) class AND a `part_of` containing structure. UBERON +convention typically populates both for specific named anatomical entities — e.g. +`vastus lateralis` has `is_a: UBERON:0001630 ! muscle organ` AND +`relationship: part_of UBERON:0001377 ! quadriceps femoris`. + +**Procedure:** + +1. Use awk over `src/ontology/uberon-edit.obo` to find similar specific UBERON terms. + Examples for muscle subdivisions: + ```bash + awk 'BEGIN{RS=""} /\nname: .*head of .*muscle/' src/ontology/uberon-edit.obo + awk 'BEGIN{RS=""} /\nname: .*part of .*muscle/' src/ontology/uberon-edit.obo + awk 'BEGIN{RS=""} /\nname: .*belly of/' src/ontology/uberon-edit.obo + awk 'BEGIN{RS=""} /\nid: UBERON:0001379\n/' src/ontology/uberon-edit.obo # vastus lateralis + ``` + +2. From similar terms, extract the genus pattern. Common UBERON genus classes for + muscle leaf terms: + - `UBERON:0001630` muscle organ — for whole named individual muscles (e.g. articularis + genu, longus capitis, vastus lateralis) + - `UBERON:0011906` muscle head — for named heads of muscles (clavicular head, long + head, short head) + - `UBERON:0014892` skeletal muscle organ, vertebrate — for skeletal muscles when a + more specific class is unavailable + - `UBERON:0014892` or domain-specific (e.g. `UBERON:0001135` smooth muscle organ) + for non-skeletal cases + +3. From similar terms, extract the part_of pattern. Common targets: + - For "X head/belly/part of Y muscle" → part_of the named parent muscle Y + - For named muscles in a region → part_of the region (e.g. neck, thigh, + anterior compartment) + - For named segmental muscles → part_of the relevant region (cervical vertebral + column, lumbar region, etc.) + +4. Emit a `leaf_template_rows[label]` entry with `{"is_a": "UBERON:...", "part_of": + "UBERON:..."}`. **Both columns should be populated when applicable.** + - Set `is_a` only (omit `part_of`) for classification subtypes that don't have a + containing structure (e.g. `dominant antral follicle is_a antral follicle` — no + additional part_of needed beyond what the genus class implies). + - Set `part_of` only (omit `is_a` or use a very generic genus) when the term is + purely a subdivision and no specific genus class is available. + +5. The legacy `resolved_relationships` + `resolved_parents` keys are still accepted as + a fallback but `leaf_template_rows` is preferred — it expresses both axes + simultaneously. + +**Optional fields in `leaf_template_rows` (Phase 6 + 7):** + +The default leaf template has an OPTIONAL `develops_from` column. The muscular-system +overlay also has `has_muscle_origin`, `has_muscle_insertion`, `innervated_by` columns. +Populate any of these in `leaf_template_rows[label]` when you have evidence: + +```json +"leaf_template_rows": { + "early antral follicle": { + "is_a": "UBERON:0000037", + "develops_from": "UBERON:0000036" + }, + "articularis genu muscle": { + "is_a": "UBERON:0001630", + "part_of": "UBERON:0000376", + "has_muscle_origin": "UBERON:0000981", + "has_muscle_insertion": "UBERON:0000976", + "innervated_by": "UBERON:0001267" + } +} +``` + +The merge step writes any of these to the corresponding column IF the column exists in +the current template variant. Unknown fields are silently dropped — you don't need to +know which template the row belongs to. Just emit whatever you can populate with +evidence. + +**Stage-series guidance for `develops_from`:** + +For terms in a developmental sequence (follicle stages, embryonic stages, hematopoietic +differentiation), look up the precursor stage via OLS4 / awk and emit `develops_from`. +Example: `early antral follicle` develops_from `secondary ovarian follicle` +(UBERON:0000036). + +**Muscle-overlay guidance for `has_muscle_origin`/`has_muscle_insertion`/`innervated_by`:** + +For `system: "muscle"` terms (the per-group JSON contains a `system` field per term), +extract origin/insertion/innervation from Wikipedia + UBERON precedent. The bone or +nerve labels in Wikipedia text typically need OLS4 lookup to resolve to UBERON IDs +(e.g. "femur" → UBERON:0000981, "femoral nerve" → UBERON:0001267). If a UBERON ID +cannot be resolved (named bone landmark, specific nerve branch missing from UBERON), +omit that field rather than guess. + +**Worked examples:** + +- `clavicular head of pectoralis major muscle`: + - Look up similar: UBERON:0007168 (long head of biceps brachii), UBERON:0007169 (short + head of biceps brachii) → both use `is_a: UBERON:0011906 ! muscle head` and + `relationship: part_of `. + - Emit: `{"is_a": "UBERON:0011906", "part_of": "UBERON:0002381"}` + +- `articularis genu muscle`: + - Look up similar: vastus lateralis (UBERON:0001379) uses + `is_a: UBERON:0001630 ! muscle organ` + `part_of UBERON:0001377 ! quadriceps femoris`. + - For articularis genu, the analogous part_of would be the thigh region (or anterior + compartment of thigh if a UBERON term exists for it). Emit: + `{"is_a": "UBERON:0001630", "part_of": "UBERON:0004252"}` (or more specific). + +- `costal part of respiratory diaphragm muscle`: similar UBERON pattern is to use a + domain part as `part_of` plus a generic genus. Already a confirmed match in this + case (UBERON:0035831), so this term is excluded from the leaf template. + +- `dominant antral follicle` (a stage/subtype, no spatial part_of beyond the parent): + emit `{"is_a": "UBERON:0000035"}` only — omit `part_of`. + +**Important — DO NOT just take the supplied source parent and assign it to one column.** +Look at similar UBERON terms first; the source parent is often too broad (a grouping class) +to serve as the genus, and a more specific genus may be obvious (muscle head, muscle +organ, etc.). + +## Step 8: Group term equivalent class — genus + part_of some Y (GROUP terms only) + +For each `term_type: "group"` term, find existing UBERON terms with similar names and +mirror their equivalent-class definition pattern. Stage 1 has already routed the term to +the groups template (with EC directives); your job is to populate `genus` and `location`. + +The supported pattern is **only** `genus and (part_of some Y)`. Anything more complex +gets punted to manual_curation. + +**Procedure:** + +1. Use `obo-grep.pl` (via Bash) — or `awk` if obo-grep is not in PATH — to find UBERON + terms with similar labels in `src/ontology/uberon-edit.obo`. Examples: + + ```bash + awk 'BEGIN{RS=""} /\nname: muscle of [a-z].*\n/' src/ontology/uberon-edit.obo + awk 'BEGIN{RS=""} /\nname: .*pelvic floor.*\n/' src/ontology/uberon-edit.obo + ``` + +2. Inspect the `intersection_of` lines of similar terms. The most common UBERON pattern + for muscle group terms is: + ``` + intersection_of: UBERON:0014892 ! skeletal muscle organ, vertebrate + intersection_of: part_of UBERON:NNNNNNN ! some region + ``` + Genus is typically `UBERON:0014892` (skeletal muscle organ, vertebrate); use + `UBERON:0001630` (muscle organ) only if a similar non-skeletal term uses it. + +3. Determine `Y` (the differentiator) from anatomical context. For "thoracic wall muscle", + Y = the UBERON term for "thoracic wall". Look it up via OLS4 or by name-grep over + uberon-edit.obo. + +4. **If at least one similar UBERON term uses the simple `genus + part_of some Y` pattern + AND that pattern fits this term**: emit a `group_template_rows[label]` entry with + `{"genus": "UBERON:NNNNNNN", "location": "UBERON:MMMMMMM"}`. + +5. **Otherwise — pattern unsupported**: emit a `manual_curation` entry. Reasons to punt: + - Similar UBERON terms use `innervated_by some Y` (function-defined groups like facial + muscle), not part_of. + - Similar UBERON terms use multiple intersection_of axioms (e.g. attaches_to_part_of + + innervated_by + part_of for intrinsic muscle of tongue). + - No clear genus class identifiable. + - The group is defined by something the simple pattern can't express (e.g. layer + within a hollow organ, has_part-defined collective). + + In the manual_curation entry, include: + - The proposed definition you wrote in Step 6 + - The reason this term doesn't fit the simple pattern + - 3–5 most similar UBERON terms found via obo-grep, with their full + `intersection_of` lines (so the curator can see the precedent) + - A suggestion for what equivalent class the curator should write + +## Output Format + +Save to: `bulk_ntr_workflow/outputs/definitions/{group_name}.json` + +```json +{ + "definitions": { + "term label": "Aristotelian definition string." + }, + "wikipedia_images": { + "term label": "https://upload.wikimedia.org/wikipedia/commons/..." + }, + "xrefs": { + "term label": "Wikipedia:Article_Title|FMA:NNNNN" + }, + "def_xrefs_to_add": { + "term label": "PMID:12345678|PMID:87654321" + }, + "leaf_template_rows": { + "leaf term label": { + "is_a": "UBERON:0011906", + "part_of": "UBERON:0002381", + "develops_from": "UBERON:0000036", + "has_muscle_origin": "UBERON:0001105", + "has_muscle_insertion": "UBERON:0000976", + "innervated_by": "UBERON:0003726" + } + }, + "resolved_relationships": { + "leaf term label": "is_a | part_of" + }, + "resolved_parents": { + "leaf term label": "UBERON:xxxxxxx" + }, + "group_template_rows": { + "group term label": { + "genus": "UBERON:0014892", + "location": "UBERON:0002047" + } + }, + "manual_curation": [ + { + "label": "muscle of facial expression", + "definition": "A group of muscles innervated by the facial nerve...", + "reason": "UBERON's similar 'facial muscle' (UBERON:0001577) uses innervated_by some facial nerve, not part_of. Out of simple part_of-only template scope.", + "similar_terms": [ + {"id": "UBERON:0001577", "label": "facial muscle", + "intersection_of": ["UBERON:0014892 ! skeletal muscle organ, vertebrate", "innervated_by UBERON:0001647 ! facial nerve"]} + ], + "suggestion": "Curator should add directly to uberon-edit.obo with the same innervated_by pattern." + } + ], + "confirmed_matches": [ + { + "label": "term label", + "uberon_id": "UBERON:xxxxxxx", + "confidence": "high", + "uberon_definition": "...", + "wikipedia_summary": "..." + } + ], + "possible_matches": [ + { + "label": "term label", + "uberon_id": "UBERON:xxxxxxx", + "confidence": "medium", + "note": "..." + } + ], + "out_of_scope": [ + { + "label": "term label", + "reason": "Describes a pathological/dysfunctional state (hemorrhagic follicle). UBERON covers normal anatomy only.", + "suggestion": "Consider MONDO or PATO-qualified term." + } + ], + "name_corrections": [ + { + "label": "dominance antral follicle", + "suggested": "dominant antral follicle", + "reason": "Standard anatomical term; 'dominance' is non-standard. Keep source name as synonym." + } + ], + "unresolvable": [ + { + "label": "term label", + "reason": "...", + "suggestion": "..." + } + ] +} +``` + +Omit empty lists/dicts. Do NOT include a `fma_resolutions` key — use `resolved_parents` instead. + +## Quality Checks Before Saving + +- Every definition must be content-rich (not just "part of X" or "a type of X"). +- Every confirmed match must have both a UBERON definition and Wikipedia/literature evidence. +- Every new term must have at least one real PMID/DOI in `def_xrefs_to_add` or in the existing + `def_xref` input field (ASCTB-TEMP placeholders do not count as real references). +- For LEAF terms: prefer emitting `leaf_template_rows[label]` with both `is_a` and + `part_of` populated. Look up similar UBERON terms via awk over uberon-edit.obo to + find the right genus class — do NOT just assign the source parent to one column. +- `leaf_template_rows[label].is_a` should be a genus class (e.g. UBERON:0001630 muscle + organ, UBERON:0011906 muscle head), not a regional grouping class. +- `leaf_template_rows[label].part_of` should be the containing structure (parent muscle, + body region, compartment). +- For backward compatibility, `resolved_relationships` (values `"is_a"` or `"part_of"`) + + `resolved_parents` may still be used; merge will fall back to these if + `leaf_template_rows` is absent. +- All UBERON ID values must be real UBERON IDs retrieved from OLS4 or uberon-edit.obo — + never guessed. +- Layers, zones, heads, bellies, parts of named structures → MUST have `part_of` + populated to the named parent structure. +- Pathological/dysfunctional terms → must appear in `out_of_scope`. +- Non-standard names → must appear in `name_corrections`. +- **For `term_type: "group"` terms**: every term must end up in EITHER + `group_template_rows` (with both `genus` and `location` populated as real UBERON IDs) + OR `manual_curation` (with proposed definition + similar UBERON terms). No group term + should be silently absent from both. +- `leaf_template_rows`, `resolved_relationships`, `resolved_parents` apply to LEAF terms + only — do not emit these keys for group terms. +- Do NOT invent UBERON IDs. + +## Tools Available + +- `ols4` MCP server — ontology term search and lookup +- `ontology-term-lookup` subagent — structured OLS4 search with quality assessment +- `fetch-wiki-info-api` skill — Wikidata + Wikipedia structured fetch via HTTP APIs. + Use for both the specific term article and (with the parent label) the parent article + passage extraction in Step 4. +- `artl-mcp` — fetch and verify literature (PMID, DOI) +- `awk` over `src/ontology/uberon-edit.obo` — find existing UBERON terms by name pattern + and inspect their `intersection_of` axioms (used for group term EC pattern detection). + `obo-grep.pl` is documented as in PATH but may be missing on some setups; awk is the + fallback (`awk 'BEGIN{RS=""} /\nname: PATTERN\n/' src/ontology/uberon-edit.obo`). diff --git a/.claude/agents/ontology-term-lookup.md b/.claude/agents/ontology-term-lookup.md new file mode 100644 index 000000000..8b9e9457f --- /dev/null +++ b/.claude/agents/ontology-term-lookup.md @@ -0,0 +1,102 @@ +--- +name: ontology-term-lookup +description: Use this agent when you need to find ontology terms by their textual labels or descriptions using the OLS4 MCP. This includes:\n\n\nContext: User is populating a DOSDP template and needs to find the correct ontology term for 'hepatic artery'.\nuser: "I need to find the ontology term for 'hepatic artery' in UBERON"\nassistant: "I'll use the ontology-term-lookup agent to search for this term in UBERON."\n\n\n\n\nContext: Agent is filling in missing ontology terms in a template and encounters text describing an anatomical structure.\nassistant: "I need to find the ontology term for 'renal vein' to complete this template entry. Let me use the ontology-term-lookup agent."\n\n\n\n\nContext: User provides alternative phrasings that need to be searched.\nuser: "Check if there's a term for either 'artery of kidney' or 'kidney artery'"\nassistant: "I'll use the ontology-term-lookup agent to search for both phrasings."\n\n\n +model: sonnet +--- + +You are an expert ontology term matcher specializing in using the OLS4 (Ontology Lookup Service 4) MCP to find precise ontology term matches for textual descriptions. + +Your core responsibility is to take textual input describing an anatomical or biological concept and find the best matching ontology term(s) from a specified ontology using the ols4-mcp tool. + +## Input Processing + +You will receive: +1. **text**: The term or phrase to look up (e.g., 'hepatic artery', 'blood vessel', 'artery of liver') +2. **ontology**: The target ontology to search within (e.g., 'UBERON', 'CL', 'GO') + +## Search Strategy + +Execute searches systematically: + +1. **Primary Search**: Search for the exact text as provided in the specified ontology using ols4-mcp, looking for matches in labels and synonyms. + +2. **Alternative Phrasing**: If no high-confidence match is found, automatically generate and search alternative phrasings: + - Convert "X artery" to "artery of X" and vice versa + - Try singular/plural variations + - Substitute common synonyms (e.g., 'vessel' for 'blood vessel', 'hepatic' for 'liver') + - Consider anatomical term variations (e.g., 'renal' for 'kidney', 'cardiac' for 'heart') + +3. **Iterative Refinement**: If initial searches yield poor results, progressively broaden or narrow the search terms based on the domain. + +## Match Quality Assessment + +Evaluate matches based on: +- **Exact label match**: Highest confidence +- **Exact synonym match**: High confidence +- **Partial label/synonym match**: Medium confidence (note the differences) +- **Related term**: Low confidence (clearly indicate this is not a direct match) + +## Output Format + +Return results in this structured format: + +**For single high-confidence match:** +``` +Best Match Found: +- Input Text: [original input] +- Matched Term: [term label] +- Ontology ID: [full IRI or CURIE] +- Match Type: [exact label | exact synonym | partial match] +- Definition: [term definition if available] +- Confidence: High +``` + +**For multiple high-confidence matches:** +``` +Multiple Matches Found (ranked by relevance): + +Input Text: [original input] + +1. [Match rank] + - Matched Term: [term label] + - Ontology ID: [full IRI or CURIE] + - Match Type: [exact label | exact synonym | partial match] + - Definition: [term definition if available] + - Confidence: High/Medium + - Reason for ranking: [brief explanation] + +2. [Match rank] + - Matched Term: [term label] + - Ontology ID: [full IRI or CURIE] + - Match Type: [exact label | exact synonym | partial match] + - Definition: [term definition if available] + - Confidence: High/Medium + - Reason for ranking: [brief explanation] + +[Continue for all relevant matches] +``` + +**For no matches:** +``` +No Match Found: +- Input Text: [original input] +- Ontology Searched: [ontology name] +- Alternative phrasings tried: [list attempted variations] +- Recommendation: [suggest manual review, broader ontology search, or term creation] +``` + +## Quality Control + +- Always verify that the matched term's definition aligns semantically with the input text +- Flag cases where the match seems questionable despite technical similarity +- When ranking multiple matches, prioritize based on: definition alignment > match type > term specificity +- Never return matches with low confidence without clearly labeling them as such +- If the ontology parameter seems inappropriate for the term type, note this in your response + +## Error Handling + +- If the ols4-mcp tool is unavailable, clearly state this and suggest alternative approaches +- If the specified ontology doesn't exist or is inaccessible, report this explicitly +- If the input text is ambiguous, note this and explain what additional context would help + +Remember: Precision is paramount. It's better to return no match or multiple candidates than to return a single incorrect high-confidence match. diff --git a/.claude/skills/fetch-wiki-info-api/SKILL.md b/.claude/skills/fetch-wiki-info-api/SKILL.md new file mode 100644 index 000000000..e6819f46c --- /dev/null +++ b/.claude/skills/fetch-wiki-info-api/SKILL.md @@ -0,0 +1,75 @@ +--- +name: fetch-wiki-info-api +description: Fetch structured and descriptive information from Wikidata and Wikipedia via HTTP APIs (no browser, no Playwright) +argument-hint: "[search term] [--images]" +allowed-tools: Bash +--- + +# Fetch Wiki Info Skill (HTTP-API variant) + +Parallel implementation of `fetch-wiki-info` that hits Wikidata + Wikipedia public APIs directly instead of going through Playwright. Faster, no Chromium dependency, no 8-parallel cap. + +## Search Term + +Topic to search for: **$ARGUMENTS** + +## Instructions + +Run the bundled Python helper. It is stdlib-only — no `pip install`. + +```bash +python3 .claude/skills/fetch-wiki-info-api/fetch_wiki_info.py "$ARGUMENTS" +``` + +If the caller wants Wikipedia images + captions (e.g. for the `ntr-term-researcher` agent's image-xref step), pass `--images`: + +```bash +python3 .claude/skills/fetch-wiki-info-api/fetch_wiki_info.py "$ARGUMENTS" --images +``` + +For machine-readable output, add `--json`. + +## Workflow inside the script + +1. **Wikidata search** (`wbsearchentities`) — top 5 candidates. +2. **Wikidata entity fetch** (`Special:EntityData/{Q}.json`) for the top hit. Extracts label, description, aliases, P31/P361/P279, and the canonical English Wikipedia title via `sitelinks.enwiki.title` (avoids redirect guessing). +3. **Wikipedia summary** (`/api/rest_v1/page/summary/{title}`) — liberal relevance gate: rejects only disambiguation pages or empty extracts. +4. **Wikipedia full extract** (`action=query&prop=extracts&explaintext=1&redirects=1`) — full plain-text article body. +5. **Wikipedia media** (with `--images` only): `/api/rest_v1/page/media-list/{title}`, keeping only items whose caption shares a word with the query term. + +Set a polite `User-Agent` (already done in the script). + +## Output Format + +Markdown with the same overall shape as the Playwright skill, plus an optional **Wikipedia Full Text** section and an optional **Wikipedia Images** section: + +``` +# + +## Wikidata (Q#######) +- Label / Description / Aliases / Instance of / Subclass of / Part of / Wikipedia link + +## Wikipedia Summary () +<one-paragraph extract> + +## Wikipedia Full Text +<full plain-text article> + +## Wikipedia Images (only with --images) +- <file title> — <caption> + - src: <url> + +## Notes +- <relevance-gate reasons, if any> + +## Sources +- Wikidata: https://www.wikidata.org/wiki/Q####### +- Wikipedia: https://en.wikipedia.org/wiki/<page> +``` + +## Notes + +- Endpoints are anonymous; no auth required. +- This skill exists in parallel with `fetch-wiki-info` for A/B comparison. Once validated on a real Stage 3 NTR run, the Playwright version (and the 8-parallel cap in [bulk_ntr_workflow/CLAUDE.md](../../../bulk_ntr_workflow/CLAUDE.md)) can be retired. +- If Wikidata has no match, the script reports the empty candidate list and exits cleanly. +- Disambiguation pages (e.g. "head") are dropped via the relevance gate — try a more specific term. diff --git a/.claude/skills/fetch-wiki-info-api/VALIDATION.md b/.claude/skills/fetch-wiki-info-api/VALIDATION.md new file mode 100644 index 000000000..114b8a8d4 --- /dev/null +++ b/.claude/skills/fetch-wiki-info-api/VALIDATION.md @@ -0,0 +1,89 @@ +# Validation of the HTTP-API `fetch-wiki-info-api` skill + +This skill replaced the Playwright-based `fetch-wiki-info` skill. This note documents +the A/B validation run that justified the switch — keep alongside the skill so the +provenance lives with the code, not in a PR description that ages out. + +**Not auto-loaded into agent context** (only `SKILL.md` frontmatter is). Safe reference, +won't distract agents. + +## Method + +Test set: every unique term label across the 45 group-input JSONs on the +`add-hra-muscular-ntr` branch (`bulk_ntr_workflow/outputs/definitions/input/*.json`). +75 unique terms after label-deduplication. + +For each term: +- Invoke the new skill (`fetch_wiki_info.py <label> --json`) +- Record: Wikidata Q-ID found? Wikipedia summary found? Full-text length? Latency? +- Compare against the `wikipedia_summary` field in the Playwright-skill-produced + output JSONs on the same branch (`bulk_ntr_workflow/outputs/definitions/*.json`). + +39 of the 75 terms had a Playwright-produced reference summary to compare against. + +Test harness: `/tmp/wiki-test/run_test.py` (not checked in — single-shot validation +script; recreate from this note + the branch fixtures if needed to re-run). + +## Headline results (parallel=6) + +| Metric | Result | +|---|---:| +| Successful runs | 75 / 75 | +| Got Wikidata Q-ID | 65 / 75 (87%) | +| Got Wikipedia summary | 72 / 75 (96%) | +| Got Wikipedia full-text | 72 / 75 (96%) | +| **Matches Playwright reference** | 38 / 39 | +| Failures (crashes) | 0 | +| Latency p50 / p95 | 1.77 s / 13.01 s | + +The single remaining miss (`pteryopharyngeal part of superior pharyngeal constrictor +muscle`) was a misspelling for which the Playwright-side **agent step 4.2** +(parent-article passage extraction) had carried the load — not the Playwright skill +itself. That step is orthogonal to this skill and works identically with the new +skill (call it on the parent label). + +## Issues found and fixed during validation + +1. **Rate-limit handling.** Wikimedia returned HTTP 429 once parallelism reached ~12. + Added exponential backoff + `Retry-After` honouring + up to 5 retries on 429/5xx + in `_request`. 0 crashes at parallel=6 afterwards. +2. **Wikidata `wbsearchentities` is strict.** Initial hit rate was 29% — many real + anatomy terms didn't match because Wikidata search insists on tight prefix + + word-order matches (e.g. `splenius capitus` typo, `respiratory diaphragm muscle` + → Wikipedia title is `Thoracic diaphragm`, `spermatic cord muscle` → + `Spermatic cord`). + Added two cascading fallbacks: + - Wikipedia `opensearch` (prefix match, handles typos) + - Wikipedia `list=search` (CirrusSearch full-text, catches redirects + alternate names) + When a fallback resolves a Wikipedia title, the skill reverse-looks-up the Q-ID + via `action=wbgetentities&sites=enwiki&titles=...` so the Wikidata block is still + populated. +3. **Captions weren't on `media-list`.** The REST `page/media-list/{title}` endpoint + does NOT include caption text despite docs suggesting otherwise. Switched to + parsing `<figure>+<figcaption>` blocks from `page/html/{title}` instead. +4. **macOS Homebrew Python SSL.** The default `urllib` SSL context on + Homebrew-Python doesn't trust system roots. Added a fallback that tries + `certifi`, then `$SSL_CERT_FILE`, then common Homebrew/OS bundle paths. + +## Operational guidance + +- **Safe parallelism**: tested clean at 6. Likely fine up to ~10 with the retry + logic, but observed p95 latency climbs from rate-limit retries past that. +- **Reverse lookup is cheap**: Wikipedia title → Q-ID via `wbgetentities` is one + extra HTTP call per fallback hit; ~+0.3 s. +- **3 remaining test misses** are all misspellings (`pteryopharyngeal`, + `compartmet`, `puboperineales`) — the curator should flag these as + `name_corrections` rather than relying on the wiki lookup. + +## How to re-validate + +1. Check out the `add-hra-muscular-ntr` branch (or any branch with finished + Stage 3 outputs). +2. Collect unique labels from `bulk_ntr_workflow/outputs/definitions/input/*.json`. +3. Run the skill helper (`fetch_wiki_info.py <label> --json`) on each, in parallel. +4. Compare `wikipedia.summary` field against the Playwright run's + `confirmed_matches[*].wikipedia_summary` in + `bulk_ntr_workflow/outputs/definitions/*.json`. + +A skill regression should show as either a drop in the per-term hit rate or in the +Playwright-reference match count. diff --git a/.claude/skills/fetch-wiki-info-api/fetch_wiki_info.py b/.claude/skills/fetch-wiki-info-api/fetch_wiki_info.py new file mode 100644 index 000000000..5ddac5082 --- /dev/null +++ b/.claude/skills/fetch-wiki-info-api/fetch_wiki_info.py @@ -0,0 +1,450 @@ +#!/usr/bin/env python3 +"""Fetch Wikidata + Wikipedia info via public HTTP APIs (no browser). + +Stdlib-only. Replaces the Playwright-based fetch-wiki-info skill. +""" +from __future__ import annotations + +import argparse +import json +import random +import re +import ssl +import sys +import time +import urllib.parse +import urllib.request +from typing import Any + +USER_AGENT = "uberon-bulk-ntr/0.1 (https://github.com/obophenotype/uberon; dosumis@gmail.com)" +TIMEOUT = 15 + + +def _ssl_context() -> ssl.SSLContext: + """macOS/Homebrew Python doesn't always trust system roots; try certifi, then common bundle paths.""" + try: + import certifi # type: ignore + return ssl.create_default_context(cafile=certifi.where()) + except ImportError: + pass + import os + for path in ( + os.environ.get("SSL_CERT_FILE"), + "/etc/ssl/cert.pem", + "/opt/homebrew/etc/ca-certificates/cert.pem", + os.path.expanduser("~/.homebrew/etc/ca-certificates/cert.pem"), + "/usr/local/etc/ca-certificates/cert.pem", + ): + if path and os.path.exists(path): + return ssl.create_default_context(cafile=path) + return ssl.create_default_context() + + +_SSL_CTX = _ssl_context() + +DISAMBIG_MARKERS = ( + "may refer to:", + "may refer to several", + "is a disambiguation page", +) + + +def _request(url: str, accept_json: bool) -> str | None: + headers = {"User-Agent": USER_AGENT} + if accept_json: + headers["Accept"] = "application/json" + req = urllib.request.Request(url, headers=headers) + # Retry on 429 (rate limit) and 5xx with exponential backoff + jitter. + delay = 0.5 + for attempt in range(5): + try: + with urllib.request.urlopen(req, timeout=TIMEOUT, context=_SSL_CTX) as resp: + return resp.read().decode("utf-8", errors="replace") + except urllib.error.HTTPError as e: + if e.code == 404: + return None + if e.code == 429 or 500 <= e.code < 600: + retry_after = e.headers.get("Retry-After") if e.headers else None + wait = float(retry_after) if (retry_after and retry_after.isdigit()) else delay + wait += random.uniform(0, 0.3) + time.sleep(wait) + delay = min(delay * 2, 8.0) + continue + print(f"WARN: HTTP {e.code} for {url}", file=sys.stderr) + return None + except Exception as e: + if attempt == 4: + print(f"WARN: request failed for {url}: {e}", file=sys.stderr) + return None + time.sleep(delay + random.uniform(0, 0.3)) + delay = min(delay * 2, 8.0) + return None + + +def _get_json(url: str) -> dict[str, Any] | None: + body = _request(url, accept_json=True) + if not body: + return None + try: + return json.loads(body) + except json.JSONDecodeError: + return None + + +def _get_json_array(url: str) -> list[Any] | None: + body = _request(url, accept_json=True) + if not body: + return None + try: + data = json.loads(body) + return data if isinstance(data, list) else None + except json.JSONDecodeError: + return None + + +def search_wikidata(term: str, limit: int = 5) -> list[dict[str, str]]: + url = ( + "https://www.wikidata.org/w/api.php" + f"?action=wbsearchentities&search={urllib.parse.quote(term)}" + f"&language=en&format=json&limit={limit}" + ) + data = _get_json(url) or {} + out = [] + for item in data.get("search", []): + out.append({ + "id": item.get("id", ""), + "label": item.get("label", ""), + "description": item.get("description", ""), + }) + return out + + +def fetch_wikidata_entity(qid: str) -> dict[str, Any]: + url = f"https://www.wikidata.org/wiki/Special:EntityData/{qid}.json" + data = _get_json(url) or {} + entity = data.get("entities", {}).get(qid, {}) + if not entity: + return {} + + def _label_only_claims(prop: str) -> list[str]: + ids = [] + for c in entity.get("claims", {}).get(prop, []): + try: + v = c["mainsnak"]["datavalue"]["value"] + if isinstance(v, dict) and "id" in v: + ids.append(v["id"]) + except (KeyError, TypeError): + continue + return ids + + result: dict[str, Any] = {"qid": qid} + labels = entity.get("labels", {}) + descriptions = entity.get("descriptions", {}) + aliases = entity.get("aliases", {}) + if labels.get("en"): + result["label"] = labels["en"]["value"] + if descriptions.get("en"): + result["description"] = descriptions["en"]["value"] + if aliases.get("en"): + result["aliases"] = [a["value"] for a in aliases["en"]] + + result["properties"] = { + "instance_of": _label_only_claims("P31"), + "part_of": _label_only_claims("P361"), + "subclass_of": _label_only_claims("P279"), + } + + enwiki = entity.get("sitelinks", {}).get("enwiki", {}) + if enwiki.get("title"): + result["wikipedia_title"] = enwiki["title"] + result["wikipedia_url"] = enwiki.get("url") or ( + "https://en.wikipedia.org/wiki/" + urllib.parse.quote(enwiki["title"].replace(" ", "_")) + ) + return result + + +def wikipedia_opensearch(term: str, limit: int = 5) -> list[str]: + """Prefix search — fast, handles typos and casing.""" + url = ( + "https://en.wikipedia.org/w/api.php" + f"?action=opensearch&search={urllib.parse.quote(term)}&limit={limit}&namespace=0&format=json" + ) + data = _get_json_array(url) or [] + if len(data) >= 2 and isinstance(data[1], list): + return data[1] + return [] + + +def wikipedia_fulltext_search(term: str, limit: int = 5) -> list[str]: + """Full-text (CirrusSearch) — catches redirects, alternate names, and content matches.""" + url = ( + "https://en.wikipedia.org/w/api.php" + f"?action=query&list=search&srsearch={urllib.parse.quote(term)}" + f"&srlimit={limit}&srnamespace=0&format=json" + ) + data = _get_json(url) or {} + hits = (data.get("query") or {}).get("search") or [] + return [h.get("title") for h in hits if h.get("title")] + + +def wikidata_qid_from_enwiki_title(title: str) -> dict[str, Any] | None: + """Reverse-look-up a Wikidata Q-ID + entity from a known Wikipedia title.""" + url = ( + "https://www.wikidata.org/w/api.php" + f"?action=wbgetentities&sites=enwiki&titles={urllib.parse.quote(title)}&format=json&languages=en" + ) + data = _get_json(url) or {} + entities = data.get("entities", {}) + for qid, ent in entities.items(): + if qid.startswith("Q") and not ent.get("missing"): + return {"qid": qid} + return None + + +def fetch_wikipedia_summary(title: str) -> dict[str, Any] | None: + enc = urllib.parse.quote(title.replace(" ", "_"), safe="") + url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{enc}" + return _get_json(url) + + +def fetch_wikipedia_extract(title: str) -> str: + url = ( + "https://en.wikipedia.org/w/api.php" + "?action=query&prop=extracts&explaintext=1&exsectionformat=plain&redirects=1" + f"&titles={urllib.parse.quote(title)}&format=json" + ) + data = _get_json(url) or {} + pages = data.get("query", {}).get("pages", {}) + for _, page in pages.items(): + if "extract" in page: + return page["extract"] + return "" + + +def _get_text(url: str) -> str: + return _request(url, accept_json=False) or "" + + +_TAG_RE = re.compile(r"<[^>]+>") +_FIGURE_RE = re.compile(r"<figure\b[^>]*>(.*?)</figure>", re.DOTALL | re.IGNORECASE) +_FIGCAPTION_RE = re.compile(r"<figcaption\b[^>]*>(.*?)</figcaption>", re.DOTALL | re.IGNORECASE) +_IMG_SRC_RE = re.compile(r'<img\b[^>]*\bsrc="([^"]+)"', re.IGNORECASE) + + +_STYLE_RE = re.compile(r"<style\b[^>]*>.*?</style>", re.DOTALL | re.IGNORECASE) +_SCRIPT_RE = re.compile(r"<script\b[^>]*>.*?</script>", re.DOTALL | re.IGNORECASE) + + +def _strip_tags(html: str) -> str: + html = _STYLE_RE.sub("", html) + html = _SCRIPT_RE.sub("", html) + return re.sub(r"\s+", " ", _TAG_RE.sub("", html)).strip() + + +def fetch_wikipedia_media(title: str, term: str) -> list[dict[str, str]]: + """Parse <figure>+<figcaption> blocks out of the rendered page HTML. + + The rest_v1/page/media-list endpoint does NOT include captions despite + documentation suggesting otherwise, so we go to the HTML. + """ + enc = urllib.parse.quote(title.replace(" ", "_"), safe="") + html = _get_text(f"https://en.wikipedia.org/api/rest_v1/page/html/{enc}") + if not html: + return [] + term_words = {w.lower() for w in re.findall(r"\w+", term) if len(w) > 3} + items: list[dict[str, str]] = [] + for fig in _FIGURE_RE.findall(html): + cap_m = _FIGCAPTION_RE.search(fig) + if not cap_m: + continue + cap = _strip_tags(cap_m.group(1)) + if not cap: + continue + cap_words = {w.lower() for w in re.findall(r"\w+", cap)} + relevant = bool(term_words & cap_words) if term_words else True + if not relevant: + continue + src_m = _IMG_SRC_RE.search(fig) + src = src_m.group(1) if src_m else "" + if src.startswith("//"): + src = "https:" + src + items.append({"title": "", "caption": cap, "src": src}) + return items + + +def is_relevant(summary: dict[str, Any] | None, wd_description: str) -> tuple[bool, str]: + """Liberal relevance gate. Drop only obvious non-matches.""" + if not summary: + return False, "no Wikipedia summary" + if summary.get("type") == "disambiguation": + return False, "disambiguation page" + extract = (summary.get("extract") or "").lower() + for marker in DISAMBIG_MARKERS: + if marker in extract: + return False, f"disambiguation-like extract ({marker!r})" + if not extract.strip(): + return False, "empty extract" + return True, "ok" + + +def _try_wikipedia(term: str, title: str, want_images: bool) -> dict[str, Any] | None: + """Fetch summary + full text + (optional) images for a Wikipedia title. + Returns None if the page fails the relevance gate. + """ + summary = fetch_wikipedia_summary(title) + relevant, why = is_relevant(summary, "") + if not relevant: + return {"_rejected": why} + canonical_title = (summary.get("titles", {}) or {}).get("canonical") or title + wp: dict[str, Any] = { + "title": canonical_title, + "url": (summary.get("content_urls", {}).get("desktop", {}) or {}).get("page", ""), + "description": summary.get("description", ""), + "summary": summary.get("extract", ""), + } + extract = fetch_wikipedia_extract(canonical_title) + if extract: + wp["full_text"] = extract + if want_images: + wp["media"] = fetch_wikipedia_media(canonical_title, term) + return wp + + +def assemble(term: str, want_images: bool) -> dict[str, Any]: + out: dict[str, Any] = {"query": term, "wikidata_candidates": [], "wikidata": {}, "wikipedia": {}, "notes": []} + + # 1. Try Wikidata first. + candidates = search_wikidata(term) + out["wikidata_candidates"] = candidates + + wd: dict[str, Any] = {} + candidate_title: str | None = None + if candidates: + wd = fetch_wikidata_entity(candidates[0]["id"]) + candidate_title = wd.get("wikipedia_title") + + # 2. Try Wikipedia via the Wikidata-supplied title (if any). + wp_result = None + if candidate_title: + wp_result = _try_wikipedia(term, candidate_title, want_images) + if wp_result and "_rejected" in wp_result: + out["notes"].append(f"Wikidata-linked Wikipedia page rejected: {wp_result['_rejected']}") + wp_result = None + + # 3. Fallback: Wikipedia search. Try prefix (opensearch) then full-text (CirrusSearch). + if not wp_result: + tried: set[str] = set() + if candidate_title: + tried.add(candidate_title) + candidate_titles: list[str] = [] + for t in wikipedia_opensearch(term): + if t not in tried: + candidate_titles.append(t); tried.add(t) + for t in wikipedia_fulltext_search(term): + if t not in tried: + candidate_titles.append(t); tried.add(t) + for title in candidate_titles: + wp_result = _try_wikipedia(term, title, want_images) + if wp_result and "_rejected" not in wp_result: + if not wd: + qid_info = wikidata_qid_from_enwiki_title(title) + if qid_info: + wd = fetch_wikidata_entity(qid_info["qid"]) + out["notes"].append(f"Wikidata Q-ID resolved via Wikipedia title '{title}'.") + out["notes"].append(f"Wikipedia match via fallback search: '{title}'.") + break + wp_result = None + + out["wikidata"] = wd + if wp_result: + out["wikipedia"] = wp_result + elif not candidates: + out["notes"].append("No Wikidata hits and no Wikipedia match.") + elif not wd.get("wikipedia_title"): + out["notes"].append("Wikidata entity has no enwiki sitelink; no Wikipedia match.") + + return out + + +def render_markdown(data: dict[str, Any]) -> str: + lines = [f"# {data['query']}", ""] + + wd = data.get("wikidata") or {} + if wd: + lines.append(f"## Wikidata ({wd.get('qid', '?')})") + if wd.get("label"): + lines.append(f"- **Label**: {wd['label']}") + if wd.get("description"): + lines.append(f"- **Description**: {wd['description']}") + if wd.get("aliases"): + lines.append(f"- **Aliases**: {', '.join(wd['aliases'])}") + props = wd.get("properties", {}) + for key, label in (("instance_of", "Instance of"), ("subclass_of", "Subclass of"), ("part_of", "Part of")): + if props.get(key): + lines.append(f"- **{label}**: {', '.join(props[key])}") + if wd.get("wikipedia_url"): + lines.append(f"- **Wikipedia link**: {wd['wikipedia_url']}") + lines.append("") + else: + cands = data.get("wikidata_candidates", []) + if cands: + lines.append("## Wikidata candidates") + for c in cands: + lines.append(f"- {c['id']}: {c['label']} — {c['description']}") + lines.append("") + + wp = data.get("wikipedia") or {} + if wp.get("summary"): + lines.append(f"## Wikipedia Summary ({wp.get('title', '?')})") + if wp.get("description"): + lines.append(f"_{wp['description']}_") + lines.append("") + lines.append(wp["summary"]) + lines.append("") + if wp.get("full_text"): + lines.append("## Wikipedia Full Text") + lines.append(wp["full_text"]) + lines.append("") + if wp.get("media"): + lines.append("## Wikipedia Images") + for m in wp["media"]: + prefix = f"**{m['title']}** — " if m.get("title") else "" + lines.append(f"- {prefix}{m['caption']}") + if m.get("src"): + lines.append(f" - src: {m['src']}") + lines.append("") + + if data.get("notes"): + lines.append("## Notes") + for n in data["notes"]: + lines.append(f"- {n}") + lines.append("") + + lines.append("## Sources") + if wd.get("qid"): + lines.append(f"- Wikidata: https://www.wikidata.org/wiki/{wd['qid']}") + if wp.get("url"): + lines.append(f"- Wikipedia: {wp['url']}") + elif wd.get("wikipedia_url"): + lines.append(f"- Wikipedia: {wd['wikipedia_url']}") + + return "\n".join(lines) + + +def main() -> int: + ap = argparse.ArgumentParser(description="Fetch Wikidata + Wikipedia info via HTTP APIs.") + ap.add_argument("term", help="Search term") + ap.add_argument("--images", action="store_true", help="Include Wikipedia images + captions") + ap.add_argument("--json", action="store_true", help="Emit JSON instead of markdown") + args = ap.parse_args() + + data = assemble(args.term, want_images=args.images) + if args.json: + print(json.dumps(data, indent=2, ensure_ascii=False)) + else: + print(render_markdown(data)) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.mcp.json b/.mcp.json new file mode 100644 index 000000000..dcb096f50 --- /dev/null +++ b/.mcp.json @@ -0,0 +1,21 @@ +{ + "mcpServers": { + "artl-mcp": { + "command": "uvx", + "args": ["artl-mcp"], + "tools": ["*"] + }, + "ols4": { + "type": "http", + "url": "https://www.ebi.ac.uk/ols4/api/mcp", + "tools": ["*"] + }, + "playwright": { + "command": "npx", + "args": [ + "@playwright/mcp@latest" + ], + "tools": ["*"] + } + } +} diff --git a/bulk_ntr_workflow/CLAUDE.md b/bulk_ntr_workflow/CLAUDE.md new file mode 100644 index 000000000..c8226619e --- /dev/null +++ b/bulk_ntr_workflow/CLAUDE.md @@ -0,0 +1,279 @@ +# NTR Workflow: UBERON New Term Request ROBOT Template Generator + +Generates a ROBOT template TSV for new UBERON term requests from HRA ASCTB unmapped terms, +together with error and candidate-match reports. + +## Overview + +``` +source_data/input.xlsx (or .csv) + | + v +[Stage 1: generate_template.py] + → outputs/template_initial.tsv (ROBOT template, placeholder definitions) + → outputs/errors.md (bad/missing parent IDs) + → outputs/candidates.md (terms already in UBERON) + | + v +[Stage 2: group_terms_by_parent.py] + → outputs/definitions/input/*.json (one JSON per parent group) + | + v +[Stage 3: ntr-term-researcher subagents] ← up to 8 in parallel + → outputs/definitions/*.json (definitions, matches, resolved relationships) + | + v +[Stage 4: merge_definitions.py] + → outputs/template_final.tsv (ready for review) + + appends to outputs/candidates.md (confirmed/possible OLS4 matches) +``` + +## Input Format + +Place the source file in `source_data/`. The workflow reads: +- **xlsx**: the `as-temp terms` sheet from `hra_unmapped-asct-term-list-with-refs.xlsx` +- **csv**: any CSV with the same columns: `tables`, `as`, `as_label`, `UBERON ID`, `parents_as`, `parents_as_label`, `references` + +Use `--table muscular-system` (or another table name) to filter to one anatomical system. + +## Stage 1: Generate Initial Template + +The source file `hra_unmapped-asct-term-list-with-refs.xlsx` lives at the repo root. + +```bash +cd bulk_ntr_workflow +uv run --with openpyxl scripts/generate_template.py \ + --input ../hra_unmapped-asct-term-list-with-refs.xlsx \ + --table muscular-system \ + --name hra-muscular \ + --start-id 9900001 +``` + +**Parent ID handling:** +- `UBERON:xxxxxxx` → accepted; relationship type marked INFER for subagent resolution +- FMA IRI (e.g. `http://purl.org/sig/ont/fma/fmaXXXX`) → flagged in errors.md; subagent maps to UBERON via OLS4 +- `ASCTB-TEMP` IRI → flagged as error (parent not yet in UBERON; needs human follow-up) +- Terms with `UBERON ID` already populated in the input → logged in candidates.md, excluded from template + +**ID assignment:** `UBERON:99` + 5-digit counter (e.g. `UBERON:9900001`). Adjust `--start-id` to avoid +collisions with other NTR batches. + +**Term-type pre-classification (Phase 2):** Stage 1 also classifies each term as `leaf` or +`group` using linguistic regex rules: +- **Leaf terms** (specific named structures, subdivisions of named muscles) → routed to + `<name>.template.tsv` with `SC %` and `SC BFO:0000050 some %` directives (asserted + is_a/part_of). The agent picks one in Stage 3. +- **Group terms** (collective classes — "muscle of X", "X muscle group", regional + collectives) → routed to `<name>-groups.template.tsv` with `EC %` (genus) and + `EC BFO:0000050 some %` (location) directives. The agent fills genus + location in + Stage 3 by inspecting how UBERON defines similar terms. + +`input.tsv` carries the `term_type` column so curators can review the classification. + +## Stage 2: Group by Parent + +```bash +uv run scripts/group_terms_by_parent.py +``` + +Outputs one JSON per parent group to `outputs/definitions/input/`. Check the files — each group +should contain 1–20 terms for efficient subagent processing. If a group is very large (>20 terms), +consider splitting manually. + +## Stage 3: Definition Writing and OLS4 Matching + +Launch one `ntr-term-researcher` subagent per group JSON, up to 8 in parallel: + +``` +For each file in outputs/definitions/input/*.json: + → Launch Agent(subagent_type="ntr-term-researcher", + prompt="Process group: bulk_ntr_workflow/outputs/definitions/input/{filename}") +``` + +Each subagent: +1. Resolves parent (UBERON confirm / FMA→UBERON / ASCTB-TEMP→UBERON) AND searches for a more + specific parent than the source provided +2. Searches OLS4 for existing UBERON matches per term +3. Flags pathological/dysfunctional terms as out-of-scope (UBERON is normal anatomy) +4. Flags non-standard term names with corrections +5. Fetches Wikipedia (specific term → parent → WebSearch); checks image caption for relevance +6. Searches PubMed for a real PMID/DOI to add to `def_xref` +7. Writes Aristotelian (leaf) or collective (group) definitions +8. **For LEAF terms**: resolves relationship type using the structural-vs-classification + rule (layers/heads/parts = `part_of`; subtypes/stages = `is_a`) +9. **For GROUP terms**: uses awk over `src/ontology/uberon-edit.obo` to find similar UBERON + group terms; if they use `genus + part_of some Y` pattern, populates `group_template_rows` + with `{genus, location}`; otherwise punts to `manual_curation` +10. Saves `outputs/definitions/{group_name}.json` with keys: definitions, wikipedia_images, + xrefs, def_xrefs_to_add, resolved_relationships, resolved_parents, group_template_rows, + confirmed_matches, possible_matches, out_of_scope, name_corrections, manual_curation, + unresolvable + +**Do not launch more than 8 subagents in parallel** (Wikimedia per-IP rate limits; +the `fetch-wiki-info-api` skill retries on 429 with exponential backoff but throughput +degrades sharply if every worker is being throttled). 6–8 is the validated sweet spot. + +## Stage 4: Merge + +```bash +uv run scripts/merge_definitions.py --name hra-muscular +``` + +Merges definitions, images, and relationships from subagent JSONs into `template_final.tsv`. +Outputs a summary of remaining issues. + +## QC Checklist Before Finalising + +**Both templates:** +1. No `[PENDING]` definitions remain +2. Every term has a real PMID/DOI (or ISBN) in `def_xref` — ASCTB-TEMP placeholder IRIs + do not count as references + +**Leaf template (`<name>.template.tsv`):** +3. No `INFER` / `NEEDS_MAPPING` / `UNRESOLVABLE` values in `is_a` or `part_of` columns +4. Layers / zones / regions / heads / bellies / parts of named structures are in + `part_of`, never `is_a` + +**Groups template (`<name>-groups.template.tsv`):** +5. Every row has both `genus` and `location` populated with real UBERON IDs (the merge + script flags incomplete rows as "EC incomplete" — those need agent re-run or manual + curator addition) +6. The `genus` column uses a sensible class — typically `UBERON:0014892` (skeletal + muscle organ, vertebrate) for muscle group terms + +**Reports:** +7. Row counts: input − confirmed_match − out_of_scope − manual_curation = leaf + group +8. Spot-check 5–10 definitions for anatomical accuracy +9. Review `<name>-reports/candidates.tsv` — `confirmed_match` auto-excluded; + `possible_match` rows need curator decision +10. Review `<name>-reports/out_of_scope.tsv` — pathological/dysfunctional terms; + curator decides: drop, reroute to MONDO, keep with PATO qualifier +11. Review `<name>-reports/name_corrections.tsv` — confirm and decide whether source + name should be added as a synonym +12. Review `<name>-reports/manual_curation.tsv` — group terms that don't fit the + simple `part_of` pattern; curator adds these directly to `uberon-edit.obo`, + using the similar UBERON terms listed for guidance +13. Review `<name>-reports/errors.tsv` — input rows with bad/missing parents + +## Stage 5: Register Templates with ODK + +After QC, register the generated templates with ODK and regenerate the Makefile: + +```bash +uv run scripts/register_templates.py --name hra-muscular +``` + +The script discovers every `src/templates/<name>*.template.tsv` (default +leaf, `-groups`, and any per-system overlays like `-muscle`), appends a +component entry per template to `components.products:` in +`src/ontology/uberon-odk.yaml`, then runs `sh run.sh make update_repo` +to regenerate the Makefile. Idempotent — already-registered templates +are skipped. + +Component filenames map dashes → underscores (e.g. +`hra-muscular-groups.template.tsv` → `components/hra_muscular_groups.owl`), +matching existing precedent (`hra-skeleton.template.tsv` → +`components/hra_skeleton.owl`). + +Each generated entry has the form: +```yaml + - filename: hra_muscular.owl + use_template: true + templates: + - hra-muscular.template.tsv +``` + +Use `--skip-update-repo` to edit the yaml without invoking the (slow, +Docker-based) Makefile regeneration step. + +## Output Files Reference + +| File | Description | +|---|---| +| `bulk_ntr_workflow/outputs/template_initial.tsv` | Leaf working copy (SC directives) | +| `bulk_ntr_workflow/outputs/template_groups_initial.tsv` | Groups working copy (EC directives) | +| `src/templates/<name>.template.tsv` | Final leaf template; updated in-place by Stage 4 | +| `src/templates/<name>-groups.template.tsv` | Final groups template (equivalent class definitions) | +| `src/templates/<name>-reports/input.tsv` | Filtered input rows + `term_type` classification | +| `src/templates/<name>-reports/errors.tsv` | Input errors (bad/FMA/ASCTB-TEMP parents) | +| `src/templates/<name>-reports/candidates.tsv` | Pre-mapped + OLS4-confirmed existing terms | +| `src/templates/<name>-reports/out_of_scope.tsv` | Pathological/dysfunctional terms | +| `src/templates/<name>-reports/name_corrections.tsv` | Source-label → corrected-label rewrites | +| `src/templates/<name>-reports/manual_curation.tsv` | Group terms not fitting simple `part_of` pattern | +| `bulk_ntr_workflow/outputs/definitions/input/*.json` | Per-group input for subagents | +| `bulk_ntr_workflow/outputs/definitions/*.json` | Per-group subagent output | + +## ROBOT Template Column Reference + +### Leaf template (`<name>.template.tsv`) — asserted SC + +| Header | ROBOT directive | Notes | +|---|---|---| +| ID | ID | `http://purl.obolibrary.org/obo/UBERON_99xxxxx` | +| LABEL | LABEL | Term label | +| Definition | A IAO:0000115 | Aristotelian definition | +| def_xref | >A oboInOwl:hasDbXref SPLIT=\| | References + ASCTB-TEMP IRI | +| is_a | SC % | Genus class (structural type or classification parent) | +| part_of | SC BFO:0000050 some % | Containing structure | +| develops_from | SC RO:0002202 some % | Optional. Developmental precursor (stage series) | +| In_subset | AI oboInOwl:inSubset | `added_by_HRA` subset IRI | +| Date | AT dcterms:date^^xsd:dateTime | ISO timestamp | +| Contributor | AI dcterms:contributor | ORCID IRI | +| Present_in_taxon | AI RO:0002175 | NCBITaxon IRI | +| Wikipedia_image | A foaf:depiction | Wikipedia image URL | +| xref | A oboInOwl:hasDbXref SPLIT=\| | Direct term xrefs: Wikipedia article + FMA ID | + +### Muscle leaf template (`<name>-muscle.template.tsv`) — Phase 7 overlay + +Used automatically when the source `tables` value is `muscular-system`. Adds three +columns between `develops_from` and `In_subset`: + +| Header | ROBOT directive | Notes | +|---|---|---| +| has_muscle_origin | SC RO:0002372 some % | Bone/structure the muscle arises from | +| has_muscle_insertion | SC RO:0002373 some % | Bone/structure the muscle inserts onto | +| innervated_by | SC RO:0002005 some % | Motor nerve | + +All three are OPTIONAL — empty cell ⇒ no axiom. Populate only when Wikipedia + +UBERON precedent provide a resolvable UBERON ID for the related entity. + +### Template variants and partitioning + +Stage 1 partitions input rows by source `tables` column → system overlay map: + +| Source table value | Overlay | Output template | +|---|---|---| +| `muscular-system` | `muscle` | `<name>-muscle.template.tsv` | +| (anything else) | `default` | `<name>.template.tsv` | + +A single Stage 1 run can produce multiple leaf templates if the input has rows from +mixed tables (each system gets its own clean template — no muscle-specific empty +columns appear in non-muscle templates). The routing decision is printed at the start +of Stage 1 as `Step 0 routing: muscle=N, default=M, group=K`. + +### Groups template (`<name>-groups.template.tsv`) — equivalent class + +Same as leaf, with `is_a` / `part_of` replaced by `genus` / `location`: + +| Header | ROBOT directive | Notes | +|---|---|---| +| genus | EC % | Genus class for the equivalent definition (typically `UBERON:0014892`) | +| location | EC BFO:0000050 some % | Differentiator class — what the group is `part_of` | + +The two columns together generate +`EquivalentClass(this_term, genus and (part_of some location))`. + +## Tools Available + +MCP servers (configured in repo-root `.mcp.json`): +- `ols4` — OLS4 ontology search (UBERON, FMA, etc.) +- `artl-mcp` — literature lookup (PMID, DOI) + +Skills (`.claude/skills/`): +- `fetch-wiki-info-api` — Wikidata + Wikipedia structured fetch via HTTP APIs + (covers both specific-term articles and, when called with a parent label, + parent-article passage extraction). See its `VALIDATION.md` for A/B results. + +Agents (`.claude/agents/`): +- `ntr-term-researcher` — Stage 3 subagent (this workflow) +- `ontology-term-lookup` — structured OLS4 term lookup diff --git a/bulk_ntr_workflow/README.md b/bulk_ntr_workflow/README.md new file mode 100644 index 000000000..19f4e22a6 --- /dev/null +++ b/bulk_ntr_workflow/README.md @@ -0,0 +1,35 @@ +## Quick start + +This workflow is run interactively by a curator from a Claude Code session. + +1. `cd bulk_ntr_workflow` +2. Start Claude Code in this directory: `claude` +3. Drop the source spreadsheet into `source_data/` (or point at the repo-root copy of `hra_unmapped-asct-term-list-with-refs.xlsx`) +4. Ask Claude to run the workflow — e.g. *"Run the NTR workflow for the muscular-system table, name hra-muscular, starting at UBERON:9900001"* + +Claude picks up `CLAUDE.md` in this folder, which is the authoritative spec for the pipeline (stages 1–5, input format, QC checklist, ROBOT template columns, tools/agents/skills). Start there if you want to understand or modify what runs. + +## What the workflow does + +Generates ROBOT template TSVs for UBERON new term requests from HRA ASCTB unmapped terms: + +- **Stage 1** — `generate_template.py` builds initial TSV + error/candidate reports +- **Stage 2** — `group_terms_by_parent.py` splits terms into per-parent JSON groups +- **Stage 3** — up to 8 `ntr-term-researcher` subagents in parallel write definitions, resolve parents/relationships, and find OLS4 matches +- **Stage 4** — `merge_definitions.py` produces the final template +- **Stage 5** — `register_templates.py` registers templates with ODK + +See [CLAUDE.md](CLAUDE.md) for full details and [ROADMAP.md](ROADMAP.md) for planned work. + +## Layout + +``` +bulk_ntr_workflow/ +├── CLAUDE.md # workflow spec (read by Claude on session start) +├── ROADMAP.md # planned work +├── scripts/ # stage 1/2/4/5 Python scripts +├── source_data/ # drop input xlsx/csv here +└── outputs/ # generated TSVs, reports, per-group JSONs +``` + +Final templates land in `src/templates/<name>*.template.tsv` and reports in `src/templates/<name>-reports/`. diff --git a/bulk_ntr_workflow/ROADMAP.md b/bulk_ntr_workflow/ROADMAP.md new file mode 100644 index 000000000..00a39267f --- /dev/null +++ b/bulk_ntr_workflow/ROADMAP.md @@ -0,0 +1,403 @@ +# Bulk NTR Workflow — Development Roadmap + +## Current state (Phase 1 complete) + +Four-stage pipeline operational: +1. `generate_template.py` — reads source xlsx/csv, classifies parent IDs, assigns UBERON:99xxxxx IDs +2. `group_terms_by_parent.py` — groups by parent for parallel subagent processing +3. `ntr-term-researcher` subagent — OLS4 matching, Wikipedia lookup, Aristotelian definitions, relationship resolution +4. `merge_definitions.py` — merges subagent outputs into final ROBOT template TSV + +Tested on first 10 muscular-system terms: 4 confirmed UBERON matches identified; 6 new terms with +complete definitions and resolved relationship types. + +--- + +## Phase 2: Grouping terms — equivalent-class definitions + +**Status:** Step 1 (investigation), Steps 2–6 (implementation) complete. +End-to-end test on muscular-system pending — Stage 1 smoke-tested (20 group / 55 leaf +on 75 input terms); Stage 4 smoke-tested (correctly flags 20 group terms as +"EC incomplete" pre-agent). Agent run not yet performed. + +### Step 1 findings (empirical survey of UBERON's existing "muscle of X" group terms) + +19 existing UBERON terms named `muscle of X` were inspected via awk over +`src/ontology/uberon-edit.obo`: + +| Pattern | Count | Examples | +|---|---:|---| +| `genus and (part_of some Y)` | **14 (74%)** | muscle of neck (UBERON:0002377), muscle of back (UBERON:0002324), muscle of abdomen (UBERON:0002378), muscle of pelvis, muscle of leg, muscle of arm, muscle of larynx, muscle of iris, muscle of pectoral girdle, muscle of digastric group, muscle of pelvic girdle, muscle of pes, muscle of manus, muscle of anal triangle | +| `genus and (attaches_to_part_of some Y)` | 3 (16%) | muscle of shoulder, muscle of vertebral column, muscle of auditory ossicle | +| No `intersection_of` (no logical definition) | 2 (10%) | muscle of pelvic diaphragm, muscle of posterior compartment of hindlimb stylopod | + +**Genus consistency:** +- 16/19 use `UBERON:0014892` (skeletal muscle organ, vertebrate) +- 3/19 use `UBERON:0001630` (muscle organ — broader; used for iris/auditory ossicle/anal triangle muscles, i.e. non-skeletal or finer granularity) + +Additional spot-checks of neighbouring group classes: +- `intrinsic muscle of tongue` (UBERON:0001576): 4 intersection_of axioms + (`genus + attaches_to_part_of + innervated_by + part_of`) — multi-axiom, complex. +- `extrinsic muscle of tongue` (UBERON:0001575): 4 intersection_of axioms (no `part_of`, + has `attaches_to`). +- `facial muscle` (UBERON:0001577): `genus + innervated_by some facial nerve` only — + defined by innervation rather than location. + +**Decision: proceed with simple `genus + part_of some Y` pattern.** 74% coverage of +existing UBERON convention is sufficient. The genus is `UBERON:0014892` for the muscular +system; the agent will identify it from similar terms via obo-grep rather than hardcode. + +### Future patterns (deferred — not in current Phase 2 scope) + +Once the simple `part_of`-only template is proven, additional ROBOT templates can be +added for: +- `genus and (attaches_to_part_of some Y)` — covers ~16% of muscle group terms + (muscle of shoulder, vertebral column, auditory ossicle) +- `genus and (innervated_by some Y)` — function/innervation-defined groups + (facial muscle, possibly muscle of facial expression in our request set) +- Multi-axiom group definitions (intrinsic/extrinsic muscle of tongue style) — low + frequency; manual curation is probably appropriate even in the long term. + +For now these go to `<name>-reports/manual_curation.tsv` for direct curator addition. + +### Original Phase 2 problem statement (preserved for context) + +### The problem + +The current workflow treats all requested terms as leaf-node anatomical entities (specific named +muscles) and writes Aristotelian definitions and `part_of`/`is_a` placements accordingly. Many +requested terms are however **grouping terms** — collective classes that group several individual +muscles by region, layer, function, or compartment. These require different: + +- **Definition form**: "A muscle group comprising..." or "A collection of muscles that..." rather + than a single-entity definition with specific attachments +- **Relationship type**: always `is_a` to a broader group (never `part_of` a single muscle) +- **OLS4 search strategy**: search for collective/group terms, not individual muscle names +- **Wikipedia strategy**: group articles are often titled "muscles of the X" rather than a single + named muscle article + +### Assessment: can latent knowledge distinguish grouping from leaf? + +Reviewed all 75 muscular-system terms. Latent knowledge is **sufficient for the large majority**. +Key linguistic cues: + +| Cue | Term type | Examples | +|---|---|---| +| `X part of Y muscle` | leaf — subdivision of named muscle | `clavicular head of pectoralis major muscle`, `costal part of respiratory diaphragm muscle` | +| `X belly/head of Y muscle` | leaf — subdivision | `frontal belly of occipitofrontalis muscle`, `inferior head of lateral pterygoid muscle` | +| Named individual muscle | leaf | `articularis genu muscle`, `tensor fascia latae muscle`, `longus capitis muscle` | +| Regional/directional qualifier on known named type | leaf | `multifidus cervicis muscle`, `splenius capitis muscle`, `iliocostalis cervicalis muscle` | +| `muscle of [body region]` | **group** | `muscle of facial expression`, `pelvic floor muscle`, `posterior abdominal wall muscle` | +| `[region] muscle` where region is diffuse | **group** | `intermediate back muscle`, `superficial back muscle`, `thoracic wall muscle` | +| `[layer] pharyngeal/lingual muscle` | **group** | `circular pharyngeal muscle`, `longitudinal pharyngeal muscle` | +| `[region] eye/ear muscle` | **group** | `intrinsic eye muscle`, `middle ear muscle`, `external ear muscle` | +| `[compartment] muscle` | **group** | `hypothenar hand muscle`, `lateral compartment of leg muscle` | + +#### Latent knowledge assessment: all 75 terms + +**Grouping terms** (class of muscles; definition form: "A group of muscles that..."; relationship: `is_a`): + +| Term | Rationale | +|---|---| +| anterior vertebral muscle | Collective for prevertebral group (longus capitis, longus colli, rectus capitis ant./lat.) | +| circular pharyngeal muscle | Outer circular layer — comprises superior, middle, inferior constrictors | +| dorsum of foot muscle | Regional group (extensor digitorum brevis etc.) | +| external ear muscle | Auricular muscle group (anterior, superior, posterior auricular) | +| hypothenar hand muscle | Hypothenar group (abductor digiti minimi, flexor digiti minimi, opponens digiti minimi) | +| intermediate back muscle | Serratus posterior superior/inferior group | +| intrinsic eye muscle | Intraocular muscle group (ciliary, iris muscles) | +| lateral compartment of leg muscle | Peroneal group (peroneus longus/brevis) | +| lateral vertebral muscle | Scalene group (anterior, middle, posterior scalene) | +| longitudinal pharyngeal muscle | Longitudinal layer — stylopharyngeus, palatopharyngeus, salpingopharyngeus | +| middle ear muscle | Tensor tympani + stapedius | +| muscle of facial expression | Large group; dozens of individual muscles | +| palmar interosseous muscle | Collective for 3–4 palmar interossei | +| pelvic floor muscle | Levator ani group + coccygeus | +| plantar interosseous muscle | Collective for 3 plantar interossei | +| posterior abdominal wall muscle | Quadratus lumborum, psoas major/minor, iliacus | +| respiratory diaphragm muscle | Single organ but structurally complex; treat as leaf unless context suggests group | +| segmental back muscle | Collective for short segmental intrinsic back muscles | +| sole of foot muscle | Plantar intrinsic muscle group | +| spinotransversales muscle | Splenius capitis + splenius cervicis | +| superficial back muscle | Trapezius, latissimus dorsi, rhomboids, levator scapulae | +| thoracic wall muscle | External/internal/innermost intercostals, subcostals, transversus thoracis | + +**Uncertain / borderline** (require OLS4 check or context): + +| Term | Issue | +|---|---| +| intertransversarii laterales lumborum muscle | Segmental — multiple pairs but treated as a named entity in TA2/FMA; check OLS4 | +| levator costarum muscle | 12 pairs, segmental; TA2 names it as a single entity — probably leaf | +| thoracic intertransversarii muscle | Same issue as above | +| spermatic cord muscle | Cremaster muscle analog — probably leaf | + +**Leaf terms** (specific named muscle or named subdivision): all remaining 49 terms. + +### Required workflow changes + +#### Stage 1 (`generate_template.py`) +- Add `term_type` column with values `leaf` | `group` | `infer` +- Pre-classify using a rule set based on the linguistic cues above (regex patterns + known + group-name vocabulary) +- Flag `infer` for borderline cases; subagent resolves + +#### Stage 3 (`ntr-term-researcher` subagent) +- Respect `term_type` from input JSON +- **Group terms**: write definition as "A group of muscles that..." with member enumeration + where known; always emit `is_a` in `resolved_relationships`; search OLS4 for group-level + terms not individual muscles +- **Leaf terms**: current behaviour (specific definition with attachments/function/innervation; + infer `is_a` vs `part_of` from anatomical context) +- **Infer terms**: use OLS4 to check if existing UBERON children of the parent are groups or + leaves, then classify accordingly + +#### Stage 2 (`group_terms_by_parent.py`) +- Include `term_type` in the per-group JSON so subagents receive it + +#### Stage 4 (`merge_definitions.py`) +- No changes needed — `term_type` is resolved upstream + +#### Reports +- Add `term_type` column to `input.tsv` +- Add a `grouping_terms.tsv` report listing all group-classified terms for curator review + +### Linguistic rule set (draft) + +```python +GROUP_PATTERNS = [ + r'\bmuscle of\b', # "muscle of facial expression" + r'\b(pelvic floor|thoracic wall|abdominal wall|dorsum of foot|sole of foot)\b', + r'\b(circular|longitudinal)\s+pharyngeal\b', + r'\b(intrinsic|extrinsic)\s+(eye|ear|tongue|hand|foot)\s+muscle\b', + r'\b(hypothenar|thenar|interosseous)\b', + r'\b(superficial|intermediate|deep)\s+back\s+muscle\b', + r'\b(lateral|medial|anterior|posterior)\s+(vertebral|compartment)\b.*muscle\b', + r'\bspinotransversales\b', + r'\bsegmental back\b', +] + +LEAF_PART_PATTERNS = [ + r'\b(head|belly|part|portion|crus)\s+of\b', # subdivisions of named muscles +] +``` + +--- + +## Phase 3: Parent quality — detect and flag UBERON label-ID mismatches + +### The problem + +Stage 1 accepts any syntactically valid UBERON ID (`UBERON:\d{7}`) as a good parent. It does +not check whether the `parent_label` column in the source data actually matches that UBERON ID. +The HRA ASCTB data has many rows where the UBERON ID and label are clearly inconsistent — data +entry errors where an ID from one row of a spreadsheet was accidentally paired with a label from +another. Examples from the ovary CSV: + +| Child term | Supplied parent ID | Supplied parent label | Actual UBERON label | +|---|---|---|---| +| corpus luteum granulosa lutein layer | UBERON:0000976 | humerus | humerus (bone!) | +| dominance antral follicle | UBERON:0001684 | mandible | mandible (bone!) | +| early antral follicle | UBERON:0001677 | sphenoid bone | sphenoid bone (bone!) | +| hemorrhagic anovulatory follicle | UBERON:0001424 | ulna | ulna (bone!) | +| luteinized unruptured follicle | UBERON:0001272 | innominate bone | innominate bone (bone!) | + +None of these were flagged in `errors.tsv`. They passed through Stage 1 as `INFER:UBERON:XXXXXXX`, +with the subagent left to notice the mismatch from context provided in the prompt rather than from +structured error information. + +There is also a related issue with **multi-valued parent columns**: the corona radiata row had a +comma-separated list of parents (`UBERON:0004641, UBERON:0003337, ASCTB-TEMP_serosa`). Stage 1 +classifies the entire string as a single parent, and the presence of the ASCTB-TEMP entry causes +the whole row to be flagged as `asctb_temp_parent`, hiding the fact that some of the supplied +parents are valid UBERON IDs. + +### Root cause in `generate_template.py` + +`classify_parent()` only checks the **format** of the ID string — it never validates whether the +provided `parent_label` matches the ID's actual content. There is no OLS4 lookup in Stage 1. + +### Proposed fix + +#### 3a — Label-ID mismatch detection (Stage 1) + +Two options, in order of preference: + +**Option A (preferred): OLS4 label lookup in Stage 1** + +After classifying a parent as `uberon`, look up the UBERON ID label via the OLS4 MCP +(`ontology_search` with exact ID). Compare the returned label (lowercased, stripped) to the +supplied `parent_label`. If they differ: +- Emit a new `issue_type: uberon_label_mismatch` row in `errors.tsv` with columns: + `label | as_iri | uberon_label_mismatch | parent_id | parent_label | actual_label` +- Use `WRONG_PARENT:<parent_id>` (not `INFER:`) in both `is_a` and `part_of` template columns +- In the per-group JSON, set `"parent_mismatch": true` and record `"supplied_label"` and + `"actual_label"` so subagents have the full picture + +**Option B (fallback): keyword heuristic** + +If adding OLS4 calls to Stage 1 is undesirable (latency, dependency), detect mismatches using +a blocklist of label keywords that are never valid parents in any organ-system table: +```python +ANATOMICALLY_IMPOSSIBLE_PARENT_LABELS = { + # Skeletal + "bone", "humerus", "femur", "tibia", "fibula", "ulna", "radius", + "mandible", "maxilla", "clavicle", "scapula", "patella", + "innominate bone", "sphenoid bone", "temporal bone", + # Vascular (if table is non-vascular) + "artery", "vein", "lymphatic vessel", "lymph node", +} +``` +Flag any UBERON parent whose `parent_label` (lowercased) matches or contains a blocklist entry. +This catches the most egregious cases without network calls but will miss subtler mismatches. + +#### 3b — Multi-valued parent column handling (Stage 1) + +The `parents_as` column sometimes contains a comma-separated list of parent IRIs. Stage 1 currently +treats the entire string as a single parent ID, causing misclassification. Fix: + +- Split `parent_id` on `,` and classify each element independently +- If multiple valid UBERON IDs are present, use the first and record the others as `additional_parents` + in the JSON for the subagent +- If any element is ASCTB-TEMP, flag accordingly but also surface any valid UBERON IDs present +- If any element is FMA, flag accordingly + +#### 3c — Subagent behaviour for `WRONG_PARENT:` + +When `ntr-term-researcher` sees `is_a: WRONG_PARENT:UBERON:XXXXXXX` in its input JSON: +- Do NOT use the supplied UBERON ID as the parent +- Do NOT look up the supplied UBERON ID's children/hierarchy +- Instead, search OLS4 for the **child term label** directly to find an existing term or candidate parent +- Record the correction in `resolved_parents` with `"source": "label_mismatch_correction"` + +### Required changes + +| File | Change | +|---|---| +| `scripts/generate_template.py` | Add label-ID mismatch detection (Option A or B); add multi-parent splitting | +| `scripts/generate_template.py` | Emit `WRONG_PARENT:<id>` placeholder for mismatch cases | +| `.claude/agents/ntr-term-researcher.md` | Document `WRONG_PARENT:` handling; subagent must search by child label | +| `CLAUDE.md` (workflow) | Document new `uberon_label_mismatch` error type and `WRONG_PARENT:` placeholder | + +### Impact assessment + +From the ovary run: 7 of 13 terms (54%) had label-ID mismatches. From the muscular-system run, +a similar proportion had wrong-domain FMA/ASCTB-TEMP parents. This is a high-frequency data +quality issue across all ASCTB tables — fixing it will make errors.tsv substantially more +informative and reduce the amount of tacit correction subagents must perform. + +--- + +## Phase 4: Scale to full muscular-system table + +Once Phase 2 is implemented, run the complete 75-term muscular-system table. Expected: +- ~22 grouping terms → `is_a` definitions +- ~49 leaf terms → specific Aristotelian definitions +- Many wrong-parent rows to resolve (seen in test: ~30 FMA/ASCTB-TEMP/wrong-domain parents) +- Likely 10–20 additional confirmed UBERON matches to exclude + +--- + +## Phase 5: Other anatomical systems + +Generalise to other ASCTB tables (nervous system, vasculature, etc.). The grouping vs leaf +distinction will apply across systems (e.g. "artery of X" vs "X artery", "region of cortex" vs +"X gyrus"). + +--- + +## Phase 6: Optional `develops_from` column on default leaf template ✅ + +**Status:** complete. + +Added an optional `develops_from` column with directive `SC RO:0002202 some %` to the +default leaf template. Empty cell → no axiom emitted by ROBOT. Populated by the agent +when Wikipedia + UBERON precedent indicate a developmental precursor (stage series: +follicle stages, embryonic stages, hematopoietic differentiation, etc.). + +Agent emits via `leaf_template_rows[label].develops_from` in its JSON output. Merge +silently drops the field if the column is absent in the current template variant. + +--- + +## Phase 7: System overlays + +The default leaf template captures only `is_a`, `part_of`, and (optional) `develops_from`. +Some anatomical systems benefit substantially from additional axiomatic richness (origin, +insertion, innervation for muscles; arterial supply / drainage for vasculature; etc.). +Phase 7 implements per-system template overlays — a system overlay is a leaf template +variant with extra columns covering system-specific connectivity relations. + +Stage 1 routes input rows to the appropriate overlay based on the source `tables` +column. Per-system separation keeps each output template clean (no muscle-specific +empty columns in non-muscle templates). + +### Phase 7 — Skeletal muscle overlay ✅ + +**Status:** complete. + +For inputs with `tables == muscular-system`, Stage 1 produces +`<name>-muscle.template.tsv` instead of (or alongside) the default leaf template, +adding three columns: + +| Column | ROBOT directive | Relation | +|---|---|---| +| has_muscle_origin | SC RO:0002372 some % | bone/structure muscle arises from | +| has_muscle_insertion | SC RO:0002373 some % | bone/structure muscle inserts onto | +| innervated_by | SC RO:0002005 some % | motor nerve | + +All three OPTIONAL — populated only with evidence-quoted UBERON IDs. Coverage gaps +(e.g. "lateral pectoral nerve" not in UBERON) are captured as free-text notes in the +agent's output rather than guessed UBERON IDs. + +### Phase 7 — Future overlays (NOT IMPLEMENTED) + +| System | Source table | Suggested fields | Notes | +|---|---|---|---| +| Skeletal | `skeletal-system`? | `articulates_with`, `ossifies_via`, `composed_primarily_of` (bone tissue) | Bones often have rich articulation patterns | +| Vasculature | `vasculature` | `arterial_supply_to`, `drains_into`, `branch_of` | Connectivity is central to vasculature semantics | +| Nervous system | `nervous-system`, `allen-brain` | `innervates`, `synapsed_to`, `axon_in` | Cell-type heavy; CL ontology integration matters | + +Each overlay should be added only when there's a real bulk NTR batch that would benefit +from it. The skeletal-muscle overlay was justified by the muscle enrichment experiment +(see `bulk_ntr_workflow/experiments/SUMMARY.md`); future overlays should similarly +follow an enrichment-experiment validation step before code commits. + +--- + +## Phase 8: Term promotion to direct editing + +**Status:** roadmap only. + +When a templated term needs richer axiomatisation than its template supports — e.g. a +follicle stage that requires `has_component UBERON:0005170 minCardinality=2` (cardinality- +constrained intersection_of), or a complex term needing multiple `has_part` axioms with +CL: cell-type fillers — the templating system becomes a constraint rather than a help. + +The proposed remedy: a "promote to direct editing" agent that: + +1. Takes a term ID (or list) plus the desired richer axiom set. +2. Reads the current template TSV row for that ID. +3. Converts the row to OBO stanza form (mapping ROBOT directives back to OBO syntax: + `SC %` → `is_a`, `SC BFO:0000050 some %` → `relationship: part_of`, etc.). +4. Augments the stanza with the new axioms (intersection_of, cardinality, additional + relationship axioms). +5. Uses the standard checkout/checkin flow: writes to `terms/UBERON_NNNNNNN.obo`, then + `obo-checkin.pl` to merge into `uberon-edit.obo`. +6. Removes the row from the template TSV. +7. Runs the reasoner to confirm the new axiomatisation produces the expected + classification (no unsatisfiable, no unexpected new is_a). + +This solves the templating lock-in concern: any term can be promoted to direct editing +later without losing its UBERON ID or history. + +UX sketch: +```bash +bulk_ntr_workflow/scripts/promote_term.py UBERON:9900037 \ + --add 'intersection_of: UBERON:0001305' \ + --add 'intersection_of: has_component UBERON:0005170 {minCardinality="2"}' \ + --add 'relationship: develops_from UBERON:0000035' +``` + +Or for batches, a YAML/TSV input listing which terms to promote with which axiom sets. +The agent should handle is_a-inheritance carefully (the inferred is_a after +intersection_of must still resolve to the previous genus + the new differentia). diff --git a/bulk_ntr_workflow/scripts/generate_template.py b/bulk_ntr_workflow/scripts/generate_template.py new file mode 100644 index 000000000..52b0e9373 --- /dev/null +++ b/bulk_ntr_workflow/scripts/generate_template.py @@ -0,0 +1,537 @@ +""" +Stage 1: Generate initial ROBOT template TSVs from HRA ASCTB unmapped terms. + +Each input row is pre-classified as a leaf or group term (linguistic rules) and routed +to the appropriate template: + - Leaf terms → standard template with SC (asserted is_a/part_of) + - Group terms → groups template with EC (equivalent class: genus + part_of some Y) + +Input: An xlsx file (default: hra_unmapped-asct-term-list-with-refs.xlsx at repo root) + OR a pre-exported CSV with the same columns as the 'as-temp terms' sheet. + Optionally filter to a specific ASCTB table (e.g. 'muscular-system'). + +Outputs (REPO_ROOT = two levels up from this script): + bulk_ntr_workflow/outputs/template_initial.tsv — leaf working template + bulk_ntr_workflow/outputs/template_groups_initial.tsv — groups working template + src/templates/<name>.template.tsv — leaf final template + src/templates/<name>-groups.template.tsv — groups final template + src/templates/<name>-reports/input.tsv — filtered input rows + term_type + src/templates/<name>-reports/errors.tsv — input problems + src/templates/<name>-reports/candidates.tsv — pre-mapped existing terms + +Usage: + cd bulk_ntr_workflow + uv run --with openpyxl scripts/generate_template.py \\ + --input ../hra_unmapped-asct-term-list-with-refs.xlsx \\ + --table muscular-system \\ + --name hra-muscular \\ + --start-id 9900001 +""" + +import argparse +import csv +import re +import sys +from datetime import date +from pathlib import Path + +# bulk_ntr_workflow/scripts/ → bulk_ntr_workflow/ → repo root +NTR_ROOT = Path(__file__).resolve().parent.parent +REPO_ROOT = NTR_ROOT.parent + +WORK_DIR = NTR_ROOT / "outputs" +WORK_DIR.mkdir(parents=True, exist_ok=True) +WORK_TSV = WORK_DIR / "template_initial.tsv" # default leaf +WORK_GROUPS_TSV = WORK_DIR / "template_groups_initial.tsv" # groups +# System overlay working files: outputs/template_<overlay>_initial.tsv (created on demand) + +# ROBOT template column headers and directives — DEFAULT LEAF template (asserted SC) +# Phase 6: develops_from is OPTIONAL — empty cell ⇒ no axiom emitted by ROBOT +TEMPLATE_HEADERS = [ + "ID", "LABEL", "Definition", "def_xref", + "is_a", "part_of", "develops_from", + "In_subset", "Date", "Contributor", "Present_in_taxon", + "Wikipedia_image", "xref", +] +TEMPLATE_DIRECTIVES = [ + "ID", "LABEL", "A IAO:0000115", ">A oboInOwl:hasDbXref SPLIT=|", + "SC %", "SC BFO:0000050 some %", "SC RO:0002202 some %", + "AI oboInOwl:inSubset", "AT dcterms:date^^xsd:dateTime", + "AI dcterms:contributor", "AI RO:0002175", + "A foaf:depiction", "A oboInOwl:hasDbXref SPLIT=|", +] + +# Phase 7: MUSCLE LEAF template overlay — adds muscle-specific relations. +# RO IDs: has_muscle_origin=RO:0002372, has_muscle_insertion=RO:0002373, innervated_by=RO:0002005 +# Inserted between develops_from and In_subset (positions 7-9). +MUSCLE_TEMPLATE_HEADERS = TEMPLATE_HEADERS[:7] + [ + "has_muscle_origin", "has_muscle_insertion", "innervated_by", +] + TEMPLATE_HEADERS[7:] +MUSCLE_TEMPLATE_DIRECTIVES = TEMPLATE_DIRECTIVES[:7] + [ + "SC RO:0002372 some %", "SC RO:0002373 some %", "SC RO:0002005 some %", +] + TEMPLATE_DIRECTIVES[7:] + +# Map source-table value to a system overlay name. Unmapped tables → 'default'. +# Future overlays for skeleton, vasculature, nervous-system go here (see ROADMAP). +SYSTEM_OVERLAYS = { + "muscular-system": "muscle", +} + +# Per-overlay header/directive sets +OVERLAY_TEMPLATES = { + "default": (TEMPLATE_HEADERS, TEMPLATE_DIRECTIVES), + "muscle": (MUSCLE_TEMPLATE_HEADERS, MUSCLE_TEMPLATE_DIRECTIVES), +} + + +def classify_system(record: dict) -> str: + """Return the system overlay name for a row; 'default' if no overlay applies.""" + return SYSTEM_OVERLAYS.get(record.get("table", ""), "default") + + +def overlay_paths(overlay: str, name: str) -> tuple[Path, Path]: + """Return (working_tsv, final_tsv) paths for a given overlay name.""" + templates_dir = REPO_ROOT / "src" / "templates" + if overlay == "default": + work = WORK_DIR / "template_initial.tsv" + final = templates_dir / f"{name}.template.tsv" + else: + work = WORK_DIR / f"template_{overlay}_initial.tsv" + final = templates_dir / f"{name}-{overlay}.template.tsv" + return work, final + + +# ROBOT template — GROUPS template (equivalent class: genus + part_of some Y) +GROUPS_TEMPLATE_HEADERS = [ + "ID", "LABEL", "Definition", "def_xref", + "genus", "location", + "In_subset", "Date", "Contributor", "Present_in_taxon", + "Wikipedia_image", "xref", +] +GROUPS_TEMPLATE_DIRECTIVES = [ + "ID", "LABEL", "A IAO:0000115", ">A oboInOwl:hasDbXref SPLIT=|", + "EC %", "EC BFO:0000050 some %", + "AI oboInOwl:inSubset", "AT dcterms:date^^xsd:dateTime", + "AI dcterms:contributor", "AI RO:0002175", + "A foaf:depiction", "A oboInOwl:hasDbXref SPLIT=|", +] + +# Columns for input.tsv (mirrors the raw source columns + term_type pre-classification) +INPUT_HEADERS = [ + "table", "as_iri", "label", "uberon_id", + "parent_id", "parent_label", "references", "term_type", +] + +# Columns for errors.tsv +ERROR_HEADERS = ["label", "as_iri", "issue_type", "parent_id", "parent_label", "detail"] + +# Columns for candidates.tsv +CANDIDATE_HEADERS = ["label", "as_iri", "uberon_id", "note"] + +SUBSET_IRI = "http://purl.obolibrary.org/obo/uberon/core#added_by_HRA" +CREATION_DATE = f"{date.today().isoformat()}T00:00:00Z" +TAXON_IRI = "http://purl.obolibrary.org/obo/NCBITaxon_9606" + +ORCID_RE = re.compile(r'^https://orcid\.org/\d{4}-\d{4}-\d{4}-\d{3}[\dX]$') + +DEFAULT_START_ID = 9900001 + +UBERON_RE = re.compile(r'^UBERON:\d{7}$') +FMA_IRI_RE = re.compile(r'fma/fma(\d+)', re.IGNORECASE) + +# Linguistic patterns for grouping terms (collective classes, not specific named entities). +# When matched, the term is routed to the groups template (EquivalentClass form). +# Default if none match: "leaf" (asserted SC subclass). +GROUP_PATTERNS = [ + re.compile(r'\bmuscle of (?!the )', re.IGNORECASE), + re.compile(r'\b(pelvic floor|thoracic wall|abdominal wall|chest|chest wall) muscle\b', re.IGNORECASE), + re.compile(r'\b(dorsum|sole) of (foot|hand) muscle\b', re.IGNORECASE), + re.compile(r'\b(circular|longitudinal) pharyngeal muscle\b', re.IGNORECASE), + re.compile(r'\b(intrinsic|extrinsic) (eye|ear|tongue|hand|foot|laryngeal|lingual) muscle\b', re.IGNORECASE), + re.compile(r'\b(hypothenar|thenar) hand muscle\b', re.IGNORECASE), + re.compile(r'\b(palmar|plantar) interosseous muscle\b', re.IGNORECASE), + re.compile(r'\b(superficial|intermediate|deep) back muscle\b', re.IGNORECASE), + re.compile(r'\b(anterior|posterior|lateral|medial) vertebral muscle\b', re.IGNORECASE), + re.compile(r'\b(anterior|posterior|lateral|medial) compartment( of \w+)? muscle\b', re.IGNORECASE), + re.compile(r'\b(spinotransversales|segmental back|external ear|middle ear|cranial) muscle\b', re.IGNORECASE), + re.compile(r'\b(posterior|anterior|lateral|medial) abdominal wall muscle\b', re.IGNORECASE), + re.compile(r'\bmuscle of (facial expression|mastication)\b', re.IGNORECASE), +] + +# Subdivision patterns — head/belly/part/portion/crus/etc of a named muscle → leaf +LEAF_PART_PATTERNS = [ + re.compile(r'\b(head|belly|part|portion|crus|fascicle|layer|zone|lamina) of\b', re.IGNORECASE), +] + + +def classify_term_type(label: str) -> str: + """Classify a term label as 'group' or 'leaf' using linguistic rules. + + Default: 'leaf'. A term matching any LEAF_PART_PATTERN (e.g. 'X head of Y muscle') + is always 'leaf', even if a GROUP_PATTERN would otherwise match. Specific + subdivisions of a named structure trump grouping cues. + """ + if not label: + return "leaf" + for pat in LEAF_PART_PATTERNS: + if pat.search(label): + return "leaf" + for pat in GROUP_PATTERNS: + if pat.search(label): + return "group" + return "leaf" + + +# --------------------------------------------------------------------------- +# Input reading +# --------------------------------------------------------------------------- + +def read_xlsx(path: Path, table_filter: str | None) -> list[dict]: + try: + import openpyxl + except ImportError: + sys.exit("openpyxl not installed — run: uv run --with openpyxl ...") + + wb = openpyxl.load_workbook(str(path), read_only=True) + if "as-temp terms" not in wb.sheetnames: + sys.exit(f"Sheet 'as-temp terms' not found in {path}. Sheets: {wb.sheetnames}") + ws = wb["as-temp terms"] + rows = list(ws.iter_rows(values_only=True)) + if not rows: + sys.exit("Sheet is empty") + + raw_headers = [str(h).strip() if h else "" for h in rows[0]] + col = {h: i for i, h in enumerate(raw_headers)} + + def get(row, name, default=""): + idx = col.get(name) + if idx is None: + return default + v = row[idx] + return str(v).strip() if v is not None else default + + records = [] + for row in rows[1:]: + if not any(row): + continue + table = get(row, "tables") + if table_filter and table != table_filter: + continue + records.append({ + "table": table, + "iri": get(row, "as"), + "label": get(row, "as_label"), + "uberon_id": get(row, "UBERON ID"), + "parent_id": get(row, "parents_as"), + "parent_label": get(row, "parents_as_label"), + "references": get(row, "references"), + }) + return records + + +def read_csv(path: Path, table_filter: str | None) -> list[dict]: + records = [] + with open(path, newline="", encoding="utf-8") as f: + reader = csv.DictReader(f) + for row in reader: + table = row.get("tables", "").strip() + if table_filter and table != table_filter: + continue + records.append({ + "table": table, + "iri": row.get("as", "").strip(), + "label": row.get("as_label", "").strip(), + "uberon_id": row.get("UBERON ID", "").strip(), + "parent_id": row.get("parents_as", "").strip(), + "parent_label": row.get("parents_as_label", "").strip(), + "references": row.get("references", "").strip(), + }) + return records + + +# --------------------------------------------------------------------------- +# Parent ID classification +# --------------------------------------------------------------------------- + +def classify_parent(parent_id: str) -> str: + """Return 'uberon', 'fma', 'asctb_temp', or 'unknown'.""" + if UBERON_RE.match(parent_id): + return "uberon" + if FMA_IRI_RE.search(parent_id): + return "fma" + if "ASCTB-TEMP" in parent_id or "asctb-temp" in parent_id.lower(): + return "asctb_temp" + return "unknown" + + +def fma_id_from_iri(iri: str) -> str: + m = FMA_IRI_RE.search(iri) + return f"FMA:{m.group(1)}" if m else iri + + +# --------------------------------------------------------------------------- +# Reference formatting (comma-separated → pipe-separated) +# --------------------------------------------------------------------------- + +def format_refs(raw: str, asctb_iri: str) -> str: + parts = [r.strip() for r in raw.split(",") if r.strip()] + if asctb_iri and asctb_iri not in parts: + parts.append(asctb_iri) + return "|".join(parts) if parts else "" + + +# --------------------------------------------------------------------------- +# TSV helpers +# --------------------------------------------------------------------------- + +def write_tsv(path: Path, headers: list[str], rows: list[list]) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + with open(path, "w", newline="", encoding="utf-8") as f: + writer = csv.writer(f, delimiter="\t") + writer.writerow(headers) + writer.writerows(rows) + + +# --------------------------------------------------------------------------- +# Main +# --------------------------------------------------------------------------- + +def resolve_contributor(contributor_arg: str | None) -> str: + """Return a validated ORCID IRI, prompting if not supplied.""" + if contributor_arg: + iri = contributor_arg.strip() + if not iri.startswith("https://orcid.org/"): + iri = f"https://orcid.org/{iri}" + if not ORCID_RE.match(iri): + sys.exit(f"Invalid ORCID format: {iri}\nExpected: https://orcid.org/XXXX-XXXX-XXXX-XXXX") + return iri + while True: + raw = input("Contributor ORCID (e.g. https://orcid.org/0000-0000-0000-0000): ").strip() + if not raw.startswith("https://orcid.org/"): + raw = f"https://orcid.org/{raw}" + if ORCID_RE.match(raw): + return raw + print(f" Invalid format, try again.") + + +def process(input_path: Path, table_filter: str | None, start_id: int, name: str, + contributor_iri: str, limit: int | None = None) -> None: + suffix = input_path.suffix.lower() + if suffix in (".xlsx", ".xlsm"): + records = read_xlsx(input_path, table_filter) + elif suffix in (".csv", ".tsv"): + records = read_csv(input_path, table_filter) + else: + sys.exit(f"Unsupported file type: {suffix}") + + if not records: + sys.exit("No records found (check --table filter)") + + if limit is not None: + records = records[:limit] + + # Output paths + templates_dir = REPO_ROOT / "src" / "templates" + reports_dir = templates_dir / f"{name}-reports" + final_groups_tsv = templates_dir / f"{name}-groups.template.tsv" + input_tsv = reports_dir / "input.tsv" + errors_tsv = reports_dir / "errors.tsv" + candidates_tsv = reports_dir / "candidates.tsv" + reports_dir.mkdir(parents=True, exist_ok=True) + + # Step 0: rows are partitioned by system overlay (default vs muscle vs ...). + # leaf_rows_by_overlay[overlay] holds the rows destined for that overlay's template. + leaf_rows_by_overlay: dict[str, list] = {} + group_rows = [] + error_rows = [] + candidate_rows = [] + input_rows = [] + counter = start_id + + for rec in records: + label = rec["label"] + iri = rec["iri"] + uberon_id = rec["uberon_id"] + parent_id = rec["parent_id"] + parent_lbl = rec["parent_label"] + refs = rec["references"] + + term_type = classify_term_type(label) if label else "leaf" + + # Save to input.tsv regardless of outcome (now includes term_type) + input_rows.append([rec["table"], iri, label, uberon_id, + parent_id, parent_lbl, refs, term_type]) + + if not label: + error_rows.append(["", iri, "missing_label", "", "", ""]) + continue + + # Already mapped — skip from template, log as candidate + if uberon_id and UBERON_RE.match(uberon_id): + candidate_rows.append([label, iri, uberon_id, "pre-assigned in input"]) + continue + + # Classify parent + parent_class = classify_parent(parent_id) if parent_id else "unknown" + + if parent_class == "uberon": + # Embed parent ID; subagent resolves is_a vs part_of + is_a_val = f"INFER:{parent_id}" + part_of_val = f"INFER:{parent_id}" + elif parent_class == "fma": + fma_curie = fma_id_from_iri(parent_id) + is_a_val = f"NEEDS_MAPPING:{fma_curie}" + part_of_val = f"NEEDS_MAPPING:{fma_curie}" + error_rows.append([ + label, iri, "fma_parent", + fma_curie, parent_lbl, + "Subagent should resolve FMA→UBERON via OLS4" + ]) + elif parent_class == "asctb_temp": + # Embed parent label so subagent can try OLS4 to find correct UBERON parent + safe_lbl = parent_lbl.replace("|", ";") + is_a_val = f"UNRESOLVABLE:{safe_lbl}" + part_of_val = f"UNRESOLVABLE:{safe_lbl}" + error_rows.append([ + label, iri, "asctb_temp_parent", + parent_id, parent_lbl, + "Parent not yet in UBERON; subagent should search OLS4 for correct parent" + ]) + else: + is_a_val = "UNKNOWN" + part_of_val = "UNKNOWN" + error_rows.append([ + label, iri, "unknown_parent", + parent_id, parent_lbl, "Unrecognised parent ID format" + ]) + + def_xref = format_refs(refs, iri) + # Pre-populate xref with FMA ID if the term's own IRI is an FMA IRI + own_fma = fma_id_from_iri(iri) if FMA_IRI_RE.search(iri) else "" + + if term_type == "group": + # Groups template: genus + location columns are populated by the subagent + group_rows.append([ + f"http://purl.obolibrary.org/obo/UBERON_{counter}", + label, + "[PENDING]", + def_xref, + "", # genus — filled by subagent + "", # location — filled by subagent + SUBSET_IRI, + CREATION_DATE, + contributor_iri, + TAXON_IRI, + "", # Wikipedia_image — filled by subagent + own_fma, # xref — FMA from source IRI; subagent appends + ]) + else: + overlay = classify_system(rec) + base_row = [ + f"http://purl.obolibrary.org/obo/UBERON_{counter}", + label, + "[PENDING]", + def_xref, + is_a_val, + part_of_val, + "", # develops_from — filled by subagent if applicable + ] + if overlay == "muscle": + base_row += ["", "", ""] # has_muscle_origin, has_muscle_insertion, innervated_by + base_row += [ + SUBSET_IRI, + CREATION_DATE, + contributor_iri, + TAXON_IRI, + "", # Wikipedia_image — filled by subagent + own_fma, + ] + leaf_rows_by_overlay.setdefault(overlay, []).append(base_row) + counter += 1 + + # Write per-overlay LEAF working + final templates + overlay_summary = [] + for overlay, rows in sorted(leaf_rows_by_overlay.items()): + headers, directives = OVERLAY_TEMPLATES[overlay] + work_path, final_path = overlay_paths(overlay, name) + for path in (work_path, final_path): + with open(path, "w", newline="", encoding="utf-8") as f: + writer = csv.writer(f, delimiter="\t") + writer.writerow(headers) + writer.writerow(directives) + writer.writerows(rows) + overlay_summary.append((overlay, len(rows), final_path)) + + # Write GROUPS working + final templates + for path in (WORK_GROUPS_TSV, final_groups_tsv): + with open(path, "w", newline="", encoding="utf-8") as f: + writer = csv.writer(f, delimiter="\t") + writer.writerow(GROUPS_TEMPLATE_HEADERS) + writer.writerow(GROUPS_TEMPLATE_DIRECTIVES) + writer.writerows(group_rows) + + # Write reports + write_tsv(input_tsv, INPUT_HEADERS, input_rows) + write_tsv(errors_tsv, ERROR_HEADERS, error_rows) + write_tsv(candidates_tsv, CANDIDATE_HEADERS, candidate_rows) + + # Routing summary (Step 0) + parts = ", ".join(f"{ov}={n}" for ov, n, _ in overlay_summary) or "(none)" + print(f"Step 0 routing: {parts}, group={len(group_rows)}") + print() + for overlay, n, final_path in overlay_summary: + work_path, _ = overlay_paths(overlay, name) + print(f"Leaf template [{overlay}] → {final_path} ({n} rows)") + print(f"Groups template → {final_groups_tsv} ({len(group_rows)} rows)") + print(f"Reports → {reports_dir}/") + print(f" input.tsv {len(input_rows)} rows") + print(f" errors.tsv {len(error_rows)} rows") + print(f" candidates.tsv {len(candidate_rows)} rows") + + total_leaf = sum(len(r) for r in leaf_rows_by_overlay.values()) + uberon_p = sum(1 for r in records if classify_parent(r["parent_id"]) == "uberon") + fma_p = sum(1 for r in records if classify_parent(r["parent_id"]) == "fma") + asctb_p = sum(1 for r in records if classify_parent(r["parent_id"]) == "asctb_temp") + print(f"\nTemplate rows: leaf={total_leaf} group={len(group_rows)} | " + f"Parents: UBERON={uberon_p} FMA={fma_p} ASCTB-TEMP={asctb_p}") + if asctb_p: + print(f" ⚠ {asctb_p} terms have ASCTB-TEMP parents — " + f"subagent will attempt OLS4 lookup for correct parent") + + +def main(): + parser = argparse.ArgumentParser( + description="Generate initial UBERON NTR ROBOT template from HRA ASCTB unmapped terms" + ) + parser.add_argument( + "--input", required=True, + help="Path to input xlsx or csv file" + ) + parser.add_argument( + "--table", default=None, + help="Filter to a specific ASCTB table name (e.g. 'muscular-system')" + ) + parser.add_argument( + "--name", required=True, + help="Template name used for output filenames (e.g. 'hra-muscular')" + ) + parser.add_argument( + "--start-id", type=int, default=DEFAULT_START_ID, + help=f"Starting UBERON:99xxxxx counter (default: {DEFAULT_START_ID})" + ) + parser.add_argument( + "--limit", type=int, default=None, + help="Process only the first N terms (for testing)" + ) + parser.add_argument( + "--contributor", default=None, + help="Contributor ORCID IRI (e.g. https://orcid.org/0000-0000-0000-0000). " + "Prompted interactively if omitted." + ) + args = parser.parse_args() + contributor_iri = resolve_contributor(args.contributor) + process(Path(args.input), args.table, args.start_id, args.name, contributor_iri, args.limit) + + +if __name__ == "__main__": + main() diff --git a/bulk_ntr_workflow/scripts/group_terms_by_parent.py b/bulk_ntr_workflow/scripts/group_terms_by_parent.py new file mode 100644 index 000000000..26f56f356 --- /dev/null +++ b/bulk_ntr_workflow/scripts/group_terms_by_parent.py @@ -0,0 +1,204 @@ +""" +Stage 2: Group ROBOT template rows by parent term for parallel subagent processing. + +Reads BOTH templates produced by Stage 1: + bulk_ntr_workflow/outputs/template_initial.tsv — leaf terms (SC directives) + bulk_ntr_workflow/outputs/template_groups_initial.tsv — group terms (EC directives) + +Each per-term JSON entry includes a `term_type` field ("leaf" or "group") so the agent +can branch its behaviour (Step 5 of the agent spec). Group terms have no parent ID +encoded in the template (the agent will determine genus + part_of differentiator), so +they are all collected into a single group keyed by `term_type=group` rather than by +parent UBERON ID. + +Output: bulk_ntr_workflow/outputs/definitions/input/{group_name}.json + +Usage: + uv run scripts/group_terms_by_parent.py +""" + +import csv +import json +import re +from pathlib import Path + +ROOT = Path(__file__).resolve().parent.parent +OUTPUT_DIR = ROOT / "outputs" / "definitions" / "input" +OUTPUT_DIR.mkdir(parents=True, exist_ok=True) + +# Discovered at runtime via glob over outputs/template_*_initial.tsv +LEAF_DEFAULT_TSV = ROOT / "outputs" / "template_initial.tsv" +INPUT_GROUPS_TSV = ROOT / "outputs" / "template_groups_initial.tsv" + +# Header column names — looked up per-template via header_indices() +H_ID, H_LABEL, H_DEF, H_DEFXREF = "ID", "LABEL", "Definition", "def_xref" +H_IS_A, H_PART_OF = "is_a", "part_of" +H_GENUS, H_LOCATION = "genus", "location" + + +def header_indices(header_row: list[str]) -> dict[str, int]: + return {h.strip(): i for i, h in enumerate(header_row)} + + +def discover_leaf_templates() -> list[Path]: + """Return all leaf template working files (default + system overlays). + + Convention: outputs/template_initial.tsv (default), outputs/template_<overlay>_initial.tsv. + """ + out_dir = ROOT / "outputs" + paths = [] + if LEAF_DEFAULT_TSV.exists(): + paths.append(LEAF_DEFAULT_TSV) + for p in sorted(out_dir.glob("template_*_initial.tsv")): + if p.name in ("template_initial.tsv", "template_groups_initial.tsv"): + continue + paths.append(p) + return paths + + +def extract_parent_info(row: list[str], idx: dict[str, int]) -> tuple[str, str]: + """Return (parent_id, parent_label) from a leaf template's is_a/part_of cells.""" + is_a = row[idx[H_IS_A]].strip() if H_IS_A in idx else "" + part_of = row[idx[H_PART_OF]].strip() if H_PART_OF in idx else "" + + for val in (is_a, part_of): + m = re.match(r'^(UBERON:\d{7})$', val) + if m: + return m.group(1), "" + m = re.match(r'^INFER:(UBERON:\d{7})$', val) + if m: + return m.group(1), "" + m = re.match(r'^(NEEDS_MAPPING:FMA:\d+)$', val) + if m: + return m.group(1), "" + + val = is_a if is_a and is_a not in ("", "[PENDING]") else part_of + return val, "" + + +def make_group_name(parent_id: str, parent_label: str) -> str: + """Derive a safe filename-friendly group name.""" + if parent_label and parent_label not in ("INFER", "NEEDS_MAPPING", "UNRESOLVABLE", "UNKNOWN"): + slug = re.sub(r'[^\w]+', '_', parent_label.lower()).strip('_') + return slug[:50] + safe = re.sub(r'[^\w]+', '_', parent_id.lower()).strip('_') + return safe[:50] + + +def process() -> None: + leaf_paths = discover_leaf_templates() + if not leaf_paths: + raise FileNotFoundError( + f"No leaf templates found in {ROOT/'outputs'}. Run generate_template.py first." + ) + + groups: dict[str, dict] = {} + + # --- Leaf templates (default + system overlays): group by parent --- + # Each row carries the `system` overlay it came from, derived from filename: + # template_initial.tsv → system='default' + # template_<overlay>_initial.tsv → system='<overlay>' + for leaf_path in leaf_paths: + if leaf_path.name == "template_initial.tsv": + system = "default" + else: + # template_muscle_initial.tsv → muscle + system = leaf_path.stem[len("template_"):-len("_initial")] + + with open(leaf_path, newline="", encoding="utf-8") as f: + reader = csv.reader(f, delimiter="\t") + header_row = next(reader) + next(reader) # directive row + idx = header_indices(header_row) + + for row in reader: + if not row or len(row) <= idx[H_LABEL] or not row[idx[H_LABEL]].strip(): + continue + label = row[idx[H_LABEL]].strip() + ntr_id = row[idx[H_ID]].strip() + + parent_id, _ = extract_parent_info(row, idx) + group_key = parent_id + + if group_key not in groups: + groups[group_key] = { + "parent_id": parent_id, + "parent_label": "", + "terms": [], + } + + groups[group_key]["terms"].append({ + "ntr_id": ntr_id, + "label": label, + "term_type": "leaf", + "system": system, + "is_a": row[idx[H_IS_A]].strip() if H_IS_A in idx else "", + "part_of": row[idx[H_PART_OF]].strip() if H_PART_OF in idx else "", + "def_xref": row[idx[H_DEFXREF]].strip() if H_DEFXREF in idx and len(row) > idx[H_DEFXREF] else "", + }) + + # --- Groups template: all into one bucket; agent determines genus + location per term --- + if INPUT_GROUPS_TSV.exists(): + with open(INPUT_GROUPS_TSV, newline="", encoding="utf-8") as f: + reader = csv.reader(f, delimiter="\t") + header_row = next(reader) + next(reader) # directive row + idx = header_indices(header_row) + grouping_terms = [] + for row in reader: + if not row or len(row) <= idx[H_LABEL] or not row[idx[H_LABEL]].strip(): + continue + grouping_terms.append({ + "ntr_id": row[idx[H_ID]].strip(), + "label": row[idx[H_LABEL]].strip(), + "term_type": "group", + "genus": row[idx[H_GENUS]].strip() if H_GENUS in idx and len(row) > idx[H_GENUS] else "", + "location": row[idx[H_LOCATION]].strip() if H_LOCATION in idx and len(row) > idx[H_LOCATION] else "", + "def_xref": row[idx[H_DEFXREF]].strip() if H_DEFXREF in idx and len(row) > idx[H_DEFXREF] else "", + }) + if grouping_terms: + groups["__grouping_terms__"] = { + "parent_id": "GROUPING_TERMS", + "parent_label": "(grouping terms — agent determines genus + part_of differentiator per term)", + "terms": grouping_terms, + } + + written = 0 + for group_key, data in sorted(groups.items()): + parent_id = data["parent_id"] + # Special handling for the grouping bucket + if group_key == "__grouping_terms__": + group_name = "grouping_terms" + else: + group_name = make_group_name(parent_id, data.get("parent_label", "")) + + # Group-level summary: leaf vs group counts (always one or the other in this iteration) + leaf_n = sum(1 for t in data["terms"] if t.get("term_type") == "leaf") + group_n = sum(1 for t in data["terms"] if t.get("term_type") == "group") + + out = { + "group_name": group_name, + "parent_id": parent_id, + "parent_label": data.get("parent_label", ""), + "term_counts": {"leaf": leaf_n, "group": group_n}, + "note": "parent_label is best-effort; subagent should resolve via OLS4. " + "For term_type='group' terms: use obo-grep on uberon-edit.obo to find " + "similar UBERON groupings, identify the genus + part_of pattern, and " + "fill genus + location. If pattern doesn't fit, route to manual_curation.", + "terms": data["terms"], + } + + out_path = OUTPUT_DIR / f"{group_name}.json" + with open(out_path, "w", encoding="utf-8") as f: + json.dump(out, f, indent=2) + + marker = "[GROUP]" if group_n else "[leaf] " + print(f" {marker} {group_name:45s} {len(data['terms']):3d} terms") + written += 1 + + total = sum(len(d["terms"]) for d in groups.values()) + print(f"\nTotal groups: {written} | Total terms: {total}") + + +if __name__ == "__main__": + process() diff --git a/bulk_ntr_workflow/scripts/merge_definitions.py b/bulk_ntr_workflow/scripts/merge_definitions.py new file mode 100644 index 000000000..084885440 --- /dev/null +++ b/bulk_ntr_workflow/scripts/merge_definitions.py @@ -0,0 +1,598 @@ +""" +Stage 4: Merge subagent outputs back into the ROBOT template TSVs. + +Reads all bulk_ntr_workflow/outputs/definitions/*.json files. Each JSON may contain: + - "definitions": {label: definition_string} + - "wikipedia_images": {label: image_url} + - "xrefs": {label: "Wikipedia:Title|FMA:NNNNN"} + - "def_xrefs_to_add": {label: "PMID:nnn|DOI:..."} + - "resolved_relationships": {label: "is_a" | "part_of"} — leaf template only + - "resolved_parents": {label: "UBERON:xxxxxxx"} — leaf template only + - "group_template_rows": {label: {"genus": "...", "location": "..."}} — groups only + - "confirmed_matches": [{label, uberon_id, confidence}] + - "possible_matches": [{label, uberon_id, confidence, note}] + - "out_of_scope": [{label, reason, suggestion}] + - "name_corrections": [{label, suggested, reason}] + - "manual_curation": [{label, definition, reason, similar_terms, suggestion}] + +Reads both working templates: + bulk_ntr_workflow/outputs/template_initial.tsv — leaf (SC directives) + bulk_ntr_workflow/outputs/template_groups_initial.tsv — groups (EC directives) + +Writes back to: + src/templates/<name>.template.tsv — leaf, in place + src/templates/<name>-groups.template.tsv — groups, in place + +Reports written to src/templates/<name>-reports/: + candidates.tsv — confirmed + possible OLS4 matches + out_of_scope.tsv — pathological/dysfunctional terms + name_corrections.tsv — agent-applied label rewrites + manual_curation.tsv — group terms that don't fit the simple part_of pattern + +Requires --name matching the value used in Stage 1. + +Usage: + uv run scripts/merge_definitions.py --name hra-muscular +""" + +import argparse +import csv +import json +import re +from pathlib import Path + +NTR_ROOT = Path(__file__).resolve().parent.parent +REPO_ROOT = NTR_ROOT.parent +INPUT_TSV = NTR_ROOT / "outputs" / "template_initial.tsv" +INPUT_GROUPS_TSV = NTR_ROOT / "outputs" / "template_groups_initial.tsv" +DEFS_DIR = NTR_ROOT / "outputs" / "definitions" + +PENDING_PATTERN = re.compile(r'^\[PENDING\]$') +INFER_PATTERN = re.compile(r'^INFER') + +# Header column names. Indices are looked up per-template via header_indices() below +# so the merge step is robust to additional columns (e.g. develops_from, has_muscle_origin) +# without having to update hardcoded positions. +H_ID = "ID" +H_LABEL = "LABEL" +H_DEF = "Definition" +H_DEFXREF = "def_xref" +H_IMAGE = "Wikipedia_image" +H_TERMREF = "xref" +# Leaf template logic columns +H_IS_A = "is_a" +H_PART_OF = "part_of" +H_DEVELOPS_FROM = "develops_from" +# Optional muscle-overlay logic columns (Phase 7) +H_MUSCLE_ORIGIN = "has_muscle_origin" +H_MUSCLE_INSERTION = "has_muscle_insertion" +H_INNERVATED_BY = "innervated_by" +# Groups template logic columns (EC genus, EC part_of some location) +H_GENUS = "genus" +H_LOCATION = "location" + + +def header_indices(header_row: list[str]) -> dict[str, int]: + """Return {column_name: index} for a template header row.""" + return {h.strip(): i for i, h in enumerate(header_row)} + + +def ensure_width(row: list[str], width: int) -> None: + """Extend row in-place to at least `width` cells with empty strings.""" + while len(row) < width: + row.append("") + + +def _normalise_matches(raw: list) -> list: + """Normalise various field-name conventions agents may use into {label, uberon_id, ...}.""" + out = [] + for m in raw: + label = m.get("label") or m.get("ntr_label") or m.get("term_label", "") + uid = m.get("uberon_id") or m.get("matched_id") or "" + out.append({ + "label": label, + "uberon_id": uid, + "confidence": m.get("confidence", ""), + "note": m.get("note", ""), + }) + return out + + +def load_subagent_outputs() -> dict: + """Load and merge all subagent JSON outputs into a dict of merged maps/lists.""" + out = { + "definitions": {}, + "images": {}, + "relationships": {}, # legacy fallback: label → "is_a" | "part_of" + "resolved_parents": {}, # legacy fallback: label → "UBERON:xxxxxxx" + "leaf_template_rows": {}, # label → {"is_a": "UBERON:...", "part_of": "UBERON:..."} + "xrefs": {}, # label → pipe-sep xref (Wikipedia URL + FMA ID) + "def_xrefs_extra": {}, # label → PMIDs/DOIs to append to def_xref column + "group_template_rows": {}, # label → {"genus": "...", "location": "..."} + "confirmed": [], + "possible": [], + "out_of_scope": [], # [{label, reason, suggestion}] + "name_corrections": [], # [{label, suggested, reason}] + "manual_curation": [], # [{label, definition, reason, similar_terms, ...}] + } + + for jf in sorted(DEFS_DIR.glob("*.json")): + with open(jf, encoding="utf-8") as f: + data = json.load(f) + if not isinstance(data, dict): + print(f" WARNING: {jf.name} is not a dict, skipping") + continue + + out["definitions"].update(data.get("definitions", {})) + out["images"].update(data.get("wikipedia_images", {})) + out["relationships"].update(data.get("resolved_relationships", {})) + out["resolved_parents"].update(data.get("resolved_parents", {})) + out["leaf_template_rows"].update(data.get("leaf_template_rows", {})) + out["xrefs"].update(data.get("xrefs", {})) + out["def_xrefs_extra"].update(data.get("def_xrefs_to_add", {})) + out["group_template_rows"].update(data.get("group_template_rows", {})) + out["confirmed"].extend(_normalise_matches(data.get("confirmed_matches", []))) + out["possible"].extend(_normalise_matches(data.get("possible_matches", []))) + out["out_of_scope"].extend(data.get("out_of_scope", [])) + out["name_corrections"].extend(data.get("name_corrections", [])) + out["manual_curation"].extend(data.get("manual_curation", [])) + # Also accept {label: {match_type, matched_id, ...}} dict form + for lbl, info in data.get("existing_term_match", {}).items(): + mt = info.get("match_type", "") + entry = { + "label": lbl, + "uberon_id": info.get("matched_id") or info.get("uberon_id", ""), + "confidence": info.get("confidence", "high" if "confirmed" in mt else "medium"), + "note": info.get("note", ""), + } + if "confirmed" in mt: + out["confirmed"].append(entry) + elif "possible" in mt: + out["possible"].append(entry) + + return out + + +def extract_parent_id(cell_val: str) -> str: + """Pull the embedded UBERON ID from INFER:, NEEDS_MAPPING:, etc.""" + m = re.match(r'^INFER:(UBERON:\d{7})$', cell_val) + if m: + return m.group(1) + m = re.match(r'^(UBERON:\d{7})$', cell_val) + if m: + return m.group(1) + m = re.match(r'^NEEDS_MAPPING:(.*)', cell_val) + if m: + return m.group(1) + return "" + + +def _apply_common_fields(row: list[str], label: str, lookup_label: str, + sub: dict, counters: dict, idx: dict[str, int]) -> None: + """Update definition / image / xref / def_xref columns. Used for both templates. + + idx is the header→index map for the current template (different leaf variants + have different positions for these columns).""" + ensure_width(row, max(idx.values()) + 1) + + def get(d: dict): + return d.get(lookup_label) or d.get(label) + + new_def = get(sub["definitions"]) + if new_def and new_def.strip(): + row[idx[H_DEF]] = new_def.strip() + counters["defs"] += 1 + + if H_IMAGE in idx: + new_img = get(sub["images"]) + if new_img and new_img.strip(): + row[idx[H_IMAGE]] = new_img.strip() + counters["images"] += 1 + + if H_TERMREF in idx: + new_xref = get(sub["xrefs"]) + if new_xref and new_xref.strip(): + col = idx[H_TERMREF] + existing = row[col].strip() + parts = [p for p in existing.split("|") if p] if existing else [] + for p in new_xref.strip().split("|"): + if p and p not in parts: + parts.append(p) + row[col] = "|".join(parts) + counters["xrefs"] += 1 + + if H_DEFXREF in idx: + extra_def_xref = get(sub["def_xrefs_extra"]) + if extra_def_xref and extra_def_xref.strip(): + col = idx[H_DEFXREF] + existing = row[col].strip() + parts = [p for p in existing.split("|") if p] if existing else [] + for p in extra_def_xref.strip().split("|"): + if p and p not in parts: + parts.append(p) + row[col] = "|".join(parts) + counters["def_xrefs"] += 1 + + +def merge_leaf_template(input_tsv: Path, final_tsv: Path, sub: dict, + excluded_labels: set, out_of_scope_labels: set, + name_correction_map: dict, manual_curation_labels: set) -> dict: + """Merge subagent outputs into a leaf template (default OR system overlay). + + Uses header-name lookup so the function works with any leaf template variant + (default 13 columns, muscle 16 columns, future overlays). + + Resolution priority for is_a / part_of columns: + 1. leaf_template_rows[label] = {is_a, part_of, develops_from?, has_muscle_*?} + 2. resolved_relationships + resolved_parents — legacy single-column form + 3. INFER:/UNRESOLVABLE:/NEEDS_MAPPING: — fall back to blank + curator review + """ + # Optional logic columns; populated only if the column exists in this template + OPTIONAL_LEAF_COLS = [H_DEVELOPS_FROM, H_MUSCLE_ORIGIN, + H_MUSCLE_INSERTION, H_INNERVATED_BY] + counters = {"defs": 0, "images": 0, "xrefs": 0, "def_xrefs": 0, + "rels": 0, "leaf_rows_used": 0, "relabelled": 0, + "pending": 0, "infer": 0, "unknown_rel": [], + "optional_filled": 0} + rows = [] + with open(input_tsv, newline="", encoding="utf-8") as f: + reader = csv.reader(f, delimiter="\t") + header_row = next(reader) + directive_row = next(reader) + rows.append(header_row) + rows.append(directive_row) + idx = header_indices(header_row) + width = max(idx.values()) + 1 + for row in reader: + if not row: + rows.append(row) + continue + ensure_width(row, width) + label = row[idx[H_LABEL]].strip() + + if label in excluded_labels or label in out_of_scope_labels: + continue + if label in manual_curation_labels: + continue + + if label in name_correction_map: + row[idx[H_LABEL]] = name_correction_map[label] + counters["relabelled"] += 1 + lookup_label = name_correction_map.get(label, label) + + _apply_common_fields(row, label, lookup_label, sub, counters, idx) + + is_a_val = row[idx[H_IS_A]].strip() + part_of_val = row[idx[H_PART_OF]].strip() + + # Priority 1: leaf_template_rows — preferred, populates both axes + optional cols + ltr = (sub["leaf_template_rows"].get(lookup_label) + or sub["leaf_template_rows"].get(label)) + if ltr: + row[idx[H_IS_A]] = (ltr.get("is_a") or "").strip() + row[idx[H_PART_OF]] = (ltr.get("part_of") or "").strip() + # Optional columns — only populate if both the column exists in this + # template AND the agent emitted a value + for col_name in OPTIONAL_LEAF_COLS: + if col_name in idx and ltr.get(col_name): + row[idx[col_name]] = ltr[col_name].strip() + counters["optional_filled"] += 1 + counters["leaf_rows_used"] += 1 + else: + # Priority 2: legacy resolved_relationships + resolved_parents + parent_id = (sub["resolved_parents"].get(lookup_label) + or sub["resolved_parents"].get(label) + or extract_parent_id(is_a_val) + or extract_parent_id(part_of_val)) + rel = (sub["relationships"].get(lookup_label) + or sub["relationships"].get(label)) + if rel and parent_id: + if rel == "is_a": + row[idx[H_IS_A]] = parent_id + row[idx[H_PART_OF]] = "" + elif rel == "part_of": + row[idx[H_IS_A]] = "" + row[idx[H_PART_OF]] = parent_id + counters["rels"] += 1 + elif parent_id and (is_a_val.startswith("INFER:") or + is_a_val.startswith("UNRESOLVABLE:") or + is_a_val.startswith("NEEDS_MAPPING:")): + row[idx[H_IS_A]] = "" + row[idx[H_PART_OF]] = "" + counters["unknown_rel"].append(row[idx[H_LABEL]].strip()) + + if PENDING_PATTERN.match(row[idx[H_DEF]].strip()): + counters["pending"] += 1 + if INFER_PATTERN.match(row[idx[H_IS_A]].strip()) or \ + INFER_PATTERN.match(row[idx[H_PART_OF]].strip()): + counters["infer"] += 1 + + rows.append(row) + + with open(final_tsv, "w", newline="", encoding="utf-8") as f: + writer = csv.writer(f, delimiter="\t") + writer.writerows(rows) + counters["data_rows"] = len(rows) - 2 + return counters + + +def merge_groups_template(input_tsv: Path, final_tsv: Path, sub: dict, + excluded_labels: set, out_of_scope_labels: set, + name_correction_map: dict, + manual_curation_labels: set) -> dict: + """Merge subagent outputs into the groups template. Returns a counters dict. + + Group rows that have no genus+location populated AND are not in manual_curation + are dropped from the template (the agent has not produced an EC definition for + them) — the curator should investigate. + """ + counters = {"defs": 0, "images": 0, "xrefs": 0, "def_xrefs": 0, + "ec_resolved": 0, "ec_incomplete": [], "relabelled": 0, + "pending": 0} + rows = [] + with open(input_tsv, newline="", encoding="utf-8") as f: + reader = csv.reader(f, delimiter="\t") + header_row = next(reader) + directive_row = next(reader) + rows.append(header_row) + rows.append(directive_row) + idx = header_indices(header_row) + width = max(idx.values()) + 1 + for row in reader: + if not row: + rows.append(row) + continue + ensure_width(row, width) + label = row[idx[H_LABEL]].strip() + + if label in excluded_labels or label in out_of_scope_labels: + continue + # Group terms that the agent punted go to manual_curation — exclude from template + if label in manual_curation_labels: + continue + + if label in name_correction_map: + row[idx[H_LABEL]] = name_correction_map[label] + counters["relabelled"] += 1 + lookup_label = name_correction_map.get(label, label) + + _apply_common_fields(row, label, lookup_label, sub, counters, idx) + + # Populate genus + location from the agent + ec = (sub["group_template_rows"].get(lookup_label) + or sub["group_template_rows"].get(label)) + if ec and ec.get("genus") and ec.get("location"): + row[idx[H_GENUS]] = ec["genus"].strip() + row[idx[H_LOCATION]] = ec["location"].strip() + counters["ec_resolved"] += 1 + else: + # Incomplete EC — agent didn't produce both columns; flag for curator + counters["ec_incomplete"].append(row[idx[H_LABEL]].strip()) + + if PENDING_PATTERN.match(row[idx[H_DEF]].strip()): + counters["pending"] += 1 + + rows.append(row) + + with open(final_tsv, "w", newline="", encoding="utf-8") as f: + writer = csv.writer(f, delimiter="\t") + writer.writerows(rows) + counters["data_rows"] = len(rows) - 2 + return counters + + +def discover_leaf_partitions(name: str, ntr_root: Path, repo_root: Path) -> list[tuple[str, Path, Path]]: + """Find all leaf-template partitions for a given name. + + Returns a list of (partition_label, working_tsv, final_tsv) tuples for every + partition that has both a working and final template on disk. Partition_label is + 'default' for the base template, or the overlay name (e.g. 'muscle') for system + overlays. + + Convention: + default → outputs/template_initial.tsv + src/templates/<name>.template.tsv + <system> → outputs/template_<system>_initial.tsv + src/templates/<name>-<system>.template.tsv + """ + out_dir = ntr_root / "outputs" + templates_dir = repo_root / "src" / "templates" + partitions = [] + + # Default partition + work = out_dir / "template_initial.tsv" + final = templates_dir / f"{name}.template.tsv" + if work.exists() and final.exists(): + partitions.append(("default", work, final)) + + # System overlay partitions — discover by looking for outputs/template_<system>_initial.tsv + for work_path in sorted(out_dir.glob("template_*_initial.tsv")): + stem = work_path.stem # 'template_muscle_initial' + if stem in ("template_initial", "template_groups_initial"): + continue + # Extract overlay name from 'template_<overlay>_initial' + overlay = stem[len("template_"):-len("_initial")] + final = templates_dir / f"{name}-{overlay}.template.tsv" + if final.exists(): + partitions.append((overlay, work_path, final)) + + return partitions + + +def process(name: str) -> None: + templates_dir = REPO_ROOT / "src" / "templates" + final_groups_tsv = templates_dir / f"{name}-groups.template.tsv" + reports_dir = templates_dir / f"{name}-reports" + candidates_tsv = reports_dir / "candidates.tsv" + out_of_scope_tsv = reports_dir / "out_of_scope.tsv" + name_corrections_tsv = reports_dir / "name_corrections.tsv" + manual_curation_tsv = reports_dir / "manual_curation.tsv" + + leaf_partitions = discover_leaf_partitions(name, NTR_ROOT, REPO_ROOT) + if not leaf_partitions: + raise FileNotFoundError( + f"No leaf templates found for '{name}'. Run generate_template.py --name {name} first." + ) + + sub = load_subagent_outputs() + print(f"Loaded: {len(sub['definitions'])} definitions, {len(sub['images'])} images, " + f"{len(sub['relationships'])} resolved relationships, " + f"{len(sub['resolved_parents'])} resolved parents, " + f"{len(sub['leaf_template_rows'])} leaf rows, " + f"{len(sub['group_template_rows'])} group EC rows, " + f"{len(sub['xrefs'])} xrefs, {len(sub['def_xrefs_extra'])} extra def_xrefs, " + f"{len(sub['confirmed'])} confirmed, {len(sub['possible'])} possible, " + f"{len(sub['out_of_scope'])} out-of-scope, " + f"{len(sub['name_corrections'])} name corrections, " + f"{len(sub['manual_curation'])} manual_curation") + + name_correction_map = { + nc["label"]: nc.get("suggested", "").strip() + for nc in sub["name_corrections"] if nc.get("suggested", "").strip() + } + excluded_labels = {m["label"] for m in sub["confirmed"]} + out_of_scope_labels = {o["label"] for o in sub["out_of_scope"]} + manual_curation_labels = {mc["label"] for mc in sub["manual_curation"]} + + for partition_label, work_tsv, final_tsv in leaf_partitions: + leaf_counters = merge_leaf_template( + work_tsv, final_tsv, sub, + excluded_labels, out_of_scope_labels, name_correction_map, + manual_curation_labels, + ) + print(f"\nLeaf template [{partition_label}] → {final_tsv} ({leaf_counters['data_rows']} rows)") + print(f" Definitions updated: {leaf_counters['defs']}") + print(f" Images added: {leaf_counters['images']}") + print(f" Xrefs added: {leaf_counters['xrefs']}") + print(f" def_xref refs appended: {leaf_counters['def_xrefs']}") + print(f" Labels corrected: {leaf_counters['relabelled']}") + print(f" leaf_template_rows used:{leaf_counters['leaf_rows_used']}") + print(f" Optional cols filled: {leaf_counters['optional_filled']}") + print(f" Relationships resolved (legacy): {leaf_counters['rels']}") + print(f" Still [PENDING] defs: {leaf_counters['pending']}") + print(f" Still INFER: {leaf_counters['infer']}") + print(f" Relationship unresolved:{len(leaf_counters['unknown_rel'])}") + for lbl in leaf_counters["unknown_rel"]: + print(f" ⚠ {lbl}") + + if INPUT_GROUPS_TSV.exists() and final_groups_tsv.exists(): + groups_counters = merge_groups_template( + INPUT_GROUPS_TSV, final_groups_tsv, sub, + excluded_labels, out_of_scope_labels, name_correction_map, + manual_curation_labels, + ) + print(f"\nGroups template → {final_groups_tsv} ({groups_counters['data_rows']} rows)") + print(f" Definitions updated: {groups_counters['defs']}") + print(f" Images added: {groups_counters['images']}") + print(f" Xrefs added: {groups_counters['xrefs']}") + print(f" def_xref refs appended: {groups_counters['def_xrefs']}") + print(f" Labels corrected: {groups_counters['relabelled']}") + print(f" EC genus+location set: {groups_counters['ec_resolved']}") + print(f" Still [PENDING] defs: {groups_counters['pending']}") + print(f" EC incomplete: {len(groups_counters['ec_incomplete'])}") + for lbl in groups_counters["ec_incomplete"]: + print(f" ⚠ {lbl} (no genus+location from agent)") + + print(f"\n Excluded (confirmed match): {len(excluded_labels)}") + print(f" Excluded (out_of_scope): {len(out_of_scope_labels)}") + print(f" Excluded (manual_curation): {len(manual_curation_labels)}") + + # Append confirmed/possible matches to candidates.tsv + if sub["confirmed"] or sub["possible"]: + reports_dir.mkdir(parents=True, exist_ok=True) + existing_rows = [] + existing_header = ["label", "as_iri", "uberon_id", "note"] + if candidates_tsv.exists(): + with open(candidates_tsv, newline="", encoding="utf-8") as f: + rows_read = list(csv.reader(f, delimiter="\t")) + if rows_read: + existing_header = rows_read[0] + existing_rows = rows_read[1:] + + new_rows = [] + for m in sub["confirmed"]: + new_rows.append([ + m.get("label", ""), "", + m.get("uberon_id", ""), + f"confirmed_match (confidence: {m.get('confidence','')})" + ]) + for m in sub["possible"]: + new_rows.append([ + m.get("label", ""), "", + m.get("uberon_id", ""), + f"possible_match ({m.get('note','')})" + ]) + + with open(candidates_tsv, "w", newline="", encoding="utf-8") as f: + writer = csv.writer(f, delimiter="\t") + writer.writerow(existing_header) + writer.writerows(existing_rows) + writer.writerows(new_rows) + print(f" Updated candidates.tsv → {candidates_tsv}") + + # Out-of-scope report (pathological/dysfunctional terms — curator decides) + if sub["out_of_scope"]: + reports_dir.mkdir(parents=True, exist_ok=True) + with open(out_of_scope_tsv, "w", newline="", encoding="utf-8") as f: + writer = csv.writer(f, delimiter="\t") + writer.writerow(["label", "reason", "suggestion"]) + for o in sub["out_of_scope"]: + writer.writerow([ + o.get("label", ""), + o.get("reason", ""), + o.get("suggestion", ""), + ]) + print(f" Wrote out_of_scope.tsv → {out_of_scope_tsv}") + + # Name corrections report (so curator can review label changes) + if sub["name_corrections"]: + reports_dir.mkdir(parents=True, exist_ok=True) + with open(name_corrections_tsv, "w", newline="", encoding="utf-8") as f: + writer = csv.writer(f, delimiter="\t") + writer.writerow(["source_label", "corrected_label", "reason"]) + for nc in sub["name_corrections"]: + writer.writerow([ + nc.get("label", ""), + nc.get("suggested", ""), + nc.get("reason", ""), + ]) + print(f" Wrote name_corrections.tsv → {name_corrections_tsv}") + + # Manual curation report — group terms that don't fit the genus + part_of pattern + if sub["manual_curation"]: + reports_dir.mkdir(parents=True, exist_ok=True) + with open(manual_curation_tsv, "w", newline="", encoding="utf-8") as f: + writer = csv.writer(f, delimiter="\t") + writer.writerow(["label", "definition", "reason", "similar_terms", "suggestion"]) + for mc in sub["manual_curation"]: + similar_terms = mc.get("similar_terms", []) + if isinstance(similar_terms, list): + similar = "; ".join( + f"{s.get('id', '')}={s.get('label', '')}" if isinstance(s, dict) + else str(s) + for s in similar_terms + ) + else: + similar = str(similar_terms) + writer.writerow([ + mc.get("label", ""), + mc.get("definition", ""), + mc.get("reason", ""), + similar, + mc.get("suggestion", ""), + ]) + print(f" Wrote manual_curation.tsv → {manual_curation_tsv}") + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Merge subagent definition outputs into the ROBOT template" + ) + parser.add_argument( + "--name", required=True, + help="Template name used in Stage 1 (e.g. 'hra-muscular')" + ) + args = parser.parse_args() + process(args.name) + + +if __name__ == "__main__": + main() diff --git a/bulk_ntr_workflow/scripts/register_templates.py b/bulk_ntr_workflow/scripts/register_templates.py new file mode 100644 index 000000000..dcc30a975 --- /dev/null +++ b/bulk_ntr_workflow/scripts/register_templates.py @@ -0,0 +1,193 @@ +#!/usr/bin/env python3 +"""Stage 5: register bulk-NTR templates with ODK and regenerate the Makefile. + +Discovers `src/templates/<name>*.template.tsv` files produced by Stage 4 +(`merge_definitions.py`) and registers any that are not yet listed under +`components.products:` in `src/ontology/uberon-odk.yaml`. Then runs +`sh run.sh make update_repo` from `src/ontology/` so the Makefile picks +up the new components. + +Idempotent: re-running after a successful registration is a no-op for +already-registered templates. + +Usage: + uv run scripts/register_templates.py --name hra-muscular + uv run scripts/register_templates.py --name hra-muscular --skip-update-repo +""" + +from __future__ import annotations + +import argparse +import subprocess +import sys +from pathlib import Path + +REPO_ROOT = Path(__file__).resolve().parents[2] +TEMPLATE_DIR = REPO_ROOT / "src" / "templates" +ODK_YAML = REPO_ROOT / "src" / "ontology" / "uberon-odk.yaml" +ONTOLOGY_DIR = REPO_ROOT / "src" / "ontology" +EDIT_OBO = REPO_ROOT / "src" / "ontology" / "uberon-edit.obo" +COMPONENT_IMPORT_PREFIX = "import: http://purl.obolibrary.org/obo/uberon/components/" + + +def discover_templates(name: str) -> list[Path]: + """Return template TSVs whose stem starts with `<name>` (excludes -reports dirs).""" + files = sorted(TEMPLATE_DIR.glob(f"{name}*.template.tsv")) + return [f for f in files if f.is_file()] + + +def component_filename(template_path: Path) -> str: + """Map `hra-muscular-groups.template.tsv` → `hra_muscular_groups.owl`.""" + stem = template_path.name[: -len(".template.tsv")] + return stem.replace("-", "_") + ".owl" + + +def already_registered(yaml_text: str, component: str) -> bool: + # Match ` - filename: <component>` exactly on a line. + return f"- filename: {component}\n" in yaml_text + + +def build_entry(component: str, template_filename: str) -> str: + return ( + f" - filename: {component}\n" + f" use_template: true\n" + f" templates:\n" + f" - {template_filename}\n" + ) + + +def insert_entries(yaml_text: str, entries: list[str]) -> str: + """Insert entries at the end of components.products: (before `workflows:`).""" + marker = "\nworkflows:" + idx = yaml_text.find(marker) + if idx < 0: + raise RuntimeError("Could not find `workflows:` section in uberon-odk.yaml") + return yaml_text[:idx] + "".join(entries) + yaml_text[idx:] + + +def add_imports_to_edit_obo(components: list[str]) -> list[str]: + """Add `import:` lines for each component to uberon-edit.obo, keeping the + components/ import block sorted alphabetically. Returns the list of components + that were newly added (i.e. excludes ones already present). + """ + lines = EDIT_OBO.read_text().splitlines(keepends=True) + existing = { + line.strip()[len("import: "):] + for line in lines + if line.startswith(COMPONENT_IMPORT_PREFIX) + } + new_iris = [] + for component in components: + iri = f"http://purl.obolibrary.org/obo/uberon/components/{component}" + if iri not in existing: + new_iris.append((component, iri)) + + if not new_iris: + return [] + + block_start = next( + (i for i, l in enumerate(lines) if l.startswith(COMPONENT_IMPORT_PREFIX)), + None, + ) + if block_start is None: + raise RuntimeError( + f"No `{COMPONENT_IMPORT_PREFIX}` lines found in {EDIT_OBO.name}; " + "cannot determine where to insert new component imports." + ) + block_end = block_start + while block_end < len(lines) and lines[block_end].startswith(COMPONENT_IMPORT_PREFIX): + block_end += 1 + + block = lines[block_start:block_end] + [f"{COMPONENT_IMPORT_PREFIX}{c}\n" for c, _ in new_iris] + block.sort() + EDIT_OBO.write_text("".join(lines[:block_start] + block + lines[block_end:])) + return [c for c, _ in new_iris] + + +def register(name: str) -> list[Path]: + templates = discover_templates(name) + if not templates: + print(f"No templates found matching src/templates/{name}*.template.tsv", + file=sys.stderr) + sys.exit(1) + + yaml_text = ODK_YAML.read_text() + new_entries: list[str] = [] + registered: list[Path] = [] + skipped: list[Path] = [] + + for tpl in templates: + component = component_filename(tpl) + if already_registered(yaml_text, component): + skipped.append(tpl) + continue + new_entries.append(build_entry(component, tpl.name)) + registered.append(tpl) + + if new_entries: + ODK_YAML.write_text(insert_entries(yaml_text, new_entries)) + print(f"Registered {len(registered)} template(s) in {ODK_YAML.relative_to(REPO_ROOT)}:") + for tpl in registered: + print(f" + {component_filename(tpl)} ← {tpl.name}") + else: + print("All matching templates already registered in uberon-odk.yaml.") + + if skipped: + print(f"Skipped {len(skipped)} already-registered template(s):") + for tpl in skipped: + print(f" = {component_filename(tpl)} ← {tpl.name}") + + all_components = [component_filename(tpl) for tpl in templates] + added_imports = add_imports_to_edit_obo(all_components) + if added_imports: + print(f"\nAdded {len(added_imports)} import(s) to {EDIT_OBO.relative_to(REPO_ROOT)}:") + for c in added_imports: + print(f" + import: .../components/{c}") + else: + print(f"\nAll component imports already present in {EDIT_OBO.name}.") + + return registered + + +def run_update_repo() -> None: + print("\nRunning `sh run.sh make update_repo` (this may take several minutes)...") + result = subprocess.run( + ["sh", "run.sh", "make", "update_repo"], + cwd=ONTOLOGY_DIR, + ) + if result.returncode != 0: + sys.exit(f"update_repo failed with exit code {result.returncode}") + print("update_repo completed successfully.") + + +def main() -> None: + parser = argparse.ArgumentParser(description=__doc__.splitlines()[0]) + parser.add_argument( + "--name", + required=True, + help="Base name used by Stage 4 (e.g. hra-muscular).", + ) + parser.add_argument( + "--skip-update-repo", + action="store_true", + help="Edit uberon-odk.yaml but skip the ODK Makefile regeneration step.", + ) + args = parser.parse_args() + + registered = register(args.name) + + if args.skip_update_repo: + if registered: + print("\nSkipping update_repo (per --skip-update-repo).") + print("Run `sh run.sh make update_repo` from src/ontology/ to wire components into the Makefile.") + return + + if not registered: + print("\nNothing changed — skipping update_repo.") + return + + run_update_repo() + + +if __name__ == "__main__": + main() diff --git a/bulk_ntr_workflow/template-id-allocation-ticket.md b/bulk_ntr_workflow/template-id-allocation-ticket.md new file mode 100644 index 000000000..b3770555a --- /dev/null +++ b/bulk_ntr_workflow/template-id-allocation-ticket.md @@ -0,0 +1,166 @@ +# Template-aware definitive ID allocation + +## Problem + +The `Temporary IDs` mechanism (`UBERON:99xxxxx` → minted into the `Automation` +range on PR merge by [`make allocate-definitive-ids`](../src/ontology/uberon.Makefile#L1412-L1417)) +does not work for ROBOT templates. + +`kgcl:mint` (via `robot-kgcl-plugin`) is not import-aware: + +- **Output**: `MintCommand.execute` constructs `OWLEntityRenamer(manager, + Sets.newHashSet(new OWLOntology[]{rootOntology}))` — the renamer set + contains only the root. IRIs in imported components are never rewritten. +- **Input**: `RandomizedIDGenerator.exists()` calls + `OWLOntology.containsEntityInSignature(IRI)` — the single-arg overload, + which defaults to `Imports.EXCLUDED` in OWLAPI 5+. IDs already minted into + an imported component (e.g. `components/hra_muscular.owl`, built from a + template) are invisible when mint picks the next free ID → collision risk. + +The GitHub Action that triggers minting +([allocate-definitive-ids.yml](../.github/workflows/allocate-definitive-ids.yml)) +is additionally restricted to `paths: src/ontology/uberon-edit.obo`, so a +template-only PR never triggers minting at all. + +Even if both `kgcl:mint` bugs were fixed upstream, the template TSV is the +source of truth — any in-OWL rewrite of the generated component would be +clobbered by the next ODK template rebuild. **Template ID allocation +fundamentally needs to operate on the TSV files.** + +## Proposal + +Run a separate template-ID allocator on merge to `master`, using a disjoint +sub-range of `Automation` so the two systems cannot collide. + +### 1. Partition the `Automation` range + +Split [`idrange:43`](../src/ontology/uberon-idranges.owl#L280-L284) into two +disjoint datatypes in `src/ontology/uberon-idranges.owl`: + +``` +Datatype: idrange:43 + Annotations: + allocatedto: "Automation (edit file, kgcl:mint)" + EquivalentTo: + xsd:integer[>= 1200000, < 1250000] + +Datatype: idrange:44 + Annotations: + allocatedto: "Templates-Automation" + EquivalentTo: + xsd:integer[>= 1250000, < 1300000] +``` + +No code change needed for `kgcl:mint` to honour the new boundary: +`RandomizedIDGenerator` reads bounds straight from the datatype. The +existing `make allocate-definitive-ids` invocation +(`--id-range-name Automation`) is automatically confined to `1200000–1250000`. + +**Precondition** — confirm no Automation IDs `≥ 1250000` have been minted: + +``` +obo-grep.pl -r 'id: UBERON:12[5-9][0-9]{4}' src/ontology/uberon-edit.obo +grep -rEh 'UBERON[:_]12[5-9][0-9]{4}' src/ontology/components/ +``` + +If any exist, pick a different split point or migrate them. + +### 2. Reserve a template temp-ID range + +Templates ship with `UBERON:99xxxxx` exactly as today (no curator-visible +change). Disambiguating template temps from edit-file temps is optional — +the allocator distinguishes by *which file the ID appears in*, not by the +ID value. + +### 3. Template-ID allocator script + +New script `src/scripts/allocate-template-ids.py` (Python; uv-runnable): + +1. Build (or read pre-built) `uberon.owl` — the merged release artefact, + guaranteed to contain every ID minted into every component. +2. Extract all `UBERON:NNNNNNN` IRIs from the merged ontology via + `robot query` with a SPARQL `SELECT DISTINCT ?id` over signature. +3. Read the `Templates-Automation` bounds from `uberon-idranges.owl` (parse + the same `Datatype: idrangeN` syntax that mint parses, or call + `robot kgcl:mint --help`-equivalent inspection). +4. Load `src/ontology/allocated-template-ids.tsv` (the ledger — see #4) to + collect IDs that are *claimed* but not yet visible in the merged build + (concurrent PRs). +5. For each `UBERON:99xxxxx` ID in any `src/templates/*.template.tsv`: + - Pick the next free ID from the sub-range, skipping anything in the + extracted set ∪ ledger. + - Record the temp→definitive mapping. +6. Pure string-substitute the mapping into every `src/templates/*.template.tsv` + (anchored on `UBERON:99` prefix; no OWL handling required). +7. Append new entries to the ledger and commit it alongside the templates. + +### 4. Ledger file + +`src/ontology/allocated-template-ids.tsv`, columns: `uberon_id`, `template`, +`label`, `pr`, `date`. Checked into the repo. Purpose: guard against +concurrent-PR race where PR A and PR B both build against a `uberon.owl` +that lacks the other's IDs; the ledger gives the allocator a second source +of "claimed" IDs to skip. + +### 5. CI trigger + +Extend [`allocate-definitive-ids.yml`](../.github/workflows/allocate-definitive-ids.yml) +`paths:` filter, OR add a parallel workflow: + +```yaml +on: + push: + branches: [ master ] + paths: + - 'src/templates/**.template.tsv' + workflow_dispatch: +``` + +Workflow steps: + +``` +- make uberon.owl # full build with temp IDs +- python src/scripts/allocate-template-ids.py # rewrites TSVs + ledger +- sh run.sh make recreate-components # rebuild components +- commit + push (existing actions-js/push pattern) +``` + +The PR-dispatch mode used by `allocate-definitive-ids` today is the +recommended manual fallback (run before merge to avoid post-merge history +noise). + +### 6. `register_templates.py` follow-on + +Already done in this branch — `register_templates.py` now adds an `import:` +line to `uberon-edit.obo` for each registered component, sorted into the +existing components block. + +## Out of scope (separate upstream issue) + +The two `robot-kgcl-plugin` bugs should still be filed upstream — they are +one-liner fixes (pass `Imports.INCLUDED` to `containsEntityInSignature`; +include manager ontologies in the renamer set) and are useful independent +of this work. But fixing them does **not** remove the need for #3 — the +TSV is the source of truth and must be rewritten directly. + +## Implementation checklist + +- [ ] Verify nothing in the existing ontology already uses + `UBERON:125xxxx`–`UBERON:129xxxx` +- [ ] Split `Automation` in `uberon-idranges.owl` (`idrange:43` + new + `idrange:44 Templates-Automation`) +- [ ] Write `src/scripts/allocate-template-ids.py` +- [ ] Add `src/ontology/allocated-template-ids.tsv` (empty header row) +- [ ] Extend `paths:` filter in `allocate-definitive-ids.yml` (or new + workflow) for `src/templates/**.template.tsv` +- [ ] Document workflow in `docs/id-management.md` — new section + "Template ID allocation" +- [ ] Test: dry-run on `hra-muscular.template.tsv` (currently full of + `UBERON:99xxxxx`); verify rewrite + ledger update + rebuilt component + +## Related + +- `docs/id-management.md` — current temp-ID docs +- `bulk_ntr_workflow/` — the workflow that surfaces this gap +- `src/ontology/uberon-idranges.owl` — range definitions +- `src/ontology/tmp/plugins/kgcl.jar` — `MintCommand`, `RandomizedIDGenerator`