NLP attribute extraction from part free-text (organism, spectra, markers, …)

Motivation
Parts carry rich attributes that are buried in free-text name/description and absent from the SBOL2 data model. Example — BBa_E0040: "green fluorescent protein derived from jellyfish Aequorea victoria … Ex 488 / Em 507". None of species, excitation, emission is a structured field today, so they can't be faceted or filtered.

This task extracts these attributes into structured indexed fields, which then (a) power faceted search and (b) improve retrieval. It generalizes and replaces the earlier organism-only feasibility stretch.

Approach — hybrid, precision-first
Closed-vocabulary / structured attributes (Ex/Em, resistance markers, known fluorophores) → regex + dictionaries. High precision, cheap, no GPU.
Open attributes (species) → NER or LLM extraction, then map to NCBI Taxonomy.
Precision over recall: a wrong attribute (wrong species, wrong Ex) is worse than a missing one — it misleads users and pollutes facets. Leave the field blank when unsure.
Run offline at index-build time (same place clusters/pagerank are computed), not per-request.

Milestones
M1 — Scope + gold sample + harness

- [ ] Hand-label a gold sample (~150–200 parts) for the initial attributes (Ex, Em, resistance marker, species).
- [ ] Build an extraction-eval harness that reports per-attribute precision & recall against the gold sample.
M2 — Build extractors + measure

 - [ ] Implement the easy ones first (Ex/Em regex, resistance-marker dictionary).
 - [ ] Implement species (NER/LLM) → NCBI Taxonomy → derive genus.
 - [ ] Report precision/recall per attribute on the gold sample.
M3 — Ship passing attributes

 - [ ] Add only the attributes that clear the gate as new indexed fields (Typesense) → facet-enabled.
 - [ ] Document: which attributes shipped, their accuracy, how to add a new one.

Gate (per attribute, not all-or-nothing)
Ship an attribute as a facet only if precision ≥ ~85% on the gold sample (precision-first, since wrong metadata misleads). Attributes below the bar are dropped and reported as honest negative findings (e.g., "species could not be reliably extracted from free text"). Recall is reported but not gating.

Deliverables
Offline extraction module (runs at index build).
Per-attribute precision/recall report (paper artifact + decides what ships).
New facet-enabled fields for the passing attributes.
The labeled gold sample (reusable).
Output fields (proposed)
excitation_nm (number), emission_nm (number), resistance_marker (string[]), fluorophore (string), species (string), genus (string), taxon_id (string) — all facet: true.

Dependencies & links
Feeds the faceted search frontend issue (these become new facets).
Needs the index schema to accept new fields (coordinate with @cl117 / Phase 1).
Supersedes the earlier "organism/strain extraction feasibility" stretch — fold that in here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NLP attribute extraction from part free-text (organism, spectra, markers, …) #148

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

NLP attribute extraction from part free-text (organism, spectra, markers, …) #148

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions