Skip to content

Add the implementation plan for building project taxa lists from a region#1366

Draft
mihow wants to merge 4 commits into
mainfrom
plan/regional-taxa-lists-1364
Draft

Add the implementation plan for building project taxa lists from a region#1366
mihow wants to merge 4 commits into
mainfrom
plan/regional-taxa-lists-1364

Conversation

@mihow

@mihow mihow commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Class masking (#999) can cut a global classifier down to the species that actually occur at a site, but only if someone first curates a taxa list for the project — and today that means adding taxa one at a time, which is too tedious to expect of project owners. This PR does not add any code; it adds the implementation plan for the approach proposed in #1364 ("Proposal A"): let a user generate a regional species list automatically from an external biodiversity database (GBIF and/or iNaturalist) by giving a region code, so building a masking list becomes a single action.

The point of opening this as a draft is to get agreement on the design before implementation starts — in particular the two decisions that shape everything downstream: (1) when multiple sources are used they are combined as a wide union with per-source provenance, never intersected against each other; and (2) the list is, by default, restricted to species the classifiers can actually predict, with a stored "model coverage" relationship so the UI can be honest about regional species that no model will ever predict (many valid species lack training data). The same core service is designed to be reused by a management command, a Django admin action, an API endpoint, the main UI, and unit tests — which also lets us backfill regional lists for every existing project.

No migrations, models, or endpoints are included yet. This is the plan; the code lands in the phased slices it describes, starting with a Phase 0 spike to verify the external APIs (none of the GBIF/iNaturalist endpoints have been exercised against the live services — they are flagged CANDIDATE/UNVERIFIED throughout).

Planning for #1364. Design writeup: docs/claude/planning/2026-07-02-regional-taxa-lists-class-masking.md.

List of Changes

# Change (what it adds) Detail
1 Adds the Proposal A implementation plan document docs/claude/planning/2026-07-02-proposal-a-regional-taxa-lists-impl-plan.md — 14 sections, docs-only, no executable change
2 Specifies the reusable core service generate_regional_taxa_list(...) with a Result breakdown, surfaced through five thin wrappers (command, admin action, API, UI, tests)
3 Specifies the multi-source design A RegionalSpeciesSource protocol + a wide union merge that keeps per-source provenance; sources are never intersected with each other
4 Specifies model & DB awareness Default list = regional species ∩ model-covered taxa; a stored Taxon.covered_by_algorithms relationship (+ a denormalized has_model_coverage flag) plus a refresh path, so the UI can flag regional species no model can predict
5 Specifies the data-model changes New region_source / region_code / taxa-list fields on Site and Project, plus the coverage relationship — each with its migration, called out but not implemented here
6 Specifies auto-apply masking A taxa_list_mode="auto" resolution ladder (occurrence → site → project) that no-ops until a region/list is configured; lands on the #999 branch since class_masking.py is not on main yet
7 Lays out a TDD test plan and phased rollout Ten tests written first; internal-first phases (spike → core service → command/admin → masking wiring → API/UI), with an explicit "what to verify before building" list

The branch is docs-only. CI is skipped on these commits ([skip ci]). Nothing outside docs/claude/planning/ is touched.

mihow and others added 3 commits July 2, 2026 13:58
…) [skip ci]

Staged, TDD-oriented implementation plan for building a project taxa list
from a geographic region so class masking works out of the box. Covers the
reusable core service, the union-with-provenance source merge (sources union,
never intersect), Site/Project data-model fields, region derivation for
backfill, the class-masking auto-resolution order, the five surfaces, a
test plan, and a phased rollout.

Refs #1364, #999, #1289

Co-Authored-By: Claude <noreply@anthropic.com>
…ip ci]

Fold the model/DB-awareness requirement into the Proposal A plan as a
first-class section. The regional-list generator now, by default, subsets the
union of source species to those a classifier can actually predict (name in
some AlgorithmCategoryMap label set), with an opt-in flag to also create
uncovered regional species flagged as not classifiable.

Confirmed by code reading that no persisted Taxon-to-Algorithm/CategoryMap link
exists today (with_taxa() resolves names live, unpersisted), so the plan adds a
persisted relationship (category-map-anchored M2M plus a denormalized
Taxon.is_classifiable boolean) and a refresh path keyed on labels_hash. Updates
the Result dataclass with explicit buckets, the data-model and test-plan
sections, and the open-questions list.

Refs #1364, #999, #1289

Co-Authored-By: Claude <noreply@anthropic.com>
#1364) [skip ci]

Rename the persisted model-coverage relationship to the through-model the
requester asked for: Taxon.covered_by_algorithms (M2M to Algorithm) so the
list and UI can show which model is aware of a taxon, with Taxon.has_model_coverage
as the denormalized boolean MVP. The category-map-anchored variant is retained
as a noted deduplication alternative and open question, since many algorithms
share one category map. Updates the data-model section, the options table and
recommendation, the refresh helpers, the Result-consuming step, the test plan,
and the open questions accordingly.

Refs #1364, #999, #1289

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow added enhancement New feature or request needs design backend ml related to machine learning models or pipeline services labels Jul 2, 2026
@netlify

netlify Bot commented Jul 2, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 574f9e3
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/6a46dab3585d840008f942bd

@netlify

netlify Bot commented Jul 2, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 574f9e3
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a46dab35f52e3000885b93e

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8cc30a24-4e00-48a5-b0dd-11ad6b3f7b12

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch plan/regional-taxa-lists-1364

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Phase 0 spike verified GBIF/iNat regional endpoints and the A3 reverse-geocode
path against a live run, and measured 70% coverage of the 2497-label Quebec &
Vermont classifier from a Vermont region list. Verdict: GO for Proposal A.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow

mihow commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Phase 0 (de-risk spike) ran — verdict: GO.

Measured against the real 2497-label Quebec & Vermont classifier, a Vermont region list (GBIF ∪ iNat) covers 70.0% (1749/2497) of the classifier's labels, so the default masking list would keep 1749 classes and mask ~748 (30%) that neither source records in Vermont. GBIF alone reaches 69.8%; iNat adds little to the intersection but feeds the 550-species include_uncovered bucket — so Phase 1 starts GBIF-first. The A3 reverse-geocode path resolved (44.26,-72.58) → GADM1 USA.46_1 cleanly, so the --all-projects backfill is viable.

Top risk to carry into Phase 1: name-join fragility — 30% of labels are absent from the region union, a mix of true regional absences and likely name/synonym mismatches that needs a sample audit. Full numbers, caveats, and the reproducible script are in docs/claude/analysis/ on this branch (commit 574f9e3).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend enhancement New feature or request ml related to machine learning models or pipeline services needs design

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant