Skip to content

Generate a project taxa list from a region, and track which species the models can predict#1367

Draft
mihow wants to merge 10 commits into
mainfrom
feat/regional-taxa-lists
Draft

Generate a project taxa list from a region, and track which species the models can predict#1367
mihow wants to merge 10 commits into
mainfrom
feat/regional-taxa-lists

Conversation

@mihow

@mihow mihow commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Class masking (#999) can cut a global classifier down to the species that actually occur where a project monitors, but only if someone first curates a taxa list — and doing that by hand, one taxon at a time, is too tedious to expect of project owners. This PR lets a project build that list automatically from a geographic region: give a region (or let it be derived from the project's deployments) and it pulls the species recorded there from an external biodiversity database, keeps the ones a classifier can actually predict, and saves them as the project's taxa list. Class masking can then resolve that list on its own.

The same core capability is reachable five ways — a management command (including an --all-projects backfill), Django admin actions, a REST endpoint, and unit tests all call one service — so we can also generate regional lists for every existing project in one pass. This is the backend and API; the in-app UI button is the remaining piece (see the checklist below).

Two design decisions are worth a reviewer's attention:

  1. Multiple sources union, they never intersect. A species present in any source (GBIF today, iNaturalist next behind the same protocol) is a candidate; the merge is a wide union that preserves per-source provenance. Sources are never intersected against each other.
  2. The list is honest about what the models can predict. By default it is restricted to species some classifier can output — tracked by a new, persisted Taxon ↔ Algorithm relationship. An opt-in mode also keeps regional species no model can predict, flagging each so the UI can say so.

Phase 0 measured this against the real 2497-label Quebec & Vermont classifier: a Vermont region list covers 70% of its labels, so the default masking list keeps ~1749 classes and drops ~748 that neither GBIF nor iNaturalist records in Vermont (full findings in docs/claude/analysis/, verdict: proceed).

Verification: the feature ships with its own tests (44 in tests_regional_taxa.py plus 12 for masking auto-mode), and the existing taxonomy/taxa-list/class-masking suites still pass. makemigrations --check is clean; black/isort/flake8 pass. No existing behavior changes until a project configures a region.

Planning: #1364. Plan/design PR: #1366.

List of Changes

# Change (effect) How (implementation)
1 A project can build a taxa list from a region Core service ami/main/services/regional_taxa.py: generate_regional_taxa_list(), the wide-union merge_source_species(), map_to_taxa(), apply_model_coverage(), Result with per-bucket counts
2 Regional species come from GBIF (primary source, verified in Phase 0) services/gbif.py — occurrence facet by GADM region + species-name resolution + reverse-geocode; iNaturalist later behind the same protocol
3 The list knows which species the models can predict Persisted Taxon.covered_by_algorithms (M2M → Algorithm) + has_model_coverage flag; services/taxon_coverage.py refreshes it from category-map labels, hook keyed on labels_hash, targeted refresh for newly created taxa
4 Operators can generate lists and backfill every project generate_regional_taxa_list management command with --all-projects (derives each region from deployments), --dry-run, --include-uncovered; refresh_taxon_model_coverage command
5 Admins can generate a list from the changelist Project & Site admin actions that enqueue a background task (the external fetch is slow); region fields exposed in both admins
6 Class masking can pick the list automatically ClassMaskingConfig.taxa_list_mode="auto" resolves the list from the occurrence's site, then the project's default; a no-op when nothing is configured, so masking is safe to enable by default
7 The app/API can trigger generation POST /projects/{id}/generate-regional-taxa-list/ enqueues the task and returns 202; requires project update permission; region derivable from deployments
8 A site or project records its region region_source / region_code fields on Site and Project, plus taxa_list (Site) / default_taxa_list (Project) FKs; migration 0095
9 Shared taxonomy building Extracted the create_taxon hierarchy builder from import_taxa into services/taxonomy.py (no behavior change)

Still to do

  • In-app UI button ("Create taxa list for region") in project settings, with screenshots — the remaining phase.
  • iNaturalist source (second source behind the existing protocol).
  • Name-mismatch audit of the ~30% of classifier labels not attested in-region (the top risk flagged in Phase 0), to decide how much fuzzy matching to add.

mihow and others added 5 commits July 2, 2026 15:19
…service

Move Command.create_taxon() and get_or_create_root_taxon() out of the
import_taxa management command and into ami.main.services.taxonomy, so
the regional taxa-list service can reuse the same rank-hierarchy builder
instead of re-deriving it. Behaviour is unchanged; import_taxa now calls
the extracted functions instead of defining them locally.

Co-Authored-By: Claude <noreply@anthropic.com>
…el-coverage relationship

Part of #1364 (regional taxa lists for class masking), Phase 1.

Adds the data-model plumbing for region-derived taxa lists:
- Site/Project gain region_source, region_code, and a taxa_list /
  default_taxa_list FK, so a project or one of its research sites can be
  tied to a geographic region and a designated TaxaList.
- Taxon gains covered_by_algorithms (M2M to ml.Algorithm) and the
  denormalized has_model_coverage boolean, answering "which classifier(s),
  if any, can predict this taxon" without a live label-set join at read
  time.

Coverage is derived data, computed by ami.main.services.taxon_coverage
from each algorithm's category map labels (the same Taxon.name == label
join AlgorithmCategoryMap.with_taxa() uses for masking). Algorithm.save()
refreshes coverage automatically whenever its category_map link changes;
the refresh_taxon_model_coverage management command does a full rebuild
for the initial backfill or to repair drift from a write path that
bypasses the hook (e.g. a bulk_update).

Co-Authored-By: Claude <noreply@anthropic.com>
Adds generate_regional_taxa_list(), the core service that turns a
geographic region into a project-scoped TaxaList: fetch species recorded
in the region from GBIF, merge multiple sources with a wide union (never
an intersection - a species in any source is a candidate), map merged
species onto Taxon rows (matching by GBIF/iNat key or name, creating
missing ones via the shared taxonomy hierarchy builder), then restrict
to species some classifier can actually predict using the persisted
model-coverage relationship.

By default the saved list keeps only model-covered species, since class
masking can't do anything with a species no classifier knows.
include_uncovered=True opts into keeping the rest too, honestly flagged
has_model_coverage=False so the UI/reporting can distinguish "in the
region" from "a model can predict it." A single classifier can also be
passed for a report-only coverage count that never changes what's saved.

Idempotent: re-running for the same (name, project) updates the existing
list rather than creating a duplicate.

GBIFRegionalSource is the first concrete source (species search faceted
by GBIF's speciesKey, endpoints exercised in the #1364 Phase 0 spike);
iNaturalist can be added later behind the same RegionalSpeciesSource
protocol without changing the merge or mapping logic.

Every test uses a stubbed source or a monkeypatched HTTP session - no
network calls in the suite.

Co-Authored-By: Claude <noreply@anthropic.com>
logger.warn has been deprecated since Python 3.3. These two calls were moved
verbatim from import_taxa.py into the extracted taxonomy service; switch them
to logger.warning while they are being relocated.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow added enhancement New feature or request backend ml related to machine learning models or pipeline services labels Jul 3, 2026
@netlify

netlify Bot commented Jul 3, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 03f92b6
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/6a47231cfc1bdd0008eba053

@netlify

netlify Bot commented Jul 3, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 03f92b6
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a47231c0bc21c000a8a23df

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c62d3d33-e466-4a36-853c-6db323e7c39a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/regional-taxa-lists

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

mihow and others added 5 commits July 2, 2026 19:24
… backfill

Operator/backfill entry point over the regional-taxa service (#1364, Phase 2).
Runs for one project with an explicit GADM region, or --all-projects derives each
project's region from a representative deployment's coordinates via GBIF
reverse-geocode (path A3). Adds reverse_geocode_gadm() to the GBIF client and
derive_region_for_project() to the service, both with a test seam so nothing hits
the network in CI. 11 new tests (reverse-geocode level selection, region
derivation, command arg wiring + the two guards).

Co-Authored-By: Claude <noreply@anthropic.com>
…axa list

Adds a background task (generate_regional_taxa_list_task) and admin actions on the
Project and Site changelists that enqueue it for rows with a region configured. The
generated list is linked to project.default_taxa_list or site.taxa_list, which the
masking auto-resolution reads. Runs off the request path because the external fetch
is slow. Exposes the region_source/region_code/taxa-list fields in both admins.
4 new tests (task links list to project vs. site; actions enqueue only configured
rows with the right scope). Part of #1364, Phase 2.

Co-Authored-By: Claude <noreply@anthropic.com>
… the region

Adds a taxa_list_mode to class masking. In 'auto' mode the taxa list is resolved
from the scope's configured region instead of an operator picking one each run: an
occurrence prefers its site's list, then its project's default; a collection resolves
at the project level. When nothing is configured the run is a safe no-op, so a
pipeline can enable masking before a project has set up a region. The explicit path
(taxa_list_id) is unchanged and still the default. The admin form gains a source
toggle. 12 new tests (config validation, the resolution ladder, the no-op path).
Part of #1364, Phase 3.

Co-Authored-By: Claude <noreply@anthropic.com>
POST /projects/{id}/generate-regional-taxa-list/ enqueues the background
generation task and returns 202; the generated list becomes the project's
default_taxa_list. region_code may be omitted to derive it from the project's
deployments. Requires update permission on the project. 6 tests cover the
permission matrix (editor 202, non-editor and anonymous 403), body validation
(invalid source / underivable region -> 400), and region derivation. Part of
#1364, Phase 4.

Co-Authored-By: Claude <noreply@anthropic.com>
…orithms

apply_model_coverage previously called refresh_all_algorithm_coverage() whenever a
run created any taxon, rewriting the covered_taxa relation for every algorithm. In
the --all-projects backfill that is O(projects x algorithms). Replace it with a
targeted refresh_coverage_for_taxa() that links only the just-created taxa to the
category maps whose labels overlap their names (one overlap query, then adds), so
per-run cost scales with new taxa, not the total algorithm/label count. Adds a test
pinning that the targeted refresh covers only the named taxa.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend enhancement New feature or request ml related to machine learning models or pipeline services

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant