Skip to content

Add the Multi-Tenant Catalogs Endpoint Extension, for nested catalog support#366

Open
jonhealy1 wants to merge 68 commits into
stac-utils:mainfrom
jonhealy1:stac-fastapi-catalogs-extension
Open

Add the Multi-Tenant Catalogs Endpoint Extension, for nested catalog support#366
jonhealy1 wants to merge 68 commits into
stac-utils:mainfrom
jonhealy1:stac-fastapi-catalogs-extension

Conversation

@jonhealy1
Copy link
Copy Markdown
Collaborator

@jonhealy1 jonhealy1 commented Mar 25, 2026

Related Issue(s):

Description:

Extension spec: https://github.com/StacLabs/multi-tenant-catalogs
STAC-FastAPI catalogs extension: https://github.com/StacLabs/stac-fastapi-catalogs-extension

PR Checklist:

  • pre-commit hooks pass locally
  • Tests pass (run make test)
  • Documentation has been updated to reflect changes, if applicable, and docs build successfully (run make docs)
  • Changes are added to the CHANGELOG.

@jonhealy1
Copy link
Copy Markdown
Collaborator Author

It's going to be best to fix the extension so it supports python 3.11

@jonhealy1 jonhealy1 changed the title route extension, create, get catalogs Add the Multi-Tenant Virtual Catalogs Extension, for nested catalog support Mar 25, 2026
@jonhealy1 jonhealy1 changed the title Add the Multi-Tenant Virtual Catalogs Extension, for nested catalog support Add the Multi-Tenant Catalogs Endpoint Extension, for nested catalog support Mar 25, 2026
Comment thread stac_fastapi/pgstac/extensions/catalogs/catalogs_database_logic.py Outdated
@jonhealy1
Copy link
Copy Markdown
Collaborator Author

jonhealy1 commented Apr 21, 2026

This is really close to being reviewable, just need some time to do some qa, documentation.

@jonhealy1
Copy link
Copy Markdown
Collaborator Author

If we PUT/POST an item or upsert in a pypgstac loader context, the spec expectation is that the entire item is replaced.

I agree with this. As soon as you're modifying things at a DB level, you're on your own 😉 The APIs can only do so much and we have PATCH for that.

Another potential issue: I am getting a 500 response validation error on the base /collections endpoint when a catalog exists. To replicate just create a catalog and then call /collections.

{
code: "ResponseValidationError",
description: "9 validation errors:
  {'type': 'literal_error', 'loc': ('response', 'collections', 1, 'type'), 'msg': "Input should be 'Collection'", 'input': 'Catalog', 'ctx': {'expected': "'Collection'"}}
  ...

@bkanuka Nice catch - I think this should be fixed now.

@jonhealy1 jonhealy1 requested review from bkanuka and hrodmn May 15, 2026 06:31
@hrodmn
Copy link
Copy Markdown
Collaborator

hrodmn commented May 15, 2026

I don't think we should make any changes to upsert behavior on the pypgstac side since the behavior is simple to explain and lines up with standard Postgres operations. The trick is just that most of our STAC metadata is squeezed into a single column which limits the reach of the upsert behavior! I think for now the best path on the pgstac side is to make sure users understand that upserting collections and items will not perform property-wise updates (except for the top-level pgstac.collections fields like bbox, datetime (from extent). This is not a new thing, just something that becomes more important as users rely on the poly-hierarchy concept.

We should make it clear to users of the Catalogs Extension that the catalog hierarchical relationships are inherently fragile because they are not managed by pgstac at all - parent/child relationships are not stored in proper columns or tracked in a junction table or anything like that. The transactions routes in the extension provide a way to manage these relationships more safely but it would still be easy for a user to POST a collection

There might be a way to do this more safely using pgstac's private column. I am not exactly sure how we would want to wire it up but I think it was designed exactly for fields like parent_ids that are never going to be returned via STAC API requests but are useful for server-side data. The catch is that none of the built-in pgstac functions (e.g. collection_search) will use it so we would need to do some more manual SQL in this repo to use it.

I had an LLM spec out what the changes would look like and it is not a simple swap. We could probably do it without pgstac changes but there are some small pgstac-side tweaks that could make it easier. Expand below for that response:

parent_ids -> private spec

What changes immediately if parent_ids moves to pgstac.collections.private

  • Good news: normal collection PUT/PATCH gets safer automatically.
    • pgstac.update_collection(data) only does SET content = data
    • it does not touch private
    • so out-of-band collection updates (e.g. via pypgstac load) would not wipe out parent_ids unless it is explicitly changed

Why it is not trivial

  • All current catalogs code assumes parent_ids is in the returned collection JSON.
  • Child lookups depend on pgstac collection_search(...) with CQL like:
    • a_contains(parent_ids, catalog_id)
  • That works only because parent_ids is currently in content.

What a stac-fastapi-pgstac-only prototype would look like

1. Read parent_ids from private

  • Change internal reads like:
    • stac_fastapi/pgstac/extensions/catalogs/catalogs_database_logic.py
  • Instead of only selecting content, select merged internal JSON, e.g.:
SELECT
  content || jsonb_build_object(
    'parent_ids',
    coalesce(private->'parent_ids', '[]'::jsonb)
  )
FROM collections
WHERE id = :id;

2. Write parent_ids into private

  • For create/link/unlink flows, stop mutating content["parent_ids"]
  • Use direct SQL like:
INSERT INTO collections (content, private)
VALUES (
  :content::jsonb,
  jsonb_build_object('parent_ids', :parent_ids::jsonb)
);

and for updates to hierarchy metadata:

UPDATE collections
SET private = jsonb_set(
  coalesce(private, '{}'::jsonb),
  '{parent_ids}',
  :parent_ids::jsonb,
  true
)
WHERE id = :id;

3. Replace collection_search-based hierarchy queries

These methods currently break if parent_ids leaves content:

  • get_catalog_children
  • get_catalog_collections
  • get_sub_catalogs

You would replace the CQL search with direct SQL on private, e.g.:

SELECT content
FROM collections
WHERE coalesce(private->'parent_ids', '[]'::jsonb) ? :catalog_id

with extra filters for content->>'type' = 'Catalog' or 'Collection'.

4. Keep public responses unchanged

  • parent_ids should still be stripped before API responses.
  • Link generation can still use the internally merged parent_ids.

The main tradeoff

  • This makes hierarchy state much safer.
  • But it also means catalogs logic stops using pgstac’s nice collection_search path for these hierarchy queries and starts carrying custom SQL.

Cleaner long-term version

The cleaner design is probably a pgstac change, not just an app-layer one:

  • either expose private.parent_ids to collection search
  • or add collection helpers that return content + private for internal use

Comment thread stac_fastapi/pgstac/core.py
Comment thread stac_fastapi/pgstac/extensions/catalogs/catalogs_database_logic.py Outdated
Comment thread stac_fastapi/pgstac/extensions/catalogs/catalogs_database_logic.py
Comment thread stac_fastapi/pgstac/extensions/catalogs/catalogs_database_logic.py Outdated
Comment thread stac_fastapi/pgstac/extensions/catalogs/catalogs_database_logic.py
@jonhealy1
Copy link
Copy Markdown
Collaborator Author

I don't think we should make any changes to upsert behavior on the pypgstac side since the behavior is simple to explain and lines up with standard Postgres operations. The trick is just that most of our STAC metadata is squeezed into a single column which limits the reach of the upsert behavior! I think for now the best path on the pgstac side is to make sure users understand that upserting collections and items will not perform property-wise updates (except for the top-level pgstac.collections fields like bbox, datetime (from extent). This is not a new thing, just something that becomes more important as users rely on the poly-hierarchy concept.

We should make it clear to users of the Catalogs Extension that the catalog hierarchical relationships are inherently fragile because they are not managed by pgstac at all - parent/child relationships are not stored in proper columns or tracked in a junction table or anything like that. The transactions routes in the extension provide a way to manage these relationships more safely but it would still be easy for a user to POST a collection

There might be a way to do this more safely using pgstac's private column. I am not exactly sure how we would want to wire it up but I think it was designed exactly for fields like parent_ids that are never going to be returned via STAC API requests but are useful for server-side data. The catch is that none of the built-in pgstac functions (e.g. collection_search) will use it so we would need to do some more manual SQL in this repo to use it.

I had an LLM spec out what the changes would look like and it is not a simple swap. We could probably do it without pgstac changes but there are some small pgstac-side tweaks that could make it easier. Expand below for that response:

parent_ids -> private spec

What changes immediately if parent_ids moves to pgstac.collections.private

  • Good news: normal collection PUT/PATCH gets safer automatically.

    • pgstac.update_collection(data) only does SET content = data
    • it does not touch private
    • so out-of-band collection updates (e.g. via pypgstac load) would not wipe out parent_ids unless it is explicitly changed

Why it is not trivial

  • All current catalogs code assumes parent_ids is in the returned collection JSON.

  • Child lookups depend on pgstac collection_search(...) with CQL like:

    • a_contains(parent_ids, catalog_id)
  • That works only because parent_ids is currently in content.

What a stac-fastapi-pgstac-only prototype would look like

1. Read parent_ids from private

  • Change internal reads like:

    • stac_fastapi/pgstac/extensions/catalogs/catalogs_database_logic.py
  • Instead of only selecting content, select merged internal JSON, e.g.:

SELECT
  content || jsonb_build_object(
    'parent_ids',
    coalesce(private->'parent_ids', '[]'::jsonb)
  )
FROM collections
WHERE id = :id;

2. Write parent_ids into private

  • For create/link/unlink flows, stop mutating content["parent_ids"]
  • Use direct SQL like:
INSERT INTO collections (content, private)
VALUES (
  :content::jsonb,
  jsonb_build_object('parent_ids', :parent_ids::jsonb)
);

and for updates to hierarchy metadata:

UPDATE collections
SET private = jsonb_set(
  coalesce(private, '{}'::jsonb),
  '{parent_ids}',
  :parent_ids::jsonb,
  true
)
WHERE id = :id;

3. Replace collection_search-based hierarchy queries

These methods currently break if parent_ids leaves content:

  • get_catalog_children
  • get_catalog_collections
  • get_sub_catalogs

You would replace the CQL search with direct SQL on private, e.g.:

SELECT content
FROM collections
WHERE coalesce(private->'parent_ids', '[]'::jsonb) ? :catalog_id

with extra filters for content->>'type' = 'Catalog' or 'Collection'.

4. Keep public responses unchanged

  • parent_ids should still be stripped before API responses.
  • Link generation can still use the internally merged parent_ids.

The main tradeoff

  • This makes hierarchy state much safer.
  • But it also means catalogs logic stops using pgstac’s nice collection_search path for these hierarchy queries and starts carrying custom SQL.

Cleaner long-term version

The cleaner design is probably a pgstac change, not just an app-layer one:

  • either expose private.parent_ids to collection search
  • or add collection helpers that return content + private for internal use

This is interesting, but sounds like something we would want to open an issue for and discuss after this first version is merged. Having a private, protected column for parent_ids is a very good idea potentially. It would take some coordination with the pgstac devs for one thing.

Comment thread stac_fastapi/pgstac/extensions/catalogs/catalogs_client.py
@hrodmn
Copy link
Copy Markdown
Collaborator

hrodmn commented May 15, 2026

This is interesting, but sounds like something we would want to open an issue for and discuss after this first version is merged. Having a private, protected column for parent_ids is a very good idea potentially. It would take some coordination with the pgstac devs for one thing.

FWIW I think the private column was designed for things exactly like this. We could probably merge this PR without addressing making a choice but since it is a decision that affects the data storage schema in postgres it would be nice to pick a direction before we release a new version of stac-fastapi-pgstac with the extension. If someone starts using it with parent_ids tucked into the content column then we change it to live in private it could be a challenge to migrate.

@jonhealy1
Copy link
Copy Markdown
Collaborator Author

This is interesting, but sounds like something we would want to open an issue for and discuss after this first version is merged. Having a private, protected column for parent_ids is a very good idea potentially. It would take some coordination with the pgstac devs for one thing.

FWIW I think the private column was designed for things exactly like this. We could probably merge this PR without addressing making a choice but since it is a decision that affects the data storage schema in postgres it would be nice to pick a direction before we release a new version of stac-fastapi-pgstac with the extension. If someone starts using it with parent_ids tucked into the content column then we change it to live in private it could be a challenge to migrate.

I think that the change would be backwards compatible from a User's perspective; you can still post a collection with the parent_ids list, but instead of the list being stored in jsonb, it would be stored in our new private database column. Someone using stac-fastapi-pgstac wouldn't know anything has changed internally.

@bitner
Copy link
Copy Markdown
Collaborator

bitner commented May 19, 2026

The idea for the private column in collections and items (which I don't think anyone has ever actually done anything with) was to be a place that a user could store things that were never to be directly user accessible. A couple use cases behind the thought were 1) to be able to store metadata that could get used by Row Based Authorization tooling within Postgres 2) to be able to store metadata that could be used for things like data synchronization like a timestamp of when the data was last updated in that instance of pgstac (which is different than when the row was actually updated) or similar things.

At least in my head, if you can search off something, it is inherently "leaky" and not private anymore to a user. If you want to search by parent_ids, but you don't want them to show up in the results, the "staccy" way to do that would be to use the fields extension excluding that column.

I still lean towards the stance that this should be a PATCH operation and not a PUT/POST - at least for now. If/when this extension gets marked as stable in the stac api spec, I do think that we could/should do some things in pgstac to actually instantiate this as a real column and really optimize it, but I am leery to add churn to the pgstac schema right now (which changing the schema - less so for collections than items, but still... can be a big deal as it often requires an entire table rewrite).

I still think that if parent_ids is important to the collection (whether in the content json, the private field, or instantiated as an actual column), that that should be something that is managed wherever you are defining your collection and that it should not be the storage engine's responsibility to selectively merge things while ignoring that field. If you have out-of-band updates to your collection content, that seems like you have an open door for plenty of other issues.

@jonhealy1
Copy link
Copy Markdown
Collaborator Author

@bitner I completely agree. The app layer should carry the responsibility of maintaining that state rather than adding complexity or schema churn to pgstac.

To make this safe for users, the right path forward is ensuring they update via scoped transaction routes - such as PUT /catalogs/{catalog_id}/collections/{collection_id} or PUT /catalogs/{catalog_id} - where the extension can safely preserve the DAG hierarchy behind the scenes.

The scoped collection update route isn't currently in the extension spec, so I'll add that endpoint to the extension to close this loophole before we wrap this up.

@jonhealy1 jonhealy1 requested a review from bkanuka May 21, 2026 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants