Skip to content

Vector data support: GeoParquet output, named vector collections, OGC API Features #153

@turban

Description

@turban

Background

The current design treats vector data as ephemeral input — a GeoJSON FeatureCollection passed as a parameter to aggregate_spatial. This works for the zonal statistics use case but misses broader opportunities: storing and serving standard boundary datasets, producing rich mappable output, and enabling vector-to-vector operations.

openEO's vector datacube model offers a clean pattern to follow.

The vector datacube model

openEO treats vector as a first-class datacube type alongside raster. The output of aggregate_spatial is a vector datacube — an xr.DataArray with a geometry dimension backed by shapely geometries via the xvec library:

Raster datacube  →  aggregate_spatial(geometries)  →  Vector datacube
(time, y, x)                                          (time, geometry)
                                                       geometry coords = shapely polygons

The geometry (polygon shapes, feature properties) travels through the pipeline and ends up in the output — the result is not a dead-end table but a datacube that can be further processed, filtered, and saved in standard GIS formats.

Proposed changes

1. GeoParquet output instead of JSON

save_result(format="GeoParquet") is the openEO standard for vector output and is strictly better than plain JSON for zonal statistics results:

  • Columnar storage — efficient for large zone × timestep results
  • Preserves geometry — output is directly mappable in QGIS, Python, R without joining to a separate boundaries file
  • Natively supported by GDAL, geopandas, DuckDB, QGIS, ArcGIS

2. Vector datasets as named collections

Standard boundary datasets (administrative regions, watersheds, river basins) stored as GeoParquet files and registered as STAC collections — loadable via a load_vector_cube process:

# Client supplies geometry
aggregate_spatial(load_collection("era5land_temp"), geometries=geojson_fc)

# Client references a named collection
aggregate_spatial(load_collection("era5land_temp"), geometries=load_vector_cube("admin_boundaries_lk"))

This removes the need for clients to supply boundaries on every request and allows service deployers to ship standard boundaries alongside dataset plugins.

3. Serving vector data — GeoParquet for analysis, PMTiles for the browser

These are two different access patterns requiring different formats:

GeoParquet over HTTP — for analysis clients (DuckDB, PyArrow, geopandas, QGIS). Parquet stores bounding box statistics per row group in the file footer. A client reads the footer first (~8 KB), identifies which row groups overlap the area of interest, then fetches only those via HTTP range requests — the same mechanism the service already uses for Zarr. The file is served statically with no server-side processing.

Client reads footer → checks row group bboxes → fetches only matching chunks
                                                  (HTTP range requests)

PMTiles — for browser map rendering. Browsers have no native Parquet support, and WebAssembly Parquet readers are too coarse for viewport-based rendering (row groups are 100K–1M features, with no zoom awareness). PMTiles organises data by tile (zoom + x + y) so the browser fetches only tiles for the current viewport at the current zoom level. MapLibre GL JS supports PMTiles natively.

PMTiles for a vector collection is generated from GeoParquet via tippecanoe and served as a static file alongside it — both registered as assets in the STAC collection.

OGC API - Features — pygeoapi (already in the project) serves vector collections for GIS client access (QGIS, ArcGIS, Leaflet) independently of the openEO jobs flow.

Full data flow

Vector input:   GeoJSON FeatureCollection (client-supplied)
             OR load_vector_cube("named_collection")  (service-stored GeoParquet)
                          ↓
             aggregate_spatial  (openeo-processes-dask, xvec internally)
                          ↓
Vector output:  save_result(format="GeoParquet")
                          ↓
             GET /jobs/{id}/results        →  GeoParquet asset  (analysis: DuckDB, Python, QGIS)
                                          →  PMTiles asset      (browser: MapLibre GL JS)
             GET /collections/{id}/items   →  OGC API Features  (pygeoapi)

What needs to be built

Component Notes
GeoParquet save_result Detect vector datacube output, write GeoParquet via geopandas
PMTiles generation Run tippecanoe on GeoParquet output; register as assets.pmtiles in STAC
Vector collection storage GeoParquet + PMTiles in vector/ directory, registered as STAC collections
load_vector_cube process Processing plugin — loads a named vector collection by ID
OGC API Features serving Wire pygeoapi to serve vector STAC collections

Relevant openEO processes

The following processes from the openEO spec apply to vector data — most are implemented in openeo-processes-dask:

Process Description Implemented
load_geojson GeoJSON → vector datacube
aggregate_spatial Raster + geometries → vector datacube
vector_buffer Buffer geometries by distance
vector_reproject Reproject geometry dimension
filter_vector Filter vector datacube by properties
load_vector_cube Load stored vector dataset ❌ (backend-provided)
vector_to_random_points Sample random points from polygons
vector_to_regular_points Sample regular point grid from polygons

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions