Problem
partition_metadata() in df.py recomputes min/max coordinate bounds for all partitions every time read_xarray_table() is called. For ARCO-ERA5 (732,072 partitions), this adds startup latency on every new session even though the coordinate layout of the dataset never changes.
For remote datasets (GCS/S3), each coordinate access has network latency — making this especially costly.
Proposed API
table = read_xarray_table(
ds,
chunks={'time': 1},
metadata_cache='./era5_meta.parquet'
)
# First call: computes and saves bounds to cache file
# Subsequent calls: loads bounds from cache, skipping 732,072 coordinate reads
The partition bounds are a pure function of: dataset path + chunk specification, so caching is safe as long as the dataset structure doesn't change.
Storage formats to consider
- Parquet sidecar file (efficient, columnar)
- JSON sidecar file (human-readable, debuggable)
- Zarr consolidated metadata attributes (colocated with dataset)
Parent: #126
Problem
partition_metadata()indf.pyrecomputes min/max coordinate bounds for all partitions every timeread_xarray_table()is called. For ARCO-ERA5 (732,072 partitions), this adds startup latency on every new session even though the coordinate layout of the dataset never changes.For remote datasets (GCS/S3), each coordinate access has network latency — making this especially costly.
Proposed API
The partition bounds are a pure function of: dataset path + chunk specification, so caching is safe as long as the dataset structure doesn't change.
Storage formats to consider
Parent: #126