Skip to content
Merged
15 changes: 15 additions & 0 deletions examples/example_metadata-geoarrow.json
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe _geoarrow to not mix _ and -?

Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"geo": {
"columns": {
"geometry": {
"encoding": "geoarrow",
"extension_name": "geoarrow.point",
Comment thread
paleolimbot marked this conversation as resolved.
Outdated
"geometry_types": [
"Point"
]
}
},
"primary_column": "geometry",
"version": "1.1.0-dev"
}
}
2 changes: 1 addition & 1 deletion format-specs/compatible-parquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The core idea of the compatibility guidelines is to have the output match the de

* The geometry column should be named either `"geometry"` or `"geography"`.

* The geometry column should be a `BYTE_ARRAY` with Well Known Binary (WKB) used to define the geometries, as defined in the [encoding](./geoparquet.md#encoding) section of the GeoParquet spec.
* The geometry column should be a `BYTE_ARRAY` with Well Known Binary (WKB) used to define the geometries, as defined in the [encoding](./geoparquet.md#encoding) section of the GeoParquet spec. Alternatively, the geometry column can be stored according to the Point, MultiPoint, MultiLinestring, or MultiPolygon memory layouts with separated (struct) coordinates as specified in the [GeoArrow format](https://geoarrow.org/format).
Comment thread
paleolimbot marked this conversation as resolved.
Outdated

* All data is stored in longitude, latitude based on the WGS84 datum, as defined as the default in the [crs](./geoparquet.md#crs) section of the GeoParquet spec.

Expand Down
19 changes: 12 additions & 7 deletions format-specs/geoparquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,7 @@ This is version 1.1.0-dev of the GeoParquet specification. See the [JSON Schema

## Geometry columns

Geometry columns MUST be stored using the `BYTE_ARRAY` parquet type. They MUST be encoded as [WKB](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the information of the first sentence somewhere? (i.e. that for a WKB encoding, the geometry column MUST be stores using the BYTE_ARRAY parquet type)

(you kept the "Implementation note" just below that also mentions BYTE_ARRAY, but that is not so specific as the above)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! We could also add some more in here about the Parquet physical description of how nesting works (but maybe in a future PR?)


Implementation note: when using the ecosystem of Arrow libraries, Parquet types such as `BYTE_ARRAY` might not be directly accessible. Instead, the corresponding Arrow data type can be `Arrow::Type::BINARY` (for arrays that whose elements can be indexed through a 32-bit index) or `Arrow::Type::LARGE_BINARY` (64-bit index). It is recommended that GeoParquet readers are compatible with both data types, and writers preferably use `Arrow::Type::BINARY` (thus limiting to row groups with content smaller than 2 GB) for larger compatibility.

See the [encoding](#encoding) section below for more details.
Geometry columns MUST be encoded as [WKB](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary) or [GeoArrow](https://geoarrow.org/). See the [encoding](#encoding) section below for more details.

### Nesting

Expand Down Expand Up @@ -51,13 +47,14 @@ Each geometry column in the dataset MUST be included in the `columns` field abov

| Field Name | Type | Description |
| -------------- | ------------ | ----------- |
| encoding | string | **REQUIRED.** Name of the geometry encoding format. Currently only `"WKB"` is supported. |
| encoding | string | **REQUIRED.** Name of the geometry encoding format. Currently `"WKB"` and `"geoarrow"` are supported. |
| geometry_types | \[string] | **REQUIRED.** The geometry types of all geometries, or an empty array if they are not known. |
| crs | object\|null | [PROJJSON](https://proj.org/specifications/projjson.html) object representing the Coordinate Reference System (CRS) of the geometry. If the field is not provided, the default CRS is [OGC:CRS84](https://www.opengis.net/def/crs/OGC/1.3/CRS84), which means the data in this column must be stored in longitude, latitude based on the WGS84 datum. |
| orientation | string | Winding order of exterior ring of polygons. If present must be `"counterclockwise"`; interior rings are wound in opposite order. If absent, no assertions are made regarding the winding order. |
| edges | string | Name of the coordinate system for the edges. Must be one of `"planar"` or `"spherical"`. The default value is `"planar"`. |
| bbox | \[number] | Bounding Box of the geometries in the file, formatted according to [RFC 7946, section 5](https://tools.ietf.org/html/rfc7946#section-5). |
| epoch | number | Coordinate epoch in case of a dynamic CRS, expressed as a decimal year. |
| geoarrow_type | string | The [GeoArrow extension name](https://geoarrow.org/extension-types#extension-names) corresponding to the column memory layout. This is required when `encoding` is `"geoarrow"` and must be omitted otherwise. |
Comment thread
paleolimbot marked this conversation as resolved.
Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| geoarrow_type | string | The [GeoArrow extension name](https://geoarrow.org/extension-types#extension-names) corresponding to the column memory layout. This is required when `encoding` is `"geoarrow"` and must be omitted otherwise. |
| geoarrow_type | string | The [GeoArrow extension name](https://geoarrow.org/extension-types#extension-names) corresponding to the column's memory layout. This is required when `encoding` is `"geoarrow"` and must be omitted otherwise. |


#### crs

Expand All @@ -83,10 +80,18 @@ The optional `epoch` field allows to specify this in case the `crs` field define

#### encoding

This is the binary format that the geometry is encoded in. The string `"WKB"`, signifying Well Known Binary is the only current option, but future versions of the spec may support alternative encodings. This SHOULD be the ["OpenGIS® Implementation Specification for Geographic information - Simple feature access - Part 1: Common architecture"](https://portal.ogc.org/files/?artifact_id=18241) WKB representation (using codes for 3D geometry types in the \[1001,1007\] range). This encoding is also consistent with the one defined in the ["ISO/IEC 13249-3:2016 (Information technology - Database languages - SQL multimedia and application packages - Part 3: Spatial)"](https://www.iso.org/standard/60343.html) standard.
This is the memory layout used to encode geometries in the geometry column.

The preferred option for maximum portability is `"WKB"`, signifying Well Known Binary. This SHOULD be the ["OpenGIS® Implementation Specification for Geographic information - Simple feature access - Part 1: Common architecture"](https://portal.ogc.org/files/?artifact_id=18241) WKB representation (using codes for 3D geometry types in the \[1001,1007\] range). This encoding is also consistent with the one defined in the ["ISO/IEC 13249-3:2016 (Information technology - Database languages - SQL multimedia and application packages - Part 3: Spatial)"](https://www.iso.org/standard/60343.html) standard.

Note that the current version of the spec only allows for a subset of WKB: 2D or 3D geometries of the standard geometry types (the Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection geometry types). This means that M values or non-linear geometry types are not yet supported.

Using the `"geoarrow"` encoding may provide better performance and enable readers to leverage more features of the Parquet format to accelerate geospatial queries (e.g., row group-level min/max statistics). When `encoding` is set to `"geoarrow"`, the column metadata must also specify `geoarrow_type` according to the [GeoArrow metadata specification for extension names](https://geoarrow.org/extension-types#extension-names) to signify the memory layout used by the geometry column.

Note that the current version of the spec only allows for a subset of GeoArrow: separated (struct) coordinates are required, only 2D or 3D geometries are permitted, and supported extension are currently `"geoarrow.point"`, `"geoarrow.linestring"`, `"geoarrow.polygon"`, `"geoarrow.multipoint"`, `"geoarrow.multilinestring"`, and `"geoarrow.multipolygon"`. This means that M values and serialized encodings are not yet supported.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, this doc doesn't even allow interleaved coordinates as GeoParquet.

I'm sensitive to the complexity concerns of having too many options in the spec, but I see this as favoring the "support cloud-native remote queries" use case over the "efficient file format, but reading and writing whole tables" use case. It "feels" like there's still a strong pull in general towards storing interleaved coordinates across the geo ecosystem.

That said, the memcopy to and from separated coordinates is pretty fast, so I can tolerate this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary of why interleaved coordinates are not a good canidiate as of this writing are:

  • They don't give useful column statistics
  • Current tools are slow to read them compared to separated encodings (important for points)
  • NULL values randomly error in some cases

Demo of column statistics:

import geoarrow.pyarrow as ga
import pyarrow as pa
from pyarrow import parquet
import numpy as np

array_interleaved = ga.as_geoarrow(["POINT (0 100)", "POINT (2 102)"], coord_type=ga.CoordType.INTERLEAVED)
tbl_interleaved = pa.table([array_interleaved], ["geom"])

parquet.write_table(tbl_interleaved, "test_interleaved.parquet")


f = parquet.ParquetFile("test_interleaved.parquet")
f.metadata.row_group(0).column(0).statistics
<pyarrow._parquet.Statistics object at 0x1223e7600>
  has_min_max: True
  min: -0.0
  max: 102.0
  null_count: 0
  distinct_count: None
  num_values: 4
  physical_type: DOUBLE
  logical_type: None
  converted_type (legacy): NONE

Demo of slowness:

import geoarrow.pyarrow as ga
import pyarrow as pa
from pyarrow import parquet
import numpy as np

n = int(1e6)
array = ga.point().from_geobuffers(None, np.random.random(n), np.random.random(n))
array_interleaved = ga.as_geoarrow(array, coord_type=ga.CoordType.INTERLEAVED)
tbl = pa.table([array], ["geom"])
tbl_interleaved = pa.table([array_interleaved], ["geom"])

parquet.write_table(tbl, "test.parquet")
parquet.write_table(tbl_interleaved, "test_interleaved.parquet")

%timeit parquet.read_table("test.parquet")
#> 7.36 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit parquet.read_table("test_interleaved.parquet")
#> 15.8 ms ± 49.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Demo of random errors for NULL values:

import geoarrow.pyarrow as ga
import pyarrow as pa
from pyarrow import parquet
import numpy as np

array_interleaved = ga.as_geoarrow(["POINT (0 1)", None], coord_type=ga.CoordType.INTERLEAVED)
tbl_interleaved = pa.table([array_interleaved], ["geom"])

parquet.write_table(tbl_interleaved, "test_interleaved.parquet")
parquet.read_table("test_interleaved.parquet")
#> ArrowInvalid: Expected all lists to be of size=2 but index 2 had size=0

These are probably all solveable/might be unique to Arrow C++-backed implementations, but I am not sure it is the best encoding to start with (and it does seem like a good idea to start with just one encoding to minimize burden on implementors).


Implementation note: when using WKB encoding with the ecosystem of Arrow libraries, Parquet types such as `BYTE_ARRAY` might not be directly accessible. Instead, the corresponding Arrow data type can be `Arrow::Type::BINARY` (for arrays that whose elements can be indexed through a 32-bit index) or `Arrow::Type::LARGE_BINARY` (64-bit index). It is recommended that GeoParquet readers are compatible with both data types, and writers preferably use `Arrow::Type::BINARY` (thus limiting to row groups with content smaller than 2 GB) for larger compatibility.

#### Coordinate axis order

The axis order of the coordinates in WKB stored in a GeoParquet follows the de facto standard for axis order in WKB and is therefore always (x, y) where x is easting or longitude and y is northing or latitude. This ordering explicitly overrides the axis order as specified in the CRS. This follows the precedent of [GeoPackage](https://geopackage.org), see the [note in their spec](https://www.geopackage.org/spec130/#gpb_spec).
Expand Down
6 changes: 5 additions & 1 deletion format-specs/schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
"properties": {
"encoding": {
"type": "string",
"const": "WKB"
"pattern": "^(WKB|geoarrow)$"
},
"geometry_types": {
"type": "array",
Expand Down Expand Up @@ -71,6 +71,10 @@
},
"epoch": {
"type": "number"
},
"geoarrow_type": {
"type": "string",
"pattern": "^geoarrow\\.(point|linestring|polygon|multipoint|multilinestring|multipolygon)$"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a json schema expert, but would we be able to make this conditionally required? It looks like dependentRequired meets what we need, though I don't know what version of json schema we're pinned to.

}
}
}
Expand Down