-
Notifications
You must be signed in to change notification settings - Fork 87
Speedup for bounding_box_query #1104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 15 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
409900d
Removed unnecessary compute() call
MeyerBender 054990f
Using the R-tree for spatial querying of shapes
MeyerBender cd454dc
Cleanup
MeyerBender 8d98440
Cleanup
MeyerBender e0f0ba7
Cleanup
MeyerBender be2db1b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 357004c
Merge pull request #1 from MeyerBender/performance/bounding_box_query
MeyerBender 8040288
Optimize transformed point bounding box queries and add multi-box cov…
dschaub95 6853ec8
add comments
dschaub95 7a29c50
Merge pull request #2 from MeyerBender/simplify-dask-bbox-query
MeyerBender 8b44d2e
Used spatial indexing also for polygon query
MeyerBender 394f998
sped up querying for scaling transformation and removed warning about…
MeyerBender 81f99aa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 11776f3
Merge branch 'main' into main
LucaMarconato 6a192d1
Fix ruff pre-commit violations in spatial_query.py
LucaMarconato 0711c40
code review: remove points parameter in bounding box internal functio…
LucaMarconato 8d0ce73
Fix bounding_box_query: restore negative-scale interval swap, use axe…
LucaMarconato 561f5f3
Merge branch 'main' into main
LucaMarconato File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,5 @@ | ||
| from __future__ import annotations | ||
|
|
||
| import warnings | ||
| from abc import abstractmethod | ||
| from collections.abc import Callable, Mapping | ||
| from dataclasses import dataclass | ||
|
|
@@ -9,6 +8,7 @@ | |
|
|
||
| import dask.dataframe as dd | ||
| import numpy as np | ||
| import pandas as pd | ||
| from dask.dataframe import DataFrame as DaskDataFrame | ||
| from geopandas import GeoDataFrame | ||
| from shapely.geometry import MultiPolygon, Point, Polygon | ||
|
|
@@ -78,7 +78,7 @@ def _get_bounding_box_corners_in_intrinsic_coordinates( | |
|
|
||
| # compute the output axes of the transformation, remove c from input and output axes, return the matrix without c | ||
| # and then build an affine transformation from that | ||
| m_without_c, input_axes_without_c, output_axes_without_c = _get_axes_of_tranformation( | ||
| m_without_c, input_axes_without_c, output_axes_without_c = _get_axes_of_transformation( | ||
| element, target_coordinate_system | ||
| ) | ||
| spatial_transform = Affine(m_without_c, input_axes=input_axes_without_c, output_axes=output_axes_without_c) | ||
|
|
@@ -142,7 +142,7 @@ def _get_polygon_in_intrinsic_coordinates( | |
|
|
||
| polygon_gdf = ShapesModel.parse(GeoDataFrame(geometry=[polygon])) | ||
|
|
||
| m_without_c, input_axes_without_c, output_axes_without_c = _get_axes_of_tranformation( | ||
| m_without_c, input_axes_without_c, output_axes_without_c = _get_axes_of_transformation( | ||
| element, target_coordinate_system | ||
| ) | ||
| spatial_transform = Affine(m_without_c, input_axes=input_axes_without_c, output_axes=output_axes_without_c) | ||
|
|
@@ -186,7 +186,7 @@ def _get_polygon_in_intrinsic_coordinates( | |
| return transform(polygon_gdf, to_coordinate_system="inverse") | ||
|
|
||
|
|
||
| def _get_axes_of_tranformation( | ||
| def _get_axes_of_transformation( | ||
| element: SpatialElement, target_coordinate_system: str | ||
| ) -> tuple[ArrayLike, tuple[str, ...], tuple[str, ...]]: | ||
| """ | ||
|
|
@@ -321,6 +321,11 @@ def _get_case_of_bounding_box_query( | |
| return case | ||
|
|
||
|
|
||
| def _is_scaling_transform(m_linear: np.ndarray) -> bool: | ||
| """Check if the linear part is a diagonal (pure scaling) matrix.""" | ||
| return np.allclose(m_linear, np.diag(np.diagonal(m_linear))) | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class BaseSpatialRequest: | ||
| """Base class for spatial queries.""" | ||
|
|
@@ -386,6 +391,7 @@ def _bounding_box_mask_points( | |
| axes: tuple[str, ...], | ||
| min_coordinate: list[Number] | ArrayLike, | ||
| max_coordinate: list[Number] | ArrayLike, | ||
| points_df: pd.DataFrame | None = None, | ||
| ) -> list[ArrayLike]: | ||
| """Compute a mask that is true for the points inside axis-aligned bounding boxes. | ||
|
|
||
|
|
@@ -404,31 +410,35 @@ def _bounding_box_mask_points( | |
| The lower right hand corners of the bounding boxes (i.e., the maximum coordinates along all dimensions). | ||
| Shape: (n_boxes, n_axes) or (n_axes,) for a single box. | ||
| {max_coordinate_docs} | ||
| points_df | ||
| A pre-computed pandas dataframe. Useful if the points_df has already been materialized, otherwise the | ||
| methods simply calls .compute() on the dask data frame | ||
|
|
||
| Returns | ||
| ------- | ||
| The masks for the points inside the bounding boxes. | ||
| """ | ||
| element_axes = get_axes_names(points) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function uses single dispatch and expects a Dask |
||
|
|
||
| min_coordinate = _parse_list_into_array(min_coordinate) | ||
| max_coordinate = _parse_list_into_array(max_coordinate) | ||
|
|
||
| # Ensure min_coordinate and max_coordinate are 2D arrays | ||
| min_coordinate = min_coordinate[np.newaxis, :] if min_coordinate.ndim == 1 else min_coordinate | ||
| max_coordinate = max_coordinate[np.newaxis, :] if max_coordinate.ndim == 1 else max_coordinate | ||
|
|
||
| # Compute once here only if the caller hasn't already done so | ||
| if points_df is None: | ||
| points_df = points.compute() | ||
|
|
||
| n_boxes = min_coordinate.shape[0] | ||
| in_bounding_box_masks = [] | ||
|
|
||
| for box in range(n_boxes): | ||
| box_masks = [] | ||
| for axis_index, axis_name in enumerate(axes): | ||
| if axis_name not in element_axes: | ||
| continue | ||
| min_value = min_coordinate[box, axis_index] | ||
| max_value = max_coordinate[box, axis_index] | ||
| box_masks.append(points[axis_name].gt(min_value).compute() & points[axis_name].lt(max_value).compute()) | ||
| col = points_df[axis_name].values | ||
| box_masks.append((col > min_value) & (col < max_value)) | ||
| bounding_box_mask = np.stack(box_masks, axis=-1) | ||
| in_bounding_box_masks.append(np.all(bounding_box_mask, axis=1)) | ||
| return in_bounding_box_masks | ||
|
|
@@ -514,16 +524,6 @@ def _( | |
| min_coordinate = _parse_list_into_array(min_coordinate) | ||
| max_coordinate = _parse_list_into_array(max_coordinate) | ||
| new_elements = {} | ||
| if sdata.points: | ||
| warnings.warn( | ||
| ( | ||
| "The object has `points` element. Depending on the number of points, querying MAY suffer from " | ||
| "performance issues. Please consider filtering the object before calling this function by calling the " | ||
| "`subset()` method of `SpatialData`." | ||
| ), | ||
| UserWarning, | ||
| stacklevel=2, | ||
| ) | ||
| for element_type in ["points", "images", "labels", "shapes"]: | ||
| elements = getattr(sdata, element_type) | ||
| queried_elements = _dict_query_dispatcher( | ||
|
|
@@ -630,7 +630,6 @@ def _( | |
| max_coordinate: list[Number] | ArrayLike, | ||
| target_coordinate_system: str, | ||
| ) -> DaskDataFrame | list[DaskDataFrame] | None: | ||
| from spatialdata import transform | ||
| from spatialdata.transformations import get_transformation | ||
|
|
||
| min_coordinate = _parse_list_into_array(min_coordinate) | ||
|
|
@@ -648,100 +647,103 @@ def _( | |
| max_coordinate=max_coordinate, | ||
| ) | ||
|
|
||
| # get the four corners of the bounding box (2D case), or the 8 corners of the "3D bounding box" (3D case) | ||
| (intrinsic_bounding_box_corners, intrinsic_axes) = _get_bounding_box_corners_in_intrinsic_coordinates( | ||
| element=points, | ||
| axes=axes, | ||
| min_coordinate=min_coordinate, | ||
| max_coordinate=max_coordinate, | ||
| target_coordinate_system=target_coordinate_system, | ||
| m_without_c, input_axes_without_c, output_axes_without_c = _get_axes_of_transformation( | ||
| points, target_coordinate_system | ||
| ) | ||
| m_without_c_linear = m_without_c[:-1, :-1] | ||
| _ = _get_case_of_bounding_box_query( | ||
| m_without_c_linear, | ||
| input_axes_without_c, | ||
| output_axes_without_c, | ||
| ) | ||
| min_coordinate_intrinsic = intrinsic_bounding_box_corners.min(dim="corner") | ||
| max_coordinate_intrinsic = intrinsic_bounding_box_corners.max(dim="corner") | ||
|
|
||
| min_coordinate_intrinsic = min_coordinate_intrinsic.data | ||
| max_coordinate_intrinsic = max_coordinate_intrinsic.data | ||
|
|
||
| # get the points in the intrinsic coordinate bounding box | ||
| in_intrinsic_bounding_box = _bounding_box_mask_points( | ||
| points=points, | ||
| axes=intrinsic_axes, | ||
| min_coordinate=min_coordinate_intrinsic, | ||
| max_coordinate=max_coordinate_intrinsic, | ||
| axes_adjusted, min_coordinate_adjusted, max_coordinate_adjusted = _adjust_bounding_box_to_real_axes( | ||
| axes, | ||
| min_coordinate, | ||
| max_coordinate, | ||
| output_axes_without_c, | ||
| ) | ||
| if axes_adjusted != output_axes_without_c: | ||
| raise RuntimeError("This should not happen") | ||
|
|
||
| if not (len_df := len(in_intrinsic_bounding_box)) == (len_bb := len(min_coordinate)): | ||
| raise ValueError( | ||
| f"Length of list of dataframes `{len_df}` is not equal to the number of bounding boxes axes `{len_bb}`." | ||
| ) | ||
| points_in_intrinsic_bounding_box: list[DaskDataFrame | None] = [] | ||
| # materialize the points in the intrinsic coordinate system once | ||
| points_pd = points.compute() | ||
| attrs = points.attrs.copy() | ||
| for mask_np in in_intrinsic_bounding_box: | ||
| if mask_np.sum() == 0: | ||
| points_in_intrinsic_bounding_box.append(None) | ||
| else: | ||
| # TODO there is a problem when mixing dask dataframe graph with dask array graph. Need to compute for now. | ||
| # we can't compute either mask or points as when we calculate either one of them | ||
| # test_query_points_multiple_partitions will fail as the mask will be used to index each partition. | ||
| # However, if we compute and then create the dask array again we get the mixed dask graph problem. | ||
| filtered_pd = points_pd[mask_np] | ||
| points_filtered = dd.from_pandas(filtered_pd, npartitions=points.npartitions) | ||
| points_filtered.attrs.update(attrs) | ||
| points_in_intrinsic_bounding_box.append(points_filtered) | ||
| if len(points_in_intrinsic_bounding_box) == 0: | ||
| return None | ||
|
|
||
| # assert that the number of queried points is correct | ||
| assert len(points_in_intrinsic_bounding_box) == len(min_coordinate) | ||
|
|
||
| # # we have to reset the index since we have subset | ||
| # # https://stackoverflow.com/questions/61395351/how-to-reset-index-on-concatenated-dataframe-in-dask | ||
| # points_in_intrinsic_bounding_box = points_in_intrinsic_bounding_box.assign(idx=1) | ||
| # points_in_intrinsic_bounding_box = points_in_intrinsic_bounding_box.set_index( | ||
| # points_in_intrinsic_bounding_box.idx.cumsum() - 1 | ||
| # ) | ||
| # points_in_intrinsic_bounding_box = points_in_intrinsic_bounding_box.map_partitions( | ||
| # lambda df: df.rename(index={"idx": None}) | ||
| # ) | ||
| # points_in_intrinsic_bounding_box = points_in_intrinsic_bounding_box.drop(columns=["idx"]) | ||
|
|
||
| # transform the element to the query coordinate system | ||
| # checking the type of the transformation | ||
| # in the case of an identity or scaling transform, we can skip the whole | ||
| # projection into intrinsic space and reprojection into the global coordinate system | ||
| is_identity_transform = input_axes_without_c == output_axes_without_c and np.allclose( | ||
| m_without_c, np.eye(m_without_c.shape[0]) | ||
| ) | ||
| is_scaling_transform = input_axes_without_c == output_axes_without_c and _is_scaling_transform(m_without_c_linear) | ||
|
|
||
| # if the transform is identity, we can save extra for the affine transformation | ||
| if is_identity_transform: | ||
| bounding_box_masks = _bounding_box_mask_points( | ||
| points=points, | ||
| axes=axes, | ||
| min_coordinate=min_coordinate, | ||
| max_coordinate=max_coordinate, | ||
| points_df=points_pd, | ||
| ) | ||
| elif is_scaling_transform: | ||
| # Pull scale factors from the diagonal and the translation from the last column | ||
| scales = np.diagonal(m_without_c_linear) # shape: (n_axes,) | ||
| translation = m_without_c[:-1, -1] # shape: (n_axes,) | ||
|
|
||
| # Invert the affine: x_intrinsic = (x_output - translation) / scale | ||
| min_intrinsic = (min_coordinate_adjusted - translation) / scales | ||
| max_intrinsic = (max_coordinate_adjusted - translation) / scales | ||
|
|
||
| # Ensure min < max after inversion (negative scale flips the interval) | ||
| min_intrinsic, max_intrinsic = ( | ||
| np.minimum(min_intrinsic, max_intrinsic), | ||
| np.maximum(min_intrinsic, max_intrinsic), | ||
| ) | ||
|
|
||
| bounding_box_masks = _bounding_box_mask_points( | ||
| points=points, | ||
| axes=tuple(input_axes_without_c), | ||
| min_coordinate=min_intrinsic, | ||
| max_coordinate=max_intrinsic, | ||
| points_df=points_pd, | ||
| ) | ||
| else: | ||
| query_coordinates = points_pd.loc[:, list(input_axes_without_c)].to_numpy(copy=False) | ||
| query_coordinates = query_coordinates @ m_without_c[:-1, :-1].T + m_without_c[:-1, -1] | ||
|
|
||
| bounding_box_masks = [] | ||
| for box_index in range(min_coordinate_adjusted.shape[0]): | ||
| bounding_box_mask = np.ones(len(points_pd), dtype=bool) | ||
| for axis_index in range(len(output_axes_without_c)): | ||
| min_value = min_coordinate_adjusted[box_index, axis_index] | ||
| max_value = max_coordinate_adjusted[box_index, axis_index] | ||
| column = query_coordinates[:, axis_index] | ||
| bounding_box_mask &= (column > min_value) & (column < max_value) | ||
| bounding_box_masks.append(bounding_box_mask) | ||
|
|
||
| if not (len_df := len(bounding_box_masks)) == (len_bb := len(min_coordinate)): | ||
| raise ValueError(f"Length of list of masks `{len_df}` is not equal to the number of bounding boxes `{len_bb}`.") | ||
|
|
||
| old_transformations = get_transformation(points, get_all=True) | ||
| assert isinstance(old_transformations, dict) | ||
| feature_key = points.attrs.get(ATTRS_KEY, {}).get(PointsModel.FEATURE_KEY) | ||
|
|
||
| output: list[DaskDataFrame | None] = [] | ||
| for p, min_c, max_c in zip(points_in_intrinsic_bounding_box, min_coordinate, max_coordinate, strict=True): | ||
| if p is None: | ||
| for mask_np in bounding_box_masks: | ||
| bounding_box_indices = np.flatnonzero(mask_np) | ||
| if len(bounding_box_indices) == 0: | ||
| output.append(None) | ||
| else: | ||
| points_query_coordinate_system = transform( | ||
| p, to_coordinate_system=target_coordinate_system, maintain_positioning=False | ||
| continue | ||
|
|
||
| # The exact mask is computed in the query coordinate system, but the returned points must stay intrinsic. | ||
| queried_points = points_pd.iloc[bounding_box_indices] | ||
| output.append( | ||
| PointsModel.parse( | ||
| dd.from_pandas(queried_points, npartitions=1), | ||
| transformations=old_transformations.copy(), | ||
| feature_key=feature_key, | ||
| ) | ||
|
|
||
| # get a mask for the points in the bounding box | ||
| bounding_box_mask = _bounding_box_mask_points( | ||
| points=points_query_coordinate_system, | ||
| axes=axes, | ||
| min_coordinate=min_c, # type: ignore[arg-type] | ||
| max_coordinate=max_c, # type: ignore[arg-type] | ||
| ) | ||
| if len(bounding_box_mask) != 1: | ||
| raise ValueError(f"Expected a single mask, got {len(bounding_box_mask)} masks. Please report this bug.") | ||
| bounding_box_indices = np.where(bounding_box_mask[0])[0] | ||
|
|
||
| if len(bounding_box_indices) == 0: | ||
| output.append(None) | ||
| else: | ||
| points_df = p.compute().iloc[bounding_box_indices] | ||
| old_transformations = get_transformation(p, get_all=True) | ||
| assert isinstance(old_transformations, dict) | ||
| feature_key = p.attrs.get(ATTRS_KEY, {}).get(PointsModel.FEATURE_KEY) | ||
|
|
||
| output.append( | ||
| PointsModel.parse( | ||
| dd.from_pandas(points_df, npartitions=1), | ||
| transformations=old_transformations.copy(), | ||
| feature_key=feature_key, | ||
| ) | ||
| ) | ||
| ) | ||
| if len(output) == 0: | ||
| return None | ||
| if len(output) == 1: | ||
|
|
@@ -791,8 +793,8 @@ def _( | |
| ) | ||
| for box_corners in intrinsic_bounding_box_corners: | ||
| bounding_box_non_axes_aligned = Polygon(box_corners.data) | ||
| indices = polygons.geometry.intersects(bounding_box_non_axes_aligned) | ||
| queried = polygons[indices] | ||
| candidate_idx = polygons.sindex.query(bounding_box_non_axes_aligned, predicate="intersects") | ||
| queried = polygons.iloc[candidate_idx] | ||
| if len(queried) == 0: | ||
| queried_polygon = None | ||
| else: | ||
|
|
@@ -949,17 +951,22 @@ def _( | |
| assert np.all(element[OLD_INDEX] == buffered.index) | ||
| else: | ||
| buffered[OLD_INDEX] = buffered.index | ||
| indices = buffered.geometry.apply(lambda x: x.intersects(polygon)) | ||
| if np.sum(indices) == 0: | ||
|
|
||
| # Use sindex for fast candidate pre-filtering, then exact intersection check | ||
| # only on the (typically small) candidate set — same pattern as bounding_box_query. | ||
| candidate_idx = buffered.sindex.query(polygon, predicate="intersects") | ||
| if len(candidate_idx) == 0: | ||
| del buffered[OLD_INDEX] | ||
| return None | ||
| queried_shapes = element[indices] | ||
| queried_shapes.index = buffered[indices][OLD_INDEX] | ||
|
|
||
| queried_shapes = element.iloc[candidate_idx].copy() | ||
| queried_shapes.index = buffered.iloc[candidate_idx][OLD_INDEX] | ||
| queried_shapes.index.name = None | ||
|
|
||
| if clip: | ||
| if isinstance(element.geometry.iloc[0], Point): | ||
| queried_shapes = buffered[indices] | ||
| queried_shapes.index = buffered[indices][OLD_INDEX] | ||
| queried_shapes = buffered.iloc[candidate_idx].copy() | ||
| queried_shapes.index = buffered.iloc[candidate_idx][OLD_INDEX] | ||
| queried_shapes.index.name = None | ||
| queried_shapes = queried_shapes.clip(polygon_gdf, keep_geom_type=True) | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_bounding_box_mask_points()is called only in two locations, and in bothpoints_dfis available. I will therefore remove thepointsargument, and only keeppoints_df.