-
Notifications
You must be signed in to change notification settings - Fork 195
Add creation of col stats from row_range or date_range #3154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,6 +15,14 @@ namespace arcticdb { | |
|
|
||
| SegmentInMemory merge_column_stats_segments(const std::vector<SegmentInMemory>& segments); | ||
|
|
||
| // Builds a merged column-stats segment that keeps every row of `old_segment` whose | ||
| // [start_index, end_index] lies fully outside `range_replaced`, and inserts the rows from | ||
| // `new_segment` (covering the in-range row-slices). Both segments must share the same | ||
| // stat-column schema. Used by the range-restricted RMW path in create_column_stats_impl. | ||
| SegmentInMemory merge_column_stats_with_range_replacement( | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think let's just leave this merging with the old stats stuff out, at least for now, as it's a bit complicated and I'm not certain that we will need it. |
||
| const SegmentInMemory& old_segment, SegmentInMemory new_segment, const entity::TimestampRange& range_replaced | ||
| ); | ||
|
|
||
| // User facing types - eg users are only allowed to create min and max together, not one or the other | ||
| enum class ColumnStatType { MINMAX }; | ||
| // Total universe of column stats we support - min and max are treated separately here | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -1266,14 +1266,27 @@ def _batch_update_internal( | |||||
| _log_warning_on_writing_empty_dataframe(data_vector[idx], symbols[idx]) | ||||||
| return result | ||||||
|
|
||||||
| def create_column_stats( | ||||||
| self, symbol: str, column_stats: Dict[str, Set[str]], as_of: Optional[VersionQueryInput] = None | ||||||
| def create_column_stats_experimental( | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Renaming the public |
||||||
| self, | ||||||
| symbol: str, | ||||||
| column_stats: Optional[Dict[str, Set[str]]] = None, | ||||||
| as_of: Optional[VersionQueryInput] = None, | ||||||
| date_range: Optional[DateRangeInput] = None, | ||||||
| row_range: Optional[Tuple[int, int]] = None, | ||||||
| ) -> None: | ||||||
| """ | ||||||
| Calculates the specified column statistics for each row-slice for the given symbol. In the future, these | ||||||
| statistics will be used by `QueryBuilder` filtering operations to reduce the number of data segments read out | ||||||
| of storage. | ||||||
|
|
||||||
| When `column_stats` is omitted, MINMAX stats are built for every non-index column whose dtype is numeric | ||||||
| (uint/int/float) or a UTC nanosecond timestamp. Any pre-existing stats are merged with the newly computed | ||||||
| ones (read-modify-write). | ||||||
|
|
||||||
| When `date_range` or `row_range` is provided, stats are only (re)computed for the row-slices overlapping that | ||||||
| range. Existing stats for row-slices that fall fully outside the range are kept; in-range slice stats are | ||||||
| replaced by the freshly computed values. `date_range` and `row_range` are mutually exclusive. | ||||||
|
|
||||||
| Parameters | ||||||
| ---------- | ||||||
| symbol: `str` | ||||||
|
|
@@ -1285,14 +1298,53 @@ def create_column_stats( | |||||
| "MINMAX" : store the minimum and maximum value for the column in each row-slice | ||||||
| as_of : `Optional[VersionQueryInput]`, default=None | ||||||
| See documentation of `read` method for more details. | ||||||
| date_range: `Optional[DateRangeInput]`, default=None | ||||||
| Restrict computation to row-slices overlapping this date range. Mutually exclusive with `row_range`. | ||||||
| Only supported on timestamp-indexed symbols. | ||||||
| row_range: `Optional[Tuple[int, int]]`, default=None | ||||||
| Restrict computation to row-slices overlapping the given (start, end) row range. Mutually exclusive | ||||||
| with `date_range`. | ||||||
|
|
||||||
| Returns | ||||||
| ------- | ||||||
| None | ||||||
| """ | ||||||
| column_stats = self._get_column_stats(column_stats) | ||||||
| check( | ||||||
| date_range is None or row_range is None, | ||||||
| "create_column_stats_experimental: date_range and row_range are mutually exclusive", | ||||||
| ) | ||||||
|
|
||||||
| # if no column stats specified in the function call, fallback to columns stats over all columns | ||||||
| if column_stats is None: | ||||||
| column_stats = self._get_eligible_column_stats_spec(symbol, as_of) | ||||||
| if not column_stats: | ||||||
| return | ||||||
|
|
||||||
| column_stats = self._convert_to_native_column_stats(column_stats) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| version_query = self._get_version_query(as_of) | ||||||
| self.version_store.create_column_stats_version(symbol, column_stats, version_query) | ||||||
| read_query = _PythonVersionStoreReadQuery() | ||||||
|
|
||||||
| if date_range is not None: | ||||||
| read_query.row_filter = _normalize_dt_range(date_range) | ||||||
|
|
||||||
| if row_range is not None: | ||||||
| read_query.row_range = _SignedRowRange(row_range[0], row_range[1]) | ||||||
|
|
||||||
| self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query) | ||||||
|
|
||||||
| def _get_eligible_column_stats_spec( | ||||||
| self, symbol: str, as_of: Optional[VersionQueryInput] | ||||||
| ) -> Dict[str, Set[str]]: | ||||||
| info = self.get_info(symbol, version=as_of) | ||||||
| numeric_value_types = { | ||||||
| TypeDescriptor.ValueType.UINT, | ||||||
| TypeDescriptor.ValueType.INT, | ||||||
| TypeDescriptor.ValueType.FLOAT, | ||||||
| TypeDescriptor.ValueType.NANOSECONDS_UTC, | ||||||
| } | ||||||
| columns = info["col_names"]["columns"] | ||||||
| dtypes = info["dtype"] | ||||||
| return {str(col): {"MINMAX"} for col, dtype in zip(columns, dtypes) if dtype.value_type in numeric_value_types} | ||||||
|
|
||||||
| def drop_column_stats( | ||||||
| self, symbol: str, column_stats: Optional[Dict[str, Set[str]]] = None, as_of: Optional[VersionQueryInput] = None | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two correctness concerns with the range-restricted RMW merge:
Schema mismatch silently corrupts stats. In the
is_filteredpath the existing-stats drop optimization is skipped (version_core.cpp:2006), sonew_segmentholds stats for whatevercolumn_statswas requested. If that set differs from the old segment's stat columns (e.g. user passes an explicitcolumn_statssubset, or the eligible-column set changed between calls),merge_column_stats_segmentsunions columns by name: in-range rows will lack the old-only stat columns and out-of-range rows will lack the new-only columns, producing a sparse/inconsistent stats segment. The doc comment states "Both segments must share the same stat-column schema" but nothing validates or enforces it. Either assert the schemas match (like the non-filtered branch does atversion_core.cpp:2081) or handle the differing-column case explicitly.Closed-interval
intersectsvs exclusiveend_index.entity::intersectsis closed on both ends (left.first <= right.second && left.second >= right.first). If the column-statsend_indexis stored one-past-the-end (exclusive, as index keys are), an old row whoseend_indexequalsrange_replaced.firstwill be falsely treated as overlapping and dropped. Please confirm the end-index convention here and adjust the comparison if it is exclusive.