Conversation
This commit separates three concepts that were previously mixed under the same "timef" naming: 1. SensorTSLM runtime/domain objects 2. the TimeF export adapter 3. the vendored TimeNet-derived persistence layer
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (1)
runtime/types.py (1)
89-100: Consider consolidating the aliased methods.
signal()/channel()anditer_signals()/iter_channels()are exact aliases. While this provides API flexibility, consider documenting one as the canonical name and the other as an alias to avoid confusion about which to use.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@runtime/types.py` around lines 89 - 100, Pick canonical names (e.g., signal and iter_signals) and mark the others as documented aliases: update the docstrings for signal, channel, iter_channels and iter_signals to clearly state which is canonical and that the other methods are simple aliases; ensure channel(self, idx) delegates to signal(self, idx) and iter_channels(self) delegates to iter_signals(self) (or vice versa) so implementation stays a single source of truth (refer to SignalView, signal, channel, iter_signals, iter_channels and self.values for locating the methods).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@exporters/timef_export.py`:
- Around line 85-95: TimeFWriter(root) eagerly creates the final root/signals
tree which can leave partial files on failure; instead stage output to a
temporary directory and only instantiate or move TimeFWriter into the final path
on success: write signals via _write_signal_frame into a temp staging directory
(or pass a staging path to TimeFWriter), accumulate samples/signals/annotations
there, and after all rows and writes complete, atomically move or rename the
staging directory to the intended root (or then construct TimeFWriter(root) if
it must create the tree). Also ensure any temporary directory is removed on
failure so subsequent retries won't hit FileExistsError.
- Around line 136-139: The code that builds reference (the json.dumps block
using channel_names and annotation.channel_idxs) must validate
annotation.channel_idxs before indexing: ensure each idx is an integer and 0 <=
idx < len(channel_names) (do not allow negative wrapping), and handle violations
with a clear error or by filtering them out consistently; update the
comprehension that produces "channel_names": [channel_names[idx] for idx in
annotation.channel_idxs] to first validate/map the indexes (or raise a
ValueError with the bad idx and annotation id/context) so an IndexError or
silent negative wrap cannot occur during export.
- Around line 209-214: In _write_signal_frame ensure you detect and reject
duplicate or colliding column names before building payload: check
row.channel_names for duplicates and also check if any channel name equals the
time_column_name, and if any conflict exists raise a clear ValueError (including
the offending names) instead of silently overwriting; do this validation prior
to creating payload and populating entries so callers get an explicit error when
channel names conflict with each other or with time_column_name.
In `@README.md`:
- Around line 21-40: The example uses an undefined lowercase variable
captionizer when calling captionizer.run; either instantiate a Captionizer
instance before use (e.g., create and configure a Captionizer object and assign
it to captionizer) or add a clear comment above the snippet stating that a
configured Captionizer instance must exist; update the README example to
reference the Captionizer symbol (Captionizer) properly so export_caption_result
and TimeFExportConfig are called with a real result from captionizer.run.
In `@timef/folder_only_for_backwards_compatability.txt`:
- Around line 1-7: The filename contains a spelling mistake: rename
folder_only_for_backwards_compatability.txt to
folder_only_for_backwards_compatibility.txt and update any references to the old
name (docs, README, tests, CI configs, or code that reads this file) to use the
corrected filename so nothing breaks; ensure the rename preserves file contents
and SPDX headers and run a quick grep for "backwards_compatability" to find and
update all references.
In `@timenet_timef/__init__.py`:
- Around line 23-40: The __all__ list is unsorted (contains names like
"Annotation", "AnnotationSampleRef", ..., "validate_dataset") and triggers Ruff
RUF022; reorder the entries in the __all__ list into a stable alphabetical order
(by string) so that symbols like Annotation, AnnotationSampleRef,
AnnotationSpec, DatasetManifest, Sample, SampleSignalRef, SensorSpec, Signal,
SignalSpec, TimeFReader, TimeFValidationError, TimeFWriter, VALID_DOMAINS,
mark_validated, validate_annotation_against_spec, validate_dataset are sorted
alphabetically to silence the lint error and keep CI deterministic.
In `@timenet_timef/io.py`:
- Around line 203-204: The read_signal_frame method currently joins signal_file
directly into self.root allowing absolute or ../ paths to escape root; update
read_signal_frame to first run signal_file through the same shard-path resolver
used in timenet_timef/validate.py (the shard path resolver) to normalize and
resolve the path relative to self.root, verify the resolved path is contained
under self.root / "signals" (raise an error if not), and then pass that resolved
path to pq.read_table instead of joining signal_file directly.
- Around line 206-223: The reader currently filters columns using the hard-coded
helper _is_time_column inside read_signal_frame_for_sample which drops
non-"time"/"time_*" columns (breaking exports that set time_column_name like
"timestamp"); update the filter to preserve the actual time column provided by
the dataset/sample metadata instead of relying solely on _is_time_column — e.g.
derive the expected time column name from the signal/manifest/sample metadata
(check signal.time_column_name or manifest.time_column_name, falling back to
_is_time_column only if none is set) and include that name in the
key-preservation check when building filtered rows for signal_ref.channels.
In `@timenet_timef/labels.py`:
- Around line 17-28: get_schema and get_valid_labels currently return direct
references into LABEL_SCHEMAS so callers can mutate global state; change
get_schema(name: str) to return a defensive copy (use copy.deepcopy on
LABEL_SCHEMAS[name]) and change get_valid_labels(name: str) to return a new list
(e.g., list(labels) or labels.copy()) after validating types; ensure you import
copy and keep the same error checks (raise ValueError for unknown schema or
invalid labels) so callers receive safe, non-mutable copies of schema and label
lists.
In `@timenet_timef/units.py`:
- Around line 10-13: The validate_ucum function accepts whitespace-padded unit
codes but returns the original string; change it to normalize the input by
trimming whitespace (use unit_code.strip()) and return the stripped value after
validation; keep the same ValueError on empty/invalid inputs but ensure callers
receive the normalized unit code from validate_ucum.
In `@timenet_timef/validate.py`:
- Around line 22-26: Before instantiating TimeFReader in validate_dataset, check
that manifest.json, samples.parquet, and annotations.parquet exist in the
provided root and raise TimeFValidationError with the explicit messages instead
of letting TimeFReader/FileNotFoundError surface; locate the validate_dataset
function and the calls to TimeFReader.read_manifest(), read_samples(), and
read_annotations() and add pre-checks for those three files (and mirror the same
checks for the other validation block that uses TimeFReader between lines 42-51)
so failures produce the intended TimeFValidationError messages.
- Around line 60-64: Add checks to reject negative row spans: validate that
signal.row_start >= 0 and signal.row_count >= 0 and raise TimeFValidationError
with a clear message if either is negative. Update the validation block that
currently checks signal.row_group_id and row span (using metadata.row_group(...)
and variables row_group_rows) to first assert signal.row_start and
signal.row_count are non-negative, then perform the existing upper-bound check;
use the same TimeFValidationError type and include signal.id in the error
messages.
---
Nitpick comments:
In `@runtime/types.py`:
- Around line 89-100: Pick canonical names (e.g., signal and iter_signals) and
mark the others as documented aliases: update the docstrings for signal,
channel, iter_channels and iter_signals to clearly state which is canonical and
that the other methods are simple aliases; ensure channel(self, idx) delegates
to signal(self, idx) and iter_channels(self) delegates to iter_signals(self) (or
vice versa) so implementation stays a single source of truth (refer to
SignalView, signal, channel, iter_signals, iter_channels and self.values for
locating the methods).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d6df737c-1bb3-426e-8be1-0d0f56e1c1b3
📒 Files selected for processing (29)
README.mdannotator.pycaptionizer.pyexporters/__init__.pyexporters/timef_export.pyextractors/__init__.pyextractors/generative.pyextractors/semantic.pyextractors/statistical.pyextractors/structural.pymhc/transformer.pymodels/base.pymodels/client.pymodels/local.pyreviewer.pyruntime/__init__.pyruntime/types.pytimef/__init__.pytimef/folder_only_for_backwards_compatability.txttimef/schema.pytimenet_timef/__init__.pytimenet_timef/io.pytimenet_timef/labels.pytimenet_timef/schema.pytimenet_timef/units.pytimenet_timef/utils.pytimenet_timef/validate.pytransformer.pyvisualizer.py
| writer = TimeFWriter(root) | ||
| samples: list[Sample] = [] | ||
| signals: list[Signal] = [] | ||
| annotations: list[PersistedAnnotation] = [] | ||
| sampling_rate = config.timestamp_unit / config.sampling_period | ||
| annotation_id = 0 | ||
|
|
||
| for sample_id, row in enumerate(result.rows): | ||
| _validate_row_shape(row, channel_names) | ||
| signal_file = f"sample-{sample_id}.parquet" | ||
| _write_signal_frame(root / "signals" / signal_file, config.sampling_period, config.time_column_name, row) |
There was a problem hiding this comment.
Stage the export before touching the final dataset path.
TimeFWriter(root) eagerly creates root/signals in timenet_timef/io.py Lines 76-78. Any later failure here—or at Lines 182-183—leaves a partial tree behind, and the next retry then hits Line 65's FileExistsError even though no valid dataset was produced.
Also applies to: 179-183
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@exporters/timef_export.py` around lines 85 - 95, TimeFWriter(root) eagerly
creates the final root/signals tree which can leave partial files on failure;
instead stage output to a temporary directory and only instantiate or move
TimeFWriter into the final path on success: write signals via
_write_signal_frame into a temp staging directory (or pass a staging path to
TimeFWriter), accumulate samples/signals/annotations there, and after all rows
and writes complete, atomically move or rename the staging directory to the
intended root (or then construct TimeFWriter(root) if it must create the tree).
Also ensure any temporary directory is removed on failure so subsequent retries
won't hit FileExistsError.
| reference = json.dumps( | ||
| { | ||
| "channel_names": [channel_names[idx] for idx in annotation.channel_idxs], | ||
| "window": list(annotation.window) if annotation.window is not None else None, |
There was a problem hiding this comment.
Validate annotation.channel_idxs before translating them to names.
runtime.types.Annotation does not bound-check these indexes. Negative indexes silently bind to the wrong channel, and an out-of-range index raises IndexError mid-export.
Suggested change
if spec_id is None:
raise ValueError(f"Unsupported annotation kind: {annotation.kind}")
+ invalid_channel_idxs = [
+ idx for idx in annotation.channel_idxs if idx < 0 or idx >= len(channel_names)
+ ]
+ if invalid_channel_idxs:
+ raise ValueError(
+ f"Annotation {annotation.kind!r} references invalid channel indexes: {invalid_channel_idxs}"
+ )
reference = json.dumps(
{
"channel_names": [channel_names[idx] for idx in annotation.channel_idxs],
"window": list(annotation.window) if annotation.window is not None else None,
"kind": annotation.kind,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@exporters/timef_export.py` around lines 136 - 139, The code that builds
reference (the json.dumps block using channel_names and annotation.channel_idxs)
must validate annotation.channel_idxs before indexing: ensure each idx is an
integer and 0 <= idx < len(channel_names) (do not allow negative wrapping), and
handle violations with a clear error or by filtering them out consistently;
update the comprehension that produces "channel_names": [channel_names[idx] for
idx in annotation.channel_idxs] to first validate/map the indexes (or raise a
ValueError with the bad idx and annotation id/context) so an IndexError or
silent negative wrap cannot occur during export.
| def _write_signal_frame(path: Path, sampling_period: float, time_column_name: str, row) -> None: | ||
| time_axis = [idx * sampling_period for idx in range(row.values.shape[1])] | ||
| payload: dict[str, list[float]] = {time_column_name: time_axis} | ||
| for idx, channel_name in enumerate(row.channel_names): | ||
| payload[channel_name] = row.values[idx].astype(float).tolist() | ||
| path.parent.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
Reject duplicate or colliding column names before building the parquet payload.
This payload is keyed by time_column_name and row.channel_names. Duplicate channel names—or a channel named exactly like the time column—overwrite earlier entries in the dict and silently drop signal data.
Suggested change
def _write_signal_frame(path: Path, sampling_period: float, time_column_name: str, row) -> None:
+ channel_names = tuple(row.channel_names)
+ if len(set(channel_names)) != len(channel_names):
+ raise ValueError("RuntimeRow channel_names must be unique for export")
+ if time_column_name in channel_names:
+ raise ValueError("time_column_name must not collide with a channel name")
+
time_axis = [idx * sampling_period for idx in range(row.values.shape[1])]
payload: dict[str, list[float]] = {time_column_name: time_axis}
- for idx, channel_name in enumerate(row.channel_names):
+ for idx, channel_name in enumerate(channel_names):
payload[channel_name] = row.values[idx].astype(float).tolist()
path.parent.mkdir(parents=True, exist_ok=True)
pq.write_table(pa.table(payload), path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@exporters/timef_export.py` around lines 209 - 214, In _write_signal_frame
ensure you detect and reject duplicate or colliding column names before building
payload: check row.channel_names for duplicates and also check if any channel
name equals the time_column_name, and if any conflict exists raise a clear
ValueError (including the offending names) instead of silently overwriting; do
this validation prior to creating payload and populating entries so callers get
an explicit error when channel names conflict with each other or with
time_column_name.
| ```python | ||
| from pathlib import Path | ||
|
|
||
| from captionizer import Captionizer | ||
| from exporters.timef_export import TimeFExportConfig, export_caption_result | ||
|
|
||
| result, _ = captionizer.run(max_rows=5) | ||
| root = export_caption_result( | ||
| result, | ||
| TimeFExportConfig( | ||
| output_root=Path("exports"), | ||
| dataset_id="mhc_caption_runs", | ||
| sampling_period=1, | ||
| timestamp_unit=1, | ||
| unit_sampling_rate="1 / minute", | ||
| unit_timestamp="minute", | ||
| time_column_name="time_minute", | ||
| ), | ||
| ) | ||
| print(root) |
There was a problem hiding this comment.
Example code references undefined captionizer variable.
The example imports Captionizer class (line 24) but uses a lowercase captionizer instance (line 27) that is never created. Users following this example will encounter a NameError.
Consider either:
- Adding the necessary setup (dataset, transformer, annotator, instantiation), or
- Adding a comment indicating prerequisite setup is required.
📝 Proposed fix with minimal setup context
from pathlib import Path
from captionizer import Captionizer
from exporters.timef_export import TimeFExportConfig, export_caption_result
+
+# Setup (see Usage section for full details)
+# captionizer = Captionizer(dataset, transformer, annotator)
result, _ = captionizer.run(max_rows=5)Or provide a complete working example:
from pathlib import Path
from captionizer import Captionizer
from exporters.timef_export import TimeFExportConfig, export_caption_result
+from mhc.dataset import MHCDataset
+from mhc.transformer import MHCTransformer
+from mhc.constants import MHC_CHANNEL_CONFIG
+from extractors.statistical import StatisticalExtractor
+from annotator import Annotator
+
+dataset = MHCDataset()
+annotator = Annotator([StatisticalExtractor(MHC_CHANNEL_CONFIG)])
+captionizer = Captionizer(dataset, MHCTransformer(), annotator)
result, _ = captionizer.run(max_rows=5)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` around lines 21 - 40, The example uses an undefined lowercase
variable captionizer when calling captionizer.run; either instantiate a
Captionizer instance before use (e.g., create and configure a Captionizer object
and assign it to captionizer) or add a clear comment above the snippet stating
that a configured Captionizer instance must exist; update the README example to
reference the Captionizer symbol (Captionizer) properly so export_caption_result
and TimeFExportConfig are called with a real result from captionizer.run.
| # | ||
| # SPDX-FileCopyrightText: 2026 Stanford University, ETH Zurich, and the project authors (see CONTRIBUTORS.md) | ||
| # SPDX-FileCopyrightText: 2026 This source file is part of the SensorTSLM open-source project. | ||
| # | ||
| # SPDX-License-Identifier: MIT | ||
| # | ||
| This folder is only here for backwards compatibility it serves no purpose in the current implementation, its supposed to be deleted soon No newline at end of file |
There was a problem hiding this comment.
Typo in filename: "compatability" should be "compatibility".
The filename folder_only_for_backwards_compatability.txt contains a spelling error. Consider renaming to folder_only_for_backwards_compatibility.txt.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@timef/folder_only_for_backwards_compatability.txt` around lines 1 - 7, The
filename contains a spelling mistake: rename
folder_only_for_backwards_compatability.txt to
folder_only_for_backwards_compatibility.txt and update any references to the old
name (docs, README, tests, CI configs, or code that reads this file) to use the
corrected filename so nothing breaks; ensure the rename preserves file contents
and SPDX headers and run a quick grep for "backwards_compatability" to find and
update all references.
| def read_signal_frame_for_sample(self, sample: Sample, manifest: DatasetManifest) -> dict[int, list[dict[str, Any]]]: | ||
| signals_by_id = {signal.id: signal for signal in manifest.signals} | ||
| result: dict[int, list[dict[str, Any]]] = {} | ||
| for signal_ref in sample.signals: | ||
| signal = signals_by_id.get(signal_ref.signal_id) | ||
| if signal is None: | ||
| continue | ||
| rows = self.read_signal_frame(signal.shard_file) | ||
| sliced = rows[signal.row_start : signal.row_start + signal.row_count] | ||
| if signal_ref.channels is not None: | ||
| filtered: list[dict[str, Any]] = [] | ||
| for row in sliced: | ||
| filtered.append( | ||
| { | ||
| key: value | ||
| for key, value in row.items() | ||
| if _is_time_column(key) or key in signal_ref.channels | ||
| } |
There was a problem hiding this comment.
Don't hard-code the time-column naming convention in the reader.
export_caption_result() only requires time_column_name to be non-empty, but this filter preserves only time/time_*. A valid export with time_column_name="timestamp" loses its time axis on readback through this path.
Suggested change
def read_signal_frame_for_sample(self, sample: Sample, manifest: DatasetManifest) -> dict[int, list[dict[str, Any]]]:
signals_by_id = {signal.id: signal for signal in manifest.signals}
+ all_signal_columns = {channel for spec in manifest.signal_spec for channel in spec.channels}
result: dict[int, list[dict[str, Any]]] = {}
for signal_ref in sample.signals:
signal = signals_by_id.get(signal_ref.signal_id)
if signal is None:
continue
rows = self.read_signal_frame(signal.shard_file)
sliced = rows[signal.row_start : signal.row_start + signal.row_count]
if signal_ref.channels is not None:
+ requested_channels = set(signal_ref.channels)
filtered: list[dict[str, Any]] = []
for row in sliced:
filtered.append(
{
key: value
for key, value in row.items()
- if _is_time_column(key) or key in signal_ref.channels
+ if key in requested_channels or key not in all_signal_columns
}
)
sliced = filtered
result[signal_ref.signal_id] = sliced
return result🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@timenet_timef/io.py` around lines 206 - 223, The reader currently filters
columns using the hard-coded helper _is_time_column inside
read_signal_frame_for_sample which drops non-"time"/"time_*" columns (breaking
exports that set time_column_name like "timestamp"); update the filter to
preserve the actual time column provided by the dataset/sample metadata instead
of relying solely on _is_time_column — e.g. derive the expected time column name
from the signal/manifest/sample metadata (check signal.time_column_name or
manifest.time_column_name, falling back to _is_time_column only if none is set)
and include that name in the key-preservation check when building filtered rows
for signal_ref.channels.
| def get_schema(name: str) -> dict[str, list[str] | str]: | ||
| if name not in LABEL_SCHEMAS: | ||
| raise ValueError(f"Unknown label schema: {name}") | ||
| return LABEL_SCHEMAS[name] | ||
|
|
||
|
|
||
| def get_valid_labels(name: str) -> list[str]: | ||
| schema = get_schema(name) | ||
| labels = schema["labels"] | ||
| if not isinstance(labels, list): | ||
| raise ValueError(f"Invalid labels payload for schema {name}") | ||
| return labels |
There was a problem hiding this comment.
Avoid exposing mutable global schema internals.
Both helpers return direct references to LABEL_SCHEMAS contents, so callers can mutate global label definitions unintentionally.
Proposed fix
+from copy import deepcopy
+
def get_schema(name: str) -> dict[str, list[str] | str]:
if name not in LABEL_SCHEMAS:
raise ValueError(f"Unknown label schema: {name}")
- return LABEL_SCHEMAS[name]
+ return deepcopy(LABEL_SCHEMAS[name])
@@
def get_valid_labels(name: str) -> list[str]:
schema = get_schema(name)
labels = schema["labels"]
if not isinstance(labels, list):
raise ValueError(f"Invalid labels payload for schema {name}")
- return labels
+ return list(labels)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@timenet_timef/labels.py` around lines 17 - 28, get_schema and
get_valid_labels currently return direct references into LABEL_SCHEMAS so
callers can mutate global state; change get_schema(name: str) to return a
defensive copy (use copy.deepcopy on LABEL_SCHEMAS[name]) and change
get_valid_labels(name: str) to return a new list (e.g., list(labels) or
labels.copy()) after validating types; ensure you import copy and keep the same
error checks (raise ValueError for unknown schema or invalid labels) so callers
receive safe, non-mutable copies of schema and label lists.
| def validate_ucum(unit_code: str) -> str: | ||
| if not isinstance(unit_code, str) or not unit_code.strip(): | ||
| raise ValueError("Unit code must be a non-empty string") | ||
| return unit_code |
There was a problem hiding this comment.
Normalize unit codes before returning them.
validate_ucum checks strip() but returns the original value, so whitespace-padded codes are accepted and persisted unchanged. That can produce invalid/cross-tool incompatible unit values.
Proposed fix
def validate_ucum(unit_code: str) -> str:
- if not isinstance(unit_code, str) or not unit_code.strip():
+ if not isinstance(unit_code, str):
+ raise ValueError("Unit code must be a non-empty string")
+ normalized = unit_code.strip()
+ if not normalized:
raise ValueError("Unit code must be a non-empty string")
- return unit_code
+ return normalized📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def validate_ucum(unit_code: str) -> str: | |
| if not isinstance(unit_code, str) or not unit_code.strip(): | |
| raise ValueError("Unit code must be a non-empty string") | |
| return unit_code | |
| def validate_ucum(unit_code: str) -> str: | |
| if not isinstance(unit_code, str): | |
| raise ValueError("Unit code must be a non-empty string") | |
| normalized = unit_code.strip() | |
| if not normalized: | |
| raise ValueError("Unit code must be a non-empty string") | |
| return normalized |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@timenet_timef/units.py` around lines 10 - 13, The validate_ucum function
accepts whitespace-padded unit codes but returns the original string; change it
to normalize the input by trimming whitespace (use unit_code.strip()) and return
the stripped value after validation; keep the same ValueError on empty/invalid
inputs but ensure callers receive the normalized unit code from validate_ucum.
| def validate_dataset(root: Path) -> None: | ||
| reader = TimeFReader(root) | ||
| manifest = reader.read_manifest() | ||
| samples = list(reader.read_samples()) | ||
| annotations = list(reader.read_annotations()) |
There was a problem hiding this comment.
Check required artifacts before calling the reader.
A missing manifest.json, samples.parquet, or annotations.parquet will fail in Lines 23-26 before this loop runs, so callers get FileNotFoundError instead of TimeFValidationError and the explicit messages below never fire.
Suggested change
def validate_dataset(root: Path) -> None:
+ for required in (
+ "manifest.json",
+ "samples.parquet",
+ "annotations.parquet",
+ "sample_signal_index.parquet",
+ "annotation_tasks_index.parquet",
+ "annotation_domains_index.parquet",
+ ):
+ if not (root / required).exists():
+ raise TimeFValidationError(f"missing required file {required}")
+
reader = TimeFReader(root)
manifest = reader.read_manifest()
samples = list(reader.read_samples())
annotations = list(reader.read_annotations())
-
- for required in (
- "manifest.json",
- "samples.parquet",
- "annotations.parquet",
- "sample_signal_index.parquet",
- "annotation_tasks_index.parquet",
- "annotation_domains_index.parquet",
- ):
- if not (root / required).exists():
- raise TimeFValidationError(f"missing required file {required}")Also applies to: 42-51
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@timenet_timef/validate.py` around lines 22 - 26, Before instantiating
TimeFReader in validate_dataset, check that manifest.json, samples.parquet, and
annotations.parquet exist in the provided root and raise TimeFValidationError
with the explicit messages instead of letting TimeFReader/FileNotFoundError
surface; locate the validate_dataset function and the calls to
TimeFReader.read_manifest(), read_samples(), and read_annotations() and add
pre-checks for those three files (and mirror the same checks for the other
validation block that uses TimeFReader between lines 42-51) so failures produce
the intended TimeFValidationError messages.
| if signal.row_group_id >= metadata.num_row_groups: | ||
| raise TimeFValidationError(f"row_group_id out of bounds for signal {signal.id}") | ||
| row_group_rows = metadata.row_group(signal.row_group_id).num_rows | ||
| if signal.row_start + signal.row_count > row_group_rows: | ||
| raise TimeFValidationError(f"row span out of bounds for signal {signal.id}") |
There was a problem hiding this comment.
Reject negative row spans during validation.
Only the upper bound is enforced here. A manifest with row_start < 0 or row_count < 0 passes validation, and TimeFReader.read_signal_frame_for_sample() later slices from the end of the frame instead of failing.
Suggested change
metadata = pq.ParquetFile(shard_path).metadata
+ if signal.row_group_id < 0 or signal.row_start < 0 or signal.row_count < 0:
+ raise TimeFValidationError(f"negative row coordinates for signal {signal.id}")
if signal.row_group_id >= metadata.num_row_groups:
raise TimeFValidationError(f"row_group_id out of bounds for signal {signal.id}")
row_group_rows = metadata.row_group(signal.row_group_id).num_rows
if signal.row_start + signal.row_count > row_group_rows:
raise TimeFValidationError(f"row span out of bounds for signal {signal.id}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if signal.row_group_id >= metadata.num_row_groups: | |
| raise TimeFValidationError(f"row_group_id out of bounds for signal {signal.id}") | |
| row_group_rows = metadata.row_group(signal.row_group_id).num_rows | |
| if signal.row_start + signal.row_count > row_group_rows: | |
| raise TimeFValidationError(f"row span out of bounds for signal {signal.id}") | |
| if signal.row_group_id < 0 or signal.row_start < 0 or signal.row_count < 0: | |
| raise TimeFValidationError(f"negative row coordinates for signal {signal.id}") | |
| if signal.row_group_id >= metadata.num_row_groups: | |
| raise TimeFValidationError(f"row_group_id out of bounds for signal {signal.id}") | |
| row_group_rows = metadata.row_group(signal.row_group_id).num_rows | |
| if signal.row_start + signal.row_count > row_group_rows: | |
| raise TimeFValidationError(f"row span out of bounds for signal {signal.id}") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@timenet_timef/validate.py` around lines 60 - 64, Add checks to reject
negative row spans: validate that signal.row_start >= 0 and signal.row_count >=
0 and raise TimeFValidationError with a clear message if either is negative.
Update the validation block that currently checks signal.row_group_id and row
span (using metadata.row_group(...) and variables row_group_rows) to first
assert signal.row_start and signal.row_count are non-negative, then perform the
existing upper-bound check; use the same TimeFValidationError type and include
signal.id in the error messages.
Add TimeNet/TimeF Export for SensorTSLM Results
♻️ Current situation & Problem
SensorTSLM previously produced useful runtime results in memory, but it did not have a persistent export format for storing those results in a structured, reusable way.
This PR adds a new export path that converts SensorTSLM output into a TimeNet-compatible on-disk TimeF dataset. That means the transformed sensor signals and generated annotations can now be persisted in a format that is inspectable, portable, and aligned with the TimeNet ecosystem.
As part of making that integration understandable and maintainable, this PR also cleans up naming around the new persistence layer:
timenet_timefruntimeThat rename work is supportive of the exporter feature, not the main purpose of the PR.
Also important for review:
⚙️ Release Notes
CaptionResultto TimeNet-compatible TimeF datasets on diskExample usage:
This export writes a self-contained TimeF dataset version containing:
manifest.jsonsamples.parquetannotations.parquetsignals/📚 Documentation
The main addition in this PR is the new TimeNet/TimeF exporter.
What the exporter does
The exporter takes SensorTSLM runtime output and writes it into a TimeNet-style persistent dataset layout.
In v1, it persists:
RuntimeRowIt does not yet persist:
Why this is useful
This makes SensorTSLM results:
Why the naming cleanup is included
Because this PR introduces a real TimeNet-style persistence layer, the old naming became more confusing:
timefpackage represented runtime/in-memory objectsTo avoid conflating those two concepts, the runtime layer now has clearer canonical imports and the persistence layer has a name that reflects its TimeNet origin.
The runtime rename is isolated in the last commit so that reviewers can easily drop that part if they prefer to keep the previous naming, while still keeping the exporter itself.
✅ Testing
Tested locally, test coverage includes:
timefimportsCode of Conduct & Contributing Guidelines
Summary by CodeRabbit
New Features
Refactor