Add TimeNet/TimeF Export for SensorTSLM Results (based on branch: internal-data-update) by KarlDeck · Pull Request #39 · StanfordBDHG/SensorTSLM

KarlDeck · 2026-04-03T18:29:58Z

Add TimeNet/TimeF Export for SensorTSLM Results

♻️ Current situation & Problem

SensorTSLM previously produced useful runtime results in memory, but it did not have a persistent export format for storing those results in a structured, reusable way.

This PR adds a new export path that converts SensorTSLM output into a TimeNet-compatible on-disk TimeF dataset. That means the transformed sensor signals and generated annotations can now be persisted in a format that is inspectable, portable, and aligned with the TimeNet ecosystem.

As part of making that integration understandable and maintainable, this PR also cleans up naming around the new persistence layer:

the vendored TimeNet-derived storage layer is now clearly identified as timenet_timef
the SensorTSLM runtime/in-memory types now have clearer canonical imports under runtime

That rename work is supportive of the exporter feature, not the main purpose of the PR.

Also important for review:

the last commit contains the runtime rename
if that rename is not desired, it can be reverted independently while keeping the new exporter and persistence package rename

⚙️ Release Notes

Added a persistent export pipeline from CaptionResult to TimeNet-compatible TimeF datasets on disk
Added a local TimeNet-derived persistence layer for writing, reading, and validating exported datasets
Added tests covering export validity, round-trip reading, annotation mapping, provenance preservation, and compatibility shims
Clarified package naming so runtime SensorTSLM types and TimeNet-derived persistence types are no longer conflated

Example usage:

from pathlib import Path

from exporters.timef_export import TimeFExportConfig, export_caption_result

result, _ = captionizer.run(max_rows=5)
root = export_caption_result(
    result,
    TimeFExportConfig(
        output_root=Path("exports"),
        dataset_id="mhc_caption_runs",
    ),
)
print(root)

This export writes a self-contained TimeF dataset version containing:

manifest.json
samples.parquet
annotations.parquet
index parquet files
per-sample signal parquet files in signals/

📚 Documentation

The main addition in this PR is the new TimeNet/TimeF exporter.

What the exporter does

The exporter takes SensorTSLM runtime output and writes it into a TimeNet-style persistent dataset layout.

In v1, it persists:

transformed runtime signals from each RuntimeRow
generated annotations from SensorTSLM
dataset metadata required by the TimeF structure

It does not yet persist:

reviewer/evaluation outputs
richer run provenance such as model configuration or execution metadata

Why this is useful

This makes SensorTSLM results:

persistent instead of only in-memory
inspectable in a structured dataset format
reusable for downstream tooling
closer to the broader TimeNet data model

Why the naming cleanup is included

Because this PR introduces a real TimeNet-style persistence layer, the old naming became more confusing:

SensorTSLM’s timef package represented runtime/in-memory objects
the new persistence layer represented actual TimeNet-style storage logic

To avoid conflating those two concepts, the runtime layer now has clearer canonical imports and the persistence layer has a name that reflects its TimeNet origin.

The runtime rename is isolated in the last commit so that reviewers can easily drop that part if they prefer to keep the previous naming, while still keeping the exporter itself.

✅ Testing

Tested locally, test coverage includes:

valid single-row export
stable multi-row export counts
round-trip reads through the persistence reader
channel filtering behavior
unlabeled vs labeled annotation mapping
provenance JSON preservation
generated timestamp versions
compatibility behavior for legacy timef imports

Code of Conduct & Contributing Guidelines

I agree to follow the Code of Conduct and Contributing Guidelines.

Summary by CodeRabbit

New Features
- Added export functionality to persist captioning results as TimeF datasets with configurable output paths, metadata, and channel-specific unit mappings.
- Introduced dataset I/O operations including reading/writing manifest, samples, annotations, and signal frames.
- Added dataset validation and marking capabilities to ensure dataset integrity.
Refactor
- Reorganized core data structures into a dedicated runtime module for improved modularity and clarity.

This commit separates three concepts that were previously mixed under the same "timef" naming: 1. SensorTSLM runtime/domain objects 2. the TimeF export adapter 3. the vendored TimeNet-derived persistence layer

coderabbitai · 2026-04-03T18:30:07Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a5e6e7b6-6062-4441-9a3a-23673d06ab73

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

✅ Review completed - (🔄 Check again to review again)

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch karl-deck/export-to-TimeNet-persistent-fromat(based_on_internal-data-update)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (1)

runtime/types.py (1)
89-100: Consider consolidating the aliased methods.

signal()/channel() and iter_signals()/iter_channels() are exact aliases. While this provides API flexibility, consider documenting one as the canonical name and the other as an alias to avoid confusion about which to use.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@runtime/types.py` around lines 89 - 100, Pick canonical names (e.g., signal
and iter_signals) and mark the others as documented aliases: update the
docstrings for signal, channel, iter_channels and iter_signals to clearly state
which is canonical and that the other methods are simple aliases; ensure
channel(self, idx) delegates to signal(self, idx) and iter_channels(self)
delegates to iter_signals(self) (or vice versa) so implementation stays a single
source of truth (refer to SignalView, signal, channel, iter_signals,
iter_channels and self.values for locating the methods).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@exporters/timef_export.py`:
- Around line 85-95: TimeFWriter(root) eagerly creates the final root/signals
tree which can leave partial files on failure; instead stage output to a
temporary directory and only instantiate or move TimeFWriter into the final path
on success: write signals via _write_signal_frame into a temp staging directory
(or pass a staging path to TimeFWriter), accumulate samples/signals/annotations
there, and after all rows and writes complete, atomically move or rename the
staging directory to the intended root (or then construct TimeFWriter(root) if
it must create the tree). Also ensure any temporary directory is removed on
failure so subsequent retries won't hit FileExistsError.
- Around line 136-139: The code that builds reference (the json.dumps block
using channel_names and annotation.channel_idxs) must validate
annotation.channel_idxs before indexing: ensure each idx is an integer and 0 <=
idx < len(channel_names) (do not allow negative wrapping), and handle violations
with a clear error or by filtering them out consistently; update the
comprehension that produces "channel_names": [channel_names[idx] for idx in
annotation.channel_idxs] to first validate/map the indexes (or raise a
ValueError with the bad idx and annotation id/context) so an IndexError or
silent negative wrap cannot occur during export.
- Around line 209-214: In _write_signal_frame ensure you detect and reject
duplicate or colliding column names before building payload: check
row.channel_names for duplicates and also check if any channel name equals the
time_column_name, and if any conflict exists raise a clear ValueError (including
the offending names) instead of silently overwriting; do this validation prior
to creating payload and populating entries so callers get an explicit error when
channel names conflict with each other or with time_column_name.

In `@README.md`:
- Around line 21-40: The example uses an undefined lowercase variable
captionizer when calling captionizer.run; either instantiate a Captionizer
instance before use (e.g., create and configure a Captionizer object and assign
it to captionizer) or add a clear comment above the snippet stating that a
configured Captionizer instance must exist; update the README example to
reference the Captionizer symbol (Captionizer) properly so export_caption_result
and TimeFExportConfig are called with a real result from captionizer.run.

In `@timef/folder_only_for_backwards_compatability.txt`:
- Around line 1-7: The filename contains a spelling mistake: rename
folder_only_for_backwards_compatability.txt to
folder_only_for_backwards_compatibility.txt and update any references to the old
name (docs, README, tests, CI configs, or code that reads this file) to use the
corrected filename so nothing breaks; ensure the rename preserves file contents
and SPDX headers and run a quick grep for "backwards_compatability" to find and
update all references.

In `@timenet_timef/__init__.py`:
- Around line 23-40: The __all__ list is unsorted (contains names like
"Annotation", "AnnotationSampleRef", ..., "validate_dataset") and triggers Ruff
RUF022; reorder the entries in the __all__ list into a stable alphabetical order
(by string) so that symbols like Annotation, AnnotationSampleRef,
AnnotationSpec, DatasetManifest, Sample, SampleSignalRef, SensorSpec, Signal,
SignalSpec, TimeFReader, TimeFValidationError, TimeFWriter, VALID_DOMAINS,
mark_validated, validate_annotation_against_spec, validate_dataset are sorted
alphabetically to silence the lint error and keep CI deterministic.

In `@timenet_timef/io.py`:
- Around line 203-204: The read_signal_frame method currently joins signal_file
directly into self.root allowing absolute or ../ paths to escape root; update
read_signal_frame to first run signal_file through the same shard-path resolver
used in timenet_timef/validate.py (the shard path resolver) to normalize and
resolve the path relative to self.root, verify the resolved path is contained
under self.root / "signals" (raise an error if not), and then pass that resolved
path to pq.read_table instead of joining signal_file directly.
- Around line 206-223: The reader currently filters columns using the hard-coded
helper _is_time_column inside read_signal_frame_for_sample which drops
non-"time"/"time_*" columns (breaking exports that set time_column_name like
"timestamp"); update the filter to preserve the actual time column provided by
the dataset/sample metadata instead of relying solely on _is_time_column — e.g.
derive the expected time column name from the signal/manifest/sample metadata
(check signal.time_column_name or manifest.time_column_name, falling back to
_is_time_column only if none is set) and include that name in the
key-preservation check when building filtered rows for signal_ref.channels.

In `@timenet_timef/labels.py`:
- Around line 17-28: get_schema and get_valid_labels currently return direct
references into LABEL_SCHEMAS so callers can mutate global state; change
get_schema(name: str) to return a defensive copy (use copy.deepcopy on
LABEL_SCHEMAS[name]) and change get_valid_labels(name: str) to return a new list
(e.g., list(labels) or labels.copy()) after validating types; ensure you import
copy and keep the same error checks (raise ValueError for unknown schema or
invalid labels) so callers receive safe, non-mutable copies of schema and label
lists.

In `@timenet_timef/units.py`:
- Around line 10-13: The validate_ucum function accepts whitespace-padded unit
codes but returns the original string; change it to normalize the input by
trimming whitespace (use unit_code.strip()) and return the stripped value after
validation; keep the same ValueError on empty/invalid inputs but ensure callers
receive the normalized unit code from validate_ucum.

In `@timenet_timef/validate.py`:
- Around line 22-26: Before instantiating TimeFReader in validate_dataset, check
that manifest.json, samples.parquet, and annotations.parquet exist in the
provided root and raise TimeFValidationError with the explicit messages instead
of letting TimeFReader/FileNotFoundError surface; locate the validate_dataset
function and the calls to TimeFReader.read_manifest(), read_samples(), and
read_annotations() and add pre-checks for those three files (and mirror the same
checks for the other validation block that uses TimeFReader between lines 42-51)
so failures produce the intended TimeFValidationError messages.
- Around line 60-64: Add checks to reject negative row spans: validate that
signal.row_start >= 0 and signal.row_count >= 0 and raise TimeFValidationError
with a clear message if either is negative. Update the validation block that
currently checks signal.row_group_id and row span (using metadata.row_group(...)
and variables row_group_rows) to first assert signal.row_start and
signal.row_count are non-negative, then perform the existing upper-bound check;
use the same TimeFValidationError type and include signal.id in the error
messages.

---

Nitpick comments:
In `@runtime/types.py`:
- Around line 89-100: Pick canonical names (e.g., signal and iter_signals) and
mark the others as documented aliases: update the docstrings for signal,
channel, iter_channels and iter_signals to clearly state which is canonical and
that the other methods are simple aliases; ensure channel(self, idx) delegates
to signal(self, idx) and iter_channels(self) delegates to iter_signals(self) (or
vice versa) so implementation stays a single source of truth (refer to
SignalView, signal, channel, iter_signals, iter_channels and self.values for
locating the methods).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d6df737c-1bb3-426e-8be1-0d0f56e1c1b3

📥 Commits

Reviewing files that changed from the base of the PR and between d09648c and 78ed48c.

📒 Files selected for processing (29)

README.md
annotator.py
captionizer.py
exporters/__init__.py
exporters/timef_export.py
extractors/__init__.py
extractors/generative.py
extractors/semantic.py
extractors/statistical.py
extractors/structural.py
mhc/transformer.py
models/base.py
models/client.py
models/local.py
reviewer.py
runtime/__init__.py
runtime/types.py
timef/__init__.py
timef/folder_only_for_backwards_compatability.txt
timef/schema.py
timenet_timef/__init__.py
timenet_timef/io.py
timenet_timef/labels.py
timenet_timef/schema.py
timenet_timef/units.py
timenet_timef/utils.py
timenet_timef/validate.py
transformer.py
visualizer.py

coderabbitai · 2026-04-03T20:49:38Z

+    writer = TimeFWriter(root)
+    samples: list[Sample] = []
+    signals: list[Signal] = []
+    annotations: list[PersistedAnnotation] = []
+    sampling_rate = config.timestamp_unit / config.sampling_period
+    annotation_id = 0
+
+    for sample_id, row in enumerate(result.rows):
+        _validate_row_shape(row, channel_names)
+        signal_file = f"sample-{sample_id}.parquet"
+        _write_signal_frame(root / "signals" / signal_file, config.sampling_period, config.time_column_name, row)


⚠️ Potential issue | 🟠 Major

Stage the export before touching the final dataset path.

TimeFWriter(root) eagerly creates root/signals in timenet_timef/io.py Lines 76-78. Any later failure here—or at Lines 182-183—leaves a partial tree behind, and the next retry then hits Line 65's FileExistsError even though no valid dataset was produced.

Also applies to: 179-183

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@exporters/timef_export.py` around lines 85 - 95, TimeFWriter(root) eagerly creates the final root/signals tree which can leave partial files on failure; instead stage output to a temporary directory and only instantiate or move TimeFWriter into the final path on success: write signals via _write_signal_frame into a temp staging directory (or pass a staging path to TimeFWriter), accumulate samples/signals/annotations there, and after all rows and writes complete, atomically move or rename the staging directory to the intended root (or then construct TimeFWriter(root) if it must create the tree). Also ensure any temporary directory is removed on failure so subsequent retries won't hit FileExistsError.

coderabbitai · 2026-04-03T20:49:38Z

+            reference = json.dumps(
+                {
+                    "channel_names": [channel_names[idx] for idx in annotation.channel_idxs],
+                    "window": list(annotation.window) if annotation.window is not None else None,


⚠️ Potential issue | 🟠 Major

Validate annotation.channel_idxs before translating them to names.

runtime.types.Annotation does not bound-check these indexes. Negative indexes silently bind to the wrong channel, and an out-of-range index raises IndexError mid-export.

Suggested change

if spec_id is None: raise ValueError(f"Unsupported annotation kind: {annotation.kind}") + invalid_channel_idxs = [ + idx for idx in annotation.channel_idxs if idx < 0 or idx >= len(channel_names) + ] + if invalid_channel_idxs: + raise ValueError( + f"Annotation {annotation.kind!r} references invalid channel indexes: {invalid_channel_idxs}" + ) reference = json.dumps( { "channel_names": [channel_names[idx] for idx in annotation.channel_idxs], "window": list(annotation.window) if annotation.window is not None else None, "kind": annotation.kind,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@exporters/timef_export.py` around lines 136 - 139, The code that builds reference (the json.dumps block using channel_names and annotation.channel_idxs) must validate annotation.channel_idxs before indexing: ensure each idx is an integer and 0 <= idx < len(channel_names) (do not allow negative wrapping), and handle violations with a clear error or by filtering them out consistently; update the comprehension that produces "channel_names": [channel_names[idx] for idx in annotation.channel_idxs] to first validate/map the indexes (or raise a ValueError with the bad idx and annotation id/context) so an IndexError or silent negative wrap cannot occur during export.

coderabbitai · 2026-04-03T20:49:38Z

+def _write_signal_frame(path: Path, sampling_period: float, time_column_name: str, row) -> None:
+    time_axis = [idx * sampling_period for idx in range(row.values.shape[1])]
+    payload: dict[str, list[float]] = {time_column_name: time_axis}
+    for idx, channel_name in enumerate(row.channel_names):
+        payload[channel_name] = row.values[idx].astype(float).tolist()
+    path.parent.mkdir(parents=True, exist_ok=True)


⚠️ Potential issue | 🟠 Major

Reject duplicate or colliding column names before building the parquet payload.

This payload is keyed by time_column_name and row.channel_names. Duplicate channel names—or a channel named exactly like the time column—overwrite earlier entries in the dict and silently drop signal data.

Suggested change

def _write_signal_frame(path: Path, sampling_period: float, time_column_name: str, row) -> None: + channel_names = tuple(row.channel_names) + if len(set(channel_names)) != len(channel_names): + raise ValueError("RuntimeRow channel_names must be unique for export") + if time_column_name in channel_names: + raise ValueError("time_column_name must not collide with a channel name") + time_axis = [idx * sampling_period for idx in range(row.values.shape[1])] payload: dict[str, list[float]] = {time_column_name: time_axis} - for idx, channel_name in enumerate(row.channel_names): + for idx, channel_name in enumerate(channel_names): payload[channel_name] = row.values[idx].astype(float).tolist() path.parent.mkdir(parents=True, exist_ok=True) pq.write_table(pa.table(payload), path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@exporters/timef_export.py` around lines 209 - 214, In _write_signal_frame ensure you detect and reject duplicate or colliding column names before building payload: check row.channel_names for duplicates and also check if any channel name equals the time_column_name, and if any conflict exists raise a clear ValueError (including the offending names) instead of silently overwriting; do this validation prior to creating payload and populating entries so callers get an explicit error when channel names conflict with each other or with time_column_name.

coderabbitai · 2026-04-03T20:49:38Z

+```python
+from pathlib import Path
+
+from captionizer import Captionizer
+from exporters.timef_export import TimeFExportConfig, export_caption_result
+
+result, _ = captionizer.run(max_rows=5)
+root = export_caption_result(
+    result,
+    TimeFExportConfig(
+        output_root=Path("exports"),
+        dataset_id="mhc_caption_runs",
+        sampling_period=1,
+        timestamp_unit=1,
+        unit_sampling_rate="1 / minute",
+        unit_timestamp="minute",
+        time_column_name="time_minute",
+    ),
+)
+print(root)


⚠️ Potential issue | 🟡 Minor

Example code references undefined captionizer variable.

The example imports Captionizer class (line 24) but uses a lowercase captionizer instance (line 27) that is never created. Users following this example will encounter a NameError.

Consider either:

Adding the necessary setup (dataset, transformer, annotator, instantiation), or

Adding a comment indicating prerequisite setup is required.

📝 Proposed fix with minimal setup context

from pathlib import Path from captionizer import Captionizer from exporters.timef_export import TimeFExportConfig, export_caption_result + +# Setup (see Usage section for full details) +# captionizer = Captionizer(dataset, transformer, annotator) result, _ = captionizer.run(max_rows=5)

Or provide a complete working example:

from pathlib import Path from captionizer import Captionizer from exporters.timef_export import TimeFExportConfig, export_caption_result +from mhc.dataset import MHCDataset +from mhc.transformer import MHCTransformer +from mhc.constants import MHC_CHANNEL_CONFIG +from extractors.statistical import StatisticalExtractor +from annotator import Annotator + +dataset = MHCDataset() +annotator = Annotator([StatisticalExtractor(MHC_CHANNEL_CONFIG)]) +captionizer = Captionizer(dataset, MHCTransformer(), annotator) result, _ = captionizer.run(max_rows=5)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 21 - 40, The example uses an undefined lowercase variable captionizer when calling captionizer.run; either instantiate a Captionizer instance before use (e.g., create and configure a Captionizer object and assign it to captionizer) or add a clear comment above the snippet stating that a configured Captionizer instance must exist; update the README example to reference the Captionizer symbol (Captionizer) properly so export_caption_result and TimeFExportConfig are called with a real result from captionizer.run.

coderabbitai · 2026-04-03T20:49:38Z

+#
+# SPDX-FileCopyrightText: 2026 Stanford University, ETH Zurich, and the project authors (see CONTRIBUTORS.md)
+# SPDX-FileCopyrightText: 2026 This source file is part of the SensorTSLM open-source project.
+#
+# SPDX-License-Identifier: MIT
+#
+This folder is only here for backwards compatibility it serves no purpose in the current implementation, its supposed to be deleted soon


⚠️ Potential issue | 🟡 Minor

Typo in filename: "compatability" should be "compatibility".

The filename folder_only_for_backwards_compatability.txt contains a spelling error. Consider renaming to folder_only_for_backwards_compatibility.txt.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@timef/folder_only_for_backwards_compatability.txt` around lines 1 - 7, The filename contains a spelling mistake: rename folder_only_for_backwards_compatability.txt to folder_only_for_backwards_compatibility.txt and update any references to the old name (docs, README, tests, CI configs, or code that reads this file) to use the corrected filename so nothing breaks; ensure the rename preserves file contents and SPDX headers and run a quick grep for "backwards_compatability" to find and update all references.

coderabbitai · 2026-04-03T20:49:39Z

+    def read_signal_frame_for_sample(self, sample: Sample, manifest: DatasetManifest) -> dict[int, list[dict[str, Any]]]:
+        signals_by_id = {signal.id: signal for signal in manifest.signals}
+        result: dict[int, list[dict[str, Any]]] = {}
+        for signal_ref in sample.signals:
+            signal = signals_by_id.get(signal_ref.signal_id)
+            if signal is None:
+                continue
+            rows = self.read_signal_frame(signal.shard_file)
+            sliced = rows[signal.row_start : signal.row_start + signal.row_count]
+            if signal_ref.channels is not None:
+                filtered: list[dict[str, Any]] = []
+                for row in sliced:
+                    filtered.append(
+                        {
+                            key: value
+                            for key, value in row.items()
+                            if _is_time_column(key) or key in signal_ref.channels
+                        }


⚠️ Potential issue | 🟠 Major

Don't hard-code the time-column naming convention in the reader.

export_caption_result() only requires time_column_name to be non-empty, but this filter preserves only time/time_*. A valid export with time_column_name="timestamp" loses its time axis on readback through this path.

Suggested change

def read_signal_frame_for_sample(self, sample: Sample, manifest: DatasetManifest) -> dict[int, list[dict[str, Any]]]: signals_by_id = {signal.id: signal for signal in manifest.signals} + all_signal_columns = {channel for spec in manifest.signal_spec for channel in spec.channels} result: dict[int, list[dict[str, Any]]] = {} for signal_ref in sample.signals: signal = signals_by_id.get(signal_ref.signal_id) if signal is None: continue rows = self.read_signal_frame(signal.shard_file) sliced = rows[signal.row_start : signal.row_start + signal.row_count] if signal_ref.channels is not None: + requested_channels = set(signal_ref.channels) filtered: list[dict[str, Any]] = [] for row in sliced: filtered.append( { key: value for key, value in row.items() - if _is_time_column(key) or key in signal_ref.channels + if key in requested_channels or key not in all_signal_columns } ) sliced = filtered result[signal_ref.signal_id] = sliced return result

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@timenet_timef/io.py` around lines 206 - 223, The reader currently filters columns using the hard-coded helper _is_time_column inside read_signal_frame_for_sample which drops non-"time"/"time_*" columns (breaking exports that set time_column_name like "timestamp"); update the filter to preserve the actual time column provided by the dataset/sample metadata instead of relying solely on _is_time_column — e.g. derive the expected time column name from the signal/manifest/sample metadata (check signal.time_column_name or manifest.time_column_name, falling back to _is_time_column only if none is set) and include that name in the key-preservation check when building filtered rows for signal_ref.channels.

coderabbitai · 2026-04-03T20:49:39Z

+def get_schema(name: str) -> dict[str, list[str] | str]:
+    if name not in LABEL_SCHEMAS:
+        raise ValueError(f"Unknown label schema: {name}")
+    return LABEL_SCHEMAS[name]
+
+
+def get_valid_labels(name: str) -> list[str]:
+    schema = get_schema(name)
+    labels = schema["labels"]
+    if not isinstance(labels, list):
+        raise ValueError(f"Invalid labels payload for schema {name}")
+    return labels


⚠️ Potential issue | 🟠 Major

Avoid exposing mutable global schema internals.

Both helpers return direct references to LABEL_SCHEMAS contents, so callers can mutate global label definitions unintentionally.

Proposed fix

+from copy import deepcopy + def get_schema(name: str) -> dict[str, list[str] | str]: if name not in LABEL_SCHEMAS: raise ValueError(f"Unknown label schema: {name}") - return LABEL_SCHEMAS[name] + return deepcopy(LABEL_SCHEMAS[name]) @@ def get_valid_labels(name: str) -> list[str]: schema = get_schema(name) labels = schema["labels"] if not isinstance(labels, list): raise ValueError(f"Invalid labels payload for schema {name}") - return labels + return list(labels)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@timenet_timef/labels.py` around lines 17 - 28, get_schema and get_valid_labels currently return direct references into LABEL_SCHEMAS so callers can mutate global state; change get_schema(name: str) to return a defensive copy (use copy.deepcopy on LABEL_SCHEMAS[name]) and change get_valid_labels(name: str) to return a new list (e.g., list(labels) or labels.copy()) after validating types; ensure you import copy and keep the same error checks (raise ValueError for unknown schema or invalid labels) so callers receive safe, non-mutable copies of schema and label lists.

coderabbitai · 2026-04-03T20:49:39Z

+def validate_ucum(unit_code: str) -> str:
+    if not isinstance(unit_code, str) or not unit_code.strip():
+        raise ValueError("Unit code must be a non-empty string")
+    return unit_code


⚠️ Potential issue | 🟠 Major

Normalize unit codes before returning them.

validate_ucum checks strip() but returns the original value, so whitespace-padded codes are accepted and persisted unchanged. That can produce invalid/cross-tool incompatible unit values.

Proposed fix

def validate_ucum(unit_code: str) -> str: - if not isinstance(unit_code, str) or not unit_code.strip(): + if not isinstance(unit_code, str): + raise ValueError("Unit code must be a non-empty string") + normalized = unit_code.strip() + if not normalized: raise ValueError("Unit code must be a non-empty string") - return unit_code + return normalized

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def validate_ucum(unit_code: str) -> str:

if not isinstance(unit_code, str) or not unit_code.strip():

raise ValueError("Unit code must be a non-empty string")

return unit_code

def validate_ucum(unit_code: str) -> str:

if not isinstance(unit_code, str):

raise ValueError("Unit code must be a non-empty string")

normalized = unit_code.strip()

if not normalized:

raise ValueError("Unit code must be a non-empty string")

return normalized

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@timenet_timef/units.py` around lines 10 - 13, The validate_ucum function accepts whitespace-padded unit codes but returns the original string; change it to normalize the input by trimming whitespace (use unit_code.strip()) and return the stripped value after validation; keep the same ValueError on empty/invalid inputs but ensure callers receive the normalized unit code from validate_ucum.

coderabbitai · 2026-04-03T20:49:39Z

+def validate_dataset(root: Path) -> None:
+    reader = TimeFReader(root)
+    manifest = reader.read_manifest()
+    samples = list(reader.read_samples())
+    annotations = list(reader.read_annotations())


⚠️ Potential issue | 🟠 Major

Check required artifacts before calling the reader.

A missing manifest.json, samples.parquet, or annotations.parquet will fail in Lines 23-26 before this loop runs, so callers get FileNotFoundError instead of TimeFValidationError and the explicit messages below never fire.

Suggested change

def validate_dataset(root: Path) -> None: + for required in ( + "manifest.json", + "samples.parquet", + "annotations.parquet", + "sample_signal_index.parquet", + "annotation_tasks_index.parquet", + "annotation_domains_index.parquet", + ): + if not (root / required).exists(): + raise TimeFValidationError(f"missing required file {required}") + reader = TimeFReader(root) manifest = reader.read_manifest() samples = list(reader.read_samples()) annotations = list(reader.read_annotations()) - - for required in ( - "manifest.json", - "samples.parquet", - "annotations.parquet", - "sample_signal_index.parquet", - "annotation_tasks_index.parquet", - "annotation_domains_index.parquet", - ): - if not (root / required).exists(): - raise TimeFValidationError(f"missing required file {required}")

Also applies to: 42-51

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@timenet_timef/validate.py` around lines 22 - 26, Before instantiating TimeFReader in validate_dataset, check that manifest.json, samples.parquet, and annotations.parquet exist in the provided root and raise TimeFValidationError with the explicit messages instead of letting TimeFReader/FileNotFoundError surface; locate the validate_dataset function and the calls to TimeFReader.read_manifest(), read_samples(), and read_annotations() and add pre-checks for those three files (and mirror the same checks for the other validation block that uses TimeFReader between lines 42-51) so failures produce the intended TimeFValidationError messages.

coderabbitai · 2026-04-03T20:49:39Z

+        if signal.row_group_id >= metadata.num_row_groups:
+            raise TimeFValidationError(f"row_group_id out of bounds for signal {signal.id}")
+        row_group_rows = metadata.row_group(signal.row_group_id).num_rows
+        if signal.row_start + signal.row_count > row_group_rows:
+            raise TimeFValidationError(f"row span out of bounds for signal {signal.id}")


⚠️ Potential issue | 🟠 Major

Reject negative row spans during validation.

Only the upper bound is enforced here. A manifest with row_start < 0 or row_count < 0 passes validation, and TimeFReader.read_signal_frame_for_sample() later slices from the end of the frame instead of failing.

Suggested change

metadata = pq.ParquetFile(shard_path).metadata + if signal.row_group_id < 0 or signal.row_start < 0 or signal.row_count < 0: + raise TimeFValidationError(f"negative row coordinates for signal {signal.id}") if signal.row_group_id >= metadata.num_row_groups: raise TimeFValidationError(f"row_group_id out of bounds for signal {signal.id}") row_group_rows = metadata.row_group(signal.row_group_id).num_rows if signal.row_start + signal.row_count > row_group_rows: raise TimeFValidationError(f"row span out of bounds for signal {signal.id}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if signal.row_group_id >= metadata.num_row_groups:

raise TimeFValidationError(f"row_group_id out of bounds for signal {signal.id}")

row_group_rows = metadata.row_group(signal.row_group_id).num_rows

if signal.row_start + signal.row_count > row_group_rows:

raise TimeFValidationError(f"row span out of bounds for signal {signal.id}")

if signal.row_group_id < 0 or signal.row_start < 0 or signal.row_count < 0:

raise TimeFValidationError(f"negative row coordinates for signal {signal.id}")

if signal.row_group_id >= metadata.num_row_groups:

raise TimeFValidationError(f"row_group_id out of bounds for signal {signal.id}")

row_group_rows = metadata.row_group(signal.row_group_id).num_rows

if signal.row_start + signal.row_count > row_group_rows:

raise TimeFValidationError(f"row span out of bounds for signal {signal.id}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@timenet_timef/validate.py` around lines 60 - 64, Add checks to reject negative row spans: validate that signal.row_start >= 0 and signal.row_count >= 0 and raise TimeFValidationError with a clear message if either is negative. Update the validation block that currently checks signal.row_group_id and row span (using metadata.row_group(...) and variables row_group_rows) to first assert signal.row_start and signal.row_count are non-negative, then perform the existing upper-bound check; use the same TimeFValidationError type and include signal.id in the error messages.

KarlDeck and others added 6 commits April 2, 2026 12:46

refactor: replace schema-heavy runtime with row-centric signal views

5220bcd

Remove duplicate __pycache__ entry from .gitignore

d09648c

added TimeNet format persistent export

58c6945

Rename vendored TimeNet persistence package to timenet_timef

07ae73f

small refactor

ce8ded5

Rename SensorTSLM runtime timef package and move exporter

890bd9a

This commit separates three concepts that were previously mixed under the same "timef" naming: 1. SensorTSLM runtime/domain objects 2. the TimeF export adapter 3. the vendored TimeNet-derived persistence layer

KarlDeck requested a review from max-rosenblattl April 3, 2026 18:29

KarlDeck added 3 commits April 3, 2026 20:32

SPDX correction

b6366ea

SPDX 2

f779eb3

update units for mhc

78ed48c

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

Base automatically changed from karl-deck/internal-data-update to main April 6, 2026 22:24

max-rosenblattl mentioned this pull request Apr 7, 2026

Add lean exporter #42

Merged

1 task

Uh oh!

Conversation

KarlDeck commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add TimeNet/TimeF Export for SensorTSLM Results

♻️ Current situation & Problem

⚙️ Release Notes

📚 Documentation

What the exporter does

Why this is useful

Why the naming cleanup is included

✅ Testing

Code of Conduct & Contributing Guidelines

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KarlDeck commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading