Skip to content

Add TimeNet/TimeF Export for SensorTSLM Results (based on branch: internal-data-update)#39

Open
KarlDeck wants to merge 9 commits intomainfrom
karl-deck/export-to-TimeNet-persistent-fromat(based_on_internal-data-update)
Open

Add TimeNet/TimeF Export for SensorTSLM Results (based on branch: internal-data-update)#39
KarlDeck wants to merge 9 commits intomainfrom
karl-deck/export-to-TimeNet-persistent-fromat(based_on_internal-data-update)

Conversation

@KarlDeck
Copy link
Copy Markdown
Collaborator

@KarlDeck KarlDeck commented Apr 3, 2026

Add TimeNet/TimeF Export for SensorTSLM Results

♻️ Current situation & Problem

SensorTSLM previously produced useful runtime results in memory, but it did not have a persistent export format for storing those results in a structured, reusable way.

This PR adds a new export path that converts SensorTSLM output into a TimeNet-compatible on-disk TimeF dataset. That means the transformed sensor signals and generated annotations can now be persisted in a format that is inspectable, portable, and aligned with the TimeNet ecosystem.

As part of making that integration understandable and maintainable, this PR also cleans up naming around the new persistence layer:

  • the vendored TimeNet-derived storage layer is now clearly identified as timenet_timef
  • the SensorTSLM runtime/in-memory types now have clearer canonical imports under runtime

That rename work is supportive of the exporter feature, not the main purpose of the PR.

Also important for review:

  • the last commit contains the runtime rename
  • if that rename is not desired, it can be reverted independently while keeping the new exporter and persistence package rename

⚙️ Release Notes

  • Added a persistent export pipeline from CaptionResult to TimeNet-compatible TimeF datasets on disk
  • Added a local TimeNet-derived persistence layer for writing, reading, and validating exported datasets
  • Added tests covering export validity, round-trip reading, annotation mapping, provenance preservation, and compatibility shims
  • Clarified package naming so runtime SensorTSLM types and TimeNet-derived persistence types are no longer conflated

Example usage:

from pathlib import Path

from exporters.timef_export import TimeFExportConfig, export_caption_result

result, _ = captionizer.run(max_rows=5)
root = export_caption_result(
    result,
    TimeFExportConfig(
        output_root=Path("exports"),
        dataset_id="mhc_caption_runs",
    ),
)
print(root)

This export writes a self-contained TimeF dataset version containing:

  • manifest.json
  • samples.parquet
  • annotations.parquet
  • index parquet files
  • per-sample signal parquet files in signals/

📚 Documentation

The main addition in this PR is the new TimeNet/TimeF exporter.

What the exporter does

The exporter takes SensorTSLM runtime output and writes it into a TimeNet-style persistent dataset layout.

In v1, it persists:

  • transformed runtime signals from each RuntimeRow
  • generated annotations from SensorTSLM
  • dataset metadata required by the TimeF structure

It does not yet persist:

  • reviewer/evaluation outputs
  • richer run provenance such as model configuration or execution metadata

Why this is useful

This makes SensorTSLM results:

  • persistent instead of only in-memory
  • inspectable in a structured dataset format
  • reusable for downstream tooling
  • closer to the broader TimeNet data model

Why the naming cleanup is included

Because this PR introduces a real TimeNet-style persistence layer, the old naming became more confusing:

  • SensorTSLM’s timef package represented runtime/in-memory objects
  • the new persistence layer represented actual TimeNet-style storage logic

To avoid conflating those two concepts, the runtime layer now has clearer canonical imports and the persistence layer has a name that reflects its TimeNet origin.

The runtime rename is isolated in the last commit so that reviewers can easily drop that part if they prefer to keep the previous naming, while still keeping the exporter itself.

✅ Testing

Tested locally, test coverage includes:

  • valid single-row export
  • stable multi-row export counts
  • round-trip reads through the persistence reader
  • channel filtering behavior
  • unlabeled vs labeled annotation mapping
  • provenance JSON preservation
  • generated timestamp versions
  • compatibility behavior for legacy timef imports

Code of Conduct & Contributing Guidelines

Summary by CodeRabbit

  • New Features

    • Added export functionality to persist captioning results as TimeF datasets with configurable output paths, metadata, and channel-specific unit mappings.
    • Introduced dataset I/O operations including reading/writing manifest, samples, annotations, and signal frames.
    • Added dataset validation and marking capabilities to ensure dataset integrity.
  • Refactor

    • Reorganized core data structures into a dedicated runtime module for improved modularity and clarity.

KarlDeck and others added 6 commits April 2, 2026 12:46
@KarlDeck KarlDeck requested a review from max-rosenblattl April 3, 2026 18:29
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a5e6e7b6-6062-4441-9a3a-23673d06ab73

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch karl-deck/export-to-TimeNet-persistent-fromat(based_on_internal-data-update)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (1)
runtime/types.py (1)

89-100: Consider consolidating the aliased methods.

signal()/channel() and iter_signals()/iter_channels() are exact aliases. While this provides API flexibility, consider documenting one as the canonical name and the other as an alias to avoid confusion about which to use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@runtime/types.py` around lines 89 - 100, Pick canonical names (e.g., signal
and iter_signals) and mark the others as documented aliases: update the
docstrings for signal, channel, iter_channels and iter_signals to clearly state
which is canonical and that the other methods are simple aliases; ensure
channel(self, idx) delegates to signal(self, idx) and iter_channels(self)
delegates to iter_signals(self) (or vice versa) so implementation stays a single
source of truth (refer to SignalView, signal, channel, iter_signals,
iter_channels and self.values for locating the methods).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@exporters/timef_export.py`:
- Around line 85-95: TimeFWriter(root) eagerly creates the final root/signals
tree which can leave partial files on failure; instead stage output to a
temporary directory and only instantiate or move TimeFWriter into the final path
on success: write signals via _write_signal_frame into a temp staging directory
(or pass a staging path to TimeFWriter), accumulate samples/signals/annotations
there, and after all rows and writes complete, atomically move or rename the
staging directory to the intended root (or then construct TimeFWriter(root) if
it must create the tree). Also ensure any temporary directory is removed on
failure so subsequent retries won't hit FileExistsError.
- Around line 136-139: The code that builds reference (the json.dumps block
using channel_names and annotation.channel_idxs) must validate
annotation.channel_idxs before indexing: ensure each idx is an integer and 0 <=
idx < len(channel_names) (do not allow negative wrapping), and handle violations
with a clear error or by filtering them out consistently; update the
comprehension that produces "channel_names": [channel_names[idx] for idx in
annotation.channel_idxs] to first validate/map the indexes (or raise a
ValueError with the bad idx and annotation id/context) so an IndexError or
silent negative wrap cannot occur during export.
- Around line 209-214: In _write_signal_frame ensure you detect and reject
duplicate or colliding column names before building payload: check
row.channel_names for duplicates and also check if any channel name equals the
time_column_name, and if any conflict exists raise a clear ValueError (including
the offending names) instead of silently overwriting; do this validation prior
to creating payload and populating entries so callers get an explicit error when
channel names conflict with each other or with time_column_name.

In `@README.md`:
- Around line 21-40: The example uses an undefined lowercase variable
captionizer when calling captionizer.run; either instantiate a Captionizer
instance before use (e.g., create and configure a Captionizer object and assign
it to captionizer) or add a clear comment above the snippet stating that a
configured Captionizer instance must exist; update the README example to
reference the Captionizer symbol (Captionizer) properly so export_caption_result
and TimeFExportConfig are called with a real result from captionizer.run.

In `@timef/folder_only_for_backwards_compatability.txt`:
- Around line 1-7: The filename contains a spelling mistake: rename
folder_only_for_backwards_compatability.txt to
folder_only_for_backwards_compatibility.txt and update any references to the old
name (docs, README, tests, CI configs, or code that reads this file) to use the
corrected filename so nothing breaks; ensure the rename preserves file contents
and SPDX headers and run a quick grep for "backwards_compatability" to find and
update all references.

In `@timenet_timef/__init__.py`:
- Around line 23-40: The __all__ list is unsorted (contains names like
"Annotation", "AnnotationSampleRef", ..., "validate_dataset") and triggers Ruff
RUF022; reorder the entries in the __all__ list into a stable alphabetical order
(by string) so that symbols like Annotation, AnnotationSampleRef,
AnnotationSpec, DatasetManifest, Sample, SampleSignalRef, SensorSpec, Signal,
SignalSpec, TimeFReader, TimeFValidationError, TimeFWriter, VALID_DOMAINS,
mark_validated, validate_annotation_against_spec, validate_dataset are sorted
alphabetically to silence the lint error and keep CI deterministic.

In `@timenet_timef/io.py`:
- Around line 203-204: The read_signal_frame method currently joins signal_file
directly into self.root allowing absolute or ../ paths to escape root; update
read_signal_frame to first run signal_file through the same shard-path resolver
used in timenet_timef/validate.py (the shard path resolver) to normalize and
resolve the path relative to self.root, verify the resolved path is contained
under self.root / "signals" (raise an error if not), and then pass that resolved
path to pq.read_table instead of joining signal_file directly.
- Around line 206-223: The reader currently filters columns using the hard-coded
helper _is_time_column inside read_signal_frame_for_sample which drops
non-"time"/"time_*" columns (breaking exports that set time_column_name like
"timestamp"); update the filter to preserve the actual time column provided by
the dataset/sample metadata instead of relying solely on _is_time_column — e.g.
derive the expected time column name from the signal/manifest/sample metadata
(check signal.time_column_name or manifest.time_column_name, falling back to
_is_time_column only if none is set) and include that name in the
key-preservation check when building filtered rows for signal_ref.channels.

In `@timenet_timef/labels.py`:
- Around line 17-28: get_schema and get_valid_labels currently return direct
references into LABEL_SCHEMAS so callers can mutate global state; change
get_schema(name: str) to return a defensive copy (use copy.deepcopy on
LABEL_SCHEMAS[name]) and change get_valid_labels(name: str) to return a new list
(e.g., list(labels) or labels.copy()) after validating types; ensure you import
copy and keep the same error checks (raise ValueError for unknown schema or
invalid labels) so callers receive safe, non-mutable copies of schema and label
lists.

In `@timenet_timef/units.py`:
- Around line 10-13: The validate_ucum function accepts whitespace-padded unit
codes but returns the original string; change it to normalize the input by
trimming whitespace (use unit_code.strip()) and return the stripped value after
validation; keep the same ValueError on empty/invalid inputs but ensure callers
receive the normalized unit code from validate_ucum.

In `@timenet_timef/validate.py`:
- Around line 22-26: Before instantiating TimeFReader in validate_dataset, check
that manifest.json, samples.parquet, and annotations.parquet exist in the
provided root and raise TimeFValidationError with the explicit messages instead
of letting TimeFReader/FileNotFoundError surface; locate the validate_dataset
function and the calls to TimeFReader.read_manifest(), read_samples(), and
read_annotations() and add pre-checks for those three files (and mirror the same
checks for the other validation block that uses TimeFReader between lines 42-51)
so failures produce the intended TimeFValidationError messages.
- Around line 60-64: Add checks to reject negative row spans: validate that
signal.row_start >= 0 and signal.row_count >= 0 and raise TimeFValidationError
with a clear message if either is negative. Update the validation block that
currently checks signal.row_group_id and row span (using metadata.row_group(...)
and variables row_group_rows) to first assert signal.row_start and
signal.row_count are non-negative, then perform the existing upper-bound check;
use the same TimeFValidationError type and include signal.id in the error
messages.

---

Nitpick comments:
In `@runtime/types.py`:
- Around line 89-100: Pick canonical names (e.g., signal and iter_signals) and
mark the others as documented aliases: update the docstrings for signal,
channel, iter_channels and iter_signals to clearly state which is canonical and
that the other methods are simple aliases; ensure channel(self, idx) delegates
to signal(self, idx) and iter_channels(self) delegates to iter_signals(self) (or
vice versa) so implementation stays a single source of truth (refer to
SignalView, signal, channel, iter_signals, iter_channels and self.values for
locating the methods).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d6df737c-1bb3-426e-8be1-0d0f56e1c1b3

📥 Commits

Reviewing files that changed from the base of the PR and between d09648c and 78ed48c.

📒 Files selected for processing (29)
  • README.md
  • annotator.py
  • captionizer.py
  • exporters/__init__.py
  • exporters/timef_export.py
  • extractors/__init__.py
  • extractors/generative.py
  • extractors/semantic.py
  • extractors/statistical.py
  • extractors/structural.py
  • mhc/transformer.py
  • models/base.py
  • models/client.py
  • models/local.py
  • reviewer.py
  • runtime/__init__.py
  • runtime/types.py
  • timef/__init__.py
  • timef/folder_only_for_backwards_compatability.txt
  • timef/schema.py
  • timenet_timef/__init__.py
  • timenet_timef/io.py
  • timenet_timef/labels.py
  • timenet_timef/schema.py
  • timenet_timef/units.py
  • timenet_timef/utils.py
  • timenet_timef/validate.py
  • transformer.py
  • visualizer.py

Comment thread exporters/timef_export.py
Comment on lines +85 to +95
writer = TimeFWriter(root)
samples: list[Sample] = []
signals: list[Signal] = []
annotations: list[PersistedAnnotation] = []
sampling_rate = config.timestamp_unit / config.sampling_period
annotation_id = 0

for sample_id, row in enumerate(result.rows):
_validate_row_shape(row, channel_names)
signal_file = f"sample-{sample_id}.parquet"
_write_signal_frame(root / "signals" / signal_file, config.sampling_period, config.time_column_name, row)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stage the export before touching the final dataset path.

TimeFWriter(root) eagerly creates root/signals in timenet_timef/io.py Lines 76-78. Any later failure here—or at Lines 182-183—leaves a partial tree behind, and the next retry then hits Line 65's FileExistsError even though no valid dataset was produced.

Also applies to: 179-183

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@exporters/timef_export.py` around lines 85 - 95, TimeFWriter(root) eagerly
creates the final root/signals tree which can leave partial files on failure;
instead stage output to a temporary directory and only instantiate or move
TimeFWriter into the final path on success: write signals via
_write_signal_frame into a temp staging directory (or pass a staging path to
TimeFWriter), accumulate samples/signals/annotations there, and after all rows
and writes complete, atomically move or rename the staging directory to the
intended root (or then construct TimeFWriter(root) if it must create the tree).
Also ensure any temporary directory is removed on failure so subsequent retries
won't hit FileExistsError.

Comment thread exporters/timef_export.py
Comment on lines +136 to +139
reference = json.dumps(
{
"channel_names": [channel_names[idx] for idx in annotation.channel_idxs],
"window": list(annotation.window) if annotation.window is not None else None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate annotation.channel_idxs before translating them to names.

runtime.types.Annotation does not bound-check these indexes. Negative indexes silently bind to the wrong channel, and an out-of-range index raises IndexError mid-export.

Suggested change
             if spec_id is None:
                 raise ValueError(f"Unsupported annotation kind: {annotation.kind}")
+            invalid_channel_idxs = [
+                idx for idx in annotation.channel_idxs if idx < 0 or idx >= len(channel_names)
+            ]
+            if invalid_channel_idxs:
+                raise ValueError(
+                    f"Annotation {annotation.kind!r} references invalid channel indexes: {invalid_channel_idxs}"
+                )
             reference = json.dumps(
                 {
                     "channel_names": [channel_names[idx] for idx in annotation.channel_idxs],
                     "window": list(annotation.window) if annotation.window is not None else None,
                     "kind": annotation.kind,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@exporters/timef_export.py` around lines 136 - 139, The code that builds
reference (the json.dumps block using channel_names and annotation.channel_idxs)
must validate annotation.channel_idxs before indexing: ensure each idx is an
integer and 0 <= idx < len(channel_names) (do not allow negative wrapping), and
handle violations with a clear error or by filtering them out consistently;
update the comprehension that produces "channel_names": [channel_names[idx] for
idx in annotation.channel_idxs] to first validate/map the indexes (or raise a
ValueError with the bad idx and annotation id/context) so an IndexError or
silent negative wrap cannot occur during export.

Comment thread exporters/timef_export.py
Comment on lines +209 to +214
def _write_signal_frame(path: Path, sampling_period: float, time_column_name: str, row) -> None:
time_axis = [idx * sampling_period for idx in range(row.values.shape[1])]
payload: dict[str, list[float]] = {time_column_name: time_axis}
for idx, channel_name in enumerate(row.channel_names):
payload[channel_name] = row.values[idx].astype(float).tolist()
path.parent.mkdir(parents=True, exist_ok=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject duplicate or colliding column names before building the parquet payload.

This payload is keyed by time_column_name and row.channel_names. Duplicate channel names—or a channel named exactly like the time column—overwrite earlier entries in the dict and silently drop signal data.

Suggested change
 def _write_signal_frame(path: Path, sampling_period: float, time_column_name: str, row) -> None:
+    channel_names = tuple(row.channel_names)
+    if len(set(channel_names)) != len(channel_names):
+        raise ValueError("RuntimeRow channel_names must be unique for export")
+    if time_column_name in channel_names:
+        raise ValueError("time_column_name must not collide with a channel name")
+
     time_axis = [idx * sampling_period for idx in range(row.values.shape[1])]
     payload: dict[str, list[float]] = {time_column_name: time_axis}
-    for idx, channel_name in enumerate(row.channel_names):
+    for idx, channel_name in enumerate(channel_names):
         payload[channel_name] = row.values[idx].astype(float).tolist()
     path.parent.mkdir(parents=True, exist_ok=True)
     pq.write_table(pa.table(payload), path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@exporters/timef_export.py` around lines 209 - 214, In _write_signal_frame
ensure you detect and reject duplicate or colliding column names before building
payload: check row.channel_names for duplicates and also check if any channel
name equals the time_column_name, and if any conflict exists raise a clear
ValueError (including the offending names) instead of silently overwriting; do
this validation prior to creating payload and populating entries so callers get
an explicit error when channel names conflict with each other or with
time_column_name.

Comment thread README.md
Comment on lines +21 to +40
```python
from pathlib import Path

from captionizer import Captionizer
from exporters.timef_export import TimeFExportConfig, export_caption_result

result, _ = captionizer.run(max_rows=5)
root = export_caption_result(
result,
TimeFExportConfig(
output_root=Path("exports"),
dataset_id="mhc_caption_runs",
sampling_period=1,
timestamp_unit=1,
unit_sampling_rate="1 / minute",
unit_timestamp="minute",
time_column_name="time_minute",
),
)
print(root)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Example code references undefined captionizer variable.

The example imports Captionizer class (line 24) but uses a lowercase captionizer instance (line 27) that is never created. Users following this example will encounter a NameError.

Consider either:

  1. Adding the necessary setup (dataset, transformer, annotator, instantiation), or
  2. Adding a comment indicating prerequisite setup is required.
📝 Proposed fix with minimal setup context
 from pathlib import Path
 
 from captionizer import Captionizer
 from exporters.timef_export import TimeFExportConfig, export_caption_result
+
+# Setup (see Usage section for full details)
+# captionizer = Captionizer(dataset, transformer, annotator)
 
 result, _ = captionizer.run(max_rows=5)

Or provide a complete working example:

 from pathlib import Path
 
 from captionizer import Captionizer
 from exporters.timef_export import TimeFExportConfig, export_caption_result
+from mhc.dataset import MHCDataset
+from mhc.transformer import MHCTransformer
+from mhc.constants import MHC_CHANNEL_CONFIG
+from extractors.statistical import StatisticalExtractor
+from annotator import Annotator
+
+dataset = MHCDataset()
+annotator = Annotator([StatisticalExtractor(MHC_CHANNEL_CONFIG)])
+captionizer = Captionizer(dataset, MHCTransformer(), annotator)
 
 result, _ = captionizer.run(max_rows=5)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 21 - 40, The example uses an undefined lowercase
variable captionizer when calling captionizer.run; either instantiate a
Captionizer instance before use (e.g., create and configure a Captionizer object
and assign it to captionizer) or add a clear comment above the snippet stating
that a configured Captionizer instance must exist; update the README example to
reference the Captionizer symbol (Captionizer) properly so export_caption_result
and TimeFExportConfig are called with a real result from captionizer.run.

Comment on lines +1 to +7
#
# SPDX-FileCopyrightText: 2026 Stanford University, ETH Zurich, and the project authors (see CONTRIBUTORS.md)
# SPDX-FileCopyrightText: 2026 This source file is part of the SensorTSLM open-source project.
#
# SPDX-License-Identifier: MIT
#
This folder is only here for backwards compatibility it serves no purpose in the current implementation, its supposed to be deleted soon No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo in filename: "compatability" should be "compatibility".

The filename folder_only_for_backwards_compatability.txt contains a spelling error. Consider renaming to folder_only_for_backwards_compatibility.txt.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@timef/folder_only_for_backwards_compatability.txt` around lines 1 - 7, The
filename contains a spelling mistake: rename
folder_only_for_backwards_compatability.txt to
folder_only_for_backwards_compatibility.txt and update any references to the old
name (docs, README, tests, CI configs, or code that reads this file) to use the
corrected filename so nothing breaks; ensure the rename preserves file contents
and SPDX headers and run a quick grep for "backwards_compatability" to find and
update all references.

Comment thread timenet_timef/io.py
Comment on lines +206 to +223
def read_signal_frame_for_sample(self, sample: Sample, manifest: DatasetManifest) -> dict[int, list[dict[str, Any]]]:
signals_by_id = {signal.id: signal for signal in manifest.signals}
result: dict[int, list[dict[str, Any]]] = {}
for signal_ref in sample.signals:
signal = signals_by_id.get(signal_ref.signal_id)
if signal is None:
continue
rows = self.read_signal_frame(signal.shard_file)
sliced = rows[signal.row_start : signal.row_start + signal.row_count]
if signal_ref.channels is not None:
filtered: list[dict[str, Any]] = []
for row in sliced:
filtered.append(
{
key: value
for key, value in row.items()
if _is_time_column(key) or key in signal_ref.channels
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't hard-code the time-column naming convention in the reader.

export_caption_result() only requires time_column_name to be non-empty, but this filter preserves only time/time_*. A valid export with time_column_name="timestamp" loses its time axis on readback through this path.

Suggested change
     def read_signal_frame_for_sample(self, sample: Sample, manifest: DatasetManifest) -> dict[int, list[dict[str, Any]]]:
         signals_by_id = {signal.id: signal for signal in manifest.signals}
+        all_signal_columns = {channel for spec in manifest.signal_spec for channel in spec.channels}
         result: dict[int, list[dict[str, Any]]] = {}
         for signal_ref in sample.signals:
             signal = signals_by_id.get(signal_ref.signal_id)
             if signal is None:
                 continue
             rows = self.read_signal_frame(signal.shard_file)
             sliced = rows[signal.row_start : signal.row_start + signal.row_count]
             if signal_ref.channels is not None:
+                requested_channels = set(signal_ref.channels)
                 filtered: list[dict[str, Any]] = []
                 for row in sliced:
                     filtered.append(
                         {
                             key: value
                             for key, value in row.items()
-                            if _is_time_column(key) or key in signal_ref.channels
+                            if key in requested_channels or key not in all_signal_columns
                         }
                     )
                 sliced = filtered
             result[signal_ref.signal_id] = sliced
         return result
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@timenet_timef/io.py` around lines 206 - 223, The reader currently filters
columns using the hard-coded helper _is_time_column inside
read_signal_frame_for_sample which drops non-"time"/"time_*" columns (breaking
exports that set time_column_name like "timestamp"); update the filter to
preserve the actual time column provided by the dataset/sample metadata instead
of relying solely on _is_time_column — e.g. derive the expected time column name
from the signal/manifest/sample metadata (check signal.time_column_name or
manifest.time_column_name, falling back to _is_time_column only if none is set)
and include that name in the key-preservation check when building filtered rows
for signal_ref.channels.

Comment thread timenet_timef/labels.py
Comment on lines +17 to +28
def get_schema(name: str) -> dict[str, list[str] | str]:
if name not in LABEL_SCHEMAS:
raise ValueError(f"Unknown label schema: {name}")
return LABEL_SCHEMAS[name]


def get_valid_labels(name: str) -> list[str]:
schema = get_schema(name)
labels = schema["labels"]
if not isinstance(labels, list):
raise ValueError(f"Invalid labels payload for schema {name}")
return labels
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid exposing mutable global schema internals.

Both helpers return direct references to LABEL_SCHEMAS contents, so callers can mutate global label definitions unintentionally.

Proposed fix
+from copy import deepcopy
+
 def get_schema(name: str) -> dict[str, list[str] | str]:
     if name not in LABEL_SCHEMAS:
         raise ValueError(f"Unknown label schema: {name}")
-    return LABEL_SCHEMAS[name]
+    return deepcopy(LABEL_SCHEMAS[name])
@@
 def get_valid_labels(name: str) -> list[str]:
     schema = get_schema(name)
     labels = schema["labels"]
     if not isinstance(labels, list):
         raise ValueError(f"Invalid labels payload for schema {name}")
-    return labels
+    return list(labels)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@timenet_timef/labels.py` around lines 17 - 28, get_schema and
get_valid_labels currently return direct references into LABEL_SCHEMAS so
callers can mutate global state; change get_schema(name: str) to return a
defensive copy (use copy.deepcopy on LABEL_SCHEMAS[name]) and change
get_valid_labels(name: str) to return a new list (e.g., list(labels) or
labels.copy()) after validating types; ensure you import copy and keep the same
error checks (raise ValueError for unknown schema or invalid labels) so callers
receive safe, non-mutable copies of schema and label lists.

Comment thread timenet_timef/units.py
Comment on lines +10 to +13
def validate_ucum(unit_code: str) -> str:
if not isinstance(unit_code, str) or not unit_code.strip():
raise ValueError("Unit code must be a non-empty string")
return unit_code
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize unit codes before returning them.

validate_ucum checks strip() but returns the original value, so whitespace-padded codes are accepted and persisted unchanged. That can produce invalid/cross-tool incompatible unit values.

Proposed fix
 def validate_ucum(unit_code: str) -> str:
-    if not isinstance(unit_code, str) or not unit_code.strip():
+    if not isinstance(unit_code, str):
+        raise ValueError("Unit code must be a non-empty string")
+    normalized = unit_code.strip()
+    if not normalized:
         raise ValueError("Unit code must be a non-empty string")
-    return unit_code
+    return normalized
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def validate_ucum(unit_code: str) -> str:
if not isinstance(unit_code, str) or not unit_code.strip():
raise ValueError("Unit code must be a non-empty string")
return unit_code
def validate_ucum(unit_code: str) -> str:
if not isinstance(unit_code, str):
raise ValueError("Unit code must be a non-empty string")
normalized = unit_code.strip()
if not normalized:
raise ValueError("Unit code must be a non-empty string")
return normalized
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@timenet_timef/units.py` around lines 10 - 13, The validate_ucum function
accepts whitespace-padded unit codes but returns the original string; change it
to normalize the input by trimming whitespace (use unit_code.strip()) and return
the stripped value after validation; keep the same ValueError on empty/invalid
inputs but ensure callers receive the normalized unit code from validate_ucum.

Comment thread timenet_timef/validate.py
Comment on lines +22 to +26
def validate_dataset(root: Path) -> None:
reader = TimeFReader(root)
manifest = reader.read_manifest()
samples = list(reader.read_samples())
annotations = list(reader.read_annotations())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Check required artifacts before calling the reader.

A missing manifest.json, samples.parquet, or annotations.parquet will fail in Lines 23-26 before this loop runs, so callers get FileNotFoundError instead of TimeFValidationError and the explicit messages below never fire.

Suggested change
 def validate_dataset(root: Path) -> None:
+    for required in (
+        "manifest.json",
+        "samples.parquet",
+        "annotations.parquet",
+        "sample_signal_index.parquet",
+        "annotation_tasks_index.parquet",
+        "annotation_domains_index.parquet",
+    ):
+        if not (root / required).exists():
+            raise TimeFValidationError(f"missing required file {required}")
+
     reader = TimeFReader(root)
     manifest = reader.read_manifest()
     samples = list(reader.read_samples())
     annotations = list(reader.read_annotations())
-
-    for required in (
-        "manifest.json",
-        "samples.parquet",
-        "annotations.parquet",
-        "sample_signal_index.parquet",
-        "annotation_tasks_index.parquet",
-        "annotation_domains_index.parquet",
-    ):
-        if not (root / required).exists():
-            raise TimeFValidationError(f"missing required file {required}")

Also applies to: 42-51

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@timenet_timef/validate.py` around lines 22 - 26, Before instantiating
TimeFReader in validate_dataset, check that manifest.json, samples.parquet, and
annotations.parquet exist in the provided root and raise TimeFValidationError
with the explicit messages instead of letting TimeFReader/FileNotFoundError
surface; locate the validate_dataset function and the calls to
TimeFReader.read_manifest(), read_samples(), and read_annotations() and add
pre-checks for those three files (and mirror the same checks for the other
validation block that uses TimeFReader between lines 42-51) so failures produce
the intended TimeFValidationError messages.

Comment thread timenet_timef/validate.py
Comment on lines +60 to +64
if signal.row_group_id >= metadata.num_row_groups:
raise TimeFValidationError(f"row_group_id out of bounds for signal {signal.id}")
row_group_rows = metadata.row_group(signal.row_group_id).num_rows
if signal.row_start + signal.row_count > row_group_rows:
raise TimeFValidationError(f"row span out of bounds for signal {signal.id}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject negative row spans during validation.

Only the upper bound is enforced here. A manifest with row_start < 0 or row_count < 0 passes validation, and TimeFReader.read_signal_frame_for_sample() later slices from the end of the frame instead of failing.

Suggested change
         metadata = pq.ParquetFile(shard_path).metadata
+        if signal.row_group_id < 0 or signal.row_start < 0 or signal.row_count < 0:
+            raise TimeFValidationError(f"negative row coordinates for signal {signal.id}")
         if signal.row_group_id >= metadata.num_row_groups:
             raise TimeFValidationError(f"row_group_id out of bounds for signal {signal.id}")
         row_group_rows = metadata.row_group(signal.row_group_id).num_rows
         if signal.row_start + signal.row_count > row_group_rows:
             raise TimeFValidationError(f"row span out of bounds for signal {signal.id}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if signal.row_group_id >= metadata.num_row_groups:
raise TimeFValidationError(f"row_group_id out of bounds for signal {signal.id}")
row_group_rows = metadata.row_group(signal.row_group_id).num_rows
if signal.row_start + signal.row_count > row_group_rows:
raise TimeFValidationError(f"row span out of bounds for signal {signal.id}")
if signal.row_group_id < 0 or signal.row_start < 0 or signal.row_count < 0:
raise TimeFValidationError(f"negative row coordinates for signal {signal.id}")
if signal.row_group_id >= metadata.num_row_groups:
raise TimeFValidationError(f"row_group_id out of bounds for signal {signal.id}")
row_group_rows = metadata.row_group(signal.row_group_id).num_rows
if signal.row_start + signal.row_count > row_group_rows:
raise TimeFValidationError(f"row span out of bounds for signal {signal.id}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@timenet_timef/validate.py` around lines 60 - 64, Add checks to reject
negative row spans: validate that signal.row_start >= 0 and signal.row_count >=
0 and raise TimeFValidationError with a clear message if either is negative.
Update the validation block that currently checks signal.row_group_id and row
span (using metadata.row_group(...) and variables row_group_rows) to first
assert signal.row_start and signal.row_count are non-negative, then perform the
existing upper-bound check; use the same TimeFValidationError type and include
signal.id in the error messages.

Base automatically changed from karl-deck/internal-data-update to main April 6, 2026 22:24
@max-rosenblattl max-rosenblattl mentioned this pull request Apr 7, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant