Skip to content

Add lean exporter#42

Merged
max-rosenblattl merged 1 commit intomainfrom
max-rosenblattl/lean-exporter
Apr 7, 2026
Merged

Add lean exporter#42
max-rosenblattl merged 1 commit intomainfrom
max-rosenblattl/lean-exporter

Conversation

@max-rosenblattl
Copy link
Copy Markdown
Collaborator

@max-rosenblattl max-rosenblattl commented Apr 7, 2026

♻️ Current situation & Problem

There is no way to persist a CaptionResult to disk.

⚙️ Release Notes

  • Adds exporters/ package with lean.py: write_caption_result(result, path, *, rows_per_shard=4096, compression="zstd")
  • Sharded Arrow IPC layout (recordings_{i:04d}.arrow), one shard per rows_per_shard recordings — ~32 shards for the full MHC dataset at the default, a single shard for typical small exports
  • Embeds sensortslm.schema_version in each shard's schema metadata — no separate manifest file
  • Arrow IPC over Parquet: matches the existing MHC input format, enables zero-copy mmap reads downstream, and our access pattern is whole-row (parquet's column-pruning advantage doesn't apply)
  • Read path intentionally out of scope — a dedicated dataset class will live with the training pipeline; this module is write-only by design

📚 Documentation

Public API is documented inline. Usage:

from pathlib import Path
from exporters.lean import write_caption_result

write_caption_result(result, Path("exports/my_run"))
# → exports/my_run/recordings_0000.arrow, recordings_0001.arrow, ...

The exporters/ package uses an empty __init__.py so future format modules (e.g. exporters/timenet.py) can be added without forcing every consumer to load all formats' dependencies.

✅ Testing

Smoke-tested on the local MHC subset (../mhc_subset_testdata_hf) in the sensortslm 3.12 venv:

  • Captioned 5 rows with statistical + structural + semantic extractors → 69 annotations
  • Wrote to exports/lean_smoke with rows_per_shard=2 to force the multi-shard path → got 3 shards (~75 KB total)
  • Verified all 13 expected columns including nested annotations struct list
  • Verified sensortslm.schema_version=1 round-trips via feather.read_table(...).schema.metadata
  • Verified Recording reconstruction (round-trip via a temporary read helper) preserves row_id, channel_names, values (NaN-aware), has_any_data, and all annotation fields

Code of Conduct & Contributing Guidelines

By creating and submitting this pull request, you agree to follow our Code of Conduct and Contributing Guidelines:

@max-rosenblattl
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@max-rosenblattl max-rosenblattl merged commit 983f6d8 into main Apr 7, 2026
2 checks passed
@max-rosenblattl max-rosenblattl deleted the max-rosenblattl/lean-exporter branch April 7, 2026 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant