Feature/evaluation pipeline by milanagm · Pull Request #55 · StanfordBDHG/SensorTSLM

milanagm · 2026-04-13T22:42:47Z

Eval Architecture und first End2End Metabolic Draft

♻️ Current situation & Problem

Link any open issues or pull requests (PRs) related to this PR. Please ensure that all non-trivial PRs are first tracked and discussed in an existing GitHub issue or discussion.

⚙️ Release Notes

Add a bullet point list summary of the feature and possible migration guides if this is a breaking change so this section can be added to the release notes.
Include code snippets that provide examples of the feature implemented or links to the documentation if it appends or changes the public interface.

📚 Documentation

Please ensure that you properly document any additions in conformance to Spezi Documentation Guide.
You can use this section to describe your solution, but we encourage contributors to document your reasoning and changes using in-line documentation.

✅ Testing

Please ensure that the PR meets the testing requirements set by CodeCov and that new functionality is appropriately tested.
This section describes important information about the tests and why some elements might not be testable.

Code of Conduct & Contributing Guidelines

By creating and submitting this pull request, you agree to follow our Code of Conduct and Contributing Guidelines:

I agree to follow the Code of Conduct and Contributing Guidelines.

Introduces a modular evaluation pipeline for benchmarking LLMs on MHC wearable sensor data. All components are task-agnostic and extensible via registries in run_eval.py. - evaluation/: unified entry point (run_eval.py), Evaluator, EvalTask ABC, classification metrics with bootstrap SE, ResultsWriter; stubs for sleep and caption tasks - time_series_datasets/mhc_activity_dataset.py: 60-min windowed samples with ground truth from Apple Watch workout:* binary channels — no manual annotation required

coderabbitai · 2026-04-13T22:42:56Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new evaluation framework (orchestrator, tasks, metrics, I/O, CLI), placeholder caption generator, dataset label-lookup/split changes, and minor transformer and packaging/ignore updates. All changes are factual and introduce new modules and APIs for end-to-end model evaluation.

Changes

Cohort / File(s)	Summary
Evaluation core `evaluation/evaluator.py`, `evaluation/__init__.py`	New Evaluator orchestration, `SampleResult`/`EvalResult` dataclasses, and package initializer headers.
Tasks `evaluation/tasks/base.py`, `evaluation/tasks/metabolic.py`, `evaluation/tasks/__init__.py`	Abstract `EvalTask` and `EvalSample`; concrete `MetabolicTask` with prompt/parse/aggregate logic and package initializer header.
Metrics `evaluation/metrics/classification.py`, `evaluation/metrics/__init__.py`	New stateless classification metrics (binary/multiclass/ordinal) with bootstrap SEs and package initializer header.
I/O / persistence `evaluation/io/writer.py`, `evaluation/io/__init__.py`	`ResultsWriter` writes timestamped output dir with `metrics.json` and `predictions.csv`; package initializer header.
CLI / utilities `evaluation/run_eval.py`, `evaluation/generate_captions_placeholder.py`	New evaluation CLI (`run_eval`) wiring registries/builders and end-to-end flow; caption placeholder generator producing `captions.parquet`.
Dataset label/split changes `time_series_datasets/mhc_label_lookup.py`, `time_series_datasets/mhc_base_qa_dataset.py`	Added SHA-1 based `assign_split()` and `TRAIN_FRAC`/`VAL_FRAC`; replaced `join()` with `join_ondate()` and `join_onlabels()`; dataset now delegates split logic and uses new join API.
Data transform & deps `mhc/transformer.py`, `requirements.txt`, `.gitignore`	Channel-5 zeros now treated as NaN; `opentslm` pinned to GitHub main; `.gitignore` updated to ignore top-level `results/` (and other entries).

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/run_eval
    participant Eval as Evaluator
    participant Task as EvalTask
    participant Model as BaseModel
    participant Writer as ResultsWriter

    CLI->>Eval: run(task, model, dataset, max_samples)
    loop per sample
        Eval->>Task: build_prompt(sample)
        Task-->>Eval: prompt
        Eval->>Model: process(prompt, channels, annotations)
        Model-->>Eval: ModelResponse
        Eval->>Task: parse_prediction(response.text)
        Task-->>Eval: parsed_prediction
        Eval-->>Eval: collect SampleResult
    end
    Eval->>Task: aggregate_metrics(gts, preds)
    Task-->>Eval: metrics
    Eval-->>CLI: EvalResult
    CLI->>Writer: write(EvalResult)
    Writer-->>CLI: output directory path

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Refactor: replace schema-heavy runtime with row-centric signal views #38: Related to row-centric Recording.iter_channels() and SignalView APIs used by the new evaluation code.
Add reviewer and model components #28: Overlaps on BaseModel / ModelResponse API and model processing semantics used by Evaluator.
Add core captioning framework #22: Both PRs modify mhc/transformer.py and affect channel-handling semantics.

Suggested reviewers

max-rosenblattl
ThomasKaar
RealLast

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.64% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Feature/evaluation pipeline' is vague and generic, using broad terms that don't specify the main change or architectural details introduced in this substantial feature addition.	Use a more descriptive title that highlights the primary change, such as 'Add evaluation orchestration framework with metabolic task support' or 'Introduce evaluation pipeline with metrics aggregation and result persistence.'
Description check	❓ Inconclusive	The description is mostly a template scaffold with placeholder sections and does not document the actual evaluation pipeline implementation, architectural decisions, or concrete features introduced.	Fill in the template sections with concrete details: describe the evaluation architecture, the MetabolicTask implementation, provide usage examples, specify testing done, and link any related GitHub issues or discussions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/evaluation-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (5)

evaluation/generate_captions_placeholder.py (2)

37-37: Consider using relative imports or package installation instead of sys.path manipulation.

sys.path.insert(0, ...) is fragile and can cause import conflicts. Since this is a script intended for standalone execution, consider either:

Running via python -m evaluation.generate_captions_placeholder from repo root
Adding an editable install (pip install -e .) to make the package importable

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evaluation/generate_captions_placeholder.py` at line 37, Replace the sys.path
manipulation in generate_captions_placeholder.py (the call to sys.path.insert(0,
str(Path(__file__).resolve().parents[1])) that alters imports at runtime) with a
proper package import strategy: either update usage to run the module with
python -m evaluation.generate_captions_placeholder from the repository root or
remove the sys.path change and rely on an editable install (pip install -e .) so
imports resolve normally; make sure to remove the sys.path.insert line and any
dependency on Path(__file__).resolve().parents[1] and document the preferred
run/install method in the script header or README.

42-44: Consider accepting paths via CLI arguments for flexibility.

Hardcoded paths work for the documented workflow but limit reusability. An optional argparse interface would allow testing with different datasets or output locations.

♻️ Optional: Add CLI arguments

+import argparse
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--weekly-path", type=Path, default=Path("data/mhc_dev_subset_weekly_hf"))
+    parser.add_argument("--output-dir", type=Path, default=Path("data/mhc_dev_subset_captions"))
+    return parser.parse_args()
+
-WEEKLY_PATH  = Path("data/mhc_dev_subset_weekly_hf")
-OUTPUT_DIR   = Path("data/mhc_dev_subset_captions")
-OUTPUT_FILE  = OUTPUT_DIR / "captions.parquet"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evaluation/generate_captions_placeholder.py` around lines 42 - 44, Add an
argparse interface to make WEEKLY_PATH, OUTPUT_DIR, and OUTPUT_FILE
configurable: parse optional CLI arguments (e.g., --weekly-path, --output-dir,
--output-file) that default to the existing WEEKLY_PATH, OUTPUT_DIR, and
OUTPUT_FILE constants, convert parsed strings to Path objects, and ensure the
output directory exists before writing; update any main or entrypoint logic in
generate_captions_placeholder.py to use the parsed values instead of the
hardcoded constants.

evaluation/tasks/base.py (1)

48-48: Consider enforcing name as an abstract property.

The name attribute is required by subclasses but isn't enforced at the ABC level. Subclasses could accidentally omit it, leading to runtime AttributeError when the evaluator accesses task.name.

♻️ Optional: Enforce via abstract property

+    `@property`
+    `@abc.abstractmethod`
+    def name(self) -> str:
+        """Task identifier used for result file paths."""
+        ...
-    name: str  # set by each concrete subclass, used for result file paths

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evaluation/tasks/base.py` at line 48, Make the name attribute an abstract
property on the base task class so subclasses are forced to define it: replace
the plain "name: str" declaration with a `@property` decorated abstract method
(using `@abstractmethod`) on the base class (e.g., def name(self) -> str) so code
that accesses task.name will be guaranteed to exist; update imports to include
ABC/abstractmethod if needed and ensure existing subclasses implement the name
property returning the required string.

time_series_datasets/mhc_activity_dataset.py (1)

107-118: Use actual recording length instead of fixed 1440 for window centering.

Hard-coding max_len=1440 couples this helper to one data shape and risks invalid windows if sequence length changes.

Proposed fix

 def _center_window(
     event_start: int,
     event_end: int,
     window_size: int,
-    max_len: int = 1440,
+    max_len: int,
 ) -> tuple[int, int]:
@@
                 for event_start, event_end in _find_workout_windows(
                     data[ch_idx], min_workout_duration
                 ):
                     w_start, w_end = _center_window(
-                        event_start, event_end, self._window_size
+                        event_start, event_end, self._window_size, max_len=data.shape[1]
                     )

Also applies to: 318-320

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@time_series_datasets/mhc_activity_dataset.py` around lines 107 - 118, The
helper _center_window currently hardcodes max_len=1440 which can produce invalid
windows for other sequence lengths; change the signature of _center_window to
accept the actual recording length (e.g., max_len: int with no default or
default None and require caller to pass length) and update all call sites
(including the other usage around the same module) to pass the true sequence
length from the dataset (e.g., the per-sample recording length variable) so
window start/end are computed against the real recording length rather than
1440.

time_series_datasets/mhc_metabolic_dataset.py (1)

108-113: Normalize caption lookup keys to avoid silent misses.

Line 108 and Lines 151-152 build keys from week_start via implicit string conversion from different sources. If types differ, captions won’t attach and evaluation silently runs without them.

Proposed fix

 class MHCMetabolicDataset(Dataset):
+    `@staticmethod`
+    def _caption_key(user_id: object, week_start: object) -> str:
+        wk = pd.Timestamp(week_start).date().isoformat()
+        return f"{user_id}:{wk}"
+
     def __getitem__(self, idx: int) -> EvalSample:
         row, ground_truth = self._samples[idx]
         recording = self._transformer.transform_row(row)
 
         # Attach caption as annotation if available.
-        caption_key = f"{row['user_id']}:{row['week_start']}"
+        caption_key = self._caption_key(row["user_id"], row["week_start"])
         if caption_key in self._captions:
             recording.annotations.append(Annotation(
                 caption_type="weekly_caption",
                 text=self._captions[caption_key],
             ))
@@
         df = pd.read_parquet(path, columns=["user_id", "week_start", "caption"])
         return {
-            f"{row.user_id}:{row.week_start}": row.caption
+            self._caption_key(row.user_id, row.week_start): row.caption
             for row in df.itertuples(index=False)
         }

Also applies to: 149-153

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@time_series_datasets/mhc_metabolic_dataset.py` around lines 108 - 113, The
caption lookup can silently miss because keys are built from row['week_start']
with inconsistent types; normalize the key building in the caption_key creation
and the related key generation (the spots that reference row['week_start'] and
where captions are added to self._captions) by converting week_start to a
canonical string (e.g., str() or .isoformat() if it's a date/datetime) before
composing caption_key and before storing/looking up in self._captions so
recording.annotations.append(Annotation(...)) always finds matching captions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluation/evaluator.py`:
- Around line 83-99: The prompt/signal preparation (task.build_prompt,
sample.recording.iter_channels, and annotations extraction) must be included
inside the per-sample try/except so pre-process failures don't abort the run;
wrap the lines that compute prompt, signals, and annotations together with the
call to model.process and task.parse_prediction in the same try block (referring
to prompt, signals, annotations, model.process, and task.parse_prediction), and
in the except ensure you set response = ModelResponse(text="", input_tokens=0,
output_tokens=0), parsed = "unknown", increment n_errors, and log the
sample.sample_id as before.

In `@evaluation/io/writer.py`:
- Around line 44-49: The current out_dir creation uses a second-level timestamp
and exist_ok=True which allows collisions; modify the directory naming in the
block that computes timestamp/safe_model/out_dir (using variables timestamp,
safe_model, out_dir, and self.results_dir with result.task_name and
result.model_name) to guarantee uniqueness per run — for example, use
datetime.now().strftime with microseconds (e.g. "%Y%m%d_%H%M%S_%f") or append a
short UUID/nonce to the folder name, then create the directory without silently
allowing existing directories (remove or avoid relying on exist_ok=True for
collisions) so each run gets a unique folder and cannot overwrite
metrics.json/predictions.json.

In `@evaluation/metrics/classification.py`:
- Around line 65-68: The loop currently swallows all exceptions from metric_fn,
hiding real bugs; change the broad "except Exception" to only catch expected
metric-evaluation errors (e.g., except (ValueError, ZeroDivisionError) as e) and
set scores[i] = np.nan in that branch (optionally logging the error), while
letting any other exceptions from metric_fn propagate so they surface during
tests; update the try/except surrounding the call to metric_fn(y_true[idx],
y_pred[idx]) (references: metric_fn, scores, y_true, y_pred, idx) accordingly.

In `@evaluation/run_eval.py`:
- Around line 245-260: After building the dataset in main() (after dataset =
_build_dataset(args)), validate that the chosen args.task is compatible with the
dataset type/name (args.dataset or samples in dataset) by adding a compatibility
check using TASK_REGISTRY[args.task] or task = TASK_REGISTRY[args.task]() and
inspecting dataset metadata or name; if incompatible, print a clear error
mentioning the dataset and task and exit (sys.exit(1)). Implement the check as a
small helper or inline guard before calling _build_model() / creating
Evaluator(), referencing main(), _build_dataset(), TASK_REGISTRY, args.dataset
and args.task so mismatched pairs (e.g., mhc_activity vs metabolic_* tasks) fail
fast with a descriptive message.

In `@evaluation/tasks/activity.py`:
- Around line 91-103: When filtered is empty, return the same metric keys as the
normal path so downstream code doesn't break: populate f1, precision, recall,
accuracy and their standard-error counterparts (f1_se, precision_se, recall_se,
accuracy_se) with zeros (floats), include n_samples (len(ground_truths)),
n_unknown, n_positive_gt and n_positive_pred as zeros (ints), and parse_rate as
0.0; this mirrors the shape produced after calling binary_metrics(gts, preds)
and keeps keys consistent with the metrics produced later in the function (see
variables filtered, ground_truths, n_unknown and function binary_metrics).

In `@evaluation/tasks/metabolic.py`:
- Around line 224-230: The all-unknown branch in evaluate (where filtered is
empty) returns a metrics dict missing *_se and positive-count keys, causing a
different schema than binary_metrics; update that return to include the same
keys as binary_metrics (e.g., "f1_se", "precision_se", "recall_se",
"accuracy_se", and any positive-count keys like "n_positive" or
"n_positive_pred") set to 0 or 0.0 as appropriate, and ensure parse_rate and
counts (n_samples, n_unknown) remain present so the shape matches binary_metrics
and downstream aggregation won't break; locate the branch around the filtered
check in evaluation/tasks/metabolic.py and mirror the full metric keys produced
by binary_metrics.
- Around line 217-221: The code computes n_unknown and builds filtered pairs
using zip(ground_truths, predictions) without checking that ground_truths and
predictions have the same length, which can silently truncate pairs and make
metrics (parse_rate, n_samples) inconsistent; update the logic (around variables
n_unknown, filtered and where parse_rate/n_samples are computed) to explicitly
guard lengths by verifying len(ground_truths) == len(predictions) and either
raise a clear ValueError when they differ or normalize the inputs (e.g.,
trim/align predictions with a logged warning) before constructing filtered, so
denominators and paired metrics remain consistent.

In `@time_series_datasets/mhc_activity_dataset.py`:
- Around line 336-342: The negative-sample creation is setting
workout_channel="none" on _SampleIndex (constructed with
dataset_idx/window_start/window_end/ground_truth), which later gets turned into
a prompt label; change negative samples to set workout_channel=None (or an
explicit sentinel like "" or "no_activity" that your prompt-generator expects)
instead of the string "none", and update any downstream prompt label generation
code that reads workout_channel to treat None/""/ "no_activity" as a
negative/no-target case (i.e., skip inserting a target label or produce the
proper "no activity" phrasing) so malformed questions are not emitted; adjust
both the instance at the shown _SampleIndex creation and the analogous creation
at the other occurrence (the one around the other negative sample block).

In `@time_series_datasets/mhc_label_lookup.py`:
- Around line 176-184: The merge in join_onlabels can silently duplicate rows if
self.df contains multiple label rows per user_id; update the code in
join_onlabels to enforce one label row per user by either adding
validate="one_to_one" to the pd.DataFrame.merge call or by explicitly checking
self.df["user_id"].duplicated() (and raising a ValueError) before merging;
reference the merge invocation that uses self.df[["user_id"] + label_cols] and
the parameters user_ids and label_cols when making this change so duplicates are
detected and handled rather than silently duplicating rows.

---

Nitpick comments:
In `@evaluation/generate_captions_placeholder.py`:
- Line 37: Replace the sys.path manipulation in generate_captions_placeholder.py
(the call to sys.path.insert(0, str(Path(__file__).resolve().parents[1])) that
alters imports at runtime) with a proper package import strategy: either update
usage to run the module with python -m evaluation.generate_captions_placeholder
from the repository root or remove the sys.path change and rely on an editable
install (pip install -e .) so imports resolve normally; make sure to remove the
sys.path.insert line and any dependency on Path(__file__).resolve().parents[1]
and document the preferred run/install method in the script header or README.
- Around line 42-44: Add an argparse interface to make WEEKLY_PATH, OUTPUT_DIR,
and OUTPUT_FILE configurable: parse optional CLI arguments (e.g., --weekly-path,
--output-dir, --output-file) that default to the existing WEEKLY_PATH,
OUTPUT_DIR, and OUTPUT_FILE constants, convert parsed strings to Path objects,
and ensure the output directory exists before writing; update any main or
entrypoint logic in generate_captions_placeholder.py to use the parsed values
instead of the hardcoded constants.

In `@evaluation/tasks/base.py`:
- Line 48: Make the name attribute an abstract property on the base task class
so subclasses are forced to define it: replace the plain "name: str" declaration
with a `@property` decorated abstract method (using `@abstractmethod`) on the base
class (e.g., def name(self) -> str) so code that accesses task.name will be
guaranteed to exist; update imports to include ABC/abstractmethod if needed and
ensure existing subclasses implement the name property returning the required
string.

In `@time_series_datasets/mhc_activity_dataset.py`:
- Around line 107-118: The helper _center_window currently hardcodes
max_len=1440 which can produce invalid windows for other sequence lengths;
change the signature of _center_window to accept the actual recording length
(e.g., max_len: int with no default or default None and require caller to pass
length) and update all call sites (including the other usage around the same
module) to pass the true sequence length from the dataset (e.g., the per-sample
recording length variable) so window start/end are computed against the real
recording length rather than 1440.

In `@time_series_datasets/mhc_metabolic_dataset.py`:
- Around line 108-113: The caption lookup can silently miss because keys are
built from row['week_start'] with inconsistent types; normalize the key building
in the caption_key creation and the related key generation (the spots that
reference row['week_start'] and where captions are added to self._captions) by
converting week_start to a canonical string (e.g., str() or .isoformat() if it's
a date/datetime) before composing caption_key and before storing/looking up in
self._captions so recording.annotations.append(Annotation(...)) always finds
matching captions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 01f6201b-68e9-45fe-8966-dbd79195082e

📥 Commits

Reviewing files that changed from the base of the PR and between b366917 and 1c5aa45.

📒 Files selected for processing (20)

.gitignore
README.md
evaluation/README.md
evaluation/__init__.py
evaluation/evaluator.py
evaluation/generate_captions_placeholder.py
evaluation/io/__init__.py
evaluation/io/writer.py
evaluation/metrics/__init__.py
evaluation/metrics/classification.py
evaluation/run_eval.py
evaluation/tasks/__init__.py
evaluation/tasks/activity.py
evaluation/tasks/base.py
evaluation/tasks/metabolic.py
notebooks/01_downstream_label_exploration.ipynb
time_series_datasets/mhc_activity_dataset.py
time_series_datasets/mhc_base_qa_dataset.py
time_series_datasets/mhc_label_lookup.py
time_series_datasets/mhc_metabolic_dataset.py

max-rosenblattl

Thanks for the work @milanagm 🚀 I know it's only draft, but when you want to merge it, please reduce the comments to a minimum and make them concise. Also activity task should be in its own PR.

…be pushen on another branch

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

evaluation/run_eval.py (1)

205-220: ⚠️ Potential issue | 🟠 Major

Fail fast on incompatible --dataset / --task pairs.

Right now mismatched pairs can run and emit meaningless metrics. Add a guard before dataset/model construction. This was already flagged in prior review and is still unresolved.

💡 Proposed fix

+def _validate_dataset_task_pair(dataset: str, task: str) -> None:
+    # e.g. mhc_metabolic_bmi -> metabolic_bmi
+    expected_task = dataset.removeprefix("mhc_")
+    if task != expected_task:
+        raise ValueError(
+            f"Incompatible selection: dataset '{dataset}' requires task '{expected_task}', got '{task}'."
+        )
+
 def main() -> None:
     args = _parse_args()
+    try:
+        _validate_dataset_task_pair(args.dataset, args.task)
+    except ValueError as exc:
+        print(str(exc))
+        sys.exit(1)

     print(f"Dataset  : {args.dataset} ({args.split})")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evaluation/run_eval.py` around lines 205 - 220, Add a fast-fail validation
before building dataset or model in main(): implement and call a small checker
(e.g., _validate_task_dataset_pair(args) or inline logic) right after
_parse_args() that verifies the provided args.task is compatible with
args.dataset/split (use existing TASK_REGISTRY[args.task] or dataset metadata to
test compatibility), and if incompatible print a clear error and sys.exit(1);
ensure this guard runs before calling _build_dataset(), _build_model(), or
instantiating Evaluator so mismatched pairs never proceed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluation/run_eval.py`:
- Around line 138-143: The CLI currently allows negative values for
"--max-samples" and "--max-users"; add validation to ensure they are positive
integers. Implement a small validator (e.g., positive_int) and use it as the
argument type or, after parsing, check args.max_samples and args.max_users: if a
value is not None and is <= 0 call parser.error (or raise a clear argparse
error) to reject invalid inputs. Reference parser.add_argument for
"--max-samples" and "--max-users" and the parsed variables args.max_samples /
args.max_users when adding the validation.

In `@time_series_datasets/mhc_metabolic_dataset.py`:
- Line 171: The return statement in mhc_metabolic_dataset.py uses
zip(df["user_id"], df["_label_str"]) which triggers Ruff B905; change it to
zip(df["user_id"], df["_label_str"], strict=True) to enforce equal-length
iterables and satisfy the linter—update the return in the function that
currently returns dict(zip(...)) so it returns dict(zip(df["user_id"],
df["_label_str"], strict=True)).

---

Duplicate comments:
In `@evaluation/run_eval.py`:
- Around line 205-220: Add a fast-fail validation before building dataset or
model in main(): implement and call a small checker (e.g.,
_validate_task_dataset_pair(args) or inline logic) right after _parse_args()
that verifies the provided args.task is compatible with args.dataset/split (use
existing TASK_REGISTRY[args.task] or dataset metadata to test compatibility),
and if incompatible print a clear error and sys.exit(1); ensure this guard runs
before calling _build_dataset(), _build_model(), or instantiating Evaluator so
mismatched pairs never proceed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e7f62e39-2a9c-4b0c-bc1a-60c0f9c203e1

📥 Commits

Reviewing files that changed from the base of the PR and between 1c5aa45 and 4ae7876.

📒 Files selected for processing (3)

evaluation/README.md
evaluation/run_eval.py
time_series_datasets/mhc_metabolic_dataset.py

🚧 Files skipped from review as they are similar to previous changes (1)

evaluation/README.md

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

evaluation/tasks/metabolic.py (2)

216-221: ⚠️ Potential issue | 🟠 Major

Validate list lengths before pairing ground truths and predictions.

Line 219 uses zip(...) without a guard, so mismatched lengths are silently truncated and metrics/parse-rate become inconsistent.

Proposed fix

     def aggregate_metrics(
         self,
         ground_truths: list[str],
         predictions: list[str],
     ) -> dict[str, float | int]:
+        if len(ground_truths) != len(predictions):
+            raise ValueError(
+                f"Length mismatch: {len(ground_truths)} ground truths vs "
+                f"{len(predictions)} predictions."
+            )
+
         n_unknown = sum(1 for p in predictions if p == "unknown")
         filtered = [
             (gt, pred)
-            for gt, pred in zip(ground_truths, predictions)
+            for gt, pred in zip(ground_truths, predictions, strict=True)
             if pred != "unknown"
         ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evaluation/tasks/metabolic.py` around lines 216 - 221, Before zipping
ground_truths and predictions to build filtered and compute metrics, validate
that len(ground_truths) == len(predictions); if they differ, raise a clear error
(or log and abort) that includes both lengths and any context (e.g., task id) so
the mismatch isn't silently truncated. Replace the direct zip usage in the block
that defines n_unknown and filtered with a guard that checks lengths first, then
proceed to compute n_unknown and build filtered only when lengths match.

223-229: ⚠️ Potential issue | 🟠 Major

Return a stable metric schema when all predictions are unknown.

The Line 223 branch omits *_se, n_positive_gt, and n_positive_pred, so the output shape differs from normal binary_metrics output.

Proposed fix

         if not filtered:
             return {
-                "f1": 0.0, "precision": 0.0, "recall": 0.0, "accuracy": 0.0,
+                "f1": 0.0, "f1_se": 0.0,
+                "precision": 0.0, "precision_se": 0.0,
+                "recall": 0.0, "recall_se": 0.0,
+                "accuracy": 0.0, "accuracy_se": 0.0,
                 "n_samples": len(ground_truths),
+                "n_positive_gt": 0,
+                "n_positive_pred": 0,
                 "n_unknown": n_unknown,
                 "parse_rate": 0.0,
             }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evaluation/tasks/metabolic.py` around lines 223 - 229, The early-return in
evaluation/tasks/metabolic.py (inside the metrics computation, e.g., the
binary_metrics branch) returns a subset of keys when all predictions are
unknown; update that branch to return the full stable metric schema used
elsewhere by including the missing keys: f1_se, precision_se, recall_se,
accuracy_se (set to 0.0), and n_positive_gt and n_positive_pred (set to 0),
while keeping existing keys n_samples, n_unknown, parse_rate, f1, precision,
recall, accuracy; ensure key names exactly match the normal output schema so
callers always receive a consistent shape.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluation/io/writer.py`:
- Around line 47-48: Sanitize both path components before joining: add/used a
sanitizer (e.g., sanitize_path_component or sanitize_filename) and apply it to
result.task_name and result.model_name (currently assigned to safe_model) so you
replace or remove unsafe sequences like "/", "\\", and ".." and any other
characters invalid for filenames; then compute out_dir using self.results_dir /
sanitized_task_name / sanitized_model_name / sanitized_timestamp (ensure
timestamp is sanitized if generated elsewhere). Update the assignments where
safe_model and out_dir are set so they use the sanitizer and keep references to
safe_model, out_dir, self.results_dir, result.task_name and result.model_name to
locate the change.

---

Duplicate comments:
In `@evaluation/tasks/metabolic.py`:
- Around line 216-221: Before zipping ground_truths and predictions to build
filtered and compute metrics, validate that len(ground_truths) ==
len(predictions); if they differ, raise a clear error (or log and abort) that
includes both lengths and any context (e.g., task id) so the mismatch isn't
silently truncated. Replace the direct zip usage in the block that defines
n_unknown and filtered with a guard that checks lengths first, then proceed to
compute n_unknown and build filtered only when lengths match.
- Around line 223-229: The early-return in evaluation/tasks/metabolic.py (inside
the metrics computation, e.g., the binary_metrics branch) returns a subset of
keys when all predictions are unknown; update that branch to return the full
stable metric schema used elsewhere by including the missing keys: f1_se,
precision_se, recall_se, accuracy_se (set to 0.0), and n_positive_gt and
n_positive_pred (set to 0), while keeping existing keys n_samples, n_unknown,
parse_rate, f1, precision, recall, accuracy; ensure key names exactly match the
normal output schema so callers always receive a consistent shape.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 20bede1c-9a2d-4b38-ac79-1902ca74c362

📥 Commits

Reviewing files that changed from the base of the PR and between 4ae7876 and a002894.

📒 Files selected for processing (5)

.gitignore
evaluation/io/writer.py
evaluation/metrics/classification.py
evaluation/tasks/metabolic.py
mhc/transformer.py

✅ Files skipped from review due to trivial changes (2)

mhc/transformer.py
.gitignore

…integration into existing mhc metabolic qa dataset

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

evaluation/metrics/classification.py (1)

117-117: Remove redundant int(len(...)) casts.

These casts are unnecessary since len(...) already returns int; dropping them improves clarity and resolves the Ruff warnings.

Proposed cleanup

-        "n_samples":     int(len(ground_truths)),
+        "n_samples":     len(ground_truths),
@@
-        "n_samples":     int(len(ground_truths)),
-        "n_classes":     int(len(label_set)),
+        "n_samples":     len(ground_truths),
+        "n_classes":     len(label_set),
@@
-        "n_samples": int(len(ground_truths)),
-        "n_classes": int(len(labels)),
+        "n_samples": len(ground_truths),
+        "n_classes": len(labels),

Also applies to: 146-147, 193-194

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evaluation/metrics/classification.py` at line 117, Remove the redundant
int(...) casts around len(...) in the result dictionaries: replace
int(len(ground_truths)) (and the other occurrences using int(len(...))) with
just len(...) so the "n_samples" and similar fields use len(...) directly;
locate these in evaluation/metrics/classification.py where result dicts are
assembled (e.g., the "n_samples" key and the other two similar entries) and
update them accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluation/metrics/classification.py`:
- Line 131: The code currently treats labels as falsy, so replace the expression
that builds label_set with an explicit None check: if labels is None then
compute label_set = sorted({*ground_truths, *predictions}) else use the provided
labels as-is (preserving an explicit empty list). Update any downstream
computations (e.g., n_classes) to derive their values from this label_set.
Ensure validation still runs on the provided labels (uniqueness/consistency with
ground_truths/predictions) rather than silently falling back when labels == [].

---

Nitpick comments:
In `@evaluation/metrics/classification.py`:
- Line 117: Remove the redundant int(...) casts around len(...) in the result
dictionaries: replace int(len(ground_truths)) (and the other occurrences using
int(len(...))) with just len(...) so the "n_samples" and similar fields use
len(...) directly; locate these in evaluation/metrics/classification.py where
result dicts are assembled (e.g., the "n_samples" key and the other two similar
entries) and update them accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 79bd7bf3-e2c8-4af1-9892-deb18203f7a5

📥 Commits

Reviewing files that changed from the base of the PR and between a002894 and 88f1301.

📒 Files selected for processing (1)

evaluation/metrics/classification.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@requirements.txt`:
- Line 14: The requirements entry pins opentslm to the mutable branch "main",
making installs non-deterministic; update the line in requirements.txt
referencing "opentslm @ git+https://github.com/StanfordBDHG/OpenTSLM.git@main"
to use an immutable ref (a commit SHA such as
104013b90b89fb39c1537c9190b3d047c3072fb5 or a release tag) so installs are
reproducible, then re-run any dependency install/lock steps (e.g., pip-compile
or CI dependency caching) to ensure the change is applied across environments.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2bc558c7-b16c-4187-aa38-a3ebf5fb569c

📥 Commits

Reviewing files that changed from the base of the PR and between 88f1301 and 349ab92.

📒 Files selected for processing (2)

.gitignore
requirements.txt

✅ Files skipped from review due to trivial changes (1)

.gitignore

…peline

…ion and updated readme

# Add cross-channel captions for all binary daily channels ## ♻️ Current Situation & Problem The daily cross-channel caption pipeline only covered a subset of binary channels: in-bed-awake sleep periods, selected stationary workouts, and running/cycling cardio workouts. This left several binary channels without cross-channel captions, including direct asleep/in-bed windows, walking, yoga, elliptical, mixed cardio, and other workouts. These channels should also be described with relevant context from continuous signals such as heart rate, distance, steps, and active energy. ## ⚙️ What Changed - Added focused cross-channel synthesizers by activity family: - `WalkingSynthesizer` in `synthesizers/locomotion.py` - `MindBodyActivitySynthesizer` in `synthesizers/mind_body.py` - `OtherActivitySynthesizer` in `synthesizers/other_activity.py` - `EnduranceActivitySynthesizer` in `synthesizers/cardio.py` - Extended `SleepSynthesizer` to emit captions for direct `sleep:asleep` and `sleep:inbed` windows, while preserving the existing `in_bed_not_sleeping` behavior. - Added active-energy metric support through the shared cross-channel helper pattern. - Updated existing cardio and stationary workout captions to include active-energy context where available. - Added new cross-channel templates for sleep stages, walking, yoga, endurance activities, generic workouts, active energy, and stationary sleep summaries. - Added explorer hit-target support for the new cross-channel labels so the new captions are discoverable through the hit navigation. - Kept `CrossChannelExtractor` unchanged; it still collects captions from the configured synthesizer list. ## 📝 Release Notes - Added cross-channel captions for previously uncovered daily binary workout channels: - Walking - Yoga - Elliptical - Mixed cardio - Other workout - Added direct sleep captions for `sleep:asleep` and `sleep:inbed`. - Preserved overlapping sleep captions, so `sleep:inbed` and `sleep:asleep` can both produce captions for the same window. - Improved sleep movement summaries so zero-distance sleep periods are described as stationary rather than listing zero steps/distance. - Improved template wording for more natural activity captions. - No public API or extractor interface changes. ## 📚 Documentation This change is implemented through focused synthesizer additions and template updates. `CrossChannelExtractor` remains unchanged and continues to collect captions from configured synthesizers. New and updated templates in `templates/templates.json` provide more natural phrasing for sleep, yoga, walking, elliptical, mixed cardio, and other workout captions. ## ✅ Testing Added synthetic unit tests covering: - Cross-channel caption coverage for every daily binary channel. - Overlapping `sleep:inbed` and `sleep:asleep` windows producing separate captions. - Yoga captions including heart-rate and active-energy context without requiring movement. - Walking captions including steps and distance. - Missing helper channels not crashing caption generation. - Explorer hit targets covering emitted cross-channel labels. - Sleep zero-distance captions using a stationary summary rather than explicit zero steps/distance wording. ### Code of Conduct & Contributing Guidelines By creating and submitting this pull request, you agree to follow our [Code of Conduct](https://github.com/StanfordBDHG/.github/blob/main/CODE_OF_CONDUCT.md) and [Contributing Guidelines](https://github.com/StanfordBDHG/.github/blob/main/CONTRIBUTING.md): - [x] I agree to follow the [Code of Conduct](https://github.com/StanfordBDHG/.github/blob/main/CODE_OF_CONDUCT.md) and [Contributing Guidelines](https://github.com/StanfordBDHG/.github/blob/main/CONTRIBUTING.md). --------- Co-authored-by: Max Rosenblattl <max.rosenblattl@tum.de>

Resolves shard-glob conflict in mhc_base_qa_dataset (kept rglob form, functionally equivalent). Adopts main's NaN→0 fix in _binary_prompt for OpenTSLM compatibility. Pulls in PR #50 (curriculum learning training pipeline + multi-label datasets).

… target

call sites

…l uses into to string, training not

…l benchmark

…te is invalid

Switch metabolic training targets from ints to canonical strings ♻️ Current situation & Problem Depends on / sits on top of the eval pipeline PR (#55, feature/evaluation-pipeline). Surface follow-up to PR #50, which introduced the curriculum-learning training stack with int-encoded answer targets ("0", "1", "2", …). Two related issues motivating this PR: 1. OpenTSLM convention divergence — the upstream OpenTSLM reference datasets (HARAcc, Sleep, ECG) all train on canonical class strings ("biking", "Wake", "yes", …). Our metabolic_finetune.py and mhc_multi_label_qa_dataset.py deviated by using str(int(value)) as the training target. 2. Train ↔ eval prompt drift — eval (MHCMetabolicQADataset from PR #55) frames the task as "Is this person biologically male or female? … Answer: <label>", while training framed it as the terse "Based on the sensor data, predict the value of BiologicalSex (binary).". ⚙️ Release Notes - Training answer format switched to canonical strings: - Binary: "Male" / "Female" / "True" / "False" - Ordinal: "Underweight" / "Normal weight" / "Overweight" / … - Continuous: numeric string (unchanged) - Wrapped as "Answer: <value>" so the loss target matches the prompt's Answer: <label> instruction. - New shared helper build_metabolic_post_prompt(label) in time_series_datasets/mhc_label_lookup.py. Both MHCMetabolicQADataset (eval) and MHCMultiLabelQADataset (training) call it — train and eval now emit byte-identical post_prompts. Lives next to METABOLIC_LABEL_CONFIG it formats. - LabelLookup API change: _decode_labels → decode_labels (was effectively public; renamed to match its actual usage). - scripts/metabolic_finetune.py eval logic removed. The single-token logit-scoring path (_candidate_token_ids, _logits_over_candidates, _answer_prefix_embed) was incompatible with multi-token canonical strings. Downstream metrics now flow through evaluation/run_eval.py (PR #55) instead. --eval_only flag removed; net −189 LOC. - Migration: existing int-trained checkpoints will not parse correctly under the new prompt format. Re-train from a stage-1 captioning checkpoint: python curriculum_learning.py \ --model OpenTSLMSP \ --stages stage_joint_caption_metabolic \ --batch_size 2 \ --gradient_checkpointing - (or stage_metabolic alone if a captioning checkpoint is already on disk.)

milanagm added 6 commits April 10, 2026 18:43

removed stubs to make review easier

de9c281

Merge main and refactor shared split logic into mhc_label_lookup

6dd3c83

Add metabolic evaluation pipeline with caption-assisted eval support

968d25d

Update .gitignore and evaluation docs

116fdac

Add downstream label exploration notebook

1c5aa45

milanagm requested a review from max-rosenblattl April 13, 2026 22:42

max-rosenblattl marked this pull request as ready for review April 13, 2026 22:47

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

max-rosenblattl requested changes Apr 14, 2026

View reviewed changes

milanagm added 3 commits April 14, 2026 19:33

removed activity related components - will push on new branch

9c2b623

Revert README changes from evaluation PR

4ae7876

Remove notebooks figures, defered to activity related eval that will …

7359eac

…be pushen on another branch

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread evaluation/run_eval.py Outdated

Comment thread time_series_datasets/mhc_metabolic_dataset.py Outdated

milanagm added 3 commits April 14, 2026 20:09

Write predictions as CSV while keeping metrics as JSON

8f9ee3d

Removed masking bootstrap metric failures / removed Exception Catch

a7db502

Clean up README files and docstring wording

a002894

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread evaluation/io/writer.py Outdated

milanagm added 2 commits April 15, 2026 13:20

Add ordinal metrics and align BMI label handling - expanded metrics

1fbc06f

removed seperat weekly metabolic dataset script to expand the weekly …

88f1301

…integration into existing mhc metabolic qa dataset

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread evaluation/metrics/classification.py Outdated

milanagm added 2 commits April 15, 2026 16:54

removed local env name

762bc24

updated requirements (added opentslm dependancy)

349ab92

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread requirements.txt Outdated

milanagm added 5 commits April 16, 2026 11:43

removed splut method from lookup back to base

7790643

removed unused method

98dc024

Remove temporary caption generation placeholder

2f62dc0

Pin opentslm to immutable commit SHA

f54dda4

anitize task and model names in result output paths

bcd1446

milanagm and others added 18 commits April 22, 2026 10:22

Remove --dataset arg, infer dataset via task.dataset_cls

fb4b922

Drop --device and --results-dir; auto-detect device in LocalModel

9984756

se opentslm from PyPI instead of git SHA

c75d287

Merge remote-tracking branch 'origin/main' into feature/evaluation-pi…

e14e032

…peline

minor updates in read me

3c5d4c5

added docstring for clarification

1a443bd

move raw-samples cache from class-level to instance-level

4405c7d

Restructure evaluation around backend-dispatch; droped caption inject…

79d7be9

…ion and updated readme

Add OpenTSLM and text-only backends under unified generate() interface

fe5a67b

Add smoke-test scripts for testing

6898408

added SPDX-Headers for test scripts

4f248f8

Restore caption field on sample dict for MHCCaptionQADataset training…

602bd4a

… target

Rename LabelLookup.join_ondate back to .join Aligns with main and PR #50

70c1cce

call sites

Move label decoding from LabelLookup.join into the eval caller - eva…

ee74c6b

…l uses into to string, training not

Whitelist peft enums (PeftType, TaskType) for resume checkpoint loads

c486336

Allow peft's LoraConfig in torch.load(weights_only=True) checkpoints.

ba46701

milanagm mentioned this pull request Apr 29, 2026

Feature/training: from int to string #63

Merged

max-rosenblattl reviewed Apr 29, 2026

View reviewed changes

Comment thread time_series_datasets/mhc_label_lookup.py

max-rosenblattl reviewed Apr 29, 2026

View reviewed changes

Comment thread scripts/smoke_test_vlm_local.sh Outdated

max-rosenblattl reviewed Apr 29, 2026

View reviewed changes

Comment thread models/base.py

max-rosenblattl approved these changes Apr 29, 2026

View reviewed changes

milanagm added 2 commits April 29, 2026 16:20

removed smoke test shell scripts for merge w PR and the text only eva…

83802f6

…l benchmark

Commit explicit Answer: as model intent — return unknown when candida…

38cc995

…te is invalid

milanagm merged commit 5c959c5 into main Apr 30, 2026
3 checks passed

milanagm deleted the feature/evaluation-pipeline branch April 30, 2026 00:47

This was referenced May 5, 2026

Use MHC-benchmark canonical splits across MHC datasets #69

Merged

Rename SensorTSLM to TimeCap #71

Closed

Uh oh!

Conversation

milanagm commented Apr 13, 2026

Eval Architecture und first End2End Metabolic Draft

♻️ Current situation & Problem

⚙️ Release Notes

📚 Documentation

✅ Testing

Code of Conduct & Contributing Guidelines

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning, 2 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

max-rosenblattl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading