Skip to content

Feature/evaluation pipeline#55

Merged
milanagm merged 54 commits intomainfrom
feature/evaluation-pipeline
Apr 30, 2026
Merged

Feature/evaluation pipeline#55
milanagm merged 54 commits intomainfrom
feature/evaluation-pipeline

Conversation

@milanagm
Copy link
Copy Markdown
Contributor

Eval Architecture und first End2End Metabolic Draft

♻️ Current situation & Problem

Link any open issues or pull requests (PRs) related to this PR. Please ensure that all non-trivial PRs are first tracked and discussed in an existing GitHub issue or discussion.

⚙️ Release Notes

Add a bullet point list summary of the feature and possible migration guides if this is a breaking change so this section can be added to the release notes.
Include code snippets that provide examples of the feature implemented or links to the documentation if it appends or changes the public interface.

📚 Documentation

Please ensure that you properly document any additions in conformance to Spezi Documentation Guide.
You can use this section to describe your solution, but we encourage contributors to document your reasoning and changes using in-line documentation.

✅ Testing

Please ensure that the PR meets the testing requirements set by CodeCov and that new functionality is appropriately tested.
This section describes important information about the tests and why some elements might not be testable.

Code of Conduct & Contributing Guidelines

By creating and submitting this pull request, you agree to follow our Code of Conduct and Contributing Guidelines:

Introduces a modular evaluation pipeline for benchmarking LLMs on
MHC wearable sensor data. All components are task-agnostic and
extensible via registries in run_eval.py.

- evaluation/: unified entry point (run_eval.py), Evaluator,
  EvalTask ABC, classification metrics with bootstrap SE,
  ResultsWriter; stubs for sleep and caption tasks
- time_series_datasets/mhc_activity_dataset.py: 60-min windowed
  samples with ground truth from Apple Watch workout:* binary
  channels — no manual annotation required
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new evaluation framework (orchestrator, tasks, metrics, I/O, CLI), placeholder caption generator, dataset label-lookup/split changes, and minor transformer and packaging/ignore updates. All changes are factual and introduce new modules and APIs for end-to-end model evaluation.

Changes

Cohort / File(s) Summary
Evaluation core
evaluation/evaluator.py, evaluation/__init__.py
New Evaluator orchestration, SampleResult/EvalResult dataclasses, and package initializer headers.
Tasks
evaluation/tasks/base.py, evaluation/tasks/metabolic.py, evaluation/tasks/__init__.py
Abstract EvalTask and EvalSample; concrete MetabolicTask with prompt/parse/aggregate logic and package initializer header.
Metrics
evaluation/metrics/classification.py, evaluation/metrics/__init__.py
New stateless classification metrics (binary/multiclass/ordinal) with bootstrap SEs and package initializer header.
I/O / persistence
evaluation/io/writer.py, evaluation/io/__init__.py
ResultsWriter writes timestamped output dir with metrics.json and predictions.csv; package initializer header.
CLI / utilities
evaluation/run_eval.py, evaluation/generate_captions_placeholder.py
New evaluation CLI (run_eval) wiring registries/builders and end-to-end flow; caption placeholder generator producing captions.parquet.
Dataset label/split changes
time_series_datasets/mhc_label_lookup.py, time_series_datasets/mhc_base_qa_dataset.py
Added SHA-1 based assign_split() and TRAIN_FRAC/VAL_FRAC; replaced join() with join_ondate() and join_onlabels(); dataset now delegates split logic and uses new join API.
Data transform & deps
mhc/transformer.py, requirements.txt, .gitignore
Channel-5 zeros now treated as NaN; opentslm pinned to GitHub main; .gitignore updated to ignore top-level results/ (and other entries).

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/run_eval
    participant Eval as Evaluator
    participant Task as EvalTask
    participant Model as BaseModel
    participant Writer as ResultsWriter

    CLI->>Eval: run(task, model, dataset, max_samples)
    loop per sample
        Eval->>Task: build_prompt(sample)
        Task-->>Eval: prompt
        Eval->>Model: process(prompt, channels, annotations)
        Model-->>Eval: ModelResponse
        Eval->>Task: parse_prediction(response.text)
        Task-->>Eval: parsed_prediction
        Eval-->>Eval: collect SampleResult
    end
    Eval->>Task: aggregate_metrics(gts, preds)
    Task-->>Eval: metrics
    Eval-->>CLI: EvalResult
    CLI->>Writer: write(EvalResult)
    Writer-->>CLI: output directory path
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • max-rosenblattl
  • ThomasKaar
  • RealLast
🚥 Pre-merge checks | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Feature/evaluation pipeline' is vague and generic, using broad terms that don't specify the main change or architectural details introduced in this substantial feature addition. Use a more descriptive title that highlights the primary change, such as 'Add evaluation orchestration framework with metabolic task support' or 'Introduce evaluation pipeline with metrics aggregation and result persistence.'
Description check ❓ Inconclusive The description is mostly a template scaffold with placeholder sections and does not document the actual evaluation pipeline implementation, architectural decisions, or concrete features introduced. Fill in the template sections with concrete details: describe the evaluation architecture, the MetabolicTask implementation, provide usage examples, specify testing done, and link any related GitHub issues or discussions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/evaluation-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@max-rosenblattl max-rosenblattl marked this pull request as ready for review April 13, 2026 22:47
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (5)
evaluation/generate_captions_placeholder.py (2)

37-37: Consider using relative imports or package installation instead of sys.path manipulation.

sys.path.insert(0, ...) is fragile and can cause import conflicts. Since this is a script intended for standalone execution, consider either:

  1. Running via python -m evaluation.generate_captions_placeholder from repo root
  2. Adding an editable install (pip install -e .) to make the package importable
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluation/generate_captions_placeholder.py` at line 37, Replace the sys.path
manipulation in generate_captions_placeholder.py (the call to sys.path.insert(0,
str(Path(__file__).resolve().parents[1])) that alters imports at runtime) with a
proper package import strategy: either update usage to run the module with
python -m evaluation.generate_captions_placeholder from the repository root or
remove the sys.path change and rely on an editable install (pip install -e .) so
imports resolve normally; make sure to remove the sys.path.insert line and any
dependency on Path(__file__).resolve().parents[1] and document the preferred
run/install method in the script header or README.

42-44: Consider accepting paths via CLI arguments for flexibility.

Hardcoded paths work for the documented workflow but limit reusability. An optional argparse interface would allow testing with different datasets or output locations.

♻️ Optional: Add CLI arguments
+import argparse
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--weekly-path", type=Path, default=Path("data/mhc_dev_subset_weekly_hf"))
+    parser.add_argument("--output-dir", type=Path, default=Path("data/mhc_dev_subset_captions"))
+    return parser.parse_args()
+
-WEEKLY_PATH  = Path("data/mhc_dev_subset_weekly_hf")
-OUTPUT_DIR   = Path("data/mhc_dev_subset_captions")
-OUTPUT_FILE  = OUTPUT_DIR / "captions.parquet"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluation/generate_captions_placeholder.py` around lines 42 - 44, Add an
argparse interface to make WEEKLY_PATH, OUTPUT_DIR, and OUTPUT_FILE
configurable: parse optional CLI arguments (e.g., --weekly-path, --output-dir,
--output-file) that default to the existing WEEKLY_PATH, OUTPUT_DIR, and
OUTPUT_FILE constants, convert parsed strings to Path objects, and ensure the
output directory exists before writing; update any main or entrypoint logic in
generate_captions_placeholder.py to use the parsed values instead of the
hardcoded constants.
evaluation/tasks/base.py (1)

48-48: Consider enforcing name as an abstract property.

The name attribute is required by subclasses but isn't enforced at the ABC level. Subclasses could accidentally omit it, leading to runtime AttributeError when the evaluator accesses task.name.

♻️ Optional: Enforce via abstract property
+    `@property`
+    `@abc.abstractmethod`
+    def name(self) -> str:
+        """Task identifier used for result file paths."""
+        ...
-    name: str  # set by each concrete subclass, used for result file paths
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluation/tasks/base.py` at line 48, Make the name attribute an abstract
property on the base task class so subclasses are forced to define it: replace
the plain "name: str" declaration with a `@property` decorated abstract method
(using `@abstractmethod`) on the base class (e.g., def name(self) -> str) so code
that accesses task.name will be guaranteed to exist; update imports to include
ABC/abstractmethod if needed and ensure existing subclasses implement the name
property returning the required string.
time_series_datasets/mhc_activity_dataset.py (1)

107-118: Use actual recording length instead of fixed 1440 for window centering.

Hard-coding max_len=1440 couples this helper to one data shape and risks invalid windows if sequence length changes.

Proposed fix
 def _center_window(
     event_start: int,
     event_end: int,
     window_size: int,
-    max_len: int = 1440,
+    max_len: int,
 ) -> tuple[int, int]:
@@
                 for event_start, event_end in _find_workout_windows(
                     data[ch_idx], min_workout_duration
                 ):
                     w_start, w_end = _center_window(
-                        event_start, event_end, self._window_size
+                        event_start, event_end, self._window_size, max_len=data.shape[1]
                     )

Also applies to: 318-320

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@time_series_datasets/mhc_activity_dataset.py` around lines 107 - 118, The
helper _center_window currently hardcodes max_len=1440 which can produce invalid
windows for other sequence lengths; change the signature of _center_window to
accept the actual recording length (e.g., max_len: int with no default or
default None and require caller to pass length) and update all call sites
(including the other usage around the same module) to pass the true sequence
length from the dataset (e.g., the per-sample recording length variable) so
window start/end are computed against the real recording length rather than
1440.
time_series_datasets/mhc_metabolic_dataset.py (1)

108-113: Normalize caption lookup keys to avoid silent misses.

Line 108 and Lines 151-152 build keys from week_start via implicit string conversion from different sources. If types differ, captions won’t attach and evaluation silently runs without them.

Proposed fix
 class MHCMetabolicDataset(Dataset):
+    `@staticmethod`
+    def _caption_key(user_id: object, week_start: object) -> str:
+        wk = pd.Timestamp(week_start).date().isoformat()
+        return f"{user_id}:{wk}"
+
     def __getitem__(self, idx: int) -> EvalSample:
         row, ground_truth = self._samples[idx]
         recording = self._transformer.transform_row(row)
 
         # Attach caption as annotation if available.
-        caption_key = f"{row['user_id']}:{row['week_start']}"
+        caption_key = self._caption_key(row["user_id"], row["week_start"])
         if caption_key in self._captions:
             recording.annotations.append(Annotation(
                 caption_type="weekly_caption",
                 text=self._captions[caption_key],
             ))
@@
         df = pd.read_parquet(path, columns=["user_id", "week_start", "caption"])
         return {
-            f"{row.user_id}:{row.week_start}": row.caption
+            self._caption_key(row.user_id, row.week_start): row.caption
             for row in df.itertuples(index=False)
         }

Also applies to: 149-153

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@time_series_datasets/mhc_metabolic_dataset.py` around lines 108 - 113, The
caption lookup can silently miss because keys are built from row['week_start']
with inconsistent types; normalize the key building in the caption_key creation
and the related key generation (the spots that reference row['week_start'] and
where captions are added to self._captions) by converting week_start to a
canonical string (e.g., str() or .isoformat() if it's a date/datetime) before
composing caption_key and before storing/looking up in self._captions so
recording.annotations.append(Annotation(...)) always finds matching captions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluation/evaluator.py`:
- Around line 83-99: The prompt/signal preparation (task.build_prompt,
sample.recording.iter_channels, and annotations extraction) must be included
inside the per-sample try/except so pre-process failures don't abort the run;
wrap the lines that compute prompt, signals, and annotations together with the
call to model.process and task.parse_prediction in the same try block (referring
to prompt, signals, annotations, model.process, and task.parse_prediction), and
in the except ensure you set response = ModelResponse(text="", input_tokens=0,
output_tokens=0), parsed = "unknown", increment n_errors, and log the
sample.sample_id as before.

In `@evaluation/io/writer.py`:
- Around line 44-49: The current out_dir creation uses a second-level timestamp
and exist_ok=True which allows collisions; modify the directory naming in the
block that computes timestamp/safe_model/out_dir (using variables timestamp,
safe_model, out_dir, and self.results_dir with result.task_name and
result.model_name) to guarantee uniqueness per run — for example, use
datetime.now().strftime with microseconds (e.g. "%Y%m%d_%H%M%S_%f") or append a
short UUID/nonce to the folder name, then create the directory without silently
allowing existing directories (remove or avoid relying on exist_ok=True for
collisions) so each run gets a unique folder and cannot overwrite
metrics.json/predictions.json.

In `@evaluation/metrics/classification.py`:
- Around line 65-68: The loop currently swallows all exceptions from metric_fn,
hiding real bugs; change the broad "except Exception" to only catch expected
metric-evaluation errors (e.g., except (ValueError, ZeroDivisionError) as e) and
set scores[i] = np.nan in that branch (optionally logging the error), while
letting any other exceptions from metric_fn propagate so they surface during
tests; update the try/except surrounding the call to metric_fn(y_true[idx],
y_pred[idx]) (references: metric_fn, scores, y_true, y_pred, idx) accordingly.

In `@evaluation/run_eval.py`:
- Around line 245-260: After building the dataset in main() (after dataset =
_build_dataset(args)), validate that the chosen args.task is compatible with the
dataset type/name (args.dataset or samples in dataset) by adding a compatibility
check using TASK_REGISTRY[args.task] or task = TASK_REGISTRY[args.task]() and
inspecting dataset metadata or name; if incompatible, print a clear error
mentioning the dataset and task and exit (sys.exit(1)). Implement the check as a
small helper or inline guard before calling _build_model() / creating
Evaluator(), referencing main(), _build_dataset(), TASK_REGISTRY, args.dataset
and args.task so mismatched pairs (e.g., mhc_activity vs metabolic_* tasks) fail
fast with a descriptive message.

In `@evaluation/tasks/activity.py`:
- Around line 91-103: When filtered is empty, return the same metric keys as the
normal path so downstream code doesn't break: populate f1, precision, recall,
accuracy and their standard-error counterparts (f1_se, precision_se, recall_se,
accuracy_se) with zeros (floats), include n_samples (len(ground_truths)),
n_unknown, n_positive_gt and n_positive_pred as zeros (ints), and parse_rate as
0.0; this mirrors the shape produced after calling binary_metrics(gts, preds)
and keeps keys consistent with the metrics produced later in the function (see
variables filtered, ground_truths, n_unknown and function binary_metrics).

In `@evaluation/tasks/metabolic.py`:
- Around line 224-230: The all-unknown branch in evaluate (where filtered is
empty) returns a metrics dict missing *_se and positive-count keys, causing a
different schema than binary_metrics; update that return to include the same
keys as binary_metrics (e.g., "f1_se", "precision_se", "recall_se",
"accuracy_se", and any positive-count keys like "n_positive" or
"n_positive_pred") set to 0 or 0.0 as appropriate, and ensure parse_rate and
counts (n_samples, n_unknown) remain present so the shape matches binary_metrics
and downstream aggregation won't break; locate the branch around the filtered
check in evaluation/tasks/metabolic.py and mirror the full metric keys produced
by binary_metrics.
- Around line 217-221: The code computes n_unknown and builds filtered pairs
using zip(ground_truths, predictions) without checking that ground_truths and
predictions have the same length, which can silently truncate pairs and make
metrics (parse_rate, n_samples) inconsistent; update the logic (around variables
n_unknown, filtered and where parse_rate/n_samples are computed) to explicitly
guard lengths by verifying len(ground_truths) == len(predictions) and either
raise a clear ValueError when they differ or normalize the inputs (e.g.,
trim/align predictions with a logged warning) before constructing filtered, so
denominators and paired metrics remain consistent.

In `@time_series_datasets/mhc_activity_dataset.py`:
- Around line 336-342: The negative-sample creation is setting
workout_channel="none" on _SampleIndex (constructed with
dataset_idx/window_start/window_end/ground_truth), which later gets turned into
a prompt label; change negative samples to set workout_channel=None (or an
explicit sentinel like "" or "no_activity" that your prompt-generator expects)
instead of the string "none", and update any downstream prompt label generation
code that reads workout_channel to treat None/""/ "no_activity" as a
negative/no-target case (i.e., skip inserting a target label or produce the
proper "no activity" phrasing) so malformed questions are not emitted; adjust
both the instance at the shown _SampleIndex creation and the analogous creation
at the other occurrence (the one around the other negative sample block).

In `@time_series_datasets/mhc_label_lookup.py`:
- Around line 176-184: The merge in join_onlabels can silently duplicate rows if
self.df contains multiple label rows per user_id; update the code in
join_onlabels to enforce one label row per user by either adding
validate="one_to_one" to the pd.DataFrame.merge call or by explicitly checking
self.df["user_id"].duplicated() (and raising a ValueError) before merging;
reference the merge invocation that uses self.df[["user_id"] + label_cols] and
the parameters user_ids and label_cols when making this change so duplicates are
detected and handled rather than silently duplicating rows.

---

Nitpick comments:
In `@evaluation/generate_captions_placeholder.py`:
- Line 37: Replace the sys.path manipulation in generate_captions_placeholder.py
(the call to sys.path.insert(0, str(Path(__file__).resolve().parents[1])) that
alters imports at runtime) with a proper package import strategy: either update
usage to run the module with python -m evaluation.generate_captions_placeholder
from the repository root or remove the sys.path change and rely on an editable
install (pip install -e .) so imports resolve normally; make sure to remove the
sys.path.insert line and any dependency on Path(__file__).resolve().parents[1]
and document the preferred run/install method in the script header or README.
- Around line 42-44: Add an argparse interface to make WEEKLY_PATH, OUTPUT_DIR,
and OUTPUT_FILE configurable: parse optional CLI arguments (e.g., --weekly-path,
--output-dir, --output-file) that default to the existing WEEKLY_PATH,
OUTPUT_DIR, and OUTPUT_FILE constants, convert parsed strings to Path objects,
and ensure the output directory exists before writing; update any main or
entrypoint logic in generate_captions_placeholder.py to use the parsed values
instead of the hardcoded constants.

In `@evaluation/tasks/base.py`:
- Line 48: Make the name attribute an abstract property on the base task class
so subclasses are forced to define it: replace the plain "name: str" declaration
with a `@property` decorated abstract method (using `@abstractmethod`) on the base
class (e.g., def name(self) -> str) so code that accesses task.name will be
guaranteed to exist; update imports to include ABC/abstractmethod if needed and
ensure existing subclasses implement the name property returning the required
string.

In `@time_series_datasets/mhc_activity_dataset.py`:
- Around line 107-118: The helper _center_window currently hardcodes
max_len=1440 which can produce invalid windows for other sequence lengths;
change the signature of _center_window to accept the actual recording length
(e.g., max_len: int with no default or default None and require caller to pass
length) and update all call sites (including the other usage around the same
module) to pass the true sequence length from the dataset (e.g., the per-sample
recording length variable) so window start/end are computed against the real
recording length rather than 1440.

In `@time_series_datasets/mhc_metabolic_dataset.py`:
- Around line 108-113: The caption lookup can silently miss because keys are
built from row['week_start'] with inconsistent types; normalize the key building
in the caption_key creation and the related key generation (the spots that
reference row['week_start'] and where captions are added to self._captions) by
converting week_start to a canonical string (e.g., str() or .isoformat() if it's
a date/datetime) before composing caption_key and before storing/looking up in
self._captions so recording.annotations.append(Annotation(...)) always finds
matching captions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 01f6201b-68e9-45fe-8966-dbd79195082e

📥 Commits

Reviewing files that changed from the base of the PR and between b366917 and 1c5aa45.

📒 Files selected for processing (20)
  • .gitignore
  • README.md
  • evaluation/README.md
  • evaluation/__init__.py
  • evaluation/evaluator.py
  • evaluation/generate_captions_placeholder.py
  • evaluation/io/__init__.py
  • evaluation/io/writer.py
  • evaluation/metrics/__init__.py
  • evaluation/metrics/classification.py
  • evaluation/run_eval.py
  • evaluation/tasks/__init__.py
  • evaluation/tasks/activity.py
  • evaluation/tasks/base.py
  • evaluation/tasks/metabolic.py
  • notebooks/01_downstream_label_exploration.ipynb
  • time_series_datasets/mhc_activity_dataset.py
  • time_series_datasets/mhc_base_qa_dataset.py
  • time_series_datasets/mhc_label_lookup.py
  • time_series_datasets/mhc_metabolic_dataset.py

Comment thread evaluation/evaluator.py Outdated
Comment thread evaluation/io/writer.py Outdated
Comment thread evaluation/metrics/classification.py Outdated
Comment thread evaluation/run_eval.py
Comment thread evaluation/tasks/activity.py Outdated
Comment thread evaluation/tasks/metabolic.py
Comment thread evaluation/tasks/metabolic.py Outdated
Comment thread time_series_datasets/mhc_activity_dataset.py Outdated
Comment thread time_series_datasets/mhc_label_lookup.py Outdated
Copy link
Copy Markdown
Collaborator

@max-rosenblattl max-rosenblattl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @milanagm 🚀 I know it's only draft, but when you want to merge it, please reduce the comments to a minimum and make them concise. Also activity task should be in its own PR.

Comment thread time_series_datasets/mhc_base_qa_dataset.py
Comment thread time_series_datasets/mhc_label_lookup.py Outdated
Comment thread time_series_datasets/mhc_metabolic_dataset.py Outdated
Comment thread .gitignore Outdated
Comment thread README.md
Comment thread evaluation/evaluator.py
Comment thread evaluation/run_eval.py Outdated
Comment thread evaluation/tasks/metabolic.py
Comment thread evaluation/io/writer.py Outdated
Comment thread evaluation/run_eval.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
evaluation/run_eval.py (1)

205-220: ⚠️ Potential issue | 🟠 Major

Fail fast on incompatible --dataset / --task pairs.

Right now mismatched pairs can run and emit meaningless metrics. Add a guard before dataset/model construction. This was already flagged in prior review and is still unresolved.

💡 Proposed fix
+def _validate_dataset_task_pair(dataset: str, task: str) -> None:
+    # e.g. mhc_metabolic_bmi -> metabolic_bmi
+    expected_task = dataset.removeprefix("mhc_")
+    if task != expected_task:
+        raise ValueError(
+            f"Incompatible selection: dataset '{dataset}' requires task '{expected_task}', got '{task}'."
+        )
+
 def main() -> None:
     args = _parse_args()
+    try:
+        _validate_dataset_task_pair(args.dataset, args.task)
+    except ValueError as exc:
+        print(str(exc))
+        sys.exit(1)

     print(f"Dataset  : {args.dataset} ({args.split})")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluation/run_eval.py` around lines 205 - 220, Add a fast-fail validation
before building dataset or model in main(): implement and call a small checker
(e.g., _validate_task_dataset_pair(args) or inline logic) right after
_parse_args() that verifies the provided args.task is compatible with
args.dataset/split (use existing TASK_REGISTRY[args.task] or dataset metadata to
test compatibility), and if incompatible print a clear error and sys.exit(1);
ensure this guard runs before calling _build_dataset(), _build_model(), or
instantiating Evaluator so mismatched pairs never proceed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluation/run_eval.py`:
- Around line 138-143: The CLI currently allows negative values for
"--max-samples" and "--max-users"; add validation to ensure they are positive
integers. Implement a small validator (e.g., positive_int) and use it as the
argument type or, after parsing, check args.max_samples and args.max_users: if a
value is not None and is <= 0 call parser.error (or raise a clear argparse
error) to reject invalid inputs. Reference parser.add_argument for
"--max-samples" and "--max-users" and the parsed variables args.max_samples /
args.max_users when adding the validation.

In `@time_series_datasets/mhc_metabolic_dataset.py`:
- Line 171: The return statement in mhc_metabolic_dataset.py uses
zip(df["user_id"], df["_label_str"]) which triggers Ruff B905; change it to
zip(df["user_id"], df["_label_str"], strict=True) to enforce equal-length
iterables and satisfy the linter—update the return in the function that
currently returns dict(zip(...)) so it returns dict(zip(df["user_id"],
df["_label_str"], strict=True)).

---

Duplicate comments:
In `@evaluation/run_eval.py`:
- Around line 205-220: Add a fast-fail validation before building dataset or
model in main(): implement and call a small checker (e.g.,
_validate_task_dataset_pair(args) or inline logic) right after _parse_args()
that verifies the provided args.task is compatible with args.dataset/split (use
existing TASK_REGISTRY[args.task] or dataset metadata to test compatibility),
and if incompatible print a clear error and sys.exit(1); ensure this guard runs
before calling _build_dataset(), _build_model(), or instantiating Evaluator so
mismatched pairs never proceed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e7f62e39-2a9c-4b0c-bc1a-60c0f9c203e1

📥 Commits

Reviewing files that changed from the base of the PR and between 1c5aa45 and 4ae7876.

📒 Files selected for processing (3)
  • evaluation/README.md
  • evaluation/run_eval.py
  • time_series_datasets/mhc_metabolic_dataset.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • evaluation/README.md

Comment thread evaluation/run_eval.py Outdated
Comment thread time_series_datasets/mhc_metabolic_dataset.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
evaluation/tasks/metabolic.py (2)

216-221: ⚠️ Potential issue | 🟠 Major

Validate list lengths before pairing ground truths and predictions.

Line 219 uses zip(...) without a guard, so mismatched lengths are silently truncated and metrics/parse-rate become inconsistent.

Proposed fix
     def aggregate_metrics(
         self,
         ground_truths: list[str],
         predictions: list[str],
     ) -> dict[str, float | int]:
+        if len(ground_truths) != len(predictions):
+            raise ValueError(
+                f"Length mismatch: {len(ground_truths)} ground truths vs "
+                f"{len(predictions)} predictions."
+            )
+
         n_unknown = sum(1 for p in predictions if p == "unknown")
         filtered = [
             (gt, pred)
-            for gt, pred in zip(ground_truths, predictions)
+            for gt, pred in zip(ground_truths, predictions, strict=True)
             if pred != "unknown"
         ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluation/tasks/metabolic.py` around lines 216 - 221, Before zipping
ground_truths and predictions to build filtered and compute metrics, validate
that len(ground_truths) == len(predictions); if they differ, raise a clear error
(or log and abort) that includes both lengths and any context (e.g., task id) so
the mismatch isn't silently truncated. Replace the direct zip usage in the block
that defines n_unknown and filtered with a guard that checks lengths first, then
proceed to compute n_unknown and build filtered only when lengths match.

223-229: ⚠️ Potential issue | 🟠 Major

Return a stable metric schema when all predictions are unknown.

The Line 223 branch omits *_se, n_positive_gt, and n_positive_pred, so the output shape differs from normal binary_metrics output.

Proposed fix
         if not filtered:
             return {
-                "f1": 0.0, "precision": 0.0, "recall": 0.0, "accuracy": 0.0,
+                "f1": 0.0, "f1_se": 0.0,
+                "precision": 0.0, "precision_se": 0.0,
+                "recall": 0.0, "recall_se": 0.0,
+                "accuracy": 0.0, "accuracy_se": 0.0,
                 "n_samples": len(ground_truths),
+                "n_positive_gt": 0,
+                "n_positive_pred": 0,
                 "n_unknown": n_unknown,
                 "parse_rate": 0.0,
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluation/tasks/metabolic.py` around lines 223 - 229, The early-return in
evaluation/tasks/metabolic.py (inside the metrics computation, e.g., the
binary_metrics branch) returns a subset of keys when all predictions are
unknown; update that branch to return the full stable metric schema used
elsewhere by including the missing keys: f1_se, precision_se, recall_se,
accuracy_se (set to 0.0), and n_positive_gt and n_positive_pred (set to 0),
while keeping existing keys n_samples, n_unknown, parse_rate, f1, precision,
recall, accuracy; ensure key names exactly match the normal output schema so
callers always receive a consistent shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluation/io/writer.py`:
- Around line 47-48: Sanitize both path components before joining: add/used a
sanitizer (e.g., sanitize_path_component or sanitize_filename) and apply it to
result.task_name and result.model_name (currently assigned to safe_model) so you
replace or remove unsafe sequences like "/", "\\", and ".." and any other
characters invalid for filenames; then compute out_dir using self.results_dir /
sanitized_task_name / sanitized_model_name / sanitized_timestamp (ensure
timestamp is sanitized if generated elsewhere). Update the assignments where
safe_model and out_dir are set so they use the sanitizer and keep references to
safe_model, out_dir, self.results_dir, result.task_name and result.model_name to
locate the change.

---

Duplicate comments:
In `@evaluation/tasks/metabolic.py`:
- Around line 216-221: Before zipping ground_truths and predictions to build
filtered and compute metrics, validate that len(ground_truths) ==
len(predictions); if they differ, raise a clear error (or log and abort) that
includes both lengths and any context (e.g., task id) so the mismatch isn't
silently truncated. Replace the direct zip usage in the block that defines
n_unknown and filtered with a guard that checks lengths first, then proceed to
compute n_unknown and build filtered only when lengths match.
- Around line 223-229: The early-return in evaluation/tasks/metabolic.py (inside
the metrics computation, e.g., the binary_metrics branch) returns a subset of
keys when all predictions are unknown; update that branch to return the full
stable metric schema used elsewhere by including the missing keys: f1_se,
precision_se, recall_se, accuracy_se (set to 0.0), and n_positive_gt and
n_positive_pred (set to 0), while keeping existing keys n_samples, n_unknown,
parse_rate, f1, precision, recall, accuracy; ensure key names exactly match the
normal output schema so callers always receive a consistent shape.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 20bede1c-9a2d-4b38-ac79-1902ca74c362

📥 Commits

Reviewing files that changed from the base of the PR and between 4ae7876 and a002894.

📒 Files selected for processing (5)
  • .gitignore
  • evaluation/io/writer.py
  • evaluation/metrics/classification.py
  • evaluation/tasks/metabolic.py
  • mhc/transformer.py
✅ Files skipped from review due to trivial changes (2)
  • mhc/transformer.py
  • .gitignore

Comment thread evaluation/io/writer.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
evaluation/metrics/classification.py (1)

117-117: Remove redundant int(len(...)) casts.

These casts are unnecessary since len(...) already returns int; dropping them improves clarity and resolves the Ruff warnings.

Proposed cleanup
-        "n_samples":     int(len(ground_truths)),
+        "n_samples":     len(ground_truths),
@@
-        "n_samples":     int(len(ground_truths)),
-        "n_classes":     int(len(label_set)),
+        "n_samples":     len(ground_truths),
+        "n_classes":     len(label_set),
@@
-        "n_samples": int(len(ground_truths)),
-        "n_classes": int(len(labels)),
+        "n_samples": len(ground_truths),
+        "n_classes": len(labels),

Also applies to: 146-147, 193-194

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluation/metrics/classification.py` at line 117, Remove the redundant
int(...) casts around len(...) in the result dictionaries: replace
int(len(ground_truths)) (and the other occurrences using int(len(...))) with
just len(...) so the "n_samples" and similar fields use len(...) directly;
locate these in evaluation/metrics/classification.py where result dicts are
assembled (e.g., the "n_samples" key and the other two similar entries) and
update them accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluation/metrics/classification.py`:
- Line 131: The code currently treats labels as falsy, so replace the expression
that builds label_set with an explicit None check: if labels is None then
compute label_set = sorted({*ground_truths, *predictions}) else use the provided
labels as-is (preserving an explicit empty list). Update any downstream
computations (e.g., n_classes) to derive their values from this label_set.
Ensure validation still runs on the provided labels (uniqueness/consistency with
ground_truths/predictions) rather than silently falling back when labels == [].

---

Nitpick comments:
In `@evaluation/metrics/classification.py`:
- Line 117: Remove the redundant int(...) casts around len(...) in the result
dictionaries: replace int(len(ground_truths)) (and the other occurrences using
int(len(...))) with just len(...) so the "n_samples" and similar fields use
len(...) directly; locate these in evaluation/metrics/classification.py where
result dicts are assembled (e.g., the "n_samples" key and the other two similar
entries) and update them accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 79bd7bf3-e2c8-4af1-9892-deb18203f7a5

📥 Commits

Reviewing files that changed from the base of the PR and between a002894 and 88f1301.

📒 Files selected for processing (1)
  • evaluation/metrics/classification.py

Comment thread evaluation/metrics/classification.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@requirements.txt`:
- Line 14: The requirements entry pins opentslm to the mutable branch "main",
making installs non-deterministic; update the line in requirements.txt
referencing "opentslm @ git+https://github.com/StanfordBDHG/OpenTSLM.git@main"
to use an immutable ref (a commit SHA such as
104013b90b89fb39c1537c9190b3d047c3072fb5 or a release tag) so installs are
reproducible, then re-run any dependency install/lock steps (e.g., pip-compile
or CI dependency caching) to ensure the change is applied across environments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2bc558c7-b16c-4187-aa38-a3ebf5fb569c

📥 Commits

Reviewing files that changed from the base of the PR and between 88f1301 and 349ab92.

📒 Files selected for processing (2)
  • .gitignore
  • requirements.txt
✅ Files skipped from review due to trivial changes (1)
  • .gitignore

Comment thread requirements.txt Outdated
milanagm and others added 18 commits April 22, 2026 10:22
# Add cross-channel captions for all binary daily channels

## ♻️ Current Situation & Problem
The daily cross-channel caption pipeline only covered a subset of binary
channels: in-bed-awake sleep periods, selected stationary workouts, and
running/cycling cardio workouts.

This left several binary channels without cross-channel captions,
including direct asleep/in-bed windows, walking, yoga, elliptical, mixed
cardio, and other workouts. These channels should also be described with
relevant context from continuous signals such as heart rate, distance,
steps, and active energy.

## ⚙️ What Changed
- Added focused cross-channel synthesizers by activity family:
  - `WalkingSynthesizer` in `synthesizers/locomotion.py`
  - `MindBodyActivitySynthesizer` in `synthesizers/mind_body.py`
  - `OtherActivitySynthesizer` in `synthesizers/other_activity.py`
  - `EnduranceActivitySynthesizer` in `synthesizers/cardio.py`
- Extended `SleepSynthesizer` to emit captions for direct `sleep:asleep`
and `sleep:inbed` windows, while preserving the existing
`in_bed_not_sleeping` behavior.
- Added active-energy metric support through the shared cross-channel
helper pattern.
- Updated existing cardio and stationary workout captions to include
active-energy context where available.
- Added new cross-channel templates for sleep stages, walking, yoga,
endurance activities, generic workouts, active energy, and stationary
sleep summaries.
- Added explorer hit-target support for the new cross-channel labels so
the new captions are discoverable through the hit navigation.
- Kept `CrossChannelExtractor` unchanged; it still collects captions
from the configured synthesizer list.

## 📝 Release Notes
- Added cross-channel captions for previously uncovered daily binary
workout channels:
  - Walking
  - Yoga
  - Elliptical
  - Mixed cardio
  - Other workout
- Added direct sleep captions for `sleep:asleep` and `sleep:inbed`.
- Preserved overlapping sleep captions, so `sleep:inbed` and
`sleep:asleep` can both produce captions for the same window.
- Improved sleep movement summaries so zero-distance sleep periods are
described as stationary rather than listing zero steps/distance.
- Improved template wording for more natural activity captions.
- No public API or extractor interface changes.

## 📚 Documentation
This change is implemented through focused synthesizer additions and
template updates. `CrossChannelExtractor` remains unchanged and
continues to collect captions from configured synthesizers.

New and updated templates in `templates/templates.json` provide more
natural phrasing for sleep, yoga, walking, elliptical, mixed cardio, and
other workout captions.

## ✅ Testing
Added synthetic unit tests covering:
- Cross-channel caption coverage for every daily binary channel.
- Overlapping `sleep:inbed` and `sleep:asleep` windows producing
separate captions.
- Yoga captions including heart-rate and active-energy context without
requiring movement.
- Walking captions including steps and distance.
- Missing helper channels not crashing caption generation.
- Explorer hit targets covering emitted cross-channel labels.
- Sleep zero-distance captions using a stationary summary rather than
explicit zero steps/distance wording.

### Code of Conduct & Contributing Guidelines
By creating and submitting this pull request, you agree to follow our
[Code of
Conduct](https://github.com/StanfordBDHG/.github/blob/main/CODE_OF_CONDUCT.md)
and [Contributing
Guidelines](https://github.com/StanfordBDHG/.github/blob/main/CONTRIBUTING.md):
- [x] I agree to follow the [Code of
Conduct](https://github.com/StanfordBDHG/.github/blob/main/CODE_OF_CONDUCT.md)
and [Contributing
Guidelines](https://github.com/StanfordBDHG/.github/blob/main/CONTRIBUTING.md).

---------

Co-authored-by: Max Rosenblattl <max.rosenblattl@tum.de>
Resolves shard-glob conflict in mhc_base_qa_dataset (kept rglob form,
functionally equivalent). Adopts main's NaN→0 fix in _binary_prompt
for OpenTSLM compatibility. Pulls in PR #50 (curriculum learning
training pipeline + multi-label datasets).
Comment thread time_series_datasets/mhc_label_lookup.py
Comment thread scripts/smoke_test_vlm_local.sh Outdated
Comment thread models/base.py
@milanagm milanagm merged commit 5c959c5 into main Apr 30, 2026
3 checks passed
@milanagm milanagm deleted the feature/evaluation-pipeline branch April 30, 2026 00:47
milanagm added a commit that referenced this pull request Apr 30, 2026
Switch metabolic training targets from ints to canonical strings
  ♻️ Current situation & Problem
Depends on / sits on top of the eval pipeline PR (#55,
feature/evaluation-pipeline). Surface follow-up to PR #50, which
introduced the curriculum-learning training stack with int-encoded
answer targets ("0", "1", "2", …).
Two related issues motivating this PR:
1. OpenTSLM convention divergence — the upstream OpenTSLM reference
datasets (HARAcc, Sleep, ECG) all train on canonical class strings
("biking", "Wake", "yes", …). Our metabolic_finetune.py and
mhc_multi_label_qa_dataset.py deviated by using str(int(value)) as the
training target.
   
2. Train ↔ eval prompt drift — eval (MHCMetabolicQADataset from PR #55)
frames the task as "Is this person biologically male or female? …
Answer: <label>", while training framed it as the terse "Based on the
sensor data, predict the value of BiologicalSex (binary).".
  ⚙️ Release Notes
  - Training answer format switched to canonical strings:        
    - Binary: "Male" / "Female" / "True" / "False"
    - Ordinal: "Underweight" / "Normal weight" / "Overweight" / …
    - Continuous: numeric string (unchanged)          
- Wrapped as "Answer: <value>" so the loss target matches the prompt's
Answer: <label> instruction.
- New shared helper build_metabolic_post_prompt(label) in
time_series_datasets/mhc_label_lookup.py. Both MHCMetabolicQADataset
(eval) and MHCMultiLabelQADataset (training) call it — train and eval
now emit byte-identical post_prompts. Lives next to
METABOLIC_LABEL_CONFIG it formats.
- LabelLookup API change: _decode_labels → decode_labels (was
effectively public; renamed to match its actual usage).
- scripts/metabolic_finetune.py eval logic removed. The single-token
logit-scoring path (_candidate_token_ids,
_logits_over_candidates, _answer_prefix_embed) was incompatible with
multi-token canonical strings. Downstream metrics now flow
through evaluation/run_eval.py (PR #55) instead. --eval_only flag
removed; net −189 LOC.
- Migration: existing int-trained checkpoints will not parse correctly
under the new prompt format. Re-train from a stage-1
  
captioning checkpoint:
python curriculum_learning.py \
--model OpenTSLMSP \
--stages stage_joint_caption_metabolic \
    --batch_size 2 \                          
--gradient_checkpointing
- (or stage_metabolic alone if a captioning checkpoint is already on
disk.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants