Conversation
Introduces a modular evaluation pipeline for benchmarking LLMs on MHC wearable sensor data. All components are task-agnostic and extensible via registries in run_eval.py. - evaluation/: unified entry point (run_eval.py), Evaluator, EvalTask ABC, classification metrics with bootstrap SE, ResultsWriter; stubs for sleep and caption tasks - time_series_datasets/mhc_activity_dataset.py: 60-min windowed samples with ground truth from Apple Watch workout:* binary channels — no manual annotation required
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new evaluation framework (orchestrator, tasks, metrics, I/O, CLI), placeholder caption generator, dataset label-lookup/split changes, and minor transformer and packaging/ignore updates. All changes are factual and introduce new modules and APIs for end-to-end model evaluation. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI/run_eval
participant Eval as Evaluator
participant Task as EvalTask
participant Model as BaseModel
participant Writer as ResultsWriter
CLI->>Eval: run(task, model, dataset, max_samples)
loop per sample
Eval->>Task: build_prompt(sample)
Task-->>Eval: prompt
Eval->>Model: process(prompt, channels, annotations)
Model-->>Eval: ModelResponse
Eval->>Task: parse_prediction(response.text)
Task-->>Eval: parsed_prediction
Eval-->>Eval: collect SampleResult
end
Eval->>Task: aggregate_metrics(gts, preds)
Task-->>Eval: metrics
Eval-->>CLI: EvalResult
CLI->>Writer: write(EvalResult)
Writer-->>CLI: output directory path
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ❌ 3❌ Failed checks (1 warning, 2 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (5)
evaluation/generate_captions_placeholder.py (2)
37-37: Consider using relative imports or package installation instead ofsys.pathmanipulation.
sys.path.insert(0, ...)is fragile and can cause import conflicts. Since this is a script intended for standalone execution, consider either:
- Running via
python -m evaluation.generate_captions_placeholderfrom repo root- Adding an editable install (
pip install -e .) to make the package importable🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluation/generate_captions_placeholder.py` at line 37, Replace the sys.path manipulation in generate_captions_placeholder.py (the call to sys.path.insert(0, str(Path(__file__).resolve().parents[1])) that alters imports at runtime) with a proper package import strategy: either update usage to run the module with python -m evaluation.generate_captions_placeholder from the repository root or remove the sys.path change and rely on an editable install (pip install -e .) so imports resolve normally; make sure to remove the sys.path.insert line and any dependency on Path(__file__).resolve().parents[1] and document the preferred run/install method in the script header or README.
42-44: Consider accepting paths via CLI arguments for flexibility.Hardcoded paths work for the documented workflow but limit reusability. An optional
argparseinterface would allow testing with different datasets or output locations.♻️ Optional: Add CLI arguments
+import argparse + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--weekly-path", type=Path, default=Path("data/mhc_dev_subset_weekly_hf")) + parser.add_argument("--output-dir", type=Path, default=Path("data/mhc_dev_subset_captions")) + return parser.parse_args() + -WEEKLY_PATH = Path("data/mhc_dev_subset_weekly_hf") -OUTPUT_DIR = Path("data/mhc_dev_subset_captions") -OUTPUT_FILE = OUTPUT_DIR / "captions.parquet"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluation/generate_captions_placeholder.py` around lines 42 - 44, Add an argparse interface to make WEEKLY_PATH, OUTPUT_DIR, and OUTPUT_FILE configurable: parse optional CLI arguments (e.g., --weekly-path, --output-dir, --output-file) that default to the existing WEEKLY_PATH, OUTPUT_DIR, and OUTPUT_FILE constants, convert parsed strings to Path objects, and ensure the output directory exists before writing; update any main or entrypoint logic in generate_captions_placeholder.py to use the parsed values instead of the hardcoded constants.evaluation/tasks/base.py (1)
48-48: Consider enforcingnameas an abstract property.The
nameattribute is required by subclasses but isn't enforced at the ABC level. Subclasses could accidentally omit it, leading to runtimeAttributeErrorwhen the evaluator accessestask.name.♻️ Optional: Enforce via abstract property
+ `@property` + `@abc.abstractmethod` + def name(self) -> str: + """Task identifier used for result file paths.""" + ... - name: str # set by each concrete subclass, used for result file paths🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluation/tasks/base.py` at line 48, Make the name attribute an abstract property on the base task class so subclasses are forced to define it: replace the plain "name: str" declaration with a `@property` decorated abstract method (using `@abstractmethod`) on the base class (e.g., def name(self) -> str) so code that accesses task.name will be guaranteed to exist; update imports to include ABC/abstractmethod if needed and ensure existing subclasses implement the name property returning the required string.time_series_datasets/mhc_activity_dataset.py (1)
107-118: Use actual recording length instead of fixed1440for window centering.Hard-coding
max_len=1440couples this helper to one data shape and risks invalid windows if sequence length changes.Proposed fix
def _center_window( event_start: int, event_end: int, window_size: int, - max_len: int = 1440, + max_len: int, ) -> tuple[int, int]: @@ for event_start, event_end in _find_workout_windows( data[ch_idx], min_workout_duration ): w_start, w_end = _center_window( - event_start, event_end, self._window_size + event_start, event_end, self._window_size, max_len=data.shape[1] )Also applies to: 318-320
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@time_series_datasets/mhc_activity_dataset.py` around lines 107 - 118, The helper _center_window currently hardcodes max_len=1440 which can produce invalid windows for other sequence lengths; change the signature of _center_window to accept the actual recording length (e.g., max_len: int with no default or default None and require caller to pass length) and update all call sites (including the other usage around the same module) to pass the true sequence length from the dataset (e.g., the per-sample recording length variable) so window start/end are computed against the real recording length rather than 1440.time_series_datasets/mhc_metabolic_dataset.py (1)
108-113: Normalize caption lookup keys to avoid silent misses.Line 108 and Lines 151-152 build keys from
week_startvia implicit string conversion from different sources. If types differ, captions won’t attach and evaluation silently runs without them.Proposed fix
class MHCMetabolicDataset(Dataset): + `@staticmethod` + def _caption_key(user_id: object, week_start: object) -> str: + wk = pd.Timestamp(week_start).date().isoformat() + return f"{user_id}:{wk}" + def __getitem__(self, idx: int) -> EvalSample: row, ground_truth = self._samples[idx] recording = self._transformer.transform_row(row) # Attach caption as annotation if available. - caption_key = f"{row['user_id']}:{row['week_start']}" + caption_key = self._caption_key(row["user_id"], row["week_start"]) if caption_key in self._captions: recording.annotations.append(Annotation( caption_type="weekly_caption", text=self._captions[caption_key], )) @@ df = pd.read_parquet(path, columns=["user_id", "week_start", "caption"]) return { - f"{row.user_id}:{row.week_start}": row.caption + self._caption_key(row.user_id, row.week_start): row.caption for row in df.itertuples(index=False) }Also applies to: 149-153
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@time_series_datasets/mhc_metabolic_dataset.py` around lines 108 - 113, The caption lookup can silently miss because keys are built from row['week_start'] with inconsistent types; normalize the key building in the caption_key creation and the related key generation (the spots that reference row['week_start'] and where captions are added to self._captions) by converting week_start to a canonical string (e.g., str() or .isoformat() if it's a date/datetime) before composing caption_key and before storing/looking up in self._captions so recording.annotations.append(Annotation(...)) always finds matching captions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evaluation/evaluator.py`:
- Around line 83-99: The prompt/signal preparation (task.build_prompt,
sample.recording.iter_channels, and annotations extraction) must be included
inside the per-sample try/except so pre-process failures don't abort the run;
wrap the lines that compute prompt, signals, and annotations together with the
call to model.process and task.parse_prediction in the same try block (referring
to prompt, signals, annotations, model.process, and task.parse_prediction), and
in the except ensure you set response = ModelResponse(text="", input_tokens=0,
output_tokens=0), parsed = "unknown", increment n_errors, and log the
sample.sample_id as before.
In `@evaluation/io/writer.py`:
- Around line 44-49: The current out_dir creation uses a second-level timestamp
and exist_ok=True which allows collisions; modify the directory naming in the
block that computes timestamp/safe_model/out_dir (using variables timestamp,
safe_model, out_dir, and self.results_dir with result.task_name and
result.model_name) to guarantee uniqueness per run — for example, use
datetime.now().strftime with microseconds (e.g. "%Y%m%d_%H%M%S_%f") or append a
short UUID/nonce to the folder name, then create the directory without silently
allowing existing directories (remove or avoid relying on exist_ok=True for
collisions) so each run gets a unique folder and cannot overwrite
metrics.json/predictions.json.
In `@evaluation/metrics/classification.py`:
- Around line 65-68: The loop currently swallows all exceptions from metric_fn,
hiding real bugs; change the broad "except Exception" to only catch expected
metric-evaluation errors (e.g., except (ValueError, ZeroDivisionError) as e) and
set scores[i] = np.nan in that branch (optionally logging the error), while
letting any other exceptions from metric_fn propagate so they surface during
tests; update the try/except surrounding the call to metric_fn(y_true[idx],
y_pred[idx]) (references: metric_fn, scores, y_true, y_pred, idx) accordingly.
In `@evaluation/run_eval.py`:
- Around line 245-260: After building the dataset in main() (after dataset =
_build_dataset(args)), validate that the chosen args.task is compatible with the
dataset type/name (args.dataset or samples in dataset) by adding a compatibility
check using TASK_REGISTRY[args.task] or task = TASK_REGISTRY[args.task]() and
inspecting dataset metadata or name; if incompatible, print a clear error
mentioning the dataset and task and exit (sys.exit(1)). Implement the check as a
small helper or inline guard before calling _build_model() / creating
Evaluator(), referencing main(), _build_dataset(), TASK_REGISTRY, args.dataset
and args.task so mismatched pairs (e.g., mhc_activity vs metabolic_* tasks) fail
fast with a descriptive message.
In `@evaluation/tasks/activity.py`:
- Around line 91-103: When filtered is empty, return the same metric keys as the
normal path so downstream code doesn't break: populate f1, precision, recall,
accuracy and their standard-error counterparts (f1_se, precision_se, recall_se,
accuracy_se) with zeros (floats), include n_samples (len(ground_truths)),
n_unknown, n_positive_gt and n_positive_pred as zeros (ints), and parse_rate as
0.0; this mirrors the shape produced after calling binary_metrics(gts, preds)
and keeps keys consistent with the metrics produced later in the function (see
variables filtered, ground_truths, n_unknown and function binary_metrics).
In `@evaluation/tasks/metabolic.py`:
- Around line 224-230: The all-unknown branch in evaluate (where filtered is
empty) returns a metrics dict missing *_se and positive-count keys, causing a
different schema than binary_metrics; update that return to include the same
keys as binary_metrics (e.g., "f1_se", "precision_se", "recall_se",
"accuracy_se", and any positive-count keys like "n_positive" or
"n_positive_pred") set to 0 or 0.0 as appropriate, and ensure parse_rate and
counts (n_samples, n_unknown) remain present so the shape matches binary_metrics
and downstream aggregation won't break; locate the branch around the filtered
check in evaluation/tasks/metabolic.py and mirror the full metric keys produced
by binary_metrics.
- Around line 217-221: The code computes n_unknown and builds filtered pairs
using zip(ground_truths, predictions) without checking that ground_truths and
predictions have the same length, which can silently truncate pairs and make
metrics (parse_rate, n_samples) inconsistent; update the logic (around variables
n_unknown, filtered and where parse_rate/n_samples are computed) to explicitly
guard lengths by verifying len(ground_truths) == len(predictions) and either
raise a clear ValueError when they differ or normalize the inputs (e.g.,
trim/align predictions with a logged warning) before constructing filtered, so
denominators and paired metrics remain consistent.
In `@time_series_datasets/mhc_activity_dataset.py`:
- Around line 336-342: The negative-sample creation is setting
workout_channel="none" on _SampleIndex (constructed with
dataset_idx/window_start/window_end/ground_truth), which later gets turned into
a prompt label; change negative samples to set workout_channel=None (or an
explicit sentinel like "" or "no_activity" that your prompt-generator expects)
instead of the string "none", and update any downstream prompt label generation
code that reads workout_channel to treat None/""/ "no_activity" as a
negative/no-target case (i.e., skip inserting a target label or produce the
proper "no activity" phrasing) so malformed questions are not emitted; adjust
both the instance at the shown _SampleIndex creation and the analogous creation
at the other occurrence (the one around the other negative sample block).
In `@time_series_datasets/mhc_label_lookup.py`:
- Around line 176-184: The merge in join_onlabels can silently duplicate rows if
self.df contains multiple label rows per user_id; update the code in
join_onlabels to enforce one label row per user by either adding
validate="one_to_one" to the pd.DataFrame.merge call or by explicitly checking
self.df["user_id"].duplicated() (and raising a ValueError) before merging;
reference the merge invocation that uses self.df[["user_id"] + label_cols] and
the parameters user_ids and label_cols when making this change so duplicates are
detected and handled rather than silently duplicating rows.
---
Nitpick comments:
In `@evaluation/generate_captions_placeholder.py`:
- Line 37: Replace the sys.path manipulation in generate_captions_placeholder.py
(the call to sys.path.insert(0, str(Path(__file__).resolve().parents[1])) that
alters imports at runtime) with a proper package import strategy: either update
usage to run the module with python -m evaluation.generate_captions_placeholder
from the repository root or remove the sys.path change and rely on an editable
install (pip install -e .) so imports resolve normally; make sure to remove the
sys.path.insert line and any dependency on Path(__file__).resolve().parents[1]
and document the preferred run/install method in the script header or README.
- Around line 42-44: Add an argparse interface to make WEEKLY_PATH, OUTPUT_DIR,
and OUTPUT_FILE configurable: parse optional CLI arguments (e.g., --weekly-path,
--output-dir, --output-file) that default to the existing WEEKLY_PATH,
OUTPUT_DIR, and OUTPUT_FILE constants, convert parsed strings to Path objects,
and ensure the output directory exists before writing; update any main or
entrypoint logic in generate_captions_placeholder.py to use the parsed values
instead of the hardcoded constants.
In `@evaluation/tasks/base.py`:
- Line 48: Make the name attribute an abstract property on the base task class
so subclasses are forced to define it: replace the plain "name: str" declaration
with a `@property` decorated abstract method (using `@abstractmethod`) on the base
class (e.g., def name(self) -> str) so code that accesses task.name will be
guaranteed to exist; update imports to include ABC/abstractmethod if needed and
ensure existing subclasses implement the name property returning the required
string.
In `@time_series_datasets/mhc_activity_dataset.py`:
- Around line 107-118: The helper _center_window currently hardcodes
max_len=1440 which can produce invalid windows for other sequence lengths;
change the signature of _center_window to accept the actual recording length
(e.g., max_len: int with no default or default None and require caller to pass
length) and update all call sites (including the other usage around the same
module) to pass the true sequence length from the dataset (e.g., the per-sample
recording length variable) so window start/end are computed against the real
recording length rather than 1440.
In `@time_series_datasets/mhc_metabolic_dataset.py`:
- Around line 108-113: The caption lookup can silently miss because keys are
built from row['week_start'] with inconsistent types; normalize the key building
in the caption_key creation and the related key generation (the spots that
reference row['week_start'] and where captions are added to self._captions) by
converting week_start to a canonical string (e.g., str() or .isoformat() if it's
a date/datetime) before composing caption_key and before storing/looking up in
self._captions so recording.annotations.append(Annotation(...)) always finds
matching captions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 01f6201b-68e9-45fe-8966-dbd79195082e
📒 Files selected for processing (20)
.gitignoreREADME.mdevaluation/README.mdevaluation/__init__.pyevaluation/evaluator.pyevaluation/generate_captions_placeholder.pyevaluation/io/__init__.pyevaluation/io/writer.pyevaluation/metrics/__init__.pyevaluation/metrics/classification.pyevaluation/run_eval.pyevaluation/tasks/__init__.pyevaluation/tasks/activity.pyevaluation/tasks/base.pyevaluation/tasks/metabolic.pynotebooks/01_downstream_label_exploration.ipynbtime_series_datasets/mhc_activity_dataset.pytime_series_datasets/mhc_base_qa_dataset.pytime_series_datasets/mhc_label_lookup.pytime_series_datasets/mhc_metabolic_dataset.py
max-rosenblattl
left a comment
There was a problem hiding this comment.
Thanks for the work @milanagm 🚀 I know it's only draft, but when you want to merge it, please reduce the comments to a minimum and make them concise. Also activity task should be in its own PR.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
evaluation/run_eval.py (1)
205-220:⚠️ Potential issue | 🟠 MajorFail fast on incompatible
--dataset/--taskpairs.Right now mismatched pairs can run and emit meaningless metrics. Add a guard before dataset/model construction. This was already flagged in prior review and is still unresolved.
💡 Proposed fix
+def _validate_dataset_task_pair(dataset: str, task: str) -> None: + # e.g. mhc_metabolic_bmi -> metabolic_bmi + expected_task = dataset.removeprefix("mhc_") + if task != expected_task: + raise ValueError( + f"Incompatible selection: dataset '{dataset}' requires task '{expected_task}', got '{task}'." + ) + def main() -> None: args = _parse_args() + try: + _validate_dataset_task_pair(args.dataset, args.task) + except ValueError as exc: + print(str(exc)) + sys.exit(1) print(f"Dataset : {args.dataset} ({args.split})")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluation/run_eval.py` around lines 205 - 220, Add a fast-fail validation before building dataset or model in main(): implement and call a small checker (e.g., _validate_task_dataset_pair(args) or inline logic) right after _parse_args() that verifies the provided args.task is compatible with args.dataset/split (use existing TASK_REGISTRY[args.task] or dataset metadata to test compatibility), and if incompatible print a clear error and sys.exit(1); ensure this guard runs before calling _build_dataset(), _build_model(), or instantiating Evaluator so mismatched pairs never proceed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evaluation/run_eval.py`:
- Around line 138-143: The CLI currently allows negative values for
"--max-samples" and "--max-users"; add validation to ensure they are positive
integers. Implement a small validator (e.g., positive_int) and use it as the
argument type or, after parsing, check args.max_samples and args.max_users: if a
value is not None and is <= 0 call parser.error (or raise a clear argparse
error) to reject invalid inputs. Reference parser.add_argument for
"--max-samples" and "--max-users" and the parsed variables args.max_samples /
args.max_users when adding the validation.
In `@time_series_datasets/mhc_metabolic_dataset.py`:
- Line 171: The return statement in mhc_metabolic_dataset.py uses
zip(df["user_id"], df["_label_str"]) which triggers Ruff B905; change it to
zip(df["user_id"], df["_label_str"], strict=True) to enforce equal-length
iterables and satisfy the linter—update the return in the function that
currently returns dict(zip(...)) so it returns dict(zip(df["user_id"],
df["_label_str"], strict=True)).
---
Duplicate comments:
In `@evaluation/run_eval.py`:
- Around line 205-220: Add a fast-fail validation before building dataset or
model in main(): implement and call a small checker (e.g.,
_validate_task_dataset_pair(args) or inline logic) right after _parse_args()
that verifies the provided args.task is compatible with args.dataset/split (use
existing TASK_REGISTRY[args.task] or dataset metadata to test compatibility),
and if incompatible print a clear error and sys.exit(1); ensure this guard runs
before calling _build_dataset(), _build_model(), or instantiating Evaluator so
mismatched pairs never proceed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e7f62e39-2a9c-4b0c-bc1a-60c0f9c203e1
📒 Files selected for processing (3)
evaluation/README.mdevaluation/run_eval.pytime_series_datasets/mhc_metabolic_dataset.py
🚧 Files skipped from review as they are similar to previous changes (1)
- evaluation/README.md
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
evaluation/tasks/metabolic.py (2)
216-221:⚠️ Potential issue | 🟠 MajorValidate list lengths before pairing ground truths and predictions.
Line 219 uses
zip(...)without a guard, so mismatched lengths are silently truncated and metrics/parse-rate become inconsistent.Proposed fix
def aggregate_metrics( self, ground_truths: list[str], predictions: list[str], ) -> dict[str, float | int]: + if len(ground_truths) != len(predictions): + raise ValueError( + f"Length mismatch: {len(ground_truths)} ground truths vs " + f"{len(predictions)} predictions." + ) + n_unknown = sum(1 for p in predictions if p == "unknown") filtered = [ (gt, pred) - for gt, pred in zip(ground_truths, predictions) + for gt, pred in zip(ground_truths, predictions, strict=True) if pred != "unknown" ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluation/tasks/metabolic.py` around lines 216 - 221, Before zipping ground_truths and predictions to build filtered and compute metrics, validate that len(ground_truths) == len(predictions); if they differ, raise a clear error (or log and abort) that includes both lengths and any context (e.g., task id) so the mismatch isn't silently truncated. Replace the direct zip usage in the block that defines n_unknown and filtered with a guard that checks lengths first, then proceed to compute n_unknown and build filtered only when lengths match.
223-229:⚠️ Potential issue | 🟠 MajorReturn a stable metric schema when all predictions are
unknown.The Line 223 branch omits
*_se,n_positive_gt, andn_positive_pred, so the output shape differs from normalbinary_metricsoutput.Proposed fix
if not filtered: return { - "f1": 0.0, "precision": 0.0, "recall": 0.0, "accuracy": 0.0, + "f1": 0.0, "f1_se": 0.0, + "precision": 0.0, "precision_se": 0.0, + "recall": 0.0, "recall_se": 0.0, + "accuracy": 0.0, "accuracy_se": 0.0, "n_samples": len(ground_truths), + "n_positive_gt": 0, + "n_positive_pred": 0, "n_unknown": n_unknown, "parse_rate": 0.0, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluation/tasks/metabolic.py` around lines 223 - 229, The early-return in evaluation/tasks/metabolic.py (inside the metrics computation, e.g., the binary_metrics branch) returns a subset of keys when all predictions are unknown; update that branch to return the full stable metric schema used elsewhere by including the missing keys: f1_se, precision_se, recall_se, accuracy_se (set to 0.0), and n_positive_gt and n_positive_pred (set to 0), while keeping existing keys n_samples, n_unknown, parse_rate, f1, precision, recall, accuracy; ensure key names exactly match the normal output schema so callers always receive a consistent shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evaluation/io/writer.py`:
- Around line 47-48: Sanitize both path components before joining: add/used a
sanitizer (e.g., sanitize_path_component or sanitize_filename) and apply it to
result.task_name and result.model_name (currently assigned to safe_model) so you
replace or remove unsafe sequences like "/", "\\", and ".." and any other
characters invalid for filenames; then compute out_dir using self.results_dir /
sanitized_task_name / sanitized_model_name / sanitized_timestamp (ensure
timestamp is sanitized if generated elsewhere). Update the assignments where
safe_model and out_dir are set so they use the sanitizer and keep references to
safe_model, out_dir, self.results_dir, result.task_name and result.model_name to
locate the change.
---
Duplicate comments:
In `@evaluation/tasks/metabolic.py`:
- Around line 216-221: Before zipping ground_truths and predictions to build
filtered and compute metrics, validate that len(ground_truths) ==
len(predictions); if they differ, raise a clear error (or log and abort) that
includes both lengths and any context (e.g., task id) so the mismatch isn't
silently truncated. Replace the direct zip usage in the block that defines
n_unknown and filtered with a guard that checks lengths first, then proceed to
compute n_unknown and build filtered only when lengths match.
- Around line 223-229: The early-return in evaluation/tasks/metabolic.py (inside
the metrics computation, e.g., the binary_metrics branch) returns a subset of
keys when all predictions are unknown; update that branch to return the full
stable metric schema used elsewhere by including the missing keys: f1_se,
precision_se, recall_se, accuracy_se (set to 0.0), and n_positive_gt and
n_positive_pred (set to 0), while keeping existing keys n_samples, n_unknown,
parse_rate, f1, precision, recall, accuracy; ensure key names exactly match the
normal output schema so callers always receive a consistent shape.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 20bede1c-9a2d-4b38-ac79-1902ca74c362
📒 Files selected for processing (5)
.gitignoreevaluation/io/writer.pyevaluation/metrics/classification.pyevaluation/tasks/metabolic.pymhc/transformer.py
✅ Files skipped from review due to trivial changes (2)
- mhc/transformer.py
- .gitignore
…integration into existing mhc metabolic qa dataset
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
evaluation/metrics/classification.py (1)
117-117: Remove redundantint(len(...))casts.These casts are unnecessary since
len(...)already returnsint; dropping them improves clarity and resolves the Ruff warnings.Proposed cleanup
- "n_samples": int(len(ground_truths)), + "n_samples": len(ground_truths), @@ - "n_samples": int(len(ground_truths)), - "n_classes": int(len(label_set)), + "n_samples": len(ground_truths), + "n_classes": len(label_set), @@ - "n_samples": int(len(ground_truths)), - "n_classes": int(len(labels)), + "n_samples": len(ground_truths), + "n_classes": len(labels),Also applies to: 146-147, 193-194
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluation/metrics/classification.py` at line 117, Remove the redundant int(...) casts around len(...) in the result dictionaries: replace int(len(ground_truths)) (and the other occurrences using int(len(...))) with just len(...) so the "n_samples" and similar fields use len(...) directly; locate these in evaluation/metrics/classification.py where result dicts are assembled (e.g., the "n_samples" key and the other two similar entries) and update them accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evaluation/metrics/classification.py`:
- Line 131: The code currently treats labels as falsy, so replace the expression
that builds label_set with an explicit None check: if labels is None then
compute label_set = sorted({*ground_truths, *predictions}) else use the provided
labels as-is (preserving an explicit empty list). Update any downstream
computations (e.g., n_classes) to derive their values from this label_set.
Ensure validation still runs on the provided labels (uniqueness/consistency with
ground_truths/predictions) rather than silently falling back when labels == [].
---
Nitpick comments:
In `@evaluation/metrics/classification.py`:
- Line 117: Remove the redundant int(...) casts around len(...) in the result
dictionaries: replace int(len(ground_truths)) (and the other occurrences using
int(len(...))) with just len(...) so the "n_samples" and similar fields use
len(...) directly; locate these in evaluation/metrics/classification.py where
result dicts are assembled (e.g., the "n_samples" key and the other two similar
entries) and update them accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 79bd7bf3-e2c8-4af1-9892-deb18203f7a5
📒 Files selected for processing (1)
evaluation/metrics/classification.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@requirements.txt`:
- Line 14: The requirements entry pins opentslm to the mutable branch "main",
making installs non-deterministic; update the line in requirements.txt
referencing "opentslm @ git+https://github.com/StanfordBDHG/OpenTSLM.git@main"
to use an immutable ref (a commit SHA such as
104013b90b89fb39c1537c9190b3d047c3072fb5 or a release tag) so installs are
reproducible, then re-run any dependency install/lock steps (e.g., pip-compile
or CI dependency caching) to ensure the change is applied across environments.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2bc558c7-b16c-4187-aa38-a3ebf5fb569c
📒 Files selected for processing (2)
.gitignorerequirements.txt
✅ Files skipped from review due to trivial changes (1)
- .gitignore
…ion and updated readme
# Add cross-channel captions for all binary daily channels ## ♻️ Current Situation & Problem The daily cross-channel caption pipeline only covered a subset of binary channels: in-bed-awake sleep periods, selected stationary workouts, and running/cycling cardio workouts. This left several binary channels without cross-channel captions, including direct asleep/in-bed windows, walking, yoga, elliptical, mixed cardio, and other workouts. These channels should also be described with relevant context from continuous signals such as heart rate, distance, steps, and active energy. ## ⚙️ What Changed - Added focused cross-channel synthesizers by activity family: - `WalkingSynthesizer` in `synthesizers/locomotion.py` - `MindBodyActivitySynthesizer` in `synthesizers/mind_body.py` - `OtherActivitySynthesizer` in `synthesizers/other_activity.py` - `EnduranceActivitySynthesizer` in `synthesizers/cardio.py` - Extended `SleepSynthesizer` to emit captions for direct `sleep:asleep` and `sleep:inbed` windows, while preserving the existing `in_bed_not_sleeping` behavior. - Added active-energy metric support through the shared cross-channel helper pattern. - Updated existing cardio and stationary workout captions to include active-energy context where available. - Added new cross-channel templates for sleep stages, walking, yoga, endurance activities, generic workouts, active energy, and stationary sleep summaries. - Added explorer hit-target support for the new cross-channel labels so the new captions are discoverable through the hit navigation. - Kept `CrossChannelExtractor` unchanged; it still collects captions from the configured synthesizer list. ## 📝 Release Notes - Added cross-channel captions for previously uncovered daily binary workout channels: - Walking - Yoga - Elliptical - Mixed cardio - Other workout - Added direct sleep captions for `sleep:asleep` and `sleep:inbed`. - Preserved overlapping sleep captions, so `sleep:inbed` and `sleep:asleep` can both produce captions for the same window. - Improved sleep movement summaries so zero-distance sleep periods are described as stationary rather than listing zero steps/distance. - Improved template wording for more natural activity captions. - No public API or extractor interface changes. ## 📚 Documentation This change is implemented through focused synthesizer additions and template updates. `CrossChannelExtractor` remains unchanged and continues to collect captions from configured synthesizers. New and updated templates in `templates/templates.json` provide more natural phrasing for sleep, yoga, walking, elliptical, mixed cardio, and other workout captions. ## ✅ Testing Added synthetic unit tests covering: - Cross-channel caption coverage for every daily binary channel. - Overlapping `sleep:inbed` and `sleep:asleep` windows producing separate captions. - Yoga captions including heart-rate and active-energy context without requiring movement. - Walking captions including steps and distance. - Missing helper channels not crashing caption generation. - Explorer hit targets covering emitted cross-channel labels. - Sleep zero-distance captions using a stationary summary rather than explicit zero steps/distance wording. ### Code of Conduct & Contributing Guidelines By creating and submitting this pull request, you agree to follow our [Code of Conduct](https://github.com/StanfordBDHG/.github/blob/main/CODE_OF_CONDUCT.md) and [Contributing Guidelines](https://github.com/StanfordBDHG/.github/blob/main/CONTRIBUTING.md): - [x] I agree to follow the [Code of Conduct](https://github.com/StanfordBDHG/.github/blob/main/CODE_OF_CONDUCT.md) and [Contributing Guidelines](https://github.com/StanfordBDHG/.github/blob/main/CONTRIBUTING.md). --------- Co-authored-by: Max Rosenblattl <max.rosenblattl@tum.de>
Resolves shard-glob conflict in mhc_base_qa_dataset (kept rglob form, functionally equivalent). Adopts main's NaN→0 fix in _binary_prompt for OpenTSLM compatibility. Pulls in PR #50 (curriculum learning training pipeline + multi-label datasets).
…l uses into to string, training not
Switch metabolic training targets from ints to canonical strings ♻️ Current situation & Problem Depends on / sits on top of the eval pipeline PR (#55, feature/evaluation-pipeline). Surface follow-up to PR #50, which introduced the curriculum-learning training stack with int-encoded answer targets ("0", "1", "2", …). Two related issues motivating this PR: 1. OpenTSLM convention divergence — the upstream OpenTSLM reference datasets (HARAcc, Sleep, ECG) all train on canonical class strings ("biking", "Wake", "yes", …). Our metabolic_finetune.py and mhc_multi_label_qa_dataset.py deviated by using str(int(value)) as the training target. 2. Train ↔ eval prompt drift — eval (MHCMetabolicQADataset from PR #55) frames the task as "Is this person biologically male or female? … Answer: <label>", while training framed it as the terse "Based on the sensor data, predict the value of BiologicalSex (binary).". ⚙️ Release Notes - Training answer format switched to canonical strings: - Binary: "Male" / "Female" / "True" / "False" - Ordinal: "Underweight" / "Normal weight" / "Overweight" / … - Continuous: numeric string (unchanged) - Wrapped as "Answer: <value>" so the loss target matches the prompt's Answer: <label> instruction. - New shared helper build_metabolic_post_prompt(label) in time_series_datasets/mhc_label_lookup.py. Both MHCMetabolicQADataset (eval) and MHCMultiLabelQADataset (training) call it — train and eval now emit byte-identical post_prompts. Lives next to METABOLIC_LABEL_CONFIG it formats. - LabelLookup API change: _decode_labels → decode_labels (was effectively public; renamed to match its actual usage). - scripts/metabolic_finetune.py eval logic removed. The single-token logit-scoring path (_candidate_token_ids, _logits_over_candidates, _answer_prefix_embed) was incompatible with multi-token canonical strings. Downstream metrics now flow through evaluation/run_eval.py (PR #55) instead. --eval_only flag removed; net −189 LOC. - Migration: existing int-trained checkpoints will not parse correctly under the new prompt format. Re-train from a stage-1 captioning checkpoint: python curriculum_learning.py \ --model OpenTSLMSP \ --stages stage_joint_caption_metabolic \ --batch_size 2 \ --gradient_checkpointing - (or stage_metabolic alone if a captioning checkpoint is already on disk.)
Eval Architecture und first End2End Metabolic Draft
♻️ Current situation & Problem
Link any open issues or pull requests (PRs) related to this PR. Please ensure that all non-trivial PRs are first tracked and discussed in an existing GitHub issue or discussion.
⚙️ Release Notes
Add a bullet point list summary of the feature and possible migration guides if this is a breaking change so this section can be added to the release notes.
Include code snippets that provide examples of the feature implemented or links to the documentation if it appends or changes the public interface.
📚 Documentation
Please ensure that you properly document any additions in conformance to Spezi Documentation Guide.
You can use this section to describe your solution, but we encourage contributors to document your reasoning and changes using in-line documentation.
✅ Testing
Please ensure that the PR meets the testing requirements set by CodeCov and that new functionality is appropriately tested.
This section describes important information about the tests and why some elements might not be testable.
Code of Conduct & Contributing Guidelines
By creating and submitting this pull request, you agree to follow our Code of Conduct and Contributing Guidelines: