Add compare and ranking function for FEMaps by jthorton · Pull Request #174 · OpenFreeEnergy/cinnabar

jthorton · 2026-01-16T10:50:33Z

Description

This PR adds a general compare and ranking function for a collection of FEMaps which contain results for the same edges which should be compared for any significant differences in performance. The FEMaps are compared by the user chosen metric, by default the mean unsigned error, via the distribution of differences of that metric under a joint bootstrapping procedure. A two-sided p-value is then determined via the fraction of negative or positive differences in the metric (whichever is smaller), a multitest correction scheme is then used to correct the p-values which are used to rank the FEMaps using the compact letter display (CLD) assigned via the insert-absorb method.

The result is two pandas dataframes, the first contains all evaluated metrics for each FEMap with confidence intervals and the CLD rank, the second contains the raw comparison data of the metric including p-values.

Example stats dataframe

    Model       MUE  MUE_CI_Lower  ...  KTAU_CI_Lower  KTAU_CI_Upper  CLD
0  FE Map 1  9.326389      9.055143  ...       0.407202       0.724801    a
1  FE Map 2  9.326389      9.062688  ...       0.377350       0.713957    a
2  FE Map 3  9.326389      8.987990  ...       0.044224       0.456656    b

Example comparsion

    Model 1   Model 2  Diff in rho  ...  p-value  significant  p-value corrected
0  FE Map 1  FE Map 2     0.002788  ...    0.666        False              0.666
1  FE Map 1  FE Map 3     0.321357  ...    0.002         True              0.006
2  FE Map 2  FE Map 3     0.318568  ...    0.002         True              0.006

Todos

Notable points that this PR has either accomplished or will accomplish.

add more tests
add a centralise/ shift method to the nodewise comparisons?

Questions

Question1

Checklist

Added a news entry for new features, bug fixes, or other user facing changes.

Status

Ready to go

Tips

Comment "pre-commit.ci autofix" to have pre-commit.ci atomically format your PR.
Since this will create a commit, it is best to make this comment when you are finished with your work.

codecov · 2026-01-16T10:53:12Z

Codecov Report

❌ Patch coverage is 98.67841% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.62%. Comparing base (c91f57d) to head (c80a7af).

Files with missing lines	Patch %	Lines
cinnabar/compare.py	97.94%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #174      +/-   ##
==========================================
+ Coverage   97.53%   97.62%   +0.09%     
==========================================
  Files          22       24       +2     
  Lines        2470     2697     +227     
==========================================
+ Hits         2409     2633     +224     
- Misses         61       64       +3

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jthorton · 2026-01-16T11:11:47Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

# Conflicts: # cinnabar/tests/test_compare.py

hannahbaumann · 2026-01-28T13:01:22Z

+
+
+def compare_and_rank_femaps(
+    femaps: list[FEMap],


I wonder if it would be easier to have a single FEMap object that contains calc data from multiple experiments, using the "source" argument to distinguish between?

Yes that would be easier as the current method requires you to add the same experimental info to all of the maps you are comparing, but we would need to rework the MLE calculation on the legacy graph to group by source first and then split into different edges so kind of doing this under the hood, for now it might be easier to ask users to do this? We could add something which copies the experimental values to all graphs if that would be easier?

I believe that was the intent of the FEMap API - i.e. to have multiple sources so it's easier to do one big comparison rather than multiple objects.

Thanks both, I have updated this now to use a single FEMap with different sources.

fjclark · 2026-05-11T15:39:35Z

Commenting as Josh pointed me here.

I'm excited about this and would love to see this functionality in cinnabar soon! We've also been using the bootstrapping method from https://www.nature.com/articles/s42004-025-01428-y and it would be great to have it implemented here.

My main comment is I'm confused about when to use each of the prediction_type options and when they're valid. I think my confusion could be phrased as: "exactly which distribution are we trying to approximate with bootstrapping, and how much do we violate the IID assumption of bootstrapping for this distribution?". If we start by assuming each "single calculation" (dG for ABFE and ddG for RBFE) is an IID sample from the distribution we're interested in, then to me:

For ABFE, nodewise makes sense (as used in the original paper). As I understand it,edgewise and pairwise would introduce correlation (each dG can contribute to many ddGs).
For RBFE, edgewise makes sense for ddG stats. nodewise feels wrong because we're no longer resampling single calculations. Our dG estimates are fixed by the MLE from a fixed set of ddG estimates, so the bootstrapped distribution of dGs is generated by a different "process" to the edgewise ddG stats. An (impractical) approach which would be more intuitive to me would be to bootstrap over edges in a way which keeps a valid graph each time and recompute the dGs through the MLE each time to get the dG distribution. However, I guess this is the only practical way to use bootstrapping here. pairwise would introduce correlation because each ddG would contribute to many of the pairwise ddGs.

@jthorton Am I misunderstanding? What's the intended use of pairwise? Thanks!

Though please feel free to ignore -- I am likely overthinking this!

# Conflicts: # cinnabar/stats.py

…utorial

review-notebook-app · 2026-05-19T09:34:07Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

jthorton · 2026-05-19T09:34:49Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

jthorton · 2026-05-19T10:38:33Z

Thanks @fjclark this is great feedback. I agree that naively applying this to all of the available analysis types in Cinnabar is not a good idea and so I have removed the pairwise option and changed the default to edgewise, as I expect RBFE results to be the main use of the function. I agree that the nodewise analysis with the back-calculated absolute values is problematic due to the correlated nature of the values. For now, I recommend using the edgewise analysis for RBFE results in the docs and we can look into other possible methods in future to assess differences in ranking, hopefully this can serve as a good starting point though!

IAlibay

Sorry that took me so long to review. Overall this looks good to me, but I've got a few questions.

IAlibay · 2026-05-26T11:15:37Z

+            metrics_to_compute = ["MUE", "RMSE", "RAE", "R2", "rho", "KTAU", "PI"]
+
+    # we must compute the rank metric however it is possible to miss it
+    if rank_metric not in metrics_to_compute:


Maybe not here, but we should think about how best to nudge users away from doing bad things - i.e. maybe we can add a warning that tells folks that it's a bad idea to rank using correlation metrics if you're doing edgewise comparisons.

Yeah I agree, hopefully the docs guide them in the right direction but a warning would help!

IAlibay · 2026-05-26T11:26:55Z

+                "CI Lower": lower,
+                "CI Upper": upper,
+                "p-value": p_value,
+                "significant": p_value < 0.05,


Do we want users to set this significance value?

Good idea added an option using the standard name alpha.

IAlibay · 2026-05-26T11:36:40Z

+    - The comparison method uses a joint bootstrapping procedure that generates a distribution of differences in the rank metric and checks for significant differences using a method inspired by. [1]_
+    - Each source must be evaluated on the same set of edges.
+    - Prediction types "nodewise" and "edgewise" correspond to DGs and edgewise DDGs respectively.
+    - When we have more than 2 models, we apply multiple testing correction to the pairwise comparisons using the ``Holm`` method by default.


Do you have a small explanation of why Holm over any of the other methods?

Added something and a link to the wiki

IAlibay · 2026-05-26T11:57:12Z

+        assert metric in summary_df.columns
+        for ci in ["Upper", "Lower"]:
+            assert f"{metric}_CI_{ci}" in summary_df.columns
+    summary_df.to_csv("summary_df.csv")


Should this be writing to disk unprotected by a temporary directory?

Good spot thats left over from checking the output, removed!

# Conflicts: # docs/tutorials/index.rst

jthorton · 2026-06-12T12:49:34Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

jthorton · 2026-06-15T10:17:15Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

IAlibay

Couple of small things, I'm going to put a "request changes" for some of the questions but it's not majorly blocking.

IAlibay · 2026-06-15T10:30:35Z

+
+    # we must compute the rank metric however it is possible to miss it
+    if rank_metric not in metrics_to_compute:
+        metrics_to_compute.append(rank_metric)


[nit] This does an in-place modification of the input kwarg. It's probably safe, but always better to avoid in case someone is using a pre-defined variable somewhere outside of the method call.

Fixed to use a copy.

IAlibay · 2026-06-15T10:47:18Z

+        # calculate the p-value as the fraction of bootstrap samples that cross zero
+        # use a 2-tailed test
+        # inspired by https://www.nature.com/articles/s42004-025-01428-y
+        p_value = 2 * min(np.mean(diffs < 0), np.mean(diffs > 0))


If I understand this correctly, there is a chance that the p value ends up being 0 when you have a clear winner. Is this something that would be misleading to users?

Thats right, I did find mention of a method which adds 1 to the numerator and denominator to avoid exact zero answers but I am not confident that this is the best practice. One option is to push users to use the reported metric differences and CI around those to interpret the results when this happens, I could add a warning to the docs about it?

IAlibay · 2026-06-15T10:48:36Z

+
+    elif prediction_type == "edgewise":
+        rel_df = femap.get_relative_dataframe()
+        sources = rel_df[rel_df["computational"] == True]["source"].unique()


On line 68 you to to_list() too, whilst here you only do unique(). Should the two be aligned?

Yes good catch, changed to tolist()

IAlibay · 2026-06-15T10:51:37Z

+    comparison_df = pd.DataFrame(comparison_data)
+
+    # if we have more than 2 models, apply multiple testing correction
+    if len(sources) > 2:


[nit] If you don't have > 2, you don't have a p-value corrected column. Would it be better for users to always have the same columns and just return the same as the p-value in that case or would it be better to not have the column in that case?

Having the output be stable would be great but I don't want people to think a correction was applied when it wasn't. I think this is something we could easily add in future though if people want it?

IAlibay · 2026-06-15T10:52:47Z

+    num_bootstraps: int = 1_000,
+    confidence_level: float = 0.95,
+    alpha: float = 0.05,


[nit] Some bounds checks on some of these arguments could be useful to avoid users passing bad things and getting odd failures.

Added checks and tests.

IAlibay · 2026-06-15T10:57:17Z

Notebook looks great - I don't have any comments about it.

# Conflicts: # docs/tutorials/index.rst

jthorton · 2026-06-17T10:20:19Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

add compare and rank method for FEMaps

639fbbe

jthorton and others added 3 commits January 16, 2026 11:13

fix pairwise comparisons, remove prints

1ab4380

[pre-commit.ci] auto fixes from pre-commit.com hooks

b0dbba7

for more information, see https://pre-commit.ci

Merge remote-tracking branch 'origin/compair' into compair

117aa88

# Conflicts: # cinnabar/tests/test_compare.py

jameseastwood assigned hannahbaumann and IAlibay Jan 26, 2026

hannahbaumann reviewed Jan 28, 2026

View reviewed changes

jthorton added 2 commits May 18, 2026 15:25

Merge branch 'refs/heads/main' into compair

08826ed

# Conflicts: # cinnabar/stats.py

rework to use a single femap, remove pairwise, add more tests and a t…

739ba99

…utorial

pre-commit-ci Bot and others added 2 commits May 19, 2026 09:34

[pre-commit.ci] auto fixes from pre-commit.com hooks

7e383ce

for more information, see https://pre-commit.ci

fix alter box add nodewise test

9af5c5a

jthorton added 3 commits May 19, 2026 11:41

render output

054fea3

fix highlights

8b610e9

add news

fb486eb

IAlibay requested changes May 26, 2026

View reviewed changes

jthorton added 3 commits June 12, 2026 11:36

Merge branch 'main' into compair

30a028b

# Conflicts: # docs/tutorials/index.rst

expose alpha, update tests, dont generate abs values in call

2269b75

fix notebook rendering

d036057

pre-commit-ci Bot and others added 2 commits June 12, 2026 12:49

[pre-commit.ci] auto fixes from pre-commit.com hooks

ef8d230

for more information, see https://pre-commit.ci

fix mypy

404a5d3

[pre-commit.ci] auto fixes from pre-commit.com hooks

179126a

for more information, see https://pre-commit.ci

IAlibay requested changes Jun 15, 2026

View reviewed changes

jthorton self-assigned this Jun 15, 2026

jthorton added 2 commits June 16, 2026 11:49

PR feedback, add bounds check tests

d70c014

Merge branch 'main' into compair

e6dfd0a

# Conflicts: # docs/tutorials/index.rst

pre-commit-ci Bot and others added 3 commits June 17, 2026 10:20

[pre-commit.ci] auto fixes from pre-commit.com hooks

5a86140

for more information, see https://pre-commit.ci

update info and warning boxes

8354a91

Merge remote-tracking branch 'origin/compair' into compair

c80a7af

Conversation

jthorton commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Todos

Questions

Checklist

Status

Uh oh!

codecov Bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jthorton commented Jan 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjclark commented May 11, 2026

Uh oh!

review-notebook-app Bot commented May 19, 2026

Uh oh!

jthorton commented May 19, 2026

Uh oh!

jthorton commented May 19, 2026

Uh oh!

IAlibay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jthorton commented Jun 12, 2026

Uh oh!

jthorton commented Jun 15, 2026

Uh oh!

IAlibay left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jthorton commented Jan 16, 2026 •

edited

Loading

codecov Bot commented Jan 16, 2026 •

edited

Loading