Skip to content

Add compare and ranking function for FEMaps#174

Open
jthorton wants to merge 22 commits into
mainfrom
compair
Open

Add compare and ranking function for FEMaps#174
jthorton wants to merge 22 commits into
mainfrom
compair

Conversation

@jthorton

@jthorton jthorton commented Jan 16, 2026

Copy link
Copy Markdown
Contributor

Description

This PR adds a general compare and ranking function for a collection of FEMaps which contain results for the same edges which should be compared for any significant differences in performance. The FEMaps are compared by the user chosen metric, by default the mean unsigned error, via the distribution of differences of that metric under a joint bootstrapping procedure. A two-sided p-value is then determined via the fraction of negative or positive differences in the metric (whichever is smaller), a multitest correction scheme is then used to correct the p-values which are used to rank the FEMaps using the compact letter display (CLD) assigned via the insert-absorb method.

The result is two pandas dataframes, the first contains all evaluated metrics for each FEMap with confidence intervals and the CLD rank, the second contains the raw comparison data of the metric including p-values.

Example stats dataframe

    Model       MUE  MUE_CI_Lower  ...  KTAU_CI_Lower  KTAU_CI_Upper  CLD
0  FE Map 1  9.326389      9.055143  ...       0.407202       0.724801    a
1  FE Map 2  9.326389      9.062688  ...       0.377350       0.713957    a
2  FE Map 3  9.326389      8.987990  ...       0.044224       0.456656    b

Example comparsion

    Model 1   Model 2  Diff in rho  ...  p-value  significant  p-value corrected
0  FE Map 1  FE Map 2     0.002788  ...    0.666        False              0.666
1  FE Map 1  FE Map 3     0.321357  ...    0.002         True              0.006
2  FE Map 2  FE Map 3     0.318568  ...    0.002         True              0.006

Todos

Notable points that this PR has either accomplished or will accomplish.

  • add more tests
  • add a centralise/ shift method to the nodewise comparisons?

Questions

  • Question1

Checklist

  • Added a news entry for new features, bug fixes, or other user facing changes.

Status

  • Ready to go

Tips

  • Comment "pre-commit.ci autofix" to have pre-commit.ci atomically format your PR.
    Since this will create a commit, it is best to make this comment when you are finished with your work.

@codecov

codecov Bot commented Jan 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.67841% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.62%. Comparing base (c91f57d) to head (c80a7af).

Files with missing lines Patch % Lines
cinnabar/compare.py 97.94% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #174      +/-   ##
==========================================
+ Coverage   97.53%   97.62%   +0.09%     
==========================================
  Files          22       24       +2     
  Lines        2470     2697     +227     
==========================================
+ Hits         2409     2633     +224     
- Misses         61       64       +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jthorton

Copy link
Copy Markdown
Contributor Author

pre-commit.ci autofix

Comment thread cinnabar/compare.py Outdated


def compare_and_rank_femaps(
femaps: list[FEMap],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be easier to have a single FEMap object that contains calc data from multiple experiments, using the "source" argument to distinguish between?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would be easier as the current method requires you to add the same experimental info to all of the maps you are comparing, but we would need to rework the MLE calculation on the legacy graph to group by source first and then split into different edges so kind of doing this under the hood, for now it might be easier to ask users to do this? We could add something which copies the experimental values to all graphs if that would be easier?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that was the intent of the FEMap API - i.e. to have multiple sources so it's easier to do one big comparison rather than multiple objects.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both, I have updated this now to use a single FEMap with different sources.

@fjclark

fjclark commented May 11, 2026

Copy link
Copy Markdown

Commenting as Josh pointed me here.

I'm excited about this and would love to see this functionality in cinnabar soon! We've also been using the bootstrapping method from https://www.nature.com/articles/s42004-025-01428-y and it would be great to have it implemented here.

My main comment is I'm confused about when to use each of the prediction_type options and when they're valid. I think my confusion could be phrased as: "exactly which distribution are we trying to approximate with bootstrapping, and how much do we violate the IID assumption of bootstrapping for this distribution?". If we start by assuming each "single calculation" (dG for ABFE and ddG for RBFE) is an IID sample from the distribution we're interested in, then to me:

  • For ABFE, nodewise makes sense (as used in the original paper). As I understand it,edgewise and pairwise would introduce correlation (each dG can contribute to many ddGs).
  • For RBFE, edgewise makes sense for ddG stats. nodewise feels wrong because we're no longer resampling single calculations. Our dG estimates are fixed by the MLE from a fixed set of ddG estimates, so the bootstrapped distribution of dGs is generated by a different "process" to the edgewise ddG stats. An (impractical) approach which would be more intuitive to me would be to bootstrap over edges in a way which keeps a valid graph each time and recompute the dGs through the MLE each time to get the dG distribution. However, I guess this is the only practical way to use bootstrapping here. pairwise would introduce correlation because each ddG would contribute to many of the pairwise ddGs.

@jthorton Am I misunderstanding? What's the intended use of pairwise? Thanks!

Though please feel free to ignore -- I am likely overthinking this!

@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@jthorton

Copy link
Copy Markdown
Contributor Author

pre-commit.ci autofix

@jthorton

Copy link
Copy Markdown
Contributor Author

Thanks @fjclark this is great feedback. I agree that naively applying this to all of the available analysis types in Cinnabar is not a good idea and so I have removed the pairwise option and changed the default to edgewise, as I expect RBFE results to be the main use of the function. I agree that the nodewise analysis with the back-calculated absolute values is problematic due to the correlated nature of the values. For now, I recommend using the edgewise analysis for RBFE results in the docs and we can look into other possible methods in future to assess differences in ranking, hopefully this can serve as a good starting point though!

@IAlibay IAlibay left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that took me so long to review. Overall this looks good to me, but I've got a few questions.

Comment thread cinnabar/compare.py
Comment thread cinnabar/compare.py
metrics_to_compute = ["MUE", "RMSE", "RAE", "R2", "rho", "KTAU", "PI"]

# we must compute the rank metric however it is possible to miss it
if rank_metric not in metrics_to_compute:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not here, but we should think about how best to nudge users away from doing bad things - i.e. maybe we can add a warning that tells folks that it's a bad idea to rank using correlation metrics if you're doing edgewise comparisons.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, hopefully the docs guide them in the right direction but a warning would help!

Comment thread cinnabar/compare.py Outdated
"CI Lower": lower,
"CI Upper": upper,
"p-value": p_value,
"significant": p_value < 0.05,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want users to set this significance value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea added an option using the standard name alpha.

Comment thread cinnabar/compare.py Outdated
- The comparison method uses a joint bootstrapping procedure that generates a distribution of differences in the rank metric and checks for significant differences using a method inspired by. [1]_
- Each source must be evaluated on the same set of edges.
- Prediction types "nodewise" and "edgewise" correspond to DGs and edgewise DDGs respectively.
- When we have more than 2 models, we apply multiple testing correction to the pairwise comparisons using the ``Holm`` method by default.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a small explanation of why Holm over any of the other methods?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added something and a link to the wiki

Comment thread cinnabar/tests/test_compare.py Outdated
assert metric in summary_df.columns
for ci in ["Upper", "Lower"]:
assert f"{metric}_CI_{ci}" in summary_df.columns
summary_df.to_csv("summary_df.csv")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be writing to disk unprotected by a temporary directory?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot thats left over from checking the output, removed!

Comment thread cinnabar/tests/test_compare.py
@jthorton

Copy link
Copy Markdown
Contributor Author

pre-commit.ci autofix

@jthorton

Copy link
Copy Markdown
Contributor Author

pre-commit.ci autofix

@IAlibay IAlibay left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small things, I'm going to put a "request changes" for some of the questions but it's not majorly blocking.

Comment thread cinnabar/compare.py

# we must compute the rank metric however it is possible to miss it
if rank_metric not in metrics_to_compute:
metrics_to_compute.append(rank_metric)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] This does an in-place modification of the input kwarg. It's probably safe, but always better to avoid in case someone is using a pre-defined variable somewhere outside of the method call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to use a copy.

Comment thread cinnabar/compare.py
# calculate the p-value as the fraction of bootstrap samples that cross zero
# use a 2-tailed test
# inspired by https://www.nature.com/articles/s42004-025-01428-y
p_value = 2 * min(np.mean(diffs < 0), np.mean(diffs > 0))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, there is a chance that the p value ends up being 0 when you have a clear winner. Is this something that would be misleading to users?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats right, I did find mention of a method which adds 1 to the numerator and denominator to avoid exact zero answers but I am not confident that this is the best practice. One option is to push users to use the reported metric differences and CI around those to interpret the results when this happens, I could add a warning to the docs about it?

Comment thread cinnabar/compare.py Outdated

elif prediction_type == "edgewise":
rel_df = femap.get_relative_dataframe()
sources = rel_df[rel_df["computational"] == True]["source"].unique()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 68 you to to_list() too, whilst here you only do unique(). Should the two be aligned?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good catch, changed to tolist()

Comment thread cinnabar/compare.py Outdated
Comment thread cinnabar/compare.py
comparison_df = pd.DataFrame(comparison_data)

# if we have more than 2 models, apply multiple testing correction
if len(sources) > 2:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] If you don't have > 2, you don't have a p-value corrected column. Would it be better for users to always have the same columns and just return the same as the p-value in that case or would it be better to not have the column in that case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the output be stable would be great but I don't want people to think a correction was applied when it wasn't. I think this is something we could easily add in future though if people want it?

Comment thread cinnabar/compare.py
Comment on lines +17 to +19
num_bootstraps: int = 1_000,
confidence_level: float = 0.95,
alpha: float = 0.05,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Some bounds checks on some of these arguments could be useful to avoid users passing bad things and getting odd failures.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added checks and tests.

Comment thread cinnabar/compare.py Outdated
@IAlibay

IAlibay commented Jun 15, 2026

Copy link
Copy Markdown
Member

Notebook looks great - I don't have any comments about it.

@jthorton jthorton self-assigned this Jun 15, 2026
@jthorton

Copy link
Copy Markdown
Contributor Author

pre-commit.ci autofix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants