Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #174 +/- ##
==========================================
+ Coverage 97.53% 97.62% +0.09%
==========================================
Files 22 24 +2
Lines 2470 2697 +227
==========================================
+ Hits 2409 2633 +224
- Misses 61 64 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
# Conflicts: # cinnabar/tests/test_compare.py
|
|
||
|
|
||
| def compare_and_rank_femaps( | ||
| femaps: list[FEMap], |
There was a problem hiding this comment.
I wonder if it would be easier to have a single FEMap object that contains calc data from multiple experiments, using the "source" argument to distinguish between?
There was a problem hiding this comment.
Yes that would be easier as the current method requires you to add the same experimental info to all of the maps you are comparing, but we would need to rework the MLE calculation on the legacy graph to group by source first and then split into different edges so kind of doing this under the hood, for now it might be easier to ask users to do this? We could add something which copies the experimental values to all graphs if that would be easier?
There was a problem hiding this comment.
I believe that was the intent of the FEMap API - i.e. to have multiple sources so it's easier to do one big comparison rather than multiple objects.
There was a problem hiding this comment.
Thanks both, I have updated this now to use a single FEMap with different sources.
|
Commenting as Josh pointed me here. I'm excited about this and would love to see this functionality in My main comment is I'm confused about when to use each of the
@jthorton Am I misunderstanding? What's the intended use of Though please feel free to ignore -- I am likely overthinking this! |
# Conflicts: # cinnabar/stats.py
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
|
Thanks @fjclark this is great feedback. I agree that naively applying this to all of the available analysis types in Cinnabar is not a good idea and so I have removed the pairwise option and changed the default to edgewise, as I expect RBFE results to be the main use of the function. I agree that the nodewise analysis with the back-calculated absolute values is problematic due to the correlated nature of the values. For now, I recommend using the edgewise analysis for RBFE results in the docs and we can look into other possible methods in future to assess differences in ranking, hopefully this can serve as a good starting point though! |
IAlibay
left a comment
There was a problem hiding this comment.
Sorry that took me so long to review. Overall this looks good to me, but I've got a few questions.
| metrics_to_compute = ["MUE", "RMSE", "RAE", "R2", "rho", "KTAU", "PI"] | ||
|
|
||
| # we must compute the rank metric however it is possible to miss it | ||
| if rank_metric not in metrics_to_compute: |
There was a problem hiding this comment.
Maybe not here, but we should think about how best to nudge users away from doing bad things - i.e. maybe we can add a warning that tells folks that it's a bad idea to rank using correlation metrics if you're doing edgewise comparisons.
There was a problem hiding this comment.
Yeah I agree, hopefully the docs guide them in the right direction but a warning would help!
| "CI Lower": lower, | ||
| "CI Upper": upper, | ||
| "p-value": p_value, | ||
| "significant": p_value < 0.05, |
There was a problem hiding this comment.
Do we want users to set this significance value?
There was a problem hiding this comment.
Good idea added an option using the standard name alpha.
| - The comparison method uses a joint bootstrapping procedure that generates a distribution of differences in the rank metric and checks for significant differences using a method inspired by. [1]_ | ||
| - Each source must be evaluated on the same set of edges. | ||
| - Prediction types "nodewise" and "edgewise" correspond to DGs and edgewise DDGs respectively. | ||
| - When we have more than 2 models, we apply multiple testing correction to the pairwise comparisons using the ``Holm`` method by default. |
There was a problem hiding this comment.
Do you have a small explanation of why Holm over any of the other methods?
There was a problem hiding this comment.
Added something and a link to the wiki
| assert metric in summary_df.columns | ||
| for ci in ["Upper", "Lower"]: | ||
| assert f"{metric}_CI_{ci}" in summary_df.columns | ||
| summary_df.to_csv("summary_df.csv") |
There was a problem hiding this comment.
Should this be writing to disk unprotected by a temporary directory?
There was a problem hiding this comment.
Good spot thats left over from checking the output, removed!
# Conflicts: # docs/tutorials/index.rst
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
IAlibay
left a comment
There was a problem hiding this comment.
Couple of small things, I'm going to put a "request changes" for some of the questions but it's not majorly blocking.
|
|
||
| # we must compute the rank metric however it is possible to miss it | ||
| if rank_metric not in metrics_to_compute: | ||
| metrics_to_compute.append(rank_metric) |
There was a problem hiding this comment.
[nit] This does an in-place modification of the input kwarg. It's probably safe, but always better to avoid in case someone is using a pre-defined variable somewhere outside of the method call.
There was a problem hiding this comment.
Fixed to use a copy.
| # calculate the p-value as the fraction of bootstrap samples that cross zero | ||
| # use a 2-tailed test | ||
| # inspired by https://www.nature.com/articles/s42004-025-01428-y | ||
| p_value = 2 * min(np.mean(diffs < 0), np.mean(diffs > 0)) |
There was a problem hiding this comment.
If I understand this correctly, there is a chance that the p value ends up being 0 when you have a clear winner. Is this something that would be misleading to users?
There was a problem hiding this comment.
Thats right, I did find mention of a method which adds 1 to the numerator and denominator to avoid exact zero answers but I am not confident that this is the best practice. One option is to push users to use the reported metric differences and CI around those to interpret the results when this happens, I could add a warning to the docs about it?
|
|
||
| elif prediction_type == "edgewise": | ||
| rel_df = femap.get_relative_dataframe() | ||
| sources = rel_df[rel_df["computational"] == True]["source"].unique() |
There was a problem hiding this comment.
On line 68 you to to_list() too, whilst here you only do unique(). Should the two be aligned?
There was a problem hiding this comment.
Yes good catch, changed to tolist()
| comparison_df = pd.DataFrame(comparison_data) | ||
|
|
||
| # if we have more than 2 models, apply multiple testing correction | ||
| if len(sources) > 2: |
There was a problem hiding this comment.
[nit] If you don't have > 2, you don't have a p-value corrected column. Would it be better for users to always have the same columns and just return the same as the p-value in that case or would it be better to not have the column in that case?
There was a problem hiding this comment.
Having the output be stable would be great but I don't want people to think a correction was applied when it wasn't. I think this is something we could easily add in future though if people want it?
| num_bootstraps: int = 1_000, | ||
| confidence_level: float = 0.95, | ||
| alpha: float = 0.05, |
There was a problem hiding this comment.
[nit] Some bounds checks on some of these arguments could be useful to avoid users passing bad things and getting odd failures.
There was a problem hiding this comment.
Added checks and tests.
|
Notebook looks great - I don't have any comments about it. |
# Conflicts: # docs/tutorials/index.rst
|
pre-commit.ci autofix |
Description
This PR adds a general compare and ranking function for a collection of FEMaps which contain results for the same edges which should be compared for any significant differences in performance. The FEMaps are compared by the user chosen metric, by default the mean unsigned error, via the distribution of differences of that metric under a joint bootstrapping procedure. A two-sided p-value is then determined via the fraction of negative or positive differences in the metric (whichever is smaller), a multitest correction scheme is then used to correct the p-values which are used to rank the FEMaps using the compact letter display (CLD) assigned via the insert-absorb method.
The result is two pandas dataframes, the first contains all evaluated metrics for each FEMap with confidence intervals and the CLD rank, the second contains the raw comparison data of the metric including p-values.
Example stats dataframe
Example comparsion
Todos
Notable points that this PR has either accomplished or will accomplish.
Questions
Checklist
newsentry for new features, bug fixes, or other user facing changes.Status
Tips
Since this will create a commit, it is best to make this comment when you are finished with your work.