import-rna: add --min-sample-fraction for sparse/single-cell cohorts (#448)#1107
Merged
Merged
Conversation
…448) filter_probes hard-coded the gene-expression filter at 'expressed in at least half of samples', which is too strict for single-cell or sparse cohorts where most genes are expressed in fewer than half of cells (reported in GH #448). Expose the threshold as a parameter. The legacy filter kept a gene when median(counts) >= 1. This generalizes it to keep a gene when its (1 - min_sample_fraction) quantile of counts is >= 1, which reduces to the median exactly at the default 0.5, so existing output is preserved bit-exact (verified end-to-end on the RNA fixture). A naive 'fraction of samples with count >= 1' reimplementation would NOT be bit-exact (e.g. counts [0, 1]: median 0.5 drops, but a 0.5 expressed-fraction keeps), so the quantile form is the correct generalization. - rna.filter_probes gains min_sample_fraction (default 0.5), validates [0, 1], and short-circuits empty input (quantile(axis=1) raises on an empty frame where the old median(axis=1) returned empty). - Plumbed through import_rna.do_import_rna and the import-rna CLI as --min-sample-fraction. - doc/rna.rst documents the flag, the single-cell use case, and a noise warning below ~0.2. Tests: behavior across N/M vs fraction, monotonic permissiveness, bit-exact default vs legacy median, empty/single-sample edges, and AST-walk plumbing guards (CLI -> do_import_rna -> filter_probes). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The RuntimeError raised for an unknown --format value was missing its f
prefix, so the message displayed the literal '{in_format!r}' instead of
the offending value. Surfaced by code review of #448; the unknown-format
test now asserts the bad name appears in the message.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1107 +/- ##
==========================================
+ Coverage 70.73% 70.75% +0.01%
==========================================
Files 73 73
Lines 7946 7951 +5
Branches 1405 1407 +2
==========================================
+ Hits 5621 5626 +5
Misses 1881 1881
Partials 444 444
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
import-rnahard-coded its gene-expression filter at "expressed in at least half of the samples" (filter_probes:median(counts) >= 1). As reported in #448, this is too strict for single-cell or otherwise sparse cohorts, where most genes are legitimately expressed in fewer than half of cells and the default discards informative genes.This adds a
--min-sample-fractionoption (default0.5) plumbed through the CLI →do_import_rna→filter_probes.Design: bit-exact at the default
The legacy filter kept a gene when
median(counts) >= 1. This generalizes it to keep a gene when its(1 - min_sample_fraction)quantile of per-sample counts is>= 1. Since the median is the 0.5 quantile, the default reduces to the old rule bit-exact — verified end-to-end on the RNA fixture (summary table and every.cnrlog2 column identical).A naive "fraction of samples with count ≥ 1" reimplementation would not be bit-exact — e.g. counts
[0, 1]have median0.5(dropped) but a 0.5 expressed-fraction (kept) — so the quantile form is the correct generalization.Changes
cnvlib/rna.py:filter_probesgainsmin_sample_fraction(default0.5); validates[0, 1]; short-circuits empty input (quantile(axis=1)raisesValueError: no types givenon an empty frame, where the oldmedian(axis=1)returned empty).cnvlib/import_rna.py:do_import_rnagainsmin_sample_fraction, forwarded tofilter_probes.cnvlib/commands.py: new--min-sample-fractionflag onimport-rna.doc/rna.rst: documents the flag, the single-cell use case, and a noise warning for values below ~0.2.Tests (
test/test_rna.py)N/M >= min_sample_fractionacross the fullN=0..10grid and fractions0.1..1.0.0.2that0.5drops.median >= 1filter bit-exact on a random matrix.do_import_rna→filter_probes.Clinical impact
None at the default (output preserved bit-exact). Users who lower the threshold retain more genes in the output
.cnr, which may shift downstream segmentation — a deliberate, documented trade-off for sparse data.Closes #448.
🤖 Generated with Claude Code