Skip to content

import-rna: add --min-sample-fraction for sparse/single-cell cohorts (#448)#1107

Merged
etal merged 2 commits into
masterfrom
feature-cnvkit-9b7-import-rna-min-sample-fraction
Jun 14, 2026
Merged

import-rna: add --min-sample-fraction for sparse/single-cell cohorts (#448)#1107
etal merged 2 commits into
masterfrom
feature-cnvkit-9b7-import-rna-min-sample-fraction

Conversation

@etal

@etal etal commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Summary

import-rna hard-coded its gene-expression filter at "expressed in at least half of the samples" (filter_probes: median(counts) >= 1). As reported in #448, this is too strict for single-cell or otherwise sparse cohorts, where most genes are legitimately expressed in fewer than half of cells and the default discards informative genes.

This adds a --min-sample-fraction option (default 0.5) plumbed through the CLI → do_import_rnafilter_probes.

Design: bit-exact at the default

The legacy filter kept a gene when median(counts) >= 1. This generalizes it to keep a gene when its (1 - min_sample_fraction) quantile of per-sample counts is >= 1. Since the median is the 0.5 quantile, the default reduces to the old rule bit-exact — verified end-to-end on the RNA fixture (summary table and every .cnr log2 column identical).

A naive "fraction of samples with count ≥ 1" reimplementation would not be bit-exact — e.g. counts [0, 1] have median 0.5 (dropped) but a 0.5 expressed-fraction (kept) — so the quantile form is the correct generalization.

Changes

  • cnvlib/rna.py: filter_probes gains min_sample_fraction (default 0.5); validates [0, 1]; short-circuits empty input (quantile(axis=1) raises ValueError: no types given on an empty frame, where the old median(axis=1) returned empty).
  • cnvlib/import_rna.py: do_import_rna gains min_sample_fraction, forwarded to filter_probes.
  • cnvlib/commands.py: new --min-sample-fraction flag on import-rna.
  • doc/rna.rst: documents the flag, the single-cell use case, and a noise warning for values below ~0.2.

Tests (test/test_rna.py)

  • Gene retained iff N/M >= min_sample_fraction across the full N=0..10 grid and fractions 0.1..1.0.
  • Lower fraction is monotonically more permissive; rescues a sparsely-expressed gene at 0.2 that 0.5 drops.
  • Default reproduces the legacy median >= 1 filter bit-exact on a random matrix.
  • Edge cases: empty frame (regression guard), single-sample, invalid fractions raise.
  • AST-walk plumbing guards that the kwarg is forwarded CLI → do_import_rnafilter_probes.

Clinical impact

None at the default (output preserved bit-exact). Users who lower the threshold retain more genes in the output .cnr, which may shift downstream segmentation — a deliberate, documented trade-off for sparse data.

Closes #448.

🤖 Generated with Claude Code

etal and others added 2 commits June 14, 2026 16:23
…448)

filter_probes hard-coded the gene-expression filter at 'expressed in at
least half of samples', which is too strict for single-cell or sparse
cohorts where most genes are expressed in fewer than half of cells
(reported in GH #448). Expose the threshold as a parameter.

The legacy filter kept a gene when median(counts) >= 1. This generalizes
it to keep a gene when its (1 - min_sample_fraction) quantile of counts
is >= 1, which reduces to the median exactly at the default 0.5, so
existing output is preserved bit-exact (verified end-to-end on the RNA
fixture). A naive 'fraction of samples with count >= 1' reimplementation
would NOT be bit-exact (e.g. counts [0, 1]: median 0.5 drops, but a 0.5
expressed-fraction keeps), so the quantile form is the correct
generalization.

- rna.filter_probes gains min_sample_fraction (default 0.5), validates
  [0, 1], and short-circuits empty input (quantile(axis=1) raises on an
  empty frame where the old median(axis=1) returned empty).
- Plumbed through import_rna.do_import_rna and the import-rna CLI as
  --min-sample-fraction.
- doc/rna.rst documents the flag, the single-cell use case, and a noise
  warning below ~0.2.

Tests: behavior across N/M vs fraction, monotonic permissiveness,
bit-exact default vs legacy median, empty/single-sample edges, and
AST-walk plumbing guards (CLI -> do_import_rna -> filter_probes).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The RuntimeError raised for an unknown --format value was missing its f
prefix, so the message displayed the literal '{in_format!r}' instead of
the offending value. Surfaced by code review of #448; the unknown-format
test now asserts the bad name appears in the message.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.75%. Comparing base (240aca1) to head (7933b6d).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1107      +/-   ##
==========================================
+ Coverage   70.73%   70.75%   +0.01%     
==========================================
  Files          73       73              
  Lines        7946     7951       +5     
  Branches     1405     1407       +2     
==========================================
+ Hits         5621     5626       +5     
  Misses       1881     1881              
  Partials      444      444              
Flag Coverage Δ
unittests 70.75% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@etal etal merged commit 849a32b into master Jun 14, 2026
13 checks passed
@etal etal deleted the feature-cnvkit-9b7-import-rna-min-sample-fraction branch June 14, 2026 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

import-rna: Parameterize fraction of samples with gene expression

1 participant