Skip to content

feat: allow merge argument for DatasetCollection.add_adatas#198

Open
ilan-gold wants to merge 8 commits into
mainfrom
ig/subset_before_concat
Open

feat: allow merge argument for DatasetCollection.add_adatas#198
ilan-gold wants to merge 8 commits into
mainfrom
ig/subset_before_concat

Conversation

@ilan-gold
Copy link
Copy Markdown
Collaborator

@ilan-gold ilan-gold commented Apr 21, 2026

To do this we need to disallow Dataset2D in anndata.conat because it does not respect merge arguments...Luckily, these df's are generally small.

It might be time to start thinking about a settings object.

@ilan-gold ilan-gold added the skip-gpu-ci Whether gpu ci should be skipped label Apr 21, 2026
Comment thread src/annbatch/io.py
Comment on lines +372 to +373
if var_subset is not None:
adata = adata[:, adata.var.index.isin(var_subset)]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also slipped this in - I don't think there's much point in outer joining genes if we're just gonna subset them anyway. It seems like we should just subset first and then outer join. I think this is just a distributive law of sets.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #198 (comment) for a bit more info on a small implication of this, which I think is actually more "correct" anyway

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 93.10345% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.68%. Comparing base (9dc1ba2) to head (28468d5).

Files with missing lines Patch % Lines
src/annbatch/io.py 93.10% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #198      +/-   ##
==========================================
- Coverage   93.48%   91.68%   -1.81%     
==========================================
  Files          14       14              
  Lines        1121     1118       -3     
==========================================
- Hits         1048     1025      -23     
- Misses         73       93      +20     
Files with missing lines Coverage Δ
src/annbatch/io.py 91.77% <93.10%> (+0.44%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilan-gold ilan-gold changed the title feat: allow merge argument for add_adatas feat: allow merge argument for DatasetCollection.add_adatas Apr 21, 2026
@ilan-gold ilan-gold marked this pull request as ready for review April 22, 2026 09:55
@ilan-gold
Copy link
Copy Markdown
Collaborator Author

ilan-gold commented Apr 22, 2026

Ok so here's an interesting problem: by subsetting the gene space first and then concatenating with an outer join, for unique merge, we end up picking up a column that is made to be uniquely identifiable as the other objects (potentially) because its index becomes identical and therefore there is only one option for that column:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "anndata",
# ]
# ///


from __future__ import annotations

import numpy as np
import pandas as pd

import anndata as ad

adatas = [
    ad.AnnData(
        X=np.ones((1, 2)),
        obs=pd.DataFrame({"1": ["a"], "2": ["b"]}),
        var=pd.DataFrame(index=["1", "2"], data={"column_1": [1, 2]}),
    ),
    ad.AnnData(
        X=np.ones((1, 2)),
        obs=pd.DataFrame({"1": ["a"], "3": ["b"]}),
        var=pd.DataFrame(index=["1", "3"], data={"column_1": [1, 2]}),
    ),
]
print(ad.concat(adatas, join="outer", merge="unique")[:, ["1"]].var.columns)  # empty
print(
    ad.concat(
        [adata[:, ["1"]] for adata in adatas], join="outer", merge="unique"
    ).var.columns
)  # has column_1

So the question is whether or not we have this. It's a subtle behavioral difference from anndata.concat but I think it is actually correct for our use-case.

@ilan-gold ilan-gold requested a review from felix0097 April 22, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-gpu-ci Whether gpu ci should be skipped

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant