PCA transform before cosine calculation by niklasmueboe · Pull Request #146 · LazDaria/SegTraQ

niklasmueboe · 2026-05-21T07:34:53Z

This includes

optional PCA transformation (fit based on the full cells) before calculating cosine similarity between top/bottom
optimized cosine similarity calculation (vectorized instead of python loops)
no genes are dropped
cells are filtered early

Currently Pearson residuals for the normalization has been dropped.

Potentially to optimize:

The full cells are calculated from the transcripts dataframe. The count matrix could also be extracted from one of the tables (AnnData) instead of recalculating
The cell-by-gene are currently based on DataFrames and therefore dense. It may be necessary to move to SparseArrays for large gene panels

Effect of performing PCA before cosine calculation needs to be investigated. Probably it will reduce some noise and it seems to be able to correct the count effects (low count cells have lower cosine similarity) to some degree. But it may also hide small gene expression differences originating from "contamination" if it is only a few transcripts.

this commit includes - optional PCA transformation (fit based on the full cells) before calculating cosine similarity between top/bottom - optimized cosine similarity correction - no genes are dropped

mjemons

Hi Niklas,
Thanks for this PR. I started integrating PCA myself in SegTraQ, also in other modules where we compare expression vectors. I will leave the PR as is for now and think how to best apply these changes to the entire package. I would then add you as reviewer to the "bigger" changes, once they are done.
Best, Martin

mjemons · 2026-05-21T08:14:47Z

+        # TODO: probably tables["table"].X can be reused or similar
+        counts_cell = _cell_by_gene_from_transcripts(tx, points_cell_id_key, points_gene_key)
+        counts_cell = counts_cell.reindex(columns=all_genes, fill_value=0)
+        cell_norm = _normalize(counts_cell, normalization, scale)


Why do we normalise the entire dataset but then compute PCA on the two subsets individually?

actually, since you fit the PCA on the full dataset it probably also makes sense to normalise and scale on the entire dataset. My worry is just a bit, that they might actually be very different in say sparsity and then we get kind of an "averaged out" result

makes sense to normalise and scale on the entire dataset

I agree. For now I just kept it as it was before but probably should be changed.

mjemons · 2026-05-21T08:15:22Z

+        counts_cell = _cell_by_gene_from_transcripts(tx, points_cell_id_key, points_gene_key)
+        counts_cell = counts_cell.reindex(columns=all_genes, fill_value=0)
+        cell_norm = _normalize(counts_cell, normalization, scale)
+        pca = PCA(n_components=n_pcs, random_state=seed).fit(cell_norm)


Also, should we scale before the PCA? smth like StandardScaler could actually replace the log norm/scale from above?

Could be an idea, but then you would also need to scale the top/bottom using the same transformation? Also consider that at some point it might make sense to switch to sparse arrays and then I always find StandardScaler a bit awkward because you can't correct the mean

mjemons · 2026-05-21T08:15:57Z

        normalization: str | None = "log",
        min_genes: int = 5,
        min_transcripts: int = 10,
+        n_pcs: int | None = 30,


potentially, it would be nice to go via percent variance explained as a threshold rather than the number of PCs.

Hmm, makes it more complicated, also you technically then don't know how many PCs you need to calculate

LazDaria · 2026-05-22T15:42:43Z

needs further exploration, ideally with STPuppeteer, to see

whether contamination by lowly expressed genes gets picked up.
whether the first PCs mainly capture variation in total counts, e.g. using PC regression

PCA transform before cosine calculation

1156ae0

this commit includes - optional PCA transformation (fit based on the full cells) before calculating cosine similarity between top/bottom - optimized cosine similarity correction - no genes are dropped

niklasmueboe requested a review from LazDaria May 21, 2026 07:35

mjemons reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCA transform before cosine calculation#146

PCA transform before cosine calculation#146
niklasmueboe wants to merge 1 commit into
mainfrom
hackathon/pca_cosine

niklasmueboe commented May 21, 2026

Uh oh!

mjemons left a comment

Uh oh!

mjemons May 21, 2026

Uh oh!

mjemons May 21, 2026

Uh oh!

niklasmueboe May 21, 2026

Uh oh!

mjemons May 21, 2026

Uh oh!

niklasmueboe May 21, 2026

Uh oh!

mjemons May 21, 2026

Uh oh!

niklasmueboe May 21, 2026

Uh oh!

LazDaria commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

niklasmueboe commented May 21, 2026

Uh oh!

mjemons left a comment

Choose a reason for hiding this comment

Uh oh!

mjemons May 21, 2026

Choose a reason for hiding this comment

Uh oh!

mjemons May 21, 2026

Choose a reason for hiding this comment

Uh oh!

niklasmueboe May 21, 2026

Choose a reason for hiding this comment

Uh oh!

mjemons May 21, 2026

Choose a reason for hiding this comment

Uh oh!

niklasmueboe May 21, 2026

Choose a reason for hiding this comment

Uh oh!

mjemons May 21, 2026

Choose a reason for hiding this comment

Uh oh!

niklasmueboe May 21, 2026

Choose a reason for hiding this comment

Uh oh!

LazDaria commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants