Skip to content

PCA transform before cosine calculation#146

Draft
niklasmueboe wants to merge 1 commit into
mainfrom
hackathon/pca_cosine
Draft

PCA transform before cosine calculation#146
niklasmueboe wants to merge 1 commit into
mainfrom
hackathon/pca_cosine

Conversation

@niklasmueboe
Copy link
Copy Markdown
Collaborator

This includes

  • optional PCA transformation (fit based on the full cells) before calculating cosine similarity between top/bottom
  • optimized cosine similarity calculation (vectorized instead of python loops)
  • no genes are dropped
  • cells are filtered early

Currently Pearson residuals for the normalization has been dropped.

Potentially to optimize:

  • The full cells are calculated from the transcripts dataframe. The count matrix could also be extracted from one of the tables (AnnData) instead of recalculating
  • The cell-by-gene are currently based on DataFrames and therefore dense. It may be necessary to move to SparseArrays for large gene panels

Effect of performing PCA before cosine calculation needs to be investigated. Probably it will reduce some noise and it seems to be able to correct the count effects (low count cells have lower cosine similarity) to some degree. But it may also hide small gene expression differences originating from "contamination" if it is only a few transcripts.

this commit includes
- optional PCA transformation (fit based on the full cells) before calculating cosine similarity between top/bottom
- optimized cosine similarity correction
- no genes are dropped
@niklasmueboe niklasmueboe requested a review from LazDaria May 21, 2026 07:35
Copy link
Copy Markdown
Collaborator

@mjemons mjemons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Niklas,
Thanks for this PR. I started integrating PCA myself in SegTraQ, also in other modules where we compare expression vectors. I will leave the PR as is for now and think how to best apply these changes to the entire package. I would then add you as reviewer to the "bigger" changes, once they are done.
Best, Martin

Comment thread src/segtraq/vl/volume.py
# TODO: probably tables["table"].X can be reused or similar
counts_cell = _cell_by_gene_from_transcripts(tx, points_cell_id_key, points_gene_key)
counts_cell = counts_cell.reindex(columns=all_genes, fill_value=0)
cell_norm = _normalize(counts_cell, normalization, scale)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we normalise the entire dataset but then compute PCA on the two subsets individually?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, since you fit the PCA on the full dataset it probably also makes sense to normalise and scale on the entire dataset. My worry is just a bit, that they might actually be very different in say sparsity and then we get kind of an "averaged out" result

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to normalise and scale on the entire dataset

I agree. For now I just kept it as it was before but probably should be changed.

Comment thread src/segtraq/vl/volume.py
counts_cell = _cell_by_gene_from_transcripts(tx, points_cell_id_key, points_gene_key)
counts_cell = counts_cell.reindex(columns=all_genes, fill_value=0)
cell_norm = _normalize(counts_cell, normalization, scale)
pca = PCA(n_components=n_pcs, random_state=seed).fit(cell_norm)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we scale before the PCA? smth like StandardScaler could actually replace the log norm/scale from above?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be an idea, but then you would also need to scale the top/bottom using the same transformation? Also consider that at some point it might make sense to switch to sparse arrays and then I always find StandardScaler a bit awkward because you can't correct the mean

Comment thread src/segtraq/SegTraQ.py
normalization: str | None = "log",
min_genes: int = 5,
min_transcripts: int = 10,
n_pcs: int | None = 30,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially, it would be nice to go via percent variance explained as a threshold rather than the number of PCs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, makes it more complicated, also you technically then don't know how many PCs you need to calculate

@LazDaria
Copy link
Copy Markdown
Owner

needs further exploration, ideally with STPuppeteer, to see

  1. whether contamination by lowly expressed genes gets picked up.
  2. whether the first PCs mainly capture variation in total counts, e.g. using PC regression

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants