PCA transform before cosine calculation#146
Conversation
this commit includes - optional PCA transformation (fit based on the full cells) before calculating cosine similarity between top/bottom - optimized cosine similarity correction - no genes are dropped
mjemons
left a comment
There was a problem hiding this comment.
Hi Niklas,
Thanks for this PR. I started integrating PCA myself in SegTraQ, also in other modules where we compare expression vectors. I will leave the PR as is for now and think how to best apply these changes to the entire package. I would then add you as reviewer to the "bigger" changes, once they are done.
Best, Martin
| # TODO: probably tables["table"].X can be reused or similar | ||
| counts_cell = _cell_by_gene_from_transcripts(tx, points_cell_id_key, points_gene_key) | ||
| counts_cell = counts_cell.reindex(columns=all_genes, fill_value=0) | ||
| cell_norm = _normalize(counts_cell, normalization, scale) |
There was a problem hiding this comment.
Why do we normalise the entire dataset but then compute PCA on the two subsets individually?
There was a problem hiding this comment.
actually, since you fit the PCA on the full dataset it probably also makes sense to normalise and scale on the entire dataset. My worry is just a bit, that they might actually be very different in say sparsity and then we get kind of an "averaged out" result
There was a problem hiding this comment.
makes sense to normalise and scale on the entire dataset
I agree. For now I just kept it as it was before but probably should be changed.
| counts_cell = _cell_by_gene_from_transcripts(tx, points_cell_id_key, points_gene_key) | ||
| counts_cell = counts_cell.reindex(columns=all_genes, fill_value=0) | ||
| cell_norm = _normalize(counts_cell, normalization, scale) | ||
| pca = PCA(n_components=n_pcs, random_state=seed).fit(cell_norm) |
There was a problem hiding this comment.
Also, should we scale before the PCA? smth like StandardScaler could actually replace the log norm/scale from above?
There was a problem hiding this comment.
Could be an idea, but then you would also need to scale the top/bottom using the same transformation? Also consider that at some point it might make sense to switch to sparse arrays and then I always find StandardScaler a bit awkward because you can't correct the mean
| normalization: str | None = "log", | ||
| min_genes: int = 5, | ||
| min_transcripts: int = 10, | ||
| n_pcs: int | None = 30, |
There was a problem hiding this comment.
potentially, it would be nice to go via percent variance explained as a threshold rather than the number of PCs.
There was a problem hiding this comment.
Hmm, makes it more complicated, also you technically then don't know how many PCs you need to calculate
|
needs further exploration, ideally with STPuppeteer, to see
|
This includes
Currently Pearson residuals for the normalization has been dropped.
Potentially to optimize:
Effect of performing PCA before cosine calculation needs to be investigated. Probably it will reduce some noise and it seems to be able to correct the count effects (low count cells have lower cosine similarity) to some degree. But it may also hide small gene expression differences originating from "contamination" if it is only a few transcripts.