Good day, I have run Spectra successfully using both CPU and GPU. Ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Unexpected spectra factors and some misc questions about spectra HOT 6 CLOSED

dpeerlab commented on August 31, 2024

Unexpected spectra factors and some misc questions

from spectra.

Comments (6)

wallet-maker commented on August 31, 2024 1

Hi Cristian,

thanks for your feedback.

Regarding your first question:

These behaviors is expected.

Three aspects: a) Multiple factors getting named by the same gene set and b) cell type specific factors getting named by factors from a different cell type c) a factor does not get a gene set name assigned

a) There is by design no 1:1 mapping between gene sets and factors. The factors are named by their overlap coefficient with all input gene sets, so also cell type specific gene sets will be used for naming all of the factors. But don't worry, Spectra only used them to fit the factors for the cell types you indicated. Maybe I will constrain the factor naming to the gene sets that were used for fitting the factors for the respective cell type. Personally, I felt it was helpful to perform the naming by computing overlap of all gene sets vs all factors because when you set a gene set as cell type specific, Spectra may still discover a similar factor outside of that cell type in an unsupervised way that is without using the gene set. Let me know if that makes sense.

b) Because of this lack of 1:1 mapping, gene sets which are not very coherent (maybe they are a mix of several processes) can be split into several factors mapping to a same gene. You may want to look at the marker genes for each of these factors. I suspect they will use different genes from the gene set. We believe this behavior is desired as the gene sets cannot be regarded as ground truth and may contain several ground truth processes which should then be split into distinct factors by Spectra. I think this behavior will be reduced if you reduce lambda, however this will also lead to stronger supervision by the gene sets so you learn less from the data. If you have a lot more factors than gene sets you can try reducing the number of factors first. Another approach could also be to take a closer look at the marker genes and see if there are perhaps some important subprocesses in your gene set and whether the gene set can be split and renamed or made more coherent. Of course there are also other ways to name the factors e.g. by running GSEA on the marker genes/ gene scores/loadings.

c) If there is an gene set vs factor marker genes overlap coefficient lower than the threshold in spectra_est (default of overlap_threshold = 0.2 ), the factor does not get a gene set assigned, so it is just called 'factor-index-X-celltype-X-factor-index' with cell type being the cell type of that factor and the factor index just the index in the factors x gene loadings/scores matrix. The factor names should be unique though and this is also what I see from the data you posted correct?

Regarding eta:
You can find eta in the model file outputted by est_spectra by calling:
model.return_eta_diag()

Regarding the cell scores:
Yes, this is expected behavior. The absolute value of the cell score is not super helpful / not easy to interpret. You are correct that in the preprint we have generally higher cell scores which is probably just because we used a lower number of factors. We have added information and importance scores in the revised manuscript. These quantify the contribution of a factor to explaining the observed expression data and in cell type variation, respectively. You can find these in spectra_util . I will try to add these to the tutorial soon. Also if you have suggestions for better naming conventions we are happy to consider them.

Regarding the gene set dictionary:
Yes, if you run Spectra without a gene set dictionary or without a gene set dictionary and without cell types it will only be supervised by cell type or completely unsupervised, respectively.

Regarding the GPU version:
This is interesting, I do not have a very confident answer and would refer this to @russellkune .

Let me know if that helps.

Thanks,
Thomas

from spectra.

kvshams commented on August 31, 2024

@ccruizm I also seen a similar observation. And got the lam saturation in drastic diffrent on the same data set same machine and same seed (#23) on multiple run. This is much inconsistent in GPU implementation than in CPU.

from spectra.

ccruizm commented on August 31, 2024

Hello Thomas,

I appreciate your thorough explanation. Now things are much clearer. I will need to invest some time to digest all the output I got from Spectra taking into account all the remarks you made :)

Will also test different lambda values and try to refine my gene sets so there are not 'promiuscous' gene sets that cover many cell processes.

Regarding your comment on unassigned factors to a specific gene set, you are right that all entries are unique and named with no duplicates. However I am curious why the final number of factors is 'limited/sticked' to the number of gene sets. I would assume if there are factors that are not fitting that well or less 'relevant', then the number of final factors should change. Because I have the feeling now that from my 465 gene sets, Spectra will try to fit 465 factors even if some of them are not specific for a particular cell type (such as the example with jessa22_M2 (which appears at least 8 times and abrogates other gene sets). What happened with the gene sets that now, talking again about this particular gene set (jessa22_M2), has 'replaced'? Does it mean that the gene set 17 that in my dictionary belongs to 'all_selenoamino-acid metabolism' has an overlap with jessa22_M2 but is better defined by the last one? (although I have noticed that factors try to keep the same order as in the dictionary, sometimes they are shuffled and there is no direct 1:1 replacement with less fitting factors).

About the cell scores, I had the feeling they are more 'arbitrary' units and not necessarily can be comparable across factors. Looking forward to read more about it once the revised version is publically available (or in a updated tutorial🙂).

Lastly, if I run Spectra in an unsupervised way, how does it determines the number of factors? do I need to set and expected number of factors to be recovered? or will Spectra determine the best and more stable value where no more overlapping factors and more relevant ones are reached?

Hope @russellkune can tell us a bit more about the GPU implementation of Spectra.

Thanks in advance for the time and detailed answers.

from spectra.

ccruizm commented on August 31, 2024

Hi there,

I was trying to run spectra fully unsupervised (no dictionary nor cell types), but I am unsure how to set the model besides setting use_cell_types=False. Since gene_set_dictionary must be provided, should I create an empty dictionary with global as a unique key? Which of the other options should be modified when running in an unsupervised way (e.g., lam, delta, etc)

Thanks!

from spectra.

wallet-maker commented on August 31, 2024

Hi Cristian,
We have never used the method this way. I think setting gene_set_dictionary = {'global':{}} sounds good. The choice of lambda should not matter because it controls the weight of the data vs gene sets. I think you can use default settings for the parameters. There might be bugs running the method without gene sets and cell types. It was not really designed for that.
best,
Thomas

from spectra.

wallet-maker commented on August 31, 2024

Perhaps to add to that: If you use Spectra without cell types and without gene sets it should essentially behave like NMF. So probably it's better to run NMF for this use case (mind that in the paper we show that NMF is actually performing very poorly so I would maybe rather use something like scHPF if you want to go fully unsupervised).

from spectra.

Unexpected spectra factors and some misc questions about spectra HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs