dpeerlab / spectra Goto Github PK

Supervised Pathway DEConvolution of InTerpretable Gene ProgRAms

License: MIT License

Python 62.36% Jupyter Notebook 37.00% Shell 0.10% R 0.55%

spectra's Introduction

Quickstart

SPECTRA takes in a single cell gene expression matrix, cell type annotations, and gene sets for cellular processes to fit the data to.

If you use Spectra please cite our preprint on bioRxiv.

We start by importing spectra. The easiest way to run spectra is to use the est_spectra function, as shown below. The default behavior is to set the number of factors equal to the number of gene sets plus one. However, this can be modified by passing an integer e.g. L = 20 as an argument to the function or a dictionary that maps cell type to an integer per cell type. We provide a method for estimating the number of factors directly from the data by bulk eigenvalue matching analysis, which is detailed further below.

# This script requires 12GB of memory or more

import Spectra

annotations = Spectra.default_gene_sets.load()
adata = Spectra.sample_data.load()

#run the model with default values
model = Spectra.est_spectra(
    adata=adata, 
    gene_set_dictionary=annotations, 
    use_highly_variable=True,
    cell_type_key="cell_type_annotations", 
    use_weights=True,
    lam=0.1, #varies depending on data and gene sets, try between 0.5 and 0.001
    delta=0.001, 
    kappa=None,
    rho=0.001, 
    use_cell_types=True,
    n_top_vals=50,
    label_factors=True, 
    overlap_threshold=0.2,
    clean_gs = True, 
    min_gs_num = 3,
    num_epochs=10000
)

This function stores four important quantities in the AnnData, in addition to returning a fitted model object. Factors are the scores that tell you how much each gene contributes to each factor, while markers is an array of genes with top scores for every factor. Cell scores are similarly the score of each factor for every cell. Finally, vocab is a boolean array that is True for genes that were used while fitting the model - note that this quantity is only added to the AnnData when highly_variable is set to True.

factors = adata.uns['SPECTRA_factors'] # factors x genes matrix that tells you how important each gene is to the resulting factors
markers = adata.uns['SPECTRA_markers'] # factors x n_top_vals list of n_top_vals top markers per factor
cell_scores = adata.obsm['SPECTRA_cell_scores'] # cells x factors matrix of cell scores
vocab = adata.var['spectra_vocab'] # boolean matrix of size # of genes that indicates the set of genes used to fit spectra

Installation

A pypi package will be available soon. For installation, you can add spectra to pip:

pip install scSpectra

Requires Python 3.7 or later.

scRNAseq Knowledge Base

Check out our scRNAseq knowledge base Cytopus 🐙 to retrieve Spectra input gene sets adapted to the cell type composition in your data.

Spectra GUI

A Graphical User Interface (GUI) to sit over top SPECTRA, a factor analysis tool developed in Dana Pe'er's Lab at Memorial Sloan Kettering Cancer Research Center (MSKCC) can be found here.

Interactive Tutorial

We provide a full tutorial how to run the basic Spectra model here: (12 GB RAM required)

You can use log1p transformed median library size normalized data. For leukocyte data, we recommend using scran normalization. We provide a tutorial.

Advanced

Accessing model parameters

Run the model as indicated above. To access finer grained information about the model fit, we can look at the attributes of the model object directly. Model parameters can be accessed with functions associated with the model object

model.return_eta_diag()
model.return_cell_scores()
model.return_factors() 
model.return_eta()
model.return_rho()
model.return_kappa()
model.return_gene_scalings()

Apart from cell scores and factors, we can also retrive a number of other parameters this way that are not by default added to the AnnData. Eta diag is the diagonal of the fitted factor-factor interaction matrix; however, its interpretation is that it measures the extent to which each factor is influenced by the prior information. In practice many of these values are zero, indicating that they are estimated without bias introduced by the annotation set. Eta is the full set of factor-factor interaction matrices, whose off diagonals measure the extent to which factors share the same genes. Rho and kappa are parameters that control the background rate of non-edges and edges respectively. These can be fixed throughout training or estimated from the data by providing rho = None or kappa = None to the est_spectra() function or to model.train(). Finally gene scalings are correction factors that normalize each gene based on its mean expression value.

Estimating the number of factors

For most datasets you want to select the number of factors based on the number of gene sets and prior knowledge as well as the granularity of the expected gene programs. However, we also provide a method to estimate the number of factors. To estimate the number of factors first, run:

from Spectra import K_est as kst
L = kst.estimate_L(adata, attribute = "cell_type", highly_variable = True)

Fitting via EM

For smaller problems we can use a memory intensive EM algorithm instead

X = adata.X 
A = binary adjacency matrix g
model = Spectra.SPECTRA_EM(X = X, A= A, T = 4)
model.fit()

Parameters

adata : AnnData object containing cell_type_key with log count data stored in .X

gene_set_dictionary : dict or OrderedDict() maps cell types to gene set names to gene sets ; if use_cell_types == False then maps gene set names to gene sets ; must contain "global" key in addition to every unique cell type under .obs.<cell_type_key>

L : dict, OrderedDict(), int , NoneType number of factors per cell type ; if use_cell_types == False then int. Else dictionary. If None then match factors to number of gene sets (recommended)

use_highly_variable : bool if True, then uses highly_variable_genes

cell_type_key : str cell type key, must be under adata.obs.<cell_type_key> . If use_cell_types == False, this is ignored

use_weights : bool if True, edge weights are estimated based on graph structure and used throughout training

lam : float lambda parameter of the model. weighs relative contribution of graph and expression loss functions

delta : float delta parameter of the model. lower bounds possible gene scaling factors so that maximum ratio of gene scalings cannot be too large

kappa : float or None if None, estimate background rate of 1s in the graph from data

rho : float or None if None, estimate background rate of 0s in the graph from data

use_cell_types : bool if True then cell type label is used to fit cell type specific factors. If false then cell types are ignored

n_top_vals : int number of top markers to return in markers dataframe

determinant_penalty : float determinant penalty of the attention mechanism. If set higher than 0 then sparse solutions of the attention weights and diverse attention weights are encouraged. However, tuning is crucial as setting too high reduces the selection accuracy because convergence to a hard selection occurs early during training [todo: annealing strategy]

filter_sets : bool whether to filter the gene sets based on coherence

clean_gs : bool if True cleans up the gene set dictionary to: 1. checks that annotations dictionary cell type keys and adata cell types are identical. 2. to contain only genes contained in the adata 3. to contain only gene sets greater length min_gs_num

min_gs_num : int only use if clean_gs True, minimum number of genes per gene set expressed in adata, other gene sets will be filtered out

label_factors : bool whether to label the factors by their cell type specificity and their Szymkiewicz–Simpson overlap coefficient with the input marker genes

overlap_threshold: float minimum overlap coefficient to assign an input gene set label to a factor

num_epochs : int number of epochs to fit the model. We recommend 10,000 epochs which works for most datasets although many models converge earlier

**kwargs : (num_epochs = 10000, lr_schedule = [...], verbose = False) arguments to .train(), maximum number of training epochs, learning rate schedule and whether to print changes in learning rate

Returns: SPECTRA_Model object [after training]

In place: adds 1. factors, 2. cell scores, 3. vocabulary, and 4. markers as attributes in .obsm, .var, .uns

default parameters: est_spectra(adata, gene_set_dictionary, L = None,use_highly_variable = True, cell_type_key = None, use_weights = True, lam = 0.01, delta=0.001,kappa = None, rho = 0.001, use_cell_types = True, n_top_vals = 50, filter_sets = True, clean_gs = True, min_gs_num = 3, label_factors=True, overlap_threshold= 0.2, **kwargs)

labeling factors

We also provide an approach to label the factors by their Szymkiewicz–Simpson overlap coefficient with the input gene sets. Each factors receives the label of the input gene set with the highest overlap coefficient, given that it the overlap coefficient is greater than the threshold defined in 'overlap_threshold'. Ties in the overlap coefficient by gene set size, selecting the label of the bigger gene set (because smaller gene sets might get bigger overlap coefficients by chance).

We provide a pandas.DataFrame indicating the overlap coefficients for each input gene set with each factor's marker genes. The index of this dataframe contains the index of each factor, assigned label as well as the cell type specificity for each factor in the format:

['index' + '-X-' + 'cell type specificity' + '-X-' + 'assigned label', ...]

We use '-X-' as a unique seperator to make string splitting and retrieval of the different components of the index easier.

adata.uns['SPECTRA_overlap']

spectra's People

Contributors

Stargazers

Watchers

Forkers

wallet-maker karolineholler jindalk kailibio hmc-clinic-mskcc-22-23 tobiaspk mxposed minhgiang174 weilerp nickp60 liang-wu-01 schaudge yhou2000 lewinsohndp

spectra's Issues

Error when running spc.est_spectra

Hello!

Congrats and thanks for your work! A few days ago I read your group's tweet and thought SPECTRA is exactly what I need for my data.
I am trying to disentangle CD8 activation/TCR reactivity vs exhaustion. I apologize in advance for the naive question, I am an R user.

I used Seurat to analyze my data, then saved the single cell object as h5ad file. Here is my script:

import os
import pandas as pd
from spectra import spectra as spc
import scanpy

os.getcwd()
adata = scanpy.read_h5ad("CD8_ad.h5ad")
pal_ident = ["#F0E442","#E69F00", "#009E73", "#56B4E9" ] #custom palette for clusters
adata.uns['clusters']= pal_ident
scanpy.pp.highly_variable_genes(adata)

then following (don't know if I followed properly) your README file I did:

gene_set_annotations = {"global": {"global_ifn_II_response" : ["CIITA", "CXCL10"] , "global_MHCI": ["HLA-A", "HLA-B", "HLA-C"] },
                        "CD8_T": {"CD8_tex": ['TOX', 'LAG3', 'PDCD1', 'HAVCR2', 'EOMES', 'ID2', 'CD244', 'NR4A3', 'NR4A2', 'NR4A1',
                                              'CXCL13', 'IRF4','PRDM1'],
                                  "CD8_reac": ['MKI67', 'TNFRSF9', 'TNFRSF18', 'GZMA', 'IFNG', 'ENTPD1', 'ITGAE', 'CXCL13']}
}
model = spc.est_spectra(adata = adata,  gene_set_dictionary = gene_set_annotations, cell_type_key = "clusters", use_highly_variable = True, lam = 0.01)

At first I had not included "global" (since I am only interested on CD8), but I was getting
KeyError: 'global'

Including the "global" variable the function runs till ~30% and then throws this error:

ValueError: operands could not be broadcast together with shapes (15486,9) (1,6)

I think I understand the type of error but not how to fix it. Hope you can help!

Best,
Francesco

Unexpected spectra factors and some misc questions

Good day,

I have run Spectra successfully using both CPU and GPU. However, I have noticed an unexpected behavior in both approaches. I used a downsampled version of my data (120K cells), normalized using scran, and grouped in major cell types (malignant, myeloid, lymphoid, and vascular). In addition to the global gene sets, I added some specific ones for each cell group, and in total, the dictionary contains 465 gene sets.

When I examined and plot the output stored in adata.uns['SPECTRA_overlap'] and their respective scores adata.obsm['SPECTRA_cell_scores'], I noticed that the order and naming of the gene sets changed. In some cases, gene sets appear in global and other categories more than one time and show different results. Also, some categories are called only global, or malignant or myeloid. I thought they could be a summary of all gene sets in that category, but once again, there are several with the same name and different scores/distributions. See the example below (In this case, the gene set called ‘jessa22_M2’ is in the dictionary only as an entry for the malignant category and contains a signature of 50 genes).

I am using

model = spc.est_spectra(adata = adata, gene_set_dictionary = annotations, 
                        use_highly_variable = True, cell_type_key = "cell_type_annotations", 
                        use_weights = True, 
                        lam = 0.1, 
                        delta=0.001,kappa = 0.00001, rho = 0.00001, 
                        use_cell_types = True, n_top_vals = 50, 
                        label_factors = True,
                        overlap_threshold = 0.2,
                        num_epochs=10000,
                        verbose = True
                       )

Why do you think this is happening? Should I reduce lam (e.g., 0.01) to make it more strict?

I also wonder how can you discriminate potentially novel programs as you described in the manuscript for the therapy-induced macrophages? You mentioned in the methods that you defined “new factors as factors with a graph dependency parameter η < 0.25 and modiﬁed factors as factors with a graph dependency parameter η ≥ 0.25”. Which of the outputs can I use to filter potentially novel factors?

I also read in the methods that the gene set dictionary is optional. In that case, will Spectra perform an unsupervised gene program detection?

About the SPECTRA_cell_scores, I saw that most of the scores are not higher than 0.02. Do you know if this is normal? I see these scores even for well-known gene sets that define, for instance, the cell cycle. (I see the scales presented in your manuscript are much higher)

Finally, as I mentioned in #21, GPU can reach a lower LR with fewer epochs than the CPU implementation. Does this suggest the GPU version can provide more accurate results in less time?

Thanks in advance! Sorry for the questions, but I am excited to get this working on my dataset (and understanding better the outputs)!

I'm looking forward to hearing your thoughts and suggestions.

GPU tutorial?

I'm looking to use the GPU implementation of Spectra as mentioned in issue #21.

When you load from Spectra import Spectra_gpu as spc_gpu you get this info:

Spectra GPU support is still under development. Raise any issues on github 
 
 Changes from v1: 
 (1) GPU support [see tutorial] 
 (2) minibatching for local parameters and data 
 Note that minibatching may affect optimization results 
 Code will eventually be merged into spectra.py

Is there a tutorial hiding somewhere? I'm especially interested in the minibatching.

Thanks!

Error saving / loading when initialize the model with kappa/ rho etc.

Hi,

Thank you for creating this amazing tool. I noticed a potential bug when saving/loading models.

File spectra.py
Line: near 243 & 251.
The dictionary self.rho / self.kappa should be transformed in to nn.ParameterDict after initialization to register the parameters to the state_dict. Otherwise, these parameters won't be saved, and hence raise an error when trying to load trained model from disk.

The same goes for cases. when use_cell_types = False, kappa and rho should be registered to be saved properly.

spc.est_spectra stalls and go to next step after some epochs

In my data set (114k cells 100 gene sets and 10 cell types), spc.est_spectra stops after some epochs without any error message. Most of the outputs are being populated in adata except the adata.var['spectra_vocab']. This stoping behaviour is different if I re-run the data, ie 12% ,32% ,58% and 4% out of 10k epochs. I have not changed any parameters or compute nodes/vm. I have simply restarted my Jupyter notebook to rerun and see what is happening.

tutorial typo

Hi,

In the 9th chunk of the tutorial Spectra_Colaboratory_tutorial.ipynb, there is the following typo: if len(annotation_labels)2

Thanks,

Example notebook is missing

Good day!

I am eager to test this exciting approach in our dataset. I have seen in several of the previous issues a link to a more detailed and expanded tutorial on how to use the tool (https://github.com/dpeerlab/spectra/blob/main/notebooks/example_notebook.ipynb), but that notebook does not exist. Could you please upload it again?

Thanks in advance, and looking forward to use your tool

TypeError: SPECTRA_Model.train() got an unexpected keyword argument 'label_factors'

hello, thanks for this great tool !!!
I run this tool with CPU successefully but failed with GPU:

#import packages
import numpy as np
import json
import scanpy as sc
from collections import OrderedDict
import scipy
import pandas as pd
import matplotlib.pyplot as plt

#spectra imports
import Spectra as spc
from Spectra import Spectra_util as spc_tl
from Spectra import K_est as kst
from Spectra import default_gene_sets

## GPU
from Spectra import Spectra_gpu as spc_gpu

#filter gene set annotation dict for genes contained in adata
my_annotations = spc_tl.check_gene_set_dictionary(
    adata,
    my_annotations,
    obs_key='Disease subtype2',
    global_key='global')

# fit the model (We will run this with only 2 epochs to decrease runtime in this tutorial)
model = spc_gpu.est_spectra(adata=adata,
    gene_set_dictionary=my_annotations,
    use_highly_variable=True,
    cell_type_key="Disease subtype2",
    use_weights=True,
    lam=0.1, # varies depending on data and gene sets, try between 0.5 and 0.001
    delta=0.001,
    kappa=None,
    rho=0.001,
    use_cell_types=True,
    n_top_vals=50,
    label_factors=True,
    overlap_threshold=0.2,
    clean_gs = True,
    min_gs_num = 3,
    num_epochs=500 #here running only 2 epochs for time reasons, we recommend 10,000 epochs for most datasets
)

and the running log:

CUDA Available:  True
Initializing model...
Building parameter set...
CUDA memory:  1.788089856
Beginning training...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-13-232fdfc3e7ab>](https://localhost:8080/#) in <cell line: 2>()
      1 # fit the model (We will run this with only 2 epochs to decrease runtime in this tutorial)
----> 2 model = spc_gpu.est_spectra(adata=adata,
      3     gene_set_dictionary=my_annotations,
      4     use_highly_variable=True,
      5     cell_type_key="Disease subtype2",

[/usr/local/lib/python3.10/dist-packages/Spectra/Spectra_gpu.py](https://localhost:8080/#) in est_spectra(adata, gene_set_dictionary, L, use_highly_variable, cell_type_key, use_weights, lam, delta, kappa, rho, use_cell_types, n_top_vals, filter_sets, **kwargs)
    886     spectra.initialize(gene_set_dictionary, word2id, X, init_scores)
    887     print("Beginning training...")
--> 888     spectra.train(X = X, labels = labels,**kwargs)
    889 
    890     adata.uns["SPECTRA_factors"] = spectra.factors

TypeError: SPECTRA_Model.train() got an unexpected keyword argument 'label_factors'

Could you help with this problem ? Thanks !!!

Key Error for est_spectra

Hi,
I used the data in the dropbox link from the Bassez publication, loaded with the respective scanpy method (note: the same function from the AnnData package did not work) and ran est_spectra with the cell_type_key = "Bassez_cellType". (This obs contains cell type annotations, like 'B_cell').

I get a KeyError that hints to line 1196 of spectra.py:
for ct in gene_set_dictionary.keys():
1195 new_gs_dict[ct] = {}
-> 1196 mval = max(L[ct] - 1, 0)
1197 sorted_init_scores = sorted(init_scores[ct].items(), key=lambda x:x[1])
1198 sorted_init_scores = sorted_init_scores[-1*mval:]

KeyError: 'B_cell'

Hope this helps!

use_cell_types not implemented in Spectra_gpu

Good day!

I want to use spectra for each main group of cells (myeloid, lymphoid, etc) in my tumor data. In that case, I want to set use_cell_types=False, but this option is not implemented for GPU. When do you think it will become available?

Thanks in advance!

Monitoring training

Hi,

Thank you for creating this really interesting project. I'm eager to use it in my work. One thing that would be really useful is if there was some way to easily examine training history to actually confirm convergence. Also, if some kind of early stopping could be implemented, that would be very helpful as Spectra takes a while to run.

Thank you,
Connor

Error whene gene in of geneset is not contained in anndata

Hi,
thank you for creating spectra, promising concept!
I tried it on CD8 pos TILs with the given genesets, but ran into two problems when initializing:
All with the provided genesets.

If the gene in one of the genesets is not contained in the anndata, a Key Error is thrown.
When computing th init scores I get float division by zero error
Traceback

ZeroDivisionError                         Traceback (most recent call last)
Cell In [21], line 1
----> 1 model = spc.est_spectra(adata = cd8_ad,  gene_set_dictionary = gene_set_dictionary,  use_highly_variable = False, lam = 0.01,cell_type_key = "Celltype Pred",use_cell_types=True,filter_sets=True)
      3 #save trained model
      4 model.save("test_model")

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/spectra.py:1195, in est_spectra(adata, gene_set_dictionary, L, use_highly_variable, cell_type_key, use_weights, lam, delta, kappa, rho, use_cell_types, n_top_vals, filter_sets, **kwargs)
   1193 if use_cell_types:
   1194     new_gs_dict = {}
-> 1195     init_scores = compute_init_scores(gene_set_dictionary,word2id, torch.Tensor(X))
   1196     for ct in gene_set_dictionary.keys():
   1197         new_gs_dict[ct] = {}

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/initialization.py:47, in compute_init_scores(gs_dict, word2id, W)
     45         gs = gs_dict[key][inner_key]
     46         idxs = [word2id[word] for word in gs]
---> 47         coh = mimno_coherence_2011(idxs, W)
     48         init_scores[key][inner_key] = coh.item() 
     49 else: 

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/initialization.py:25, in mimno_coherence_2011(words, W)
     23         score += mimno_coherence_single(words[j1], words[j2], W)
     24 denom = V*(V-1)/2
---> 25 return(score/denom)

ZeroDivisionError: float division by zero

If you could advise me, I would be very happy!

Running Spectra with mouse data?

Hi, really excited about this analysis approach but I am wondering how can I adapt it to mouse data please?

I understand from reading the Cytopus vignette that I should just be able to create a custom dictionary incorporating mouse genesets, but do you have any thoughts on an efficient way to do that? I.e. for the default human dictionary, did you pull the genesets from an existing database or did you have to curate them manually?

I am principally interested in running just the mouse equivalent of the global genesets since my sequencing data is from a single cell type.

Thanks so much for any thoughts or pointers for this!

Util function improvements

In check_gene_set_dictionary there will not be any error or flags if the length of the keys is the same between anndata and gene set dictionary but some of them are misspelled. Propose modifying function to this to account for the scenario:
if (len(adata_labels)<len(annotation_labels)) | (set(annotation_labels) != set(adata_labels)):

Then the print will output the mismatched keys.

Also, cell type labels cannot include periods because it will throw the following torch error:
KeyError: 'parameter name can't contain "."'

This should be checked as well.

Number of cores allocation

Hello,
We use sBatch system to monitoring tasks. When running spectra, we found that spectra would call more cores than we assigned to it. Could you explain spectra‘s cpu scheduling strategy, and are there any way to limit the number of cores?

Thanks!

dictionary format when no cell type labels in adata

Could you please provide an example of gene_set dictionary for cases where 'use_cell_types = False'? I can't seem to find the right format.
Thanks

issue with self.rho = nn.ParameterDict(self.rho)

Hi all, I was running through the code below in the provided tutorial and came across an error.

model = Spectra.est_spectra(adata=adata, gene_set_dictionary=annotations, use_highly_variable=True, cell_type_key="cell_type_annotations", use_weights=True, lam=0.1, delta=0.001, kappa=None, rho=0.001, use_cell_types=True, n_top_vals=50, label_factors=True, overlap_threshold=0.2, clean_gs=True, min_gs_num=3, num_epochs=2)

When rho=None, est_spectra runs just fine.
The same error with rho=0.001 also happens with kappa=0.001.

train() got an unexpected keyword argument 'label_factors'

Hi there,

When using spc.est_spectra, I am getting the error train() got an unexpected keyword argument 'label_factors'. I installed it from GitHub recently and can see that labels_fatcors is part of spectra.py., I am trying to understand why I am getting this error.

Any insights you can provide will be helpful! Thanks!

UnboundLocalError: local variable 'is_global' referenced before assignment

Hello,
I am trying to run spectra using cell type labels:

import Spectra
import scanpy as sc
import pandas as pd
import numpy as np
import cytopus as cp

#subset my dataset from cytopus
G = cp.KnowledgeBase()
celltype_of_interest = ['T']
global_celltypes = ['all-cells','leukocyte']
G.get_celltype_processes(celltype_of_interest,global_celltypes = global_celltypes,get_children=True,get_parents =False)
annotations = G.celltype_process_dict
annotations = G.celltype_process_dict

#Run spectra
model = Spectra.est_spectra(
    adata=adata, 
    gene_set_dictionary=annotations, 
    use_highly_variable=True,
    cell_type_key="predicted.celltype.l1", 
    use_weights=True,
    lam=0.1, #varies depending on data and gene sets, try between 0.5 and 0.001
    delta=0.001, 
    kappa=None,
    rho=0.001, 
    use_cell_types=False,
    n_top_vals=50,
    label_factors=True, 
    overlap_threshold=0.2,
    clean_gs = True, 
    min_gs_num = 3,
    num_epochs=5000
)

It finishes the process, but gives the following error:

Cell type labels in gene set annotation dictionary and AnnData object are identical
removing gene set T for cell type global which is of length 14 0 genes are found in the data. minimum length is 3
removing gene set global for cell type global which is of length 150 0 genes are found in the data. minimum length is 3
Your gene set annotation dictionary is now correctly formatted.
/home/ubuntu/anaconda3/envs/scFates-gpu/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/home/ubuntu/anaconda3/envs/scFates-gpu/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
100%|██████████████████████████████████████████████████████████████████████████████████| 5000/5000 [38:15<00:00,  2.18it/s]
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[12], line 1
----> 1 model = Spectra.est_spectra(
      2     adata=adata, 
      3     gene_set_dictionary=annotations, 
      4     use_highly_variable=True,
      5     cell_type_key="predicted.celltype.l1", 
      6     use_weights=True,
      7     lam=0.1, #varies depending on data and gene sets, try between 0.5 and 0.001
      8     delta=0.001, 
      9     kappa=None,
     10     rho=0.001, 
     11     use_cell_types=False,
     12     n_top_vals=50,
     13     label_factors=True, 
     14     overlap_threshold=0.2,
     15     clean_gs = True, 
     16     min_gs_num = 3,
     17     num_epochs=5000
     18 )

File ~/anaconda3/envs/scFates-gpu/lib/python3.8/site-packages/Spectra/Spectra.py:1314, in est_spectra(adata, gene_set_dictionary, L, use_highly_variable, cell_type_key, use_weights, lam, delta, kappa, rho, use_cell_types, n_top_vals, filter_sets, label_factors, clean_gs, min_gs_num, overlap_threshold, **kwargs)
   1311 #labeling function
   1312 if label_factors:
   1313     #get cell type specificity of every factor
-> 1314     if is_global == False:
   1315         celltype_dict = get_factor_celltypes(adata, cell_type_key, cellscore=spectra.cell_scores)
   1316         max_celltype = [celltype_dict[x] for x in range(spectra.cell_scores.shape[1])]

UnboundLocalError: local variable 'is_global' referenced before assignment

Annotation format requirement

What annotation format is required? Is it possible to use the gene sets directly from the pathway database? for instances the C2 jason bundle from the Broad Institute pathway database?
Thanks,
Shams

GPU implementation

Good day,

I would like to know whether Spectra can use GPU. n the example notebook could not find much information about it, but I saw there seems to be an implementation to use GPU (https://github.com/dpeerlab/spectra/blob/main/spectra/spectra_gpu.py).

Does this automatically detect I am using a GPU node? Are there differences in the outcome between running using CPU vs GPU?

Thanks in advance!

unbound local variable is_global

Trying to call est_spectra with a gene_set_dict of the form {"global": dict of gene sets} and use_cell_types=False leads to an unbound local variable error for is_global after training. Looking through the code, this happens because in this case, check_gene_set_dictionary expects a single-layered dict for gene_set_dict. This behavior isn't in the doc string. A couple suggestions:

Either change this so the expected format of gene_set_dict is consistent (my personal preference), or else have check_gene_set_dictionary check if global_key is in the top layer of gene_set_dict before wrapping it in another layer.
Expand the doc string for est_spectra to be more explicit about the format and put an example of the format in the README
No matter what, check that gene_set_dict is not empty after all changes are made to it and before training

TabError: inconsistent use of tabs and spaces in indentation

week ago, I was running the model perfectly, I am trying to use it today it returns me super bad error when I try to call the package and modules "TabError: inconsistent use of tabs and spaces in indentation"

from spectra import spectra as spc
TabError: inconsistent use of tabs and spaces in indentation

from spectra import spectra_util as spc_tl
TabError: inconsistent use of tabs and spaces in indentation

I used the original notebook (tutorial notebook) of spectra.

please I need help, How can I avoid this error.