GithubHelp home page GithubHelp logo

dpeerlab / spectra Goto Github PK

View Code? Open in Web Editor NEW
133.0 6.0 16.0 27.76 MB

Supervised Pathway DEConvolution of InTerpretable Gene ProgRAms

License: MIT License

Python 63.93% Jupyter Notebook 35.45% Shell 0.10% R 0.52%

spectra's Issues

unbound local variable is_global

Trying to call est_spectra with a gene_set_dict of the form {"global": dict of gene sets} and use_cell_types=False leads to an unbound local variable error for is_global after training. Looking through the code, this happens because in this case, check_gene_set_dictionary expects a single-layered dict for gene_set_dict. This behavior isn't in the doc string. A couple suggestions:

  • Either change this so the expected format of gene_set_dict is consistent (my personal preference), or else have check_gene_set_dictionary check if global_key is in the top layer of gene_set_dict before wrapping it in another layer.
  • Expand the doc string for est_spectra to be more explicit about the format and put an example of the format in the README
  • No matter what, check that gene_set_dict is not empty after all changes are made to it and before training

tutorial typo

Hi,

In the 9th chunk of the tutorial Spectra_Colaboratory_tutorial.ipynb, there is the following typo: if len(annotation_labels)2

Thanks,

Key Error for est_spectra

Hi,
I used the data in the dropbox link from the Bassez publication, loaded with the respective scanpy method (note: the same function from the AnnData package did not work) and ran est_spectra with the cell_type_key = "Bassez_cellType". (This obs contains cell type annotations, like 'B_cell').

I get a KeyError that hints to line 1196 of spectra.py:
for ct in gene_set_dictionary.keys():
1195 new_gs_dict[ct] = {}
-> 1196 mval = max(L[ct] - 1, 0)
1197 sorted_init_scores = sorted(init_scores[ct].items(), key=lambda x:x[1])
1198 sorted_init_scores = sorted_init_scores[-1*mval:]

KeyError: 'B_cell'

Hope this helps!

Specific factors vary based on global.

Thanks for creating this tool.

I have a question regarding the definition of global and specific gene sets.

From what I have understood, by definying a specific gene sets we aim to study the within cell type variance of this gene set by applying the matrix decomposition approach only to that cell type, right?

On the other hand, I have done a few test for me to understand the outputs of the method.
I have generated two gene set collections for spectra inputs:

  1. As global the hallmark gene sets collection (50 gene sets) from MSigDB and 15 specific gene sets for two specific cell types.
  2. As global the ~180 cytopus "all_" starting gene sets and the same 15 specific gene sets for the same two cell types.

I've found that when checking the output values for those specific gene sets there were noticeable differences. That makes me think that differences in the global gene sets will affect results for the specific gene sets, right?

Can you give a hint on how to better choose global gene sets or the rational behind their selection?

Thanks!

Error saving / loading when initialize the model with kappa/ rho etc.

Hi,

Thank you for creating this amazing tool. I noticed a potential bug when saving/loading models.

File spectra.py
Line: near 243 & 251.
The dictionary self.rho / self.kappa should be transformed in to nn.ParameterDict after initialization to register the parameters to the state_dict. Otherwise, these parameters won't be saved, and hence raise an error when trying to load trained model from disk.

The same goes for cases. when use_cell_types = False, kappa and rho should be registered to be saved properly.

Running Spectra with mouse data?

Hi, really excited about this analysis approach but I am wondering how can I adapt it to mouse data please?

I understand from reading the Cytopus vignette that I should just be able to create a custom dictionary incorporating mouse genesets, but do you have any thoughts on an efficient way to do that? I.e. for the default human dictionary, did you pull the genesets from an existing database or did you have to curate them manually?

I am principally interested in running just the mouse equivalent of the global genesets since my sequencing data is from a single cell type.

Thanks so much for any thoughts or pointers for this!

TypeError: SPECTRA_Model.train() got an unexpected keyword argument 'label_factors'

hello, thanks for this great tool !!!
I run this tool with CPU successefully but failed with GPU:

#import packages
import numpy as np
import json
import scanpy as sc
from collections import OrderedDict
import scipy
import pandas as pd
import matplotlib.pyplot as plt

#spectra imports
import Spectra as spc
from Spectra import Spectra_util as spc_tl
from Spectra import K_est as kst
from Spectra import default_gene_sets

## GPU
from Spectra import Spectra_gpu as spc_gpu

#filter gene set annotation dict for genes contained in adata
my_annotations = spc_tl.check_gene_set_dictionary(
    adata,
    my_annotations,
    obs_key='Disease subtype2',
    global_key='global')

# fit the model (We will run this with only 2 epochs to decrease runtime in this tutorial)
model = spc_gpu.est_spectra(adata=adata,
    gene_set_dictionary=my_annotations,
    use_highly_variable=True,
    cell_type_key="Disease subtype2",
    use_weights=True,
    lam=0.1, # varies depending on data and gene sets, try between 0.5 and 0.001
    delta=0.001,
    kappa=None,
    rho=0.001,
    use_cell_types=True,
    n_top_vals=50,
    label_factors=True,
    overlap_threshold=0.2,
    clean_gs = True,
    min_gs_num = 3,
    num_epochs=500 #here running only 2 epochs for time reasons, we recommend 10,000 epochs for most datasets
)

and the running log:

CUDA Available:  True
Initializing model...
Building parameter set...
CUDA memory:  1.788089856
Beginning training...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-13-232fdfc3e7ab>](https://localhost:8080/#) in <cell line: 2>()
      1 # fit the model (We will run this with only 2 epochs to decrease runtime in this tutorial)
----> 2 model = spc_gpu.est_spectra(adata=adata,
      3     gene_set_dictionary=my_annotations,
      4     use_highly_variable=True,
      5     cell_type_key="Disease subtype2",

[/usr/local/lib/python3.10/dist-packages/Spectra/Spectra_gpu.py](https://localhost:8080/#) in est_spectra(adata, gene_set_dictionary, L, use_highly_variable, cell_type_key, use_weights, lam, delta, kappa, rho, use_cell_types, n_top_vals, filter_sets, **kwargs)
    886     spectra.initialize(gene_set_dictionary, word2id, X, init_scores)
    887     print("Beginning training...")
--> 888     spectra.train(X = X, labels = labels,**kwargs)
    889 
    890     adata.uns["SPECTRA_factors"] = spectra.factors

TypeError: SPECTRA_Model.train() got an unexpected keyword argument 'label_factors'

Could you help with this problem ? Thanks !!!

Does Spectra accept processed expression matrix as the input?

As input, the algorithm receives a normalized cell-by-gene count matrix.

Because I want to use the integrated scRNA data, I only have the processed cell-by-gene expression matrix.
Will it affect the results if I use a processed expression matrix instead of a count matrix as the input?

issue with self.rho = nn.ParameterDict(self.rho)

Hi all, I was running through the code below in the provided tutorial and came across an error.

model = Spectra.est_spectra(adata=adata, gene_set_dictionary=annotations, use_highly_variable=True, cell_type_key="cell_type_annotations", use_weights=True, lam=0.1, delta=0.001, kappa=None, rho=0.001, use_cell_types=True, n_top_vals=50, label_factors=True, overlap_threshold=0.2, clean_gs=True, min_gs_num=3, num_epochs=2)

Screenshot 2023-10-20 at 10 58 46 PM
Screenshot 2023-10-20 at 10 59 07 PM

When rho=None, est_spectra runs just fine.
The same error with rho=0.001 also happens with kappa=0.001.

TabError: inconsistent use of tabs and spaces in indentation

week ago, I was running the model perfectly, I am trying to use it today it returns me super bad error when I try to call the package and modules "TabError: inconsistent use of tabs and spaces in indentation"

from spectra import spectra as spc
TabError: inconsistent use of tabs and spaces in indentation

from spectra import spectra_util as spc_tl
TabError: inconsistent use of tabs and spaces in indentation

from spectra import spectra_util as spc_tl
TabError: inconsistent use of tabs and spaces in indentation

I used the original notebook (tutorial notebook) of spectra.

please I need help, How can I avoid this error.

train() got an unexpected keyword argument 'label_factors'

Hi there,

When using spc.est_spectra, I am getting the error train() got an unexpected keyword argument 'label_factors'. I installed it from GitHub recently and can see that labels_fatcors is part of spectra.py., I am trying to understand why I am getting this error.

Any insights you can provide will be helpful! Thanks!

spc.est_spectra stalls and go to next step after some epochs

In my data set (114k cells 100 gene sets and 10 cell types), spc.est_spectra stops after some epochs without any error message. Most of the outputs are being populated in adata except the adata.var['spectra_vocab']. This stoping behaviour is different if I re-run the data, ie 12% ,32% ,58% and 4% out of 10k epochs. I have not changed any parameters or compute nodes/vm. I have simply restarted my Jupyter notebook to rerun and see what is happening.

Monitoring training

Hi,

Thank you for creating this really interesting project. I'm eager to use it in my work. One thing that would be really useful is if there was some way to easily examine training history to actually confirm convergence. Also, if some kind of early stopping could be implemented, that would be very helpful as Spectra takes a while to run.

Thank you,
Connor

use_cell_types not implemented in Spectra_gpu

Good day!

I want to use spectra for each main group of cells (myeloid, lymphoid, etc) in my tumor data. In that case, I want to set use_cell_types=False, but this option is not implemented for GPU. When do you think it will become available?

Thanks in advance!

Error whene gene in of geneset is not contained in anndata

Hi,
thank you for creating spectra, promising concept!
I tried it on CD8 pos TILs with the given genesets, but ran into two problems when initializing:
All with the provided genesets.

  1. If the gene in one of the genesets is not contained in the anndata, a Key Error is thrown.
  2. When computing th init scores I get float division by zero error
    Traceback
ZeroDivisionError                         Traceback (most recent call last)
Cell In [21], line 1
----> 1 model = spc.est_spectra(adata = cd8_ad,  gene_set_dictionary = gene_set_dictionary,  use_highly_variable = False, lam = 0.01,cell_type_key = "Celltype Pred",use_cell_types=True,filter_sets=True)
      3 #save trained model
      4 model.save("test_model")

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/spectra.py:1195, in est_spectra(adata, gene_set_dictionary, L, use_highly_variable, cell_type_key, use_weights, lam, delta, kappa, rho, use_cell_types, n_top_vals, filter_sets, **kwargs)
   1193 if use_cell_types:
   1194     new_gs_dict = {}
-> 1195     init_scores = compute_init_scores(gene_set_dictionary,word2id, torch.Tensor(X))
   1196     for ct in gene_set_dictionary.keys():
   1197         new_gs_dict[ct] = {}

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/initialization.py:47, in compute_init_scores(gs_dict, word2id, W)
     45         gs = gs_dict[key][inner_key]
     46         idxs = [word2id[word] for word in gs]
---> 47         coh = mimno_coherence_2011(idxs, W)
     48         init_scores[key][inner_key] = coh.item() 
     49 else: 

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/initialization.py:25, in mimno_coherence_2011(words, W)
     23         score += mimno_coherence_single(words[j1], words[j2], W)
     24 denom = V*(V-1)/2
---> 25 return(score/denom)

ZeroDivisionError: float division by zero

If you could advise me, I would be very happy!

GPU tutorial?

I'm looking to use the GPU implementation of Spectra as mentioned in issue #21.

When you load from Spectra import Spectra_gpu as spc_gpu you get this info:

Spectra GPU support is still under development. Raise any issues on github 
 
 Changes from v1: 
 (1) GPU support [see tutorial] 
 (2) minibatching for local parameters and data 
 Note that minibatching may affect optimization results 
 Code will eventually be merged into spectra.py

Is there a tutorial hiding somewhere? I'm especially interested in the minibatching.

Thanks!

UnboundLocalError: local variable 'is_global' referenced before assignment

Hello,
I am trying to run spectra using cell type labels:

import Spectra
import scanpy as sc
import pandas as pd
import numpy as np
import cytopus as cp

#subset my dataset from cytopus
G = cp.KnowledgeBase()
celltype_of_interest = ['T']
global_celltypes = ['all-cells','leukocyte']
G.get_celltype_processes(celltype_of_interest,global_celltypes = global_celltypes,get_children=True,get_parents =False)
annotations = G.celltype_process_dict
annotations = G.celltype_process_dict

#Run spectra
model = Spectra.est_spectra(
    adata=adata, 
    gene_set_dictionary=annotations, 
    use_highly_variable=True,
    cell_type_key="predicted.celltype.l1", 
    use_weights=True,
    lam=0.1, #varies depending on data and gene sets, try between 0.5 and 0.001
    delta=0.001, 
    kappa=None,
    rho=0.001, 
    use_cell_types=False,
    n_top_vals=50,
    label_factors=True, 
    overlap_threshold=0.2,
    clean_gs = True, 
    min_gs_num = 3,
    num_epochs=5000
)

It finishes the process, but gives the following error:

Cell type labels in gene set annotation dictionary and AnnData object are identical
removing gene set T for cell type global which is of length 14 0 genes are found in the data. minimum length is 3
removing gene set global for cell type global which is of length 150 0 genes are found in the data. minimum length is 3
Your gene set annotation dictionary is now correctly formatted.
/home/ubuntu/anaconda3/envs/scFates-gpu/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/home/ubuntu/anaconda3/envs/scFates-gpu/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
100%|██████████████████████████████████████████████████████████████████████████████████| 5000/5000 [38:15<00:00,  2.18it/s]
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[12], line 1
----> 1 model = Spectra.est_spectra(
      2     adata=adata, 
      3     gene_set_dictionary=annotations, 
      4     use_highly_variable=True,
      5     cell_type_key="predicted.celltype.l1", 
      6     use_weights=True,
      7     lam=0.1, #varies depending on data and gene sets, try between 0.5 and 0.001
      8     delta=0.001, 
      9     kappa=None,
     10     rho=0.001, 
     11     use_cell_types=False,
     12     n_top_vals=50,
     13     label_factors=True, 
     14     overlap_threshold=0.2,
     15     clean_gs = True, 
     16     min_gs_num = 3,
     17     num_epochs=5000
     18 )

File ~/anaconda3/envs/scFates-gpu/lib/python3.8/site-packages/Spectra/Spectra.py:1314, in est_spectra(adata, gene_set_dictionary, L, use_highly_variable, cell_type_key, use_weights, lam, delta, kappa, rho, use_cell_types, n_top_vals, filter_sets, label_factors, clean_gs, min_gs_num, overlap_threshold, **kwargs)
   1311 #labeling function
   1312 if label_factors:
   1313     #get cell type specificity of every factor
-> 1314     if is_global == False:
   1315         celltype_dict = get_factor_celltypes(adata, cell_type_key, cellscore=spectra.cell_scores)
   1316         max_celltype = [celltype_dict[x] for x in range(spectra.cell_scores.shape[1])]

UnboundLocalError: local variable 'is_global' referenced before assignment

Util function improvements

In check_gene_set_dictionary there will not be any error or flags if the length of the keys is the same between anndata and gene set dictionary but some of them are misspelled. Propose modifying function to this to account for the scenario:
if (len(adata_labels)<len(annotation_labels)) | (set(annotation_labels) != set(adata_labels)):

Then the print will output the mismatched keys.

Also, cell type labels cannot include periods because it will throw the following torch error:
KeyError: 'parameter name can't contain "."'

This should be checked as well.

Computing factor importance scores

Thanks for the amazing work!

Is there any way to infer which factors are importance from the results? I see some functions in Spectra_utils , but couldn't get them to work and couldn't find documentation for it either. It would be great to get some help.

Thanks

Unexpected spectra factors and some misc questions

Good day,

I have run Spectra successfully using both CPU and GPU. However, I have noticed an unexpected behavior in both approaches. I used a downsampled version of my data (120K cells), normalized using scran, and grouped in major cell types (malignant, myeloid, lymphoid, and vascular). In addition to the global gene sets, I added some specific ones for each cell group, and in total, the dictionary contains 465 gene sets.

When I examined and plot the output stored in adata.uns['SPECTRA_overlap'] and their respective scores adata.obsm['SPECTRA_cell_scores'], I noticed that the order and naming of the gene sets changed. In some cases, gene sets appear in global and other categories more than one time and show different results. Also, some categories are called only global, or malignant or myeloid. I thought they could be a summary of all gene sets in that category, but once again, there are several with the same name and different scores/distributions. See the example below (In this case, the gene set called ‘jessa22_M2’ is in the dictionary only as an entry for the malignant category and contains a signature of 50 genes).

Screenshot 2023-06-20 at 13 56 19

Screenshot 2023-06-20 at 13 56 39

I am using

model = spc.est_spectra(adata = adata, gene_set_dictionary = annotations, 
                        use_highly_variable = True, cell_type_key = "cell_type_annotations", 
                        use_weights = True, 
                        lam = 0.1, 
                        delta=0.001,kappa = 0.00001, rho = 0.00001, 
                        use_cell_types = True, n_top_vals = 50, 
                        label_factors = True,
                        overlap_threshold = 0.2,
                        num_epochs=10000,
                        verbose = True
                       )

Why do you think this is happening? Should I reduce lam (e.g., 0.01) to make it more strict?

I also wonder how can you discriminate potentially novel programs as you described in the manuscript for the therapy-induced macrophages? You mentioned in the methods that you defined “new factors as factors with a graph dependency parameter η < 0.25 and modified factors as factors with a graph dependency parameter η ≥ 0.25”. Which of the outputs can I use to filter potentially novel factors?

I also read in the methods that the gene set dictionary is optional. In that case, will Spectra perform an unsupervised gene program detection?

About the SPECTRA_cell_scores, I saw that most of the scores are not higher than 0.02. Do you know if this is normal? I see these scores even for well-known gene sets that define, for instance, the cell cycle. (I see the scales presented in your manuscript are much higher)

Finally, as I mentioned in #21, GPU can reach a lower LR with fewer epochs than the CPU implementation. Does this suggest the GPU version can provide more accurate results in less time?

Thanks in advance! Sorry for the questions, but I am excited to get this working on my dataset (and understanding better the outputs)!

I'm looking forward to hearing your thoughts and suggestions.

Number of cores allocation

Hello,
We use sBatch system to monitoring tasks. When running spectra, we found that spectra would call more cores than we assigned to it. Could you explain spectra‘s cpu scheduling strategy, and are there any way to limit the number of cores?

Thanks!

Error when running spc.est_spectra

Hello!

Congrats and thanks for your work! A few days ago I read your group's tweet and thought SPECTRA is exactly what I need for my data.
I am trying to disentangle CD8 activation/TCR reactivity vs exhaustion. I apologize in advance for the naive question, I am an R user.

I used Seurat to analyze my data, then saved the single cell object as h5ad file. Here is my script:

import os
import pandas as pd
from spectra import spectra as spc
import scanpy

os.getcwd()
adata = scanpy.read_h5ad("CD8_ad.h5ad")
pal_ident = ["#F0E442","#E69F00", "#009E73", "#56B4E9" ] #custom palette for clusters
adata.uns['clusters']= pal_ident
scanpy.pp.highly_variable_genes(adata)

then following (don't know if I followed properly) your README file I did:

gene_set_annotations = {"global": {"global_ifn_II_response" : ["CIITA", "CXCL10"] , "global_MHCI": ["HLA-A", "HLA-B", "HLA-C"] },
                        "CD8_T": {"CD8_tex": ['TOX', 'LAG3', 'PDCD1', 'HAVCR2', 'EOMES', 'ID2', 'CD244', 'NR4A3', 'NR4A2', 'NR4A1',
                                              'CXCL13', 'IRF4','PRDM1'],
                                  "CD8_reac": ['MKI67', 'TNFRSF9', 'TNFRSF18', 'GZMA', 'IFNG', 'ENTPD1', 'ITGAE', 'CXCL13']}
}
model = spc.est_spectra(adata = adata,  gene_set_dictionary = gene_set_annotations, cell_type_key = "clusters", use_highly_variable = True, lam = 0.01)

At first I had not included "global" (since I am only interested on CD8), but I was getting
KeyError: 'global'

Including the "global" variable the function runs till ~30% and then throws this error:

ValueError: operands could not be broadcast together with shapes (15486,9) (1,6) 

I think I understand the type of error but not how to fix it. Hope you can help!

Best,
Francesco

The meaning of gene weight?

I got a factor A whose top50 genes with high gene weight, and a factor B whose top50 genes with low gene weight.
Does it mean factor A is of higher quality than factor B

For example:

> factor A
CDKN2A     0.325802
BRCA2      0.313171
CDC6       0.300558
RFC4       0.298734
SLC1A4     0.295682
RRM2       0.284938
OXCT1      0.277684
GGH        0.277203
POLD3      0.271817
E2F1       0.271748
> factor B
CDKN2A     0.325802
BRCA2      0.313171
CDC6       0.300558
RFC4       0.298734
SLC1A4     0.295682
RRM2       0.284938
OXCT1      0.277684
GGH        0.277203
POLD3      0.271817
E2F1       0.271748

factor importance and information scores

Thanks for the amazing package!

I successfully managed to run spectra on a few datasets and I was eager to calculate the importance and information scores for the factors that spectra found. However, while reading the utils functions it was not immediately clear how to use these functions to calculate the scores for each factor:

  1. There is no mention of an importance score function, am I right to assume that this is calculated with the holdout_loss() function? For this function I am unsure what cell_type and labels arguments should contain. The lines below suggest that there should be a loop over each unique cell_type, where labels is an array of the the cell_type annotations for each cell. Am I correct that this loop missing from the current code?
    # loop through cell types and evaluate loss at every cell type
    X_c = X[labels == cell_type]
  2. The get_information_score returns and empty list because of the commented out code. In here the labels parameter is missing as mentioned in the #todo
    # TODO: Fix undefined "labels" variable
    Islabels the supposed to contain the same array of cell_type annotations as in holdout_loss()?

In #24 (comment) it is mentioned that an example would be added to the tutorial, but I have not found it there. Ideally I would like to make a figure similar to "Extended Data Fig. 6a".

Any help would be greatly appreciated.

Annotation format requirement

What annotation format is required? Is it possible to use the gene sets directly from the pathway database? for instances the C2 jason bundle from the Broad Institute pathway database?
Thanks,
Shams

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.