mousepixels / sanbomics_scripts Goto Github PK

View Code? Open in Web Editor NEW

305.0 305.0 123.0 32.9 MB

scripts and notebooks from sanbomics

Jupyter Notebook 98.22% HTML 1.78%

sanbomics_scripts's People

Contributors

Stargazers

Watchers

Forkers

hesmalek mansi260 olivehblee cemkaratastan slives-lab mmxx2022 rinku313 sayem-eee-kuet sophie006liu sidy2015 khl0798 shicheng-guo happypuppycoder shunsunsun hanis-gatech sciaan101 ustcjyh tidowi bbyun28 hossein-fallahi harry-maan sbwiecko gautam-lk a-vasquez-2 howtofindme marwan-unipd ronfinn sunnyyuan99 rimanb zhaojk ukmrs daltonjay subhajitbotany ngshasan mdbabumiamssm merckey capuccino26 iyalue aphbt suraj-adewale peterk678 helianthuszhu danieltiki sogada claratejido ewowiredu zzygyx9119 amyylin1 sehwanyoo ducminhnguyenle sultanghazala maratat mdmkac1 pranavmishra90 rezar12 huruifeng poku0857643 drejom mmasoud1 yupaulk zmokhtari sli0202 genostack olabiyi tealeave gurdgd92 kiddo18 pastvir jhpanda ndaniel qiuxiazhou nine-sarayut faker1c learning-jusue404 carlosseiyahigashilo sainwhisper hqi87 veera1988 mmwsygmyr abedkurdi siriusstarzx aaa7260 fallahi-bioinformatics-lab hongzhan2015 clementleong spongxin piyumalanthony whve padra66 surpoudel kashingtang pilaneli shahab178 juzheng87 cynthia1012 xueyao0830 jakelehle zhengxj1 rpc3312 animesh

sanbomics_scripts's Issues

high_quality_volcano_plots.ipynb

Hi Mark,

Thank you very much for the code it is really useful.

I am just a beginer...I was thinking in a volcano plot in which dowregulated genes are in blue and over expressed genes in red leaving the non differentailly expressed genes in grrey. Could you give me any advise, please? .

Thanks a lot

Vic

About singler.rmd , any human reference specifized tissue, like liver

Dear Sanbomics,

Thanks a lot for your efforts, that did inspire me a lot,
while I have an issue that is there any solution for human reference specifized for various tissues, exactly like what "TabulaMurisData" did in mus musculus.
Thanks again.

Best regards,
Na

SoupX tutorial: meta.data nCount_RNA contains information from previous RNA assay

Hi,
thank you a lot for your SoupX tutorial. I find the video and the code very helpful during my own analysis since I am new to scRNAseq analysis. One thing that I stumbled upon is that the nCount_RNA is not "updated" when I adjust the RNA assay of the seurat object with the SoupX-filtered counts. I'm referring to line 94 (sobj@assays$RNA@counts <- out) of your .Rmd file in this repository.
I noticed this when I tried to check whether counts actually changed by
summary([email protected]$nCount_RNA == [email protected]$nCount_original.counts)
and nothing had changed.
When I tried to overwrite the RNA assay with
sobj[["RNA"]] <- CreateAssayObject(counts = out) nothing changed. When I do this giving a new name to the assay, e.g. with
sobj[["RNAsoupx"]] <- CreateAssayObject(counts = out) it works.

I'm not sure if this is the right place to post my observation/question, maybe you have an idea how the [email protected]$nCount_RNA information can be updated by maintaining the name "RNA" for this assay. I'm not sure whether giving the assay a new name could cause any problems during downstream analysis.

Thank you a lot :)

single_cell_analysis_complete_class.ipynb

Hi @mousepixels

I'm trying to read this GEO dataset (GSE198896) and I can't load it using Scanpy's adata function. The file in CSV format is not provided. In this case, how can I load this dataset?

ERROR in pseudobulk_pyDeseq2.ipynb with pseudo replicates

Hi,

I was testing your tutorial on pseudobulk with pseudo replicates and I noticed that all the replicates created by your script are identical because you did not slice with the pseudo_rep slice that you created from the indices.

I propose you change this line:

from
`
for i, pseudo_rep in enumerate(indices):

    rep_adata = sc.AnnData(X = samp_cell_subset.X.sum(axis = 0),
                           var = samp_cell_subset.var[[]])

rep_adata = sc.AnnData(X = samp_cell_subset[samp_cell_subset.obs_names.isin(pseudo_rep)].X.sum(axis = 0), var = samp_cell_subset.var[[]])

And thanks for the cool work you do :)

Location of lung1.h5 file

Can you please provide the link to download the lung1.h5 file, which is used in the scvi_label_transfer.ipynb notebook?

Integration question + consulting work?

Hi Mark,

I'm running your scripts on my local computer and it crashed when I tried to integrate the 26 samples- I got an out of memory error. Question- if I didn't integrate, could I look at each sample alone and get the same data out- or does integration do something to the data (ie normalize with respect to each other?) so I would want to integrate everything?

Second question, my single cell data is interesting- we do depletion using CRISPR-cas9 to remove abundant, ribo, and mito targets. This redistributes reads onto low expression targets. Would you modify anything in the workflow in that scenario? Are you available for hire as a consultant?

Thanks,
Smita

Assigning mitocondrial genes variable

In my dataset the mitocondrial genes are not all labeled with 'mt-', some of them are like 'NC_002333.23, NC_002333.24', for example. I have a txt file with all the mitocondrial genes and I have loaded it into the notebook with

x = open('michondrialgenesDR11.txt', 'r')
mitogenes = x.read()

I am searching now for a command to annonate the group of genes in the txt file as adata.var['mt']. Could you please help? Thank you!

Problem to load files in Scanpy

Hi!

I am having some issues with these files that I got from GEO:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi

GSM3577882_normal_panc_barcodes.tsv.gz
GSM3577882_normal_panc_genes.tsv.gz
GSM3577882_normal_matrix.mtx.gz

This is my code: adata=sc.read_10x_mtx('./', prefix='GSM3577882_normal_',var_names='gene_symbols', cache=True )

I cannot load those samples in scampy. I have tried with different ways but with no results. So I decided to download de SRR files from GEO and practice with Cell Ranger following your video in you tube. Here I had another issue, these are the files that correspond to normal pancreas and correspond to two samples:

SRR8485290.fastq
SRR8485291.fastq
SRR8485292.fastq
SRR8485293.fastq

And I am not able to make Cell Ranger work, could it be because of how the samples are named?

Should a run each sample individually? How should I proceed with the tumor samples? Should I run independently each sample in Cell Ranger and the integrate the results in Scanpy?

I am really interested in this dataset and I don´t know what to do….

Thank you very much for your help and availability.

Victor

single-cell RNA velocity analysis

Hi,

I really appreciate your tutorial on scRNA velocity.

I am trying to merge my output loom file from velocyto and my preprocessed counts like you did here:
adata = pp('../tutorial_sample/outs/filtered_feature_bc_matrix/')
ldata = scv.read('../tutorial_sample/velocyto/tutorial_sample.loom')
adata = scv.utils.merge(adata, ldata)

My question is does the dimensions of the two objects should match exactly? Or when you do the merge it will look for the intersect barcodes/cells and the intersection of genes? In my case it does the merge but the spliced and unspliced information does not appear in the new object and I guess because the dimensions does not match.

Thank you,
Diego

Issue convert anndata to seurat object

hello all,

I've written the following code reference your youtube video:

However I get the following error when I try to load it into R using using Read10X from seurat:

Error in readMM(file = matrix.loc) :
'readMM()' is not yet implemented for representation 'array'

Any help is greatly appreciated!

Here's the code I wrote based on the youtube video (note I've tried different formats for the sparse matrix - int, int64, etc.)

See attached file for reference


 @staticmethod
    def create_10x_files(adata, dir, raw_layer):
        RhapsodyReader.create_barcodes_table(adata=adata, save_path=f"{dir}/barcodes.tsv")
        RhapsodyReader.create_features_table(adata, f"{dir}/features.tsv")
        RhapsodyReader.create_matrix(adata, f"{dir}/matrix.mtx", layer=raw_layer)
        # RhapsodyReader.gzip_files_in_directory(dir)

    
    @staticmethod
    def create_matrix(adata, save_path, layer=None):
        if layer:
            matrix = adata.layers[layer].T
        else:
            matrix = adata.X.T
            
        io.mmwrite(save_path, matrix.astype(np.int64))

    @staticmethod
    def create_barcodes_table(adata, save_path):
        with open(save_path, "w") as file:
            for item in adata.obs_names:
                file.write(item + "\n")
    
    @staticmethod
    def create_features_table(adata, save_path):
        tab_var_names = ["\t".join([x, x, "Gene Expression"]) for x in adata.var_names]

        with open(save_path, "w") as file:
            for var_name in tab_var_names:
                file.write(var_name + "\n")

matrix.mtx.gz

Differential expression

Hi Mark,

I have a question. I am doing differential expression between two cell types in scRNA-seq, I tried scVI and it worked but diffxpy did not worked, it gives me this error: ZeroDivisionError: float division by zero

I wrote them with an issue, apparently it is a common error because others described the same problem. I am waiting for the reply but they did not to respond.

I wanted to performed this method, becase I am worried about the discrepancy of DEGs that I have with scVI (approximately 2400 DEGs) with the Differential expression using Scanpy function using Wilconson test (9.000 DEGs). Have you compared different methods for differential expression? Could you give me any advice, please?.

Thanks a lot.

Victor

Salmon index issue

I'm trying to construct the reference for Salmon, but the process stopped with this error:

salmon index -t CRCH38_and_decoys.fa.gz -d decoys.txt -i GRCh38_salmon_index --gencode

[2024-02-19 12:15:56.790] [puff::index::jointLog] [warning] Removed 882 transcripts that were sequence duplicates of indexed transcripts.
[2024-02-19 12:15:56.792] [puff::index::jointLog] [warning] If you wish to retain duplicate transcripts, please use the --keepDuplicates flag
[2024-02-19 12:15:56.919] [puff::index::jointLog] [info] Replaced 151122967 non-ATCG nucleotides
[2024-02-19 12:15:56.919] [puff::index::jointLog] [info] Clipped poly-A tails from 2034 transcripts
Killed

it could be a problem of CPU memory?

go in R question about baseMean filtering

I am curious why you are filtering the expression data to only include genes with a base mean over 50? I have not really seen this as a step in other tutorials.

from GO_in_R.Rmd
sigs <- sigs[sigs$padj < 0.05 & sigs$baseMean > 50,]

thanks!

There is typo in code.

[sanbomics_scripts](https://github.com/mousepixels/sanbomics_scripts/tree/main) /simpleaf_alevin_fry_tutorial.txt

simpleaf quant --reads1 a_r1.fastq.gz,b_r1.fastq.gz --reads2 a_r2.fastq.gz,b_R2_001.fastq.gz --threads 28 --index simpleaf_index/index --chemistry 10xv3 --resolution cr-like --unfiltered-pl --expected-ori fw --t2g-map simpleaf_index/index/t2g_3col.tsv --output simpleaf_output

To be

simpleaf quant --reads1 fastq/pbmc_1k_v3_S1_L001_R1_001.fastq.gz,fastq/pbmc_1k_v3_S1_L002_R1_001.fastq.gz --reads2 fastq/pbmc_1k_v3_S1_L001_R2_001.fastq.gz,fastq/pbmc_1k_v3_S1_L002_R2_001.fastq.gz --threads 28 --index simpleaf_index/index --chemistry 10xv3 --resolution cr-like --unfiltered-pl --expected-ori fw --t2g-map simpleaf_index/index/t2g_3col.tsv --output simpleaf_output

CellBender scRNA-seq tutorial

Hi,

Hi, I was trying to follow the complete scRNA-seq tutorial using CellBender h5 out file, by changing adata = anndata_from_h5(mtx_path) (anndata_fromh5 being broadinstitute/CellBender#57).
When I train the solo model I noticed something weird, it looks like it's not working:

/home/gtosoni/miniconda3/envs/scvi-env/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call .var_names_make_unique.
utils.warn_names_duplicates("var")
/home/gtosoni/miniconda3/envs/scvi-env/lib/python3.9/site-packages/scvi/model/_utils.py:287: UserWarning: This dataset has some empty cells, this might fail inference.Data should be filtered with scanpy.pp.filter_cells()
warnings.warn(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 267/267: 100%|██████████| 267/267 [13:27<00:00, 2.96s/it, loss=478, v_num=1]

INFO Creating doublets, preparing SOLO model.
/home/gtosoni/miniconda3/envs/scvi-env/lib/python3.9/site-packages/scvi/external/solo/_model.py:185: RuntimeWarning: divide by zero encountered in log
latent_adata = AnnData(np.concatenate([latent_rep, np.log(lib_size)], axis=1))
/home/gtosoni/miniconda3/envs/scvi-env/lib/python3.9/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass AnnData(X, dtype=X.dtype, ...) to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 1/400: 0%| | 1/400 [00:01<08:54, 1.34s/it, loss=nan, v_num=1]
Monitored metric validation_loss = nan is not finite. Previous best value was inf. Signaling Trainer to stop.

Any idea why is that? Thank you!!

showing selected genenames on the plot

Hello,

I have created a volcano plot with your method and it looks good. I am just stuck on one step.

I created a list of selected 5 genes as below:
pikced10 = ('gene1', 'gene2', 'gene3', 'gene4', 'gene5')

Now, I want to show names of these selected 5 genes on the plot. I know you have shown
if df.iloc[i].nlog10 > 5 and abs(df.iloc[i].log2FoldChange) > 2:

but I am confused how can I show my customized list.

How can I feed this info/list of gene into the script so when I plot the graph the names of only these 5 genes are shown?

Many thanks,

Error in import_atac function

young <- import_atac("GSM5723631_Young_HSC_filtered_peak_bc_matrix.h5",
'GSM5723631_Young_HSC_singlecell.csv',
'./GSM5723631_Young_HSC_fragments.tsv.gz')
when i run this code, this error appear
Error in import_atac("GSM5723631_Young_HSC_filtered_peak_bc_matrix.h5", :
could not find function "import_atac"

so what is the package i need to install to make this function work?

local variable 'id_length' referenced before assignment

Hi,
Thank you for this tutorial! It is very useful and I have almost done. But I was trying to merge the data in the step: adata = scv.utils.merge(adata, ldata). I got this error:
UnboundLocalError Traceback (most recent call last)
Cell In[5], line 1
----> 1 adata = scv.utils.merge(adata, ldata)

File ~/.local/lib/python3.9/site-packages/scvelo/core/_anndata.py:526, in merge(adata, ldata, copy, **kwargs)
524 if "id_length" in kwargs:
525 id_length = kwargs.get("id_length")
--> 526 clean_obs_names(adata, id_length=id_length)
527 clean_obs_names(ldata, id_length=id_length)
528 common_obs = adata.obs_names.intersection(ldata.obs_names)

UnboundLocalError: local variable 'id_length' referenced before assignment
Do you have any idea why this error happens? Thank u very much!

No fragment file present when extracting gene activity

Hello,
I am new to bioinformatics and in R and from wet lab background. I am analyzing three single cell multiomics (RNA + ATAC) datasets from 10x genomics. When I merge the three datasets, the fragment files for each datasets were present. But after integrating with harmony when I try to extract gene activity the error comes as no fragment file present. Here I am attaching full code.

##Create seurat object reading .h5 file and meta file

counts_APL1 <- Read10X_h5(filename = "path/to/filtered_feature_bc_matrix.h5")

meta_APL1 <- read.csv(
file = 'path/to/per_barcode_metrics.csv',
header = TRUE,
row.names = 1)

 chrom_assay <- CreateChromatinAssay(
  counts = counts_APL1$Peaks,
  sep = c(":", "-"),
  genome = 'hg38',
  fragments = 'path/to/atac_fragments.tsv.gz',
  min.cells = 3,
  min.features = 200
)

data_APL1 <- CreateSeuratObject(
  counts = chrom_assay,
  assay = "peaks",
  meta.data = meta_APL1
)

data_APL1[[]]

annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotations) <- 'UCSC'
Annotation(data_APL1) <- annotations

data_APL21<- NucleosomeSignal(object = data_APL1) #fragment ratio 147-294: <147
data_APL2 <- TSSEnrichment(object = data_APL1, fast = FALSE)

##Here I get the error. I cant find any blacklist_region_fragments nor the pct_reads_in_peaks. Even the peak_region_fragments is placed in metadata in name of atac_peak_region_fragments.

So proceeded without these two parameters in QC

#data_APL1$blacklist_ratio <- data_APL1$blacklist_region_fragments / data$peak_region_fragments
#data_APL1[[]]
#data$pct_reads_in_peaks <- data$peak_region_fragments / data$passed_filters * 100

VlnPlot(
  object = data_APL1,
  features = c('nCount_peaks', 
               'nucleosome_signal', 'TSS.enrichment'),
  pt.size = 0.1,
  ncol = 5
)

data_APL1 <- subset(
    x = data_APL1,
    subset = atac_peak_region_fragments > 1000 &
           atac_peak_region_fragments < 100000 &
           nucleosome_signal < 4 &
           TSS.enrichment > 1)

##Created 3 SeuratObjects from 3 three samples

#add sample information for merging samples

Control$dataset <- "Control"
APL1$dataset <- "APL1"
APL2$dataset <- "APL2"

merged_atac <- merge(Control, y = c(APL1, APL2),  add.cell.ids = c("Control", "APL1", "APL2"), project = "Integrated_ATAC" )

[email protected]

merged_atac <- FindTopFeatures(merged_atac, min.cutoff = 'q0')
merged_atac <- RunTFIDF(merged_atac)
merged_atac <- RunSVD(merged_atac)
merged_atac

#An object of class Seurat
181982 features across 3848 samples within 1 assay
Active assay: peaks (181982 features, 181982 variable features)
1 dimensional reduction calculated: lsi

merged_atac <- RunUMAP(object = merged_atac, reduction = 'lsi', dims = 2:30)
merged_atac <- FindNeighbors(object = merged_atac, reduction = 'lsi', dims = 2:30)

merged_atac <- FindClusters(object = merged_atac, verbose = FALSE, algorithm = 3, resolution = .4)
DimPlot(object = merged_atac, label = TRUE) + NoLegend()
DimPlot(object = merged_atac, label = TRUE, group.by = "dataset") + NoLegend()

#Batch correction using Harmony
integrated_atac_harmony <- RunHarmony(object = merged_atac, group.by.vars = 'dataset', reduction = 'lsi', assay.use = 'peaks', project.dim = FALSE)
integrated_atac_harmony <- RunUMAP(integrated_atac_harmony, dims = 2:10, reduction = 'harmony')
DimPlot(integrated_atac_harmony, group.by = 'dataset', pt.size = 0.5)

# Do UMAP and clustering using ** Harmony embeddings instead of PCA **
integrated_atac_harmony <- integrated_atac_harmony %>%
  RunUMAP(reduction = 'harmony', dims = 2:10) %>%
  FindNeighbors(reduction = "harmony", dims = 2:10) %>%
  FindClusters(resolution = 0.3)

# visualize 
DimPlot(integrated_atac_harmony, reduction = 'umap', group.by = 'dataset')
DimPlot(integrated_atac_harmony, reduction = 'umap')

gene.activities <- GeneActivity(integrated_atac_harmony)

##Here I get the error
##Error in GeneActivity(integrated_atac_harmony) :
No fragment information found for requested assay

Although the fragment files were present when I created SeuratObject. Could you please help me to find this issue and better suggestion to improve the pipeline for further integration with merged scRNA dataset.

reproducing from filtered matrix

Hi,

Thank you so much for your walkthrough video!

I'm trying to reproduce your analysis on my samples. Instead of .cvs files, I have the filtered_feature_bc_matrix output of cellranger.
How could I reproduce your integration code to preprocess all the samples at once? Creating a folder containing the filtered_feature_bc_matrix of every sample and itereting through them, or converting adata to .cvs? I tried both but I'm stuck..

Thanks!

use better `pip install` approach in tutorials

I ran across your tutorial here as part of looking into this Biostars post.

I don't know where you suggest people run this code; however, if it is in standard Jupyter, I'd suggest you update your examples of pip install to not include use of the exclamation point. For example, in that Jupyter notebook I referenced the fifth cell has an example. I would suggest you change it to %pip install scvi-tools. In the last few years the magic command %pip install (and related %conda install) were added to insure that when running pip install in Jupyter notebooks, the installation occurs in the environment that backs the notebook kernel. This is something the use of the exclamation point alone with pip install failed to do and caused all sorts of issues. You can read more about the modern %pip install and %conda install magic commands here.

Unfortunately, I think Google Colab derived their offshoot of the Jupyter interface before that happened, and so it isn't consistent. I'm not 100% sure about the Google Colab notebooks and what is best; however, for standard Jupyter use of the magic command is best. And so if you direct people using your stuff to Google Colab mostly, I can understand using the outdated syntax.

It's also good to be aware that because modern Jupyter installations most often have automagics enabled by default, pip install without any symbol will use the modern magic command behind the scenes, and so even no symbol is better than suggesting the exclamation point with pip install now. However, explicit is best for learners and future you.

mousepixels / sanbomics_scripts Goto Github PK

sanbomics_scripts's People

Contributors

Stargazers

Watchers

Forkers

sanbomics_scripts's Issues

So proceeded without these two parameters in QC

Recommend Projects

Recommend Topics

Recommend Org

Jobs