GithubHelp home page GithubHelp logo

sanbomics_scripts's People

Contributors

mousepixels avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sanbomics_scripts's Issues

high_quality_volcano_plots.ipynb

Hi Mark,

Thank you very much for the code it is really useful.

I am just a beginer...I was thinking in a volcano plot in which dowregulated genes are in blue and over expressed genes in red leaving the non differentailly expressed genes in grrey. Could you give me any advise, please? .

Thanks a lot

Vic

SoupX tutorial: meta.data nCount_RNA contains information from previous RNA assay

Hi,
thank you a lot for your SoupX tutorial. I find the video and the code very helpful during my own analysis since I am new to scRNAseq analysis. One thing that I stumbled upon is that the nCount_RNA is not "updated" when I adjust the RNA assay of the seurat object with the SoupX-filtered counts. I'm referring to line 94 (sobj@assays$RNA@counts <- out) of your .Rmd file in this repository.
I noticed this when I tried to check whether counts actually changed by
summary([email protected]$nCount_RNA == [email protected]$nCount_original.counts)
and nothing had changed.
When I tried to overwrite the RNA assay with
sobj[["RNA"]] <- CreateAssayObject(counts = out) nothing changed. When I do this giving a new name to the assay, e.g. with
sobj[["RNAsoupx"]] <- CreateAssayObject(counts = out) it works.

I'm not sure if this is the right place to post my observation/question, maybe you have an idea how the [email protected]$nCount_RNA information can be updated by maintaining the name "RNA" for this assay. I'm not sure whether giving the assay a new name could cause any problems during downstream analysis.

Thank you a lot :)

ERROR in pseudobulk_pyDeseq2.ipynb with pseudo replicates

Hi,

I was testing your tutorial on pseudobulk with pseudo replicates and I noticed that all the replicates created by your script are identical because you did not slice with the pseudo_rep slice that you created from the indices.

I propose you change this line:

from
`
for i, pseudo_rep in enumerate(indices):

    rep_adata = sc.AnnData(X = samp_cell_subset.X.sum(axis = 0),
                           var = samp_cell_subset.var[[]])

`

to

rep_adata = sc.AnnData(X = samp_cell_subset[samp_cell_subset.obs_names.isin(pseudo_rep)].X.sum(axis = 0), var = samp_cell_subset.var[[]])

And thanks for the cool work you do :)

Location of lung1.h5 file

Can you please provide the link to download the lung1.h5 file, which is used in the scvi_label_transfer.ipynb notebook?

Integration question + consulting work?

Hi Mark,

I'm running your scripts on my local computer and it crashed when I tried to integrate the 26 samples- I got an out of memory error. Question- if I didn't integrate, could I look at each sample alone and get the same data out- or does integration do something to the data (ie normalize with respect to each other?) so I would want to integrate everything?

Second question, my single cell data is interesting- we do depletion using CRISPR-cas9 to remove abundant, ribo, and mito targets. This redistributes reads onto low expression targets. Would you modify anything in the workflow in that scenario? Are you available for hire as a consultant?

Thanks,
Smita

Assigning mitocondrial genes variable

In my dataset the mitocondrial genes are not all labeled with 'mt-', some of them are like 'NC_002333.23, NC_002333.24', for example. I have a txt file with all the mitocondrial genes and I have loaded it into the notebook with

x = open('michondrialgenesDR11.txt', 'r')
mitogenes = x.read()

I am searching now for a command to annonate the group of genes in the txt file as adata.var['mt']. Could you please help? Thank you!

Problem to load files in Scanpy

Hi!

I am having some issues with these files that I got from GEO:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi

GSM3577882_normal_panc_barcodes.tsv.gz
GSM3577882_normal_panc_genes.tsv.gz
GSM3577882_normal_matrix.mtx.gz

This is my code: adata=sc.read_10x_mtx('./', prefix='GSM3577882_normal_',var_names='gene_symbols', cache=True )

I cannot load those samples in scampy. I have tried with different ways but with no results. So I decided to download de SRR files from GEO and practice with Cell Ranger following your video in you tube. Here I had another issue, these are the files that correspond to normal pancreas and correspond to two samples:

SRR8485290.fastq
SRR8485291.fastq
SRR8485292.fastq
SRR8485293.fastq

And I am not able to make Cell Ranger work, could it be because of how the samples are named?

Should a run each sample individually? How should I proceed with the tumor samples? Should I run independently each sample in Cell Ranger and the integrate the results in Scanpy?

I am really interested in this dataset and I don´t know what to do….

Thank you very much for your help and availability.

Victor

single-cell RNA velocity analysis

Hi,

I really appreciate your tutorial on scRNA velocity.

I am trying to merge my output loom file from velocyto and my preprocessed counts like you did here:
adata = pp('../tutorial_sample/outs/filtered_feature_bc_matrix/')
ldata = scv.read('../tutorial_sample/velocyto/tutorial_sample.loom')
adata = scv.utils.merge(adata, ldata)

My question is does the dimensions of the two objects should match exactly? Or when you do the merge it will look for the intersect barcodes/cells and the intersection of genes? In my case it does the merge but the spliced and unspliced information does not appear in the new object and I guess because the dimensions does not match.

Thank you,
Diego

Issue convert anndata to seurat object

hello all,

I've written the following code reference your youtube video:

However I get the following error when I try to load it into R using using Read10X from seurat:

Error in readMM(file = matrix.loc) :
'readMM()' is not yet implemented for representation 'array'

Any help is greatly appreciated!

Here's the code I wrote based on the youtube video (note I've tried different formats for the sparse matrix - int, int64, etc.)

  • See attached file for reference

 @staticmethod
    def create_10x_files(adata, dir, raw_layer):
        RhapsodyReader.create_barcodes_table(adata=adata, save_path=f"{dir}/barcodes.tsv")
        RhapsodyReader.create_features_table(adata, f"{dir}/features.tsv")
        RhapsodyReader.create_matrix(adata, f"{dir}/matrix.mtx", layer=raw_layer)
        # RhapsodyReader.gzip_files_in_directory(dir)

    
    @staticmethod
    def create_matrix(adata, save_path, layer=None):
        if layer:
            matrix = adata.layers[layer].T
        else:
            matrix = adata.X.T
            
        io.mmwrite(save_path, matrix.astype(np.int64))

    @staticmethod
    def create_barcodes_table(adata, save_path):
        with open(save_path, "w") as file:
            for item in adata.obs_names:
                file.write(item + "\n")
    
    @staticmethod
    def create_features_table(adata, save_path):
        tab_var_names = ["\t".join([x, x, "Gene Expression"]) for x in adata.var_names]

        with open(save_path, "w") as file:
            for var_name in tab_var_names:
                file.write(var_name + "\n")

matrix.mtx.gz

Differential expression

Hi Mark,

I have a question. I am doing differential expression between two cell types in scRNA-seq, I tried scVI and it worked but diffxpy did not worked, it gives me this error: ZeroDivisionError: float division by zero

I wrote them with an issue, apparently it is a common error because others described the same problem. I am waiting for the reply but they did not to respond.

I wanted to performed this method, becase I am worried about the discrepancy of DEGs that I have with scVI (approximately 2400 DEGs) with the Differential expression using Scanpy function using Wilconson test (9.000 DEGs). Have you compared different methods for differential expression? Could you give me any advice, please?.

Thanks a lot.

Victor

Salmon index issue

I'm trying to construct the reference for Salmon, but the process stopped with this error:

salmon index -t CRCH38_and_decoys.fa.gz -d decoys.txt -i GRCh38_salmon_index --gencode

[2024-02-19 12:14:55.227] [puff::index::jointLog] [warning] Entry with header [ENST00000634174.1|ENSG00000282732.1|OTTHUMG00000191398.1|OTTHUMT00000487783.1|ENST00000634174|ENSG00000282732|28|unprocessed_pseudogene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)

[2024-02-19 12:15:56.790] [puff::index::jointLog] [warning] Removed 882 transcripts that were sequence duplicates of indexed transcripts.
[2024-02-19 12:15:56.792] [puff::index::jointLog] [warning] If you wish to retain duplicate transcripts, please use the --keepDuplicates flag
[2024-02-19 12:15:56.919] [puff::index::jointLog] [info] Replaced 151122967 non-ATCG nucleotides
[2024-02-19 12:15:56.919] [puff::index::jointLog] [info] Clipped poly-A tails from 2034 transcripts
Killed

it could be a problem of CPU memory?

go in R question about baseMean filtering

I am curious why you are filtering the expression data to only include genes with a base mean over 50? I have not really seen this as a step in other tutorials.

from GO_in_R.Rmd
sigs <- sigs[sigs$padj < 0.05 & sigs$baseMean > 50,]

thanks!

There is typo in code.

[sanbomics_scripts](https://github.com/mousepixels/sanbomics_scripts/tree/main) /simpleaf_alevin_fry_tutorial.txt

simpleaf quant --reads1 a_r1.fastq.gz,b_r1.fastq.gz --reads2 a_r2.fastq.gz,b_R2_001.fastq.gz --threads 28 --index simpleaf_index/index --chemistry 10xv3 --resolution cr-like --unfiltered-pl --expected-ori fw --t2g-map simpleaf_index/index/t2g_3col.tsv --output simpleaf_output

To be

simpleaf quant --reads1 fastq/pbmc_1k_v3_S1_L001_R1_001.fastq.gz,fastq/pbmc_1k_v3_S1_L002_R1_001.fastq.gz --reads2 fastq/pbmc_1k_v3_S1_L001_R2_001.fastq.gz,fastq/pbmc_1k_v3_S1_L002_R2_001.fastq.gz --threads 28 --index simpleaf_index/index --chemistry 10xv3 --resolution cr-like --unfiltered-pl --expected-ori fw --t2g-map simpleaf_index/index/t2g_3col.tsv --output simpleaf_output

CellBender scRNA-seq tutorial

Hi,

Hi, I was trying to follow the complete scRNA-seq tutorial using CellBender h5 out file, by changing adata = anndata_from_h5(mtx_path) (anndata_fromh5 being broadinstitute/CellBender#57).
When I train the solo model I noticed something weird, it looks like it's not working:

/home/gtosoni/miniconda3/envs/scvi-env/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call .var_names_make_unique.
utils.warn_names_duplicates("var")
/home/gtosoni/miniconda3/envs/scvi-env/lib/python3.9/site-packages/scvi/model/_utils.py:287: UserWarning: This dataset has some empty cells, this might fail inference.Data should be filtered with scanpy.pp.filter_cells()
warnings.warn(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 267/267: 100%|██████████| 267/267 [13:27<00:00, 2.96s/it, loss=478, v_num=1]

INFO Creating doublets, preparing SOLO model.
/home/gtosoni/miniconda3/envs/scvi-env/lib/python3.9/site-packages/scvi/external/solo/_model.py:185: RuntimeWarning: divide by zero encountered in log
latent_adata = AnnData(np.concatenate([latent_rep, np.log(lib_size)], axis=1))
/home/gtosoni/miniconda3/envs/scvi-env/lib/python3.9/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass AnnData(X, dtype=X.dtype, ...) to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 1/400: 0%| | 1/400 [00:01<08:54, 1.34s/it, loss=nan, v_num=1]
Monitored metric validation_loss = nan is not finite. Previous best value was inf. Signaling Trainer to stop.

Any idea why is that? Thank you!!

showing selected genenames on the plot

Hello,

I have created a volcano plot with your method and it looks good. I am just stuck on one step.

I created a list of selected 5 genes as below:
pikced10 = ('gene1', 'gene2', 'gene3', 'gene4', 'gene5')

Now, I want to show names of these selected 5 genes on the plot. I know you have shown
if df.iloc[i].nlog10 > 5 and abs(df.iloc[i].log2FoldChange) > 2:

but I am confused how can I show my customized list.

How can I feed this info/list of gene into the script so when I plot the graph the names of only these 5 genes are shown?

Many thanks,

Error in import_atac function

young <- import_atac("GSM5723631_Young_HSC_filtered_peak_bc_matrix.h5",
'GSM5723631_Young_HSC_singlecell.csv',
'./GSM5723631_Young_HSC_fragments.tsv.gz')
when i run this code, this error appear
Error in import_atac("GSM5723631_Young_HSC_filtered_peak_bc_matrix.h5", :
could not find function "import_atac"

so what is the package i need to install to make this function work?

local variable 'id_length' referenced before assignment

Hi,
Thank you for this tutorial! It is very useful and I have almost done. But I was trying to merge the data in the step: adata = scv.utils.merge(adata, ldata). I got this error:
UnboundLocalError Traceback (most recent call last)
Cell In[5], line 1
----> 1 adata = scv.utils.merge(adata, ldata)

File ~/.local/lib/python3.9/site-packages/scvelo/core/_anndata.py:526, in merge(adata, ldata, copy, **kwargs)
524 if "id_length" in kwargs:
525 id_length = kwargs.get("id_length")
--> 526 clean_obs_names(adata, id_length=id_length)
527 clean_obs_names(ldata, id_length=id_length)
528 common_obs = adata.obs_names.intersection(ldata.obs_names)

UnboundLocalError: local variable 'id_length' referenced before assignment
Do you have any idea why this error happens? Thank u very much!

No fragment file present when extracting gene activity

Hello,
I am new to bioinformatics and in R and from wet lab background. I am analyzing three single cell multiomics (RNA + ATAC) datasets from 10x genomics. When I merge the three datasets, the fragment files for each datasets were present. But after integrating with harmony when I try to extract gene activity the error comes as no fragment file present. Here I am attaching full code.

##Create seurat object reading .h5 file and meta file

counts_APL1 <- Read10X_h5(filename = "path/to/filtered_feature_bc_matrix.h5")

meta_APL1 <- read.csv(
file = 'path/to/per_barcode_metrics.csv',
header = TRUE,
row.names = 1)
 chrom_assay <- CreateChromatinAssay(
  counts = counts_APL1$Peaks,
  sep = c(":", "-"),
  genome = 'hg38',
  fragments = 'path/to/atac_fragments.tsv.gz',
  min.cells = 3,
  min.features = 200
)
data_APL1 <- CreateSeuratObject(
  counts = chrom_assay,
  assay = "peaks",
  meta.data = meta_APL1
)
data_APL1[[]]
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotations) <- 'UCSC'
Annotation(data_APL1) <- annotations
data_APL21<- NucleosomeSignal(object = data_APL1) #fragment ratio 147-294: <147
data_APL2 <- TSSEnrichment(object = data_APL1, fast = FALSE)

##Here I get the error. I cant find any blacklist_region_fragments nor the pct_reads_in_peaks. Even the peak_region_fragments is placed in metadata in name of atac_peak_region_fragments.

So proceeded without these two parameters in QC

#data_APL1$blacklist_ratio <- data_APL1$blacklist_region_fragments / data$peak_region_fragments
#data_APL1[[]]
#data$pct_reads_in_peaks <- data$peak_region_fragments / data$passed_filters * 100 
VlnPlot(
  object = data_APL1,
  features = c('nCount_peaks', 
               'nucleosome_signal', 'TSS.enrichment'),
  pt.size = 0.1,
  ncol = 5
)
data_APL1 <- subset(
    x = data_APL1,
    subset = atac_peak_region_fragments > 1000 &
           atac_peak_region_fragments < 100000 &
           nucleosome_signal < 4 &
           TSS.enrichment > 1)

##Created 3 SeuratObjects from 3 three samples

#add sample information for merging samples

Control$dataset <- "Control"
APL1$dataset <- "APL1"
APL2$dataset <- "APL2"
merged_atac <- merge(Control, y = c(APL1, APL2),  add.cell.ids = c("Control", "APL1", "APL2"), project = "Integrated_ATAC" )
merged_atac <- FindTopFeatures(merged_atac, min.cutoff = 'q0')
merged_atac <- RunTFIDF(merged_atac)
merged_atac <- RunSVD(merged_atac)
merged_atac

#An object of class Seurat
181982 features across 3848 samples within 1 assay
Active assay: peaks (181982 features, 181982 variable features)
1 dimensional reduction calculated: lsi

merged_atac <- RunUMAP(object = merged_atac, reduction = 'lsi', dims = 2:30)
merged_atac <- FindNeighbors(object = merged_atac, reduction = 'lsi', dims = 2:30)
merged_atac <- FindClusters(object = merged_atac, verbose = FALSE, algorithm = 3, resolution = .4)
DimPlot(object = merged_atac, label = TRUE) + NoLegend()
DimPlot(object = merged_atac, label = TRUE, group.by = "dataset") + NoLegend()
#Batch correction using Harmony
integrated_atac_harmony <- RunHarmony(object = merged_atac, group.by.vars = 'dataset', reduction = 'lsi', assay.use = 'peaks', project.dim = FALSE)
integrated_atac_harmony <- RunUMAP(integrated_atac_harmony, dims = 2:10, reduction = 'harmony')
DimPlot(integrated_atac_harmony, group.by = 'dataset', pt.size = 0.5)
# Do UMAP and clustering using ** Harmony embeddings instead of PCA **
integrated_atac_harmony <- integrated_atac_harmony %>%
  RunUMAP(reduction = 'harmony', dims = 2:10) %>%
  FindNeighbors(reduction = "harmony", dims = 2:10) %>%
  FindClusters(resolution = 0.3)

# visualize 
DimPlot(integrated_atac_harmony, reduction = 'umap', group.by = 'dataset')
DimPlot(integrated_atac_harmony, reduction = 'umap')
gene.activities <- GeneActivity(integrated_atac_harmony)

##Here I get the error
##Error in GeneActivity(integrated_atac_harmony) :
No fragment information found for requested assay

Although the fragment files were present when I created SeuratObject. Could you please help me to find this issue and better suggestion to improve the pipeline for further integration with merged scRNA dataset.

reproducing from filtered matrix

Hi,

Thank you so much for your walkthrough video!

I'm trying to reproduce your analysis on my samples. Instead of .cvs files, I have the filtered_feature_bc_matrix output of cellranger.
How could I reproduce your integration code to preprocess all the samples at once? Creating a folder containing the filtered_feature_bc_matrix of every sample and itereting through them, or converting adata to .cvs? I tried both but I'm stuck..

Thanks!

use better `pip install` approach in tutorials

I ran across your tutorial here as part of looking into this Biostars post.

I don't know where you suggest people run this code; however, if it is in standard Jupyter, I'd suggest you update your examples of pip install to not include use of the exclamation point. For example, in that Jupyter notebook I referenced the fifth cell has an example. I would suggest you change it to %pip install scvi-tools. In the last few years the magic command %pip install (and related %conda install) were added to insure that when running pip install in Jupyter notebooks, the installation occurs in the environment that backs the notebook kernel. This is something the use of the exclamation point alone with pip install failed to do and caused all sorts of issues. You can read more about the modern %pip install and %conda install magic commands here.

Unfortunately, I think Google Colab derived their offshoot of the Jupyter interface before that happened, and so it isn't consistent. I'm not 100% sure about the Google Colab notebooks and what is best; however, for standard Jupyter use of the magic command is best. And so if you direct people using your stuff to Google Colab mostly, I can understand using the outdated syntax.

It's also good to be aware that because modern Jupyter installations most often have automagics enabled by default, pip install without any symbol will use the modern magic command behind the scenes, and so even no symbol is better than suggesting the exclamation point with pip install now. However, explicit is best for learners and future you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.