mousepixels / sanbomics_scripts Goto Github PK
View Code? Open in Web Editor NEWscripts and notebooks from sanbomics
scripts and notebooks from sanbomics
Hi Mark,
Thank you very much for the code it is really useful.
I am just a beginer...I was thinking in a volcano plot in which dowregulated genes are in blue and over expressed genes in red leaving the non differentailly expressed genes in grrey. Could you give me any advise, please? .
Thanks a lot
Vic
Dear Sanbomics,
Thanks a lot for your efforts, that did inspire me a lot,
while I have an issue that is there any solution for human reference specifized for various tissues, exactly like what "TabulaMurisData" did in mus musculus.
Thanks again.
Best regards,
Na
Hi,
thank you a lot for your SoupX tutorial. I find the video and the code very helpful during my own analysis since I am new to scRNAseq analysis. One thing that I stumbled upon is that the nCount_RNA is not "updated" when I adjust the RNA assay of the seurat object with the SoupX-filtered counts. I'm referring to line 94 (sobj@assays$RNA@counts <- out
) of your .Rmd file in this repository.
I noticed this when I tried to check whether counts actually changed by
summary([email protected]$nCount_RNA == [email protected]$nCount_original.counts)
and nothing had changed.
When I tried to overwrite the RNA assay with
sobj[["RNA"]] <- CreateAssayObject(counts = out)
nothing changed. When I do this giving a new name to the assay, e.g. with
sobj[["RNAsoupx"]] <- CreateAssayObject(counts = out)
it works.
I'm not sure if this is the right place to post my observation/question, maybe you have an idea how the [email protected]$nCount_RNA
information can be updated by maintaining the name "RNA" for this assay. I'm not sure whether giving the assay a new name could cause any problems during downstream analysis.
Thank you a lot :)
Hi @mousepixels
I'm trying to read this GEO dataset (GSE198896) and I can't load it using Scanpy's adata function. The file in CSV format is not provided. In this case, how can I load this dataset?
Hi,
I was testing your tutorial on pseudobulk with pseudo replicates and I noticed that all the replicates created by your script are identical because you did not slice with the pseudo_rep slice that you created from the indices.
I propose you change this line:
from
`
for i, pseudo_rep in enumerate(indices):
rep_adata = sc.AnnData(X = samp_cell_subset.X.sum(axis = 0),
var = samp_cell_subset.var[[]])
`
to
rep_adata = sc.AnnData(X = samp_cell_subset[samp_cell_subset.obs_names.isin(pseudo_rep)].X.sum(axis = 0), var = samp_cell_subset.var[[]])
And thanks for the cool work you do :)
Can you please provide the link to download the lung1.h5 file, which is used in the scvi_label_transfer.ipynb notebook?
Hi Mark,
I'm running your scripts on my local computer and it crashed when I tried to integrate the 26 samples- I got an out of memory error. Question- if I didn't integrate, could I look at each sample alone and get the same data out- or does integration do something to the data (ie normalize with respect to each other?) so I would want to integrate everything?
Second question, my single cell data is interesting- we do depletion using CRISPR-cas9 to remove abundant, ribo, and mito targets. This redistributes reads onto low expression targets. Would you modify anything in the workflow in that scenario? Are you available for hire as a consultant?
Thanks,
Smita
In my dataset the mitocondrial genes are not all labeled with 'mt-', some of them are like 'NC_002333.23, NC_002333.24', for example. I have a txt file with all the mitocondrial genes and I have loaded it into the notebook with
x = open('michondrialgenesDR11.txt', 'r')
mitogenes = x.read()
I am searching now for a command to annonate the group of genes in the txt file as adata.var['mt']. Could you please help? Thank you!
Hi!
I am having some issues with these files that I got from GEO:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi
GSM3577882_normal_panc_barcodes.tsv.gz
GSM3577882_normal_panc_genes.tsv.gz
GSM3577882_normal_matrix.mtx.gz
This is my code: adata=sc.read_10x_mtx('./', prefix='GSM3577882_normal_',var_names='gene_symbols', cache=True )
I cannot load those samples in scampy. I have tried with different ways but with no results. So I decided to download de SRR files from GEO and practice with Cell Ranger following your video in you tube. Here I had another issue, these are the files that correspond to normal pancreas and correspond to two samples:
SRR8485290.fastq
SRR8485291.fastq
SRR8485292.fastq
SRR8485293.fastq
And I am not able to make Cell Ranger work, could it be because of how the samples are named?
Should a run each sample individually? How should I proceed with the tumor samples? Should I run independently each sample in Cell Ranger and the integrate the results in Scanpy?
I am really interested in this dataset and I don´t know what to do….
Thank you very much for your help and availability.
Victor
Hi,
I really appreciate your tutorial on scRNA velocity.
I am trying to merge my output loom file from velocyto and my preprocessed counts like you did here:
adata = pp('../tutorial_sample/outs/filtered_feature_bc_matrix/')
ldata = scv.read('../tutorial_sample/velocyto/tutorial_sample.loom')
adata = scv.utils.merge(adata, ldata)
My question is does the dimensions of the two objects should match exactly? Or when you do the merge it will look for the intersect barcodes/cells and the intersection of genes? In my case it does the merge but the spliced and unspliced information does not appear in the new object and I guess because the dimensions does not match.
Thank you,
Diego
hello all,
I've written the following code reference your youtube video:
However I get the following error when I try to load it into R using using Read10X
from seurat:
Error in readMM(file = matrix.loc) :
'readMM()' is not yet implemented for representation 'array'
Any help is greatly appreciated!
Here's the code I wrote based on the youtube video (note I've tried different formats for the sparse matrix - int, int64, etc.)
@staticmethod
def create_10x_files(adata, dir, raw_layer):
RhapsodyReader.create_barcodes_table(adata=adata, save_path=f"{dir}/barcodes.tsv")
RhapsodyReader.create_features_table(adata, f"{dir}/features.tsv")
RhapsodyReader.create_matrix(adata, f"{dir}/matrix.mtx", layer=raw_layer)
# RhapsodyReader.gzip_files_in_directory(dir)
@staticmethod
def create_matrix(adata, save_path, layer=None):
if layer:
matrix = adata.layers[layer].T
else:
matrix = adata.X.T
io.mmwrite(save_path, matrix.astype(np.int64))
@staticmethod
def create_barcodes_table(adata, save_path):
with open(save_path, "w") as file:
for item in adata.obs_names:
file.write(item + "\n")
@staticmethod
def create_features_table(adata, save_path):
tab_var_names = ["\t".join([x, x, "Gene Expression"]) for x in adata.var_names]
with open(save_path, "w") as file:
for var_name in tab_var_names:
file.write(var_name + "\n")
Hi Mark,
I have a question. I am doing differential expression between two cell types in scRNA-seq, I tried scVI and it worked but diffxpy did not worked, it gives me this error: ZeroDivisionError: float division by zero
I wrote them with an issue, apparently it is a common error because others described the same problem. I am waiting for the reply but they did not to respond.
I wanted to performed this method, becase I am worried about the discrepancy of DEGs that I have with scVI (approximately 2400 DEGs) with the Differential expression using Scanpy function using Wilconson test (9.000 DEGs). Have you compared different methods for differential expression? Could you give me any advice, please?.
Thanks a lot.
Victor
I'm trying to construct the reference for Salmon, but the process stopped with this error:
salmon index -t CRCH38_and_decoys.fa.gz -d decoys.txt -i GRCh38_salmon_index --gencode
[2024-02-19 12:14:55.227] [puff::index::jointLog] [warning] Entry with header [ENST00000634174.1|ENSG00000282732.1|OTTHUMG00000191398.1|OTTHUMT00000487783.1|ENST00000634174|ENSG00000282732|28|unprocessed_pseudogene|], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2024-02-19 12:15:56.790] [puff::index::jointLog] [warning] Removed 882 transcripts that were sequence duplicates of indexed transcripts.
[2024-02-19 12:15:56.792] [puff::index::jointLog] [warning] If you wish to retain duplicate transcripts, please use the --keepDuplicates
flag
[2024-02-19 12:15:56.919] [puff::index::jointLog] [info] Replaced 151122967 non-ATCG nucleotides
[2024-02-19 12:15:56.919] [puff::index::jointLog] [info] Clipped poly-A tails from 2034 transcripts
Killed
it could be a problem of CPU memory?
I am curious why you are filtering the expression data to only include genes with a base mean over 50? I have not really seen this as a step in other tutorials.
from GO_in_R.Rmd
sigs <- sigs[sigs$padj < 0.05 & sigs$baseMean > 50,]
thanks!
[sanbomics_scripts](https://github.com/mousepixels/sanbomics_scripts/tree/main) /simpleaf_alevin_fry_tutorial.txt
simpleaf quant --reads1 a_r1.fastq.gz,b_r1.fastq.gz --reads2 a_r2.fastq.gz,b_R2_001.fastq.gz --threads 28 --index simpleaf_index/index --chemistry 10xv3 --resolution cr-like --unfiltered-pl --expected-ori fw --t2g-map simpleaf_index/index/t2g_3col.tsv --output simpleaf_output
To be
simpleaf quant --reads1 fastq/pbmc_1k_v3_S1_L001_R1_001.fastq.gz,fastq/pbmc_1k_v3_S1_L002_R1_001.fastq.gz --reads2 fastq/pbmc_1k_v3_S1_L001_R2_001.fastq.gz,fastq/pbmc_1k_v3_S1_L002_R2_001.fastq.gz --threads 28 --index simpleaf_index/index --chemistry 10xv3 --resolution cr-like --unfiltered-pl --expected-ori fw --t2g-map simpleaf_index/index/t2g_3col.tsv --output simpleaf_output
Hi,
Hi, I was trying to follow the complete scRNA-seq tutorial using CellBender h5 out file, by changing adata = anndata_from_h5(mtx_path) (anndata_fromh5 being broadinstitute/CellBender#57).
When I train the solo model I noticed something weird, it looks like it's not working:
/home/gtosoni/miniconda3/envs/scvi-env/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call .var_names_make_unique
.
utils.warn_names_duplicates("var")
/home/gtosoni/miniconda3/envs/scvi-env/lib/python3.9/site-packages/scvi/model/_utils.py:287: UserWarning: This dataset has some empty cells, this might fail inference.Data should be filtered with scanpy.pp.filter_cells()
warnings.warn(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 267/267: 100%|██████████| 267/267 [13:27<00:00, 2.96s/it, loss=478, v_num=1]
INFO Creating doublets, preparing SOLO model.
/home/gtosoni/miniconda3/envs/scvi-env/lib/python3.9/site-packages/scvi/external/solo/_model.py:185: RuntimeWarning: divide by zero encountered in log
latent_adata = AnnData(np.concatenate([latent_rep, np.log(lib_size)], axis=1))
/home/gtosoni/miniconda3/envs/scvi-env/lib/python3.9/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass AnnData(X, dtype=X.dtype, ...)
to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Epoch 1/400: 0%| | 1/400 [00:01<08:54, 1.34s/it, loss=nan, v_num=1]
Monitored metric validation_loss = nan is not finite. Previous best value was inf. Signaling Trainer to stop.
Any idea why is that? Thank you!!
Hello,
I have created a volcano plot with your method and it looks good. I am just stuck on one step.
I created a list of selected 5 genes as below:
pikced10 = ('gene1', 'gene2', 'gene3', 'gene4', 'gene5')
Now, I want to show names of these selected 5 genes on the plot. I know you have shown
if df.iloc[i].nlog10 > 5 and abs(df.iloc[i].log2FoldChange) > 2:
but I am confused how can I show my customized list.
How can I feed this info/list of gene into the script so when I plot the graph the names of only these 5 genes are shown?
Many thanks,
young <- import_atac("GSM5723631_Young_HSC_filtered_peak_bc_matrix.h5",
'GSM5723631_Young_HSC_singlecell.csv',
'./GSM5723631_Young_HSC_fragments.tsv.gz')
when i run this code, this error appear
Error in import_atac("GSM5723631_Young_HSC_filtered_peak_bc_matrix.h5", :
could not find function "import_atac"
so what is the package i need to install to make this function work?
Hi,
Thank you for this tutorial! It is very useful and I have almost done. But I was trying to merge the data in the step: adata = scv.utils.merge(adata, ldata). I got this error:
UnboundLocalError Traceback (most recent call last)
Cell In[5], line 1
----> 1 adata = scv.utils.merge(adata, ldata)
File ~/.local/lib/python3.9/site-packages/scvelo/core/_anndata.py:526, in merge(adata, ldata, copy, **kwargs)
524 if "id_length" in kwargs:
525 id_length = kwargs.get("id_length")
--> 526 clean_obs_names(adata, id_length=id_length)
527 clean_obs_names(ldata, id_length=id_length)
528 common_obs = adata.obs_names.intersection(ldata.obs_names)
UnboundLocalError: local variable 'id_length' referenced before assignment
Do you have any idea why this error happens? Thank u very much!
Hello,
I am new to bioinformatics and in R and from wet lab background. I am analyzing three single cell multiomics (RNA + ATAC) datasets from 10x genomics. When I merge the three datasets, the fragment files for each datasets were present. But after integrating with harmony when I try to extract gene activity the error comes as no fragment file present. Here I am attaching full code.
##Create seurat object reading .h5 file and meta file
counts_APL1 <- Read10X_h5(filename = "path/to/filtered_feature_bc_matrix.h5")
meta_APL1 <- read.csv(
file = 'path/to/per_barcode_metrics.csv',
header = TRUE,
row.names = 1)
chrom_assay <- CreateChromatinAssay(
counts = counts_APL1$Peaks,
sep = c(":", "-"),
genome = 'hg38',
fragments = 'path/to/atac_fragments.tsv.gz',
min.cells = 3,
min.features = 200
)
data_APL1 <- CreateSeuratObject(
counts = chrom_assay,
assay = "peaks",
meta.data = meta_APL1
)
data_APL1[[]]
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotations) <- 'UCSC'
Annotation(data_APL1) <- annotations
data_APL21<- NucleosomeSignal(object = data_APL1) #fragment ratio 147-294: <147
data_APL2 <- TSSEnrichment(object = data_APL1, fast = FALSE)
##Here I get the error. I cant find any blacklist_region_fragments nor the pct_reads_in_peaks. Even the peak_region_fragments is placed in metadata in name of atac_peak_region_fragments.
#data_APL1$blacklist_ratio <- data_APL1$blacklist_region_fragments / data$peak_region_fragments
#data_APL1[[]]
#data$pct_reads_in_peaks <- data$peak_region_fragments / data$passed_filters * 100
VlnPlot(
object = data_APL1,
features = c('nCount_peaks',
'nucleosome_signal', 'TSS.enrichment'),
pt.size = 0.1,
ncol = 5
)
data_APL1 <- subset(
x = data_APL1,
subset = atac_peak_region_fragments > 1000 &
atac_peak_region_fragments < 100000 &
nucleosome_signal < 4 &
TSS.enrichment > 1)
##Created 3 SeuratObjects from 3 three samples
#add sample information for merging samples
Control$dataset <- "Control"
APL1$dataset <- "APL1"
APL2$dataset <- "APL2"
merged_atac <- merge(Control, y = c(APL1, APL2), add.cell.ids = c("Control", "APL1", "APL2"), project = "Integrated_ATAC" )
merged_atac <- FindTopFeatures(merged_atac, min.cutoff = 'q0')
merged_atac <- RunTFIDF(merged_atac)
merged_atac <- RunSVD(merged_atac)
merged_atac
#An object of class Seurat
181982 features across 3848 samples within 1 assay
Active assay: peaks (181982 features, 181982 variable features)
1 dimensional reduction calculated: lsi
merged_atac <- RunUMAP(object = merged_atac, reduction = 'lsi', dims = 2:30)
merged_atac <- FindNeighbors(object = merged_atac, reduction = 'lsi', dims = 2:30)
merged_atac <- FindClusters(object = merged_atac, verbose = FALSE, algorithm = 3, resolution = .4)
DimPlot(object = merged_atac, label = TRUE) + NoLegend()
DimPlot(object = merged_atac, label = TRUE, group.by = "dataset") + NoLegend()
#Batch correction using Harmony
integrated_atac_harmony <- RunHarmony(object = merged_atac, group.by.vars = 'dataset', reduction = 'lsi', assay.use = 'peaks', project.dim = FALSE)
integrated_atac_harmony <- RunUMAP(integrated_atac_harmony, dims = 2:10, reduction = 'harmony')
DimPlot(integrated_atac_harmony, group.by = 'dataset', pt.size = 0.5)
# Do UMAP and clustering using ** Harmony embeddings instead of PCA **
integrated_atac_harmony <- integrated_atac_harmony %>%
RunUMAP(reduction = 'harmony', dims = 2:10) %>%
FindNeighbors(reduction = "harmony", dims = 2:10) %>%
FindClusters(resolution = 0.3)
# visualize
DimPlot(integrated_atac_harmony, reduction = 'umap', group.by = 'dataset')
DimPlot(integrated_atac_harmony, reduction = 'umap')
gene.activities <- GeneActivity(integrated_atac_harmony)
##Here I get the error
##Error in GeneActivity(integrated_atac_harmony) :
No fragment information found for requested assay
Although the fragment files were present when I created SeuratObject. Could you please help me to find this issue and better suggestion to improve the pipeline for further integration with merged scRNA dataset.
Hi,
Thank you so much for your walkthrough video!
I'm trying to reproduce your analysis on my samples. Instead of .cvs files, I have the filtered_feature_bc_matrix output of cellranger.
How could I reproduce your integration code to preprocess all the samples at once? Creating a folder containing the filtered_feature_bc_matrix of every sample and itereting through them, or converting adata to .cvs? I tried both but I'm stuck..
Thanks!
I ran across your tutorial here as part of looking into this Biostars post.
I don't know where you suggest people run this code; however, if it is in standard Jupyter, I'd suggest you update your examples of pip install
to not include use of the exclamation point. For example, in that Jupyter notebook I referenced the fifth cell has an example. I would suggest you change it to %pip install scvi-tools
. In the last few years the magic command %pip install
(and related %conda install
) were added to insure that when running pip install
in Jupyter notebooks, the installation occurs in the environment that backs the notebook kernel. This is something the use of the exclamation point alone with pip install
failed to do and caused all sorts of issues. You can read more about the modern %pip install
and %conda install
magic commands here.
Unfortunately, I think Google Colab derived their offshoot of the Jupyter interface before that happened, and so it isn't consistent. I'm not 100% sure about the Google Colab notebooks and what is best; however, for standard Jupyter use of the magic command is best. And so if you direct people using your stuff to Google Colab mostly, I can understand using the outdated syntax.
It's also good to be aware that because modern Jupyter installations most often have automagics enabled by default, pip install
without any symbol will use the modern magic command behind the scenes, and so even no symbol is better than suggesting the exclamation point with pip install
now. However, explicit is best for learners and future you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.