swarbricklab-code / brca_cell_atlas Goto Github PK
View Code? Open in Web Editor NEWData processing and analysis related code associated with the study "A single-cell and spatially resolved atlas of human breast cancers".
Data processing and analysis related code associated with the study "A single-cell and spatially resolved atlas of human breast cancers".
Hi,
I'm Pratap, a senior bioinformatician at National Cancer Centre Singapore. I would like to request the Metadata CSV file which was mentioned on GitHub at the below link so I would like to automate it in my analysis.
Best regards
Pratap
Hi, I'm interested in your analysis of the gene module but I can not know the exact step given the descriptions in the paper. Can you share the code, thank you.
Hi! I am wondering how did you combine the pseudobulk data and TCGA data before hierarcheal clustering? Since I couldn't find these codes in the repo. Could you direct me to how to normalize the two datasets from different platforms?
Thank you
Hi
Can u pls provide RData files for major, minor & subset celltypes that are used in spatial analysis. Also if u can explain why temp_sample_id <- c(2,5,7:10) have been taken like from which metadata these numbers are coming.
Thanks in advance
Hi,
I checked the marker signature used to infer cell types with Garnett and the cell types do not match the ones you have in broad institute. Was another file used for it?
Thanks in advance
Best
Miguel
Thanks for your great work! I am just wondering whether I can apply ecotyper (https://github.com/digitalcytometry/ecotyper) on our bulk RNA-seq, and treat your well annotated single-cell dataset as discovery. Thanks!
I see that the following code was used to run InferCNV:
infercnv::run(
initial_infercnv_object,
num_threads=numcores-1,
out_dir=out_dir,
cutoff=0.1,
window_length=101,
max_centered_threshold=3,
cluster_by_groups=F,
plot_steps=F,
denoise=T,
sd_amplifier=1.3,
analysis_mode = "samples"
)
The default value for sd_amplifier
is 1.5. I was curious about your decision to adjust this de-noising parameter? I was hoping you/you all would be able to share your insights that could help me plan my future analyses and pipelines. Thank you!
Hi there!
Many thanks for this super detailed and COOL datasets. It has been a huge help for my current project :).
Just wondering if you may share the original scanned H&E images for spatial transcriptomics samples. I am now conducting some automated image analysis and require H&E images with post-scanned parameters. Unfortunately, the images in images.pdf file lost some key information and Hi_res/Lo_res images are quite in lack of resolution I need. It would be very kind and helpful of you to provide original H&E images and I sincerely appreciate your help.
Best,
Yuxuan
Hello, I have a question about the number of cells present in the published analyses. In the Figure 1.A legend there are ~130k cells quoted for the UMAP. However, in the published data on the Broad Single Cell Portal there seem to be ~100k cells. Would you please be able to help clarify this discrepancy and how many cells were used to generate the main Fig.1.A UMAP.
Thank you!
Were any steps taken to filter the background noise from unbound antibody from the CITE-seq data?
Hello,
Thank you so much for providing this data. I am new to the field of single-cell analysis. So, this might be a naive question. I was looking to reproduce the inferCNV pipeline and was wondering where can I get the gene positions file: infercnv_gene_order.txt.
Thanks.
Hi,
I am working with your data and I am trying to reproduce the SCPAM50 calls from the metadata provided for the ephitelial cells. I follow the code from Calculatingscoresandplotting.R and Highest_calls.R, but I am unable to replicate the results.
What I did was the following:
tocalc<-as.data.frame(Mydata@[email protected])
. Meaning that it cannot be the integrated data. Then, I wonder, which is the data that is used to infer the SC subtype calls.I would really appreciate if I could get some insight on this. I am not sure what the input files like "TNBCmerged_Training_SwarbrickInferCNV.txt" contain or even Mydata in here. What do those files contain exactly? (e.g: all cells or only ephitelial, are they curated cells, integrated or just normalized RNA data...). If I could get the input files used to generate those results it would help me a lot. And which is the script used, or the steps to follow to reach the calls available in broadinstitute.
Sorry in advance for the convoluted question.
Thanks in advance,
Best,
Miguel
Hi I am looking to use the code chunk:
"sigdat <- read.table("SinglecellMolecularSubtypesignaturesincludingSWARBRICKnormals_SEPTEMBER2019.txt",sep='\t',header=F,row.names=1,fill=T)"
in the SCSubtype functions.
Could you please direct me to this file or post it?
Thank you
@dlroden
Hi,
In the issue #5 , you mentioned "For the Pseudobulk, we just summed up all the reads for each gene across all cells".
I'm not sure did you use the "sum" or the "average of cells of each sample"? Since I am wondering wouldn't the difference of cell amounts between samples influence the result? For example, sample A contains more cells than B, so in the pseudobulk sample A might contain more gene UMI than B not because of its real expression but of cells.
If you used the sum, what did you perform later to eliminate this influence?
Thank you a lot
Hi @dlroden, @johnyaku, @sunnyzwu, and @gAleryani
In the methods, the following approach was taken to identify thresholds for calling cancer cells:
Cells were plotted with respect to both their genomic instability and correlation scores. Partitioning around medoids clustering was performed using the pamk function in the R package cluster v.2.0.7-1 to choose the optimum value for k (between 2 and 4) using silhouette scores and the pam function to apply the clustering. Thresholds defining normal and neoplastic cells were set at 2 cluster s.d. to the left and 1.5 s.d. below the first cancer cluster means. For tumors where partitioning around medoids could not define more than 1 cluster, the thresholds were set at 1 s.d. to the left and 1.25 s.d. below the cluster means.
I was wondering what motivated your decision to apply partitioning in the scatter plot of genomic instability and correlation scores? The original inferCNV paper in glioblastoma used hard thresholds of 0.01 and 0.4 respectively for calling cancer cells. Based on my experience, these hard thresholds work well for samples that contain a clear population of copy number high cells and a clear population of copy number low cells in the scatter plot of genomic instability and correlation scores. However (in my experience), these hard thresholds do not work well for samples that contain only one large population, or 'blob', of cells in the scatter plot of genomic instability and correlation scores.
I was wondering if your group made similar observations, and if this motivated your decision to apply partitioning via medoids clustering to infer robust thresholds? Moreover, did you find that this partitioning approach was an improvement over the hard threshold approach?
I would think so, based on code from your inferCNV scripts:
# distinguish samples with 1 and > 1 clusters:
mono_samples <- c("CID3586", "CID3921", "CID3941", "CID3948", "CID4067",
"CID4290A", "CID4461", "CID4495", "CID4515", "CID4523", "CID4535",
"CID44041", "CID44991", "CID45171", "CID4513", "CID4398")
multi_samples <- c("CID3963", "CID4066", "CID4463", "CID4465",
"CID4471", "CID4530N", "CID44971")
For the patient 1160920F related to here https://zenodo.org/record/4739739#.Y007vNJBxhE, I would like to know which deconvolution ouput is corresponding in here :
https://github.com/Swarbricklab-code/BrCa_cell_atlas/tree/main/spatial_analysis/stsc/major
[182963_1160920F-count-matrix.tsv]
[182968_1160920F-count-matrix.tsv]
[182969_1160920F-count-matrix.tsv]
From the metadata, here.
https://github.com/Swarbricklab-code/BrCa_cell_atlas/blob/main/spatial_analysis/Meta_Data.csv
I bet it's 182968_1160920F.
Right ? Thanks.
Update : Also the index given for the spots correspond to the merge of pxl_row_in_fullres X pxl_col_in_fullres.
This is not array_row X array_col. Thanks
Can you please direct me to the code used to generate pseudobulk rna-seq from the paper and how to run the pam50 subtyping on the pseudobulk?
I see the bulk-rna seq pam50 code but want to apply the pseudobulk method for my own breast samples so that would be appreciated.
Thank you
Hello, would it be possible to make the batch corrected count matrix used to generate the main UMAPs publicly available? This would be very helpful for analyzing the effect different batch correction methods have on the data and for computing a nearest neighbors graph for the corrected data.
Thank you!
Hi
Thanks for prompt response for my previous question.
How AUcell is different from GSEA or other gene set enrichment tools. Can u pls elaborate.
Thanks
Tahseen
Dear all,
Thanks for sharing the awesome work.
https://zenodo.org/record/4739739#.Yhpmm-jMJPY
Could you please share the original HE images in jpg or tif format?
Just like the 10X SRT data
https://www.10xgenomics.com/resources/datasets/human-breast-cancer-ductal-carcinoma-in-situ-invasive-carcinoma-ffpe-1-standard-1-3-0
Hi @sunnyzwu,
Very impressive paper and Git repo!
I am looking at your GEO repository at https://www-ncbi-nlm-nih-gov.libproxy.lib.unc.edu/geo/query/acc.cgi?acc=GSE176078.
I have downloaded and uncompressed GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz to reveal the standard output of cellranger (mtx, features.tsv and barcodes.tsv).
How was this multi-sample dataset created exactly? Specifically, I would like to know which of the following methods were used to make the files in GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz:
Please let me know which method (1,2, or 3) most closely aligns with how this file was constructed or if a different approach was used to make the files in GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz.
Thanks in advance!
Hi, i am willing to know why there is an error like "the compressed file format is unknown or the contents are corrupted" when i unzip the file. When i unzip the other files download from the Zenodo, everything went well, however. Is there a problem with the uploaded file and could you please send me the latest file "filtered_count_matrices.tar.gz" that can be unzipped?Here is my email address [email protected]. Thank u. Wish u have a good day.
Hi,
I cant't find the codes for subclustering of epithelial cells showed in Fig,.1d.
Could you please add it ?
Thank you
Hi are there .bam files available for the spatial transcriptomic data?
Thank you
hi, I have a batch of breast cancer single cell data, can I directly call the Highest_calls.R script?
I call this code directly, and use the NatGen_Supplementary_table_S4.csv file, and the results don't seem right, it seems to divide the cells in each sample (after extracting the cancer cells) into the four subtypes, and the dozen or so samples are So, do you know why,Is this result reasonable? At present, it is not consistent with the clinical results. But I don't know where is the problem。
Looking forward to your reply!
Specifically, there are 3 related files:
BrCa_cell_atlas/spatial_analysis/append_stereoscope_data.R
Lines 40 to 50 in adfbc29
Hi
Thanks for prompt answers. I had another request to make. Can u please provide stereoscope script as i cudn't find it in the repo. Also if u can provide this file:
temp_path <- "/share/ScratchGeneral/sunwu/projects/MINI_ATLAS_PROJECT/spatial/Garvan_PhaseII_spaceranger_loupe_annotations/"
Thanks in advance!
Hi
Thanks for providing the codes for your interesting paper
However I am stuck at AUcell analysis for spatial data where i am not able to find file used for temp_metagenes.
temp_metagenes <- read.delim("/share/ScratchGeneral/sunwu/projects/MINI_ATLAS_PROJECT/Sept2019/11_visium_brca_metagenes/gene_clusters.txt")
It would be really a great help if you can provide the file.
Thanks!
Hi,
I would like to have a look at your spatial data, however the h5 files are missing.
Could you please add it ?
Thank you
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.