GithubHelp home page GithubHelp logo

swarbricklab-code / brca_cell_atlas Goto Github PK

View Code? Open in Web Editor NEW
97.0 97.0 44.0 19.57 MB

Data processing and analysis related code associated with the study "A single-cell and spatially resolved atlas of human breast cancers".

R 94.54% Shell 3.62% Python 1.84%

brca_cell_atlas's People

Contributors

galeryani avatar johnyaku avatar sunnyzwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

brca_cell_atlas's Issues

Code about the gene module

Hi, I'm interested in your analysis of the gene module but I can not know the exact step given the descriptions in the paper. Can you share the code, thank you.

Garnett subtypes

Hi,

I checked the marker signature used to infer cell types with Garnett and the cell types do not match the ones you have in broad institute. Was another file used for it?

Thanks in advance

Best

Miguel

sd_amplifier set to 1.3 for de-noising?

I see that the following code was used to run InferCNV:

infercnv::run(
        initial_infercnv_object,
        num_threads=numcores-1,
        out_dir=out_dir,
        cutoff=0.1,
        window_length=101,
        max_centered_threshold=3,
        cluster_by_groups=F,
        plot_steps=F,
        denoise=T,
        sd_amplifier=1.3,
        analysis_mode = "samples"
      )

The default value for sd_amplifier is 1.5. I was curious about your decision to adjust this de-noising parameter? I was hoping you/you all would be able to share your insights that could help me plan my future analyses and pipelines. Thank you!

Can't find codes

Hi~, i can't find the source codes:
image
Can i get the source codes? Thanks!

Request for Original H&E Images for Spatial Transcriptomics Slides

Hi there!

Many thanks for this super detailed and COOL datasets. It has been a huge help for my current project :).

Just wondering if you may share the original scanned H&E images for spatial transcriptomics samples. I am now conducting some automated image analysis and require H&E images with post-scanned parameters. Unfortunately, the images in images.pdf file lost some key information and Hi_res/Lo_res images are quite in lack of resolution I need. It would be very kind and helpful of you to provide original H&E images and I sincerely appreciate your help.

Best,

Yuxuan

Single Cell Count

Hello, I have a question about the number of cells present in the published analyses. In the Figure 1.A legend there are ~130k cells quoted for the UMAP. However, in the published data on the Broad Single Cell Portal there seem to be ~100k cells. Would you please be able to help clarify this discrepancy and how many cells were used to generate the main Fig.1.A UMAP.

Thank you!

CITE-seq data filtering

Were any steps taken to filter the background noise from unbound antibody from the CITE-seq data?

Gene Positions data for running inferCNV.

Hello,

Thank you so much for providing this data. I am new to the field of single-cell analysis. So, this might be a naive question. I was looking to reproduce the inferCNV pipeline and was wondering where can I get the gene positions file: infercnv_gene_order.txt.

Thanks.

Reproducing SC subtype calls

Hi,

I am working with your data and I am trying to reproduce the SCPAM50 calls from the metadata provided for the ephitelial cells. I follow the code from Calculatingscoresandplotting.R and Highest_calls.R, but I am unable to replicate the results.

What I did was the following:

  • I generated a Seurat object for each of the breast cancer subtypes from the samples used. Then I generated another seurat object for the test set, again using the samples as stated in the supplementary material. I normalized and FindVariableFeatures as stated in Calculatingscoresandplotting.R
  • Then I find the anchors and I integrated the data.
  • Here comes some complications for me. I checked the Github issue opened here, and the response here was that the integrated data between training and test was used for infering the SC subtypes. However, I see that in Calculatingscoresandplotting.R once we have integrated the data, there is a phrase saying: #Calculating SC50 scores on the 'RNA' scaled data, see here .Additionally in Highest_calls.R , we have the input Mydata (which I am not sure what it is) but from the code (here) we can see that : tocalc<-as.data.frame(Mydata@[email protected]). Meaning that it cannot be the integrated data. Then, I wonder, which is the data that is used to infer the SC subtype calls.

I would really appreciate if I could get some insight on this. I am not sure what the input files like "TNBCmerged_Training_SwarbrickInferCNV.txt" contain or even Mydata in here. What do those files contain exactly? (e.g: all cells or only ephitelial, are they curated cells, integrated or just normalized RNA data...). If I could get the input files used to generate those results it would help me a lot. And which is the script used, or the steps to follow to reach the calls available in broadinstitute.

Sorry in advance for the convoluted question.

Thanks in advance,

Best,

Miguel

Looking for input file

Hi I am looking to use the code chunk:
"sigdat <- read.table("SinglecellMolecularSubtypesignaturesincludingSWARBRICKnormals_SEPTEMBER2019.txt",sep='\t',header=F,row.names=1,fill=T)"
in the SCSubtype functions.

Could you please direct me to this file or post it?

Thank you

Question about Pseudobulk

@dlroden
Hi,

In the issue #5 , you mentioned "For the Pseudobulk, we just summed up all the reads for each gene across all cells".
I'm not sure did you use the "sum" or the "average of cells of each sample"? Since I am wondering wouldn't the difference of cell amounts between samples influence the result? For example, sample A contains more cells than B, so in the pseudobulk sample A might contain more gene UMI than B not because of its real expression but of cells.
If you used the sum, what did you perform later to eliminate this influence?

Thank you a lot

Identifying thresholds for cancer cell detection with medoids clustering

Hi @dlroden, @johnyaku, @sunnyzwu, and @gAleryani

In the methods, the following approach was taken to identify thresholds for calling cancer cells:

Cells were plotted with respect to both their genomic instability and correlation scores. Partitioning around medoids clustering was performed using the pamk function in the R package cluster v.2.0.7-1 to choose the optimum value for k (between 2 and 4) using silhouette scores and the pam function to apply the clustering. Thresholds defining normal and neoplastic cells were set at 2 cluster s.d. to the left and 1.5 s.d. below the first cancer cluster means. For tumors where partitioning around medoids could not define more than 1 cluster, the thresholds were set at 1 s.d. to the left and 1.25 s.d. below the cluster means.

I was wondering what motivated your decision to apply partitioning in the scatter plot of genomic instability and correlation scores? The original inferCNV paper in glioblastoma used hard thresholds of 0.01 and 0.4 respectively for calling cancer cells. Based on my experience, these hard thresholds work well for samples that contain a clear population of copy number high cells and a clear population of copy number low cells in the scatter plot of genomic instability and correlation scores. However (in my experience), these hard thresholds do not work well for samples that contain only one large population, or 'blob', of cells in the scatter plot of genomic instability and correlation scores.

I was wondering if your group made similar observations, and if this motivated your decision to apply partitioning via medoids clustering to infer robust thresholds? Moreover, did you find that this partitioning approach was an improvement over the hard threshold approach?

I would think so, based on code from your inferCNV scripts:

# distinguish samples with 1 and > 1 clusters:
mono_samples <- c("CID3586", "CID3921", "CID3941", "CID3948", "CID4067", 
  "CID4290A", "CID4461", "CID4495", "CID4515", "CID4523", "CID4535", 
  "CID44041", "CID44991", "CID45171", "CID4513", "CID4398")
multi_samples <- c("CID3963", "CID4066", "CID4463", "CID4465", 
  "CID4471", "CID4530N", "CID44971")

Deconvolution file corresponding to 1160920F Patient counts from the zenodo repository.

For the patient 1160920F related to here https://zenodo.org/record/4739739#.Y007vNJBxhE, I would like to know which deconvolution ouput is corresponding in here :
https://github.com/Swarbricklab-code/BrCa_cell_atlas/tree/main/spatial_analysis/stsc/major
[182963_1160920F-count-matrix.tsv]
[182968_1160920F-count-matrix.tsv]
[182969_1160920F-count-matrix.tsv]

From the metadata, here.
https://github.com/Swarbricklab-code/BrCa_cell_atlas/blob/main/spatial_analysis/Meta_Data.csv
I bet it's 182968_1160920F.
Right ? Thanks.

Update : Also the index given for the spots correspond to the merge of pxl_row_in_fullres X pxl_col_in_fullres.
This is not array_row X array_col. Thanks

Generating pseudobulk

Can you please direct me to the code used to generate pseudobulk rna-seq from the paper and how to run the pam50 subtyping on the pseudobulk?
I see the bulk-rna seq pam50 code but want to apply the pseudobulk method for my own breast samples so that would be appreciated.
Thank you

Batch Corrected Count Matrix

Hello, would it be possible to make the batch corrected count matrix used to generate the main UMAPs publicly available? This would be very helpful for analyzing the effect different batch correction methods have on the data and for computing a nearest neighbors graph for the corrected data.

Thank you!

How was GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz created?

Hi @sunnyzwu,

Very impressive paper and Git repo!

I am looking at your GEO repository at https://www-ncbi-nlm-nih-gov.libproxy.lib.unc.edu/geo/query/acc.cgi?acc=GSE176078.

I have downloaded and uncompressed GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz to reveal the standard output of cellranger (mtx, features.tsv and barcodes.tsv).

How was this multi-sample dataset created exactly? Specifically, I would like to know which of the following methods were used to make the files in GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz:

  1. Ran cellranger aggr to combine all samples into one cellranger count output
  2. Decomposed a CCA-integrated Seurat object into mtx, features, and barcodes files
  3. Decomposed a non-integrated Seurat object (using Seurat's merge()) into mtx, features, and barcodes files

Please let me know which method (1,2, or 3) most closely aligns with how this file was constructed or if a different approach was used to make the files in GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz.

Thanks in advance!

About the "filtered_count_matrices.tar.gz" download on zenodo

Hi, i am willing to know why there is an error like "the compressed file format is unknown or the contents are corrupted" when i unzip the file. When i unzip the other files download from the Zenodo, everything went well, however. Is there a problem with the uploaded file and could you please send me the latest file "filtered_count_matrices.tar.gz" that can be unzipped?Here is my email address [email protected]. Thank u. Wish u have a good day.

BAM files

Hi are there .bam files available for the spatial transcriptomic data?

Thank you

scSubtype related code,Is it possible to call directly?

hi, I have a batch of breast cancer single cell data, can I directly call the Highest_calls.R script?  

I call this code directly, and use the NatGen_Supplementary_table_S4.csv file, and the results don't seem right, it seems to divide the cells in each sample (after extracting the cancer cells) into the four subtypes, and the dozen or so samples are So, do you know why,Is this result reasonable?  At present, it is not consistent with the clinical results. But I don't know where is the problem。

Looking forward to your reply!

Request for providing script for running stereoscope for spatial data

Hi

Thanks for prompt answers. I had another request to make. Can u please provide stereoscope script as i cudn't find it in the repo. Also if u can provide this file:
temp_path <- "/share/ScratchGeneral/sunwu/projects/MINI_ATLAS_PROJECT/spatial/Garvan_PhaseII_spaceranger_loupe_annotations/"

Thanks in advance!

Request for gene_clusters.txt file for AUCell analysis for spatial data

Hi

Thanks for providing the codes for your interesting paper
However I am stuck at AUcell analysis for spatial data where i am not able to find file used for temp_metagenes.
temp_metagenes <- read.delim("/share/ScratchGeneral/sunwu/projects/MINI_ATLAS_PROJECT/Sept2019/11_visium_brca_metagenes/gene_clusters.txt")

https://github.com/Swarbricklab-code/BrCa_cell_atlas/blob/main/spatial_analysis/brca_gene_module_spatial_AUCell.R

It would be really a great help if you can provide the file.

Thanks!

missing h5 files from Zenodo

Hi,

I would like to have a look at your spatial data, however the h5 files are missing.
Could you please add it ?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.