GithubHelp home page GithubHelp logo

basicsworkflow's People

Contributors

alanocallaghan avatar catavallejos avatar nilseling avatar

Watchers

 avatar  avatar  avatar

Forkers

alanocallaghan

basicsworkflow's Issues

Full pdf does not compile

Maybe is due to the docker configuration? Runs well for the chunks that have cache available but gets stuck after some point. For now, I added eval = FALSE in order to generate a full pdf.

Move some diagnostics to supplementary?

Seems like having 32 trace plots in the main paper is excessive. Could be better to put a handful in with a recommendation to plot a few, and put more in the supplementary

Final (?) to-do list

Adding this here as not everything has been documented in individual issues.

Note: 🔴 marks things that require changes in the BASiCS library.

  • Revise abstract
  • Revise introduction section. Particularly focus on intro to BASiCS paragraph. Yes, in general UMIs are preferable, but our model is not a NB distribution (we have two sources of over-dispersion). Also, UMI protocols tend to be sparser and some may not work well with BASiCS (e.g. snRNAseq).
  • Run chains with EB prior -> in progress by @alanocallaghan
  • Upload EB chains to gitlab
  • Update workflow code to use EB chains
  • Add EB option in param description for BASiCS_MCMC #33
  • Re-run chains with a seed and update seed in workflow.
  • Add ref for EB. http://www.biostat.jhsph.edu/~fdominic/teaching/bio656/labs/labs09/Casella.EmpBayes.pdf
  • Edit caption for dispersion-vs-mean figure. #37 ?
  • Edit HVG/LVG to use threshold directly on epsilon ? 🔴 See catavallejos/BASiCS#203
  • Edit caption/size for figure plot-example-vg-naive. Potentially remove function.
  • Current version removes diff over-dispersion testing as we want to encourage residual over-dispersion instead. If left like that, add a brief sentence to say that it's possible. OK?
  • Look for another workflow to reference for GO analysis. https://f1000research.com/articles/5-1281/v3
  • Edit caption and size for visualise-DE-plot plot. I tried, but the use of cowplot within BASiCS_PlotDE doesn't enable me to change some of the figure settings (e.g. legend text size). See catavallejos/BASiCS#200 🔴
  • Change plots in diff testing section to use denoised counts.
  • Use a version of BASiCS_DenoisedCounts that excludes spike-ins 🔴. See catavallejos/BASiCS#196 This is no longer needed.
  • Update figs for diff variability testing. Heatmap temporarily stayed, but a better visualisation is required. Some high level intrerpretation needs to be writen.
  • Review and update no-spikes section once EB chains are ready.
  • Remove ESS from no-spikes section? We can just refer to what was done before.
  • Add ref to 'stable genes' paper. See here. (guess you mean here)
  • Remove uses of @ operator

Docker warning

docker run -p 8787:8787 -v $(pwd):/home/rstudio/mycode -e PASSWORD=bioc alanocallaghan/bocker
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
[s6-init] making user provided files available at /var/run/s6/etc...if: fatal: child crashed with signal 11
exited 0.

Build issue on Mac

Hi @alanocallaghan

thanks so much for setting this up - really great stuff!

I get the following issue when running it on my mac using sudo make. I also get the same issue when running it in Rstudio within the container.

nils@DQBM-NBM-NIEL BASiCSWorkflow % make     
docker run -v /Users/nils/Github/BASiCSWorkflow:/home/rstudio/mycode \
		-w /home/rstudio/mycode \
		alanocallaghan/bocker:0.1.0 \
		/bin/bash \
		-c 'Rscript -e "rmarkdown::render(\"Workflow.Rmd\")"'
Unable to find image 'alanocallaghan/bocker:0.1.0' locally
0.1.0: Pulling from alanocallaghan/bocker
Digest: sha256:b99e0218ff2f52d01182f6dd1e51aa841cc3fe550e7b18850b7f626d5e545e06
Status: Downloaded newer image for alanocallaghan/bocker:0.1.0
Bioconductor version 3.12 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.12' is out-of-date; the current release version '3.13'
  is available with R version '4.1'; see https://bioconductor.org/install


processing file: Workflow.Rmd
  |.                                                                     |   1%
  ordinary text without R code

  |..                                                                    |   2%
label: setup_knitr (with options) 
List of 2
 $ include: logi FALSE
 $ cache  : logi FALSE

  |..                                                                    |   3%
   inline R code fragments

  |...                                                                   |   4%
label: overview (with options) 
List of 4
 $ out.width : symbol out_width
 $ out.height: symbol out_height
 $ fig.cap   : chr "Graphical overview for the scRNA-seq analysis workflow described in this manuscript. Starting from a matrix of "| __truncated__
 $ echo      : logi FALSE

  |....                                                                  |   5%
  ordinary text without R code

  |.....                                                                 |   7%
label: unnamed-chunk-1
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: 'MatrixGenerics'

The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars

Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following object is masked from 'package:base':

    expand.grid

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: 'Biobase'

The following object is masked from 'package:MatrixGenerics':

    rowMedians

The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians

  |.....                                                                 |   8%
   inline R code fragments

  |......                                                                |   9%
label: unnamed-chunk-2
  |.......                                                               |  10%
   inline R code fragments

  |........                                                              |  11%
label: unnamed-chunk-3
    Welcome to 'BASiCS'. If you used 'BASiCS' before its release in
    Bioconductor, please visit:
    https://github.com/catavallejos/BASiCS/wiki.
  |........                                                              |  12%
   inline R code fragments

  |.........                                                             |  13%
label: naive-data
trying URL 'https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-4888/E-MTAB-4888.processed.1.zip'
Content type 'application/zip' length 7200152 bytes (6.9 MB)
==================================================
downloaded 6.9 MB

  |..........                                                            |  14%
   inline R code fragments

  |...........                                                           |  15%
label: selecting-serum-cells
trying URL 'https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-4888/E-MTAB-4888.additional.1.zip'
Content type 'application/zip' length 33455 bytes (32 KB)
==================================================
downloaded 32 KB

  |............                                                          |  16%
  ordinary text without R code

  |............                                                          |  18%
label: CD4-SCE-object
  |.............                                                         |  19%
   inline R code fragments

  |..............                                                        |  20%
label: naive-activated-CD4-SCE-object
  |...............                                                       |  21%
   inline R code fragments

  |...............                                                       |  22%
label: obtain-gene-symbols
  |................                                                      |  23%
  ordinary text without R code

  |.................                                                     |  24%
label: unnamed-chunk-4
  |..................                                                    |  25%
   inline R code fragments

  |..................                                                    |  26%
label: unnamed-chunk-5
  |...................                                                   |  27%
  ordinary text without R code

  |....................                                                  |  29%
label: PerCellQC (with options) 
List of 1
 $ fig.cap: chr "Cell-level QC metrics. The total number of endogenous read-counts (excludes non-mapped and intronic reads) is p"| __truncated__

  |.....................                                                 |  30%
  ordinary text without R code

  |......................                                                |  31%
label: experimental-condition-batch (with options) 
List of 1
 $ fig.cap: chr "Cell-level QC metrics according to cell-level metadata. The total number of endogenous reads (excludes non-mapp"| __truncated__

  |......................                                                |  32%
   inline R code fragments

  |.......................                                               |  33%
label: pca-visualisation-stimulus-batch (with options) 
List of 1
 $ fig.cap: chr "First two principal components of log-transformed expression counts after scran normalisation. Colour indicates"| __truncated__

  |........................                                              |  34%
   inline R code fragments

  |.........................                                             |  35%
label: gene-selection (with options) 
List of 1
 $ fig.cap: chr "Average read-count for each gene is plotted against the number of cells in which that gene was detected. Dashed"| __truncated__

  |.........................                                             |  36%
  ordinary text without R code

  |..........................                                            |  37%
label: spike-ins-present
  |...........................                                           |  38%
   inline R code fragments

  |............................                                          |  40%
label: SCE-separation
  |............................                                          |  41%
   inline R code fragments

  |.............................                                         |  42%
label: BatchInfo
  |..............................                                        |  43%
   inline R code fragments

  |...............................                                       |  44%
label: spike-in_download
trying URL 'https://assets.thermofisher.com/TFS-Assets/LSG/manuals/cms_095046.txt'
Content type 'text/plain' length 4086 bytes
==================================================
downloaded 4086 bytes

  |................................                                      |  45%
  ordinary text without R code

  |................................                                      |  46%
label: ercc-mul
  |.................................                                     |  47%
  ordinary text without R code

  |..................................                                    |  48%
label: spike-info
  |...................................                                   |  49%
   inline R code fragments

  |...................................                                   |  51%
label: MCMC-run (with options) 
List of 1
 $ eval: logi FALSE

  |....................................                                  |  52%
  ordinary text without R code

  |.....................................                                 |  53%
label: download-chain-naive
trying URL 'https://zenodo.org/record/5243265/files/chain_naive.Rds'
Content type 'application/octet-stream' length 119109227 bytes (113.6 MB)
==================================================
downloaded 113.6 MB

trying URL 'https://zenodo.org/record/5243265/files/chain_active.Rds'
Content type 'application/octet-stream' length 117080996 bytes (111.7 MB)
==================================================
downloaded 111.7 MB

  |......................................                                |  54%
  ordinary text without R code

  |......................................                                |  55%
label: displayChainBASiCS
  |.......................................                               |  56%
   inline R code fragments

  |........................................                              |  57%
label: convergence-naive (with options) 
List of 1
 $ fig.cap: chr "Trace plot, marginal histogram, and autocorrelation function of the posterior distribution of the mean expressi"| __truncated__

  |.........................................                             |  58%
   inline R code fragments

  |..........................................                            |  59%
label: mcmc-diag (with options) 
List of 1
 $ fig.cap: chr "MCMC diagnostics for gene-specific mean expression parameters; naive CD4+ T cells. A: Geweke Z-score for mean e"| __truncated__

Loading required package: viridisLite
  |..........................................                            |  60%
   inline R code fragments

  |...........................................                           |  62%
label: dispersion-vs-mean (with options) 
List of 3
 $ fig.width : num 6
 $ fig.height: num 6
 $ fig.cap   : chr "Comparison of gene-specific transcriptional variability estimates and mean expression estimates obtained for ea"| __truncated__

  |............................................                          |  63%
   inline R code fragments

  |.............................................                         |  64%
label: plot-vg-naive (with options) 
List of 1
 $ fig.cap: chr "HVG and LVG detection using BASiCS. For each gene, BASiCS posterior estimates (posterior medians) associated to"| __truncated__

For LVG detection task:
the posterior probability threshold chosen via EFDR calibrationis too low. Probability threshold automatically set equal to'ProbThreshold'.
  |.............................................                         |  65%
   inline R code fragments

  |..............................................                        |  66%
label: plot-example-vg-naive (with options) 
List of 1
 $ fig.cap: chr "BASiCS denoised counts for example HVG and LVG with similar overall levels of expression."

  |...............................................                       |  67%
   inline R code fragments

  |................................................                      |  68%
label: mean-expression-testing
-------------------------------------------------------------
Log-fold change thresholds are now set in a log2 scale. 
Original BASiCS release used a natural logarithm scale.
-------------------------------------------------------------
Offset estimate: 0.5987
(ratio Naive vs Active).
To visualise its effect, please use 'PlotOffset = TRUE'.
-------------------------------------------------------------

For Differential mean task:
the posterior probability threshold chosen via EFDR calibrationis too low. Probability threshold automatically set equal to'ProbThresholdM'.
  |................................................                      |  69%
   inline R code fragments

  |.................................................                     |  70%
label: visualise-DE-mean-plot (with options) 
List of 3
 $ fig.height: num 8
 $ fig.width : num 6
 $ fig.cap   : chr "Upper panel presents the MA plot associated to the differential mean expression test between naive and active c"| __truncated__

  |..................................................                    |  71%
  ordinary text without R code

  |...................................................                   |  73%
label: offset-denoised
  |....................................................                  |  74%
   inline R code fragments

  |....................................................                  |  75%
label: heatmap-diffexp (with options) 
List of 3
 $ fig.width : num 7
 $ fig.height: num 6
 $ fig.cap   : chr "Heatmap displays normalised expression values (log10(x + 1) scale) for naive and active cells. Genes are strati"| __truncated__

Loading required package: grid
========================================
ComplexHeatmap version 2.6.2
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference

If you use it in published research, please cite:
Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
  genomic data. Bioinformatics 2016.

This message can be suppressed by:
  suppressPackageStartupMessages(library(ComplexHeatmap))
========================================

========================================
circlize version 0.4.11
CRAN page: https://cran.r-project.org/package=circlize
Github page: https://github.com/jokergoo/circlize
Documentation: https://jokergoo.github.io/circlize_book/book/

If you use it in published research, please cite:
Gu, Z. circlize implements and enhances circular visualization
  in R. Bioinformatics 2014.

This message can be suppressed by:
  suppressPackageStartupMessages(library(circlize))
========================================

  |.....................................................                 |  76%
   inline R code fragments

  |......................................................                |  77%
label: diff-res-plot (with options) 
List of 3
 $ fig.height: num 8
 $ fig.width : num 6
 $ fig.cap   : chr "Upper panel presents the MA plot associated to the differential residual over-dispersion test between naive and"| __truncated__

  |.......................................................               |  78%
  ordinary text without R code

  |.......................................................               |  79%
label: combine-results
Quitting from lines 1496-1505 (Workflow.Rmd) 
Error in ggplot(table_combined_de) : object 'table_combined_de' not found
Calls: <Anonymous> ... withCallingHandlers -> withVisible -> eval -> eval -> ggplot
In addition: Warning messages:
1: Removed 18 rows containing missing values (geom_point). 
2: Removed 127 rows containing missing values (geom_point). 
3: Removed 127 rows containing missing values (geom_point). 

Execution halted
make: *** [Workflow.pdf] Error 1

Here is also my sessionInfo (of course only the systems part applies here):

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       fansi_0.5.0      crayon_1.4.1     utf8_1.2.2       digest_0.6.28    later_1.3.0      R6_2.5.1         lifecycle_1.0.1 
 [9] magrittr_2.0.1   evaluate_0.14    pillar_1.6.3     rlang_0.4.11     fs_1.5.0         promises_1.2.0.1 vctrs_0.3.8      ellipsis_0.3.2  
[17] rmarkdown_2.11   tools_4.1.0      compiler_4.1.0   httpuv_1.6.3     xfun_0.26        fastmap_1.1.0    pkgconfig_2.0.3  htmltools_0.5.2 
[25] knitr_1.36       tibble_3.1.5 

Publish repo

Need to make repo public when submitting, and later archive in Zenodo

edit overview figure

tasks under single group to be

  • Normalization
  • Variance decomposition
  • HVG/LVG detection

Final edits - to-do

  • Run MCMC with longer number of iterations
  • Update workflow to use new chains (update code + text + running times)
  • Update running times with times provided by @alanocallaghan
  • Update documentation for PriorMu/PriorParam
  • Geweke: why are there red lines at +-3? What do they signify? (explain in Fig caption)
  • Once convergence/ESS is checked in longer chains, merge Figs 7 and 8 focusing on naive cells only. Update associated text.
  • Fig 9. Use same x and y-axes in similar panels (?) -> Not critical, likely to leave as is
  • Remove description of PercentileThreshold
  • Fig 10, change NA label to Not HVG/LVG.
  • Edit x- and y-labels in Fig 5 (it's per gene, not cell!)
  • Update lines 823-824 to use chain_naive.Rds and chain_active.Rds once final chains are in gitlab
    - [ ] In QC section, make a comment about the sparsity of the data
  • Edit examples in diff variability section

Changes after initial draft

  • Removed refs to simpleSingleCell as there is a redirection notice to OSCA
  • Further simplified the QC section to avoid duplication with the methods section (e.g. to explain that metrics are used to detect doublets, etc).
  • Gene-level QC suggests that thresholds require detailed exploration of gene-level metrics. However, we don't have any. Should we change this?

Major comments

Code-related

  • Update R version
  • Use droplet data instead, with more heterogeneity?
  • logit transform volcano plot?
  • Demonstrate reasons behind prior choice/prior sensitivity checks (PPC)

Alan: I think prior predictive checks of the regression model don't work because of the inverse gamma on scale, so I'll probably just put some text in about the prior justification and refer to the old papers.
Cata: I agree that PPC are likely to fail. Supp Text 3 of the original BASiCS paper shows some sensitivity wrt to the hyper-parameters used for $\theta$ and $\delta_i$. Of course, those results are rather outdated as the prior for $\delta_i$ now follows the approach in Eling et al.

  • Multiple chains? Rhat diagnostic.

Alan: I am working on this at the moment between a new package and BASiCS itself
BASiCstan is now on Bioc. Will implement single chain Rhat for BASiCS and say maybe we'll do multiple chains in future but not now.
Cata: Thanks for uploading BASiCSstan to BioC!. For this paper, I think we can argue that the sampler is rather time-consuming and therefore it's unlikely that users will routinely run multiple chains. We can also mention that we have generally observed the sampler behaves well in a large number of contexts. However, the ability to run multiple chains will improve once we publish the scalability paper (which we should try to submit soon!)

  • How to manipulate/visualise credible intervals?

Alan: BASiCS_Summary I guess
Cata: Yes, that would give you credible intervals for individual parameters. Should we edit the BASiCS_Test function to also return HPD intervals for LFCs? That should be easy to implement.

  • Simplify and reduce code chunks (particularly plotting); plotting is verbose (less customisation maybe)

Alan: This is tough, might just ditch some figures and make some a bit more "ugly"
eg for the sake of a workflow having ugly axis labels is probably okay? need to check other f1000 papers

Text-related

  • What type of experiments/experimental designs?

BASiCSWorkflow/Workflow.Rmd

Lines 147 to 155 in a40c412

of noise in scRNA-seq datasets [@Vallejos2015; @Vallejos2016; @Eling2017]. The
model was motivated by supervised experimental designs in which experimental
conditions corresponds to groups of cells defined *a priori* (e.g. selected
cell types obtained through FACS sorting). However, the approach can also be
used in cases were the groups of cells of interest are computationally
identified through clustering. In such case, the model implemented in
`r Biocpkg("BASiCS")` does not account for issues associated to post-selection
inference, where the same data is analysed twice: first to perform clustering
and then to compare expression profiles between clusters [@Lahnemann2020].

  • Better model summary (with schematics)

    • Explain joint prior
    • explain variance decomp better
  • How does it deal with biological replicates?

Cata: it does not. That's scive. :-)

  • Computationally challenging - so why not just MLE or MAP?

Cata: We can argue against a simple MLE based on an e.g. NB model because our main focus is on estimates of variability. Inference about means is simpler and, indeed, an MLE approach is likely sufficient (at some point I remember comparing point estimates with DESeq2, etc). This is not the case for over-dispersion parameters. Here is where the use of Bayesian approach becomes more useful. In particular, we have seen that the joint mean/dispersion prior (aka the regression approach) improve inference of transcriptional variability for lowly expressed genes and sparse datasets. In terms of MAP, I think the argument to give is that we want to retain posterior uncertainty to enable differential testing. We can add a couple of sentences about this when describing the method.

  • Volcano plot probs seem to bunch around 0.5, why? Are they calibrated?
  • How does it work with spatial data?

Cata: We have not tested that.

  • Address double dipping

Cata: Whilst we highlight that this is an important issue, we have not addressed it yet.

  • What happens with violation of model assumptions?
  • Dealing with a BASiCS run per population (muscat approach...? not sure what that refers to)
  • Divide and conquer MCMC/other computational approaches for speed - downsampling, filtering

Manuscript-related

  • Better navigation (eg linking to sections more)
  • make available as bioc/github workflow with better navigation
    Response: I would say "We'll do this with a later/final version". Actually think this is the best way to maintain it to ensure it still runs in later versions of Bioc etc...

submission checklist

  • article type: "software tool article"?
  • Publish repo, formerly #38
  • author contributions (entered as below)
    • aoc: Conceptualization, Software, Visualization, Writing – Original Draft Preparation
    • ne: Conceptualization, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing
    • jcm: Project Administration, Supervision, Writing – Review & Editing
    • cav (corresponding): Conceptualization, Project Administration, Software, Supervision, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing
  • enter funding bodies/grant nos
  • payment information (eventually, presumably...)
  • cover letter
  • gateway: bioc
  • final run in Cata's laptop to ensure reproducibility -> Nils to run
  • propose reviewers

Rename .Rds files

use chain_active.Rds and chain_naive.Rds to be compatible with the storage options in BASiCS_MCMC

Pre-submission edits

Need to get the changes we did after submitting the original tex file and apply them to the Rmd.

Changes for current draft (compiled on June 5)

  • The F1000 template does not distinguish between sub-sections (##) and sub-sub-sections (###). As such, I removed the distinction between those.
  • Further simplification of the MCMC diagnostics section.
    • Re-organise text to focus on convergence first, then ESS.
    • Focus on mean parameters only. Diagnostics for delta could be added as supplementary material.
    • Histogram of Z-scores was removed (redundant info wrt to other plots)
    • Plot of Z-scores vs percentage of zeroes was removed to be consistent with ESS plots
    • Removes comment about high absolute Z-score for lowly expressed genes. Trend is not so clear after applying the more stringent gene filter.
    • Same for the connection between ESS and Z-scores.
  • Normalisation section
    • Removes BASiCS_DenoisedRates as it's not used anywhere else (and it's less directly comparable to scran normalisation)
    • Removes intro text that was left-over from old structure
    • Removes BASiCS_DenoisedCounts as it's best to apply the function after offset correction.
    • Adds Pearson's correlation to scatterplots
    • I will temporarily comment out this section, due to the offset effect it's hard to interpret the scale of the scaling factors when comparing to scran. I believe that the explanation will distract from the key messages of the paper. Potentially move to supplementary material.

F1000Research to-do list

  • Add Zenodo link / DOI to .bib file
  • Cite Zenodo code within the article: line 221
  • Edit Operation section to include system requirements and link to Docker. See example here
  • Re-structure Methods to have Implementation and Operation subsections -> Done / @alanocallaghan to review
  • @alanocallaghan to review Software availability section
  • Is the Docker public? I am also asked to login in order to access it
  • Provide payment information -> only @alanocallaghan can do this

Remove ess histograms?

I've used these mostly with smaller data (~5000 genes) where they're informative. With 9k genes it seems you need to log the y axis for them to be informative. Since we plot ESS in different forms already they're somewhat redundant anyway

To-do

Draft rebuttal letter available here

Prioritised TO-DO list

  • Finish edits in data prep script - @catavallejos
  • Update to R 4.3/Bioc release @alanocallaghan
  • Re-run chains - @alanocallaghan -> all minus 1!
  • Make sure that the Overview.svg fig works - @alanocallaghan
  • Edit $\theta$ panel in schematic BASiCS fig - @alanocallaghan just details to be fixed
  • Create related paragraph in Methods (see comment in lines 309-312) and edit the remaining of the BASiCS model description in that Section. Add a table highlighting differences across diff variations - @catavallejos WIP, but needs to be shortned
  • Clarify priors, independent/joint + EB - @catavallejos WIP
    - [ ] Edit model specification in supplementary section - @catavallejos supps are not allowed!!
  • Run MCMC chains with/without EB - @alanocallaghan

Reviewer 1

  • Schematic figure (see discussion in #45)
  • Create related paragraph in Methods (see comment in lines 309-312) and edit the remaining of the BASiCS model description in that Section. Perhaps add a table summarising the different model variations?. Here is a WIP draft - @catavallejos
Paper Normalisation (cell-specific parameters)
Vallejos et al (2015) $s_j$: sequencing depth and other technical factors
$\phi_j$: mRNA content
  • Check enough comments are provided throughout the text
  • See what can we do about ggplot2
    * catavallejos/BASiCS#270
  • Explain EB and other elements of the prior. Would it be better to have a supp section with more details?

Cata is working on this

  • Create supp section for visualising HPDs etc (mention in text)
  • Check volcano plots concentration around 0.5, calibration?
  • Add @MARX2021] and @Aijo2019] to bibtex
  • check all libraries are loaded at the start
  • Finish incomplete sentence in rebuttal about PCA axis
  • Check PR about spike-in calculation: catavallejos/BASiCS#270
  • Check if R 4.2 is still the obvious choice. Likely outdated by now!

Reviewer 2

  • #50
  • See what can we do about ggplot2
  • Better navigation?
  • GH/Bioc workflow

Alan: I would say "We'll do this with a later/final version". Actually think this is the best way to maintain it to ensure it still runs in later versions of Bioc etc...

  • Draft response about muscat question.
  • Check Fig 15 and interpretation

Reviewer 3

  • Draft response for 5 (already addressed in text)
  • Address 8 (tempted not to run additional analysis here)

Others

  • Make sure that running times correspond to the new data
  • Fix data download links
  • Fix references to Supplementary Material. Hyperlink? Marked with an "ADD-REF" tag in text.

Minor comments from all reviews

These are mostly text edits that should be quick and easy to address.

  • In the chunk starting ## Moles per microliter, I had to edit the first line of the code chunk to get it to work. Perhaps it has been malformatted?

  • \log is a symbol and should always be lower-case (note: I'm not making all the "log2" in plot labels subscripted)

  • Figure 12 there is a typo in the legend "\log_{1}0 -> \log_{10}".

  • In the abstract, you write "strong technical noise". I don’t quite understand what "strong" is adding here.

  • In the first code chunks, in `website <- " https …” I think there is an initial space which stops this from running correctly.

  • Remember to include spaces after commas e.g. [a, ] rather than [a,], it is much clearer to read.

  • I think it would be better to load all the packages at the beginning, I had to restart my session several times to install the packages that I realized I needed x% of the way through the workflow.

  • newlines in website strings resulting in mal-formed URL.

  • ^ rendered in some odd font that wouldn't execute when I copied a chunk into Rstudio.

  • concentration.in..Mix.1.attomoles.ul was missing an .

  • It seemed that when I ran the BASiCS_DetectHVG function as specified, I got a couple of warning messages: "The posterior probability threshold chosen via EFDR calibration is too low. Probability threshold automatically set equal to 'ProbThresholdM'."
    Response: This is a message, not a warning. Could add the EFDR/EFNR plot to the manuscript and explain that it's fine.

  • Figure 15: Caption says "log2 change in expression against against log mean expression for genes with higher residual over-dispersion in naive (A) cells and active (D) cells", but seems to refer to panels A and C.

    • More broadly, it's not really clear what sort of conclusions we are supposed to draw from this sort of figure.
  • Introduction, 2nd paragraph: “Moreover, these variability estimates can also be inflated by the technical noise that is typically observed in scRNA-seq data”
    Comment: Please illustrate with examples of technical noise as presented above for extrinsic and intrinsic noise, e.g. RNA differential degradation, ligation bias, etc.

  • Introduction, 3rd paragraph: “However, despite the benefits associated to the use of spike-ins and UMIs, these are not available for all scRNA-seq protocols.”
    Comment: Please provide a reason - something like “due to the very nature of the assay, which isolates library prep from external spike-ins and uses UMIs to map single-cell libraries…” or that spike-ins added to the library after pooling limit the full advantage of using them across single-cell populations.

Cata: I think our response needs to differentiate between spike-ins and UMIs. We could write something along the lines of what Haque et al (2017) has: In general, spike-in RNAs are not compatible with droplet-based approaches, whereas UMIs are typically used in protocols where only the 3’-ends of transcripts are sequenced, such as CEL-seq2, Drop-seq and MARS-seq [10, 45, 60]. Note, however, that technology is evolving (see this paper). Perhaps we can change the sentence to say "... these are not routinely available ...". I haven't been able to find a reference yet, but I believe that spike-ins in 10X could be expensive, as you would technically need to add spike-ins to all droplets, even though some of cells won't capture a cell. For now, I added a note in the Rmd file: "[EXPAND to provide reasons why]"

  • Methods, 2nd paragraph: “Mean parameters μi quantify the overall expression for each gene i across the cell population under study.”
    Comment: Please define cell population as sample (all cells within a scRNA-seq assay) or post-processing UMAP, t-SNE, or similar clustering grouping. I understood the latter, but readers should be certain, and we need to understand which step these parameters refer to.

BASiCSWorkflow/Workflow.Rmd

Lines 281 to 287 in 23f82e8

across the cell population under study. This could correspond to a group that
is set *a priori* by the experimental design (e.g. naive or stimulated
CD4^+^ T cells in [@Martinez-jimenez2017]) or to groups of cells that were
computationally identified through clustering (note that `r Biocpkg("BASiCS")`
does not account for issues associated to post-selection inference
[@Lahnemann2020] in this case, but this is an inherent limitation of most
differential expression tools).

  • After Figure 5: “These thresholds can vary across datasets and should be informed by gene-specific QC metrics such as those shown in Figure 5 as well as prior knowledge about the cell types and conditions being studied, where available”.
    Comment: I understand this is not the intent of this study, but if possible include some rationale behind thresholds for QCing gene exclusion based on types and conditions. Maybe bringing the choice for the example explored here.

Cata: I think the statement we had about prior knowledge is a bit vague. I was also looking at OSCA and it seems that this step has been removed from their examples (the ones I saw were doing feature selection by identifying HVGs directly). Maybe we need to rephrase a bit?

  • After spike-in calculation step: “To update the sce_naive and sce_active objects, the user must create a data.frame whose first column contains the spike-in labels (e.g. ERCC-00130) and whose second column contains the number of molecules calculated above. We add this as row metadata for altExp (sce_naive) and altExp (sce_active).”
    Comment: By ‘update’, do the authors mean ‘update after normalisation with spike-in’? If so, please specify.

To add this information to the existing `sce_naive` and `sce_active` objects,

  • MCMC diagnostics: I would kindly ask if they can add the reason why this is needed: “to ensure that comparisons of gene expression are not random” or something in that direction. Readers will appreciate it.

BASiCSWorkflow/Workflow.Rmd

Lines 498 to 500 in 8dec9e9

the MCMC reached its stationary distribution. Lack of convergence can lead to
spurious results that do not accurately reflect the underlying biology of the
population of cells understudy.

Summarising main changes / discussion points

Creating this issue to summarise the main changes I implemented in the workflow and other discussion points:

  • Introduction, Sources of variability in scRNA-seq data, Methods and Reproducibility sections already edited and ready for review.

  • Currently format removes the GO subsection from Methods. Instead, my proposal is to focus instead on the main libraries (i.e. SingleCellExperiment, scater, scran and BASiCS).

  • Proposed changes for CD4T analysis:
    * Use a more stringent filter to only include those genes detected in at least 5 cells (current filter seems to be too liberal). This will require us to update the chains in gitlab (I am re-running them), but will keep the structure/results largely unaffected.
    * In the QC/exploratory section, use the scran deconvolution size factors rather than the spike-in ones. This doesn't alter the results, but is more consistent with the normalised data that will be obtained from BASiCS (which uses two sets of scaling factors).
    * Potentially do cell QC before gene filtering. Not very critical, but just to be more consistent with OSCA structure (QC before feature selection) and the scater vignette.

Grant information

Please state who funded the work discussed in this article, whether it is your
employer, a grant funder etc. Please do not list funding that you have that is
not relevant to this specific piece of research. For each funder, please state
the funder’s name, the grant number where applicable, and the individual to
whom the grant was assigned. If your work was not funded by any grants,
please include the line: 'The author(s) declared that no grants were involved
in supporting this work.'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.