Full pdf does not compile

Maybe is due to the docker configuration? Runs well for the chunks that have cache available but gets stuck after some point. For now, I added eval = FALSE in order to generate a full pdf.

Move some diagnostics to supplementary?

Seems like having 32 trace plots in the main paper is excessive. Could be better to put a handful in with a recommendation to plot a few, and put more in the supplementary

proc.time() in MCMC-run chunk?

Output is not used in the workflow. Remove?

Variability section needs to be linked to others

It is introduced but doesn't link well with following section

Final (?) to-do list

Adding this here as not everything has been documented in individual issues.

Note: 🔴 marks things that require changes in the BASiCS library.

Docker warning

docker run -p 8787:8787 -v $(pwd):/home/rstudio/mycode -e PASSWORD=bioc alanocallaghan/bocker
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
[s6-init] making user provided files available at /var/run/s6/etc...if: fatal: child crashed with signal 11
exited 0.

Build issue on Mac

Hi @alanocallaghan

thanks so much for setting this up - really great stuff!

I get the following issue when running it on my mac using sudo make. I also get the same issue when running it in Rstudio within the container.

nils@DQBM-NBM-NIEL BASiCSWorkflow % make     
docker run -v /Users/nils/Github/BASiCSWorkflow:/home/rstudio/mycode \
		-w /home/rstudio/mycode \
		alanocallaghan/bocker:0.1.0 \
		/bin/bash \
		-c 'Rscript -e "rmarkdown::render(\"Workflow.Rmd\")"'
Unable to find image 'alanocallaghan/bocker:0.1.0' locally
0.1.0: Pulling from alanocallaghan/bocker
Digest: sha256:b99e0218ff2f52d01182f6dd1e51aa841cc3fe550e7b18850b7f626d5e545e06
Status: Downloaded newer image for alanocallaghan/bocker:0.1.0
Bioconductor version 3.12 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.12' is out-of-date; the current release version '3.13'
  is available with R version '4.1'; see https://bioconductor.org/install


processing file: Workflow.Rmd
  |.                                                                     |   1%
  ordinary text without R code

  |..                                                                    |   2%
label: setup_knitr (with options) 
List of 2
 $ include: logi FALSE
 $ cache  : logi FALSE

  |..                                                                    |   3%
   inline R code fragments

  |...                                                                   |   4%
label: overview (with options) 
List of 4
 $ out.width : symbol out_width
 $ out.height: symbol out_height
 $ fig.cap   : chr "Graphical overview for the scRNA-seq analysis workflow described in this manuscript. Starting from a matrix of "| __truncated__
 $ echo      : logi FALSE

  |....                                                                  |   5%
  ordinary text without R code

  |.....                                                                 |   7%
label: unnamed-chunk-1
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: 'MatrixGenerics'

The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars

Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following object is masked from 'package:base':

    expand.grid

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: 'Biobase'

The following object is masked from 'package:MatrixGenerics':

    rowMedians

The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians

  |.....                                                                 |   8%
   inline R code fragments

  |......                                                                |   9%
label: unnamed-chunk-2
  |.......                                                               |  10%
   inline R code fragments

  |........                                                              |  11%
label: unnamed-chunk-3
    Welcome to 'BASiCS'. If you used 'BASiCS' before its release in
    Bioconductor, please visit:
    https://github.com/catavallejos/BASiCS/wiki.
  |........                                                              |  12%
   inline R code fragments

  |.........                                                             |  13%
label: naive-data
trying URL 'https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-4888/E-MTAB-4888.processed.1.zip'
Content type 'application/zip' length 7200152 bytes (6.9 MB)
==================================================
downloaded 6.9 MB

  |..........                                                            |  14%
   inline R code fragments

  |...........                                                           |  15%
label: selecting-serum-cells
trying URL 'https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-4888/E-MTAB-4888.additional.1.zip'
Content type 'application/zip' length 33455 bytes (32 KB)
==================================================
downloaded 32 KB

  |............                                                          |  16%
  ordinary text without R code

  |............                                                          |  18%
label: CD4-SCE-object
  |.............                                                         |  19%
   inline R code fragments

  |..............                                                        |  20%
label: naive-activated-CD4-SCE-object
  |...............                                                       |  21%
   inline R code fragments

  |...............                                                       |  22%
label: obtain-gene-symbols
  |................                                                      |  23%
  ordinary text without R code

  |.................                                                     |  24%
label: unnamed-chunk-4
  |..................                                                    |  25%
   inline R code fragments

  |..................                                                    |  26%
label: unnamed-chunk-5
  |...................                                                   |  27%
  ordinary text without R code

  |....................                                                  |  29%
label: PerCellQC (with options) 
List of 1
 $ fig.cap: chr "Cell-level QC metrics. The total number of endogenous read-counts (excludes non-mapped and intronic reads) is p"| __truncated__

  |.....................                                                 |  30%
  ordinary text without R code

  |......................                                                |  31%
label: experimental-condition-batch (with options) 
List of 1
 $ fig.cap: chr "Cell-level QC metrics according to cell-level metadata. The total number of endogenous reads (excludes non-mapp"| __truncated__

  |......................                                                |  32%
   inline R code fragments

  |.......................                                               |  33%
label: pca-visualisation-stimulus-batch (with options) 
List of 1
 $ fig.cap: chr "First two principal components of log-transformed expression counts after scran normalisation. Colour indicates"| __truncated__

  |........................                                              |  34%
   inline R code fragments

  |.........................                                             |  35%
label: gene-selection (with options) 
List of 1
 $ fig.cap: chr "Average read-count for each gene is plotted against the number of cells in which that gene was detected. Dashed"| __truncated__

  |.........................                                             |  36%
  ordinary text without R code

  |..........................                                            |  37%
label: spike-ins-present
  |...........................                                           |  38%
   inline R code fragments

  |............................                                          |  40%
label: SCE-separation
  |............................                                          |  41%
   inline R code fragments

  |.............................                                         |  42%
label: BatchInfo
  |..............................                                        |  43%
   inline R code fragments

  |...............................                                       |  44%
label: spike-in_download
trying URL 'https://assets.thermofisher.com/TFS-Assets/LSG/manuals/cms_095046.txt'
Content type 'text/plain' length 4086 bytes
==================================================
downloaded 4086 bytes

  |................................                                      |  45%
  ordinary text without R code

  |................................                                      |  46%
label: ercc-mul
  |.................................                                     |  47%
  ordinary text without R code

  |..................................                                    |  48%
label: spike-info
  |...................................                                   |  49%
   inline R code fragments

  |...................................                                   |  51%
label: MCMC-run (with options) 
List of 1
 $ eval: logi FALSE

  |....................................                                  |  52%
  ordinary text without R code

  |.....................................                                 |  53%
label: download-chain-naive
trying URL 'https://zenodo.org/record/5243265/files/chain_naive.Rds'
Content type 'application/octet-stream' length 119109227 bytes (113.6 MB)
==================================================
downloaded 113.6 MB

trying URL 'https://zenodo.org/record/5243265/files/chain_active.Rds'
Content type 'application/octet-stream' length 117080996 bytes (111.7 MB)
==================================================
downloaded 111.7 MB

  |......................................                                |  54%
  ordinary text without R code

  |......................................                                |  55%
label: displayChainBASiCS
  |.......................................                               |  56%
   inline R code fragments

  |........................................                              |  57%
label: convergence-naive (with options) 
List of 1
 $ fig.cap: chr "Trace plot, marginal histogram, and autocorrelation function of the posterior distribution of the mean expressi"| __truncated__

  |.........................................                             |  58%
   inline R code fragments

  |..........................................                            |  59%
label: mcmc-diag (with options) 
List of 1
 $ fig.cap: chr "MCMC diagnostics for gene-specific mean expression parameters; naive CD4+ T cells. A: Geweke Z-score for mean e"| __truncated__

Loading required package: viridisLite
  |..........................................                            |  60%
   inline R code fragments

  |...........................................                           |  62%
label: dispersion-vs-mean (with options) 
List of 3
 $ fig.width : num 6
 $ fig.height: num 6
 $ fig.cap   : chr "Comparison of gene-specific transcriptional variability estimates and mean expression estimates obtained for ea"| __truncated__

  |............................................                          |  63%
   inline R code fragments

  |.............................................                         |  64%
label: plot-vg-naive (with options) 
List of 1
 $ fig.cap: chr "HVG and LVG detection using BASiCS. For each gene, BASiCS posterior estimates (posterior medians) associated to"| __truncated__

For LVG detection task:
the posterior probability threshold chosen via EFDR calibrationis too low. Probability threshold automatically set equal to'ProbThreshold'.
  |.............................................                         |  65%
   inline R code fragments

  |..............................................                        |  66%
label: plot-example-vg-naive (with options) 
List of 1
 $ fig.cap: chr "BASiCS denoised counts for example HVG and LVG with similar overall levels of expression."

  |...............................................                       |  67%
   inline R code fragments

  |................................................                      |  68%
label: mean-expression-testing
-------------------------------------------------------------
Log-fold change thresholds are now set in a log2 scale. 
Original BASiCS release used a natural logarithm scale.
-------------------------------------------------------------
Offset estimate: 0.5987
(ratio Naive vs Active).
To visualise its effect, please use 'PlotOffset = TRUE'.
-------------------------------------------------------------

For Differential mean task:
the posterior probability threshold chosen via EFDR calibrationis too low. Probability threshold automatically set equal to'ProbThresholdM'.
  |................................................                      |  69%
   inline R code fragments

  |.................................................                     |  70%
label: visualise-DE-mean-plot (with options) 
List of 3
 $ fig.height: num 8
 $ fig.width : num 6
 $ fig.cap   : chr "Upper panel presents the MA plot associated to the differential mean expression test between naive and active c"| __truncated__

  |..................................................                    |  71%
  ordinary text without R code

  |...................................................                   |  73%
label: offset-denoised
  |....................................................                  |  74%
   inline R code fragments

  |....................................................                  |  75%
label: heatmap-diffexp (with options) 
List of 3
 $ fig.width : num 7
 $ fig.height: num 6
 $ fig.cap   : chr "Heatmap displays normalised expression values (log10(x + 1) scale) for naive and active cells. Genes are strati"| __truncated__

Loading required package: grid
========================================
ComplexHeatmap version 2.6.2
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference

If you use it in published research, please cite:
Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
  genomic data. Bioinformatics 2016.

This message can be suppressed by:
  suppressPackageStartupMessages(library(ComplexHeatmap))
========================================

========================================
circlize version 0.4.11
CRAN page: https://cran.r-project.org/package=circlize
Github page: https://github.com/jokergoo/circlize
Documentation: https://jokergoo.github.io/circlize_book/book/

If you use it in published research, please cite:
Gu, Z. circlize implements and enhances circular visualization
  in R. Bioinformatics 2014.

This message can be suppressed by:
  suppressPackageStartupMessages(library(circlize))
========================================

  |.....................................................                 |  76%
   inline R code fragments

  |......................................................                |  77%
label: diff-res-plot (with options) 
List of 3
 $ fig.height: num 8
 $ fig.width : num 6
 $ fig.cap   : chr "Upper panel presents the MA plot associated to the differential residual over-dispersion test between naive and"| __truncated__

  |.......................................................               |  78%
  ordinary text without R code

  |.......................................................               |  79%
label: combine-results
Quitting from lines 1496-1505 (Workflow.Rmd) 
Error in ggplot(table_combined_de) : object 'table_combined_de' not found
Calls: <Anonymous> ... withCallingHandlers -> withVisible -> eval -> eval -> ggplot
In addition: Warning messages:
1: Removed 18 rows containing missing values (geom_point). 
2: Removed 127 rows containing missing values (geom_point). 
3: Removed 127 rows containing missing values (geom_point). 

Execution halted
make: *** [Workflow.pdf] Error 1

Here is also my sessionInfo (of course only the systems part applies here):

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       fansi_0.5.0      crayon_1.4.1     utf8_1.2.2       digest_0.6.28    later_1.3.0      R6_2.5.1         lifecycle_1.0.1 
 [9] magrittr_2.0.1   evaluate_0.14    pillar_1.6.3     rlang_0.4.11     fs_1.5.0         promises_1.2.0.1 vctrs_0.3.8      ellipsis_0.3.2  
[17] rmarkdown_2.11   tools_4.1.0      compiler_4.1.0   httpuv_1.6.3     xfun_0.26        fastmap_1.1.0    pkgconfig_2.0.3  htmltools_0.5.2 
[25] knitr_1.36       tibble_3.1.5

Publish repo

Need to make repo public when submitting, and later archive in Zenodo

edit overview figure

tasks under single group to be

Normalization
Variance decomposition
HVG/LVG detection

Changes after initial draft

Removed refs to simpleSingleCell as there is a redirection notice to OSCA
Further simplified the QC section to avoid duplication with the methods section (e.g. to explain that metrics are used to detect doublets, etc).
Gene-level QC suggests that thresholds require detailed exploration of gene-level metrics. However, we don't have any. Should we change this?

Code-related

Update R version
Use droplet data instead, with more heterogeneity?
logit transform volcano plot?
Demonstrate reasons behind prior choice/prior sensitivity checks (PPC)

Alan: I think prior predictive checks of the regression model don't work because of the inverse gamma on scale, so I'll probably just put some text in about the prior justification and refer to the old papers.
Cata: I agree that PPC are likely to fail. Supp Text 3 of the original BASiCS paper shows some sensitivity wrt to the hyper-parameters used for $\theta$ and $\delta_i$. Of course, those results are rather outdated as the prior for $\delta_i$ now follows the approach in Eling et al.

Multiple chains? Rhat diagnostic.

Alan: I am working on this at the moment between a new package and BASiCS itself
BASiCstan is now on Bioc. Will implement single chain Rhat for BASiCS and say maybe we'll do multiple chains in future but not now.
Cata: Thanks for uploading BASiCSstan to BioC!. For this paper, I think we can argue that the sampler is rather time-consuming and therefore it's unlikely that users will routinely run multiple chains. We can also mention that we have generally observed the sampler behaves well in a large number of contexts. However, the ability to run multiple chains will improve once we publish the scalability paper (which we should try to submit soon!)

How to manipulate/visualise credible intervals?

Alan: BASiCS_Summary I guess
Cata: Yes, that would give you credible intervals for individual parameters. Should we edit the BASiCS_Test function to also return HPD intervals for LFCs? That should be easy to implement.

Simplify and reduce code chunks (particularly plotting); plotting is verbose (less customisation maybe)

Alan: This is tough, might just ditch some figures and make some a bit more "ugly"
eg for the sake of a workflow having ugly axis labels is probably okay? need to check other f1000 papers

Text-related

What type of experiments/experimental designs?

BASiCSWorkflow/Workflow.Rmd

Lines 147 to 155 in a40c412

 of noise in scRNA-seq datasets [@Vallejos2015; @Vallejos2016; @Eling2017]. The 

 model was motivated by supervised experimental designs in which experimental 

 conditions corresponds to groups of cells defined *a priori* (e.g. selected 

 cell types obtained through FACS sorting). However, the approach can also be 

 used in cases were the groups of cells of interest are computationally 

 identified through clustering. In such case, the model implemented in 

 `r Biocpkg("BASiCS")` does not account for issues associated to post-selection 

 inference, where the same data is analysed twice: first to perform clustering 

 and then to compare expression profiles between clusters [@Lahnemann2020].

Better model summary (with schematics)
- Explain joint prior
- explain variance decomp better
How does it deal with biological replicates?

Cata: it does not. That's scive. :-)

Computationally challenging - so why not just MLE or MAP?

Cata: We can argue against a simple MLE based on an e.g. NB model because our main focus is on estimates of variability. Inference about means is simpler and, indeed, an MLE approach is likely sufficient (at some point I remember comparing point estimates with DESeq2, etc). This is not the case for over-dispersion parameters. Here is where the use of Bayesian approach becomes more useful. In particular, we have seen that the joint mean/dispersion prior (aka the regression approach) improve inference of transcriptional variability for lowly expressed genes and sparse datasets. In terms of MAP, I think the argument to give is that we want to retain posterior uncertainty to enable differential testing. We can add a couple of sentences about this when describing the method.

Volcano plot probs seem to bunch around 0.5, why? Are they calibrated?
How does it work with spatial data?

Cata: We have not tested that.

Address double dipping

Cata: Whilst we highlight that this is an important issue, we have not addressed it yet.

What happens with violation of model assumptions?
Dealing with a BASiCS run per population (muscat approach...? not sure what that refers to)
Divide and conquer MCMC/other computational approaches for speed - downsampling, filtering

Manuscript-related

Better navigation (eg linking to sections more)
make available as bioc/github workflow with better navigation
Response: I would say "We'll do this with a later/final version". Actually think this is the best way to maintain it to ensure it still runs in later versions of Bioc etc...

Caption and size figures consistently

Group MCMC and convergence for naive and active cells

Re-run chains using current filter

Document does not compile because of incompatible dimensions between data and chains

Gene filtering needs to be applied to spike-ins

This may require running the chains again 😢

Final (minor) changes to Figure 1

submission checklist

article type: "software tool article"?
Publish repo, formerly #38
author contributions (entered as below)
- aoc: Conceptualization, Software, Visualization, Writing – Original Draft Preparation
- ne: Conceptualization, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing
- jcm: Project Administration, Supervision, Writing – Review & Editing
- cav (corresponding): Conceptualization, Project Administration, Software, Supervision, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing
enter funding bodies/grant nos
payment information (eventually, presumably...)
cover letter
gateway: bioc
final run in Cata's laptop to ensure reproducibility -> Nils to run
propose reviewers

Code for plots on same page as plot

Rename .Rds files

use chain_active.Rds and chain_naive.Rds to be compatible with the storage options in BASiCS_MCMC

Pre-submission edits

Need to get the changes we did after submitting the original tex file and apply them to the Rmd.

Fix caption and size for Fig 7.

Changes for current draft (compiled on June 5)

The F1000 template does not distinguish between sub-sections (##) and sub-sub-sections (###). As such, I removed the distinction between those.
Further simplification of the MCMC diagnostics section.
- Re-organise text to focus on convergence first, then ESS.
- Focus on mean parameters only. Diagnostics for delta could be added as supplementary material.
- Histogram of Z-scores was removed (redundant info wrt to other plots)
- Plot of Z-scores vs percentage of zeroes was removed to be consistent with ESS plots
- Removes comment about high absolute Z-score for lowly expressed genes. Trend is not so clear after applying the more stringent gene filter.
- Same for the connection between ESS and Z-scores.
Normalisation section
- Removes BASiCS_DenoisedRates as it's not used anywhere else (and it's less directly comparable to scran normalisation)
- Removes intro text that was left-over from old structure
- Removes BASiCS_DenoisedCounts as it's best to apply the function after offset correction.
- Adds Pearson's correlation to scatterplots
- I will temporarily comment out this section, due to the offset effect it's hard to interpret the scale of the scaling factors when comparing to scran. I believe that the explanation will distract from the key messages of the paper. Potentially move to supplementary material.

Create a cartoon/diagram to illustrate the BASiCS model(s)

edit the model description accordingly when ready
highlight the importance of our prior specification (Eling et al results)

F1000Research to-do list

Add Zenodo link / DOI to .bib file
Cite Zenodo code within the article: line 221
Edit Operation section to include system requirements and link to Docker. See example here
Re-structure Methods to have Implementation and Operation subsections -> Done / @alanocallaghan to review
@alanocallaghan to review Software availability section
Is the Docker public? I am also asked to login in order to access it
Provide payment information -> only @alanocallaghan can do this

I've used these mostly with smaller data (~5000 genes) where they're informative. With 9k genes it seems you need to log the y axis for them to be informative. Since we plot ESS in different forms already they're somewhat redundant anyway

To-do

Draft rebuttal letter available here

Prioritised TO-DO list

Finish edits in data prep script - @catavallejos
Update to R 4.3/Bioc release @alanocallaghan
Re-run chains - @alanocallaghan -> all minus 1!
Make sure that the Overview.svg fig works - @alanocallaghan
Edit $\theta$ panel in schematic BASiCS fig - @alanocallaghan just details to be fixed
Create related paragraph in Methods (see comment in lines 309-312) and edit the remaining of the BASiCS model description in that Section. Add a table highlighting differences across diff variations - @catavallejos WIP, but needs to be shortned
Clarify priors, independent/joint + EB - @catavallejos WIP
~~- [ ] Edit model specification in supplementary section - @catavallejos~~ supps are not allowed!!
Run MCMC chains with/without EB - @alanocallaghan

Reviewer 1

Schematic figure (see discussion in #45)
Create related paragraph in Methods (see comment in lines 309-312) and edit the remaining of the BASiCS model description in that Section. Perhaps add a table summarising the different model variations?. Here is a WIP draft - @catavallejos

Paper	Normalisation (cell-specific parameters)
Vallejos et al (2015)	$s_j$: sequencing depth and other technical factors $\phi_j$: mRNA content

Check enough comments are provided throughout the text
See what can we do about ggplot2
* catavallejos/BASiCS#270
Explain EB and other elements of the prior. Would it be better to have a supp section with more details?

Cata is working on this

Create supp section for visualising HPDs etc (mention in text)
Check volcano plots concentration around 0.5, calibration?
Add @MARX2021] and @Aijo2019] to bibtex
check all libraries are loaded at the start
Finish incomplete sentence in rebuttal about PCA axis
Check PR about spike-in calculation: catavallejos/BASiCS#270
Check if R 4.2 is still the obvious choice. Likely outdated by now!

Reviewer 2

#50
See what can we do about ggplot2
Better navigation?
GH/Bioc workflow

Alan: I would say "We'll do this with a later/final version". Actually think this is the best way to maintain it to ensure it still runs in later versions of Bioc etc...

Draft response about muscat question.
Check Fig 15 and interpretation

Reviewer 3

Draft response for 5 (already addressed in text)
Address 8 (tempted not to run additional analysis here)

Others

Make sure that running times correspond to the new data
Fix data download links
Fix references to Supplementary Material. Hyperlink? Marked with an "ADD-REF" tag in text.

FInal reproducibility steps

MCMC chains on Zenodo
Link Zenodo not gitlab in text
GPL2 in this repo? See https://github.com/VallejosGroup/BASiCSWorkflow/blob/master/LICENSE

I think that's it, actually!

Minor comments from all reviews

These are mostly text edits that should be quick and easy to address.

Cata: I think our response needs to differentiate between spike-ins and UMIs. We could write something along the lines of what Haque et al (2017) has: In general, spike-in RNAs are not compatible with droplet-based approaches, whereas UMIs are typically used in protocols where only the 3’-ends of transcripts are sequenced, such as CEL-seq2, Drop-seq and MARS-seq [10, 45, 60]. Note, however, that technology is evolving (see this paper). Perhaps we can change the sentence to say "... these are not routinely available ...". I haven't been able to find a reference yet, but I believe that spike-ins in 10X could be expensive, as you would technically need to add spike-ins to all droplets, even though some of cells won't capture a cell. For now, I added a note in the Rmd file: "[EXPAND to provide reasons why]"

Methods, 2nd paragraph: “Mean parameters μi quantify the overall expression for each gene i across the cell population under study.”
Comment: Please define cell population as sample (all cells within a scRNA-seq assay) or post-processing UMAP, t-SNE, or similar clustering grouping. I understood the latter, but readers should be certain, and we need to understand which step these parameters refer to.

BASiCSWorkflow/Workflow.Rmd

Lines 281 to 287 in 23f82e8

 across the cell population under study. This could correspond to a group that 

 is set *a priori* by the experimental design (e.g. naive or stimulated 

 CD4^+^ T cells in [@Martinez-jimenez2017]) or to groups of cells that were 

 computationally identified through clustering (note that `r Biocpkg("BASiCS")` 

 does not account for issues associated to post-selection inference 

 [@Lahnemann2020] in this case, but this is an inherent limitation of most 

 differential expression tools).

After Figure 5: “These thresholds can vary across datasets and should be informed by gene-specific QC metrics such as those shown in Figure 5 as well as prior knowledge about the cell types and conditions being studied, where available”.
Comment: I understand this is not the intent of this study, but if possible include some rationale behind thresholds for QCing gene exclusion based on types and conditions. Maybe bringing the choice for the example explored here.

Cata: I think the statement we had about prior knowledge is a bit vague. I was also looking at OSCA and it seems that this step has been removed from their examples (the ones I saw were doing feature selection by identifying HVGs directly). Maybe we need to rephrase a bit?

After spike-in calculation step: “To update the sce_naive and sce_active objects, the user must create a data.frame whose first column contains the spike-in labels (e.g. ERCC-00130) and whose second column contains the number of molecules calculated above. We add this as row metadata for altExp (sce_naive) and altExp (sce_active).”
Comment: By ‘update’, do the authors mean ‘update after normalisation with spike-in’? If so, please specify.

BASiCSWorkflow/Supplements.Rmd

Line 534 in 64f1ce8

To add this information to the existing `sce_naive` and `sce_active` objects,

MCMC diagnostics: I would kindly ask if they can add the reason why this is needed: “to ensure that comparisons of gene expression are not random” or something in that direction. Readers will appreciate it.

BASiCSWorkflow/Workflow.Rmd

Lines 498 to 500 in 8dec9e9

 the MCMC reached its stationary distribution. Lack of convergence can lead to 

 spurious results that do not accurately reflect the underlying biology of the 

 population of cells understudy.

Missing references

@Saelens2019

Revise code and add further comments throughout

Use readRDS or BASiCS_LoadChain?

BASiCS_LoadChain was useful when we changed the format of the BASiCS_MCMC object, not sure it's required anymore

Fix gene-level QC

Review intro+methods

Summarising main changes / discussion points

Creating this issue to summarise the main changes I implemented in the workflow and other discussion points:

Introduction, Sources of variability in scRNA-seq data, Methods and Reproducibility sections already edited and ready for review.
Currently format removes the GO subsection from Methods. Instead, my proposal is to focus instead on the main libraries (i.e. SingleCellExperiment, scater, scran and BASiCS).
Proposed changes for CD4T analysis:
* Use a more stringent filter to only include those genes detected in at least 5 cells (current filter seems to be too liberal). This will require us to update the chains in gitlab (I am re-running them), but will keep the structure/results largely unaffected.
* In the QC/exploratory section, use the scran deconvolution size factors rather than the spike-in ones. This doesn't alter the results, but is more consistent with the normalised data that will be obtained from BASiCS (which uses two sets of scaling factors).
* Potentially do cell QC before gene filtering. Not very critical, but just to be more consistent with OSCA structure (QC before feature selection) and the scater vignette.

comment about contamination in the text? (could be when interpreting HVGs or diff var)

Add EB prior to main description of BASiCS usage

PSM/SM GO (variability)

Grant information

Please state who funded the work discussed in this article, whether it is your
employer, a grant funder etc. Please do not list funding that you have that is
not relevant to this specific piece of research. For each funder, please state
the funder’s name, the grant number where applicable, and the individual to
whom the grant was assigned. If your work was not funded by any grants,
please include the line: 'The author(s) declared that no grants were involved
in supporting this work.'

	of noise in scRNA-seq datasets [@Vallejos2015; @Vallejos2016; @Eling2017]. The
	model was motivated by supervised experimental designs in which experimental
	conditions corresponds to groups of cells defined a priori (e.g. selected
	cell types obtained through FACS sorting). However, the approach can also be
	used in cases were the groups of cells of interest are computationally
	identified through clustering. In such case, the model implemented in
	`r Biocpkg("BASiCS")` does not account for issues associated to post-selection
	inference, where the same data is analysed twice: first to perform clustering
	and then to compare expression profiles between clusters [@Lahnemann2020].

	across the cell population under study. This could correspond to a group that
	is set a priori by the experimental design (e.g. naive or stimulated
	CD4^+^ T cells in [@Martinez-jimenez2017]) or to groups of cells that were
	computationally identified through clustering (note that `r Biocpkg("BASiCS")`
	does not account for issues associated to post-selection inference
	[@Lahnemann2020] in this case, but this is an inherent limitation of most
	differential expression tools).

	the MCMC reached its stationary distribution. Lack of convergence can lead to
	spurious results that do not accurately reflect the underlying biology of the
	population of cells understudy.

vallejosgroup / basicsworkflow Goto Github PK

basicsworkflow's People

Contributors

Watchers

Forkers

basicsworkflow's Issues

Code-related

Text-related

Manuscript-related

Prioritised TO-DO list

Reviewer 1

Reviewer 2

Reviewer 3

Others

Recommend Projects

Recommend Topics

Recommend Org

Jobs