henrikbengtsson / costellopscnseq Goto Github PK

View Code? Open in Web Editor NEW

3.0 6.0 2.0 131 KB

R package: Parent-specific Copy-number Estimation Pipeline using HT-Seq Data

R 61.08% Shell 4.94% Makefile 33.99%

r ht-seq copy-number

costellopscnseq's Introduction

CostelloPSCNSeq: Parent-Specific Copy-Number Estimation Pipeline using HT-Seq Data

Requirements

Required data

This pipeline requires paired tumor-normal data. This is because the allele-specific copy numbers are inferred from the allelic imbalance in the tumor at heterozygous SNPs. The heterozygous SNPs are identified from the allelic signals in the matched normal.

Required software

This pipeline is implemented in R and requires R packages aroma.seq, sequenza (Favero et al. 2015), and PSCBS (Bengtsson et al. 2010, Olshen et al. 2011). To install these packages and all of their dependencies to the current working directory, call the following from R:

if (!requireNamespace("pak")) install.packages("pak")
pak:::proj_create()
pak::pkg_install("[email protected]")
pak::pkg_install("HenrikBengtsson/aroma.seq")
pak::pkg_install("HenrikBengtsson/CostelloPSCNSeq@develop")

In addition to the above R dependencies, the pipeline requires that samtools (Li et al. 2009) is on the PATH.

Important: The pipeline does not work with sequenza (>= 3.0.0; 2019-05-09). This is the reason why we install sequenza 2.1.2. The use of pak:::proj_create() causes all packages to be installed to ./r-packages/, which avoids clashing the R package library that is typically installed under ~/R/.

Setup (once)

Run Rscript 0.setup.R once. This will setup links to shared annotation data sets and lab data files on the TIPCC compute cluster.
Make sure ./config.yml is correct. It specify the default analysis settings. The individual entries can be overridden by individual command-line options to the below Rscript calls.
Configure parallel processing following the instructions in Section 'Configure parallel processing' below.

Data processing

The following scripts should be run in order:

Rscript -e CostelloPSCNSeq::pscnseq --args --what=mpileup  --verbose=TRUE  # ~25 min
Rscript -e CostelloPSCNSeq::pscnseq --args --what=sequenza --verbose=TRUE  # ~60 min
Rscript -e CostelloPSCNSeq::pscnseq --args --what=pscbs    --verbose=TRUE  #  ~5 min
Rscript -e CostelloPSCNSeq::pscnseq --args --what=reports  --verbose=TRUE  #  ~2 min

Comment: The timings listed are typical run times for our test tumor-normal sample.

You may want to adjust inst/config.yml to process other data sets. Alternatively, you can specify another file that this default via command-line option --config, e.g. Rscript -e "CostelloPSCNSeq::pscnseq(what='mpileup', verbose=TRUE)" --config=config_set_a.yml.

Data processing via scheduler

To process the above four steps via the Torque/PBS scheduler, use:

$ qsub -d $(pwd) inst/scripts/1-4.submit_all.pbs

This will in turn submit the corresponding PBS scripts 1.mpileup.pbs, 2.sequenza.pbs, 3.pscbs.pbs, and 4.reports.pbs to the scheduler. Those PBS scripts are located in inst/scripts/ and "freeze" software versions to R 3.6.1 and samtools 1.3.1.

Configure parallel processing

The pipeline supports both sequential and parallel processing on a large number of backends and compute resources. By default the pipeline is configured to process the data sequentially on the current machine, but this can easily be changed to run in parallel, say, on a compute cluster. In order not to clutter up the analysis scripts, these settings are preferably done in a separate .future.R (loaded automatically by the future framework) in the project root directory.

To process data via a TORQUE / PBS job scheduler using the future.batchtools package, try with the configuration that we use for our UCSF TIPCC cluster;

# Copy to project directory
$ cp inst/future-configs/batchtools/.future.R .

# Copy to project directory
$ cp inst/future-configs/batchtools/batchtools.torque.tmpl .

# Install the future.batchtools package
$ Rscript -e "install.packages('future.batchtools')"

These should be generic enough to also run on other TORQUE / PBS systems. If not, see the batchtools package for how to configure the template file.

You can verify that it works by trying the following in the project directory:

> library("future")
Using future plan:
plan(list(samples = tweak(batchtools_torque, label = "sample", 
    resources = list(vmem = "4gb")), chromosomes = tweak(batchtools_torque, 
    label = "chr", resources = list(vmem = "8gb"))))

This confirms that as soon as the future package is loaded, it will source the .future.R script which in turn will setup the parallel settings. It is .future.R that reports on the future plan used.

Next, we can try to submit a job to the scheduler using these settings by:

> x %<-% Sys.info()[["nodename"]]

In this step, future.batchtools will import the batchtools.torque.tmpl file in that we copied to project working directory. If it fails to locate that file, there will be an error. If it succeeds, a batchtools job will be submitted to the job scheduler - which can be seen when if calling qstat -u $USER in another shell.

Finally, if we try to look at the value of x;

> x
[1] "n17"
>

it will block until the job is finished and then its value will be printed. Here we see that the job was running on compute node n17.

Help

> help("pscnseq", package = "CostelloPSCNSeq")

pscnseq            package:CostelloPSCNSeq             R Documentation

Calling the Parent-Specific Copy-Number Pipeline Step by Step

Description:

     Calling the Parent-Specific Copy-Number Pipeline Step by Step

Usage:

     pscnseq(
       what = c("mpileup", "sequenza", "pscbs", "reports"),
       dataset = NULL,
       organism = NULL,
       chrs = NULL,
       samples = NULL,
       fasta = NULL,
       gcbase = NULL,
       bam_pattern = NULL,
       binsize = NULL,
       config = "config.yml",
       session_details = !interactive(),
       verbose = TRUE,
       ...
     )
     
Arguments:

    what: (character) The step to be performed; in order, one of
          '"mpileup"', '"sequenza"', '"pscbs"', or '"reports"'.

 dataset: (character) The name of the dataset as on file.

organism: (character) The name of the organism as on file.

    chrs: (character vector) The name of the chromosomes to be
          processed, e.g. 'c("1", "2", "X")'.

 samples: (character) Pathname to a tab-delimited sample specification
          file, typically named '*.tsv', e.g. 'samples.tsv'.

   fasta: (character) The pathname to the FASTA reference file,
          typically named '*.fa' or '*.fasta', e.g. 'hg19.fa'.

  gcbase: (character) The pathname to the FASTA reference file,
          typically named '*.txt.gz', e.g. 'hg19.gc50Base.txt.gz'.

bam_pattern: (character; optional) Regular expression to identify
          subset of BAM files to be processed.  If NULL (default), then
          BAM files matching .bwa.realigned.rmDups(|.recal)(|.bam)$ are
          included.

 binsize: (integer or numeric) The bin size (in basepairs) used for
          binning reads into bins that then are passed to the
          segmentation method.

  config: (character) Pathname to YAML configuration file. If NULL,
          then the configuration file is skipped.

session_details: (logical) If TRUE, session details are reported before
          starting the processing and after it completed.

 verbose: (logical) If TRUE, then verbose output is produced, otherwise
          not.

     ...: Not used.

Value:

     Returns what the called 'pscnseq_nnn()' function returns, i.e.
     'pscnseq_mpileup()', 'pscnseq_sequenza()', 'pscnseq_pscbs()', or
     'pscnseq_reports()'.

Format of the samples file:

     The 'samples' argument should specify the pathname to a
     TAB-delimited file that provide annotation data for the samples to
     be processed. This file should a row of TAB-delimited column
     headers followed rows of samples with corresponding, TAB-delimited
     cells. The samples file must provide columns 'Patient_ID',
     'Sample_ID', and 'A0'. Any other columns are ignored. This
     pipeline processes tumor-normal pairs.  The pairs processed are
     inferred from (Patient_ID, Sample_ID).  Specifically, for each
     unique 'Patient_ID', the sample entry with 'Sample_ID == "Normal"'
     is used as the normal reference.  There must only be such entry
     per patient. Each patient may have one or more tumor samples,
     which are identified as 'Sample_ID != "Normal"'.

     For example, the below 'samples.tsv' file specifies two
     tumor-normal pairs 'Primary-v1' vs 'Normal' and 'Primary-v2' vs
     'Normal' for one patient named 'Patient123'.  This file specifies
     also fields 'SF', 'Kit', and 'A0', which may be used in other
     pipelines but are all ignored by this pipeline.

     Patient_ID      Sample_ID       SF      Kit     A0
     Patient123      Normal  SF00121N        Xgen Exome Research Panel       X00001
     Patient123      Primary-v1      SF00121-v1      Xgen Exome Research Panel       X00002
     Patient123      Primary-v2      SF00121-v2      Xgen Exome Research Panel       X00003
     
     This

Configuration File:

     The default arguments can be set in an YAML-formatted
     configuration file as given by argument 'config'.  The default is
     to look for a file named 'config.yml' in the current directory.
     To skip this file, specify 'config = NULL'.  An example of such a
     file is:

     organism: Homo_sapiens
     chromosomes: c(1:22, "X", "Y", "M")
     fasta: annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/hg19.fa
     gcbase: annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/hg19.gc50Base.txt.gz
     dataset: CostelloP_2015-Exome,bwa,realigned,rmDups,recal
     binsize: 100e3
     samples: sampleData/samples.tsv
     
Specifying arguments via command-line options:

     The arguments can be overridden by command-line options, e.g.
     '--organism=Homo_sapiens' will take precedence of argument
     'organism', which in turn will take precedent of what is specified
     in the configuration file.

How to call pipeline from the command line:

     Below is how you could run the pipeline step by step.  The
     '--args' option tells 'Rscript' that any options following should
     be passed as arguments to this function.

     Rscript -e CostelloPSCNSeq::pscnseq --args --help
     Rscript -e CostelloPSCNSeq::pscnseq --args --what=mpileup   # ~25 min
     Rscript -e CostelloPSCNSeq::pscnseq --args --what=sequenza  # ~60 min
     Rscript -e CostelloPSCNSeq::pscnseq --args --what=pscbs     #  ~5 min
     Rscript -e CostelloPSCNSeq::pscnseq --args --what=reports   #  ~2 min

References

Bengtsson H, Neuvial P, Speed TP. TumorBoost: Normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays, BMC Bioinformatics, 2010. DOI: 10.1186/1471-2105-11-245. PMID: 20462408, PMCID: PMC2894037
Olshen AB, Bengtsson H, Neuvial P, Spellman PT, Olshen RA, Seshan VA. Parent-specific copy number in paired tumor-normal studies using circular binary segmentation, Bioinformatics, 2011. DOI: 10.1093/bioinformatics/btr329. PMID: 21666266. PMCID: PMC3137217
Favero F, Joshi T, Marquard AM, Birkbak NJ, Krzystanek M, Li Q, Szallasi Z and Eklund AC. Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data, Annals of Oncology, 2015. DOI: 10.1093/annonc/mdu479, PMID: 25319062, PMCID: PMC4269342
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and 1000 Genome Project Data Processing Subgroup, The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 2009. DOI: 10.1093/bioinformatics/btp352, PMID: 19505943, PMCID: PMC2723002

Installation

R package CostelloPSCNSeq is only available via GitHub and can be installed in R as:

remotes::install_github("HenrikBengtsson/Costello-PSCN-Seq", ref="master")

Pre-release version

To install the pre-release version that is available in Git branch develop on GitHub, use:

remotes::install_github("HenrikBengtsson/Costello-PSCN-Seq", ref="develop")

This will install the package from source.

Contributions

This Git repository uses the Git Flow branching model (the git flow extension is useful for this). The develop branch contains the latest contributions and other code that will appear in the next release, and the master branch contains the code of the latest release.

Contributing to this package is easy. Just send a pull request. When you send your PR, make sure develop is the destination branch on the CostelloPSCNSeq repository. Your PR should pass R CMD check --as-cran, which will also be checked by and when the PR is submitted.

We abide to the Code of Conduct of Contributor Covenant.

Software status

Resource	GitHub	GitHub Actions	Travis CI	AppVeyor CI
Platforms:	Multiple	Multiple	Linux & macOS	Windows
R CMD check
Test coverage

costellopscnseq's People

Contributors

Stargazers

Watchers

Forkers

srhilz ivan108

costellopscnseq's Issues

Issue loading required packages using README.md instructions

My PSCN run failed today, with the error being that the aromaseq + sequenza libraries were not available. I attributed this to the latest R update (we recently moved our cluster to v3.5.0), though of course there may be some other more obvious reason behind it. I then tried to reinstall these packages using the instructions in the README.md:

source("http://callr.org/install#HenrikBengtsson/aroma.seq,sequenza")

While everything appears to be installing for a while, at some point, it fails and exits before completing. These are the last 3 lines of code before exit, including the error message:

DONE (aroma.seq)
Error in username %||% getOption("github.user") %||% stop("Unknown username.") :
Unknown username.

I checked the sourced code, but the github username (HenrikBengtsson) seemed pretty clearly specified in it, and I couldn't find the error message within the code itself to trace back when it was occurring. Any idea what might be going on? Thanks!

Patient260 Incomplete Chromosome Coverage

I have run into a peculiar issue with Patient260. Large regions of the genome are absent from the final copy number report + rds output files:

When inspecting the individual bins in the $data structure of the final .rds files, many bins are missing from each sample. From the LoH plot for this patient (included below), it does not appear there is an issue with identifying enough heterozygous snps.

Have you ever seen a case like this before?

Many thanks for any help you can give! I have attached a few of the .pdf final files as examples, as well as coverage and quality plots - while the normal has better coverage than most of the tumor samples, it isn't that much higher. The rds files are in /costellolab/shilz/data/cotient260tmp/

Patient260,Z00520_vs_Z00249,PairedPSCBS,report.pdf
Patient260,Z00519_vs_Z00249,PairedPSCBS,report.pdf
Patient260,Z00314_vs_Z00249,PairedPSCBS,report.pdf
overall_basequality.pdf
coverage.pdf
samples.txt

TIPCC: annotationData -> /cbc/data,public/annotationData is broken

@ivan108 wrote:

Hi Henrik,

I am trying to run Costello-PSCN-Seq pipeline, the frozen version Stephanie was using before she left.

However, the hardcoded symbolic link

annotationData -> /cbc/data,public/annotationData

doesn't work anymore, giving error "cannot access annotationData: Permission denied".

Could you fix that?

Project folder out of date

Oh my, I just checked the /home/jocostello/repositories/HenrikBengtsson/Costello-PSCN-Seq project folder. It looks like it has not been updated in a while, so it's out of sync with this repo. That makes me wonder about the R package versions too. I also see there are lots of new scripts in there so I cannot figure out which ones are used and which aren't.

@ivan108, exactly how do you run these scripts?

Assuming these are still run with:

module purge
module load CBC r/3.4.4 samtools/1.3.1

what does

library("aroma.seq")
sessionDetails()

report?

future.batchtools not connecting to Torque server

I am getting an error when trying to run 1.mpileup.R, which is the first step of the pipeline.

module load CBC r/3.4.4
cd /home/jocostello/repositories/HenrikBengtsson/Costello-PSCN-Seq
Rscript 0.setup.R
qsub -l vmem=200gb -d "${PWD}" -M "${EMAIL}" -m ae 1.mpileup.pbs

Error: Listing of jobs failed (exit code 33);
cmd: 'qselect -u $USER -s EHRT'
output:
socket_connect_unix failed: 15137
qselect: cannot connect to server (null) (errno=15137) could not connect to trqauthd
Execution halted
Error : Listing of jobs failed (exit code 33);
cmd: 'qselect -u $USER -s EHRT'
output:
socket_connect_unix failed: 15137
qselect: cannot connect to server (null) (errno=15137) could not connect to trqauthd

It seems batchtools fail to connect to torque server? Any ideas how to fix that?

Thanks!
Ivan
cc/ @SRHilz

Chr 1-25: Need to increase to `vmem=2gb` in .future.R

While testing with chr 1-22, X, Y, and M, I ran into the following in the 2.sequenza.R step:

$ tail /home/henrik/projects/CostelloJ-PSCN-Seq_tests/.future/20170626_111627-54Dsqf/sample_1_38265500/logs/joba810736cd1202b8e75d829f18ca5d611.log
[...]
 Chromosome 'M'...done
pileup2seqz...done
=>> PBS: job killed: vmem 1076252672 exceeded limit 1073741824

The solution should be to bump up from vmem=1gb to vmem=2gb in .future.R.

I'll update this in this repository; there'll be other updates too soon, so you shouldn't have to do anything unless you wanna to it already now.

ROBUSTNESS: Use '/usr/bin/env bash' shebangs

We're now using:

#!/bin/bash

but it's more portable to use

#!/usr/bin/env bash

1.mpileup.R gives error

@SRHilz wrote:

I have finally gotten the chance to try running the Costello PSCN pipeline on my own (I want to eventually be able to run it for our lab), and have two quick questions I wanted to see if you knew the answers to:

These error messages occur at the end of running 1.mpileup.R, and I wanted to see if you knew how to resolve them:

 Collect and resolve all futures...
  A 'listenv' vector with 42 unnamed elements.
 Collect and resolve all futures...done
 Gather and rearrange...
Error: BatchJobError in BatchJobsFuture ('sample_1'): 'Error : BatchJobExpiration: Future ("chr_chr19") expired: /home/shilz/tools/Costello-PSCN-Seq/.future/20170131_192521-5a8wXh/chr_chr19_1326258390-files '
In addition: There were 50 or more warnings (use warnings() to see the first 50)
 Gather and rearrange...done
mpileup()...done
Execution halted
There were 11 warnings (use warnings() to see them)

(full run output file attached [by email]).

I then get error messages right away when I try to run 2.sequenza.R, such as:

Error : BatchJobError in BatchJobsFuture ('sample_1'): 'Error in eval(expr, envir, enclos) :    Sample #1 (Patient300,Z00363_vs_Z00346) of 38: Missing tumor "Z00363" for given pattern: ",chr=9$" '
Error : BatchJobError in BatchJobsFuture ('sample_2'): 'Error in eval(expr, envir, enclos) :    Sample #2 (Patient300,SF10711_9-1-22_vs_Z00346) of 38: Missing tumor "SF10711_9-1-22" for given pattern: ",chr=19$"
 '
Error : BatchJobError in BatchJobsFuture ('sample_3'): 'Error in installPkg(pkg, version = version, repos = repos, ..., quietly = quietly,  :    Failed to install package: sequenza '
(full run output file also attached [by email])

Which directory will my files (such as the pileup files) be output to?

Session details

Using future plan:
plan(list(samples = tweak(batchjobs_torque, label = "sample", 
    resources = list(vmem = "1gb")), chromosomes = tweak(batchjobs_torque, 
    label = "chr", resources = list(vmem = "5gb"))))
aroma.seq v0.7.0-9000 successfully loaded. See ?aroma.seq for help.

= sessionInfo() ===============================================================
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  base     

other attached packages:
[1] aroma.seq_0.7.0-9000    future.BatchJobs_0.13.1 future_1.1.1-9000      
[4] aroma.core_3.0.0        R.devices_2.15.1        R.filesets_2.10.0      
[7] R.utils_2.5.0           R.oo_1.21.0             R.methodsS3_1.7.1      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8                RColorBrewer_1.1-2        
 [3] GenomeInfoDb_1.10.0        XVector_0.14.0            
 [5] methods_3.3.2              base64enc_0.1-4           
 [7] bitops_1.0-6               BatchJobs_1.6             
 [9] tools_3.3.2                zlibbioc_1.20.0           
[11] digest_0.6.10              lattice_0.20-34           
[13] RSQLite_1.1                memoise_1.0.0             
[15] checkmate_1.8.2            R.cache_0.12.0            
[17] Matrix_1.2-7.1             DBI_0.5-1                 
[19] parallel_3.3.2             R.rsp_0.30.0              
[21] hwriter_1.3.2              stringr_1.1.0             
[23] gtools_3.5.0               Biostrings_2.42.0         
[25] S4Vectors_0.12.0           PSCBS_0.62.0              
[27] globals_0.7.1              IRanges_2.8.0             
[29] grid_3.3.2                 stats4_3.3.2              
[31] Biobase_2.34.0             listenv_0.6.0             
[33] DNAcopy_1.48.0             BiocParallel_1.8.1        
[35] fail_1.3                   latticeExtra_0.6-28       
[37] sendmailR_1.2-1            magrittr_1.5              
[39] GenomicAlignments_1.10.0   backports_1.0.4           
[41] BBmisc_1.10                Rsamtools_1.26.1          
[43] codetools_0.2-15           matrixStats_0.51.0        
[45] BiocGenerics_0.20.0        GenomicRanges_1.26.1      
[47] SummarizedExperiment_1.4.0 ShortRead_1.32.0          
[49] brew_1.0-6                 stringi_1.1.2             

= .libPaths ===================================================================
[1] "/home/shilz/R/x86_64-pc-linux-gnu-library/3.3"                   
[2] "/home/shared/cbc/R/site-library/x86_64-pc-linux-gnu-library/3.3" 
[3] "/home/shared/cbc/software_cbc/R/R-3.3.2-20161031/lib64/R/library"

cc/ @SRHilz

WISH: Add `--cleanup` to each script / step (was 5.cleanup.R)

Add a cleanup step (5.cleanup.R) after:

Rscript 1.mpileup.R
- write to folder seqzData/<dataset>,mpileup/
Rscript 2.sequenza.R
- reads from folder seqzData/<dataset>,mpileup/
- write to folder seqzData/<dataset>,seqz/
- when complete, the above input folder is no longer needed.
Rscript 3.pscbs.R
- reads from folder seqzData/<dataset>,seqz/
- writes to folder pscbsData/<dataset>,seqz,<binsize>,tcn=2/
- when complete, the above input folder is no longer needed.
Rscript 4.reports.R
- reads from folder pscbsData/<dataset>,seqz,<binsize>,tcn=2/

Failing to run `1.mpileup.R`

The original issue has moved to Issue #7.

Version 0.3.0 of pipeline released

I've updated this repository to v0.3.0.

To update, do a git pull on the master branch to get the updated R scripts and make sure to update to aroma.seq 0.9.1.

I've also dropped the default .future.R file so that the instructions works for everyone. Instead, I've added a Section 'Configure parallel processing' to the end of the README with commands for configuring compute-cluster processing.

POTENTIAL BUG: Lack of newline at end of sampleData/samples.tsv cause 3.pscbs.R to fail

Issue

Lack of newline at end of sampleData/samples.tsv may cause 3.pscbs.R to fail.

Action

Reproduce
Detect mistake and given informative error message.

Turn these scripts into an R package

The next level of this pipeline is to turn its scripts into an R package. This will make it easier to install, easier to document (help pages and vignettes) and we can add tests that can be automatically ran locally, on CI and across OSes.

DOCS: Add PMC numbers with URLs for all references

1.mpileup.R - BatchtoolsError write permission error

I've just tried to run a new batch of exomes through the copy number pipeline, and have encountered a new error while running 1.mpileup.R.

Almost all of the mpileup files for each sample + chromosome are successfully generated:

[shilz@n6 THetA]$ ll /costellolab/shilz/tools/Costello-PSCN-Seq/seqzData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal,mpileup/Homo_sapiens/
total 282747836
-rw-r--r-- 1 shilz costellolab  810866744 Oct 15 21:31 Patient276,Z00483-trim,chr=1.mpileup.gz
-rw-r--r-- 1 shilz costellolab  237568378 Oct 15 20:42 Patient276,Z00483-trim,chr=10.mpileup.gz
-rw-r--r-- 1 shilz costellolab  415875698 Oct 15 20:59 Patient276,Z00483-trim,chr=11.mpileup.gz
-rw-r--r-- 1 shilz costellolab  404682127 Oct 15 20:59 Patient276,Z00483-trim,chr=12.mpileup.gz
-rw-r--r-- 1 shilz costellolab  170785520 Oct 15 21:11 Patient276,Z00483-trim,chr=13.mpileup.gz
-rw-r--r-- 1 shilz costellolab  257024396 Oct 15 20:49 Patient276,Z00483-trim,chr=14.mpileup.gz
-rw-r--r-- 1 shilz costellolab  344659939 Oct 15 21:01 Patient276,Z00483-trim,chr=15.mpileup.gz
-rw-r--r-- 1 shilz costellolab  376578483 Oct 15 20:57 Patient276,Z00483-trim,chr=16.mpileup.gz

However, two seem to not complete when the script hits an error and stops running:

-rw-r--r-- 1 shilz costellolab 1610939722 Oct 15 20:29 Patient372,Z00496-trim,chr=9.mpileup
-rw-r--r-- 1 shilz costellolab  466056895 Oct 15 20:59 Patient372,Z00496-trim,chr=9.mpileup.gz.tmp

and

-rw-r--r-- 1 shilz costellolab 1326738204 Oct 15 21:00 Patient372,Z00500-trim,chr=16.mpileup
-rw-r--r-- 1 shilz costellolab  409682667 Oct 15 21:27 Patient372,Z00500-trim,chr=16.mpileup.gz.tmp

I have tried simply re-running 1.mpileup.R, but it throws and error that these intermediate tmp files already exist. When I then delete them and try rerunning, it also does not work unfortunately.

Here are the last few dozen lines form the Rscript*.o* output file from the R run, with the part that seems potentially problematic highlighted:

  A 'listenv' vector with 27 unnamed elements.
 Collect and resolve all futures...done
 Gather and rearrange...
**Error: BatchtoolsError in BatchtoolsFuture ('sample_7'): 'Error : BatchtoolsError in BatchtoolsFuture ('chr_chrX'): 'Error in getWritablePathname.Arguments(static, ...) : 
  No write permission for directory: seqzData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal,mpileup/Homo_sapiens''**
 Gather and rearrange...done
mpileup()...done
Execution halted
Warning messages:
1: In delete.BatchtoolsFuture(future, onRunning = "skip", onMissing = "ignore",  :
  Will not remove batchtools registry, because the status of the batchtools was 'defined', 'error', 'started', 'submitted' and option 'future.delete' is FALSE or running in an interactive session: '/costellolab/shilz/tools/Costello-PSCN-Seq/.future/20171015_201234-e7GtcR/sample_26_1483439475'
2: In delete.BatchtoolsFuture(future, onRunning = "skip", onMissing = "ignore",  :
  Will not remove batchtools registry, because the status of the batchtools was 'defined', 'error', 'started', 'submitted' and option 'future.delete' is FALSE or running in an interactive session: '/costellolab/shilz/tools/Costello-PSCN-Seq/.future/20171015_201234-e7GtcR/sample_17_1062549309'
3: In delete.BatchtoolsFuture(future, onRunning = "skip", onMissing = "ignore",  :
  Will not remove batchtools registry, because the status of the batchtools was 'defined', 'error', 'started', 'submitted' and option 'future.delete' is FALSE or running in an interactive session: '/costellolab/shilz/tools/Costello-PSCN-Seq/.future/20171015_201234-e7GtcR/sample_13_1838446915'
4: In delete.BatchtoolsFuture(future, onRunning = "skip", onMissing = "ignore",  :
  Will not remove batchtools registry, because the status of the batchtools was 'defined', 'error', 'started', 'submitted' and option 'future.delete' is FALSE or running in an interactive session: '/costellolab/shilz/tools/Costello-PSCN-Seq/.future/20171015_201234-e7GtcR/sample_9_1953818508'
5: In delete.BatchtoolsFuture(future, onRunning = "skip", onMissing = "ignore",  :
  Will not remove batchtools registry, because the status of the batchtools was 'defined', 'error', 'started', 'submitted' and option 'future.delete' is FALSE or running in an interactive session: '/costellolab/shilz/tools/Costello-PSCN-Seq/.future/20171015_201234-e7GtcR/sample_7_2035481989'

I wanted to see if you knew what might be causing this, before I delve into my own investigation? Let me know if there is any more info I can provide.

Many thanks!

Pipeline does not work with sequenza (>= 3.0.0)

Issue

The pipeline does not work with sequenza (>= 3.0.0) because it has dropped essential Python scripts. If tried, we get the error:

Error in system.file("exec", "sequenza-utils.py", package = "sequenza",  : 
  no file found

Workaround

For now, we need to install sequenza 2.1.2;

install.packages("https://cran.r-project.org/src/contrib/Archive/sequenza/sequenza_2.1.2.tar.gz")

Tasks

Document this in the README
Use ~~packrat~~ pak to install sequenza 2.1.2 locally so that update.packages() won't break the pipeline.

4.reports.pbs crashing

Make 0.setup.R agile to config.yml

Creating a new issue based on the below:

@ivan108 wrote:

I wonder what does it take to move that path into config.yml?

@HenrikBengtsson replied:

One easier, quicker approach is to be able to override the default via an env var. See ?Sys.getenv.

BUG: 1.mpileup.R gives Error: length(work) == length(todo) is not TRUE

Running qcmd --exec Rscript 1.mpileup.R with chrs=c(1:22, "X", "Y", "M") gave an error:

$ Rscript_222853-20170625.o880722
[...]
 Processing incomplete BAMs (i.e. missing one or more chromosomes)...done
 Collect and resolve all futures...
  A 'listenv' vector with 2 unnamed elements.
 Collect and resolve all futures...done
 Gather and rearrange...
Error: length(work) == length(todo) is not TRUE

Rerunning it again, and all is good.

Action

Try to reproduce this.
If reproducible, fix it.

Use another exome kit?

@ivan108 wrote:

I need to use a different exome kit, Stephanie mentioned that a target list should be somewhere linked properly to the corresponding .bed file? How do I do that?

TROUBLESHOOT: 3.pcsbs.R

Running qcmd --exec Rscript 3.pscbs.R with chromosomes: c(1:22, "X", "Y", "M") gave:

[...]
Sample 1 ('Patient300,Z00363_vs_Z00346') of 1 ...
...
Loading required package: methods

Attaching package: 'methods'

The following objects are masked from 'package:R.oo':

    getClasses, getMethods

Sample 1 ('Patient300,Z00363_vs_Z00346') of 1 ...
...done
Error: Log file for job with id 1 not available
Execution halted
Warning message:
In delete.BatchtoolsFuture(future, onRunning = "skip", onMissing = "ignore",  :
  Will not remove batchtools registry, because the status of the batchtools was 'defined', 'error', 'running', 'started', 'submitted', 'system' and option 'future.del\
ete' is FALSE or running in an interactive session: '/home/henrik/projects/CostelloJ-PSCN-Seq_tests/.future/20170625_234757-0ss9xx/sample_1_1462898678'

Action

Check if it can be replicated.
If so, troubleshoot and fix bug.

DOCS: Clarify which steps require tumor-normal pairs

Clarify which steps require tumor-normal pairs, namely:

Sequenza
Paired PSCBS

Question: Are matched tumor-normal pairs required?
Answer: It should be possible to replace the paired PSCBS step with a non-paired version. However, I think Sequenza required paired tumor-normal data;

"Here we describe Sequenza, a software package that uses paired tumor-normal exome or whole-genome sequencing data to estimate tumor cellularity and ploidy and to infer allele-specific tumor copy number profiles. Using publicly available matched tumor-normal data, we compare the results of exome sequence data ..." (Sequenza abstract)

Proper chromosome tags (e.g. `chrs=1-22,X`) in several places

This far the pipeline has assumed only processing Chrs 1-22, but not Chr X, Chr Y, and Chr M. The following this need / should be fixed:

Critical
- 1.mpileup.R will give an error with non-numeric elements in chrs. FIXED: Now supports chrs = c(1:22, "X", "Y", "M").
Important
- 3.pscbs.R outputs to pscbsData/<dataset>,<tags>/Homo_sapiens/<sample>,<tags>,chrs=<chrsTag>,PairedPSCBS.rds. If we use chrs = c(1:22, "X", "Y", "Z"), we still get that <chrsTag> is chrs=1:22. FIXED: Now gives chrs=1-25 (because PSCBS translates X, Y and M to 23, 24, and 25).
Misc
- 3.pscbs.R and 4.reports.R outputs log messages listing the chromosomes; these don't handle non-numeric chrs values. FIXED: They now output as 1-25.
- ~~4.reports.R should include a <chrsTag> in it's output directory.~~ NOT FIXED: Not necessary; let's keep the current default of PSCBS::report().

Add assertion that BAM index files exist

Add assertion that BAM index files exist. If *.bai files are missing, then 1.mpileup.R will produce an empty *.mpileup.tmp file and the stdout/stderr output will (only) show something like:

[...]
  BAM #8 ('COH-148,SC300109') of 8...done                                                                                                                                                                                   [0/12121]
  A 'listenv' vector with 8 elements (unnamed).                            
 Processing incomplete BAMs (i.e. missing one or more chromosomes)...done
 Collect and resolve all futures...                                                                                             
Execution halted                                               
Error: cannot allocate vector of size 1 Kb                                                                                     
Error: cannot allocate vector of size 0 Kb                     
Error: cannot allocate vector of size 0 Kb                                                                                   
Error: cannot allocate vector of size 0 Kb 
[...]
Error: cannot allocate vector of size 0 Kb

So not very informative. However, the individual batchjobs log files will show something like:

[...]
      Running samtools 'mpileup'...                                                                                           
       Calling samtools executable...                                                                                         
        Executable: /home/shared/cbc/software_cbc/samtools-1.3.1/samtools                                               
        Arguments passed to system2():                                                                                       
        List of 1                                                                                                            
         $ stdout: chr "seqzData/ZivE_2019-LatinaBC,bwa,realigned,rmDups,recal,mpileup/Homo_sapiens/COH-148,SC299980,chr=1.mpileup.tmp"        
        Arguments passed to samtools:                                                                                        
        List of 5                                                                                                                                                                                              
         $  : chr "mpileup"                                                                                                                                                                            
         $ f: chr "'annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/hg19.fa'"                                                                                       
         $ r: chr "chr1"                                                                                                                                                     
         $ Q: num 20                                                                                                                                                                                           
         $  : chr "'bamData/ZivE_2019-LatinaBC,bwa,realigned,rmDups,recal/Homo_sapiens/COH-148/SC299980_GACTAGTA_L005.bwa.realigned.rmDups.bam'"                                                               
        Command line options:                                                                                                                                                                      
        [1] "mpileup"                                                                                                                                                                                        
        [2] "-f 'annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/hg19.fa'"                                                                                                                            
        [3] "-r chr1"                                                                                                                                                                                                
        [4] "-Q 20"                                                                                                                                                                                          
        [5] "'bamData/ZivE_2019-LatinaBC,bwa,realigned,rmDups,recal/Homo_sapiens/COH-148/SC299980_GACTAGTA_L005.bwa.realigned.rmDups.bam'"                                                                   
        System call:                                                                                                                                                                                 
        [1] "/home/shared/cbc/software_cbc/samtools-1.3.1/samtools mpileup -f 'annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/hg19.fa' -r chr1 -Q 20 'bamData/ZivE_2019-LatinaBC,bwa,realigned,rmDups,recal/Homo_sapiens/COH-
148/SC299980_GACTAGTA_L005.bwa.realigned.rmDups.bam'"                                                                                                                                                        
        List of 1                                                                                                                      
         $ stdout: chr "seqzData/ZivE_2019-LatinaBC,bwa,realigned,rmDups,recal,mpileup/Homo_sapiens/COH-148,SC299980,chr=1.mpileup.tmp"    
        system2() call...                                                                                                    
         List of 3                                                                                                                             
          $ command: Named chr "/home/shared/cbc/software_cbc/samtools-1.3.1/samtools"                                    
           ..- attr(*, "names")= chr "samtools"                                                                           
           ..- attr(*, "version")=Classes 'package_version', 'numeric_version'  hidden list of 1                      
           .. ..$ : int [1:3] 1 3 1                                                                                                                                                              
          $ args   : chr [1:5] "mpileup" "-f 'annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/hg19.fa'" "-r chr1" "-Q 20" ...                 
          $ stdout : chr "seqzData/ZivE_2019-LatinaBC,bwa,realigned,rmDups,recal,mpileup/Homo_sapiens/COH-148,SC299980,chr=1.mpileup.tmp"                    
[mpileup] fail to load index for bamData/ZivE_2019-LatinaBC,bwa,realigned,rmDups,recal/Homo_sapiens/COH-148/SC299980_GACTAGTA_L005.bwa.realigned.rmDups.bam    
        system2() call...done                                                                                                                                                                    
       Calling samtools executable...done                                                                                                                                                        
      Running samtools 'mpileup'...done 
                                                                                                                                                               
### [bt]: Job terminated successfully [batchtools job.id=1]                                                                                                                                    
### [bt]: Calculation finished!

resolve(..., value) is depreciated

From the future package:

Argument 'value' of resolve() is deprecated. Use 'result' instead.

1.mpileup: Pathname not found

Getting weird error in the beginning of 1.mpileup run:

Exception: Pathname not found: /home/jocostello/repositories/HenrikBengtsson/Costello-PSCN-Seq/annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/hg19.fa (/home/jocostello/repositories/HenrikBengtsson/Costello-PSCN-Seq/ exists, but nothing beyond)

However, pathname actually exists and readable:

>ls -l /home/jocostello/repositories/HenrikBengtsson/Costello-PSCN-Seq/annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/hg19.fa
-rw-r--r-- 1 henrik cbc 3199905909 Apr  6  2015 /home/jocostello/repositories/HenrikBengtsson/Costello-PSCN-Seq/annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/hg19.fa

DOCS: Add installation instructions

Add installation instructions, i.e. basically what R packages needs to be installed before launching the scripts.

1.mpileup.R: Warning messages

When running 1.mpileup.R, get the following warning message:

= warnings() ==================================================================
Warning messages:
1: In getIndexFile.BamDataFile(this, create = FALSE) :
Detected outdated index file and recreated it: bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00047.bwa.realigned.rmDups.recal.bam.bai
2: In file.remove(pathnameIDX) :
cannot remove file 'bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00047.bwa.realigned.rmDups.recal.bam.bai', reason 'Permission denied'
3: In getIndexFile.BamDataFile(this, create = FALSE) :
Detected outdated index file and recreated it: bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00047.bwa.realigned.rmDups.recal.bam.bai
4: In file.remove(pathnameIDX) :
cannot remove file 'bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00047.bwa.realigned.rmDups.recal.bam.bai', reason 'Permission denied'
5: In getIndexFile.BamDataFile(this, create = FALSE) :
Detected outdated index file and recreated it: bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00047.bwa.realigned.rmDups.recal.bam.bai
6: In file.remove(pathnameIDX) :
cannot remove file 'bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00047.bwa.realigned.rmDups.recal.bam.bai', reason 'Permission denied'
7: In getIndexFile.BamDataFile(this, create = FALSE) :
Detected outdated index file and recreated it: bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00047.bwa.realigned.rmDups.recal.bam.bai
8: In file.remove(pathnameIDX) :
cannot remove file 'bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00047.bwa.realigned.rmDups.recal.bam.bai', reason 'Permission denied'
9: In getIndexFile.BamDataFile(this, create = FALSE) :
Detected outdated index file and recreated it: bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00048.bwa.realigned.rmDups.recal.bam.bai
10: In file.remove(pathnameIDX) :
cannot remove file 'bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00048.bwa.realigned.rmDups.recal.bam.bai', reason 'Permission denied'
11: In getIndexFile.BamDataFile(this, create = FALSE) :
Detected outdated index file and recreated it: bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00048.bwa.realigned.rmDups.recal.bam.bai
12: In file.remove(pathnameIDX) :
cannot remove file 'bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00048.bwa.realigned.rmDups.recal.bam.bai', reason 'Permission denied'
13: In getIndexFile.BamDataFile(this, create = FALSE) :
Detected outdated index file and recreated it: bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00048.bwa.realigned.rmDups.recal.bam.bai
14: In file.remove(pathnameIDX) :
cannot remove file 'bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00048.bwa.realigned.rmDups.recal.bam.bai', reason 'Permission denied'
15: In getIndexFile.BamDataFile(this, create = FALSE) :
Detected outdated index file and recreated it: bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00048.bwa.realigned.rmDups.recal.bam.bai
16: In file.remove(pathnameIDX) :
cannot remove file 'bamData/CostelloP_2015-Exome,bwa,realigned,rmDups,recal/Homo_sapiens/Patient68/Z00048.bwa.realigned.rmDups.recal.bam.bai', reason 'Permission denied'
Rscript_203901-20170627.o882206.txt

qselect: cannot connect to server

Stephanie and I are experiencing the same error last few days:

[...]
 Processing incomplete BAMs (i.e. missing one or more chromosomes)...done
 Collect and resolve all futures...
  A 'listenv' vector with 3 elements (unnamed).
 Collect and resolve all futures...done
 Gather and rearrange...
Error: Listing of jobs failed (exit code 33);
cmd: 'qselect -u $USER -s EHRT'
output:
socket_connect_unix failed: 15137
qselect: cannot connect to server (null) (errno=15137) could not connect to trqauthd

while running the first step of the pipeline, e.g.:

qsub -d $(pwd) -l vmem=200gb -M [email protected] -m ae 1.mpileup.pbs

Any ideas?
Thanks!

henrikbengtsson / costellopscnseq Goto Github PK

costellopscnseq's Introduction

CostelloPSCNSeq: Parent-Specific Copy-Number Estimation Pipeline using HT-Seq Data

Requirements

Required data

Required software

Setup (once)

Data processing

Data processing via scheduler

Configure parallel processing

Help

References

Installation

Pre-release version

Contributions

Software status

costellopscnseq's People

Contributors

Stargazers

Watchers

Forkers

costellopscnseq's Issues

Session details

Issue

Action

Issue

Workaround

Tasks

Action

Action

Recommend Projects

Recommend Topics

Recommend Org

Jobs