aryarm / varca Goto Github PK

View Code? Open in Web Editor NEW

23.0 2.0 8.0 363 KB

Use an ensemble of variant callers to call variants from ATAC-seq data

License: MIT License

Python 55.79% Shell 24.99% Awk 2.43% R 16.80%

random-forest machine-learning snakemake atac-seq-data variant-calling

varca's Introduction

varCA

A pipeline for running an ensemble of variant callers to predict variants from ATAC-seq reads.

The entire pipeline is made up of two smaller subworkflows. The prepare subworkflow calls each variant caller and prepares the resulting data for use by the classify subworkflow, which uses an ensemble classifier to predict the existence of variants at each site.

Code Ocean

Using our Code Ocean compute capsule, you can execute VarCA v0.2.1 on example data without downloading or setting up the project. To interpret the output of VarCA, see the output sections of the prepare subworkflow and the classify subworkflow in the rules README.

download

Execute the following command or download the latest release manually.

git clone https://github.com/aryarm/varCA.git

Also consider downloading the example data.

cd varCA
wget -O- -q https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz | tar xvzf -

setup

The pipeline is written as a Snakefile which can be executed via Snakemake. We recommend installing version 5.18.0:

conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0'

We highly recommend you install Snakemake via conda like this so that you can use the --use-conda flag when calling snakemake to let it automatically handle all dependencies of the pipeline. Otherwise, you must manually install the dependencies listed in the env files.

execution

Activate snakemake via conda:
```
conda activate snakemake
```
Execute the pipeline on the example data

Locally:
```
./run.bash &
```
or on an SGE cluster:
```
./run.bash --sge-cluster &
```

Output

VarCA will place all of its output in a new directory (out/, by default). Log files describing the progress of the pipeline will also be created there: the log file contains a basic description of the progress of each step, while the qlog file is more detailed and will contain any errors or warnings. You can read more about the pipeline's output in the rules README.

Executing the pipeline on your own data

You must modify the config.yaml file to specify paths to your data. The config file is currently configured to run the pipeline on the example data provided.

Executing each portion of the pipeline separately

The pipeline is made up of two subworkflows. These are usually executed together automatically by the master pipeline, but they can also be executed on their own for more advanced usage. See the rules README for execution instructions and a description of the outputs. You will need to execute the subworkflows separately if you ever want to create your own trained models.

Reproducing our results

We provide the example data so that you may quickly (in ~1 hr, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow these steps to create all of the plots and tables in our paper.

If this is your first time using Snakemake

We recommend that you run snakemake --help to learn about Snakemake's options. For example, to check that the pipeline will be executed correctly before you run it, you can call Snakemake with the -n -p -r flags. This is also a good way to familiarize yourself with the steps of the pipeline and their inputs and outputs (the latter of which are inputs to the first rule in each workflow -- ie the all rule).

Note that Snakemake will not recreate output that it has already generated, unless you request it. If a job fails or is interrupted, subsequent executions of Snakemake will just pick up where it left off. This can also apply to files that you create and provide in place of the files it would have generated.

By default, the pipeline will automatically delete some files it deems unnecessary (ex: unsorted copies of a BAM). You can opt to keep these files instead by providing the --notemp flag to Snakemake when executing the pipeline.

files and directories

Snakefile

A Snakemake pipeline for calling variants from a set of ATAC-seq reads. This pipeline automatically executes two subworkflows:

the prepare subworkflow, which prepares the reads for classification and
the classify subworkflow, which creates a VCF containing predicted variants

rules/

Snakemake rules for the prepare and classify subworkflows. You can either execute these subworkflows from the master Snakefile or individually as their own Snakefiles. See the rules README for more information.

configs/

Config files that define options and input for the pipeline and the prepare and classify subworkflows. If you want to predict variants from your own ATAC-seq data, you should start by filling out the config file for the pipeline.

callers/

Scripts for executing each of the variant callers which are used by the prepare subworkflow. Small pipelines can be written for each caller by using a special naming convention. See the caller README for more information.

breakCA/

Scripts for calculating posterior probabilities for the existence of an insertion or deletion, which can be used as features for the classifier. These scripts are an adaptation from @Arkosen's BreakCA code.

scripts/

Various scripts used by the pipeline. See the script README for more information.

run.bash

An example bash script for executing the pipeline using snakemake and conda. Any arguments to this script are passed directly to snakemake.

citation

There is an option to "Cite this repository" on the right sidebar of the repository homepage.

Massarat, A. R., Sen, A., Jaureguy, J., Tyndale, S. T., Fu, Y., Erikson, G., & McVicker, G. (2021). Discovering single nucleotide variants and indels from bulk and single-cell ATAC-seq. Nucleic Acids Research, gkab621. https://doi.org/10.1093/nar/gkab621

varca's People

Contributors

Stargazers

Watchers

Forkers

gmcvicker jaureguy760 mdbjax biocq alaminzju bit-vs-it zzh24zzh dzhao98

varca's Issues

2vcf.py exits with a non-zero exit code when executed by snakemake

For some reason, the 2vcf.py script sometimes finishes with a non-zero exit code. It isn't clear what might be causing this. The usual warnings that I added are printed, but the warnings don't seem to be the cause of the non-zero exit code. No other information appears to explain why the exit code isn't 0. In fact, the VCF output looks correct.

This is a problem because the non-zero exit code causes Snakemake to remove the output after it is created.

Single-end support

The ATAC-seq data we work with is single-end, which is not supported at the moment. Ideally, of course, this feature could be introduced. However, in the meantime, I was wondering if this is still true if I provide aligned BAM files and we do the peak calling ourselves? If I skip those steps, will VarCA be able to handle single-end data?

Thanks!

making manta and strelka less picky

(Note that all of the following plots were produced after filtering sites with a gatk-indel~DP < 10)

When I first ran manta and strelka (with the loosest config parameters possible), I got the following precision-recall plot:

It didn't look great for manta and strelka.
Next, I decided to feed the candidate indels from manta to strelka and turned on usage of the --exome flag to disable various read filters that are probably more useful for whole genome seq than ATAC-seq data:

At this point, manta is still pretty picky, but strelka seems to have improved a lot.

I think it's important to remember that manta and strelka are supposed to complement each other. manta finds large indels and strelka finds small ones. Thus, it's probably unreasonable to expect each of them to identify all of the indels on their own.

My next plan of action is to make strelka's config parameters stricter. And maybe reduce the number of candidate indels that manta passes to strelka somehow.

Error in rule predict

Describe the bug

pred= predict(fit, newdata= test, type="prob")
Error in [.data.frame(data, , forest$independent.variable.names, drop = FALSE) :
undefined columns selected

To Reproduce
Steps to reproduce the behavior, ideally with the example data we provide (or with simple modifications to the example data, if possible). Example:
run predict_RF.R
when to run code pred= predict(fit, newdata= test, type="prob")
see the error
Error in [.data.frame(data, , forest$independent.variable.names, drop = FALSE) :
undefined columns selected

Runtime information

VarCA Version: [e.g. v0.2.1]
Snakemake Version: [e.g. v5.18.1]

make channel specifications in environment files explicit

some of our environment files use "default" channels but those can differ from user to user! so we should be explicit about which channel each package belongs to
and we should add "nodefaults" to the channels list of every conda environment to prevent default channels from being considered

output in VCF format

Add an option to the classify pipeline that would allow it to output predictions in the VCF format

write a script for calculating the following stats on each atac seq cell line

% mito reads
FRiP

add GT info to the VCFs

I've successfully run the example. I don't understand the output

out/merged_snp/jurkat_chr1/final.tsv.gz

What is the final call? And is what should I use as the measure of confidence?

Thanks.

a smaller test dataset

Our current test dataset comprises all of chr1 in two different samples: the Jurkat sample and the MOLT4 cell line. It takes about an hour to run the entire pipeline with this dataset.

Ideally, we would have a dataset that runs in under 10 mins or so. This could then be incorporated into a Github CI pipeline that runs automatically upon release of each major and minor version increment, so that we can know when a change that we've made to the code leads to a change in the results.

find SNVs and indels supported by all callers
choose just one or two peaks that overlap those variants from each of the two samples
subset the example dataset to reads that only overlap those peaks
also try to subset the reference genome that is packaged with the example data, since the ref genome appears to be the largest file, right now
rerun the pipeline with the smaller dataset and tweak the dataset as necessary to make it run quickly
use snakemake --generate-unit-tests to create a bunch of tests that can be executed using pytest
- I'm running into issues with this. It doesn't work for outputs marked as pipe and there are some problems with other directories (see snakemake/snakemake#1104)
- fix issues and ensure test coverage is appropriate
- remove any unnecessary tests to ensure the test directory is small and can be properly included in version history (edit: this won't be possible, after all - b/c the test directory has to include the outputs of each rule ugh)
(optionally) create a Github action like this one to execute pytest upon each major or minor version increment and confirm the tests pass successfully

add genotypes (GT field) to final vcf

so that we can use this for allele specific expression (and make a conclusion about whether our performance differs for het alts vs hom alts)

you could start by using the genotypes output by platinum genomes

ChildIOException: File/directory is a child to another output:

Hi,
I'm trying to run the example dataset. (snakemake 5.19.2 ) My server crashed during the workflow and I tried to re-run it.

$ ./run.bash --sge-cluster && echo OK
$

this the the content of the log file:

Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
('/SCRATCH-BIRD/users/lindenbaum-p/work/20210312.varCA/merged_snp/jurkat_chr1/norm.tsv.gz', result)
('/SCRATCH-BIRD/users/lindenbaum-p/work/20210312.varCA/merged_snp/jurkat_chr1/norm.tsv.gz', norm_nums)

can you help me please ?

handling a scATAC custom reference genome from 10x

scATAC-seq BAM files from CellRanger can be used with VarCA, but there is an important caveat that one should be aware of:

tldr; if you give VarCA BAM files, you must provide the exact same reference genome as was used to align reads for the BAM!

The reads in the BAM files from CellRanger are often aligned against a custom reference genome. This can cause issues if you try to use a standard (non-custom) reference genome with VarCA, especially if any of the chromosomes are named differently between the custom reference and the standard one. In the past, I have found that the non-canonical chromosomes are usually different.

Option 1

One option is to discard reads belonging to non-canonical chromosomes (since these are usually less than 1% of the total number of reads) and then alter the header of the BAM to match the contigs in the reference:

# 1) extract only reads belonging to chromosomes 1-22 and chromosome X, Y, and M; note that this does not change the header
# 2) convert the BAM to SAM
# 3) remove all SQ tags in the header; these list the contigs in the reference
# 4) replace contig names in the header with those from the reference genome (using the
#    approach in https://www.biostars.org/p/289770/#289782)
#    We use "-f2 -F4" to drop any pairs where a single mate mapped to a non-canonical contig,
#    and we redirect stderr to /dev/null to ignore warnings about such reads.
#    We also drop any secondary alignments "-F256"
samtools view --no-PG -bh old.bam chr{1..22} chr{X,Y,M} | \
samtools view --no-PG -h | \
grep -Ev '^@SQ' | \
samtools view --no-PG -f2 -F4 -F256 -bhT reference.fa - 2>/dev/null >new.bam
# Note that we take a slightly different approach from the one described in
# https://www.biostars.org/p/289770/#289782 by using grep to remove the @SQ tags
# instead of nuking the header altogether, allowing us to retain the @RG and other tags
# Lastly, we must index the resulting BAM file for VarCA
samtools index new.bam

You should try this strategy only after verifying that the canonical chromosomes in your standard reference are the same as those in the custom reference. You can use samtools idxstats for this.

Option 2

Another option is to provide VarCA with the custom reference, if you have it. I haven't tested this but theoretically it should work.

Option 3

Finally, the last option is to provide VarCA with FASTQs (if you have them) rather than BAMs. VarCA will then redo the alignment step and create its own BAM files. I hesitate to recommend this option because I haven't tried it, and I doubt that the alignment performed by VarCA will be as robust to issues with scATAC. The alignment step in VarCA was initially designed with bulk ATAC in mind. However, for some users this might be the simplest and easiest strategy, especially if you're willing to perform alignment twice.

change sge mode in strelka config script

oops

I think this is the only place where this happens

only install the necessary variant caller dependencies

Our pipeline requires that users install every variant caller at runtime, even if they don't actually use some of them.
For example, DELLY is not used by the pipeline by default, but it is still installed by Snakemake when it is executed for the first time.

Is there a way to improve this behavior so that only the required dependencies are installed at runtime? Currently, the answer is no.

Why? Well, there are only two steps in the Snakemake pipeline that execute the variant callers in the ensemble: the prepare_caller rule and the run_caller rule. Both steps must be general enough that they would work for any variant caller. The inputs and outputs of those rules dynamically adapt to every caller based on a single wildcard. If we wanted to have the dependencies of the rule change too, we would need to change the env rule based on the caller wildcard. But snakemake currently offers no way of doing this; you can't provide a lambda function to env like you can for input, output, and params.

I really only see one solution to this issue, then: I submit a pull request (or feature request) for snakemake that adds the functionality we desire. I can't really think of anything else short of some sort of major refactor?

example data

provide example data for running the pipeline from start to finish

comply with snakemake standards

It shouldn't be too hard to comply with these standards. We just need to add a .snakemake-workflow-catalog.yml file and restructure the project folder structure

(note, this issue is part of a broader effort to conform with snakemake best practices)

Exit (1) error when running pipeline using example data.

Hello,

I recently downloaded VarCA to run on my ATAC data. Installation was apparently successful, but upon running "./run.bash &" from within the VarCA directory I get an exit (1) error. All dependencies were installed correctly, as I did not get any errors. Just wondering what could be triggering this error.

I am running Python 3.10.5 on a MacBook Pro 2.4 GHz 8-Core Intel Core i9, 64GB RAM with OS X 12.4.

Thanks for your help,

Ricardo

Error: "An output file is marked as pipe, but consuming jobs are part of conflicting groups."

Hi @aryarm ,

I am running your pipeline starting with called peaks and a .BAM file. However, when doing a dry-run, I get the following error:

"Building DAG of jobs...
WorkflowError in line 350 of /home/######/projects/varCA/rules/prepare.smk:
An output file is marked as pipe, but consuming jobs are part of conflicting groups."

My sample.tsv file looks like:

EMC378 /absPathTo/EMC378.filtered.bam /absPathTo/EMC378_peaks.bed
EMC385 /absPathTo/EMC385.filtered.bam /absPathTo/EMC385_peaks.bed
....

In my config.yaml I define:

sample_file: /absPathTo/BSF_1013_2000HIV_hg38_perIndiv_atac_varCA_samples.tsv
genome: /absPathTo/genome.fa
snp_callers: [gatk-snp, varscan-snp, vardict-snp]
indel_callers: [gatk-indel, varscan-indel, vardict-indel, pindel, illumina-strelka]
snp_filter: ['gatk-snp~DP>5']
indel_filter: ['gatk-indel~DP>5']
snp_model: data/snp.rda
indel_model: data/indel.rda

I cloned your repository on Jul 26 13:11.
I am running Snakemake 6.6.0

I run the following command to do a dry-run:
snakemake --config out="$out_path" -p -n

I might have misunderstood something in the way different folders need to be defined, so apologies in advance if that is the case.

Thanks!

PS. Could it be that the example data is no longer available?

Vardict error.

Hi , I got the following error: (I added -set x ; set -e ; set -o pipefail to view the output of callers/vardict I also set my $sample=""; line 32 of "var2vcf_valid.pl" )

(...)
rule prepare_caller:
    input: /SCRATCH-BIRD/users/lindenbaum-p/work/NEXTFLOW/20210312.varCA/BAMS/CM_Br334_C02_p20_run3.sorted.rename
d.bam, /SCRATCH-BIRD/users/lindenbaum-p/work/NEXTFLOW/20210312.varCA/20210312.ATAC/peaks/CM_Br334_C02_p20_run3.so
rted/peaks.bed, /LAB-DATA/BiRD/resources/species/human/cng.fr/hs37d5/hs37d5_all_chr.fasta, callers/vardict
    output: /SCRATCH-BIRD/users/lindenbaum-p/work/NEXTFLOW/20210312.varCA/20210312.ATAC/callers/CM_Br334_C02_p20_
run3.sorted/vardict
    jobid: 0
    wildcards: sample=CM_Br334_C02_p20_run3.sorted, caller=vardict

+ set -e
+ bgzip
+ vardict-java -G /LAB-DATA/BiRD/resources/species/human/cng.fr/hs37d5/hs37d5_all_chr.fasta -N CM_Br334_C01_p32_r
un4.sorted -b /SCRATCH-BIRD/users/lindenbaum-p/work/NEXTFLOW/20210312.varCA/BAMS/CM_Br334_C01_p32_run4.sorted.ren
amed.bam -v -c 1 -S 2 -E 3 /SCRATCH-BIRD/users/lindenbaum-p/work/NEXTFLOW/20210312.varCA/20210312.ATAC/peaks/CM_B
r334_C01_p32_run4.sorted/peaks.bed -VS SILENT
+ var2vcf_valid.pl
+ teststrandbias.R
/LAB-DATA/BiRD/users/lindenbaum-p/notebook/2021/20210312.varCA/varCA/.snakemake/conda/2501cca1/lib/R/bin/R: line 
238: /LAB-DATA/BiRD/users/lindenbaum-p/notebook/2021/20210312.varCA/varCA/.snakemake/conda/2501cca1/lib/R/etc/ldp
aths: No such file or directory

docker

we could simplify dependency installation even more using docker

Update
Actually, using docker does not simplify dependency installation, since when our docker container is created, it will have to install everything it needs using conda anyway.
But docker can improve reproducability by making the pipeline executable on other platforms (ex: Windows or Mac). That is the only reason why we chose to pursue this issue.

don't use gatk selectvariants + allow varca to output SNVs and indels in the same VCF

Many callers output both InDels and SNVs in the same VCF.
In order to separate them from each other before outputting to the final TSVs, we use gatk SelectVariants. It conveniently allows us to keep GVCF blocks where there are no variants. However, there is no way to request it to label InDels as no-call when selecting SNVs and vice versa. It simply filters them out. This means that we lack depth and other valuable information at those sites.

I haven't been able to find a tool that achieves the behavior that we want, so I think we might have to write a custom script. We already have the classify.awk script, but it doesn't really work for every type of VCF ALT allele and it can only accept REF and ALT columns as input (and nothing else).
We should

modify classify.awk to work with
- BND alleles
- MIXED alleles
write a bash script to filter VCFs using classify.awk

varscan precision recall curve

Approximately 1.5 weeks ago, I got a varscan plot that looked like this:

I don't remember what state the code was in when I created this plot nor do I remember the parameters I passed to the code that creates the plots.
Then, I made changes to the code that produces the plots and got this instead:

We think the first one is correct because it passed through the single point precision-recall calculation at its inflection point (as both gatk and vardict do currently). So what went wrong with the varscan plot making?

Well, one of the changes I made to the plot creation code affected how I was labeling non-variants. I'm not sure what I was doing before, but I don't think non-variants were being given the correct PVAL. Neither do I think this was a breaking change. I checked, and mislabeling the non-variants doesn't give the plot I had before.

Another thing I tried to consider is the precision with which non-variants are being read into python. I'm currently using a float64, which should be more than enough space to store each PVAL, but I might have been using something of smaller precision before. Unfortunately, using less precision didn't return the plots to the way they were either.

ldpaths error when using prepare.yaml conda env

When running the prepare subworkflow, users of VarCA might occasionally run into the following error:

$CONDA_PREFIX/lib/R/bin/R: line 238: $CONDA_PREFIX/lib/R/etc/ldpaths: No such file or directory

This error can happen when Snakemake activates the prepare conda environment from two different processes at the same time on an NFS file system. This is usually a rare occurrence unless you are running many samples through the pipeline simultaneously. The problem (and some potential solutions) is discussed at length in this Github issue.

If this error happens to you, it is essentially because of bad luck. Because the issue leads to a corrupted R installation within the prepare conda environment that Snakemake created, we recommend deleting the prepare conda environment, so that Snakemake can automatically recreate it afresh. You can determine where the prepare conda environment was created by running snakemake --list-conda-envs and then delete it using rm -r.
To reduce your chances of running into this problem in the future, you can limit the number of jobs that run simultaneously by passing a smaller number of cores to Snakemake via the -j parameter in the run.bash script. The tradeoff of this approach is that VarCA may take longer to finish executing.

There are currently no plans for us to resolve this issue, since it is a bug within conda-forge and not within VarCA. But feel free to let us know below if you have any suggestions.

Classification model for mouse?

Hi,

Thank you for developing this useful tool. I would like to use varCA for mouse samples.

Can I use the current classification model?

If the answer is no, could you give me some advice on training a SNP model for mouse?

Thank you for your effort.

Bests,
Yiwei

skip rmdups and peak calling steps if input is BAM

Instead of FASTQs, users can provide their own BAM files as input to the pipeline. However, the current behavior still calls samtools to remove duplicates from the BAM files (and MACS2 to call peaks from them).

I could have it skip the duplicate removal steps if a BAM file is provided, instead. And I could have it skip the peak calling step if a BED file is provided with the BAM file.

Varscan generates a REF allele that is not compatible with GATK.

Hi again (sorry)

I got this new GATK error with var scan ouput

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 49313350: unparsable vcf record with allele R, for input source: /SCRATCH-BIRD/users/lindenbaum-p/work/NEXTFLOW/20210312.varCA/20210312.ATAC/callers/D02_i_WT8288_C03_p20_20/varscan/varscan.vcf.gz

the problem is that a REF allele contains an allele that is not compatible with gatk:

gunzip -c  /SCRATCH-BIRD/users/lindenbaum-p/work/NEXTFLOW/20210312.varCA/20210312.ATAC/callers/A12_CM_WT8288_C03_p32_12/varscan/varscan.vcf.gz | awk '($4=="R")'
hs37d5	31982609	.	R	.	.	PASS	ADP=7;WT=0;HET=0;HOM=0;NC=1	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	./.:.:7
hs37d5	32894833	.	R	.	.	PASS	ADP=1;WT=0;HET=0;HOM=0;NC=1	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	./.:.:1

create a snakemake file for running both pipelines all at once

The prepare and classify pipelines are split into two different pipelines in case the user wants to tweak the output of the prepare pipeline before feeding it to the classify pipeline.

But the majority of users won't want to do that. It'll be more convenient for them if there's one single Snakemake file that can run both pipelines at once. This should be easy to do if I just import the rules defined in Snakefile-prepare and Snakefile-classify into the big Snakemake file.

Also, consider re-organizing the files in the repository to match Snakemake's recommendations. This might be especially useful if we ever output figures in a Snakemake report or Jupyter notebook (see issue #7).

ERROR conda.core.link:_execute(730)

Describe the bug
ERROR conda.core.link:_execute(730): An error occurred while installing package 'bioconda::bioconductor-genomeinfodbdata-1.2.1-r351_0'.

To Reproduce
Could not create conda environment from /lustre/grp/bitcap/jinxx/source/varCA/rules/../envs/prepare.yml:
Expected behavior
A clear and concise description of what you expected to happen.

here is my log file :
Building DAG of jobs...
Creating conda environment envs/prepare.yml...
Downloading and installing remote packages.
CreateCondaEnvironmentException:
Could not create conda environment from /lustre/grp/bitcap/jinxx/source/varCA/rules/../envs/prepare.yml:
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
ERROR conda.core.link:_execute(730): An error occurred while installing package 'bioconda::bioconductor-genomeinfodbdata-1.2.1-r351_0'.
Rolling back transaction: ...working... done
class: LinkError
message:
post-link script failed for package bioconda::bioconductor-genomeinfodbdata-1.2.1-r351_0
location of failed script: /lustre/grp/bitcap/jinxx/source/varCA/.snakemake/conda/942e2afa/bin/.bioconductor-genomeinfodbdata-post-link.sh
==> script messages <==

==> script output <==
stdout: /lustre/grp/bitcap/jinxx/source/varCA/.snakemake/conda/942e2afa/share/bioconductor-genomeinfodbdata-1.2.1-0/GenomeInfoDbData_1.2.1.tar.gz: FAILED
ERROR: post-link.sh was unable to download any of the following URLs with the md5sum 2fd536521151e2ff37217b5cfee8cec4:
https://bioconductor.org/packages/3.9/data/annotation/src/contrib/GenomeInfoDbData_1.2.1.tar.gz
https://bioarchive.galaxyproject.org/GenomeInfoDbData_1.2.1.tar.gz
https://depot.galaxyproject.org/software/bioconductor-genomeinfodbdata/bioconductor-genomeinfodbdata_1.2.1_src_all.tar.gz

stderr: % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 415 100 415 0 0 374 0 0:00:01 0:00:01 --:--:-- 374
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
curl: (7) Failed to connect to bioarchive.galaxyproject.org port 443: No route to host
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
curl: (7) Failed to connect to depot.galaxyproject.org port 443: No route to host

return code: 1

kwargs:
{}

Traceback (most recent call last):
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1082, in call
return func(*args, **kwargs)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda_env/cli/main.py", line 80, in do_call
exit_code = getattr(module, func_name)(args, parser)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda_env/cli/main_create.py", line 142, in execute
result[installer_type] = installer.install(prefix, pkg_specs, args, env)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda_env/installers/conda.py", line 59, in install
unlink_link_transaction.execute()
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/core/link.py", line 281, in execute
self._execute(tuple(concat(interleave(itervalues(self.prefix_action_groups)))))
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/core/link.py", line 744, in _execute
raise CondaMultiError(tuple(concatv(
conda.CondaMultiErrorclass: LinkError
message:
post-link script failed for package bioconda::bioconductor-genomeinfodbdata-1.2.1-r351_0
location of failed script: /lustre/grp/bitcap/jinxx/source/varCA/.snakemake/conda/942e2afa/bin/.bioconductor-genomeinfodbdata-post-link.sh
==> script messages <==

==> script output <==
stdout: /lustre/grp/bitcap/jinxx/source/varCA/.snakemake/conda/942e2afa/share/bioconductor-genomeinfodbdata-1.2.1-0/GenomeInfoDbData_1.2.1.tar.gz: FAILED
ERROR: post-link.sh was unable to download any of the following URLs with the md5sum 2fd536521151e2ff37217b5cfee8cec4:
https://bioconductor.org/packages/3.9/data/annotation/src/contrib/GenomeInfoDbData_1.2.1.tar.gz
https://bioarchive.galaxyproject.org/GenomeInfoDbData_1.2.1.tar.gz
https://depot.galaxyproject.org/software/bioconductor-genomeinfodbdata/bioconductor-genomeinfodbdata_1.2.1_src_all.tar.gz

return code: 1

kwargs:
{}

: <exception str() failed>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/lustre/grp/bitcap/jinxx/miniconda3/bin/conda-env", line 7, in
sys.exit(main())
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda_env/cli/main.py", line 91, in main
return conda_exception_handler(do_call, args, parser)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1374, in conda_exception_handler
return_value = exception_handler(func, *args, **kwargs)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1085, in call
return self.handle_exception(exc_val, exc_tb)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1116, in handle_exception
return self.handle_application_exception(exc_val, exc_tb)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1132, in handle_application_exception
self._print_conda_exception(exc_val, exc_tb)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1136, in _print_conda_exception
print_conda_exception(exc_val, exc_tb)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1059, in print_conda_exception
stderrlog.error("\n%r\n", exc_val)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/logging/init.py", line 1475, in error
self._log(ERROR, msg, args, **kwargs)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/logging/init.py", line 1589, in _log
self.handle(record)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/logging/init.py", line 1598, in handle
if (not self.disabled) and self.filter(record):
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/logging/init.py", line 806, in filter
result = f.filter(record)
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/gateways/logging.py", line 61, in filter
record.msg = record.msg % new_args
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/init.py", line 132, in repr
errs.append(e.repr())
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/init.py", line 71, in repr
return '%s: %s' % (self.class.name, text_type(self))
File "/lustre/grp/bitcap/jinxx/miniconda3/lib/python3.9/site-packages/conda/init.py", line 90, in str
return text_type(self.message % self._kwargs)
ValueError: unsupported format character 'T' (0x54) at index 852

File "/lustre/grp/bitcap/jinxx/miniconda3/envs/snakemake/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 333, in create

create READMEs for the snakemake pipelines

describe inputs, outputs, config options, and execution

run the post-processing steps in the prepare pipeline in parallel

At the end of the prepare pipeline, a couple of post-processing steps are performed on the merged TSV before we feed it to the classify pipeline. All of the scripts used in these steps support reading from stdin and writing to stdout except for fillna.bash

remove the first parameter from fillna.bash and make it read the TSV from stdin, instead
connect all of the post-processing steps together via pipes
- this will allow us to save on file IO and wasted time compressing and uncompressing the file between steps
remove extra config params that nobody uses (like keepna, pure_numerics, and friends) - they just make things more complicated
mark extra files as temp

update dependencies and freeze them

update them to their newest versions and then freeze them in the envs file
(specifically, consider updating mlr)

this will decrease the chance of bugs arising in the future from new versions of the dependencies

Not all of the samples requested have provided input

After preparing the required input, the pipeline can't seem to find the specified files or output directory. I don't see in the log files whether or not my sample file is recognized. I am hoping that there is an obvious issue with my file paths or the config, but I'm just not seeing it. Any help would be much appreciated.

Also, sample data ran to completion.

Log file:
(snakemake) root@c844f1072fc5:/varCA# cat out/*

/varCA/Snakefile:51: UserWarning: Not all of the samples requested have provided input. Proceeding with as many samples as is possible...
rule all:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Singularity containers: ignored
Job counts:
count jobs
1 all
1
[Thu Sep 9 11:59:06 2021]
localrule all:
jobid: 0
[Thu Sep 9 11:59:06 2021]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /varCA/.snakemake/log/2021-09-09T115906.454263.snakemake.log

My config (note that the output directory I specify is ignored):

(snakemake) root@c844f1072fc5:/varCA# grep -vF '#' configs/config.yaml | sed '/^$/d'

sample_file: DATA/data/samples.tsv
SAMP_NAMES: [2294, 2296]
genome: DATA/bwa/genome.fa
out: DATA/out_01
snp_callers: [gatk-snp, varscan-snp, vardict-snp]
indel_callers: [gatk-indel, varscan-indel, vardict-indel, pindel, illumina-strelka]
snp_filter: ['gatk-snp~DP>10']
indel_filter: ['gatk-indel~DP>10']
snp_model: DATA/data/snp.rda
indel_model: DATA/data/indel.rda

Sample file:

(snakemake) root@c844f1072fc5:/varCA# cat DATA/data/samples.tsv

2294 DATA/2294.dup.fix.bam DATA/2294.bed
2296 DATA/2296.dup.fix.bam DATA/2296.bed

BAM and bed files referenced in samples.tsv are present:

(snakemake) root@c844f1072fc5:/varCA# ls DATA/229* | xargs -n 1 basename

2294.bam
2294.bam.bai
2294.bed
2294.dup.bam
2294.dup.bam.bai
2294.dup.fix.bam
2294.dup.fix.bam.bai
2294_peaks.narrowPeak
2296.bam
2296.bam.bai
2296.bed
2296.dup.bam
2296.dup.bam.bai
2296.dup.fix.bam
2296.dup.fix.bam.bai
2296_peaks.narrowPeak

As are indexes:

root@c844f1072fc5:/varCA# ls DATA/bwa | xargs -n 1 basename

genome.dict
genome.fa
genome.fa.amb
genome.fa.ann
genome.fa.bwt
genome.fa.fai
genome.fa.pac
genome.fa.sa

And models:

(snakemake) root@c844f1072fc5:/varCA# ls DATA/data | xargs -n 1 basename

README.md
indel.rda
indel.tsv.gz
samples.tsv
snp.rda
snp.tsv.gz

prepare.yml environment yields conflicts with strict channel_priority

If channel_priority is set to "strict" in one's .condarc, it becomes impossible to install the prepare.yml environment file. conda will just spit out a bunch of conflicts. Ideally, all of our environment files would work with strict channel-priority mode, since conda is set to make that the default in v5.0 (source).

We have two options.

make the prepare.yml environment file work with strict channel priority configurations
break up the prepare.yml environment file into separate environment files: one for each variant caller

Option 2 is probably better in the long-run. But option 1 might be more feasible, since I'm not sure how I would achieve option 2 (see #19).

output figures in a snakemake report

The classify pipeline creates several figures as part of its output. Snakemake has a report() function for marking these figures as such. Consider using it.

That way, the user could have the option of whether to generate the plots (rather than making them required outputs at the end of the classify pipeline). And, we could keep them in a nice, tidy HTML file for them to peruse whenever they're trying to test a trained model.

VarDict-java produces malformed VCF

I'm creating this issue to record a problem encountered by a user (through personal correspondence). They received the following error message:

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 165135: unparsable vcf record with allele CCCCCTCCCCACTGTTCCAGTAGTCACTCCCTGGCTCCTCCCCAGGCCTCT<dup-8>AGGCCTCTGCTGCTCCTCCCCACTGTGTTCCAGTAGTCACTCCCTGGCTG

Based on the error message, it sounds like the VarDict-java tool is creating a malformed VCF allele:

CCCCCTCCCCACTGTTCCAGTAGTCACTCCCTGGCTCCTCCCCAGGCCTCT<dup-8>AGGCCTCTGCTGCTCCTCCCCACTGTGTTCCAGTAGTCACTCCCTGGCTG

The <dup-8> part of that allele is not valid in the VCF format, so GATK flags it and raises an exception.

It appears that someone else has already reported the issue in the VarDict repo. In the meantime, if anyone else encounters this while we wait for the issue to be resolved, I would recommend just discarding those alleles manually using awk just like we did in #25 . For example, you could edit line 17 of the callers/vardict file from this

teststrandbias.R | var2vcf_valid.pl | bgzip > "$output_dir/vardict.vcf.gz" && \

to this

teststrandbias.R | var2vcf_valid.pl | \
awk -F $"\t" -v 'OFS=\t' '/^#/ || $5 !~ /<dup/' | \
bgzip > "$output_dir/vardict.vcf.gz" && \

This will simply remove any lines in the VCF where the fifth column (for the ALT alleles) contains <dup. Ideally, we would keep those lines in the file and fix those alleles so that they are valid, since they potentially represent real structural variants that should be reported in VarCA's output. But without further information, I can't know what the correct allele should be, so I don't know how to properly change it using awk.

warn or error gracefully if index files don't exist

for either the reference genome, the BAM files, or the fasta files

we've been depending on the tools in our workflow to handle these errors, but we could also just handle them at the workflow level by listing them as dependencies of each rule in the Snakefiles

excessive memory usage when predicting variants

The problem

@Jaureguy760 discovered that the predict_RF.R script uses an excessive amount of memory (up to ~400 times the size of its input data!). We should investigate why this is happening.

It might just be a quirk of the R tools we are using (ie mlr and ranger), so the following are some other solutions we could use if we can't get mlr and ranger to behave themselves.

Solution 1

There should be a way to make predictions on slices of the data at a time, so that we don't load the entire dataset into memory at once. Maybe we could do predictions on just 1000 rows at a time?

Solution 2

We could declare the expected memory usage of the predict rule via the resources directive, so that the cluster environment will know not to run too many of these jobs at the same time. We should add this to the predict rule:

resources:
    mem_mb=lambda wildcards, input, attempt: int(input.size_mb*500*attempt)

And add this parameter to the qsub command in the run.bash script:

-l h_vmem={resources.mem_mb}

snakemake validation

There are a lot of config options and it might be hard for a user to know how to fill them, despite all of the documentation I've provided. It would be great if we could validate the config files that the user provides, so that the user has instant feedback on whether they filled everything out correctly.

Snakemake lets us use JSON schemas to do this, but I think we'll need something a lot more robust. While JSON schemas might allow us to conditionally require options based on other options, I doubt it will allow us to validate the format and content of some of the pipeline's inputs in the way that I want them to. For example, it would be great if we could notify the user in advance if

Their BAM files don't have read group information.
Their trained RF model doesn't support columns in the datasets for which they want to predict variants
The files they provided are not in the correct TSV format, or are otherwise missing some columns
etc

It feels like those sorts of checks will require much more complicated validation logic than JSON schemas provide. Perhaps the best way to proceed would be to create a validation python module that uses argparse or something similar? We could import that module in the Snakefiles.

Update (10/22/20): There is an alternative to importing the validation module in the Snakefiles. Instead, we could create a single python script run.py that executes Snakemake (much like run.bash). And then, we could import the validation module there. This would also offer us the benefit of being able to place more complicated validation/preparation logic there in the future.
wait - no, that won't work because we won't have access to the dependencies that we need within that validation module unless it's running as a rule or checkpoint

mamba

Is your feature request related to a problem? Please describe.
I am trying to run varCA in mamba instead of conda

Describe the solution you'd like
In the run.bash there is a --use-conda flag. Is there a way to run the entire pipeline using mamba instead of conda? I am able to install snakemake using mamba but when I execute run.bash it still tries to run the piepeline using conda and this fails.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.