replikation / what_the_phage Goto Github PK

View Code? Open in Web Editor NEW

99.0 7.0 16.0 40.04 MB

WtP: Phage identification via nextflow and docker or singularity

Home Page: https://mult1fractal.github.io/wtp-documentation/

License: GNU General Public License v3.0

Nextflow 86.27% Dockerfile 3.44% R 2.61% Shell 4.65% Python 3.03%

phage-identification phage-sequences nextflow-pipelines nextflow phages bioinformatics

what_the_phage's Introduction

What the Phage (WtP)

by Christian Brandt & Mike Marquet
this tool is under active development,feel free to report issues and add suggestions
Use a release candidate for a stable experience via -r e.g. -r v1.2.0
- These are extensively tested release versions of WtP
- releases of WtP are listed here

Publication:

What the Phage: A scalable workflow for the identification and analysis of phage sequences

M. Marquet, M. Hölzer, M. W. Pletz, A. Viehweger, O. Makarewicz, R. Ehricht, C. Brandt

doi: https://doi.org/10.1093/gigascience/giac110

What is this repo

TL;DR

WtP is a scalable and easy-to-use workflow for phage identification and analysis. Our tool currently combines 12 established phage identification tools
An attempt to streamline the usage of various phage identification and prediction tools
The main focus is stability and data filtering/analysis for the user
The tool is intended for assembled fasta files to predict phages in contigs
Proper prophage detection is not implemented - but CheckV reports them
A full report can be found here

Documentation

The documentation contains:
- General information
- Installation guide
- Tool overview
- Result interpretation
- Troubleshooting

what_the_phage's People

Contributors

$mult1fractal avatar$

Stargazers

Watchers

Forkers

mewbak stefanches7 pollytikhonova pythseq mandiayba papanikos mult1fractal zongzhiwu kenkeni-zju liupfskygre 444thliao xanadu314 yasas1994 anyihu trinyeg arefeenhaider

what_the_phage's Issues

multiPhate v1.1.0 Phage annotation

what is multiPhATE

multiPhate

fully automated computational pipeline for identifying and annotating phage genes in genome sequence
controled by a configuration file, which can be tailored to run specific gene finders and to blast sequences against specific phage- and virus-centric data sets, in addition to more generic (genome, protein) data sets
one gene finding algorithm, then annotates the genome, gene, and protein sequences using blast and a set of fasta sequence databases, and uses an hmm search against the pVOG database. If more than one gene finder is run, PhATE will also provide a side-by-side comparison of the genes called by each gene caller.
Classification of each protein sequence into a pVOG group is followed by generation of an alignment-ready fasta file. By convention, genome sequence files end with extension, ".fasta"; gene nucleotide fasta files end with, ".fnt", and cds amino-acid fasta files end with, ".faa"

building in docker:

build dockerfile with biopython, emboss, blast, glimmer, prodigal, hmmer, trnascan-se.
using conda to setup everything except phanotate

needed databases:

download autodatabases (blast hmm specify path to databases in file.config
NCBI virus genomes - ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/
NCBI Refseq Protein - download using blast+: /bin/update_blastdb.pl refseq_protein
NCBI Refseq Gene - download using blast+: /bin/update_blastdb.pl refseqgene. This database contains primarily human sequences. To acquire bacterial gene sequences you may need to download them via http://ncbi.nlm.nih.gov/gene and process the data to generate a fasta data set. Support for doing this is not provided with the multiPhATE distribution
NCBI Swissprot - download using blast+: /bin/update_blastdb.pl swissprot
NR - ftp://ftp.ncbi.nlm.nih.gov/nr/
KEGG virus subset - (available by license) http://www.kegg.jp/kegg/download/
- KEGG associated files - T40000.pep, T40000.nuc, vg_enzyme.list, vg_genome.list, vg_ko.list, vg_ncbi-geneid.list, vg_ncbi-proteinid.list, vg_pfam.list, vg_rs.list, vg_tax.list, vg_uniprot.list

Singularity containers

As clusters in the academia often ban docker as an obligation for people to get root rights, maybe it is possible to use the current workflow with Singularity? Maybe just convert the existing Docker steps to Singularity containers (e.g. with https://github.com/singularityhub/docker2singularity)?

Test fastq option w/ long reads throws Error in library(ggplot2) : there is no package called ‘ggplot2’

I tested the wf again with some nanopore long reads using the new --fastq input option:

nextflow run phage.nf --fastq ../virify/results/SRR8811964_1.fastq/SRR8811964_1.fastq.unclassified.fastq --databases /homes/mhoelzer/data/nextflow-databases/ --workdir /homes/mhoelzer/data/nextflow-work-mhoelzer --cachedir /hps/nobackup2/singularity/mhoelzer/ -profile lsf

Seems to work very well until the final step:

[14/e74e2b] process > read_validation_wf:removeSmallReads (13) [100%] 13 of 13, cached: 7 ✔
[0e/1c10e1] process > read_validation_wf:fastqTofasta (4)      [100%] 13 of 13 ✔
[ac/08b407] process > metaphinder_wf:metaphinder (2)           [100%] 13 of 13 ✔
[cc/b561f6] process > metaphinder_wf:filter_metaphinder (1)    [100%] 1 of 1 ✔
[ac/7f5adf] process > virfinder_wf:virfinder (9)               [100%] 13 of 13 ✔
[8c/010786] process > virfinder_wf:filter_virfinder (1)        [100%] 1 of 1 ✔
[8d/9e7b2b] process > pprmeta_wf:pprmeta (8)                   [100%] 13 of 13 ✔
[be/254b31] process > pprmeta_wf:filter_PPRmeta (1)            [100%] 1 of 1 ✔
[17/009afe] process > parse_reads (1)                          [100%] 1 of 1 ✔
[ef/634e2e] process > r_plot_reads (1)                         [100%] 1 of 1, failed: 1 ✘

that errors with:

Error executing process > 'r_plot_reads (1)'

Caused by:
  Process `r_plot_reads (1)` terminated with an error exit status (1)

Command executed:

  #!/usr/bin/Rscript

  library(ggplot2)

  inputdata <- read.table("SRR8811964_1.fastq.unclassified_summary.csv", header = TRUE, sep = ";")

  pdf("phage-distribution.pdf", height = 6, width = 10)
    ggplot(data=inputdata, aes(x=type, y=amount)) +
    geom_bar(stat="identity") +
    theme(legend.position = "none") +
    coord_flip()
  dev.off()

Command exit status:
  1

Command output:
  (empty)

Command error:
  Error in library(ggplot2) : there is no package called ‘ggplot2’
  Execution halted

-resume is not working anymore?

Is it possible that -resume is not working anymore? Maybe because ot the rnd IDs you are producing?

nextflow run phage.nf --fasta test-data/OX2_draft.fa --cores 4 --resume

does all the steps again for me (except stored data bases)

Own Database of phages on ncbi

Own database

create a list of Accessionnumbers of phages - within the repository under database/
- we collect this via ncbi eutils via the nextflow run an build a database
- see this process

PhageTerm: detection of phage termini and packaging mechanism

I just stumbled over this tool:

PhageTerm publication

that sounds interesting. It seems they are using different known phage replication strategies (such as DTR, direct terminal repeats) to describe phage sequences in more detail besides simply stating: this (could) be a phage

Maybe it is possible to put PhageTerm on top of the already identified phages by all the other tools to provide additional information about their replication/packaging mechanisms.

Some evaluation testing

As I have to prepare some virus test data sets anyway I just run 4 assemblies through WtP. Honestly, I do not exactly know what to expect as an outcome from the files (I have to clarify with some guy that was working on a virus pipeline here before and used them).

The assemblies are:

High_confidence_putative_viral_contigs.fna
Low_confidence_putative_viral_contigs.fna
tara_virus.fa
putative_prophages.fna

Pipeline finished successfully (yeah, on the LSF system in 28m).

However, I am concerned with the output, in particular not any hit with VirFinder. How do you filter the VirFinder output at the moment? And can you also please clarify again how the VirSorter output is filtered currently?
(and of course, some heat map plot tuning is needed, but this might be anyway resolved by including UpSetR and might not be that important for now.

Results

High_confidence_putative_viral_contigs.pdf
Low_confidence_putative_viral_contigs.pdf
putative_prophages.pdf
tara_virus.pdf

upsetr fails if a txt file is empty

I just observed that the R code fails if an input file is empty.

Example:

nextflow run phage.nf --fasta test-data/PERVI_draft.fa

The output of virsorter is empty. Thus, upsetr code fails with

Command error:
  Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
    no lines available in input
  Calls: sort -> read.csv -> read.table
  Execution halted

upsetr enhancements

The current output of the upsetr module looks like this:

todo:

increase plot margin to not cut off the numbers
plot also as high-quality PNG/PDF next to the SVG
potentially add other fancy upsetr features with metadata

dummy test

Phage prediction tools to include

List of Tools

Implement?

VirMiner(Mining, analyzing, and integrating viral signals from metagenomic data)
- Paper
- Docker file created
VirMine (viral sequences from metagenomes)
- Paper
- docker available, however, a full pipeline that also does read qc and assembly (spades)
- but users can supply either raw Illumina sequencing reads (single-end or paired-end) or assembled contigs/scaffolds
- Docker file created
PhaMers (Phage k-mers)
- Paper
- input: dir where fasta-file | *.fasta is located
- output: dir: phamer_output/phamer_scores.csv
- no database
- Docker file created
PhiSpy
- paper
- phispy support
- Docker file created
FastViromeExplorer
- paper
- paired end ?
- databse needed see link
- Docker file created
ViromeScan ?
vConTACT could be interesting to run on final selected putative phage contigs
https://peerj.com/articles/3243/
Virminer

Prophage-tools

Phaster
Virsorter
Phage_Finder
ProphET
Prophinder
PhiSpy
uv ?? phiweger

other interesting tools

v.HULK

Not possible to implement

Pro-Hunter --> no CLI ?
- paper

deepvirfinder terminated with an error exit status (130)

srry guys,

I just tried some larger input files (30k - 160k contigs, 8 FASTA) also to see how the #29 Upsetr output looks like. I run this at the cluster, so I am not entirely sure if it is a cluster issue or also occurs locally. Calculations are running locally at the moment but takes time.

Error executing process > 'deepvirfinder_wf:deepvirfinder (7)'

Caused by:
  Process `deepvirfinder_wf:deepvirfinder (7)` terminated with an error exit status (130)

Command executed:

  rnd=0.3255814060400286
  dvf.py -c 8 -i ERR579308_host_filtered_filt500bp.fa -o ERR579308_host_filtered_filt500bp
  cp ERR579308_host_filtered_filt500bp/*.txt ERR579308_host_filtered_filt500bp_${rnd//0.}.list

Command exit status:
  130

Command output:
  1. Loading Models.
     model directory /DeepVirFinder/models
  2. Encoding and Predicting Sequences.
     processing line 1
     processing line 156114

Command error:
  Using Theano backend.
  INFO (theano.gof.compilelock): Waiting for existing lock by process '214427' (I am process '302292')
  INFO (theano.gof.compilelock): To manually release the lock, delete /homes/mhoelzer/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-10.0--3.6.9-64/lock_dir
  INFO (theano.gof.compilelock): Waiting for existing lock by process '10407' (I am process '302292')
  INFO (theano.gof.compilelock): To manually release the lock, delete /homes/mhoelzer/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-10.0--3.6.9-64/lock_dir
  INFO (theano.gof.compilelock): Waiting for existing lock by process '10408' (I am process '302292')
  INFO (theano.gof.compilelock): To manually release the lock, delete /homes/mhoelzer/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-10.0--3.6.9-64/lock_dir
  INFO (theano.gof.compilelock): Waiting for existing lock by process '65080' (I am process '302292')
  INFO (theano.gof.compilelock): To manually release the lock, delete /homes/mhoelzer/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-10.0--3.6.9-64/lock_dir
  INFO (theano.gof.compilelock): Waiting for existing lock by unknown process (I am process '302292')
  INFO (theano.gof.compilelock): To manually release the lock, delete /homes/mhoelzer/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-10.0--3.6.9-64/lock_dir
  INFO (theano.gof.compilelock): Waiting for existing lock by process '10406' (I am process '302292')
  INFO (theano.gof.compilelock): To manually release the lock, delete /homes/mhoelzer/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-10.0--3.6.9-64/lock_dir
  INFO (theano.gof.compilelock): Waiting for existing lock by process '10409' (I am process '302292')
  INFO (theano.gof.compilelock): To manually release the lock, delete /homes/mhoelzer/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-10.0--3.6.9-64/lock_dir
  INFO (theano.gof.compilelock): Waiting for existing lock by process '291597' (I am process '302292')
  INFO (theano.gof.compilelock): To manually release the lock, delete /homes/mhoelzer/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-10.0--3.6.9-64/lock_dir
  INFO (theano.gof.compilelock): Waiting for existing lock by unknown process (I am process '302292')
  INFO (theano.gof.compilelock): To manually release the lock, delete /homes/mhoelzer/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-10.0--3.6.9-64/lock_dir

Filter Results from tools for further application (smatools, taxonomy, heatmap)

Metaphinder

Output from tool:

contigID classification ANI [%] merged coverage [%] number of hits size[bp]
ctg1 phage 75.357 99.967 71 77616
ctg2 phage 12.049 16.991 61 45788
ctg3 phage 61.742 83.422 80 56341
ctg4 phage 75.595 99.995 55 18684
ctg5 phage 54.507 72.777 12 11233
ctg6 phage 58.68 78.916 32 10420
ctg7 phage 71.843 95.827 47 11046
ctg8 phage 51.998 67.979 14 8713
ctg9 phage 75.518 99.99 52 9716
ctg10 phage 74.659 98.796 47 8641

export LC_NUMERIC=en_US.utf-8
###### print contig only ########
 mkdir sorted_contig_only
 sort  -g  -k4,4 *.txt | awk '$2>=phage' | awk '{ print $1 }' | tail -n+2 > sorted_contig_only/sorted_contig_only.txt

output filterscript : sorted_contig_only.txt

ctg7
ctg11
ctg10
ctg14
ctg1
ctg9
ctg4

printed contigs classified as 'phage' in neue file

dummy test2

Mock data for Benchmarking

Mockdata and Real data for Benchmarking

Mock data

see this issue for phage data
- create reads out of this data and/or mix it with refseq genome data
- use pomoxis for simulated ONT reads
  - try phage prediction on raw data/reads

Mockdata sources (paste here)

real data

utilize mikes real metagenomic data
assembly and predict
check how it differs to screening against fastq

Pipeline execution on a cluster using LSF schedule system and Singularity images

I just wanted to give the pipeline a try here at the EBI server structure using Singularity and LSF to schedule jobs. I made a new branch and added the following profile to the configuration to get it ready to run:

  lsf {
        process.executor = 'lsf'
        singularity {
	    enabled = true
	    cacheDir = "/hps/nobackup2/singularity/mhoelzer"
	}
	workDir = "/hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-$USER"
        process {
	    cache = "lenient"
	    errorStrategy = "rerty"
	    maxRetries = 1
	    withLabel: virsorter { cpus = 24; memory = '24 GB'; container = 'multifractal/virsorter:latest' }
            withLabel: deepvirfinder { cpus = 24; memory = '24 GB'; container = 'multifractal/deepvirfinder:latest' }
            withLabel: marvel { cpus = 24; memory = '24 GB'; container = 'multifractal/marvel:latest' }
            withLabel: metaphinder { cpus = 24; memory = '24 GB'; container = ' multifractal/metaphinder:latest' }
            withLabel: pprmeta { cpus = 24; memory = '24 GB'; container = 'multifractal/ppr-meta:latest' }
            withLabel: r_plot { cpus = 4; memory = '4 GB'; container = 'replikation/r-phage-plot:latest' }
            withLabel: ubuntu { cpus = 4; memory = '4 GB'; container = 'ubuntu:bionic' }
            withLabel: virfinder { cpus = 24; memory = '24 GB'; container = 'multifractal/virfinder:latest' }
} }

The pipeline starts well but I get problems while pulling the docker images and converting them to singularity images:

Error executing process > 'metaphinder_wf:metaphinder (1)'

Caused by:
  Failed to pull singularity image
  command: singularity pull  --name \ multifractal-metaphinder-latest.img docker:// multifractal/metaphinder:latest > /dev/null
  status : 255
  message:
    WARNING: pull for Docker Hub is not guaranteed to produce the
    WARNING: same image on repeated pull. Use Singularity Registry
    WARNING: (shub://) to pull exactly equivalent images.
    ERROR: Unknown container build Singularity recipe format: multifractal-metaphinder-latest.img
    ABORT: Aborting with RETVAL=255
    ERROR: pulling container failed!

I think this could be related to no tags are used for the docker images. There seems to be some issue with simply using :latest when converting docker images to singularity images.

I totally understand that this is not your main focus now, but getting the workflow running also on cluster systems with a job scheduler like LSF and SIngularity support might be helpful in the future (and also interesting for some potential users).

I am not entirely sure, but this might be fixed by simply tagging your Docker images on DockerHub.

Metaphinder own blast db: BLAST db error

Hey, I hope you had some nice relaxed holidays and way too much food over xmas! :D

I tried to run the current master branch (on my laptop, linux, default profile) but get some blast db error from the metaphinder_own_db module:

(base) ➜  What_the_Phage git:(master) ✗ nextflow run phage.nf --fasta test-data/POI_draft.fa --cores 4 

...

Error executing process > 'metaphinder_own_DB_wf:metaphinder_own_DB (1)'

Caused by:
  Process `metaphinder_own_DB_wf:metaphinder_own_DB (1)` terminated with an error exit status (1)

Command executed:

  rnd=0.9202244176289235
  mkdir POI_draft
  MetaPhinder.py -i POI_draft.fa -o POI_draft -d blast_phage_DB/blast_phage_DB/phage_db
  mv POI_draft/output.txt POI_draft_${rnd//0.}.list

Command exit status:
  1

Command output:
  parsing commandline options...
  running BLAST...
  calculating ANI...

Command error:
  BLAST Database error: No alias or index file found for nucleotide database [blast_phage_DB/blast_phage_DB/phage_db] in search path [/tmp/nextflow-phages-martin/95/6a0a1fc0eed979f31534caff46ade8::]
  Traceback (most recent call last):
    File "/MetaPhinder/MetaPhinder.py", line 228, in <module>
      if (old_id != str(l[0])) and (old_id != ""):
  NameError: name 'l' is not defined

Work dir:
  /tmp/nextflow-phages-martin/95/6a0a1fc0eed979f31534caff46ade8

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

I did not look into details, maybe it's something you can easily fix. Seems just like some bash syntax error in the if clause?

Visualization of positive contigs and annotated ORFs

via chromomap
docker and code available

Chromomap

one chromomap for coverage
one chromomap for counts (orfs)

Evaluation of large input files

This issue is for documentation of the behavior of WtP for large input files. Based on this @replikation might implement FASTA chunking to increase speed of the pipeline.

case 1, aquadiva sample

bsub -n 4 -M 8.0G -R "rusage[mem=8.0G]" "nextflow run phage.nf --fasta /homes/mhoelzer/data/calc/aquadiva_kaiju/spades/H14_0_2_1/scaffolds.fasta --output /homes/mhoelzer/data/calc/aquadiva_kaiju/wtp/H14_0_2_1 -profile ebi --mp -resume"

(execluded metaphinder because of an previous issue)

555 MB
1.152.661 contigs
largest contigs:
- 46863 NODE_1_length_46863_cov_23.219813
- 40744 NODE_2_length_40744_cov_42.929662
- 26449 NODE_3_length_26449_cov_28.914564
- 23605 NODE_4_length_23605_cov_11.299066
- 22256 NODE_5_length_22256_cov_27.842575
- 22209 NODE_6_length_22209_cov_8.825675
- 21099 NODE_7_length_21099_cov_6.212270
- 20898 NODE_8_length_20898_cov_8.155352
- 20440 NODE_9_length_20440_cov_6.053373
- 18557 NODE_10_length_18557_cov_6.418549

started: Dec 31 12:50

Tools completed

virsorter: ~1h
virfinder: ~48h
deepvirfinder: running...
marvel: running...
pprmeta: ~8h

Job was aborted after 2.5 days by cluster for unclear reason. No stats for deepvirfinder and marvel

Problems with dockers

General thread to talk and solve Dockerfile issues

VirSorter filtering

I saw that you filter the VirSorter output to only collect phages from the files

VIRSorter_cat-1.fasta
VIRSorter_cat-2.fasta

cat !{results}/Predicted_viral_sequences/VIRSorter_cat-[1,2].fasta | grep ">" | sed -e s/\\>VIRSorter_//g | sed -e s/-cat_1//g |\
  sed -e s/-cat_2//g  > virsorter.txt

Why not also include the cat-3 phages? To reduce the amount of false-positive hits? However, to have a somewhat fair comparison with the other tools it might be worth to also include cat-3 phages identified by VirSorter?

And prophages you are not interested at all?

improve readme prior abstract submission

@Stormrider935

add a installation manual to the readme.

simple install

sudo Java runtime
sudo docker
usermod docker
wget nextflow

Metaphinder Filter is not correctly working

I think I found out why Metaphinder has such good results in your test data sets ;)

The filtering simply selects all contigs.

Your command in modules/filter_metaphinder.nf:

sort  -g  -k4,4 SRR8811960_1.unclassified.txt | awk '$2>=phage' | awk '{ print $1 }' | tail -n+2

needs to be changed to

sort  -g  -k4,4 SRR8811960_1.unclassified.txt | awk '$2>="phage"' | awk '{ print $1 }' | tail -n+2

or why not looking for

awk '$2=="phage"'

I found this because of my Nanopore test run where VirSorter found 220 phages and Metaphinder reported 365,686 -- basically all of my input reads. By changing the above command I get 63,794 hits, still a lot but now only those labeled as phage.

@Stormrider935 please confirm that I interpreted the output file of Metaphinder correctly and if so, fix the filter step.

Run @LSF cluster: PPRMeta

Execution from the nextflow pipeline throws:

Error executing process > 'pprmeta_wf:pprmeta (1)'

Caused by:
  Process `pprmeta_wf:pprmeta (1)` terminated with an error exit status (127)

Command executed:

  cp PPR-Meta/* .
  ./PPR_Meta T7_draft.fa T7_draft.csv

Command exit status:
  127

Command output:
  (empty)

Command error:
  ./PPR_Meta: error while loading shared libraries: libmwlaunchermain.so: cannot open shared object file: No such file or directory

The git/database is cloned correctly.

possible Workflows

first_workflow_mike_test

Missing output file(s) `.sbt.phages` expected by process `sourmash_database:sourmash_download_DB`

I wanted to run one of our test-data/files before starting with the integration of VIRBRANT
my command nextflow run /home/mike/bioinformatics/What_the_Phage/phages.nf --fasta What_the_Phage/test-data/OX2_draft.fa
got this error:

Error executing process > 'sourmash_database:sourmash_download_DB'

Caused by:
  Missing output file(s) `.sbt.phages` expected by process `sourmash_database:sourmash_download_DB`

Command executed:

  sourmash compute --scaled 100 -k 21 --singleton --seed 42 -p 8 -o phages.sig phage_references.fa
  sourmash index phages phages.sig

Command exit status:
  0

Command output:
  (empty)

Command error:
  time taken to save signatures is 0.00039 seconds
  
  time taken to save signatures is 0.00030 seconds
  
  8148 of 8155 nodes saved
  
  time taken to save signatures is 0.00029 seconds
  
  time taken to save signatures is 0.00034 seconds
  
  8149 of 8155 nodes saved
  
  time taken to save signatures is 0.00102 seconds
  
  time taken to save signatures is 0.00099 seconds
  
  8150 of 8155 nodes saved
  
  time taken to save signatures is 0.00022 seconds
  
  time taken to save signatures is 0.00021 seconds
  
  8151 of 8155 nodes saved
  
  time taken to save signatures is 0.00091 seconds
  
  time taken to save signatures is 0.00091 seconds
  
  8152 of 8155 nodes saved
  
  time taken to save signatures is 0.00013 seconds
  
  time taken to save signatures is 0.00012 seconds
  
  8153 of 8155 nodes saved
  
  time taken to save signatures is 0.00032 seconds
  
  time taken to save signatures is 0.00028 seconds
  
  8154 of 8155 nodes saved
  
  time taken to save signatures is 0.00022 seconds
  
  time taken to save signatures is 0.00021 seconds
  
  8155 of 8155 nodes saved
  
  
  Finished saving nodes, now saving SBT json file.

Work dir:
  /tmp/nextflow-phages-mike/0e/5af05f70bd3177b06e6a4ff2d99383

the .sbt. file is missing in the folder /tmp/nextflow-phages-mike/0e/5af05f70bd3177b06e6a4ff2d99383 (contains only phage_references.fa phages.sig)
I also tried this command nextflow run replikation/What_The_Phage --fasta What_the_Phage/test-data/OX2_draft.fa and got the same error

Automatic Tool selection based on input

Automatic Tool selection

Solution for

#45 #46

Idea

some identification tools are bad for certain "assembly" stats (e.g. to many contigs; or to large contigs)
we will implement a "assembly stats" process prior analysis with 3 identical output channels
depending on assembly stats
ouput[0] -> always available; stable tools are using this
output[1] -> optional true; tools who cant handle n>x contigs use this; depending on assembly stats this channel might be empty; thus deactivating the tools linked to this channel
output[2] -> optional true; tools who cant handle contigs with size > X bp use this; depending on assembly stats this channel might be empty; thus deactivating the tools linked to this channel

Implementation:

Question remains if the .concat channel at the and can handle a missing e.g. marvel output channel.
We might need to add dummy files that get filtered.

Add the version tag to the configs

@Stormrider935

add the docker version tags to the config file, they are all tagged as latest not 0.1 etc.
make a test run and push it to master if it works
also see #10

fastq support

ont assembly / read support via --ont flag

VirFinder run in parallel

VirFinder is one of the slow tools for large input data. However, there is a parallel version that might be interesting to implement:

virfinder repo:

VirFinder can be run in parallel using multiple cores. See script here. Thanks for the contributions from R. Eric Collins.

https://github.com/rec3141/VirFinder/blob/master/linux/VirFinder/R/parVF.pred.R

Validate prediction results via nucleotide shuffling

random nucleotide shuffling for error-prone ont reads to improve/validate read hit

@hoelzer can you add some code and/or github ref for the nucleotide shuffling?
it should be added to the barplots in a meaningful way

.splitFasta()

add `.splitFasta()`

add a --splitFasta() file input and a merge to the workflow to allow the computation of large files on smaller systems
maybe control the "splitamount" via a flag so user can adjust until its working ?

Splitting large input files into single FASTAs (Marvel, ...) produces many files

I think the current procedure of splitting a multi-FASTA-file into single FASTA files only containing one contig can cause trouble for large input files (like done for Marvel).

(base) [mhoelzer@noah-login-01 ~]$ ls data/nextflow-work-mhoelzer/b1/139ca96b107d491ef6fb4d534361f9/H52_0_1_2_contigs/*.fa | wc -l
-bash: /usr/bin/ls: Argument list too long

Let's say you have an assembly with 500k contigs then you produce a single folder with 500k files.

I recommend, if possible, working with subfolders and only store 1k or so files into one folder.

Build own virsorter docker

PPR meta nextflow integration

nextflow run phage.nf --fasta 'test-data/*.fa'

N E X T F L O W ~ version 19.07.0
Launching phage.nf [festering_nightingale] - revision: 01c9983eeb
WARN: DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE

Profile: standard

Current User: mike
Nextflow-version: 19.07.0
Starting time: 27-07-2019 13:22 UTC
Workdir location:
/tmp/nextflow-work-mike

CPUs to use: 8
Output dir name: results

executor > local (7)
[skipped ] process > pprgetdeps [100%] 1 of 1, stored: 1 ✔
[ca/529358] process > virsorterGetDB [ 0%] 0 of 1
[- ] process > deepvirfinder -
executor > local (20)
[skipped ] process > pprgetdeps [100%] 1 of 1, stored: 1 ✔
[ca/529358] process > virsorterGetDB [ 0%] 0 of 1
[0f/7e955e] process > deepvirfinder (1) [ 0%] 0 of 3
executor > local (21)
[skipped ] process > pprgetdeps [100%] 1 of 1, stored: 1 ✔
[ca/529358] process > virsorterGetDB [ 0%] 0 of 1
[0f/7e955e] process > deepvirfinder (1) [ 0%] 0 of 3
[00/e6fce0] process > marvel (4) [ 0%] 0 of 2
[19/2e4430] process > metaphinder (2) [ 0%] 0 of 5
[15/78388d] process > virfinder (5) [ 0%] 0 of 5
[- ] process > virsorter -
[80/0d0998] process > pprmeta (5) [ 0%] 0 of 5
[skipping] Stored process > pprgetdeps
[POII_draft, /home/mike/install_software/wf_phage_benchmark/test-data/POII_draft.fa]
[OX2_draft, /home/mike/install_software/wf_phage_benchmark/test-data/OX2_draft.fa]
[POI_draft, /home/mike/install_software/wf_phage_benchmark/test-data/POI_draft.fa]
[PERVI_draft, /home/mike/install_software/wf_phage_benchmark/test-data/PERVI_draft.fa]
[CRC_meta, /home/mike/install_software/wf_phage_benchmark/test-data/CRC_meta.fa]

Error executing process > 'pprmeta (1)'

executor > local (21)
[skipped ] process > pprgetdeps [100%] 1 of 1, stored: 1 ✔
[ca/529358] process > virsorterGetDB [100%] 1 of 1, failed: 1
[0f/7e955e] process > deepvirfinder (1) [ 0%] 0 of 3
[00/e6fce0] process > marvel (4) [ 0%] 0 of 2
[19/2e4430] process > metaphinder (2) [100%] 5 of 5, failed: 5
[42/adb11b] process > virfinder (1) [ 80%] 4 of 5, failed: 4
[- ] process > virsorter -
[20/040331] process > pprmeta (4) [ 80%] 4 of 5, failed: 4
[skipping] Stored process > pprgetdeps
[POII_draft, /home/mike/install_software/wf_phage_benchmark/test-data/POII_draft.fa]
[OX2_draft, /home/mike/install_software/wf_phage_benchmark/test-data/OX2_draft.fa]
[POI_draft, /home/mike/install_software/wf_phage_benchmark/test-data/POI_draft.fa]
[PERVI_draft, /home/mike/install_software/wf_phage_benchmark/test-data/PERVI_draft.fa]
[CRC_meta, /home/mike/install_software/wf_phage_benchmark/test-data/CRC_meta.fa]
WARN: Killing pending tasks (20)

Error executing process > 'pprmeta (1)'

executor > local (21)
[skipped ] process > pprgetdeps [100%] 1 of 1, stored: 1 ✔
[ca/529358] process > virsorterGetDB [100%] 1 of 1, failed: 1
[0f/7e955e] process > deepvirfinder (1) [100%] 3 of 3, failed: 3
[00/e6fce0] process > marvel (4) [100%] 2 of 2, failed: 2
[19/2e4430] process > metaphinder (2) [100%] 5 of 5, failed: 5
[15/78388d] process > virfinder (5) [100%] 5 of 5, failed: 5
[- ] process > virsorter -
[80/0d0998] process > pprmeta (5) [100%] 5 of 5, failed: 5
[skipping] Stored process > pprgetdeps
[POII_draft, /home/mike/install_software/wf_phage_benchmark/test-data/POII_draft.fa]
[OX2_draft, /home/mike/install_software/wf_phage_benchmark/test-data/OX2_draft.fa]
[POI_draft, /home/mike/install_software/wf_phage_benchmark/test-data/POI_draft.fa]
[PERVI_draft, /home/mike/install_software/wf_phage_benchmark/test-data/PERVI_draft.fa]
[CRC_meta, /home/mike/install_software/wf_phage_benchmark/test-data/CRC_meta.fa]
WARN: Killing pending tasks (20)
Error executing process > 'pprmeta (1)'

Caused by:
Process pprmeta (1) terminated with an error exit status (126)

Command executed:

cp PPR-Meta/* .
./PPR_Meta POII_draft.fa POII_draft.csv

Command exit status:
126

Command output:
(empty)

Command error:
.command.sh: line 3: ./PPR_Meta: Permission denied

Work dir:
/tmp/nextflow-work-mike/ca/5bba323f6ad062d7e5e2dfc504d09b

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh

im workdir:
/tmp/nextflow-work-mike/ca$ ls

`529358185c73fcf5081afe2834496a 5bba323f6ad062d7e5e2dfc504d09b

mike@workstation_iimk:/tmp/nextflow-work-mike/ca$ ls -l
total 8
drwxrwxr-x 2 mike mike 4096 Sep 6 08:51 529358185c73fcf5081afe2834496a
drwxrwxr-x 2 mike mike 4096 Sep 6 08:51 5bba323f6ad062d7e5e2dfc504d09b`

.command.sh: No such file or directory

Expected behaviour

Input (all_contigs.fa) is formatted accordingly for later processes

Observed behaviour

Error executing process > 'fasta_validation_wf:input_suffix_check (1)'

Caused by:
  Process `fasta_validation_wf:input_suffix_check (1)` terminated with an error exit status (1)

Command executed:

  case "all_contigs.fa" in
      *.gz) 
          zcat all_contigs.fa > all_contigs.fa
          ;;
      *.fna)
          cp all_contigs.fa all_contigs.fa
          ;;
      *.fasta)
          cp all_contigs.fa all_contigs.fa
          ;;
      *.fa)
          ;;
      *)
          echo "file format not supported...what the phage...(.fa .fasta .fna .gz is supported)"
          exit 1
  esac
  
  # tr whitespace at the end of lines
  sed 's/[[:blank:]]*$//' -i all_contigs.fa
  # remove ' and "
  tr -d "'"  < all_contigs.fa | tr -d '"' | tr -d "[]" > tmp.file && mv tmp.file all_contigs.fa
  # replace ( ) | . , / and whitespace with _
  sed 's#[()|.,/ ]#_#g' -i all_contigs.fa
  # remove empty lines
  sed '/^$/d' -i all_contigs.fa

Command exit status:
  1

Command output:
  (empty)

Command error:
  INFO:    Convert SIF file to sandbox...
  WARNING: underlay of /etc/localtime required more than 50 (72) bind mounts
  /bin/bash: .command.sh: No such file or directory
  INFO:    Cleaning up image...

Environment

Nextflow 20.01.0 (WtP-intended), standard exec. profile, empty .command.out in the process work dir and .command.sh is in the work dir.

Metaphinder validation testing

I put together 2 data-sets for validation testing :

10 files (pos_phage_x.fa)
10 files (neg_phage_x.fa)

structure of the files for validation testing:
>pos.phage.0 GCTACCTGGTGCACCGTGCCGCCTTGCCAGCCCAATGAGTGCATGAGATGTAAAGGAGTATTTGTCATGGTATCGCCCTTTCTTTGATGTGCGCCCTGGTTG

structure of the What_the_Phage/test-data/*.fa
>ctg1 len=150545 GCTACCTGGTGCACCGTGCCGCCTTGCCAGCCCAATGAGTGCATGAGATGTAAAGGAGTATTTGTCATGGTATCGCCCTTTCTTTGATGTGCGCCCTGGTTG

the Idea was an easy readout- system for the phagedistribution.pdf output from WTP
I testet WTP on the data-sets:
nextflow run replikation/What_The_Phage --fasta pos_phage_1.fa

>Error executing process > 'metaphinder_wf:metaphinder (1)'

>Caused by:
  Process `metaphinder_wf:metaphinder (1)` terminated with an error >exit status (1)

>Command executed:

  >rnd=0.18523026702354528
  mkdir pos_phage_1
  MetaPhinder.py -i pos_phage_1.fa -o pos_phage_1 -d /MetaPhinder/>database/ALL_140821_hr
  mv pos_phage_1/output.txt pos_phage_1_${rnd//0.}.list

>Command exit status:
  1

>Command output:
  parsing commandline options...

>Command error:
  Traceback (most recent call last):
    File "/MetaPhinder/MetaPhinder.py", line 154, in <module>
      contigID,size = get_contig_size(contigfile)
    File "/MetaPhinder/MetaPhinder.py", line 29, in get_contig_size
      if l[0] == ">":
  IndexError: string index out of range

I tried to perform the task just in the docker
docker run --rm -it -v $PWD:/home/mike/bioinformatics/Phage_fasta_files/pos_neg.phage_data/allinone/ multifractal/metaphinder:0.1 /bin/bash
and inside:
MetaPhinder.py -i pos_phage_1.fa -o output.txt -d /MetaPhinder/database/ALL_140821_hr

parsing commandline options...
Traceback (most recent call last):
File "/MetaPhinder/MetaPhinder.py", line 154, in
contigID,size = get_contig_size(contigfile)
File "/MetaPhinder/MetaPhinder.py", line 29, in get_contig_size
if l[0] == ">":
IndexError: string index out of range

as control I downloaded the T7 bacteriophage genome fastafile and got the same problem in nextflow and docker
as additional control I tried the commands above with our current What_the_Phage/test-data/*.fa test data: no problems
The github page says this:

MetaPhinder classifies metagenomic contigs as of phage origin or not based on a
comparison to a phage database.
The script relies on BLAST which must be installed on you machine prior to running
MetaPhinder.
The input to MetaPhinder is a FASTA file of metagenomic contigs.

I will try to find contig-databases and put togehter a new set of phage pos. and neg. data that lookls like our current What_the_Phage/test-data/*.fa test data
We should mention in the readme that you can't use some random fasta-files as input otherwise metaphinder will beef around

Deactivate Tool(s) via --Option-flag

eg.: nextflow run replikation/What_the_Phage --fasta /foo/bar --tools_to_exclude virfinder virsorter

`fasta_validation_wf` declares 0 input channels but 1 were specified

I wanted to run the workflow nextflow run replikation/What_the_Phage and nextflow run phage.nf so i could try to implement samtools and grab positive ctgs but instead i got this error:

Profile: standard

Current User: mike
Nextflow-version: 19.07.0
Starting time: 27-07-2019 13:22 UTC
Workdir location:
/tmp/nextflow-phages-mike

CPUs to use: 8
Output dir name: results

Workflow fasta_validation_wf declares 0 input channels but 1 were specified

-- Check script 'What_the_Phage/phage.nf' at line: 179 or see '.nextflow.log' file for more details

commands

nextflow run What_the_Phage/phage.nf --fasta What_the_Phage/test-data/T7_draft.fa
nextflow run replikation/What_the_Phage --fasta What_the_Phage/test-data/T7_draft.fa

sourmash with kmer based json DB

sourmash / kmer approach implementation

not sure about the quality but I want to test this

database

we need to pre-build a phage DB @stormrider can you link collect DBs for this and report it here?

implementation

depending on the size we could do either pre-prepare a DB our compute it live while WtP is running to be always most recent (via efetch)

RUN @LSF cluster: MARVEL

[e4/19acd4] process > marvel_wf:input_suffix_check (1) [100%] 1 of 1 ✔
[8e/c04eb1] process > marvel_wf:marvel (1)             [100%] 1 of 1, failed: 1 ✘
[-        ] process > marvel_wf:filter_marvel          -
Error executing process > 'marvel_wf:marvel (1)'

Caused by:
  Process `marvel_wf:marvel (1)` terminated with an error exit status (127)

Command executed:

  mkdir fasta_dir_T7_draft
        cp T7_draft.1.fa T7_draft.2.fa T7_draft.3.fa T7_draft.4.fa T7_draft.5.fa T7_draft.6.fa fasta_dir_T7_draft/
        # Marvel
        marvel_bins.py -i fasta_dir_T7_draft -t 8 > results.txt
        # getting contig names
        filenames=$(grep  "T7_draft\." results.txt | cut -f2 -d " ")
        while IFS= read -r samplename ; do
         head -1 fasta_dir_T7_draft/${samplename}.fa >> T7_draft.txt
        done < <(printf '%s
  ' "${filenames}")

Command exit status:
  127

Command output:
  (empty)

Command error:
  .command.sh: line 5: marvel_bins.py: command not found

sourmash error for NCBI phage FASTA header

Input:
https://www.rna.uni-jena.de/supplements/wtp/zheng_2019.fasta

Error:

Caused by:
  Process `sourmash_wf:split_multi_fasta (2)` terminated with an error exit status (1)

Command executed:

  mkdir zheng_2019_contigs/
  
  while read line
    do
  if [[ ${line:0:1} == '>' ]]
  then
    outfile=${line#>}.fa
    echo ${line} > zheng_2019_contigs/${outfile}
  else
    echo ${line} >> zheng_2019_contigs/${outfile}
  fi
    done < zheng_2019.fa

Command exit status:
  1

Command output:
  (empty)

I think the problem is parsing such crazy NCBI FASTA header?

  .command.sh: line 9: zheng_2019_contigs/gi|1003341871|gb|KU310943_1|__Pseudomonas__phage__YMC11/07/P54_PAE_BP___complete__genome.fa: No such file or directory

You might consider renaming all FASTA ids to something useful in the beginning off WtP and then re-renaming at the end.

Lenght Cut-off for input sequences

@hoelzer : For example the Tara ocean virome paper uses a combinations of virsorter/virfinder and a length cutoff of 1.5kb. Personal communicatiom with matt sullivan: He would not trust any classified contig below 5k ;) and the lenght Filter will reduce run time of All Tools and yield a better heatmap/upset fig that is more reliable. Atm pprmeta and virfinder Report insane high numbers because they als predict almost all super short contigs as phages

Nanopore data: MARVEL can not handle sequences longer than 65535 Unicode code units

Nothing important for now, but I was curious and just threw some Nanopore reads into the workflow, MARVEL reported:

Error executing process > 'marvel_wf:marvel (1)'

Caused by:
  Failed to parse template script (your template may contain an error or be trying to use expressions not currently supported): org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
/groovy/script/ScriptB029065F62D4DB146CF4723657DAAB3D: 1: String too long. The given string is 13054086 Unicode code units long, but only a maximum of 65535 is allowed.
 @ line 1, column 16.
   __$$_out.print("""
                  ^

1 error

So it seems a length filter needs to be applied for MARVEL when used with long sequences. Are actually all input test data sets smaller than this size? I mean, it is totally possible that someone just starts the workflow with a FASTA file containing longer sequences.

Data

Disscussion of data we want to use and include

Publications overview

overview of publications that benchmark
and short summary about tool or and benchmark links

Add optional parameters for workdir, params.cloudDatabase, and cacheDir

Currently, the LSF configuration looks like this:

 lsf {
        params.cloudProcess = true
        params.cloudDatabase = '/homes/mhoelzer/data/nextflow-databases/'
        workDir = "/hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-$USER"

        process.executor = 'lsf'
        singularity {
            enabled = true
            autoMounts = true
            cacheDir = "/hps/nobackup2/singularity/mhoelzer"
        }
        params.cpus = params.cores

        process {
            cache = "lenient"
            withLabel: virsorter { cpus = 8; memory = '8 GB'; container = 'multifractal/virsorter:latest' }
} }

So fixed to my environment at the EBI LSF. Do you agree that it would be good to have input parameters for

cloudDatabase //here are the databases stored, default could be ./nextflow-autodownload-databases
workDir //here are the work files written, this can be a lot of stuff and I have to define a specific place for this on the cluster here, default could be /tmp/nextflow-phages-$USER
cacheDir //this I need to tell Singularity where the images are

If you agree, I would implement this.

deactivating all tools except one fails upsetr

upsetR wont plot if it has only one input file
reproduce via:
- ./phage.nf -profile git_action
WtP ignores this error but still creates the heatmap (so no big issue)
Error message:

Error in 1:ncol(data) : argument of length 0
Calls: upset -> FindStartEnd

thats all the error messages here

Get phage positive reads

enhancement

WtP should extract the "phage" positive hits from the fastq files and give it to the user as a result output
similar to samtools and contigs
this would make downstream analysis much more useful after the classifications