fmalmeida / mpgap Goto Github PK

View Code? Open in Web Editor NEW

56.0 4.0 10.0 85.22 MB

Multi-platform genome assembly pipeline for Illumina, Nanopore and PacBio reads

Home Page: https://mpgap.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Nextflow 94.38% Dockerfile 5.62%

hybrid-assemblies illumina pipeline genome-assembly polish pacbio nanopore unicycler spades flye

mpgap's Introduction

MpGAP pipeline

A generic multi-platform genome assembly pipeline

See the documentation »

Report Bug · Request Feature

About

MpGAP is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. It is an easy to use pipeline that adopts well known software for de novo genome assembly of Illumina, Pacbio and Oxford Nanopore sequencing data through illumina only, long reads only or hybrid modes.

This pipeline wraps up the following software:

	Source
Assemblers	Hifiasm, Canu, Flye, Raven, Shasta, wtdbg2, Haslr, Unicycler, Spades, Shovill, Megahit
Polishers	Nanopolish, Medaka, gcpp, Polypolish and Pilon
Quality check	Quast, BUSCO and MultiQC

Release notes

Are you curious about changes between releases? See the changelog.

I strongly, vividly, mightily recommend the usage of the latest versions hosted in master branch, which is nextflow's default.
- The latest will always have support, bug fixes and generally maitain the same processes (I mainly add things instead of removing) that also were in previous versions.
- But, if you really want to execute an earlier release, please see the instructions for that.
Versions below 3.0 are no longer supported.

Feedback

In the pipeline we always try to create a workflow and a execution dynamics that is the most generic possible and is suited for the most possible use cases.

Therefore, feedbacks are very well welcomed. If you believe that your use case is not encompassed in the pipeline, you have enhancement ideas or found a bug, please do not hesitate to open an issue to disscuss about it.

Installation

Install Nextflow:
```
curl -s https://get.nextflow.io | bash
```
Give it a try:
```
nextflow run fmalmeida/mpgap --help
```

Download required tools

for docker

# for docker
docker pull fmalmeida/mpgap:v3.2

# run
nextflow run fmalmeida/mpgap -profile docker [options]

for singularity

# for singularity --> prepare env variables
# remember to properly set NXF_SINGULARITY_LIBRARYDIR
# read more at https://www.nextflow.io/docs/latest/singularity.html#singularity-docker-hub
export NXF_SINGULARITY_LIBRARYDIR=<path in your machine>    # Set a path to your singularity storage dir
export NXF_SINGULARITY_CACHEDIR=<path in your machine>      # Set a path to your singularity cache dir
export SINGULARITY_CACHEDIR=<path in your machine>          # Set a path to your singularity cache dir

# TODO: ADD Information about TMPDIR

# run
nextflow run fmalmeida/mpgap -profile singularity [options]

for conda

# for conda
# it is better to create envs with mamba for faster solving
wget https://github.com/fmalmeida/mpgap/raw/master/environment.yml
conda env create -f environment.yml   # advice: use mamba

# must be executed from the base environment
# This tells nextflow to load the available mpgap environment when required
nextflow run fmalmeida/mpgap -profile conda [options]

🎯 Please make sure to also download its busco databases. See the explanation

Start running your analysis

nextflow run fmalmeida/mpgap -profile <docker/singularity/conda>

🔥 Please read the documentation below on selecting between conda, docker or singularity profiles, since the tools will be made available differently depending on the profile desired.

Quickstart

A few testing datasets have been made available so that users can quickly try-out the features available in the pipeline:

# short-reads
nextflow run fmalmeida/mpgap -profile test,sreads,<docker/singularity>

# long-reads
nextflow run fmalmeida/mpgap -profile test,lreads,<ont/pacbio>,<docker/singularity>

# hybrid
nextflow run fmalmeida/mpgap -profile test,hybrid,<ont/pacbio>,<docker/singularity>

Documentation

Complete online documentation. »

Selecting between profiles

Nextflow profiles are a set of "sensible defaults" for the resource requirements of each of the steps in the workflow, that can be enabled with the command line flag -profile. You can learn more about nextflow profiles at:

The pipeline have "standard profiles" set to run the workflows with either conda, docker or singularity using the local executor, which is nextflow's default and basically runs the pipeline processes in the computer where Nextflow is launched. If you need to run the pipeline using another executor such as sge, lsf, slurm, etc. you can take a look at nextflow's manual page to proper configure one in a new custom profile set in your personal copy of MpGAP config file and take advantage that nextflow allows multiple profiles to be used at once, e.g. -profile conda,sge.

By default, if no profile is chosen, the pipeline will try to load tools from the local machine $PATH. Available pre-set profiles for this pipeline are: docker/conda/singularity, you can choose between them as follows:

conda

# must be executed from the base environment
# This tells nextflow to load the available mpgap environment when required
nextflow run fmalmeida/mpgap -profile conda [options]

docker

nextflow run fmalmeida/mpgap -profile docker [options]

singularity

nextflow run fmalmeida/mpgap -profile singularity [options]

Note on conda

📖 Please use conda as last resource

Instructions to create required conda environment are found in the installation section

The usage of conda profile will only work in linux-64 machine because some of the tools only have its binaries available for this machine, and others had to be put inside the "bin" dir to avoid version compatibility also were compiled for linux-64. A few examples are: wtdbg2, ALE (used as auxiliary tool in pilon polish step), spades v3.13 for unicycler, and others.

Therefore, be aware, -profile conda will only work on linux-64 machines. Users in orther systems must use it with docker or singularity.

Finally, the main conda packages in the environment.yml file have been "frozen" to specific versions to make env solving faster. If you saw that I tool has a new update and would like to see it updated in the pipeline, please flag an issue.

Also, since in quast 5.0.2 the automatic download of its busco databases is broken, if using conda you must download the busco dbs for quast to properly run the assembly quality check step.

CONDA_PREFIX is the base/root directory of your conda installation

# create the directory
mkdir -p $CONDA_PREFIX/envs/mpgap-3.2/lib/python3.8/site-packages/quast_libs/busco/

# bacteria db
wget -O $CONDA_PREFIX/envs/mpgap-3.2/lib/python3.8/site-packages/quast_libs/busco/bacteria.tar.gz https://busco.ezlab.org/v2/datasets/bacteria_odb9.tar.gz

# eukaryota db
wget -O $CONDA_PREFIX/envs/mpgap-3.2/lib/python3.8/site-packages/quast_libs/busco/eukaryota.tar.gz https://busco.ezlab.org/v2/datasets/eukaryota_odb9.tar.gz

# fungi db
wget -O $CONDA_PREFIX/envs/mpgap-3.2/lib/python3.8/site-packages/quast_libs/busco/fungi.tar.gz https://busco.ezlab.org/v2/datasets/fungi_odb9.tar.gz
chmod -R 777 $CONDA_PREFIX/envs/mpgap-3.2/lib/python3.8/site-packages/quast_libs/busco

# get augustus database with
# must be executed in the end because its links for bacteria, fungi and eukaryota are broken
# it is only working for augustus
conda activate mpgap-3.2 && quast-download-busco

Explanation of hybrid strategies

Hybrid assemblies can be produced with two available strategies. Please read more about the strategies and how to set them up in the online documentation.

➡️ they are chosen with the parameter --hybrid_strategy.

Strategy 1

It uses the hybrid assembly modes from Unicycler, Haslr and/or SPAdes.

Strategy 2

It produces a long reads only assembly and polishes (correct errors) it with short reads using Pilon. By default, it runs 4 rounds of polishing (params.pilon_polish_rounds).

Example:

# run the pipeline setting the desired hybrid strategy globally (for all samples)
nextflow run fmalmeida/mpgap \
  --output output \
  --max_cpus 5 \
  --input "samplesheet.yml" \
  --hybrid_strategy "both"

🔥 This will perform, for all samples, both both strategy 1 and strategy 2 hybrid assemblies. Please read more about it in the manual reference page and samplesheet reference page.

Usage

For understading pipeline usage and configuration, users must read the complete online documentation »

Using the configuration file

All parameters showed above can be, and are advised to be, set through the configuration file. When a configuration file is used the pipeline is executed as nextflow run fmalmeida/mpgap -c ./configuration-file. Your configuration file is what will tell the pipeline which type of data you have, and which processes to execute. Therefore, it needs to be correctly configured.

To create a configuration file in your working directory:
```
nextflow run fmalmeida/mpgap --get_config
```

Interactive graphical configuration and execution

Via NF tower launchpad (good for cloud env execution)

Nextflow has an awesome feature called NF tower. It allows that users quickly customise and set-up the execution and configuration of cloud enviroments to execute any nextflow pipeline from nf-core, github (this one included), bitbucket, etc. By having a compliant JSON schema for pipeline configuration it means that the configuration of parameters in NF tower will be easier because the system will render an input form.

Checkout more about this feature at: https://seqera.io/blog/orgs-and-launchpad/

Via nf-core launch (good for local execution)

Users can trigger a graphical and interactive pipeline configuration and execution by using nf-core launch utility. nf-core launch will start an interactive form in your web browser or command line so you can configure the pipeline step by step and start the execution of the pipeline in the end.

# Install nf-core
pip install nf-core

# Launch the pipeline
nf-core launch fmalmeida/mpgap

It will result in the following:

Known issues

Whenever using unicycler with unpaired reads, an odd platform-specific SPAdes-related crash seems do randomly happen as it can be seen in the issue discussed at rrwick/Unicycler#188.

As a workaround, Ryan says to use the --no_correct parameter which solves the issue and does not have a negative impact on assembly quality.
Therefore, if you run into this error when using unpaired data you can activate this workaroud with:
- --unicycler_additional_parameters " --no_correct ".

Sometimes, shovill assembler can fail and cause the pipeline to fail due to problems in estimating the genome size. This, is actually super simple to solve! Instead of letting the shovill assembler estimate the genome size, you can pass the information to it and prevent its fail:
- --shovill_additional_parameters " --gsize 3m "

Citation

In order to cite this pipeline, please refer to:

Almeida FMd, Campos TAd and Pappas Jr GJ. Scalable and versatile container-based pipelines for de novo genome assembly and bacterial annotation. [version 1; peer review: awaiting peer review]. F1000Research 2023, 12:1205 (https://doi.org/10.12688/f1000research.139488.1)

Additionally, archived versions of the pipeline are also found in Zenodo.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the GPLv3.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

In addition, users are encouraged to cite the programs used in this pipeline whenever they are used. Links to resources of tools and data used in this pipeline are in the list of tools.

mpgap's People

Contributors

Stargazers

Watchers

Forkers

pythseq vikash84 jennomics mxrcon bennuru edwardbirdlab fredrickkebaso adamtaranto santosrac scintilla9

mpgap's Issues

add homopolish tool

Add the possibility to run homopolish if desired by the user:

https://github.com/ythuang0522/homopolish

Add example of non-bacterial dataset analysis (paper review)

Background
This issue is meant to address the comments received on the paper review here.

Description
Generate a new page in the web documentation, showing the analysis of a fungi or plant sequencing dataset. Make sure that they have the necessary command lines from input to output, so one can reproduce, but also, add an overview of the generated results in the web page.

Once done, check how easily one can we update the paper to provide an additional Zenodo for the non-bacterial analysis (ngs-preprocess + MpGAP).

Add a simple parameter to handle starting memory settings

This issue relates to issues #52 and #59 where users seemed to face memory errors and had to adapt the config so that they could use more memory from the first try, instead of having to wait for retries.

By default, the pipeline first tries with a small amount, then it uses the fully amount specified by the max parameter:

// Assemblies will first try to adjust themselves to a parallel execution
    // If it is not possible, then it waits to use all the resources allowed
    withLabel:process_assembly {
      cpus   = {  if (task.attempt == 1) { check_max( 6 * task.attempt, 'cpus'       ) } else { params.max_cpus   } }
      memory = {  if (task.attempt == 1) { check_max( 20.GB * task.attempt, 'memory' ) } else { params.max_memory } }
      time   = {  if (task.attempt == 1) { check_max( 24.h * task.attempt, 'time'    ) } else { params.max_time   } }
      
      // retry at least once to try it with full resources
      errorStrategy = { task.exitStatus in [1,21,143,137,104,134,139,247] ? 'retry' : 'finish' }
      maxRetries    = 1
      maxErrors     = '-1'
    }

    // Quast sometimes can take too long
    withName:quast {
      cpus   = {  if (task.attempt == 1) { check_max( 4 * task.attempt, 'cpus'       ) } else { params.max_cpus   } }
      memory = {  if (task.attempt == 1) { check_max( 10.GB * task.attempt, 'memory' ) } else { params.max_memory } }
      time   = {  if (task.attempt == 1) { check_max( 12.h * task.attempt, 'time'    ) } else { params.max_time   } }
      
      // retry at least once to try it with full resources
      errorStrategy = { task.exitStatus in [21,143,137,104,134,139,247] ? 'retry' : 'finish' }
      maxRetries    = 1
      maxErrors     = '-1'
    }

Probably would be good to also define a parameter, to configure the starting memory amount&threads, which would be used in the first attempt of these modules.

Maybe, --start_asm_mem & --start_asm_cpus.

add testing git actions for PRs

To facilitate contribution and updates, would be great to create github actions to test the pipeline with the available profiles and technologies.

Add skip parameter for sreads polishers and fix multiqc report names

Currently, in hybrid mode, all short-reads polishers are used. Should add a parameter to allow skipping one or another.
Also, the new MultiQC report produced in dev branch now has some weird entries with ".stat" and ".err" files ... this should be fixed.

quast generating empty files

So, I've consistently observed that the quast step fails, and even if the pipeline points to the directory of the work, so the files can be checked, these appear to be empty.
This is the folder:

I attach the logs, and the files that were not 0-sized.
Will let you know if it keeps failing during my tests, and if the fix you provided in the config file allows to bypass the erro and avoid the pipeline crash. Thank you so much
nextflow.log.txt
output.log.txt
quast_files.zip

add 3 hybrid strategy

Add another hybrid strategy for samples where this might be the best option.

This strategy is to perform a short reads assembly and then scaffold with long reads.

allow hifiasm to use hi-c and parental data

Now that hifiasm was included in the pipeline, it must assessed how the pipeline can be adapted in order to allow the user to pass on hi-c and/or parental data to the tool as it is specified in their webpage.

It is a follow-up of #70

Add more parallel jobs

Add the option to execute more jobs in parallel, being each job up to N threads. As it happens in bacannot!

Check new assemblers?

A small list of different assemblers possibilities to add to the pipeline.

Obs: Read and understand the assemblers to evaluate whether or not to add them.

update documentation about the configuration in either config or samplesheet

The pipeline now has a few configuration parameters that can be set globally to all samples at once (when set via the CLI or config file), or specifically to a single sample (when passed inside the samplesheet).

Although a few cases of them are documented in the manual, others are not.

So we need to revise the documentation to make sure all these parameters are properly described in the manual and not only inside the config file.

Also revise the help message.

Include option for high quality long reads

Add a parameter that tells the pipeline to treat the high quality input long reads as corrected reads. This should trigger, whenever available, the parameters in each assembler that is specific for corrected long reads.

Examples:

Flye:
- --pacbio-corr
- --nano-corr
Canu:
- -corrected

etc.

No such variable: USER

Dear developers of MpGAP,

I'm trying to run you assembly pipeline on ONT reads (after adapter removl with Porechop and length filtering with Filtlong) using the following command:

./nextflow run fmalmeida/mpgap --longreads /storage/ONT_results_FLO-MIN/pass/ONT_FLO-MIN_filtered.fastq.gz --lr_type nanopore --assembly_type longreads-only --try_canu --try_flye --try_unicycler --genomeSize 5m --outdir /storage/ONT_results_FLO-MIN/assemblies --threads 8

And I get the following error message:

N E X T F L O W ~ version 20.10.0
Launching fmalmeida/mpgap [nasty_pasteur] - revision: 9860b84 [master]
Docker-based, fmalmeida/mpgap, generic genome assembly pipeline

No such variable: USER

Is there a way to fix this in the CLI so othat the pipeline runs correctly?

Thanks in advance for your kind help.

Best wishes

mkdir: cannot create directory

Hi, so thanks again for the help. I'm slowly going through the pipeline in my hpc. It seems to be only a matter of specificying enviromental variables to make sure the root directoyry and /tmp/are not filled, and then allocating enough resources (cpus and memory).
So at last, it's doing the pilon processes, but I've encountered an error I wanted to ask you about. Maybe related to the pipeline not resuming properly even if -resume provided to the nextflow command?

This may be related with the pipeline and not my system? Checking out the logs I see the same error, similar with other folders, such as "wtdbg2" instead of "flye". I guess for the different Pilon runs on the different assemblies? There maybe some mkdir or mv commands that should be forced to allow for replace existing files and resuming runs? Thanks!

conda?

Looks like a great pipeline. Any chance you can have it in conda?

Quast report generated different columns and break MultiQC module

When running eukaryotic data, Quast generated a different set of columns which broke MultiQC module.
Add an if-else, so if we cannot subset all desired columns, we print them all.

Add option hifi

Add option to use Pacbio hifi in assemblers were an option for it is available, such as Canu, Flye and etc.

fix canu -hifi choice for pacbio

Although not available for nanopore-hifi, there is an option for pacbio:
https://canu.readthedocs.io/en/latest/tutorial.html

So canu module needs a fix regarding that so when pacbio hifi, it uses the correct one.

error in running MpGAP with SRR8482585_30X_{1,2}.fastq.gz and SRX5299443_30X.fastq.gz

When I tried to test the complete workflow of your code, I found that the case file at https://figshare.com/ndownloader/articles/14036585/versions/4 can pass through the ngs-preprocess pipeline, but when I continue with the MpGAP pipeline, an error occurs. The code I tried is as follows:
nextflow run fmalmeida/mpgap
--output mpgap_assmbly
--max_cpus 20
--genome_size 6m
--input ./mpgap_samplesheet.yml
--hybrid-strategy both
-profile docker

running log
.nextflow.log

NOTE: Process `HYBRID:strategy_2_pilon (aspergillus_terreus:strategy_2)` terminated with an error exit status (137) -- Execution is retried (1)

(base) omic@omic-Precision-7920-Tower:~/funig8$ nextflow run fmalmeida/mpgap   --output _ASSEMBLY   --max_cpus 5   --skip_spades   --input "samplesheet.yml"   --unicycler_additional_parameters ' --mode conservative '   -profile docker

curl: (28) SSL connection timeout

 N E X T F L O W   ~  version 24.04.2

Launching `https://github.com/fmalmeida/mpgap` [happy_meninsky] DSL2 - revision: 9f2475ff11 [master]



------------------------------------------------------
  fmalmeida/mpgap v3.2
------------------------------------------------------
Core Nextflow options
  revision                       : master
  runName                        : happy_meninsky
  containerEngine                : docker
  container                      : [.*:fmalmeida/mpgap@sha256:d0c421d2caa6bfb6fbaad36b4182746485f750c82524b7b738b0d190505c8098]
  launchDir                      : /home/omic/funig8
  workDir                        : /home/omic/funig8/work
  projectDir                     : /home/omic/.nextflow/assets/fmalmeida/mpgap
  userName                       : omic
  profile                        : docker
  configFiles                    : /home/omic/.nextflow/assets/fmalmeida/mpgap/nextflow.config

Input/output options
  input                          : samplesheet.yml
  output                         : _ASSEMBLY

Computational options
  start_asm_mem                  : 20 GB
  max_cpus                       : 5
  max_memory                     : 40 GB

Turn assemblers and modules on/off
  skip_spades                    : true

Software' additional parameters
  unicycler_additional_parameters:  --mode conservative

Generic options
  tracedir                       : _ASSEMBLY/pipeline_info

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use fmalmeida/mpgap for your analysis please cite:

* The pipeline
  https://doi.org/10.12688/f1000research.139488.1

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/fmalmeida/mpgap#citation
------------------------------------------------------

    Launching defined workflows!
    By default, all workflows will appear in the console "log" message.
    However, the processes of each workflow will be launched based on the inputs received.
    You can see that processes that were not launched have an empty [-       ].
  
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:unicycler      -
executor >  local (2)
[-        ] SHORTREADS_ONLY:unicycler      -
executor >  local (7)
[-        ] SHORTREADS_ONLY:unicycler      -
executor >  local (7)
executor >  local (7)
executor >  local (8)
executor >  local (11)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
executor >  local (14)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
executor >  local (14)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
executor >  local (17)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
executor >  local (17)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
executor >  local (18)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
executor >  local (19)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
[-        ] LONGREADS_ONLY:flye            -
executor >  local (19)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
[-        ] LONGREADS_ONLY:flye            -
executor >  local (22)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
executor >  local (23)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
executor >  local (24)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
[-        ] LONGREADS_ONLY:flye            -
executor >  local (25)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
[-        ] LONGREADS_ONLY:flye            -
executor >  local (26)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
[-        ] LONGREADS_ONLY:flye            -
executor >  local (27)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
[-        ] LONGREADS_ONLY:flye            -
[-        ] LONGREADS_ONLY:unicycler       -
[-        ] LONGREADS_ONLY:raven           -
[-        ] LONGREADS_ONLY:shasta          -
[-        ] LONGREADS_ONLY:wtdbg2          -
[-        ] LONGREADS_ONLY:hifiasm         -
[-        ] LONGREADS_ONLY:medaka          -
[-        ] LONGREADS_ONLY:nanopolish      -
[-        ] LONGREADS_ONLY:gcpp            -
[a2/3c8f43] HYB…gillus_terreus:strategy_1) | 0 of 1
[e7/cdbb77] HYB…gillus_terreus:strategy_1) | 1 of 1 ✔
[64/054208] HYB…gillus_terreus:strategy_2) | 0 of 1
[5f/24fa65] HYB…gillus_terreus:strategy_2) | 1 of 1 ✔
[68/c6a255] HYB…gillus_terreus:strategy_2) | 1 of 1 ✔
[3c/83e89b] HYB…gillus_terreus:strategy_2) | 1 of 1 ✔
[-        ] HYBRID:strategy_2_shasta       -
[-        ] HYBRID:strategy_2_hifiasm      -
[8b/7f6c0e] HYB…gillus_terreus:strategy_2) | 1 of 1 ✔
[44/55e4cd] HYB…gillus_terreus:strategy_2) | 4 of 8, failed: 4, retries: 4
[36/4da361] HYB…gillus_terreus:strategy_2) | 3 of 7, failed: 3, retries: 3
[29/653896] ASS…gillus_terreus:strategy_2) | 0 of 5
Plus 5 more processes waiting for tasks…
[1a/5c45fa] NOTE: Process `HYBRID:strategy_2_pilon (aspergillus_terreus:strategy_2)` terminated with an error exit status (137) -- Execution is retried (1)

Improve quality assessment

Add an option in the pipeline, such as a parameter called --eukaryotes, in which will tell the Quality assessment tools (Quast and Busco) to be performed using its configurations for eukaryotes.

use nf-core framework for CLI help and log messages

Change the pipeline configurations a little bit to:

Better reorganize the config files, updating standard to do not load any other profile as is common for NF pipelines
Better separate defaults params from the main config and script
Use label resources to better manage parallel jobs as it is done by nf-core
Use more of nf-core framework and Groovy libs to provide beautiful and cleaner CLI help and log messages

No valid choice in the pipeline parameters for hybrid strategies and problem with -c config

Hi, thanks for the impressive work and pipeline

So I'm trying to use it on our data, and getting a couple of errors for starters.
If I ran the pipeline with --input XXXX.yml and --hybrid_strategy 2, I get the error:

Launching https://github.com/fmalmeida/mpgap [golden_bell] DSL2 - revision: c1d2ab6 [master]
ERROR: Validation of pipeline parameters failed!

--hybrid_strategy: '2' is not a valid choice (Available choices: 1, 2, both)

--hybrid_strategy: '2' is not a valid choice (Available choices: 1, 2, both)

both seems to be working with this syntaxis, but not 1 nor 2

If I ran the pipeline trying to provide the config file with -c, I get the error

N E X T F L O W ~ version 22.04.5
Unknown method invocation call on BigDecimal type -- Did you mean?
scale

I don't have much experience with nextflkow, so I may be missing something "easy". Hope you can comment and help. Thanks for the support

should hifiasm be executed only when user says reads are high quality?

The pipeline (in dev branch) now has two params:

--corrected_longreads and high_quality_longreads.

Maybe would be worthy to make hifiasm only run when one of these are available?

This is a follow up of #53

@scintilla9 any strong opinion?

add the possibility of running directly from SRA IDs

Include a way to automatically download data from SRA and run the pipeline.

Bottleneck here is identifying a way so that the pipeline can fetch multiple SRAs for a single sample, in case of a hybrid assembly for example.

ERROR ~ Error executing process > 'HYBRID:strategy_2_wtdbg2 (ecoli_30X:strategy_2)'

(mpgap-3.2) omic@omic-Precision-7920-Tower:~/Downloads/assembly$ nextflow run fmalmeida/mpgap --output _ASSEMBLY --max_cpus 5 --skip_spades --input "samplesheet.yml" --unicycler_additional_parameters ' --mode conservative ' -profile docker

N E X T F L O W ~ version 24.04.2

Launching https://github.com/fmalmeida/mpgap [mighty_mirzakhani] DSL2 - revision: 9f2475f [master]

fmalmeida/mpgap v3.2

Core Nextflow options
revision : master
runName : mighty_mirzakhani
containerEngine : docker
container : [.*:fmalmeida/mpgap@sha256:d0c421d2caa6bfb6fbaad36b4182746485f750c82524b7b738b0d190505c8098]
launchDir : /home/omic/Downloads/assembly
workDir : /home/omic/Downloads/assembly/work
projectDir : /home/omic/.nextflow/assets/fmalmeida/mpgap
userName : omic
profile : docker
configFiles : /home/omic/.nextflow/assets/fmalmeida/mpgap/nextflow.config

Input/output options
input : samplesheet.yml
output : _ASSEMBLY

Computational options
start_asm_mem : 20 GB
max_cpus : 5
max_memory : 40 GB

Turn assemblers and modules on/off
skip_spades : true

Software' additional parameters
unicycler_additional_parameters: --mode conservative

Generic options
tracedir : _ASSEMBLY/pipeline_info

!! Only displaying parameters that differ from the pipeline defaults !!

If you use fmalmeida/mpgap for your analysis please cite:

The pipeline
https://doi.org/10.12688/f1000research.139488.1
The nf-core framework
https://doi.org/10.1038/s41587-020-0439-x
Software dependencies
https://github.com/fmalmeida/mpgap#citation

Launching defined workflows!
By default, all workflows will appear in the console "log" message.
However, the processes of each workflow will be launched based on the inputs received.
You can see that processes that were not launched have an empty [-       ].

executor > local (8)
[- ] SHORTREADS_ONLY:unicycler -
[- ] SHORTREADS_ONLY:shovill -
[- ] SHORTREADS_ONLY:megahit -
[- ] LONGREADS_ONLY:canu -
[- ] LONGREADS_ONLY:flye -
[- ] LONGREADS_ONLY:unicycler -
[- ] LONGREADS_ONLY:raven -
[- ] LONGREADS_ONLY:shasta -
[- ] LONGREADS_ONLY:wtdbg2 -
[- ] LONGREADS_ONLY:hifiasm -
[- ] LONGREADS_ONLY:medaka -
[- ] LONGREADS_ONLY:nanopolish -
[- ] LONGREADS_ONLY:gcpp -
[8c/80f386] HYBRID:strategy_1_unicycler (ecoli_30X:strategy_1) [ 0%] 0 of 1
[7c/ee2d18] HYBRID:strategy_1_haslr (ecoli_30X:strategy_1) [100%] 1 of 1, failed: 1 ✘
[1c/7053ec] HYBRID:strategy_2_canu (ecoli_30X:strategy_2) [ 0%] 0 of 1
[3c/9be87a] HYBRID:strategy_2_flye (ecoli_30X:strategy_2) [ 0%] 0 of 1 ✔
executor > local (8)
[- ] SHORTREADS_ONLY:unicycler -
[- ] SHORTREADS_ONLY:shovill -
[- ] SHORTREADS_ONLY:megahit -
[- ] LONGREADS_ONLY:canu -
[- ] LONGREADS_ONLY:flye -
[- ] LONGREADS_ONLY:unicycler -
[- ] LONGREADS_ONLY:raven -
[- ] LONGREADS_ONLY:shasta -
[- ] LONGREADS_ONLY:wtdbg2 -
[- ] LONGREADS_ONLY:hifiasm -
[- ] LONGREADS_ONLY:medaka -
[- ] LONGREADS_ONLY:nanopolish -
[- ] LONGREADS_ONLY:gcpp -
[8c/80f386] HYBRID:strategy_1_unicycler (ecoli_30X:strategy_1) [100%] 1 of 1, failed: 1 ✘
[7c/ee2d18] HYBRID:strategy_1_haslr (ecoli_30X:strategy_1) [100%] 1 of 1, failed: 1 ✘
[1c/7053ec] HYBRID:strategy_2_canu (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[3c/9be87a] HYBRID:strategy_2_flye (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[8a/54c676] HYBRID:strategy_2_unicycler (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[17/121160] HYBRID:strategy_2_raven (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[e6/4fdd45] HYBRID:strategy_2_shasta (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[- ] HYBRID:strategy_2_hifiasm -
[83/f20e7f] HYBRID:strategy_2_wtdbg2 (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
executor > local (8)
[- ] SHORTREADS_ONLY:unicycler -
[- ] SHORTREADS_ONLY:shovill -
[- ] SHORTREADS_ONLY:megahit -
[- ] LONGREADS_ONLY:canu -
[- ] LONGREADS_ONLY:flye -
[- ] LONGREADS_ONLY:unicycler -
[- ] LONGREADS_ONLY:raven -
[- ] LONGREADS_ONLY:shasta -
[- ] LONGREADS_ONLY:wtdbg2 -
[- ] LONGREADS_ONLY:hifiasm -
[- ] LONGREADS_ONLY:medaka -
[- ] LONGREADS_ONLY:nanopolish -
[- ] LONGREADS_ONLY:gcpp -
[8c/80f386] HYBRID:strategy_1_unicycler (ecoli_30X:strategy_1) [100%] 1 of 1, failed: 1 ✘
[7c/ee2d18] HYBRID:strategy_1_haslr (ecoli_30X:strategy_1) [100%] 1 of 1, failed: 1 ✘
[1c/7053ec] HYBRID:strategy_2_canu (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[3c/9be87a] HYBRID:strategy_2_flye (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[8a/54c676] HYBRID:strategy_2_unicycler (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[17/121160] HYBRID:strategy_2_raven (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[e6/4fdd45] HYBRID:strategy_2_shasta (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[- ] HYBRID:strategy_2_hifiasm -
[83/f20e7f] HYBRID:strategy_2_wtdbg2 (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[- ] HYBRID:strategy_2_medaka -
[- ] HYBRID:strategy_2_nanopolish -
[- ] HYBRID:strategy_2_gcpp -
[- ] HYBRID:strategy_2_pilon -
[- ] HYBRID:strategy_2_polypolish -
[- ] ASSEMBLY_QC:quast -
[- ] ASSEMBLY_QC:CUSTOM_DUMPSOFTWAREVERSIONS -
[- ] ASSEMBLY_QC:multiqc -
Execution cancelled -- Finishing pending tasks before exit
Pipeline completed at: 2024-06-03T08:53:15.368670472+04:00
Execution status: failed
Execution duration: 5s
Do not give up, we can fix it!
WARN: The operator first is useless when applied to a value channel which returns a single value by definition
ERROR ~ Error executing process > 'HYBRID:strategy_2_wtdbg2 (ecoli_30X:strategy_2)'

Caused by:
Process HYBRID:strategy_2_wtdbg2 (ecoli_30X:strategy_2) terminated with an error exit status (127)

Command executed:

run wtdbg2

wtdbg2.pl
-t 5
-x ont
-g 4m
-o ecoli_30X

SRX5299443_30X.fastq.gz

rename results

cp ecoli_30X.cns.fa wtdbg2_assembly.fasta

get version

--version command does not exist

cat <<-END_VERSIONS > versions.yml
"HYBRID:strategy_2_wtdbg2":
wtdbg2: 2.5
END_VERSIONS

Command exit status:
127

Command output:
(empty)

Command error:
.command.run: line 296: docker: command not found

Work dir:
/home/omic/Downloads/assembly/work/83/f20e7fb57111d1073362c72536c288

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh

-- Check '.nextflow.log' file for details

Update CLI help message

Some of the workflow parameters are explained in the online documentation (readthedocs) but they are not explained in the command line help! Fix it!

nf tower options
parallel jobs options

add polypolish tool

Pilon is the tool used for polishing long reads assemblies in the pipeline.

It would be nice to also add polypolish tools as the second short-reads polisher for long reads assembly together with pilon.

By default, the pipeline would polish long reads assemblies with both, but users could chose to skip or not one of them.

When not giving `--genome_size` for long reads, message goes to log and not console

When running a samplesheet that has longreads assemblies, if one forget to give the --genome_size parameter, the pipeline exits without showing the error message in the console. The message appears only in the .nextflow.log

Problem is in file https://github.com/fmalmeida/MpGAP/blob/master/nf_functions/writeCSV.nf

Instead of println should be log.error and instead of exit 1 should be System.exit(1).

Shovill with all assemblers?

To date, Shovill is executed by default with "spades" assembler as base. However, the software also supports using megahit and skesa.

Although possible for users to change the default assembler for shovill, e.g. --shovill_additional_parameters " --assembler skesa ".

However, this will only change the assembler selected and execute only the selected one. The idea is:

Is it possible to create a rule by default that makes the pipeline create a shovill assembly with each possible assembler?

add automatic samplesheet for bacannot

Include the automatic generation of a samplesheet that can be readily used for bacannot.
Question here is: put all generated genomes? only polished ones? only final?

change to unicycler v0.5.0?

Unicycler has now made a huge release to v0.5.0. So, it would be nice to have the pipeline now using this version.

For that, a few fixes in the pipeline's environment and scripts should be done would be required:

Unicycler now accepts the newest SPAdes version thus the v3.13 binaries would no be necessary anymore
Unycler now do not correct reads prior to assembly, thus, the information about --no_correct should be remove
Unicycler now do not polishes the assembly in the end, thus, a new step for pilon polish is required after it's assembly
- This already happens for hybrid assemblies, however, should also be performed for Illumina assemblies.
Unicycler has now descontinued it's script unicycler_polish which was used inside the MpGAP's pilon polish module for paired end reads
- Thus, this module needs to be updated to do not use this script and perform only a single polishing with pilon either with single (which is already this way) or paired end reads.
- This, removes the dependency on ALE binaries

Obs: For now, this release will not impact the pipeline since it is stick to the v0.4.8. However, for using the new one, these observations should be addressed.

Enhance polishing and downstream as hybracter

Assess what is required and how to implement some nice features performed in https://github.com/gbouras13/hybracter such as the coupled polishing of polypolish+pypolca which seem to work nicely in a complementary manner. Maybe the plasmid assembly step. And maybe the chr reorientation step.

Check what and how can be done.

Requesting support with error "Explicit 'name separator' in class"

Hi,

Thanks for your kind support in the past. I've been using mpgap routinely in our projects, but now I've been struggling for a while with a new installation... maybe trivial but I'm stuck and wondered if you could comment on it.
I'm using singularity. Everything seems to be working, and the pipeline starts, but then ends with the error:

Explicit 'name separator' in class near index 8
[dataset/pacbio.fastq]
^
-- Check script '/.nextflow/assets/fmalmeida/mpgap/./workflows/parse_samples.nf' at line: 66 or see '.nextflow.log' file for more details

I believe I've used the same syntax as always, and the one suggested in the manual in the yml samplesheet:
samplesheet:

id: Sol_test_1
pacbio:

'dataset/pacbio.fastq'
genome_size: 39.11m
wtdbg2_technology: rs
corrected_long_reads: false

I'm attaching the nextflow.log. May be something trivial and nextflow-related... but could you comment please? I've unsuccesfully tried to use different quotes in the yml file, or even placing the file on the same folder.

Thanks!

100% missing in Busco

Hi Dr Almeida,

I was playing around the pipeline, everything want well, however, I found issue in Busco.
The output summary showed no buscos were found in query genome (100% missing) against bacteria_odb9. I also found some issues in quast github, seems related to LD_LIBRARY_PATH (ablab/quast#88).

I also tried on standalone Busco container (v5.4.7) with bacteria_odb10, and got 100% complete single copy. It confirms that the assembled genome is well.

Could you please kindly check if it could be fixed?

Best,
CW

Add an option for multiple samples

Add an option to facilitate and organize the execution of the pipeline for multiple samples. Maybe create something using a YAML syntax as it is done in bacannot.

problem with longreads_only assembly

[intergalactic_knuth] Nextflow Workflow Report.pdf

Hi,

I am trying to assemble plant genome (~800m) from PacBio Revio reads.

here is the command I use
nextflow -bg run fmalmeida/mpgap --output _ASSEMBLY --max_cpus 20 --skeep_wtdbg2 --genome_size 800m --input MPGAP_samplesheet1.yml -profile docker

here is the yml file contents

samplesheet:
  - id: sample_5
    pacbio: HMW_DNA_m84126_231020_112323_s3.hifi_reads.fastq.gz

The process started but at some points I get the error messages similar to the following for all the assemblers

[Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 5; name: LONGREADS_ONLY:canu (sample_5); status: COMPLETED; exit: 1; error: -; workDir: /mnt/data/guyh/Trifolium/Revio/work/a8/a8499e26430751241cde25981ce53b]

A pdf version of mpgap report is attached

Can you please advice?

Thank you in advance.
Guy

Incomplete pipeline and different errors when using nanopore reads files with different sizes (900 mb vs 11Gb)

Describe the bug
I encountered an issue while running the pipeline with two barcoded genome samples (Barcode04 and Barcode06). These samples produced exceptionally large output files: Barcode04 resulted in an 11GB file, while Barcode06 generated a massive 980GB file. Both runs also exhibited different errors. It's worth noting that the reference genome size for these samples, which are from hummingbirds, is approximately 1.5GB. The sequencing was performed using the Nanopore Promethion 10.4.1 platform, and basecalling was done with the Super Accurate algorithm.

To Reproduce
Steps to reproduce the behavior:
Run the following command line with the files in the respective folders
nextflow run fmalmeida/mpgap --output output_barcode04_20_oct_2023 --max_cpus 64 --input "MPGAP_samplesheet_barcode04.yml" -profile docker

nextflow run fmalmeida/mpgap --output output_barcode06_20_oct_2023 --max_cpus 64 --input "MPGAP_samplesheet_barcode06.yml" -profile docker

Expected behavior
Output folders with the results of the pipe

Archive.zip

add trycycler

Add trycycler tool to generate a consensus assembly of long reads tools as an option.

Add hifiasm for long reads assemble

Hi Dr Almeida,

Thank you for the detailed genome assembly workflow. Could you please include another assembler, hifiasm(https://github.com/chhylp123/hifiasm)? It is particularly suited for long-read data

Best
Chia-Wei

Enhance documentation (paper review)

Background
This issue is meant to address the comments received on the paper review here.

Description
Create an "Output" page to facilitate users on the output structure and refer the correct tools-specific links as it is done in the bacannot documentation page, which gives users the interpretation of the generated results, including the directory structure and the relevant links for the tool-specific reference material.

new directory called "final_output"

Add in the pipeline a rule so all the assemblies of a sample have a copy stored in a single folder, e.g. final_output, so that it is easier for users to further select and retrieve the assemblies they want.

update tool versions

Check if there are new version available of tools and update it.

fmalmeida / mpgap Goto Github PK

mpgap's Introduction

MpGAP pipeline

A generic multi-platform genome assembly pipeline

About

Release notes

Further reading

Feedback

Installation

Quickstart

Documentation

Selecting between profiles

Note on conda

Explanation of hybrid strategies

Strategy 1

Strategy 2

Example:

Usage

Using the configuration file

Interactive graphical configuration and execution

Via NF tower launchpad (good for cloud env execution)

Via nf-core launch (good for local execution)

Known issues

Citation

mpgap's People

Contributors

Stargazers

Watchers

Forkers

mpgap's Issues

fmalmeida/mpgap v3.2

!! Only displaying parameters that differ from the pipeline defaults !!

run wtdbg2

rename results

get version

--version command does not exist

Recommend Projects

Recommend Topics

Recommend Org

Jobs