GithubHelp home page GithubHelp logo

fmalmeida / mpgap Goto Github PK

View Code? Open in Web Editor NEW
56.0 4.0 10.0 85.22 MB

Multi-platform genome assembly pipeline for Illumina, Nanopore and PacBio reads

Home Page: https://mpgap.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Nextflow 94.38% Dockerfile 5.62%
hybrid-assemblies illumina pipeline genome-assembly polish pacbio nanopore unicycler spades flye

mpgap's Introduction

F1000 Paper GitHub release (latest by date including pre-releases) Documentation Nextflow run with docker run with singularity License Follow on Twitter Zenodo Archive

Open in Gitpod

MpGAP pipeline

A generic multi-platform genome assembly pipeline


See the documentation »

Report Bug · Request Feature

About

MpGAP is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. It is an easy to use pipeline that adopts well known software for de novo genome assembly of Illumina, Pacbio and Oxford Nanopore sequencing data through illumina only, long reads only or hybrid modes.

This pipeline wraps up the following software:

Source
Assemblers Hifiasm, Canu, Flye, Raven, Shasta, wtdbg2, Haslr, Unicycler, Spades, Shovill, Megahit
Polishers Nanopolish, Medaka, gcpp, Polypolish and Pilon
Quality check Quast, BUSCO and MultiQC

Release notes

Are you curious about changes between releases? See the changelog.

  • I strongly, vividly, mightily recommend the usage of the latest versions hosted in master branch, which is nextflow's default.
    • The latest will always have support, bug fixes and generally maitain the same processes (I mainly add things instead of removing) that also were in previous versions.
    • But, if you really want to execute an earlier release, please see the instructions for that.
  • Versions below 3.0 are no longer supported.

Further reading

This pipeline has two complementary pipelines (also written in nextflow) for NGS preprocessing and prokaryotic genome annotation that can give the user a complete workflow for bacterial genomics analyses.

Feedback

In the pipeline we always try to create a workflow and a execution dynamics that is the most generic possible and is suited for the most possible use cases.

Therefore, feedbacks are very well welcomed. If you believe that your use case is not encompassed in the pipeline, you have enhancement ideas or found a bug, please do not hesitate to open an issue to disscuss about it.

Installation

  1. Install Nextflow:

    curl -s https://get.nextflow.io | bash
  2. Give it a try:

    nextflow run fmalmeida/mpgap --help
  3. Download required tools

    • for docker

      # for docker
      docker pull fmalmeida/mpgap:v3.2
      
      # run
      nextflow run fmalmeida/mpgap -profile docker [options]
    • for singularity

      # for singularity --> prepare env variables
      # remember to properly set NXF_SINGULARITY_LIBRARYDIR
      # read more at https://www.nextflow.io/docs/latest/singularity.html#singularity-docker-hub
      export NXF_SINGULARITY_LIBRARYDIR=<path in your machine>    # Set a path to your singularity storage dir
      export NXF_SINGULARITY_CACHEDIR=<path in your machine>      # Set a path to your singularity cache dir
      export SINGULARITY_CACHEDIR=<path in your machine>          # Set a path to your singularity cache dir
      
      # TODO: ADD Information about TMPDIR
      
      # run
      nextflow run fmalmeida/mpgap -profile singularity [options]
    • for conda

      # for conda
      # it is better to create envs with mamba for faster solving
      wget https://github.com/fmalmeida/mpgap/raw/master/environment.yml
      conda env create -f environment.yml   # advice: use mamba
      
      # must be executed from the base environment
      # This tells nextflow to load the available mpgap environment when required
      nextflow run fmalmeida/mpgap -profile conda [options]

      🎯 Please make sure to also download its busco databases. See the explanation

  4. Start running your analysis

    nextflow run fmalmeida/mpgap -profile <docker/singularity/conda>

🔥 Please read the documentation below on selecting between conda, docker or singularity profiles, since the tools will be made available differently depending on the profile desired.

Quickstart

A few testing datasets have been made available so that users can quickly try-out the features available in the pipeline:

# short-reads
nextflow run fmalmeida/mpgap -profile test,sreads,<docker/singularity>

# long-reads
nextflow run fmalmeida/mpgap -profile test,lreads,<ont/pacbio>,<docker/singularity>

# hybrid
nextflow run fmalmeida/mpgap -profile test,hybrid,<ont/pacbio>,<docker/singularity>

Documentation

Complete online documentation. »

Selecting between profiles

Nextflow profiles are a set of "sensible defaults" for the resource requirements of each of the steps in the workflow, that can be enabled with the command line flag -profile. You can learn more about nextflow profiles at:

The pipeline have "standard profiles" set to run the workflows with either conda, docker or singularity using the local executor, which is nextflow's default and basically runs the pipeline processes in the computer where Nextflow is launched. If you need to run the pipeline using another executor such as sge, lsf, slurm, etc. you can take a look at nextflow's manual page to proper configure one in a new custom profile set in your personal copy of MpGAP config file and take advantage that nextflow allows multiple profiles to be used at once, e.g. -profile conda,sge.

By default, if no profile is chosen, the pipeline will try to load tools from the local machine $PATH. Available pre-set profiles for this pipeline are: docker/conda/singularity, you can choose between them as follows:

  • conda

    # must be executed from the base environment
    # This tells nextflow to load the available mpgap environment when required
    nextflow run fmalmeida/mpgap -profile conda [options]
  • docker

    nextflow run fmalmeida/mpgap -profile docker [options]
  • singularity

    nextflow run fmalmeida/mpgap -profile singularity [options]

Note on conda

📖 Please use conda as last resource

Instructions to create required conda environment are found in the installation section

The usage of conda profile will only work in linux-64 machine because some of the tools only have its binaries available for this machine, and others had to be put inside the "bin" dir to avoid version compatibility also were compiled for linux-64. A few examples are: wtdbg2, ALE (used as auxiliary tool in pilon polish step), spades v3.13 for unicycler, and others.

Therefore, be aware, -profile conda will only work on linux-64 machines. Users in orther systems must use it with docker or singularity.

Finally, the main conda packages in the environment.yml file have been "frozen" to specific versions to make env solving faster. If you saw that I tool has a new update and would like to see it updated in the pipeline, please flag an issue.

Also, since in quast 5.0.2 the automatic download of its busco databases is broken, if using conda you must download the busco dbs for quast to properly run the assembly quality check step.

CONDA_PREFIX is the base/root directory of your conda installation

# create the directory
mkdir -p $CONDA_PREFIX/envs/mpgap-3.2/lib/python3.8/site-packages/quast_libs/busco/

# bacteria db
wget -O $CONDA_PREFIX/envs/mpgap-3.2/lib/python3.8/site-packages/quast_libs/busco/bacteria.tar.gz https://busco.ezlab.org/v2/datasets/bacteria_odb9.tar.gz

# eukaryota db
wget -O $CONDA_PREFIX/envs/mpgap-3.2/lib/python3.8/site-packages/quast_libs/busco/eukaryota.tar.gz https://busco.ezlab.org/v2/datasets/eukaryota_odb9.tar.gz

# fungi db
wget -O $CONDA_PREFIX/envs/mpgap-3.2/lib/python3.8/site-packages/quast_libs/busco/fungi.tar.gz https://busco.ezlab.org/v2/datasets/fungi_odb9.tar.gz
chmod -R 777 $CONDA_PREFIX/envs/mpgap-3.2/lib/python3.8/site-packages/quast_libs/busco

# get augustus database with
# must be executed in the end because its links for bacteria, fungi and eukaryota are broken
# it is only working for augustus
conda activate mpgap-3.2 && quast-download-busco

Explanation of hybrid strategies

Hybrid assemblies can be produced with two available strategies. Please read more about the strategies and how to set them up in the online documentation.

➡️ they are chosen with the parameter --hybrid_strategy.

Strategy 1

It uses the hybrid assembly modes from Unicycler, Haslr and/or SPAdes.

Strategy 2

It produces a long reads only assembly and polishes (correct errors) it with short reads using Pilon. By default, it runs 4 rounds of polishing (params.pilon_polish_rounds).

Example:

# run the pipeline setting the desired hybrid strategy globally (for all samples)
nextflow run fmalmeida/mpgap \
  --output output \
  --max_cpus 5 \
  --input "samplesheet.yml" \
  --hybrid_strategy "both"

🔥 This will perform, for all samples, both both strategy 1 and strategy 2 hybrid assemblies. Please read more about it in the manual reference page and samplesheet reference page.

Usage

For understading pipeline usage and configuration, users must read the complete online documentation »

Using the configuration file

All parameters showed above can be, and are advised to be, set through the configuration file. When a configuration file is used the pipeline is executed as nextflow run fmalmeida/mpgap -c ./configuration-file. Your configuration file is what will tell the pipeline which type of data you have, and which processes to execute. Therefore, it needs to be correctly configured.

  • To create a configuration file in your working directory:

    nextflow run fmalmeida/mpgap --get_config
    

Interactive graphical configuration and execution

Via NF tower launchpad (good for cloud env execution)

Nextflow has an awesome feature called NF tower. It allows that users quickly customise and set-up the execution and configuration of cloud enviroments to execute any nextflow pipeline from nf-core, github (this one included), bitbucket, etc. By having a compliant JSON schema for pipeline configuration it means that the configuration of parameters in NF tower will be easier because the system will render an input form.

Checkout more about this feature at: https://seqera.io/blog/orgs-and-launchpad/

Via nf-core launch (good for local execution)

Users can trigger a graphical and interactive pipeline configuration and execution by using nf-core launch utility. nf-core launch will start an interactive form in your web browser or command line so you can configure the pipeline step by step and start the execution of the pipeline in the end.

# Install nf-core
pip install nf-core

# Launch the pipeline
nf-core launch fmalmeida/mpgap

It will result in the following:

Known issues

  1. Whenever using unicycler with unpaired reads, an odd platform-specific SPAdes-related crash seems do randomly happen as it can be seen in the issue discussed at rrwick/Unicycler#188.
  • As a workaround, Ryan says to use the --no_correct parameter which solves the issue and does not have a negative impact on assembly quality.
  • Therefore, if you run into this error when using unpaired data you can activate this workaroud with:
    • --unicycler_additional_parameters " --no_correct ".
  1. Sometimes, shovill assembler can fail and cause the pipeline to fail due to problems in estimating the genome size. This, is actually super simple to solve! Instead of letting the shovill assembler estimate the genome size, you can pass the information to it and prevent its fail:
    • --shovill_additional_parameters " --gsize 3m "

Citation

In order to cite this pipeline, please refer to:

Almeida FMd, Campos TAd and Pappas Jr GJ. Scalable and versatile container-based pipelines for de novo genome assembly and bacterial annotation. [version 1; peer review: awaiting peer review]. F1000Research 2023, 12:1205 (https://doi.org/10.12688/f1000research.139488.1)

Additionally, archived versions of the pipeline are also found in Zenodo.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the GPLv3.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

In addition, users are encouraged to cite the programs used in this pipeline whenever they are used. Links to resources of tools and data used in this pipeline are in the list of tools.

mpgap's People

Contributors

fmalmeida avatar scintilla9 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mpgap's Issues

Add example of non-bacterial dataset analysis (paper review)

Background
This issue is meant to address the comments received on the paper review here.

Description
Generate a new page in the web documentation, showing the analysis of a fungi or plant sequencing dataset. Make sure that they have the necessary command lines from input to output, so one can reproduce, but also, add an overview of the generated results in the web page.

Once done, check how easily one can we update the paper to provide an additional Zenodo for the non-bacterial analysis (ngs-preprocess + MpGAP).

Add a simple parameter to handle starting memory settings

This issue relates to issues #52 and #59 where users seemed to face memory errors and had to adapt the config so that they could use more memory from the first try, instead of having to wait for retries.

By default, the pipeline first tries with a small amount, then it uses the fully amount specified by the max parameter:

// Assemblies will first try to adjust themselves to a parallel execution
    // If it is not possible, then it waits to use all the resources allowed
    withLabel:process_assembly {
      cpus   = {  if (task.attempt == 1) { check_max( 6 * task.attempt, 'cpus'       ) } else { params.max_cpus   } }
      memory = {  if (task.attempt == 1) { check_max( 20.GB * task.attempt, 'memory' ) } else { params.max_memory } }
      time   = {  if (task.attempt == 1) { check_max( 24.h * task.attempt, 'time'    ) } else { params.max_time   } }
      
      // retry at least once to try it with full resources
      errorStrategy = { task.exitStatus in [1,21,143,137,104,134,139,247] ? 'retry' : 'finish' }
      maxRetries    = 1
      maxErrors     = '-1'
    }

    // Quast sometimes can take too long
    withName:quast {
      cpus   = {  if (task.attempt == 1) { check_max( 4 * task.attempt, 'cpus'       ) } else { params.max_cpus   } }
      memory = {  if (task.attempt == 1) { check_max( 10.GB * task.attempt, 'memory' ) } else { params.max_memory } }
      time   = {  if (task.attempt == 1) { check_max( 12.h * task.attempt, 'time'    ) } else { params.max_time   } }
      
      // retry at least once to try it with full resources
      errorStrategy = { task.exitStatus in [21,143,137,104,134,139,247] ? 'retry' : 'finish' }
      maxRetries    = 1
      maxErrors     = '-1'
    }

Probably would be good to also define a parameter, to configure the starting memory amount&threads, which would be used in the first attempt of these modules.

Maybe, --start_asm_mem & --start_asm_cpus.

add testing git actions for PRs

To facilitate contribution and updates, would be great to create github actions to test the pipeline with the available profiles and technologies.

quast generating empty files

So, I've consistently observed that the quast step fails, and even if the pipeline points to the directory of the work, so the files can be checked, these appear to be empty.
This is the folder:
quast_folder

I attach the logs, and the files that were not 0-sized.
Will let you know if it keeps failing during my tests, and if the fix you provided in the config file allows to bypass the erro and avoid the pipeline crash. Thank you so much
nextflow.log.txt
output.log.txt
quast_files.zip

add 3 hybrid strategy

Add another hybrid strategy for samples where this might be the best option.

This strategy is to perform a short reads assembly and then scaffold with long reads.

allow hifiasm to use hi-c and parental data

Now that hifiasm was included in the pipeline, it must assessed how the pipeline can be adapted in order to allow the user to pass on hi-c and/or parental data to the tool as it is specified in their webpage.

It is a follow-up of #70

update documentation about the configuration in either config or samplesheet

The pipeline now has a few configuration parameters that can be set globally to all samples at once (when set via the CLI or config file), or specifically to a single sample (when passed inside the samplesheet).

Although a few cases of them are documented in the manual, others are not.

So we need to revise the documentation to make sure all these parameters are properly described in the manual and not only inside the config file.

Also revise the help message.

Include option for high quality long reads

Add a parameter that tells the pipeline to treat the high quality input long reads as corrected reads. This should trigger, whenever available, the parameters in each assembler that is specific for corrected long reads.

Examples:

  • Flye:
    • --pacbio-corr
    • --nano-corr
  • Canu:
    • -corrected

etc.

No such variable: USER

Dear developers of MpGAP,

I'm trying to run you assembly pipeline on ONT reads (after adapter removl with Porechop and length filtering with Filtlong) using the following command:

./nextflow run fmalmeida/mpgap --longreads /storage/ONT_results_FLO-MIN/pass/ONT_FLO-MIN_filtered.fastq.gz --lr_type nanopore --assembly_type longreads-only --try_canu --try_flye --try_unicycler --genomeSize 5m --outdir /storage/ONT_results_FLO-MIN/assemblies --threads 8

And I get the following error message:

N E X T F L O W ~ version 20.10.0
Launching fmalmeida/mpgap [nasty_pasteur] - revision: 9860b84 [master]
Docker-based, fmalmeida/mpgap, generic genome assembly pipeline

No such variable: USER

Is there a way to fix this in the CLI so othat the pipeline runs correctly?

Thanks in advance for your kind help.

Best wishes

JL

mkdir: cannot create directory

Hi, so thanks again for the help. I'm slowly going through the pipeline in my hpc. It seems to be only a matter of specificying enviromental variables to make sure the root directoyry and /tmp/are not filled, and then allocating enough resources (cpus and memory).
So at last, it's doing the pilon processes, but I've encountered an error I wanted to ask you about. Maybe related to the pipeline not resuming properly even if -resume provided to the nextflow command?
image

This may be related with the pipeline and not my system? Checking out the logs I see the same error, similar with other folders, such as "wtdbg2" instead of "flye". I guess for the different Pilon runs on the different assemblies? There maybe some mkdir or mv commands that should be forced to allow for replace existing files and resuming runs? Thanks!

conda?

Looks like a great pipeline. Any chance you can have it in conda?

Add option hifi

Add option to use Pacbio hifi in assemblers were an option for it is available, such as Canu, Flye and etc.

error in running MpGAP with SRR8482585_30X_{1,2}.fastq.gz and SRX5299443_30X.fastq.gz

When I tried to test the complete workflow of your code, I found that the case file at https://figshare.com/ndownloader/articles/14036585/versions/4 can pass through the ngs-preprocess pipeline, but when I continue with the MpGAP pipeline, an error occurs. The code I tried is as follows:
nextflow run fmalmeida/mpgap
--output mpgap_assmbly
--max_cpus 20
--genome_size 6m
--input ./mpgap_samplesheet.yml
--hybrid-strategy both
-profile docker

running log
.nextflow.log

NOTE: Process `HYBRID:strategy_2_pilon (aspergillus_terreus:strategy_2)` terminated with an error exit status (137) -- Execution is retried (1)

(base) omic@omic-Precision-7920-Tower:~/funig8$ nextflow run fmalmeida/mpgap   --output _ASSEMBLY   --max_cpus 5   --skip_spades   --input "samplesheet.yml"   --unicycler_additional_parameters ' --mode conservative '   -profile docker

curl: (28) SSL connection timeout

 N E X T F L O W   ~  version 24.04.2

Launching `https://github.com/fmalmeida/mpgap` [happy_meninsky] DSL2 - revision: 9f2475ff11 [master]



------------------------------------------------------
  fmalmeida/mpgap v3.2
------------------------------------------------------
Core Nextflow options
  revision                       : master
  runName                        : happy_meninsky
  containerEngine                : docker
  container                      : [.*:fmalmeida/mpgap@sha256:d0c421d2caa6bfb6fbaad36b4182746485f750c82524b7b738b0d190505c8098]
  launchDir                      : /home/omic/funig8
  workDir                        : /home/omic/funig8/work
  projectDir                     : /home/omic/.nextflow/assets/fmalmeida/mpgap
  userName                       : omic
  profile                        : docker
  configFiles                    : /home/omic/.nextflow/assets/fmalmeida/mpgap/nextflow.config

Input/output options
  input                          : samplesheet.yml
  output                         : _ASSEMBLY

Computational options
  start_asm_mem                  : 20 GB
  max_cpus                       : 5
  max_memory                     : 40 GB

Turn assemblers and modules on/off
  skip_spades                    : true

Software' additional parameters
  unicycler_additional_parameters:  --mode conservative

Generic options
  tracedir                       : _ASSEMBLY/pipeline_info

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use fmalmeida/mpgap for your analysis please cite:

* The pipeline
  https://doi.org/10.12688/f1000research.139488.1

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/fmalmeida/mpgap#citation
------------------------------------------------------

    Launching defined workflows!
    By default, all workflows will appear in the console "log" message.
    However, the processes of each workflow will be launched based on the inputs received.
    You can see that processes that were not launched have an empty [-       ].
  
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:unicycler      -
executor >  local (2)
[-        ] SHORTREADS_ONLY:unicycler      -
executor >  local (7)
[-        ] SHORTREADS_ONLY:unicycler      -
executor >  local (7)
executor >  local (7)
executor >  local (8)
executor >  local (11)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
executor >  local (14)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
executor >  local (14)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
executor >  local (17)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
executor >  local (17)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
executor >  local (18)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
executor >  local (19)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
[-        ] LONGREADS_ONLY:flye            -
executor >  local (19)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
[-        ] LONGREADS_ONLY:flye            -
executor >  local (22)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
executor >  local (23)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
executor >  local (24)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
[-        ] LONGREADS_ONLY:flye            -
executor >  local (25)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
[-        ] LONGREADS_ONLY:flye            -
executor >  local (26)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
[-        ] LONGREADS_ONLY:flye            -
executor >  local (27)
[-        ] SHORTREADS_ONLY:unicycler      -
[-        ] SHORTREADS_ONLY:shovill        -
[-        ] SHORTREADS_ONLY:megahit        -
[-        ] LONGREADS_ONLY:canu            -
[-        ] LONGREADS_ONLY:flye            -
[-        ] LONGREADS_ONLY:unicycler       -
[-        ] LONGREADS_ONLY:raven           -
[-        ] LONGREADS_ONLY:shasta          -
[-        ] LONGREADS_ONLY:wtdbg2          -
[-        ] LONGREADS_ONLY:hifiasm         -
[-        ] LONGREADS_ONLY:medaka          -
[-        ] LONGREADS_ONLY:nanopolish      -
[-        ] LONGREADS_ONLY:gcpp            -
[a2/3c8f43] HYB…gillus_terreus:strategy_1) | 0 of 1
[e7/cdbb77] HYB…gillus_terreus:strategy_1) | 1 of 1 ✔
[64/054208] HYB…gillus_terreus:strategy_2) | 0 of 1
[5f/24fa65] HYB…gillus_terreus:strategy_2) | 1 of 1 ✔
[68/c6a255] HYB…gillus_terreus:strategy_2) | 1 of 1 ✔
[3c/83e89b] HYB…gillus_terreus:strategy_2) | 1 of 1 ✔
[-        ] HYBRID:strategy_2_shasta       -
[-        ] HYBRID:strategy_2_hifiasm      -
[8b/7f6c0e] HYB…gillus_terreus:strategy_2) | 1 of 1 ✔
[44/55e4cd] HYB…gillus_terreus:strategy_2) | 4 of 8, failed: 4, retries: 4
[36/4da361] HYB…gillus_terreus:strategy_2) | 3 of 7, failed: 3, retries: 3
[29/653896] ASS…gillus_terreus:strategy_2) | 0 of 5
Plus 5 more processes waiting for tasks…
[1a/5c45fa] NOTE: Process `HYBRID:strategy_2_pilon (aspergillus_terreus:strategy_2)` terminated with an error exit status (137) -- Execution is retried (1)

Improve quality assessment

Add an option in the pipeline, such as a parameter called --eukaryotes, in which will tell the Quality assessment tools (Quast and Busco) to be performed using its configurations for eukaryotes.

use nf-core framework for CLI help and log messages

Change the pipeline configurations a little bit to:

  • Better reorganize the config files, updating standard to do not load any other profile as is common for NF pipelines
  • Better separate defaults params from the main config and script
  • Use label resources to better manage parallel jobs as it is done by nf-core
  • Use more of nf-core framework and Groovy libs to provide beautiful and cleaner CLI help and log messages

No valid choice in the pipeline parameters for hybrid strategies and problem with -c config

Hi, thanks for the impressive work and pipeline

So I'm trying to use it on our data, and getting a couple of errors for starters.
If I ran the pipeline with --input XXXX.yml and --hybrid_strategy 2, I get the error:

Launching https://github.com/fmalmeida/mpgap [golden_bell] DSL2 - revision: c1d2ab6 [master]
ERROR: Validation of pipeline parameters failed!

  • --hybrid_strategy: '2' is not a valid choice (Available choices: 1, 2, both)
  • --hybrid_strategy: '2' is not a valid choice (Available choices: 1, 2, both)

both seems to be working with this syntaxis, but not 1 nor 2

If I ran the pipeline trying to provide the config file with -c, I get the error

N E X T F L O W ~ version 22.04.5
Unknown method invocation call on BigDecimal type -- Did you mean?
scale

I don't have much experience with nextflkow, so I may be missing something "easy". Hope you can comment and help. Thanks for the support

add the possibility of running directly from SRA IDs

Include a way to automatically download data from SRA and run the pipeline.

Bottleneck here is identifying a way so that the pipeline can fetch multiple SRAs for a single sample, in case of a hybrid assembly for example.

ERROR ~ Error executing process > 'HYBRID:strategy_2_wtdbg2 (ecoli_30X:strategy_2)'

(mpgap-3.2) omic@omic-Precision-7920-Tower:~/Downloads/assembly$ nextflow run fmalmeida/mpgap --output _ASSEMBLY --max_cpus 5 --skip_spades --input "samplesheet.yml" --unicycler_additional_parameters ' --mode conservative ' -profile docker

N E X T F L O W ~ version 24.04.2

Launching https://github.com/fmalmeida/mpgap [mighty_mirzakhani] DSL2 - revision: 9f2475f [master]


fmalmeida/mpgap v3.2

Core Nextflow options
revision : master
runName : mighty_mirzakhani
containerEngine : docker
container : [.*:fmalmeida/mpgap@sha256:d0c421d2caa6bfb6fbaad36b4182746485f750c82524b7b738b0d190505c8098]
launchDir : /home/omic/Downloads/assembly
workDir : /home/omic/Downloads/assembly/work
projectDir : /home/omic/.nextflow/assets/fmalmeida/mpgap
userName : omic
profile : docker
configFiles : /home/omic/.nextflow/assets/fmalmeida/mpgap/nextflow.config

Input/output options
input : samplesheet.yml
output : _ASSEMBLY

Computational options
start_asm_mem : 20 GB
max_cpus : 5
max_memory : 40 GB

Turn assemblers and modules on/off
skip_spades : true

Software' additional parameters
unicycler_additional_parameters: --mode conservative

Generic options
tracedir : _ASSEMBLY/pipeline_info

!! Only displaying parameters that differ from the pipeline defaults !!

If you use fmalmeida/mpgap for your analysis please cite:


Launching defined workflows!
By default, all workflows will appear in the console "log" message.
However, the processes of each workflow will be launched based on the inputs received.
You can see that processes that were not launched have an empty [-       ].

executor > local (8)
[- ] SHORTREADS_ONLY:unicycler -
[- ] SHORTREADS_ONLY:shovill -
[- ] SHORTREADS_ONLY:megahit -
[- ] LONGREADS_ONLY:canu -
[- ] LONGREADS_ONLY:flye -
[- ] LONGREADS_ONLY:unicycler -
[- ] LONGREADS_ONLY:raven -
[- ] LONGREADS_ONLY:shasta -
[- ] LONGREADS_ONLY:wtdbg2 -
[- ] LONGREADS_ONLY:hifiasm -
[- ] LONGREADS_ONLY:medaka -
[- ] LONGREADS_ONLY:nanopolish -
[- ] LONGREADS_ONLY:gcpp -
[8c/80f386] HYBRID:strategy_1_unicycler (ecoli_30X:strategy_1) [ 0%] 0 of 1
[7c/ee2d18] HYBRID:strategy_1_haslr (ecoli_30X:strategy_1) [100%] 1 of 1, failed: 1 ✘
[1c/7053ec] HYBRID:strategy_2_canu (ecoli_30X:strategy_2) [ 0%] 0 of 1
[3c/9be87a] HYBRID:strategy_2_flye (ecoli_30X:strategy_2) [ 0%] 0 of 1 ✔
executor > local (8)
[- ] SHORTREADS_ONLY:unicycler -
[- ] SHORTREADS_ONLY:shovill -
[- ] SHORTREADS_ONLY:megahit -
[- ] LONGREADS_ONLY:canu -
[- ] LONGREADS_ONLY:flye -
[- ] LONGREADS_ONLY:unicycler -
[- ] LONGREADS_ONLY:raven -
[- ] LONGREADS_ONLY:shasta -
[- ] LONGREADS_ONLY:wtdbg2 -
[- ] LONGREADS_ONLY:hifiasm -
[- ] LONGREADS_ONLY:medaka -
[- ] LONGREADS_ONLY:nanopolish -
[- ] LONGREADS_ONLY:gcpp -
[8c/80f386] HYBRID:strategy_1_unicycler (ecoli_30X:strategy_1) [100%] 1 of 1, failed: 1 ✘
[7c/ee2d18] HYBRID:strategy_1_haslr (ecoli_30X:strategy_1) [100%] 1 of 1, failed: 1 ✘
[1c/7053ec] HYBRID:strategy_2_canu (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[3c/9be87a] HYBRID:strategy_2_flye (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[8a/54c676] HYBRID:strategy_2_unicycler (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[17/121160] HYBRID:strategy_2_raven (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[e6/4fdd45] HYBRID:strategy_2_shasta (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[- ] HYBRID:strategy_2_hifiasm -
[83/f20e7f] HYBRID:strategy_2_wtdbg2 (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
executor > local (8)
[- ] SHORTREADS_ONLY:unicycler -
[- ] SHORTREADS_ONLY:shovill -
[- ] SHORTREADS_ONLY:megahit -
[- ] LONGREADS_ONLY:canu -
[- ] LONGREADS_ONLY:flye -
[- ] LONGREADS_ONLY:unicycler -
[- ] LONGREADS_ONLY:raven -
[- ] LONGREADS_ONLY:shasta -
[- ] LONGREADS_ONLY:wtdbg2 -
[- ] LONGREADS_ONLY:hifiasm -
[- ] LONGREADS_ONLY:medaka -
[- ] LONGREADS_ONLY:nanopolish -
[- ] LONGREADS_ONLY:gcpp -
[8c/80f386] HYBRID:strategy_1_unicycler (ecoli_30X:strategy_1) [100%] 1 of 1, failed: 1 ✘
[7c/ee2d18] HYBRID:strategy_1_haslr (ecoli_30X:strategy_1) [100%] 1 of 1, failed: 1 ✘
[1c/7053ec] HYBRID:strategy_2_canu (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[3c/9be87a] HYBRID:strategy_2_flye (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[8a/54c676] HYBRID:strategy_2_unicycler (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[17/121160] HYBRID:strategy_2_raven (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[e6/4fdd45] HYBRID:strategy_2_shasta (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[- ] HYBRID:strategy_2_hifiasm -
[83/f20e7f] HYBRID:strategy_2_wtdbg2 (ecoli_30X:strategy_2) [100%] 1 of 1, failed: 1 ✘
[- ] HYBRID:strategy_2_medaka -
[- ] HYBRID:strategy_2_nanopolish -
[- ] HYBRID:strategy_2_gcpp -
[- ] HYBRID:strategy_2_pilon -
[- ] HYBRID:strategy_2_polypolish -
[- ] ASSEMBLY_QC:quast -
[- ] ASSEMBLY_QC:CUSTOM_DUMPSOFTWAREVERSIONS -
[- ] ASSEMBLY_QC:multiqc -
Execution cancelled -- Finishing pending tasks before exit
Pipeline completed at: 2024-06-03T08:53:15.368670472+04:00
Execution status: failed
Execution duration: 5s
Do not give up, we can fix it!
WARN: The operator first is useless when applied to a value channel which returns a single value by definition
ERROR ~ Error executing process > 'HYBRID:strategy_2_wtdbg2 (ecoli_30X:strategy_2)'

Caused by:
Process HYBRID:strategy_2_wtdbg2 (ecoli_30X:strategy_2) terminated with an error exit status (127)

Command executed:

run wtdbg2

wtdbg2.pl
-t 5
-x ont
-g 4m
-o ecoli_30X

SRX5299443_30X.fastq.gz

rename results

cp ecoli_30X.cns.fa wtdbg2_assembly.fasta

get version

--version command does not exist

cat <<-END_VERSIONS > versions.yml
"HYBRID:strategy_2_wtdbg2":
wtdbg2: 2.5
END_VERSIONS

Command exit status:
127

Command output:
(empty)

Command error:
.command.run: line 296: docker: command not found

Work dir:
/home/omic/Downloads/assembly/work/83/f20e7fb57111d1073362c72536c288

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh

-- Check '.nextflow.log' file for details

Update CLI help message

Some of the workflow parameters are explained in the online documentation (readthedocs) but they are not explained in the command line help! Fix it!

  • nf tower options
  • parallel jobs options

add polypolish tool

Pilon is the tool used for polishing long reads assemblies in the pipeline.

It would be nice to also add polypolish tools as the second short-reads polisher for long reads assembly together with pilon.

By default, the pipeline would polish long reads assemblies with both, but users could chose to skip or not one of them.

Shovill with all assemblers?

To date, Shovill is executed by default with "spades" assembler as base. However, the software also supports using megahit and skesa.

Although possible for users to change the default assembler for shovill, e.g. --shovill_additional_parameters " --assembler skesa ".

However, this will only change the assembler selected and execute only the selected one. The idea is:

Is it possible to create a rule by default that makes the pipeline create a shovill assembly with each possible assembler?

change to unicycler v0.5.0?

Unicycler has now made a huge release to v0.5.0. So, it would be nice to have the pipeline now using this version.

For that, a few fixes in the pipeline's environment and scripts should be done would be required:

  1. Unicycler now accepts the newest SPAdes version thus the v3.13 binaries would no be necessary anymore
  2. Unycler now do not correct reads prior to assembly, thus, the information about --no_correct should be remove
  3. Unicycler now do not polishes the assembly in the end, thus, a new step for pilon polish is required after it's assembly
    • This already happens for hybrid assemblies, however, should also be performed for Illumina assemblies.
  4. Unicycler has now descontinued it's script unicycler_polish which was used inside the MpGAP's pilon polish module for paired end reads
    • Thus, this module needs to be updated to do not use this script and perform only a single polishing with pilon either with single (which is already this way) or paired end reads.
    • This, removes the dependency on ALE binaries

Obs: For now, this release will not impact the pipeline since it is stick to the v0.4.8. However, for using the new one, these observations should be addressed.

Requesting support with error "Explicit 'name separator' in class"

Hi,

Thanks for your kind support in the past. I've been using mpgap routinely in our projects, but now I've been struggling for a while with a new installation... maybe trivial but I'm stuck and wondered if you could comment on it.
I'm using singularity. Everything seems to be working, and the pipeline starts, but then ends with the error:

Explicit 'name separator' in class near index 8
[dataset/pacbio.fastq]
^
-- Check script '/.nextflow/assets/fmalmeida/mpgap/./workflows/parse_samples.nf' at line: 66 or see '.nextflow.log' file for more details

I believe I've used the same syntax as always, and the one suggested in the manual in the yml samplesheet:
samplesheet:

  • id: Sol_test_1
    pacbio:
    • 'dataset/pacbio.fastq'
      genome_size: 39.11m
      wtdbg2_technology: rs
      corrected_long_reads: false

I'm attaching the nextflow.log. May be something trivial and nextflow-related... but could you comment please? I've unsuccesfully tried to use different quotes in the yml file, or even placing the file on the same folder.

Thanks!

100% missing in Busco

Hi Dr Almeida,

I was playing around the pipeline, everything want well, however, I found issue in Busco.
The output summary showed no buscos were found in query genome (100% missing) against bacteria_odb9. I also found some issues in quast github, seems related to LD_LIBRARY_PATH (ablab/quast#88).

I also tried on standalone Busco container (v5.4.7) with bacteria_odb10, and got 100% complete single copy. It confirms that the assembled genome is well.

Could you please kindly check if it could be fixed?

Best,
CW

Add an option for multiple samples

Add an option to facilitate and organize the execution of the pipeline for multiple samples. Maybe create something using a YAML syntax as it is done in bacannot.

problem with longreads_only assembly

[intergalactic_knuth] Nextflow Workflow Report.pdf

Hi,

I am trying to assemble plant genome (~800m) from PacBio Revio reads.

here is the command I use
nextflow -bg run fmalmeida/mpgap --output _ASSEMBLY --max_cpus 20 --skeep_wtdbg2 --genome_size 800m --input MPGAP_samplesheet1.yml -profile docker

here is the yml file contents

samplesheet:
  - id: sample_5
    pacbio: HMW_DNA_m84126_231020_112323_s3.hifi_reads.fastq.gz

The process started but at some points I get the error messages similar to the following for all the assemblers

[Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 5; name: LONGREADS_ONLY:canu (sample_5); status: COMPLETED; exit: 1; error: -; workDir: /mnt/data/guyh/Trifolium/Revio/work/a8/a8499e26430751241cde25981ce53b]

A pdf version of mpgap report is attached

Can you please advice?

Thank you in advance.
Guy

Incomplete pipeline and different errors when using nanopore reads files with different sizes (900 mb vs 11Gb)

Describe the bug
I encountered an issue while running the pipeline with two barcoded genome samples (Barcode04 and Barcode06). These samples produced exceptionally large output files: Barcode04 resulted in an 11GB file, while Barcode06 generated a massive 980GB file. Both runs also exhibited different errors. It's worth noting that the reference genome size for these samples, which are from hummingbirds, is approximately 1.5GB. The sequencing was performed using the Nanopore Promethion 10.4.1 platform, and basecalling was done with the Super Accurate algorithm.

To Reproduce
Steps to reproduce the behavior:
Run the following command line with the files in the respective folders
nextflow run fmalmeida/mpgap --output output_barcode04_20_oct_2023 --max_cpus 64 --input "MPGAP_samplesheet_barcode04.yml" -profile docker

or

nextflow run fmalmeida/mpgap --output output_barcode06_20_oct_2023 --max_cpus 64 --input "MPGAP_samplesheet_barcode06.yml" -profile docker

Expected behavior
Output folders with the results of the pipe

Archive.zip

add trycycler

Add trycycler tool to generate a consensus assembly of long reads tools as an option.

Enhance documentation (paper review)

Background
This issue is meant to address the comments received on the paper review here.

Description
Create an "Output" page to facilitate users on the output structure and refer the correct tools-specific links as it is done in the bacannot documentation page, which gives users the interpretation of the generated results, including the directory structure and the relevant links for the tool-specific reference material.

new directory called "final_output"

Add in the pipeline a rule so all the assemblies of a sample have a copy stored in a single folder, e.g. final_output, so that it is easier for users to further select and retrieve the assemblies they want.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.