Automated long-read metagenomics workflow, using either PacBio HiFi or Nanopore sequencing reads as input to generate characterized MAGs. The mmlong2 workflow is a continuation of mmlong.
Note: multiple large-scale databases are utilized by mmlong2 for genome bin analysis. If you are only interested in getting the MAGs, check out mmlong2-lite.
Overview of mmlong2 workflow in Nanopore-only mode:
Installation (Conda):
A local Conda environment containing all the required software dependencies can be created by using the code chunk posted below. To acquire microbial genome taxonomy and annotation results, databases will have to be setup.
conda create --prefix mmlong2 -c conda-forge -c bioconda snakemake=7.26.0 singularity=3.8.6 zenodo_get=1.3.4 pv=1.6.6 pigz=2.6 tar=1.34 -y
conda activate ./mmlong2 || source activate ./mmlong2 && zenodo_get -r 8027235 -o mmlong2/bin
pv mmlong2/bin/sing-mmlong2-lite-*.tar.gz | pigz -dc - | tar xf - -C mmlong2/bin/.
pv mmlong2/bin/sing-mmlong2-proc-*.tar.gz | pigz -dc - | tar xf - -C mmlong2/bin/.
chmod +x mmlong2/bin/mmlong2
Quick-start (AAU bioserver users):
conda activate /projects/microflora_danica/mmlong2/conda/mmlong2-v0.9.2
mmlong2 -h
Usage example for Nanopore-only mode:
mmlong2 -np [Nanopore_reads.fastq] -p [Processes/Threads] -o [Output_dir]
Full usage:
MAIN INPUTS:
-np --nanopore_reads Path to Nanopore reads (default: none)
-pb --pacbio_reads Path to PacBio HiFi reads (default: none)
-o --output_dir Output directory name (default: mmlong2)
-p --processes Number of processes/multi-threading (default: 3)
-cov --coverage CSV dataframe for differential coverage binning (e.g. NP/PB/IL,/path/to/reads.fastq)
-run --run_until Run pipeline until a specified stage completes
(e.g. assembly polishing binning taxonomy annotation variants)
ADDITIONAL INPUTS:
-tmp --temporary_dir Directory for temporary files (default: none)
-med1 --medaka_model_polish Medaka polishing model (default: r1041_e82_400bps_sup_v4.2.0)
-med2 --medaka_model_variant Medaka variant calling model (default: r1041_e82_400bps_sup_variant_v4.2.0)
-sem --semibin_model Binning model for SemiBin (default: global)
-fmo --flye_min_ovlp Minimum overlap between reads used by Flye assembler (default: auto)
-fmc --flye_min_cov Minimum initial contig coverage used by Flye assembler (default: 3)
-mlc --min_len_contig Minimum assembly contig length (default: 3000)
-mlb --min_len_bin Minimum genomic bin size (default: 250000)
-slv --silva Silva database to use (default: none)
-mds --midas Midas database to use (default: none)
-gnc --gunc Gunc database to use (default: none)
-bkt --bakta Bakta database to use (default: none)
-kj --kaiju Kaiju database to use (default: none)
-gdb --gtdb GTDB-tk database to use (default: none)
-x1 --extra_inputs1 Extra inputs for the MAG production part of the Snakemake workflow (default: none)
-x2 --extra_inputs2 Extra inputs for the MAG processing part of the Snakemake workflow (default: none)
MISCELLANEOUS INPUTS:
-h --help Print help information
-v --version Print workflow version number
Overview of result files:
assembly.fasta
- assembled and polished metagenomerRNA.fa
- rRNA sequences, recovered from the polished metagenomerRNA_16S.fa
- 16S rRNA sequences, recovered from the polished metagenome<name>_contigs.tsv
- dataframe for metagenome contig metrics<name>_bins.tsv
- dataframe for automated binning results<name>_general.tsv
- workflow results, summarized into a single rowdependencies.csv
- list of dependencies used and their versionsbins
- directory for metagenome assembled genomesbakta
- directory, containing bin annotation results from bakta
Additional documentation:
Comments:
- The workflow assumes that the input reads have been quality-filtered and adapter/barcode sequences have been trimmed off.
- The workflow is long-read-based and requires either Nanopore or PacBio HiFi reads. It doesn't feature an Illumina-only mode.
- If the workflow crashes, it can be resumed by re-running the same command. Some of the intermediary files might have to be removed for compatibility.
- It is recommended to run the workflow from a screen session. This can be achieved with e.g.
screen -R mmlong2
and then running the workflow.
Future improvements
Suggestions on improving the workflow or fixing bugs are always welcome.
Please use the GitHub Issues
section or e-mail to [email protected] for providing feedback.