Licence: GNU General Public License v3.0 (copy provided in directory)
Author: Tom van Wijk
Contact: [email protected]
This script is developed for the assembly of whole genome sequencing
(WGS from now on) data of bacterial isolates. It is developed
specifically for paired-end Illumina sequencing data but might work
or might be easily modified to work with other sequencing data formats.
Quality reports of the raw data are generated using FastQC and MultiQC.
The reads are quality trimmed from both ends using ENRE-filter,
assembled to de-novo contigs and scaffolds using SPAdes.
The assemblies are assessed using QUAST, a quality report is generated.
- Linux operating system. This script is developed on Linux Ubuntu
WARNING: experiences when using different operating systems may vary. - python 2.7.x
- python libraries as listed in the import section
- erne-filter 2.1.1 (http://erne.sourceforge.net/)
- SPAdes 3.10.0 (http://cab.spbu.ru/files/release3.10.0/SPAdes-3.10.0-Linux.tar.gz)
- QUAST 4.4 (https://downloads.sourceforge.net/project/quast/quast-4.4.tar.gz)
- fastQC
- multiQC
- Clone the assembly_pipeline repository to the desired location on your system.
git clone https://github.com/Papos92/assembly_pipeline.git
- Add the location of assembly.py to the PATH variable:
export PATH=$PATH:/path/to/assembly.py
(It is recommended to add this command to your ~/.bashrc file)
The script can be runned with the following command:
assembly.py -i 'inputdir' -o 'outputdir' -t 'threads' -m 'memory' -x 'savetemp'
-
'inputdir': location of input directory. (required)
Should only contain either the uncompressed (.fastq) or compressed (.fastq.gz) sequence files containing the raw sequences of the forward and reverse reads. The files need to be named with an_R1
and_R2
tag for the forward and reverse reads respectively.
Each sample (set of forward and reverse files) are treated as a separate isolate. It is not (yet) possible to process isolates that are divided into multiple different samples. Will add this when required. -
'outputdir': location of output directory. (Default = inputdir)
The output will be stored in multiple directories inside this directory. -
'threads': Number of threads (virtual cpu cores) to be used. (Default = 4)
-
'memory' Maximum amount of RAM (GB) to be used. (Default = 13)
If the machine runs out of RAM memory, SPAde will crash so adjust this parameter appropriately to your machine. -
'savetemp': Set to true so save the temporary files and directories generated by the pipeline. (Default = false)