longtianpy / skewgd_v1 Goto Github PK

Whole Genome Duplication detection pipeline

Python 100.00%

skewgd_v1's Introduction

SkewGD

Summation of Ks pairs for Exploration of Whole Genome Duplications -- Whole Genome Duplication detection pipeline

INTRODUCTION

WGD_detection is a script implementing several bioinformatics software and calculation to visualize gene duplication events by processing the full set of coding sequences (CDS) of an organism.

INPUT: CDS file in FASTA format;

OUTPUT: kS distribution data in csv format and histogram in user-indicated working directory (set by -d).

WORKFLOW

CDS translation to protein sequences;
Pairwise self-BLASTP by BLASTP;
Extraction of pairs of comparison by identity (default: 50%) and coverage (default: 30%);
Markov chain clustering by MCL;
Sequence Alignment of each cluster by MUSCLE;
For each alignment, reverse translation of protein sequences back to nucleotide sequences according to the input CDS;
Maximum likelihood phylogenetic analysis on each nucleotide sequence alignment by yn00 from PAML;
kS correction and gene duplication event clustering;
Data visualization

DEPENDENCIES AND REQUIREMENTS

WGD_detection is developed in Python 2.x with modules and external software, and is Python 3 compatible.

While running this pipeline, a dependency check is at first performed to make sure every dependency is correctly installed.

For information about installing the dependencies, please see below. The version numbers listed below represents the version this pipeline is developed with, and using the newest version is recommended.

Python 2.x
Modules can be installed using pip pip install [module_name]
Pandas v0.16.2
BioPython v1.64
Seaborn v0.7.0
BLAST for LINUX ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+ v2.2.31 -- Installation guide Ubuntu users can directly install by sudo apt-get install ncbi-blast+
MCL v14-137
MUSCLE v3.8.31
YN00 v4.8 Ubuntu users can directly install by sudo apt-get install paml

USAGE

usage: WGD_detection.py [-h] [-i NUCLEOTIDE_CDS] [-I CDS_FOLDER]
                        [-o OUTPUT_PREF] [-d WORKING_DIR] 
                        [--blastp BLASTP] [--makeblastdb MAKEBLASTDB]
                        [--muscle MUSCLE] [--mcl MCL]
                        [-yn00 YN00_PATH]
                        [--identity IDENTITY] [--coverage COVERAGE]
                        [--blastp_threads BLASTP_THREADS]
                        [--mcl_threads MCL_THREADS]
                        [--mcl_inflation MCL_INFLATION]
                        [--cluster_aln_threads CLUSTER_ALN_THREADS]

Generate kS distrbution histogram to detect Whole Genome Duplication (WGD)
events. Taking the full coding sequences of an organism as input.

optional arguments:
  -h, --help            show this help message and exit
  -i NUCLEOTIDE_CDS     Full coding sequences of the organism of interest.
  -I CDS_FOLDER         A directory with CDS files of different organisms
                        only. NOTE: This option cannot be used with -i at the
                        same time. Options for threadsneed to be set to
                        reasonable number since a maximum of 2 files canbe
                        running at the same time.
  -o OUTPUT_PREF        Prefix for the MCL clustered files. 
                        Default: Prefix of input file.
                        This option is ignored if otherwise indicated when 
                        "-I" is used.
  -d WORKING_DIR        Working directory to store intermediate files of each
                        step. Default: ./ .
  --blastp BLASTP       File path to blastp executable. Default:
                        /usr/bin/blastp .
  --makeblastdb MAKEBLASTDB
                        File path to makeblastdb executable. Default:
                        /usr/bin/makeblastdb .
  --muscle MUSCLE       File path to MUSCLE executable.
  --mcl MCL             File path to MCL executable. Default: /usr/bin/mcl .
  --yn00 YN00_PATH      File path to yn00 executable. Default: /usr/bin/yn00 .
  --identity IDENTITY   Threshold of percentage identity in BLAST result.
                        Default: 50 .
  --coverage COVERAGE   Threshold of percentage alignment coverage in BLAST
                        result. Default: 30 .
  --blastp_threads BLASTP_THREADS
                        Number of threads for running BLASTp. Default: 8 .
  --mcl_threads MCL_THREADS
                        Number of threads for running MCL. Default: 1 .
  --mcl_inflation MCL_INFLATION
                        Tune the granularity of clustering. Usually choose
                        from the range of [1.2, 5.0]. 5.0 makes it finely
                        grained and 1.2 makes clustering coarsed. Default: 2.0
                        .
  --cluster_aln_threads CLUSTER_ALN_THREADS
                        Number of threads for parallelling the alignment of
                        clusters. Default: 8 .

skewgd_v1's People

Contributors

Stargazers

Watchers

Forkers

lilabatvt yuzhenpeng

skewgd_v1's Issues

Failed test run with A. thaliana

aurebg@begonia:/data/aurebg_projects/WGD$ python /data/software/SkewGD/WGD_detection.py -i Athaliana_Phytozome167_TAIR10.cds.fa -o WGD_Atha/Atha_MCL -d WGD_Atha/ -y /usr/local/bin/yn00

** (WGD_detection.py:10372): WARNING **: Couldn't connect to accessibility bus: Failed to connect to socket /tmp/dbus-Cf5K2FmXPG: Connection refused
Translating CDS to proteins...

/usr/local/lib/python2.7/dist-packages/Bio/Seq.py:2041: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning)
Self-blasting, this may take long...

Building a new DB, current time: 05/06/2016 16:25:21
New DB name: /data/aurebg_projects/WGD/Athaliana_Phytozome167_TAIR10.cds.fa.protein
New DB title: Athaliana_Phytozome167_TAIR10.cds.fa.protein
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 27416 sequences in 1.65178 seconds.
Clustering...
..[mcl] new tab created
[mcl] pid 19353
ite ------------------- chaos time hom(avg,lo,hi) expa expb expc fmv
1 ................... 30.09 0.08 1.02/0.02/3.66 2.17 2.17 2.17 0
2 ................... 47.06 0.18 0.90/0.19/1.55 2.34 1.35 2.94 7
3 ................... 25.04 0.22 0.83/0.11/2.87 2.30 0.69 2.02 10
4 ................... 17.66 0.12 0.82/0.17/3.61 1.93 0.69 1.40 6
5 ................... 20.96 0.06 0.85/0.21/2.28 1.28 0.72 1.01 1
6 ................... 10.83 0.04 0.88/0.20/2.51 1.06 0.74 0.74 0
7 ................... 5.50 0.03 0.91/0.39/2.26 1.02 0.74 0.55 0
8 ................... 2.18 0.02 0.94/0.46/1.36 1.01 0.75 0.41 0
9 ................... 1.39 0.01 0.96/0.54/1.34 1.00 0.80 0.33 0
10 ................... 0.95 0.01 0.98/0.52/1.19 1.00 0.86 0.28 0
11 ................... 0.64 0.01 0.99/0.63/1.22 1.00 0.90 0.26 0
12 ................... 0.48 0.01 0.99/0.58/1.00 1.00 0.94 0.24 0
13 ................... 0.34 0.01 1.00/0.76/1.00 1.00 0.97 0.23 0
14 ................... 0.25 0.01 1.00/0.76/1.00 1.00 0.98 0.23 0
15 ................... 0.25 0.01 1.00/0.76/1.00 1.00 0.99 0.22 0
16 ................... 0.24 0.01 1.00/0.76/1.00 1.00 0.99 0.22 0
17 ................... 0.37 0.01 1.00/0.75/1.00 1.00 1.00 0.22 0
18 ................... 0.50 0.01 1.00/0.63/1.00 1.00 1.00 0.22 0
19 ................... 0.28 0.01 1.00/0.80/1.00 1.00 1.00 0.22 0
20 ................... 0.03 0.01 1.00/0.97/1.00 1.00 1.00 0.22 0
21 ................... 0.00 0.01 1.00/1.00/1.00 1.00 1.00 0.22 0
22 ................... 0.00 0.01 1.00/1.00/1.00 1.00 1.00 0.22 0
[mcl] jury pruning marks: <99,99,99>, out of 100
[mcl] jury pruning synopsis: <99.0 or perfect> (cf -scheme, -do log)
[mcl] output is in Athaliana_Phytozome167_TAIR10.cds.fa.protein.mcl_out
[mcl] 5055 clusters found
[mcl] output is in Athaliana_Phytozome167_TAIR10.cds.fa.protein.mcl_out

Please cite:
Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis,
University of Utrecht, May 2000.
( http://www.library.uu.nl/digiarchief/dip/diss/1895620/full.pdf
or http://micans.org/mcl/lit/svdthesis.pdf.gz)
OR
Stijn van Dongen, A cluster algorithm for graphs. Technical
Report INS-R0010, National Research Institute for Mathematics
and Computer Science in the Netherlands, Amsterdam, May 2000.
( http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z
or http://micans.org/mcl/lit/INS-R0010.ps.Z)

Matching clusters...
5049
Traceback (most recent call last):
File "/data/software/SkewGD/WGD_detection.py", line 125, in
main()
File "/data/software/SkewGD/WGD_detection.py", line 41, in main
afa_file_list = Hong_wrapper(nucleotide_cds=nucleotide_cds, identity=identity, coverage=coverage, output_prefix=out_prefix,working_dir=working_dir)
File "/data/software/SkewGD/WGD_detection.py", line 95, in Hong_wrapper
process_cluster_all.process_cluster(mcl_out=mcl_out, protein_cds=protein_cds, output_prefix=output_prefix,working_dir=working_dir)
File "/data/software/SkewGD/process_cluster_all.py", line 41, in process_cluster
with open(output_prefix+str(i)+'.txt','w') as output:
IOError: [Errno 2] No such file or directory: 'WGD_Atha/Atha_MCL1.txt'

Argument for threads

The number of threads should be configurable. For example process_blast.py uses '-num_threads 8' but the program should be able to use any number specified by the user depending of the resources (for example, the Bombarely lab's server has 64 threads).

Warning messages Run 1

I have these warning messages:
1- ** (WGD_detection.py:10372): WARNING **: Couldn't connect to accessibility bus: Failed to connect to socket /tmp/dbus-Cf5K2FmXPG: Connection refused
Translating CDS to proteins...
2- usr/local/lib/python2.7/dist-packages/Bio/Seq.py:2041: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning)

Steps Information

The pipeline runs over different steps. It will informative for the user if while the pipeline is running, it get messages like:

====================
SkewGB pipeline starts
Date XXXXXXXX
====================

Step 1 of Z: Translating CDS to proteins start (date)
XXXXXX CDS have been translated to proteins

Step 2 of Z: Building the selfblast database (date)

Step 3 of Z: Performing selfblast (date)
XXXXX hits were detected excluding selfhits.

....

Software Version

The pipeline can be modified in the future, It should have the program version number.