GithubHelp home page GithubHelp logo

phispy's Introduction

INTRODUCTION

PhiSpy is a computer program written in C++, Python and R to identify prophages in a complete bacterial genome sequences.

Initial versions of PhiSpy were written by

Sajia Akhter ([email protected]) PhD Student Edwards Bioinformatics Lab (http://edwards.sdsu.edu/labsite/) Computational Science Research Center (http://www.csrc.sdsu.edu/csrc/) San Diego State University (http://www.sdsu.edu/)

Improvements, bug fixes, and other changes were made by

Katelyn McNair Edwards Bioinformatics Lab (http://edwards.sdsu.edu/labsite/) San Diego State University (http://www.sdsu.edu/)

SYSTEM REQUIREMENTS

The program should run on all Unix platforms, although it was not tested in all platforms.

SOFTWARE REQUIREMENTS

PhiSpy requires following programs to be installed in the system. NOTE: You can ignore this if you're using the singularity container method of installation.

  1. Python - version 3.4 or later
  2. Biopython - version 1.58 or later
  3. gcc - GNU project C and C++ compiler - version 4.4.1 or later
  4. The R Project for Statistical Computing - version 2.9.2 or later
  5. Package randomForest in R - version 4.5-36 or later

INSTALLATION

  1. git clone https://github.com/linsalrob/PhiSpy.git
  2. % cd PhiSpy
  3. % make
  4. For ease of use, add the location of PhiSpy.py to your $PATH.

ALTERNATE INSTALLATION

  1. Get singularity
  2. Build phispy.img using this repository
  3. Run the singularity image % singularity exec phispy.img PhiSpy.py
  4. NOTE: if you haven't used singularity before you'll need to know about binding directories so that PhiSpy can find your input and output.

TO TEST THE PROGRAM

  1. % cd PhiSpy
  2. % python PhiSpy.py -i tests/160490.1/ -o output_directory -t 25

tests/160490.1/ is a seed annotation directory for genome 'Streptococcus pyogenes M1 GAS'. You will find the output files of this genome at output_directory.

TO RUN PHISPY

% ./PhiSpy.py -i organism_directory -o output_directory -c

where: 'output directory': Output directory is the directory where the final output file will be created.

'organism directory': The seed annotation directory for the input bacterial organism whose prophage(s) need to be identified.

You can download the SEED genomes from the PhAnToMe database

Or, If you have new genome, you can annotate it using the RAST server. After annotation, you can download the genome directory from the server.

Or, If you have the GenBank file (containing sequence) of the genome, you can convert it using the following command: % python scripts/genbank_to_seed.py GenBank_file.gb organism_directory

Now to run PhiSpy, use organism_directory as 'organism directory'.

The program will access the following files in the organism_directory: i. contig file: organism_directory/contigs ii. tbl file for peg: organism_directory/Features/peg/tbl iii. assigned_functions file: organism_directory/assigned_functions or organism_directory/proposed_functions or organism_directory/proposed_non_ff_functions
iv. tbl file for rna: organism_directory/Features/rna/tbl

Note: The assigned functions file may not be in the RAST genome directory. You can create it from proposed_functions and proposed_non_ff_functions or you can use this perl script to create an assigned_functions file for you.

REQUIRED INPUT OPTIONS

The program will take 1 command line input.

It shows a list (run with -c option) and asks for a number from the list. In the list, there are several organisms and each organism is associated by a number. If you find a closely related genome of your interested organism enter the number. PhiSpy will consider that genome as training genome. Otherwise, enter 0 to run with generic training set.

HELP

For the help menu use the -h option: % python PhiSpy.py -h

OUTPUT FILES

There are 3 output files, located in output directory.

  1. prophage.tbl: This file has two columns separated by tabs [id, location]. The id is in the format: pp_number, where number is a sequential number of the prophage (starting at 1). Location is be in the format: contig_start_stop that encompasses the prophage.

  2. prophage_tbl.tsv: This is a tab seperated file. The file contains all the genes of the genome. The tenth colum represents the status of a gene. If this column is 1 then the gene is a phage like gene; otherwise it is a bacterial gene.

This file has 16 columns:(i) fig_no: the id of each gene; (ii) function: function of the gene; (iii) contig; (iv) start: start location of the gene; (v) stop: end location of the gene; (vi) position: a sequential number of the gene (starting at 1); (vii) rank: rank of each gene provided by random forest; (viii) my_status: status of each gene based on random forest; (ix) pp: classification of each gene based on their function; (x) Final_status: the status of each gene. For prophages, this column has the number of the prophage as listed in prophage.tbl above; If the column contains a 0 we believe that it is a bacterial gene. If we can detect the att sites, the additional columns will be: (xi) start of attL; (xii) end of attL; (xiii) start of attR; (xiv) end of attR; (xv) sequence of attL; (xvi) sequence of attR.

  1. prophage_coordinates.tsv: This file has the prophage ID, contig, start, stop, and potential att sites identified for the phages.

EXAMPLE DATA

We have provided two different example data sets.

  • Streptococcus pyogenes M1 GAS which has a single genome contig. The genome contains four prophages.

To analyze this data, you can use:

python PhiSpy.py -t 25 -i tests/160490.1/ -o tests/160490.1.output

And you should get a prophage table that has this information:

Prophage number Contig Start Stop
pp_1 NC_002737 529631 604720
pp_2 NC_002737 778642 846824
pp_3 NC_002737 1191309 1255536
pp_4 NC_002737 1607352 1637214
  • Salmonella enterica serovar Enteritidis LK5

This is an early draft of the genome (the published sequence has a single contig), but this draft has 1,410 contigs and some phage like regions.

If you run PhiSpy on this draft genome with the default parameters you will not find any prophage because they are all filtered out for not having enough genes. By default, PhiSpy requires 30 genes in a prophage. You can alter that stringency on the command line, and for example reducing the phage gene window size to 10 results in 3 prophage regions being identified.

python PhiSpy.py -t 21 -w 10 -i tests/272989.13/ -o tests/272989.13.output

You should get a prophage table that has this information:

Prophage number Contig Start Stop
pp_1 Contig_2300_10.15 1630 10400
pp_2 Contig_2294_10.15 175 11290
pp_3 Contig_2077_10.15 318 12625

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.