GithubHelp home page GithubHelp logo

linsalrob / phispy Goto Github PK

View Code? Open in Web Editor NEW
59.0 10.0 22.0 209.66 MB

Prediction of prophages from bacterial genomes

License: MIT License

Python 22.32% Makefile 0.01% R 0.09% C++ 1.28% C 0.08% Jupyter Notebook 76.22%

phispy's Introduction

Edwards Lab DOI License: MIT GitHub language count Build Status PyPi Anaconda-Server Badge BioConda Install Anaconda-Server Badge

What is PhiSpy?

PhiSpy identifies prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions.

Initial versions of PhiSpy were written by

Sajia Akhter ([email protected]) Edwards Bioinformatics Lab

Improvements, bug fixes, and other changes were made by

Katelyn McNair Edwards Bioinformatics Lab and Przemyslaw Decewicz DEMB at the University of Warsaw

Installation

Conda

The easiest way to install for all users is to use bioconda.

conda install -c bioconda phispy

PIP

python-pip requires a C++ compiler and the Python header files. You should be able to install it like this:

sudo apt install -y build-essential python3-dev python3-pip
python3 -m pip install --user PhiSpy

This will install PhiSpy.py in ~/.local/bin which should be in your $PATH but might not be (see this detailed discussion). See the tips and tricks below for a solution to this.

Advanced Users

For advanced users, you can clone the git repository and use that (though pip is the recommended install method).

git clone https://github.com/linsalrob/PhiSpy.git
cd PhiSpy
python3 setup.py install --user --record installed_files.txt

Note that we recommend using --record to save a list of all the files that were installed by PhiSpy. If you ever want to uninstall it, or to remove everything to reinstall e.g. from pip, you can simply use the contents of that file:

cat installed_files.txt | xargs rm -f

If you have root and you want to install globally, you can change the setup command.

git clone https://github.com/linsalrob/PhiSpy.git
cd PhiSpy
python3 setup.py install

For ease of use, you may wish to add the location of PhiSpy.py to your $PATH.

Software Requirements

PhiSpy requires following programs to be installed in the system. Most of these are likely already on your system or will be installed using the mechanisms above.

  1. Python - version 3.4 or later
  2. Biopython - version 1.58 or later
  3. gcc - GNU project C and C++ compiler - version 4.4.1 or later
  4. The Python.h header file. This is included in python3-dev that is available on most systems.

Testing PhiSpy.py

Download the Streptococcus pyogenes M1 genome

curl -Lo Streptococcus_pyogenes_M1_GAS.gb https://bit.ly/37qFArb
PhiSpy.py -o Streptococcus.phages Streptococcus_pyogenes_M1_GAS.gb

or to run it with the Streptococcus training set:

PhiSpy.py -o Streptococcus.phages -t data/trainSet_160490.61.txt Streptococcus_pyogenes_M1_GAS.gb

This uses the GenBank format file for Streptococcus pyogenes M1 GAS that we provide in the tests/ directory, and we use the training set for S. pyogenes M1 GAS that we have pre-calculated. This quickly identifies the four prophages in this genome, runs the repeat finder on all of them, and outputs the answers.

You will find the output files from this query in output_directory.

Download more testing data

You can also download all the genomes in tests/. These are not installed with PhiSpy if you use pip/conda, but will be if you clone the repository. Please note that these are stored on git lfs, and so if you notice an error that the files are small and don't ungzip, you may need to (i) install git lfs and (ii) use git lfs fetch to update this data.

Running PhiSpy.py

The simplest command is:

PhiSpy.py genbank_file -o output_directory

where:

  • genbank file: The input DNA sequence file in GenBank format.
  • output directory: The output directory is the directory where the final output file will be created.

If you have new genome, we recommend annotating it using the RAST server or PROKKA. RAST has a server that allows you to upload and download the genome (and can handle lots of genomes), while PROKKA is stand-alone software.

phage_genes

By default, PhiSpy.py uses strict mode, where we look for two or more genes that are likely to be a phage in each prophage region. If you increase the value of --phage_genes that will reduce the number of prophages that are predicted. Conversely, if you reduce this, or set it to 0 we will overcall mobile elements.

When --phage_genes is set to 0, PhiSpy.py will identify other mobile elements like plasmids, integrons, and pathogenicity islands. Somewhat unexpectedly, it will also identify the ribosomal RNA operons as likely being mobile since they are unlike the host's backbone!

color

If you add the --color flag, we will color the CDS based on their function. The colors are primarily used in artemis for visualizing phage regions.

file name prefixes

By default the outputs from PhiSpy.py have standard names. If you supply a file name prefix it will be prepended to all the file so that you can run PhiSpy.py on multiple genomes and have the outputs in the same directory without overwriting each other.

gzip support

PhiSpy.py natively supports both reading and writing files in gzip format. If you provide a gzipped input file, we will write a gzipped output file.

HMM Searches

When also considering the signal from HMM profile search:

PhiSpy.py genbank_file -o output_directory --phmms hmm_db --threads 4 --color

where:

  • hmm_db: reference HMM profiles database to search with genome-encoded proteins (at the moment)

Training sets were searched with pVOG database HMM profiles: AllvogHMMprofiles.tar.gz. To use it:

wget http://dmk-brain.ecn.uiowa.edu/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz
tar -zxvf AllvogHMMprofiles.tar.gz
cat AllvogHMMprofiles/* > pVOGs.hmm

Then use pVOGs.hmm as hmm_db.

Since extra step before the regular processing of PhiSpy is performed, input genbank file is updated and saved in output_directory. When --color flag is used, additional qualifier /color will be added in the updated GenBank file so that the user could easily distinguished proteins with hits to hmm_db while viewing the file in Artemis

When running PhiSpy again on the same input data and with --phmms option you can skip the search step by --skip_search flag.

Another database that maybe of interest is the VOGdb database. You can download all their VOGs, and the press them into a compiled format for hmmer:

curl -LO http://fileshare.csb.univie.ac.at/vog/latest/vog.hmm.tar.gz
mkdir vog
tar -C vog -xf vog.hmm.tar.gz
cat vog/* > VOGs.hmms
hmmpress VOGs.hmms

Metrics

We use several different metrics to predict regions that are prophages, and there are some optional metrics you can add. The default set of metrics are:

  • orf_length_med: median ORF length
  • shannon_slope: the slope of Shannon's diversity of k-mers across the window under consideration. You can also expand this with the --expand_slope option.
  • at_skew: the normalized AT skew across the window under consideration
  • gc_skew: the normalized GC skew across the window under consideration
  • max_direction: The maximum number of genes in the same direction

You can specify each of these options with the --metrics flag, for example:

PhiSpy.py --metrics shannon_slope

or

PhiSpy.py --metrics gc_skew

If you wish to specify more than one metric, you can either use one --metrics flag and list your options, e.g.

PhiSpy.py --metrics shannon_slope gc_skew

or provide each one, e.g.:

PhiSpy.py --metrics shannon_slope --metrics gc_skew

The default is all of these, and so ommitting a --metrics flag is equivalent to

PhiSpy.py --metrics orf_length_med shannon_slope at_skew gc_skew max_direction

The choice(s) you provide are recorded in the log file.

You can also add a few other options

  • phmms: The phmm search results
  • phage_genes: The number of genes that must be annotated as phage in the region
  • nonprophage_genegaps : The maximum number of non-phage genes between two phage-like regions that will enable them to be merged

Help

For the help menu use the -h option:

python PhiSpy.py -h

Output Files

PhiSpy has the option of creating multiple output files with the prophage data:

  1. prophage_coordinates.tsv (code: 1)

This is the coordinates of each prophage identified in the genome, and their att sites (if found) in tab separated text format.

The columns of the file are:

    1. Prophage number
    1. The contig upon which the prophage resides
    1. The start location of the prophage
    1. The stop location of the prophage If we can detect the att sites, the additional columns are:
    1. start of attL;
    1. end of attL;
    1. start of attR;
    1. end of attR;
    1. sequence of attL;
    1. sequence of attR;
    1. The explanation of why this att site was chosen for this prophage.
  1. GenBank format output (code: 2)

We provide a duplicate GenBank record that is the same as the input record, but we have inserted the prophage information, including att sites into the record.

If the original GenBank file was provided in gzip format this file will also be created in gzip format.

  1. prophage and bacterial sequences (code: 4)

PhiSpy can automatically separate the DNA sequences into prophage and bacterial components. If this output is chosen, we generate both fasta and GenBank format outputs:

  • GenBank files: Two files are made, one for the bacteria and one for the phages. Each contains the appropriate fragments of the genome annotated as in the original.
  • fasta files: Two files are made, the first contains the entire genome, but the prophage regions have been masked with Ns. We explicitly chose this format for a few reasons: (i) it is trivial to convert this format into separate contigs without the Ns but it is more complex to go from separate contigs back to a single joined contig; (ii) when read mapping against the genome, understanding that reads map either side of a prophage maybe important; (iii) when looking at insertion points this allows you to visualize the where the prophage was lying.
  1. prophage_information.tsv (code: 8)

This is a tab separated file, and is the key file to assess prophages in genomes (see assessing predictions, below). The file contains all the genes of the genome, one per line. The tenth colum represents the status of a gene. If this column is 0 then we consider this a bacterial gene. If it is non-zero it is probably a phage gene, and the higher the score the more likely we believe it is a phage gene. This is the raw data that we use to identify the prophages in your genome.

This file has 16 columns:

    1. The id of each gene;
    1. function: function of the gene (or product from a GenBank file);
    1. contig;
    1. start: start location of the gene;
    1. stop: end location of the gene;
    1. position: a sequential number of the gene (starting at 1);
    1. rank: rank of each gene provided by random forest;
    1. my_status: status of each gene based on random forest;
    1. pp: classification of each gene based on their function;
    1. Final_status: the status of each gene. For prophages, this column has the number of the prophage as listed in prophage.tbl above; If the column contains a 0 we believe that it is a bacterial gene. Otherwise we believe that it is possibly a phage gene.

If we can detect the att sites, the additional columns are:

    1. start of attL;
    1. end of attL;
    1. start of attR;
    1. end of attR;
    1. sequence of attL;
    1. sequence of attR;
  1. prophage.tsv (code: 16)

This is a simpler version of the prophage_coordinates.tsv file that only has prophage number, contig, start, and stop.

  1. GFF3 format (code: 32)

This is the prophage information suitable for insertion into a GFF3. This is a legacy file format, however, since GFF3 is no longer widely supported, this only has the prophage coordinates. Please post an issue on GitHub if more complete GFF3 files are required.

  1. prophage.tbl (code: 64)

This file has two columns separated by tabs [prophage_number, location]. This is a also a legacy file that is not generated by default. The prophage number is a sequential number of the prophage (starting at 1), and the location is in the format: contig_start_stop that encompasses the prophage.

  1. test data (code: 128)

This file has the data used in the random forest. The columns are:

  • Identifier
  • Median ORF length
  • Shannon slope
  • Adjusted AT skew
  • Adjusted GC skew
  • The maxiumum number of ORFs in the same direction
  • PHMM matches
  • Status

The numbers are averaged across a window of size specified by --window_size

Choosing which output files are created.

We have provided the option (--output_choice) to choose which output files are created. Each file above has a code associated with it, and to include that file add up the codes:

Code File
1 prophage_coordinates.tsv
2 GenBank format output
4 prophage and bacterial sequences
8 prophage_information.tsv
16 prophage.tsv
32 GFF3 format output of just the prophages
64 prophage.tbl
128 test data used in the random forest
256 GFF3 format output for the annotated genomic contigs

So for example, if you want to get GenBank format output (2) and prophage_information.tsv (8), then enter an --output_choice of 10.

The default is 3: you will get both the prophage_coordinates.tsv and GenBank format output files.

Note: Choice 32 will only output the prophages themselves in GFF3 format. In contrast, choice 256 outputs annotated genomes. This is probably the best choice to bring the genome into Artemis as it will handle multiple contigs correctly.

If you want all files output, use --output_choice 512.

Example Data

  • Streptococcus pyogenes M1 GAS which has a single genome contig. The genome contains four prophages.

To analyze this data, you can use:

PhiSpy.py -o output_directory -t data/trainSet_160490.61.txt tests/Streptococcus_pyogenes_M1_GAS.gb.gz

And you should get a prophage table that has this information (for example, take a look at output_directory/prophage.tbl).

Prophage number Contig Start Stop
pp_1 NC_002737 529631 569288
pp_2 NC_002737 778642 820599
pp_3 NC_002737 1192630 1222549
pp_4 NC_002737 1775862 1782822

Assessing predictions

As with any software, it is critical that you assess the output from phispy to see if it actually makes sense! We start be ensuring we have the prophage_information.tsv file output (this is not output by default, and requires adding 8 to the --output-choice flag).

That is a tab-separated text file that you can import into Microsoft Excel, LibreOffice Calc, Google Sheets, or your favorite spreadsheet viewing program.

There are a few columns that you should pay attention to:

  • position (the 6th column) is the position of the gene in the genome. If you sort by this column you will always return the genome to the original order.
  • Final status (the 10th column) is whether this region is predicted to be a prophage or not. The number is the prophage number. If the entry is 0 it is not a prophage.
  • pp and my status (the 8th and 9th columns) are interim indicators about whether this gene is potentially part of a phage.

We recommend:

  1. Freeze the first row of the spreadsheet so you can see the column headers
  2. Sort the spreadsheet by the my status column and color any row red where the value in this column is greater than 0
  3. Sort the spreadsheet by the final status column and color those rows identified as a prophage green.
  4. Sort the spreadsheet by the position column.

Now all the prophages are colored green, while all the potential prophage genes that are not included as part of a prophage are colored red. You can easily review those non-prophage regions and determine whether you think they should be included in prophages. Note that in most cases you can adjust the phispy parameters to include regions you think are prophages.

Note: Ensure that while you are reviewing the results, you pay particular attention to the contig column. In partial genomes, contig breaks are very often located in prophages. This is usual because prophages often contain sequences that are repeated around the genome. We have an open issue open issue to try and resolve this in a meaningful way.

Interactive PhiSpy

We have created a jupyter notebook example where you can run PhiSpy to test the effect of the different parameters on your prophage predictions. Change the name of the genbank file to point to your genome, and change the values in parameters and see how the prophage predictions vary!

Tips, Tricks, and Errors

If you are feeling lazy, you actually only need to use sudo apt install -y python3-pip; python3 -m pip install phispy since python3-pip requires build-essential and python3-dev!

If you try PhiSpy.py -v and get an error like this:

$ PhiSpy.py -v
-bash: PhiSpy.py: command not found

Then you can either use the full path:

~/.local/bin/PhiSpy.py -v

or add that location to your $PATH:

echo "export PATH=\$HOME/.local/bin:\$PATH" >> ~/.bashrc
source ~/.bashrc
PhiSpy.py -v

Exit (error) codes

We use a few different error codes to signify things that we could not compute. So far, we have:

Exit Code Meaning Suggested solution
2 No input file provided We need a file to work with!
3 No output directory provided We need somewhere to write the results to!
10 No training sets available This should be in the default install. Please check your installation
11 The specific training set is not available Check the argument passed to the --training_set parameter
13 No kmers file found This should be in the default install. Please check your installation
20 IO Error There was an error reading your input file.
25 Non nucleotide base found Check for a non-standard base in your sequence
26 An ORF with no bases This is probably a really short ORF and should be deleted.
30 No contigs We filter contigs by length, and so try adjusting the --min_contig_size parameter, though the default is 5,000 bp and you will need some adjacent genes!
40 No ORFs in your genbank file Please annotate your genome, e.g. using RAST or PROKKA
41 Less than 100 ORFs are in your annotated genome. This is not enough to find a prophage Please annotate your genome, e.g. using RAST or PROKKA

Making your own training sets

If within reference datasets, close relatives to bacteria of your interest are missing, you can make your own training sets by providing at least a single genome in which you indicate prophage proteins. This is done by adding a new qualifier to GenBank annotation for each CDS feature within a prophage region: /is_phage="1". This allows PhiSpy to distinguish the signal from bacterial/phage regions and make a training set to use afterwards during classification with random forest algorithm.

We provide a script - mark_prophage_features.py, to automate that process. It updates GenBank files based on PhiSpy's prophage_predictions.tsv file format or user's tab-delimited table with the following information in columns for each prophage region:

  1. path to GenBank file
  2. replicon id
  3. prophage start coordinate
  4. prophage end coordinate

To make training sets out of your files use make_training_sets.py script. It allows you to update/extend PhiSpy's default training sets or overwrite them with just your data.

make_training_sets.py prepares all required input files, i.e. it makes phage/bacteria-specific kmers sets based on /is_phage="1" qualifiers, reads information about taxonomy (if requested for grouping with --use_taxonomy), calls PhiSpy in a training mode and prepares training sets.

make_training_sets.py -d input_directory -g groups_file --use_taxonomy -k kmer_size -t kmers_type --phmms hmm_db --threads num_threads --retrain

where:

  • input_directory: a directory where all GenBank files for training are stored. Note that provided path will be added to file names in groups_file.
  • groups_file: a file mapping GenBank file names with extension and the name of group they will make; each file can be assigned to more than one group - take a look at how the reference data grouping file was constructed at test_genbank_files/groups.txt.
  • use_taxonomy: this option creates groups of training sets based on taxonomy within analyzed GenBank files. If taxonomy information is missing, genome is assigned to Bacteria group.
  • kmer_size: is the size of kmers that will be produces. By default it's 12. If changed, remember to also change that parameter while running PhiSpy with produced training sets.
  • kmers_type: type of generated kmers. By default 'all' means generating kmers by 1 nt. If changed, remember to also change that parameter while running PhiSpy with produced training sets.

Beside the flags that allow training with phmm signal, there are also --retrain and --absolute_retrain flags. Each of them triggers complete reanalysis of input files but were added for different reasons. The first should be used whenever any file previously used for training has changed, e.g. more/less phage proteins were marked with /is_phage="1", as it triggers preparation of new kmers files. The second additionally ignores trainingGenome_list.txt file and therefore allows to ommit PhiSpy's default reference genomes. The same will happen when trainingGenome_list.txt is missing in PhiSpy's installation directory.

All files created while training, i.e. phage/bacteria kmers and testSet for each GenBank file are stored in PhiSpyModules/data/testSets/ directory in PhiSpy's installation directory. This allows to save a bit of time when adding new genomes and retraining.

Preparing GenBank files

  • it is recommended to mark prophage proteins even from prophage remnants/disrupted regions composed of a few proteins with /is_phage="1" to minimize the loss of good signal, kmers in particular,
  • don't use too many genomes (e.g. a 100) as you may end up with a small set of phage-specific kmers,
  • try to pick several genomes with different prophages to increase the diversity.

phispy's People

Contributors

deprekate avatar laurasisk avatar linsalrob avatar pdec avatar scottdaniel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phispy's Issues

Output GFF3

Dear @linsalrob,

Thank you very much for the recent updates and modifications.

A few months ago I forked your repo, and generate some new implementations such as python3 and gff3 output. During the last weeks I have not used it but the moment I come back to use Phispy, in a few weeks, I will start using your version, now that is in Python 3, as it would be much more up-to-date.

But I guess it would be great to include the generation of a standarized format such as gff3 as output results.

I already did it for my forked version and it is available here.
https://github.com/JFsanchezherrero/PhiSpy/blob/9c31a60d28f9035f9d1dba8f1bc475e816b288e7/PhiSpy_tools/evaluation.py#L541

Example:

##gff-version 3 ---- ---- ---- ---- ---- ---- ---- ----
NC_002737 PhiSpy prophage_region 87661 119067 . . . ID=pp1
NC_002737 PhiSpy attL 87232 87245 . . . ID=pp1
NC_002737 PhiSpy attR 117100 117113 . . . ID=pp1
NC_002737 PhiSpy prophage_region 529631 605856 . . . ID=pp2
NC_002737 PhiSpy attL 529660 529672 . . . ID=pp2
NC_002737 PhiSpy attR 604721 604733 . . . ID=pp2
NC_002737 PhiSpy prophage_region 778642 840502 . . . ID=pp3
NC_002737 PhiSpy attL 777439 777452 . . . ID=pp3
NC_002737 PhiSpy attR 840272 840285 . . . ID=pp3
NC_002737 PhiSpy prophage_region 1191309 1241894 . . . ID=pp4
NC_002737 PhiSpy attL 1189734 1189747 . . . ID=pp4
NC_002737 PhiSpy attR 1243163 1243176 . . . ID=pp4

Please take a look and include it in your code, if not as default as a flag. I think it would provide Phispy a new feature and mucho more versatility

Thanks in advance

"Contig starting with a partial ORF" warning

Dear Rob,

I am currently using PhiSpy version 4.2.21 to detect prophages in several strains of Aeromonas and, for one of them (that is available at the NCBI: https://www.ncbi.nlm.nih.gov/nuccore/CP043323.1/) and it prints this warning concerning an untested feature:

$ phispy --output_choice 512 -o CP043323_prophages CP043323.1.gbk
Feature FT688_00005 spans the origin: join{[0:44](-), [4784838:4786216](-)}
WARNING: THIS IS AN UNTESTED FEATURE!
We have not thoroughly tested conditions where an ORF on the -ve strand appears to cross the origin of the contig. We would appreciate you posting an issue on GitHub and sending Rob a copy of your genome to test!
Feature FT688_00005 spans the origin: join{[0:44](-), [4784838:4786216](-)}
WARNING: THIS IS AN UNTESTED FEATURE!
We have not thoroughly tested conditions where an ORF on the -ve strand appears to cross the origin of the contig. We would appreciate you posting an issue on GitHub and sending Rob a copy of your genome to test!
Processing 1 contigs
Making Testing Set...

Since the fact that some available genomes might have the first coordinate in the "middle" of an ORF, I was wondering if there is any chance of missing a potential prophage in these situations. If so, what would you recommend at this moment to check that there is nothing lost? Finally, and the most relevant question, would you have planned to improve PhiSpy to deal with situations like this one, please?

Best,

Enrique

bug report

Hi maintainer,

when I use the phispy to detect the prophage of NZ_CP014862.1, I got the error like this.

Making Test Set... (need couple of minutes)
Start Classification Algorithm
Using training flag:  0
Done with classification Algorithm
As training flag is zero, considering unknown functions
Start evaluation...
Traceback (most recent call last):
  File "/home/ciillab/huangle/phiSpy.py", line 150, in <module>
    start_propgram(sys.argv)
  File "/home/ciillab/huangle/phiSpy.py", line 148, in start_propgram
    call_phiSpy(organismPath,output_dir,trainingFlag,INSTALLATION_DIR,args_parser.evaluate, args_parser.number, args_parser.window_size,args_parser.quiet)
  File "/home/ciillab/huangle/phiSpy.py", line 40, in call_phiSpy
    evaluation.call_start_end_fix(output_dir,organismPath,INSTALLATION_DIR,threshold_for_FN, phageWindowSize)
  File "/home/ciillab/huangle/source/evaluation.py", line 528, in call_start_end_fix
    fixing_start_end(output_dir+'initial_tbl.txt',output_dir+'prophage_tbl_temp.txt',organismPath,INSTALLATION_DIR)
  File "/home/ciillab/huangle/source/evaluation.py", line 352, in fixing_start_end
    repeat_list = find_repeat(dna[pp[i]['contig']][start:stop],start,INSTALLATION_DIR)
  File "/home/ciillab/huangle/source/evaluation.py", line 90, in find_repeat
    if math.fabs(int(temp[0]) - int(temp[3])) > 10000:
IndexError: list index out of range

Please address this problem, thank you.

ASCI

Hello,

Thank you for sharing your amazing software.
I would be very grateful if anyone could help me on the following issue:

SyntaxError: Non-ASCII character '\xc5' in file /Users/bognasmug/Applications/PhiSpy/modules/helper_functions.py on line 35, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

I get that error when I run:
./PhiSpy.py -i [organism_directory] -o [output_dir] -c

Choice of the longest repeats is non-deterministic

PhiSpy chooses the longest repeats flanking a prophage region and includes them in the output files. However, the repeats are stored initially as structs and then as members of a set, and then finally iterated through.

The order of the repeats may vary from run-to-run, and if there are two repeats of the same length longer than any other repeats they may be chosen in a non-deterministic way (ie. one may be chosen on the first run, while a different one maybe chosen subsequently).

With version 3.4.7 PhiSpy prints a warning to STDERR if more than one longest repeat is found. A solution for this problem is neither obvious nor trivial to implement.

Unable to finish + excessive RAM usage

Hi,

I am running PhISpy 4.2.6 on a set of contigs, one contig per PhiSpy run, and for some of them it is not able to finish the analysis. After identifying potential prophages, RAM usage scalates until the process gets killed. I have tried in several machines with up to 125 Gb of RAM with the same result.

Here one of those problematic contigs, downloaded from NCBI and provided to PhiSpy directly. Attached the phispy.log of the run.
Instalation was via conda as described in the documentation.
Test run with Streptococcus_pyogenes_M1_GAS ran smoothly, here the log file phispy.log

Any idea of which could be the reason of this behaviour?

Thank you.

Mark prophage regions at the end of contigs

Very often contig breaks in draft genomes occur within prophages. This is because the prophage contains repeat regions (like IS elements), the prophage has unusual coverage (e.g. because it is induced during growth/library prep), or they are just downright tricky!

We should mark prophage regions at the end of contigs using different criteria (fewer genes, more hypotheticals, etc) than we use for the rest of the genome. This could potentially assist in recircularizing the genome.

ValueError: Need a Nucleotide or Protein alphabet

The following is the code I ran and the error situation. Has anyone encountered this issue? BioPython version is 1.77, and PhiSpy version is 4.2.21

(PhiSpy) [kxy@zju out]$ PhiSpy.py my_output.gbk -o output_directory
Processing 34 contigs
Making Testing Set...
Start Classification Algorithm...
Using the following metric(s): {'gc_skew', 'at_skew', 'shannon_slope', 'orf_length_med', 'max_direction'}.
Running the random forest classifier with 500 trees and 2 threads
/data/users/kxy/miniconda3/envs/PhiSpy/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of n_init will change from 10 to 'auto' in 1.4. Set the value of n_init explicitly to suppress the warning
warnings.warn(
As the training flag is zero, down-weighting unknown functions
Evaluating...
Checking prophages we might have found
Potential prophages (sorted highest to lowest)
Contig Start Stop Number of potential genes Status
NODE_25_length 4943 22015 29 Dropped. No genes were identified as phage genes
NODE_17_length 67767 91585 24 Kept
NODE_31_length 48 3971 8 Dropped. No genes were identified as phage genes
NODE_3_length 36724 41861 5 Dropped. No genes were identified as phage genes
NODE_7_length 40892 42026 2 Dropped. Region too small (Not enough genes)
NODE_9_length 147704 149860 1 Dropped. Region too small (Not enough genes)
NODE_9_length 128083 130272 1 Dropped. Region too small (Not enough genes)
NODE_8_length 31922 32392 1 Dropped. Region too small (Not enough genes)
NODE_2_length 90215 91141 1 Dropped. Region too small (Not enough genes)
NODE_27_length 8285 8959 1 Dropped. Region too small (Not enough genes)
NODE_20_length 9731 10882 1 Dropped. Region too small (Not enough genes)
NODE_18_length 55595 56515 1 Dropped. Region too small (Not enough genes)
NODE_17_length 38600 40714 1 Dropped. Region too small (Not enough genes)
NODE_16_length 41140 41724 1 Dropped. Region too small (Not enough genes)
NODE_15_length 34985 35749 1 Dropped. Region too small (Not enough genes)
NODE_11_length 62124 64214 1 Dropped. Region too small (Not enough genes)
PROPHAGE: 1 Contig: NODE_17_length Start: 67767 Stop: 91585
Creating output files
Writing GenBank output file
Traceback (most recent call last):
File "/data/users/kxy/miniconda3/envs/PhiSpy/bin/PhiSpy.py", line 10, in
sys.exit(run())
File "/data/users/kxy/miniconda3/envs/PhiSpy/lib/python3.10/site-packages/PhiSpyModules/main.py", line 122, in run
main(sys.argv)
File "/data/users/kxy/miniconda3/envs/PhiSpy/lib/python3.10/site-packages/PhiSpyModules/main.py", line 114, in main
PhiSpyModules.write_all_outputs(**vars(args_parser))
File "/data/users/kxy/miniconda3/envs/PhiSpy/lib/python3.10/site-packages/PhiSpyModules/writers.py", line 401, in write_all_outputs
write_genbank(self)
File "/data/users/kxy/miniconda3/envs/PhiSpy/lib/python3.10/site-packages/PhiSpyModules/writers.py", line 98, in write_genbank
SeqIO.write(self.record, handle, 'genbank')
File "/data/users/kxy/miniconda3/envs/PhiSpy/lib/python3.10/site-packages/Bio/SeqIO/init.py", line 531, in write
count = writer_class(handle).write_file(sequences)
File "/data/users/kxy/miniconda3/envs/PhiSpy/lib/python3.10/site-packages/Bio/SeqIO/Interfaces.py", line 235, in write_file
count = self.write_records(records, maxcount)
File "/data/users/kxy/miniconda3/envs/PhiSpy/lib/python3.10/site-packages/Bio/SeqIO/Interfaces.py", line 209, in write_records
self.write_record(record)
File "/data/users/kxy/miniconda3/envs/PhiSpy/lib/python3.10/site-packages/Bio/SeqIO/InsdcIO.py", line 1005, in write_record
self._write_the_first_line(record)
File "/data/users/kxy/miniconda3/envs/PhiSpy/lib/python3.10/site-packages/Bio/SeqIO/InsdcIO.py", line 757, in _write_the_first_line
raise ValueError("Need a Nucleotide or Protein alphabet")
ValueError: Need a Nucleotide or Protein alphabet

Additionally, because the ID in the gbk file obtained through the prokka annotation is too long, I used the following code to transform all LOCUS IDs in the file as follows:

LOCUS NODE_2_length_354722_cov_51.4144354722 bp DNA linear
sed -re 's/(_length)[^=]*$/\1/' 4751.gbk > my_output.gbk
LOCUS NODE_2_length

Huge memory consumption and runtime with NNNNs

I have been waiting hours for one (our of 20k) genome to finish. There are stretches of N in the dna that I think is causing the problem.

The genome: g.500.gb

The output of phispy:

$ tail phispy.log 
2021-05-10 13:31:12 INFO     g_500_c_0	763490	764494	1	Dropped. Not enough genes
2021-05-10 13:31:12 INFO     g_500_c_0	556601	556960	1	Dropped. Not enough genes
2021-05-10 13:31:12 INFO     PROPHAGE: 1 Contig: g_500_c_0 Start: 160559 Stop: 171884
2021-05-10 14:07:05 INFO     There were 6 repeats with the same length as the best. One chosen somewhat randomly!
2021-05-10 14:07:05 INFO     PROPHAGE: 2 Contig: g_500_c_0 Start: 334826 Stop: 374323
2021-05-10 14:11:32 INFO     PROPHAGE: 3 Contig: g_500_c_0 Start: 444209 Stop: 475789
2021-05-10 14:40:12 INFO     There were 12 repeats with the same length as the best. One chosen somewhat randomly!
2021-05-10 14:40:12 INFO     PROPHAGE: 4 Contig: g_500_c_0 Start: 589283 Stop: 606509
2021-05-10 15:29:30 INFO     There were 12 repeats with the same length as the best. One chosen somewhat randomly!
2021-05-10 15:29:30 INFO     PROPHAGE: 5 Contig: g_500_c_0 Start: 622677 Stop: 710948

And the region that contains (some) the N's

$ grep '^   622[6789]' g.500.gb 
   622621 catcgattac aaatattgtc cgccatcaaa acacatccag cgggatggtt tgttgcatga
   622681 atcatgcgat attttgggtg ttggttctgt tggatacgga tcggcatatt caacacgtat
   622741 ttatagcttt cgctcaaacg ctnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
   622801 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
   622861 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
   622921 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
   622981 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn

How much resources PhiSpy is using:

$ top -n1 | grep "PID\|PhiSpy"
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                           
 4345 katelyn   20   0  214.1g 123.3g   1560 R 100.0  97.9 142:11.65 PhiSpy.py

Minor changes and suguestions

Hi PhiSpy team, I contacted @linsalrob directly and he told me to write my feedback here.

I noticed that when Phispy v3.4.5 is installed, it adds to the path Phipy.py, but it also adds Phispy and phispy (which don't work). It can be a little misleading even for someone who read the README.me. I think this is a simple fix in the setup.py file.
Fixed by @linsalrob 2/24/20

prophage.tbl has two definitions in the README.md file: One states two columns without a header (the one that show in the output), and 2) it shows the file with a header and nicely split into three columns. I believe the second one if possible would be nice to have.
Fixded by @linsalrob 5/20/20

The README states that If the column contains a 0 we believe that it is a bacterial gene and in another section it says: If this column is 1 then the gene is a phage like gene; otherwise it is a bacterial gene.

It is a little ambiguous to understand what it is when value > 1. I believe >=1 is a phage region - I saw some regions annotation with "phage" as substring and column 10th > 1. Maybe clarify the README.me.
Fixded by @linsalrob 5/20/20

Add scikit-learn >=0.21.3 to Software Requirements. I believe numpy is not showing because it will be installed when installing Biopython ?!
Fixed by @linsalrob 2/24/20

I believe that that prophage.tbl seems to contain the contig name, start and stop for the phage genome rather than the phage genes, right?

The tool takes a gbk file as input. I believe that most people that have a gbk would also have a FASTA file. I know that the tool does not take the FASTA file, but maybe the user could optionally pass it (if wanted) and a FASTA output file with be generated with the phages' genomes. I'm thinking about the end-user who does not know how to parse a FASTA file to get the phage regions. If the program does it, I'm sure people will love it.

I'm working in a blog post (http://onestopdataanalysis.com/) which will be posted a few months from now on how to go from bacterial FASTA to phage FASTA (Docker image with prokka and phispy, use the phispy output to and parse the phages FASTA file).

I will write a script to do the parsing and would be more than happy to add to the code base if given access. Otherwise, I'm also happy to share it with the team.

Thanks

GFF3 File - Include the att sites

Would it be possible to add all the features detected by phispy (att sites mostly) to the GFF3 format? They can be quite useful for visualization in certain software liek Geneious for instance (the ReadMe said to post an issue about it if need be).

Thank you very much for this very useful tool

Impossible to use training sets kept in other location than INSTALLATION_DIR

I noticed that the below check forces users to keep their retrained reference datasets in INSTALLATION_DIR/PhiSpyModules/data directory

if not pkg_resources.resource_exists('PhiSpyModules', trainingFile):
sys.stderr.write("FATAL: Can not find data file {}\n".format(trainingFile))
sys.exit(-1)

The same will happen with kmers file within makeTest.py and makeTrain.py

kmers_file = 'data/phage_kmers_' + self._kmers_type + '_wohost.txt'
if not pkg_resources.resource_exists:
sys.exit("ERROR: Kmers file {} not found".format(kmers_file))
for line in pkg_resources.resource_stream('PhiSpyModules', kmers_file):

Hence, while retraining one needs to point the PhiSpy's installation dir (e.g. /usr/local/lib/python3.6/site-packages/PhiSpy...) and can't keep separate trainSets dirs.

ValueError: missing molecule_type in annotations

Hi

i am getting this error when i run with output choice 255

/.local/lib/python3.8/site-packages/Bio/SeqIO/InsdcIO.py", line 744, in _write_the_first_line
raise ValueError("missing molecule_type in annotations")
ValueError: missing molecule_type in annotations

Could you please help me resolve it

conda installation failed

I installed using conda install -c bioconda psispy. But running phispy produced the following error:

Traceback (most recent call last):                                                                                                                                         
  File "/scicomp/home-pure/uma2/miniconda3/envs/phispy/bin/phispy", line 7, in <module>                                                                                    
    from PhiSpy.py import main                                                                                                                                             
ModuleNotFoundError: No module named 'PhiSpy.py'; 'PhiSpy' is not a package

I fixed the error by modifying the phispy file:

  • I imported os
  • I added the directory containg PhiSpy.py to the path using sys and os: sys.path.insert(0,os.path.join(sys.path[0], '<path>/miniconda3/envs/phispy/bin'))
  • I changed the import statement from from PhiSpy.py import main to from PhiSpy import main
  • Finally, sys.argv was not being passed to main at the end of the file. So I modified the line from sys.exit(main()) to sys.exit(main(sys.argv))

These changes allowed the command phispy to work

multi-contigs

Hi,

thanks for sharing this tool.
I had expected the method to work as well with a multi-contig input file (cf Phaster). Indeed, it clearly recognizes the number of relevant (>5kb) contigs in the input file (e.g "Processing 2 contigs"). But it seems that it scans only the 1st contig...or did I miss some parameter/twick?

Thanks for your help,

H.

Fixing phmms data use condition

Currently the below condition is always False,

if 'phmms' not in kwargs:
train_data = np.delete(train_data, 5, 1)
test_data = np.delete(train_data, 5, 1)

and phmms signal columns are considered even if not used.
There was also a mistake in L43 where test_data had train_data assigned.

This should be:

 if not kwargs['phmms']: 
     train_data = np.delete(train_data, 5, 1) 
     test_data = np.delete(test_data, 5, 1)

Output

Only prophage_tbl.txt as output, no prophage.tbl file on output folder.

The misc_feature location off by 1

The misc_features added to the input genbank file that represent the PhiSpy prophage calls is off by 1

$ grep pp3 prophage_coordinates.tsv 
pp3	NC_002737	1191309	1222549	1193572	1193583	1220349	1220360	TCAGATTTTTT	AAAAAATCTGA	Longest Repeat flanking phage and within 2000 bp

$ head -n 25734 Streptococcus_pyogenes_M1_GAS.gb | tail -n 2
     misc_feature    1191310..1222549
                     /note="prophage region pp3 identified with PhiSpy v4.2.6"

It should be:

     misc_feature    1191309..1222549

phmms option cleanup

Either:

  1. allow a defined name for the protein file (some may want to keep it), or
  2. remove it automatically.

We could also write the protein sequences with md5 hashes if we do (1)

PhiSpy does not support Python 3

PhiSpy appears to support Python 2 only. This could be documented in the README files.

Ideally Python 3 support would also be added since Python 2 will be not be maintained in 2020.

No bases were counted for orf

Hello, I occasionally run into this issue when running PhySpy:

2022-03-28 12:14:53 INFO     Welcome to PhiSpy.py version 4.2.21
2022-03-28 12:14:53 INFO     Starting PhiSpy.py with the following arguments
Namespace(infile='input/genomic.gbff', output_dir='/home/ec2-user/physpy', make_training_data=None, training_set='data/trainSet_genericAll.txt', list=False, file_prefix='test', evaluate=False, number=5, min_contig_size=5000, window_size=30, nonprophage_genegaps=10, phage_genes=1, metrics=['orf_length_med', 'shannon_slope', 'at_skew', 'gc_skew', 'max_direction'], randomforest_trees=500, expand_slope=False, kmers_type='all', phmms='/home/ec2-user/VOGs.hmms', include_annotations=True, ignore_annotations=False, color=True, threads=4, output_choice=512, include_all_repeats=False, keep_dropped_predictions=False, extra_dna=2000, min_repeat_len=10, log='/home/ec2-user/physpy/test_phispy.log', quiet=False, keep=False, logger=<Logger PhiSpy (Level 5)>)
2022-03-28 12:14:54 INFO     Processing 14 contigs
2022-03-28 12:14:54 INFO     Running HMM profiles against /home/ec2-user/VOGs.hmms
2022-03-28 12:14:54 INFO     hmmsearch: writing the amino acids to temporary file /home/ec2-user/physpy/tmpio1svuew
2022-03-28 12:14:54 INFO     Searching 2613 proteins with hmmsearch.
2022-03-28 12:18:15 INFO     Completed running HMM profiles against /home/ec2-user/VOGs.hmms
2022-03-28 12:18:15 INFO     Making Testing Set...
2022-03-28 12:18:17 INFO     a total of zero total_at*total_gc
No bases were counted for orf {'start': 507191, 'stop': 508927, 'phmm': 0.18568636235841013, 'peg': 'peg', 'is_phage': 0} from 507191 to 508927
This error is usually thrown with an exceptionally short ORF that is only a  few bases. You should check this ORF and confirm it is real!

I can't find anything weird in the ORF throwing the exception. This error makes the entire run fail, which is not what I would expect given that other ORFs are simply ignored and raise warnings (e.g. when there are multiple ORFs with the same ID and all but the first are discarded).
Would it be possible to have more details about what may be triggering the error, and eventually convert this to a warning in future versions of PhySpy? Unfortunately I cannot share the input file, and I haven't managed to replicate the error on a small example, but running PhySpy on many random genomes downloaded from NCBI should enable you to replicate it. Let me know if I can help in any other way.
Thanks for maintaining this great tool!

Not phage function

Shouldn't there be an if rather than elif in L126?

if is_phage_func(func):
x = 1
elif is_unknown_func(func):
x = 0.5
elif is_not_phage_func(func):
x = 0

It seems like it should be checked separately
Eg., It won't work for "phage shock protein" as it first gets classified as phage function due to "phage" word.

output files not created

Hi!

I'm encountering a problem when running phispy (conda environment) with the provided test files.
The analysis runs with no problems and I can see that it identifies prophages but then the output files are not created and I get the following error:

 File "/Users/oihira/anaconda3/envs/prophaging/lib/python3.10/site-packages/PhiSpyModules/writers.py", line 325, in write_all_outputs
    self.record.get_entry(self.pp[i]['contig']).append_feature(SeqFeature(

TypeError: SeqFeature.__init__() got an unexpected keyword argument 'strand'.

Any idea why?

Thanks a lot!
/Oihane

#packages in environment at /Users/xxxx/anaconda3/envs/phispy:
# Name                    Version                   Build  Channel
bcbio-gff                 0.7.0              pyh7cba7a3_0    bioconda
biopython                 1.83            py310hb372a2b_0    conda-forge
bx-python                 0.10.0          py310h260c36f_0    bioconda
bzip2                     1.0.8                h10d778d_5    conda-forge
ca-certificates           2023.11.17           h8857fd0_0    conda-forge
hmmer                     3.4                  h7133b54_0    bioconda
joblib                    1.3.2              pyhd8ed1ab_0    conda-forge
libblas                   3.9.0           20_osx64_openblas    conda-forge
libcblas                  3.9.0           20_osx64_openblas    conda-forge
libcxx                    16.0.6               hd57cbcb_0    conda-forge
libexpat                  2.5.0                hf0c8a7f_1    conda-forge
libffi                    3.4.2                h0d85af4_5    conda-forge
libgfortran               5.0.0           13_2_0_h97931a8_1    conda-forge
libgfortran5              13.2.0               h2873a65_1    conda-forge
liblapack                 3.9.0           20_osx64_openblas    conda-forge
libopenblas               0.3.25          openmp_hfef2a42_0    conda-forge
libsqlite                 3.44.2               h92b6c6a_0    conda-forge
libzlib                   1.2.13               h8a1eda9_5    conda-forge
llvm-openmp               17.0.6               hb6ac08f_0    conda-forge
ncurses                   6.4                  h93d8f39_2    conda-forge
numpy                     1.26.3          py310h4bfa8fc_0    conda-forge
openssl                   3.2.0                hd75f5a5_1    conda-forge
phispy                    4.2.21          py310hbdf848b_2    bioconda
pip                       23.3.2             pyhd8ed1ab_0    conda-forge
python                    3.10.13         h00d2728_1_cpython    conda-forge
python_abi                3.10                    4_cp310    conda-forge
readline                  8.2                  h9e318b2_1    conda-forge
scikit-learn              1.3.2           py310h04b1a37_2    conda-forge
scipy                     1.11.4          py310h3f1db6d_0    conda-forge
setuptools                69.0.3             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
threadpoolctl             3.2.0              pyha21a80b_0    conda-forge
tk                        8.6.13               h1abcd95_1    conda-forge
tzdata                    2023d                h0c530f3_0    conda-forge
wheel                     0.42.0             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h775f41a_0    conda-forge

ValueError: missing molecule_type in annotations

Hi

i am getting this error when i run with output choice 255

/.local/lib/python3.8/site-packages/Bio/SeqIO/InsdcIO.py", line 744, in _write_the_first_line
raise ValueError("missing molecule_type in annotations")
ValueError: missing molecule_type in annotations

Could you please help me resolve it

Parameter to indicate file name prefixes+

What is the parameter required to indicate file name prefixes?

I am interested in this part of the information:

file name prefixes
By default the outputs from PhiSpy.py have standard names. If you supply a file name prefix it will be prepended to all the file so that you can run PhiSpy.py on multiple genomes and have the outputs in the same directory without overwriting each other.

I can't find anything related when typing -h

Coordinates interpretation

Hi, I have a question. Not a specialist on prophages but my understanding was that the attL/attR sequences were always flanking the prophage regions. I got the following prophage_coordinates.tsv file:

pp1             NODE_3  400     2596    1645    1657    3009    3021    GTCATAAAAGCC    GGCTTTTATGAC    Longest Repeat flanking phage and within 2000 bp
pp2             NODE_13 180     8557    3204    3216    7661    7673    TCGTCCAATTTC    GAAATTGGACGA    Longest Repeat flanking phage and within 2000 bp
pp3             NODE_14 15647   28479   14471   14484   25382   25395   AGGGAGTTTTACC   GGTAAAACTCCCT   Longest Repeat flanking phage and within 2000 bp
pp4             NODE_47 10784   27254   11925   11937   23917   23929   TGAATCGCAGTA    GCTAACCAAAGA    Longest Repeat flanking phage and within 2000 bp
...

As you can see, the prophage coordinates look weird with respect to the attL/attL's. I'm not sure if I am missing something or if there's a bug in PhiSpy. Am I interpreting the results the wrong way?
Thanks!

The glibc version is not compatible

hello,I am using a docker image to install Phispy 4.2.21,but the glibc version of base image is 2.35,so I can‘t install it ,is there any pre-build phispy or other method that can deal the problem?

Thanks!

Selecting features measurements

The idea is to be able to select which measurements are selected for classification with Random Forest. This would allow investigating the influence of each metric for prophage discovery.

Maybe this could be available in Jupyter Notebook?

Output formatting and tempRepeatDNA

Very helpful program! I could use a little help interpreting the output, though:

A) Could someone please provide a guide to the functional categories in the gene output (unfortunately named 'pp' in the "output_tbl.txt" file)? The phantome.org site seems to be out of commission, so I could not follow the link in the Akhter et al., 2012, paper.

B) Also, is there any use after evaluation for the numerous tempRepeatDNA.XXXX.pp.X.fasta files generated in the process? At first I assumed these were fasta files for putative prophages, but there is no connection between the regions listed in "output.tbl" and those in the file generation output, e.g.:

generated by PhiSpy.py :
...
Finding repeats in pp 4 from 597108 to 766988
Not checking repeats for pp 5 because it is too big: 242343
Finding repeats in pp 6 from 1896092 to 1944030
...
Finding repeats in pp 19 from 3506144 to 3527984
Finding repeats in pp 20 from 3625704 to 3715725
... etc.

vs.

from output.tbl:
MSMTP | pp | contig | start | end
RS05 | 0 | NZ_CP009505 | 1348725 | 1585237
RS14 | 1 | NZ_CP009505 | 3512461 | 3524330

Also, unless I'm misinterpreting the output, prophages seem to begin numbering at 0, not 1 as stated in the instructions.

Thank you!

Error when sequence ID is too long

There is a small issue where one of the biopython functions has a character length limit on sequence IDs, a more informative error message might be useful. A fasta ID

>SEQID_TOO_LONG_BIOPY_HAS_CHAR_LIMIT

results in a genbank file which will give a PhiSpy traceback/error

[USERID]$ PhiSpy.py testgenome.gb -o phispyTest
Traceback (most recent call last):
  File "$PATH/anaconda3/bin/PhiSpy.py", line 125, in <module>
    main(sys.argv)
  File "$PATH/anaconda3/bin/PhiSpy.py", line 48, in main
    args_parser.record = PhiSpyModules.SeqioFilter(filter(lambda x: len(x.seq) > args_parser.min_contig_size, SeqIO.parse(handle, "genbank")))
  File "$PATH/anaconda3/lib/python3.8/site-packages/PhiSpyModules/seqio_filter.py", line 33, in __init__
    for n, item in enumerate(content):
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 73, in __next__
    return next(self.records)
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 516, in parse_records
    record = self.parse(handle, do_features)
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 499, in parse
    if self.feed(handle, consumer, do_features):
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 465, in feed
    self._feed_first_line(consumer, self.line)
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 1572, in _feed_first_line
    raise ValueError("Did not recognise the LOCUS line layout:\n" + line)
ValueError: Did not recognise the LOCUS line layout:
LOCUS       SEQID_TOO_LONG_BIOPY_HAS_CHAR_LIMIT bp   DNA linear

Changing the ID to

>SEQID_SHORT

resolves the problem.

Keep CDS parts in original frame

When there's a gap in a complex feature, the new feature needs to have a start position shifted by one. That way newly created features stay in the original frame.

The following lines require change:

merged = FeatureLocation(p.start, p.end, strand)

to

merged = FeatureLocation(p.start - 1, p.end, strand)

and

merged = FeatureLocation(p.start, p.end, strand)

to

merged = FeatureLocation(p.start, p.end - 1, strand)

output files

Hello,

I'm running PhiSpy v.3.2 on local computer and I observed that the output is a bit different from v. 2.3. In Version 3.2 I get the prophage_tbl file and a couple of files with repeats annotation "tempRepeatDNA.12438.pp.10.fasta.repeatfinder_1516879088".
What happened to the initial_tbl.txt and prophage.tbl ?
I run PhiSpy with the command:
python PhiSpy.py -i BX927147 -o BX927147.out

Thanks!

Ovidiu

prophage fasta file

Hi,
I am using PhiSpy to predict prophages from bacteria. But the prophage fasta file have been masked with Ns. You say it is trivial to convert this format into separate contigs without the Ns but it is more complex to go from separate contigs back to a single joined contig. So how can I simply convert the format into separate contigs without Ns? Or does the PhiSpy.py have the option to do that?
And I have another question. I got an error like
''No bases were counted for orf {'start': 1570484, 'stop': 1570485, 'phmm': 0.0, 'peg': 'peg', 'is_phage': 0} from 1570484 to 1570485
This error is usually thrown with an exceptionally short ORF that is only a few bases. You should check this ORF and confirm it is real!"
and I checked the gbk file. This gene looks like this
" gene join(1570484..1570485,1..994)
/locus_tag="SMAR_RS00005"
/old_locus_tag="Smar_0001"
/db_xref="GeneID:4907656"
CDS join(1570484..1570485,1..994)
/locus_tag="SMAR_RS00005"
/old_locus_tag="Smar_0001"
/inference="COORDINATES: similar to AA
sequence:RefSeq:WP_013143107.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="TIGR00269 family protein"
/protein_id="WP_052833761.1"
/db_xref="GeneID:4907656"
/translation="MVNCSICGRPAVYVNRISGQAYCKKHFLEYFDKKVRRTIRKYKM
FSSREHIVVAVSGGKDSLSLLHYLYNLSKRVPGWKITALLIDEGIGGYRDITKKDFLR
VVNELGVNYKIASFKEYLGYTLDEIVRIGREKGLPYLPCSYCGVFRRYLLNKVARDLG
GTVLATAHNLDDVIQTYVMNIINNSWDKILRLAPVTGPLDHPKFVRRAKPFYEILEKE
TTLYSILNNLYPKFVECPYARFNIRWMIRRQLNELEEKYPGTKYSLLRSLLRIISILS
KHRDEIIQGEIKTCKVCGEPSAHEICRACLYRYELGIMREDERKIVEEVLGKKKK"
"
So how can I solve this problem?
Sorry for my ignorance...I am a new bird...

numpy.core.umath_tests

/usr/lib64/python2.7/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d

Lower the gene window

(helper_functions.py)

The current number is 30, which seems a little high, as when I run it on ecoli k12 I do not get any prophages found.

This might be a bug, it looks like the -n and -w flags got mixed up.
This line looks like it should be self.number as noted here

ValueError: missing molecule_type in annotations

Hi
I get following error with --output_choice 4 or --output_choice 8. I don't get bacteria.fasta, bacteria.gbk and phage.gbk. I do get phage.fasta. What could be the possible reason? All other options work fine.

PhiSpy.py Streptococcus_pyogenes_M1_GAS.gbk -o M.phages -p M1 --threads 4 --log M1.log --output_choice 4

ValueError: missing molecule_type in annotations

Thank you so much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.