GithubHelp home page GithubHelp logo

deprekate / prfect Goto Github PK

View Code? Open in Web Editor NEW
7.0 3.0 1.0 899.27 MB

Software to predict the occurence of programmed ribosomal frameshifting in bacterial, phage, and viral genomes

License: GNU General Public License v3.0

Makefile 0.28% Python 99.72%
bacteria genes genomics phages annotation bioinformatics

prfect's Introduction

prfect

PRFect is a tool to predict programmed ribosomal frameshifting in eukaryotic, prokaryotic, and viral genomes

The published manuscript is available at: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05701-0


PRFect takes as input the genome and its annotated CoDing Sequences (CDS) as a GenBank file.
       *  If you only have a fasta file we recommend our brand new gene caller Genotate that is
          the only gene caller that can call gene fragments

PRFect searches through a GenBank file looking for 8 different slippery site motifs associated with backwards (-1) frameshifts and two motifs associated with forward (+1) frameshifts. When a motif is encountered, various cellular properties and factors are assessed and a prediction is made whether the site is involved in programmed ribosomal frameshifting.


To install:

python3 -m pip install prfect

To run:

prfect.py input.gbk

An example genome for SARS-Cov2 is provided in the test folder. The SARS-Cov2 genome contains 12 genes the first of which happens to be a PRF gene and is denoted as such through the use of the join keyword. Any genes already present that use the join keyword are split into their two parts and subsequently predicted anew and then tagged with the /label=1 feature tag to indicate a TruePositive. When the genome is run through PRFect the known PRF gene is correctly predicted to utilize programmed ribosomal frameshifting.

$ prfect.py test/covid19.gbk 

     CDS             join(266..13468,13468..21555)
                     /ribosomal_slippage
                     /direction=-1
                     /motif=is_threethree
                     /slippery_sequence=tttaaac
                     /label=1
                     /locus=NC_045512
                     /product="ORF1ab polyprotein"
                     /product="ORF1ab polyprotein"

Another example is bacteriophage lambda, which has the geneG and geneGT tail assembly chaperone gene that is known to frameshift. The current genbank annotation file (NC_001416) does not have the gene properly denoted with the join keyword and so both pieces are in two separate CDS features. When the genome is run through PRFect the gene is correctly identified as being a single PRF gene with the /label=0 to indicate that it is an UnknownPositive.

$ prfect.py test/lambda.gbk

     CDS             join(9711..10115,10115..10549)
                     /ribosomal_slippage
                     /direction=-1
                     /motif=is_threethree
                     /bases=gggaaag
                     /label=0
                     /locus=NC_001416
                     /product="minor tail protein G"
                     /product="tail assembly protein T"

You can show all the slippery sites that PRFect checked to make sure it evaluated a given site and to see if there were any near hits. Using the --dump flag will show the calculated cellular properites at each potential slippery site:

$ prfect.py test/lambda.gbk --dump | head
LOCUS      SLIPSITE   LOC  LABEL  N  DIR RBS1 RBS2  A0     A1     LF50    HK50    LF100   HK100  PRED  PROB  MOTIF
NC_001416  gcaaaacgc  4278   0  159   1   13   1.8  0.015  0.025  -0.24   -0.236  -0.523  -0.306   0    1.0  three
NC_001416  ggaaagtgt  10115  0   18  -1    2     0  0.004  0.024  -0.313  -0.287  -0.668  -0.404  -1   0.88  threethree  
NC_001416  gcgaaagca  31034  0   30   1    2   1.0  0.029  0.032  -0.282  -0.243  -0.477  -0.326   0    1.0  three
NC_001416  tggaaacgc  33370  0   72   1    1     0  0.015  0.028  -0.124  -0.118  -0.482  -0.36    0    1.0  three
NC_001416  cgtaaatta  33388  0   90   1    0     0  0.009  0.012  -0.15   -0.138  -0.291  -0.237   0    1.0  three
NC_001416  gcagggtgg  33442  0  144   1    0     0  0.017  0.021  -0.092  -0.039  -0.388  -0.274   0    1.0  three
NC_001416  gaaaaggag  42081  0   42  -1    0     0  0.027  0.013  -0.246  -0.149  -0.176  -0.105   0    1.0  twofour
NC_001416  aaaaccttc  42206  0   66  -1    0     0  0.015  0.014  -0.403  -0.266  -0.367  -0.249   0    1.0  fivetwo
NC_001416  cgaaaaaat  43240  0    6   1    2     0  0.019  0.023  -0.513  -0.245  -0.395  -0.294   0   0.98  four

The columns are:

LOCUS     id of the sequence
SLIPSITE  bases of the slippery site
LOC       location within the bases of the slippery site
LABEL     whether the slippery site is already annotated: 0 not a joined gene, 1 a joined gene, -1 a joined gene but is >10bp away 
N         distance of the slippery site from the in-frame stop codon
DIR       direction of the shift
RBS1      Prodigal like ribosomal binding site interference score
RBS2      RAST like ribosomal binding site interference score
A0        frequency of the A-site codon usage in all genes
A1        frequency of the +1 A-site codon usage in all genes
LF50      normalized LinearFold minimum free energy calculation of the downstream 50bp window
LF100     normalized LinearFold minimum free energy calculation of the downstream 100bp window
HK50      normalized HotKnots minimum free energy calculation of the downstream 50bp window
HK100     normalized HotKnots minimum free energy calculation of the downstream 100bp window
PRED      type of shift predicted by PRFect to occur: -1 backwards, 0 no shift, +1 forwards
PROB      how sure PRFect was for the predicted (PRED) type
MOTIF     slippery sequence motif

You can even use the flag -s to scale the MFE calculations to account for extreme GCcontent/temp/salinity:

$ prfect.py test/lambda.gbk -s 1.5 --dump | head -n 2
LOCUS      SLIPSITE   LOC  LABEL  N  DIR RBS1 RBS2  A0     A1     LF50    HK50    LF100   HK100  PRED  PROB  MOTIF
NC_001416  gcaaaacgc  4278   0  159   1   13   1.8  0.015  0.025  -0.36   -0.354  -0.785  -0.459   0    1.0  three
NC_001416  ggaaagtgt  10115  0   18  -1    2     0  0.004  0.024  -0.47   -0.431  -1.002  -0.606  -1  0.999  threethree  

you will notice that the MFE values were scaled by 50% when compared to the above dump, which also caused the trained model to be more confident in the backward -1 PREDiction at LOCation 10115

prfect's People

Contributors

deprekate avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

hugeandbulk

prfect's Issues

AttributeError: 'HistGradientBoostingClassifier' object has no attribute '_preprocessor'

When running prfect.py prfect/test/covid19.gbk, I get the following error:

Error
Traceback (most recent call last):
File "/home/sc-linux1/miniconda3/envs/prfect2/bin/prfect.py", line 181, in
if has_prf(metrics):
^^^^^^^^^^^^^^^^
File "/home/sc-linux1/miniconda3/envs/prfect2/bin/prfect.py", line 141, in has_prf
prob = clf.predict_proba(row.loc[:,clf.feature_names_in_])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sc-linux1/miniconda3/envs/prfect2/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 2178, in predict_proba
raw_predictions = self._raw_predict(X)
^^^^^^^^^^^^^^^^^^^^
File "/home/sc-linux1/miniconda3/envs/prfect2/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1264, in _raw_predict
X = self._preprocess_X(X, reset=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sc-linux1/miniconda3/envs/prfect2/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 266, in _preprocess_X
if self._preprocessor is None:
^^^^^^^^^^^^^^^^^^
AttributeError: 'HistGradientBoostingClassifier' object has no attribute '_preprocessor'. Did you mean: '_preprocess_X'?

My environment:
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
bzip2 1.0.8 hd590300_5 conda-forge
ca-certificates 2023.11.17 hbcca054_0 conda-forge
genbank 0.110 pypi_0 pypi
hotknots 2.4 pypi_0 pypi
joblib 1.3.2 pypi_0 pypi
ld_impl_linux-64 2.40 h41732ed_0 conda-forge
libexpat 2.5.0 hcb278e6_1 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 13.2.0 h807b86a_4 conda-forge
libgomp 13.2.0 h807b86a_4 conda-forge
libnsl 2.0.1 hd590300_0 conda-forge
libsqlite 3.44.2 h2797004_0 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libxcrypt 4.4.36 hd590300_1 conda-forge
libzlib 1.2.13 hd590300_5 conda-forge
linearfold 1.1 pypi_0 pypi
ncurses 6.4 h59595ed_2 conda-forge
numpy 1.26.3 pypi_0 pypi
openssl 3.2.0 hd590300_1 conda-forge
packaging 23.2 pypi_0 pypi
pandas 2.2.0 pypi_0 pypi
pip 23.3.2 pyhd8ed1ab_0 conda-forge
prfect 0.38 pypi_0 pypi
python 3.12.1 hab00c5b_1_cpython conda-forge
python-dateutil 2.8.2 pypi_0 pypi
pytz 2023.4 pypi_0 pypi
readline 8.2 h8228510_1 conda-forge
scikit-learn 1.4.0 pypi_0 pypi
scipy 1.12.0 pypi_0 pypi
score-rbs 0.5 pypi_0 pypi
setuptools 69.0.3 pyhd8ed1ab_0 conda-forge
six 1.16.0 pypi_0 pypi
sklearn 0.0.post12 pypi_0 pypi
threadpoolctl 3.2.0 pypi_0 pypi
tk 8.6.13 noxft_h4845f30_101 conda-forge
tzdata 2023.4 pypi_0 pypi
wheel 0.42.0 pyhd8ed1ab_0 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge

Exact package dependencies?

Hi,
Could you please post the exact versions of the dependencies because I'm just hitting a wall here with hotknots and LinearFold on python 3.12.2

Building wheels for collected packages: hotknots, LinearFold
  Building wheel for hotknots (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [912 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-312
      creating build/lib.linux-x86_64-cpython-312/hotknots

...

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      running bdist_wheel
      running build
      running build_ext
      building 'LinearFold' extension
      creating build
      creating build/temp.linux-x86_64-cpython-312
      creating build/temp.linux-x86_64-cpython-312/src
      gcc -pthread -B /home/leon/miniconda3/envs/prfect_env/compiler_compat -fno-strict-overflow -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /home/leon/miniconda3/envs/prfect_env/include -fPIC -O2 -isystem /home/leon/miniconda3/envs/prfect_env/include -fPIC -I. -I... -I/tmp/pip-install-ia7nz1_1/linearfold_757d1fe791ca4ed79fdd218b5b314571/src -I/tmp/pip-install-ia7nz1_1/linearfold_757d1fe791ca4ed79fdd218b5b314571/src/Utils -I/home/leon/miniconda3/envs/prfect_env/include/python3.12 -c src/python.cpp -o build/temp.linux-x86_64-cpython-312/src/python.o -w -Dlv -Dis_cube_pruning -Dis_candidate_list -std=c++11
      src/python.cpp:58:1: sorry, unimplemented: non-trivial designated initializers not supported
       };
       ^
      src/python.cpp:58:1: sorry, unimplemented: non-trivial designated initializers not supported
      src/python.cpp:58:1: sorry, unimplemented: non-trivial designated initializers not supported
      src/python.cpp:58:1: sorry, unimplemented: non-trivial designated initializers not supported
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for LinearFold
  Running setup.py clean for LinearFold
Failed to build hotknots LinearFold
ERROR: Could not build wheels for hotknots, LinearFold, which is required to install pyproject.toml-based projects

and I'm not sure what to adjust:

# packages in environment at /home/leon/miniconda3/envs/prfect_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                hd590300_5    conda-forge
ca-certificates           2024.2.2             hbcca054_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libexpat                  2.6.2                h59595ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.2.0               h807b86a_5    conda-forge
libgomp                   13.2.0               h807b86a_5    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libsqlite                 3.45.2               h2797004_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
ncurses                   6.4.20240210         h59595ed_0    conda-forge
openssl                   3.2.1                hd590300_1    conda-forge
pip                       24.0               pyhd8ed1ab_0    conda-forge
python                    3.12.2          hab00c5b_0_cpython    conda-forge
readline                  8.2                  h8228510_1    conda-forge
setuptools                69.2.0             pyhd8ed1ab_0    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
tzdata                    2024a                h0c530f3_0    conda-forge
wheel                     0.43.0             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge

Thanks.

ModuleNotFoundError: No module named 'sklearn._loss.loss'

Dear developer,

I am running into an error (see below)

Thanks for looking into.

best regards
jonas

///////////
prfect.py sequence.gbk
Traceback (most recent call last):
File "/usr/local/bin/prfect.py", line 163, in
clf = pickle.load(open(path, 'rb'))
ModuleNotFoundError: No module named 'sklearn._loss.loss'

Pip3 install triggers error with sklearn

pip3 install prfect
Collecting prfect
Using cached prfect-0.38-py3-none-any.whl (909 kB)
Requirement already satisfied: scikit-learn>=0.24.0 in c:\users\x\appdata\local\programs\python\python310\lib\site-packages (from prfect) (1.3.1)
Collecting sklearn (from prfect)
Using cached sklearn-0.0.post10.tar.gz (3.6 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [18 lines of output]
The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
rather than 'sklearn' for pip commands.

  Here is how to fix this error in the main use cases:
  - use 'pip install scikit-learn' rather than 'pip install sklearn'
  - replace 'sklearn' by 'scikit-learn' in your pip requirements files
    (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  - if the 'sklearn' package is used by one of your dependencies,
    it would be great if you take some time to track which package uses
    'sklearn' instead of 'scikit-learn' and report it to their issue tracker
  - as a last resort, set the environment variable
    SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error

  More information is available at
  https://github.com/scikit-learn/sklearn-pypi-package

  If the previous advice does not cover your use case, feel free to report it at
  https://github.com/scikit-learn/sklearn-pypi-package/issues/new
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Column Meanings

Hello,

I am currently working with SEA-PHAGE data and have been using PRFect as a tool for an independent project and I had a few questions. Specifically, one question I have is regarding the probability of being 1.0. If the rest of the predicted slippery sites are also 1.0 or very very close, why was the predicted site chosen over the others? And is there more information on just the interpretation/meaning of each column?

Thanks for reading!

Fix bug with fully nested genes not being pairwise compared correctly

There is an issue with the current version when fully nested genes are present in the genome.

Given the layout of two genes in different frames where one is fully nested inside the other such as the case below:

geneA  |------------------->
geneB           |---->     

The current version only does a single pairwise comparison for shifting from A into B

We need to check both A into B and B into A

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.