GithubHelp home page GithubHelp logo

oliverdutton / cspred Goto Github PK

View Code? Open in Web Editor NEW

This project forked from thglab/cspred

0.0 0.0 0.0 287.15 MB

UCBShift is a program for predicting chemical shifts for backbone atoms and β-carbon of a protein in solution. It utilizes a machine learning module that makes predictions from features extracted from the 3D structures of the proteins.

Python 46.83% Jupyter Notebook 52.98% Shell 0.19%

cspred's Introduction

UCBShift

UCBShift is a program for predicting chemical shifts for backbone atoms and β-carbon of a protein in solution. The program implements two mechanisms: a transfer prediction module that employs both sequence alignment and structure alignment to select references for shift replication; and an ensemble decision tree based machine learning module which takes features extracted from a PDB file and makes trustful chemical shift predictions. When combined together, this new predictor achieves state-of-the-art accuracy for predicting chemical shifts in a "real-world" dataset, with root-mean-square errors of 0.38, 0.22, 1.31, 0.97, 1.29 and 2.16 ppm between prediction and experimental values for H, Hα, C, Cα, Cβ and N.

Publication

Li, J., Bennett, K. C., Liu, Y., Martin, M. V., & Head-Gordon, T. (2020). Accurate prediction of chemical shifts for aqueous protein structure on “Real World” data. Chemical Science, 11(12), 3180-3191. DOI: 10.1039/C9SC06561J

Software package requirements

Python and python packages

External programs needed

Usage

Because the trained models are big, users are directed to here to download all the saved model files. After downloading the models.tgz file, extract them into the models/ folder (so that there will be 18 .sav files under models/ folder)
Users can use the trained model "as is" once they have correctly configured the python packages and external programs. The CSpred.py file is the entrance to UCBShift chemical shift predictor.
The easiest, out-of-the-box way of using UCBShift is running CSpred.py script directly on your desired protein. A [shifts.csv] file will be generated at the same position where you executed the script. The syntax will be something like this:

python CSpred.py your_input_protein.pdb

Advanced options are described below:

usage: CSpred.py [-h] [--batch] [--output OUTPUT] [--worker WORKER]
                 [--shifty_only] [--shiftx_only] [--pH PH] [--test]
                 input

positional arguments:
  input                 The query PDB file or list of PDB files for which the
                        shifts are calculated

optional arguments:
  --batch, -b           If toggled, input accepts a text file specifying all
                        the PDB files need to be calculated (Each line is a
                        PDB file name. If pH values are specified, followed
                        with a space)
  --output OUTPUT, -o OUTPUT
                        Filename of generated output file. A file [shifts.csv]
                        is generated by default. If in batch mode, you should
                        specify the path for storing all the output files.
                        Each output file has the same name as the input PDB
                        file name.
  --worker WORKER, -w WORKER
                        Number of CPU cores to use for parallel prediction in
                        batch mode.
  --shifty_only, -y, -Y
                        Only use UCBShift-Y (transfer prediction) module.
                        Equivalent to executing UCBShift-Y directly with
                        default settings
  --shiftx_only, -x, -X
                        Only use UCBShift-X (machine learning) module. No
                        alignment results will be utilized or calculated
  --pH PH, -pH PH, -ph PH
                        pH value to be considered. Default is 5
  --test, -t            If toggled, use test mode for UCBShift-Y prediction

If you want to execute SHIFTY++ with more options, you can execute ucbshifty.py.

usage: ucbshifty.py [-h] [--output OUTPUT] [--strict STRICT] [--secondary]
                    [--test] [--exclude] [--shifty]
                    input

This program executes both sequence-based alignment (using BLAST) and
structure-based alignment (using mTM-align) to find the best alignment for a
specific pdb file with entities in the refDB database, and use the average
chemical shifts from refDB to predict the chemical shifts for backbone H/C/N
atom chemical shifts for the query protein

positional arguments:
  input                 The query PDB file for which the shifts are calculated

optional arguments:
  -h, --help            show this help message and exit
  --output OUTPUT, -o OUTPUT
                        Filename of generated output file. A file [shifts.csv]
                        is generated by default
  --strict STRICT, -s STRICT
                        Strict level of shift transfer: 0 - Strict, only the
                        exact matching residue shifts are transferred 1 -
                        Normal, transfer the shifts for residues that are the
                        same or have positive substitution scores (from
                        BLOSUM62) 2 - Permissive, transfer all shifts
                        regardless of the likeliness of substitution.
  --secondary, -2       If this flag is set, the output will be secondary
                        shifts (observed shifts-random coil shifts) instead of
                        observed shifts
  --test, -t            If this flag is set, the test BLAST database is used,
                        which means all sequences in the validation set and
                        test set will not be included in the BLAST search
                        database
  --exclude, -e         Exclude mode, another way of analyzing the performance
                        of SHIFTY++. When selecting sequences going to the
                        structure alignment, those completely identical
                        examples are excluded.
  --shifty, -y          SHIFTY mode, only the top hit from sequence alignment
                        is considered for shift transfer

Reproducibility

You can reproduce the results by preparing all the data and retrain the model on your own machine. Follow PROCEDURE.md under the folder train_model/ for a complete description of how to train the model.

cspred's People

Contributors

jerryjohnsonlee avatar gerardwx avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.