GithubHelp home page GithubHelp logo

mhcseqnet's Introduction

What is MHCSeqNet?

MHCSeqNet is a MHC ligand prediction python package developed by the Computational Molecular Biology Group at Chulalongkorn University, Bangkok, Thailand. MHCSeqNet utilizes recurrent neural networks to process input ligand's and MHC allele's amino acid sequences and therefore can be to extended to handle peptide of any length and any MHC allele with known amino acid sequence.

The current release was trained using only data from MHC class I and supports peptides ranging from 8 to 15 amino acids in length, but the model can be re-trained to support more alleles and wider ranges of peptide length.

Please see our preprint on bioRxiv for more information.

Models

MHCSeqNet offers two versions of prediction models

  1. One-hot model: This model uses data from each MHC allele to train a separate predictor for that allele. The list of supported MHC alleles for the current release can be found here

  2. Sequence-based model: This model use data from all MHC alleles to train a single predictor that can handle any MHC allele whose amino acid sequence is known. For more information on how our model learns MHC allele information in the form of amino acid sequence, please see our preprint on bioRxiv. The list of MHC alleles used to train this model can be found here

How to install?

MHCSeqNet requires Python 3 (>= 3.4) and the following Python packages:

numpy (>= 1.14.3)
Keras (>= 2.2.0)
tensorflow (>= 1.6.0)
scipy (>= 1.1.0)
scikit-learn (>= 0.19.1)

If your system has both Python 2 and Python 3, please make sure that Python 3 is being used when following these instructions. Note that we cannot guarantee whether MHCSeqNet will work with older versions of these packages.

To install MHCSeqNet:

  1. Clone this repository
git clone https://github.com/cmbcu/MHCSeqNet

Or you may find other methods for cloning a GitHub repository here

  1. Install the latest version of 'pip' and 'setuptools' packages for Python 3 if your system does not already have them
python -m ensurepip --default-pip
pip install setuptools

If you have trouble with this step, more information can be found here

  1. Run Setup.py inside MHCSeqNet directory to install MHCSeqNet.
cd MHCSeqNet
python Setup.py install

How to use MHCSeqNet?

MHCSeqNet can be launched through the MHCSeqNet.py script or by editing sample scripts explained below

MHCSeqNet.py

The instruction on how to use the MHCSeqNet.py script can be found by running:

python MHCSeqNet.py -h

usage: python MHCSeqNet.py [options] peptide_file allele_file output_file
         'peptide_file' and 'allele_file' should each contains only one column, without header row
  options:
    -p, --path                             REQUIRED: Speficy the path to pre-trained model directory
                                           This should be either the 'one_hot_model' or the 'sequence_model'
                                            directory located in 'PATH/PretrainedModels/' where PATH is where
                                            MHCSeqNet was downloaded to
    -m, --model        [onehot sequence]   REQUIRED: Specify whether the one-hot model or sequence-based model will be used
    -i, --input-mode   [paired complete]   REQUIRED: Specify whether the prediction should be made for each pair of peptide
                                            and allele on the same row of each input file [paired] or for all
                                            combinations of peptides and alleles [complete]
    -h, --help                             Print this message

Sample peptide and MHC allele files can be found in the 'Sample' directory

Sample scripts

Sample scripts for running MHCSeqNet in either the 'one-hot' mode or 'sequence-based' can be found in the 'Sample' directory. Continuing from the installation process, you may test the installation of MHCSeqNet through the following commands:

python Sample/OnehotModelPredictionExample.py
python Sample/SequenceModelPredictionExample.py

To run the sample scripts from different locations on your system, please edit the path to pretrained model in the respective script.

bindingOnehotPredictor.load_model('./PretrainedModels/one_hot_model/')
bindingSequencePredictor.load_model('./PretrainedModels/sequence_model/')

To replace sample peptides and MHC alleles with your own lists, please edit the 'sample_data' accordingly.

sample_data = np.array([['TYIGSLPGK', 'HLA-B*58:01'],
                        ['TYIHALDNGLF', 'HLA-A*24:02'],
                        ['AAAWICGEF', 'HLA-B*15:01'],
                        ['TWLTYHGAI', 'HLA-A*30:02'],
                        ['TWLVNSAAHLF', 'HLA-A*24:02']])

To adjust the behavior of how prediction results are output (e.g. print results to file rather than on the screen), please edit the following line:

print(result)

Input format

Peptide: The current release supports peptides of length 8 - 15 and does not accept ambiguous amino acids.

MHC allele: For alleles included in the training set (i.e. supported alleles listed in the models section), the model requires the 'HLA-A*XX:YY' format.

To add new MHC alleles to the sequence-based model, the names and amino acid sequences of the new alleles must first be added to the AlleleInformation.txt and supported_alleles.txt in the sequence-based model's directory.

Output

MHCSeqNet output binding probability ranging from 0.0 to 1.0 where 0.0 indicates an unlikely ligand and 1.0 indicates a likely ligand.

How to re-train MHCSeqNet?

This feature and instruction will be added in the future

mhcseqnet's People

Contributors

cmbcu avatar poomarinph avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mhcseqnet's Issues

`count_amino_acid_pair_frequency` function of AminoAcidRepresentationModel class does not work properly

I don't think I fully understand what this function is doing without a detailed docstring description.

I tested a sequence "ACDEFH", based on the function name count_amino_acid_pair_frequency, I expect the calculation to be done done on pairs of amino acid, such as ('A', 'C'), ('C', 'D') in a window_size =1 scenario. Instead, I got this (line[i], line[j]) list:

('^', 'A')
('A', '^')
('A', 'C')
('C', 'A')
('C', 'D')
('D', 'C')
('D', 'E')
('E', 'D')
('E', 'F')
('F', 'E')
('F', 'H')
('H', 'F')

Question 1: Why would each pair get counted twice?

Then I tried testing when window_size =2 (default) using the same sequence "ACDEFH", I got

('^', 'A')
('^', 'C')
('A', '^')
('A', 'C')
('A', 'D')
('C', '^')
('C', 'A')
('C', 'D')
('C', 'E')
('D', 'A')
('D', 'C')
('D', 'E')
('D', 'F')
('E', 'C')
('E', 'D')
('E', 'F')
('E', 'H')
('F', 'D')
('F', 'E')
('F', 'H')
('H', 'E')
('H', 'F')

In addition to Question1, I wasn't sure why ('A', 'D') could be a pair skipping 'C' in the middle, and 'D' and 'F' can be a pair skiping 'E' in the middle, so my Question 2 is why proteinA and proteinB can is a pair when there exists multiple amino acid between proteinA and proteinB? I understand the output given the window_size param is changing. But the outputs of this function were eventually used to generate data in _generate_data function, so I doubt that the program generate what you are hoping for with additional amnio acid inserted between the pair of two amino acid.

Training & Test Examples

Do you have exemplary training and test data? Or some description, how my own training data should look like?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.