GithubHelp home page GithubHelp logo

bigmhc's Introduction

BigMHC

BigMHC is a deep learning tool for predicting MHC-I (neo)epitope presentation and immunogenicity.

See the article for more information:

All data used in this research can be freely downloaded here.

Installation

Get the BigMHC Source

git clone https://github.com/karchinlab/bigmhc.git

The repository is about 5GB, so installation generally takes about 3 minutes depending on internet speed.

Environment and Dependencies

Execution is OS agnostic and does not require GPUs.

Training models with large batch sizes (e.g. 32768) requires significant GPU memory (about 94 GB total). Transfer learning requires minimal GPU memory and can be reasonably conducted on a CPU.

All methods were tested on Debian 11 using Linux 5.10.0-19-amd64, AMD EPYC 7443P, and four RTX 3090 GPUs.

Software depenencies are listed below (the versions used in the paper are parenthesized).

Required Dependencies

Optional Dependencies

  • cuda (11.7)
    • Required for GPU usage
  • magma (magma-cuda117 version 2.6.1)
    • Recommended for GPU usage

Jupyter Notebook Dependencies

Usage

There are two executable Python scripts in src: predict.py and train.py.

  • predict.py is used for making predictions using BigMHC EL and BigMHC IM
  • train.py allows you to train or retrain (transfer learning) BigMHC on new data

Both scripts, which can be run from any directory, offer help text.

  • python predict.py --help
  • python train.py --help

Examples

From within the src dir, you can execute the below examples:

python predict.py -i=../data/example1.csv -m=el -t=2 -d="cpu"
python predict.py -i=../data/example2.csv -m=el -a=HLA-A*02:02 -p=0 -c=0 -d="cpu"

Predictions will be written to example1.csv.prd and example2.csv.prd in the data folder. Execution takes a few seconds. Compare your output with example1.csv.cmp and example2.csv.cmp respectively.

Supported Alleles

BigMHC only supports MHC-I. In order to handle different MHC naming schemes, BigMHC will perform fuzzy string matching to find the nearest MHC by name. For example, HLA-A*02:01, A*02:01, HLAA0201, and A0201 are all considered valid and equivalent allele names. Additionally, synonymous substitutions and noncoding fields are handled, so HLA-A*02:01:01 should be mapped to HLA-A*02:01.

We do not validate allele names. BigMHC will make predictions even if given nonsense or MHC-II input, as it will find the nearest valid MHC name to the provided invalid allele name. The list of alleles used in our multiple sequence alignment, to which input is mapped, can be found in the pseudosequences data file.

Required Arguments

  • -i or --input input CSV file
    • Columns are zero-indexed
    • Must have a column of peptides
    • Can also have a column of of MHC-I allele names
  • -m or --model BigMHC model to load
    • el or bigmhc_el to load BigMHC EL
    • im or bigmhc_im to load BigMHC IM
    • Can be a path to a BigMHC model directory
    • Optional for train.py (if a model dir is specified, then transfer learn)

Required Arguments for Training

  • -t or --tgtcol column index of target values
    • Elements in this column are considered ground truth values.
  • -o or --out output directory
    • Directory to save model parameters for each epoch
    • Optional for transfer learning (defaults to model arg)

Input Formatting Arguments

  • -a or --allele allele name or allele column
    • If allele is a column index, then a single MHC-I allele name must be present in each row
  • -p or --pepcol peptide column
    • Is the column index of a CSV file containing one peptide sequence per row.
  • -c or --hdrcnt header count
    • Skip the first hdrcnt rows before consuming input

Output Arguments

  • -o or --out output file or directory
    • If using predict.py, save CSV data to this file
      • Defaults to input.prd
    • If using train.py, save the retrained BigMHC model to this directory
      • If transfer learning, defaults to the base model dir
  • -z or --saveatt boolean indicating whether to save attention values
    • Only available for predict.py
    • Use 1 for true and 0 for false

Other Optional Arguments

  • -d or --devices devices on which to run BigMHC
    • Set to all to utilize all GPUs
    • To use a subset of available GPUs, provide a comma-separated list of GPU device indices
    • Set to cpu to run on CPU (not recommended for large datasets)
  • -v or --verbose toggle verbose printing
    • Use 1 for true and 0 for false
  • -j or --jobs Number of workers for parallel data loading
    • These workers are persistent throughout the script execution
  • -f or --prefetch Number of batches to prefetch per data loader worker
    • Increasing this number can help prevent GPUs waiting on the CPU, but increases memory usage
  • -b or --maxbat Maximum batch size
    • Turn this down if running out of memory
    • If using predict.py, defaults to a value that is estimated to fully occupy the device with the least memory
    • If using train.py, defaults to 32
  • -s or --pseudoseqs CSV file mapping MHC to one-hot encoding
  • -l or --lr AdamW optimizer learning rate
    • Only available for train.py
  • -e or --epochs number of epochs for transfer learning
    • Only available for train.py

Citation

@Article{Albert2023,
	author={Albert, Benjamin Alexander and Yang, Yunxiao and Shao, Xiaoshan M. and Singh, Dipika and Smith, Kellie N. and Anagnostou, Valsamo and Karchin, Rachel},
	title={Deep neural networks predict class I major histocompatibility complex epitope presentation and transfer learn neoepitope immunogenicity},
	journal={Nature Machine Intelligence},
	year={2023},
	month={Jul},
	day={20},
	issn={2522-5839},
	doi={10.1038/s42256-023-00694-6},
	url={https://doi.org/10.1038/s42256-023-00694-6}
}

License

See the LICENSE file

bigmhc's People

Contributors

benjaminalbert avatar

Stargazers

Brian Naughton avatar  avatar leo avatar  avatar Ho Leung Ng avatar Dan Ofer avatar  avatar  avatar Alexander Cristofaro avatar  avatar yophon avatar  avatar Trish Whetzel avatar Jeroen Van Goey avatar  avatar Fernando Cardoso Garcia Filho avatar  avatar  avatar  avatar  avatar Yumeng Zhang avatar Jonas Scheid avatar 蔡徐坤 avatar  avatar Qinghui Li avatar Hao Li avatar Ronak Shah avatar  avatar Miquel Anglada Girotto avatar  avatar Ana Grant avatar Ma xingyong avatar Terence avatar  avatar Gary Yang avatar biolxy avatar Yuan Liu avatar  avatar  avatar XD avatar

Watchers

James Cloos avatar Collin Tokheim avatar I.K. Ashok Sivakumar avatar Christopher Mohr avatar Noushin Niknafs avatar  avatar  avatar  avatar Rachel Karchin avatar Michael Ryan avatar

bigmhc's Issues

Choice of MHC allele

Hi,

thanks for developing such a great tool.

Since there are many possible MHC I alleles, should one run the prediction of every peptide for each possible MHC allele or do you have some tricks to constrain the list of possibilities or obtain an averaged result?

I am specifically referring to the --allele option for predictions.

Thanks in advance,

Miquel

bigMHC retraining

Hi dear,
I have a question about retraining or transfer learning bigmhc on new immunogenicity dataset,
I am a beginner in programming, i splitted new dataset in training and testing.
This is my command for retraining the model:
$ python train.py -i=training.csv -a=HLA-A02:01 -t=2 -o=pathToDirectory
I dont want to optimize batch and epoch number and using the same in model :(batch,epoch) (16,23), (16,23), (8,15), (64,62), (32,27), (32,31) and (64,54)
My question is : i must retrain model each time with one of the seven batch and epoch numbers and save parameters then replacing the parameters with the original parameters of the model into models folder?
and then testing performance of model by this command:
$ python predict.py -i=testing.csv -m=im -a=HLA-A
02:01 -t=2
please guide me in this problem.

Question about the saved models

Hi,

There are several saved model checkpoints under model/ directory. Are these models trained on training data only (el_train.csv), or trained on both training and test data (el_train.csv and el.csv).

I ran your code to re-train the model using the training data for 50 epochs and evaluated on the test data. However, I could not obtain the same results as directly running the saved models you provided. I could only achieve AUPR of ~0.75.

model selection code for 7 models in paper?

Hi
In the paper you selected 7 models optimized by learning rate and batch size. However, the train function provided here seems to just train a model for a fixed learning rate and batch size.
Are there any code showing how the 7 models were created?

From the paper it seem that 7 batch sizes were used: 512 1024 2048 4096 8192 16384 32768. But how about the learning rate? Did you use a grid so that for each batch size you tested different learning rates and then you pick the epoch/learning rate that has the best performance on the validation?

thanks
FKG

about the model frame

Hello, and thank you to your team for developing such a useful tool. Could I use your framework to train my own data? If so, how should I proceed? I am looking forward to receiving your assistance.

Questions about input files and results

Hello, thank you for providing this tool.I have two questions about this tool, what does 'tgt' mean in the input file 'example1.csv'? Is there a recommended threshold for scoring to judge the binding?

about output results

Hello, thank you for providing this tool. I have some confusion regarding the interpretation of the output results. How can I understand the affinity and immunogenicity from the BigMHC_EL and BigMHC_IM results? Is there a reasonable cutoff value?

bits files in makeseqs

hello there:
wonderful work for predicting affinity of peptide and mhc,
I am trying to generate pseudo sequences following the jupyter, in the instruction I do not see how bits file was generating? it would be much appreciated if you can provide any suggestions!
thank you !

Allele option doesn't restrict input to actual human class I alleles

As far as I understand it from your paper, BigMHC supports predictions for all human class I alleles. However, it doesn't restrict the input provided to the -a option to only human class I HLA alleles. I was able to provide non-human alleles, human class II alleles, as well as nonsense words and a prediction result was returned. Is this intended or should this option be limited to only inputs that are actual class I human HLA alleles?

The Choice of BigMHC method, BA,EL or IM?

Hi,

Thank you for this great tool.

I'm using pvactools for neo-epitope screening. There is Binding Affinity predictors, EL predictors (include BigMHC_EL) and Immunogenicity predictors (include BigMHC_IM) available .

I'm wondering if it is reccomended to use one of the method above? or use BigMHC_IM to rank BA predictor / EL predictor identified candidate peptide-MHC complex.

Thanks again
Danhua

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.