This is a class project for CS478: Intro to Machine Learning and CS418: Bioinformatics at BYU. It explores the learnability of SNP pathogenicity.
See also the associated paper.
Note: all the python code is expecting to be run from the root of the repository. It will not find the datafiles if run from another directory.
The data for this project was derived from the the ClinVar database of gene variants. Additional features were added from AAindex and UCSC genome browser.
Note: this preprocessing has already been done for the BRC family, so the data is ready to go in the BRC_full.csv file. You only need to run these steps if you want to regenerate the files, run it on more data, or run the Neural Net on a new Protein family or gene, like TTN.
- Run FindProteinFamilies.py to gather genes from BRC family to one file
$ python FindProteinFamilies.py
- Run GatherConservationData.py to gather the conservation data for the gene family.
$ python GatherConservationData.py
- Run preprocessing.py to gather all the amino acid features and conservation data to one file.
$ python preprocessing.py
- Now the data is ready to run through the Neural Network!
- Create a python virtual environment in the (model/) directory
$ virtualenv -p python3 model/venv
- Enter the environment
$ . ./model/venv/bin/activate
- Install the requirements
$ pip install -r model/requirements.txt
- Run the neural network
$ python model/genevariants.py
- When finished leave the venv
$ deactivate