GithubHelp home page GithubHelp logo

shen-lab / gcwgan Goto Github PK

View Code? Open in Web Editor NEW
37.0 8.0 7.0 342.05 MB

Guided Conditional Wasserstein GAN for De Novo Protein Design

Home Page: https://doi.org/10.1021/acs.jcim.0c00593

License: GNU General Public License v3.0

Roff 95.63% Python 4.28% Shell 0.07% Perl 0.02%
generative-adversarial-networks generative-model protein protein-design protein-sequence protein-structure

gcwgan's Introduction

DeNovoFoldDesign

Motivation: Facing data quickly accumulating on protein sequence and structure, this study is addressingthe following question: to what extent could current data alone reveal deep insights into the sequence-structure relationship, such that new sequences can be designed accordingly for novel structure folds?

Results: We have developed novel deep generative models, constructed low-dimensional andgeneralizable representation of fold space, exploited sequence data with and without paired structures,and developed ultra-fast fold predictor as an oracle providing feedback. The resulting semi-supervisedgcWGAN is assessed with the oracle over 100 novel folds not in the training set and found to generatemore yields and cover 3.6 times more target folds compared to a competing data-driven method (cVAE).Assessed with structure predictor over representative novel folds (including one not even part of basisfolds), gcWGAN designs are found to have comparable or better fold accuracy yet much more sequencediversity and novelty than cVAE. gcWGAN explores uncharted sequence space to design proteins bylearning from current sequence-structure data. The ultra fast data-driven model can be a powerful additionto principle-driven design methods through generating seed designs or tailoring sequence space.

Training-Process


Pre-requisite

* Environments:

To build the enviroments for this project, go to the Environments folder, then run

conda env create -f tensorflow_training.yml
conda env create -f DeepDesign_acc.yml

For the oracle (modified DeepSF), add the following function in the file <path where keras was installed>/keras/activations.py.

def leakyrelu(x, alpha=0.1, max_value=None):
    return K.relu(x, alpha=alpha, max_value=max_value)

* Backend of Keras:

In this project we utilized two backends of keras, the theano and tensorflow, which can be set in the file /.keras/keras.json.

  • When training gcWGAN, set the backend to be tensorflow as follows:
{
    "epsilon": 1e-07,
    "floatx": "float32",
    "image_dim_ordering":"tf",
    "image_data_format": "channels_last",
    "backend": "tensorflow"
}
  • Otherwise (training cWGAN, pretraining and generating sequences), set the backend to be theano as follows:
{
    "epsilon": 1e-07,
    "floatx": "float32",
    "image_dim_ordering":"tf",
    "image_data_format": "channels_last",
    "backend": "theano"
}

* Check Points:

  • To train the model (For cWGAN and gcWGAN): Directly go to the cWGAN or gcWGAN model and follow the instructions.
  • Apply our model for evaluation or sequence generation (For Model_Test and Model_Evaluation): Go to the Checkpoints folder and download the related check points into the correct path according to the instruction.

Our check points were gotten after 100 epoch of training. If you have already downloaded our check points but want to retrain the model with the same hyper-parameters, the downlowded ones may be replaced if the training process reach the 100th epoch.


Table of contents:

  • Environments: Contain the *.yml with which you can build required environments.
  • Data: Contain the original data, processed data and ralated processing scripts.
  • Oracle: Contain the scripts for the sequence features and applying the oracles.
  • cWGAN: Contain the scripts for cWGAN model training and validation (hpper-parameter tuning).
  • gcWGAN: Contain the scripts for gcWGAN model training.
  • Model_Evaluation: Contain the scripts for model performance evaluation.
  • Model_Apply: Contain the scripts to apply the trained model.
  • Generated_Results: Contain the sequence samples generated by our model for the evaluation part (except the yiled ratio part which can be too large to upload) and the selected structure prediction from Rosetta based on gcWGAN.

Model Application

In this part you can apply our models to generate protein sequences according to a given protein fold (*.pdb file). With the scripts you can represent the givern fold with a 20 dimensional vector and send it to the generator for sequence generation. Go to the Model_Apply folder for more dtails.

Some examples of the generated sequences (10 sequences based on gcWGAN that pass the oracle):

>1
MIAPDQTIEKYVKFMAPVFTTTEYLKIVEMEEKGITTIAHGPVIHTARNPYAEVRLVSVTHELLIELQASGFLNISKTICLFETGIDENKEVLIDKDDYKEEPLLVDLFLEMEGPMDGQEIMTKLVRVPVMGQSLKPYAVKKAGVIKSAKHVG
>2
PCYALTVEAVENLLQAPAVRTLQKDEGLTPRLQPGIAAYASFIAGGAGCGLTRGSSDNMAKALIQEIEKTLRAVELTPATVQILVNNNEVKLPEKEKPNAIAKGILTVNLISKMDEFTKLVLVGENYTAILIDHIAKHKVGPV
>3
MCYDIAQSYLNFMMINGTVLIQTATRTLCPAVHSACRYDYIKVTAAKGNIVTDIGLMYFVRNMELVGPLMTATVAISKSIYTVQKATKETVNEMRTLQVAGTRTMFCRIYHVDMTKMMMQTGISIVGEKKPTRHDAEITYDQLAGHLVPLAHLKKL
>4
CTKAQRGVHKIYEVEKNYMPNRTLGDPNSLRIDSIGIRPVNERKDNTRYVAKKAKAILAKKDIMYCLPINIDVVKVTSTLDNYLDGDPYSKRPRFDDNLIKAVIPTDVALKPSPRYDVQAGRETPPAYTAVVQRFFSVKLNRL
>5
CPNVYQKLLYSMTEGPMDIGPVEVGQLLAVIPSAIGKVVSEITTSVHPAAPFEEAARVTAMAQRAALQYSTQTYLVGKESIALMYGKYRALHQDLARMVLADGQTADVQEVVPIIADIQRMHPAGQVAPRLIESGVVTASVLMTAA
>6
LLHGKLEVFHKCVAKADEASGLTFFHCGCSAYVTSEAAKGRYRPRACSTVHYFEKGATIPGLQYTNMYENAMVCTSKIRIYLEAMNMAPNVPLHRAAKYDNVSAALTANNNKVALIAEYYVTALLEGEVTQHLEEYKKNPPPELYEEIC
>7
MNKINIKYCPFNFNKVFRKEAFITQMAGENMAVLKELSEQIDHCSCFHKNTARQLLHRAEDGPVTEVETLLELRAAMICCFRRRAPRLVLGSSMSTTVITKCIAICTGQPYPGNGPPTTLGQPACSGVEVINNQAAIVIQTVEQRFILMTPGK
>8
CTVTAVQEFTENYGGLPLYVTRNQTLAPADKRLTPRYAGNFPEGAEVPAPNLAQTSPGVTYGKNIGRYLKNGLPDVAICTSPNLNLSGAYPDIVKYNYQQPEVFIRQYHPGNEMDVVKALEQFSSELLPGKTMSIVVNSYNNLADK
>9
CETTIDIEASVISQVIAVIVALTPIHKYAHASSKALASGASDVNVGPKLVAYIGKIAYSDPPIDLIPPVKVVVALLAPELAGVTAADYISYNEGKPATGESAGNAAFADGTTTIAPQRTIYEGEHKARINIITIADGAPLGSHEIP
>10
PEPDLVLTCTNLSFSAMVSCLRETSAFAGVEYAYNGIHPAGSCCLAAMKKGFFPHTEGMNALVIEPTPPVPCAPTKDLVQNKIQKAKLLPPAATTADEYSETLGQEDFLKLLTNPKITEKKKSPTTLILVTVNSELMISPVYFTGPLMKELLYHCNGEN

Training Process:

In the cWGAN folder and gcWGAN folder there are scripts for traning our two models. For cWGAN there are also scripts for validation (hyper-parameter tuning) and for gcWGAN there are also scripts for the Warmstart. Go to the cWGAN folder or gcWGAN folder for more details.

Some examples of the generated sequences during the training process:

fold a.39: vvaitfdnvhfpcshapltkaltvkklqvsannvsllvfddakmtkkidiekaikgfymmknnpqaqleiierftpttrgkpvikpiasftltspeilgkegykk!!!!!!!!!!!!!!!!!!!itkmlidavks!!!!!!!!!!!!!!!!!!!!!!!!!
fold d.78: leemskvgntpaltyreardvavigifnngkqmksrddvtdeaddyqceidpisnllelgallpplhvaetkmllyykneakmhlfegag!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fold d.240: tlaedippklpleveqcneiivdaqnkryvavgealllitcpmlqnnsmsttcgyrfeakskdgvicespeeglqndtthyachkraaavqiptekkttvyrlhacttklegcaeadnrvladvgldgivqravcdivttfsaevnp!!!!!!!!!!!!!
fold d.227: sckpglplvcagkkstyleklltgylvyslladyispkaleeavisekkpniampafatmpslvaddvtaliakkglqnaakcpndhmeiyeaeedpaiigqgynkhqgvgcnivvmagaipdeqkvenlrsliei!!!!!!!!!!!!!!!!!!!!!!!!
fold d.301: mtakstvqlpaeykgqniaeilnnvafnlaaivysattivayramacfpcgeknykeilgkvltlfidkhpiqnnr!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fold d.223: mqtyeeavtlgltneqqtgknvtpiniaeekllvtnglvcqapalpvneevliklsentdnikpllciigkkseaispcsfraeeafdrsadymankatimcrkgnyaiilhsdgeellaihqtsgviirlghvpgkknrymppgaliplcngp!!!!!!
fold a.216: eelakrmiqrapdveligknkiatelkrlcllirgqtaanimnvillcataisvipkkskpasqyeetvnpadlakeiilqekkeaftriltteylvtsllkmypvhkvpkp!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fold b.60: qpifvtykrlnrlallkshplhkdpkyltavlvmeldpsslpvavqpqrvvtiqsccpiiepsappeecdiqapnklkallendkptsqn!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fold c.9: ayfereelilltpthggnepktldiptnlpakilgtplrkvklasqkellgeahpnnavstlideayylgdeqrevvvlteqekkagpidithyvngtegsckkpnisdsptphakafkqilkemqariqhhkelittalerlkn!!!!!!!!!!!!!!!
fold a.180: mpeecvctglepgevrrqngvipllnqgfhavltpagktylccttatknqvivhmfcqtaaeniyaeitvsylrtaatstylefmkhccqnvssihygiymslmdllkeyvveklv!e!!!!!!!!!!!!!!!!!!!iaeqipearkyaaalvg!!!!!!

Evaluate Model Performance:

This part contains the scripts we applied to evaluate the performance of our model. We also generate several sequences with the previouse state-of-art model cVAE and applied our evaluation method for comparison. Model evalustion consists of three part, model accuracy, sequence generating rate and sequence diversity and novelty, and for model accuracy we applied yield ratio calculation for all the training, validation and test folds. Go to the Model_Evaluation folder for more details.


Citation:

@article{gcWGAN,
author = {Karimi, Mostafa and Zhu, Shaowen and Cao, Yue and Shen, Yang},
title = {De Novo Protein Design for Novel Folds Using Guided Conditional Wasserstein Generative Adversarial Networks},
journal = {Journal of Chemical Information and Modeling},
volume = {60},
number = {12},
pages = {5667-5681},
year = {2020},
doi = {10.1021/acs.jcim.0c00593},
note ={PMID: 32945673},
URL = {https://doi.org/10.1021/acs.jcim.0c00593},
eprint = {https://doi.org/10.1021/acs.jcim.0c00593}
}

Contacts:

Yang Shen: [email protected]

Mostafa Karimi: [email protected]

Shaowen Zhu: [email protected]

Yue Cao: [email protected]

gcwgan's People

Contributors

shaowen1994 avatar shen-lab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gcwgan's Issues

Errors about Tensorflow or theano?

Dear developers,

When the backend is Tensorflow:

CUDA_VISIBLE_DEVICES=0 python Nov_SeqGenerator_Filter_r1py3.py ./DeepSF_modified/DeepSF_model_weight_more_folds/model-train-weight-DLS2F.h5 ALL 100 test_generated_file_index test_job_index M_DeepSF

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/theano/tensor/type.py", line 270, in dtype_specs
}[self.dtype]
KeyError: 'object'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/theano/tensor/basic.py", line 250, in constant_or_value
TensorType(dtype=x_.dtype, broadcastable=bcastable),
File "/usr/local/lib/python3.6/dist-packages/theano/tensor/type.py", line 51, in init
self.dtype_specs() # error checking is done there
File "/usr/local/lib/python3.6/dist-packages/theano/tensor/type.py", line 273, in dtype_specs
% (self.class.name, self.dtype))
TypeError: Unsupported dtype for TensorType: object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/theano/tensor/basic.py", line 206, in as_tensor_variable
return constant(x, name=name, ndim=ndim)
File "/usr/local/lib/python3.6/dist-packages/theano/tensor/basic.py", line 264, in constant
dtype=dtype)
File "/usr/local/lib/python3.6/dist-packages/theano/tensor/basic.py", line 259, in constant_or_value
raise TypeError("Could not convert %s to TensorType" % x, type(x))
TypeError: ('Could not convert ? to TensorType', <class 'tensorflow.python.framework.tensor_shape.Dimension'>)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "Nov_SeqGenerator_Filter_r1py3.py", line 142, in
M_DLS2F_CNN = model_from_json(M_loaded_model_json, custom_objects={'K_max_pooling1d': Oracle_Filters.K_max_pooling1d})
File "/usr/local/lib/python3.6/dist-packages/keras/models.py", line 213, in model_from_json
return layer_from_config(config, custom_objects=custom_objects)
File "/usr/local/lib/python3.6/dist-packages/keras/utils/layer_utils.py", line 40, in layer_from_config
custom_objects=custom_objects)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/topology.py", line 2582, in from_config
process_layer(layer_data)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/topology.py", line 2577, in process_layer
layer(input_tensors[0])
File "/usr/local/lib/python3.6/dist-packages/keras/engine/topology.py", line 572, in call
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/topology.py", line 635, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/topology.py", line 166, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/root/projects/newFoldDesign/toolkits/gcWGAN/Oracle_py3/Oracle_Filters.py", line 296, in call
output = x[T.arange(x.shape[0]).dimshuffle(0, "x", "x"),
File "/usr/local/lib/python3.6/dist-packages/theano/tensor/basic.py", line 5396, in arange
start, stop, step = map(as_tensor_variable, (start, stop, step))
File "/usr/local/lib/python3.6/dist-packages/theano/tensor/basic.py", line 212, in as_tensor_variable
raise AsTensorError("Cannot convert %s to TensorType" % str_x, type(x))
theano.tensor.var.AsTensorError: ('Cannot convert ? to TensorType', <class 'tensorflow.python.framework.tensor_shape.Dimension'>)

How could I hande this issue? modify theano type or T.arange?

Thank you very much.

ValueError: Invalid activation function: leakyrelu

Hi developers,

An invalid error occurs:

python Nov_SeqGenerator_Filter.py ./DeepSF_modified/DeepSF_model_weight_more_folds/model-train-weight-DLS2F.h5 ALL 100 test_generated_file_index test_job_index M_DeepSF

Using TensorFlow backend.
./DeepSF_modified/DeepSF_model_weight_more_folds/model-train-weight-DLS2F.h5
Modified: True
Original: False
Uppercase local vars:
BATCH_SIZE: 200
DATA_DIR: ../Data/Datasets/Final_Data/
DIM: 512
GEN_NUM: 100
JOB_INDEX: test_job_index
K: <module 'keras.backend' from '/usr/local/lib/python3.6/dist-packages/keras/backend/init.py'>
KIND: ALL
LAMBDA: 10
MAX_LEN: 160
MAX_N_EXAMPLES: 50000
MIN_LEN: 60
SEQ_LEN: 160
THRESHOLD: 0.21
TOP_NUM: 10
loading dataset...
loaded 20125 lines in dataset
Data loading successfully!
WARNING: Logging before flag parsing goes to stderr.
W0214 17:46:15.818645 140365499090752 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:321: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

Traceback (most recent call last):
File "Nov_SeqGenerator_Filter_r1py3.py", line 143, in
M_DLS2F_CNN = model_from_json(M_loaded_model_json, custom_objects={'K_max_pooling1d': Oracle_Filters.K_max_pooling1d})
File "/usr/local/lib/python3.6/dist-packages/keras/models.py", line 213, in model_from_json
return layer_from_config(config, custom_objects=custom_objects)
File "/usr/local/lib/python3.6/dist-packages/keras/utils/layer_utils.py", line 40, in layer_from_config
custom_objects=custom_objects)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/topology.py", line 2582, in from_config
process_layer(layer_data)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/topology.py", line 2560, in process_layer
custom_objects=custom_objects)
File "/usr/local/lib/python3.6/dist-packages/keras/utils/layer_utils.py", line 42, in layer_from_config
return layer_class.from_config(config['config'])
File "/usr/local/lib/python3.6/dist-packages/keras/engine/topology.py", line 1025, in from_config
return cls(**config)
File "/usr/local/lib/python3.6/dist-packages/keras/layers/convolutional.py", line 106, in init
self.activation = activations.get(activation)
File "/usr/local/lib/python3.6/dist-packages/keras/activations.py", line 55, in get
return get_from_module(identifier, globals(), 'activation function')
File "/usr/local/lib/python3.6/dist-packages/keras/utils/generic_utils.py", line 125, in get_from_module
str(identifier))
ValueError: Invalid activation function: leakyrelu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.