uci-cbcl / danq Goto Github PK

A hybrid convolutional and recurrent neural network for predicting the function of DNA sequences

License: Other

HTML 26.27% Python 0.05% Groff 73.69%

danq's Introduction

README for DanQ

DanQ is a hybrid convolutional and recurrent neural network model for predicting the function of DNA de novo from sequence.

Citing DanQ

Quang, D. and Xie, X. ``DanQ: a hybrid convolutional and recurrent neural network for predicting the function of DNA sequences'', NAR, 2015.

INSTALL

DanQ uses a lot of bleeding edge software packages, and very often these software packages are not backwards compatible when they are updated. Therefore, I have included the most recent version numbers of the software packages for the configuration that worked for me. For the record, I am using Ubuntu Linux 14.04 LTS with an NVIDIA Titan Z GPU.

Required

[Python] (https://www.python.org) (2.7.10). The easiest way to install Python and all of the necessary dependencies is to download and install [Anaconda] (https://www.continuum.io) (2.3.0). I listed the versions of Python and Anaconda I used, but the latest versions should be fine. If you're curious as to what packages in Anaconda are used, they are: [numpy] (http://www.numpy.org/) (1.10.1), [scipy] (http://www.scipy.org/) (0.16.0), and [h5py] (http://www.h5py.org) (2.5.0).
[Theano] (https://github.com/Theano/Theano) (latest). At the time I wrote this, Theano 0.7.0 is already included in Anaconda. However, it is missing some crucial helper functions. You need to git clone the latest bleeding edge version since there isn't a version number for it:

$ git clone git://github.com/Theano/Theano.git
$ cd Theano
$ python setup.py develop

[keras] (https://github.com/fchollet/keras/releases/tag/0.2.0) (0.2.0). Deep learning package that uses Theano backend. I'm in the process of upgrading to version 0.3.0 with the Tensorflow backend.
[seya] (https://github.com/EderSantana/seya) (???). I had to modify the source code of this package a little bit. You can try getting the latest version from Github, but for your convenience I've uploaded my copy of the package. You can install it as follows:

$ tar zxvf DanQ_seya.tar.gz
$ cd DanQ_seya
$ python setup.py install

I will likely improve DanQ soon and drop the dependency on seya.

Optional

[CUDA] (https://developer.nvidia.com/cuda-toolkit-65) (6.5). Theano can use either CPU or GPU, but using a GPU is almost entirely necessary for a network and dataset this large.
[cuDNN] (https://developer.nvidia.com/cudnn) (2). Significantly speeds up convolution operations.

USAGE

You need to first download the training, validation, and testing sets from DeepSEA. You can download the datasets from [here] (http://deepsea.princeton.edu/media/code/deepsea_train_bundle.v0.9.tar.gz). After you have extracted the contents of the tar.gz file, move the 3 .mat files into the data/ folder.

If you have everything installed, you can train a model as follows:

$ python DanQ_train.py

On my system, each epoch took about 6 hours. Whenever the validation loss is reaches a new minimum at the end of a training epoch, the best weights are stored in [DanQ_bestmodel.hdf5] (https://cbcl.ics.uci.edu/public_data/DanQ/DanQ_bestmodel.hdf5). I've already uploaded the fully trained model in the hyperlink. You can see motif results, including visualizations and TOMTOM comparisons to known motifs, in the motifs/ folder. Likewise, you can also train a much larger model where about half of the motifs are initialized with JASPAR motifs:

$ python DanQ-JASPAR_train.py

Weights are saved to the fight [DanQ-JASPAR_bestmodel.hdf5] (https://cbcl.ics.uci.edu/public_data/DanQ/DanQ-JASPAR_bestmodel.hdf5) whenever the validation loss is lowered. Motif results for this model are also stored in the motifs/ folder.

For your convenience, I've posted the current ROC AUC and PR AUC statistics comparing DanQ and DanQ-JASPAR with DeepSEA.

If you do not want to train a model from scratch and just want to do predictions, I've included test scripts for both models and the file example.h5 in the data folder. This is the same hdf5 file that is generated using the example from the DeepSEA package. The test scripts here have the same input and output formats as the prediction script from DeepSEA, so you can replace the prediction step of the DeepSEA pipeline (i.e. the 2_DeepSEA.lua script) with the test scripts here:

$ python DanQ_test.py data/example.h5 data/example_DanQ_pred.h5

To-Do

Annotate genetic variation (xgboost model files are currently included, but not detailed at the moment)
Improve DanQ architecture

danq's People

Contributors

Stargazers

Watchers

danq's Issues

AUC Class Ordering

Hi,

Could you provide the code used to generate the ROC and PR AUC curves in aucs.txt? How do you map the class prediction results (919) to the alphabetically ordered list of cell type names that are in aucs.txt?

Also, do you happen to have the code that DeepSEA used for preprocessing to get the train, valid, and test mats? A request for that code in their repository has not been responded to.

Thanks in advance.

Link to Supplementary Data missing

The link to Supplementary Data on the journal website (https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkw226#42670419) does not work. Would you be able to contact the journal to get this fixed?

[Question] Can DanQ identify motifs that are separated by spacer sequence?

Have you tested the ability for DanQ to identify motif sequences that are separater by a spacer sequence? For example, if kernel_size=26 in Conv1D and there was a motif (motif_A) of 8 bp separated by a 10 bp sequence and then another motif (motif_B) of 8 bp. Would it be able to detect it with unspecific probabilities in the middle 10 bp region? Just curious if it can handle split motifs.

How to convert the convolution kernels to motifs?

Hello! Can you explain how to convert the convolution kernels to motifs in detail ? Although I have read the paper about DeepBind model , I cannot understand even find the responding contents.Thanks!

ImportError: cannot import name downsample

when I run this model by "python DanQ_train.py",it comes a error,
Traceback (most recent call last):
File "DanQ_train.py", line 8, in
from keras.models import Sequential
File "build/bdist.linux-x86_64/egg/keras/models.py", line 15, in
File "build/bdist.linux-x86_64/egg/keras/utils/layer_utils.py", line 10, in
File "build/bdist.linux-x86_64/egg/keras/layers/convolutional.py", line 6, in
ImportError: cannot import name downsample
what should I do?

Long Training Time

I have been trying to run DanQ_train.py (after tweaking the bidirectional LSTM implementation by using keras instead of seya), but training one epoch on the NVIDIA GTX 1080 takes over 200 hours according to the Keras ETA (batch size 100). How many samples did you use in the training data set? Did you use all 4.4 million samples in the DeepSEA train.mat?

I am using:
Keras 1.2.0
Tensorflow 0.12.0
CUDA 8.0
CuDNN 5.1
Python 2.7.13

A question about your paper.

you said the predicted probability for each sequence was computed as the average
of the probability predictions for the forward and reverse complement sequence pairs, similar to DeepSEA’s evaluation experiments. But I can't find you do it in your code. It seems that you only predict the probabilities through the forward sequence.

Data?

I'd like to try your model - but I have no idea what sort of format it expect the data to be in. Any chance you can upload a sample to the GitHub?
Thanks.

use test.mat to predict the scores,but some scores are nan?

Hi,
recently I try to run the code and use the 'roc_curve' method of sklearn lib to complete the ROC curve.But when I use test.mat to predict the scores,I find some scores are nan as below.
[[ 3.01186283e-05 1.29807665e-06 8.04380979e-05 ..., 1.28236379e-05
1.28918364e-05 5.43438178e-03]
[ 3.00640259e-05 1.30328704e-06 8.17851032e-05 ..., 1.26049072e-05
1.26921959e-05 5.40011935e-03]
[ nan nan nan ..., nan nan nan] ...,
[ 3.92721742e-02 4.11137007e-03 3.05633545e-02 ..., 8.36040080e-02
7.93742165e-02 7.82782771e-03]
[ 3.92719842e-02 4.11114981e-03 3.05634644e-02 ..., 8.36120248e-02
7.93785900e-02 7.82747846e-03]
[ 3.92720513e-02 4.11104271e-03 3.05638630e-02 ..., 8.36143717e-02
7.93796182e-02 7.82728661e-03]]
I don't know why this issue occurs,I load your provided bestmodel to be the weights.This issue caused the ROC-AUC to remain 0.5,it's too bad.So I want to know where my work is wrong.My code is as follows.
Thanks,
pu

`forward_lstm = LSTM(input_dim=320, output_dim=320, return_sequences=True)
backward_lstm = LSTM(input_dim=320, output_dim=320, return_sequences=True)
brnn = Bidirectional(forward=forward_lstm, backward=backward_lstm, return_sequences=True)

print 'building model'

model = Sequential()
model.add(Convolution1D(input_dim=4,
input_length=1000,
nb_filter=320,
filter_length=26,
border_mode="valid",
activation="relu",
subsample_length=1))

model.add(MaxPooling1D(pool_length=13, stride=13))

model.add(Dropout(0.2))

model.add(brnn)

model.add(Dropout(0.5))

model.add(Flatten())

model.add(Dense(input_dim=75*640, output_dim=925))
model.add(Activation('relu'))

model.add(Dense(input_dim=925, output_dim=919))
model.add(Activation('sigmoid'))

print 'compiling model'
model.compile(loss='binary_crossentropy', optimizer='rmsprop', class_mode="binary")

model.load_weights('data/DanQ_bestmodel.hdf5')

print 'loading test data'
testmat = scipy.io.loadmat('data/test.mat')
X, y = np.transpose(testmat['testxdata'],axes=(0,2,1)), testmat['testdata']

print 'predicting on test sequences'
test_predicts = model.predict(X)

print 'printing the ROC AUC'
fpr, tpr, thresholds = roc_curve(y[:,13], test_predicts[:,13])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1, label='ROC(area = %0.2f)'% roc_auc )
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()`

Binary Classifier experiment - log loss

Hi,
Fantastic Library!
I was just wondering, i am trying to use the library for a binary classifier experiment using the log loss function to train the model. This is for a university experiment around benchmarking different models. Would you have time to provide an example of how to use the library to achieve the above goal.
Also, with an included visualisation on how the model learns.
Many thanks,
Best,
Andrew

Updated version with tensorflow backend

Hi,

Could you provide updated DanQ model trained with keras with tensorflow backend?

Best

One hot matrix

Hello,
I'm not very clear on how the one-hot matrix is coded. In your paper Figure 1, it appears to be A,C,G,T, but in the first section (Features and data) under Materials and methods, it writes A,G,C,T. Which is the right way to read the matrix?

Kind regards,
Jing

pre-computed score

Hi,

Just wondering is there any pre-computed score available for hg19 assembly?

Thanks,
Hurley

kernel visualization to motif/position frequency matrices

Hi, I read this DanQ paper, and want to do the same thing to convert the learnt kernels to position frequency matrices, or motifs. I way trying to find the code here, but did not see it. Can anyone help? So appreciated!

Why do you use a length 919 binary target vector , and how do you encode them?

Crash when compiling

Hey Daniel,

I am trying to compile your model so I can use it as a pre-trained model. However I am running into an error when using keras 0.2.0, your version of seya and theano 0.8.0 (0.7.0 and 0.9.0 did not work) and all other provided versions of packages. In the description you wrote that you might be switching to tensorflow and later versions of keras. Do you happen to have a new version of your model that's more easily portable? Thanks in advance.

Error in Code

Hello
I ran your code, but your code has an error in the part shown in the figure.
Can you figure out what the error is?
Thanks