ilkarman / deeplearningframeworks Goto Github PK
View Code? Open in Web Editor NEWDemo of running NNs across different frameworks
License: MIT License
Demo of running NNs across different frameworks
License: MIT License
I'm trying to replace the basic GRU Cell I currently have:
cell = tf.contrib.rnn.GRUCell(NUMHIDDEN)
outputs, states = tf.contrib.rnn.static_rnn(cell, word_list, dtype=tf.float32)
With the CuDNN version:
cudnn_cell = tf.contrib.cudnn_rnn.CudnnGRU(num_layers=1,
num_units=NUMHIDDEN,
input_size=EMBEDSIZE) # Set params
params_size_t = cudnn_cell.params_size()
params = tf.Variable(tf.ones([params_size_t]), validate_shape=False)
input_h = tf.Variable(tf.ones([1, BATCHSIZE, NUMHIDDEN]))
outputs, states = cudnn_cell(is_training=True,
input_data=word_list,
input_h=input_h,
params=params)
However, when I do this my model starts to predict randomly. My accuracy goes from 0.86 to 0.5
I remember tensor flow in vggstyle run 300s for traintime, and now change to 173s, is the config or version change?
@ThomasDelteil I have been trying to re-run the mxnet example on V100s, however still end up with the same error as on the P100s:
MXNetError: [11:35:49] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: an illegal memory access was encountered
MXNet: 1.3.0
GPU: ['Tesla V100-PCIE-16GB', 'Tesla V100-PCIE-16GB', 'Tesla V100-PCIE-16GB', 'Tesla V100-PCIE-16GB']
CUDA Version 9.0.176
CuDNN Version 7.0.5
Also do you know if any further updates on MXNet have reduced the need for the boiler-plate code e.g. "Hot fixing DataLoader for multi-processing and RecordFileDataset"? Also, perhaps to avoid using tfrecords and just the raw images as with the other frameworks? It would be cool to match the conciseness of other frameworks (e.g PyTorch)
@kirnap @denizyuret Would it be possible at some point to directly compare the inference speed of Knet on a pre-trained resnet50 model similar to these notebooks?
I'm not sure if there is a converter for say caffe models to knet format?
The CNN and RNN training times are very impressive!
Current code also uses dropout for testing.
Right code is like below.
I don't know that dropout should be used for train accuracies.
import numpy as np
import os
import sys
import tensorflow as tf
from common.params import *
from common.utils import *
print("OS: ", sys.platform)
print("Python: ", sys.version)
print("Numpy: ", np.__version__)
print("Tensorflow: ", tf.__version__)
OS: linux
Python: 3.6.0 (default, May 9 2017, 15:45:21)
[GCC 5.4.0 20160609]
Numpy: 1.13.1
Tensorflow: 1.3.0
def create_symbol(training):
conv1 = tf.layers.conv2d(X, filters=50, kernel_size=(3, 3), padding='same')
relu1 = tf.nn.relu(conv1)
conv2 = tf.layers.conv2d(relu1, filters=50, kernel_size=(3, 3), padding='same')
relu2 = tf.nn.relu(conv2)
pool1 = tf.layers.max_pooling2d(relu2, pool_size=(2, 2), strides=(2, 2), padding='valid')
drop1 = tf.layers.dropout(pool1, 0.25, training=training)
conv3 = tf.layers.conv2d(drop1, filters=100, kernel_size=(3, 3), padding='same')
relu3 = tf.nn.relu(conv3)
conv4 = tf.layers.conv2d(relu3, filters=100, kernel_size=(3, 3), padding='same')
relu4 = tf.nn.relu(conv4)
pool2 = tf.layers.max_pooling2d(relu4, pool_size=(2, 2), strides=(2, 2), padding='valid')
drop2 = tf.layers.dropout(pool2, 0.25, training=training)
flatten = tf.reshape(drop2, shape=[-1, 100*8*8])
fc1 = tf.layers.dense(flatten, 512, activation=tf.nn.relu)
drop4 = tf.layers.dropout(fc1, 0.5, training=training)
logits = tf.layers.dense(drop4, N_CLASSES, name='output')
return logits
def init_model(m):
# Single-class labels, don't need dense one-hot
# Expects unscaled logits, not output of tf.nn.softmax
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=m, labels=y)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.MomentumOptimizer(learning_rate=LR, momentum=MOMENTUM)
training_op = optimizer.minimize(loss)
return training_op
%%time
# Data into format for library
#x_train, x_test, y_train, y_test = mnist_for_library(channel_first=False)
x_train, x_test, y_train, y_test = cifar_for_library(channel_first=False)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
print(x_train.dtype, x_test.dtype, y_train.dtype, y_test.dtype)
Preparing train set...
Preparing test set...
Done.
(50000, 32, 32, 3) (10000, 32, 32, 3) (50000,) (10000,)
float32 float32 int32 int32
CPU times: user 544 ms, sys: 224 ms, total: 768 ms
Wall time: 766 ms
%%time
# Place-holders
X = tf.placeholder(tf.float32, shape=[None, 32, 32, 3])
y = tf.placeholder(tf.int32, shape=[None])
training = tf.placeholder(tf.bool)
# Initialise model
sym = create_symbol(training)
CPU times: user 76 ms, sys: 4 ms, total: 80 ms
Wall time: 78.4 ms
%%time
model = init_model(sym)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
# Accuracy logging
correct = tf.nn.in_top_k(sym, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
CPU times: user 360 ms, sys: 632 ms, total: 992 ms
Wall time: 1.04 s
%%time
for j in range(EPOCHS):
for data, label in yield_mb(x_train, y_train, BATCHSIZE, shuffle=True):
sess.run(model, feed_dict={X: data, y: label, training: True})
# Log
acc_train = sess.run(accuracy, feed_dict={X: data, y: label, training: True})
print(j, "Train accuracy:", acc_train)
0 Train accuracy: 0.546875
1 Train accuracy: 0.484375
2 Train accuracy: 0.671875
3 Train accuracy: 0.65625
4 Train accuracy: 0.609375
5 Train accuracy: 0.765625
6 Train accuracy: 0.765625
7 Train accuracy: 0.796875
8 Train accuracy: 0.90625
9 Train accuracy: 0.734375
CPU times: user 1min 21s, sys: 11.9 s, total: 1min 33s
Wall time: 1min 20s
%%time
n_samples = (y_test.shape[0]//BATCHSIZE)*BATCHSIZE
y_guess = np.zeros(n_samples, dtype=np.int)
y_truth = y_test[:n_samples]
c = 0
for data, label in yield_mb(x_test, y_test, BATCHSIZE):
pred = tf.argmax(sym,1)
output = sess.run(pred, feed_dict={X: data, training: False})
y_guess[c*BATCHSIZE:(c+1)*BATCHSIZE] = output
c += 1
CPU times: user 3.83 s, sys: 152 ms, total: 3.98 s
Wall time: 3.58 s
print("Accuracy: ", sum(y_guess == y_truth)/len(y_guess))
Accuracy: 0.770532852564
We compared some of the top deep learning frameworks using CNNs and CIFAR. It would be great if the community could vote on what problem you would like to see next. I add some options:
cc @botev, @souptc, @YusukeSuzuki, @Yangqing, @ppwwyyxx, @miguelvr, @msalvaris, @ilkarman, @piiswrong, @soumith, @n17s
Recently the mxnet's new top framework gluon is hot, it can support hybrid imperative and symbolic network. So can you add the gluon testing and the dynet?
Finally, thank you for your work to give us a clear idea.
That's why mxnet is very fast in your benchmark. my code is here:
def create_symbol():
# https://mxnet.incubator.apache.org/api/python/rnn.html
data = mx.symbol.Variable('data')
embedded_step = mx.symbol.Embedding(data=data, input_dim=MAXFEATURES, output_dim=EMBEDSIZE)
gru_cell = mx.rnn.GRUCell(num_hidden=NUMHIDDEN)
# Initialize its hidden and memory states.
# 'begin_state' method takes an initialization function, and uses 'zeros' by default.
begin_state = gru_cell.begin_state()
# Call the cell to get the output of one time step for a batch.
output, states = gru_cell.unroll(length=MAXLEN, inputs=embedded_step, merge_outputs=False)
# output, states = gru_cell(embedded_step, begin_state) ***WRONG***
# FC out
fc1 = mx.symbol.FullyConnected(data=output[-1], num_hidden=2)
# Label
input_y = mx.symbol.Variable('softmax_label')
m = mx.symbol.SoftmaxOutput(data=fc1, label=input_y, name="softmax")
return m
@mitmul Thank you for highlighting my typo in your PR request; I wanted to highlight two further issues I am facing here
Toggling between single and muli-gpu (4x) improves time-taken from 47min15s to 14min43s; however for some reason the AUC also drops from 0.8028 (which matches all other examples) to 0.56. This does not happen for example with PyTorch. There is a also a diff in validation/main/loss which ends at 0.23 for multi-gpu but 0.15 for single-gpu
I wondered if there was an update to the pre-trained densenet model so that I no longer have to override CaffeFunction with class to reduce the memory fooptrint? The custom call_ lets me use a batch of 56 over 32, however I am still not able to get the low-memory footprint as with other frameworks that lets me run a batch of 64
Chainer: 4.1.0
CuPy: 4.1.0
Numpy: 1.14.1
GPU: ['Tesla V100-PCIE-16GB', 'Tesla V100-PCIE-16GB', 'Tesla V100-PCIE-16GB', 'Tesla V100-PCIE-16GB']
CUDA Version 9.0.176
CuDNN Version 7.0.5
I've just noticed your remark about the flags for test and train on Pytorch, Tensorflow. In fact, there is a similar thing for Lasagne which I totally forgot about. Could you change block 9 to:
%%time
# Compile functions
train_func = theano.function([X.input_var, y], [loss, accuracy], updates=updates)
pred = L.get_output(net, deterministic=True)
pred_func = theano.function([X.input_var], T.argmax(pred, axis=1))
Hi @ilkarman ,
This project is such good. I want to download the vm file to run these deep learning framework, Can you share the download url? or I may buy the vm file. Thanks a lot
Hey @ilkarman there is mention of an incompatibility with keras >2.1.4, can this be fixed? I'd like to try with the keras-mxnet backend to see if there is any difference.
Thanks!
I'm having problems to match MXNet AUC with the other frameworks like Keras, TF or Pytorch.
In this notebook I get an AUC of 0.73, whereas in Pytorch for example I get almost 0.80
Any guidance @ThomasDelteil?
paddlepaddle?
Let's do the multi-gpu notebooks using CUDA 9 + CuDNN 7 and updated frameworks (e.g. TF 1.6 > 1.4) instead of CUDA 8.
Nice repo, thanks a lot for the authors' work. Could we also add the results from caffe for comparison ?
Now that apparently Densenet can be used - see issue
I am sorry for asking this, because I know how Caffe is a pain to work with. However you could probably use MMdnn to quickly create the networks.
Caffe2 appears to be optimised for CPU inference using Intel's MKL library. In terms of GPU training-times it's one of the fastest frameworks. However, for inference I can't get a lot of speed out of it (both GPU and CPU).
You can see here that timings for feature extraction on a resnet-50 model are:
DL Library | Images/s GPU | Images/s CPU |
---|---|---|
Tensorflow | 155 | 11 |
MXNet(w/mkl) | 129 | 25 |
MXNet | 130 | 8 |
PyTorch | 130 | 6 |
CNTK | 117 | 8 |
Chainer | 107 | 3 |
Keras(TF) | 98 | 5 |
ONNX_Caffe2 | 75 | 6 |
Caffe2 | 71 | 6 |
Keras(CNTK) | 46 | 4 |
So I have tried it also with a different model (PyTorch converted to ONNX) but it's still not as fast as I would expect it to be. This is the same environment that had blazingly fast CNN training times so I wonder if I'm just running inference in a non-optimal way?
It's an amazing framework and the results are very surprising for me.
So I've just noticed here that the default value for theano.config.floatX
is float64
. Thus unless you have edited your .theanorc
it might have been running on float64
rather than float32
. If you don't want to edit the environmental file you can set the value similar to the cuDNN flags by adding theano.config.floatX = "float32"
. Also for good practice I suggest to add theano.config.warn_float64 = "warn"
.
Current Keras_CNTK test case is running with channel_last format, which will led to perf degradation in cntk, that's why you see the warning:
/anaconda/envs/py35/lib/python3.5/site-packages/cntk/core.py:82: RuntimeWarning: data is not C contiguous; rearrange your data/computation to avoid costly data conversions
RuntimeWarning)
Could we test with 'channel_first' format in keras? I manually run it on my box, Keras_CNTK is only around 20% slower than CNTK native implementation.
@ilkarman Love the benchmarks.
It would be interesting to see which of the current platforms is able to scale the best to take advantage of cloud resources. Have you considered expanding to multi-GPU and to multi-node benchmarks?
I'm having some issues with the Chainer Multi-GPU examples and I was hoping someone could give me some guidance to fix it. @Crissman if you get a chance I would really appreciate your feedback.
First I had to truncate the batch-norm param (however in the prototxt they are already 1e-5, so I'm not sure why they become less when imported):
def truncate_bn(sym):
# Need to truncate batchnorm - eps
for layer in list(sym._children):
if "bn" in layer:
if sym.__dict__[layer].eps < 1e-5:
sym.__dict__[layer].eps = 1e-5
Second, I had to update to 4.0.0b3 to handle the average pooling layer in the pretrained model
Third, I modified the chainer.links.caffe.CaffeFunction
as noted here to save only the layers that are needed for the final computation, not all of them:
class CaffeFunctionDenseNet121(CaffeFunction):
# Standard function saves all variables so cannot use big batch
# This lets me run BATCH of 56 over 32 - still can't get to 64
# https://github.com/chainer/chainer/blob/master/chainer/links/caffe/caffe_function.py#L176
def __call__(self, inputs, **kwargs):
variables = dict(inputs)
# Pools not to save
# These layers are not concatenated
_NOSAVE = set(['pool5', 'concat_5_16', 'concat_4_24', 'concat_3_12', 'concat_2_6'])
# Forward through all layers
for func_name, bottom, top in self.layers:
func = self.forwards[func_name]
# Concat ops require some previous layers that are saved
if "concat" in func_name:
input_vars = tuple([variables[bottom[0]], variables['data']])
else:
input_vars = tuple([variables['data']])
output_vars = func(*input_vars)
# Delete layers for concat once used
if "concat" in func_name:
del variables[bottom[0]]
if not isinstance(output_vars, collections.Iterable):
output_vars = output_vars,
# Save to dict
variables['data'] = output_vars[0]
top = top[0]
# Save for concat
if ("pool" in top) and (top not in _NOSAVE):
variables[top] = output_vars[0]
elif ("concat" in top) and (top not in _NOSAVE):
variables[top] = output_vars[0]
return tuple([variables['data']])
With these three changes I am able to train DenseNet-121 with a batchsize of 56 before killing my GPU VRAM, without the modification to the .call() method I can only run 32 -> and this speeds up the model by around 30 minutes (over 5 epochs). I believe the reason it is still slower than the rest is because the batchsize is still too small, but I'm not sure what else I can do. Comparing this to PyTorch -> that runs at half the memory-usage this currently does.
It seems that CaffeFunction is not the preferred way of loading and fine-tuning pretrained models? I think it would be possible to copy the weights into a model that has been defined already like this, however the names and structure are different so it would be almost easier to attempt to write a new implementation to match the names from the pre-trained Caffe model.
Transfer learning appears quite popular so it would be great if the above was possible. I'm not sure if I'm doing it wrong
For some reason when I run with multi-gpu it takes longer to complete 5 epochs. I checked that all my GPUs are used and raised an issue. I'm not sure if this is specific to just the CaffeFunction?
Also the multi-GPU methods have a much lower AUC (0.7 vs 0.8 for the other frameworks). I adopt a linear LR scaling rule and get 0.8 running Chainer single-gpu and all other frameworks single and multi-gpu, so not sure what happens.
I would appreciate any help finalising this notebook since aside from the three points above, I really do like the interface.
CUDA 9 has been out for a while, and 10 is gaining support for later generation cards. Any thought given to updating this to later versions of CUDA and framework versions?
When processing the IMDB data for RNN, I got an error which suggests to allow pickle
.
I found the following fix works for me:
with np.load('imdb.npz', allow_pickle = True) as f:
from with np.load('imdb.npz') as f:
Hi - thanks for setting up the benchmark. I want to quickly put a performance note so folks are not surprised when seeing the perf numbers.
First, if one measure speed difference for basic ones like CNNs, something is wrong in how one use the frameworks :)
In this case, the major performance difference comes from I/O, not the framework itself. If you look at the TensorFlow and Caffe2 examples, the data is provided with a feed approach - this is usually bad for performance, and instead one should use a db or input iterator to do so. Under the hood, this makes prefetching and other optimizations possible and is particularly important for performance.
Redo MXNet high_api example using .fit() to match Tensorflow and CNTK.
Should really use tf.nn.dynamic_rnn()
not tf.contrib.rnn.static_rnn()
Note difference in input shape:
inputs: The RNN inputs. If time_major == False (default), this must be a Tensor of shape: [batch_size, max_time, ...], or a nested tuple of such elements. If time_major == True, this must be a Tensor of shape: [max_time, batch_size, ...], or a nested tuple of such elements. This may also be a (possibly nested) tuple of Tensors satisfying this property. The first two dimensions must match across all the inputs, but otherwise the ranks and other shape components may differ. In this case, input to cell at each time-step will replicate the structure of these tuples, except for the time dimension (from which the time is taken). The input to cell at each time step will be a Tensor or (possibly nested) tuple of Tensors each with dimensions [batch_size, ...].
The way that TF does checkpointing with:
tf.estimator.train_and_evaluate(nn, train_spec, eval_spec)
Seems to result in a lot of IO lag where it saves the params to disk after every epoch, runs validation, then loads the model again and repeats.
Is there an easier way to just keep this in memory (like other frameworks, e.g. PyTorch) and just save to disk once at the end?
For example running on pure numpy array:
nn.train(tf.estimator.inputs.numpy_input_fn(
fake_X,
fake_y,
shuffle=False,
num_epochs=EPOCHS,
batch_size=BATCHSIZE))
Takes 14min30s with TF and 16min52s with Keras. However, the train_and_evaluate loop takes 21min49s sec with TF and 20min16s with Keras.
import numpy as np
import tensorflow as tf
from tensorpack import *
from common.params import *
from common.utils import *
def create_symbol(X, training, n_classes=N_CLASSES):
# Tensorflow requires a flag for training in dropout
conv1 = tf.layers.conv2d(X, activation=tf.nn.relu, filters=50, kernel_size=(3, 3),
padding='same', data_format='channels_first')
conv2 = tf.layers.conv2d(conv1, filters=50, kernel_size=(3, 3),
padding='same', data_format='channels_first')
pool1 = tf.layers.max_pooling2d(conv2, pool_size=(2, 2), strides=(2, 2),
padding='valid', data_format='channels_first')
relu2 = tf.nn.relu(pool1)
drop1 = tf.layers.dropout(relu2, 0.25, training=training)
conv3 = tf.layers.conv2d(drop1, activation=tf.nn.relu, filters=100, kernel_size=(3, 3),
padding='same', data_format='channels_first')
conv4 = tf.layers.conv2d(conv3, filters=100, kernel_size=(3, 3),
padding='same', data_format='channels_first')
pool2 = tf.layers.max_pooling2d(conv4, pool_size=(2, 2), strides=(2, 2),
padding='valid', data_format='channels_first')
relu4 = tf.nn.relu(pool2)
drop2 = tf.layers.dropout(relu4, 0.25, training=training)
flatten = tf.reshape(drop2, shape=[-1, 100*8*8])
fc1 = tf.layers.dense(flatten, 512, activation=tf.nn.relu)
drop3 = tf.layers.dropout(fc1, 0.5, training=training)
logits = tf.layers.dense(drop3, n_classes, name='output')
return logits
def tower_func(x, y):
logits = create_symbol(x, training=get_current_tower_context().is_training)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)
loss = tf.reduce_mean(xentropy)
# Accuracy logging
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
tf.summary.scalar('train_accuracy', accuracy)
return loss
def get_optimizer(learning_rate=LR, momentum=MOMENTUM):
return tf.train.MomentumOptimizer(learning_rate, momentum)
x_train, x_test, y_train, y_test = cifar_for_library(channel_first=True)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
print(x_train.dtype, x_test.dtype, y_train.dtype, y_test.dtype)
def generator(train=True):
if train:
while True:
yield from yield_mb(x_train, y_train, BATCHSIZE, shuffle=True)
else:
while True:
yield from yield_mb(x_test, y_test, BATCHSIZE)
df_train = PrefetchDataZMQ(DataFromGenerator(generator(True)), 1)
df_test = FixedSizeData(DataFromGenerator(generator(False)), len(x_test) // BATCHSIZE)
trainer = SimpleTrainer()
trainer.setup_graph(
inputs_desc=[InputDesc(tf.float32, [None, 3, 32, 32], 'image'),
InputDesc(tf.int32, [None], 'label')],
input=QueueInput(df_train),
get_cost_fn=tower_func,
get_opt_fn=get_optimizer
)
trainer.train_with_defaults(
callbacks=[
PeriodicCallback(
InferenceRunner(df_test, [ScalarStats('accuracy')]),
every_k_epochs=10)
],
steps_per_epoch=len(x_train) // BATCHSIZE,
max_epoch=EPOCHS
)
The above code based on tensorpack is equivalent to Tensorflow_CNN.ipynb
but runs 20% faster than it on my machine (with cuda 9, TF1.5, cudnn7, GTX1080).
data = cuda.to_gpu(data)
target = cuda.to_gpu(target)
Ilia,
With the latest release of Chainer version 3.1, Chainer supports cuDNN auto-tuner for convolutional networks.
After installing Chainer v3.1, please add this line: chainer.global_config.autotune = True
after the module imports to benchmark Chainer.
Here is our test run on AWS: https://github.com/unnonouno/DeepLearningFrameworks/blob/autotune/Chainer_CNN-autotune.ipynb
Please let me know if you have any issues with implementation.
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.