ilkarman / deeplearningframeworks Goto Github PK

View Code? Open in Web Editor NEW

1.7K 1.7K 359.0 4.86 MB

Demo of running NNs across different frameworks

License: MIT License

Jupyter Notebook 95.48% Python 3.12% R 1.40%

deeplearningframeworks's People

Contributors

Stargazers

Watchers

Forkers

botev liw71 soumith ajaytalati allensmile lohithsubramanya akzaidi benjamesbabala yusukesuzuki mnrmja007 ptzagk bwasti yangqing bityangke keyky poornasandur pierrehao 94mia mitmul yytdfc shuidongliu mutual-ai midasc sunkaianna liudongqi akirakane peterbengkui wwwanghao linmajia yichuan9527 wtliao paulshealy1 existeundelta frankfqchen lyastro pandamax fw4141 charlesxrwu dem108 happy-fish jkhlot strategist922 anishsingh20 chengduozh trefoil-ml kirtyvedula kywang jebtang dim25 ieeecs-vit aradhyamathur jemisa aascode unnonouno sriharsha0806 ehrlinger wguo123 basant-kumar zhuwenxiao grseb9s denizyuret guerzh wangzhe0623 danielleodean techbhatia crissman rzel tsreenivasan shafiahmed mo-ai tonychouzju chenglongchen darthsuogles sky20010109 joseph-chan xsongx lemonnight mt-ml jiandanjinxin iamzye catyans jeffmacaluso dz1135508698 jeanxi trantorrepository hzm8341 alfords ibrahim85 lazarusa fqss0436 angzz yangwangx msalvaris angusrtaylor zhaochenghuang farlandliu codeloop lynnw123 yueyingjun lewie001

deeplearningframeworks's Issues

Adding neon

Interesting to see how this does also and they have very clear CNN and LSTM examples to reproduce

Tensorflow CudnnGRU

I'm trying to replace the basic GRU Cell I currently have:

cell = tf.contrib.rnn.GRUCell(NUMHIDDEN)
outputs, states = tf.contrib.rnn.static_rnn(cell, word_list, dtype=tf.float32)

With the CuDNN version:

cudnn_cell = tf.contrib.cudnn_rnn.CudnnGRU(num_layers=1, 
                                           num_units=NUMHIDDEN, 
                                           input_size=EMBEDSIZE)    # Set params
params_size_t = cudnn_cell.params_size()
params = tf.Variable(tf.ones([params_size_t]), validate_shape=False)   
input_h = tf.Variable(tf.ones([1, BATCHSIZE, NUMHIDDEN]))

outputs, states = cudnn_cell(is_training=True,
                             input_data=word_list,
                             input_h=input_h,
                             params=params)

However, when I do this my model starts to predict randomly. My accuracy goes from 0.86 to 0.5

what change for tensor flow?

I remember tensor flow in vggstyle run 300s for traintime， and now change to 173s, is the config or version change?

MXNet MultiGPU

@ThomasDelteil I have been trying to re-run the mxnet example on V100s, however still end up with the same error as on the P100s:

MXNetError: [11:35:49] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: an illegal memory access was encountered

MXNet:  1.3.0
GPU:  ['Tesla V100-PCIE-16GB', 'Tesla V100-PCIE-16GB', 'Tesla V100-PCIE-16GB', 'Tesla V100-PCIE-16GB']
CUDA Version 9.0.176
CuDNN Version  7.0.5

Also do you know if any further updates on MXNet have reduced the need for the boiler-plate code e.g. "Hot fixing DataLoader for multi-processing and RecordFileDataset"? Also, perhaps to avoid using tfrecords and just the raw images as with the other frameworks? It would be cool to match the conciseness of other frameworks (e.g PyTorch)

Knet Inference

@kirnap @denizyuret Would it be possible at some point to directly compare the inference speed of Knet on a pre-trained resnet50 model similar to these notebooks?

I'm not sure if there is a converter for say caffe models to knet format?

The CNN and RNN training times are very impressive!

TensorFlow dropout code is wrong

Current code also uses dropout for testing.
Right code is like below.
I don't know that dropout should be used for train accuracies.

High-level TF Example

import numpy as np
import os
import sys
import tensorflow as tf
from common.params import *
from common.utils import *

print("OS: ", sys.platform)
print("Python: ", sys.version)
print("Numpy: ", np.__version__)
print("Tensorflow: ", tf.__version__)

OS:  linux
Python:  3.6.0 (default, May  9 2017, 15:45:21) 
[GCC 5.4.0 20160609]
Numpy:  1.13.1
Tensorflow:  1.3.0

def create_symbol(training):
    conv1 = tf.layers.conv2d(X, filters=50, kernel_size=(3, 3), padding='same')
    relu1 = tf.nn.relu(conv1)
    conv2 = tf.layers.conv2d(relu1, filters=50, kernel_size=(3, 3), padding='same')
    relu2 = tf.nn.relu(conv2)
    pool1 = tf.layers.max_pooling2d(relu2, pool_size=(2, 2), strides=(2, 2), padding='valid')
    drop1 = tf.layers.dropout(pool1, 0.25, training=training)
    
    conv3 = tf.layers.conv2d(drop1, filters=100, kernel_size=(3, 3), padding='same')
    relu3 = tf.nn.relu(conv3)
    conv4 = tf.layers.conv2d(relu3, filters=100, kernel_size=(3, 3), padding='same')
    relu4 = tf.nn.relu(conv4)
    pool2 = tf.layers.max_pooling2d(relu4, pool_size=(2, 2), strides=(2, 2), padding='valid')
    drop2 = tf.layers.dropout(pool2, 0.25, training=training)
    
    flatten = tf.reshape(drop2, shape=[-1, 100*8*8])
    fc1 = tf.layers.dense(flatten, 512, activation=tf.nn.relu)
    drop4 = tf.layers.dropout(fc1, 0.5, training=training)
    logits = tf.layers.dense(drop4, N_CLASSES, name='output')
    return logits

def init_model(m):
    # Single-class labels, don't need dense one-hot
    # Expects unscaled logits, not output of tf.nn.softmax
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=m, labels=y)
    loss = tf.reduce_mean(xentropy)
    optimizer = tf.train.MomentumOptimizer(learning_rate=LR, momentum=MOMENTUM)
    training_op = optimizer.minimize(loss)
    return training_op

%%time
# Data into format for library
#x_train, x_test, y_train, y_test = mnist_for_library(channel_first=False)
x_train, x_test, y_train, y_test = cifar_for_library(channel_first=False)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
print(x_train.dtype, x_test.dtype, y_train.dtype, y_test.dtype)

Preparing train set...
Preparing test set...
Done.
(50000, 32, 32, 3) (10000, 32, 32, 3) (50000,) (10000,)
float32 float32 int32 int32
CPU times: user 544 ms, sys: 224 ms, total: 768 ms
Wall time: 766 ms

%%time
# Place-holders
X = tf.placeholder(tf.float32, shape=[None, 32, 32, 3])
y = tf.placeholder(tf.int32, shape=[None])
training = tf.placeholder(tf.bool)
# Initialise model
sym = create_symbol(training)

CPU times: user 76 ms, sys: 4 ms, total: 80 ms
Wall time: 78.4 ms

%%time
model = init_model(sym)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
# Accuracy logging
correct = tf.nn.in_top_k(sym, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

CPU times: user 360 ms, sys: 632 ms, total: 992 ms
Wall time: 1.04 s

%%time
for j in range(EPOCHS):
    for data, label in yield_mb(x_train, y_train, BATCHSIZE, shuffle=True):
        sess.run(model, feed_dict={X: data, y: label, training: True})
    # Log
    acc_train = sess.run(accuracy, feed_dict={X: data, y: label, training: True})
    print(j, "Train accuracy:", acc_train)

0 Train accuracy: 0.546875
1 Train accuracy: 0.484375
2 Train accuracy: 0.671875
3 Train accuracy: 0.65625
4 Train accuracy: 0.609375
5 Train accuracy: 0.765625
6 Train accuracy: 0.765625
7 Train accuracy: 0.796875
8 Train accuracy: 0.90625
9 Train accuracy: 0.734375
CPU times: user 1min 21s, sys: 11.9 s, total: 1min 33s
Wall time: 1min 20s

%%time
n_samples = (y_test.shape[0]//BATCHSIZE)*BATCHSIZE
y_guess = np.zeros(n_samples, dtype=np.int)
y_truth = y_test[:n_samples]
c = 0
for data, label in yield_mb(x_test, y_test, BATCHSIZE):
    pred = tf.argmax(sym,1)
    output = sess.run(pred, feed_dict={X: data, training: False})
    y_guess[c*BATCHSIZE:(c+1)*BATCHSIZE] = output
    c += 1

CPU times: user 3.83 s, sys: 152 ms, total: 3.98 s
Wall time: 3.58 s

print("Accuracy: ", sum(y_guess == y_truth)/len(y_guess))

Accuracy:  0.770532852564

[what to do next] What problem you would like to see next?

We compared some of the top deep learning frameworks using CNNs and CIFAR. It would be great if the community could vote on what problem you would like to see next. I add some options:

RNNs
LSTMs on time series data
LSTMs on text
CNNs on ImageNet/bigger dataset
CNNs on text
Reinforcement Learning
GANs/ VAC
Neural art
STOP DOING THIS, AI IS TOO DANGEROUS!!!
Other??

cc @botev, @souptc, @YusukeSuzuki, @Yangqing, @ppwwyyxx, @miguelvr, @msalvaris, @ilkarman, @piiswrong, @soumith, @n17s

How about adding Gluon and Dynet?

Recently the mxnet's new top framework gluon is hot, it can support hybrid imperative and symbolic network. So can you add the gluon testing and the dynet?
Finally, thank you for your work to give us a clear idea.

IMDB in mxnet only unroll 1 step but other frameworks unroll 150 step.

That's why mxnet is very fast in your benchmark. my code is here:

def create_symbol():
    # https://mxnet.incubator.apache.org/api/python/rnn.html
    data = mx.symbol.Variable('data')
    embedded_step = mx.symbol.Embedding(data=data, input_dim=MAXFEATURES, output_dim=EMBEDSIZE)
    gru_cell = mx.rnn.GRUCell(num_hidden=NUMHIDDEN)
    # Initialize its hidden and memory states.
    # 'begin_state' method takes an initialization function, and uses 'zeros' by default.
    begin_state = gru_cell.begin_state()
    # Call the cell to get the output of one time step for a batch.
    output, states = gru_cell.unroll(length=MAXLEN, inputs=embedded_step, merge_outputs=False)
    # output, states = gru_cell(embedded_step, begin_state) ***WRONG***
    # FC out
    fc1 = mx.symbol.FullyConnected(data=output[-1], num_hidden=2) 
    # Label
    input_y = mx.symbol.Variable('softmax_label')  
    m = mx.symbol.SoftmaxOutput(data=fc1, label=input_y, name="softmax")
    return m

Chainer MultiGPU

@mitmul Thank you for highlighting my typo in your PR request; I wanted to highlight two further issues I am facing here

Toggling between single and muli-gpu (4x) improves time-taken from 47min15s to 14min43s; however for some reason the AUC also drops from 0.8028 (which matches all other examples) to 0.56. This does not happen for example with PyTorch. There is a also a diff in validation/main/loss which ends at 0.23 for multi-gpu but 0.15 for single-gpu
I wondered if there was an update to the pre-trained densenet model so that I no longer have to override CaffeFunction with class to reduce the memory fooptrint? The custom call_ lets me use a batch of 56 over 32, however I am still not able to get the low-memory footprint as with other frameworks that lets me run a batch of 64

Chainer:  4.1.0
CuPy:  4.1.0
Numpy:  1.14.1
GPU:  ['Tesla V100-PCIE-16GB', 'Tesla V100-PCIE-16GB', 'Tesla V100-PCIE-16GB', 'Tesla V100-PCIE-16GB']
CUDA Version 9.0.176
CuDNN Version  7.0.5

Theano test performance

I've just noticed your remark about the flags for test and train on Pytorch, Tensorflow. In fact, there is a similar thing for Lasagne which I totally forgot about. Could you change block 9 to:

%%time
# Compile functions
train_func = theano.function([X.input_var, y], [loss, accuracy], updates=updates)
pred = L.get_output(net, deterministic=True)
pred_func = theano.function([X.input_var], T.argmax(pred, axis=1))

where I want to download the vm file?

Hi @ilkarman ,
This project is such good. I want to download the vm file to run these deep learning framework, Can you share the download url? or I may buy the vm file. Thanks a lot

Fix Keras to allow use with 2.1.6

Hey @ilkarman there is mention of an incompatibility with keras >2.1.4, can this be fixed? I'd like to try with the keras-mxnet backend to see if there is any difference.

Thanks!

MXNet getting lower AUC compared to other frameworks in multigpu multilabel problem

I'm having problems to match MXNet AUC with the other frameworks like Keras, TF or Pytorch.

In this notebook I get an AUC of 0.73, whereas in Pytorch for example I get almost 0.80

Any guidance @ThomasDelteil?

paddlepaddle

paddlepaddle?

multi-gpu CUDA 9

Let's do the multi-gpu notebooks using CUDA 9 + CuDNN 7 and updated frameworks (e.g. TF 1.6 > 1.4) instead of CUDA 8.

The performance of Caffe

Nice repo, thanks a lot for the authors' work. Could we also add the results from caffe for comparison ?

Finish Chainer Multi-GPU

Now that apparently Densenet can be used - see issue

Could you include Caffe (not Caffe2)

I am sorry for asking this, because I know how Caffe is a pain to work with. However you could probably use MMdnn to quickly create the networks.

Caffe2 inference timings (CPU/GPU)

Caffe2 appears to be optimised for CPU inference using Intel's MKL library. In terms of GPU training-times it's one of the fastest frameworks. However, for inference I can't get a lot of speed out of it (both GPU and CPU).

You can see here that timings for feature extraction on a resnet-50 model are:

DL Library	Images/s GPU	Images/s CPU
Tensorflow	155	11
MXNet(w/mkl)	129	25
MXNet	130	8
PyTorch	130	6
CNTK	117	8
Chainer	107	3
Keras(TF)	98	5
ONNX_Caffe2	75	6
Caffe2	71	6
Keras(CNTK)	46	4

So I have tried it also with a different model (PyTorch converted to ONNX) but it's still not as fast as I would expect it to be. This is the same environment that had blazingly fast CNN training times so I wonder if I'm just running inference in a non-optimal way?

It's an amazing framework and the results are very surprising for me.

Theano - theano.config.floatX

So I've just noticed here that the default value for theano.config.floatX is float64. Thus unless you have edited your .theanorc it might have been running on float64 rather than float32. If you don't want to edit the environmental file you can set the value similar to the cuDNN flags by adding theano.config.floatX = "float32". Also for good practice I suggest to add theano.config.warn_float64 = "warn".

Test Keras_CNTK with channel_first format

Current Keras_CNTK test case is running with channel_last format, which will led to perf degradation in cntk, that's why you see the warning:

/anaconda/envs/py35/lib/python3.5/site-packages/cntk/core.py:82: RuntimeWarning: data is not C contiguous; rearrange your data/computation to avoid costly data conversions
RuntimeWarning)

Could we test with 'channel_first' format in keras? I manually run it on my box, Keras_CNTK is only around 20% slower than CNTK native implementation.

Extension to multi-GPU, multi-node

@ilkarman Love the benchmarks.

It would be interesting to see which of the current platforms is able to scale the best to take advantage of cloud resources. Have you considered expanding to multi-GPU and to multi-node benchmarks?

CNTK performance issue with LSTM

We are having problems with the LSTM example in CNTK, the accuracy is now 0.5, while in MXNET, TF and Keras is around 0.86.

Maybe the inputs to the trainer are not correct...

Any idea?

Chainer Multi-GPU

I'm having some issues with the Chainer Multi-GPU examples and I was hoping someone could give me some guidance to fix it. @Crissman if you get a chance I would really appreciate your feedback.

The single-GPU example works (and gets to AUC of 0.81 which matches the rest), however there were three modifications I had to make ( I am using a trained model from shicai ):

First I had to truncate the batch-norm param (however in the prototxt they are already 1e-5, so I'm not sure why they become less when imported):

def truncate_bn(sym):
    # Need to truncate batchnorm - eps
    for layer in list(sym._children):
        if "bn" in layer:
            if sym.__dict__[layer].eps < 1e-5:
                sym.__dict__[layer].eps = 1e-5

Second, I had to update to 4.0.0b3 to handle the average pooling layer in the pretrained model

Third, I modified the chainer.links.caffe.CaffeFunction as noted here to save only the layers that are needed for the final computation, not all of them:

class CaffeFunctionDenseNet121(CaffeFunction):
        
    # Standard function saves all variables so cannot use big batch
    # This lets me run BATCH of 56 over 32 - still can't get to 64
    # https://github.com/chainer/chainer/blob/master/chainer/links/caffe/caffe_function.py#L176
    def __call__(self, inputs, **kwargs):
        variables = dict(inputs)
        # Pools not to save
        # These layers are not concatenated
        _NOSAVE = set(['pool5', 'concat_5_16', 'concat_4_24', 'concat_3_12', 'concat_2_6'])
        # Forward through all layers
        for func_name, bottom, top in self.layers:

            func = self.forwards[func_name]
            # Concat ops require some previous layers that are saved
            if "concat" in func_name:
                input_vars = tuple([variables[bottom[0]], variables['data']])
            else:
                input_vars = tuple([variables['data']])
            output_vars = func(*input_vars)
            # Delete layers for concat once used
            if "concat" in func_name:
                del variables[bottom[0]]
            if not isinstance(output_vars, collections.Iterable):
                output_vars = output_vars,
            # Save to dict
            variables['data'] = output_vars[0]
            top = top[0]
            # Save for concat
            if ("pool" in top) and (top not in _NOSAVE):
                variables[top] = output_vars[0]
            elif ("concat" in top) and (top not in _NOSAVE):
                variables[top] = output_vars[0]
                
        return tuple([variables['data']])

With these three changes I am able to train DenseNet-121 with a batchsize of 56 before killing my GPU VRAM, without the modification to the .call() method I can only run 32 -> and this speeds up the model by around 30 minutes (over 5 epochs). I believe the reason it is still slower than the rest is because the batchsize is still too small, but I'm not sure what else I can do. Comparing this to PyTorch -> that runs at half the memory-usage this currently does.

It seems that CaffeFunction is not the preferred way of loading and fine-tuning pretrained models? I think it would be possible to copy the weights into a model that has been defined already like this, however the names and structure are different so it would be almost easier to attempt to write a new implementation to match the names from the pre-trained Caffe model.

Transfer learning appears quite popular so it would be great if the above was possible. I'm not sure if I'm doing it wrong

For some reason when I run with multi-gpu it takes longer to complete 5 epochs. I checked that all my GPUs are used and raised an issue. I'm not sure if this is specific to just the CaffeFunction?
Also the multi-GPU methods have a much lower AUC (0.7 vs 0.8 for the other frameworks). I adopt a linear LR scaling rule and get 0.8 running Chainer single-gpu and all other frameworks single and multi-gpu, so not sure what happens.

I would appreciate any help finalising this notebook since aside from the three points above, I really do like the interface.

Thoughts on updating framework versions to take advantage of CUDA 9/10

CUDA 9 has been out for a while, and 10 is gaining support for later generation cards. Any thought given to updating this to later versions of CUDA and framework versions?

Add parameter `allow_pickle = True` to common.utils.process_imdb()

When processing the IMDB data for RNN, I got an error which suggests to allow pickle.

I found the following fix works for me:

In common/utils.py, change line 177 to with np.load('imdb.npz', allow_pickle = True) as f: from with np.load('imdb.npz') as f:

Could you also please include Deeplearning for Java (DL4J) Library?

Quick performance note

Hi - thanks for setting up the benchmark. I want to quickly put a performance note so folks are not surprised when seeing the perf numbers.

First, if one measure speed difference for basic ones like CNNs, something is wrong in how one use the frameworks :)

In this case, the major performance difference comes from I/O, not the framework itself. If you look at the TensorFlow and Caffe2 examples, the data is provided with a feed approach - this is usually bad for performance, and instead one should use a db or input iterator to do so. Under the hood, this makes prefetching and other optimizations possible and is particularly important for performance.

A few suggestions

The last line can be changed to:
'print("Accuracy: ", sum(y_guess == y_test)/float(len(y_guess)))'
as I've got 0 in Python 2.7
Increace EPOCHES to 20 or more to get a more fair accuracy because of the differences in platform(such as hyper-parameters) and weights initialization and randomness with SGD.
Is the data need to zero mean normalization first to get a more resaonable result?

MXNet high api

Redo MXNet high_api example using .fit() to match Tensorflow and CNTK.

tf.nn.dynamic_rnn

Should really use tf.nn.dynamic_rnn() not tf.contrib.rnn.static_rnn()

Note difference in input shape:

inputs: The RNN inputs. If time_major == False (default), this must be a Tensor of shape: [batch_size, max_time, ...], or a nested tuple of such elements. If time_major == True, this must be a Tensor of shape: [max_time, batch_size, ...], or a nested tuple of such elements. This may also be a (possibly nested) tuple of Tensors satisfying this property. The first two dimensions must match across all the inputs, but otherwise the ranks and other shape components may differ. In this case, input to cell at each time-step will replicate the structure of these tuples, except for the time dimension (from which the time is taken). The input to cell at each time step will be a Tensor or (possibly nested) tuple of Tensors each with dimensions [batch_size, ...].

Tensorflow MultiGPU

The way that TF does checkpointing with:

tf.estimator.train_and_evaluate(nn, train_spec, eval_spec)

Seems to result in a lot of IO lag where it saves the params to disk after every epoch, runs validation, then loads the model again and repeats.

Is there an easier way to just keep this in memory (like other frameworks, e.g. PyTorch) and just save to disk once at the end?

For example running on pure numpy array:

nn.train(tf.estimator.inputs.numpy_input_fn(
    fake_X,
    fake_y,
    shuffle=False,
    num_epochs=EPOCHS,
    batch_size=BATCHSIZE))

Takes 14min30s with TF and 16min52s with Keras. However, the train_and_evaluate loop takes 21min49s sec with TF and 20min16s with Keras.

Benchmark with suboptimal performance

import numpy as np
import tensorflow as tf
from tensorpack import *
from common.params import *
from common.utils import *


def create_symbol(X, training, n_classes=N_CLASSES):
    # Tensorflow requires a flag for training in dropout
    conv1 = tf.layers.conv2d(X, activation=tf.nn.relu, filters=50, kernel_size=(3, 3),
                             padding='same', data_format='channels_first')
    conv2 = tf.layers.conv2d(conv1, filters=50, kernel_size=(3, 3),
                             padding='same', data_format='channels_first')
    pool1 = tf.layers.max_pooling2d(conv2, pool_size=(2, 2), strides=(2, 2),
                                    padding='valid', data_format='channels_first')
    relu2 = tf.nn.relu(pool1)
    drop1 = tf.layers.dropout(relu2, 0.25, training=training)

    conv3 = tf.layers.conv2d(drop1, activation=tf.nn.relu, filters=100, kernel_size=(3, 3),
                             padding='same', data_format='channels_first')
    conv4 = tf.layers.conv2d(conv3, filters=100, kernel_size=(3, 3),
                             padding='same', data_format='channels_first')
    pool2 = tf.layers.max_pooling2d(conv4, pool_size=(2, 2), strides=(2, 2),
                                    padding='valid', data_format='channels_first')
    relu4 = tf.nn.relu(pool2)
    drop2 = tf.layers.dropout(relu4, 0.25, training=training)

    flatten = tf.reshape(drop2, shape=[-1, 100*8*8])
    fc1 = tf.layers.dense(flatten, 512, activation=tf.nn.relu)
    drop3 = tf.layers.dropout(fc1, 0.5, training=training)
    logits = tf.layers.dense(drop3, n_classes, name='output')
    return logits


def tower_func(x, y):
    logits = create_symbol(x, training=get_current_tower_context().is_training)
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)
    loss = tf.reduce_mean(xentropy)

    # Accuracy logging
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
    tf.summary.scalar('train_accuracy', accuracy)
    return loss

def get_optimizer(learning_rate=LR, momentum=MOMENTUM):
    return tf.train.MomentumOptimizer(learning_rate, momentum)


x_train, x_test, y_train, y_test = cifar_for_library(channel_first=True)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
print(x_train.dtype, x_test.dtype, y_train.dtype, y_test.dtype)

def generator(train=True):
    if train:
        while True:
            yield from yield_mb(x_train, y_train, BATCHSIZE, shuffle=True)
    else:
        while True:
            yield from yield_mb(x_test, y_test, BATCHSIZE)

df_train = PrefetchDataZMQ(DataFromGenerator(generator(True)), 1)
df_test = FixedSizeData(DataFromGenerator(generator(False)), len(x_test) // BATCHSIZE)

trainer = SimpleTrainer()
trainer.setup_graph(
    inputs_desc=[InputDesc(tf.float32, [None, 3, 32, 32], 'image'),
                 InputDesc(tf.int32, [None], 'label')],
    input=QueueInput(df_train),
    get_cost_fn=tower_func,
    get_opt_fn=get_optimizer
)

trainer.train_with_defaults(
    callbacks=[
        PeriodicCallback(
            InferenceRunner(df_test, [ScalarStats('accuracy')]),
            every_k_epochs=10)
    ],
    steps_per_epoch=len(x_train) // BATCHSIZE,
    max_epoch=EPOCHS
)

The above code based on tensorpack is equivalent to Tensorflow_CNN.ipynb but runs 20% faster than it on my machine (with cuda 9, TF1.5, cudnn7, GTX1080).

Chainer_CIFAR uses gpu without checking the "GPU" variable.

    data = cuda.to_gpu(data)
    target = cuda.to_gpu(target)

Chainer implements Autotuner

Ilia,

With the latest release of Chainer version 3.1, Chainer supports cuDNN auto-tuner for convolutional networks.

After installing Chainer v3.1, please add this line: chainer.global_config.autotune = True after the module imports to benchmark Chainer.

Here is our test run on AWS: https://github.com/unnonouno/DeepLearningFrameworks/blob/autotune/Chainer_CNN-autotune.ipynb

Please let me know if you have any issues with implementation.

Thanks!

ilkarman / deeplearningframeworks Goto Github PK

deeplearningframeworks's People

Contributors

Stargazers

Watchers

Forkers

deeplearningframeworks's Issues

High-level TF Example

Recommend Projects

Recommend Topics

Recommend Org

Jobs