GithubHelp home page GithubHelp logo

hewlettpackard / dlcookbook-dlbs Goto Github PK

View Code? Open in Web Editor NEW
131.0 25.0 51.0 13.71 MB

Deep Learning Benchmarking Suite

Home Page: https://www.hpe.com/software/dl-cookbook

License: Apache License 2.0

Shell 8.68% Python 73.71% CMake 0.70% C++ 13.17% HTML 0.30% Dockerfile 3.32% Cuda 0.05% Makefile 0.04% C 0.04%
deep-learning neural-network benchmarking tensorflow caffe caffe2 mxnet tensorrt

dlcookbook-dlbs's Introduction

Deep Learning Benchmarking Suite

Deep Learning Benchmarking Suite (DLBS) is a collection of command line tools for running consistent and reproducible deep learning benchmark experiments on various hardware/software platforms. In particular, DLBS:

  1. Provides implementation of a number of neural networks in order to enforce apple-to-apple comparison across all supported frameworks. Models that are supported include various VGGs, ResNets, AlexNet and GoogleNet models. DLBS can support many more models via integration with third party benchmark projects such as Google's TF CNN Benchmarks or Tensor2Tensor.
  2. Benchmarks single node multi-GPU or CPU platforms. List of supported frameworks include various forks of Caffe (BVLC/NVIDIA/Intel), Caffe2, TensorFlow, MXNet, PyTorch. DLBS also supports NVIDIA's inference engine TensorRT for which DLBS provides highly optimized benchmark backend.
  3. Supports inference and training phases.
  4. Supports synthetic and real data.
  5. Supports bare metal and docker environments.
  6. Supports single/half/int8 precision and uses tensor cores with Volta GPUs.
  7. Is based on modular architecture enabling easy integration with other projects such Google's TF CNN Benchmarks and Tensor2Tensor or NVIDIA's NVCNN, NVCNN-HVD or similar.
  8. Supports raw performance metric (number of data samples per second like images/sec).

Supported platforms

Deep Learning Benchmarking Suite was tested on various servers with Ubuntu / RedHat / CentOS operating systems with and without NVIDIA GPUs. We have a little success with running DLBS on top of AMD GPUs, but this is mostly untested. It may not work with Mac OS due to slightly different command line API of some of the tools we use (like, for instance, sed) - we will fix this in one of the next releases.

Installation

  1. Install Docker and NVIDIA Docker for containerized benchmarks. Read here why we prefer to use docker and here for installing/troubleshooting tips. This is not required. DLBS can work with bare metal framework installations.

  2. Clone Deep Learning Benchmarking Suite from GitHub

    git clone https://github.com/HewlettPackard/dlcookbook-dlbs dlbs
  3. The benchmarking suite mostly uses modules from standard python library (python 2.7). Optional dependencies that do not influence the benchmarking process are listed in python/requirements.txt. If they are not found, the code that uses it will be disabled.

  4. Build/pull docker images for containerized benchmarks or build/install host frameworks for bare metal benchmarks.

    1. TensorFlow
    2. BVLC Caffe
    3. NVIDIA Caffe
    4. Intel Caffe
    5. Caffe2
    6. MXNet
    7. TensorRT
    8. PyTorch

    There are several ways to get Docker images. Read here about various options including images from NVIDIA GPU Cloud. We may not support the newest framework versions due to API change.

    Our recommendation is to use docker images specified in default DLBS configuration. Most of them are docker images from NVIDIA GPU Cloud.

Quick start

Assuming CUDA enabled GPU is present, execute the following commands to run simple experiment with ResNet50 model:

git clone https://github.com/HewlettPackard/dlcookbook-dlbs.git ./dlbs   # Install benchmarking suite

cd ./dlbs  &&  source ./scripts/environment.sh                           # Initialize host environment
python ./python/dlbs/experimenter.py help --frameworks                   # List supported DL frameworks
docker pull nvcr.io/nvidia/tensorflow:18.07-py3                          # Pull TensorFlow docker image from NGC

python $experimenter run\                                                # Benchmark ...
       -Pexp.framework='"nvtfcnn"'\                                      #     TensorFlow framework
       -Vexp.model='["resnet50", "alexnet_owt"]'\                        #     with ResNet50 and AlexNetOWT models
       -Vexp.gpus='["0", "0,1", "0,1,2,3"]'\                             #     run on 1, 2 and 4 GPUs
       -Pexp.dtype='"float16"'                                           #     use mixed-precision training
       -Pexp.log_file='"${HOME}/dlbs/logs/${exp.id}.log"' \              #     and write results to these files

python $logparser '${HOME}/dlbs/logs/*.log'\                             # Parse log files and
       --output_file '${HOME}/dlbs/results.json'                         #     print and write summary to this file

python $reporter --summary_file '${HOME}/dlbs/results.json'\             # Parse summary file and build
                 --type 'weak-scaling'\                                  #     weak scaling report
                 --target_variable 'results.time'                        #     using batch time as performance metric

This configuration will run 6 benchmarks (2 models times 3 GPU configurations). DLBS can support multiple benchmark backends for Deep Learning frameworks. In this particular example DLBS uses a TensorFlow's nvtfcnn benchmark backend from NVIDIA which is optimized for single/multi-GPU systems. The introduction section contains more information on what backends actually represent and what users should be using.

The introduction contains more examples of what DLBS can do.

Documentation

We host documentation here.

More information

License

Deep Learning Benchmarking Suite is licensed under Apache 2.0 license.

Contributing

All contributors must include acceptance of the DCO (Developer Certificate of Origin). Please, read this document for more details.

Contact us

dlcookbook-dlbs's People

Contributors

insop avatar peholland avatar robertengelmann avatar sergey-serebryakov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dlcookbook-dlbs's Issues

validation

Hi,
I am trying to run validation bench marking for inference and accuracy.Following are my configuration.
Sytsem: CPU- Ubuntu 16.04
Framework: tensorflow:
model: alexnet
Action: validation inference time and accuracy

After running the experimenter.py , I got following error.I put some log to check that but tensorflow.launcher is getting as KeyError.

Below is the line of code that is getting error in launcher.py

 framework_key = 'exp.framework_family'
            if framework_key not in experiment:
                framework_key = 'exp.framework'
                print((experiment[framework_key]))
            command = [experiment['%s.launcher' % (experiment[framework_key])]]

following are the command and log.



python python/dlbs/experimenter.py run --log-level=error -Pexp.num_warmup_batches=10 -Pexp.num_batches=100 -Pexp.framework='"tensorflow"' -Pexp.docker=true -Pexp.replica_batch=16 -Pexp.gpus='""' -Vexp.model='["alexnet"]' -Pexp.docker_image='"hpe/tensorflow:cpu"' -Vexp.phase='["training"]' -Pexp.log_file='"./benchmarks/my_experiment/tf.log"' 

--------------------------------------------------------------
Experimenter pid 15397. Run this to gracefully terminate me:
	kill -USR1 15397
I will terminate myself as soon as current benchmark finishes.
--------------------------------------------------------------
Traceback (most recent call last):
  File "python/dlbs/experimenter.py", line 368, in <module>
    experimenter.execute()
  File "python/dlbs/experimenter.py", line 337, in execute
    Launcher.run(self.plan, self.__progress_file)
  File "/home/brgupta/workspace/dlbs/python/dlbs/launcher.py", line 198, in run
    command = [experiment['%s.launcher' % (experiment[framework_key])]]
KeyError: u'tensorflow.launcher'

It's very common error in python when key doesn't exist but here in command line I am giving correct.Please let me understand on this error.

Thanks in advance.

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'fp32_vars/dense/bias/Momentum'

@sergey-serebryakov hi , I've successfully tried GPU benchmarking by using the TensorFlow image, so I wanted to try CPU benchmarking. Additionally, I have successfully tested caffe's CPU benchmark.
My command line:
python ./python/dlbs/experimenter.py run -Pexp.framework='"nvtfcnn"' -Pexp.gpus='""' -Vexp.docker=true -Pexp.log_file='"./benchmarks/my_experiment/tf_${exp.model}_${exp.replica_batch}.log"' -Vexp.model='["alexnet", "googlenet"]' -Vexp.replica_batch='[2, 4]' -Ptensorflow.docker_image='"tensorflow/tensorflow:latest"' -Pexp.docker_launcher='"nvidia-docker"'

My mirror list is as follows:

docker images
REPOSITORY                           TAG                            IMAGE ID            CREATED             SIZE
registry.docker-cn.com/nvidia/cuda   latest                         1cc6f1613121        3 weeks ago         2.24GB
nvidia/cuda                          9.0-cudnn7-devel-ubuntu16.04   afc5ab1e9a0d        3 weeks ago         2.59GB
tensorflow/tensorflow                latest                         2054925f3b43        4 weeks ago         1.34GB
nvcr.io/nvidia/tensorflow            18.07-py3                      8289b0a3b285        5 months ago        3.34GB
bvlc/caffe                           cpu                            0b577b836386        7 months ago        1.64GB
hpe/intel_caffe                      cpu                            0b577b836386        7 months ago        1.64GB
bvlc/caffe                           gpu                            ba28bcb1294c        7 months ago        3.38GB
hpe/bvlc_caffe                       cuda9-cudnn7                   ba28bcb1294c        7 months ago        3.38GB

error:

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'fp32_vars/dense/bias/Momentum': Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
         [[Node: fp32_vars/dense/bias/Momentum = VariableV2[_class=["loc:@fp32_vars/dense/bias"], container="", dtype=DT_FLOAT, shape=[1000], shared_name="", _device="/device:GPU:0"]()]]

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[14086,1],3]
  Exit code:    1
--------------------------------------------------------------------------
__results.end_time__= "2018-12-04:02:46:45:780"
__results.proc_pid__= 7988
__exp.model_title__="GoogleNet"
__exp.status__="failure"
__exp.status_msg__="No results have been found in this log file"

When I remove the argument "-Pexp.docker_launcher='"nvidia-docker"'", new error is as follows:
python: can't open file '.py': [Errno 2] No such file or directory

Feature request: Add PyTorch support

Through some customer conversations, it's becoming apparent that we'll want PyTorch support sooner or later. How much of a lift would it be to add PyTorch support?

Plan has not been validated. See reason (s) above.

I tried the following command, but got no result.

python ./python/dlbs/experimenter.py run\
             -Pexp.framework='"tensorflow"'\
             -Pexp.gpus='"0"'\
             -Vexp.docker=true\
             -Pexp.log_file='"./benchmarks/my_experiment/tf_${exp.model}_${exp.replica_batch}.log"'\
             -Vexp.model='["alexnet", "googlenet"]'\
             -Vexp.replica_batch='[2, 4]'\
             -Ptensorflow.docker_image='"hpe/tensorflow:cuda9-cudnn7"'
WARNING:root:Module 'matplotlib' cannot be imported, certain system information will not be available
INFO:root:Plan was built with 4 experiments
Plan was built with 4 experiments
====================== VALIDATION REPORT =======================
=========================== MESSAGES ===========================
[
    {
        "check_name": "CanRunDocker", 
        "output": "Docker version 18.09.0, build 4d60db4\n", 
        "cmd": "nvidia-docker --version", 
        "retcode": 0
    }
]
======================== FRAMEWORK STATS =======================
{
    "tensorflow": {
        "num_disabled": 0, 
        "docker_images": [
            "hpe/tensorflow:cuda9-cudnn7"
        ], 
        "num_exps": 4, 
        "num_cpu_exps": 0, 
        "num_gpu_exps": 4, 
        "num_host_exps": 0, 
        "num_docker_exps": 4
    }
}
============================ ERRORS ============================
Other errors:
[
    {
        "check_name": "DockerImageExists", 
        "output": "[]\n Error: No such image: hpe/tensorflow:cuda9-cudnn7\n", 
        "cmd": "docker inspect --type=image hpe/tensorflow:cuda9-cudnn7", 
        "retcode": 1
    }
]
========================= PLAN SUMMARY =========================
Is plan OK ................................ False
Total number of experiments (plan size).....4
Number of disabled experiments ............ 0
Number of active experiments .............. 4
Log files collisions ...................... NO
================================================================
WARNING:root:Plan has not been validated. See reason (s) above.
Plan has not been validated. See reason (s) above.
WARNING:root:If you believe validator is wrong, rerun experimenter with `--no-validation` flag.
If you believe validator is wrong, rerun experimenter with `--no-validation` flag.

My PC environment is as follows:
Ubuntu16.04
Docker version 18.09.0
NVIDIA 1080Ti

$ docker images:

REPOSITORY                           TAG                 IMAGE ID            CREATED             SIZE
registry.docker-cn.com/nvidia/cuda   latest              1cc6f1613121        2 weeks ago         2.24GB
nvcr.io/nvidia/tensorflow            18.07-py3           8289b0a3b285        5 months ago        3.34GB

No devices for CPU-only run

I am having some issues with a CPU-only run of the benchmark suite on one of my GPU-less nodes. I am trying to use the Tensorflow framework for my test, but I get an error in the log (see below). Looking at the code it appears (from my angle at least) that tf_cnn_benchmarks.py is looking for GPU devices, which I have none, causing self.devices to be zero, hence the divide by zero error. Perhaps I am off-base though?

I looked at several of the CPU-only commands in the Wiki guide, but none of them have been successful - most still require a GPU it appears in my testing.

Environment: CentOS 7.4 64-bit
Docker Build: hpe/tensorflow:cpu
Framework: Tensorflow
Command:

python python/dlbs/experimenter.py run -Pexp.framework='"tensorflow"' -Pexp.env='"docker"' -Pexp.phase='"training"' -Pexp.gpus='""' -Pexp.model='"alexnet"' -Pexp.device_batch='"16"' -Pexp.log_file='"./benchmarks/my_experiment/tf.log"' -Ptensorflow.docker.image='"hpe/tensorflow:cpu"' -Pexp.device='"cpu"'
----------------------------
Starting framework launcher.
----------------------------
01-04-18 17:55:58 [INFO]    /root/dlbs/scripts/launchers/tensorflow_hpm.sh
__exp.framework_title__= "TensorFlow"
__tensorflow.version__= "1.0.1"
__results.start_time__= "2018-01-04:23:56:01:230"
PYTHONPATH=/workspace/tf_cnn_benchmarks:/workspace/tf_cnn_benchmarks: CUDA_VISIBLE_DEVICES=
BenchmarkCNN::__init__ time=0.056028 ms
TensorFlow:  1.0
Model:       alexnet
Mode:        training
Batch size:  0 global
Traceback (most recent call last):
  File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1454, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1449, in main
    bench.print_info()
  File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 970, in print_info
    log_fn('             %s per device' % (self.batch_size / len(self.devices)))
ZeroDivisionError: integer division or modulo by zero
__results.end_time__= "2018-01-04:23:56:02:826"
__results.proc_pid__= 44072

Thanks for putting this together by the way! I'm really excited to take a deeper dive into it once I get this issue resolved. Let me know if you need any additional information.

SyntaxError: invalid syntax in File "/workspace/mxnet_benchmarks/benchmarks.py", line 390

hi,Here I come again.I used the mxnet framework, there was a syntax error.
my command:

 python $experimenter run --log-level=warning -Pexp.framework='"mxnet"' -Pexp.gpus='0'  \
-Pexp.docker=true -Pmonitor.frequency=0.1 -Vexp.replica_batch='[16]' -Pexp.num_warmup_batches=10\
-Pexp.num_batches=10 -Pmxnet.cudnn_autotune='"false"' -Vexp.model='["alexnet"]' \
-Pexp.phase='"training"'  \
-Pexp.log_file='"${BENCH_ROOT}/mxnet/${exp.model}_${exp.effective_batch}.log"'\
-Pmxnet.docker_image='"nvcr.io/nvidia/mxnet:18.11-py3"'

error:

__results.start_time__= "2018-12-18:08:54:20:839"
  File "/workspace/mxnet_benchmarks/benchmarks.py", line 390
    except Exception, e:
                    ^
SyntaxError: invalid syntax

@sergey-serebryakov

Issue running ResNet models

Hello again! I'm having difficulties running ResNet models at the moment. No matter which model I use, I always get a ValueError that the dimensions must be equal, but I can't quite track down where the discrepancy is coming from, with the exception that some of the convolutions are giving results of different dimensions. Not sure if you guys have run into this before. Here is the end of the output of my log:

BenchmarkCNN::__init__ time=0.061035 ms
TensorFlow:  1.4
Model:       resnet50
Mode:        training
Batch size:  16 global
             16 per device
Devices:     ['/cpu:0']
Data format: NHWC
Optimizer:   sgd
Variables:   replicated
Use NCCL:    False
==========
__exp.model_title__="ResNet50"
Generating model
Adding preprocessing for resnet50
Traceback (most recent call last):
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1454, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1450, in main
    bench.run()
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 995, in run
    self._benchmark_cnn()
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1035, in _benchmark_cnn
    (enqueue_ops, fetches) = self._build_model()
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1208, in _build_model
    gpu_grad_stage_ops)
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1358, in add_forward_pass_and_gradients
    self.model_conf.add_inference(network)
  File "/root/dlbs/python/tf_cnn_benchmarks/resnet_model.py", line 78, in add_inference
    dim_match=False, bottle_neck=bottle_neck)
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 527, in residual_unit
    self.top_layer = tf.nn.relu(shortcut + res)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 183, in add
    "Add", x=x, y=y, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2958, in create_op
    set_shapes_for_outputs(ret)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2209, in set_shapes_for_outputs
    shapes = shape_func(op)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2159, in call_with_requiring
    return call_cpp_shape_fn(op, require_shape_fn=True)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 627, in call_cpp_shape_fn
    require_shape_fn)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 691, in _call_cpp_shape_fn_impl
    raise ValueError(err.message)
ValueError: Dimensions must be equal, but are 54 and 52 for 'v0/tower_0/add' (op: 'Add') with input shapes: [16,54,55,256], [16,52,55,256].
__results.end_time__= "2018-01-19:16:21:20:504"
__results.proc_pid__= 5009

Python 3 support

Hi,

the docs refer that this project is based on python 2.7. Did you test it already on python 3 and it should run there also?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.