GithubHelp home page GithubHelp logo

dlsys-course / assignment2-2017 Goto Github PK

View Code? Open in Web Editor NEW
62.0 7.0 44.0 30 KB

(Spring 2017) Assignment 2: GPU Executor

Home Page: http://dlsys.cs.washington.edu/

Python 69.17% C++ 22.48% C 1.46% Makefile 1.24% Cuda 5.64%
gpu-kernels computation-graph deep-learning neural-nets gpu-executor

assignment2-2017's Introduction

Assignment 2: GPU Graph Executor

In this assignment, we would implement a GPU graph executor that can train simple neural nets such as multilayer perceptron models.

Our code should be able to construct a simple MLP model using computation graph API implemented in Assignment 1, and train and test the model using either numpy or GPU. If you implement everything correctly, you would see nice speedup in training neural nets with GPU executor compared to numpy executor, as expected.

Key concepts and data structures that we would need to implement are

  • Shape inference on computation graph given input shapes.
  • GPU executor memory management for computation graph.
  • GPU kernel implementations of common kernels, e.g. Relu, MatMul, Softmax.

Overview of Module

  • python/dlsys/autodiff.py: Implements computation graph, autodiff, GPU/Numpy Executor.

  • python/dlsys/gpu_op.py: Exposes Python function to call GPU kernels via ctypes.

  • python/dlsys/ndarray.py: Exposes Python GPU array API.

  • src/dlarray.h: header for GPU array.

  • src/c_runtime_api.h: C API header for GPU array and GPU kernels.

  • src/gpu_op.cu: cuda implementation of kernels

What you need to do?

Understand the code skeleton and tests. Fill in implementation wherever marked """TODO: Your code here""".

There are only two files with TODOs for you.

  • python/dlsys/autodiff.py
  • src/gpu_op.cu

Special note

Do not change Makefile to use cuDNN for GPU kernels.

Environment setup

  • If you don't have a GPU machine, you can use AWS GPU instance. AWS setup instructions see lab1.
  • Otherwise, you need to install CUDA toolkit (instructions) on your own machine, and set the environment variables.
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
    export PATH=/usr/local/cuda/bin:$PATH

Tests cases

We have 12 tests in tests/test_gpu_op.py. We would grade your GPU kernel implementations based on those tests. We would also grade your implementation of shape inference and memory management based on tests/mnist_dlsys.py.

Compile

export PYTHONPATH="${PYTHONPATH}:/path/to/assignment2/python"
make

Run all tests with

# sudo pip install nose
nosetests -v tests/test_gpu_op.py

Run neural nets training and testing with

# see cmd options with 
# python tests/mnist_dlsys.py -h

# run logistic regression on numpy
python tests/mnist_dlsys.py -l -m logreg -c numpy
# run logistic regression on gpu
python tests/mnist_dlsys.py -l -m logreg -c gpu
# run MLP on numpy
python tests/mnist_dlsys.py -l -m mlp -c numpy
# run MLP on gpu
python tests/mnist_dlsys.py -l -m mlp -c gpu

If your implementation is correct, you would see

  • generally decreasing loss value with epochs, similar loss value decrease for numpy and GPU execution
  • your dev set accuracy for logreg about 92% and MLP about 97% for mnist using the parameters we provided in mnist_dlsys.py
  • GPU execution being noticeably faster than numpy. However, if you do not reuse memory across executor.runs, your GPU execution would incur overhead in memory allocation.

Profile GPU execution with

nvprof python tests/mnist_dlsys.py -l -m mlp -c gpu

If GPU memory management is done right, e.g. reuse GPU memory across each executor.run, your cudaMalloc "Calls" should not increase with number of training epochs (set with -e option).

# Run 10 epochs
nvprof python tests/mnist_dlsys.py -l -m mlp -c gpu -e 10
#==2263== API calls:
#Time(%)      Time     Calls       Avg       Min       Max  Name
# 10.19%  218.65ms        64  3.4164ms  8.5130us  213.90ms  cudaMalloc

# Run 30 epochs
nvprof python tests/mnist_dlsys.py -l -m mlp -c gpu -e 30
#==4333== API calls:
#Time(%)      Time     Calls       Avg       Min       Max  Name
#  5.80%  340.74ms        64  5.3240ms  15.877us  333.80ms  cudaMalloc

Grading rubrics

  • test_gpu_op.test_array_set ... 1 pt

  • test_gpu_op.test_broadcast_to ... 1 pt

  • test_gpu_op.test_reduce_sum_axis_zero ... 1 pt

  • test_gpu_op.test_matrix_elementwise_add ... 1 pt

  • test_gpu_op.test_matrix_elementwise_add_by_const ... 1 pt

  • test_gpu_op.test_matrix_elementwise_multiply ... 1 pt

  • test_gpu_op.test_matrix_elementwise_multiply_by_const ... 1 pt

  • test_gpu_op.test_matrix_multiply ... 2 pt

  • test_gpu_op.test_relu ... 1 pt

  • test_gpu_op.test_relu_gradient ... 1 pt

  • test_gpu_op.test_softmax ... 1 pt

  • test_gpu_op.test_softmax_cross_entropy ... Implemented by us.

  • mnist with MLP using numpy ... 1 pt

  • mnist with MLP using gpu ... 2 pt

Submitting your work

Please submit your assignment2.tar.gz to Catalyst dropbox under Assignment 2. Due: 5/9/2017, 5pm.

# compress
tar czvf assignment2.tar.gz assignment2/

assignment2-2017's People

Contributors

icemelon avatar zhangqiaorjc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

assignment2-2017's Issues

Shape doesn't match in the gradient of SoftmaxCrossEntropyOp

Hi, there.

I think there is a problem that shape doesn't match in the gradient of SoftmaxCrossEntropyOp.

# class SoftmaxCrossEntropyOp
def gradient(self, node, output_grad):
    grad_A = (softmax_op(node.inputs[0]) + -1 * node.inputs[1])*output_grad

The shape of the forward output of SoftmaxCrossEntropyOp is (1, ),
so I think the shape of output_grad is also (1,).

However, the shape of (softmax_op(node.inputs[0]) + -1 * node.inputs[1]) is not (1,).

MulOP need to handle input_vals[0].shape != input_vals[1].shape

The shape of (softmax_op(node.inputs[0]) + -1 * node.inputs[1]) doesn't match that of output_grad.

I fix it by adding broadcastto_op:

grad_A = (softmax_op(node.inputs[0]) + -1 * node.inputs[1]) * broadcastto_op(output_grad, node.inputs[0])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.