Hardware Acceleration for Machine Learnning

A image identifier implemented with FGPA to achieve fast real-time multiple object detection.

TEAM MEMBERS: Zuxiong Tan, Samyak Jain, Chenxi Li

Project Goals:

Find a state-of-art multiple object detection model
Measure its performance on GPU for inferencing
Deploy the model on FPGA DPU achieving real-time measurement
Measure the inferencing performance
Compare performances

Make roofline plot
Calculate memory bandwidths for the DL program on GPU and FPGA

What is DPU

The Xilinx® Deep Learning Processor Unit (DPU) is a programmable engine optimized for convolutional neural networks. The unit includes a high performance scheduler module, a hybrid computing array module, an instruction fetch unit module, and a global memory pool module. The DPU uses a specialized instruction set, which allows for the efficient implementation of many convolutional neural networks. Some examples of convolutional neural networks which have been deployed include VGG, ResNet, GoogLeNet, YOLO, SSD, MobileNet, FPN, and many others.
The DPU IP can be implemented in the programmable logic (PL) of the selected Zynq®-7000 SoC or Zynq UltraScale+™ MPSoC devices with direct connections to the processing system (PS). The DPU requires instructions to implement a neural network and accessible memory locations for input images as well as temporary and output data. A program running on the application processing unit (APU) is also required to service interrupts and coordinate data transfers. https://www.xilinx.com/products/design-tools/ai-inference/ai-developer-hub.html#edge

DPU Development Flow (Using DNNDK)

The DPU requires a device driver which is included in the Xilinx Deep Neural Network Development Kit (DNNDK) toolchain.
The DNNDK User Guide (UG1327) describes how to use the DPU with the DNNDK tools. The basic development flow is shown in the following figure. First, use Vivado to generate the bitstream. Then, download the bitstream to the target board and install the DPU driver. For instructions on how to install the DPU driver and dependent libraries, refer to the DNNDK User Guide (UG1327).https://www.xilinx.com/support/documentation/user_guides/ug1327-dnndk-user-guide.pdf

Similar Products:

NVIDIA Deep Learning Accelerator(NVDLA):

This is a free and open architecture that promotes a standard way to design deep learning inference accelerators. NVDLA is scalable, highly configurable, and designed to simplify integration and portability. The hardware supports a wide range of IoT devices.
NVDLA overview: http://nvdla.org

Google's Tensor Processing Unit(TPU):

TPU is tailored to machine learning applications, allowing the chip to be more tolerant of reduced computational precision, which means it requires fewer transistors per operation. TPUs power many applications at Google, including RankBrain, used to improve the relevancy of search results and Street View, to improve the accuracy and quality of our maps and. navigation.
TPU overview: https://cloud.google.com/blog/products/gcp/quantifying-the-performance-of-the-tpu-our-first-machine-learning-chip

Sprint 1

Mange to run YOLO on GPU
Compare YOLO's performance on GPU to on CPU
Get FPGA

Sprint 2 (Lots of work on reverse-enginneering darknet YOLO)

Refactor YOLO we got from https://pjreddie.com/darknet/yolo/
Rewrite YOLO with DNNDK API
Looked into different methods to run the given C code on an FPGA
- Use OpenCL framework to run the code on an Intel FPGA. Can be done using the Intel FPGA SDK for OpenCL
- Convert the code into HDL to run on a Xilinx FPGA
  - Implement DPU on vivado and run some simulation tests

Results from sprint 1

Time taken to detect obejcts on a single image

Prediction on BU SCC GPU 0.925530 seconds.
Prediction on CPU(single core). Intel Core i5: 19.457083 seconds.
GPU Spec:
- Tesla P100 PCIe 16GB
- Width: 64 bits
- Clock: 33MHz

Sprint 3

Achieved object detection using Hardware Accelerator based on FPGA
Compare the performance and Power efficiency between FPGA, GPU and CPU

System Diagram

Graph above shows the system diagram of the design using YOLOv2 model with darknet-19. In this design we used CPU as the co-processor and used FPGA to accelerate the calculation. The acceleration card we used is Xilinx ML Suite-Alveo U200 and we developed it on AWS(Amazon Web Services)

Performance

According to the graph, GPU runs 15.5 times faster than CPU, FPGA runs 4.9 times faster than CPU.

Power efficiency

Power efficiency = speed/power, where GPU is 5.89 times better than CPU, FPGA is 52.6 times better than CPU.

User Stories:

Navigation for Robots
Surveillance
Self-Driving cars Use YOLOv2 algorithm

mr-brillianter / fpga_dpu Goto Github PK

fpga_dpu's Introduction

Hardware Acceleration for Machine Learnning

Project Goals:

What is DPU

DPU Development Flow (Using DNNDK)

Similar Products:

Sprint 1

Sprint 2 (Lots of work on reverse-enginneering darknet YOLO)

Results from sprint 1

Time taken to detect obejcts on a single image

Sprint 3

System Diagram

Performance

Power efficiency

User Stories:

Poster

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs