GithubHelp home page GithubHelp logo

ml-project2-vehicle-detection's Introduction

ML-project2-Vehicle-Detection

Overview:

Vehicle detection is a important part of self-driving car technology. Usually in a self-driving car, camera is the most common sensor, from which if we can detect the vehicles around us, we can plan trajectory and avoid collisions. So in this project, we will implement a system, which the input is a image from a on-vehicle camera and the output looks like following:

Motivation:

In this project, we used the YOLO network to detect vehicles. This is a different method comparing with traditional vehicle detection algorithm. In the latest version, the detection speed on the GPU can basically meet the real-time-detection's requirement. The YOLO network is so important and unique, so that we chose to implement it in this project, in order to learn the technical details of YOLO, and to practice our ability to implement and train neural networks.

Dataset:

In this project, we will use the BDD100K dataset, which includes 100,000 images of size (1280 * 720) pixels. Here is an example:

Besides, there is also a label file of json, which we can use to find the ground truth detection-boxs' information:

[
   {
      "name": str,
      "timestamp": 1000,
      "category": str,
      "bbox": [x1, y1, x2, y2],
      "score": float
   }
]

According to dataset info: Box coordinates are integers measured from the top left image corner (and are 0-indexed). [x1, y1] is the top left corner of the bounding box and [x2, y2] the lower right. name is the video name that the frame is extracted from. It composes of two 8-character identifiers connected '-', such as c993615f-350c682c. Candidates for category are ['bus', 'traffic light', 'traffic sign', 'person', 'bike', 'truck', 'motor', 'car', 'train', 'rider']. In the current data, all the image timestamps are 1000. In our case, we will use only the attributes "bbox" and "category".

In our actual training process, we found that many of the marked objects were too detailed, which caused that there are too many labels overlaped with each other, and it also has higher requirements to train network, so we preprocessed the dataset before training: for each category we only saved 5 objects with the largest bbox area.

Method:

The main idea is to use pre-trained neraul network called YOLO network as the basic model, and we will try to retrain it in order to make it more suitable for our task. As a result, we will apply this detection model into images but also videos. And then we implement a traditional object detector based on SVM, in order to compare with YOLO.

Plan:

We divided the whole project into 3 part: data process, training YOLO model, evaluate the model and apply into videos.

Name Work
Xi data process
Martin YOLO model
Ziyuan apply model

YOLO network

YOLO's name comes from "you only look once", which exactly explained the mian idea of YOLO network: it reads in a image, predicts the area of the object in the image and also the category of object in this area, because it only needs to read in the image and go through the neural network at a time. the speed of detection processing could be very fast.

Architecture:

Its architecture is as follows:

In our implementation, the structure is shown as the following table:

Layer Details
Inception model (first 20 layers) well pre-trianed layers, to extract features, output size = {6 * 6}
Convolutional layer filter size = {3x3x1024}
Convolutional layer filter size = {3x3x1024-s-2}
Convolutional layer filter size = {3x3x1024}
Convolutional layer filter size = {3x3x1024}
Dense layer size = {4096}
Dense layer size = {4500}

Finally we can resize the output of NN into a 3D tensor: grid size * grid size* ( class amount + anchor box amout * 5 ), in our case: 15 * 15 * (10 + 2 * 5), shown as following ( source: deepsystem.io ):

The loss function is shown as follows, in fact, the main idea is to convert object-detection into a regression problem:

Training process

During training process, we found that training such a large neural network is very time consuming, and the network is very easy to overfit. After trying many different methods, including image augumentation, adding dropout. Still overfitting:

Inception model with frozen weights. No image augumentation. Batch size 10. Steps_per_epoch 1000. Epochs 100. GPU: NVIDIA Tesla P4
Inception model with all weights set to trainable. Image augumentation. Batch size 16. Steps_per_epoch 1000. Epochs 100. GPU: NVIDIA Tesla P100
Inception model with all weights set to trainable. No Image augumentation. Batch size 16. Steps_per_epoch 1000. Epochs 100. GPU: NVIDIA Tesla V100

So we used the trained model: darknet provided by author, which got impressive results:

Compare with traditional detection model

Besides YOLO model, we got a traditinal detection model based on SVM. We use a two-classes-dataset to train SVM to classify {"car","non-car"}. The main workflow is shown as following:

  1. extract feature and train svm model on training data.
  2. pick test image and use windows of different sizes to slide.
  3. resize this window-images and classify with well-trained SVM, label the "car" box.
  4. reducing redundant box with non-maximum-suppression

In details, we using HOG( Histogram of oriented gradient ) feature from YUV channels, which are highly frequently used by many image classification problem:

And here is an example:

As we can see, an important step of detector with SVM is to use different size windows to slide, which actually needs many times classification for every single image, and it seriously reduces the detection efficiency. Besides, a huge disadvantage of this SVM-based detector is the poor generalization ability, which may be related to the features we choose.

Compare the classification on a clear images:

SVM detector:

and YOLO detector:

YOLO can better dectect the vehicles even far away from the camera. Besides, when we take it to "dirty" but "real" images:

SVM detector:

and YOLO detector:

ml-project2-vehicle-detection's People

Contributors

chrishuxi avatar horczech avatar

Watchers

 avatar  avatar

Forkers

horczech

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.