ML-project2-Vehicle-Detection

Overview:

Vehicle detection is a important part of self-driving car technology. Usually in a self-driving car, camera is the most common sensor, from which if we can detect the vehicles around us, we can plan trajectory and avoid collisions. So in this project, we will implement a system, which the input is a image from a on-vehicle camera and the output looks like following:

Motivation:

In this project, we used the YOLO network to detect vehicles. This is a different method comparing with traditional vehicle detection algorithm. In the latest version, the detection speed on the GPU can basically meet the real-time-detection's requirement. The YOLO network is so important and unique, so that we chose to implement it in this project, in order to learn the technical details of YOLO, and to practice our ability to implement and train neural networks.

Dataset:

In this project, we will use the BDD100K dataset, which includes 100,000 images of size (1280 * 720) pixels. Here is an example:

Besides, there is also a label file of json, which we can use to find the ground truth detection-boxs' information:

[
   {
      "name": str,
      "timestamp": 1000,
      "category": str,
      "bbox": [x1, y1, x2, y2],
      "score": float
   }
]

According to dataset info: Box coordinates are integers measured from the top left image corner (and are 0-indexed). [x1, y1] is the top left corner of the bounding box and [x2, y2] the lower right. name is the video name that the frame is extracted from. It composes of two 8-character identifiers connected '-', such as c993615f-350c682c. Candidates for category are ['bus', 'traffic light', 'traffic sign', 'person', 'bike', 'truck', 'motor', 'car', 'train', 'rider']. In the current data, all the image timestamps are 1000. In our case, we will use only the attributes "bbox" and "category".

In our actual training process, we found that many of the marked objects were too detailed, which caused that there are too many labels overlaped with each other, and it also has higher requirements to train network, so we preprocessed the dataset before training: for each category we only saved 5 objects with the largest bbox area.

Method:

The main idea is to use pre-trained neraul network called YOLO network as the basic model, and we will try to retrain it in order to make it more suitable for our task. As a result, we will apply this detection model into images but also videos. And then we implement a traditional object detector based on SVM, in order to compare with YOLO.

Plan:

We divided the whole project into 3 part: data process, training YOLO model, evaluate the model and apply into videos.

Name	Work
Xi	data process
Martin	YOLO model
Ziyuan	apply model

YOLO network

YOLO's name comes from "you only look once", which exactly explained the mian idea of YOLO network: it reads in a image, predicts the area of the object in the image and also the category of object in this area, because it only needs to read in the image and go through the neural network at a time. the speed of detection processing could be very fast.

Architecture:

Its architecture is as follows:

In our implementation, the structure is shown as the following table:

Layer	Details
Inception model (first 20 layers)	well pre-trianed layers, to extract features, output size = {6 * 6}
Convolutional layer	filter size = {3x3x1024}
Convolutional layer	filter size = {3x3x1024-s-2}
Convolutional layer	filter size = {3x3x1024}
Convolutional layer	filter size = {3x3x1024}
Dense layer	size = {4096}
Dense layer	size = {4500}

Finally we can resize the output of NN into a 3D tensor: grid size * grid size* ( class amount + anchor box amout * 5 ), in our case: 15 * 15 * (10 + 2 * 5), shown as following ( source: deepsystem.io ):

The loss function is shown as follows, in fact, the main idea is to convert object-detection into a regression problem:

Training process

During training process, we found that training such a large neural network is very time consuming, and the network is very easy to overfit. After trying many different methods, including image augumentation, adding dropout. Still overfitting:

Inception model with frozen weights. No image augumentation. Batch size 10. Steps_per_epoch 1000. Epochs 100. GPU: NVIDIA Tesla P4

Inception model with all weights set to trainable. Image augumentation. Batch size 16. Steps_per_epoch 1000. Epochs 100. GPU: NVIDIA Tesla P100

Inception model with all weights set to trainable. No Image augumentation. Batch size 16. Steps_per_epoch 1000. Epochs 100. GPU: NVIDIA Tesla V100

So we used the trained model: darknet provided by author, which got impressive results:

Compare with traditional detection model

Besides YOLO model, we got a traditinal detection model based on SVM. We use a two-classes-dataset to train SVM to classify {"car","non-car"}. The main workflow is shown as following:

extract feature and train svm model on training data.

pick test image and use windows of different sizes to slide.

resize this window-images and classify with well-trained SVM, label the "car" box.

reducing redundant box with non-maximum-suppression

In details, we using HOG( Histogram of oriented gradient ) feature from YUV channels, which are highly frequently used by many image classification problem:

And here is an example:

As we can see, an important step of detector with SVM is to use different size windows to slide, which actually needs many times classification for every single image, and it seriously reduces the detection efficiency. Besides, a huge disadvantage of this SVM-based detector is the poor generalization ability, which may be related to the features we choose.

Compare the classification on a clear images:

SVM detector:

and YOLO detector:

YOLO can better dectect the vehicles even far away from the camera. Besides, when we take it to "dirty" but "real" images:

SVM detector:

and YOLO detector:

chrishuxi / ml-project2-vehicle-detection Goto Github PK