You are on page 1of 6

Obstacle Detection and Classification using Deep

Learning for Tracking in High-Speed Autonomous


Driving

Gowdham Prabhakar, Binsu Kailath Sudha Natarajan, Rajesh Kumar


Electronic System Design Autonomous Vehicle Program – R&D
IIITDM Kancheepuram Tata Elxsi Limited
Chennai, India Chennai, India
{eds14m006, bkailath}@iiitdm.ac.in {sudha.n,rajesh}@tataelxsi.co.in

Abstract—On-road obstacle detection and classification is one obstacle detection is highly challenging when designed and
of the key tasks in the perception system of self-driving vehicles. deployed with optical sensors. Environmental conditions also
Since vehicle tracking involves localizationand association of add another challenge for acquiring high-quality images.
vehicles between frames, detection and classification of vehicles is
necessary. Vision-based approaches are popular for this task due Existing techniques for vision-based on-road obstacle
to cost-effectiveness and usefulness of appearance information detection techniques [1], [2] have not progressed into its
associated with the vision data. In this paper, a deep learning mature form due to many issues such as variability in vehicle
system using region-based convolutional neural network trained shapes, cluttered environment and illumination conditions.
with PASCAL VOC image dataset is developed for the detection Deep learning [3] has shown great promise in recent years in
and classification of on-road obstacles such as vehicles, the field of object detection and recognition. Convolutional
pedestrians and animals. The implementation of the system on a Neural Networks (CNN) are dedicated to vision-based
Titan X GPU achieves a processing frame rate of at least 10 fps approaches and they are quite feasible for Graphics Processing
for a VGA resolution image frame. This sufficiently high frame Unit (GPU) acceleration in real-time applications. The GPUs,
rate using a powerful GPU demonstrate the suitability of the originally designed for 3D modeling and rendering, are now
system for highway driving of autonomous cars. The detection solving classic image processing problems and provide
and classification results on images from KITTI and iRoads, and tremendous improvement in speed over CPU-only
also Indian roads show the performance of the system invariant implementations. GPUs when deployed in the perception
to object’s shape and view, and different lighting and climatic
system of autonomous vehicles could process video frames at a
conditions.
sufficiently high frame rate and facilitate high-speed driving by
Keywords—Autonomous driving, object detection, object detecting the obstacles well before for motion planning to
classification, deep learning, convolutional neural network, R-CNN avoid collision.
In this paper, we address the detection and classification of
I. INTRODUCTION on-road objects using Faster Region-based CNN (R-CNN), a
Collision avoidance system is a key component in self- variant of CNN and its implementation in GPU. We employ a
driving vehicles and obstacle detection is one of the main tasks pre-trained network model ZF Net, fine-tuned for 20 different
of this system. The most known approach to obstacle detection objects of PASCAL VOC 2012 dataset [4] in our detection and
uses active sensors likelidars,lasers,millimeter-wave radars. classification system. During the real-time detection phase, the
They can measure distance directly using limited computing on-road object detections are filtered such that the entire
resources which shows their main advantage. However, these system is made to detect only the classes which correspond to
active sensors do have many drawbacks, likeslow scanning on-road objects. The outputs of the system are the rectangular
speed and low spatial resolution. Moreover, the interference bounding boxes and class information of objects which are
among the same type of sensors creates a serious problem useful parameters for motion planning of the self-driving
when it encounters a number of vehicles closely moving vehicle.
togetheralong the same direction simultaneously. Optical This paper is organized as follows. The next section briefly
sensors, like conventional cameras,collect data in a way which describes the conventional CNN and its variant R-CNN.
is non-intrusive and are generally referred to as passive Section III explains the on-road obstacle detection and
sensors. Cost is one of the major advantages for preferring classification system. The GPU implementation of the
passive sensors to active sensors. Moreover, visual information systemusing Caffe framework along with the results and
plays a key role in several applications, like object performance are given in Section IV. Section V concludes the
identification, traffic sign recognition and lane detection. On paper.
the other hand, due to several variabilities within the classes,

978-1-5090-6255-3/17/$31.00 ©2017 IEEE


2017 IEEE Region 10 Symposium (TENSYMP)

Person? No

Car? Yes
:
CNN SVM
Dog? No

Input Extract Warped Compute Classify


Image Region Image CNN Regions
Proposals Features

Fig. 1. Functionality of R-CNN

II. CONVOLUTIONAL NEURAL NETWORK – BACKGROUND B. Fast R-CNN


CNNs [5] are some special multi-layer neural networks In the fast R-CNN, a feature vectorof fixed-length is
designed specifically for 2D data, likevideo and images. The extracted from the feature map for each object proposal by
CNNs are motivated by minimal data preprocessing pooling layer which uses Region of Interest (RoI). A set of
fully connected layers is fed with each feature vector after
requirements, and they largely receive raw input image and
which it branchesfinally into two sibling output layers. The
extract features on its own. Small portions of image are fed to
first layer produces the softmax probability estimates over
the bottom most layer of its hierarchical structure. The different object classes and the other layer outputs the
information is usually passed through several layers of the bounding box coordinates.
network. Some digital filtering is performed such that the
salient features of the data are obtained at each layer. The C. Faster R-CNN
initialnetwork layer has a feature map which isobtained as a
result of the convolution process along with some bias added. Faster R-CNN [7] is a new method to realize R-CNN for
better performance on mAP as well as detection time. In this
The next stage undergoes a subsampling process, which
method, region proposals are made by a separate convolutional
typically reduces the dimensionality by performing a 2x2
network called Region Proposal Network (RPN). The
averaging operation. This feature map after subsampling, convolutional features are shared with the detection network.
receives a trainable bias and weighting which is then fed to an
activation function. Finally, the outputs of the activation To deal with varyingaspect ratios and scales of objects,
function are forwarded to a feedforward fully connected anchors are introduced in the RPN. An anchor is at each
network which givesthe final result of the system. The sliding location of the convolutional maps and thus at the
convolution and subsampling layer can be repeated in a CNN. center of each spatial window. Each anchor is associated with
The CNN therefore autonomously extract salient features from an aspect ratio and a scale. Training of the RPN can be done in
images and classify them. The weights are updated during the an end-to-end manner using Stochastic Gradient Descent
training process by means of backpropagation with reference (SGD) for both classification and regression branches. Two
kinds of training are available: (1) Approximate joint training
to the loss function. The loss function like Support Vector
(2) Alternating training. In approximate joint training, both
Machines (SVM)/Softmax is generally used in CNNs on the
networks are trained simultaneously while in the alternating
final Fully Connected (FC) layer. training, the RPN is trained first and the proposals generated
A. Region-Based CNN are used to train fast R-CNN. Faster R-CNN achieved 73.2%
mAP on PASCAL VOC 2012 using 300 proposals per image.
CNNs are capable of giving better mean Average Precision
(mAP) in object classification but consumes a lot of time when
applied to object detection directly. In order to optimize the III. OUR OBSTACLE DETECTION AND CLASSIFICATION
detection time as well as training time, a modified version of SYSTEM
CNN called R-CNN [6] was proposed which exhibits
reasonably good performance on PASCAL VOC 2012. It’s A. System Design
made up of three modules. A set of region proposalswhich are The designed system as shown in Fig. 2 employs the Faster
independent of category are generated by the first module. In R-CNN implemented on GPU for our application - detection
the second module, a feature vector of fixed length is extracted and classification of on-road objects. The R-CNN was trained
by a large CNN from respective region proposals. The third as follows. A ZF Net pre-trained with ImageNet was used to
module consists of a set of linear SVMs which are class- initialize the weights. The PASCAL VOC 2012 dataset was
specific. Fig. 1 explains the working of R-CNN. To reduce the used to tune the networkfor detecting only 20 object classes. It
computational burden of proposal generation, the fast R-CNN includes bicycle, sofa,TV-monitor,boat,aeroplane, bus, cow,
and then faster R-CNN were proposed. car, bottle, chair,sheep, dining table, bird, cat, person,dog,
potted plant,horse, train,motorbike.Approximate joint
2017 IEEE Region 10 Symposium (TENSYMP)

Trained On-road Annotation/


Faster Obstacle Object
R-CNN Filter Tracking

Fig. 2. Block Diagram of the system

training was performed as this is faster than the alternating neural networks and other deep networks and hence
training.Since the training set contains a lot of non-road deployingthem efficiently on commodity architectures.
objects, the network requires a retraining with only the on-road
object classes. We have not however retrained the network. A. Detection Results
Instead, the non-road object detections were masked in the The implementation was tested on a variety of datasets in
detection phase efficiently. different climatic conditions. Images were from the public
datasets such as Kitti [8] (size: 1392x512) and iRoads [9] (size:
B. Function of the System 640x360). The video frames from the shots taken on Bangalore
For real-time detection on video, each image frame is fed road (size: 1920x1080) and Chennai road (size: 1920x1080)
to the system. This image frame is processed by the trained R- from a camera on-board a vehicle were also considered. Apart
CNN module for the bounding boxes of various class-specific from these, on-road animal images (size: 1025x680, 1001x608)
objects. Since this system is designed for detecting only the were also tested. Fig. 3 to Fig. 6 show the performance of ZF
on-road objects, some classes are masked in such a way that Net model of faster R-CNN on GPU. The objects in Fig. 3
only the on-road objects are recognized. This is done by a were detected in 48ms for 300 object proposals. The objects in
filter. The processed image is then annotated with bounding Fig. 4 were detected in 70ms for 163 object proposals. The
boxes tagged with the respective class name on top of each objects in Figs. 5 & 6 were detected in around 90ms and 60ms
detected object. The bounding box colors are listed in Table I. respectively for 300 object proposals. The results show the
robustness of the approach to different views of objects as well
These bounding boxes are fed to a tracking module for motion
as the lighting conditions. The detection time is less than
planning of an autonomous vehicle. 100ms for an image of considerable size.
TABLE I. LIST OF ON-ROAD OBSTACLES AND THEIR RESPECTIVE The detection time was also computed for images of
COLORS OF THE BOUNDING BOXES various standard display resolutions. The bar chart in Fig. 7
Class name Color of the Bounding-Box displays these results. It can be seen that most of the
resolutions can be processed at a frame rate of 10 fps.
Bicycle Red

Bus Blue B. Performance Evaluation


Car Blue The performance of the system is measured in terms of
detection accuracy. The commonly used metric to measure the
Cat Cyan
detection accuracy is the mean Average Precision (mAP). This
Cow Cyan is calculated by taking the Average Precision (AP) of each
class first and then averaging all the Average Precisions as
Dog Cyan
defined in the equations below.
Motorbike Red
True Positives
Person Yellow precision ( P ) =
True Positives + False Positives

average precision ( AP ) =
∑ P ∀ True Positives
IV. GPU IMPLEMENTATION AND RESULTS True Positives
The obstacle detection and classification system was
implemented on Ubuntu workstation with NVIDIA GeForce mean average precision (mAP ) =
∑ AP ∀ Classes
GTX 980 Ti GPU. The GPU has 6GB graphics memory, 2816 Number of Classes
CUDA cores. The workstation is powered by Intel i7-6700
with 16GB RAM. There are two modules in the proposed The detected object is a true positive only if the Intersection
system. The main module runs on CPU. The second module over Union (IoU) of ground truth and detected bounding boxes
that includes the Caffe framework of R-CNN runs on GPU. is ≥0.5. Table II shows the mAP calculation of Kitti_drive0005
This framework has a C++ library with MATLAB and python video shot containing 153 video frames. Tables III and IV
bindings used for training the general purpose convolutional show mAP calculation of Chennai road dataset of 50 images
and Bangalore road dataset of 100 images respectively.
2017 IEEE Region 10 Symposium (TENSYMP)

Fig. 3. Person and cars detected on KITTI image

Fig. 4. Cars detected during a rainy day on iRoads image

Fig. 5. Cars, bus and person detected on Chennai and Bangalore Highways

TABLE II. MEAN AVERAGE PRECISION FOR KITTI_DRIVE0005 VIDEO TABLE III. MEAN AVERAGE PRECISION FOR CHENNAI ROAD VIDEO
Average Precision mAP (%) Average Precision mAP (%)
Class name Class name
(AP) (AP)
Bus 1 71.7 Bus 0.62 90.5

Car 0.916 Car 1

Motorbike 0 Motorbike 1

Person 0.952 Person 1


2017 IEEE Region 10 Symposium (TENSYMP)

Fig. 6. Animals on road are detected along with pedestrians and cars

Fig. 7. Detection time for various image resolution

this dataset. The performance can be further improved by using


TABLE IV. MEAN AVERAGE PRECISION FOR BANGALORE ROAD VIDEO a much wider network model such as GoogleNet. We also plan
Average Precision mAP (%) to optimize the network for on-road objects so as to realize on
Class name an embedded GPU platform like Jetson TX1.
(AP)
Bus 0.9805 97.42
REFERENCES
Car 0.9209
[1] A. Mukhtar, L. Xia and T.B. Tang, “Vehicle detection techniques for
Motorbike 1 collision avoidance systems: A review”, IEEE Transactions on
Intelligent Transportation Systems, Vol. 16, No. 5, Oct. 2015.
Person 0.9805
[2] S. Sivaraman and M.M. Trivedi, “Looking at vehicles on the road: A
survey of vision-based vehicle detection, tracking, and behavior
analysis”, IEEE Transactions on Intelligent Transportation Systems,
Vol. 14, No. 4, Dec. 2013.
V. CONCLUSION [3] I. Arel, D. C. Rose and T. P. Karnowski, “Deep Machine Learning - A
A vision-based object detection system for on-road New Frontier in Artificial Intelligence Research [Research
Frontier],”IEEE Computational Intelligence Magazine, Vol. 5, No. 4,
obstacles was realized using Faster R-CNN and implemented pp. 13-18, Nov. 2010.
on GPU. This detection is useful for estimating the trajectory of
[4] PASCAL VOC 2012 Dataset
themoving vehicles and other on-road objects. The http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html
performance was evaluated on images of benchmark datasets [5] M.D. Zeiler and R. Fergus, “Visualizing and understanding
and Indian roads. The deep learning network is found to be convolutional neural networks”, Proceedings of theEuropean
robust to variation in object’s view, lighting and climatic Conference on Computer Vision 2014.
conditions. A frame rate of more than 10 fps was achieved in [6] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Region-Based
processing a video. The resulting bounding boxes of detected Convolutional Networks for Accurate Object Detection and
objects and their classes are useful for subsequent motion Segmentation," IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 38, No. 1, pp. 142-158, Jan. 1 2016.
planning and control subsystems of self-driving vehicle.
[7] S. Ren, K. He, R. Girshick, J. Sun, “Faster R-CNN: Towards Real-Time
Although the system we developed detects almost all on-road Object Detection with Region Proposal Networks,”IEEE Transactions
obstacles, it sometimes fails to detect vehicles like auto that are on Pattern Analysis and Machine Intelligence, DOI:
commonly seen on Indian roads but not in the PASCAL 10.1109/TPAMI.2016.2577031.
training dataset. [8] Kitti Dataset http://www.cvlibs.net/datasets/kitti/raw_data.php
[9] IRoads Dataset
Our future work will focus on improving the detection https://www.cs.auckland.ac.nz/~m.rezaei/Publications/iROADS%20Dat
performance for Indian road scenario. We plan to create a aset.pdf
dataset for Indian road vehicles and retrain the network with
2017 IEEE Region 10 Symposium (TENSYMP)

You might also like