PE&RS April 2019 Public - page 297

Vehicle Detection in Aerial Images
Michael Ying Yang, Wentong Liao, Xinbo Li, Yanpeng Cao and Bodo Rosenhahn
Abstract
The detection of vehicles in aerial images is widely applied
in many applications. Comparing with object detection in
the ground view images, vehicle detection in aerial images
remains a challenging problem because of small vehicle size
and the complex background. In this paper, we propose a
novel double focal loss convolutional neural network (
DFL-
CNN
) framework. In the proposed framework, the skip con-
nection is used in the
CNN
structure to enhance the feature
learning. Also, the focal loss function is used to substitute for
conventional cross entropy loss function in both of the region
proposal network (
RPN
) and the final classifier. We further
introduce the first large-scale vehicle detection dataset
ITCVD
with ground truth annotations for all the vehicles in the scene.
We demonstrate the performance of our model on the exist-
ing benchmark German Aerospace Center (
DLR
) 3K dataset as
well as the
ITCVD
dataset. The experimental results show that
our
DFL-CNN
outperforms the baselines on vehicle detection.
Introduction
The detection of vehicles in aerial images is widely applied in
many applications,
e.g.,
traffic monitoring, vehicle tracking for
security purpose, parking lot analysis and planning,
etc
. There-
fore, this topic has caught increasing attention in both academ-
ic and industrial fields (Gleason
et al.
2011; Liu and Mattyus
2015; Chen
et al.
2016). However, compared with object detec-
tion in ground view images, vehicle detection in aerial images
has many different challenges, such as small vehicle size and
complex background. See Figure 1 for an illustration.
Figure 1. Vehicles detection results on the proposed dataset.
Before the emergence of deep learning, hand-crafted
features combined with a classifier are the mostly adopted
ideas to detect vehicles in aerial images (Zhao and Nevatia
2003; Liu and Mattyus 2015; Gleason
et al.
2011). However,
the hand-crafted features lack generalization ability, and the
adopted classifiers need to be modified to adapt the of the
features. Some previous works also attempted to use shallow
neural network (LeCun
et al.
1990) to learn the features specif-
ically for vehicle detection in aerial images (Cheng
et al.
2012;
Chen
et al.
2014). However, the representational power of the
formance meets
thods localize
h. These sliding
putational cost.
carefully chosen
to adapt the different sizes of objects of interest in the dataset.
In recent years, deep convolutional neural network (
DCNN
)
has achieved great successes in different tasks, especially for
object detection and classification (Krizhevsky
et al.
2012;
LeCun
et al.
2015). In particular, the series of methods based
on region convolutional neural network (
R-CNN
) (Girshick
et al.
2014; Girshick 2015; Ren
et al.
2015) push forward the prog-
ress of object detection significantly. Especially, Faster-
R-CNN
(Ren
et al.
2015) proposes the region proposal network (
RPN
) to
localize possible object instead of traditional sliding window
search methods and achieves the state-of-the-art performance
in different datasets in terms of accuracy. However, these
existing state-of-the-art detectors cannot be directly applied to
detect vehicles in aerial images, due to the different character-
istics of ground view images and aerial view images (Xia
et al.
2017). The appearance of the vehicles is monotone, as shown
in Figure 1. It’s difficult to learn and extract representative
features to distinguish them from other objects. Particularly, in
the dense park lot, it is hard to separate individual vehicles.
Moreover, the background in the aerial images is much more
complex than the nature scene images. For example, the win-
dows on the facades or the special structures on the roof, these
background objects confuse the detectors and classifiers. Fur-
thermore, compared to the vehicle sizes in ground view imag-
es, the vehicles in the aerial images are much smaller (ca. 50
×
50 pixels) while the images have very high resolution (normal-
ly larger than 5000
×
2000 pixels). Lastly, large-scale and well
annotated dataset is required to train a well performed
DCNN
methods. However, there is no public large-scale dataset such
as ImageNet (Deng
et al.
2009) or ActivityNet (Caba Heilbron
et al.
2015), for vehicle detection in aerial images.
To address these problems, we propose a specific framework
for vehicle detection in aerial images, as shown in Figure 2.
The novel framework is called double focal loss convolutional
neural network (
DFL
-
CNN
), which consists of three main parts:
1) A skip-connection from the shallow layer to the deep layer
is added to learn features which contains rich detail informa-
tion. 2) Focal loss function (Lin
et al.
2017) is adopted in the
RPN
instead of traditional cross entropy. This modification aims
at the class imbalance problem when
RPN
decides whether a
proposal is likely to be an object of interest or not. 3) Focal loss
function replaces the cross entropy in the classifier. It’s used to
handle the problem of easy positive examples and hard negative
examples during training. Furthermore, we introduce a novel
large-scale and well annotated dataset for quantitative vehicle
detection evaluation—
ITCVD
. Towards this goal, we collected
Michael Ying Yang is with the Scene Understanding Group,
ITC Faculty, University of Twente., Wentong Liao, Xinbo Li,
and Bodo Rosenhahn are with the Institute for Information
Processing, Leibniz University Hannover.
Yanpeng Cao is with the School of Mechanical Engineering,
Zhejiang University, (Corresponding author:
)
Photogrammetric Engineering & Remote Sensing
Vol. 85, No. 4, April 2019, pp. 297–304.
0099-1112/18/297–304
© 2019 American Society for Photogrammetry
and Remote Sensing
doi: 10.14358/PERS.85.4.297
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
April 2019
297
239...,287,288,289,290,291,292,293,294,295,296 298,299,300,301,302,303,304,305,306,307,...326
Powered by FlippingBook