PE&RS April 2019 Public

Vehicle Detection in Aerial Images

Michael Ying Yang, Wentong Liao, Xinbo Li, Yanpeng Cao and Bodo Rosenhahn

Abstract

The detection of vehicles in aerial images is widely applied

in many applications. Comparing with object detection in

the ground view images, vehicle detection in aerial images

remains a challenging problem because of small vehicle size

and the complex background. In this paper, we propose a

novel double focal loss convolutional neural network (

DFL-

CNN

) framework. In the proposed framework, the skip con-

nection is used in the

CNN

structure to enhance the feature

learning. Also, the focal loss function is used to substitute for

conventional cross entropy loss function in both of the region

proposal network (

RPN

) and the final classifier. We further

introduce the first large-scale vehicle detection dataset

ITCVD

with ground truth annotations for all the vehicles in the scene.

We demonstrate the performance of our model on the exist-

ing benchmark German Aerospace Center (

DLR

) 3K dataset as

well as the

ITCVD

dataset. The experimental results show that

our

DFL-CNN

outperforms the baselines on vehicle detection.

Introduction

The detection of vehicles in aerial images is widely applied in

many applications,

e.g.,

traffic monitoring, vehicle tracking for

security purpose, parking lot analysis and planning,

etc

. There-

fore, this topic has caught increasing attention in both academ-

ic and industrial fields (Gleason

et al.

2011; Liu and Mattyus

2015; Chen

et al.

2016). However, compared with object detec-

tion in ground view images, vehicle detection in aerial images

has many different challenges, such as small vehicle size and

complex background. See Figure 1 for an illustration.

Figure 1. Vehicles detection results on the proposed dataset.

Before the emergence of deep learning, hand-crafted

features combined with a classifier are the mostly adopted

ideas to detect vehicles in aerial images (Zhao and Nevatia

2003; Liu and Mattyus 2015; Gleason

et al.

2011). However,

the hand-crafted features lack generalization ability, and the

adopted classifiers need to be modified to adapt the of the

features. Some previous works also attempted to use shallow

neural network (LeCun

et al.

1990) to learn the features specif-

ically for vehicle detection in aerial images (Cheng

et al.

2012;

Chen

et al.

2014). However, the representational power of the

extracted features is insufficient and the per

formance meets

the bottleneck. Furthermore, all of these me

thods localize

vehicle candidates by sliding window searc

h. These sliding

window-based approaches lead to high com

putational cost.

The window sizes and sliding steps must be

carefully chosen

to adapt the different sizes of objects of interest in the dataset.

In recent years, deep convolutional neural network (

DCNN

)

has achieved great successes in different tasks, especially for

object detection and classification (Krizhevsky

et al.

2012;

LeCun

et al.

2015). In particular, the series of methods based

on region convolutional neural network (

R-CNN

) (Girshick

et al.

2014; Girshick 2015; Ren

et al.

2015) push forward the prog-

ress of object detection significantly. Especially, Faster-

R-CNN

(Ren

et al.

2015) proposes the region proposal network (

RPN

) to

localize possible object instead of traditional sliding window

search methods and achieves the state-of-the-art performance

in different datasets in terms of accuracy. However, these

existing state-of-the-art detectors cannot be directly applied to

detect vehicles in aerial images, due to the different character-

istics of ground view images and aerial view images (Xia

et al.

2017). The appearance of the vehicles is monotone, as shown

in Figure 1. It’s difficult to learn and extract representative

features to distinguish them from other objects. Particularly, in

the dense park lot, it is hard to separate individual vehicles.

Moreover, the background in the aerial images is much more

complex than the nature scene images. For example, the win-

dows on the facades or the special structures on the roof, these

background objects confuse the detectors and classifiers. Fur-

thermore, compared to the vehicle sizes in ground view imag-

es, the vehicles in the aerial images are much smaller (ca. 50

×

50 pixels) while the images have very high resolution (normal-

ly larger than 5000

×

2000 pixels). Lastly, large-scale and well

annotated dataset is required to train a well performed

DCNN

methods. However, there is no public large-scale dataset such

as ImageNet (Deng

et al.

2009) or ActivityNet (Caba Heilbron

et al.

2015), for vehicle detection in aerial images.

To address these problems, we propose a specific framework

for vehicle detection in aerial images, as shown in Figure 2.

The novel framework is called double focal loss convolutional

neural network (

DFL

-

CNN

), which consists of three main parts:

1) A skip-connection from the shallow layer to the deep layer

is added to learn features which contains rich detail informa-

tion. 2) Focal loss function (Lin

et al.

2017) is adopted in the

RPN

instead of traditional cross entropy. This modification aims

at the class imbalance problem when

RPN

decides whether a

proposal is likely to be an object of interest or not. 3) Focal loss

function replaces the cross entropy in the classifier. It’s used to

handle the problem of easy positive examples and hard negative

examples during training. Furthermore, we introduce a novel

large-scale and well annotated dataset for quantitative vehicle

detection evaluation—

ITCVD

. Towards this goal, we collected

Michael Ying Yang is with the Scene Understanding Group,

ITC Faculty, University of Twente., Wentong Liao, Xinbo Li,

and Bodo Rosenhahn are with the Institute for Information

Processing, Leibniz University Hannover.

Yanpeng Cao is with the School of Mechanical Engineering,

Zhejiang University, (Corresponding author:

caoyp@zju.edu.cn

)

Photogrammetric Engineering & Remote Sensing

Vol. 85, No. 4, April 2019, pp. 297–304.

0099-1112/18/297–304

and Remote Sensing

doi: 10.14358/PERS.85.4.297

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

April 2019

297

PE&RS April 2019 Public - page 297

Warning.