A CNN-Based Method of Vehicle Detection from Aerial Images Using Hard Example Mining

Recently, deep learning techniques have had a practical role in vehicle detection. While much effort has been spent on applying deep learning to vehicle detection, the effective use of training data has not been thoroughly studied, although it has great potential for improving training results, especially in cases where the training data are sparse. In this paper, we proposed using hard example mining (HEM) in the training process of a convolutional neural network (CNN) for vehicle detection in aerial images. We applied HEM to stochastic gradient descent (SGD) to choose the most informative training data by calculating the loss values in each batch and employing the examples with the largest losses. We picked 100 out of both 500 and 1000 examples for training in one iteration, and we tested different ratios of positive to negative examples in the training data to evaluate how the balance of positive and negative examples would affect the performance. In any case, our method always outperformed the plain SGD. The experimental results for images from New York showed improved performance over a CNN trained in plain SGD where the F1 score of our method was 0.02 higher.


Introduction
Recently, vehicle detection methods have achieved very high performance owing to deep learning techniques; moreover, many more sources of high-resolution aerial and satellite images have become available and affordable. Worldview3 by Digital Globe [1] provides images with a resolution of 0.3 m per pixel, and now many startup companies such as Planet Labs [2] and Black Sky [3] plan to launch small satellites and provide images with a resolution typically around one meter per pixel. For aerial images, in Japan, NTT Geospace [4] provides aerial images that cover 83% of Japan and updates them frequently. In this context, vehicle detection is now being applied to practical issues such as traffic volume surveys and the estimation of economic activity on the ground.
Research and development of object detection techniques have significantly progressed in recent years by the advancement of deep learning techniques, in particular, the convolutional neural network (CNN). Region-based CNN (R-CNN) [5] was one of the earliest algorithms to employ CNN for object detection and to demonstrate its great capability. In R-CNN, image regions that possibly contain target objects (called "region proposals") are chosen by a selective search algorithm [6], and then a CNN algorithm is applied to map target objects in the region proposals. Following R-CNN, many descendants have been proposed. Fast R-CNN [7] and the Spatial Pyramid Pooling network (SPP-net) [8] have improved accuracy and runtime over R-CNN by utilizing an RoI pooling layer-a special case of the spatial pyramid pooling (SPP) layer-and a SPP layer, respectively. They compute a feature map from an entire image only once, and by utilizing the RoI pooling layer or SPP layer, and the candidate weak classifiers in the next stage were imposed on well classifying those weighted hard examples. While their attempt succeeded, they did not focus on improving the feature learning part in terms of the effective use of training data, which is more straightforward.
In this paper, we proposed the application of HEM to the feature learning process of a CNN model for vehicle detection from high-resolution aerial images.

Methodology
We applied HEM to the stochastic gradient descent (SGD), a commonly used algorithm in deep learning training. Specifically, we used a large batch size, and in each batch, calculated the loss values and employed only examples with the largest loss values for training. In this way, we could always use the most informative examples for training and to improve accuracy.
The details are as follows. In Section 2.1, we introduce our basic methodology and its drawbacks. We first introduce our vehicle detection steps and then explain the characteristics of SGD where there is room for improvement. In Section 2.2, we briefly introduce the related studies of HEM and explain the details of our method. In Section 2.3, we explain our method of accuracy assessment for the experiments in this paper.

Basic Methodology
In this paper, we used a simple sliding window method for vehicle detection. Candidate bounding boxes were scattered densely over an entire image and then those with no existence of vehicles were screened out. HEM was applied to the training of CNN used for the screening. Employed CNN architecture was also simple. We employed the simple sliding window method and CNN architecture, because we mainly focused on the effectiveness of our HEM method. Our HEM method is easily scaled, for instance, by replacing the CNN architecture with a richer one, such as the model used in [18].
Our HEM method was actually a variant of Online Hard Example Mining (OHEM) [22], which was originally designed for Fast R-CNN, but required modifications to suit our method as our training process and the Fast R-CNN training process were different. (The details are described in Section 2.2.)

Vehicle Detection Methodology
We structured the algorithm based on the method of [12]. Figure 1 shows our CNN architecture.
Remote Sens. 2018, 10, 124 3 of 20 examples. While their attempt succeeded, they did not focus on improving the feature learning part in terms of the effective use of training data, which is more straightforward.
In this paper, we proposed the application of HEM to the feature learning process of a CNN model for vehicle detection from high-resolution aerial images.

Methodology
We applied HEM to the stochastic gradient descent (SGD), a commonly used algorithm in deep learning training. Specifically, we used a large batch size, and in each batch, calculated the loss values and employed only examples with the largest loss values for training. In this way, we could always use the most informative examples for training and to improve accuracy.
The details are as follows. In Section 2.1, we introduce our basic methodology and its drawbacks. We first introduce our vehicle detection steps and then explain the characteristics of SGD where there is room for improvement. In Section 2.2, we briefly introduce the related studies of HEM and explain the details of our method. In Section 2.3, we explain our method of accuracy assessment for the experiments in this paper.

Basic Methodology
In this paper, we used a simple sliding window method for vehicle detection. Candidate bounding boxes were scattered densely over an entire image and then those with no existence of vehicles were screened out. HEM was applied to the training of CNN used for the screening. Employed CNN architecture was also simple. We employed the simple sliding window method and CNN architecture, because we mainly focused on the effectiveness of our HEM method. Our HEM method is easily scaled, for instance, by replacing the CNN architecture with a richer one, such as the model used in [18].
Our HEM method was actually a variant of Online Hard Example Mining (OHEM) [22], which was originally designed for Fast R-CNN, but required modifications to suit our method as our training process and the Fast R-CNN training process were different. (The details are described in Section 2.2.)

Vehicle Detection Methodology
We structured the algorithm based on the method of [12]. Figure 1 shows our CNN architecture. CONV, BN, ReLU, POOL, FC represent the convolutional layer, batch normalization layer, Rectified Linear Unit layer, and fully connected layer, respectively. While we simplified the CNN architecture of [12], we added batch normalization [23] layers to accelerate the learning process.
The window size was set so that it finally became 50 pixels before classification, which was large enough to cover a typical vehicle size (see details of our data in Section 3.1). The detailed vehicle detection steps are as follows: • Threshold a test image by pixel intensity greater than 60 or less than 100 and calculate gradient images, yielding three gradient images ( Figure 2).

•
Generate sliding windows that overlap each other on half of width and height ( Figure 3).

•
Move centers of the windows to geometric centers, which represent possible positions of objects in windows. Geometric centers are calculated as Equation (1): The CNN architecture employed in this study. We introduced batch normalization layers to accelerate learning.
CONV, BN, ReLU, POOL, FC represent the convolutional layer, batch normalization layer, Rectified Linear Unit layer, and fully connected layer, respectively. While we simplified the CNN architecture of [12], we added batch normalization [23] layers to accelerate the learning process.
The window size was set so that it finally became 50 pixels before classification, which was large enough to cover a typical vehicle size (see details of our data in Section 3.1). The detailed vehicle detection steps are as follows: • Threshold a test image by pixel intensity greater than 60 or less than 100 and calculate gradient images, yielding three gradient images ( Figure 2). • Generate sliding windows that overlap each other on half of width and height ( Figure 3). • Move centers of the windows to geometric centers, which represent possible positions of objects in windows. Geometric centers are calculated as Equation (1): where g center is a vector which express a pixel position of a geometric center; W and H are the width and height of a window patch, respectively (both are equal to the window size); vector p i,j is a pixel position (i, j) (1 ≤ i ≤ W, 1 ≤ j ≤ H); I i,j is a gradient intensity value of a pixel (i, j); and S is the sum of the gradient intensity values at all pixels in a window patch (Figure 4a,b). • Enlarge them by a factor of √ 2 , and move them to the new geometric centers (Figure 4c,d).
• Discard unnecessary windows that were close to the others. We regarded the windows whose centers were within a distance of 0.15 of window size as unnecessary ( Figure 5).

•
Apply a CNN to RGB pixels in the windows remaining after the above steps. • Examine if the windows had overlapping windows with more than 0.5 of IoU from the highest probability of vehicle existence to the lowest. If a window had overlapping windows, the overlapping windows were discarded (this is called non-maximum-suppression).
where is a vector which express a pixel position of a geometric center; and are the width and height of a window patch, respectively (both are equal to the window size); vector , is a pixel position , 1 , 1 ; , is a gradient intensity value of a pixel , ; and is the sum of the gradient intensity values at all pixels in a window patch ( Figure  4a,b).

•
Enlarge them by a factor of √2, and move them to the new geometric centers (Figure 4c,d).

•
Discard unnecessary windows that were close to the others. We regarded the windows whose centers were within a distance of 0.15 of window size as unnecessary ( Figure 5).

•
Apply a CNN to RGB pixels in the windows remaining after the above steps.

•
Examine if the windows had overlapping windows with more than 0.5 of IoU from the highest probability of vehicle existence to the lowest. If a window had overlapping windows, the overlapping windows were discarded (this is called non-maximum-suppression).
where is a vector which express a pixel position of a geometric center; and are the width and height of a window patch, respectively (both are equal to the window size); vector , is a pixel position , 1 , 1 ; , is a gradient intensity value of a pixel , ; and is the sum of the gradient intensity values at all pixels in a window patch ( Figure  4a,b).

•
Enlarge them by a factor of √2, and move them to the new geometric centers (Figure 4c,d).

•
Discard unnecessary windows that were close to the others. We regarded the windows whose centers were within a distance of 0.15 of window size as unnecessary ( Figure 5).

•
Apply a CNN to RGB pixels in the windows remaining after the above steps.

•
Examine if the windows had overlapping windows with more than 0.5 of IoU from the highest probability of vehicle existence to the lowest. If a window had overlapping windows, the overlapping windows were discarded (this is called non-maximum-suppression).     We did not use any meta information such as shadow directions. In terms of sliding window accuracy, we evaluated the negative impact of clutter and shadows in the Appendix A.

Stochastic Gradient Descent (SGD) and Room for Improvement
SGD is an algorithm for optimizing parameters in machine learning that is commonly used in deep learning. First, we explain gradient descent (also called batch gradient descent) on which SGD is based. In machine learning, parameters are optimized by minimizing the objective function (also often called the loss function). In gradient descent, a parameter is updated by where θ is the parameter, α is the learning rate, J is the objective function, and its derivative θ J θ is called the gradient. In gradient descent, gradients are calculated over all examples in the training data and used to update θ [24,25]. This is repeated until there is convergence. However, this becomes inefficient or infeasible when the number of training data is huge [24,25]. Hence in SGD, a small number of examples-called a minibatch-are sampled from the entire training dataset and used for training. Sampling a minibatch is random as giving training data in some meaningful order can bias gradients and lead to poor convergence [25]. Specifically, all of the training data are first shuffled [25] and partitioned (usually equally) into minibatches, then each minibatch is processed for optimization in order. Strictly speaking, this should be called minibatch gradient descent, and SGD originally meant using only a single training example [24]; however, we use this term as it is commonly used in a deep learning context. This is based on the assumption that each minibatch approximates the entire training dataset well [24]. One minibatch process is called an iteration, and processing the entire dataset is called an epoch. Training continues over epochs until convergence. Now let the weight variables of our model be W, minibatch input data be X, labels (the numbers which express the classes) of X be T, and loss function be L(W, X, T). Note that if x and t are single examples of X and T, respectively, L(W, X, T) must be the summation of all L(W, x, t) [26]. Given concrete input X and labels T, we can regard the X and T as the coefficients of L. Therefore, L is regarded as a function of W. We can interpret Equation (2) as follows: This equation updates W so that the loss function becomes smaller. As a consequence, the model becomes able to classify input well. The gradient of each weight variable is calculated by propagating derivatives from the tail to the head of the model based on the chain rule, which is called back propagation [26]. Conversely, calculating output or loss function of a model when given an input is called forward propagation. We did not use any meta information such as shadow directions. In terms of sliding window accuracy, we evaluated the negative impact of clutter and shadows in the Appendix A.

Stochastic Gradient Descent (SGD) and Room for Improvement
SGD is an algorithm for optimizing parameters in machine learning that is commonly used in deep learning. First, we explain gradient descent (also called batch gradient descent) on which SGD is based. In machine learning, parameters are optimized by minimizing the objective function (also often called the loss function). In gradient descent, a parameter is updated by where θ is the parameter, α is the learning rate, J is the objective function, and its derivative ∇ θ J(θ) is called the gradient. In gradient descent, gradients are calculated over all examples in the training data and used to update θ [24,25]. This is repeated until there is convergence. However, this becomes inefficient or infeasible when the number of training data is huge [24,25]. Hence in SGD, a small number of examples-called a minibatch-are sampled from the entire training dataset and used for training. Sampling a minibatch is random as giving training data in some meaningful order can bias gradients and lead to poor convergence [25]. Specifically, all of the training data are first shuffled [25] and partitioned (usually equally) into minibatches, then each minibatch is processed for optimization in order. Strictly speaking, this should be called minibatch gradient descent, and SGD originally meant using only a single training example [24]; however, we use this term as it is commonly used in a deep learning context. This is based on the assumption that each minibatch approximates the entire training dataset well [24]. One minibatch process is called an iteration, and processing the entire dataset is called an epoch. Training continues over epochs until convergence. Now let the weight variables of our model be W, minibatch input data be X, labels (the numbers which express the classes) of X be T, and loss function be L(W, X, T). Note that if x and t are single examples of X and T, respectively, L(W, X, T) must be the summation of all L(W, x, t) [26]. Given concrete input X and labels T, we can regard the X and T as the coefficients of L. Therefore, L is regarded as a function of W. We can interpret Equation (2) as follows: This equation updates W so that the loss function becomes smaller. As a consequence, the model becomes able to classify input well. The gradient of each weight variable is calculated by propagating derivatives from the tail to the head of the model based on the chain rule, which is called back propagation [26]. Conversely, calculating output or loss function of a model when given an input is called forward propagation.
As training progresses, most of the loss values in a minibatch become very small. However, there are still some examples where the loss values are relatively large. We can find analogs of these in the test results. When we conduct vehicle detection with a trained classifier, many of the bounding boxes are classified correctly, but still there can be some that are misclassified. These are sometimes called hard examples. For instance, they may have vehicle-like features that are difficult to discriminate (see Figure 6 for examples). Such examples are likely to give clues to discriminating confusing features; therefore, utilizing them in training processes seems to yield better accuracy. However, they would not sufficiently contribute to learning in an ordinary SGD. As described above, most loss values in a minibatch become very small as training progresses, and gradients calculated over a minibatch are aggregated and averaged. This means the few informative examples are diluted by another large part of the minibatch that do not contribute to improving accuracy. In this way, hard examples contribute little to learning. To address this, we needed to choose the informative examples and preferentially use them for training, which is called hard example mining.

Hard Example Mining (HEM) in SGD Training
In HEM, hard examples, which are difficult to classify correctly, are weighted more than other examples for training. Typically, hard examples are selected if they are difficult to correctly classify for a current classifier. HEM has been conventionally used in machine learning, e.g., for SVM training. For pedestrian detection, Dalal and Triggs [13] searched hard examples with a preliminarily trained detector and additionally used them for training a final detector. Felzenszwalb et al. [27] iteratively updated the training data subset by discarding easy examples that were correctly classified beyond the current classifier's margin and adding hard examples that violated the current classifier's margin. Using a non-SVM method, Tang et al. [16] adopted a cascade of boosted classifiers of shallow decision trees as the final classification part of their vehicle detection method. In each stage of their Real AdaBoost [21] training, a weak classifier that best classified the training data was selected as part of the final classifier. The misclassified examples were weighted, and the candidate weak classifiers in the next stage were imposed on well classifying those weighted hard examples.
In object detection by deep learning, a heuristic method has been previously used. In Fast R-CNN [7] and SPP-net [8], when sampling reference background patches for training data, if the IoU between a background patch and a foreground patch is lower than 0.1, the sampled background patch is excluded from training data, because the patch is not a hard example given that the patch is easily classified to the background patch. If a background patch overlaps a foreground patch in a much portion, such as cases where the IoU is much higher than 0.1, the patch is chosen as training data because the patch is useful as a hard example as the background patch is likely to be confused Such examples are likely to give clues to discriminating confusing features; therefore, utilizing them in training processes seems to yield better accuracy. However, they would not sufficiently contribute to learning in an ordinary SGD. As described above, most loss values in a minibatch become very small as training progresses, and gradients calculated over a minibatch are aggregated and averaged. This means the few informative examples are diluted by another large part of the minibatch that do not contribute to improving accuracy. In this way, hard examples contribute little to learning. To address this, we needed to choose the informative examples and preferentially use them for training, which is called hard example mining.

Hard Example Mining (HEM) in SGD Training
In HEM, hard examples, which are difficult to classify correctly, are weighted more than other examples for training. Typically, hard examples are selected if they are difficult to correctly classify for a current classifier. HEM has been conventionally used in machine learning, e.g., for SVM training. For pedestrian detection, Dalal and Triggs [13] searched hard examples with a preliminarily trained detector and additionally used them for training a final detector. Felzenszwalb et al. [27] iteratively updated the training data subset by discarding easy examples that were correctly classified beyond the current classifier's margin and adding hard examples that violated the current classifier's margin. Using a non-SVM method, Tang et al. [16] adopted a cascade of boosted classifiers of shallow decision trees as the final classification part of their vehicle detection method. In each stage of their Real AdaBoost [21] training, a weak classifier that best classified the training data was selected as part of the final classifier. The misclassified examples were weighted, and the candidate weak classifiers in the next stage were imposed on well classifying those weighted hard examples.
In object detection by deep learning, a heuristic method has been previously used. In Fast R-CNN [7] and SPP-net [8], when sampling reference background patches for training data, if the IoU between a background patch and a foreground patch is lower than 0.1, the sampled background patch is excluded from training data, because the patch is not a hard example given that the patch is easily classified to the background patch. If a background patch overlaps a foreground patch in a much portion, such as cases where the IoU is much higher than 0.1, the patch is chosen as training data because the patch is useful as a hard example as the background patch is likely to be confused with the foreground. This improves accuracy to some extent but is suboptimal, as there could be some hard examples in the excluded patches.
To address this, Shrivastava et al. [22] proposed Online Hard Example Mining (OHEM). In OHEM, the loss values of all region proposals in an image are calculated by the current classifier and only examples with the largest losses are picked for a minibatch. OHEM further improved accuracy over the heuristic method.
However, we could not directly apply OHEM to our method because OHEM is designed for Fast R-CNN, a training process that is different from ours. In Fast R-CNN training, an image is randomly selected from all training images, region proposals are calculated in the image, and 64 of them are selected for a minibatch (in practice, their minibatch consists of 128 examples from two images). As Fast R-CNN employs RoI pooling-in which the feature map is calculated from an entire image only once and region proposals are classified by projecting each of them onto the feature map-this image-wise training is effective. OHEM replaces the selection of region proposals for a minibatch and also benefits from RoI pooling in terms of effective computation. Meanwhile, in our algorithm proposed in Section 2.1.1., we preliminarily extracted patches from all training images, and sampled minibatches randomly from them. We needed to modify OHEM to our training procedure.
Here we explain our method in detail. Figure 7 shows an overview of the algorithm.
Remote Sens. 2018, 10, 124 7 of 20 minibatch and also benefits from RoI pooling in terms of effective computation. Meanwhile, in our algorithm proposed in Section 2.1.1., we preliminarily extracted patches from all training images, and sampled minibatches randomly from them. We needed to modify OHEM to our training procedure.
Here we explain our method in detail. Figure 7 shows an overview of the algorithm.  In this way, for training, only the examples with the largest losses are always used, which are the most informative ones. We expect our method to promote the learning of finer features, and it should also find the optimal balance of positive and negative examples in the training examples, the same as in [22]. Recall that SGD is based on the assumption that each minibatch approximates the entire training dataset well, as described in Section 2.1.2. From this viewpoint, we can say that our proposed method approximates checking the loss values of all the training data and only selected the most informative examples for training in one iteration.
In plain SGD, the entire training dataset is split into minibatches and each minibatch is used for training, which means all the examples in the training data are used for training. However, in the proposed method, we only used a part of the examples for training in one epoch because we selected only the examples with the largest loss values. Therefore, we compared plain SGD and the proposed method in the same iteration, not in the same literal epoch.
Here, we also explain the implementation details. As is common practice, we adopted softmax cross entropy as the loss function as defined as follows: where y is the softmax output; N is minibatch size; C is the number of classes; and t is a label. Only one true label among C is 1 and the others are 0. According to this equation, when we want to sort examples of a minibatch by loss values, we only need to check the prediction results, i.e., the last activation of the model corresponding to the true class. There are two ways to implement the proposed method. The first is to calculate all of the loss values in a checkbatch, set the loss values to zeros (except the largest ones), and train. The second is to preliminarily select examples with the worst prediction results in a checkbatch, calculate their loss values, and train. We adopted the second procedure in this paper, mainly because we introduced batch normalization layers [23] into our CNN, as mentioned in Section 2.1.1. Batch normalization normalizes a minibatch so that the mean and the variance of the minibatch become 0 and 1, respectively. If the first implementation is adopted, after examples with the largest losses in a checkbatch are selected, the batch normalization needs to be re-calculated, because the minibatch statistics are generally changed. However, we cannot recalculate the batch normalization by linear transformation of the previous forward propagation result because our CNN also has non-linear ReLU layers as mentioned in Section 2.1.1. Therefore, there is a need to recalculate the loss values of the selected examples after all. For this reason, we adopted the second implementation.
Batch normalization has a training mode and a testing mode, and different statistics are used to normalize a minibatch in each mode. In the training mode, it uses the statistics of the current minibatch, and in testing mode it uses the statistics of all the data that have been used for training. We used the testing mode in the loss checking process, because our idea was to use examples that were difficult to classify by the current classifier.

Vehicle Detection Criteria
We adopted the same criteria as [12]. When a window was detected as containing vehicles, if the distance of centers between it and any groundtruth was smaller than 0.45 of the window size, it was judged as true positive (TP), otherwise it was a false positive (FP). A groundtruth was judged to be detected if it had at least one corresponding TP. In these criteria, it is possible to have multiple TPs for one vehicle, which are redundant except for one TP. One TP is allowed to detect only one vehicle.

Quantitative Measure
We calculated the recall rate (RR), precision rate (PR), and false alarm rate (FAR) as per [12,14]. We also calculated the F1 scores by using the obtained RR and PR scores.

Experiment and Results
We evaluated the performance of our method by comparing it with plain SGD training (hereinafter called the normal method). First, we trained CNNs by the normal and proposed methods, and then conducted vehicle detection with those classifiers. An overview of the experiment is shown in Figure 8.  We used sparse training data as explained in Sections 3.1 and 3.2. We conducted preliminary experiments using the normal method and found that it still had room for improvement, because the FAR in the result was high. We aimed to reduce false positives and improve the accuracy by using the proposed method, which was the first motivation of this paper.

Training and Test Images
We downloaded aerial ortho images of New York from the U.S. Geological Survey (USGS), cut out areas of harbors and malls, and used them for training and testing. The pixel size of all images was 0.15 m. Table 1 shows the images and their attributes. The train_2 image was taken in the spring of 2013, and the rest was taken in April-May of 2014. These were the only images used in this paper.

Training Images
Test Images We used sparse training data as explained in Sections 3.1 and 3.2. We conducted preliminary experiments using the normal method and found that it still had room for improvement, because the FAR in the result was high. We aimed to reduce false positives and improve the accuracy by using the proposed method, which was the first motivation of this paper.

Training and Test Images
We downloaded aerial ortho images of New York from the U.S. Geological Survey (USGS), cut out areas of harbors and malls, and used them for training and testing. The pixel size of all images was 0.15 m. Table 1 shows the images and their attributes. The train_2 image was taken in the spring of 2013, and the rest was taken in April-May of 2014. These were the only images used in this paper.

Training Images
Test Images

Training and Test Images
We downloaded aerial ortho images of New York from the U.S. Geological Survey (USGS), cut out areas of harbors and malls, and used them for training and testing. The pixel size of all images was 0.15 m. Table 1 shows the images and their attributes. The train_2 image was taken in the spring of 2013, and the rest was taken in April-May of 2014. These were the only images used in this paper.

Data Preparation
We prepared groundtruth maps of all the images described in Section 3.1 by choosing a pixel in the center of each vehicle by hand. Then we generated the training dataset by extracting patches from the training images. For positive examples, we first extracted bounding boxes around the dots in the groundtruth maps as groundtruth patches. The window size was 50 pixels, which was designed to well cover the typical size of vehicles. To increase the variance of the positive examples, we generated 10 rotated duplications of each groundtruth patch at rotation angles from 9° to 90° in increments of 9°. This is called data augmentation. Then, for negative examples, we extracted background patches randomly where the IoU between a candidate patch and any groundtruth was lower than 0.4. These types of methods are commonly used. The authors in [14] used similar methods for data preparation. Finally, all patches were resized to 48 by 48 pixels, which was the input size of our CNN.

Training and Test Images
We downloaded aerial ortho images of New York from the U.S. Geological Survey (USGS), cut out areas of harbors and malls, and used them for training and testing. The pixel size of all images was 0.15 m. Table 1 shows the images and their attributes. The train_2 image was taken in the spring of 2013, and the rest was taken in April-May of 2014. These were the only images used in this paper.

Data Preparation
We prepared groundtruth maps of all the images described in Section 3.1 by choosing a pixel in the center of each vehicle by hand. Then we generated the training dataset by extracting patches from the training images. For positive examples, we first extracted bounding boxes around the dots in the groundtruth maps as groundtruth patches. The window size was 50 pixels, which was designed to well cover the typical size of vehicles. To increase the variance of the positive examples, we generated 10 rotated duplications of each groundtruth patch at rotation angles from 9° to 90° in increments of 9°. This is called data augmentation. Then, for negative examples, we extracted background patches randomly where the IoU between a candidate patch and any groundtruth was lower than 0.4. These types of methods are commonly used. The authors in [14] used similar methods for data preparation. Finally, all patches were resized to 48 by 48 pixels, which was the input size of our CNN.

Training and Test Images
We downloaded aerial ortho images of New York from the U.S. Geological Survey (USGS), cut out areas of harbors and malls, and used them for training and testing. The pixel size of all images was 0.15 m. Table 1 shows the images and their attributes. The train_2 image was taken in the spring of 2013, and the rest was taken in April-May of 2014. These were the only images used in this paper.

Data Preparation
We prepared groundtruth maps of all the images described in Section 3.1 by choosing a pixel in the center of each vehicle by hand. Then we generated the training dataset by extracting patches from the training images. For positive examples, we first extracted bounding boxes around the dots in the groundtruth maps as groundtruth patches. The window size was 50 pixels, which was designed to well cover the typical size of vehicles. To increase the variance of the positive examples, we generated 10 rotated duplications of each groundtruth patch at rotation angles from 9° to 90° in increments of 9°. This is called data augmentation. Then, for negative examples, we extracted background patches randomly where the IoU between a candidate patch and any groundtruth was lower than 0.4. These types of methods are commonly used. The authors in [14] used similar methods for data preparation. Finally, all patches were resized to 48 by 48 pixels, which was the input size of our CNN.

Training and Test Images
We downloaded aerial ortho images of New York from the U.S. Geological Survey (USGS), cut out areas of harbors and malls, and used them for training and testing. The pixel size of all images was 0.15 m. Table 1 shows the images and their attributes. The train_2 image was taken in the spring of 2013, and the rest was taken in April-May of 2014. These were the only images used in this paper.

Data Preparation
We prepared groundtruth maps of all the images described in Section 3.1 by choosing a pixel in the center of each vehicle by hand. Then we generated the training dataset by extracting patches from the training images. For positive examples, we first extracted bounding boxes around the dots in the groundtruth maps as groundtruth patches. The window size was 50 pixels, which was designed to well cover the typical size of vehicles. To increase the variance of the positive examples, we generated 10 rotated duplications of each groundtruth patch at rotation angles from 9° to 90° in increments of 9°. This is called data augmentation. Then, for negative examples, we extracted background patches randomly where the IoU between a candidate patch and any groundtruth was lower than 0.4. These types of methods are commonly used. The authors in [14] used similar methods for data preparation. Finally, all patches were resized to 48 by 48 pixels, which was the input size of our CNN.

Data Preparation
We prepared groundtruth maps of all the images described in Section 3.1 by choosing a pixel in the center of each vehicle by hand. Then we generated the training dataset by extracting patches from the training images. For positive examples, we first extracted bounding boxes around the dots in the groundtruth maps as groundtruth patches. The window size was 50 pixels, which was designed to well cover the typical size of vehicles. To increase the variance of the positive examples, we generated 10 rotated duplications of each groundtruth patch at rotation angles from 9 • to 90 • in increments of 9 • . This is called data augmentation. Then, for negative examples, we extracted background patches randomly where the IoU between a candidate patch and any groundtruth was lower than 0.4. These types of methods are commonly used. The authors in [14] used similar methods for data preparation. Finally, all patches were resized to 48 by 48 pixels, which was the input size of our CNN.
We generated five different training datasets. The groundtruth patches were always the same, whereas the amounts of sampled background patches were different. The ratios of background patches to groundtruth patches (without the augmented ones) were 100:1, 200:1, 300:1, 400:1, and 500:1 (hereinafter called ×100, ×200, ×300, ×400, and ×500), respectively. For instance, in the case of ×100, the ratio of positive examples (including augmented ones) to negative examples was 11:100. We used them, because the balance of positive and negative examples in the training data generally affects the result, and we aimed to evaluate this effect. As the background area is generally larger than the foreground area in an image, it is common to use more negative examples than positive examples in the training data for better accuracy [1, 7,8,11,14]. This is more conspicuous in the case of vehicle detection, because vehicles are small objects. Taking these into account, we began the ratio of positive to negative examples from 11:100.
In each training dataset, we randomly selected one-tenth of the dataset and used it as a fixed hold-out validation dataset. During training, we calculated the loss and accuracy on this validation dataset in every epoch, which was its only use.

Training Results
We initialized the CNN weight variables at random. We used the Adam solver [28], and the training iterations were equivalent to 100 epochs in the normal method. (e.g., 500 epochs when ncheck was five times larger than nlearn). The batch size for learning (i.e., nlearn) was constantly 100. These were decided empirically based on preliminary experiments. In every epoch during training, the mean loss and mean accuracy were calculated for the training and validation datasets.
As the shapes of all of the graphs were similar for the different conditions, we present only one example. Figure 9 shows the training curves where the background patch amount was ×200 (one of the results of repeated experiments). In Figure 9, all methods seem to have sufficiently converged. For training loss and accuracy, the fluctuations of the proposed method were larger than the normal method. This seems natural because in every iteration, the CNN is updated and used to calculate loss values and select training examples, which means that the criteria of selecting training examples changed in every iteration. Figure 10 shows the moving average of Figure 9d, which aimed to show the convergence trend more clearly. In Figure 10, we averaged about 6350 iterations, which corresponded to two epochs of the normal method.
As Figure 10 shows, while the accuracy in the normal method still improved slightly toward the last epochs, it improved much faster in the proposed method. This means that the proposed method markedly accelerated convergence. In addition, the curve of HEM1000 seems to have converged slightly earlier than that of HEM500. This indicates that the larger ncheck accelerated convergence more.
To evaluate the training results, we compared the final values of validation loss and validation accuracy under different conditions. To mitigate fluctuations, we calculated the moving average of iterations corresponding to 10 epochs of the normal method and then averaged the repeated experiments. Table 3 shows the statistics, which include the standard deviations and standard errors. While validation losses were not necessarily smaller in the proposed method than in the normal method, validation accuracies were always higher in the proposed method than in the normal method, which can be said explicitly according to the standard errors. Although the main purpose of this validation was to check the overfitting occurrence, this result is evidence that the proposed method yielded better generalization. When we compared the validation accuracies of HEM500 and HEM1000, we could not see a significant difference. Note that this accuracy calculation included classifying the background patches. The performance of our system in terms of vehicle detection is evaluated in Section 3.5.
Remote Sens. 2018, 10, 124 11 of 20 the convergence trend more clearly. In Figure 10, we averaged about 6350 iterations, which corresponded to two epochs of the normal method.  As Figure 10 shows, while the accuracy in the normal method still improved slightly toward the last epochs, it improved much faster in the proposed method. This means that the proposed method markedly accelerated convergence. In addition, the curve of HEM1000 seems to have converged slightly earlier than that of HEM500. This indicates that the larger ncheck accelerated convergence more.
To evaluate the training results, we compared the final values of validation loss and validation accuracy under different conditions. To mitigate fluctuations, we calculated the moving average of iterations corresponding to 10 epochs of the normal method and then averaged the repeated the convergence trend more clearly. In Figure 10, we averaged about 6350 iterations, which corresponded to two epochs of the normal method.  As Figure 10 shows, while the accuracy in the normal method still improved slightly toward the last epochs, it improved much faster in the proposed method. This means that the proposed method markedly accelerated convergence. In addition, the curve of HEM1000 seems to have converged slightly earlier than that of HEM500. This indicates that the larger ncheck accelerated convergence more.
To evaluate the training results, we compared the final values of validation loss and validation accuracy under different conditions. To mitigate fluctuations, we calculated the moving average of iterations corresponding to 10 epochs of the normal method and then averaged the repeated

Vehicle Detection Results
We conducted vehicle detection using the method described in Section 2.1.1 and the test images described in Section 3.1. Results of repeated experiments were averaged. Figure 11 shows the F1-measure results and Table 4 shows the statistics of all quantitative measures, which include the standard deviations and standard errors. method, validation accuracies were always higher in the proposed method than in the normal method, which can be said explicitly according to the standard errors. Although the main purpose of this validation was to check the overfitting occurrence, this result is evidence that the proposed method yielded better generalization. When we compared the validation accuracies of HEM500 and HEM1000, we could not see a significant difference. Note that this accuracy calculation included classifying the background patches. The performance of our system in terms of vehicle detection is evaluated in Section 3.5.

Vehicle Detection Results
We conducted vehicle detection using the method described in Section 2.1.1 and the test images described in Section 3.1. Results of repeated experiments were averaged. Figure 11 shows the F1-measure results and Table 4 shows the statistics of all quantitative measures, which include the standard deviations and standard errors. Figure 11. F1 scores in each condition. Our proposed method improved the scores in all cases over the normal method. Figure 11. F1 scores in each condition. Our proposed method improved the scores in all cases over the normal method.
In terms of F1 scores, while almost all of the standard errors were smaller than 0.01, the proposed method improved the scores by over 0.02 when compared to the normal method in most cases, which proved the effectiveness of our proposed method. Moreover, because the standard deviations of all methods were not very different, we can say our proposed method worked stably. When we compared the results in terms of the background patch amount, the F1 scores in the normal method tended to be higher when the background patch amount was larger, and this seems to also apply in the proposed method.
When we compared the F1 scores of HEM500 and HEM1000, from ×100 to ×500, HEM500 won in two cases and HEM1000 won in the other three cases, which seems almost even. The score differences were less than those between the normal method and the proposed method. Figure 12 plots the FAR versus the RR. As can be seen, the proposed method greatly reduced the FAR while retaining nearly the same RR. This mainly contributed to accuracy improvement, because our training data were sparse and the FARs were relatively large throughout our experiments. Note that the power of the proposed method was not restricted to FAR reduction, because the most informative examples were automatically selected in every checkbatch.
Although the non-maximum-suppression (NMS) was properly applied, there were some duplicated detections (redundant TPs) due to a limitation of NMS. However, this does not affect accuracy assessment according to the definitions of PR and F1. Figure 13 shows an example of good and bad results by HEM500 where the background patch amount was ×400. We chose the classifiers that achieved the best F1 scores from repeated experiments. While many FPs were reduced in the pair of images in Figure 13a, a few vehicles became undetected in Figure 13b. There seems to have been a kind of trade-off, while overall accuracy was improved.
Remote Sens. 2018, 10, 124 14 of 20 In terms of F1 scores, while almost all of the standard errors were smaller than 0.01, the proposed method improved the scores by over 0.02 when compared to the normal method in most cases, which proved the effectiveness of our proposed method. Moreover, because the standard deviations of all methods were not very different, we can say our proposed method worked stably.
When we compared the results in terms of the background patch amount, the F1 scores in the normal method tended to be higher when the background patch amount was larger, and this seems to also apply in the proposed method.
When we compared the F1 scores of HEM500 and HEM1000, from ×100 to ×500, HEM500 won in two cases and HEM1000 won in the other three cases, which seems almost even. The score differences were less than those between the normal method and the proposed method. Figure 12 plots the FAR versus the RR. As can be seen, the proposed method greatly reduced the FAR while retaining nearly the same RR. This mainly contributed to accuracy improvement, because our training data were sparse and the FARs were relatively large throughout our experiments. Note that the power of the proposed method was not restricted to FAR reduction, because the most informative examples were automatically selected in every checkbatch. Although the non-maximum-suppression (NMS) was properly applied, there were some duplicated detections (redundant TPs) due to a limitation of NMS. However, this does not affect accuracy assessment according to the definitions of PR and F1. Figure 13 shows an example of good and bad results by HEM500 where the background patch amount was ×400. We chose the classifiers that achieved the best F1 scores from repeated experiments. While many FPs were reduced in the pair of images in Figure 13a, a few vehicles became undetected in Figure 13b. There seems to have been a kind of trade-off, while overall accuracy was improved.
(a) (b) Figure 13. An example of good and bad cases in the tested images. In each pair of images, the left one shows the result of the normal method and the right one shows the result of HEM500. (a) Good case. On the right, FPs were much reduced; (b) bad case. On the right, some vehicles became undetected. In terms of F1 scores, while almost all of the standard errors were smaller than 0.01, the proposed method improved the scores by over 0.02 when compared to the normal method in most cases, which proved the effectiveness of our proposed method. Moreover, because the standard deviations of all methods were not very different, we can say our proposed method worked stably.
When we compared the results in terms of the background patch amount, the F1 scores in the normal method tended to be higher when the background patch amount was larger, and this seems to also apply in the proposed method.
When we compared the F1 scores of HEM500 and HEM1000, from ×100 to ×500, HEM500 won in two cases and HEM1000 won in the other three cases, which seems almost even. The score differences were less than those between the normal method and the proposed method. Figure 12 plots the FAR versus the RR. As can be seen, the proposed method greatly reduced the FAR while retaining nearly the same RR. This mainly contributed to accuracy improvement, because our training data were sparse and the FARs were relatively large throughout our experiments. Note that the power of the proposed method was not restricted to FAR reduction, because the most informative examples were automatically selected in every checkbatch. Although the non-maximum-suppression (NMS) was properly applied, there were some duplicated detections (redundant TPs) due to a limitation of NMS. However, this does not affect accuracy assessment according to the definitions of PR and F1. Figure 13 shows an example of good and bad results by HEM500 where the background patch amount was ×400. We chose the classifiers that achieved the best F1 scores from repeated experiments. While many FPs were reduced in the pair of images in Figure 13a, a few vehicles became undetected in Figure 13b. There seems to have been a kind of trade-off, while overall accuracy was improved.

Improvement Extent
While we could see that our method improved accuracy, the improvement of F1 did not seem very significant. We investigated the reasons. First, the improvement was different between test_1 image and test_2 image. Table 5 shows the F1 result of each test image in the case of ×400. As can be seen, although the result of test_2 image was much improved, the result of test_1 was originally good, because test_1 image had very similar features to the train_1 image and did not much improve (became slightly worse in this case) using our method. Thus, the improvement of both test images became relatively small.
The second reason was redundant TP. While our method greatly reduced FPs, the improvement of PR was relatively small. In Table 4, in the case of ×400, while FAR reduction was about 0.13 on average, PR improvement was about 0.06 on average. This was because not only FPs, but also redundant TPs hurt PR. As we checked, the ratios of redundant TPs to detected vehicles in normal, HEM500, and HEM1000 were about 0.31, 0.25, and 0.27, respectively. Our method also seems to have reduced redundant TPs, which were because redundant TPs were relatively distant from the vehicles. For example, in one experiment using the normal method with ×400, the average distances between vehicles and TPs that detected vehicles, and between vehicles and redundant TPs, were 6.6 pixels and 11.5 pixels, respectively. As our training data only included positive examples that exactly matched the locations of vehicles, such redundant TPs would have been reduced by HEM. However, the reduction was smaller than that of FPs, because the redundant TPs overlap vehicles to some extent. This diluted the improvement of FAR.
The third reason was RR decrease. In Table 4, in the case of ×400, RR decreased by 0.02 on average. As above-mentioned, our HEM method seems to have reduced TPs that were relatively distant from vehicles, and a small part of them may have detected vehicles before applying HEM. This slightly hurt the RR.
The second and third reasons came from the inaccuracy of sliding windows. By replacing sliding windows with a more accurate method such as RPN, we can see the improvement by HEM more explicitly.
The best average F1 score throughout our experiments was 0.71, which is actually not a state-of-the-art result. For instance, [16] reported an F1 of 0.83, and [18] reported very high F1 of 0.94. Although it is not a fair comparison, because the training data and test settings are totally different, 0.71 is not the best. The primary reason would be the insufficiency of our training data. The authors in [16] used the Munich dataset [29], which has 9433 annotated vehicles, and [18] proposed and used the COWC dataset, which has 32,716 annotated vehicles. Compared to those datasets, our training data were sparse. Although our HEM could improve accuracy, sufficient training data were necessary for high performance. Another reason would be the simplicity of our CNN architecture. When we compared just the numbers of convolutional layers in each method, [16] had five and [18] had 50, whereas our model only had three.
We can combine our HEM method with adequate training data and rich CNN architecture for higher accuracy. Moreover, as discussed above, the accuracy would further become better by improving or replacing the sliding window method. Our HEM method can easily scale with those options.

Training Loss Values and Duration
Shrivastava et al. [22] reported that the loss values during training became smaller by using their OHEM method because they conducted a fair comparison where all region proposals in an image-not just the ones selected for a minibatch-were used to calculate the loss values in every method. Meanwhile, the loss values of our method seemed to have been better than the normal method; however, the difference was very small (Figure 9a). This was because those loss values were calculated from examples that were actually used for training; those examples had relatively large loss values, because we chose such examples for training. If we evaluated the loss values over a checkbatch, the trend would be similar to [22].
Our method took more time to train for the same iterations than the normal method, because of the overhead of calculating loss values in a checkbatch. For instance, training durations of normal, HEM500, and HEM1000 were 2.5, 6, and 10 h, respectively, with Tesla K20X manufactured by NVIDIA (Santa Clara, CA, USA) in Figure 9. In the original OHEM, it took approximately 1.7 times longer in one iteration to select 64 out of around 4000 region proposals in an image. Our method is more time-consuming than theirs, because our method requires extra forward propagation calculations as described in Section 2.2, whereas the original OHEM can calculate the loss values of region proposals with small additional costs due to RoI pooling.
However, our method strongly accelerates convergence. Therefore, the training could be stopped much earlier in the proposed method, which would cancel out the demerit.

Source of Accuracy Improvement
In our experiment, more background patches tended to yield better F1 scores in the normal method. This may be because more background patches contain more "negative" features to be discriminated from vehicles, and it kept improving the accuracy in the range of our experiments. FP reduction seems to have contributed more to the performance improvement, because the FAR was relatively high in our experiments.
To confirm how accuracy improvement continued, we conducted additional experiments. We gradually increased the amount of background patches to 1000 times the number of groundtruths. Figure 14 shows all the F1 scores of vehicle detection tests by the normal method, including the additional experiments. The standard deviations and standard errors of the additional experiments had values similar to the previous experiments. As can be seen, the accuracy does not seem to have improved after ×600, which indicates the best balance was around ×600. Although the best score was at ×800, which was about 0.7, the difference from ×600 was smaller than the standard errors, and was thus negligible.
When we compared these results with those of the proposed method, four out of 10 scores of the proposed method surpassed the best scores of the normal method, which had the best balance of positive and negative examples in the training data. This fact proves that our method improved accuracy by learning finer features, and not only by balancing positive and negative examples.
Continuing to increase negative examples could lead to a worse result. Even in such cases, our method is expected to find the optimal balance between positive and negative examples during training, which we could not confirm explicitly in our experiments. This will be our future work. Figure 14 shows all the F1 scores of vehicle detection tests by the normal method, including the additional experiments. The standard deviations and standard errors of the additional experiments had values similar to the previous experiments. As can be seen, the accuracy does not seem to have improved after ×600, which indicates the best balance was around ×600. Although the best score was at ×800, which was about 0.7, the difference from ×600 was smaller than the standard errors, and was thus negligible. When we compared these results with those of the proposed method, four out of 10 scores of the proposed method surpassed the best scores of the normal method, which had the best balance of

HEM500 vs. HEM1000
The HEM500 and HEM1000 scores were almost even, and the score differences were less than those between the normal and proposed methods. Considering that the validation accuracy results of HEM500 and HEM1000 were not very different, and taking these results into account, we can suppose that the performances of HEM500 and HEM1000 were similar, because an ncheck value of 500 was large enough, and increasing it did not improve accuracy in the range of our experiments. However, our experiments were not enough to prove this conjecture. Moreover, we could test only two values of ncheck due to our limited computing resources. Further verification will be our future work.

Usabilty of Our Method
In our experiments, we used a sparse training dataset. However, our method does not depend on model architecture or training data; therefore, it would be effective even when training data are abundant.
In this paper, we used our CNN as a classifier. However, it could also be used as a feature extractor by removing the last fully connected layer, and it could be combined with other classifiers. For instance, Tang et al. [16] used a cascade of boosted classifiers that was fed features extracted by a CNN. We would be able to further improve accuracy by combining our method with such a method.

Conclusions
We applied HEM to the SGD training of a CNN vehicle classifier, which successfully promoted learning finer features and improved accuracy. It took more time to train for the same number of iterations; however, this could be canceled out by breaking training earlier, which would be acceptable as the proposed method markedly accelerates convergence. Although we used sparse data in our experiments, our method would be effective even when training data are abundant. However, we could not confirm the effect of balancing positive and negative examples in the training data explicitly, and the effect of checkbatch size was not sufficiently determined. These issues will form the basis of our future work.
In this appendix, we present an example of HEM500 with ×400 to further investigate the result.

Effect of Clutters and Shadows
To evaluate how clutter and shadows impeded vehicle detection, we visualized undetected vehicles. Figure A1 shows 100 out of 211 randomly selected undetected vehicle patches of test_1 and all 31 undetected vehicle patches of test_2. As can be seen in Figure A1b, only one undetected vehicle was covered by any clutter or shadows. Although we could not fairly evaluate the robustness of the performance of our sliding window method, because our test images did not originally have many obstacles, there seems to have been only a few vehicles that were not detected due to clutter and shadows.
Furthermore, to evaluate how clutter and shadows confused CNN, we visualized the FP patches. Figure A2 shows all 18 FP patches of test_1 and 100 out of 231 randomly selected FP patches of test_2. As can be seen in Figure A1b, only one undetected vehicle was covered by any clutter or shadows. Although we could not fairly evaluate the robustness of the performance of our sliding window method, because our test images did not originally have many obstacles, there seems to have been only a few vehicles that were not detected due to clutter and shadows.
Furthermore, to evaluate how clutter and shadows confused CNN, we visualized the FP patches. Figure A2 shows all 18 FP patches of test_1 and 100 out of 231 randomly selected FP patches of test_2.
In Figure A2b, several patches with shadows were misclassified as vehicles. This was probably because the shadows looked like straight edge features that are similar to vehicle features, thus confusing CNN. This seems to have been caused by the insufficiency of the classification performance of CNN rather than the inaccuracy of the sliding windows. These misclassifications would be further reduced by using abundant training data.
As can be seen in Figure A1b, only one undetected vehicle was covered by any clutter or shadows. Although we could not fairly evaluate the robustness of the performance of our sliding window method, because our test images did not originally have many obstacles, there seems to have been only a few vehicles that were not detected due to clutter and shadows.
Furthermore, to evaluate how clutter and shadows confused CNN, we visualized the FP patches. Figure A2 shows all 18 FP patches of test_1 and 100 out of 231 randomly selected FP patches of test_2.