Small-Sized Vehicle Detection in Remote Sensing Image Based on Keypoint Detection

: The vehicle detection in remote sensing images is a challenging task due to the small size of the objects and interference of a complex background. Traditional methods require a large number of anchor boxes, and the intersection rate between these anchor boxes and an object’s real position boxes needs to be high enough. Moreover, the size and aspect ratio of each anchor box need to be designed manually. For small objects, more anchor boxes need to be set. To solve these problems, we regard the small object as a keypoint in the relevant background and propose an anchor-free vehicle detection network (AVD-kpNet) to robustly detect small-sized vehicles in remote sensing images. The AVD-kpNet framework fuses features across layers with a deep layer aggregation architecture, preserving the ﬁne features of small objects. First, considering the correlation between the object and the surrounding background, a 2D Gaussian distribution strategy is adopted to describe the ground truth, instead of a hard label approach. Moreover, we redesign the corresponding focus loss function. Experimental results demonstrate that our method has a higher accuracy for the small-sized vehicle detection task in remote sensing images compared with several advanced methods.


Introduction
The development of vehicle detection technology in remote sensing images makes it possible to obtain traffic information in time, which is significant for road traffic monitoring, management and scheduling applications. However, in remote sensing images with low resolution, the vehicle object has only a few pixels, with a lack of shape, texture, contour and other features. In addition, the background often interferes with the object information in the detection process. The detection of small-sized vehicle in remote sensing images is a difficult problem.
There are many definitions of small objects; we use the definition by the International Society for Optical Engineering (SPIE), that is, a number of pixels ≤80. The current literature for vehicle detection in remote sensing images can be divided into descriptor-based and feature-learning-based methods. The traditional feature-descriptor-based approaches generally consist of three stages: vehicle localization, feature extraction and classification.
Currently, most of the object detection methods are based on the idea of feature detection. Feature generally refers to the part of interest in the image, such as a point, a line segment or a region. Here, feature detection consists in calculating all the positions of the image, converting the image into feature values, and judging whether the feature at a position belongs to a given type. Generally speaking, the proportion of objects in remote sensing images is small, and the background information of remote sensing images is more complex than that of natural images. Object features are affected by background features, which makes it more difficult to detect specific features in remote sensing images, and the

Related Work
A sliding-window approach [2][3][4][5][6] is one of the most widely used methods for vehicle localization. However, many preset parameters such as window size and sliding step size have a great influence on the detection performance. Some studies have also looked at alleviating some of these shortcomings using techniques such as simple linear iterative clustering [7] or edge-weighted centroidal Voronoi tessellations based algorithm [8]. These methods use hard labels, but for the small-sized object detection task in remote sensing images, the relationship between the object and the background is ignored. Moreover, for the small-sized object detection task in remote sensing images, the number of sliding windows is very large, resulting in a lot of redundant computation.
Liu et al. [2] used a sliding window to collect the area to be processed and apply a fast binary detector using integral channel features in a soft-cascade structure. A multiclass classifier was applied to the output of a binary detector to give the direction and type of vehicle. ElMikaty et al. [3] used a sliding-window approach which consisted of a window evaluation, the extraction and encoding of features, classification and postprocessing. Xu et al. [4] improved the vehicle detection performance of the original Viola-Jones detector by rotating the image. Wu et al. [5] proposed an optical remote sensing imagery detector with the processing steps of channel features extraction, feature learning, fast image pyramid matching and boosting strategy. However, these techniques are more suitable for images with large-sized objects in high-resolution remote sensing images (RSIs). For small-sized objects, these methods have difficulty extracting detailed features.
With the rapid development of deep convolution networks, CNNs have shown a good ability for image abstraction. In the field of computer vision, CNN technology has become a research hotspot for semantic segmentation [9][10][11], object detection [12][13][14][15][16], image classification [17,18] and human pose estimation [19]. CNNs have achieved good results not only in high-level semantic feature detection, but also in low-level feature detection. For example, some researchers used CNNs for edge detection [20][21][22][23]. The results showed that CNNs are better than traditional methods for edge detection in natural images. The application of a CNN's ability for image abstraction in detection also greatly promotes the development of object detection.
CNNs can learn features from remote sensing images and detect objects. The literature [24][25][26][27][28] primarily used fast/faster region-based CNN (R-CNN) frameworks to detect vehicles. Chen Huai et al. [29] used transfer-compression learning to train a shallow network, and a location method based on a threshold. Audebert et al. [30] used a segment-Remote Sens. 2021, 13, 4442 3 of 14 before-detect method for segmentation and subsequent detection and classification of several varieties of vehicles in high-resolution remote sensing images. Koga et al. [31] tailored the correlation alignment domain adaptation (DA) and adversarial DA for the vehicle detector to improve the performance in different data. Mandal et al. [32] constructed a one-stage vehicle detection network by introducing blocks composed of two convolutional layers and five residual feature blocks at multiple scales. The detection method based on region classification has disadvantages. For small-sized objects, more anchors need to be designed manually. Hyper-parameters associated with anchors are usually related to data sets, which are difficult to tune and to be generalized. Taking CornerNet and CenterNet as representatives, the idea of region classification regression is abandoned, and the object position is determined by detecting the center or corner of the object [33,34]. It is not necessary to set anchors when small-sized objects are viewed as keypoints in the related feature map.
Heatmaps have been used in human pose estimation, such as marking each joint position of the human body [19] or different parts [35]. In remote sensing images, there is a certain correlation between the object and the surrounding background. In this paper, we use 2D Gaussian heatmaps to reflect the uncertainty of pixel labels around the center of the object. The ground truth of the heatmap covers the object and the pixels around the object.

Proposed Framework
We use a nonstandard two-dimensional Gaussian function to construct the ground truth. The AVD-kpNet framework consists of two parts: a feature extraction network and a detection head. The feature extraction network integrates shallow features with deep features. The detection head predicts the category of each pixel in the output heatmap and the position offset of keypoints. We design a variant of focal loss to penalize the difference between the category of each pixel output by the network and the ground truth. We use an L1 loss function to penalize the keypoints position offset caused by pooling in the network. The output of AVD-kpNet is the heatmap of the object category probability. We determine the location of the object instance according to the maximum value of the neighborhood on the heatmap.

Overall Architecture
We designed an anchor-free object detector, named AVD-kpNet, to simultaneously perform small-sized vehicle object localization and classification. The proposed detector utilizes the AVD-kpNet framework to learn the salient feature maps from the input image. The entire architecture of the proposed remote sensing object detector is shown in Figure 1.
Due to the small number of pixels of the object, most of the information will be lost if the image is downsampled through the pooling layer many times. Therefore, it is necessary to fuse the information of different resolutions. The backbone network of AVD-kpNet consists of a DLA. DLA combines the advantages of dense connections and feature pyramids, which can aggregate semantic and spatial information. As proven by [36,37] a DLA structure has good performance in the field of detection and recognition.
In order to resolve the feature map to the detection result, a detection head is added after the feature extraction part. The detection head contains a 3 × 3 and 1 × 1 convolutional layer, followed by an offset prediction and a heatmap of category predictions. We take the local maximum position as predicted coordinates of objects, as shown in Figure 2.

Keypoint-Based Prediction Module
Most object detection systems tend to train a deep neural network to extract the deep features of candidate regions, and then predict the class probability of these regions. Using an anchor-based method, it is necessary to design the anchor in advance. For small objects, due to the small number of object pixels, only a very accurate part of the anchor is used as a positive sample. Among a large number of preset anchors, the anchor with a positive sample only accounts for a small part of all the anchors. Some related studies have shown that the classification score and positioning accuracy of the prediction results of the anchor-based method are not consistent. This inconsistency means that the anchor with inaccurate positioning is selected in the NMS process or when the detection result is selected according to the classification confidence [38]. A small amount of pixel deviation will lead to a decline of the small-sized object positioning accuracy.     In order to resolve the feature map to the detection result, a detection after the feature extraction part. The detection head contains a 3 × 3 and 1 × 1 layer, followed by an offset prediction and a heatmap of category predictio local maximum position as predicted coordinates of objects, as shown in F

Keypoint-Based Prediction Module
Most object detection systems tend to train a deep neural network to e features of candidate regions, and then predict the class probability of these an anchor-based method, it is necessary to design the anchor in advance. Fo due to the small number of object pixels, only a very accurate part of the as a positive sample. Among a large number of preset anchors, the anchor sample only accounts for a small part of all the anchors. Some related stud that the classification score and positioning accuracy of the prediction re chor-based method are not consistent. This inconsistency means that the a accurate positioning is selected in the NMS process or when the detection re according to the classification confidence [38]. A small amount of pixel dev to a decline of the small-sized object positioning accuracy.
In small-sized object detection, each pixel belonging to the small-siz great influence on the final detection result. In contrast, the conventional s many pixels, even if the loss of a few pixels does not have a great impact o information of the object. At the same time, since the pixels of small-sized o are fused with the background, the anchor-based method will introduce la In remote sensing images, vehicle objects usually appear in a specif the object is correlated with the surrounding area. Thus, we adopt a 2D Gau to label the center of the object as the keypoint, and the closer to the center, confidence probability of the object. In the heatmap of the same keypoin belonging to the object and background are marked at the same time. In heatmaps are set to be ground truths at the stage of network training. Each tains the ground truth of all instances of the same category with the norma function whose parameters vary with the size of objects. The ground truth given by a nonstandard two-dimensional Gaussian function. In small-sized object detection, each pixel belonging to the small-sized object has a great influence on the final detection result. In contrast, the conventional scale object has many pixels, even if the loss of a few pixels does not have a great impact on the semantic information of the object. At the same time, since the pixels of small-sized object boundary are fused with the background, the anchor-based method will introduce large errors.
In remote sensing images, vehicle objects usually appear in a specific background; the object is correlated with the surrounding area. Thus, we adopt a 2D Gaussian heatmap to label the center of the object as the keypoint, and the closer to the center, the greater the confidence probability of the object. In the heatmap of the same keypoint, some pixels belonging to the object and background are marked at the same time. In our proposal, heatmaps are set to be ground truths at the stage of network training. Each heatmap contains the ground truth of all instances of the same category with the normalized Gaussian function whose parameters vary with the size of objects. The ground truth of the object is given by a nonstandard two-dimensional Gaussian function.
where (x − p x ) and y − p y represent the distances to the center of the object, σ p is the radius of Gaussian kernel and the calculation formula is: where avg(w, h) is the average width and height of the minimum bounding rectangle of the object. This makes the ground truth value around the center of the object is not zero. The ground truth describes the distance relationship between other pixels and the keypoint. The size of the predicted heatmap is W × H × C, and C represents the number of categories. If two Gaussian functions overlap for the same class C, we choose the one with the largest element level.
As can be seen from Figure 3, our annotation method regards the target and the background around the target as positive samples of different degrees at the same time.
The heatmaps of objects with the same avg(w, h) are identical. We consider a square region with the length of avg(w, h). When the predicted position and the ground truth of the keypoint satisfy Equation (3), the Gaussian function value of the ground truth is 0.78 and we define the position as close to the object keypoint. When the predicted position and the ground truth of the keypoint satisfy Equation (4), the Gaussian function value of Remote Sens. 2021, 13, 4442 6 of 14 the ground truth is 0.61 and we define the position as far from the object keypoint. The threshold range can be used to judge whether the detection is correct in the experiment.
where − and − represent the distances to the center of the object, is the radius of Gaussian kernel and the calculation formula is: where , ℎ is the average width and height of the minimum bounding rectangle of the object. This makes the ground truth value around the center of the object is not zero. The ground truth describes the distance relationship between other pixels and the keypoint.
The size of the predicted heatmap is , and represents the number of categories. If two Gaussian functions overlap for the same class , we choose the one with the largest element level.
As can be seen from Figure 3, our annotation method regards the target and the background around the target as positive samples of different degrees at the same time.  The heatmaps of objects with the same , ℎ are identical. We consider a square region with the length of , ℎ . When the predicted position and the ground truth of the keypoint satisfy Equation (3), the Gaussian function value of the ground truth is 0.78

Loss Function
Let p cij be the score at location (i, j) for class c in the predicted heatmap and let g cij be the score at location (i, j) for class c in the ground truth heatmap. We design a variant of focal loss: where N is the number of objects in an image, α is used to limit the dominance of gradients caused by easy examples, β is the hyper-parameter that controls the contribution of each keypoint. For setting the parameter values, we refer to the [39], and perform fine-tuning during the experiment to obtain the final parameters. We set α to 4 and β to 2 in all experiments. When g cij is 0 or 1, the formula for L point degenerates to a general focus loss: After image downsampling, the keypoint position of the ground truth will produce a certain deviation, and we predict a local offset for each center point. This offset is trained by the L1 loss function.
where O p is the prediction of the offset, d r is the downsampling rate of the network, g is the center position of the real heatmap of the object and also the maximum position of the heatmap value. p is the local maximum position of the predicted heatmap.
In conclusion, the total loss function is: where λ o f f is the weight of L o f f set . Similar to the [33], we set λ o f f to 1.

Data Set
We collected 877 high-resolution remote sensing images from DOTA [40] and Google Earth. The original size range of the images vary from 800 × 800 pixels to 4000 × 4000 pixels. The images include roads, trees, houses and other kinds of backgrounds. The spatial resolution ranges from 0.1 m to 0.3 m. We resampled these high-resolution remote sensing images with a downscaling factor of 5, so that the number of vehicle object pixels is less than 80, which conforms to the definition of small-sized objects in this paper. We cropped all the images to 512 × 512 pixels, then, we marked the ground truth of the image by the heatmap method illustrated in Section 3.2. We divided vehicle objects into two categories, one of which mainly includes sedans, hatchbacks and similar sized cars with an actual length of around 4 meters, and we called it vehicle_1. The other category mainly included container cars, buses, and the actual length is more than 10 m, which we called vehicle_2. There were about 1700 instances of vehicle_1 and 1500 instances of vehicle_2. The average pixel size of vehicle_1 and vehicle_2 in each resampled image was 2 × 4~5 × 10 and 2 × 9~4 × 20, respectively. Figure 4 shows some examples of the employed data set.
In our experiments, we divided these images into three sets, 1033 images as the training set, 295 images as the validation set and 149 images as the test set. The resolution of the images was 512 × 512 pixels. For the training of AVD-kpNet, we used the commonly used data normalization method, zero-mean normalization, which subtracts the average value of training data from the input image and then divides it by the standard deviation. Moreover, we used mosaic data augmentation. We cropped the images into small images with a size of 128 ×128 pixels; the sliding-step was 128 with a random shift from 0 to 32 pixels. After stitching 16 pictures, we obtained a new picture, and then we fed that picture to the neural network for training. This preprocessing makes the input image contain a richer background. In addition, this significantly reduces the need for a large mini-batch size. This data enhancement method has been applied in YOLOv4 [41] and achieved good results. For the inference, the input image was not cropped. Several augmentation technologies were also adopted to increase the amount of data. We randomly flipped images horizontally and vertically, and an angle was randomly chosen from (90 • , 180 • , 270 • ) to rotate the images. The normalization and augmentation sequences from the training process are shown in Figure 5.
where is the weight of . Similar to the [33], we set to 1.

Data Set
We collected 877 high-resolution remote sensing images from DOTA [40] and Google Earth. The original size range of the images vary from 800 × 800 pixels to 4000 × 4000 pixels. The images include roads, trees, houses and other kinds of backgrounds. The spatial resolution ranges from 0.1 m to 0.3 m. We resampled these high-resolution remote sensing images with a downscaling factor of 5, so that the number of vehicle object pixels is less than 80, which conforms to the definition of small-sized objects in this paper. We cropped all the images to 512 × 512 pixels, then, we marked the ground truth of the image by the heatmap method illustrated in Section 3.2. We divided vehicle objects into two categories, one of which mainly includes sedans, hatchbacks and similar sized cars with an actual length of around 4 meters, and we called it vehicle_1. The other category mainly included container cars, buses, and the actual length is more than 10 m, which we called vehicle_2. There were about 1700 instances of vehicle_1 and 1500 instances of vehicle_2. The average pixel size of vehicle_1 and vehicle_2 in each resampled image was 2 × 4~5 × 10 and 2 × 9~4 × 20, respectively. Figure 4 shows some examples of the employed data set. In our experiments, we divided these images into three sets, 1033 images as the training set, 295 images as the validation set and 149 images as the test set. The resolution of the images was 512 × 512 pixels. For the training of AVD-kpNet, we used the commonly used data normalization method, zero-mean normalization, which subtracts the average value of training data from the input image and then divides it by the standard deviation. Moreover, we used mosaic data augmentation. We cropped the images into small images with a size of 128 ×128 pixels; the sliding-step was 128 with a random shift from 0 to 32 pixels. After stitching 16 pictures, we obtained a new picture, and then we fed that picture to the neural network for training. This preprocessing makes the input image contain a richer background. In addition, this significantly reduces the need for a large mini-batch size. This data enhancement method has been applied in YOLOv4 [41] and achieved good results. For the inference, the input image was not cropped. Several augmentation technologies were also adopted to increase the amount of data. We randomly flipped images horizontally and vertically, and an angle was randomly chosen from (90°, 180°, 270°) to rotate the images. The normalization and augmentation sequences from the training process are shown in Figure 5.

Images Zero-mean normalization
Randomly flipping and rotating Mosaic data augmentation Network Figure 5. Data normalization and augmentation during training.

Evaluation Metrics
In the inference stage, the local maximum value of the heatmap is taken as the keypoint position. As can be seen from Figure 6, all the response points on the heatmap are compared with their eight adjacent points. If the response value of the point is greater than or equal to its eight adjacent points, it is retained. After finding the maximum position of the neighborhood, only the loss of offset is calculated for this point.

Evaluation Metrics
In the inference stage, the local maximum value of the heatmap is taken as the keypoint position. As can be seen from Figure 6, all the response points on the heatmap are compared with their eight adjacent points. If the response value of the point is greater than or equal to its eight adjacent points, it is retained. After finding the maximum position of the neighborhood, only the loss of offset is calculated for this point. The real distance (we set the value as ) between the predicted coordinate position and the ground truth is used to judge the consistency between the predicted position and the actual situation. Then, we set a threshold to judge whether the position prediction is accurate. In our experiments, the category prediction is considered as true positive (TP) if keypoint similarity ( ) is above 0.78 while it is considered to be false positive (FP) if is less than 0.78. If an object is not detected and recognized, it is considered a false negative (FN).
where , ℎ is the average width and height of the minimum bounding rectangle of the object. We use average precision (AP) and mean average precision (mAP) to measure the performance of the detection model.

=
(10) The false alarm rate (FAR) is an important evaluation metric in practical tasks. If the FAR is too high, it will have a negative impact on practical application tasks. Therefore, the FAR is introduced to further evaluate the performance of the detection model:

=
number of detected false alarms number of detected candidates

Experiment Results and Analysis
For setting the parameters of the training process, we refer to [1], and fine-tuned them in the experiment. We set the learning rate to 0.01, the epoch to 150, the batch size to 8, the stochastic gradient descent (SGD) with a momentum of 0.9 and the weight decay to 10 −4 . We started the training with a learning rate of 0.01, reducing 10 times every 50 epochs. We did not use pretrained weights. Figure 7 shows some examples of detection results, and Figure 8 shows some examples of enlarged images of some smaller objects.
The test results of other networks were calculated under the condition of Intersection The real distance (we set the value as D) between the predicted coordinate position and the ground truth is used to judge the consistency between the predicted position and the actual situation. Then, we set a threshold to judge whether the position prediction is accurate. In our experiments, the category prediction is considered as true positive (TP) if keypoint similarity (OKS) is above 0.78 while it is considered to be false positive (FP) if OKS is less than 0.78. If an object is not detected and recognized, it is considered a false negative (FN). OKS = e − D 2 2(avg(w,h)) 2 (9) where avg(w, h) is the average width and height of the minimum bounding rectangle of the object.
We use average precision (AP) and mean average precision (mAP) to measure the performance of the detection model.
The false alarm rate (FAR) is an important evaluation metric in practical tasks. If the FAR is too high, it will have a negative impact on practical application tasks. Therefore, the FAR is introduced to further evaluate the performance of the detection model:

Experiment Results and Analysis
For setting the parameters of the training process, we refer to [1], and fine-tuned them in the experiment. We set the learning rate to 0.01, the epoch to 150, the batch size to 8, the stochastic gradient descent (SGD) with a momentum of 0.9 and the weight decay to 10 −4 . We started the training with a learning rate of 0.01, reducing 10 times every 50 epochs. We did not use pretrained weights. Figure 7 shows some examples of detection results, and Figure 8 shows some examples of enlarged images of some smaller objects.   The test results of other networks were calculated under the condition of Intersection over Union (IoU) ≥0.5. Table 1 reports the evaluation values of AP, mAP and FAR for different methods.   In order to verify the contribution of the proposed soft label and the new loss function, ablation experiments were carried out on the existing data sets. The means of hard labels in Table 1 consisted of marking with a circle whose radius was the average length and width of the object. We considered a positive sample when inside the circle, set by 1, and a negative sample when outside the circle, set by 0. Without changing the network structure, using hard labels and original focal loss, the mAP value is 35.80% and FAR value is 45.2%. Compared with the YOLOv4 model based on anchor boxes, the performance of the detection model is slightly improved, the mAP value is increased by 3.70%, and the FAR value is reduced by 4.3%. When the detection model uses soft labels and the proposed loss function, the performance of the detection model is further improved; the mAP value is increased by 7.85% and the FAR value is reduced by 6.9%.
In order to analyze the impact of relatively small training data sets on our algorithm, we used 2/3, 1/2, 1/4 and 1/8 training data sets for model training, respectively. For models trained with training data sets of different sizes, complete test data sets were used for model testing, and the test results are shown in Table 2. It can be seen from the Table 2 that the detection performance of the model gradually improves with the increase of the training data set. The detection model using 2/3 of the training data can also approach the performance when using all the training data. Based on the comprehensive analysis results, we can obtain the following conclusions: (1) in the experimental indexes, AVD-kpNet obtains the best AP, mAP and FAR; (2) the detection performance for vehicle_2 is better than for vehicle_1, possibly because the details of vehicle_1 instances are not good enough to be detected; (3) for the same network structure, the model with a Gaussian heatmap as ground truth shows better performance. For the AVD-kpNet framework with hard labels, if the radius of the circle is too small, it cannot completely cover the object, resulting in the uncovered part being used as the background. If the radius of the circle is too large, many background pixels will be used as the object.

Conclusions
In this paper, a framework named AVD-kpNet was proposed for small-sized vehicle detection in remote sensing images. We regarded small-sized objects as keypoints in the background. The outputs of the AVD-kpNet framework are heatmap layers of objects and we can get the positions of object instances by locating peak regions in the heatmap layers. We proposed a method to get the ground truth of vehicles in remote sensing image using a 2D Gaussian distribution. The purpose of the marking method was to construct the relationship between small-sized objects and the surrounding background, so as to improve the detection ability of keypoints. Moreover, we designed a variant of focal loss to reduce the impact of easy examples. Comparing our proposal with several other methods, we draw the conclusion that the method based on AVD-kpNet can achieve state-of-the-art results. In further research, we will improve the robustness of the model for the detection problem when the object is partially occluded.

Conflicts of Interest:
The authors declare no conflict of interest.