Improved Faster RCNN Based on Feature Ampliﬁcation and Oversampling Data Augmentation for Oriented Vehicle Detection in Aerial Images

Oriented Vehicle Detection in High-resolution Remote Sensing Images based on Feature Ampliﬁcation and Category


Introduction
The development of high-resolution remote sensing images makes vehicle detection possible, which is important for autonomous driving and traffic monitoring [1][2][3].In some more specific tasks, types and orientations of vehicles are required so that traffic conditions can be better scheduled [4].Therefore, the oriented vehicle detection of multiple types is significant.The transportations such as car, tractor, vans, plane, pick-up and so on are the common vehicles existing in aerial images and they are studied in this paper.The oriented vehicle detection is a more challenging task compared with multi-class object detection problems since the difference between different vehicle categories is small and vehicles are usually small objects [5].
Two commonly used definitions of small objects are as follows.One definition is the relative size.An object will be regarded as a small object if the size of an object is below 10% of the original images.The other definition is that an object will be regarded as small objects if their sizes are below 32 × 32 pixels.In high-resolution remote sensing images, vehicles usually occupy a small area below 10% of the image size or smaller than 32 × 32 pixels.We take the Vehicle Detection in Aerial 10% of the image size or smaller than 32 × 32 pixels.We take the Vehicle Detection in Aerial Imagery (VEDAI) [6] dataset as an example to prove this.The size of each image is 1024 × 1024 pixels.The VEDAI dataset contains 9 vehicle categories.The statistical results shown in Figure 1 demonstrate that most types of vehicles in this dataset are small objects.
Traditional vehicle detection methods usually include four steps: (1) Data preprocessing such as improving the image quality and increasing the contrast between vehicles and their backgrounds.(2) Determination of potential positions of vehicles by calculating the contrast between different parts of images.(3) Segmentation is performed to accurately extract potential location of vehicles from the background.(4) Vehicles are finally recognized by extracted features from potential regions.Recent vehicle detection methods are completely different from traditional methods since they try to decrease the influence of intermediate decisions on detection results obtained by machine learning methods.They are made up of handcrafted features-based methods and deep learningbased methods according to the different types of extracted features used [7].Before 2012, handcrafted features-based approaches were the mainstream algorithms for vehicle detection.However, handcrafted features including Viola Jones Detectors [8], Bag Of Words (BOW) [9], Deformable Parts Model (DPM) [10] and Histogram of Oriented Gradients (HOG) [11], cannot represent vehicles well because they lack the semantic information which is important for recognizing vehicles.
The development of vehicle detection approaches has been promoted since deep learning architectures appeared in 2012.Existing deep learning-based vehicle detection approaches can be divided into one-stage vehicle detection approaches such as Single Shot Multi-Box Detector (SSD) [12], You Only Look Once (YOLO) [13], YOLOv2 [14], YOLOv3 [15], YOLOv4 [16] and two-stage vehicle detection methods such as Region CNN (RCNN) [17], Spatial Pyramid Pooling Network (SPP-Net) [18], Fast RCNN [19] and Faster RCNN [20], according to the different detection processes employed.Compared with one-stage approaches the two-stage methods can achieve higher precision ratio with speeds that can meet real-time requirements.Therefore, this paper mainly investigates two-stage deep learning-based approaches.
When two-stage deep learning-based methods are applied to small vehicle detection, the following limitations may exist: 1.The quantity imbalance between diverse types of vehicles in the training dataset caused by the random frequency and spatial distribution of vehicles will have a negative influence on training the network.The Convolutional Neural Network (CNN) models tends to focus on vehicle categories with a larger number of samples, which may have a negative influence on detecting vehicle categories with a smaller number of samples.2. The features of small objects are less detailed than those of large or medium objects, which increases the difficulty of detecting vehicles.Features extracted by CNN contain more semantic Traditional vehicle detection methods usually include four steps: (1) Data preprocessing such as improving the image quality and increasing the contrast between vehicles and their backgrounds.
(2) Determination of potential positions of vehicles by calculating the contrast between different parts of images.(3) Segmentation is performed to accurately extract potential location of vehicles from the background.(4) Vehicles are finally recognized by extracted features from potential regions.
Recent vehicle detection methods are completely different from traditional methods since they try to decrease the influence of intermediate decisions on detection results obtained by machine learning methods.They are made up of handcrafted features-based methods and deep learning-based methods according to the different types of extracted features used [7].Before 2012, handcrafted features-based approaches were the mainstream algorithms for vehicle detection.However, handcrafted features including Viola Jones Detectors [8], Bag of Words (BOW) [9], Deformable Parts Model (DPM) [10] and Histogram of Oriented Gradients (HOG) [11], cannot represent vehicles well because they lack the semantic information which is important for recognizing vehicles.
The development of vehicle detection approaches has been promoted since deep learning architectures appeared in 2012.Existing deep learning-based vehicle detection approaches can be divided into one-stage vehicle detection approaches such as Single Shot Multi-Box Detector (SSD) [12], You Only Look Once (YOLO) [13], YOLOv2 [14], YOLOv3 [15], YOLOv4 [16] and two-stage vehicle detection methods such as Region CNN (RCNN) [17], Spatial Pyramid Pooling Network (SPP-Net) [18], Fast RCNN [19] and Faster RCNN [20], according to the different detection processes employed.Compared with one-stage approaches the two-stage methods can achieve higher precision ratio with speeds that can meet real-time requirements.Therefore, this paper mainly investigates two-stage deep learning-based approaches.
When two-stage deep learning-based methods are applied to small vehicle detection, the following limitations may exist: 1.
The quantity imbalance between diverse types of vehicles in the training dataset caused by the random frequency and spatial distribution of vehicles will have a negative influence on training the network.The Convolutional Neural Network (CNN) models tends to focus on vehicle categories with a larger number of samples, which may have a negative influence on detecting vehicle categories with a smaller number of samples.

2.
The features of small objects are less detailed than those of large or medium objects, which increases the difficulty of detecting vehicles.Features extracted by CNN contain more semantic information but the pooling operation in the CNN reduces the detailed information hidden in deep features, which decrease the discriminative ability of features in distinguishing different vehicles.

3.
There may exist high inter-class similarity in vehicle detection as shown in  Considering the above problems and the research status of addressing each problem shown in Section 2. This paper proposes an oriented vehicle detection framework based on improved Faster RCNN for aerial images.The major contributions can be summarized as follows in this paper.


Different from the basic data augmentation methods, we propose a data augmentation strategy by oversampling and stitching to reduce the negative impact of category imbalance and construct a dataset with balanced number of samples.


The pooling operation in CNN may reduce the discriminative ability of features in distinguishing small objects.We perform feature amplification by bilinear interpolation in the last feature map to increase the capability of features with more simple operations.


Considering the small inter-class diversity in different types of vehicles, center loss is introduced to the loss function in order to increase the model's ability to distinguish different vehicle types.


Considering the random direction of vehicles, the oriented bounding boxes and horizontal bounding boxes are jointly trained in the same framework so as to more accurately determine the position of the vehicles.
The remainder of this paper can be highlighted as follows.Section 2 gives a brief introduction to the related work.Oriented vehicle detection based on feature amplification and oversampling data augmentation in aerial images is presented in Section 3. Section 4 describes the implementation details and dataset description along with vehicle detection results and ablation studies.Section 5 discusses and analyzes the experimental results in Section 4. Section 6 concludes the experimental conclusions with future directions.Considering the above problems and the research status of addressing each problem shown in Section 2. This paper proposes an oriented vehicle detection framework based on improved Faster RCNN for aerial images.The major contributions can be summarized as follows in this paper.

•
Different from the basic data augmentation methods, we propose a data augmentation strategy by oversampling and stitching to reduce the negative impact of category imbalance and construct a dataset with balanced number of samples.

•
The pooling operation in CNN may reduce the discriminative ability of features in distinguishing small objects.We perform feature amplification by bilinear interpolation in the last feature map to increase the capability of features with more simple operations.

•
Considering the small inter-class diversity in different types of vehicles, center loss is introduced to the loss function in order to increase the model's ability to distinguish different vehicle types.

•
Considering the random direction of vehicles, the oriented bounding boxes and horizontal bounding boxes are jointly trained in the same framework so as to more accurately determine the position of the vehicles.
The remainder of this paper can be highlighted as follows.Section 2 gives a brief introduction to the related work.Oriented vehicle detection based on feature amplification and oversampling data augmentation in aerial images is presented in Section 3. Section 4 describes the implementation details and dataset description along with vehicle detection results and ablation studies.Section 5 discusses and analyzes the experimental results in Section 4. Section 6 concludes the experimental conclusions with future directions.

Class Imbalance Problem
Class imbalance of objects in the training dataset is a common problem in the object detection.There are usually two types of category imbalance: foreground-background and foreground-foreground imbalance [21].Foreground-foreground imbalance is studied in this paper since it negatively affects the multi-class object detection.
Numerous studies have been done to address the foreground-foreground imbalance problem in the computer vision field.Ouyang et al. [22] proposed to fine-tune the distribution of under-represented categories by clustering similar categories to address the class imbalance.Oksuz et al. [23] proposed a foreground balanced sampling method, which decreases the imbalance between distributions of different objects within each batch by assigning a probability to each true bounding box.Wang et al. [24] proposed a sample exchange strategy to generate new samples and decrease the imbalance by exchanging the same type of objects in different natural images.
The above methods are mainly aimed at addressing the class imbalance problem in natural imagery.In the remote sensing field, few researchers have considered the imbalance between types of training samples, and existing data augmentation methods are usually aimed at enhancing the generalization ability of the model.The imbalanced class distribution makes the network training favor the vehicle categories with a larger number, which leads to unideal results for other categories.Therefore, this paper designs an oversampling and stitching data augmentation method for aerial images so that the numbers of different vehicles can be balanced for training CNN.

Represention of Small Objects
Due to low resolution, blurred images, less information and more noise, small object detection has been a difficult problem in object detection.Some researchers have carried out some works in order to improve the performance in representing small objects.These methods mainly consist of improving the resolution of images containing small objects and enriching the detail information of the feature maps describing the small objects.
In terms of improving image resolution, Ji et al. [25] fused an object detection network with an image super-resolution reconstruction network in order to increase the resolution of the original images.Bharat Singh et al. [26] proposed to establish multi-scale pyramids for training images by resizing training images.Moktari et al. [27] proposed a joint super-resolution and vehicle detection network that tries to generate high-resolution images of vehicles from low-resolution aerial images.By increasing the image resolution, the discriminative ability of the features can be enhanced.
In enriching the detail information of feature maps, Hilal et al. [28] proposed deconvolving feature maps continuously in order to increase the ability of shallow features to distinguish diverse objects.Mandal at el. [29] proposed using AVDNet to enlarge feature maps for vehicles by introducing ConvRes modules to difference scale layers.Lin et al. [30] proposed a layer-by-layer prediction using feature pyramids to detect multi-scale objects, which predicts the output of the feature map of each layer of the CNN and finally selects the optimal detection results.
The above methods based on feature pyramids or image pyramids increase the computational cost and set high requirements for computer graphics cards.In addition, complex deep learning architectures may not achieve the desired results in detecting diverse vehicles.Different from existing complex architectures, in this paper we perform a simple but effective bilinear interpolation to amplify the feature map and enrich detail information while maintaining the deep semantic information, which may increase the discrimination of features in representing vehicles.

The Discriminative Ability of Features
Vehicle is the typical small-scale object and the difference between diverse types of vehicles is relatively little.Therefore, it is more difficult to distinguish the specific category of each vehicle.
In terms of increasing the discriminative ability of features in distinguishing diverse objects, Li et al. [31] propose to use contextual features for increased discrimination of features.Contextual information related to the objects is proved to be helpful to improve the ability of the features [32].Deng et al. [33] use different feature layers to extract candidate regions in the region proposal network (RPN) for increased recall ratio of multi-class objects.Deep and shallow features are also concatenated to increase the precision ratio of objects in the classification stage.
However, the above methods mainly address multi-category geographic object detection of remote sensing images rather than vehicle detection tasks.Compared with multi-class object detection, vehicle detection demonstrates higher inter-class similarity, which may increase the possibility of misclassification.In this paper, center loss [34] that can control intra-class differences is introduced to improve the capability of features to distinguish different vehicles.

Oriented Object Detection
The remote sensing images are acquired by overhead sensors and vehicles are a kind of moving object with random spatial distribution and arbitrary orientation.Traditional horizontal bounding boxes can only roughly describe the position of vehicles.Recently, some researchers have studied object detection algorithms with oriented bounding boxes.Ma et al. [35] proposed an oriented text detection algorithm to detect inclined text in the field of text detection.Yang et al. [36] proposed a multi-oriented ship detection algorithm of remote sensing images.Ding et al. [37] proposed an oriented multi-class object detection method in aerial images.
Few studies have been done to achieve both horizontal and oriented detection results in a CNN to get more accurate position of the vehicles.Therefore, this paper designs a joint training loss function for horizontal and oriented bounding boxes to regress the vehicle position and direction.

Overall Architecture
In this part, we introduce the proposed oriented vehicle detection algorithm for aerial images based on feature amplification and oversampling based data augmentation.This paper takes the Faster RCNN as the research basis and makes improvements on it.Figure 3 depicts the overall structure of the algorithm.The basic feature extractor in the proposed framework is the Resnet101 [38].The proposed framework mainly consists of three parts, (1) Oversampling and stitching data augmentation, (2) Enlarging feature maps and (3) A joint training loss function combined with center loss for horizontal and oriented bounding boxes.Each step can be illustrated as follows.
use different feature layers to extract candidate regions in the region proposal network (RPN) for increased recall ratio of multi-class objects.Deep and shallow features are also concatenated to increase the precision ratio of objects in the classification stage.
However, the above methods mainly address multi-category geographic object detection of remote sensing images rather than vehicle detection tasks.Compared with multi-class object detection, vehicle detection demonstrates higher inter-class similarity, which may increase the possibility of misclassification.In this paper, center loss [34] that can control intra-class differences is introduced to improve the capability of features to distinguish different vehicles.

Oriented Object Detection
The remote sensing images are acquired by overhead sensors and vehicles are a kind of moving object with random spatial distribution and arbitrary orientation.Traditional horizontal bounding boxes can only roughly describe the position of vehicles.Recently, some researchers have studied object detection algorithms with oriented bounding boxes.Ma et al. [35] proposed an oriented text detection algorithm to detect inclined text in the field of text detection.Yang et al. [36] proposed a multi-oriented ship detection algorithm of remote sensing images.Ding et al. [37] proposed an oriented multi-class object detection method in aerial images.
Few studies have been done to achieve both horizontal and oriented detection results in a CNN to get more accurate position of the vehicles.Therefore, this paper designs a joint training loss function for horizontal and oriented bounding boxes to regress the vehicle position and direction.

Overall Architecture
In this part, we introduce the proposed oriented vehicle detection algorithm for aerial images based on feature amplification and oversampling based data augmentation.This paper takes the Faster RCNN as the research basis and makes improvements on it.Figure 3 depicts the overall structure of the algorithm.The basic feature extractor in the proposed framework is the Resnet101 [38].The proposed framework mainly consists of three parts, (1) Oversampling and stitching data augmentation, (2) Enlarging feature maps and (3) A joint training loss function combined with center loss for horizontal and oriented bounding boxes.Each step can be illustrated as follows.First of all, we perform oversampling and stitching data augmentation on the training dataset by increasing the frequency of vehicles with fewer number of training data to synthesize a new dataset.
In the stage of RPN, we set up multi-scale and multi-shape horizontal anchors and select positive and negative samples for training a RPN network, by calculating the overlap between anchors and ground truths.
In the stage of classification, we amplify the feature map for increased ability of feature maps to represent vehicles.Considering the orientation of vehicles, we propose a multi-task loss function, which jointly trains oriented and horizontal bounding boxes, and introduces the center loss to decrease within-class difference.

Data Augmentation for Foreground-Foreground Imbalance Problem by Oversampling and Stitching
Motivation of data augmentation by oversampling and stitching.The proposed data augmentation method is aimed to address the foreground-foreground category imbalance problem.It is a common problem in vehicle detection since the frequency and location of different types of vehicles in aerial images are random.When there exists large quantity variance between diverse vehicles, objects may be over-presented or under-represented in the training process.
Two factors contribute to the foreground-foreground category imbalance, namely the imbalanced category distribution in a dataset and that within a batch of samples.We have counted the number of 9 types of vehicles in the VEDAI dataset.Table 1 describes the statistical results according to a descending order of vehicle number.The above statistical results show that there exists a serious foreground-foreground category imbalance in the VEDAI dataset, which will negatively affect the detection results of vehicles with a small number of samples.In addition, vehicles occupy less image areas, and the vehicles with lower frequency usually have fewer matched anchors, which may increase the difficulty to learn useful information from the network.
Considering that each image contains only a small number of vehicles and backgrounds are with a large image area, this paper designs a data augmentation method based on oversampling and stitching in order to decrease the impact of foreground-foreground imbalance on the training process.The central idea of the proposed method shown in Figure 4 can be illustrated as follows.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 21 In the stage of RPN, we set up multi-scale and multi-shape horizontal anchors and select positive and negative samples for training a RPN network, by calculating the overlap between anchors and ground truths.
In the stage of classification, we amplify the feature map for increased ability of feature maps to represent vehicles.Considering the orientation of vehicles, we propose a multi-task loss function, which jointly trains oriented and horizontal bounding boxes, and introduces the center loss to decrease within-class difference.

Data Augmentation for Foreground-Foreground Imbalance Problem by Oversampling and Stitching
Motivation of data augmentation by oversampling and stitching.The proposed data augmentation method is aimed to address the foreground-foreground category imbalance problem.It is a common problem in vehicle detection since the frequency and location of different types of vehicles in aerial images are random.When there exists large quantity variance between diverse vehicles, objects may be over-presented or under-represented in the training process.
Two factors contribute to the foreground-foreground category imbalance, namely the imbalanced category distribution in a dataset and that within a batch of samples.We have counted the number of 9 types of vehicles in the VEDAI dataset.Table 1 describes the statistical results according to a descending order of vehicle number.The above statistical results show that there exists a serious foreground-foreground category imbalance in the VEDAI dataset, which will negatively affect the detection results of vehicles with a small number of samples.In addition, vehicles occupy less image areas, and the vehicles with lower frequency usually have fewer matched anchors, which may increase the difficulty to learn useful information from the network.
Considering that each image contains only a small number of vehicles and backgrounds are with a large image area, this paper designs a data augmentation method based on oversampling and stitching in order to decrease the impact of foreground-foreground imbalance on the training process.The central idea of the proposed method shown in Figure 4 can be illustrated as follows.Step 1: Augment the original training images by rotating them with angles of 90°, 180°, 270° to generate the rotation dataset ensuring the diversity of object direction.
Step 2: Segment each vehicle from the rotation dataset in Step 1 according to the type and location of vehicles in order to establish the vehicle template dataset.Meanwhile, considering that Step 2: Segment each vehicle from the rotation dataset in Step 1 according to the type and location of vehicles in order to establish the vehicle template dataset.Meanwhile, considering that vehicles in each image occupy only a small area, the images in the rotation dataset with less 10 vehicles are selected as the background image dataset.
Step 3: Count the number of vehicles in each category in the rotation dataset.We take the most numerous type as an expansion benchmark.In order to keep a balance between the quantities of vehicles, the number of vehicles in each category to be augmented should be calculated.
Step 4: For each type of vehicles, certain number of images from the background dataset and a random vehicle from the template dataset are used to synthetize the new training images.We try to make each synthetized image include all types of vehicles to reduce the imbalanced distribution of the samples within a training batch.
Step 5: Considering the random location of vehicles in the geographic space, randomly generate the position of the vehicles in the background images.In order to avoid repetition, we calculate whether there is an overlap between positions of newly generated vehicles and those of original vehicles in the image.When the overlap is 0, image synthesis is performed.The gray values of generated vehicles replace those of original pixels in the background image.
Step 6: Repeat Steps 4 and 5 until the number of vehicles from different categories in the training dataset is balanced.

Amplification of Deep Features for Small Objects
Motivation of deep feature amplification.The pooling operations can decrease the number of deep neural network parameters but may lose the details of feature maps for small objects.Feature amplification can enlarge the deep feature map and restore the detailed information of the feature map.We use bilinear interpolation in the last feature map to increase the capability of features in representing small objects with more simple operations.
Resnet101 is the backbone for feature extraction in this paper.Figure 5 shows the structure of Resnet101 including four pooling operations.If a vehicle with the size of 32 × 32 pixels undergoes 4 pooling operations, the corresponding feature map size is 2 × 2 pixels.However, feature map of 2 × 2 pixels cannot fully describe the information of a vehicle.The differences between appearances of vehicles from different types are relatively small.The detailed information of the feature map plays a very important role in distinguishing vehicles.Therefore, we propose to perform amplification operation to the feature maps and increase the discriminative ability of features for vehicles.
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 21 vehicles in each image occupy only a small area, the images in the rotation dataset with less 10 vehicles are selected as the background image dataset.
Step 3: Count the number of vehicles in each category in the rotation dataset.We take the most numerous type as an expansion benchmark.In order to keep a balance between the quantities of vehicles, the number of vehicles in each category to be augmented should be calculated.
Step 4: For each type of vehicles, certain number of images from the background dataset and a random vehicle from the template dataset are used to synthetize the new training images.We try to make each synthetized image include all types of vehicles to reduce the imbalanced distribution of the samples within a training batch.
Step 5: Considering the random location of vehicles in the geographic space, randomly generate the position of the vehicles in the background images.In order to avoid repetition, we calculate whether there is an overlap between positions of newly generated vehicles and those of original vehicles in the image.When the overlap is 0, image synthesis is performed.The gray values of generated vehicles replace those of original pixels in the background image.
Step 6: Repeat Steps 4 and 5 until the number of vehicles from different categories in the training dataset is balanced.

Amplification of Deep Features for Small Objects
Motivation of deep feature amplification.The pooling operations can decrease the number of deep neural network parameters but may lose the details of feature maps for small objects.Feature amplification can enlarge the deep feature map and restore the detailed information of the feature map.We use bilinear interpolation in the last feature map to increase the capability of features in representing small objects with more simple operations.
Resnet101 is the backbone for feature extraction in this paper.Figure 5 shows the structure of Resnet101 including four pooling operations.If a vehicle with the size of 32 × 32 pixels undergoes 4 pooling operations, the corresponding feature map size is 2 × 2 pixels.However, feature map of 2 × 2 pixels cannot fully describe the information of a vehicle.The differences between appearances of vehicles from different types are relatively small.The detailed information of the feature map plays a very important role in distinguishing vehicles.Therefore, we propose to perform amplification operation to the feature maps and increase the discriminative ability of features for vehicles.There are usually two main methods for upsampling feature map, interpolation and deconvolution.However, deconvolution usually produces checkboard artifacts, which is not conducive to the detailed description of features.Therefore, we adopt interpolation to enlarge the feature image.Here, we use bilinear interpolation to amplify the last feature map.The bilinear interpolation can be illustrated as follows.
Assume that the original feature map size is w * h pixels and the enlarged feature map size is W ×H pixels.It is known that each pixel value in the original feature maps and enlarged feature maps is  There are usually two main methods for upsampling feature map, interpolation and deconvolution.However, deconvolution usually produces checkboard artifacts, which is not conducive to the detailed description of features.Therefore, we adopt interpolation to enlarge the feature image.Here, we use bilinear interpolation to amplify the last feature map.The bilinear interpolation can be illustrated as follows.
Assume that the original feature map size is w * h pixels and the enlarged feature map size is W ×H pixels.It is known that each pixel value in the original feature maps and enlarged feature maps is f (x, y) and f (X, Y) respectively.If you want to get the pixel value of the point f (X, Y), you need to get pixel values corresponding to the original feature map f (x, y) according to the ratio of enlargement.As shown in Equation (1), if the calculated position is not an integer, you need to interpolate the pixels f ( w W * X, h H * Y) in the original image and assign them to the enlarged pixels f (X, Y).
As shown in Figure 6, the central idea of bilinear interpolation is to get the final pixel values to be interpolated by four adjacent points f (i, j), f (i + 1, j), f (i, j + 1), f (i + 1, j + 1) next to the central pixel for linear interpolation in the vertical and horizontal directions.Suppose that the float coordinates of the pixel to be interpolated are (i + u, j + v), where i, j are the integer part of the coordinate, and u, v are the decimal part of the coordinate whose range is [0, 1).Then the pixel value to be interpolated f (i + u, j + v) can be determined by the corresponding values of the four surrounding pixels (i, j), (i + 1, j), (i, j + 1), (i + 1, j + 1).The pixel value of the point to be interpolated is shown in Equation (2).Where f (i, j) represents the pixel values of the location (i, j) in the original image.
Remote Sens. 2020, 12 As shown in Figure 6, the central idea of bilinear interpolation is to get the final pixel values to be interpolated by four adjacent points ( , ), ( 1, ), ( , 1), ( 1, 1)  next to the central pixel for linear interpolation in the vertical and horizontal directions.Suppose that the float coordinates of the pixel to be interpolated are ( , ) i u j v  , where i, j are the integer part of the coordinate, and u, v are the decimal part of the coordinate whose range is [0, 1).Then the pixel value to be interpolated

Multi-Task Loss Function for Joint Horizontal and Oriented Bounding Boxes
Motivation of multi-task loss function.Multi-task loss function is aimed to detect horizontal and oriented vehicles simultaneously by combining the loss of horizontal bounding boxes with that of oriented bounding boxes.In addition, vehicle detection is a difficult problem due to diversity in object representation and small difference between different vehicles.Therefore, we also introduce center loss to the multi-task loss function to improve the discriminative ability of the features.
Traditional object detection methods usually use horizontal bounding boxes min min max max ( , , , ) x y x y to describe the position of the objects.However, vehicles on aerial images are usually with arbitrary oriention.For a oriented vehicle, we can describe the position more accurately by describing the coordinates of its four corners 1 1 2 2 3 3 4 4 ( , , , , , , , ) x y x y x y x y .As shown in Figure 7, when detecting an object that contains direction information, two types of anchors are usually used [39], namely horizontal anchors and oriented anchors.The horizontal anchor contains more contextual information around objects than oriented anchor in the ROI pooling, which can assist the object recognition.Therefore, the horizontal anchor rather than oriented anchor is adopted in the article.

Multi-Task Loss Function for Joint Horizontal and Oriented Bounding Boxes
Motivation of multi-task loss function.Multi-task loss function is aimed to detect horizontal and oriented vehicles simultaneously by combining the loss of horizontal bounding boxes with that of oriented bounding boxes.In addition, vehicle detection is a difficult problem due to diversity in object representation and small difference between different vehicles.Therefore, we also introduce center loss to the multi-task loss function to improve the discriminative ability of the features.
Traditional object detection methods usually use horizontal bounding boxes (x min , y min , x max , y max ) to describe the position of the objects.However, vehicles on aerial images are usually with arbitrary oriention.For a oriented vehicle, we can describe the position more accurately by describing the coordinates of its four corners (x 1 , y 1 , x 2 , y 2 , x 3 , y 3 , x 4 , y 4 ).
As shown in Figure 7, when detecting an object that contains direction information, two types of anchors are usually used [39], namely horizontal anchors and oriented anchors.The horizontal anchor contains more contextual information around objects than oriented anchor in the ROI pooling, which can assist the object recognition.Therefore, the horizontal anchor rather than oriented anchor is adopted in the article.As shown in Equation ( 3), the proposed loss function consists of 5 parts, namely the crossentropy loss of oriented objects in Equation ( 4), the cross-entropy loss of horizontal objects    In the process of location regression, we convert four corners to the ( , , , , )

HR reg i i reg i i Centerloss i x y w h i x y w h L P ,P ,P ,P ,t,t = L P ,P L P ,P L t t L t t L
x y w h  in order to describe the position of oriented vehicles, where ( , )  xy represents the coordinates of vehicle center, ( , ) wh represents the width and height of vehicles and  is the degrees from the horizontal perspective.For the horizontal bounding boxes, * t represents the offset vector between true bounding boxes and positive anchors composed of , , , x y w h .For the oriented bounding boxes, * t consists of , , , , t represents the corresponding predicted coordinate vector.In Equations (7) and ( 8 We introduce center loss as shown in Equation ( 9) to decrease within-class differences existing in features, and increase the ability of features in distinguishing diverse vehicles.As shown in Equation ( 3), the proposed loss function consists of 5 parts, namely the cross-entropy loss of oriented objects L R cls (P R , P * R ) in Equation ( 4), the cross-entropy loss of horizontal objects L H cls (P H , P * H ) in Equation ( 5), the location loss function of the oriented objects i∈{x,y,w,h,θ} L R reg (t i , t * i ) and the horizontal objects i∈{x,y,w,h} L H reg (t i , t * i ) in Equation ( 6) and center loss function L Centerloss in Equation ( 11).λ 1 , λ 2 , λ 3 are the balance parameters.
In terms of classification, P H and P R are the probabilities that predicted horizontal and oriented bounding boxes belong to each category respectively; P * H and P * R are true categories of the horizontal and oriented bounding boxes respectively.
In the process of location regression, we convert four corners to the (x, y, w, h, θ) in order to describe the position of oriented vehicles, where (x, y) represents the coordinates of vehicle center, (w, h) represents the width and height of vehicles and θ is the degrees from the horizontal perspective.For the horizontal bounding boxes, t * represents the offset vector between true bounding boxes and positive anchors composed of x, y, w, h.For the oriented bounding boxes, t * consists of x, y, w, h, θ. t represents the corresponding predicted coordinate vector.In Equations ( 7) and ( 8), x * , x a , x represent the true bounding box, anchor, and predicted box respectively.y, w, h, θ can be represented in a way similar to x.
We introduce center loss as shown in Equation ( 9) to decrease within-class differences existing in features, and increase the ability of features in distinguishing diverse vehicles.
where m is the batch size in the object classification stage, c y n is the feature center of the y-th category and x n is the features before the fully connected layer.

Dataset Description
The experimental dataset in this paper is Vehicle Detection in Aerial Imagery (VEDAI).The VEDAI dataset contains aerial images with the size of 1024 × 1024 pixels extracted from the publicly available Utah AGRC database.The ground sampling distance (GSD) of the original image is 12.5 cm/pixel (cmpp).Each image consists of various types of small vehicles, backgrounds and objects that may lead to confusion.The VEDAI dataset contains 9 types of vehicles, namely van, tractor, pick-up, car, camping car, boat, plane and other vehicles.The research content of this paper is mainly for vehicle detection of nine categories.On average, there are 5.5 vehicles on each image, accounting for approximately 0.7% of the entire image.We split the dataset into two groups by randomly selecting 50% of the images for training while the left 50% for testing.The experiments have been repeated two times to reduce the measurement error.
The number of vehicles in the original training dataset for each category is shown in Table 2.The number of different vehicles in the training set is extremely imbalanced.Images before and after the proposed data augmentation method are shown in Figure 8.The red circle indicates the newly generated objects.The original images before data augmentation usually contain only a few types of vehicles.After performing the proposed data augmentation method, each image contains 9 different types of vehicles, and the vehicle position is randomly generated.The proposed method can help to increase the frequency of the categories with a smaller number of samples and increase the background diversity of vehicles to a certain extent.
where m is the batch size in the object classification stage, n y c is the feature center of the y-th category and n x is the features before the fully connected layer.

Dataset Description
The experimental dataset in this paper is Vehicle Detection in Aerial Imagery (VEDAI).The VEDAI dataset contains aerial images with the size of 1024 × 1024 pixels extracted from the publicly available Utah AGRC database.The ground sampling distance (GSD) of the original image is 12.5 cm/pixel (cmpp).Each image consists of various types of small vehicles, backgrounds and objects that may lead to confusion.The VEDAI dataset contains 9 types of vehicles, namely van, tractor, pickup, car, camping car, boat, plane and other vehicles.The research content of this paper is mainly for vehicle detection of nine categories.On average, there are 5.5 vehicles on each image, accounting for approximately 0.7% of the entire image.We split the dataset into two groups by randomly selecting 50% of the images for training while the left 50% for testing.The experiments have been repeated two times to reduce the measurement error.
The number of vehicles in the original training dataset for each category is shown in Table 2.The number of different vehicles in the training set is extremely imbalanced.Images before and after the proposed data augmentation method are shown in Figure 8.The red circle indicates the newly generated objects.The original images before data augmentation usually contain only a few types of vehicles.After performing the proposed data augmentation method, each image contains 9 different types of vehicles, and the vehicle position is randomly generated.The proposed method can help to increase the frequency of the categories with a smaller number of samples and increase the background diversity of vehicles to a certain extent.

Experimental Setup
Resnet101 which is pre-trained on the ImageNet dataset [40] is used for extracting features in this paper.The method in this paper is implemented in the TensorFlow framework with the Ubuntu 16.04 system.The computer hardware is GTX1080ti GPU with 11GB memory.The mini-batch size of the RPN stage and classification stage in this paper is 256 and 512.The initial learning rate of first 30,000 iterations is 0.003 while the learning rate of subsequent 70,000 epochs is 0.00003.The maximum iterations is set to 100,000.The momentum is 0.9 and the weight decay is 0.0001.
In the stage of RPN, we set the horizontal anchors with various shape and scale parameters, and set the thresholds of anchors and true objects to select positive and negative samples for training RPN networks.In this article, the anchor scale parameter is set to (8,16,32,64,128), and the shape parameter is set to (1, 1/2, 2/1, 1/3, 3/1, 1/4, 4/1, 1/5, 5/1, 1/6, 6/1, 1/7, 7/1).This article considers anchors with an IoU overlap below 0.3 as negative samples while that above 0.7 as positive samples.In training the RPN network, when the overlap between the ground truth and the anchors meets the above conditions, the anchors can be used to train the RPN network.
We have carried out comparative experiments with the baseline approaches that can detect oriented vehicles to demonstrate the effectiveness of the proposed algorithm.Among them, baseline methods include Faster RCNN, Feature Pyramid Network (FPN) and Dense Feature Pyramid Network (DFPN).We adopt Rotated anchors (RA) and Horizontal Anchors (HA) in FPN and DFPN, respectively.

Evaluation Metric
Mean average precision (mAP) is a comprehensive quality evaluation metric used in this paper, representing the mean value of average precision in each type of vehicle.The higher mAP is, the better vehicle detection performance is.Equation (10) shows how the average precision in each vehicle type is calculated.
where R n and P n represent the recall ratio and precision ratio when n-th threshold is set.The precision ratio and recall ratio can be defined as Equations ( 11) and (12).
where FP and TP are the amount of wrongly and accurately detected vehicles.FN represents the amount of undetected vehicles.If the Intersection over Union (IoU) [40] between true locations and predicted locations computed by Equation ( 13) is above 0.5, the bounding box will be considered as TP.
Otherwise, it will be FP.

Detection Results and Compared with Baseline Methods
Figure 9 shows the examples of the detection results by the proposed method.The proposed method can well detect diverse types of vehicles.While determining the specific categories of the vehicles, the proposed method can simultaneously obtain good detection results of the horizontal and the oriented bounding boxes.Table 3 shows the horizontal and oriented detection results of baseline approaches.All baseline approaches are performed on the merged dataset after rotation augmentation and oversampling augmentation proposed in this paper.The detection results of the proposed framework are better than those of Faster RCNN algorithm, DFPN algorithm and FPN algorithm in both horizontal and oriented bounding boxes.Rotated anchors contain less context information compared with horizontal anchors, so the detection accuracy of the FPN and DFPN with rotated anchors are obviously lower than that of the methods with horizontal anchors.The proposed framework with horizontal anchors is improved on the Faster RCNN method.The enlarged feature maps can increase Table 3 shows the horizontal and oriented detection results of baseline approaches.All baseline approaches are performed on the merged dataset after rotation augmentation and oversampling augmentation proposed in this paper.The detection results of the proposed framework are better than those of Faster RCNN algorithm, DFPN algorithm and FPN algorithm in both horizontal and oriented bounding boxes.Rotated anchors contain less context information compared with horizontal anchors, so the detection accuracy of the FPN and DFPN with rotated anchors are obviously lower than that of the methods with horizontal anchors.The proposed framework with horizontal anchors is improved on the Faster RCNN method.The enlarged feature maps can increase the discriminative ability of features by restoring the detailed information of feature maps.Center loss can relatively increase the gap between features of different vehicle types by reducing the intra-class diversity existing in features belonging to the same vehicle type, which may lead to decreased misclassification of similar vehicle types.Therefore, the overall detection accuracy of the proposed approach is higher than that of Faster RCNN algorithm.FPN method establishes a feature pyramid to detect multi-scale objects.The FPN method with horizontal anchors achieves the highest detection accuracy in the camping car, tractor and vans since it selects features suitable for detecting a certain type of vehicles from multilayer feature map pyramids.However, the detection results of other categories are unsatisfactory, especially for the airplane.That is because airplane is the category with the smallest number of samples in the original dataset, which limits the diversity of airplane samples.FPN increases the number of training parameters to build the feature pyramid and requires a larger number of samples, which may decrease the vehicle detection performance.
Compared with FPN, DFPN can make full use of multi-layer features to build a tighter feature pyramid between different feature layers, which makes the network better adapt to multi-scale objects.Therefore, the detection accuracy of DFPN with the rotated anchors is slightly higher than that of FPN.The detection accuracy of DFPN with the horizontal anchors are comparable with that of FPN with the horizontal anchors.
The same problem exists in the DFPN method, dense feature pyramid may increase number of network training parameters, which may lead to unideal detection accuracy of most vehicle categories.The horizontal detection results of FPN and DFPN are better than that of Faster RCNN because of multi-layer features.However, there is a big gap between results of detecting horizontal bounding boxes and those of detecting oriented bounding boxes in both DFPN and FPN methods with horizontal anchors.That is because it is difficult for both DFPN and FPN methods to regress the direction of oriented objects.But FPN can better obtain the direction information compared with DFPN.The oriented results of DFPN with horizontal anchors is about 2% mAP less than that of FPN.
In the proposed method, a multi-task loss function that introduces center loss is used to constrain the detection results of both the oriented bounding box and the horizontal bounding box.The accuracy difference between the horizontal bounding box and the oriented bounding box by the proposed method is small, and the overall accuracy of the proposed method is better than other methods.

Comparasion of the Different Data Augmentaion Methods
We compare different data augmentation approaches on the detection results in this section.We adopt three kinds of datasets for experiments.The first one is the dataset after rotation augmentation.Among them, the mAP of the merged dataset reached the highest in horizontal and oriented bounding boxes, with 56.3% and 53.8%, respectively, which is better than R and O datasets.The mAP of O dataset reached 55% and 53.2% in the horizontal and oriented bounding box, which is better than the 52.6% mAP and 50.2% mAP for R dataset.
As can be seen in Tables 4 and 5, the rotation augmentation method does not address the uneven vehicle number distribution in the dataset.The training network still favors vehicle categories with a large number of samples.After the proposed augmentation of the oversampling and stitching, the frequency of vehicle categories with fewer training samples in the dataset is increased.Therefore, the network trained by the dataset after oversampling and stitching data augmentation can better distinguish diverse types of vehicles.
As shown in Tables 4 and 5, after performing the proposed data augmentation method, the recall ratios of the Pickup, Truck, Other, Tractor, boat, Vans, and Plane categories have been improved, and the corresponding mAP has also been improved.The Car, Pick-up, and Camping car are the categories with more training samples in the rotated training dataset.As the number of vehicles in other categories has been increased after the proposed data augmentation method, the tendency of network to categories with a larger number of training samples has been reduced.Therefore, the recall ratio and average precision of these categories are decreased slightly and that of remaining categories are increased after proposed data augmentation method.
Although the dataset based on oversampling and stitching augmentation method can increase accuracy of vehicle detection to some degree.This type of synthesis method increases the complexity of the background objects.Moreover, a certain gap exists between the synthetic and real imagery.In order to decrease the negative influence of background diversity on object detection, we combine O dataset with R dataset to form a new training dataset M. According to the results of the merged the merge dataset generated in this paper.The accuracy of horizontal and oriented vehicle detection are improved by 3% mAP and 5% mAP respectively.The center loss function improves the detection results by 2% mAP since it can reduce the intra-class feature differences of the different categories.The proposed framework achieve a mean average precision of 60.4% and 60.1% in detecting horizontal and oriented bounding boxes respectively.All the improvements in this article have achieved accuracy improvements of approximately 8% mAP totally in detecting horizontal and oriented bounding boxes, respectively.dataset will further have a negative impact on training the network.In this section, we discuss the proposed data augmentation methods on the number of objects and positive samples.In order to increase the angle diversity of trained objects, rotation augmentation is performed.As shown in Figure 10a, the number of vehicles of Group 1 varies from 83 to 2798.And as shown in Figure 11a, the number of vehicles of Group 2 varies from 108 to 2680.The rotation augmentation cannot solve the category imbalance, but increase the difference between the numbers of different classes.The small size of the vehicles, the center deviation caused by pooling operation, and the anchor parameters make detection of small objects difficult.The category imbalance between vehicles in the dataset will further have a negative impact on training the network.In this section, we discuss the proposed data augmentation methods on the number of objects and positive samples.
In order to increase the angle diversity of trained objects, rotation augmentation is performed.As shown in Figure 10a, the number of vehicles of Group 1 varies from 83 to 2798.And as shown in Figure 11a, the number of vehicles of Group 2 varies from 108 to 2680.The rotation augmentation cannot solve the category imbalance, but increase the difference between the numbers of different classes.The proposed the oversampling and stitching data augmentation method can make the number of each type of vehicle relatively equal.We take the most numerous type of the car as an expansion benchmark and increase the frequency of remaining 8 vehicle categories in different background images.As shown in Figure 10a, the number of vehicles after proposed method ranges from 2519-2798 in Group 1. Similarly, as shown in Figure 11a, the number of vehicles ranges from 2472-2680 in Group 2. We find that the vehicle number has been balanced after the proposed data augmentation method.
In the region proposal stage, the horizontal anchors are used to selective the samples for the network.For small vehicles that are difficult to invest in network training, increasing the number of samples is conducive to train the model.We separately count the number of positive samples (IoU > 0.7) extracted from R and O dataset on the last feature map extracted by resnet101 so as to prove the role of the proposed data augmentation method.The statistical results are shown in Figures 10b and  11b.The proposed data augmentation approach can effectively increase the amount of positive The small size of the vehicles, the center deviation caused by pooling operation, and the anchor parameters make detection of small objects difficult.The category imbalance between vehicles in the dataset will further have a negative impact on training the network.In this section, we discuss the proposed data augmentation methods on the number of objects and positive samples.
In order to increase the angle diversity of trained objects, rotation augmentation is performed.As shown in Figure 10a, the number of vehicles of Group 1 varies from 83 to 2798.And as shown in Figure 11a, the number of vehicles of Group 2 varies from 108 to 2680.The rotation augmentation cannot solve the category imbalance, but increase the difference between the numbers of different classes.The proposed the oversampling and stitching data augmentation method can make the number of each type of vehicle relatively equal.We take the most numerous type of the car as an expansion benchmark and increase the frequency of remaining 8 vehicle categories in different background images.As shown in Figure 10a, the number of vehicles after proposed method ranges from 2519-2798 in Group 1. Similarly, as shown in Figure 11a, the number of vehicles ranges from 2472-2680 in Group 2. We find that the vehicle number has been balanced after the proposed data augmentation method.
In the region proposal stage, the horizontal anchors are used to selective the samples for the network.For small vehicles that are difficult to invest in network training, increasing the number of samples is conducive to train the model.We separately count the number of positive samples (IoU > 0.7) extracted from R and O dataset on the last feature map extracted by resnet101 so as to prove the role of the proposed data augmentation method.The statistical results are shown in Figures 10b and  11b.The proposed data augmentation approach can effectively increase the amount of positive The proposed the oversampling and stitching data augmentation method can make the number of each type of vehicle relatively equal.We take the most numerous type of the car as an expansion benchmark and increase the frequency of remaining 8 vehicle categories in different background images.As shown in Figure 10a, the number of vehicles after proposed method ranges from 2519-2798 in Group 1. Similarly, as shown in Figure 11a, the number of vehicles ranges from 2472-2680 in Group 2. We find that the vehicle number has been balanced after the proposed data augmentation method.
In the region proposal stage, the horizontal anchors are used to selective the samples for the network.For small vehicles that are difficult to invest in network training, increasing the number of samples is conducive to train the model.We separately count the number of positive samples (IoU > 0.7) extracted from R and O dataset on the last feature map extracted by resnet101 so as to prove the role of the proposed data augmentation method.The statistical results are shown in Figures 10b  and 11b.The proposed data augmentation approach can effectively increase the amount of positive samples used for training RPN network and increase the diversity of training samples.Therefore, the proposed data augmentation method is an effective method to improve the effective sample number and the ability of the network to detect small objects.

Analysis of the Feature Amplification Parameters
In order to prove that vehicle detection results can be enhanced by feature map magnification operation, we use the merged dataset to discuss the magnification operation of deep feature maps and analyze the impact of different amplification parameters on vehicle detection.Two interpolation methods including bilinear interpolation and nearest neighbor (NN) interpolation are for comparison.H and O are respectively denoted as the horizontal and oriented detection results.
The bold represents the optimal average precision of horizontal or oriented bounding boxes in each line of Table 11.The detection accuracy of horizontal and oriented vehicles by the proposed method without feature amplification are 58.9%mAP and 58.3% mAP, respectively.The experimental results show that the enlarged feature map by the bilinear interpolation can improve the detection accuracy to a certain extent.When bilinear interpolation is used to enlarge the feature map, we investigate the influence of feature maps with different magnifications on detection results.Among the comparison experiments with amplification multiples of 1.5, 2.0, 2.5, and 3.0, feature amplification by bilinear interpolation with multiples of 2.0 increases the most detection accuracy, with about 2% mAP for both horizontal and oriented vehicles.
At the same time, we use the nearest neighbor interpolation method of 2.0 to perform amplification experiments on feature maps.The nearest neighbor amplification method shows lower accuracy than of the method without feature map amplification since it will cause a jagged effect on the enlarged feature map, which is not beneficial to representing truck, car, others, and tractor.

Conclusions
Vehicle detection is difficult for remote sensing images because of the limited size and class imbalance.To enhance the vehicle detection results, we propose an oriented vehicle detection method for aerial images consisting of three indispensable parts, namely oversampling and stitching data augmentation method, amplifying the feature map and a joint training loss function for horizontal and oriented bounding boxes with the center loss.Three parts are aimed to address the problem of foreground-foreground category imbalance, the reduced discriminative ability of features caused by pooling operation and small inter-class diversity between types of oriented vehicles respectively.The experiments on the VEDAI dataset can draw the following conclusions.
(1) The proposed framework outperforms most of previous vehicle detection approaches.The method proposed in this paper can effectively detect oriented vehicles with a 8% higher mAP than the original Faster RCNN approach.(2) The proposed oversampling and stitching data augmentation method is an effective way to address class imbalance.The datasets combining oversampling and stitching data augmentation with rotation augmentation can improve about 3% mAP since they can increase the number of effective samples in the network and reduce the imbalance between vehicle categories to a certain extent.(3) The amplified feature map makes the network better distinguish different categories of vehicles by about 3% mAP.(4) The multi-task loss function can get the horizontal and oriented detection results simultaneously and the center loss can improve the accuracy since it can reduce the intra-class diversity of the vehicle categories to a certain extent.
Although the mAP can be improved by the proposed method compared with the Faster RCNN method, the overall precision ratio is still low.That is mainly because a large number of background objects are wrongly detected as foreground objects.In the future, we will study how to increase the ability of features to distinguish small objects.

Figure 1 .
Figure 1.Average length and width of different vehicles in VEDAI dataset.

Figure 1 .
Figure 1.Average length and width of different vehicles in VEDAI dataset.

Figure 2 .
Moreover, factors such as vehicle type, environmental background, lighting condition and shooting angle may lead to the increased diversity of vehicle representations, which may increase the difficulty of determining vehicle types.4. The aerial images acquired by airborne sensors are taken overhead, which may lead to random directions of vehicles.The traditional horizontal bounding box (HBB) can only roughly predict the position of objects.For oriented objects such as vehicles, oriented bounding box (OBB) should be used to describe the position of objects more accurately.Remote Sens. 2020, 12, x FOR PEER REVIEW 3 of 21 information but the pooling operation in the CNN reduces the detailed information hidden in deep features, which decrease the discriminative ability of features in distinguishing different vehicles.3.There may exist high inter-class similarity in vehicle detection as shown in Figure 2.Moreover, factors such as vehicle type, environmental background, lighting condition and shooting angle may lead to the increased diversity of vehicle representations, which may increase the difficulty of determining vehicle types.4. The aerial images acquired by airborne sensors are taken overhead, which may lead to random directions of vehicles.The traditional horizontal bounding box (HBB) can only roughly predict the position of objects.For oriented objects such as vehicles, oriented bounding box (OBB) should be used to describe the position of objects more accurately.

Figure 2 .
Figure 2. Examples of different vehicle types in VEDAI dataset.

Figure 3 .Figure 3 .
Figure 3.The flowchart of oriented vehicle detection based on feature amplification and oversampling data augmentation in aerial images.

Figure 4 .
Figure 4. Schematic of oversampling and stitching data augmentation for foreground-foreground class imbalance problem.

Figure 4 .
Figure 4. Schematic of oversampling and stitching data augmentation for foreground-foreground class imbalance problem.Step 1: Augment the original training images by rotating them with angles of 90 • , 180 • , 270 • to generate the rotation dataset ensuring the diversity of object direction.
( , ) f x y and ( , ) f X Y respectively.If you want to get the pixel value of the point ( , ) f X Y , you need to get pixel values corresponding to the original feature map ( , ) f x y according to the ratio of enlargement.As shown in Equation (1), if the calculated position is not an integer, you need to
j v  can be determined by the corresponding values of the four surrounding pixels ( , )  .The pixel value of the point to be interpolated is shown in Equation (2).Where ( , ) f i j represents the pixel values of the location ( , ) ij in the original image.

Figure 6 .
Figure 6.Flow diagram of bilinear interpolation to enlarge feature map.

Figure 6 .
Figure 6.Flow diagram of bilinear interpolation to enlarge feature map.

Figure 7 .
Figure 7. Two types of anchor setting methods of oriented objects.(a) Horizontal anchor.(b) Rotated anchor.
are the balance parameters.

P
are the probabilities that predicted horizontal and oriented bounding boxes belong to each category respectively; * H P and * R P are true categories of the horizontal and oriented bounding boxes respectively.

Figure 7 .
Figure 7. Two types of anchor setting methods of oriented objects.(a) Horizontal anchor.(b) Rotated anchor.

Figure 8 .
Figure 8.Comparison of images before and after augmentation by the oversampling and stitching method.(a-d) are original images from VEDAI dataset, and (e-h) are images synthesized by proposed method corresponding to (a-e).

Figure 9 .
Figure 9. Examples of the detection results by the proposed method.The odd rows are the detection results of the horizontal bounding boxes (HBB).The even rows are the detection results of the oriented bounding boxes (OBB).

Figure 9 .
Figure 9. Examples of the detection results by the proposed method.The odd rows are the detection results of the horizontal bounding boxes (HBB).The even rows are the detection results of the oriented bounding boxes (OBB).

Table 1 .
Number of each vehicle category in VEDAI dataset.

Table 1 .
Number of each vehicle category in VEDAI dataset.
represent the true bounding box, anchor, and predicted box respectively.y, w, h, θ can be represented in a way similar to x.

Table 2 .
Vehicle number statistics in training dataset of VEDAI.

Table 2 .
Vehicle number statistics in training dataset of VEDAI.

Table 3 .
Horizontal and oriented detection results of comparison approaches.

Table 9 .
The mean of ablation experimental results of OBB from two groups.