Rapid Vehicle Detection in Aerial Images under the Complex Background of Dense Urban Areas

: Vehicle detection on aerial remote sensing images under the complex background of urban areas has always received great attention in the ﬁeld of remote sensing; however, the view of remote sensing images usually covers a large area, and the size of the vehicle is small and the background is complex. Therefore, compared with object detection in the ground view images, vehicle detection in aerial images remains a challenging problem. In this paper, we propose a single-scale rapid convolutional neural network (SSRD-Net). In the proposed framework, we design a global relational (GR) block to enhance the fusion of local and global features; moreover, we adjust the image segmentation method to unify the vehicle size in the input image, thus simplifying the model structure and improving the detection speed. We further introduce an aerial remote sensing image dataset with rotating bounding boxes (RO-ARS), which has complex backgrounds such as snow, clouds, and fog scenes. We also design a data augmentation method to get more images with clouds and fog. Finally, we evaluate the performance of the proposed model on several datasets, and the experimental results show that the recall and precision are improved compared with existing methods.


Introduction
With the development of remote sensing technology, the quantity of remote sensing images has also been greatly increased. Compared with aerospace remote sensing image, aerial remote sensing image has the advantages of large imaging scale, high resolution, and accurate geometric correction [1][2][3]. Therefore, aerial remote sensing is still an important remote sensing way; it usually uses airplanes or balloons as working platforms, and the flying altitude is between hundreds of meters and tens of kilometers. In aerial remote sensing images, vehicle detection is an indispensable technology in civil and military surveillance, such as traffic management and urban planning [4][5][6][7]; however, the method of manual interpretation for vehicle identification has low data utilization and poor information timeliness, which is easily affected by physical conditions, mentality, and subjective consciousness. Therefore, it is particularly necessary to perform automatic vehicle detection on remote sensing images efficiently and accurately.
Compared with general images, aerial remote sensing images have a unique perspective; this task becomes challenging due to the following reasons: Large field of view (FOV): Aerial remote sensing images are taken by high-resolution imaging sensors, and the obtained images generally have the characteristics of a large field of view (few target pixels) and high resolution. Therefore, simply down-sampling to the input size required by most algorithms is not suitable. Larger scale range: Since the collection height of remote sensing images and sensor parameters are different, the scales of the similar targets are also inconsistent. Generally, objects of interest in aerial images are often very small and densely clustered.
Special perspective: Aerial imagery is a top view, which makes the ground target have complete rotation invariance, and the direction angle is arbitrary. Therefore, there is no overlap of goals.
Complex background: In the urban remote sensing image, there are a large number of objects with similar characteristics as the vehicle target; moreover, the aerial images are susceptible to weather such as clouds and fog, it is necessary to consider the impact of complex weather conditions on the aerial images.
Traditional target detection algorithms [8], such as Viola Jones detectors [9,10], HOG detector [11], and deformable part model (DPM) [12][13][14][15], are usually designed for the geometric features, spatial relationships, target contours, and other features [16]. These algorithms can only achieve low accuracy, while the methods with higher accuracy, such as frame difference, can only detect vehicles in motion. Such strategy information is not very robust to the diversity of the environment, and it is difficult to well adapt to the needs of the actual scenes. Since the introduction of convolutional neural networks in the ImageNet large-scale visual recognition challenge (ILSVRC) [17,18], the network model based on deep learning has achieved remarkable results in the field of target detection. Meanwhile, the design of large, high-quality general-purpose labeled datasets, such as Pascal VOC [19,20], LVIS [21], and MS COCO [22], have also promoted the progress of target detection technology.
Many recent works have exploited SOTA detectors to detect, such as Faster R-CNN [27], deformable R-CNN [39], YOLOv4 [33], etc. Observing the input size of the model, the Faster RCNN model will resize the short side of the input image to 600 pixels and YOLO runs on either 608 × 608-pixel inputs. None of these models can directly receive the typical size of aerial remote sensing images (ITCVD [40]:~5616 × 3744 pixels, DOTA [41]: 4000 × 4000 pixels).
In order to meet the requirements of the standard architecture, it is not feasible to resize the image, because this way will lead to the loss of small pixel targets directly (MS COCO dataset definition [22]: <32 × 32 pixels). To solve the above-mentioned problems, existing algorithms usually segment the original image first. The YOLT [42] model adopts a "sliding window" method for cropping and designed a 15% overlap to ensure all regions will be analyzed; however, the size of the target object depends on the shooting height and camera parameters. Using a fixed size to crop the original image, the target pixel still has a large dynamic range, which affects the target detection ability.
There are many down-sampling layers inside the existing detection models, which will expand the receptive field. Vehicle targets in aerial remote sensing images are relatively small in size (ITCVD [40]:~30 × 15 pixels) and have fewer features. They are submerged by background features easily, and it is difficult to extract effective feature information. Existing algorithms usually use feature fusion to improve the ability to detect small targets. Specifically, the SSD model [34] uses a pyramidal feature hierarchy and the Mask R-CNN [43] model uses Feature Pyramid Network [44] (FPN) structure. The ablation experiment shows that the model is beneficial to improve the ability to detect small targets; however, current techniques are still suboptimal in the applications of aerial images. There is a large amount of redundant information for aerial remote sensing images, which affects the detection efficiency.
In addition, several studies indicate that it is crucial for small targets detection by enhancing the fusion of contextual information. Some studies have adopted long-and short-term memory networks (LSTM) [45] and spatial memory networks (SMN) [46] to enhance target features. For instance, AC-FCN [47] pointed out that the information between targets can help improve detection capabilities, whereas, the structure of this type of method is usually complex and these methods are still simple feed-forward networks, which are easy to cause the loss of feature information; moreover, FA-SSD [48] improves the ability to extract context information from small targets by using more high-level abstract features. These methods achieve good results; however, they are not suitable for aerial images because they do not have real-time capabilities.
We also notice that the lack of data sets is another important reason why aerial images are difficult to process. The labeled boxes of some datasets do not provide directions so that there is a large amount of overlap in the labeled boxes of the dense target area; it will have a great impact on the target detection in dense areas. Research on the target detection algorithms based on deep learning is inseparable from the support of data. Many scholars have also established target detection datasets for remote sensing images, such as DOTA [41], VEDAI [49], DLR 3K [50], and so on; in fact, most of the images in the DOTA data set come from Google Earth, taken by aerospace remote sensing satellites, and such images cannot truly reflect the perspective of aerial remote sensing images. The VEDAI dataset has a small number of vehicles, sparse distribution, and simple background, all of which make vehicle targets relatively easy to detect. Although the DLR 3K dataset is more challenging and authentic, it only contains 20 aerial images. The number of images in the dataset is too few for training a convolutional neural network model. Furthermore, for the cloud and fog phenomenon in aerial remote sensing images, there are usually two solutions: one of them is to improve the image quality through the haze removal algorithm, such as DCP (Dark Channel Prior) [51,52], MC (Maximum Contrast) [53], CAP (Color Attenuation Prior) [54], and so on. Another solution is to train pictures containing haze and constrain the features through the objective function. Nevertheless, existing haze removal algorithms usually produce a halo effect or color distortion phenomenon [55]. The existing datasets also do not include the haze phenomenon.
As described, although the performance of the above models is impressive, none of the existing frameworks can handle aerial remote sensing images well. To address these problems, we propose several prioritization schemes. The main contributions of this paper are presented as follows: (1) An adaptive image segmentation method based on the parameters of aerospace vehicles and cameras is proposed; this method limits the size of the target to a small range by dynamically adjusting the crop size. It plays a major role in improving the speed and accuracy of model detection. (2) In view of the high speed and accuracy of YOLOv4 [33], this paper uses YOLOv4 as the main frame of vehicle detection. We present the single-scale rapid convolutional neural network (SSRD-Net). A structure with denser prediction anchor frames is proposed, optimizing the feature fusion structure to improve the ability to detect small targets. (3) We designed an aerial remote sensing image dataset (RO-ARS) with rotating boxes.
The dataset has annotated flight height and camera focal length. In order to improve the authenticity of the dataset, we propose affine transformation and haze simulation methods to augment the dataset.
The rest of this paper is organized as follows: Section 2 starts with the proposed image cutting method and then introduces the details of the proposed dataset, including the affine transformation and haze simulation methods. Furthermore, the details of the proposed SSRD-Net model are introduced. In Section 3, through several experiments, the image Remote Sens. 2022, 14, 2088 4 of 22 segmentation method, affine transformation, and haze simulation method are evaluated and discussed. Finally, the conclusions are provided in Section 4.

Method
The framework of our proposed vehicle detection method is illustrated in Figure 1. It is mainly composed of three parts: image pretreatment, feature extraction, and image mosaic. During image pretreatment, original large-scale images are cropped into smallscale blocks for training and testing. Through the proposed model (SSRD-Net), we will get the vehicle detection result of each block. In the end, we mosaic all the blocks together and use the non-maximum suppression [56] (NMS) method to eliminate duplicate targets. In this section, we will give the details for each part of the framework proposed, and discuss how our method improves the accuracy of vehicles detection in aerial images. affine transformation and haze simulation methods. Furthermore, the details of the proposed SSRD-Net model are introduced. In Section 3, through several experiments, the image segmentation method, affine transformation, and haze simulation method are evaluated and discussed. Finally, the conclusions are provided in Section 4.

Method
The framework of our proposed vehicle detection method is illustrated in Figure 1. It is mainly composed of three parts: image pretreatment, feature extraction, and image mosaic. During image pretreatment, original large-scale images are cropped into smallscale blocks for training and testing. Through the proposed model (SSRD-Net), we will get the vehicle detection result of each block. In the end, we mosaic all the blocks together and use the non-maximum suppression [56] (NMS) method to eliminate duplicate targets. In this section, we will give the details for each part of the framework proposed, and discuss how our method improves the accuracy of vehicles detection in aerial images.

Image Segmentation
As described, putting the original image into the detection model is not suitable for aerial remote sensing images with a large field of view. If the original image is cropped with a fixed size, the target in the resulting image, which has a large-scale range, will remain unchanged. For this reason, it requires the detection model to have multi-scale target detection capabilities, which will affect the detection efficiency of the method. In order to solve the problem of inconsistent target scales in aerial remote sensing images at different flight heights and camera focal lengths, we propose an adaptive cutting method.
The above can be known by analyzing the shooting situation of aerial remote sensing images that the image shooting angle is usually close to vertical to the ground. According to the schematic diagram of the camera in Figure 1, the parameter relationship can be represented as where w is the optical size of the target on CMOS/CCD, is the physical size, v is image distance and h is object distance. The basic relationship among focal length (f), object distance (h), and image distance (v) can be expressed by a Gaussian imaging equation: Obviously, the object distance (h) is much larger than the image distance (v) in the aerial remote sensing image, so we conclude:

Image Segmentation
As described, putting the original image into the detection model is not suitable for aerial remote sensing images with a large field of view. If the original image is cropped with a fixed size, the target in the resulting image, which has a large-scale range, will remain unchanged. For this reason, it requires the detection model to have multi-scale target detection capabilities, which will affect the detection efficiency of the method. In order to solve the problem of inconsistent target scales in aerial remote sensing images at different flight heights and camera focal lengths, we propose an adaptive cutting method.
The above can be known by analyzing the shooting situation of aerial remote sensing images that the image shooting angle is usually close to vertical to the ground. According to the schematic diagram of the camera in Figure 1, the parameter relationship can be represented as where w is the optical size of the target on CMOS/CCD, w t is the physical size, v is image distance and h is object distance. The basic relationship among focal length (f ), object distance (h), and image distance (v) can be expressed by a Gaussian imaging equation: Obviously, the object distance (h) is much larger than the image distance (v) in the aerial remote sensing image, so we conclude: Remote Sens. 2022, 14, 2088

of 22
According to Formulas (1) and (3), the actual number of pixels occupied by the target can be expressed as: where p denotes the pixel size. It can be concluded that the number of the target pixels depends on the flight altitude and camera focal length. Accordingly, we partition images of arbitrary size into manageable cutouts with the number of target pixels. Partitioning takes place via a sliding window with overlap. The size of partitioning (L P ) and overlap (L o ) are defined as: where a and b are the hyperparameters. By default, a is equal to the number of output grids (N gird ) so that each grid corresponds to only one target. To avoid omissions caused by the segmentation of the target, b is set to a number greater than 1 (1.5 by default). During the mosaic process, non-maximal suppression of this overlap is necessary to refine detections at the edge of the cutouts.

Data Augmentation
According to the research status described in Section 1, the number of pictures in the aerial image data set is insufficient. It is a heavy task to construct a large number of aerial remote sensing images. To enrich the content of the dataset and improve the robustness of our model, we design a method to increase the dataset size.
Affine transformation explains the mapping between two images, which can be regarded as the superposition of linear transformation and translation transformation; it plays an important role in image correction [57][58][59], image registration [60], etc. In cartesian coordinates, it is expressed as: where (x, y) is the original pixel point coordinates, (u, v) is the point coordinates after an affine transformation. Its basic transformations as shown in Figure 2, include: translation, scale, rotation, reflection and shear. Notably, aerial remote sensing images can be rotated at any angle and the scale is limited (0.8-1.2 by default).
According to Formulas (1) and (3), the actual number of pixels occupied by the target can be expressed as: where p denotes the pixel size. It can be concluded that the number of the target pixels depends on the flight altitude and camera focal length. Accordingly, we partition images of arbitrary size into manageable cutouts with the number of target pixels. Partitioning takes place via a sliding window with overlap. The size of partitioning ( ) and overlap ( ) are defined as: where a and b are the hyperparameters. By default, a is equal to the number of output grids ( ) so that each grid corresponds to only one target. To avoid omissions caused by the segmentation of the target, b is set to a number greater than 1 (1.5 by default). During the mosaic process, non-maximal suppression of this overlap is necessary to refine detections at the edge of the cutouts.

Data Augmentation
According to the research status described in Section 1, the number of pictures in the aerial image data set is insufficient. It is a heavy task to construct a large number of aerial remote sensing images. To enrich the content of the dataset and improve the robustness of our model, we design a method to increase the dataset size.
Affine transformation explains the mapping between two images, which can be regarded as the superposition of linear transformation and translation transformation; it plays an important role in image correction [57][58][59], image registration [60], etc. In cartesian coordinates, it is expressed as: where ( , ) is the original pixel point coordinates, ( , ) is the point coordinates after an affine transformation. Its basic transformations as shown in Figure 2, include: translation, scale, rotation, reflection and shear. Notably, aerial remote sensing images can be rotated at any angle and the scale is limited (0.8-1.2 by default).  Moreover, to solve the problem of complex cloud and fog weather in aerial images, we propose an image degradation method in the cloudy interference state. Retinex (Retina Cortex) theory [51,52] points out that the observable information of an object is determined by two factors: the reflection properties of the object and the light intensity around the object. The light intensity determines the dynamic range of all pixels in the original image, and the inherent property (color) of the original image is determined by the reflection coefficient of the object.
As shown in the left of Figure 3, the object is illuminated by global atmospheric light, and then the light is reflected to form an image. The process can be expressed as: where I(x) is the observed intensity, A is the global atmospheric light, J(x) is the scene radiance, and t(x) is the medium transmission and describes the portion of the light that is not scattered and reaches the camera.
Moreover, to solve the problem of complex cloud and fog weather in aerial images, we propose an image degradation method in the cloudy interference state. Retinex (Retina Cortex) theory [51,52] points out that the observable information of an object is determined by two factors: the reflection properties of the object and the light intensity around the object. The light intensity determines the dynamic range of all pixels in the original image, and the inherent property (color) of the original image is determined by the reflection coefficient of the object.
As shown in the left of Figure 3, the object is illuminated by global atmospheric light, and then the light is reflected to form an image. The process can be expressed as: where I( ) is the observed intensity, A is the global atmospheric light, J( ) is the scene radiance, and t( ) is the medium transmission and describes the portion of the light that is not scattered and reaches the camera. The goal of haze simulation is to create A and t( ), we only need to add global atmospheric light noise to the original image. The noise can be considered as a random process, we use Perlin Noise [61,62] and Fractal Brownian Motion [63,64] to simulate it. As shown in Figure 3, we define a lattice structure, each vertex of the lattice has a random gradient vector ( ⃗, ⃗, ⃗, ⃗ in Figure 3); therefore, each coordinate in the noise map is surrounded by four vertices. The dot product of the distance vector and the gradient vector is defined as: where is the gradient vector of each corresponding corner, [ − . − ] is the distance vector between the target point and each corresponding vertex; thus, the noise at this point can be defined as: The goal of haze simulation is to create A and t(x), we only need to add global atmospheric light noise to the original image. The noise can be considered as a random process, we use Perlin Noise [61,62] and Fractal Brownian Motion [63,64] to simulate it.
As shown in Figure 3, we define a lattice structure, each vertex of the lattice has a random gradient vector ( → a 00 , → a 01 , → a 10 , → a 11 in Figure 3); therefore, each coordinate in the noise map is surrounded by four vertices. The dot product of the distance vector and the gradient vector is defined as: where a ij is the gradient vector of each corresponding corner, [u − i.v − j] is the distance vector between the target point and each corresponding vertex; thus, the noise at this point can be defined as: where s(t) is the weight function, and it needs to meet the following requirements: s(0) = 0 and s(0.5) = 0.5 and s(1) = 1 To make the noise more natural, the first and second derivatives of the smoothing function we used are zero at both t = 0 and t = 1: Ultimately, we got the simulated noise, which is shown in Figure 3. As can be seen, the appearance of the noise is determined by the number of lattice structures. To simulate the effect of clouds and fog more realistically, we have fused different noises: where q is a scaling factor (0.7 by default), and L(n) represents the number of lattice structures in n. Notably, the number of lattices should remain the same to ensure the additivity of noise at different scales.

The Proposed SSRD-Net
Motivated by YOLOv4 [33], our approach uses a one-stage object detection strategy. In this section, we will give the details for each of the sub-networks. We design a single-scale vehicle detector, named SSRD-Net, to simultaneously perform small-sized vehicle object localization and classification.

Overall Architecture
In recent years, hierarchical detection models have achieved good performance, such as Feature Pyramid Networks (FPN) [44]. Usually, these models must stack more convolutional layers to ensure the appropriateness of the receiving domain. In the detection of small-sized objects, each pixel belonging to the small-sized object has a great influence on the final detection result, and an excessively deep network structure will make the target feature submerged by environmental information. We have initially unified the target scale as described in Section 2.1, so we propose some strategies to reduce the depth of the model, increase the number of the feature channels, and remove the irrelevant structure of the model.
As shown in Figure 4, the model designed mainly consists of four parts: input, backbone, neck, and head. The input part is an RGB aerial image resized to 608 × 608 pixels. The detection backbone extracts feature of the image through a series of convolutional structures. The detection neck is a feature extraction network that combines shallow features and deep features. The detection head predicts the category of each pixel in the output heat map, the position offset of bounding boxes, and the deflection angle.  In the detection backbone, the focus structure divides the target into smaller pix sizes; however, each pixel belonging to a small-sized object has a great influence on t final detection result, which is not friendly to small targets. Therefore, we introduce t up-sample structure before Focus to increase the channel and reduce the depth of the ne work.
In the detection neck, small targets can only be detected by a 76 × 76 grid, so w eliminate the 38 × 38 and 19 × 19 grid modules in the original YOLO method; this signi cantly reduces the computational complexity of the model and improves the detectio In the detection backbone, the focus structure divides the target into smaller pixel sizes; however, each pixel belonging to a small-sized object has a great influence on the final detection result, which is not friendly to small targets. Therefore, we introduce the up-sample structure before Focus to increase the channel and reduce the depth of the network.
In the detection neck, small targets can only be detected by a 76 × 76 grid, so we eliminate the 38 × 38 and 19 × 19 grid modules in the original YOLO method; this significantly reduces the computational complexity of the model and improves the detection efficiency of the network. In addition, due to the lack of effective communication between receptive fields of different sizes, these models are limited in their ability to express generated features. We introduced the global relational (GR) block to alleviate these limitations.
In the detection head, we use a detection frame with a rotation angle to detect the target, which effectively distinguishes dense targets, prevents a large amount of overlap between the detection frames, and further improves the actual detection effect. Table 1 below is the network structure of the proposed model. In Table 1, "from" means the input of the block, "number" means the number of repetitions, and "Param" is the parameter amount of the block. The backbone of the model adopts the CSPDarknet53 architecture, which effectively extracts the feature information of different receptive fields. The feature fusion part removes unnecessary output structures and improves the detection speed of the model.

Global Relational Block
The location of the target in the aerial image is arbitrary. There are a large number of similar targets in the urban context. The simple connection of the convolution operator will make the network only focus on the local neighborhood, and cannot sensitively capture the global relationship among the entire spacetime. The global context-aware blocks are built in many detection tasks, via the aggregation of convolution operators in the same layer. Based on this observation, the design of the GCA block was inspired by Non-local Neural Networks [65], Double Attention Networks [66], and Compact Generalized Non-local Networks [67].
The key to the global context awareness block is that the response of a location is the weighted sum of all location features. We define global context-aware blocks in a convolutional neural network as: where i is the index of an output position and j is the index of all possible positions. x is the input feature map and y is the output feature map of the GR block. f x i , x j represents the correlation measurement function of two points in the feature map. g(x) represents the convolutional map of x. The response is normalized by a factor C(x). We define f x i , x j as the similarity of the dot-product: where θ and φ are the convolutional structures to be trained, so that, there will be a pairwise connection between x i and x j . We set the normalization factor as C(x) = N, where N is the number of positions in input feature map, because it simplifies gradient computation.
In detail, as shown in Figure 5, given discriminative feature maps M C×H×W i , we transform them into a latent space (Q C/2×H×W , K C/2×H×W , V C/2×H×W ) by using different convolutional layers respectively. Then, they were reshaped to Q HW×C/2 r , K C/2×HW r and V HW×C/2 r . According to Equation (17), we can obtain a vector subset of feature vectors to capture the relationship between each subregion, T HW×HW can be expressed as: Finally, we obtain the output layer × × through a convolutional layer with 1 × 1 filters. With the iterative update of the weight, this block can gradually extract useful context information to correct the prediction of this pixel. The BR block enhances the discrimination of pixel features by designing a pixel-topixel relationship matrix. It is completely differentiable, so it can be easily optimized through backpropagation. The BR block has the same input and output dimensions, so it can be easily integrated into our detection model.

Prediction
Based on the regression method, we design the target detection model. As analyzed in Section 2.1 Equation (5), the size of the grid corresponds to a target. We divide the image into an S × S (76 × 76 by default) grid and each grid cell predicts some bounding boxes, confidence for those boxes, class probabilities and the rotation angle of those boxes.
As shown in Figure 6, each cell in the grid is designed with anchors centered on the cell. The output of the model includes the center coordinates of the bounding box ( , ), the long-side and short-side of bounding boxes ( , ), and angle . We define them as: The correlation matrix is defined as follows: G C×H×W is calculated by a convolutional layer with 1 × 1 filter on F r . To prevent network degradation, we define R 2C×H×W as: Finally, we obtain the output layer M C×H×W o through a convolutional layer with 1 × 1 filters. With the iterative update of the weight, this block can gradually extract useful context information to correct the prediction of this pixel.
The BR block enhances the discrimination of pixel features by designing a pixel-topixel relationship matrix. It is completely differentiable, so it can be easily optimized through backpropagation. The BR block has the same input and output dimensions, so it can be easily integrated into our detection model.

Prediction
Based on the regression method, we design the target detection model. As analyzed in Section 2.1 Equation (5), the size of the grid corresponds to a target. We divide the image into an S × S (76 × 76 by default) grid and each grid cell predicts some bounding boxes, confidence for those boxes, class probabilities and the rotation angle of those boxes.
As shown in Figure 6, each cell in the grid is designed with anchors centered on the cell. The output of the model includes the center coordinates of the bounding box b x , b y , the long-side and short-side of bounding boxes (b l , b s ), and angle a. We define them as:  Compared with other methods, the non-horizontal box has one more angle dimension, and the detection box does not need to consider the target category. We design a variant of focal loss to penalize the difference between the category of each pixel output by the network and the ground truth. For the output grid (S × S), each cell in the grid generates B bounding box, each bounding box contains: center coordinates ( , ), longside (l), short-side (s), object confidence (c), angle (a). Object loss ( ) and angle loss ( ) are calculated by binary cross entropy (BCE). We define them as: where, and ̂ is the ground true values of a, c. denotes whether the object appears in the bounding box j predictor in cell i. Object loss and angle loss are considered as multiclassification problems, so the cross-entropy loss function is adopted. For the SoftMax activation function used in the model, the cross-entropy loss function can avoid the problem that the activation function enters the saturation region, and the gradient disappears in some cases. Here (i, j) is the coordinate of the corresponding grid, c x , c y is the pixel size of cell, σ(t x ), σ t y is the coordinate offset of cell, p l and p s are the long-side and short-side of anchors. We consider the rotation angle as the result of classification, so each bounding box has 180 labels for angle recognition.
Compared with other methods, the non-horizontal box has one more angle dimension, and the detection box does not need to consider the target category. We design a variant of focal loss to penalize the difference between the category of each pixel output by the network and the ground truth. For the output grid (S × S), each cell in the grid generates B bounding box, each bounding box contains: center coordinates (x, y), long-side (l), short-side (s), object confidence (c), angle (a). Object loss (L obj ) and angle loss (L angle ) are calculated by binary cross entropy (BCE). We define them as: where,â andĉ is the ground true values of a, c. I obj ij denotes whether the object appears in the bounding box j predictor in cell i. Object loss and angle loss are considered as multiclassification problems, so the cross-entropy loss function is adopted. For the SoftMax activation function used in the model, the cross-entropy loss function can avoid the problem that the activation function enters the saturation region, and the gradient disappears in some cases.
Consider three geometric parameters: overlap area, center point distance, and aspect ratio, we use CIOU [68] to calculate the bounding box loss (L box ): where (x,ŷ,l,ŝ) is the ground true of (x, y, l, s). c is the diagonal length of the smallest enclosing box covering two boxes. α is the weight parameter, and the parameter v represents the consistency of the aspect ratio. The CIOU loss considers the aspect ratio of the Bounding box, which improves the regression accuracy. Finally, the total loss can be expressed as: Examples of skew IoU computation are shown in Figure 7 and the optimization process is summarized in Algorithm 1.

Algorithm 1 Skew IoU computation 1:
Input: Vertex coordinates of rotating bounding boxes B 1 , B 2 2: Output: IOU between rotating bounding boxes B 1 , B 2 3: Set u ← ∅ , union : S = 0 3: Add intersection points of B 1 and B 2 to u 4: Add the vertex of B 1 inside B 2 to u 5: Add the vertex of B 2 inside B 1 to u 6: Set c ← the mean coordinates of the point in u 7: Compare the coordinates of each point in u and c, Sort u into anticlockwise order 8: Split convex polygon into n triangles 9: For each triangle (i) in n do 10: ] (Heron's formula) 11: S ← S + S i 12: End for 13: Finally, the total loss can be expressed as: Examples of skew IoU computation are shown in Figure 7 and the optimization process is summarized in Algorithm 1.

Result
In this section, we discuss the setup and preprocessing of the dataset. Then, we evaluate the proposed detection method, and compare it with the state-of-the-art target detectors.

Datasets
As shown in Table 2 below, we show the comparison of different optical remote sensing datasets. Among them, F/H refers to the camera focal length and flying height.
Through comparison, it can be seen that remote sensing datasets usually have largescale characteristics. The source images are mainly acquired from Google Earth, and most of the datasets are annotated with horizontal bounding boxes. There is no cloud phenomenon in the existing datasets, and a lack of F/H data. For the datasets above, we evaluate our method in ITCVD, DLR 3K, and our RO-ARS datasets.

Result
In this section, we discuss the setup and preprocessing of the dataset. Then, we evaluate the proposed detection method, and compare it with the state-of-the-art target detectors.

Datasets
As shown in Table 2 below, we show the comparison of different optical remote sensing datasets. Among them, F/H refers to the camera focal length and flying height. Through comparison, it can be seen that remote sensing datasets usually have largescale characteristics. The source images are mainly acquired from Google Earth, and most of the datasets are annotated with horizontal bounding boxes. There is no cloud phenomenon in the existing datasets, and a lack of F/H data. For the datasets above, we evaluate our method in ITCVD, DLR 3K, and our RO-ARS datasets.

Image Segmentation
Since ITCVD and DLR 3K lack focal length and flight height data, they cannot calculate the crop size. To verify the effectiveness of the adaptive segmentation method proposed in Section 2.1, we performed size statistics on the RO-ARS dataset. The size distributions of different cutting methods are shown in Figure 8. Figure 8 (left) is the original width-height distribution of bounding boxes, the middle is the width-height distribution obtained by the proposed resize method. After labeling with rotating bounding box, the distribution of the long side-short side is as shown in Figure 8 (right).
The setting of anchors in target detection models depends on the target size distribution. After clustering by the Kmeans method, each color in Figure 8 represents a cluster. In analyzing the size distribution of the bounding box, we can find the rotating bounding box after resize can unify the target size better; this design makes it possible to meet the needs of the model with fewer anchors.
In addition, we also count the size and location distribution of vehicle targets, as shown in Figure 9.
It can be seen that the size of the target is relatively concentrated, and the position is evenly distributed in all positions of the picture, which can better reflect the insensitivity of the target position in the dataset.

Image Segmentation
Since ITCVD and DLR 3K lack focal length and flight height data, they cannot calculate the crop size. To verify the effectiveness of the adaptive segmentation method proposed in Section 2.1, we performed size statistics on the RO-ARS dataset. The size distributions of different cutting methods are shown in Figure 8.  The setting of anchors in target detection models depends on the target size distribution. After clustering by the Kmeans method, each color in Figure 8 represents a cluster. In analyzing the size distribution of the bounding box, we can find the rotating bounding In addition, we also count the size and location distribution of vehicle targets, as shown in Figure 9. It can be seen that the size of the target is relatively concentrated, and the position is evenly distributed in all positions of the picture, which can better reflect the insensitivity of the target position in the dataset.

Angles Distribution
Through analysis of the rotating bounding box, we count the angle distribution of the bounding box, as shown in Figure 10.

Angles Distribution
Through analysis of the rotating bounding box, we count the angle distribution of the bounding box, as shown in Figure 10.
In original dataset, the rotation angle of the bounding box is 0 • and 90 • exceeds 20%. After affine transformation, variance and standard deviation are smaller than the original ones and the angular distribution is more even. It is helpful to improve the network's ability to learn the target angle characteristics, and also proves the importance of affine transformation.

Angles Distribution
Through analysis of the rotating bounding box, we count the angle distribution of the bounding box, as shown in Figure 10. In Figure 10, ( ) is the variance, σ is the standard deviation. They are defined as: In Figure 10, D(x) is the variance, σ is the standard deviation. They are defined as:

Cloud Simulation
The current aviation dataset is designed to work in sunny weather; however, bad weather, including cloud and fog, is inevitable in outdoor application. The aerial remote sensing data set must include enough complex weather images. The examples of the cloud simulation method described in Section 2.2 are shown in Figure 11. In original dataset, the rotation angle of the bounding box is 0° and 90° exceeds 20%. After affine transformation, variance and standard deviation are smaller than the original ones and the angular distribution is more even. It is helpful to improve the network's ability to learn the target angle characteristics, and also proves the importance of affine transformation.

Cloud Simulation
The current aviation dataset is designed to work in sunny weather; however, bad weather, including cloud and fog, is inevitable in outdoor application. The aerial remote sensing data set must include enough complex weather images. The examples of the cloud simulation method described in Section 2.2 are shown in Figure 11.

Model Size
The proposed SSRD method is an improvement of the YOLOv5 model, we have made statistics on the parameters of the current popular target detection network, as shown in Table 3.

Model Size
The proposed SSRD method is an improvement of the YOLOv5 model, we have made statistics on the parameters of the current popular target detection network, as shown in Table 3. As can be seen from Table 3, the SSRD method has smaller params size compared with other models. The convolution depth of SSRD-tiny method is 0.5 times of SSRD-base, and the number of BottleneckCSP layers is 1/3 times that of SSRD-base. Some application scenarios have strict restrictions on the size of the model, such as embedded devices. The params size is positively related to the size of the output model and the small and dedicated model will have great advantages.

Evaluation Metrics
To verify the effectiveness of our proposed method, we conduct a qualitative and quantitative comparison among the current popular target detectors. The metrics of recall/precision rate, F1-score and average precision (AP) are used, which are formally defined as: The area under the Precision(P)-Recall(R) curve is defined as AP.
Since it is relatively difficult to calculate the integral, the interpolation method is introduced. The formula for calculating AP is defined as follows: The proposed method was tested and evaluated on a computer with an Intel Core i7-10700F 2.90 GHz CPU, 16 GB computer memory, and GeForce GTX 2060Ti GPU with 6 GB memory, implemented using the open-source Pytorch framework.
During the training, stochastic gradient descent [73] (SGD) is used to optimize the parameters and the basic learning rate is 1 × 10 −4 The weights are initialized with Kaiming distribution [74]. The IoU threshold of non-maximum suppression (NMS) is 0.65 for inference.

Ablation Experiments
Since the RO-ARS dataset proposed contains a large number of images with cloud and fog, we test the SSRD method on ITCVD and DLR 3K datasets which have no cloud and fog image to illustrate the necessity of the cloud simulation method proposed.
The ITCVD and DLR 3K datasets lack the relevant parameters and cannot calculate the split size by the method described in Section 2.1. To fit the design of the SSRD model, we calculated the median size (s m ) of each image in the dataset, and then substituting k = s m into Equation (5). After cropping the image according to the calculated value, we resize them to the input size (608 × 608) of the model. Finally, we can get a dataset with a uniform target scale.
The simulation results of cloud and fog are calculated for each image in this experiment and they are shown in Table 4. It can be seen that when the training dataset lacks cloud and fog images, the model has poor performance under foggy conditions.
Adding a proper proportion of the simulation images with cloud and fog during the training will enhance the robustness of the model. The result reflects the importance of cloud and fog simulation. The ITCVD and DLR 3K datasets lack a complex meteorological environment, it is difficult to adapt to the detection tasks in real complex environments; this also provides a theoretical basis for adding complex weather images to the RO-ARS dataset.
Finally, we analyze the results obtained by training the model with different strategies and the results are shown in Table 5. Since the size of the picture segmentation depends on the relevant parameters, the number of blocks obtained is not all the same, the detection frame rate of the block will be more reliable.
The proposed model achieves a high level of detection speed and detection accuracy. The following conclusions can be verified through the experimental data of Table 5: First, comparing SSRD-base and SSRD-base (No GR block), the precision has increased by 5.03%, F1 has increased by 0.0644, AP@0.5 has increased by 6.33%, and AP@0.5:0.95 has increased by 3.63%; this proves the effectiveness of the GR block we proposed. Comparing SSRD-base and SSRD-base (No up-sample), the up-sample block increases the AP@0.5 by 3.19%, AP@0.5:0.95 by 2.38%. Experiments show that the proposed up-sample block is beneficial.
Second, the small targets in complex backgrounds are difficult to detect by traditional methods (HOG + SVM). Compared with other neural networks, better results are achieved by our algorithm. The precision of the YOLO and SSD models is insufficient, and there are a large number of false detection in the results, while the detection speed of the Faster-RCNN model is slow, which is difficult to meet the requirements of practical applications.
Finally, we show some detection results of our proposed method on different datasets (ITCVD, DLR 3K, RO-ARS), as shown in Figure 12.

Discussion
In this work, we propose a new type of remote sensing vehicle dataset RO-ARS, which not only considers the angular distribution of the rotating bounding box, but also the diversity of the meteorological environment is considered; moreover, we use affine transformation to enhance the robustness of the model and design a cloud simulation method to increase the proportion of the images with cloud and fog; we also analyze the vehicle detection characteristics of aerial remote sensing images, and design an adaptive high-resolution image cropping scheme to improve the detection speed.
Inspired by the impressive performance of YOLOv5 in target detection, we propose a vehicle detection model suitable for aerial remote sensing images in a complex urban background. The experimental results demonstrate that the SSRD method we proposed can achieve the highest scores on AP@0.5 (72.23%), AP@0.5:0.95 (31.69%) and F1-score (0.6722) with real-time detection speed (49.6 fps). In this work, we propose a GR block and conduct a quantitative evaluation through the ablation experiments, which proves that the GR block has excellent performance in improving the precision and reducing the false detection.
In Section 3, we analyze the relationship between the depth of the network and detection ability. The shallow neural network has a higher recall and frame, but the poor precision makes it difficult to show better performance on AP and F1. YOLOv5x with a deeper network shows a poor effect on small targets, which proves that increasing the network depth cannot solve the problem of vehicle detection in aerial remote sensing images.

Discussion
In this work, we propose a new type of remote sensing vehicle dataset RO-ARS, which not only considers the angular distribution of the rotating bounding box, but also the diversity of the meteorological environment is considered; moreover, we use affine transformation to enhance the robustness of the model and design a cloud simulation method to increase the proportion of the images with cloud and fog; we also analyze the vehicle detection characteristics of aerial remote sensing images, and design an adaptive high-resolution image cropping scheme to improve the detection speed.
Inspired by the impressive performance of YOLOv5 in target detection, we propose a vehicle detection model suitable for aerial remote sensing images in a complex urban background. The experimental results demonstrate that the SSRD method we proposed can achieve the highest scores on AP@0.5 (72.23%), AP@0.5:0.95 (31.69%) and F1-score (0.6722) with real-time detection speed (49.6 fps). In this work, we propose a GR block and conduct a quantitative evaluation through the ablation experiments, which proves that the GR block has excellent performance in improving the precision and reducing the false detection.
In Section 3, we analyze the relationship between the depth of the network and detection ability. The shallow neural network has a higher recall and frame, but the poor precision makes it difficult to show better performance on AP and F1. YOLOv5x with a deeper network shows a poor effect on small targets, which proves that increasing the network depth cannot solve the problem of vehicle detection in aerial remote sensing images.

Conclusions
The model proposed in this paper provides a feasible solution for vehicle detection in aerial remote sensing. Experiments show that the proposed model has excellent performance. The image segmentation method and cloud simulation method have a positive significance for target recognition; however, the types of cloud simulations are still not abundant. To enhance the practicality of the model, improving the detection speed is still an important research direction.
In the future, we will further improve our research topics in several aspects, such as feature extraction and fusion, the diversification of cloud simulation and model compression. Using unsupervised or weakly supervised models to reduce the model's dependence on datasets is also one of the important design directions.