Sparse Label Assignment for Oriented Object Detection in Aerial Images

: Object detection in aerial images has received extensive attention in recent years. The current mainstream anchor-based methods directly divide the training samples into positives and negatives according to the intersection-over-unit (IoU) of the preset anchors. This label assignment strategy assigns densely arranged samples for training, which leads to a suboptimal learning process and cause the model to suffer serious duplicate detections and missed detections. In this paper, we propose a sparse label assignment strategy (SLA) to select high-quality sparse anchors based on the posterior IoU of detections. In this way, the inconsistency between classiﬁcation and regression is alleviated, and better performance can be achieved through balanced training. Next, to accurately detect small and densely arranged objects, we use a position-sensitive feature pyramid network (PS-FPN) with a coordinate attention module to extract position-sensitive features for accurate localization. Finally, the distance rotated IoU loss is proposed to eliminate the inconsistency between the training loss and the evaluation metric for better bounding box regression. Extensive experiments on the DOTA, HRSC2016, and UCAS-AOD datasets demonstrate the superiority of the proposed approach.


Introduction
Object detection is an important and challenging task in the field of computer vision. With the rapid development of deep learning, a series of models based on convolutional neural networks (CNN) have been proposed to achieve accurate object detection [1][2][3][4][5][6][7]. Different from the objects in the natural scenes, the objects in aerial images are often densely arranged and have large variation in scales, aspect ratios, and orientations, which makes it difficult to achieve accurate detection.
In recent years, many rotation detectors have been proposed to introduce the additional orientation prediction to detect arbitrary-oriented objects in aerial images [8][9][10][11][12][13][14][15]. These detectors first densely preset a large number of prior boxes (also called anchors) to align with the ground-truth (GT) objects. Then positive samples are selected according to the intersection-over-union (IoU) for bounding box regression. This process is also called label assignment. Due to the fact that the objects in the aerial images have large variation in scale, shape, and orientation, more anchors need to be laid to match the objects well. Therefore, this dense training sample selection strategy is denoted as dense label assignment in the paper.
Dense label assignment brings many intractable problems to object detection in aerial images. Firstly, most of the massive predefined anchors are backgrounds, which aggravates the foreground-background imbalance during training [16], especially for one-stage detectors. Secondly, the dense prediction suffers from inconsistency between classification and regression in object detection, thereby degrading detection performance. Specifically, the dense arranged anchors often lead to the case where multiple positive samples predict the same object. However, the detections with high classification scores of these positives cannot guarantee precise localization results, which has been proved in many previous work [17][18][19][20]. Therefore, false duplicate detections may occur after the non-maximum suppression (NMS) process. For example, as shown in Figure 1, the upper part of the illustration shows the local duplicate detection of large object. It can be seen that the high-quality detection (blue) is suppressed by the low-quality detection (green) due to its relative low classification score (0.94 vs. 0.96). Besides, the local false detection box cannot be suppressed (The detection box with a score of 0.91).

Ground-Truth Boxes
Duplicate detections of the same object

Output Detections Suppressed Detections
Missed detections of densely arranged objects Moreover, dense object detection in aerial images suffers from missed detections due to dense label assignment. The bottom of Figure 1 shows the example of missed detection of densely arranged ships. The output detection box (green box with score of 0.93) with poor localization accuracy suppresses the more accurate predictions (blue boxes with scores of 0.90 and 0.76), leading to missed detection of ships. In the above cases, dense positive samples lead to high-overlapping detections. However, the corresponding classification scores are not effective in distinguishing their localization accuracy, thereby resulting in poor detection performance.
Due to the problems mentioned above, we suggest that dense label assignment is not conducive to object detection in aerial images. In this article, we propose a sparse label assignment (SLA) strategy to achieve superior training sample selection and improve densely arranged oriented object detection in aerial images. Firstly, we perform forward propagation to obtain the posterior detections corresponding to the preset anchors. Next, posterior non-maximum suppression (P-NMS) is conducted on the detection boxes according to the localization accuracy. For the remaining detections, their corresponding initial anchors are the high-quality positives and can be used for loss calculation. These selected anchors have varying IoU distributions with limited overlap with each other, which can reduce the misjudgments caused by the weak correlation between the classification and regression. Besides, we performed IoU-balanced representative sampling for negatives to alleviate the imbalance between foreground and background in the one-stage detector.
Since there is generally no large overlap between objects in aerial images, the posterior non-maximum suppression works well in this case. Therefore, sparse label assignment for object detection in aerial images is more suitable.
For accurate detection of densely arranged independent objects, we further propose a position-sensitive feature pyramid network (PS-FPN) to improve the localization performance. PS-FPN uses the coordinate attention module to encode localization information into multi-scale features. The position-sensitive feature maps are then used for high-quality object detection. Finally, a novel distance rotated IoU (D-RIoU) loss function is adopted for rotated bounding box regression for faster convergency and to achieve the consistency between the training loss and the localization accuracy.
The proposed sparse label assignment strategy is conducive to high-precision object detection with little additional overhead. Our proposed methods can be applied to existing models to achieve better detection performance. Extensive experiments on public benchmark datasets of aerial images, HRSC2016 [21] and DOTA [22] prove the superiority of our model.
The contribution of this article can be summarized as follows: • We suggest that the dense label assignment strategy causes serious false duplicate detections and missed detections in aerial images, which degrades the detection performance; • A novel sparse label assignment (SLA) strategy is proposed to achieve training sample selection based on their posterior IoU distribution. The posterior non-maximum suppression and representative sampling are used for the selection of positives and negatives, respectively, to improve detection performance; • The position-sensitive feature pyramid network (PS-FPN) is adopted to extract feature maps for better localization performance. Besides, a novel distance rotated IoU (D-RIoU) loss is proposed to solve the misalignment between training loss and localization accuracy.
The rest of this paper is organized as follows. Section 2 reviews the related work of generic object detection and object detection in aerial images. Section 3 introduces our method in detail. Section 4 shows the ablation experiments of the proposed methods and the performance on different datasets. Section 5 concludes the paper.

Generic Object Detection
In recent years, methods based on convolutional neural networks have greatly improved the performance of object detection. A series of CNN-based detectors are proposed to achieve high-quality object detection [1][2][3]6,7]. These methods can be divided into two categories: two-stage detectors and one-stage detectors. The two-stage detectors first generate some candidate regions, and then perform classification and regression on these regions to obtain the final detections, such as faster R-CNN [1], and R-FCN [2]. Two-stage detectors often have high accuracy, but the inference speed is slow. The single-stage detector achieves the object detection by one-step prediction, such as YOLO series [3,5,6], SSD [7]. The inference speed of the single-stage detector is faster, but the detection accuracy is often slightly lower than that of the two-stage framework.
To achieve better detection performance, the current detectors tend to densely preset lots of anchor boxes to achieve good spatial alignment with ground-truth (GT) objects. Then the samples with high IoU with the GT boxes are selected as positive samples for training. This offset-based regression method effectively constrains the search space of parameters and accelerates the network convergence [1]. However, a large number of predefined anchors are required to achieve good spatial alignment with the GT boxes for sufficient prior semantic knowledge. It causes serious imbalances during training and leads to performance degradation. To solve the problems, a series of sampling methods have been proposed to alleviate this imbalance between training samples. For example, focal loss [16] reduces the weight of easy samples to avoid loss being dominated by a large number of simple negative samples. Li et al. [23] utilizes a gradient harmonizing mechanism to balance the gradient flow from different samples. Libra R-CNN [24] proposed IoU-balanced sampling for reducing the imbalance during label assignment.

Object Detection in Aerial Images
Object detection in aerial images has received extensive attention due to its wide range of application scenarios. With the great breakthrough made by CNN methods, object detection in aerial images has also made considerable progress.
Different from objects in natural images, objects in aerial images often have large variations in scale, aspect ratio, orientation, and there are many scenes that contains densely arranged small objects. Therefore, it is hard to detect objects in aerial images. Some previous detectors directly introduced additional angle prediction based on the generic detectors to locate oriented objects in aerial images [8,25,26]. Although progress has been achieved, these methods do not consider the large variation in scale, shape, and orientation of object in the aerial images, and, therefore, cannot further improve the detection performance.
Recently, a series of works have been proposed to improve the performance of rotation detectors from many aspects. Some studies designed better features to improve detection accuracy [27][28][29][30]. For example, CAD-Net [27] constructs attention-modulated features, as well as global and local contexts to detect objects of different scales. Wang et al. [28] proposed a unified feature-merged network to aggregate the context information in multiple scales for better small object detection. CFC-Net [29] improves performance by building features suitable for classification and regression tasks, respectively. Fu et al. [30] proposed a feature-fusion architecture to handle the problem of multi-scale objects by generating a multi-scale feature hierarchy. The combination of the features of shallow layers with semantic representations and the feature maps of top layers with low-level information helps to detect objects with different scales.
The representation of oriented objects is a unique problem for objects detection in aerial images, which has been discussed in some recent works [31][32][33][34][35][36]. Yang et al. [31] suggested that rotated rectangle representation is subject to boundary problems that make the network hard to converge. To solve the problem, circular smooth label [31], and densely coded labels [32] are proposed to convert angle regression into fine-grained angle classification to avoid the outof-bounds angles. Qian et al. [33] and Ming et al. [34] construct multiple representations of oriented objects to unify boundary conditions for better bounding box regression optimization. Yang et al. [35] discussed the inconsistency between the localization accuracy and loss caused by the boundary problems of the oriented rectangle, and proposed the Gaussian Wasserstein distance loss to achieve consistent regression optimization.
There are also some works that improve object detection in aerial images from the label assignment. Object detection methods in aerial images often follow the label assignment methods of generic object detection. That is, the positives and negatives are selected according to the preset IoU threshold [4]. Although some novel methods have been proposed to improve the label assignment strategy [37][38][39], these works do not take into account the characteristics of aerial image targets. Recently, some label assignment methods have been proposed for rotating aerial object detection [10,20,40]. Ming et al. [20] observed the inconsistency of localization ability before and after bounding box regression, and proposed a dynamic anchor learning strategy to adaptively select the optimal anchors for the rotation object detection. Zhong et al. [10] decoupled the rotating bounding box into a horizontal bounding box to reduce the instability of the angle during anchor matching process. Xiao et al. [40] used a adaptive IoU threshold for training sample selection to keep a balance between positive and negative anchors.

The Proposed Method
The overall framework of our method is shown in Figure 2. Our proposed model consists of three parts: sparse label assignment strategy (SLA) for training sample selection, position-sensitive feature pyramid network (PS-FPN) for feature extraction, and distance rotated IoU loss (D-RIoU) for network training. The following sections will introduce these modules in detail.

Sparse Label Assignment for Efficient Training Sample Selection
The current rotation detectors use densely arranged anchors to achieve object detection in aerial images. However, the massive preset anchors are redundant for the detection task. On the one hand, the redundant negatives cause the training loss to be dominated by low-quality background. On the other hand, redundant positives induce the misaligned classification scores and regression accuracy as discussed in Section 1 and shown in Figure 3a. The redundancy and imbalance of training samples are of the crucial factors that restrict the performance of the one-stage detector.
It has been proved in some previous work that the detector can achieve good performance without using dense anchors during training [6,41,42]. For example, YOLOv3 [6] only uses one anchor with the highest IoU as the positive sample for training. Multiple anchor learning method [42] constructs anchor bags and selects the most representative anchors from each bag as training samples.
Inspired by these work, we introduce the sparse label assignment strategy to use the sparse anchor to alleviate the problem of duplicate detection and missed detection in aerial images. Sparse label assignment includes two parts: posterior suppression for positives and IoU-balanced representative sampling for negatives.
For positive samples, densely arranged anchors produce dense predictions. However, the inconsistency between classification and regression interferes with selecting accurate detections from dense predictions. The posterior non-maximum suppression (P-NMS) is proposed to select high-quality positives according to the localization accuracy of detections. The algorithm is shown in Algorithm 1. Specifically, we first select anchors whose IoUs with GT are higher than the threshold (usually 0.5) as preliminary positive samples. Next, we calculate the posterior IoU between the GT boxes and the detection boxes regressed from initial positives. Finally, the IoU score is regarded as the confidence of the detections, and non-maximum suppression is performed on the detections. For the remaining detection boxes after P-NMS, we treat the corresponding initial anchors as positive samples for training.

Algorithm 1 Posterior non-maximum suppression.
Input: A = {a 1 , a 2 , . . . , a n } is a N × 5 matrix of initially selected positive anchors. B = {b 1 , b 2 , . . . , b n } is a N × 5 matrix of detection boxes corresponding to A. G = {g 1 , g 2 , . . . , g n } is a N × 5 matrix of GT boxes assigned to the corresponding anchors in A. RIoU(·) calculates the IoU between rotated boxes. N 0 is the NMS threshold. t denotes the training process, and t ∈ [0, 1]. schedule(·) dynamically schedules the NMS threshold according to the training process. Output: T is a matrix of final selected positive samples.
Note that the detection results are unstable in the early stage of training [20]. Therefore, the IoU scores are also unreliable in this phase. We adopted a dynamically scheduling NMS threshold to increase the suppression intensity gradually. The threshold adjustment strategy of NMS is as follows: in which N 0 is the predefined NMS threshold. t denotes the training process, and t ∈ [0, 1]. Through Equation (1), the threshold of the posterior NMS in the training process gradually decreases, and, thus, the suppression intensity is gradually increased. In this way, we not only ensure a stable training process but also improve the detection performance by suppressing redundant positive samples. For example, as shown in Figure 3a, the model trained with DLA predicts two highly overlapped detections. However, we cannot guarantee the more accurate one (blue box) can be output. As shown in Figure 3b, this issue can be resolved through SLA by suppressing the positives with suboptimal predictions. SLA ensures the sparse valid predictions for each location on feature maps. Since P-NMS further reduces the number of positives and aggravates the imbalance between foreground and background samples. It is also vital to conduct sparse sampling for negatives. The intuitive method is to perform NMS operations on negative samples, but it is not feasible in practice for the following two reasons: • Firstly, the number of negatives is much larger than that of positives, and the implementation of NMS on them requires huge memory and is very time-consuming; • Secondly, the detector does not perform regression supervision on negative samples, so the IoU between the GT boxes and the predictions of negatives is meaningless.
We use the representative sampling for negative samples to achieve balanced training. The algorithm is shown in Algorithm 2. We first divide the anchors into three categories: positive samples, hard samples, and background samples. Positive samples are obtained from initial positives via Algorithm 1. Background samples are anchors that contain a lot of backgrounds. These negatives have the IoU less than the threshold T bg (set to 0.1 in our experiments). The hard samples contain part of objects and are hard to be classified, whose IoUs are in [T bg , T neg ] (T neg is set to 0.4 in our experiments). Next, random sampling is carried out in different types of samples according to the number of positive samples at a ratio of 1 : α : β. For example, there are N p positives after P-NMS, then we randomly select α · N p samples in hard samples, and β · N p samples in background samples. The total number of negatives used for training is (α + β) · N p . On the one hand, representative sampling ensures that the number of negative samples changes dynamically according to positive samples, which help to avoid the training loss being dominated by massive negatives. On the other hand, the sampling of hard examples enhances the robustness of classifier to reduce false detections.

Position-Sensitive Feature Pyramid Network
The aerial images often contain many small and densely arranged objects. For these objects, a slight deviation in coordinate prediction may cause severe performance degradation, so accurate localization is particularly important. We propose the position-sensitive feature pyramid network (PS-FPN) to embed the localization information into the feature pyramid through the coordinate attention module (CAM) (see Figure 2).

Algorithm 2 Representative sampling for negative samples.
Input: A = {a 1 , a 2 , . . . , a n } is a N × 5 matrix of initially selected positive anchors. G = {g 1 , g 2 , . . . , g n } is a N × 5 matrix of GT boxes assigned to the corresponding anchors in A. RIoU(·) calculates the IoU between rotated boxes. N p is the number of positive samples obtained through P-NMS. T neg and T bg are the thresholds for defining negative and background samples, respectively. α and β are the constant coefficients of sampling strategy, respectively. sample(T , t) is a sampling function that randomly selects t elements from the set T . Output: T is a matrix of final selected negative samples. T The attention mechanism has been widely used in the field of computer vision with great success [43][44][45]. However, many attention methods use global average pooling (GAP), which is harmful to the encoding of positioning information. For example, SE block [44] and CBAM [45] adopt GAP and GAM to compress the feature tensor into the channel-wise vector to capture the dependence of the channel direction, as shown in Figure 4. Motivated by Hou et al. [43] that built spatially selective attention maps for the backbone of the mobile networks, we embed the coordinate attention module (CAM) into the feature pyramid to extract position-sensitive feature maps. The structure of CAM is shown in Figure 4.
Given the input feature map F ∈ R C×H×W , we first construct the direction-sensitive features as follows: in which Pool 1×W and Pool H×1 are the average pooling kernels with size of 1 × W and H × 1, respectively. F x ∈ R C×H×1 and F y ∈ R C×1×W are the direction-sensitive features. For example, for the given input feature F with the size of C × H × W, the Pool 1×W conducts pooling with the kerenl of 1 × W on F, then we obtain the output feature with size of C × H × 1. Next, we concatenate the tensors and squeeze it to reduce the parameters: The concatenation of F x and F y is of C × 1 × (W + H). Then it is squeezed via a 1 × 1 convolution operation to reduce the channels to C/r, The generated M ∈ R C/r×1×(W+H) is further split into F x ∈ R C×H×1 and F y ∈ R C×1×W to encode the position information and then re-weight to the input feature as follows: in which σ is the sigmoid function. Directional attention maps are then weighted to the original feature to obtain a direction-sensitive feature map F . CAM uses horizontal and vertical pooling to encode spatial coordinate information into features. Therefore, compared with the attention mechanisms that use the global average pooling, the feature pyramid encoded by CAM can more accurately extract the localization information of the objects and achieve accurate bounding box prediction.  Note that the receptive fields of different feature maps of FPN are various. It is not suitable to use the shared weights to learn the localization coding of multi-scale objects. Therefore, we use independent CAM modules for position coding for each level of the multi-scale features. Different from many heavy non-local or self-attention method that brings a massive amount of computational cost, CAM is lightweight and only introduces a few convolutional layers, but achieves substantial performance gains.

Distance Rotated IoU Loss for Bounding Box Regression
Another thorny issue in object detection in aerial images is the inconsistency between training loss and localization accuracy. The current mainstream regression loss function is the smooth-L 1 loss, which uses the offsets of the prediction box and GT box relative to the anchor for training. However, the smooth-L 1 cannot accurately represent the localization accuracy of the detections. For example, as shown in Figure 5, the two different detection boxes have the same rotated IoU (RIoU) with GT box, but their regression losses are different. Under the supervision of the smooth-L 1 loss, the detector pays more attention to the case on the right in the Figure 5. However, the detection box on the left has only a tiny angle offset relative to the GT box, which is easy to optimize. The inconsistency between the regression loss function and the localization accuracy of the detections hinders the optimization of the regression, making it hard for the network to converge. IoU loss has achieved great success in generic object detection [46,47]. It is feasible to directly use the rotated IoU to guide the regression in oriented object detection, but it is not optimal. Aerial images contain many objects with large aspect ratios, such as bridges, large vehicles, and ships. A slight deviation between the center of the detection box and that of the GT box will result in a sharp drop in rotated IoU. Therefore, the accurate prediction of the center point is critical in aerial image object detection.
We propose the distance rotated IoU (D-RIoU) loss to solve the above problems. D-RIoU loss uses rotated IoU to guide the regression process while taking into account the deviation of the center point. The formula is as follows: in which p and g denote prediction box and GT box, respectively. RIoU(·) calculates the rotated IoU between p and g. d(·) calculates the distance between the center points of p and g. c is the diagonal of the smallest enclosing rectangle of p and g. The smallest enclosing rectangle of two oriented bounding boxes is shown in Figure 6a. The performance evaluation of D-RIoU loss is shown in Figure 6b. G-RIoU loss is extended from GIoU loss [46] in generic object detection and is as follows: L GRIoU (p, g) = 1 − RIoU(p, g) + |e\(p ∪ g)| |e| (6) in which e is the smallest enclosing box of p, g. G-RIoU helps to optimize the anchors that have no intersection area with the GT boxes. It can be seen that the model trained with D-RIoU loss achieves faster network convergence and better performance. This is because that D-RIoU loss focuses on the convergence of the center point of the object, which is vital for oriented object detection.
With the proposed D-RIoU loss, the training loss for the model is as follows: in which L cls (t, t * ) is the binary cross entropy (BCE) loss for classification. t and t * are the predicted score and classification label, respectively. L DRIoU (p, g) is D-RIoU loss for bounding box regression as defined in Equation (8).  HRSC2016 [21] is a challenging high resolution ship detection dataset with a total of 1061 images. The image sizes range from 300 × 300 to 1500 × 900. The dataset contains a large number of rotated ships with large aspect ratios. All objects are annotated with oriented bounding boxes. The total dataset is divided into training set, validation set, and test set, including 436, 181, and 444 images, respectively.
We conducted ablation study and main experiments on the HRSC2016 dataset. The images are resized to 384 × 384 and 768 × 768 for training and testing. We use Adam optimizer for training, and the learning rate is set to 2 × 10 −4 . We trained the model for 25,000 iterations on RTX 2080Ti GPU with the batch size set to 8.

UCAS-AOD
UCAS-AOD [48] is an aerial plane and car dataset detection dataset. It contains 1510 images, including 1000 images for planes and 510 images for cars. The objects are annotated with both oriented bounding boxes and horizontal bounding boxes. Since there is no official division of the dataset, we randomly divide the total dataset into training set, validation set, and test set with the ratios of 5:2:3.
The images are resized to 768 × 768. We use Adam optimizer for training, and the learning rate is set to 2 × 10 −4 . We trained the model for 20,000 iterations on RTX 2080Ti GPU with the batch size set to 8.

DOTA
DOTA [22] is the largest public dataset for oriented object detection in aerial images. The images in DOTA are of the size in the range from 800 × 800 to 20,000 × 20,000 pixels and contains objects with a wide variety of scales, orientations, and shapes. It includes 2806 aerial images with 188,282 annotated instances. There are 15 categories in total, including plane (PL), baseball diamond (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP), and helicopter (HC). The total dataset is divided into training set, validation set, and testing set with the proportions of 1/2, 1/6, and 1/3, respectively.
Note that images in DOTA are too large, we crop the original images into 768 × 768 patches with the stride 200 for training and testing. Adam optimizer is used for training, and the learning rate is set to 2 × 10 −4 . We trained the model on RTX 2080Ti GPU for 500,000 iterations with the batch size set to 8.

Evaluation of the Proposed Modules
We conducted experiments on the HRSC2016 dataset to prove the effectiveness of the proposed modules. We used the RetinaNet with ResNet50 as the baseline model. The images are resized to 384 × 384, and no data augmentation is adopted. The experimental results are shown in Table 1. The proposed SLA strategy significantly improves the high-precision detection performance, achieving an increase of 7.49% in AP 75 . It indicates that the dense samples are harmful for high-quality detection performance and the sparse training samples can achieve better performance via SLA. The PS-FPN extracts position-sensitive features for precise localization. It extracts the boundary features of the object by CAM, thus improving the detection performance. As a result, PS-FPN improves AP 50 and AP 75 by 0.65% and 3.28%, respectively. Our model further improves the AP 75 by 4.04% when trained with the novel D-RIoU loss. It proves that the D-RIoU guidance contributes to high-precision detections compared with smooth-L 1 loss. D-RIoU loss achieves the consistency between the training loss and the evaluation performance through the rotated IoU. Besides, it additionally considers the importance of the center point for high-performance oriented object detection in the aerial images.

Evaluation of Sparse Label Assignment
We conducted experiments to evaluate the effect of sparse label assignment, and the results are shown in Table 2. The best performance is obtained with P-NMS and sampling ratio of 1:2:100, which achieves the AP 50 of 86.08% and AP 75 of 55.60%. It can be found that the representative sampling of negatives and the suppression of positives can effectively improve the high-precision detection performance represented by AP 75 . However, unsuitable hyperparameters may lead to a slight decrease in the recall, resulting in a decrease in AP 50 . For example, the model that does not consider the hard samples (ID1-ID5) achieve the higher AP 75 than baseline, but their AP 50 are slightly lower.
The performance comparison of ID13 (85.72%) and ID15 (86.08%) shows that posterior NMS (adaptive threshold is obtained by Equation (1)) can avoid the instability in the early training stage and optimize high-quality detection. NMS with a fixed posterior threshold may lead to the neglect of high-quality positives and reduce the recall (such as 85.09% for ID12).
For the representative sampling of negatives, the IoU interval division and sampling ratio are both important. Training with hard samples helps to improve the robustness of the classification network and avoids giving high confidence to low-quality detections. It can be seen from Table 2 that the model not trained with hard samples (ID1-ID5) have lower detection performance than the model that used (ID6-ID15). Besides, the model can achieve better performance when the sampling ratio is 1:2:100, since this ratio is more consistent with the real IoU distribution of the anchors.

Evaluation of Position-Sensitive Feature Pyramid Network
The ablation study on PS-FPN is shown in Table 3. PS-FPN can further improve the detection performance based on SLA. The best performance of the compared parameters reaches 86.73% on HRSC2016 dataset with the channel compression ratio r = 32. Note that if the feature maps of different levels adopt the CAM that uses the shared parameters, the performance drops by 0.32%. Position coding is sensitive to the scale of objects, and thus parameter-independent CAM can better adapt to the features of different scales, thereby achieving more accurate coordinate coding.

Evaluation of Distance Rotated IoU
We compared the performance of different RIoU-based loss functions, and the results are shown in Table 4. The baseline model is RetinaNet trained with smooth-L 1 loss, and the images are resized into 768 × 768 here. RIoU (linear) and RIoU (log) are as follows: It can be seen that most RIoU-based loss can improve high-quality detection performance compared with smooth-L 1 . For example, AP 75 of RIoU(log) is 2.76% higher than that of smooth-L 1 . However, G-RIoU does not perform well in oriented object detection. AP 75 of G-RIoU is even lower than smooth-L 1 by 3.64%. We conclude that it is caused by the following two problems: 1. When assigning positive labels for training, we ensure that each object is assigned at least one anchor with the largest IoU. Therefore, anchors with no intersection with the objects will not be used for regression at all, and thus G-RIoU loss is similar to RIoU loss (linear); 2. The intersection between two rotated rectangles is very sen-sitive to the angles and aspect ratios, and thus the smallest enclosing rectangle is difficult to converge during the training process. The model trained with our D-RIoU loss achieves the AP 50 of 87.92% and AP 75 of 59.15%, which outperform the mainstream smooth-L 1 loss by 1.53% and 4.27%, respectively. It is also superior to other RIoU-based losses, which proved that the supervision of center distance is beneficial to oriented object detection. We visualized some detection results from the models trained with different loss functions on DOTA as shown in Figure 7. The tiny position deviation will result in a worse localization result for small objects compared to large objects, but the smooth-L 1 loss will treat them equally, leading to poor detection performance for small object detection. As shown in the first row of Figure 7, the model trained with smooth-L 1 loss for regression suffers from missed detections and inaccurate localization when detecting densely arranged objects. In contrast, D-RIoU loss uses rotated IoU to normalize the regression loss of objects of different scales, so the performance of small target detection is excellent (see the second and third columns of the second row in Figure 7). Moreover, the D-RIoU loss also imposes additional center point supervision, which is conducive to the regression of objects with large aspect ratios (see the second row and the first column of Figure 7).  Table 5 shows the performance comparison of different methods on the HRSC2016 dataset. Our method outperforms other compared methods, achieving the mAP of 89.51%. Even with a smaller input size of 384 × 384 and a lightweight ResNet-50 as the backbone, our model can still achieve the mAP of 87.14%. We also compare high-quality detection performance as shown in Table 6. Due to the sparse label assignment method effectively alleviates the performance degradation caused by redundant training samples, our method performs well on high-precision detection. The proposed model achieves the highest AP 75 of 68.12% among the compared single-stage detectors, which proves the superiority of our method. Table 6. Comparisons with high-quality detection performance on HRSC2016 dataset.

Methods
RetinaNet [16] ATSS [ We further visualized some detection results, as shown in the Figure 8. Our model can accurately detect the remote sensing ship in complex scenes in the images. Even for densely arranged long narrow ships that are difficult to detect, our method still performs well and outputs high-quality detection results. (see the third row in Figure 8). Table 7 shows the experimental results on UCAS-AOD dataset. Our method achieves the best performance among the compared methods, reaching the mAP of 89.44%. Our method outperforms the advanced two stage detector RoI transformer [50] by 0.49%. RIDet [34] is also a recent high-quality oriented detector with anchor refinement module. It can be seen that we have achieved better performance compared with the proposal refinement approach (such as RoI transformer and RIDet here), which proves the superiority of our method.  We visualized some of the detection results, as shown in the Figure 9. Sparse label assignment is suitable for oriented object detection due to there is generally no large overlap between objects in aerial images. It can be seen that our method outputs highquality detections even for densely arranged small objects (such as small vehicles and planes in Figure 9).

Results on DOTA
We conduct performance comparisons with some advanced algorithms on the DOTA dataset, and the results are shown in Table 8. Our method achieves the mAP of 76.36%, which is the highest among the compared models. Our baseline model is the one-stage detector RetinaNet, but it achieves better performance than some advanced two-stage methods after adopting the proposed modules. The visualization of some detections is shown in Figure 10. It can be seen that the objects in the DOTA dataset have large variation in scales, and there are many scenes where the objects are densely arranged. Our model does not suffer from duplicate detections of large objects, and achieves accurate detection (see soccer ball field in the first row and second column, and roundabout in the second row and first column in Figure 10). It can be attributed to the SLA that alleviates the inconsistency between classification and regression and helps suppress redundant detections. Moreover, densely arranged small objects in aerial images are also difficult to detect, such as small vehicles and small ships. Owing to the localization features extracted by PS-FPN and the efficient supervision of D-RIoU loss, our method achieves superior detection performance for dense object detection. As shown in the last row of Figure 10, our model accurately detects dense small objects in aerial images, with almost no missed detections.

Conclusions
In this paper, we analyzed the drawbacks of the current dense label assignment strategy of object detection in aerial images and proposed a sparse label assignment strategy (SLA). SLA uses the posterior IoU of the detections to perform posterior nonmaximum suppression (P-NMS), to select sparse and high-quality anchors for training. In this way, the inconsistency between the classification and regression is alleviated, and the imbalance of the training samples is resolved. In order to further improve the detection performance of densely arranged small objects in aerial images, we propose a positionsensitive feature pyramid network (PS-FPN). PS-FPN uses the coordinate attention module to extract position-sensitive features via direction-specific pooling for accurate localization. Finally, the distance rotated IoU loss function (D-RIoU) is proposed for training to normalize the loss contribution of objects with different scales. In addition, the additional center point constraint in D-RIoU loss is beneficial to achieve accurate detection for objects with large aspect ratios. Extensive ablation experiments on aerial image datasets have confirmed the superiority of our method. We achieved the mAP of 76.36% on DOTA dataset, 89.51% on HRSC2016 dataset and 89.43% on UCAS-AOD dataset based on the simple RetinaNet, which are superior to many advanced rotation detectors. In the future, we will further study the optimization process of anchor during regression to explore the distribution of high-quality anchors, which helps to achieve better bounding box regression for high detection performance.