Multi-Sector Oriented Object Detector for Accurate Localization in Optical Remote Sensing Images

: Oriented object detection in optical remote sensing images (ORSIs) is a challenging task since the targets in ORSIs are displayed in an arbitrarily oriented manner and on small scales, and are densely packed. Current state-of-the-art oriented object detection models used in ORSIs primarily evolved from anchor-based and direct regression-based detection paradigms. Nevertheless, they still encounter a design difﬁculty from handcrafted anchor deﬁnitions and learning complexities in direct localization regression. To tackle these issues, in this paper, we proposed a novel multi-sector oriented object detection framework called MSO 2 -Det, which quantizes the scales and orientation prediction of targets in ORSIs via an anchor-free classiﬁcation-to-regression approach. Speciﬁcally, we ﬁrst represented the arbitrarily oriented bounding box as four scale offsets and angles in four quadrant sectors of the corresponding Cartesian coordinate system. Then, we divided the scales and angle space into multiple discrete sectors and obtained more accurate localization information by a coarse-granularity classiﬁcation to ﬁne-grained regression strategy. In addition, to decrease the angular-sector classiﬁcation loss and accelerate the network’s convergence, we designed a smooth angular-sector label (SASL) that smoothly distributes label values with a deﬁnite tolerance radius. Finally, we proposed a localization-aided detection score (LADS) to better represent the conﬁdence of a detected box by combining the category-classiﬁcation score and the sector-selection score. The proposed MSO 2 -Det achieves state-of-the-art results on three widely used benchmarks, including the DOTA, HRSC2016, and UCAS-AOD data sets.


Introduction
With the development of aerospace technology and sensor technology, remote sensing technology is entering a new stage that can quickly and accurately provide a variety of massive Earth observation data and facilitate widely applied research. Moreover, the demands of people for high-resolution optical remote sensing images (ORSIs) continue to increase. As a key task of remote sensing data information extraction, object detection in ORSIs plays an important role in many remote sensing applications, such as traffic supervision, resource exploration, military investigation, land management, and smart city construction. In recent years, although the related research on object detection has already made significant progress, ORSI implementations remain a challenging task due to the unique morphological characteristics of ORSI targets, such as varying scales, dense arrangement, arbitrary direction, and complex backgrounds.
In recent years, deep learning methods, especially the deep convolutional neural network (DCNN), have made great progress in the field of object detection (e.g., Faster-RCNN [1], YOLO [2], SSD [3], and RetinaNet [4]). Although the DCNN-based object detection approaches have achieved promising results in natural scene images, there are In this article, we designed an anchor-free multi-sector oriented object detector (MSO 2 -Det) that adopts the partitioning idea and multi-sector mechanisms to quantize the regression space of scales and the orientation of ORSI objects. Our multi-sector mechanisms are threefold. First, as depicted in Figure 2b, we divided the coordinate space into four quadrant sectors and represented the AOBB as four scale and angle parameters in the Cartesian coordinate system. Based on this quadrant-sector mechanism, we represented the targets by ((x, y), O p , θ p , (p = 1, 2, 3, 4)). Specifically, targets are described by an in-box point, four scale diameters that are offset from the in-box point to the four boundaries, and four angles between the four scale diameters and the reference x-axis. By dividing the coordinate system into four sectors, each quadrant sector will be responsible for regressing the respective scale offset and angle to build an entire bounding box, which enhances the convergence performance of the network and addresses the order ambiguity problem of the angle and boundary. Second, instead of directly regressing the scales of the four diameters, we divided the scale space into multiple scale sectors and then employed a classificationto-regression strategy to obtain a more accurate location of the targets. Specifically, we first adopted a coarse-granularity classification approach to determine to which sector the scale range belongs. Then, the corresponding regression network refines the coarse localization with the selected sector scale by a fine-grained regression strategy. Compared with the direct regression method, the network of the combined regression and classification is easier to train and converge while obtaining a more accurate boundary box scale. Third, we designed a smooth angular-sector label (SASL) to smoothly distribute the label value and improve the missed rate and detection accuracy. In addition, we adopted a localizationaided detection score (LADS) that better represents the confidence of a detected box by combining the category-classification and sector-selection score, in contrast to the previous category-based confidence decision method. This localization-aided method dramatically improves the performance of detection. The contributions of this article are summarized as follows: 1.
We proposed an innovative representation, i.e., quadrant sectors, for AOBBs in ORSIs. The proposed representation of AOBBs addresses the ambiguity problem of the boundary and the angle well, while enhancing the convergence performance of the network; 2.
We proposed a classification-to-regression strategy to obtain the accurate localization of the ORSI targets with discrete scale and angular sectors. This strategy makes it easier for the network to learn the scale and orientation information of the AOBB; 3.
We designed a smooth angular-sector label (SASL) that smoothly distributes label values with a definite tolerance radius. With this label, the missed rate and detection accuracy are dramatically improved;

4.
To obtain a more accurate confidence of the detected boxes, we proposed the fusion of classification and localization information and thus achieved promising results on the DOTA, HRSC2016, and UCAS-AOD data sets.
The remainder of this article is organized as follows. The related work is concisely reviewed in Section 2. The details of the proposed method are introduced in Section 3. In Section 4, the experiments result are analyzed in detail. Finally, the conclusions of this article are presented in Section 5.

Related Works
According to the geometric characteristics of the bounding box, most of the existing object detectors in ORSIs can be roughly classified into two types: axis-aligned object detection and arbitrarily oriented object detection methods. In this section, the related works of axis-aligned object detection and arbitrarily oriented object detection models are briefly reviewed.

Axis-Aligned Object Detection in ORSIs
Studies on real-time, precise target detection algorithms of a target have been a research hotspot in the field of machine vision and are also a difficult research area. In recent years, as a significant and tough research branch of computer vision, object detection in ORSIs has developed rapidly. Traditional object detection algorithms are based on the excellent texture description ability of handcrafted features (e.g., the histogram of oriented gradients [11], the scale-invariant feature transform [12], and deformable part-based models [13]) and follow the paradigm of sliding windows. Gradually, the performance of manual feature selection techniques became saturated. Due to the robust learning ability and the high-level feature representation capability of deep convolutional neural networks (DCNNs) for images, a large number of DCNN-based object detectors have been proposed in natural image object detection and ORSI object detection. These detectors are used to detect axis-aligned bounding box targets and can be categorized into two main branches: multi-stage and one-stage object detection.

Multi-Stage Object Detection Method
The DCNN-based multi-stage object detectors divide the detection process of AABB into several core computational steps, and higher accuracy is achieved. As the originator of the multi-stage detection method, the R-CNN [14] first extracts the target proposals by selective search and then utilizes the CNN to determine the category and refine the location of the object proposal. Fast R-CNN [15] inputs the whole image to extract the features by the CNN and then generates the features of each region proposal by RoI pooling for the subsequent classifiers and fine regressors. Faster R-CNN [1] implements a CNN-based region proposal network (RPN) to generate the feature information of the region proposal, and the end-to-end detection is realized. Based on the Faster R-CNN framework, the Cascade R-CNN [16] cascades multiple R-CNN networks based on different IoU thresholds to continuously optimize the resulting proposals and obtain more accurate detection results. Aimed at the characteristics of ORSI targets, some recent works have applied the multi-stage detection methods to the ORSI object detection field. For example, Deconv R-CNN [17] utilizes a network with a deconvolution layer after the last convolution layer of the Faster R-CNN backbone network for ORSI small target detection. Yang et al. [18] purposed a cluster proposal network (CPN) that addresses the target clustering and scale adjustment issues of aerial image targets. To boost multi-class and multi-scale detection capabilities, FRPNet [19] is designed with a feature-reflowing pyramid structure to generate high-quality features representations for each scale by fusing fine-grained features from the lower adjacent layer. Chen et al. [20] introduced a multi-scale spatial and channelwise attention (MSCA) mechanism to eliminate the interference of complex background. Lu et al. [21] designed a gated axis-concentrated localization network (GACL-Net) to improve the performance of small-scale detection in ORSIs.

One-Stage Object Detection Methods
One-stage object detection methods (e.g., YOLO [2], SSD [3], and RetinaNet [4]), which abandon the region proposal stage, directly generate the category probability and position coordinate value of the object. With a single feedforward CNN baseline, the final detection result can be obtained directly. Therefore, these types of methods are considered to be faster, slicker, and simpler in the design stage. In the field of ORSIs, one-stage detectors are becoming increasingly popular. For example, MRFF-YOLO [22] introduced a multireceptive field model to enhance the performance of small-scale target extraction. Based on the SSD paradigm, AF-SSD [23] improves the performance of ORSI object detection by designing exquisite enhancement modules such as the encoding-decoding module and spatial and channel attention modules. Sun et al. [24] proposed an adaptive saliency-biased loss (ASBL) to train the RetinaNet and dramatically improved the performance of detection in the ORSIs. In addition, the work in [25,26] proposed the advanced object detection architecture that involves both spatial and temporal domain information in the decision. However, these axis-aligned bounding box object detectors are still confronted with the challenge of arbitrary orientations in ORSIs. More auxiliary network structures are required for arbitrarily oriented objects in the ORSIs.

Arbitrarily Oriented Object Detection in ORSIs
Given the orientation characteristic of remote sensing objects, a good alternative is the use of an arbitrarily oriented bounding box to describe the ORSI targets. These arbitrarily oriented object detectors for ORSIs can be roughly divided into two categories: anchor-based and direct regression-based object detection methods.

Anchor-Based Object Detection Method
For an optical remote sensing image, anchor-based detectors first make use of many fixed anchors as a referee and then either regress the localization offset of the bounding box or generate the region proposals on the basis of anchors and decide whether the corresponding proposal belongs to a certain category. Liu et al. [27] transformed the original region-of-internet (RoI) pooling layer and AABB regression representation into a rotated RoI and AOBB regression model for the ship detection task in ORSIs. The work in [28] introduced the feature pyramid network (FPN) and the cascade image to obtain abundant semantic information for regressing the offsets between the AOBB and the AABB. RoI Transformer [29] upgrades the horizontal RoI to an oriented RoI by a supervised RoI learner design. To effectively detect ships, the R 2 PN [30] proposed a rotated region proposal network (R 2 PN) and a rotated RoI layer to generate oriented proposals and extract features from inclined regions, respectively. Based on the FPN structure and a novel spatial and scale-aware attention mechanism, CAD-Net [31] introduced a global and local context network to collect the scene and object-level contextual information for accurate and efficient AOBB object detection in ORSIs.

Anchor-Free Object Detection Method
While anchor-based detection strategies have demonstrated promising results in ORSIs, they are unable to escape the inefficient and inflexible manual designs of multi-scale, multi-orientation anchors. Recently, the ORSI target detection field has seen an upsurge of numerous anchor-free approaches. Typically, these methods are classified into two categories: keypoint-based and intensive predictive-based detectors. In regard to keypointbased methods, CornerNet [32] utilizes the upper left and lower right corners of the AABB to locate the objects. CenterNet [33] proposes a center-based paradigm to represent the target and then regresses the offsets of the center and the corresponding distances among four boundaries and the center. Combining with CornerNet [8] and CenterNet [20], Chen et al. [34] utilized an end-to-end FCN to identify the ship AOBBs according to the predicted corners, center, and corresponding angle of the ship. The OPLD [35] transforms an accurate localization task from a regression problem to a keypoint estimation problem and then combines the endpoint scores with the classification score to improve the final detection quality. Shi et al. [5] decomposed the vehicle detection problem in the ORSIs into one central point classification and three parameter regression subtasks to predict the central point, scales, orientation, and offsets of the vehicle central point. HRPNet [36] introduced polar coordinates and transformed the detection task of the arbitrarily oriented bounding box into the regression of one polar angle and four polar radii. GRS-Det [37] employs an anchor-free ship detection algorithm based on the unique U-shape network and rotation Gaussian-mask. For intensive predictive-based methods, DenseBox [38] utilizes a fully convolutional network (FCN) to obtain the pixel-level prediction of confidence and the location of AABBs. FCOS [39] follows the FCN structure and implements center-ness to suppress the low-quality detected boxes. For ORSIs targets, IENet [40] modifies the FCOS structure with an oriented regression branch enhanced by a self-attention mechanism. Similarly, TOSO [41] designed a robust Student's T distribution-aided one-stage orientation detector to address orientation target detection in ORSIs. Xiao et al. [42] proposed to detect the arbitrarily oriented objects in ORSIs by predicting the axis of the object at the pixel level of feature maps. Different from the aforementioned method that directly regresses the scales or the angle of the AOBB, our proposed MSO 2 -Det quantizes the boundless regression spaces by a classification-to-regression multi-sector strategy, which accelerates the convergence of the network and obtains more accurate localization of AOBBs in ORSIs.

Localization-Guided Detection Confidence
There are many works that have verified that the combination of the localization quality score and classification score can be instrumental in identifying high-quality detection results. Many works are committed to correcting the final detection confidence by the localization score. The work in [43] proposed to transform the task of the intersection of union (IoU) prediction between the predicted box and ground truth as a classification task and then used the predicted IoU to optimize the final detection confidence. IoU-Net [44] corrects the detected bounding box score by an IoU regression branch. The work in [45] combined the IoU score that was predicted by a fused scoring network with the classification score for the final detection confidence. Wu et al. [46] predicted the IoU for each detected box and utilized the product of the predicted IoU and the classification score to compute the final detection confidence, which effectively boosted the localization accuracy. OPLD [35] uses the class-agnostic keypoint-estimation score to guide the detection score of the AOBB in ORSIs. Therefore, inspired by these methods, our MSO 2 -Det combines the category-classification score with the localization sector-selection score, which provides a more reasonable final detection confidence.

Methodology
The pipeline of the proposed multi-sector oriented object detector (MSO 2 -Det) is illustrated in Figure 3. It mainly includes two modules: the multi-level feature extraction backbone network and the multi-level prediction head for object classification and localization. Given an input image, the backbone network generates a multi-level feature map by a feature pyramid network (FPN), which is used for the subsequent multi-level prediction head. Note that each level of the FPN will extend a prediction head to detect targets with different scales. For each position on the feature map of different levels, the classification branch of the prediction head is responsible for the prediction of the category confidence score. Meanwhile, in order to predict the accurate localization of objects, we designed a sector-based localization branch to pinpoint the ORSI targets. Specifically, the localization branch of the prediction head is composed of the scale-sector classification, scale-sector regression, angular-sector regression, and angular-sector classification prediction sub-branches. Combining the scale-sector classification and regression, we can obtain an accurate scale of the targets. The angular-sector classification and regression subbranch is in charge of precise angle prediction. In addition, to obtain more accurate localization confidence, we adopted a localization-guided detection confidence strategy that combines the category-classification score with the sector-selection score and dramatically improves the localization quality.

Multi-Level Feature Extraction Network
As shown in Figure 3, a 101-layer residual network (ResNet-101) [47] backbone was deployed to extract features from the input training or testing images, followed by a feature pyramid network (FPN) [48], which was implemented to detect objects with different sizes on multi-level feature maps. The output feature maps of ResNet-101 were down-sampled 32 times by five stages, and we only utilized three levels of the multi-scale feature pyramid, following the design of FCOS. We defined C 3 , C 4 , and C 5 as the feature maps in Stages 3, 4, and 5 of the ResNet-101 backbone. In addition, to enhance the ability of modeling geometric transformation, we replaced the 3 × 3 convolution in C 3 , C 4 , and C 5 with DCN-v2 (modulated deformable convolution). Meanwhile, P i represents the feature maps of different levels used for final classification and localization prediction that are obtained by the FPN. In our method, five levels of feature maps {P 3 , P 4 , P 5 , P 6 , P 7 } were utilized, where P 3 , P 4 , and P 5 were generated by the backbone network's feature maps C 3 , C 4 , and C 5 , followed by a 1 × 1 convolutional unit layer with top-down connections. P 6 and P 7 were obtained by employing a two-stride size convolutional layer on P 5 and P 6 , respectively. Finally, the prediction heads were obtained from feature maps at different levels. Let F l ∈ R H×W×C be the feature maps with size (H, W) at layer l ∈ {3, 4, 5, 6, 7} of the network, s = 2 l be the total stride until the l-th layer, and C represent the number of ORSI target categories. For each localization (x, y) on the feature map, which can be mapped back onto the corresponding position (x · s + s 2 , y · s + s 2 ) of the input image, it is considered a positive sample if it has to be within a distance d = 1.25 × s to the center point (x c , y c ) of a ground truth AOBB belonging to category label c, and the range of the scale sector lies in the regression range of the l-th layer. We defined the regression range for the FPN level from 3 to 7 as (0, 64], (64, 128], (128, 256], (256, 512], and (512, ∞), respectively. Otherwise, it can be considered a negative sample with c = 0, which denotes the background. Figure 3 illustrates the network details of the prediction heads. For the classification branch, a four-layer convolution stack with 3 × 3 kernels and 256 channels was employed to extract the features f i cls ∈ R H×W×256 , i = 3, 4, 5, 6, 7 from the i-th level of the FPN. The final feature map for predicting object multi-category probability scores can be calculated as:

Classification Branch of the Prediction Head
where F i cls ∈ R H×W×C denotes the final category-classification prediction map, Conv1 × 1 indicates the convolutional operation with 1 × 1 kernels and C channels (i.e., the total category number), GN represents the group normalization, and σ denotes the ReLU activation function. At the inference stage, the final layer of the classification branch network predicts a C-dimensional vector of classification labels at the localization (x, y).

Multi-Sector Design
As shown in Figure 4, in order to obtain the accurate localization of the AOBB target, we represented the target by a multi-sector model. For each in-box point in the ORSI target, we represented it by ((x, y), O p , θ p , (p = 1, 2, 3, 4)), where (x, y) indicates the coordinate of the in-box point, ρ p , p = 1, 2, 3, 4 denotes the vertical distance scale from in-box point (x, y) to the four boundaries and θ p , (p = 1, 2, 3, 4) represents the angles between the four scale diameters and the reference x-axis. For convenience, we only took the angle θ ∈ [0, 90) in the first quadrant to represent the AOBB target, and the angles of the 2nd, 3rd, and 4th quadrant can be calculated as θ + 90, θ + 180, and θ + 270, respectively. Note that the detailed descriptions of the scale offsets and SASL can be found in Appendix A Algorithms A1 and A2. Meanwhile, for the localization branch of the prediction head in Figure 3, we also first deployed a four-layer convolution stack with 3 × 3 kernels and 256 channels to extract the features f i loc ∈ R H×W×256 , i = {3, 4, 5, 6, 7} from the i-th level of the FPN. Then, similar to (1), we employed a ReLU + GN + Conv1 × 1 operation to obtain the feature maps F i ss−reg ∈ R H×W×4N for scale-sector regression, F i ss−cls ∈ R H×W×4N for scale-sector classification, F i as−reg ∈ R H×W×1 for angular-sector regression, and F i as−cls ∈ R H×W×M for angular-sector classification. The motivation of this multi-sector design can be summed up in two points. One is divide-and-conquer. The Cartesian coordinate system will be divided into four independent quadrant sectors, and then, the regression tasks of each sector can be more definite, which effectively eliminates the ambiguity of the regression parameter definition. The other is coarse-to-fine. By discretizing the regression range into multiple coarse-scale sectors and angular sectors in four quadrant sectors, we can shrink the regression range and then perform fine-tuning in the smaller regression interval adapting to the object size, which will be instrumental in detecting the remote sensing objects with various resolutions. The core mechanisms of the multi-sector model for our method are detailed as follows.

Quadrant Sector
As described in [10], if we adopted the representation in Figure 2a, the regression parameters, such as w and h, of the target AOBB will be measured in one fixed rotating coordinate, which will result in the inherent ambiguity problem in the regression parameter definition and make it hard for the network to converge. Therefore, we took the in-box point as the origin and split the AOBB of the ORSI target with the corresponding x-axis and y-axis. As shown in Figure 4, the Cartesian coordinate system will be divided into four quadrant sectors, namely, Q 1 , Q 2 , Q 3 , Q 4 , and then, the network will regress the respective diameter belonging to the corresponding quadrant sector. This representation of the AOBB will be more distinct and enhance the convergence performance of the network. (v , v ) x y 2 2 (v , v ) x y 3 3 (v , v ) x y

Scale Sector
As shown in Figure 2b, we reconstructed the AOBB of the object by calculating the scale offset between the regression point and four AOBB boundaries in four quadrants Q 1 , Q 2 , Q 3 , Q 4 . Formally, if location (x r , y r ) is associated with a bounding box B in , the training regression targets (i.e., scale offset) {O * 1 , O * 2 , O * 3 , O * 4 } for the location can be calculated by Algorithm A1. Instead of directly regressing the scale offsets, we adopted a classificationto-regression strategy to obtain the values of the scale offsets. Specifically, as illustrated in Figure 4, we divided the scale regression space into N scale sectors, where N = 5 in our method. We defined the range for the N sector as (0, 32], (32,64], (64, 128], (128, 256], and (256, ∞). If the scale offset falls into a certain scale sector {(S n )|n ∈ 0, 1, 2, 3, 4}, it will be assigned a scale regression parameter S n = 32 · 2 n . We used a one-hot label for scale-sector selection prediction. We defined s j,n as the predicted scale-sector classification score for the quadrant j ∈ {1, 2, 3, 4} within the n-th scale sector, and the final predicted confidence score p j,n was formulated as: We regressed the scale offsets O 1 , O 2 , O 3 , O 4 by a classification-to-regression strategy. In particular, we identified which scale-sector the scale offsets belong to as follows: where n j denotes that the scale offset falls into the j-th sector. Then, the regression of the scale sector was formulated as: where O j and O * j are the scale offsets of the predicted bounding box and ground truth bounding box in the j-th quadrant (likewise for S n j and n j ), respectively. As illustrated in Figure 4, the scale-sector classification performs N classifications for sector selection in four quadrants. The scale-sector regression performs scale predictions for the selected sector from the scale-sector classification branch.

Angular Sector
For the arbitrarily oriented objects in ORSIs, the direction of the AOBB has a great impact on the detection performance. The IoU between the predicted box and ground truth may decrease considerably even with a small angle bias. To obtain more accurate angle information, we also employed a classification-to-regression method to predict the angle θ ∈ [0 • , 90 • ) in the first quadrant. To be more concrete, we split the angle θ ∈ [0 • , 90 • ) into M angular sectors, where M was set to 90 and each sector had an interval I θ = 1 • . Therefore, we divided the angular space as {(0 • , 1 • ], (1 • , 2 • ], · · ·, (89 • , 90 • ]}. If the angle θ falls into a certain angular sector A m , the network will regress the angle bias as follows: where θ and θ * denote the predicted result and ground truth of the first quadrant angle, respectively. Meanwhile, m denotes that the ground truth angle θ belongs to the m-th angular sector. We defined p θ,l , {l ∈ {1, 2 · ··, M − 1, M}} as the predicted angular-sector classification score within the m-th angular sector and m * = argmax(p θ,l ) as the parameter for angle bias regression. Moreover, we designed a smooth angular-sector label (SASL) to smoothly assign the label value with a certain tolerance R and obtain robust angular-sector prediction. The procedure of SASL generation is summarized in Algorithm A2. Instead of taking the one-to-one mapping paradigm of the one-hot label for angular selection prediction, this smooth label smoothly maps the ground truth angular sector into multiple sectors and alleviates the effect of classification error. By assigning this SASL to each angular sector, the prediction results close to the ground truth will obtain more angle tolerance and be allowed within a weak angle deviation, resulting in missed rate and detection accuracy improvements.

Localization-Aided Detection Score
To obtain more accurate detection confidence, we designed a localization-aided detection score. Most of detectors only use classification scores as the standard of the detected box quality. Nevertheless, a high-quality detection result represents not only precise category classification, but also accurate localization. Therefore, it is inaccurate to evaluate the quality of detection results only by classification scores. To tackle this issue, we proposed to combine the classification score with a localization confidence score (i.e., scale-sector and angular-sector selection confidence score), which is formulated as: where P j = max n (p j,n ), j ∈ {1, 2, 3, 4} represents the maximum confidence of angular-sector classification in four quadrants and P θ = max l (p θ,l ), l ∈ {1, 2, · · ·, M − 1, M} denotes the maximum probability of the angular-sector selection score. P cls and P loc are the prediction results in the classification and localization branches, respectively. In our experiment, the parameter α ∈ [0, 1] was introduced to fuse the contribution of the classification and localization score into the final detection score. Taking localization quality into account, the detection result can better represent the confidence of detected bounding boxes. For each location, we chose the final confidence score p f in that was higher than 0.05 as a definite prediction.

Loss Function
Our MSO 2 -Det is an end-to-end framework, and the multi-task training loss function was formulated as follows: where 1 {·} represents an indicator function that returns one if c = 1 (i.e., positive sample) and otherwise returns zero. L cls represents the feature point category-classification loss. L sc and L sr indicate the sector classification and regression loss, respectively. In our method, we set the loss weights λ to 0.5.

Classification Loss
The category classification loss L cls is calculated by the focal loss [4] function as follows: where N pos indicates the number of positive targets in the ground truth. p x,y and c x,y represent the predicted probability score and ground truth of the category, respectively. In our experiment, we set α and γ to 2 and 0.25, respectively.

Sector Classification Loss
The sector-classification loss L sc of the scale sector and angular sector is calculated as follows: where p j,n and p * j,n are the predicted scale-sector classification score and ground truth label of each feature point, respectively. p θ,m and L m are the predicted angular-sector probability distribution and smooth angular-sector label of the ground truth θ in each feature point (x, y), respectively. SCE and CE represents the sigmoid cross-entropy loss and cross-entropy loss, respectively. Note that we omitted the mark (x, y) for simplicity

Sector Regression Loss
The scale-sector and angular-sector regression loss were formulated via the smooth L1 regression loss function. The formula is defined as follows: where t j , j ∈ {1, 2, 3, 4, θ} represents the regression targets of scale-sector and angular-sector offsets of the positive samples, which are defined in (4) and (5), respectively.

Experiments and Results Analysis
In this section, we first introduce three public optical remote sensing image data sets and evaluation metrics and then analyze the implementation details of the training and detection inference of the network. Next, the superiority of the proposed method is analyzed in comparison with the state-of-the-art detectors. Finally, some promising detection results are displayed.

Data Sets and Evaluation Metrics
In our experiments, we chose three oriented optical remote sensing image data sets: the DOTA [49] data set, the HRSC2016 data set [50], and the UCAS-AOD [51] data set.

DOTA Data Set
DOTA consists of 2806 aerial images that contain a total of 188,282 instances annotated with horizontally oriented bounding boxes. The categories of the data set include plane, ship, storage tank, baseball diamond, tennis court, swimming pool, ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer field, and basketball court. The 15 categories contain 14 main categories, where small vehicles and large vehicles are sub-classes of the vehicle category. In this data set, the proportions of training, validation, and test images are 1/2, 1/6, and 1/3, respectively. The size of each image falls within the range of 800 × 800 to 4000 × 4000 pixels. In the experiments, we only used the annotations of the arbitrarily oriented bounding boxes. Multiple sizes were used for the crop images; the sizes used were 512 × 512, 800 × 800, and 1024 × 1024 with 0.2 overlaps.

HRS2016 Data Set
HRSC2016 is a public data set for arbitrarily oriented ship object detection in ORSIs. The HRSC2016 data set contains a total of 1061 images with scales from 300 × 300 to 1500 × 900 pixels that were captured from six famous ports. The training, validation, and test data sets contain 436, 181, and 444 images.

UCAS-AOD Data Set
The UCAS-AOD data set consists of two types of targets: airplane and car, which are labeled with oriented bounding boxes. It includes 1000 plane images and 510 car images, which contain 7482 objects and 7144 objects, respectively. The scale of the UCAS-AOD image is 1280 × 659 pixels. In the experiment, we randomly divided the training and testing set according to the ratio of 7:3.

Evaluation Metrics
A predicted box is regarded as a true positive (TP) if the IoU between the predicted box and ground truth exceeds the preset threshold; otherwise, it is a false positive (FP). If a ground truth box has not been detected correctly, it is labeled a false negative (FN). precision = TP/(TP + FN) denotes the proportion of true positives to all predicted positive samples, while recall = TP/(TP + FP) indicates the ratio of correctly detected positive samples to all positive samples. Combined with precision and recall, F1score = (2 · precise · recall/(precise + recall)) can evaluate the one-class object detection performance comprehensively. For multi-category object detection, we used the mean average precision (mAP), which is defined as the mean value of the AP in each category, to evaluate the detection accuracy. Meanwhile, we recorded the number of images that can be processed per second (i.e., frame per second (FPS)) and the model parameters to evaluate the detection speed and complexities of the methods.

Experimental Details
In the experiments, the computer hardware platform used in this article was an Inter®Xeon(R) CPUE52603v4@1.70GHz×6 CPU and two NVIDIA GeForce GTX 1080Ti GPUs with 12 GB memory. We used the deep learning development framework PyTorch 1.0 that was run on the Ubuntu 16.04 operating system. In our method, ResNet-100, which was initialized with the weights pre-trained on ImageNet, was used as the backbone network. In addition, we used stochastic gradient descent (SGD) to optimize the network and set the initial learning rate to 0.001. The learning rate was reduced by a factor of 1.8 every 20 k iterations with a batch size setting of 32. In addition, the weight decay and momentum were set as 0.0001 and 0.9, respectively. We resized the input image to 1024 × 1024 and randomly applied the data augmentation methods to enlarge the data set, including horizontal and vertical flipping, rotation, cropping, and color dithering. We trained the network for approximately 50 epochs on the DOTA data set and 150 epochs on the UCAS-AOD and HRSC2016 data sets. We utilized ResNet-101+FPN as the backbone network to optimize the parameters of our method on the UCAS-AOD data set. First, in our method, the parameters α and γ in (8) are two factors that can have a vital impact on the detection results. We analyzed the sensitivity of MSO 2 -Det on these two values. We set the parameter α = {0.1, 0.25, 0.5, 0.75, 0.9} and γ = {0, 0.2, 0.5, 1, 2, 5}. Figure 5 shows that the best performance of our method was achieved with α = 0.25 and γ = 2. Therefore, the values of these two parameters α = 0.25 and γ = 2 were set to zeropoint-two-five and two empirically. Meanwhile, as shown in Table 1, we set the value of λ = {0.01, 0.1, 0.2, 0.5, 0.75, 1} in (7) and achieved the highest mAP of 96.33% when λ = 0.5. Therefore, we chose 0.5 as the λ value for the best performance.

Network Inference
The inference of our network is straightforward; we input the image into the network and forwarded the input image through the network. The classification branch of the prediction head will output an M-dimensional vector for C category predictions. In addition, corresponding to the training targets, the final layer of localization branch networks predicts an M-dimensional vector for angular-sector selection prediction, a one-dimensional vector for angular-sector bias prediction, a 4N-dimensional vector for scale-sector selection prediction, and a 4N-dimensional vector for scale-sector bias prediction in the inference stage. For each point of the FPN feature map, we can map it back onto the input image coordinate (x, y). Then, we can obtain the scale offset O 1 , O 2 , O 3 , O 4 , and θ according to (4) and (5) Figure 4 by the following formula: In our method, we only decoded bounding box predictions from at most 1k topscoring predictions score p f in per FPN level, after thresholding the detector confidence at 0.05. The top predictions from all levels were merged, and oriented non-maximum suppression with a threshold of 0.5 was applied to yield the final detection results.

Ablation Study
We conducted some ablation experiments on the UCAS-AOD data set to verify the effectiveness of the proposed smooth angular-sector label (SASL) and localization-aided detection score (LADS). All models for impartial comparison were based on ResNet101-FPN with data augmentation.

SASL
In our method, we transformed the regression of the object orientation angle into the discrete fine-grained multiple angular-sector classification problem. In the experiment, we found that the one-hot label used in the baseline model that adopts a point-to-point mapping between the ground truth and true predicted angular-sector was agnostic to the angle bias between the false angular-sector classification prediction and ground truth. All false classification results of the angle were allocated an equal prediction loss, but the prediction results close to the ground truth should be assigned a smaller classification loss. To tackle this problem, we designed an angular-sector label that smoothly distributes the label value with a definite tolerance radius. Our baseline model without SASL and LADS only achieved 90.56% mAP. Integrated with SASL, the performance of our model was improved by 2.22% compared with the baseline model, due to its ability to accommodate the angle prediction results, which were allowed within a defined error tolerance limit from a detection perspective.

LADS
A single classification score cannot comprehensively assess the final detection quality of the detected box. Therefore, we used the average value of four scale-sectors and the angular-sector classification scores as the localization quality P loc of the AOBB to aid the evaluation of the detected box quality. Then, as shown in (6), we took the weighted product of localization score and classification score as the final detection confidence P f in , which took into account both classification and localization confidence. By using LADS to reflect the confidence of detected AOBB, the detection performance was improved by 3.42% compared with the baseline model. The additional improvement indicated that the localization score made the accuracy increase significantly, and the LADS enabled better assessment of the quality of the detected box.
As shown in Table 2, the proposed MSO 2 -Det that combines the SASL and LADS achieved a total of a 5.77% mAP improvement compared to the baseline model, pushing the mAP to 96.33%, which illustrates that these two methods are actually complementary to each other and can effectively improve the detection performance. Meanwhile, Figure 6 shows some detection results from the baseline model (first row) and the full implementation of the proposed MSO 2 -Det (second row). The green, red, and yellow boxes indicate true positives (TPs), false positives (FPs), and false negatives (FNs), respectively. We can see that the additions of SASL and LADS can effectively decrease the number of FPs and FNs and improve the recall and precision rate. Moreover, we recorded the PRCs for car and plane objects on the UCAS-AOD data set with the four implementation models (baseline, MSO 2 -Det w/o SASL, MSO 2 -Det w/o SASL, and MSO 2 -Det) in Figure 7 and concluded that the full implemented MSO 2 -Det outperformed the other three models in terms of AP by a large margin, which further proved the effectiveness of our SASL and LADS. Figure 8 shows the curves of the validation mAP and losses obtained by the MSO 2 -Det and MSO 2 -Det without SASL models in 150 training epochs. It can be seen that MSO 2 -Det with the SASL component yielded a higher validation mAP and a smaller loss and then converged faster, which demonstrated that SASL played a precise active role in speeding up the convergence of the network and improving the detection accuracy.

Analysis of High Parameters
In this section, we performed a sequence of comparison experiments with the proposed MSO 2 -Det on HRSC2016 data set to analyze the effect of the key parameters.

Smooth Radius of SASL
The smooth radius R of the SASL is a crucial parameter as discussed in Algorithm A2. It can be seen as the reflection of the maximum error tolerance of angular-sector classification. Therefore, it is vital for MSO 2 -Det to determine the optimal range of the smooth radius. As shown in Table 3a, the value of α of LADS was fixed at 0.4. First, when R was zero, SASL degenerated to the original one-hot label, and we can see that the F1-score and AP of MSO 2 -Det only achieved 0.8875 and 0.8956, respectively. Then, with the increase of R, the indexes of the F1-score and mAP on the HRSC2016 data set with MSO 2 -Det gradually increased until reaching R = 5, which further verified the effectiveness of our smooth radius. However, if the value of R was further increased, the tolerance of the angular-sector error would be overburdened, and the performance would degrade. Taking the above into consideration, the smooth radius R was set to the crucial value of five in our method. When using the combination of localization and classification scores as the detection confidence, the trade-off between these two scores determines the importance of classification and localization tasks. To test the influence of the trade-off factor α on our method, we first set the smooth radius R to five based on the analysis of the smooth radius R and then explored different α values in Table 3b. First, if we only considered the localization confidence, i.e., α = 0, the detector would encounter a considerable performance degradation (an mAP of only 0.7853) because the localization score does not contain the category information at all. Similarly, the detection score that only takes classification confidence into account will also face the problem of deficient localization information. Then, by gradually increasing the value of α from zero to one, we can conclude that when α equaled 0.6, the F1 score and mAP achieved the highest values of 0.8854 and 0.8744, respectively. Experimental results demonstrated that this pattern of information fusion effectively improved the detection performance.

Numbers of Scale and Angular Sectors
To find suitable hyper-parameter settings of the scale sector N and angular sector M in our method, we conducted parameter optimization experiments, and the results are shown in Table 4. First, the scale sectors were set to 45, 90, and 180, which demonstrated that the angular space was divided into 45, 90, and 180 sectors, and each sector was equally allocated to 2 • , 1 • , and 0.5 • , respectively. Then, the number of scale sectors increased from two to six. We listed all the combined results and found that when N and M were extremely small or large, the performance dropped sharply (as in (M = 180, N = 6) or (M = 45, N = 2)) because these settings of M and N destroyed the balance between the classification and regression tasks. For example, when M equaled 90, the demand for classification accuracy would increase, leading to the CNN confronting more difficulty in learning and converging. Under this consideration, the number of scale sectors N and M was set to five and ninety for optimal detection performance, respectively.

Comparison with State-of-the-Art Detectors
We compared the performance of the proposed MSO 2 -Det with the state-of-the-art oriented detectors on three data sets: DOTA [49], UCAS-AOD [51], and HRSC2016 [50].

DOTA
To comprehensively verify the superiority of our method, we performed a series of experiments including some precision comparison and speed comparison experiments on the DOTA data set. First, we compared the AP in 15 categories of objects and the mAP value of fifty deep learning-based methods. All models listed in Table 5 adopted ResNet-101-FPN as the backbone network, except that RRPN [55], R 2 CNN, and O 2 -DNet adopted VGG-16 and ResNet101, respectively. Note that data augmentation was applied for a fair comparison with all the compared methods. In terms of the mAP values over fifteen categories of remote sensing targets, six of the fifteen detectors had mAP values over 70%, and the proposed MSO 2 -Det achieved an mAP of 76.63%, which outperformed the top six detectors, i.e., R 3 Det, O 2 -DNet, SCRDet, Gliding Vertex, WPSGA-Net, and OPLD by 4.94%, 5.51%, 4.02%, 1.51%, 0.60%, and 0.20%, respectively. In addition, the AP values of small-scale and densely arranged objects (e.g., plane and storage tank), large aspect ratio objects (e.g., ship and harbor), and easily confused objects (e.g., baseball diamond and ground track field) with MSO 2 -Det were all higher than all compared methods, which demonstrated the superiority of our method for remote sensing object detection. Figure 9 displays the mAP-IoU curves of our model and the other four anchor-free models. Note that a higher IoU threshold represents more accurate detection results. It can be seen that the mAPs generated from our model were always higher than the other four anchor-free models, which indicated that our model was more efficient and accurate in the ORSI object detection task. Moreover, as indicated in Table 6, we compared the speed, accuracy, and model parameters with the other four anchor-free methods and four anchor-based models. Note that the computational burden of post-processing is also included. Our method can achieve the highest accuracy of 76.63% while maintaining a speed of 7.67FPS with 218.5MB parameters, which was faster and more lightweight than all compared anchor-based models and most anchor-free models except for O 2 -DNet and TOSO. The experimental results indicated that our model was relatively efficient and lightweight, but the complexity of our detector was exactly heavier compared to some state-of-the-art anchor-free methods due to the further stages of sector processing. The visualization results on the DOTA data set are shown in Figure 10.

UCAS-AOD
In addition, we evaluated the proposed method on the UCAS-AOD data set and compared it with several advanced oriented object detectors, namely R-DFPN [52], S 2 ARN [53], RetinaNet-H [54], ICN [28], R 3 Det, and WPSGA-Net [9], as shown in Table 2. We can see that our method achieved state-of-the-art performance, and the detection accuracy AP of the small car exceeded that of other compared detectors, which indicated that our method was robust to densely arranged ORSI objects.

HRSC2016
To test the performance of our MSO 2 -Det, we compared it with eleven ship detectors, which included several state-of-the-art methods such as RoI Transformer [29], R 3 Det [54], Gliding Vertex [58], and GRS-Det [37]. The comparison results are reported in Table 7. We can see that MSO 2 -Det outperformed all compared methods in terms of mAP. Compared with the methods that adopted data augmentation and resized the input image to 800 × 800 (R 2 PN [30], RetinaNet-H [54], R 3 Det [54], and GRS-Det [37]), MSO 2 -Det outperformed them by 10.61%, 7.32%, 1.07%, and 0.64%, respectively, which indicated the superiority of our method in the ship detection task. The detection results on HRSC2016 are visualized in Figure 11. It can be noticed that the ship with a large aspect ratio, which increased the difficulty of network convergence, can be detected well, and the detected box gave a compact peripheral outline of the ship.

Conclusions
In this paper, we abandoned the anchor mechanism and direct regression paradigm and proposed MSO 2 -Det, which tackled the prediction of bounding box scale and orientation via a successive coarse-granularity classification to fine-grained regression strategy in the discrete scale and angular sector space. Furthermore, we also designed a smooth angular-sector label to speed up the network's convergence and dramatically improve the detection performance. In addition, to obtain a more accurate detection confidence, we adopted a localization-aided detection score that combined the category-classification score with localization sector-selection score. Extensive experimental results and ablation studies based on the DOTA, UCAS-AOD, and HRSC2016 data sets proved the effectiveness of our method in optical remote sensing arbitrarily oriented object detection. In future work, we will design a more lightweight and efficient backbone network to speed up the real-time performance of the detector for detecting oriented targets in optical remote sensing images.