Mask OBB: A Semantic Attention-Based Mask Oriented Bounding Box Representation for Multi-Category Object Detection in Aerial Images

: Object detection in aerial images is a fundamental yet challenging task in remote sensing ﬁeld. As most objects in aerial images are in arbitrary orientations, oriented bounding boxes (OBBs) have a great superiority compared with traditional horizontal bounding boxes (HBBs). However, the regression-based OBB detection methods always suffer from ambiguity in the deﬁnition of learning targets, which will decrease the detection accuracy. In this paper, we provide a comprehensive analysis of OBB representations and cast the OBB regression as a pixel-level classiﬁcation problem, which can largely eliminate the ambiguity. The predicted masks are subsequently used to generate OBBs. To handle huge scale changes of objects in aerial images, an Inception Lateral Connection Network (ILCN) is utilized to enhance the Feature Pyramid Network (FPN). Furthermore, a Semantic Attention Network (SAN) is adopted to provide the semantic feature, which can help distinguish the object of interest from the cluttered background effectively. Empirical studies show that the entire method is simple yet efﬁcient. Experimental results on two widely used datasets, i


Introduction
It is a fundamental problem in Earth Vision to achieve accurate and robust object detection in aerial images, which is very challenging due to four issues: • arbitrary orientations: unlike natural images in which objects are generally oriented upward, objects in aerial images often appear with arbitrary orientations since aerial images are typically taken with a bird's-eye view [1,2].
densely packed objects: it is hard to separate small crowded objects like vehicles in parking lots [3].• huge scale variations: scale changes of objects in aerial images captured with various platforms and sensors are usually huge [2,4].
cluttered background: the background in aerial images is cluttered and normally contains a large number of uninteresting objects [5].
To tackle these issues, we need a robust object detection method for aerial images which is resilient to the aforementioned appearance variations.
With the development of deep learning technology, modern generic object detection methods based on a horizontal bounding box (HBB) have achieved great success in natural scenes.They can be organized into two main categories: two-stage and single-stage detectors.Two-stage detectors are firstly introduced by a Region-based Convolutional Neural Network (R-CNN) [6].R-CNN generates object proposals by Selective Search [7], then classifies and refines the proposal regions by a Convolutional Neural Network (CNN).To eliminate the duplicated computation in the R-CNN, Fast R-CNN [8] extracts the feature of the whole image once, then generates region features through Region of Interest (RoI) Pooling.Faster R-CNN [9] introduces a Region Proposal Network (RPN) to generate the region proposals efficiently.Some researchers further extend the work of Faster R-CNN for better performance, like Region-based Fully Convolutional Network (R-FCN) [10], Deformable R-FCN [11], Light Head R-CNN [12], Scale Normalization for Image Pyramids (SNIP) [13], SNIP with Efficient Resampling (SNIPER) [14], etc.Unlike two-stage detectors, single-stage detectors directly estimate class probabilities and bounding box offsets with a single CNN like You Only Look Once (YOLO) [15][16][17], Single Shot Multibox Detector (SSD) [18] and RetinaNet [19].Compared with two-stage detectors, one-stage detectors are much simpler and more efficient, because there is no need to produce region proposals.
Similarly, object detection methods based on HBB are widely used for object detection in aerial images.Han et al. [20] propose R-P-Faster R-CNN for detecting small objects in aerial images.Xu et al. [21] use Deformable Convolutional Network (DCN) [11] to address geometric modeling in aerial image object detection and propose a Ratio Constrained Non Maximum Suppression (arcNMS) to reduce the increase of false region proposals.Guo et al. [22] propose a multi-scale CNN and multi-scale object proposal network for geospatial object detection in high resolution satellite images.Li et al. [23] propose a hierarchical selective filtering network (HSF-Net) to detect ships with various scales efficiently.Pang et al. [24] design a Tiny-Net backbone and a global attention block to detect tiny objects in large-scale aerial images.Dong et al. [25] propose the Sig-NMS to replace traditional NMS for improving the detection accuracy of small objects.These methods greatly promote the development of object detection in the remote sensing field.
However, the HBBs are not suitable for describing objects in aerial images since the objects in aerial images are often of arbitrary orientations.To deal with this challenge, instead of using HBB, some datasets [2,[26][27][28][29] use oriented bounding boxes (OBBs) to annotate objects in aerial images.OBBs can not only compactly enclose oriented objects, but also retain the orientation information which is very useful for further processing.Many works [1,2,[30][31][32] handle this problem as a regression task and directly regress oriented bounding boxes.We call them regression-based methods.For instance, DRBox [33] redesigns the SSD [18] to regress oriented bounding boxes by multi-angle prior oriented bounding boxes.Xia et al. [2] propose the FR-O which regresses the offsets of OBBs relative to HBBs.ICN [30] joints image cascade and feature pyramid network to extract features for regressing the offsets of OBBs relative to HBBs.Ma et al. [34] design a Rotation Region Proposal Network (RRPN) to generate prior proposals with the object orientation information, and then regress the offsets of OBBs relative to oriented proposals.R-DFPN [35] adopts RRPN and puts forward the Dense Feature Pyramid Network to solve the narrow width problems of ships.Ding et al. [1] design a rotated RoI learner to transform horizontal RoIs to oriented RoIs by a supervised method.All these regression-based methods can be summarized as the problem of regression for the offsets of OBBs relative to HBBs or OBBs, and they rely on the accurate representation of OBB.
Nevertheless, regression-based methods encounter the problem of ambiguity in the regression target.For example, OBB represented as {(x i , y i )|i = 1, 2, 3, 4} (point-based OBB) has 4 different representation ways if we just change the order of vertexes.In order to get the uniqueness of OBB representation, some tricks, such as defining the first vertex by a certain rule [2], are used to leverage the ambiguity problem.The ambiguity is still unsolved, because two similar region features may have obvious different OBB representations.For instance, Figure 1e,f show the ambiguity problem of regression-based OBB intuitively.When the angle of OBB is near π/4 or 3π/4, the ambiguity of adjacent points is the most serious.Specifically, by the definition in [2], OBB in Figure 1e can be represented as (x 1 , y 1 , x 2 , y 2 , x 3 , y 3 , x 4 , y 4 ) (point-based OBB), but in Figure 1f which is very similar with Figure 1e, the OBB needs to be represented as (x 2 , y 2 , x 3 , y 3 , x 4 , y 4 , x 1 , y 1 ) (point-based OBB).Although OBBs (θ-based OBBs, point-based OBBs, and h-based OBBs) in Figure 1e,f are completely different, but they have similar feature map.Due to the ambiguity, the training is hard to converge, and the mAP of the HBB task is often much higher than OBB task even with the same backbone network.In this paper, we give an experimental analysis of different regression-based methods.Figure 1a-c  In order to eliminate the ambiguity, we represent an object region as a binary segmentation map and treat the problem of detecting an oriented object as pixel-level classification for each proposal in this paper.Then, the oriented bounding boxes are generated from the predicted masks by post-processing, and we call this kind of OBB representation a mask-oriented bounding box representation (Mask OBB).By using Mask OBB, the convergence is faster and the gap of mAP between HBB task and OBB task is greatly reduced while compared with regression-based methods.As shown in Figure 1d, detection results of Mask OBB are better than regression-based OBB representation methods.Despite its relevance, segmentation-based oriented object detection methods in remote sensing have been poorly exploited when compared with regression-based methods.There are just some segmentation-based methods in the field of oriented text detection.For instance, Ref [36] presents Fused Text Segmentation Networks to detect and segment the text instance simultaneously.Ref [37] finishes the detection and recognition task on mask branch by predicting word segmentation maps.These segmentation-based methods are restricted to single-category (text) object detection while there are many different categories to discern for aerial images, such as the dataset DOTA [2].Our proposed segmentation-based method Mask OBB can handle multi-category-oriented object detection in aerial images.It is based on an instance segmentation framework Mask R-CNN [38] which is proposed by adding a mask branch on Faster R-CNN to obtain pixel-level segmentation predictions.To the best of our knowledge, this work is the first multi-category segmentation-based oriented object detection method in the remote sensing field.
Besides the arbitrary orientation problem, huge scale changes of objects in aerial images is also a challenging problem.Some works [30,32,35] use a Feature Pyramid Network (FPN) [39] to handle the scale problem by fusing low-level and high-level features.In this paper, we design an Inception Lateral Connection Network (ILCN) to further enhance the FPN for solving the scale change problem.Unlike the original FPN, we use the inception structure [40][41][42][43] instead of one 1 × 1 convolutional layer as the lateral connection.In the ILCN, besides the original 1 × 1 convolutional layer, three additional layers with different receptive fields are added.We call this enhanced FPN an Inception Lateral Connection Feature Pyramid Network (ILC-FPN).Experimental results show ILC-FPN can handle large-scale variations in aerial images efficiently.
In addition, in aerial images, the background is cluttered and normally contains a large number of uninteresting objects.For distinguishing interesting objects from cluttered background, attention mechanism which is proven to be promising in many vision applications, such as image classification [44][45][46] and general object detection [47,48] is used in some aerial image object detection works [49][50][51].Specifically, inspired by [45,49,52] proposes a Feature Attention FPN (FA-FPN) which contains channel-wise attention and pixel-wise attention to effectively capture the foreground information and restrain the background in aerial images.CAD-Net [50] designs a spatial-and-scale-aware attention module to guide the network to focus on more informative regions and features as well as more appropriate feature scales in aerial images.Chen et al. [51] proposes a multi-scale spatial and channel-wise attention (MSCA) mechanism to make the network pay more attention to object in aerial images as human vision.Unlike the aforementioned attention modules which are unsupervised, we use the semantic segmentation map converted from oriented bounding boxes as the target of semantic segmentation network and design a Semantic Attention Network (SAN) to learn semantic features for predicting HBBs and OBBs efficiently.
Overall, our complete model can achieve 75.33% and 76.98% mAP on OBB task and HBB task of DOTA dataset, respectively.At the same time, it achieves 96.70% mAP on OBB task of HRSC2016 dataset.The main contributions of this paper can be summarized as follows:

•
We address the influence of ambiguity of regression-based OBB representation methods for oriented bounding box detection, and propose a mask-oriented bounding box representation (Mask OBB).As far as we know, we are the first to treat the multi-category oriented object detection in aerial images as a problem of pixel-level classification.Extensive experiments demonstrate its state-of-the-art performance on both DOTA and HRSC2016 datasets.

•
We propose an Inception Lateral Connection Feature Pyramid Network (ILC-FPN), which can provide better features to handle huge scale changes of objects in aerial images.

•
We design a Semantic Attention Network (SAN) to distinguish interesting objects from cluttered background by providing semantic features when predicting HBBs and OBBs.
The rest of the paper is organized as follows: Section 2 presents the proposed method, including Mask OBB, ILC-FPN, SAN, and overall network architecture.Then we give the experimental results in Section 3. Finally, we discuss the model settings in Section 4 and draw the conclusions in Section 5.
Although these representations ensure the uniqueness in the OBB's definition with some rules, there still allow extreme conditions.In these conditions, a tiny change of OBB angle would result in a large change of OBB representation.We denote angle values in these conditions as discontinuity points.For oriented object detectors, similar features extracted by the detector with close positions are supposed to generate similar position representations.However, OBB representations of these similar features would differ greatly near discontinuity points.This would force the detector to learn totally different position representations for similar features.It would impede the detector training process and deteriorate detector's performance obviously.
Specifically, for point-based OBB representation, to ensure the uniqueness of OBB definition, Xia et al. [2] choose the vertex closest to the "top left" vertex of the corresponding HBB as the first vertex.Then the other vertexes are fixed in clockwise order, so we get the unique representation of OBB.Nevertheless, this mode still allows discontinuity point, as illustrated in Figure 1e,f.When the l 1 on the horizontal bounding box is shorter than the l 2 , the OBB is represented with 1e shows.Otherwise, the OBB is represented with R 2 = (x 4 , y 4 , x 1 , y 1 , x 2 , y 2 , x 3 , y 3 ) (point-based OBB), as Figure 1f shows.When the length of l 1 increases with θ, till θ approaches and surpasses π/4, the OBB representation would jump from R 1 to R 2 , and vice versa.Hence π/4 is a discontinuity point in this mode.

Mask OBB Representation
For handling the ambiguity problem, we represent oriented object as binary segmentation map which ensures the uniqueness naturally, and the problem of detecting OBB can be treated as pixel-level classification for each proposal.Then the OBBs are generated from the predicted masks by post-processing, and we call this kind of OBB representation as mask-oriented bounding box representation (Mask OBB).Under this representation, there is no discontinuity point and ambiguity problem.
Furthermore, aerial image datasets like DOTA [2] and HRSC2016 [26] give the regression-based oriented bounding boxes as ground truth.Specifically, DOTA and HRSC2016 use point-based OBBs {(x i , y i )|i = 1, 2, 3, 4} [2] and θ-based OBBs {(cx, cy, h, w, θ)} [26] as the ground truth, respectively.However, for pixel-level classification, pixel-level annotations are essential.In order to handle this problem, pixel-level annotations are converted from original OBB ground truths.Specifically, pixels inside OBBs are labeled as positive and pixels outside are labeled as negative, and then, we obtain the pixel-level annotations which will be treated as the pixel-level classification ground truth.In the inference stage, the predicted Mask OBB needs to be converted to point-based OBB and θ-based OBB for performance evaluation on DOTA and HRSC2016 dataset, respectively.We calculate the minimum area oriented bounding box of predicted segmentation map by Topological Structural Analysis Algorithm [54].Minimum area oriented bounding box has the same representation as θ-based OBB, which can be directly used by HRSC2016 dataset for calculating mAP.For DOTA, the four vertexes of minimum area oriented bounding box can be used for evaluating performance.

Overall Pipeline
The overall architecture of our method is illustrated in Figure 3.Our network is a two-stage method based on Mask R-CNN [55], which is known as an instance segmentation framework.In the first stage, a number of region proposals are generated by a Region Proposal Network (RPN) [9].In the second stage, after the RoI Align [55] operation for each proposal, aligned features extracted from ILC-FPN features and SAN's semantic features are fed into the HBB branch and OBB branch to generate the HBBs and instance masks.Finally, the OBBs are obtained by OBB branch based on predicted instance masks.In this work, we apply the ILC-FPN which will be detailed in Section 2.4 with ResNet [56] as backbone.Each level of the pyramid can be used for detecting objects at a different scale.We denote the output as {C 2 , C 3 , C 4 , C 5 } for conv2, conv3, conv4, and conv5 of ResNet, and call the final feature map set of ILC-FPN as {P 2 , P 3 , P 4 , P 5 , P 6 }.Note that {P 2 , P 3 , P 4 , P 5 , P 6 } have strides of {4, 8, 16, 32, 64} pixels with respect to the input image.
Region Proposal Network (RPN) [9] is used to generate region proposals for the second stage on the outputs of ILC-FPN.Following [39], we assign anchors of a single scale to each level.Specifically, we set five anchors with areas of {32 2 , 64 2 , 128 2 , 256 2 , 512 2 } pixels on five levels {P 2 , P 3 , P 4 , P 5 , P 6 }, respectively.Different anchor aspect ratios {1:2, 1:1, 2:1} are also adopted at each level.Thus, in total, there are 15 anchors over the pyramid.Note that no special design for objects in aerial images is adopted in RPN.
RoI Align [55] is adapted to extract the region features of the proposals produced by RPN from the outputs of ILC-FPN and SAN.Compared with RoI Pooling [9], RoI Align preserves more accurate location information, which is quite beneficial for the segmentation task in the OBB branch.Through RoI Align, all proposals are resized to 7 × 7 for HBB branch and 14 × 14 for OBB branch.HBB branch aims to regress HBB and classify objects.OBB branch predicts a 28 × 28 mask from each proposal by four convolutional layers and one deconvolutional layer.In training stage, OBB branch just generates masks to calculate OBB branch loss, and in the inference stage, an OBB branch generates OBB set {(cx, cy, h, w, θ)} based on predicted masks by post-processing.

Inception Lateral Connection Feature Pyramid Network
Objects in aerial images are very complicated, and relative scales vary greatly between different categories.For example, the size of ground track field is about 1500 times the size of small vehicle in DOTA.Even for the same category object, such as ship, the sizes are range from about 10 × 10 to 400 × 400 pixels in the different images.The huge scale gap leads to poor performance of normal object detection methods.In order to handle this, we need to use a strong feature extraction network to enhance the backbone.
In convolutional neural network, low-level features lack semantic information, but have the accurate location information.On the contrary, high-level features have rich semantic information but relatively rough location information.Making full use of low-level features and high-level features can handle the scale problem to a certain extent.FPN [39] is an effective method to fuse low-level and high-level features via the top-down pathway and lateral connection.However, original FPN [39] simply utilizes one 1 × 1 convolutional neural network as lateral connection to fuse features C i and P i+1 .This fusion strategy can not effectively handle the features of very large objects.
In order to solve this problem, we design an Inception Lateral Connection Network (ILCN), which uses inception structure to enhance feature propagation.Figure 4 shows the architecture of ILCN.Based on ILCN, we design an enhanced FPN, and we call it Inception Lateral Connection Feature Pyramid Network (ILC-FPN).Unlike the original FPN which uses one 1 × 1 convolutional layer as the lateral connection, we use inception structure [42] as the lateral connection.Besides the original 1 × 1 convolutional layer, three additional layers are added in the lateral connection.As shown in Figure 4, these extra layers include the 5 × 5 convolutional layer, the 3 × 3 convolutional layer, and the max pooling layer.The output of the new lateral connection is a concatenation of these layers' outputs.Thanks to the different convolution kernel sizes in the ILC-FPN, the same level feature in FPN can better handle large scale variations.Through the experimental results, we can observe that the use of ILC-FPN can significantly improve the detection performance due to a better feature fusion strategy.

Semantic Attention Network
To further help model extract the interesting objects from the cluttered background in aerial images.We design a Semantic Attention Network (SAN) to extract semantic feature of the whole image.Note that, RPN, the Semantic Attention Network, OBB branch and HBB branch are jointly trained end-to-end.
In the Semantic Attention Network, the feature extraction module is the key which is called a Semantic Feature Extraction (SFE) module.To generate the semantic feature from the outputs of ILC-FPN, we design a simple architecture to incorporate higher-level feature maps with context information and lower-level feature maps with location information for better feature representation.Figure 5 illustrates the details of SFE.For the output of ILC-FPN, low-level feature maps are upsampled and high-level feature maps are downsampled to the same spatial scale.The result is a set of feature maps at the same scale, which are then element-wise summed.Four 3 × 3 convolutions are then used to obtain the output of SFE, and one 1 × 1 convolution is used on the SFE's output to generate class label map and semantic feature map.The class label map is used to calculate the semantic segmentation loss, and the semantic feature map is used to be fused with an HBB branch feature map and OBB branch feature map.Specifically, given a proposal from RPN, we use RoI Align to extract a feature patch (e.g., 7 × 7 for HBB branch and 14 × 14 for OBB branch) from the corresponding level of ILC-FPN outputs as the region feature.At the same time, we also apply RoI Align on the semantic feature map to obtain other feature patch of the same shape as the region feature, and then combine the features from both branches by element-wise sum, and the result will be treated as the new region feature for HBB branch and OBB branch.Experimental results demonstrate that SAN can improve the detection performance on OBB task and HBB task.
In addition, aerial image datasets have no semantic segmentation ground truth.For calculating the semantic segmentation loss, we generate the semantic segmentation ground truth by the OBB ground truth.Specifically, pixels inside OBB are labeled as certain class of OBB and pixels outside are labeled as background in the whole image.Figure 6 demonstrates the OBB ground truth and corresponding semantic segmentation ground truth.

Multi-Task Learning
Multi-task learning has been proved to benefit the performance of multiple tasks [55].It enables the network to learn HBBs, OBBs and semantic segmentation at the same time from the same input image.The overall loss function takes the form of a multi-task learning: where L RPN is region proposal network loss, L CLS is classification loss, L HBB is the horizontal bounding box loss, L OBB is the oriented bounding box loss, and L SEG is the semantic segmentation loss.Specifically, L RPN , L CLS and L HBB are the same as Faster R-CNN [9], L OBB are the same as instance segmentation branch of Mask R-CNN [55] in the training stage, and L SEG is computed as per-pixel cross entropy loss between the predicted and the ground truth labels.α 1 , α 2 , α 3 , α 4 are the weights of these sub-losses.The ground truth of the OBB branch is described in Section 2.2.We calculate OBB's horizontal bounding box as the ground truth of RPN and HBB branch.The ground truth of semantic segmentation branch is described in Section 2.5.

Experiments
In this section, we describe the implementation of the proposed method in detail and demonstrate the performance of our proposed method on DOTA [2] dataset and HRSC2016 [26] dataset with state-of-the-art methods.

Datasets and Evaluation Metrics
3.1.1.DOTA Dataset DOTA [2] is a dataset for multi-category object detection in aerial images.It contains 2806 images from different cameras and platforms.The image sizes vary from about 800 × 800 to 4000 × 4000 pixels.There are 15 object categories: baseball diamond (BD), ground track field (GTF), small vehicle (SV), large vehicle (LV), tennis court (TC), basketball court (BC), storage tank (ST), soccer ball field (SBF), roundabout (RA), swimming pool (SP), helicopter (HC), bridge (BR), harbor (HA), ship (SP), plane (PL).Each object in this dataset is annotated with an arbitrary quadrilateral which is the same as point-based OBB.The training, validation and test sets include 1/2, 1/6 and 1/3 of the dataset, respectively.DOTA aims for two tasks: Horizontal Bounding Box Task (HBB task) and Oriented Bounding Boxes Task (OBB task), and provides an evaluation server.It is one of the largest and the most challenging aerial image object detection datasets.

HRSC2016 Dataset
HRSC2016 [26] is a dataset for ship detection in aerial images.It contains 1061 images collected from Google Earth and has more than 20 categories of ships.The image size ranges from 300 × 300 to 1500 × 900 pixels.It consists of 436 training images, 181 validation images and 444 test images.Objects in HRSC2016 are annotated with θ-based OBBs.

Evaluation Metrics
For DOTA, we submit our results on test set to official evaluation server to obtain the mean Average Precision (mAP).For HRSC2016 dataset, we report the standard VOC-style AP metrics with Intersection Over Union (IoU) threshold of 0.5.

Implementation Details
Our model is implemented with PyTorch [57].We use SGD with a weight decay of 0.0001 and momentum of 0.9 on 4 NVIDIA Titan Xp GPUs with a total of 8 images per mini-batch (2 images per GPU).We train 12 epochs in total with an initial learning rate of 0.01, and decrease it by a factor of 0.1 at epoch 9 and 11.The batch size of RPN and Fast R-CNN is set to 256 and 512 per image with a sample ratio 1 : 3 of positive to negatives.In the multi-task loss function, we set We use ResNet-50 [56] with FPN [39] as the backbone for all experiments, if not specified otherwise.All models are trained on the training and validation sets, then evaluated on the test set.
Multi-scale training and testing (MSTT) and data augmentation (Data Aug.) technologies are applied in final results of DOTA and HRSC2016 datasets when compared with state-of-the-art detectors in Section 3.4.
For DOTA dataset, we use three scales {(1024, 1024), (896, 896), (768, 768)} to apply MSTT in both training and inference stage.For data augmentation, we resize the original images at two scales (1.0 and 0.5) before dividing the images into patches.After resizing, we divide the resized images into 1024 × 1024 patches with an overlap of 200 in both training and inference stage.In addition, each image is randomly flipped with a probability of 0.5 and randomly rotate an angle from an angle set {0 • , 90 For HRSC2016 dataset, in the training stage, long sides of the input images are resized to 1024 pixels, and short sides are randomly resized to a range of [800, 1024] pixels, and in the inference stage, we use three scales {(1280 × 1024), (1024 × 800), (800 × 600)} to do MSTT.For data augmentation, we firstly do 5 times data augmentation by randomly rotating the input images with an angle range of [−90 • , 90 • ] before training.Then, in the training stage, each image is randomly flipped with a probability of 0.5 and we randomly rotate an angle from an angle set {0 • , 90 • , 180 • , 270 • }.
For more details, in the training stage, the proposal number of RPN is set to 2000, the HBB branch runs on all these proposals and the OBB branch just runs on positive proposals which have IoU overlap with a ground-truth bounding box of at least 0.5.In the inference stage, the proposal number of RPN is set to 2000, we run the HBB branch on these proposals, following Non Maximum Suppression (NMS) [58], the OBB branch is then applied to 500 horizontal bounding boxes with the highest scores.Finally, the oriented bounding boxes are generated from the predicted mask in OBB branch by post-processing.

Comparison of Different OBB Representations
OBBs can be represented in a variety of ways, as shown in Figure 1.θ-based OBB, point-based OBB and h-based OBB are the most common representation methods.In this section, we firstly study the different "first vertex" definition methods which will affect the performance of point-based OBB and h-based OBB in Table 1, and then, we study the effect of different OBB representations in Table 2.For a fair comparison, we re-implement above three bounding box representations on the same basic network structure as Mask OBB.Note that in our implementations, we use the box-encoding functions φ θ (g; p), φ point (g; p), φ h (g; p) to encode the ground truth boxes of θ-based OBB, point-based OBB and h-based OBB with respect to their matching proposals generated by RPN which can be obtained by following equations: For the first vertex definition, we compare two different methods.One is the same as [2], which chooses the vertex closest to the "top left" vertex of the corresponding HBB, and we call this method as "best point".The other one is defined by ourself, which chooses the "extreme top" vertex of OBB as the first vertex, then other vertexes are fixed in clockwise order, and we call this method as "extreme point".As shown in Table 1, "best point" method significantly outperforms "extreme point" method on the OBB task of DOTA dataset.We can learn that different "first vertex" definition methods will significantly affect mAPs of OBB task.Thus if we want to obtain great performance on the OBB task by using point-based OBB and h-based OBB representations, we should design a special "first vertex" definition method which can represent OBB uniquely.
For different OBB representations, there is a higher gap between the HBB and OBB performance for both θ-based OBB, point-based OBB and h-based OBB representation than Mask OBB.Theoretically, changing from prediction of HBB to OBB should not affect the classification precision, but as shown in Table 2, the methods which use regression-based OBB representations have higher HBB task performance than OBB task performance.We argue that the reduction is due to the low quality localization, which is caused by the discontinuity point as discussed in Section 2.1.There should not be such a large gap between the performance of HBB and OBB task if the representation of OBB is defined well.The result of Mask OBB verified that.In addition, mAPs on HBB and OBB tasks of Mask OBB are nearly all higher than the other three OBB representations in our implementations.For other implementations, FR-O [2] uses point-based OBB and gets 60.46% HBB mAP and 54.13% OBB mAP, and the gap is 6.33%.ICN [30] also uses point-based OBB and gets 72.45% HBB mAP and 68.16% OBB mAP, and the gap is 4.29%.SCRDet [59] uses θ-based OBB and gets 72.61% OBB map and 75.35%HBB map, and the gap is 2.70%.Li et al. [49] also uses θ-based OBB and gets 73.28% OBB map and 75.38%HBB map, and the gap is 2.10%.Note that the performances of ICN, SCRDet and Li et al. are obtained by using other modules and data augmentation technology.The gaps between HBB task and OBB task of these methods (6.33%, 4.29%, 2.70%, 2.10%) are all higher than Mask OBB (0.17%).Therefore, We can draw the conclusion that Mask OBB is a better representation on the oriented object detection problem. Figure 7 shows some visualization results in our implementations by using different OBB representation methods on OBB task of DOTA dataset.We can observe that detection results are very bad when the angles of objects relative to HBBs are near π/4 or 3π/4 in θ-based OBB, point-based OBB and h-based OBB, but Mask OBB can compactly enclose oriented objects.

Comparison with State-of-the-Art Detectors
We compared the performance of our method with the state-of-the-art methods on the OBB task and the HBB task of two datasets DOTA and HRSC2016.

Results on DOTA Dataset
We compare our method with the state-of-the-art methods on OBB and HBB tasks of DOTA dataset in Tables 3 and 4.Besides the official baseline given by DOTA, we also compare proposed model with RRPN [34], R 2 CNN [53], R-DFPN [35], ICN [30], RoI Transformer [1], SCRDet [59] and Li et al. [49] which have been introduced in Section 1.Note that some methods only report mAP of the OBB detection task.The results in Tables 3 and 4 are obtained by using Soft NMS [60], MSTT and Data Aug.By using ResNet-50, our method achieves 74.86% and 75.98% mAP on OBB task of DOTA, respectively, and outperforms all methods which even use ResNet-101.In addition, by using ResNeXt [61], our method achieves 75.33% and 76.98% mAP on the OBB task and HBB task of DOTA, respectively.We note that our method attains a little gap (1.65%) between its OBB task mAP and HBB task mAP.Our method outperforms all methods evaluated on this dataset.Figures 8 and 9 show some visualization results on the DOTA dataset.

Results on HRSC2016 Dataset
Table 5 shows the comparison results with state-of-the-art methods of HRSC2016 OBB detection task.Our full model achieves 96.70% mAP of OBB detection task.It outperforms all other methods evaluated on this dataset with a promotion around 4.8 points in mAP.Some visualization results on HRSC2016 dataset are displayed in Figure 10.

Ablation Study
To verify the effectiveness of our approach, we do a series of comparative experiments on DOTA test set.Table 6 summarizes the results of our models with different settings on DOTA dataset.The detailed comparison is given in the following.
Baseline setup.Mask R-CNN which is extended for oriented object detection task without other components is used as the baseline of the ablation experiments.To ensure the fairness and accuracy of the experiment, all experimental data and parameter settings are strictly consistent.We use ResNet-50-FPN as the backbone and mAP as the indicator of model performance.The results of mAP on DOTA reported here are obtained by submitting results to the official DOTA evaluation server.In our implementation, it gets 69.97% and 70.14% mAPs for OBB task and HBB task.
Effect of ILC-FPN.As discussed in Section 2.4, the inception structure can help FPN to better handle large scale variations in aerial images.Through the experimental results in Table 6, we can observe that the use of ILC-FPN can significantly improve the detection performance of 1.12% on OBB task and 1.47% on HBB task because ILCN in ILC-FPN can help FPN to extract more discriminative features of objects in aerial images.Effect of SAN.Table 6 shows our Semantic Attention Network can improve mAP of 0.72% on OBB task and 0.88% on HBB task compare baseline.Compared baseline+ILC-FPN, it also improves 0.34% on OBB task and 0.80% on HBB task.It shows the importance of semantic features on the whole image in aerial image detection.

Failure Cases
Figure 11 shows some failure cases.In the DOTA, errors are most likely to occur on long and narrow objects, whose parts are possible to be detected as objects.For instance, as shown in the second and the third images of the row 1, the local part of harbor and bridge are detected in some situations.Some failure examples are caused by objects with large extent like large-vehicle and basketball court.However, this is not always the case, as the ship can be seen in the second image of the row 2, and some other failure cases are caused by the huge objects like ground track field and soccer ball field shown in the third image of row 2.

Conclusions
In this paper, we analyzed the influence of different OBB representations for oriented object detection in aerial images, which exposes shortcomings of the typical regression-based OBB representation methods like θ-based, point-based and h-based OBB representation methods.Based on the analysis, the Mask OBB representation is proposed to tackle the ambiguity in regression-based OBB.In addition, we proposed the Inception Lateral Connection Feature Pyramid Network (ILC-FPN) which usees Inception Lateral Connection (ILCN) to enhance the feature extraction ability of FPN for handling the scale variation problem in aerial images.Furthermore, we proposed the Semantic Attention Network (SAN) to extract semantic features to further enhance features of generating HBBs and OBBs.Experimental results on the DOTA and HRSC2016 datasets demonstrated the importance of representations in multi-category arbitrary-oriented object detection.Notably, our method achieves 75.33% and 76.98% on OBB task and HBB task of DOTA dataset, respectively.At the same time, it achieves 96.70% on OBB task of HRSC2016 dataset.

Figure 1 .
Figure 1.(a-c) Failure modes of regression-based OBB representations.Specifically, (a) is the result from θ-based OBB representation (cx, cy, h, w, θ), (b) is the result from point-based OBB representation {(x i , y i )|i = 1, 2, 3, 4}, and (c) is the result from h-based OBB representation (x 1 , y 1 , x 2 , y 2 , h).(d) Result from Mask OBB representation.(e,f) Borderline states of regression-based OBB representations.The full line, dashed line and gray region represent horizontal bounding box, oriented bounding box and oriented object.The feature map of the left instance should be very similar to the right one, but by the definition in [2] to choose the first vertex (yellow vertex of OBB in (e,f)), the coordinates of θ-based OBB, point-based OBB and h-based OBB representations differ greatly.The ambiguity makes the learning unstable and explains the results in (a-c).The representation of Mask OBB can avoid the problem of ambiguity and obtain better detection results.
Figure 2 illustrates the point-based OBBs and converted Mask OBBs on DOTA images.The highlight points are original ground truth, and the highlight regions inside point-based OBBs are new ground truth for pixel-level classification, which is well known as instance segmentation problem.Unlike point-based OBB, h-based OBB and θ-based OBB, Mask OBB is unique in the definition no matter how point-based OBB changes.Using Mask OBB, the problem of ground truth ambiguity can be solved in nature, and there is no discontinuity points allowed in this mode.

Figure 2 .
Figure 2. Samples for illustrating mask-oriented bounding box representation (Mask OBB).The highlight points are original ground truth (point-based OBB), and the highlight regions inside point-based OBBs are ground truth for pixel-level classification.

Figure 3 .
Figure 3. Overview of the pipeline for our method.ILCN is the Inception Lateral Connection Network.RPN is the Region Proposal Network.SAN is the Semantic Attention Network, and SFE is the Semantic Feature Extraction module in SAN.Horizontal bounding boxes and oriented bounding boxes are generated by the HBB and OBB branches, respectively.

Figure 4 .
Figure 4. Illustration of the Inception Lateral Connection Network (ILCN).ILCN is the modified inception blocks as the lateral connection in ILC-FPN to better fuse features for enhancing the original FPN.

Figure 5 .
Figure 5. Illustration of the Semantic Feature Extraction (SFE) module.SFE is the feature extraction module in Semantic Attention Network (SAN).SFE simply upsamples and downsamples the outputs of ILC-FPN to provide features for predicting HBBs and OBBs.

Figure 6 .
Figure 6.OBB ground truth (left) and the corresponding semantic segmentation ground truth (right).Different colors in semantic segmentation ground truth mean different categories.

Figure 7 .
Figure 7. Visualization of detection results by using different OBB representation methods on OBB task of DOTA dataset.Compared with other OBB representation methods, Mask OBB can compactly enclose oriented objects.

Figure 9 .
Figure 9. Visualization of detection results using our method on HBB task of DOTA.

Figure 11 .
Figure 11.Some typical failure predictions of our method.

Table 1 .
Comparison with different first vertex definition methods on the mAP of point-based OBB and h-based OBB representations."Best point" method significantly outperforms "extreme point" method on the OBB task of DOTA dataset.The best result in the same OBB representation is highlighted in bold.
where g and p denote ground truth box and proposal, respectively.

Table 2 .
Comparison with different methods on the gap of mAP between HBB and OBB.The best result in the gap is highlighted in bold.

Table 3 .
Quantitative comparison of the baselines and our method on the OBB task in the test set of DOTA (%).The best result in each category is highlighted in bold.

Figure 8 .
Visualization of detection results using our method on OBB task of DOTA.

Table 5 .
Comparison with the state-of-the-art methods on HRSC2016 OBB task.The best result is highlighted in bold.

Table 6 .
Ablation study of each component in our proposed method on DOTA.ILC-FPN is the Inception Lateral Connection Feature Pyramid Network, SAN is the Semantic Attention Network.