FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection

Existing anchor-base oriented object detection methods have achieved amazing results, but these methods require some manual preset boxes, which introduces additional hyperparameters and calculations. The existing anchor-free methods usually have complex architectures and are not easy to deploy. Our goal is to propose an algorithm which is simple and easy-to-deploy for aerial image detection. In this paper, we present a one-stage anchor-free rotated object detector (FCOSR) based on FCOS, which can be deployed on most platforms. The FCOSR has a simple architecture consisting of only convolution layers. Our work focuses on the label assignment strategy for the training phase. We use ellipse center sampling method to define a suitable sampling region for oriented bounding box (OBB). The fuzzy sample assignment strategy provides reasonable labels for overlapping objects. To solve the insufficient sampling problem, a multi-level sampling module is designed. These strategies allocate more appropriate labels to training samples. Our algorithm achieves 79.25, 75.41, and 90.15 mAP on DOTA1.0, DOTA1.5, and HRSC2016 datasets, respectively. FCOSR demonstrates superior performance to other methods in single-scale evaluation. We convert a lightweight FCOSR model to TensorRT format, which achieves 73.93 mAP on DOTA1.0 at a speed of 10.68 FPS on Jetson Xavier NX with single scale. The code is available at: https://github.com/lzh420202/FCOSR


Introduction
The object detection task usually uses horizontal bounding box (HBB) to circle the target and give its category.In recent years, many excellent HBB framework algorithms have been proposed, including YOLO series [1][2][3], R-CNN series [4][5][6][7], RetinaNet [8], FCOS [9], and CenterNet [10], etc.These methods have achieved amazing results in object detection tasks.There are many challenges in single-image aerial object detection task, such as arbitrary orientation, dense objects, and wide range of resolution.These above problems make the HBB algorithm difficult to detect aerial objects effectively.Therefore, the aerial object detection task converts HBB into an oriented bounding box (OBB) by adding a rotation angle.At present, oriented object detector is generally modified from HBB algorithms, which can be divided into two types: anchor-base methods [11][12][13][14][15][16][17][18][19][20][21][22][23], and anchor-free methods [24][25][26][27][28][29][30][31][32][33].The anchor-base methods usually require manual preset boxes, which not only introduces additional hyperparameters and calculations, but also directly affects the performance of the model.The anchor-free methods remove the preset box and reduce the prior information, which makes it more adaptable than the anchor-base methods.In this paper, we propose a one-stage anchor-free rotated object detector (FCOSR) based on FCOS [9] and 2dimensional (2D) gaussian distribution.Our method directly predicts the center point, width, height, and angle of the object.The main work of this paper focuses on training stage.Benefitting from the redesigned label assignment strategy, our method can predict OBB of the target directly and accurately with few modifications to the FCOS architecture.Compared with the refined two-step methods, our method is not only simpler, but also has only convolutional layers, so it is easier to deploy to most platforms.A series of experiments on DOTA [34] and HRSC2016 [35] datasets verify the effectiveness of our method.Our contributions are as: (1) We propose a onestage anchor-free aerial oriented object detector, which is simple, fast and easy in deployment.(2) We design a set of label assignment strategies based on 2D gaussian distribution and aerial image characteristics.These strategies assign more appropriate labels to training samples.(3) Our method achieves 79.25, 75.41, and 90.15 mAP on DOTA1.0,DOTA1.5, and HRSC2016 datasets, respectively.Compared with other anchor-free methods, FCOSR achieves state-of-the-arts.Compared with other anchor-base methods, FCOSR surpasses many two-stage methods in terms of single scale.Our model greatly reduces the gap between anchor-free and anchor-base methods.In terms of speed and accuracy, FCOSR presents its excellent performance, and it surpasses the current mainstream models.

Anchor-base methods
The anchor-base methods need to manually preset a series of standard boxes (anchor) for boundary regression and refinement.Early methods used anchors with multiple angles and multiple aspect ratios to detect oriented objects [11,12].However, the increase of the preset angles leads to a rapid increase of anchors and calculations, which makes the model difficult to train.As a two-stage method, ROI transformer [13] converts the horizontal proposal into OBB format through the RROI learning module, then extracts the features in the rotation proposal for subsequent classification and regression.This method replaces the preset angles by giving the angle value through the network, which greatly reduces anchors and calculations.For a long time, many ROI-transformer-base methods have appeared and achieved good results.ReDet [14] introduces rotation invariant convolution (e2cnn) [37] into the whole model and extracts rotation invariant features by using RiROI alignment.Oriented R-CNN [15] replaces RROI learning module in ROI-transformer [13] with a lighter and simpler oriented region proposal network (orientation RPN).R 3 Det [18] is a refined one-stage oriented object detection method, which obtains OBB result by fine tuning the anchor in HBB format through feature innovation module (FRM) module.S 2 ANet [19] is composed of feature alignment module (FAM) and oriented detection module (ODM).FAM generates high-quality OBB anchor.ODM adopts active rotating filters to produce orientationsensitive and orientation-invariant features to alleviate the inconsistency between classification score and localization accuracy.CSL [20] converts angle prediction into a classification task to solve the problem of discontinuous rotation angles.DCL [21] uses dense coding on the basis of CSL [20] to improve training speed.It also uses angle distance and aspect ratio sensitive weighting to improve accuracy.

Anchor-free methods
Current anchor-free methods are mostly one-stage architecture.IENet [24] develops a branch interactive module with self-attention mechanism, which can fuse features from classification and regression branches.The anchor-free methods directly predict the bounding box of the target, which makes the loss design in the regression task have certain limitations.GWD [22], KLD [23], and ProbIoU [25] use the distance metric between two 2D gaussian distributions to represent loss, which provides a new regression loss scheme for anchor-free methods.PIoU [26] designs an IoU loss function for OBB based on pixel statistics.BBAVectors [27] and PolarDet [28] define OBB with bbav vector and polar coordinates respectively.CenterRot [29] uses deformable convolution (DCN) [38] to fuse multi-scale features.AROA [30] leverages attention mechanisms to refine the performance of remote sensing object detection in a one-stage anchor-free network framework.

FCOSR
As shown in Figure 1.our method takes the FCOS [9] architecture as baseline.The network directly predicts center point (include x and y), width, height, and rotated angle of target (OpenCV format).We determine the convergence target of the feature map through the label assignment module.Our algorithm does not introduce additional components into the architecture.It removes centerness branch [9], which makes network simpler and easier in deployment.The work of this paper focuses on the label assignment in training stage.

Network outputs
Network output contains a C-dimensional vector from classification branch and a 5-dimensional (5D) vector from regression branch.Unlike FCOS [9], we want each components of regression output to have different ranges.The offset can be negative, width and height must be positive, and angle must be limited to 0-90.These simple processes are defined by Eq. 1.
Reg xy, Regwh, and Regθ indicate the direct output from last layer of regression branch.k is a learnable adjustment factor and s is the down-sampling ratio (stride) for multilevel feature maps.Elu [39] is the improvement of ReLu.
Through the calculation of above equation, the output will be converted into a new 5D vector (offsetx, offsety, w, h, angle).The sampling point coordinates plus offsets to obtain target OBBs.

Ellipse center sampling
Center sampling is a strategy to concentrate sampling points close to the center of target, which helps to reduce low-quality detection and improves model performance.This strategy is adopted in FCOS [9], YOLOX [3], and other networks, which stably improves the accuracy.However, there are two problems when directly migrating horizontal center sampling strategy to oriented object detection.First, the horizontal center sampling area is usually a 3×3 or 5×5 square [3,9], so the angle of OBB will affect the sampling range.Second, short edge further reduces the number of sampling points for large aspect ratio targets.The most intuitive center sampling should be a circular area within a certain range at the center of target, but the short edge limits the range of center sampling.In order to reduce these negative influences, we propose an elliptical center sampling method (ECS) based on 2D gaussian distribution.We use OBB (cx, cy, w, h, θ) parameters to define a 2D gaussian distribution [25]: Σ is covariance matrix, Σ0 is the covariance matrix when angle equal to 0, μ is mean value, and Rθ is the rotation transformation matrix.
The contour of the probability density function of 2D gaussian distribution is an elliptic curve.Eq. 3 represents the probability density of 2D gaussian distribution in general case.
X indicates coordinates (2D vector).We remove the normalization term from ( ) f X and get ( ) ( ) (0,1] g X ∈ , the elliptic contour of 2D gaussian distribution can be expressed as ( ) , the elliptical contour line is just inscribed in OBB.The range of elliptic curve will expand with the decrease of C, which means effective range of C is [C0,1].Considering that there are many small objects in aerial images, we set C as 0.23 to prevent insufficient sampling caused by small sampling area.The center sampling area of the target can be determined by g X is greater than C, the point X is in sampling area.The elliptical area defined by the target with large aspect ratio has a slender shape, which makes the part in long axis direction is far away from the center area.In order to solve this problem, we shrink the ellipse sampling region by modifying the gaussian distribution.Eq. 5 defines new original covariance matrix.0 0 min( , ) 0 12 The length of ellipse major axis shrinks to wh and minor axis remain unchanged.Figure 2 shows the ellipse center area of OBB.Compared with the horizontal center sampling, the ellipse center sampling is more suitable for OBB, and sampling area of large aspect ratio target is more concentrated by shrinking long axis.

Fuzzy sample label assignment
FCOS [9] reduces fuzzy (ambiguous) samples by allocating different scales targets to feature maps with different strides.For targets with similar scales, FCOS [9] assigns smaller target label to ambiguous sampling points.Obviously, this fuzzy sample label assignment method based on the minimum area principle is difficult to deal with complex scenes, such as aerial scenes.We design a fuzzy sample label assignment method (FLA) to assign ambiguous sample labels based on 2D gaussian distribution.The gaussian distribution presents a bell shape, and the response is the largest in its center.The response becomes smaller as the sampling point moves away from the center of the distribution.We approximately take the 2D gaussian distribution as distance measure between sampling point and target center.The center distance ( ) J X is defined by Eq. 6.
For any target, we calculate the ( ) J X value at each sampling point.A larger value of ( ) J X means that X is closer to the target.When a sampling point is included by multiple targets at the same time, we assign the label with the largest ( ) J X target to sampling point.A simple fuzzy sample label assignment diagram is shown in Figure 3.

Multi-level sampling
The sampling range of large aspect ratio targets is mainly affected by the short edge.As shown in Figure 4, when the stride of feature map is greater than the length of short edge, target may be too narrow to be effectively sampled.Therefore, in the view of possibility of insufficient sampling, we add a simple supplementary sampling scheme, which determines whether allocate labels in the lower-level feature map by comparing short edge with stride.We assign labels to feature maps that satisfy the following two conditions.First, the ratio between the short edge of the target and the stride of the feature map is less than 2. Second, the long edge of minimum bounding rectangle of the target is larger than the acceptance range of feature map.Multi-level sampling strategy (MLS) allows us to add some targets that cannot be effectively sampled to the lower-level feature map.The lower-level feature map represents denser sampling points, which alleviates the insufficient sampling problem.

Target Loss
The loss of FCOSR consists of classification loss and regression loss.Quality focal loss (QFL) [40] is used for classification loss, which is mainly remove the centerness branch from original FCOS [9] algorithm.The regression uses the ProbIoU loss [25].
QFL [40] is a part of general focal loss (GFL) [40].It unifies training and testing process by replacing the one hot label with IOU value between prediction and ground truth.QFL [40] suppresses low-quality detection results and also improves the performance of the model.Eq. 7 gives the definition of QFL [40].
( ) y represents the replaced IOU, parameter β (using the recommend value 2) controls the down-weighting rate smoothly.
ProbIoU loss [25] is a kind of IoU loss specially designed for oriented object.It mainly represents the IOU between OBBs through the distance between 2D gaussian distributions, which is similar to GWD [22] and KLD [23].The overall loss can be defined by Eq.After J(X) calculation, smaller areas in the red ellipse will be allocated to the court, and other blue areas will be allocated to the ground track field.
In our paper, we use both train and validation sets for training and use test set for testing.All images are cropped into 1024×1024 patches with the gap of 512, and the multiscale argument of DOTA1.0 are {0.5, 1.0}, while those of DOTA1.5 are {0.5, 1.0, 1.5}.We also apply random flipping and random rotation argument method during training.
HRSC2016 [35] is a challenging ship detection dataset with OBB annotations, which contains 1061 aerial images with the size ranges from 300×300 to 1500×900.It includes 436, 181, and 444 images in the train, validation, and test set, respectively.We use both train and validation set for training and the test set for testing.All images are resized to 800×800 without changing the aspect ratio.Random flipping and random rotation are applied during training.

Implement Details
We adopt ResNext50 [42] with FPN [36] as the backbone for FCOSR.We train the model in 36 epochs for DOTA and 40k iterations for HRSC2016.We use SGD optimizer to train model of DOTA with an initial learning rate (LR) of 0.01, and LR is divided by 10 at {24, 33} epoch.The initial LR of HRSC2016 model is set to 0.001, and step is {30k, 36k} iterations.The momentum and weight decay are 0.9 and 0.0001, respectively.We use Nvidia DGX Station (4 V100 GPUs@32G) with a total batch size 16 for training, and use single 2080Ti GPU for test.We adopt Jetson Xavier NX with TensorRT as embedded deployment platforms.The NMS threshold is set to 0.1 when merging image patches and confidence threshold is set to 0.1 during testing.Inspired by rotation-equivariant CNNs [14,37], we adopt a new rotation augmentation method, which uses 2 step rotation to generate random augmentation data.First, we rotate image randomly in 0, 90, 180 and 270 degrees with equal probability.Second, we rotate image randomly in 30, 60 degrees with 50% probability.Our implement is based on mmdetection [41].

Ablation Studies
We perform a series of experiments on the DOTA1.0test set to evaluate the effectiveness of the proposed method.We use ResNext50 [42] as backbone and call this model FCOSR-M.Other models in FCOSR series can be found in the next section.We training and testing the model with single scale.
As shown in Table 1, the mAP at baseline for FCOS-M is 70.4,which increases by 4.03 with the addition of rotation augmentation.When QFL [40] is used instead of focal loss, the detection result of the model gains 0.91 mAP.Then we try to add the ECS and FLA modules, which increases the detection result to 76.80 mAP.Finally, we add the MLS module and achieve 77.15 mAP on DOTA1.0 with single scale.Through the use of multiple modules, FCOSR-M achieves a very significant performance improvement.These modules do not have any additional calculations during inferencing, which makes FCOSR a very simple, fast, and easy deployment OBB detector.

More Backbones
We use a variety of other backbones to replace ResNext50 [42] to reconstruct the FCOSR model.We adopt Mobilenetv2 [43] as backbone, and named FCOSR-S.We also test FCOSR on ResNext101 [42] with 64 groups and 4 width, and named FCOSR-L.Model size, FLOPs, FPS, and mAP on DOTA is shown in Table 2.
In order to deploy FCOSR on the embedded platform, we perform lightweight processing on the model.We adjust the output stage of backbone on the basis of FCOSR- S, and replace the extra convolutional layer of FPN with pooling layer.We call it FCOSR-lite.On this basis, we further adjust the feature channel of the head from 256 to 128, and name it FCOSR-tiny.The above two models are converted to TensorRT 16-bit format and tested on Jetson Xavier NX.The result is shown in Table 3.The lightweight FCOSR achieves perfect balance between speed and accuracy on Jetson Xavier NX.The lightest tiny model achieves 73.93 mAP at 10.68 FPS.This is a successful attempt to deploy a high-performance oriented object detector on edge computing devices.

Speed VS Accuracy
We test the inference speed of FCOSR series models and other open source mainstream models, including R 3 Det [18], ReDet [14], S 2 ANet [19], Faster-RCNN-O [6], Oriented RCNN [15], and RetinaNet-O [8].For convenience, we test Faster-RCNN-O [6]   model currently.The FCOSR series models surpass the existing mainstream models in speed and accuracy, which also proves that through reasonable label assignment, even a very simple model can achieve excellent performance.

Comparison with state-of-the-arts
Results on DOTA1.0.As shown in Table 4, we compare FCOSR series with other state-of-the-arts methods on DOTA1.0OBB task.FCOSR-L achieve 77.39 mAP, exceeding all single-scale methods and most multi-scale methods.Multi-scale FCOSR-M achieves 79.25 mAP, and the gap with S 2 A-Net [19] was reduced to 0.17.Although there is still a certain gap compared with the anchor-base model, our algorithm achieves state-of-the-arts in anchorfree methods.We visualize a part of DOTA1.0 test set result in Figure 6.
Results on DOTA1.5.As shown in Table 5, we also conduct all experiments on the FCOSR series.However, there are currently few methods for evaluating the DOTA1.5 dataset, so we directly use some results in ReDet [14].From a single-scale perspective, FCOSR-M and L models achieves 68.74 and 69.96 mAP, respectively.FCOSR-L outperform all single-scale models.From a multi-scale perspective, FCOSR-L achieves 75.41 mAP, which is the second best model on the DOTA1.5 test set.
Results on HRSC2016.As shown in Table 6, FCOSR series models surpass all anchor-free models, and achieve 95.74 mAP under the VOC2012 metrics.FCOSR series exceed 90 mAP under the VOC2007 metrics.FCOSR-L even surpasses S 2 A-Net [19], which further proves that the anchor-free method we proposed already has the performance not weaker than the anchor-base method.The visualization of detection result is shown in Figure 7.

Conclusion
This paper proposed a one-stage anchor-free oriented detector, which consists of three parts: ellipse center sampling, fuzzy sample label assignment, and multi-level sampling.Ellipse center sampling provides a more suitable sampling area for rotated objects.The fuzzy sample label assignment method divides the sampling area of overlapping targets more reasonably.Multi-level sampling method alleviates the insufficient sampling problem of targets with large aspect ratios.Thanks to the simple architecture, FCOSR does not have any special computing unit for inferencing.Therefore, it is a very fast and easy deployment model to most platforms.Our experiments on light weight backbone have also shown satisfactory results.Extensive experiments on DOTA and HRSC2016 datasets demonstrate the effectiveness of our method.

( 4 )
We convert a lightweight FCOSR to TensorRT format and successfully migrate it to Jetson Xavier NX.The TensorRT model achieves 73.93 mAP with 10.68 FPS on DOTA1.0 test set.

Figure 1 .
Figure1.FCOSR architecture.The output of backbone with feature pyramid network (FPN)[36] are multi-level feature maps, including P3-P7.The head is shared with all multi-level feature maps.The predictions on the left of the head is the inference part, the other components are only effective during the training stage.LAM means label assignment module, which allocates labels to each feature maps.H and W are height and width of feature map.Stride is the down-sampling ratio for multi-level feature maps.C represents the number of categories, and regression branch directly predicts the center point, width, height and angle of the target.

Figure 2 .
Ellipse center area of OBB.Oriented box represents OBB of the target, and the shadow area represents the sampling region.(a) general sampling region, (b) horizontal center sampling region, (c) original elliptical region, (d) shrink elliptical region.

Figure 3 .
A Fuzzy sample label assignment demo.(a) is a 2D label assignment area diagram, and (b) is a 3D visualization effect diagram of J(X) of two targets.The red OBB and area represent the court target, and the blue represents the ground track field.

Figure 4 .
Multi-level sampling.(a) insufficient sampling, green points in diagram are sampling points.The ship is so narrow that there are no sampling points inside it.(b) is a multi-level sampling demo.The red line indicates that the target follows FCOS guidelines assigned to H6 while it is too narrow to sample effectively.The blue line indicates that the target is assigned to lower level of features according to the multi-level guideline.This represents the target sampling at three different scales to handle the insufficient sampling problem.Npos represents the number of positive samples.The summation is calculated over all locations z on the multilevel feature maps.

Table 5 .Figure 6 .
Figure 6.The FCOSR-M detection result on DOTA1.0 test set.The confidence threshold is set to 0.3 when visualizing these results.

Figure 7 .
Figure 7.The FCOSR-L detection result on HRSC2016.The confidence threshold is set to 0.3 when visualizing these results.

Table 1 . The result of ablation experiments for FCOSR-M.
 means the module is used.Rot means 2 step rotation augmentation.QFL means quality focal loss.when not using QFL, we use focal loss instead.ECS means ellipse center sampling.FLA means fuzzy sample label assignment module.MLS means multi-level sampling.

Table 4 .
and RetinaNet-Results on DOTA1.0OBB Task.H104 means Hourglass 104.* indicates multi-scale training and testing.⁑ means rotated test mode during multi-scale testing.The results with red and blue colors indicate the best and 2nd-best results of each column, respectively.

Table 2 . FCOSR series model size, FLOPs, FPS and mAP comparison
[43]means ResNext[42].Mobile v2 means Mobilenet v2[43].FLOPs and FPS is result which test with 1024×1024 image size and single 2080Ti device, and second FPS represents the speed of the rotation test mode.mAP is the result of DOTA1.0 with single-scale evaluation.

Table 3 . Lightweight FCOSR test result on Jetson Xavier NX.
Model size is TensorRT engine file size (.trt).FPS is the average speed of 2000 inferences with 1024×1024 image size.The mAP is the result on DOTA1.0 single-scale test set.