ADT-Det: Adaptive Dynamic Reﬁned Single-Stage Transformer Detector for Arbitrary-Oriented Object Detection in Satellite Optical Imagery

: The detection of arbitrary-oriented and multi-scale objects in satellite optical imagery is an important task in remote sensing and computer vision. Despite signiﬁcant research efforts, such detection remains largely unsolved due to the diversity of patterns in orientation, scale, aspect ratio, and visual appearance; the dense distribution of objects; and extreme imbalances in categories. In this paper, we propose an adaptive dynamic reﬁned single-stage transformer detector to address the aforementioned challenges, aiming to achieve high recall and speed. Our detector realizes rotated object detection with RetinaNet as the baseline. Firstly, we propose a feature pyramid transformer (FPT) to enhance feature extraction of the rotated object detection framework through a feature interaction mechanism. This is beneﬁcial for the detection of objects with diverse patterns in terms of scale, aspect ratio, visual appearance, and dense distributions. Secondly, we design two special post-processing steps for rotated objects with arbitrary orientations, large aspect ratios and dense distributions. The output features of FPT are fed into post-processing steps. In the ﬁrst step, it performs the preliminary regression of locations and angle anchors for the reﬁnement step. In the reﬁnement step, it performs adaptive feature reﬁnement ﬁrst and then gives the ﬁnal object detection result precisely. The main architecture of the reﬁnement step is dynamic feature reﬁnement (DFR), which is proposed to adaptively adjust the feature map and reconstruct a new feature map for arbitrary-oriented object detection to alleviate the mismatches between rotated bounding boxes and axis-aligned receptive ﬁelds. Thirdly, the focus loss is adopted to deal with the category imbalance problem. Experiments on two challenging satellite optical imagery public datasets, DOTA and HRSC2016, demonstrate that the proposed ADT-Det detector achieves a state-of-the-art detection accuracy (79.95% mAP for DOTA and 93.47% mAP for HRSC2016) while running very fast (14.6 fps with a 600 × 600 input image size).


Introduction
In the past few decades, Earth observation satellites have been monitoring changes in the Earth's surface and the amount and resolution of satellite optical images have been greatly improved. The task of object detection in satellite optical images is to localize interest objects (such as vehicles, ships, aircraft, buildings, airports, ports) and identify their categories. This has numerous practical applications in satellite remote sensing and computer vision, warning of natural disasters, Earth surveying and mapping, and surveillance and traffic planning. Much progress in general-purpose horizontal detectors has been achieved by advances in deep convolutional neural networks (DCNNs) and the emergence of large datasets [1]. However, unlike natural images that are usually taken from horizontal • Large-scale difference. Objects in satellite images vary in size hugely [5]. There are small objects such as cars, ships, aircraft, and small houses in satellite images, as well as large objects such as ports, airports, ground track fields, bridges, and large buildings. In addition, the size of objects within the same category (such as large aircraft and small aircraft) in the same image also varies greatly. • Dense distribution. There are many densely distributed objects in satellite optical images, such as cars and ships [5]. • Large aspect ratio. There are lots of objects with large aspect ratios, such as large vehicles, ships, harbors, and bridges in satellite optical images. The mismatch between the ground truth bounding box and the predicted bounding box of these objects is very sensitive to the rotation angle of objects [4]. • Category imbalance. Satellite optical imagery datasets are long-tailed, and the number of instances in each category varies greatly. For example, the amount of small vehicles is about 105 times larger than that of soccer ball fields in satellite optical imagery. Recent research [6][7][8][9] has focused on the design of rotation detectors, which apply rotated regions of interest (RRoI) instead of horizontal regions of interest (HRoI). To meet the above challenges, a framework for rotated object detection consisting of a rotation learning stage and a feature refinement stage is proposed to improve the detection accuracy. Despite the fact that some newly developed rotated object detection methods [10][11][12][13][14] have made some progress in this area, their performance still falls considerably below that required for real-world applications. A main reason for their low detection performance is improper feature extraction for instances with arbitrary orientations, large aspect ratios, and dense distributions. As shown in Figure 2a, the general receptive field of deep neural network-based detectors is axis-aligned and square, representing a mismatch with the actual shape of the instances, and this usually produces false detections. Thus, our goal is to design a special feature pyramid transformer and feature refinement module which can be adjusted adaptively according to the angle and scale of the instance, as shown in Figure 2b. Then, we introduce the above methods into the rotated object detection framework to help extract more accurate features. In this paper, we propose an adaptive dynamic refined single-stage transformer detector to address the aforementioned challenges, aiming to achieve a high recall and speed. Our detector realizes rotated object detection with RetinaNet as the baseline. Firstly, the feature pyramid transformer (FPT) is introduced into the traditional feature pyramid network (FPN) to enhance feature extraction through a feature interaction mechanism. This is beneficial for the detection of multi-scale objects and densely distributed objects. Secondly, the output features of FPT are fed into two post-processing steps. In the first step, the preliminary regression of locations and angle anchors for the refinement step is performed. In the refinement step, adaptive feature refinement is performed first and then the final object detection result is given precisely. The main architecture of the refinement step is the dynamic feature refinement (DFR), which is proposed to adaptively adjust the feature map and reconstruct a new feature map for arbitrary-oriented object detection to alleviate the mismatches between rotated bounding boxes and axis-aligned receptive fields. Experiments are carried out on two challenging satellite optical imagery public datasets, DOTA and HRSC2016, to demonstrate that our method outperforms previous state-of-the-art methods while running very fast.
The contributions of this work are three-fold: (1) We propose a feature pyramid transformer for the feature extraction of the rotated object detection framework. This is beneficial for detecting objects with diverse patterns in terms of scale, aspect ratio, and visual appearance, and helps with the handling of challenging scenes with densely distributed instances through a feature interaction mechanism.
(2) We propose a dynamic feature refinement method for rotated objects with arbitrary orientations, large aspect ratios, and dense distributions. This can help to alleviate the bounding box mismatch problem.
(3) The proposed ADT-Det detector outperforms previous state-of-the-art detectors in terms of accuracy while running very fast.

Related Studies
Along with the wide application of satellite remote sensing and unmanned aerial vehicles, the amount of satellite optical imagery is increasing tremendously and object detection in satellite optical imagery has received increasing attention in the computer vision and remote sensing communities. Researchers have introduced DCNN-based detectors for object detection in satellite optical imagery, and oriented bounding boxes have been used instead of horizontal bounding boxes to reduce the mismatch between the predicted bounding box and corresponding objects. DCNN-based detectors are now reported as state-of-the-art.
In this section, we briefly review some previous well-known object detection methods in satellite or aerial optical images. In Section 2.1, we review the current mainstream detectors used for satellite optical image detection. In Section 2.2, we summarize some classical designs of DCNN-based detectors that can improve the detection performance.

The Mainstream Detectors for Object Detection in Satellite Optical Imagery
The current mainstream detectors for satellite optical image detection are rotation detectors. Existing rotation detectors are mostly employed as alternatives to horizontal bounding boxes. Generally, these detectors can be organized into two main categories: multi-stage detectors and single-stage detectors.
The framework of multi-stage detectors includes a pre-processing stage for region proposal and one or more post-processing stages to regress the bounding box of an object and identify its category. In the pre-processing stage, classification-independent region proposals are generated from an input image. Then, CNNs with a special architecture are used to subsequently extract features from these regions, and regression and classification are performed over the next several stages [3,4]. In the last stage, the final detection results are generated by non-maximum suppression (NMS) or other methods. To the best of our knowledge, RoI-Transformer [2] and SCRDet [15] are state-of-the-art multi-stage rotated objects detectors. The RoI-Transformer is a two-stage rotated object detector. Its first stage is a RRoI Learner that generates a transformation from a horizontal bounding box to an oriented bounding box by learning from the annotated data. One important task in the second stage is RoI alignment, which extracts rotation-invariant features from the oriented RoI for subsequent object regression and classification. SCRDet introduced SF-Net [16] and MDA-Net into Faster-RCNN [17] to detect small and densely distributed objects. By introducing the Intersection over Union (IoU) factor into the traditional smooth L 1 loss function, the IoU-Smooth L 1 Loss enables the angle regression to be more concise. Generally, the numerous redundant region proposals make multi-stage detectors more accurate than anchor-free detectors. However, they rely on a more complicated structure, which greatly reduces their speed.
Single-stage object detectors drop the complex and redundant region proposal network, directly regress the bounding box, and identify the category of objects. YOLO [18][19][20] treats object detection as a regression task. Image pixels are regressed to spatially separate bounding boxes and associate them with class probabilities using the GoogLeNet network. Its improved versions are YOLOv2 and YOLO9000, in which GoogLeNet is replaced by a simpler Dark-Net19 and some special strategies (e.g., batch normalization) are introduced. Liu et al. [21] proposed SSD to preserve the real-time speed while keeping the detection accuracy as high as possible. Just like YOLO, a fixed number of bounding boxes and scores are predicted for the presence of object category in these boxes, followed by a NMS [22] step to generate the final detection result. As observed in [5], the detection performance of general single-stage methods is considerably lower than that of multistage methods. Recently, R 3 Det [4] and R 4 Det [3] demonstrated high performance in detecting rotated objects in satellite optical images. R 3 Det adopts RetinaNet [23] for the baseline and adds refinement to the network. The focal loss alleviates any imbalance between positive and negative samples. R 4 Det proposed a single-stage object detection framework by introducing the recursive feature pyramid (RFP) into RetinaNet to integrate feature maps of different levels. In many DCNN-based object detection frameworks, FPN is a basic component used to extract multi-level features for detecting objects at different scales. Low-level features represent less semantic information but the resolution is higher; on the contrary, high-level features represent more semantic information but the resolution is lower. In order to make full use of low-level features and high-level features at the same time, Lin et al. [24] proposed a generic FPN approach to fuse a multi-scale feature pyramid with a top-down pathway and lateral connections. This has become the benchmark and performs well in feature extraction. Using a feature pyramid transformer [25] is an effective way to perform feature interaction between different scales and spaces. The transformed feature pyramid has a richer context than the original pyramid while maintaining the same size. In this paper, we introduce an FPT to enhance feature interaction in the feature fusion step.

Spatial Transformer Network
Atrous convolution [26] is an initial spatial transformer network. It increases the reception field by injecting holes into the standard convolution. Many improvements in dilated convolution have been proposed in recent years. Atrous spatial pyramid pooling (ASPP) [27] and denseASPP [28] obtained better results by cascading convolutions with different dilated rates in various forms. The Deformable Convolutional Network (DCN) [29] provides new ideas for spatial transformer networks. DCN can adjust the convolution kernels to make the receptive field more suitable for the feature map. General convolution is mostly horizontal and square. DCN can dynamically adjust according to the feature shape. We expect that it can improve the detection performance by introducing DCN into the feature extraction for rotated object detection.

Refined Object Detectiors
The research in [30] indicates that a low IoU threshold usually produces noisy detections. However, due to the mismatch between the optimal IoU of the detector and the IoU of the input hypothesis, detection performance tends to degrade as the IoU thresholds increase. To address these problems, Cascade RCNN [30] uses multiple stages with sequentially increasing IoU thresholds to train detectors. The main idea of RefineDet [31] is to coarsely adjust the locations and sizes of anchors using an anchor refinement module first. This is then followed by a regression branch to obtain more precise box information. Unlike two-stage detectors, the currently single-stage detector with a refinement stage is not well resolved in this respect. Feature misalignment is still one of the main reasons for the poor performance of refined single-stage detectors.
In this paper, we propose an adaptive dynamic refined single-stage transformer detector to address the aforementioned challenges, aiming to achieve a high recall and speed. Our detector realizes rotated object detection with RetinaNet as the baseline to achieve the detection of multi-scale objects and densely distributed objects. Firstly, the feature pyramid transformer (FPT) is introduced into the traditional feature pyramid network (FPN) to enhance feature extraction through a feature interaction mechanism. Secondly, the output features of FPT are fed into two post-processing steps considering the mismatch between the rotated bounding box and the general axis-aligned receptive fields of CNN. Dynamic Feature Refinement (DFR) is introduced to the refinement step. The key idea of DFR is to adaptively adjust the feature map and reconstruct a new feature map for arbitrary-oriented object detection to alleviate the mismatches between the rotated bounding box and the axis-aligned receptive fields. Extensive experiments and ablation studies show that our method can achieve state-of-the-art results in the task of object detection.

Methodology
In this section, we first describe our network architecture for arbitrary rotated object detection in Section 3.1. We then propose the feature pyramid transformer and dynamic feature refinement, which are our main contributions, in Sections 3.2 and 3.3, respectively. Finally, we show the details of our RetinaNet-based rotation detection method and the loss function in Section 3.4.

Network Architecture
The overall architecture of the proposed ADT-Det detector is sketched in Figure 3. Our pipeline improves upon RetinaNet and consists of a backbone network and two postprocessing steps. The FPN network is utilized as the backbone and a feature pyramid transformer is proposed to enhance feature extraction for densely distributed instances.
Then, the backbone is attached in the post-processing steps. These consist of two substeps: first, a sub-step and a refinement sub-step, which will be described in detail in Sections 3.3 and 3.4. In the first sub-step, the preliminary regression of locations and angle anchors for the refinement sub-step is performed. In the refinement sub-step, adaptive feature refinement is performed first and then the final object detection result is given precisely. The main architecture of the refinement sub-step is the dynamic feature refinement (DFR), which is proposed to adaptively adjust the feature map and reconstruct a new feature map for rotated object detection (the detailed architecture of DFR is shown in Section 3.3). In the refinement sub-step, the feature fusion module (FFM) is considered as an important step to dynamically counteract the mismatch between the rotating object and the axis-aligned receptive fields of neurons. The overall framework is end-to-end trainable with a high efficiency.  An FPN network is used as backbone network and a feature pyramid transformer is proposed to enhance the feature extraction. Then, the backbone is attached in the post-processing steps, which consist of two sub-steps: first, a sub-step and a refinement step. In the first sub-step, the preliminary regression of locations and angles for the refinement sub-step is performed. In the refinement sub-step, adaptive feature refinement is performed first and then the final object detection result is given precisely.

Feature Pyramid Transformer
We introduce a feature pyramid transformer (FPT) and add it between the backbone FPN network and the post-processing network to produce features with stronger semantic information. Its architecture is shown in Figure 4. Firstly, the features from FPN are transformed and re-arranged. Then, the output features are concatenated with the original feature map to obtain the concatenated features. Finally, the Conv3×3 operation is carried out to reduce the channel and obtain the transformed feature pyramid.
The FPT is a light network that enhances features through feature interaction with multiple scales and layers. It allows features of different levels to interact across space and scale. The FPT consists of three transformer steps: a self-transformer, a grounding transformer, and a rendering transformer. The self-transformer is introduced to capture objects that appear simultaneously on the same feature map. The grounding transformer is a up-bottom non-local interaction transformer that is used to enhance shallow features with different levels of features. As shown in Figure 5a,b, the inputs of the self-transformer and the grounding transformer are q i , k j , and v j , where q i = f q (X i ) represents the i-th query; k j = f k (X j ) represents the j-th key; v j = f v (X j ) represents the j-th value; and f q (.), f k (.), and f v (.) are used to perform queries, keys, and values operations on the feature map, respectively. The self-transformer adopts dot products as similarity function F sim to capture co-occurring features in the same feature map. The output of F sim is fed to the normalization function F norm to generate weights w (i,j) . Lastly, we multiply v j and w (i,j) to obtain the transformed feature X. Unlike the self-transformer, the grounding transformer is a top-down non-local interaction that is used to strengthen shallow features with deep features. It uses Euclidean distance to measure the similarity of deep features and shallow features. The rendering transformer works with a bottom-up transformer to interact with the entire feature map, presenting higher-level semantic features in lower-level features. The transformation process is shown in Figure 5c. First, we calculate the weight w of Q through global average pooling from the shallow feature K. Then, the weights of Q (Q att ) and V are refined by Conv3×3 to reduce the size of the feature map. Finally, the refined Q att and down-sampled V (V down ) are summed and processed by another Conv3×3 for rendering.

Dynamic Feature Refinement
When detecting instances with arbitrary orientations, large aspect ratios, and dense distributions, the main reason for low detection performance is the feature misalignment problem, which is caused by differences in the scale and rotation between the orientated bounding box and the axis-aligned receptive fields. To alleviate the feature misalignment problem, we introduce dynamic feature refinement (DFR) to obtain the refined accurate bounding box. The architecture of DFR is shown in the bottom of Figure 6.  6. Architecture of the post-processing step. This consists of two sub-steps: the first sub-step and the refinement sub-step. Top: the first sub-step, which performs the preliminary regression of angle anchors for the refinement sub-step. Bottom: the refinement sub-step, which performs feature fusion and adaptive feature refinement and then gives the final object detection result precisely.
On the left of the refinement sub-step is the feature fusion module, followed by the feature refinement module. On the right are two subnetworks, which perform object classification and regression.
We adopt a feature fusion module (FFM) to counteract the mismatches between arbitrary-orientation objects and axis-aligned receptive fields. This can dynamically and adaptively aggregate the features extracted by various kernel sizes, shapes (aspect ratios), and angles. The FFM takes the i-th stage feature map X ∈ R H×W×C as an input and consists of two branches. In one branch, X ∈ R H×W×C is connected to the classification and regression subnetworks to decode the location feature information. This is a normal network introduced from RetinaNet. The task of this branch is to generate initial location information and decode the angle feature information. In the other branch, we compress X ∈ R H×W×C with a Conv1×1 layer and aggregate the improved information using batch normalization and ReLU. In order to further deal with the mismatches between rotated objects and axis-aligned receptive fields, we introduce the adaptive convolution (AdaptConv) into our DFR.
The AdaptConv is inspired by [32], and the implementation details are illustrated in Figure 7. Similar to DCN in [29], denotes the regular grid receptive field and dilation. For a 3 × 3 kernel, we have: The output of AdaptConv is: where p n represents the locations in , w denotes the kernel weights, and δp n is the offset field for each location p n . In our method, we redefine the offset field δp n so that DCN can be transformed into a regular convolution with angle information. The offset of AdaptConv is defined as follows: where M r (θ) ∈ R H×W×1 is the angle feature information that is split and resized from the location feature information. As shown in the bottom of Figure 6, in order to cope with objects with large aspect ratios, we use a three-split AdaptConv with 3 × 3, 1 × 3, and 3 × 1 kernels, which are denoted as X i ∈ R H×W×C (i ∈ {1, 2, 3}), to extract multiple features from X c ∈ R H×W×C . In order to cause the receptive fields of neurons to adjust features dynamically, we adopt an attention mechanism to integrate features from the above three-split process. Let the attention map be A i ∈ R H×W×1 (i ∈ 1, 2, 3) and the computation be as follows: Firstly, X i is fed into the attention block, which is composed of Conv1×1 and the batch normalization operation. Secondly, A i (i = 1, 2, 3) is sent to SoftMax to obtain the normalized selection weight A i : Here, the SoftMax can be described as follows. Suppose v is a vector and v i represents the i-th element in v. In this case, the SoftMax value of this element is formulated by: where the calculation result is between 0 and 1 and the sum of the SoftMax values of all elements is 1. Thirdly, the feature map Y is obtained by implementing a ReLU operation on: where Y ∈ R H×W×C is the output feature. The adjusted feature map Y is then sent to the feature refinement module (as shown in the middle of Figure 6) to reconstruct the features and achieve feature alignment. The feature alignment details are illustrated in Figure 8. For each feature map, the aligned feature vectors are obtained through interpolation, according to the five coordinates (orange points) of the refined bounding box. Following the method described in [4], we use feature bilinear interpolation to generate more accurate feature vectors and replace the original feature vectors, as illustrated in Figure 8b. The bilinear interpolation is formulated as follows: where val denotes the result of bilinear interpolation.

RetinaNet-Based Rotation Detection and Loss Function
We achieve rotated bounding box detection by using the oriented rectangle representation method proposed in [4]. For the completeness of the content, let us introduce the method briefly. We use a vector with five parameters (x, y, w, h, θ) to represent an arbitrarily oriented bounding box, where (x, y) denotes the coordinates of the bounding box center, w and h denote the width and height of the bounding box, and θ denotes the rotation angle of the bounding box relative to the horizontal direction. Compared to the horizontal bounding box, an additional angular offset must be predicted in the regression subnet, for which the rotation bounding box is described as follows: where (x, x a , x ) correspond to the ground-truth box, the anchor box, and the predicted box, respectively (likewise for y, w, h, θ). The definition of the multi-task loss function is as follows: L cls (p n , t n ) (10) where N denotes the anchor number and t n denotes a binary value (t n = 1 for the foreground and t n = 0 for the background). v nj denotes the predicted offset vectors, and v nj denotes the vector of the ground truth, t n denotes the instance label, and p n denotes the probability of the categories calculated by the sigmoid function. The hyperparameters λ 1 , λ 2 , and λ 3 control the trade-off and are set to 1 by default. The classification loss L cls is implemented using focal loss. In [23], the authors noticed that the imbalance of instances categories results in a low accuracy for a single-stage detector compared with that of a two-stage detector. They proposed focal loss to address this problem. Thus, we use focal loss to optimize our classification loss, whereby our detector maintains single-stage speed while improving the detection accuracy. Equation (11) shows the cross-entropy loss function that produces focal loss: where y ∈ {±1} specifies the ground-truth class and p t ∈ [0, 1] is the model's estimated probability for the class with the label y = 1. Furthermore, a weighting factor α t ∈ [0, 1] and a modulating factor (1 − p t ) γ (γ ≥ 0) are introduced (as shown in Equation (12)) to control the weights of positive and negative instances, meaning that the training is relatively more focused on positive samples.
In the rotated object detection task, the loss is very large due to the periodicity of the angle. Therefore, the model has to be regressed in other complex forms, increasing the difficulty of regression. Yang [15] proposed a loss function by introducing the IoU constant factor in the traditional smooth L 1 loss. The smooth L 1 loss is expressed by: The new regression loss can be divided into two parts, as shown in Equation (10)

Benchmark Datasets
Extensive experiments and ablation studies were conducted. We compared our detector with 8 other well-known detectors through experiments on two challenging satellite optical image benchmarks: DOTA [5] and HRSC2016 [33].
DOTA is the largest and most challenging dataset with both horizontal and oriented bounding box annotations for object detection in satellite or aerial optical images. It contains 2806 satellite images, whose sizes range from 800 × 800 to 4000 × 4000. DOTA contains objects with a wide variety of scales, orientations, and appearances. These images have been annotated by experts using 15 common object categories. The object categories include plane (PL), ship (SP), large vehicle (LV), small vehicle (SV), helicopter (HC), tennis court (TC), bridge (BR), ground track field (GTF), basketball court (BC), baseball diamond (BD), soccer field (SBF), storage tank (ST), roundabout (RA), harbor (HA), and swimming pool (SP). Among them, there are huge numbers of densely distributed objects, such as small vehicles, large vehicles, ships, and planes. There are many object categories with large aspect ratios, such as large vehicles, ships, harbors, and bridges. Two detection tasks with horizontal bounding boxes and orientated bounding boxes can be performed on DOTA. In our experiment, we chose the task of detecting objects with an orientated bounding box. An official website (https://captain-whu.github.io/DOTA/dataset.html (accessed on 1 January 2018) is provided for the submission of the results. DOTA contains 1403 training images, 468 verification images, and 935 testing images, which are randomly selected from the original images.
HRSC2016 [33] is a challenging satellite optical imagery dataset for ship detection. It contains 1061 images collected from Google Earth and over 20 categories of ship instances with different shapes, orientations, sizes, and backgrounds. The images with the scenario of ships close to the shore in HRSC2016 were collected from six famous harbors, while the other images show the scenario of ships on the sea. The image size ranges between 300 × 300 and 1500 × 900. HRSC2016 contains 436 training images, 181 validation images, and 444 testing images. During the training and testing, we resized the images to 800 × 800. In our experiment, we chose the task of detecting ships with an orientated bounding box.

Implementation Details
We adopted ResNet101 FPN as the backbone of the experiment. The hyperparameters of the multi-task loss function were set to λ 1 = 4, λ 2 = 1, and λ 3 = 2. The hyperparameters of the focal loss were set to α = 0.25 and γ = 2.0. SGD [34] was adopted as an optimizer. The initial learning rate was set at 0.04 and the learning rate was divided by 10 at each decay step. The momentum and weight decay were set to 0.9 and 0.0001. The learning rate warmup was set to 500 iterations. We adopted mmdetections [35] as training schedules and trained all the models in 12 epochs for DOTA and 36 epochs for HRSC2016. We used a sever with 4 NVIDIA TITAN Xp GPUs and 4 GPUs with a total batch size of 8 for training and a single GPU for inference.

Ablation Study
In order to evaluate the impact of DFR, FPT, and data augmentation on our detector, we conducted some ablation studies on the DOTA and HRSC2016. ResNet-50 pretrained on ImageNet was used as a backbone in the experiments. The weight decay and momentum were set to 0.0001 and 0.9, respectively. Detectors were trained using 4 GPUs with a total of 8 images per mini batch (two images per GPU).

Ablation Study for DFR
In this subsection, we present the ablation study results for the original feature refinement module (FRM) and the proposed DFR. As shown in Table 1, RetinaNet has a 62.22% accuracy. By introducing FRM, R 3 Det (RetinaNet with refinement) obtained a 71.69% accuracy under ResNet101-FPN as a backbone with no multi-scale. FRM improved the accuracy by 9.47%. In this study, we introduced DFR to achieve feature misalignment instead of FRM. The accuracy with DFR was 73.10%, which is 1.41% higher then the accuracy with FRM. As shown in Table 2, the accuracy for some hard instance categories, such as BR, SV, LV, SH, and RA, increased by 2.06%, 7.71%, 2.8%, 9.42%, and 2.84%, respectively. We can see that the proposed DFR has a significant effect on improving the performance.

Ablation Study on FPT
As shown in Table 1, the accuracy was 73.10% without FPT and 73.77% with FPT. It can be seen that the proposed FPT has a slight effect on improving the performance.

Ablation Study for Data Augmentation
A previous study showed that data augmentation is a very effective way to improve detection performance by enriching training datasets. In this subsection, we study the impact of data augmentation on the detection accuracy of our detector. The data augmentation methods used in the experiment includes horizontal and vertical flipping, random graying, multi-scales, and random rotation. As shown in Table 1, the detection accuracy was improved from 73.77% to 76.89% by data augmentation. We compared our proposed detector with some state-of-the-art detectors using the DOTA dataset. The results reported here were obtained by submitting our detection results to the official DOTA evaluation server. All the detectors involved in this experiment can be divided into three groups: multi-stage, anchor-free, and single-stage detectors. As shown in Table 3, the latest multi-stage detectors, such as SCRDet [15], Gliding Vertex [10], and APE [36], achieved values of 69.56%, 72.61%, 75.02%, and 75.75% mAP, respectively. The anchor-free method DRN [32] achieved a 73.23% mAP. The single-stage detectors R 3 Det and R 4 Det with ResNet-152 had 73.73% and 75.84% accuracies. Our ADT-Det with ResNet-152 achieved the highest accuracy of 77.43%, which is 1.59% higher than the previous best result.
The research of R 4 Det [3] showed that feature recursion is a good method to improve the detection accuracy. We also adopted feature recursion in our pipeline, and it outperformed state-of-art methods and achieved a 79.95% accuracy.
The visualization of some of the detection results of our detector is shown in Figure 9. The results demonstrate that our detector can accurately detect most objects with arbitrary orientations, large aspect ratios, huge scale differences, and dense distributions.  HRSC2016 contains many ship instances with large aspect ratios and arbitrary orientations. RRPN was originally developed for orientation scene text detection. RoI-Transformer and R 3 Det are advanced satellite optical imagery detection methods. We performed comparative experiments with these methods, and the results are shown in Table 4. We can see that the scene text detection methods have competitive results for satellite optical imagery datasets; RRPN [13] achieved a 79.08% mAP. Under the PASCAL VOC2007 metrics, the famous multi-stage rotated object detector RoI-Transformer [2] could achieve an 86.20% accuracy. The state-of-art single-stage methods, R 3 Det [4] and R 4 Det [3], could achieve 89.26% and 89.56% accuracies, respectively. Meanwhile, the proposed ADT-Det detector achieved the best detection performance, with an accuracy of 89.75%. This accuracy is close to the accuracy for ship detection in the DOTA experiment (88.94%), which further proves the advantage of using DFR to reduce the mismatch between arbitrarily oriented objects and axis-aligned receptive fields. Evaluated under the PASCAL VOC2012 metrics, the anchor-free method DRN achieved a 92.7% accuracy, while the proposed ADT-Det detector (with ResNet-152) achieved the best detection result, with an accuracy of 93.47%.

Speed Comparison
Comparison experiments for detection speed and accuracy were carried out on HRSC2016. In the experiment, our ADT-Det detector was compared with eight other well-known methods. The detailed results are illustrated in Table 4 and the overall comparison results are also visualized in Figure 10. It can be seen that the multi-stage detector RoI-Transformer could achieve an 86.2% accuracy and a 6 fps speed when using ResNet101 as the backbone and when the input image size was 512 × 800. The single-stage R 3 Det detector could achieve a 89.26% accuracy and a 10 fps speed. The existing state-of-art single-stage R 4 Det could achieve an 89.5% accuracy, but the detection speed was slower than that of R 3 Det. Our ADT-Det detector could achieve an 89.75% accuracy when evaluated under the PASCAL VOC2007 metrics and a 12 fps speed when the input image size was 800 × 800. Furthermore, we could achieve a 14.6 fps speed when the input image size was 600 × 600. The results demonstrate that our ADT-Det detector can achieve the highest accuracy of all the investigated detectors while running very fast. Table 4. Evaluation results with the accuracy and speed of some well-known detectors on HRSC2016. All models were evaluated under ResNet-152. * indicates that the result was evaluated under the PASCAL VOC2012 metrics.

Conclusions
In this work, we identify inappropriate feature extraction as the primary obstacle preventing the high-performance detection of instances with arbitrary directions, large aspect ratios, and dense distributions. To address this, we proposed the use of an adaptive dynamic refined single-stage transformer detector to address the aforementioned challenges, aiming to achieve a high recall and speed. Our detector realizes rotated object detection with RetinaNet as the baseline to achieve the detection of multi-scale objects and densely distributed objects. Firstly, the feature pyramid transformer (FPT) was introduced into the traditional feature pyramid network (FPN) to enhance feature extraction through a feature interaction mechanism. Secondly, the output features of FPT were fed into two post-processing steps, considering the mismatch between the rotated bounding box and the general axis-aligned receptive fields of CNN. Dynamic Feature Refinement (DFR) was introduced in the refinement step. The key idea of DFR was to adaptively adjust the feature map and reconstruct a new feature map for arbitrary-oriented object detection to alleviate the mismatches between the rotated bounding box and the axis-aligned receptive fields. Extensive experiments and ablation studies were carried out to test the proposed detector based on two challenging satellite optical imagery public datasets, DOTA and HRSC2016. The proposed detector could achieve a 79.95% mAP accuracy for DOTA and 93.47% mAP for HRSC2016, and the running speed was 14.6 fps with an 600 × 600 input image size. The results show that our method achieved state-of-the-art results in the task of object detection in these optical imagery datasets.