RODFormer: High-Precision Design for Rotating Object Detection with Transformers

Aiming at the problem of Transformers lack of local spatial receptive field and discontinuous boundary loss in rotating object detection, in this paper, we propose a Transformer-based high-precision rotating object detection model (RODFormer). Firstly, RODFormer uses a structured transformer architecture to collect feature information of different resolutions to improve the collection range of feature information. Secondly, a new feed-forward network (spatial-FFN) is constructed. Spatial-FFN fuses the local spatial features of 3 × 3 depthwise separable convolutions with the global channel features of multilayer perceptron (MLP) to solve the deficiencies of FFN in local spatial modeling. Finally, based on the space-FFN architecture, a detection head is built using the CIOU-smooth L1 loss function and only returns to the horizontal frame when the rotating frame is close to the horizontal, so as to alleviate the loss discontinuity of the rotating frame. Ablation experiments of RODFormer on the DOTA dataset show that the Transformer-structured module, the spatial-FFN module and the CIOU-smooth L1 loss function module are all effective in improving the detection accuracy of RODFormer. Compared with 12 rotating object detection models on the DOTA dataset, RODFormer has the highest average detection accuracy (up to 75.60%), that is, RODFormer is more competitive in rotating object detection accuracy.


Introduction
Object detection is the core task in the field of computer vision and the basis for object tracking and behavior recognition [1]. Objects in any direction are widely distributed in application scenarios, such as scene text detection and pipeline object detection. Because the objects are always small, inclined and dense, the detection of directional objects becomes very difficult [2]. Therefore, researchers have proposed many rotating object detection algorithms, including RRPN [3], R 2 CNN [4], RoI-Transformer [5] and SCRDet [6]. Although these methods have achieved good performance, they are sensitive to the super parameters of Anchor and are prone to the problem of loss discontinuity, thus leading to the decline of object detection accuracy.
Considering the complexity of detecting objects and a large number of small and stray rotating objects, the DETR model constructed by Carion et al. [7] was the first to apply Transformer to the field of object detection. Transformer is a fully attention-based encoderdecoder model built by Vaswani et al. [8]. DETR directly detects all objects by introducing object queries, achieving true end-to-end detection with less feature information. However, during the initialization process of the transformers, each query uses the same weights for all positions, which makes the training volume of the model very large. In order to solve this problem, Zhu et al. [9] sparsely sampled dense keys in the attention operation so that each query only needs to aggregate the sparse keys and reduce the computational cost of the model. Dai et al. [10] constructed the UP-DETR model. UP-DETR pretrained the object query in DETR with a new pretext task-multi-query localization and improved the convergence speed of transformers in DETR with unsupervised pretraining. Although DETR and its improved algorithm effectively improve the training effect of the model, the transformer contains a feed-forward network (FFN) [11], and FFN is composed of MLP. Compared with convolutional layers, FFNs are more efficient and can model better long-term dependencies and position patterns. However, the fully connected layer is based on the global receptive field of the channel. For the detection of small objects, if the fully connected layer structure is still used, the object to be detected is submerged in the background average feature, which causes MLP to lack local space modeling ability [12].
To improve Transformer's ability to extract local information, researchers improved DETR from three aspects: sparse attention [9], spatial prior [13] and structural redesign [14]. For example, Ding et al. [15] proposed the RepMLP model for image recognition. RepMLP utilizes the local spatial-modeling capability of CNN to improve the local informationcollection capability of RepMLP. Beal et al. [16] combined ViT and RPN to construct ViT-FRCNN and verified for the first time that transformers as the backbone can be directly applied to images and maintain good classification effect. Wang et al. [17] proposed the pyramid vision transformers (PVT) model. PVT uses global down-sampling to design the global subsampled attention (GSA). Liu et al. [18] constructed the Swin Transformer model, which constructs the structural backbone as a local-to-global combination, thereby avoiding the quadratic computation of the algorithm and improving the algorithm convergence speed. Although the above methods improve the Transformer's ability to extract local information to a certain extent, the addition of CNN in RepMLP greatly increases the computational load of the Transformer, and the lack of connections between different windows in Swin Transformer leads to the limitation of the receptive field. At the same time, these methods are based on anchor-base target detection, and anchor base ignores the matching mechanism of extreme samples, which is not in line with the design idea of DNN.
In order to solve the problems existing in anchor, researchers proposed the anchorfree method [18]. Compared with anchor base, the biggest advantage of anchor-free is the detection speed. Anchor-free does not require preset anchors but only needs to regress the object center point, width and height of the feature map, which improves the information collection of the model and alleviates the discontinuity problem of the boundary loss of the rotating frame. For example, SCRDet introduces an IoU constant factor in the SmoothL1 loss [19] to alleviate the boundary problem of the rotated box. Zhao et al. [20] proposed a different polar coordinate method (PolarDet), which uses center-point positioning, orients with four polar angles and measures with a polar ratio system to further improve the accuracy of object detection. Han et al. [21] proposed a single-shot alignment network (S 2 A-Net) that employs an active rotation filter to encode orientation information to alleviate the inconsistency between classification scores and localization accuracy.
In view of Transformer's lack of local space modeling capabilities, we provide an effective hybrid architecture to further improve object detection accuracy, and the architecture is spatial-FFN. Spatial-FFN combines the local space characteristics of 3 × 3 depth separable convolution with the global channel characteristics of MLP. Using structured transformer-level module (STS), Spatial-FFN module (STS) and CIOU-smooth L1 loss function (C-SL1) as the ablation experimental modules of RODFormer, the results show that the three modules in RODFormer (STS, STS and C-SL1) are effective for improving the object detection accuracy of the model. On the DOTA [22] dataset, comparing RODFormer with 12 rotating object detection models, RODFormer has the highest mAP value, that is, RODFormer has the best object detection effect.

Methods
The RODFormer framework is shown in Figure 1. RODFormer is mainly composed of backbone, neck and head. Firstly, RODFormer's backbone structure uses structured transformers to extract the features of images. Secondly, the obtained multi-level features are input into the neck structure and enhanced by the constructed spatial-FFN structure to solve the lack of spatial local modeling ability of FFN. Finally, the enhanced multilevel features are transferred to the head structure, and the first-order, anchor-free, eightparameter regression method is used to predict and detect the image, which reduces the complexity of the second-order structure and alleviates the loss discontinuity of the rotating frame.

Methods
The RODFormer framework is shown in Figure 1. RODFormer is mainly composed of backbone, neck and head. Firstly, RODFormer's backbone structure uses structured transformers to extract the features of images. Secondly, the obtained multi-level features are input into the neck structure and enhanced by the constructed spatial-FFN structure to solve the lack of spatial local modeling ability of FFN. Finally, the enhanced multi-level features are transferred to the head structure, and the first-order, anchor-free, eight-parameter regression method is used to predict and detect the image, which reduces the complexity of the second-order structure and alleviates the loss discontinuity of the rotating frame.

Backbone
(1) Structured design Unlike ViT, which uses patches of 16 × 16 size all the time, using smaller patches is beneficial to the prediction task of dense small objects. For images of H × W × 3 size, ROD-Former constructs four stages with different resolutions to realize the structuring of transformers and divides the input images by patch partition to obtain patches of (4 × 4) size from stage 1 to stage 4, with the following feature resolutions: , Ci+1 is greater than Ci. Each stage consists of blocks with the same structure but different numbers (the specific parameters are shown in Table 1 in Section 2.1). The structure of each block is shown in Figure 2.

Backbone
(1) Structured design Unlike ViT, which uses patches of 16 × 16 size all the time, using smaller patches is beneficial to the prediction task of dense small objects. For images of H × W × 3 size, RODFormer constructs four stages with different resolutions to realize the structuring of transformers and divides the input images by patch partition to obtain patches of (4 × 4) size from stage 1 to stage 4, with the following feature resolutions: H 2 i+1 × W 2 i+1 × C i , i ∈ {1, 2, 3, 4}, C i+1 is greater than C i . Each stage consists of blocks with the same structure but different numbers (the specific parameters are shown in Table 1 in Section 2.1). The structure of each block is shown in Figure 2.

Methods
The RODFormer framework is shown in Figure 1. RODFormer is mainly composed of backbone, neck and head. Firstly, RODFormer's backbone structure uses structured transformers to extract the features of images. Secondly, the obtained multi-level features are input into the neck structure and enhanced by the constructed spatial-FFN structure to solve the lack of spatial local modeling ability of FFN. Finally, the enhanced multi-level features are transferred to the head structure, and the first-order, anchor-free, eight-parameter regression method is used to predict and detect the image, which reduces the complexity of the second-order structure and alleviates the loss discontinuity of the rotating frame.  16 16

Backbone
(1) Structured design Unlike ViT, which uses patches of 16 × 16 size all the time, using smaller patches is beneficial to the prediction task of dense small objects. For images of H × W × 3 size, ROD-Former constructs four stages with different resolutions to realize the structuring of transformers and divides the input images by patch partition to obtain patches of (4 × 4) size from stage 1 to stage 4, with the following feature resolutions: , Ci+1 is greater than Ci. Each stage consists of blocks with the same structure but different numbers (the specific parameters are shown in Table 1 in Section 2.1). The structure of each block is shown in Figure 2.   (2) Global subsampled attention Traditional transformers object detection adopts the self-attention model [23]. The long-distance information-capture ability of self-attention is comparable to that of RNN and far exceeds that of CNN. Among them, self-attention includes scaled dot-product attention (Equation (1)) and multi-head attention (Equation (2)).
where, Q, K, V are query vector sequence, key vector sequence and value vector sequence, √ d k is to scale the inner product to avoid softmax results from either 0 or 1.
To further add global attention and reduce the complexity of self-attention, the easiest way is to add a global attention layer after each local attention block so that information can be exchanged across windows, with time complexity (O((k 1 × k 2 × m × n) 2 × d)) also increased further. Therefore, RODFormer adopts the global subsampled attention (GSSA) method [24,25] to reduce the time complexity (O( H 2 W 2 d k 1 k 2 + k 1 k 2 HWd)) of the whole process (the process of local and global attention) through the subsampling function. Among them, m and n, represented as feature maps, are divided into (m × n) subwindows, Based on Ref. [17], the length of the sequence was reduced using the reduction ratio (R) shown below.
where, K's sequence is to reduce, and the dimension of the new K is N/R × C, since the complexity of the attention mechanism is reduced from O (N 2 ) to O (N 2 /R 2 ). From stage 1 to stage 4, set R to [1,2,4,8].
(3) Spatial-FFN Because the resolution of PE is fixed, when the resolution is layered, the position code needs to be interpolated; however, this will lead to a decrease in accuracy. Each layer in the encoder and decoder in the traditional Transformers contains a channel-based global modeling FFN (Equation (4)). The FFN consists of two linear transformations and is activated by the ReLU function. CNN obtains the global feature information of the image through local perception (Equation (5)).
where, W 1 and W 2 are weight matrices, and b 1 and b 2 are bias vectors. f represents the nonlinear function, and w and b represent the weight and bias of the fully connected layer, respectively. x represents the feature of the input. However, the attention in Transformers is position-invariant and requires position embedding to determine feature information. At the same time, FFN lacks spatial local modeling capabilities, and CNN adds complexity to the network structure. Therefore, RODFormer introduces spatial-FFN, which integrates the global capability of FFN and the local capability of 3 × 3 depthwise separable convolution [26] by means of upsampling through layer normalization [27]. Because point-wise convolution and position-wise FFN are equivalent, introducing spatial-FFN to each Transformers block enables modeling of local spatial relationships through the network. The built-in properties of spatial-FFN allow for the removal of location information in the network without affecting performance, which both strengthens the local modeling effect of the network and avoids the structural complexity of CNNs. The structure of spatial-FFN is shown in Figure 3. local capability of 3 × 3 depthwise separable convolution [26] by means of upsampling through layer normalization [27]. Because point-wise convolution and position-wise FFN are equivalent, introducing spatial-FFN to each Transformers block enables modeling of local spatial relationships through the network. The built-in properties of spatial-FFN allow for the removal of location information in the network without affecting performance, which both strengthens the local modeling effect of the network and avoids the structural complexity of CNNs. The structure of spatial-FFN is shown in Figure 3.

MLP
3×3 depthwise separable convolution Spatial-FFN In Figure 3, + stands for upsampling. Layer normalization (LN) normalizes different channels of the same sample. Because LN is an algorithm independent of batch size, the number of samples will not affect the amount of data involved in LN calculation. The expression of LN is as follows: where σ and μ are the normalized statistics of LN, a is the normalized value of LN, g is the gain and ε prevents the occurrence of division by 0.
(4) Patch merging For each patch of the image, ViT uses the patch-merging method to unify the N × N × 3 patches into a 1 × 1 × C vector to obtain a hierarchical feature map. However, ViT is used to combine non-overlapping feature blocks and cannot maintain the local continuity of patches. Therefore, according to reference [28], the overlapping patch-merging method is used to convert the feature dimensions of different patches into C1, C2, C3 and C4, respectively, and then send them into the Transformers block.

Neck and Heads
(1) Neck Neck is used to collect and strengthen different feature maps, and a neck consists of multiple bottom-up paths and top-down paths. The PANet structure adopts bottom-up path augmentation to recover the corrupted information between each proposal region and all feature levels through adaptive feature pooling, and its structure is shown in Figure 4a. RODFormer uses the bidirectional fusion method shown in Figure 4b to enhance the feature information of the image. All features adopt the spatial-FFN structure. ROD-Former changes the global channel of FFN through spatial-FFN, avoiding the structural complexity of CNN and the limitations of FFN. In Figure 3, + stands for upsampling. Layer normalization (LN) normalizes different channels of the same sample. Because LN is an algorithm independent of batch size, the number of samples will not affect the amount of data involved in LN calculation. The expression of LN is as follows: where σ and µ are the normalized statistics of LN, a is the normalized value of LN, g is the gain and ε prevents the occurrence of division by 0.
(4) Patch merging For each patch of the image, ViT uses the patch-merging method to unify the N × N × 3 patches into a 1 × 1 × C vector to obtain a hierarchical feature map. However, ViT is used to combine non-overlapping feature blocks and cannot maintain the local continuity of patches. Therefore, according to reference [28], the overlapping patch-merging method is used to convert the feature dimensions of different patches into C1, C2, C3 and C4, respectively, and then send them into the Transformers block.

Neck and Heads
(1) Neck Neck is used to collect and strengthen different feature maps, and a neck consists of multiple bottom-up paths and top-down paths. The PANet structure adopts bottom-up path augmentation to recover the corrupted information between each proposal region and all feature levels through adaptive feature pooling, and its structure is shown in Figure 4a. RODFormer uses the bidirectional fusion method shown in Figure 4b to enhance the feature information of the image. All features adopt the spatial-FFN structure. RODFormer changes the global channel of FFN through spatial-FFN, avoiding the structural complexity of CNN and the limitations of FFN.  (2) Head Using structured transformers, 4 parallel output predictions are obtained for the 4 layers of transformer stages. Head includes a classification branch and a regression branch, which both use spatial-FFN to predict image features. The structure of the head is shown in Figure 5. The classification branch is mainly used for the classification of objects and object categories, and the regression branch is mainly used for the regression processing of the rotation frame. To reduce the complexity of the structure, RODFormer adopts a first-order, anchor-free mode, which reduces the prediction of each position from 3 to 1. RODFormer directly predicts the two offsets of the grid center point, as well as the height and width of the predicted box. (2) Head Using structured transformers, 4 parallel output predictions are obtained for the 4 layers of transformer stages. Head includes a classification branch and a regression branch, which both use spatial-FFN to predict image features. The structure of the head is shown in Figure 5. The classification branch is mainly used for the classification of objects and object categories, and the regression branch is mainly used for the regression processing of the rotation frame. To reduce the complexity of the structure, RODFormer adopts a first-order, anchor-free mode, which reduces the prediction of each position from 3 to 1. RODFormer directly predicts the two offsets of the grid center point, as well as the height and width of the predicted box. RODFormer adopts the eight-parameter regression method [29]. In addition to the basic four-regression point classification, this definition method also includes four regression points of the horizontal box. The upper-left corner of the definition is the starting point, and the remaining points are arranged in counterclockwise order, as shown in Figure 6. To solve the loss discontinuity of the rotation box, when the coincidence ratio of the rotation box area and the original horizontal box area is close to 1, only the horizontal box is regressed. RODFormer adopts the eight-parameter regression method [29]. In addition to the basic four-regression point classification, this definition method also includes four regression points of the horizontal box. The upper-left corner of the definition is the starting point, and the remaining points are arranged in counterclockwise order, as shown in Figure 6. To solve the loss discontinuity of the rotation box, when the coincidence ratio of the rotation box area and the original horizontal box area is close to 1, only the horizontal box is regressed. To solve the problem of the discontinuity of the loss of the rotating box, when the coincidence ratio of the area of the rotating box and the area of the original horizontal box is close to 1, the frame only reverts to the horizontal box. Since this paper predicts remote To solve the problem of the discontinuity of the loss of the rotating box, when the coincidence ratio of the area of the rotating box and the area of the original horizontal box is close to 1, the frame only reverts to the horizontal box. Since this paper predicts remote sensing rotation images under the framework of Transformers, the prediction is not an ordered set like the traditional object detection result but an unordered set. To limit the gradient value, the horizontal box of RODFormer adopts the CIOU loss function (the coefficient is 2) (Equation (7)) [30], and the rotation box adopts Smooth L1 as the loss function (Equation (8)) [19].
where, A and B are the predicted frame and the real frame, respectively; A ctr is the coordinates of the center point of the prediction frame; B ctr is the coordinates of the center point of the real frame; ρ is the Euclidean distance calculation; c is the diagonal length of the minimum bounding box of A and B; w gt and h gt represent the width and height of the real box, respectively; w and h represent the width and height of the predicted box, respectively; and x is the elementwise difference between A and B.

Experiments
In this section, we demonstrate the effectiveness of RODFormer on the commonly used, publicly available aerial dataset (DOTA). After introducing the experimental settings and evaluation metrics, RODFormer is compared with state-of-the-art models.

Datasets
The current main object detection datasets are shown in Table 1. It can be seen from Table 1 that the DOTA dataset collects the smallest images (10-50 pixels), and this paper mainly detects small objects, so the DOTA dataset is used as the training sample.
Since the sizes of various objects in the DOTA dataset are very different, this brings great trouble to the training of RODFormer. For example, a boat can be as small as 40 pixels, and a bridge can be as large as 1200 pixels. At the same time, various objects in the DOTA dataset also have the characteristics of high spatial resolution and large aspect ratio. Therefore, in order to effectively detect small target objects, before training RODFormer, the DOTA dataset needs to be cropped to obtain images of the same size.
The DOTA dataset consists of 2806 aerial images with 188,282 instance objects annotated by horizontal boxes and rotated boxes. The DOTA dataset has 15 common detection object categories, including plane (PL), baseball diamond (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer ballfield (SBF), roundabout (RA), harbor (HA), swimming pool (SP) and helicopter (HC). The training set, validation set and test set in the DOTA dataset comprise 1411, 458 and 937 entries, respectively. The size of the images is between 800 × 800 pixel and 4000 × 4000 pixels. Due to the large variation range of objects in the DOTA dataset (10-50 pixels), testing against the DOTA dataset has more practical application value.

Experimental Environment and Evaluation Index
The experimental environment is shown in Table 2. According to Ref. [22], RODFormer uses average precision (AP) and mean average precision (mAP) to evaluate the detection accuracy of the model. The total batch size of RODFormer is set to 16, corresponding to 16 images for subtraining. The total number of training epochs, the initial learning rate and the weight decay rate are set to 300, 0.0001 and 0.0001, respectively. The loU threshold for the entire AP score is set to 0.1.

Experimental Results and Analysis
Since the images in the DOTA dataset are larger before the training process, the images are cropped to smaller images of (800 × 800) pixels for training. The cropped images are generated with new label information to facilitate future model training. About 28,000 small images are obtained after cropping.
(1) Ablation experiments Table 3 shows the ablation test results of RODFormer, in which the bold font is the maximum value of each column, and the unit of the values in the table is %. As shown in Table 3, when the basic backbone is ResNet50 and ResNet125, the mAP is 63.89% and 66.85%, respectively. When using ViT-B4, the mAP is only 69.20%. After adding the STS module, mAP increased by 1.18%. After adding the SFM module, mAP increases by 3.06% compared to adding the STS module only. Finally, after adding the C-SL1 module, compared with adding the STS module and the SFM module, the mAP increases by 2.16%, reaching 75.60%. The results in Table 3 show that after accumulatively adding each module, the mAP also gradually increases, that is, the STS module, the SFM module and the C-SL1 module all help to improve the detection accuracy of rotating targets, thus proving the effectiveness of each module.
It can be seen from Table 4 that RODFormer has the best mAP. For single-class detection accuracy, RODFormer's AP values are within 0.5% of the best results on PL, TC, and BC. At the same time, the results demonstrate that RODFormer has the highest detection accuracy when detecting small and dense rectangular objects, such as BR, SV, LV and SH, mainly because RODFormer uses depthwise separable convolution and Transformer to effectively combine local and global visual representation information. The clever combination of these two approaches can efficiently encode local and global information. Dense objects have richer local features, which lead to higher accuracy. For the detection of sparse targets, due to the lack of local spatial features, the accuracy of RODFormer is slightly lower than that of other detection networks. Due to the removal of positional embeddings and reduced computational cost, our method outperforms other methods in overall accuracy.
To visually display the target-detection effect of RODFormer, some image-detection effects of RODFormer and IRetinaNet in the DOTA dataset are visualized, and the results are shown in Figure 7. Figure 8 shows the visualization results of RODFormer in some categories of images in the DOTA dataset.  As can be seen from Figure 7, IRetinaNet is prone to misjudgment on small and dense targets, such as ships, large and small vehicles, etc. On sparse objects, RODFormer has similar accuracy to IRetinaNet, but RODFormer uses an eight-parameter regression method to make the anchor box fit better, such as for ground track fields and tennis courts. Compared with IRetinaNet, RODFormer is superior in comprehensive accuracy, which proves the detection effect of RODFormer on local spatial features and provides ideas for the development of subsequent Transformers.
As shown in Figure 8, this paper presents a small image containing a typical scene, and RODFormer can accurately detect the position of the object. The results show that RODFormer can handle challenging situations, namely rotating objects in high-density or cluttered scenes.
For example, in Figure 8a, the main objects in the image are boats. Since the rotation angles of these objects are almost equal, the use of distance loss suffers from a boundary problem, i.e., it is difficult for the model to distinguish which side is longer. To reduce the loss, RODFormer uses an eight-parameter regression method to synchronously regress anchor box corners. Therefore, we can see that the method can effectively solve the boundary problem and obtain accurate results.
Our method (f) TC (g) GTF (h) SV (i) SV As can be seen from Figure 7, IRetinaNet is prone to misjudgment on small and dense targets, such as ships, large and small vehicles, etc. On sparse objects, RODFormer has similar accuracy to IRetinaNet, but RODFormer uses an eight-parameter regression method to make the anchor box fit better, such as for ground track fields and tennis courts. Compared with IRetinaNet, RODFormer is superior in comprehensive accuracy, which proves the detection effect of RODFormer on local spatial features and provides ideas for the development of subsequent Transformers.
As shown in Figure 8, this paper presents a small image containing a typical scene, and RODFormer can accurately detect the position of the object. The results show that RODFormer can handle challenging situations, namely rotating objects in high-density or cluttered scenes.
For example, in Figure 8a, the main objects in the image are boats. Since the rotation angles of these objects are almost equal, the use of distance loss suffers from a boundary problem, i.e., it is difficult for the model to distinguish which side is longer. To reduce the loss, RODFormer uses an eight-parameter regression method to synchronously regress anchor box corners. Therefore, we can see that the method can effectively solve the boundary problem and obtain accurate results.
Our method is an improvement on Transformer. To verify the combined effect of these techniques, we conduct ablation experiments and compare the accuracy with that of some other methods. Based on the above experimental results, RODFormer is better than other methods in object detection.

Conclusions
In this paper, a RODFormer model is proposed for the problem of rotating object detection. RODFormer is composed of Backbone, Neck and Head. Among them, the structure Transformers is used as the backbone to solve the complexity of CNN, and at the same time, the spatial-FFN is designed to solve the problem of the lack of space and local Our method is an improvement on Transformer. To verify the combined effect of these techniques, we conduct ablation experiments and compare the accuracy with that of some other methods. Based on the above experimental results, RODFormer is better than other methods in object detection.

Conclusions
In this paper, a RODFormer model is proposed for the problem of rotating object detection. RODFormer is composed of Backbone, Neck and Head. Among them, the structure Transformers is used as the backbone to solve the complexity of CNN, and at the same time, the spatial-FFN is designed to solve the problem of the lack of space and local receptive field of FFN. A lightweight detection head is built through the spatial-FFN architecture to alleviate the loss discontinuity problem of the rotating frame. The results of ablation experiments show that the three proposed modules (the structured transformers stage module, the spatial-FFN module and the CIOU-smooth L1 loss function module) all help to improve the detection accuracy of objects. On the DOTA dataset, RODFormer is compared with 12 advanced rotating object detection models. The results show that RODFormer's mAP is the best, and for dense target detection, the accuracy of RODFormer detection is higher because of its richer local features. On the contrary, for sparse object detection, the accuracy of RODFormer detection is slightly behind that of other detection networks due to its insufficient local spatial features. The visual comparison results further demonstrate the good detection effect of RODFormer on the DOTA dataset.