Hybrid Network Model: TransConvNet for Oriented Object Detection in Remote Sensing Images

: The complexity of backgrounds, the diversity of object scale and orientation, and the defects of convolutional neural network (CNN) have always been the challenges of oriented object detection in remote sensing images (RSIs). This paper designs a hybrid network model to meet these challenges and further improve the effect of oriented object detection. The inductive bias of CNN makes the network translation invariant, but it is difﬁcult to adapt to RSIs with arbitrary object direction. Therefore, this paper designs a hybrid network, TransConvNet, which integrates the advantages of CNN and self-attention-based network, pays more attention to the aggregation of global and local information, makes up for the lack of rotation invariability of CNN with strong contextual attention, and adapts to the arbitrariness of the object direction of RSIs. In addition, to resolve the inﬂuence of complex backgrounds and multi-scale, an adaptive feature fusion network (AFFN) is designed to improve the information representation ability of feature maps with different resolutions. Finally, the adaptive weight loss function is used to train the network to further improve the effect of object detection. Extensive experimental results on the DOTA, UCASAOD, and VEDAI data sets demonstrate the effectiveness of the proposed method. the improvement of recall rate. Finally, the is with the adaptive weight-loss function, which further achieves good results. We performed extensive comparisons and ablation experiments on the DOTA, UCAS-AOD, and the VEDAI data sets. The experimental results prove the effectiveness of our method for oriented object detection in remote sensing images. In future work, we will design a more lightweight and efﬁcient backbone network to speed up the real-time performance of the detector for detecting oriented targets in optical remote sensing images.


Introduction
As one of the important contents of computer vision task, object detection in remote sensing images (RSIs) has broad application prospects in the fields of environmental monitoring, military investigation, national security, intelligent transportation, heritage site reconstruction, and so on.
Different from natural images, RSIs have the characteristics of variable object orientation, large scale of small object, dense object distribution, large changes in object shape, multiple scales, and complex backgrounds. According to these characteristics, many studies have proposed different improved models, and achieved good results. Yang et al. [1] improved on the basis of FasterRCNN and proposed a clustering proposal network to replace the RPN structure of the original network to capture large-scale cluster targets in aerial images. Chen et al. [2] proposed a multi-scale space and channel-aware attention mechanism to optimize the features in the backbone network of FasterRCNN [3], effectively eliminating the interference of complex backgrounds in RSIs. Wang et al. [4] designed a feature reflow pyramid network to fuse adjacent high-level features and low-level features, which effectively improved the detection effect of the detector on multi-category and multiscale targets in RSIs. Aiming at the multi-directional targets ubiquitous in optical remote sensing images, Zhang et al. [5] improved the RPN structure in the FasterRCNN network into a rotated region proposal network and generated oriented proposals. The CADNet Extensive experiments are performed on the DOTA data set [18], the UCAS-AOD data set [19], and the VEDAI data set [20]. The contribution of the AFFN and AWLF to the improvement of detection performance is proved by ablation study; the experimental results compared with advanced models show that our method has the best detection effect. The main contributions of this paper are summarized as follows: • We design a hybrid backbone network model named TranCovNet, which improved the object detection effect of RSIs with complex background through extracted stronger feature information. • According to the multi-category and multi-scale characteristics of RSIs, an adaptive feature fusion network (AFFN) is proposed to make the feature maps with different resolutions in different stages contain balanced semantic information and detailed information, which improves the accuracy of detection. • The adaptive weight loss function (AWLF) is employed for multi task prediction to balance the loss of different tasks and better train the network.
The remainder of this article is organized as follows. The related work is briefly reviewed in Section 2. The detail of the proposed method is introduced in Section 3. In Section 4, the experimental results are reported and analyzed. Finally, the conclusion is presented in Section 5.

Convolutional Neural Network
The strong feature extraction ability of CNN makes it still the mainstream in the field of computer vision to select CNN as the backbone network (e.g., VGG [21], RESNET [22], Efficientnet [23], Mobilenet [24]). With the emergence of RESNET [22] network, this residual connection makes the network go deeper and greatly improves the performance of the neural network. In this paper, we also learn from this structure and propose a new network structure on this basis.

Self-Attention Based Network
The success of the self-attention-based model in machine translation and natural language processing [10] has inspired people to try to apply this idea to image recognition [11,25,26] and object detection [12,27]. The core of these models is the self-attention mechanism, which can obtain the global representation information through the calculation of self-attention. An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, key, and value are vectors from the linear mapping of the input. The output is computed as a weighted sum of the values, where the weight is assigned to each value is computed by a compatibility function of the query with the corresponding key. At present, the calculation of self-attention can be divided into scaled dot-product attention, multi-head attention, and local window attention. They can be described as follows:

Scaled Dot-Product Attention
Value (V) is weighted according to the similarity between query (Q) and key (K); Q and K of dimension d k , and V of dimension d v are obtained from the same input through linear transformation. Scaled dot-product attention is defined as follows: perform the attention function in parallel. These are concatenated and once again projected, resulting in the final values. The calculation process is as follows:

Local Window Attention
In the standard transformer mechanism [10], the computation of global attention makes the computational complexity grow quadratically with the number of input sequence data, which is not suitable for computer vision problems that require dense predictions or high-resolution images. In the paper [13], the authors proposed self-attention in a local window to reduce the computational and storage complexity. The image is equally divided into non-overlapping windows, each window contains M × M patches, and the multi-head attention calculation is performed in each window.

Feature Fusion Network
In many deep learning tasks (such as object detection, image segmentation), fusing features of different scales is an important means to improve performance. Low-level features have higher resolution and contain more location and detail information, but they have lower semantics and more noise due to fewer convolutions. High-level features have stronger semantic information, but low resolution and poor perception of details. How to fuse them efficiently is the key to improve the segmentation model. SSD [28] and MS-CNN [29] does not fuse features, but predicts multi-scale features separately, and then integrates the prediction results. FPN [14] performs feature pyramid fusion and prediction after fusion. In order to better solve the problem caused by scale changes in object detection, M2det [30] proposes a more effective feature pyramid structure, MLFPN (multilevel feature pyramid network). For FPN-based single-stage detectors, the inconsistency between different feature scales is their main limitation. YOLOv3-ASFF [31] proposes a new data-driven pyramid feature fusion method, called adaptive spatial feature fusion (ASFF).

Materials and Methods
The overall framework of the detection method in this paper is presented in Figure 1. It mainly includes: TransConvNet backbone network, adaptive feature fusion network (AFFN), and multi-task prediction head. TransConvNet backbone network (as shown in Figure 2) includes four stages, in which S1, S2, S3, and S4 represent different depth feature maps extracted in the first, second, third, and fourth stages of the network, and the sizes are 1/4, 1/8, 1/16, and 1/32 of the original input image size, respectively. S2, S3, and S4 are sent into AFFN to obtain feature maps P2, P3, P4, P5, and P6 with different resolutions, which are 1/8, 1/16, 1/32, 1/64, and 1/128 of the original size, respectively. These five feature maps are sent to the detection head network to predict the object classification confidence, center confidence, angle, and distance offset, respectively. The feature maps of different sizes predict the object of different size ranges. The regression ranges of P2, P3, P4, P5, and P6 are (0, 64], (64, 128], (128, 256], (256, 512] and (512, ∞), respectively. The following is a detailed introduction to the TransConvNet backbone network (shown in Figure 2), the adaptive feature fusion network (shown in the AFFN module in Figure 1), the multi-task detector head, and the adaptive weight loss function.

TransConvNet Backbone Network
As shown in Figure 2, TransCovNet backbone network is an essential component of our method for feature extraction, which is mainly composed of Patchify Stem and four stages. Each stage contains different numbers of transformer block and conv block.

Patchify Stem
The main function of Patchify Stem is to convert 2D RGB image ( 3 H W × × ) into 1D sequence data. In classical papers, such as VIT [11] and Swin transformer [13], the usual approach is to first hard segment the image into non overlapping image blocks ( P P × ), and then linearly encode each image block through a linear embedding layer, project it to dimension C and convert it into 1D sequence data. In the code implementation, the convolution operation with convolution kernel size of P P × and step size of P is used. However, this kind of hard segmentation easily makes the network unable to model the local structure of the image (such as edges and lines), thus a large number of training samples are needed to obtain the similar performance of CNN.

TransConvNet Backbone Network
As shown in Figure 2, TransCovNet backbone network is an essential component of our method for feature extraction, which is mainly composed of Patchify Stem and four stages. Each stage contains different numbers of transformer block and conv block.

Patchify Stem
The main function of Patchify Stem is to convert 2D RGB image ( 3 H W × × ) into 1D sequence data. In classical papers, such as VIT [11] and Swin transformer [13], the usual approach is to first hard segment the image into non overlapping image blocks ( P P × ), and then linearly encode each image block through a linear embedding layer, project it to dimension C and convert it into 1D sequence data. In the code implementation, the convolution operation with convolution kernel size of P P × and step size of P is used. However, this kind of hard segmentation easily makes the network unable to model the local structure of the image (such as edges and lines), thus a large number of training samples are needed to obtain the similar performance of CNN.

TransConvNet Backbone Network
As shown in Figure 2, TransCovNet backbone network is an essential component of our method for feature extraction, which is mainly composed of Patchify Stem and four stages. Each stage contains different numbers of transformer block and conv block.

Patchify Stem
The main function of Patchify Stem is to convert 2D RGB image (H × W × 3) into 1D sequence data. In classical papers, such as VIT [11] and Swin transformer [13], the usual approach is to first hard segment the image into non overlapping image blocks (P × P), and then linearly encode each image block through a linear embedding layer, project it to dimension C and convert it into 1D sequence data. In the code implementation, the convolution operation with convolution kernel size of P × P and step size of P is used. However, this kind of hard segmentation easily makes the network unable to model the local structure of the image (such as edges and lines), thus a large number of training samples are needed to obtain the similar performance of CNN.
In this paper, we adopt two convolution operations with a convolution kernel size of (3 × 3) and a stride of 2 to realize the aggregation of local features, and then connect a (1 × 1) convolution to adjust the dimension to C and realize the integration of cross-channel information. Finally, the feature of H/4 × W/4 × C is obtained and transformed into sequence data of (H/4 × W/4) × C through reshape operation.

Transformer Block and Conv Block
Each stage of the backbone network includes two modules: transformer block and conv block.
Transformer block is shown in Figure 2a. In order to reduce the complexity of selfattention calculation, the input data is first reduced by (1 × 1) convolution to reduce the dimension. By arranging the windows, the image is evenly divided into windows of size (M × M) in a non-overlapping manner, the attention calculation is performed inside each window, and the (1 × 1) convolution is connected at the output to increase the dimension. The operation process in transformer block can be expressed as: where z l−1 and z l are the input and output of the l-th transformer block, and W − MSA is window-based self-attention calculation.
After the transformer block, the conv block is connected (shown in Figure 2b) to realize the fusion of data and the interaction of data between windows, and to model the global relationship. This attention with local and global receptive fields helps the model learn strong informative representations of images.
Each conv block consists of three convolutional layers, namely, 1 × 1, 3 × 3, and 1 × 1 filters. Downsampling is performed in the first conv block with a stride of 2. The process can be expressed as: Residual connections follow two simple design rules: (a) after the first conv block, the feature map size is halved; (b) after the second conv block, the feature map size is unchanged. The residual connections can be expressed as: where, z l and z l+1 are the input and output of the l-th conv block. This paper builds four kinds of models, which are TransConvNet-T, TransConvNet-S, TransConvNet-B, and TransConvNet-L, as shown in Table 1

Adaptive Feature Fusion Network
The network structure is shown in the AFFN module in Figure 1. The adaptive feature fusion network is divided into two stages. The first stage is the FPN [14] structure, which transfers the high-level semantic information to the low-level feature map from top to bottom. The second stage is the PAN [15] structure. In order to make up for the detail information lost in the backbone network due to downsampling, a bottom-up feature fusion is constructed, so that the detail location information is transferred to the deep-level feature map. In addition, channel attention and spatial attention are respectively introduced into the two fusion methods to improve the fusion effect. The final output feature maps of different scales contain strong semantic information and rich detailed information. The specific process is as follows: The feature maps S2, S3, and S4 extracted from the backbone network are sent to the AFFN. In the FPN structure, S2, S3, and S4 carry out top-down feature fusion. First, the feature maps of each level are followed by a 1 × 1 convolutional layer to keep the number of channels consistent, and Q4 is obtained by direct convolution of S4. Then S4 is upsampled by bilinear interpolation to improve the resolution of the feature map. After the channel attention module (as shown in Figure 3), the feature maps in different channels are reweighted to emphasize important features and compress unimportant features. The recalibrated features are added and fused with S3 to obtain Q3. In the same operation, Q3 is added and fused with S2 after upsampling and channel attention module to obtain Q2. Then enter the second stage, in the PAN structure, Q2, Q3, and Q4 perform bottom-up feature fusion. P2 is obtained directly from Q2 through a 3 3 × convolution layer. Downsampling P2 to reduce the resolution of the feature map, focusing the position of the effective information in the feature map through the spatial attention module (as shown in Figure 4), fusion with Q3, and finally obtaining P3 after 3 3 × convolutional operation. With the same operation, P3 is added and fused with Q4 after downsampling and spatial attention, and then P4 is obtained through a 3 3 × convolution layer. P5 and P6 are produced by applying a convolutional layer with stride of 2 on P4.  Then enter the second stage, in the PAN structure, Q2, Q3, and Q4 perform bottomup feature fusion. P2 is obtained directly from Q2 through a 3 × 3 convolution layer. Downsampling P2 to reduce the resolution of the feature map, focusing the position of the effective information in the feature map through the spatial attention module (as shown in Figure 4), fusion with Q3, and finally obtaining P3 after 3 × 3 convolutional operation. With the same operation, P3 is added and fused with Q4 after downsampling and spatial attention, and then P4 is obtained through a 3 × 3 convolution layer. P5 and P6 are produced by applying a convolutional layer with stride of 2 on P4.
downsampling and spatial attention, and then P4 is obtained through a 3 3 × convolution layer. P5 and P6 are produced by applying a convolutional layer with stride of 2 on P4.  In our method, the output feature maps P2, P3, P4, P5, and P6 of the AFFN are respectively allocated with the regression range of (0, 64], (64, 128], (128, 256], (256, 512], (512, ∞), so that small objects are detected on low-level high-resolution feature maps, and large objects are detected on high-level low-resolution feature maps. Through the adaptive feature fusion network, the feature maps with different resolutions balance semantic information and location information, which improves the recall and accuracy of object detection.

Multi-Task Detector Head
As displayed in Figure 1, the multi-task detector head has four subtasks including classification, centerness confidence, distance offset, and angle prediction.
The output ( 2,3,4,5,6) i P i = of AFFN is sent to the multi-scale detection head, respectively. Consistent with the setting of FCOS [32], first, the feature map 256 ( 2,3,4,5,6) is obtained through a 3 3 × convolutional layer stacked four times. Then it is divided into four branches, after normalization, activation function and 1 × 1 convolutional layer, the feature maps are obtained. Finally, these four feature maps are used for multi-task prediction and regression, respectively, including object classification confidence prediction, center confidence prediction, angle regression, and distance offset regression. In our method, the output feature maps P2, P3, P4, P5, and P6 of the AFFN are respectively allocated with the regression range of (0, 64], (64, 128], (128, 256], (256, 512], (512, ∞), so that small objects are detected on low-level high-resolution feature maps, and large objects are detected on high-level low-resolution feature maps. Through the adaptive feature fusion network, the feature maps with different resolutions balance semantic information and location information, which improves the recall and accuracy of object detection.

Multi-Task Detector Head
As displayed in Figure 1, the multi-task detector head has four subtasks including classification, centerness confidence, distance offset, and angle prediction.
The output P i (i = 2, 3, 4, 5, 6) of AFFN is sent to the multi-scale detection head, respectively. Consistent with the setting of FCOS [32], first, the feature map F i ∈ R h×w×256 (i = 2,3,4,5,6) is obtained through a 3 × 3 convolutional layer stacked four times. Then it is divided into four branches, after normalization, activation function and 1 × 1 convolutional layer, the feature maps f clc ∈ R h×w×N , f center ∈ R h×w×1 , f θ ∈ R h×w×1 , f o f f set ∈ R h×w×4 are obtained. Finally, these four feature maps are used for multi-task prediction and regression, respectively, including object classification confidence prediction, center confidence prediction, angle regression, and distance offset regression.

Box Regression
At present, remote sensing object detection usually adopts five parameters to represent the oriented bounding box (OBB), that is (x c, y c, w, h, θ), as shown in Figure 5a. Where (x c, y c ) represents the coordinate of the center point, (w, h) represents the width and height of the bounding box, θ represents the angle between the horizontal axis at the lowest point of OBB and the first edge encountered by counterclockwise rotation, and the first edge contacted is the width of the object box, i.e., w.
At present, remote sensing object detection usually adopts five parameters to represent the oriented bounding box (OBB), that is Representation of the target used in our method.
However, this representation method will be accompanied by a typical boundary mutation problem when performing rotation box coordinate regression. Besides, our paper adopts the pixel-wise predictive regression of FCOS [32] to generate bounding boxes. Therefore, we redefine a new representation method. As shown in Figure 5b However, this representation method will be accompanied by a typical boundary mutation problem when performing rotation box coordinate regression. Besides, our paper adopts the pixel-wise predictive regression of FCOS [32] to generate bounding boxes. Therefore, we redefine a new representation method. As shown in Figure 5b, the directional bounding box can be expressed as (x, y, Q i , θ, (i = 1, 2, 3, 4)), where (x, y) represents the coordinates of a certain point in the figure, and a Cartesian coordinate system is established with this point as the origin, Q i (i = 1, 2, 3, 4) represents the distance offset from the point to the bounding box and is located in the i-th quadrant, and θ is the angle between the X-axis and Q 1 .
For each location (x, y) on the feature map f clc ∈ R h×w×N , we can map it back onto the input image as (xs + s 2 , ys + s 2 ), which is near the center of the receptive field of the location (x, y). If the point is within a ground truth bounding box, the point is a positive sample; otherwise, it is a negative sample. If it is a positive sample, the distance from the point to the border of the ground truth and angle are regressed, and the regression target is (Q * i , θ * ) = (Q * 1 , Q * 2 , Q * 3 , Q * 4 , θ * ). Given that the coordinates of the four vertices of the ground truth box are {(P i x , P i y )|i = 1, 2, 3, 4} and the coordinates of a point in the box are (a x , a y ), the algorithm for finding Q * i is as follows (Algorithm 1): Algorithm 1: Distance offset calculation procedure Input: {(P i x , P i y )|i = 1, 2, 3, 4}: coordinates of the four vertices of the ground truth (a x , a y ): coordinates of the regression point Output: regression distance offset target Q * i 1 set P 1 x = P x max , the rest of the coordinates are arranged counterclockwise based on P 1 return Q i 12 end 13 end

Center Confidence
This paper adopts the same settings as in [32] to reduce the low-quality bounding boxes predicted by points farther from the center point. In training, it is trained as a sub-task with the cross-entropy function as the loss function. In reasoning, the product of the classification confidence and the center confidence is taken as the final score. If the score is greater than 0.05, it is taken as a positive sample. The center confidence can be expressed as:

Adaptive Weight Loss Function
The multitask learning loss function is composed of one classification loss for classification prediction, one centerness confidence loss for centerness prediction, and two regression loss for distance offset and angle prediction. To jointly learn these four subtasks, the uncertainty weighted loss is employed to capture the relative confidence among the four subtasks.
The classification loss function adopts the focal loss function [33]. The formula is defined as follows: 11) where N pos represents the number of positive samples. p x,y and c * x,y are the predicted classification confidence and its true value at point (x, y). In the experiment, α and γ are set to 0.25 and 2, respectively.
The center confidence loss function adopts the cross entropy loss function, and the calculation method is as follows: where c x,y ∈ {0, 1} represents that the point (x, y) is a negative sample or a positive sample. centerness * ∈ [0, 1] depicts the normalized distance from the location to the center of the object that the location is responsible for, as shown in Equation (10). The bounding box regression loss consists of two parts, one is the distance offset loss and the other is the angle loss. They are respectively defined as follows: where 1 {c x,y =1} represents an indicator function that returns one if C x,y = 1 (i.e., positive sample) and otherwise returns zero. θ x,y and t x,y = {Q i |i = 1, 2, 3, 4} x,y indicate the predicted values of angle and offset at the point (x, y), respectively. θ * x,y and t * x,y represent their true values. The smooth L1 loss can be calculated as: Finally, this paper uses uncertainty loss to balance these multi task losses. The specific formula is as follows: where δ 1 , δ 2 , δ 3 , and δ 4 are learnable uncertainty weighting factors used to balance multi task loss. The specific introduction of weight uncertainty loss is detailed in reference [34].

Experiments and Results Analysis
In this section, three public optical remote sensing image data sets and evaluation metrics are first introduced. Then, the contribution of TransCovNet backbone, AFFN, and AWLF are analyzed. Next, the superiority of the proposed method is analyzed in comparison with the state-of-the-art detectors. Finally, some promising detection results are displayed.

Data Set and Training Details
In our experiments, we chose three oriented optical remote sensing image data sets: the DOTA [18] data set, the UCAS-AOD [19] data set, and the VEDAI [20] data set.

DOTA Data Set
The DOTA data set consists of 2806 remote sensing images from different sensors and platforms, ranging from 800 × 800 to 4000 × 4000 pixels, which contain 188,282 object instances with different scales, orientations, and shapes. The categories of the data set include plane, helicopter, swimming pool, roundabout, harbor, basketball court, soccer ball field, tennis court, ground track court, baseball diamond, storage tank, bridge, ship, small vehicle, and large vehicle. In this data set, the proportions of training, validation, and test images are 1/2, 1/6, and 1/3, respectively. Multiple sizes were used for the crop images; the sizes used were 512 × 512, 800 × 800, and 1024 × 1024 with 0.2 overlaps.

UCAS-AOD Data Set
The UCAS-AOD data set mainly contains two types of targets: plane and car. There are 1000 plane images and 520 car images, including 7482 plane targets and 7144 car targets. The target is annotated with oriented bounding boxes consisting of four vertex coordinates. In the experiment, the data are randomly divided into training set, validation set, and test set according to the ratio of 7:2:1.

VEDAI Data Set
The VEDAI data set is a data set for vehicle detection in aerial images, with a total of 1210 images containing 3640 vehicle objects. In this experiment, the data set is randomly divided into training set and test set according to 9:1.

Training Details
The experiment in this article uses the pytorch deep learning framework, uses 2 GTX 1080Ti GPUs for accelerated training, and sets the single GPU BatchSize to 4. At the same time, the Adam optimizer [35] with an initial learning rate of 1 × 10 −4 is used to optimize the network. In this experiment, the DOTA data set, the UCAS-AOD data set, and the VEDAI data set were iteratively trained for 100, 120, and 140 epochs, respectively, and the optimizer learning rate decay with a decay rate of 10 was performed at the 80th, 100th, and 120th epochs.

Evaluation Metrics
To quantitatively evaluate the performance of the proposed method, we adopt five widely used evaluation metrics, namely, precision, recall, average precision, mean average precision, and F1 score. The calculation formulas are as follows: where TP is the true positive, FP is the false positive, and FN is the false negative. Precision represents the ratio of correctly identified positive objects to detected positive samples and Recall represents the ratio of correctly identified positive objects to all positive samples. AP metric is measured by the area under the precision-recall curve, which comprehensively measures the precision and recall rate of a certain class.
mAP can be used as an evaluation index for multi-class target detection accuracy. 20) where N c represents the number of categories in the data set, and P i and R i represent the precision and recall rate of the i-th target. F1 can evaluate the one-class object detection performance comprehensively.

Backbone Network Performance Analysis
Experiments are carried out on the UCAS-AOD data set to compare the three detection methods of RoI-transformer [8], R3Det [36] and our method, using ResNet50 [22], Swin transformer-T [13] and TransConvNet-T three backbone networks, respectively, and to test their detection performance, as shown in Table 2. It can be seen that the performance of the same detection method using the backbone network of this paper has been improved to varying degrees. For the RoI-transformer detection method, the mAP value is increased by 5.24% compared with the CNN of the same level, and by 2.05% compared with the Swin-T backbone network; For R3Det, compared with the CNN at the same level, the mAP value is increased by 5.05% and 1.44% compared with Swin-T backbone network; for the detection method in this paper, the mAP value is increased by 4.95% compared with the convolutional neural backbone network of the same level, and 1.96% higher than that of the Swin-T backbone network. Experiments are performed on the VEDAI data set. Using the adaptive feature fusion network, detection head, and loss function proposed in this paper, the performance of TransConvNet with four different capacities is tested, as shown in Table 3.

Ablation Study
The adaptive feature fusion network (AFFN) and the adaptive weight loss function (AWLS) are the two important components proposed in this paper. In order to test their contribution to the detection performance, relevant ablation experiments are carried out on the VEDAI data set. The baseline model uses TransConvNet-B as the backbone network, using FPN feature fusion strategy and equal weight loss function.
As shown in Table 4, after selecting the AFFN, the recall rate is increased by 3.44%, the accuracy rate is increased by 2.22%, and the F1 score is increased by 2.9%. After selecting the AWLF, the recall rate, precision rate, and F1 score are increased by 1.7%, 1.44%, and 1.6%, respectively. Compared with the two, the former contributes more to the improvement of recall rate and precision rate, which may benefit from the internal mechanism of the AFFN, so that the feature map evenly contains the information representation required for detection. Figure 6 shows visualization results of the baseline model after adopting an AFFN and an AWLF. The first row is the detection result of the baseline model. It can be seen that there are some missed detections and the bounding box is not accurate enough. The second row is the detection result after select the adaptive feature fusion network and it can be seen that the missed detection has improved. The third row is the detection result after selecting the adaptive weight loss function, and it can be seen that the detected bounding box is more accurate. The last row is the detection result of both the adaptive feature fusion network and the adaptive weight loss function that are used.

Comparison with State-of-the-Art Methods
To comprehensively verify the superiority of our method, we conduct comparative experiments on the DOTA data set. As shown in Table 5, we compared the AP in 15 categories and mAP value with another eight deep learning-based methods. Among them, ROI-Trans [8], CAD-Net [6], R3Det [36], SCRDet [7], GV [37], BBAVector [38], and CSL [39] adopted ResNet-101-FPN as the backbone network, except that RRPN [40] used VGG 16 as the backbone network. Note that data augmentation was applied for a fair comparison with all the compared methods. In Table 5, red and blue represent the optimal and second-best detection results, respectively. As can be seen from the table, the mAP value of the method in this paper reaches 78.41%, which is 2.24% higher than the CSL method with the best performance in the anchor-based model, and 3.05% higher than the BBAVectors method based on the anchor-free mechanism, obtained state-of-the-art performance on the DOTA data set. In addition, the detection performance on densely arranged small targets, such as small vehicles and large vehicles, has improved significantly, reaching 80.23% and 82.43% AP values, respectively, which are 1.97% and 2.03% higher than the second best. At the same time, the method in this paper achieves the optimal detection in the detection of bridges and ships, and achieves second-best detection in the detection of soccer ball fields and roundabouts. Finally, we present the visualization of the detection results for the DOTA data set in Figure 7.

Comparison with State-of-the-Art Methods
To comprehensively verify the superiority of our method, we conduct comparative experiments on the DOTA data set. As shown in Table 5, we compared the AP in 15 categories and mAP value with another eight deep learning-based methods. Among them, ROI-Trans [8], CAD-Net [6], R3Det [36], SCRDet [7], GV [37], BBAVector [38], and CSL [39] adopted ResNet-101-FPN as the backbone network, except that RRPN [40] used VGG 16 as the backbone network. Note that data augmentation was applied for a fair comparison with all the compared methods. In Table 5, red and blue represent the optimal and second-best detection results, respectively. As can be seen from the table, the mAP value of the method in this paper reaches 78.41%, which is 2.24% higher than the CSL method with the best performance in the anchor-based model, and 3.05% higher than the BBAVectors method based on the anchor-free mechanism, obtained state-of-the-art performance on the DOTA data set. In addition, the detection performance on densely arranged small targets, such as small vehicles and large vehicles, has improved significantly, reaching 80.23% and 82.43% AP values, respectively, which are 1.97% and 2.03% higher than the second best. At the same time, the method in this paper achieves the optimal detection in the detection of bridges and ships, and achieves second-best detection in the detection of soccer ball fields and roundabouts. Finally, we present the visualization of the detection results for the DOTA data set in Figure 7.

Conclusions
In order to cope with the complexity of remote sensing image background, the diversity of target scale and orientation, and the shortcomings of convolutional neural networks, this paper proposes a new backbone network, TransCovNet, which combines the advantages of convolutional neural network and self-attention-based network to better extract the information representation of image for object detection. On this basis, this paper also proposes an adaptive feature fusion network, which makes the feature representation of different resolutions of the image contain balanced semantic information and location detail information, and further improves the accuracy of object detection, especially for the improvement of recall rate. Finally, the network is trained with the adaptive weightloss function, which further achieves good results. We performed extensive comparisons and ablation experiments on the DOTA, UCAS-AOD, and the VEDAI data sets. The experimental results prove the effectiveness of our method for oriented object detection in remote sensing images. In future work, we will design a more lightweight and efficient backbone network to speed up the real-time performance of the detector for detecting oriented targets in optical remote sensing images.

Data Availability Statement:
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest:
The authors declare no conflict of interest.