Transformer with Transfer CNN for Remote-Sensing-Image Object Detection

: Object detection in remote-sensing images (RSIs) is always a vibrant research topic in the remote-sensing community. Recently, deep-convolutional-neural-network (CNN)-based methods, including region-CNN-based and You-Only-Look-Once-based methods, have become the de-facto standard for RSI object detection. CNNs are good at local feature extraction but they have limitations in capturing global features. However, the attention-based transformer can obtain the relationships of RSI at a long distance. Therefore, the Transformer for Remote-Sensing Object detection (TRD) is investigated in this study. Speciﬁcally, the proposed TRD is a combination of a CNN and a multiple-layer Transformer with encoders and decoders. To detect objects from RSIs, a modiﬁed Transformer is designed to aggregate features of global spatial positions on multiple scales and model the interactions between pairwise instances. Then, due to the fact that the source data set (e.g., ImageNet) and the target data set (i.e., RSI data set) are quite different, to reduce the difference between the data sets, the TRD with the transferring CNN (T-TRD) based on the attention mechanism is proposed to adjust the pre-trained model for better RSI object detection. Because the training of the Transformer always needs abundant, well-annotated training samples, and the number of training samples for RSI object detection is usually limited, in order to avoid overﬁtting, data augmentation is combined with a Transformer to improve the detection performance of RSI. The proposed T-TRD with data augmentation (T-TRD-DA) is tested on the two widely-used data sets (i.e., NWPU VHR-10 and DIOR) and the experimental results reveal that the proposed models provide competitive results (i.e., centuple mean average precision of 87.9 and 66.8 with at most 5.9 and 2.4 higher than the comparison methods on the NWPU VHR-10 and the DIOR data sets, respectively) compared to the competitive benchmark methods, which shows that the Transformer-based method opens a new window for RSI object detection.


Introduction
Object detection in remote-sensing images (RSIs) is used to answer one of the most basic questions in the remote-sensing (RS) community: What and where are the objects (such as a ship, vehicle, or aircraft) in the RSIs? In general, the objective of object detection is to build models to localize and recognize different ground objects of interest in highresolution RSIs [1]. Due to the fact that object detection is a fundamental task for the interpretation of high-resolution RSIs, a great number of methods have been proposed to handle the issue of RSI object detection in the last decade [2].
The traditional RSI object-detection methods focus on constructing effective features for objects of interest and training a classifier from a set of annotated RSIs. They usually acquire object regions with sliding windows and then try to recognize each region. The varieties of feature-extracting methods, e.g., bag-of-words (BOW) [3], scale-invariant feature transform [4], and their extensions, have been explored for representing objects. Then, the tasks. However, it seems that CNN-based methods, whether one-stage or two-stage, have reached the bottleneck of progress.
Recently, the attention-based Transformer presented by Vaswani et al. [25] has become the standard model for machine translation. Numerous studies have demonstrated that the Transformer might also be efficient at image-processing tasks, and they have achieved breakthroughs. The Transformer was able to obtain the relationship in RSIs at a long distance [26][27][28], which tackled the difficulty of CNN-based methods for capturing global features. Therefore, there have been a number of successful studies focusing on Transformer-based models in the RS community. Inspired by the Vision Transformer [26], He et al. [29] proposed a Transformer-based hyperspectral image-classification method. They introduced the spatial-spectral Transformer, using a CNN to extract spatial features of hyperspectral images and a densely connected Transformer to learn the spectra relationships. Hong et al. [30] presented a flexible backbone network for hyperspectral images named SpectralFormer, which exploited the spectral-wise sequence attributes of hyperspectral images in order to sequentially feed them into the Transformer. Zhang et al. [31] proposed a Transformer-based method for a remote-sensing scene-classification method, which designed a new bottleneck based on multi-head self-attention (MHSA) for image embedding, and cascaded encoder blocks to enhance accuracy. They all achieved stateof-the-art performance, which shows the potential of the Transformer for various tasks in RSI processing. However, for RSI object detection, the amount of studies working on the basis of the Transformer is still insufficient. Zheng et al. [32] proposed an adaptive, dynamically refined one-stage detector based on the feature-pyramid Transformer, which embedded a Transformer in the FPN in order to enhance its feature-fusion capacity. Xu et al. [33] proposed a local-perception backbone based on the Swin Transformer for RSI object detection and instance segmentation, and they investigated the performance of their backbone on different detection frameworks. In their studies, the Transformer worked as a feature-interaction module, i.e., backbone or feature-fusion component, which is adaptable to various detection frameworks. Above all, since the Transformer has enormous potential to promote a unification of the architecture of various tasks in artificial intelligence, it is essential to further explore Transformer-based RSI object detectors.
In this paper, we investigate a neoteric Transformer-based remote-sensing objectdetection (TRD) framework. The proposed TRD is inspired by the detection Transformer [28], which takes features obtained from a CNN backbone as the input and directly outputs a set of detected objects. The existing Transformer-based RSIs object detectors [32,33] are still highly dependent on the existing detection frameworks composed of various surrogate-task components, such as duplicated prediction elimination, etc. The proposed TRD abandons the conventional complicated structure in favor of an independent and more end-to-end framework. Additionally, the CNN backbone in the TRD is trained with transfer learning. To reduce the diversity of the source domain and target domain, the T-TRD is proposed, which adjusts the pre-trained CNN with the attention mechanism for a better transfer. Moreover, since the quantity of reliable training samples for RSI object detection is usually insufficient for training a Transformer-based model, the T-TRD-DA explores data augmentation composed of sample expansion and multiple-sample fusion to enrich the training samples and prevent overfitting. We hope that our research will inspire the development of RSI object-detection components based on the Transformer.
In summary, the following are the main contributions of this study.
(1) An end-to-end Transformer-based RSI object-detection framework, TRD, is proposed, in which the Transformer is remolded in order to efficiently integrate features of global spatial positions and capture relationships of feature embeddings and objects instances. Additionally, the deformable attention module is introduced as an essential component of the proposed TRD, which only attends to a sparse set of sampling features and mitigates the problem of high computational complexity. Hence, the TRD can process RSIs on multiple scales and recognize objects of interest from RSIs.  (2) The pre-trained CNN is used as the backbone for feature extraction. Furthermore, in order to mitigate the difference between the two data sets (i.e., ImageNet and RSI data set), the attention mechanism is used in the T-TRD to reweight the features, which further improves the RSI detection performance. Therefore, the pre-trained backbone is better transferred and obtains discriminant pyramidal features.
(3) Data augmentations, including sample expansion and multiple-sample fusion, are used to enrich the diversity of orientations, scales, and backgrounds of training samples. In the proposed T-TRD-DA, the impact of using insufficient training samples for Transformerbased RSI object detection is alleviated. Figure 1 shows the overview architecture of the proposed Transformer-based RSI object-detection framework. First, a CNN backbone with attention-based transferring learning is used for extracting multi-scale feature maps of the RSIs. The feature maps from the shallower layers have higher resolutions, which benefits the detection of small-object instances, while the high-level features have wide receptive fields and they are appropriate for large-object detection and global spatial-information fusion. The features of all levels are embedded together in a sequence. The sequence of embedded features undergoes the encoder and decoder of the Transformer-based detection head and is transferred to a set of predictions with categories and locations. As the figure shows, the point in the input embeddings from the high-level feature map tends to recognize a small instance, while that from the low-level map is inclined to recognize a large instance. The detailed introduction of the proposed Transformer-based RSI object-detection framework is started with the framework of the proposed TRD and the effective deformable attention module in its Transformer. Subsequently, the attention-based transferring backbone and the data augmentation are introduced in detail. (2) The pre-trained CNN is used as the backbone for feature extraction. Furthermore, in order to mitigate the difference between the two data sets (i.e., ImageNet and RSI data set), the attention mechanism is used in the T-TRD to reweight the features, which further improves the RSI detection performance. Therefore, the pre-trained backbone is better transferred and obtains discriminant pyramidal features.

The Proposed Transformer-Based RSI Object-Detection Framework
(3) Data augmentations, including sample expansion and multiple-sample fusion, are used to enrich the diversity of orientations, scales, and backgrounds of training samples. In the proposed T-TRD-DA, the impact of using insufficient training samples for Transformer-based RSI object detection is alleviated. Figure 1 shows the overview architecture of the proposed Transformer-based RSI object-detection framework. First, a CNN backbone with attention-based transferring learning is used for extracting multi-scale feature maps of the RSIs. The feature maps from the shallower layers have higher resolutions, which benefits the detection of small-object instances, while the high-level features have wide receptive fields and they are appropriate for large-object detection and global spatial-information fusion. The features of all levels are embedded together in a sequence. The sequence of embedded features undergoes the encoder and decoder of the Transformer-based detection head and is transferred to a set of predictions with categories and locations. As the figure shows, the point in the input embeddings from the high-level feature map tends to recognize a small instance, while that from the low-level map is inclined to recognize a large instance. The detailed introduction of the proposed Transformer-based RSI object-detection framework is started with the framework of the proposed TRD and the effective deformable attention module in its Transformer. Subsequently, the attention-based transferring backbone and the data augmentation are introduced in detail.  Figure 2 shows the framework of the proposed TRD. A CNN backbone is first used to extract pyramidal multi-scale feature maps from an RSI. They are then embedded with the 2D positional encoding and converted to a sequence that can be inputted into the Transformer. The Transformer is remolded in order to process the sequence of image embeddings and make predictions of detected object instances.

The Framework of the Proposed TRD
The feature pyramid of the proposed TRD can be obtained by a well-designed CNN, and in this study, the detection backbone based on ResNet [34] is adopted. The convolutional backbone takes an RSI ∈ ℝ 3× 0 × 0 of an arbitrary size 0 × 0 as the input and generates hierarchical feature maps. Specifically, the ResNet generate hierarchical maps from the outputs of the last three stages, which are denoted as { 1 , 2 , 3 }, and ∈ ℝ × × . Those of the other stages are not included due to their restricted receptive field and additional computational complexities. Then, the feature map at each level undergoes 1 × 1 convolutions, mapping their channels to a smaller, uniform dimension .

Transformer-based Detection Head
Transformer Encoder

Ship Harbor
Sel. Figure 1. The overview architecture of the proposed Transformer-based RSI object-detection framework. Figure 2 shows the framework of the proposed TRD. A CNN backbone is first used to extract pyramidal multi-scale feature maps from an RSI. They are then embedded with the 2D positional encoding and converted to a sequence that can be inputted into the Transformer. The Transformer is remolded in order to process the sequence of image embeddings and make predictions of detected object instances.

The Framework of the Proposed TRD
The feature pyramid of the proposed TRD can be obtained by a well-designed CNN, and in this study, the detection backbone based on ResNet [34] is adopted. The convolutional backbone takes an RSI I ∈ R 3×H 0 ×W 0 of an arbitrary size H 0 × W 0 as the input and generates hierarchical feature maps. Specifically, the ResNet generate hierarchical maps from the outputs of the last three stages, which are denoted as {f 1 , f 2 , f 3 }, and f l ∈ R C l ×H l ×W l . Those of the other stages are not included due to their restricted receptive field and additional computational complexities. Then, the feature map at each level undergoes 1 × 1 convolutions, mapping their channels C l to a smaller, uniform dimension d.
14, x FOR PEER REVIEW 5 of 21  Hence, a three-level feature pyramid is obtained, which is denoted as { 1 , 2 , 3 } and ∈ ℝ × × . Additionally, a lower-resolution feature map 4 is acquired by a 3 × 3 convolution on 3 .
The feature pyramid is further processed to be fed into the Transformer. The MHSA in Transformer aggregates the elements of the input and does not discriminate their positions; hence, the Transformer has permutation invariance. To alleviate this problem, we need to embed spatial information in the feature maps. Therefore, after the -level feature pyramid { } =1 is extracted from the convolutional backbone, the 2D position encodings are supplemented at each level. Specifically, the sine and cosine positional encoding of the original Transformer is extended to column and row positional encodings, respectively. They are both acquired by encoding on the dimension of the row or column and half of the channels, and then duplicated to the other spatial dimension. The final positional encodings are concatenated with them.
The Transformer expects a sequence consisting of elements of equal dimensions as inputs. Therefore, the multi-scale position-encoded feature maps { } =1 are flattened in the spatial dimensions, developing them into sequences of × length. The input sequence is obtained by concatenating the sequences from levels, which consists of ∑ × =1 tokens with dimensionalities. Each pixel in the feature pyramid is treated as an element of the sequence. The Transformer then models the interaction of the feature points and recognizes concerned object instances from the sequence.
The original Transformer adopted an encoder-decoder structure using stacked selfattention layers and point-wise fully connected layers, and the decoder was autoregressive, generating an element at a time and appending the element to the input sequence for the next generation [25]. In a different manner, the Transformer here changes the MHSA layers of the encoder to the deformable attention layers, which are more attractive for modeling the relationship between feature points due to the lack of computational and memory complexities. Besides, the decoder adopts a nonautoregressive structure, which parallelly decodes the elements. The details are as follows: The encoder takes the sequence of the feature embeddings as the input and outputs a sequence of spatial-aware elements. The encoder consists of cascaded encoder layers. Hence, a three-level feature pyramid is obtained, which is denoted as {x 1 , x 2 , x 3 } and x l ∈ R d×H l ×W l . Additionally, a lower-resolution feature map x 4 is acquired by a 3 × 3 convolution on x 3 .
The feature pyramid is further processed to be fed into the Transformer. The MHSA in Transformer aggregates the elements of the input and does not discriminate their positions; hence, the Transformer has permutation invariance. To alleviate this problem, we need to embed spatial information in the feature maps. Therefore, after the L-level feature pyramid {x l } L l=1 is extracted from the convolutional backbone, the 2D position encodings are supplemented at each level. Specifically, the sine and cosine positional encoding of the original Transformer is extended to column and row positional encodings, respectively. They are both acquired by encoding on the dimension of the row or column and half of the d channels, and then duplicated to the other spatial dimension. The final positional encodings are concatenated with them.
The Transformer expects a sequence consisting of elements of equal dimensions as inputs. Therefore, the multi-scale position-encoded feature maps {x l } L l=1 are flattened in the spatial dimensions, developing them into L sequences of H l × W l length. The input sequence is obtained by concatenating the sequences from L levels, which consists of ∑ L l=1 H l × W l tokens with d dimensionalities. Each pixel in the feature pyramid is treated as an element of the sequence. The Transformer then models the interaction of the feature points and recognizes concerned object instances from the sequence.
The original Transformer adopted an encoder-decoder structure using stacked selfattention layers and point-wise fully connected layers, and the decoder was auto-regressive, generating an element at a time and appending the element to the input sequence for the next generation [25]. In a different manner, the Transformer here changes the MHSA layers of the encoder to the deformable attention layers, which are more attractive for modeling the relationship between feature points due to the lack of computational and memory complexities. Besides, the decoder adopts a non-autoregressive structure, which parallelly decodes the elements. The details are as follows: The encoder takes the sequence of the feature embeddings as the input and outputs a sequence of spatial-aware elements. The encoder consists of N cascaded encoder layers. In each encoder layer, the sequence undergoes a deformable multi-head attention layer Remote Sens. 2022, 14, 984 6 of 21 and a feed-forward layer, both of which are accompanied by a layer normalization and a residual computation, and the encoder layer outputs an equilong sequence of isometric elements. The deformable attention layers aggregate the features at positions in an adaptive field, obtaining feature maps with a distant relationship. The feature points can be used to compose the input sequence of the decoder. To reduce computational complexities, the feature points are fed into a scoring network, specifically, a three-layer FFN with a softmax layer, which can be realized as a binary classifier of the foreground and background. The N p highest scored points constitute a fixed-length sequence, which is fed into the decoder. The encoder endows the multi-scale feature maps with global spatial information and then selects a quantity-fixed set of spatial-aware feature points, which are more easily used for detecting object instances.
The decoder takes the sequence of essential feature points as the input and outputs a sequence of object-aware elements in parallel. The decoder also contains M cascaded decoder layers, consisting of an MHSA layer, an encoder-decoder attention layer, and a feed-forward layer, followed by three-layer normalization and residual computations behind them, respectively. The MHSA layers capture interactions between pairwise feature points, which has benefits for constraints related to object instances, such as preventing duplicate predictions. Each encoder-decoder attention layer takes the elements from the previous layer in the decoder as queries and those from the output of the last encoder layer as memory keys and values. It enables the feature points to attend to feature contexts at different scale levels and global spatial positions. The output embeddings of each decoder layer are fed into a layer normalization and the prediction heads, which share a common set of parameters for different layers.
The prediction heads further decode the output embeddings from the decoder into object categories and bounding-box coordinates. Similar to most modern end-to-end objectdetection architectures, the prediction head is divided into two branches for classification and regression. In the classification branch, a linear projection with a softmax function is used to predict the category of each embedding. A special 'background' category is appended to the classes, meaning that no concerned object is detected from the query. In the regression branch, a three-layer fully connected network with the ReLU function is utilized for producing the normalized coordinates of the bounding boxes. In total, the heads generate an N p set of predictions, and each set consists of a class and the corresponding box position. The final prediction results are obtained by removing the 'background'.
The proposed TRD takes full advantage of the relationship-capturing capacity of the Transformer and rebuilds the original structure and embedding scheme. It explores a Transformer-based paradigm for RSI object detection.

The Deformable Attention Module
To enhance the detection performance of small-object instances, the idea of utilizing multi-scale feature maps is explored, in which the low-level and high-resolution feature maps are conducive to recognizing small objects. However, the high-resolution feature maps result in high computational and memory complexities for the conventional MHSAbased Transformer, because the MHSA layers measure the compatibility of each pair of reference points. In contrast, the deformable attention module only pays attention to a fixed-quantity set of essential sampling points at several adaptive positions around the reference point, which enormously decreases the computational and memory complexities. Thus, the Transformer can be effectively extended to the aggregation of multi-scale features of RSIs. Figure 3 shows the diagram of the deformable attention module. The module generates a specific quantity of sampling offsets and attention weights for each element in each scale level. The features at the sampling positions of maps in different levels are aggregated to a spatial-and scale-aware element.
where indexes the feature levels, and indexes the sampled points for keys and values, respectively. The is the sequence of the practical coordinates { (̂0), (̂1), ⋯}, and the ∆ indicates the sequence of the -th sampling offsets {∆ 0 , ∆ 0 , ⋯}. The is composed of normalized attention weights .
x0, y0 a 0 a n-1 a 1 a n

Linear Projection
Linear Projection x1, y1 xKL-1, yKL-1 xKL, yKL The deformable attention mechanism resolves the problem of processing spatial features with self-attention computations. It is extremely appropriate for Transformers in computer-vision tasks and it is adopted in the proposed TRD detector.

The Attention-Based Transferring Backbone
In general, deep CNN can obtain discriminative features of RSIs for object detection. However, due to the fact that RSI object-detection tasks usually have limited training samples and deep models always contain numerous parameters, deep-learning-based RSI object-detection methods usually face the problem of overfitting.
To address the overfitting issue, transfer learning is used in this study. In the proposed T-TRD detector, a pre-trained CNN model is used as the backbone for RSI feature extraction, and then the Transformer-based detection head is used to complete the objectdetection task. In CNNs, the first few convolution operations extract low-level and midlevel features such as blobs, corners, and edges, which are common features for image processing [35].
In RSI object detection, the proper re-usage of low-level and mid-level representations will significantly improve the detection performance. However, due to the fact that the spatial resolution and imaging environment between ImageNet and RSI are quite different, the attention mechanism is used in this study to adjust the pre-trained model for better RSI object detection. The input sequence of the embedded feature elements is denoted as x. In each level, the normalized location of the q-th feature element is denoted asp q ∈ [0, 1] 2 , which can be re-scaled to the practical coordinates at the l-th level with a mapping function φ l p q .
For each element, which is represented as x φ l p q , a 3LK-channel linear projection is used to obtain LK sets of sampling offsets ∆p lkq ∈ R 2 and attention weights a lkq ∈ [0, 1], which is normalized by ∑ L l=1 ∑ K k=1 a lkq = 1. Then, the features of the LK sampling points x φ l p q + ∆p lkq are calculated from the input feature maps by applying bilinear interpolation. They are aggregated by multiplying the attention weights a lkq , generating a spatialand scale-aware element. Therefore, the output sequence of the deformable attention module is calculated with (1).
where l indexes the L feature levels, and k indexes the K sampled points for keys and values, respectively. The p l is the sequence of the practical coordinates {φ l (p 0 ), φ l (p 1 ), · · · }, and the ∆p lk indicates the sequence of the k-th sampling offsets {∆p lk0 , ∆p lk0 , · · · }. The A lk is composed of normalized attention weights a lkq . The deformable attention mechanism resolves the problem of processing spatial features with self-attention computations. It is extremely appropriate for Transformers in computer-vision tasks and it is adopted in the proposed TRD detector.

The Attention-Based Transferring Backbone
In general, deep CNN can obtain discriminative features of RSIs for object detection. However, due to the fact that RSI object-detection tasks usually have limited training samples and deep models always contain numerous parameters, deep-learning-based RSI object-detection methods usually face the problem of overfitting.
To address the overfitting issue, transfer learning is used in this study. In the proposed T-TRD detector, a pre-trained CNN model is used as the backbone for RSI feature extraction, and then the Transformer-based detection head is used to complete the object-detection task. In CNNs, the first few convolution operations extract low-level and mid-level features such as blobs, corners, and edges, which are common features for image processing [35].
In RSI object detection, the proper re-usage of low-level and mid-level representations will significantly improve the detection performance. However, due to the fact that the spatial resolution and imaging environment between ImageNet and RSI are quite different, the attention mechanism is used in this study to adjust the pre-trained model for better RSI object detection.
In the original attention mechanism, more attention is paid to the important regions in an image and the selected regions are assigned by different weights. Such an attention mechanism has been proved to be effective in text entailment and sentence representations [36,37].
Motivated by the attention mechanism, we re-weight the feature maps to reduce the difference in the two data sets (i.e., RSI and ImageNet). Specifically, the feature maps in the pre-trained model are re-weighted and then transferred to the backbone in RSI object detection. When attention scores of different feature maps are higher, the transferring features are more important for the following feature extractions. Figure 4 shows the framework of the proposed attention-based transferring backbone. As is shown, the model pre-trained on the source-domain-images data set is transferred to the backbone of the T-TRD. The attention weights are obtained with global average pooling and non-linear projection. At last, the feature maps are re-weighted according to the attention weights. The detailed steps are defined below.
4, x FOR PEER REVIEW 8 of 21 In the original attention mechanism, more attention is paid to the important regions in an image and the selected regions are assigned by different weights. Such an attention mechanism has been proved to be effective in text entailment and sentence representations [36,37].
Motivated by the attention mechanism, we re-weight the feature maps to reduce the difference in the two data sets (i.e., RSI and ImageNet). Specifically, the feature maps in the pre-trained model are re-weighted and then transferred to the backbone in RSI object detection. When attention scores of different feature maps are higher, the transferring features are more important for the following feature extractions. Figure 4 shows the framework of the proposed attention-based transferring backbone. As is shown, the model pretrained on the source-domain-images data set is transferred to the backbone of the T-TRD. The attention weights are obtained with global average pooling and non-linear projection. At last, the feature maps are re-weighted according to the attention weights. The detailed steps are defined below. At first, feature maps in one convolutional layer are operated to channel-wise statistics by using the global average pooling layer. Specifically, the spatial dimension ′ × ′ of each feature map is calculated by the following formula: where refers to the input feature map and indicates the aggregated information of a whole feature map. Next, to capture the relationships of feature maps with different importances, a neural network that consists of two fully connected (FC) layers and a ReLU operation are utilized. To limit model complexity, the first FC layer maps the total number of feature maps to a fixed value (i.e., 128), followed by a non-linearity ReLU operation. In addition, the second FC layer restores the number of feature maps to its initial dimension. By learning the parameters in this neural network through backpropagation, the interaction reflected the importance between different feature maps can be obtained.
Finally, the attention values of different feature maps are outputted by the sigmoid function, which restricts the values from zero to one. Each feature map multiplies the ob- At first, feature maps in one convolutional layer are operated to channel-wise statistics by using the global average pooling layer. Specifically, the spatial dimension H × W of each feature map is calculated by the following formula: where u refers to the input feature map and v indicates the aggregated information of a whole feature map. Next, to capture the relationships of feature maps with different importances, a neural network that consists of two fully connected (FC) layers and a ReLU operation are utilized. To limit model complexity, the first FC layer maps the total number of feature maps to a fixed value (i.e., 128), followed by a non-linearity ReLU operation. In addition, the second FC layer restores the number of feature maps to its initial dimension. By learning the parameters in this neural network through backpropagation, the interaction reflected the importance between different feature maps can be obtained. Finally, the attention values of different feature maps are outputted by the sigmoid function, which restricts the values from zero to one. Each feature map multiplies the obtained attention values to distinguish the degree importance of different feature maps.
The above steps are used in the proposed attention-based transferring backbone. The transferring features from ImageNet to RSI re-weighted by the attention values could boost the feature discriminability, thereby reducing the difference between the two data sets by learning more important transferring features and weakening less important features.

Data Augmentation for RSI Object Detection
As is reported, the Transformer-based vision models are more likely to overfit than the CNN with equivalent computational complexity on limited data sets [26]. However, the quantities of training samples in RSI data sets for object detection are usually deficient. Additionally, objects in an RSI sample are usually sparsely distributed, which is an inefficient method of training the proposed Transformer-based detection models. Hence, a dataaugmentation method, which is composed of sample expansion and multiple-sample fusion, is merged into the training strategy of the T-TRD to improve the detection performance.
Let X = {x 1 , x 2 , · · · , x N } be the training samples. We define a set of four right-angle rotation transformations T R = {t R0 , t R1 , t R2 , t R3 } and another set of two horizontal flip transformations T F = {t F0 , t F1 }. Both sets are applied to all the training samples, generating a ×8 extended samples set   boost the feature discriminability, thereby reducing the difference between the two data sets by learning more important transferring features and weakening less important features.

Data Augmentation for RSI Object Detection
As is reported, the Transformer-based vision models are more likely to overfit than the CNN with equivalent computational complexity on limited data sets [26]. However, the quantities of training samples in RSI data sets for object detection are usually deficient. Additionally, objects in an RSI sample are usually sparsely distributed, which is an inefficient method of training the proposed Transformer-based detection models. Hence, a data-augmentation method, which is composed of sample expansion and multiple-sample fusion, is merged into the training strategy of the T-TRD to improve the detection performance.    With the data augmentation, the problem of insufficient training samples is mitigated. The proposed T-TRD-DA trains a Transformer-based detection model on an enhanced training data set with more diversity of scale, orientation, background, etc., which prevents the proposed deep model from overfitting.

Data Description
The proposed TRD, T-TRD and T-TRD-DA are evaluated on the NWPU VHR-10 [6] and DIOR [2] data sets, which are both widely-used public data sets for multi-class object detection in RSIs.
The NWPU VHR-10 data set contains 800 very-high-resolution RSIs collected from Google Earth and the Vaihingen data set [38]. There is an annotated 'positive image set' and a 'negative image set'. The 150 images in the 'negative image set' contain no object in the concerned categories, which are used for exploring semi-supervised and weakly-supervised algorithms. The 650 images in the 'positive image set' were annotated with 10 categories of objects, which are used in the experiment and divided into a training set with 130 images, a validation set with 130 images, and a testing set with 390 images.
The DIOR data set is one of the most challenging large-scale benchmark data sets for RSI object detection. There are 23,463 images acquired from Google Earth, and 20 categories of 192,472 objects annotated in the DIOR data set. Compared with other data sets, the images and object instances of the data set have higher intra-class variation and inter-class similarity. Therefore, the DIOR data set is considered appropriate for the training and evaluation of RSI object detectors, especially deep-learning-based detectors. In the experiments, the quantities of the training set, the validation set, and the testing set are 5862, 5863, and 11,738, respectively, according to the official setting in [2].

Evaluation Metrics
In the experiments, the average precision (AP) for each category and mean average precision (mAP) are utilized to evaluate the proposed detectors. In general, the AP for the With the data augmentation, the problem of insufficient training samples is mitigated. The proposed T-TRD-DA trains a Transformer-based detection model on an enhanced training data set with more diversity of scale, orientation, background, etc., which prevents the proposed deep model from overfitting.

Data Description
The proposed TRD, T-TRD and T-TRD-DA are evaluated on the NWPU VHR-10 [6] and DIOR [2] data sets, which are both widely-used public data sets for multi-class object detection in RSIs.
The NWPU VHR-10 data set contains 800 very-high-resolution RSIs collected from Google Earth and the Vaihingen data set [38]. There is an annotated 'positive image set' and a 'negative image set'. The 150 images in the 'negative image set' contain no object in the concerned categories, which are used for exploring semi-supervised and weaklysupervised algorithms. The 650 images in the 'positive image set' were annotated with 10 categories of objects, which are used in the experiment and divided into a training set with 130 images, a validation set with 130 images, and a testing set with 390 images.
The DIOR data set is one of the most challenging large-scale benchmark data sets for RSI object detection. There are 23,463 images acquired from Google Earth, and 20 categories of 192,472 objects annotated in the DIOR data set. Compared with other data sets, the images and object instances of the data set have higher intra-class variation and interclass similarity. Therefore, the DIOR data set is considered appropriate for the training and evaluation of RSI object detectors, especially deep-learning-based detectors. In the experiments, the quantities of the training set, the validation set, and the testing set are 5862, 5863, and 11,738, respectively, according to the official setting in [2].

Evaluation Metrics
In the experiments, the average precision (AP) for each category and mean average precision (mAP) are utilized to evaluate the proposed detectors. In general, the AP for the c-th category AP c is calculated from recall values (R) and the corresponding precision values (P c (R)) are calculated with formula (3), which is also the area under the precision-recall curve of the category, and the mAP is calculated by averaging the AP c over the C categories with formula (4).
For a specific category, to obtain the precision-recall curve, we need to calculate pairwise Precision values with formula (5) and Recall values with formula (6). Specifically, assume that there is a total of K bounding boxes classified into the category. Each prediction result includes coordinates and the classification confidence of a bounding box. The bounding box is true positive (TP) if the IOU between the ground-truth (GT) box and itself is larger than the threshold γ; otherwise, it is considered to be false positive (FP). In addition, if there is more than one TP bounding box corresponding to a GT box, the box with the largest IOU is reserved as TP, and the others are considered to be FP. If a GT box has no corresponding TP, then the GT box is considered false negative (FN). In formulas (5) and (6), the TP, FP, FN represent quantities of TP, FP, FN boxes; therefore, the Precision and Recall are dimensionless and TP + FN is equal to the number of GT boxes Num(GT). In practice, the bounding boxes are sorted according to their confidence, and the Precision and Recall values are calculated with the first k bounding boxes each time. The precision-recall curve is obtained by taking k from 1 to K. In the experiment, the IOU threshold γ is set to 0.5 according to the benchmarks of object detection in RSIs.
The Precision can be considered as the percentage of correct predictions out of all predictions, and the Recall can be the proportion of GT boxes that can be detected among all GT boxes. The precision-recall curve can reflect the relationship between Precision and Recall. A better detector should have both higher Precision and Recall, therefore its mAP should also be higher.

Baseline Methods
In the experiments, nine baseline methods, which are diffusely used as comparison benchmarks for object detection in RSIs, are adopted to evaluate the proposed detectors. To be specific, on the NWPU VHR-10 data set, the baseline methods include the traditional methods such as SSCBOW [5], and COPD [6], and deep-learning-based methods such as RICNN [10], R-P-Faster R-CNN [39], YOLO v3 [20], Deformable R-FCN [40], Faster RCNN [12], and Faster RCNN with FPN [17]. As for the DIOR data set, regionproposal-based methods including RICNN, Faster RCNN, Faster RCNN with FPN, and Mask RCNN [41] with FPN and the anchor-based method YOLO v3 are selected for a comprehensive comparison.

Implementation Details
ResNet [34] is recognized as one of the most effective backbone networks in the objectdetection community. The residual operation of ResNet solves the degradation problem in deep networks; therefore, it can achieve a larger network and extract high-level semantic features. We adopt the ImageNet pre-trained ResNet-50 according to the choices of most baseline methods. To distinguish the feature maps of different scales, in addition to the 2D positional encoding, learnable scale-level encodings are also embedded in the multi-scale feature maps.
The encoder and decoder of the transformer both have six attention modules, and each module consists of eight attention heads. The dimension d of the input embeddings is set to 256. The number of sampled keys for each deformable attention calculation K is set to 4. On the NWPU VHR-10 data set, the number of selected feature points N p is set to 300. However, on the DIOR data set, the number is set to 600, because images may have more than 300 object instances in the DIOR data set.
The detectors are trained with the AdamW optimizer, setting the weight decay to 1 × 10 −4 . The initial learning rate of the Transformer is set to 1 × 10 −4 , while that of the other learnable parameters is set to 1 × 10 −5 . The combined loss function in [28] is used for optimization, except that the section for classification is modified to the Focal Loss [42]. Other strategies of training and parameter initialization also follow [28].
The proposed methods are implemented using MMDetection [43], which is an opensource object-detection framework presented by Open MMLab. The experiments are executed on a scientific computing workstation with Intel Xeon Silver CPUs and dual Tesla V100 MAX-Q GPUs with a total of 32 GB memory.

Experimental Results and Discussion
The proposed Transformer-based detectors are trained on the two data sets. Both qualitative inference results and quantitative evaluation results are provided and analyzed. For the qualitative inference results in Figures 6-8, the regions surrounded by blue bounding boxes indicate ground truth, and the detection results are marked with red bounding boxes. Additionally, the categories and confidence values of each detected box are given. In the quantitative evaluation results, the APs and mAPs magnified by 100 of the detectors are reported, and the precision-recall curve of each category is given. Additionally, the results of the ablation experiment are appended to provide the effectiveness of the modules in the proposed methods. For the quantitative evaluation results in Tables 1-5, the bold numbers represent the best performance compared to the other methods. At last, comparisons of the computational complexities and inference speeds between the proposed methods and baseline methods are exhibited. mAP. With the same backbone to extract features of RSIs, the Transformer-based detection head of the TRD exhibits its powerful detection capability and exceeds the CNN-based detection heads, which demonstrates the feasibility of using the Transformer for object detection in RSIs. Furthermore, with the promotion of the proposed attention-based transferring backbone and data augmentation, the T-TRD-DA achieves a better detection performance, with an mAP that reaches 0.879 and obtains outstanding APs in all categories. As a consequence, the improvements can make efficient progress on the proposed Transformer-based RSI object-detection framework.     Additionally, the comparison results of the proposed methods and baseline methods on objects of specific scale ranges, i.e., large, middle, small, are reported in Table 2. The mAP of the Faster RCNN baseline is limited in its detection of small objects because its backbone only outputs the highest-level features, which have low resolution and cause poor detection performance. The FPN capable of multi-scale feature fusion effectively solves this problem. Therefore, the Faster RCNN with FPN baseline achieves great improvement on small objects. The proposed TRD and T-TRD-DA can aggregate multi-scale features without FPN, and they also have outstanding detection capacities for small objects. Moreover, the proposed Transformer-based detectors also perform well on large objects and middle objects, which means a better overall detection capability.

Comparison Results on the DIOR Data Set
To further evaluate the effectiveness of the proposed Transformer-based detectors, the detectors are trained on the DIOR data set and compared with more competitive baseline methods. Figure 8 shows the qualitative inference results of the proposed T-TRD-DA on the DIOR data set. It is obvious that the proposed T-TRD-DA exhibits an intuitively satisfactory detection capability on the large-scale challenge data set. The precision-recall curves of each category are provided in Figure 9, which intuitively shows the detailed relationship between precision and recall. The ETA and ESA are the abbreviations of expressway toll station and expressway service area, respectively. It can be seen that the proposed T-TRD-DA detector exhibits a superior performance in most categories, such as airplane, ground-track field, tennis court, etc.      Figure 6 shows the qualitative inference results of the proposed Transformer-based detectors on the NWPU VHR-10 data set. As illustrated in the figures, the proposed T-TRD-DA can detect most object instances in RSIs and correctly identify their categories. Even if the object instances are small, which are hard to detect, the T-TRD-DA still performs well. Figure 7 provides a qualitative comparison between the proposed T-TRD-DA and YOLO v3. In Figure 7a,b, the smaller storage tanks are all detected by the proposed T-TRD-DA, while the YOLO v3 omits some of them. In Figure 7c,d, the T-TRD-DA recognizes almost all vehicles, while the YOLO v3 leaves out more than half of them. As a consequence, in contrast to YOLO v3, the proposed T-TRD-DA is shown not to be susceptible to objects of small scale, clustered objects, or objects being obscured by shallows, etc. Table 1 shows the comparison results on the NWPU VHR-10 data set, where ST denotes the storage tank, BD denotes the baseball diamond, TC denotes the tennis court, BC denotes the basketball court, and GT denotes the ground-track field. As shown in the table, the CNN-based methods exhibit a noticeable advantage compared to the traditional BOW-based SSCBOW method and the SVM-based COPD method. Among these CNNbased methods for object detection in RSIs, the Faster RCNN is the most representative one, which can swiftly provide region proposals and then make precise predictions. The FPN is often used for the multi-scale feature fusion of features extracted from the CNN backbone, which effectively enhances the detection capability of small-object instances. Therefore, the Faster RCNN with FPN is a relatively competitive baseline method for object detection in RSIs. Nevertheless, the proposed TRD outperforms all the baseline methods and surpasses the Faster RCNN with FPN baseline with a 0.02 improvement on the mAP. With the same backbone to extract features of RSIs, the Transformer-based detection head of the TRD exhibits its powerful detection capability and exceeds the CNN-based detection heads, which demonstrates the feasibility of using the Transformer for object detection in RSIs. Furthermore, with the promotion of the proposed attention-based transferring backbone and data augmentation, the T-TRD-DA achieves a better detection performance, with an mAP that reaches 0.879 and obtains outstanding APs in all categories. As a consequence, the improvements can make efficient progress on the proposed Transformer-based RSI object-detection framework.
Additionally, the comparison results of the proposed methods and baseline methods on objects of specific scale ranges, i.e., large, middle, small, are reported in Table 2. The mAP of the Faster RCNN baseline is limited in its detection of small objects because its backbone only outputs the highest-level features, which have low resolution and cause poor detection performance. The FPN capable of multi-scale feature fusion effectively solves this problem. Therefore, the Faster RCNN with FPN baseline achieves great improvement on small objects. The proposed TRD and T-TRD-DA can aggregate multi-scale features without FPN, and they also have outstanding detection capacities for small objects. Moreover, the proposed Transformer-based detectors also perform well on large objects and middle objects, which means a better overall detection capability.

Comparison Results on the DIOR Data Set
To further evaluate the effectiveness of the proposed Transformer-based detectors, the detectors are trained on the DIOR data set and compared with more competitive baseline methods. Figure 8 shows the qualitative inference results of the proposed T-TRD-DA on the DIOR data set. It is obvious that the proposed T-TRD-DA exhibits an intuitively satisfactory detection capability on the large-scale challenge data set. The precision-recall curves of each category are provided in Figure 9, which intuitively shows the detailed relationship between precision and recall. The ETA and ESA are the abbreviations of expressway toll station and expressway service area, respectively. It can be seen that the proposed T-TRD-DA detector exhibits a superior performance in most categories, such as airplane, ground-track field, tennis court, etc.   Table 3 shows the results of the DIOR data set and compares the proposed TRD and T-TRD-DA to five representative deep-learning-based methods, including the AP values of the 20 categories and the mAP. In these baseline methods, the Mask RCNN, which was originally designed for object-instance segmentation, is extended from the Faster RCNN and achieves state-of-the-art performance of object detection. With the FPN, both the Faster RCNN and Mask RCNN can detect objects with a wide variety of scales and acquire great advances to their overall detection performance. Additionally, as shown in Table 4, Figure 9. The p-r curve of the detectors on each category of the DIOR data set. Table 3 shows the results of the DIOR data set and compares the proposed TRD and T-TRD-DA to five representative deep-learning-based methods, including the AP values of the 20 categories and the mAP. In these baseline methods, the Mask RCNN, which was originally designed for object-instance segmentation, is extended from the Faster RCNN and achieves state-of-the-art performance of object detection. With the FPN, both the Faster RCNN and Mask RCNN can detect objects with a wide variety of scales and acquire great advances to their overall detection performance. Additionally, as shown in Table 4, compared to the Faster RCNN and the Faster RCNN with FPN, the proposed TRD acquires outstanding detection capacity for the three scale ranges, especially on the small objects. The proposed T-TRD-DA achieves the best performance, which is attributed to the multi-scale-feature embedding. Above all, with the powerful context-modeling capabilities of the Transformer, the proposed Transformer-based detectors can accurately detect the objects of interest in the complicated RSIs.

Ablation Experiments
Four sets of ablation experiments on both data sets are performed to evaluate the efficiencies of the improvements to the proposed T-TRD-DA, and the results are reported in Table 5. The results indicate that both the improvements to the attention-based transferring backbone and the data augmentation benefit the detection performance of the TRD. The transferring backbone utilizes the knowledge learned from the source-domain data to extract more effective features of the RSIs, and then uses the attention mechanism to adaptively regulate the channel-wise features. Additionally, the data augmentation enriches the orientations, scales, and backgrounds of the object instances, which strengthens the generalization performance of the detectors. Therefore, the final T-TRD-DA achieves a competitive detection capability and indicates the great potential of the Transformer for RSI object detection.

Comparison of the Computational Complexity and Inference Speed
To evaluate the computational efficiency of the methods, the values of floating-point operations (FLOPs) and the inference speeds of the proposed Transformer-based methods and three baseline methods are reported in Table 6. The FLOPs and FPS of each method are measured with the analysis tools of MMDetection, with inputs of 800 × 800 size from both data sets. As is shown, the FLOPs of the proposed Transformer-based-detection models are close to the models of the baseline methods, and are only higher than YOLO v3. However, due to the high computational cost of the Transformer, the inference speeds are still able to be improved.

Discussion
In the experiments, the proposed Transformer-based methods were evaluated and compared with the state-of-the-art CNN-based RSI object-detection frameworks. The experiments demonstrated the effectiveness of the proposed Transformer-based frameworks and their advantages over the CNN-based frameworks.
From the qualitative inference results in Figures 6-8, it could be seen that the proposed T-TRD-DA could accurately recognize objects of various categories, scales, and orientations.
The bounding boxes of prediction were highly closed to the GT boxes. Additionally, from the quantitative evaluation results in Tables 1 and 3, the TRD and T-TRD-DA achieved 82.9 and 87.9 on the NWPU VHR-10 data set, and obtained 64.6 and 66.8 on the DIOR data set in terms of centuple mAP, respectively.
From the ablation experiments in Table 5, compared with the TRD, the proposed T-TRD obtained an improvement of 0.6 in terms of centuple mAP on the NWPU VHR-10 data set. It was not a great success, but it showed that the proper adjustment of the feature map led to better RSI detection performance. Moreover, the TRD-DA improved by 3.7 in terms of centuple mAP on the NWPU VHR-10 data set. The overfitting problem caused by limited training samples was mitigated by the data augmentation in the TRD-DA. With the two improvements, the proposed T-TRD-DA improved by 5.0 in terms of centuple mAP on the NWPU VHR-10 data set. Therefore, the attention-based transferring backbone and the data augmentation were both efficient and indispensable in the proposed T-TRD-DA.
From Tables 1 and 3, the proposed TRD and T-TRD-DA methods both exceeded all the competitive CNN-based RSI object-detection methods. For example, Faster RCNN only obtained 0.554 in terms of mAP on the DIOR dataset. The proposed TRD, which was based on a well-designed Transformer, obtained 0.646 in terms of mAP on the DIOR dataset. The results of the comparison experiments revealed the advantages of the proposed Transformer-based methods, which are discussed as follows.
Firstly, CNN-based methods were good at object detection. However, for RSI objectdetection tasks, due to the large spatial size (e.g., the spatial size of DIOR data set is 800 × 800), it was difficult to obtain the global representation of RSIs. The Transformer was good at capturing long-distance relationships, hence it could obtain more discriminative features.
Secondly, CNN-based methods usually required FPN [14] for multi-scale feature fusion to improve the performance on small objects. From Tables 2 and 4, the TRD and T-TRD-DA performed better on objects of various scales than the CNN-based methods with FPN, especially on small objects. In contrast to the FPN, which added the down-sampled features at the same positions of all the scales, the proposed Transformer-based frameworks could adaptively integrate features at various crucial positions of different scales; therefore, it achieved impressive small-object-detection performance.
Additionally, the representative CNN-based frameworks, such as the Faster RCNN [12] or YOLO v3 [20], were usually based on anchors. However, the setting of sizes, amount, and aspect ratios for anchor generation affected the detection performances. The proposed TRD and T-TRD-DA aggregated the pyramidal features and acquired spatial-and levelaware feature points for representing instances. Therefore, the proposed methods were anchor-free and convenient to train.
Moreover, from Table 6, although deformable attention was developed in the TRD and T-TRD-DA to simplify the calculation of the Transformer, the inference speeds of the proposed methods were slower than those of the CNN-based methods. More research into the improvement of inference speed is required.
Above all, in this study, a modified Transformer combined with a transfer CNN was proposed for RSI object detection. Elaborate experiments and analyses have indicated the superiorities of the proposed Transformer-based frameworks. Besides, the disadvantages have also been analyzed for further research on developing Transformer-based RSI objectdetection methods.

Conclusions
In this study, Transformer-based frameworks were explored for RSI object detection. It was found that the Transformer was good at obtaining the long-distance relationship; therefore, it could capture global spatial-and scale-aware features of RSIs and detect objects of interest. The proposed TRD used the pre-trained CNN to extract local discriminate features, and the Transformer was modified to process feature pyramid of an RSI and predict the categories and the box coordinates of the objects in an end-to-end manner. By combining the advantages of the CNN and Transformer, the experimental results of diverse terms demonstrated that the TRD achieved impressive RSI object-detection performance for objects of different scales, especially on small objects.
There was still a lot of room for improvement in the TRD. On the one hand, the use of the pre-trained CNN faced the problem of data-set shift (i.e., the source data set and the target data set were quite different). On the other hand, there were insufficient training samples for RSI object detection to train a Transformer-based model. Hence, to further improve the performance of the TRD, an attention-based transferring backbone and data augmentation were combined with the TRD to formulate the T-TRD-DA. The ablation experiments on various structures, i.e., TRD, T-TRD, TRD-DA, and T-TRD-DA, have shown that the two improvements as well as their combination were efficient. The T-TRD-DA was proved to be a state-of-the-art RSI object-detection framework.
Compared with the CNN-based frameworks, the proposed T-TRD-DA was demonstrated to be a better detection architecture. There were not anchors, non-maximum suppression, or FPN in the proposed frameworks. However, the T-TRD-DA exceeded YOLO-v3 and the Faster RCNN with FPN in detecting small objects. As an early stage of the Transformer-based detection method, the T-TRD-DA showed the potential of the Transformer-based RSI object-detection methods. Nevertheless, the proposed Transformerbased frameworks have the problem of low inference speed, which is another topic for further research.
Very recently, some modifications of the Transformer, including the self-training Transformer and transferring Transformer, can be investigated for RSI object detection in the near future.
The findings reported in this study have some implications for effective RSI object detection, which show that Transformer-based methods have huge research value in the area of RSI object detection.