TRS: Transformers for Remote Sensing Scene Classification

: Remote sensing scene classification remains challenging due to the complexity and variety of scenes. With the development of attention-based methods, Convolutional Neural Networks (CNNs) have achieved competitive performance in remote sensing scene classification tasks. As an important method of the attention-based model, the Transformer has achieved great success in the field of natural language processing. Recently, the Transformer has been used for computer vision tasks. However, most existing methods divide the original image into multiple patches and encode the patches as the input of the Transformer, which limits the model’s ability to learn the overall features of the image. In this paper, we propose a new remote sensing scene classification method, Remote Sensing Transformer (TRS), a powerful “pure CNNs → Convolution + Transformer → pure Transformers” structure. First, we integrate self-attention into ResNet in a novel way, using our proposed Multi-Head Self-Attention layer instead of 3 × 3 spatial revolutions in the bottleneck. Then we connect multiple pure Transformer encoders to further improve the representation learning performance completely depending on attention. Finally, we use a linear classifier for classification. We train our model on four public remote sensing scene datasets: UC-Merced, AID, NWPU-RESISC45, and OPTIMAL-31. The experimental results show that TRS exceeds the state-of-the-art methods and achieves higher accuracy.


Introduction
With the rapid development of remote sensing technology and the emergence of more sophisticated remote sensing sensors, remote sensing technologies have been widely used in various fields [1][2][3][4]. As one of the core tasks of remote sensing, remote sensing scene classification is often used as a benchmark to measure the understanding of remote sensing scene images. The progress of remote sensing scene classification often promotes the improvement of other related tasks, such as remote sensing image retrieval and target detection [1,2].
The traditional remote sensing scene classification method mainly relies on the spatial features of images [5,6]. However, the error rate is high in the complex remote sensing scene. In recent years, many deep convolutional neural network models have made significant progress in remote sensing scene classification with the development of deep learning. The convolution operation can effectively obtain the local information of the image. The authors of [7,8] proved that different features can be extracted by convolutional layers of different depths. To aggregate global features, neural networks based on convolution operations need to stack multiple layers [9]. He et al. [10] proposed ResNet to make Convolutional Neural Networks (CNNs) deeper and easier to train. However, Liang et al. [11] suggested that only relying on a fully connected layer to complete the classification ignores the features of different convolutional layers in CNNs. Compared with stacking more layers to improve the accuracy of remote sensing scene classification, it is a more effective way to establish the relationship between local information through the attention mechanism.
The self-attention-based structure proposed in Transformer [12] is dominant in natural language processing (NLP) tasks. Self-attention can learn abundant features from long-sequence data and establish the dependency relationship between different features. Bert and GPT [13][14][15][16] were proposed based on Transformer architecture. Inspired by the success of NLP, many researchers applied self-attention to computer vision tasks. Wang et al. [17] and Ramachandran et al. [18] proposed a special attention mode to completely replace convolution operation, but it has not been extended to modern hardware accelerators. SENet [19], CBAM [20], SKNet [21], and Non-Local Net [22] combine self-attention with CNNs (such as ResNet). However, convolution operations are still the core of these methods, and self-attention is added to the bottleneck structure in the form of additional modules. Recently, Transformer architecture applications to computer vision tasks have shown great prospects. Dosovitskiy et al. [23] proposed the Vision Transformer (ViT). The ViT directly inputs the image into the standard Transformer encoder, which can learn the dependencies of different positions of the image well, but ignores the overall semantic features of the image, and the accuracy of ViT is only close to that of CNNs. Several works have also used the "Convolution + Transformer" structure. Touvron et al. [24] used knowledge distillation technology to allow CNN to assist in training the ViT, but this made training difficult. Carion et al. [25] proposed end-to-end DETR. DETR uses CNNs as the backbone to extract features and connects with Transformers to complete object detection. However, DETR has not been proven to be good for image classification.
Due to the lack of inductive bias [25,26], the number of images in the remote sensing scene dataset is not enough for the Transformer to achieve good results without the ImageNet1K pre-trained model. Therefore, we need to combine CNNs with Transformers. The existing "Convolution + Transformer" models reshape the outputs of the CNN backbone and connect them with Transformers. We believe that the existing models ignore the information contained in the three-dimensional representations of images. Therefore, we aim to design a Transformer capable of processing the three-dimensional matrix as a transition module between CNNs and standard Transformers. We are surprised to find the unique relationship between the standard bottleneck structure and the Transformer architecture (for details, see Section 3.4). Therefore, we propose the MHSA-Bottleneck.
In this paper, we develop a remote sensing Transformer (TRS) based on ResNet50 and Transformer architecture, which significantly boosts the remote sensing scene classification performance and reduces the dependence of the model on convolution operation. We propose a novel "pure CNNs → Convolution + Transformer → pure Transformers" structure. Different from the conventional "Convolution + Transformer" method, we do not simply connect the CNNs and Transformers, but integrate the Transformers into CNNs. We replace the last three bottlenecks of ResNet50 with multiple Transformer encoders and design the MHSA-Bottleneck. We replace the 3 × 3 spatial convolutions in the bottleneck with position-encoded Multi-Head Self-Attention rather than using the attention mechanism as an auxiliary module to the convolution module. Our contribution is not only the successful application of Transformers to remote sensing classification tasks, but also the provision of a special way to understand bottleneck structure.
We summarize our contributions as follows: (1) We apply the Transformer to remote sensing scene classification, and propose a novel "pure CNNs → CNN + Transformer → pure Transformers" structure called TRS. The TRS can well combine Transformers with CNNs to achieve better classification accuracy. (2) We propose the MHSA-Bottleneck. The MHSA-Bottleneck uses Multi-Head Self-Attention instead of the 3 × 3 spatial convolutions. The MHSA-Bottleneck has fewer parameters and better effects than the standard bottleneck and other bottlenecks improved by the attention mechanism. (3) We also provide a novel way to understand the structure of the bottleneck. We demonstrate the connection between the MHSA-Bottleneck and Transformer, and regard MHSA-Bottleneck as a 3D Transformer. (4) We complete training on four public datasets NWPU-RESISC45, UC-Merced, AID, and OPTIMAL-31. The experimental results prove that TRS surpasses the existing state-of-the-art CNNs methods.
The rest of this paper is organized as follows. Section 2 introduces our related work, and Section 3 introduces the structure and algorithm of the TRS in detail. The ablation study and state-of-the-art comparison are shown in Section 4. Section 5 presents the conclusion of our article.

CNNs in Remote Sensing Scene Classification
CNNs have been the dominant method of image classification, since AlexNet [27] won the ImageNet competition in 2012. The emergence of various CNNs has made a great contribution to the improvement of image classification accuracy. These deep models also demonstrate good performance on remote sensing datasets. Cheng et al. [28] fine-tuned AlexNet [27], GoogleNet [29], VGGNet [9], etc., and proposed a benchmark for remote sensing scene classification. Due to the excellent performance of the optimized VGG-16 [30], it is often used as the backbone for feature extraction. The ResNet [10] increases the depth of the network, reduces the model parameters, and improves the training speed by using residual modules. EfficientNet [31] balanced the depth and width of the network to obtain a better result. Bi et al. [32] proposed an Attention Pooling-based Dense Connected Convolutional Neural Network (APDC-Net) as the backbone and adopted a multi-level supervision strategy. Hu et al. [33] believed that the abundance of prior information was an important factor that affected the accuracy of remote sensing scene classification, and proposed to pre-train the model on ImageNet. Li et al. [34] used different convolutional layers of a pre-trained CNN to extract information. Zhang et al. [35] proposed the Gradient Boosting Random Convolutional Network (GBRCN), which selected different deep convolutional neural network models for different remote sensing scenes. The problem with the CNNs is that they can only focus on the local information of the size of each convolution kernel. In order to solve this problem, GBNet [36] integrated layered feature aggregation into an end-to-end network. Xu et al. [37] proposed the Lie Group Regional Influence Network (LGRIN) which combined the lie group machine learning with CNNs and achieved state-of-the-art.

Attention in CNNs
Although integrating multi-layer features and increasing the depth of the network can improve the classification accuracy, it is clearly a better choice to use local information to establish dependencies. The attention mechanism is widely used to obtain global information of CNNs. For example, Wang et al. [38] proposed a "CNN + LSTM" model in ARCNet, which used LSTM to replace feature fusion to establish connections between multiple layers. Yu et al. [39] present Attention GANs. Attention GANs are optimized with attention and input the learned features into SVM [40][41][42] or KNN [43][44][45][46][47][48][49][50] for classification. There are also methods to optimize the bottleneck of ResNet with unique selfattention. SENet [19] proposes the Squeeze-and-Excitation (SE) module to learn the relationship between channels. The Squeeze operation in the SE module is used to obtain the channel-level global features from the feature map. Then, the SE module performs an Excitation operation on the global features. CBAM [20] adds spatial attention based on SE-Net. Non-Local Net [22] combines the Transformer and Non-Local algorithms to capture remote dependencies through global attention. ResNeSt [51] introduces the Split-Attention block to realize multi-layer feature-map attention. There are three differences between the MHSA-Bottleneck and bottleneck optimized by the above methods: (1) Compared with Non-Local Net, the MHSA-Bottleneck uses multiple head vectors and adds position embedding. (2) The MHSA-Bottleneck uses Multi-Head Self-Attention instead of the 3 × 3 spatial convolutions. SENet, CBAM, and Non-local Net are usually added to the bottleneck structure in the form of additional modules, which increase the model parameters and calculation cost. (3) The convolution mechanism is still the core of SENet, CBAM, Non-local Net, and ResNeSt. The MHSA-Bottleneck deletes the 3×3 space convolutions and relies on Multi-Head Self-Attention for learning.

Transformer in Vision
The Transformer was originally proposed by [12] for natural language processing tasks. Self-attention was introduced in the Transformer, which has the advantage of performing global calculations on input sequences and summarizing the information for an update. In the fields of NLP and speech recognition, Transformers are replacing Recurrent Neural Networks [52][53][54][55]. Recently, several works have applied the Transformer to computer vision. Parmar et al. [56] initially used each pixel of the image as the input of the Transformer, which greatly increased computational cost. Child et al. [57] proposed Sparse Transformers which are scalable modules suitable for image processing tasks. Vision Transformer [23] divided a picture into multiple patches as input of the model, and the size of each patch was 16 × 16 or 14 × 14. However, the ViT ignores the overall semantic features of the image and requires additional datasets to assist training [58]. Bello et al. [59] proposed a combination of CNNs and Transformers. DETR [25] used Transformers to further process the 2D image representation output by CNNs. Tokens-to-Token (T2T) ViT [60] designed a deep and narrow structure for the backbone, and proposed a "Tokensto-Token module" to model local information. DieT [24] used the T2T for reference and uses knowledge distillation [61,62] to improve the original ViT. The PVT [25] combines convolution and the ViT to make it more suitable for downstream tasks. The Swin Transformer [63] uses window attention to combine global and local information, and it is one of the best models. However, the Swin Transformer and PVT still lack inductive bias and need a large number of datasets to complete training. Recently, many researchers have applied Transformer to remote sensing tasks. MSNet [64] proposes a network fusion method for Remote Sensing Spatiotemporal Fusion. Bazi et al. [65] use the structure of the ViT to remote sensing scene classification. Xu et al. [66] combine the Swin Transformer and UperNet for remote sensing image segmentation. These methods all migrate the existing Transformer structure to remote sensing tasks, and the lack of inductive bias is not resolved.

Overview of TRS
The design of the TRS is based on ResNet50 architecture, which consists of four parts: the stem unit, standard bottleneck, MHSA-Bottleneck, and Transformer encoder. Figure 1 demonstrates the overall architecture of the TRS. First, we used CNNs (stem unit + bottleneck) and the MHSA-Bottleneck to learn the 3D representation of input images. Then, we added position embedding and class token to the representation and passed it to the Transformer encoders. Finally, a linear classifier is used to complete the classification. The details of the model are described in Table 1.

Stem Unit
CNNs usually start from a stem unit that can quickly reduce image resolution. Similar to ResNet50, TRS starts with a 7 × 7 convolution kernel, stride-2, and three zero-padding layers. As the Transformer has strict restrictions on the input data, different convolution operations were selected for images in different sizes. For example, when the resolution of the remote sensing image was 600 × 600, we used two 7 × 7 convolution kernels with the stride of 5 and 1, respectively, as the stem unit.

Transformer Architecture
We only chose the Transformer encoder as a component of the TRS. As shown in Figure 2, the overall Transformer encoder architecture consists of three parts: Multi-Head Self-Attention, position embedding, and the feed-forward network. Multi-Head Self-Attention: Multi-Head Self-Attention is an important component for modeling the relationship between feature representations in the Transformer. As shown in Table 1, the output of Stage4 (S4) was I = (14, 14, 1024). We put I into a convolution kernel with a kernel size of 1 × 1 to get I'= (14, 14, d). We added the class token to I' and flattened the first two dimensions. Then, we obtained N d-dimensional vectors as the input of the Transformer encoder (N = 14 × 14 + 1). M = (N, d) denotes the input of the Transformer. The self-attention layer, as proposed in [14] which uses the query, key and value matrix (QKV) to train the associative memory. The calculation method of the QKV matrix is shown as follows: where WQ, WK, and WV are trainable matrices. We use the inner product to match the Q matrix and the K matrix, and used d1/2 to complete the normalization. Then, we used the Softmax function to process the normalized inner product result. The output of self-attention is expressed as: The authors of [14] proved that multiple attention heads can learn detailed information and improve classification performance. Multi-Head Self-Attention involves dividing Q, K, and V into several attention heads. We set the number of attention heads as h, d' = d/h. We used h heads of size (N, d') for calculation according to (2). Finally, we remapped the output matrix to (N, d).
Position Embedding: The self-attention structure in the Transformer cannot capture the order of the input sequence. Thus, we used position embedding to supplement the position information of the remote sensing image [14]. Since different functions should be used to complete position encoding in different dimensions, we used the Cosine function and the Sine function to calculate the absolute position according to the odd and even dimensions of the vector, respectively. The specific calculation method of position embedding is as follows: which meets pos∈N, i∈d. λ is a hyperparameter that controls the wavelength of the periodic function. Position embedding was added to the Q and K matrices. Feedforward networks (FNN): FNN is composed of two fully connected layers: FC1 and FC2. FC1 changes the input dimension from (N, d) to (N, 4d), and FC2 changes the dimension from (N, 4d) back to (N, d). Gaussian Error Linear Units (GeLU) [67] are used as the activation function of FC1. GeLU combines dropout, Zoneout, and ReLU [68]: where δ(x) is the probability function of the normal distribution. Assume that δ(x) conforms to the standard normal distribution, and the approximate calculation formula of GeLU [67] is as follows: We used the dropout function to process the output of FC2, and the dropout rate was 0.1. The Layer norm [69] was used for normalization in the Transformer. We performed the normalization operation after the Multi-Head Self-Attention and FNNs, respectively. In the TRS, we replaced the last three bottlenecks of ResNet50 with multiple Transformer encoders. Finally, the fully connected layer, activated by a Softmax function, was used to predict the categories of the remote sensing scenes.

MHSA-Bottleneck Architecture
As shown in Figure 3, the MHSA layer was used to replace the 3 × 3 spatial convolutions in the bottleneck. The Transformer can only take a two-dimensional matrix as an input, which ignores the three-dimensional relationships that exist in the matrix. To address this problem, we designed the MHSA-Bottleneck for three-dimensional matrices as an intermediate structure between the CNNs and Transformer encoders. The latest work [70] in the field believes that excessive use of Batch Normalization (BN) [71] will affect the independence of training samples in the batch, so we replaced the BN with Group Normalization [72]. We used the 1 × 1 convolution to obtain the Q, K, and V matrices. We find that the absolute position embedding of Formula (2) does not perform well in the position embedding of a three-dimensional matrix. Therefore, we used the relative position-coding in [18]. The attention calculation formula is shown in (6), and the architecture of the MHSA layer is shown in Figure 4.
where P is the relative position embedding matrix. We used the MHSA-Bottleneck to replace 6 bottlenecks in ResNet50. We do not replace all bottleneck spatial convolutions with MHSA layers because we found that the performance of self-attention in extracting image edge features and semantic features was not as good as the CNNs in the experiments. The specific experimental results are shown in Section 4.  In addition, the relationship between the MHSA-Bottlenecks and Transformers is another important contribution. As shown in Figure 5, we believe that the stacked MHSA-Bottlenecks can be regarded as a Transformer encoder that can process three-dimensional matrices: (1) We regard conv1 and conv2 in Figure 5 (b) as FNNs in the Transformer architecture.
Both FNNs and convolutional layers were used to increase a certain dimension of the input matrix by 4 times, and then compress it to its original size. (2) Both the MHSA-Bottleneck and Transformer architecture used residual connections.
The specific differences can be found in Figure 5.

Experiments
In this section, we introduce the datasets, training details, and evaluation protocol used in the experiment. We perform a comprehensive ablation study from three aspects and then provide state-of-the-art comparisons.

Dataset Description
UC-Merced Dataset: UC-Merced Dataset is one of the most classic datasets in remote sensing scene classification tasks, containing 2100 remote sensing scene images. UC-Merced consists of 21 types of remote sensing scenes. Each type contains 100 pictures, and the resolution of each picture is 256 × 256. UC-Merced was first proposed in [73], and the author used the data again in [74]. The dataset is collected by the U.S. Geological Survey.
Aerial Image Dataset: Aerial Image Dataset (AID) dataset [75] is collected from Google Earth by Wuhan University. AID has a total of 10,000 remote sensing scene images covering 30 remote sensing scene categories, which is a large-scale dataset.
NWPU-RESISC45 Dataset: NWPU-RESISC45 (NWPU) [9] is created by Northwestern Polytechnical University which is a large-scale dataset. NWPU consists of 31,500 images with 256 × 256 pixels. The dataset covers a total of 45 remote sensing scene categories, each with 700 images.
OTIMAL-31 Dataset: OPTIMAL-31 [38] is also collected from Google Earth by Wuhan University. OPTIMAL-31 is a relatively small dataset composed of complex remote sensing scenes. There are 1860 images in total. Each type contains 60 pictures and the resolution of each picture is 256 × 256.

Training Details
The training equipment we used is shown in Table 2. All experiments are trained for 80 epochs. We used Adam as our optimizer. Since we adopt distributed training on a 4 × NVIDIA TITAN XP GPUs, we set the initial learning rate to 0.0004 (0.0001 × 4), weight decay is 0.00001. We reshape UC-Merced, NWPU, and OPTIMAL-31 into 224 × 224 sizes, and set the batch size to 64. We reshape AID into 600 × 600, and set the batch size to 16. We use 12 Transformer encoders instead of the last three bottlenecks in ResNet50, the number of Multi-Head Self-Attention heads in Transformers is also 12. The number of heads in MHSA-Bottleneck is set to 6, the hyper-parameter λ of absolute position embedding is set to 10,000, and d is set to 384.
For UC-Merced, we set the training ratio to 50% and 80%. For AID, we set the training ratio to 20% and 50%. For NWPU, we set the training ratio to 10% and 20%. For OPTIMAL-31, we set the training ratio to 80%. Training code will be available at: https://github.com/zhangjrjlu/TRS, accessed on 14 October 2021.

Comparison with CNNs State-of-the-Art
The main purpose of this paper is to demonstrate that optimizing CNNs with Transformers can improve the performance of the network. Therefore, we do not compare TRS with traditional handcrafted features. We use overall accuracy as our evaluation metric, and all results of comparison experiments are obtained from other researchers. At the same time, we only use the ImageNet1K pre-trained parameters in S1, S2 and S3 of the model.
UC-Merced Dataset: The experimental results are shown in Table 3. The "-" in Table 3 means that the model did not complete the experiment under 50% for training or 80% for training. This representation method is applicable to the experimental results of all datasets, and will not be explained hereafter. When the training ratio is 50%, Xu et al. [37] designed lie group features and proposed a new pooling method to improve the training effect, which obtained an accuracy of 98.61 ± 0.11%. Our TRS achieves 98.76 ± 0.23% accuracy, which is 0.15% higher than Xu's method. When the training ratio is 80%, our method achieves 99.52 ± 0.17% accuracy, which is 0.54% higher than EfficientNet-B3-aux [76] and 0.55% higher than Concourlet CNN [77]. ResNeXt101 + MTL [78] uses multitask learning and achieves an accuracy of 99.11 ± 0.25%, but it is still 0.41% lower than our method. ARCNet + VGGnet16 introduced multi-layer LSTMs and optimized them for the UC-Merced dataset to achieve an accuracy of 99.12 ± 0.40%. TRS is 0.40% higher than ARCNet + VGGnet16, which not only proves the effectiveness of our method but also proves that the performance of Transformers is better than LSTMs. The confusion matrices of the results on UC-Merced test set are shown in Figure 6.

Comparison with Other Attention Models
In this section, we compare TRS with other models that use attention to optimize ResNet50. As shown in Table 7. Our TRS is 3.52% and 6.65% higher than the standard ResNet50 on AID and NWPU, respectively. Our method is 1.77% and 2.06% higher than ResNeSt, which is known as the best-improved version of ResNet.
We visualized Class Activation Mapping (CAM) [86] and Guided-Backpropagation (GB) [87]. Interpretability is an important evaluation criterion for deep models. We use Grad-CAM to visualize CAM and GB to prove that TRS is interpretable, and compare the visualized results with SENet [19], Non-local Net [21], and ResNeSt [51]. Class activation mapping shows how each pixel of the image affects the output of the model. Guided-Backpropagation shows the features extracted by the model. We selected 10 remote sensing scene classes: (a) Airplane, (b) Baseball diamond, (c) Basketball court, (d) Bridge, (e) Church, (f) Freeway, (g) Lake, (h) Roundabout, (i) Runway, (j) Thermal power station. The experimental results shown in Figure 10 demonstrate that TRS has more powerful performance than several other attention models. The experimental results also explain why TRS has higher remote sensing scene classification accuracy from the perspective of interpretability.

Comparison with Other Transformers
We also compared the TRS with other Transformers. The experimental results are shown in Tables 8 and 9. TRS has obvious advantages over other excellent Transformers. The experimental results show that the ViT [23] based on global attention does not show strong performance for remote sensing scene classification, but the ViT-Hybrid [23] does well. The Swin Transformer [63] uses windows to complete local and global attention, and achieves better results than CNNs. However, the TRS achieves higher accuracy than the Swin Transformer. For the experiments of these Transformers, our code was based on the Timm package (https://github.com/rwightman/pytorch-image-models, accessed on 14 October 2021), and we used the ImageNet1k pretrained model.

Training, Testing Time and Parameters
Training and testing time can intuitively reflect the efficiency of the model. Acc. in the table refers to overall accuracy, and FLOPs refer to floating-point operations. All experiments are performed on an NVIDIA GTX 2080Ti GPU. To compare the time it takes to train and test each model for an epoch, we use tqdm package. As shown in Table 10, the time it takes for TRS to train and test an epoch is very close to ResNet-101, nevertheless, the former's accuracy is higher than the latter. Compared with models whose accuracy is close to ours, the training and testing time of Swin-Base are 5 s and 0.9 s slower than TRS, respectively; the training and testing times of ViT-Hybrid are 6.9 s and 2.9 s lower than our model. We also show the parameters and FLOPs of the models. The weight parameters and FLOPs of TRS are 46.3M and 8.4G, respectively, which surpass ResNet-101 by only 0.3M and 0.8G, respectively. However, the weight parameters and FLOPs of TRS decrease by 41.7M and 7G compared with Swin-Base, which ranks second in accuracy.

Ablation Study
In the ablation study, we explored how the components of the TRS affect the performance of the model. In order to obtain more convincing results, we chose to conduct ablation experiments on two datasets with different resolutions, AID and NWPU. The training ratios of AID and NWPU were 50% and 20%, respectively.

Number of Encoder Layers and Self-Attention Heads
We changed the number of encoder layers and self-attention heads to evaluate the importance of the Transformer architecture. The experimental results are shown in Table 11. When there was no encoder layer, we used the GlobalAvgPooling to process the output of Stage3 (S3) in Table 1, and used the fully connected layer for scene classification. We found that the classification accuracy on AID and NWPU decreased by 7.01% and 9.26%, respectively, without the Transformer encoder. When using three encoder layers, the accuracy of the TRS was 0.67% and 1.66% higher than ResNet50, respectively. TRS achieves the best accuracy when the number of Transformer encoders is 12. At the same time, TRS achieves the best accuracy when the number of Multi-Head Self-Attention heads is 12. The experimental results show that the Transformer architecture is effective in remote sensing scene classification. We also conducted an ablation study on the arrangement of the MHSA-Bottleneck in the TRS. For comparison, several structures are shown in Figure 11, and the experimental results are shown in Table 12. The accuracy of the TRS (a) was 2.28% and 3.61% lower than TRS (d), respectively. The results demonstrate the importance of using our proposed MHSA-Bottleneck as an intermediate structure between CNNs and Transformer encoders. In TRS (b), all bottlenecks were replaced by the MHSA-Bottleneck, the accuracy was lower than TRS (a). This shows that relying only on self-attention to learn the relationship between features cannot achieve good results, and the combination of CNNs' feature extraction ability and self-attention has better performance. We also test the number of selfattention heads and standardized methods of the MHSA-Bottleneck. The experimental results are shown in Table 13.  There are two ways of our position embedding: absolute position embedding and relative position embedding. We tried a combination of these two encoding methods, and the results of the experiment are shown in Table 14. Without position embedding, the accuracy of TRS is 4.52% and 2.33% lower than the ResNet50, respectively. Therefore, position embedding is required. The accuracy of using relative position embedding or absolute position embedding in Transformers is almost the same, while the accuracy of using relative position embedding in MHSA-Bottleneck is higher than that of absolute position embedding.
Given these ablation experiments, we came to the conclusion: Transformer encoders, MHSA-Bottlenecks, and position embedding all contributed to the performance of TRS.

Application Scenarios
A transformer is a great way to learn Global information which is the current development trend of remote sensing image tasks. However, the current work believes that local information must be combined to obtain better results. Our proposed model uses CNNs to extract Local information and uses Transformer to extract Global information. Through the results of Table 10, we can get a conclusion: the parameters of the existing models are redundant for remote sensing scene classification tasks, and there is no need to make the model larger to improve the performance. We believe that using a limited number of parameters to obtain better performance is a better solution to improve the classification effect of remote sensing scenes, and TRS has outstanding performance in this regard.
At the same time, our model can not only be used for remote sensing scene classification, but also provide rich features for downstream tasks. Downstream tasks can obtain image features at different resolutions as the input of FPN [88], and the current method of applying Transformer to remote sensing scene classification cannot do this well, such as [25,57,89]. We can input the features extracted from TRS S2-S5 into FPN to complete downstream tasks.

Conclusions
In this paper, we proposed the TRS, a new design for remote sensing scene classification based on the Transformer. We successfully used the Transformer for remote sensing scene classification for the first time, and proposed a novel "pure CNNs → Convolution + Transformer → pure Transformers" structure. We designed the MHSA-Bottleneck and proposed to replace spatial convolution with the Multi-Head Self-Attention. At the same time, we provided a new idea to understand MHSA-Bottleneck as a Transformer that processes three-dimensional matrices. We also replaced the bottleneck with multiple standard Transformer encoders. The experimental results of the four public datasets demonstrate that the TRS is robust, surpasses previous work, and achieves state-of-the-art.
We hope that we can not only apply the Transformer encoder to remote sensing scene classification, but also that the transformer decoder and transformer encoder can be combined and applied to other remote sensing tasks. In further works, we will attempt to apply the complete Transformer architecture (encoder + decoder) to remote sensing tasks.