Remote Sensing Image Scene Classiﬁcation via Self-Supervised Learning and Knowledge Distillation

: The main challenges of remote sensing image scene classiﬁcation are extracting discriminative features and making full use of the training data. The current mainstream deep learning methods usually only use the hard labels of the samples, ignoring the potential soft labels and natural labels. Self-supervised learning can take full advantage of natural labels. However, it is difﬁcult to train a self-supervised network due to the limitations of the dataset and computing resources. We propose a self-supervised knowledge distillation network (SSKDNet) to solve the aforementioned challenges. Speciﬁcally, the feature maps of the backbone are used as supervision signals, and the branch learns to restore the low-level feature maps after background masking and shufﬂing. The “dark knowledge” of the branch is transferred to the backbone through knowledge distillation (KD). The backbone and branch are optimized together in the KD process without independent pre-training. Moreover, we propose a feature fusion module to fuse feature maps dynamically. In general, SSKDNet can make full use of soft labels and has excellent discriminative feature extraction capabilities. Experimental results conducted on three datasets demonstrate the effectiveness of the proposed approach.


Introduction
Remote sensing images are a valuable data source for observing the earth surface [1]. The rapid development of remote sensing instruments provides us with opportunities to obtain high-resolution images. Remote sensing technology has been widely used in many practical applications, such as remote sensing image scene classification (RSISC) [2], object detection [3], semantic segmentation [4], change detection [5], and image fusion [6,7]. Scene classification is an active research topic in the intelligent interpretation of remote sensing images, and it aims to distinguish a predefined semantic category. For the last few decades, extensive research work on RSISC has been undertaken, driven by its real-world applications, such as urban planning [8], natural hazard detection [9], and geospatial object detection [10]. As the resolution of remote sensing images increases, RSISC has become an important and challenging task.
It is important for RSISC to establish contextual information [11]. For example, pixelbased classification methods can distinguish the basic land-cover types such as vegetation, water, buildings, etc. When contextual information is considered, scene classification can distinguish high-level semantic information such as residential areas, industrial areas, etc. Convolutional neural networks (CNNs) expand the receptive field by stacking convolutional layers. However, the global interactions of the local elements in the image are not directly modeled when using stacked receptive fields. Differing from CNNs, Vision Transformer (ViT) [12] directly establishes contextual information between different patches via a self-attention mechanism and achieves remarkable performance in the field of computer  [20] and the NWPU-RESISC45 dataset [2]. When the number of training samples is limited, acquiring more labels to supervise the network is an effective solution to alleviate the lack of data. We design a selfsupervised branch that takes the feature maps of the backbone network as labels and uses self-supervised methods to process the feature maps of the backbone network as the input of the branch network, and the output of the branch network is used as the prediction result. Specifically, it includes a self-supervised block and a feature fusion block. The selfsupervised block uses two methods to preprocess the feature maps. One is to shuffle the feature map after dividing the region. Noroozi et al. [21] argues that solving jigsaw puzzles can be used to teach a system that an object is made of parts and what these parts are. It is beneficial for the network to understand the relative position information between different image patches. The other is to mask the background region [15,22]. Remote sensing scene images often contain a large amount of background information, and background masking can mask redundant backgrounds and enhance the discriminative ability of the branch network. This is an efficient solution to address the challenge of discriminative feature extraction. Different from SKAL [22], we propose masking the background by dividing the patch. Subsequent experiments show that this masking method is relatively flexible.
We adopt a feature pyramid network (FPN) [23] structure as the feature fusion block. The FPN structure has been used extensively in remote sensing [24,25]. It adopts a topdown method to fuse feature maps from different stages, and high-level features play a guiding role for low-level features. However, low-level and mid-level features are also essential for improving the features extracted in deep layers from a coarse level to a fine level [26]. Thus, we add a bottom-up fusion process to the branch network. Different locations in the image should have different attention weights, and due to the inherent density advantage of the image, the adjacent regions tend to have similar attention weight. Motivated by this, we propose a feature fusion module; it dynamically weights the feature map by assigning different attention weights to the patches.
The large memory and computation cost of branch structure cannot be ignored, so we embed the knowledge distillation method in SSKDNet. Inspired by FRSKD [27] and BAKE [28], the branch labels are served as soft logits to distil knowledge to the backbone, and the backbone integrates sample-similarity-level soft logits to guide itself. We only adopt the backbone during inference to reduce the computational load.
The main contributions of this article are summarized as follows: 1.
The performance of the Cycle MLP [18] model in RSISC is explored, and the SSKDNet model is proposed, enhancing the discriminative ability of Cycle MLP through selfsupervised learning and knowledge distillation.

2.
We propose a self-supervised learning branch to learn the backbone feature maps. It includes a self-supervised block and a feature fusion block. The self-supervised block masks background regions by an attention mechanism and shuffles the feature map after dividing the region to improve the discriminative ability of the branch. The feature fusion block dynamically weights the feature maps of different stages, enhancing the high-level features from a coarse level to a fine level.

3.
The backbone integrates the "dark knowledge" of the branch via knowledge distillation to reduce the computation of the inference. It ensembles the sample-similaritylevel soft logits to guide itself, which fully uses the similarity information between samples. Moreover, SSKDNet dynamically weights multiple losses via a principled loss function [29] to reduce training costs.
The remainder of this article is organized as follows. Section 2 presents related work on RSISC, MLP, self-supervised learning and knowledge distillation. In Section 3, our proposed model is described in detail. In Section 4, the effectiveness of SSKDNet is demonstrated through experiments on three datasets. Section 5 discusses the advantages of our self-supervised branch. Section 6 provides the concluding remarks.

Remote Sensing Image Scene Classification Methods
Early feature extraction methods in RSISC are mainly based on handcrafted feature extraction. Handcrafted feature-based methods always use some salient feature descriptors. Typical feature descriptors are scale-invariant feature transformation (SIFT) [30], color histogram (CH) [31], texture descriptors (TD) [32], and HOG [33]. These methods mainly extract low-level information from images. CH and TD can represent an entire image with features, so they can be directly applied for remote scene classification. However, SIFT and HOG cannot directly represent an entire image because of their local characteristics. In order to represent an entire image with local descriptors, SIFT and HOG are encoded by some encoding methods, such as a vector of locally aggregated descriptors [34] and bag-ofvisual-words (BoVWs) [20]. These approaches rely heavily on manual design, and it is not easy for them to achieve satisfactory results.
With the development of artificial intelligence, deep learning methods have been widely used in RSISC [1]. Traditional deep learning methods include CNNs, generative adversarial networks (GANs), and graph convolutional networks (GCNs). For CNNs, Li et al. [35], Wu et al. [36], and Shi et al. [37] proposed an attention module to enhance the extraction ability of discriminative features; Shi et al. [38,39] designed a lightweight module to reduce model parameters; and Bi et al. [19] and Wang et al. [22] further improved the classification accuracy by a key area detection strategy. For GANs, Ma et al. [40] proposed a supervised progressive growing generative adversarial network and significantly improved the classification accuracy in the case of limited samples; Gu et al. [41] introduced a pseudo-labelled sample generation method to solve the geographic errors of generated pseudo-labelled; and Ma et al. [42] designed a framework with error-correcting boundaries and a feature adaptation metric. For GCNs, Xu et al. [43] introduced the use of graph convolution to effectively capture the potential context relationships of the scene images; Peng et al. [44] proposed a multi-output network combining GCNs and CNNs. Recently, some novel methods have also been applied to RSISC, such as the CapsNet [45], neural architecture search [46], meta learning [47], etc. Deep learning methods have become the mainstream natural image feature extraction method due to their powerful feature extraction capabilities. They can automatically learn the most discriminative semantic-level features from the raw images compared with handcrafted feature-based methods. Furthermore, CNN models are end-to-end trainable architectures rather than complex multi-stage architectures [22]. However, these methods ignore logits and potential natural labels. SSKDNet can effectively use this supervision information.

MLP
While CNNs have been the main model for computer vision [48], recently, the ViT [12] introduced transformers into computer vision and attained SOTA performance. The transformer architecture computes representations for each individual token in parallel and aggregates spatial information across tokens via a multi-head self-attention mechanism. It can effectively establish contextual information. Lv et al. [11] explored the performance of the ViT model in RSISC and proposed the SCViT model. Hao et al. [49] designed a twostream swin transformer network for special processing of remote sensing images to make the network focus on the target object in the scene. However, the computational complexity of self-attention is proportional to the square of tokens. In a series of works, refs. [50,51] proposed a computational approach from local attention to global attention to improve this problem. At the same time, researchers began to consider whether the self-attention mechanism is necessary. Tolstikhin et al. [16] designed a mixed MLP to replace the self-attention mechanism and achieved better results than ViT. Tang et al. [17] further proposed a sparse MLP to reduce the computational complexity. Liu et al. [15] designed the spatial gating unit to enhance the information interaction of tokens, further improving the performance of MLP. However, the complexity of these methods is still quadratic with the size of the image. Recently, Chen et al. [18] proposed a Cycle MLP, which achieved linear computational complexity through the dynamic shift and achieved competitive results in object detection, strength segmentation, and semantic segmentation. In SSKDNet, Cycle MLP is used as the backbone network.

Self-Supervised Learning
Self-supervised learning of visual representations is a fast-growing subfield of unsupervised learning. In recent years, many methods have been proposed and applied. Doersch et al. [52] and Wang et al. [53] proposed to use the location information and pixel values of the images as supervised labels. Bert [54] applied mask language model training to the NLP domain. He et al. [55] further verified the effectiveness of this class of methods by applying self-supervised masks to the computer vision domain. In scene classification, Zhao et al. [56] introduced a multitask learning framework that combines the tasks of self-supervised learning and scene classification. Stojnic et al. [57] showed that for the downstream task of remote sensing images, using self-supervised pre-training on remote sensing images can give better results than supervised pre-training on images of natural scenes. These methods have achieved SOTA performance. However, in scene classification, the effectiveness of self-supervised learning remains to be further explored.

Knowledge Distillation
Knowledge distillation transfers knowledge from a cumbersome teacher model to a lightweight student model. The pioneering work [58] performs knowledge representation of a teacher model using the softmax output layer, which converts the probability into soft logits with a temperature parameter. Following this, many works proposed new approaches to knowledge distillation. Tung et al. [59] proposed that similar samples have pairwise activation similarities and supervised the student network by pairwise activation similarities of similar samples of the teacher network. Zagoruyko et al. [60] proposed an attention map distillation method. Ji et al. [27] utilized both feature map distillation and soft logit distillation.
Knowledge distillation is also used to improve the RSISC accuracy. The discriminative modality distillation approach was introduced in [61]. Xu et al. [14] proposed an end-to-end method by employing ViT as an excellent teacher for guiding small networks in RSISC. Zhao et al. [62] introduced a novel pair-wise similarity knowledge distillation method. Chen et al. [63] introduced a knowledge distillation framework, which makes the output of the student and teacher models match. Zhang et al. [64] proposed a novel noisy label distillation method. The self-distillation learning method has few relevant research works in RSISC. We propose an end-to-end network architecture that combines the self-supervised learning method and knowledge distillation.

Materials and Methods
This section explains the specific details of the proposed SSKDNet for RSISC. The overall structure of the SSKDNet model is displayed in Figure 2. It consists of a backbone and a self-supervised branch. The backbone adopts Cycle MLP [18], and the self-supervised branch includes a self-supervised block and a feature fusion block. We present the backbone in Section 3.1, the self-supervised branch in Section 3.2, knowledge distillation in Section 3.3, and loss functions of SSKDNet in Section 3.4.

Backbone
Cycle MLP [18] is used as our backbone in the feature extraction stage. We denote stage1-stage5 as f 1 (.)f 5 (.). f 1 (.) represents patch embedding, which downsamples a quarter of the original input spatial dimension. f 2 (.)f 5 (.) stack 3, 4, 24, and 3 cycle fully connected (FC) layers (Section 3.1.1), respectively. The outputs have 96, 192, 384, and 768 channel dimensions, respectively. After each stage, the spatial dimension is downsampled by half. The feature maps of the backbone can be formulated as follows.
where E i is the output feature map of f i (.), and E i−1 is the feature map from the former layer. Specifically, the feature map E 0 is the input image. For the j-th sample, the predictive probability vector predict 1 can be obtained via a softmax function on the logits where M is the category number. The probability of class k can be formulated as follows.

Cycle MLP
We denote the input feature map as X ∈ R H×W×C , where H, W, C are the height of the feature map, the width of the feature map and the number of channels. As shown in Figure 3a, channel FC allows communication between different channels, and it can handle various input scales. However, it cannot provide spatial information interactions. As shown in Figure 3b, spatial FC allows communication between different spatial locations. However, its computational complexity is quadratic with the image scale. As shown in Figure 3c, Cycle FC allows communication between channels and spatial locations by adding a shift operation, and it has linear complexity.
We use subscripts to index the feature map. For example, X i,j,c is the value of the c-th channel at the spatial position (i, j), and X i,j,: are values of all channels at the spatial position (i, j). The Cycle FC can be formulated as follows.
where S H and S W are stepsizes along with the height and width dimensions, respectively. δ i (c) and δ j (c) are spatial offsets on the c-th channel. W ml p ∈ R C×C out and b ∈ R C out are parameters of Cycle FC. Figure 3d shows the Cycle FC when S H = 3 and S W = 1. This image is quoted from [18] Notably, the index in Figure 3d has been modified to start from δ i (0), and the last index is δ i (8) .

Self-Supervised Branch
The self-supervised branch includes a self-supervised block and a feature fusion block. The self-supervised block is shown in Figure 4. This block performs patch embedding on the backbone feature maps E 2 -E 5 , and then feeds the features into the gMLP [15] to extract local information S 2 -S 5 . The final feature maps T 2 -T 5 are obtained by passing the adjacent S 2 -S 5 through a Cross MLP, which is an information interaction module. Notably, E 2 -E 5 are processed differently. E 2 masks a part of the background by background masking. E 3 shuffles the regions after dividing the regions by jigsaw puzzle. E 4 is fed directly into the gMLP after patch embedding. E 5 is utilized to calculate the attention map. The feature fusion block is shown at the bottom of Figure 2, which is achieved by a top-down and bottom-up bidirectional fusion. We use a feature fusion module to fuse adjacent feature maps and use the feature maps P 2 -P 5 as the self-supervised prediction results.

Background Masking
Masking background regions is an effective way to extract discriminative features [22]. As shown in Figure 4a,d, we calculate the attention map of the high-level feature map E 5 . The attention map attn for each position (i, j) is calculated as follows.
where E 5,i,j,k is the element of E 5 , and C is the number of channels of E 5 . We upsample the attention map attn to the same spatial shape as E 2 by the nearest interpolation, record the positions of 75% of feature points with the lowest attention weights, and then delete the corresponding tokens in the feature map E 2 . We pad a vector 1 with an initial value of 1 to replace the deleted tokens. It is well known that high-level feature maps contain high-level semantic information, and the network tends to pay more attention to points with higher activation values. Thus, the channel means of the feature map E 5 can effectively calculate the attention map of the network. Later experiments can also verify this.

Jigsaw Puzzle
We expect SSKDNet to learn more location information about images. Noroozi et al. [21] proposed to divide the feature map into different regions, erase the edge information and randomly shuffle the regions. As shown in Figure 4b, we adopt a similar method to divide the feature map into four regions, erase the edge information of the feature map, and then shuffle them randomly.

Cross MLP
Compared with convolution and transformer, gMLP [15] alternately constructs FC layers along the channel dimension and the token dimension, which can extract global features easily. As shown in Figure 5, we propose a Cross MLP module to obtain more complete semantic information from adjacent feature maps. The Cross MLP has two parallel gMLP with a larger receptive field. It can be formulated as follows.
X ch = Concat(P emb (X 1 ), P emb (X 2 ), dim = channel) X tok = Concat(P emb (X 1 ), P emb (X 2 ), dim = token) Cross MLP(X 1 , X 2 ) = FC channel (gMLP(X ch )) + FC token (gMLP(X tok ))) (6) where X 1 ∈ R H 1 ×W 1 ×C 1 and X 2 ∈ R H 2 ×W 2 ×C 2 are the input feature maps. P emb (.) represents patch embedding with patchsize = 1. P emb (X 1 ) and P emb (X 2 ) are split tokens, where N is the number of tokens. We concatenate P emb (X 1 ) and P emb (X 2 ) along the channel dimension and token dimension to obtain X ch and X tok by Concat, respectively. gMLP [15] is used for information interaction between channels and tokens, and FC channel and FC token are FC layers along the channel dimension and token dimension to adjust the size of X ch and X tok , respectively. As shown in Figure 4, the feature maps S 2 -S 5 are the input. We can obtain the self-supervised block output T 2 -T 5 from the following formula.
As shown in Table 1, we show the parameter settings for patch embedding in the selfsupervised block, where p i and c i are patch size and the number of channels, respectively, and (a)-(d) correspond to the different branchs of Figure 4.

Feature Fusion Block
We dynamically weight adjacent feature maps by setting a weight for each patch of the feature map. As shown in Figure 6, for the input X 1 ∈ R H 1 ×W 1 ×C 1 and X 2 ∈ R H 2 ×W 2 ×C 2 , the feature fusion module can be formulated as follows.
where Conv(·) denotes a 3 × 3 convolutional layer, and upsample is the nearest interpolation. We adjust X 2 to the same spatial and channel shape as X 1 by Conv and upsample (when the spatial size of X 2 is greater than X 1 , replace the upsample with the downsample).
Dividing the feature map into patches allows the network to flexibly assign different attention weights to patches. The top-down feature maps M 2 − M 5 can be formulated as follows.
Similarly, the bottom-up feature maps P 2 − P 5 can be formulated as follows.
The difference between M i and P i is that upsample is substituted by downsample, and the downsample adopts convolution with kernel size (2 × 2) and stride = 2.
For the setting of parameters w j 1 and w j 2 , we initialize w j 1 and w j 2 to 1 and normalize them by the softmax function. It can be formulated as follows.
The weights are normalized to the range 0 to 1 by softmax to represent the importance of each input, which is an efficient attention calculation method. In the feature fusion block, we set the patch size of the patch embedding to 6, 8, and 16 in the top-down approach and set the patch size to 8, 4, and 2 in the bottom-up approach, respectively. The other settings are the same as FRSKD [27].

Feature Map Loss
For the backbone feature map E t ∈ R H×W×C , we denote E t ∈ R H×W×1 as the channel mean of E t . e avg and e std are the mean and variance of E t , respectively. The normalization process can be formulated as follows.
Similarly, for the branch feature map P t , we can calculate the normalized feature map P t . The feature map loss can be formulated as follows.
The gradients through E t,i,j are not propagated to avoid the model collapse issue.

Knowledge Distillation
Knowledge distillation can be viewed as normalizing the training of the student network with soft targets that carry the "dark knowledge" of the teacher network. We combine two knowledge distillation methods so that the backbone can obtain knowledge from the branch and itself. The knowledge distillation methods include soft logit distillation and batch distillation.
We denote Z = [z 1 , z 2 , . . . , z N ] ∈ R N×M as the logit vectors of the backbone in the same batch. For the j-th sample, the backbone logit vector z j = [z 1 j , z 2 j , . . . , z M j ] ∈ R 1×M , where M is the number of categories, and N is the number of samples in a batch.

Soft Logits Distillation
The soft logit output by the network contains more information about sample than just the class label. For the j-th sample, the backbone soft logits p τ j ∈ R 1×M can be obtained by a softmax function, and the soft logits of the k-th class can be formulated as follows.
The τ is the temperature hyper-parameter which is used to soften the logits. Similarly, we can obtain the branch soft logits q τ j ∈ R 1×M . The Kullback-Leibler (KL) loss is used to measure the similarity of two distributions. The KL loss can be formulated as follows.

Batch Distillation
Samples with high visual similarities are expected to make more consistent predictions about their predicted class probabilities, regardless of these truth labels. In this part, the batch knowledge ensembling (BAKE) [28] is introduced to obtain excellent soft logits. It achieves knowledge ensembling by aggregating the "dark knowledge" from different samples in the same batch. We can obtain the pairwise similarity matrix A ∈ R N×N by the dot product of the logits. A i,j represents the affinity between the samples with indexes i and j, and it can be formulated as follows.
where σ( f ) = f / f 2 . We normalize each row of the affinity matrix A to obtainÂ, For the i-th sample, the soft logits p τ j ∈ R 1×M , we denote the soft logits of samples within the same batch as P τ = [p τ 1 , . . . , p τ N ] T ∈ R N×M . Based on Equations (17) and (18), we can obtain the soft logits Q τ = [q τ 1 , . . . , q τ N ] T ∈ R N×M aggregated from the sample information, where ω ∈ [0, 1] is the weighting factor. For the j-th sample, the batch KL loss can be formulated as follows.
Notably, in Equation (20), q τ j is different from Equation (16). The parameters for knowledge distillation are set the same as in BAKE [28].

Loss Function of SSKDNet
The SSKDNet is optimized by minimizing the loss function. Our loss function consists of three parts, including knowledge distillation loss (Equations (16) and (20)), cross-entropy loss (Equations (22) and (23)), and feature map loss (Equation (14)). The total loss can be formulated as follows.
where predict 1 and predict 2 are the backbone and branch predictions, respectively. L ce 1 and L ce 2 represent the cross-entropy loss functions from the backbone and self-supervised branch. α 1 , α 2 , α 3 , α 4 , and α 5 are the weight coefficients. We introduce a multi-task loss function [29] to reduce the extra cost of hyperparameters. Specifically, we set the weight coefficients as learnable variables and optimize them using the following formula.
log(α i ) (24) where α 1 , α 2 , α 3 , α 4 , and α 5 are learnable parameters. This loss function is smoothly differentiable and is well-formed such that the task weights do not converge to zero. In contrast, directly learning the weights using a simple linear sum of losses would result in weights that quickly converge to zero.

Results
In this section, we conduct comprehensive experiments and analyses on three public scene classification datasets. We first introduce these datasets, evaluation metrics and experimental settings. Then, we compare the performance differences between our method and several SOTA methods. Finally, we perform ablation experiments for each module.
(1) The UCM dataset is one of the oldest datasets in the RSISC domain. It has 21 classes ("harbour", "intersection", "overpass", "chaparral", etc.) and 1000 images per category extracted from the United States Geological Survey National Map, Urban Area Imagery collection. The size of each image is 256 × 256 pixels. Some sample examples are shown in Figure 7. (2) The AID dataset is a high-spatial-resolution dataset for aerial scene classification, which has a size of 600 × 600 pixels. It has 30 classes ("church", "medium residential", "storage tanks", "port", "playground", etc.). All images are extracted from Google Earth Imagery. Some sample examples are shown in Figure 8.

Evaluation Metrics
The overall accuracy (OA) and confusion matrix were used as the primary quantitative measures. OA is the proportion of the number of correctly predicted samples in the total data, which reflects the overall performance of the classification. The confusion matrix is a two-dimensional table that analyses the between-class classification errors and confusion degree. Rows and columns represent all samples of a predicted class and samples of a true class, respectively.
We randomly split datasets using the same training ratio, repeated the experiment five times, and reported the mean and standard deviation.

Experimental Setup
We followed the conventional settings. A total of 80% of images from the UCM dataset were randomly selected for training, and the remainder data were used for evaluation. Similarly, we randomly selected 20% and 50% of the images from AID for training and 10% and 20% of the images of NWPU for training. During training, the image size was set to 256 × 256. The image augmentation methods included random horizontal flipping, random resized cropping and random erasing. All models were trained for 60 epochs, with a batch size set to 50 and AdamW (with a weight decay of 0.05) [66] chosen as optimizer. The initial learning rate was set to 0.0001, divided by 10 every 30 epochs. We used an RTX3090 GPU in all experiments. The implementation was based on Python 3.8 with Pytorch.

Comparison with Other Methods
This subsection describes the experimental results on the three remote sensing scene image datasets in detail. We divided the compared methods into two categories: classic CNNs-based methods, including GoogLeNet [2], VGG-VD-16 [2], SCCov [67], ResNet50 [68], ResNet-50+EAM [68], LCNN-BFF [69], BiMobileNet [70], DFAGCN [43], and MF 2 CNet [71], and ViT-based methods, including Fine-tune ViT [11] and ET-GSNet [14]. The backbone Cycle MLP [18] was also included for comparison. Table 2 shows the comparison results on the UCM dataset. Since there are only 420 test images in this dataset, many current methods can easily achieve more than 99% accuracy. The classification accuracy tends to be saturated. Compared with other algorithms, our method achieves a higher accuracy of 99.62%, with an improvement of 0.1%. The confusion matrix on the UCM dataset is shown in Figure 10. Only one image in medium residential is misclassified as tennis courts, while other classes can achieve 100% accuracy.  Table 3 shows the results on the AID dataset. It can be seen that when the training ratios were 20% and 50%, our method achieved the best performance of 95.96% and 97.45%, respectively. Compared with other algorithms, the average OAs are improved by 0.38% and 0.27%, respectively. The confusion matrix on the AID dataset is shown in Figure 11, in which 29 categories can achieve more than 99% accuracy. The most confusing categories are park and square. It can be seen in Figure 8 that the park and square categories have few distinct features, which makes them difficult to classify. Table 4 presents the performance of our algorithm on the NWPU dataset. When the training ratios were 10% and 20%, we achieved results of 92.77% and 94.92%, which are 0.05% and 0.42% higher than the previous best method, respectively. The confusion matrix of the NWPU dataset is shown in Figure 12. Only two categories are lower than 90% accuracy, palace and church. These two categories are relatively complex.
The computation costs (FLOPs) and parameters of the methods GoogLeNet [2], VGG-VD-16 [2], ResNet50 [68], Fine-tune ViT [11], BiMobileNet [70], MF 2 CNet [71] and Cycle MLP [18] are compared with our model. As shown in Table 5, SSKDNet has slightly more parameters than some baseline models, but we achieve higher accuracy. For instance, when compared with ResNet50 [68] on the AID dataset, we have an improvement of 3.41%. We only use the backbone during inference, so the inference speed of SSKDNet is same as Cycle MLP [18]. Compared with the ViT [11], we have a 1.08% improvement at lower FLOPs and parameters.

Ablation Experiments
We validated the effectiveness of our module through ablation experiments. For each experiment, we set the same parameters and removed only one module at a time. Experiment 1 removed the background masking module and the jigsaw puzzle module in Figure 4. Experiment 2 removed the Cross MLP module in Figure 4. Experiment 3 removed the feature fusion module. Table 6 shows the performance on the AID dataset. When the background masking module and the jigsaw puzzle module are removed, the classification accuracy drops by 0.19%. When the Cross MLP module is removed, the classification accuracy drops by 0.16%. When the feature fusion module is removed, the accuracy drops by 0.1%. These results demonstrate the effectiveness of our proposed modules. Next, we conducted ablation experiments regarding the loss function on the AID dataset. The batch distillation loss (20) was removed in Experiment 1, the soft logit loss (16) was removed in Experiment 2, and the feature map loss (14) was removed in Experiment 3. As shown in Table 7, the accuracy drops by 0.32% when the batch distillation loss is removed, 0.16% when the soft logits loss is removed, and 0.21% when the feature map loss is removed. The results showed that each loss is an important element. In addition, we also experimented with the masking ratio of the background masking module on the AID dataset. Figure 13 shows the influence of the masking ratio. We found that the network accuracy is the highest when the masking ratio is 75%. On the one hand, since the image is highly dense, the masking area can copy features from surrounding areas when the masking ratio is low. A high masking ratio can force the branch network to learn more information from the image. On the other hand, the discriminative regions usually occupy only a small part due to the high resolution of remote sensing images. A high masking ratio can make the branch network extract more precise discriminant features. Figure 13. The influence of masking ratio.

Discussion
In SSKDNet, the backbone and branch are optimized together. On the one hand, the self-supervised branch learns the feature map of the backbone by feature map loss. On the other hand, the backbone integrates the "dark knowledge" of the branch via soft logits. This is a mutual learning process.
We visualized the background masking module on the UCM dataset. Figures 14 and 15 show our background masking results on the original images with 75% and 50% scales, respectively. By the 30th epoch, the network can already accurately determine the background location. These results show that the branch can detect the background and discriminative region locations significantly. SKAL [22] uses a similar method to calculate the discriminative regions. The difference is that they can only crop out a continuous square region. In contrast, we use patch embedding to process each token attention information, and the extracted tokens can be discontinuous in spatial position, which is relatively flexible.
We also selected the UCM dataset to generate the energy maps of P 2 -P 5 at different training stages to explore the branch learning process. As shown in Figures 16 and 17, the attention of the feature map gradually transfers to the discriminative area as the training progresses. Additionally, even the low-level feature map P 2 can focus on the discriminative area. The results show that our network can extract accurate discriminative regional features and further verify the effectiveness of the self-supervised branch.    To further demonstrate the classification results of our SSKDNet, we used T-SNE [72] to map the prediction vectors in the high-dimensional space to the low-dimensional space. In addition, we visualize Cycle MLP and SSKDNet by randomly selecting 1000 test samples on the AID and UCM datasets, respectively. Figure 18a,b represent the visualization results of Cycle MLP and SSKDNet on the AID dataset, respectively. Figure 18c,d represent the visualization results of SSKDNet on the UCM dataset, respectively. It can be seen that SSKDNet has smaller intraclass distance and larger interclass distance than Cycle MLP.

Conclusions
This article investigated the Cycle MLP models for RSISC. We have also proposed the SSKDNet model to improve the discriminative ability of the Cycle MLP model. First, a selfsupervised branch is introduced, which generates labels through feature maps to alleviate the problem of insufficient training data. Second, an attention-based feature fusion module is introduced to fuse adjacent feature maps dynamically. Finally, a knowledge distillation method is proposed to distil the "dark knowledge" of the branch to the backbone. Moreover, a multi-task loss function weighting method is introduced to weight the SSKDNet loss function dynamically. We evaluate the performance of our model on three public datasets, achieving 99.62%, 95.96%, and 92.77% accuracy on the UCM, AID, and NWPU datasets, respectively. Results on multiple datasets show that SSKDNet has more competitive classification accuracy than some SOTA methods, and MLP can achieve a similar effect to the self-attention mechanism. For future work, we will try to reduce the time consumption of the SSKDNet method in the training phase and further improve the performance on small-scale datasets.