Learn by Yourself: A Feature-Augmented Self-Distillation Convolutional Neural Network for Remote Sensing Scene Image Classiﬁcation

: In recent years, with the rapid development of deep learning technology, great progress has been made in remote sensing scene image classiﬁcation. Compared with natural images, remote sensing scene images are usually more complex, with high inter-class similarity and large intra-class differences, which makes it difﬁcult for commonly used networks to effectively learn the features of remote sensing scene images. In addition, most existing methods adopt hard labels to supervise the network model, which makes the model prone to losing ﬁne-grained information of ground objects. In order to solve these problems, a feature-augmented self-distilled convolutional neural network (FASDNet) is proposed. First, ResNet34 is adopted as the backbone network to extract multi-level features of images. Next, a feature augmentation pyramid module (FAPM) is designed to extract and fuse multi-level feature information. Then, auxiliary branches are constructed to provide additional supervision information. The self-distillation method is utilized between the feature augmentation pyramid module and the backbone network, as well as between the backbone network and auxiliary branches. Finally, the proposed model is jointly supervised using feature distillation loss, logits distillation loss, and cross-entropy loss. A lot of experiments are conducted on four widely used remote sensing scene image datasets, and the experimental results show that the proposed method is superior to some state-ot-the-art classiﬁcation methods.


Introduction
Remote sensing scene classification is the task of assigning a label to a specific scene.It has received extensive attention in recent years and is mainly used in urban planning, environmental surveying, natural disaster detection, and land use [1][2][3][4].High-resolution remote sensing images have the characteristics of complex content, diverse semantics, and multi-scale targets.Remote sensing scene image classification is widely used, but due to the characteristics of remote sensing images, it is difficult to accurately classify remote sensing scene images.Therefore, the way in which to improve the classification accuracy of remote sensing scene images has become a research hotspot in the field of remote sensing.Traditional feature extraction uses hand-crafted features (e.g., texture features [5,6], spectral features [7,8], color features [9,10], and shape features [11,12]).Traditional classifiers are support vector machines [13] and decision trees [14].Because it is difficult for manual features to fully describe the information of high-resolution remote sensing scene images, traditional classifiers cannot classify the information of manual features well, and the classification performance of traditional models cannot meet our requirements.With the development of the deep convolutional neural network (DCNN) [15], DCNN-based classification methods have become more and more popular.At this stage, many methods have been proposed to distinguish remote sensing scene images [16][17][18][19].The role of the feature extractor is to map the remote sensing scene image to appropriate visual features, while the role of the classifier is to classify the visual features into various semantic classes.Convolutional neural networks (CNNs) are outstanding in expressive feature learning and have achieved good performance in remote sensing scene classification.In conventional CNNs, one-hot ground-truth labels are used to guide feature learning.However, one-hot ground-truth labels only bring category information (i.e., which category the input image belongs to), but cannot provide the relationship between categories.For example, "Dense Residential Area" has the same distance to "Medium Residential Area" and "Airport", but "Dense Residential Area" is closer to "Medium Residential Area" than to "Airport".In this case, category information alone cannot accurately describe images, which leads to insufficient supervision in training.As shown in Figure 1, the label of the scene is "bridge".In addition to "bridge", there will be "river", "tree", "car", and "house" in the scene.If only bridge information is considered in the process of feature learning, other semantic information will reduce the discrimination of learned features.
Remote Sens. 2023, 15, x FOR PEER REVIEW 2 of 2 manual features to fully describe the information of high-resolution remote sensing scen images, traditional classifiers cannot classify the information of manual features well, and the classification performance of traditional models cannot meet our requirements.With the development of the deep convolutional neural network (DCNN) [15], DCNN-based classification methods have become more and more popular.At this stage, many method have been proposed to distinguish remote sensing scene images [16][17][18][19].The role of th feature extractor is to map the remote sensing scene image to appropriate visual features while the role of the classifier is to classify the visual features into various semantic classes Convolutional neural networks (CNNs) are outstanding in expressive feature learning and have achieved good performance in remote sensing scene classification.In conven tional CNNs, one-hot ground-truth labels are used to guide feature learning.However one-hot ground-truth labels only bring category information (i.e., which category the in put image belongs to), but cannot provide the relationship between categories.For exam ple, "Dense Residential Area" has the same distance to "Medium Residential Area" and "Airport", but "Dense Residential Area" is closer to "Medium Residential Area" than to "Airport".In this case, category information alone cannot accurately describe images which leads to insufficient supervision in training.As shown in Figure 1, the label of th scene is "bridge".In addition to "bridge", there will be "river", "tree", "car", and "house in the scene.If only bridge information is considered in the process of feature learning other semantic information will reduce the discrimination of learned features.
. The remote sensing scene image above is manually semantically labeled as a bridge.Ther are multiple different land covers besides the bridge, including "River", "Forest", "Car", "Residen tial".If only bridges are considered in the feature learning process, the content corresponding t other semantics will reduce the discriminative degree of the learned features.
Knowledge distillation is a method of transferring the knowledge of the pre-trained teacher network to the student network, so that the small network can replace the larg teacher network during the network deployment stage.The concept of knowledge distil lation [20] was originally proposed by Hinton et al. and has been widely used in variou fields and tasks.The basic principle of knowledge distillation is to learn the knowledge o a larger and more complex model by training a smaller and more lightweight model.Typ ically, complex models are referred to as "teacher models", while simplified models ar referred to as "student models".The teacher model can be a deep neural network or othe complex model, while the student model is usually a shallower or narrower layer neura network.By taking the output of the teacher model and its corresponding labels as th training target of the student model, the student model can gain more knowledge from the teacher model, and gradually approach or exceed the performance of the teache model during the learning process.One of the main advantages of knowledge distillation is that it can significantly reduce model complexity and computational resourc Figure 1.The remote sensing scene image above is manually semantically labeled as a bridge.There are multiple different land covers besides the bridge, including "River", "Forest", "Car", "Residential".If only bridges are considered in the feature learning process, the content corresponding to other semantics will reduce the discriminative degree of the learned features.
Knowledge distillation is a method of transferring the knowledge of the pre-trained teacher network to the student network, so that the small network can replace the large teacher network during the network deployment stage.The concept of knowledge distillation [20] was originally proposed by Hinton et al. and has been widely used in various fields and tasks.The basic principle of knowledge distillation is to learn the knowledge of a larger and more complex model by training a smaller and more lightweight model.Typically, complex models are referred to as "teacher models", while simplified models are referred to as "student models".The teacher model can be a deep neural network or other complex model, while the student model is usually a shallower or narrower layer neural network.By taking the output of the teacher model and its corresponding labels as the training target of the student model, the student model can gain more knowledge from the teacher model, and gradually approach or exceed the performance of the teacher model during the learning process.One of the main advantages of knowledge distillation is that it can significantly reduce model complexity and computational resource requirements while maintaining relatively high performance.This makes knowledge distillation have a broad application potential in resource-constrained environments such as mobile devices, embedded systems, and edge computing.In addition, knowledge distillation can also be used as a method of model compression to reduce the cost of storage and inference by transferring the knowledge of complex models into simplified models.
Self-distillation (SD) [21] is a technique based on knowledge distillation.SD extracts knowledge from an already trained model and uses this knowledge to retrain the same model, thereby further improving the performance of the model.The core idea of the selfdistillation method is to improve performance by letting the model learn its own knowledge.
During training, the model uses its own soft labels as targets instead of hard labels.
Auxiliary classifiers [22,23] can enhance the performance of the main classifier by providing additional information.In this paper, an auxiliary classifier is used to learn the distribution of the classification results output by the classifier.During the training phase, the auxiliary classifier is trained together with the main classifier.Auxiliary classifiers can provide additional supervisory signals to help the main classifier better understand the data distribution.Due to the high cost of remote sensing image acquisition and the relatively small dataset, the use of auxiliary classifiers can help alleviate the overfitting problem of the main classifier.By introducing auxiliary classifiers, additional regularization effects can be provided to help reduce the degree of model fitting.The introduction of auxiliary classifiers can also increase the diversity of the model.If the auxiliary classifiers provide different predictions than the main classifier, the difference between them can help improve the overall classification performance.In general, the basic principle of the auxiliary classifier is to strengthen training, reduce overfitting, and improve diversity by providing additional information.
In order to train a compact model to achieve high classification performance and overcome the drawbacks of traditional distillation, a new self-distillation framework is proposed.
The main contributions of this paper are as follows.
1.This paper proposes a new self-distillation framework that effectively combines feature distillation and logits distillation to solve the problem of losing fine-grained information in traditional hard-label supervised models.This enables the backbone network to extract more representative features of the image and improve the generalization performance and adversarial nature of the model.Extensive experiments on four commonly used remote sensing scene image classification datasets have demonstrated the effectiveness of the proposed method.

2.
In order to complement the advantages of multi-level features, a feature augmentation pyramid module is carefully designed, which fuses the top-level features with the low-level features through deconvolution to increase the richness of the features, so that the semantic features extracted by the deep network can be learned by the underlying network.

3.
A method of adding two auxiliary classifiers in the middle layer is proposed, which is trained through distillation to provide additional supervisory information and help the network converge faster.In order to ensure that the shallow auxiliary classifier and the main classifier share similar feature representations, a bottleneck structure is added to the middle layer of the backbone network to encourage them to learn similar features.
The rest of this paper is organized as follows.Section 2 briefly introduces the research on remote sensing scene classification, and Section 3 provides a description of our proposed method in detail.Section 4 shows the experimental results and discussion.In Section 5, the conclusion and prospects are given.

Classification of Remote Sensing Scene Images
Over the past few decades, many methods for image classification of remote sensing scenes have been proposed.Initially, these methods were mainly based on hand-crafted features, such as gradient histograms [24], scale-invariant feature transforms [25], and the bag-of-visual-word (BoVW) [26].Although these methods yield impressive representations, handcrafted features cannot fully capture the complex content of remote sensing scenes.In recent years, the convolutional neural network (CNN) has performed well in extracting representational features and is widely used in image classification and target detection.It has achieved great success in remote sensing scene image classification, and many CNNbased methods have been proposed.For example, Li et al. [27] proposed a deep feature fusion network for remote sensing scene classification.Zhao et al. [28] proposed a structure that combines local spectral features, global texture features, and local structural features to fuse features.Wang et al. [29] use an attention mechanism to adaptively select key regions of an image, and then fuse features to produce more representative features.The key filter bank network (KFBNet) [30] uses a key filter bank to capture discriminative local details while preserving local features.Shi et al. [31] proposed a multi-branch fusion attention network, which fuses spatial attention and channel attention into the ResNet backbone network.Shi et al. [32] proposed a dense fusion of multi-level features, through 3 × 3 depthwise separable convolution and 1 × 1 standard convolution, to extract the information of the current layer and fuse it with the features extracted from the previous layer.Deng et al. [33] proposed a deep neural network incorporating contextual features, using the pre-trained VGG-16 as a feature extractor to obtain feature maps.Then, the feature map is input into two parallel modules, global average pooling (GAP) and long short-term memory (LSTM), to extract global and contextual features, respectively, and finally splicing global features and contextual features.Meng et al. [34] proposed a multi-layer feature fusion network based on spatial attention and a gating mechanism, using a backbone network to extract multi-layer convolutional features, and then using a spatial attention module to aggregate multi-layer features for classification.Wang et al. [35] proposed an enhanced feature pyramid network based on deep semantic embeddings.Using multi-level and multi-scale features, a feature fusion module is introduced to fuse the two branch features.Zhang et al. [36] proposed a distributed convolutional neural network.

Knowledge Distillation
The idea of knowledge distillation (Original Knowledge Distillation) is to guide the training of the student model by using the soft targets of the teacher model.The teacher network usually produces class probabilities by using a "softmax" output layer with a temperature hyperparameter applied, which is used to convert the logits generated by each class calculation, that is, z i , into a probability q i , and the calculation process can be represented as where T is the temperature, usually set to 1. Using a higher T value produces a softer distribution on the output of the classification.As shown in Figure 2, an image is input into the network, the resulting output is fed into Softmax with temperature hyperparameters, and then a soft-label output is obtained.In the output soft label, we can see that for an input image labeled as a forest, in addition to the probability of the forest category, there will also be a certain probability of other categories in the output.Soft labels are the probability distributions output by the teacher model, whic provide richer information to help the student model learn.To transfer knowledge tively, an appropriate loss function needs to be defined to measure the difference bet the output of the student model and the output of the teacher model.Commonly loss functions include the mean squared error [37], cross-entropy loss [37], and KL d gence [38].The mean square error loss function measures the numerical difference output, while the cross-entropy loss function measures the difference in the proba distribution of the output.KL divergence (Kullback-Leibler divergence), also know relative entropy, is an indicator used to measure the difference between two proba distributions.The calculation process can be represented as represents the distribution predicted by the teacher network, and represents the distribution predicted by the student network.
KL divergence measures the loss of information from the true distribution t model distribution.The real distribution is simulated using the output of the teache work.In the process of knowledge distillation, a large teacher model is usually use training first, and then the teacher model is adopted to generate soft labels, which are together with the output of the student model to train the student model.During the ing process, different weights can be used to balance the relative importance of har soft objects.Choosing an appropriate teacher model is crucial to the effect of know distillation.In general, the teacher model should be complex and accurate enough to vide high-quality soft targets.A commonly used teacher model is a pre-trained deep ral network model.The student model is usually more lightweight and simplified the teacher model for deployment where computing resources are constrained.common student model design strategies include using shallow network structure reducing the number of network parameters.With the deepening of research, man proved and extended knowledge distillation methods have emerged.For exampl FitNets method [39] introduced the concept of intermediate layer alignment to alig intermediate layer outputs of the teacher model and the student model.The atte transfer method [40] learned knowledge from the teacher network by having the stu network imitate the attention map of the teacher network.The relational knowledg tillation method [41] exploited relational modeling to improve knowledge distillati comprehensive overhaul of the feature distillation method [42] adopted the feature lation, designed a new distillation loss, distilled features before the ReLU function retained negative values before distillation.Ahn et al. [43] proposed a variational mation distillation framework, which transfers the knowledge learned by Soft labels are the probability distributions output by the teacher model, which can provide richer information to help the student model learn.To transfer knowledge effectively, an appropriate loss function needs to be defined to measure the difference between the output of the student model and the output of the teacher model.Commonly used loss functions include the mean squared error [37], cross-entropy loss [37], and KL divergence [38].The mean square error loss function measures the numerical difference of the output, while the cross-entropy loss function measures the difference in the probability distribution of the output.KL divergence (Kullback-Leibler divergence), also known as relative entropy, is an indicator used to measure the difference between two probability distributions.The calculation process can be represented as where P(i) represents the distribution predicted by the teacher network, and Q(i).represents the distribution predicted by the student network.KL divergence measures the loss of information from the true distribution to the model distribution.The real distribution is simulated using the output of the teacher network.In the process of knowledge distillation, a large teacher model is usually used for training first, and then the teacher model is adopted to generate soft labels, which are used together with the output of the student model to train the student model.During the training process, different weights can be used to balance the relative importance of hard and soft objects.Choosing an appropriate teacher model is crucial to the effect of knowledge distillation.In general, the teacher model should be complex and accurate enough to provide high-quality soft targets.A commonly used teacher model is a pre-trained deep neural network model.The student model is usually more lightweight and simplified than the teacher model for deployment where computing resources are constrained.Some common student model design strategies include using shallow network structures and reducing the number of network parameters.With the deepening of research, many improved and extended knowledge distillation methods have emerged.For example, the FitNets method [39] introduced the concept of intermediate layer alignment to align the intermediate layer outputs of the teacher model and the student model.The attention transfer method [40] learned knowledge from the teacher network by having the student network imitate the attention map of the teacher network.The relational knowledge distillation method [41] exploited relational modeling to improve knowledge distillation.A comprehensive overhaul of the feature distillation method [42] adopted the feature distillation, designed a new distillation loss, distilled features before the ReLU function, and retained negative values before distillation.Ahn et al. [43] proposed a variational information distillation framework, which transfers the knowledge learned by the convolutional network to the multi-layer perceptron (MLP) and maximizes the mutual information of the two neural networks by maximizing the variational lower bound.
Due to the difficulty of selecting a teacher network and training a large teacher network, some studies have proposed self-distillation algorithms.The self-distillation framework distills knowledge within the network itself.The network is first divided into several parts.Then, the knowledge from the deep layers of the network is squeezed into the shallow layers.Zhang et al. [44] proposed a self-distillation framework using ResNet as the backbone network to extract the output of the intermediate layer through the bottleneck structure.The output obtained by the deep layer is used as a soft label to supervise the distribution of the shallow layer so that the shallow layer of the network learns the distribution of the deep layer.Ji et al. [45] proposed a self-distillation framework for feature refinement, which enhances feature maps through lateral convolutions for the purpose of self-knowledge distillation.Hu et al. [46] proposed a hierarchical self-distillation feature learning framework.The distribution generated by the shallow network is supervised by the distribution generated by the deep network.And a gradient separation and fusion module is proposed, and the gradient generated by the final classification output is not returned to the backbone network in reverse.
Influenced by the above work, knowledge distillation methods have been introduced into remote sensing image analysis, and Wei et al. proposed MSH-Net [47] to assist models with missing modalities by reconstructing complete modality-shared features from incomplete inference modality reasoning.Among them, the Joint Adaptive Distillation (JAD) method guides the model to learn modality-shared knowledge from multimodal models by matching the joint probability distribution between the representation and the ground truth.Hu et al. [48] proposed variational self-distillation to distill deep and shallow layers through Variational Knowledge Transfer (VKT), using the prediction vector of class entanglement information as supplementary class information.Li et al. proposed dual knowledge distillation [49], designing dual attention and a spatial structure.The two designed loss functions can effectively transfer the knowledge learned by the teacher network to the student network.Liu et al. proposed cross-model knowledge distillation, using the RGB image pre-trained model as a teacher model to guide multispectral scene classification [50].Lin et al. [51] proposed a pyramid network, which used an interpolation method to generate high-resolution feature maps.Unlike these methods, we propose a feature-augmented self-distillation network.In the network architecture, the teacher network is an extension of the student network, which belongs to the same network architecture.We employ a pyramid module to fuse the top-level feature maps with the underlying multi-level features through deconvolution.The feature maps of the backbone network are then supervised using feature distillation with the fused features.At the same time, we add two auxiliary branches to the backbone network, and use the softlabel distillation loss of the auxiliary branch to supervise the shallow network to learn representational features.For the classifier of the backbone network, a combination of soft and hard labels for supervision is adopted.

Methodology
In this paper, a feature-augmented self-distillation convolutional neural network (FASDNet) is proposed, which is shown in Figure 3.It consists of the backbone classifier network in the gray area in the middle of the picture, the self-teacher network in the green area below the picture, and two auxiliary branches in the blue area above the picture.ResNet34 is utilized as the backbone network to generate multi-layer features.Then, multilayer features are sent to the feature augmentation pyramid module to generate a refined feature map.The reason why green areas are called self-teacher networks is that in the same model, green areas are branches that extend from the backbone network.Green areas generate more refined feature maps, which can guide model learning through feature distillation.The core of self-teaching networks is that models use self-generated labels or targets for training.The backbone network contains 4 convolution blocks, and each convolution block from bottom to top generates feature maps S1, S2, S3, and S4.The shape of S1 is B × C × H × W, and B, C, H, and W represent the batch size, number of channels, and width and height of the feature map, respectively.The self-teacher network takes the horizontal feature map of the backbone network as input, and each convolutional block from the bottom to the top sequentially generates feature maps T1, T2, T3, T4.The shape of T1 is B × C × 2 × H × W, and the number of horizontal convolution kernels is set to 2C, so that the number of channels of the feature map T is twice the number of channels of the feature map S. Two auxiliary branches adopt the convolutional bottleneck structure to further learn the representation of shallow features.The final loss is jointly determined using feature distillation loss, soft-label distillation loss, and ground truth loss.By combining the loss function to apply strong supervision to the network, the probability of network overfitting is greatly reduced.number of channels, and width and height of the feature map, respectively.The selfteacher network takes the horizontal feature map of the backbone network as input, and each convolutional block from the bottom to the top sequentially generates feature maps and the number of horizontal convolution kernels is set to 2C , so that the number of channels of the feature map T is twice the number of channels of the feature map S .Two auxiliary branches adopt the convolutional bottleneck structure to further learn the representation of shallow features.The final loss is jointly determined using feature distillation loss, soft-label distillation loss, and ground truth loss.By combining the loss function to apply strong supervision to the network, the probability of network overfitting is greatly reduced.

Self-Distillation
The idea of the self-distillation method is to introduce some mechanisms to allow the model to generate some information by itself, and to use this information to perform self-learning operations.It can be used for model compression.During the process of training, the self-teacher network will be included.During the testing stage, only the backbone classifier network needs to be deployed to compress the complex model into a smaller and lighter model, thereby reducing resource constraints.By introducing soft labels, the model pays more attention to the distribution of learning data instead of just hard labels, which helps improve the generalization performance of the model.The self-distillation method can also sustain the disturbance of input data, thus improving the robustness of the model.The proposed FASDNet adopts the green area part in Figure 3 as the self-teacher network, where the feature map is represented by T i , and the output soft label is represented by pt .The gray area is utilized as the backbone network, and the feature map in it is represented by S i .The feature map of the self-teacher is used to guide the feature map of the classifier network.In other words, it aims to learn the feature representation of the self-teacher network through a classifier network.For feature distillation, the distillation loss can be represented as where ϕ represents pooling of feature maps along the channel dimension and L 2 normalization, θ c represents the parameters of the classification network, θ t represents the parameters of the self-teacher network, S represents the feature map of the classification network, and T represents the feature map of the self-teacher network.L F enables the classification network to learn the enhanced feature map of the self-teacher network.This training can reduce the gap between the classification network and the self-teaching network.At the same time, the self-distillation method also uses soft labels for distillation, and the distillation loss is where f c is the classifier network, f t is the teacher network, D KL represents the KL divergence between two distributions, L KD represents knowledge distillation loss, x represents the tensor after input data augmentation, θ c represents the parameters of the classification network, θ t represents the parameters of the self-teacher network, and T represents the temperature hyperparameter.In addition to the distillation loss, the classifier network and the self-teacher network adopt cross-entropy loss to learn the true labels.The cross-entropy loss can be represented as Among them, N represents the training sample, y i represents the real label, p S i represents the output obtained by the backbone network after passing through the fully connected layer and then using Softmax, and p T i represents the output obtained from the teacher network after passing through the fully connected layer and then using Softmax.

Feature Augmentation Pyramid Module (FAPM)
The purpose of the teacher network is to provide an excellent feature map and soft labels for the classifier network.The input of the self-teacher network is the feature map S 1 , S 2 , . . ., S n of the classifier network.Assume that the classifier network is divided into n blocks.The overall architecture of the feature augmentation pyramid module (FAPM) is shown in Figure 4. We refer to the green blocks in Figure 4 as the feature augmentation pyramid module, which mainly consists of deconvolution and convolution.Deconvolution is used to upsample deep features with rich semantic information, and then fuse them with the features obtained from horizontal convolution.The fused features are further processed using convolution.
blocks.The overall architecture of the feature augmentation pyramid module (FAPM) is shown in Figure 4. We refer to the green blocks in Figure 4 as the feature augmentation pyramid module, which mainly consists of deconvolution and convolution.Deconvolution is used to upsample deep features with rich semantic information, and then fuse them with the features obtained from horizontal convolution.The fused features are further processed using convolution.The feature pyramid network can generate multi-scale and multi-level feature maps.In this paper, deconvolution is utilized as an upsampling technique.Deconvolution upsamples the feature map with rich semantic information twice.Compared to upsampling interpolation methods, deconvolution can enable the model to learn how to generate highresolution features with more semantic significance from training data.Through deconvolution, the semantic-information-rich features are combined with the underlying features of the neural network to achieve spatial feature enhancement.Specifically, a topdown and bottom-up path design is adopted.The horizontal convolution layer is used before using the top-down path as follows.
( ; ) where Conv is the convolution, batch normalization, and ReLU activation function operations.The convolution includes parallel 1 × 1 convolution, 1 × 3 convolution, and 3 × 1 convolution.The Conv output dimension is i d , and we design d i i w c = × , where w is a width hyperparameter, which is set to 2 here.Among them, the 1 × 3 and 3 × 1 convolutions have direction sensitivity, and the 1 × 3 and 3 × 1 convolution kernels have different weights in the horizontal and vertical directions, which can better capture the directional features in the input data.Compared with the traditional 3 × 3 convolution kernel, the 1 × 3 and 3 × 1 convolution kernels have fewer parameters.When the number of input and output channels is both 1, 1 × 3 and 3 × 1 require 6 weight parameters, while 3 × 3 convolution requires 9 weight parameters.This can reduce the complexity and computational cost of the model.The process of top-down path is The feature pyramid network can generate multi-scale and multi-level feature maps.In this paper, deconvolution is utilized as an upsampling technique.Deconvolution upsamples the feature map with rich semantic information twice.Compared to upsampling interpolation methods, deconvolution can enable the model to learn how to generate high-resolution features with more semantic significance from training data.Through deconvolution, the semantic-information-rich features are combined with the underlying features of the neural network to achieve spatial feature enhancement.Specifically, a topdown and bottom-up path design is adopted.The horizontal convolution layer is used before using the top-down path as follows.
where Conv is the convolution, batch normalization, and ReLU activation function operations.The convolution includes parallel 1 × 1 convolution, 1 × 3 convolution, and 3 × 1 convolution.The Conv output dimension is d i , and we design d i = w × c i , where w is a width hyperparameter, which is set to 2 here.Among them, the 1 × 3 and 3 × 1 convolutions have direction sensitivity, and the 1 × 3 and 3 × 1 convolution kernels have different weights in the horizontal and vertical directions, which can better capture the directional features in the input data.Compared with the traditional 3 × 3 convolution kernel, the 1 × 3 and 3 × 1 convolution kernels have fewer parameters.When the number of input and output channels is both 1, 1 × 3 and 3 × 1 require 6 weight parameters, while 3 × 3 convolution requires 9 weight parameters.This can reduce the complexity and computational cost of the model.The process of top-down path is The downsampling process adopts a combination of maximum pooling and 1 × 1 convolution, which can be represented as where Conv 1×1 represents a 1 × 1 convolution, T i represents the feature map of each level of the self-teacher network, and ϕ i represents the parameters of the convolution kernel.
The process of the bottom-up path is where P i represents the output of the middle layer of the i layer in the top-down path, and T i represents the output of the i layer in the bottom-up path.w p and w T represent normalized parameters.The convolution kernel size of deconvolution is 2 × 2, and the step size is 2. The feature map obtained after deconvolution is added element by element with the feature map generated using horizontal convolution.Conv represents the combination of convolution, batch normalization, and ReLU activation functions.The calculation process of Conv can be represented as Conv dsc represents 3 × 3 depth-separable convolution, 1 × 1 point convolution is used to interact with the feature maps in the channel dimension, and then the batch normalization and ReLU activation function are performed.

Auxiliary Classifier
Two additional branches are introduced into the middle layer of the network to assist in the training task, providing additional supervised information during the training process, accelerating model convergence, and improving model generalization.The distillation loss between the backbone network classifier and the auxiliary classifier can provide additional supervised signals for the model, which helps the gradient propagate back to the shallower layers of the network more effectively, thereby improving the training effect of the network.Introducing auxiliary branches can also serve as a regularization method by introducing additional tasks in the middle layer, forcing the network to learn effective representations for multiple tasks, thereby improving the generalization performance of the model.The position of shallow auxiliary classifiers and main classifiers in the network may lead to them learning different feature representations of the data.This leads to inconsistent weight updates for different parts.In this case, the weight adjustment between the shallow auxiliary classifier and the main classifier may not be coordinated, resulting in inconsistent classification results.To ensure that the shallow auxiliary classifier and the main classifier share similar feature representations, we added a bottleneck structure in the middle layer of the backbone network to encourage them to learn similar features.Soft labels are used instead of hard labels to supervise the network in the process of training shallow classifiers.A good shallow classifier can obtain more discriminative features, which in turn improves the performance of deep classifiers.The bottleneck structure of the auxiliary classifier we designed is shown in Figure 5.The blue square in the figure represents a 3 × 3 depthseparable convolution, which is utilized for further extracting local spatial features, and reducing parameters compared to ordinary convolution.The orange square represents a 1 × 1 point convolution, which is used to increase the dimension of the feature map.The purple square represents the batch normalization layer and the yellow square represents the activation function, which are used to finally pass through a global average pooling layer.The bottleneck structure can be represented as where ConvBNReLU represents the stacking of convolution, batch normalization, and ReLU activation functions, as shown in Figure 5. Avgpool2d represents global average pooling.Using the bottleneck structure for downsampling can reduce the difference between the shallow classifier and the deep classifier.The losses for supervising the two auxiliary classifiers with soft labels are where c f represents the classifier network, and represent the auxiliary classifier network.KL divergence is used as a metric distance to make shallow classifiers learn the distribution of deep classifier outputs.Among them, 1 ( ; , ) Aux c x T θ  represents the KL distance between the classification output of the backbone network and the deep auxiliary classifier.
2 ( ; , ) Aux c x T θ  denotes the KL distance between the deep auxiliary classifier and the shallow auxiliary classifier.
The supervised loss of the whole network consists of four parts.The first part is the feature distillation loss between the backbone network feature map and the self-teacher network feature map.The second part is the logits distillation loss between the backbone network classifier and the self-teacher classifier.The third part is the logits distillation loss between the backbone network classifier and the auxiliary classifier.The fourth part is the cross-entropy loss of the classifier network and the real label and the cross-entropy loss of the self-teacher network and the real label.The overall loss function can be expressed as ( , ; , ) ( ; , , ) ( ; , ) ( ; , ) Conv3×3 Conv1×1 BN ReLU AvgPool2d

Implementation Details
The process of the proposed FASDNet is as follows.Firstly, the original remote sensing scene image is preprocessed.Then, the data are input into the backbone network to The total loss of the auxiliary classifier is where f c represents the classifier network, and f Aux1 and f Aux2 represent the auxiliary classifier network.KL divergence is used as a metric distance to make shallow classifiers learn the distribution of deep classifier outputs.Among them, L Aux1 (x; θ c , T) represents the KL distance between the classification output of the backbone network and the deep auxiliary classifier.L Aux2 (x; θ c , T) denotes the KL distance between the deep auxiliary classifier and the shallow auxiliary classifier.The supervised loss of the whole network consists of four parts.The first part is the feature distillation loss between the backbone network feature map and the self-teacher network feature map.The second part is the logits distillation loss between the backbone network classifier and the self-teacher classifier.The third part is the logits distillation loss between the backbone network classifier and the auxiliary classifier.The fourth part is the cross-entropy loss of the classifier network and the real label and the cross-entropy loss of the self-teacher network and the real label.The overall loss function can be expressed as

Implementation Details
The process of the proposed FASDNet is as follows.Firstly, the original remote sensing scene image is preprocessed.Then, the data are input into the backbone network to obtain feature maps S 1 , S2, S 3 , and S 4 at different stages.Using horizontal convolution to process feature maps S 1 , S 2 , S 3 , and S 4 , feature maps L 1 , L 2 , L 3 , and L 4 are obtained.Following this, the feature map S i is input into 1 × 1 convolution, 1 × 3 convolution, and 3 × 1 convolution for processing, i = 1, 2, 3, 4. The output features of the three parallel branches are added element by element to obtain the aggregated features.Among them, 1 × 3 and 3 × 1 convolutions have direction sensitivity and can better capture the directional features of the input data.The deconvolution in the feature augmentation pyramid module upsamples features with rich semantic information and fuses the upsampled features with the features obtained using horizontal convolution.The fused features are further processed through convolution to obtain an enhanced feature map, represented by P i .After a top-down path, the multi-level feature map enhanced by the pyramid module is obtained.Next is the bottom-up path.First, the feature maps of L 1 and P 1 are fused to obtain T 1 , and then T 1 is fused with L 2 and P 2 through downsampling to obtain T 2 .Through this bottom-up approach, the feature maps T 1 , T 2 , T 3 , and T 4 are obtained, and then T 4 passes through a linear layer to obtain the output of the self-teacher network.Two auxiliary branches are added after the middle two layers of feature maps of the backbone network.The auxiliary branches consist of a bottleneck downsampling structure and an auxiliary classifier.Additional supervised information can be provided through distillation between auxiliary branches and the backbone network.The final supervised loss consists of four parts: feature distillation between the self-teacher network and the backbone network, logits distillation between the output of the self-teacher network and the output of the backbone network, logits distillation between the output of the backbone network and the auxiliary branch output, and cross-entropy loss between the output of the self-teacher network and the backbone classifier network and the real label.By updating the model parameters through the total loss, the trained model is ultimately obtained.The specific process of the FASDNet is shown in Algorithm 1.  Data preprocessing to obtain the input tensor x.

2.
Input x to the backbone network to obtain feature maps S1, S2, S3, S4 at different stages.

3.
Use horizontal convolution to enhance the feature map obtained in step 2, The feature map obtained in step 3 is input into the feature augmentation pyramid module, P i = Conv (w P i,1 Combine the resulting enhanced feature map with the feature map of the horizontal convolution and the feature map after the maximum pooling The feature maps of the middle two layers of the backbone network are sent to the auxiliary branch, Downsample = Avgpool2d(ConvBNReLU(S i )), then the output of the auxiliary classifier is obtained.7.
Calculate the overall supervised loss Loss = L F (T, Updating model parameters through the overall supervised loss.9.
Obtain the output of the classifier.

Experiments
To evaluate the effectiveness of the proposed FASDNet, some experiments are performed on four public and challenging datasets, i.e., UC-Merced dataset [26], RSSCN7 dataset [52], AID [53], and NWPU-RESISC45 dataset [54], and the proposed FASDNet is compared with some advanced classification methods proposed in recent years.The experimental results show that the classification performance of the proposed method is superior to those of some state-of-the-art methods on all datasets.

Datasets
In this section, the four datasets used in the experiments are introduced briefly.Some examples selected in these datasets are shown in Figure 6.
Due to the significant difference in the number of images between different datasets, we used a larger training ratio for smaller datasets and a smaller training ratio for larger datasets.The training ratio on the four datasets is the same as that used in previous work [55][56][57].The information of the four datasets is described in Table 1.In this section, the four datasets used in the experiments are introduced briefly.Some examples selected in these datasets are shown in Figure 6.Due to the significant difference in the number of images between different datasets, we used a larger training ratio for smaller datasets and a smaller training ratio for larger datasets.The training ratio on the four datasets is the same as that used in previous work [55][56][57].The information of the four datasets is described in Table 1.

Beach
(1) UC-Merced UC-Merced is a commonly used remote sensing image dataset for object classification tasks.This dataset was created and provided by the University of California, Merced.The UC-Merced dataset contains 21 different object categories and each category has 100 images, containing a total of 2100 images.Each image has a resolution of 256 × 256 pixels and is a color image (RGB format).The images cover different types of ground features such as cities, farmlands, forests, rivers, and parks.For the UC-Merced dataset, we divide the proportion of training into 50% and 80%, and the remaining 50% and 20% are used for testing.
(2) RSSCN7 RSSCN7 (Remote Sensing Scene Classification using Convolutional Networks) is a dataset for remote sensing scene classification.The RSSCN7 dataset contains seven common remote sensing scene categories, namely: Buildings, Forest, Farmland, River, Lake, Meadow, and Roads.Each category contains about 400 images, for a total of about 2800 images.Each image has a resolution of 256 × 256 pixels and is a color image (RGB format).For the RSSCN7 dataset, we divide the training ratio into 50%, and the remaining 50% is used for testing.
(3) AID The AID (Aerial Image Dataset) is a widely used dataset for aerial image analysis, mainly for remote sensing image classification and target detection tasks.The AID was created by the Institute of Automation of the Chinese Academy of Sciences and contains 30 different object categories, covering cities, farmland, forests, grasslands, roads, rivers, lakes, buildings, and other types of objects.Each category contains approximately 200-400 images, for a total of approximately 10,000 images.Each image has a resolution of 600 × 600 pixels and is a color image (RGB format).For the AID, we divide the training ratio into 20% and 50%, and the remaining 80% and 50% are used for testing.
(4) NWPU-RESISC45  (1) UC-Merced UC-Merced is a commonly used remote sensing image dataset for object classification tasks.This dataset was created and provided by the University of California, Merced.The UC-Merced dataset contains 21 different object categories and each category has 100 images, containing a total of 2100 images.Each image has a resolution of 256 × 256 pixels and is a color image (RGB format).The images cover different types of ground features such as cities, farmlands, forests, rivers, and parks.For the UC-Merced dataset, we divide the proportion of training into 50% and 80%, and the remaining 50% and 20% are used for testing.
(2) RSSCN7 RSSCN7 (Remote Sensing Scene Classification using Convolutional Networks) is a dataset for remote sensing scene classification.The RSSCN7 dataset contains seven common remote sensing scene categories, namely: Buildings, Forest, Farmland, River, Lake, Meadow, and Roads.Each category contains about 400 images, for a total of about 2800 images.Each image has a resolution of 256 × 256 pixels and is a color image (RGB format).For the RSSCN7 dataset, we divide the training ratio into 50%, and the remaining 50% is used for testing.
(3) AID The AID (Aerial Image Dataset) is a widely used dataset for aerial image analysis, mainly for remote sensing image classification and target detection tasks.The AID was created by the Institute of Automation of the Chinese Academy of Sciences and contains 30 different object categories, covering cities, farmland, forests, grasslands, roads, rivers, lakes, buildings, and other types of objects.Each category contains approximately 200-400 images, for a total of approximately 10,000 images.Each image has a resolution of 600 × 600 pixels and is a color image (RGB format).For the AID, we divide the training ratio into 20% and 50%, and the remaining 80% and 50% are used for testing.
(4) NWPU-RESISC45 NWPU-RESISC45 is a widely used remote sensing image dataset for remote sensing image classification tasks.This dataset is provided by Northwestern Polytechnical University (NWPU) in China and is one of the datasets for the RESISC45 (Remote Sensing Image Scene Classification) competition.The NWPU-RESISC dataset contains 45 different remote sensing image scene categories and each category has 700 images, containing a total of 31,500 images.Each image has a resolution of 256 × 256 pixels and is a color image (RGB format).The images cover different geographical environments and scenes, including cities, farmlands, rivers, forests, grasslands, airports, etc.For the experiment on the NWPU-RESISC45 dataset, we divide the training ratio into 10% and 20%, and the remaining 90% and 80% are used as the test set.

Experimental Details
All experiments are implemented using Pytorch on a workstation with a GeForce RTX 3090.ResNet34 in the network is initialized with parameters pre-trained on ImageNet [58], and the rest of the network uses randomly initialized parameters.The adaptive moment estimation is adopted to optimize the model, the initial learning rate is set to 0.0001, and the training is 150 epochs.The cosine decay learning rate adjustment is used, and the learning rate decays to 0.1 times the original at the 30th epoch, 50th epoch, and 100th epoch.We first resize the image to 448 × 448.Random horizontal flip, random vertical flip, and random rotation with fixed angles are adopted to enhance the image.In addition, a color enhancement method is adopted to enhance the image.In the experiments, the overall accuracy (OA) is adopted to evaluate the effectiveness of our proposed method.OA is the number of correctly predicted images in the test set divided by the total number of images in the test set.To ensure the accuracy of the experimental results, the final results are obtained by averaging 10 experiments.In addition, the confusion matrix is adopted to analyze the prediction results of different categories.

Experimental Results and Analysis
To evaluate the effectiveness of our proposed method, a series of experiments are conducted on four datasets.Some advanced classification methods using multi-layer feature aggregation and global deep features are used for comparison.The experimental results are listed in Table 2.
Table 2.The remote sensing scene classification methods studied in recent years to be compared, where * indicates a classification method based on global deep features, • indicates a classification method based on multi-layer feature aggregation, and † indicates an image classification method using knowledge distillation.[66] • JSTARS2021 EFPN-DSE-TDFF [35] • TGRS2021 (1) Classification results on the UC-Merced dataset Some methods with good classification performance on the UC-Merced dataset in recent years are chosen to compare with the proposed FASDNet.The experimental results are listed in Table 3.We can see that the classification accuracy of the proposed method reaches 99.90% when the training ratio is 80%, which exceeds all comparison methods.The OA of the proposed FASDNet is 0.33% higher than that of EMTCAL, which also uses a ResNet34 backbone.The OA of the proposed FASDNet is 0.08% higher than that of the SAGN method that uses a dense network to extract underlying features and then uses graph convolution to further aggregate features.The OA of the proposed FASDNet is 0.33% higher than that of VSDNet-ResNet34, which also uses the distillation method.

Methods Year
Table 3.Comparison of our proposed method with some methods proposed in recent years on the UC-Merced dataset.

Method
OA (50%) BoVW(SIFT) [53] 81.34 ± 0.55 Tex-Net-LF_VGG-M [70] 91.25 ± 0.57 Resnet50 [70] 93.12 ± 0.55 WSPM-CRC-ResNet152 [71] 93.9 Tex-Net-LF_Resnet50 [70] 94.00 ± 0.57 DFAGCN [67] 94.14 ± 0.44 SE-MDPMNet [72] 94.71 ± 0.15 Contourlet CNN [17] PCANet [18] 95.54 ± 0.71 95.98 ± 0.56 MLF2Net_SAGM [34] 96.01 ± 0.23 FASDNet (ours) 97.79 ± 0.14 The confusion matrix of the proposed FASDNet on the RSSCN7 dataset is shown in Figure 8, which shows the proposed method can provide good classification performance.The classification accuracy rate of all scenes is up to 97%, and the classification accuracy rate of the "forest" scene can reach 99%.It can be seen from the figure that some "Field" scenes are misclassified as "Grass", and some "Grass" scenes are misclassified as "Field".This is due to the strong inter-class similarity between the "Grass" and "Field" scenarios.There is also a misclassification between the "Industry" and "Parking" scenes, because the "Industry" scene contains many parking areas, while the "Parking" scene contains many industrial-area-style buildings.This makes it difficult for our proposed method to distinguish them as well.Nevertheless, the proposed method still achieved excellent classification performance.(2) Classification results on the RSSCN7 dataset The proposed method is compared with some methods proposed in recent years on the RSSCN7 dataset.The experimental results are shown in Table 4.The OA of our proposed method is 97.79%, which is 1.78%, 1.81%, and 2.25% higher than those of MLF2Net_SAGM, PCANet, and Contourlet CNN, respectively.The experimental results prove that our proposed method has good feature representation ability.
Table 4. Comparison of our proposed method with some methods proposed in recent years on the RSSCN7 dataset.

OA (50%)
BoVW(SIFT) [53] 81.34 ± 0.55 Tex-Net-LF_VGG-M [70] 91.25 ± 0.57 Resnet50 [70] 93.12 ± 0.55 WSPM-CRC-ResNet152 [71] 93.9 Tex-Net-LF_Resnet50 [70] 94.00 ± 0.57 DFAGCN [67] 94.14 ± 0.44 SE-MDPMNet [72] 94.71 ± 0.15 Contourlet CNN [17] PCANet [18] 95.54 ± 0.71 95.98 ± 0.56 MLF2Net_SAGM [34] 96.01 ± 0.23 FASDNet (ours) 97.79 ± 0.14 The confusion matrix of the proposed FASDNet on the RSSCN7 dataset is shown in Figure 8, which shows the proposed method can provide good classification performance.The classification accuracy rate of all scenes is up to 97%, and the classification accuracy rate of the "forest" scene can reach 99%.It can be seen from the figure that some "Field" scenes are misclassified as "Grass", and some "Grass" scenes are misclassified as "Field".This is due to the strong inter-class similarity between the "Grass" and "Field" scenarios.There is also a misclassification between the "Industry" and "Parking" scenes, because the "Industry" scene contains many parking areas, while the "Parking" scene contains many industrial-area-style buildings.This makes it difficult for our proposed method to distinguish them as well.Nevertheless, the proposed method still achieved excellent classification performance.(3) Classification results on the AID: Some methods proposed in recent years are selected to compare with our proposed method, and the experimental results are shown in Table 5.With a training ratio of 20% on the AID, the classification accuracy of the proposed FASDNet is 95.68%.It is 0.78% higher than that of the SAGN method, 1.97% higher than that of the MBFANet method, and 1.26% higher than that of the EMTCAL method.When the training ratio of the AID is 50%, the classification accuracy of the proposed FASDNet is 97.84%, which is 1.07% higher than that of the SAGN method, 0.91% higher than that of the MBFANet method, and 1.43% higher than that of the EMTCAL method.The experimental results fully demonstrate the effectiveness of our proposed method.
Table 5.Comparison of our proposed method with some methods proposed in recent years on the AID.

OA (20%) OA (50%)
GoogleNet [53] 83.44 ± 0.40 86.39 ± 0.55 VGG-16 [53] 86.59 ± 0.29 89.64 ± 0.36 VGG-16-CapsNet [60] 91.63 ± 0.19 94.74 ± 0.17 SCCov [19] 93.12 ± 0.25 96.10 ± 0.16 VGG-VD16 + MSCP + MRA [59] 92.21 ± 0.17 95.56 ± 0.18 GBNet + global feature [57] 92.20 ± 0.23 95.48 ± 0.12 MIDC-Net_CS [62] 88.51 ± 0.41 92.95 ± 0.17 EFPN-DSE-TDFF [35] 94.02 ± 0.21 94.50 ± 0.30 ACNet [64] 92.71 ± 0.14 95.31 ± 0.37 DFAGCN [67] -94.88 ± 0.22 MG-CAP(Sqrt-E) [61] 93.34 ± 0.18 96.12 ± 0.12 MSA-Network [65] 93.53 ± 0.21 96.01 ± 0.43 ACR-MLFF [63] 92.73 ± 0.12 95.06 ± 0.33 EMTCAL [68] 94.69 ± 0.14 96.41 ± 0.23 MBFANet [31] SAGN [55] 93.98 ± 0.15 95.17 ± 0.12 96.93 ± 0.16 96.77 ± 0.18 VSDNet-ResNet34 [48] 96.00 ± 0.18 97.28 ± 0. The confusion matrix diagram under the 50% training ratio on the AID is shown in Figure 9.Among the 30 categories, 28 categories have an accuracy of more than 90%, and only two categories have an accuracy that does not reach 90%.The two categories are "Resort" and "Square".The "Resort" scene category is mainly misclassified as schools and parking lots.The "Square" scene category is mainly misclassified into parking lots, central areas, and schools.This is due to the high inter-class similarity of scene categories.Further improving the performance of FASDNet is our future work.The confusion matrix diagram under the 50% training ratio on the AID is shown in Figure 9.Among the 30 categories, 28 categories have an accuracy of more than 90%, and only two categories have an accuracy that does not reach 90%.The two categories are "Resort" and "Square".The "Resort" scene category is mainly misclassified as schools and parking lots.The "Square" scene category is mainly misclassified into parking lots, central areas, and schools.This is due to the high inter-class similarity of scene categories.Further improving the performance of FASDNet is our future work.The experimental results of different classification methods are summarized in Table 6.The overall accuracy reaches 92.89% under the 10% training ratio and 94.95% under the 20% training ratio.Compared with other methods, our proposed method achieves the best classification results under training ratios of 10% and 20%.At the 10% training ratio, the proposed FASDNet is 0.76% higher than that of the VSDNet-ResNet34 method, 1.16% higher than that of the SAGN method, 1.28% higher than that of the MBFANet method, and 1.26% higher than that of the EMTCAL method.At the 20% training ratio, the proposed FASDNet is 0.27% higher than that of the VSDNet-ResNet34 method, 1.46% higher than that of the SAGN method, 0.94% higher than that of the MBFANet method, and 1.30% higher than that of the EMTCAL method.These experimental results fully validate the effectiveness of our proposed method on the remote sensing scene classification task.(4) Classification results on the NWPU dataset: The experimental results of different classification methods are summarized in Table 6.The overall accuracy reaches 92.89% under the 10% training ratio and 94.95% under the 20% training ratio.Compared with other methods, our proposed method achieves the best classification results under training ratios of 10% and 20%.At the 10% training ratio, the proposed FASDNet is 0.76% higher than that of the VSDNet-ResNet34 method, 1.16% higher than that of the SAGN method, 1.28% higher than that of the MBFANet method, and 1.26% higher than that of the EMTCAL method.At the 20% training ratio, the proposed FASDNet is 0.27% higher than that of the VSDNet-ResNet34 method, 1.46% higher than that of the SAGN method, 0.94% higher than that of the MBFANet method, and 1.30% higher than that of the EMTCAL method.These experimental results fully validate the effectiveness of our proposed method on the remote sensing scene classification task.

Evaluation of Size of Models
The floating-point operations (FLOPs) and parameter quantities of some network models are listed in Table 7.The FLOPs measure the complexity of the model.Table 7 shows that compared with EMTCAL, the proposed method has advantages in FLOPs and parameter quantity, with a classification accuracy of 1.43% higher than that of EMTCAL, demonstrating the advantages of the proposed method.Compared with methods such as GoogLeNet, SE-MDPMNet, and Contourlet CNN, although they have disadvantages in terms of parameter quantity and FLOPs, they greatly surpass these models in terms of classification accuracy.It is worth mentioning that our model only needs to deploy a backbone classifier network during the deployment phase.In this way, the model complexity and computational resource requirements are reduced.From Table 7, it can also be seen that the number of FLOPs and parameters of the model during deployment is less than that during training.

Discussion
In order to comprehensively evaluate the effectiveness of our proposed method, some ablation experiments and heat map analysis are conducted.Grad-CAM can make full use of the features of the last layer of neural convolution to generate an attention map, also called a heat map, to display important areas in the image.In these experiments, some scene images are randomly selected, such as "parking lot", "residential", and "river", in the RSSCN7 dataset.The heat maps obtained by only the backbone network and the backbone network combined with the distillation method are shown in Figure 11.
We can see from the figure that for the "Parking" scene, the method using only the backbone network cannot accurately focus on the parking area.In addition to the parking area, the network also focused on other parts.After combining with our proposed distillation method, the network is obviously more focused on the parking area.For the "resident" scene, only using the backbone network method, the network is partially biased in the region of interest and ignores similar surrounding objects, which can only use limited features for classification.However, our proposed method focuses on the target region very well.For the "RiverLake" scene, the method without distillation can only focus on the edge information of the scene, and cannot fully extract the target, which will affect the classification accuracy.After using the distillation method, the network can focus more on the complete region of interest.
To verify the effectiveness of our three proposed modules, some ablation experiments were conducted on four datasets.The results of the ablation experiment are shown in .Each experimental result given in the table is the average of 10 repeated experimental results.In the first case, the network only includes the backbone network, resulting in the model with the worst classification performance.In the second case, the classification accuracy is improved by combining distillation methods with networks.In the third case, adding distillation methods and feature augmentation pyramid modules to the network further improves the classification accuracy compared to distillation-only methods.The fourth case adds distillation methods and auxiliary branches to the network, which improves the classification accuracy compared to using only distillation methods.
The last case connects all modules, i.e., distillation methods, feature augmentation pyramid modules, and auxiliary branches.It can be seen that from the four tables, when the network includes these three modules, the highest classification accuracy can be achieved.The ablation study has fully demonstrated the effectiveness of the main modules in FASDNet.To verify the effectiveness of our three proposed modules, some ablation experiments were conducted on four datasets.The results of the ablation experiment are shown in Tables 8-11.Each experimental result given in the table is the average of 10 repeated experimental results.In the first case, the network only includes the backbone network, resulting in the model with the worst classification performance.In the second case, the classification accuracy is improved by combining distillation methods with networks.In the third case, adding distillation methods and feature augmentation pyramid modules to the network further improves the classification accuracy compared to distillation-only methods.The fourth case adds distillation methods and auxiliary branches to the network, which improves the classification accuracy compared to using only distillation methods.The last case connects all modules, i.e., distillation methods, feature augmentation pyramid modules, and auxiliary branches.It can be seen that from the four tables, when the network includes these three modules, the highest classification accuracy can be achieved.The ablation study has fully demonstrated the effectiveness of the main modules in FAS-DNet.To verify the effect of temperature hyperparameters on model performance, some ablation experiments are carried out using the proposed FASDNet on the AID with a training ratio of 50%.The temperature hyperparameters are divided into 1, 2, 4, 6, and 8.The experimental results are listed in Table 12.It can be seen from Table 12 that the highest classification accuracy is achieved when T is 4. When T is greater than 4, the network performance becomes worse as the temperature increases.When T is less than 4, as the temperature increases, the network performance continues to improve.Therefore, we use 4 as the temperature hyperparameter when training on other datasets.

Conclusions
In this paper, a novel remote sensing scene classification method is proposed, named FASDNet.It mainly comprises three new designed modules, including the feature augmentation pyramid module, the self-teacher network, and the auxiliary classifier.First, ResNet34 is utilized as the backbone network to learn the multi-layer features of the model.Then, a feature augmentation pyramid module is designed to fuse rich deep semantic information and shallow features step by step through transposed convolution.Next, the backbone network learns the aggregated features through feature distillation, and then Logits distillation is used as a regularization method to reduce the confidence of the network prediction, thereby improving the robustness of the model.Finally, auxiliary branches are added after the feature maps S2 and S3 generated by the backbone network.For the auxiliary branch, the knowledge distillation method is also added, which can provide additional supervision information and help the model to learn more effectively.The proposed FASDNet is verified on four widely used remote sensing classification datasets.The experimental results show that, compared with other advanced classification methods, the proposed FASDNet has significant advantages in the classification of remote sensing scene images.
Although our proposed FASDNet method achieves excellent performance, it still has some shortcomings.In future work, we will integrate the three proposed modules into other advanced networks to improve their generalization.In addition, striving to design specialized networks that are more suitable for remote sensing scene classification is also one of our ongoing efforts.

Figure 3 .
Figure 3.The overall framework of the proposed FASDNet.Figure 3. The overall framework of the proposed FASDNet.

Figure 3 .
Figure 3.The overall framework of the proposed FASDNet.Figure 3. The overall framework of the proposed FASDNet.

Figure 5 .
Figure 5.The bottleneck convolution structure of the auxiliary classifier.

Figure 6 .
Figure 6.Randomly selected sample images from the four datasets.

Figure 7 .
Figure 7. Confusion matrix on the UC-Merced dataset with 80% training ratio.

Figure 7 .
Figure 7. Confusion matrix on the UC-Merced dataset with 80% training ratio.

Figure 8 .
Figure 8.The confusion matrix obtained under the 50% training ratio of the RSSCN7 dataset.

Figure 8 .
Figure 8.The confusion matrix obtained under the 50% training ratio of the RSSCN7 dataset.

Figure 9 .
Figure 9.The confusion matrix obtained under the 50% training ratio of the AID.

Figure 9 .
Figure 9.The confusion matrix obtained under the 50% training ratio of the AID.

Figure 10 .
Figure 10.The confusion matrix obtained under the 20% training ratio of the NWPU dataset.Figure 10.The confusion matrix obtained under the 20% training ratio of the NWPU dataset.

Figure 10 .
Figure 10.The confusion matrix obtained under the 20% training ratio of the NWPU dataset.Figure 10.The confusion matrix obtained under the 20% training ratio of the NWPU dataset.

Figure 11 .
Figure 11.The heat maps obtained using different methods.The first row shows the heat maps obtained using the backbone network combined with the distillation method.The second row shows the heat maps obtained with only the backbone network.

Figure 11 .
Figure 11.The heat maps obtained using different methods.The first row shows the heat maps obtained using the backbone network combined with the distillation method.The second row shows the heat maps obtained with only the backbone network.

Table 1 .
Data information of the four datasets.

Table 4 .
Comparison of our proposed method with some methods proposed in recent years on the RSSCN7 dataset.

Table 5 .
Comparison of our proposed method with some methods proposed in recent years on the AID.

Table 6 .
Comparison of our proposed method with some methods proposed in recent years on the NWPU dataset.

Table 6 .
Comparison of our proposed method with some methods proposed in recent years on the NWPU dataset.

Table 7 .
Complexity evaluation of some models.

Table 8 .
Some ablation experiments of the proposed FASDNet on the UC-Merced dataset.

Table 9 .
Some ablation experiments of the proposed FASDNet on the RSSCN dataset.

Table 8 .
Some ablation experiments of the proposed FASDNet on the UC-Merced dataset.

Table 9 .
Some ablation experiments of the proposed FASDNet on the RSSCN dataset.

Table 10 .
Some ablation experiments of the proposed FASDNet on the AID.

Table 11 .
Some ablation experiments of the proposed FASDNet on the NWPU dataset.

Table 12 .
The experimental results obtained by the proposed FASDNet under the 50% training ratio on the AID when the temperature hyperparameters are 1, 2, 4, 6, and 8.