Collaborative Consistent Knowledge Distillation Framework for Remote Sensing Image Scene Classiﬁcation Network

: For remote sensing image scene classiﬁcation tasks, the classiﬁcation accuracy of the small-scale deep neural network tends to be low and fails to achieve accuracy in real-world application scenarios. However, although large deep neural networks can improve the classiﬁcation accuracy of remote sensing image scenes to some extent, the corresponding deep neural networks also have more parameters and cannot be used on existing embedded devices. The main reason for this is that there are a large number of redundant parameters in large deep networks, which directly leads to the difﬁculty of application on embedded devices and also reduces the classiﬁcation speed. Considering the contradiction between hardware equipment and classiﬁcation accuracy requirements, we propose a collaborative consistent knowledge distillation method for improving the classiﬁcation accuracy of remote sensing image scenes on embedded devices, called CKD. In essence, our method addresses two aspects: (1) We design a multi-branch fused redundant feature mapping module, which signiﬁcantly improves the parameter redundancy problem. (2) To improve the classiﬁcation accuracy of the deep model on embedded devices, we propose a knowledge distillation method based on mutually supervised learning. Experiments were conducted on two remote sensing image classiﬁcation datasets, SIRI-WHU and NWPU-RESISC45, and the experimental results showed that our approach signiﬁcantly reduced the number of redundant parameters in the deep network; the number of parameters decreased from 1.73 M to 0.90 M. In addition, compared to a series of student sub-networks obtained based on the existing different knowledge distillation methods, the performance of the student sub-networks obtained by CKD for remote sensing scene classiﬁcation was signiﬁcantly improved on two different datasets, with an average accuracy of 0.943 and 0.916, respectively.


Introduction
Over the past few years, deep neural networks have achieved state-of-the-art performance in computer vision [1][2][3][4], natural language processing [5][6][7], reinforcement learning [8][9][10], and various other fields [11][12][13]. However, with the increasing depth, as well as the width of the network, for example from the shallow LeNet to the wider Inception structure in GoogLeNet and deeper Resnet convolutional architecture, as well as the currently popular transformer architecture, the number of parameters of the deep model is constantly growing, which in turn, leads to a series of problems such as the redundancy of network parameters, more rigorous hardware requirements, and difficulty in training the model, and large deep models severely limit their applications in low-memory or high-real-time conditions. In recent years, the research [14,15] to develop faster and smaller models based on the idea of knowledge distillation to solve the above problems has been developing rapidly. In the traditional knowledge-distillation-based model compression methods [16][17][18][19], the smaller student network is typically guided by a larger teacher network. The primary purpose is to enable the student network to achieve competitive and even superior task performance by learning the prior knowledge of the teacher network.
The key to achieving this goal is mainly related to two aspects: on the one hand, how to design the network structure of the teacher-student model; on the other hand, how to transfer important features from the large-sized teacher model to the small-sized student model in a more efficient way.
We observed the following phenomenon when performing model optimization with the standard knowledge distillation methodology: I. When we trained a small-sized student network independently, it was usually more difficult to find the ideal model parameters to meet the relevant task requirements.
II. Compared with training a small-sized student network independently, when a large-sized teacher network was trained independently, although better task performance can be gained, the model parameters of the teacher network were not optimal due to the presence of a significant amount of parameter redundancy in the teacher network.
III. When jointly training teacher-student models, the parameter redundancy present in the teacher model was usually detrimental to the optimization of the student model, which may have a negative effect on the optimization of the student model.
In this paper, considering that current high-precision remote sensing image classification models require high-performance hardware devices, which are difficult to deploy on embedded devices with low performance, our goal was to solve the parameter redundancy problem in the teacher-student model and obtain a small deep neural network with powerful feature extraction capabilities that can be easily deployed on lower-performance hardware devices and meet the accuracy requirements for remote sensing image classification. To address these issues, we propose a collaborative consistency knowledge distillation framework.
Firstly, different from the previous convolutional neural networks, a plug-and-play redundant feature mapping module was designed for the redundant parameters in the teacher-student model. Specifically, this module contains both multi-branch feature extraction and fusion components, as well as redundant mapping convolution components. On the one hand, we can obtain an equivalent convolution kernel with stronger feature extraction capability with multi-branch feature extraction and fusion and utilize this equivalent convolution kernel to extract richer task-related high-level semantic information. On the other hand, the redundant mapping convolution component was used to generate the intrinsic feature maps of the inputs, and the redundant feature maps were further obtained by a series of low-cost linear operations, which greatly reduced the redundant parameters of the network.
Secondly, our CKD framework starts with a powerful and pre-trained teacher network and performs a one-way prior knowledge transfer to two untrained student sub-networks of different depths. In addition, for both student sub-networks, we propose that the student sub-networks not only absorb prior knowledge derived from the teacher, but also extract high-level semantic features that the other possesses via mutual supervised learning. The experimental results showed that the student sub-networks obtained by training in this way have better task-relevant model parameters.
In summary, our contributions are summarized as follows: To reduce the parameter redundancy of remote sensing image classification models and facilitate their deployment on embedded devices with low performance, we propose a plug-and-play multi-branch fused redundant feature mapping module. The equivalent convolutional kernel obtained by this module has a more powerful feature extraction capability and can more effectively optimize the parameter redundancy of the network. We propose a collaborative consistent knowledge distillation framework to obtain a more robust backbone network. In contrast to the traditional knowledge distillation framework, we guided a pair of student sub-networks of different depths through a teacher model, where the student sub-networks not only learn prior knowledge deriving from the teacher network, but also acquire prior knowledge possessed by them by the way of mutual supervised learning. The experimental results on two benchmark datasets (SIRI-WHU, NWPU-RESISC45) showed that our approach provided a significant improvement over a series of existing depth models and the state-of-the-art knowledge distillation networks on the relevant remote sensing image scene classification task. In addition, the student sub-network obtained based on the CKD framework had a more powerful feature extraction capability, as well as a lower number of parameters, which can be widely used as a feature extraction network in various embedded devices.

Remote Sensing Image Scene Recognition
The remote sensing image recognition task realizes the recognition and classification of scene topics by analyzing the composition relationship of the targets in the image scene, which mainly contains methods based on mid-level features and deep learning methods. The main approaches based on mid-level features include visual word-packet models [20], combined with sparse representation [21], Flisher vector coding [22], and so on. However, these traditional methods can hardly meet the accuracy requirements for remote sensing image scene classification on embedded devices.
In recent years, deep learning models have performed well in remote sensing image recognition tasks. Yao et al. [23] utilized a pre-trained deep learning network for feature extraction of remote sensing scenes and adopted a random forest classifier for scene recognition of remote sensing images. Cheng et al. [24] combined deep learning with metric learning, and the problem of high similarity between remote sensing scenes and large intraclass differences was well solved by discriminative convolutional neural networks. Gong et al. [25] combined the attention mechanism with the deep learning model, which solved the overfitting problem of the deep learning model in remote sensing image processing to some extent. However, although these models can obtain better accuracy for remote sensing image scene classification, they are difficult to deploy on embedded devices with low performance due to having more model parameters.

Knowledge Distillation
In recent years, deep learning methods have achieved great success in the field of knowledge distillation [15,26,27]. In accordance with whether the teacher model is updated simultaneously with the student model, the learning schemes for knowledge distillation are mainly divided into two categories: offline knowledge distillation [28][29][30][31][32][33] and online knowledge distillation [34][35][36][37].
The training for the offline knowledge distillation method needs to be performed in stages; specifically, in the first stage, the large-scale teacher network is first trained based on the relevant training dataset until the network converges. In the second stage, based on the trained teacher network, the relevant features of the input data are extracted, and these features are then utilized to guide the training of the student network. The knowledge transfer from the pre-trained teacher network to the student network is enabled by the two stages. Bucilua et al. [16] advocated the use of knowledge transfer for compressing models as early as 2006, transferring knowledge from a large-scale complicated model to a lightweight model. The idea was adopted by Hinton et al. [19] in 2015, and the concept of knowledge distillation (KD) was formally defined, as well as a detailed training method for knowledge distillation networks given. FitNets [28] further extends the idea of knowledge distillation by adding an intermediate layer of knowledge distillation to the teacher network and boosts the training speed of the knowledge distillation network with the guidance of the intermediate layer feature map. Inspired by this, RKD [29] combines the output of multiple teacher models to produce structural units, which work together to guide student learning, driving better guidance for student models. CRD [30] introduces comparative learning for knowledge distillation and trains the student network to be able to learn more useful knowledge from the data representation of the teacher network. Li et al. proposed LKD [31], a local correlation exploration framework for knowledge distillation, which uses the intra-instance local relationship, the inter-instance relation on the same local location, and the inter-instance relation across different local locations for modeling. Xu et al. [32] proposed a feature-normalized knowledge distillation scheme by introducing a samplespecific correction factor instead of the uniform temperature T. Considering the ensemble knowledge distillation as a multi-objective optimization problem, Du et al. [33] investigated the diversity of teacher models in gradient spaces.
Unlike the process of offline distillation, the online knowledge distillation process updates the entire knowledge distillation framework simultaneously, that is the teacher model and the student model are updated in parallel. Over the last few years in particular, a series of online knowledge distillation methods have been proposed. For example, Lan et al. [34] proposed a learning framework for single-stage online distillation. Specifically, the framework establishes powerful online teacher models to enhance the learning of the target network while only training a single multi-branch network. Zhang et al. [35] proposed a deep mutual learning strategy that allows student models to learn collaboratively and teach each other throughout the training process. Yao et al. [36] designed an improved bidirectional knowledge distillation method, the dense cross-layer mutual distillation framework (DCM). Wu et al. [37] found that collaborative learning and mutual learning cannot build the online high-capacity teacher network, while the online integration ignores the collaboration between branches, which leads to the proposal of a novel peer collaborative learning approach for online knowledge distillation.

Methodology
For tasks associated with the field of computer vision, the number of parameters in the backbone network increases dramatically as the depth of the network is continuously deepened, which causes significant parameter redundancy from the backbone network, hence affecting the performance of several computer vision tasks. In order to reduce the redundant parameters of the backbone network and obtain a more powerful CNN feature extractor, we propose a collaborative consistency distillation framework, which can effectively deal with the parameter redundancy problem with the increasing depth of the network, while making the obtained backbone network have excellent feature extraction capability. As a result, it can better support various downstream computer vision tasks. The overall pipeline of the CKD framework is illustrated in Figure 1.

Redundant Feature Mapping Module
In contrast to the ordinary convolution in previous convolutional neural networks [38][39][40][41][42][43], our proposed redundant feature mapping module can be inserted into any network structure to improve the model structure, enhance the model feature extraction capability, and reduce the parameter redundancy and floating point operations of the model. The relevant structure of the redundant feature mapping module is shown in Figure 2, which mainly includes two aspects: (1) Multi-branch feature extraction and fusion: For multi-branch feature extraction and fusion, different from the previous work, our objective was to obtain equivalent convolutional kernels with stronger feature extraction capability. In other words, the obtained single-branch k × k equivalent convolution kernel has multi-scale feature extraction capability. As shown in Figure 2, MRFM enhances the feature extraction capability of the CNN network with three parallel branches, and each branch employs the k × k, 1 × k, and k × 1 convolutional kernel sizes, respectively. When the network training is complete, the convolutional kernels of three sizes are fused into equivalent convolutional kernels of k × k with stronger extraction ability. The process of equivalent fusion mainly consists of two processes, BN fusion and branch fusion. For the network input, deep semantic features of the remote sensing image scene classification can be obtained in two aspects: On the one hand, we used the teacher sub-networks to extract deep semantic features for guiding the student sub-networks to extract more refined classification feature information. On the other hand, the classification feature information was reinforced by mutual supervision among student sub-networks to enable the student sub-networks to obtain higher classification results. Part A demonstrates in detail the components of our proposed the CKD framework. Specifically, it contains a multi-branch teacher sub-network, a single-branch teacher sub-network, and a pair of student sub-networks of different depths. Part B presents the basic block components of the student sub-networks and the teacher sub-networks in the CKD framework.

BN fusion:
In order to prevent the overfitting and accelerate the training speed of the network and for the MRFM module, it is necessary to perform the BN operation as shown in Equation (1) after each branch performs the convolution operation.
where M ∈ R U×V×C denotes the input feature maps of size U × V and the number of channels C and k denotes the input feature map of the k-th channel. F ∈ R H×W×C indicates a convolution kernel of size H × W and the number of channels C. The output feature map O ∈ R R×T×D of size R × T and number of channels D is obtained after the convolution operation * . µ j and σ j are the mean and standard deviation of the BN operation, and γ j and β j are the scaling factor and offset, respectively.  Figure 2. Illustration of the redundant feature mapping module. Specifically, the redundancy mapping module consists of three aspects: the multi-branch redundancy mapping module (MRFM), the single-branch redundancy mapping module (SRFM), and the redundancy mapping convolution (Rconv) operation. It is worth noting that the single-branch redundancy mapping module is generated from the multi-branch redundancy mapping module after BN fusion and branch fusion operations.
After the above BN operation, the convolutional kernels of different sizes are fused based on the principle of additivity between 2D convolutional kernels to produce an equivalent convolutional kernel with the same feature output, and the associated process can be represented by Equation (2).
where I indicates a matrix that can be cropped or filled. K (1) and K (2) are two 2D convolution kernels with compatible dimensions, and ⊕ refers to the summation operation at the corresponding positions.
Branch fusion: As shown in Figure 2, the three feature extraction branches are reduced to one feature extraction branch, and the feature extraction is completed based on the equivalent convolutional kernel obtained after BN fusion. After such an operation, the features we extracted are equivalent to the extraction results of multiple feature extraction branches. In other words, this operation enhances the feature extraction ability of the network and reduces the network parameters, which improve the performance of the network. For the j-th convolution kernel, F (j) represents the fused convolution kernel, b j represents the bias, F (j) ,F (j) , andF (j) represent the outputs of the k × k, 1 × k, and k × 1 convolution kernels, respectively, and the result after branching fusion can be expressed as: where µ and σ are the mean and standard deviation of the BN operation, γ and β are the scaling factor and offset, and O :,:,j ,Ō :,:,j , andÔ :,:,j are the output feature maps of the k × k, 1 × k, and k × 1 convolution kernels, respectively.
(2) Redundant mapping convolution operation (Rconv): Due to the significant redundancy in the feature maps extracted by the existing backbone network, to address this problem, the ordinary convolution layer is divided into two parts, as shown in the Rconv module in Figure 2, which fully combines the ordinary convolution operation, as well as the linear transformation operation. Specifically, we first obtained the intrinsic feature maps by ordinary convolutional operations; second, we performed the identical transformation and a series of simple linear transformations on the intrinsic feature maps. The two operate in parallel: On the one hand, the intrinsic feature maps are preserved, and the computational burden of the network is reduced. On the other hand, the redundant information in the feature maps is preserved with the inexpensive linear mapping, which obtains the redundant feature maps.
The equation for the ordinary convolution operation to generate n feature maps is expressed as: For the Rconv module, m feature maps are first generated by ordinary convolution. It can be expressed as follows.
where X ∈ R c×h×w is the input feature maps, c, h and w are the number of channels and the height, and width of the input feature map, respectively, and * denotes the convolution operation. f ∈ R c×k×k×n and f ∈ R c×k×k×m denote the convolution kernel. k × k is the kernel size. b is the bias term. For simplicity, the bias term is neglected in Equation (7). Y ∈ R n×h ×w and Y ∈ R m×h ×w denote the output feature maps with n and m channels, respectively, and m n. h , and w represent the height and width of the output feature maps. In addition, to obtain the required n feature maps, redundant feature information is generated by adding linear operations to the inherent feature maps in Y .
where y i denotes the i-th intrinsic element map in Y , y i denotes the redundant feature map of y i , and Φ i (·) is an inexpensive linear operation on y i .

Cooperative Consistency Distillation Algorithm
The main purpose of the collaborative consistency knowledge distillation algorithm is to obtain remote sensing image scene classification models that are convenient for deployment on embedded devices. Therefore, to achieve the above goal, in contrast to the previous work, our approach consists of two main aspects. On the one hand, a single-teacher multi-student knowledge distillation model is constructed based on the proposed redundant feature mapping module, and the two student sub-networks with fewer parameters and higher accuracy are obtained based on this architecture. On the other hand, the accuracy of each student sub-network is further improved by a collaborative consistency strategy between the student sub-networks.
As shown in Figure 1 Part A, the teacher network consists of two parts: multi-branch and single-branch teacher networks. We first trained the multi-branch teacher network, which is composed of multiple multi-branch blocks. When the multi-branch teacher network is trained, multiple feature extraction branches are transformed into a single feature extraction branch by the multi-branch feature fusion operation, which not only drastically reduces the number of parameters of the network, but also obtains a single-branch teacher network with equivalent feature extraction capability. The single-branch teacher network is composed of multiple single blocks, and the structure of the multi-branch block and single block is shown in Figure 1 Part B. Second, we used the pre-trained singlebranch teacher sub-network to guide the feature learning of the Student1 and Student2 sub-networks, so that the student sub-network learns as much prior knowledge as possible from the single-branch teacher sub-network. As a result, the student network can not only achieve the purpose of model compression, but also achieve an accuracy similar to the teacher sub-network. Note that the student sub-networks have different specifications, and the student sub-network S 1 holds a deeper network structure. For the i-th data sample, the loss of the student sub-network S 1 , as well as S 2 can be expressed as: where L CE (p i , q i ) is the cross-entropy loss between the predicted value q i of the singlebranch teacher network T and the predicted value p i of the student network S and L CE (y i , p i ) is the cross-entropy loss between the predicted value p i of the student network S and the true label y i . λ indicates the regularization weight, which balances the losses of different components. Through fitting the predicted labels of the single-branch teacher network T, the student network S is able to learn as much prior knowledge from the teacher network T as possible.
For the student sub-networks, the predicted outputs of the two student sub-networks S 1 and S 2 on the i-th sample data are denoted as p S 1 i and p S 2 i , respectively. The two student sub-networks are updated simultaneously. During the training of the student subnetwork S 1 , the student sub-network S 2 helps S 1 converge by using learned classification characteristics to guide S 1 . To measure the variance between the predictions p S 1 i and p S 2 i of the two student sub-networks, the Kullback-Leibler (KL) loss is used for calculation. Then, the KL loss of S 2 to S 1 can be expressed as: Similarly, during the training of the student sub-network S 2 , the student sub-network S 1 guides S 2 with the learned classification information, and the KL loss of S 1 to S 2 can be expressed as: In summary, the student sub-network is required to fit not only the truth label y, but also the prediction label q of the single-branch teacher sub-network T and the prediction label of another student sub-network. Therefore, the overall loss of the student sub-network includes the traditional supervised loss L S 1 , L S 2 , the mutual supervised loss L S2→S1 , L S1→S2 among the student sub-networks, and the distillation loss L T→S1 Single , L T→S2 Single with the singlebranch teacher sub-network. The final loss of the student sub-network can be expressed by the following equation.
The collaborative consistency distillation algorithm guides multiple student subnetworks through the teacher sub-network, while maintaining the collaborative consistency among student sub-networks through mutual supervised learning. The final goal of opti-mizing the parameter redundancy and improving the classification performance of student sub-networks was accomplished, and the related process is described in Algorithm 1.

Algorithm 1 Collaborative consistency distillation algorithm
Input: training set D train , label set Y, learning rate lr Initialization parameters: θ stu1 for Student Sub-network 1, θ stu2 for Student Sub-network 2 Repeat: 1: Randomly selected data X from the training set D train . 2: Pre-trained multi-branch teacher sub-model T m . 3: Generate single-branch teacher sub-networks T s based on multi-branch teacher subnetworks T m . 4: Update the parameter θ stu1 of Student Sub-network 1: 5: Update the parameter θ stu2 of Student Sub-network 2: End: Student sub-networks S 1 and S 2 converge.

Experimentation and Results Discussion
In this section, we perform several sets of comparative experiments and rigorously analyze the experimental results of the CKD framework on NWPU-RESISC45 [44] and SIRI-WHU [45].

Datasets
NWPU-RESISC45: The NWPU-RESISC45 dataset contains a total of 45 remote sensing scenes, and each scene consists of 700 images with a size of 256 × 256 pixels. The NWPU-RESISC45 dataset exhibits rich variation in appearance, spatial resolution, illumination, background, and occlusion. SIRI-WHU: The SIRI-WHU dataset is composed of 12 categories of remote sensing scene images, with a total of 2400 images, and each category consists of 200 images with a size of 200 × 200 pixels. The data were obtained from Google Earth and mainly cover urban areas in China.

Implementation Details
We developed our proposed collaborative consistent distillation framework based on Pytorch and conducted the related experiments with 6 NVIDIA GeForce GTX 3080Ti GPUs. In our experiments, we used the Adam optimizer [46] to optimize the parameters of the network, setting the initial learning rate to 0.01, the momentum factor to 0.9, the weight decay rate to 10 −4 , and the batch size to 256. The model was trained for a total of 300 epochs, and the learning rate decreased to 1/10 of the previous learning rate for each 60 epoch iteration.
Since the fully connected layer in the convolutional neural network restricts the input image size, therefore it is necessary to pre-process the images in the dataset when the relevant model is trained on the NWPU-RESISC45 and SIRI-WHU datasets. For the training set, firstly, the input images based on NWPU-RESISC45 dataset were randomly cropped to 200 × 200 after mirror filling to standardize the size of the input images. It is worth noting that for the remote sensing image size in the SIRI-WHU dataset, we still kept 200 × 200. Secondly, in order to enrich the training set and improve the generalization ability of the model, a simple random left-right flip operation was performed on the training set. Finally, the images were processed by normalization. For the testing set, the input images were center cropped, and the size was set to 200 × 200, which unified the size of the input image between the training set and the testing set. Similarly, the normalization operation was performed on the testing set images.

Comparison of Remote Sensing Image Scene Classification Methods on SIRI-WHU and NWPU-RESISC45 Datasets
To evaluate the remote sensing image scene classification performance of the CKD student sub-networks, the network model was compared with other deep learning models based on the SIRI-WHU and NWPU-RESISC45 datasets. The average classification accuracy in the experiments and a series of other evaluation metrics are shown in Table 1. The CKD student sub-networks proposed in this paper had the highest classification accuracy of 0.916 for the NWPU-RESISC45 dataset and 0.943 for the SIRI-WHU dataset.

Comparison with the State-of-the-Art Knowledge Distillation Methods on the NWPU-RESISC45 and SIRI-WHU Datasets
To more comprehensively evaluate our CKD framework, we also compared CKD with recent state-of-the art knowledge distillation methods reported on the SIRI-WHU and NWPU-RESISC45 datasets in Tables 2 and 3. We used two baselines to evaluate the classification performance of our CKD-RPO framework. Specifically, the first type of baseline employs a series of offline distillation methods, including KD [19], FN [32], AE-KD [33], LKD [31], RKD [29], and CRD [30]. The second kind of the baseline is the online knowledge distillation methods, which were DML [35], ONE [34], DCM [36], and PCL [37].
The experimental results are reported in Table 2, where the best results are marked in bold. Experimental results on the SIRI-WHU and NWPU-RESISC45 datasets showed that the proposed CKD achieved the best performance not only on offline distillation methods, but also on online knowledge distillation methods compared to other state-ofthe-art methods. It also demonstrates that our CKD was capable of enhancing classification tasks. Despite the fact that the experimental setup in these references varied slightly, it appears that our strategy outperformed previous state-of-the-art methods.  [19] offline 91.7% 87.3% RKD [29] offline 91.2% 86.4% CRD [30] offline 91.4% 87.6% FN [32] offline 90.8% -LKD [31] offline -88.4% AE-KD [33] offline -87.1% Ours offline 92.0% 90.5% Table 3. The accuracy of ResNet110 while using different knowledge distillation approaches on SIRI-WHU and NWPU-RESISC45.

Comparison of the Number of Parameters among CKD Student Sub-Networks and Resnet Networks
The parameters' variability between different student sub-networks in the CKD framework for the backbone networks Resnet20, Rconv_Res20, Resnet32, Rconv_Res32, Resent 56, Rconv_Res56, Resnet110, and Rconv_Res110 is compared in this section. The number of parameters of different student sub-networks is shown in Table 4.  Table 4, we can find that the Rconv_Res-series student sub-networks maintained comparatively fewer parameters. As the depth of the network deepened, the number of parameters in the Resnet-series student sub-networks increased more significantly compared to the Rconv_Res-series student sub-networks. This also demonstrates that the redundancy parameters of the Resnet-series student sub-networks increased dramatically with the increasing depth of the student sub-networks. In contrast, the Rconv_Res-series student sub-networks in the CKD framework were able to effectively eliminate redundant parameters. The backbone networks that we employed in our experiments included typical studentlevel backbone networks: Resnet20 [47], Resnet32, Resnet56, and Resnet110 and large-scale backbone networks at the teacher level: Resnet110 and Densenet121 [48]. Table 5 compares the top-1 accuracy [49] on the SIRI-WHU dataset obtained by various architectures under the two-student sub-network condition. We can observe the following conclusions from Table 5: (1) For Student Sub-networks 1 and 2, the collaborative consistency distillation algorithm (CKD) significantly improved the classification accuracy of each student sub-network, and the gain values indicate the gains of each student sub-network. (2) Although Rconv_Res110 is a much larger backbone network than Rconv_Res32, it still benefited from being trained with a smaller student sub-network. (3) The smaller student sub-networks can usually gain more from the collaborative consistency distillation algorithm. To more comprehensively evaluate our CKD framework, we conducted ablation studies to analyze the correlation between different components in the redundant feature mapping operation. The redundant feature mapping operation of the CKD framework mainly involves three components: Rconv module, MRFM, and SRFM. To investigate the impact of each component on the redundancy mapping module on the CKD framework, based on Resnet, we set a series of student sub-networks of different depths and their corresponding variants and compared the performance between the student sub-networks of different depths and the corresponding variants.
Resnet: A series of image classification models was constructed based on Resnet20, Resnet32, Resnet56, and Resnet110.
Resnet with Rconv module: Only the Rconv module was applied to a series of image classification models.
Resnet with MRFM: A series of classification models was reconstructed based on the MRFM module to predict the categories of remote sensing image scenes in the SIRI-WHU and NWPU-RESISC45 datasets.
Resnet with SRFM: First, we obtained the SRFM module through the MRFM module. Then, based on the SRFM module, a series of classification models was redesigned to predict the categories of remote sensing image scenes in the SIRI-WHU and NWPU-RESISC45 datasets.
The results of the ablation experiments of the redundant feature mapping operation in the CKD framework are shown in Figure 3. From Figure 3, we can draw the following conclusions: (1) Resnet with the Rconv module showed the worst classification performance among the methods for all datasets. This shows that reconstructing the Resent model with only the simple Rconv module, although it can reduce the parameter redundancy of the networks, can also lead to a degradation of the model classification performance. (2) Resnet with MRFM achieved the best classification performance. However, the number of parameters of the models was relatively more compared to Resnet with SRFM. At the same time, the improvement in classification accuracy of the models was insignificant, and we believe that it is not worthwhile to gain a slight improvement in the classification performance through such a scale of the number of parameters. (3) With the number of parameters keeping consistent, Resnet with SRFM possessed better classification performance compared to Resnet with the Rconv module. This indicates that the equivalent convolutional kernel obtained by the multi-branch fusion operation exhibited a more powerful feature extraction ability, which effectively improved the classification performance of the model.

Conclusions
In this work, we proposed a collaborative consistent knowledge distillation framework in order to reduce the parameters of the remote sensing image scene classification model and further facilitate the deployment on embedded devices with poor hardware conditions. Our framework consisted of two main aspects: the redundant feature mapping module and the collaborative consistency distillation algorithm. The experimental results on two benchmark datasets, SIRI-WHU and NWPU-RESISC45, showed that our framework significantly improved the remote sensing image scene classification performance of the student sub-network and substantially reduced the redundant parameters of the backbone network. In addition, the pre-trained student sub-networks obtained by the CKD framework had a powerful feature extraction ability with fewer parameters, which can be widely used in various embedded devices.

Data Availability Statement:
The data presented in this study are available in NWPU-RESISC45 and SIRI-WHU. The details of the datasets are described in Section 4.1 and the relevant references correspond to [44,45].