SDLS: A Two-Stream Architecture with Self-Distillation and Local Streams for Remote Sensing Image Scene Classification

Ma, Xinliang; Luo, Junwei; Ni, Shuiping; Zhang, Xiaohong; Ding, Runze

doi:10.3390/rs18030498

Open AccessArticle

SDLS: A Two-Stream Architecture with Self-Distillation and Local Streams for Remote Sensing Image Scene Classification

by

Xinliang Ma

¹,

Junwei Luo

^1,2,*,

Shuiping Ni

¹,

Xiaohong Zhang

² and

Runze Ding

²

¹

School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo 454003, China

²

School of Software, Henan Polytechnic University, Jiaozuo 454003, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(3), 498; https://doi.org/10.3390/rs18030498

Submission received: 19 December 2025 / Revised: 28 January 2026 / Accepted: 2 February 2026 / Published: 3 February 2026

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A local image generation module (LIGM) is proposed to construct discriminative local representations for remote sensing images, even in scenes with irregularly distributed salient regions.
Based on MobileNetV2, a two-stream framework (SDLS) tailored for CNN is proposed, integrating self-distillation and local streams for remote sensing image scene classification.
A multiplex-guided attention (MGA) module is proposed to enable cross-network attention-guided learning, where CNN features are leveraged to generate attention weights for the lightweight MobileNetV2.

What are the implications of the main findings?

The proposed LIGM provides a general method to highlight key regions in remote sensing images while reducing background interference.
The SDLS framework is adaptable to various CNN architectures, enabling high-accuracy remote sensing image scene classification.
The proposed MGA module uses three convolutional branches to generate attention weights for low-channel features, reducing information loss during channel compression.

Abstract

Remote sensing image scene classification holds significant application value and has long been a research hotspot in remote sensing. However, remote sensing images contain diverse objects and complex backgrounds. Reducing background interference while focusing on key target regions in the images remains a challenge, which limits the potential improvement of classification accuracy. In this paper, a local image generation module (LIGM) is proposed to generate weights for the original images. The resulting local images, generated by weighting the original images, effectively focus on key target regions while suppressing background regions. Based on the LIGM, a two-stream architecture with self-distillation and local streams (SDLS) is proposed. The self-distillation stream extracts features from the original images using a convolutional neural network (CNN) and two MobileNetV2 networks. Furthermore, a multiplex-guided attention (MGA) module is introduced into this stream to facilitate cross-network attention-guided learning between the CNN and MobileNetV2 features. In the local stream, a MobileNetV2 network is employed to extract features from the local images. The classification logits produced by the two streams are fused, resulting in the final SDLS classification score. Experimental results demonstrate that SDLS achieves competitive performance on multiple datasets.

Keywords:

remote sensing; scene classification; two-stream architecture; self-distillation

1. Introduction

Rapid advancements in remote sensing technology and imaging equipment have significantly increased the number of remote sensing images [1]. Accurate interpretation of these images and the extraction of valuable information can promote the application of remote sensing technology. Remote sensing image scene classification [2], as a fundamental task in remote sensing image interpretation [3], assigns predefined scene labels to images based on their visual features. The accurate classification of remote sensing images benefits various applications, such as environmental monitoring, disaster assessment, and urban planning.

There are currently two primary categories of remote sensing image scene classification methods: handcrafted feature-based and deep learning. Handcrafted feature-based methods primarily rely on low-level and mid-level features extracted from images. Currently, the representative low-level features include color histogram [4,5], scale-invariant feature transform (SIFT) [6], local binary patterns (LBP) [7], and histogram of oriented gradients (HOG) [8]. Mid-level features [9,10] are derived from low-level features through processes such as integration and encoding. The design and extraction of discriminative and robust features are crucial for improving classification accuracy. Therefore, several research works [11,12] have proposed classification methods by combining multiple features. The purpose of combining is to enhance the distinguishability of features and characterize complex scene information more comprehensively. Despite the advancements achieved through handcrafted feature-based methods, challenges persist in extracting high-level features from images. The proposal of deep learning has led to the gradual replacement of handcrafted feature-based methods. Deep learning methods can automatically extract the high-level features of images. Currently, deep learning models for remote sensing image scene classification are mainly categorized into three types: Generative Adversarial Network (GAN) [13,14,15], Convolutional Neural Network (CNN) [16,17,18], and Vision Transformer (ViT) [19,20]. GAN is primarily used for unsupervised or semi-supervised learning, whereas CNN and ViT are typically utilized for supervised learning. Compared with GAN-based methods, the classification performance of CNN and ViT-based methods tends to be better. Consequently, supervised learning methods have been a hot research topic in remote sensing image scene classification.

Most supervised learning research employs various optimization strategies for model optimization to improve classification accuracy. In contrast, some studies have focused on image transformation, proposing the use of original remote sensing images to generate additional images and designing a two-stream model architecture adapted to this transformation. The original and generated images are inputs to the two-stream model architecture. The advantage of this architecture is that it can extract multiple image features for classification. For example, TEX-Net-LF [21] uses original and texture coded mapped images as inputs. The ELM-based architecture [22] employs original and saliency-detected images as inputs. Although these methods enrich input representations, the generated images do not necessarily correspond to the semantic key regions. To address this problem, the SKAL [23] is proposed to localize key regions in the original image and crop them to generate local images, which are subsequently used for feature extraction and classification. However, the quality of the local images heavily depends on the segmentation threshold used to identify key regions. To alleviate threshold sensitivity, the LML [24] generates spatially adaptive weights for each original image based on the convolutional features extracted by a CNN backbone. These weights are applied to the original image to highlight key regions while suppressing background regions.

In summary, existing image transformation methods that highlight key regions mainly rely on cropping operations or spatial weight overlay. In the local images produced by both approaches, the key regions exhibit limited completeness and accuracy. This limitation is particularly pronounced in complex scenes, such as viaducts or rivers, where key regions are distributed along curves. Cropping operations typically generate rectangular regions. These regions do not adapt well to curved distributions. Meanwhile, spatial weight overlay methods often rely on a single CNN backbone. This reliance may result in inaccurate localization of key regions.

To address these limitations, this paper proposes the local image generation module (LIGM), which generates spatial weights based on integrated convolutional features from multiple CNN backbones. Before applying the generated weights to the original image, a nonlinear transformation is introduced to emphasize key regions further. In addition, the original image is retained and incorporated into the final local representation to prevent the omission of key regions that may not be fully captured by the spatial weighting alone. Based on LIGM, a two-stream architecture named self-distillation and local streams (SDLS) is proposed. SDLS consists of a self-distillation stream and a local stream. The self-distillation stream extracts features from the original images and consists of a self-distillation framework that includes a CNN and two MobileNetV2 networks [25]. The local stream contains only one MobileNetV2, which extracts features of the local images. Note that generating the classification score for the local stream requires fusing the original and local image features. The classification score of SDLS is obtained by fusing the self-distillation stream and the local stream.

The main contributions of this paper are as follows:

(1): A strategy called LIGM is proposed, which generates local images that effectively focus on key target regions while suppressing background interference, providing higher generalization and adaptability.
(2): A two-stream architecture SDLS incorporating self-distillation and local streams is proposed. The proposed architecture is adaptable to various CNN networks and achieves competitive performance across multiple remote sensing image scene datasets.
(3): A multiplex-guided attention (MGA) module is proposed and introduced into the self-distillation stream. The MGA module facilitates cross-network attention from a large CNN to a lightweight MobileNetV2 while reducing information loss.

The paper is organized as follows. Section 2 reviews related work. Section 3 presents the construction and training of SDLS. Section 4 describes the experimental setup and analyzes the results. Section 5 provides a discussion. Finally, Section 6 concludes the paper.

2. Related Work

2.1. Remote Sensing Image Scene Classification

From the research perspective, deep learning methods for remote sensing image scene classification primarily focus on optimizing model architectures to enhance feature extraction capabilities. In contrast, image transformation enhances input image diversity and strengthens feature complementarity, providing more comprehensive data support for classification tasks.

2.1.1. Model Optimization

Transfer learning is a widely used and effective optimization strategy. Deep features learned by deep learning models pre-trained on natural image datasets (e.g., ImageNet [26]) have demonstrated specific applicability in remote sensing image classification [27]. Consequently, pre-trained CNN models [28] are often employed as feature extractors, a practice also adopted in the proposed SDLS architecture. Meanwhile, attention mechanism and feature fusion are commonly utilized to enhance model performance. Zhao et al. [29] proposed an enhanced attention module EAM, which jointly exploits channel-wise and spatial attention mechanisms to strengthen discriminative feature representations. Motivated by the attention mechanism, Li et al. [30] introduced a selective feature fusion module to combine features from different layers of the same backbone network. While SDLS also builds on these concepts, the attention module MGA differs from EAM by directly producing 3D feature map weights rather than separate 1D channel weights or 2D spatial weights. Moreover, SDLS performs feature fusion across multiple CNN backbones rather than within a single backbone. Beyond architectural improvements, some methods that focus on modifying the training loss also fall under model optimization. For example, Xu et al. [31] proposed a joint loss combining cross-entropy and center loss to reduce intra-class variation and preserve inter-class separability. HTPL-JB [32] tackles the limited availability of labeled data through self-supervised pre-training and employs a key point sensitive loss to mitigate class imbalance. In the proposed SDLS framework, a distillation loss is additionally introduced to improve model generalization and classification accuracy.

2.1.2. Image Transformation

Image transformation mainly includes type transformation and multi-scale transformation. Type transformation involves generating various images from original remote sensing images, such as texture coded mapped images [21], saliency-detected images [22], and local images [23,24] containing key regions. Among them, texture coded mapped images and saliency-detected images primarily aim to highlight textural or visually salient features. These images provide complementary representations of the original inputs. In contrast, local images focus on semantically key regions to facilitate more targeted feature extraction. Multi-scale transformation, on the other hand, involves generating images at multiple scales [33,34,35], with each scaled image encoded for classification. The difference between the two transformations is as follows: type transformation diversifies features by altering the representation of images, while multi-scale transformation enhances image features at various scales. Current multi-scale transformation methods are relatively easy to implement and demonstrate effectiveness. Conversely, type transformation methods are relatively more complex and remain an active area of research. The key to type transformation lies in generating images from the original ones and designing an effective feature extraction architecture for these transformed images. In this paper, SDLS integrates the principles of model optimization and image transformation. Specifically, attention mechanisms and feature fusion are applied within the self-distillation stream to extract more robust and discriminative feature maps. These feature maps are then fed into LIGM, implementing image transformation.

2.2. Knowledge Distillation

Knowledge distillation [36] is an effective technique for model compression [37,38] that uses a pre-trained, complex teacher model to train a simpler student model. The main advantage of this technique is that the student model can improve accuracy without increasing model complexity. The selection of teacher and student models is highly flexible, allowing the adoption of different network architectures. As a result, knowledge distillation [39] is often employed to integrate the advantages of ViT and CNN. Xu et al. [40] employed a ViT model as the teacher network to guide ResNet training. Instead, Nabi et al. [41] used the CNN model as the teacher network and transferred its knowledge to the ViT model through distillation. However, traditional distillation methods often require the teacher and student models to be trained in separate stages, which results in high training costs. Meanwhile, the distillation effectiveness is affected by the gap [42] between the teacher and student models. Self-distillation, as a special form of knowledge distillation, does not require an additional pre-trained teacher model. For example, Yun et al. [43] used logits from different samples within the same class for distillation. Zhang et al. [44] constructed multiple branch classifiers using intermediate layers of CNN, with the CNN’s logits guiding the training of these branch classifiers. Inspired by self-distillation, the proposed SDLS framework incorporates multiple classifiers with the CNN. The representative logits from the two streams are integrated to guide the training of both the networks and classifiers in each stream.

3. Methodology

SDLS comprises a self-distillation stream and a local stream, with the overall architecture shown in Figure 1. The input images for the self-distillation stream are original. The original images are processed through the self-distillation stream to extract features and obtain classification score. The extracted features are converted into image weights, which are then combined with the original images as input to the LIGM, generating local images that highlight key regions. Subsequently, the local images serve as input to the local stream. The extracted local features are then fused with those from the self-distillation stream to obtain the local stream’s classification score. Finally, the classification score of SDLS are obtained by fusing the logits of both the self-distillation and local streams.

3.1. Self-Distillation Stream

In the self-distillation stream, two MobileNetV2 networks, namely Branch1_MobileNetV2 and Branch2_MobileNetV2, assist in CNN training. Based on the differences in feature map dimensions, the internal structures of both the CNN and MobileNetV2 networks are divided into multiple blocks. An attention module is introduced to enhance information interaction between the CNN and MobileNetV2 branches. The CNN and the two MobileNetV2 branches take the original images as input, extracting features and obtaining their respective classification score. Furthermore, the self-distillation framework includes a feature fusion classifier, which performs classification based on the integrated features obtained by fusing the original image features extracted from the three networks mentioned above.

3.1.1. CNN

As the backbone network of the self-distillation stream, the CNN not only extracts features from the original images but also provides critical feature inputs for generating attention weights of the intermediate feature maps in the branch networks. CNN primarily consists of a feature extraction layer and a classification layer. The feature extraction layer mainly comprises convolutional layers, while the classification layer includes pooling and fully connected layers. Based on the dimensional changes of feature maps, convolutional layers can be divided into four convolutional blocks. Define the CNN as

L_{c n n} (\cdot)

, then the logits of the CNN can be represented as follows:

L_{c n n} (z) = \underset{E_{c n n}}{\underset{︸}{e_{1} (z) ▷ e_{2} ▷ e_{3} ▷ e_{4}}} ▷ o .

(1)

Here,

z \in R^{H \times W \times C}

represents the original image.

e_{i} (\cdot)

denotes the block_i in the CNN. The “▷” indicates the process of forward propagation, while

o (\cdot)

denotes the classification layer.

E_{c n n}

represents the feature map before the input classification layer of the CNN.

L_{c n n} (z) \in R^{B}

denotes the logits obtained from input z, where B represents the total number of image classes.

3.1.2. Multiplex-Guided Attention Module

The multiplex-guided attention (MGA) module, a crucial connecting module between the CNN and the branch networks, utilizes intermediate feature maps from the CNN to generate attention masks. These masks subsequently guide the feature learning in the shallow layers of the branch networks. Therefore, an additional function of the MGA module is to reduce the channel dimensionality of the CNN feature maps. This dimensionality reduction is necessary because the branch networks are constructed based on MobileNetV2, which operates with fewer channels in its shallow feature maps. A standard approach for this is to apply a 1 × 1 convolution. However, this operation can lead to an undesirable loss of information. To address this problem, the MGA module employs a multiplex-guided convolutional scheme that utilizes multiple parallel branches.

As shown in Figure 2, the MGA module primarily consists of a standard convolution, a depthwise separable convolution, and a group convolution. In this module, spatial information is processed channel-wise by the depthwise separable convolution before being combined, while information exchange is restricted to within channel subgroups by the group convolution. The input feature map passes through different types of convolution operations, and then the output feature maps from these operations are fused. This combination is designed to yield rich, complementary features while rigorously limiting the number of parameters.

Let the input feature map be

X \in R^{h \times w \times c}

, where h is the height, w is the width, and c is the number of channels. The output feature map after the standard convolution operation is

X_{conv} = ReLU 6 (BN (C_{conv} (X))),

(2)

where

C_{conv} (\cdot)

represents the standard convolution operation, BN stands for batch normalization, and ReLU6 refers to the ReLU6 activation function. Similarly, the output feature maps after the depthwise separable convolution operation and the grouped convolution operation are

X_{depthwise} = ReLU 6 (BN (C_{depthwise} (X))),

(3)

X_{group} = ReLU 6 (BN (C_{group} (X))),

(4)

where

C_{depthwise} (\cdot)

and

C_{group} (\cdot)

represent the depthwise separable convolution and group convolution operations, respectively. Then, the output feature maps from the three operations are fused as

X_{fusion} = σ (X_{conv} + X_{depthwise} + X_{group}),

(5)

where

σ (\cdot)

is the sigmoid function that transforms the fused feature map into an attention mask provided to the MobileNetV2 branch network. The attention mask

X_{fusion} \in R^{h \times w \times c^{'}}

transforms the number of channels from the original c to

c^{'}

. This adjustment ensures that it aligns with the input requirements of the internal blocks in the MobileNetV2 branch networks. In MobileNetV2, the inverted residuals structure uses ReLU6 as the activation function, which is also used in Equations (2)–(4).

3.1.3. Branch Networks

The branch networks constructed based on the MGA module can also serve as lightweight alternative models. The attention masks of the two branch networks are generated based on intermediate feature maps from different blocks of the CNN. Therefore, each MobileNetV2 is also divided into four blocks, denoted as Mvblock_i, to match the dimensions of the CNN feature maps. The feature map of block_i (

i = 1, 2

) from the CNN is processed by the MGA module to obtain the attention mask for the Mvblock_i feature map as follows:

M_{att}^{i} = γ (e_{1} (z) ▷ \dots ▷ e_{i}),

(6)

where

M_{att}^{i}

represents the attention mask for the Mvblock_i output feature map, and

γ

denotes the MGA-based attention generation process defined in Equations (2)–(5).

e_{i} (\cdot)

denotes the block_i in the CNN. Thus, the logits of Branch2_MobileNetV2 are given by

L_{m v 2} (z) = \underset{E_{m v 2}}{\underset{︸}{{\hat{e}}_{1} (z) ▷ {\hat{e}}_{2} ⊙ M_{att}^{2} ▷ {\hat{e}}_{3} ▷ {\hat{e}}_{4}}} ▷ \hat{o},

(7)

where “⊙” denotes element-wise multiplication, and

E_{m v 2}

represents the feature map before the input classification layer of the Branch2_MobileNetV2.

{\hat{e}}_{i} (\cdot)

and

\hat{o} (\cdot)

denote the operation of Mvblock_i and the classification layer of MobileNetV2, respectively.

L_{m v 2} (z)

is the logits of Branch2_MobileNetV2.

The logits of Branch1_MobileNetV2 are given by

L_{m v 1} (z) = \underset{E_{m v 1}}{\underset{︸}{{\hat{e}}_{1} (z) ⊙ M_{att}^{1} ▷ {\hat{e}}_{2} ▷ {\hat{e}}_{3} ▷ {\hat{e}}_{4}}} ▷ \hat{o},

(8)

where

E_{m v 1}

represents the feature map before the input classification layer of the Branch1_MobileNetV2.

L_{m v 1} (z)

is the logits of Branch1_MobileNetV2.

3.1.4. Feature Fusion Classifier

The feature fusion classifier is an integrated multi-source feature ensemble classifier, constructed based on the feature maps from the CNN and branch networks. The feature maps from the final feature extraction layers of the three networks are fused, resulting in the following fused feature map

E_{f}

:

E_{f} = τ (E_{c n n}) + E_{m v 2} + E_{m v 1},

(9)

where the feature maps

E_{c n n}

,

E_{m v 2}

,

E_{m v 1}

are derived from Equations (1), (7), and (8), respectively.

τ (\cdot)

denotes the operation of the feature alignment layer. The feature alignment layer mainly consists of a standard convolutional layer, which aligns the number of channels in the CNN feature map with those in the feature maps of the branch networks:

τ (E_{c n n}) = ReLU (BN (C_{conv}^{'} (E_{c n n}))),

(10)

where

C_{conv}^{'}

represents the standard convolution operation, BN denotes batch normalization, and ReLU refers to the ReLU activation function.

The fused feature map

E_{f}

is processed by the classification layer, resulting in the logits of the feature fusion classifier as follows:

L_{f} (z) = {\tilde{o}}_{p} (E_{f}) ▷ {\tilde{o}}_{f},

(11)

where

{\tilde{o}}_{p}

represents the pooling and flattening operations in the classification layer, and

{\tilde{o}}_{f}

denotes the fully connected layer in the classification layer.

L_{f} (z)

is the logits of feature fusion classifier.

3.2. Local Image Generation Module

The LIGM is designed to generate local image that highlight key regions, using the original image

z \in R^{H \times W \times C}

and the fused feature map

E_{f}

from the feature fusion classifier. First, the

E_{f}

is averaged along the channel dimension to reduce the number of channels to one, resulting in the feature map S:

S (u, v, 1) = \frac{1}{D} \sum_{d = 1}^{D} E_{f} (u, v, d),

(12)

where u, v, d denote the indices for height, width, and channel, respectively. D denotes the number of channels in the feature map

E_{f}

. Afterward, the feature map S is normalized as follows:

S_{norm} = \frac{S - min (S)}{max (S) - min (S)},

(13)

where

S_{norm}

represents the normalized feature map.

min (S)

and

max (S)

denote the minimum and maximum values of the feature map S, respectively. The feature map S is upsampled to match the size of the original image as follows:

S^{'} = Upsample (S_{norm}, (H, W)),

(14)

where

S^{'}

denotes the feature map after upsampling, which serves as the weights for the original image. H and W denote the height and width of the original image, respectively. Next, the weights undergo a fourth power operation to enhance the contrast between high and low weights:

S^{'} = S^{' 4} .

(15)

The weights

S^{'} \in R^{H \times W \times 1}

is element-wise multiplied with each channel of the original image to highlight key regions. Finally, the weighted image overlays the original image to obtain the local image

z_{local} \in R^{H \times W \times C}

. The overlay process introduces two trainable parameters,

λ_{1}

and

λ_{2}

, to achieve a balance:

z_{local} = λ_{1} z + λ_{2} (z ⊙ S^{'}) .

(16)

Here,

λ_{1}

and

λ_{2}

have initial values of 0.2 and 0.8, respectively. “⊙” denotes element-wise multiplication. Because

S^{'}

has only one channel, it is implicitly broadcasted across all C channels of the original image z during element-wise multiplication.

3.3. Local Stream

The local stream extracts features from the local image and fuses them with the original features extracted by the self-distillation stream, thereby comprehensively utilizing both the original and local image features for classification. Only one MobileNetV2 network named Local_MobileNetV2 is included in the local stream. The local image

z_{local}

, derived from Equation (16), is used as input by this network. First, the feature map is extracted from the local image before the classification layer of Local_MobileNetV2 as follows:

E_{l o c a l} = {\hat{e}}_{1} (z_{local}) ▷ {\hat{e}}_{2} ▷ {\hat{e}}_{3} ▷ {\hat{e}}_{4} .

(17)

Here,

E_{local}

denotes the feature map before the classification layer of Local_MobileNetV2. The operator

{\hat{e}}_{i} (\cdot)

denotes the i-th MvBlock of Local_MobileNetV2. The Local_MobileNetV2 shares the same backbone as the two MobileNetV2 branches in the self-distillation stream. Therefore, the corresponding convolutional operations are identical.

Next, the pooled and flattened feature

{\tilde{o}}_{p} (E_{f})

from the feature fusion classifier in the self-distillation stream is introduced into the Local_MobileNetV2. Therefore, the logits

L_{l o c a l} (z_{local})

of Local_MobileNetV2 are given by:

L_{l o c a l} (z_{local}) = χ ({\tilde{o}}_{p} (E_{f}), {\hat{o}}_{p} (E_{l o c a l})) ▷ {\hat{o}}_{f},

(18)

where

{\tilde{o}}_{p} (E_{f})

is derived from Equation (11),

{\hat{o}}_{p}

represents the pooling and flattening operations, and

{\hat{o}}_{f}

denotes the fully connected layer. Both

{\hat{o}}_{p}

and

{\hat{o}}_{f}

belong to the classification layer of Local_MobileNetV2.

χ (\cdot, \cdot)

represents the concatenation operation.

Finally, the logits from the feature fusion classifier and the Local_MobileNetV2 are integrated to form the logits of the SDLS architecture:

L (z, z_{local}) = L_{f} (z) + L_{l o c a l} (z_{local}),

(19)

where

L_{f} (z)

and

L_{l o c a l} (z_{local})

are derived from Equations (11) and (18). This logits-level fusion is used to form a unified teacher logits for two streams rather than serving as an independent ensemble prediction. Note that the logits from each network or classifier in both streams must be processed through a softmax function to obtain the corresponding classification score:

p^{b} = \frac{exp (L^{b} / T)}{\sum_{j = 1}^{B} exp (L^{j} / T)} .

(20)

Here, B denotes the number of classes,

L^{b}

denotes the logits for the b-th class,

p^{b}

is the probability of the b-th class, and T is the temperature for smoothing the probability distribution. The prediction is the class that maximizes the probability

p^{b}

. For notational convenience, the softmax operation with temperature T is denoted by

H_{T} (\cdot)

in the following loss formulation.

3.4. Training Loss

The training procedure of SDLS is summarized in Algorithm 1. At each iteration, the original image and its corresponding local image are processed by the self-distillation and local streams, respectively. The resulting logits are fused to form unified teacher logits, and all networks and classifiers are optimized jointly. During the training of SDLS, the training loss is composed of the self-distillation stream loss and the local stream loss.

3.4.1. Self-Distillation Stream Loss

The self-distillation stream consists of three basic networks and a feature fusion classifier, each of which is supervised by the image labels. Formally, for input image z with label y, the loss is given by

L o s s_{ce} = \sum_{j \in {c n n, m v 2, m v 1, f}} D_{CE} (H_{1} (L_{j} (z)), y),

(21)

where

H_{1} (\cdot)

represents the softmax function at

T = 1

in Equation (20), and

D_{CE}

denotes the cross-entropy loss function.

L_{j} (z)

is the logits of model j. Additionally, the SDLS architecture serves as the teacher model, while the three basic networks and the feature fusion classifier act as student models for self-distillation training. The distillation loss is given by

L o s s_{kd} = \sum_{j \in {c n n, m v 2, m v 1, f}} T^{2} \cdot D_{KL} (H_{3} (L_{j} (z)), H_{3} (L (z, z_{local}))),

(22)

where

H_{3} (\cdot)

represents the softmax function at

T = 3

in Equation (20), and

D_{KL}

represents the Kullback-Leibler divergence.

L_{j} (z)

is the logits of model j.

L (z, z_{local})

denotes the logits of the SDLS architecture, as defined in Equation (19). Each KL divergence term is multiplied by the square of the temperature

T = 3

. Therefore, the self-distillation stream loss is given by

L o s s_{sd} = L o s s_{ce} + L o s s_{kd} .

(23)

Algorithm 1: Training Procedure of the Proposed SDLS Framework

3.4.2. Local Stream Loss

In the local stream, the input image is the local image

z_{local}

, and the label of

z_{local}

remains y. Local_MobileNetV2, as the student model, requires its output to be supervised by the label:

L o s s_{ce}^{local} = D_{CE} (H_{1} (L_{l o c a l} (z_{local})), y),

(24)

where

L_{l o c a l} (z_{local})

is the logits of Local_MobileNetV2. Meanwhile, the distillation loss is

L o s s_{kd}^{local} = T^{2} \cdot D_{KL} (H_{3} (L_{l o c a l} (z_{local})), H_{3} (L (z, z_{local}))) .

(25)

Here

L (z, z_{local})

denotes the logits of the SDLS architecture. Therefore, the local stream loss is given by

L o s s_{ls} = L o s s_{ce}^{local} + L o s s_{kd}^{local} .

(26)

In summary, the overall training loss is given by

L o s s = L o s s_{sd} + L o s s_{ls} .

(27)

4. Experiments and Results

The experiments are conducted using four CNN networks (ResNet18 [45], WRN50-2 [46], ShuffleNetV2-1.5 [47], and EfficientNet-b3 [48]) on four datasets, with five evaluation metrics employed to assess the effectiveness of SDLS comprehensively. The experimental process primarily encompasses seven key aspects. First, the performance of SDLS with different backbone networks is analyzed through CNN ablation experiments. Second, the performance of SDLS is compared with that of existing methods. Third, model complexity and inference time on each network and classifier are quantitatively evaluated. Fourth, the contribution of the MGA module is evaluated, accompanied by feature visualization. Fifth, an ablation study is conducted to analyze the effects of distillation and temperature. Sixth, feature fusion and logits fusion strategies are explored. Finally, LIGM is analyzed through ablation experiments and local image visualization.

All experiments are implemented in a computing environment with 1 × 24-GB NVIDIA GeForce GTX 3090 GPU, utilizing Python 3.8 and PyTorch 1.12.1. All networks are trained for 50 epochs. The Adam optimizer is used to optimize these networks. The initial learning rate for the optimizer is set to 1

\times 10^{- 4}

, and the learning rate is divided by 10 every 20 epochs. The batch size is set to 24. Before training on the remote sensing image datasets, all networks in the self-distillation and local streams must be pre-trained on ImageNet.

4.1. Experimental Datasets

The AID dataset [49] comprises 10,000 images distributed across 30 classes, each with an image size of 600 × 600 pixels. The RSSCN7 dataset [50] contains 2800 images, organized into seven classes, 400 images per class, and an image size of 400 × 400 pixels. The UCM dataset [51] consists of 2100 images, classified into 21 categories, with 100 images per class and an image size of 256 × 256 pixels. The NWPU dataset [52] includes 31,500 images spanning 45 classes, 700 images per class, and an image size of 256 × 256 pixels.

SDLS is evaluated on the four datasets, each divided according to two training ratios. The training ratios for AID and RSSCN7 are set to 20% and 50%, for UCM to 50% and 80%, and for NWPU to 10% and 20%. Each dataset is randomly split five times under the corresponding training ratios, and the mean and standard deviation of the results from these five runs are reported as the final result.

4.2. Evaluation Metrics

The following five evaluation metrics are used in the experiment: (a) Overall Accuracy (OA), which is used to measure the classification accuracy of the model on the entire test set. (b) Confusion Matrix, which is used to analyze the classification performance of the model in each class. (c) Params, which is used to evaluate the complexity and storage requirements of the model. (d) FLOPs, which is used to measure the computational cost of the model. (e) Inference Time, which measures the average time required to process a single image.

4.3. CNN Ablation Experiments

The experiments are conducted using four CNN networks on the four datasets, with the size of the original and local images resized to 224 × 224. The experimental results are shown in Table 1. All baseline results are obtained using the same CNN backbones as SDLS and are trained under identical training configurations, including the optimizer, learning rate, and number of training epochs. For all baseline methods, the standard cross-entropy loss is adopted without additional task-specific tuning, ensuring a fair comparison. Experimental results demonstrate that the proposed SDLS can adapt to various backbone networks and improve classification accuracy. SDLS remains effective even with a small training dataset. For instance, using the ShuffleNetV2-1.5 network on the NWPU dataset, the OA gains for 10% and 20% training ratios are 2.62% and 2.02%, respectively. Note that the networks with OA below the baseline can be categorized into two types: (a) Some branch networks perform worse than the CNN baseline. When constructing the SDLS architecture using a CNN with a higher baseline, the OA difference between the branches and the CNN increases, and this difference is difficult to bridge by distillation alone. (b) The OA of the distilled ShuffleNetV2-1.5 model decreases. The performance of the SDLS architecture as a teacher model is influenced by the CNN network. SDLS is built on three lightweight MobileNetV2. When the lightweight network ShuffleNetV2-1.5 is employed as the backbone, prediction inconsistency may arise among different networks or classifiers. As a result, the teacher logits become less reliable, leading to a reduced distillation effect.

Additionally, to explore more possibilities, the local stream network in the SDLS architecture built with EfficientNet-b3 is replaced by another EfficientNet-b3, and the impact of input image sizes of 300 × 300 and 330 × 330 on classification accuracy is investigated. When the input image size is 330 × 330, the batch size is adjusted from 24 to 16. The experiments are conducted on the AID and NWPU datasets, and the corresponding results are shown in Table 2. The experimental results show that the OA of all networks and classifiers consistently improves as the image size increases, with the highest OA achieved at an image size of 330 × 330. Combined with the data in Table 1, it can be concluded that larger image sizes offer a distinct advantage. The confusion matrixes in Figure 3, Figure 4, Figure 5 and Figure 6 provide a detailed visualization of the classification performance across different categories. For the AID and NWPU datasets, the matrices are derived from SDLS results where both the self-distillation and local stream networks are EfficientNet-b3, using an input image size of 330 × 330. The matrices for the RSSCN7 and UCM datasets are based on SDLS results, where the self-distillation stream employs EfficientNet-b3, while the local stream utilizes MobileNetV2, with an input image size of 224 × 224. The confusion matrix results for the RSSCN7 dataset indicate that the classification accuracy for all categories is at least 94%. Notably, only a single category has an accuracy below 90% for the AID and UCM datasets. In the NWPU dataset, four out of 45 categories show a classification accuracy below 90%. A detailed analysis of these misclassified categories is provided in Section 5.2.

4.4. Comparison with Other Methods

SDLS is compared with various methods on the AID and NWPU datasets. The comparison methods include three types: CNN-based methods, GAN-based methods, and ViT-based methods, all proposed after 2020. All the data for the comparison methods are sourced from the values reported in the corresponding papers. The specific comparison results are detailed in Table 3. The comparison results indicate that SDLS achieves competitive performance on both datasets. Specifically, experimental results on the AID dataset demonstrate that SDLS achieves the highest and second-highest OA under different training ratios. Even with an input image size of 224 × 224 and a training ratio of 50%, SDLS still exhibits a performance advantage, with only one method achieving a higher OA. At a 20% training ratio, it still surpasses most methods, with only four achieving higher OA. On the NWPU dataset, when the training ratio is 20%, SDLS also achieves the second-highest and the highest OA. Even with an input image size of 224 × 224, SDLS outperforms most methods, with only three comparison methods achieving higher OA than SDLS. Note that LSDGNet is also built on EfficientNet-b3, SDLS based on EfficientNet-b3 consistently outperforms it on the AID dataset, regardless of the input image size. Overall, ViT-based methods generally outperform CNN-based methods among the comparison methods. However, SDLS surpasses several ViT-based methods in OA as a CNN-based method.

To further evaluate the efficiency of SDLS, we additionally compare the model complexity and OA with several methods. The comparison results are reported in Table 4. As shown in the table, SDLS demonstrates a favorable trade-off between model complexity and OA. Specifically, ResNet18_MV2_224 (SDLS) uses the fewest parameters (18.65 M) among all compared methods. The computational cost of ResNet18_MV2_224 (SDLS) is 2.85 G FLOPs, which is the second lowest. The OA of ResNet18_MV2_224 (SDLS) exceeds the OA of several existing approaches. ENet-b3_MV2_224 (SDLS) achieves the highest OA among all compared methods. The computational cost of ENet-b3_MV2_224 (SDLS) is the lowest (2.08 G FLOPs). The parameter count of ENet-b3_MV2_224 (SDLS) is 19.55 M, which is slightly higher than that of ResNet18_MV2_224 (SDLS) and lower than that of most comparison methods.

4.5. Model Complexity and Inference Time Analysis

SDLS provides networks and classifiers with flexible trade-offs between accuracy, model complexity, and inference time. As shown in Table 5, Branch1_MobileNetV2 and Branch2_MobileNetV2 achieve higher OA than their respective CNN backbone networks while using fewer parameters and FLOPs. For example, Branch1_MobileNetV2 has only 4% of the parameters and 22% of the FLOPs compared to WRN50-2, yet its OA still exceeds the baseline. In addition to model complexity, branch-level times are reported to illustrate the efficiency of individual components. All experiments are conducted on three NVIDIA GPUs, including RTX 3090, Tesla P100, and Tesla T4. As shown in Table 5, lightweight branch networks such as Branch1_MobileNetV2 and Branch2_MobileNetV2 exhibit longer inference time than the ResNet18 backbone, despite having fewer parameters and FLOPs. This is mainly because MobileNetV2 relies on depthwise separable convolutions, which are not always efficiently optimized on GPUs. However, when WRN50-2 is adopted as the backbone, the inference time of Branch1_MobileNetV2 and Branch2_MobileNetV2 remains comparable to that of the backbone network. For the Feature fusion classifier, Local_MobileNetV2, and SDLS, the inference time is further increased due to the more complex network structures and additional computational operations. Nevertheless, these models achieve higher classification accuracy. Further optimization of inference efficiency for such architectures will be explored in future work.

4.6. Analysis of the MGA Module

The contribution of the MGA module is validated on the UCM dataset with an 80% training set ratio, using ResNet18, WRN50-2, and ShuffleNetV2-1.5. The experimental results are presented in Table 6. The results indicate that, after removing the attention modules, the OA of multiple networks and classifiers decreased to varying degrees, confirming the critical role of the modules in enhancing the OA of SDLS. When ResNet18 serves as the backbone, OA decreases for both Branch2_MobileNetV2 and the feature fusion classifier. For WRN50-2 and ShuffleNetV2-1.5, the OA of all networks and classifiers, except for Branch1_MobileNetV2, declines after removing the MGA module.

In addition, we conduct a branch ablation study and compare the MGA module with two widely used attention mechanisms, SE [70] and CBAM [71]. Experiments use ResNet18 as the backbone with a 10% training ratio on the NWPU dataset. The results are shown in Table 7. For the branch ablation study, we remove each of the three branch components separately: standard convolution (w/o Conv), depthwise convolution (w/o Depthwise), and group convolution (w/o Group). The results show that Branch2_MobileNetV2 achieves the highest OA when standard convolution is removed. Branch1_MobileNetV2 achieves the highest OA when group convolution is removed. The full branch structure balances the performance between Branch2_MobileNetV2 and Branch1_MobileNetV2. ResNet18, the feature fusion classifier, Local_MobileNetV2, and SDLS achieve their highest OA with the full branch structure. For the attention module comparison, MGA consistently leads to slightly higher OA. For example, SDLS obtains an OA of 92.03% with MGA, compared to 91.82% with SE and 91.89% with CBAM. Similar trends are observed for the other networks and classifiers. Combined with the branch ablation study, these results suggest that the MGA module in the full branch design can help improve the OA of SDLS.

As a supplementary analysis, the effectiveness of the attention module across different backbones is further investigated through a comparative analysis of attention maps. The corresponding results are shown in Figure 7, where the feature maps have been processed with ReLU6 activation to ensure visualization quality. All visualized maps represent the input feature maps of Mvblock2 corresponding to different backbone networks. Here, Mvblock2 belongs to Branch1_MobileNetV2. The results demonstrate that, after being weighted by attention masks generated from different backbone networks, the feature maps can effectively focus on key target regions. As the MGA module is applied to the shallow layers of the Branch1_MobileNetV2, the resulting attention is mainly directed toward the edge information of the targets.

Furthermore, attention feature maps related to Branch1_MobileNetV2 in the SDLS architecture built on ShuffleNetV2-1.5 are visualized, as shown in Figure 8. The attention module takes the output feature map of the CNN block1 as input, and after three convolution operations, performs element-wise addition to generate the integrated feature map. Afterwards, the attention mask obtained from this integrated feature map is element-wise multiplied with the output feature map of Mvblock1, and the resulting weighted feature map is used as the input to Mvblock2. Note that the output feature map from Mvblock1 is not activated by ReLU6, which affects the quality of feature visualization. Therefore, during the visualization process, the ReLU6 activation function is additionally applied to the output feature map of Mvblock1 and the input feature map of Mvblock2. The corresponding results are highlighted by the dashed boxes in Figure 8. The visualization results show that the Mvblock1 output feature map, after being processed by the attention module, effectively focuses on the edge information of key regions while reducing background interference. It is worth noting that using only standard convolution may result in the loss of edge information. The introduction of depthwise separable and group convolutions, however, mitigates this information loss. This precisely explains why the MGA module employs a multiplex-guided convolutional scheme utilizing multiple parallel branches.

4.7. Impact of Distillation and Temperature

During the SDLS training process, distillation needs to be performed in each stream. To evaluate the contribution of distillation, experiments are conducted on the AID dataset with a 50% training set ratio, using ResNet18 and WRN50-2. The experimental results are presented in Table 8. The results indicate that the absence of distillation causes a decline in OA across all cases, confirming the critical role of distillation in enhancing the OA of the proposed SDLS framework. The impact of distillation varies across different backbone networks. The improvement in OA is most pronounced with the ResNet18 backbone, achieving a gain of 0.61%, which underscores the effectiveness of self-distillation in enhancing CNN performance. Consequently, the OA of SDLS increases by 0.29% for ResNet18 and 0.25% for WRN50-2. Furthermore, the standard deviations of SDLS remain stable or slightly decrease, indicating that the observed improvements are consistent and reliable.

To further investigate the influence of the distillation temperature on the OA, ResNet18 is selected as the backbone network and trained on the AID dataset with a 20% training ratio. The experimental results under different temperature settings are summarized in Table 9. From Table 9, it can be observed that

T = 3

consistently achieves the best OA across different networks and classifiers. Therefore,

T = 3

is adopted in all experiments. In contrast, excessively large temperature values (e.g.,

T = 7

and

T = 9

) lead to a severe decline in OA accompanied by significantly increased variance, which can be attributed to over-smoothing of the soft targets during distillation. These results indicate that the proposed SDLS framework is robust to moderate variations of the temperature.

4.8. Effect of Two-Stream Fusion Strategies

This subsection aims to systematically analyze the impact of different fusion strategies in the proposed two-stream SDLS framework. Specifically, fusion operations are investigated at two levels: logits-level fusion for combining the outputs of the two streams, and feature-level fusion for integrating multi-branch representations within the self-distillation stream. All experiments in this subsection are conducted on the AID dataset with a training ratio of 50%, where ResNet18 and WRN50-2 are, respectively, adopted as the CNN backbones for evaluation.

We first examine different logits fusion techniques for combining the outputs of the two streams. Importantly, all fusion techniques are purely mathematical and introduce no additional parameters. As shown in Equation (19), SDLS employs the summation operation by default to fuse the logits from the two streams. In addition, two alternative fusion strategies are explored: mean fusion, which averages the logits element-wise, and max fusion, which selects the element-wise maximum value between the two logits vectors. The corresponding results are presented in Table 10. The results demonstrate that mean fusion yields the highest OA for both the two-branch networks and the CNN within the self-distillation stream. Conversely, the summation technique is more advantageous for enhancing the OA of the feature fusion classifier and the overall SDLS framework. The specific fusion technique should be determined based on the practical application requirements. In this paper, the summation technique is recommended.

In addition to logits fusion, we further analyze different feature fusion strategies. Three commonly used fusion operations are evaluated: element-wise summation, channel-wise concatenation, and weighted averaging based on branch confidence. The quantitative results are summarized in Table 11. As shown in Table 11, weighted averaging and summation generally yield higher OA than simple concatenation. This observation indicates that feature-level fusion benefits from controlled aggregation, which avoids excessive feature dimensionality while maintaining discriminative information. Higher OA is achieved through summation, and this technique is therefore recommended in this paper. Overall, these results demonstrate that both logits-level and feature-level fusion play critical roles in the proposed SDLS framework. Appropriate fusion strategies enable effective information integration across streams and branches.

4.9. Local Image Analysis

The ResNet18 network, equipped with SDLS, is used for local image visualization on the AID dataset with a training ratio of 50%. The visualization results are shown in Figure 9. These results are obtained after the network training, indicating that the local images can effectively highlight key regions. Images labeled “Storagetanks” and “Bridge” typically contain key regions with clear edges and regular arrangements. In contrast, the key regions of the “Viaduct” and “River” classes are often arranged irregularly and follow curved distributions. This structural difference suggests that for irregularly arranged key regions, local images may struggle to capture all key regions comprehensively. Therefore, the original image is incorporated into the LIGM to address this limitation. As shown in Equation (16), two trainable parameters are introduced for generating local images. The variations of

λ_{1}

and

λ_{2}

during training are illustrated in Figure 10. The variations in

λ_{1}

and

λ_{2}

during the early stages of training (first 20 epochs) indicate continuous optimization of the local images. Afterwards, the values of both parameters gradually stabilize, indicating that the generation of local images has reached a relatively optimized state. As observed in Figure 10,

λ_{1}

tends to increase while

λ_{2}

tends to decrease during training. Motivated by this trend, we conducted an ablation study by increasing the initial value of

λ_{1}

and decreasing that of

λ_{2}

. The results are summarized in Table 12.

As shown in Table 12, a low initial value of

λ_{1}

generally benefits the individual networks, including Branch1_MobileNetV2, Branch2_MobileNetV2, and ResNet18. When

λ_{1}

is increased to 0.6, a slight improvement in OA is observed for the Feature fusion classifier, Local_MobileNetV2, and SDLS, whereas the OA of Branch1_MobileNetV2, Branch2_MobileNetV2, and ResNet18 decreases. Further increasing

λ_{1}

to 0.8 leads to a OA decrease for most models. The default configuration,

λ_{1} = 0.2

and

λ_{2} = 0.8

, provides a reliable balance. These observations confirm that the default values are a reasonable choice for local image generation.

Additionally, as shown in Equation (15), the weights used for local image generation undergo a fourth power operation to enhance focus on key regions. The effect of this transformation is demonstrated by visualizing the local images generated at different training epochs, as shown in Figure 11. The results indicate that the fourth power transformation of the weights leads to a stronger emphasis on high-weight regions. As the number of epochs increases, the local images are continuously optimized. Even in the early training stages, the local images can effectively focus on the key regions of the images. This characteristic enables the SDLS framework to simultaneously optimize the models of the self-distillation and local streams during training, eliminating the complexity of staged training and ensuring high efficiency.

To investigate the impact of the exponent x in the power-based transformation of the LIGM module, we perform an ablation study by varying x. Specifically, x is varied among

1, 2, 4, 6

, where

x = 1

corresponds to the case without power-based enhancement. The quantitative results are summarized in Table 13. As shown in Table 13, the highest OA for all evaluated networks and classifiers is achieved when

x > 1

, indicating that the power-based transformation consistently improves performance compared to the case without enhancement. Among the tested settings,

x = 4

yields the highest OA for both Local_MobileNetV2 and the proposed SDLS, and therefore is recommended as the default configuration. When x is further increased to 6, higher OA is observed for Branch1_MobileNetV2, Branch2_MobileNetV2, and the Feature fusion classifier. However, the OA of Local_MobileNetV2 and SDLS slightly decreases under this setting, suggesting that an excessively large exponent may lead to over-amplification of local responses.

5. Discussion

This section analyzes the experimental results to better understand SDLS. First, statistical significance is examined to assess reliability. Second, misclassified categories are analyzed to reveal limitations and suggest potential improvements.

5.1. Statistical Significance Analysis

In this paper, SDLS is implemented on an EfficientNet-b3 backbone with an input image size of 330 × 330, which achieves the highest OA among all evaluated configurations. Statistical significance analysis is therefore performed under this setting to examine whether the observed OA gains are stable and reproducible. Experiments are conducted on the AID and NWPU datasets. For each dataset, the data are randomly split five times, and two independent runs are performed for each split, resulting in a total of ten runs. All networks and classifiers are evaluated under identical data splits to enable fair paired statistical comparisons. The OA differences between each network or classifier and the baseline of EfficientNet-b3 are analyzed using paired sample t-tests with 95% confidence intervals (CI), as summarized in Table 14. The paired differences are defined as the OA difference between each configuration and the baseline. Positive CI indicate consistent improvements, while negative CI indicate inferior OA.

On the AID dataset, Branch1_MobileNetV2 and Branch2_MobileNetV2 show statistically significant OA degradation. The CI for both branches are strictly negative, and the p-values are very small. Section 4.3 explains that the OA gap between lightweight branches and the high-OA CNN baseline is difficult to bridge by distillation alone. EfficientNet-b3 shows only marginal OA improvements. At the 50% training ratio, the CI crosses zero and the corresponding p-value is 9.49

\times 10^{- 2}

, indicating that the observed OA gain is not statistically significant. This limited improvement can be attributed to the already high OA of the EfficientNet-b3 before distillation, which leaves little room for further gains. From an overall perspective, this phenomenon is reasonable. Across both datasets, the statistical significance of most networks and classifiers tends to decrease as the training ratio increases, suggesting that increasing the training data does not substantially enhance their advantage over the baseline. Therefore, on the NWPU dataset at lower training ratios, EfficientNet-b3, the Feature fusion classifier, Local_MobileNetV2, and SDLS achieve statistically significant OA improvements compared with the baseline. SDLS consistently attains the highest OA across different training ratios. The CI for SDLS are entirely above zero, demonstrating robust and stable OA gains.

5.2. Analysis of Misclassified Categories

The confusion matrix in Figure 6 shows that several categories in the NWPU dataset exhibit relatively low classification accuracy. To further analyze the causes of these misclassifications, Figure 12 visualizes the top five misclassification cases. These cases include church misclassified as palace, palace misclassified as church, railway_station misclassified as railway, rectangular_farmland misclassified as terrace, and dense_residential misclassified as medium_residential. These misclassification cases are mainly associated with high inter-class similarity between the scenes of the true labels and those of the predicted classes. This similarity increases the difficulty of classification.

These cases reveal two main limitations of SDLS. In church misclassified as palace, palace misclassified as church, and railway_station misclassified as railway, the weight heatmaps generated by LIGM can focus on semantically relevant regions. However, strong inter-class similarity still limits classification accuracy. In rectangular_farmland misclassified as terrace and dense_residential misclassified as medium_residential, key regions are spatially distributed over large areas. In these cases, the generated weight heatmaps cannot fully cover all informative regions. This limitation motivates the use of the original image overlay in LIGM, which helps alleviate incomplete region coverage. Therefore, the proposed targeted improvement strategies aim to explore two directions. First, the goal is to reduce inter-class similarity. One possible approach is to introduce constraints via a loss function. Second, the goal is to improve coverage of key regions while maintaining precision. This can be achieved by making the extent of salient regions in the feature maps trainable, allowing adaptive adjustment according to each input image.

6. Conclusions

In this paper, we propose the LIGM, which generates local images by suppressing background regions while emphasizing key target regions. Based on LIGM, a two-stream architecture for SDLS is developed, which is compatible with various CNN backbones. Specifically, SDLS consists of a self-distillation stream and a local stream that extract features from the original images and the generated local images, respectively. Extensive experiments on multiple remote sensing image datasets demonstrate that SDLS achieves competitive performance. In addition, comprehensive analyses are conducted on model complexity and inference time, the MGA module, the effects of distillation and temperature, feature and logits fusion strategies, and the LIGM strategy. Together, these analyses demonstrate the effectiveness of the proposed SDLS framework from multiple perspectives.

Nevertheless, several limitations of the proposed SDLS framework remain. First, the SDLS introduces a sequential dependency between the self-distillation and local streams, since the LIGM relies on deep features from the self-distillation stream for local image generation. This dependency increases model complexity and reduces inference efficiency, despite the lightweight design of the local stream. Second, the classification performance of SDLS remains limited for semantically similar scenes and images in which key regions are spatially distributed over large areas. Third, most remote sensing applications involve multi-modal data sources, whereas the current study considers only RGB images. To address these limitations, future work will explore strategies to reduce the dependency between streams and develop a more generalizable local image generation scheme. In addition, the framework will be extended to handle multi-modal remote sensing data. Approaches to further improve model efficiency will also be investigated.

Author Contributions

Conceptualization, J.L. and X.Z.; methodology, X.M. and J.L.; software, R.D.; validation, X.M.; formal analysis, X.M. and J.L.; resources, X.M. and S.N.; data curation, R.D.; writing—review and editing, X.M. and J.L.; writing—original draft, X.M.; visualization, X.M. and R.D.; funding acquisition, J.L.; supervision, J.L., S.N. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the “Double First-Class” Discipline Construction Project for Surveying and Mapping Science and Technology (Grant No. GCCYJKT202508).

Data Availability Statement

Data derived from public domain resources, including UCM (https://vision.ucmerced.edu/datasets/, accessed on 20 January 2026), AID (https://captain-whu.github.io/AID/, accessed on 20 January 2026), NWPU (https://www.tensorflow.org/datasets/catalog/resisc45, accessed on 20 January 2026), and RSSCN7 (https://sites.google.com/site/qinzoucn/download, accessed on 20 January 2026). The code supporting this work is available at: https://github.com/plkfans/SDLS (accessed on 20 January 2026).

Acknowledgments

We sincerely appreciate the editor and the anonymous reviewers for their thorough review and constructive suggestions, which have greatly improved the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lian, Z.; Zhan, Y.; Zhang, W.; Wang, Z.; Liu, W.; Huang, X. Recent advances in deep learning-based spatiotemporal fusion methods for remote sensing images. Sensors 2025, 25, 1093. [Google Scholar] [CrossRef]
Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.-S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
Dong, Z.; Gu, Y.; Liu, T. Generative convnet foundation model with sparse modeling and low-frequency reconstruction for remote sensing image interpretation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5603816. [Google Scholar] [CrossRef]
Van De Sande, K.; Gevers, T.; Snoek, C. Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1582–1596. [Google Scholar] [CrossRef]
Burghouts, G.J.; Geusebroek, J.-M. Performance evaluation of local colour invariants. Comput. Vis. Image Underst. 2009, 113, 48–62. [Google Scholar] [CrossRef]
Shao, W.; Yang, W.; Xia, G.-S.; Liu, G. A hierarchical scheme of multiple feature fusion for high-resolution satellite scene categorization. In Proceedings of the International Conference on Computer Vision Systems, St. Petersburg, Russia, 16–18 July 2013. [Google Scholar]
Aptoula, E. Remote sensing image retrieval with global morphological texture descriptors. IEEE Trans. Geosci. Remote Sens. 2014, 52, 3023–3034. [Google Scholar] [CrossRef]
Cheng, G.; Zhou, P.; Han, J.; Guo, L.; Han, J. Auto-encoder-based shared mid-level visual dictionary learning for scene classification using very high resolution remote sensing images. IET Comput. Vis. 2015, 9, 639–647. [Google Scholar] [CrossRef]
Chen, S.; Tian, Y. Pyramid of spatial relatons for scene-level land use classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 1947–1957. [Google Scholar] [CrossRef]
Wu, H.; Liu, B.; Su, W.; Zhang, W.; Sun, J. Hierarchical coding vectors for scene level land-use classification. Remote Sens. 2016, 8, 436. [Google Scholar] [CrossRef]
Zou, J.; Li, W.; Chen, C.; Du, Q. Scene classification using local and global features with collaborative representation fusion. Inf. Sci. 2016, 348, 209–226. [Google Scholar] [CrossRef]
Risojević, V.; Babić, Z. Fusion of global and local descriptors for remote sensing image classification. IEEE Geosci. Remote Sens. Lett. 2013, 10, 836–840. [Google Scholar] [CrossRef]
Yu, Y.; Li, X.; Liu, F. Attention gans: Unsupervised deep feature learning for aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 519–531. [Google Scholar] [CrossRef]
Yan, P.; He, F.; Yang, Y.; Hu, F. Semi-supervised representation learning for remote sensing image classification based on generative adversarial networks. IEEE Access 2020, 8, 54135–54144. [Google Scholar] [CrossRef]
Li, J.; Liao, Y.; Zhang, J.; Zeng, D.; Qian, X. Semi-supervised degan for optical high-resolution remote sensing image scene classification. Remote Sens. 2022, 14, 4418. [Google Scholar] [CrossRef]
Bi, Q.; Qin, K.; Li, Z.; Zhang, H.; Xu, K.; Xia, G.-S. A multiple-instance densely-connected convnet for aerial scene classification. IEEE Trans. Image Process. 2020, 29, 4911–4926. [Google Scholar] [CrossRef]
Xu, K.; Huang, H.; Deng, P.; Shi, G. Two-stream feature aggregation deep neural network for scene classification of remote sensing images. Inf. Sci. 2020, 539, 250–268. [Google Scholar] [CrossRef]
Zhang, G.; Xu, W.; Zhao, W.; Huang, C.; Yk, E.N.; Chen, Y.; Su, J. A multiscale attention network for remote sensing scene images classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 9530–9545. [Google Scholar] [CrossRef]
Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
Guo, J.; Jia, N.; Bai, J. Transformer based on channel-spatial attention for accurate classification of scenes in remote sensing image. Sci. Rep. 2022, 12, 15473. [Google Scholar] [CrossRef]
Anwer, R.M.; Khan, F.S.; Van De Weijer, J.; Molinier, M.; Laaksonen, J. Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification. ISPRS J. Photogramm. Remote Sens. 2018, 138, 74–85. [Google Scholar] [CrossRef]
Yu, Y.; Liu, F. A two-stream deep fusion framework for high-resolution aerial scene classification. Comput. Intell. Neurosci. 2018, 2018, 8639367. [Google Scholar] [CrossRef]
Wang, Q.; Huang, W.; Xiong, Z.; Li, X. Looking closer at the scene: Multiscale representation learning for remote sensing image scene classification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1414–1428. [Google Scholar] [CrossRef]
Chen, X.; Zheng, X.; Zhang, Y.; Lu, X. Remote sensing scene classification by local–global mutual learning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506405. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Penatti, O.A.; Nogueira, K.; Dos Santos, J.A. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 11–12 June 2015. [Google Scholar]
Yuan, Y.; Fang, J.; Lu, X.; Feng, Y. Remote sensing image scene classification using rearranged local features. IEEE Trans. Geosci. Remote Sens. 2019, 57, 1779–1792. [Google Scholar] [CrossRef]
Zhao, Z.; Li, J.; Luo, Z.; Li, J.; Chen, C. Remote sensing image scene classification based on an enhanced attention module. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1926–1930. [Google Scholar] [CrossRef]
Li, M.; Lei, L.; Li, X.; Sun, Y.; Kuang, G. An adaptive multilayer feature fusion strategy for remote sensing scene classification. Remote Sens. Lett. 2021, 12, 563–572. [Google Scholar] [CrossRef]
Xu, K.; Huang, H.; Deng, P. Remote sensing image scene classification based on global–local dual-branch structure model. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8011605. [Google Scholar] [CrossRef]
Chen, X.; Tao, H.; Zhou, H.; Zhou, P.; Deng, Y. Hierarchical and progressive learning with key point sensitive loss for sonar image classification. Multimed. Syst. 2024, 30, 380. [Google Scholar] [CrossRef]
Hu, F.; Xia, G.-S.; Hu, J.; Zhang, L. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef]
Yang, Z.; Mu, X.; Wang, S.; Ma, C. Scene classification of remote sensing images based on multiscale features fusion. Opt. Precis. Eng. 2018, 26, 3099–3107. [Google Scholar] [CrossRef]
Zheng, Z.; Fang, F.; Liu, Y.; Gong, X.; Guo, M.; Luo, Z. Joint multi-scale convolution neural network for scene classification of high resolution remote sensing imagery. Acta Geod. Cartogr. Sin. 2018, 47, 620–630. Available online: http://xb.chinasmp.com/CN/10.11947/j.AGCS.2018.20170191 (accessed on 1 February 2026).
Himeur, Y.; Aburaed, N.; Elharrouss, O.; Varlamis, I.; Atalla, S.; Mansoor, W.; Al-Ahmad, H. Applications of knowledge distillation in remote sensing: A survey. Inf. Fusion 2025, 115, 102742. [Google Scholar] [CrossRef]
Li, Z.; Li, H.; Meng, L. Model compression for deep neural networks: A survey. Computers 2023, 12, 60. [Google Scholar] [CrossRef]
Zawish, M.; Davy, S.; Abraham, L. Complexity-driven model compression for resource-constrained deep learning on edge. IEEE Trans. Artif. Intell. 2024, 5, 3886–3901. [Google Scholar] [CrossRef]
Hao, Z.; Guo, J.; Han, K.; Tang, Y.; Hu, H.; Wang, Y.; Xu, C. One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Xu, K.; Deng, P.; Huang, H. Vision transformer: An excellent teacher for guiding small networks in remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5618715. [Google Scholar] [CrossRef]
Nabi, M.; Maggiolo, L.; Moser, G.; Serpico, S.B. A cnn-transformer knowledge distillation for remote sensing scene classification. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022. [Google Scholar]
Mirzadeh, S.I.; Farajtabar, M.; Li, A.; Levine, N.; Matsukawa, A.; Ghasemzadeh, H. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Yun, S.; Park, J.; Lee, K.; Shin, J. Regularizing class-wise predictions via self-knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020. [Google Scholar]
Zhang, L.; Bao, C.; Ma, K. Self-distillation: Towards efficient and compact neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4388–4403. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Wide residual networks. In Proceedings of the British Machine Vision Conference, York, UK, 19–22 September 2016. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Xia, G.-S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep learning based feature selection for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010. [Google Scholar]
Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Bi, Q.; Qin, K.; Zhang, H.; Xia, G.-S. Local semantic enhanced convnet for aerial scene recognition. IEEE Trans. Image Process. 2021, 30, 6498–6511. [Google Scholar] [CrossRef]
Tang, X.; Li, M.; Ma, J.; Zhang, X.; Liu, F.; Jiao, L. EMTCAL: Efficient multiscale transformer and cross-level attention learning for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5626915. [Google Scholar] [CrossRef]
Ma, J.; Li, M.; Tang, X.; Zhang, X.; Liu, F.; Jiao, L. Homo–heterogenous transformer learning framework for rs scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2223–2239. [Google Scholar] [CrossRef]
Lv, P.; Wu, W.; Zhong, Y.; Du, F.; Zhang, L. SCViT: A spatial-channel feature preserving vision transformer for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4409512. [Google Scholar] [CrossRef]
Wang, J.; Li, W.; Zhang, M.; Chanussot, J. Large kernel sparse convnet weighted by multi-frequency attention for remote sensing scene understanding. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5626112. [Google Scholar] [CrossRef]
Zhao, Y.; Liu, J.; Yang, J.; Wu, Z. EMSCNet: Efficient multisample contrastive network for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605814. [Google Scholar] [CrossRef]
Zheng, F.; Lin, S.; Zhou, W.; Huang, H. A lightweight dual-branch swin transformer for remote sensing scene classification. Remote Sens. 2023, 15, 2865. [Google Scholar] [CrossRef]
Dai, W.; Shi, F.; Wang, X.; Xu, H.; Yuan, L.; Wen, X. A multi-scale dense residual correlation network for remote sensing scene classification. Sci. Rep. 2024, 14, 22197. [Google Scholar] [CrossRef] [PubMed]
Zhao, Y.; Chen, Y.; Xiong, S.; Lu, X.; Zhu, X.X.; Mou, L. Co-enhanced global-part integration for remote-sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4702114. [Google Scholar] [CrossRef]
Wan, Q.; Xiao, Z.; Yu, Y.; Liu, Z.; Wang, K.; Li, D. A hyperparameter-free attention module based on feature map mathematical calculation for remote-sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5600318. [Google Scholar] [CrossRef]
Yue, H.; Qing, L.; Zhang, Z.; Wang, Z.; Guo, L.; Peng, Y. MSE-Net: A novel master–slave encoding network for remote sensing scene classification. Eng. Appl. Artif. Intell. 2024, 132, 107909. [Google Scholar] [CrossRef]
Chai, B.; Zhao, T.; Yang, R.; Zhang, N.; Tian, T.; Tian, J. Label semantic dynamic guidance network for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5610414. [Google Scholar] [CrossRef]
Ma, J.; Jiang, W.; Tang, X.; Zhang, X.; Liu, F.; Jiao, L. Multiscale sparse cross-attention network for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5605416. [Google Scholar] [CrossRef]
Duan, Y.; Song, C.; Zhang, Y.; Cheng, P.; Mei, S. STMSF: Swin transformer with multi-scale fusion for remote sensing scene classification. Remote Sens. 2025, 17, 668. [Google Scholar] [CrossRef]
Chen, J.; Yi, J.; Chen, A.; Jin, Z. EFCOMFF-Net: A multiscale feature fusion architecture with enhanced feature correlation for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604917. [Google Scholar] [CrossRef]
Shi, J.; Liu, W.; Shan, H.; Li, E.; Li, X.; Zhang, L. Remote sensing scene classification based on multibranch fusion attention network. IEEE Geosci. Remote Sens. Lett. 2023, 20, 3001505. [Google Scholar] [CrossRef]
Lu, X.; Yang, M.; Chen, Y.; Xiong, S.; Lu, X. Multi-branch fusion-based feature enhance for remote-sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4702017. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]

Figure 1. Architecture of the proposed SDLS framework. First, features are extracted from the original images using the self-distillation stream. Second, local images are generated via the LIGM. Third, features are extracted from the local images using the local stream. Finally, the representative logits from both streams are fused for final classification. FC represents the fully connected layer.

C_{1}, C_{2}, C_{3}

, and

C_{4}

indicate the number of channels in the CNN internal feature maps.

λ_{1}

and

λ_{2}

represent the trainable parameters used for generating local image. In this figure, numbers 1–3 denote the classification scores produced by Branch1_MobileNetV2, Branch2_MobileNetV2, and the CNN backbone, respectively. Number 4 represents the classification score from the feature fusion classifier, and number 5 corresponds to the classification score generated by Local_MobileNetV2.

Figure 1. Architecture of the proposed SDLS framework. First, features are extracted from the original images using the self-distillation stream. Second, local images are generated via the LIGM. Third, features are extracted from the local images using the local stream. Finally, the representative logits from both streams are fused for final classification. FC represents the fully connected layer.

C_{1}, C_{2}, C_{3}

, and

C_{4}

indicate the number of channels in the CNN internal feature maps.

λ_{1}

and

λ_{2}

represent the trainable parameters used for generating local image. In this figure, numbers 1–3 denote the classification scores produced by Branch1_MobileNetV2, Branch2_MobileNetV2, and the CNN backbone, respectively. Number 4 represents the classification score from the feature fusion classifier, and number 5 corresponds to the classification score generated by Local_MobileNetV2.

Figure 2. Structural details of the MGA module. Conv denotes the standard convolution operation. Dconv, Pconv, and Gconv refer to depthwise convolution, pointwise convolution, and group convolution, respectively. The symbols k, s, p, and g represent kernel size, stride, padding, and groups in the convolution.

h

and

w

represent the height and width of the feature map, respectively.

c

and

c^{'}

denote the number of channels.

Figure 2. Structural details of the MGA module. Conv denotes the standard convolution operation. Dconv, Pconv, and Gconv refer to depthwise convolution, pointwise convolution, and group convolution, respectively. The symbols k, s, p, and g represent kernel size, stride, padding, and groups in the convolution.

h

and

w

represent the height and width of the feature map, respectively.

c

and

c^{'}

denote the number of channels.

Figure 3. The confusion matrix on the RSSCN7 dataset with a training ratio of 50%.

Figure 4. The confusion matrix on the UCM dataset with a training ratio of 50%.

Figure 5. The confusion matrix on the AID dataset with a training ratio of 50%.

Figure 6. The confusion matrix on the NWPU dataset with a training ratio of 20%.

Figure 7. Attention feature maps visualisation results of the three networks equipped with SDLS on the UCM dataset with a training ratio of 80%.

Figure 8. Attention feature maps visualization results of the ShuffleNetV2-1.5 network equipped with SDLS on the UCM dataset with a training ratio of 80%.

Figure 9. Visualization results of the ResNet18 network equipped with SDLS on the AID dataset with a training ratio of 50%.

Figure 10. Variations of the two trainable parameters during training. (a) Variation of the parameter

λ_{1}

during training. (b) Variation of the parameter

λ_{2}

during training.

Figure 10. Variations of the two trainable parameters during training. (a) Variation of the parameter

λ_{1}

during training. (b) Variation of the parameter

λ_{2}

during training.

Figure 11. Visualization results of the ResNet18 network equipped with SDLS at different epochs on the AID dataset with a training ratio of 50%. “

S^{'} = S^{'}

” denotes the original weights. “

S^{'} = S^{' 4}

” represents the weights after applying the fourth power transformation.

Figure 11. Visualization results of the ResNet18 network equipped with SDLS at different epochs on the AID dataset with a training ratio of 50%. “

S^{'} = S^{'}

” denotes the original weights. “

S^{'} = S^{' 4}

” represents the weights after applying the fourth power transformation.

Figure 12. Examples of the top five misclassified categories identified from the confusion matrix.

Table 1. Experiment results of OA (%) on four datasets.

Backbone	Networks/Classifiers	AID		RSSCN7		UCM		NWPU
Backbone	Networks/Classifiers	20%	50%	20%	50%	50%	80%	10%	20%
	Baseline	93.71 ± 0.16	96.10 ± 0.16	91.75 ± 0.87	94.70 ± 0.55	97.75 ± 0.49	98.71 ± 0.12	89.75 ± 0.27	92.46 ± 0.13
	Branch1_MobileNetV2	94.33 ± 0.23	96.49 ± 0.26	93.66 ± 0.40	96.13 ± 0.41	98.46 ± 0.42	99.05 ± 0.48	90.28 ± 0.18	92.89 ± 0.07
	Branch2_MobileNetV2	94.40 ± 0.25	96.61 ± 0.14	93.88 ± 0.50	96.00 ± 0.40	98.44 ± 0.21	99.10 ± 0.28	90.28 ± 0.15	92.85 ± 0.20
ResNet18	ResNet18	94.46 ± 0.11	96.55 ± 0.14	93.16 ± 0.61	95.47 ± 0.43	98.36 ± 0.18	98.62 ± 0.51	90.50 ± 0.08	92.82 ± 0.15
	Feature fusion classifier	95.35 ± 0.24	97.22 ± 0.19	94.34 ± 0.40	96.37 ± 0.40	98.78 ± 0.28	99.15 ± 0.12	91.88 ± 0.05	94.05 ± 0.12
	Local_MobileNetV2	95.41 ± 0.18	97.21 ± 0.17	94.31 ± 0.44	96.46 ± 0.28	98.69 ± 0.38	99.05 ± 0.34	91.96 ± 0.12	94.07 ± 0.11
	SDLS	95.47 ± 0.22	97.24 ± 0.15	94.50 ± 0.48	96.42 ± 0.36	98.78 ± 0.28	99.15 ± 0.19	92.03 ± 0.05	94.11 ± 0.13
	Baseline	94.37 ± 0.29	96.56 ± 0.23	93.12 ± 0.61	95.69 ± 0.16	97.94 ± 0.17	98.72 ± 0.44	91.15 ± 0.18	93.64 ± 0.06
	Branch1_MobileNetV2	94.50 ± 0.35	96.67 ± 0.15	93.75 ± 0.56	95.97 ± 0.37	98.50 ± 0.22	99.05 ± 0.52	90.22 ± 0.20	93.01 ± 0.11
	Branch2_MobileNetV2	94.43 ± 0.24	96.48 ± 0.21	93.90 ± 0.47	95.90 ± 0.45	98.53 ± 0.21	99.29 ± 0.40	90.33 ± 0.29	93.05 ± 0.09
WRN50-2	WRN50-2	94.44 ± 0.25	96.62 ± 0.19	94.10 ± 0.32	95.97 ± 0.44	98.46 ± 0.23	99.05 ± 0.42	91.22 ± 0.11	93.44 ± 0.15
	Feature fusion classifier	95.59 ± 0.27	97.45 ± 0.18	94.78 ± 0.30	96.64 ± 0.24	98.91 ± 0.27	99.38 ± 0.32	92.52 ± 0.16	94.56 ± 0.12
	Local_MobileNetV2	95.66 ± 0.22	97.46 ± 0.10	94.62 ± 0.33	96.46 ± 0.30	98.91 ± 0.28	99.47 ± 0.35	92.53 ± 0.20	94.53 ± 0.18
	SDLS	95.64 ± 0.26	97.47 ± 0.14	94.82 ± 0.33	96.60 ± 0.22	98.90 ± 0.24	99.43 ± 0.35	92.57 ± 0.18	94.58 ± 0.16
	Baseline	94.80 ± 0.17	96.90 ± 0.26	92.59 ± 0.76	94.87 ± 0.29	98.44 ± 0.28	98.86 ± 0.41	91.52 ± 0.15	93.94 ± 0.06
	Branch1_MobileNetV2	94.41 ± 0.36	96.45 ± 0.10	94.00 ± 0.40	95.97 ± 0.15	98.57 ± 0.21	99.19 ± 0.44	90.17 ± 0.29	92.95 ± 0.15
	Branch2_MobileNetV2	94.55 ± 0.27	96.60 ± 0.12	93.86 ± 0.49	95.87 ± 0.33	98.55 ± 0.30	99.05 ± 0.40	90.18 ± 0.20	92.85 ± 0.09
Efficientnet_b3	Efficientnet_b3	95.20 ± 0.29	97.06 ± 0.11	93.35 ± 0.53	95.76 ± 0.29	98.54 ± 0.24	99.05 ± 0.37	92.13 ± 0.25	94.25 ± 0.08
	Feature fusion classifier	95.96 ± 0.27	97.62 ± 0.08	94.81 ± 0.45	96.59 ± 0.15	98.97 ± 0.23	99.38 ± 0.32	92.85 ± 0.24	94.94 ± 0.05
	Local_MobileNetV2	95.75 ± 0.33	97.54 ± 0.19	94.68 ± 0.42	96.45 ± 0.20	98.84 ± 0.22	99.38 ± 0.28	92.60 ± 0.27	94.83 ± 0.08
	SDLS	95.92 ± 0.29	97.59 ± 0.14	94.84 ± 0.48	96.56 ± 0.14	98.95 ± 0.22	99.43 ± 0.32	92.82 ± 0.24	94.91 ± 0.06
	Baseline	93.44 ± 0.11	95.90 ± 0.16	92.66 ± 0.38	94.92 ± 0.50	97.67 ± 0.26	98.38 ± 0.23	89.13 ± 0.16	92.11 ± 0.12
	Branch1_MobileNetV2	94.43 ± 0.32	96.47 ± 0.16	94.04 ± 0.37	96.09 ± 0.41	98.57 ± 0.34	98.95 ± 0.24	90.14 ± 0.28	92.74 ± 0.03
	Branch2_MobileNetV2	94.45 ± 0.24	96.54 ± 0.17	93.91 ± 0.29	96.03 ± 0.35	98.69 ± 0.22	99.15 ± 0.12	90.17 ± 0.19	92.86 ± 0.13
ShuffleNetV2-1.5	ShuffleNetV2-1.5	93.62 ± 0.22	95.90 ± 0.23	92.53 ± 0.85	95.23 ± 0.46	97.64 ± 0.34	98.76 ± 0.35	89.21 ± 0.14	92.03 ± 0.09
	Feature fusion classifier	95.33 ± 0.21	97.11 ± 0.14	94.57 ± 0.34	96.47 ± 0.38	98.88 ± 0.28	99.29 ± 0.15	91.59 ± 0.17	94.03 ± 0.04
	Local_MobileNetV2	95.24 ± 0.24	97.12 ± 0.17	94.35 ± 0.40	96.43 ± 0.20	98.74 ± 0.37	99.38 ± 0.28	91.67 ± 0.24	94.07 ± 0.04
	SDLS	95.35 ± 0.22	97.16 ± 0.11	94.50 ± 0.36	96.41 ± 0.33	98.88 ± 0.26	99.38 ± 0.19	91.75 ± 0.18	94.13 ± 0.05

Baseline refers to the OA obtained when training the network independently without SDLS. The OA values lower than the baseline are highlighted in blue, and the highest OA values are highlighted in bold. 10%, 20%, 50%, and 80% represent the training set split ratios in the datasets.

Table 2. Experiment results of OA (%) on two datasets.

Networks/Classifiers	AID		NWPU
Networks/Classifiers	20%	50%	10%	20%
Baseline (300 × 300)	95.68 ± 0.18	97.38 ± 0.15	92.45 ± 0.13	94.57 ± 0.09
Branch1_MobileNetV2	94.93 ± 0.21	97.06 ± 0.08	90.66 ± 0.20	93.36 ± 0.14
Branch2_MobileNetV2	94.98 ± 0.25	97.08 ± 0.25	90.62 ± 0.22	93.27 ± 0.13
EfficientNet-b3	96.07 ± 0.20	97.66 ± 0.12	92.73 ± 0.14	94.78 ± 0.14
Feature fusion classifier	96.55 ± 0.13	97.91 ± 0.08	93.29 ± 0.14	95.33 ± 0.08
Local_EfficientNet-b3	96.57 ± 0.12	97.99 ± 0.07	93.50 ± 0.10	95.44 ± 0.08
SDLS	96.63 ± 0.12	98.00 ± 0.09	93.59 ± 0.10	95.46 ± 0.10
Baseline (330 × 330)	96.07 ± 0.13	97.67 ± 0.13	92.70 ± 0.13	94.77 ± 0.08
Branch1_MobileNetV2	95.37 ± 0.21	97.26 ± 0.19	91.01 ± 0.17	93.39 ± 0.08
Branch2_MobileNetV2	95.33 ± 0.23	97.19 ± 0.13	91.05 ± 0.18	93.40 ± 0.10
EfficientNet-b3	96.33 ± 0.14	97.78 ± 0.11	93.08 ± 0.11	94.94 ± 0.18
Feature fusion classifier	96.79 ± 0.18	98.05 ± 0.09	93.64 ± 0.13	95.42 ± 0.07
Local_EfficientNet-b3	96.82 ± 0.15	98.09 ± 0.07	93.87 ± 0.13	95.46 ± 0.10
SDLS	96.93 ± 0.15	98.11 ± 0.06	93.89 ± 0.08	95.52 ± 0.10

Baseline (n × n) refers to the OA obtained when training the network independently without SDLS, with the input image size set to n × n. The OA values lower than the baseline are highlighted in blue, and the highest OA values are highlighted in bold. 10%, 20%, and 50% represent the training set split ratios in the datasets.

Table 3. Comparison with other methods.

Method (Year)	AID		NWPU
Method (Year)	20%	50%	10%	20%
! Attention GANs (2020) [13]	78.95 ± 0.23	84.52 ± 0.18	72.21 ± 0.21	77.99 ± 0.19
† MIDC-Net (2020) [16]	88.51 ± 0.41	92.95 ± 0.17	86.12 ± 0.29	87.99 ± 0.18
† TFADNN (2020) [17]	93.21 ± 0.32	95.64 ± 0.16	87.78 ± 0.11	90.86 ± 0.24
† MSA-Network (2021) [18]	93.53 ± 0.21	96.01 ± 0.43	90.38 ± 0.17	93.52 ± 0.21
† LSE-Net (2021) [53]	94.41 ± 0.16	96.36 ± 0.19	92.23 ± 0.14	93.34 ± 0.15
† SKAL_ResNet18 (2022) [23]	94.38 ± 0.10	96.76 ± 0.20	90.04 ± 0.15	92.79 ± 0.11
† LML_ResNet50 (2022) [24]	96.21 ± 0.13	97.86 ± 0.09	92.67 ± 0.15	94.73 ± 0.11
‡ ET-GSNet (2022) [40]	95.58 ± 0.18	96.88 ± 0.19	92.72 ± 0.28	94.50 ± 0.18
‡ EMTCAL (2022) [54]	94.69 ± 0.14	96.41 ± 0.23	91.63 ± 0.19	93.65 ± 0.12
‡ HHTL (2022) [55]	95.62 ± 0.13	96.88 ± 0.21	92.07 ± 0.44	94.21 ± 0.09
‡ SCViT (2022) [56]	95.56 ± 0.17	96.98 ± 0.16	92.72 ± 0.04	94.66 ± 0.10
† LSCNet (2023) [57]	95.38 ± 0.15	97.14 ± 0.14	92.80 ± 0.14	94.54 ± 0.19
‡ EMSCNet_ViT-B (2023) [58]	96.02 ± 0.18	97.35 ± 0.17	93.58 ± 0.22	95.37 ± 0.07
‡ LDBST (2023) [59]	95.10 ± 0.09	96.84 ± 0.20	93.86 ± 0.18	94.36 ± 0.12
† MDRCN (2024) [60]	93.64 ± 0.19	95.66 ± 0.18	91.59 ± 0.29	93.82 ± 0.17
† CGINet (2024) [61]	95.35 ± 0.14	97.10 ± 0.24	92.28 ± 0.17	94.38 ± 0.13
† HFAM (2024) [62]	94.71 ± 0.06	96.73 ± 0.04	90.81 ± 0.11	–
‡ MSE-Net (2024) [63]	96.30 ± 0.10	97.44 ± 0.05	92.80 ± 0.17	94.70 ± 0.16
† LSDGNet_ENet-b3 (2025) [64]	95.81 ± 0.14	97.49 ± 0.19	93.60 ± 0.17	95.05 ± 0.11
† MSCN (2025) [65]	95.86 ± 0.16	97.46 ± 0.12	92.64 ± 0.09	94.59 ± 0.11
‡ STMSF (2025) [66]	96.15 ± 0.16	97.51 ± 0.37	92.88 ± 0.16	94.95 ± 0.11
† ENet-b3_MV2_224 (SDLS)	95.92 ± 0.29	97.59 ± 0.14	92.82 ± 0.24	94.91 ± 0.06
† ENet-b3_ENet-b3_300 (SDLS)	96.63 ± 0.12	98.00 ± 0.09	93.59 ± 0.10	95.46 ± 0.10
† ENet-b3_ENet-b3_330 (SDLS)	96.93 ± 0.15	98.11 ± 0.06	93.89 ± 0.08	95.52 ± 0.10

model1_model2_n indicates that the CNN in the self-distillation stream is model1, while the network in the local stream is model2, with the input image size of n × n. MV2 and ENet-b3 refer to MobileNetV2 and EfficientNet-b3, respectively. The second-highest OA values are highlighted in green, and the highest OA values are highlighted in bold. 10%, 20%, and 50% represent the training set split ratios in the datasets. †, !, and ‡ denote methods based on CNN, GAN, and ViT, respectively.

Table 4. Comparison of model complexity and OA.

Method (Year)	Params	FLOPs	AID_20%	NWPU_10%
EFCOMFFNetv1-DenseNet (2023) [67]	19.94 M	6.02 G	95.86 ± 0.13	92.40 ± 0.15
EFCOMFFNetv2-DenseNet (2023) [67]	27.75 M	5.93 G	95.69 ± 0.15	92.36 ± 0.12
MBFANet (2023) [68]	24.48 M	4.51 G	93.98 ± 0.15	91.61 ± 0.14
CGINet (2024) [61]	26.10 M	4.14 G	95.35 ± 0.14	92.28 ± 0.17
MBFNet (2025) [69]	22.09 M	4.25 G	95.81 ± 0.13	–
ResNet18_MV2_224 (SDLS)	18.65 M	2.85 G	95.47 ± 0.22	92.03 ± 0.05
ENet-b3_MV2_224 (SDLS)	19.55 M	2.08 G	95.92 ± 0.29	92.82 ± 0.24

model1_model2_n indicates that the CNN in the self-distillation stream is model1, while the network in the local stream is model2, with the input image size of n × n. MV2 and ENet-b3 refer to MobileNetV2 and EfficientNet-b3, respectively. The highest OA values are highlighted in bold.

Table 5. Model evaluation results on the AID dataset with a 20% training ratio.

Networks/Classifiers	OA (%)	Params (M)	FLOPs (G)	Inference Time (ms/Image)
Networks/Classifiers	OA (%)	Params (M)	FLOPs (G)	RTX 3090	Tesla P100	Tesla T4
Baseline (ResNet18)	93.71 ± 0.16	11.20 (1.00×)	1.83 (1.00×)	3.12	2.37	2.31
Branch1_MobileNetV2	94.33 ± 0.23	2.43 (0.22×)	0.93 (0.51×)	7.84	5.47	5.68
Branch2_MobileNetV2	94.40 ± 0.25	2.96 (0.26×)	1.33 (0.73×)	8.68	5.97	6.25
ResNet18	94.46 ± 0.11	11.20 (1.00×)	1.83 (1.00×)	3.12	2.37	2.31
Feature fusion classifier	95.35 ± 0.24	16.35 (1.46×)	2.53 (1.38×)	17.69	12.59	12.98
Local_MobileNetV2	95.41 ± 0.18	18.61 (1.66×)	2.85 (1.56×)	24.37	17.35	17.88
SDLS	95.47 ± 0.22	18.65 (1.67×)	2.85 (1.56×)	24.39	17.89	17.84
Baseline (WRN50-2)	94.37 ± 0.29	66.89 (1.00×)	11.44 (1.00×)	9.01	8.07	9.74
Branch1_MobileNetV2	94.50 ± 0.35	2.93 (0.04×)	2.50 (0.22×)	8.95	5.95	6.18
Branch2_MobileNetV2	94.43 ± 0.24	6.45 (0.10×)	5.37 (0.47×)	11.12	7.24	7.59
WRN50-2	94.44 ± 0.25	66.89 (1.00×)	11.44 (1.00×)	9.01	8.07	9.74
Feature fusion classifier	95.59 ± 0.27	74.02 (1.11×)	12.33 (1.08×)	23.54	15.44	15.89
Local_MobileNetV2	95.66 ± 0.22	76.28 (1.14×)	12.65 (1.11×)	30.40	20.16	21.22
SDLS	95.64 ± 0.26	76.32 (1.14×)	12.65 (1.11×)	31.00	20.43	20.88

Baseline (network) refers to the OA, Params, and FLOPs obtained when training the network independently without SDLS. Inference time is measured as the average per-image latency.

Table 6. Experiment results of OA (%) on the UCM dataset with a 80% training ratio.

Backbone	Networks/Classifiers	MGA Module
Backbone	Networks/Classifiers	×	✓
	Branch1_MobileNetV2	99.05 ± 0.26	99.05 ± 0.48
	Branch2_MobileNetV2	99.05 ± 0.26	99.10 ± 0.28
	ResNet18	98.67 ± 0.36	98.62 ± 0.51
ResNet18	Feature fusion classifier	99.05 ± 0.21	99.15 ± 0.12
	Local_MobileNetV2	99.10 ± 0.18	99.05 ± 0.34
	SDLS	99.15 ± 0.12	99.15 ± 0.19
	Branch1_MobileNetV2	99.05 ± 0.21	99.05 ± 0.52
	Branch2_MobileNetV2	99.00 ± 0.32	99.29 ± 0.40
	WRN50-2	98.71 ± 0.49	99.05 ± 0.42
WRN50-2	Feature fusion classifier	99.29 ± 0.26	99.38 ± 0.32
	Local_MobileNetV2	99.24 ± 0.18	99.47 ± 0.35
	SDLS	99.29 ± 0.21	99.43 ± 0.35
	Branch1_MobileNetV2	99.05 ± 0.26	98.95 ± 0.24
	Branch2_MobileNetV2	99.10 ± 0.53	99.15 ± 0.12
	ShuffleNetV2-1.5	98.76 ± 0.46	98.76 ± 0.35
ShuffleNetV2-1.5	Feature fusion classifier	99.29 ± 0.40	99.29 ± 0.15
	Local_MobileNetV2	99.24 ± 0.46	99.38 ± 0.28
	SDLS	99.28 ± 0.50	99.38 ± 0.19

Symbol “✓” denotes that the MGA module is introduced into SDLS, whereas “×” denotes its absence. The increased OA values after introducing the MGA module are highlighted in bold.

Table 7. Branch ablation and attention module comparison of MGA.

Networks/Classifiers	Branch Ablation			Attention Modules
Networks/Classifiers	w/o Conv	w/o Depthwise	w/o Group	MGA	SE	CBAM
Branch1_MobileNetV2	90.22 ± 0.33	90.20 ± 0.23	90.34 ± 0.29	90.28 ± 0.18	90.23 ± 0.16	89.98 ± 0.20
Branch2_MobileNetV2	90.30 ± 0.15	90.16 ± 0.23	90.16 ± 0.17	90.28 ± 0.15	90.26 ± 0.25	89.91 ± 0.15
ResNet18	90.34 ± 0.18	90.47 ± 0.18	90.36 ± 0.20	90.50 ± 0.08	90.33 ± 0.19	90.50 ± 0.18
Feature fusion classifier	91.79 ± 0.20	91.86 ± 0.19	91.74 ± 0.22	91.88 ± 0.05	91.73 ± 0.20	91.76 ± 0.12
Local_MobileNetV2	91.76 ± 0.23	91.86 ± 0.25	91.73 ± 0.26	91.96 ± 0.12	91.78 ± 0.18	91.79 ± 0.16
SDLS	91.86 ± 0.22	91.95 ± 0.23	91.84 ± 0.25	92.03 ± 0.05	91.82 ± 0.19	91.89 ± 0.15

The highest OA for each network or classifier is highlighted in bold.

Table 8. Results of distillation on the AID dataset with a 50% training ratio.

Networks/Classifiers	Distillation
Networks/Classifiers	×	✓
Branch1_MobileNetV2	96.24 ± 0.26	96.49 ± 0.26
Branch2_MobileNetV2	96.31 ± 0.31	96.61 ± 0.14
ResNet18	95.94 ± 0.18	96.55 ± 0.14
Feature fusion classifier	96.86 ± 0.25	97.22 ± 0.19
Local_MobileNetV2	96.89 ± 0.24	97.21 ± 0.17
SDLS	96.95 ± 0.21	97.24 ± 0.15
Branch1_MobileNetV2	96.37 ± 0.06	96.67 ± 0.15
Branch2_MobileNetV2	96.39 ± 0.12	96.48 ± 0.21
WRN50-2	96.44 ± 0.13	96.62 ± 0.19
Feature fusion classifier	97.21 ± 0.13	97.45 ± 0.18
Local_MobileNetV2	97.17 ± 0.11	97.46 ± 0.10
SDLS	97.22 ± 0.12	97.47 ± 0.14

Symbol “✓” denotes that the distillation is introduced into SDLS, whereas “×” denotes its absence. The increased OA values after introducing the distillation are highlighted in bold.

Table 9. Effect of distillation temperature T.

Networks/Classifiers	Distillation Temperature T
Networks/Classifiers	$T = 1$	$T = 3$	$T = 5$	$T = 7$	$T = 9$
Branch1_MobileNetV2	94.26 ± 0.07	94.33 ± 0.23	94.22 ± 0.49	82.72 ± 3.04	14.45 ± 2.52
Branch2_MobileNetV2	94.32 ± 0.14	94.40 ± 0.25	94.28 ± 0.42	82.98 ± 3.36	14.62 ± 2.34
ResNet18	93.84 ± 0.21	94.46 ± 0.11	94.42 ± 0.25	83.46 ± 3.67	18.62 ± 2.54
Feature fusion classifier	95.29 ± 0.11	95.35 ± 0.24	95.16 ± 0.41	85.29 ± 3.19	15.14 ± 2.08
Local_MobileNetV2	95.27 ± 0.18	95.41 ± 0.18	95.28 ± 0.42	85.37 ± 3.13	15.39 ± 1.88
SDLS	95.39 ± 0.16	95.47 ± 0.22	95.30 ± 0.43	85.35 ± 3.18	15.18 ± 2.05

The highest OA for each network or classifier is highlighted in bold.

Table 10. Results of two-stream logits fusion.

Networks/Classifiers	Fusion Techniques
Networks/Classifiers	Sum	Mean	Max
Branch1_MobileNetV2	96.49 ± 0.26	96.88 ± 0.13	96.70 ± 0.18
Branch2_MobileNetV2	96.61 ± 0.14	96.90 ± 0.11	96.66 ± 0.13
ResNet18	96.55 ± 0.14	96.80 ± 0.10	96.53 ± 0.19
Feature fusion classifier	97.22 ± 0.19	97.18 ± 0.12	96.97 ± 0.22
Local_MobileNetV2	97.21 ± 0.17	97.22 ± 0.10	97.04 ± 0.19
SDLS	97.24 ± 0.15	97.20 ± 0.12	96.98 ± 0.22
Branch1_MobileNetV2	96.67 ± 0.15	96.98 ± 0.06	96.86 ± 0.20
Branch2_MobileNetV2	96.48 ± 0.21	96.93 ± 0.05	96.90 ± 0.10
WRN50-2	96.62 ± 0.19	97.12 ± 0.23	97.01 ± 0.16
Feature fusion classifier	97.45 ± 0.18	97.40 ± 0.15	97.25 ± 0.12
Local_MobileNetV2	97.46 ± 0.10	97.42 ± 0.16	97.27 ± 0.11
SDLS	97.47 ± 0.14	97.42 ± 0.15	97.25 ± 0.12

The highest OA for each network or classifier is highlighted in bold.

Table 11. Results of different feature fusion strategies.

Networks/Classifiers	Fusion Techniques
Networks/Classifiers	Sum	Cat	Weighted Average
Branch1_MobileNetV2	96.49 ± 0.26	96.64 ± 0.09	96.68 ± 0.20
Branch2_MobileNetV2	96.61 ± 0.14	96.56 ± 0.11	96.50 ± 0.12
ResNet18	96.55 ± 0.14	96.39 ± 0.17	96.50 ± 0.23
Feature fusion classifier	97.22 ± 0.19	97.20 ± 0.14	97.25 ± 0.14
Local_MobileNetV2	97.21 ± 0.17	97.20 ± 0.17	97.13 ± 0.08
SDLS	97.24 ± 0.15	97.20 ± 0.11	97.28 ± 0.12
Branch1_MobileNetV2	96.67 ± 0.15	96.52 ± 0.25	96.60 ± 0.13
Branch2_MobileNetV2	96.48 ± 0.21	96.50 ± 0.13	96.56 ± 0.14
WRN50-2	96.62 ± 0.19	96.40 ± 0.23	96.54 ± 0.27
Feature fusion classifier	97.45 ± 0.18	97.36 ± 0.27	97.38 ± 0.14
Local_MobileNetV2	97.46 ± 0.10	97.36 ± 0.23	97.25 ± 0.09
SDLS	97.47 ± 0.14	97.36 ± 0.23	97.36 ± 0.11

The highest OA for each network or classifier is highlighted in bold.

Table 12. Ablation results of varying the initial values of

λ_{1}

and

λ_{2}

for local image generation on the AID dataset with a 50% training ratio.

Table 12. Ablation results of varying the initial values of

λ_{1}

and

λ_{2}

for local image generation on the AID dataset with a 50% training ratio.

Networks/Classifiers	Initial Values of $(λ_{1}, λ_{2})$
Networks/Classifiers	(0.2, 0.8)	(0.4, 0.6)	(0.6, 0.4)	(0.8, 0.2)
Branch1_MobileNetV2	96.49 ± 0.26	96.62 ± 0.16	96.54 ± 0.15	96.46 ± 0.13
Branch2_MobileNetV2	96.61 ± 0.14	96.42 ± 0.11	96.56 ± 0.21	96.54 ± 0.10
ResNet18	96.55 ± 0.14	96.39 ± 0.18	96.33 ± 0.10	96.38 ± 0.17
Feature fusion classifier	97.22 ± 0.19	97.12 ± 0.25	97.23 ± 0.16	97.15 ± 0.12
Local_MobileNetV2	97.21 ± 0.17	97.12 ± 0.19	97.22 ± 0.11	97.10 ± 0.11
SDLS	97.24 ± 0.15	97.15 ± 0.24	97.28 ± 0.10	97.17 ± 0.11

The highest OA for each network or classifier is highlighted in bold.

Table 13. Analysis of the exponent x in the power-based transformation on the AID dataset with a 20% training ratio.

Networks/Classifiers	Exponent x in $S^{' x}$
Networks/Classifiers	$x = 1$	$x = 2$	$x = 4$	$x = 6$
Branch1_MobileNetV2	94.47 ± 0.30	94.41 ± 0.18	94.33 ± 0.23	94.48 ± 0.13
Branch2_MobileNetV2	94.41 ± 0.24	94.42 ± 0.13	94.40 ± 0.25	94.31 ± 0.20
ResNet18	94.36 ± 0.15	94.32 ± 0.21	94.46 ± 0.11	94.47 ± 0.19
Feature fusion classifier	95.35 ± 0.19	95.30 ± 0.13	95.35 ± 0.24	95.35 ± 0.17
Local_MobileNetV2	95.35 ± 0.22	95.39 ± 0.16	95.41 ± 0.18	95.35 ± 0.20
SDLS	95.40 ± 0.23	95.40 ± 0.12	95.47 ± 0.22	95.40 ± 0.17

x = 1

denotes the case without power-based enhancement. The highest OA for each network or classifier is highlighted in bold.

Table 14. Statistical significance analysis on the AID and NWPU datasets.

Dataset	Metric	Branch1_MobileNetV2	Branch2_MobileNetV2	EfficientNet_b3	Feature Fusion Classifier	Local_MobileNetV2	SDLS
AID 20%	Mean ± Std	95.35 ± 0.26	95.34 ± 0.24	96.36 ± 0.18	96.79 ± 0.20	96.84 ± 0.19	96.93 ± 0.16
	95% CI	[−0.921, −0.527]	[−0.925, −0.549]	[+0.162, +0.416]	[+0.576, +0.858]	[+0.606, +0.938]	[+0.708, +1.004]
	p-value	1.65 $\times 10^{- 5}$	9.79 $\times 10^{- 6}$	6.17 $\times 10^{- 4}$	1.08 $\times 10^{- 6}$	2.31 $\times 10^{- 6}$	3.61 $\times 10^{- 7}$
AID 50%	Mean ± Std	97.23 ± 0.18	97.22 ± 0.12	97.82 ± 0.12	98.09 ± 0.10	98.11 ± 0.07	98.15 ± 0.08
	95% CI	[−0.604, −0.316]	[−0.629, −0.319]	[−0.026, +0.274]	[+0.250, +0.554]	[+0.261, +0.567]	[+0.335, +0.589]
	p-value	4.82 $\times 10^{- 5}$	6.87 $\times 10^{- 5}$	9.49 $\times 10^{- 2}$	2.09 $\times 10^{- 4}$	1.74 $\times 10^{- 4}$	1.76 $\times 10^{- 5}$
NWPU 10%	Mean ± Std	91.04 ± 0.20	91.02 ± 0.21	93.03 ± 0.12	93.61 ± 0.15	93.81 ± 0.15	93.86 ± 0.13
	95% CI	[−1.783, −1.461]	[−1.746, −1.536]	[+0.261, +0.469]	[+0.825, +1.067]	[+1.026, +1.264]	[+1.095, +1.303]
	p-value	2.89 $\times 10^{- 9}$	5.73 $\times 10^{- 11}$	2.36 $\times 10^{- 5}$	2.59 $\times 10^{- 8}$	4.30 $\times 10^{- 9}$	8.38 $\times 10^{- 10}$
NWPU 20%	Mean ± Std	93.45 ± 0.10	93.44 ± 0.12	94.95 ± 0.17	95.43 ± 0.08	95.49 ± 0.10	95.53 ± 0.10
	95% CI	[−1.420, −1.180]	[−1.416, −1.198]	[+0.050, +0.356]	[+0.586, +0.768]	[+0.634, +0.854]	[+0.680, +0.880]
	p-value	1.45 $\times 10^{- 9}$	6.18 $\times 10^{- 10}$	1.48 $\times 10^{- 2}$	4.21 $\times 10^{- 8}$	9.16 $\times 10^{- 8}$	2.81 $\times 10^{- 8}$

The highest OA for each network or classifier is highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, X.; Luo, J.; Ni, S.; Zhang, X.; Ding, R. SDLS: A Two-Stream Architecture with Self-Distillation and Local Streams for Remote Sensing Image Scene Classification. Remote Sens. 2026, 18, 498. https://doi.org/10.3390/rs18030498

AMA Style

Ma X, Luo J, Ni S, Zhang X, Ding R. SDLS: A Two-Stream Architecture with Self-Distillation and Local Streams for Remote Sensing Image Scene Classification. Remote Sensing. 2026; 18(3):498. https://doi.org/10.3390/rs18030498

Chicago/Turabian Style

Ma, Xinliang, Junwei Luo, Shuiping Ni, Xiaohong Zhang, and Runze Ding. 2026. "SDLS: A Two-Stream Architecture with Self-Distillation and Local Streams for Remote Sensing Image Scene Classification" Remote Sensing 18, no. 3: 498. https://doi.org/10.3390/rs18030498

APA Style

Ma, X., Luo, J., Ni, S., Zhang, X., & Ding, R. (2026). SDLS: A Two-Stream Architecture with Self-Distillation and Local Streams for Remote Sensing Image Scene Classification. Remote Sensing, 18(3), 498. https://doi.org/10.3390/rs18030498

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SDLS: A Two-Stream Architecture with Self-Distillation and Local Streams for Remote Sensing Image Scene Classification

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Remote Sensing Image Scene Classification

2.1.1. Model Optimization

2.1.2. Image Transformation

2.2. Knowledge Distillation

3. Methodology

3.1. Self-Distillation Stream

3.1.1. CNN

3.1.2. Multiplex-Guided Attention Module

3.1.3. Branch Networks

3.1.4. Feature Fusion Classifier

3.2. Local Image Generation Module

3.3. Local Stream

3.4. Training Loss

3.4.1. Self-Distillation Stream Loss

3.4.2. Local Stream Loss

4. Experiments and Results

4.1. Experimental Datasets

4.2. Evaluation Metrics

4.3. CNN Ablation Experiments

4.4. Comparison with Other Methods

4.5. Model Complexity and Inference Time Analysis

4.6. Analysis of the MGA Module

4.7. Impact of Distillation and Temperature

4.8. Effect of Two-Stream Fusion Strategies

4.9. Local Image Analysis

5. Discussion

5.1. Statistical Significance Analysis

5.2. Analysis of Misclassified Categories

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI