Next Article in Journal
Impact of GOES Atmospheric Motion Vector Data Assimilation on Forecasts over South America: Akará Cyclone Case Study
Previous Article in Journal
Geolocation-Corrected UAV–GEDI Bridging Samples and Stacking Ensemble Models for Regional AGB Mapping in Subtropical Mountainous Forests of Simao District, Yunnan
Previous Article in Special Issue
Daily-Scale Meteorological Normalization of Surface Solar Radiation in Varying Pollution Levels: A Statistical Case Study in Beijing (2015–2019)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MSDR-Net: Multiscale Dynamic Reasoning for Multi-Label Remote Sensing Image Classification

Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(11), 1798; https://doi.org/10.3390/rs18111798
Submission received: 23 April 2026 / Revised: 20 May 2026 / Accepted: 26 May 2026 / Published: 1 June 2026
(This article belongs to the Special Issue Advanced AI Technology for Remote Sensing Analysis (Second Edition))

Highlights

What are the main findings?
  • The proposed unified framework integrating multiscale feature representation and semantic reasoning achieves 95.88% mAP on the DIOR dataset.
  • The difficulty-weighted loss improves recognition performance for long-tail and small-scale categories by incorporating category frequency and sample difficulty.
What are the implications of the main findings?
  • Joint modeling of multiscale features and semantic dependencies is essential for multi-label classification in complex remote sensing scenarios.
  • Task-specific loss reweighting effectively mitigates class imbalance and improves performance on underrepresented categories.

Abstract

With the rapid advancement of Earth observation technologies and the growing demand for intelligent remote sensing applications, high-resolution remote sensing imagery provides critical data support for a range of downstream applications, including land monitoring and disaster assessment. In this context, multi-label remote sensing image classification has become an important research task, because a single image may contain multiple ground-object categories with complex spatial distributions and semantic co-occurrence relationships. However, challenges such as the coexistence of multiscale objects, complex semantic dependencies, and long-tail category distributions impose significant limitations on existing methods in terms of feature representation capacity and class-balanced modeling. To address these challenges, a Multiscale Dynamic Reasoning Network (MSDR-Net) is proposed. Different from methods that focus on localized optimization for a single challenge, MSDR-Net establishes a task-driven modeling framework that jointly integrates multiscale feature extraction, label-aware semantic reasoning, and long-tail category optimization within an end-to-end architecture. The proposed network consists of three core modules. The Multiscale Feature Enhancement (MSFE) module incorporates a Feature Pyramid Network-based fusion mechanism, integrating deep semantic information with shallow, detailed features to effectively enhance the representation of multiscale objects. The Dynamic Semantic Reasoning (DSR) module introduces a Transformer-based global attention mechanism that models long-range dependencies among image features, enabling the capture of complex global semantic relationships. In the loss optimization stage, a Difficulty-Weighted Loss (DW-Loss) is introduced, which jointly incorporates category frequency weights and prior difficulty coefficients to dynamically regulate the contributions of rare classes and hard samples during training, thereby mitigating bias induced by class imbalance. Experiments conducted on the large-scale Detection in Optical Remote Sensing Images dataset demonstrate that the proposed method achieves superior performance. Ablation studies validate the effectiveness of each component, while comparative experiments indicate that MSDR-Net achieves a mean Average Precision of 95.88%, outperforming existing state-of-the-art methods. An improvement of approximately 1.74% is observed over the strongest baseline, MSCA, with consistent advantages demonstrated across Overall F1 and Class-wise F1 metrics. By unifying multiscale feature extraction, global semantic reasoning, and balanced loss optimization within a single framework, MSDR-Net provides a robust and efficient solution for multi-label classification in complex remote sensing scenarios.

1. Introduction

With the rapid advancement of remote sensing technologies and satellite sensors, high-resolution remote sensing imagery has become a fundamental data source for global land cover monitoring, urban management, disaster assessment, and ecological environment analysis [1,2]. The authoritative international report Earth Intelligence for All highlights that future intelligent analysis of remote sensing imagery requires substantial improvements in understanding complex semantic scenes and in information-extraction capabilities to support sustainable development and refined governance. High-resolution remote sensing imagery often captures complex scenes featuring coexisting multiscale and multi-category objects, rendering traditional single-label classification methods inadequate for practical applications. Consequently, multi-label remote sensing image classification has emerged as an important research direction in the intelligent interpretation of remote sensing imagery [3]. By enabling the identification of multiple semantic categories within a single image, the accuracy of scene understanding can be improved, while providing critical support for downstream tasks such as change detection, object recognition, and land cover analysis.
However, multi-label remote sensing image classification still faces two fundamental challenges: (i) the representation of multiscale objects struggles to capture local details and global semantics simultaneously [4]; and (ii) long-tail category distributions hinder the effective learning of low-frequency classes [5,6]. Although existing studies have attempted to alleviate these issues through deep networks, attention mechanisms, or data augmentation, most methods remain focused on addressing challenges along a single dimension. As a result, a unified solution capable of jointly optimizing multiscale feature extraction, long-tail sample balancing, and complex semantic dependency modeling within an end-to-end framework is still lacking [4,6]. Although notable progress has been achieved in feature fusion, semantic modeling, and long-tail optimization, these approaches are still predominantly designed to tackle isolated challenges, and an integrated framework that simultaneously addresses multiscale feature extraction, long-tail sample balancing, and complex semantic dependency modeling remains absent [5,7]. This limitation constitutes the primary bottleneck that the present study aims to address.
The Multiscale Dynamic Reasoning Network (MSDR-Net), an end-to-end multi-label classification network, is proposed to address the challenges posed by multiscale objects, complex semantic dependencies, and long-tail category distributions in remote sensing images. A task-driven unified modeling framework is established for multi-label remote sensing image classification, integrating multiscale feature enhancement, label-aware dynamic semantic reasoning, and difficulty-weighted loss optimization. Consequently, scale variations, semantic dependencies, and long-tail category distributions can be jointly addressed within a unified end-to-end architecture. The proposed network consists of three core modules. First, the Multiscale Feature Enhancement (MSFE) module is constructed using ResNet-34 and a Feature Pyramid Network (FPN) fusion. Deep features are stabilized via residual representation learning, while multi-branch convolution captures both local and global semantic information across multiple scales.
Furthermore, a top-down cross-layer fusion mechanism is adopted to jointly model high-level semantics and low-level details, thereby significantly enhancing the representation of small-scale, fine-grained objects. Second, the Dynamic Semantic Reasoning (DSR) module is built on a Transformer encoder, which incorporates two-dimensional positional encoding and label embeddings. Feature dependencies within the image are adaptively modeled via a multi-head attention mechanism, while cross-layer multilayer perceptrons (MLPs) and Dropout are employed to improve multiscale feature representation and training stability. As a result, the modeling of multi-label semantic dependencies in complex scenarios is effectively enhanced. Finally, to address the long-tail distribution and hard-sample challenges commonly observed in remote sensing data, a Difficulty-Weighted Loss (DW-Loss) is proposed. Category frequency weights and prior difficulty coefficients are jointly incorporated to dynamically regulate the loss contributions of rare classes and hard samples during training, thereby enhancing the model’s focus on underrepresented and challenging categories. Based on the aforementioned modules, MSDR-Net achieves synergistic modeling of multiscale representation learning, label-aware semantic reasoning, and long-tail category optimization. Experimental results demonstrate that, while maintaining high overall classification accuracy, MSDR-Net significantly improves recognition robustness for complex scenes, small-scale targets, and rare categories, thereby validating the effectiveness of the proposed task-driven unified modeling framework.
The main contributions of this study are summarized as follows:
  • The MSFE module is proposed, which jointly models deep semantic information and shallow, detailed features to effectively capture multiscale object characteristics in remote sensing imagery, thereby improving the representation of small-scale and complex targets.
  • The DSR module is designed to adaptively model multi-label semantic dependencies based on a Transformer encoder and a multi-head attention mechanism, enabling efficient integration of global semantic information and local details in complex scenarios; and
  • DW-Loss is introduced within an end-to-end training framework, termed MSDR-Net, in which the loss contributions of long-tail categories and hard samples are dynamically regulated to enable collaborative optimization across multiple modules, thereby significantly enhancing the classification robustness of rare categories and small-scale objects.

2. Related Work

Multi-label remote sensing image classification is primarily challenged by the diversity of land-cover categories, significant scale variations, long-tail distributions of categories, and latent semantic dependencies among labels. In complex scenarios, Multi-Label Remote Sensing Image Classification (MLRIC) continues to face two fundamental issues: the representation of multiscale objects and label imbalance under long-tail distributions.
From a multiscale feature representation perspective, remote sensing imagery exhibits wide spatial coverage and complex viewing angles, where large-scale structures and small objects often coexist within a single image. This substantial scale variation imposes higher demands on feature representation. To address this issue, Tan and Le proposed a compound scaling strategy in EfficientNet to enhance cross-scale feature modeling capability; however, the computational complexity remains relatively high [8]. Subsequently, Zhu et al. introduced deformable attention in Deformable DETR, enabling flexible multiscale alignment through dynamic sampling across feature scales [9]. In the remote sensing domain, li et al. presented a spatial-topological-semantic alignment paradigm to enhance domain adaptability for few-label cross-domain scene classification [10]. Pandey et al. explored a joint super-resolution and multi-label classification framework for remote sensing images, demonstrating that preserving high-resolution spatial details is beneficial for multi-label recognition in low-resolution satellite imagery [11]. More recently, Zhao et al. developed a multiscale sparse cross-attention network, in which sparse attention mechanisms facilitate the integration of local details and global context, leading to significant performance improvements in complex scenarios [12]. Nevertheless, it has been reported that repeated downsampling in deep convolutional networks leads to substantial resolution degradation, weakening the representation of small objects during feature fusion and imposing inherent limitations on multiscale modeling [13].
From a semantic dependency modeling perspective, multi-label remote sensing imagery typically exhibits pronounced semantic co-occurrence patterns. To capture such relationships, hierarchical semantic structures have been explored to model inter-category dependencies. For instance, Zhang et al. proposed a hierarchical knowledge graph-based approach that models multi-level semantic relationships to enhance understanding of multiscale objects [14]. Meanwhile, studies in general computer vision have demonstrated that Transformer-based architectures are highly effective in modeling complex semantic dependencies. Carion et al. established inter-object relationship modeling within the Transformer-based DETR framework, providing an effective paradigm for multi-semantic relationship learning [15]. Subsequently, Dosovitskiy et al. demonstrated that the Vision Transformer can effectively capture long-range semantic dependencies [16]. Inspired by these advances, Transformer-based interaction mechanisms have been introduced into remote sensing methods. Ou et al. proposed a view–category interactive sharing mechanism that jointly models multi-view information and category-level semantic relationships, effectively alleviating label dependency issues under incomplete annotation conditions [17]. Cao et al. introduced the pioneering CLIP-Mamba framework, integrating a pre-trained Vision-Language Model (CLIP) with a State Space Model (Mamba) for efficient and comprehensive feature fusion and semantic extraction [18]. In addition, Xia et al. proposed a latent semantic dependency model that jointly infers explicit and implicit label relationships to improve classification performance; however, its applicability in complex scenarios is constrained by limited feature representation stability and high computational cost [19]. Moreover, due to the inherent deep downsampling process in remote sensing imagery, small-object information is often lost, making it difficult for conventional Transformer architectures to achieve full-scale perception from local details to global semantics.
From the perspective of the long-tail distribution, large-scale remote sensing datasets commonly exhibit severe class imbalance, with low-frequency categories underrepresented and difficult to learn effectively. It has been demonstrated that such an imbalance leads to a pronounced bias toward high-frequency categories, thereby degrading generalization performance on rare classes [2]. To mitigate this issue, Cui et al. proposed a class-balanced loss based on the effective number of samples, in which class weights are redefined to counteract frequency imbalance; however, it remains insufficient to simultaneously address sample difficulty and distribution disparities in remote sensing scenarios [20]. Wang et al. introduced a diffusion-based noise augmentation strategy to improve tail-class performance, while Du et al. proposed a category-selective feature enhancement mechanism to enhance tail-class responses adaptively [21,22]. Wang et al. proposed DMRS, a foundation-model-based framework for long-tailed remote sensing scene recognition, further demonstrating the importance of robust representation learning for imbalanced remote sensing data [22]. In the broader computer vision domain, robust loss function designs, such as asymmetric loss and distributionally robust loss, have provided theoretical support for addressing long-tail learning [23,24]. In addition, Zhang et al. proposed a unified learning framework to address multi-label classification under long-tailed distributions and partial-label conditions [25]. Nevertheless, directly transferring these general approaches to remote sensing scenarios remains challenging, as long-tail distributions are often coupled with multiscale variations. Consequently, relying solely on loss optimization or data augmentation is insufficient to address the complexity of jointly modeling multiscale features and long-tail distributions.
Despite substantial progress in multiscale feature representation, semantic dependency modeling, and long-tail learning, most existing methods address these challenges separately. In complex multi-label remote sensing scenes, however, these issues are often coupled: small objects require fine-grained spatial details, label prediction depends on global semantic context and inter-class correlations, and rare categories are more likely to be suppressed during training. Therefore, a unified framework is required to coordinate feature representation, semantic reasoning, and class-balanced optimization within an end-to-end learning process.
Compared with existing approaches, MSDR-Net is designed to address these coupled challenges collaboratively. The MSFE module enhances multiscale spatial representation by integrating shallow details and deep semantic features. The DSR module further models long-range spatial dependencies and inter-class semantic relationships through Transformer-based reasoning. The DW-Loss introduces category-frequency weighting and prior difficulty coefficients to improve learning of rare and difficult categories. In this way, MSDR-Net does not simply stack existing modules, but coordinates feature extraction, semantic reasoning, and loss optimization for multi-label remote sensing image classification.

3. Methods

A multiscale feature extraction framework is constructed by integrating residual learning, multiscale convolution, and an FPN. Dynamic semantic reasoning is further achieved by incorporating label embedding and Transformer-based modeling. In addition, DW-Loss is proposed to effectively alleviate class imbalance and hard-sample issues, thereby improving the performance of multi-label remote sensing image classification, as shown in Figure 1.
To provide a clear description of the computational workflow and logical architecture of the proposed method, Algorithm 1 summarizes the overall training and inference procedures of MSDR-Net.
Algorithm 1: The Proposed MSDR-Net Framework
Input: Remote Sensing Image R S I R 3 × H × W , Ground truth labels y.
Output: Predicted probabilities y ^ , Total Loss LDW.
1  Feature Extraction and Fusion: Extract hierarchical features from RSI via ResNet34 and fuse them using FPN-PAN to yield the enhanced feature map Hi
2  Sequence Generation: Flatten Hi and incorporate Positional Encoding and label embeddings to construct the input sequence GIN
3  Global Reasoning: for l = 1 to L (Transformer Layers) do
4                       Iteratively update the sequence representations via Multi-Head Self-Attention and MLP blocks.
5                       end for
6  Classification: Aggregate global features and derive prediction probabilities y ^ through a linear projection layer.
7  Optimization: Compute the Dynamic Weighting Loss LDW based on label difficulty and update network parameters via backpropagation.

3.1. Multiscale Feature Enhancement Module

3.1.1. Residual Representation Learning

Deep convolutional networks exhibit significant advantages in the high-level semantic representation of remote sensing imagery. However, as network depth increases, learning complete mappings solely through stacked convolutions can lead to vanishing gradients, optimization difficulties, and feature degradation. To ensure stable gradient propagation in deep architectures, a residual learning mechanism is introduced that employs identity mappings to facilitate the training of deep nonlinear transformations, as illustrated in Figure 2.
The input remote sensing image is represented as a batch tensor. The residual structure consists of multiple convolutional layers and activation functions, and the overall residual transformation can be formulated as a functional mapping
R n ( R S I ; W R )
where n denotes the number of residual units, and W R represents the corresponding parameter set.
By introducing an identity shortcut connection, the output of the residual module can be expressed as
F 1 = R S I + R n ( R S I ; W R )
The advantage of this structure lies in the fact that, when the optimal mapping approaches an identity function, the residual term. R n ( · ) only needs to be driven toward zero, thereby significantly reducing the optimization difficulty of deep architectures. Meanwhile, the identity shortcut ensures that low-level spatial structures and texture information are preserved and directly propagated to deeper semantic representations, which is particularly critical for complex high-resolution remote sensing scenarios where local objects and global backgrounds coexist.

3.1.2. Multiscale Convolutional Feature Extraction

Remote sensing imagery exhibits significant multiscale characteristics, encompassing rich information ranging from local details to global structural patterns. To overcome the limitations of single-scale feature representation, a hierarchical multiscale feature extraction strategy is proposed. The proposed strategy exploits the hierarchical representation capabilities of convolutional neural networks (CNNs) to model land-cover features across different spatial scales explicitly.
As illustrated in Figure 3, the network is composed of multiple sequential convolutional stages. The k -th stage, denoted as M k ( · ) , is responsible for extracting feature representations at a specific semantic level. Let the input to the current stage be the output feature from the previous stage. The feature extraction process at the k -th stage can then be formulated as:
F k = M k ( F k 1 ; W k )
where W k denotes the learnable parameters of the k -th stage, and F k 1 represents the input feature map. For the first stage, the input corresponds to the original RSI. The output F k represents the deep feature representation extracted at the corresponding stage. Through this cascaded architecture, multiple convolutional stages jointly form a hierarchical multiscale semantic extraction sequence from shallow to deep representations:
F 1 = M 1 ( R S I ) , F 2 = M 2 ( F 1 ) , F n = M n ( F n 1 )
As the network depth increases, the receptive field is progressively enlarged, enabling the extraction of hierarchical dependency features ranging from local textures and geometric boundaries to high-level regional semantic structures. The final feature representation F n serves as a comprehensive, deep, multiscale semantic descriptor and is subsequently fed into the downstream decision network for efficient modeling of complex remote sensing scene structures.

3.1.3. Multiscale Feature Pyramid Fusion and Fixed-Scale Representation

Deep convolutional features exhibit significant differences in semantic richness and spatial resolution across different layers. Shallow features preserve rich spatial details but lack strong semantic discriminability, whereas deep features contain more discriminative semantic information but suffer from reduced spatial resolution. This cross-layer semantic gap limits the effective integration of multiscale information. To address this issue, a FPN-based fusion mechanism is introduced that achieves multiscale semantic enhancement via top-down information propagation. Furthermore, a fixed-scale feature representation is constructed to facilitate subsequent decision-making modules.
As illustrated in Figure 4, the constructed feature pyramid adopts a bidirectional fusion architecture that combines top-down and bottom-up pathways, enabling the joint modeling of cross-layer semantic and spatial information. In the top-down pathway, high-level features are progressively upsampled and fused with corresponding low-level features via lateral connections, resulting in feature representations with both high spatial resolution and strong semantic expressiveness. Let the feature map at the l -th layer of the backbone be denoted as F l . The feature alignment, top-down fusion, and smoothing operations are defined as:
P l = φ ( ϕ l ( F l ) + U ( P l + 1 ) )
where l ( · ) denotes the feature embedding function for achieving isomorphic representation across feature levels; U ( · ) represents the upsampling operator for spatial alignment; and φ ( · ) denotes a smoothing operation to mitigate aliasing effects introduced by upsampling.
Subsequently, to enhance the contribution of low-level high-resolution features, a bottom-up PAN is introduced. In the bottom-up pathway, fused low-level features are progressively downsampled and propagated to higher layers, where they are combined with corresponding high-level features through element-wise addition and convolutional smoothing, which can be recursively formulated as:
N l + 1 = ϕ P A N ( P l + 1 + D ( N l ) )
where D 1 ( · ) denotes the downsampling operator, and PAN ( · ) represents the convolutional smoothing operation after bottom-up fusion. This bidirectional fusion strategy effectively enhances spatial perception capability, enabling the resulting features to preserve fine-grained spatial details while maintaining strong semantic representations.
From the multiscale feature set, an optimal level l is selected as the output representation. The fused feature map is partitioned into subregions. { R i } i = 1 K , followed by normalization and nonlinear transformation to obtain the final feature representation, which can be formulated as:
H i = σ ( N ( 1 R i x R i P ( N ^ l * ) ) )
H = F l a t t e n ( H i )
The resulting feature representation preserves both spatial structural information and semantic content, providing stable and informative inputs for downstream classification or detection tasks.

3.2. Dynamic Semantic Reasoning Module

3.2.1. Label Indexing System and Positional Encoding

As illustrated in Figure 5, After visual feature extraction and initialization, a label-indexing system is constructed to support supervised learning for multi-label classification.
  • Category index generation: Let the DIOR dataset contain M  object categories (e.g., airplanes, ships, and vehicles). A set of unique category indices i = { 0,1 , , M 1 }  is first defined, and a mapping from category names to integer indices is established.
  • Index set expansion and alignment: To achieve dimensional alignment between label information and visual features, the category index set is replicated to construct an index sequence of length N, the number of feature locations.
  • Label embedding mapping: For each input sample, the corresponding category indices derived from ground-truth annotations are mapped to high-dimensional label embedding vectors, forming the label representation space.
Mathematically, let the predicted category at the k-th feature location be denoted as y k . The corresponding label embedding can be expressed as:
E l a b e l = W e m b y k
where W emb R M × D denotes a learnable category embedding matrix. This initialization strategy ensures that each feature location is associated with explicit semantic priors at the early stage of training, enabling rapid alignment between visual features and category labels, thereby accelerating convergence and improving classification performance.
Within the Transformer framework, the self-attention mechanism processes each feature by attending to all other positions, thereby enabling the effective modeling of global feature dependencies [14]. However, this property implies the absence of an inherent sequential processing mechanism, preventing the direct encoding of absolute positional information for input features.
W D = exp f D mod e l / 2 · log 10 , 000     ,     f = 1 , , D mod e l / 2
P P H × P V = sin p o s H · W D cos p o s H · W D sin p o s V · W D cos p o s V · W D
where f denotes the index of the frequency term, W and H represent the spatial width and height of the feature map output by the ResNet backbone, and D m o d e l   denotes the feature dimensionality. The feature representation augmented with positional encoding is concatenated with the embedded label indices. E label to construct a label-aware feature representation, which serves as the input tensor to the Transformer encoder. The formulation is defined as follows:
G I N = H + P P H × P V + E l a b e l

3.2.2. Multi-Head Attention Mechanism

As illustrated in Figure 4, to enhance the model’s learning capacity and representation depth, a Multi-Head Attention mechanism with H attention heads is adopted and stacked across L   layers within the Transformer architecture. Through this design, input features are jointly processed in parallel and sequential manners, enabling comprehensive feature refinement. In addition, a Dropout mechanism is incorporated to mitigate overfitting and improve model robustness. The computation process of the multi-head attention mechanism is defined as follows:
Q i , K i , V i = G I N W Q i , G I N W K i , G I N W V i       ,       i = 1 , , I
h i = A t t e n t i o n Q , K , V = s o f t max Q i K j T Z V j       ,       j = 1 , , I
H k = D ropout M u l t i H e a d Q i , K i , V i = D ropout C o n c a t h 1 , , h i , , h I W C   k = 1 , , K
where W Q , W K , W V denote the learnable projection matrices that map the input features into query, key, and value representations, respectively; 1 Z is a scaling factor introduced to stabilize gradients across different feature dimensions; W c represents the output projection matrix used to aggregate multiple attention heads; head i denotes the i -th attention head; and H k denotes the output of the Multi-Head Attention module.
Between the stacked attention layers, an MLP block and a Dropout mechanism are incorporated to enhance further the Multi-Head Attention module’s ability to capture multiscale information. This design improves the representational capacity, training stability, and the joint modeling of local and global dependencies. The output after the MLP block and Dropout operation is defined as follows:
H I N k 1 = = D r o p o u t M L P H k W P + b 1 D r o p o u t Re L U H k + b 0 W P + b 1     ,     k 1 = 2 , , K
where W p denotes the weight matrix used for inter-layer transformation, and b 1 and b 2 represent the corresponding bias terms.

3.3. Supervised Learning and Loss Optimization

3.3.1. Probability Prediction

The label-related components of the Transformer encoder output tensor are extracted via an index selection module, after which logits are computed and transformed through a Sigmoid function to obtain probability values corresponding to each category. The logit transformation and Sigmoid function are defined as follows:
L g l = L o g i t G O U T = W O G O U T + b 2 y ^ i = s i g m o i d ( L g l ) = 1 1 + e L g l ,   l = 1 , , L
where W o and b 2 denote the weight matrix and bias term of the logit transformation, respectively, and y ^ i represents the predicted probability vector. A binary vector with a length equal to the number of categories is constructed, in which the positions corresponding to the ground truth labels are set to 1, forming the multi-label ground truth vector. Since the predicted probabilities y ^ i lie within the range 0,1 aligning the output structure with the ground truth vector enables each element to directly represent the probability of its corresponding category.

3.3.2. Difficulty-Weighted Loss

To mitigate class imbalance and improve learning of hard samples, a DW-Loss is proposed based on the Binary Cross-Entropy (BCE) loss. This loss function dynamically adjusts the contributions of different categories during optimization by jointly incorporating category-frequency weights and prior-difficulty coefficients. The DW-Loss is formulated as:
L D W = 1 N i = 1 N w f r e q ( c ) · α h a r d c · y i log ( y ^ i ) + ( 1 y i ) log ( 1 y ^ i )
where N  denotes the total number of samples, and y i and y ^ i represent the ground truth label and predicted probability of the i-th sample, respectively. For positive samples, a composite weighting term is introduced:
  • Category frequency weight: introduced to address long-tail distribution. For category c , the weight is computed based on the ratio of positive and negative samples in the training set:
    w f r e q ( c ) = N n e g ( c ) N p o s ( c ) + ε
    where N p o s c and N n e g c denote the number of positive and negative samples for category c , respectively, and ϵ is a small smoothing constant to avoid division by zero. Categories with fewer samples are assigned larger weights.
  • Difficulty coefficient: introduced to emphasize hard categories, such as small-scale objects or structurally complex targets. For instance, categories like vehicles, despite having sufficient samples, are assigned higher coefficients due to their small scale and ambiguity. This coefficient is empirically set to encourage the model to focus on hard samples during early training.

4. Experiments

4.1. Dataset

To validate the effectiveness of the proposed MSDR-Net, experiments are conducted on the publicly available DIOR dataset [26]. As shown in Figure 6, DIOR is a large-scale, high-resolution remote sensing imagery dataset that encompasses diverse, complex scenes and a wide range of object categories and has been widely adopted for remote sensing image understanding tasks. The dataset contains 20 object categories, including large-scale targets such as Airplane, Airport, Ship, and Stadium, as well as small-scale or structurally complex targets such as Vehicle, Bridge, and Overpass. Significant variations across categories are observed in scale distribution, spatial density, and semantic co-occurrence patterns, reflecting typical characteristics of multi-label remote sensing scenarios.
Based on the object-level annotations, each image is reformulated as an image-level multi-label sample, allowing a single image to correspond to multiple category labels. This transformation better reflects the real-world characteristics of remote sensing scenarios, where multiple land-cover types often coexist within a single scene. No external training data are introduced in the experiments, and a fixed 8:2 split is adopted to construct the training and validation sets, ensuring the fairness and reproducibility of the experimental comparisons.
In addition to DIOR, MLRSNet is also used as a multi-label remote sensing benchmark to evaluate the cross-dataset generalization ability of MSDR-Net [1]. MLRSNet contains diverse high-resolution remote sensing scenes with multiple land-cover categories and multi-label annotations. In this study, 20 categories are selected for evaluation, including Airplane, Airport, Baseball Diamond, Basketball Court, Bridge, Freeway, Golf Course, Ground Track Field, Harbor&Port, Overpass, Parking Lot, Railway, Railway Station, Shipping Yard, Stadium, Storage Tank, Tennis Court, Terrace, Transmission Tower, and Wind Turbine. To construct the experimental subset, the first 25% samples of each category are retained, resulting in 12,012 images. The dataset is split into 9609 training and 2403 validation images at 80%/20%.

4.2. Evaluation Metrics

To comprehensively evaluate the performance of the model in multi-label remote sensing image classification, Mean Average Precision (mAP), Hamming Accuracy (HA), Overall F1-score (OF1), Class-wise F1-score (CF1), Overall Precision/Recall (OP/OR), and Class-wise Precision/Recall (CP/CR) are adopted. Among these metrics, mAP reflects the overall ranking capability of the model across different thresholds; HA measures label-wise prediction consistency; OF1 and OP/OR evaluate prediction quality from a sample-level perspective; and CF1 and CP/CR assess model performance from a class-level perspective, particularly in terms of class balance and long-tail category recognition. The corresponding formulations are defined as follows:
m A P = 1 q i = 1 q A P ( i )     H A = 1 n j = 1 n 1 q Y j h ( X j )
C P = 1 q i = 1 q T P i T P i + F P i     C R = 1 q i = 1 q T P i T P i + F N i     C F 1 = 2 × C P × C R C P + C R
O P = 1 n i = 1 q T P i T P i + F P i     O R = 1 n i = 1 q T P i T P i + F N i     O F 1 = 2 × O P × O R O P + O R
Given that single-run experiments are susceptible to stochastic variations from random initialization and batch sampling, all overall evaluation metrics, except class-wise Average Precision (AP), are reported based on statistics from multiple repeated runs. Specifically, deep learning models are trained using five random seeds ([42, 2022, 7, 123, 999]), whereas traditional and comparative methods, including Support Vector Machine (SVM), Extremely Randomized Trees (ERT), Relation Network, Deep Multi-Attention, MSCA, and SFNet, are evaluated using three random seeds. All results are reported in the form of μ ± σ , where σ denotes the sample standard deviation. For the validation mAP, a 95% confidence interval (CI) is additionally reported. To assess the statistical significance of performance differences between MSDR-Net and MSCA, a two-sided t -test is conducted with a significance level of α = 0.05 .

4.3. Experimental Setup

Considering the significant scale variations and diverse land-cover types in high-resolution remote sensing imagery, a customized data preprocessing strategy is designed. During training, the following data augmentation strategies are applied to improve model generalization: Random Resized Crop (scale range: 0.7–1.0; aspect ratio: 0.85–1.15), Random Horizontal Flip (probability: 0.5), and rotations at multiples of 90°. The images are subsequently converted into tensors and normalized using dataset-specific statistics, with the mean. [0.485 0.456 0.406] and standard deviation [0.229 0.224 0.225]. For the validation set, only deterministic preprocessing is applied, including resizing to 512 × 512 pixels, center cropping, tensor conversion, and normalization, to ensure the stability and consistency of evaluation metrics.
The model is trained on a single NVIDIA GeForce RTX 4070 Ti Super GPU using the PyTorch V1.12.1 framework. The key hyperparameters are set as follows: a batch size of 40, 100 training epochs, and the AdamW optimizer. A hybrid learning rate scheduling strategy is adopted, with linear warmup during the first five epochs, followed by dynamic adjustment based on validation loss, enabling more refined convergence in later training stages.
The proposed DW-Loss is employed, and consistent experimental settings, including input resolution, number of training epochs, optimizer configuration, and data splits, are maintained across all compared deep learning models. For methods originally designed for different task settings, their core architectures are retained while the output layers are uniformly replaced with a 20-dimensional Sigmoid classification head to conform to the DIOR multi-label protocol. Additionally, identical data augmentation strategies, training epochs, and validation protocols are applied to ensure fair and consistent comparisons.
The DSR module is implemented as a Transformer encoder with four layers, each with eight attention heads and a hidden embedding dimension of 256. A dropout rate of 0.1 is applied to prevent overfitting. The label embedding dimension is set to 20, corresponding to the number of categories in the multi-label classification task.
The DW-Loss, formulated as a difficulty-weighted binary cross-entropy, incorporates two complementary weighting mechanisms to improve learning on rare and challenging categories. First, the category frequency weight w f r e q c is calculated as the ratio of negative to positive samples for each category in the training set, with a small smoothing constant ϵ = 1 × 10 6 to avoid numerical instability. Second, the prior difficulty coefficient α h a r d c is empirically assigned according to target scale, geometric complexity, and observed training difficulty. Categories that are inherently challenging—such as Vehicle, Bridge, and Overpass—are assigned higher coefficients within the range [1.0, 2.5] to emphasize their contribution during training. The final weighting vector applied in the BCE loss is obtained by multiplying the category frequency weight and the difficulty coefficient element-wise, with a maximum clamp of 50 to maintain training stability.
This configuration ensures that the MSDR-Net framework effectively emphasizes complex, small-scale, or semantically ambiguous categories, while maintaining stable convergence for frequently occurring classes.

4.4. Performance Evaluation and Ablation Studies

4.4.1. Overall Performance and Fine-Grained Category Analysis

Table 1 presents the overall performance statistics of MSDR-Net on the DIOR validation set. Based on repeated experiments across five random seeds, the proposed model achieves a favorable balance between precision and recall in the multi-label classification task, indicating strong overall discriminative capability while maintaining robustness in both label-level consistency and class-level balance.
To further evaluate category-level performance, Table 2 reports the AP and HA results for all 20 DIOR categories. Overall, MSDR-Net achieves high AP and HA values for most categories, indicating strong category discrimination and stable label-wise prediction consistency. For categories with distinctive structures and large spatial scales, such as Airplane, Airport, Baseball Field, Chimney, Ship, Stadium, and Windmill, the AP values are close to 1.0, while the HA values are around 0.99. This demonstrates the model’s stable recognition of salient, well-structured targets. For medium-scale structured categories, including Dam, Expressway Service Area, Expressway Toll Station, Golf Field, Harbor, Tennis Court, and Train Station, most AP values exceed 0.97, further confirming the robustness of MSDR-Net for regular scene objects. Relatively lower AP values are observed for Bridge, Overpass, and Vehicle, at 0.8543, 0.8709, and 0.8444, respectively. These categories are more challenging due to small object sizes, large appearance variations, and strong background coupling. In particular, Vehicle obtains a lower HA of 0.9014, indicating that dense small-object recognition remains difficult. For relatively imbalanced categories such as Storage Tank and Train Station, MSDR-Net still maintains high AP and HA values, suggesting that DW-Loss helps improve learning for rare and difficult categories.
As illustrated in Figure 7, the Transformer module’s attention responses are visualized to evaluate MSDR-Net’s spatial attention capability in multi-label remote sensing image classification. Since the feature pyramid produces a 7 × 7 feature representation, the resulting heatmaps correspond to coarse-grained token-level semantic responses, which are subsequently upsampled and overlaid onto the original images for visualization. The high-attention regions effectively cover both large target areas and densely distributed small-scale objects, while also accurately focusing on critical regions of structurally organized targets with linear or block-like spatial distributions. In complex multi-label scenarios, multiple semantic regions can be simultaneously attended, indicating that the combination of multiscale feature representation and Transformer-based semantic reasoning effectively captures target co-occurrence patterns and inter-class spatial relationships.
Overall, MSDR-Net demonstrates superior performance on both global and fine-grained target categories. Large-scale and common categories are accurately recognized, while substantial improvements are achieved for small objects, structurally complex targets, and rare categories. Furthermore, the attention visualization results demonstrate that the proposed model can simultaneously focus on multiple targets and semantically important regions, further validating the synergistic advantages of multiscale feature fusion, Transformer-based semantic reasoning, and DW-Loss.

4.4.2. Ablation Study

To further investigate the underlying mechanisms contributing to the performance improvements of MSDR-Net and to validate the necessity of each component within the network architecture and training strategy, a systematic cumulative ablation study is conducted. The experiments are conducted using ResNet-34 as the baseline model. Under consistent hyperparameter settings, the Transformer-based global attention module, multiscale feature pyramid, DW-Loss, and data augmentation strategies are incrementally incorporated, allowing a quantitative evaluation of the cumulative contributions of each component to the final classification performance.
The baseline model employs ResNet-34 as the feature extractor without incorporating additional attention mechanisms or multiscale fusion modules, and is optimized using the BCE loss. As shown in Table 3, this configuration achieves an mAP of 90.71%. Although the backbone provides reasonable feature extraction capability, its representation of small, densely distributed targets remains limited in remote sensing scenarios characterized by significant scale variations and complex background interference. In Exp-1, a Transformer-based global attention module is incorporated into the baseline model. By leveraging self-attention to capture long-range dependencies, the model’s ability to model contextual relationships among objects is enhanced. As a result, the mAP improves from 90.71% to 92.3% (+1.8%). This improvement highlights the importance of global contextual information for understanding complex remote sensing scenes, particularly in distinguishing categories with similar local features but distinct global semantics. In Exp-2, the FPN is further integrated based on Exp-1. By combining deep semantic information with shallow spatial details, the model’s ability to represent multiscale targets is significantly enhanced. Consequently, the mAP increases to 94.8% (+3.5% compared to Exp-1), demonstrating the effectiveness of multiscale feature fusion in addressing large-scale variations, particularly for structurally distinctive targets such as bridges and overpasses. To address class imbalance and hard-sample challenges in remote sensing datasets, the proposed DW-Loss is introduced in Exp-3. By dynamically adjusting category- and sample-level weights, the model is guided to focus more on rare and difficult samples. As shown in Table 2, the mAP further increases to 95.5% (+0.8% compared to Exp-2). Although the overall gain is moderate, substantial improvements are observed for long-tail categories such as vehicles and storage tanks, effectively mitigating category bias. In the final configuration, customized data augmentation strategies, including random cropping, flipping, and rotation, are incorporated to increase the diversity of the training data. This leads to improved generalization and robustness, resulting in a final mAP of 95.88% (+0.3% compared to Exp-3). Overall, the progressive ablation results clearly demonstrate that each component of MSDR-Net—from the global attention mechanism and multiscale feature fusion to tailored loss optimization and data augmentation strategies—positively and significantly contributes to overall performance. These components operate synergistically to form a high-performance framework for multi-label remote sensing image classification.

4.5. Cross-Dataset Validation on MLRSNet

Table 4 summarizes the overall performance metrics of MSDR-Net on the MLRSNet validation set, including mAP, HA, OF1, CF1, CP, CR, OP, and OR, together with the corresponding standard deviations and 95% confidence intervals. To further evaluate category-level performance, Table 5 presents the AP and HA results for all 20 categories in MLRSNet. Overall, MSDR-Net achieves high AP and HA values across most categories, demonstrating strong category discrimination and stable label-wise prediction consistency.
For large-scale, geometrically distinctive categories, such as Airport, Baseball Diamond, Basketball Court, Harbor, Golf Course, Terrace, Storage Tank, Transmission Tower, and Wind Turbine, AP values are above 0.98, and HA values are around 0.99, confirming that MSDR-Net can reliably recognize prominent, clearly structured targets. For medium-scale structured categories, including Freeway, Ground Track Field, Railway Station, Shipping Yard, and Stadium, AP values remain above 0.95, further demonstrating the robustness of MSDR-Net on regular scene objects. Small-scale and challenging categories, such as Parking Lot, Vehicle, Bridge, Overpass, and Railway, have relatively lower AP values (0.8950–0.9807) and slightly lower HA values (0.9097–0.9863) due to their dense spatial distribution, large appearance variation, and complex backgrounds. Nevertheless, compared to prior benchmarks, MSDR-Net achieves noticeable improvement in small-scale target recognition, indicating the effectiveness of multiscale feature enhancement and DW-Loss in improving small and difficult object predictions. In addition, long-tail categories such as Storage Tank and Railway Station, despite having relatively fewer samples, still achieve high AP and HA, demonstrating that MSDR-Net mitigates the influence of long-tail category imbalance through difficulty-weighted optimization. Collectively, these results show that MSDR-Net effectively handles multiscale objects, models complex semantic dependencies, and addresses long-tail distribution issues.
Compared to DIOR, MLRSNet shows higher AP for medium and large-scale categories, likely due to more balanced class distributions and higher image quality, while small-scale and long-tail categories remain more challenging, confirming the importance of global reasoning and DW-Loss in these scenarios.

4.6. Comparative Experiments

To comprehensively evaluate the effectiveness and competitiveness of MSDR-Net, comparisons are conducted under the DIOR multi-label protocol against a diverse set of approaches, including traditional machine learning models, classical deep learning models, relation modeling methods, and recent multiscale and semantic enhancement techniques. Specifically, the compared methods include Support Vector Machine (SVM) [27], Extremely Randomized Trees (ERT) [27], CNN [28], ResNet-34 [29], Relation Network [30], Deep Multi-Attention [31], MSCA [12], and SFNet [21]. It should be noted that the original task settings of methods such as Relation Network, Deep Multi-Attention, MSCA, and SFNet are not fully aligned with the multi-label classification setting considered in this study. Therefore, to ensure fair comparison, their core architectures are re-implemented under a unified experimental protocol, including consistent input resolution, optimizer configuration, training epochs, data augmentation strategies, and a standardized multi-label Sigmoid classification head. The comparative results are reported in Table 6.
As illustrated in Figure 8, Overall, traditional machine learning methods exhibit substantially inferior performance. Specifically, Support Vector Machine (SVM) and Extremely Randomized Trees (ERT) achieve mAP values of 68.42% and 73.55% on the validation set, respectively. This indicates that methods relying on hand-crafted features are insufficient for capturing the complex spatial structures and semantic co-occurrence patterns present in high-resolution remote sensing imagery. In contrast, deep learning-based approaches significantly improve classification performance. The CNN achieves an mAP of 86.37%, while ResNet-50 further improves it to 91.16%, validating the effectiveness of deep convolutional architectures and residual learning in feature representation. However, these methods primarily rely on local receptive fields, limiting their ability to model long-range semantic dependencies in multi-label scenarios.
To further enhance multi-label classification performance, subsequent methods incorporate relation modeling and attention mechanisms. The Relation Network explicitly models inter-label dependencies, achieving an mAP of 92.46% on the validation set, while Deep Multi-Attention enhances regional feature representations through multi-branch attention, achieving an mAP of 91.88%. Overall, these approaches improve performance beyond 92%, demonstrating that modeling label relationships and incorporating attention mechanisms effectively enhance multi-label discrimination capability. However, these methods still exhibit notable instability, as reflected in relatively high standard deviations of 0.56–0.68, indicating increased sensitivity to random initialization.
In recent years, research has shifted from relational modeling toward integrated, multiscale, and semantic fusion frameworks. MSCA leverages sparse cross-scale attention to enable multiscale feature interaction, achieving an mAP of 94.24 ± 0.41 on the validation set, while SFNet enhances category discrimination through semantic-assisted feature fusion, achieving 93.85 ± 0.37 . Although these approaches effectively mitigate challenges arising from scale variations in remote sensing imagery, a unified optimization framework that jointly models multiscale information, semantic relationships, and long-tail distributions remains lacking.
Compared with the aforementioned methods, the proposed MSDR-Net achieves the highest validation mAP of 95.88 ± 0.29 , outperforming all competing approaches. Meanwhile, a training mAP of 97.87 ± 0.27 is obtained, with a train–validation gap of approximately 2.03%, which is significantly smaller than that of other deep models. This indicates that, while maintaining strong fitting capability, the model effectively suppresses overfitting and demonstrates superior generalization performance. In terms of performance gains, MSDR-Net improves mAP by approximately 1.64% compared with the current best-performing method, MSCA, and by 2.08% compared with SFNet. Notably, an improvement exceeding 1.5% is still achieved in the high-performance regime (above 94% mAP), further demonstrating the effectiveness of the proposed approach.
Table 7 presents the AP results of MSDR-Net and MSCA for each category. Overall, MSDR-Net consistently outperforms MSCA across most categories, particularly for challenging categories with small-scale objects or long-tail distributions. For instance, small-scale targets such as Vehicle, Bridge, and Overpass exhibit notable improvements, with MSDR-Net achieving AP values of 0.8370, 0.8429, and 0.8941, respectively, compared to MSCA’s 0.7884, 0.7240, and 0.7948, respectively. These gains indicate that MSDR-Net’s multiscale feature enhancement and dynamic semantic reasoning effectively capture subtle spatial details, improving recognition of small or densely distributed targets.
For long-tail categories, including Storage Tank, Tennis Court, and Train Station, MSDR-Net also demonstrates superior performance (AP = 0.9509, 0.9767, and 0.9932) relative to MSCA (AP = 0.9138, 0.9513, and 0.9672), highlighting the effectiveness of the DW-Loss in mitigating class imbalance and enhancing learning on rare categories. Large-scale categories, such as Airplane, Baseball Field, Ship, and Windmill, are already well recognized by both methods, but MSDR-Net still provides small yet consistent improvements.
In summary, these results indicate that MSDR-Net achieves notable gains on small-scale, difficult, and long-tail categories, while maintaining or slightly improving performance on large and medium-scale categories. This underscores the advantages of integrating multiscale feature extraction, dynamic semantic reasoning, and difficulty-weighted loss in a unified framework.

4.7. Computational Complexity Analysis

To further evaluate the practical applicability of the proposed method, the computational complexity of MSDR-Net was compared with representative baseline and competing methods, including ResNet-34, SFNet, and MSCA. The comparison was conducted under the same input resolution of 3 × 512 × 512 . The number of parameters, FLOPs, single-image inference time, and frames per second (FPS) were reported. FLOPs were calculated using THOP, and inference time was measured on the same GPU platform with a batch size of 1. Data loading and image preprocessing were excluded from the timing process.
As shown in Table 8, ResNet-34 has the lowest inference cost among the compared CNN-based baselines, with 21.80 M parameters, 19.12 GFLOPs, and an average inference time of 6.02 ms per image. However, its mAP is only 90.71%, indicating that the standard backbone alone is insufficient for classifying complex multi-label remote sensing images. SFNet achieves better classification performance than ResNet-34 with relatively low parameters and inference time, but its mAP remains lower than that of MSDR-Net.
Compared with MSCA, MSDR-Net achieves a better trade-off between accuracy and efficiency. Although the parameter count of MSDR-Net is slightly higher than that of MSCA, increasing from 38.32 M to 39.46 M, the FLOPs are reduced from 35.54 GFLOPs to 24.61 GFLOPs, corresponding to a reduction of approximately 30.8%. Meanwhile, the average inference time decreases from 12.02 ms to 8.34 ms, yielding an approximately 30.6% improvement in inference speed. More importantly, MSDR-Net achieves a higher mAP of 95.88%, outperforming MSCA by approximately 1.64 percentage points.
These results indicate that the proposed MSDR-Net does not simply improve performance by introducing excessive computational overhead. Instead, by efficiently fusing multiscale features, employing Transformer-based semantic reasoning, and applying difficulty-weighted optimization, MSDR-Net achieves higher classification accuracy while maintaining moderate computational complexity. The average inference speed of 119.9 FPS further demonstrates that the proposed method has practical potential for high-resolution remote sensing image interpretation, especially in offline and near-real-time application scenarios.

4.8. Discussion

An end-to-end MSDR-Net is proposed to address three major challenges in multi-label remote sensing image classification, including multi-scale variation, complex semantic dependencies, and long-tail category distributions. Extensive experiments conducted on the Dataset for DIOR and MLRSNet datasets demonstrate the effectiveness of the proposed framework. Existing approaches generally focus on isolated optimization objectives, where CNN-based methods emphasize multi-scale feature extraction, while Transformer-based methods primarily focus on long-range dependency modeling. In contrast, MSDR-Net unifies MSFE, DSR and DW-Loss within a unified end-to-end framework. On the DIOR dataset, MSDR-Net achieves an mAP of 95.88%, outperforming the state-of-the-art MSCA method by approximately 1.64%. Notable improvements are observed for small-scale targets, such as vehicles and bridges, as well as long-tail categories, including storage tanks, demonstrating the robustness of the proposed joint modeling strategy. Compared with traditional machine learning methods, such as SVM, deep learning-based approaches exhibit substantially stronger capability in modeling the complex spatial structures of high-resolution remote sensing imagery, further highlighting the limitations of hand-crafted feature representations.
Beyond performance improvements, the proposed framework also provides methodological insights for multi-label remote sensing image classification. Experimental analysis indicates that simply increasing network depth, such as adopting ResNet-34, is insufficient for effectively addressing missed detections of small-scale targets. After introducing the Transformer-based DSR module, long-range contextual dependencies can be captured through global attention mechanisms, enabling improved recognition of occluded and spatially scattered targets. This observation is consistent with findings reported in Vision Transformer studies, where global contextual modeling has been shown to play a critical role in visual understanding.
Despite the promising performance achieved by MSDR-Net, several limitations remain. Although the inference speed is improved to 119.9 FPS through architectural optimization, the incorporation of FPN/PAN structures and the Transformer encoder increases the model complexity to 39.46 M parameters, compared with lightweight CNN models such as ResNet-34. This increased computational cost may restrict real-time deployment on resource-constrained edge devices, such as small unmanned aerial vehicles. In addition, for extremely dense and small-scale targets, such as vehicles, the HA still remains improvable, indicating that feature discriminability under severe background clutter requires further enhancement. Future work will focus on lightweight model design, including knowledge distillation and neural architecture search, as well as self-supervised and large-scale pretraining strategies, to reduce computational cost while maintaining high accuracy and improving generalization capability under challenging remote sensing scenarios.

5. Conclusions

To address the challenges of multiscale variation, complex semantic dependencies, and long-tail distributions in multi-label remote sensing image classification, an end-to-end framework, MSDR-Net, is proposed that integrates MSFE, DSR, and DW-Loss into a unified architecture for representation learning and imbalance-aware optimization. The MSFE module enables robust multiscale feature extraction through residual learning and feature pyramid fusion, while the DSR module captures long-range dependencies and inter-class correlations via positional encoding, enhancing global semantic modeling. The proposed DW-Loss further improves robustness by dynamically reweighting category contributions, effectively mitigating long-tail and hard-sample issues. Extensive experiments on DIOR and MLRSNet demonstrate that MSDR-Net achieves favorable multi-label classification performance and shows promising robustness and generalization potential on the evaluated datasets.
Despite its favorable performance on the DIOR multi-label remote sensing image classification task, MSDR-Net still has several limitations. First, due to the introduction of FPN/PAN, the Transformer encoder, and multiscale feature enhancement structures, the model’s complexity remains higher than that of lightweight CNN-based models, and further optimization is required for deployment on resource-constrained edge platforms. In addition, the recognition of small-scale, densely distributed, or background-coupled categories, such as Vehicle, Bridge, and Overpass, still leaves room for further improvement. Future work will explore lightweight model design, self-supervised or large-scale pretraining strategies, and multimodal data integration to enhance scalability and practical applicability.

Author Contributions

Conceptualization, Q.S. and H.W.; methodology, Q.S.; resources, S.W.; writing—original draft preparation, Q.S. and T.Y.; writing—review and editing, Q.S., H.W., S.W., T.Y., H.Z. and X.F.; supervision, Q.S. and X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was jointly funded by the National Natural Science Foundation of China (grant No. 62403476).

Data Availability Statement

The datasets analyzed in this study are publicly available benchmark datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Qi, X.; Zhu, P.; Wang, Y.; Zhang, L.; Peng, J.; Wu, M.; Chen, J.; Zhao, X.; Zang, N.; Mathiopoulos, P.T. MLRSNet: A multi-label high spatial resolution remote sensing dataset for semantic scene understanding. ISPRS J. Photogramm. Remote Sens. 2020, 169, 337–350. [Google Scholar] [CrossRef]
  2. Möllenbrok, L.; Sumbul, G.; Demir, B. Deep Active Learning for Multi-Label Classification of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5002405. [Google Scholar] [CrossRef]
  3. Lin, Y.; He, L.; Zhong, D.; Song, Y.; Wei, L.; Xin, L. A High Spatial Resolution Aerial Image Dataset and an Efficient Scene Classification Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5616413. [Google Scholar] [CrossRef]
  4. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
  5. Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef] [PubMed]
  6. Zhou, B.; Cui, Q.; Wei, X.S.; Chen, Z.M. BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition. In Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9719–9728. [Google Scholar]
  7. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Computer Vision 2018—ECCV, Munich, Germany, 6 October 2018; pp. 833–851. [Google Scholar]
  8. Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9 June 2019; pp. 6105–6114. [Google Scholar]
  9. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the 9th International Conference on Learning Representations, Virtual Event, 3 May 2021. [Google Scholar]
  10. Li, B.; Gong, L.; Wang, Q.; Guo, X.; Li, Z. Spatial-Topological-Semantic alignment for cross-domain scene classification of remote sensing images with few source labels. Int. J. Appl. Earth Obs. Geoinf. 2024, 135, 104313. [Google Scholar] [CrossRef]
  11. Pandey, S.; Echuri, P.; Vemulapalli, V.M.; Chakraborty, S. Multi-Label Classification in Remote Sensing: Leveraging High-Resolution Patches for Low-Resolution Satellite Images. In Proceedings of the 2026 IEEE Conference on Computer Vision and Pattern Recognition, Denver, CO, USA, 20–25 June 2026; pp. 1512–1520. [Google Scholar]
  12. Ma, J.; Jiang, W.; Tang, X.; Zhang, X.; Liu, F.; Jiao, L. Multiscale Sparse Cross-Attention Network for Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5605416. [Google Scholar] [CrossRef]
  13. Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-End Object Detection with Learnable Proposals. In Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14449–14458. [Google Scholar]
  14. Zhang, X.; Hong, W.; Li, Z.; Cheng, X.; Tang, X.; Zhou, H.; Jiao, L. Hierarchical Knowledge Graph for Multilabel Classification of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5645714. [Google Scholar] [CrossRef]
  15. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23 August 2020; pp. 213–229. [Google Scholar]
  16. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Houlsby, N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, Vienna, Austria, 3 May 2021. [Google Scholar]
  17. Ou, S.; Xue, Z.; Li, Y.; Liang, M.; Cai, Y.; Wu, J. View-Category Interactive Sharing Transformer for Incomplete Multi-View Multi-Label Learning. In Proceedings of the 2024 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27457–27466. [Google Scholar]
  18. Cao, M.; Xie, W.; Zhang, X.; Zhang, J.; Jiang, K.; Lei, J.; Li, Y. M3amba: CLIP-driven Mamba Model for Multimodal Remote Sensing Classification. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 7605–7617. [Google Scholar] [CrossRef]
  19. Ji, J.; Jing, W.; Chen, G.; Lin, J.; Song, H. Multi-Label Remote Sensing Image Classification with Latent Semantic Dependencies. Remote Sens. 2020, 12, 1110. [Google Scholar] [CrossRef]
  20. Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; Belongie, S.J. Class-Balanced Loss Based on Effective Number of Samples. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9260–9269. [Google Scholar]
  21. Du, R.; Tang, X.; Ma, J.; Zhang, X.; Liu, F.; Jiao, L. Semantic-Assisted Feature Integration Network for Multilabel Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5603015. [Google Scholar] [CrossRef]
  22. Wang, Q.; Ye, H.; Liang, D.; Huang, S.J. Diffusion-Noise-Based Augmentation for Long-Tailed Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5626114. [Google Scholar] [CrossRef]
  23. Ridnik, T.; Ben-Baruch, E.; Zamir, N.; Noy, A.; Friedman, I.; Protter, M.; Zelnik-Manor, L. Asymmetric Loss For Multi-Label Classification. In Proceedings of the 2021 IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10 October 2021; pp. 82–91. [Google Scholar]
  24. Lin, D.; Peng, T.; Chen, R.; Xie, X.; Qin, X.; Cui, Z. Distributionally Robust Loss for Long-Tailed Multi-label Image Classification. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September 2024; pp. 417–433. [Google Scholar]
  25. Zhang, W.; Liu, C.; Zeng, L.; Ooi, B.C.; Tang, S.; Zhuang, Y. Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels. In Proceedings of the 2023 IEEE International Conference on Computer Vision, Paris, France, 2 October 2023; pp. 1423–1432. [Google Scholar]
  26. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  27. Sheykhmousa, M.; Mahdianpari, M.; Ghanbari, H.; Mohammadimanesh, F.; Ghamisi, P.; Homayouni, S. Support Vector Machine Versus Random Forest for Remote Sensing Image Classification: A Meta-Analysis and Systematic Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 6308–6325. [Google Scholar] [CrossRef]
  28. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June 2016; pp. 770–778. [Google Scholar]
  30. Hua, Y.; Mou, L.; Zhu, X.X. Relation Network for Multilabel Aerial Image Classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4558–4572. [Google Scholar] [CrossRef]
  31. Sumbul, G.; Demİr, B. A Deep Multi-Attention Driven Approach for Multi-Label Remote Sensing Image Classification. IEEE Access 2020, 8, 95934–95946. [Google Scholar] [CrossRef]
Figure 1. Overall architecture of MSDR-Net integrating multiscale feature extraction (FPN+PAN), feature homogenization, Transformer-based semantic reasoning, and DW-Loss for class-balanced multi-label classification.
Figure 1. Overall architecture of MSDR-Net integrating multiscale feature extraction (FPN+PAN), feature homogenization, Transformer-based semantic reasoning, and DW-Loss for class-balanced multi-label classification.
Remotesensing 18 01798 g001
Figure 2. Residual learning structure for deep feature representation with stacked convolutions and shortcut connections.
Figure 2. Residual learning structure for deep feature representation with stacked convolutions and shortcut connections.
Remotesensing 18 01798 g002
Figure 3. Multiscale convolutional feature extraction with parallel branches of different receptive fields for hierarchical feature aggregation.
Figure 3. Multiscale convolutional feature extraction with parallel branches of different receptive fields for hierarchical feature aggregation.
Remotesensing 18 01798 g003
Figure 4. Multi-scale feature pyramid fusion and fixed-scale representation structure diagram.
Figure 4. Multi-scale feature pyramid fusion and fixed-scale representation structure diagram.
Remotesensing 18 01798 g004
Figure 5. Dynamic Semantic Reasoning module: fixed-scale features combined with positional encodings and label embeddings, processed by self-attention layers for global semantic dependencies and inter-class correlations.
Figure 5. Dynamic Semantic Reasoning module: fixed-scale features combined with positional encodings and label embeddings, processed by self-attention layers for global semantic dependencies and inter-class correlations.
Remotesensing 18 01798 g005
Figure 6. Illustrates examples of multi-label annotations in the Dataset for Object Detection in Optical Remote Sensing Images (DIOR). Colored dots above each subfigure correspond to the ground-truth categories listed in the legend, and the text below each image gives the annotated labels. (a) A sports complex scene containing Vehicle, Tennis Court, and Ground Track Field. (b) A transportation-service scene containing Basketball Court, Vehicle, and Expressway Service Area. (c) An airport scene containing Vehicle and Airplane. (d) A mixed sports-field scene containing Baseball Field, Tennis Court, Basketball Court, and Vehicle. (e) A single-label maritime scene containing Ship. (f) A single-label rural scene containing Windmill. (g) A large-scale land-cover scene containing Expressway Service Area and Golf Field. (h) A multi-object sports scene containing Baseball Field, Tennis Court, and Vehicle.
Figure 6. Illustrates examples of multi-label annotations in the Dataset for Object Detection in Optical Remote Sensing Images (DIOR). Colored dots above each subfigure correspond to the ground-truth categories listed in the legend, and the text below each image gives the annotated labels. (a) A sports complex scene containing Vehicle, Tennis Court, and Ground Track Field. (b) A transportation-service scene containing Basketball Court, Vehicle, and Expressway Service Area. (c) An airport scene containing Vehicle and Airplane. (d) A mixed sports-field scene containing Baseball Field, Tennis Court, Basketball Court, and Vehicle. (e) A single-label maritime scene containing Ship. (f) A single-label rural scene containing Windmill. (g) A large-scale land-cover scene containing Expressway Service Area and Golf Field. (h) A multi-object sports scene containing Baseball Field, Tennis Court, and Vehicle.
Remotesensing 18 01798 g006
Figure 7. The Transformer attention responses on the DIOR dataset. (a) denotes the original remote sensing image, and (b) denotes the corresponding attention-overlay visualization generated by the Transformer-based dynamic semantic reasoning module. The class names below each image pair indicate the ground-truth labels of the multi-label scene. The attention heatmap is superimposed on the original image, where blue regions represent low attention responses and red/yellow regions represent high attention responses. The visualization results show that the proposed model can focus on discriminative object regions, such as vehicles, ships, airports, harbors, storage tanks, chimneys, dams, and sports fields, while suppressing irrelevant background areas.
Figure 7. The Transformer attention responses on the DIOR dataset. (a) denotes the original remote sensing image, and (b) denotes the corresponding attention-overlay visualization generated by the Transformer-based dynamic semantic reasoning module. The class names below each image pair indicate the ground-truth labels of the multi-label scene. The attention heatmap is superimposed on the original image, where blue regions represent low attention responses and red/yellow regions represent high attention responses. The visualization results show that the proposed model can focus on discriminative object regions, such as vehicles, ships, airports, harbors, storage tanks, chimneys, dams, and sports fields, while suppressing irrelevant background areas.
Remotesensing 18 01798 g007
Figure 8. Overall accuracy comparison: MSDR-Net achieves 95.88%, outperforming traditional machine learning, CNN-based, relation-and-attention-based, and multiscale fusion methods.
Figure 8. Overall accuracy comparison: MSDR-Net achieves 95.88%, outperforming traditional machine learning, CNN-based, relation-and-attention-based, and multiscale fusion methods.
Remotesensing 18 01798 g008
Table 1. Overall performance metrics of the proposed method on the DIOR validation set.
Table 1. Overall performance metrics of the proposed method on the DIOR validation set.
mAPHAOF1CF1CPCROPOR
Value μ ± σ 0.9588 ± 0.00290.9724 ± 0.00180.8241 ± 0.00510.8738 ± 0.00420.7942 ± 0.00540.8746 ± 0.00410.7775 ± 0.00570.9589 ± 0.0025
95% CI[0.9570, 0.9642][0.9602, 0.9646][0.8178, 0.8304][0.8686, 0.8790][0.7875, 0.8009][0.8695, 0.8797][0.7704, 0.7846][0.9558, 0.9620]
Table 2. Detailed statistics of the AP and HA for each category over five repeated experiments.
Table 2. Detailed statistics of the AP and HA for each category over five repeated experiments.
labelsAirplaneAirportBaseball FieldBasketball CourtBridge
AP0.99050.99180.99170.93540.8543
HA0.99780.99570.99450.98590.9667
labelsChimneyDamExpressway
Service Area
Expressway Toll StationGolf Field
AP0.99670.97790.98820.97470.9669
HA0.99910.99400.99610.99270.9945
labelsGround Track FieldHarborOverpassShipStadium
AP0.95110.97770.87090.99290.9974
HA0.97650.99320.96800.99140.9987
labelsStorage TankTennis CourtTrain StationVehicleWindmill
AP0.94630.97440.98890.84441
HA0.98850.99060.99790.90140.9999
Table 3. Results of stepwise cumulative ablation experiments for MSDR-Net (mAP %).
Table 3. Results of stepwise cumulative ablation experiments for MSDR-Net (mAP %).
No.Network ConfigurationSeedsValidation
mAP%
Validation
95% CI
Improvement
BaselineResNet34590.71 ± 0.53[90.05, 91.37]
Exp-1ResNet34+
DSR
592.31 ± 0.44[91.76, 92.86]+1.60
Exp-2FPN+DSR594.82 ± 0.38[94.44, 95.20]+4.09
Exp-3FPN+ DSR +DW-Loss595.53 ± 0.35[95.18,95. 88]+4.82
MSDR-Net (Ours)MSDR-Net595.88 ± 0.29[95.59, 96.17]+5.17
Table 4. Overall performance metrics of the proposed method on the MLRSNet validation set.
Table 4. Overall performance metrics of the proposed method on the MLRSNet validation set.
mAPHAOF1CF1CPCROPOR
Value μ ± σ 0.9805 ± 0.00160.9872 ± 0.00110.9044 ± 0.00910.9296 ± 0.00420.8881 ± 0.00680.9753 ± 0.00370.8451 ± 0.00640.9725 ± 0.0056
95% CI[0.9789, 0.9821][0.9861, 0.9883][0.8954, 0.9135][0.9254, 0.9338][0.8949, 0.8813][0.9790, 0.9716][0.8515, 0.8387][0.9781, 0.9669]
Table 5. Detailed statistics of the AP and HA for each category over five repeated experiments.
Table 5. Detailed statistics of the AP and HA for each category over five repeated experiments.
labelsAirplaneAirportBaseball FieldBasketball CourtBridge
AP0.95170.99500.99930.98430.9847
HA0.98880.99290.99920.97670.9942
labelsFreewayGolf CourseGround Track FieldHarborOverpass
AP0.99680.99990.96650.99990.9813
HA0.99750.99630.97540.99960.9908
labelsParking LotRailwayRailway StationShipping YardStadium
AP0.89500.98070.95290.98810.9733
HA0.90970.98630.98750.97880.9921
labelsStorage TankTennis CourtTerraceTransmission TowerWind Turbine
AP0.99990.97540.99990.99490.9998
HA0.99960.99380.99880.99750.9996
Table 6. Performance comparison of MSDR-Net and baseline methods on the DIOR validation set (mAP%).
Table 6. Performance comparison of MSDR-Net and baseline methods on the DIOR validation set (mAP%).
No.NetworkSeedsValidation mAP%Validation 95% CI
Exp-1SVM368.42 ± 0.08[68.36, 68.50]
Exp-2ERT373.55 ± 0.31[73.24, 73.86]
Exp-3CNN386.37 ± 0.71[94.44, 87.08]
Exp-4ResNet50391.16 ± 0.46[90.70,91.62]
Exp-5RelationNet392.46 ± 0.61[91.85,93.07]
Exp-6Deep MultiAttention391.88 ± 0.68[91.20,92.56]
Exp-7MSCA394.24 ± 0.41[93.83,94.65]
Exp-8SFNet593.8 ± 0.37[93.43,94.17]
Table 7. The AP results of MSDR-Net and MSCA for each category.
Table 7. The AP results of MSDR-Net and MSCA for each category.
labelsAirplaneAirportBaseball FieldBasketball CourtBridge
MSDR0.99030.98430.99080.93450.8429
MSCA0.98810.97720.98250.92450.7240
labelsChimneyDamExpressway
Service Area
Expressway
Toll Station
Golf Field
MSDR0.98770.97650.98810.97100.9632
MSCA0.97430.93450.95490.94640.9574
labelsGround Track FieldHarborOverpassShipStadium
MSDR0.95480.98000.89410.98940.9970
MSCA0.89030.96660.79480.97980.9913
labelsStorage TankTennis CourtTrain StationVehicleWindmill
MSDR0.95090.97670.99320.83701
MSCA0.91380.95130.96720.78840.9999
Table 8. Computational complexity comparison of different methods under 512 × 512 input size.
Table 8. Computational complexity comparison of different methods under 512 × 512 input size.
MethodParams(M)FLOPs(G)Inference Time (ms/Image)FPS
ResNet-3421.8019.126.02166.0
SFNet16.2621.576.67149.9
MSCA38.3235.5412.0283.2
MSDR-Net39.4624.618.34119.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, Q.; Wang, H.; Wang, S.; Yang, T.; Zhao, H.; Fan, X. MSDR-Net: Multiscale Dynamic Reasoning for Multi-Label Remote Sensing Image Classification. Remote Sens. 2026, 18, 1798. https://doi.org/10.3390/rs18111798

AMA Style

Sun Q, Wang H, Wang S, Yang T, Zhao H, Fan X. MSDR-Net: Multiscale Dynamic Reasoning for Multi-Label Remote Sensing Image Classification. Remote Sensing. 2026; 18(11):1798. https://doi.org/10.3390/rs18111798

Chicago/Turabian Style

Sun, Qinghe, Hua Wang, Shuai Wang, Teng Yang, Hui Zhao, and Xuewu Fan. 2026. "MSDR-Net: Multiscale Dynamic Reasoning for Multi-Label Remote Sensing Image Classification" Remote Sensing 18, no. 11: 1798. https://doi.org/10.3390/rs18111798

APA Style

Sun, Q., Wang, H., Wang, S., Yang, T., Zhao, H., & Fan, X. (2026). MSDR-Net: Multiscale Dynamic Reasoning for Multi-Label Remote Sensing Image Classification. Remote Sensing, 18(11), 1798. https://doi.org/10.3390/rs18111798

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop