1. Introduction
With the rapid advancement of remote sensing technologies and satellite sensors, high-resolution remote sensing imagery has become a fundamental data source for global land cover monitoring, urban management, disaster assessment, and ecological environment analysis [
1,
2]. The authoritative international report Earth Intelligence for All highlights that future intelligent analysis of remote sensing imagery requires substantial improvements in understanding complex semantic scenes and in information-extraction capabilities to support sustainable development and refined governance. High-resolution remote sensing imagery often captures complex scenes featuring coexisting multiscale and multi-category objects, rendering traditional single-label classification methods inadequate for practical applications. Consequently, multi-label remote sensing image classification has emerged as an important research direction in the intelligent interpretation of remote sensing imagery [
3]. By enabling the identification of multiple semantic categories within a single image, the accuracy of scene understanding can be improved, while providing critical support for downstream tasks such as change detection, object recognition, and land cover analysis.
However, multi-label remote sensing image classification still faces two fundamental challenges: (i) the representation of multiscale objects struggles to capture local details and global semantics simultaneously [
4]; and (ii) long-tail category distributions hinder the effective learning of low-frequency classes [
5,
6]. Although existing studies have attempted to alleviate these issues through deep networks, attention mechanisms, or data augmentation, most methods remain focused on addressing challenges along a single dimension. As a result, a unified solution capable of jointly optimizing multiscale feature extraction, long-tail sample balancing, and complex semantic dependency modeling within an end-to-end framework is still lacking [
4,
6]. Although notable progress has been achieved in feature fusion, semantic modeling, and long-tail optimization, these approaches are still predominantly designed to tackle isolated challenges, and an integrated framework that simultaneously addresses multiscale feature extraction, long-tail sample balancing, and complex semantic dependency modeling remains absent [
5,
7]. This limitation constitutes the primary bottleneck that the present study aims to address.
The Multiscale Dynamic Reasoning Network (MSDR-Net), an end-to-end multi-label classification network, is proposed to address the challenges posed by multiscale objects, complex semantic dependencies, and long-tail category distributions in remote sensing images. A task-driven unified modeling framework is established for multi-label remote sensing image classification, integrating multiscale feature enhancement, label-aware dynamic semantic reasoning, and difficulty-weighted loss optimization. Consequently, scale variations, semantic dependencies, and long-tail category distributions can be jointly addressed within a unified end-to-end architecture. The proposed network consists of three core modules. First, the Multiscale Feature Enhancement (MSFE) module is constructed using ResNet-34 and a Feature Pyramid Network (FPN) fusion. Deep features are stabilized via residual representation learning, while multi-branch convolution captures both local and global semantic information across multiple scales.
Furthermore, a top-down cross-layer fusion mechanism is adopted to jointly model high-level semantics and low-level details, thereby significantly enhancing the representation of small-scale, fine-grained objects. Second, the Dynamic Semantic Reasoning (DSR) module is built on a Transformer encoder, which incorporates two-dimensional positional encoding and label embeddings. Feature dependencies within the image are adaptively modeled via a multi-head attention mechanism, while cross-layer multilayer perceptrons (MLPs) and Dropout are employed to improve multiscale feature representation and training stability. As a result, the modeling of multi-label semantic dependencies in complex scenarios is effectively enhanced. Finally, to address the long-tail distribution and hard-sample challenges commonly observed in remote sensing data, a Difficulty-Weighted Loss (DW-Loss) is proposed. Category frequency weights and prior difficulty coefficients are jointly incorporated to dynamically regulate the loss contributions of rare classes and hard samples during training, thereby enhancing the model’s focus on underrepresented and challenging categories. Based on the aforementioned modules, MSDR-Net achieves synergistic modeling of multiscale representation learning, label-aware semantic reasoning, and long-tail category optimization. Experimental results demonstrate that, while maintaining high overall classification accuracy, MSDR-Net significantly improves recognition robustness for complex scenes, small-scale targets, and rare categories, thereby validating the effectiveness of the proposed task-driven unified modeling framework.
The main contributions of this study are summarized as follows:
The MSFE module is proposed, which jointly models deep semantic information and shallow, detailed features to effectively capture multiscale object characteristics in remote sensing imagery, thereby improving the representation of small-scale and complex targets.
The DSR module is designed to adaptively model multi-label semantic dependencies based on a Transformer encoder and a multi-head attention mechanism, enabling efficient integration of global semantic information and local details in complex scenarios; and
DW-Loss is introduced within an end-to-end training framework, termed MSDR-Net, in which the loss contributions of long-tail categories and hard samples are dynamically regulated to enable collaborative optimization across multiple modules, thereby significantly enhancing the classification robustness of rare categories and small-scale objects.
2. Related Work
Multi-label remote sensing image classification is primarily challenged by the diversity of land-cover categories, significant scale variations, long-tail distributions of categories, and latent semantic dependencies among labels. In complex scenarios, Multi-Label Remote Sensing Image Classification (MLRIC) continues to face two fundamental issues: the representation of multiscale objects and label imbalance under long-tail distributions.
From a multiscale feature representation perspective, remote sensing imagery exhibits wide spatial coverage and complex viewing angles, where large-scale structures and small objects often coexist within a single image. This substantial scale variation imposes higher demands on feature representation. To address this issue, Tan and Le proposed a compound scaling strategy in EfficientNet to enhance cross-scale feature modeling capability; however, the computational complexity remains relatively high [
8]. Subsequently, Zhu et al. introduced deformable attention in Deformable DETR, enabling flexible multiscale alignment through dynamic sampling across feature scales [
9]. In the remote sensing domain, li et al. presented a spatial-topological-semantic alignment paradigm to enhance domain adaptability for few-label cross-domain scene classification [
10]. Pandey et al. explored a joint super-resolution and multi-label classification framework for remote sensing images, demonstrating that preserving high-resolution spatial details is beneficial for multi-label recognition in low-resolution satellite imagery [
11]. More recently, Zhao et al. developed a multiscale sparse cross-attention network, in which sparse attention mechanisms facilitate the integration of local details and global context, leading to significant performance improvements in complex scenarios [
12]. Nevertheless, it has been reported that repeated downsampling in deep convolutional networks leads to substantial resolution degradation, weakening the representation of small objects during feature fusion and imposing inherent limitations on multiscale modeling [
13].
From a semantic dependency modeling perspective, multi-label remote sensing imagery typically exhibits pronounced semantic co-occurrence patterns. To capture such relationships, hierarchical semantic structures have been explored to model inter-category dependencies. For instance, Zhang et al. proposed a hierarchical knowledge graph-based approach that models multi-level semantic relationships to enhance understanding of multiscale objects [
14]. Meanwhile, studies in general computer vision have demonstrated that Transformer-based architectures are highly effective in modeling complex semantic dependencies. Carion et al. established inter-object relationship modeling within the Transformer-based DETR framework, providing an effective paradigm for multi-semantic relationship learning [
15]. Subsequently, Dosovitskiy et al. demonstrated that the Vision Transformer can effectively capture long-range semantic dependencies [
16]. Inspired by these advances, Transformer-based interaction mechanisms have been introduced into remote sensing methods. Ou et al. proposed a view–category interactive sharing mechanism that jointly models multi-view information and category-level semantic relationships, effectively alleviating label dependency issues under incomplete annotation conditions [
17]. Cao et al. introduced the pioneering CLIP-Mamba framework, integrating a pre-trained Vision-Language Model (CLIP) with a State Space Model (Mamba) for efficient and comprehensive feature fusion and semantic extraction [
18]. In addition, Xia et al. proposed a latent semantic dependency model that jointly infers explicit and implicit label relationships to improve classification performance; however, its applicability in complex scenarios is constrained by limited feature representation stability and high computational cost [
19]. Moreover, due to the inherent deep downsampling process in remote sensing imagery, small-object information is often lost, making it difficult for conventional Transformer architectures to achieve full-scale perception from local details to global semantics.
From the perspective of the long-tail distribution, large-scale remote sensing datasets commonly exhibit severe class imbalance, with low-frequency categories underrepresented and difficult to learn effectively. It has been demonstrated that such an imbalance leads to a pronounced bias toward high-frequency categories, thereby degrading generalization performance on rare classes [
2]. To mitigate this issue, Cui et al. proposed a class-balanced loss based on the effective number of samples, in which class weights are redefined to counteract frequency imbalance; however, it remains insufficient to simultaneously address sample difficulty and distribution disparities in remote sensing scenarios [
20]. Wang et al. introduced a diffusion-based noise augmentation strategy to improve tail-class performance, while Du et al. proposed a category-selective feature enhancement mechanism to enhance tail-class responses adaptively [
21,
22]. Wang et al. proposed DMRS, a foundation-model-based framework for long-tailed remote sensing scene recognition, further demonstrating the importance of robust representation learning for imbalanced remote sensing data [
22]. In the broader computer vision domain, robust loss function designs, such as asymmetric loss and distributionally robust loss, have provided theoretical support for addressing long-tail learning [
23,
24]. In addition, Zhang et al. proposed a unified learning framework to address multi-label classification under long-tailed distributions and partial-label conditions [
25]. Nevertheless, directly transferring these general approaches to remote sensing scenarios remains challenging, as long-tail distributions are often coupled with multiscale variations. Consequently, relying solely on loss optimization or data augmentation is insufficient to address the complexity of jointly modeling multiscale features and long-tail distributions.
Despite substantial progress in multiscale feature representation, semantic dependency modeling, and long-tail learning, most existing methods address these challenges separately. In complex multi-label remote sensing scenes, however, these issues are often coupled: small objects require fine-grained spatial details, label prediction depends on global semantic context and inter-class correlations, and rare categories are more likely to be suppressed during training. Therefore, a unified framework is required to coordinate feature representation, semantic reasoning, and class-balanced optimization within an end-to-end learning process.
Compared with existing approaches, MSDR-Net is designed to address these coupled challenges collaboratively. The MSFE module enhances multiscale spatial representation by integrating shallow details and deep semantic features. The DSR module further models long-range spatial dependencies and inter-class semantic relationships through Transformer-based reasoning. The DW-Loss introduces category-frequency weighting and prior difficulty coefficients to improve learning of rare and difficult categories. In this way, MSDR-Net does not simply stack existing modules, but coordinates feature extraction, semantic reasoning, and loss optimization for multi-label remote sensing image classification.
4. Experiments
4.1. Dataset
To validate the effectiveness of the proposed MSDR-Net, experiments are conducted on the publicly available DIOR dataset [
26]. As shown in
Figure 6, DIOR is a large-scale, high-resolution remote sensing imagery dataset that encompasses diverse, complex scenes and a wide range of object categories and has been widely adopted for remote sensing image understanding tasks. The dataset contains 20 object categories, including large-scale targets such as Airplane, Airport, Ship, and Stadium, as well as small-scale or structurally complex targets such as Vehicle, Bridge, and Overpass. Significant variations across categories are observed in scale distribution, spatial density, and semantic co-occurrence patterns, reflecting typical characteristics of multi-label remote sensing scenarios.
Based on the object-level annotations, each image is reformulated as an image-level multi-label sample, allowing a single image to correspond to multiple category labels. This transformation better reflects the real-world characteristics of remote sensing scenarios, where multiple land-cover types often coexist within a single scene. No external training data are introduced in the experiments, and a fixed 8:2 split is adopted to construct the training and validation sets, ensuring the fairness and reproducibility of the experimental comparisons.
In addition to DIOR, MLRSNet is also used as a multi-label remote sensing benchmark to evaluate the cross-dataset generalization ability of MSDR-Net [
1]. MLRSNet contains diverse high-resolution remote sensing scenes with multiple land-cover categories and multi-label annotations. In this study, 20 categories are selected for evaluation, including Airplane, Airport, Baseball Diamond, Basketball Court, Bridge, Freeway, Golf Course, Ground Track Field, Harbor&Port, Overpass, Parking Lot, Railway, Railway Station, Shipping Yard, Stadium, Storage Tank, Tennis Court, Terrace, Transmission Tower, and Wind Turbine. To construct the experimental subset, the first 25% samples of each category are retained, resulting in 12,012 images. The dataset is split into 9609 training and 2403 validation images at 80%/20%.
4.2. Evaluation Metrics
To comprehensively evaluate the performance of the model in multi-label remote sensing image classification, Mean Average Precision (mAP), Hamming Accuracy (HA), Overall F1-score (OF1), Class-wise F1-score (CF1), Overall Precision/Recall (OP/OR), and Class-wise Precision/Recall (CP/CR) are adopted. Among these metrics, mAP reflects the overall ranking capability of the model across different thresholds; HA measures label-wise prediction consistency; OF1 and OP/OR evaluate prediction quality from a sample-level perspective; and CF1 and CP/CR assess model performance from a class-level perspective, particularly in terms of class balance and long-tail category recognition. The corresponding formulations are defined as follows:
Given that single-run experiments are susceptible to stochastic variations from random initialization and batch sampling, all overall evaluation metrics, except class-wise Average Precision (AP), are reported based on statistics from multiple repeated runs. Specifically, deep learning models are trained using five random seeds ([42, 2022, 7, 123, 999]), whereas traditional and comparative methods, including Support Vector Machine (SVM), Extremely Randomized Trees (ERT), Relation Network, Deep Multi-Attention, MSCA, and SFNet, are evaluated using three random seeds. All results are reported in the form of , where denotes the sample standard deviation. For the validation mAP, a 95% confidence interval (CI) is additionally reported. To assess the statistical significance of performance differences between MSDR-Net and MSCA, a two-sided -test is conducted with a significance level of .
4.3. Experimental Setup
Considering the significant scale variations and diverse land-cover types in high-resolution remote sensing imagery, a customized data preprocessing strategy is designed. During training, the following data augmentation strategies are applied to improve model generalization: Random Resized Crop (scale range: 0.7–1.0; aspect ratio: 0.85–1.15), Random Horizontal Flip (probability: 0.5), and rotations at multiples of 90°. The images are subsequently converted into tensors and normalized using dataset-specific statistics, with the mean. [0.485 0.456 0.406] and standard deviation [0.229 0.224 0.225]. For the validation set, only deterministic preprocessing is applied, including resizing to pixels, center cropping, tensor conversion, and normalization, to ensure the stability and consistency of evaluation metrics.
The model is trained on a single NVIDIA GeForce RTX 4070 Ti Super GPU using the PyTorch V1.12.1 framework. The key hyperparameters are set as follows: a batch size of 40, 100 training epochs, and the AdamW optimizer. A hybrid learning rate scheduling strategy is adopted, with linear warmup during the first five epochs, followed by dynamic adjustment based on validation loss, enabling more refined convergence in later training stages.
The proposed DW-Loss is employed, and consistent experimental settings, including input resolution, number of training epochs, optimizer configuration, and data splits, are maintained across all compared deep learning models. For methods originally designed for different task settings, their core architectures are retained while the output layers are uniformly replaced with a 20-dimensional Sigmoid classification head to conform to the DIOR multi-label protocol. Additionally, identical data augmentation strategies, training epochs, and validation protocols are applied to ensure fair and consistent comparisons.
The DSR module is implemented as a Transformer encoder with four layers, each with eight attention heads and a hidden embedding dimension of 256. A dropout rate of 0.1 is applied to prevent overfitting. The label embedding dimension is set to 20, corresponding to the number of categories in the multi-label classification task.
The DW-Loss, formulated as a difficulty-weighted binary cross-entropy, incorporates two complementary weighting mechanisms to improve learning on rare and challenging categories. First, the category frequency weight is calculated as the ratio of negative to positive samples for each category in the training set, with a small smoothing constant to avoid numerical instability. Second, the prior difficulty coefficient is empirically assigned according to target scale, geometric complexity, and observed training difficulty. Categories that are inherently challenging—such as Vehicle, Bridge, and Overpass—are assigned higher coefficients within the range [1.0, 2.5] to emphasize their contribution during training. The final weighting vector applied in the BCE loss is obtained by multiplying the category frequency weight and the difficulty coefficient element-wise, with a maximum clamp of 50 to maintain training stability.
This configuration ensures that the MSDR-Net framework effectively emphasizes complex, small-scale, or semantically ambiguous categories, while maintaining stable convergence for frequently occurring classes.
4.4. Performance Evaluation and Ablation Studies
4.4.1. Overall Performance and Fine-Grained Category Analysis
Table 1 presents the overall performance statistics of MSDR-Net on the DIOR validation set. Based on repeated experiments across five random seeds, the proposed model achieves a favorable balance between precision and recall in the multi-label classification task, indicating strong overall discriminative capability while maintaining robustness in both label-level consistency and class-level balance.
To further evaluate category-level performance,
Table 2 reports the AP and HA results for all 20 DIOR categories. Overall, MSDR-Net achieves high AP and HA values for most categories, indicating strong category discrimination and stable label-wise prediction consistency. For categories with distinctive structures and large spatial scales, such as Airplane, Airport, Baseball Field, Chimney, Ship, Stadium, and Windmill, the AP values are close to 1.0, while the HA values are around 0.99. This demonstrates the model’s stable recognition of salient, well-structured targets. For medium-scale structured categories, including Dam, Expressway Service Area, Expressway Toll Station, Golf Field, Harbor, Tennis Court, and Train Station, most AP values exceed 0.97, further confirming the robustness of MSDR-Net for regular scene objects. Relatively lower AP values are observed for Bridge, Overpass, and Vehicle, at 0.8543, 0.8709, and 0.8444, respectively. These categories are more challenging due to small object sizes, large appearance variations, and strong background coupling. In particular, Vehicle obtains a lower HA of 0.9014, indicating that dense small-object recognition remains difficult. For relatively imbalanced categories such as Storage Tank and Train Station, MSDR-Net still maintains high AP and HA values, suggesting that DW-Loss helps improve learning for rare and difficult categories.
As illustrated in
Figure 7, the Transformer module’s attention responses are visualized to evaluate MSDR-Net’s spatial attention capability in multi-label remote sensing image classification. Since the feature pyramid produces a
feature representation, the resulting heatmaps correspond to coarse-grained token-level semantic responses, which are subsequently upsampled and overlaid onto the original images for visualization. The high-attention regions effectively cover both large target areas and densely distributed small-scale objects, while also accurately focusing on critical regions of structurally organized targets with linear or block-like spatial distributions. In complex multi-label scenarios, multiple semantic regions can be simultaneously attended, indicating that the combination of multiscale feature representation and Transformer-based semantic reasoning effectively captures target co-occurrence patterns and inter-class spatial relationships.
Overall, MSDR-Net demonstrates superior performance on both global and fine-grained target categories. Large-scale and common categories are accurately recognized, while substantial improvements are achieved for small objects, structurally complex targets, and rare categories. Furthermore, the attention visualization results demonstrate that the proposed model can simultaneously focus on multiple targets and semantically important regions, further validating the synergistic advantages of multiscale feature fusion, Transformer-based semantic reasoning, and DW-Loss.
4.4.2. Ablation Study
To further investigate the underlying mechanisms contributing to the performance improvements of MSDR-Net and to validate the necessity of each component within the network architecture and training strategy, a systematic cumulative ablation study is conducted. The experiments are conducted using ResNet-34 as the baseline model. Under consistent hyperparameter settings, the Transformer-based global attention module, multiscale feature pyramid, DW-Loss, and data augmentation strategies are incrementally incorporated, allowing a quantitative evaluation of the cumulative contributions of each component to the final classification performance.
The baseline model employs ResNet-34 as the feature extractor without incorporating additional attention mechanisms or multiscale fusion modules, and is optimized using the BCE loss. As shown in
Table 3, this configuration achieves an mAP of 90.71%. Although the backbone provides reasonable feature extraction capability, its representation of small, densely distributed targets remains limited in remote sensing scenarios characterized by significant scale variations and complex background interference. In Exp-1, a Transformer-based global attention module is incorporated into the baseline model. By leveraging self-attention to capture long-range dependencies, the model’s ability to model contextual relationships among objects is enhanced. As a result, the mAP improves from 90.71% to 92.3% (+1.8%). This improvement highlights the importance of global contextual information for understanding complex remote sensing scenes, particularly in distinguishing categories with similar local features but distinct global semantics. In Exp-2, the FPN is further integrated based on Exp-1. By combining deep semantic information with shallow spatial details, the model’s ability to represent multiscale targets is significantly enhanced. Consequently, the mAP increases to 94.8% (+3.5% compared to Exp-1), demonstrating the effectiveness of multiscale feature fusion in addressing large-scale variations, particularly for structurally distinctive targets such as bridges and overpasses. To address class imbalance and hard-sample challenges in remote sensing datasets, the proposed DW-Loss is introduced in Exp-3. By dynamically adjusting category- and sample-level weights, the model is guided to focus more on rare and difficult samples. As shown in
Table 2, the mAP further increases to 95.5% (+0.8% compared to Exp-2). Although the overall gain is moderate, substantial improvements are observed for long-tail categories such as vehicles and storage tanks, effectively mitigating category bias. In the final configuration, customized data augmentation strategies, including random cropping, flipping, and rotation, are incorporated to increase the diversity of the training data. This leads to improved generalization and robustness, resulting in a final mAP of 95.88% (+0.3% compared to Exp-3). Overall, the progressive ablation results clearly demonstrate that each component of MSDR-Net—from the global attention mechanism and multiscale feature fusion to tailored loss optimization and data augmentation strategies—positively and significantly contributes to overall performance. These components operate synergistically to form a high-performance framework for multi-label remote sensing image classification.
4.5. Cross-Dataset Validation on MLRSNet
Table 4 summarizes the overall performance metrics of MSDR-Net on the MLRSNet validation set, including mAP, HA, OF1, CF1, CP, CR, OP, and OR, together with the corresponding standard deviations and 95% confidence intervals. To further evaluate category-level performance,
Table 5 presents the AP and HA results for all 20 categories in MLRSNet. Overall, MSDR-Net achieves high AP and HA values across most categories, demonstrating strong category discrimination and stable label-wise prediction consistency.
For large-scale, geometrically distinctive categories, such as Airport, Baseball Diamond, Basketball Court, Harbor, Golf Course, Terrace, Storage Tank, Transmission Tower, and Wind Turbine, AP values are above 0.98, and HA values are around 0.99, confirming that MSDR-Net can reliably recognize prominent, clearly structured targets. For medium-scale structured categories, including Freeway, Ground Track Field, Railway Station, Shipping Yard, and Stadium, AP values remain above 0.95, further demonstrating the robustness of MSDR-Net on regular scene objects. Small-scale and challenging categories, such as Parking Lot, Vehicle, Bridge, Overpass, and Railway, have relatively lower AP values (0.8950–0.9807) and slightly lower HA values (0.9097–0.9863) due to their dense spatial distribution, large appearance variation, and complex backgrounds. Nevertheless, compared to prior benchmarks, MSDR-Net achieves noticeable improvement in small-scale target recognition, indicating the effectiveness of multiscale feature enhancement and DW-Loss in improving small and difficult object predictions. In addition, long-tail categories such as Storage Tank and Railway Station, despite having relatively fewer samples, still achieve high AP and HA, demonstrating that MSDR-Net mitigates the influence of long-tail category imbalance through difficulty-weighted optimization. Collectively, these results show that MSDR-Net effectively handles multiscale objects, models complex semantic dependencies, and addresses long-tail distribution issues.
Compared to DIOR, MLRSNet shows higher AP for medium and large-scale categories, likely due to more balanced class distributions and higher image quality, while small-scale and long-tail categories remain more challenging, confirming the importance of global reasoning and DW-Loss in these scenarios.
4.6. Comparative Experiments
To comprehensively evaluate the effectiveness and competitiveness of MSDR-Net, comparisons are conducted under the DIOR multi-label protocol against a diverse set of approaches, including traditional machine learning models, classical deep learning models, relation modeling methods, and recent multiscale and semantic enhancement techniques. Specifically, the compared methods include Support Vector Machine (SVM) [
27], Extremely Randomized Trees (ERT) [
27], CNN [
28], ResNet-34 [
29], Relation Network [
30], Deep Multi-Attention [
31], MSCA [
12], and SFNet [
21]. It should be noted that the original task settings of methods such as Relation Network, Deep Multi-Attention, MSCA, and SFNet are not fully aligned with the multi-label classification setting considered in this study. Therefore, to ensure fair comparison, their core architectures are re-implemented under a unified experimental protocol, including consistent input resolution, optimizer configuration, training epochs, data augmentation strategies, and a standardized multi-label Sigmoid classification head. The comparative results are reported in
Table 6.
As illustrated in
Figure 8, Overall, traditional machine learning methods exhibit substantially inferior performance. Specifically, Support Vector Machine (SVM) and Extremely Randomized Trees (ERT) achieve mAP values of 68.42% and 73.55% on the validation set, respectively. This indicates that methods relying on hand-crafted features are insufficient for capturing the complex spatial structures and semantic co-occurrence patterns present in high-resolution remote sensing imagery. In contrast, deep learning-based approaches significantly improve classification performance. The CNN achieves an mAP of 86.37%, while ResNet-50 further improves it to 91.16%, validating the effectiveness of deep convolutional architectures and residual learning in feature representation. However, these methods primarily rely on local receptive fields, limiting their ability to model long-range semantic dependencies in multi-label scenarios.
To further enhance multi-label classification performance, subsequent methods incorporate relation modeling and attention mechanisms. The Relation Network explicitly models inter-label dependencies, achieving an mAP of 92.46% on the validation set, while Deep Multi-Attention enhances regional feature representations through multi-branch attention, achieving an mAP of 91.88%. Overall, these approaches improve performance beyond 92%, demonstrating that modeling label relationships and incorporating attention mechanisms effectively enhance multi-label discrimination capability. However, these methods still exhibit notable instability, as reflected in relatively high standard deviations of 0.56–0.68, indicating increased sensitivity to random initialization.
In recent years, research has shifted from relational modeling toward integrated, multiscale, and semantic fusion frameworks. MSCA leverages sparse cross-scale attention to enable multiscale feature interaction, achieving an mAP of on the validation set, while SFNet enhances category discrimination through semantic-assisted feature fusion, achieving . Although these approaches effectively mitigate challenges arising from scale variations in remote sensing imagery, a unified optimization framework that jointly models multiscale information, semantic relationships, and long-tail distributions remains lacking.
Compared with the aforementioned methods, the proposed MSDR-Net achieves the highest validation mAP of , outperforming all competing approaches. Meanwhile, a training mAP of is obtained, with a train–validation gap of approximately 2.03%, which is significantly smaller than that of other deep models. This indicates that, while maintaining strong fitting capability, the model effectively suppresses overfitting and demonstrates superior generalization performance. In terms of performance gains, MSDR-Net improves mAP by approximately 1.64% compared with the current best-performing method, MSCA, and by 2.08% compared with SFNet. Notably, an improvement exceeding 1.5% is still achieved in the high-performance regime (above 94% mAP), further demonstrating the effectiveness of the proposed approach.
Table 7 presents the AP results of MSDR-Net and MSCA for each category. Overall, MSDR-Net consistently outperforms MSCA across most categories, particularly for challenging categories with small-scale objects or long-tail distributions. For instance, small-scale targets such as Vehicle, Bridge, and Overpass exhibit notable improvements, with MSDR-Net achieving AP values of 0.8370, 0.8429, and 0.8941, respectively, compared to MSCA’s 0.7884, 0.7240, and 0.7948, respectively. These gains indicate that MSDR-Net’s multiscale feature enhancement and dynamic semantic reasoning effectively capture subtle spatial details, improving recognition of small or densely distributed targets.
For long-tail categories, including Storage Tank, Tennis Court, and Train Station, MSDR-Net also demonstrates superior performance (AP = 0.9509, 0.9767, and 0.9932) relative to MSCA (AP = 0.9138, 0.9513, and 0.9672), highlighting the effectiveness of the DW-Loss in mitigating class imbalance and enhancing learning on rare categories. Large-scale categories, such as Airplane, Baseball Field, Ship, and Windmill, are already well recognized by both methods, but MSDR-Net still provides small yet consistent improvements.
In summary, these results indicate that MSDR-Net achieves notable gains on small-scale, difficult, and long-tail categories, while maintaining or slightly improving performance on large and medium-scale categories. This underscores the advantages of integrating multiscale feature extraction, dynamic semantic reasoning, and difficulty-weighted loss in a unified framework.
4.7. Computational Complexity Analysis
To further evaluate the practical applicability of the proposed method, the computational complexity of MSDR-Net was compared with representative baseline and competing methods, including ResNet-34, SFNet, and MSCA. The comparison was conducted under the same input resolution of . The number of parameters, FLOPs, single-image inference time, and frames per second (FPS) were reported. FLOPs were calculated using THOP, and inference time was measured on the same GPU platform with a batch size of 1. Data loading and image preprocessing were excluded from the timing process.
As shown in
Table 8, ResNet-34 has the lowest inference cost among the compared CNN-based baselines, with 21.80 M parameters, 19.12 GFLOPs, and an average inference time of 6.02 ms per image. However, its mAP is only 90.71%, indicating that the standard backbone alone is insufficient for classifying complex multi-label remote sensing images. SFNet achieves better classification performance than ResNet-34 with relatively low parameters and inference time, but its mAP remains lower than that of MSDR-Net.
Compared with MSCA, MSDR-Net achieves a better trade-off between accuracy and efficiency. Although the parameter count of MSDR-Net is slightly higher than that of MSCA, increasing from 38.32 M to 39.46 M, the FLOPs are reduced from 35.54 GFLOPs to 24.61 GFLOPs, corresponding to a reduction of approximately 30.8%. Meanwhile, the average inference time decreases from 12.02 ms to 8.34 ms, yielding an approximately 30.6% improvement in inference speed. More importantly, MSDR-Net achieves a higher mAP of 95.88%, outperforming MSCA by approximately 1.64 percentage points.
These results indicate that the proposed MSDR-Net does not simply improve performance by introducing excessive computational overhead. Instead, by efficiently fusing multiscale features, employing Transformer-based semantic reasoning, and applying difficulty-weighted optimization, MSDR-Net achieves higher classification accuracy while maintaining moderate computational complexity. The average inference speed of 119.9 FPS further demonstrates that the proposed method has practical potential for high-resolution remote sensing image interpretation, especially in offline and near-real-time application scenarios.
4.8. Discussion
An end-to-end MSDR-Net is proposed to address three major challenges in multi-label remote sensing image classification, including multi-scale variation, complex semantic dependencies, and long-tail category distributions. Extensive experiments conducted on the Dataset for DIOR and MLRSNet datasets demonstrate the effectiveness of the proposed framework. Existing approaches generally focus on isolated optimization objectives, where CNN-based methods emphasize multi-scale feature extraction, while Transformer-based methods primarily focus on long-range dependency modeling. In contrast, MSDR-Net unifies MSFE, DSR and DW-Loss within a unified end-to-end framework. On the DIOR dataset, MSDR-Net achieves an mAP of 95.88%, outperforming the state-of-the-art MSCA method by approximately 1.64%. Notable improvements are observed for small-scale targets, such as vehicles and bridges, as well as long-tail categories, including storage tanks, demonstrating the robustness of the proposed joint modeling strategy. Compared with traditional machine learning methods, such as SVM, deep learning-based approaches exhibit substantially stronger capability in modeling the complex spatial structures of high-resolution remote sensing imagery, further highlighting the limitations of hand-crafted feature representations.
Beyond performance improvements, the proposed framework also provides methodological insights for multi-label remote sensing image classification. Experimental analysis indicates that simply increasing network depth, such as adopting ResNet-34, is insufficient for effectively addressing missed detections of small-scale targets. After introducing the Transformer-based DSR module, long-range contextual dependencies can be captured through global attention mechanisms, enabling improved recognition of occluded and spatially scattered targets. This observation is consistent with findings reported in Vision Transformer studies, where global contextual modeling has been shown to play a critical role in visual understanding.
Despite the promising performance achieved by MSDR-Net, several limitations remain. Although the inference speed is improved to 119.9 FPS through architectural optimization, the incorporation of FPN/PAN structures and the Transformer encoder increases the model complexity to 39.46 M parameters, compared with lightweight CNN models such as ResNet-34. This increased computational cost may restrict real-time deployment on resource-constrained edge devices, such as small unmanned aerial vehicles. In addition, for extremely dense and small-scale targets, such as vehicles, the HA still remains improvable, indicating that feature discriminability under severe background clutter requires further enhancement. Future work will focus on lightweight model design, including knowledge distillation and neural architecture search, as well as self-supervised and large-scale pretraining strategies, to reduce computational cost while maintaining high accuracy and improving generalization capability under challenging remote sensing scenarios.
5. Conclusions
To address the challenges of multiscale variation, complex semantic dependencies, and long-tail distributions in multi-label remote sensing image classification, an end-to-end framework, MSDR-Net, is proposed that integrates MSFE, DSR, and DW-Loss into a unified architecture for representation learning and imbalance-aware optimization. The MSFE module enables robust multiscale feature extraction through residual learning and feature pyramid fusion, while the DSR module captures long-range dependencies and inter-class correlations via positional encoding, enhancing global semantic modeling. The proposed DW-Loss further improves robustness by dynamically reweighting category contributions, effectively mitigating long-tail and hard-sample issues. Extensive experiments on DIOR and MLRSNet demonstrate that MSDR-Net achieves favorable multi-label classification performance and shows promising robustness and generalization potential on the evaluated datasets.
Despite its favorable performance on the DIOR multi-label remote sensing image classification task, MSDR-Net still has several limitations. First, due to the introduction of FPN/PAN, the Transformer encoder, and multiscale feature enhancement structures, the model’s complexity remains higher than that of lightweight CNN-based models, and further optimization is required for deployment on resource-constrained edge platforms. In addition, the recognition of small-scale, densely distributed, or background-coupled categories, such as Vehicle, Bridge, and Overpass, still leaves room for further improvement. Future work will explore lightweight model design, self-supervised or large-scale pretraining strategies, and multimodal data integration to enhance scalability and practical applicability.