1. Introduction
Remote sensing imagery provides detailed and large-scale observations of the Earth’s surface and supports applications such as land-use mapping, urban planning, environmental monitoring, and infrastructure management [
1]. Semantic segmentation, which assigns a semantic label to each pixel, has become a core technique for extracting fine-grained geospatial information from such imagery [
2]. Recent advances in convolutional neural networks and transformer-based architectures have substantially improved remote sensing semantic segmentation performance [
3,
4]. In many practical scenarios, however, the goal is not comprehensive scene parsing over all categories, but the accurate extraction of a small number of task-critical geospatial elements, such as buildings, water bodies, or cultivated land [
5]. In such settings, an important question is not only how to improve overall segmentation accuracy, but also which segmentation paradigm is more suitable for extracting a specific target of interest.
Reliable target extraction remains challenging because real-world geospatial data often exhibit severe pixel-level class imbalance. In large-scale datasets, a few dominant land-cover categories occupy most pixels, whereas many categories of practical interest are sparse and unevenly distributed [
6]. Under such long-tailed conditions, optimization can be dominated by head classes, which may limit the effectiveness of standard segmentation pipelines for task-critical targets even when aggregate evaluation metrics remain acceptable. More importantly, extraction difficulty is not determined by class proportion alone. In practice, target performance is also influenced by target morphology, spatial continuity, fragmentation, boundary clarity, and semantic similarity to surrounding categories [
6,
7]. These factors make it difficult to judge model suitability using overall multi-class evaluation results alone.
Existing studies on remote sensing semantic segmentation have mainly focused on improving model architectures, feature representations, or loss designs within a unified multi-class formulation. Various strategies, including re-weighting, focal-style objectives, and class-balanced learning, have been explored to alleviate imbalance effects [
7,
8]. Although these studies provide valuable advances, they do not fully answer a practical application-oriented question: when the objective is the extraction of a single key geospatial element under highly imbalanced conditions, how do different segmentation paradigms behave, and what kinds of targets are they better suited for? Existing review papers and benchmark studies also mainly emphasize methodological development or overall multi-class performance, whereas practical deployment often requires target-aware model selection for a specific geospatial element.
Motivated by this gap, this study conducts an application-oriented comparative analysis of remote sensing segmentation under severe class imbalance. Specifically, we adopt a target-oriented evaluation setting in which each key geospatial element is analyzed as an independent target-specific extraction task. This setting is not intended to replace conventional multi-class segmentation or to imply universally higher class-wise performance. Instead, it serves as a complementary application-oriented setting for analyzing target-specific delineation behavior under severe imbalance. Based on a remote sensing dataset with pronounced long-tailed category characteristics, we benchmark eight representative segmentation models spanning different design paradigms on four key geospatial element categories with different sparsity levels and visual properties. We further include a limited comparison with conventional multi-class segmentation to clarify the role of the target-oriented setting. Through quantitative evaluation and qualitative case analysis, we examine model performance, precision–recall trade-offs, robustness, and characteristic failure modes under highly imbalanced conditions. Ultimately, this work aims to provide empirical evidence and practical guidance for segmentation model selection and protocol interpretation in highly imbalanced remote sensing applications.
The main contributions of this work are summarized as follows:
We conduct an application-oriented comparative study of eight representative segmentation paradigms across four key geospatial elements with different structural and semantic characteristics.
We build a custom remote sensing segmentation dataset with pronounced long-tailed characteristics and severe class imbalance, which serves as the experimental basis for the present protocol comparison and target-specific analysis.
We compare a conventional multi-class setting and a target-present one-vs-rest setting, and show that the latter should be interpreted as a complementary application-oriented protocol rather than a universally superior alternative.
We provide target-aware practical guidance by analyzing quantitative performance, robustness, and characteristic failure modes across different target types.
2. Related Works
Remote sensing semantic segmentation has become a core technique for extracting fine-grained geospatial information from high-resolution imagery and supports a wide range of applications, including land-use analysis, infrastructure monitoring, and environmental observation [
1,
2]. With the rapid development of deep learning, the field has evolved from conventional convolution-based pipelines toward a broader family of dense prediction paradigms. Meanwhile, many practical remote sensing applications prioritize the extraction of a small number of task-critical geospatial elements rather than complete scene parsing over all semantic categories. This application-oriented demand motivates a more target-aware view of model comparison under highly imbalanced conditions.
Deep Learning Paradigms for Remote Sensing Segmentation. Fully convolutional networks (FCNs) established the basis for end-to-end dense prediction and laid the foundation for modern semantic segmentation [
9]. Subsequent work introduced encoder–decoder architectures and multi-scale context aggregation strategies to better cope with the large scale variation, heterogeneous textures, and complex spatial layouts common in remote sensing scenes, including U-Net-style models, DeepLab-style encoder–decoder designs, and pyramid-context approaches [
10,
11,
12]. High-resolution representation learning and object-context modeling further improved spatial precision and semantic reasoning in dense prediction [
13,
14]. More recently, transformer-based segmentation models, mask-classification frameworks, state-space models, and prompt-guided adaptation of foundation models have substantially expanded the design space of remote sensing segmentation [
15,
16,
17,
18,
19,
20]. Although these studies have significantly advanced model design, most evaluations still emphasize overall multi-class accuracy rather than target-aware paradigm suitability under severe imbalance.
Class Imbalance and Long-Tailed Challenges in Remote Sensing. A persistent challenge in remote sensing semantic segmentation is severe pixel-level class imbalance, where a few dominant land-cover categories occupy most pixels while many categories of interest are sparse or weakly represented [
6,
7]. Public benchmarks such as LoveDA have also highlighted difficulties caused by multi-scale objects, complex backgrounds, and inconsistent class distributions across domains [
21]. To mitigate these effects, existing studies have explored class re-weighting, class-balanced objectives, focal-style losses, and broader long-tailed learning strategies [
6,
7,
8]. However, these methods do not directly answer a practical question: when the task is the extraction of a single key geospatial element rather than complete scene understanding, how do different segmentation paradigms behave, and which target types are they better suited for?
Target-Specific Extraction and Related Fine-Grained Tasks. Beyond unified multi-class segmentation, many remote sensing applications are inherently target-centered. Representative examples include building extraction, road extraction, water-body delineation, and flood extent mapping, where non-target categories mainly serve as contextual background and the practical objective is accurate extraction of a single geospatial element of interest [
22,
23,
24,
25]. This perspective is also related to broader fine-grained scene understanding tasks such as instance segmentation and panoptic segmentation, which provide richer target-centered descriptions than conventional semantic labeling alone [
16,
19,
26,
27]. However, most existing studies are centered on a specific target category, supervision setting, or model family, and therefore provide limited evidence on how substantially different segmentation paradigms compare when target sparsity, fragmentation, or semantic ambiguity becomes dominant.
Benchmarking and Application-Oriented Comparative Analysis. Systematic benchmarking is essential for understanding the strengths and limitations of segmentation models [
2]. Existing remote sensing benchmarks and comparative studies mainly focus on multi-class land-cover mapping and typically report aggregate metrics such as mean Intersection over Union. While valuable for general scene understanding, such evaluations provide limited guidance for application-oriented scenarios in which the goal is to extract a single key geospatial element under severe class imbalance. In particular, relatively few studies jointly analyze target sparsity, spatial fragmentation, semantic ambiguity, precision–recall trade-offs, and characteristic failure modes across substantially different segmentation paradigms. This gap motivates the present study, which focuses on paradigm behavior, protocol interpretation, and model suitability for key geospatial element extraction.
In summary, existing studies have made substantial progress in remote sensing segmentation architectures, imbalance mitigation, benchmark construction, and target-specific applications. However, comparatively fewer works systematically analyze how different segmentation paradigms behave for key geospatial element extraction under severe imbalance while also clarifying the role of a complementary target-oriented analysis setting relative to conventional multi-class evaluation. This gap motivates the application-oriented comparative study conducted in this paper.
3. Materials and Methods
Figure 1 outlines the overall design of this study, including target selection, protocol construction, model comparison, and subsequent quantitative as well as qualitative analysis.
3.1. Dataset Description and Annotation Protocol
This study is based on a custom remote sensing semantic segmentation dataset containing 10,482 image tiles. The dataset was derived from satellite imagery acquired in 2024 and cropped into RGB patches of pixels for annotation, training, and evaluation. The original source image has a spatial size of 14,401 × 9601 pixels with three spectral bands. Its pixel size is approximately degrees in both horizontal and vertical directions, corresponding to an approximate ground sampling distance of about 1 m in the study area. All images are stored as 3-channel RGB uint8 JPG files, and the corresponding labels are stored as single-channel uint8 PNG masks.
The dataset contains 14 semantic categories, including background and 13 foreground geospatial classes, and exhibits a pronounced long-tailed distribution. Dominant categories such as Building and Cultivated Land occupy most pixels, whereas categories such as Water and Nursery are much less represented.
Table 1 summarizes the semantic categories and pixel-level distribution of the dataset, highlighting the severe class imbalance across categories.
Pixel-wise annotations were manually produced and reviewed following a unified category definition and annotation guideline. Quality control focused on boundary consistency, category-definition consistency, and the handling of visually ambiguous regions. Based on the dataset statistics, four representative categories were selected for comparative analysis: Building, Cultivated Land, Water, and Nursery. These targets differ in prevalence, spatial continuity, structural regularity, and semantic ambiguity, and therefore provide a suitable basis for studying paradigm-specific behavior under highly imbalanced conditions.
3.2. Target-Oriented Evaluation Protocol
To examine the effect of protocol design, we considered two related settings in this study.
Protocol A: Conventional multi-class segmentation. Protocol A follows the standard multi-class formulation, in which all images are retained and the original multi-class annotations are used directly for training and evaluation. This setting serves as the conventional reference protocol and reflects the standard full-scene segmentation scenario.
Protocol B: Target-present one-vs-rest segmentation. Protocol B reformulates each selected category as an independent binary segmentation task. For a target class
c, the original dataset is denoted as
where
is an input image and
is the corresponding multi-class label map. The binary target mask is defined as
Using this formulation, we constructed four target-specific datasets for Building, Cultivated Land, Water, and Nursery. These datasets were derived from the same global split as Protocol A, but only image tiles containing the selected target were retained. This design was adopted to focus the analysis on target delineation rather than target presence detection.
3.3. Compared Segmentation Paradigms
To conduct a representative comparative study, we selected eight segmentation models covering several major design paradigms in dense prediction: U-Net, DeepLabV3+, HRNet, OCRNet, SegFormer, Mask2Former, RSMamba, and RSPrompter. These models were chosen to cover classical encoder–decoder learning, multi-scale context aggregation, high-resolution representation learning, object-context reasoning, transformer-based segmentation, mask-classification-based segmentation, state-space modeling, and prompt-guided segmentation, respectively.
U-Net and DeepLabV3+ serve as widely used baseline architectures for semantic segmentation. HRNet and OCRNet represent high-resolution and context-aware feature learning. SegFormer and Mask2Former provide transformer-based and mask-classification-based paradigms. RSMamba represents state-space-model-based segmentation, while RSPrompter introduces a prompt-guided paradigm adapted from large pre-trained visual models. Together, these models enable comparison across substantially different modeling strategies, which is important for analyzing paradigm-level behavior rather than minor architectural variation.
3.4. Experimental Setup
A shared global training, validation, and test split was first constructed from the full dataset. Protocol A directly uses this global split with the original multi-class annotations, whereas Protocol B is derived from the same split by converting each selected target into a one-vs-rest binary segmentation task and retaining only image tiles containing that target. In this way, the comparison between the two protocols is based on consistent data partitioning rather than differences in split construction.
All input images were resized to pixels. During training, data augmentation included random cropping, random flipping, photometric distortion, normalization, and padding. During testing, only deterministic preprocessing and normalization were applied. Images were normalized using a mean of [123.675, 116.28, 103.53] and a standard deviation of [58.395, 57.12, 57.375].
To improve comparability across model families, all compared models were trained for 300 epochs under the same train/validation/test split strategy and evaluated using the same metrics. All models were optimized using stochastic gradient descent (SGD) with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay of 0.0005. A poly learning-rate schedule was adopted with a power of 0.9 and a minimum learning rate of . The batch size was set to 8 images per GPU. The checkpoint with the best validation IoU was selected for final testing.
Experiments were conducted in a PyTorch 2.1.0-based environment with CUDA 11.1 on a server equipped with six NVIDIA GeForce RTX 4090 GPUs, each with 24 GB of memory. For robustness analysis, all compared paradigms were additionally trained under Protocol B using three random seeds (0, 1, and 2), and the mean and standard deviation of IoU were reported in the robustness analysis of the Results section.
Model performance was evaluated independently using Intersection over Union (IoU), Precision, Recall, and F1-score. These metrics jointly measure segmentation overlap quality and the precision–recall trade-off, which is particularly important under highly imbalanced conditions. In addition to quantitative evaluation, representative qualitative predictions were analyzed to investigate characteristic failure modes, including target omission, fragmented segmentation, and semantic confusion with visually similar background regions.
3.5. Statistical Analysis
To avoid over-interpreting the limited number of random seeds, statistical significance was not assessed using seed-level averages alone. Instead, paired test-image-level analyses were conducted under Protocol B. For each model and each test image, IoU values from three random seeds were first averaged. The resulting per-image IoU differences between two models were then used for statistical comparison.
For each target category, the top-performing model according to the seed-averaged IoU values was compared with the closest competing models. The 95% confidence intervals of the mean IoU differences were estimated using paired bootstrap resampling with 10,000 iterations. Paired Wilcoxon signed-rank tests were further applied to assess statistical significance. The p-values were adjusted using the Holm-Bonferroni correction to account for multiple comparisons. A difference was considered statistically significant when the adjusted p-value was below 0.05 and the 95% confidence interval did not include zero.
4. Results
4.1. Protocol-Level Comparison Between Conventional and Target-Oriented Evaluation
Table 2 and
Table 3 compare the conventional multi-class setting and the target-oriented setting. Protocol A serves as the standard multi-class reference, where all semantic categories are jointly learned and evaluated within the same scene-level label space. In contrast, Protocol B reformulates each selected category as a target-present one-vs-rest task, thereby emphasizing the delineation behavior of a specific target of practical interest.
The comparison shows that Protocol A generally provides stronger absolute class-wise performance on the present dataset. This result is important because it indicates that the target-oriented setting should not be interpreted as a universally superior replacement for conventional multi-class segmentation. Rather, Protocol B changes both the supervision structure and the evaluation distribution by focusing on target-present samples and merging all non-target categories into the background. Therefore, its value lies in providing a complementary application-oriented view of target-specific extraction behavior, especially under severe imbalance.
This distinction is essential for interpreting the following results. Protocol A is more suitable for reporting complete scene parsing performance, whereas Protocol B is more informative when the practical question is how reliably a particular geospatial element can be extracted once it becomes the target of interest. In this sense, the exclusion of target-absent image tiles in Protocol B is not intended to create a more realistic full-scene benchmark, but to isolate target-present delineation behavior and precision–recall trade-offs. Consequently, the results under Protocol B should be interpreted as application-oriented target analysis rather than as a replacement for standard multi-class evaluation.
4.2. Target-Specific Quantitative Results Under Protocol B
Table 3 reports the representative single-run results under Protocol B. A clear observation is that no single segmentation paradigm is uniformly optimal across all target categories. Instead, the best-performing paradigm depends strongly on the target type and on which metric is prioritized.
For Building, RSPrompter achieves the highest IoU, while HRNet obtains the highest Recall. This suggests that prompt-guided representations provide a favorable balance between target completeness and boundary precision for structured object-like targets, whereas high-resolution feature maintenance remains beneficial when stronger foreground sensitivity is required. For Cultivated Land, SegFormer achieves the best IoU, Precision, Recall, and F1-score, indicating that hierarchical multi-scale representations are particularly effective for large-area continuous targets with relatively regular contextual patterns.
The behavior on Water and Nursery further shows that target prevalence alone is insufficient to explain segmentation difficulty. Water is not only less dominant than Building and Cultivated Land, but also spatially sparse and often elongated. Under this condition, RSPrompter achieves the highest IoU, whereas HRNet maintains strong Recall. By contrast, RSMamba shows a distinctive high-Recall but low-Precision behavior, suggesting that it tends to over-activate background regions as water. Nursery presents a different type of difficulty. Its challenge is mainly caused by semantic similarity to surrounding vegetation-like regions. SegFormer performs best in this category in terms of IoU, Recall, and F1-score, while Mask2Former achieves the highest Precision but suffers from substantial Recall loss, indicating a more conservative prediction tendency.
Overall, the Protocol B results demonstrate that segmentation difficulty is jointly shaped by target prevalence, morphology, spatial continuity, fragmentation, and semantic ambiguity. Therefore, model comparison under severe imbalance should not be interpreted only through class proportion or a single aggregate metric. Instead, target-specific characteristics and metric-specific trade-offs need to be analyzed together. The paired statistical analysis further refines these observations. The leading result of RSPrompter on Building should be interpreted as a non-significant performance tendency, whereas the advantages of SegFormer on Cultivated Land and Nursery and RSPrompter on Water are statistically supported.
4.3. Robustness Across Random Seeds and Statistical Significance Analysis
To examine whether the observations under Protocol B are sensitive to random initialization, all compared paradigms were additionally trained with three different random seeds.
Table 4 reports the mean and standard deviation of IoU.
The repeated-run results are generally consistent with the representative single-run observations. RSPrompter achieves the highest mean IoU on Building and Water, while SegFormer achieves the highest mean IoU on Cultivated Land and Nursery. This suggests that the main target-specific trends are not solely caused by one favorable random initialization.
The standard deviations also indicate different levels of target difficulty. Building and Cultivated Land show relatively stable behavior across most paradigms, whereas Water and Nursery exhibit stronger sensitivity for some models. This is consistent with the additional difficulty caused by sparse geometry, background activation, and vegetation-like semantic ambiguity.
Since only three random seeds were used, the repeated-run results were used mainly to describe robustness to random initialization rather than to perform seed-level significance testing. We therefore conducted paired test-image-level statistical analyses under Protocol B. For each model and each test image, IoU values from the three seeds were first averaged. The resulting per-image IoU differences were used to estimate 95% confidence intervals by paired bootstrap resampling. Paired Wilcoxon signed-rank tests were also performed, and the resulting p-values were adjusted using the Holm–Bonferroni correction.
Table 5 shows that the leading performance of RSPrompter on Building should be interpreted cautiously. Although RSPrompter obtains the highest seed-averaged IoU, its margins over SegFormer and RSMamba are small (
and
), and both confidence intervals include zero. The adjusted
p-values are also above 0.05, indicating that the differences are not statistically significant.
In contrast, the advantages of the other three targets are statistically supported. On Cultivated Land, SegFormer outperforms HRNet and OCRNet by and IoU, respectively (95% CI: and ; adjusted and ). On Water, RSPrompter outperforms HRNet and DeepLabV3+ by and IoU, respectively (95% CI: and ; adjusted and ). The largest differences are observed on Nursery, where SegFormer outperforms RSPrompter and HRNet by and IoU, respectively (95% CI: and ; adjusted for both comparisons).
Overall, the statistical analysis refines the robustness results in
Table 4. The advantages of SegFormer on Cultivated Land and Nursery and RSPrompter on Water are statistically supported, whereas the leading result of RSPrompter on Building is better regarded as a non-significant performance tendency among closely competing models. These findings provide a more cautious basis for the target-dependent interpretation of paradigm suitability under Protocol B.
4.4. Computational Cost Analysis
To complement the accuracy- and robustness-based comparisons, we further report the computational cost of the eight segmentation paradigms in
Table 6. The comparison includes model configuration, parameter size, FLOPs, training time, inference latency, FPS, and peak training memory under the unified experimental setting.
Table 6 shows that the compared paradigms differ substantially in computational cost. RSPrompter has the largest model size and the highest overall cost, with 122.00 M parameters, 88.90 G FLOPs, 8.12 min/epoch training time, 32.6 ms/image inference latency, and 21.10 GB peak training memory. Mask2Former is also computationally demanding, requiring 108.10 M parameters, 98.20 G FLOPs, 6.96 min/epoch, and 25.4 ms/image.
In contrast, U-Net and HRNet provide lower computational cost, with U-Net achieving the fastest inference speed among the compared models (9.2 ms/image, 108.7 FPS) and the lowest peak training memory (8.60 GB). SegFormer shows a more balanced efficiency profile: although it is not the lightest model, it requires moderate training and inference cost (2.93 min/epoch and 12.7 ms/image) while providing statistically supported advantages on Cultivated Land and Nursery. These results indicate that deployment-oriented model selection should consider both target-specific accuracy and computational efficiency. For example, RSPrompter is effective on Water but incurs substantially higher computational cost, whereas simpler or more efficient paradigms may be preferable when latency or memory is the primary constraint.
4.5. Qualitative Analysis and Failure Modes
Figure 2 provides representative qualitative comparisons of the selected paradigms. The visual results are consistent with the quantitative findings: structurally regular or large-area targets are generally delineated more reliably, whereas sparse, elongated, or semantically ambiguous targets expose clearer inter-model differences. The qualitative analysis is not intended to provide an independent ranking of models, but to illustrate representative failure patterns behind the quantitative results.
To make the representative failure cases quantitatively interpretable,
Figure 3 and
Figure 4 are further annotated with patch-level IoU,
, and
.Here,
and
denote false-positive and false-negative proportions normalized by the union area
, respectively. These annotations provide local error-decomposition evidence for the observed failure patterns.
For Water, the dominant failure modes are target omission, fragmented response, and background leakage. As shown in
Table 3, RSMamba attains a relatively high Recall of 0.8782 on Water, but its Precision and IoU decrease to 0.3916 and 0.3882, respectively, indicating a clear tendency toward background over-activation. The annotated examples in
Figure 3 further show that larger FN values correspond to incomplete extraction of sparse or weakly connected water regions, whereas larger FP values indicate leakage into surrounding water-like background structures. These observations suggest that Water is challenging not only because of its low target proportion, but also because of weak foreground continuity and elongated geometry.
For Nursery, the dominant failure mode is mainly related to semantic confusion with surrounding vegetation-like regions. As shown in
Figure 4, predictions often overlap only partially with the ground truth, leading to spatial misalignment and incomplete delineation. The local annotations show both FN-dominated cases, which indicate omission or conservative prediction, and FP errors, which reflect confusion with visually similar background regions. This pattern is consistent with the quantitative results, where Mask2Former achieves high Precision but suffers from severe Recall loss, while SegFormer provides a more balanced delineation under this type of ambiguity.
Taken together, the qualitative examples and patch-level annotations show that different targets fail for different reasons. Water is mainly affected by sparse geometry, fragmented continuity, and background leakage, whereas Nursery is mainly affected by semantic ambiguity, omission, and boundary inconsistency. These results further support the need for target-aware model selection under severe class imbalance.
4.6. Practical Implications for Target-Aware Model Selection
Based on the quantitative results, repeated-run robustness analysis, statistical tests, computational cost comparison, and patch-level failure-case annotations,
Table 7 summarizes application-oriented model-selection guidance under Protocol B. These recommendations should be interpreted as target-aware guidance rather than a universal ranking.
For structured object-like targets such as Building, RSPrompter achieves the highest mean IoU, but its advantage over the closest competitors is not statistically significant. Therefore, it can be considered when the highest observed IoU is prioritized, while alternative paradigms such as HRNet or SegFormer may also be reasonable depending on the required Recall, efficiency, and error tolerance. For large-area continuous targets such as Cultivated Land, SegFormer is recommended because it provides the strongest and statistically supported overall performance.
For sparse or elongated targets such as Water, RSPrompter is recommended when segmentation accuracy is the primary concern, as its advantage is statistically supported. However, its higher computational cost should be considered in deployment-oriented scenarios. Models with high Recall but low Precision should be used cautiously, as they may over-activate background regions. For semantically ambiguous targets such as Nursery, SegFormer is recommended because it provides the most robust overall performance and is less affected by vegetation-like confusion.
These findings suggest that model selection for key geospatial element extraction should be both protocol-aware and target-aware. In practice, the preferred paradigm depends not only on average segmentation accuracy, but also on target morphology, semantic ambiguity, statistical reliability, computational cost, and the application-specific tolerance for false positives and false negatives. Therefore, the central conclusion is not that a single paradigm should be preferred in all cases, but that different model families expose different strengths and failure tendencies under target-specific imbalance.
5. Discussion
5.1. Target Properties and Paradigm-Specific Suitability
The results show that target prevalence alone is insufficient to explain segmentation difficulty. Although Building and Cultivated Land are both relatively frequent targets, their metric patterns differ across paradigms. Similarly, Water and Nursery are both challenging, but their dominant error sources are different. These observations indicate that target morphology, spatial continuity, fragmentation, and semantic similarity to surrounding regions jointly influence segmentation performance.
The compared paradigms also show different suitability for different target properties. Under Protocol B, RSPrompter achieves the highest mean IoU on Building, but its advantage over the closest competitors is not statistically significant, suggesting that several paradigms remain competitive for structured object-like targets. RSPrompter shows a statistically supported advantage on Water, indicating its suitability for sparse or elongated targets, although its higher computational cost should also be considered. SegFormer shows statistically supported advantages on Cultivated Land and Nursery, suggesting that hierarchical multi-scale representations are beneficial for large-area continuous targets and semantically ambiguous vegetation-like targets. Mask2Former tends to be conservative, often trading Recall for higher Precision, whereas RSMamba shows less stable behavior across targets, especially on Water.
The patch-level error annotations further support these interpretations by linking visual failure cases to observable error mechanisms. For Water, larger FP or FN components correspond to background leakage or fragmented omission, while for Nursery, omission and confusion with vegetation-like background regions are more prominent. Therefore, paradigm suitability should be interpreted with respect to target properties, error tolerance, statistical reliability, and computational cost rather than a single aggregate metric. These findings are consistent with the empirical results, although deeper internal representation mechanisms still require further verification.
5.2. Practical Implications for Key Geospatial Element Extraction
From an application perspective, the main implication of this study is that model selection for key geospatial element extraction should be both target-aware and protocol-aware rather than metric-agnostic. Protocol A is more suitable when the practical priority is absolute class-wise performance under conventional full-scene segmentation, whereas Protocol B is more informative when the goal is target-present delineation analysis and target-aware paradigm selection. These two protocols therefore serve different but complementary purposes.
Within Protocol B, when the target is structurally regular and object-like, a prompt-guided paradigm such as RSPrompter provides a strong default choice because it offers a favorable balance between Precision and Recall. When the target is of a large area and continuous, multiple paradigms remain competitive, and the final choice should depend on whether the application prioritizes overall overlap quality, Recall, or false-positive control. In this setting, SegFormer provides the strongest overall balance, while RSMamba may also remain useful when Precision is particularly important. For sparse or elongated targets, stable suppression of background-driven activations is as important as raw sensitivity, making robust delineation behavior particularly important. For semantically ambiguous vegetation-like targets, hierarchical multi-scale representations appear to be more robust than overly conservative or unstable prediction behavior.
These observations also clarify why overall multi-class metrics alone are often insufficient for deployment-oriented decisions. In practice, the cost of false positives and false negatives differs across application scenarios. Therefore, a model with the highest overall mean performance may not be the most suitable choice for a given geospatial element if its prediction behavior is misaligned with the operational objective. The present analysis under Protocol B provides a more application-relevant perspective by linking paradigm behavior to target characteristics, robustness, and error tolerance, while Protocol A remains the appropriate conventional reference for full-scene class-wise performance.
5.3. Limitations and Future Directions
Several limitations should be acknowledged. First, the present analysis focuses on four representative target categories, and the generalizability of the observed patterns to other geospatial elements and datasets remains to be further validated. Second, this study is primarily empirical and comparative rather than methodological. The target-oriented protocol is used as a complementary application-oriented setting and should not be interpreted as universally superior to conventional multi-class segmentation. Third, although repeated-run experiments, paired statistical tests, computational cost measurements, and patch-level error annotations were added to strengthen the comparison, the conclusions remain dependent on the current dataset, implemented model configurations, hardware environment, and evaluation setting.
Future work may extend the present study toward broader target coverage, additional protocol settings, and deeper mechanism-oriented analysis, such as feature visualization, ablation studies, and more systematic examination of boundary fragmentation and semantic confusion.
6. Conclusions
This study presented an application-oriented comparative analysis of remote sensing segmentation under severe class imbalance. Using a complementary target-present setting (Protocol B), together with conventional multi-class segmentation (Protocol A), we evaluated eight representative segmentation paradigms on four key geospatial elements with different structural and semantic characteristics.
Three main conclusions can be drawn from the results. First, Protocol A generally yields higher absolute class-wise performance on the present dataset and remains the appropriate conventional reference setting for full-scene segmentation accuracy. Second, no single segmentation paradigm is universally optimal across all targets, especially under Protocol B, where model behavior is strongly target-dependent. The statistical analysis shows that the advantages of SegFormer on Cultivated Land and Nursery and RSPrompter on Water are statistically supported, whereas the leading result of RSPrompter on Building should be interpreted as a non-significant performance tendency. Third, target prevalence alone cannot fully explain segmentation difficulty; target morphology, continuity, fragmentation, and semantic ambiguity should also be considered. The patch-level error annotations further support this interpretation by showing different observable failure patterns, such as background leakage on Water and omission or semantic confusion on Nursery.
Rather than claiming a universally superior task formulation, this study provides a protocol-aware and target-aware perspective on segmentation model selection for imbalanced remote sensing applications. The results suggest that deployment-oriented model choice should be guided by the evaluation protocol, target characteristics, statistical reliability, computational cost, and application-specific error tolerance. Future work will extend this analysis to broader target sets, additional protocol settings, and deeper mechanism-oriented analysis.