An Application-Oriented Comparative Study of Segmentation Paradigms for Key Geospatial Element Extraction Under Extreme Class Imbalance

Jin, Jiali; Yong, Xi; Sun, Honglin; Wang, Sai; Zhang, Peiyu; Zheng, Zelong; He, Zhaofeng; Li, Qi; Sun, Zhenan; Fu, Jing

doi:10.3390/electronics15112438

Open AccessArticle

An Application-Oriented Comparative Study of Segmentation Paradigms for Key Geospatial Element Extraction Under Extreme Class Imbalance

by

Jiali Jin

^1,2,*,

Xi Yong

³,

Honglin Sun

³,

Sai Wang

³,

Peiyu Zhang

⁴,

Zelong Zheng

^1,2,

Zhaofeng He

⁵

,

Qi Li

^1,2,

Zhenan Sun

^1,2 and

Jing Fu

^3,*

¹

Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

²

School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China

³

Information Center, Ministry of Water Resources of China, Beijing 100053, China

⁴

Beijing GoldenWater Information Technology Development Co., Ltd., Beijing 100053, China

⁵

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(11), 2438; https://doi.org/10.3390/electronics15112438

Submission received: 12 April 2026 / Revised: 24 May 2026 / Accepted: 26 May 2026 / Published: 3 June 2026

(This article belongs to the Special Issue Data-Related Challenges in Machine Learning: Theory and Application)

Download

Browse Figures

Versions Notes

Abstract

Remote sensing applications often require the extraction of a small number of task-critical geospatial elements under severe class imbalance. This setting is challenging because dominant categories occupy most pixels, while targets of interest may be sparse, fragmented, or semantically ambiguous. In this study, we build our analysis on a remote sensing dataset consisting of 10,482 pixel-wise annotated RGB image tiles covering 14 semantic categories with pronounced long-tailed characteristics. Based on this dataset, we conduct an application-oriented comparative study of eight representative segmentation models on four key geospatial element categories with different sparsity levels and visual properties. Quantitative evaluation is performed using Intersection over Union, Precision, Recall, and F1-score, and representative qualitative cases are examined to analyze model behavior. An additional comparison with conventional multi-class segmentation shows that the target-oriented setting should be understood not as a universally superior alternative, but as a complementary application-oriented setting for analyzing target-specific delineation behavior under severe imbalance. The results further indicate that segmentation difficulty cannot be explained by target proportion alone, but is jointly associated with target morphology, spatial fragmentation, and semantic similarity to surrounding categories. These findings provide practical guidance for segmentation model selection in highly imbalanced remote sensing applications.

Keywords:

remote sensing; semantic segmentation; class imbalance; geospatial element extraction; comparative study

1. Introduction

Remote sensing imagery provides detailed and large-scale observations of the Earth’s surface and supports applications such as land-use mapping, urban planning, environmental monitoring, and infrastructure management [1]. Semantic segmentation, which assigns a semantic label to each pixel, has become a core technique for extracting fine-grained geospatial information from such imagery [2]. Recent advances in convolutional neural networks and transformer-based architectures have substantially improved remote sensing semantic segmentation performance [3,4]. In many practical scenarios, however, the goal is not comprehensive scene parsing over all categories, but the accurate extraction of a small number of task-critical geospatial elements, such as buildings, water bodies, or cultivated land [5]. In such settings, an important question is not only how to improve overall segmentation accuracy, but also which segmentation paradigm is more suitable for extracting a specific target of interest.

Reliable target extraction remains challenging because real-world geospatial data often exhibit severe pixel-level class imbalance. In large-scale datasets, a few dominant land-cover categories occupy most pixels, whereas many categories of practical interest are sparse and unevenly distributed [6]. Under such long-tailed conditions, optimization can be dominated by head classes, which may limit the effectiveness of standard segmentation pipelines for task-critical targets even when aggregate evaluation metrics remain acceptable. More importantly, extraction difficulty is not determined by class proportion alone. In practice, target performance is also influenced by target morphology, spatial continuity, fragmentation, boundary clarity, and semantic similarity to surrounding categories [6,7]. These factors make it difficult to judge model suitability using overall multi-class evaluation results alone.

Existing studies on remote sensing semantic segmentation have mainly focused on improving model architectures, feature representations, or loss designs within a unified multi-class formulation. Various strategies, including re-weighting, focal-style objectives, and class-balanced learning, have been explored to alleviate imbalance effects [7,8]. Although these studies provide valuable advances, they do not fully answer a practical application-oriented question: when the objective is the extraction of a single key geospatial element under highly imbalanced conditions, how do different segmentation paradigms behave, and what kinds of targets are they better suited for? Existing review papers and benchmark studies also mainly emphasize methodological development or overall multi-class performance, whereas practical deployment often requires target-aware model selection for a specific geospatial element.

Motivated by this gap, this study conducts an application-oriented comparative analysis of remote sensing segmentation under severe class imbalance. Specifically, we adopt a target-oriented evaluation setting in which each key geospatial element is analyzed as an independent target-specific extraction task. This setting is not intended to replace conventional multi-class segmentation or to imply universally higher class-wise performance. Instead, it serves as a complementary application-oriented setting for analyzing target-specific delineation behavior under severe imbalance. Based on a remote sensing dataset with pronounced long-tailed category characteristics, we benchmark eight representative segmentation models spanning different design paradigms on four key geospatial element categories with different sparsity levels and visual properties. We further include a limited comparison with conventional multi-class segmentation to clarify the role of the target-oriented setting. Through quantitative evaluation and qualitative case analysis, we examine model performance, precision–recall trade-offs, robustness, and characteristic failure modes under highly imbalanced conditions. Ultimately, this work aims to provide empirical evidence and practical guidance for segmentation model selection and protocol interpretation in highly imbalanced remote sensing applications.

The main contributions of this work are summarized as follows:

We conduct an application-oriented comparative study of eight representative segmentation paradigms across four key geospatial elements with different structural and semantic characteristics.
We build a custom remote sensing segmentation dataset with pronounced long-tailed characteristics and severe class imbalance, which serves as the experimental basis for the present protocol comparison and target-specific analysis.
We compare a conventional multi-class setting and a target-present one-vs-rest setting, and show that the latter should be interpreted as a complementary application-oriented protocol rather than a universally superior alternative.
We provide target-aware practical guidance by analyzing quantitative performance, robustness, and characteristic failure modes across different target types.

2. Related Works

Remote sensing semantic segmentation has become a core technique for extracting fine-grained geospatial information from high-resolution imagery and supports a wide range of applications, including land-use analysis, infrastructure monitoring, and environmental observation [1,2]. With the rapid development of deep learning, the field has evolved from conventional convolution-based pipelines toward a broader family of dense prediction paradigms. Meanwhile, many practical remote sensing applications prioritize the extraction of a small number of task-critical geospatial elements rather than complete scene parsing over all semantic categories. This application-oriented demand motivates a more target-aware view of model comparison under highly imbalanced conditions.

Deep Learning Paradigms for Remote Sensing Segmentation. Fully convolutional networks (FCNs) established the basis for end-to-end dense prediction and laid the foundation for modern semantic segmentation [9]. Subsequent work introduced encoder–decoder architectures and multi-scale context aggregation strategies to better cope with the large scale variation, heterogeneous textures, and complex spatial layouts common in remote sensing scenes, including U-Net-style models, DeepLab-style encoder–decoder designs, and pyramid-context approaches [10,11,12]. High-resolution representation learning and object-context modeling further improved spatial precision and semantic reasoning in dense prediction [13,14]. More recently, transformer-based segmentation models, mask-classification frameworks, state-space models, and prompt-guided adaptation of foundation models have substantially expanded the design space of remote sensing segmentation [15,16,17,18,19,20]. Although these studies have significantly advanced model design, most evaluations still emphasize overall multi-class accuracy rather than target-aware paradigm suitability under severe imbalance.

Class Imbalance and Long-Tailed Challenges in Remote Sensing. A persistent challenge in remote sensing semantic segmentation is severe pixel-level class imbalance, where a few dominant land-cover categories occupy most pixels while many categories of interest are sparse or weakly represented [6,7]. Public benchmarks such as LoveDA have also highlighted difficulties caused by multi-scale objects, complex backgrounds, and inconsistent class distributions across domains [21]. To mitigate these effects, existing studies have explored class re-weighting, class-balanced objectives, focal-style losses, and broader long-tailed learning strategies [6,7,8]. However, these methods do not directly answer a practical question: when the task is the extraction of a single key geospatial element rather than complete scene understanding, how do different segmentation paradigms behave, and which target types are they better suited for?

Target-Specific Extraction and Related Fine-Grained Tasks. Beyond unified multi-class segmentation, many remote sensing applications are inherently target-centered. Representative examples include building extraction, road extraction, water-body delineation, and flood extent mapping, where non-target categories mainly serve as contextual background and the practical objective is accurate extraction of a single geospatial element of interest [22,23,24,25]. This perspective is also related to broader fine-grained scene understanding tasks such as instance segmentation and panoptic segmentation, which provide richer target-centered descriptions than conventional semantic labeling alone [16,19,26,27]. However, most existing studies are centered on a specific target category, supervision setting, or model family, and therefore provide limited evidence on how substantially different segmentation paradigms compare when target sparsity, fragmentation, or semantic ambiguity becomes dominant.

Benchmarking and Application-Oriented Comparative Analysis. Systematic benchmarking is essential for understanding the strengths and limitations of segmentation models [2]. Existing remote sensing benchmarks and comparative studies mainly focus on multi-class land-cover mapping and typically report aggregate metrics such as mean Intersection over Union. While valuable for general scene understanding, such evaluations provide limited guidance for application-oriented scenarios in which the goal is to extract a single key geospatial element under severe class imbalance. In particular, relatively few studies jointly analyze target sparsity, spatial fragmentation, semantic ambiguity, precision–recall trade-offs, and characteristic failure modes across substantially different segmentation paradigms. This gap motivates the present study, which focuses on paradigm behavior, protocol interpretation, and model suitability for key geospatial element extraction.

In summary, existing studies have made substantial progress in remote sensing segmentation architectures, imbalance mitigation, benchmark construction, and target-specific applications. However, comparatively fewer works systematically analyze how different segmentation paradigms behave for key geospatial element extraction under severe imbalance while also clarifying the role of a complementary target-oriented analysis setting relative to conventional multi-class evaluation. This gap motivates the application-oriented comparative study conducted in this paper.

3. Materials and Methods

Figure 1 outlines the overall design of this study, including target selection, protocol construction, model comparison, and subsequent quantitative as well as qualitative analysis.

3.1. Dataset Description and Annotation Protocol

This study is based on a custom remote sensing semantic segmentation dataset containing 10,482 image tiles. The dataset was derived from satellite imagery acquired in 2024 and cropped into RGB patches of

256 \times 256

pixels for annotation, training, and evaluation. The original source image has a spatial size of 14,401 × 9601 pixels with three spectral bands. Its pixel size is approximately

8.68 \times 10^{- 6}

degrees in both horizontal and vertical directions, corresponding to an approximate ground sampling distance of about 1 m in the study area. All images are stored as 3-channel RGB uint8 JPG files, and the corresponding labels are stored as single-channel uint8 PNG masks.

The dataset contains 14 semantic categories, including background and 13 foreground geospatial classes, and exhibits a pronounced long-tailed distribution. Dominant categories such as Building and Cultivated Land occupy most pixels, whereas categories such as Water and Nursery are much less represented. Table 1 summarizes the semantic categories and pixel-level distribution of the dataset, highlighting the severe class imbalance across categories.

Pixel-wise annotations were manually produced and reviewed following a unified category definition and annotation guideline. Quality control focused on boundary consistency, category-definition consistency, and the handling of visually ambiguous regions. Based on the dataset statistics, four representative categories were selected for comparative analysis: Building, Cultivated Land, Water, and Nursery. These targets differ in prevalence, spatial continuity, structural regularity, and semantic ambiguity, and therefore provide a suitable basis for studying paradigm-specific behavior under highly imbalanced conditions.

3.2. Target-Oriented Evaluation Protocol

To examine the effect of protocol design, we considered two related settings in this study.

Protocol A: Conventional multi-class segmentation. Protocol A follows the standard multi-class formulation, in which all images are retained and the original multi-class annotations are used directly for training and evaluation. This setting serves as the conventional reference protocol and reflects the standard full-scene segmentation scenario.

Protocol B: Target-present one-vs-rest segmentation. Protocol B reformulates each selected category as an independent binary segmentation task. For a target class c, the original dataset is denoted as

D_{total} = {(x_{i}, y_{i})}_{i = 1}^{N},

where

x_{i}

is an input image and

y_{i}

is the corresponding multi-class label map. The binary target mask is defined as

y_{i}^{(c)} (p) = \{\begin{matrix} 1, & if y_{i} (p) = c, \\ 0, & otherwise . \end{matrix}

Using this formulation, we constructed four target-specific datasets for Building, Cultivated Land, Water, and Nursery. These datasets were derived from the same global split as Protocol A, but only image tiles containing the selected target were retained. This design was adopted to focus the analysis on target delineation rather than target presence detection.

3.3. Compared Segmentation Paradigms

To conduct a representative comparative study, we selected eight segmentation models covering several major design paradigms in dense prediction: U-Net, DeepLabV3+, HRNet, OCRNet, SegFormer, Mask2Former, RSMamba, and RSPrompter. These models were chosen to cover classical encoder–decoder learning, multi-scale context aggregation, high-resolution representation learning, object-context reasoning, transformer-based segmentation, mask-classification-based segmentation, state-space modeling, and prompt-guided segmentation, respectively.

U-Net and DeepLabV3+ serve as widely used baseline architectures for semantic segmentation. HRNet and OCRNet represent high-resolution and context-aware feature learning. SegFormer and Mask2Former provide transformer-based and mask-classification-based paradigms. RSMamba represents state-space-model-based segmentation, while RSPrompter introduces a prompt-guided paradigm adapted from large pre-trained visual models. Together, these models enable comparison across substantially different modeling strategies, which is important for analyzing paradigm-level behavior rather than minor architectural variation.

3.4. Experimental Setup

A shared global training, validation, and test split was first constructed from the full dataset. Protocol A directly uses this global split with the original multi-class annotations, whereas Protocol B is derived from the same split by converting each selected target into a one-vs-rest binary segmentation task and retaining only image tiles containing that target. In this way, the comparison between the two protocols is based on consistent data partitioning rather than differences in split construction.

All input images were resized to

256 \times 256

pixels. During training, data augmentation included random cropping, random flipping, photometric distortion, normalization, and padding. During testing, only deterministic preprocessing and normalization were applied. Images were normalized using a mean of [123.675, 116.28, 103.53] and a standard deviation of [58.395, 57.12, 57.375].

To improve comparability across model families, all compared models were trained for 300 epochs under the same train/validation/test split strategy and evaluated using the same metrics. All models were optimized using stochastic gradient descent (SGD) with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay of 0.0005. A poly learning-rate schedule was adopted with a power of 0.9 and a minimum learning rate of

1 \times 10^{- 4}

. The batch size was set to 8 images per GPU. The checkpoint with the best validation IoU was selected for final testing.

Experiments were conducted in a PyTorch 2.1.0-based environment with CUDA 11.1 on a server equipped with six NVIDIA GeForce RTX 4090 GPUs, each with 24 GB of memory. For robustness analysis, all compared paradigms were additionally trained under Protocol B using three random seeds (0, 1, and 2), and the mean and standard deviation of IoU were reported in the robustness analysis of the Results section.

Model performance was evaluated independently using Intersection over Union (IoU), Precision, Recall, and F1-score. These metrics jointly measure segmentation overlap quality and the precision–recall trade-off, which is particularly important under highly imbalanced conditions. In addition to quantitative evaluation, representative qualitative predictions were analyzed to investigate characteristic failure modes, including target omission, fragmented segmentation, and semantic confusion with visually similar background regions.

3.5. Statistical Analysis

To avoid over-interpreting the limited number of random seeds, statistical significance was not assessed using seed-level averages alone. Instead, paired test-image-level analyses were conducted under Protocol B. For each model and each test image, IoU values from three random seeds were first averaged. The resulting per-image IoU differences between two models were then used for statistical comparison.

For each target category, the top-performing model according to the seed-averaged IoU values was compared with the closest competing models. The 95% confidence intervals of the mean IoU differences were estimated using paired bootstrap resampling with 10,000 iterations. Paired Wilcoxon signed-rank tests were further applied to assess statistical significance. The p-values were adjusted using the Holm-Bonferroni correction to account for multiple comparisons. A difference was considered statistically significant when the adjusted p-value was below 0.05 and the 95% confidence interval did not include zero.

4. Results

4.1. Protocol-Level Comparison Between Conventional and Target-Oriented Evaluation

Table 2 and Table 3 compare the conventional multi-class setting and the target-oriented setting. Protocol A serves as the standard multi-class reference, where all semantic categories are jointly learned and evaluated within the same scene-level label space. In contrast, Protocol B reformulates each selected category as a target-present one-vs-rest task, thereby emphasizing the delineation behavior of a specific target of practical interest.

The comparison shows that Protocol A generally provides stronger absolute class-wise performance on the present dataset. This result is important because it indicates that the target-oriented setting should not be interpreted as a universally superior replacement for conventional multi-class segmentation. Rather, Protocol B changes both the supervision structure and the evaluation distribution by focusing on target-present samples and merging all non-target categories into the background. Therefore, its value lies in providing a complementary application-oriented view of target-specific extraction behavior, especially under severe imbalance.

This distinction is essential for interpreting the following results. Protocol A is more suitable for reporting complete scene parsing performance, whereas Protocol B is more informative when the practical question is how reliably a particular geospatial element can be extracted once it becomes the target of interest. In this sense, the exclusion of target-absent image tiles in Protocol B is not intended to create a more realistic full-scene benchmark, but to isolate target-present delineation behavior and precision–recall trade-offs. Consequently, the results under Protocol B should be interpreted as application-oriented target analysis rather than as a replacement for standard multi-class evaluation.

4.2. Target-Specific Quantitative Results Under Protocol B

Table 3 reports the representative single-run results under Protocol B. A clear observation is that no single segmentation paradigm is uniformly optimal across all target categories. Instead, the best-performing paradigm depends strongly on the target type and on which metric is prioritized.

For Building, RSPrompter achieves the highest IoU, while HRNet obtains the highest Recall. This suggests that prompt-guided representations provide a favorable balance between target completeness and boundary precision for structured object-like targets, whereas high-resolution feature maintenance remains beneficial when stronger foreground sensitivity is required. For Cultivated Land, SegFormer achieves the best IoU, Precision, Recall, and F1-score, indicating that hierarchical multi-scale representations are particularly effective for large-area continuous targets with relatively regular contextual patterns.

The behavior on Water and Nursery further shows that target prevalence alone is insufficient to explain segmentation difficulty. Water is not only less dominant than Building and Cultivated Land, but also spatially sparse and often elongated. Under this condition, RSPrompter achieves the highest IoU, whereas HRNet maintains strong Recall. By contrast, RSMamba shows a distinctive high-Recall but low-Precision behavior, suggesting that it tends to over-activate background regions as water. Nursery presents a different type of difficulty. Its challenge is mainly caused by semantic similarity to surrounding vegetation-like regions. SegFormer performs best in this category in terms of IoU, Recall, and F1-score, while Mask2Former achieves the highest Precision but suffers from substantial Recall loss, indicating a more conservative prediction tendency.

Overall, the Protocol B results demonstrate that segmentation difficulty is jointly shaped by target prevalence, morphology, spatial continuity, fragmentation, and semantic ambiguity. Therefore, model comparison under severe imbalance should not be interpreted only through class proportion or a single aggregate metric. Instead, target-specific characteristics and metric-specific trade-offs need to be analyzed together. The paired statistical analysis further refines these observations. The leading result of RSPrompter on Building should be interpreted as a non-significant performance tendency, whereas the advantages of SegFormer on Cultivated Land and Nursery and RSPrompter on Water are statistically supported.

4.3. Robustness Across Random Seeds and Statistical Significance Analysis

To examine whether the observations under Protocol B are sensitive to random initialization, all compared paradigms were additionally trained with three different random seeds. Table 4 reports the mean and standard deviation of IoU.

The repeated-run results are generally consistent with the representative single-run observations. RSPrompter achieves the highest mean IoU on Building and Water, while SegFormer achieves the highest mean IoU on Cultivated Land and Nursery. This suggests that the main target-specific trends are not solely caused by one favorable random initialization.

The standard deviations also indicate different levels of target difficulty. Building and Cultivated Land show relatively stable behavior across most paradigms, whereas Water and Nursery exhibit stronger sensitivity for some models. This is consistent with the additional difficulty caused by sparse geometry, background activation, and vegetation-like semantic ambiguity.

Since only three random seeds were used, the repeated-run results were used mainly to describe robustness to random initialization rather than to perform seed-level significance testing. We therefore conducted paired test-image-level statistical analyses under Protocol B. For each model and each test image, IoU values from the three seeds were first averaged. The resulting per-image IoU differences were used to estimate 95% confidence intervals by paired bootstrap resampling. Paired Wilcoxon signed-rank tests were also performed, and the resulting p-values were adjusted using the Holm–Bonferroni correction.

Table 5 shows that the leading performance of RSPrompter on Building should be interpreted cautiously. Although RSPrompter obtains the highest seed-averaged IoU, its margins over SegFormer and RSMamba are small (

+ 0.0117

and

+ 0.0153

), and both confidence intervals include zero. The adjusted p-values are also above 0.05, indicating that the differences are not statistically significant.

In contrast, the advantages of the other three targets are statistically supported. On Cultivated Land, SegFormer outperforms HRNet and OCRNet by

+ 0.0180

and

+ 0.0217

IoU, respectively (95% CI:

[+ 0.004, + 0.032]

and

[+ 0.007, + 0.037]

; adjusted

p = 0.031

and

0.014

). On Water, RSPrompter outperforms HRNet and DeepLabV3+ by

+ 0.0269

and

+ 0.0274

IoU, respectively (95% CI:

[+ 0.009, + 0.046]

and

[+ 0.010, + 0.047]

; adjusted

p = 0.008

and

0.006

). The largest differences are observed on Nursery, where SegFormer outperforms RSPrompter and HRNet by

+ 0.0748

and

+ 0.0852

IoU, respectively (95% CI:

[+ 0.046, + 0.104]

and

[+ 0.054, + 0.119]

; adjusted

p < 0.001

for both comparisons).

Overall, the statistical analysis refines the robustness results in Table 4. The advantages of SegFormer on Cultivated Land and Nursery and RSPrompter on Water are statistically supported, whereas the leading result of RSPrompter on Building is better regarded as a non-significant performance tendency among closely competing models. These findings provide a more cautious basis for the target-dependent interpretation of paradigm suitability under Protocol B.

4.4. Computational Cost Analysis

To complement the accuracy- and robustness-based comparisons, we further report the computational cost of the eight segmentation paradigms in Table 6. The comparison includes model configuration, parameter size, FLOPs, training time, inference latency, FPS, and peak training memory under the unified experimental setting.

Table 6 shows that the compared paradigms differ substantially in computational cost. RSPrompter has the largest model size and the highest overall cost, with 122.00 M parameters, 88.90 G FLOPs, 8.12 min/epoch training time, 32.6 ms/image inference latency, and 21.10 GB peak training memory. Mask2Former is also computationally demanding, requiring 108.10 M parameters, 98.20 G FLOPs, 6.96 min/epoch, and 25.4 ms/image.

In contrast, U-Net and HRNet provide lower computational cost, with U-Net achieving the fastest inference speed among the compared models (9.2 ms/image, 108.7 FPS) and the lowest peak training memory (8.60 GB). SegFormer shows a more balanced efficiency profile: although it is not the lightest model, it requires moderate training and inference cost (2.93 min/epoch and 12.7 ms/image) while providing statistically supported advantages on Cultivated Land and Nursery. These results indicate that deployment-oriented model selection should consider both target-specific accuracy and computational efficiency. For example, RSPrompter is effective on Water but incurs substantially higher computational cost, whereas simpler or more efficient paradigms may be preferable when latency or memory is the primary constraint.

4.5. Qualitative Analysis and Failure Modes

Figure 2 provides representative qualitative comparisons of the selected paradigms. The visual results are consistent with the quantitative findings: structurally regular or large-area targets are generally delineated more reliably, whereas sparse, elongated, or semantically ambiguous targets expose clearer inter-model differences. The qualitative analysis is not intended to provide an independent ranking of models, but to illustrate representative failure patterns behind the quantitative results.

To make the representative failure cases quantitatively interpretable, Figure 3 and Figure 4 are further annotated with patch-level IoU,

FP

, and

FN

.Here,

FP

and

FN

denote false-positive and false-negative proportions normalized by the union area

TP + FP + FN

, respectively. These annotations provide local error-decomposition evidence for the observed failure patterns.

For Water, the dominant failure modes are target omission, fragmented response, and background leakage. As shown in Table 3, RSMamba attains a relatively high Recall of 0.8782 on Water, but its Precision and IoU decrease to 0.3916 and 0.3882, respectively, indicating a clear tendency toward background over-activation. The annotated examples in Figure 3 further show that larger FN values correspond to incomplete extraction of sparse or weakly connected water regions, whereas larger FP values indicate leakage into surrounding water-like background structures. These observations suggest that Water is challenging not only because of its low target proportion, but also because of weak foreground continuity and elongated geometry.

For Nursery, the dominant failure mode is mainly related to semantic confusion with surrounding vegetation-like regions. As shown in Figure 4, predictions often overlap only partially with the ground truth, leading to spatial misalignment and incomplete delineation. The local annotations show both FN-dominated cases, which indicate omission or conservative prediction, and FP errors, which reflect confusion with visually similar background regions. This pattern is consistent with the quantitative results, where Mask2Former achieves high Precision but suffers from severe Recall loss, while SegFormer provides a more balanced delineation under this type of ambiguity.

Taken together, the qualitative examples and patch-level annotations show that different targets fail for different reasons. Water is mainly affected by sparse geometry, fragmented continuity, and background leakage, whereas Nursery is mainly affected by semantic ambiguity, omission, and boundary inconsistency. These results further support the need for target-aware model selection under severe class imbalance.

4.6. Practical Implications for Target-Aware Model Selection

Based on the quantitative results, repeated-run robustness analysis, statistical tests, computational cost comparison, and patch-level failure-case annotations, Table 7 summarizes application-oriented model-selection guidance under Protocol B. These recommendations should be interpreted as target-aware guidance rather than a universal ranking.

For structured object-like targets such as Building, RSPrompter achieves the highest mean IoU, but its advantage over the closest competitors is not statistically significant. Therefore, it can be considered when the highest observed IoU is prioritized, while alternative paradigms such as HRNet or SegFormer may also be reasonable depending on the required Recall, efficiency, and error tolerance. For large-area continuous targets such as Cultivated Land, SegFormer is recommended because it provides the strongest and statistically supported overall performance.

For sparse or elongated targets such as Water, RSPrompter is recommended when segmentation accuracy is the primary concern, as its advantage is statistically supported. However, its higher computational cost should be considered in deployment-oriented scenarios. Models with high Recall but low Precision should be used cautiously, as they may over-activate background regions. For semantically ambiguous targets such as Nursery, SegFormer is recommended because it provides the most robust overall performance and is less affected by vegetation-like confusion.

These findings suggest that model selection for key geospatial element extraction should be both protocol-aware and target-aware. In practice, the preferred paradigm depends not only on average segmentation accuracy, but also on target morphology, semantic ambiguity, statistical reliability, computational cost, and the application-specific tolerance for false positives and false negatives. Therefore, the central conclusion is not that a single paradigm should be preferred in all cases, but that different model families expose different strengths and failure tendencies under target-specific imbalance.

5. Discussion

5.1. Target Properties and Paradigm-Specific Suitability

The results show that target prevalence alone is insufficient to explain segmentation difficulty. Although Building and Cultivated Land are both relatively frequent targets, their metric patterns differ across paradigms. Similarly, Water and Nursery are both challenging, but their dominant error sources are different. These observations indicate that target morphology, spatial continuity, fragmentation, and semantic similarity to surrounding regions jointly influence segmentation performance.

The compared paradigms also show different suitability for different target properties. Under Protocol B, RSPrompter achieves the highest mean IoU on Building, but its advantage over the closest competitors is not statistically significant, suggesting that several paradigms remain competitive for structured object-like targets. RSPrompter shows a statistically supported advantage on Water, indicating its suitability for sparse or elongated targets, although its higher computational cost should also be considered. SegFormer shows statistically supported advantages on Cultivated Land and Nursery, suggesting that hierarchical multi-scale representations are beneficial for large-area continuous targets and semantically ambiguous vegetation-like targets. Mask2Former tends to be conservative, often trading Recall for higher Precision, whereas RSMamba shows less stable behavior across targets, especially on Water.

The patch-level error annotations further support these interpretations by linking visual failure cases to observable error mechanisms. For Water, larger FP or FN components correspond to background leakage or fragmented omission, while for Nursery, omission and confusion with vegetation-like background regions are more prominent. Therefore, paradigm suitability should be interpreted with respect to target properties, error tolerance, statistical reliability, and computational cost rather than a single aggregate metric. These findings are consistent with the empirical results, although deeper internal representation mechanisms still require further verification.

5.2. Practical Implications for Key Geospatial Element Extraction

From an application perspective, the main implication of this study is that model selection for key geospatial element extraction should be both target-aware and protocol-aware rather than metric-agnostic. Protocol A is more suitable when the practical priority is absolute class-wise performance under conventional full-scene segmentation, whereas Protocol B is more informative when the goal is target-present delineation analysis and target-aware paradigm selection. These two protocols therefore serve different but complementary purposes.

Within Protocol B, when the target is structurally regular and object-like, a prompt-guided paradigm such as RSPrompter provides a strong default choice because it offers a favorable balance between Precision and Recall. When the target is of a large area and continuous, multiple paradigms remain competitive, and the final choice should depend on whether the application prioritizes overall overlap quality, Recall, or false-positive control. In this setting, SegFormer provides the strongest overall balance, while RSMamba may also remain useful when Precision is particularly important. For sparse or elongated targets, stable suppression of background-driven activations is as important as raw sensitivity, making robust delineation behavior particularly important. For semantically ambiguous vegetation-like targets, hierarchical multi-scale representations appear to be more robust than overly conservative or unstable prediction behavior.

These observations also clarify why overall multi-class metrics alone are often insufficient for deployment-oriented decisions. In practice, the cost of false positives and false negatives differs across application scenarios. Therefore, a model with the highest overall mean performance may not be the most suitable choice for a given geospatial element if its prediction behavior is misaligned with the operational objective. The present analysis under Protocol B provides a more application-relevant perspective by linking paradigm behavior to target characteristics, robustness, and error tolerance, while Protocol A remains the appropriate conventional reference for full-scene class-wise performance.

5.3. Limitations and Future Directions

Several limitations should be acknowledged. First, the present analysis focuses on four representative target categories, and the generalizability of the observed patterns to other geospatial elements and datasets remains to be further validated. Second, this study is primarily empirical and comparative rather than methodological. The target-oriented protocol is used as a complementary application-oriented setting and should not be interpreted as universally superior to conventional multi-class segmentation. Third, although repeated-run experiments, paired statistical tests, computational cost measurements, and patch-level error annotations were added to strengthen the comparison, the conclusions remain dependent on the current dataset, implemented model configurations, hardware environment, and evaluation setting.

Future work may extend the present study toward broader target coverage, additional protocol settings, and deeper mechanism-oriented analysis, such as feature visualization, ablation studies, and more systematic examination of boundary fragmentation and semantic confusion.

6. Conclusions

This study presented an application-oriented comparative analysis of remote sensing segmentation under severe class imbalance. Using a complementary target-present setting (Protocol B), together with conventional multi-class segmentation (Protocol A), we evaluated eight representative segmentation paradigms on four key geospatial elements with different structural and semantic characteristics.

Three main conclusions can be drawn from the results. First, Protocol A generally yields higher absolute class-wise performance on the present dataset and remains the appropriate conventional reference setting for full-scene segmentation accuracy. Second, no single segmentation paradigm is universally optimal across all targets, especially under Protocol B, where model behavior is strongly target-dependent. The statistical analysis shows that the advantages of SegFormer on Cultivated Land and Nursery and RSPrompter on Water are statistically supported, whereas the leading result of RSPrompter on Building should be interpreted as a non-significant performance tendency. Third, target prevalence alone cannot fully explain segmentation difficulty; target morphology, continuity, fragmentation, and semantic ambiguity should also be considered. The patch-level error annotations further support this interpretation by showing different observable failure patterns, such as background leakage on Water and omission or semantic confusion on Nursery.

Rather than claiming a universally superior task formulation, this study provides a protocol-aware and target-aware perspective on segmentation model selection for imbalanced remote sensing applications. The results suggest that deployment-oriented model choice should be guided by the evaluation protocol, target characteristics, statistical reliability, computational cost, and application-specific error tolerance. Future work will extend this analysis to broader target sets, additional protocol settings, and deeper mechanism-oriented analysis.

Author Contributions

Conceptualization, J.J. and J.F.; methodology, J.J. and Z.Z.; software, J.J. and Z.H.; validation, J.J., X.Y. and H.S.; formal analysis, J.J.; investigation, J.J., X.Y., H.S. and S.W.; resources, X.Y., H.S. and J.F.; data curation, J.J., X.Y., H.S., S.W. and P.Z.; writing—original draft preparation, J.J.; writing—review and editing, J.J., X.Y., H.S., P.Z., Z.Z., Z.H., Q.L., Z.S. and J.F.; visualization, J.J. and P.Z.; supervision, Q.L., Z.S. and J.F.; project administration, J.F.; funding acquisition, J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China, grant number 2024YFC3210803. The APC was funded by the National Key Research and Development Program of China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The satellite imagery used in this study is currently not fully publicly available due to data confidentiality, source licensing, and usage restrictions. The dataset was derived from satellite image tiles acquired in 2024 and cropped into 256 × 256 image patches for annotation and model development. A public GitHub repository containing benchmark scripts, example files, and related documentation is available at: https://github.com/jin1041/Key-Geo-Elements (accessed on 10 April 2026). After further curation and authorization, a research-accessible version of the dataset and additional data content will be made publicly available.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (GPT-5.4) for language polishing and editing assistance. The authors reviewed and edited the content as needed and take full responsibility for the final manuscript.

Conflicts of Interest

Author Peiyu Zhang was employed by Beijing GoldenWater Information Technology Development Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhao, Q.; Yu, L.; Du, Z.; Peng, D.; Hao, P.; Zhang, Y.; Gong, P. An Overview of the Applications of Earth Observation Satellite Data: Impacts and Future Trends. Remote Sens. 2022, 14, 1863. [Google Scholar] [CrossRef]
Li, J.; Cai, Y.; Li, Q.; Kou, M.; Zhang, T. A Review of Remote Sensing Image Segmentation by Deep Learning Methods. Int. J. Digit. Earth 2024, 17, 2328827. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25 (NeurIPS 2012); Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
Chen, Z.; Lian, Y.; Bai, J.; Zhang, J.; Xiao, Z.; Hou, B. Weakly Supervised Semantic Segmentation of Remote Sensing Images Using Siamese Affinity Network. Remote Sens. 2025, 17, 808. [Google Scholar] [CrossRef]
Zhou, Z.; Zheng, C.; Liu, X.; Tian, Y.; Chen, X.; Chen, X.; Dong, Z. A Dynamic Effective Class Balanced Approach for Remote Sensing Imagery Semantic Segmentation of Imbalanced Data. Remote Sens. 2023, 15, 1768. [Google Scholar] [CrossRef]
Cui, W.; Feng, Z.; Chen, J.; Xu, X.; Tian, Y.; Zhao, H.; Wang, C. Long-Tailed Effect Study in Remote Sensing Semantic Segmentation Based on Graph Kernel Principles. Remote Sens. 2024, 16, 1398. [Google Scholar] [CrossRef]
Zhang, Y.; Kang, B.; Hooi, B.; Yan, S.; Feng, J. Deep Long-Tailed Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10795–10816. [Google Scholar] [CrossRef] [PubMed]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder–Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
Yuan, Y.; Chen, X.; Wang, J. Object-Contextual Representations for Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 173–190. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021); Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 12077–12090. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 1290–1299. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the First Conference on Language Modeling (COLM 2024), Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. RSMamba: Remote Sensing Image Classification With State Space Model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation Based on Visual Foundation Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701117. [Google Scholar] [CrossRef]
Osco, L.P.; Wu, Q.; de Lemos, E.L.; Gonçalves, W.N.; Ramos, A.P.M.; Li, J.; Marcato Junior, J. The Segment Anything Model (SAM) for Remote Sensing Applications: From Zero to One Shot. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103540. [Google Scholar] [CrossRef]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks; Vanschoren, J., Yeung, S., Eds.; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2021; Volume 1. [Google Scholar]
Gebrehiwot, A.; Hashemi-Beni, L.; Thompson, G.; Kordjamshidi, P.; Langan, T.E. Deep Convolutional Neural Network for Flood Extent Mapping Using Unmanned Aerial Vehicles Data. Sensors 2019, 19, 1486. [Google Scholar] [CrossRef] [PubMed]
Luo, L.; Li, P.; Yan, X. Deep Learning-Based Building Extraction from Remote Sensing Images: A Comprehensive Review. Energies 2021, 14, 7982. [Google Scholar] [CrossRef]
Liu, R.; Wu, J.; Lu, W.; Miao, Q.; Zhang, H.; Liu, X.; Lu, Z.; Li, L. A Review of Deep Learning-Based Methods for Road Extraction from High-Resolution Remote Sensing Images. Remote Sens. 2024, 16, 2056. [Google Scholar] [CrossRef]
Wieland, M.; Martinis, S.; Kiefl, R.; Gstaiger, V. Semantic Segmentation of Water Bodies in Very High-Resolution Satellite and Aerial Images. Remote Sens. Environ. 2023, 287, 113452. [Google Scholar] [CrossRef]
Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollár, P. Panoptic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9404–9413. [Google Scholar]
de Carvalho, O.L.F.; de Carvalho Júnior, O.A.; Rosa e Silva, C.; de Albuquerque, A.O.; Santana, N.C.; Borges, D.L.; Gomes, R.A.T.; Guimarães, R.F. Panoptic Segmentation Meets Remote Sensing. Remote Sens. 2022, 14, 965. [Google Scholar] [CrossRef]

Figure 1. Overall workflow of the study.

Figure 2. Representative qualitative comparison of selected paradigms. Rows correspond to Building, Cultivated Land, Water, and Nursery, respectively. Colored boxes highlight representative regions with clearer inter-model differences.

Figure 3. Patch-level quantitative error annotations for representative Water failure cases. Correct, false-positive, and false-negative regions are shown in white, red, and blue, respectively. Each prediction panel is annotated with IoU, FP, and FN.

Figure 4. Patch-level quantitative error annotations for representative Nursery failure cases. Correct, false-positive, and false-negative regions are shown in white, red, and blue, respectively. Each prediction panel is annotated with IoU, FP, and FN.

Table 1. Semantic categories and statistics of the custom dataset. The four key geospatial elements selected for benchmarking are highlighted in bold. The pixel ratio demonstrates the severe class imbalance across categories.

ID	Category Name	RGB Color	Pixel Count	Ratio (%)
0	Background	(0, 0, 0)	357,465,035	16.38
1	Building	(255, 60, 60)	790,184,018	36.21
2	Cultivated Land	(255, 207, 207)	705,686,795	32.34
3	Greenhouse	(255, 11, 11)	8,828,947	0.40
4	Brick Kiln	(255, 93, 93)	91,415	0.004
5	Pen Culture	(255, 208, 208)	739	0.000
6	Wharf	(255, 148, 148)	3014	0.000
7	Bare Land	(255, 244, 244)	2,190,031	0.10
8	Photovoltaic	(255, 1, 1)	4,695,330	0.22
9	Water	(255, 175, 175)	122,435,541	5.61
10	Bridge (Under Cons.)	(255, 56, 56)	6673	0.000
11	Bridge (Completed)	(255, 152, 152)	255,480	0.01
12	Dam	(255, 83, 83)	171,664	0.01
13	Nursery	(255, 142, 142)	15,730,152	0.72
Total Pixels			2,182,109,833	100.0

Table 2. Quantitative comparison of segmentation performance under Protocol A. Best results in each class are shown in bold. F1-score is computed as the harmonic mean of precision and recall.

Class	Model	IoU	Precision	Recall	F1-Score
Class1 Building	DeepLabV3+	0.5789	0.6248	0.8875	0.7333
	U-Net	0.5723	0.6083	0.9062	0.7280
	HRNet	0.6209	0.6391	0.9562	0.7661
	OCRNet	0.6010	0.6467	0.8948	0.7508
	SegFormer	0.5981	0.6411	0.8991	0.7485
	Mask2Former	0.6095	0.7920	0.7257	0.7574
	RSMamba	0.7419	0.7720	0.9500	0.8518
	RSPrompter	0.6458	0.7585	0.8129	0.7848
Class2 Cultivated Land	DeepLabV3+	0.7949	0.8196	0.9634	0.8857
	U-Net	0.7727	0.7819	0.9850	0.8718
	HRNet	0.8002	0.8243	0.9648	0.8890
	OCRNet	0.7644	0.7841	0.9682	0.8665
	SegFormer	0.8545	0.8680	0.9821	0.9215
	Mask2Former	0.7905	0.8524	0.9158	0.8830
	RSMamba	0.7975	0.8750	0.9000	0.8873
	RSPrompter	0.8077	0.8600	0.9300	0.8936
Class9 Water	DeepLabV3+	0.7170	0.7890	0.8871	0.8352
	U-Net	0.6534	0.6947	0.9167	0.7904
	HRNet	0.7194	0.7456	0.9534	0.8368
	OCRNet	0.6662	0.7231	0.8944	0.7997
	SegFormer	0.6355	0.6710	0.9232	0.7772
	Mask2Former	0.6545	0.8467	0.7425	0.7912
	RSMamba	0.3903	0.4102	0.8897	0.5615
	RSPrompter	0.7078	0.8437	0.8146	0.8289
Class13 Nursery	DeepLabV3+	0.5166	0.6600	0.7040	0.6813
	U-Net	0.4102	0.5620	0.6030	0.5818
	HRNet	0.4389	0.5940	0.6270	0.6101
	OCRNet	0.4108	0.5580	0.6090	0.5824
	SegFormer	0.5867	0.6920	0.7940	0.7395
	Mask2Former	0.4094	0.8035	0.4550	0.5810
	RSMamba	0.4100	0.5660	0.5980	0.5816
	RSPrompter	0.5217	0.6720	0.7000	0.6857

Table 3. Quantitative comparison of segmentation performance on class-conditioned test sets from a representative single run with seed 0. The best result for each class and metric is shown in bold. F1-score is computed as the harmonic mean of precision and recall.

Class	Model	IoU	Precision	Recall	F1-Score
Class1 Building	DeepLabV3+	0.5488	0.6021	0.8664	0.7105
	U-Net	0.5343	0.5850	0.8884	0.7055
	HRNet	0.5662	0.6170	0.9562	0.7500
	OCRNet	0.5849	0.6238	0.8771	0.7291
	SegFormer	0.5888	0.6183	0.8785	0.7258
	Mask2Former	0.5187	0.7400	0.7008	0.7199
	RSMamba	0.5892	0.7366	0.7134	0.7248
	RSPrompter	0.6051	0.7392	0.7411	0.7401
Class2 Cultivated Land	DeepLabV3+	0.7489	0.7957	0.9482	0.8653
	U-Net	0.6820	0.7581	0.9561	0.8457
	HRNet	0.7554	0.8004	0.9487	0.8683
	OCRNet	0.7559	0.7609	0.9547	0.8468
	SegFormer	0.7748	0.8472	0.9725	0.9055
	Mask2Former	0.7177	0.8319	0.8972	0.8634
	RSMamba	0.7338	0.8443	0.8129	0.8283
	RSPrompter	0.7358	0.8169	0.8384	0.8275
Class9 Water	DeepLabV3+	0.6935	0.7660	0.8692	0.8143
	U-Net	0.6372	0.6703	0.8996	0.7682
	HRNet	0.6936	0.7218	0.9385	0.8160
	OCRNet	0.6674	0.6995	0.8756	0.7777
	SegFormer	0.6571	0.6475	0.9070	0.7556
	Mask2Former	0.6456	0.8240	0.7190	0.7680
	RSMamba	0.3882	0.3916	0.8782	0.5417
	RSPrompter	0.7211	0.8222	0.7923	0.8070
Class13 Nursery	DeepLabV3+	0.3835	0.5214	0.5582	0.5392
	U-Net	0.3071	0.5077	0.4163	0.4575
	HRNet	0.4139	0.5439	0.5601	0.5519
	OCRNet	0.3755	0.4985	0.5330	0.5152
	SegFormer	0.5000	0.6146	0.6498	0.6318
	Mask2Former	0.2133	0.7733	0.2522	0.3804
	RSMamba	0.3609	0.5060	0.4421	0.4719
	RSPrompter	0.4237	0.5084	0.5157	0.5120

Table 4. Repeated-run robustness of all compared paradigms under the target-oriented setting. IoU is reported as mean ± standard deviation over three random seeds. The best result for each class is shown in bold.

Model	Building	Cultivated Land	Water	Nursery
DeepLabV3+	$0.5501 \pm 0.0127$	$0.7465 \pm 0.0114$	$0.6912 \pm 0.0268$	$0.3811 \pm 0.0183$
U-Net	$0.5316 \pm 0.0118$	$0.6843 \pm 0.0109$	$0.6355 \pm 0.0126$	$0.3058 \pm 0.0169$
HRNet	$0.5648 \pm 0.0097$	$0.7581 \pm 0.0095$	$0.6917 \pm 0.0112$	$0.4126 \pm 0.0141$
OCRNet	$0.5826 \pm 0.0115$	$0.7544 \pm 0.0107$	$0.6691 \pm 0.0204$	$0.3732 \pm 0.0176$
SegFormer	$0.5903 \pm 0.0109$	$0.7761 \pm 0.0099$	$0.6554 \pm 0.0191$	$0.4978 \pm 0.0138$
Mask2Former	$0.5169 \pm 0.0172$	$0.7195 \pm 0.0128$	$0.6431 \pm 0.0213$	$0.2117 \pm 0.0279$
RSMamba	$0.5867 \pm 0.0160$	$0.7325 \pm 0.0132$	$0.3974 \pm 0.0437$	$0.3576 \pm 0.0210$
RSPrompter	$0.6020 \pm 0.0125$	$0.7346 \pm 0.0111$	$0.7186 \pm 0.0170$	$0.4230 \pm 0.0197$

Table 5. Paired statistical analysis of key IoU differences under Protocol B.

Target	Model A	Model B	ΔIoU	95% CI of ΔIoU	Adjusted p-Value	Interpretation
Building	RSPrompter	SegFormer	$+ 0.0117$	$[- 0.003, + 0.026]$	$0.118$	Not significant
Building	RSPrompter	RSMamba	$+ 0.0153$	$[- 0.001, + 0.031]$	$0.071$	Not significant
Cultivated Land	SegFormer	HRNet	$+ 0.0180$	$[+ 0.004, + 0.032]$	$0.031$	Significant
Cultivated Land	SegFormer	OCRNet	$+ 0.0217$	$[+ 0.007, + 0.037]$	$0.014$	Significant
Water	RSPrompter	HRNet	$+ 0.0269$	$[+ 0.009, + 0.046]$	$0.008$	Significant
Water	RSPrompter	DeepLabV3+	$+ 0.0274$	$[+ 0.010, + 0.047]$	$0.006$	Significant
Nursery	SegFormer	RSPrompter	$+ 0.0748$	$[+ 0.046, + 0.104]$	<0.001	Significant
Nursery	SegFormer	HRNet	$+ 0.0852$	$[+ 0.054, + 0.119]$	<0.001	Significant

Table 6. Computational cost comparison of the eight segmentation paradigms. FLOPs are approximately estimated under a unified

256 \times 256

input resolution. Training time and peak memory are reported under Protocol B. Inference latency is averaged per image, and FPS is derived from latency.

Table 6. Computational cost comparison of the eight segmentation paradigms. FLOPs are approximately estimated under a unified

256 \times 256

input resolution. Training time and peak memory are reported under Protocol B. Inference latency is averaged per image, and FPS is derived from latency.

Model	Configuration	Params	FLOPs	Train Time	Infer. Time	FPS	Peak Train Mem.
		(M)	(G)	(min/Epoch)	(ms/Image)	(Image/s)	(GB)
DeepLabV3+	ResNet-101-D8	62.40	64.60	2.48	16.1	62.1	11.20
U-Net	5-stage, base-64	29.10	51.30	1.34	9.2	108.7	8.60
HRNet	HRNet-W48	66.00	33.50	2.21	13.9	71.9	10.90
OCRNet	HRNet-W48 + OCR	70.90	42.70	2.87	17.2	58.1	12.10
SegFormer	MiT-B4	64.20	24.60	2.93	12.7	78.7	10.00
Mask2Former	Swin-B	108.10	98.20	6.96	25.4	39.4	19.10
RSMamba	RSM/VMamba-based	46.20	37.50	3.28	18.7	53.5	14.50
RSPrompter	SAM ViT-B/16	122.00	88.90	8.12	32.6	30.7	21.10

Note: The reported computational costs are approximate values normalized under the same input resolution and experimental protocol. FLOPs may vary depending on the profiling tool and support for custom operators. FPS is computed as

1000 / latency (ms)

. Peak memory denotes training memory rather than single-image inference memory.

Table 7. Target-aware model-selection guidance under Protocol B. The recommendations are application-oriented and should not be interpreted as a universal ranking.

Target Type	Class	Recommended Model	Main Rationale
Structured object-like	Building	RSPrompter/alternatives	Highest observed IoU, but the advantage is not statistically significant; alternatives may be selected by recall or efficiency needs.
Large-area continuous	Cultivated Land	SegFormer	Statistically supported advantage with stable robustness.
Sparse or elongated	Water	RSPrompter	Statistically supported accuracy advantage, but with higher computational cost.
Semantically ambiguous	Nursery	SegFormer	Statistically supported advantage and stronger robustness to vegetation-like confusion.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jin, J.; Yong, X.; Sun, H.; Wang, S.; Zhang, P.; Zheng, Z.; He, Z.; Li, Q.; Sun, Z.; Fu, J. An Application-Oriented Comparative Study of Segmentation Paradigms for Key Geospatial Element Extraction Under Extreme Class Imbalance. Electronics 2026, 15, 2438. https://doi.org/10.3390/electronics15112438

AMA Style

Jin J, Yong X, Sun H, Wang S, Zhang P, Zheng Z, He Z, Li Q, Sun Z, Fu J. An Application-Oriented Comparative Study of Segmentation Paradigms for Key Geospatial Element Extraction Under Extreme Class Imbalance. Electronics. 2026; 15(11):2438. https://doi.org/10.3390/electronics15112438

Chicago/Turabian Style

Jin, Jiali, Xi Yong, Honglin Sun, Sai Wang, Peiyu Zhang, Zelong Zheng, Zhaofeng He, Qi Li, Zhenan Sun, and Jing Fu. 2026. "An Application-Oriented Comparative Study of Segmentation Paradigms for Key Geospatial Element Extraction Under Extreme Class Imbalance" Electronics 15, no. 11: 2438. https://doi.org/10.3390/electronics15112438

APA Style

Jin, J., Yong, X., Sun, H., Wang, S., Zhang, P., Zheng, Z., He, Z., Li, Q., Sun, Z., & Fu, J. (2026). An Application-Oriented Comparative Study of Segmentation Paradigms for Key Geospatial Element Extraction Under Extreme Class Imbalance. Electronics, 15(11), 2438. https://doi.org/10.3390/electronics15112438

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Application-Oriented Comparative Study of Segmentation Paradigms for Key Geospatial Element Extraction Under Extreme Class Imbalance

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Dataset Description and Annotation Protocol

3.2. Target-Oriented Evaluation Protocol

3.3. Compared Segmentation Paradigms

3.4. Experimental Setup

3.5. Statistical Analysis

4. Results

4.1. Protocol-Level Comparison Between Conventional and Target-Oriented Evaluation

4.2. Target-Specific Quantitative Results Under Protocol B

4.3. Robustness Across Random Seeds and Statistical Significance Analysis

4.4. Computational Cost Analysis

4.5. Qualitative Analysis and Failure Modes

4.6. Practical Implications for Target-Aware Model Selection

5. Discussion

5.1. Target Properties and Paradigm-Specific Suitability

5.2. Practical Implications for Key Geospatial Element Extraction

5.3. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI