Next Article in Journal
Terrain-Aware Self-Supervised Representation Learning for Tree Species Mapping in Mountainous Regions Under Limited Field Samples
Previous Article in Journal
High-Performance Parallel Direct Georeferencing for Massive ULS LiDAR Measurements
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CRECA-Net: Class Representation-Enhanced Class-Aware Network for Semantic Segmentation of High-Resolution Remote Sensing Images

School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(6), 950; https://doi.org/10.3390/rs18060950
Submission received: 10 December 2025 / Revised: 6 March 2026 / Accepted: 9 March 2026 / Published: 21 March 2026

Highlights

What are the main findings?
  • We propose CRECA-Net, a novel class-aware segmentation network that integrates the CPR module and the DA loss to enhance segmentation performance on high-resolution remote sensing imagery characterized by complex backgrounds and high inter-class similarity.
  • The CPR module incorporates pixel selection, confidence-aware contribution weighting, and inter-class prototype separation constraints to generate reliable and discriminative class representations for subsequent CLCA modules, while the DA loss adaptively emphasizes hard samples during training to improve overall segmentation performance.
What are the implications of the main findings?
  • CRECA-Net provides a high-accuracy framework for the fine-grained interpretation of high-resolution remote sensing scenes, supporting applications such as urban planning and land-use analysis.
  • The findings demonstrate that improving class representation quality is essential for achieving superior segmentation performance in class-aware networks, offering valuable insights for the design of segmentation models in complex remote sensing scenarios.

Abstract

High-resolution remote sensing (RS) images exhibit complex backgrounds, large intra-class variability, and low inter-class differences, posing substantial challenges for semantic segmentation. Although existing class-level contextual modeling methods partially alleviate these issues, they often overlook the importance of accurate and discriminative class representations and fail to effectively handle hard samples during training. To address these limitations, we propose CRECA-Net, a class representation-enhanced class-aware network designed from two complementary perspectives: class prototype refinement and difficulty-aware learning. Specifically, we introduce a class prototype refinement (CPR) module that improves class representations through pixel selection, confidence-aware contribution weighting, and an inter-class prototype separation loss, yielding more reliable and discriminative class centers. In addition, class-level context aggregation (CLCA) modules capture pixel-to-class prototype correlations via cross-attention to inject class-aware semantics into decoder features, thereby reducing interference from cluttered backgrounds and visually similar categories. Furthermore, a difficulty-aware (DA) loss dynamically estimates pixel-wise difficulty and redistributes the loss weights within each image, gradually shifting the learning focus from easy to hard samples while maintaining training stability. Extensive experiments on two benchmark RS segmentation datasets demonstrate that CRECA-Net consistently outperforms state-of-the-art methods across multiple evaluation metrics.

1. Introduction

Advances in remote sensing platforms and Earth observation technologies have markedly improved the spatial resolution of remote sensing (RS) imagery, enabling finer and more reliable observations of the Earth’s surface. Semantic segmentation, which assigns a semantic label to each pixel, has become a core technique for fine-grained RS scene understanding. It supports precise identification and delineation of geographic objects and serves as a fundamental tool in a wide range of applications, including road extraction [1,2,3], traffic monitoring [4,5], urban planning [6,7,8], land cover classification [9,10,11], and disaster assessment [12,13].
Driven by rapid advances in deep learning and the increasing availability of large-scale, annotated datasets, automatic segmentation of RS imagery has achieved remarkable success in recent years. However, despite these advances, accurately segmenting geospatial objects in high-resolution RS images remains a significant challenge, primarily due to several inherent characteristics of RS data.
Complex Background: RS images often exhibit highly complex and heterogeneous backgrounds, where target objects are frequently embedded in or obscured by large, cluttered surroundings. This complexity often leads to higher false-positive rates in segmentation. In certain datasets, irrelevant background regions may occupy more than 90% of the total image area [14].
Large Intra-Class Variance and Small Inter-Class Variance: Although fine spatial resolution provides richer visual cues, such as detailed textures, shapes, and subtle color variations, it also exacerbates intra-class diversity while reducing inter-class separability. This issue is further intensified by structural ambiguity and boundary uncertainty arising from complex geographic layouts and the top-down imaging perspective. Consequently, distinguishing visually similar and easily confusable categories becomes particularly challenging in high-resolution RS imagery. For instance, objects of the same category (e.g., cars) may differ considerably in size, color, and orientation even within a single scene, while similar categories (e.g., trees vs. low vegetation or rooftops vs. ground surfaces) are often difficult to distinguish due to highly similar spectral or textural properties. As illustrated in Figure 1, these factors collectively contribute to large intra-class variance (yellow arrows) and small inter-class variance (blue arrows), making purely appearance-based feature discrimination inadequate. To address these challenges, numerous studies [15,16,17,18] have attempted to improve segmentation accuracy by aggregating contextual information.
Existing context aggregation methods in semantic segmentation can be broadly categorized into two paradigms: spatial and relational context aggregation.
Spatial context aggregation methods commonly expand the receptive field by capturing multi-scale spatial contexts. Representative approaches such as PSPNet [15] and the DeepLab series [16,19,20] exploit pyramid pooling and dilated convolutions to aggregate contextual information from multiple spatial scales. Although effective, these methods rely on fixed isotropic receptive windows that aggregate contextual information in a category-agnostic manner, thereby indiscriminately integrating category-relevant and irrelevant regions. As a result, the aggregated context becomes susceptible to interference from background noise and visually similar yet semantically different categories, a problem that is particularly pronounced in RS imagery characterized by cluttered scenes and subtle inter-class differences.
To overcome these limitations, relational context aggregation methods employ attention mechanisms to adaptively model long-range dependencies, allowing contextual cues to be selected based on semantic relevance rather than spatial proximity. According to the type of relationships, these methods can be categorized into pixel-level and class-level relational modeling methods.
Pixel-level relational modeling explicitly computes pairwise dependencies among all pixels to perform dynamic feature aggregation. For example, NonLocal [17] constructs a pairwise affinity matrix to capture global dependencies, while DANet [21] introduces spatial and channel attention mechanisms to refine pixel representations. Transformer-based segmentation frameworks [22,23,24,25] generalize this idea by applying self-attention to image tokens or patches, enabling powerful global contextual reasoning. Related efforts have also explored Transformer-inspired models or Transformer frameworks, for cross-view feature matching [26,27], providing valuable methodological insights for multimodal semantic segmentation [28,29]. However, the quadratic computational and memory complexity of full attention limits their scalability to high-resolution RS imagery. Although sparse attention variants [30,31] and efficient Transformer architectures [32,33,34] alleviate computational burdens through structured or hierarchical attention mechanisms, they still operate at the pixel–pixel or token–token level, failing to explicitly model class-level semantic structures. Consequently, in complex RS scenes with subtle inter-class differences, the absence of such class-aware guidance may lead the model to aggregate context from visually similar yet semantically irrelevant regions, ultimately degrading segmentation performance.
In contrast, class-level context modeling aims to suppress interference from irrelevant or confusing regions and enhance feature discriminability by explicitly capturing the relationships between pixels and class-level semantic representations. Representative approaches such as ACFNet [35] and OCRNet [18] generate class centers (or object region representations) either by averaging the embeddings of pixels assigned to each predicted class or by aggregating all pixel features using coarse segmentation probability maps. While effective in natural image segmentation, these strategies encounter notable limitations when applied to high-resolution RS imagery, which typically features complex backgrounds, low inter-class variability, and large intra-class variability. Under these conditions, existing class-level context modeling approaches encounter two fundamental limitations.
Inaccuracy and weakly discriminative class prototypes: Under these challenging characteristics of RS imagery, class prototypes tend to become highly sensitive to noise, prediction errors, and ambiguous regions. Simple averaging (Figure 2a) assigns equal importance to all pixels and disregards variations in pixel confidence and representativeness, often leading to biased class prototypes. Although probability-weighted aggregation (Figure 2b) partially alleviates this issue, the soft aggregation of all pixels inevitably allows non-target pixels to contribute to class prototypes, potentially contaminating the learned representations. Moreover, most existing methods lack explicit inter-class separation constraints, leading to insufficiently discriminative class prototypes, an issue particularly pronounced in RS imagery with inherently low inter-class separability.
Insufficient optimization of hard samples: These characteristics of RS imagery also lead to highly uneven pixel-wise segmentation difficulty. Boundary pixels, occluded regions, and pixels belonging to visually similar categories are significantly harder to classify than those in homogeneous regions. Despite this inherent imbalance, class-level context modeling frameworks are usually trained using the standard cross-entropy (CE) loss, which implicitly assumes uniform segmentation difficulty across pixels. Consequently, abundant easy pixels dominate optimization process, while hard samples receive inadequate supervision, leading to biased feature learning and reduced robustness in complex RS scenes.
To address these challenges, we propose CRECA-Net, a class representation-enhanced class-aware segmentation framework that simultaneously improves class prototypes and strengthens hard sample learning. Concretely, CRECA-Net integrates three key components: a class prototype refinement (CPR) module, which produces more reliable and discriminative prototypes by applying pixel selection, confidence-aware contribution weighting, and inter-class prototype separation regularization; a set of class-level context aggregation (CLCA) modules, which explicitly model pixel-to-class prototype correlations via cross-attention and progressively inject class-aware context into decoder features; and a difficulty-aware (DA) loss that dynamically estimates pixel-wise segmentation difficulty and redistributes the intra-image loss to emphasize hard samples.
Our main contributions are summarized as follows:
  • We present CRECA-Net, a class-aware segmentation framework that simultaneously enhances class representation learning and difficulty-adaptive optimization, effectively alleviating segmentation difficulty arising from low inter-class separability in high-resolution RS imagery.
  • We propose a CPR module that constructs reliable and discriminative class prototypes through pixel selection, confidence-aware contribution weighting, and inter-class prototype separation regularization, thereby mitigating prototype bias and facilitating more effective class-level context modeling in subsequent CLCA modules.
  • We introduce DA loss that dynamically estimates pixel-wise difficulty and adjusts the intra-image loss distribution to strengthen hard samples learning, resulting in more stable optimization and enhanced robustness. Extensive experiments on two benchmark datasets demonstrate that CRECA-Net achieves consistent performance improvements with minimal computational overhead.
The remainder of this paper is organized as follows: Section 2 reviews related work. Section 3 presents the architecture of CRECA-Net and details its key components. Section 4 describes the datasets, evaluation metrics, and experimental setup. Section 5 reports the comparative and ablation studies. Finally, Section 6 and Section 7 present the discussion of experimental results and the conclusion with future research directions, respectively.

2. Related Work

2.1. Semantic Segmentation

Semantic segmentation extends image classification to the pixel level by assigning a semantic label to each pixel, thereby enabling fine-grained scene understanding. The Fully Convolutional Network (FCN) [36] pioneered this field by replacing fully connected layers with convolutional ones to achieve end-to-end dense prediction, laying the foundation for modern segmentation networks. UNet [37] and its variants [38,39] employ encoder–decoder architectures with skip connections to restore spatial details lost during downsampling. However, due to their reliance on local convolutional operations, these networks still suffer from a limited receptive field and inadequate global context modeling. Consequently, later studies have primarily focused on spatial or relational context aggregation to enhance global feature and improve segmentation performance.

2.2. Spatial Context Aggregation Methods

Spatial context aggregation methods expand the receptive field by capturing contextual information from multiple spatial scales using dilated convolutions [40,41] or pooling operations. For example, PSPNet [15] employs a pyramid pooling module that performs parallel pooling with different kernel sizes to aggregate features across various spatial regions, enhancing multi-scale representations. Similarly, the DeepLab series [16,19,20] introduces the atrous spatial pyramid pooling (ASPP) module, which enlarges the receptive field via dilated convolutions with varying dilation rates while preserving spatial resolution. DenseASPP [42] further strengthens this framework by using dense connectivity for richer multi-scale feature extraction.
Several studies have further adapted these ideas to RS imagery. GCDNet [43] introduces a global context dependencies attention module to establish long-range relationships among multi-scale features extracted via dilated convolutions. AFNet [44] employs scale-feature and scale-layer attention mechanisms to adaptively enhance and fuse multi-scale features based on object sizes, thereby improving the segmentation performance for objects with large scale variations. Although effective for multi-scale context extraction, these methods rely on fixed geometric receptive fields determined by kernel shapes or dilation rates. As a result, the aggregated context is prone to including target and non-target regions, producing unreliable semantic cues and exacerbating category confusion, an issue that becomes especially severe in RS images featuring cluttered backgrounds and subtle inter-class differences.

2.3. Relational Context Aggregation Methods

Relational context aggregation methods leverage attention mechanisms to adaptively capture long-range dependencies and can be broadly categorized into pixel-level and class-level approaches according to the type of relationships they model.
Pixel-level Relational Context Aggregation: Early pixel-level approaches capture dependencies among all pixels to dynamically aggregate contextual information. For instance, NonLocal [17] models pairwise relationships across all spatial positions through non-local operations, enabling each pixel feature to access global context, while DANet [21] further introduces dual-attention to jointly capture spatial and channel dependencies. In recent years, Transformer architectures, represented by SegFormer [23], Segmenter [22], and Mask2Former [24], have further generalized pixel-level attention by applying self-attention to image tokens or patches, enabling powerful global reasoning. However, these dense attention operations require constructing full affinity matrices, resulting in quadratic computational and memory complexity.
To mitigate this limitation, numerous sparse and efficient attention mechanisms have been proposed. ANN [30] selects representative key-value pairs via pyramid sampling, CCNet [31] restricts attention computation to pixels along the same row and column, and EMANet [45] estimates attention maps through an expectation–maximization algorithm to reduce computational cost. In parallel, efficient Transformer architectures such as Swin Transformer [32], EfficientViT [33], and Biformer [34] employ window-based attention, lightweight ReLU-based attention mechanisms and bi-level routing strategies to further improve efficiency.
Given the large spatial dimensions and high computational demands of high-resolution RS imagery, numerous studies have developed efficient relational aggregation mechanisms to model long-range dependencies. MANet [46] and MAResU-Net [47] propose kernel and linear attention mechanisms to reduce the quadratic complexity of dot-product attention. ABCNet [48] leverages linear attention to construct global context paths, whereas UNetFormer [49] adopts a dual-attention design to jointly capture global and local dependencies. Similarly, SLCNet [50] explicitly supervises attention maps to guide long-range correlation modeling, and CGGLNet [51] integrates class-guided attention with adaptive multi-scale local features enhancement. Furthermore, MSGCNet [52] employs efficient cross-attention and window-based multihead self-attention for multiscale feature interaction and long-range dependency modeling, while AMMUNet [53] enhances global correlation modeling through multiscale attention map merging and granular multi-head self-attention. DLNet [54] further employs macro-level and micro-level self-attention in conjunction with cross-attention mechanisms to facilitate multi-scale feature extraction and fusion. Transformer-based models such as BANet [55] and DC-Swin [25] further integrate efficient transformer variants, including ResT [56] and Swin Transformer [32], to better capture long-range dependencies with reduced computational overhead.
Despite their effectiveness in extending receptive fields and enhancing global contextual reasoning, pixel-level and token-level attention mechanisms predominantly rely on appearance-based similarity to compute affinities. This reliance can lead to context being aggregated from irrelevant background regions or visually similar yet semantically distinct categories, especially in high-resolution RS imagery where large intra-class variability and low inter-class separability make appearance-based correlations inherently unreliable. The lack of explicit class-level semantic guidance thus limits the reliability and discriminative power of pixel-level relational aggregation, motivating the development of class-aware context modeling paradigms.
Class-level Relational Context Aggregation: Class-level modeling aims to overcome the limitations of pixel-level relational aggregation by explicitly modeling relationships between pixels and class representations (i.e., class centers or class prototypes). These prototypes are generally computed either by averaging the feature embeddings of pixels within each predicted class or by aggregating all pixel features through probability-based weighting. Representative methods include ACFNet [35], which aggregates contextual information directly through class centers and coarse segmentation results, and OCRNet [18], which enhances pixel features by computing their similarity to object region representations. ISNet [57] further integrates image-level and semantic-level context to refine pixel embeddings, while MCIBI [58] maintains a feature memory bank of dataset-level class prototypes mined across images.
Class-level relational modeling has also been extensively explored in the RS domain. HMANet [59] combines class-enhanced, class-channel, and region-shuffle attention to improve aerial image segmentation. CCANet [60] introduces a class-constrained, coarse-to-fine attention module that leverages category information as explicit constraints to capture long-range context. LOGCAN++ [61] applies affine transformations to adaptively extract local class representations serving as intermediate bridges between pixels and global class centers, mitigating intra-class variance.
Despite their effectiveness in introducing semantic priors and mitigating certain limitations of pixel-level/token-level attention, existing class-level relational modeling methods still face challenges when applied to high-resolution RS imagery, which typically exhibits complex backgrounds, substantial intra-class variability, and small inter-class differences. These factors can cause current prototype estimation strategies to produce unreliable and weakly discriminative class representations. Moreover, these properties result in highly uneven pixel-wise segmentation difficulty across regions, causing hard samples to receive insufficient supervision under conventional CE loss. These issues collectively compromise the robustness and overall effectiveness of existing class-level approaches when deployed in complex RS environments.
To address these challenges, we propose CRECA-Net, a class-aware network designed to integrate class prototype refinement and difficulty-aware learning. By constructing more reliable and discriminative class prototypes while dynamically emphasizing hard samples during training, CRECA-Net achieves more robust and accurate segmentation performance on high-resolution RS imagery.

3. Proposed Method

3.1. Overall Architecture

The overall architecture of the proposed CRECA-Net is illustrated in Figure 3. The network follows a classical encoder-decoder structure with skip connections to balance global context modeling and spatial detail preservation. We employ a ResNet-50 [62] backbone, pre-trained on ImageNet [63], as the encoder for hierarchical feature extraction. Given an input RS image I, the encoder generates multi-scale feature maps F 1 , F 2 , F 3 , and F 4 with spatial resolutions of 1/4, 1/8, 1/16, and 1/32 of the input size, respectively. In the decoder, all feature maps are first projected into a unified channel dimension ( d = 128 ) via a 1 × 1 convolution. Next, we introduce a CPR module that operates on the deepest feature map F 4 to generate refined class prototypes C. These prototypes then guide top-down feature refinement through four class-level context aggregation modules ( CLCA 1 CLCA 4 ). Specifically, CLCA 4 takes ( F 4 , C ) as input and outputs an enhanced representation F 4 o . This representation is fused with the higher-resolution feature map F 3 and processed by CLCA 3 to obtain F 3 o . The same refinement procedure is subsequently applied through CLCA 2 and CLCA 1 , yielding progressively enriched features F 2 o and F 1 o that incorporate both semantic cues and spatial details. Finally, all enhanced feature maps ( F i o ) i = 1 4 are upsampled to a uniform spatial resolution and aggregated through element-wise summation, followed by a final 1 × 1 convolution to produce the segmentation prediction. The entire network is trained end-to-end using a joint objective that combines DA loss and inter-class prototype separation loss, enabling more effective learning from hard samples while encouraging discriminative prototype formation.

3.2. Class Prototype Refinement Module

High-resolution RS images often contain visually similar objects from different categories, such as trees vs. low vegetation, or buildings vs. impervious surfaces, resulting in low inter-class variance. Complex backgrounds further aggravate this issue, making it challenging for existing class prototype estimation methods to derive accurate and well-separated class prototypes. Since reliable prototypes are crucial for constructing a discriminative embedding space and guiding class-aware context reasoning, we introduce the CPR module to refine class representations from three complementary aspects: pixel selection, confidence-aware contribution weighting, and inter-class prototype separation. These strategies collectively improve the accuracy, representativeness, and discriminability of class prototypes, thus enabling subsequent CLCA modules to perform more precise class-aware context aggregation.

3.2.1. Pixel Selection

Pixel selection improves the reliability of class prototype estimation by restricting the pixels used for prototype computation. Instead of estimating class prototypes through probability-weighted averaging over all pixels, as commonly adopted in prior work, we retain only those pixels predicted as class k. This selective mechanism reduces cross-class interference and ensures that each prototype is computed from semantically consistent feature regions.
As shown in Figure 3b, given the deepest feature map F 4 R H 4 × W 4 × d , where H 4 and W 4 denotes its spatial height and width, respectively, a classification head H implemented using a 1 × 1 convolution projects F 4 into the class space, producing the pre-classification logits:
D = H ( F 4 ) , D R H 4 × W 4 × K
where K denotes the number of semantic categories. A coarse segmentation mask M R H 4 × W 4 is then obtained by assigning each pixel to the class with the highest logit response along the category dimension. This mask is subsequently supervised using the proposed DA loss, as described in Section 3.4:
M ( i , j ) = arg max c { 1 , , K } D ( i , j , c )
where M ( i , j ) denotes the predicted class label for pixel ( i , j ) . Based on M, the feature map F 4 is partitioned into class-specific regions R k , each containing the pixel features predicted as class k:
R k = F 4 ( i , j , : ) M ( i , j ) = k , R k R N k × d
where N k is the number of pixels assigned to class k. Similarly, the pre-classification logits D are grouped into class-specific subsets:
D k = D ( i , j , : ) M ( i , j ) = k , D k R N k × K
This pixel grouping strategy ensures that each class prototype is computed from semantically coherent and prediction-consistent regions. Consequently, the estimated prototypes better capture the intrinsic characteristics of each category and are more robust to interference from visually similar but semantically distinct classes.

3.2.2. Confidence-Aware Contribution Weighting

Although the pixel selection module confines the candidate set for class k to the pixels predicted as belonging to class k, these pixels still vary in confidence and representativeness. Treating them equally may bias prototype estimation. To address this, we propose a confidence-aware contribution weighting (CACW) strategy, which assigns each pixel a contribution score computed from three complementary confidence indicators: prediction probability, logit margin, and prediction certainty. These indicators are defined as follows.
Prediction Probability Confidence: The predicted probability for the assigned class provides a direct estimate of model confidence for each pixel. Pixels with higher predicted probabilities are considered more reliable and therefore exert a stronger influence on the class prototype.
Given the pre-classification logits D k R N k × K (as defined above), the class probability distribution is obtained by applying the softmax operation along the category dimension:
P k ( i , c ) = exp ( D k ( i , c ) ) c = 1 K exp ( D k ( i , c ) ) , P k R N k × K
Here, P k ( i , c ) denotes the predicted probability that the i-th pixel is classified into category c (for c = 1 , , K ). The probability-based confidence vector for class k is computed as:
WP k = P k ( i , k ) i = 1 N k
This indicator naturally down-weights low-confidence pixels while emphasizing more reliable pixels.
Logit Margin Confidence: To complement the absolute confidence provided by probability values, we further incorporate the margin between the largest and second-largest logits. A larger margin indicates clearer separation between the top competing classes, reflecting stronger representativeness of the assigned class.
Let Top i ( x ) denote the i-th largest element of a vector x. Given the pre-classification logits D k R N k × K , the logit margin score for each pixel is computed as:
WL k = Top 1 ( D k ( i , : ) ) Top 2 ( D k ( i , : ) ) i = 1 N k
Here, D k ( i , : ) R K denotes the logit vector of the i-th pixel.
This indicator complements the probability-based confidence by capturing relative certainty among competing classes, encouraging contributions from pixels with higher discriminativeness.
Prediction Certainty Confidence: In addition to probability and logit margin, entropy provides a measure of predictive uncertainty. Prior studies on uncertainty estimation [64,65] have shown that greater entropy indicates greater prediction uncertainty and a higher likelihood of misclassification. Accordingly, pixels with lower entropy should contribute more to the prototype.
For each pixel i, the entropy is computed from its predicted probability distribution P k ( i , : ) as:
Ent k ( i ) = c = 1 K P k ( i , c ) log P k ( i , c )
where i = 1 , , N k denotes the index of the pixels assigned to class k. To convert entropy into a confidence-like weight, we first normalize the entropy values by the maximum possible entropy and then invert them:
WC k = 1 Ent k ( i ) log K i = 1 N k
where log K corresponds to the maximum entropy of a uniform K-class distribution. Pixels with lower entropy, indicating higher certainty, receive larger weights. This indicator explicitly captures the overall uncertainty in predictions and provides a complementary perspective to both the probability and logit margin confidence measures.
Confidence Fusion and Normalization: After obtaining the three confidence indicators, they are integrated to derive a contribution score for each pixel. Specifically, for pixels predicted as class k, the aggregated confidence is computed as the sum of the three confidence indicators:
Weight k = WP k + WL k + WC k
Here, Weight k R N k × 1 represents the contribution scores of the N k pixels predicted as class k. To ensure comparability among pixels within the same class, these aggregated scores are normalized via a softmax function:
NWeight k ( i ) = exp ( Weight k ( i ) ) i = 1 N k exp ( Weight k ( i ) ) , i = 1 , 2 , , N k
The normalized scores NWeight k ( i ) quantify each pixel’s relative importance in shaping the final class prototype.
Weighted Prototype Computation: Based on the normalized contribution weights, the class prototype for class k is computed as a weighted mean of its selected pixel features:
C k = i = 1 N k NWeight k ( i ) · R k ( i , : )
Here, R k ( i , : ) denotes the feature vector of the i-th pixel in R k . Repeating this process for all K categories yields the full prototype matrix:
C = C 1 , C 2 , , C K R K × d
where each row corresponds to a refined class representation. The complete computation procedure for class prototype generation is summarized in Algorithm  1.
Algorithm 1 Class Prototype Generation
Require:  Feature map F 4 R H 4 × W 4 × d ,
     Classifier Head H ( · ) ,
     Number of categories K
Ensure:   Refined class prototypes C = [ C 1 , C 2 , , C K ] R K × d
  1:
Feature Projection: Obtain the pre-classification logits D according to Equation (1)
  2:
Mask Generation: Generate the coarse segmentation mask M as defined in Equation (2)
  3:
for each class k = 1 to K do
  4:
 Extract the feature subset R k and logit subset D k following Equations (3) and (4)
  5:
 Compute the probability distribution P k according to Equation (5)
  6:
for each pixel i = 1 to N k  do
  7:
  • Compute the confidence indicators: prediction probability confidence W P k ( i ) , logit margin confidence W L k ( i ) , and prediction certainty confidence W C k ( i ) as defined in Equations  (6),  (7), and (9)
  8:
  • Compute aggregated confidence weight Weight k ( i ) using Equation (10)
  9:
end for
10:
 Compute Normalize confidence weight NWeight k ( i ) according to Equation (11)
11:
 Compute the refined class prototype C k with Equation (12)
12:
end for
13:
Return: Refined class prototypes C
This weighting scheme ensures that highly confident and semantically representative pixels contribute more strongly, while uncertain or noisy samples contribute less. As a result, the generated prototypes are more robust and discriminative, providing a reliable basis for subsequent class-aware context modeling.

3.2.3. Inter-Class Prototype Separation Loss

Although the CPR module improves the reliability of individual class prototypes, ensuring sufficient separability among different categories remains essential for robust semantic segmentation, particularly in RS imagery, where semantically distinct classes often exhibit highly similar appearances (e.g., tree vs. low vegetation, building vs. impervious surface). However, existing class-level context aggregation methods typically use the derived class centers directly for context modeling without enforcing explicit separation, which may lead to overly clustered prototypes and weakened class discriminability.
To mitigate this issue, we introduce an inter-class prototype separation loss, which explicitly encourages larger angular margins between class centers by penalizing excessive similarity. This constraint prevents feature-space collapse and promotes the formation of more discriminative and well-separated class embeddings.
Definition of Inter-class Similarity Penalty: To explicitly penalize overly close class centers, we define a penalty term based on the cosine similarity between class centers:
P inter ( C p , C q ) = max ( 0 , Sim ( C p , C q ) β )
where C p , C q R d denote the refined prototypes for classes p and q, respectively. The cosine similarity is defined as:
Sim ( C p , C q ) = C p · C q C p 2 · C q 2
The threshold β [ 0 , 1 ] sets the maximum allowable similarity between two class centers. Only prototype pairs whose similarity exceeds β incur a penalty, thereby pushing overly close prototypes farther apart.
Inter-class Prototype Separation Loss: The loss evaluates pairwise similarities among all class prototypes and penalizes those pairs whose similarity exceeds the threshold β :
L inter = 1 K p = 1 K q = 1 q p K P inter ( C p , C q )
Minimizing L inter effectively enlarges the angular margins among class centers, enhances inter-class separability, and produces clearer semantic boundaries particularly in visually ambiguous regions. This loss operates solely on learned prototypes, requires no additional supervision, and introduces negligible computational overhead, making it practical for end-to-end training.
It is noteworthy that the class centers are computed exclusively from the deepest feature map F 4 , which possesses a larger receptive field and captures richer semantic information [15]. This enables more accurate coarse segmentation masks and more reliable prototype initialization. The coarse masks are then refined under ground-truth supervision.
Unlike previous works [18,61] that compute prototypes by aggregating all pixel features via probability-based weighting, our method forms each class representation using only the pixels predicted as belonging to that class, preventing cross-class interference. Furthermore, each pixel’s contribution is adaptively modulated by three complementary confidence indicators: (1) absolute confidence measured by the predicted probability; (2) relative confidence quantified by the logit margin; and (3) overall prediction certainty indicated by entropy. By combining these indicators, high-confidence and semantically representative pixels receive larger contribution weights, producing prototypes that are robust and discriminative. Furthermore, the proposed inter-class prototype separation loss enforces angular margins among prototypes, effectively alleviating the prevalent problem of low inter-class variance in high-resolution RS imagery.

3.3. Class-Level Context Aggregation Module

Self-attention-based pixel-level relational modeling has been extensively employed in RS image segmentation owing to its strong capability in capturing long-range dependencies. However, computing pairwise similarities among all pixels incurs a quadratic complexity of O ( n 2 d ) , where n and d denote the number of pixels and feature dimension, respectively. For high-resolution RS images, this leads to prohibitive computational and memory costs. Additionally, aggregating information from all pixels may incorporate features from visually similar but semantically irrelevant regions, especially in scenes containing complex backgrounds or visually similar land-cover categories. These factors make pixel-to-pixel affinities unreliable and ultimately degrade segmentation accuracy.
To address these limitations, we introduce the CLCA module, which replaces dense pixel-to-pixel interactions with pixel-to-class prototype correlations. Unlike standard self-attention that captures dependencies only at the pixel level without semantic guidance, CLCA leverages class prototypes to guide attention and aggregates semantic information across all pixels predicted as belonging to the same class. This design reduces the computational complexity of pixel-level self-attention from O ( n 2 d ) to O ( n K d ) (where K denotes the number of semantic classes), while injecting explicit semantic guidance into the attention mechanism. Consequently, CLCA allows the network to capture long-range contextual cues while suppressing activations from irrelevant or confusing regions, thus enhancing the discriminability of pixel representations. Compared with generic cross-attention variants, which compute attention between two feature maps without explicit class-level guidance, CLCA explicitly incorporates class-level prototypes to guide semantic aggregation, which is not easily achievable with standard self-attention or cross-attention alone.
As illustrated in Figure 3a, the CLCA module operates at each decoder stage and takes two inputs (1) the stage-wise feature map F i R H i × W i × d and (2) the refined class prototypes C R K × d generated by the CPR module. Following a cross-attention paradigm, F i is projected into a query matrix Q i = ϕ q ( F i ) R ( H i W i ) × d , while the class prototypes are mapped to a key matrix K c = ϕ k ( C ) R K × d and a value matrix V c = ϕ v ( C ) R K × d . Here, ϕ q , ϕ k , and ϕ v represent learnable 1 × 1 convolutions followed by batch normalization and ReLU activation. The pixel-to-class attention map A i R ( H i W i ) × K is computed using a scaled dot-product followed by a row-wise softmax:
A i = Softmax Q i K c d
Based on A i , the aggregated feature F i o is obtained by:
F i o = A i V c R ( H i W i ) × d
The resulting feature is first reshaped to match the spatial dimensions of the original feature F i , then passed through an output mapping ϕ o , and fused with the original feature map F i via channel concatenation (denoted as ⊕). The fused feature is subsequently refined using two 3 × 3 convolution layers with batch normalization and ReLU activation ( Conv 3 × 3 ) to produce the refined aggregated feature F i o :
F i o = Conv 3 × 3 ( Conv 3 × 3 ( F i ϕ o ( F i o ) ) )
The above procedure defines a single CLCA module, which outputs F i o for stage i.
The CLCA module is applied in a top-down manner to progressively refine decoder features. For decoder stages i = 1 , 2 , 3 , the current-stage feature is first updated by fusing it with the upsampled refined aggregated feature from the deeper stage:
F i = Conv 3 × 3 Upsample ( F i + 1 o ) F i
where Upsample ( · ) denotes 2 × bilinear upsampling. The updated feature F i is then fed into the CLCA module at the current stage to generate the corresponding refined aggregated output:
F i o = CLCA i ( F i , C )
This hierarchical refinement strategy progressively strengthens semantic consistency while preserving spatial detail. By repeatedly incorporating class-aware global cues from deep layers into fine-grained spatial structures of shallower layers, CLCA improves segmentation coherence and boundary localization.

3.4. Loss Function

Difficulty-aware Loss: Semantic segmentation in high-resolution RS imagery presents substantial variability in pixel-wise classification difficulty due to complex backgrounds, large intra-class diversity, and small inter-class differences. Pixels located at object boundaries, in occluded areas, or belonging to visually similar categories are inherently harder to classify than those in homogeneous regions. However, the standard CE loss assigns equal importance to all pixels, implicitly assuming uniform difficulty. This mismatch leads to a training imbalance: (1) Early in training, gradients are dominated by abundant easy samples, impeding effective learning of hard samples; (2) Later in training, although hard samples generate larger errors, their small quantity limits their impact on optimization. Consequently, hard regions receive insufficient supervision, ultimately degrading segmentation performance.
These limitations arise from the static nature of the standard CE loss, which cannot adaptively adjust sample importance as training evolves. To address this issue, we propose a DA loss that incorporates a dynamic difficulty estimation mechanism and an adaptive loss scheduling strategy, which stabilizes early-stage optimization while progressively emphasizing challenging samples as training proceeds.
Dynamic Difficulty Estimation Mechanism: To estimate pixel-wise classification difficulty, we adopt a normalized, confidence-driven weighting scheme inspired by focal loss [66]. The difficulty weight for the i-th pixel is defined as:
W i = ( 1 p i ) γ i = 1 N ( 1 p i ) γ
where p i [ 0 , 1 ] denotes the predicted probability of the ground-truth class, N is the total number of pixels, and γ 0 is a focusing parameter controlling the emphasis on hard samples. The resulting difficulty-weighted loss is expressed as:
L weight = i = 1 N W i L ce i = i = 1 N W i log ( p i )
The normalization term i = 1 N W i = 1 preserves gradient stability by keeping the loss magnitude comparable to the standard CE loss. This difficulty-aware redistribution of pixel-wise loss within an image assigns greater weights to lower confidence pixels, encouraging the model to focus on hard regions.
Adaptive Loss Scheduling Mechanism: Because model predictions are unreliable at early training stages, directly applying difficulty weighting may negatively impact optimization stability. To address this, we introduce an adaptive loss scheduling mechanism based on an annealing function that gradually shifts the optimization focus from easy samples to hard samples. The overall DA loss is defined as:
L da = 1 λ ( t ) L ce + λ ( t ) L weight
where L ce denotes the standard CE loss, and λ ( t ) [ 0 , 1 ] is a monotonically increasing annealing function of training step t. As λ ( t ) increases, the loss smoothly transitions from uniform CE supervision to difficulty-aware weighting, which stabilizes early-stage optimization while progressively increasing the model’s focus on hard regions. Representative annealing strategies are listed in Table 1, where T step denotes the annealing step count marking the end of the annealing period; its selection is discussed in Section 5.3.3.
Overall loss: The total training loss combines the DA loss with the inter-class prototype separation loss to jointly optimize pixel-level learning and inter-class discrimination:
L total = L da main + λ aux L da aux + L inter
where L da main and L da aux represent the DA loss applied to the final prediction and the auxiliary coarse mask M, respectively, and L inter is the inter-class prototype separation loss introduced in Section 3.2.3. The coefficient λ aux controls the relative contribution of the auxiliary branch.

4. Experiment Settings

4.1. Datasets

The performance of the proposed CRECA-Net was evaluated on two widely used benchmark datasets for RS image segmentation: the ISPRS Vaihingen [67] and ISPRS Potsdam [67] datasets.
ISPRS Potsdam Dataset: The Potsdam dataset consists of 38 orthophoto tiles, each sized at 6000 × 6000 pixels with a ground sampling distance (GSD) of 5 cm. Each tile provides four multispectral bands, near-infrared (NIR), red (R), green (G), and blue (B), along with corresponding digital surface model (DSM) and normalized DSM (NDSM) data. In our experiments, only the RGB bands were used, excluding DSM and NDSM data. The dataset contains dense annotations for six land-cover categories: impervious surfaces, buildings, low vegetation, trees, cars, and clutter/background. Following established protocols, 23 tiles with IDs 2_10, 2_11, 2_12, 3_10, 3_11, 3_12, 4_10, 4_11, 4_12, 5_10, 5_11, 5_12, 6_7, 6_8, 6_9, 6_10, 6_11, 6_12, 7_7, 7_8, 7_9, 7_11, and 7_12 were allocated for training (note that tile 7_10 was excluded due to annotation errors), while the remaining 14 tiles were reserved for testing. Each tile was further divided into fixed-size, non-overlapping patches of 1024 × 1024 pixels for training and evaluation.
ISPRS Vaihingen Dataset: The Vaihingen dataset comprises 33 orthophoto tiles with an average resolution of 2494 × 2494 pixels and a GSD of 9 cm. Each tile includes three spectral bands, near-infrared (NIR), red (R), and green (G), together with corresponding DSM and NDSM data. The land-cover categories are the same six classes as defined for the Potsdam dataset. Following prior work, 15 tiles (IDs: 1, 3, 5, 7, 11, 13, 15, 17, 21, 23, 26, 28, 32, 34, and 37) were used for training, and the remaining 17 tiles were used for testing. All tiles were cropped into 1024 × 1024 patches using a sliding window with a 512-pixel stride, ensuring uniform patch dimensions across the dataset.

4.2. Evaluation Metrics

To comprehensively evaluate the performance of CRECA-Net and ensure fair comparisons with existing methods, three widely recognized evaluation metrics were used: Overall Accuracy (OA), mean Intersection over Union (mIoU), and F1 score (F1).
Overall Accuracy (OA): This metric measures the proportion of correctly classified pixels among all pixels and is defined as:
OA = k = 1 K TP k k = 1 K ( TP k + FP k + TN k + FN k )
where K denotes the total number of categories, and TP k , FP k , TN k , and FN k represent the true positives, false positives, true negatives, and false negatives for the k-th class, respectively.
Mean Intersection over Union (mIoU): This metric measures the average overlap between the predicted and ground-truth regions across all categories, defined as the ratio of their intersection to their union:
mIoU = 1 K k = 1 K TP k TP k + FP k + FN k
F1 Score (F1): The F1 score for class k is the harmonic mean of precision and recall. The overall F1 score (mean F1) is computed as the mean of F 1 k over all classes.
F 1 k = 2 × Precision k × Recall k Precision k + Recall k
For class k, Precision k denotes the proportion of true positives among all predicted positives, and Recall k is the ratio of true positives to the total number of actual positives, defined as:
Precision k = TP k TP k + FP k
Recall k = TP k TP k + FN k

4.3. Implementation Details

To ensure consistent and fair comparisons, all experiments were implemented using the PyTorch (version 2.0.0) framework and conducted on a single NVIDIA RTX A40 GPU (NVIDIA Corporation, Santa Clara, CA, USA). We employed the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.01 and a weight decay of 0.0001. A polynomial decay policy [41] was adopted to dynamically adjust the learning rate according to the schedule ( 1 iter max_iter ) p o w e r , where the p o w e r was set to 0.9. The mini-batch size was 8, with 80 epochs on the Potsdam dataset and 150 epochs on the Vaihingen dataset. The introduced hyperparameters, including the similarity threshold β , focusing parameter γ , and auxiliary loss coefficient λ aux , were empirically determined through ablation studies and set to 0.125, 1.0, and 0.8, respectively, and kept fixed across all experiments. To enhance generalization, several data augmentation techniques were applied during training, including random cropping (crop size of 512 × 512), random scaling (scale factors of 0.5, 0.75, 1.0, 1.25, 1.5), random horizontal/vertical flipping, random rotation, and random Gaussian blur. During testing, multi-scale inference, flipping, and rotation were applied as test-time augmentation to improve robustness.

5. Experiment Results and Analysis

This section presents a comparative evaluation of CRECA-Net against fifteen representative semantic segmentation models. The compared methods span various architectural paradigms, including (1) classical convolutional neural network-based frameworks FCN [36]; (2) spatial context aggregation models PSPNet [15] and DeepLabV3+ [16]; (3) pixel-level relational modeling methods, such as DANet [21], SegFormer [23], MAResU-Net [47], BANet [55], DC-Swin [25], UNetFormer [49], AMMUNet [53], SLCNet [50], MSGCNet [52], and CGGLNet [51]; and (4) class-level context modeling approaches comprising OCRNet [18] and LOGCAN++ [61]. All baseline implementations follow the optimal configurations reported in their original publications.

5.1. Comparison with State-of-the-Art Methods on the Potsdam Dataset

5.1.1. Quantitative Analysis

To evaluate the effectiveness of CRECA-Net, we conduct a quantitative comparison with state-of-the-art methods on the Potsdam dataset. As shown in Table 2, CRECA-Net achieves the highest OA (92.00%), MeanF1 (93.18%), and mIoU (87.44%), consistently outperforming all competing approaches. Specifically, compared with spatial context aggregation models, CRECA-Net yields substantial mIoU improvements of 10.77% and 8.85% over PSPNet and DeepLabV3+, respectively. It also surpasses recent pixel-level relational modeling methods, achieving mIoU gains of 1.16%, 1.30% and 0.41% over UNetFormer, MSGCNet, and CGGLNet, respectively. Furthermore, compared with LOGCAN++, a state-of-the-art class-level context modeling framework specifically designed for RS image segmentation, CRECA-Net achieves a notable 1.08% improvement in mIoU. This consistent superiority can be primarily attributed to the proposed CPR module, which produces more accurate and discriminative class centers. These refined class centers enable more effective class-level context modeling, thereby improving feature representations and segmentation performance. Overall, these results highlight the critical role of high-quality class prototypes in enhancing class-level context modeling for RS image segmentation.

5.1.2. Computational Complexity Analysis

To quantitatively assess computational complexity, we compare CRECA-Net with representative baselines with respect to model parameters (Params), computational cost (FLOPs), and inference speed (FPS). All evaluations were conducted on a single NVIDIA RTX A40 GPU with an input resolution of 1024 × 1024, and the results are summarized in Table 2. Among models using the same backbone, CRECA-Net achieves superior segmentation performance while maintaining comparable or lower computational cost. For instance, it outperforms LogCAN++ by 1.08% in mIoU with similar FLOPs and Params, and surpasses CGGLNet by 0.41% in mIoU while using 38.36G fewer FLOPs and 5.39M fewer Params. This efficiency can be attributed to two design aspects: (1) the CPR module and the DA loss introduce no learnable parameters; and (2) the CLCA module performs context aggregation at the class level, which requires significantly fewer parameters than spatial or pixel-level relational modeling approaches. These findings indicate that CRECA-Net achieves a favorable trade-off between segmentation accuracy and computational cost, underscoring the advantage of class-level context modeling in high-resolution RS segmentation.

5.1.3. Qualitative Analysis

To qualitatively evaluate segmentation performance, Figure 4 compares the visual results of CRECA-Net with several representative methods on typical samples from the Potsdam dataset. As illustrated, our model produces clearer and more precise segmentation maps, especially in complex or visually ambiguous regions. For instance, in the first example, CRECA-Net accurately delineates buildings from adjacent impervious surfaces while reducing misclassification in surrounding low vegetation areas. In the second case, our model effectively distinguishes intertwined low vegetation and tree regions, demonstrating stronger class discriminability. The third image further highlights CRECA-Net’s robustness in fine-grained object segmentation. Collectively, these results indicate that CRECA-Net delivers higher recognition accuracy and improved spatial coherence compared to competing approaches.
Furthermore, to gain deeper insight into the feature learning capability of different models, we employ t-SNE [68] to visualize the high-level features extracted from the final layers of UNetFormer, LOGCAN++, and CRECA-Net on the Potsdam test set, as depicted in Figure 5. The visualization shows that in UNetFormer and LOGCAN++, the feature embeddings of the “building” and “impervious surface” classes remain relatively close, indicating limited inter-class separability. In contrast, CRECA-Net forms a distinctly separated cluster for the “building” class, which is distant from the “impervious surface” cluster. This pronounced separation demonstrates CRECA-Net’s ability to enhance feature discriminability and refine inter-class boundaries in the embedding space.

5.2. Comparison with State-of-the-Art Methods on the Vaihingen Dataset

5.2.1. Quantitative Analysis

To further evaluate the generalization capability of CRECA-Net, we conduct a quantitative comparison on the Vaihingen dataset, as presented in Table 3. CRECA-Net consistently outperforms all baseline methods across all metrics, further validating its effectiveness and generalization capability. In terms of mIoU, CRECA-Net achieves improvements of 8.29% and 8.20% over the classical spatial context aggregation models PSPNet and DeepLabV3+, respectively. Compared with pixel-level relational modeling methods, it obtains mIoU gains of 1.00%, 1.01%, and 0.88% over UNetFormer, SLCNet, and CGGLNet, respectively. Furthermore, CRECA-Net surpasses LOGCAN++, a state-of-the-art class-level context modeling method for RS image segmentation, by 0.38% in mIoU. These improvements primarily stem from the proposed CPR module, which refines class centers and facilitates more reliable class-level context aggregation. Collectively, these results underscore the superior performance of CRECA-Net across different context modeling mechanisms and demonstrate its robustness and adaptability in diverse RS scenarios.

5.2.2. Qualitative Analysis

The visual segmentation performance of CRECA-Net is further compared with several representative models on the Vaihingen test set, as illustrated in Figure 6. Visual inspection of challenging samples shows that CRECA-Net delivers more accurate and spatially coherent segmentation results. Specifically, in the first sample, our method successfully detects all vehicles while significantly reducing both missed and false detections. In the second case, it yields more complete building extraction with smoother and more continuous boundaries. Remarkably, in the third example, where shadow occlusion often causes trees to be misclassified as low vegetation, CRECA-Net effectively identifies shadowed trees, thus mitigating confusion between these visually similar categories. These visual results collectively demonstrate that CRECA-Net excels at preserving structural integrity and alleviating inter-class ambiguity in complex RS scenarios.
Likewise, we use t-SNE to visualize the high-level feature distributions of different models on the Vaihingen test set, as shown in Figure 7. Consistent with observations on the Potsdam dataset, the feature embeddings of UNetFormer and LOGCAN++ exhibit relatively overlapping clusters for the “building”,“tree”, and “low vegetation” classes. In contrast, CRECA-Net produces a clearly isolated cluster for the “building” category, which is well separated from the “tree” and “low vegetation” clusters. These observations indicate that CRECA-Net possesses strong feature discriminability and demonstrates robust across-dataset generalization.

5.3. Ablation Studies and Analysis

We conducted a series of ablation experiments on the Potsdam dataset to determine the optimal model architecture and to assess the influence of key components and hyperparameter settings. The detailed experimental results and analyses are presented below.

5.3.1. Ablation of the Model Structure

To systematically analyze the contribution of each major component in CRECA-Net, we conduct a comprehensive ablation study, with experimental configurations and results summarized in Table 4. Specifically, six architectural variants were constructed by gradually simplifying the full model: (1) CRECA-Net: The complete model integrating both the CPR and CLCA modules, trained with the proposed DA loss. (2)  L da L ce : A variant that replaces the DA loss with the standard CE loss, while retaining both the CPR and CLCA modules. (3)  L da L ce  & CPR → WCC: Derived from variant (2) by substituting the CPR module with the weighted class center (WCC) scheme, where class centers are computed via probability-based weighting aggregation (see Figure 2b). (4)  L da L ce  & CPR → ACC: A variant of (2) in which the CPR module is replaced with the average class center (ACC) scheme, where class centers are obtained by averaging all pixel features within each predicted class (see Figure 2a). (5) -CPR-CLCA: A configuration that removes both CPR and CLCA modules from CRECA-Net. (6) -CPR-CLCA &  L da L ce : The most simplified version, obtained by further replacing the DA loss with CE loss in variant (5), leaving only the ResNet-50 backbone with a basic decoder.
The ablation results yield several key insights. First, variant (2), which preserves both the CPR and CLCA modules, achieves a 2.65% mIoU improvement over the most simplified variant (6), confirming the effectiveness of the proposed CPR and CLCA modules. Second, introducing the proposed DA loss into this configuration further enhances model performance, enabling CRECA-Net to achieve the highest mIoU of 87.44%, indicating that the DA mechanism effectively facilitates learning from hard samples. Finally, when the CPR module is replaced with either WCC or ACC, mIoU drops by 0.74% and 1.08%, respectively, highlighting the superiority of the proposed CPR mechanism in generating accurate and discriminative class centers. Overall, these ablation results confirm that the CPR module, CLCA module, and DA loss jointly contribute to the performance gains of CRECA-Net.

5.3.2. Ablation of the CPR Module

To further investigate the internal design of the CPR module, we conducted a dedicated ablation study, with results summarized in Table 5. Using the complete CPR configuration as the baseline, we systematically evaluated two aspects: (1) the contribution of each component, by removing individual components from the CACW submodule, namely, WP k , WL k , WC k , as well as the inter-class prototype separation loss L i n t e r ; (2) the effectiveness of each confidence metric independently by constructing variants that retain only one of WP k , WL k , or WC k .
The results show that removing any component or relying on a single confidence metric consistently degrades segmentation performance. This observation suggests that all components contribute positively to the refinement of discriminative class centers, and that their complementary effects are crucial for improving downstream segmentation performance. Among all component removal variants, removing L i n t e r results in the largest performance drop. This can be attributed to the high inter-class similarity commonly observed in RS images: without the inter-class prototype separation constraint imposed by L i n t e r , class centers tend to collapse toward each other (i.e., become less separated), resulting in reduced inter-class separability and the overall discriminative power of the learned feature space. Overall, these results demonstrate that both the multi-metric confidence weighting strategy and the inter-class prototype separation loss are indispensable for constructing high-quality class centers, thereby validating the design of the proposed CPR module.
Choice of the Similarity Threshold β : As a key hyperparameter within the inter-class prototype separation constraint of the CPR module, the threshold β controls the maximum allowable cosine similarity between class centers and thus directly influences the strength of the inter-class separation constraint. A smaller β enforces a stricter angular margin by lowering the threshold that triggers the penalty between class centers. To select an appropriate value, we performed a grid search over the range β [ 0.05 , 0.15 ] , and the corresponding results are reported in Table 6. As β increases from 0.05 to 0.125, the mIoU gradually improves and reaches its peak value of 87.44% at β = 0.125 , indicating that moderate relaxation of the separation constraint helps avoid over-penalizing semantically related classes. However, further increasing β causes a performance decline, suggesting that an overly weak separation constraint reduces inter-class discriminability. Overall, these results demonstrate a clear trade-off between enforcing class separation and preserving meaningful semantic relationships, with β = 0.125 striking an effective balance between the two.

5.3.3. Analysis of Difficulty-Aware Loss

To assess the effectiveness of the proposed DA loss and investigate the influence of its key hyperparameters, we conducted a series of controlled ablation studies on the Potsdam dataset.
Effect of Difficulty-aware Loss: As shown in rows (1) and (5) of Table 4, replacing the standard CE loss with the DA loss for configurations (2) and (6) leads to mIoU improvements of 0.29% and 0.73%, respectively, without introducing any additional computational overhead. These improvements demonstrate that the DA loss effectively mitigates the optimization bias toward easy samples inherent in standard CE loss by adaptively emphasizing hard samples, particularly those in boundary regions and visually ambiguous areas, leading to more effective and stable training.
Visualization and Analysis of Difficulty-Aware Loss Weight W i : To further verify whether the proposed DA loss indeed focuses on hard-to-segment regions, we visualize the pixel-wise DA loss weight maps generated during training, as shown in Figure 8. The DA mechanism assigns higher weights to pixels with higher classification difficulty, thereby guiding the network to focus on challenging pixels. As observed in the first row, higher weights are concentrated along object boundaries, where segmentation ambiguity is inherently high. In the second row, regions containing visually similar categories, such as trees vs. low vegetation and cars vs. impervious surfaces, are also assigned elevated weights, reflecting increased classification uncertainty. Additionally, in the third row, areas affected by shadow occlusion receive higher weights, indicating the model’s sensitivity to complex visual conditions. Collectively, these visual results provide clear evidence that the DA mechanism successfully captures pixel-wise classification difficulty arising from boundary ambiguity, visual similarity, and occlusion, thereby guiding the model to focus on challenging regions and improving overall segmentation performance.
Choice of the Focusing Parameter γ : The parameter γ controls the degree of emphasis placed on hard samples in the DA loss. Following the principle of focal loss [66,69], we evaluated γ within the range [0.5, 2.0], as shown in Table 7. The model’s performance gradually improves as γ increases from 0.5 to 1.0, reaching the highest mIoU of 87.44% at γ = 1.0 . However, further increasing γ to 2.0 does not yield additional performance gains, likely because excessively large γ over-penalizes low-confidence predictions and inadvertently amplifies noisy or mislabeled pixels [70]. These observations indicate that a moderate focusing strength achieves an optimal balance between emphasizing hard samples and suppressing noise.
Effect of Normalization: The dynamic difficulty estimation mechanism introduces image-level normalization of pixel-wise difficulty weights, serving two purposes: (1) redistributing the loss within each image to emphasize hard regions; and (2) maintaining consistency in the overall loss magnitude, thereby preventing images with a high density of difficult pixels from dominating the optimization.
To evaluate the effectiveness of this image-level normalization strategy, we compared three loss formulations, as shown in Table 8: row (1) reports the proposed DA loss with image-level normalization, row (2) corresponds to the naïve focal loss with un-normalized weights w i = ( 1 p i ) γ , and row (3) presents the foreground-aware loss proposed in FarSeg++ [69], which normalizes the naïve focal loss by scaling it with the ratio between the focal loss and the CE loss.
Without normalization, the naïve focal loss biases training toward images containing many hard samples, leading to a 0.42% decrease in mIoU. In contrast, our image-level normalization balances the per-image contribution to the loss, increasing mIoU to 87.28% and surpassing the FarSeg++ normalization by an additional 0.34%. These results indicate that image-level normalization effectively stabilizes per-image contributions during training, allowing the model to learn more evenly from hard regions of each image and preventing overfitting to image subsets dominated by difficult samples. Overall, this confirms the critical role of image-level normalization in enabling robust and stable training of the proposed DA loss.
Choice of the Annealing Function: The annealing function enables a smooth transition from the standard CE loss to the DA loss, thereby mitigating training instability caused by unreliable difficulty estimation during the early training phase. To evaluate its impact, we compare multiple annealing strategies, as reported in rows (4)–(6) of Table 8. All annealing strategies outperform the direct application of the DA loss, confirming that a gradual transition from CE to DA loss is critical for stable training. Among them, cosine annealing achieves the best performance, reaching an mIoU of 87.44%, due to its gentle slope at the beginning and end of training, which avoids abrupt changes in gradient contributions and facilitates stable convergence. These findings demonstrate that selecting an appropriate annealing schedule is essential for stable and effective training with the proposed DA loss.
Choice of Annealing Step T step : The DA loss introduces a hyperparameter T step to control the duration of the annealing process. Since the reliability of difficulty estimation depends on the model’s predictions quality, setting T step too small may lead to unreliable difficulty estimates, while an overly large value may reduce the number of training iterations devoted to optimizing hard samples. To determine a stable and principled value, we reference the training dynamics of the CE-trained CRECA-Net (where the DA loss is replaced with standard CE loss). Specifically, T best is defined as the training step at which the CE-trained model achieves its best validation performance, and T step is set as a fraction of T best . Accordingly, we evaluate four settings: T step { T best , 0.9 T best , 0.8 T best , 0.7 T best } , as summarized in Table 9. Results on both the Potsdam and Vaihingen datasets indicate that T step = 0.8 T best yields the best performance, while the model’s performance remains relatively stable across neighboring values. These findings suggest that the DA loss is not overly sensitive to the precise choice of T step , and a reliable setting can be derived from CE-based training behavior without introducing additional hyperparameter complexity.
Choice of Loss Coefficient λ aux : We further analyze the influence of the auxiliary loss coefficient λ aux , which regulates the contribution of the auxiliary segmentation branch, by fixing all other loss terms and varying λ aux within 0.7, 0.8, 0.9, 1.0. As shown in Table 10, increasing λ aux from 0.7 to 0.8 steadily improves performance, suggesting that moderate auxiliary supervision enhances optimization. However, excessively large values tend to hinder the optimization of the main prediction branch, resulting in a slight degradation of accuracy. Based on these observations, λ aux = 0.8 is selected as the default setting, achieving a balanced trade-off between the auxiliary supervision and the main segmentation objective.

6. Discussion

The consistent performance improvements of CRECA-Net on the ISPRS Potsdam and Vaihingen datasets primarily resulted from the design of the CPR module, CLCA module, and the DA loss. The CPR module enhances the reliability and discriminability of class prototypes through three key components: pixel selection, CWCA, and the inter-class prototype separation loss. Specifically, pixel selection ensured that class prototypes were computed from semantically coherent and prediction-consistent regions, CWCA allowed high-confidence pixels to contribute more to prototype estimation, while the inter-class prototype separation loss explicitly promoted separability between different category prototypes. The CLCA module employed cross-attention to model pixel-to-class prototype correlations, thereby capturing long-range contextual dependencies under semantic prototype guidance and reducing prediction ambiguity in complex scenes. In addition, the DA loss introduced a dynamic difficulty estimation mechanism with an adaptive loss scheduling strategy, which adaptively adjusted pixel-wise loss weights within each image, enabling the model to gradually shift its learning focus from easy samples to more challenging ones, such as boundary regions and visually confusing categories, while ensuring stable training.
These results indicated that jointly improving class representation quality and training dynamics provided an effective strategy for addressing the challenges of high-resolution RS image segmentation, particularly in scenes characterized by large intra-class variation and complex backgrounds. Despite these improvements, accurately delineating fine-grained boundary details and distinguishing subtle transitions between visually similar categories remained challenging, particularly in complex scenes where boundary ambiguity and category confusion were more pronounced.

7. Conclusions

In this work, we investigated the key challenges inherent in high-resolution RS image segmentation, including complex backgrounds, large intra-class variance, and small inter-class variance, all of which commonly limit the performance of conventional segmentation models. To address these challenges, we proposed CRECA-Net, a class-aware segmentation network that integrates a CPR module, CLCA modules, and a DA loss. Specifically, the CPR module constructed more reliable and discriminative class representations, the CLCA module aggregated long-range contextual information under semantic prototype guidance, and the DA loss enabled the model to focus more effectively on hard samples, such as boundary regions and visually similar categories. Extensive experiments on two benchmark datasets demonstrated that CRECA-Net consistently outperformed several state-of-the-art methods across multiple evaluation metrics, achieving superior segmentation accuracy and strong generalization in diverse RS scenarios. Future work will explore integrating contrastive learning to further enhance intra-class compactness and inter-class separability at the representation level. Additionally, edge-aware loss functions and boundary-refinement mechanisms will be incorporated to explicitly alleviate structural ambiguity and improve boundary delineation in complex RS scenes.

Author Contributions

Conceptualization, R.L.; methodology, R.L.; software, L.Y.; validation, R.L.; resources, B.C.; data curation, L.Y. and S.Z.; writing—original draft preparation, R.L.; writing—review editing, R.L.; visualization, R.L.; supervision, L.Y. and S.Z.; funding acquisition, B.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Scientific Research Foundation of Chongqing University of Technology (Grant No. 2025ZDZ012), the Industry-University-Research Collaboration Innovation Fund for Chinese Higher Education Institutions, Ministry of Education (Grant No. 2025ZX00), the Major Project of Chongqing Technology Innovation and Application Development (Grant No. CSTB2024TIAD-STX0026), the National Science Foundation of China (Grant No. 61961040, U1903215 and 61771089), and the Sichuan Provincial Key Research and Development Program (Grant No. 2021YFQ0011).

Data Availability Statement

Vaihingen and Potsdam datasets are available at: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/default.aspx (accessed on 2 January 2025).

Acknowledgments

The authors would like to express their sincere gratitude to Dolla Mihretu Samuel for his valuable assistance in refining the English expression and improving the overall clarity and readability of the manuscript during the revision process.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lian, R.; Wang, W.; Mustafa, N.; Huang, L. Road extraction methods in high-resolution remote sensing images: A comprehensive review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5489–5507. [Google Scholar] [CrossRef]
  2. Wang, C.; Xu, R.; Xu, S.; Meng, W.; Wang, R.; Zhang, J.; Zhang, X. Toward Accurate and Efficient Road Extraction by Leveraging the Characteristics of Road Shapes. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4404616. [Google Scholar] [CrossRef]
  3. Zhang, P.; Li, J.; Wang, C.; Niu, Y. SAM2MS: An Efficient Framework for HRSI Road Extraction Powered by SAM2. Remote Sens. 2025, 17, 3181. [Google Scholar] [CrossRef]
  4. Wei, S.; Zeng, X.; Zhang, H.; Zhou, Z.; Shi, J.; Zhang, X. LFG-Net: Low-Level Feature Guided Network for Precise Ship Instance Segmentation in SAR Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5231017. [Google Scholar] [CrossRef]
  5. Azimi, S.M.; Fischer, P.; Körner, M.; Reinartz, P. Aerial LaneNet: Lane-Marking Semantic Segmentation in Aerial Imagery Using Wavelet-Enhanced Cost-Sensitive Symmetric Fully Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2019, 57, 2920–2938. [Google Scholar] [CrossRef]
  6. Zhang, K.; Ming, D.; Du, S.; Xu, L.; Ling, X.; Zeng, B.; Lv, X. Distance Weight-Graph Attention Model-Based High-Resolution Remote Sensing Urban Functional Zone Identification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5608518. [Google Scholar] [CrossRef]
  7. Zhao, Y.; Sun, G.; Zhang, L.; Zhang, A.; Jia, X.; Han, Z. MSRF-Net: Multiscale Receptive Field Network for Building Detection From Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5515714. [Google Scholar] [CrossRef]
  8. Chicchon, M.; Colosi, F.; Malinverni, E.S.; León Trujillo, F.J. Urban Sprawl Monitoring by VHR Images Using Active Contour Loss and Improved U-Net with Mix Transformer Encoders. Remote Sensing 2025, 17, 1593. [Google Scholar] [CrossRef]
  9. Li, R.; Zheng, S.; Duan, C.; Wang, L.; Zhang, C. Land cover classification from remote sensing images based on multi-scale fully convolutional network. Geo-Spat. Inf. Sci. 2022, 25, 278–294. [Google Scholar] [CrossRef]
  10. Ma, A.; Zheng, C.; Wang, J.; Zhong, Y. Domain Adaptive Land-Cover Classification via Local Consistency and Global Diversity. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5606317. [Google Scholar] [CrossRef]
  11. Liu, L.; Tong, Z.; Cai, Z.; Wu, H.; Zhang, R.; Le Bris, A.; Olteanu-Raimond, A.M. HierU-Net: A Hierarchical Semantic Segmentation Method for Land Cover Mapping. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4404614. [Google Scholar] [CrossRef]
  12. Saleh, T.; Holail, S.; Zahran, M.; Xiao, X.; Xia, G.S. LiST-Net: Enhanced Flood Mapping with Lightweight SAR Transformer Network and Dimension-Wise Attention. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5211817. [Google Scholar] [CrossRef]
  13. Yadav, R.; Nascetti, A.; Ban, Y. Attentive Dual Stream Siamese U-Net for Flood Detection on Multi-Temporal Sentinel-1 Data. In Proceedings of the 2022 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 5222–5225. [Google Scholar] [CrossRef]
  14. Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A. Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4096–4105. [Google Scholar]
  15. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  16. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  17. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
  18. Yuan, Y.; Chen, X.; Wang, J. Object-contextual representations for semantic segmentation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VI 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 173–190. [Google Scholar]
  19. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
  20. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
  21. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16-20 June 2019; pp. 3146–3154. [Google Scholar]
  22. Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7262–7272. [Google Scholar]
  23. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  24. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
  25. Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  26. Guan, F.; Zhao, N.; Wang, H.; Fang, Z.; Zhang, J.; Yu, Y.; Jiang, L.; Huang, H. Dual-branch transformer framework with gradient-aware weighting feature alignment for robust cross-view geo-localization. Inf. Fusion 2025, 127, 103808. [Google Scholar] [CrossRef]
  27. Guan, F.; Zhao, N.; Fang, Z.; Jiang, L.; Zhang, J.; Yu, Y.; Huang, H. Multi-level representation learning via ConvNeXt-based network for unaligned cross-view matching. Geo-Spat. Inf. Sci. 2025, 28, 2344–2357. [Google Scholar] [CrossRef]
  28. Ye, Z.; Li, Y.; Li, Z.; Liu, H.; Zhang, Y.; Li, W. Attention-Multi-Scale Network for Semantic Segmentation of Multi-Modal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5610315. [Google Scholar] [CrossRef]
  29. Zhao, Y.; Qiu, L.; Yang, Z.; Chen, Y.; Zhang, Y. MGF-GCN: Multimodal interaction Mamba-aided graph convolutional fusion network for semantic segmentation of remote sensing images. Inf. Fusion 2025, 122, 103150. [Google Scholar] [CrossRef]
  30. Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; Bai, X. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 593–602. [Google Scholar]
  31. Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
  32. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  33. Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. Efficientvit: Multi-scale linear attention for high-resolution dense prediction. arXiv 2022, arXiv:2205.14756. [Google Scholar]
  34. Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
  35. Zhang, F.; Chen, Y.; Li, Z.; Hong, Z.; Liu, J.; Ma, F.; Han, J.; Ding, E. Acfnet: Attentional class feature network for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6798–6807. [Google Scholar]
  36. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  37. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar] [CrossRef]
  38. Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
  39. Zhang, Z.; Liu, Q.; Wang, Y. Road Extraction by Deep Residual U-Net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
  40. Yu, F. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
  41. Chen, L.C. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
  42. Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18-22 June 2018; pp. 3684–3692. [Google Scholar]
  43. Cui, J.; Liu, J.; Wang, J.; Ni, Y. Global context dependencies aware network for efficient semantic segmentation of fine-resolution remoted sensing images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 2505205. [Google Scholar] [CrossRef]
  44. Liu, R.; Mi, L.; Chen, Z. AFNet: Adaptive fusion network for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7871–7886. [Google Scholar] [CrossRef]
  45. Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-maximization attention networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9167–9176. [Google Scholar]
  46. Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention network for semantic segmentation of fine-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607713. [Google Scholar] [CrossRef]
  47. Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage attention ResU-Net for semantic segmentation of fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8009205. [Google Scholar] [CrossRef]
  48. Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive bilateral contextual network for efficient semantic segmentation of Fine-Resolution remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
  49. Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
  50. Yu, D.; Ji, S. Long-range correlation supervision for land-cover classification from remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4409814. [Google Scholar] [CrossRef]
  51. Ni, Y.; Liu, J.; Chi, W.; Wang, X.; Li, D. CGGLNet: Semantic Segmentation Network for Remote Sensing Images Based on Category-Guided Global-Local Feature Interaction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5615617. [Google Scholar] [CrossRef]
  52. Zeng, Q.; Zhou, J.; Tao, J.; Chen, L.; Niu, X.; Zhang, Y. Multiscale global context network for semantic segmentation of high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622913. [Google Scholar] [CrossRef]
  53. Yang, Y.; Zheng, S.; Wang, X.; Ao, W.; Liu, Z. AMMUNet: Multiscale Attention Map Merging for Remote Sensing Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2025, 22, 3506718. [Google Scholar] [CrossRef]
  54. Meng, W.; Shan, L.; Ma, S.; Liu, D.; Hu, B. Dlnet: A dual-level network with self-and cross-attention for high-resolution remote sensing segmentation. Remote Sens. 2025, 17, 1119. [Google Scholar] [CrossRef]
  55. Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer meets convolution: A bilateral awareness network for semantic segmentation of very fine resolution urban scene images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
  56. Zhang, Q.; Yang, Y.B. Rest: An efficient transformer for visual recognition. Adv. Neural Inf. Process. Syst. 2021, 34, 15475–15485. [Google Scholar]
  57. Jin, Z.; Liu, B.; Chu, Q.; Yu, N. Isnet: Integrate image-level and semantic-level context for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7189–7198. [Google Scholar]
  58. Jin, Z.; Gong, T.; Yu, D.; Chu, Q.; Wang, J.; Wang, C.; Shao, J. Mining contextual information beyond image for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7231–7241. [Google Scholar]
  59. Chen, Y.; Dong, Q.; Wang, X.; Zhang, Q.; Kang, M.; Jiang, W.; Wang, M.; Xu, L.; Zhang, C. Hybrid attention fusion embedded in transformer for remote sensing image semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sensing 2024, 17, 4421–4435. [Google Scholar] [CrossRef]
  60. Deng, G.; Wu, Z.; Wang, C.; Xu, M.; Zhong, Y. CCANet: Class-constraint coarse-to-fine attentional deep network for subdecimeter aerial image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4401120. [Google Scholar] [CrossRef]
  61. Ma, X.; Lian, R.; Wu, Z.; Guo, H.; Ma, M.; Wu, S.; Du, Z.; Song, S.; Zhang, W. LOGCAN++: Adaptive Local-global class-aware network for semantic segmentation of remote sensing imagery. arXiv 2024, arXiv:2406.16502. [Google Scholar] [CrossRef]
  62. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  63. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  64. Huang, Y.H.; Proesmans, M.; Georgoulis, S.; Van Gool, L. Uncertainty based model selection for fast semantic segmentation. In Proceedings of the 2019 16th International Conference on Machine Vision Applications (MVA), Tokyo, Japan, 27–31 May 2019; pp. 1–6. [Google Scholar]
  65. Liu, Z.; Zhang, Z.; Khaki, S.; Yang, S.; Tang, H.; Xu, C.; Keutzer, K.; Han, S. Sparse refinement for efficient high-resolution semantic segmentation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2025; pp. 108–127. [Google Scholar]
  66. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  67. Rottensteiner, F.; Sohn, G.; Jung, J.; Gerke, M.; Baillard, C.; Bnitez, S.; Breitkopf, U. International Society for Photogrammetry and Remote Sensing, 2d Semantic Labeling Contest. 2020, 29. Available online: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/Default.aspx (accessed on 1 October 2025).
  68. Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  69. Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. FarSeg++: Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13715–13729. [Google Scholar] [CrossRef] [PubMed]
  70. Li, B.; Liu, Y.; Wang, X. Gradient harmonized single-stage detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, USA, 27 January–1 February 2019; Volume 33, pp. 8577–8584. [Google Scholar]
Figure 1. Illustration of the large intra-class variance and small inter-class variance in remote sensing (RS) images. Yellow arrows denote intra-class variance, while blue arrows indicate inter-class variance.
Figure 1. Illustration of the large intra-class variance and small inter-class variance in remote sensing (RS) images. Yellow arrows denote intra-class variance, while blue arrows indicate inter-class variance.
Remotesensing 18 00950 g001
Figure 2. Illustration of two representative class center computation approaches: (a) simple averaging of pixel features assigned to each predicted class using a coarse segmentation mask; (b) probability-weighted aggregation of all pixel features using coarse segmentation probability maps. Different symbols (e.g., circles, triangles) indicate pixels belonging to different classes.
Figure 2. Illustration of two representative class center computation approaches: (a) simple averaging of pixel features assigned to each predicted class using a coarse segmentation mask; (b) probability-weighted aggregation of all pixel features using coarse segmentation probability maps. Different symbols (e.g., circles, triangles) indicate pixels belonging to different classes.
Remotesensing 18 00950 g002
Figure 3. Overall architecture of CRECA-Net, with a ResNet-50 backbone as the encoder for hierarchical feature extraction, and a decoder comprising a class prototype refinement (CPR) module for generating discriminative class centers, and class-level context aggregation (CLCA) modules for modeling class-aware contextual relationships and reducing inter-class confusion.
Figure 3. Overall architecture of CRECA-Net, with a ResNet-50 backbone as the encoder for hierarchical feature extraction, and a decoder comprising a class prototype refinement (CPR) module for generating discriminative class centers, and class-level context aggregation (CLCA) modules for modeling class-aware contextual relationships and reducing inter-class confusion.
Remotesensing 18 00950 g003
Figure 4. Qualitative comparison of different models on the Potsdam test set. Red dashed boxes indicate areas of focus.
Figure 4. Qualitative comparison of different models on the Potsdam test set. Red dashed boxes indicate areas of focus.
Remotesensing 18 00950 g004
Figure 5. t-SNE visualizations of the last-layer feature embeddings extracted from UNetFormer, LOGCAN++, and the proposed CRECA-Net on the Potsdam test.
Figure 5. t-SNE visualizations of the last-layer feature embeddings extracted from UNetFormer, LOGCAN++, and the proposed CRECA-Net on the Potsdam test.
Remotesensing 18 00950 g005
Figure 6. Qualitative comparison of different models on the Vaihingen test set. Red dashed boxes indicate areas of focus.
Figure 6. Qualitative comparison of different models on the Vaihingen test set. Red dashed boxes indicate areas of focus.
Remotesensing 18 00950 g006
Figure 7. t-SNE visualizations of the last-layer features embeddings extracted from UNetFormer, LOGCAN++, and the proposed CRECA-Net on the Vaihingen test set.
Figure 7. t-SNE visualizations of the last-layer features embeddings extracted from UNetFormer, LOGCAN++, and the proposed CRECA-Net on the Vaihingen test set.
Remotesensing 18 00950 g007
Figure 8. Visualization of DA loss weight maps. From left to right: raw image, ground truth, and the corresponding DA loss weight map generated during training. Warmer colors indicate higher difficulty-aware weights. Red dashed boxes highlight challenging regions, including visually similar categories and shadow-occluded areas.
Figure 8. Visualization of DA loss weight maps. From left to right: raw image, ground truth, and the corresponding DA loss weight map generated during training. Warmer colors indicate higher difficulty-aware weights. Red dashed boxes highlight challenging regions, including visually similar categories and shadow-occluded areas.
Remotesensing 18 00950 g008
Table 1. Different forms of annealing functions.
Table 1. Different forms of annealing functions.
Annealing FunctionHyperparameterMathematical Expression
Linear T step λ ( t ) = t T step
Polynomial T step , decay_factor λ ( t ) = t T step decay_factor
Cosine T step λ ( t ) = 0.5 · 1 cos ( t T step · π )
Table 2. Quantitative comparison with state-of-the-art methods on the ISPRS Potsdam dataset. Per-class segmentation performance is reported using IoU scores, with the best value highlighted in bold. Params, FLOPs, and FPS denote the number of model parameters, computational cost, and inference speed, respectively.
Table 2. Quantitative comparison with state-of-the-art methods on the ISPRS Potsdam dataset. Per-class segmentation performance is reported using IoU scores, with the best value highlighted in bold. Params, FLOPs, and FPS denote the number of model parameters, computational cost, and inference speed, respectively.
MethodBackboneImp. SurfBuildingLowvegTreeCarOAMean F1mIoUFLOPs (G)Params (M)FPS
FCN [36]ResNet5080.0886.7868.8072.1668.4283.3185.6975.2528.7749.4953.28
PSPNet [15]ResNet50-D881.0287.2768.7171.4774.9083.5486.6476.67178.6448.9713.93
DeepLabV3+ [16]ResNet50-D883.5389.4470.8975.1473.9385.1587.8478.59176.5143.5814.73
DANet [21]ResNet50-D881.2287.4469.1973.2572.5983.8586.6876.73211.0449.8212.37
OCRNet [18]ResNet50-D882.0087.5469.5173.6371.5784.0386.7476.85153.0736.5119.58
SegFormer [23]MiT-B182.3089.1770.4574.5871.8184.7087.2577.6615.4213.6831.25
BANet [55]ResT-Lite86.9591.5176.2979.3190.1690.4291.6984.8415.1512.7310.13
MAResU-Net [47]ResNet3486.9692.2376.2879.5391.3490.5291.9285.2728.7626.2845.19
DC-Swin [25]Swin-S88.2092.7877.9179.0088.9491.1092.0085.3772.1866.9515.16
UNetFormer [49]ResNet1888.6293.6377.8579.2792.0291.3592.5086.2811.7611.7385.51
AMMUNet [53]ResNet5083.7290.7472.0275.9273.0685.5888.1579.0952.5134.7119.06
SLCNet [50]ResNet5085.3490.9876.0779.0291.7689.8191.5584.63315.47187.156.18
MSGCNet [52]ResNet3488.2592.8677.5679.4492.6191.1192.4286.1429.1127.6143.27
LOGCAN++ [61]ResNet5088.7793.6077.2779.8992.2891.3492.5586.3650.1831.0316.30
CGGLNet [51]ResNet5088.9294.0478.4280.4093.3591.7192.9387.0387.9436.8911.09
OursResNet5089.5794.2579.0081.0793.3092.0093.1887.4449.4831.0029.10
Table 3. Quantitative comparison with state-of-the-art methods on the ISPRS Vaihingen dataset. Per-class segmentation performance is reported using IoU scores, and the best value is highlighted in bold.
Table 3. Quantitative comparison with state-of-the-art methods on the ISPRS Vaihingen dataset. Per-class segmentation performance is reported using IoU scores, and the best value is highlighted in bold.
MethodBackboneImp. SurfBuildingLowvegTreeCarOAMean F1mIoU
FCN [36]ResNet5088.3885.7766.6174.7847.5092.8783.2272.61
PSPNet [15]ResNet50-D889.6986.8367.2374.8763.2293.5286.2176.37
DeepLabV3+ [16]ResNet50-D889.2886.9966.8975.2363.8993.4386.2776.46
DANet [21]ResNet50-D889.4386.4866.6974.6065.1093.3186.3076.46
OCRNet [18]ResNet50-D889.2186.1466.8375.4162.6493.3086.0076.05
SegFormer [23]MiT-B189.4585.6466.3874.8260.2493.1685.4675.31
BANet [55]ResT-Lite92.9690.5971.2880.7073.5592.5889.7581.82
MAResU-Net [47]ResNet3493.5191.0373.1481.8374.4693.0790.3682.79
DC-Swin [25]Swin-S93.8492.0373.4581.9170.3993.3190.0182.33
UNetFormer [49]ResNet1893.5891.3473.4881.9877.9093.2090.9183.66
AMMUNet [53]ResNet5089.6986.3765.7375.1061.2193.2485.6675.62
SLCNet [50]ResNet5093.6490.7073.4982.1678.2693.1590.9183.65
MSGCNet [52]ResNet3493.2290.5773.0682.2273.1292.9690.1482.44
LOGCAN++ [61]ResNet5093.9591.8174.5982.5778.4893.5391.2984.28
CGGLNet [51]ResNet5093.5691.5973.0981.4179.2593.1590.9883.78
OursResNet5094.0892.2774.8582.8179.2793.6591.5284.66
Table 4. Ablation study evaluating the contributions of major components in CRECA-Net on the Potsdam dataset. “→” indicates component or loss replacement; “-” denotes module removal; “&” represents the combination of multiple modifications; “✓” indicates the presence of the component, while “✗” denotes its absence.
Table 4. Ablation study evaluating the contributions of major components in CRECA-Net on the Potsdam dataset. “→” indicates component or loss replacement; “-” denotes module removal; “&” represents the combination of multiple modifications; “✓” indicates the presence of the component, while “✗” denotes its absence.
Method L ce L da Imp. SurfBuildingLowvegTreeCarmIoU
(1) CRECA-Net89.5794.2579.0081.0793.3087.44
(2) L da L ce 89.6694.0878.8980.4992.6387.15
(3) L da L ce & CPR→WCC88.6193.6477.8180.0991.9286.41
(4) L da L ce & CPR→ACC88.1093.2977.4779.9291.5886.07
(5) -CPR-CLCA86.9892.1976.3379.4191.2385.23
(6) -CPR-CLCA & L da L ce 86.0990.8976.5277.9191.0984.50
Table 5. Ablation study evaluating the internal design of the CPR module on the Potsdam dataset, with the first row serves as the baseline. “✓” indicates the presence of the component, while “✗” denotes its absence.
Table 5. Ablation study evaluating the internal design of the CPR module on the Potsdam dataset, with the first row serves as the baseline. “✓” indicates the presence of the component, while “✗” denotes its absence.
WP k WL k WC k L inter Imp. SurfBuildingLowvegTreeCarmIoU
89.5794.2579.0081.0793.3087.44
89.3594.0778.9480.8392.6687.17
89.5094.2078.9880.5892.5487.16
89.2093.8878.8881.0792.9287.19
88.9993.8078.7581.0592.7787.07
89.1893.8777.4880.6493.0986.85
89.0493.9378.3681.0792.5686.99
89.1693.9778.0980.6892.7886.94
Table 6. Ablation study on the effect of the similarity threshold β on the Potsdam dataset.
Table 6. Ablation study on the effect of the similarity threshold β on the Potsdam dataset.
β 0.050.750.10.1250.15
mIoU87.1187.3387.3987.4487.20
Table 7. Ablation study on the effect of the focusing parameter γ in the DA loss on the Potsdam dataset.
Table 7. Ablation study on the effect of the focusing parameter γ in the DA loss on the Potsdam dataset.
γ 0.51.01.52.0
mIoU87.2987.4487.1487.08
Table 8. Ablation study of the effect of normalization and annealing function in the DA loss on the Potsdam dataset.
Table 8. Ablation study of the effect of normalization and annealing function in the DA loss on the Potsdam dataset.
MethodNormalizationAnnealing FunctionmIoU
(1) Standard cross-entropy loss87.15
(2) Loss weight w i = ( 1 p i ) γ [66]86.73
(3) F-A Optimization [69]86.94
(4) Difficult estimation87.28
(5) Difficult estimation + Linear AnnealingLinear87.35
(6) Difficult estimation + Polynomial AnnealingPolynomial87.40
(7) Difficult estimation + Cosine AnnealingCosine87.44
Table 9. Ablation study of the effect of the annealing step T step in the DA loss on the Potsdam and Vaihingen datasets.
Table 9. Ablation study of the effect of the annealing step T step in the DA loss on the Potsdam and Vaihingen datasets.
T step T best 0.9 T best 0.8 T best 0.7 T best
Potsdam(mIoU)87.2787.2887.4487.35
Vaihingen(mIoU)84.4584.5784.6684.48
Table 10. Ablation study on the effect of the auxiliary loss coefficient λ aux on the Potsdam dataset.
Table 10. Ablation study on the effect of the auxiliary loss coefficient λ aux on the Potsdam dataset.
β 0.70.80.91.0
mIoU87.2987.4487.3887.36
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, R.; Chen, B.; Yu, L.; Zhang, S. CRECA-Net: Class Representation-Enhanced Class-Aware Network for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2026, 18, 950. https://doi.org/10.3390/rs18060950

AMA Style

Liu R, Chen B, Yu L, Zhang S. CRECA-Net: Class Representation-Enhanced Class-Aware Network for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sensing. 2026; 18(6):950. https://doi.org/10.3390/rs18060950

Chicago/Turabian Style

Liu, Ruolan, Bingcai Chen, Lin Yu, and Shaodong Zhang. 2026. "CRECA-Net: Class Representation-Enhanced Class-Aware Network for Semantic Segmentation of High-Resolution Remote Sensing Images" Remote Sensing 18, no. 6: 950. https://doi.org/10.3390/rs18060950

APA Style

Liu, R., Chen, B., Yu, L., & Zhang, S. (2026). CRECA-Net: Class Representation-Enhanced Class-Aware Network for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sensing, 18(6), 950. https://doi.org/10.3390/rs18060950

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop