ConvNeXt with Context-Weighted Deep Superpixels for High-Spatial-Resolution Aerial Image Semantic Segmentation

Ye, Ziran; Lin, Yue; Gan, Muye; Tan, Xiangfeng; Dai, Mengdi; Kong, Dedong

doi:10.3390/ai6110277

Open AccessArticle

ConvNeXt with Context-Weighted Deep Superpixels for High-Spatial-Resolution Aerial Image Semantic Segmentation

by

Ziran Ye

¹

,

Yue Lin

²,

Muye Gan

³,

Xiangfeng Tan

¹

,

Mengdi Dai

¹ and

Dedong Kong

^1,*

¹

Institute of Digital Agriculture, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China

²

School of Spatial Planning and Design, Hangzhou City University, Hangzhou 310015, China

³

College of Environmental and Resource Sciences, Zhejiang University, Hangzhou 310058, China

^*

Author to whom correspondence should be addressed.

AI 2025, 6(11), 277; https://doi.org/10.3390/ai6110277

Submission received: 15 September 2025 / Revised: 20 October 2025 / Accepted: 21 October 2025 / Published: 22 October 2025

Download

Browse Figures

Versions Notes

Abstract

Semantic segmentation of high-spatial-resolution (HSR) aerial imagery is critical for applications such as urban planning and environmental monitoring, yet challenges, including scale variation, intra-class diversity, and inter-class confusion, persist. This study proposes a deep learning framework that integrates convolutional networks (CNNs) with context-enhanced superpixel generation, using ConvNeXt as the backbone for feature extraction. The framework incorporates two key modules, namely, a deep superpixel module (Spixel) and a global context modeling module (GC-module), which synergistically generate context-weighted superpixel embeddings to enhance scene–object relationship modeling, refining local details while maintaining global semantic consistency. The introduced approach achieves mIoU scores of 84.54%, 90.59%, and 64.46% on diverse HSR aerial imagery benchmark datasets (Vaihingen, Potsdam, and UV6K), respectively. Ablation experiments were conducted to further validate the contributions of the global context modeling module and deep superpixel modules, highlighting their synergy in improving segmentation results. This work facilitates precise spatial detail preservation and semantic consistency in HSR aerial imagery interpretation tasks, particularly for small objects and complex land cover classes.

Keywords:

aerial image; semantic segmentation; convolutional neural network; global context; superpixel

1. Introduction

As a core task in remote sensing image analysis, semantic segmentation of aerial imagery involves categorizing each pixel into a predefined ground object class. This task holds critical importance in urban planning [1,2], ecological investigation [3], disaster evaluation [4], and agricultural monitoring [5,6,7]. The emergence of HSR aerial imagery has provided high-precision spatial decision support for these applications. However, the unique characteristics of HSR imagery introduce challenges to HSR semantic segmentation: ground object categories exhibit significant scale variations, ranging from large-area geographical entities, like vegetation and lakes, to small targets, such as individual trees and vehicles. Additionally, intra-class objects may differ drastically in material and shape, while inter-class objects may be easily confused due to similar color or texture. These complexities make traditional handcrafted feature-based methods inadequate for capturing the fine-grained differences inherent in HSR imagery [8].

Recent advancements in deep learning have demonstrated superior feature representation capabilities, enabling simultaneous feature extraction and segmentation in practice. The deep learning approach to remote sensing semantic segmentation primarily builds upon classic frameworks such as U-Net [9], DeepLab [10,11,12], and PSPNet [13], with adaptations tailored to remote sensing data. For example, multiscale feature fusion techniques (e.g., atrous convolution, pyramid pooling) are employed to capture objects of varying sizes by fusing features from different receptive fields [14,15], mitigating the scale variation issue. Studies have also introduced attention mechanisms, with examples including channel attention (SE module) [16], spatial attention (CBAM) [17], non-local [18], and GCNet [19], to suppress background noise and enhance feature responses in key regions. While CNN-based methods have achieved remarkable progress, their development has reached a bottleneck due to the inherent limitation of receptive field size. In recent years, transformer architectures have emerged as a research hotspot for HSR remote sensing segmentation [20,21]. Vision transformers are adept at capturing long-range dependencies, offering advantages in global context perception. Pure transformer models or hybrid CNN–transformer architectures have shown unique strengths in capturing global consistency across large scenes and modeling complex object relationships [22,23]. Nevertheless, transformer-based methods still face limitations in remote sensing segmentation practice: they still require patch-wise processing when handling HSR aerial imagery, which—similar to CNN-based approaches—compromises the global context to some extent. Additionally, the self-attention mechanism of transformers focuses on global statistical information, causing features of small objects to be easily diluted by large-area background features, resulting in segmentation accuracy of small objects comparable to that of CNNs. Although studies have attempted to alleviate these issues using dynamic attention mechanisms (e.g., sparse attention, local attention) [24,25], the large-scale practical implementation of transformers in remote sensing segmentation still requires further breakthroughs.

This paper presents a newly refined deep learning framework dedicated to HSR aerial imagery semantic segmentation. To address the aforementioned challenges, our framework integrates three main contributions:

(1): Global context projection: The framework explicitly models scene–object relationships by projecting global contextual information into local feature representations, enhancing the understanding of spatial dependencies across large scenes.
(2): Context-weighted superpixel embeddings: The framework generates superpixel-level feature embeddings weighted by global context, enabling discriminative representation of region-level semantic information while preserving fine-grained details.
(3): Superpixel-guided upsampling: The framework incorporates region-level shape priors from superpixels to guide the upsampling process, optimizing edge details and reducing spatial misalignment in segmentation predictions.

Specifically, this framework adopts a transformer-inspired pure convolutional network ConvNeXt [26] as the backbone. We introduce a deep superpixel module that incorporates global context modeling, which models scene–object relationships using global context, generates context-weighted superpixel embeddings, and provides region-level shape priors to optimize edge details in semantic segmentation predictions.

2. Related Works

2.1. General Semantic Segmentation

Classical methods rely on handcrafted feature descriptors for pixel-level representation, whereas deep learning approaches like CNNs learn hierarchical features directly from pixels for pixel-wise classification. The fully convolutional network (FCN) [27] represents a landmark in pixel-wise segmentation, as it pioneered end-to-end dense prediction and set a foundational architectural paradigm for pixel-level tasks. In parallel, U-Net’s symmetric structure and skip connections demonstrate exceptional spatial detail recovery in refinement tasks, serving as a benchmark model in communities such as medical image segmentation and aerial image interpretation [28,29,30,31]. The DeepLab series introduced atrous convolutions to enlarge receptive fields [12], leveraging spatial contextual information to enhance model performance. Similarly, the Pyramid Pooling Module (PPM) integrates pyramid features of different scales to improve global context modeling [13]. While CNN convolutions primarily capture local features, self-attention mechanisms [25,32,33] have been progressively integrated into vision tasks to model long-range dependencies. Building on the achievements of transformers, vision transformer-based methods (e.g., Segformer [34], Swin-Unet [35]) have advanced pixel-level tasks, exhibiting competitive performance in semantic segmentation. Meanwhile, convolutional network structures inspired by the design of transformers, such as ConvNeXt, have also secured a prominent position in the field of semantic segmentation [36,37].

These generic approaches predominantly focus on multiscale context modeling to capture extensive dependencies and enhance scene semantic comprehension. However, HSR aerial imagery poses unique challenges: (1) larger image coverage with diverse object sizes and severe class imbalance hinders explicit modeling of complex ground objects and (2) demands for precise local detail modeling of small objects further intensify task difficulty.

2.2. Semantic Segmentation in Aerial Images

Deep learning-based semantic segmentation techniques have been widely adopted in remote sensing for applications, including building/road extraction [38,39,40], vehicle detection [41,42,43], and land cover classification [44,45,46]. While these applications follow conventional segmentation paradigms, they incorporate adaptations for HSR imagery and specific recognition targets. To address challenges like small-scale objects and inter-class similarity, studies have extracted multiscale features to enhance network performance [47,48], moderately improving segmentation accuracy for objects of varying sizes.

Attention mechanisms are extensively utilized to boost feature representation, typically by strengthening long-range dependencies within feature maps [33,49]. Leveraging contextual information also proves critical [50,51], where integrating attention and multiscale features in CNNs facilitates accurate context modeling. Transformer-based architectures have been further applied to remote sensing segmentation. These methods treat images as patch sequences, flattening patches into vectors and modeling global relationships via self-attention to capture arbitrary-range dependencies [52,53,54]. Such works validate transformers’ applicability to remote sensing tasks.

However, compared to the progressive downsampling in CNNs, the patch embedding process in vision transformers involves implicit downsampling. This leads to the loss of local correlations between pixels within patches, reduced spatial resolution in feature maps, and compromised local detail preservation. Consequently, these limitations pose significant challenges for precisely identifying fine-grained structures (e.g., building outlines and road boundaries) in aerial imagery. In recent years, several studies have incorporated superpixels to leverage object-level information, aiming to refine the edge details of semantic segmentation results [53,55,56]. However, their efficacy remains contingent upon the accuracy of superpixel segmentation, which invariably induces novel semantic ambiguities. Consequently, it remains imperative to further investigate the optimal strategy for integrating these two components.

3. Methods

A schematic diagram of the proposed framework is depicted in Figure 1. A symmetric encoder is employed to process the input image and generate pyramid features. Furthermore, deep features within the multiscale features generate global descriptors, leveraging the global context to compensate for the semantic consistency of the superpixel module. In the decoder, multi-level feature aggregation is performed to generate a prediction map, and segmentation boundary optimization is completed based on the generated superpixels.

3.1. Symmetric Encoder

For the symmetric encoder, ConvNeXt was employed as the backbone network in this work. As a pure convolutional deep network, ConvNeXt achieves performance comparable to vision transformers (e.g., Swin Transformer [20]). It was chosen as the feature extractor due to its structural simplicity and efficiency. ConvNeXt adopts a ResNet-like architecture: its initial “stem” layer utilizes a 4 × 4 convolution kernel and a stride of 4 to perform 4× downsampling on the input image. Four subsequent stages are then configured to extract feature maps (denoted as

S_{i} | i = 2, 3, 4, 5

), where each stage consists of stacked basic blocks. These blocks comprise modules such as depthwise convolution, layer normalization, and a multilayer perceptron (MLP). Downsampling with a stride of 2 is applied between consecutive stages, resulting in channel dimensions of 128, 256, 512, and 1024 for the four stages, respectively. Detailed specifications of the ConvNeXt network used in this study are provided in Table A1.

Similar to typical symmetric networks, a symmetric pathway was designed to progressively recover multiscale pyramid feature maps (

P_{i} | i = 2, 3, 4, 5

) with restored spatial resolution. The encoder and decoder pathways were connected via skip connections, and the generation process of Pi is described as follows:

P_{i} = S k i p (S_{i}) + U p (P_{i + 1}, s c a l e = 2), i = 2, 3, 4,

(1)

where Skip(∙) denotes a skip connection implemented via a 1 × 1 convolution layer with 256 channels and Up(∙, scale = 2) represents nearest-neighbor upsampling with a scale factor of 2. This structure aggregates deep-layer semantic information and shallow-layer spatial details, facilitating semantic modeling of remote sensing ground objects and recovery of spatial details. Additionally, PPM was introduced following S₅ to obtain the context-enhanced representation P₅ [57]. Finally, the multiscale pyramid feature maps P_i, with 256 channels, were output to the decoder, where feature map P₂ was additionally used to generate deep superpixels, the details of which are elaborated in subsequent sections.

3.2. Deep Superpixel Module with Integrated Contextual Attention

This study builds on our previous work [53], beginning with a review of the deep superpixel (Spixel) generation process. The core concept of deep superpixels involves using reconstruction loss to guide a fully convolutional neural network in generating a pixel–superpixel affinity matrix, which assigns each pixel to its adjacent superpixel grids. Mathematically, this process involves inputting the image

I \in R^{H \times W \times 3}

into the neural network and outputting the affinity matrix

Q \in R^{H \times W \times 9}

. Following Yang’s work [58], to simplify computation, the predicted affinity matrix Q only considers the membership probabilities of each pixel to its 9 neighboring superpixel grids. Typically, the affinity matrix can cluster pixels into superpixels; conversely, it can also reconstruct pixel attributes from superpixels. Mathematically, this process is described as:

f (s) = \frac{\sum_{{p : s \in N}_{p}} f_{i, j} (p) \cdot q_{i, j} (p, s)}{\sum_{{p : s \in N}_{p}} q_{i, j} (p)},

(2)

f_{i, j}^{'} (p) = \sum_{{s \in N}_{p}} f (s) \cdot q_{i, j} (p),

(3)

where f(s) denotes the attribute of superpixel S, f_i,j(p) is the attribute of pixel p at coordinates (i, j),

f_{i, j}^{'} (p)

represents the reconstructed attribute of pixel p, N_p is the set of neighboring superpixels of pixel p, q_i,j(p,s) is the probability of pixel p belonging to superpixels, as indicated by the neural network-predicted affinity matrix, and

\sum_{{p : s \in N}_{p}} q_{i, j} (p)

is a normalization constant.

Building on the implicit learning of pixel–superpixel grid relationships by deep superpixels, this study proposes using a global context modeling module (GC-module) to model scene–object relationships. Context-weighted superpixel embeddings are employed to generate the affinity matrix Q, thereby improving the local details of semantic segmentation. Specifically, a projection mapping function is applied to feature map S₅ to condense global information into global statistics

\bar{G} \in R^{1 \times 1 \times c}

. This projection mapping function consists of global average pooling (GAP) and a trainable 1 × 1 convolution layer with 256 channels. Subsequently, the sigmoid function is used to convert the global descriptor

\bar{G}

into a channel weight vector

\bar{V}

, which is then applied for channel-aligned weighting with feature map P₂. To avoid content information loss during the re-weighting process, a shortcut connection is introduced to sum up two features, in order to preserve the content of the original feature map as much as possible. Finally, the fused feature map is used to generate the affinity matrix Q. We formulate this procedure as follows:

G A P (S_{5}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} S_{5} (i, j, c),

(4)

\bar{G} = σ (W_{256} \cdot G A P (S_{5}) + b), \bar{V} = s i g m o i d (\bar{G}),

(5)

Q = σ (W_{9} \cdot ((\bar{V} \cdot P_{2}) + P_{2}) + b)

(6)

where W_c is a learnable 1 × 1 convolution layer with c channel, b is a bias term, and σ denotes the ReLU activation function.

3.3. Decoder

The decoder stage involves two processes: feature aggregation and superpixel refinement. Multiscale pyramid feature maps (

P_{i} | i = 2, 3, 4, 5

) are aggregated via concatenation. Prior to concatenation, features P₃, P₄, and P₅ are upscaled to a size equivalent to 1/4 of the input image (i.e., spatially aligned with P₂). The aggregated features are followed by a learnable 1 × 1 convolutional layer to produce pixel-level features with 512 channels, which are then fed into a classification head to generate semantic segmentation logits. These logits are restored to the input image size using the superpixel mapping matrix Q. This upsampling process can be interpreted as reconstructing the spatial resolution of pixel-wise semantic segmentation based on the superpixel centers defined in Q. Ultimately, a semantic object segmentation result refined by deep superpixels is obtained.

3.4. Loss Function

Cross-entropy loss L_ce combined with Dice loss L_dice is adopted as the semantic segmentation loss function for this experiment. The former measures the per-pixel classification error, while the latter evaluates the similarity (i.e., overlap) between the model predictions and ground-truth labels, which is tailored for imbalanced datasets. The semantic segmentation loss L_seg is denoted as:

L_{c e} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{k = 1}^{K} y_{i}^{k} \log ({\hat{y}}_{i}^{k}),

(7)

L_{d i c e} = 1 - \frac{2}{N} \sum_{i = 1}^{N} \sum_{k = 1}^{K} \frac{y_{i}^{k} {\hat{y}}_{i}^{k}}{y_{i}^{k} + {\hat{y}}_{i}^{k},}

(8)

L_{s e g} = L_{c e} + L_{d i c e},

(9)

where the number of samples is denoted by n,

{\hat{y}}_{i}^{k}

denotes the probability that the i-th prediction belongs to class k, and

y_{i}^{k}

denotes the i-th ground truth.

Additionally, an object loss L_obj is introduced in the deep superpixel branch. Following Yang’s work, the attributes of pixel p include semantic labels sem_i,j(p) and coordinate vectors loc_i,j(p), and the entire reconstruction loss is implemented as:

L_{o b j} = \sum_{P} E ({s e m}_{i, j} (p), {s e m}_{i, j}^{'} (p)) + \frac{m}{S} ‖{l o c}_{i, j} (p) - {l o c}_{i, j}^{'} (p)‖

(10)

where E(∙) denotes the cross-entropy loss, S is the superpixel sampling interval, and m is empirically set to 0.03.

Finally, the total loss of the model is a weighted sum of the above loss functions, expressed as:

L_total = L_seg + α L_obj,

(11)

where weight factor α is empirically set to 0.5.

4. Experimental Setting

4.1. Datasets

Vaihingen Dataset.

The Vaihingen dataset is a well-established ISPRS benchmark for urban remote sensing semantic annotation. It includes 33 HSR aerial images (ground sampling distance = 9 cm). Each image contains three multispectral bands and a normalized digital surface model (nDSM). The ground objects are annotated into six categories. Following previous studies [22,23], the dataset was split by allocating 15 images for training, 1 image for validation, and 17 for testing. In order to conduct a fair performance comparison with the results reported by existing methods, the nDSM data were excluded from this experiment.

2.: Potsdam Dataset.

For the ISPRS urban semantic labeling project, the Potsdam dataset offers 38 HSR images (6000 × 6000 pixels) with a 5 cm GSD, covering larger geographical areas than Vaihingen. Each image includes annotations for six object classes. Consistent with existing research [22,23], 23 RGB images were utilized for training, 1 for validation, and 14 for testing, and the nDSM data were also excluded from this experiment.

3.: UV6K Dataset.

This high-resolution urban vehicle segmentation dataset contains 6310 images (1024 × 1024 pixels) with 245,141 annotated vehicle instances [43]. The imagery features a spatial resolution of 0.1 m. As vehicles represent typical small-scale foreground objects in remote sensing tasks—characterized by severe foreground–background imbalance—this dataset effectively evaluates model performance, particularly in addressing class imbalance challenges. Following the publisher’s split, the dataset was divided into training (4100 images), testing (1710 images), and validation (500 images) subsets.

The ISPRS Vaihingen and ISPRS Potsdam datasets are two well-established high-resolution aerial image datasets that have been extensively utilized to evaluate the performance of semantic segmentation models in the remote sensing community [41]. On the other hand, these two datasets are relatively small-scale aerial image datasets. To better assess the model’s performance in small object segmentation, we introduced the UV6K dataset (Figure 2), a large-scale aerial image dataset of urban vehicles. Small vehicles, as representative geospatial objects in foreground–background imbalanced scenarios, serve as critical evaluation targets for testing segmentation robustness under such challenging conditions. Consequently, in this study, we aimed to comprehensively evaluate the segmentation performance of the developed model using three distinct aerial image datasets, ensuring robustness across varying scales and object characteristics.

4.2. Implementation Details

The proposed model was benchmarked against several advanced segmentation methods, including Deeplabv3+ [12] (spatial context modeling), OCRNet [59] (attention mechanism), Segformer (transformer-based architecture), Unetformer [22] (hybrid CNN–transformer design), and ESPNet [53]. Experimental implementation of all models was performed via PyTorch v1.13.1 on an NVIDIA RTX 3090 GPU. A batch size of four was adopted, with the maximum number of epochs set to 30. The Adam optimizer was employed with an initial learning rate of 0.0001 and a weight decay of 0.01, while cosine annealing was used to adjust the learning rate. The images were randomly cropped to 512 × 512 patches for input. Training data augmentation, including horizontal flipping, vertical flipping, and 90-degree rotation, was randomly applied. For inference, non-overlapping sliding windows of 1024 × 1024 pixels were used to process the high-resolution images, and the predicted results were stitched together. All reproduced methods were tested under the same experimental environment. Furthermore, to enable a more comprehensive comparison, experimental results from recent advanced approaches reported in the literature were further incorporated, such as large-model-based segmentation methods (e.g., Mamba [60] and SAM [61]).

Following practices in prior studies, Intersection over Union (IoU) and the F1-score were adopted as accuracy metrics. Mean IoU (mIoU) is the average of IoU across all classes, and the mean F1 (mF1) follows a similar calculation. The formulas for two metrics are provided below:

I o U = \frac{T P}{T P + F P + F N},

(12)

F 1 = \frac{2 \times T P}{2 \times T P + F P + F N},

(13)

where TP corresponds to the true positives of a specific class in the confusion matrix, FP represents the false positives, and FN refers to the false negatives.

5. Results and Analysis

5.1. Results and Comparison on the Vaihingen Dataset

Table 1 shows that the proposed model outperforms seven state-of-the-art methods, with substantial gains in both the mIoU and mF1 metrics on the Vaihingen test set. These results show the effectiveness of our method in capturing object and boundary representations on HSR aerial imagery. It is worth noting that the incorporation of global context information and deep superpixels substantially enhances semantic segmentation performance, with the proposed method achieving mIoU and mF1 values of 84.54% and 91.17%, respectively, outperforming representative methods such as DeepLabv3+, OCRNet, and SegFormer. For example, the proposed method achieves an IoU of 88.7% on impervious and 74.69% for low vegetation, which are improvements of 2.94% and 7.36%, respectively, compared to UNetformer, and 0.5% and 5.53% increments relative to SAM-RS [62], respectively. Furthermore, the car class exhibits an IoU improvement of 1.16% compared to RS³mamba [63] and a comparable IoU for the building class. This is attributed to the inherently normalized shapes and distinct boundary features of these two classes, consistent with the strength of our deep superpixel module in capturing fine-grained object structure. Compared with ESPNet, there is also a certain improvement in the car class, indicating that introducing global context can enhance superpixels.

On the other hand, the tree class, with its blurred boundaries and varying spatial scales, shows moderate but meaningful improvements. For example, our method outperforms DeepLabV3+ and SegFormer in the tree class, achieving 4.89% and 0.22%, respectively, in terms of IoU, but falls behind the advanced RS³Mamba (1.31% IoU) and SAM-RS (2.18% IoU). Overall, the proposed method achieves consistent gains across all classes, maintaining its lead in mIoU and mF1.

Figure 3 presents a visual comparison of the results produced by the existing advanced methods and our approach on the Vaihingen test set. The comparison in the regions of interest (purple boxes) in each scenario shows that the proposed model has a distinct advantage in capturing fine features and maintaining semantic consistency. For the first scenario (a hollow rectangular building), DeepLabv3+ and SegFormer incorrectly classified the roof greening area as “vegetation”, while the proposed model had the most complete segmentation of the “building” category, indicating that the proposed model has better performance in using context relations to solve the semantic ambiguity within objects. In the second scenario (low vegetation and trees), the proposed model was least affected by shadows, and compared to the other models, it could more accurately distinguish “trees” and “low vegetation” pixels, while other models showed boundary fragmentation or loss of small objects. For the third scenario, OCRNet, SegFormer, and the proposed model all successfully identified the cars that were shaded and not labeled in the ground truth. The proposed model further detected sparse low vegetation in the upper part of the image, indicating stronger sensitivity to small and spatially dispersed objects. These visual results are consistent with quantitative indicators, thereby confirming the validity of the presented model in complex scenarios.

5.2. Results and Comparison on the Potsdam Dataset

Table 2 presents the quantitative results on the Potsdam test set. The proposed model attains a new performance, with an mIoU of 90.59% and an mF1 of 93.69%, surpassing the six competitive comparison methods. It exhibits superior performance across all assessed land cover categories, with notable improvements in the building and car classes, which are characterized by regular geometric structures and high background contrast. Specifically, over SegFormer, the proposed method yields mIoU and mF1 improvements of 0.93% and 0.58%, respectively. The IoU for the building and car classes improves by 0.92% and 1.12%, respectively. While the distinctive edge features of man-made structures are more prominent in very high-resolution aerial imagery, the proposed method’s enhanced ability to capture object boundaries leads to stronger performance.

For complex classes such as low vegetation and trees, the proposed method still demonstrates significant improvements over SBSS-MS and OCRNet. For example, the low vegetation category improves by 3.84% IoU over SBSS-MS and 2.56% IoU over OCRNet, reflecting the ability of our model to handle heterogeneous vegetation distribution through global context-driven object consistency constraints. In addition, the strong urban scene segmentation methods, i.e., SegFormer and OCRNet, also perform slightly worse than our method in the impervious surface category, indicating that the proposed framework better utilizes high-resolution spatial features to solve the fine-grained classification task of complex urban remote sensing scenes. Compared with the previous ESPNet, the proposed model achieves improvements in both overall accuracy and individual categories.

Figure 4 presents visual comparisons of the segmentation results on the Potsdam test set, with purple bounding boxes highlighting regions of interest. The proposed model exhibits precision in recognizing complex ground objects, most obviously characterized by fewer noise artifacts compared to competing methods. For the first scene, the proposed model delineates building boundaries more completely, and there is less confusion in the pixel boundaries of adjacent different objects. In the second and third scenes—where distorted objects increase boundary recognition difficulty—the proposed model demonstrates robustness, maintaining clear class separation even in highly cluttered areas. These visual improvements align with the quantitative metrics in Table 2.

5.3. Results and Comparison on the UV6K Dataset

The object types and difficulty of the UV6K dataset are significantly different from those of the aforementioned datasets. Table 3 demonstrates that our approach achieved competitive performance, with an IoU of 64.46% and an F1-score of 78.39% for the segmentation of a large number of small-scale objects (vehicles), significantly outperforming the baseline methods, such as DeepLabv3+ (IoU/F1: 53.63%/69.82%) and OCRNet (IoU/F1: 56.79%/72.44%). Compared to SegFormer and ESPNet, which performed well on the ISPRS dataset, our method achieved 0.78% and 1.18%, respectively, in terms of IoU, and 0.58% and 0.88% in terms of the F1-score. Notably, our proposed model achieved comparable or even slightly superior performance to Farseg++ (IoU/F1: 64.40%/78.30%), a strong performance method reported by the publisher of this dataset.

Overall, the robust efficacy of the developed method on the different datasets mentioned above demonstrates that the model can be effectively generalized to various remote sensing scenarios beyond standard benchmark datasets.

Figure 5 presents a visualization example from the UV6K dataset. Focusing on the region of interest (the purple box), the proposed model demonstrates high accuracy in identifying small objects, such as the recognition results of the disordered vehicles in the third sub-figure. Such performance places the model at the top among the compared methods. These visual observations align closely with the improvements in F1 and IoU metrics reported in Table 3.

5.4. Ablation Studies

To analyze the influence of individual components on segmentation performance, an ablation study was performed using a baseline ConvNeXt-T segmentation model without the two modules. The results are shown in Table 4, where the global context modeling module (GC-module) and deep superpixel module (Spixel) were added sequentially, and both modules were finally combined to form the proposed model.

Baseline Model: The baseline model attained an mIoU of 83.13% and an mF1-score of 90.37% without additional modules, serving as a reference for performance comparison.

Effect of the Global Context Module: Adding the GC-module alone improved mIoU by 0.49% and mF1 by 0.21%, with an increase of 1.19 M parameters. This indicates that modeling global contextual dependencies helps capture long-range semantic relationships, enhancing feature discrimination.

Effect of the Superpixel Module: Incorporating the Spixel module alone yielded a more significant improvement, with mIoU increasing by 1.05% and mF1 by 0.61%, while adding 0.005M parameters. This suggests that superpixel-based local structure modeling effectively refines spatial details, reducing false negatives in small object segmentation.

Combined Effect of Both Modules: When both the GC-module and Spixel module were integrated into the proposed model, it achieved optimal performance with an mIoU of 84.54% and an mF1 of 91.17%. The total parameter increase was 1.195 M, indicating synergistic effects between the global context and local structure modeling.

These findings indicate that the proposed modules complement each other: the GC-module enhances semantic coherence at the global level, while the Spixel module improves spatial precision at the local level, collectively leading to excellent semantic segmentation performance.

5.5. Sensitivity Analysis of the Spixel Module

To further investigate the impact of the superpixel sampling interval S (a key parameter for deep superpixels) on model performance, a sensitivity analysis was performed on the Vaihingen test set, with S set to 4, 8, and 16. The evaluation metrics included mIoU and mean F1-score. Smaller S values indicate a more fragmented segmentation, while larger S values indicate a larger average size of the generated superpixels.

The experimental results in Figure 6 show that model performance decreases with increasing S values. Specifically, when S = 4, mIoU and mF1 reach their highest values. As S increases to 8, mIoU and mF1 decrease slightly, to 83.70% and 90.71%, respectively. When S is further increased to 16, mIoU and mF1 continue to decrease to 83.48% and 90.49%, respectively. This trend suggests that a smaller sampling interval can strengthen the model’s ability to capture fine-grained object boundaries by generating more fragmented superpixels, thereby improving segmentation accuracy. Furthermore, when S is set to 4, the deep superpixel module simultaneously applies 4× upsampling for semantic segmentation when generating superpixels (i.e., the upsampling process is considered as a feature reconstruction). In summary, S = 4 is selected as the optimal parameter in this model because it strikes a balance between detail boundary preservation and segmentation performance.

5.6. Model Complexity Analysis

To assess the practicality of the proposed model, four key metrics—floating-point operations (FLOPs), model parameters, memory footprint, and frames per second (FPS)—were used to analyze its computational complexity and resource consumption (Table 5). These metrics reflect computational efficiency, scale, memory requirements, and inference speed, respectively, with an optimal model expected to balance high segmentation accuracy, fast inference, and low resource consumption. Table 5 presents a comparative analysis of the proposed model against advanced semantic segmentation models on the Vaihingen test set. The presented model achieves the highest mIoU of 84.54%, outperforming all the compared methods, including OCRNet (84.04%), SAM-RS (84.01%), and RS³Mamba (82.78%). Moreover, the proposed model’s FPS is 2.99, which is on par with the fastest Unetformer (2.87), demonstrating its fast inference speed. However, this performance gain is accompanied by higher computational complexity: the proposed model exhibits the highest FLOPs (133.21 G) and parameter count (83.69 M) among the evaluated models.

In terms of the accuracy–performance trade-off, the proposed model prioritizes segmentation accuracy, achieving a 0.50% mIoU improvement over the second-best model, OCRNet. This gain is particularly notable for fine-grained remote sensing tasks where precise pixel classification is essential, including urban land cover mapping and ecological environment investigation. While its FLOPs and parameters are higher than lightweight models, like SegFormer and UNetformer, the significant accuracy improvement demonstrates that the effort to enhance baseline model performance by introducing additional modules is worthwhile, even with the accompanying risk of increased computational complexity. In terms of memory efficiency, the proposed model demonstrates favorable memory usage: its memory footprint (1087 MB) is comparable to that of OCRNet (1080 MB) and significantly lower than those of SAM-RS (5176 MB) and RS³Mamba (2332 MB). This suggests the model optimizes memory usage while pushing accuracy boundaries, making it more feasible for deployment compared to memory-intensive alternatives. Overall, despite exceeding lightweight models in computational complexity, its superior segmentation performance and moderate memory footprint make it a practical choice for applications requiring advanced aerial image semantic segmentation, such as high-resolution urban mapping and environmental monitoring. This trade-off highlights the model’s suitability for research and industrial settings where the benefits from improved accuracy may outweigh computational costs. Reducing the computational overhead of the proposed model will be a key focus of future work to further enhance its flexibility.

6. Conclusions

This study presents a refined deep learning framework for semantic segmentation of high-resolution aerial images, addressing the key challenges of scale variation and preservation of fine details. By integrating ConvNeXt as the backbone network with modules such as global context enhancement and deep superpixels, this framework achieves outstanding performance on multiple benchmark datasets. The integration of global context modeling enhances semantic consistency, while deep superpixel refinement improves local boundary fidelity, particularly beneficial for small targets and complex land cover categories. Ablation experiments confirm that the synergistic interaction of these two modules improves the model’s accuracy, where the global context module enhances feature discrimination and the superpixel module optimizes spatial details.

Maintaining an optimal balance between accuracy and efficiency, the proposed model offers practical utility for applications including urban mapping and environmental monitoring. Future research will focus on reducing resource consumption, investigating multimodal data fusion, and extending the framework to dynamic remote sensing scenarios. This study advances the frontier of HSR aerial imagery semantic segmentation by providing a robust method that combines global context understanding and local detail refinement.

Author Contributions

Conceptualization, Z.Y. and D.K.; methodology, Z.Y.; software, Z.Y. and M.G.; validation, M.G. and Y.L.; formal analysis, M.G.; investigation, Y.L.; resources, D.K.; data curation, M.D.; writing—original draft preparation, Z.Y.; writing—review and editing, Y.L. and X.T.; visualization, X.T.; supervision, D.K.; project administration, Y.L. and M.D.; funding acquisition, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 42201300, the Zhejiang Provincial Basic Public Welfare Research Project of China, grant number LGN22C130016, and the Hangzhou Joint Fund of the Zhejiang Provincial Natural Science Foundation of China, grant number LHZQN25D010002.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are openly available in ISPRS at https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/Default.aspx (accessed on 14 September 2025), and the UV6K dataset is available at https://zenodo.org/records/8404754 (accessed on 14 September 2025). The remaining data that support the findings in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. The left-path structure based on ConvNeXt in the symmetric encoder.

Name	Kernel	Stride	Input	OutDim
Input
Image				H × W × 3
Stem
Conv0	4 × 4, 128	4	Image	$\frac{1}{4} H \times \frac{1}{4} W \times 3$
Encoder feature extractor
Conv_S2	$[\begin{matrix} 7 \times 7, 128 \\ 1 \times 1, 512 \\ 1 \times 1, 128 \end{matrix}] \times 3$	1	Conv0	$\frac{1}{4} H \times \frac{1}{4} W \times 128$
Conv_S3	$[\begin{matrix} 7 \times 7, 256 \\ 1 \times 1, 1024 \\ 1 \times 1, 256 \end{matrix}] \times 3$	2	Conv_S2	$\frac{1}{8} H \times \frac{1}{8} W \times 256$
Conv_S4	$[\begin{matrix} 7 \times 7, 512 \\ 1 \times 1, 2048 \\ 1 \times 1, 512 \end{matrix}] \times 9$	2	Conv_S3	$\frac{1}{16} H \times \frac{1}{16} W \times 512$
Conv_S5	$[\begin{matrix} 7 \times 7, 1024 \\ 1 \times 1, 4096 \\ 1 \times 1, 1024 \end{matrix}] \times 3$	2	Conv_S4	$\frac{1}{32} H \times \frac{1}{32} W \times 1024$
PPM_feat	Pyramid Pooling Module	1	Conv_S5	$\frac{1}{32} H \times \frac{1}{32} W \times 512$

Five different random seeds were used to conduct multiple rounds of validation on the three datasets. The corresponding statistical results can be found in Table A2, Table A3 and Table A4. It can be observed that the changes in the accuracy rates of each category and the overall accuracy rate of the proposed model are relatively small, which, to some extent, proves the robustness of the model performance.

Table A2. The statistical results obtained using five random seeds on the Vaihingen test set.

Method	IoU(%)					mIoU (%)	mF1 (%)
Method	Impervious Surface	Building	Low Vegetation	Tree	Car	mIoU (%)	mF1 (%)
the proposed model	88.52 ± 0.25	93.79 ± 0.15	74.86 ± 0.45	82.54 ± 0.19	82.57 ± 0.74	84.46 ± 0.09	91.07 ± 0.10

Table A3. The statistical results obtained using five random seeds. on the Potsdam test set.

Method	IoU(%)					mIoU (%)	mF1 (%)
Method	Impervious Surface	Building	Low Vegetation	Tree	Car	mIoU (%)	mF1 (%)
the proposed model	92.50 ± 0.12	97.30 ± 0.15	83.17 ± 0.27	82.85 ± 0.20	96.44 ± 0.11	90.45 ± 0.12	93.58 ± 0.08

Table A4. The statistical results obtained using five random seeds on the UV6K test set.

Method	Backbone	IoU	F1
the proposed model	ConvNeXt-T	64.40±0.06	78.35±0.05

References

Lin, Y.; Zhang, M.; Gan, M.; Huang, L.; Zhu, C.; Zheng, Q.; You, S.; Ye, Z.; Shahtahmassebi, A.; Li, Y.; et al. Fine Identification of the Supply–Demand Mismatches and Matches of Urban Green Space Ecosystem Services with a Spatial Filtering Tool. J. Clean. Prod. 2022, 336, 130404. [Google Scholar] [CrossRef]
Lin, Y.; An, W.; Gan, M.; Shahtahmassebi, A.; Ye, Z.; Huang, L.; Zhu, C.; Huang, L.; Zhang, J.; Wang, K. Spatial Grain Effects of Urban Green Space Cover Maps on Assessing Habitat Fragmentation and Connectivity. Land 2021, 10, 1065. [Google Scholar] [CrossRef]
He, T.; Hu, Y.; Guo, A.; Chen, Y.; Yang, J.; Li, M.; Zhang, M. Quantifying the Impact of Urban Trees on Land Surface Temperature in Global Cities. ISPRS J. Photogramm. Remote Sens. 2024, 210, 69–79. [Google Scholar] [CrossRef]
Fang, C.; Fan, X.; Wang, X.; Nava, L.; Zhong, H.; Dong, X.; Qi, J.; Catani, F. A Globally Distributed Dataset of Coseismic Landslide Mapping via Multi-Source High-Resolution Remote Sensing Images. Earth Syst. Sci. Data 2024, 16, 4817–4842. [Google Scholar] [CrossRef]
Victor, N.; Maddikunta, P.K.R.; Mary, D.R.K.; Murugan, R.; Chengoden, R.; Gadekallu, T.R.; Rakesh, N.; Zhu, Y.; Paek, J. Remote Sensing for Agriculture in the Era of Industry 5.0—A Survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5920–5945. [Google Scholar] [CrossRef]
Lehouel, K.; Saber, C.; Bouziani, M.; Yaagoubi, R. Remote Sensing Crop Water Stress Determination Using CNN-ViT Architecture. AI 2024, 5, 618–634. [Google Scholar] [CrossRef]
Banerjee, S.; Reynolds, J.; Taggart, M.; Daniele, M.; Bozkurt, A.; Lobaton, E. Quantifying Visual Differences in Drought-Stressed Maize through Reflectance and Data-Driven Analysis. AI 2024, 5, 790–802. [Google Scholar] [CrossRef]
Zhao, W.; Du, S.; Emery, W.J. Object-Based Convolutional Neural Network for High-Resolution Imagery Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3386–3396. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Barron, J.T.; Papandreou, G.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Task-Specific Edge Detection Using CNNs and a Discriminatively Trained Domain Transform. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4545–4554. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Wang, J.; Zheng, Y.; Wang, M.; Shen, Q.; Huang, J. Object-Scale Adaptive Convolutional Neural Networks for High-Spatial Resolution Remote Sensing Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 283–299. [Google Scholar] [CrossRef]
Liu, S.; Zhao, D.; Zhou, Y.; Tan, Y.; He, H.; Zhang, Z.; Tang, L. Network and Dataset for Multiscale Remote Sensing Image Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 2851–2866. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the ICCV, Virtual, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin Transformer Embedding UNet for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Meng, X.; Yang, Y.; Wang, L.; Wang, T.; Li, R.; Zhang, C. Class-Guided Swin Transformer for Semantic Segmentation of Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6517505. [Google Scholar] [CrossRef]
Sun, D.; Bao, Y.; Liu, J.; Cao, X. A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18727–18738. [Google Scholar] [CrossRef]
Wang, Z.; Gao, F.; Dong, J.; Du, Q. Global and Local Attention-Based Transformer for Hyperspectral Image Change Detection. IEEE Geosci. Remote Sens. Lett. 2024, 22, 5500405. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the CVPR, New Orleans, LA, USA, 21–24 June 2022; pp. 11966–11976. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ye, Z.; Fu, Y.; Gan, M.; Deng, J.; Comber, A.; Wang, K. Building Extraction from Very High Resolution Aerial Imagery Using Joint Attention Deep Neural Network. Remote Sens. 2019, 11, 2970. [Google Scholar] [CrossRef]
Li, R.; Liu, W.; Yang, L.; Sun, S.; Hu, W.; Zhang, F.; Li, W. DeepUNet: A Deep Fully Convolutional Network for Pixel-Level Sea-Land Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 3954–3962. [Google Scholar] [CrossRef]
Liu, B.; Li, B.; Sreeram, V.; Li, S. MBT-UNet: Multi-Branch Transform Combined with UNet for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2024, 16, 2776. [Google Scholar] [CrossRef]
Lateef, F.; Ruichek, Y. Survey on Semantic Segmentation Using Deep Learning Techniques. Neurocomputing 2019, 338, 321–348. [Google Scholar] [CrossRef]
Ye, Z.; Tan, X.; Dai, M.; Chen, X.; Zhong, Y.; Zhang, Y.; Ruan, Y.; Kong, D. A Hyperspectral Deep Learning Attention Model for Predicting Lettuce Chlorophyll Content. Plant Methods 2024, 20, 22. [Google Scholar] [CrossRef] [PubMed]
Ma, F.; Sun, X.; Zhang, F.; Zhou, Y.; Li, H.-C. What Catch Your Attention in SAR Images: Saliency Detection Based on Soft-Superpixel Lacunarity Cue. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5200817. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
Liu, W.; Liu, J.; Luo, Z.; Zhang, H.; Gao, K.; Li, J. Weakly Supervised High Spatial Resolution Land Cover Mapping Based on Self-Training with Weighted Pseudo-Labels. Int. J. Appl. Earth Obs. Geoinform. 2022, 112, 102931. [Google Scholar] [CrossRef]
Liu, S.; Cao, S.; Lu, X.; Peng, J.; Ping, L.; Fan, X.; Teng, F.; Liu, X. Lightweight Deep Learning Model, ConvNeXt-U: An Improved U-Net Network for Extracting Cropland in Complex Landscapes from Gaofen-2 Images. Sensors 2025, 25, 261. [Google Scholar] [CrossRef]
Chen, D.; Ma, A.; Zhong, Y. Semi-Supervised Knowledge Distillation Framework for Global-Scale Urban Man-Made Object Remote Sensing Mapping. Int. J. Appl. Earth Obs. Geoinform. 2023, 122, 103439. [Google Scholar] [CrossRef]
Dong, D.; Ming, D.; Weng, Q.; Yang, Y.; Fang, K.; Xu, L.; Du, T.; Zhang, Y.; Liu, R. Building Extraction from High Spatial Resolution Remote Sensing Images of Complex Scenes by Combining Region-Line Feature Fusion and OCNN. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 4423–44381. [Google Scholar] [CrossRef]
Dai, L.; Zhang, G.; Zhang, R. RADANet: Road Augmented Deformable Attention Network for Road Extraction From Complex High-Resolution Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5602213. [Google Scholar] [CrossRef]
Lyu, Y.; Vosselman, G.; Xia, G.-S.; Yilmaz, A.; Yang, M.Y. UAVid: A Semantic Segmentation Dataset for UAV Imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
Feng, L.; Chen, S.; Zhang, C.; Zhang, Y.; He, Y. A Comprehensive Review on Recent Applications of Unmanned Aerial Vehicle Remote Sensing with Various Sensors for High-Throughput Plant Phenotyping. Comput. Electron. Agric. 2021, 182, 106033. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. FarSeg++: Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13715–13729. [Google Scholar] [CrossRef] [PubMed]
Zheng, Z.; Yu, S.; Jiang, S. A Domain Adaptation Method for Land Use Classification Based on Improved HR-Net. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4400911. [Google Scholar] [CrossRef]
Ma, Y.; Deng, X.; Wei, J. Land Use Classification of High-Resolution Multispectral Satellite Images with Fine-Grained Multiscale Networks and Superpixel Post Processing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3264–3278. [Google Scholar] [CrossRef]
Formichini, M.; Avizzano, C.A. A Comparative Analysis of Deep Learning-Based Segmentation Techniques for Terrain Classification in Aerial Imagery. AI 2025, 6, 145. [Google Scholar] [CrossRef]
Zhang, C.; Harrison, P.A.; Pan, X.; Li, H.; Sargent, I.; Atkinson, P.M. Scale Sequence Joint Deep Learning (SS-JDL) for Land Use and Land Cover Classification. Remote Sens. Environ. 2020, 237, 111593. [Google Scholar] [CrossRef]
Li, Z.; Zhang, H.; Lu, F.; Xue, R.; Yang, G.; Zhang, L. Breaking the Resolution Barrier: A Low-to-High Network for Large-Scale High-Resolution Land-Cover Mapping Using Low-Resolution Labels. ISPRS J. Photogramm. Remote Sens. 2022, 192, 244–267. [Google Scholar] [CrossRef]
Li, Y.; Zhou, Z.; Qi, G.; Hu, G.; Zhu, Z.; Huang, X. Remote Sensing Micro-Object Detection under Global and Local Attention Mechanism. Remote Sens. 2024, 16, 644. [Google Scholar] [CrossRef]
Zhao, W.; Peng, S.; Chen, J.; Peng, R. Contextual-Aware Land Cover Classification with U-Shaped Object Graph Neural Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6510705. [Google Scholar] [CrossRef]
Dong, B.; Zheng, Q.; Lin, Y.; Chen, B.; Ye, Z.; Huang, C.; Tong, C.; Li, S.; Deng, J.; Wang, K. Integrating Physical Model-Based Features and Spatial Contextual Information to Estimate Building Height in Complex Urban Areas. Int. J. Appl. Earth Obs. Geoinf. 2024, 126, 103625. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506105. [Google Scholar] [CrossRef]
Ye, Z.; Lin, Y.; Dong, B.; Tan, X.; Dai, M.; Kong, D. An Object-Aware Network Embedding Deep Superpixel for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2024, 16, 3805. [Google Scholar] [CrossRef]
Zhou, X.; Zhou, L.; Gong, S.; Zhong, S.; Yan, W.; Huang, Y. Swin Transformer Embedding Dual-Stream for Semantic Segmentation of Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 175–189. [Google Scholar] [CrossRef]
Liang, S.; Hua, Z.; Li, J. Hybrid Transformer-CNN Networks Using Superpixel Segmentation for Remote Sensing Building Change Detection. Int. J. Remote Sens. 2023, 44, 2754–2780. [Google Scholar] [CrossRef]
Fang, F.; Zheng, K.; Li, S.; Xu, R.; Hao, Q.; Feng, Y.; Zhou, S. Incorporating Superpixel Context for Extracting Building from High-Resolution Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1176–1190. [Google Scholar] [CrossRef]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified Perceptual Parsing for Scene Understanding. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11209, pp. 432–448. [Google Scholar]
Yang, F.; Sun, Q.; Jin, H.; Zhou, Z. Superpixel Segmentation with Fully Convolutional Networks. In Proceedings of the CVPR, Seattle, WA, USA, 14–19 June 2020; pp. 13961–13970. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-Contextual Representations for Semantic Segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 173–190. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual State Space Model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Ma, X.; Wu, Q.; Zhao, X.; Zhang, X.; Pun, M.-O.; Huang, B. SAM-Assisted Remote Sensing Imagery Semantic Segmentation with Object and Boundary Constraints. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5636916. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.-O. RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
Cai, Y.; Fan, L.; Fang, Y. SBSS: Stacking-Based Semantic Segmentation Framework for Very High-Resolution Remote Sensing Image. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5600514. [Google Scholar] [CrossRef]

Figure 1. Framework of the proposed model.

Figure 2. Example images and annotations of three datasets.

Figure 3. Visual comparison on the Vaihingen test set. (a) Image, (b) ground truth, (c) Deeplabv3+, (d) OCRNet, (e) SegFormer, and (f) the proposed model.

Figure 4. Visual comparison on the Potsdam test set. (a) Image, (b) ground truth, (c) Deeplabv3+, (d) OCRNet, (e) SegFormer, and (f) the proposed model.

Figure 5. Visual comparison on the UV6K test set. (a) Image, (b) ground truth, (c) Deeplabv3+, (d) OCRNet, (e) SegFormer, and (f) the proposed model.

Figure 6. Sensitivity analysis of the hyperparameter S of the Spixel module on the ISPRS Vaihingen dataset.

Table 1. Experimental results on the Vaihingen test set. Bold values are the best.

Method	Backbone	IoU (%)					mIoU (%)	mF1 (%)
Method	Backbone	Impervious Surface	Building	Low Vegetation	Tree	Car	mIoU (%)	mF1 (%)
DeepLabv3+ [12]	ResNet-101	73.76	84.48	63.96	77.46	61.29	72.19	83.28
OCRNet [59]	HRNet-48	87.62	92.84	74.55	82.46	82.73	84.04	90.71
SegFormer [34]	Mit-b5	87.46	92.72	72.81	82.13	78.13	82.65	90.00
UNetformer [22]	ResNet-18	85.76	92.78	67.33	83.22	80.75	81.97	89.85
RS³Mamba [63]	R18-MambaT	86.62	93.83	67.84	83.66	81.97	82.78	90.34
SAM-RS [62]	Swin-B	88.20	94.53	69.16	84.53	83.64	84.01	91.08
ESPNet [53]	ConvNeXt-T	88.40	93.63	74.50	82.12	82.94	84.32	91.06
the proposed model	ConvNeXt-T	88.70	93.84	74.69	82.35	83.13	84.54	91.17

Table 2. Experimental results on the Potsdam test set. Bold values are the best.

Method	Backbone	IoU (%)					mIoU (%)	mF1 (%)
Method	Backbone	Impervious Surface	Building	Low Vegetation	Tree	Car	mIoU (%)	mF1 (%)
DeepLabv3+ [12]	ResNet-101	87.27	92.94	76.24	75.2	93.2	84.97	89.88
GCNet [19]	ResNet-50	87.47	93.24	77.04	79.67	91.70	85.82	/
OCRNet [59]	HRNet-48	91.53	96.04	80.7	80.75	95.86	88.98	92.69
SegFormer [34]	Mit-b5	91.85	96.58	82.21	82.21	95.47	89.66	93.11
SBSS-MS [64]	ConvNeXt-T	88.95	94.88	79.42	81.82	93.34	87.68	/
ESPNet [53]	ConvNeXt-T	92.40	97.21	82.59	82.50	95.95	90.13	93.54
the proposed model	ConvNeXt-T	92.63	97.50	83.26	82.95	96.59	90.59	93.69

Table 3. Experimental results on the UV6K test set. Bold values are the best.

Method	Backbone	IoU	F1
DeepLabv3+ [12]	ResNet-101	53.63	69.82
OCRNet [59]	HRNet48	56.79	72.44
Segformer [34]	Mit_b5	63.68	77.81
Farseg++ [43]	ResNet-50	64.40	78.30
ESPNet [53]	ConvNeXt-T	63.28	77.51
the proposed model	ConvNeXt-T	64.46	78.39

Table 4. Ablation studies on the Vaihingen test set.

Method	ConvNeXt-T	GC-Module	Spixel	mIoU (%)	mF1 (%)	Δparams (M)
Baseline	√			83.13	90.37	0
w/global context	√	√		83.62	90.58	1.190
w/deep superpixel	√		√	84.18	90.98	0.005
The proposed model	√	√	√	84.54	91.17	1.195

Note: The symbol “√” indicates that the corresponding module is included in the model architecture for that experimental configuration. All configurations are built upon the baseline model; each row represents a variant where specific modules are either retained (√) or ablated (absence of “√”).

Table 5. Computational complexity analysis and parameter comparison. Results on two 256 × 256 images from the ISPRS Vaihingen dataset using a single NVIDIA GeForce RTX 3090 GPU. Bold values are the best.

Model	Backbone	FLOPs (G)	Parameter (M)	Memory (MB)	FPS (Image/s)	MIoU (%)
DeepLabv3+ [12]	ResNet-101	127.45	60.21	1913	13.58	72.19
OCRNet [59]	HRNet-48	82.74	70.53	1080	9.17	84.04
SegFormer [34]	Mit-b5	3.14	2.55	1185	6.46	82.65
Unetformer [22]	ResNet-18	5.87	11.68	167	2.87	81.97
RS³Mamba [63]	R18-MambaT	28.25	43.32	2332	/	82.78
SAM-RS [62]	Swin-B	50.84	96.14	5176	3.50	84.01
ConvNeXt-U [37]	ConvNeXt	15.61	80.30	669	16.13	60.43
the proposed model	ConvNeXt-T	133.21	83.69	1087	2.99	84.54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, Z.; Lin, Y.; Gan, M.; Tan, X.; Dai, M.; Kong, D. ConvNeXt with Context-Weighted Deep Superpixels for High-Spatial-Resolution Aerial Image Semantic Segmentation. AI 2025, 6, 277. https://doi.org/10.3390/ai6110277

AMA Style

Ye Z, Lin Y, Gan M, Tan X, Dai M, Kong D. ConvNeXt with Context-Weighted Deep Superpixels for High-Spatial-Resolution Aerial Image Semantic Segmentation. AI. 2025; 6(11):277. https://doi.org/10.3390/ai6110277

Chicago/Turabian Style

Ye, Ziran, Yue Lin, Muye Gan, Xiangfeng Tan, Mengdi Dai, and Dedong Kong. 2025. "ConvNeXt with Context-Weighted Deep Superpixels for High-Spatial-Resolution Aerial Image Semantic Segmentation" AI 6, no. 11: 277. https://doi.org/10.3390/ai6110277

APA Style

Ye, Z., Lin, Y., Gan, M., Tan, X., Dai, M., & Kong, D. (2025). ConvNeXt with Context-Weighted Deep Superpixels for High-Spatial-Resolution Aerial Image Semantic Segmentation. AI, 6(11), 277. https://doi.org/10.3390/ai6110277

Article Menu

ConvNeXt with Context-Weighted Deep Superpixels for High-Spatial-Resolution Aerial Image Semantic Segmentation

Abstract

1. Introduction

2. Related Works

2.1. General Semantic Segmentation

2.2. Semantic Segmentation in Aerial Images

3. Methods

3.1. Symmetric Encoder

3.2. Deep Superpixel Module with Integrated Contextual Attention

3.3. Decoder

3.4. Loss Function

4. Experimental Setting

4.1. Datasets

4.2. Implementation Details

5. Results and Analysis

5.1. Results and Comparison on the Vaihingen Dataset

5.2. Results and Comparison on the Potsdam Dataset

5.3. Results and Comparison on the UV6K Dataset

5.4. Ablation Studies

5.5. Sensitivity Analysis of the Spixel Module

5.6. Model Complexity Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI