SDS-Former: A Transformer-Based Method for Semantic Segmentation of Arid Land Remote Sensing Imagery

Du, Yujie; Fan, Junfu; Li, Kuan; Li, Yongrui

doi:10.3390/a19050325

Open AccessArticle

SDS-Former: A Transformer-Based Method for Semantic Segmentation of Arid Land Remote Sensing Imagery

¹

School of Civil Engineering and Geomatics, Shandong University of Technology, Zibo 255000, China

²

State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(5), 325; https://doi.org/10.3390/a19050325

Submission received: 13 February 2026 / Revised: 15 April 2026 / Accepted: 18 April 2026 / Published: 22 April 2026

(This article belongs to the Special Issue Artificial Intelligence, Image Processing and Spatial Analytics in Environmental Informatics)

Download

Browse Figures

Versions Notes

Abstract

Semantic segmentation of land use and land cover (LULC) in arid regions remains challenging due to severe class imbalance, fragmented spatial distributions, and high spectral similarity among different land cover types. These characteristics often lead to an information bottleneck in deep segmentation networks and hinder the extraction of discriminative semantic representations. To address these issues, we propose SDS-Former, a lightweight semantic segmentation network specifically designed for remote sensing imagery in arid environments. SDS-Former incorporates an SSM-inspired Lightweight Semantic Enhancement (LSE) module to strengthen contextual modeling and alleviate the loss of discriminative information in deep features. To tackle scale variations, a Dynamic Selective Feature Fusion (DSFF) module is employed in the decoder to adaptively weight and fuse high-level semantics with low-level spatial details. Furthermore, a Feature Refinement Head (FRH) is introduced to enhance boundary localization and improve the recognition of small-scale and sparsely distributed land cover objects. Extensive ablation and comparative experiments demonstrate that SDS-Former consistently outperforms representative semantic segmentation methods across multiple evaluation metrics. On the Tarim Basin dataset, the proposed network achieves a mean Intersection over Union (mIoU) of 82.51% and an F1 score of 86.47%, indicating its superior effectiveness and robustness. Qualitative results further verify that SDS-Former exhibits clear advantages in distinguishing spectrally similar land cover types and preserving the spatial continuity of ground objects in complex arid-region scenes.

Keywords:

semantic segmentation; remote sensing images; dynamic fusion; feature refinement; arid regions; land cover classification

1. Introduction

Semantic segmentation of remote sensing imagery has become a critical technique for land use and land cover (LULC) mapping, environmental monitoring, and ecological assessment [1,2,3]. With the rapid advancement of satellite and airborne sensors, large volumes of high-resolution remote sensing images are now available, providing rich spectral, spatial, and contextual information for detailed surface feature analysis [4,5,6]. Compared with conventional pixel-based or rule-based classification methods, deep learning-based semantic segmentation has been shown to outperform traditional approaches in capturing discriminative features from complex remote sensing scenes [7,8].

However, remote sensing images usually cover vast geographic areas with highly diverse land surface patterns and strong spatial heterogeneity, making semantic segmentation significantly more challenging than natural image segmentation [9,10]. This challenge is particularly pronounced in arid regions, where surface objects such as bare soil, desert, sparse vegetation, and saline land exhibit similar spectral characteristics and ambiguous boundaries [11,12,13]. Misclassification among spectrally similar land-cover types can introduce uncertainty in desertification monitoring and land degradation analysis, thereby affecting the accuracy and reliability of ecological assessments. The complexity and imbalance of land cover categories in arid regions often lead to degraded segmentation performance and unstable model generalization [13,14,15,16]. Moreover, the significant spectral variability within the same category and high spectral confusion among different land cover types further complicate the extraction of discriminative semantic representations [17].

Remote sensing image segmentation methods have evolved through several distinct stages, ranging from early pixel-based classification techniques to object-based image analysis (OBIA), which incorporates spatial and texture information beyond pixel-level statistics. With the advancement of pattern recognition, traditional machine learning algorithms such as Support Vector Machines (SVM) [18] and Random Forests (RF) [19] have become the mainstream for LULC tasks due to their robust performance in high-dimensional feature spaces. Current deep learning-based LULC segmentation methods mainly rely on convolutional neural networks (CNNs) and Transformer-based architectures. Representative models such as UNet [20], DeepLab series [21,22,23], and HRNet [24] have been widely applied in remote sensing image segmentation tasks due to their strong capability in multi-scale feature extraction and spatial information fusion [25,26]. More recently, Vision Transformer (ViT) and Swin Transformer have introduced self-attention mechanisms to model long-range dependencies and global contextual relationships, showing promising results in high-resolution remote sensing image analysis [27,28,29].

Despite these advances, both CNN-based and Transformer-based segmentation models inevitably suffer from an information bottleneck problem as the network depth increases. In CNN-based architectures, this issue mainly arises from repeated downsampling operations, which expand the receptive field but inevitably lead to the loss of fine spatial details and boundary information. In Transformer-based models, the bottleneck mainly occurs in the deeper stages (Stage 3–4) of the Mix Transformer (MiT) backbone, where hierarchical downsampling and token compression weaken spatial resolution and degrade boundary representations. As a result, small or sparsely distributed targets are more likely to be misclassified [30,31,32]. This issue is particularly critical in arid-region semantic segmentation, where subtle texture variations and micro-scale land objects must be accurately identified. Additionally, land cover categories in arid regions usually present a pronounced class imbalance. Dominant classes such as bare land and desert account for most samples, while categories such as water and vegetation are relatively scarce [33,34]. These challenges impose higher demands on segmentation models in terms of multi-scale feature fusion and fine-grained category discrimination.

To address the above challenges, we propose SDS-Former, a lightweight decoder designed for land cover semantic segmentation in arid regions. The proposed method adopts a Dynamic Selective Feature Fusion (DSFF) module to adaptively weight and fuse multi-level features, enabling a flexible balance between high-level semantic information and low-level spatial details. A Feature Refinement Head (FRH) is developed by integrating pixel-level spatial attention and channel attention mechanisms to enhance the discriminative capability of the fused features. This module effectively highlights key land cover regions while suppressing background noise, which is particularly beneficial for land cover types with fragmented distributions and blurred boundaries in arid environments. To address the loss of global contextual information caused by the information bottleneck, it is essential to enhance long-range dependency modeling in the decoding stage. Therefore, the SSM-inspired Lightweight Semantic Enhancement (LSE) module is introduced to strengthen contextual representation and semantic consistency while maintaining computational efficiency. In summary, SDS-Former effectively improves the ability to distinguish key land cover types under complex backgrounds in arid regions.

The main contributions of this article are as follows:

We propose SDS-Former, a lightweight decoder specifically tailored for semantic segmentation in arid regions, effectively alleviating the information bottleneck and enhancing feature representation.
We design a Dynamic Selective Feature Fusion (DSFF) module combined with a Lightweight Semantic Enhancement (LSE) module for adaptive multi-scale feature fusion and improved semantic consistency, along with a Feature Refinement Head (FRH) to enhance discriminative capability for key land cover features.
Validation is conducted through ablation and comparative studies, underscoring the significant potential of the proposed SDS-Former to produce highly refined semantic features based on remote sensing imagery, notably enhancing segmentation performance in arid regions.

2. Related Work

With the widespread availability of high-resolution remote sensing images and the rapid development of deep learning techniques, research on land cover classification has gradually shifted from traditional pixel-based and object-based methods to deep learning-based semantic segmentation approaches [35]. Although object-based image analysis (OBIA) methods incorporate spatial and contextual information by grouping pixels into homogeneous objects, they still rely heavily on accurate image segmentation. In arid regions, where land-cover types are highly fragmented and exhibit weak spectral contrast, OBIA methods often suffer from over-segmentation or under-segmentation, leading to unstable classification performance. Among current deep learning-based remote sensing image classification methods, semantic segmentation models have evolved from early stage-based frameworks that combine CNNs with shallow classifiers like Support Vector Machine (SVM) and Multilayer Perceptron (MLP), to end-to-end deep encoder–decoder architectures [36,37].

In recent years, Transformer architectures have been gradually introduced into the field of semantic segmentation due to their strong capability in global context modeling [38]. Vision Transformer (ViT) [29] first divides an image into a sequence of image patches for processing, successfully adapting the standard Transformer architecture to vision tasks and enabling improved transfer performance in downstream tasks. Building on the ViT framework, Swin Transformer [39] employs a shifted window mechanism and a hierarchical downsampling strategy to balance global contextual modeling and local feature extraction while significantly reducing computational cost. SegFormer [40] integrates a hierarchical Transformer encoder with a lightweight multilayer perceptron decoder. By avoiding complex architectural designs, it achieves high performance while significantly reducing model complexity and computational overhead, providing an efficient and powerful solution for semantic segmentation of high-resolution remote sensing images.

Benefiting from the ability of Transformers to model global context and long-range dependencies, numerous hybrid approaches that integrate Transformers with CNN-based methods have been proposed, marking a new stage in the development of segmentation models [41]. TransUNet [42] combines Transformers with UNet, encoding strong global context by treating image features as sequences while effectively utilizing underlying CNN features through its U-shaped hybrid architecture. UNetFormer [43] employs a CNN-based encoder and a Transformer-based decoder for efficient semantic segmentation of remote sensing urban scene images. More recently, STransU2Net [44] further enhances global–local feature interaction by introducing Transformer modules into multi-scale CNN frameworks, improving segmentation performance for high-resolution and complex scenes. TranSegNet [45] integrates convolutional feature extraction with Transformer-based contextual modeling to achieve better generalization and accuracy.

Although these hybrid methods have achieved promising results, the continuous deepening of feature extraction networks inevitably leads to the attenuation of shallow spatial features and fine-scale land-cover details. This phenomenon results in an information bottleneck that limits segmentation accuracy, especially for land cover categories with fragmented distributions and blurred boundaries in complex environments such as arid and semi-arid regions [46,47,48].

3. Methods

The overall framework of the proposed SDS-Former is illustrated in Figure 1. The model is built upon SegFormer and mainly consists of a hierarchical Transformer encoder and a lightweight decoder. The encoder uses a Mix Transformer (MiT) as the backbone to extract multi-scale features from the input remote sensing images, producing feature maps at four different stages. As the stages progress, the spatial resolution of the feature maps gradually decreases, while the level of semantic abstraction progressively increases. Unlike traditional Vision Transformer, which produce single-scale feature representations, MiT adopts a hierarchical architecture to generate multi-scale features without relying on explicit positional encoding. This design enables the extraction of fine-grained spatial details at high-resolution stages and high-level semantic information at low-resolution stages, making it more suitable for semantic segmentation tasks. In the decoding stage, three complementary modules are introduced, including a Dynamic Selective Feature Fusion (DSFF) module, a Lightweight Semantic Enhancement (LSE) model for global context enhancement, and a Feature Refinement Head (FRH). We will introduce each module in detail as follows.

3.1. Dynamic Selective Feature Fusion

Multiscale feature fusion is essential for semantic segmentation, especially in arid regions. In these areas, land cover types often exhibit fragmented distributions and blurred boundaries [49]. Shallow features provide fine spatial details, while deep features contain rich high-level semantic information. Traditional decoders usually fuse features from different encoder stages by simple linear addition. This fixed weighting cannot adapt to the varying scales of land cover and the unequal contribution of features in arid regions [50]. To address this limitation, we design a Dynamic Selective Feature Fusion (DSFF) module, as shown in Figure 2. It adaptively weights features across different stages, allowing the network to adjust the fusion strategy based on the input data.

For the upsampled deep feature and the shallow feature, they are denoted as

F_{h} \in R^{C \times H \times W}

and

F_{l} \in R^{C_{l} \times H_{l} \times W_{l}}

; we apply a 1 × 1 convolution to the shallow feature to match its channel dimension with the deep feature as:

\tilde{F_{l}} = φ (R)

(1)

where

φ (\cdot)

denotes the channel alignment operation.

To explicitly model the relative importance of features at different scales during fusion, a set of learnable scale weight parameters is introduced as:

W = [w_{1}, w_{2}]

(2)

where

w_{1}

and

w_{2}

correspond to the contribution weights of the shallow and deep features. The learnable scale weights are initialized using a constant initialization strategy with equal values, leading to uniform feature fusion at the early stage of training. This design ensures a stable starting point and allows the model to adaptively learn the relative importance of multi-scale features during training. To ensure non-negativity and numerical stability, the weights are first constrained by a ReLU function and then normalized, as follows:

α_{i} = \frac{R e L U (w_{i})}{\sum_{j = 1}^{2} R e L U (w_{j}) + ε}, i ϵ \{1,2\}

(3)

where

ε

is a small constant added to avoid division by zero and ensure numerical stability. In this study, ε is empirically set to

1 \times 10^{- 6}

, which is sufficiently small to prevent numerical instability without affecting the weighting results.

Subsequently, the features are fused using weighted summation to achieve scale-aware feature fusion, denoted as:

F = α_{1} \cdot \tilde{F_{l}} + α_{2} \cdot F_{h}

(4)

Finally, the fused features are processed by a 3 × 3 convolution, Batch Normalization, and ReLU to further enhance local details and stabilize the feature distribution as:

F_{o u t} = {C o n v}_{3 \times 3} (F)

(5)

3.2. Lightweight Semantic Enhancement Module

In remote sensing images of arid regions, land cover types such as bare land, Gobi, and desert often exhibit sparse texture and weak spectral differences. Relying only on local convolution operations can easily lead to semantic inconsistency and class confusion [51]. Although Transformer encoders have strong capability in modeling long-range dependencies, the self-attention mechanism is simplified to MLP layers in the decoder. As a result, global information may be weakened during decoding. To address this issue, we introduce an SSM-inspired Lightweight Semantic Enhancement (LSE) module to enhance global semantics without introducing complex sequence modeling structures, as shown in Figure 3.

We denote the deep feature fed into the decoder as

F_{1} \in R^{C \times H \times W}

, where C, H, and W represent the channel number and the spatial dimensions of the feature map.

In LSE, depthwise separable convolution is used for local state updates to model spatial dependencies among neighboring pixels, as follows:

F_{l o c a l} = {D W C o n v}_{3 \times 3} (F_{1})

(6)

where

{D W C o n v}_{3 \times 3}

denotes a depthwise separable convolution with a kernel size of 3 × 3 and C groups. This operation enlarges the effective receptive field without introducing cross-channel computation, enabling efficient capture of local spatial context.

Subsequently, a global gating mechanism is applied to modulate the feature states, denoted as:

G = σ ({C o n v}_{1 \times 1} G A P (F_{1}))

(7)

where

G A P (\cdot)

denotes the global average pooling operation, and

σ (\cdot)

represents the Sigmoid activation function. The gating path first aggregates global spatial information through GAP to obtain a feature vector

Z

. Then, a 1 × 1 convolution is applied to learn nonlinear interactions across channels. Finally, the Sigmoid function generates channel weights

G

ranging from 0 to 1.

Finally, a residual connection is employed to ensure stable gradient flow and preserve the original input information, preventing the loss of critical features during optimization, as follows:

Y = F_{1} + F_{l o c a l} ⊙ G

(8)

where ⊙ denotes element-wise multiplication along the channel dimension. By multiplying the local features

F_{l o c a l}

with the global weights G, adaptive fusion of local details and global context is achieved. Important channel features are enhanced, while less relevant ones are suppressed.

3.3. Feature Refinement Head

In remote sensing semantic segmentation, shallow encoder layers provide high-resolution features with rich spatial details. However, they also suffer from semantic ambiguity and noise interference. Directly using these features for classification may lead to coarse boundaries and the loss of small objects. Therefore, we design a Feature Refinement Head (FRH) inspired by [52]. This module adopts both spatial and channel attention mechanisms to further enhance the fused features, thereby improving the representation of small land cover objects.

The feature refinement head mainly consists of three core components: a spatial attention branch, a channel attention branch, and a residual shortcut connection, as shown in Figure 4.

F_{T} \in R^{C \times H \times W}

denotes the feature obtained after multi-scale fusion in the decoder. The spatial attention branch processes the input feature using depthwise separable convolution, as follows:

A_{p} = σ ({D W C o n v}_{3 \times 3} (F_{T}))

(9)

where

A_{p} \in R^{C \times H \times W}

represents adaptive weights for different spatial locations, and

σ

denotes the Sigmoid activation function. This branch captures spatial contextual relationships through convolution operations, effectively enhancing boundary awareness.

The channel attention branch focuses on the relative importance of different semantic channels in the classification task. First, global average pooling is applied to extract statistical information from each channel as:

z = G A P (F_{T})

(10)

Subsequently, two 1 × 1 convolution layers are used to construct a lightweight channel mapping, as follows:

A_{c} = σ ({C o n v}_{1 \times 1} (R e L U ({C o n v}_{1 \times 1} (z))))

(11)

where the intermediate channel dimension is reduced from C to C/8, decreasing computation and strengthening nonlinear channel interactions. To avoid potential feature degradation caused by attention modulation, a residual shortcut connection is introduced as:

X_{s} = {C o n v}_{1 \times 1} (F_{T})

(12)

This branch performs channel alignment while preserving the original feature information, which helps stabilize the training process. The final output of the feature refinement head is given as:

X = X_{s} + A_{p} ⊙ F_{T} + A_{c} ⊙ F_{T}

(13)

where

⊙

denotes element-wise multiplication.

3.4. Loss Function

To address common challenges in land cover remote sensing images of arid regions, such as class imbalance, large background areas, and blurred object boundaries, we combine cross-entropy loss [53], Focal loss [54], and Dice loss [55] to construct a composite loss function for comprehensive model optimization. Cross-entropy loss is included as the standard pixel-wise classification objective, providing stable gradients for the dominant classes. Focal loss can effectively mitigate class imbalance by assigning higher weights to hard-to-classify samples, thereby improving model performance. Dice loss helps improve the overall consistency and integrity of segmentation results. The overall loss function is defined as follows:

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{C = 1}^{C} y_{i, c} l o g (p_{i, c})

(14)

L_{F o c a l} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} α_{c} {(1 - p_{i, c})}^{γ} y_{i, c} l o g (p_{i, c})

(15)

L_{D i c e} = 1 - \frac{2}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} \frac{p_{i, c} y_{i, c}}{p_{i, c} + y_{i, c}}

(16)

L = W_{C E} L_{C E} + W_{F o c a l} L_{F o c a l} + W_{D i c e} L_{D i c e}

(17)

where

N

denotes the total number of pixels in the image,

y_{i, c}

represents the ground truth label,

p_{i, c}

is the predicted probability for the correct class,

γ

is the focusing parameter,

α_{c}

is the class balancing parameter,

ε

prevents division by zero, and

W_{C E}

,

W_{F o c a l}

, and

W_{D i c e}

are the weighting coefficients for the respective losses.

4. Experiments and Results

4.1. Study Area and Datasets

The main study area is the region surrounding the Tarim Basin, located in the southern part of Xinjiang Uygur Autonomous Region, China, and enclosed by the Tianshan and Kunlun Mountains, as shown in Figure 5. The area is far from the ocean, and the surrounding high mountains block maritime airflow, resulting in a dry climate with scarce precipitation, making it one of China’s extremely arid regions [56,57,58]. The terrain of the Tarim Basin exhibits a ring-shaped distribution, with higher elevations in the west and lower elevations in the east. The central part is dominated by the well-known Taklamakan Desert, while oases, farmland, and towns are mainly distributed along the basin margins. Consequently, the land cover types in the region are diverse, including desert, Gobi, oases, cropland, urban areas, and water bodies.

We constructed a semantic segmentation dataset based on 0.59 m high-resolution remote sensing images provided by the Mapbox data source. The dataset consists of RGB images with three spectral bands (Red, Green, and Blue). The typical landform features of the Cele-Yutian region were precisely delineated through professional manual visual interpretation. Specifically, the annotation was conducted using ArcGIS 10.6 in a vector-based format, where typical land-cover categories were manually annotated. The labeled data were then converted into raster format through polygon-to-raster transformation and reclassification to obtain pixel-level semantic labels. Subsequently, the processed labels were exported using ArcGIS Pro for deep learning model training. The dataset includes seven classes: farmland, desert, vegetation, Gobi, water, bare area, and building, as shown in Figure 6. However, the class distribution is highly imbalanced, with desert and farmland occupying the majority of pixels, while vegetation and water bodies account for a small proportion, as shown in Figure 7. For model training and evaluation, the dataset was split into training, validation, and test sets at a ratio of 6:2:2, containing 10,292, 3430, and 3430 images, respectively, each with a resolution of 256 × 256 pixels. During training and validation, only the training and validation sets were used for model optimization, while the test set was reserved solely for performance evaluation.

4.2. Evaluation Metrics

We selected three metrics to quantitatively evaluate the effectiveness of the proposed model: pixel accuracy (PA), mean intersection over union (mIoU), and F1 score (F1), which are defined as follows:

P A = \frac{T P + T N}{T P + T N + F P + F N}

(18)

I o U = \frac{T P}{T P + F P + F N}

(19)

m I o U = \frac{1}{N} \sum_{n = 1}^{N} {I o U}_{n}

(20)

F_{1} = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(21)

P r e c i s i o n = \frac{T P}{T P + F P}

(22)

R e c a l l = \frac{T P}{T P + F N}

(23)

where TP denotes the number of pixels correctly predicted as a certain class by the model, FP denotes the number of pixels incorrectly predicted as that class but belonging to other classes, TN denotes the number of pixels correctly predicted as not belonging to the target class, and FN denotes the number of pixels belonging to the target class but incorrectly predicted as other classes.

4.3. Experimental Details

We trained our model on an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory, and all experiments were implemented using the PyTorch 2.7.1 deep learning framework. During training, data augmentation was applied, including random resizing, random horizontal flipping, and random cropping. The model was optimized using the AdamW optimizer with a learning rate of 0.00012, momentum of 0.9, and weight decay of 0.01. A cosine annealing schedule was employed for learning rate decay.

4.4. Ablation Study

The proposed SDS-Former model consists of three main modules: DSFF, LSE, and FRH. To evaluate their effectiveness, we conducted ablation experiments by adding each of these modules to the baseline. Based on the experimental results, the ablation studies are presented in the order shown in Table 1, and the corresponding visual results are illustrated in Figure 8.

Under the baseline configuration without any additional modules, the model achieved an mIoU of only 76.81%, with PA and F1 scores of 83.21% and 82.93%, as shown in Figure 8d. The baseline model exhibited blurred boundaries and local misclassifications for spectrally similar classes such as bare land, desert, and farmland in complex backgrounds, and it showed incomplete recognition for small-scale classes such as water bodies, buildings, and farmland. When only the DSFF was added, the mIoU increased to 78.24%, and the F1 score improved to 83.14%, as shown in Figure 8e, indicating that this module effectively balances shallow spatial details with deep semantic information. When only the LSE was added, the mIoU increased by only 0.29%, but the PA and F1 scores improved more noticeably, reaching 84.37% and 83.86%. As shown in Figure 8f, the recognition of linear water bodies and farmland remained limited. This demonstrates that LSE primarily enhances model performance by improving global semantic consistency. With the introduction of the FRH, the mIoU increased to 78.75%, highlighting the effectiveness of FRH in optimizing high-resolution features, as shown in Figure 8g.

To further analyze the synergistic effects among different modules, we evaluated three dual-module combinations. When DSFF and LSE are introduced, the model achieves significant improvements in mIoU, PA, and F1 score, reaching 80.63%, 86.25%, and 85.16%, respectively, demonstrating the complementary nature of multi-scale feature fusion and global semantic enhancement. However, as shown in Figure 8h, confusion still occurs in transition areas between desert and farmland, and the recognition of small-scale objects remains limited. The combination of DSFF and FRH achieves 81.31% mIoU, 86.92% PA, and 85.64% F1 score, indicating that the integration of adaptive feature fusion and spatial–channel refinement effectively enhances boundary delineation and small-object recognition. Similarly, the combination of LSE and FRH achieves 79.26% mIoU, 84.65% PA, and 84.38% F1 score, suggesting that the integration of global semantic enhancement and feature refinement provides complementary advantages for land-cover classification. Finally, the complete SDS-Former model, incorporating DSFF, LSE, and FRH, achieves the best performance, with an mIoU of 82.51%, PA of 87.54%, and F1 score of 86.47%. As shown in Figure 8c, the segmentation results of the full model are closer to the ground truth, demonstrating that the joint modeling of DSFF, LSE, and FRH effectively improves segmentation accuracy for typical land-cover types in arid regions.

4.5. Performance Comparison

To comprehensively evaluate the effectiveness of the proposed SDS-Former, we compare it with several state-of-the-art semantic segmentation models belonging to three different architectural categories: (1) CNN-based methods, including UNet [20] and DeepLabV3+ [23]; (2) Transformer-based methods, including SETR [59], Segmenter [60], and SegFormer [40]; and (3) hybrid CNN–Transformer methods, represented by TransUNet [42]. To ensure fair and comparable results, all models are trained using the same parameter settings as reported in their original papers.

As shown in Table 2, the proposed SDS-Former demonstrates notable improvements across multiple land-cover categories. In particular, SDS-Former shows clear advantages in the Gobi and Desert categories, achieving PA values of 84.36% and 93.53%, respectively. Compared with TransUNet, the proposed method improves performance by 4.33% and 3.16%. When compared with SegFormer, the improvements further increase to 8.84% and 3.35%, highlighting its superior capability in distinguishing spectrally similar land-cover types. However, SDS-Former performs slightly worse than Segmenter on the water class. This is mainly because water bodies in the Tarim Basin are characterized by fragmented distribution, small spatial scale, and irregular shapes. These results indicate that SDS-Former still has room for improvement in detecting extremely small targets.

In terms of overall performance, CNN-based models show limited capability in complex scenes, especially when land-cover types exhibit high spectral similarity and strong background interference, which leads to frequent misclassification. Transformer-based methods demonstrate advantages in large-scale semantic representation, but they still suffer from blurred boundaries when segmenting small-scale or structurally complex objects. The proposed SDS-Former achieves an mIoU of 82.51% and an F1 score of 86.47% on the Tarim Basin dataset. Compared with the baseline SegFormer, SDS-Former improves mIoU by 5.70% and F1 score by 3.54%. These quantitative results indicate that SDS-Former has strong potential for remote sensing semantic segmentation applications.

4.6. Visualization Results

We further evaluate the proposed model through qualitative visual comparisons to demonstrate its practical effectiveness on the Tarim Basin dataset. As shown in Figure 9, SDS-Former achieves superior segmentation performance compared with other methods. Vegetation in the Tarim Basin is typically distributed in narrow strips or fragmented patches and shows strong spatial discontinuity. In Figure 9a,b, UNet and DeepLabV3+ fail to capture the complete vegetation regions and only detect partial areas. Segmenter and TransUNet can identify most vegetation regions, but they still suffer from missed detections and false predictions. In contrast, SDS-Former is able to identify the spatial distribution of vegetation more comprehensively.

Buildings in the study area are small and highly clustered. In Figure 9a, UNet, DeepLabV3+, and SETR produce obvious misclassifications in building regions. Segmenter and TransUNet show some improvement in local areas, but they remain insufficient in preserving the integrity of dense building regions. By comparison, SDS-Former achieves more stable and accurate delineation of building boundaries. For water and farmland, all models exhibit generally good recognition performance, especially in regions with large and regular shapes. However, at river boundaries and shadow-affected areas, all methods still suffer from boundary ambiguity and local misclassification, as shown in Figure 9b. DeepLabV3+, SegFormer, SETR, and Segmenter produce fragmented predictions or misclassify farmland patches as surrounding land-cover types in some areas, as shown in Figure 9. SDS-Former produces more stable segmentation results for both water bodies and farmland. For farmland, SDS-Former preserves the integrity of large-scale parcels while better maintaining the structure of small farmland patches. This effectively reduces the fragmented predictions and misclassification problems observed in scattered farmland regions with other models. For water, SDS-Former performs similarly to other models in achieving relatively accurate identification. In shadow-affected regions, it produces more stable segmentation results.

For large-scale, continuously distributed land cover types such as desert, Gobi, and bare land, segmentation results vary significantly across models. UNet relies mainly on local convolutional features and therefore tends to confuse land-cover types with similar spectral characteristics, such as desert, bare land, and Gobi. In some cases, bare land is even misclassified as farmland, which limits its ability to accurately distinguish among desert, Gobi, and bare land regions. SegFormer and Segmenter show relatively weak performance in Gobi recognition and produce noticeable misclassifications in the transition zones between desert and Gobi. In contrast, SDS-Former can more clearly delineate the boundaries among spectrally similar land-cover types, including desert, Gobi, and bare land. It also maintains more stable segmentation results for large-scale and continuously distributed land-cover regions, as shown in Figure 9c,d.

As shown in Figure 9, the visualization results reveal a clear advantage of SDS-Former in complex scenes dominated by Gobi, desert, and bare land. SDS-Former achieves more accurate classification in these areas, whereas other methods tend to produce noticeable misclassifications. Benefiting from the global context modeling capability of LSE, SDS-Former also generates clearer boundaries and more regular geometric structures for farmland and vegetation. Overall, the experimental results demonstrate that SDS-Former consistently outperforms conventional segmentation methods in multi-scale remote sensing analysis, boundary delineation, and complex scene segmentation. This further validates the effectiveness of SDS-Former in semantic segmentation of land cover in arid regions.

4.7. Model Complexity Analysis

To further evaluate the efficiency of the proposed model, we compare SDS-Former with the baseline SegFormer in terms of floating point operations (FLOPs) and model parameters (Params). FLOPs assess the model’s complexity, while model parameters evaluate the scale of the network. All models are evaluated under the same input resolution (256 × 256) to ensure a fair comparison.

As shown in Table 3, SDS-Former achieves a comparable number of parameters to SegFormer, while significantly reducing the computational cost from 14.19 G to 7.70 G. This indicates that the proposed model effectively improves computational efficiency without increasing model complexity.

5. Discussion

By integrating Dynamic Selective Feature Fusion (DSFF), SSM-inspired Lightweight Semantic Enhancement Module (LSE), and the Feature Refinement Head (FRH), we develop a semantic segmentation framework named SDS-Former for complex land-cover scenes in arid regions. Under a unified Transformer architecture, the proposed model effectively combines multi-scale semantic information with local spatial details, leading to stable and consistent performance gains across multi-class and multi-scale segmentation tasks. Unlike conventional Transformer models that rely solely on self-attention, LSE enhances global semantic consistency for large-scale continuous land-cover types. Meanwhile, DSFF adaptively assigns weights to features at different scales, strengthening critical scale information while preserving overall structural integrity and fine-grained details.

The proposed SDS-Former shows strong potential for large-scale remote sensing tasks, including land use and land cover mapping, desertification monitoring, and ecological assessment. Its improved accuracy and robustness enhance the reliability of geospatial analysis and provide valuable support for environmental management, sustainable land use planning, and ecological protection in arid and semi-arid regions.

Despite these contributions, several limitations remain and require further investigation. First, the current model is still built mainly on a Transformer backbone, and deeper integration between CNNs and state space models has not been fully explored. Second, the performance of SDS-Former on challenging categories, such as distinguishing water bodies from shadows and accurately detecting small objects like buildings, needs further improvement. Third, the experimental validation is limited to a specific dataset, and more extensive evaluations across different geographic regions are necessary. Fourth, the training dataset relies on manually annotated labels obtained through professional annotators’ visual interpretation. Although this approach ensures high-quality ground truth for complex remote sensing scenes, it may introduce a certain degree of subjectivity and limit the scalability of the data preparation process. Future work will address these limitations in four aspects. (1) More advanced variants, such as Mamba, will be introduced into diverse multimodal remote sensing tasks. For large-scale models, more efficient fine-tuning strategies will be explored, since foundation models usually require substantial memory resources. (2) Task-specific object refinement modules will be designed to further improve boundary delineation and recognition performance for difficult land-cover types. (3) The generalization ability of the proposed method will be examined on different remote sensing datasets, including multispectral, LiDAR, and SAR imagery. (4) Semi-automatic and active learning-based annotation strategies will be investigated to reduce human effort and improve scalability while maintaining annotation quality.

6. Conclusions

In this study, we designed SDS-Former for land use and land cover semantic segmentation in arid regions. The proposed method integrated a Dynamic Selective Feature Fusion module, an SSM-inspired Lightweight Semantic Enhancement module, and a Feature Refinement Head into the decoding stage, which effectively enhanced global semantic consistency and local detail representation without significantly increasing computational complexity. Experimental results on the Tarim Basin dataset demonstrate that SDS-Former outperforms representative CNN-based models, Transformer-based models, and hybrid segmentation approaches in terms of mIoU, PA, and F1 score, enabling more accurate delineation of land-cover boundaries and fine-scale structural details. Further qualitative and quantitative analyses demonstrate that SDS-Former achieves better integrity in recognizing small-scale and sparsely distributed land-cover types, such as vegetation and farmland, and exhibits stronger robustness in complex scenes involving large-scale and continuously distributed categories, including desert, Gobi, and bare land.

In addition, a high-resolution remote sensing image dataset with severe class imbalance was constructed to support complex land-cover segmentation tasks. This dataset is specifically designed for arid regions and provides a reliable basis for model training and performance evaluation.

Overall, SDS-Former achieves stable segmentation performance in complex scenes with severe class imbalance and spectrally similar land-cover types, providing a new approach for fine-scale identification and analysis of land-cover types in arid regions. Furthermore, the proposed method may provide valuable support for sustainable land use planning, environmental protection, and policy-making in arid and semi-arid regions.

Author Contributions

Conceptualization, Y.D. and J.F.; methodology, Y.D. and J.F.; software, Y.D., K.L. and Y.L.; validation, Y.D. and K.L.; formal analysis, Y.D. and Y.L.; investigation, K.L. and Y.L.; resources, J.F.; data curation, Y.D. and Y.L.; writing—original draft preparation, Y.D. and J.F.; writing—review and editing, Y.D. and J.F.; visualization, Y.D.; supervision, J.F.; project administration, J.F.; funding acquisition, J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No.42171413) and a grant from the State Key Laboratory of Resources and Environmental Information Systems.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to privacy.

Acknowledgments

We would like to thank the State Key Laboratory of Resources and Environmental Information Systems for providing remote sensing images of the Circum-Tarim area. We also thank the editors and reviewers for their constructive comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Twisa, S.; Buchroithner, M.F. Land-Use and Land-Cover (LULC) Change Detection in Wami River Basin, Tanzania. Land 2019, 8, 136. [Google Scholar] [CrossRef]
Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Tan, W.; Yang, Q.; Wang, J.; et al. Deep Learning in Environmental Remote Sensing: Achievements and Challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
Chen, Z.; Chen, J.; Zhou, C.; Li, Y. An Ecological Assessment Process Based on Integrated Remote Sensing Model: A Case from Kaikukang-Walagan District, Greater Khingan Range, China. Ecol. Inform. 2022, 70, 101699. [Google Scholar] [CrossRef]
Li, D.; Wang, M.; Jiang, J. China’s High-Resolution Optical Remote Sensing Satellites and Their Mapping Applications. Geo-Spat. Inf. Sci. 2021, 24, 85–94. [Google Scholar] [CrossRef]
Xu, Y.; Gong, J.; Huang, X.; Hu, X.; Li, J.; Li, Q.; Peng, M. Luojia-HSSR: A High Spatial-Spectral Resolution Remote Sensing Dataset for Land-Cover Classification with a New 3D-HRNet. Geo-Spat. Inf. Sci. 2023, 26, 289–301. [Google Scholar] [CrossRef]
Marzougui, M.; Sampedro, G.A.; Almadhor, A.; Alsubai, S.; Al Hejaili, A.; Abbas, S. Deep Learning-Based Spatial Pattern Modeling for Land Use and Land Cover Classification Using Satellite Imagery. Meteorol. Appl. 2025, 32, e70064. [Google Scholar] [CrossRef]
Huang, L.; Jiang, B.; Lv, S.; Liu, Y.; Fu, Y. Deep-Learning-Based Semantic Segmentation of Remote Sensing Images: A Survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8370–8396. [Google Scholar] [CrossRef]
Neupane, B.; Horanont, T.; Aryal, J. Deep Learning-Based Semantic Segmentation of Urban Features in Satellite Images: A Review and Meta-Analysis. Remote Sens. 2021, 13, 808. [Google Scholar] [CrossRef]
Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic Segmentation Network with Spatial and Channel Attention Mechanism for High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 18, 905–909. [Google Scholar] [CrossRef]
Wang, Z.; Yi, J.; Chen, A.; Chen, L.; Lin, H.; Xu, K. Accurate Semantic Segmentation of Very High-Resolution Remote Sensing Images Considering Feature State Sequences: From Benchmark Datasets to Urban Applications. ISPRS J. Photogramm. Remote Sens. 2025, 220, 824–840. [Google Scholar] [CrossRef]
Fan, J.; Shi, Z.; Du, Y.; Zhuang, C. HR-MM Segformer: Enhancing Land Use and Land Cover Semantic Segmentation through Transformer-Based Multisource Remote Sensing Feature Fusion. Environ. Model. Softw. 2026, 197, 106848. [Google Scholar] [CrossRef]
Zhang, W.; Li, W.; Zhang, C.; Li, X. Incorporating Spectral Similarity into Markov Chain Geostatistical Cosimulation for Reducing Smoothing Effect in Land Cover Postclassification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 1082–1095. [Google Scholar] [CrossRef]
Fan, J.; Shi, Z.; Ren, Z.; Zhou, Y.; Ji, M. DDPM-SegFormer: Highly Refined Feature Land Use and Land Cover Segmentation with a Fused Denoising Diffusion Probabilistic Model and Transformer. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104093. [Google Scholar] [CrossRef]
Du, H.; Li, M.; Xu, Y.; Zhou, C. An Ensemble Learning Approach for Land Use/Land Cover Classification of Arid Regions for Climate Simulation: A Case Study of Xinjiang, Northwest China. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2413–2426. [Google Scholar] [CrossRef]
Ali, K.; Johnson, B.A. Land-Use and Land-Cover Classification in Semi-Arid Areas from Medium-Resolution Remote-Sensing Imagery: A Deep Learning Approach. Sensors 2022, 22, 8750. [Google Scholar] [CrossRef] [PubMed]
Gaur, M.K.; Squires, V.R. Geographic Extent and Characteristics of the World’s Arid Zones and Their Peoples. In Climate Variability Impacts on Land Use and Livelihoods in Drylands; Gaur, M.K., Squires, V.R., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–20. ISBN 978-3-319-56681-8. [Google Scholar]
Rodríguez-Galiano, V.F.; Abarca-Hernández, F.; Ghimire, B.; Chica-Olmo, M.; Atkinson, P.M.; Jeganathan, C. Incorporating Spatial Variability Measures in Land-Cover Classification Using Random Forest. Procedia Environ. Sci. 2011, 3, 44–49. [Google Scholar] [CrossRef]
Ul Din, S.; Mak, H.W.L. Retrieval of Land-Use/Land Cover Change (LUCC) Maps and Urban Expansion Dynamics of Hyderabad, Pakistan via Landsat Datasets and Support Vector Machine Framework. Remote Sens. 2021, 13, 3337. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2016, arXiv:1412.7062. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef]
Zhang, P.; Ke, Y.; Zhang, Z.; Wang, M.; Li, P.; Zhang, S. Urban Land Use and Land Cover Classification Using Novel Deep Learning Models Based on High Spatial Resolution Satellite Imagery. Sensors 2018, 18, 3717. [Google Scholar] [CrossRef] [PubMed]
Clark, A.; Phinn, S.; Scarth, P. Optimised U-Net for Land Use–Land Cover Classification Using Aerial Photography. PFG J. Photogram. Remote Sens. Geoinf. Sci. 2023, 91, 125–147. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Li, X.; Li, Y.; Ai, J.; Shu, Z.; Xia, J.; Xia, Y. Semantic Segmentation of UAV Remote Sensing Images Based on Edge Feature Fusing and Multi-Level Upsampling Integrated with Deeplabv3+. PLoS ONE 2023, 18, e0279097. [Google Scholar] [CrossRef]
Pan, Z.; Zhuang, B.; Liu, J.; He, H.; Cai, J. Scalable Vision Transformers with Hierarchical Pooling. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 377–386. [Google Scholar]
Hsu, C.-C.; Lee, C.-M.; Chou, Y.-S. DRCT: Saving Image Super-Resolution Away from Information Bottleneck. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–18 June 2024; IEEE: New York, NY, USA, 2024; pp. 6133–6142. [Google Scholar]
Zhu, Z.; Zhang, Z.; Zuo, L.; Pan, T.; Zhao, X.; Wang, X.; Sun, F.; Xu, J.; Liu, Z. Study on the Classification and Change Detection Methods of Drylands in Arid and Semi-Arid Regions. Remote Sens. 2022, 14, 1256. [Google Scholar] [CrossRef]
Mellor, A.; Boukir, S.; Haywood, A.; Jones, S. Exploring Issues of Training Data Imbalance and Mislabelling on Random Forest Performance for Large Area Land Cover Classification Using the Ensemble Margin. ISPRS J. Photogramm. Remote Sens. 2015, 105, 155–168. [Google Scholar] [CrossRef]
Tong, X.-Y.; Xia, G.-S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-Cover Classification with High-Resolution Remote Sensing Images Using Transferable Deep Models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]
Sun, X.; Liu, L.; Li, C.; Yin, J.; Zhao, J.; Si, W. Classification for Remote Sensing Data with Improved CNN-SVM Method. IEEE Access 2019, 7, 164507–164516. [Google Scholar] [CrossRef]
Zhang, C.; Pan, X.; Li, H.; Gardiner, A.; Sargent, I.; Hare, J.; Atkinson, P.M. A Hybrid MLP-CNN Classifier for Very Fine Resolution Remotely Sensed Image Classification. ISPRS J. Photogramm. Remote Sens. 2018, 140, 133–144. [Google Scholar] [CrossRef]
Khan, M.; Hanan, A.; Kenzhebay, M.; Gazzea, M.; Arghandeh, R. Transformer-Based Land Use and Land Cover Classification with Explainability Using Satellite Imagery. Sci. Rep. 2024, 14, 16744. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Mahendra, H.N.; Pushpalatha, V.; Mallikarjunaswamy, S.; Rama Subramoniam, S. A Hybrid Deep Learning Model Based on Vision Transformer and Convolutional Neural Networks for Land Use and Land Cover Classification. Appl. Soft Comput. 2026, 192, 114775. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Liu, G.; Diao, K.; Zhu, J.; Wang, Q.; Li, M. STransU2Net: Transformer Based Hybrid Model for Building Segmentation in Detailed Satellite Imagery. PLoS ONE 2024, 19, e0299732. [Google Scholar] [CrossRef]
Ingole, R.V.; Giradkar, A.M.; Deshmukh, A.A.; Agrawal, R.; Dhule, C.; Morris, N.C. TranSegNet: A Hybrid Transformer Model for Satellite Imagery Segmentation with Performance Benchmark Against U-Net Variants. Procedia Comput. Sci. 2025, 258, 775–784. [Google Scholar] [CrossRef]
Chen, N.; Yang, R.; Zhao, Y.; Dai, Q.; Wang, L. Remote Sensing Image Segmentation Network That Integrates Global–Local Multi-Scale Information with Deep and Shallow Features. Remote Sens. 2025, 17, 1880. [Google Scholar] [CrossRef]
Wang, J.; Chen, T.; Zheng, L.; Tie, J.; Zhang, Y.; Chen, P.; Luo, Z.; Song, Q. A Multi-Scale Remote Sensing Semantic Segmentation Model with Boundary Enhancement Based on UNetFormer. Sci. Rep. 2025, 15, 14737. [Google Scholar] [CrossRef]
Chen, Y.; Yang, Z.; Zhang, L.; Cai, W. A Semi-Supervised Boundary Segmentation Network for Remote Sensing Images. Sci. Rep. 2025, 15, 2007. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.-O.; Liu, M. MSFNET: Multi-Stage Fusion Network for Semantic Segmentation of Fine-Resolution Remote Sensing Data. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 2833–2836. [Google Scholar]
Liu, R.; Mi, L.; Chen, Z. AFNet: Adaptive Fusion Network for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7871–7886. [Google Scholar] [CrossRef]
Sotomayor, L.N.; Lucieer, A.; Turner, D.; Lewis, M.; Kattenborn, T. Mapping Fractional Vegetation Cover in UAS RGB and Multispectral Imagery in Semi-Arid Australian Ecosystems Using CNN-Based Semantic Segmentation. Landsc. Ecol. 2025, 40, 169. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.-O. RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
Zhang, Z.; Sabuncu, M. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Yao, J.; Chen, Y.; Guan, X.; Zhao, Y.; Chen, J.; Mao, W. Recent Climate and Hydrological Changes in a Mountain–Basin System in Xinjiang, China. Earth-Sci. Rev. 2022, 226, 103957. [Google Scholar] [CrossRef]
Zhou, Y.; Li, Y.; Li, W.; Li, F.; Xin, Q. Ecological Responses to Climate Change and Human Activities in the Arid and Semi-Arid Regions of Xinjiang in China. Remote Sens. 2022, 14, 3911. [Google Scholar] [CrossRef]
Wei, R.; Fan, Y.; Wu, H.; Zheng, K.; Fan, J.; Liu, Z.; Xuan, J.; Zhou, J. The Value of Ecosystem Services in Arid and Semi-Arid Regions: A Multi-Scenario Analysis of Land Use Simulation in the Kashgar Region of Xinjiang. Ecol. Model. 2024, 488, 110579. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 7262–7272. [Google Scholar]

Figure 1. SDS-Former network structure: The network consists of an encoder–decoder architecture, where LSE represents the Lightweight Semantic Enhancement module, DSFF denotes the Dynamic Selective Feature Fusion module, and FRH indicates the Feature Refinement Head.

Figure 2. Dynamic selective feature fusion process.

Figure 3. Detailed architecture of the Lightweight Semantic Enhancement Module.

Figure 4. Detailed architecture of FRH: from left to right are the spatial attention branch, the channel attention branch, and the residual shortcut connection.

Figure 5. Location and spatial extent of the study area.

Figure 6. LULC dataset of the region surrounding the Tarim Basin.

Figure 7. Class distribution of the dataset.

Figure 8. Visual comparison of ablation experiments for each module: (a) Image; (b) Label; (c) SDS-Former (Baseline + DSFF + LSE + FRH); (d) Baseline; (e) Baseline + DSFF; (f) Baseline + LSE; (g) Baseline + FRH; (h) Baseline + DSFF + LSE; (i) Baseline + DSFF + FRH; (j) Baseline + LSE + FRH. Black dashed boxes are marked to highlight the differences.

Figure 9. Visualization of comparative experiments on the dataset in the Tarim Basin area; (a–d) show the segmentation results for four different scenarios. Image denotes the RGB input image, and Label denotes accurate feature classification. Red arrows indicate boundary regions of dense building clusters.

Table 1. Results of the ablation study on the dataset of the region surrounding the Tarim Basin (%). The best values are shown in bold.

DSFF	LSE	FRH	mIOU	PA	F1
×	×	×	76.81	83.21	82.93
√	×	×	78.24	83.42	83.14
×	√	×	77.10	84.37	83.86
×	×	√	78.75	84.19	83.34
√	√	×	80.63	86.25	85.16
√	×	√	81.31	86.92	85.64
×	√	√	79.26	84.65	84.38
√	√	√	82.51	87.54	86.47

Table 2. Comparative experimental results on the Tarim Basin dataset. We present the PA of each land-cover category and two overall performance metrics, and all values are expressed as percentages (%). The values in bold are the best value in the experiment.

Methods	Bare Area	Desert	Gobi	Vegetation	Farmland	Water	Building	Back_ Ground	mIOU	F1
UNet	60.46	78.92	69.38	68.94	87.61	85.38	71.83	79.85	70.94	79.16
DeepLabv3+	74.18	89.76	78.81	73.95	85.43	92.14	73.74	80.92	74.83	80.13
SETR	85.42	88.74	81.05	74.03	87.61	87.15	73.88	83.26	77.84	80.96
Segmenter	78.34	90.91	78.46	78.79	86.73	92.98	75.26	81.34	73.56	81.17
TransUNet	85.89	90.37	80.03	76.89	88.41	88.45	75.58	79.12	77.70	82.04
Segformer	86.74	90.18	75.52	79.97	86.94	90.46	76.51	79.31	76.81	82.93
SDS-Former (Ours)	89.12	93.53	84.36	83.94	89.79	92.57	81.43	85.26	82.51	86.47

Table 3. Comparison on Computational Complexity Measured by a 256 × 256 Input on a Single NVIDIA GeForce RTX 3090 GPU.

Model	FLOPs (G)	Params (M)
SegFormer	14.19	27.35
SDS-Former	7.70	26.39

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, Y.; Fan, J.; Li, K.; Li, Y. SDS-Former: A Transformer-Based Method for Semantic Segmentation of Arid Land Remote Sensing Imagery. Algorithms 2026, 19, 325. https://doi.org/10.3390/a19050325

AMA Style

Du Y, Fan J, Li K, Li Y. SDS-Former: A Transformer-Based Method for Semantic Segmentation of Arid Land Remote Sensing Imagery. Algorithms. 2026; 19(5):325. https://doi.org/10.3390/a19050325

Chicago/Turabian Style

Du, Yujie, Junfu Fan, Kuan Li, and Yongrui Li. 2026. "SDS-Former: A Transformer-Based Method for Semantic Segmentation of Arid Land Remote Sensing Imagery" Algorithms 19, no. 5: 325. https://doi.org/10.3390/a19050325

APA Style

Du, Y., Fan, J., Li, K., & Li, Y. (2026). SDS-Former: A Transformer-Based Method for Semantic Segmentation of Arid Land Remote Sensing Imagery. Algorithms, 19(5), 325. https://doi.org/10.3390/a19050325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SDS-Former: A Transformer-Based Method for Semantic Segmentation of Arid Land Remote Sensing Imagery

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Dynamic Selective Feature Fusion

3.2. Lightweight Semantic Enhancement Module

3.3. Feature Refinement Head

3.4. Loss Function

4. Experiments and Results

4.1. Study Area and Datasets

4.2. Evaluation Metrics

4.3. Experimental Details

4.4. Ablation Study

4.5. Performance Comparison

4.6. Visualization Results

4.7. Model Complexity Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI