1. Introduction
Remote sensing imagery plays a critical role in large-scale terrain analysis, natural disaster monitoring, and environmental assessment [
1]. As an essential task in the intelligent interpretation of remote sensing imagery, semantic segmentation aims to assign explicit semantic category labels to each pixel, thereby enabling accurate identification of surface objects and landform types as well as modeling their spatial distribution [
2]. This technique has been widely applied to land use classification [
3], crop yield estimation [
4], urban expansion monitoring [
5], and ecological environment assessment [
1], becoming one of the core approaches driving intelligent geographic information analysis.
Among various application scenarios, the semantic segmentation of complex landforms (e.g., karst terrains, landslide areas, and post-earthquake regions) is increasingly becoming a research hotspot. In China, for instance, the exposed or near-surface soluble rock covers an area of more than 1.3 million km
2, and the widely distributed karst landforms play a vital role in geological surveys, water resource protection, and infrastructure site selection [
6,
7]. At the same time, landslides, as one of the most common and destructive natural hazards worldwide, also exhibit typical characteristics of complex landforms. Their irregular boundaries and intricate formation mechanisms impose greater challenges on semantic segmentation models [
8,
9]. Therefore, developing a high-precision semantic segmentation method tailored for complex landforms holds significant scientific value and promising application potential for advancing remote sensing-based geoscientific interpretation.
Compared with natural scenes such as forests and farmlands [
1,
4], or urban scenes such as roads [
10] and buildings [
3,
5], complex landform remote sensing images face significant challenges in semantic segmentation tasks. First, complex landforms exhibit substantial intra-class variations in elevation, scale, spectral reflectance, and texture, which weakens the model’s ability to achieve semantic consistency within the same category. Second, some landform classes share high similarity with the background in terms of spectral characteristics or spatial distribution, making it difficult for traditional feature extraction methods to effectively distinguish landform categories from the background. Finally, complex landform regions often lack clear boundaries, present irregular shapes with vast spatial extents, and may even evolve over time, further aggravating boundary recognition errors and inaccurate regional delineation (as illustrated in
Figure 1).
Traditional remote sensing semantic segmentation methods, such as threshold-based approaches [
11] and machine learning techniques [
12], often rely on fixed parameter settings. Consequently, their performance is usually unsatisfactory when dealing with regions under varying illumination conditions and complex textures. In recent years, deep learning has achieved remarkable progress in remote sensing semantic segmentation, with Convolutional Neural Networks (CNNs) being widely applied in environmental remote sensing [
13], geological remote sensing [
14], and urban remote sensing [
3,
5]. For example, in tasks such as urban building and road extraction [
3,
5,
10,
15], small object detection in remote sensing imagery [
16], and urban change detection [
17], CNN-based models leverage multi-scale receptive fields and spatial pyramid structures to effectively capture spatial context, delivering high segmentation accuracy in structured scenes. However, their receptive fields are inherently constrained by local convolutional kernels, making it difficult to model long-range spatial dependencies. Moreover, CNNs are insufficiently sensitive to features with significant scale variations and blurred boundaries, which can lead to intra-class feature loss and background misclassification in boundary regions. To overcome these limitations, Transformer architectures have been introduced into remote sensing image segmentation owing to their superior global modeling capability [
18]. Representative methods, such as Swin Transformer [
19], Mix Vision Transformer [
20], and DeBifomer [
21], employ self-attention mechanisms to enable cross-region information interaction and exhibit strong potential in multi-scale feature aggregation, achieving excellent performance when directly applied to remote sensing semantic segmentation. Nevertheless, for complex landform remote sensing imagery, although attention mechanisms can effectively capture global relationships among target regions, they still fail to address the critical issue of local detail feature loss.
To leverage the complementary advantages of CNNs and Transformers, several studies have explored hybrid architectures that integrate the two. Representative approaches include DC-Swin [
22] and ST-Unet [
23], which preserve the Transformer’s strength in modeling long-range dependencies while retaining the CNN’s ability to capture local details. However, these methods still pay insufficient attention to indistinguishable intra-class details, and when background and target features exhibit high similarity, they may lead to partial loss of intra-class feature information.
In order to improve the segmentation accuracy of complex terrain remote sensing images while maintaining the integrity of intra-class features, this paper proposes a novel semantic segmentation network, LENet (landforms Expert segmentation Net), based on axial semantic modeling and deformation-aware compensation. Building on previous work integrating CNNS and Transformers. Inspired by the cognitive process of human experts in recognizing complex geomorphic regions, LENet mimics the way experts utilize domain knowledge: they first combine global and local feature distributions to filter most of the background interference, then retain regions rich in geomorphology features, and finally pay special attention to indistinguishable fuzzy intra-class regions. Therefore, in complex landform image segmentation, the key challenge is to quickly eliminate background interference while focusing on complex, fuzzy regions in complex landform classes, which is crucial to improve segmentation performance. Among them, the visualization of the LENet imitating the workflow of human experts and the working principle of each component is shown in
Figure 2.
LENet is built upon a state-of-the-art (SOTA) encoder–decoder architecture, with its key innovation lying in the design of the decoder. As the crucial stage for feature reconstruction, the decoder determines how to effectively restore category information from feature maps rich in semantic representations, which remains a major challenge in complex landform segmentation tasks. To address the wide spatial extent of complex geomorphological regions, we design an expert-enhanced axial semantic modeling module in the decoder to capture both horizontal and vertical contextual information, thereby adapting to large-scale spatial feature distributions. To mitigate noise interference caused by the similarity between background and target features, a cross-sparse attention mechanism is introduced to filter out redundant background information. Furthermore, to handle the rich intra-class scale variations in complex landform categories, a feature expert compensator is proposed to emphasize critical intra-class regions, thereby enhancing semantic consistency within categories. The main contributions of this work can be summarized as follows:
- (1)
We propose a novel remote sensing semantic segmentation network, LENet, which integrates axial feature modeling with expert-inspired intra-class information identification. By combining an attention-based encoder with strong global modeling capability and a hybrid decoder capable of capturing subtle local features and filtering key information, LENet addresses the challenge of insufficient intra-class recognition integrity in complex landform segmentation.
- (2)
We design an expert-inspired multi-scale feature learning decoder that mimics the process of expert judgment in distinguishing complex geomorphological features. The decoder incorporates an Expert Feature Enhancement Block (EEBlock), a Feature Expert Compensator (FEC), and a Cross Sparse Attention (CSA) module to improve category integrity and intra-class feature accuracy.
- (3)
We validate the effectiveness of LENet on a landslide dataset and a karst dataset. Experimental results demonstrate that LENet achieves superior performance across multiple evaluation metrics, confirming its robustness and advancement in complex landform segmentation tasks.
2. Related Works
2.1. Remote Sensing Image Semantic Segmentation
Semantic segmentation can be regarded as a pixel-level extension of image classification, with the goal of assigning a semantic category label to each pixel in the image [
3]. Traditional semantic segmentation methods typically rely on manually annotated pixel-level labels and employ deep neural networks to learn the mapping between pixel features and semantic categories. CNN-based approaches have achieved remarkable progress in remote sensing image analysis [
22,
23,
24,
25], by constructing multi-layer convolutional architectures to extract and classify local features on a per-pixel basis. With further research, many studies have attempted to enhance contextual modeling capability and enlarge the receptive field to improve semantic understanding among pixels. U-Net introduces skip connections to integrate multi-scale contextual information [
24], becoming a classical model for semantic feature enhancement through multi-scale fusion. SegNeXt employs large-kernel strip convolutions to expand the receptive field [
26]. BiSeNet proposes a dual-branch structure to balance spatial detail preservation and semantic understanding efficiency [
27], while EncNet enhances semantic category-related feature representation through a context encoding module [
28].
Since the introduction of Transformer architectures into visual tasks, their reliance on self-attention mechanisms has demonstrated remarkable advantages in modeling long-range dependencies [
18,
29,
30], gradually becoming a pivotal direction in the design of semantic segmentation models. However, Transformers incur substantial computational costs when processing high-resolution images, which limits their practical deployment in remote sensing applications. To address this issue, the researchers proposed a series of lightweight optimization strategies: In addition to the classic moving window mechanism adopted by Swin Transformer [
19], Biformer utilizes a novel dynamic sparse attention mechanism achieved through a two-layer routing approach, enabling more flexible and content-aware computational allocation [
31]; RMT introduces a spatial decay matrix and designs a spatially constrained attention decomposition structure to enhance long-range modeling efficiency [
32]; Zhu et al. integrate CNNs with Transformers to extract high-frequency details and low-frequency semantic features separately, achieving a unified representation of global context and fine-grained structures [
33].
Although the aforementioned methods achieve excellent performance in natural image semantic segmentation, they often struggle when directly applied to remote sensing imagery due to the unique challenges of such data, including high intra-class variance, significant scale variations, blurred object boundaries, and complex backgrounds. In addition, remote sensing images typically possess ultra-high resolution, which further increases model parameters and computational costs, thereby limiting the practicality and scalability of these approaches.
To address the unique challenges of remote sensing imagery, numerous scholars have proposed targeted design strategies. For instance, in urban remote sensing scenarios, common issues include high inter-class similarity, large intra-class variation, and dense small objects [
5,
22,
34]. Wang et al. introduced a parallel window attention structure to enhance the spatial modeling capability of Swin Transformer [
35], thereby improving urban remote sensing image segmentation performance. Rau et al. incorporated DEM auxiliary information to achieve hierarchical segmentation for multimodal landslide disasters [
36]. Cui et al. developed a cross-modal transfer learning framework for semantic segmentation [
37], introducing channel and spatial discretization losses to alleviate inter-modal feature conflicts and redundancy, thus enhancing modality-cooperative modeling. He et al. employed prompt learning mechanisms and explicit semantic grouping strategies to adapt frozen pre-trained models to multimodal downstream tasks [
38]. Ma et al. combined the strengths of CNNs and Transformers to propose a hybrid architecture that improves multimodal fusion efficiency [
15]. In addition, Zhang et al. utilized infrared remote sensing imagery for vehicle detection [
39], Zheng et al. designed a foreground-aware relational modeling network to tackle scale variations and foreground class imbalance in remote sensing images [
40], and Chen et al. introduced a prompt-driven mechanism based on the SAM architecture for instance segmentation in remote sensing imagery [
41].
To enhance the spatial sampling capability of convolutional neural networks, Dai et al. proposed Deformable Convolutional Networks (DCN) [
42], which learn unsupervised offsets from target tasks to expand the spatial sampling positions of convolutional layers, demonstrating strong potential in remote sensing semantic segmentation. Yu et al. leveraged deformable convolutions to achieve global and deformation-aware feature extraction [
10], improving road extraction performance in remote sensing imagery. Dong et al. introduced a multi-scale deformable attention module [
43], utilizing a larger deformable receptive field to adapt to remote sensing targets of diverse shapes and sizes, thereby generating more precise attention maps. Hu et al. proposed a multi-scale deformable self-attention mechanism to reduce the partial loss of linearly intertwined road information in remote sensing images [
44], effectively enhancing the saliency of road features relative to the surrounding environment. Although deformable convolutions have been widely applied in remote sensing semantic segmentation, most applications focus on complex object shapes or objects densely intertwined with the background, while their substantial potential for handling intra-class feature variations within target categories remains underexplored.
In summary, although recent advances in CNNs and Transformer architectures have significantly promoted the development of remote sensing semantic segmentation, most existing studies focus on man-made or natural land cover types with regular boundaries and clear structures, such as urban buildings, roads, and agricultural areas. Effective modeling strategies remain lacking for complex landforms, which exhibit irregular textures, atypical features, blurred boundaries, and high intra-class variance. While deformable convolutions have demonstrated remarkable capabilities in road recognition tasks, their potential for learning critical intra-class regions has yet to be fully explored. Consequently, directly applying existing methods to complex landform remote sensing imagery often fails to achieve satisfactory segmentation performance.
2.2. Semantic Segmentation of Complex Landforms
Semantic segmentation of complex landform regions in remote sensing imagery is a highly challenging task, due to the spatial heterogeneity of multiple landform types, blurred boundaries, significant scale variations, and high intra-class variance. Areas exhibiting typical complex landform characteristics, such as landslide-prone zones, plateau karst regions, and desert hilly landscapes, are of particular interest, as accurate semantic understanding and fine-grained segmentation in these regions play a crucial role in disaster monitoring, resource surveying, and ecological assessment.
In geohazard-related complex landforms, remote sensing monitoring and semantic segmentation of landslide-prone areas have become a major research focus. Landslide occurrences are typically accompanied by abrupt changes in surface morphology, posing significant threats to human life and property, making their mapping and dynamic change detection highly practical. Zhang et al. proposed a prototype-guided region-aware progressive learning method based on multi-scale target domain adaptation to address cross-domain modeling of landslide regions in large-scale remote sensing imagery [
8]. Lu et al. designed a lightweight network, MS2LandsNet, specifically for landslide landform features [
14]; by reducing the number of channels and optimizing the network architecture, combined with multi-scale feature fusion and channel attention mechanisms, the model’s performance in landslide detection tasks was significantly improved. Şener et al. developed LandslideSegNet [
45], which integrates an encoder–decoder residual structure with spatial–channel attention mechanisms to enhance early identification of potential landslide regions. Soares et al. evaluated the generalization performance of automated landslide mapping models across three landform types incorporating NDVI and DEM data, demonstrating that NDVI information helps mitigate overfitting and improves prediction balance [
46].
Research on remote sensing segmentation of non-hazard-related complex landforms has also been continuously expanding. Huang et al. addressed the challenge of multi-scale representation of landscape landform features by proposing the SwinClustering framework based on Swin Transformer [
19], enhancing the model’s ability to jointly capture spatial structure and semantic information [
47]. Cheng et al. designed an improved semantic segmentation framework that employs statistical preprocessing to remove invalid image patches during training [
48], thereby improving the model’s capacity to capture local details of fracture-like features. Yu et al. introduced SegRoadv2 for road segmentation [
10], incorporating deformable self-attention and grouped deformable convolution structures to enhance the perception of irregular road boundaries. Goswami et al. utilized a U-Net architecture to construct segmentation models for typical landform types, including deserts, forests, and mountainous regions [
49]. Zhou et al. proposed a network based on a dense attention residual pyramid fusion structure for Lithological Unit Classification (LUC) [
50], a task of significant relevance in geological resource exploration. While these methods have achieved remarkable results across various complex landform types, they differ from the focus of this study, which aims to develop a generalized segmentation model capable of handling diverse complex landforms. Specifically, this work proposes a semantic segmentation method for complex landform remote sensing imagery that not only improves segmentation accuracy but also preserves intra-class feature integrity.