CETransUNet: An Intelligent Landslide Identification Method Based on Collaborative Optimization of Global Context and Dual Attention Mechanisms

Sun, Tianli; Yang, Chengsheng; Wu, Jifeng; Liu, Zewei; Wang, Ziqian; Cheng, Xiaoqiang

doi:10.3390/rs18121974

Open AccessArticle

CETransUNet: An Intelligent Landslide Identification Method Based on Collaborative Optimization of Global Context and Dual Attention Mechanisms

by

Tianli Sun

¹,

Chengsheng Yang

^1,*,

Jifeng Wu

²,

Zewei Liu

¹,

Ziqian Wang

¹ and

Xiaoqiang Cheng

¹

School of Geological Engineering and Geomatics, Chang’an University, Xi’an 710054, China

²

College of Geomatics, Xi’an University of Science and Technology, Xi’an 710054, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(12), 1974; https://doi.org/10.3390/rs18121974 (registering DOI)

Submission received: 13 April 2026 / Revised: 3 June 2026 / Accepted: 7 June 2026 / Published: 13 June 2026

(This article belongs to the Special Issue Advances in Geological Hazard Characterization and Assessment: Merging Remote Sensing with Direct Surveys)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Based on post-earthquake remote sensing images from the 2017 Nyingchi earthquake, this paper systematically constructed a co-seismic landslide detection dataset for the alpine valley region of the Yarlung Zangbo River. This dataset comprehensively covers typical landslide morphologies under various slope, illumination, and vegetation coverage conditions. It particularly emphasizes pixel-level fine annotation of fragmented boundaries and small-scale shallow landslides, effectively addressing the lack of high-quality landslide detection datasets for this specific area.
This paper proposes CETransUNet, a novel landslide detection model that combines CNN and Transformer architecture. By integrating coordinate attention and edge-guided attention modules, the model effectively mitigates boundary ambiguity and geometric distortion in complex scenarios.

What are the implications of the main findings?

The co-seismic landslide detection dataset of the Yarlung Zangbo River alpine valley region constructed in this paper effectively addresses the lack of high-quality landslide detection datasets for this area, providing a critical data foundation for researching the spatial distribution patterns of landslides, accurate hazard risk assessment, and the development of disaster prevention and mitigation strategies in the region.
The CETransUNet model achieves a synchronous optimization of landslide boundary integrity and geometric precision, providing a reliable technical solution for large-scale intelligent landslide identification and disaster emergency decision-making.

Abstract

Accurate landslide identification is crucial for enhancing emergency response capabilities during destructive geological hazards. Although deep-learning-based semantic segmentation has demonstrated effectiveness, substantial variations in landslide scales and environmental similarities continue to challenge existing methods. This paper systematically constructs a new co-seismic landslide dataset for the Yarlung Zangbo River basin based on the 2017 Nyingchi earthquake, effectively filling a critical regional data gap. This paper proposes CETransUNet (coordinate attention and edge-guided attention transformer UNet), a novel landslide detection model that integrates ResNet and Transformer architectures. Specifically, a coordinate attention (CA) module is introduced within the skip connections between the encoder and decoder. This module encodes positional information along both horizontal and vertical spatial directions and dynamically re-weights the feature maps, thereby effectively suppressing background noise caused by semantic gaps and enhancing the model’s ability to localize landslide regions. Additionally, an edge-guided attention (EGA) module is incorporated into the decoder. This module extracts explicit edge priors from the input image using a Laplacian operator and imposes geometric constraints on the predictions via a boundary reverse attention mechanism, thereby significantly alleviating boundary ambiguity and morphological distortion of landslides. Evaluations across datasets from the Yarlung Zangbo River, Iburi-Tobu, and Bijie regions demonstrate that CETransUNet significantly outperforms state-of-the-art models—including TransUNet, SegFormer, and SwinUNet—in terms of IoU, MIoU, and F1-score. Overall, through the synergistic optimization of the coordinate attention and edge-guided attention modules, the CETransUNet model achieves synchronous enhancement of boundary integrity and geometric precision in complex scenarios, providing a reliable technical solution for large-scale intelligent landslide identification.

Keywords:

landslide detection; deep learning; CETransUNet; coordinate attention module; edge-guided attention module

1. Introduction

Landslides are one of the most widespread and destructive geological hazards worldwide. Their occurrence and development are often closely linked to factors such as earthquakes, heavy rainfall, groundwater fluctuations, extreme weather events and river erosion [1]. Sudden landslides not only cause dramatic alterations to the land surface but also potentially trigger secondary disasters, such as disruptions to transportation networks, destruction of buildings and infrastructure and loss of life, posing a dual threat to human society and natural ecosystems [2,3]. Therefore, landslide detection using remote-sensing imagery is crucial for understanding their spatial distribution and morphological characteristics, thereby facilitating disaster prevention and mitigation [4].

After a landslide occurs, the significant differences in morphology, texture and spectral features between the slide mass and surrounding stable terrain provide a critical basis for accurately defining landslide boundaries. However, landslide identification in complex environments remains a challenge [5]. Traditional methods primarily rely on field surveys and manual visual interpretation, which are not only time-consuming and labor-intensive but also inefficient in providing rapid responses after heavy rainfall or major earthquakes trigger large-scale, group-occurring landslides [6]. In recent years, optical remote-sensing satellites have provided unprecedented data support for landslide identification, as they can rapidly acquire high-resolution surface imagery over large areas [7]. With the continuous advancement of remote-sensing technology, the use of satellite data that are obtained with higher spatial resolution and revisit frequency has further enhanced the potential and accuracy of automated landslide identification. Consequently, automated landslide-boundary detection technologies based on remote-sensing images have become the focus of current research. Intelligent identification methods that incorporate deep learning are driving the advancement of this field from semi- to full automation [8]. These technologies enable the rapid acquisition of extensive surface information, which is crucial for emergency response, disaster assessment and long-term landslide inventory [9]. Particularly in inaccessible or hazardous areas where field investigations are significantly constrained, remote-sensing-based automated identification methods demonstrate irreplaceable application value [10,11].

In recent years, deep learning, particularly convolutional neural networks (CNNs), has revolutionized automated landslide detection from remote sensing imagery [12,13,14]. Architectures such as ResUNet have demonstrated strong performance in regional landslide mapping by combining residual connections with encoder–decoder structures and attention mechanisms [15,16,17]. However, a fundamental limitation of CNNs remains their local receptive field: stacking small convolutional kernels results in an effective receptive field much smaller than the theoretical size, making it difficult to model global context and long-range dependencies [18,19,20]. This leads to semantic noise, misclassification of spectrally similar features, and boundary distortion in complex scenarios [21,22]. To overcome these limitations, Transformer architectures have been introduced, leveraging self-attention to directly capture global correlations across image patches [23,24,25]. TransUNet, for instance, integrates a Transformer into the U-Net bottleneck, improving global context modeling [26]. Nevertheless, Transformers often struggle to preserve fine-grained local details and precise boundary delineation, especially for fragmented landslide morphologies [25,27]. Moreover, existing hybrid models typically stack CNN and Transformer modules without a synergistic design, leaving the semantic gap between encoder and decoder features unaddressed and failing to fully exploit edge priors for boundary refinement [28,29,30].

Beyond the limitations inherent in model architecture, deep learning-based landslide detection research also faces multiple challenges at the data level [31]. Firstly, the suddenness of landslide events makes acquiring high-quality, large-scale, pixel-level annotated datasets extremely costly, resulting in scarce publicly available resources that are insufficient for adequately training data-driven models [32]. Secondly, existing datasets often lack diversity in terrain complexity, scene variety and representation of triggering mechanisms, resulting in models that suffer from significantly reduced generalization ability when confronted with real-world challenges, such as complex lighting and shadows, vegetation cover and fragmented landslide boundaries with varying morphologies [33]. Furthermore, the prevalent issue of class imbalance in the data exacerbates prediction bias, predisposing models to classify most pixels as background and leading to severe under detection of actual landslide bodies [34]. Collectively, these factors constrain the practical application effectiveness and reliability of deep learning in landslide detection.

To address the aforementioned challenges and meet the demand for high-precision automated landslide identification in complex remote-sensing scenarios, this study focuses on two key research questions: (1) how to construct a high-quality landslide dataset for the alpine valley region of the Yarlung Zangbo River basin, and (2) how to develop an effective deep-learning framework that overcomes the limitations of existing methods in handling boundary blurring, geometric distortion, and semantic noise. To tackle these questions, we first systematically build a novel co-seismic landslide dataset based on post-earthquake remote-sensing images from the 2017 Nyingchi earthquake. This dataset captures typical alpine valley landforms with diverse landslide morphologies under varying slope gradients, extreme illumination, and complex vegetation coverages, with fine annotations of fragmented boundaries and cluster landslides. Second, we propose a deep-learning framework named CETransUNet (coordinate attention and edge-guided attention transformer UNet), which integrates ResNet and Transformer architectures. Specifically, a coordinate attention (CA) module is incorporated to enhance positional awareness and filter irrelevant semantic noise, thereby suppressing misclassification of spectrally similar features (e.g., bare rock and roads). Concurrently, an edge-guided attention (EGA) module is introduced to provide explicit boundary constraints via Laplacian edge priors, mitigating geometric distortion and edge blurring in fragmented landslide areas.

2. Datasets and Methods

2.1. Datasets

Figure 1a shows the geographical locations of the three study areas involved in this research: the lower Yarlung Zangbo River region, the Bijie area in China, and the Iburi-Tobu area in Japan.

As shown in Figure 1b, the lower Yarlung Zangbo River region is situated on the southeastern margin of the Tibetan Plateau, within the Eastern Syntaxis zone of the collision belt between the Eurasian and Indian Plates. This area experiences intense tectonic activity and exhibits significant topographic relief, characterized by a large elevation span. As shown in Figure 1b,c, our research team found through field investigation and in combination with the topographic map of the Yarlung Zangbo River that the geomorphology of this area is characterized by high mountains and deep valleys, with highly fractured rock masses and generally poor slope stability. Influenced by moist air currents from the Indian Ocean, it receives abundant precipitation. The average annual precipitation is approximately 500 mm, with a decreasing trend from east to west. Precipitation is mainly concentrated from June to September, accounting for more than 78% of the annual total, with frequent heavy rainstorms and intense hydrological erosion. The study also indicates that the annual precipitation in the lower reaches can reach 800–1400 mm, while in the upper reaches it is only about 300–500 mm, presenting a significant spatial gradient [35]. The coupling effect of seismic activity and rainfall makes the area highly prone to landslides. On 18 November 2017, the Milin M_s 6.9 earthquake struck this region, triggering numerous co-seismic landslides. This paper utilized post-earthquake Ziyuan-3 (ZY-3) satellite imagery to conduct a detailed visual interpretation of the Yarlung Zangbo River Grand Canyon area, establishing a database containing 824 landslide samples with an image size of 256 × 256 pixels. This dataset provides crucial data support for researching distribution patterns, influencing factors and susceptibility assessments of earthquake-induced landslides.

As shown in Figure 1d, the Iburi-Tobu area is in Hokkaido, Japan, covering an area of approximately 733.06 km². Situated within the active Pacific Ring of Fire subduction zone, the region experiences significant tectonic activity. Its topography comprises mountains, plateaus and coastal plains, with elevations ranging from 0 to 624 m. Lithology is dominated by andesite, dacite, rhyolite and basalt, contributing to generally poor slope stability. Under a humid continental climate, the area has an average annual temperature of 6 to 10 °C and receives annual precipitation of 800 to 1200 mm. Significant hydrological erosion caused by seasonal snowmelt and heavy rainfall events make this region highly susceptible to landslides. On 6 September 2018, an Mw 6.6 earthquake struck the Iburi-Tobu area, triggering widespread co-seismic landslides [36]. The Iburi-Tobu dataset was constructed using 3 m resolution post-earthquake satellite images provided by the Geospatial Information Authority of Japan, acquired in September and October 2018. It is a library of group-occurring landslide samples (512 × 512 pixels) and a publicly available subset contains 1484 samples [37].

As shown in Figure 1e, the Bijie area is located in northwestern Guizhou Province, China, within the transition zone between the Tibetan Plateau and the eastern hills, with elevations ranging from 152 to 2885 m. This region lies in an active tectonic zone and is characterized by numerous steep slopes with poor stability. With an average annual rainfall of 849–1399 mm and highly concentrated precipitation during the rainy season, the area is highly susceptible to landslides. Furthermore, human engineering activities such as mining and road construction have increased susceptibility to geological hazards. The Bijie dataset comprises 770 landslide samples extracted from TripleSat satellite imagery acquired between May and August 2018. These include rockfalls, rockslides and a limited number of debris slides. Each sample contains a complete landslide body with an extended background area of 40 m beyond the bounding box [38].

2.2. CETransUNet

In recent years, deep learning-based semantic segmentation methods have demonstrated great potential in the field of landslide detection. Among them, convolutional neural networks (CNNs) have been widely adopted due to their powerful hierarchical feature extraction and end-to-end learning capabilities. Architectures such as ResUNet combine residual connections with encoder–decoder structures, achieving excellent performance in regional landslide mapping. However, a fundamental limitation of CNNs lies in their local receptive field, which relies on stacking small kernels to expand the context. As a result, the effective receptive field is often much smaller than theoretical size, making it difficult to model global context and long-range dependencies. This leads to semantic noise, misclassification of spectrally similar features (e.g., bare rock versus landslide mass), and boundary distortion in complex scenarios. To overcome these limitations, Transformer architecture has been introduced, leveraging self-attention mechanisms to directly capture global correlations across image patches. In landslide detection, TransUNet integrates a Transformer into the bottleneck of U-Net, improving global context modeling. Nevertheless, Transformers have limited ability to preserve fine-grained local details and precise boundary delineation, especially for fragmented and irregular landslide morphologies. Moreover, existing hybrid models often simply stack CNN and Transformer modules without a synergistic design, leaving the semantic gap between encoder and decoder features unaddressed and failing to fully exploit edge priors for boundary refinement.

To synergistically combine the strengths of both paradigms while mitigating their respective weaknesses, this paper proposes CETransUNet, a novel framework that integrates CNN and Transformer architectures with CA and EGA modules. As shown in Figure 2, CETransUNet achieves a non-linear synergistic optimization of global semantic context and local structural integrity.

The core synergistic mechanism of the model operates through a coordinated information flow that we term “feature filtering-to-structural reconstruction.” In the feature fusion stage, the CA module is embedded within the skip connections to bridge the semantic gap—which frequently causes the misclassification of spectrally similar ground objects such as bare rock and roads. By encoding precise positional information along the horizontal and vertical spatial directions, the CA module dynamically reweights the feature maps, enhancing the representation of salient landslide regions while effectively suppressing environmental noise. This process functions as spatial feature purification, ensuring that only high-purity spatial information is transferred to the decoder. Crucially, this purified output provides a semantically refined foundation for the subsequent EGA module.

Then, the EGA module is integrated into the decoder to enforce explicit geometric constraints on the purified features delivered by the CA module. By fusing multi-scale semantic features from the encoder and preliminary decoder predictions with explicit edge priors derived via a Laplacian operator, the EGA module specifically addresses boundary blurring and morphological distortion. The module utilizes a boundary reverse attention mechanism to refine the delineation of fragmented landslide margins, ensuring high structural consistency in complex topographic scenarios. The two modules thus operate in a tightly coupled manner: the CA module optimizes spatial localization at a semantic level, while the EGA module leverages this optimized representation to execute precise boundary delineation at a structural level. Experimental evidence confirms that this synergistic coupling significantly enhances the model’s robustness and shape preservation capability, particularly in alpine valley regions characterized by high background complexity.

2.3. Coordinate Attention Module

The CA module employs a dual-path spatial attention mechanism. It independently computes the attention weights along the horizontal and vertical directions, then integrates them with the original feature information. This process enables the module to precisely focus on key features within the target region, thereby significantly enhancing the model’s accuracy in terms of target recognition and localization capabilities.

By leveraging this mechanism, the model effectively mitigates environmental interference and refines local feature representations. The architecture of the CA module is illustrated in Figure 3. Horizontal (

A v g_{x}

) and vertical (

A v g_{y}

) average pooling operations are first applied on an input feature x. The resulting features are then concatenated to form a feature map

F

[39] as follows:

F = C a t [A v g_{x} (x), A v g_{y} (x)]

(1)

The fused features are subsequently integrated and processed through convolution and normalization operations, followed by the Hard-Swish activation function

δ

to obtain

F_{s}

the final result. The feature map

F_{s}

is then split and passed through convolutional layers and the nonlinear activation function σ to generate the horizontal and vertical attention maps

F_{x}

F_{y}

, respectively. Finally, the enhanced feature map

Y_{c o o r d}

is produced by applying

F_{x}

F_{y}

it to the original input feature map [39].

F_{s} = δ \times B N [C o n v (F)]

(2)

F_{x} = σ \times C o n v (F_{s x}), F_{y} = σ \times C o n v (F_{s y})

(3)

Y_{c o o r d} = F_{x} \times F_{y} \times x

(4)

By following the above steps, feature reweighting is achieved, optimizing the input features and enhancing model performance.

2.4. Edge-Guided Attention Module

To address the weak boundary problem, Bui et al. proposed the EGA module, which enables a model to focus more effectively on edge-related information [40]. The structure of the EGA module is illustrated in Figure 4. This module requires three inputs: the feature from the encoder (

{\hat{f}}_{i}^{e}

), edge information obtained via the Laplacian function (

{\hat{f}}_{i}^{l}

) and a preliminary prediction feature generated by the decoder (

{\hat{f}}_{i + 1}^{d}

). It is critical to clarify that the edge information

{\hat{f}}_{i}^{l}

is derived directly from the high-frequency components of the input image via a Laplacian pyramid, rather than being obtained from the ground truth (GT), ensuring the model’s practical utility during inference.

First, a reverse attention mechanism and Laplacian algorithm are applied to the preliminary prediction feature

{\hat{f}}_{i + 1}^{d}

to generate reverse attention information

{\hat{f}}_{i + 1}^{r}

and boundary attention information

{\hat{f}}_{i + 1}^{b}

, which are then multiplied element-wise with the input features. The interacting information is subsequently aggregated and processed through a

3 \times 3

convolutional operation to produce the combined feature

f_{c}

. To suppress the interference of edge background noise on critical regions, an attention mask mechanism is introduced at layer i. By filtering out redundant features, the model is guided to focus on the key areas along the landslide boundary. The attention features

f_{i}^{a}

at layer i can be defined as follows [40]:

f_{i}^{a} = {\hat{f}}_{i}^{e} + (f_{i}^{c} \otimes A_{i})

(5)

The obtained attention feature map

f_{i}^{a}

is then fed into the CBAM for recalibration, enabling the model to effectively explore the feature relationships between the boundary and background regions. CBAM comprises channel and spatial attention modules, which collaboratively enhance the model’s ability to mine and emphasize informative features along both channel and spatial dimensions, respectively, thereby focusing on relevant information while suppressing irrelevant noise [41]. In the channel attention branch, the attention feature map

f_{i}^{a}

is processed using a

1 \times 1 \times N i

convolutional kernel to enhance the model’s sensitivity to important channel-wise features. In the spatial attention branch, a

H i \times W i \times 1

spatial convolution kernel is applied to further refine the intermediate feature map, ultimately producing a calibrated output feature map [40].

f_{i}^{d} = C B A M (f_{i}^{a})

(6)

2.5. Evaluation Indicators

A confusion matrix is a core evaluation tool in the field of machine learning, used to quantify the performance of classification models. It is an

N \times N

matrix structure that enables visual analysis of model effectiveness by systematically organizing the correspondence between the predicted results and true labels. Taking the binary classification task as the research paradigm, where the target and non-target categories are defined as positive and negative classes, respectively, the matrix can be precisely decomposed into the following four key evaluation units (Table 1): true positive (TP), which represents the number of positive class samples that the model accurately identifies; true negative (TN), which quantifies the ability of the model to accurately identify negative class samples; false positive (FP), which reflects the situation where the model incorrectly classifies a negative sample as positive; false negative (FN), which records the failure of the model to detect positive class samples (missed detections).

To evaluate the performance of the proposed model, five commonly used quantitative metrics were adopted: intersection over union (IoU), mean IoU (MIoU), precision, recall, and F1-score. These evaluation metrics are derived from the confusion matrix.

IoU is a widely used metric for evaluating the performance of object detection and image segmentation models. It measures the degree of overlap between the predicted and ground-truth regions. The formula for calculating the IoU is as follows:

IoU (M_{t}, M_{p}) = \frac{M_{t} \cap M_{p}}{M_{t} \cup M_{p}}

(7)

where

M_{t}

and

M_{p}

represent the ground-truth and predicted segmentation masks, respectively;

M_{t} \cap M_{p}

and

M_{t} \cup M_{p}

represent the areas of intersection and union between both masks, respectively.

Precision quantifies the reliability of a model’s positive classifications by calculating the proportion of correctly identified positives among all samples predicted as positive. In contrast, Recall measures the model’s effectiveness in detecting actual positive instances, representing the proportion of true positives successfully identified out of all actual positives. To integrate these two aspects into a single metric, the F1-score is employed as the harmonic mean of Precision and Recall, providing a balanced evaluation of the model’s performance, particularly useful when class distribution is uneven. The corresponding formulas are presented as follows:

Precision = \frac{T P}{T P + F P}

(8)

Recall = \frac{T P}{T P + F N}

(9)

F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(10)

3. Results and Analysis

3.1. Data Preprocessing

For the Yarlung Zangbo River, Iburi-Tobu, and Bijie datasets, 80%, 10%, and 10% of the respective 824, 1484, and 770 landslide images were randomly allocated to the training, validation, and test sets. To evaluate the model’s predictive performance, a portion of the samples that did not participate in the training and validation processes were selected as new data. To prevent overfitting and enhance the model’s generalization capability, four data augmentation strategies were applied to all samples: adding Gaussian noise, multi-angle rotation and flipping, and darkening processing, as illustrated in Figure 5.

To better match the pretrained weights, the sizes of all the samples were uniformly adjusted to 256 × 256 pixels in the RGB three-channel format. The specific parameters are listed in Table 2.

3.2. Experimental Setup

To ensure experimental consistency, all investigations in this paper were performed under a Linux environment utilizing NVIDIA RTX A6000 GPUs. The Adam optimizer was adopted with an initial learning rate set at 1 × 10⁻⁴, which was dynamically adjusted during training via a cosine annealing schedule. A batch size of 16 was maintained across all experiments, with each model undergoing 60 training epochs.

For the loss function, we employed a composite BCEDiceLoss that combines the binary cross-entropy (BCE) term with the Dice loss. The BCE component facilitates pixel-level probabilistic calibration, while the Dice term enhances overlap-based measurement to mitigate severe class imbalance between landslide targets and the image background. The joint loss is formulated as follows:

L_{BCE} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \cdot \log σ (x_{i}) + (1 - y_{i}) \cdot \log (1 - σ (x_{i}))]

(11)

L_{Dice} = 1 - \frac{1}{N} \sum_{j = 1}^{N} \frac{2 \sum_{i} σ (x_{j i}) y_{j i} + ϵ}{\sum_{i} σ (x_{j i}) + \sum_{i} y_{j i} + ϵ}

(12)

L_{Total} = α \cdot L_{BCE} + (1 - α) \cdot L_{Dice}

(13)

The BCE term optimizes probability calibration through a pixel-wise classification error, thereby enhancing the model’s sensitivity to fine details. The Dice term, on the other hand, focuses on the overlap between the foreground and background regions to improve regional consistency.

To determine the optimal weight combination of the BCE loss and the Dice loss for the landslide detection task, we conducted comparative experiments on three datasets: Yarlung Zangbo River, Iburi-Tobu, and Bijie. In the experiments, the weight α of the BCE loss was successively set to {1.0, 0.9, 0.8, …, 0.2, 0.1, 0}, with the weight of the Dice loss correspondingly set to 1 − α, while all other hyperparameters remained unchanged. The experimental results on the three datasets are shown in Figure 6.

As illustrated in Figure 6, on all three datasets with significantly different geological environments, equal weighting (α = 0.5) achieved the highest IoU and F1-score. This indicates that achieving a balance between pixel-level calibration (BCE) and region-level overlap (Dice) is optimal for the landslide detection task. Therefore, we adopted the configuration of α = 0.5 in the loss function for all experiments in this paper. The loss function is as follows:

L_{Total} = 0.5 \cdot L_{BCE} + 0.5 \cdot L_{Dice}

(14)

3.3. Results

3.3.1. Comparative Experiments

To comprehensively and rigorously evaluate the performance of the proposed improved model, this paper selected seven representative baseline models for comparative experiments, including: UNet, SegFormer, SegUNet, BiseNetV2, AttUNet, SwinUNet and DCSAUNet. All models were trained and tested on the Yarlung Zangbo River, Iburi-Tobu and Bijie landslide datasets to validate their adaptability across different scenarios—encompassing both typical single-landslide morphologies and complex landslide geological features. Moreover, the ground truth labels used in the test sets of the experiments are the known landslide inventories for the respective study areas.

As illustrated in Figure 7, CETransUNet demonstrates consistent superiority over established benchmarks across all three landslide datasets: Yarlung Zangbo River, Iburi-Tobu, and Bijie. The model achieves peak performance across all evaluation metrics, highlighting its robust generalization capacity and high-fidelity structural recognition across heterogeneous geomorphological environments.

In the karst geomorphic transition zones of the Bijie dataset, CETransUNet attained an Intersection over Union (IoU) of 79.20% and an F1-score of 88.24%, yielding significant improvements of 2.71% and 1.71% over the second-best model, SegFormer. The synchronized enhancement of Precision (2.19%) and Recall (1.19%) underscores the model’s ability to strike an optimal balance between identification accuracy and target coverage in complex transition landscapes.

Under the more taxing Iburi-Tobu scenario—defined by dense and highly fragmented landslide clusters—CETransUNet exhibited exceptional adaptability. It achieved a substantial IoU of 74.77%, outperforming SegFormer by a margin of 5.96%. Notably, the 4.08% surge in Recall reflects the model’s improved capacity to reconstruct intricate boundaries and localized fragments, thereby validating the efficacy of the EGA module’s structural refinement mechanism.

In the geologically extreme environment of the Yarlung Zangbo River dataset, characterized by dense, morphologically diverse landslides and steep valley shadows, CETransUNet demonstrated particularly salient advantages. It achieved an IoU of 80.26%, making it the only architecture to exceed the 80% IoU threshold, surpassing SegFormer by 5.46%. The peak F1-score of 89.01% (+3.49%) confirms that the synergistic integration of CA-based noise suppression and EGA-based edge guidance is pivotal for maintaining detection integrity amidst extreme topographic complexity. This multi-stage refinement ensures identification results with superior structural consistency and boundary clarity, offering a dependable technical framework for disaster response in high-risk geological zones.

As shown in Figure 8, Figure 9 and Figure 10 on the three datasets, the comparative models, including UNet, SegFormer, SegUNet, BiseNetV2, AttUNet, SwinUNet and DCSAUNet, while capable of identifying the overall distribution areas of landslides, exhibit significant shortcomings in detail processing. Their prediction results commonly display issues such as blurred boundaries, serrated contours, internal holes and local mis-segmentation. These deficiencies are particularly prominent in the areas marked by red boxes in the figures.

In contrast, the proposed CETransUNet model demonstrates superior segmentation performance across the datasets. This model not only achieves more accurate identification of structurally complete landslide bodies but also exhibits stronger structural sensitivity and shape preservation capabilities when dealing with fragmented and dispersed landslide areas in complex scenes. Furthermore, the segmentation results generated by the CETransUNet model are characterized by clear boundaries and continuous structures, showing a high degree of overlap with the ground-truth annotations. This effectively mitigates the common problems of geometric distortion and edge blurring observed in traditional models.

By synthesizing the experimental results from the three datasets, the proposed model delivers stable performance not only in single-landslide scenarios but also possesses stronger robustness and structure-preserving ability for identifying complex landslide morphologies. This validates its effectiveness and advancement in practical landslide detection applications.

3.3.2. Ablation Experiment

To validate the effectiveness of the model improvements, this paper conducts ablation studies on three datasets—Yarlung Zangbo River, Iburi-Tobu, and Bijie—comparing the performance of the baseline TransUNet model with that of the improved model (Table 3, Table 4 and Table 5).

Theoretically, the introduction of the CA and EGA modules can effectively suppress the interference of distracting features on the segmentation results while enhancing the representation of target features, thereby improving the model’s segmentation performance. To rigorously validate the effectiveness and generalization capability of this synergistic mechanism, we compared the performance of the baseline model (TransUNet), the model with only CA introduced (No-EGA), the model with only EGA introduced (No-CA), and the complete model (CETransUNet) on three landslide datasets featuring different geological and geomorphological characteristics. As shown in Table 3, Table 4 and Table 5, the introduction of the attention modules led to improvements across all evaluation metrics—including IoU, MIoU, Precision, Recall, and F1-score—on all three datasets, indicating their strong generalization ability. It is noteworthy that the performance improvement exhibited significant scenario dependence. On the Bijie dataset, characterized by relatively simple scenes and distinct landslide boundaries, the F1-score increased by only 1.96%. In contrast, on the Yarlung Zangbo River and Iburi-Tobu datasets, which are rich in group-occurring, morphologically fragmented landslides and complex backgrounds, the IoU metric increased significantly by 4.65% and 5.43%, respectively, and the F1-score also improved by 2.97% and 3.68%. This differential phenomenon suggests that CETransUNet effectively mitigates feature confusion and boundary uncertainty in complex scenarios. For simpler datasets, such as Bijie, the baseline model can already learn features reasonably well, leaving limited room for further performance gains. However, in complex datasets like the Yarlung Zangbo River and the Iburi-Tobu, the CA module enhances positional awareness, effectively suppressing misclassification of spectrally similar features (e.g., bare rock vs. landslide mass). In contrast, the EGA module strengthens edge constraints, significantly improving the coherence and integrity of fragmented landslide boundaries.

To provide an intuitive comparison of the performance differences among the various models in the landslide detection task, Figure 11 shows a visual comparison of the prediction results for different architectures. As shown in the figure, the baseline model exhibits significant boundary blurring and fragmentation when identifying single-landslide bodies, resulting in poor prediction coherence. The model incorporating only the CA module (NO-EGA) partially suppresses the noise caused by the semantic gap compared to that of the baseline, leading to smoother boundaries; however, it still suffers from insufficient continuity and residual boundary ambiguity. In contrast, the model incorporating only the EGA module (NO-CA) demonstrates significant improvements in boundary clarity and smoothness, albeit with a marginal reduction in identification accuracy. The proposed CETransUNet model, which simultaneously embeds both the CA and EGA modules, demonstrates significant advantages in terms of boundary integrity, structural continuity, and shape accuracy. When confronted with complex scenarios involving group-occurring and fragmented landslide morphologies, the baseline model, owing to its lack of an effective feature filtering mechanism, produces blurred responses to fragmented landslide boundaries and is prone to misclassifying spectrally similar features such as bare land and roads. The NO-EGA variant, while suppressing some semantic gap noise, fails to adequately restore spatial structural details, leading to missed detections of densely distributed small landslides. The NO-CA variant enhances local boundary clarity but introduces semantic noise due to inconsistent semantic hierarchies between the encoder and decoder, thereby preventing the effective fusion of deep and shallow features. CETransUNet, through the synergistic embedding of both the CA and EGA modules, significantly enhances adaptability to complex scenes: the CA module dynamically adjusts channel weights to suppress background interference and highlight salient landslide features, whereas the EGA module explicitly reinforces edge structures and spatial context, thereby improving the contour integrity and morphological preservation of fragmented landslides. Consequently, CETransUNet maintains a high identification accuracy even in highly fragmented areas, demonstrating stronger robustness and greater potential for practical applications.

3.4. Generalization Performance Experiment

To evaluate the generalization performance of the CETransUNet model in complex environments, this paper selected the Rizhaigou River basin in Jiuzhaigou County, Sichuan Province, as the experimental area. This region is characterized by complex geological conditions, abundant rainfall, and a high frequency of landslide disasters. A landslide inventory for the area was established through visual interpretation combined with unmanned aerial vehicle (UAV) imagery and Google Earth imagery. Given the significant differences in geological structure, triggering conditions, and landslide development characteristics between the Rizhaigou area and the previously utilized Yarlung Zangbo River, Iburi-Tobu, and Bijie regions, the widely used Maximum Mean Discrepancy (MMD) was adopted as a metric to effectively quantify the distribution differences among these regions. The fundamental principle of MMD is to measure the discrepancy between the mean embeddings of two distributions mapped into a Reproducing Kernel Hilbert Space (RKHS), thereby assessing their distribution consistency. Let sample sets

X = \{x_{1}, x_{2}, \dots, x_{n}\}

and

Y = \{y_{1}, y_{2}, \dots, y_{m}\}

be derived from two different distributions. The MMD calculation is expressed as follows [42]:

MMD = {‖\frac{1}{n} \sum_{i = 1}^{n} f (x_{i}) - \frac{1}{m} \sum_{j = 1}^{m} f (y_{j})‖}_{H}

(15)

where denotes the RKHS and

f (\cdot)

represents the mapping function. In this paper, the Yarlung Zangbo River, Iburi-Tobu, and Bijie regions were selected as source domains, while the Jiuzhaigou area was used as the target domain. The Maximum Mean Discrepancy (MMD) was calculated based on the overall sample data from each region to evaluate the inter-domain distribution differences. The results show that the MMD value between Jiuzhaigou and the Yarlung Zangbo River is 1.3014, while the values between Jiuzhaigou and Iburi-Tobu, and between Jiuzhaigou and Bijie, are 2.1243 and 2.6862, respectively. The significantly lower MMD value between Jiuzhaigou and the Yarlung Zangbo River indicates that their distributions are closer in the feature space, suggesting more minor inter-domain differences. Therefore, the model parameters trained on the Yarlung Zangbo River dataset were selected to conduct large-scale landslide identification experiments in the Jiuzhaigou watershed, comparing the generalization performance of the proposed CETransUNet model with that of the TransUNet model.

The experimental results, as illustrated in Figure 12, demonstrate that the CETransUNet model significantly outperforms the TransUNet model across all evaluation metrics. Even under complex interference conditions such as weak lighting, water reflection, and cloud or fog occlusion, CETransUNet effectively identified most of the landslide bodies in the region. It achieved an IoU of 75.41%, a MIoU of 86.70%, a Precision of 86.93%, a Recall of 85.05%, and an F1-score of 85.98%, indicating excellent landslide detection capability and scene adaptability. These results suggest that CETransUNet maintains high localization accuracy and identification completeness even when dealing with fragmented and densely distributed landslide targets. Further analysis of the recognition results (areas marked with red boxes in Figure 11) reveals a small number of misjudgments. The reasons for these errors may be twofold. First, subjective judgment differences inherent in the manual landslide annotation process may introduce certain label noise. Second, partial cloud cover in the images leads to the distortion of ground object spectral features, and the low signal-to-noise ratio in cloud-obscured areas severely diminishes the textural information of landslides, consequently affecting the model’s judgment accuracy.

4. Discussion

4.1. Analysis of Results

In the systematic evaluation conducted in Section 3.3.2, the CETransUNet model demonstrated excellent overall segmentation performance across all test datasets. The high Intersection over IoU and F1-scores collectively confirmed the effectiveness of the model architecture for feature extraction and semantic segmentation tasks. However, on the Bijie landslide dataset, a slight but noticeable decrease in Precision was observed compared to the TransUNet model. To investigate the underlying cause of this observation, a dedicated confusion matrix analysis was performed on the prediction results for this dataset.

As shown in Figure 13, although the number of FP increased marginally, the model achieved a significant improvement in identifying TP landslide pixels, which directly contributed to an enhancement in recall. A common characteristic of deep learning models is the inherent trade-off between precision and recall, where an improvement in one metric often leads to a corresponding reduction in the other [43]. This trade-off is generally modulated by the decision threshold set for landslide classification. When the model is tuned to reduce the risk of missed detections, it relaxes the classification criteria. This helps capture more true landslide pixels (increasing recall), but it also increases the likelihood of misclassifying non-landslide areas as landslides (raising FPs and consequently reducing precision). In critical applications such as emergency response, particularly for generating earthquake-induced landslide distribution maps, the primary objective is to maximize the detection of all potential landslide bodies (i.e., achieve a high recall rate) to support practical disaster assessment and rescue operations. This priority surpasses the pursuit of extremely high precision, which might result from overly suppressing false positives. In such contexts, the consequence of missing actual landslide areas (a high FN rate) is far more critical than the additional workload involved in visually verifying potential false alarms (a relatively higher FP rate). Therefore, the performance characteristic of CETransUNet on the Bijie dataset—a slight reduction in precision (moderate increase in FP rate) in exchange for a substantial gain in recall (significant decrease in FN rate)—should not be considered a model flaw. Instead, it highlights a practical and targeted advantage in real-world emergency scenarios. This performance characteristic may also be attributed to the unique topography, geomorphology, and textural features of landslides in the Bijie area. The model potentially learned more inclusive feature representations adapted to this specific region, leading to an optimization strategy that emphasizes the minimization of omissions, as reflected in the metrics.

4.2. Comparison of Model Complexity

As a landslide detection method designed to provide scientific decision-making support for geological disaster emergency response, the computational complexity and time overhead of the model are critical indicators for assessing its practicality. Since increased model complexity is often accompanied by significant hardware resource costs, a strict trade-off between detection accuracy and computational expense is essential in practical deployment. To quantitatively evaluate the computational efficiency of the proposed CETransUNet, this study adopts the number of parameters (Params), floating-point operations (FLOPs), inference speed (FPS), and training time as evaluation metrics. As shown in Table 6, introducing the CA and EGA modules into the baseline TransUNet architecture (Model B) to enhance detection capability increases the parameter count of the full model (Model C, the 12-layer CETransUNet) to 221.8 M, FLOPs to 48.76 G, reduces FPS to 63.13 img/s, and significantly prolongs training time. This indicates that augmenting model representational capacity through attention mechanisms inevitably brings additional computational burden.

Given that the self-attention mechanism in the Transformer architecture has a computational complexity that grows quadratically with sequence length, making it the primary computational bottleneck, this paper effectively controls model complexity by reducing the number of standard Transformer layers. As shown in Table 6, Model A (the 8-layer CETransUNet) has a parameter count (188.2 M) that is very close to that of the baseline TransUNet (Model B, 192.4 M). Its FLOPs (43.12 G) are slightly higher than those of Model B (42.56 G), its FPS (81 img/s) is slightly lower than that of Model B (85 img/s), and its training time is also marginally increased. This phenomenon is attributed to the additional per-layer computational overhead introduced by the CA and EGA modules, together with the extra serialization processing burden imposed by the self-attention mechanism when handling two-dimensional feature maps.

Despite the comparable parameter counts and computational loads, Model A consistently achieves better detection performance than the baseline model (Model B) on all three datasets. As illustrated in Figure 14, the F1-score of Model A improves by 1.65%, 2.27%, and 1.05% over Model B on the Yarlung Zangbo River, Iburi-Tobu, and Bijie datasets, respectively, with simultaneous increases in IoU. This demonstrates that through architectural optimization (reducing the number of Transformer layers while embedding the CA and EGA modules), Model A achieves higher parameter efficiency and stronger feature representation capability. The performance gain is primarily attributable to the synergistic effect of the CA and EGA modules: the CA module enhances feature filtering via spatial location awareness, effectively suppressing semantic noise caused by complex backgrounds; the EGA module imposes explicit geometric constraints on landslide boundaries using edge priors, significantly improving segmentation precision. Notably, the extent of improvement varies across datasets—the gains are most pronounced in scenarios characterized by fragmented landslide morphology and substantial background noise, validating the adaptability of the proposed method to complex geological environments.

5. Conclusions

This study systematically established a post-earthquake landslide dataset for the alpine valley region of the Yarlung Zangbo River, effectively filling the void of high-quality, fine-grained landslide detection data in this region. To address the challenges of boundary blurring and geometric distortion in landslide identification, this paper proposes CETransUNet, a model that deeply integrates CNN and Transformer architectures. By innovatively incorporating CA and EGA modules, the model effectively bridges the semantic gap between the encoder and decoder, achieving a deep fusion of global semantic modeling and local boundary perception.

Extensive multi-scenario evaluations demonstrate that CETransUNet significantly outperforms mainstream models—including UNet, SegFormer, and SegUNet—in terms of IoU and F1-score across diverse geological environments, exhibiting superior robustness in geometric integrity recognition, especially in complex landscapes. Furthermore, by compressing the Transformer component to eight layers, the model significantly enhances computational efficiency while maintaining high precision, striking a balance between algorithmic performance and emergency response timeliness. Generalization experiments in the Jiuzhaigou region further validate the model’s potential for practical disaster prevention and mitigation. Although a trade-off exists between precision and recall, the model’s high-recall characteristic is better aligned with the practical requirements of disaster emergency identification. Future research will explore multi-source data fusion strategies to further suppress false positives in complex backgrounds.

Author Contributions

All the authors made significant contributions to this work. Conceptualization, C.Y. and J.W.; methodology, T.S.; software, T.S.; validation, T.S., Z.L. and Z.W.; formal analysis, T.S.; resources, C.Y.; writing—original draft preparation, T.S.; writing—review and editing, T.S. and X.C.; project administration, C.Y.; funding acquisition, C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (Grants No. 42574037 and 42174032), the Shaanxi Provincial Education Department Local Service Special Program Project (No. 23JE001), and the Power Construction Corporation of China Science and Technology Project (No. DJ-ZDXM-2023-48).

Data Availability Statement

The data used in this paper is available at https://github.com/Dali202001020417/Dataset.git (accessed on 4 June 2026). If you would like to obtain the code used in the article, please contact the corresponding author.

Acknowledgments

The authors would like to thank the valuable time of editors and anonymous reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qiu, P.; Pang, L.; Luo, Y.; Liu, Y.; Xing, H.; Liu, K.; Zhuang, G. Earthquake Event Knowledge Graph Construction and Reasoning. Geomat. Nat. Hazards Risk 2024, 15, 2383768. [Google Scholar] [CrossRef]
Zhu, W.; Yang, L.; Cheng, Y.; Liu, X.; Zhang, R. Active Thickness Estimation and Failure Simulation of Translational Landslide Using Multi-Orbit InSAR Observations: A Case Study of the Xiongba Landslide. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103801. [Google Scholar] [CrossRef]
Song, C.; Chen, B.; Li, Y.; Li, Z.; Du, J.; Yu, C.; Peng, J.; Liu, H.; Liu, Z.; Hu, X.; et al. Amplified Coseismic Loess Failure and Postseismic Landslide Acceleration Triggered by the 2023 Jishishan, China Earthquake. Eng. Geol. 2025, 352, 108074. [Google Scholar] [CrossRef]
Lu, J.; He, Y.; Zhang, L.; Zhang, Q.; Gao, B.; Chen, H.; Fang, Y. Ensemble Learning Landslide Susceptibility Assessment with Optimized Non-Landslide Samples Selection. Geomat. Nat. Hazards Risk 2024, 15, 2378176. [Google Scholar] [CrossRef]
Li, Y.; Zhu, W.; Wu, J.; Zhang, R.; Xu, X.; Zhou, Y. DBSANet: A Dual-Branch Semantic Aggregation Network Integrating CNNs and Transformers for Landslide Detection in Remote Sensing Images. Remote Sens. 2025, 17, 807. [Google Scholar] [CrossRef]
Zhang, R.; Zhu, W.; Li, Z.; Zhang, B.; Chen, B. Re-Net: Multibranch Network With Structural Reparameterization for Landslide Detection in Optical Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 2828–2837. [Google Scholar] [CrossRef]
Sul, A.; Patil, S. GIS and ML Integrated Techniques for Detection of Landslide-Prone Areas. In Proceedings of the 2024 2nd International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), Erode, India, 23–25 October 2024; pp. 667–672. [Google Scholar]
Ghorbanzadeh, O.; Gholamnia, K.; Ghamisi, P. The Application of ResU-Net and OBIA for Landslide Detection from Multi-Temporal Sentinel-2 Images. Big Earth Data 2023, 7, 961–985. [Google Scholar] [CrossRef]
Chamoli, V.; Bahuguna, R.; Gowri, R.; Prakash, R.; Vidyarthi, A.; Dubey, V.P. Landslide Detection in Uttarakhand Region Using Active Remote Sensing. In Proceedings of the 2024 2nd International Conference on Device Intelligence, Computing and Communication Technologies (DICCT), Dehradun, India, 15–16 March 2024; pp. 154–157. [Google Scholar]
Maxwell, A.E.; Warner, T.A.; Fang, F. Implementation of Machine-Learning Classification in Remote Sensing: An Applied Review. Int. J. Remote Sens. 2018, 39, 2784–2817. [Google Scholar] [CrossRef]
Chen, B.; Li, Z.; Song, C.; Tomás, R.; Yu, C.; Zhu, W.; Peng, J. Unveiling the Long-Term Cascading Effects of the 2018 Baige Landslide and Subsequent Outburst Flood with Satellite Radar Observations. Remote Sens. Environ. 2026, 334, 115231. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Pritt, M.; Chern, G. Satellite Image Classification with Deep Learning. In Proceedings of the 2017 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), Washington, DC, USA, 10–12 October 2017; pp. 1–7. [Google Scholar]
Wang, R.; Lei, T.; Cui, R.; Zhang, B.; Meng, H.; Nandi, A.K. Medical Image Segmentation Using Deep Learning: A Survey. IET Image Process. 2022, 16, 1243–1267. [Google Scholar] [CrossRef]
Dao, D.V.; Jaafari, A.; Bayat, M.; Mafi-Gholami, D.; Qi, C.; Moayedi, H.; Phong, T.V.; Ly, H.-B.; Le, T.-T.; Trinh, P.T.; et al. A Spatially Explicit Deep Learning Neural Network Model for the Prediction of Landslide Susceptibility. CATENA 2020, 188, 104451. [Google Scholar] [CrossRef]
Qi, W.; Wei, M.; Yang, W.; Xu, C.; Ma, C. Automatic Mapping of Landslides by the ResU-Net. Remote Sens. 2020, 12, 2487. [Google Scholar] [CrossRef]
Prakash, N.; Manconi, A.; Loew, S. Mapping Landslides on EO Data: Performance of Deep Learning Models vs. Traditional Machine Learning Models. Remote Sens. 2020, 12, 346. [Google Scholar] [CrossRef]
Yi, Y.; Zhang, W. A New Deep-Learning-Based Approach for Earthquake-Triggered Landslide Detection From Single-Temporal RapidEye Satellite Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 6166–6176. [Google Scholar] [CrossRef]
Ghorbanzadeh, O.; Crivellari, A.; Ghamisi, P.; Shahabi, H.; Blaschke, T. A Comprehensive Transferability Evaluation of U-Net and ResU-Net for Landslide Detection from Sentinel-2 Data (Case Study Areas from Taiwan, China, and Japan). Sci. Rep. 2021, 11, 14629. [Google Scholar] [CrossRef]
Cai, H.; Chen, T.; Niu, R.; Plaza, A. Landslide Detection Using Densely Connected Convolutional Networks and Environmental Conditions. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5235–5247. [Google Scholar] [CrossRef]
Singh, A.; Dhiman, N.; Shukla, D.P. Transfer Learning in Landslide Susceptibility Mapping: Bridging Data-Rich and Data-Scarce Regions in the Northwestern Himalayas. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 3253–3256. [Google Scholar]
Liu, T.; Chen, T.; Niu, R.; Plaza, A. Landslide Detection Mapping Employing CNN, ResNet, and DenseNet in the Three Gorges Reservoir, China. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 11417–11428. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. Learning Deep Transformer Models for Machine Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Korhonen, A., Traum, D., Màrquez, L., Eds.; Association for Computational Linguistics: Florence, Italy, 2019; pp. 1810–1822. [Google Scholar]
Cui, X.; Chen, X.; Zhou, J.; Lin, D. Transformer in Image Interpretation. In Proceedings of the International Conference on Computer Graphics, Artificial Intelligence, and Data Processing (ICCAID 2021); SPIE: Bellingham, WA, USA, 2022; Volume 12168, pp. 45–50. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision Transformers for Remote Sensing Image Classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
Deng, P.; Xu, K.; Huang, H. When CNNs Meet Vision Transformer: A Joint Framework for Remote Sensing Scene Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8020305. [Google Scholar] [CrossRef]
Perera, M.V.; Bandara, W.G.C.; Valanarasu, J.M.J.; Patel, V.M. Transformer-Based SAR Image Despeckling. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 751–754. [Google Scholar]
Yang, Z.; Xu, C.; Li, L. Landslide Detection Based on ResU-Net with Transformer and CBAM Embedded: Two Examples with Geologically Different Environments. Remote Sens. 2022, 14, 2885. [Google Scholar] [CrossRef]
Wei, Y.; Fu, X.; Zhang, B.; Qin, X.; Kou, P.; Wang, L.; Li, H. Efficient Multi-Source Deep Learning for Rapid Landslide Mapping in the Karst Mountains of Bijie, China. Geomat. Nat. Hazards Risk 2025, 17, 2608246. [Google Scholar] [CrossRef]
Han, Z.; Fu, B.; Fang, Z.; Li, Y.; Li, J.; Jiang, N.; Chen, G. Dynahead-YOLO-Otsu: An Efficient DCNN-Based Landslide Semantic Segmentation Method Using Remote Sensing Images. Geomat. Nat. Hazards Risk 2024, 15, 2398103. [Google Scholar] [CrossRef]
Chen, H.; Zhang, L.; Yan, S.; Li, X.; Wang, D. A Lightweight Rockfall Detection Method in Complex Environments Based on Receptive Field and Attention Mechanism: REGM-YOLO. Landslides 2025, 22, 4113–4131. [Google Scholar] [CrossRef]
Liu, X.; Xu, L.; Zhang, J. Landslide Detection with Mask R-CNN Using Complex Background Enhancement Based on Multi-Scale Samples. Geomat. Nat. Hazards Risk 2024, 15, 2300823. [Google Scholar] [CrossRef]
Du, J.; Gao, J.; Chen, T.; Tsewang; Pakgordolma. Spatiotemporal Variations of the Precipitation Concentration Index and Seasonal Precipitation Characteristics in the Yalung Zangbo River Basin from 1981 to 2024. Arid Zone Res. 2025, 42, 1159–1172. [Google Scholar] [CrossRef]
Yamagishi, H.; Yamazaki, F. Landslides by the 2018 Hokkaido Iburi-Tobu Earthquake on September 6. Landslides 2018, 15, 2521–2524. [Google Scholar] [CrossRef]
Xu, Y.; Ouyang, C.; Xu, Q.; Wang, D.; Zhao, B.; Luo, Y. CAS Landslide Dataset: A Large-Scale and Multisensor Dataset for Deep Learning-Based Landslide Detection. Sci. Data 2024, 11, 12. [Google Scholar] [CrossRef]
Ji, S.; Yu, D.; Shen, C.; Li, W.; Xu, Q. Landslide Detection from an Open Satellite Imagery and Digital Elevation Model Dataset Using Attention Boosted Convolutional Neural Networks. Landslides 2020, 17, 1337–1352. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 21–25 June 2021. [Google Scholar]
Bui, N.-T.; Hoang, D.-H.; Nguyen, Q.-T.; Tran, M.-T.; Le, N. MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary Polyp Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2023. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Borgwardt, K.; Gretton, A.; Rasch, M.; Kröger, P.; Schölkopf, B.; Smola, A. Integrating Structured Biological Data by Kernel Maximum Mean Discrepancy. Bioinformatics 2006, 22, e49–e57. [Google Scholar] [CrossRef] [PubMed]
Xu, M.; Zhang, Z.; Hu, H.; Wang, J.; Wang, L.; Wei, F.; Bai, X.; Liu, Z. End-to-End Semi-Supervised Object Detection with Soft Teacher. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3040–3049. [Google Scholar]

Figure 1. Geographical locations of the three study areas (Yarlung Zangbo River, Iburi-Tobu, and Bijie). (a) Geographical location of the study areas, (b) distribution of landslides along the Yarlung Zangbo River, (c) actual photograph of the Yarlung Zangbo River, (d) remote sensing image of the Iburi-Tobu dataset, and (e) topographic map of Bijie.

Figure 2. Structure of the CETransUNet model.

Figure 3. Coordinate attention module architecture.

Figure 4. Edge-guided attention module structure.

Figure 5. Data augmentation.

Figure 6. Performance comparison under different BCE loss weights on three landslide datasets.

Figure 7. Performance metrics of different models on the Yarlung Zangbo River, Iburi-Tobu, and Bijie landslide datasets. Subfigures (a–e), (f–j), and (k–o) represent the IoU, MIoU, Precision, Recall, and F1-score of the models on the Yarlung Zangbo River, Iburi-Tobu, and Bijie datasets, respectively.

Figure 8. Model test results on the Yarlung Zangbo River dataset.

Figure 9. Model test results on the Iburi-Tobu dataset.

Figure 10. Model test results on the Bijie dataset.

Figure 11. Experimental results of the ablation paper on the Yarlung Zangbo River, Iburi-Tobu, and Bijie landslide datasets.

Figure 12. Application of CETransUNet in the Jiuzhaigou region.

Figure 13. Confusion matrices of the two models on the Bijie dataset. (a) Confusion matrix of the TransUNet model; (b) confusion matrix of the CETransUNet model; (c) difference in confusion matrices between the two models; (d) visualization of the confusion matrices.

Figure 14. The impact of the number of Transformer layers on model performance. (a) shows the model’s performance on the Yarlung Zangbo River dataset, (b) shows the model’s performance on the Iburi-Tobu dataset, and (c) shows the model’s performance on the Bijie dataset.

Table 1. Confusion matrix for a binary classification task.

	Prediction False	Prediction Truth
Ground False	TN	FP
Ground Truth	FN	TP

Table 2. Dataset information.

Dataset	Size	Band	Train/Val/Test	Resolution
Yarlung Zangbo River	256 × 256	RGB	3296/412/412	3 m
Iburi-Tobu			5940/740/740	3 m
Bijie			3080/385/385	0.8 m

Table 3. Model performance on the Yarlung Zangbo River dataset.

Models	IoU (%)	MIoU (%)	Precision (%)	Recall (%)	F1-Score (%)
Original	75.61	86.66	85.06	87.08	86.04
No-EGA	78.36	88.17	86.80	88.91	87.82
No-CA	78.59	88.30	87.08	88.88	87.95
CETransUNet	80.26	89.22	88.80	89.23	89.01

Table 4. Model performance on the Iburi-Tobu dataset.

Models	IoU (%)	MIoU (%)	Precision (%)	Recall (%)	F1-Score (%)
Original	69.34	82.96	82.56	81.19	81.85
No-EGA	72.48	84.70	83.42	84.62	84.00
No-CA	71.60	84.22	83.27	83.58	83.41
CETransUNet	74.77	86.00	85.61	85.48	85.53

Table 5. Model performance on the Bijie dataset.

Models	IoU (%)	MIoU (%)	Precision (%)	Recall (%)	F1-Score (%)
Original	76.17	86.71	89.91	83.25	86.28
No-EGA	76.93	87.10	86.81	86.96	86.77
No-CA	77.54	87.47	88.70	85.97	87.19
CETransUNet	79.20	88.37	88.96	87.74	88.24

Table 6. Comparison of models in terms of computational time and number of parameters. A represents the 8-layer transformer version of CETransUNet, B represents the 12-layer transformer version of TransUNet, and C represents the 12-layer transformer version of CETransUNet.

Models	Total Number of Components	FLOPs (G)	FPS (img/s)	Training Time
Models	Total Number of Components	FLOPs (G)	FPS (img/s)	Yarlung Zangbo River	Iburi-Tobu	Bijie
A	188.2 M	43.12	81	80 min	83 min	45 min
B	192.4 M	42.56	85	75 min	79 min	41min
C	221.8 M	48.76	63.13	123 min	130 min	70 min

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, T.; Yang, C.; Wu, J.; Liu, Z.; Wang, Z.; Cheng, X. CETransUNet: An Intelligent Landslide Identification Method Based on Collaborative Optimization of Global Context and Dual Attention Mechanisms. Remote Sens. 2026, 18, 1974. https://doi.org/10.3390/rs18121974

AMA Style

Sun T, Yang C, Wu J, Liu Z, Wang Z, Cheng X. CETransUNet: An Intelligent Landslide Identification Method Based on Collaborative Optimization of Global Context and Dual Attention Mechanisms. Remote Sensing. 2026; 18(12):1974. https://doi.org/10.3390/rs18121974

Chicago/Turabian Style

Sun, Tianli, Chengsheng Yang, Jifeng Wu, Zewei Liu, Ziqian Wang, and Xiaoqiang Cheng. 2026. "CETransUNet: An Intelligent Landslide Identification Method Based on Collaborative Optimization of Global Context and Dual Attention Mechanisms" Remote Sensing 18, no. 12: 1974. https://doi.org/10.3390/rs18121974

APA Style

Sun, T., Yang, C., Wu, J., Liu, Z., Wang, Z., & Cheng, X. (2026). CETransUNet: An Intelligent Landslide Identification Method Based on Collaborative Optimization of Global Context and Dual Attention Mechanisms. Remote Sensing, 18(12), 1974. https://doi.org/10.3390/rs18121974

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

CETransUNet: An Intelligent Landslide Identification Method Based on Collaborative Optimization of Global Context and Dual Attention Mechanisms

Highlights

Abstract

1. Introduction

2. Datasets and Methods

2.1. Datasets

2.2. CETransUNet

2.3. Coordinate Attention Module

2.4. Edge-Guided Attention Module

2.5. Evaluation Indicators

3. Results and Analysis

3.1. Data Preprocessing

3.2. Experimental Setup

3.3. Results

3.3.1. Comparative Experiments

3.3.2. Ablation Experiment

3.4. Generalization Performance Experiment

4. Discussion

4.1. Analysis of Results

4.2. Comparison of Model Complexity

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI