TriGEFNet: A Tri-Stream Multimodal Enhanced Fusion Network for Landslide Segmentation from Remote Sensing Imagery

Zhang, Zirui; Hu, Qingfeng; Fang, Haoran; Liu, Wenkai; Feng, Ruimin; Chen, Shoukai; Wu, Qifan; Wang, Peng; Lu, Weiqiang

doi:10.3390/rs18020186

Open AccessArticle

TriGEFNet: A Tri-Stream Multimodal Enhanced Fusion Network for Landslide Segmentation from Remote Sensing Imagery

by

Zirui Zhang

¹,

Qingfeng Hu

¹,

Haoran Fang

^2,3,*,

Wenkai Liu

¹,

Ruimin Feng

⁴

,

Shoukai Chen

¹,

Qifan Wu

¹

,

Peng Wang

¹ and

Weiqiang Lu

¹

College of Surveying and Geo-Informatics, North China University of Water Resources and Electric Power, Zhengzhou 450046, China

²

State Key Laboratory of Simulation and Regulation of Water Cycle in River Basin, China Institute of Water Resources and Hydropower Research, Beijing 100038, China

³

Research Center on Flood and Drought Disaster Reduction, Ministry of Water Resource, Beijing 100038, China

⁴

Department of Civil and Environmental Engineering, University of Idaho, Moscow, ID 83844, USA

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(2), 186; https://doi.org/10.3390/rs18020186

Submission received: 6 December 2025 / Revised: 30 December 2025 / Accepted: 4 January 2026 / Published: 6 January 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel multimodal fusion network, TriGEFNet, is proposed, which demonstrates excellent performance and effectiveness in landslide segmentation by adaptively fusing RGB imagery, Vegetation Indices, and Slope.
Experiments on the self-constructed Zunyi dataset and the public Bijie and Landslide4Sense datasets demonstrate that the model’s performance significantly surpasses that of classic semantic segmentation models and other advanced multimodal fusion models.

What are the implications of the main findings?

This study provides an innovative and effective paradigm for multimodal remote sensing data fusion, enabling a more precise extraction of landslide-sensitive features and addressing the challenge of insufficient information extraction from a single data source in complex scenarios.
The findings offer a forward-looking solution for constructing more robust and precise intelligent systems for landslide monitoring and assessment, holding significant value for enhancing disaster risk assessment, emergency response, and post-disaster management capabilities.

Abstract

Landslides are among the most prevalent geological hazards worldwide, posing severe threats to public safety due to their sudden onset and destructive potential. The rapid and accurate automated segmentation of landslide areas is a critical task for enhancing capabilities in disaster risk assessment, emergency response, and post-disaster management. However, existing deep learning models for landslide segmentation predominantly rely on unimodal remote sensing imagery. In complex Karst landscapes characterized by dense vegetation and severe shadow interference, the optical features of landslides are difficult to extract effectively, thereby significantly limiting recognition accuracy. Therefore, synergistically utilizing multimodal data while mitigating information redundancy and noise interference has emerged as a core challenge in this field. To address this challenge, this paper proposes a Triple-Stream Guided Enhancement and Fusion Network (TriGEFNet), designed to efficiently fuse three data sources: RGB imagery, Vegetation Indices (VI), and Slope. The model incorporates an adaptive guidance mechanism within the encoder. This mechanism leverages the terrain constraints provided by slope to compensate for the information loss within optical imagery under shadowing conditions. Simultaneously, it integrates the sensitivity of VIs to surface destruction to collectively calibrate and enhance RGB features, thereby extracting fused features that are highly responsive to landslides. Subsequently, gated skip connections in the decoder refine these features, ensuring the optimal combination of deep semantic information with critical boundary details, thus achieving deep synergy among multimodal features. A systematic performance evaluation of the proposed model was conducted on the self-constructed Zunyi dataset and two publicly available datasets. Experimental results demonstrate that TriGEFNet achieved mean Intersection over Union (mIoU) scores of 86.27% on the Zunyi dataset, 80.26% on the L4S dataset, and 89.53% on the Bijie dataset, respectively. Compared to the multimodal baseline model, TriGEFNet achieved significant improvements, with maximum gains of 7.68% in Recall and 4.37% in F1-score across the three datasets. This study not only presents a novel and effective paradigm for multimodal remote sensing data fusion but also provides a forward-looking solution for constructing more robust and precise intelligent systems for landslide monitoring and assessment.

Keywords:

landslide detection; semantic segmentation; deep learning; multimodal model

1. Introduction

Landslides, characterized by the gravity-driven instability and movement of rock and soil masses on slopes, pose a significant threat to human life, property, infrastructure, and ecosystems worldwide [1]. In the context of climate change and the increasing frequency of extreme weather events, the rapid and precise identification of landslides is critical for disaster risk assessment and emergency response [2]. Traditional landslide monitoring relies heavily on field geological surveys. While offering high precision, these methods are often constrained by high time costs and limited spatial coverage, making them inadequate for monitoring large-scale or inaccessible hazardous terrains [3,4].

With rapid advancements in Earth observation technology, remote sensing has emerged as the primary approach for landslide detection, owing to its advantages in macroscopic and dynamic monitoring [5,6]. Over the past few decades, landslide detection based on remote sensing imagery has undergone a paradigm shift from manual visual interpretation to machine learning, and subsequently to deep learning [7,8]. Although manual interpretation offers acceptable accuracy, it fails to meet the demands for large-scale, rapid response. Meanwhile, traditional machine learning methods are limited by their shallow feature extraction capabilities, struggling to resolve high-dimensional abstract features in complex backgrounds [9,10].

In recent years, deep learning has witnessed rapid advancements in the field of landslide detection. Unlike traditional machine learning, deep learning models possess the capability to learn directly from image data, automatically extracting low-level features, such as textures and edges, as well as high-level features, including shapes and spatial context. This mechanism enables the capture of deeper and more abstract semantic information from the imagery [11,12]. In the deep learning domain, landslide detection is predominantly formulated as a semantic segmentation task, the objective of which is to classify every pixel in an image to precisely delineate the extent and boundaries of landslides. Consequently, a multitude of studies have employed semantic segmentation models to achieve automated landslide recognition [13]. For instance, Fu et al. proposed a lightweight network optimized for on-board landslide segmentation; this model utilizes CSPDarknet-tiny as an efficient encoder backbone to enhance accuracy and robustness while maintaining a low parameter count [14]. To simultaneously leverage global context and local deep features, Liu et al. designed a dual-branch encoder, in which a Transformer branch captures global dependencies while a Convolutional Neural Network (CNN) branch specializes in extracting abstract features. These are then integrated via a multi-scale feature fusion module to refine landslide boundary details [15]. Addressing the challenge of weak model generalization in novel regions, Zhang et al. [16] proposed a cross-domain landslide segmentation method based on Multi-Target Domain Adaptation (MTDA). This approach employs a progressive “near-to-far” learning strategy to align feature distributions across different regions, achieving outstanding performance on large-scale datasets comprising multiple heterogeneous domains [16]. Despite these significant strides, existing segmentation methods continue to face considerable challenges. Current approaches rely primarily on unimodal RGB optical imagery, which is prone to spectral confusion in complex geological environments. Specifically, the spectral characteristics of bare soil or rock masses resulting from landslides are strikingly similar to those of bare farmland, construction sites, and natural bedrock. Models relying solely on RGB are susceptible to confusing these distinct classes. These factors obscure the visual features of landslides, thereby severely compromising the generalization capability of the models [17,18].

To overcome the limitations of unimodal models and enhance their discriminative capability, incorporating multimodal data that provides complementary information has emerged as a pivotal research direction [19]. For instance, Liu et al. [20] proposed an integrated segmentation framework that includes a specialized multimodal branch for extracting elevation features from Digital Elevation Model (DEM). Optimized via a terrain-guided loss function, their experiments demonstrated the effectiveness of DEM features in landslide segmentation tasks [20]. Ghorbanzadeh evaluated four advanced models on the L4S dataset, analyzing the impact of various spectral input combinations on model training. The study revealed that extending unimodal RGB inputs to multimodal data improved the performance of U-Net-based architectures, whereas the performance of Transformer-based architectures deteriorated [21]. Addressing the challenge of identifying visually indistinct old landslides, Chen et al. proposed FFS-Net, which fuses the texture features of optical imagery with the terrain features of DEMs at high semantic levels, significantly enhancing the model’s capability to detect old landslides [22]. These studies demonstrate that multi-source data fusion can construct feature representations that are far more comprehensive and robust than their unimodal counterparts, marking it as a critical approach for achieving robust and accurate landslide segmentation. However, existing methods predominantly employ static fusion mechanisms that fail to adaptively adjust the contribution of each modality according to the context, thereby constraining further improvements in model performance [23,24].

In the broader field of computer vision, more extensive research into multimodal fusion has led to the emergence of a series of advanced dynamic fusion strategies [25,26,27]. For instance, CMX leverages meticulously designed cross-modal feature rectification and fusion modules to facilitate granular interaction and correction of multimodal features at various stages of encoding and decoding. This effectively enhances complementary information between modalities while suppressing noise [28]. Similarly, CMNeXt introduces a highly efficient cross-modal attention module that significantly reduces computational complexity while improving fusion performance, thereby achieving a superior balance between efficiency and accuracy [29]. Addressing the challenge of varying multimodal data quality, EAEFNet employs a dual-branch architecture to differentially process multimodal information of unequal quality, achieving enhancement and compensation for features from each modality [30]. Although these multimodal models exhibit outstanding performance, they are primarily designed for datasets such as RGB-D (Depth), and their fusion methodologies are difficult to adapt directly to the unique demands of remote sensing landslide segmentation. First, these models typically treat the contributions of different modalities indiscriminately. However, in landslide segmentation tasks, high inter-class spectral similarity often causes spectral features to introduce substantial redundancy and noise, potentially leading to model overfitting. Second, there are often significant resolution disparities between modalities in landslide datasets. This necessitates precise feature alignment while bridging the semantic gap between modalities [31,32].

To address the aforementioned challenges, this paper proposes TriGEFNet, a Triple-Stream Guided Enhancement and Fusion Network designed to resolve the difficulties of multimodal feature alignment and fusion through an asymmetric guidance mechanism. Concurrently, to validate the model’s robustness in challenging environments, we constructed a benchmark dataset comprising multi-sensor heterogeneous data—the Zunyi Landslide Dataset.

The main contributions of this paper are summarized as follows:

This paper introduces TriGEFNet, a triple-stream multimodal fusion network featuring a novel guided enhancement and fusion strategy to tackle noise and redundancy. In the encoder, the Multimodal Guided Enhancement Module (MGEM) first mitigates inconsistent data quality by independently enhancing each stream’s features. Then, the Dominant-stream Guided Fusion Module (DGFM), led by the semantically rich RGB stream, selectively integrates Slope and VI features to achieve an efficient, asymmetric fusion. In the decoder, the Gated Skip Refinement Module (GSRM) adaptively filters skip connections, preventing redundant information flow while preserving crucial spatial details for accurate boundary delineation. Collectively, these components allow TriGEFNet to learn highly discriminative representations for landslide segmentation in complex environments.
We constructed the Zunyi Landslide Dataset, tailored for complex scenarios. This dataset integrates significant cross-modal resolution disparities with multi-source data heterogeneity. It provides a challenging benchmark for evaluating the generalization ability of multimodal fusion algorithms in actual geological environments.
We conducted comprehensive comparative experiments on the Zunyi, Bijie [33], and Landslide4Sense (L4S) [34] datasets. The proposed model was evaluated against a series of classic semantic segmentation models and advanced multimodal fusion models. Experimental results demonstrate that TriGEFNet achieves superior performance across multiple key evaluation metrics, including mean Intersection over Union (mIoU). This fully validates the model’s robust capability for high-performance landslide segmentation in complex environments and highlights its significant value for practical applications.

2. Datasets and Modality Feature Analysis

High-quality and diverse training data serve as the cornerstone for constructing robust semantic segmentation models. This section first elaborates on the three datasets employed in this study: the self-constructed Zunyi dataset, the global L4S dataset, and the high-resolution Bijie dataset. Subsequently, we provide a quantitative analysis of the distribution characteristics of Vegetation Indices (VI) and Slope within landslide segmentation from a statistical perspective. By elucidating the complementary mechanisms of these two factors in suppressing false alarms and enhancing terrain constraints, this analysis establishes the data-driven and theoretical foundation for the proposed multimodal fusion methodology.

2.1. Experimental Datasets

In this study, we constructed a benchmark dataset for landslide segmentation focused on Zunyi City, Guizhou Province, China. Situated in a typical Karst landscape zone, this region is subjected to the dual long-term effects of soluble rock dissolution and concentrated rainfall. Consequently, frequent landslide disasters pose severe threats to regional ecological security and infrastructure safety [35]. The primary objective of the Zunyi dataset design is to replicate the challenging scenarios encountered in practical applications, characterized by multi-source data heterogeneity and severe resolution disparities between modalities. Specifically, the optical imagery for the dataset was sourced from the Google Earth platform [36] and the Gaofen-2 (GF-2) satellite, featuring a spatial resolution of 1 m. As a fused product of multi-source, multi-temporal satellite imagery, Google Earth data exhibits internal inconsistencies in illumination and color tone, which impose higher demands on the model’s environmental adaptability and the stability of feature extraction. Simultaneously, the slope information was derived from DEM data provided by the publicly available SRTM [37], which has a spatial resolution of 30 m. Consequently, processing this dataset requires the model to effectively fuse optical micro-texture information at a 1 m resolution with topographic trend information at a 30 m resolution. Regarding VI acquisition, since Google Earth imagery contains only visible bands and lacks the Near-Infrared (NIR) required for NDVI, we chose an effective visible-band VI as a proxy. The Normalized Green-Red Difference Index (NGRDI) has been demonstrated to be effective for vegetation analysis in RGB imagery [38]. Therefore, this study utilizes NGRDI calculated from RGB bands as the vegetation feature, defined as follows:

N G R D I = \frac{(G r e e n - R e d)}{(G r e e n + R e d)}

(1)

Calculated using the reflectance of green and red bands, NGRDI effectively reflects vegetation health and density, thereby constituting the VI input stream of the model. In the dataset construction workflow, typical landslide areas were first screened and identified on Google Earth imagery. Subsequently, the boundaries of each landslide body were annotated at the pixel level through meticulous manual visual interpretation and digitization. Finally, the samples were verified against the historical landslide inventory of Zunyi City and field surveys to yield the final dataset. The Zunyi dataset contains 881 landslides, and their spatial distribution is illustrated in Figure 1.

To comprehensively evaluate model performance, we also employed two publicly available datasets: L4S and Bijie. L4S is a global benchmark dataset for landslide segmentation, providing 10 m resolution multispectral imagery derived from the Sentinel-2 satellite. From this dataset, we constructed three input streams: RGB (Red, Green, and Blue bands), NDVI (computed from the Red and NIR bands), and Slope (derived from 12.5 m source data). All data were uniformly resampled to a 10 m resolution and cropped into image patches of 128 × 128 pixels. The Bijie dataset focuses on Bijie, Guizhou Province, China—a region highly prone to landslides. It provides high-resolution RGB imagery at 0.8 m captured by the TripleSat satellite. For this dataset, we calculated slope information based on the corresponding 2 m resolution DEM. Additionally, given the absence of a NIR band in this dataset, we adopted NGRDI as the VI. Visual samples from all three datasets are presented in Figure 2.

2.2. Analysis of the Discriminative Potential of Vegetation Indices and Slope

To quantitatively assess the discriminative potential of VI and Slope in landslide segmentation tasks, we conducted a detailed analysis of the feature distributions for landslide and background regions across the three datasets. As shown in Figure 3, we plotted the Kernel Density Estimation (KDE) histograms of the mean VI and mean Slope for positive (landslide) and negative (background) sample regions.

From Figure 3a–c, it is clearly observable that the VI exhibits a high degree of statistical separability between landslides and the background. Across all datasets, the mean VI of landslide samples is significantly lower than that of background samples; the peaks of the two distribution curves are distinctly separated with minimal overlap. As a drastic geomorphic process, a landslide directly destroys surface vegetation, exposing the underlying soil and bedrock. Consequently, landslides typically exhibit low VI values. However, sole reliance on this feature has inherent limitations. Anthropogenic or natural features—such as urban buildings, roads, and bare farmland—also exhibit low VI values and can be misclassified as landslides, leading to model errors [39,40].

In contrast to VI, the slope feature exhibits a more complex distribution (Figure 3d–f). Although the mean slope of landslide samples is higher, their distribution substantially overlaps with that of the background. This complexity arises from two main factors. First, slope is a predisposing condition rather than a deterministic trigger; many stable, steep slopes naturally exist in non-landslide areas. Second, the finite resolution of DEM acts as a critical limiting factor. A single coarse pixel often averages distinct topographies (e.g., steep scarps and gentle accumulation zones), effectively performing a low-pass filtering that attenuates the topographic extremes indicative of landslides [41,42]. Despite this distributional overlap, slope data is indispensable to our multimodal model. Its primary value lies in providing robust terrain constraints. For example, the model can learn to associate low VI values with landslides only when they occur on high slopes. This prevents misclassifying other areas with low vegetation, like barren land or riverbeds. Furthermore, slope data offers crucial prior knowledge of local geomorphic patterns, such as ridges and valleys, which provides valuable contextual information that deep networks can exploit [43].

In summary, the VI efficiently detects signs of land cover disruption, while Slope assesses whether this anomaly occurs within a plausible geomorphic context. This intrinsic complementary effect establishes a solid theoretical and data-driven foundation for the design of our TriGEFNet, aiming to achieve robust and precise landslide segmentation.

3. Methodology

In this paper, we propose TriGEFNet, a deep neural network for landslide segmentation from multimodal imagery that introduces a novel fusion paradigm: Independent Encoding, Interactive Enhancement, and Asymmetric Fusion. Illustrated in Figure 4, the network is built upon the classic U-Net [44] framework and employs ResNet34 [45] as its backbone. A key principle of the architecture lies in the multi-branch feature decoupling design of the encoder. We configure three parallel encoders with non-shared parameters for RGB imagery, VI, and Slope, respectively. This configuration allows the network to learn the modality-specific semantic distributions inherent to each data source. To achieve the efficient integration of heterogeneous features, TriGEFNet incorporates three key components: MGEM, DGFM, and GSRM. These modules are designed to facilitate the efficient interaction and fusion of multimodal features, thereby enhancing the final segmentation performance. This section provides a detailed analysis of the core modules constituting the network, followed by an introduction to the loss function and performance evaluation metrics utilized for model optimization.

3.1. Multimodal Guided Enhancement Module (MGEM)

To improve the recognition accuracy of landslide areas in remote sensing imagery under complex scenes, we introduce the VI as a spectral feature indicating surface vegetation disruption, and leverage Slope data as a geographic constraint reflecting the likelihood of landslide occurrence. However, naive feature concatenation or element-wise addition neglects the heterogeneity and spatial inconsistency of contributions from different modalities. For instance, in shadowed areas, RGB imagery often suffers from information loss and high noise due to poor lighting, whereas Slope data remains unaffected and reliable. Conventional fusion merges these modalities indiscriminately, causing the optical noise to contaminate the critical geometric features. Consequently, such methods fail to capture deep conditional dependencies. To fully exploit the synergistic potential among multimodal data, we designed the MGEM.

Figure 5 illustrates the detailed structure of the MGEM. For clarity, the diagram exclusively depicts the enhancement workflow for the RGB features; the processes for the VI and Slope branches are identical. The MGEM comprises a Guidance Feature Generation Network (Guidance Net) and three parallel Feature Enhancers. First, the module concatenates the multimodal feature maps (

F_{r g b}

,

F_{v i}

, and

F_{s l o p e}

) from the same encoder level along the channel dimension. This constructs a unified feature representation that retains the original contextual information of each modality. The representation is then fed into the Guidance Net, which employs a stacked convolutional block (comprising 1 × 1 and 3 × 3 convolutions) to implicitly model inter-modal dependencies and aggregate local spatial context. This process generates a guidance feature,

F_{g u i d e}

, which integrates complementary information from all three sources. Subsequently, this guidance feature serves as a shared spatial context prior and is distributed to the three Feature Enhancers.

Within each enhancer,

F_{g u i d e}

passes through a lightweight convolutional network with a Sigmoid activation to generate an adaptive spatial attention map. This map acts as a pixel-wise spatial gate. By performing element-wise multiplication with the attention map, the model spatially recalibrates the feature representation: regions within the original map that possess high discriminative value for landslide segmentation are enhanced, while the weights of redundant or noisy information are effectively attenuated. Finally, the optimized features are added to the original features via a residual connection to generate the enhanced output. This residual connection ensures that the unique characteristic information of each modality is preserved. The formulation of the module is as follows:

{F'}_{r g b} = F_{r g b} + F_{r g b} \otimes σ (C o n v_{1 \times 1} (R e L U (B N (C o n v_{1 \times 1} (F_{g u i d e})))))

(2)

where

σ

denotes the Sigmoid activation function and

\otimes

denotes the Hadamard product. The same process is applied in parallel to the VI and Slope branches. By generating spatial attention maps via

F_{g u i d e}

, MGEM enables the model to dynamically adjust the spatial weights of each unimodal feature map based on the fused multimodal context, facilitating the learning of cross-modal conditional dependencies. Consequently, MGEM realizes the interaction of gain information among multimodal features, allowing each modality to absorb complementary information from others while retaining its own characteristics, thereby enhancing feature robustness and discriminability.

3.2. Dominant-Stream Guided Fusion Module (DGFM)

Following the parallel enhancement of features from each modality by the MGEM, it is necessary to fuse them into a unified representation for subsequent processing by the decoder. To prevent the robust spatial context features extracted by the RGB encoder from being compromised by potential noise or redundant information within the auxiliary modalities, we designed the DGFM, the schematic of which is illustrated in Figure 6. In landslide segmentation tasks, RGB imagery provides the richest and most critical spatial context and spectral information. Consequently, we establish the RGB stream as the dominant modality, while treating the VI (providing supplementary spectral information) and Slope (providing terrain constraints) as auxiliary modalities. The design of the DGFM aims to leverage the dominant stream to guide and regulate the integration process of the auxiliary streams, ensuring that only beneficial information from the auxiliary features contributes to the fusion.

The specific implementation of the DGFM is as follows: First, the dominant feature

F_{r g b}

is input into a gating generator composed of lightweight convolutions to generate a spatial attention gating map,

F_{g a t e}

. This gating map functions as a dynamic spatial filter, with its weight distribution determined entirely by the feature information of the dominant stream. Subsequently, this gating map is simultaneously applied to the

F_{v i}

and

F_{s l o p e}

, filtering the features of the auxiliary modalities through element-wise multiplication. Finally, the dominant stream feature

F_{r g b}

is concatenated along the channel dimension with the two filtered auxiliary stream features. This combined tensor is then processed by a convolutional block for final information integration and dimensionality reduction.

The entire fusion process can be formulated as follows:

F_{g a t e} = σ ({C o n v}_{1 \times 1} (R e L U (B N ({C o n v}_{1 \times 1} (F_{r g b})))))

(3)

F_{f u s e d} = R e L U (B N ({C o n v}_{1 \times 1} (C o n c a t (F_{r g b}, F_{v i} \otimes F_{g a t e}, F_{s l o p e} \otimes F_{g a t e}))))

(4)

Through its unique “guidance-gating” mechanism, the DGFM ensures that only high-relevance auxiliary information beneficial to the dominant modality participates in the final decision-making process. This not only maximizes the preservation of the core feature integrity but also achieves adaptive denoising and screening of auxiliary information, thereby accomplishing a prioritized, robust, and efficient feature fusion.

3.3. Gated Skip Refinement Module (GSRM)

During the decoding stage, to effectively bridge the semantic gap between the high-resolution spatial details provided by the encoder and the high-level semantic information generated by the decoder, we designed the GSRM. The schematic of the GSRM is illustrated in Figure 7. First, the feature map

F_{d e c o d e r}

from the decoder is processed by a gating controller composed of two 1 × 1 convolutions. This controller extracts rich contextual information to generate a spatial attention map,

F_{g a t e}

. Subsequently,

F_{g a t e}

is employed to perform element-wise weighting on the encoder feature

F_{e n c o d e r}

. This operation directs focus toward target regions critical for the segmentation task while simultaneously suppressing redundant information and noise. Following this, the filtered encoder feature is concatenated with

F_{d e c o d e r}

along the channel dimension. Finally, the concatenated features are fed into a Refinement Block (RB). The purpose of this block is to facilitate the deep alignment of these two heterogeneous features within the local spatial domain. The cascaded 3 × 3 convolution blocks are capable of learning and modeling complex local correlations within the concatenated features. They smoothly integrate semantic information with precise boundary details, ultimately generating a more robust and discriminative feature representation. The formulation of the GSRM is expressed as follows:

F_{g a t e} = σ (C o n v_{1 \times 1} (R e L U (C o n v_{1 \times 1} (F_{d e c o d e r}))))

(5)

F_{f u s e d} = R e L U (B N (C o n v_{3 \times 3} (R e L U (B N (C o n v_{3 \times 3} (C o n c a t {(F}_{e n c o d e r} \otimes F_{g a t e}, F_{d e c o d e r})))))))

(6)

In summary, the GSRM ensures that only highly relevant low-level features participate in the fusion process, effectively bridging the semantic gap while mitigating noise interference. The subsequent refinement process guarantees the deep and seamless integration of these two feature types, achieving a truly organic fusion.

3.4. Upsample, SCSE, and Segmentation Head (SH)

Within the decoder, at each stage, the resolution of deep feature maps is first upsampled via bilinear interpolation. Subsequently, the upsampled results are fed into the GSRM to implement the skip connection. The resulting fused features are then processed by the Spatial and Channel Squeeze and Excitation (SCSE) module. The SCSE module adaptively enhances feature information critical for landslide segmentation while suppressing redundancy by applying concurrent attention weighting across both channel and spatial dimensions [46].

After the decoder restores the feature maps to the same resolution as the original input image through a series of upsampling and fusion operations, the Segmentation Head (SH) serves as the final output layer of the model. It is responsible for transforming these semantically rich feature maps into the final pixel-level segmentation prediction. In the proposed model, the SH is designed with an efficient structure, primarily consisting of a 3 × 3 convolutional layer.

The formulation of the SH is as follows:

S H (x) = C o n v 3 \times 3 (x)

(7)

The primary function of this convolutional layer is to reduce the channel dimensionality of the high-dimensional feature maps from the final decoder layer to match the number of target classes. Specifically, for the binary classification task in this study, this layer reduces the channel count of the input feature maps to 1.

3.5. Loss Function

Landslide segmentation is a task typically characterized by severe class imbalance. In existing landslide datasets, the proportion of pixels representing landslide areas is usually far smaller than that of the non-landslide background. Standard Cross-Entropy Loss penalizes errors for every pixel with equal weight. In such scenarios, the loss generated by the overwhelming majority of background pixels dominates the gradient direction, biasing the model towards predicting all pixels as background and potentially leading to overfitting to the majority class. To effectively address this challenge, we employ a composite loss function that combines the strengths of Dice Loss and Focal Loss. This approach aims to optimize the model simultaneously from the perspectives of regional overlap and hard sample mining.

Dice Loss (

L_{D}

), derived from the Dice coefficient used to measure set similarity, directly optimizes the degree of overlap between the predicted region and the ground truth. Its primary advantage lies in its inherent insensitivity to class imbalance. It is defined as follows:

L_{D} = 1 - \frac{2 \sum_{i = 1}^{N} y_{i} p_{i} + ϵ}{\sum_{i = 1}^{N} y_{i} + \sum_{i = 1}^{N} p_{i} + ϵ}

(8)

where

N

represents the total number of pixels, and

y_{i}

and

p_{i}

denote the ground truth label and the model’s predicted probability for the positive class of the

i

-th pixel, respectively.

ϵ

is a smoothing constant added to enhance numerical stability.

Focal Loss (

L_{F}

) represents a dynamically weighted improvement over standard Cross-Entropy Loss. By introducing a modulating factor, it automatically reduces the contribution of the vast number of easy samples during loss calculation. This mechanism forces the model to focus its learning on positive and negative samples that are difficult to distinguish. It is defined as follows:

L_{F} = - \frac{1}{N} \sum_{i = 1}^{N} α_{t} {(1 - p_{t})}^{γ} \log (p_{t})

(9)

where

N

is the total number of pixels. For the

i

-th pixel,

p_{t}

represents the model’s predicted probability for the correct class;

α_{t}

is the class balancing weight; and

γ

is the focusing parameter. Finally, we sum the Dice Loss and Focal Loss to leverage their synergistic effects. The final composite loss function

L

is defined as

L = {α L}_{D} + {β L}_{F}

(10)

where

α

and

β

are hyperparameters that balance the contribution of each loss component. Through experimental evaluation on dataset, we determined the optimal weights to be

α = 0.5

and

β = 0.5

. This balanced configuration ensures that the model effectively addresses both the structural similarity of segmentation results at a regional level and the challenge of learning hard examples at the pixel level.

3.6. Evaluation Metrics

To comprehensively and quantitatively evaluate the segmentation performance of the TriGEFNet model from multiple dimensions, we employ six standard metrics: Accuracy, Precision, Recall, F1-Score, Intersection over Union for landslides (IoU_landslide), and mIoU. Given the extreme class imbalance in landslide scenes, we prioritize Recall and mIoU to better evaluate hazard detection sensitivity, as Accuracy is often dominated by background pixels. The calculation of these metrics is based on four fundamental statistical quantities derived from the comparison between the model’s pixel-wise prediction results and the ground truth labels: True Positive (

T P

), False Positive (

F P

), True Negative (

T N

), and False Negative (

F N

). Based on these definitions, the calculation formulas for each evaluation metric are as follows:

\{\begin{matrix} Accuracy = \frac{T P + T N}{T P + T N + F P + F N} \\ Precision = \frac{T P}{T P + F P} \\ Recall = \frac{T P}{T P + F N} \\ F 1 - Score = \frac{2 \times T P}{2 \times T P + F P + F N} \\ IoU_landslide = \frac{T P}{T P + F P + F N} \\ IoU_background = \frac{T N}{T P + F P + F N} \\ mIoU = \frac{I o U_l a n d s l i d e + I o U_b a c k g r o u n d}{2} \end{matrix}

(11)

4. Results

4.1. Data Processing

To comprehensively evaluate the performance of the proposed TriGEFNet model, we conducted experiments using three remote sensing landslide datasets: the self-constructed Zunyi dataset and the publicly available L4S and Bijie datasets. Given the severe class imbalance between background and landslide classes inherent in landslide segmentation tasks, this study exclusively selected samples containing positive landslide instances for training and evaluation. The total number of samples ultimately utilized for the Zunyi, L4S, and Bijie datasets was 2231, 770, and 881, respectively. These datasets were partitioned into training and validation sets at a ratio of 8:2.

During the data preprocessing stage, all input images were uniformly resized to 256 × 256 pixels and normalized. Regarding data augmentation, we applied dynamic augmentation strategies exclusively to the training set. These strategies included geometric transformations such as random horizontal/vertical flipping and random rotation. Furthermore, to enhance the model’s robustness to illumination variations, color jittering and Gaussian noise were applied specifically to the RGB imagery.

4.2. Implementation Details

All experiments in this study were implemented using the PyTorch (v2.3.0) deep learning framework. Training and evaluation were conducted on a Linux server equipped with an Intel(R) Xeon(R) Platinum 8358P CPU, 48 GB of system RAM, and an NVIDIA GeForce RTX 3090 GPU (24 GB VRAM). For model initialization, we adopted distinct strategies for the encoders: the RGB branch was initialized with weights pre-trained on the ImageNet dataset, whereas the two auxiliary branches were randomly initialized and trained from scratch. The models were trained for a total of 100 epochs, with the batch size set to 32. We utilized the AdamW optimizer to update model parameters, with the weight decay coefficient set to 1 × 10⁻⁴. The initial learning rate was set to 1 × 10⁻⁴, and a Cosine Annealing strategy was employed to dynamically adjust the learning rate during the training process.

4.3. Comparative Experiments

To validate the superiority of TriGEFNet, we conducted comparative experiments against a series of representative advanced models. First, to investigate the impact of incorporating VI and Slope data, we established four groups of baseline experiments based on U-Net, ranging from unimodal RGB input to trimodal input. Second, we compared our method with classic semantic segmentation models and advanced multimodal fusion models. For fair comparison, all models were trained and evaluated under the same experimental settings as the proposed model. The quantitative results and visual samples for the three datasets are presented in Table 1, Table 2 and Table 3 and Figure 8, Figure 9 and Figure 10, respectively.

We initially analyzed the impact of adding VI and Slope using U-Net with an early fusion strategy. The experimental results exhibited significant variations across different datasets. On the L4S dataset, multimodal data improved the IoU from 0.5681 to 0.5767, demonstrating the potential of multimodal features to provide complementary information. However, this gain was not universal. On the more heterogeneous Zunyi and Bijie datasets, the inclusion of auxiliary modalities conversely led to a decline in model performance. Specifically, on the Bijie dataset, the IoU achieved by trimodal fusion (0.7631) was lower than that of the unimodal RGB baseline (0.7820). These results indicate that multimodal feature fusion is not a simple linear gain process but a complex problem highly dependent on data characteristics, inter-modal correlations, and the fusion strategy. In landslide segmentation tasks, simply concatenating multimodal data at the input stage is a suboptimal approach as it ignores critical inter-modal heterogeneity and differing noise distributions, often introducing disruptive noise rather than enhancing performance.

To verify the universality of the aforementioned findings, we applied multimodal inputs to four classic segmentation frameworks: DeepLabV3+ [47], U-Net++ [48], SegFormer [49], and Mask2Former [50]. The results show that stronger network architectures yield better model performance. On the Zunyi dataset, Mask2Former achieved an IoU of 0.852, outperforming the 0.830 achieved by U-Net. This suggests that advanced model architecture is a key determinant of segmentation performance. Nevertheless, this simplistic early fusion approach limits the model’s ability to learn complex relationships between different modalities, failing to fully exploit the potential of multimodal features. As seen in the third sample of the Zunyi dataset in Figure 8, these models produced high false negative rates, indicating that the gain from supplementary information in auxiliary modalities could not offset the negative impact of the introduced noise. These experiments reveal the fundamental flaw of early fusion strategies: “blindly” mixing heterogeneous data at the pixel level not only imposes an optimization burden on the network but also introduces noise due to the lack of a guidance mechanism, thereby interfering with the learning of core features. Consequently, the model is unable to fully exploit the complementary information within the auxiliary modalities.

To overcome the limitations of early fusion, we designed TriGEFNet, which features a hierarchical guided enhancement and fusion mechanism as its core. We comprehensively compared it with four advanced multimodal segmentation models: SGNet [51], CMX [28], CMNeXt [29], and EAEFNet [30]. These models employ various sophisticated fusion strategies, including cross-modal attention and collaborative learning, representing the current frontier of multimodal semantic segmentation. Experimental results demonstrate that TriGEFNet achieved optimal performance across most core metrics on all three datasets, surpassing the comparative models. On the Zunyi dataset, TriGEFNet achieved a landslide IoU of 0.7454 and an mIoU of 0.8627. Surpassing the second-best EAEFNet, it achieved gains of 0.19 percentage points in mIoU and 0.38 percentage points in Landslide IoU, confirming its quantitative superiority. This proves its stability under complex data source conditions. Regarding the first sample of the Zunyi dataset, CMX, CMNeXt, and EAEFNet all exhibited overfitting effects to the VI, resulting in false positive predictions. In contrast, TriGEFNet produced the most precise boundary delineation and the best control over false positives and false negatives.

The superior performance of TriGEFNet is primarily attributed to its systematic resolution of the core issues in multimodal fusion. Independent encoders construct clear semantic pathways for each heterogeneous data source. During the encoding phase, the MGEM and DGFM work synergistically to implement intelligent guidance and feature screening fusion across multiple semantic levels, effectively avoiding the feature conflicts typical of early fusion. Subsequently, in the decoding phase, the GSRM screens shallow spatial details, ensuring the refined reconstruction of landslide boundaries. This cohesive design, centered on a core principle of Independent Encoding, Interactive Enhancement, and Asymmetric Fusion, enables high-performance landslide segmentation driven by multimodal information.

4.4. Ablation Experiments

To deeply analyze the internal mechanisms of the proposed TriGEFNet and quantitatively evaluate the individual contributions and combined efficacy of its three core innovative components—DGFM, MGEM, and GSRM—we designed a series of comprehensive ablation studies.

The baseline model employs a standard U-Net architecture with a ResNet34 backbone, identical to that used in the main experiments. It utilizes three independent encoders to process the three modalities, respectively. Both the multimodal feature fusion and the skip connections are implemented via simple concatenation. Subsequently, we independently validated the effectiveness of each module and incrementally integrated them until the final complete model was constructed. All ablation studies were conducted on the L4S dataset, with detailed results presented in Table 4.

The baseline model, adopting a triple-stream input with concatenation-based multimodal feature fusion, improved the F1-score from 0.7291 to 0.7381 compared to the single-stream early fusion method in Table 2. This demonstrates that providing independent encoders for each feature type facilitates the extraction of critical information unique to each modality. Building upon this baseline, we evaluated the utility of the three core modules by progressively incorporating them. As shown in Table 4, the individual introduction of any single module yields significant performance improvements. Notably, the DGFM makes the most substantial independent contribution, increasing the IoU from 0.5871 to 0.6026, highlighting its critical role in suppressing heterogeneous noise. Ultimately, the complete TriGEFNet model, which integrates all three proposed modules, achieved the best performance among all configurations.

This result provides evidence that the three proposed components are not merely a simple accumulation of functions but constitute a complementary and organic whole. The DGFM and MGEM synergize at the encoder stage to perform comprehensive feature enhancement and fusion, while the GSRM ensures at the decoder stage that this high-quality information is precisely utilized for boundary reconstruction, ultimately realizing accurate landslide segmentation.

4.4.1. DGFM

To validate the superiority of the proposed DGFM, we compared it against two baseline methods: Element-wise Addition (Add) and Channel Concatenation (Concat). As shown in Table 5, the DGFM achieved the best performance across all evaluation metrics, significantly outperforming the other two methods. This quantitatively demonstrates the effectiveness of our guided fusion strategy.

We further conducted a visualization analysis of this module, as illustrated in Figure 11. It can be clearly observed from the visualized feature maps that the fusion results of the Add and Concat methods contain substantial diffuse noise and erroneous activation regions, leading to severe confusion between the target and the background. Although the Concat method offers slight improvements, the issue of background interference remains pronounced in its feature maps.

In contrast, the feature maps generated by the proposed DGFM exhibit activation regions that are highly focused on the landslide areas indicated by the Ground Truth. Simultaneously, the module effectively suppresses background noise, achieving precise feature representation. These experimental results demonstrate that the guided fusion mechanism of the DGFM effectively resolves the issue of multimodal information conflict. By establishing the RGB stream as the dominant modality and utilizing it to dynamically screen auxiliary modal information, this module successfully avoids the noise interference often introduced by simplistic fusion strategies. This prioritized design ensures the efficiency and robustness of the fusion process. Consequently, the generated features are not only enriched in semantic information but also possess significantly enhanced discriminability.

4.4.2. MGEM

To further elucidate the internal working mechanism of the MGEM, we visualized the multimodal feature maps before and after enhancement, as shown in Figure 12. The visualization reveals a pattern of efficient feature synergy and functional specialization. The feature maps of the NDVI exhibited strong activation responses to bare landslide surfaces. Serving as a spatial attention signal, this response significantly enhanced the feature intensity of the corresponding regions in the RGB feature maps via the MGEM. This process assigned higher feature weights to landslide areas that were originally ambiguous in the RGB features, thereby effectively boosting feature discriminability. In contrast, the enhancement applied to the Slope features was more moderate, ensuring that critical topographic patterns were preserved without being overwhelmed by strong signals from other modalities.

In summary, experimental results demonstrate that the MGEM does not apply a homogenized enhancement across all modalities. Instead, by leveraging a shared guidance signal, it successfully establishes cross-modal conditional dependencies. Based on this foundation, it performs differentiated and asymmetric feature refinement and enhancement tailored to the specific strengths of each modality. Ultimately, this achieves efficient synergy and complementary enhancement among multimodal features, generating feature representations that are significantly more robust and discriminative than the input features.

4.4.3. GSRM

To validate the superiority of the proposed GSRM, we compared it with three classic skip connection strategies: Add, Concat, and Attention Gate (Attention) [52]. Table 6 shows that the GSRM yields the most significant performance improvement for the model, comprehensively outperforming other modules across all evaluation metrics.

The visualization of the module’s feature maps is presented in Figure 13. The Add and Concat methods, representing indiscriminate fusion strategies, inevitably introduce original noise and redundant background information from the encoder into the decoding path. This results in final feature maps exhibiting marked semantic ambiguity and background noise. While the classic Attention Gate is capable of filtering some irrelevant features, it erroneously triggers misguided attention toward background regions while attempting to suppress noise.

In contrast, the feature maps generated by the GSRM exhibit activation regions that are highly consistent with the landslide morphology, characterized by sharp and clear boundaries. This demonstrates its superior capability in feature refinement. The GSRM performs precise, adaptive screening and enhancement on the shallow detail features provided by the encoder. Consequently, it preserves high-frequency details critical for segmentation while effectively suppressing noise. By intelligently refining and fusing cross-level features, the GSRM generates feature representations that possess both high-level semantic discriminability and shallow spatial precision. This effectively bridges the semantic gap between deep abstract features and shallow geometric features, ensuring the integrity of the segmentation results.

5. Discussion

5.1. Comparative Analysis with Different Vegetation Indices

To investigate the impact of NDVI and NGRDI on model performance, we computed the corresponding NGRDI data using RGB imagery from the L4S dataset and utilized it as the input for the VI stream to train the model. Experimental results (Table 7) indicate that NDVI achieved superior performance across the board due to its sensitivity to NIR bands. Compared to NGRDI, NDVI improved the IoU and F1-score by approximately 3.2% and 2.5%, respectively. This confirms that NDVI provides clearer features indicative of vegetation disruption along landslide boundaries, thereby enhancing the model’s segmentation capability.

However, the performance gap between the model using NGRDI and the one using NDVI is relatively narrow, validating the effectiveness of NGRDI as a viable alternative data source. The experiments demonstrate that our TriGEFNet retains robust landslide segmentation capabilities even in data-constrained scenarios where NDVI is unavailable. This significantly extends the model’s robustness and generalization ability under varying data conditions.

In emergency response scenarios following landslide disasters, the data most immediately available is often acquired by Unmanned Aerial Vehicles (UAVs) equipped with standard RGB cameras, whereas acquiring multispectral data may entail longer response times or higher logistical costs. Our experiments confirm that NGRDI can effectively serve as a substitute for NDVI as the VI input. In summary, the proposed TriGEFNet is not contingent upon specific data types; rather, it exhibits high robustness and adaptability to input data, thereby significantly expanding its potential for application in complex data environments.

5.2. Analysis of the Impact of Backbone Networks on Model Performance

To determine the optimal backbone network, we evaluated four classic architectures: ResNet18, ResNet34, ResNet50, and ResNet101. The experimental results (Table 8) indicate that model performance peaks with ResNet34, rather than monotonically increasing with network depth. Transitioning from ResNet18 to ResNet34 yielded a significant performance gain, with the IoU improving from 0.6012 to 0.6251. However, further increasing the depth to ResNet50 and ResNet101 resulted in performance saturation or even a slight decline; neither their IoU nor F1-scores surpassed those of ResNet34. We attribute this phenomenon to the trade-off between model complexity and task specificity. The feature extraction capability of ResNet34 proves sufficient for the landslide segmentation task. Although deeper networks possess stronger theoretical representational capacity, they heighten the risk of overfitting on limited datasets. Furthermore, they may extract excessively fragmented features at the expense of contextual information, a hypothesis supported by the observed decline in Recall rates.

Consequently, ResNet34 strikes the optimal balance between performance and efficiency. Based on these comparisons, ResNet34 was selected as the final backbone. This decision underscores the importance of refining model selection tailored to specific tasks, rather than merely increasing network depth.

5.3. Limitations and Future Work

Despite the significant advancements achieved by the proposed TriGEFNet in landslide segmentation accuracy compared to existing models, certain limitations persist. First, constrained by the prohibitive costs of acquiring remote sensing landslide samples, existing public datasets are relatively small in scale and geographically concentrated. Although we constructed the multi-source, multi-temporal Zunyi dataset, its coverage remains limited. To further enhance model robustness in complex scenarios and realize truly intelligent disaster management, the field requires a profound commitment to the continuous expansion and iteration of landslide datasets. By collecting cross-regional and cross-temporal remote sensing imagery to increase the spatiotemporal diversity of training data, we can fundamentally mitigate the risk of overfitting caused by small sample sizes.

Beyond expanding data diversity, enhancing the adaptability of the method across different scales is crucial. While currently validated at a regional level, TriGEFNet holds significant potential for detailed monitoring scenarios, such as mining safety and engineering geology. In these contexts, its multimodal fusion offers robustness against anthropogenic noise. Future research will leverage transfer learning by fine-tuning models pre-trained on satellite data with high resolution UAV imagery. This approach aims to bridge the gap between regional surveys and the high precision required for specific engineering sites.

Furthermore, a paradigm shift is imperative: moving from the current purely data-driven semantic segmentation toward physics-informed disaster perception intelligence. This transition requires models to transcend mere pattern recognition to achieve mechanistic understanding. Given that landslide occurrence results from the complex coupling of geological, geomorphological, and hydrological factors, future research must focus on deeply exploring and fusing auxiliary modality data more closely related to landslide mechanisms. To this end, our future work will aim to construct a comprehensive sensing framework capable of capturing the environment of landslide development and dynamic triggering factors. Incorporating key causative factors—such as geological lithology, soil moisture, and InSAR surface deformation—will be a crucial step. These factors should no longer be viewed as mere supplementary inputs but as pivotal cues for understanding and modeling the physical processes of landslides, thereby enabling the model to learn the coupling laws governing these multi-factor interactions.

Finally, to support the physics-aware models, future research must transition from relying on small-sample, static image datasets to establishing dynamic monitoring benchmarks covering full regions and temporal sequences. This will significantly enhance the model’s generalization ability and segmentation accuracy in complex surface environments. More importantly, it will provide the possibility of capturing the complete dynamic chain of landslides, which spans from incubation and development to occurrence. The ultimate goal is to develop intelligent systems equipped with rudimentary physical perception capabilities, laying a solid foundation for a fundamental shift from post-disaster response to pre-disaster warning.

6. Conclusions

In this paper, we proposed TriGEFNet, a network specifically designed for landslide segmentation from multimodal remote sensing imagery. The model employs a triple-stream encoder architecture aimed at fully exploiting the complementary characteristics of RGB imagery, VI, and Slope. To address the challenges of semantic gaps and noise interference inherent in fusing multi-source heterogeneous data, we innovatively designed three core modules. The MGEM achieves the interaction and synergistic enhancement of cross-modal information by constructing shared guidance features, allowing the model to dynamically absorb context from other modalities while retaining unique modal information. Subsequently, the DGFM establishes an asymmetric fusion mechanism led by RGB and supplemented by auxiliary modalities, utilizing gating strategies to effectively filter redundant noise and ensure the quality of feature fusion. Finally, the GSRM utilizes high-level semantic features generated by the decoder to spatially screen shallow detail features from the encoder, effectively bridging the semantic discrepancy between the encoder and decoder and improving the detailed recovery of landslide boundaries.

Collectively, these three modules construct a comprehensive, refined feature processing framework spanning feature enhancement, multimodal fusion, and cross-level optimization. The core principle of this framework is to abandon simple, static information stacking in favor of a consistent, context-driven dynamic gating and guidance strategy. This systematic design ensures that information is processed optimally at every critical node of the model, ultimately constructing a feature representation that is both robust and refined for landslide segmentation tasks. Extensive comparative experiments were conducted on the self-constructed Zunyi dataset and the public Bijie and L4S datasets. The results demonstrate that TriGEFNet exhibits exceptional segmentation generalization and accuracy. The model achieved landslide IoU of 74.54%, 81.30%, and 62.51% on these three datasets, respectively, comprehensively outperforming classic semantic segmentation networks and advanced multimodal fusion models.

This study not only confirms the effectiveness of multimodal fusion in landslide segmentation but also reveals a critical methodological insight: introducing physical environmental priors into deep learning frameworks is an effective strategy to overcome the spectral confusion inherent in unimodal approaches. By integrating VI and slope as auxiliary modalities—coupled with a series of interaction and fusion modules—we effectively resolved the difficulty of spectral interference in complex scenes.

In conclusion, TriGEFNet provides not only a novel and efficient paradigm for the semantic segmentation of multimodal remote sensing imagery but also robust technical support for constructing large-scale and physics-aware automated landslide monitoring systems in practical scenarios.

Author Contributions

Conceptualization, Q.H. and Z.Z.; methodology, Q.H., H.F. and S.C.; software, Z.Z.; validation, H.F., Q.W. and Z.Z.; formal analysis, P.W.; investigation, Z.Z. and R.F.; resources, Q.H. and H.F.; data curation, R.F.; writing—original draft preparation, Z.Z.; writing—review and editing, W.L. (Wenkai Liu), Q.H. and H.F.; visualization, W.L. (Weiqiang Lu); supervision, Q.H.; project administration, Q.H. and S.C.; funding acquisition, Q.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 42277478, U21A20109), research on integrated air-ground deformation monitoring and behavior analysis of high-face rockfill dam at Lawa Hydropower Station on the upper reaches of the Jinsha River (HTDX0203A012025JZ251), the National Key Research and Development Program of China (No. 2024YFC3212200), the Henan Science Foundation for Distinguished Young Scholars of China (No. 242300421041), the Henan Provincial University Science and Technology Innovation Team Support Program (No. 25IRTSTHN008), the Henan Key Research and Development Program of China (No. 241111321100), research on remote sensing intelligent identification technology of embankment geometry and safety state (JZ110145B0102025), and Water Conservancy Multi-parameter Beidou Monitoring Technology (Phase II) (JZ110145B0082025).

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the editor and reviewers for their contributions on the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dai, F.C.; Lee, C.F.; Ngai, Y.Y. Landslide Risk Assessment and Management: An Overview. Eng. Geol. 2002, 64, 65–87. [Google Scholar] [CrossRef]
Tang, X.; Tu, Z.; Wang, Y.; Liu, M.; Li, D.; Fan, X. Automatic Detection of Coseismic Landslides Using a New Transformer Method. Remote Sens. 2022, 14, 2884. [Google Scholar] [CrossRef]
Froude, M.J.; Petley, D.N. Global Fatal Landslide Occurrence from 2004 to 2016. Nat. Hazards Earth Syst. Sci. 2018, 18, 2161–2181. [Google Scholar] [CrossRef]
Casagli, N.; Intrieri, E.; Tofani, V.; Gigli, G.; Raspini, F. Landslide Detection, Monitoring and Prediction with Remote-Sensing Techniques. Nat. Rev. Earth Environ. 2023, 4, 51–64. [Google Scholar] [CrossRef]
Zhou, N.; Hong, J.; Cui, W.; Wu, S.; Zhang, Z. A Multiscale Attention Segment Network-Based Semantic Segmentation Model for Landslide Remote Sensing Images. Remote Sens. 2024, 16, 1712. [Google Scholar] [CrossRef]
Mohan, A.; Singh, A.K.; Kumar, B.; Dwivedi, R. Review on Remote Sensing Methods for Landslide Detection Using Machine and Deep Learning. Trans. Emerg. Tel. Technol. 2021, 32, e3998. [Google Scholar] [CrossRef]
Chen, T.; Gao, X.; Liu, G.; Wang, C.; Zhao, Z.; Dou, J.; Niu, R.; Plaza, A.J. BisDeNet: A New Lightweight Deep Learning-Based Framework for Efficient Landslide Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 3648–3663. [Google Scholar] [CrossRef]
Wei, R.; Li, Y.; Li, Y.; Zhang, B.; Wang, J.; Wu, C.; Yao, S.; Ye, C. A Universal Adapter in Segmentation Models for Transferable Landslide Mapping. ISPRS J. Photogramm. Remote Sens. 2024, 218, 446–465. [Google Scholar] [CrossRef]
Fang, Z.; Wang, Y.; Peng, L.; Hong, H. A Comparative Study of Heterogeneous Ensemble-Learning Techniques for Landslide Susceptibility Mapping. Int. J. Geogr. Inf. Sci. 2021, 35, 321–347. [Google Scholar] [CrossRef]
Chen, L.; Ma, P.; Yu, C.; Zheng, Y.; Zhu, Q.; Ding, Y. Landslide Susceptibility Assessment in Multiple Urban Slope Settings with a Landslide Inventory Augmented by InSAR Techniques. Eng. Geol. 2023, 327, 107342. [Google Scholar] [CrossRef]
Li, P.; Wang, Y.; Si, T.; Ullah, K.; Han, W.; Wang, L. MFFSP: Multi-Scale Feature Fusion Scene Parsing Network for Landslides Detection Based on High-Resolution Satellite Images. Eng. Appl. Artif. Intell. 2024, 127, 107337. [Google Scholar] [CrossRef]
Dong, A.; Dou, J.; Li, C.; Chen, Z.; Ji, J.; Xing, K.; Zhang, J.; Daud, H. Accelerating Cross-Scene Co-Seismic Landslide Detection Through Progressive Transfer Learning and Lightweight Deep Learning Strategies. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4410213. [Google Scholar] [CrossRef]
Ghorbanzadeh, O.; Shahabi, H.; Crivellari, A.; Homayouni, S.; Blaschke, T.; Ghamisi, P. Landslide Detection Using Deep Learning and Object-Based Image Analysis. Landslides 2022, 19, 929–939. [Google Scholar] [CrossRef]
Fu, Y.; Li, W.; Fan, S.; Jiang, Y.; Bai, H. CAL-Net: Conditional Attention Lightweight Network for In-Orbit Landslide Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4408515. [Google Scholar] [CrossRef]
Liu, B.; Wang, W.; Wu, Y.; Gao, X. Attention Swin Transformer UNet for Landslide Segmentation in Remotely Sensed Images. Remote Sens. 2024, 16, 4464. [Google Scholar] [CrossRef]
Zhang, X.; Yu, W.; Pun, M.-O.; Shi, W. Cross-Domain Landslide Mapping from Large-Scale Remote Sensing Images Using Prototype-Guided Domain-Aware Progressive Representation Learning. ISPRS J. Photogramm. Remote Sens. 2023, 197, 1–17. [Google Scholar] [CrossRef]
Tehrani, F.S.; Calvello, M.; Liu, Z.; Zhang, L.; Lacasse, S. Machine Learning and Landslide Studies: Recent Advances and Applications. Nat. Hazards 2022, 114, 1197–1245. [Google Scholar] [CrossRef]
Zhong, C.; Liu, Y.; Gao, P.; Chen, W.; Li, H.; Hou, Y.; Nuremanguli, T.; Ma, H. Landslide Mapping with Remote Sensing: Challenges and Opportunities. Int. J. Remote Sens. 2020, 41, 1555–1581. [Google Scholar] [CrossRef]
Cai, J.; Su, J.; Li, Q.; Yang, W.; Wang, S.; Zhao, T.; He, S.; Liu, W. Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 10587–10598. [Google Scholar]
Liu, X.; Peng, Y.; Lu, Z.; Li, W.; Yu, J.; Ge, D.; Xiang, W. Feature-Fusion Segmentation Network for Landslide Detection Using High-Resolution Remote Sensing Images and Digital Elevation Model Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4500314. [Google Scholar] [CrossRef]
Ghorbanzadeh, O.; Xu, Y.; Zhao, H.; Wang, J.; Zhong, Y.; Zhao, D.; Zang, Q.; Wang, S.; Zhang, F.; Shi, Y.; et al. The Outcome of the 2022 Landslide4Sense Competition: Advanced Landslide Detection From Multisource Satellite Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 9927–9942. [Google Scholar] [CrossRef]
Chen, S.; Qu, H.; Shao, Y.; Zeng, Y.; Wu, Z.; Lu, C.; Wu, M. A Landslide Identification Method Based on Integrated Segmentation Network and Transfer Learning. Neurocomputing 2025, 653, 131242. [Google Scholar] [CrossRef]
Lv, J.; Zhang, R.; Wu, R.; Bao, X.; Liu, G. Landslide Detection Based on Pixel-Level Contrastive Learning for Semi-Supervised Semantic Segmentation in Wide Areas. Landslides 2025, 22, 1087–1105. [Google Scholar] [CrossRef]
Wang, K.; He, D.; Sun, Q.; Yi, L.; Yuan, X.; Wang, Y. A Novel Network for Semantic Segmentation of Landslide Areas in Remote Sensing Images with Multi-Branch and Multi-Scale Fusion. Appl. Soft Comput. 2024, 158, 111542. [Google Scholar] [CrossRef]
Wang, T.; Chen, G.; Zhang, X.; Liu, C.; Wang, J.; Tan, X.; Zhou, W.; He, C. LMFNet: Lightweight Multimodal Fusion Network for High-Resolution Remote Sensing Image Segmentation. Pattern Recognit. 2025, 164, 111579. [Google Scholar] [CrossRef]
Feng, H.; Hu, Q.; Zhao, P.; Wang, S.; Ai, M.; Zheng, D.; Liu, T. FTransDeepLab: Multimodal Fusion Transformer-Based DeepLabv3+ for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4406618. [Google Scholar] [CrossRef]
Ghorbanzadeh, O.; Blaschke, T.; Gholamnia, K.; Meena, S.R.; Tiede, D.; Aryal, J. Evaluation of Different Machine Learning Methods and Deep-Learning Convolutional Neural Networks for Landslide Detection. Remote Sens. 2019, 11, 196. [Google Scholar] [CrossRef]
Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers. IEEE Trans. Intell. Transport. Syst. 2023, 24, 14679–14694. [Google Scholar] [CrossRef]
Zhang, J.; Liu, R.; Shi, H.; Yang, K.; Reiß, S.; Peng, K.; Fu, H.; Wang, K.; Stiefelhagen, R. Delivering Arbitrary-Modal Semantic Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 1136–1147. [Google Scholar]
Liang, M.; Hu, J.; Bao, C.; Feng, H.; Deng, F.; Lam, T.L. Explicit Attention-Enhanced Fusion for RGB-Thermal Perception Tasks. IEEE Robot. Autom. Lett. 2023, 8, 4060–4067. [Google Scholar] [CrossRef]
Zhang, Y.; Yan, S.; Zhang, L.; Du, B. Fast Projected Fuzzy Clustering With Anchor Guidance for Multimodal Remote Sensing Imagery. IEEE Trans. Image Process. 2024, 33, 4640–4653. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.-O.; Liu, M. A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403215. [Google Scholar] [CrossRef]
Ji, S.; Yu, D.; Shen, C.; Li, W.; Xu, Q. Landslide Detection from an Open Satellite Imagery and Digital Elevation Model Dataset Using Attention Boosted Convolutional Neural Networks. Landslides 2020, 17, 1337–1352. [Google Scholar] [CrossRef]
Ghorbanzadeh, O.; Xu, Y.; Ghamisi, P.; Kopp, M.; Kreil, D. Landslide4Sense: Reference Benchmark Data and Deep Learning Models for Landslide Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5633017. [Google Scholar] [CrossRef]
Liu, Z.; Zhou, Q.; Liu, K.; Wang, Y.; Chen, D.; Chen, Y.; Xiao, L. Sedimentary Features and Paleogeographic Evolution of the Middle Permian Trough Basin in Zunyi, Guizhou, South China. J. Earth Sci. 2023, 34, 1803–1815. [Google Scholar] [CrossRef]
Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-Scale Geospatial Analysis for Everyone. Remote Sens. Environ. 2017, 202, 18–27. [Google Scholar] [CrossRef]
Farr, T.G.; Rosen, P.A.; Caro, E.; Crippen, R.; Duren, R.; Hensley, S.; Kobrick, M.; Paller, M.; Rodriguez, E.; Roth, L.; et al. The Shuttle Radar Topography Mission. Rev. Geophys. 2007, 45, 2005RG000183. [Google Scholar] [CrossRef]
Jannoura, R.; Brinkmann, K.; Uteau, D.; Bruns, C.; Joergensen, R.G. Monitoring of Crop Biomass Using True Colour Aerial Photographs Taken from a Remote Controlled Hexacopter. Biosyst. Eng. 2015, 129, 341–351. [Google Scholar] [CrossRef]
Li, H.; He, Y.; Xu, Q.; Deng, J.; Li, W.; Wei, Y. Detection and Segmentation of Loess Landslides via Satellite Images: A Two-Phase Framework. Landslides 2022, 19, 673–686. [Google Scholar] [CrossRef]
Martha, T.R.; Kerle, N.; Van Westen, C.J.; Jetten, V.; Kumar, K.V. Segment Optimization and Data-Driven Thresholding for Knowledge-Based Landslide Detection by Object-Based Image Analysis. IEEE Trans. Geosci. Remote Sens. 2011, 49, 4928–4943. [Google Scholar] [CrossRef]
Huang, F.; Tao, S.; Chang, Z.; Huang, J.; Fan, X.; Jiang, S.-H.; Li, W. Efficient and Automatic Extraction of Slope Units Based on Multi-Scale Segmentation Method for Landslide Assessments. Landslides 2021, 18, 3715–3731. [Google Scholar] [CrossRef]
Fang, H.; Shao, Y.; Xie, C.; Tian, B.; Shen, C.; Zhu, Y.; Guo, Y.; Yang, Y.; Chen, G.; Zhang, M. A New Approach to Spatial Landslide Susceptibility Prediction in Karst Mining Areas Based on Explainable Artificial Intelligence. Sustainability 2023, 15, 3094. [Google Scholar] [CrossRef]
Tavakkoli Piralilou, S.; Shahabi, H.; Jarihani, B.; Ghorbanzadeh, O.; Blaschke, T.; Gholamnia, K.; Meena, S.; Aryal, J. Landslide Detection Using Multi-Scale Image Segmentation and Different Machine Learning Models in the Higher Himalayas. Remote Sens. 2019, 11, 2575. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
Roy, A.G.; Navab, N.; Wachinger, C. Concurrent Spatial and Channel ‘Squeeze & Excitation’ in Fully Convolutional Networks. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2018; Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11070, pp. 421–429. ISBN 9783030009274. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Stoyanov, D., Taylor, Z., Carneiro, G., Syeda-Mahmood, T., Martel, A., Maier-Hein, L., Tavares, J.M.R.S., Bradley, A., Papa, J.P., Belagiannis, V., et al., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. ISBN 9783030008888. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Wang, Z.; Yan, Z.; Yang, J. SGNet: Structure Guided Network via Gradient-Frequency Awareness for Depth Map Super-Resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]

Figure 1. Spatial distribution of landslides in the Zunyi dataset.

Figure 2. Visualization of samples from the three datasets. (a,e) RGB imagery; (b,f) VI; (c,g) Slope; (d,h) Ground Truth.

Figure 3. Comparison of statistical distributions between landslide and background samples for VI and Slope features. (a–c) VI distributions for the L4S, Zunyi, and Bijie datasets, respectively. (d–f) Slope distributions for the same datasets.

Figure 4. The architecture of TriGEFNet.

Figure 5. The structure of the MGEM.

Figure 6. The structure of the DGFM.

Figure 7. The structure of the GSRM.

Figure 8. Visual comparisons of landslide segmentation among different methods on the Zunyi dataset, where red represents FN, and blue represents false positives FP. Yellow boxes denote challenging regions where the segmentation results vary most significantly.

Figure 9. Visual comparisons of landslide segmentation among different methods on the L4S dataset, where red represents FN, and blue represents false positives FP. Yellow boxes denote challenging regions where the segmentation results vary most significantly.

Figure 10. Visual comparisons of landslide segmentation among different methods on the Bijie dataset, where red represents FN, and blue represents false positives FP. Yellow boxes denote challenging regions where the segmentation results vary most significantly.

Figure 11. Visualization of feature fusion using different methods. (a–c) display the input imagery and intermediate feature maps for RGB, NDVI, and Slope, respectively; (d) shows the Ground Truth (GT) and the results of the three different fusion methods.

Figure 12. Visualization of the feature enhancement process in the MGEM. (a,c) display the original feature maps before enhancement for two samples, while (b,d) show their corresponding enhanced results. Columns 2–4 represent the features from the shallowest layer, and Columns 5–7 represent the features from the second shallowest layer.

Figure 13. Feature visualization of different skip connection methods. The “Input” column displays the original data (RGB, NDVI, and GT). The subsequent columns display the encoder features (a) and decoder features (b), along with the final feature maps (c) generated after fusion using four different skip connection methods.

Table 1. Performance comparison of different methods on the Zunyi dataset.

Method	Accuracy	Precision	Recall	F1-Scores	IoU	mIoU
Unet/RGB	0.9774	0.8433	0.8051	0.8214	0.6986	0.8374
Unet/RGB + Slope	0.9775	0.8249	0.8231	0.8230	0.7010	0.8386
Unet/RGB + NGRDI	0.9748	0.8065	0.7939	0.7990	0.6684	0.8209
Unet/RGB + NGRDI + SLOPE	0.9761	0.8217	0.8007	0.8095	0.6846	0.8297
DeepLabV3+	0.9771	0.8343	0.8019	0.8157	0.6912	0.8335
Unet++	0.9779	0.8386	0.8052	0.8202	0.6979	0.8373
Mask2former	0.9793	0.8291	0.8542	0.8394	0.7261	0.8520
SegFormer	0.9795	0.8643	0.8085	0.8339	0.7166	0.8475
SGNet	0.9802	0.8568	0.8321	0.8429	0.7303	0.8546
CMX	0.9808	0.8436	0.8541	0.8483	0.7396	0.8597
CMNeXt	0.9812	0.8635	0.8345	0.8482	0.7401	0.8601
EAEFNet	0.9811	0.8609	0.8436	0.8508	0.7416	0.8608
Ours	0.9812	0.8491	0.8590	0.8532	0.7453	0.8627