1. Introduction
Landslides, characterized by the gravity-driven instability and movement of rock and soil masses on slopes, pose a significant threat to human life, property, infrastructure, and ecosystems worldwide [
1]. In the context of climate change and the increasing frequency of extreme weather events, the rapid and precise identification of landslides is critical for disaster risk assessment and emergency response [
2]. Traditional landslide monitoring relies heavily on field geological surveys. While offering high precision, these methods are often constrained by high time costs and limited spatial coverage, making them inadequate for monitoring large-scale or inaccessible hazardous terrains [
3,
4].
With rapid advancements in Earth observation technology, remote sensing has emerged as the primary approach for landslide detection, owing to its advantages in macroscopic and dynamic monitoring [
5,
6]. Over the past few decades, landslide detection based on remote sensing imagery has undergone a paradigm shift from manual visual interpretation to machine learning, and subsequently to deep learning [
7,
8]. Although manual interpretation offers acceptable accuracy, it fails to meet the demands for large-scale, rapid response. Meanwhile, traditional machine learning methods are limited by their shallow feature extraction capabilities, struggling to resolve high-dimensional abstract features in complex backgrounds [
9,
10].
In recent years, deep learning has witnessed rapid advancements in the field of landslide detection. Unlike traditional machine learning, deep learning models possess the capability to learn directly from image data, automatically extracting low-level features, such as textures and edges, as well as high-level features, including shapes and spatial context. This mechanism enables the capture of deeper and more abstract semantic information from the imagery [
11,
12]. In the deep learning domain, landslide detection is predominantly formulated as a semantic segmentation task, the objective of which is to classify every pixel in an image to precisely delineate the extent and boundaries of landslides. Consequently, a multitude of studies have employed semantic segmentation models to achieve automated landslide recognition [
13]. For instance, Fu et al. proposed a lightweight network optimized for on-board landslide segmentation; this model utilizes CSPDarknet-tiny as an efficient encoder backbone to enhance accuracy and robustness while maintaining a low parameter count [
14]. To simultaneously leverage global context and local deep features, Liu et al. designed a dual-branch encoder, in which a Transformer branch captures global dependencies while a Convolutional Neural Network (CNN) branch specializes in extracting abstract features. These are then integrated via a multi-scale feature fusion module to refine landslide boundary details [
15]. Addressing the challenge of weak model generalization in novel regions, Zhang et al. [
16] proposed a cross-domain landslide segmentation method based on Multi-Target Domain Adaptation (MTDA). This approach employs a progressive “near-to-far” learning strategy to align feature distributions across different regions, achieving outstanding performance on large-scale datasets comprising multiple heterogeneous domains [
16]. Despite these significant strides, existing segmentation methods continue to face considerable challenges. Current approaches rely primarily on unimodal RGB optical imagery, which is prone to spectral confusion in complex geological environments. Specifically, the spectral characteristics of bare soil or rock masses resulting from landslides are strikingly similar to those of bare farmland, construction sites, and natural bedrock. Models relying solely on RGB are susceptible to confusing these distinct classes. These factors obscure the visual features of landslides, thereby severely compromising the generalization capability of the models [
17,
18].
To overcome the limitations of unimodal models and enhance their discriminative capability, incorporating multimodal data that provides complementary information has emerged as a pivotal research direction [
19]. For instance, Liu et al. [
20] proposed an integrated segmentation framework that includes a specialized multimodal branch for extracting elevation features from Digital Elevation Model (DEM). Optimized via a terrain-guided loss function, their experiments demonstrated the effectiveness of DEM features in landslide segmentation tasks [
20]. Ghorbanzadeh evaluated four advanced models on the L4S dataset, analyzing the impact of various spectral input combinations on model training. The study revealed that extending unimodal RGB inputs to multimodal data improved the performance of U-Net-based architectures, whereas the performance of Transformer-based architectures deteriorated [
21]. Addressing the challenge of identifying visually indistinct old landslides, Chen et al. proposed FFS-Net, which fuses the texture features of optical imagery with the terrain features of DEMs at high semantic levels, significantly enhancing the model’s capability to detect old landslides [
22]. These studies demonstrate that multi-source data fusion can construct feature representations that are far more comprehensive and robust than their unimodal counterparts, marking it as a critical approach for achieving robust and accurate landslide segmentation. However, existing methods predominantly employ static fusion mechanisms that fail to adaptively adjust the contribution of each modality according to the context, thereby constraining further improvements in model performance [
23,
24].
In the broader field of computer vision, more extensive research into multimodal fusion has led to the emergence of a series of advanced dynamic fusion strategies [
25,
26,
27]. For instance, CMX leverages meticulously designed cross-modal feature rectification and fusion modules to facilitate granular interaction and correction of multimodal features at various stages of encoding and decoding. This effectively enhances complementary information between modalities while suppressing noise [
28]. Similarly, CMNeXt introduces a highly efficient cross-modal attention module that significantly reduces computational complexity while improving fusion performance, thereby achieving a superior balance between efficiency and accuracy [
29]. Addressing the challenge of varying multimodal data quality, EAEFNet employs a dual-branch architecture to differentially process multimodal information of unequal quality, achieving enhancement and compensation for features from each modality [
30]. Although these multimodal models exhibit outstanding performance, they are primarily designed for datasets such as RGB-D (Depth), and their fusion methodologies are difficult to adapt directly to the unique demands of remote sensing landslide segmentation. First, these models typically treat the contributions of different modalities indiscriminately. However, in landslide segmentation tasks, high inter-class spectral similarity often causes spectral features to introduce substantial redundancy and noise, potentially leading to model overfitting. Second, there are often significant resolution disparities between modalities in landslide datasets. This necessitates precise feature alignment while bridging the semantic gap between modalities [
31,
32].
To address the aforementioned challenges, this paper proposes TriGEFNet, a Triple-Stream Guided Enhancement and Fusion Network designed to resolve the difficulties of multimodal feature alignment and fusion through an asymmetric guidance mechanism. Concurrently, to validate the model’s robustness in challenging environments, we constructed a benchmark dataset comprising multi-sensor heterogeneous data—the Zunyi Landslide Dataset.
The main contributions of this paper are summarized as follows:
This paper introduces TriGEFNet, a triple-stream multimodal fusion network featuring a novel guided enhancement and fusion strategy to tackle noise and redundancy. In the encoder, the Multimodal Guided Enhancement Module (MGEM) first mitigates inconsistent data quality by independently enhancing each stream’s features. Then, the Dominant-stream Guided Fusion Module (DGFM), led by the semantically rich RGB stream, selectively integrates Slope and VI features to achieve an efficient, asymmetric fusion. In the decoder, the Gated Skip Refinement Module (GSRM) adaptively filters skip connections, preventing redundant information flow while preserving crucial spatial details for accurate boundary delineation. Collectively, these components allow TriGEFNet to learn highly discriminative representations for landslide segmentation in complex environments.
We constructed the Zunyi Landslide Dataset, tailored for complex scenarios. This dataset integrates significant cross-modal resolution disparities with multi-source data heterogeneity. It provides a challenging benchmark for evaluating the generalization ability of multimodal fusion algorithms in actual geological environments.
We conducted comprehensive comparative experiments on the Zunyi, Bijie [
33], and Landslide4Sense (L4S) [
34] datasets. The proposed model was evaluated against a series of classic semantic segmentation models and advanced multimodal fusion models. Experimental results demonstrate that TriGEFNet achieves superior performance across multiple key evaluation metrics, including mean Intersection over Union (mIoU). This fully validates the model’s robust capability for high-performance landslide segmentation in complex environments and highlights its significant value for practical applications.
3. Methodology
In this paper, we propose TriGEFNet, a deep neural network for landslide segmentation from multimodal imagery that introduces a novel fusion paradigm: Independent Encoding, Interactive Enhancement, and Asymmetric Fusion. Illustrated in
Figure 4, the network is built upon the classic U-Net [
44] framework and employs ResNet34 [
45] as its backbone. A key principle of the architecture lies in the multi-branch feature decoupling design of the encoder. We configure three parallel encoders with non-shared parameters for RGB imagery, VI, and Slope, respectively. This configuration allows the network to learn the modality-specific semantic distributions inherent to each data source. To achieve the efficient integration of heterogeneous features, TriGEFNet incorporates three key components: MGEM, DGFM, and GSRM. These modules are designed to facilitate the efficient interaction and fusion of multimodal features, thereby enhancing the final segmentation performance. This section provides a detailed analysis of the core modules constituting the network, followed by an introduction to the loss function and performance evaluation metrics utilized for model optimization.
3.1. Multimodal Guided Enhancement Module (MGEM)
To improve the recognition accuracy of landslide areas in remote sensing imagery under complex scenes, we introduce the VI as a spectral feature indicating surface vegetation disruption, and leverage Slope data as a geographic constraint reflecting the likelihood of landslide occurrence. However, naive feature concatenation or element-wise addition neglects the heterogeneity and spatial inconsistency of contributions from different modalities. For instance, in shadowed areas, RGB imagery often suffers from information loss and high noise due to poor lighting, whereas Slope data remains unaffected and reliable. Conventional fusion merges these modalities indiscriminately, causing the optical noise to contaminate the critical geometric features. Consequently, such methods fail to capture deep conditional dependencies. To fully exploit the synergistic potential among multimodal data, we designed the MGEM.
Figure 5 illustrates the detailed structure of the MGEM. For clarity, the diagram exclusively depicts the enhancement workflow for the RGB features; the processes for the VI and Slope branches are identical. The MGEM comprises a Guidance Feature Generation Network (Guidance Net) and three parallel Feature Enhancers. First, the module concatenates the multimodal feature maps (
,
, and
) from the same encoder level along the channel dimension. This constructs a unified feature representation that retains the original contextual information of each modality. The representation is then fed into the Guidance Net, which employs a stacked convolutional block (comprising 1 × 1 and 3 × 3 convolutions) to implicitly model inter-modal dependencies and aggregate local spatial context. This process generates a guidance feature,
, which integrates complementary information from all three sources. Subsequently, this guidance feature serves as a shared spatial context prior and is distributed to the three Feature Enhancers.
Within each enhancer,
passes through a lightweight convolutional network with a Sigmoid activation to generate an adaptive spatial attention map. This map acts as a pixel-wise spatial gate. By performing element-wise multiplication with the attention map, the model spatially recalibrates the feature representation: regions within the original map that possess high discriminative value for landslide segmentation are enhanced, while the weights of redundant or noisy information are effectively attenuated. Finally, the optimized features are added to the original features via a residual connection to generate the enhanced output. This residual connection ensures that the unique characteristic information of each modality is preserved. The formulation of the module is as follows:
where
denotes the Sigmoid activation function and
denotes the Hadamard product. The same process is applied in parallel to the VI and Slope branches. By generating spatial attention maps via
, MGEM enables the model to dynamically adjust the spatial weights of each unimodal feature map based on the fused multimodal context, facilitating the learning of cross-modal conditional dependencies. Consequently, MGEM realizes the interaction of gain information among multimodal features, allowing each modality to absorb complementary information from others while retaining its own characteristics, thereby enhancing feature robustness and discriminability.
3.2. Dominant-Stream Guided Fusion Module (DGFM)
Following the parallel enhancement of features from each modality by the MGEM, it is necessary to fuse them into a unified representation for subsequent processing by the decoder. To prevent the robust spatial context features extracted by the RGB encoder from being compromised by potential noise or redundant information within the auxiliary modalities, we designed the DGFM, the schematic of which is illustrated in
Figure 6. In landslide segmentation tasks, RGB imagery provides the richest and most critical spatial context and spectral information. Consequently, we establish the RGB stream as the dominant modality, while treating the VI (providing supplementary spectral information) and Slope (providing terrain constraints) as auxiliary modalities. The design of the DGFM aims to leverage the dominant stream to guide and regulate the integration process of the auxiliary streams, ensuring that only beneficial information from the auxiliary features contributes to the fusion.
The specific implementation of the DGFM is as follows: First, the dominant feature is input into a gating generator composed of lightweight convolutions to generate a spatial attention gating map, . This gating map functions as a dynamic spatial filter, with its weight distribution determined entirely by the feature information of the dominant stream. Subsequently, this gating map is simultaneously applied to the and , filtering the features of the auxiliary modalities through element-wise multiplication. Finally, the dominant stream feature is concatenated along the channel dimension with the two filtered auxiliary stream features. This combined tensor is then processed by a convolutional block for final information integration and dimensionality reduction.
The entire fusion process can be formulated as follows:
Through its unique “guidance-gating” mechanism, the DGFM ensures that only high-relevance auxiliary information beneficial to the dominant modality participates in the final decision-making process. This not only maximizes the preservation of the core feature integrity but also achieves adaptive denoising and screening of auxiliary information, thereby accomplishing a prioritized, robust, and efficient feature fusion.
3.3. Gated Skip Refinement Module (GSRM)
During the decoding stage, to effectively bridge the semantic gap between the high-resolution spatial details provided by the encoder and the high-level semantic information generated by the decoder, we designed the GSRM. The schematic of the GSRM is illustrated in
Figure 7. First, the feature map
from the decoder is processed by a gating controller composed of two 1 × 1 convolutions. This controller extracts rich contextual information to generate a spatial attention map,
. Subsequently,
is employed to perform element-wise weighting on the encoder feature
. This operation directs focus toward target regions critical for the segmentation task while simultaneously suppressing redundant information and noise. Following this, the filtered encoder feature is concatenated with
along the channel dimension. Finally, the concatenated features are fed into a Refinement Block (RB). The purpose of this block is to facilitate the deep alignment of these two heterogeneous features within the local spatial domain. The cascaded 3 × 3 convolution blocks are capable of learning and modeling complex local correlations within the concatenated features. They smoothly integrate semantic information with precise boundary details, ultimately generating a more robust and discriminative feature representation. The formulation of the GSRM is expressed as follows:
In summary, the GSRM ensures that only highly relevant low-level features participate in the fusion process, effectively bridging the semantic gap while mitigating noise interference. The subsequent refinement process guarantees the deep and seamless integration of these two feature types, achieving a truly organic fusion.
3.4. Upsample, SCSE, and Segmentation Head (SH)
Within the decoder, at each stage, the resolution of deep feature maps is first upsampled via bilinear interpolation. Subsequently, the upsampled results are fed into the GSRM to implement the skip connection. The resulting fused features are then processed by the Spatial and Channel Squeeze and Excitation (SCSE) module. The SCSE module adaptively enhances feature information critical for landslide segmentation while suppressing redundancy by applying concurrent attention weighting across both channel and spatial dimensions [
46].
After the decoder restores the feature maps to the same resolution as the original input image through a series of upsampling and fusion operations, the Segmentation Head (SH) serves as the final output layer of the model. It is responsible for transforming these semantically rich feature maps into the final pixel-level segmentation prediction. In the proposed model, the SH is designed with an efficient structure, primarily consisting of a 3 × 3 convolutional layer.
The formulation of the SH is as follows:
The primary function of this convolutional layer is to reduce the channel dimensionality of the high-dimensional feature maps from the final decoder layer to match the number of target classes. Specifically, for the binary classification task in this study, this layer reduces the channel count of the input feature maps to 1.
3.5. Loss Function
Landslide segmentation is a task typically characterized by severe class imbalance. In existing landslide datasets, the proportion of pixels representing landslide areas is usually far smaller than that of the non-landslide background. Standard Cross-Entropy Loss penalizes errors for every pixel with equal weight. In such scenarios, the loss generated by the overwhelming majority of background pixels dominates the gradient direction, biasing the model towards predicting all pixels as background and potentially leading to overfitting to the majority class. To effectively address this challenge, we employ a composite loss function that combines the strengths of Dice Loss and Focal Loss. This approach aims to optimize the model simultaneously from the perspectives of regional overlap and hard sample mining.
Dice Loss (
), derived from the Dice coefficient used to measure set similarity, directly optimizes the degree of overlap between the predicted region and the ground truth. Its primary advantage lies in its inherent insensitivity to class imbalance. It is defined as follows:
where
represents the total number of pixels, and
and
denote the ground truth label and the model’s predicted probability for the positive class of the
-th pixel, respectively.
is a smoothing constant added to enhance numerical stability.
Focal Loss (
) represents a dynamically weighted improvement over standard Cross-Entropy Loss. By introducing a modulating factor, it automatically reduces the contribution of the vast number of easy samples during loss calculation. This mechanism forces the model to focus its learning on positive and negative samples that are difficult to distinguish. It is defined as follows:
where
is the total number of pixels. For the
-th pixel,
represents the model’s predicted probability for the correct class;
is the class balancing weight; and
is the focusing parameter. Finally, we sum the Dice Loss and Focal Loss to leverage their synergistic effects. The final composite loss function
is defined as
where
and
are hyperparameters that balance the contribution of each loss component. Through experimental evaluation on dataset, we determined the optimal weights to be
and
. This balanced configuration ensures that the model effectively addresses both the structural similarity of segmentation results at a regional level and the challenge of learning hard examples at the pixel level.
3.6. Evaluation Metrics
To comprehensively and quantitatively evaluate the segmentation performance of the TriGEFNet model from multiple dimensions, we employ six standard metrics: Accuracy, Precision, Recall, F1-Score, Intersection over Union for landslides (IoU_landslide), and mIoU. Given the extreme class imbalance in landslide scenes, we prioritize Recall and mIoU to better evaluate hazard detection sensitivity, as Accuracy is often dominated by background pixels. The calculation of these metrics is based on four fundamental statistical quantities derived from the comparison between the model’s pixel-wise prediction results and the ground truth labels: True Positive (
), False Positive (
), True Negative (
), and False Negative (
). Based on these definitions, the calculation formulas for each evaluation metric are as follows:
4. Results
4.1. Data Processing
To comprehensively evaluate the performance of the proposed TriGEFNet model, we conducted experiments using three remote sensing landslide datasets: the self-constructed Zunyi dataset and the publicly available L4S and Bijie datasets. Given the severe class imbalance between background and landslide classes inherent in landslide segmentation tasks, this study exclusively selected samples containing positive landslide instances for training and evaluation. The total number of samples ultimately utilized for the Zunyi, L4S, and Bijie datasets was 2231, 770, and 881, respectively. These datasets were partitioned into training and validation sets at a ratio of 8:2.
During the data preprocessing stage, all input images were uniformly resized to 256 × 256 pixels and normalized. Regarding data augmentation, we applied dynamic augmentation strategies exclusively to the training set. These strategies included geometric transformations such as random horizontal/vertical flipping and random rotation. Furthermore, to enhance the model’s robustness to illumination variations, color jittering and Gaussian noise were applied specifically to the RGB imagery.
4.2. Implementation Details
All experiments in this study were implemented using the PyTorch (v2.3.0) deep learning framework. Training and evaluation were conducted on a Linux server equipped with an Intel(R) Xeon(R) Platinum 8358P CPU, 48 GB of system RAM, and an NVIDIA GeForce RTX 3090 GPU (24 GB VRAM). For model initialization, we adopted distinct strategies for the encoders: the RGB branch was initialized with weights pre-trained on the ImageNet dataset, whereas the two auxiliary branches were randomly initialized and trained from scratch. The models were trained for a total of 100 epochs, with the batch size set to 32. We utilized the AdamW optimizer to update model parameters, with the weight decay coefficient set to 1 × 10−4. The initial learning rate was set to 1 × 10−4, and a Cosine Annealing strategy was employed to dynamically adjust the learning rate during the training process.
4.3. Comparative Experiments
To validate the superiority of TriGEFNet, we conducted comparative experiments against a series of representative advanced models. First, to investigate the impact of incorporating VI and Slope data, we established four groups of baseline experiments based on U-Net, ranging from unimodal RGB input to trimodal input. Second, we compared our method with classic semantic segmentation models and advanced multimodal fusion models. For fair comparison, all models were trained and evaluated under the same experimental settings as the proposed model. The quantitative results and visual samples for the three datasets are presented in
Table 1,
Table 2 and
Table 3 and
Figure 8,
Figure 9 and
Figure 10, respectively.
We initially analyzed the impact of adding VI and Slope using U-Net with an early fusion strategy. The experimental results exhibited significant variations across different datasets. On the L4S dataset, multimodal data improved the IoU from 0.5681 to 0.5767, demonstrating the potential of multimodal features to provide complementary information. However, this gain was not universal. On the more heterogeneous Zunyi and Bijie datasets, the inclusion of auxiliary modalities conversely led to a decline in model performance. Specifically, on the Bijie dataset, the IoU achieved by trimodal fusion (0.7631) was lower than that of the unimodal RGB baseline (0.7820). These results indicate that multimodal feature fusion is not a simple linear gain process but a complex problem highly dependent on data characteristics, inter-modal correlations, and the fusion strategy. In landslide segmentation tasks, simply concatenating multimodal data at the input stage is a suboptimal approach as it ignores critical inter-modal heterogeneity and differing noise distributions, often introducing disruptive noise rather than enhancing performance.
To verify the universality of the aforementioned findings, we applied multimodal inputs to four classic segmentation frameworks: DeepLabV3+ [
47], U-Net++ [
48], SegFormer [
49], and Mask2Former [
50]. The results show that stronger network architectures yield better model performance. On the Zunyi dataset, Mask2Former achieved an IoU of 0.852, outperforming the 0.830 achieved by U-Net. This suggests that advanced model architecture is a key determinant of segmentation performance. Nevertheless, this simplistic early fusion approach limits the model’s ability to learn complex relationships between different modalities, failing to fully exploit the potential of multimodal features. As seen in the third sample of the Zunyi dataset in
Figure 8, these models produced high false negative rates, indicating that the gain from supplementary information in auxiliary modalities could not offset the negative impact of the introduced noise. These experiments reveal the fundamental flaw of early fusion strategies: “blindly” mixing heterogeneous data at the pixel level not only imposes an optimization burden on the network but also introduces noise due to the lack of a guidance mechanism, thereby interfering with the learning of core features. Consequently, the model is unable to fully exploit the complementary information within the auxiliary modalities.
To overcome the limitations of early fusion, we designed TriGEFNet, which features a hierarchical guided enhancement and fusion mechanism as its core. We comprehensively compared it with four advanced multimodal segmentation models: SGNet [
51], CMX [
28], CMNeXt [
29], and EAEFNet [
30]. These models employ various sophisticated fusion strategies, including cross-modal attention and collaborative learning, representing the current frontier of multimodal semantic segmentation. Experimental results demonstrate that TriGEFNet achieved optimal performance across most core metrics on all three datasets, surpassing the comparative models. On the Zunyi dataset, TriGEFNet achieved a landslide IoU of 0.7454 and an mIoU of 0.8627. Surpassing the second-best EAEFNet, it achieved gains of 0.19 percentage points in mIoU and 0.38 percentage points in Landslide IoU, confirming its quantitative superiority. This proves its stability under complex data source conditions. Regarding the first sample of the Zunyi dataset, CMX, CMNeXt, and EAEFNet all exhibited overfitting effects to the VI, resulting in false positive predictions. In contrast, TriGEFNet produced the most precise boundary delineation and the best control over false positives and false negatives.
The superior performance of TriGEFNet is primarily attributed to its systematic resolution of the core issues in multimodal fusion. Independent encoders construct clear semantic pathways for each heterogeneous data source. During the encoding phase, the MGEM and DGFM work synergistically to implement intelligent guidance and feature screening fusion across multiple semantic levels, effectively avoiding the feature conflicts typical of early fusion. Subsequently, in the decoding phase, the GSRM screens shallow spatial details, ensuring the refined reconstruction of landslide boundaries. This cohesive design, centered on a core principle of Independent Encoding, Interactive Enhancement, and Asymmetric Fusion, enables high-performance landslide segmentation driven by multimodal information.
4.4. Ablation Experiments
To deeply analyze the internal mechanisms of the proposed TriGEFNet and quantitatively evaluate the individual contributions and combined efficacy of its three core innovative components—DGFM, MGEM, and GSRM—we designed a series of comprehensive ablation studies.
The baseline model employs a standard U-Net architecture with a ResNet34 backbone, identical to that used in the main experiments. It utilizes three independent encoders to process the three modalities, respectively. Both the multimodal feature fusion and the skip connections are implemented via simple concatenation. Subsequently, we independently validated the effectiveness of each module and incrementally integrated them until the final complete model was constructed. All ablation studies were conducted on the L4S dataset, with detailed results presented in
Table 4.
The baseline model, adopting a triple-stream input with concatenation-based multimodal feature fusion, improved the F1-score from 0.7291 to 0.7381 compared to the single-stream early fusion method in
Table 2. This demonstrates that providing independent encoders for each feature type facilitates the extraction of critical information unique to each modality. Building upon this baseline, we evaluated the utility of the three core modules by progressively incorporating them. As shown in
Table 4, the individual introduction of any single module yields significant performance improvements. Notably, the DGFM makes the most substantial independent contribution, increasing the IoU from 0.5871 to 0.6026, highlighting its critical role in suppressing heterogeneous noise. Ultimately, the complete TriGEFNet model, which integrates all three proposed modules, achieved the best performance among all configurations.
This result provides evidence that the three proposed components are not merely a simple accumulation of functions but constitute a complementary and organic whole. The DGFM and MGEM synergize at the encoder stage to perform comprehensive feature enhancement and fusion, while the GSRM ensures at the decoder stage that this high-quality information is precisely utilized for boundary reconstruction, ultimately realizing accurate landslide segmentation.
4.4.1. DGFM
To validate the superiority of the proposed DGFM, we compared it against two baseline methods: Element-wise Addition (Add) and Channel Concatenation (Concat). As shown in
Table 5, the DGFM achieved the best performance across all evaluation metrics, significantly outperforming the other two methods. This quantitatively demonstrates the effectiveness of our guided fusion strategy.
We further conducted a visualization analysis of this module, as illustrated in
Figure 11. It can be clearly observed from the visualized feature maps that the fusion results of the Add and Concat methods contain substantial diffuse noise and erroneous activation regions, leading to severe confusion between the target and the background. Although the Concat method offers slight improvements, the issue of background interference remains pronounced in its feature maps.
In contrast, the feature maps generated by the proposed DGFM exhibit activation regions that are highly focused on the landslide areas indicated by the Ground Truth. Simultaneously, the module effectively suppresses background noise, achieving precise feature representation. These experimental results demonstrate that the guided fusion mechanism of the DGFM effectively resolves the issue of multimodal information conflict. By establishing the RGB stream as the dominant modality and utilizing it to dynamically screen auxiliary modal information, this module successfully avoids the noise interference often introduced by simplistic fusion strategies. This prioritized design ensures the efficiency and robustness of the fusion process. Consequently, the generated features are not only enriched in semantic information but also possess significantly enhanced discriminability.
4.4.2. MGEM
To further elucidate the internal working mechanism of the MGEM, we visualized the multimodal feature maps before and after enhancement, as shown in
Figure 12. The visualization reveals a pattern of efficient feature synergy and functional specialization. The feature maps of the NDVI exhibited strong activation responses to bare landslide surfaces. Serving as a spatial attention signal, this response significantly enhanced the feature intensity of the corresponding regions in the RGB feature maps via the MGEM. This process assigned higher feature weights to landslide areas that were originally ambiguous in the RGB features, thereby effectively boosting feature discriminability. In contrast, the enhancement applied to the Slope features was more moderate, ensuring that critical topographic patterns were preserved without being overwhelmed by strong signals from other modalities.
In summary, experimental results demonstrate that the MGEM does not apply a homogenized enhancement across all modalities. Instead, by leveraging a shared guidance signal, it successfully establishes cross-modal conditional dependencies. Based on this foundation, it performs differentiated and asymmetric feature refinement and enhancement tailored to the specific strengths of each modality. Ultimately, this achieves efficient synergy and complementary enhancement among multimodal features, generating feature representations that are significantly more robust and discriminative than the input features.
4.4.3. GSRM
To validate the superiority of the proposed GSRM, we compared it with three classic skip connection strategies: Add, Concat, and Attention Gate (Attention) [
52].
Table 6 shows that the GSRM yields the most significant performance improvement for the model, comprehensively outperforming other modules across all evaluation metrics.
The visualization of the module’s feature maps is presented in
Figure 13. The Add and Concat methods, representing indiscriminate fusion strategies, inevitably introduce original noise and redundant background information from the encoder into the decoding path. This results in final feature maps exhibiting marked semantic ambiguity and background noise. While the classic Attention Gate is capable of filtering some irrelevant features, it erroneously triggers misguided attention toward background regions while attempting to suppress noise.
In contrast, the feature maps generated by the GSRM exhibit activation regions that are highly consistent with the landslide morphology, characterized by sharp and clear boundaries. This demonstrates its superior capability in feature refinement. The GSRM performs precise, adaptive screening and enhancement on the shallow detail features provided by the encoder. Consequently, it preserves high-frequency details critical for segmentation while effectively suppressing noise. By intelligently refining and fusing cross-level features, the GSRM generates feature representations that possess both high-level semantic discriminability and shallow spatial precision. This effectively bridges the semantic gap between deep abstract features and shallow geometric features, ensuring the integrity of the segmentation results.
5. Discussion
5.1. Comparative Analysis with Different Vegetation Indices
To investigate the impact of NDVI and NGRDI on model performance, we computed the corresponding NGRDI data using RGB imagery from the L4S dataset and utilized it as the input for the VI stream to train the model. Experimental results (
Table 7) indicate that NDVI achieved superior performance across the board due to its sensitivity to NIR bands. Compared to NGRDI, NDVI improved the IoU and F1-score by approximately 3.2% and 2.5%, respectively. This confirms that NDVI provides clearer features indicative of vegetation disruption along landslide boundaries, thereby enhancing the model’s segmentation capability.
However, the performance gap between the model using NGRDI and the one using NDVI is relatively narrow, validating the effectiveness of NGRDI as a viable alternative data source. The experiments demonstrate that our TriGEFNet retains robust landslide segmentation capabilities even in data-constrained scenarios where NDVI is unavailable. This significantly extends the model’s robustness and generalization ability under varying data conditions.
In emergency response scenarios following landslide disasters, the data most immediately available is often acquired by Unmanned Aerial Vehicles (UAVs) equipped with standard RGB cameras, whereas acquiring multispectral data may entail longer response times or higher logistical costs. Our experiments confirm that NGRDI can effectively serve as a substitute for NDVI as the VI input. In summary, the proposed TriGEFNet is not contingent upon specific data types; rather, it exhibits high robustness and adaptability to input data, thereby significantly expanding its potential for application in complex data environments.
5.2. Analysis of the Impact of Backbone Networks on Model Performance
To determine the optimal backbone network, we evaluated four classic architectures: ResNet18, ResNet34, ResNet50, and ResNet101. The experimental results (
Table 8) indicate that model performance peaks with ResNet34, rather than monotonically increasing with network depth. Transitioning from ResNet18 to ResNet34 yielded a significant performance gain, with the IoU improving from 0.6012 to 0.6251. However, further increasing the depth to ResNet50 and ResNet101 resulted in performance saturation or even a slight decline; neither their IoU nor F1-scores surpassed those of ResNet34. We attribute this phenomenon to the trade-off between model complexity and task specificity. The feature extraction capability of ResNet34 proves sufficient for the landslide segmentation task. Although deeper networks possess stronger theoretical representational capacity, they heighten the risk of overfitting on limited datasets. Furthermore, they may extract excessively fragmented features at the expense of contextual information, a hypothesis supported by the observed decline in Recall rates.
Consequently, ResNet34 strikes the optimal balance between performance and efficiency. Based on these comparisons, ResNet34 was selected as the final backbone. This decision underscores the importance of refining model selection tailored to specific tasks, rather than merely increasing network depth.
5.3. Limitations and Future Work
Despite the significant advancements achieved by the proposed TriGEFNet in landslide segmentation accuracy compared to existing models, certain limitations persist. First, constrained by the prohibitive costs of acquiring remote sensing landslide samples, existing public datasets are relatively small in scale and geographically concentrated. Although we constructed the multi-source, multi-temporal Zunyi dataset, its coverage remains limited. To further enhance model robustness in complex scenarios and realize truly intelligent disaster management, the field requires a profound commitment to the continuous expansion and iteration of landslide datasets. By collecting cross-regional and cross-temporal remote sensing imagery to increase the spatiotemporal diversity of training data, we can fundamentally mitigate the risk of overfitting caused by small sample sizes.
Beyond expanding data diversity, enhancing the adaptability of the method across different scales is crucial. While currently validated at a regional level, TriGEFNet holds significant potential for detailed monitoring scenarios, such as mining safety and engineering geology. In these contexts, its multimodal fusion offers robustness against anthropogenic noise. Future research will leverage transfer learning by fine-tuning models pre-trained on satellite data with high resolution UAV imagery. This approach aims to bridge the gap between regional surveys and the high precision required for specific engineering sites.
Furthermore, a paradigm shift is imperative: moving from the current purely data-driven semantic segmentation toward physics-informed disaster perception intelligence. This transition requires models to transcend mere pattern recognition to achieve mechanistic understanding. Given that landslide occurrence results from the complex coupling of geological, geomorphological, and hydrological factors, future research must focus on deeply exploring and fusing auxiliary modality data more closely related to landslide mechanisms. To this end, our future work will aim to construct a comprehensive sensing framework capable of capturing the environment of landslide development and dynamic triggering factors. Incorporating key causative factors—such as geological lithology, soil moisture, and InSAR surface deformation—will be a crucial step. These factors should no longer be viewed as mere supplementary inputs but as pivotal cues for understanding and modeling the physical processes of landslides, thereby enabling the model to learn the coupling laws governing these multi-factor interactions.
Finally, to support the physics-aware models, future research must transition from relying on small-sample, static image datasets to establishing dynamic monitoring benchmarks covering full regions and temporal sequences. This will significantly enhance the model’s generalization ability and segmentation accuracy in complex surface environments. More importantly, it will provide the possibility of capturing the complete dynamic chain of landslides, which spans from incubation and development to occurrence. The ultimate goal is to develop intelligent systems equipped with rudimentary physical perception capabilities, laying a solid foundation for a fundamental shift from post-disaster response to pre-disaster warning.
6. Conclusions
In this paper, we proposed TriGEFNet, a network specifically designed for landslide segmentation from multimodal remote sensing imagery. The model employs a triple-stream encoder architecture aimed at fully exploiting the complementary characteristics of RGB imagery, VI, and Slope. To address the challenges of semantic gaps and noise interference inherent in fusing multi-source heterogeneous data, we innovatively designed three core modules. The MGEM achieves the interaction and synergistic enhancement of cross-modal information by constructing shared guidance features, allowing the model to dynamically absorb context from other modalities while retaining unique modal information. Subsequently, the DGFM establishes an asymmetric fusion mechanism led by RGB and supplemented by auxiliary modalities, utilizing gating strategies to effectively filter redundant noise and ensure the quality of feature fusion. Finally, the GSRM utilizes high-level semantic features generated by the decoder to spatially screen shallow detail features from the encoder, effectively bridging the semantic discrepancy between the encoder and decoder and improving the detailed recovery of landslide boundaries.
Collectively, these three modules construct a comprehensive, refined feature processing framework spanning feature enhancement, multimodal fusion, and cross-level optimization. The core principle of this framework is to abandon simple, static information stacking in favor of a consistent, context-driven dynamic gating and guidance strategy. This systematic design ensures that information is processed optimally at every critical node of the model, ultimately constructing a feature representation that is both robust and refined for landslide segmentation tasks. Extensive comparative experiments were conducted on the self-constructed Zunyi dataset and the public Bijie and L4S datasets. The results demonstrate that TriGEFNet exhibits exceptional segmentation generalization and accuracy. The model achieved landslide IoU of 74.54%, 81.30%, and 62.51% on these three datasets, respectively, comprehensively outperforming classic semantic segmentation networks and advanced multimodal fusion models.
This study not only confirms the effectiveness of multimodal fusion in landslide segmentation but also reveals a critical methodological insight: introducing physical environmental priors into deep learning frameworks is an effective strategy to overcome the spectral confusion inherent in unimodal approaches. By integrating VI and slope as auxiliary modalities—coupled with a series of interaction and fusion modules—we effectively resolved the difficulty of spectral interference in complex scenes.
In conclusion, TriGEFNet provides not only a novel and efficient paradigm for the semantic segmentation of multimodal remote sensing imagery but also robust technical support for constructing large-scale and physics-aware automated landslide monitoring systems in practical scenarios.