A Multi-Scale Feature Fusion Dual-Branch Mamba-CNN Network for Landslide Extraction

Yang, Zhiheng; Zhang, Hua; Zheng, Nanshan

doi:10.3390/app151810063

Open AccessArticle

A Multi-Scale Feature Fusion Dual-Branch Mamba-CNN Network for Landslide Extraction

by

Zhiheng Yang

,

Hua Zhang

^*

and

Nanshan Zheng

School of Environment and Spatial Informatics, China University of Mining and Technology, Xuzhou 221116, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 10063; https://doi.org/10.3390/app151810063

Submission received: 23 August 2025 / Revised: 7 September 2025 / Accepted: 12 September 2025 / Published: 15 September 2025

(This article belongs to the Section Environmental Sciences)

Download

Browse Figures

Versions Notes

Abstract

Featured Application

The proposed method enables the high-precision extraction of landslides from remote sensing imagery, providing a fundamental basis for landslide mapping and supporting geo-hazard prevention and management.

Abstract

Automatically extracting landslide regions from remote sensing images plays a vital role in the landslide inventory compilation. However, this task remains challenging due to the considerable diversity of landslides in terms of morphology, triggering mechanisms, and internal structure. Thanks to its efficient long-sequence modeling, Mamba has emerged as a promising candidate for semantic segmentation tasks. This study adopts Mamba for landslide extraction to improve the recognition of complex geomorphic features. While Mamba demonstrates strong performance, it still faces challenges in capturing spatial dependencies and preserving fine-grained local information. To address these challenges, we propose a multi-scale spatial context-guided network (MSCG-Net). MSCG-Net features a dual-branch architecture, comprising a convolutional neural network (CNN) branch that captures detailed spatial features and an omnidirectional multi-scale Mamba (OMM) branch that models long-range contextual dependencies. We introduce an adaptive feature enhancement module (AFEM) to further enhance feature representation by effectively integrating global context with local details, which enhances both multiscale feature richness and boundary clarity. Additionally, we develop an omnidirectional multiscale scanning (OMSS) mechanism to improve contextual modeling and preserve computational efficiency by integrating omnidirectional attention with multi-scale feature extraction. Comprehensive evaluations on two benchmark datasets demonstrate that MSCG-Net outperforms existing approaches, achieving IoU scores of 78.04% on the Bijie dataset and 81.13% on the GVLM dataset. Furthermore, it exceeds the second-best methods by 2.28% and 4.25% in Boundary IoU, respectively.

Keywords:

landslide extraction; deep learning; state-space model; feature fusion

1. Introduction

Landslides, as a common geological disaster, not only pose a serious threat to human life and safety but also cause considerable social and economic losses [1]. They often result in casualties, structural collapses, road damage, and disruptions to transportation networks while also inflicting extensive damage to critical infrastructure such as water supply systems, power grids, and communication networks. These consequences are especially severe in regions experiencing heavy rainfall, featuring steep terrain, or prone to seismic activity. The complexity and severity of landslide events not only hinder emergency response efforts but also prolong recovery, potentially leading to ecological degradation and long-term challenges to socio-economic reconstruction. With climate change and human activities continuing to intensify, both the frequency and severity of landslides have been on the rise [2]. Given the need for rapid and accurate identification of affected areas in landslide disaster management, remote sensing technology has become a vital tool due to its broad spatial coverage and affordability [3]. Relevant landslide datasets have been widely developed and utilized in recent studies. In this work, we selected two representative regions: the Bijie area in northwestern Guizhou Province, China, and a globally distributed high-resolution landslide mapping dataset. Bijie features complex geology, steep slopes, and heavy rainfall, along with frequent human engineering activities, making it one of China’s most landslide-prone regions and a typical setting for developing and validating landslide detection methods. In contrast, the GVLM dataset includes landslide events triggered by 17 mechanisms, such as rainfall, earthquakes, snow and ice melt, tropical cyclones, slope failures, and typhoons, exhibiting substantial heterogeneity in scale, morphology, timing, spatial distribution, phenology, and land cover. Combining these datasets allows for a systematic evaluation of the proposed method across both local disaster-prone environments and diverse global scenarios, verifying its generalization and robustness under varying triggers, geomorphic conditions, and land cover and providing a solid foundation for high-precision landslide detection in complex landscapes.

Accordingly, convolutional neural networks (CNNs) have been widely applied to landslide extraction due to their superior feature extraction capabilities [4,5]. This adoption has facilitated a shift from traditional methods based on handcrafted features to approaches that enable automatic feature learning [6]. Approaches based on U-Net have shown robust performance in practical landslide extraction tasks [5,7]. For instance, Qi et al. [8] reported that ResU-Net achieved a Recall of 0.83 in distinguishing landslides on exposed floodplains along river valleys and on uncultivated terraces within their study area. Similarly, Ghorbanzadeh et al. [9] conducted a comprehensive evaluation of the transferability of both U-Net and ResU-Net for landslide detection. Notably, the ResU-Net model trained on a Japanese landslide dataset attained an F1-score of 63.21 when transferred to the target region, demonstrating its strong cross-regional generalization capability. However, due to the limited receptive field inherent in CNNs, these models struggle to capture long-range dependencies, especially when addressing small-scale landslides, shadowed areas, and spectral confusion between landslides and neighboring areas. The discriminability of object characteristics is inherently scale-dependent, as certain features are only observable and distinguishable at specific levels of granularity [10]. To tackle this issue, researchers have enhanced multiscale modeling and contextual understanding by integrating architectural components such as pyramid pooling modules [11], residual connections [12], and attention mechanisms [13]. While these strategies refine feature representation to some extent, the limited receptive field of CNNs still restricts their ability to effectively model global context and capture long-range dependencies. To overcome these limitations, recent Transformer-based models have shown powerful global modeling capabilities in remote sensing geoscientific analysis, achieving significant performance gains [14,15]. For instance, in landslide detection tasks in Nepal, ShapeFormer [14] achieved approximately 10% higher Recall compared to baseline methods, substantially mitigating the risk of missed detections. Moreover, CTLGNet [15] delivered strong performance in landslide susceptibility mapping within the Three Gorges Reservoir area and Jiuzhaigou, achieving AUC values of 0.9817 and 0.9693, respectively. These outcomes underscore the importance of global contextual modeling mechanisms for robust feature recognition in complex geographical environments. Nevertheless, the high computational constrains of self-attention, especially its inefficiency in capturing fine-grained local details, constrain its applicability to high-resolution imagery and heterogeneous terrains.

The recently proposed Mamba state-space model [16] offers a promising solution to the above 1 shortcomings. By introducing the S6 module, Mamba leverages a state-space sequence modeling approach that enables efficient long-range dependency capture, surpassing CNNs in contextual reasoning and scaling better than Transformers in large-scale scenarios. To adapt Mamba for vision tasks, researchers have proposed several variants for visual tasks, including Vim [17] and VMamba [18], and have further improved them using novel scanning strategies [19,20,21] and multiscale feature fusion techniques [22]. Zhao et al. [23] applied Mamba-based segmentation models, namely, RS3Mamba [24] and Ultralight VM-Unet [25], to landslide datasets and demonstrated their effectiveness in remote sensing landslide mapping, especially in comparison with Transformer-based models. However, despite their strength in global modeling, these models remain limited in capturing fine spatial details required for precise landslide delineation [26].

The aforementioned methods often suffer from limited generalizability in highly heterogeneous terrains, resulting in misdetections or omissions [27]. Combining the local feature extraction strengths of CNN with Mamba’s global context modeling ability provides an effective means to balance fine-grained details and global context. VM-Unet [28] combines global context modeling of Mamba with local feature extraction capabilities of U-Net to enhance recognition of complex textures and structures. CM-Unet [29] integrates a CNN encoder with a Mamba-based decoder to better capture long-range dependencies and improve semantic segmentation performance. CVMH-Unet [30] employs a dual-branch module that fuses CNN-derived local features and Mamba-modeled global context to enhance both local and global representations. Swin-UMamba [31] improves performance and efficiency by replacing components of the Swin Transformer [32] extraction module with VMamba blocks initialized with pre-trained weights. Yang et al. [33] combine a Transformer encoder with a Mamba-based feature fusion module to fully leverage multiscale and high-frequency feature encoding, achieving outstanding results in disaster segmentation. SCGC-Net [34] introduces the HMIE structure, which integrates enhanced CNN and Mamba features to improve landslide recognition accuracy and generalization through contextual modulation and progressive calibration. A promising direction is to combine Mamba’s global modeling capability with CNN’s strong local feature extraction to better address complex landslide boundaries and terrain heterogeneity, thereby enhancing generalization across regions and multi-sensor imagery.

This study presents a novel network called the Multiscale Spatial Context-Guided Network (MSCG-Net). MSCG-Net employs a dual-branch architecture, with a straightforward CNN-based branch to capture spatial details and an omnidirectional multiscale Mamba (OMM)-based branch to model global dependencies for landslide boundary extraction. This work introduces an omnidirectional multiscale scanning (OMSS) mechanism to better capture contextual information. Additionally, an Adaptive Feature Enhancement Module (AFEM) is designed to effectively fuse global dependencies and local details from both branches. The main contributions of this work are summarized as follows:

MSCG-Net is a novel landslide detection network combining omnidirectional multiscale Mamba modules and convolutional layers in a dual-branch architecture. This enables effective fusion of global and local features, improving boundary preservation and segmentation quality.
The OMSS module is proposed by employing a multidirectional and multiscale scanning strategy to enhance boundary perception in complex and heterogeneous landslide regions while improving computational efficiency.
The Adaptive Feature Enhancement Module (AFEM) is introduced to effectively enhance the spatial details and global dependencies extracted from both branches.

2. Materials and Methods

2.1. Overall Framework of the Model

Landslide detection using high-resolution remote sensing imagery faces not only challenges in multi-scale feature extraction but also the inherent complexity of landslides. This complexity, which arises from the high intra-class variability in size, shape, spectral signature, and surrounding context of landslides, necessitates a robust detection algorithm capable of effectively adapting to diverse terrain features and landslide morphologies. To this end, this study proposes MSCG-Net, a dual-branch fusion network that integrates Mamba and CNN modules. The overall architecture of the network is shown in Figure 1, and its main components are described as follows:

A dual-branch architecture is designed to extract both contextual dependencies and boundary details.
The OMSS module introduces diagonal semantic information and integrates multiscale features to enhance the model’s perception of features across different spatial scales.
The AFEM module filters noise and enhances features from both branches to accurately refine contextual representations and edge information.

The method proposed in this paper comprises the following main stages (see Figure 1). First, two parallel branches process the input image independently: the Mamba branch extracts global contextual dependencies, while the CNN branch captures local boundary contours. Next, the Structural Feature Attention (SFA) module refines the features from the Mamba branch by extracting and enhancing structural information. Simultaneously, edge details in the CNN branch are enhanced using the Smoothed Adaptive High-Pass Filter (SAHF). Then, feature maps from both branches are aligned and fused through upsampling to unify their spatial resolutions. Finally, the fused multiscale features are processed by a feature fusion module to generate the final semantic segmentation output.

2.2. The Local Detail Extraction Branch

Accurate boundary delineation plays a vital role in landslide inventory mapping, as clear and well-defined edges help capture landslide morphology and reduce false detections. This requires the algorithm to have high boundary sensitivity and the ability to precisely capture contours. CNN is well-suited as the local branch due to its strength in capturing fine textures and details. In this branch, the input image (C × H × W) is processed by a series of convolutional blocks (Figure 2a) and progressively downsampled to H/2 × W/2, H/4 × W/4, H/8 × W/8 and H/16 × W/16 using max pooling, with increasing channel dimensions. This multilevel methodology design enhances boundary perception while preserving spatial details.

2.3. The Contextual Feature Extraction Branch

Landslide images from different geographical regions display significant morphological variations, which makes preserving the complete structure of landslide areas during segmentation crucial. Inspired by VMamba, this study design a contextual feature extraction branch that leverages an enhanced state-space attention mechanism. This branch consists of one PE block, three PM blocks, and four OMM blocks (see Figure 1). To address the high computational and memory demands of Mamba at medium and high resolutions, the input image is first downsampled to one-quarter of its original resolution, effectively reducing the computational load while expanding the receptive field and preserving essential spatial structures. The PE block (see Figure 2b) performs layer-wise convolutional downsampling followed by normalization and activation to enhance feature representation. Convolution operations help retain local structures, improve texture edge detection, and extract fine boundary details. The PM block employs a 3 × 3 convolution with stride 2 for efficient spatial compression.

The OMM block comprises an Omnidirectional Multiscale Visual State-Space Block (OMVSS) and a Convolutional Feedforward Network (ConvFFN) (see Figure 3). Visual State-Space (VSS) module serves as the core of the Mamba architecture. Previous studies have demonstrated that diagonal scanning can enhance remote sensing image recognition [20,21,30,35], while multiscale scanning helps mitigate the degradation of long-range dependencies [22]. To better capture the complex structure of landslide boundaries, this study propose the OMSS strategy (see Figure 4). The input is processed by two depthwise separable convolutions: one at the original resolution and the other downsampled 1/2 resolution. The feature maps at the original resolution are expanded along the horizontal and vertical directions, whereas the downsampled feature maps are expanded along the diagonal and anti-diagonal directions. These directional sequences are fed into the S6 module. The original and downsampled feature maps are denoted as

Z_{1} \in R^{H \times W \times D}

and

Z_{2} \in R^{\frac{H}{2} \times \frac{W}{2} \times D}

, respectively.

\begin{matrix} Y_{1}, Y_{2} = S 6 ({[S}_{1} (Z_{1}), S_{2} (Z_{1})]), \\ Y_{3}, Y_{4} = S 6 ({[S}_{3} (Z_{2}), S_{4} (Z_{2})]), \end{matrix}

(1)

where

S 6

denotes the selective scanning mechanism proposed in Mamba. The transformation

S

converts 2D feature maps into 1D sequences to facilitate processing, and the resulting output is represented by

Y

. The resulting sequences are then reconstructed into 2D feature maps and interpolated to match the resolution of the downsampled features prior to fusion.

\begin{matrix} Z_{i}^{'} = S^{- 1} (Y i), i \in [1,2, 3,4] \\ Z^{'} = Z_{1}^{'} + Z_{2}^{'} + I n t e r p o l a t e (Z_{3}^{'}) + I n t e r p o l a t e (Z_{4}^{'}), \end{matrix}

(2)

where

S^{- 1}

is the inverse transformation of

S

, which reconstructs the 1D sequence back into the original 2D feature map.

Although OMVSS can model long-range dependencies, its nonlinear representation capability remains limited. To overcome this limitation, the ConvFFN module (see Figure 3) is introduced, which enhances overall feature representation via non-linear transformations and cross-channel information interaction.

\begin{matrix} C o n v F F N (x) = x + C o n v_{2} (D W C o n v (C o n v_{1} (L N (x)))) \\ + C o n v_{2} (C o n v_{1} (L N (x))) \end{matrix},

(3)

where

x

denotes the feature extracted by OMVSS,

C o n v_{1}

expands the channels to enhance representation,

D W C o n v

captures multi-scale local information, and

C o n v_{2}

performs weighting feature fusion. This design improves multi-directional long-range dependency modeling, enhances global context understanding, and balances feature diversity with consistency.

2.4. Adaptive Feature Enhancement Module

The complexity of terrain and boundaries in landslide areas necessitates more effective multiscale feature alignment and enhancement to ensure accurate segmentation. Although the SSM module in the context branch captures global dependencies, its reliance on sequential computation may impair spatial structural integrity. Meanwhile, the convolutional branch is susceptible to noise, which may degrade the quality of boundary representations. The Adaptive Feature Enhancement Module (AFEM) is designed to jointly refine the global features from the Mamba branch and the boundary-sensitive features from the CNN branch. As shown in Figure 5, the AFEM consists of two complementary components: the Structural Feature Attention (SFA) module applied to the Mamba branch and the Self-Adaptive High-Pass Filter (SAHF) module applied to the CNN branch.

Specifically, SFA enhances the context branch by mitigating the information loss caused by downsampling and sequence modeling, reinforcing structural cues through a gated enhancement pathway. In parallel, SAHF improves the convolutional branch by suppressing low-frequency noise and adaptively enhancing edge details with a high-pass filter guided by the Kaiser window and Carafe operator. The feature integration process of AFEM is formulated as follows: The refined features from both branches are obtained through residual enhancement and then fused:

\begin{matrix} {F_{1}}^{'} = I n t e r p o l a t e (F_{1} + S F A (x_{1})), \\ {F_{2}}^{'} = x_{2} + S A H F (F_{2}), \end{matrix}

(4)

where

F_{1}

and

F_{2}

represent the input features from the Mamba and CNN branches, respectively. The Interpolate operation denotes 2× bilinear upsampling, which aligns the spatial resolution of the enhanced Mamba branch features

{F_{1}}^{'}

with that of the CNN branch features

{F_{2}}^{'}

. The final fused output

F^{'}

is then obtained by summing the two enhanced feature maps. This integration strategy ensures a complementary fusion: the upsampled, structurally enhanced global context from the Mamba branch is directly combined with the noise-suppressed, edge-sharpened local features from the CNN branch. The result is a fused representation that combines rich global semantics with precise boundary details, thereby significantly improving MSCG-Net’s ability to accurately perceive, recognize, and delineate the complex morphology of landslides.

SFA (see Figure 5a) addresses the challenges of downsampling and sequence conversion in the context branch, which may cause noise and lead to a context loss. It enhances low-frequency information through a smoothing compensation branch and reinforces structural features via a gated enhancement branch, thereby improving overall recognition accuracy. The context branch itself captures global and semantic information to aid in locating landslide areas.

Z = D W C o n v (F_{C}) ⨀ S E (D W C o n v (F_{C})) + A v g (F_{C}),

(5)

where

F_{C}

represents the context branch features,

A v g

denotes average pooling, and

S E

is a channel attention mechanism involving pooling, nonlinear transformation, and reweighting. The symbol

⨀

indicates elementwise multiplication. This design strengthens contextual feature representation, suppresses noise, and facilitates the accurate capture of landslide details, thereby improving the overall accuracy of the model.

SAHF enhances edge features in the convolutional branch (see Figure 5b). Convolutions with limited receptive fields may overlook fine details in complex regions. To further enhance boundary features, the SAHF module integrates a high-pass filter to highlight edges and suppress low-frequency components. Additionally, a Kaiser window is applied to mitigate spectral leakage and reduce artifacts, ensuring cleaner and more accurate feature extraction. On top of this, Carafe adaptively generates content-aware filter kernels to further improve detail perception and enhance the model’s ability to recognize fine boundaries. For features

X_{l} \in R^{B \times C \times H \times W}

, initial filter kernel weights are predicted by a convolutional layer.

Φ_{l} = C o n v_{k_{e}} (X_{l}), Φ_{l} \in R^{B \times (G \cdot k^{2}) \times H \times W}

(6)

where

k

is the high-pass filter kernel size,

k_{e}

is the convolution kernel size, and

Φ_{l}

represents the high-pass filter kernel parameter tensor. The kernel is then normalized and modulated by the window function to produce the final effective filter.

\begin{matrix} \tilde{Φ_{l}} = R e s h a p e (Φ_{l}, [B, G, k^{2}, H, W]), \\ {\hat{Φ}}_{l}^{i, j} = S o f t m a x ({\tilde{Φ_{l}}}^{:, :, i, j}) ⊙ K_{β}, \\ K_{β} (p, q) = \frac{I_{0} (β \sqrt{1 - \frac{4 [{(p - u)}^{2} + {(q - v)}^{2}]}{k^{2}}})}{I_{0} (β)}, \end{matrix}

(7)

where

{\hat{Φ}}_{l}

denotes the final spatially adaptive filtering kernel obtained after normalization and modulation with the Kaiser window.

p

,

q

represent local spatial coordinates,

u

,

v

are normalized frequency components,

K_{β}

is the parameterized 2D Kaiser window,

I_{0}

is the zero-order modified Bessel function, and

β

controls the energy concentration of the main lobe. Adaptive low-pass filtering is then applied using the Carafe operator.

\begin{matrix} X_{l}^{l o w} = C a r a f e (X_{l}, {\hat{Φ}}_{l}) \\ = \sum_{(p, q \in N)} {\hat{Φ}}_{l}^{i, j} (p, q) \cdot X_{l}^{i + p - |\frac{k}{2}|, j + q - |\frac{k}{2}|} \end{matrix},

(8)

Finally, the high-frequency component is extracted through signal decomposition.

X_{l}^{h i g h} = X_{l} - X_{l}^{l o w},

(9)

2.5. Feature Fusion Module

Feature fusion techniques play a critical role landslide extraction tasks. An effective fusion strategy should be capable of handling multiscale features, mitigating spatial misalignment, and adapting to diverse geographic environments and imaging conditions, thereby enhancing the accuracy and robustness of landslide detection in complex scenes. In our dual-branch architecture, features extracted from the Mamna branch and the convolution branch are aligned and fused in the feature enhancement stage, resulting in a new representation that preserves both semantic and detailed information. To further integrate features across different levels, this study introduces a feature fusion module (see Figure 6), which performs hierarchical decoding and semantic refinement of the fused multi-scale features.

Multiscale features vary in their contribution to landslide region extraction, and directly fusing them may lead to information loss and degrade the segmentation performance of landslide features [36]. In this section, spatial information is reconstructed under semantic guidance through a hierarchical decoding pathway. The module begins with high-level semantic features and progressively restores spatial resolution by cascading convolutional and upsampling layers. Then, the shallow features are channel-aligned and adaptively fused with the upsampled deep features through skip connections in order to enhance boundary details and reduce spatial misalignment. Finally, by progressively restoring spatial resolution and channel dimensions, the network generates high-resolution feature maps with well-defined structures, supporting accurate segmentation of landslide regions.

2.6. Performance Metrics

To evaluate the performance of MSCG-Net and other comparative methods in landslide detection, the study selects five widely used quantitative metrics: overall accuracy (OA), precision (P), recall (R), F1-score (F1), and IoU. Additionally, to better assess boundary recognition performance, this study adopts Boundary IoU (BIoU) [37], a metric commonly used in medical image segmentation to measure boundary accuracy.

O A & = \frac{T P + T N}{T P + T N + F N + F P},

(10)

P & = \frac{T P}{T P + F P},

(11)

R & = \frac{T P}{T P + F N},

(12)

F 1 & = \frac{2 \times P \times R}{P + R},

(13)

I o U & = \frac{T P}{T P + F P + F N} = \frac{|G \cap P|}{|G \cup P|},

(14)

B I o U & = \frac{|(G_{d} \cap G) \cap (P_{d} \cap P)|}{|(G_{d} \cap G) \cup (P_{d} \cap P)|},

(15)

TP is a landslide pixel correctly predicted by the model. FP is a non-landslide pixel incorrectly classified as a landslide. TN is a correctly identified non-landslide pixel. FN is a landslide pixel missed by the model. G represents the ground truth binary mask, and P is the predicted binary mask. Gd refers to the banded region formed by dilating the contour of G by d pixels, while Pd refers to the banded region formed by dilating the contour of P by d pixels. In landslide detection tasks, evaluation typically emphasizes Recall and Intersection over Union (IoU), reflecting the importance of detecting all landslide pixels and measuring overall segmentation accuracy. In our experiments, we additionally incorporate Boundary IoU (BIoU) to specifically assess the precision of predicted boundaries. Recall measures the proportion of true landslide pixels correctly detected, IoU quantifies the overlap between predicted and ground-truth regions relative to their union, and BIoU evaluates boundary accuracy within a narrow band around edges. While other metrics are also reported, these three provide the most informative assessment of MSCG-Net’s effectiveness in landslide detection and boundary delineation.

2.7. Datasets and Study Areas

Deep learning-based landslide detection requires large amounts of annotated imagery. For this study, we used two publicly available datasets. The first is from the Bijie region in northwestern Guizhou Province, China, a transitional slope zone between the Qinghai–Tibet Plateau and the eastern hills. This area, marked by complex geology, steep terrain, heavy rainfall, and intensive human activities, is among the most landslide-prone areas in China and serves as a representative test site. The Bijie dataset [38] contains 770 landslide and 2003 non-landslide samples, cropped from 0.8 m TripleSat imagery. While it includes diverse negative samples (e.g., mountains, villages, roads, rivers, forests, and farmland), each positive sample depicts only a single landslide, limiting its use for multi-landslide detection. The second dataset, GVLM [39], comprises 17 pairs of bi-temporal high-resolution images (0.59 m) from landslide-affected regions worldwide. It covers events triggered by 17 mechanisms, including rainfall, earthquakes, snowmelt, tropical cyclones, slope failures, and typhoons, capturing wide variation in scale, morphology, timing, distribution, phenology, and land cover. Its cropped images feature multiple landslides, making it well suited for evaluating model performance under diverse and complex conditions. Together, these datasets differ markedly in geography, landslide types, and imaging characteristics, providing a robust basis for comprehensive model assessment. Figure 7 and Figure 8 show the spatial distribution of landslide points in the Bijie dataset and the GVLM dataset, respectively.

Datasets highlight the key challenges in landslide detection, including differences in spatial resolution, diverse geographic environments, and various sources of interference. A considerable number of images exhibit occlusions caused by clouds and shadows, which significantly reduce the visibility of landslide areas. In addition, features such as cultivated land and bare soil often share similar color and spectral characteristics with landslides, which may cause misclassification by the model. Representative examples from both the Bijie and GVLM datasets, illustrating their varied complexities and typical landslide appearances, are provided in Figure 9. To evaluate the performance of MSCG-Net, 770 positive landslide samples from the Bijie dataset are used. For the GVLM dataset, 2429 image patches of size 256 × 256 pixels are cropped from both the image data and the corresponding landslide masks. Both datasets are split into training and testing sets at a ratio of 70:30. Cross-dataset testing allows for assessing the effectiveness and adaptability of the proposed method. Furthermore, to address the issue of limited data, random augmentation strategies, including random cropping, flipping, rotation, and color-space transformations, are applied during training to better optimize the landslide detection model.

3. Results

3.1. Experiment Settings

All experiments were conducted on an NVIDIA QUADRO RTX6000 GPU using the PyTorch (version 2.1.1) framework. To ensure consistency, all input images were resized to 256 × 256 pixels with a batch size of 4. Data augmentation included random scaling, horizontal and vertical flipping, and normalization. The model was trained using the AdamW optimizer with an initial learning rate of 2 × 10⁻⁴ and weight decay of 0.01. Binary cross-entropy was used as the loss function, and the training was run for 100 epochs.

3.2. Comparative Experiments

To comprehensively evaluate the performance of MSCG-Net, this study compared it with a variety of deep learning methods on two datasets, covering both classical CNN architectures and emerging Transformer- and Mamba-based models. DeepLabV3+ [40], as a representative semantic segmentation network, demonstrates stable performance in multi-scale context modeling, enabling effective capture of the overall morphology of landslide areas. DANet [41] incorporates channel and spatial attention mechanisms, which enhance the representation of global features in complex terrains. AttUnet [42] introduces attention mechanisms into skip connections, allowing it to effectively identify irregular and sparsely distributed landslide boundaries. Nested Attention U-Net [43] further enhances the modeling of fine boundaries and complex contexts in remote sensing images through its nested attention structure. TransUnet [44] leverages the global self-attention mechanism of Transformers to capture long-range dependencies while balancing global semantic information with local details. SegFormer [45], with its lightweight decoder, achieves high-precision segmentation and maintains accuracy across landslides of varying scales. VMUnet [28] integrates the Mamba module, enhancing multi-scale omnidirectional feature extraction to better perceive contextual structures in complex landslide scenes.

As shown in Table 1 and Table 2, MSCG-Net outperforms the other models on most evaluation metrics but exhibits a slight decrease in Precision. This difference mainly stems from the model’s design, which places greater emphasis on comprehensive landslide coverage and accurate boundary delineation to reduce missed detections. In the following, we provide a detailed discussion of its performance on the two datasets.

As shown in Table 1, the experimental results on the Bijie dataset demonstrate the overall superiority of MSCG-Net over other models. Specifically, the proposed method achieves the highest Overall Accuracy (OA) of 97.44% and significantly outperforms other approaches in terms of Recall (87.60%), F1-score (87.67%), IoU (78.04%), and Boundary IoU (69.09%), indicating its strong capability in comprehensively extracting landslide regions and accurately delineating their boundaries. It is noteworthy that although MSCG-Net achieves a slightly lower Precision (P, 87.74%) compared to VMUnet, this result is consistent with the model’s intended design objective: MSCG-Net prioritizes comprehensive coverage of landslide areas and accurate boundary delineation. This approach enhances the network’s sensitivity to potential landslide pixels, thereby reducing missed detections. As a result, a small number of spectrally similar non-landslide objects (such as bare soil or farmland) may occasionally be misclassified, leading to a marginal decrease in Precision. From an operational standpoint, minimizing missed detections is more crucial than completely eliminating false positives in landslide mapping, as omissions directly compromise the completeness and reliability of hazard identification. Meanwhile, a limited number of false alarms can be readily addressed during post-processing, for instance through the integration of topographic constraints or manual verification. Therefore, the marginally lower Precision does not undermine the applicability of MSCG-Net; on the contrary, its superior performance in Recall and boundary delineation significantly increases its practical utility in real-world scenarios.

Table 2 further illustrates the superior performance of MSCG-Net on the GVLM dataset. It excelled particularly in Recall and IoU, suggesting strong sensitivity and comprehensive detection capability for small and inconspicuous landslide areas. Meanwhile, the model’s Precision is not optimal due to a conservative strategy that prioritizes high Recall. To minimize missed detections, MSCG-Net maintains high sensitivity to potential landslide areas, which can occasionally lead to the misclassification of spectrally or texturally similar non-landslide features, such as bare soil and farmland, which may result in a limited number of false positives. Nonetheless, the model achieves high Boundary IoU, demonstrating its ability to accurately delineate landslide boundaries. This focus on comprehensive detection carries important practical implications. In disaster emergency response, providing a comprehensive landslide inventory, even if it includes some false positives for expert review, is far more critical than producing an overly conservative map that risks omitting affected areas, as the latter could severely underestimate hazards. Overall, these findings indicate that MSCG-Net delivers reliable, accurate, and boundary-aware landslide detection, and its slightly lower Precision can be mitigated through post-processing. This does not reduce the model’s practical utility.

To comprehensively and intuitively compare the performance of different methods, representative examples from each dataset were selected, and the detection results of various state-of-the-art models are shown in Figure 10 and Figure 11. MSCG-Net consistently delivered more complete and accurate predictions, highlighting its advantages and robustness when faced with diverse and complex scenarios.

On the Bijie landslide dataset, MSCG-Net exhibits strong performance in resolving adjacent structural features. As illustrated in the second row of Figure 10, the landslide in this example is visually similar to the surrounding bare soil and cultivated land. CNN-based methods, which have a limited capacity to capture spatial contextual structures, often struggle to fully interpret the internal composition of landslides. While Transformer-based approaches are able to identify landslides to some extent, they frequently fail to maintain structural continuity, leading to fragmented prediction results. Compared to other methods, MSCG-Net, featuring a dual-branch feature extractor, not only accurately delineates landslide boundaries but also effectively maintains the internal consistency of landslide bodies, ensuring continuity and completeness of the detected objects, thereby enhancing both the quality and practical usability of the predictions. Moreover, MSCG-Net shows notable robustness against interference. As shown in the fourth row of Figure 10, for the dark bare-soil areas created by landslides, the omnidirectional multi-scale scanning module enhances the model’s ability to capture complex terrain structures. This enables MSCG-Net to effectively extract the texture features of landslides, allowing accurate identification even in regions with highly similar visual characteristics, without misclassifying them as mountain shadows. These advantages highlight the superior performance of MSCG-Net in complex scenarios with limited training samples.

On the GVLM dataset, MSCG-Net also demonstrates a strong capability for high-resolution boundary extraction in large-scale landslide detection. As shown in the last row of Figure 11, the strip-shaped landslide lies in an environment with insufficient semantic information, making accurate boundary delineation a highly challenging task. While other methods tend to produce noticeable misclassifications and omissions, MSCG-Net successfully distinguishes the landslide from the surrounding bare land. Even under conditions where spectral features are highly similar, the model effectively reduces confusion with neighboring objects, enabling the landslide area to be clearly identified and precisely delineated. This joint enhancement of global and local features significantly improves the model’s adaptability and robustness in complex terrains, leading to higher accuracy and more comprehensive detection of large-scale landslides.

3.3. Computational Efficiency

In large-scale landslide detection tasks, computational efficiency is critical for practical deployment. To evaluate this aspect, this study compares the number of parameters and computational complexity of DeepLabV3+, DANet, AttUnet, TransUnet, Nested Attention U-Net, SegFormer, VMUnet, and MSCG-Net. For consistency, all models are tested using input images with a resolution of 256 × 256. Table 3 summarizes the performance of each model at the specified resolution. Overall, MSCG-Net achieves a favorable trade-off between computational efficiency and detection accuracy. Nevertheless, when scaling to large-area and high-resolution remote sensing imagery, the computational and memory requirements may present additional challenges for practical deployment.

4. Discussion

4.1. Ablation Study

To evaluate the effectiveness of each component in MSCG-Net and verify the rationality of its network design, this study conducted comprehensive ablation experiments on the GVLM dataset, focusing on the impact of different scanning modes and the contribution of each module.

For the scanning mode comparison, this study referred to related studies and selected five existing scanning methods for comparison (see Figure 12). In the figure, forward and reverse scanning directions are denoted by green and blue arrows. Method M1 corresponds to the scanning strategy used in VMamba [18]. Method M2 corresponds to the RS-Mamba [19] scanning strategy. Method M3 uses only diagonal and anti-diagonal scanning. Method M4 adopts the scanning strategy proposed in CVMH-Unet [30]. Method M5, proposed by Plain Mamba [21], uses a serpentine scanning approach that maintains a degree of spatial and temporal coherence. Method M6 is the scanning strategy proposed in this study. All experiments were implemented based on the MSCG-Net framework. Table 4 presents the performance metrics of these methods on the GVLM dataset, offering a comprehensive comparison of different scanning strategies for landslide detection. Each scanning method (M1 to M6) corresponds to a specific experiment (A1 to A6), respectively, allowing us to systematically evaluate the impact of each scanning strategy on detection performance.

Compared to experiment A1, experiment A2 demonstrates significant performance improvements with higher F1 and Boundary IoU scores after introducing diagonal and anti-diagonal scans, confirming the effectiveness of semantic information along diagonal directions for contextual understanding in complex landslides. However, experiment A3, which relies solely on diagonal and anti-diagonal scans, performs poorly in IoU, highlighting the continued importance of horizontal and vertical scans for capturing landslide contextual features. Experiment A4 refines the full scanning strategy used in A2 by eliminating the reverse scanning direction, offers a more efficient approach that balances detection accuracy with reduced computational cost. The serpentine scanning in experiment A5 further improves F1 and IoU but at the cost of reduced boundary accuracy. The method proposed in this paper achieves the best overall balance between performance and inference speed.

In the module ablation experiments, this study sequentially introduced the CNN branch, AHPF, and SFA modules to quantify their individual contributions to the overall model performance. Table 5 presents the corresponding performance changes with the inclusion of each module.

The baseline model B1 relies solely on the context branch, resulting in the lowest F1 and IoU despite its relatively small number of parameters and FLOPs, which indicates that relying only on contextual information is insufficient for capturing fine-grained and boundary features. In B2, the introduction of the convolutional branch improves the F1 score from 86.6% to 88.8% and also raises the IoU, while the increase in parameters and computation is relatively modest, suggesting that the convolutional branch effectively enhances texture details and local boundary representations at a low cost. Building upon this, B3 incorporates the SFA module, which not only further improves F1 and IoU but also reduces FLOPs from 47.53 G to 35.26 G, demonstrating higher efficiency in optimizing spatial feature representation and dependency modeling, thereby achieving a better balance between complexity and performance. In B4, the SAHF module is introduced into the convolutional branch to refine high-resolution boundary features. Although this configuration brings a clear increase in parameters and computational load, it achieves the highest Boundary IoU among the single-module variants, showing its unique advantage in scenarios requiring precise boundary delineation. Finally, the full configuration B5 integrates all modules and achieves the best overall performance, while its parameter size and FLOPs remain almost the same as B2. This indicates that the overall architecture has been carefully optimized to achieve significant performance gains without imposing additional computational burdens, thus demonstrating that CNN, SFA, and SAHF provide complementary advantages which together strike an effective balance between accuracy and efficiency, making MSCG-Net both powerful in detection and suitable for deployment in large-scale or resource-constrained applications.

4.2. CAM Visualization

To further validate the effectiveness of the convolutional branch in preserving boundary features, this study generated a heatmap using Gradient-weighted Class Activation Mapping, as shown in Figure 13. This visualization highlights the distinct focus areas of the convolutional and Mamba branches in extracting landslide features. Specifically, the Mamba branch highlights the overall structure and long-range connections within the landslide area, whereas the convolutional branch concentrates on boundary contours and fine-grained details. By organically integrating the strengths of both branches, the proposed model effectively compensates for the Mamba branch’s limitations in local feature representation while retaining its strengths in capturing large-scale semantics and global context. This results in a robust feature extraction framework with both global and local perceptual capabilities.

4.3. Limitations and Future Work

While MSCG-Net demonstrates strong performance in landslide extraction on benchmark datasets, several limitations remain, particularly in complex or edge-case scenarios. The model is still affected by the spectral ambiguity of optical imagery, where landslides often resemble bare soil or agricultural land, leading to potential false positives in areas with gentle slopes or less distinct morphological features. Its generalizability across diverse terrains and sensor types also requires further improvement, as the current design has been mainly optimized for optical imagery and lacks validation in drastically different environments such as deserts or tropical regions, as well as the integration of complementary modalities like SAR or LiDAR for all-weather detection under vegetation cover. In addition, challenges persist in edge cases such as merging of closely spaced landslides, missing detections in heavily shadowed regions, and difficulties in delineating fine details when the scale of features approaches the resolution of input imagery. Beyond these technical aspects, the scalability of MSCG-Net to large-area, high-resolution imagery poses practical challenges, since computational and memory demands during inference may hinder deployment in real-world, resource-constrained settings.

To address these issues, future work will explore the incorporation of topographic information (e.g., slope, curvature) to better distinguish landslides from spectrally similar backgrounds, along with multi-source fusion frameworks that adaptively integrate optical, SAR, and LiDAR data to enhance robustness across varying terrains. We also plan to investigate SAR coherence analysis and super-resolution methods to alleviate shadow, cloud, and resolution-related limitations. Finally, lightweight model designs, compression strategies, and tiling-based inference will be pursued to improve computational efficiency and scalability, facilitating practical deployment in large-scale applications.

5. Conclusions

This study proposes MSCG-Net, a novel network for landslide detection from multi-source remote sensing imagery, aiming to address several challenges in automatic landslide area extraction. Specifically, the challenges addressed include misclassification caused by the similarity between landslide textures and other surface features, the omission of small-scale landslides, blurred boundaries, and voids within complex landslide areas. MSCG-Net employs a dual-branch feature extraction architecture to simultaneously capture spatial details and global dependencies. The extracted features are subsequently aligned and fused through the AFEM module, which ensures the complementary integration of the contextual information branch and the local detail branch throughout the process. This allows MSCG-Net to precisely delineate landslide boundaries, effectively detect small-scale events, and maintain the integrity of large-scale landslide structures. The proposed OMSS module uses multi-directional and multi-scale scanning to enrich attention diversity and enhance complex feature extraction. As a result, MSCG-Net demonstrates strong adaptability to diverse terrains and imaging platforms, delivering accurate and robust landslide detection performance. This study evaluates MSCG-Net on two datasets, and the results demonstrate its superiority over existing models in visual quality and most quantitative metrics. Although the model achieves slightly lower precision than some counterparts, this design choice effectively reduces the risk of missed detections, better aligning with the critical safety demands of geological hazard applications. The findings comprehensively validate the effectiveness and practical value of MSCG-Net for landslide detection in remote sensing imagery. Future work will focus on extending the deployment and application of the model to broader geographic regions.

Author Contributions

Writing—original draft preparation, Z.Y.; writing—review and editing, Z.Y.; supervision, H.Z.; funding acquisition, N.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, U22A20569.

Data Availability Statement

The Bijou dataset is available at http://gpcv.whu.edu.cn/data/Bijie_pages.html (accessed on 1 December 2024) The GVLM dataset is available at https://pan.baidu.com/share/init?surl=GYlY16k1zIEf07puGl8l_w&pwd=wsss (accessed on 6 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

MSCG-Net	Multi-Scale Spatial Context-Guided Network
CNN	Convolutional Neural Network
AFEM	Adaptive Feature Enhancement Module
SFA	Structural Feature Attention
SAHF	Soft Adaptive High-Pass Filter
OMSS	Omnidirectional and Multi-Scaled Scan 2D

References

Klose, M.; Highland, L.; Damm, B.; Terhorst, B. Estimation of direct landslide costs in industrialized countries: Challenges, concepts, and case study. In Proceedings of the Landslide Science for a Safer Geoenvironment: Volume 2: Methods of Landslide Studies; Springer: Cham, Switzerland, 2014; pp. 661–667. [Google Scholar]
Ozturk, U.; Bozzolan, E.; Holcombe, E.A.; Shukla, R.; Pianosi, F.; Wagener, T. How climate change and unplanned urban sprawl bring more landslides. Nature 2022, 608, 262–265. [Google Scholar] [CrossRef]
Stumpf, A.; Kerle, N. Object-oriented mapping of landslides using Random Forests. Remote Sens. Environ. 2011, 115, 2564–2577. [Google Scholar] [CrossRef]
Prakash, N.; Manconi, A.; Loew, S. A new strategy to map landslides with a generalized convolutional neural network. Sci. Rep. 2021, 11, 9722. [Google Scholar] [CrossRef]
Niu, C.; Gao, O.; Lu, W.; Liu, W.; Lai, T. Reg-SA–UNet++: A lightweight landslide detection network based on single-temporal images captured postlandslide. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 9746–9759. [Google Scholar] [CrossRef]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Chen, X.; Zhao, C.; Lu, Z.; Xi, J. Landslide Inventory Mapping Based on Independent Component Analysis and UNet3+: A Case of Jiuzhaigou, China. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 2213–2223. [Google Scholar] [CrossRef]
Qi, W.; Wei, M.; Yang, W.; Xu, C.; Ma, C. Automatic Mapping of Landslides by the ResU-Net. Remote Sens. 2020, 12, 2487. [Google Scholar] [CrossRef]
Ghorbanzadeh, O.; Crivellari, A.; Ghamisi, P.; Shahabi, H.; Blaschke, T. A comprehensive transferability evaluation of U-Net and ResU-Net for landslide detection from Sentinel-2 data (case study areas from Taiwan, China, and Japan). Sci. Rep. 2021, 11, 14629. [Google Scholar] [CrossRef]
Peta, K.; Stemp, W.J.; Stocking, T.; Chen, R.; Love, G.; Gleason, M.A.; Houk, B.A.; Brown, C.A. Multiscale Geometric Characterization and Discrimination of Dermatoglyphs (Fingerprints) on Hardened Clay—A Novel Archaeological Application of the GelSight Max. Materials 2025, 18, 2939. [Google Scholar] [CrossRef]
Lei, T.; Zhang, Y.; Lv, Z.; Li, S.; Liu, S.; Nandi, A.K. Landslide inventory mapping from bitemporal images using deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 2019, 16, 982–986. [Google Scholar] [CrossRef]
Yi, Y.; Zhang, W. A new deep-learning-based approach for earthquake-triggered landslide detection from single-temporal RapidEye satellite imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 6166–6176. [Google Scholar] [CrossRef]
Lu, W.; Hu, Y.; Shao, W.; Wang, H.; Zhang, Z.; Wang, M. A multiscale feature fusion enhanced CNN with the multiscale channel attention mechanism for efficient landslide detection (MS2LandsNet) using medium-resolution remote sensing data. Int. J. Digit. Earth 2024, 17, 2300731. [Google Scholar] [CrossRef]
Lv, P.; Ma, L.; Li, Q.; Du, F. ShapeFormer: A shape-enhanced vision transformer model for optical remote sensing image landslide detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2681–2689. [Google Scholar] [CrossRef]
Zhao, Z.; Chen, T.; Dou, J.; Liu, G.; Plaza, A. Landslide susceptibility mapping considering landslide local-global features based on CNN and transformer. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7475–7489. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. Rs-mamba for large remote sensing image dense prediction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Zhu, Q.; Fang, Y.; Cai, Y.; Chen, C.; Fan, L. Rethinking scanning strategies with vision mamba in semantic segmentation of remote sensing imagery: An experimental study. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18223–18234. [Google Scholar] [CrossRef]
Yang, C.; Chen, Z.; Espinosa, M.; Ericsson, L.; Wang, Z.; Liu, J.; Crowley, E.J. Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv 2024, arXiv:2403.17695. [Google Scholar]
Shi, Y.; Dong, M.; Xu, C. Multi-scale vmamba: Hierarchy in hierarchy visual state space model. arXiv 2024, arXiv:2405.14174. [Google Scholar] [CrossRef]
Zhao, C.; Chen, S.; Wu, F.; Li, H. Feasibility analysis of the Mamba-based landslide mapping from remote sensing images. In Proceedings of the 5th International Conference on Artificial Intelligence and Computer Engineering, Wuhu, China, 8–10 November 2024; pp. 501–505. [Google Scholar]
Ma, X.; Zhang, X.; Pun, M.-O. Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Wu, R.; Liu, Y.; Liang, P.; Chang, Q. Ultralight vm-unet: Parallel vision mamba significantly reduces parameters for skin lesion segmentation. arXiv 2024, arXiv:2403.20035. [Google Scholar] [CrossRef]
Yu, W.; Wang, X. Mambaout: Do we really need mamba for vision? In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 4484–4496. [Google Scholar]
Xu, Y.; Ouyang, C.; Xu, Q.; Wang, D.; Zhao, B.; Luo, Y. CAS landslide dataset: A large-scale and multisensor dataset for deep learning-based landslide detection. Sci. Data 2024, 11, 12. [Google Scholar] [CrossRef]
Ruan, J.; Li, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar] [CrossRef]
Liu, M.; Dan, J.; Lu, Z.; Yu, Y.; Li, Y.; Li, X. CM-UNet: Hybrid CNN-Mamba UNet for remote sensing image semantic segmentation. arXiv 2024, arXiv:2405.10530. [Google Scholar]
Cao, Y.; Liu, C.; Wu, Z.; Zhang, L.; Yang, L. Remote SensingImage Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion. Remote Sens. 2025, 17, 1390. [Google Scholar] [CrossRef]
Liu, J.; Yang, H.; Zhou, H.-Y.; Xi, Y.; Yu, L.; Li, C.; Liang, Y.; Shi, G.; Yu, Y.; Zhang, S. Swin-umamba: Mamba-based unet with imagenet-based pretraining. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; pp. 615–625. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10022. [Google Scholar]
Yang, J.; Cai, W.; Chen, G.; Yan, J. A State Space Model-Driven Multiscale Attention Method for Geological Hazard Segmentation. IEEE Trans. Instrum. Meas. 2025, 74, 1–12. [Google Scholar] [CrossRef]
Fan, Y.; Ma, P.; Hu, Q.; Liu, G.; Guo, Z.; Tang, Y.; Wu, F.; Zhang, H. SCGC-Net: Spatial Context Guide Calibration Network for multi-source RSI Landslides detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–17. [Google Scholar] [CrossRef]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. Localmamba: Visual state space model with windowed selective scan. In Proceedings of the Computer Vision—ECCV 2024 Workshops, Milan, Italy, 29 September–4 October 2024; pp. 12–22. [Google Scholar]
Huang, Z.; Wei, Y.; Wang, X.; Liu, W.; Huang, T.S.; Shi, H. Alignseg: Feature-aligned segmentation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 550–557. [Google Scholar] [CrossRef] [PubMed]
Cheng, B.; Girshick, R.; Dollár, P.; Berg, A.C.; Kirillov, A. Boundary IoU: Improving object-centric image segmentation evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15334–15342. [Google Scholar]
Ji, S.; Yu, D.; Shen, C.; Li, W.; Xu, Q. Landslide detection from an open satellite imagery and digital elevation model dataset using attention boosted convolutional neural networks. Landslides 2020, 17, 1337–1352. [Google Scholar] [CrossRef]
Zhang, X.; Yu, W.; Pun, M.-O.; Shi, W. Cross-domain landslide mapping from large-scale remote sensing images using prototype-guided domain-aware progressive representation learning. ISPRS J. Photogramm. Remote Sens. 2023, 197, 1–17. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Xue, H.; Liu, C.; Wan, F.; Jiao, J.; Ji, X.; Ye, Q. Danet: Divergent activation for weakly supervised object localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6589–6598. [Google Scholar]
Lian, S.; Luo, Z.; Zhong, Z.; Lin, X.; Su, S.; Li, S. Attention guided U-Net for accurate iris segmentation. J. Vis. Commun. Image Represent. 2018, 56, 296–304. [Google Scholar] [CrossRef]
Li, C.; Tan, Y.; Chen, W.; Luo, X.; Gao, Y.; Jia, X.; Wang, Z. Attention unet++: A nested attention-aware u-net for liver ct image segmentation. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 345–349. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed model architecture.

Figure 2. Specifications of basic blocks. (a) Conv Block (CB): A simple and efficient convolutional structure for extracting boundary information. (b) Patch Embedding (PE): A module for downsampling the input image into patch-level representations.

Figure 3. Architecture of the OMM Block, composed of the Omnidirectional Multiscale Visual State-Space Block (OMVSS), which improves the VSS scanning mechanism for multiscale and omnidirectional feature modeling, and the Convolutional Feedforward Network Block (ConvFFN), which complements VSS by enhancing spatial feature extraction.

Figure 4. Illustration of OMSS. It performs horizontal and vertical scanning before downsampling and applies diagonal scanning on downsampled features to enhance oblique semantic perception.

Figure 5. Architecture of the AFEM. (a) SFA is designed to compensate for context information loss caused by noise during downsampling. (b) SAHF serves to suppress low-frequency interference in the convolutional branch and enhances edge details.

Figure 6. Architecture of feature fusion module, which combines high- and low-level feature fusion with skip connections to better preserve boundary information.

Figure 7. Spatial distribution of the Bijie dataset.

Figure 8. Spatial distribution of the 17 landslide sites in the GVLM dataset.

Figure 9. Landslide datasets. (a) Example from the Bijie dataset. (b) Example from the GVLM dataset.

Figure 10. Landslide detection results comparing with 6 state-of-the-art approaches on Bijie dataset. (a) Optical RSIs; (b) GT; (c) DeepLabV3+; (d) DANet; (e) AttUNet; (f) TransNet; (g) Nested Attention U-Net; (h) SegFormer; (i) VMUNet; (j) MSCG-Net (ours).

Figure 11. Landslide detection results comparing with 6 state-of-the-art approaches on GVLM dataset. (a) Optical RSIs; (b) GT; (c) DeepLabV3+; (d) DANet; (e) AttUNet; (f) TransNet; (g) Nested Attention U-Net; (h) SegFormer; (i) VMUNet; (j) MSCG-Net (ours).

Figure 12. Scanning methods for comparison. Green arrows indicate the forward scanning order, and blue arrows indicate the reverse scanning order.

Figure 13. Feature map visualization from the ablation experiment. (a) Optical RSIs; (b) feature map from CNN stage; (c) feature map from Mamba stage.

Table 1. Comparison Results of Various Methods on Bijie Dataset.

Method	Bijie
Method	OA (%)	P (%)	R (%)	F1 (%)	IoU (%)	BIoU (%)
DeepLabV3Plus	96.85	85.41	84.12	84.76	73.55	62.81
DANet	97.09	87.29	84.25	85.74	75.05	65.93
AttUnet	96.59	83.62	83.56	83.59	71.81	61.22
TransUnet	94.52	73.40	74.25	73.82	58.51	48.00
Nested Attention U-Net	98.87	83.67	86.87	85.24	74.28	63.86
SegFormer	96.81	84.63	84.67	84.65	73.38	62.62
VMUnet	97.33	88.82	85.07	84.91	76.84	66.81
Ours	97.44	87.74	87.60	87.67	78.04	69.09

Note: Bold values represent the best performance among the compared models.

Table 2. Comparison Results of Various Methods on GVLM Dataset.

Method	GVLM
Method	OA (%)	P (%)	R (%)	F1 (%)	IoU (%)	BIoU (%)
DeepLabV3Plus	94.14	90.66	79.28	84.59	73.29	48.49
DANet	94.22	86.14	85.22	85.68	74.95	53.74
AttUnet	94.88	90.83	83.13	86.81	76.69	49.07
TransUnet	94.13	82.44	90.29	86.19	75.73	53.07
Nested Attention U-Net	95.50	88.98	88.80	88.89	80.00	59.00
SegFormer	94.75	86.23	88.19	87.20	77.31	51.03
VMUnet	95.36	88.65	88.46	88.56	79.46	57.85
Ours	95.66	87.31	91.98	89.58	81.13	62.10

Note: Bold values represent the best performance among the compared models.

Table 3. Computational efficiency of different models.

Model	FLOPs (G).	Params (M).
DeepLabV3Plus	20.854 G	54.708 M
DANet	144.637 G	65.182 M
AttUnet	66.632 G	34.879 M
TransUnet	32.633 G	66.815 M
Nested Attention U-Net	34.903 G	9.163 M
SegFormer	3.273 G	7.713 M
VMUnet	7.531 G	60.273 M
Ours	47.531 G	39.442 M

Table 4. Quantitative results of scanning method comparison experiments.

Exp ID	FPS	FLOPs	F1 (%)	IoU (%)	BIoU (%)
A1	29.84	47.478 G	88.64	79.60	59.37
A2	17.23	47.478 G	89.26	80.60	59.96
A3	17.90	47.478 G	88.90	79.77	60.02
A4	21.63	47.478 G	89.09	80.49	60.06
A5	23.48	35.210 G	89.11	80.35	59.16
A6	22.34	47.531 G	89.58	81.13	62.10

Table 5. Quantitative results of ablation experiments.

Exp ID	CNN	SFA	SAHF	Param (M)	FLOPs (G)	F1 (%)	IoU (%)	BIoU (%)
B1				19.62	43.31	86.6	79.42	58.05
B2	√			39.44	47.53	88.8	79.92	58.77
B3	√	√		39.19	35.26	89.3	80.67	59.99
B4	√		√	44.42	72.02	89.1	80.28	60.73
B5	√	√	√	39.44	47.53	89.4	81.13	62.10

Note: The check mark indicates that the corresponding component is included in the experiment.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Z.; Zhang, H.; Zheng, N. A Multi-Scale Feature Fusion Dual-Branch Mamba-CNN Network for Landslide Extraction. Appl. Sci. 2025, 15, 10063. https://doi.org/10.3390/app151810063

AMA Style

Yang Z, Zhang H, Zheng N. A Multi-Scale Feature Fusion Dual-Branch Mamba-CNN Network for Landslide Extraction. Applied Sciences. 2025; 15(18):10063. https://doi.org/10.3390/app151810063

Chicago/Turabian Style

Yang, Zhiheng, Hua Zhang, and Nanshan Zheng. 2025. "A Multi-Scale Feature Fusion Dual-Branch Mamba-CNN Network for Landslide Extraction" Applied Sciences 15, no. 18: 10063. https://doi.org/10.3390/app151810063

APA Style

Yang, Z., Zhang, H., & Zheng, N. (2025). A Multi-Scale Feature Fusion Dual-Branch Mamba-CNN Network for Landslide Extraction. Applied Sciences, 15(18), 10063. https://doi.org/10.3390/app151810063

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Scale Feature Fusion Dual-Branch Mamba-CNN Network for Landslide Extraction

Abstract

Featured Application

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Framework of the Model

2.2. The Local Detail Extraction Branch

2.3. The Contextual Feature Extraction Branch

2.4. Adaptive Feature Enhancement Module

2.5. Feature Fusion Module

2.6. Performance Metrics

2.7. Datasets and Study Areas

3. Results

3.1. Experiment Settings

3.2. Comparative Experiments

3.3. Computational Efficiency

4. Discussion

4.1. Ablation Study

4.2. CAM Visualization

4.3. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI