Research on Landslide Hazard Detection in Ya’an Region Based on an Improved YOLO Model

Cui, Kewei; Huang, Meng; Zhang, Weiling; Yang, Guang; Huang, Yongxiong; Wu, Zhengyi; Zhai, Zhiwei; Cheng, Chao

doi:10.3390/rs18060957

Open AccessArticle

Research on Landslide Hazard Detection in Ya’an Region Based on an Improved YOLO Model

by

Kewei Cui

^1,2,

Meng Huang

^1,2,*,

Weiling Zhang

¹,

Guang Yang

^1,2,

Yongxiong Huang

^1,2,

Zhengyi Wu

^1,2,

Zhiwei Zhai

^1,2 and

Chao Cheng

^1,2

¹

School of Computer Science and Engineering, University of Emergency Management, Langfang 065201, China

²

Hebei Province University Smart Emergency Application Technology Research and Development Center, Langfang 065201, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(6), 957; https://doi.org/10.3390/rs18060957

Submission received: 22 February 2026 / Revised: 12 March 2026 / Accepted: 19 March 2026 / Published: 23 March 2026

Download

Browse Figures

Review Reports Versions Notes

Highlights

What are the main findings?

MSDE-YOLO, a landslide detector based on YOLOv11, effectively addresses blurred boundaries and weakened texture features of landslides in remote sensing imagery of complex terrain, achieving high detection performance with 90.2% precision, 84.8% recall and 92.7% mAP on the self-constructed Ya’an landslide dataset.
An autonomous multi-scale feature fusion module (MSDE) was designed and incorporated into the neck network of the model. By efficiently aggregating shallow details and deep semantic information, it enhances the model’s ability to represent fuzzy boundaries and introduces a lightweight SimAM attention mechanism. This significantly improves the feature discrimination ability of the slope area under complex imaging conditions, and simultaneously enhances the consistency of slope boundary extraction.

What are the implications of the main findings?

The results demonstrate that the framework is applicable to landslide hazard detection in topographically complex regions like Ya’an, and is suitable for efficient screening and dynamic monitoring of geological hazards in real-world scenarios.
This work provides a practical technical framework for intelligent landslide detection based on remote sensing imagery, highlighting the potential of improved deep learning object detection models in geohazard remote sensing applications.

Abstract

Landslide hazards occur frequently in the Ya’an region; therefore, accurately identifying and delineating potential landslide areas is crucial for disaster prevention and mitigation. Although deep learning-based detection methods using optical remote sensing imagery are widely adopted, the complex terrain and diverse land cover in this area often result in blurred boundaries and weakened textural features, making it difficult to precisely define spatial extents. To overcome these challenges, this study proposes an improved YOLOv11 model for landslide detection. Building on the YOLOv11 baseline, we designed a novel Multi-Scale Detail Enhancement module and integrated it into the neck network to effectively aggregate shallow-level details with deep-level semantic information, thereby enhancing the model’s ability to represent ambiguous boundaries. Additionally, we incorporated the lightweight SimAM attention mechanism into the backbone network. This mechanism dynamically suppresses background noise based on an energy minimization principle, improving feature discriminability within landslide regions and enabling precise boundary boxes. We conducted validation experiments in the Ya’an region using a custom dataset constructed from high-resolution UAV orthoimagery, comparing our method against mainstream models such as YOLOv8 and YOLOv10. The results show that the proposed improved YOLOv11 model achieves a precision of 90.2%, a recall of 84.8%, and an mAP of 92.7%. This enhanced performance demonstrates the model’s effectiveness in detecting landslides under complex terrain conditions, providing a practical technical reference for efficient hazard screening and dynamic monitoring.

Keywords:

landslide hazard; object detection; multi-scale feature enhancement; YOLO

1. Introduction

Situated within the seismically active Longmenshan Fault Zone and characterized by frequent rainfall, Ya’an is one of the most landslide-susceptible regions in Sichuan Province [1]. The region has witnessed severe geological disasters in recent years: on 1 June 2022, an Ms6.1 earthquake in Lushan County triggered landslides in nearby Baoxing County, causing fatalities and structural damage. More recently, on 20 July 2024, prolonged intense rainfall in Hanyuan County induced large-scale debris flows that activated dozens of landslides, resulting in significant casualties and property loss. Given that the accurate localization of potential landslide hazards is fundamental to geological risk assessment and dynamic monitoring, it serves as a prerequisite for effective early warning and disaster mitigation. Therefore, enhancing the accuracy and efficiency of landslide detection methods is both a technical imperative and a practical necessity.

Recent advances in artificial intelligence and computational infrastructure have catalyzed the widespread adoption of deep learning for landslide detection using optical remote sensing imagery [2,3,4]. Modern object detectors generally fall into two paradigms: two-stage and one-stage architectures, distinguished by the presence of an explicit region proposal phase [5]. Representative two-stage models, such as R-CNN [6], Fast R-CNN [7], and Mask R-CNN [8], first generate candidate regions via a Region Proposal Network (RPN) or selective search before refining and classifying them. In contrast, one-stage detectors—including SSD, EfficientDet, and the YOLO series [9]—predict bounding boxes and class labels directly from a dense grid of predefined anchors. By eliminating the region proposal step, one-stage methods achieve significantly higher inference speeds. This efficiency renders them particularly suitable for time-critical applications, such as large-scale landslide screening in remote sensing.

Among one-stage frameworks, the YOLO (You Only Look Once) series has emerged as a leading choice due to its favorable trade-off between detection accuracy and computational efficiency [10]. YOLOv3 introduced Darknet-53 as its backbone, leveraging deeper hierarchical features to improve multi-scale detection performance [11]. YOLOv5, re-implemented in PyTorch by Ultralytics, replaced the original Darknet framework with a more modular and community-friendly codebase while incorporating stride-based convolutions and a fast spatial pyramid pooling (SPPF) module to accelerate feature aggregation and inference [12]. YOLOv8 further extended the architecture’s capabilities to support multi-task learning, including instance segmentation, pose estimation, and oriented object detection [13]. Most recently, YOLOv11 preserved this multi-task flexibility while enhancing efficiency through the C3k2 block and introducing the C2PSA module—a spatial attention mechanism that improves detection of small and densely clustered objects [14].

Despite these advancements, deploying YOLO-based models for landslide detection remains challenging. In optical imagery, landslide regions often exhibit indistinct boundaries, low textural contrast, and high intra-class variability, exacerbated by complex terrain, heterogeneous land cover, and varying illumination. These factors degrade feature discriminability and impede precise boundary localization—a critical limitation in safety-sensitive geohazard applications.

To mitigate these issues, recent studies have adapted YOLO architectures for landslide-specific contexts. For instance, Meng et al. [15] integrated a C3-Swin Transformer module into YOLOv5, synergizing the global contextual modeling of Swin Transformers with the channel-and-spatial attention of CBAM to enhance feature representation in salient regions. Similarly, Wang et al. [16] introduced a Receptive Field Attention (RFA) mechanism into a modified YOLOv11, inspired by biological visual receptive fields, to dynamically highlight discriminative features across spatial and channel dimensions. However, while these approaches advance multi-scale fusion and attention modeling, they often struggle to balance fine-grained boundary recovery with efficient global context integration.

To directly address the specific challenges posed by Ya’an’s complex topography and diverse surface covers—namely, blurred landslide boundaries that hinder accurate spatial boxes and weakened texture features that reduce discriminability—we propose an enhanced YOLOv11 framework incorporating two targeted innovations:

(1): To tackle the issue of blurred boundaries and ambiguous spatial ranges, we design a novel Multi-Scale Detail Enhancement (MSDE) module within the neck network. By aggregating features from multiple receptive fields via parallel convolutional pathways, the MSDE module simultaneously preserves high-resolution spatial details and semantic-rich contextual cues. This design specifically recovers fine-grained boundary information while maintaining global context consistency, thereby enabling precise localization even in terrain with indistinct edges.
(2): To overcome the problem of weakened texture features in heterogeneous backgrounds, we integrate the parameter-free SimAM (Simple Attention Module) mechanism into the backbone. Leveraging an energy minimization principle, SimAM adaptively amplifies informative neuron responses corresponding to weak landslide indicators while suppressing complex background noise. This enhances feature selectivity and discriminability without increasing model complexity.

Collectively, these improvements establish a logical mapping between the identified environmental bottlenecks and our methodological solutions, enabling high-precision landslide detection under complex conditions.

2. Materials and Methods

2.1. Study Area and Dataset

2.1.1. Study Area

Ya’an City, located in central-western Sichuan Province (28°51′N–30°56′N, 101°56′E–103°23′E), serves as the study area (Figure 1). Situated at the transition between the Tibetan Plateau and the Sichuan Basin, the region features complex topography characterized by steep elevation gradients and a highly heterogeneous landscape. Geologically, Ya’an lies within the tectonic influence zone of the India–Eurasia collision, at the intersection of three major fault systems: the Longmenshan, Xianshuihe, and Anninghe faults. This tectonic convergence has generated dense fault networks and significant lithological heterogeneity. Climatically, the area experiences a subtropical humid monsoon climate with pronounced seasonal precipitation. Intense monsoon rainfall frequently induces rapid surface runoff, triggering slope instability, collapses, and landslides that often result in severe casualties and substantial economic losses.

From a computer vision perspective, accurately delineating the spatial extent of landslides—particularly in their incipient stages—remains a formidable challenge due to inherent technical limitations. First, feature extraction and semantic segmentation are impeded by the subtlety of early surface deformations. These initial signs often exhibit low contrast against the background terrain, rendering them indistinguishable to conventional algorithms that rely on distinct color, texture, or morphological cues. Consequently, this leads to frequent missed detections or erroneous segmentations in visually homogeneous regions. Second, the high heterogeneity of landslide manifestations poses a significant barrier to model generalization. For instance, shallow soil slides and deep-seated rockfalls display markedly different spectral and structural signatures in remote sensing imagery, making it difficult for a single, unified architecture to effectively capture such diverse patterns.

Consequently, existing computer vision approaches encounter critical bottlenecks when applied to complex terrains like Ya’an. Key limitations include incomplete perceptual coverage, poor discriminability of weak precursory indicators, and insufficient integration of multi-scale and multi-modal features. These deficiencies compromise the timeliness and accuracy of hazard assessments during the critical early phase of disaster response, thereby hindering effective emergency decision-making. This underscores the urgent need for targeted methodological advancements to bridge these gaps.

2.1.2. Dataset Construction

This study presents a dedicated dataset for intelligent landslide detection, constructed from high-resolution optical imagery acquired by Unmanned Aerial Vehicles (UAVs). Covering 23 towns and townships in Ya’an City, Sichuan Province, the dataset spans approximately 15,000 km² and captures the region’s heterogeneous geomorphology, ranging from alpine gorges to hilly basins. Comprising roughly 700 GB of raw data with a uniform Ground Sampling Distance (GSD) of 0.5 m, the imagery provides rich textural and morphological details essential for resolving surface cover, topographic relief, and subtle surficial signatures of landslide activity.

Landslide annotations were curated by integrating multi-source authoritative data, including detailed regional geological hazard surveys, historical disaster inventories, and official records from local government agencies. Through rigorous cross-validation and systematic field verification, ambiguous or duplicate entries were removed, and landslide boundaries and centroids were digitally refined. This process yielded a high-quality set of 1643 confirmed historical landslide events, serving as robust positive samples.

To accommodate the input requirements, the original large-format orthoimages were partitioned into

512 \times 512

pixel tiles. We employed a centroid-centered sampling strategy: for each annotated landslide, a tile centered on its centroid was extracted to ensure the complete inclusion of the landslide body and its immediate surrounding context. This approach preserves the spatial integrity of landslide features within their environmental setting, enabling models to learn discriminative associations between landslides and contextual factors such as terrain morphology, lithology, and vegetation cover.

To ensure reproducibility and rigorous evaluation, we explicitly define our annotation and data splitting protocols. First, the expert annotations were directly provided as axis-aligned bounding boxes, which precisely enclose the visible extent of each landslide instance. Second, negative samples were rigorously defined as image tiles randomly cropped from large, expert-verified hazard-free regions. These regions encompass diverse background types (e.g., dense vegetation, bare rock, agricultural fields, and urban areas) to ensure the model learns to distinguish true landslides from visually similar non-hazardous features. Finally, to prevent spatial data leakage where adjacent patches of the same geological event might appear in both training and testing sets, we adopted a location-based splitting strategy. Instead of random shuffling, all tiles were grouped by their geographical source, and these groups were then partitioned into training, validation, and testing sets with a ratio of 7:1:2. This ensures that the test set comprises completely unseen geological environments, providing a robust assessment of the model’s generalization capability.

The resulting dataset is designed to serve as a large-scale, high-precision benchmark for computer vision-based research on automated landslide detection, supporting both algorithm development and performance evaluation in complex real-world scenarios.

2.1.3. Data Augmentation

To enhance the model’s capacity to extract discriminative features and improve generalization, a systematic data augmentation strategy was applied to the preprocessed samples. This strategy enriches sample diversity and bolsters robustness through two primary categories: geometric transformations and photometric adjustments.

Geometric augmentation encompasses random horizontal and vertical translations, rotations, isotropic scaling, cropping, and flipping [17,18,19,20,21]. These operations simulate variations in landslide position, orientation, scale, and viewpoint while preserving structural integrity, thereby significantly increasing spatial diversity. Photometric augmentation introduces stochastic perturbations to hue and saturation, mimicking spectral and contrast variations induced by diverse weather conditions, illumination angles, and seasonal cycles. This enhances the model’s invariance to surface reflectance discrepancies, effectively mitigating misclassifications caused by imaging variability or differences in vegetation cover.

Following this pipeline, the dataset was expanded to a final total of 4929 images (Figure 2). The collection comprises 1643 positive samples, each containing at least one annotated landslide, and 3286 negative samples representing non-landslide surfaces with visually similar characteristics (e.g., bare soil, rock outcrops, built-up areas, and agricultural fields). The resulting positive-to-negative ratio of approximately 1:2 facilitates effective discrimination between true landslides and confounding background patterns. All samples underwent rigorous spatial consistency checks and manual visual inspection to ensure precise boundary boxes and label reliability, guaranteeing high data quality at the source.

To visually demonstrate the complexity and diversity inherent in our dataset, Figure 3 presents representative examples of both challenging positive and hard negative samples. The first row displays positive samples that encapsulate the primary detection challenges in the Ya’an region, specifically characterized by complex morphologies, blurred boundaries that blend with surrounding vegetation, and weak texture features with low spectral contrast. Conversely, the second row illustrates hard negative samples representing non-landslide surface types that are visually similar to landslides, such as bare soil, dry riverbeds, and construction sites. These look-alike features often trigger false positives in standard detectors. By explicitly including these diverse and challenging scenarios, our dataset ensures that the model learns to distinguish true geological hazards from confounding background patterns, thereby enhancing robustness in heterogeneous terrains.

2.2. Proposed Method

2.2.1. YOLOv11 Model

YOLOv11 represents a significant advancement in the YOLO series, offering a unified framework capable of handling five core computer vision tasks: object detection, instance segmentation, pose estimation, image classification, and oriented bounding box detection (Figure 4). Its architecture features a highly optimized backbone integrated with the C3k2 module, which employs parallel convolutional branches to process input feature maps through dual pathways: a shallow path for preserving fine-grained details and a deep path for extracting high-level semantic information. For feature fusion, YOLOv11 introduces the C2PSA (Cross-stage Partial Pyramid Spatial Attention) module. This component synergizes pyramid spatial attention with cross-stage partial connections, utilizing multi-scale convolutional kernels to aggregate spatial context at varying granularities. Furthermore, it incorporates a Squeeze-and-Excitation (SE) mechanism to apply channel-wise weighting, thereby amplifying responses from salient regions while suppressing background noise [22,23,24,25]. Consequently, the rate of false positives in complex scenes is significantly reduced. Given its exceptional multi-task versatility, lightweight design, and heightened sensitivity to small objects, YOLOv11 serves as an ideal baseline for the landslide detection model proposed in this study.

2.2.2. Model Architecture Design

Landslide bodies in optical imagery are often characterized by ambiguous boundaries and degraded textural features, posing significant challenges for the accurate determination of their spatial extent. To address these limitations, this study proposes an enhanced YOLOv11 architecture incorporating a novel Multi-Scale Detail Enhancement (MSDE) module, as illustrated in Figure 5. Specifically, the custom-designed MSDE module is strategically integrated into the neck network of the baseline model to reinforce feature.

The MSDE module initiates with a feature pyramid structure that aggregates features from multiple backbone levels via bilinear upsampling and channel-wise concatenation. This process synthesizes a multi-scale contextual representation, enhancing semantic complementarity across scales. By precisely aligning and efficiently fusing low-level textural details with high-level semantic cues, the module significantly bolsters the propagation and preservation of fine-grained texture information.

Subsequently, parallel depthwise separable convolution branches are employed to extract multi-scale spatial features, facilitating robust and high-precision edge detection. The outputs from these branches are fused through a

1 \times 1

convolution and integrated via a residual connection to maintain gradient flow and signal integrity. Finally, the lightweight, parameter-free SimAM attention mechanism is incorporated to adaptively amplify responses in informative feature channels. This holistic design ensures the effective enhancement of critical structural details—particularly along blurred or weak-texture boundaries—while preserving original feature fidelity, thereby directly addressing the challenges of edge ambiguity and texture degradation in landslide boxes.

2.2.3. Self-Designed Multi-Scale Feature Enhancement Module

Despite significant advancements in feature fusion and attention mechanisms within YOLOv11, achieving simultaneous fine-grained boundary boxes and efficient contextual modeling remains a challenge. In landslide remote sensing imagery, blurred boundaries and attenuated textures often render landslide regions visually indistinguishable from the surrounding background. Consequently, conventional single-scale feature extraction frequently fails to accurately recover true landslide contours under such conditions.

To address this limitation, we introduce the Multi-Scale Detail Enhancement (MSDE) module. Designed to aggregate shallow spatial details with deep semantic cues, the MSDE module enhances the representation of ambiguous boundaries. By establishing multi-scale perception within a unified feature space, it significantly improves boundary detection accuracy in complex terrains.

As illustrated in Figure 6, the MSDE module integrates two distinct feature streams from the YOLOv11 backbone:

(1): Low-level features, extracted from early convolutional layers, retain high spatial resolution and precise pixel-wise localization. While effective at preserving fine-grained textural details (e.g., local surface patterns), their limited receptive fields restrict global semantic context, hindering the discrimination of landslides from visually similar backgrounds.
(2): High-level features, generated through deep convolutions and pooling, encode rich semantic information via hierarchical integration. Although these features capture global characteristics—such as overall shape and spatial distribution—they suffer from reduced spatial resolution and a consequent loss of boundary fidelity.

To enable effective fusion of the two types of features, the mismatch in feature dimensions must first be addressed. Considering the weak semantic representation capability of low-level feature maps, a

1 \times 1

standard convolution is applied to expand the channel dimension and transform the features. This operation adjusts the number of channels to match that of the high-level feature map while preserving spatial resolution, and simultaneously enhances the semantic expressiveness of the low-level features through learnable convolutional parameters. For the high-level feature map

F_{H} \in R^{C \times H \times W}

, a

1 \times 1

standard convolution is likewise employed for channel alignment, yielding an initially aligned feature

F_{H}^{'} = {Conv}_{1 \times 1} (F_{H})

. Subsequently, a learnable re-weighting mechanism is introduced to strengthen critical semantic responses. The feature re-weighting can be formulated as:

F_{high}^{rw} (i, j, :) = σ (W_{r} \cdot F_{high}^{'} (i, j, :)) ⊙ F_{high}^{'} (i, j, :)

(1)

where

W_{r} \in R^{C \times C}

denotes a learnable weight matrix,

σ (\cdot)

is the Sigmoid function, and ⊙ represents channel-wise multiplication, thereby enabling adaptive enhancement of channels associated with the global morphology of landslides. Additionally, to account for the spatial resolution discrepancy between the two feature maps, the re-weighted high-level feature map

F_{high}^{rw}

is upsampled via bilinear interpolation to match the spatial resolution of the low-level feature map

F_{L} \in R^{C \times H \times W}

, i.e., from

H_{h} \times W_{h}

to

H \times W

. Specifically, for a target position

(x, y) \in R^{2}

in the upsampled feature map, its value is computed by weighted interpolation using the four nearest neighboring grid points

(x 1, y 1), (x 1, y 2), (x 2, y 1), (x 2, y 2)

.

F_{high}^{up} (x, y, :) = \sum_{m = 1}^{2} \sum_{n = 1}^{2} w_{m n} (x, y) \cdot F_{high}^{rw} (x_{m}, y_{n}, :)

(2)

where the interpolation weights are:

w_{11} = (x_{2} - x) (y_{2} - y)

(3)

w_{12} = (x_{2} - x) (y - y_{1})

(4)

w_{21} = (x - x_{1}) (y_{2} - y)

(5)

w_{22} = (x - x_{1}) (y - y_{1})

(6)

and satisfy

\sum_{m, n} w_{m n} = 1

. This weighted interpolation process preserves feature smoothness while increasing spatial resolution, avoiding the introduction of additional noise and establishing spatial consistency for subsequent cross-level feature fusion.

After the aforementioned processing, the low-level feature map

F_{L}

and the high-level feature map

F_{H}

satisfy both dimensional consistency and spatial alignment. They are then concatenated along the channel dimension to form a unified input feature tensor

F_{i} n \in R^{2 C \times H \times W}

.

This step facilitates the deep integration of cross-level information. Specifically, fine-grained spatial details from low-level features serve as precise localization anchors for high-level semantic representations, while robust semantic cues from high-level features impose global contextual constraints on low-level details. This synergistic interaction establishes a “detail–semantic” dual-driven feature paradigm, which effectively mitigates boundary misclassification arising from single-level information insufficiency and lays a robust foundation for subsequent multi-scale feature fusion.

To comprehensively capture the multi-scale characteristics of landslide bodies, a multi-branch parallel depthwise separable convolution structure is designed [26,27,28,29,30,31,32]. By leveraging convolutional kernels of different sizes in a coordinated manner, this design mimics the human visual system’s parallel processing of multi-scale cues while balancing computational efficiency and feature extraction capability. The architecture consists of three parallel depthwise separable convolution paths employing

3 \times 3

,

5 \times 5

, and

7 \times 7

kernels, respectively. Each branch independently processes the input feature tensor

F_{in}

to generate complementary multi-scale feature responses, as shown in Figure 7.

The three parallel branches are functionally specialized to construct a comprehensive multi-scale feature capture system, spanning from local details to global contexts:

(1): $3 \times 3$ Depthwise Separable Convolution Branch: Characterized by a small kernel and narrow receptive field, this branch is optimized for extracting micro-level features, such as pixel-wise edge transitions and fine-grained textures. In landslide detection, it precisely resolves subtle boundary variations, providing critical support for delineating ambiguous edges.
(2): $5 \times 5$ Depthwise Separable Convolution Branch: With a moderate kernel size and intermediate receptive field, this branch mediates between local detail preservation and regional contextual correlation. It effectively captures meso-scale characteristics—including local morphological structures and textural distribution patterns—thereby bridging the gap between fine-grained details and coarse-level contours.
(3): $7 \times 7$ Depthwise Separable Convolution Branch: Equipped with a large kernel and extensive receptive field, this branch suppresses local noise while perceiving spatial continuity over broader regions. It encapsulates global structural cues, such as the overall shape and spatial extent of landslide bodies, making it particularly effective for inferring outlines in areas with blurred boundaries or attenuated textures.

Let the input feature tensor be

F_{in}

, and the output features of each branch can be denoted as:

G_{k} = {DWConv}_{k} (F_{in}), k \in {3, 5, 7}

(7)

where

{DWConv}_{k} (F_{in})

represents a depthwise separable convolution operation with a kernel size of

k \times k

. The feature maps produced in parallel by the three branches,

G_{3} \in R^{2 C \times H \times W}

,

G_{5} \in R^{2 C \times H \times W}

,

G_{7} \in R^{2 C \times H \times W}

, characterize landslide features at micro, meso, and macro scales, respectively. This results in a complementary multi-scale feature ensemble that collectively captures detailed landslide characteristics across varying spatial resolutions.

First, a feature aggregation operation is performed: the feature maps

G_{3}

,

G_{5}

,

G_{7}

output by the three parallel branches are fused through element-wise addition to generate an aggregated feature tensor

F_{a g g}

. The calculation formula is as follows:

F_{agg} = G_{3} + G_{5} + G_{7} = \sum_{k \in {3, 5, 7}} {DWConv}_{k} (F_{in})

(8)

Element-wise addition is employed as the fusion strategy to preserve the response intensity of individual branches, thereby achieving synergistic enhancement of multi-scale features. This mechanism operates adaptively across different spatial contexts: in regions with clear boundaries, high-frequency responses from the

3 \times 3

branch are amplified to ensure edge fidelity; in areas with blurred transitions, meso-scale features from the

5 \times 5

branch supplement local details to maintain continuity, and for global structural regions, low-frequency cues from the

7 \times 7

branch govern the representation to capture the overall landslide morphology. Consequently, the aggregated feature map comprehensively encapsulates multi-scale information, effectively characterizing landslide attributes across varying spatial resolutions.

Subsequently, the aggregated feature tensor

F_{a g g}

is processed by a

1 \times 1

convolutional layer to achieve channel fusion and dimensionality reduction. This operation serves two critical functions:

(1): Dimensionality Compression: It performs a linear transformation to reduce the channel count from 2C to C, aligning the feature map with the backbone network’s subsequent stages while mitigating computational overhead associated with channel redundancy.
(2): Cross-Channel Integration: It synthesizes response variations across different scales, thereby enhancing the representation of discriminative features essential for landslide detection.

Formally, the output is defined as

{Conv}_{1 \times 1} (F_{agg})

, where

C o n v_{1 \times 1}

, where

C o n v_{1 \times 1}

denotes the standard

1 \times 1

convolution operator.

To further enhance feature integrity and ensure training stability, the MSDE module incorporates a residual connection. Specifically, the original input tensor

F_{i n}

is first projected to match the target channel dimension via

1 \times 1

convolution. This projected input is then added element-wise to the aggregated feature map (which has also undergone channel reduction), yielding the final output tensor

F_{o u t}

. This process preserves the original identity information while integrating multi-scale enhancements. The computation is formally expressed as:

F_{out} = {Conv}_{1 \times 1} (F_{agg}) + {Conv}_{1 \times 1}^{'} (F_{in})

(9)

where

{Conv}_{1 \times 1}^{'}

denotes a standard

1 \times 1

convolution applied to

F_{i n}

for channel dimension alignment. This residual design offers two key advantages:

(1): Preservation of Feature Integrity: Fundamental information from the original input is directly propagated to the module output, mitigating detail loss caused by multiple layers of convolution and aggregation. This is particularly critical for retaining the subtle textural cues of landslides.
(2): Optimized Gradient Flow: The shortcut connection provides a direct path for gradient backpropagation, effectively alleviating the vanishing gradient problem in deep networks. This facilitates faster convergence, thereby improving both training stability and model generalization.

Through a comprehensive strategy comprising parallel multi-scale extraction, cross-level feature aggregation, and residual-optimized fusion, the MSDE module establishes a multi-scale feature representation framework specifically tailored to landslide detection in remote sensing imagery.

2.2.4. SimAM Module

To enhance the discriminative power of deep semantic features before multi-scale fusion, we integrate a lightweight, parameter-free Simple Attention Module (SimAM) [33,34,35,36,37,38] at the end of the backbone network.

Integration Position: As illustrated in Figure 5, SimAM is embedded immediately after the C2PSA module and before the first MSDE module within the backbone pathway. This strategic placement allows SimAM to refine the high-level semantic features extracted by the backbone prior to their injection into the multi-scale feature enhancement pipeline. While the subsequent MSDE modules focus on cross-scale fusion and boundary detail preservation, SimAM specializes in suppressing homogeneous background noise and amplifying salient landslide patterns at the deepest feature level—where semantic information is most abstract yet critical for accurate detection.

SimAM infers 3D attention weights based on an energy function derived from neuroscience. For a feature map

X \in R^{C \times H \times W}

, let

x_{c, i, j}

denote the neuron at channel c and spatial location

(i, j)

. The importance of each neuron is determined by its variance relative to the channel statistics. Following the closed-form solution derived in [33], the attention weight

w_{c, i, j}

is computed as:

w_{c, i, j} = σ (\frac{{(x_{c, i, j} - μ_{c})}^{2}}{4 ({\hat{σ}}_{c}^{2} + λ)})

(10)

where:

$μ_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c, i, j}$ is the mean of channel c,
${\hat{σ}}_{c}^{2} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {(x_{c, i, j} - μ_{c})}^{2}$ is the unbiased variance of channel c,
$σ (\cdot)$ denotes the Sigmoid activation function,
$λ = 10^{- 4}$ is a small constant to ensure numerical stability.

This formulation assigns higher weights to neurons whose values significantly deviate from the channel mean—typically corresponding to distinctive structures like landslide boundaries or texture anomalies—while suppressing neurons close to the mean, which often represent redundant background regions (e.g., uniform soil or vegetation).

The output feature map

X^{'}

is obtained via element-wise multiplication:

x_{c, i, j}^{'} = w_{c, i, j} \cdot x_{c, i, j}

(11)

Key Advantages:

1.: Parameter-Free Efficiency: SimAM computes attention weights analytically using only intrinsic feature statistics ( $μ_{c}, {\hat{σ}}_{c}^{2}$ ), introducing zero additional learnable parameters. This preserves the real-time inference capability essential for emergency landslide monitoring.
2.: Deep Semantic Refinement: By operating at the deepest layer of the backbone (post-C2PSA), SimAM enhances the most abstract and semantically rich features before they are fused across scales. This ensures that noisy or ambiguous high-level representations are cleaned early, improving downstream detection accuracy.
3.: Adaptive Background Suppression: Unlike fixed-threshold methods, SimAM dynamically adjusts weights per image based on local feature distribution. This enables robust performance across varying terrains, lighting conditions, and occlusion levels commonly encountered in remote sensing imagery.

In summary, the sequential integration of SimAM followed by MSDE creates a powerful feature refinement cascade: SimAM purifies deep semantics, while MSDE enriches multi-scale context. Together, they enable the model to generate highly discriminative features that optimize both classification confidence and bounding box precision in complex environments.

3. Results

3.1. Environment

(1): The experimental hardware configuration is detailed in Table 1.

(2): The software environment and specific version details utilized in this experiment are summarized in Table 2.

3.2. Evaluation Indicators

To comprehensively evaluate the performance of the proposed MSDE-YOLO landslide detection model, this study employs three quantitative metrics: Precision, Recall, and mean Average Precision (mAP) [39,40,41,42]. These metrics are derived from the detection outcomes categorized as True Positives (TP), False Positives (FP), and False Negatives (FN). The mathematical formulations for the core metrics are defined as follows:

Precision quantifies the proportion of detected landslide instances that are correctly identified, thereby reflecting the reliability and validity of the model’s predictions:

Precision = \frac{T P}{T P + F P}

(12)

Recall quantifies the proportion of actual landslide instances that are successfully detected by the model, thereby reflecting its completeness or sensitivity in identifying positive samples:

Recall = \frac{T P}{T P + F N}

(13)

Furthermore, to ensure rigorous assessment aligned with state-of-the-art object detection benchmarks, we adopt the MS COCO evaluation protocol for calculating mean Average Precision (mAP). Unlike the traditional PASCAL VOC metric, which uses a single IoU threshold of 0.5, the COCO protocol evaluates localization accuracy across a range of thresholds. We report two primary variants:

mAP@0.5: The average precision computed at a single Intersection over Union (IoU) threshold of 0.5. This metric indicates the model’s ability to detect landslides with moderate localization overlap.
mAP@[0.5:0.95]: The primary metric for this study, defined as the mean of AP values computed over IoU thresholds ranging from 0.50 to 0.95 with a step size of 0.05 (i.e., ${0.50, 0.55, \dots, 0.95}$ ). This stringent metric provides a comprehensive evaluation of both detection confidence and bounding box regression accuracy.

Mathematically, the Average Precision (AP) for a specific class c at a given IoU threshold t represents the area under the precision-recall trade-off function, calculated by integrating precision over the recall range from 0 to 1:

{AP}^{c, t} = \int_{0}^{1} p (r) d r

(14)

where

p (r)

denotes the precision achieved at recall level r. The final mAP is obtained by averaging the AP scores across all C classes and, for mAP@[0.5:0.95], across all

T = 10

IoU thresholds:

mAP = \frac{1}{C \times | T |} \sum_{c = 1}^{C} \sum_{t \in T} {AP}^{c, t}

(15)

where

T = {0.50, 0.55, \dots, 0.95}

for the strict metric, or

T = {0.50}

for mAP@0.5.

In addition to detection accuracy, computational efficiency is a critical factor for real-time landslide monitoring systems. To quantify the inference speed, we employ Frames Per Second (FPS) as the primary metric. FPS represents the number of images the model can process within one second and is inversely proportional to the inference latency (L), which is the time required to process a single image. The relationship is defined as:

FPS = \frac{1}{L}

(16)

where L denotes the average inference latency measured in seconds per image. In our experiments, L is calculated by averaging the processing time over a large validation set on the target hardware NVIDIA RTX 5090 with a batch size of 1. This metric directly reflects the model’s capability to meet the real-time requirements of emergency response scenarios, where rapid detection is as crucial as high accuracy.

3.3. Comparison Study

To comprehensively evaluate the performance and scene adaptability of the proposed MSDE-YOLO model for landslide detection on the Ya’an dataset, we conducted comparative experiments against mainstream one-stage and two-stage object detection architectures. The baseline models include the one-stage detectors YOLOv8, YOLOv10, and SSD, as well as the two-stage detector Faster R-CNN. All models, including the baseline YOLOv11 and the proposed MSDE-YOLO, were evaluated on the self-constructed Ya’an landslide hazard dataset under strictly identical experimental conditions.

Specifically, all models adhered to a unified experimental protocol, utilizing the same data preprocessing pipeline and the Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.937. The training hyperparameters were kept identical across all methods: an initial learning rate of 0.001, a batch size of 16, 300 epochs, and a weight decay of 0.0005. Evaluation was performed using standard metrics (Precision, Recall, and mAP). This rigorous control of variables ensures a fair and reliable comparison. The quantitative results are summarized in Table 3.

In these challenging scenarios characterized by blurred boundaries and heterogeneous backgrounds, traditional detectors like SSD and Faster R-CNN exhibit significant missed detections or inaccurate localization, as visually evidenced in Figure 8. While recent YOLO versions (v8, v10) show improved performance, they still struggle with small-scale landslides or those embedded in complex textures. In contrast, our proposed method (Col 1) consistently delivers the most robust results, accurately delineating landslide boundaries and minimizing both false positives and false negatives.

In terms of Precision, as illustrated in Figure 9, MSDE-YOLO achieves 90.2%, representing improvements of 16.7%, 12.6%, 10.4%, 7.0%, and 4.5% over SSD, Faster R-CNN, YOLOv8, YOLOv10, and the original YOLOv11, respectively, significantly enhancing the accuracy of landslide region prediction. Regarding Recall—a core metric for landslide detection—the proposed model attains a detection rate of 84.8% (Figure 10), which corresponds to gains of 14.6%, 11.0%, 7.5%, 4.2%, and 2.5% compared to SSD, Faster R-CNN, YOLOv8, YOLOv10, and YOLOv11, respectively. On the primary evaluation metric mAP, MSDE-YOLO reaches 92.7% (Figure 11), outperforming SSD, Faster R-CNN, YOLOv8, YOLOv10, and the original YOLOv11 by 13.6%, 10.3%, 6.5%, 4.8%, and 1.2%, respectively. The consistent improvement in mAP validates the effectiveness and rationality of the proposed enhancements.

3.4. Ablation Study

To quantify the individual contributions of the aforementioned strategies, we conducted ablation studies on the self-constructed Ya’an dataset. The detailed experimental results are presented in Table 4.

To visually verify the effectiveness of the proposed modules, Figure 12 presents a comparative visualization of detection results from the ablation study. The top row displays negative samples (non-landslide areas), while the bottom row shows positive samples (actual landslides).

The experimental results show that embedding the MSDE module into the YOLOv11 backbone improves mAP by 0.6 percentage points. When the SimAM module is introduced alone, mAP increases by only 0.3 percentage points, indicating that the attention mechanism must work in conjunction with the feature fusion module to fully leverage their complementary effects of “feature enhancement + noise suppression.” When both components are combined, the model’s mAP further rises to 92.7%, representing a 1.2 percentage point improvement over the original YOLOv11, as visually evidenced in Figure 13. Overall, the ablation study demonstrates that the simultaneous integration of MSDE and SimAM effectively enhances model performance.

4. Discussion

This study validates the efficacy of the MSDE-YOLO model for landslide hazard detection in the Ya’an region through comprehensive comparative and ablation experiments. The results confirm that the proposed enhancements are both targeted and effective.

In comparative experiments, MSDE-YOLO consistently outperforms conventional architectures (e.g., SSD, Faster R-CNN) and mainstream YOLO variants (e.g., YOLOv8, YOLOv10). Ablation studies further reveal that integrating the MSDE module alone into the baseline YOLOv11 yields measurable gains in mAP and other metrics. This confirms that the multi-scale feature fusion mechanism effectively aggregates shallow-level spatial details with deep-level semantic information, thereby enhancing the representation of ambiguous boundaries and weak textures characteristic of landslides. Consequently, the model overcomes the limited feature discriminability of generic detectors in complex terrains.

Notably, introducing the SimAM attention mechanism in isolation results in marginal or fluctuating performance improvements. This suggests that SimAM operates most effectively in synergy with the feature fusion module, jointly suppressing background noise while preserving critical landslide-specific features.

5. Conclusions

This study presents MSDE-YOLO, a specialized landslide detection model designed to overcome the challenges of blurred boundaries and weak textures in the Ya’an region’s remote sensing imagery. Built on YOLOv11, the model leverages a dual-strategy enhancement: a custom Multi-Scale Detail Enhancement (MSDE) module for superior feature fusion and the parameter-free SimAM attention mechanism for noise suppression. Quantitative evaluations on our self-constructed dataset reveal state-of-the-art performance, with 90.2% Precision, 84.8% Recall, and 92.7% mAP. MSDE-YOLO surpasses conventional models (SSD, Faster R-CNN) and recent YOLO variants by effectively handling multi-scale targets and ambiguous boundaries, thus significantly reducing detection errors in complex environments.

Key innovations include:

(1): Tailored Feature Fusion: A novel framework employing cross-level alignment and depthwise separable convolutions to simultaneously capture fine-grained boundaries and global context.
(2): Efficient Attention Integration: The strategic use of SimAM to boost feature discriminability with zero parameter increase, ensuring high efficiency.

These advancements provide a significant reference for geological hazard monitoring in rugged terrains. Future work will aim to optimize the model for edge computing devices and incorporate multi-source data streams to facilitate real-time dynamic monitoring and early warning capabilities, bridging the gap between academic research and practical engineering applications.

Author Contributions

Conceptualization, K.C. and M.H.; Methodology, K.C., W.Z. and G.Y.; Software, K.C., W.Z., Y.H., Z.W., Z.Z. and C.C.; Validation, K.C., W.Z., Y.H., Z.W., Z.Z. and C.C.; Formal analysis, K.C., Y.H. and Z.W.; Investigation, K.C.; Resources, M.H. and G.Y.; Data curation, K.C., W.Z. and Y.H.; Writing—original draft, K.C.; Writing—review & editing, M.H.; Visualization, K.C. and W.Z.; Supervision, M.H. and G.Y.; Project administration, M.H.; Funding acquisition, M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Hebei Province University Smart Emergency Application Technology Research and Development Center and the Sichuan Ya’an Earthquake Disaster Emergency Technology Support Capacity Enhancement Project (Project Number: GS2024151).

Data Availability Statement

Due to the confidentiality of the data in this study, it is not possible to make the data publicly available. If you wish to obtain the data, you can submit an application to the author, but you must comply with the restrictions stipulated by relevant institutions and regulations.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, J.; Wang, R.; Shi, W.; Yang, L.; Wei, J.; Liu, F.; Xiong, K. Landslide Susceptibility Assessment in Ya’an Based on Coupling of GWR and TabNet. Remote Sens. 2025, 17, 2678. [Google Scholar] [CrossRef]
Cheng, Z.; Gong, W.; Jaboyedoff, M.; Chen, J.; Derron, M.-H.; Zhao, F. Landslide Identification in UAV Images through Recognition of Landslide Boundaries and Ground Surface Cracks. Remote Sens. 2025, 17, 1900. [Google Scholar] [CrossRef]
Cheng, G.; Wang, Z.; Huang, C.; Yang, Y.; Hu, J.; Yan, X.; Tan, Y.; Liao, L.; Zhou, X.; Li, Y.; et al. Advances in Deep Learning Recognition of Landslides Based on Remote Sensing Images. Remote Sens. 2024, 16, 1787. [Google Scholar] [CrossRef]
Yuan, Z.; Gong, J.; Guo, B.; Wang, C.; Liao, N.; Song, J.; Wu, Q. A Novel Landslide Identification Method for Multi-Scale and Complex Background Region Based on Multi-Model Fusion: YOLO + U-Net. Remote Sens. 2024, 16, 4265. [Google Scholar] [CrossRef]
Zhang, W.; Liu, Z.; Zhou, S.; Qi, W.; Wu, X.; Zhang, T.; Han, L. LS-YOLO: A Novel Model for Detecting Multiscale Landslides with Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4952–4965. [Google Scholar] [CrossRef]
Qin, H.; Wang, J.; Mao, X.; Zhao, Z.; Gao, X.; Lu, W. An Improved Faster R-CNN Method for Landslide Detection in Remote Sensing Images. J. Geovisualizat. Spat. Anal. 2024, 8, 2. [Google Scholar] [CrossRef]
Dianqing, Y.; Yanping, M. Remote Sensing Landslide Target Detection Method Based on Improved Faster R-CNN. J. Appl. Remote Sens. 2022, 16, 044521. [Google Scholar] [CrossRef]
Yun, L.; Zhang, X.; Zheng, Y.; Wang, D.; Hua, L. Enhance the Accuracy of Landslide Detection in UAV Images Using an Improved Mask R-CNN Model: A Case Study of Sanming, China. Sensors 2023, 23, 4287. [Google Scholar] [CrossRef]
Wang, N.; Zhi, M. Deep Learning-Based Single-Stage General Object Detection Algorithms: A Review. Comput. Sci. Explor. 2025, 19, 1115–1140. (In Chinese) [Google Scholar]
Zhang, Y.; Xing, J.; Chen, W.; Wang, H.; Shi, B.; Song, Y.; Huang, X.; Jiang, Z. A Novel YOLOv11-Driven Deep Learning Algorithm for UAV Multispectral Oil Spill Detection in Inland Lakes. J. King Saud Univ.—Comput. Inf. Sci. 2025, 37, 108. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Ultralytics. YOLOv5: A State-of-the-Art Real-Time Object Detection System. Available online: https://docs.ultralytics.com (accessed on 13 February 2025).
Sohan, M.; Ram, S.; Reddy, R.; Venkata, C. A Review on YOLOv8 and Its Advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 18–20 November 2024; pp. 529–545. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 13 September 2024).
Meng, S.; Shi, Z.; Pirasteh, S.; Ullo, S.L.; Peng, M.; Zhou, C.; Gonçalves, W.N.; Zhang, L. TLSTMF-YOLO: Transfer Learning and Feature Fusion Network for Earthquake-Induced Landslide Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5610712. [Google Scholar] [CrossRef]
Wang, B.; Su, J.; Xi, J.; Chen, Y.; Cheng, H.; Li, H.; Chen, C.; Shang, H.; Yang, Y. Landslide Detection with MSTA-YOLO in Remote Sensing Images. Remote Sens. 2025, 17, 2795. [Google Scholar] [CrossRef]
Hao, X.; Liu, L.; Yang, R.; Yin, L.; Zhang, L.; Li, X. A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition. Remote Sens. 2023, 15, 827. [Google Scholar] [CrossRef]
Lin, G.; Jiang, J.; Bai, J.; Su, Y.; Su, Z.; Liu, H. Frontiers and Developments of Data Augmentation for Image: From Unlearnable to Learnable. Inf. Fusion 2025, 114, 102660. [Google Scholar] [CrossRef]
Wan, D.; Lu, R.; Xu, T.; Shen, S.; Lang, X.; Ren, Z. Random Interpolation Resize: A Free Image Data Augmentation Method for Object Detection in Industry. Expert Syst. Appl. 2023, 228, 120355. [Google Scholar] [CrossRef]
Yan, Y.; Zhang, Y.; Su, N. A Novel Data Augmentation Method for Detection of Specific Aircraft in Remote Sensing RGB Images. IEEE Access 2019, 7, 56051–56061. [Google Scholar] [CrossRef]
Chen, N.; Xu, Z.; Liu, Z.; Chen, Y.; Miao, Y.; Li, Q.; Hou, Y.; Wang, L. Data Augmentation and Intelligent Recognition in Pavement Texture Using a Deep Learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25427–25436. [Google Scholar] [CrossRef]
Gan, Y.; Ren, X.; Liu, H.; Chen, Y.; Lin, P. A Novel Lightweight YOLO11-Based Framework for Precisely Locating Diverse Ship Targets in Complex Optical Remote Sensing Photographs. Meas. Sci. Technol. 2025, 36, 045409. [Google Scholar] [CrossRef]
Chen, X.; Jiang, N.; Yu, Z.; Qian, W.; Huang, T. Citrus Leaf Disease Detection Based on Improved YOLO11 with C3K2. In Proceedings of the International Conference on Computer Graphics, Artificial Intelligence, and Data Processing (ICCAID 2024), Nanchang, China, 6–8 December 2024; pp. 746–751. [Google Scholar]
Zhou, S.; Yang, L.; Liu, H.; Zhou, C.; Liu, J.; Zhao, S.; Wang, K. A Lightweight Drone Detection Method Integrated into a Linear Attention Mechanism Based on Improved YOLOv11. Remote Sens. 2025, 17, 705. [Google Scholar] [CrossRef]
Feng, F.; Hu, Y.; Li, W.; Yang, F. Improved YOLOv8 Algorithms for Small Object Detection in Aerial Imagery. J. King Saud Univ.—Comput. Inf. Sci. 2024, 36, 102113. [Google Scholar] [CrossRef]
Kamal, K.C.; Yin, Z.; Wu, M.; Wu, Z. Depthwise Separable Convolution Architectures for Plant Disease Classification. Comput. Electron. Agric. 2019, 165, 104948. [Google Scholar] [CrossRef]
Bai, L.; Zhao, Y.; Huang, X. A CNN Accelerator on FPGA Using Depthwise Separable Convolution. IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 1415–1419. [Google Scholar] [CrossRef]
Jang, J.-G.; Quan, C.; Lee, H.D.; Kang, U. Falcon: Lightweight and Accurate Convolution Based on Depthwise Separable Convolution. Knowl. Inf. Syst. 2023, 65, 2225–2249. [Google Scholar] [CrossRef]
Dai, Y.; Li, C.; Su, X.; Liu, H.; Li, J. Multi-Scale Depthwise Separable Convolution for Semantic Segmentation in Street–Road Scenes. Remote Sens. 2023, 15, 2649. [Google Scholar] [CrossRef]
Liu, B.; Zou, D.; Feng, L.; Feng, S.; Fu, P.; Li, J. An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution. Electronics 2019, 8, 281. [Google Scholar] [CrossRef]
Zhang, R.; Zhu, F.; Liu, J.; Liu, G. Depth-Wise Separable Convolutions and Multi-Level Pooling for an Efficient Spatial CNN-Based Steganalysis. IEEE Trans. Inf. Forensics Secur. 2019, 15, 1138–1150. [Google Scholar] [CrossRef]
Liu, R.; Jiang, D.; Zhang, L.; Zhang, Z. Deep Depthwise Separable Convolutional Network for Change Detection in Optical Aerial Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1109–1118. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Zhang, Y.; Sun, Z. An Advanced YOLOv5s Approach for Vehicle Detection Integrating Swin Transformer and SimAM in Dense Traffic Surveillance. J. Ind. Intell. 2024, 2, 31–41. [Google Scholar] [CrossRef]
Xu, Q.; Wei, Y.; Gao, J.; Yao, H.; Liu, Q. ICAPD Framework and SimAM-YOLOv8n for Student Cognitive Engagement Detection in Classroom. IEEE Access 2023, 11, 136063–136076. [Google Scholar] [CrossRef]
Mahaadevan, V.C.; Narayanamoorthi, R.; Gono, R.; Moldrik, P. Automatic Identifier of Socket for Electrical Vehicles Using SWIN-Transformer and SimAM Attention Mechanism-Based EVS YOLO. IEEE Access 2023, 11, 111238–111254. [Google Scholar] [CrossRef]
Chen, P.; Lin, B.; Chen, X. An Improved YOLOv7 Model with SimAM for Wind Turbine Blade Defects Detection. In Proceedings of the 2024 8th International Symposium on Computer Science and Intelligent Control (ISCSIC), Shenyang, China, 6–8 September 2024; pp. 421–426. [Google Scholar]
Li, N.; Ye, T.; Zhou, Z.; Gao, C.; Zhang, P. Enhanced YOLOv8 with BiFPN-SimAM for Precise Defect Detection in Miniature Capacitors. Appl. Sci. 2024, 14, 429. [Google Scholar] [CrossRef]
Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
Park, I.; Kim, S. Performance Indicator Survey for Object Detection. In Proceedings of the 2020 20th International Conference on Control, Automation and Systems (ICCAS), Busan, Republic of Korea, 13–16 October 2020; pp. 284–288. [Google Scholar]
Chen, W.; Luo, J.; Zhang, F.; Tian, Z. A Review of Object Detection: Datasets, Performance Evaluation, Architecture, Applications and Current Trends. Multimed. Tools Appl. 2024, 83, 65603–65661. [Google Scholar] [CrossRef]
Chen, J.; Wan, L.; Zhu, J.; Xu, G.; Deng, M. Multi-Scale Spatial and Channel-Wise Attention for Improving Object Detection in Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2019, 17, 681–685. [Google Scholar] [CrossRef]

Figure 1. Spatial Distribution Map of Landslides in Ya’an City.

Figure 2. The self-built landslide dataset of Ya’an City.

Figure 3. Representative samples from the constructed Ya’an landslide dataset. (Top row): Positive samples exhibiting key challenges, including complex morphologies, blurred boundaries, and weak texture features, with landslides highlighted by red bounding boxes. (Bottom row): Hard negative samples representing non-landslide surfaces visually similar to landslides, such as bare soil, dry riverbeds, and construction sites. These examples highlight the specific difficulties addressed by our proposed method.

Figure 4. The original network structure diagram of YOLOv11.

Figure 5. Improved MSDE-YOLOv11 Network Structure.

Figure 6. Effective aggregation of high-level features and low-level features.

Figure 7. Multi-branch parallel depthwise separable convolution structure.

Figure 8. Visual comparison with state-of-the-art detection methods. From left to right: Ours, YOLOv11, YOLOv10, YOLOv8, Faster R-CNN, and SSD.

Figure 9. Results of precision.

Figure 10. Results of recall.

Figure 11. Results of mAP.

Figure 12. Visual comparison of ablation study results. (top row): Negative samples showing the ability to suppress false positives. (bottom row): Positive samples demonstrating detection completeness and boundary accuracy. Columns from left to right: Ours (MSDE-YOLO), Baseline YOLOv11, YOLOv11 + SimAM, and YOLOv11 + MSDE.

Figure 13. The results of the ablation experiment.

Table 1. Hardware environment configuration.

Hardware	Description
CPU	Intel Xeon Gold 6454S
GPU	NVIDIA RTX 5090, 24GB
Memory	512G

Table 2. Software environment configuration.

Software	Description
Operating System	Ubuntu
Python	3.12.1
PyTorch	2.4.1
CUDA	12.5

Table 3. Comparison of detection performance and inference speed on the Ya’an landslide dataset. FPS denotes Frames Per Second measured on an NVIDIA RTX 5090 GPU with batch size 1. The best results are highlighted in bold.

Model	Precision (%)	Recall (%)	mAP (%)	FPS
SSD	73.5	70.2	79.1	35.2
Faster R-CNN	77.6	73.8	82.4	48.6
YOLOv8	79.8	77.3	86.2	68.5
YOLOv10	83.2	80.6	87.9	75.4
YOLOv11	85.7	82.3	91.5	82.1
Ours	90.2	84.8	92.7	79.5

Table 4. Ablation study results on the Ya’an landslide dataset.

Method	Precision (%)	Recall (%)	mAP (%)
YOLOv11	85.7	82.3	91.5
YOLOv11-SimAM	85.2	81.8	91.8
YOLOv11-MSDE	88.4	82.6	92.1
Ours (MSDE-YOLO)	90.2	84.8	92.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cui, K.; Huang, M.; Zhang, W.; Yang, G.; Huang, Y.; Wu, Z.; Zhai, Z.; Cheng, C. Research on Landslide Hazard Detection in Ya’an Region Based on an Improved YOLO Model. Remote Sens. 2026, 18, 957. https://doi.org/10.3390/rs18060957

AMA Style

Cui K, Huang M, Zhang W, Yang G, Huang Y, Wu Z, Zhai Z, Cheng C. Research on Landslide Hazard Detection in Ya’an Region Based on an Improved YOLO Model. Remote Sensing. 2026; 18(6):957. https://doi.org/10.3390/rs18060957

Chicago/Turabian Style

Cui, Kewei, Meng Huang, Weiling Zhang, Guang Yang, Yongxiong Huang, Zhengyi Wu, Zhiwei Zhai, and Chao Cheng. 2026. "Research on Landslide Hazard Detection in Ya’an Region Based on an Improved YOLO Model" Remote Sensing 18, no. 6: 957. https://doi.org/10.3390/rs18060957

APA Style

Cui, K., Huang, M., Zhang, W., Yang, G., Huang, Y., Wu, Z., Zhai, Z., & Cheng, C. (2026). Research on Landslide Hazard Detection in Ya’an Region Based on an Improved YOLO Model. Remote Sensing, 18(6), 957. https://doi.org/10.3390/rs18060957

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Landslide Hazard Detection in Ya’an Region Based on an Improved YOLO Model

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Dataset

2.1.1. Study Area

2.1.2. Dataset Construction

2.1.3. Data Augmentation

2.2. Proposed Method

2.2.1. YOLOv11 Model

2.2.2. Model Architecture Design

2.2.3. Self-Designed Multi-Scale Feature Enhancement Module

2.2.4. SimAM Module

3. Results

3.1. Environment

3.2. Evaluation Indicators

3.3. Comparison Study

3.4. Ablation Study

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI