1. Introduction
A landslide is one of the most common and destructive natural phenomena. It occurs when soil or rock on a slope becomes unstable—either entirely or in part—along weak planes or zones due to internal and external dynamic factors such as river erosion, groundwater activity, infiltration from heavy rainfall [
1], seismic activity, or human-induced slope cutting [
2,
3]. There are numerous methods for monitoring landslide disasters, which can be categorized into three types. The first type emerged before remote sensing imagery was developed. It involves traditional manual disaster surveys, in which researchers primarily rely on field visits to collect landslide data. While this method allows for on-site verification of information, the data is usually unavailable until long after the disaster has occurred. This approach is costly and time-consuming, making it difficult to meet the demands of large-scale monitoring. The second category of methods emerged alongside the rapid development and widespread application of satellite remote sensing imagery. Many scholars have used traditional image processing techniques to detect landslides in remote sensing imagery [
4]. The main methods are statistical and machine learning methods [
5,
6]. Since prior knowledge of disasters is primarily derived from remote sensing imagery and on-site investigations, which involve significant human interpretation, the resulting data lacks generalizability and transferability. The third category consists of landslide detection technologies based on deep learning [
7]. In deep learning, a series of closely interconnected convolutional layers forms a convolutional neural network (CNN) [
8]. Input image data is compressed into smaller feature maps through convolutional operations. These feature maps retain the input image’s feature information and are stored as high-dimensional signals with strong descriptive power.
In landslide susceptibility prediction, landslides are influenced by their surrounding environment, environmental factors in the vicinity, and the landslide clustering effect [
9]. Landslide features in remote sensing imagery can be categorized into three types: color features (tone, hue, and shading); morphological features (shape, texture, size, and pattern); and spatial location features (spatial coordinates and spatial distribution). The surveying process results in remote sensing images containing complex background information and numerous interfering factors. Factors such as rainfall, snowfall, vegetation cover, and shadows limit detection accuracy, posing significant challenges to detection performance and generalization capabilities [
10]. Therefore, improving feature extraction capabilities and removing redundant information from images to enhance detection rates while maintaining accuracy represents a current bottleneck in research on intelligent prediction of landslide-prone areas.
With the continued development and application of machine vision technology, deep learning (ML) has been applied to landslide image recognition [
11]. For example, Yongxin Li et al. [
10] proposed a multi-label classification and annotation network for landslide detection. By leveraging bidirectional long short-term memory (LSTM) networks’ ability to model label dependencies, this network significantly improved landslide detection accuracy in the study area. DianqingYang et al. [
12]. Furthermore, colleagues proposed using Faster R-CNN for landslide detection in remote sensing images via deformable convolutions. After using batch normalization to mitigate the impact of batch size on the model, they optimized the extracted landslide features, thereby significantly improving the model’s accuracy. The LEB-YOLO model proposed by Du, Yingjie et al. [
13] significantly improves the efficiency of landslide object detection by simplifying the network architecture and reducing model weight and computational load. However, accuracy, speed, and the number of parameters in landslide detection models are interdependent and influenced by various factors [
13]. For example, increasing the model depth can improve accuracy but reduces detection speed and increases parameter complexity. Reducing the number of parameters affects detection accuracy. The number of parameters is also constrained by storage space [
14]. Balancing accuracy, speed, and the number of parameters in landslide detection models to ensure high performance has become an urgent issue [
15].
In summary, existing research on deep learning models for landslide hazard monitoring primarily relies on large amounts of geological data to predict landslide-prone areas, thereby enhancing early warning and prevention capabilities for landslide disasters [
15]. However, in landslide susceptibility prediction, landslides are influenced by the surrounding environment and the clustering effect of landslides. This undoubtedly limits detection accuracy and poses significant challenges to detection performance and generalization ability. Furthermore, balancing accuracy, speed, and the number of parameters in landslide monitoring models to achieve higher performance remains a major research challenge in intelligent landslide susceptibility monitoring [
16].
In an attempt to address these challenges, we sought to enhance the YOLO model by integrating four pivotal modules: Coordinate Attention, Deformable Convolutional Networks, the C3 module/CSP architecture and SIoU loss. The outcome was a novel detection model, CDCS-YOLO, named after the initials of these modules, which is designed to detect landslide hazards in the Ili Kazakh Autonomous Prefecture of Xinjiang. First, this study compares the performance of YOLOv4, YOLOv5 and YOLOv7 in terms of detection accuracy, speed and memory consumption. Following a thorough evaluation, YOLOv5s was chosen as the base architecture. This architecture combines the DCN module with the GhostConv [
17,
18] module to enhance the model’s ability to extract features from remote sensing images. Furthermore, introducing the attention mechanism of the SPPF module (CA) [
19] and the structured IoU (SIoU) loss function accelerates the model’s training speed, improves the agreement rate between predicted and ground-truth bounding boxes, reduces computational complexity, and enhances object detection accuracy. Finally, given the significant differences in image features between soil and rock landslides, a differentiated landslide monitoring and management scheme based on the CDCS-YOLO model is proposed.
3. Experimental Evaluation
3.1. Dataset and Data Pre-Processing
A map of geological hazard distribution in Xinjiang was created in ArcMap 10.8 using satellite imagery of the region downloaded with the All-in-One Map Downloader version 3.0 (
Figure 7). Data points for landslide hazards were obtained from the Regional Highway Business Development Centre network. Landslides in Xinjiang are primarily concentrated in three major mountain ranges. Statistical data indicate that the Ili Kazakh Autonomous Prefecture alone accounts for over 2000 landslide sites. Due to geographical constraints, manual data collection is extremely challenging. Future research will focus on constructing a database based on existing high-resolution landslide remote sensing imagery to enable landslide detection. Due to the complexity of landslides in Xinjiang’s mountainous regions, subsequent practical testing of the model will use landslide data from this area to verify the model’s general applicability.
The performance of deep learning models largely depends on the volume of data they are exposed to during training. When processing large datasets, models can learn the mapping rules between data inputs and outputs. It is crucial to ensure that the model is exposed to a diverse and sufficient number of data instances, as this improves the adaptability and predictive accuracy of convolutional neural networks (CNNs) when handling different scenarios. In order to develop a model with comprehensive generalisability, a large number of representative training samples must be carefully collected and appropriate data preprocessing must be performed. According to Step 1, data preprocessing consists of three stages.
3.1.1. Data Organization
First, 933 available Sentinel-2 remote sensing images of landslides were collected from the Kaggle data platform and split into training and testing sets at a ratio of 8:2. Subsequently, the remote sensing images corresponding to road sections with high landslide risk were annotated for subsequent practical applications, based on the landslide disaster data system maintained by the relevant administrative units from 2018 to 2023.
3.1.2. Sample Augmentation
Due to the relatively small size of the dataset in this study, the learning capabilities of the parameters and the network may not be fully realised, which could lead to negative effects such as overfitting. To address this issue, a series of innovative data augmentation strategies were implemented in accordance with the experimental requirements, with the aim of enhancing the model’s ability to generalise and mitigating the potential risk of overfitting during training.
First, we introduced a multidimensional data augmentation method that, unlike simple image rotation, translation or scaling, includes more complex image transformations such as HSV saturation enhancement. These operations significantly increase the diversity of the dataset while preserving the essential features of the images. Secondly, in terms of the implementation strategy for data augmentation, both online and offline image augmentation methods are employed simultaneously. Compared to traditional offline augmentation, this approach first uses offline augmentation to expand the dataset further, and then uses online augmentation to avoid the need for large amounts of storage space, while also ensuring that the data is transformed randomly during each model training session. Specifically, the data augmentation techniques applied during the input phase of the model ensure that the images seen by the model in each iteration possess unique characteristics. Finally, label smoothing was applied to improve training performance, suppress model overfitting, and enhance the robustness of the landslide detection algorithm to noise (
Figure 8).
3.1.3. Tags
In this study, the collected images were systematically annotated using the minimum bounding box method to delineate target objects precisely, with each target being assigned a corresponding class label. Annotations were performed separately for the entire landslide body and for each individual landslide within the cluster. This approach improved annotation accuracy and ensured that the generated labelled data met the requirements of object detection models at different scales. Secondly, Use PyCharm Community Edition 2022.1.4 (JetBrains s.r.o., Prague, Czech Republic) was used to process the annotation files and convert them into computer-readable text files for model training.
The structure of the dataset is shown in
Figure 9: (3) label.
3.2. Experimental Setup
Several experiments have been conducted using the dataset collected for this task. The model was trained on a GPU; the system configuration is detailed in
Table 3 below. The experimental parameters are as follows: the Adam optimizer was used in place of SGD for algorithm optimization, with 100 training epochs and a batch size of 16. The minimum learning rate was set to 0.0001, and the maximum learning rate to 0.01. The weight decay value was set to 5 × 10
−4.
We use precision and recall as evaluation metrics to evaluate the performance of the proposed method.
Here, R denotes recall, P denotes precision, FA denotes the false positive rate, TP denotes true positives (i.e., data that are both true and correctly predicted), FP denotes false positives (i.e., data that are false but correctly predicted), and FN denotes false negatives (i.e., data that are true but incorrectly predicted).
Common models for object detection include the R-CNN and YOLO series mentioned earlier, along with their derivatives and improved versions. It is not feasible to compare every single model. However, according to the literature, YOLO series algorithms better meet the requirements for model accuracy, memory consumption, and detection speed. Therefore, we will use the YOLO series as a baseline and compare its optimised version with several representative models from recent years to evaluate the optimised model’s performance.
3.3. Model Training Parameters
In supervised learning, the performance of deep learning models is strongly influenced by hyperparameter tuning. Hyperparameter optimisation achieves an optimal balance between model convergence efficiency (training speed) and detection accuracy by systematically adjusting key parameters, such as the learning rate and batch size. This study uses the control-variable method to determine the optimal hyperparameter combination (see
Table 4), ensuring that all comparison models undergo benchmark testing under identical conditions. This guarantees the scientific validity and comparability of the experimental conclusions.
3.3.1. Batch Size
In this experiment, each iteration processes 16 images, so the number of batches refers to the number of samples selected from the training set during a single training pass. The results show that when BS = 16, GPU memory usage remains stable at 85–92%, resulting in a 41% improvement in training efficiency compared to the BS = 4/8 scheme.
3.3.2. Learning Rate
As the core control parameter for weight updates, the learning rate must strike a balance between convergence speed and stability. If it is set too low (<1 × 10−3), local optima may be reached, leading to overfitting. Conversely, if it is set too high (>1 × 10−2), gradient explosion will occur. This experiment adopts a staged learning rate: it is set to 0.01 during the freezing phase to accelerate feature extraction and reduced to 0.001 during the fine-tuning phase to allow precise parameter adjustment. A smooth transition is achieved through Adaptive Moment Estimation (Adam).
3.3.3. Anchor Box
The standard k-means clustering method used to generate anchor boxes in the YOLO framework shows signs of bias when applied to the small-scale roadside slope dataset examined in this study.
Due to the concentration of target sizes, the anchor boxes generated by clustering lack diversity, leading to a significant decline in the model’s performance in detecting landslide targets of non-mainstream sizes. Comparative experiments have shown that using YOLO’s default multi-scale bounding box configuration effectively mitigates this issue, as its preset aspect ratio combinations are better suited to the diverse morphological characteristics of landslides and demonstrate greater robustness in cross-scale detection tasks.
4. Results
In accordance with established landslide classification principles, the landslide hazards identified in this study were assessed based on the following criteria: material type (soil landslides, rock landslides), movement type (sliding, falling, flowing), boundary clarity, surface texture, color contrast, degree of vegetation disturbance, and slope characteristics. In existing optical imagery, soil landslides typically exhibit fan-shaped or tongue-shaped forms, relatively smooth surfaces, strong color contrast, and obvious vegetation clearance; Rock mass landslides usually present as blocky or wedge-shaped features with rough textures and exposed rock surfaces. They are often characterised by shadows and angular debris deposits. These classification criteria are used to evaluate model performance and guide the development of monitoring strategies. Observing features in landslide disaster imagery revealed that both soil and rock landslides exhibit significant changes in their triggering factors before and after the slide. These changes manifest as distinct features in the images. Therefore, combining deep learning with traditional measurement methods could accelerate the early identification of landslide disasters even further.
Based on the above data, we will present and discuss the results of the proposed landslide detection method, applying it to landslide-prone areas of Xinjiang. First, we compared different versions of the YOLO family of base models, pretraining them on the dataset we created, as detailed in Step 2 of the experimental procedure. The results are shown in
Table 5.
As shown in
Table 5, the baseline models YOLOv5 and YOLOv4 achieve higher frame rates than YOLOv7, with only minor differences in accuracy. Meanwhile, YOLOv5n, YOLOv5s, and YOLOv7-tiny use less memory. The table above shows that YOLOv5n, YOLOv5s, and YOLOv7-tiny offer significant advantages in terms of accuracy, memory usage, and detection speed. Their performance more closely aligns with the accuracy, memory, and speed requirements of landslide object detection tasks.
Based on the above conclusions, we expanded the YOLOv5s model framework by incorporating recent mainstream attention mechanism modules into the algorithm. We then proceeded to Experiment 3, in which we optimised the model further using a more practical loss function. The results of the ablation experiments are shown in
Figure 10,
Figure 11 and
Figure 12.
As shown in the scatter plot, the SIoU loss function achieves the highest training accuracy at epoch 90. In contrast, CIoU and EIoU do not achieve their highest training accuracy until epoch 110.
According to the radar chart, integrating the SIoU loss function with the CA attention mechanism improves the model’s performance, yielding an mAP@0.5 of 0.956, a recall of 0.908, and a precision of 0.937.
The model’s validity was verified using landslide remote sensing images from the validation set. We applied the loss-function-based optimal attention mechanism obtained from the experiments, along with the DCN backbone network and the Ghost model, to Model IV. The detection results are shown in
Table 6.
The experimental results show that the recognition accuracy of the optimised model has significantly improved compared to the initial model, rising from 67.3% to 96.1%. As shown in
Table 6, the proposed model achieves a 1.3% increase in accuracy, an 8.2 FPS (4.4%) increase in frame rate, and a 1.7 MB (10%) reduction in model size compared to the initial model. Taking these results into account, while the proposed model is only slightly inferior to YOLOv5m in terms of accuracy, its smaller model size gives it an overall performance advantage over YOLOv5n.
Spatial Pyramid Pooling (SPP) extracts features at different scales. This enhances the model’s ability to detect landslides of different sizes, improving recognition accuracy. The CA module performs global average pooling on the width and height of the acquired channel information, strengthening the model’s ability to confirm coordinate information and improving landslide recognition across different scenarios. The SIoU loss function, which considers the three-dimensional aspects of angle, distance, and shape, is used for landslide localisation, thereby improving both model training speed and inference accuracy. This analysis shows that using a backbone network that combines the GhostConv module, the deformable convolution module, the CA mechanism module, SPPF, and the SIoU loss function significantly improves landslide detection performance. This validates the research direction and correctness of the model. The test results are shown in
Figure 13.
Figure 13 illustrates four typical landslide hazard detection scenarios. From left to right, the scenarios depict areas with dense clusters of multiple landslides; narrow, elongated landslide bodies; mixed landslide clusters of varying scales; and low-light/shadow-obstructed scenarios. Of these four scenarios, the improved YOLOv5s, which was used as the base model for CDCS-YOLO, achieved the highest detection rate for small objects. It demonstrated optimal localisation accuracy and bounding box fit with no false positives, high classification reliability and robust environmental adaptability across challenging scenarios such as complex lighting, shadow occlusion and multi-scale mixed environments.
Figure 7 shows the distribution of geological hazards in Xinjiang and indicates that landslides are primarily concentrated in the Ili Kazakh Autonomous Prefecture. We then moved on to the fourth step of our experiment, applying the trained model (CDCS-YOLO) to detect landslides in the mountainous areas of this region (see
Figure 14). As shown in the figure, the landslide detection accuracy generally met expectations, further validating the model’s generalisability.
5. Discussion
5.1. Differentiated Landslide Prevention and Mitigation Measures
Analysis of the experimental results in this chapter reveals that UAV technology is significantly more effective at detecting earthslides than rock slides. Earthslides exhibit more distinct colour and texture features in images, making them easier for deep learning models to capture. In contrast, rock slides present challenges due to factors such as the fine texture of rock and significant effects from shadows and lighting. This results in insufficient accuracy in the detection model’s localisation and classification. Therefore, relying solely on UAVs for the inspection of rock slides can lead to missed detections or difficulties in effectively identifying early-stage cracks under certain terrain conditions. Therefore, this approach must be combined with others to achieve more comprehensive early identification and warning of landslides.
In order to develop more effective early warning strategies, the interpretation of detection results must take into account the causative factors and triggering conditions of landslides. In this study area, landslides occur in close relation to topography, rock type, geological faults, freeze–thaw cycles, rainfall infiltration, groundwater activity, seismic disturbances, river erosion and human activity. Soil landslides are generally more sensitive to rainfall, soil moisture, pore water pressure and changes in groundwater levels. They often exhibit distinct changes in colour, texture and vegetation cover in optical imagery. In contrast, rock mass landslides are more strongly influenced by joints, bedding planes, crack propagation, in situ stress, weathering and sudden external triggers. Their early deformation phenomena are often difficult to capture accurately using only UAV or satellite optical imagery. Therefore, relying solely on UAVs for rockslide inspection may lead to missed detections or difficulties in effectively identifying early-stage cracks under certain terrain conditions. Consequently, it is necessary to combine UAVs with other methods (such as ground-based LiDAR [
34] and manual inspections [
41]) to achieve more comprehensive early identification and early warning of landslides. Tailored slope monitoring methods for different types of soil and rock landslides have been developed based on a summary of their characteristics, as shown in
Table 7 and
Table 8.
Therefore, the proposed CDCS-YOLO model should be integrated into a differentiated monitoring and response framework. For high-risk earthen slopes, UAV imagery should be combined with on-site surveys, soil moisture monitoring, rainfall records, drainage system inspections and crack monitoring. For high-risk rock slopes, however, UAV inspections should be supplemented with ground-based LiDAR or total station monitoring, crack measurement device inspections, rockfall observations and emergency inspections following heavy rainfall or earthquakes. Mitigation measures may include surface and subsurface drainage, toe protection, retaining structures, anchor bolts, protective netting, crack sealing, vegetation restoration, traffic control and temporary road closures when necessary. Combining image recognition with engineering monitoring reduces the rate of missed defects, making this approach more practical for highway maintenance and emergency management.
5.2. Methodological Framework and Limitations
As demonstrated by the experimental results in this paper, the CDCS-YOLO model significantly improves the early detection and identification of landslides on mountain roads in Xinjiang. Unlike existing studies that focus on one-dimensional improvements to the YOLO model (such as replacing only the backbone or adding attention mechanisms alone) [
42,
43,
44], this study achieves three-dimensional synergistic optimisation of efficiency, accuracy and robustness. With a model size of 14.2 MB, an accuracy of 96.1% and a speed of 142.6 FPS, it strikes an optimal balance between industrial deployability and academic performance. Secondly, the experiments in this study further demonstrate that the early identification and monitoring of landslides using the CDCS-YOLO deep learning model relies on data from various key influencing factors, based on the differentiated theoretical strategy for soil and rock proposed by K. He et al. [
25]. For example, when it comes to soil landslides, the key parameters to monitor include soil moisture content, pore water pressure, groundwater level, surface displacement, crack development, precipitation and meteorological data, as well as slope deformation rates [
41]. In contrast, for rockfall disasters, the focus should be on joint displacement and deep-seated deformation, changes in in situ stress, groundwater dynamics, surface inclination and topographic deformation characteristics [
45]. Combining multi-source monitoring data with image recognition technology can significantly improve landslide monitoring efficiency while ensuring accurate detection. Therefore, as proposed in
Section 5.1 regarding differentiated landslide prevention and control measures, a truly comprehensive landslide monitoring system of the future should utilise CDCS-YOLO image recognition technology and be supplemented by field surveys, UAV observations, LiDAR, rainfall monitoring and deformation monitoring. This would develop a landslide monitoring and early warning system led by CDCS-YOLO.
In particular, the model’s adaptability and generalisation capabilities need to be further enhanced when dealing with landslide data under varying geological conditions. Future improvements can be pursued in several areas: first, incorporating a broader range of landslide data would improve the model’s accuracy and reliability across diverse geological environments. Secondly, the model may struggle with more complex background and noise data; integrating advanced noise-suppression and background-modelling techniques would further improve detection accuracy.
6. Conclusions
In this study, we propose a novel landslide detection model called CDCS-YOLO. To address the complexity of landslide backgrounds, we use a Deep Convolutional Network (DCN) module in conjunction with a GhostConv module. We also conduct ablation experiments to determine that the CA mechanism and the SIoU loss function are the most suitable for landslide monitoring. This approach enhances the extraction of landslide features and spatial localisation capabilities, thereby improving the accuracy of detecting landslides of varying angles, sizes, and shapes. Experimental results demonstrate that CDCS-YOLO outperforms the traditional YOLOv4, YOLOv5, and YOLOv7 models in both performance and accuracy. The model achieves an mAP of 96.6%, a precision of 96.1%, and a frame rate of 142.6 FPS with only a slight increase in the number of covariates. This demonstrates that the algorithm offers a certain degree of efficiency improvement. Results from applying the model to the Ili Kazakh Autonomous Prefecture in Xinjiang show that landslide detection accuracy fluctuates minimally, further validating the model’s effectiveness.
The CDCS-YOLO model is not intended to replace geotechnical monitoring; rather, it is designed to provide a rapid, image-based screening tool to be used alongside other methods, such as UAV inspections, ground-based LiDAR, on-site surveys, rainfall data, groundwater monitoring and deformation monitoring. Future research will expand the dataset to include a wider range of topographic and climatic conditions in order to more accurately distinguish between soil and rock slides. It will also integrate rainfall thresholds, snowmelt indices and climate change scenarios in order to support the prediction of future landslides and the management of highway risk.
Although this method has achieved good results in landslide object detection, limitations in data scale and variations in the topography and terrain of landslide-prone areas mean that further research is needed. This research should focus on adaptability to different geographical features, changes in lighting conditions, and phenomena such as occlusion. At the same time, the ability to process large volumes of data must be considered to improve real-time detection while maintaining high accuracy. This study provides an important reference for the automatic identification of landslide geological hazards and paves the way for future research.