1. Introduction
Geological disaster evolution is a long-term and dynamic process controlled by multiple factors. Variations in external triggering conditions at different evolutionary stages can influence the likelihood of geological disaster occurrence. Landslide susceptibility assessment quantifies the likelihood of geological disasters, such as landslides, occurring within a specific region over a given period. These algorithms typically draw on historical information from the study area and employ complex models to forecast future land conditions [
1]. Over the past few decades, landslide susceptibility algorithms exhibit high reliability in assessing the potential landslides and ground subsidence, making them increasingly robust and widely applicable [
2].
Geological disasters are significantly influenced by extreme weather events and intensive anthropogenic activities. As a result, they exhibit broad distribution, frequent occurrences, and severe impacts [
3]. Moreover, landslides commonly develop in a concealed manner and occur suddenly, posing risks to human life and property. Due to the varying geological environments across different regions, landslide susceptibility assessment faces significant challenges [
4]. Consequently, research on landslide susceptibility assessment is essential. With the rapid advancement of computer science, remote sensing technologies, and geographic information systems (GISs), numerous models have been developed. Through expansion into diverse scenarios, these models have progressively matured. In general, these methods are categorized into two types: knowledge-driven methods and data-driven methods [
4]. Knowledge-driven methods rely on hierarchical expert analysis, but often exhibit limited practical applicability. For the latter, numerous studies have incorporated geological, meteorological, and other multi-source datasets to assess landslide susceptibility. These studies highlight the critical role of multi-source data and advanced methods in enhancing predictive accuracy. Rapid advances in computing technologies have driven the evolution of assessment algorithms. Initially, basic statistical methods, such as frequency ratio analysis [
5] and weight-of-evidence models [
6], were commonly used. Subsequently, advanced methods, including logistic regression [
7], support vector machines (SVMs) [
8], and decision trees [
9], have been increasingly adopted. These methods have exhibited robust performance across various scenarios. Notably, artificial intelligence-based algorithms have attracted considerable attention due to their superior accuracy, computational efficiency, and cost-effectiveness. Machine learning algorithms are a branch of artificial intelligence. They facilitate thorough and nuanced assessment of geological disasters by analyzing high-dimensional datasets. Xie et al. [
10] employed logistic regression and SVMs for landslide susceptibility assessment in 2017. Chen et al. [
11] employed the random forest (RF) model for landslide risk assessment in Taibai County in 2018. Hakim et al. [
2] employed four models to assess geological hazard susceptibility in Jakarta, Indonesia, in 2020. To improve predictive accuracy, numerous enhanced machine learning methods have been proposed [
12]. In 2023, Liu et al. [
8] refined the SVM by substituting the Euclidean distance with the Mahalanobis distance. In 2024, Ouyang et al. [
13] optimized the loss function by introducing the PU-PullBaggingDT model. In addition, GIS-based models that integrate multiple meta-ensembles have been developed in recent years. Numerous experiments have demonstrated that machine learning methods outperform conventional methods in landslide susceptibility assessment [
14,
15].
The reliability of landslide susceptibility assessment is closely related to the selection of models and the quality of input samples [
16]. On the one hand, different models exhibit distinct strengths and limitations, making it difficult to adapt to dynamically geological environments. For example, decision trees are well-suited for discrete data, whereas SVMs excel in high-dimensional data [
17]. At this stage, a single model presents limitations for diverse scenarios [
18]. Over the past few decades, meta-ensemble algorithms have attracted widespread attention as novel methods. By integrating multiple weak classifiers, the meta-ensemble algorithm leverages the strengths of baseline models to improve generalization [
19]. Meta-ensemble algorithms can be categorized into homogeneous and heterogeneous types. In homogeneous ensemble algorithms, such as bagging and boosting [
20], the same classifier is used as the baseline model. In contrast, heterogeneous ensemble algorithms (e.g., stacking) employ diverse classifiers as baseline models. By relying on diverse classifiers to extract features from multiple perspectives, heterogeneous ensemble algorithms realize complementary advantages [
21,
22].On the other hand, the reliability of the model depends on the quality of the input samples. Bias introduced by sampling can lead to significant differences in predictions [
23].
The threshold-based sampling strategy randomly selects negative samples from features that fall below a specified threshold [
18]. The similarity-based sampling strategy selects negative samples based on similarity to the stable area [
24]. For instance, the geographical similarity-based sampling strategy is innovative but computationally intensive due to point-by-point calculations [
25]. Statistical analysis was used to identify non-stable samples from low-susceptibility areas. However, the method is highly affected by regional variability [
26]. Hong et al. improved the quality of training samples through reliability scoring [
27]. Huang et al. introduced the self-organizing map, an unsupervised model that does not require labeled data. The method focuses on spatial coverage and feature representativeness, conducting clustering or pattern mapping but not yielding landslide probabilities or classifications [
28]. Random sampling strategies are easier to implement than other methods. The random sampling strategy selects negative samples from areas free of geological disasters or from buffer zones surrounding a known hazard position [
29]. A typical example is double-buffer-based sampling methods, which extract samples from non-stable locations or buffer zones around known hazard sites [
30]. However, such methods may include potential geological hazard zones in the samples [
31]. Non-stable samples are closely related to the occurrence of geological disasters in the corresponding ground surface. Surface deformation information serves as an effective indicator for locating non-stable samples associated with geological disasters [
32]. At this point, synthetic aperture radar interferometry (InSAR) offers high-precision, large-scale, and spatially continuous deformation information, thereby aiding in geological disaster identification, impact assessment, risk early warning, and disaster prevention. Combined with automatic detection methods, such as hotspot analysis (HSA), surface deformation derived from InSAR can be used to identify the non-stable samples and further enhance the quality of samples. The effectiveness of HSA has been validated across various scenarios. In 2012, Lu et al. [
33] applied HSA to detect slow-moving landslides with low displacement rates. In 2021, Zhu et al. [
34] employed HSA to identify clusters of anomalous deformation along the Minjiang River in Mao County. In addition, sample deficiencies may affect the model accuracy [
35]. High collinearity among conditioning factors increases sample complexity, degrades model performance, and may lead to erroneous predictions. Redundancy among conditioning factors is a key consideration in model construction. Therefore, in dynamic monitoring environments, developing a robust model along with an effective sampling strategy is crucial for studying the landslide susceptibility.
The Baihetan Hydropower Station reservoir area, situated along the Jinsha River’s main channel, features steep slopes, intersecting gullies, and unevenly distributed hazardous rock masses on both banks. Such terrain is prone to landslides, debris flows, and other geological hazards [
36]. Disasters with complex underlying mechanisms pose serious risks to human life, property, and infrastructure. Given the intricate geological conditions and the presence of multiple triggering factors, the Baihetan Reservoir area serves as a suitable site for validating the proposed method.
Building on previous research and the existing challenges, this paper proposes an optimized heterogeneous ensemble learning algorithm combined with an adaptive sampling strategy based on hotspot analysis to assess landslide susceptibility. The Baihetan reservoir area is selected as the study area. To validate the effectiveness of our proposed method, a comparative analysis was conducted with eight models using five metrics.
4. Experimental Results and Analysis
A total of 26 conditioning factors were selected, including elevation, aspect, slope, contour, mean curvature, plan curvature, profile curvature, distance to road, distance to building, population density, land cover, NDVI, lithology, distance to fault, vegetation coverage, soil erosion, topographic factor, VPD, AVP, distance to Jinsha River, distance to river network, land surface temperature, visibility, coherence, deformation rate, and cumulative deformation.
4.1. Results and Analysis of Conditioning Factors
4.1.1. Collinearity Detection
High collinearity among conditioning factors increases sample complexity, degrades model performance, and may lead to erroneous predictions. Therefore, collinearity analysis should be conducted before modeling. The importance of conditioning factors is evaluated through independence analysis using the Pearson correlation coefficient (PCC), tolerance (TOL), and variance inflation factor (VIF). The statistics of TOL and VIF values of the 26 conditioning factors are listed in
Table 3.
Compared with TOL and VIF, PCC values provide similarity among all conditioning factors, as illustrated in
Figure 8. Among the 26 conditioning factors, cumulative deformation exhibits the strongest correlation with deformation rate, with a PCC value of 0.88. By comparison, the deformation rate can accurately describe the surface deformation. Land surface temperature shows strong correlations with four variables, namely elevation, VPD, AVP, and distance to the river network, with PCC values of −0.76, 0.71, 0.70, and −0.56, respectively, and was excluded. AVP is highly correlated with both elevation and VPD, with PCC values of −0.74 and 0.79, respectively, and was excluded. In addition, the correlation between slope and the topographic factor reaches 0.79. The sums of the absolute correlation coefficients between each of these two variables and other factors are 5.47 and 5.22, respectively. Slope exhibits a higher degree of correlation with the other variables and was excluded.
4.1.2. Frequency Ratio Analysis
The relationship between the occurrence rate of geological disasters and conditioning factors is illustrated in
Figure 9. In each subplot, the red histogram represents the proportion of geological disasters within the classification intervals of the conditioning factor. The black histogram represents the proportion of the conditioning factor within each interval. The blue line represents the ratio of the disaster proportion to the factor proportion in each interval. A ratio greater than 1 indicates a stronger correlation between the category of the conditioning factor and the occurrence of geological disasters [
1]. The relationship between elevation and disasters shows that geological disasters are prevalent in low- to mid-elevation areas, with a lower occurrence at higher elevations. For the three curvature-related factors, disasters occur frequently in the middle intervals. Due to the limited number of pixels with values in the contour, it is difficult to identify frequency-based patterns. The relationship between distance to road and geological disasters shows a trend, with the frequency of disasters gradually decreasing as the distance increases. Disasters occur frequently in sparsely populated areas, with occurrences reaching 70%, while densely populated areas show no disasters. Regarding distance to faults, the occurrence of disasters decreases with increasing distance from the fault. Active fault zones destabilize nearby rock masses, promoting the development of fractures, which facilitate disasters formation. Higher vegetation coverage contributes to soil and water stabilization, reducing disaster risk. Specifically, in the 80.21∼99.48 interval, disaster occurrence is low, whereas it doubles in the 70.46∼80.21 interval. For VPD, the proportion of geological disasters gradually increases with the index. In terms of distance to the Jinsha River, disasters decrease as distance from the river increases, indicating that rivers also promote disaster development. Finally, for the InSAR-derived features of cumulative deformation and deformation rate, areas with low deformation rates show little disaster occurrence. Regions with significant subsidence are typically located in mining areas, which are not included in the disaster inventory.
4.2. Prediction of Landslide Susceptibility
After performing collinearity analysis, importance evaluation, and frequency ratio analysis, 21 factors are ultimately selected as the samples. Among these, two conditioning factors derived from InSAR are included: coherence and deformation rate. The remaining 19 factors are: elevation, aspect, slope, mean curvature, plan curvature, profile curvature, distance to road, distance to buildings, population distribution, land cover, NDVI, lithology, distance to faults, vegetation coverage, atmospheric humidity index VPD, distance to the Jinsha River, distance to river systems, soil erosion, and topographic factors. Cumulative deformation is excluded due to the high correlation of 0.88 with the deformation rate. The contour lines are excluded due to the minimal contribution to the assessment model. The landslide susceptibility map generated by the IME-InSAR model is illustrated in
Figure 10.
Using the natural breaks method based on Monte Carlo, the degree of risk is classified into five categories: very high, high, moderate, low, and very low. The color scale ranges from red to green. The greener the pixel, the less likely it is to trigger landslides, while the redder the pixel, the more susceptible the area is to landslides. The impact of the disaster on the model’s performance is assessed by determining whether the disasters fall within very-high- or high-risk zones. The prediction results obtained from the IME-InSAR model are further analyzed, with Areas S1 and S2 of the study area selected as representative examples.
The landslide susceptibility map obtained by the IME-InSAR model and the optical imagery of Area S1 are illustrated in
Figure 11. Area S1 includes Jin County in Liangshan Yi Autonomous Prefecture, Sichuan Province, and Zhaoyang District in Zhaotong City, Yunnan Province, along with other surrounding areas. These areas are separated by the Jinsha River, which is characterized by steep terrain, high mountains, and narrow valleys. On the western bank of the Jinsha River, the terrain in Jin County slopes from northwest to southeast, with the highest elevation of 4076 m at Shizi Mountain in the north, and the lowest elevation of 430 m in Ludian County. Most areas have an elevation above 2100 m, which includes eight types of landforms, such as flat plains, low hills, and mountain plateaus. Area S1 is characterized by widespread fault zones and a complex geological structure. Rainfall is the main conditioning factor. In addition, human activities have disturbed surface vegetation, leading to the occurrence of geohazards, particularly in the rainy season. On the eastern side, Zhaotong City is characterized by a higher western region and a lower eastern region, with a plateau landform, numerous ravines, and frequent seismic activity. Due to the abundant rainfall during the rainy season, Area S1 has become one of the regions severely affected by geological hazards in the upper Yangtze River Basin. Disasters in Area S1 are mainly distributed along the flow direction of the Jinsha River. There is a total of 153 geological disasters, with 140 located in very-high-risk zones. Most unidentified geological disasters are distributed far from the Jinsha River, with some in flat terrain areas that receive little rainfall and have high population density, beyond the capacity of the model.
Figure 12 presents the landslide susceptibility map for Area S2, obtained by the IME-InSAR model, along with its optical imagery. Area S2 is located in the southwest of the study area, within Ningnan County, Liangshan Yi Autonomous Prefecture, Sichuan Province. The area is rich in water resources, including 15 rivers, such as the Heishui River and Jinsha River. It boasts abundant mineral resources, including iron, lead, copper, and limestone. Precipitation is mainly concentrated between June and October.
Figure 12 illustrates the distribution of five fault zones across the area, including the Muhe Fault Zone and the Ningnan–Huili Fault Zone. Due to intense tectonic activity and frequent seismic events, the rocks of the area are loose. Furthermore, activities such as mining and the development of the Baihetan Hydroelectric Station have made the region susceptible to disasters. There is a total of 61 geological disasters in Area S2. Among these, the landslide at Luojia Slope, located along the fault zone in the northwest of the study area, has been predicted. These disasters are situated in very-high-risk areas, with an accuracy of 100%. Disasters in the central area are concentrated, forming a high-susceptibility geohazard-prone zone that requires widespread attention.
6. Conclusions
Landslide susceptibility assessment has demonstrated effectiveness across various scenarios. The generated predictions can support geological hazard identification and analysis of evolution mechanisms. Therefore, it is crucial to study the landslide susceptibility assessment.
In this study, we propose an optimized heterogeneous ensemble learning framework for landslide susceptibility assessment, combined with an adaptive sampling strategy. Considering the unreliability of traditional methods, the proposed sampling strategy incorporates InSAR-derived deformation information and performs hotspot analysis using the Getis–Ord statistic to refine the stable samples. The stacking method employs a two-layer structure, that utilizes RF, SVM, and XGBoost as base learners. InSAR-derived information, including deformation rate, cumulative deformation, coherence, and visibility, is incorporated into the model to enhance the robustness. In addition, a grid search strategy is utilized to obtain the optimal parameters. The middle and lower reaches of the Jinsha River were selected as the study area. First, a total of 26 conditioning factors were selected. To reduce the factor complexity, apart from collinearity analysis and importance evaluation, a Monte Carlo-based frequency ratio analysis was introduced. Conditioning factor importance is evaluated based on the Gini index, and a frequency ratio analysis is conducted using the Monte Carlo method to reveal the relationships between conditioning factors and geological hazards. The proposed sampling strategy was then employed to generate the high-quality samples. Finally, by coupling InSAR-derived information, an improved meta-ensemble (IME) stacking-based heterogeneous framework integrating an SVM, RF, and XGBoost was used to assess the landslide susceptibility. The optimal parameters are determined through a grid search strategy. Eight models, including RF, RF-InSAR, SVM, SVM-InSAR, XGBoost, XGBoost-InSAR, and IME, and IME-InSAR, are employed for comparative analysis. Model performance is evaluated using accuracy, precision, recall, and F1-score. Among all models, the proposed IME model achieved the best performance, with a maximum accuracy of 0.976. All models achieve AUC values greater than 0.95, among which the IME model attains the highest AUC value of 0.97772. Despite the formally highest IME metrics, the inclusion of InSAR information did not lead to a statistically significant improvement in the forecast compared to the high-quality basic sampling strategy. Overall, the proposed algorithm provides support for landslide susceptibility assessment in dynamic environments. The main findings of this study can be summarized as follows:
- (1)
The adaptive sampling strategy proposed in this study utilizes InSAR-derived deformation information with hotspot analysis to automatically identify potentially non-stable samples. Compared with traditional methods, this strategy improved both the quantity and quality of the training samples. Compared with the pre-optimization state, the variance of stable samples decreased. Under absolute value operation, the variance declined from 5.11 to 4.15. Furthermore, the robustness of the proposed sampling strategy is demonstrated by the fact that all models achieved accuracies exceeding 0.932.
- (2)
The incorporation of InSAR-derived deformation information enhances the predictive performance of models in landslide susceptibility assessments. To assess the impact of InSAR-derived information, two experimental settings were designed. Experimental results indicated that InSAR-derived information had a limited effect on the IME models. From another perspective, it highlights the capability of the hotspot analysis method to identify non-stable samples. In addition, the introduction of the adaptive sampling strategy reduced the complexity of conditioning factors and enhanced the robustness of the predictive models.
- (3)
Improved heterogeneous ensemble algorithms, which combine diverse classifiers to extract features, demonstrated superior performance. Landslide susceptibility maps generated by eight models were compared to evaluate the performance. The prediction results were further validated against optical imagery and field survey, with Areas S1 and S2 selected as representative examples. In Area S2, all 61 landslide-prone locations were identified within very-high-risk zones. Notably, in Area S2, concentrated disasters have formed a high-susceptibility geohazard zone that warrants significant attention.