Insights from Optimized Non-Landslide Sampling and SHAP Explainability for Landslide Susceptibility Prediction

Mengyuan Li; Hongling Tian

doi:10.3390/app15031163

and

¹

State Key Laboratory of Mountain Hazards and Engineering Resilience, Chengdu 610299, China

²

Institute of Mountain Hazards and Environment, Chinese Academy of Sciences, Chengdu 610299, China

³

University of Chinese Academy of Sciences, Beijing 101408, China

^*

Author to whom correspondence should be addressed.

Appl. Sci.2025, 15(3), 1163;https://doi.org/10.3390/app15031163

This article belongs to the Section Earth Sciences

Version Notes

Order Reprints

Abstract

The quality of sampling data critically influences landslide susceptibility prediction accuracy. Current studies commonly use a 1:1 ratio of landslide to non-landslide samples, failing to reflect natural geographical variability. This study develops a region-specific framework by integrating SHAP (SHapley Additive exPlanation) analysis with twelve landslide conditioning factors (LCFs) and three progressive sampling strategies, aiming to create adaptive non-landslide point selection criteria tailored to unique environmental and geological characteristics. The strategies include (1) multi-ratio random sampling (1:1 to 1:200), (2) susceptibility-based sampling adjustments derived from pre-susceptibility analysis, and (3) LCF-based correction using the NDVI threshold identified through SHAP analysis. Results show that LCF-based correction achieved the highest performance, while a 1:5 ratio proved optimal in random sampling, aligning with regional characteristics. This framework demonstrates the importance of region-specific sampling strategies in improving landslide susceptibility prediction.

Keywords:

landslide susceptibility mapping; landslide susceptibility prediction; machine learning; non-landslide sampling strategy; SHAP

1. Introduction

As a highly destructive natural disaster, landslides often cause severe damage to vegetation, infrastructure, and human life and property [1]. Due to the sudden onset of landslides and the difficulty in issuing timely warnings, they are considered to be among the most threatening natural disasters in mountainous areas [2]. Landslide refers to the downward movement of large amounts of rock, debris, or soil along a slope under the influence of gravity [3]. Landslide susceptibility prediction (LSP) is a core tool for assessing landslide risk and formulating disaster prevention and mitigation measures [4]. LSP evaluates the likelihood of landslides in a specific area based on terrain, environmental features, and other conditions [5]. By integrating various conditioning factors such as topography, geology, and hydrology, LSP study can effectively identify high-susceptibility areas, providing scientific guidance for land use planning and disaster prevention [6].

Neuland was one of the first to apply statistical methods to explore the relationships among morphometric measurements, geomechanics, rock characteristics, structural features, and the stability of 250 slopes in Western Germany [7]. Several methods and approaches have been proposed to assess landslide susceptibility, including but not limited to geomorphological mapping, landslide inventory analysis, heuristic terrain zoning, physically based numerical modelling, and statistical classification techniques [8]. Geomorphological mapping relies on expert evaluation to identify and map slope instability conditions, with the quality depending on the investigator’s expertise and the study area’s complexity [9]. Landslide inventory analysis predicts future spatial occurrences based on past and present landslide distributions, often using landslide density maps. Heuristic approaches rank and weight instability factors based on assumed importance; however, their quality relies on the investigator’s understanding of the causes [10]. Physically based landslide susceptibility assessment methods are based on the numerical modeling of slope failure processes [6], which are often based on the infinite slope model, which assumes a planar rupture surface parallel to the topographic surface, usually corresponding to the bedrock/hillslope deposit discontinuity [11], while statistical methods explore the relationship between instability factors and landslide distribution, often employing regression or probabilistic models [12]. These traditional methods have been enhanced by advancements in GISs [13], remote sensing, and computational power, with machine learning now offering a powerful tool for more accurate and scalable landslide susceptibility modeling.

With advancements in remote sensing and computational power, more researchers are adopting machine learning methods for LSP [14]. A literature review indicates that around 15.8% of LSP studies utilize machine learning as the primary modeling approach [15]. Widely used algorithms include decision trees, support vector machines (SVMs), neural networks, and Random Forest. Each of these machine learning methods has unique advantages in LSP studies. Decision tree (DT) models are the most used in LSP studies. However, DT models struggle with complex non-linear relationships and are prone to overfitting. Moreover, slight variations in input data can significantly affect the structure of the DT, making it sensitive to data fluctuations [16,17]. Support vector machines (SVMs) offer slightly higher accuracy in shallow landslide assessments than other algorithms [18]. However, SVM models have high computational complexity and require careful parameter tuning. Neural networks (NNs) and deep learning (DL) excel at automatic feature extraction and are particularly effective in processing large datasets. However, both methods often encounter generalization issues, with underfitting and overfitting negatively impacting model performance [19]. Random Forest (RF) enhances model stability and predictive accuracy by aggregating multiple decision trees. RF has proven highly reliable for LSP studies [20].

Despite the strong performance of machine learning models in LSP studies, data preprocessing and quality control account for approximately 80% of the modeling effort [19]. Data acquisition and processing accuracy directly affect model performance, emphasizing the importance of high-quality input data [21]. Most machine learning models used in LSP studies are classification algorithms, which require both positive and negative samples for training. Positive samples represent areas where landslides have occurred, while negative samples represent non-landslide-prone areas. In LSP studies, the acquisition of landslides (positive samples) typically relies on various data sources, including historical landslide records, remote sensing image interpretation, and field investigations. The quality of these data differs significantly; sometimes, they can accurately reflect landslide occurrences, resulting in high-quality landslide inventories, but sometimes they do not. However, the selection of positive samples is relatively standardized and somewhat reliable.

On the other hand, there is no unified standard for selecting non-landslide samples (negative samples) [22,23]. Current methods include random sampling in landslide-free areas, low-slope sampling based on topographic features, or the selection of negative samples outside the buffer zones of landslide boundaries. Generally, these methods often rely on subjective judgments or specific geographical factors and cannot entirely avoid the risk of mistakenly selecting potential landslide areas as negative samples. One of the challenges remaining is ensuring that the input data sufficiently represent the geographical characteristics of both landslide and non-landslide areas [19]. This uncertainty in sample selection may affect the accuracy of the final output. As a result, determining how to scientifically and reasonably select non-landslide samples has become a key issue in current research.

Traditional studies often adopted a 1:1 ratio of landslide to non-landslide samples. However, this balanced sampling strategy does not fully reflect actual field conditions [14], leading to poor model performance in real cases. The sampling ratio between landslide and non-landslide samples is critical to prediction accuracy during training and testing [24]. To address this issue, some researchers proposed various strategies for dealing with imbalanced data in recent years, including oversampling and adaptive sampling methods. Although these methods somewhat alleviate the data imbalance problem, they still face the risks of information loss or overfitting. Therefore, optimizing the ratio of positive to negative samples, combined with a reasonable method for selecting non-landslide samples, remains a challenge in improving model performance.

This study proposes three strategies for generating non-landslide samples to improve the appropriateness of sample selection under various topographic conditions. After comparing these strategies, a more reasonable sample selection approach is identified. This method offers a systematic way to select non-landslide samples by considering each study area’s unique environmental and geological characteristics.

2. Study Area and Data Sources

2.1. Study Area

The study area, located in southern Sichuan Province, China, includes four counties (Ebian Yi Autonomous County, Emeishan City, Ganluo County, and Jinkouhe District), covering approximately 1667 km² (Figure 1). The area is located in the transition zone, with the Tibet Plateau in the west, the Yunnan Plateau in the south, and the Sichuan Basin in the northeast. The Dadu River from the Tibet Plateau crosses this area and flows to the northwest, separating the border minorities from the inner majorities. Mount Emei, in the northeast, is a famous national scenic spot. The geologic structures are complex, and the elevation in this region ranges from 363 m in the northeast basin to 4242 m in the south mountain.

Figure 1. Location map and landslide inventory of study area.

The region experiences a subtropical humid monsoon climate, with rainfall concentrated in the summer months, particularly in July and August. Due to the significant topographic variation, the annual precipitation varies greatly, ranging from approximately 470 mm in the valleys to 1922 mm at higher elevations, such as Mount Emei. The mountains exert a screening action on humid air flows with consequent condensation, which results in heavy rainy events.

A total of 589 landslides are recorded in the official document. Many large-scale landslides occurred in this area and caused significant losses and casualties. Among them, the Liziyida gully debris flow is well known. It killed over 240 people and disrupted the central railway from Chengdu to Kunming on 9 July 1981, nearly the biggest catastrophe in Chinese railway history [25].

We have conducted multiple studies in the research area with abundant landslide data. A total of 12 conditioning factors for LSP were selected in this study, including geological, topographical, and vegetation coverage data.

2.2. Landslide Inventory

The quality of landslide inventories significantly affects the results of LSP [26]. These inventories record historical landslide events [27], providing details that include landslide type, occurrence date, size, and slope movement features [28]. Field investigations, as the primary source of historical inventories, offer significant advantages in reliability, as direct on-site observations minimize misinterpretation. Furthermore, this approach allows for a focused analysis of landslide susceptibility in areas with vulnerable elements, such as human populations and infrastructure, enhancing the efficiency of disaster risk assessments. This landslide inventory was obtained from the China Geological Survey of the Ministry of Natural Resources. We set up a dataset that was updated in 2021 and included 589 landslides within the research area. Figure 1 illustrates the spatial distribution of landslides. These landslides mainly occur in high-altitude mountainous areas with distinct spatial clustering, particularly with significant topographic variation in the northwest and northeast valley area. Most landslides are small- to medium-sized, with slopes between 10° and 40°.

2.3. Landslide Conditioning Factors

Due to the variety of landslide conditioning factors (LCFs) across different places, there is no clarity on the most essential factors. Hence, we followed these principles in the selection of LCFs.

1. Inclusion of Multiple Factors

We included a wide range of LCFs, covering topographic (altitude, slope, aspect, etc.), hydrological (distance from streams and TWI), geological (lithology and fault density), land cover (NDVI), and human activity factors (distance from roads). Prior research has supported these factors. All data sources for the selected factors are listed in Table 1.

Table 1. Information on landslide conditioning factors.

2. Data Accessibility

Although rainfall is a crucial triggering factor for landslides, it was excluded from this study, as it represents a temporal trigger rather than a permanent conditioning factor. This decision focuses on stable LCFs that influence landslide susceptibility, ensuring the model’s reliability and consistency.

3. Multicollinearity Assessment

The variance inflation factor (VIF) was used to assess the multiple linearity concerns. The VIF for a factor j is calculated by using the following formula:

V I F_{j} = 1 / (1 - R_{j}^{2})

(1)

where

R_{j}^{2}

is the coefficient of determination of factor j regressed on the other factors. A VIF value greater than 10 indicates potential multicollinearity among the factors [29]. Factors with a VIF greater than 10 were excluded to ensure model stability. The final selection of the 12 LCFs is shown in Figure 2. These factors were rasterized and resampled into uniform 30 m

\times

30 m layers. After that, the LCF values for both landslide and non-landslide samples were obtained via the “Extract Multi Values to samples” tool in ESRI ArcGIS. For categorical factors, such as lithology, soil types, distance from roads, and distance from rivers, categorical encoding was applied. Other continuous factors were retained in their original form.

Figure 2. Landslide conditioning factors: (a) DEM, (b) slope degree, (c) slope aspect, (d) plan curvature, (e) TWI, (f) profile curvature, (g) NDVI, (h) soil types, (i) distance from rivers, (j) distance from roads, (k) lithology, and (l) fault density.

2.3.1. Topographic Factors

Topographic factors include altitude, slope degree, slope aspect, Topographic Wetness Index (TWI), plan curvature, and profile curvature. These factors are all calculated based on digital elevation model (DEM) data with a 30 m resolution. The DEM data used in this study are SRTM, obtained from NASA’s Earth data Search platform (https://search.earthdata.nasa.gov, accessed on June 2024). The spatial analysis and extraction of topographic factors from the DEM data were performed by using ArcGIS.

The gravitational force on a slope generates a downward pressure on slope materials, making it a critical triggering factor influencing landslide susceptibility [30,31]. As shown in Figure 2b, the slope degree in the study area ranges from 0° to 79.54°. The influence of the slope aspect on landslides has shown contrasting results, as it indirectly affects vital attributes such as moisture retention, vegetation cover type, and vegetation distribution [32] (Figure 2c).

The Topographic Wetness Index (TWI) quantitatively describes the distribution of soil moisture. A higher TWI value means more soil moisture, making the soil more likely to become saturated and produce runoff, which increases the possibility of landslides [33]. The TWI ranges from 1.66 to 30.10 (Figure 2e) and is calculated by using the following formula:

T W I = l n (\frac{A}{t a n (S l o p e)})

(2)

where A represents the catchment area and the slope is measured in radians [34]. Plan curvature and profile curvature are necessary for describing terrain shape and modeling flow patterns and velocity, both critical in building landslide susceptibility mapping. Plan curvature (Figure 2d) refers to the curvature of contour lines, indicating whether the terrain is concave or convex, influencing water’s convergence or dispersion. Profile curvature (Figure 2f) measures the rate of slope variation, indicating whether the terrain is steep or gentle, affecting water flow velocity and erosion dynamics [35].

2.3.2. Environmental Factors

The Normalized Difference Vegetation Index (NDVI), shown in Figure 2g, represents vegetation cover density closely related to landslide susceptibility [36]. Higher NDVI values usually indicate better vegetation cover, where plant roots help stabilize the soil and reduce surface erosion, lowering the likelihood of shallow landslides [37]. Soil types (Figure 2h) also significantly influence landslide occurrence, as different soils have distinct properties, such as particle size, porosity, permeability, and cohesion. Soil types further affect vegetation growth, indirectly impacting slope stability.

Distance from rivers (Figure 2i) and distance from roads (Figure 2j) are also critical factors. Slopes closer to rivers are more vulnerable to failure due to river erosion, which removes slope material and increases the frequency of landslides [38]. Similarly, areas closer to roads often exhibit higher susceptibility to landslides, although this relationship can vary depending on local terrain and geological conditions [39].

2.3.3. Geological Factors

Lithology is the key factor influencing landslide occurrence. Lithological units such as clayey and marly sediments exhibit strong positive contrast values, indicating a higher likelihood of landslides, while hard rocks like sandstones and metamorphic rocks show negative values, suggesting that they contribute to slope stability [40]. Different lithologies have distinct physical and mechanical properties, such as strength, weathering rate, permeability, and shear strength. These characteristics directly influence slope stability [41]. Lithology (Figure 2k) also affects groundwater flow and storage, increasing pore water pressure and reducing effective stress—both critical factors triggering landslides.

Fault density impacts surface structure and permeability, influencing landslide susceptibility. It is an essential factor of geological complexity and plays a significant role in the likelihood of landslides [42]. Fault density can be calculated by using the line density analysis tool in ArcGIS (Figure 2l).

3. Methodology

This study is divided into four stages (Figure 3). First, historical landslide data and conditioning factors are collected. Next, VIF analysis, Random Forest, SHAP, and frequency ratio are used to build an initial LSP model. In the third stage, three strategies are applied to refine non-landslide sample selection. Finally, the AUC-ROC and landslide susceptibility index (LSI) metrics are used to evaluate the model’s performance.

Figure 3. Research workflow.

3.1. Three Non-Landslide Sampling Methods

This study investigates the influence of three distinct strategies for generating non-landslide samples on the performance of LSP. The three scenarios are as follows.

Scenario 1: Multi-ratio random sampling produced six datasets with varying ratios of landslide to non-landslide samples, ranging from 1:1 to 1:200.

Scenario 2: The susceptibility-based ratio adjustment method was applied, with the area proportions of low-susceptibility and high-susceptibility zones derived from the initial LSP results. Based on these initial results, the ratio of non-landslide to landslide samples was adjusted to reflect the varying susceptibility levels within the study area.

Scenario 3: The LCF-based correction method filtered out non-landslide samples based on thresholds selected according to the specific characteristics of the study area. In our study, non-landslide samples with NDVI values below 0.8 were filtered out and replaced with samples from areas where the NDVI exceeded 0.8. The NDVI was chosen because it had a dominant influence on this study area, with NDVI = 0.8 being identified as a critical threshold through SHAP analysis, which was further supported by FR analysis, which showed landslide frequency dropping to 2% in the NDVI range of 0.86–0.9.

3.2. Initial LSP Modeling—Random Forest

Random Forest (RF) was selected due to its proven effectiveness in LSP studies [18,20,43]. As the most widely used method in LSP research (approximately 49% [18]), RF serves as an ideal baseline model for investigating the fundamental issue of non-landslide sampling strategies. Using this commonly accepted model allows us to focus on the sampling effects while addressing a universal challenge in LSP modeling.

3.3. SHAP (SHapley Additive exPlanation)

SHAP (SHapley Additive exPlanation) is a model-agnostic explanation method based on cooperative game theory. It attributes the output of a machine learning model to individual features by using Shapley values, ensuring both consistency and local accuracy [44]. SHAP provides both global and local explanations for model interpretation in LSP. Global explanations offer consistent and accurate attribution values for each feature (e.g., environmental or geological factors), illustrating how they influence the model’s overall predictions. Local explanations demonstrate how specific features contribute to the prediction for an individual landslide event by analyzing specific input data [45,46,47].

Generally, the AUC is the most commonly used metric for evaluating model performance [48], but relying solely on the AUC may not sufficiently capture the spatial characteristics of the model’s predictions [49]. The AUC is based on the relationship between observed and predicted values, but high prediction accuracy does not necessarily imply good interpretability [50]. For instance, Elith demonstrated that models with similar or identical AUC values can produce markedly different spatial prediction patterns [51]. To address these limitations, this study incorporates the SHAP method, which provides advantages [44].

Improved interpretability: SHAP offers clear insights into feature contributions, helping to understand the decision-making process within “black-box” models. By quantifying the contribution of each feature to the model’s predictions, SHAP enhances both interpretability and trust in the model.

Enhanced transparency: By calculating the Shapley values associated with each input feature, SHAP explains how individual features influence LSP results, improving the model’s performance and interpretability.

SHAP quantifies the contribution of each feature to the model’s output, representing the model prediction as the sum of the Shapley values for each input feature:

g (x^{'}) = φ_{0} + \sum_{M}^{j = 1} φ_{j}

(3)

g (x^{'})

denotes the model output,

φ_{0}

is a constant explaining the model, and

φ_{j}

denotes the Shapley value assigned to each feature.

3.4. Model Evaluation

3.4.1. Frequency Ratio Analysis

The frequency ratio (FR) method was applied to evaluate the relationship between LCFs and historical landslides. Accurate landslide susceptibility mapping (LSM) depends critically on selecting appropriate LCFs [15,52]. The FR method was selected for its effectiveness in quantifying the correlation between LCFs and landslide occurrence. A higher FR value suggests a more pronounced positive correlation between the factor category and landslide occurrence, whereas a lower FR value implies a weaker correlation [33]. It provides a foundational analysis for subsequent machine learning models [53]. The FR analysis results served as preparatory work for Scenario 3. The expression for the frequency ratio is as follows:

Frequency ratio = \frac{L_{j}}{L} / \frac{S_{j}}{S}

(4)

Here,

L_{j}

represents the number of landslides occurring within a specific interval, while

L

denotes the total number of landslides across the entire study area. Similarly,

S_{j}

corresponds to the number of grid cells within the interval, and

S

refers to the total grid cells in the study area.

3.4.2. Receiver Operating Characteristic Curve

The Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) was used to measure the models’ performance [54]. The AUC is widely recognized as an effective measure for evaluating classification performance, as it quantifies how well the model distinguishes between positive and negative classes [55]. AUC values range from 0 to 1, with values greater than 0.8 generally indicating strong model performance [56].

4. Analysis

4.1. Relationship Between LCFs and Historical Landslides

According to the FR analysis (Table 2), several factors were revealed to influence the occurrence of landslides to varying degrees. Vegetation coverage (NDVI) exhibited the highest FR value (3.70) within the range of 0.32 to 0.64. Altitude also showed a significant impact. The 2196 to 2820 m range obtained the max FR value of 2.37. Among the topographic factors, the slope angle (31.9° to 42.23°) had an FR value of 1.64, while the FR value for areas within 0 to 500 m from roads reached as high as 2.34. Other factors, such as the Topographic Wetness Index (TWI), Stream Power Index (SPI), and curvature, also displayed varying degrees of influence.

Table 2. Frequency ratios for each class of landslide conditioning factors.

4.2. Initial LSP Results

4.2.1. Initial Landslide Susceptibility Results

The RF model was used for initial LSP modeling in the study area, with the data distribution shown in Table 3. The Natural Breaks method categorized the results into five levels: very high, high, medium, low, and very low. A landslide susceptibility map was generated, as shown in Figure 4. The proportions of different susceptibility levels are very low (38.28%), low (23.19%), medium (15.43%), high (14.04%), and very high (9.06%). Among these, the low-susceptibility areas (very low and low levels) account for most of the study area, comprising over 60%.

Table 3. Distribution of landslide and non-landslide samples in initial LSP by RF model.

Figure 4. Initial LSP by RF.

Table 4 shows that the very high susceptibility level has an FR value of 8.37, much higher than the other levels, suggesting that the LSP results from the initial RF model closely match the actual landslide occurrence patterns. The model achieved an ROC accuracy of 0.897, which meets the LSP accuracy requirements but could be further improved by adjusting non-landslide sampling ratios.

Table 4. Frequency ratio classification of initial LSP by RF model.

4.2.2. SHAP Value Analysis

In Figure 5, SHAP values are displayed along the horizontal axis, where positive values indicate increased landslide probability and negative values suggest decreased probability. The color gradient represents the feature values, with red indicating high values and blue indicating low values. Figure 5 reveals that the NDVI, altitude, distance from roads, and slope degree are the most influential factors, consistent with the ranking in Table 2, while profile curvature, fault density, lithology, soil, and plan profile show relatively smaller impacts.

Figure 5. SHAP values.

Among the high-impact factors, the NDVI demonstrates the strongest influence. Higher NDVI values (shown in red, indicating dense vegetation) consistently correspond to negative SHAP values, indicating that dense vegetation significantly reduces landslide probability. Altitude shows complex effects, with lower elevations (blue in feature values) being generally associated with increased landslide likelihood, while higher elevations (red in feature values) correlate with decreased probability. For distance from roads, areas closer to roads (blue in feature values) show positive SHAP values, suggesting higher landslide susceptibility, potentially due to slope destabilization from construction activities. Slope degree exhibits a clear pattern where steeper slopes (red in feature values) strongly correlate with positive SHAP values, confirming that increased slope angles enhance landslide probability.

The lower-impact factors show more subtle patterns. Profile curvature and plan profile display moderate effects, with certain curvature patterns corresponding to slightly increased landslide probability. Fault density shows minimal variation in its impact, while specific soil types and lithological characteristics demonstrate minor but detectable influences on landslide susceptibility.

Further analysis of the NDVI and altitude SHAP analysis results (Figure 6a,b) reveals complex non-linear relationships with landslide susceptibility. For the NDVI (Figure 6a), a critical threshold of 0.8 was identified, where SHAP values exhibit a distinct transition pattern. SHAP values remain predominantly positive when the NDVI is below 0.8, indicating higher landslide susceptibility in areas with lower vegetation coverage. When the NDVI exceeds 0.8, the SHAP values rapidly shift to negative and stabilize, demonstrating significantly reduced landslide susceptibility. This threshold aligns with the FR analysis (Table 2), where areas with an NDVI > 0.8 only account for 13% of the relative frequency (RF), indicating a significantly lower landslide occurrence in well-vegetated areas. This NDVI threshold serves as a criterion for non-landslide sample selection in Scenario 3.

Figure 6. SHAP scatter plots for (a) NDVI and (b) altitude.

The SHAP analysis plot for altitude (Figure 6b) displays a pronounced U-shaped pattern. Areas at lower altitude (0–500 m) show high positive SHAP values (1.0–1.8), possibly attributable to intensive human activities and road construction, as evidenced by the high importance of distance from roads in the SHAP analysis. This influence decreases sharply with the increase in altitude, transitioning to negative values around 700 m. Mid-altitude zones (1000–1500 m) maintain negative SHAP values between −0.5 and 0, while higher-altitude areas (>1500 m) exhibit consistent but moderate negative effects (−0.3 to −0.2). This complex pattern suggests a combined influence of anthropogenic activities, vegetation coverage, and topographic characteristics across altitude gradients.

4.3. Scenario 1

4.3.1. The Landslide Susceptibility Results for Scenario 1

We generated multiple datasets with different ratios of landslide to non-landslide samples (1:1, 1:5, 1:50, 1:100, 1:150, and 1:200), as shown in Table 5. The LSP results were then classified into five susceptibility levels (very high, high, medium, low, and very low) to generate LSMs (Figure 7). The results demonstrate that sample ratio variations significantly impact landslide susceptibility zoning. As the proportion of non-landslide samples increases, high and very high susceptibility zones gradually decrease, while low- and very low-susceptibility zones increase accordingly. This trend is most noticeable between the ratios of 1:1 and 1:50, after which it stabilizes. At a 1:1 ratio (Figure 7a), high- and very high-susceptibility areas are more widely distributed, especially in the northern and northwestern regions. As the ratio increases (Figure 7b–f), high- and very high-susceptibility areas decrease significantly, reducing spatial differences in susceptibility levels, and the landslide susceptibility distribution becomes more uniform, concentrating mainly in low- and very low-susceptibility zones. Moreover, even at high ratios (e.g., 1:200), some medium- and high-susceptibility areas remain in the northern and eastern edges of the study area, which generally aligns with the observed landslide sample distribution.

Table 5. Distribution of landslide and non-landslide samples in Scenario 1.

Figure 7. Landslide susceptibility maps with different imbalanced ratios in Scenario 1.

As shown in Figure 8, increasing the proportion of non-landslide samples leads to a gradual reduction in the mean and standard deviation of the LSIs. At a 1:1 ratio (Figure 8a), the mean LSI is 0.301, with a standard deviation 0.256. While low-susceptibility areas dominate, there is still a notable presence of medium-to-high-susceptibility areas. At ratios of 1:5 (Figure 8b) and 1:50 (Figure 8c), the mean decreases to 0.132 and 0.028, respectively, along with corresponding reductions in standard deviation, and there is a significant contraction in the high-susceptibility range. As the non-landslide sample ratio further increases to 1:100, 1:150, and 1:200 (Figure 8d–f), the mean drops to 0.016, 0.012, and 0.009, respectively, and the standard deviation continues to decrease, with the LSIs becoming increasingly concentrated in low-susceptibility areas. When the ratio increases from 1:1 to 1:50, the mean and standard deviation of the LSIs exhibit significant reductions and substantial fluctuations. However, as the ratio rises to 1:100 and beyond, these values stabilize, with predictions being increasingly focused on very low- and low-susceptibility zones. The growing proportion of non-landslide samples reduces the spatial heterogeneity of LSPs, concentrating susceptibility indices primarily in very low- and low-susceptibility areas.

Figure 8. LSIs under imbalanced ratios. The red vertical lines indicate the thresholds separating different landslide susceptibility levels (very low, low, moderate, high, and very high).

4.3.2. Model Validation for Scenario 1

As shown in Figure 9, the model’s performance varies with different landslide-to-non-landslide ratios. At a 1:1 ratio, the AUC value is the lowest, at 0.897. As the proportion of non-landslide samples increases, the AUC value exhibits an upward trend, peaking at 0.914 for the 1:5 ratio. However, the AUC value slightly decreases to 0.900 at a 1:50 ratio and drops further to 0.894 at a 1:100 ratio. When the ratio increases to 1:150 and 1:200, the AUC value rebounds to 0.905 and 0.907, respectively. Apart from the 1:1 and 1:100 ratios, the AUC values remain above 0.900 for all other ratios.

Figure 9. ROC curves for different ratios.

4.4. Scenario 2

In Scenario 2, the area proportions of low-susceptibility and high-susceptibility zones were determined based on the results of the initial LSP analysis (Figure 4). These area proportions were then used to define the ratio of non-landslide to landslide samples. Specifically, low-susceptibility zones covered 77,189 units, while high-susceptibility zones covered 326,157 units, resulting in an area ratio of approximately 1:4.23 (landslides to non-landslides). As shown in Figure 10, high-susceptibility zones are predominantly found in steep mountainous and hilly regions, particularly in the northern and eastern parts, where the likelihood of landslides is elevated. By contrast, low-susceptibility zones are concentrated in the relatively flat central and western areas, indicating a lower landslide risk. High-susceptibility zones are closely aligned with areas experiencing significant terrain relief, sparse vegetation, and frequent human activities.

Figure 10. Landslide susceptibility mapping for Scenario 2.

4.5. Scenario 3

As shown in our SHAP and FR analyses (Section 4.2.2), an NDVI value of 0.8 was identified as a key threshold. Based on this, in Scenario 3, non-landslide samples with NDVI values below 0.8 were first identified and removed. Then, an equal number of non-landslide samples were added in areas where the NDVI value exceeded 0.8. This threshold-based adjustment strategy helps optimize the selection of non-landslide samples, ensuring a more balanced distribution and enhancing the model’s predictive accuracy and reliability. In Scenario 3 (Figure 11), the landslide susceptibility maps show similar spatial distribution patterns to those in Scenario 2. This similarity can be attributed to the strong correlation between landslide susceptibility and the NDVI in both scenarios. High-susceptibility areas are still concentrated in regions with low NDVI values, especially in the north and east mountainous areas. In contrast, low-susceptibility areas are mainly found in flatter regions with higher NDVI values. Scenario 3 adopts a more conservative prediction approach, capturing more high-susceptibility areas.

Figure 11. Landslide susceptibility mapping of Scenario 3.

According to the results shown in Figure 12, Scenario 2 achieves an AUC value of 0.979. Scenario 3, which applies an LCF-based correction method, reaches the highest AUC value of 0.986, higher than Scenario 2.

Figure 12. ROC curves for Scenario 2 and Scenario 3.

5. Discussion

This study combines SHAP analysis and three sampling strategies to develop a region-specific landslide susceptibility analysis framework. The goal is to create adaptive non-landslide point selection criteria tailored to the study area’s unique environmental and geological characteristics. The results show that the proportion and selection of non-landslide points significantly affect LSP outcomes. The framework incorporating regional characteristics provides a more flexible and scalable approach than traditional sampling methods. Additionally, SHAP analysis enhances the model’s interpretability by revealing the contribution of each factor, improving the transparency of the machine learning model for LSP.

5.1. Frequency Ratio Analysis Results

Through FR analysis, the NDVI emerged as the most influential factor for LSP in this study area. It was also the primary criterion for selecting non-landslide points in Scenario 3. When the NDVI values ranged between 0.32 and 0.64 in this region, the FR value peaked at 3.7. While moderate vegetation cover enhances soil stability, landslides may still occur due to water infiltration. Shallow landslides, typically involving soil masses less than 2 m thick [57], dominate the dataset in this study, which may explain why the NDVI emerged as the most influential factor for LSP. Shallow landslides are sensitive to vegetation cover (expressed as both below-ground and above-ground vegetation) [58,59]. Therefore, the results of the FR analysis should be interpreted considering the local geological context, particularly the prevalence of shallow landslides, and should not be generalized to other regions without careful consideration.

5.2. SHAP Model Interpretation

Compared with physical models, machine learning models are often regarded as ’black boxes’ because their predictions are based on data-driven training rather than a direct connection to the physical processes underlying landslide occurrences. This lack of interpretability can make building trust in the model’s predictions challenging, especially when such predictions are used for critical decision making [60]. SHAP analysis enhances the interpretability of machine learning applied to LSP by highlighting the influence of predisposing factors. In this study, the NDVI and altitude were the most impactful factors, with the NDVI showing the strongest effect. SHAP analysis identified a critical NDVI threshold of 0.8, consistent with FR analysis.

Moreover, SHAP analysis revealed complex non-linear relationships between the LCFs and the LSP results. The SHAP values for altitude showed a distinct U-shaped pattern, making it clear that landslides occur at some specific altitude within the research area. Both low- and high-altitude regions exhibited high susceptibility. This could be due to a higher proportion of shallow landslides in low-elevation areas, where human activities and engineering interference with the environment affect slope stability.

Another possibility is that the study area’s historical landslide inventory may be biased, as landslides are more likely to be investigated in accessible areas, often near human settlements and lower areas. While this could introduce some bias, it also has the advantage of focusing the LSP results on areas with high exposure to human impact. However, this pattern also highlights certain limitations, suggesting that the results should be viewed as complementary to more traditional geological analyses.

5.3. Insights into the Three Sampling Strategies

This study employed three different sampling strategies, corresponding to Scenario 1, Scenario 2, and Scenario 3. Scenario 1 utilized different sampling ratios of landslide to non-landslide points, while Scenarios 2 and 3 were based on different approaches to refining the dataset.

In Scenario 1, the 1:5 ratio yielded the best results, with an AUC value of 0.914. This ratio also avoided overemphasizing high-susceptibility areas, a limitation seen with a 1:1 ratio. The success of the 1:5 ratio can be attributed to the fact that in the study area, the spatial distribution of high- and low-susceptibility areas aligns closely with 1:5 (1:4.23)—consistently with the regional geological conditions. Increasing the number of non-landslide samples beyond a 1:50 ratio makes it harder to capture high-susceptibility areas. Based on this, avoiding excessively high negative sample ratios and instead adopting sampling strategies that mirror the actual proportions in the study area, such as the 1:5 ratio, can effectively improve model performance in this region.

5.4. Model Limitations and Future Perspectives

The results indicate that the AUC values for Scenario 2 (0.979) and Scenario 3 (0.986) were significantly higher than the best value for Scenario 1 (0.914), with Scenario 3 achieving the highest AUC. Typically, AUC values ranging from 0.85 to 1 are considered to reflect strong model performance [19]. However, such particularly high values can raise concerns about potential overfitting, which is a common issue in predictive modeling, as highlighted by some researchers’ studies [14,35].

The manually refined datasets in Scenarios 2 and 3 tend to be more conservative, predicting more high-susceptibility areas. To address the possibility of overfitting, we plan to validate the model further by using cross-validation techniques and external validation datasets in future research. If the model’s performance remains consistent across different datasets, the high AUC values could be considered reasonable and indicative of its strong predictive power. On the other hand, if the model significantly underperforms on new data, it would suggest the need for further refinement, such as simplifying the model or incorporating regularization techniques to improve its generalizability. This conservative approach may still be appropriate in specific scenarios—such as route planning for critical infrastructure projects—as it better ensures that high-susceptibility areas are not overlooked.

6. Conclusions

This study integrates SHAP analysis with non-landslide sampling optimization to enhance model interpretability and performance in landslide susceptibility prediction. While conventional studies typically use a 1:1 ratio of landslide to non-landslide samples, this approach fails to reflect actual geographical distributions, limiting practical applicability. Three progressive sampling strategies were developed and evaluated: multi-ratio random sampling, susceptibility-based adjustment, and LCF-based correction guided by SHAP analysis. SHAP analysis proved valuable in identifying the NDVI as the dominant factor and establishing its critical threshold, forming the basis for this study’s sampling strategy. The LCF-based correction method (Scenario 3) achieved the highest performance (AUC = 0.986).

This study establishes a region-specific landslide susceptibility analysis framework that adapts to regional characteristics. Future research should explore additional environmental factors’ influence on sampling strategies and develop more robust validation methods to address potential overfitting concerns while maintaining model performance.

Author Contributions

Conceptualization, M.L. and H.T.; methodology, M.L. and H.T.; software, M.L.; validation, M.L. and H.T.; formal analysis, M.L.; resources, H.T.; data curation, H.T.; writing—original draft preparation, M.L.; writing—review and editing, H.T.; visualization, M.L.; supervision, H.T.; project administration, H.T.; funding acquisition, H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Research Program of the Institute of Mountain Hazards and Environment, Chinese Academy of Sciences, grant number IMHE-CXTD-04.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to the fact that the dataset production is part of the innovation point of this paper; the dataset will not be open-sourced at this time, as we will continue to explore new possibilities in subsequent experimental research; then, the dataset will be made public along with the related code.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guzzetti, F.; Mondini, A.C.; Cardinali, M.; Fiorucci, F.; Santangelo, M.; Chang, K.-T. Landslide Inventory Maps: New Tools for an Old Problem. Earth-Sci. Rev. 2012, 112, 42–66. [Google Scholar] [CrossRef]
Froude, M.J.; Petley, D.N. Global Fatal Landslide Occurrence from 2004 to 2016. Nat. Hazards Earth Syst. Sci. 2018, 18, 2161–2181. [Google Scholar] [CrossRef]
Varnes, D.J. Landslide Types and Processes. Landslides Eng. Pract. 1958, 24, 20–47. [Google Scholar]
Ado, M.; Amitab, K.; Maji, A.K.; Jasińska, E.; Gono, R.; Leonowicz, Z.; Jasiński, M. Landslide Susceptibility Mapping Using Machine Learning: A Literature Survey. Remote Sens. 2022, 14, 3029. [Google Scholar] [CrossRef]
Brabb, E.E. The San Mateo County California GIS Project for Predicting the Consequences of Hazardous Geologic Processes. In Geographical Information Systems in Assessing Natural Hazards; Carrara, A., Guzzetti, F., Eds.; Advances in Natural and Technological Hazards Research; Springer: Dordrecht, The Netherlands, 1995; Volume 5, pp. 299–334. ISBN 978-90-481-4561-4. [Google Scholar]
Corominas, J.; van Westen, C.; Frattini, P.; Cascini, L.; Malet, J.-P.; Fotopoulou, S.; Catani, F.; Van Den Eeckhaut, M.; Mavrouli, O.; Agliardi, F.; et al. Recommendations for the Quantitative Analysis of Landslide Risk. Bull. Eng. Geol. Environ. 2014, 73, 209–263. [Google Scholar] [CrossRef]
Neuland, H. A Prediction Model of Landslips. CATENA 1976, 3, 215–230. [Google Scholar] [CrossRef]
Aleotti, P.; Chowdhury, R. Landslide Hazard Assessment: Summary Review and New Perspectives. Bull. Eng. Geol. Environ. 1999, 58, 21–44. [Google Scholar] [CrossRef]
Sim, A.; Ong, D.; Bachat, J. Geomorphological Approach for Assessment of Slope Stability and Landslide Hazard Mapping. In Proceedings of the 19th Southeast Asian Geotechnical Conference & 2nd AGSSEA Conference (19SEAGC & 2AGSSEA), Kuala Lumpur, Malaysia, 30 May 2016. [Google Scholar]
Thapa, S.; Karna, A.K.; Dahal, B.K. Evaluation of Different Landslide Susceptibility Analysis Methods: A Case Study of Bagmati Rural Municipality. J. Eng. Technol. Plan. 2022, 3, 44–59. [Google Scholar] [CrossRef]
Skempton, A.W.; Delory, F.A. Stability of Natural Slopes in London Clay. In Selected Papers on Soil Mechanics; Thomas Telford Publishing: London, UK, 1984; pp. 70–73. ISBN 978-0-7277-3982-7. [Google Scholar]
Hao, L.; Rajaneesh, A.; van Westen, C.; Sajinkumar, K.S.; Martha, T.R.; Jaiswal, P.; McAdoo, B. Constructing a Complete Landslide Inventory Dataset for the 2018 Monsoon Disaster in Kerala, India, for Land Use Change Analysis. Earth Syst. Sci. Data 2020, 12, 2899–2918. [Google Scholar] [CrossRef]
Carrara, A.; Cardinali, M.; Detti, R.; Guzzetti, F.; Pasqui, V.; Reichenbach, P. GIS Techniques and Statistical Models in Evaluating Landslide Hazard. Earth Surf. Process. Landf. 1991, 16, 427–445. [Google Scholar] [CrossRef]
Huang, F.; Xiong, H.; Jiang, S.-H.; Yao, C.; Fan, X.; Catani, F.; Chang, Z.; Zhou, X.; Huang, J.; Liu, K. Modelling Landslide Susceptibility Prediction: A Review and Construction of Semi-Supervised Imbalanced Theory. Earth-Sci. Rev. 2024, 250, 104700. [Google Scholar] [CrossRef]
Reichenbach, P.; Rossi, M.; Malamud, B.D.; Mihir, M.; Guzzetti, F. A Review of Statistically-Based Landslide Susceptibility Models. Earth-Sci. Rev. 2018, 180, 60–91. [Google Scholar] [CrossRef]
Mitrofanov, S.A.; Semenkin, E.S. Tree Retraining in the Decision Tree Learning Algorithm. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1047, 012082. [Google Scholar] [CrossRef]
Arabameri, A.; Pal, S.C.; Rezaie, F.; Chakrabortty, R.; Saha, A.; Blaschke, T.; Di Napoli, M.; Ghorbanzadeh, O.; Ngo, P.T.T. Decision Tree Based Ensemble Machine Learning Approaches for Landslide Susceptibility Mapping. Geocarto Int. 2022, 37, 4594–4627. [Google Scholar] [CrossRef]
Merghadi, A.; Yunus, A.P.; Dou, J.; Whiteley, J.; ThaiPham, B.; Bui, D.T.; Avtar, R.; Abderrahmane, B. Machine Learning Methods for Landslide Susceptibility Studies: A Comparative Overview of Algorithm Performance. Earth-Sci. Rev. 2020, 207, 103225. [Google Scholar] [CrossRef]
Ma, Z.; Mei, G.; Piccialli, F. Machine Learning for Landslides Prevention: A Survey. Neural Comput. Appl. 2021, 33, 10881–10907. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Rahmati, O. Prediction of the Landslide Susceptibility: Which Algorithm, Which Precision? CATENA 2018, 162, 177–192. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Kornejady, A.; Kerle, N.; Shabani, F. Investigating the Effects of Different Landslide Positioning Techniques, Landslide Partitioning Approaches, and Presence-Absence Balances on Landslide Susceptibility Mapping. CATENA 2020, 187, 104364. [Google Scholar] [CrossRef]
Zhu, A.-X.; Miao, Y.; Yang, L.; Bai, S.; Liu, J.; Hong, H. Comparison of the Presence-Only Method and Presence-Absence Method in Landslide Susceptibility Mapping. CATENA 2018, 171, 222–233. [Google Scholar] [CrossRef]
Jiang, Y.; Wang, W.; Zou, L.; Cao, Y. Regional Landslide Susceptibility Assessment Based on Improved Semi-Supervised Clustering and Deep Learning. Acta Geotech. 2024, 19, 509–529. [Google Scholar] [CrossRef]
Gao, H.; Fam, P.S.; Tay, L.T.; Low, H.C. Three Oversampling Methods Applied in a Comparative Landslide Spatial Research in Penang Island, Malaysia. SN Appl. Sci. 2020, 2, 1512. [Google Scholar] [CrossRef]
Zhang, S. A Comprehensive Approach to the Observation and Prevention of Debris Flows in China. Nat. Hazards 1993, 7, 1–23. [Google Scholar] [CrossRef]
Schicker, R.; Moon, V. Comparison of Bivariate and Multivariate Statistical Approaches in Landslide Susceptibility Mapping at a Regional Scale. Geomorphology 2012, 161–162, 40–57. [Google Scholar] [CrossRef]
Rosser, B.; Dellow, S.; Haubrock, S.; Glassey, P. New Zealand’s National Landslide Database. Landslides 2017, 14, 1949–1959. [Google Scholar] [CrossRef]
Huang, Y.; Zhao, L. Review on Landslide Susceptibility Mapping Using Support Vector Machines. CATENA 2018, 165, 520–529. [Google Scholar] [CrossRef]
Haitovsky, Y. Multicollinearity in Regression Analysis: Comment. Rev. Econ. Stat. 1969, 51, 486–489. [Google Scholar] [CrossRef]
Youssef, A.M.; Pourghasemi, H.R. Landslide Susceptibility Mapping Using Machine Learning Algorithms and Comparison of Their Performance at Abha Basin, Asir Region, Saudi Arabia. Geosci. Front. 2021, 12, 639–655. [Google Scholar] [CrossRef]
Çellek, S. Effect of the Slope Angle and Its Classification on Landslide. Nat. Hazards Earth Syst. Sci. 2020, preprint. [Google Scholar]
Dai, F.C.; Lee, C.F.; Li, J.; Xu, Z.W. Assessment of Landslide Susceptibility on the Natural Terrain of Lantau Island, Hong Kong. Environ. Geol. 2001, 40, 381–391. [Google Scholar] [CrossRef]
Yilmaz, I. Landslide Susceptibility Mapping Using Frequency Ratio, Logistic Regression, Artificial Neural Networks and Their Comparison: A Case Study from Kat Landslides (Tokat—Turkey). Comput. Geosci. 2009, 35, 1125–1138. [Google Scholar] [CrossRef]
Sørensen, R.; Zinko, U.; Seibert, J. On the Calculation of the Topographic Wetness Index: Evaluation of Different Methods Based on Field Observations. Hydrol. Earth Syst. Sci. 2006, 10, 101–112. [Google Scholar] [CrossRef]
Agboola, G.; Beni, L.H.; Elbayoumi, T.; Thompson, G. Optimizing Landslide Susceptibility Mapping Using Machine Learning and Geospatial Techniques. Ecol. Inform. 2024, 81, 102583. [Google Scholar] [CrossRef]
Chen, W.; Li, Y. GIS-Based Evaluation of Landslide Susceptibility Using Hybrid Computational Intelligence Models. CATENA 2020, 195, 104777. [Google Scholar] [CrossRef]
Hall, F.G.; Townshend, J.R.; Engman, E.T. Status of Remote Sensing Algorithms for Estimation of Land Surface State Parameters. Remote Sens. Environ. 1995, 51, 138–156. [Google Scholar] [CrossRef]
Yan, G.; Liang, S.; Gui, X.; Xie, Y.; Zhao, H. Optimizing Landslide Susceptibility Mapping in the Kongtong District, NW China: Comparing the Subdivision Criteria of Factors. Geocarto Int. 2019, 34, 1408–1426. [Google Scholar] [CrossRef]
Brenning, A.; Schwinn, M.; Ruiz-Páez, A.P.; Muenchow, J. Landslide Susceptibility near Highways Is Increased by 1 Order of Magnitude in the Andes of Southern Ecuador, Loja Province. Nat. Hazards Earth Syst. Sci. 2015, 15, 45–57. [Google Scholar] [CrossRef]
Ilinca, V.; Şandric, I.; Jurchescu, M.; Chiţu, Z. Identifying the Role of Structural and Lithological Control of Landslides Using TOBIA and Weight of Evidence: Case Studies from Romania. Landslides 2022, 19, 2117–2134. [Google Scholar] [CrossRef]
Hungr, O.; Leroueil, S.; Picarelli, L. The Varnes Classification of Landslide Types, an Update. Landslides 2014, 11, 167–194. [Google Scholar] [CrossRef]
Aditian, A.; Kubota, T.; Shinohara, Y. Comparison of GIS-Based Landslide Susceptibility Models Using Frequency Ratio, Logistic Regression, and Artificial Neural Network in a Tertiary Region of Ambon, Indonesia. Geomorphology 2018, 318, 101–111. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Teimoori Yansari, Z.; Panagos, P.; Pradhan, B. Analysis and Evaluation of Landslide Susceptibility: A Review on Articles Published during 2005–2016 (Periods of 2005–2012 and 2013–2016). Arab. J. Geosci. 2018, 11, 193. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Zhang, J.; Ma, X.; Zhang, J.; Sun, D.; Zhou, X.; Mi, C.; Wen, H. Insights into Geospatial Heterogeneity of Landslide Susceptibility Based on the SHAP-XGBoost Model. J. Environ. Manag. 2023, 332, 117357. [Google Scholar] [CrossRef]
Meng, Y.; Yang, N.; Qian, Z.; Zhang, G. What Makes an Online Review More Helpful: An Interpretation Framework Using XGBoost and SHAP Values. J. Theor. Appl. Electron. Commer. Res. 2021, 16, 466–490. [Google Scholar] [CrossRef]
Yang, B.; Lu, H.; Ran, Y. Advancing Non-Alcoholic Fatty Liver Disease Prediction: A Comprehensive Machine Learning Approach Integrating SHAP Interpretability and Multi-Cohort Validation. Front. Endocrinol. 2024, 15, 1450317. [Google Scholar] [CrossRef] [PubMed]
Pearce, J.; Ferrier, S. Evaluating the Predictive Performance of Habitat Models Developed Using Logistic Regression. Ecol. Model. 2000, 133, 225–245. [Google Scholar] [CrossRef]
Aguirre-Gutiérrez, J.; Carvalheiro, L.G.; Polce, C.; van Loon, E.E.; Raes, N.; Reemer, M.; Biesmeijer, J.C. Fit-for-Purpose: Species Distribution Model Performance Depends on Evaluation Criteria—Dutch Hoverflies as a Case Study. PLoS ONE 2013, 8, e63708. [Google Scholar] [CrossRef] [PubMed]
Austin, M. Species Distribution Models and Ecological Theory: A Critical Assessment and Some Possible New Approaches. Ecol. Model. 2007, 200, 1–19. [Google Scholar] [CrossRef]
Elith, J.H.; Graham, C.P.H.; Anderson, R.P.; Dudík, M.; Ferrier, S.; Guisan, A.; Hijmans, R.J.; Huettmann, F.; Leathwick, J.R.; Lehmann, A.; et al. Novel Methods Improve Prediction of Species’ Distributions from Occurrence Data. Ecography 2006, 29, 129–151. [Google Scholar] [CrossRef]
Guzzetti, F.; Carrara, A.; Cardinali, M.; Reichenbach, P. Landslide Hazard Evaluation: A Review of Current Techniques and Their Application in a Multi-Scale Study, Central Italy. Geomorphology 1999, 31, 181–216. [Google Scholar] [CrossRef]
Lee, S.; Sambath, T. Landslide Susceptibility Mapping in the Damrei Romel Area, Cambodia Using Frequency Ratio and Logistic Regression Models. Environ. Geol. 2006, 50, 847–855. [Google Scholar] [CrossRef]
Gómez-Ramírez, J.; Ávila-Villanueva, M.; Fernández-Blázquez, M.Á. Selecting the Most Important Self-Assessed Features for Predicting Conversion to Mild Cognitive Impairment with Random Forest and Permutation-Based Methods. Sci. Rep. 2020, 10, 20630. [Google Scholar] [CrossRef] [PubMed]
Bradley, A.P. The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef]
Wang, Y.; Song, Q.; Du, Y.; Wang, J.; Zhou, J.; Du, Z.; Li, T. A Random Forest Model to Predict Heatstroke Occurrence for Heatwave in China. Sci. Total Environ. 2019, 650, 3048–3053. [Google Scholar] [CrossRef] [PubMed]
Murgia, I.; Giadrossich, F.; Mao, Z.; Cohen, D.; Capra, G.F.; Schwarz, M. Modeling Shallow Landslides and Root Reinforcement: A Review. Ecol. Eng. 2022, 181, 106671. [Google Scholar] [CrossRef]
Murgia, I.; Giadrossich, F.; Niccolini, M.; Preti, F.; Giambastiani, Y.; Capra, G.F.; Cohen, D. Using SlideforMAP and SOSlope to Identify Susceptible Areas to Shallow Landslides in the Foreste Casentinesi National Park (Tuscany, Italy). In Proceedings of the EGU General Assembly 2021, Online, 19–30 April 2021; EGU21-14454. [Google Scholar] [CrossRef]
Marzini, L.; D’Addario, E.; Papasidero, M.P.; Chianucci, F.; Disperati, L. Influence of Root Reinforcement on Shallow Landslide Distribution: A Case Study in Garfagnana (Northern Tuscany, Italy). Geosciences 2023, 13, 326. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar]

Figure 1. Location map and landslide inventory of study area.

Figure 2. Landslide conditioning factors: (a) DEM, (b) slope degree, (c) slope aspect, (d) plan curvature, (e) TWI, (f) profile curvature, (g) NDVI, (h) soil types, (i) distance from rivers, (j) distance from roads, (k) lithology, and (l) fault density.

Figure 3. Research workflow.

Figure 4. Initial LSP by RF.

Figure 5. SHAP values.

Figure 6. SHAP scatter plots for (a) NDVI and (b) altitude.

Figure 7. Landslide susceptibility maps with different imbalanced ratios in Scenario 1.

Figure 8. LSIs under imbalanced ratios. The red vertical lines indicate the thresholds separating different landslide susceptibility levels (very low, low, moderate, high, and very high).

Figure 9. ROC curves for different ratios.

Figure 10. Landslide susceptibility mapping for Scenario 2.

Figure 11. Landslide susceptibility mapping of Scenario 3.

Figure 12. ROC curves for Scenario 2 and Scenario 3.

Table 1. Information on landslide conditioning factors.

Category	LCFs	Type	Scale/Resolution	Source	Min.	Max.
Topographic factors	Altitude	Raster	30 m	DEM	363	4242
	Slope degree	Raster	30 m	DEM	0	79.54
	Slope aspect	Raster	30 m	DEM	—	—
	Plan Curvature	Raster	30 m	DEM	−20.45	26.18
	Profile curvature	Raster	30 m	DEM	−22.21	20.09
	TWI	Raster	30 m	DEM	1.66	30.10
Environmental factors	NDVI	Raster	30 m	Landsat 8 satellite images	0.32	0.90
	soil types	Vector	1:1,000,000	National Cryosphere Desert Data Center. (http://www.ncdc.ac.cn, accessed on June 2024)	—	—
	Distance from rivers	Vector	1:250,000	National Earth System Science Data Center, National Science & Technology Infrastructure of China (http://www.geodata.cn, accessed on June 2024)	0	—
	Distance from roads	Vector	1:250,000	National Earth System Science Data Center, National Science & Technology Infrastructure of China (http://www.geodata.cn, accessed on June 2024)	0	—
Geological factors	Lithology	Vector	1:1,000,000	ISRIC (https://data.isric.org/, accessed on June 2024)	—	—
Geological factors	Fault density	Vector	1:2,500,000	Data Sharing Infrastructure of Seismic Active Fault Survey Data Center (https://www.activefault-datacenter.cn, accessed on June 2024)	0	0.40

Table 2. Frequency ratios for each class of landslide conditioning factors.

LCFs	Class	Area Count	Landslides	Percent of Area (E)	Percent of Landslide (F)	FR = F/E	RF (%)	Max
Altitude	2820–4242	126,099	154	15%	26%	1.79	37%	2.37
	2196–2820	199,557	322	23%	55%	2.37	49%
	1638–2196	235,332	105	27%	18%	0.66	13%
	1050–1638	210,888	8	24%	1%	0.06	1%
	363–1050	93,578	0	11%	0%	0.00	0%
Slope degree	42.23–79.5	147,764	142	17%	24%	1.41	30%	1.64
	31.9–42.23	210,399	236	24%	40%	1.64	35%
	22.5–31.9	223,029	143	26%	24%	0.94	20%
	12.7–22.5	193,543	45	22%	8%	0.34	7%
	0–12.7	88,984	23	10%	4%	0.38	8%
Slope aspect	N	97,782	49	11%	8%	0.73	9%	1.29
	NE	113,375	58	13%	10%	0.75	9%
	E	120,747	83	14%	14%	1.00	13%
	ES	107,297	83	12%	14%	1.13	14%
	S	94,668	78	11%	13%	1.20	15%
	WS	100,338	89	12%	15%	1.29	16%
	W	115,858	88	13%	15%	1.11	14%
	WN	108,574	60	13%	10%	0.81	10%
Plan curvature	−20.4–2.05	29,969	11	3%	2%	0.54	14%	1.18
	−2.05–0.78	132,209	66	15%	11%	0.73	19%
	−0.78–0.32	402,999	325	47%	55%	1.18	30%
	0.32–1.77	255,768	174	30%	30%	1.00	26%
	1.77–26.18	44,509	13	5%	2%	0.43	11%
Profile curvature	1.75–20.09	26,664	3	3%	1%	0.17	5%	1.29
	0.26–1.75	131,915	65	15%	11%	0.72	21%
	−0.89–0.25	371,154	325	43%	55%	1.29	38%
	−2.55–0.89	273,141	174	32%	30%	0.94	27%
	−22.21–2.55	62,580	13	7%	2%	0.31	9%
TWI	1.66–4.66	377,624	212	44%	36%	0.82	14%	1.52
	4.66–6.54	297,552	216	34%	37%	1.06	18%
	6.54–9.33	130,604	103	15%	17%	1.16	20%
	9.33–14.44	40,507	42	5%	7%	1.52	26%
	14.44–30.10	17,432	16	2%	3%	1.35	23%
NDVI	0.32–0.64	10,567	27	1%	5%	3.70	31%	3.70
	0.64–0.75	37,688	84	4%	14%	3.23	27%
	0.75–0.81	89,102	221	10%	38%	3.59	30%
	0.81–0.86	215,742	190	25%	32%	1.28	11%
	0.86–0.9	500,027	67	59%	11%	0.19	2%
Soil types	Lixisols	95,685.81	145	11%	25%	2.24	21%	2.24
	Regosols	104,384.52	133	12%	23%	1.88	18%
	Anthrosols	43,493.55	56	5%	8%	1.56	15%
	Luvisols	43,4935.5	35	50%	6%	0.12	1%
	Cambisols	69,589.68	35	8%	6%	0.75	7%
Distance from river	0–500	169,088	208	19%	35%	1.82	33%	1.82
	500–1000	147,156	107	17%	18%	1.07	20%
	1000–1500	134,586	94	15%	16%	1.03	19%
	1500–2000	114,475	80	13%	14%	1.03	19%
	>2000	304,566	100	35%	17%	0.48	9%
Distance from road	0–500	133,635	212	15%	36%	2.34	35%	2.34
	500–1000	100,139	103	12%	17%	1.52	23%
	1000–1500	87,300	64	10%	11%	1.08	16%
	1500–2000	76,050	63	9%	11%	1.22	18%
	>2000	472,747	147	54%	25%	0.46	7%
Fault density	0–0.037	443,076	343	51%	58%	1.14	26%	1.14
	0.037–0.105	129,035	84	15%	14%	0.96	22%
	0.105–0.170	201,592	115	23%	20%	0.84	19%
	0.170–0.25	68,716	37	8%	6%	0.79	18%
	0.25–0.401	23,009	10	3%	2%	0.64	15%

Table 3. Distribution of landslide and non-landslide samples in initial LSP by RF model.

Landslide/Non-Landslide	Landslide Slope Units	Non-Landslide Slope Units	Total Number of Landslide and Non-Landslide Slope Units	Training Set (70%)	Test Set (30%)
1:1	589	589	1178	824	353

Table 4. Frequency ratio classification of initial LSP by RF model.

LCFs	Class	Area Count	Landslides	% of Area (E)	% of Landslide (F)	FR = F/E
Altitude	Very Low	326,157	3	38%	1%	0.01
	Low	197,596	11	23%	2%	0.08
	Moderate	131,483	38	15%	6%	0.42
	High	119,645	90	14%	15%	1.09
	Very high	77,189	445	9%	76%	8.37

Table 5. Distribution of landslide and non-landslide samples in Scenario 1.

Landslide/Non-Landslide	Landslide Slope Units	Non-Landslide Slope Units	Total Number of Landslide and Non-Landslide Slope Units	Training Set (70%)	Test Set (30%)
1:1	589	589	1178	824	353
1:5	589	2945	3534	2474	1061
1:50	589	29,450	30,039	21,027	9012
1:100	589	58,900	59,489	41,642	17,847
1:150	589	88,350	88,939	62,257	26,682
1:200	589	117,800	118,389	82,872	35,517

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Insights from Optimized Non-Landslide Sampling and SHAP Explainability for Landslide Susceptibility Prediction

Abstract

1. Introduction

2. Study Area and Data Sources

2.1. Study Area

2.2. Landslide Inventory

2.3. Landslide Conditioning Factors

2.3.1. Topographic Factors

2.3.2. Environmental Factors

2.3.3. Geological Factors

3. Methodology

3.1. Three Non-Landslide Sampling Methods

3.2. Initial LSP Modeling—Random Forest

3.3. SHAP (SHapley Additive exPlanation)

3.4. Model Evaluation

3.4.1. Frequency Ratio Analysis

3.4.2. Receiver Operating Characteristic Curve

4. Analysis

4.1. Relationship Between LCFs and Historical Landslides

4.2. Initial LSP Results

4.2.1. Initial Landslide Susceptibility Results

4.2.2. SHAP Value Analysis

4.3. Scenario 1

4.3.1. The Landslide Susceptibility Results for Scenario 1

4.3.2. Model Validation for Scenario 1

4.4. Scenario 2

4.5. Scenario 3

5. Discussion

5.1. Frequency Ratio Analysis Results

5.2. SHAP Model Interpretation

5.3. Insights into the Three Sampling Strategies

5.4. Model Limitations and Future Perspectives

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics