Next Article in Journal
Quantitative Analysis Method for Full Lifecycle Aging Pathways of Lithium-Ion Battery Systems Based on Equilibrium Potential Reconstruction
Previous Article in Journal
Analysis of the Wave Characteristics of the Baltic Sea in Terms of the Use of Wave Energy Converters
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Geohazard Susceptibility Assessment in Karst Terrain: A Novel Coupling Model Integrating Information Value and XGBoost Machine Learning in Guizhou Province, China

School of Guizhou Emergency Management, Guizhou Normal University, Guiyang 550025, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(18), 10077; https://doi.org/10.3390/app151810077
Submission received: 19 July 2025 / Revised: 8 September 2025 / Accepted: 11 September 2025 / Published: 15 September 2025

Abstract

In this study, the geological disasters in Guizhou Province serve as the research object, and a systematic susceptibility evaluation is conducted in light of the province’s prominent problems with frequent geological disasters. The current research primarily focuses on the application of a single model, often with deficiencies in factor interpretation. It has not yet systematically integrated the advantages of the traditional information model and multiple machine learning algorithms, nor introduced interpretable methods to analyze the disaster mechanism deeply. In this study, the information value (IV) model is combined with machine learning algorithms—logistic regression (LR), decision tree (DT), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost)—to construct a coupling model to evaluate the susceptibility to geological disasters. Combined with the Bayesian optimization algorithm, the geological disaster susceptibility evaluation model is built. The confusion matrix and receiver operating characteristic (ROC) curve were used to evaluate the model’s accuracy. The Shapley Additive exPlanations (SHAP) method is used to quantify the contribution of each influencing factor, thereby improving the transparency and credibility of the model. The results show that the coupling models, especially the IV-XGB model, achieved the best performance (AUC = 0.9448), which significantly identifies the northern Wujiang River Basin and the central karst core area as high-risk areas and clarifies the disaster-causing mechanism of “terrain–hydrology–human activities” coupling. The SHAP method further identified that NDVI, land use type, and elevation were the predominant controlling factors. This study presents a high-precision and interpretable modeling method for assessing susceptibility to geological disasters, providing a scientific basis for disaster prevention and control in Guizhou Province and similar geological conditions.

1. Introduction

Affected by global climate change, the frequency and intensity of extreme disaster events worldwide have shown a significant increasing trend, triggering a series of meteorological, flood, drought, and geological disasters [1]. China has a vast territory and diverse, complex geological structures, making it one of the countries with the most severe geological disasters in the world [2]. As the population expands and urbanization accelerates, land resources undergo increasingly frequent development and utilization, which heightens geological disaster risks that, in turn, pose grave threats to lives, property, and regional sustainability [3]. As of the end of 2022, a total of 28,172 potential geological disaster points and high-risk slopes have been identified in the province, threatening more than 2 million people. Among them, 10,475 potential hazards exhibited obvious deformation, including 5317 landslides, 3087 collapses, 119 debris flows, 1365 unstable slopes, 507 collapses, and 80 ground fissures. Additionally, 17,697 risk slopes had not shown deformation but still had conditions that could lead to disasters [4]. The total area of high- and medium-susceptibility areas in the province was 154,800 square kilometers, accounting for 87.9% of the land area [5]. Due to the unique karst geological environment and complex geological conditions in Guizhou Province [6], it is crucial to select appropriate influencing factors and training models for evaluating the susceptibility of geological disasters.
The susceptibility assessment of geological disasters is the basis of risk assessment [7]. It is crucial for predicting and evaluating the occurrence and development of sudden geological disasters, identifying high-risk areas, and providing a scientific basis for disaster reduction planning [8]. Traditional assessment methods primarily rely on expert experience and statistical models, which are subject to intense subjectivity and struggle to effectively address issues of spatial heterogeneity and data imbalance [9]. The information quantity model evaluates the contribution of geological factors to geological disaster events by quantitatively calculating their weights. Its advantage lies in strong interpretability, but it relies on subjective experience and cannot capture complex, non-linear relationships [10]. In recent years, machine learning (ML) models, such as the logistic regression model (LR) [11], decision tree (DT) model [12], support vector machine model (SVM) [13], random forest (RF) [14], etc., have shown significant advantages in the susceptibility assessment of geological disasters. Liu Jian et al. [15] employed random forest and logistic regression to assess the susceptibility to geological disasters in Xietan Township, Shazhenxi Town, within the Three Gorges Reservoir Area, and found that the optimized random forest model exhibited higher prediction accuracy. Chen Xinyu et al. [16] introduced the CF model based on the SVM model to predict landslide points in Lueyang County and compared it with the single SVM model. It was proved that the coverage rate of disaster points was significantly improved. Xu et al. [17] integrated the entropy index into LR to evaluate the susceptibility of landslides in Shaanxi Province. The results showed that the AUC of the coupled model was 0.89, which was significantly better than that of the traditional LR. Ma et al. [18] combined LR with the Bayesian probability method and verified the applicability of the coupling model through the risk assessment of the Jiuzhaigou earthquake landslide. The study by Li et al. [19] in the central mountainous area of Hainan Island also demonstrates that the accuracy of the information–logistic regression coupling model surpasses that of the single information model. Wang et al. [20] introduced the XGBoost algorithm for predicting blood glucose concentration based on Raman spectroscopy. Compared to the decision tree, random forest, and support vector regression models, it was found that the model not only had better prediction performance but also significantly improved computational efficiency.
The advantages of different models can complement each other, thereby optimizing evaluation results and improving prediction accuracy, making them applicable to a variety of geological conditions [21]. It is an important direction to improve the accuracy and robustness of susceptibility assessment by combining the feature quantification advantage of the traditional information value model (IV) with the high-precision prediction ability of an advanced machine learning model to construct the information value–machine learning coupling model (IV-ML) [22]. Wei Wenhao et al. [23] coupled the amount of information with support vector machine modeling, and the results show that the combined accuracy is better than that of a single method. Kong Jiaxu et al. [24] further coupled a random forest and a convolutional neural network with information quantity, respectively. After verification, it was found that the information quantity convolutional neural network had stronger adaptability in identifying loess geological disasters. However, its systematic comparative verification and optimization in complex karst geological areas, such as Guizhou Province, still require further development. Studies have shown that factors such as land use type, slope, normalized difference vegetation index (NDVI), road density, rainfall, and curvature have a significant control effect on the development of geological disasters. For example, Wang et al. [25] used GIS weight analysis to demonstrate that the interaction between lithology and slope is the primary controlling factor of landslides in Guizhou. At the same time, rainfall and human activities indirectly exacerbate disaster risk by altering the hydrological cycle and stress distribution [26]. In assessing geological disaster susceptibility, tools such as the Pearson correlation coefficient and geographic detector are typically used to analyze the importance of factors [27]. However, the application of ex post interpretable algorithms is still relatively limited, which makes it difficult to present the internal relationship between factors and geological disasters clearly and lacks the scientific support needed for accurate disaster prevention. Saeed Chehreh Chelgani et al. [28] constructed a framework for energy consumption index modeling and feature analysis of cement ball mills combining CatBoost and SHAP. By introducing SHAP to enhance interpretability, the key factors influencing energy consumption in the mill are identified, and operators are assisted in making informed decisions for sustainable production and energy consumption optimization. Chen Xiaokun et al. [29] developed a framework for coronary heart disease risk assessment and feature analysis that combines XGBoost and SHAP. By introducing SHAP, it could enhance interpretability, identify key factors influencing the disease, and support clinicians in making informed diagnostic decisions.
In summary, machine learning has been widely applied in evaluating geological disaster susceptibility due to its high prediction accuracy and ease of operation. However, its training process is complicated, and the output results are often “black box”, which lack interpretability and restrict further promotion. To this end, this study examines geological disasters in Guizhou Province, including landslides, collapses, and debris flows. Ground collapses, as the research object, comprehensively include many influencing factors such as topography, hydrology, climate, ecology, and human activities, and the study aims at the phenomena and problems in the traditional research on geological disaster susceptibility, such as intense subjectivity, difficulty in capturing complex non-linear relationships, and uncertainty of results. Taking geological disasters in the Guizhou Province region as the research object, a multialgorithm coupling evaluation system is constructed using the data-driven coupling models of information value–machine learning (IV-DT, IV-LR, IV-SVM, IV-RF, and IV-XGB). The SHAP method is introduced to quantify the marginal contribution of each influencing factor and reveal the disaster-causing mechanism, thereby enhancing the credibility and transparency of the model and facilitating the spatial fine expression of the probability of disaster occurrence. The research results will provide a scientific basis for disaster risk control in Guizhou Province and serve as a paradigm for disaster reduction practices in global karst landform areas.

2. Materials and Methods

2.1. Study Area

Guizhou Province is located in the western part of the Yangtze Plate, adjacent to the Tethys orogenic belt in the west, and is situated in a crustal uplifting and deforming area between the Mesozoic orogenic belt in East Asia and the Alpine-Tethys Cenozoic orogenic belt. Its land area is between 24°37′ N and 29°13′ N latitude and 103°36′ E and 109°35′ E longitude, with a total land area of 176,200 square kilometers [30]. The region’s terrain is high in the west and low in the east, with an average altitude of about 1100 m. The landform types are complex and diverse, primarily consisting of plateaus and mountains. Karst landforms are widely distributed, accounting for about 73% of the province’s total land area [31]. The region features a complex geological structure, with a richly varied stratigraphic lithology, a fragile geological environment, and numerous potential geological hazards, all of which pose significant challenges [32]. The region falls within the subtropical monsoon climate zone, characterized by a mild and humid climate with abundant rainfall [33]. Precipitation is mainly concentrated in certain months, and geological disasters, such as those triggered by heavy rain and debris flows, occur frequently. The general situation of the study area is shown in Figure 1.

2.2. Data Sources

The specific data sources of the publicly available datasets used in this study are shown in Table 1.

3. Research Methods

3.1. Bayesian Optimization

Bayesian optimization belongs to a class of algorithms that implement hyperparameter search based on prior knowledge [34]. As a global optimization method, its core advantage is that it can find the global optimal solution within a limited number of iterations, that is, the optimal hyperparameter combination. In this study, a Bayesian optimization algorithm was employed to optimize the hyperparameters of five coupling models: IV-LR, IV-DT, IV-SVM, IV-RF, and IV-XGB. According to the characteristics of each model, by itself it is used as the base model, and the AUC value of the test set is used as the objective function to evaluate the performance of the hyperparameter combination. In the optimization process, the probability model of the objective function is constructed by historical sampling, and the next set of hyperparameters to be evaluated is selected based on the expected improvement (EI) acquisition function. Iterations are updated until convergence or the maximum number of iterations is reached, allowing for the determination of the optimal hyperparameter configuration for each model. The experimental model parameters are set as shown in Table 2.

3.2. Information Quantity Model

The information quantity model is a statistical model that calculates the amount of information supplied by each influencing factor for the research object through statistical analysis of the information that has been deformed or may be deformed in a specific geological environment [35]. The occurrence of geological disasters is affected by multiple factors, and the magnitudes and effects of these factors may vary in different geological environments; therefore, an optimal combination of factors exists [36]. Diminishing information quantity magnitudes decrease the probability of geohazard manifestation; conversely, elevated values increase such likelihood [10]. Specifically, it is achieved by comparing the occurrence frequency of geological disasters under the action of a particular factor in a specific evaluation unit with the occurrence frequency of geological disasters in the surveyed area [37]. The information quantity of the evaluation factors for the susceptibility to geological disasters in this study is shown in Figure 2. For a particular factor in a specific state, the calculation formula for the corresponding geological disaster information quantity is as follows:
I = l n N i / N S i / S
In the formula, N i represents the number of geological disasters in each impact factor classification; N represents the total number of geological disasters in the study areas; S i denotes the number of hierarchical grids for each impact factor; S denotes the total number of grids in the study area. When I > 0, the condition is conducive to the occurrence of geological disasters; when I < 0, the conditions are not conducive to the occurrence of geological disasters. Geological disasters are affected by many factors. The total amount of information on geological disasters under the combination of various state factors can be determined by the following formula:
I = i = 1 n l n N i / N S i / S
where: I is the total information quantity of geological disasters occurring in the corresponding specific unit, indicating the probability of geological disasters occurring, and can be used as the susceptibility index of geological disasters; N i is the area or number of geological disaster points of geological disasters under the conditions of the corresponding specific factor and the i state (or interval); S i is the distribution area of the corresponding specific factor and the i state (or interval); N is the total area or total number of geological disaster points of geological disasters in the surveyed area; S is the total surveyed area.

3.3. Machine Learning Model

3.3.1. Logistic Regression (LR)

Logistic regression (LR) is a regression analysis method derived from the generalized linear model, commonly used to explain the correlation between binary dependent variables or predictor variables [38]. This model combines the response variable and the independent variable through a link function. Compared to the general linear regression model, the variables in logistic regression can be either continuous or discrete. LR solves the binary problem by predicting the probability of an event occurring (“0” and “1”) [39]. The expression of logistic regression is as follows:
P = e Y 1 + e Y
Y = α + β 1 X 1 + β 2 X 2 + + β n X n
where: P represents the probability of a geological hazard occurrence, with its value falling within the interval [0, 1], α is a constant, β 1 , 2 , , β n are logistic regression coefficients used to indicate the weights of each evaluation factor, and X 1 , X 2 , , X n are independent explanatory variables that influence the occurrence of geological hazard events.

3.3.2. Decision Tree Algorithm (DT)

DT is a tree-structured model that constructs decision rules [40] by recursively splitting data features. It is a simple and easy-to-use non-parametric classifier. The C4.5 algorithm is adopted in this paper. This algorithm selects the splitting features by means of the information gain ratio, and its mathematical expression is as follows:
GainRatio   D A =   infoGain   D A I V A
I V A = k = 1 K D k D l o g 2 D k D
where A = a 1 , a 2 , , a k , K values. If A is used to partition the sample set D , then K branch nodes will be formed. Among them, the k t h node contains all samples in D whose values on the attribute A are a k , denoted as D k .

3.3.3. Support Vector Machine (SVM)

Support vector machine (SVM) is a supervised learning method [41] developed based on statistical theory and the principle of minimizing structural risk. As a binary classification model, SVM possesses global optimality and excellent generalization characteristics, making it suitable for high-dimensional datasets that are linearly inseparable. It has been widely and effectively used in various regression and classification problems. Its principle is based on finding a separating hyperplane [42,43,44,45] that maximizes the classification margin in a high-dimensional space. First, assume a set of data as x i , y i , i = 1 , 2 , , n , and fit and determine ω and b through the linear regression function f x = ω x + b . Slack variable ε is used to control the classification error, and the corresponding linear function fitting is Equation (7) [21]:
y i f x i ε + ξ i f x i y i ε + ξ i , i = 1 , 2 , , n ξ i , ξ i 0
where: ξ i and ξ i are the classification error factors, respectively. When ξ i and ξ i are greater than 0, it means there are classification errors. At this time, it is transformed into a problem of solving the minimization function, as shown in Formula (8), where the constant C is greater than 0, which is the degree of misclassification beyond the classification error ε . The linear fitting function after substituting it into the Lagrangian function is shown in Formula (9).
R ω , ξ i , ξ i = 1 2 ω ω + C i = 1 n ξ i + ξ i
f x = ω x + b = i = 1 n α i α i x i x + b
where: ω is the weight vector that determines the direction of the hyperplane; b is the bias; C is the penalty factor; α i and α i are the SVM coefficients.

3.3.4. Random Forest (RF)

Random forest (RF) is an ensemble learning method that enhances prediction robustness by constructing multiple decision trees and employing a majority voting mechanism [46]. In geological hazard assessment, RF can effectively handle high-dimensional non-linear environmental data. Compared to traditional geological hazard assessment methods, it introduces a new method of random sampling of samples and features during training, which significantly reduces the risk of overfitting [47]. Random forest adopts the bagging algorithm. Randomly select m feature attributes from all feature attributes M to construct a weak decision tree m < M . By repeating n times, the machine trains to obtain n weak decision trees y 1 X , y 2 X , y i X , and a random forest model is established. Its expression is as follows (Formula (10)) [48]:
Y x = a r g Z m a x i = 1 n I y i X = Z

3.3.5. Extreme Gradient Boosting (XGBoost)

XGBoost, as an emerging algorithm, introduces a regularization term when solving the extreme value of the loss function. Compared with the traditional gradient boosting tree, XGBoost is faster in the ensemble algorithm using gradient boosting. Optimizing the loss function through the second-order Taylor expansion not only suppresses the overfitting problem but also improves the model accuracy. It is considered an advanced estimator with ultra-high performance in classification and regression [49]. On average, X G B o o s t ’s most significant advantage is its speed, and its prediction effect is better than that of the random forest and almost as good as that of the deep neural network [50]. Its expression is as follows (Equation (11)) [51]:
O b j = i = 1 n l y i , y ^ i t 1 + g i f t x i + 1 2 h i f t 2 x i + k Ω f k
where: y ^ i t 1 is the predicted value of the front t 1 wheel model; l   y i , y ^ i t 1 is the training error of the sample x i ; g i = y ^ ( t 1 ) l   y i , y ^ ( t 1 ) and h i = y ^ ( t 1 ) 2 l   y i , y ^ ( t 1 ) are first- and second-order gradient statistics on the loss function; f x is one of the regression trees; Ω   f k is the regularization term of the k -th tree.

3.4. Confusion Matrix and Receiver Operating Characteristic (ROC) Curve

In this paper, the confusion matrix, ROC curve, and area under the curve (AUC) value are used to evaluate the prediction accuracy of the model [52]. This study refers to the relevant literature [53], where the threshold is set to 0.5. The prediction results are classified into two categories: if the predicted value is greater than 0.5, it is considered a geological disaster, and the corresponding sample is marked as positive; otherwise, it is regarded as negative. According to the actual and predicted categories of the test set sample, the four basic units of the confusion matrix can be defined: true positive (TP): the number of samples that are actually positive and correctly predicted as such. False positive (FP): the number of samples that are actually negative but are incorrectly predicted as positive. True negative (TN): the number of samples that are actually negative and correctly predicted. False negative (FN): the number of samples that are actually positive but wrongly predicted as negative. Based on the results obtained from the confusion matrix, the accuracy rate, precision rate, and recall rate are calculated to evaluate the model’s performance comprehensively.
Accuracy = T P + T N T P + F P + T N + F N
Precision = T P T P + F P
Recall = T P T P + F N
According to the confusion matrix, the ROC curve can be drawn, in which the vertical axis represents the true positive rate (TPR) and the horizontal axis represents the false positive rate (FPR) [54]. The greater the area under the ROC curve (AUC), the higher the model’s accuracy. The closer the AUC value is to 1, the stronger the model’s ability to distinguish between “prone areas” and “less prone areas”. Typically, when the AUC is between 0.7 and 0.8, it indicates that the model has a better discriminant ability and can classify relatively accurately. If the AUC is higher than 0.8, the model is considered to have excellent performance, achieving not only high classification accuracy and a low misjudgment rate but also providing users with more stable and reliable prediction results [24].

3.5. SHAP Algorithm

Most advanced machine learning algorithms are essentially black boxes, and their credibility and transparency need to be analyzed and enhanced through post hoc explanation tools. Among many explanation frameworks, the SHAP algorithm is particularly prominent and can provide both global and local perspectives [55,56]. This method borrows the Shapley value of game theory to accurately measure the marginal contribution of each feature to the output, thus clearly presenting the weight and role of the feature in the prediction. The global perspective reveals the overall importance of the features in the entire dataset. In contrast, the local perspective focuses on a single sample and analyzes the impact of each feature on the prediction result of this sample, item by item. With this dual ability, SHAP can adapt to various application requirements and demonstrate broad application value. For the feature i in the feature set S, the Shapley value calculation formula is as follows:
Φ i = S N i S ! N S 1 ! N ! v S i v S
In the formula, N represents the set of all features; S is any feature subset that does not contain the feature i ; | S | is the number of features in the set S ; v S is the contribution of the feature set S to the model prediction output; v ({ i }) is the contribution of the feature set S i that contains the feature i to the model prediction output.

3.6. Geological Hazard Susceptibility Evaluation Process

To address the deficiencies of classical statistical methods in analyzing non-linear associations, as well as the overfitting and prediction volatility problems that easily occur in machine learning models, this study relies on the development laws of geological hazards in Guizhou Province and existing research results. It constructs the geological hazard susceptibility evaluation process in Guizhou Province based on the IV-LR, IV-DT, IV-SVM, IV-RF, and IV-XGB algorithms, respectively (Figure 3).

3.7. Coupling Model and Processing Flow

Technology of the geological disaster susceptibility evaluation mechanism integrating the information quantity–machine learning coupling model: Step ①, construct the evaluation system rules for geological disaster influencing factors. Among them, collect geological disaster information and regional data in the study area, to establish a geological disaster database within the region, and then form a multifactor evaluation system covering topographic features, climate conditions, ecological environment, geological and hydrological conditions, basic data, etc.; Step ②, processing mechanism for evaluation factors. To mitigate the influence of multicollinearity among factors, a Pearson correlation analysis was conducted on all candidate factors. Ultimately, nine factors with relatively low correlation and independence were selected for inclusion in the evaluation system. Step ③, sample data production. In this study, the same 30 m × 30 m grid as the evaluation factor data is used as the basic evaluation unit: positive samples (label 1): all grids where geological disaster potential points are located (N = 9160); negative samples (label 0): based on the disaster points, establish a buffer zone with a radius of 1000 m and randomly select an equal number of grids (N = 9160) in the non-affected area outside the buffer zone; the basis for the buffer zone: the potential influence range of typical geological disasters (landslides, collapses) in Guizhou Province with a radius of 1000 m is studied [57], which is used to isolate the core disaster area and associated unstable zones to avoid missampling pseudo-negative samples (spatial dependence interference). Step ④, randomly divide the total sample set (18,320 positive and negative samples in total) into a training set (70%) and an independent test set (30%). All reported model performance indicators are calculated based on this independent test set. ⑤ Based on the sample data and evaluation factors, perform respective analyses in the RF, DT, SVM, LR, and XGBoost models to obtain specific execution results, and conduct zonal mapping on the geological disaster susceptibility evaluation results obtained from the IV-(RF, DT, SVM, LR, XGB) models. ⑥ Use the SHapley Additive exPlanations (SHAP) method to quantify and rank the contribution degrees of each geological disaster impact factor in the five coupling models, including IV-XGB and IV-RF, and generate a feature importance chart; based on the prediction results of the test set, draw the ROC curves of each model, calculate performance indicators such as AUC value, accuracy, and precision, and comprehensively evaluate the prediction accuracy and reliability of different coupling models.

4. Results and Analysis of Geological Disaster Susceptibility Evaluation

4.1. Selection of Evaluation Factors

The occurrence of geological disasters is a complex, non-linear process formed by the combined action of various internal and external factors. Drawing a geological disaster susceptibility map is the core method for identifying the spatial distribution of potential geological disasters. Currently, there is no established standard for selecting evaluation factors for geological disaster susceptibility. The main principles include that the evaluation factors should be measurable, operable, relevant to the disaster occurrence mechanism, and have low redundancy among them [58]. Existing studies have shown that topography is one of the main factors leading to the occurrence of geological disasters [59]. Liu Shuai et al. [60] used factors such as slope, aspect, elevation, curvature, and terrain relief as evaluation indicators. Combining the regional disaster-forming characteristics, this study primarily focuses on the dynamic factors (topography, ecology, hydrology, and human activities) that can be remotely sensed at the surface, as well as the basic geological and hydrological conditions (drainage density). Although the literature [61,62] indicates that lithology is crucial, due to the limitation of obtaining high-resolution regional lithology maps, this factor is not included in this evaluation system. This study finally selected nine influencing factors as evaluation indicators (Table 3, Figure 4).

4.2. Correlation of Evaluation Factors

The mechanism of geological disasters is relatively complex and often the result of the combined action of multiple environmental impact factors. Therefore, to address the issue of low model evaluation accuracy resulting from high correlation among ecological impact factors, this study proposes the use of a correlation analysis method to conduct an independent test on these impact factors. To avoid excessive mutual interference between factors, it is necessary to consider the correlation of each factor and delete duplicate impact factors. This paper employs Pearson correlation coefficients (PCCs) to conduct correlation and collinearity analysis on the selected factors, testing the independence of evaluation factors. PCCs, as a bivariate factor correlation analysis method with high evaluation accuracy, are used to compare the correlation degree of each factor. The greater the absolute value, the stronger the correlation [57]. In general, there is no correlation when 0.3 > |PCCs| > 0; when 0.6 > |PCCs| > 0.3, the correlation is low; when 0.8 > |PCCs| > 0.6, there was a moderate correlation; there was a high correlation between the selected evaluation factors when 1 > |PCCs| > 0.8 [21]. As shown in Figure 5, the Pearson correlation analysis reveals that the absolute value of the correlation coefficient between the nine evaluation factors was less than 0.60 (range: −0.31 to 0.45), which is significantly lower than the threshold for strong correlation (|PCCs| > 0.8). This indicates that there is only a weak correlation between factors, which meets the requirements of low collinearity in modeling and effectively avoids the interference of multicollinearity on the prediction model.

4.3. Evaluation of the Accuracy of the Coupling Model

The accurate evaluation of model performance is a crucial part of this study. Metrics based on the confusion matrix, such as the ROC curve, accuracy, precision, AUC value, F1 score, and recall rate, are used as crucial quantitative evaluation bases [52]. In this paper, the ROC curve (receiver operating characteristic, area under the curve), accuracy, precision, recall rate, and F1 score are introduced to evaluate the accuracy of the disaster susceptibility evaluation results. Accuracy reflects the proportion of correctly predicted samples, precision refers to the proportion of actually positive cases among the predicted positive cases, recall rate indicates the proportion of truly positive cases that are successfully predicted, and the F1 score is the harmonic mean of precision and recall rate, which is used to evaluate the performance of the model comprehensively. A comparison of various accuracy metrics was conducted. As shown in Figure 6 and Table 4 below, the susceptibility assessment model built based on the XGBoost algorithm demonstrates superior prediction performance compared to the SVM, RF, IV, LR, and DT prediction models. The AUC values of the receiver operating characteristic curves for the IV, XGBoost, and IV-XGB models are 0.691, 0.727, and 0.9448, respectively. The prediction results based on the IV model are generally moderate, while the susceptibility assessment model based on XGBoost shows better prediction performance. The coupled model exhibits higher prediction efficiency than single prediction models. From Figure 7, it can be observed that the ROC curve of the IV-XGB model is closer to the top-left corner compared to the IV-RF and IV-SVM models. The AUC values for the test set are 0.9448, 0.9301, and 0.9018, respectively, indicating higher classification accuracy for the IV-XGB model. As shown in Table 4 and Figure 8, the five coupled models achieve an accuracy of at least 0.774, a recall of at least 0.814, a precision of at least 0.854, an F1 score of at least 0.843, and an AUC value of at least 0.8656. This demonstrates that these five coupled models possess good fitting accuracy and prediction performance, validating the effectiveness of the geological disaster susceptibility assessment in Guizhou Province.

4.4. Geological Hazard Susceptibility Assessment Results

In this study, Guizhou Province was taken as the research object, and after the susceptibility index was calculated, the study area was divided into five susceptibility levels: extremely low, low, medium, high, and extremely high, by the Jenks natural breaks optimization method. The statistical data on the distribution of disaster points in different subregions for each susceptibility evaluation model have been compiled in Table 5. From the perspective of the spatial distribution pattern, the results of all coupled models showed that the geological hazard susceptibility in Guizhou Province presented the characteristics of “high in the north and middle, low in the south and east”: the northern and central regions were mainly extremely high and high susceptibility areas, while the southern and eastern regions were mainly extremely low and low susceptibility areas (Figure 9 and Figure 10, Table 5).

4.5. Global Feature Explanation

As shown in Figure 11, the SHAP feature summary diagram provides an intuitive representation of the contribution methods and influence directions of each influencing factor on the geological disaster susceptibility prediction model. Each point in the figure represents a sample point, and its color indicates the value of the corresponding factor at that sample point—red represents a high value, and blue represents a low value; the horizontal axis is the SHAP value, which is used to measure the contribution degree of the factor to the model output, and a positive value indicates that this factor will increase the predicted probability of geological disaster occurrence. The analysis results show that the normalized difference vegetation index (NDVI) is the factor that has the most significant impact on the model output. The lower its value, the higher the SHAP value, indicating that geological disasters are more likely to occur when the vegetation coverage is low; land use type (LU) and elevation (DEM) also have significant impacts. Among them, the lower the elevation, the larger the SHAP value, indicating that the probability of disaster occurrence increases accordingly. The mean annual precipitation (prep), road density (road-density), terrain undulation degree (QFD), and river density (river-density) all show significant influence. The greater the road density, river density, and degree of terrain undulation, the higher the SHAP value, indicating an increase in disaster risk. Factors such as slope, aspect, and surface curvature make specific contributions to the model prediction, but their overall impact is relatively weak.
By calculating the average absolute value of the Shapley values of each sample, the ranking of factor feature importance can be obtained (Figure 12). The top six factors in terms of importance are: normalized difference vegetation index (NDVI), land use type (LU), elevation (DEM), precipitation (prep), road density (road-density), and terrain undulation degree (QFD), and river density (river-density), slope (slope), aspect (aspect), and curvature (curvature) have relatively low importance for the susceptibility in the study area.

5. Discussion and Outlook

5.1. Superiority of the Coupled Model

The accuracy and reliability of the geological disaster susceptibility evaluation model are directly related to the pertinence of prevention and control measures. This study comprehensively compared the evaluation results of five single models—decision tree (DT), logistic regression (LR), random forest (RF), support vector machine (SVM), and XGBoost (XGB)—and their coupling models with information content (IV). The area distribution of each model and the statistical characteristics of disaster points reveal significant differences.
In a single model, the performance of IV, DT, and LR models has certain limitations. The medium and low prone areas of the IV model account for 76.96%, but its extremely low prone areas still contain 1.63% of the disaster points, indicating that its ability to identify non-prone areas is limited. The results of the DT model showed extreme polarization, with extremely low (49.09%) and extremely high (42.17%) prone areas being the dominant ones. In comparison, the intermediate transition grade area accounted for only 1.33%, and up to 46.33% of the disaster points were misjudged in the extremely low prone area, resulting in low model reliability. Although the results of the LR model are better than those of DT, the continuity of the gradient distribution of disaster points in space is generally maintained. In contrast, the RF and XGB models perform well, and the area distribution of each prone level is more balanced. The percentage of disaster points exhibits a perfect increasing gradient from extremely low to extremely high risk levels, which demonstrates that they can accurately distinguish between different risk levels. The prediction results are the most reliable.
The coupling of IV and machine learning models significantly increased risk concentration, and the proportion of extremely high-risk areas in all coupling models exceeded 38%. Among them, the risk gradient of the IV-DT and IV-LR coupling model is broken, and the proportion of medium and high prone areas (IV-DT) or low and very low prone areas (IV-LR) is unbalanced, resulting in a substantial jump in the zoning results, which is inconsistent with the continuity of actual geological risks. The effect of the IV-SVM coupling model is enhanced, resulting in a continuous transition zone from extremely high (44.94%) to extremely low (12.98%) values. The effect of the IV-RF and IV-XGB coupling model is the most ideal. Both of them not only concentrate about 49% of disaster points in extremely high prone areas (IV-RF: 49.69%, IV-XGB: 48.25%) but also form a smooth and continuous risk gradient (such as IV-XGB: 38.80%, 20.13% → 15.61% → 14.35% → 11.11%), indicating that the model successfully integrates the interpretability of geological factors and the prediction accuracy of machine learning algorithms, and the evaluation results are the most practical.
From the perspective of spatial distribution pattern (Figure 9 and Figure 10), the susceptibility zoning results of all models show that the susceptibility to geological disasters in Guizhou Province shows the macroscopic characteristics of “high in the north and middle, low in the south and east”. High and extremely high prone areas dominate the northern and central regions. The terrain in this area is undulating, the geological structure is complex, and human engineering activities are frequent. A combination of various factors that can cause disasters forms a high-risk background for geological disasters. The southern and eastern regions are characterized by low and very low prone areas, featuring relatively flat terrain and a more stable geological environment. This law is highly consistent with the regional geological background, which verifies the applicability and rationality of all models in this study.

5.2. Applicability of Interpretability Methods

NDVI ranks first, indicating that vegetation cover is the most critical factor affecting geological disaster susceptibility, especially in karst bare rock areas [33], further demonstrating that vegetation reduction may significantly increase the risk of disasters. The dominant contribution of land use type is consistent with the research of Wang et al. [25] in Guizhou, which emphasizes that human activities, such as the development of sloping farmland, exacerbate disaster risks by changing the hydrological cycle. Road density often acts synergistically with the slope factor, as road cutting projects directly damage slope stability, reflecting the direct disturbance of rock and soil masses by these projects. Meanwhile, heavy rainfall weakens the strength of rock and soil by increasing pore water pressure. In the northern Wujiang River Basin, the superposition of high road density areas and strong rainfall belts (Figure 8) further amplifies disaster probability, verifying the disaster-causing mechanism of the ternary coupling of “human activities–rainfall–topography”. From the perspective of the influence direction of the SHAP global interpretation map in Figure 9, multiple factors show an apparent monotonic change trend: for example, the lower the DEM, river density, and NDVI values, the higher the SHAP values; while for factors such as road density and terrain undulation, the higher the values, the larger the SHAP values, indicating an increase in disaster risk. The SHAP value is relatively high within a specific range of slope aspect, indicating a certain directivity in disaster occurrence. This result is consistent with the formation mechanism of geological disasters, indicating that the SHAP method can effectively reveal the complex, non-linear relationships between model factors and their contributions to prediction and decision making.

5.3. Limitations

Based on the information–machine learning coupling model, this study has completed the susceptibility evaluation of geological disasters in Guizhou Province; however, there are still aspects that can be further developed, primarily at two levels: data basis and model application. Firstly, in terms of data basis, there are two limitations: firstly, this study is limited by the availability of high-resolution regional lithology data, and lithology factors are not included in the evaluation system. Lithology is the basic geological condition that controls the development of geological disasters. Especially in karst landform areas such as Guizhou Province, the engineering geological properties of karst and non-karst strata are significantly different. The absence of this factor may impact the discriminant accuracy of the model to some extent in specific areas. Secondly, the spatial distribution of historical disaster point data that this study relies on is not only controlled by the natural disaster-inducing environment but also significantly disturbed by human engineering activities (such as road slope cutting, urban construction) and survey intensity differences (such as more detailed surveys in densely populated areas), which may lead to an excessive concentration of disaster records in areas with strong human activities, resulting in potential bias in model training. In view of the above limitations, future research can be deepened in the following directions:
(1) Data integration and optimization: priority is given to obtaining and integrating high-precision lithology data to more comprehensively describe the control effect of geological background conditions on disasters. At the same time, we can try to use the density of disaster points per unit area (such as/km2) and other methods to normalize the susceptibility evaluation results, thereby weakening the distribution deviation caused by human activities and survey differences and more objectively quantifying the relative risk of the region.
(2) Model application deepening: this study focuses on susceptibility assessment (i.e., “where is prone to disasters”). To achieve real risk management and control, the follow-up work should further integrate the spatial data of disaster-affected bodies, such as population distribution, GDP, and key infrastructure, and carry out vulnerability and risk assessment (that is, “how much damage may be caused by disasters”) based on susceptibility to provide more targeted and operational decision-making basis for land and space planning, site selection of significant projects, and optimal allocation of disaster prevention and mitigation resources.

6. Conclusions

(1) Based on the matching degree and gradient continuity between susceptibility zoning and disaster point distribution, the IV-RF and IV-XGB coupling models demonstrated the best performance in the geohazard susceptibility assessment of Guizhou Province (with AUC values of 0.9301 and 0.9448, respectively). The evaluation results of both models showed continuous and smooth transition zones in risk. The extremely high and high susceptibility zones concentrated 68.16% (IV-RF: 18.47% + 49.69%) and 69.29% (IV-XGB: 21.04% + 48.25%) of the historical disaster points in the entire region, respectively. These models clearly identified the two high-risk areas—the northern Wujiang River Basin and the central karst core area—in space, accurately capturing the disaster-causing mechanism of the “terrain–hydrology–human activity” coupling. The reliability of these models is the highest.
(2) Based on the factor contribution analysis using SHAP, the main controlling factor system affecting the development of geological disasters in Guizhou Province was revealed. NDVI, LU, and DEM were the most important core indicators dominating the spatial differentiation of susceptibility, with their contributions significantly higher than those of factors such as precipitation (prep), road density (road-density), and topographic factors like slope, aspect, and curvature.
(3) The study revealed significant differences in the performance of the coupled models. Among them, the IV-RF, IV-SVM, and IV-XGB coupling models can serve as core decision-making bases due to their balanced zoning, continuous gradients, and high prediction accuracy (AUC > 0.90). In contrast, the IV-DT and IV-LR models exhibited broken risk gradients and should be applied with caution. This study validates the effectiveness of the coupling framework between the information value model and machine learning algorithms in complex geological environments, providing multiple optimized modeling options for disaster prediction in Guizhou Province and similar regions.

Author Contributions

J.C.: Writing—Original Draft, Investigation, Methodology, Resources. F.W.: Writing—Editing and Review, Investigation, Funding. H.H.: Investigation, Methodology. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Guizhou Provincial Basic Research Program (Natural Science) (Qiankehe Foundation MS[2025]250, ZK [2023] 264).

Data Availability Statement

The data will be made available upon request.

Acknowledgments

The authors would like to extend their sincere appreciation to the reviewers and editors for their valuable contributions in enhancing the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest regarding this manuscript, and it has not been previously published.

References

  1. Liu, C.; Shen, W.; Huang, S. Strategic Thinking on the Prevention and Response of Geological Disasters in China. J. Catastrophol. 2022, 37, 1–4+11. [Google Scholar]
  2. Ge, Q.; Sun, Q.; Zhang, N.; Hu, J. Evaluation of Geological Hazard Susceptibility of Baiyin City Based on Multi-temporal InSAR Deformation Measurements. Geomat. Inf. Sci. Wuhan Univ. 2024, 49, 1434–1443. [Google Scholar]
  3. Wang, J.; Kang, Y.; Feng, B. Disaster resilience in the geohazard-prone mountainous areas: Evidence from the Hengduan Mountain, southwest China. Int. J. Disaster Risk Reduct. 2025, 119, 105331. [Google Scholar] [CrossRef]
  4. Guizhou Provincial Department of Natural Resources. 2023 Geological Disaster Prevention and Control Plan of Guizhou Province. [EB/OL]. 16 March 2023. Available online: https://zrzy.guizhou.gov.cn/wzgb/zwgk/zdlyxxgk/dzkcgl/dzzhfz/202309/t20230921_82465226.html (accessed on 16 March 2023).
  5. Fan, H.; Li, J.; Wang, J.H.; Sun, W. Exploration of Experience in Prevention, Response and Disposal of Major and Above-Grade Geological Disasters-Taking Guizhou Province as an Example. China Emerg. Rescue 2024, 2, 74–79. [Google Scholar]
  6. Zhang, B.; Yang, Z.; Hu, Y.; Chen, J. Analysis of Slope Deformation Characteristics and Stability under Underground Mining in Guizhou Mountainous Areas. Resour. Inf. Eng. 2025, 40, 63–67+71. [Google Scholar]
  7. Huang, L.; Sun, Q.; Hu, J. Landslide Susceptibility Evaluation and Error Correction Based on InSAR and Random Forest. Bull. Surv. Mapp. 2022, 10, 13–20. [Google Scholar]
  8. Lan, Y.; Guo, C.; Zhu, Y. Review of Geological Hazard Susceptibility Evaluation Methods. Geol. Resour. 2024, 33, 65–73. [Google Scholar]
  9. Zhang, R.; Zhang, D.; Shu, B.; Chen, Y. Predicting the Spatial Distribution of Geological Hazards in Southern Sichuan, China, Using Machine Learning and ArcGIS. Land 2025, 14, 577. [Google Scholar] [CrossRef]
  10. Wang, H.; Xu, J.; Tan, S.; Zhou, J. Landslide Susceptibility Evaluation Based on a Coupled Informative—Logistic Regression Model—Shuangbai County as an Example. Sustainability 2023, 15, 12449. [Google Scholar] [CrossRef]
  11. Xu, K.; Guo, Q.; Li, Z.W.; Xiao, J.; Qin, Y.S.; Chen, D.; Kong, C.F. Landslide susceptibility evaluation based on BPNN and GIS: A case of Guojiaba in the Three Gorges Reservoir Area. Int. J. Geogr. Inf. Sci. 2015, 29, 1111–1124. [Google Scholar] [CrossRef]
  12. Yuan, X.; Liu, C.; Nie, R.; Yang, Z.; Li, W.; Dai, X.; Cheng, J.; Zhang, J.; Ma, L.; Fu, X.; et al. A Comparative Analysis of Certainty Factor-Based Machine Learning Methods for Collapse and Landslide Susceptibility Mapping in Wenchuan County, China. Remote Sens. 2022, 14, 3259. [Google Scholar] [CrossRef]
  13. Zhou, X.; Wen, H.; Zhang, Y.; Xu, J.; Zhang, W. Landslide susceptibility mapping using hybrid random forest with GeoDetector and RFE for factor optimization. Geosci. Front. 2021, 12, 101211. [Google Scholar] [CrossRef]
  14. Lee, S.; Lee, M.J.; Jung, H.S.; Lee, S. Landslide susceptibility mapping using Naïve Bayes and Bayesian network models in Umyeonsan, Korea. Geocarto Int. 2019, 35, 1665–1679. [Google Scholar] [CrossRef]
  15. Liu, J.; Li, S.; Chen, T. Landslide Susceptibility Evaluation Based on Optimized Random Forest Model. Geomat. Inf. Sci. Wuhan Univ. 2018, 43, 1085–1091. [Google Scholar]
  16. Chen, X.; Shi, Y.; Zhao, K.; Wen, Y. Landslide Susceptibility Evaluation Based on CF-Integrated SSA Optimizing SVM and RF Models. J. Xi’an Univ. Technol. 2024, 40, 121–131+142. [Google Scholar]
  17. Xu, S.; Liu, J.; Wang, X.; Zhang, Y.; Lin, R.; Zhang, M.; Liu, M.; Jiang, T. Landslide Disaster Susceptibility Evaluation Method with Entropy Index Integrated into Support Vector Machine—Taking Shaanxi Province as an Example. Geomat. Inf. Sci. Wuhan Univ. 2020, 45, 1214–1222. [Google Scholar]
  18. Ma, S.; Xu, C.; Tian, Y.; Xu, X. Risk Assessment of Earthquake-Induced Landslides in Jiuzhaigou Based on Logistic Regression Model. Seismol. Geol. 2019, 41, 162–177. [Google Scholar]
  19. Li, X.; Xue, G.; Xia, N.; Liu, C.; Yang, Y.; Ma, B. Study on the Geological Disaster Susceptibility of National Tropical Rainforest Parks Based on CF, CF-LR and CF-AHP Models: A Case Study of Baoting, Hainan. Geoscience 2023, 37, 1033–1043. [Google Scholar]
  20. Wang, M.; Wang, Q.; Pian, F.; Shan, P.; Li, Z.; Ma, Z. Quantitative analysis method of diabetes blood Raman spectroscopy based on XGBoost. Spectrosc. Spectr. Anal. 2022, 42, 1721–1727. [Google Scholar]
  21. He, W.; Chen, G.; Zhao, J.; Lin, Y.; Qin, B.; Yao, W.; Cao, Q. Landslide Susceptibility Evaluation of Machine Learning Based on Information Volume and Frequency Ratio: A Case Study of Weixin County, China. Sensors 2023, 23, 2549. [Google Scholar] [CrossRef]
  22. Huang, C.; Yan, X.; Mei, H.; Zhou, C.; Huang, G. Geological hazard susceptibility assessment based on a random forest weighted information model: A case study of Shidian County, Yunnan Province. Chin. J. Geol. Hazard Control 2025, 36, 151–159. [Google Scholar]
  23. Wei, W.; Jia, Y.; Sheng, Y.; Xu, G.; Yang, Y.; Zhang, D. Research on Landslide Disaster Susceptibility Evaluation Models Based on I, SVM, and I-SVM. Saf. Environ. Eng. 2023, 30, 136–144. [Google Scholar]
  24. Kong, J.; Zhuang, J.; Peng, J.; Zhan, J.; Ma, P.; Mou, J. Landslide Susceptibility Evaluation in the Loess Plateau Based on Information Quantity and Convolutional Neural Network. Earth Sci. 2023, 48, 1711–1729. [Google Scholar]
  25. Wang, W.D.; Xie, C.M.; Du, X.G. Landslides susceptibility mapping based on geographical information system, GuiZhou, south-west China. Environ. Geol. 2008, 58, 33–43. [Google Scholar] [CrossRef]
  26. Wang, W.-D.; Guo, J.; Fang, L.-G.; Chang, X.-S. A subjective and objective integrated weighting method for landslide susceptibility mapping based on GIS. Environ. Earth Sci. 2011, 65, 1705–1714. [Google Scholar] [CrossRef]
  27. Liu, T.; Tan, J.; Guo, F.; Pan, Y.; Wang, L. Research on the weight correction method in the evaluation of landslide susceptibility of artificial cut slopes—Taking Shadi Town, Ganzhou City as an example. J. Nat. Disasters 2021, 30, 217–225. [Google Scholar]
  28. Chelgani, S.C.; Fatahi, R.; Pournazari, A.; Nasiri, H. Modeling energy consumption indexes of an industrial cement ball mill for sustainable production. Sci. Rep. 2025, 15, 18514. [Google Scholar] [CrossRef]
  29. Chen, X.; Zuo, H.; Liao, B.; Sun, R. Coronary Heart Disease Prediction and Its Feature Analysis Model Combining XGBoost and SHAP. Appl. Res. Comput. 2022, 39, 1796–1804. [Google Scholar]
  30. Yu, H.; Wei, L.; Jing, C.; Fangqing, D. Temporal and Spatial Distribution Characteristics and Trends of Geological Hazards in Guizhou. J. Geol. 2022, 46, 291–299. [Google Scholar]
  31. Yang, Z. Population Growth and Ecological Environment in Guizhou. Environ. Prot. Sci. Technol. 1989, 2, 7–10. [Google Scholar]
  32. Dang, P. Developmental Characteristics and Genesis Analysis of Collapse (Dangerous Rock Mass) Geological Hazards in Wudang District, Guiyang City. Min. Equip. 2024, 8, 103–105. [Google Scholar]
  33. Zhangli, J.; Gaopeng, L.; Mingtao, Z.; Wennian, X. Altitude Characteristics of Grassland Community Diversity and Soil Physical and Chemical Property Characteristics in Karst Mountains. J. Ecol. Environ. 2019, 28, 661–668. [Google Scholar]
  34. Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; Curran Associates Inc.: Red Hook, NY, USA, 2012; pp. 2951–2959. [Google Scholar]
  35. Chen, B. Research on Fine Risk Assessment Method of Landslide Disasters Based on the Coupling of Information Quantity Model and Machine Learning. Master’s thesis, East China University of Technology, Nanchang, China, 2023. [Google Scholar] [CrossRef]
  36. Wang, J.J.; Yin, K.L.; Xiao, L.L. Evaluation of landslide hazard susceptibility based on GIS and information volume—Taking Wanzhou district in the Three Gorges reservoir area as an example. Chin. J. Rock Mech. Eng. 2014, 33, 797–808. [Google Scholar]
  37. Shao, C. Risk Assessment of Shallow Soil Landslide Geological Disasters. Master’s thesis, Guizhou University, Guiyang, China, 2021. [Google Scholar]
  38. Zhao, Z.; Zhang, F.; Zheng, J. Evaluation of landslide susceptibility by multiple adaptive regression spline method. Geom. Inf. Sci. Wuhan Univ. 2021, 46, 442–450. [Google Scholar]
  39. Zhu, A.; Miao, Y.; Wang, R.; Zhu, T.; Deng, Y.; Liu, J.; Yang, L.; Qin, C.; Hong, H. A comparative study of an expert knowledge-based model and two data-driven models for landslide susceptibility mapping. CATENA 2018, 166, 317–327. [Google Scholar] [CrossRef]
  40. Pan, W.; Gao, W. Geological Hazard Susceptibility Prediction Based on Decision Tree Algorithm. In Proceedings of the 2024 IEEE 6th Advanced Information Management, Communications, Electronic, and Automation Control Conference (IMCEC), Chongqing, China, 24–26 May 2024; pp. 811–814. [Google Scholar]
  41. Zhao, Z.; Liu, Z.Y.; Xu, C. Slope Unit-Based Landslide Susceptibility Mapping Using Certainty Factor, Support Vector Machine, Random Forest, CF-SVM and CF-RF Models. Front. Earth Sci. 2021, 9, 589630. [Google Scholar] [CrossRef]
  42. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  43. Huang, F.; Cao, Z.; Guo, J.; Jiang, S.-H.; Li, S.; Guo, Z. Comparisons of heuristic, general statistical and machine learning models for landslide susceptibility prediction and mapping. CATENA 2020, 191, 104580. [Google Scholar] [CrossRef]
  44. Orhan, O.; Bilgilioglu, S.S.; Kaya, Z.; Ozcan, A.K.; Bilgilioglu, H. Assessing and mapping landslide susceptibility using different machine learning methods. Geocarto Int. 2020, 37, 2795–2820. [Google Scholar] [CrossRef]
  45. Huang, F.; Hu, S.; Yan, X.; Li, M.; Wang, J.; Li, W.; Guo, Z.; Fan, W. Prediction Modeling of Landslide Susceptibility Based on Machine Learning and Identification of Its Main Controlling Factors. Bull. Geol. Sci. Technol. 2022, 41, 79–90. [Google Scholar]
  46. Pyakurel, A.; Dahal, B.K.; Gautam, D. Does machine learning adequately predict earthquake-induced landslides? Soil Dyn. Earthq. Eng. 2023, 171, 107994. [Google Scholar] [CrossRef]
  47. Abdelkader, M.M.; Csámer, Á. Comparative assessment of machine learning models for landslide susceptibility mapping: A focus on validation and accuracy. Nat. Hazards 2025, 121, 10299–10321. [Google Scholar] [CrossRef]
  48. Zhang, A.; Zhao, X.; Zhao, X.; Zheng, X.; Zeng, M.; Huang, X.; Wu, P.; Jiang, T.; Wang, S.; He, J.; et al. Comparative study of different machine learning models in landslide susceptibility assessment: A case study of Conghua District, Guangzhou, China. China Geol. 2024, 7, 104–115. [Google Scholar]
  49. Sheridan, R.P.; Wang, W.M.; Liaw, A.; Ma, J.; Gifford, E.M. Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships. J. Chem. Inf. Model. 2016, 56, 2353–2360. [Google Scholar] [CrossRef] [PubMed]
  50. Wu, H.; Zhou, C.; Liang, X.; Yuan, P.; Yu, L. Assessment of landslide susceptibility mapping based on XGBoost model: A case study of Yanshan Township. Chin. J. Geol. Hazard Control 2023, 34, 141–152. [Google Scholar]
  51. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
  52. Zhang, W.G.; He, Y.W.; Wang, L.Q.; Liu, S.L.; Chen, B.L. A machine learning analysis method for landslide susceptibility based on river basin division: A case study of Fengjie County, Chongqing Municipality. Earth Sci. 2023, 48, 2024–2038. [Google Scholar]
  53. Lihao, D.; Yanhui, L.; Junbao, H.; Haining, L. A regional landslide disaster early warning model for Fujian Province based on a convolutional neural network. Hydrogeol. Eng. Geol. 2024; in press. [Google Scholar]
  54. Hu, Q.; Wang, Y. Landslide Disaster Susceptibility Evaluation in the Geomorphic Transition Zone of Western Sichuan Based on GIS. J. Chengdu Univ. Technol. (Sci. Technol. Ed.) 2018, 45, 746–753. [Google Scholar]
  55. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
  56. Kannangara, K.K.P.M.; Zhou, W.H.; Ding, Z.; Hong, Z. Investigation of feature contribution to shield tunneling-induced settlement using Shapley Additive Explanations method. J. Rock Mech. Geotech. Eng. 2022, 14, 1052–1063. [Google Scholar] [CrossRef]
  57. Niandong, D.; Yuxin, L.; Yangyang, C.; Hui, S.; Yalei, G. Landslide susceptibility assessment based on a hybrid machine learning model. Sci. Technol. Eng. 2022, 22, 5539–5547. [Google Scholar]
  58. Ayalew, L.; Yamagishi, H. The application of GIS-based logistic regression for landslide susceptibility mapping in the Kakuda-Yahiko Mountains, Central Japan. Geomorphology 2005, 65, 15–31. [Google Scholar] [CrossRef]
  59. Wang, R.; Wang, X.; Liu, H.; Sun, J.; Wang, X.; Zhang, S. Identification of collapse and landslide disasters and analysis of main controlling factors based on fine DEM—Taking the Jiacha—Langxian section of the Yarlung Zangbo River suture zone as an example. J. Eng. Geol. 2019, 27, 1146–1152. [Google Scholar]
  60. Liu, S.; Zhu, J.; Yang, D.; Ma, B. Hazard assessment of collapse and landslide geological disasters under different rainfall conditions. Bull. Geol. Sci. Technol. 2024, 43, 253–267. [Google Scholar]
  61. Liu, R.; Shi, S.; Sun, D.; Xu, J. Landslide susceptibility zoning in Wushan County based on GIS and random forest. J. Chongqing Norm. Univ. (Nat. Sci. Ed.) 2020, 37, 86–96. [Google Scholar]
  62. Ahmed, N.; Firoze, A.; Rahman, R.M. Machine learning for predicting landslide risk of Rohingya refugee camp infrastructure. J. Inf. Telecommun. 2020, 4, 175–198. [Google Scholar] [CrossRef]
Figure 1. General situation of the study area and distribution of potential geological hazard points.
Figure 1. General situation of the study area and distribution of potential geological hazard points.
Applsci 15 10077 g001
Figure 2. Information Quantity of Evaluation Factors for the Susceptibility to Geological Disasters.
Figure 2. Information Quantity of Evaluation Factors for the Susceptibility to Geological Disasters.
Applsci 15 10077 g002
Figure 3. Flowchart of geological hazard susceptibility evaluation based on the information quantity–machine learning coupling model.
Figure 3. Flowchart of geological hazard susceptibility evaluation based on the information quantity–machine learning coupling model.
Applsci 15 10077 g003
Figure 4. Classification of susceptibility evaluation factors.
Figure 4. Classification of susceptibility evaluation factors.
Applsci 15 10077 g004
Figure 5. Correlation analysis of evaluation factors.
Figure 5. Correlation analysis of evaluation factors.
Applsci 15 10077 g005
Figure 6. ROC curves and AUC values of the IV model and different machine learning models.
Figure 6. ROC curves and AUC values of the IV model and different machine learning models.
Applsci 15 10077 g006
Figure 7. ROC curves and AUC values of different coupling models.
Figure 7. ROC curves and AUC values of different coupling models.
Applsci 15 10077 g007
Figure 8. Radar chart comparing the performance metrics of different IV-ML models.
Figure 8. Radar chart comparing the performance metrics of different IV-ML models.
Applsci 15 10077 g008
Figure 9. Geological Disaster Susceptibility Zonation Map Based on the Information Value (IV) Model and Machine Learning Models.
Figure 9. Geological Disaster Susceptibility Zonation Map Based on the Information Value (IV) Model and Machine Learning Models.
Applsci 15 10077 g009
Figure 10. Geological Disaster Susceptibility Zoning Map.
Figure 10. Geological Disaster Susceptibility Zoning Map.
Applsci 15 10077 g010
Figure 11. Global Explanation Diagram of SHAP: Feature Summary Diagram of SHAP.
Figure 11. Global Explanation Diagram of SHAP: Feature Summary Diagram of SHAP.
Applsci 15 10077 g011
Figure 12. Feature Importance Based on SHAP.
Figure 12. Feature Importance Based on SHAP.
Applsci 15 10077 g012
Table 1. Data and their sources.
Table 1. Data and their sources.
Data NameData SourceData Content and Processing
Geological Hazard Hidden Danger PointsChinese Academy of Sciences Resource and Environmental Science Data Platform (https://www.resdc.cn/data.aspx?DATAID=290)Obtain the geological hazard hidden danger point data of Guizhou Province, containing longitude and latitude information of disaster points as of 2019
Topographic DataGeospatial Data Cloud (https://www.gscloud.cn/)Based on the provided ASTER GDEM V3 digital elevation model with a 30 m resolution, extract topographic factors, including elevation, slope, aspect, and curvature
Land UseU.S. Geological Survey (USGS) (https://www.usgs.gov/)Obtain GlobalLand30 2020 global land cover data through the Geosciences and Environmental Change Science Center and reclassify it into six land use types
RainfallPANGAEA Data Publisher for Earth & Environmental Science (https://www.pangaea.de/)Obtain multiyear average rainfall data from 1981 to 2020
Human ActivitiesOpenStreetMap (https://www.openstreetmap.org/)Based on the road network and water system vector data in 2023, calculate the road density and water system density within 1 km2 grids
Vegetation IndexGoogle Earth Engine (GEE) platform (https://earthengine.google.com/)Calculate the annual average normalized difference vegetation index (NDVI) based on Landsat 8 OLI images from 2020 to 2022
Note: The processing code used in this study can be obtained from the corresponding author upon reasonable request.
Table 2. Experimental model parameter settings.
Table 2. Experimental model parameter settings.
ModelParameter RangesBest Parameters
RFn_estimators: [50, 500]
max_depth: [3, 50]
min_samples_split: [2, 20]
min_samples_leaf: [1, 10]
n_estimators: 189
max_depth: 37
min_samples_split: 5
min_samples_leaf: 7
max_features: None
bootstrap: true
class_weight: “balanced”
SVMC: [0.001, 1000]
degree: [2, 5]
C: 21.08
kernel: “linear”
gamma: “scale”
degree: 4
class_weight: none
DTmax_depth: [3, 50]
min_samples_split: [2, 20]
min_samples_leaf: [1, 10]
max_depth: 50
min_samples_split: 10
min_samples_leaf: 6
max_features: “sqrt”
class_weight: “balanced”
XGBoostn_estimators: [50, 500]
max_depth: [3, 15]
learning_rate: [0.001, 0.3]
subsample: [0.5, 1.0]
colsample_bytree: [0.5, 1.0]
gamma: [0, 10]
reg_alpha: [0, 10]
reg_lambda: [1, 10]
n_estimators: 456
max_depth: 5
learning_rate: 0.109
subsample: 0.595
colsample_bytree: 0.948
gamma: 1.231
reg_alpha: 3.049
reg_lambda: 7.304
Table 3. Evaluation indicators for geological disaster susceptibility in Guizhou Province.
Table 3. Evaluation indicators for geological disaster susceptibility in Guizhou Province.
Goal LayerCriterion LayerAlternative Layer
Geohazard susceptibility assessmentTopography and geomorphologySlope; aspect; elevation (DEM); curvature
Climatic conditionsPrecipitation (prep)
Eco-environmentNormalized difference vegetation index (NDVI); land use type (LU)
Geology and hydrologyRiver density (river-density)
Basic dataRoad density (road-density)
Table 4. Performance evaluation results of coupling models.
Table 4. Performance evaluation results of coupling models.
ModelAccuracyRecallPrecisionF1AUC
IV0.6400.6380.6440.6410.6912
DT0.5880.5950.590.5930.5899
LR0.6110.6000.6180.6090.6619
RF0.6530.6740.6510.6620.7059
SVM0.6550.6710.6540.6630.708
XGB0.6700.6760.6710.6740.7267
IV-DT0.7740.8320.8540.8430.881
IV-LR0.7960.8250.9040.8630.8656
IV-RF0.7980.8360.8900.8620.9301
IV-SVM0.8020.8400.8900.8650.9018
IV-XGB0.7960.8140.9240.8650.9448
Table 5. Statistical Information on Geological Disaster Susceptibility.
Table 5. Statistical Information on Geological Disaster Susceptibility.
ModelSusceptibility ClassArea (sq km)Pixel Percentage (%)Disaster CountDisaster Percentage (%)
IVExtremely Low5706.713.461491.63
Low59,458.1136.01220024.02
Medium67,613.0640.95378241.29
High39,823.4924.12268429.3
Extremely High3293.951.993453.77
DTExtremely Low93,233.4449.09424446.33
Low188.460.1100.11
Medium2294.181.21860.94
High31.620.0240.04
Extremely High80,147.6242.17481652.58
LRExtremely Low459.180.25210.23
Low56,501.9830.33212523.2
Medium84,483.8545.35460850.31
High32,285.4117.33220224.04
Extremely High2164.911.162042.23
RFExtremely Low27,178.6815.497958.68
Low43,823.4124.97183720.05
Medium50,404.4628.72265729.01
High40,159.3822.88265028.93
Extremely High14,329.398.16122113.33
SVMExtremely Low1157.450.7170.19
Low76,648.7646.41275630.09
Medium44,320.6226.83249827.27
High53,234.132.23385042.03
Extremely High534.390.32390.43
XGBExtremely Low25,65514.617177.83
Low49,258.3328.06195021.29
Medium49,072.2227.95255927.94
High36,222.7220.63244926.73
Extremely High15,687.058.93148516.21
IV-DTExtremely Low53,524.2525.92157217.16
Low38,04818.43171718.74
Medium11,722.255.687127.77
High15,034.57.288709.5
Extremely High88,140.542.69428946.82
IV-LRExtremely Low20,065.259.725996.54
Low26,693.512.938969.78
Medium32,304.2515.65122913.42
High43,027.2520.84191620.92
Extremely High84,379.2540.87452049.34
IV-RFExtremely Low23,410.2511.345165.63
Low29,718.514.493710.23
Medium31,444.7515.23146315.97
High33,216.7516.09169218.47
Extremely High88,679.2542.95455249.69
IV-SVMExtremely Low26,805.512.986116.67
Low24,255.2511.758068.8
Medium27,38913.27108411.83
High35,227.2517.06178019.43
Extremely High92,792.544.94487953.26
IV-XGBExtremely Low22,936.2511.115105.57
Low29,633.7514.358999.81
Medium32,233.2515.61140415.33
High41,55320.13192721.04
Extremely High80,113.2538.8442048.25
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, J.; Wu, F.; Hu, H. Geohazard Susceptibility Assessment in Karst Terrain: A Novel Coupling Model Integrating Information Value and XGBoost Machine Learning in Guizhou Province, China. Appl. Sci. 2025, 15, 10077. https://doi.org/10.3390/app151810077

AMA Style

Chen J, Wu F, Hu H. Geohazard Susceptibility Assessment in Karst Terrain: A Novel Coupling Model Integrating Information Value and XGBoost Machine Learning in Guizhou Province, China. Applied Sciences. 2025; 15(18):10077. https://doi.org/10.3390/app151810077

Chicago/Turabian Style

Chen, Jiao, Fufei Wu, and Hongyin Hu. 2025. "Geohazard Susceptibility Assessment in Karst Terrain: A Novel Coupling Model Integrating Information Value and XGBoost Machine Learning in Guizhou Province, China" Applied Sciences 15, no. 18: 10077. https://doi.org/10.3390/app151810077

APA Style

Chen, J., Wu, F., & Hu, H. (2025). Geohazard Susceptibility Assessment in Karst Terrain: A Novel Coupling Model Integrating Information Value and XGBoost Machine Learning in Guizhou Province, China. Applied Sciences, 15(18), 10077. https://doi.org/10.3390/app151810077

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop