Exploring Complementary Models Consisting of Machine Learning Algorithms for Landslide Susceptibility Mapping

Hu, Han; Wang, Changming; Liang, Zhu; Gao, Ruiyuan; Li, Bailong

doi:10.3390/ijgi10100639

Open AccessArticle

Exploring Complementary Models Consisting of Machine Learning Algorithms for Landslide Susceptibility Mapping

by

Han Hu

,

Changming Wang

^*

,

Zhu Liang

,

Ruiyuan Gao

and

Bailong Li

College of Construction Engineering, Jilin University, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2021, 10(10), 639; https://doi.org/10.3390/ijgi10100639

Submission received: 3 August 2021 / Revised: 18 September 2021 / Accepted: 21 September 2021 / Published: 24 September 2021

Download

Browse Figures

Versions Notes

Abstract

:

Landslides frequently occur because of natural or human factors. Landslides cause huge losses to the economy as well as human beings every year around the globe. Landslide susceptibility prediction (LSP) plays a key role in the prevention of landslides and has been under investigation for years. Although new machine learning algorithms have achieved excellent performance in terms of prediction accuracy, a sufficient quantity of training samples is essential. In contrast, it is hard to obtain enough landslide samples in most the areas, especially for the county-level area. The present study aims to explore an optimization model in conjunction with conventional unsupervised and supervised learning methods, which performs well with respect to prediction accuracy and comprehensibility. Logistic regression (LR), fuzzy c-means clustering (FCM) and factor analysis (FA) were combined to establish four models: LR model, FCM coupled with LR model, FA coupled with LR model, and FCM, FA coupled with LR model and applied in a specific area. Firstly, an inventory with 114 landslides and 10 conditioning factors was prepared for modeling. Subsequently, four models were applied to LSP. Finally, the performance was evaluated and compared by k-fold cross-validation based on statistical measures. The results showed that the coupled model by FCM, FA and LR achieved the greatest performance among these models with the AUC (Area under the curve) value of 0.827, accuracy of 85.25%, sensitivity of 74.96% and specificity of 86.21%. While the LR model performed the worst with an AUC value of 0.736, accuracy of 77%, sensitivity of 62.52% and specificity of 72.55%. It was concluded that both the dimension reduction and sample size should be considered in modeling, and the performance can be enhanced by combining complementary methods. The combination of models should be more flexible and purposeful. This work provides reference for related research and better guidance to engineering activities, decision-making by local administrations and land use planning.

Keywords:

complementary model; unsupervised and supervised learning; landslide susceptibility prediction; GIS

1. Introduction

Landslide is a common and unavoidable form of disaster, especially in mountainous areas where rainfall, earthquakes or, engineering activities frequently occurs. The damages caused by landslides to both humans and the economy are enormous [1,2]. The frequency and scale of landslide outbreaks in China are larger than that of other countries in the world [3]. Therefore, landslides have attracted increasing attention and a significant amount of research has been done, especially in landslide susceptibility prediction (LSP).

In some cases, damages could be avoided or decreased with limits by recognizing the likely locations of future disasters [4]. In the view of seriousness and frequency of landslide, the establishment of prediction models are indispensable and various methods have also proved their effectiveness. Physical-based approaches and heuristic methods are applicable to limited samples and are time-consuming [5,6]. Traditional machine learning methods (TMLM) as the Logistic regression (LR) model, clustering analysis and principal component analysis have been well verified in LSP [7,8,9]. The new machine learning methods (NMLM), such as support vector machines, artificial neural networks and deep learning, have gained consideration with the development of computer technology [10,11,12,13,14,15,16]. Ensemble learning offers the possibility to further improve the accuracy and reflection of nonlinear relationships between landslide and conditioning factors [17]. Bagging, boosting and stacking are three commonly applied technologies [18,19,20,21,22]. However, few discussions have focused on the integration of TMLM.

The application of NMLM involves a few hyper-parameters needed to be tuned and optimized with the “black-box” training process, which leads to some limitations in practice [23]. On the other hand, high-quality samples with respect to purity and quantity are essential for the modeling of NMLM, which is hard to obtain, especially for the county-level area. Therefore, exploring a new ensemble method to ease the above problems is be of great significance. The current study aims to explore a hybrid model consisting of TMLM to improve the accuracy and comprehensibility.

Several researchers have compared the LR model with other models [24]. However, few attempts have combined the LR model with other methods to compensate for its limitations [25]. The LR model, which uses the logistic transformation to calculate probability ratios and to predict the probabilities of events associated with multiple variables, belongs to supervised learning. The coefficients in the regression equation will be affected if there is a multicollinear relationship between the independent variables, and may lead to unsatisfactory prediction results. The data can hardly meet mutual independence between variables in LSP. On the other hand, different catchments highlight different major landslide-related factors while LR model fails to recognize the difference and lacks pertinence. Hence, the accuracy and reliability of the LR model need to be further improved.

Fuzzy C-means (FCM) clustering belongs to an unsupervised learning model and is used for sorting catchments with similar characteristics into clusters [26]. Factor analysis (FA) belongs to another unsupervised learning model, which is a common tool for solving the problem of “Dimension disaster” and exploring the main conditioning factors of different clusters [27]. Therefore, this study proposes hybrid models combining LR, FCM, and FA. Luoying Town, Pinggu District in Beijing is selected as the study area. Four models, including the LR model, FCM coupled with LR model, FA coupled with LR model, and FCM, FA coupled with LR models, are established to compare and analyze the performance from different angles. The ArcGIS platform is applied to map and extract related data, while Statistical Product and Service Solutions (SPSS) software is used for modeling.

2. Study Area and Materials

2.1. Study Area and Landslide Inventory

The town of Luoying in the Pinggu District of Beijing, has been suffered from landslides for years (Figure 1). It covers an area of 80.9 km² with 3200 households and total of 11,000 people. The elevation ranges from about 181 to 1236 m above from mean sea level. It belongs to the northern temperate continental climate with a large annual temperature difference. The annual precipitation rate in the area is 642 mm, mainly concentrated in late July and early August, accounting for 76% of the total rainfall. Figure 2 shows the average monthly rainfall (1959–2017) in the area.

The Yanshanian and Indosinian periods were characterized by strong tectonic activity, which forms a series of large folds and faults. Three common lithologies were found during our field investigation: gneiss from Middle Archean (Ar₂wgn), quartz sandstone from Mesozoic (Chc³) and dolomites from Mesozoic (Cht). Magmatic rocks are not developed in the region, and many kinds of loose solid material accumulations such as landslides and collapse accumulations, were found in the catchments.

Tectonic activity in the study area is intense, and natural disasters such as earthquakes, floods and landslides have occurred many times in history (Table 1), which has caused great damage to local villagers. Therefore, LSP was indispensable. According to historical reports (1970~2015), field investigation (2017~2019) (Figure 3), and remote sensing image interpretation, 114 shallow landslides were collected. To reduce the spatial autocorrelation effect between the observation data, improve the rendering effect, and avoid the uncertainty of drawing the landslide boundary, a single point placed at the center of centroid of the region is drawn for each landslide to show the locations (Figure 1). LSP could be considered as a considered as a binary classification problem, which needs both positive and negative samples. The landslide samples were regarded as the positive samples with the “1” label while non-landslide samples as negative samples with “0” label. To reduce the bias, the same number of non-landslide samples were selected randomly on the landslide-free area. The maximum area of landslide is 8.9 × 10³ m², the minimum area is 200 m², and the average area is 2 × 10³ m². All shallow landslides triggered by rainfall or earthquakes were considered in the study. For example, on the night of 15 July 1958, the rainfall intensity in the study area reached 144.6 mm, which induced a large number of landslides. Shallow landslides occurred in the area, mainly in July.

2.2. Mapping Unit

The selection of a mapping unit needs to be considered first, as data extraction is based on spatial primitives. Several methods have been applied to the division of terrain and four kinds of mapping units were obtained: grid unit, slope unit, unique-condition unit and watershed unit [28,29,30,31]. The grid unit is the most popular unit applied in LSP due to its simplicity [32]. However, grid units are sensitive to all the uncertainties in the geomorphological mapping. On the other hand, slope units could reflect topographic and geological conditions of a landslide. Landslides occur primarily on slopes, and slope units which are hydrological terrain units bounded by drainage and divide lines are suited for LSP. More information between different units can be found in other works [33]. Therefore, slope units were selected as the mapping unit in this study and the study area is divided into 503 slope units based on the Hydrologic analysis tool in ArcGIS and the results were imported into Google Earth for repeated correction.

2.3. Conditioning Factors

Geological disasters are induced by the interaction of internal geological forces and external meteorological hydrology. The conditioning factors usually involve topographical, geological and meteorological conditions, which affect the distribution and frequency of landslides [34]. In this study, ten conditioning factors (F1–F10) were chosen to refer to previous literature [35]. A brief description is stated below:

2.3.1. Triggering Factors

Rainfall has been the main triggering factor for landslides for many years. The occurrence of a landslide is affected by both intensity and duration of rainfall. This study selects maximum rainfall of 7 days (F1) as the predisposing factor, the values of which range from 311.956 to 355.045 mm. The thematic map was generated by kriging interpolation in ArcGIS and 12 precipitation stations nearby the study area were collected as a reference (Figure 4a). Road construction is the key for the development of mountainous areas and unreasonable excavation often leads to landslides. Distance to the road (F2) was obtained using the Euclidean distance interpolation approach, ranging from 36.38 to 2844.85 m from roads (Figure 4b). Road networks were collected from Landsat 8 LOI images. Earthquake is another important triggering factor in the current study, but the peak ground acceleration is the same. Therefore, rainfall is regarded as the main triggering factor considered in this study and distance to the road as another triggering factor reflecting human activities.

2.3.2. The Topographical Factors

Topographic related factors as elevation (F3), slope angle (F4), plan curvature (F5), profile curvature (F6) and topographic wetness index (F7) were originated from the shuttle radar topography mission digital elevation model (SRTM DEM) dataset with a 30 m × 30 m resolution. The related thematic maps are shown in Figure 4c–g.

2.3.3. Geological Factors

Distance to a fault (F8) is closely linked to the occurrence of landslides [36]. Euclidean distance algorithm was applied to calculate the distance from slope units to the faults (in meters), the values of which vary from 39.31 to 4748.39 m (Figure 4h). The map of faults was obtained from a geological map at a scale of 1:50,000.

Rivers have an effect on its support force to slopes and alternating slopes’ pore-water pressure. Therefore, the distance-to-stream (F9) is also considered in the study and the values range from 68.82 to 5230.63 m from the rivers (Figure 4i) [37].

The plane shape varies from a different stage of development of a landslide. Roundness (F10) is a morphologic factor, which can be calculated with the following equation:

R_{c} = \frac{S}{A} = \frac{4 π S}{P}

(1)

where S and P represent the area and perimeter of a slope unit, respectively; A is the area of a circle with the same perimeter of a slope unit.

The mean value of each conditioning factor is taken as the attribute of each slope unit, and multiple thematic factor graphs are shown in Figure 4.

3. Methodology

3.1. LR Model

The LR model commonly applies to describe the relationship between several independent variables, which can be nominal or continuous and a dependent variable which can be binary or categorical [38,39]. This model is able to calculate the probability of presence or absence based on the predictor variables suitable for LSP. In the present study, the conditioning factors are taken as the independent variables and the occurrence of landslides as the dependent variable. The general form of LR model is as follows [40]:

p = \frac{1}{1 - e^{- y}}

(2)

Where p refers to the probability of a landslide occurrence; y is a linear combination function of the variables as shown in Equation (3):

y = b_{0} + b_{1} x_{1} + b_{2} x_{2} + b_{3} x_{3} + b_{n} x_{n}

(3)

Where b₀ is a constant and b₁, b₂, …, b_n are the regression coefficients of the variables x₁, x₂, …, x_n.

3.2. FCM Clustering

The membership of objects belonging to a particular cluster is calculated by FCM and belongs to the soft clustering [41,42]. The core idea is to map data points in a multi-dimensional space to different cluster sets in the form of membership degrees, thereby determining C cluster centers, minimizing inter-cluster correlation and maximizing intra-cluster correlation. The function and steps of the FCM clustering are defined as shown below:

C_{i} = \sum_{j = 1}^{n} μ_{i j}^{m} χ_{j} / \sum_{j - 1}^{n} μ_{i j}^{m}

(4)

J = \sum_{j = 1}^{N} \sum_{i = 1}^{C} u_{i j}^{m} d^{2} (X_{j}, V_{i})

(5)

u_{i j} = \frac{1}{\sum_{k = 1}^{C} {(\frac{d_{i j}}{d_{k j}})}^{2 / (m - 1)}}

(6)

Where m refers to the degree of fuzziness, u_ij is the degree of membership of object x_i in the cluster j, d represents the Euclidean distance between the ith clustering center and the jth sample [43], J is the objective function and n is the number of objects in the database.

Two parameters as m and the number of cluster centers C need to be determined in advance. This study applies the cluster validity function [44] to determine the number of cluster centers, and m is assigned as 2 for most applications.

3.3. FA Model

The FA model extracts comprehensive factors by exploring the internal dependency of variables. The core idea of FA is to describe the complex relationship between the original variables with the sum of the linear function of the least number of common factors and the special factors. Specifically, the observation data is decomposed into two matrices, one of which is expressed as a common factor. The other matrix is expressed as a special factor. It can be written in matrix form by mathematical expression as shown in Equation (6):

X_{i j} = a_{f 1} f_{1 i} + a_{f 2} f_{2 i} + \dots + a_{f m} f_{m i} + e_{f i}

(7)

Where X refers to a measured variable; a represents the factor loading; f is the factor score; e is the special factor; i is the sample number and m is the factor number.

The main steps are as follows:

Constructions of comprehensive factors

The first m rotated factors are extracted as the comprehensive factors the total variance of which is not less than 85%.

b.: The factor scores calculation

The Thomson regression method is applied to calculate the factor scores of each factor, which can be represented as:

F = A^{'} R^{- 1} X

(8)

Where

A^{'} R^{- 1}

refers to the coefficient matrix and X represents the factor loading matrix after rotation.

FA is adopted to solve the problem of “Dimensional disaster” while most of the original information are remained and avoid the influence of multicollinearity on the prediction results.

3.4. Comparison of the Methods

The machine learning methods used in LSP follow an important assumption that conditions similar to historical landslide locations will be more likely prone to failures again in the future [45]. Among the methods applied in this study, the LR model and FCM are based on Bernoulli distribution and fuzzy theory, respectively. LR model is suitable for both classification and regression, which has been successfully applied in the fields of medicine, economy and biology. Similarly, the application of FCM is also extensive. As for limitations, LR model is sensitive to the collinearity between independent variables, which is also one of the reasons for its unsatisfactory results compared to other NMLM. The location and number of initial cluster centers needed to be determined first for FCM. FCM and FA belong to an unsupervised learning method that models without priori conditions and usually performs not as well as the supervised learning method. LR models are conducted with the use of labeled samples and it belongs to supervised learning.

As the representatives of TMLM, three methods have their own merits and limitations based on their respective assumptions. Therefore, it is recommended to combine two or even three methods to compensate for the inherent limitations of a single model.

3.5. Model Performance

A scientific model needs to be verified properly before generalizing and the dataset should be divided into two parts as training-testing sets. In earlier studies, the single random division was common, which divides the data into training and validation set in a certain proportion [46]. However, a more robust verification method should be recommended and fivefold cross-validation is applied in the study [47]. Three other indexes as sensitivity, specificity and accuracy were utilized to evaluate the performance of models. Finally, standard errors were used to determine the errors associated with the susceptibility predictions for each model. The equations involved are shown as follow:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

\begin{array}{l} Sensitivity = \frac{T P}{T P + F N} \\ Specificity = \frac{T N}{F P + T N} \end{array}

(9)

Where True Positive (TP) and True Negative (TN) is the number of correctly predicted landslides and non-landslides, respectively; False Positive (FP) and False Negative (FN) are the incorrect number of landslide and non-landslides samples, respectively.

Receiver Operating Characteristic (ROC) is constructed by TP (sensitivity) and FP (1-specificity) ratios. The area under the cure (AUC) is also applied to evaluate model performance. For a classifier with very poor performance, its ROC curve will appear as a straight line with a 45° oblique finger at the upper right corner, while for a classifier with better performance, its ROC curve will be closer to the upper left corner of the image. That is, the AUC under the image will be larger and the higher the AUC value indicates the better model performance. The correlation between predictive power and AUC can be quantified as: excellent (0.9–1), excellent (0.8–0.9), good (0.7–0.8), average (0.6–0.7), poor (0.5–0.6). If the AUC value is 0.5, the model has no application value. The flow chart of the study was shown in Figure 5.

4. Results and Validation

Z-score was applied to standardize the data and eliminate the impact of different dimensions (units), which was also provided by SPSS.

4.1. LR Model

The LR model was performed by adopting the stepwise forward method in SPSS software for screening of conditioning factors. The significant values of all variables retained in the last step of the analysis were less than 0.05, so no variables were added. The model finally selected the factors that have a significant influence on the model fitting by slope angle and roundness after several iterations.

The coefficient of slope angle was the largest as 0.619, which illustrated that it had the greatest influence on the occurrence of a landslide. In addition, the regression coefficients of the two factors were all greater than zero, indicating a positive correlation between the occurrence of landslide and the factors. The obtained LR equations are shown as follow:

L o g i t (p) = \ln (\frac{p}{1 - p}) = - 1.322 + 0.619 \times F 4 + 0.488 \times F 10

(10)

Where p is the probability of a landslide.

Table 2 shows that the Wald chi-square (Wald) value of slope (22.55) was higher than roundness (14.89), which indicated more importance for explaining the occurrence of landslide [41]. SPSS also provides the statistical indexes that reflect the goodness of fitting of the model: Cox and Snell R square values (CSR²) and Nagelkerke R square (NR²). The values of CSR² and NR² indicate that the independent variables can explain the dependent variables for which these values are 66.2% and 77.6% respectively. Table 3 shows the predicted accuracy of landslide (sensitivity) as 69.15% and that of non-landslide (specificity) as 76.26%. Consequently, the overall predicted accuracy was 77.16% and the value of AUC was 0.745.

4.2. FCM Coupled with LR Model

The exponent (V_cs) is the ratio of the degree of compactness to dispersion and the greater V_cs value indicates the better clustering effect. Accordingly, the optimal number of clustering was 4 (Figure 6) while the cluster results are shown in Table 4.

Different samples needed different factors and coefficients for fitting LR model (Table 5), which further identify the major factors of different clusters after clustering. For the first model, elevation and topographic wetness index were selected as main factors and the coefficients were all positive. However, the coefficient of distance to the stream was negative as −1.757. For the second model, maximum rainfall of 7 days and plan curvature were determined as the major factors, the coefficients of which was 0.852 and 0.588, respectively. In the third model, plan curvature and distance to stream were selected again, but the sign of coefficients were opposite as −0.477 and 0.322, respectively. In the fourth model, roundness was chosen as the only fitting factor and the coefficient was 0.354.

The hybrid model showed a better performance with the sensitivity, specificity and accuracy values of 73.81%, 85.77% and 84.89, respectively (Table 6), which were all higher than that of LR model (Table 3). Besides, four models were established after clustering performed well with sensitivity values ranging from 71.28% to 74.88% and specificity values from 82.35% to 88.77%.

4.3. FA Coupled with LR Model

The principal components were generated by FA to solve the problems of multiple linearities and information loss. The first seven factors (C1, C2, C3, C4, C5, C6, C7) were extracted and explained 90.231% of the total variance (Table 7). The contribution rate of the first three factors was relatively large as 45.726%, 15.794% and 10.979%, respectively among them. According to the correlation coefficient of each common factor (Table 7), the first common factor (C1) mainly highlighted the information of profile curvature and elevation, which have reflected the topographic conditions for the development of landslide. Similarly, the second and the third common factor mainly highlighted the information of distance to road and distance to stream, respectively. The scoring functions were determined (Equation (11)) and LR model was established using five common factors scores (C1–C5) as independent variables. Accordingly, Equation (13) restored the relationship between the 10 conditioning factors and the occurrence of landslides. The coefficients of F1, F4, F5 and F7 were negative, which were not conducive to the occurrence of landslide. The coefficients of F6 and F10 were 0.702 and 0.706, respectively, both of which are greater than 0.7.

The introduction of FA simplifies the information carried by the original 10 conditioning factors into five common factors, achieving effective dimensionality reduction and retaining most of the information.

\begin{array}{l} C 1 = - 0.469 F 1 + 0.783 F 2 + 0.814 F 3 - 0.200 F 4 - 0.128 F 5 + 0.915 F 6 - 0.846 F 7 - 0.190 F 8 - 0.555 \\ F 9 + 0.120 F 10 \\ C 2 = - 0.657 F 1 + 0.16 F 2 + 0.475 F 3 - 0.29 F 4 - 0.03 F 5 + 0.19 F 6 - 0.076 F 7 + 0.921 F 8 + 0.65 F 9 \\ + 0.29 F 10 \\ C 3 = 0.155 F 1 + 0.330 F 2 + 0.181 F 3 + 0.125 F 4 - 0.949 F 5 + 0.38 F 6 + 0.67 F 7 + 0.28 F 8 + 0.117 \\ F 9 + 0.05 F 10 \\ C 4 = - 0.066 F 1 - 0.210 F 2 - 0.101 F 3 + 0.974 F 4 - 0.139 F 5 + 0.106 F 6 - 0.060 F 7 - 0.180 F 8 - \\ 0.116 F 9 - 0.004 F 10 \\ C 5 = 0.059 F 1 + 0.063 F 2 + 0.042 F 3 - 0.004 F 4 - 0.049 F 5 + 0.116 F 6 - 0.044 F 7 + 0.036 F 8 \\ + 0.083 F 9 + 0.988 F 10 \end{array}}

(11)

where C1 to C5 are the first to fifth principal components.

C1 and C5 were selected into the regression equation (Table 8) and the equation was:

\ln (\frac{p}{1 - p}) = - 1.371 + 0.692 \times C 1 + 0.61 \times C 5

(12)

Combining Equations (9) and (10), a multinomial LR equation can be obtained as:

\begin{array}{l} \ln (\frac{p}{1 - p}) = - 1.372 - 0.289 \times F 1 + 0.582 \times F 2 + 0.588 F 3 - 0.016 \times F 4 - 0.119 \times F 5 + \\ 0.702 \times F 6 - 0.616 \times F 7 + 0.009 \times F 8 + 0.428 \times F 9 + 0.716 \times F 10 \end{array}

(13)

The established model had a high degree of fit (Table 9), but it was slightly lower than the LR model (Table 2). The accuracy reached 83.29% while the value of sensitivity was 73.11% and specificity was 84.79% (Table 9) which was significantly improved compared with LR model (Table 3). In addition, the value of Kaiser–Meyer–Olkin (KMO) was 0.789, which indicated that the correlation between variables was obvious and suitable for FA.

4.4. FCM, FA Coupled with LR Model

A model combining three statistical methods FCM, FA, and LR was finally established. Firstly, the cluster analysis was carried out, then the FA was performed on each clustering sample and finally the LR model was established.

Five common factors were involved in the first and second model while four for the third and fourth model (Table 10). LR equations were similarly established respectively with the regression coefficients of common factors (Table 11).

The coefficients and symbols corresponding to each factor in different models varied (Equation (14)), which not only highlighted the difference of major factors among different samples, but also retained most of the information of the original data. For the first model, the coefficients of F2, F3 and F9 were relatively large as 0.396, 0.386 and 0.208, respectively. Additionally, the coefficients of F1 and F7 were negative as −0.526 and −0.061, respectively. Similarly, the coefficients of F4 and F6 in the second model were 0.411 and 0.125, respectively. The coefficients of F2, F3, F5, F7, F8, F9 and F10 were negative and the coefficients of F3, F8 and F9 in the third model were 0.477, 0.559 and 0.543, respectively. The coefficients of F1, F4 and F7 were negative. For the fourth model, the coefficient of F10 was relatively large as 0.343. All the models passed KMO test, but the values were lower than those of FA coupled with LR model (Table 9). The accuracy reached to 88.01% while the values of sensitivity and specificity were 79.85% and 89.58%, respectively (Table 12), which was exceeded from the other three models.

\begin{array}{l} \ln (\frac{p}{1 - p}) = - 1.212 - 0.526 \times F 1 + 0.396 \times F 2 + 0.386 \times F 3 + 0.034 \times F 4 - 0.007 \times F 5 + \\ 0.186 \times F 6 - 0.061 \times F 7 + 0.095 \times F 8 + 0.208 \times F 9 - 0.004 \times F 10 \\ \ln (\frac{p}{1 - p}) = - 1.825 + 0.012 \times F 1 - 0.037 \times F 2 - 0.037 \times F 3 + 0.411 \times F 4 - 0.062 \times F 5 \\ + 0.125 \times F 6 - 0.029 \times F 7 - 0.016 \times F 8 - 0.041 \times F 9 - 0.078 \times F 10 \\ \ln (\frac{p}{1 - p}) = - 1.347 - 0.534 \times F 1 + 0.036 \times F 2 + 0.477 \times F 3 - 0.018 \times F 4 + 0.029 \times F 5 + \\ 0.224 \times F 6 - 0.197 \times F 7 + 0.559 \times F 8 + 0.543 \times F 9 + 0.004 \times F 10 \\ \ln (\frac{p}{1 - p}) = - 1.248 - 0.008 \times F 1 + 0.044 \times F 2 + 0.023 \times F 3 - 0.012 \times F 4 - 0.008 \times F 5 + \\ 0.021 \times F 6 + 0.001 \times F 7 + 0.063 \times F 8 - 0.027 \times F 9 + 0.343 \times F 10 \end{array}}

(14)

4.5. Validation and Comparison

The generalization performance of the proposed models should be validated by the test data. Table 13 shows that the accuracy of three hybrid models was better than that of LR model. Besides, the model combing with FCM, FA and LR also performed the best with the highest value of accuracy, sensitivity and specificity, which was 85.25%, 74.96% and 86.21%, respectively.

The AUC values of the three hybrid models were also higher than the single LR model, which indicated that the establishment of the hybrid model was beneficial to predict landslide in the study area (Figure 7). The model combining with FCM, FA and LR achieved the highest value of AUC as 0.827, followed by the model coupled with FA and LR as 0.782 (Table 14). It indicated that the dimensionality reduction and clustering improve the performance of LR model. The performance of models in validation turned down in comparison to the results from training data. The accuracy of the model combining with FCM and LR dropped obviously, which indicated over-fitting and limited generalization. Cluster analysis reduces the sample size for modeling and FA solves the “dimensional disaster”. The sample size varies greatly based on the size of the study area. The dimension also varies but it is not that obvious because the conditioning factors applied for modeling in LSP are relatively stable. On the other hand, there is a slight decrease in the values of AUC compared to the training dataset. The errors related to probability estimation were negligible as the standard error were all less than 0.05 (Table 15).

The trained and validated models were applied to the whole study area, the probability of landslide (p), which is also known as landslide susceptibility index (LSI) was calculated for each unit and the results were imported into ArcGIS 10.2.1 platform. The landslide susceptibility map was reclassified into several classes based on the LSI and class probabilities either with equal interval or natural break are most common. To compare the result of the maps more expediently, the study area was classified into five categories of landslide susceptibility levels as very low (0~0.2), low (0.2~0.4), moderate (0.4~0.6), high (0.6~0.8) and very high (0.8~1.0) based on equal spacing principle and the results were shown in Figure 8. The maps should follow an important principle that the observed landslides should be more likely to appeared in the high susceptibility area. Therefore, the landslide points would be better predicted in the dark area (orange or red) and the non-landslide points were in light area (green). Therefore, the map predicted by FCM, FA coupled with LR model was the most reasonable. The percentage of moderate class accounts the smallest proportion (10.64%) while very low (45.62%) and high class (8.67%) accounts the biggest proportion compared to other models (Figure 9).

5. Discussion

5.1. Comparison of TMLM and NMLM

TMLM have been applied in LSP for many years and showed their own advantages and drawbacks [48]. Recently, NMLM evolved from statistical methods has been popular due to their ability in data processing and satisfactory results [49,50,51,52,53,54,55]. NMLM as random forest and support vector machine was applied in LSP and the accuracy reached 90%, the value of AUC exceeds 0.9 [56,57]. The quantity of samples is the key for NMLM because the application of NMLM emphasizes optimization and usually involves several hyper-parameters needed to be tuned. Actually, the samples are limited, especially for the county-level study area. It is hard to avoid the problem of over-fitting for NMLM without enough data. On the contrary, TMLM focuses on inference and mathematical equations based on certain assumptions, which are easier to implement and understandable. Logistic regression can find an accurate fitting function to define the nonlinear relationship between landslide or non-landslide and a set of hazard-pregnancy factors, and there are almost no “hyperparameters” that need to be tuned. Therefore, the models applied in the current study are suitable for LSP, especially for the county-level study area. The visualization of the final equation of the logistic regression model, when combined with cluster analysis and factor analysis, makes the results easier to understand. The hybrid model discussed in the current study also performed well in terms of accuracy but is importantly more comprehensible and operational.

5.2. The Necessity of Model Integration

Although previous researchers have applied LR model, FCM and FA in susceptibility prediction of geological disasters for many times [45], few studies have combined them for an improved model. There are certain limitations in a model established by a single method either NMLM or TMLM. Random forest is one of the famous ensemble learning algorithms [58]. Common ensemble algorithms as Bagging and Boosting aim to constructed a strong classifier by means of combining several weak classifiers [59,60,61]. In the current study, the hybrid models aim to make up for the deficiencies considering that TMLM have their own concerns. The integration of different methods is flexible but purposeful. LR model lacks pertinence because the main conditioning factors of different types of samples are different, even if they are the same, the relative importance also varies greatly. The introduction of FCM solves the problem and is conducive to further research and prevention of disasters. The accuracy of the three hybrid models was higher than that of the single LR model and the AUC results also illuminated that the establishment of the hybrid model was effective.

There is no clear agreement on the selection of the best model for many years as the performance vary for the different study area. It is common to apply multiple methods to the same study area for comparison to find out the best model according to unified standards such as the AUC. However, new methods are constantly emerging and not only the comparison but the effort for a universal model also counts. Therefore, it is necessary to explore a universally accepted model which collects the advancement of several methods. Besides, producing accurate LSP results is important but should not be the only consideration.

In the past, the stepwise forward method is applied to filter the variables and the variables are left by the last step with the significance values less than 0.05 [57]. Usually, few variables will be retained in this way and the information that has an important impact on the default may lose. Accordingly, it is unreliable for LR model to recognize the predisposing factors responsible for the occurrence of landslides and sometimes even draw conclusions that are contrary to past experience. However, if all variables are retained forcibly, the collinearity between variables cannot be avoided, resulting in poor model fitting and low prediction accuracy. The LR model based on FA avoided the influence of the multi-collinearity problem and improved the reliability of the regression coefficient under the condition of retaining the original data information as much as possible. Moreover, the retention of most of the original information enhances the analyticity and persuasion of the equation. Previous studies have applied FCM and FA in LSP and the results obtained are not as good as LR model mainly because FCM and FA belong to unsupervised learning and the prior conditions cannot be fully utilized [61,62,63,64]. Therefore, the hybrid model established in this study is reasonable and purposeful, combining the advantages of supervised and unsupervised learning.

6. Conclusions

In this study, complementary models based on FCM, FA and LR were explored and discussed to find out a suitable model for LSP in a county-level study area, which has better accuracy, convenience and analyticity. The following conclusions can be drawn from the present study.

The introduction of FCM and FA compensates for the deficiencies of the LR model to some extent, and the hybrid models performed better in terms of accuracy and generalization ability. The performance of TMLM can be improved by combining with the specific methods to obtain better performance as NMLM. Secondly, the hybrid model is able to retain the majority of the information and identify the main conditioning factors of different types of data, so the credibility and communicating ability of the model are enhanced. The combination of models should be more flexible and purposeful and the evaluation standards should be expanded, and not solely focused on accuracy. Both dimension reduction and sample size should be considered in modeling. However, future work need to explored as flow:

The models could be compared to the new machine learning algorithms;
More diverse methods and combinations need to be discussed further.

Author Contributions

Han Hu: writing, methodology; Changming Wang: reviewing and editing; Zhu Liang: software and validation; Ruiyuan Gao: methodology; Bailong Li: investigation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was founded by the National Natural Science Foundation of China (Grant No. 41972267, 41977221, and 41572257).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

LSP	Landslide susceptibility prediction LR Logistic regression
FCM	Fuzzy c-means clustering FA Factor analysis
ROC	Receiver Operating Characteristic AUC Area under the cure
SPSS	Statistical Product and Service Solutions DEM Digital elevation model
SRTM	DEM Shuttle radar topography mission digital elevation model
TMLM	Traditional machine learning methods NMLM New machine learning methods
TP	True positives CSR² Cox and Snell R square
TN	True negatives NR² Nagelkerke R square
FP	False positives Walds Wald statistic
FN	False negatives S.E Standard error of estimation
KMO	Kaiser–Meyer–Olkin LSI Landslide susceptibility index

References

Huang, Y.; Zhao, L. Review on landslide susceptibility mapping using support vector machines. CATENA 2018, 165, 520–529. [Google Scholar] [CrossRef]
Fan, X.; Scaringi, G.; Korup, O.; West, A.J.; van Westen, C.J.; Tanyas, H.; Hovius, N.; Hales, T.C.; Jibson, R.W.; Allstadt, K.E.; et al. Earthquake-induced chains of geologic hazards: Patterns, mechanisms, and impacts. Rev. Geophys. 2019, 57, 421–503. [Google Scholar] [CrossRef] [Green Version]
Ni, H.; Zheng, W.; Li, Z.; Ba, R. Recent catastrophic debris flows in Luding county, SW China: Geological hazards, rainfall analysis and dynamic characteristics. Nat. Hazards 2010, 55, 523–542. [Google Scholar] [CrossRef]
Wang, Z.; Liu, Q.; Liu, Y. Mapping Landslide Susceptibility Using Machine Learning Algorithms and GIS: A Case Study in Shexian County, Anhui Province, China. Symmetry 2020, 12, 1954. [Google Scholar] [CrossRef]
Blais-Stevens, A.; Behnia, P. Debris flow susceptibility mapping using a qualitative heuristic method and Flow-R along the Yukon Alaska Highway Corridor, Canada. Nat. Hazards Earth Syst. Sci. 2016, 16, 449–462. [Google Scholar] [CrossRef] [Green Version]
Schilirò, L.; Cevasco, A.; Esposito, C.; Mugnozza, G.S. Shallow landslide initiation on terraced slopes: Inferences from a physically based approach. Geomat. Nat. Hazards Risk 2018, 9, 295–324. [Google Scholar] [CrossRef] [Green Version]
Aditian, A.; Kubota, T.; Shinohara, Y. Comparison of GIS-based landslide susceptibility models using frequency ratio, logistic regression, and artificial neural network in a tertiary region of Ambon, Indonesia. Geomorphology 2018, 318, 101–111. [Google Scholar] [CrossRef]
Shi, M.; Chen, J.; Song, Y.; Zhang, W.; Song, S.; Zhang, X. Assessing debris flow susceptibility in Heshigten Banner, Inner Mongolia, China, using principal component analysis and an improved fuzzy C-means algorithm. Bull. Int. Assoc. Eng. Geol. 2015, 75, 909–922. [Google Scholar] [CrossRef]
Liang, Z.; Wang, C.; Han, S.; Khan, K.U.J.; Liu, Y. Classification and susceptibility assessment of debris flow based on a semi-quantitative method combination of the fuzzy C-means algorithm, factor analysis and efficacy coefficient. Nat. Hazards Earth Syst. Sci. 2020, 20, 1287–1304. [Google Scholar] [CrossRef]
Chen, W.; Peng, J.; Hong, H.; Shahabi, H.; Pradhan, B.; Liu, J.; Zhu, A.-X.; Pei, X.; Duan, Z. Landslide susceptibility modelling using GIS-based machine learning techniques for Chongren County, Jiangxi Province, China. Sci. Total Environ. 2018, 626, 1121–1135. [Google Scholar] [CrossRef] [PubMed]
Hong, H.; Liu, J.; Zhu, A.-X. Modeling landslide susceptibility using LogitBoost alternating decision trees and forest by penalizing attributes with the bagging ensemble. Sci. Total Environ. 2020, 718, 137231. [Google Scholar] [CrossRef]
Gao, R.; Wang, C.; Liang, Z.; Han, S.; Li, B. A Research on Susceptibility Mapping of Multiple Geological Hazards in Yanzi River Basin, China. ISPRS Int. J. Geo-Inf. 2021, 10, 218. [Google Scholar] [CrossRef]
Liang, Z.; Wang, C.; Duan, Z.; Liu, H.; Liu, X.; Khan, K.U.J. A Hybrid Model Consisting of Supervised and Unsupervised Learning for Landslide Susceptibility Mapping. Remote Sens. 2021, 13, 1464. [Google Scholar] [CrossRef]
Lin, Q.; Lima, P.; Steger, S.; Glade, T.; Jiang, T.; Zhang, J.; Liu, T.; Wang, Y. National-scale data-driven rainfall induced landslide susceptibility mapping for China by accounting for incomplete landslide data. Geosci. Front. 2021, 12, 101248. [Google Scholar] [CrossRef]
Armaș, I.; Gheorghe, M.; Silvaș, G. Shallow Landslides Physically Based Susceptibility Assessment Improvement Using InSAR. Case Study: Carpathian and Subcarpathian Prahova Valley, Romania. Remote Sens. 2021, 13, 2385. [Google Scholar] [CrossRef]
Sujatha, E.R. An integrated landslide susceptibility model to assess landslides along linear infrastructure for environmental management. Environ. Earth Sci. 2021, 80, 447. [Google Scholar] [CrossRef]
Liang, Z.; Wang, C.-M.; Zhang, Z.-M.; Khan, K.-U. A comparison of statistical and machine learning methods for debris flow susceptibility mapping. Stoch. Environ. Res. Risk Assess. 2020, 34, 1887–1907. [Google Scholar] [CrossRef]
Hu, X.; Huang, C.; Mei, H.; Zhang, H. Landslide susceptibility mapping using an ensemble model of Bagging scheme and random subspace–based naïve Bayes tree in Zigui County of the Three Gorges Reservoir Area, China. Bull. Int. Assoc. Eng. Geol. 2021, 80, 5315–5329. [Google Scholar] [CrossRef]
Wu, Y.; Ke, Y.; Chen, Z.; Liang, S.; Zhao, H.; Hong, H. Application of alternating decision tree with AdaBoost and bagging ensembles for landslide susceptibility mapping. CATENA 2019, 187, 104396. [Google Scholar] [CrossRef]
Li, W.; Fang, Z.; Wang, Y. Stacking ensemble of deep learning methods for landslide susceptibility mapping in the Three Gorges Reservoir area, China. Stoch. Environ. Res. Risk Assess. 2021. [Google Scholar] [CrossRef]
Hu, X.; Zhang, H.; Mei, H.; Xiao, D.; Li, Y.; Li, M. Landslide Susceptibility Mapping Using the Stacking Ensemble Machine Learning Method in Lushui, Southwest China. Appl. Sci. 2020, 10, 4016. [Google Scholar] [CrossRef]
Dou, J.; Yunus, A.P.; Bui, D.T.; Merghadi, A.; Sahana, M.; Zhu, Z.; Chen, C.-W.; Han, Z.; Pham, B.T. Improved landslide assessment using support vector machine with bagging, boosting, and stacking ensemble machine learning framework in a mountainous watershed, Japan. Landslides 2019, 17, 641–658. [Google Scholar] [CrossRef]
Merghadi, A.; Yunus, A.P.; Dou, J.; Whiteley, J.; ThaiPham, B.; Bui, D.T.; Avtar, R.; Abderrahmane, B. Machine learning methods for landslide susceptibility studies: A comparative overview of algorithm performance. Earth-Sci. Rev. 2020, 207, 103225. [Google Scholar] [CrossRef]
Liang, Z.; Wang, C.; Khan, K.U.J. Application and comparison of different ensemble learning machines combining with a novel sampling strategy for shallow landslide susceptibility mapping. Stoch. Environ. Res. Risk Assess. 2020, 35, 1243–1256. [Google Scholar] [CrossRef]
Li, X.; Chen, P.; Han, W.; Shi, H.; Yu, H. Application of factor analysis to debris flow risk assessment. Chin. J. Geol. Hazard Control. 2016, 27, 55–61. [Google Scholar] [CrossRef]
Tian, C.; Liu, X.; Wang, J. Geohazard susceptibility assessment based on CF model and Logistic regression models in Guangdong. Hydroge Eng. 2016, 43, 154–161. [Google Scholar] [CrossRef]
Sezer, E.; Nefeslioglu, H.; Gokceoglu, C. An assessment on producing synthetic samples by fuzzy C-means for limited number of data in prediction models. Appl. Soft Comput. 2014, 24, 126–134. [Google Scholar] [CrossRef]
Verde, R.; Irpino, A. Multiple factor analysis of distributional data. arXiv 2018, arXiv:1804.07192. [Google Scholar]
Hussin, H.Y.; Zumpano, V.; Reichenbach, P.; Sterlacchini, S.; Micu, M.; van Westen, C.; Bălteanu, D. Different landslide sampling strategies in a grid-based bi-variate statistical susceptibility model. Geomorphology 2015, 253, 508–523. [Google Scholar] [CrossRef]
Carrara, A.; Cardinali, M.; Detti, R.; Guzzetti, F.; Pasqui, V.; Reichenbach, P. GIS techniques and statistical models in evaluating landslide hazard. Earth Surf. Process. Landf. 1991, 16, 427–445. [Google Scholar] [CrossRef]
Carrara, A.; Guzzetti, F.; Cardinali, M.; Reichenbach, P. Use of GIS Technology in the Prediction and Monitoring of Landslide Hazard. Nat. Hazards 1999, 20, 117–135. [Google Scholar] [CrossRef]
Palamakumbure, D.; Flentje, P.; Stirling, D. Consideration of optimal pixel resolution in deriving landslide susceptibility zoning within the Sydney Basin, New South Wales, Australia. Comput. Geosci. 2015, 82, 13–22. [Google Scholar] [CrossRef]
Reichenbach, P.; Rossi, M.; Malamud, B.D.; Mihir, M.; Guzzetti, F. A Review of Statistically-Based Landslide Susceptibility Models. Earth-Sci. Rev. 2018, 180, 60–91. [Google Scholar] [CrossRef]
Lombardo, L.; Tanyas, H.; Nicu, I.C. Spatial modeling of multi-hazard threat to cultural heritage sites. Eng. Geol. 2020, 277, 105776. [Google Scholar] [CrossRef]
DI, B.; Chen, N.; Cui, P.; Li, Z.; He, Y.; Gao, Y. GIS-based risk analysis of debris flow: An application in Sichuan, southwest China. Int. J. Sediment Res. 2008, 23, 138–148. [Google Scholar] [CrossRef]
Dou, J.; Yunus, A.P.; Xu, Y.; Zhu, Z.; Chen, C.-W.; Sahana, M.; Khosravi, K.; Yang, Y.; Pham, B.T. Torrential rainfall-triggered shallow landslide characteristics and susceptibility assessment using ensemble data-driven models in the Dongjiang Reservoir Watershed, China. Nat. Hazards 2019, 97, 579–609. [Google Scholar] [CrossRef]
Kornejady, A.; Ownegh, M.; Bahremand, A. Landslide susceptibility assessment using maximum entropy model with two different data sampling methods. CATENA 2017, 152, 144–162. [Google Scholar] [CrossRef]
Huang, F.; Cao, Z.; Jiang, S.-H.; Zhou, C.; Huang, J.; Guo, Z. Landslide susceptibility prediction based on a semi-supervised multiple-layer perceptron model. Landslides 2020, 17, 2919–2930. [Google Scholar] [CrossRef]
Hosmer, D.W.; Lemeshow, S. Applied Logistic Regression, 2nd ed.; John Wiley and Sons, Inc.: New York, NY, USA, 2000; 375p, Available online: http://www.nesug.org/proceedings/nesug06/an/da26.pdf (accessed on 21 September 2021).
Umar, Z.; Pradhan, B.; Ahmad, A.; Jebur, M.N.; Tehrany, M.S. Earthquake induced landslide susceptibility mapping using an integrated ensemble frequency ratio and logistic regression models in West Sumatera Province, Indonesia. CATENA 2014, 118, 124–135. [Google Scholar] [CrossRef]
Pradhan, B.; Jebur, M.N. Spatial prediction of landslide-prone areas through K-nearest neighbor algorithm and logistic regression model using high resolution airborne laser scanning data. In Laser Scanning Applications in Landslide Assessment; Springer: Cham, Switzerland, 2017; pp. 151–165. [Google Scholar]
Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; IEEE Electrical Insulation Magazine; Plenum Press: New York, NY, USA, 1981. [Google Scholar]
Sun, X.-L.; Zhao, Y.-G.; Wang, H.-L.; Yang, L.; Qin, C.-Z.; Zhu, A.-X.; Zhang, G.-L.; Pei, T.; Li, B.-L. Sensitivity of digital soil maps based on FCM to the fuzzy exponent and the number of clusters. Geoderma 2011, 171–172, 24–34. [Google Scholar] [CrossRef]
Wang, J.; Chen, J.; Yang, J. Application of distance discriminant analysis method in classification of surrounding rock mass in highway tunnel. J. Jilin Univ. Earth Sci. Ed. 2008, 38, 999–1004. [Google Scholar] [CrossRef]
Varnes, D.J. Landslide Hazard Zonation: A Review of Principles and Practice; Commission on Landslides of the IAEG, UNESCO Natural Hazards No. 3; UNESCO: Paris, France, 1984; p. 61. [Google Scholar]
Chung, C.-J.F.; Fabbri, A.G. Validation of Spatial Prediction Models for Landslide Hazard Mapping. Nat. Hazards 2003, 30, 451–472. [Google Scholar] [CrossRef]
Guzzetti, F.; Reichenbach, P.; Ardizzone, F.; Cardinali, M.; Galli, M. Estimating the quality of landslide susceptibility models. Geomorphology 2006, 81, 166–184. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: New York, NY, USA, 2013; p. 441. [Google Scholar]
Chen, J.; Pi, D. A Cluster Validity Index for Fuzzy Clustering Based on Non-distance. In Proceedings of the 2013 International Conference on Computational and Information Sciences, Shiyang, China, 21–23 June 2013; pp. 880–883. [Google Scholar]
Ozdemir, A.; Altural, T. A comparative study of frequency ratio, weights of evidence and logistic regression methods for landslide susceptibility mapping: Sultan Mountains, SW Turkey. J. Asian Earth Sci. 2013, 64, 180–197. [Google Scholar] [CrossRef]
Dou, J.; Yunus, A.P.; Merghadi, A.; Shirzadi, A.; Nguyen, H.; Hussain, Y.; Avtar, R.; Chen, Y.; Pham, B.T.; Yamagishi, H. Different sampling strategies for predicting landslide susceptibilities are deemed less consequential with deep learning. Sci. Total Environ. 2020, 720, 137320. [Google Scholar] [CrossRef] [PubMed]
Bui, D.T.; Shirzadi, A.; Shahabi, H.; Geertsema, M.; Omidvar, E.; Clague, J.J.; Pham, B.T.; Dou, J.; Asl, D.T.; Bin Ahmad, B.; et al. New Ensemble Models for Shallow Landslide Susceptibility Modeling in a Semi-Arid Watershed. Forests 2019, 10, 743. [Google Scholar] [CrossRef] [Green Version]
Lian, C.; Zeng, Z.; Yao, W.; Tang, H. Extreme learning machine for the displacement prediction of landslide under rainfall and reservoir level. Stoch. Environ. Res. Risk Assess. 2014, 28, 1957–1972. [Google Scholar] [CrossRef]
Zhang, Y.-X.; Lan, H.-X.; Li, L.-P.; Wu, Y.-M.; Chen, J.-H.; Tian, N.-M. Optimizing the frequency ratio method for landslide susceptibility assessment: A case study of the Caiyuan Basin in the southeast mountainous area of China. J. Mt. Sci. 2020, 17, 340–357. [Google Scholar] [CrossRef]
Park, S.; Hamm, S.-Y.; Kim, J. Performance Evaluation of the GIS-Based Data-Mining Techniques Decision Tree, Random Forest, and Rotation Forest for Landslide Susceptibility Modeling. Sustainability 2019, 11, 5659. [Google Scholar] [CrossRef] [Green Version]
Wang, Y.; Fang, Z.; Hong, H. Comparison of convolutional neural networks for landslide susceptibility mapping in Yanshan County, China. Sci. Total Environ. 2019, 666, 975–993. [Google Scholar] [CrossRef]
Wang, L.-J.; Sawada, K.; Moriguchi, S. Landslide susceptibility analysis with logistic regression model based on FCM sampling strategy. Comput. Geosci. 2013, 57, 81–92. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Chen, T.; Zhu, L.; Niu, R.-Q.; Trinder, C.J.; Peng, L.; Lei, T. Mapping landslide susceptibility at the Three Gorges Reservoir, China, using gradient boosting decision tree, random forest and information value models. J. Mt. Sci. 2020, 17, 670–685. [Google Scholar] [CrossRef]
Arabameri, A.; Pradhan, B.; Rezaei, K.; Sohrabi, M.; Kalantari, Z. GIS-based landslide susceptibility mapping using numerical risk factor bivariate model and its ensemble with linear multivariate regression and boosted regression tree algorithms. J. Mt. Sci. 2019, 16, 595–618. [Google Scholar] [CrossRef]
Song, Y.; Niu, R.; Xu, S.; Ye, R.; Peng, L.; Guo, T.; Li, S.; Chen, T. Landslide Susceptibility Mapping Based on Weighted Gradient Boosting Decision Tree in Wanzhou Section of the Three Gorges Reservoir Area (China). ISPRS Int. J. Geo-Inf. 2019, 8, 4. [Google Scholar] [CrossRef] [Green Version]
Tehrany, M.S.; Kumar, L.; Jebur, M.N.; Shabani, F. Evaluating the application of the statistical index method in flood susceptibility mapping and its comparison with frequency ratio and logistic regression methods. Geomat. Nat. Hazards Risk 2018, 10, 79–101. [Google Scholar] [CrossRef]
Chang, Z.; Du, Z.; Zhang, F.; Huang, F.; Chen, J.; Li, W.; Guo, Z. Landslide Susceptibility Prediction Based on Remote Sensing Images and GIS: Comparisons of Supervised and Unsupervised Machine Learning Models. Remote Sens. 2020, 12, 502. [Google Scholar] [CrossRef] [Green Version]
Zhu, Q.; Chen, L.; Hu, H.; Pirasteh, S.; Li, H.; Xie, X. Unsupervised Feature Learning to Improve Transferability of Landslide Susceptibility Representations. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3917–3930. [Google Scholar] [CrossRef]

Figure 1. Geographical position of the study area showing landslide locations.

Figure 2. Average monthly rainfall data (from 1959 to 2017) for the Pinggu district.

Figure 3. Field investigation photos. (a) Shallow landslide in Hetaowa village; (b) shallow landslide in Guanshang village; (c) shallow landslide in Taoyuan village; (d) shallow landslide in Qingshui village.

Figure 4. Study area thematic maps: (a) rainfall; (b) distance to road; (c) elevation; (d) slope angle; (e) plan curvature; (f) profile curvature; (g) topographic wetness index; (h) distance to fault; (i) distance to stream; (j) roundness.

Figure 5. Flow chart of the methodology applied in the study.

Figure 6. Clustering validity function V_cs.

Figure 7. ROC cure for four models. (a) Success rate curve for training data; (b) prediction rate curve for validation data.

Figure 8. Landslide susceptibility maps: (a) LR model; (b) FCM coupled with LR model; (c) FA coupled with LR model; (d) FCM, FA coupled with LR model.

Figure 9. The percentage of the different susceptibility classes.

Table 1. Statistics of geological disasters in the Pinggu district.

Type	Earthquake	Collapse	Landslide	Statistics Time
Quantity	90	168	114	As of October 2016

Table 2. Regression coefficients of the LR model.

Factor	B	S.E	Wals
Slope angle (F4)	0.619	0.124	22.55
Roundness (F10)	0.488	0.112	14.891
Constant	−1.322	0.126	13.594

Note: B = logistic coefficient; S.E = standard error of estimation; Wals = Wald statistic.

Table 3. Statistics indexes and AUC value of the LR model.

CSR²	NR²	Accuracy	Sensitivity	Specificity	AUC
0.662	0.776	77.16	69.15	76.26	0.745

Table 4. Clustering results of 503 catchments.

Category	0	1	Total
I	88	31	119
II	132	24	156
III	55	21	76
IV	114	38	152
Total	389	114	503

Table 5. Regression coefficients of each LR model.

Parameters/Coefficients	Model I	Model II	Model III	Model IV
Maximum 7 days rainfall (F1)	0	0.852	0	0
Distance to road (F2)	0	0	0	0
Elevation (F3)	2.931	0	0	0
Slope angle (F4)	0	0	0	0
Plan curvature (F5)	0	0.588	−0.477	0
Profile curvature (F6)	0	0	0	0
Topographic wetness index (F7)	0.966	0	0	0
Distance to fault (F8)	0	0	0	0
Distance to stream (F9)	−1.757	0	0.322	0
Roundness (F10)	0	0	0	0.354
Constant	−2.121	−1.867	−1.171	−1.233

Table 6. Statistical indexes and AUC value of the hybrid model.

	Model I	Model II	Model III	Model IV	Total (%)
Sensitivity (%)	74.88	71.28	74.18	72.30	73.81
Specificity (%)	86.66	84.84	82.35	88.66	85.77
Accuracy (%)	85.24	82.33	84.23	86.11	84.89

Table 7. The correlation coefficients between principal components and original variables.

Factor	C1	C2	C3	C4	C5
Maximum 7 days rainfall (F1)	−0.469	−0.657	0.155	−0.066	0.059
Distance to road (F2)	−0.019	0.921	0.028	−0.018	0.036
Elevation (F3)	0.814	0.475	0.181	−0.101	0.042
Slope angle (F4)	−0.020	−0.029	0.125	0.974	−0.004
Plan curvature (F5)	−0.128	−0.003	−0.949	−0.139	−0.049
Profile curvature (F6)	0.915	0.190	0.038	0.106	0.116
Topographic wetness index (F7)	−0.846	−0.076	0.067	−0.060	−0.044
Distance to fault (F8)	0.783	0.016	0.330	−0.210	0.063
Distance to stream (F9)	0.555	0.650	0.817	−0.116	0.083
Roundness (F10)	0.120	0.029	0.050	−0.004	0.988
Contribution rate (%)	45.726	15.794	10.979	9.449	8.233
Accumulative contribution (%)	45.726	61.52	72.499	81.998	90.231

Table 8. The regression coefficients of common factors in LR model.

Common Factors	B	S.E	Wals
C1	0.692	0.366	9.016
C5	0.61	0.323	8.241
Constant	−1.371	0.311	8.715

Table 9. Statistical indexes and AUC value of the hybrid model.

−2LL	CSR²	NR²	KMO	Accuracy	Sensitivity	Specificity
69.791	0.345	0.698	0.789	83.29	73.11	84.79

Table 10. Contribution rates of common factors in four models.

	Model I	Model II	Model III	Model IV
Factors	Model I	Model II	Model III	Model IV
C1	39.244	46.216	47.293	48.432
C2	14.147	15.274	18.529	14.424
C3	11.889	10.665	11.911	10.941
C4	10.02	8.808	9.476	9.836
C5	7.245	7.388	0	0
Accumulative contribution (%)	85.544	88.352	87.209	86.633

Table 11. The regression coefficients of common factors in LR model.

	Model I	Model II	Model III	Model IV
Factors	Model I	Model II	Model III	Model IV
C1	0	0	0.587	0
C2	0.59	0	0	0
C3	0	0	0	0
C4	0	0.421	0	0.349
C5	0	0	0	0
Constant	−1.212	−1.825	−1.347	−1.248

Table 12. Statistical indexes and AUC value of the hybrid model.

Statistical Indexes	Model I	Model II	Model III	Model IV
−2LL	42.595	40.453	28.913	38.923
CSR²	0.709	0.645	0.682	0.693
NR²	0.617	0.723	0.754	0.776
KMO	0.600	0.762	0.711	0.742
Sensitivity (%)	76.74	88.07	78.62	78.26
Specificity (%)	87.91	91.92	90.24	88.24
Accuracy (%)	85.37	91.59	87.76	86.53

Table 13. Model performance using validation data.

		Training			Validation
	Index	Accuracy	Sensitivity	Specificity	Accuracy	Sensitivity	Specificity
Model		(%)	(%)	(%)	(%)	(%)	(%)
LR model		77.66	67.45	78.2	77	65.52	72.55
FCM coupled with LR model		83.63	72.61	85.15	77.63	67.6	78.4
FA coupled with LR model		83.29	73.11	84.79	80.72	71.79	81.47
FCM, FA coupled with LR model		88.01	79.85	89.58	85.25	74.96	86.21

Table 14. AUC values of models using validation data.

Model	AUC for Training	AUC for Validation
LR model	0.755	0.736
FCM coupled with LR model	0.788	0.744
FA coupled with LR model	0.818	0.782
FCM, FA coupled with LR model	0.862	0.827

Table 15. Standard errors of each model.

Model	Standard Errors
LR model	0.033
FCM coupled with LR model	0.031
FA coupled with LR model	0.021
FCM, FA coupled with LR model	0.031

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, H.; Wang, C.; Liang, Z.; Gao, R.; Li, B. Exploring Complementary Models Consisting of Machine Learning Algorithms for Landslide Susceptibility Mapping. ISPRS Int. J. Geo-Inf. 2021, 10, 639. https://doi.org/10.3390/ijgi10100639

AMA Style

Hu H, Wang C, Liang Z, Gao R, Li B. Exploring Complementary Models Consisting of Machine Learning Algorithms for Landslide Susceptibility Mapping. ISPRS International Journal of Geo-Information. 2021; 10(10):639. https://doi.org/10.3390/ijgi10100639

Chicago/Turabian Style

Hu, Han, Changming Wang, Zhu Liang, Ruiyuan Gao, and Bailong Li. 2021. "Exploring Complementary Models Consisting of Machine Learning Algorithms for Landslide Susceptibility Mapping" ISPRS International Journal of Geo-Information 10, no. 10: 639. https://doi.org/10.3390/ijgi10100639

APA Style

Hu, H., Wang, C., Liang, Z., Gao, R., & Li, B. (2021). Exploring Complementary Models Consisting of Machine Learning Algorithms for Landslide Susceptibility Mapping. ISPRS International Journal of Geo-Information, 10(10), 639. https://doi.org/10.3390/ijgi10100639

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring Complementary Models Consisting of Machine Learning Algorithms for Landslide Susceptibility Mapping

Abstract

1. Introduction

2. Study Area and Materials

2.1. Study Area and Landslide Inventory

2.2. Mapping Unit

2.3. Conditioning Factors

2.3.1. Triggering Factors

2.3.2. The Topographical Factors

2.3.3. Geological Factors

3. Methodology

3.1. LR Model

3.2. FCM Clustering

3.3. FA Model

3.4. Comparison of the Methods

3.5. Model Performance

4. Results and Validation

4.1. LR Model

4.2. FCM Coupled with LR Model

4.3. FA Coupled with LR Model

4.4. FCM, FA Coupled with LR Model

4.5. Validation and Comparison

5. Discussion

5.1. Comparison of TMLM and NMLM

5.2. The Necessity of Model Integration

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI