Spatial Prediction of Landslides Using Hybrid Integration of Artiﬁcial Intelligence Algorithms with Frequency Ratio and Index of Entropy in Nanzheng County, China

: The main object of this study is to introduce hybrid integration approaches that consist of state-of-the-art artiﬁcial intelligence algorithms (SysFor) and two bivariate models, namely the frequency ratio (FR) and index of entropy (IoE), to carry out landslide spatial prediction research. Hybrid integration approaches of these two bivariate models and logistic regression (LR) were used as benchmark models. Nanzheng County was considered as the study area. First, a landslide distribution map was produced using news reports, interpreting satellite images and a regional survey. A total of 202 landslides were identiﬁed and marked. According to the previous studies and local geological environment conditions, 16 landslide conditioning factors were chosen for landslide spatial prediction research: elevation, proﬁle curvature, plan curvature, slope angle, slope aspect, stream power index (SPI), topographic wetness index (TWI), sediment transport index (STI), distance to roads, distance to rivers, distance to faults, lithology, rainfall, soil, normalized di ﬀ erent vegetation index (NDVI), and land use. Then, the 202 landslides were randomly segmented into two parts with a ratio of 70:30. Seventy percent of the landslides (141) were used as the training dataset and the remaining landslides (61) were used as the validating dataset. Next, the evaluation models were built using the training dataset and compared by the receiver operating characteristics (ROC) curve. The results showed that all models performed well; the FR_SysFor model exhibited the best prediction ability (0.831), followed by the IoE_SysFor model (0.819), IoE_LR model (0.702), FR_LR model (0.696), IoE model (0.691), and FR model (0.681). Overall, these six models are practical tools for landslide spatial prediction research and the results can provide a reference for landslide prevention and control in the study area.


Introduction
Landslides, as one of the most frequent geological disasters, have caused casualties, property damage and a series of geological environment problems [1,2]. According to the statistics report of the China Institute of Geo-environmental Monitoring [3], a total of 2966 geological disasters occurred in 2018, including 1631 landslides, resulting in 112 deaths and a direct financial loss of CNY 1.47 billion.

Frequency Ratio
The frequency ratio (FR) is defined as the ratio between the percentage of landslides and the percentage of pixels within one class [20,55]. The relationship between landslide and the factors is stronger when the frequency ratio is larger than 1 [56,57]. The frequency ratio is calculated by Equation (1). (1) where NLSpix is the number of landslides, and NCpix is number of pixels of a class.

Index of Entropy
The second model is the entropy index (IoE), which is based on the bivariate analysis principle [58,59]. The model can calculate the weight of each input variable, and the weight can show which variable is most relevant to the occurrence of landslides in the natural environment [20]. The weights of each variable obtained are taken as the entropy index [60]. The equations used to calculate the weight of each variable are shown as Equations (2)- (8).
Appl. Sci. 2020, 10,29 3 of 21 (P ij ) = P ij S j j=1 P ij (3) (P ij ) log 2 (P ij ) j = 1, 2, 3, . . . , n H jmax = log 2 S j (5) W j = I j × P j (8) where a is the percentage of the defined domain; b is the landslide percentage; P ij is the probability density; H jmax and H j are both entropy values; S j is the number of classes; I j is the information coefficient; and W j is the weight for the variable as a whole.

SysFor
The systematically developed forest of multiple trees (SysFor) is a data mining algorithm that is based on the concept of the gain ratio and was proposed by Islam and Giggins in 2011 [61,62]. Compared with the commonly used techniques, the SysFor applies both high-dimensional and low-dimensional data sets. In addition, it also shows a higher prediction accuracy than the common techniques [63]. Generally, the SysFor is built with the following four steps: (1) According to the user-defined gain ratio and separation value, a set of good attributes and their segmentation points are identified. (2) If the number of good attributes is smaller than the number of user-defined trees, each attribute is selected as the root attribute of the tree, and the number of trees built is equal to the number of good attributes.
(3) If the number of trees generated in step 2 is less than the number of user-defined trees, more trees are built. (4) All trees built in steps (2) and (3) are returned as the SysFor.

Logistic Regression
Logistic regression is a multivariable method [64][65][66]. The main goal of a logistic regression model is to obtain the most suitable method to determine whether there are landslides using particular variables [47,67]. The relationship between landslide occurrence and the variables is shown in Equation (9).
where P is the probability of landslide occurrence, and Z is the linear sum, which is obtained by the product of independent variables and their coefficients. The calculation of Z is shown in Equation (10).

Study Area and Data Used
The study area (Nanzheng County) lies in Shaanxi Province, China ( Figure 1). It is located between longitude 106 •  where α is a constant; βi (i = 1, 2, 3… n) are the coefficients; and xi (i = 1, 2, 3… n) are the independent variables.

Study Area and Data Used
The study area (Nanzheng County) lies in Shaanxi Province, China ( Figure 1). It is located between longitude 106°30′ and 107°22′ E and latitude 32°24′ to 33°07′ N. It covers an area of 2823 square kilometers. The altitude of the study area is between 442 and 2410 m. The study area has a subtropical monsoon climate, with an annual average temperature of 14.2 °C. Rainfall is mainly concentrated from June to September and the mean annual rainfall is 909.8 mm.  A landslide inventory can give insight into landslide location, dates, type, and damage caused [68][69][70]. In this study, the landslide inventory map was prepared on the basis of historical landslide records and satellite images (Google Earth and ZY03 images). A total of 202 landslides were identified, including 190 slides and 12 rock falls [71]. The largest landslide was more than 1,000,000 m 3 , and the smallest landslide was nearly 160 m 3 [70]. Finally, 141 landslides were randomly selected as training and validation datasets with a ratio of 70:30 ( Figure 1).
There is no clear agreement about the precise cause of landslides due to their complex nature and development. According to the latest relevant research and local geological environment characteristics [72][73][74][75][76], 16 landslide conditioning factors have been compiled: elevation, profile curvature, plan curvature, slope angle, slope aspect, stream power index (SPI), topographic wetness index (TWI), sediment transport index (STI), distance to roads, distance to rivers, distance to faults, lithology, rainfall, soil, normalized different vegetation index (NDVI), and land use. The 16 landslide susceptibility conditioning factors were converted with a resolution of 30 × 30 m.
Elevation influences earth surface and topographic attributes which account for spatial variability of precipitation, soil thickness, erosion, and vegetation [38]. Profile curvature and plan curvature have significant effects on surface runoff and infiltration [77]. Slope angle is an important factor that controls the velocity of the slopes, and the slope aspect has an important effect on rainfall, wind and sunlight exposure [78]. SPI, TWI and STI are related to soil water content status, water accumulation and progress of erosion and sedimentation in a watershed, which influence landslide stability [77,78]. In the present study, the elevation map was acquired from a 30 × Distance to roads, distance to rivers and distance to faults are three commonly used landslide conditioning factors, these factors are related to the infiltration and strength of slopes [79,80]. Distance to roads, distance to rivers and distance to faults are divided into five categories by the equal distance method, as shown in Figure 2i-k, respectively. Lithology is one of the most important restrictive factors in landslide susceptibility evaluation; areas with highly resistant rocks or highly permeable subsoil material have low drainage density [81], which are natural factors essential to determine landslide occurrence [82,83]. The geological map of this study was extracted from a geological map with 1:1,000,000 scale. The study area has 12 different lithological units, which are shown in Figure 2l. Rainfall is a widely recognized landslide-inducing factor, which not only increases the weight of the slope, but also reduces soil strength [84,85]. From Figure 2m, it can be seen that the rainfall in the middle of the study area is significantly higher than that in the north and south. The rainfall map was reclassified with an interval of 100 mm/yr. Soil is the material composition of the slope, and different soils have different physical and mechanical properties. Soil type was extracted from the geological map with a 1:1,000,000 scale, as shown in Figure 2n, and there are nine kinds of soils in the study area. NDVI is a measure of surface reflectance and gives a quantitative estimate of biomass and the vegetation growth [38,[86][87][88]. The NDVI values in this paper were reclassified into five categories: −0.21 to 0.21, 0.21 to 0.36, 0.36 to 0.44, 0.44 to 0.52, and 0.52 to 0.65 ( Figure 2o). Land use is considered as the direct manifestation of human activities impacting on landslide probability. Land use types control the amount of infiltration and surface runoff generation [77]. In general, landslides are concentrated in human active areas [89,90]. In this study, a land use map was extracted from regional land use maps with a 1:100,000 scale. Land use was divided into farmland, forestland, grassland, water, residential areas, and bare land (Figure 2p). land use maps with a 1:100,000 scale. Land use was divided into farmland, forestland, grassland, water, residential areas, and bare land ( Figure 2p).

Results
As the LR method was employed to produce the landslide susceptibility map, it is necessary to analyze the multicollinearity of landslide conditioning factors [91]. Currently, the tolerance (TOL) and variance inflation factor (VIF) are mostly used in multicollinearity analysis [92,93]. Generally, a TOL < 0.1 or a VIF > 10 indicate multicollinearity [72]. In the study, the results indicate that there are no multicollinearities among the 16 landslide conditioning factors (Table 1).

Results
As the LR method was employed to produce the landslide susceptibility map, it is necessary to analyze the multicollinearity of landslide conditioning factors [91]. Currently, the tolerance (TOL) and variance inflation factor (VIF) are mostly used in multicollinearity analysis [92,93]. Generally, a TOL < 0.1 or a VIF > 10 indicate multicollinearity [72]. In the study, the results indicate that there are no multicollinearities among the 16 landslide conditioning factors (Table 1).

Application of FR Model
The frequency ratio (FR) is a commonly used method of univariate probability analysis in landslide susceptibility assessment [21,25]. When the frequency ratio is 1, it represents the average value; when the frequency ratio is greater than 1, it indicates that the factor has a strong correlation with the occurrence of the landslide; when the frequency ratio is less than 1, it indicates that the correlation between the factor and the occurrence of the landslide is weak. The landslide susceptibility index (LSI) calculated by the FR model can be expressed as Equation (11), as shown in Table 2.

Application of Hybrid Models
Compared with a single model, a hybrid model has higher predictive ability when dealing with high-dimensional problems [94][95][96]. In this study, the SysFor algorithm was combined with FR and IoE models. During the modeling process, the following parameters were used for hybrid FR_SysFor and IoE_SysFor models: separation, 0.3; confidence, 0.25; goodness, 0.3; number of trees built in the forest, 500.
The hybrid integration of LR with FR and IoE was also applied as a benchmark model to build landslide susceptibility maps. A forward stepwise LR was adopted, and the analysis results are given in Table 3. The landslide occurrence probability P of FR_LR and IoE_LR models can be expressed using Equations (13)   Finally, all the landslide susceptibility maps were reclassified into five categories using the equal area classification method [92,93]: very high (5%), high (10%), moderate (15%), low (20%), and very low (50%) (Figure 3).

Validation of Landslide Susceptibility Maps
Validating the landslide susceptibility map (LSM) and determining its accuracy is an important aspect of landslide susceptibility research. The verification of landslide susceptibility maps is very important. Without it, the rigor of the study is greatly weakened, and the results of the study have no scientific significance [97][98][99]. In consequence, three commonly used statistical parameters, the receiver operating characteristics (ROC) curve [100][101][102] and the area under the curve (AUC) [103][104][105], standard error, and 95% confidence interval are introduced to verify the landslide susceptibility maps. It can be seen that the FR_SysFor model acquired the highest AUC value (0.940) of all hybrid models in the training data, followed by the IoE_SysFor model (0.926), IoE_LR model (0.783) and FR_LR model (0.779) (Figure 4). With the exception of the AUC value, the other statistical parameters in Tables 4 and 5 show similar results; the FR_SysFor model obtained the smallest SE (0.0132) and 95% CI (0.906, 0.965), followed by the IoE_SysFor, IoE_LR and FR_LR models. The validating data showed similar results to the training data, and the FR_SysFor model achieved the highest AUC value (0.831) and the smallest SE (0.0388) and 95% CI (0.753, 0.893). In addition, the success rate curves and prediction rate curves of the landslide susceptibility maps were also assessed ( Figures 5 and 6). It can be seen that the FR_SysFor model recognized a highly susceptible area with a success rate curve that contains more than 80% of landslides. Similarly, the highly susceptible area with the prediction rate curve recognized by the FR_SysFor model also includes more than 70% of landslides, which illustrates that the precision of the FR_SysFor model is the highest and that the FR_SysFor model is the best model in this study.

Validation of Landslide Susceptibility Maps
Validating the landslide susceptibility map (LSM) and determining its accuracy is an important aspect of landslide susceptibility research. The verification of landslide susceptibility maps is very important. Without it, the rigor of the study is greatly weakened, and the results of the study have no scientific significance [97][98][99]. In consequence, three commonly used statistical parameters, the receiver operating characteristics (ROC) curve [100][101][102] and the area under the curve (AUC) [103][104][105], standard error, and 95% confidence interval are introduced to verify the landslide susceptibility maps. It can be seen that the FR_SysFor model acquired the highest AUC value (0.940) of all hybrid models in the training data, followed by the IoE_SysFor model (0.926), IoE_LR model (0.783) and FR_LR model (0.779) (Figure 4). With the exception of the AUC value, the other statistical parameters in Tables 4  and 5 show similar results; the FR_SysFor model obtained the smallest SE (0.0132) and 95% CI (0.906, 0.965), followed by the IoE_SysFor, IoE_LR and FR_LR models. The validating data showed similar results to the training data, and the FR_SysFor model achieved the highest AUC value (0.831) and the smallest SE (0.0388) and 95% CI (0.753, 0.893).    In addition, the success rate curves and prediction rate curves of the landslide susceptibility maps were also assessed ( Figures 5 and 6). It can be seen that the FR_SysFor model recognized a highly susceptible area with a success rate curve that contains more than 80% of landslides. Similarly, the highly susceptible area with the prediction rate curve recognized by the FR_SysFor model also includes more than 70% of landslides, which illustrates that the precision of the FR_SysFor model is the highest and that the FR_SysFor model is the best model in this study.

Discussion
Currently, some novel ensemble techniques have been proposed in landslide susceptibility mapping, and the excellent performance of ensemble techniques has been proven [92,[106][107][108][109]. Furthermore, the hybrid integration of machine learning algorithms with bivariate statistical models can weaken the hypotheses of the conventional bivariate models and retain the merits of bivariate

Discussion
Currently, some novel ensemble techniques have been proposed in landslide susceptibility mapping, and the excellent performance of ensemble techniques has been proven [92,[106][107][108][109]. Furthermore, the hybrid integration of machine learning algorithms with bivariate statistical models can weaken the hypotheses of the conventional bivariate models and retain the merits of bivariate statistical models and machine learning models [110].
Compared with SysFor, FR, IoE and LR are three common evaluation models which have the merits of model stabilization, higher accuracy and simple calculation. Razavizadeh et al. proposed a GIS-based landslide susceptibility mapping with frequency ratio, statistical index, and weights of evidence models for a part of Mazandaran Province, Iran. Both success rate curve and prediction rate curve demonstrated that the frequency ratio (FR) is a reliable model with the highest accuracy. Moreover, the landslide susceptibility map generated by the FR model is trustworthy for hazard mitigation strategies [25]. Pourghasemi et al. introduced the index of entropy (IoE) and conditional probability models to landslide susceptibility research in Safarood Basin, Iran. The results indicated that both models have good predictive capacity, while the IoE performed slightly better than the conditional probability model in landslide susceptibility mapping [111]. Abedini et al. applied logistic regression (LR) and AHP models to landslide susceptibility assessment. The results indicated that the LR is a suitable model to classify and estimate the probability of landslide occurrence in the process of project research planning and implementation [112].
In the present study, for the class of 600-800 m of elevation (ratio value = 2.650), distance to roads <300 m (ratio value = 2.126) and 1100-1200 mm/yr of rainfall (ratio value = 2.354) facilitated the occurrence of landslides. The results based on the IoE model showed that soil (0.326), land use (0.311), elevation (0.223), and slope angle (0.143) are the most important factors, which are closely related to the occurrence and spatial distribution of landslides compared with other factors. On the contrary, distance to faults, plane curvature, SPI, and distance to rivers achieved the four lowest weight values, which were 0.022, 0.014, 0.003, and 0.014, respectively. However, the classifications of landslide conditioning factors were based on previous studies and might not be suitable for the present study. Therefore, further studies should be conducted to find an objective classification method for landslide conditioning factors [113,114]. In order to obtain more reliable landslide susceptibility maps, the significant differences of six landslide susceptibility methods were calculated and compared. According to the relevant research [58,115], the most common and effective methods include the receiver operating characteristics (ROC) curve, standard error (SE), 95% confidence interval (CI), and Wilcoxon signed-rank tests. Hence, the paper introduced three well known statistical parameters-the ROC curve, SE and 95% CI-to calculate and compare the model performance. For the training data, the results of the ROC curve, SE and 95% CI can be seen in Table 4. The FR_SysFor model performed best and acquired the highest AUC value (0.940), followed by IoE_SysFor, IoE_LR, FR_LR, FR, and IoE models with AUC values of 0.926, 0.783, 0.779, 0.757, and 0.746, respectively. Similarly, the two other statistical parameters showed the same results; the FR_SysFor model obtained the smallest SE and 95% CI. For the validating data, the parameters of the ROC curve, SE and 95% CI are shown in Table 5.
The validating data showed that the FR_SysFor model had higher accuracy (AUC = 0.831) than the remaining models and performed best in the research. Meanwhile, the smallest SE and 95% CI also belonged to the FR_SysFor model.
Overall, based on the FR and IoE models, combined with SysFor and LR models, the landslide susceptibility of Nanzheng County was studied. As described above, the FR_SysFor model performed the best in this study as compared to other models. Finally, the ensembles of FR and IoE with the proposed algorithm (SysFor) can provide a reference for landslide susceptibility research in other areas.
In practice, landslide hazard managers can employ the FR_SysFor model to determine regions with high and very high susceptibility in the study area. An early warning system for landslide occurrence can transmit useful awareness and warning information for residents living in these areas [107]. Furthermore, the landslide susceptibility map of the present study can help to construct retaining wall systems and anchor systems to enhance slope stability [107].

Conclusions
In this case study, six landslide susceptibility evaluation methods, namely, the FR, IoE, FR_SysFor, IoE_SysFor, FR_LR, and IoE_LR models, were systematically analyzed and compared as part of landslide susceptibility research in Nanzheng Country (China). Based on previous research and the geological environmental characteristics in the study area, 16 conditioning factors were selected for the research: elevation, profile curvature, plan curvature, slope angle, slope aspect, SPI, TWI, SPI, distance to roads, distance to rivers, distance to faults, lithology, rainfall, soil, NDVI, and land use. These models were applied to the calculation of landslide occurrence probability for the first time. Finally, the model performances were compared by the statistical parameters of ROC curves, AUC values, SE and 95% CI. The results show that all the models perform well. Compared with other models, the prediction capability of the FR_SysFor model is the highest, with a success rate of 0.940 and prediction rate of 0.831. Hence, the FR_SysFor model is considered the most promising technique in this study. The results can provide a reference for land-use planning and decision-making in the study area.
Author Contributions: W.C., L.F., C.L., and B.T.P. contributed equally to the work. W.C. collected field data and conducted the landslide susceptibility mapping and analysis. W.C. wrote and revised the manuscript. L.F., C.L., and B.T.P. edited the manuscript. All the authors discussed the results and edited the manuscript. All authors have read and agreed to the published version of the manuscript.