Spatial Prediction of Landslide Susceptibility Based on GIS and Discriminant Functions

: The areas where landslides occur frequently pose severe threats to the local population, which necessitates conducting regional landslide susceptibility mapping (LSM). In this study, four models including weight-of-evidence (WoE) and three WoE-based models, which were linear discriminant analys is (LDA), Fisher’s linear discriminant analysis (FLDA), and quadratic discriminant analysis (QDA), were used to obtain the LSM in the Nanchuan region of Chongqing, China. Firstly, a dataset was prepared from sixteen landslide causative factors, including eight topographic factors, three distance-related factors, and five environmental factors. A landslide inventory map including 298 landslide locations was also constructed and randomly divided with a ratio of 70:30 as training and validation data. Subsequently, the WoE method was used to estimate the relationship between landslides and the landslide causative factors, which assign a weight value to each class of causative factors. Finally, four models were applied using the training dataset, and the predictive performance of each model was compared using the validation datasets. The results showed that FLDA had a higher performance than the other three models according to the success rate curve (SRC) and prediction rate curve (PRC), illustrating that it could be considered a promising approach for landslide susceptibility mapping in the study area.


Introduction
Slope can be perceived as a critical landform configuration; however, it exposes many natural and man-made disasters such as landslides, collapse events, and debris flows, which result indifferent human and asset loss levels globally, especially in developing countries. Geological environments are increasingly affected by human engineering activities [1][2][3]. Landslide hazard assessments represent a conditional probability based on time and spatial occurrence, and landslide magnitude for a given geo-environmental setting [4]. Gradually, landslide risk assessment has received considerable attention, from analyzing landslide type and magnitude to risk elements, to the final step analyzing landslide spatial occurrence. As a factor included in landslide risk assessment, landslide spatial probability (landslide susceptibility) implies the spatial likelihood of landslide occurrence under indigenous topographic conditions [5]. However, this factor cannot imply the magnitude and temporal probability of slope failure, although it can indicate which area is more likely to have slope failure based on connections to past landslide conditions [6].
To prevent landslides before they cause grave physical and property damage, it is necessary to study the characteristics and causative factors of landslide occurrence at a large scale, to manage landslides with the ultimate goal of providing early warning of landslide hazards. Therefore, a simple expert knowledge method is first used for landslide susceptibility estimation. The combination of landslide inventory and computer science makes digital methods popular, leading to advanced methods and topics for research and application [7]. Nevertheless, landslide susceptibility estimation is not easy to conduct because databases need information covering different areas. Nevertheless, these approaches are useful, with objective and repeatable characteristics [7].
All methods can be grouped into five classes [7]: direct landform mapping, analysis of landslide inventories, process-based conceptual models, heuristic methods, and statistical methods. The first method constructs landslide susceptibility maps using remote sensing (RS) and geographic information system (GIS) techniques. The second method comprises landslide density maps. The third method studies mechanics related to stability analysis [8]. The heuristic approach is a qualitative analysis based on expert knowledge and previous work experience [8,9].
The application of statistical methods based on spatial relation between a series of thematic layers and distribution of past landslides. Statistical approaches are quantitative methods which can use different functional relationship. They can be subdivided into: (1) physically-based methods [10,11]; and (2) traditional statistical methods, such as the frequency ratio [12,13], evidential belief function [14,15], weight of evidence [16,17], discriminant analysis [18][19][20][21], and logistic regression [22,23], (3) 42], and groundwater potential mapping [43,44]. Recently, scientists have done a lot of research on hybrid machine learning methods for better assessment and prediction of landslides [45]. Wang et al. proposed a base classifier of MultiBoosting for integrating different machine learning techniques [46]. Althuwaynee et al. applied ensemble decision tree-based chi-squared automatic interaction detector (CHAID) and multivariate logistic regression models to assess landslide spatial occurrence for the first time [47].
This study investigated landslide susceptibility assessment in the Nanchuan area, China, using linear discriminant analysis (LDA), Fisher's linear discriminant analysis (FLDA), and quadratic discriminant analysis (QDA), as well as weight-of-evidence (WoE) statistical models. These models were selected and applied in this research because of their successful performance in other fields. As can be seen from numerous articles which apply data mining technology, FLDA, LDA, and QDA have been successfully used in many other classification problems, such as groundwater spring potential mapping [48], fault prediction [49], and soil studies [50,51]. WoE model have been used to determine the importance of the independent variables in different research areas [52,53]. In the present study, in order to eliminate the problem of class imbalance in the dataset, the WoE model was used to determine the importance of the classes in each landslide causative factor, and then the weighted classes were used as input to build three discriminant models.

Geological and Geomorphological Setting
The research was conducted in the Nanchuan district (approximatively 2600 km 2 ), which is located in northern Chongqing City ( Figure 1). Geologically, based on the China Geological Survey, the study area outcropped a great variety lithologies ( Figure 2, Table 1) (http://www.cgs.gov.cn/). The study area falls in the adjoining zones of the southeast edge of the Sichuan Basin and the northern Yunnan-Guizhou Plateau. The topography and geomorphology of the Nanchuan district have both basin and plateau characteristics. Faults in the study area generally strike in NNE, NE, and NNW directions, of which the main type is compressional fault. The Nanchuan urban district is situated on the border of many middle-low mountains. The Dalou Mountains, in the east of Nanchuan urban area, range from 1,000 to 2,000 m in altitude. Whereas red-beds landforms, in the west of Nanchuan urban area, are relatively smaller, with altitudes range from 500 to 1000 m. In the study area, slope belts are extensively covered by Quaternary deposits (http://www.cgs.gov.cn/).

Materials and Methods
In the present study, landslide susceptibility analysis was conducted using four models. The modeling steps are presented in Figure 3.

Preparation of Training and Validation Datasets
Landslide inventory is essential to successfully perform landslide susceptibility mapping in a giving area. Firstly, landslide inventory map was extracted from previous landslide records, aerial photographs, and field research. A total of 298 landslides were identified and randomly split into training (70% of landslides; 209) and validation datasets (30% of landslide; 89). An analysis showed that the size of the largest landslide was 840,000 m 2 and the smallest was 70 m 2 . According to regional geographical and geological conditions, we selected sixteen representative causative factors (Supplementary, Figure S1) to apply the assessment of landslide susceptibility. Depending on the factor attribute, the selected causative factors included three different types: (1) topographic factors, which are aspect, slope, elevation, power index stream power index (SPI), sediment transport index (STI), topographic wetness index (TWI), plan curvature, and profile curvature; (2) distance-related factors, which are distance to roads, distance to rivers and distance to faults; and (3) environmental factors which are normalized difference vegetation index (NDVI), lithology, land use, soil, and rainfall [54][55][56]. The digital elevation model (DEM) is derived from the ASTER GDEM data with the resolution of 30mx30m. Then the aspect, angle, elevation, profile curvature, plan curvature, STI, SPI, and TWI factor maps can be extracted from DEM data.
Slope angle and aspect are two important landslide configuration factors. Slope angle is directly related to landslide formation [20,57]. Slope aspect can indirectly affect landslide occurrence via wind, lineaments, and sunshine effects [20,58,59]. The elevation is a significant factor, which shows geologic and geomorphologic characteristics [60,61].
The power index stream power index (SPI) and sediment transport index (STI) measure the erosive power of water flow and overland flow, respectively [62]. They were calculated on the basis of specific catchment areas (AS) and local slope gradient in degrees (β) [63,64]. Equation (1), gives the net erosion and depositional area in a curved profile, while Equation (2) reflects sediment transport controlled by the slope angle [63,64].
The topographic wetness index (TWI) is another contributing factor affecting topography of runoff-saturated source areas [64]. The TWI is defined by Equation (3).
where α is the area drained per unit contour length at a point, and β is the slope [14,64]. Plan curvature illustrated the curvature degree of a contour line that forms the intersection of the horizontal plane and the ground [65]. Profile curvature expresses the rate of change of a slope angle in the direction of maximum slope [66].
Distances to roads, rivers and faults were compiled from the topographic map (1:50,000), and geological map (1:200,000), respectively. Three distance-related factors were built using a buffer for rivers, faults, and roads. Because faults can induce sharp ground displacement caused by ground stress. Building roads destroys the original terrain due to artificial slope cutting activities. Rivers weaken slope stability by different river density affecting groundwater potential [67].
The value of NDVI represents the degree of vegetation coverage [20]. The NDVI value can be acquired by the NDVI map obtained from Landsat 8 Operational Land Imager (OLI) images. The value can be calculated according to Equation (4).
where IR is the infrared band, and R is the red band. Lithology is a decisive factor because it is closely related to landslide occurrence. Land use groups identified by using a Landsat Enhanced dataset. Different soil types lead to different susceptibility degrees, because it influences strength and permeability of rocks and soils. As the last causative factor, excessive rainfall can induce landslides.

Weight of Evidence
The application of weight-of-evidence modeling was demonstrated in this study for large-scale landslide susceptibility mapping. Based on the Bayesian probability model, it was originally used in medicine [68] and then developed for identification of mineral potential using GIS in many countries [68][69][70]. Some authors have employed the weight-of-evidence method to landslide susceptibility mapping [71][72][73][74][75][76][77][78]. This statistical method has many advantages, including its ease of use and speed because it is a quantitative data-driven method [79]. In using the WoE method, each relevant factor's weight was measured by analysis of the spatial relationship between the landslide location and each class of the causative factors.
The methods can be described as Equation (5) and Equation (6). Additionally, the essential parameters can be calculated through each causative factor (B) based on the presence or absence of the landslide within a certain group of causative factors (N) [80].
P{B|N} is a conditional probability of B occurring given the presence of N. B and B in numerator and denominator signify the presence and absence of a causative factor while N and N signify the presence and absence of a disaster [81]. Positive weight (w+) and negative weight (w-) indicate the presence and absence of a predictable variable and the magnitude indicates the level of positive and negative correlation between a predictable variable and the landslides. The weight contrast, C=W-W was used to measure spatial association. When the contrast value is equal to zero, a certain class of causal variable does not have a clear spatial association. Using Equation (7) and Equation (8), the weight variances (S 2 ) can be calculated as follows [82][83][84]: Based on S 2 , the standard deviation of the contrast is given by , and the studentized contrast, C/S(C), gives a measure of confidence [85].

Fisher's Linear Discriminant Function
FLDA is referred to as a statistical learning method, which can acquire optimal decision boundaries for a given data set. Discriminant analysis is used for this method to build a model [84]. Because classification must be performed in one-dimension, the FLDA algorithm projects training classes onto a line. This maximizes the distance between the means of two classes and minimizes their variance. For landslide prediction, landslide influence factors use the xi (i=1,2…n) representation, and two landslide classes use the Yj (j=1,2) expression. Parameter vector n is (Xi,Yj), ascertains the classification of Fisher's linear discriminant [86]: where Nb is the definition of a between-class scatter matrix, and Nw is a within-class scatter matrix defined as where i l refers to the sample mean of the classes, and T values are training data. Using the limited optimization problem Equation (9), the invariant G(u)can be maximized.

Quadric Discriminant Analysis
As a soft computing learning algorithm, quadratic discriminant analysis (QDA) can separate the class of a case by the fit of its quadratic surface. The resulting value in each class expresses a normal distribution. In the current study, this univariate statistical method was used in landslide susceptibility mapping to build a model based on groups that contain landslide disaster statistical information [87,88]. QDA is based on finding a category membership comprising a square n × n matrix, where n represents a number of variables. These variables combine linearly following the form Q xTBx eTx a    , where B is an n × n coefficient matrix, e indicates the linear combination coefficient, and c is constant [89]. Some advantages of QDA over LDA have been reported, such as the fact that covariance of each class need not be identical and different covariance values in classes can be properly handled.

Linear Discriminant Analysis
In 1936, Fisher pioneered the use of linear discriminant analysis. LDA is a popular machine learning method for high accuracy task classification [90]. In addition, parameter prediction is relatively stable [91]. For LDA objects, there are mutually-exclusive groups for independent input variables [92,93]. LDA first finds the linear combination of some classes of causative factors such as K=αV+m (m is constant). Determined by a linear combination, the estimated (k) values with proper α coefficients can best distinguish a case set [89]. This step is required for data preprocessing to reduce dimensionality in machine-learning models.

Correlation Analysis of Influencing Factors
In this study, sixteen different input variables were selected from numerous possible variables for landslide susceptibility mapping and have been applied to construct a model using four different methods. The spatial relationship between each causative factor and landslides were analyzed using the WoE model (Supplementary Table S1). Using this method, W+ and W-are acquired first. The weight contrast (C) can be calculated by subtraction between W+ and W-and the variances in W+ and W-are represented by S2W+ and S2W-. Then, the standard deviation can be calculated by S2W+, and S2W-, which is shown as S(C) (Supplementary Table S1). C/S(C) reflects the final weight, in which positive C/S(C) in each causative factor class indicates a positive correlation with landslide occurrence, while the negative value in each class indicates a negative correlation in this class.
There was no significant difference in weight value based on the comparative analysis of two values C and C/S(C) for the aspect factor. For slope, the C/S(C) value in the 10-20 class had an apparent maximum weight over other classes and nearly half of the landslides occur in this class. Elevation classes (312-700) and (900-1100) were positively correlated with landslide occurrence. For STI, SPI, and TWI, the maximum index interval had a maximum weight representing maximum susceptibility for landslides. The flat curvature had high landslide susceptibility, and the C/S(C) value decreased with the plan curvature increase or decrease. The C/S(C) values for different distance ranges from the roads, river, and faults. The largest WoE value occurred in the 0-200 m distance to roads class with WoE value of 4.5, then the percentage of landslides and WoE values were lower in the next class. For the distance to river and distance to fault classes, the result values did not show significant characteristics. In general, the presence of vegetation may reduce the landslide hazard. However, there are some studies that offer a different point of view [94,95]. From the analysis, lower values of NDVI (<0.26) had positive WoE values, whereas higher NDVI values (> 0.26) had almost negative WoE values. The correlation between lithology and WoE values of landslides shows that Ordovician deposits (group-5) mainly include charcoal shale and the siliceous base is the most landslide-prone deposit. The second-most landslide-prone lithologies are (group-2) and (group-4), with relatively weak strength.

Constructing Landslide Susceptibilitymaps
The landslide susceptibility map was produced using layers and training points of the sixteen selected factors, and the map accuracy was firmly controlled by factor selection and combination [96].
The probability values (LSI value) were calculated by extracting the spatial relationship between landslides and each selected factor. Through the application of FLDA, LDA, QDA, and WoE, the LSI values of four landslide susceptibility maps for each pixel were obtained. After normalization, the acquired LSI values were between 0 and 1. After reclassification, the landslide sensitivity map was generated, as shown in Figure 4. Among some classification methods [97], natural breaks were used depending on the histogram of landslide susceptibility indexes. A forward stepwise process reclassified the LSI value into five different class breaks of very high susceptibility (0.70-1), high susceptibility (0.56-0.7), moderate susceptibility (0.45-0.56), low susceptibility (0.32-0.45), and very low susceptibility (0.00-0.32) for the area under investigation. The FLDA, LDA, QDA, and WoE landslide susceptibility maps were constructed following these divisions. The model results are shown below, and the area percentage of each class illustrated in Figure 5. Then, the next step was to explore comparative analysis results of our four models.

Models Validation and Comparison
The receiver operating characteristic (ROC) curve, as shown in Figure 6, assessed and validated the models, as it is a common overall performance prediction and accuracy evaluation quantitative index-based technique [98][99][100][101]. Two particular ROC curves, the name of success rate curve (SRC) and prediction rate curve (PRC), have been shown using training and validation datasets, respectively [102][103][104]. The datasets ROC curve used divide the landslide susceptibility index into two groups (landslide and non-landslide) by the size of the value. The x-axis of the curve shows 100specificity,which expresses the ratio of correctly predicted landslide points, while the y-axis of the curve shows sensitivity, which expresses the ratio of correctly predicted non-landslide points [103,105]. A quantitative index (AUC) represents accuracy and goodness-of-fit for the performance of the four models and represents the area under the ROC curve [40, 106,107]. According to the meaning of the AUC value, when the value is 1, the sensitivity and specificity are 100%, which indicates all landslide pixels are correctly classified. Further, the lower AUC value has lower accuracy, AUC > 0.9 indicates an excellent fit, 0.8-0.9 is very good, 0.7-0.8 is good, 0.6-0.7 is moderate, and<0.6 indicates poor accuracy [108].
Performance analysis based on four landslide models results of the SRC (using training dataset) is demonstrated by Figure 6a. Similarly, the PRC result (using validation datasets), which used the evaluated predictive ability, is demonstrated by Figure 6b. The FLDA model was the best for prediction capability and had an AUC value of 0.821. Correspondingly, the FLDA model also has the highest prediction rate for AUC=0.719. In the training dataset, the other two models, LDA and WoE, had slightly lower AUC values 0.831 and 0.821. Therefore, in the SRC curve, the success rates of LDA and WoE models were relatively high. In the validation dataset, the prediction rate performed well, and the AUC value was close to the maximum (AUC=0.719 and AUC=0.718). This was followed by reducing AUC values belonging to the QDA model, in which prediction accuracy is good and predictive ability poor. The AUC values for this model were 0.781 and 0.513 on the SRC curve and the PRC curve, respectively. Hence, from the above analysis results, all four models are considered to exhibit a reasonable SRC, in which the FLDA model shows the highest performance in predicting landslide susceptibility, followed by the LDA and WoE models. On the PRC curve, prediction value was somewhat lower and QDA much lower.
In addition to the AUC quantitative index method, Friedman [109] and Wilcoxon signed-rank [110] statistical tests also can assess model prediction ability. To make the comparison, we first conducted Friedman tests. This test is based on the null hypothesis that at the significance level of α = 0.05 the landslide model group would have no significant difference. Based on this assumption, the p-value was used, and if p <0.05, the previous assumptions were accepted; otherwise, the null hypothesis was rejected [111]. Tables 2 and 3 show the result of this test, which display that the mean ranks for the FLDA, LDA, WoE, and QDA models were 2.77, 2.47, 2.44, and 2.32, respectively. The chi-square and P-value are 27.019 and 0.000 in this test. Thus, it implies significant differences among the five models.
Systematic pairwise comparison explored the significant differences among the four models. The analysis was conducted in six groups assembled from any two pairs of landslide models, which is the Wilcoxon signed-rank test. Using the Wilcoxon signed-rank tests at the 5% significance level, the p-value and z-value were used to test whether to accept each group's hypothesis. The condition for rejecting the null hypothesis assumed that there was no significance between the models if the pvalue <0.05and the z-value was N (−1.96 and+1.96). According to Table 4

Discussions
Four quantitative statistical models were utilized in landslide susceptibility mapping for the study area. Additionally, the four approaches explored the linear and non-linear spatial relationships among the landslide-prone areas and sixteen causative factors, which are independent of the statistical distribution. QDA and LDA provided a fairly dependent data distribution, and WoE was characterized by operating intuitively and was not dimensionally sensitive [112,113]. QDA and LDA had the same performance indicators, although QDA was like a black box approach, which is an disadvantage for interpretation. However, in the current study, the LDA model could be interpreted straightforwardly [114].
Before modeling, to identify highly susceptible landslide factor classes in any modeling approach, WoE can be used for preprocessing. This method evaluates the weight value of all causative factor classes, which can measure the possibility of a landslide. WoE is similarly used in building landslide models in this research, and it has shown reasonable results.
Within the WoE constructed by W+ and W-, the C/S(C) calculated by the ratio of C to standard deviation S(C) can guide the significance of the spatial connection for each causative factor class. If the C/S(C) value was positive, the factor was favorable for spatial association with landslides; if it was negative, the spatial association was negative. The analysis revealed that a slope with the range of 10-20 had the maximum weight, representing maximum landslide susceptibility. This result was consistent with previous research [85]. It also can be seen that elevation of 900-1100, SPI of >20, STI of >20, TWI of >2.5, rainfall of 333.62-1221.86, road distance of 0-200m, and lithological group 5 had the highest impacts on landslide occurrence, as evidenced by relatively high values in these classes. In addition to the above factors, some factors (aspect) had poor sensitivity because the values oscillated around zero. This is the same as has been found in other analyses [115].
This study also investigated the performance of three data-mining models of FLDA, LDA, QDA integrated with WoE. The results showed that FLDA and LDA models exhibited excellent performance; where FLDA had the best performance, followed by LDA. They all outperformed the QDA model for spatial prediction accuracy. The areas under the ROC curve using the training dataset for the three models ranged from 82.1% to 83.3%, and they had extremely high AUC values. The analysis of the PRC using the validation dataset indicated that the goodness-of-fit for the three susceptibility models was good, and showed two high prediction powers with AUC values of 0.718 and 0.719, because FLDA and LDA models had the same maximum value. Moreover, the results of the ROC using the training and validation dataset showed that the QDA exhibited the worst performance, especially for the validation dataset analysis (AUC=0.513). However, a comparison using the Friedman test of the model group revealed that the group models performed differently with statistical significance. It could also be concluded that, although the QDA model did not need identical covariance in each class and the complexity exceeded that of LDA, the prediction accuracy was not enhanced. Instead, the learning performance reduced significantly, and this model could not interpret the parameter.

Conclusions
This large-scale landslide susceptibility prediction can be utilized to devise strategies for infrastructural and detailed land use planning. The current study used WoE to depict the relationship between all causative factors and landslide locations. The results suggest that most positive impact classes exist at low slope angles and high SPI, STI, and TWI values. The present study also showed that the FLDA outperformed the other three models. In conclusion, landslide susceptibility maps could assist local and government authorities in establishing appropriate land use plans and landslide mitigation strategies.
Author Contributions: Guirong Wang, Xi Chen and Wei Chen contributed equally to the work. Xi Chen and Guirong Wang collected field data and conducted the landslide susceptibility mapping and analysis. Xi Chen and Wei Chen wrote and revised the manuscript. Guirong Wang provided critical comments in planning this paper and edited the manuscript. All the authors discussed the results and edited the manuscript. Great thanks are given to Xingguang Chen and Zhengqian Wu for their kind help. All authors have read and agreed to the published version of the manuscript.