Uncertainties Analysis of Collapse Susceptibility Prediction Based on Remote Sensing and GIS: Inﬂuences of Di ﬀ erent Data-Based Models and Connections between Collapses and Environmental Factors

: To study the uncertainties of a collapse susceptibility prediction (CSP) under the coupled conditions of di ﬀ erent data-based models and di ﬀ erent connection methods between collapses and environmental factors, models falls somewhere in between. It is concluded that the WOE-RF is the most appropriate coupled condition for CSP than the other models.

mapped with 30 m resolution grid units. Examples of collapses in the An'yuan County are shown in Figure 2.
In this study, the environmental factors used for CSP are classified as follows: (1) topographic and geomorphic factors, including digital elevation model (DEM), slope, aspect, profile curvature, plane curvature and topographic relief; (2) land cover factors consist of the normalized differential vegetation index (NDVI) and normalized difference built-up index (NDBI); (3) hydrological factors with distance to rivers and modified normalized difference water index (MNDWI); (4) geological factors with lithology; lithology is the material basis of collapse development, which affects the permeability of rock and soil and shear strength of the slope. The lithology of the study area is magmatic rock, metamorphic rock, clastic rock and carbonate. The collapse-related environmental factors are acquired through RS and the GIS platform. The RS data include the DEM, Landsat TM 8 image and high-resolution images, and the GIS spatial analysis was performed in the ArcGIS 10.2 software.

Acquisition of Topographic and Hydrological Factors
The topographic and geomorphic factors for CSP were calculated and mapped through the three-dimensional analysis tool and data management tool in ArcGIS 10.2 software [40,60]. DEM is an important environmental factor for collapse evolution and a data source for other topographic elements. The whole study area ranges from 180 to 1151 m and was divided into 8  To solve this problem, topographic relief is introduced to measure the relative changes of elevation values in smaller regions. The topographic relief was calculated through the statistical test and the maximum height difference method in a certain area in the ArcGIS 10.2 software. Slope has a direct relationship with the occurrence of collapse [61]. Only when the slope is attached to the slope can collapse occur, and the probability of collapse is different with different slopes. In this study, the slope angle values were reclassified into 8 categories with equal intervals of 0°-4°, 4°-8°, 8°-12°, 12°-16°, 16°-20°, 16°-24°, 24°-30° and 30°-90°. Aspect determines the scale of collapse formation under external factors such as rainfall, solar radiation and vegetation cover. Hence, the effects of aspect on collapse occurrence should not be neglected. Ultimately, nine groups of aspect are identified in this study area. Profile curvature and plan curvature are also extracted from DEM data. They affect the rate of collapse and weathering degree of rock mass on the slope. In this paper, profile curvature and plan curvature are divided into eight categories.
In addition, the river networks of An'yuan County are extracted by the hydrological analysis toolbox made up of fill, flow direction and flow accumulation tools, to reflect the effects of hydrological factors on landslide occurrences [62][63][64]. In the first step, the depressurization treatment of the DEM data with 30 m resolution was performed by the fill tool. In the second step, the flow direction tool was applied to determine the water flow direction of the filled DEM. In the third step, the flow accumulation tool was applied to determine the flow accumulation based on the water flow direction and DEM data. Finally, the river networks of the study area could be calculated and mapped by determining the flow accumulation of all grid units above a certain threshold.

Acquisition of NDVI, NDBI and MNDWI Factors
The NDVI, NDBI and MNDWI factors have important influences on the probability of landslide

Acquisition of Topographic and Hydrological Factors
The topographic and geomorphic factors for CSP were calculated and mapped through the three-dimensional analysis tool and data management tool in ArcGIS 10.2 software [40,60]. DEM is an important environmental factor for collapse evolution and a data source for other topographic elements. The whole study area ranges from 180 to 1151 m and was divided into 8  To solve this problem, topographic relief is introduced to measure the relative changes of elevation values in smaller regions. The topographic relief was calculated through the statistical test and the maximum height difference method in a certain area in the ArcGIS 10.2 software. Slope has a direct relationship with the occurrence of collapse [61]. Only when the slope is attached to the slope can collapse occur, and the probability of collapse is different with different slopes. In this study, the slope angle values were reclassified into 8 categories with equal intervals of 0 • -4 • , 4 • -8 • , 8 • -12 • , 12 • -16 • , 16 • -20 • , 16 • -24 • , 24 • -30 • and 30 • -90 • . Aspect determines the scale of collapse formation under external factors such as rainfall, solar radiation and vegetation cover. Hence, the effects of aspect on collapse occurrence should not be neglected. Ultimately, nine groups of aspect are identified in this study area. Profile curvature and plan curvature are also extracted from DEM data. They affect the rate of collapse and weathering degree of rock mass on the slope. In this paper, profile curvature and plan curvature are divided into eight categories.
In addition, the river networks of An'yuan County are extracted by the hydrological analysis toolbox made up of fill, flow direction and flow accumulation tools, to reflect the effects of hydrological factors on landslide occurrences [62][63][64]. In the first step, the depressurization treatment of the DEM data with 30 m resolution was performed by the fill tool. In the second step, the flow direction tool was applied to determine the water flow direction of the filled DEM. In the third step, the flow accumulation tool was applied to determine the flow accumulation based on the water flow direction and DEM data. Finally, the river networks of the study area could be calculated and mapped by determining the flow accumulation of all grid units above a certain threshold. The NDVI, NDBI and MNDWI factors have important influences on the probability of landslide occurrence though affecting the shear strength of slope soils and controlling the surface and underground water migrations of the slope body [10,65,66]. These three significant remote sensing indexes were extracted from the above Landsat TM 8 image. The NDVI can be used to reflect the regional vegetation growth and coverage ratios (Equation (1)). The NDBI can be used to show the percentage of buildings on the ground surface of the study area (Equation (2)). Additionally, the MNDWI can reflect the surface hydrology and soil moisture information (Equation (3)). In these equations, the P (Red), P (NIR), P (Green) and P (MIR) are the measurements of the visible red band, near infrared band, green band and middle infrared band in the above Landsat 8 TM image, respectively.

Uncertainties of CSP: Connection Methods and Data-Based Models
The CSP accuracy is significantly dependent on the quality of input variables; therefore, it is important to select the collection methods of collapses inventory and environmental factors to obtain the input variables. In addition, the coupled models between collection methods and data-based models can also create many uncertainties. By analyzing the performance rules and influence degrees of the above two kinds of uncertainty factors on the prediction of CSIs, the influences of these uncertainty factors can be better reduced. For example, the literature shows that some researchers have recently used WOE or FR for collapse susceptibility modeling without any proper explanations [2,56]. In this study, based on the five nonlinear connection methods of PSs, FR, IV, IOE and WOE, the AHP, MLR, C5.0 and RF models were selected to establish 20 kinds of different coupled models for the CSP. The specific research steps are as follows ( Figure 3): (1) The data sources of collapse inventory and related environmental factors in the study area were obtained to construct the spatial datasets for CSP modeling; (2) A total of 20 different modeling conditions are proposed for CSP on the basis of the above five different connection methods and four different kinds of data-based models; (3) In the modeling processes, the CSP model was utilized, the CSM was drawn and the uncertainty analysis of the CSI was carried out under each coupled model condition; (4) The area under the ROC curve (AUC) [67] was used to evaluate the accuracy of the CSP results; (5) At the significance level of 0.05, the Friedman two-factor ANOVA analysis and test method were used to analyze the difference significance of the CSI distribution under each coupled model condition; (6) Numerical distribution characteristics of CSIs predicted by five correlation methods and four data-based models were analyzed from the perspective of mean values and standard deviation; (7) The optimal correlation method and data-based model coupled model condition was obtained through comparison analysis, so as to provide theoretical guidance for the CSP.

Probability Statistics
In general, the PSs method can be defined as the ratio of the area where collapses have occurred to the total collapse area for a given attribute interval of an environmental factor [18]. A greater value of means a higher correlation between the collapse and the related factor. The formula for calculating the collapse area ratio of each second-order factor is shown in Equation (4)is the historical collapse area in the th state of the th environmental factor, and is the number of states under the th class factor.

Frequency Ratio
The FR method can be defined as the ratio of the area where collapses occurred in the total study area for a given attribute of an environmental factor, as show in Equation (5). is the number of collapse grid units that occurred in the attribute interval of an environmental factor; is the total number of collapse grid units in the study area; is the number of grid units of the attribute interval in an environment factor; represents the total number of grids in the study area. reveals the relative influence degree of each attribute interval of an environmental factor on the collapse occurrence [68]. A value greater than 1 indicates a higher correlation between collapse and environmental factors, otherwise the opposite is true.

Probability Statistics
In general, the PSs method can be defined as the ratio of the area where collapses have occurred to the total collapse area for a given attribute interval of an environmental factor [18]. A greater value of PS ij means a higher correlation between the collapse and the related factor. The formula for calculating the collapse area ratio of each second-order factor is shown in Equation (4)-S Z ij is the historical collapse area in the jth state of the ith environmental factor, and λ i is the number of states under the ith class factor.

Frequency Ratio
The FR method can be defined as the ratio of the area where collapses occurred in the total study area for a given attribute of an environmental factor, as show in Equation (5). N j is the number of collapse grid units that occurred in the attribute interval of an environmental factor; N is the total number of collapse grid units in the study area; S j is the number of grid units of the attribute interval in an environment factor; S represents the total number of grids in the study area. FR reveals the relative influence degree of each attribute interval of an environmental factor on the collapse occurrence [68]. A FR value greater than 1 indicates a higher correlation between collapse and environmental factors, otherwise the opposite is true.

Information Value
The collapse disasters are affected by multifactors. Under different geological environments, the degree and nature of the environmental factors that contribute to the collapse are different. The IV method was used to express the optimal combination of environmental factors under a certain geological environment, including the number and basic state of the environmental factors [31]. For a specific grid unit, the IV was used to consider the quantity and quality of all information acquired in a given area related to the collapse. In the specific calculation process, the total probability was usually estimated by using sample frequency for the calculation convenience. Hence, the formula of IV can be converted into Equation (6).
where IV is the information value of collapse occurrence when the environmental factor is in the state of j, N j expresses the number of collapse grid units in the attribute interval of an environmental factor, N is the total number of collapse grid units in the study area, S j denotes the number of grid units in the attribute interval of an environment factor and S denotes the total number of grid units in the study area. When the value of IV is positive, the environmental factor in state j can provide the information of collapse occurrence. The greater the value of IV, the higher the probability of collapse occurrence. Otherwise, the opposite is true. In addition, when the value of IV is 0 or close to 0, the environmental factor has almost no contribution to the collapse occurrence and can be removed from the CSP modeling processes.

Index of Entropy
The IOE method was used to represent the degree of uncertainty of an environmental factor [69]. In the prediction of collapse susceptibility, IOE was used to express the influence degree of different environmental factors on the evolution of collapse disasters. Firstly, the probability density P ij was calculated based on the frequency ratio analysis, as show in Equation (7). P ij is the FR value of each environmental factor, S j denotes the number of categories and i and j represent the serial number and the class of the environmental factor, respectively.
Secondly, the probability density P ij was substituted into Equation (8) to obtain the entropy value H j of each parameter; the information coefficient I j was calculated as Equation (9).
Finally, by coupling the information coefficient I j with the collapse occurrence probability, the final weight value W j of the parameter was calculated.

Weight of Evidence
The WOE is a quantitative method to predict the probability of an event based on Bayes' theorem. For the collapse prediction, the spatial correlation between collapse and environmental factors was analyzed to obtain the distribution of various environmental factors at the collapse point. A pair of weights, W + and W − , for any environment factor was calculated: where W + and W − are the weight values of the existence and nonexistence region of environmental factors, respectively; B and B are the number of the collapse grid units present on the existent and nonexistent regions of environmental factors, respectively; D and D are the number of the noncollapse units present on the existent and nonexistent regions of environmental factors, respectively. The difference between these weights (W + − W − ), known as the relative coefficient, C, represents a useful measure of the correlation between the evidence layer and the collapse events. For a positive correlation, the value of C is positive, whereas for a negative correlation the value is negative; a weight of 0 is irrelevant. When data were missing, the weight was also considered to be 0.

Analytic Hierarchy Process
The AHP, a kind of decision-making method combining qualitative and quantitative analyses, was mainly used to quantify and model the selected environmental factors [70]. The AHP was established based on the internal dominant relationship among various environmental factors. Then, the weight values (ranging between 1 and 9) of environmental factors were determined by comparing environmental factors. The consistency ratio (CR) was defined based on Equation (13) to check the consistent features of the comparison matrix A composed of these weight values.
where CI denotes the index of consistency obtained through Equation (14), RI suggests the random index for the comparison matrix A, n is the order in matric A, ω denotes the eigenvector corresponding to the maximum eigenvalue λ max of the matrix A and F i denotes the ith environmental factor. When the value of CR is less than 0.1, the comparison matric A is satisfactory and consistent. Then, CSIs were calculated as Equation (15), where ω i indicates the weight of the environmental factor F i :

Multiple Linear Regression
The MLR is often used to explore the correlations between multiple dependent variables and an independent variable. The value y i of the dependent variable was defined as Equation (16), where x 1i , x 2i , · · · , x ki denotes the independent variable, b 0 , b 1 , · · · , b k denotes the regression coefficient and ε i represents the error. The maximum likelihood value of parameters was calculated based on the least square method; then, necessary statistical tests were carried out to judge the goodness of fit R 2 of the model. The higher the value of R 2 is, the better the fitting degree is.
3.3.3. C5.0 Decision Tree C5.0 uses the boosting method to improve the implementation efficiency and classification accuracy of the decision tree algorithm [71,72]. The C5.0 model can be constructed as four main steps [73]: (i) selecting the nodes of the optimal root segmentation tree using the training dataset and threshold with the highest gain ratio; (ii) finding the child nodes from two branch nodes produced by the tree structure; (iii) creating additional tree nodes that grow further with certain mathematical criteria, and in this process, children nodes that do not contribute to the model are eliminated; (iv) this process is continuous and repeated until all instances in the training dataset are assigned gain ratio values for leaf nodes or no remaining variables can be divided. After the initial decision tree was established, the model was verified by the testing dataset.
In summary, C5.0 construction consists of tree splitting, growth, pruning of child nodes, growth promotion and model closure. Compared with other artificial neural networks, this model is easier to be understood because it can clearly explain the processes of tree growth and removal [74].

Random Forest
The RF model is an integrated classification model composed of multiple classification trees and regression trees. The bagging technique (Bootstrap aggregation) was used to randomly select samples from the training dataset for classification and regression tree construction. Then, the optimal classification results were selected in the random subset of environmental factors with a given feature. The error of the model was evaluated by using a bag sample. The random forest integrates the results of all classification and regression trees, effectively avoiding the discontinuity of the predictive value of the decision tree and the sensitivity to the training dataset, so as to make the predictive value smoother, to prevent the overfitting of the model and to increase its stability [75].

ROC Curves and AUC Analysis
The ROC was used to evaluate the overall performance of the prediction model based on the quantitative indicators [76,77]. The ROC curve was calculated as follows: First, the values of CSIs were calculated, and various collapse samples in the testing dataset were sorted. Then, different truncation points were selected in this order. Next, whether each landslide sample was positive was determined. Finally, the "true positive rate" and "false positive rate" of the current classifier were calculated each time as the vertical and horizontal axis of the ROC curve [78]. To further quantify the classification performance of different models, the AUC (Area under ROC) was used as the specific evaluation index.

Statistical Law Analysis of CSI
The mean value and standard deviation (SD) were used to reflect the average level and dispersion degree of the CSIs distribution, respectively, and further to reveal the classification effects of different data-based models. The mean value and SD were adopted to reveal the predictive performance of the collapse susceptibility modeling under the coupled conditions of the connection method and the data-based model by analyzing the numerical distribution characteristics of the CSIs on the whole. By comparison and analysis, the optimal coupled condition could be obtained. The mean value and SD have a certain objectivity and provide theoretical guidance for the study of CSP.
Friedman two-factor ANOVA analysis and test by rank method were used to compare the significant differences between different CSP models. The Friedman test was used to test for significant differences between a set of models; the null hypothesis states the equality between the median values of two groups. Hence, if the probability of a hypothesis at the significant level of α = 0.05 (or 5%) was true, then the null hypothesis was rejected and vice versa [79]. To assess the significant differences between two models, the signed-rank test was used. Based on this test, the performances of these models were ranked. The higher the average rank is, the better the model performance is. The significance difference level and average rank were used to further analyze the uncertainties of connection method and data-based model to obtain the CSP model with a high reliability and accuracy.

Collapse-Related Environmental Factor and Connection Results
The collapse inventory and environmental factors in the study area were obtained as shown in Table 2, through in-depth analysis of various factors affecting the evolution of collapses. The data types of continuous environmental factors were divided into eight attribute levels using the natural break point method [80,81], and the aspect of the flat ground was separately divided into one class and set to −1, while discrete types such as lithology and distance to rivers were classified into four classes.
The study area is located in the mountain boundary zone, mainly composed of low mountains and hills, with large topographic relief [82,83]. The elevation, slope, aspect, plane curvature, profile curvature and topographic relief were extracted from DEM as topographic and geomorphic factors, as shown in Figure 4. Taking slope as an example, the slopes in the study area were divided into eight attribute intervals, as shown in Figure 4b. The PSs regarding the occurrence of collapses were normally distributed when slopes ranged from 0 • to 58.3 • with a peak value of 16 • . The FR values greater than 1 in the slope were greater than 16 • , which is connected with the frequency of spatial classification, showing the strong spatial correlations between occurrence of collapses and the slope ( Figure 5).
More specifically, according to the statistics in Table 2, within the slope range of 16 • -20 • , the values of PS and FR are 0.3 and 1.7, respectively; the IV and WOE show strong and positive correlations with collapse occurrence; the IOE shows that the weight of slope is 0.1458, second only to lithology. The results of these connection method suggest that, the slope has a very important role on the collapse occurrence, and further suggest that all the connection methods can reflect the effect of environmental factors on collapse susceptibility on the whole [84].
(2) Relationships between collapse and hydrological factors. The collapse is largely affected by the distance to the river and the stream. Due to the erosion of the river, the stability of the slope rock and soil mass deteriorates with the increase in soil moisture content [85,86]. According to statistical calculation, the area with a distance of less than 300 m to the river system has the highest concentration of collapses (35%). MNDWI is commonly used to reflect water information at the surface; the value of MNDWI in the region ranges from 0 to 1, and most of the collapses occur between 0.392 and 0.498 with the maximum FR (1.214) ( Table 2). The distance to the river and MNDWI (Figure 4g,h) were used to characterize the influence of hydrological environment on collapse evolution [87,88].
(3) Relationships between collapse and land cover factors. The NDBI and NDVI were selected as land cover factors to reflect the influence of building distribution and natural vegetation on collapse evolution (Figure 6a,b). It can be seen from Table 2, when the NDBI values range between 0.49 and 0.6, that the calculation results of these collection methods of PS, FR, IV and WOE are all at their maximum values, which are 0.2515, 1.4969, 0.1752 and 0.7543, respectively. NDVI was used to quantitatively estimate vegetation growth and coverage. When the NDVI is smaller than 0.39, the area is prone to collapse occurrence.   (3) Relationships between collapse and land cover factors. The NDBI and NDVI were selected as land cover factors to reflect the influence of building distribution and natural vegetation on collapse evolution (Figure 6a,b). It can be seen from Table 2, when the NDBI values range between 0.49 and 0.6, that the calculation results of these collection methods of PS, FR, IV and WOE are all at their maximum values, which are 0.2515, 1.4969, 0.1752 and 0.7543, respectively. NDVI was used to quantitatively estimate vegetation growth and coverage. When the NDVI is smaller than 0.39, the area is prone to collapse occurrence.   Table 2). The other types of rock and soil are less distributed in this region. In short, the occurrence of collapse is relatively high in the areas with metamorphic and clastic rock types, and is relatively low in the areas with magmatic rock. In addition, very few carbonate rocks are distributed in this region, and as a result, the rule of collapse occurrence in this region is not clear.  (3) Relationships between collapse and land cover factors. The NDBI and NDVI were selected as land cover factors to reflect the influence of building distribution and natural vegetation on collapse evolution (Figure 6a,b). It can be seen from Table 2, when the NDBI values range between 0.49 and 0.6, that the calculation results of these collection methods of PS, FR, IV and WOE are all at their maximum values, which are 0.2515, 1.4969, 0.1752 and 0.7543, respectively. NDVI was used to quantitatively estimate vegetation growth and coverage. When the NDVI is smaller than 0.39, the area is prone to collapse occurrence.  Table 2). The other types of rock and soil are less distributed in this region. In short, the occurrence of collapse is relatively high in the areas with metamorphic and clastic rock types, and is relatively low in the areas with magmatic rock. In addition, very few carbonate rocks are distributed in this region, and as a result, the rule of collapse occurrence in this region is not clear. (4) Relationships between collapse and lithology. The lithology of An'yuan County is reflected by the types of rock and soil in this study. The types of rock and soil represent the material basis of collapse and greatly affect the collapse evolution. The values of PS and FR under the metamorphic rock are, respectively, up to 0.4388 and 1.3636, and those under the clastic rock are, respectively, 0.3356 and 1.2971. In addition, under the condition of both metamorphic and clastic rock types, the connection methods of IV and WOE have positive correlations with collapse occurrence, and the IOE suggests that the factor of lithology has the highest weight value of 0.2058 ( Table 2). The other types of rock and soil are less distributed in this region. In short, the occurrence of collapse is relatively high in the areas with metamorphic and clastic rock types, and is relatively low in the areas with magmatic rock. In addition, very few carbonate rocks are distributed in this region, and as a result, the rule of collapse occurrence in this region is not clear.

Preparation of Spatial Dataset
The whole study area was divided into 2,655,972 grid units under the grid resolution of 30 m × 30 m. All of the 11 environment factors were reassigned by the calculation results of the five collection methods, then these reassigned environment factors were used as input variables of CSP models. At the same time, a total of 108 recorded collapse polygons were divided into 1463 collapse grid units, which were assigned to 1, while the same number of randomly selected noncollapse grid units were assigned to 0. These collapse and noncollapse grid units were randomly divided into model training sets and testing sets by a proportion of 70%/30%. Finally, all the grid units with connection values in the study area were put into the four models, respectively, to calculate the CSIs, which were divided into five levels: very high (10%), high (20%), m rate (20%), low (20%) and very low (30%).

CSP Using Conventional Mathematical Statistics Model: Multiple Linear Regression
The connection values of the 11 environmental factors calculated by the five connection methods were normalized and then taken as inputs to the MLR model. The collinearity diagnosis and significance test in the MLR were carried out to determine the suitable inputs. The results show that the variance inflation factor (VIF) of the 11 selected environmental factors was all less than 3, with a weak correlation and a significance of less than 0.05. All the environmental factors were statistically significant [89]. MLR modeling was carried out for collapse and noncollapse samples, and the regression coefficient of each environmental factor and the goodness of fit of the MLR model were calculated under five connection methods. The larger the regression coefficient is, the higher the contribution of the corresponding environmental factors to collapse development is. The greater the goodness of fit, the better the fitting effect. Among them, the goodness of fit is 0.606 at most, which is significantly higher than that of other connection methods. The regression coefficient and VIF values of MLR models under different connection methods are shown in Table 3. The CSIs of the whole study area can be predicted by importing the connection values of each grid cell into the trained MLR model. This paper uses the C5.0 software package in R Studio to build the C5.0 decision tree model. The parameters of C5.0 model were obtained through cross-validation: the minimum sample size of leaf nodes is 2; the maximum number of iterations of convergence is 100. Pruning was performed using a bottom-up method and the severity of the pruning was 75%. The Boosting iteration number in the C5.0 model was set to 10 and the confidence factor to 25. Other parameters were set as the default. Similarly, the RF model was also built in R Studio. The random forest function was used to calculate the out-of-pocket errors of different random forests. Generally speaking, the smaller the out-of-pocket error is, the higher the model prediction accuracy is. The optimal number of random features is 4, and the number of random forest decision trees is 500. Finally, the C5.0 DT and RF models were trained and tested based on the collapse and noncollapse samples and the model input variables calculated by the five connection methods. Then, the CSIs of the whole study area were predicted, respectively, by the trained C5.0 DT and RF models. The R software used in this paper is from R Cran.

Creating Collapse Susceptibility Maps
The CSP was carried out in two steps under 20 coupled conditions. Firstly, the CSIs predicted under each coupled condition were imported into ArcGIS 10.3 software. Then, the CSMs of the study area were all divided into five levels as: very high (10%), high (20%), moderate (20%), low (20%) and very low (30%). The CSMs under several typical coupled conditions are shown. The CSMs of WOE-based models are shown in Figure 7 and the CSMs under the coupled condition of five collection methods and RF model are shown in Figure 8. As shown in Figure 7, most areas of An'yuan County are in low and very low levels, and the proportions of high and very high levels predicted by AHP and MLR models are higher than those of low and very low levels. The results of AHP and MLR models show that slope and lithology are the two most important environmental factors. Most of the collapses were located in the mountainous and hilly areas with relatively steep slopes and moderate elevation, which is consistent with the field survey results. As shown in Figure 8, under the same data-based model, the CSLs obtained by the five connection methods were significantly different. Meanwhile, the areas of low and very low levels obtained by the five different connection methods were also very different.

Accuracy Analysis of ROC
The ROC statistical method was used to evaluate the prediction accuracies of samples in the testing set. In order to further quantify the CSP performance of different models, the AUC value was used as a specific evaluation index. The larger the AUC value is, the better the CSP performance of

Accuracy Analysis of ROC
The ROC statistical method was used to evaluate the prediction accuracies of samples in the testing set. In order to further quantify the CSP performance of different models, the AUC value was used as a specific evaluation index. The larger the AUC value is, the better the CSP performance of the model is. The ROC curves under all the coupled conditions are shown in Figure 9. The WOE-RF model has the highest prediction accuracy with an AUC value of 0.959. Furthermore, WOE shows the best predictive accuracy compared with other connection methods in the same data-based model, and the results of IOE, IV and FR are relatively consistent, followed by the PS-based model as shown in Table 4. Meanwhile, it is proved that the RF model has better prediction accuracy than the other data-based models in the CSP under all the connect methods. Furthermore, compared with traditional MLR and heuristic models, the AUC of the machine learning model was improved by about 0.1 (Figure 9).
Remote Sens. 2020, 12, x FOR PEER REVIEW 19 of 29 traditional MLR and heuristic models, the AUC of the machine learning model was improved by about 0.1 (Figure 9).

Distribution Rule of Collapse Susceptibility Index
The mean value and SD were used to reflect the average level and dispersion degree of CSI distribution, respectively, and then the uncertainties of CSIs under coupled conditions were analyzed.
(1) The distribution rules of the CSIs calculated by the WOE-based models are discussed as shown in Table 5 and Figure 10

Distribution Rule of Collapse Susceptibility Index
The mean value and SD were used to reflect the average level and dispersion degree of CSI distribution, respectively, and then the uncertainties of CSIs under coupled conditions were analyzed.
(1) The distribution rules of the CSIs calculated by the WOE-based models are discussed as shown in Table 5 and Figure 10; meanwhile, the distribution rules of the CSIs under the other coupled models are similar to those of WOE-based models. The CSIs of the WOE-based models are ranked by the mean value as follows: Mean (WOE-AHP) > Mean (WOE-MLR) > Mean (WOE-C5.0) > Mean (WOE-RF). Among them, the CSIs of WOE-AHP and WOE-MLR models are normally distributed and mostly concentrate in the moderate levels, indicating that the CSIs predicted by WOE-AHP and WOE-MLR models are generally large. Combined with the AUC values of these coupled models, it can be seen that the abilities of the AHP and MLR models to identify collapse susceptibility are low. Moreover, the CSIs of RF and C5.0 models have similar distribution rules, which are concentrated in the very low and low levels, and gradually decrease in the other levels. In addition, the dispersion degree of these four models is exactly opposite to its mean value as follows: SD (WOE-RF) > SD (WOE-C5.0) > SD (WOE-MLR) > SD (WOE-AHP). The results show that RF and C5.0 models have a good differentiation degree for the CSIs of the region, and can well reflect the differences of the CSIs in different grid units. Moreover, fewer high CSIs were used to reflect as much known collapses as possible, which indirectly indicates that advanced machine learning models can predict the collapse susceptibility more effectively.
(2) Taking RF model as an example, the distribution rules of the CSIs predicted by different collection methods was analyzed, as shown in Table 5 and Figure 11 Moreover, the CSIs of RF and C5.0 models have similar distribution rules, which are concentrated in the very low and low levels, and gradually decrease in the other levels. In addition, the dispersion degree of these four models is exactly opposite to its mean value as follows: SD (WOE-RF) > SD (WOE-C5.0) > SD (WOE-MLR) > SD (WOE-AHP). The results show that RF and C5.0 models have a good differentiation degree for the CSIs of the region, and can well reflect the differences of the CSIs in different grid units. Moreover, fewer high CSIs were used to reflect as much known collapses as possible, which indirectly indicates that advanced machine learning models can predict the collapse susceptibility more effectively.
(2) Taking RF model as an example, the distribution rules of the CSIs predicted by different collection methods was analyzed, as shown in Table 5 and Figure 11

Difference Significance Analysis of the CPS Results
The significant difference level and mean rank were used to further analyze the uncertainties of the CSP models coupled with the collection methods and data-based models. Specifically, the Friedman two-factor ANOVA analysis and test method by rank were used to test the difference significance of the CSIs predicted under the conditions of any two groups of different connection methods and data-based models. If the significance of the test results is less than 0.05, the CSIs of the two groups is significantly different, and the null hypothesis is rejected (there is no difference between the CSIs in the groups). Through the significance test of paired factors, the probability values of a hypothesis (p-values) were found to all be less than 0.05, with significant differences. Therefore, it was necessary to cross-verify the connection methods and the data-based models.
At the same time, this test was also used to calculate the mean ranks of CSIs predicted by the models coupled with the collection method and the data-based model, and to rank the performance of the coupled CSP models. If the average rank is smaller, the model performance will be better. The comparison results of any pair of models in the group are shown in Table 6. WOE-RF has a mean rank of 4.82, ranking the highest, followed by the WOE-C5.0 (5.30), and other WOE-based models. The CSP performances of the FR-based, IV-based and IOE-based machine learning models are consistent, while the PS-AHP model ranks as the worst. The significance difference level and the mean rank indicate the uncertainty features of the coupled collection methods and data-based models. Avoiding these uncertainties is important for obtaining reliable and stable CSP results.

Difference Significance Analysis of the CPS Results
The significant difference level and mean rank were used to further analyze the uncertainties of the CSP models coupled with the collection methods and data-based models. Specifically, the Friedman two-factor ANOVA analysis and test method by rank were used to test the difference significance of the CSIs predicted under the conditions of any two groups of different connection methods and data-based models. If the significance of the test results is less than 0.05, the CSIs of the two groups is significantly different, and the null hypothesis is rejected (there is no difference between the CSIs in the groups). Through the significance test of paired factors, the probability values of a hypothesis (p-values) were found to all be less than 0.05, with significant differences. Therefore, it was necessary to cross-verify the connection methods and the data-based models.
At the same time, this test was also used to calculate the mean ranks of CSIs predicted by the models coupled with the collection method and the data-based model, and to rank the performance of the coupled CSP models. If the average rank is smaller, the model performance will be better. The comparison results of any pair of models in the group are shown in Table 6. WOE-RF has a mean rank of 4.82, ranking the highest, followed by the WOE-C5.0 (5.30), and other WOE-based models. The CSP performances of the FR-based, IV-based and IOE-based machine learning models are consistent, while the PS-AHP model ranks as the worst. The significance difference level and the mean rank indicate the uncertainty features of the coupled collection methods and data-based models. Avoiding these uncertainties is important for obtaining reliable and stable CSP results. 6. Discussion

CSP Modeling under Different Collection Methods
The impact degrees of each attribute interval of the environmental factors on the collapse susceptibility were quantitatively calculated by connection methods, which were used as the input variables of the data-based models to predict the spatial probability of collapse occurrence. In the classification processes of attribute intervals of environmental factors under different connection methods, the WOE can more effectively reflect the effects of spatial information on collapse than the other four connection methods and has a better prediction accuracy. Compared with IV and IOE methods, FR is more intuitive, which can guarantee the prediction accuracy and effectively avoid too complicated statistical calculations. The PS method reflects the contribution rate of collapse to the attribute interval, but fails to fully reflect the spatial correlations between collapses and attribute intervals of environmental factors. The more fully the correlation expression of spatial information between environmental factors and collapse, the greater the degree of differentiation of the CSIs, and the better the effect of CSP modeling. Furthermore, for the five nonlinear connection methods of PS, FR, IV, IOE and WOE, the mean values of the CSIs calculated by the coupled the data-based models decrease gradually, while the corresponding SD values increase gradually; meanwhile, the change trend of mean ranks of CSIs calculated by the coupled the data-based models are the same as the rules of mean values, and it can be seen that the modeling performance of the five collection methods become better and better when using PS, FR, IV and IOE to WOE methods.

CSP Modeling under Different Data-Based Models
Under the coupled conditions of the same connection method and different data-based models, the prediction accuracies of all coupled models show a consistent rule: AUC RF > AUC C 5.0 > AUC MLR > AUC AHP , which shows that the prediction accuracies of machine learning models are higher than that of a conventional regression model and heuristic model. Analysis of the characteristics of CSIs: the CSIs predicted by the RF model are exponentially distributed (Figure 11), and the mean value of CSIs are in the transition zone between very low and low levels; the CSIs predicted by the C5.0 model are relatively discrete and these mean values are only higher than RF; however, the CSIs predicted by MLR and AHP tend to be normally distributed, with large mean values and in the moderate level, as shown in Figure 10. Compared with a conventional heuristic model and linear regression model, the CSIs predicted by machine learning models are more centralized in terms of distribution in very low and low levels. Meanwhile, the machine learning models are more accurate in predicting very high and high levels, and most historical collapse events fall in these levels. In addition, the SD values of the CSIs predicted by the MLR and AHP are smaller than those predicted by the machine learning models, which indicates that the CSIs obtained by MLR and AHP are not differentiated enough and the prediction accuracy is poor. As a whole, the AHP, MLR, C5.0 and RF models exhibit better prediction performances in turn from the characteristics of CSIs and prediction accuracy, as shown in Section 6.4.

CSP Modeling under Coupled Conditions of Connection Methods and Data-Based Models
From the perspective of the coupled models, the CSP accuracy of the WOE-RF model is the best, while that of the PS-AHP model is the worst. In addition, the PS-RF model can also achieve good CSP accuracy (AUC = 0.923). Compared with heuristic models and linear statistical models, machine learning models have stronger robustness in noise environments, can fully and efficiently mine incomplete information and the training and testing effects of CSP modeling are excellent [90]. Furthermore, the RF model is more stable and has more advantages than the C5.0 DT model. Heuristics and statistical models rely on the collection methods, and the more obvious the statistical rule of collection methods, the better the prediction accuracy [6].
RF is a supervised integrated learning algorithm based on a decision tree, which is more accurate than individual algorithms such as C5.0 and MLR. Due to out-of-pocket data, unbiased estimation of true error is obtained in the process of model generation without loss of training data. With the introduction of sample and characteristic randomness, RF has certain anti-noise and anti-overfitting ability in the testing process. As a combination of multiple classification trees, RF can process nonlinear data and high-dimensional data without making a feature selection. Meanwhile, RF can process both discrete and continuous data with strong adaptability to datasets, so it is suitable to be used as a nonlinear classification model.
Friedman analysis and test method were used to verify the difference of CSP performance of coupled models. The CSIs predicted by of WOE-based models are significantly different from those predicted by the coupled of other connection methods and the data-based models. Compared with the other data-based coupled models, the CSP performance of the RF model coupled with the five collection methods has a significant difference. Moreover, the WOE-RF model exhibits the best CSP performance, with the lowest mean rank and the best predicted accuracy. The results of this study can guide the selection of the best combination of collapse connection methods and data-based models in other research areas. Although these results are obtained in the areas with mountainous and hilly terrains, they can be generalized to other regions as a guideline or alternative method for selecting the best combination.

CSP Modeling under Single Data-Based Models with No Connect Methods
This paper also conducts CSP modeling based on the original continuous environment factors data without the connection method. That is to say, the original continuous environment factors data were directly used as input variables of the four types of data-based models. Then, these models were trained and tested based on the appropriate parameters similar to the above model, respectively. The prediction accuracy of these single data-based models with no connect methods were slightly lower than those of the coupling model using the connection method (Figures 12 and 13). In addition, the distribution rule of the collapse susceptibility maps created by the coupling models and single models were similar as a whole. In order to improve the modeling efficiency of CSP, a single machine learning model can be used directly. However, in order to better reflect the spatial correlations between the collapse distribution and basic environmental factors or to analyze the influence rules of each subinterval of environmental factors on the evolution of collapses, the coupling models considering the connection method need to be adopted. Remote Sens. 2020, 12, x FOR PEER REVIEW 24 of 29

Conclusions
Some uncertainty problems in CSP modeling, such as nonlinear correlation methods coupled with data-based models to obtain the optimal coupled conditions, are very important for predicting the accurate and reliable CSMs. This paper discusses these uncertainties in depth and comes to the following conclusions: (1) Compared with the other four connection methods, WOE better reflects the nonlinear correlation between collapse and related environmental factors and has a better spatial information discrimination ability regarding environmental factors. Compared with the CSP modeling based on the FR, IV and IOE, the CSP accuracies of the WOE-based models are the highest, with the lowest mean values, average ranks and larger SDs. Meanwhile, the CSP accuracies of the three types of the FR, IV and IOE connection methods tend to be consistent, and their CSP performances are not as good as those of the WOE-based models. In addition, the prediction results of PS-based models are poor.
(2) Compared with other kinds of data-based models, the RF model has the highest CSP accuracy, with the lowest mean value and mean rank of the CSIs and a larger SD, followed by the C5.0, MLR and AHP models. It can be seen that the advanced machine learning models can effectively improve the CSP accuracy, and the collapse susceptibility identification ability is significant.

Conclusions
Some uncertainty problems in CSP modeling, such as nonlinear correlation methods coupled with data-based models to obtain the optimal coupled conditions, are very important for predicting the accurate and reliable CSMs. This paper discusses these uncertainties in depth and comes to the following conclusions: (1) Compared with the other four connection methods, WOE better reflects the nonlinear correlation between collapse and related environmental factors and has a better spatial information discrimination ability regarding environmental factors. Compared with the CSP modeling based on the FR, IV and IOE, the CSP accuracies of the WOE-based models are the highest, with the lowest mean values, average ranks and larger SDs. Meanwhile, the CSP accuracies of the three types of the FR, IV and IOE connection methods tend to be consistent, and their CSP performances are not as good as those of the WOE-based models. In addition, the prediction results of PS-based models are poor.
(2) Compared with other kinds of data-based models, the RF model has the highest CSP accuracy, with the lowest mean value and mean rank of the CSIs and a larger SD, followed by the C5.0, MLR and AHP models. It can be seen that the advanced machine learning models can effectively improve the CSP accuracy, and the collapse susceptibility identification ability is significant.

Conclusions
Some uncertainty problems in CSP modeling, such as nonlinear correlation methods coupled with data-based models to obtain the optimal coupled conditions, are very important for predicting the accurate and reliable CSMs. This paper discusses these uncertainties in depth and comes to the following conclusions: (1) Compared with the other four connection methods, WOE better reflects the nonlinear correlation between collapse and related environmental factors and has a better spatial information discrimination ability regarding environmental factors. Compared with the CSP modeling based on the FR, IV and IOE, the CSP accuracies of the WOE-based models are the highest, with the lowest mean values, average ranks and larger SDs. Meanwhile, the CSP accuracies of the three types of the FR, IV and IOE connection methods tend to be consistent, and their CSP performances are not as good as those of the WOE-based models. In addition, the prediction results of PS-based models are poor.
(2) Compared with other kinds of data-based models, the RF model has the highest CSP accuracy, with the lowest mean value and mean rank of the CSIs and a larger SD, followed by the C5.0, MLR and AHP models. It can be seen that the advanced machine learning models can effectively improve the CSP accuracy, and the collapse susceptibility identification ability is significant.
(3) Under the coupled conditions of different collection methods and data-based models, the CSP accuracy of the WOE-RF model is the highest with the lowest mean value and mean rank. The predicted CSIs of WOE-RF model is more in line with the actual characteristics of collapse probability distribution than the other coupled models. On the contrary, the PS-AHP model has the lowest prediction accuracy with a larger mean value and mean rank and smaller SD value. (4) In general, the CSP performance of single data-based models not considering connect methods was slightly worse than those of the connection method-based models. The comparison results further demonstrate the importance of spatial correlation analysis of environmental factors for CSP modeling. (5) Although this study mainly analyzes the uncertainty rules of CSP modeling under the conditions of different data-based models and connections between collapses and environmental factors, the conclusions of this study also have some reference values for other kinds of geological disasters' (landslide, debris flow, etc.) susceptibility predictions. This is because the evolution processes of these geological disasters are closely related to various environmental factors in the spatial perspective.