Integration of Information Theory , K-Means Cluster Analysis and the Logistic Regression Model for Landslide Susceptibility Mapping in the Three Gorges Area , China

In this work, an effective framework for landslide susceptibility mapping (LSM) is presented by integrating information theory, K-means cluster analysis and statistical models. In general, landslides are triggered by many causative factors at a local scale, and the impact of these factors is closely related to geographic locations and spatial neighborhoods. Based on these facts, the main idea of this research is to group a study area into several clusters to ensure that landslides in each cluster are affected by the same set of selected causative factors. Based on this idea, the proposed predictive method is constructed for accurate LSM at a regional scale by applying a statistical model to each cluster of the study area. Specifically, each causative factor is first classified by the natural breaks method with the optimal number of classes, which is determined by adopting Shannon’s entropy index. Then, a certainty factor (CF) for each class of factors is estimated. The selection of the causative factors for each cluster is determined based on the CF values of each factor. Furthermore, the logistic regression model is used as an example of statistical models in each cluster using the selected causative factors for landslide prediction. Finally, a global landslide susceptibility map is obtained by combining the regional maps. Experimental results based on both qualitative and quantitative analysis indicated that the proposed framework can achieve more accurate landslide susceptibility maps when compared to some existing methods, e.g., the proposed framework can achieve an overall prediction accuracy of 91.76%, which is 7.63–11.5% higher than those existing methods. Therefore, the local scale LSM technique is very promising for further improvement of landslide prediction.


Introduction
As the water level in the reservoir fluctuates periodically, the famous Three Gorges Reservoir is characterized by plentiful active and reactivated landslides with different scales, which seriously threaten the local people's lives and property.Up to 2009, more than 3800 landslides have been recorded along this reservoir [1].Therefore, it is very significant to perform landslide susceptibility mapping (LSM) to dynamically monitor the unstable areas.
The spatial forecasting of landslides is more efficient and economical by integrating geographical information systems (GIS) and statistical analysis, compared to the traditional field geological surveying, and can provide an effective solution for landslide mitigation and management [2].This method has been widely documented in recent literature stating that remote sensing (RS) can be used for landslide investigation.According to the three most comprehensive reviews [3][4][5], landslide susceptibility and hazard assessment is one of the three hot topics in landslide investigation using RS.Over the last three decades, many effective methods have been developed to investigate the role of RS and GIS for producing landslide hazard zoning maps.These techniques are mainly divided into two categories with different theoretical bases, i.e., qualitative and quantitative [6].The qualitative techniques are characterized by subjective assessments that describe the probability of landslide occurrences based on expert experience and knowledge of landslide formation mechanism(s) [7], including the analytical hierarchy process (AHP) [8][9][10], fuzzy mathematics [11], multi-criteria evaluation [12,13], weighted linear combination (WLC) [14,15] and ordered weighted average [16,17].The quantitative techniques represent landslide occurrences by exploiting mathematical models to perform LSM on a continuous scale [14,18].Landslides are typically complex processes triggered by various causative factors, which have geomorphological, geological, hydrological, terrestrial, meteorological or geotechnical properties.The quantitative methods are commonly divided into two types, bivariate and multivariate.To estimate the weights for each variable in the bivariate methods, each causative factor map is combined with a landslide distribution map.Various techniques can be used with the bivariate methods, such as favorability functions [19][20][21], information value [22,23], weights of evidence [24][25][26][27][28][29], the frequency ratio [30][31][32] and the Dempster-Shafer method [2,33].However, failure to consider the correlation of the causative factors is the main shortcoming of such methods [34].The multivariate methods assess the relationships between the landslide distribution and a series the causative factors [35].Specifically, all of the causative factors are resampled for each terrain mapping unit (TMU), and the events of landslides are estimated through the resulting matrix, which can be analyzed with logistic regression (LR) [18,[36][37][38] using multiple regression [39][40][41][42][43], discriminant analysis [44] or principle component analysis (PCA) techniques [45,46].Apart from these statistical methods, data mining and machine learning techniques have drawn much attention for LSM including decision tree (DT) [47,48], random forests [49][50][51], neural networks [52][53][54][55][56], support vector machine (SVM) [57,58] and Bayesian network (BN) approaches [59].Nevertheless, it is improper to adopt all of the causative factors for LSM because the problem of overfitting always occurs, and the model generalization is not well respected, without considering the issue of data dimensionality [60].Therefore, screening the factors through feature selection using filtering methods [11,61] or wrapper methods [62][63][64] is a common step for producing more accurate landslide susceptibility maps.However, the mentioned dimensionality reduction techniques are burdened with a high computational cost.Furthermore, the same set of selected factors is exploited throughout the entire study area, without taking the spatial dependence between TMUs into account.
Landslide are triggered by many causative factors at a local scale, and the impact of these factors is closely related to geographic locations and the nearest neighborhood.Recently, Das et al. [65] proposed to obtain landslide susceptibility maps using homogeneous susceptibility units (HSUs), which is an effective local-scale analysis method.In this work, we develop an alternative framework to solve the previously-mentioned issues by integrating the techniques of information theory, K-means cluster analysis and statistical models.The proposed framework consists of the following steps.First, each causative factor used in the study area is classified by the natural breaks method with the corresponding optimal number of classes, which is determined by using Shannon's entropy index.Then, a certainty factor (CF) for each class of factors is estimated.It is observed that the impact of each causative factor may occur at a local scale for a certain study area in practice [66].To address this fact, each TMU in the study area is assigned with an appropriate combination of the causative factors represented by a unique binary encoding, according to expert experience and knowledge of the CFs.By performing the K-means cluster analysis on the new encoded TMUs, the spatial dependence between these units is considered.Therefore, the TMUs where the landslides are affected by a similar set of the causative factors are aggregated together.The final binary-encoded centroid of each cluster is employed for choosing the optimal combination of the causative factors shared by all of the TMUs in the same cluster.Next, the weights of the selected factors for each cluster are computed, and the proposed predictive method is constructed for accurate LSM at a regional scale by applying an LR model to each cluster of the study area.On this basis, a global landslide susceptibility map is created by integrating the regional maps.The proposed framework was validated in the Zigui-Badong section of the Three Gorges area by using the LR method implemented by SPSS Clementine 12.0, which can effectively integrate remote sensing datasets with field surveying data.IBM SPSS Statistics 19.0 was used for the computation of the proposed framework, and ESRI ArcGIS 10.0 was used for producing the resultant maps.

General Characteristics and Geological Setting
The study area is located in the Zigui-Badong section of the middle and lower reaches of the Yangtze River and covers 446.32 km 2 in the southwest of Hubei Province.Its latitudes and longitudes lie between 30 • 54 59"N to 31 • 03 32"N and 110 • 18 44"E to 110 • 52 30"E, respectively, while its highest point reaches 2000 m above sea level, as shown in Figure 1.This study area belongs to the subtropical monsoon climate zone and is characterized by abundant rainfall and humidity.The average annual precipitation for the period from 2001-2010 in Badong County and Zigui County in Hubei Province is 1069.2mm and 944.5 mm, respectively, while the highest annual precipitation reached 1148.7 mm in 2008.Furthermore, most of the rainfall in this area is concentrated from May-September of each year, accounting for 70% of the annual precipitation.by applying an LR model to each cluster of the study area.On this basis, a global landslide susceptibility map is created by integrating the regional maps.The proposed framework was validated in the Zigui-Badong section of the Three Gorges area by using the LR method implemented by SPSS Clementine 12.0, which can effectively integrate remote sensing datasets with field surveying data.IBM SPSS Statistics 19.0 was used for the computation of the proposed framework, and ESRI ArcGIS 10.0 was used for producing the resultant maps.

General Characteristics and Geological Setting
The study area is located in the Zigui-Badong section of the middle and lower reaches of the Yangtze River and covers 446.32 km 2 in the southwest of Hubei Province.Its latitudes and longitudes lie between 30°54′59″N to 31°03′32″N and 110°18′44″E to 110°52′30″E, respectively, while its highest point reaches 2000 m above sea level, as shown in Figure 1.This study area belongs to the subtropical monsoon climate zone and is characterized by abundant rainfall and humidity.The average annual precipitation for the period from 2001-2010 in Badong County and Zigui County in Hubei Province is 1069.2mm and 944.5 mm, respectively, while the highest annual precipitation reached 1148.7 mm in 2008.Furthermore, most of the rainfall in this area is concentrated from May-September of each year, accounting for 70% of the annual precipitation.All forms of rock, including igneous, sedimentary and metamorphic, can be observed in this area.The strata from the Middle Triassic to Jurassic are mostly composed of sandstone, shale, mudstone and marlstone, which are the main components of natural hazards.The geological map of the study area is shown in Figure 2. Several main faults and lineaments can be identified in this figure, including the Xiannvshan Fault, the Jiuwanxi Fault, the Niukou Fault and the Xiangluping Fault, which may promote or mitigate landslides.All forms of rock, including igneous, sedimentary and metamorphic, can be observed in this area.The strata from the Middle Triassic to Jurassic are mostly composed of sandstone, shale, mudstone and marlstone, which are the main components of natural hazards.The geological map of the study area is shown in Figure 2. Several main faults and lineaments can be identified in this figure, including the Xiannvshan Fault, the Jiuwanxi Fault, the Niukou Fault and the Xiangluping Fault, which may promote or mitigate landslides.

Slope Failures and Causative Factors
The geological conditions and human activities in the study area, such as urbanization, deforestation and construction of the reservoir, have caused widespread distribution of landslides, which have brought a serious threat to the lives and property of the local residents.The landslide inventory map of the study area was constructed by using Google Earth 7.1 along with extensive field surveys, historical and bibliographical landslide data.Next, 202 landslide polygons were identified and mapped with total areas of 23.40 km 2 , covering 5.24% of the study area, as shown in Figure 1c.It can also be observed from Figure 1c that the area of these landslides varies widely, e.g., the largest Fanjiaping landslide has an area of 1.51 km 2 , while the smallest Kuihua street landslide is only 2068.8 m 2 .
In this work, the Yangtze River was excluded from the study area because the values of the Advanced Spaceborne Thermal Emission and Reflection Radiometer Global Digital Elevation Model (ASTER GDEM) data always change dramatically at the junction between this river and its sides [67].It is known that landslide occurrences are greatly relevant to causative factors.Considering the landslide distribution and the characteristics of the study area, a total of 17 causative factors was selected for the LSM, including the four main categories of geological, geomorphological, hydrological and land cover factors.To extract these causative factors for landslide prediction, some ancillary data were used, including:

Slope Failures and Causative Factors
The geological conditions and human activities in the study area, such as urbanization, deforestation and construction of the reservoir, have caused widespread distribution of landslides, which have brought a serious threat to the lives and property of the local residents.The landslide inventory map of the study area was constructed by using Google Earth 7.1 along with extensive field surveys, historical and bibliographical landslide data.Next, 202 landslide polygons were identified and mapped with total areas of 23.40 km 2 , covering 5.24% of the study area, as shown in Figure 1c.It can also be observed from Figure 1c that the area of these landslides varies widely, e.g., the largest Fanjiaping landslide has an area of 1.51 km 2 , while the smallest Kuihua street landslide is only 2068.8 m 2 .
In this work, the Yangtze River was excluded from the study area because the values of the Advanced Spaceborne Thermal Emission and Reflection Radiometer Global Digital Elevation Model (ASTER GDEM) data always change dramatically at the junction between this river and its sides [67].It is known that landslide occurrences are greatly relevant to causative factors.Considering the landslide distribution and the characteristics of the study area, a total of 17 causative factors was selected for the LSM, including the four main categories of geological, geomorphological, hydrological and land cover factors.To extract these causative factors for landslide prediction, some ancillary data were used, including: √ A Landsat-8 OLI image obtained on 14 April 2015, with the path/row number of 125/38.
To perform feature extraction, we have performed a series of operations on this multispectral image.This process includes radiometric correction to avoid radiometric errors or distortions over the whole image, geometric correction to avoid geometric distortion due to Earth's rotation and other imaging conditions from the image and atmospheric correction to remove the effects of the atmosphere on the reflectance values of the image.Meanwhile, Bands 4 and 5 of the image are used for computing the normalized difference vegetable index (NDVI), whereas Bands 3 and 6 of the image are used for computing the normalized difference water index (NDWI).√ The 1:50,000-scale geological maps provided by Hubei Geological Bureau for the exaction of geological factors, including lithology and distance to fault.√ ASTER GDEM Version 2 (V2) data, representing the surface in raster format, for the extraction of geomorphological and hydrological factors, including elevation, distance to rivers, the terrain roughness index (TRI), the terrain position index (TPI), slope gradient, catchment area, catchment slope, terrain curvature, the topographic wetness index (TWI), terrain surface convexity, terrain surface texture, slope aspect and slope form.
The selection of the TMU is very significant for LSM.In this work, grid cell terrain units were exploited to model the landslide susceptibility of the study area, and a value was assigned to each grid cell unit per causative factor.The landslide map and other factor layers were extracted with grid cells having a spatial resolution of 28.5 × 28.5 m, to match the remote sensing data considered here.

The Proposed Framework
The flowchart of the proposed framework is shown in Figure 3.In the following subsections, we present the foundations of our framework.
The selection of the TMU is very significant for LSM.In this work, grid cell terrain units were exploited to model the landslide susceptibility of the study area, and a value was assigned to each grid cell unit per causative factor.The landslide map and other factor layers were extracted with grid cells having a spatial resolution of 28.5 × 28.5 m, to match the remote sensing data considered here.

The Proposed Framework
The flowchart of the proposed framework is shown in Figure 3.In the following subsections, we present the foundations of our framework.

Information Coefficient Based on Shannon's Entropy Index
Shannon's entropy model has been commonly used to measure the amount of information in a signal or event [68], and landslide occurrence can be estimated using this approach.Shannon's entropy model was used to estimate the density of landslides within each class per factor.With respect to the i-th class of the j-th factor, let p denote the probability density, A and B the area

Information Coefficient Based on Shannon's Entropy Index
Shannon's entropy model has been commonly used to measure the amount of information in a signal or event [68], and landslide occurrence can be estimated using this approach.Shannon's entropy model was used to estimate the density of landslides within each class per factor.With respect to the i-th class of the j-th factor, let p ij denote the probability density, A ij and B ij the area percentage and the landslide percentage, respectively, C j the total number of classes of the j-th factor and M j and M jmax the entropy value of the j-th factor and its maximum, respectively.The calculation of the information coefficient I j can be implemented using a series of formulas given below [69,70]: Since the entropy value is constantly above or equal to 0, the range of the information coefficient is restricted to the domain of [0, 1].Specifically, the amount of extracted information increases as the information coefficient ranges from 0-1.

Certainty Factor
The CF method has been commonly used for landslide prediction because it is capable of dealing with the challenge of the combination of different vector layers, the heterogeneity and the uncertainty of the input data.The certainty factor can be expressed as follows [19,71]: where PP a is the condition probability of a landslide event occurring in a certain class a, while PP s is the prior probability of a landslide event occurring throughout the study area.From Equation ( 6), a certainty is defined in the range of [−1, 1].If the CF value is larger than zero, the certainty of a landslide occurrence is high, while if this value is smaller than 0, the certainty of a landslide occurrence is low.In particular, if a CF value equals 0, this means that no indication is available about the contribution of a certain class for a causative factor.

K-means Clustering Analysis
K-means is widely used for solving clustering problems due to its efficiency and simple implementation [72].The main idea of this algorithm is to group a given dataset into K clusters, in which each data point is assigned to the cluster with the nearest mean, thus serving as the centroid of the cluster [73].The iterative process is performed on the input dataset for re-clustering of all of the data points and updating the location of the centroids until these centroids do not change any more.This algorithm is applied to minimize the following objective function [74]: where x i j and c i represent the j-th data point and the i-th cluster center, respectively.x i j − c i means the L 2 norm of x i j − c i .

Multicollinearity Analysis
To estimate the correlation between the causative factors, multicollinearity analysis has received extensive attention.Multicollinearity refers to a statistical phenomenon in which there exists a high relationship between two or more predictor variables in a multiple regression model [75].When those variables are highly correlated, it is difficult to obtain their respective coefficients accurately.To detect multicollinearity, two diagnostic indices are commonly used, tolerance (TOL) and the variance inflation factor (VIF) [76].Let X = {X 1 , X 2 , . . . ,X N } define a given independent variable set and R 2 j denote the coefficient of determination when the j-th independent variable X j is regressed on all other predictor variables in the model.The VIF value is computed as follows: The TOL measure is the reciprocal of the VIF value and represents the degree of linear correlation between independent variables.From Equation ( 8), if R 2 j = 0, then VIF = TOL = 1, meaning that X j is not linearly related to the others; if R 2 j is close to 1, then VIF → ∞ and TOL = 0, indicating that X j is highly related to the others.If the VIF value is above the threshold value of 5 or 10 [75], the corresponding regression coefficients are collinear and should be removed from the predictive model.

Logistic Regression
Logistic regression is a multivariate statistical method to establish the relationship between a dependent variable and several independent variables [6,35,38,[77][78][79].In recent years, the logical regression model has been commonly used for LSM due to its simplicity and effectiveness [18,58,[80][81][82].The main idea of such a method is to perform maximum likelihood estimation to obtain the probability of landslide occurrence after each independent variable is converted to a logical variable.The simplified logical regression model can be quantitatively expressed as follows: where p denotes the probability of landslide occurrence and ranges between 0 and 1 and z is the linear combination: where β 0 is the intercept of the model and (β 1 , β 2 , . . . ,β N ) are the regression coefficients representing the impact of X = {X 1 , X 2 , . . . ,X N } mentioned previously on the logit z.

Objective Evaluation Measures
To objectively assess the predictive methods, two measures were utilized.The first one is overall prediction accuracy, evaluating prediction correctness, and defined as follows: where a and b mean the numbers of correctly-predicted landslide and non-landslide TMUs in the final susceptibility maps, respectively, and S indicates the total number of grid cells in the study area.
According to Equation (11), this measure can be directly applied to the LMS of the entire study area to evaluate the global LR model.If this measure in Equation ( 11) is used to assess the proposed framework, it should be measured in each cluster as follows: where a i and b i are the numbers of correctly-predicted landslide and non-landslide grid cells in the i-th cluster, respectively.The second one is the commonly-used receiver operating characteristic (ROC) and the area under the ROC curve (AUC).Since a test with perfect discrimination always produces a curve passing through the upper left corner of the plot, the closer the ROC curve is to the upper left corner, the more accurate are the landslide predictive results [83,84].The AUC value ranges from 0.5-1, and it is close to 1, representing that the model is perfectly reasonable for prediction [85].

Choosing the Number of Classes for Each Causative Factor
This step is to determine the number of classes for each factor by maximizing its information coefficient.In this work, 14 continuous factors were classified into 2-6 classes using the natural breaks method, except for three categorical factors of slope aspect, lithology and slope form.Based on the approach in [86], the information coefficients of each causative factor considered here are computed and listed in Table 1.It can be concluded from this table that there are six causative factors with the greatest information coefficients when divided into two classes, i.e., elevation, distance to river, NDVI, NDWI, catchment area and terrain surface texture.The three causative factors of slope gradient, catchment slope and TWI have the highest information coefficients of 0.1229, 0.1350 and 0.1798, respectively, when divided into three classes.The causative factors of TRI and Terrain surface convexity maximized their information coefficients to 0.2472 and 0.0933, respectively, when divided into four classes.Only the causative factor of distance to fault was divided into five classes with the highest information coefficient.Finally, the two causative factors of TPI and terrain curvature can obtain the maximum information coefficients of 0.1089 and 0.0933, respectively, when divided into six classes.In this table, the term NC represents non-calculable, which means that there is no landslide grid cell in a class of the factors.Specifically, p ij in Equations ( 1) and ( 2) would be zero when there is no landslide grid cell in the i-th class of the j-th factor.In this case, log 2 p ij cannot be calculated.There may be some specific operations to avoid this problem, but we did not compute the corresponding information coefficient in this work and assigned it as "NC".Furthermore, the two most influential factors are elevation and distance to river with information coefficients above 0.8, which means that these two causative factors are crucial for the LSM of the study area.In contrast, the two least-influential causative factors are distance to fault and Catchment area, with information coefficients of below 0.06.The classification results of the 14 continuous factors with the optimal number of classes are shown in Figure 4a-n, while the classification maps of the other three categorical factors of lithology, slope aspect and slope form are illustrated in Figure 4o-q, respectively.In this work, the categorical factor of slope form is classified as concave/concave (V/V), elongated/concave (GE/V), convex/concave (X/V), concave/even (V/GR), elongated/even (GE/GR), convex/even (X/GR), concave/convex (V/X), elongated/convex (GE/X).elongated/concave (GE/V), convex/concave (X/V), concave/even (V/GR), elongated/even (GE/GR), convex/even (X/GR), concave/convex (V/X), elongated/convex (GE/X).Remote Sens. 2017, 9, 938 13 of 28 (q)   6), for a landslide grid cell, if at least one of its causative factors has a negative CF value, this factor has a negative impact on the landslide prediction of this cell; whereas for a non-landslide grid cell, if at least one of its causative factors has a positive CF value, a similar conclusion can be reached.In both situations, the corresponding factor should not be considered for landslide prediction of this grid cell.In this work, after computing a CF value for each class of all of the causative factors considered here, we were able to obtain an optimal combination of the causative factors for each grid cell, and the selected factors were used as the independent variables of the LR model.
The classes and the corresponding CF values of each causative factor are listed in Table 3 along with the percentages of landslide and class.Elevation is a key factor in landslide occurrences.In Table 3, the CF value of elevation is positive in the range of 80~700 m, which means that landslides always occur in this range of elevation.In this class, the area of landslides accounts for 99% of the total area.Conversely, the CF value is negative and close to −1 in the range of >700~2000 m, indicating   6), for a landslide grid cell, if at least one of its causative factors has a negative CF value, this factor has a negative impact on the landslide prediction of this cell; whereas for a non-landslide grid cell, if at least one of its causative factors has a positive CF value, a similar conclusion can be reached.In both situations, the corresponding factor should not be considered for landslide prediction of this grid cell.In this work, after computing a CF value for each class of all of the causative factors considered here, we were able to obtain an optimal combination of the causative factors for each grid cell, and the selected factors were used as the independent variables of the LR model.
The classes and the corresponding CF values of each causative factor are listed in Table 3 along with the percentages of landslide and class.Elevation is a key factor in landslide occurrences.In Table 3, the CF value of elevation is positive in the range of 80~700 m, which means that landslides always occur in this range of elevation.In this class, the area of landslides accounts for 99% of the total area.Conversely, the CF value is negative and close to −1 in the range of >700~2000 m, indicating that the probability of landslide occurrences is very low.Distance to river is a commonly-used factor in the evaluation of landslide susceptibility, reflecting the impact of the reservoir water on the landslide.As shown in Table 3, the impact of the reservoir water on the landslides becomes weaker as the distance from the water system increases.From Figure 4a,b, we can also observe that if the distance to river becomes large, the elevation of the same position will be higher.In such terrain, the loose stacking layers are very thin, which is adverse to the development of soil landslides.As slope gradient may affect the slope stability through modulating the surrounding engineering geological conditions, it is another key factor of LSM.Table 3 shows that slope gradient has the highest and lowest CF values of 0.2967 and −0.692, respectively.Moreover, the landslide occurrence in the study area decreases as the slope gradient increases, and very few landslides occur when the slope gradient is higher than 35 • , as shown in Table 3.In general, the probability of landslide occurrence should increase with the slope gradient.However, in the study area, the sites with a high slope gradient are mostly distributed in high-elevation areas.As mentioned above, it is difficult to cause slope failures in such places.Slopes with different orientations usually receive different intensities of solar radiation, which affects the distribution of pore water pressure and the physical and mechanical characteristics of rocks and soil.Table 3 shows that the CF value of slope aspect is in the range of [0.2, 1] in the two directions of north and northeast, indicating that the north-facing slopes are more susceptible to landslide occurrences.Apart from these factors, stratigraphic lithology is an important intrinsic factor and the foundation for the development of landslides and can determine the type and scale of landslide occurrences.Table 3 shows that the CF value of lithology is positive in the area with soft and hard sandstone or limestone with thin bedrocks, which is prone to landslide occurrences.Conversely, the value is negative for mudstone, shale, Quaternary deposits, hard limestone or thick sandstone, which means that the area with such lithological characteristics is not conducive to landslide occurrences.Since the term "mudstone, shale and Quaternary deposits" is one class name of lithology, it means that mudstone may not be the only factor to contribute to the stability of a certain area when the CF value of this area is negative.For instance, it was recorded that Quaternary deposits are negative for slope failure [10].

Clustering Grid Cells into Different Groups
Although each grid cell in the study area has an optimal combination of causative factors based on the analysis mentioned in Section 3.1.2,it is unrealistic and difficult to perform LSM for each grid cell using the predictive model.In this work, the K-means clustering algorithm in IBM SPSS Statistics 19 was adopted to group all of the grid cells into different clusters according to the nearest neighbor principle.To this end, each grid cell in the study area was first assigned a unique 17-digit binary encoding, i.e., "1" denotes that the corresponding factor was selected for this grid cell for the landslide prediction, whereas "0" represents that the corresponding factor was excluded.Then, all binary encoded grid cells were used as input variables in the K-means algorithm for clustering.Consequently, the study area was divided into K clusters, and all of the grid cells have the greatest similarity in the same cluster.Eventually, the optimal combination of causative factors is selected and represented by the final centroid of each cluster shared by all of the cells in the same cluster.In our experiments, we perform the K-means algorithm to divide all of the grid cells into three clusters (K = 3), and the optimal combinations of the causative factors for each cluster are shown in Table 4, where the abbreviations SE and RC denote "selected" and "regression coefficient", respectively, and the symbol " √ " indicates that the causative factor is selected for the corresponding cluster.Specifically, all of the causative factors are used as independent variables in the LR model when the study area is not classified using the K-means algorithm.After the K-means cluster analysis, a multicollinearity analysis using IBM SPSS Statistics 19.0 was performed for the selected causative factors.The VIF and TOL values of the causative factors for each cluster with K = 3 are listed in Table 5.According to this table, there was no serious multicollinearity between the causative factors in each cluster.For instance, all of the TOL values are higher than 0.4, which is above the commonly-used critical value of 0.1, while the greatest VIF value is less than 2.5, which indicates that the selected causative factors are independent of each other.

Validation and Comparison
In this step, the proposed method was compared with several commonly-used methods, including: (1) the LR method, which is a representative of statistical models; (2) the SVM model, which is representative of machine learning methods; (3) the DT method modelling with the C5.0 algorithm, which is representative of data mining techniques.These methods can be used with both remote sensing images and field surveys and were performed using SPSS Clementine 12.0.To apply these methods to the LSM of the study area, 70% of the landslide grid cells were randomly selected for training the LR, SVM and DT methods, and the remaining landslide grid cells were used for validation.For the proposed framework, the same proportion of the training-validation samples were randomly selected in each cluster.As mentioned in Section 3.1.3,the study area can be clustered by the K-means algorithm for obtaining an optimal combination of the causative factors for each cluster.Meanwhile, the regression coefficients of each cluster per causative factor were computed using the SPSS Clementine 12.0, and the regional LR model with K = 3 (LR_K3) was constructed for comparison.Therefore, the LR, SVM and DT methods were applied to the entire study area for LSM, whereas the LR_K3 method was performed in the different clusters of the study area for accurate LSM at a regional scale.To make the resultant maps more readable, we divided the probability values using the natural breaks method in ESRI ArcGIS 10.0 into four susceptibility zones, i.e., low, medium, high and very high.The landslide susceptibility maps of all of the methods used here are illustrated in Figure 5, which shows that most of the previously investigated landslides are distributed in high or very high susceptibility zones in the maps of all of the predictive methods.However, many grid cells are unreliably categorized by the LR, SVM and DT methods as high or very high susceptibility classes, because landslides occur infrequently in these study areas.In contrast, the map generated by the LR_K3 method is consistent with the actual distribution of landslides, as shown in Figure 1.
The overall accuracies in terms of landslide prediction by all of the methods, which were measured using Equations ( 11) and (12), are listed in Table 6.The LR_K3 method achieved the best overall accuracy of 85.32% when compared to that of the LR, SVM and DT methods with 80.26%, 83.74% and 84.13%, respectively.The success and prediction rate curves achieved by the different methods are shown in Figure 6.Specifically, the success power of the predictive methods was evaluated by using the training samples, and we can draw similar conclusions as for the overall accuracy mentioned above, i.e., the LR_K3 method achieved better success power with an AUC value of 96.8% compared with that of the LR, SVM and DT methods at 90.4%, 92.3% and 93.4%, respectively.On the other hand, the predictive performance of the methods considered here was evaluated by using the validation samples, and the observation was consistent with the conclusion on the comparison of the success power of the predictive methods, i.e., the LR_K3 method has superior prediction ability with AUC values of 96.1% in comparison with the LR, SVM and DT methods with AUC values of 90%, 91.5% and 92%, respectively.The overall accuracies in terms of landslide prediction by all of the methods, which were measured using Equations ( 11) and ( 12), are listed in Table 6.The LR_K3 method achieved the best  of 96.8% compared with that of the LR, SVM and DT methods at 90.4%, 92.3% and 93.4%, respectively.On the other hand, the predictive performance of the methods considered here was evaluated by using the validation samples, and the observation was consistent with the conclusion on the comparison of the success power of the predictive methods, i.e., the LR_K3 method has superior prediction ability with AUC values of 96.1% in comparison with the LR, SVM and DT methods with AUC values of 90%, 91.5% and 92%, respectively.

Discussion
From the above analyses, we can observe that the number of clusters for the division of the study area is greatly significant for landslide prediction.In this section, the impact of K on the predictive performance of the proposed framework is first discussed.Then, to better describe the aim of this work, we provide qualitative/quantitative analysis of the correlation between landslide susceptibility and urban planning.

Impact of K
Our experimental results reported that there is no grid cell in at least one cluster when the study area was divided into K ( ≥ K 5) clusters.Therefore, we perform the K-means algorithm with { } K = 2, 3, 4 .The optimal combinations of the causative factors for each cluster with are shown in Table 7. Tables 4 and 7 show that the two factors catchment area and distance to fault were not selected, no matter the number of clusters in the study area, indicating that these two factors are not critical for LSM, which is consistent with the conclusion that these two factors have the

Discussion
From the above analyses, we can observe that the number of clusters for the division of the study area is greatly significant for landslide prediction.In this section, the impact of K on the predictive performance of the proposed framework is first discussed.Then, to better describe the aim of this work, we provide qualitative/quantitative analysis of the correlation between landslide susceptibility and urban planning.

Impact of K
Our experimental results reported that there is no grid cell in at least one cluster when the study area was divided into K (K ≥ 5) clusters.Therefore, we perform the K-means algorithm with K = {2, 3, 4}.The optimal combinations of the causative factors for each cluster with K = {2, 4} are shown in Table 7. Tables 4 and 7 show that the two factors catchment area and distance to fault were not selected, no matter the number of clusters in the study area, indicating that these two factors are not critical for LSM, which is consistent with the conclusion that these two factors have the smallest information coefficients, as mentioned in Section 3.1.1.The VIF and TOL values of the causative factors for each cluster with K = {2, 4} are listed in Table 8.According to Tables 5 and 8, there was no serious multicollinearity between the causative factors in each cluster with different values of K, because all of the TOL values are higher than 0.4, and the highest VIF value is less than 2.5.Therefore, the proposed regional LR framework can obtain more accurate landslide susceptibility maps using the selected causative factors.
The statistics of all of the regional LR models are given in Table 9, which shows that both of the two measures −2 ln likelihood and goodness of fit of the proposed framework are smaller than those of the traditional LR model using all of the causative factors, validating the improved fitness of the proposed methods.Furthermore, the smallest pseudo R 2 measure is 0.252 using the proposed The statistics of all of the regional LR models are given in Table 9, which shows that both of the two measures −2 ln likelihood and goodness of fit of the proposed framework are smaller than those of the traditional LR model using all of the causative factors, validating the improved fitness of the proposed methods.Furthermore, the smallest pseudo R 2 measure is 0.252 using the proposed framework with K = 3 , which is appropriate for LSM of the study area.To validate the proposed framework, we constructed a regional LR model with  The overall accuracies in terms of landslide prediction by the three constructed methods are listed in Table 10.More accurate landslide susceptibility maps can be obtained as K is increased from 2-4.For instance, the proposed LR_K4 method can obtain the best prediction accuracy of 91.76%, which is 11.5%, 9.91% and 6.44% higher than that of the global LR, LR_K2 and LR_K3 methods, respectively.Furthermore, the success and prediction rate curves achieved by the three methods are shown in Figure 8.The ROC analysis in this figure indicated that all of the curves achieved by the proposed methods were better than that of the LR method.In addition, Figure 8 shows that the AUC value achieved using the proposed framework was better as K was increased from 2-4.It should also be noted that the AUC values of the LR_K4 method can be higher than 0.98.The overall accuracies in terms of landslide prediction by the three constructed methods are listed in Table 10.More accurate landslide susceptibility maps can be obtained as K is increased from 2-4.For instance, the proposed LR_K4 method can obtain the best prediction accuracy of 91.76%, which is 11.5%, 9.91% and 6.44% higher than that of the global LR, LR_K2 and LR_K3 methods, respectively.Furthermore, the success and prediction rate curves achieved by the three methods are shown in Figure 8.The ROC analysis in this figure indicated that all of the curves achieved by the proposed methods were better than that of the LR method.In addition, Figure 8 shows that the AUC value achieved using the proposed framework was better as K was increased from 2-4.It should also be noted that the AUC values of the LR_K4 method can be higher than 0.98.

The Suitability for Urban Development
During development of an urban environment, landslide susceptibility maps and other physical factors of the study area should be considered by the decision makers and planners since the geology and geomorphology of an area are very significant for urban sustainability [87,88].In this subsection, the integration technique of AHP and GIS is performed to encourage the evaluation and the selection of suitable areas for urban development of the study area.To assess the suitability for different land uses, the geomorphological, geological and geographical causative factors, along with the landslide hazards were considered.To obtain the potential suitability map for urban development, the causative factors of elevation, distance to river, distance to main towns, landslide susceptibility map, slope gradient and slope aspect were used in this study.The landslide susceptibility map was obtained by the proposed framework with K = 4.The rating of the classes of each causative factor was based on a five-grade scale ranging from 0-4, which has been widely used by other researchers [89][90][91].A grade of zero indicates the most favorable conditions for slope failure, while a grade of four describes the most stable conditions for urban development.The selected factors, their classes, ratings and weighting coefficients are listed in Table 11.The suitability map for urban development of the study area is shown in Figure 9, and this map was classified into the following four categories using the natural breaks method: low, medium, high and very high suitability.Regarding the spatial distribution of the four categories, the areas of very high suitability for urban development are located mostly around the main towns in the study area.Specifically, such areas for each city are as follows: • near the county of Badong, southwest south and southeast of this county.hazards were considered.To obtain the potential suitability map for urban development, the causative factors of elevation, distance to river, distance to main towns, landslide susceptibility map, slope gradient and slope aspect were used in this study.The landslide susceptibility map was obtained by the proposed framework with K = 4 .The rating of the classes of each causative factor was based on a five-grade scale ranging from 0-4, which has been widely used by other researchers [89][90][91].A grade of zero indicates the most favorable conditions for slope failure, while a grade of four describes the most stable conditions for urban development.The selected factors, their classes, ratings and weighting coefficients are listed in Table 11.The suitability map for urban development of the study area is shown in Figure 9, and this map was classified into the following four categories using the natural breaks method: low, medium, high and very high suitability.Regarding the spatial distribution of the four categories, the areas of very high suitability for urban development are located mostly around the main towns in the study area.Specifically, such areas for each city are as follows: • near the county of Badong, southwest south and southeast of this county.

Conclusions
Landslides are the leading natural hazards in the Three Gorges area, as the water level in the reservoir fluctuates periodically, and they pose a serious threat to life and property.To avoid risks and mitigate damage caused by landslides, accurate susceptibility maps are critically significant for land management and land use planning.To better perform LSM, we presented an effective framework through integrating the techniques of information theory, K-means cluster analysis and an LR model.In this work, a total of 17 causative factors were used to construct the LR model, and the impacts of these factors should be closely related to geographic locations and the nearest neighborhood.The major achievement of this work is the grouping of the study area into several clusters to ensure that landslides in each cluster are affected by the same set of selected causative factors.Based on this idea, the proposed predictive method was constructed for accurate LSM at a regional scale by applying a suitable LR model to each cluster of the study area.In each cluster, 70% of the landslide grid cells were randomly selected for training the LR model, and the remaining cells were used for validation purposes.The experimental results indicated that the proposed framework can demonstrate superior prediction performance when compared with the traditional LR, SVM and DT methods.Furthermore, the predictive methods used in this work were comprehensively assessed in terms of their overall prediction accuracy and using ROC analysis.These objective measures showed that the proposed framework can produce more accurate landslide susceptibility maps with an overall prediction accuracy above 90%.Additionally, this framework is capable of achieving a more reliable success rate and prediction rate curves with AUC values above 98%.Further, to better describe the correlation between landslide susceptibility and urban planning, a potential suitability map for urban development was obtained using the landslide susceptibility map and the geological and geomorphological causative factors.In the future, other statistical models or machine learning methods can be embedded into the proposed framework for better prediction performance and a comprehensive comparison.

Figure 1 .
Figure 1.Location of the study area.Site maps of (a) China, (b) Hubei province and (c) the study area (a true color composite image using Bands 4, 3 and 2 of Landsat-8 OLI data).

Figure 1 .
Figure 1.Location of the study area.Site maps of (a) China, (b) Hubei province and (c) the study area (a true color composite image using Bands 4, 3 and 2 of Landsat-8 OLI data).

Figure 2 .
Figure 2. A geological map of the study area.

Figure 2 .
Figure 2. A geological map of the study area.

Figure 5 .
Figure 5.The landslide susceptibility maps of the study area produced by the four methods.(a) Logistic regression; (b) support vector machine; (c) decision tree; (d) the proposed framework ( K = 3 ).

Figure 5 .
Figure 5.The landslide susceptibility maps of the study area produced by the four methods.(a) Logistic regression; (b) support vector machine; (c) decision tree; (d) the proposed framework (K = 3).

Figure 6 .
Figure 6.ROC curves based on the randomly-selected training-validation samples produced by the four methods.(a) The success rate curve; (b) the prediction rate curve.

Figure 6 .
Figure 6.ROC curves based on the randomly-selected training-validation samples produced by the four methods.(a) The success rate curve; (b) the prediction rate curve.

Figure 7 .
Figure 7.The landslide susceptibility maps of our study area produced by the proposed framework with different values of K. (a) K = 2 ; (b) K = 3 ; (c) K = 4 .

Figure 7 .
Figure 7.The landslide susceptibility maps of our study area produced by the proposed framework with different values of K. (a) K = 2; (b) K = 3; (c) K = 4.

Figure 8 .
Figure 8.The impact of K on the ROC curves of the proposed framework using randomly selected training-validation samples.(a) The success rate curve; (b) the prediction rate curve.

Figure 8 .
Figure 8.The impact of K on the ROC curves of the proposed framework using randomly selected training-validation samples.(a) The success rate curve; (b) the prediction rate curve.
• near the county of Xietan, west, northwest, north, northeast and east of this county.• near the county of Shazhenxi, northwest, north, northeast and south of this county.• near the county of Guizhou, north, northeast, east and southeast of this county.• near the county of Guojiaba, south and southeast of this county.• near the county of Xiangxi, north, northeast and east of this county.• near the county of Quyuan, northwest, north, northeast, east and southeast of this county.

Figure 9 .
Figure 9.The potential suitability for urban development.Figure 9.The potential suitability for urban development.

Figure 9 .
Figure 9.The potential suitability for urban development.Figure 9.The potential suitability for urban development.

Table 1 .
Information coefficients of each causative factor.The highest ones in each factor are indicated in bold and underlined.TRI, terrain roughness index; TPI, terrain position index; TWI, topographic wetness index.Selecting Causative Factors for Each Grid Cell Based on the above analysis in Section 2.2.2, the CF values were divided into five categories for analyzing the possibility of landslide occurrences, as shown in Table2.According to Equation (

Table 1 .
Information coefficients of each causative factor.The highest ones in each factor are indicated in bold and underlined.TRI, terrain roughness index; TPI, terrain position index; TWI, topographic wetness index.Based on the above analysis in Section 2.2.2, the CF values were divided into five categories for analyzing the possibility of landslide occurrences, as shown in Table2.According to Equation (

Table 2 .
Categories of the certainty factor (CF) value in terms of slope stability.

Table 3 .
The classes and their CF values of each causative factor.

Table 4 .
The optimal combinations of the causative factors for each cluster and their regression coefficients with the proposed framework when K = 3. SE, selected; RC, regression coefficient.

Table 5 .
The multicollinearity analysis of the causative factors for each cluster corresponding to Table4.TOL, tolerance; VIF, variance inflation factor.

Table 6 .
Overall accuracies of all of the predictive methods.

Table 6 .
Overall accuracies of all of the predictive methods.

Table 9 .
Summary statistics of the logistic regression models.

Table 11 .
The selected factors, their classes, ratings and weighting coefficients.
• near the county of Xietan, west, northwest, north, northeast and east of this county.• near the county of Shazhenxi, northwest, north, northeast and south of this county.• near the county of Guizhou, north, northeast, east and southeast of this county.
• near the county of Guojiaba, south and southeast of this county.•nearthecounty of Xiangxi, north, northeast and east of this county.•nearthe county of Quyuan, northwest, north, northeast, east and southeast of this county.

Table 11 .
The selected factors, their classes, ratings and weighting coefficients.