GIS-Based Random Forest Weight for Rainfall-Induced Landslide Susceptibility Assessment at a Humid Region in Southern China

Landslide susceptibility assessment is presently considered an effective tool for landslide warning and forecasting. Under the assessment procedure, a credible index weight can greatly increase the rationality of the assessment result. Using the Beijiang River Basin, China, as a case study, this paper proposes a new weight-determining method based on random forest (RF) and used the weighted linear combination (WLC) to evaluate the landslide susceptibility. The RF weight and eight indices were used to construct the assessment model. As a comparison, the entropy weight (EW) and weight determined by analytic hierarchy process (AHP) were also used, respectively, to demonstrate the rationality of the proposed weight-determining method. The results show that: (1) the average error rates of training and testing based on RF are 18.12% and 15.83%, respectively, suggesting that the RF model can be considered rational and credible; (2) RF ranks the indices elevation (EL), slope (SL), maximum one-day precipitation (M1DP) and distance to fault (DF) as the Top 4 most important of the eight indices, occupying 73.24% of the total, while the indices runoff coefficient (RC), normalized difference vegetation index (NDVI), shear resistance capacity (SRC) and available water capacity (AWC) are less consequential, with an index importance degree of only 26.76% of the total; and (3) the verification of landslide susceptibility indicates that the accuracy rate based on the RF weight reaches 75.41% but are only 59.02% and 72.13% for the other two weights (EW and AHP), respectively. This paper shows the potential to provide a new weight-determining method for landslide susceptibility assessment. Evaluation results are expected to provide a reference for landslide management, prevention and reduction in the studied basin.


Introduction
The occurrence frequency of natural disaster has increased in recent decades on the background of global warming [1][2][3][4][5][6][7][8][9].Rainfall-induced landslides are considered one of the most common natural disasters resulting in significant economic damage and devastating loss of life [10,11].Large-scale landslide occurrences are estimated to have led to at least 60,000 deaths with losses of more than US $9.7 billion worldwide from 1900 to 2016 [12].Changing climatic patterns and increased anthropogenic activities (e.g., deforestation, land reclamation, slope excavation and reservoir construction) in mountainous regions have contributed to a global increase in the occurrence of landslide events [13][14][15][16].Defining optimum preventive and palliative measures for appropriate landslide defense and management is essential within this context as landslide-induced losses may be reduced by nearly 90% at an estimated cost of 10.3% of the potential losses [17].
The occurrence of landslide is regarded as a comprehensive result of many determinants such as precipitation, topography, morphology, lithology and land-use type [18].The exact location of such geological disaster implies all varieties of information of hazard inducing environment factors [19].Therefore, landslides are not reciprocally irrelevant events; there are some correlations between hazard inducing environment factors and location in a certain region.The occurrence probability can be expected if such correlations are properly revealed and estimated.Landslide susceptibility assessment, one of most important measures analyzing the correlations, becomes a vital parameter for landslide early warning systems and is a necessary component of natural and urban planning for government policies worldwide [20][21][22][23][24][25][26].Benefitting from development of computer technique, the convenience in application and compatibility of geographical information systems (GIS), numerous assessment methods have been applied to evaluate the landslide susceptibility.These methods can be generally categorized into two groups.The first is a deterministic or engineering approach based on mathematical models of the physical mechanisms that control slope failure, e.g., TRIGRS [27,28].The significant limitation of this kind method is the requirement for material data (mechanical properties, water saturation, etc.) that are difficult to obtain over large areas [29].The second general approach is statistical and thus does not posit mechanisms that control slope failure, but assumes rather that occurrences of past landslides can be related arbitrarily to measurable characteristics of the landscape [30][31][32].In turn, these characteristics can be used to predict future landslide occurrence and then many common algorithms were applied including weighted linear combination (WLC), multiple regression model [33][34][35], artificial neural network model [36,37], and support vector machine [38,39].All these statistical methods could properly present the probability distribution at spatial scale and show a prefect effect in practice.Among these methods, WLC, first introduced by Voogd [40], has been intensively applied, benefitting from high precision, easy comprehension, simple use and convenience when combining with GIS [41][42][43].However, the determination of a suitable index weight is a significant step when applying the WLC method because a group of suitable weights helps to better and more sensitively assess the susceptibility level.Generally, subjective weight (SW) and objective weight (OW) are two main weight-determining methods used in the evaluation system [44].SW is typically determined by the decision maker's intentions and strongly affected by expert knowledge and biases, resulting in high subjectivity [45,46].For example, analytic hierarchy process, a method of quantitative and qualitative analysis, is able to determine a comprehensive weight by expert score; however, such weight may not be proper if the experts lack enough experience or neglect some implicit information.A suitable index weight in landslide susceptibility assessment should objectively reflect each index's real contribution/importance and should not be affected by the decision maker's intentions when considering the objective existence of a landslide event.In this case, OW is regarded as a more suitable weight than SW.Deficiencies are still featured in currently common OW methods, which include entropy theory [47,48], technique for order preference by similarity to ideal solution method (TOPSIS) [49,50], gray relational analysis (GRA) [51] and the criteria importance though intercriteria correlation method (CRITIC) [52,53].These traditional methods could fetch objective information of sample data using self-contained mathematical theory and analysis; however, they depend on sample data excessively and are easy to get disturbed by data fluctuation, resulting in many deficiencies including complicated calculations, poor relevance and even overlooking practical situations [54].
Water 2018, 10, 1019 3 of 20 Random forest (RF) is a machine-learning algorithm proposed by Breiman in 2001 that provides estimates regarding hierarchy of variables in classification and evaluation and features the capability of estimating index importance to total susceptibility level [55].The method has been applied to fields including genomic ranking [56], neuroscience prediction [57], T-cell epitope classification [58], soil parent material mapping [59], vegetable oil analysis [60], and flood hazard risk assessment [61].Theoretical and empirical studies have demonstrated that RF may perform classification work effectively and quantitatively give objective estimates of what variables are important in the classification.The quantitative estimate of variable importance is consistent to the idea of index weight, implying that the OW could, in theory, be computed using the importance of the variables predicted by RF.However, no study has focused on determining OW utilizing RF in the field of landslide susceptibility assessment.Therefore, this study aims to apply this novel OW (i.e., weight determined by RF) in the field of landslide susceptibility assessment.
Primary objectives of this study were to: (1) adopt the Beijiang River Basin where located in humid region in Southern China as a case study and construct a landslide susceptibility assessment model utilizing the WLC; and (2) demonstrate that RF can estimate an objective and suitable index weight at basin scale.The study was intended to provide a scientific reference for index weight calculation, landslide prediction, warning, and management, as well as for soil and water conservation planning in the studied basin.

Methodology
Taking the Beijiang River Basin as a study case, we first selected 11 indices closely related to landslide and determined 181 rainfall-induced landslide spots.We then divided these spots into training dataset and validation dataset.The RF algorithm was executed to compute the weight of indices and the results should pass the five-fold cross validation.Afterwards, the landslide susceptibility was assessed by combining the RF weight and weighted linear combination method.As a comparison, the entropy weight (EW) and weight determined by analytic hierarchy process (AHP) were also used to further demonstrate the rationality of RF weight.

Weighted Linear Combination
Weighted linear combination (WLC), the best known and most commonly used multi criteria-GIS method [40], was applied to calculate landslide susceptibility in this study.The WLC method is a simple but effective method where susceptibility indices affecting a landslide may be combined by applying weights [62].Assuming there are m indices and weights in the assessment system, the calculation formula of WLC is as follows: where y is the comprehensive landslide susceptibility value; w j is the weight of the jth index with a range of 0 to 1 and meets the condition of ∑ m j=1 w j = 1 (j = 1, 2, . . ., m); and x j is the normalized value of the jth susceptibility index that may be calculated in the following formulas: where x is the raw value of the susceptibility index, and x min and x max are the minimum and maximum values, respectively.The former formula is available for the positive indices, as, the larger the value is, the greater the occurrence probability of a landslide.The latter formula is available for the negative indices, as, the larger the value is, the smaller the occurrence probability of a landslide.Suitable weights greatly improve the accuracy and quality of landslide susceptibility assessment.This study utilized a random forest (RF) to determine the index weight.An RF is a classifier consisting of a collection of tree-structured classifiers {h(x, Θ k ), k = 1, . ..},where {Θ k } are independent, identically distributed random vectors, and each decision tree (DT) casts a unit vote for the most popular class at input x [55].Multiple samples are drawn in a RF utilizing the resampling bootstrap method, and classification and regression trees (CARTs) are built corresponding to each bootstrap sample (Figure 1).
Water 2018, 10, x FOR PEER REVIEW 4 of 20 independent, identically distributed random vectors, and each decision tree (DT) casts a unit vote for the most popular class at input x [55].Multiple samples are drawn in a RF utilizing the resampling bootstrap method, and classification and regression trees (CARTs) are built corresponding to each bootstrap sample (Figure 1).Classification and regression trees (CARTs) (Figure 2), consist of root node (t1), internal node (ti, i = 1, 2, 3 and 4) and leaf node (NT).The root node is split into two internal nodes according to a certain split standard when the tree begins to grow.The internal node then becomes root node and is split again and the splitting process repeats constantly until the terminal leaf node generates.If there are M input variables (i.e., susceptibility indices in this study), a number m << M is specified so that, at each node, m variables are selected at random out of M, and the best split of these m is applied to split the node.The value of m remains constant during the forest's growth.The minimum Gini value is the split standard of the node, with the corresponding variable considered as the optimal variable.The Gini value is calculated as follows: where ( | ) is the probability of class j at node t.Each time a node split is made on variable i, the Gini impurity criterion for the two descendent nodes is less than that of the parent node, which provides Mean Gini Decrease (MGD) after each split.Combining the MGD for each individual variable over all the trees in the forest rapidly provides an importance parameter named Gini importance that is typically consistent with the permutation importance measure [60,61].Thus, this study proposes the random forest weight (RFW) as: where and are the ith variable weight and MGD value, respectively.The RFW equation is therefore based on MGD without involving subjective factors.The RFW equation measures the importance of the variables and is available for providing reasonable weights for the WLC.Classification and regression trees (CARTs) (Figure 2), consist of root node (t 1 ), internal node (t i , i = 1, 2, 3 and 4) and leaf node (N T ).The root node is split into two internal nodes according to a certain split standard when the tree begins to grow.The internal node then becomes root node and is split again and the splitting process repeats constantly until the terminal leaf node generates.If there are M input variables (i.e., susceptibility indices in this study), a number m << M is specified so that, at each node, m variables are selected at random out of M, and the best split of these m is applied to split the node.The value of m remains constant during the forest's growth.The minimum Gini value is the split standard of the node, with the corresponding variable considered as the optimal variable.The Gini value is calculated as follows: where p(j|t) is the probability of class j at node t.Each time a node split is made on variable i, the Gini impurity criterion for the two descendent nodes is less than that of the parent node, which provides Mean Gini Decrease (MGD) after each split.Combining the MGD for each individual variable over all the trees in the forest rapidly provides an importance parameter named Gini importance that is typically consistent with the permutation importance measure [60,61].Thus, this study proposes the random forest weight (RFW) as: where w i and D i are the ith variable weight and MGD value, respectively.The RFW equation is therefore based on MGD without involving subjective factors.The RFW equation measures the importance of the variables and is available for providing reasonable weights for the WLC.
= ∑ where and are the ith variable weight and MGD value, respectively.The RFW equation is therefore based on MGD without involving subjective factors.The RFW equation measures the importance of the variables and is available for providing reasonable weights for the WLC.

Entropy Weight
Entropy was utilized as a commonly applied method of OW to compare with the RFW.The concept of entropy, as a parameter measuring the degree of disorder or randomness, originates from thermodynamics and represents heat energy that cannot be utilized to generate work [63].Entropy was first applied to the information theory in 1948 by Shannon, which became the measurement of ordering of one system [64].Entropy weight (EW) is based on the information entropy theory and reflects useful information content offered by each variable [44,47].
A judgement matrix Y with m evaluation objects and n variables is constructed for the calculation of EW as: The influence of variable dimension and numerical range is eliminated when Y is normalized to a standard matrix X = x ij m×n (i = 1, 2, . . ., n; j = 1, 2, . . ., m) by Equation ( 2).According to information theory, the variable's entropy value H i is calculated as: where and 0 ≤ H i ≤ 1.The EW may then be computed as: where the EW should meet the condition ∑ n i=1 w i = 1.A smaller entropy value obviously relates to a larger EW, indicating the variable is more crucial.

Analytic Hierarchy Process
The weight determined by Analytic hierarchy process (AHP) was also utilized as a comparison.AHP is regarded as an ideal SW method featuring efficient and flexible framework based on psychology and mathematics.Its multi-criteria decision-making technique provides a systematic approach for assessing and integrating the effects of various factors, involving several levels of dependent or independent qualitative and quantitative information [65,66].
By analyzing the relations among indices, this method builds a hierarchical organization, including goal, criterion and sub-criterion levels, to objectively form a multi-level analysis model.The goal level is a problem's objective, and the criterion level includes factors which have influence on the objective decision.The sub-criterion level contains indices subordinated to those belonging to the criterion level.Judgment matrices are established, and a weight vector is determined according to these matrices.

Study Area
The Beijiang River is the second largest river in the Pearl River system located in the Guangdong Province, China [67], and is approximately 582 km long with a drainage area of 46,649 km 2 (Figure 3).The Beijiang River Basin predominantly constitutes two cities and sixteen counties and features a subtropical monsoon climate with a multi-year average precipitation of 1800 mm.The flood season of the basin is from April to September, and the dry season is from October to March of the next year [68].Approximately 70-80% of the annual rainfall is concentrated in the flood season with the rainfall fastigium from May to July [69].Main soil type of the basin is red soil, typical to hilly topographical areas of south China, and converted to a soft soil once infiltrated by rain.Geological structure of the basin includes 68 lithology types and a substantial number of small faults in the middle and upper reaches, composing a complicated and adverse geological environment with potential for geological instability [70].The characteristics of high-intensity rainfall, poor agrotype, complicated landform, and complicated and adverse geological environment in the basin can lead to a high probability of landslide hazard.Examples of occurrences in the area include a serious landslide in Qingxin County resulting from continuously heavy rain in March 2012, causing 7 deaths and 1 injury; four people of Nanxiong County were buried by a rainstorm-triggered landslide in May 2013; a landslide in Huaiji County caused by continuous heavy rain then killed 2 villagers and injured 3 children in May 2014; etc.The accidents suggest the Beijiang River Basin is facing great challenges in prevention and reduction of landslides.Taken together, the studied area is considered as a typical case for landslide susceptibility assessment.

Data and Pre-Processing
The index selection varies among study areas according to the specific characteristics of each location [71].One index can have significant impacts on the landslide susceptibility in a specific area but may have a limited influence in another area.First, we selected 11 indices representing the conditions of rainfall, topography, geology, and human activity; afterwards, we estimated these indices by the RF methods and then abandoned three indices (i.e., slope aspect, topographic wetness index and distance from stream) featuring smaller rate of Gini importance (less than 2%) for the purpose of convenient calculation and redundancy elimination [72,73].The remaining eight indices

Data and Pre-Processing
The index selection varies among study areas according to the specific characteristics of each location [71].One index can have significant impacts on the landslide susceptibility in a specific area but may have a limited influence in another area.First, we selected 11 indices representing the conditions of rainfall, topography, geology, and human activity; afterwards, we estimated these indices by the RF methods and then abandoned three indices (i.e., slope aspect, topographic wetness index and distance from stream) featuring smaller rate of Gini importance (less than 2%) for the purpose of convenient calculation and redundancy elimination [72,73].The remaining eight indices are as follows: The maximum one-day precipitation: Intensive rainfall acted as a trigger factor causing most landslide events in the study basin [74][75][76].Short duration precipitation exerts greater influence in the studied area on landslide formation and development than average yearly or monthly rainfall [77,78].The maximum one-day precipitation (M1DP) was selected finally among the maximum 6 h, 12 h, one-day and three-day precipitation because we found most of the historical landslides occurred in the study area after a consecutive one-day rainstorm [79].Precipitation data  were provided by 48 rainfall observation stations scattered across the Beijiang River Basin and were accessed from the Hydrology Bureau of Guangdong Province (http://www.gdsw.gov.cn/wcm/gdsw/index.html).Kriging interpolation was then employed to generate the layer based on the rainfall observation stations.
Elevation (EL, m): Most landslides occur in mountainous areas with a large drop with elevation reflecting characteristics of a discontinuous terrain [80][81][82].Digital elevation model (DEM-30 m) was utilized to represent the elevation index.The range of DEM is 48-1871 m, with an average elevation of 365.25 m in the study basin.Mountainous areas are typically located in the northern basin, whereas the southern basin features lower elevations.The DEM dataset was provided by Geospatial Data Cloud site, Computer Network Information Center, Chinese Academy of Sciences (http://www.gscloud.cn).
Slope angle (SL, degree): Slope angle is frequently applied as an index reflecting the degree of topographic change in landslide susceptibility studies as landslides are directly related to slope angle [78][79][80][81][82]. SL was generated by DEM using the "Slope" tool of Arc.GIS9.3 and it meets (Degree of slope = θ, tan θ = rise/run).Areas with steep slopes feature high occurrence probabilities for landslides.The range of SL in the Beijiang River Basin is from 0 • to 71.5 • , with an average slope of 11.8 • , and steep slopes mainly located in the central basin.
Normalized difference vegetation index (NDVI): The condition of vegetation cover is represented by this index.A large NDVI value indicates the area is comprised of luxuriant vegetation, providing a well-developed root system to maintain and stabilize soils.Areas with high vegetation cover are then generally safer than are bare areas.Average NDVI value in the Beijiang River Basin is 0.50, suggesting that vegetation cover is at a moderate level.This index was calculated for each Landsat 5 TM image data.Landsat data in 2005 acquired from the USGS Global Visualization Viewer was terrain-, radiometrically-, and geographically-corrected, and formatted to fit in an 8-bit number (ranging from 0-255).NDVI is expressed by NDVI = (band 4 − band 3)/(band 4 + band 3) where band 4 and 3 represent near-infrared band and infrared band, respectively, with a spatial resolution of 30 m × 30 m. Use the Raster Calculator tool in the Spatial Analyst toolbar to perform the calculation.
Distance to fault (DF, m): The geological fault areas are highly susceptible to landslides because the surrounding rock strength decreases due to tectonic breaks [83].DF is utilized in this study to reflect the degree of landslide susceptibility, thus the closer to the fault, the more dangerous exists [78].Fault data (1:250,000) were obtained from the National Geological Archives of China (http://www.ngac.org.cn).
Shear resistance capacity (SRC, MPa): Lithology is an important index for the susceptibility assessment [80].Lithological variations often result in strength and permeability differences in rocks and soils, significantly affecting the occurrence of landslides.Thus, this research used SRC to quantify the lithology.A large SRC value indicates that lithology can withstand a large collapsing force.A total of 68 lithology types exist in the study basin and each type was assigned a SRC comprehensive value according to the Design code for engineered slopes in water resources and hydropower projects of China (SL 386-2007).The design code is a national normative criterion based on a significant number of tests and experiments in different areas of China, thus a SRC value is recommended for use that directly corresponds to a certain lithology type.Lithology data (1:250,000) were obtained from the National Geological Archives of China (http://www.ngac.org.cn).
Available water storage capacity (AWC): Topsoil plays a key role in the formation of landslide [84].Soil-type data used in this study include information related to AWC, an index reflecting the maximum water amount that is held per unit of earth column.A classification value could be consulted directly from the Harmonized World Soil Database (2009) [85] as it provides a standard between classification value and AWC value (Table 1), thus AWC was used to represent and quantify the soil type.A large AWC value indicates soil absorbs more water with the absorption likely to weaken and break the soil structure, increasing the probability of landslides.Seven AWC measurement values were assigned to each soil type according to the Harmonized World Soil Database (2009) (Table 1).Soil-type data (1 km × 1 km) were obtained from the Food and Agriculture Organization of the United Nations (http://www.fao.org/home/en/).Runoff coefficient (RC): Land-cover types (LCT) are often affected by human activities, including bare land, open forest land and rural residential areas, and present high landslide potential [86,87].A runoff coefficient (RC), measuring the runoff quantity that is converted by rainfall [61], was applied to quantify the LCT.A large RC value indicates that more rainwater is converted into surface runoff and less water infiltrates into the underground environment, significantly reducing probability of soil structure breakdown.Twenty-four land cover types exist in the study basin and were assigned corresponding RC values (Table 2) according to the Code for Design of Building Water Supply and Drainage of China (GB 50015-2003) and the Code for Design of Outdoor Wastewater Engineering of China (GB 50014-2006).The two design codes were similar to SL 386-2007 as well as national normative criteria, thus the recommended value could be applied directly.Land-cover type data for 2005 were employed and provided by the Resources and Environment Science Data Center of the Chinese Academy of Sciences (http://www.resdc.cn/Default.aspx).

Landslide Susceptibility Assessment Model
Training and validation datasets must be created prior to employing RF.Historical landslide spots were utilized as the dataset for the ability to accurately reflect the characteristic and spatial distribution of landslides.Historical landslide inventory (1995-2005) was available from comprehensive field surveys, including field evaluation, air photo/satellite image interpretations, the China Geological Environment Information Network landslide database (http://www.hbgec.org/),and news report records.Only the rainfall-induced landslide spots occurred after extreme rainfalls were considered; spots caused by artificial actions, including slope excavation, mine excavation and reservoir construction were not considered in this study.Altogether 181 landslide spots (Figure 5) distributed over the basin were finally utilized for the dataset.The five-fold cross validation and the final validation accuracy of susceptibility map are the two important criteria for dividing the sample for training and validation.Among the 181 landslide spots, a random sample of two-thirds (120) was applied to create a training dataset with the remaining (61) employed as validation data for the final susceptibility map.The 120 spots were classified as first category and marked with "1" while 120 non-landslide spots were classified as second category and marked with "0".The non-landslide spots were of the same sample size with intense human activities and no recorded landslides and were drawn randomly and uniformly to contribute to the training dataset.Samples were then created by extracting normalized values of the eight indices based on the 240 spots using the tool "Sample" of Arc.GIS 9.3.The total 240 samples, including eight normalized values (EL, SL, M1DP, DF, RC, NDVI, SRC and AWC) and a category value (0 or 1), constitute a complete training dataset.
Water 2018, 10, x FOR PEER REVIEW 10 of 20

Landslide Susceptibility Assessment Model
Training and validation datasets must be created prior to employing RF.Historical landslide spots were utilized as the dataset for the ability to accurately reflect the characteristic and spatial distribution of landslides.Historical landslide inventory (1995-2005) was available from comprehensive field surveys, including field evaluation, air photo/satellite image interpretations, the China Geological Environment Information Network landslide database (http://www.hbgec.org/),and news report records.Only the rainfall-induced landslide spots occurred after extreme rainfalls were considered; spots caused by artificial actions, including slope excavation, mine excavation and reservoir construction were not considered in this study.Altogether 181 landslide spots (Figure 5) distributed over the basin were finally utilized for the dataset.The five-fold cross validation and the final validation accuracy of susceptibility map are the two important criteria for dividing the sample for training and validation.Among the 181 landslide spots, a random sample of two-thirds (120) was applied to create a training dataset with the remaining (61) employed as validation data for the final susceptibility map.The 120 spots were classified as first category and marked with "1" while 120 non-landslide spots were classified as second category and marked with "0".The non-landslide spots were of the same sample size with intense human activities and no recorded landslides and were drawn randomly and uniformly to contribute to the training dataset.Samples were then created by extracting normalized values of the eight indices based on the 240 spots using the tool "Sample" of Arc.GIS 9.3.The total 240 samples, including eight normalized values (EL, SL, M1DP, DF, RC, NDVI, SRC and AWC) and a category value (0 or 1), constitute a complete training dataset.The total 240 samples were input into the RF package of the software R to train the training data.The number of classification trees and variables attempted at each split was set to 2500 and 3, respectively, following multiple attempts.Effects of calculation occasionality were then reduced utilizing the five-fold cross validation, a common model-checking algorithm [59].The stable and reliable performance of the model can be checked by this validation technique.After training, the Gini decrease value of each index can be obtained, and the RFW can be calculated by Equation (4).reliable performance of the model can be checked by this validation technique.After training, the Gini decrease value of each index can be obtained, and the RFW can be calculated by Equation (4).
Normalized grid layers and the weight were then input into Equation ( 1) utilizing the raster calculator of GIS to calculate landslide susceptibility value and generate a susceptibility map.The landslide susceptibility map was classified into five susceptibility levels-very high, high, moderate, low and very low-by the quantile method as contained by an equal number of features.The flow chart of the assessment is shown in Figure 6.Normalized grid layers and the weight were then input into Equation ( 1) utilizing the raster calculator of GIS to calculate landslide susceptibility value and generate a susceptibility map.The landslide susceptibility map was classified into five susceptibility levels-very high, high, moderate, low and very low-by the quantile method as contained by an equal number of features.The flow chart of the assessment is shown in Figure 6.

Five-Fold Cross Validation
The 240 samples, the training data of five-fold cross validation in this case, were randomly divided into five sub-samples.A single sub-sample was retained as the model validation data, whereas the other four sub-samples were used to train the model.Each sub-sample was only validated once during the process of five-fold cross validation and then we can obtain five sets of results [61].
Table 3 demonstrates that the error rate of training and testing ranges 14.06-20.83%and 10.42-20.83%,respectively.The average error rates are 18.12% and 15.83%, respectively, indicating that average accuracy reaches 81.88% and 84.17%, respectively.Generally, the verification accuracies of both training and testing present stable and reliable performance, suggesting that the model can be considered rational and credible [88] and the weight calculated by RF can be used for the next step.

Five-Fold Cross Validation
The 240 samples, the training data of five-fold cross validation in this case, were randomly divided into five sub-samples.A single sub-sample was retained as the model validation data, whereas the other four sub-samples were used to train the model.Each sub-sample was only validated once during the process of five-fold cross validation and then we can obtain five sets of results [61].
Table 3 demonstrates that the error rate of training and testing ranges 14.06-20.83%and 10.42-20.83%,respectively.The average error rates are 18.12% and 15.83%, respectively, indicating that average accuracy reaches 81.88% and 84.17%, respectively.Generally, the verification accuracies of both training and testing present stable and reliable performance, suggesting that the model can be considered rational and credible [88] and the weight calculated by RF can be used for the next step.

Random Forest Weight Analysis
Five sets of Gini decrease values were obtained after the five-fold cross validation and an average RFW was calculated.Table 4 confirms that indices EL, SL, M1DP and DF are the Top 4 most-important of the eight indices, occupying 73.24% of the weight and suggesting these specific indices contribute overwhelmingly to total landslide susceptibility.EL is the most important index comprising approximately 35.00% of the total.High elevation indicates a mountainous region location as the Beijing River Basin is naturally characterized by hilly terrain, significantly increasing the probability of a landslide event.Figure 5 illustrates that most landslide spots are located in the mountainous regions of the central and northern basin, verifying that the high impact index EL, as identified by RF, bears significance in landslide susceptibility.SL is similar to EL and is considered to be the second-most-important index by RF.A large drop provides significant potential energy to cause earth-body sliding.Figures 4c and 5 also illustrate that most landslide spots are located in areas with a large SL value (substantial drop), verifying that SL also plays a vital role in landslide susceptibility.M1DP is regarded as the third-ranked index, with a percentage of 12.09%.Figure 4a demonstrates that M1DP in the south basin, especially in the southeast, is greater than in the north, suggesting the spatial variation of M1DP is quite notable, and is the primary explanation for the RF model ranking the index in third place.Certain landslide spots are in locations with relatively slight rainfall, yet the rainfall amount may be sufficient enough (minimum value still reaches 85 mm) to trigger a landslide.Many faults exist in the central basin where most landslide spots are concentrated; thus, the RF model ranks DF as the fourth-ranked index.Indices RC, NDVI, SRC and AWC are less consequential, with only a 26.76% index weight of the total.

Spatial Distribution of Landslide Susceptibility
Landslide susceptibility map based on RF weight was generated finally.Figure 7 illustrates the high-and very-high-susceptibility areas are principally located in the central and northern basin in mountainous terrain areas; the low-and very-low-susceptibility areas are distributed in the southern and northeast basin in flat areas; and the moderate-susceptibility zones are typically located in transition areas between high and low susceptibility zones.Zone proportions of each class, from very low to very high, are 19.75%,20.19%, 20.62%, 20.22% and 19.22%, respectively.Dangerous zones, including the high-and very-high susceptibility zones, occupy approximately 39.44%.
Sixty-one historical landslide spots, approximately one-third of the 181 landslide spots, were utilized to validate reliability of the assessment results.Table 5 demonstrates that 46 historical landslide spots (75.41%) exist in the dangerous zones, 9 spots (14.75%) in the moderate-susceptibility areas and only 5 spots (9.84%) exist in the low-and very-low-susceptibility zones.Fifteen spots (24.59%) remain in the non-dangerous zones (including the moderate-, low-and very-low-susceptibility zones) with data errors, including the historical landslide spot data and index data errors, offering a potential explanation.Some flaws may exist in the dataset of historical landslide spots, for example, To further verify the rationality of the susceptibility map based on RFW, we also collected 16 other landslide samples occurred after May 2005.These samples are from news report and have been verified by the open remote sensing images (Sentinel-2(ESA) and Baidu Map).As shown in Table 6, altogether 12 landslides (75%) locate in the dangerous zones while only 4 spots in the non-dangerous zones, which indicates the map still presents high applicability, even though the indices are mainly based on 2005 (e.g., NDVI and RC).

Discussion
To further verify the rationality of RFW, another objective weight (OW), i.e., entropy weight (EW), and a subjective weight determined by AHP were applied as comparisons.With normalized values of the eight indices, the 120 spots that classified as first category and marked with "1" were applied to calculate the EW, while the 120 non-landslide spots classified as second category and marked with "0" did not add to the calculation because the entropy method could not differentiate the landslide and the non-landslide spot data if mixed together.Table 4 shows the EW considers AWC and SL as the most and least critical indexes with values of 0.3315 and 0.0346, respectively.For the weight of AHP, a two-level analysis model with one criterion level (eight indices) and one goal level (weight) was constructed and ten experienced experts were invited to score.The average weight of the ten experts was determined as a final weight (Table 4) featuring SL (0.2344) and SRC (0.0541) as the most and least important indexes, respectively.Overall, the weights based on the tree methods vary considerably.
The landslide susceptibility maps based on EW and weight of AHP were generated and five categories were classified by the quantile method.Except the very-high-susceptibility areas, spatial distributions of the other four categories are visually different among the three maps (Figures 7  and 8).Among the three maps, approximately 53.92% areas have the same susceptibility level between weight of AHP and RFW while only occupying 29.97% between EW and RFW.The 61 historical landslide spots were also employed to validate susceptibility of the two weights.Approximately 44 (AHP, 72.13%) and 36 (EW, 59.02%) historical landslide spots, respectively, locate in the dangerous zones, which are fewer than those found with RFW.Different index weights then were typically observed to produce large differences among the three landslide susceptibility maps.
The EW belongs to OW and its disadvantages, i.e. it only reflects the data law of landslide spots and it fails to reflect the information of non-landslide spots, are obvious.In this case, the EW has difficulty measuring the internal law between landslide and non-landslide and thus results in relatively worse performance.In this case, the AHP has good performance, featuring a total of 44 verification spots (72.13%) located in the dangerous zones, implying the experts of this case were well experienced and grasped key points.However, this weight is experience-dependent and strongly determined by the decision maker's intentions, which means the more experience and information the experts have, the more reasonable is the weight that will be obtained.Conversely, a ridiculous weight may be obtained if the expert's experience and level of understanding could not meet the requirements.The EW belongs to OW and its disadvantages, i.e. it only reflects the data law of landslide spots and it fails to reflect the information of non-landslide spots, are obvious.In this case, the EW has difficulty measuring the internal law between landslide and non-landslide and thus results in relatively worse performance.In this case, the AHP has good performance, featuring a total of 44 verification spots (72.13%) located in the dangerous zones, implying the experts of this case were well experienced and grasped key points.However, this weight is experience-dependent and strongly determined by the decision maker's intentions, which means the more experience and information the experts have, the more reasonable is the weight that will be obtained.Conversely, a ridiculous weight may be obtained if the expert's experience and level of understanding could not meet the requirements.
The Random Forest method is an efficient and straightforward classifier applied in this study to calculate index weights to provide a new reference for the weighted method.Unlike the common OW and SW, the RF method is able to deal with multi-category samples and is able to identify the internal law between landslide and non-landslide spots and provide contribution rate of each index to susceptibility.This function provides the possibility for determining a comprehensive weight that reflects the real contribution of each index.Additionally, the RF method could be expediently implemented by the open source software R.Only two key parameters are required for proofreading during the assessment, the number of classification trees and the number of variables tried at each split, unlike other machine learning algorithms [61].The application effect is satisfactory, most importantly, due to the high validation precision.This method is a novel approach for landslide susceptibility assessments; however, certain issues remain.The weight calculation of this study requires many historical landslide sites, for example, and increasing the number of spots would significantly improve accuracy of the results as data limitations restricted the spots to only 181.Some of the indices may not provide accurate enough data, for example, proportional scale of index DF is only 1:250,000 and the accuracy may be improved if using a more accurate DF.Although the quantile method was used to classify into five susceptibility levels, whether there is a better way to reduce the error rate (24.59% in this study) is still worthy of research.Additionally, we only evaluated the basin with drainage area of 46,649 km 2 , whether it would be more effective for the study area with a larger or smaller spatial scale requires further discussion.
The application of RF for weight determination demonstrates significant potential in this study, despite a few drawbacks.RFW is then recommended for use in the field of landslide susceptibility assessment and other fields dealing with hazard assessment.The Random Forest method is an efficient and straightforward classifier applied in this study to calculate index weights to provide a new reference for the weighted method.Unlike the common OW and SW, the RF method is able to deal with multi-category samples and is able to identify the internal law between landslide and non-landslide spots and provide contribution rate of each index to susceptibility.This function provides the possibility for determining a comprehensive weight that reflects the real contribution of each index.Additionally, the RF method could be expediently implemented by the open source software R.Only two key parameters are required for proofreading during the assessment, the number of classification trees and the number of variables tried at each split, unlike other machine learning algorithms [61].The application effect is satisfactory, most importantly, due to the high validation precision.This method is a novel approach for landslide susceptibility assessments; however, certain issues remain.The weight calculation of this study requires many historical landslide sites, for example, and increasing the number of spots would significantly improve accuracy of the results as data limitations restricted the spots to only 181.Some of the indices may not provide accurate enough data, for example, proportional scale of index DF is only 1:250,000 and the accuracy may be improved if using a more accurate DF.Although the quantile method was used to classify into five susceptibility levels, whether there is a better way to reduce the error rate (24.59% in this study) is still worthy of research.Additionally, we only evaluated the basin with drainage area of 46,649 km 2 , whether it would be more effective for the study area with a larger or smaller spatial scale requires further discussion.
The application of RF for weight determination demonstrates significant potential in this study, despite a few drawbacks.RFW is then recommended for use in the field of landslide susceptibility assessment and other fields dealing with hazard assessment.

Conclusions
Landslide susceptibility assessment is an appropriate approach for predicting and analyzing the spatial distribution of susceptibility.Determination of a suitable index weight is a key step for assessment results accuracy, thus a new weight-determining method based on random forest (RF) was proposed in this study.Eight indices were utilized to construct the susceptibility index system utilizing the Beijing River Basin as a case study.In total, 240 training samples, including 120 landslide spots and 120 non-landslide spots, were utilized to calculate the weight based on RF.Landslide susceptibility was calculated by the weighted linear combination (WLC) method employing the RF weight as an index weight.EW and weight determined by AHP were also applied for comparisons to demonstrate the reasonability and feasibility of the RFW.Results indicate that: (1) Average training and testing error rates of the 240 samples are 18.12% and 15.83%, respectively, suggesting that the RF model can be considered rational and credible.(2) The RF model ranks EL, SL, M1DP and DF as the Top 4 most critical of the eight indices, occupying 73.24% of the total weight, while the indices, RC, NDVI, SRC and AWC are less consequential, with an index importance degree of only 26.76% of the total.
(3) The landslide susceptibility map based on RFW was exceptionally different from the maps based on EW and weight of AHP; a total of 46 spots among the 61 validation spots are located in dangerous areas based on RF weight with the accuracy rate reaching 75.41%; however, only 59.02% and 72.13% of the spots are in the dangerous areas based on the other two weights, respectively.Sixteen other landslide samples occurred after May 2005, further verifying the rationality of the susceptibility map based on RFW.The proposed weighted method could be expediently implemented with few parameters while producing a satisfactory practical application.Application of the weight based on RF to landslide susceptibility assessment provides a scientific reference for weight definition and reveals significant potential.

Figure 2 .
Figure 2. Structure chart of classification and regression trees.

Figure 2 .
Figure 2. Structure chart of classification and regression trees.

Water 2018 ,
10, x FOR PEER REVIEW 6 of 20 and complicated and adverse geological environment in the basin can lead to a high probability of landslide hazard.Examples of occurrences in the area include a serious landslide in Qingxin County resulting from continuously heavy rain in March 2012, causing 7 deaths and 1 injury; four people of Nanxiong County were buried by a rainstorm-triggered landslide in May 2013; a landslide in Huaiji County caused by continuous heavy rain then killed 2 villagers and injured 3 children in May 2014; etc.The accidents suggest the Beijiang River Basin is facing great challenges in prevention and reduction of landslides.Taken together, the studied area is considered as a typical case for landslide susceptibility assessment.

Figure 3 .
Figure 3. Location and topographical condition map of the study area.

Figure 3 .
Figure 3. Location and topographical condition map of the study area.

Figure 4
presents spatial distribution characteristics of the indices with all indices converted into grid format with a cell size of 30 m × 30 m using the GIS technique and the Beijiang River Basin consisting of approximately 52 million grids.Data-processing tools included the open source software R, Arc.GIS 9.3 and MS Excel.

Figure 4 .
Figure 4. Spatial distribution characteristics of the eight susceptibility indices.Note: M1DP-maximum one-day precipitation; EL-elevation; SL-slope angle; NDVI-normalized difference vegetation index; DF-distance to fault; SRC-shear resistance capacity; AWC-available water storage capacity; RC-runoff coefficient.

Figure 4 .
Figure 4. Spatial distribution characteristics of the eight susceptibility indices.Note: M1DP-maximum one-day precipitation; EL-elevation; SL-slope angle; NDVI-normalized difference vegetation index; DF-distance to fault; SRC-shear resistance capacity; AWC-available water storage capacity; RC-runoff coefficient.

Figure 5 .
Figure 5. Location of landslide spots and training samples.

Figure 5 .
Figure 5. Location of landslide spots and training samples.

Water 2018 ,
10, x FOR PEER REVIEW 11 of 20

Figure 6 .
Figure 6.The flow chart of the landslide susceptibility assessment in the Beijiang River Basin.

Figure 6 .
Figure 6.The flow chart of the landslide susceptibility assessment in the Beijiang River Basin.

Figure 8 .
Figure 8. Landslide susceptibility maps based on entropy weight and weight determined by AHP.

Figure 8 .
Figure 8. Landslide susceptibility maps based on entropy weight and weight determined by AHP.

Table 1 .
Available water capacity (AWC) value and classification.

Table 2 .
Land-cover type and the corresponding runoff coefficient (RC).

Table 4 .
Index weights determined by RF, entropy and analytic hierarchy process.

Table 6 .
Verification sample occurred in the Beijiang River Bain after May, 2005.