AI-Based Susceptibility Analysis of Shallow Landslides Induced by Heavy Rainfall in Tianshui, China

: Groups of landslides induced by heavy rainfall are widely distributed on a global basis and they usually result in major losses of human life and economic damage. However, compared with landslides induced by earthquakes, inventories of landslides induced by heavy rainfall are much less common. In this study we used high-precision remote sensing images before and after continuous heavy rainfall in southern Tianshui, China, from 20 June to 25 July 2013, to produce an inventory of 14,397 shallow landslides. Based on the results of landslide inventory, we utilized machine learning and the geographic information system (GIS) to map landslide susceptibility in this area and evaluated the relative weight of various factors affecting landslide development. First, 18 variables related to geomorphic conditions, slope material, geological conditions, and human activities were selected through collinearity analysis; second, 21 selected machine learning models were trained and optimized in the Python environment to evaluate the susceptibility of landslides. The results showed that the ExtraTrees model was the most effective for landslide susceptibility assessment, with an accuracy of 0.91. This predictive ability means that our landslide susceptibility results can be used in the implementation of landslide prevention and mitigation measures in the region. Analysis of the importance of the factors showed that the contribution of slope aspect (SA) was signiﬁcantly higher than that of the other factors, followed by planar curvature (PLC), distance to river (DR), distance to fault (DTF), normalized difference vehicle index (NDVI), distance to road (DTR), and other factors. We conclude that factors related to geomorphic conditions are principally responsible for controlling landslide susceptibility in the study area.


Introduction
Extreme rainfall events and earthquakes are the two main factors inducing regional landslides [1][2][3][4][5][6]. Compared with the spatial distribution of earthquakes, which are concentrated in plate margins and intracontinental orogenic belts [1,7], landslides induced by heavy rainfall are more widely distributed on a global basis [8][9][10][11][12]. As the landslides induced by heavy rainfall are characterized by wide distribution, high density, and long travel distance, such landslide events often cause many casualties and major property losses and ecological damage [13,14]. Landslide inventories are an essential basis for studying the formation, distribution, landscape evolution, susceptibility, and risk assessment of regional landslides [5,[15][16][17]]. An event inventory shows landslides caused by a single trigger, such as an earthquake, rainfall event, or snowmelt event [18]. Compared with the widespread concerns related to landslide disasters caused by earthquakes, inventories of landslides induced by heavy rainfall are much fewer in number [19]. Nevertheless, in recent years, the rapid development of space radar, satellite remote sensing, small unmanned aerial vehicles (UAV), and other technologies have provided high-precision images for landslide interpretation induced by heavy rainfall events, which greatly facilitate the production of corresponding landslide inventories [16,[20][21][22].
Landslides are one of the most important natural hazards in China as they are widespread and cause substantial damage and fatalities every year [23]. Consequently, the development of methods to reduce the threat of landslides has long been an important component of landslide research in China and elsewhere. As a method of assessing areas with high landslide susceptibility, landslide susceptibility analysis is important for disaster prevention and mitigation [24][25][26][27][28]. Regional evaluations of landslide susceptibility based on physical models and GIS technology are widely used, and they are important for assessing the future risks of landslides and debris flows, and they can make a major contribution to disaster prevention and control planning [24,29,30]. However, these two traditional methods are limited by problems of efficiency and cost and by their limited ability to obtain useful information from complex datasets, together with their high dependence on human subjectivity [31,32]. In recent years, with the rapid development of Artificial Intelligence (AI) technology, machine learning provides possibilities for improving the accuracy and efficiency of geological hazard susceptibility evaluation [31,33,34]. Machine learning has achieved outstanding results in landslide and debris flow hazard analysis in several regions [35][36][37]; however, as a new technology, the effect of machine learning in different environments needs to be further examined.
From 20 June to 25 July 2013, continuous heavy rainfall induced a large number of shallow landslides and debris flows in southern Tianshui, China. They resulted in 24 deaths and one missing person; in addition, 2386 houses collapsed and 6666 were damaged. The direct economic losses were USD 1.24 billion, in addition to the losses of life and property and the trauma caused to the local inhabitants [38]. The study had three main components: (i) Comparison of high-precision image data before and after rainfall events was used to interpret and catalog landslides induced by heavy rainfall. (ii) The effect of machine learning on the susceptibility evaluation of shallow landslides induced by heavy rainfall in an area of high vegetation cover was examined. (iii) Evaluating the relative contribution of various factors affecting landslide formation in a high vegetation coverage area.

Study Area
The study area is located in southern Tianshui, China ( Figure 1). The geological structure of the region is complex, influenced mainly by the Qinling Mountains latitudinal structural belt, the Qilv-Holland Arc structural belt, the West Qinling Mountains northeast structural belt, and the Longxi spiral structural belt. The main lithological unit in the area is the Devonian Shujiaba formation, mainly composed of marl, slate with thin layers of limestone and metasandstone. Yanshanian biotite granite and medium-coarse grained granite are exposed in several areas in the north and east. Neogene strata are dominant in the west and they have an unconformable relationship with the other strata. The lithology is mainly gray-white and gray-green clay and red mudstone with a sandy conglomerate. The thickness of the formation exceeds 1000 m. In addition, carboniferous glutenite and late Devonian slate and sandstone are sporadically exposed in the north and south parts of the study area.
The geomorphology of the study area is dominated by the intermediate-and lowelevation mountains, with altitudes ranging between 1239 m and 2249 m. Because it is located on the southern edge of the Chinese Loess Plateau, the area has a cover of quaternary loess, forming a dual-stratum structure of bedrock and overlying loess. The development of a large pore space and vertical joints in the loess makes it highly permeable, and in addition, the bedrock has a low permeability; therefore, the excess pore water pressure caused by heavy rainfall, combined with the seepage force at the stratum interface, is likely the main reason for the extensive occurrence of shallow landslides in the area [39][40][41].
average temperature is ~6-11 centigrade, with the highest temperatures generally occurring in July, with s relative humidity of 66%. The average annual rainfall is 800-900 mm, and the seasonal distribution of rainfall is very uneven; most of the rainfall occurs between July to September, which comprises 68% of the annual total. The vegetation coverage in the area is high (generally >70%) and species-rich.

Landslide Inventory and Mapping
Guzzetti et al. [18] comprehensively summarized landslide inventory methods and divided them into two categories: traditional methods based mainly on geomorphological field mapping and visual interpretation of aerial photos; and new methods based on high-precision satellite image interpretation, analysis of surface morphology using Airborne LiDAR (Light Detection and Ranging), and the automated and semi-automated recognition of landslides. The former is expensive in terms of time and cost, and for these The study area is located in the transitional region between a semi-humid and semiarid climate. The climate type is a warm temperate continental climate. The annual average temperature is~6-11 centigrade, with the highest temperatures generally occurring in July, with s relative humidity of 66%. The average annual rainfall is 800-900 mm, and the seasonal distribution of rainfall is very uneven; most of the rainfall occurs between July to September, which comprises 68% of the annual total. The vegetation coverage in the area is high (generally > 70%) and species-rich.

Landslide Inventory and Mapping
Guzzetti et al. [18] comprehensively summarized landslide inventory methods and divided them into two categories: traditional methods based mainly on geomorphological field mapping and visual interpretation of aerial photos; and new methods based on high-precision satellite image interpretation, analysis of surface morphology using Airborne LiDAR (Light Detection and Ranging), and the automated and semi-automated recognition of landslides. The former is expensive in terms of time and cost, and for these reasons, traditional methods are gradually being superseded by high-precision optical image interpretation and semi-automatic and automated recognition. However, due to the limitations of automated and semi-automated methods in terms of the accuracy of recognition, they cannot provide a truly comprehensive landslide inventory [42][43][44]. Therefore, high-precision image interpretation has become the most commonly used method for landslide inventory development in the case of recent earthquakes and rainfall and other events [5,45]. In areas with high vegetation coverage, landslide scars induced by heavy rainfall are clearly resolved in optical images. Therefore, in this study, we downloaded 2 m × 2 m Google earth images before and after an interval of continuous rainfall and used them for comparative analysis (Figure 2), in order to provide a detailed recognition of rainfall-induced landslides. Landslide interpretation, inventorying, and mapping was conducted using ArcGIS 10.2 software (The company is located in Redlands, CA, USA).

FOR PEER REVIEW 4 of 21
reasons, traditional methods are gradually being superseded by high-precision optical image interpretation and semi-automatic and automated recognition. However, due to the limitations of automated and semi-automated methods in terms of the accuracy of recognition, they cannot provide a truly comprehensive landslide inventory [42][43][44]. Therefore, high-precision image interpretation has become the most commonly used method for landslide inventory development in the case of recent earthquakes and rainfall and other events [5,45]. In areas with high vegetation coverage, landslide scars induced by heavy rainfall are clearly resolved in optical images. Therefore, in this study, we downloaded 2 m × 2 m Google earth images before and after an interval of continuous rainfall and used them for comparative analysis (Figure 2), in order to provide a detailed recognition of rainfall-induced landslides. Landslide interpretation, inventorying, and mapping was conducted using ArcGIS 10.2 software (The company is located in Redlands, CA, USA).

Landslide Susceptibility Evaluation Based on Machine Learning
The process of modeling with machine learning includes the selection and preparation of parameters, data acquisition and processing, and model selection, fitting, and evaluation. A flow chart of the process is shown in Figure 3. The selection of a suitable

Landslide Susceptibility Evaluation Based on Machine Learning
The process of modeling with machine learning includes the selection and preparation of parameters, data acquisition and processing, and model selection, fitting, and evaluation. A flow chart of the process is shown in Figure 3. The selection of a suitable terrain mapping unit is the basis of landslide sensitivity analysis [25]. At present, the grid cell is still the most commonly used terrain element in most of the literature [25,26]. In this study, in order to balance the amount of information of grid acquisition, data volume, and calculation efficiency, a 100 m × 100 m grid was selected as the evaluation unit for landslide susceptibility analysis. A total of 65,472 grid cells were defined in the study area, of which 13,859 grid cells corresponded to landslides. In this study, the extraction of geomorphic factors was based on 12.5 m × 12.5 m Digital Elevation Mode (DEM) data from ALOS Satellite ( Figure 4A), and lithology and faults were derived from 1:50,000 geological mapping data ( Figure 4B) (source from China Geological Survey).

Selection of Factors Influencing Landslides
As a complex process of material transport and energy transfer on the earth's surface, the formation and distribution of landslides are determined by the effects of climate, hydrology, geology, landforms, human activity, and other factors [46]. To a large extent, the formation and distribution of landslides is related to specific local factors; for example, in orogenic belts, slope, lithology, and structure are important factors [47,48], while

Selection of Factors Influencing Landslides
As a complex process of material transport and energy transfer on the earth's surface, the formation and distribution of landslides are determined by the effects of climate, hydrology, geology, landforms, human activity, and other factors [46]. To a large extent, the formation and distribution of landslides is related to specific local factors; for example, in orogenic belts, slope, lithology, and structure are important factors [47,48], while in mountains with low and intermediate altitudes, rainfall, soil properties, and engineering activity are important [49,50]. Therefore, there is no consensus regarding which factors should be used in landslide susceptibility evaluation. In this study, based on an evaluation of previous studies [24][25][26][27][28] and with regard to the specific environment of the Tianshui area, we selected geomorphic factors, landslide material and geological conditions, and human activity as the three categories of factors for landslide susceptibility analysis (Table 1). Continuous heavy rainfall was the primary cause of the groups of shallow landslide events. The shallow landslides induced by heavy rainfall are widely distributed all over the world. Critical rainfalls that induce shallow landslides are an important factor in the study of landslide triggering thresholds. The records of six rainfall stations in the study area show that the accumulated rainfall from 20 June to 25 July 2013 is more than 230 mm, which reaches the threshold of shallow landslide in many studies [51,52]. In other words, in the event, the rainfall intensity meets the critical threshold in the whole area. In addition, due to the study area being small, the error and resolution of the existing rainfall data cannot meet the factor requirements of machine learning susceptibility evaluation. Therefore, we chose to carry out unified value processing for the rainfall conditions of this evaluation, focusing on the influence of geomorphic conditions, geological structure, and material composition on the susceptibility of shallow landslide. The specific parameters adopted and the rationale for their use are described below.

Parameters Related to Geomorphological Conditions
Average slope (AS) ( Figure 4C). Slope is one of the most important factors influencing stability. Different slope angles can affect the magnitude of normal stress and shear stress on the potential failure surface. Slope aspect (SA) ( Figure 4D). Slope aspect strongly affects hydrological processes via evapotranspiration and weathering processes in a given microclimatic environment [53].
Local relief (LR) ( Figure 4E). The potential energy of the slope is determined by its relief. Statistical analysis of landslides shows that topographic relief is an important factor affecting the spatial distribution of a landslide [22,54].
Profile curvature (PRC) and planar curvature (PLC). Profile and planar curvature are important parameters reflecting the morphological characteristics of slopes. Curvature is defined as the rate of change of slope gradient or aspect, usually in a specific direction [55].
Slope unit area (SUA) ( Figure 4F). The slope unit is the fundamental spatial domain used in quantitative geomorphological analyses. A slope unit can be used for terrain zonation, using methods such as sensitivity modeling and hydrological and erosion modeling; and those based on the geographical environment, including ecology, agriculture, forestry, land use, and other aspects [56]. The slope unit used in this study was extracted by the ArcSWAT module of ArcGIS software. Specifically, it was obtained from the ridgeline and river network extracted under the condition of a 100-hectare flow accumulation.
Elevation (E). Temperature, rainfall, vegetation type, and microorganisms are dependent on elevation. These factors can affect soil layer thickness: the lower the altitude, the thicker the soil layer, while high mountain areas are mainly bare hard rock. In some cases, precipitation and the incidence of landslides increase with increasing altitude [57].
Topographic wetness index (TWI) ( Figure 4G). TWI reflects the distribution of soil moisture, and soil moisture content in turn strongly affects the cohesion and internal friction angle of slope materials.
Watershed area (WA) ( Figure 4H). Záruba and Mencl [58] observed a relationship between the occurrence of landslides and watershed area. The larger the watershed area, the greater the amount of water seeping into the ground, which increases slope instability [59,60].

Parameters Related to Landslide Materials and Geological Conditions
Normalized Difference Vegetation Index (NDVI) ( Figure 4I). The NDVI effectively reflects vegetation coverage, which has important effects on slope stability by reducing the rainfall infiltration rate. The vertical and horizontal growth of plant roots also increases slope stability [61]. NDVI was derived from Landsat-8 images (June 2016) with a resolution of 30-m (Landsat-8 image courtesy of the US Geological Survey).
Formation lithological index (FLI) ( Figure 4B). Lithology affects the spatial distribution of landslides. The structural characteristics of the bedrock promote landslide initiation in several ways: (1) by producing weak surfaces that are prone to sliding; (2) by facilitating the introduction of groundwater into the overlying soil mantle; and (3) by destabilizing the regolith because of weathering [46].
Distance to fault (DTF). The two principal effects of faults on landslides are (1) a fault plane can act as the dominant structural plane in the formation of a sliding surface, and (2) rock mass damage caused by fault activity may lead to slope instability.
Soil type (ST). Like lithology, soil is also the material basis of landslide formation. There are substantial differences in soil microstructure, water permeability, and vegetation growth between soil types.
Contents of sand (SC), gravel (SG), silt (SIC), and clay (CC). The grain-size composition of the soil determines the cohesion, shear strength, and hydraulic conductivity of the slope, and thus its stability.
Distance to river (DR). In many areas, landslides are clustered along rivers, and landslide density decreases with increasing distance from rivers [17,62]. Fluvial incision provides the potential energy for the development of a landslide, while the lateral erosion of a river can destroy the slope toe, causing slope instability.

Parameter Preprocessing Parameters Related to Landslide Materials and Geological Conditions
In order to eliminate collinearity among the selected parameters, a heat map of a parameter correlation matrix was calculated using the Seaborn Python visualization package (https://seaborn.pydata.org/generated/seaborn.heatmap.html#seaborn.heatmap, accessed date: 10 November 2020) ( Figure 5). Strongly correlated parameters have a certain degree of redundancy and they also affect the stability of the model operation. Through the heat map of the parameter correlation matrix, it was found that several parameters selected in the study have a strong correlation, for example, the following correlations coefficients were obtained: SB vs. SC (0.86); SIC vs. SC (0.93); SB vs. CC (0.97), in other words, they have an almost consistent influence on landslide development. Therefore, SB and SC were excluded from this study. In order to eliminate collinearity among the selected parameters, a heat map of a parameter correlation matrix was calculated using the Seaborn Python visualization package (https://seaborn.pydata.org/generated/seaborn.heatmap.html#seaborn.heatmap, accessed date: 10 November 2020) ( Figure 5). Strongly correlated parameters have a certain degree of redundancy and they also affect the stability of the model operation.
Through the heat map of the parameter correlation matrix, it was found that several parameters selected in the study have a strong correlation, for example, the following correlations coefficients were obtained: SB vs. SC (0.86); SIC vs. SC (0.93); SB vs. CC (0.97), in other words, they have an almost consistent influence on landslide development. Therefore, SB and SC were excluded from this study.

Resampling
In a machine learning algorithm, if the ratio of non-landslides (NLSs) to landslides (LSs) is 1:1, machine learning may focus on the classification of LSs rather than of NLSs. However, in the present study, the ratio of NLSs to LSs is close to 4:1 ( Figure 6). In order to maintain a balance between the two types of samples, SMOTE (synthetic mineral oversampling technology) was used to increase the number of LS samples [63]. This method randomly selects a nearest neighbor sample B from A (a sample in NDFs), and then, randomly selects a point C from the relationship between A and B, as a new minority sample. After resampling, the ratio of NLS sample size to LS sample size is 1:1, which provides a balance between the data samples.

Resampling
In a machine learning algorithm, if the ratio of non-landslides (NLSs) to landslides (LSs) is 1:1, machine learning may focus on the classification of LSs rather than of NLSs. However, in the present study, the ratio of NLSs to LSs is close to 4:1 ( Figure 6). In order to maintain a balance between the two types of samples, SMOTE (synthetic mineral oversampling technology) was used to increase the number of LS samples [63]. This method randomly selects a nearest neighbor sample B from A (a sample in NDFs), and then, randomly selects a point C from the relationship between A and B, as a new minority sample. After resampling, the ratio of NLS sample size to LS sample size is 1:1, which provides a balance between the data samples.

Data Standardization
The data were standardized in order to improve the accur rithm and to speed up the convergence of the model. In add learning algorithms are very sensitive to feature scales. Therefo scalar algorithm (from Scikit-learn, https://scikit-learn.org, acces 2020) to normalize the factors by removing the mean and scalin ance. Scikit-learn is a Python library that provides a standard inte machine learning algorithms [64].

Candidate Machine Selection
We chose 21 types of model algorithms that are widely us [35]. Via inspection and testing, we chose the most suitable model susceptibility evaluation in an area of dense vegetation.

Ensemble Methods
The principle of the ensemble method is to combine severa parameters of an algorithm) to improve the effectiveness of ea classifiers can be divided into average methods and boosting m dient Tree Boosting (GDBT), Bagging, Random Forest, and Extra this study.

Data Standardization
The data were standardized in order to improve the accuracy of the model algorithm and to speed up the convergence of the model. In addition, several machine learning algorithms are very sensitive to feature scales. Therefore, we used a standard scalar algorithm (from Scikit-learn, https://scikit-learn.org, accessed date: 10 November 2020) to normalize the factors by removing the mean and scaling according to the variance. Scikitlearn is a Python library that provides a standard interface for implementing machine learning algorithms [64].

Candidate Machine Selection
We chose 21 types of model algorithms that are widely used in machine learning [35]. Via inspection and testing, we chose the most suitable model algorithm for landslide susceptibility evaluation in an area of dense vegetation.

Ensemble Methods
The principle of the ensemble method is to combine several classifiers (or different parameters of an algorithm) to improve the effectiveness of each single classifier. The classifiers can be divided into average methods and boosting methods. AdaBoost, Gradient Tree Boosting (GDBT), Bagging, Random Forest, and Extra Trees were selected in this study.

Generalized Linear Models (GLMs)
The generalized linear model is an extension of the linear model. The relationship between the mathematical expectation of the response variable and the prediction variable of the linear combination is established by the relationship function. Logistic Regression (LR), Passive Aggressive, Ridge, Stochastic Gradient Descent (SGD), and Perceptron were used in this study.

Naïve Bayes (NB)
Naive Bayes classification is based on Bayesian probability. Assuming that the attributes are independent of each other, the probability of each feature is obtained and the larger one is taken as the prediction result. Gaussian Naive Bayes and Bernoulli Naive Bayes were selected.

Nearest Neighbors
The principle of the nearest neighbor method is to find a specified number of nearest sample points and then use them to predict new points.

Support Vector Machines (SVM)
The principle of SVM is to solve the separation hyperplane, which can correctly divide the training dataset and provide the largest geometric interval. Support Vector Classification (SVC), Linear SVC, and Nu-SVC were selected.

Trees
The tree classifier is a tree structure in which each internal node represents a judgment of an attribute, and each branch represents an output of the judgment result. Finally, each leaf node represents a classification result. Decision Tree and Extra Tree were selected.

Discriminant Analysis
Discriminant analysis is a method of multivariate statistical analysis that classifies the studied objects according to several observed indexes. Linear discriminant and quadratic discriminant analyses were selected.

eXtreme Gradient Boosting (XGBoost)
XGBoost is a boosting algorithm and a type of lifting tree model. It implements the GBDT algorithm efficiently and makes many improvements to the algorithm, integrating numerous tree models to produce a strong classifier.

Model Fitting and Tuning
The initial model is trained by the training data in a cross-validation dataset. The models are then sorted according to the average accuracy score (ACC) of the test data in the cross-validation data set. ACC represents the correct allocation rate of all samples involved in the modeling. It can be seen in Figure 7 that the overall fitting effect of the comprehensive model is better than that of other models, and the highest score was achieved by ExtraTrees, followed by RandomForest, Bagging, and KNeighbors. ACC is calculated as follows:

ACC = (TP + TN)/(TP + FN + FP + TN)
True positive (TP): the predicted class is positive, and the prediction agrees with the actual class; False positive (FP): the predicted class is positive, and the prediction disagrees with the actual class; True negative (TN): the predicted class is negative, and the prediction agrees with the actual class; False negative (FN): the predicted class is negative, and the prediction disagrees with the actual class.  We selected the first four models for optimization ( Figure 8). The model was fitted using a parametric grid method and the grid search cross-validation method, and the best super parameters were found by AUC (area under the receiver operating characteristic curve) scoring method. According to the optimal super parameters of each model given in Table 3, the training set of the model was cross-validated 10 times, and the models were reordered according to the average accuracy score of the test data. After optimization, the performance of the four models was seen to have improved. ExtraTrees remained the optimal model, with a test data ACC of 0.91, and an average AUC of 0.97 after 10-time cross validation (Figure 9). AUC represents a trade-off between sensitivity and specificity. After optimization, the accuracy of the Bagging model was significantly improved.  The terms are listed in Table 2 and are defined as follows: We selected the first four models for optimization ( Figure 8). The model was fitted using a parametric grid method and the grid search cross-validation method, and the best super parameters were found by AUC (area under the receiver operating characteristic curve) scoring method. According to the optimal super parameters of each model given in Table 3, the training set of the model was cross-validated 10 times, and the models were reordered according to the average accuracy score of the test data. After optimization, the performance of the four models was seen to have improved. ExtraTrees remained the optimal model, with a test data ACC of 0.91, and an average AUC of 0.97 after 10-time cross validation (Figure 9). AUC represents a trade-off between sensitivity and specificity. After optimization, the accuracy of the Bagging model was significantly improved.
x FOR PEER REVIEW 12 of 21  We selected the first four models for optimization ( Figure 8). The model was fitted using a parametric grid method and the grid search cross-validation method, and the best super parameters were found by AUC (area under the receiver operating characteristic curve) scoring method. According to the optimal super parameters of each model given in Table 3, the training set of the model was cross-validated 10 times, and the models were reordered according to the average accuracy score of the test data. After optimization, the performance of the four models was seen to have improved. ExtraTrees remained the optimal model, with a test data ACC of 0.91, and an average AUC of 0.97 after 10-time cross validation (Figure 9). AUC represents a trade-off between sensitivity and specificity. After optimization, the accuracy of the Bagging model was significantly improved.

Landslide Inventory and Mapping
Comparison of the remote sensing images before and after rainfall event enabled us to identify 14,397 landslides in an area of 655 km 2 . The interpretation results are shown in Figure 1. The landslide density reached 22/km 2 , with the largest landslide area being 39,637 m 2 . The average landslide area is 907 m 2 . The total area of all landslides in the study region is 13.06 km 2 , accounting for 2% of the total. In the landslide inventory, the largest 10 landslides account for 1.8% of the total landslide area, while the top 10% of large landslides account for 10.9%. Compared with the results of landslide inventories in other areas, the proportion of large-scale landslides of the total landslide area is relatively small [53,65].
Kernel density spatial analysis, with a default radius of 1 km (unweighted) and with area weighting, was carried out using the ArcGIS 10.2 toolbox (Figure 10). The results show that the spatial distribution of landslides induced by heavy rainfall is not completely uniform. In the case of a non-weighted distribution, the landslides are clustered in the north and south, and the density and area of the cluster in the north are higher than those of the south ( Figure 10A). From the area-weighted distribution ( Figure 10B), it was found that large landslides are mainly concentrated in the northern region and compared to the non-weighted distribution, the distribution range of large landslides in the northern region is larger and more dispersed.
Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 21 the north and south, and the density and area of the cluster in the north are higher than those of the south ( Figure 10A). From the area-weighted distribution ( Figure 10B), it was found that large landslides are mainly concentrated in the northern region and compared to the non-weighted distribution, the distribution range of large landslides in the northern region is larger and more dispersed.

Landslide Susceptibility Mapping
Although the ExtraTrees model achieved the highest score after optimization (Figure 8), it can only output the classification result (i.e., 0/1) and cannot generate a probability value. However, probability values are needed to produce a landslide susceptibility map. Therefore, the RandomForest (Super parameter: 'criterion' = 'entropy', 'max_depth' = 54, 'n_estimators' = 300, 'obb_score' = True) was used as the final model to produce a landslide susceptibility map for the study area. The methods provided by Scikit-learn were used to construct the probability set. In the binary case, the probabilities are calibrated using Platt scaling (Platt "Probabilistic outputs for RandomForests and comparisons to regularized likelihood methods"): logistic regression of the RandomForest scores and fitting by additional cross-validation of the training data [66].
The natural discontinuity method was used to divide the probability values of landslide susceptibility into five categories [67]: very low, low, moderate, high, and very high (Figure 11), and the corresponding proportions were 41.2%, 24.5%, 13.1%, 6.2%, and 15.0% respectively. It can be seen that the proportion of the landslide susceptibility area does not decrease with increasing sensitivity. The proportion of the extremely low susceptibility area is the highest, followed by the low susceptibility area, and the proportion of the high susceptibility area is the lowest. The spatial distribution of the very high susceptibility area is not completely uniform: the northern region has a more dense and larger range of distribution characteristics than the central and southern regions, which is consistent with the actual distribution of landslides. The area surrounding the study area has mainly a very low susceptibility, which may be related to rainfall intensity.

Landslide Susceptibility Mapping
Although the ExtraTrees model achieved the highest score after optimization (Figure 8), it can only output the classification result (i.e., 0/1) and cannot generate a probability value. However, probability values are needed to produce a landslide susceptibility map. Therefore, the RandomForest (Super parameter: 'criterion' = 'entropy', 'max_depth' = 54, 'n_estimators' = 300, 'obb_score' = True) was used as the final model to produce a landslide susceptibility map for the study area. The methods provided by Scikit-learn were used to construct the probability set. In the binary case, the probabilities are calibrated using Platt scaling (Platt "Probabilistic outputs for RandomForests and comparisons to regularized likelihood methods"): logistic regression of the RandomForest scores and fitting by additional cross-validation of the training data [66].
The natural discontinuity method was used to divide the probability values of landslide susceptibility into five categories [67]: very low, low, moderate, high, and very high (Figure 11), and the corresponding proportions were 41.2%, 24.5%, 13.1%, 6.2%, and 15.0% respectively. It can be seen that the proportion of the landslide susceptibility area does not decrease with increasing sensitivity. The proportion of the extremely low susceptibility area is the highest, followed by the low susceptibility area, and the proportion of the high susceptibility area is the lowest. The spatial distribution of the very high susceptibility area is not completely uniform: the northern region has a more dense and larger range of distribution characteristics than the central and southern regions, which is consistent with the actual distribution of landslides. The area surrounding the study area has mainly a very low susceptibility, which may be related to rainfall intensity.
Remote Sens. 2021, 13, x FOR PEER REVIEW 15 of 21 Figure 11. Landslide susceptibility map of the Tianshui area calculated using the RandomForest model.

Discussion
The interpretability of the model helps to determine the potential relationship between different influencing factors and landslide susceptibility. This in turn enables the landslide susceptibility assessment results to be applied outside the study area and to increase our understanding of the contribution of the various factors influencing landslide formation under similar environmental conditions.
The calculated weight of each factor is shown in Figure 12, from which it can be seen that although all of the factors contribute to the landslide development, there are differences in the size of their contributions. Among the factors, the contribution of SA (14%) is significantly higher than those of the other factors, followed by PLC (8%). DR, DTF, NDVI, and DTR have contributions of 7%, and the other factors contribute less than 6%. Geomorphic factors can be seen to be the most important controls for landslide susceptibility, while factors related to the landslide material and geological conditions play a secondary role, and the impact of engineering activity is relatively small. In the present Figure 11. Landslide susceptibility map of the Tianshui area calculated using the RandomForest model.

Discussion
The interpretability of the model helps to determine the potential relationship between different influencing factors and landslide susceptibility. This in turn enables the landslide susceptibility assessment results to be applied outside the study area and to increase our understanding of the contribution of the various factors influencing landslide formation under similar environmental conditions.
The calculated weight of each factor is shown in Figure 12, from which it can be seen that although all of the factors contribute to the landslide development, there are differences in the size of their contributions. Among the factors, the contribution of SA (14%) is significantly higher than those of the other factors, followed by PLC (8%). DR, DTF, NDVI, and DTR have contributions of 7%, and the other factors contribute less than 6%. Geomorphic factors can be seen to be the most important controls for landslide susceptibility, while factors related to the landslide material and geological conditions play a secondary role, and the impact of engineering activity is relatively small. In the present study, four soil-related factors (SIC, ST, SG, and CC) made only small contributions to landslide development, which may be related to inaccuracies in the soil data. Specifically, from the distribution of SA ( Figure 13A), LSs accounted for the highest proportion when 94 • < SA < 246 • , indicating that sunlit slopes are more prone to landslides than shaded slopes. The influence of this aspect on the spatial distribution of landslides has been concerning for a long time. Many research results show that the influence of the slope aspect on landslide formation is mainly manifested in three categories: (i) The microclimate of slopes with different orientations shows regular differences. Compared with the shaded slope, the sunlit slope has higher temperature and precipitation, the physical and chemical weathering rate is, therefore, faster, forming a thicker soil layer, ensuring the material source of landslide formation [51,68]. (ii) Comparing with the shaded slope, the vegetation coverage of the sunlit slope is low, and most of it consists of shrubs and herbs [69]. The influence of shallow roots on the stability of the landslide on the sunlit slope is significantly weaker than that of the vertical roots of trees on the shaded slope. The influence of vegetation on slope stability is bidirectional, and it has adverse effects on the development of deep landslides, while vegetation can restrain the development of shallow landslides [70]. (iii) The continuous alternation of wet and dry on a sunlit slope can easily form the macropore system in the unsaturated zones of the slopes, which is conducive to the rapid infiltration of precipitation, thus is unfavorable to slope stability [71,72].
Remote Sens. 2021, 13, x FOR PEER REVIEW 16 landslide development, which may be related to inaccuracies in the soil data. Specific from the distribution of SA ( Figure 13A), LSs accounted for the highest proportion w 94° < SA < 246°, indicating that sunlit slopes are more prone to landslides than sh slopes. The influence of this aspect on the spatial distribution of landslides has concerning for a long time. Many research results show that the influence of the s aspect on landslide formation is mainly manifested in three categories: (i) The micr mate of slopes with different orientations shows regular differences. Compared wit shaded slope, the sunlit slope has higher temperature and precipitation, the physica chemical weathering rate is, therefore, faster, forming a thicker soil layer, ensuring material source of landslide formation [51,68]. (ii) Comparing with the shaded slope vegetation coverage of the sunlit slope is low, and most of it consists of shrubs and h [69]. The influence of shallow roots on the stability of the landslide on the sunlit slo significantly weaker than that of the vertical roots of trees on the shaded slope. Th fluence of vegetation on slope stability is bidirectional, and it has adverse effects on development of deep landslides, while vegetation can restrain the development of low landslides [70]. (iii) The continuous alternation of wet and dry on a sunlit slope easily form the macropore system in the unsaturated zones of the slopes, which is ducive to the rapid infiltration of precipitation, thus is unfavorable to slope stab [71,72]. For the ranges of 5 < PLC < 48, 84 < DR< 760, 0 < DTF < 2677, 0.03 < NDVI < 0.56, DTR < 1270 (Figure 13), the proportion of LSs is higher, which indicates that lands are more likely to occur within these ranges. For this landslide event, the shape o slope was the second most important factor after slope aspect because there are sig cant differences in the ponding capacity and degree of surface differentiation of diff be used as a reference for decision-makers and planners in the study area. With the increasing population pressure in western China, there is a trend towards an increased settlement on steep hillsides. Therefore, in order to protect human life and property, landslide susceptibility maps can be used as a basic tool for land management and planning in future construction projects in such areas.

Conclusions
We have compared high-precision remote sensing images in southern Tianshui before and after an interval of continuous high rainfall (from 20 June to 25 July 2013), with the aim of identifying rainfall-induced landslides. According to the inventory map of For the ranges of 5 < PLC < 48, 84 < DR< 760, 0 < DTF < 2677, 0.03 < NDVI < 0.56, 150 < DTR < 1270 (Figure 13), the proportion of LSs is higher, which indicates that landslides are more likely to occur within these ranges. For this landslide event, the shape of the slope was the second most important factor after slope aspect because there are significant differences in the ponding capacity and degree of surface differentiation of different types of slopes, such as concave, convex, and flat slopes [73]. This difference may be amplified under the effect of heavy rainfall, thus strongly influencing the development of landslides. DR, DTF, and DTR have similar distribution characteristics. The intensity of fluvial erosion, the damage of fault to rock mass strength, and the level of engineering activity all decrease with increasing distance from these elements [17,74]. Therefore, the smaller the distance from these features, the higher the proportion of LSs, and with the increase of distance, NLSs gradually occupy a higher proportion. The influence of vegetation on landslide development is reflected by the fact that areas of low vegetation coverage are more prone to landslides [61]. This is mainly because vegetation delays rainfall infiltration, increases evaporation, and well-developed root systems significantly increase slope stability. Similar to the results of some studies, slope aspect is the most important factor in the susceptibility evaluation [73]. However, due to the differences in geographical location, climate environment, topography, and vegetation types, the role of slope aspect in many regions may be very different. For example, the weight of landslide susceptibility factors in an orogenic belt and in a mountainous hilly region may be completely different. The location and climatic conditions of the study area determine that the microclimate of different slope aspects is very important to the development of landslides, and the microclimate signif-icantly affects the vegetation type and coverage, water system development, humidity index, soil thickness, and other factors, thus profoundly affecting the spatial distribution of landslides [53].
The LSs based on machine learning are relatively flexible and practical, and they are readily applicable in disaster prevention and land management. The research results can be used as a reference for decision-makers and planners in the study area. With the increasing population pressure in western China, there is a trend towards an increased settlement on steep hillsides. Therefore, in order to protect human life and property, landslide susceptibility maps can be used as a basic tool for land management and planning in future construction projects in such areas.

Conclusions
We have compared high-precision remote sensing images in southern Tianshui before and after an interval of continuous high rainfall (from 20 June to 25 July 2013), with the aim of identifying rainfall-induced landslides. According to the inventory map of landslides, various machine learning methods were applied to landslide susceptibility evaluation, and we selected the optimal model for landslide susceptibility evaluation in areas of low and medium elevation mountains with a high vegetation coverage and produced a landslide susceptibility map. Finally, in order to better understand the factors controlling landslide susceptibility, we analyzed the role and weight of each influencing factor in the training process. The main conclusions are as follows: (1) The 21 initial models were trained with the training data in the cross-validation dataset, and the models were then sorted according to the average accuracy score (ACC). The results showed that the overall fitting effect of the comprehensive model was better than for the other models. The ExtraTrees model had the highest score, with an average test data accuracy of 0.91, and the average AUC after 10-times cross validation was 0.97. This model can be effectively used for the susceptibility evaluation of shallow landslides. (2) Among all of the selected evaluation factors, slope aspect made a larger contribution to landslide development than the other factors. For 94 • < SA < 246 • , LSs accounted for the highest proportion, which indicates that sunlit slopes are significantly more prone to landslides than shaded slopes, followed by PLC, DR, DTF, NDVI, and DTR. Geomorphic conditions are the most important factors in triggering landslides induced by heavy rainfall, followed by fluvial erosion and fault distribution, while human activities have only a small influence. (3) In the evaluation of landslide susceptibility based on machine learning, the prediction performance of various models is significantly different. Extensive comparative prediction in different environments, closely linking the model evaluation with the goals of the study and increasing the understanding of the ability and limitations of the model are the key to model selection in the future, so as to strengthen the application of artificial intelligence technology in the field of geological disaster prevention and improve the prediction accuracy and efficiency.