Comparison of Tree-Structured Parzen Estimator Optimization in Three Typical Neural Network Models for Landslide Susceptibility Assessment

Landslides pose a constant threat to the lives and property of mountain people and may also cause geomorphological destruction such as soil and water loss, vegetation destruction, and land cover change. Landslide susceptibility assessment (LSA) is a key component of landslide risk evaluation. There are many related studies, but few analyses and comparisons of models for optimization. This paper aims to introduce the Tree-structured Parzen Estimator (TPE) algorithm for hyperparameter optimization of three typical neural network models for LSA in Shuicheng County, China, as an example, and to compare the differences of predictive ability among the models in order to achieve higher application performance. First, 17 influencing factors of landslide multiple data sources were selected for spatial prediction, hybrid ensemble oversampling and undersampling techniques were used to address the imbalanced sample and small sample size problem, and the samples were randomly divided into a training set and validation set. Second, deep neural network (DNN), recurrent neural network (RNN), and convolutional neural network (CNN) models were adopted to predict the regional landslides susceptibility, and the TPE algorithm was used to optimize the hyperparameters respectively to improve the assessment capacity. Finally, to compare the differences and optimization effects of these models, several objective measures were applied for validation. The results show that the high-susceptibility regions mostly distributed in bands along fault zones, where the lithology is mostly claystone, sandstone, and basalt. The DNN, RNN, and CNN models all perform well in LSA, especially the RNN model. The TPE optimization significantly improves the accuracy of the DNN and CNN (3.92% and 1.52%, respectively), but does not improve the performance of the RNN. In summary, our proposed RNN model and TPE-optimized DNN and CNN model have robust predictive capability for landslide susceptibility in the study area and can also be applied to other areas containing similar geological conditions.


Introduction
Landslides are the most common natural hazards in mountainous areas, and once occurred, landslides may cause the destruction of roads and houses, change of large

Study Area
Shuicheng County belongs to the central region of the Yunnan-Guizhou Plateau in China, with an approximate area totaling 3605 km 2 and a resident population of about 754,900. Its elevation range is from 633 to 2863 m, and about 32.5% of areas have a slope above 20 • (Figure 1). For landslides, extreme precipitation is the main triggering and inducing factor. Sudden extreme precipitation in mountainous areas induces regional or basin-based cluster landslides such as topples, slides, and debris flows which can cause serious impact on social production life and ecological environment. Shuicheng County belongs to subtropical monsoon climate, with abundant and frequent precipitation, often accompanied by heavy rainfall. Furthermore, it also belongs to karst landscape, where surface water easily seeps and the moisture content of the soil is high. Shuicheng County is a concentrated and high-incidence area of landslides and is one of the landslide-prone and serious counties in Guizhou Province. Hence, it is particularly important to carry out risk assessment of extreme precipitation-induced landslides in Shuicheng County, and the susceptibility assessment is the most essential content of it.
Remote Sens. 2021, 13, x 4 of 34 risk assessment of extreme precipitation-induced landslides in Shuicheng County, and the susceptibility assessment is the most essential content of it. On 23 July 2019, a mega-landslide that killed 52 people occurred in the Jichang Town, Shuicheng County [46,47]. Figure 2 shows the Google Earth satellite maps prior to and following the landslide, and that can be seen more than 1 year after the landslide there is still a very significant impact on natural factors such as vegetation and geomorphology, as well as roads, houses, and other building sites. On 23 July 2019, a mega-landslide that killed 52 people occurred in the Jichang Town, Shuicheng County [46,47]. Figure 2 shows the Google Earth satellite maps prior to and following the landslide, and that can be seen more than 1 year after the landslide there is still a very significant impact on natural factors such as vegetation and geomorphology, as well as roads, houses, and other building sites.

Landslide Historical Inventory
In this paper, we collected the landslide historical inventories recorded by China G ological Survey [48]. Additionally, the three most frequent and serious forms of land slides, topples, slides, and debris flows, were extracted, and integrated with remote sen

Landslide Historical Inventory
In this paper, we collected the landslide historical inventories recorded by China Geological Survey [48]. Additionally, the three most frequent and serious forms of landslides, topples, slides, and debris flows, were extracted, and integrated with remote sensing images and field survey, the centroids of landslide scarp of 240 historical landslide points were finally identified and stored in the database, which has been proved the best landslide sampling strategy [49].

Landslides Susceptibility Influencing Factors
The susceptibility is characterized as the spatial probability of occurrence of landslides, and the selection of the influencing factors is significantly important to LSA. Combining relevant studies and the accessibility of factors, we finally identified 17 factors. We input these influencing factors into a uniform format database, according to the Digital Elevation Model (DEM) map pixel size, all the factor's pixel sizes were set to 30 × 30 m, regardless of the initial data format.
Lithology is fundamental to the development of landslides and affects the shear strength and water leakage of slopes [50]. The lithology in Shuicheng County can be classified into five categories, Basalt, Claystone, Dolomite, Limestone, and Sandstone. Geological age can reflect the degree of lithological development. Meanwhile, the distance to the faults can reflect the active degree of geological structure. This part of vector data was digitized from geological maps provided by the China Geology Survey. Topographic factors are another major predisposing factor for landslides [2]. Elevation is a measure of the absolute degree of elevation of regional terrain; slope is an important factor for landslide development, especially for rockfalls and landslides. Slope and geotechnical stability are not simply linearly related, it always works in conjunction with slope height, geotechnical combination, slope structure, and other factors. Aspect generally combines with other factors to form the slope structure, which in turn affects the development of landslides. For landslides induced by extreme rainfall, the aspect affects the insolation and rainfall. The slope type mainly includes concave, convex, and linear, etc. The degree of rainfall infiltration into the slope body varies with the slope type, which can be characterized by plan curvature and profile curvature. Land cover affects soil erosion, especially the degree of vegetation cover, which can be expressed by the normalized difference vegetation index (NDVI) [51,52]. The land cover was downloaded at Finer Resolution Observation and Monitoring of Global Land Cover (FROM-GLC) and the NDVI was calculated from the near infrared and red band values in Landsat8 OLI satellite remote sensing digital images shot in April 2018: Roads can reflect the influence of human activities on geological formations, so the distance from the road was also selected as an influencing factor. Terrestrial hydrology mainly refers to the development and distribution of river valleys, etc. Research shows that the development of landslides is strongly linked to the river system configuration, because river system integrally reflects the development of the free surface, the density of gully, and some characteristics of slope. The river also has an erosion effect, mainly downward cutting erosion, lateral hollowing erosion, and wave action, erosion will carry away stones and clods at the slope toe, forming the free plane, and providing favorable topographic conditions for landslides. Meanwhile, the rise and fall of river level will affect the fluctuation of groundwater level, which will affect the stability of the slope. Based on the above analysis, distance to rivers, average annual precipitation (AAP), and four hydrological indices were selected for landslide studies [31]. Among them, the distance to rivers was obtained by creating multi-loop buffer, the AAP was obtained by collecting the AAP of seven meteorological stations around the study area from 1981 to 2018, and then interpolating by the inverse distance weights, and the four hydrological indices were calculated as shown below, and the four hydrological indices consisting of Stream Power Index (SPI), Sediment Transport Index (STI), Topographic Relief Index (TRI), and Topographic Wetness Index (TWI) [53,54], these indices were calculated as shown below: where A S represents the catchment area (m 2 /m), β is the slope [55], DEM MAX and DEM MIN are the max and min DEM value surrounding every pixel, respectively. The classification of each influencing factor value is shown in Table 1. All the maps of the spatial distribution of influencing factors of landslides are shown in Figure 3.

Methods
Our research of LSA can be separated in the five stages as follows, which can be observed in Figure 4. Stage A is data collection; we collected data of landslide historical inventory and the landslide susceptibility influencing factors through multiple sources. Stage B is data processing; we divided the study area into grids, unified data to the same pixel size, and prepared the samples. Stage C is model construction; we used the DNN, RNN, and CNN models to train and generate LSMs. Stage D is TPE optimization, we used the TPE optimized DNN (DNN_TPE), RNN (RNN_TPE), and CNN (CNN_TPE) to train the models and generate LSMs as well. Stage E is model validation and comparison; we validated and compared the performance and the TPE optimization effect of different neural network models by multiple methods.

Methods
Our research of LSA can be separated in the five stages as follows, which can be observed in Figure 4. Stage A is data collection; we collected data of landslide historical inventory and the landslide susceptibility influencing factors through multiple sources. Stage B is data processing; we divided the study area into grids, unified data to the same pixel size, and prepared the samples. Stage C is model construction; we used the DNN, RNN, and CNN models to train and generate LSMs. Stage D is TPE optimization, we used the TPE optimized DNN (DNN_TPE), RNN (RNN_TPE), and CNN (CNN_TPE) to train the models and generate LSMs as well. Stage E is model validation and comparison; we validated and compared the performance and the TPE optimization effect of different neural network models by multiple methods.

Geodatabase Construction
First, the factors were classified into 5 categories, where continuous variants used the Natural Breaks Method (NBM), and discrete variants were ranked by calculating the ratio of historical landslide points (R) to the area for each category: where S ij and S A represent the area of category j of factor i and the study area, respectively. X ij and X A are the number of historical landslide points in S ij and S A , respectively. R actually represents the amount of information in each category, and the higher the R value, the higher the category rank.

Sample Selection
In the neural network modeling process, the number of positive samples (landslide points) and negative samples (non-landslide points) should not be unbalanced by orders of magnitude, because when the data are extremely unbalanced, samples from the majority category are easier to predict, and the prediction performance for minority category is poorer. Meanwhile, a total of 240 historical landslide points in Shuicheng County were identified in this study. Too few numbers may lead to poor model prediction and cannot correctly reflect the vulnerability of landslides in the study area, while too many may cause overfitting of the model. After several tests, when the landslide points are doubled and then an equivalent number of non-landslide points are selected as samples, a certain accuracy can be maintained without overfitting.
Considering the above, this paper used the hybrid ensemble oversampling and undersampling techniques for sample selection. The specific steps are as follows: (1) 240 nonlandslide points were selected using random undersampling and repeated twice, 480 negative samples were obtained; (2) these 480 non-landslide points and the 240 landslide points were selected as input data and the Borderline-Synthetic Minority Over-sampling Technique (Borderline-SMOTE) algorithm was used to oversample the positive samples which is an enhanced method of SMOTE [56]. The principle of SMOTE is by selecting a minority class sample A, choosing a sample B from its nearest neighbors, and then generating a new minority class sample randomly on the line of the two points, while Borderline-SMOTE is based on this, but only safe samples (A, B are the same class) are selected for the sample synthesis. In this paper, a new 240 landslide points were generated; (3) 70% of the positive and negative samples were chosen at random for training while the remaining 30% were used for validation, respectively.

Multi-Collinearity Analysis
Multi-collinearity analysis is a prerequisite to test whether multi-dimensional factors can be used simultaneously. In this paper, variance inflation factor (VIF) was selected as the determination conditions: where R 2 j is determinable co-efficient of the auxiliary regression model of the explaining factors X j on the others. The closer the VIF is to 1, the weaker the multi-collinearity. Experience shows that VIF ≥ 10 indicates severe multi-collinearity between the variables and the remaining variables, and this multi-collinearity may overly affect the least squares estimates. Tolerance is the inverse of VIF, which means that severe multi-collinearity exists when Tolerance ≤ 0.1.

DNN Model
DNN can be understood as neural networks with many hidden layers [57]. DNN extends the simple perceptron by: (1) adding multi-layer hidden layers to enhance the expressiveness of the model; (2) the output layer neurons can be more than one and can have multiple outputs, so that the model can be flexibly applied to classification, regression, dimensionality reduction and clustering, etc.; (3) the activation function can be extended. The activation function of the perceptron is sign(z), which is simple but has limited processing power, while the neural network generally uses Sigmoid, tanh, ReLU, softplus, softmax, etc., to add nonlinear factors, which can improve the expressiveness of the model. The structure of the DNN constructed in this paper is shown as Figure 5.

−
where is determinable co-efficient of the auxiliary regression model of the explain factors Xj on the others. The closer the VIF is to 1, the weaker the multi-collinearity. Ex rience shows that VIF ≥ 10 indicates severe multi-collinearity between the variables the remaining variables, and this multi-collinearity may overly affect the least squa estimates. Tolerance is the inverse of VIF, which means that severe multi-collinearity ists when Tolerance ≤ 0.1.

DNN Model
DNN can be understood as neural networks with many hidden layers [57]. D extends the simple perceptron by: (1) adding multi-layer hidden layers to enhance expressiveness of the model; (2) the output layer neurons can be more than one and have multiple outputs, so that the model can be flexibly applied to classification, reg sion, dimensionality reduction and clustering, etc.; (3) the activation function can be tended. The activation function of the perceptron is sign(z), which is simple but has l ited processing power, while the neural network generally uses Sigmoid, tanh, Re softplus, softmax, etc., to add nonlinear factors, which can improve the expressivenes the model. The structure of the DNN constructed in this paper is shown as Figure 5.

RNN Model
RNN is a special neural network structure that not only considers the input at previous moment, but also creates a "memory" of the previous content. In other wo the present output is correlated with the preceding output as well [58]. The specific pression is that the network remembers the prior information and applies them in

RNN Model
RNN is a special neural network structure that not only considers the input at the previous moment, but also creates a "memory" of the previous content. In other words, the present output is correlated with the preceding output as well [58]. The specific expression is that the network remembers the prior information and applies them in the computation of the present export, which means that the nodes between the hidden layers are no longer connectionless but connected, and the input of the hidden layers also contains the output of the hidden layers in the preceding moment. Based on this property, RNNs are commonly used in speech recognition research [59]. Figure 6 shows the layer unfolding of the hidden layers of the RNN model. computation of the present export, which means that the nodes between the hidden layers are no longer connectionless but connected, and the input of the hidden layers also contains the output of the hidden layers in the preceding moment. Based on this property, RNNs are commonly used in speech recognition research [59]. Figure 6 shows the layer unfolding of the hidden layers of the RNN model.
denotes the weight of the input sample, denotes the weight of the input sample at this moment, and denotes the weight of the output sample. When = 1, the general initialization input S0 = 0, random initialization , , , and proceed to the following Equation: x denotes the input sample, h t denotes the hidden state vector at time t, S t denotes the memory of the sample at time t, S t = f (W × S t−1 + U × x t ). W denotes the weight of the input sample, U denotes the weight of the input sample at this moment, and V denotes the weight of the output sample. When t = 1, the general initialization input S 0 = 0, random initialization W, U, V, and proceed to the following Equation: where, f (x) and g(x) are both activation functions, f (x) can be Tanh, Relu, Sigmoid and other activation functions, g(x) is usually used by Softmax. and so on, the final output value can be obtained as: There are various variants of RNN models that may have surpassed performance of the basic RNN model [40,60,61]. However, since this paper aims to compare the applications of different typical neural network models, the basic RNN model was selected and the default neurons in the hidden layer were set to 50, and the structure of the RNN model is shown in Figure 7.

CNN Model
CNN is essentially a multilayer perceptron proposed by Yann Lecun of New York University in 1998 [62]. CNN is characterized by local connectivity and shared weights, which decreases weight counts making this network easy to optimize, while reducing the model sophistication, that is, risks for overfitting. The special feature of CNN construction is that it has a unique convolutional layer and pooling layer. Convolutional layer is functioned to extract features, in the convolution operation, a matrix of size F × F (F × 1 in one dimension) is set, called the filter or convolution kernel, and the matrix size is receptive field. The interior of the convolutional layer contains multiple convolutional kernels, and each element that makes up the convolutional kernel is associated with a weight and a bias. Every neuron within a convolutional layer is connected to several others near its position in the preceding layer, and the region magnitude is determined by the receptive field. The convolution kernel works by regularly sweeping through the input features, multiplying and summing matrix elements within the convolution kernel, and superimposing the bias. Then, the exported feature graph is delivered into the pooling layer to be used for feature selection and informational filtration. The pooling layer includes pre-de-

CNN Model
CNN is essentially a multilayer perceptron proposed by Yann Lecun of New York University in 1998 [62]. CNN is characterized by local connectivity and shared weights, which decreases weight counts making this network easy to optimize, while reducing the model sophistication, that is, risks for overfitting. The special feature of CNN construction is that it has a unique convolutional layer and pooling layer. Convolutional layer is functioned to extract features, in the convolution operation, a matrix of size F × F (F × 1 in one dimension) is set, called the filter or convolution kernel, and the matrix size is receptive field. The interior of the convolutional layer contains multiple convolutional kernels, and each element that makes up the convolutional kernel is associated with a weight and a bias. Every neuron within a convolutional layer is connected to several others near its position in the preceding layer, and the region magnitude is determined by the receptive field. The convolution kernel works by regularly sweeping through the input features, multiplying and summing matrix elements within the convolution kernel, and superimposing the bias. Then, the exported feature graph is delivered into the pooling layer to be used for feature selection and informational filtration. The pooling layer includes pre-defined pooling functions, Max pooling and average sampling are the most common. It can replace the value for a point within a feature map with the statistical value of the adjacent areas. The pooling layer selects the pooling region at the identical manner as the convolutional kernel scans the feature map, determined by the padding, size of pooling and step. The pooling layer is equivalent to converting a higher resolution image into a lower resolution image, it also reduces the node count in the final fully connected layer, thus reducing parameters of the entire neural network and hence decreases overfitting risk. The fully connected layer is the final component of the CNN hidden layer, which equals the hidden layer in a classical feedforward ANN model. The fully connected layer is responsible for transmitting information to the output layer where the feature maps will lose their spatial topologies, be extended as vectors, and passed through the activation function. As the most excellent and popular neural network model in latest decade, CNN was extensively adopted for various fields, especially image recognition [63,64].
For LSA, based on the influence factor rank of each pixel, each sample can be made into a 17 × 1 array format as the input layer, therefore, in this paper, a one-dimensional CNN, which is often applied to the data processing of sequence class, was used to construct the model. The specific CNN structure is shown in Figure 8. This structure is referred to the one-dimensional CNN presented by Wang et al. [41]. The input layer is the dimension n of the factor, which is 17 × 1, the initial value of the convolution kernel m is set to 3, and N feature vectors of length (n − m + 1) are obtained, with N set to 20. The size of the maximum pooling layer is a × 1, and the initial value of a is set to 2. The result consists of N vectors of length (n − m + 1)/a. Then, a fully connected layer with 50 neuron units is set up for the extracted features. Finally, we set 2 neurons in the output layer to achieve the problem of binary classification.

TPE Optimization
Hyperparameter optimization has been extremely important for machine learning models, especially for neural network models which are typically black box models [65]. Since it cannot intervene during model training, tuning hyperparameters before the model runs formally becomes an important means to enable the improvement of model precision. From the initial manual tuning to the later evolution of grid and random search, it was very time consuming and inefficient [54]. Based on the idea of accuracy and efficiency, many methods for automatic tuning of parameters were later generated. Bayesian optimization is a function minimization method using a model to find the value that minimizes the objective function [66]. It is highly performant and very time-efficient since it refers to the previous evaluation results when trying the next set of hyperparameters.

TPE Optimization
Hyperparameter optimization has been extremely important for machine learning models, especially for neural network models which are typically black box models [65]. Since it cannot intervene during model training, tuning hyperparameters before the model runs formally becomes an important means to enable the improvement of model precision. From the initial manual tuning to the later evolution of grid and random search, it was very time consuming and inefficient [54]. Based on the idea of accuracy and efficiency, many methods for automatic tuning of parameters were later generated. Bayesian optimization is a function minimization method using a model to find the value that minimizes the objective function [66]. It is highly performant and very time-efficient since it refers to the previous evaluation results when trying the next set of hyperparameters.
TPE is a Bayesian optimization algorithm proposed by [67], to learn hyperparameter models using the Gaussian Mixture Model. Firstly, the concept of conditional probability from Bayes theory is introduced. p(x|y) is the conditional probability that the hyperparameter is x when the model loss is y. In the first step, we select a threshold y * for the loss based on the available data, e.g., according to the median. Two probability densities (x) and g(x) are learned for data greater than the threshold and less than the threshold, respectively.
where (x) is the density formed by using the observations x (i) such that corresponding loss f x (i) was less than y * and g(x) is the density formed by using the remaining observations.
The parametrization of p(x, y) as p(y)p(x|y) in the TPE algorithm was chosen to facilitate the optimization of Expected Improvement (EI).
By construction, γ = p(y < y * ), and p( The final EI y * (x) can be expressed as: Therefore, we can minimize g(x)/ (x) to get a new x * , then put x * back into the function and iterate again to fit (x) and g(x), keep minimizing g(x)/ (x) until we reach the predetermined number of iterations, and finally complete the optimization of the hyperparameters.
In this paper, we used the Hyperopt library in python 3.7 environment to complete the TPE optimization. There are four main components of TPE optimization: (1) objective function: we select the loss of the model using the set of hyperparameters on the validation set; (2) domain space: that is, the search range of the hyperparameters; (3) optimization algorithm: that is, the TPE algorithm; and (4) the result record. In these four items, the objective function and optimization algorithm have been determined. Additionally, the domain space is another important part of TPE optimization. The basic principle of choosing the search space is scientific and efficient, the space range should not be too large or too small. Too small optimization effect is not obvious, too large is easy to cause overfitting and long computing time and other problems. The domain space selected in this paper is based on the above principles and related research.
For the selection of parameters, although there are many parameters of neural network models, many of them are nested in their selected activation functions, and the default values of these parameters can be chosen, which have little impact on model optimization and are generally not considered in Bayesian optimization. Only hyperparameters that directly affect the structure and operation of the network, such as the number of neurons, dropout rate, convolutional kernel size, and class of activation function, are selected for adjustment.
The hyperparameters of DNN and RNN are the same, but there is a difference in the "units", which refers to the neuron count of the two fully connected layers in DNN, and the neuron count of the hidden layer in the RNN model. Dropout rate is an important way to reduce overfitting in neural network models, especially when the training samples are small [68]. "batch size" is the number of samples selected for training at one time, and is a means of batch processing of neural networks, which can greatly improve the learning efficiency by processing samples in batches, especially for large-scale samples. If "batch size" increases, the gradient becomes accurate, and after a certain degree, it is useless to increase the "Batch Size". "Epoch" refers to the number of times all samples in a neural network model are trained, the more epochs, the more adequate, but too many epochs also tend to cause overfitting. Therefore, we set the initial values of "batch size" and "epoch" of DNN and RNN to 50, while the domain space is 10-100. The essence of machine learning training is in minimizing the loss, and after we define the loss function, we need the optimizer to perform the gradient optimization, and the goal of optimization is the loss value θ in the network model. In this paper, the most common optimizer algorithm Adaptive Moment Estimate (Adam) was chosen as the initial value, and introduced Adamax, Stochastic Gradient Descent (SGD), Root Mean Square Prop (RMSProp), Adaptive gradient algorithm (Adagred), Adadelta, Nesterov-accelerated Adaptive Moment Estimation (Nadam) as the domain space of the optimization algorithm in the process of TPE optimization. Each optimization algorithm has its own advantages and disadvantages, so TPE optimization is needed to search and find the optimal algorithm.
The hyperparameters of CNN model are more complicated than DNN and RNN. Firstly, the number of convolutional kernels "Filter" is also the number of convolutional layer feature map, the initial value is set to 20, and the domain space is 10-100. The "Kernels size" is the size of the convolution kernel, and we set the search range from 1 to 9, and the domain space of "pooling size" from 2 to 5. In addition, "Units", "Dropout rate", "Epoch", and "Optimizers" take the same range of values as DNN.
The initial values of hyperparameters and their domain spaces for each model to perform TPE optimization are shown in Table 2.

Model Validation Methods
The goodness of the model needs to be judged by evaluating. In this paper, for the binary classification problem of whether it is a landslide or not, various validation methods were used from different perspectives. First, the most basic method is the Accuracy, as well as the Precision and Recall values for validating positive and negative samples, respectively [69]. Second, there are validation methods for binary classification problems such as F-value, MCC, and Kappa coefficient [49,70]. The third is the most common and visualized ROC curve, which evaluates model merit by measuring the area under the curve (AUC) [71,72]. Finally, SCAI was used to analyze the percentage of each class and the proportion of historical landslide points in each class [73]. The formulae for each of the above evaluation methods are as follows: where TP, FP, TN, and FN are true positive, false positive, true negative, and false negative, respectively. The value domain of Accuracy, Precision, Recall, and F-value is from 0 to 1, and a greater value results in stronger performance. For MCC and Kappa coefficient, the results are between −1 and 1, and again, larger value means better.
The ROC is drawn according to the "Sensitivity" and the "1-Specificity" [74]: The value domain of AUC ranges from 0.5 to 1, and the better performance is reflected by higher values [75].
The SCAI is calculated by the following Equation: where, i is the landslides susceptibility class, P A is percent for each class, and P HGP is the proportion of historical landslide points in that class to total points. This method can show the density of historical landslide points in each class. The higher classes should have lower SCAI values.

Factors Multi-Collinearity Analysis and Importance
The collinearity analysis between these influencing factors at 95% confidence level is illustrated in Table 3. The results indicate that no multicollinearity relationship existed among factors (all the VIF values were below 7). That means the selected 17 factors can be adopted simultaneously for these neural network models. The neural networks are black-box models, the factors importance ranking cannot be performed by them. Therefore, we input the training set into the RF model for the analysis of factor importance study. Figure 9 shows the radar plot of factor importance, the importance value of elevation is much more than other factors that means elevation is highest priority for development of landslides. As the most important trigger factor for landslides in the study area, APP has an importance of more than 0.7, second only to elevation. Plan curvature and NDVI are the third tier (importance value > 0.6), while the distance to rivers and SPI have little effect on landslides (importance value < 0.4). The neural networks are black-box models, the factors importance ranking cannot be performed by them. Therefore, we input the training set into the RF model for the analysis of factor importance study. Figure 9 shows the radar plot of factor importance, the importance value of elevation is much more than other factors that means elevation is highest priority for development of landslides. As the most important trigger factor for landslides in the study area, APP has an importance of more than 0.7, second only to elevation. Plan curvature and NDVI are the third tier (importance value >0.6), while the distance to rivers and SPI have little effect on landslides (importance value <0.4).

The Optimization Process of TPE
The principle of TPE optimization is to continuously loop through searching for the minimum objective function value, therefore, this paper iterates with accuracy as the re-

The Optimization Process of TPE
The principle of TPE optimization is to continuously loop through searching for the minimum objective function value, therefore, this paper iterates with accuracy as the return value of TPE optimization for 500 iterations, and its optimization process is shown in Figure 10. The minimum return value of DNN is −0.7535, which is generated after 367 iterations, while the initial return value of RNN is the largest and converges to −0.7604 after 258 iterations, and the return value of CNN is the smallest, reaching −0.7708 after 358 iterations. Overall, CNN has the best TPE optimization effect, and RNN and DNN have similar effects. If combined with the analysis of optimized initial values, RNN is slightly better than DNN.  Table 4 shows the hyperparameter results of TPE optimization for the three neural network models. The optimization results of DNN and RNN are very similar, including the neurons (units) in the hidden layer, dropout rate, batch size, and only the epoch values are different. Moreover, Adam algorithm is the best optimizer in DNN and RNN, which proves its universality to some extent. For the CNN, the kernel size is increased to 6 and the RMSProp algorithm is chosen for the optimizer, which ensures less oscillation in the return value during optimization and also makes the network function converge faster.

Landslide Susceptibility Assessment and Mapping
In this study, three typical neural network models were constructed using the Keras library (Version 2.3.1) in Python 3.7 environment. Keras is a high-level API for TensorFlow Figure 10. The TPE optimization loss in the iterative process of three typical neural network models. Table 4 shows the hyperparameter results of TPE optimization for the three neural network models. The optimization results of DNN and RNN are very similar, including the neurons (units) in the hidden layer, dropout rate, batch size, and only the epoch values are different. Moreover, Adam algorithm is the best optimizer in DNN and RNN, which proves its universality to some extent. For the CNN, the kernel size is increased to 6 and the RMSProp algorithm is chosen for the optimizer, which ensures less oscillation in the return value during optimization and also makes the network function converge faster.

Landslide Susceptibility Assessment and Mapping
In this study, three typical neural network models were constructed using the Keras library (Version 2.3.1) in Python 3.7 environment. Keras is a high-level API for TensorFlow 2 that facilitates machine learning especially deep learning through easy operation [76]. The different models constructed based on the training set predicted the probability of occurrence of landslides (i.e., susceptibility) for each pixel in the study area, respectively. Coupled with MATLAB 2018b and ArcGIS 10.6 softwires, we plotted the LSMs thus visualizing the spatiality for susceptibility. For better visualization of LSMs, they were reclassified via NBM to 5 levels ( Figure 11). To facilitate comparison between models, we calculated PA and PHGP in eac which is the most important step required to calculate SCAI values, as shown in 12. For the DNN model, the area share of each class is very close, but the distrib landslide points is obviously concentrated in the high-class area. The TPE optim did not change the structure of the distribution of each class in the DNN model, number of landslide points in the "very high" class increases, which can reflect t mization effect of TPE. For RNN, the grade distribution is U-shaped, with "very lo "very high" accounting for 32.79% and 21.83%, respectively, and after TPE optim the percentage of "very low" reached 52.43%. Meanwhile, the "very high" class has the highest PHGP (47.08%), which decreases after TPE optimization, but the "hig has a significantly higher PHGP. The rank distribution of PA in CNN models decreas "very low" to "very high", with "very high" accounting for nearly half (46.55%), CNN has the worst performance for PHGP results. By TPE optimization, the perce extreme classes is significantly increased, while the PHGP of high-class regions creased and low and medium classes were decreased, which indicates that the TP rithm has good effect on the optimization of the CNN model. The distributions of high susceptibility areas of different models are similar, all of them are near faults, and the lithology is mostly claystone, sandstone, and basalt, while for the topography, the greater the slope, the higher the probability of landslide occurrence. In terms of elevation, high susceptibility areas are mostly concentrated in mid-high elevation areas rather than in very high elevation areas. The main reason is that human activities in this elevation range may change the surrounding geological environment and thus affect the occurrence of landslides, which matched the actual situation with high reliability of the results. For lithology, limestone and dolomite are dense and hard blocky rocks, which are brittle and have great shear strength and can withstand large shear forces without deformation, while claystone, mudstone, and basalt have much clay or gravel soil, which have plasticity and relatively low shear strength and are easily deformed and landslides occur. For the slope, the greater the slope, the greater the potential energy of the landslides, the higher the sliding speed, and the landslides are more likely to occur induced by external forces. The results will provide an important reference for the zoning of landslide susceptibility and the setting of disaster prevention policies in Shuicheng County.
To facilitate comparison between models, we calculated P A and P HGP in each LSM, which is the most important step required to calculate SCAI values, as shown in Figure 12. For the DNN model, the area share of each class is very close, but the distribution of landslide points is obviously concentrated in the high-class area. The TPE optimization did not change the structure of the distribution of each class in the DNN model, but the number of landslide points in the "very high" class increases, which can reflect the optimization effect of TPE. For RNN, the grade distribution is U-shaped, with "very low" and "very high" accounting for 32.79% and 21.83%, respectively, and after TPE optimization, the percentage of "very low" reached 52.43%. Meanwhile, the "very high" class of RNN has the highest P HGP (47.08%), which decreases after TPE optimization, but the "high" class has a significantly higher P HGP . The rank distribution of P A in CNN models decreases from "very low" to "very high", with "very high" accounting for nearly half (46.55%), and the CNN has the worst performance for P HGP results. By TPE optimization, the percentage of extreme classes is significantly increased, while the P HGP of high-class regions was increased and low and medium classes were decreased, which indicates that the TPE algorithm has good effect on the optimization of the CNN model.

Model Validation and Comparison
Properties and applicability of these models need to be validated and evaluated by different perspectives and methods. Table 5 displays validation results for different models. The Accuracy values of the models in descending order is DNN_TPE, CNN_TPE, CNN, RNN, DNN, RNN_TPE, which indicates that the TPE algorithm significantly optimizes the accuracies of DNN and CNN models, but has no effect on RNN. Differentiated into Precision and Recall analysis, TPE optimization significantly improves the performance of three typical neural network models for the correct rate of positive sample prediction according to the metric "Precision". For the ability to capture positive samples represented by Recall values, RNN has an obvious advantage over other models, but TPE optimization has a significant negative effect on RNN, while the other two models are improved. Moreover, F1 focuses on the balance of Precision and Recall, it can be obtained that RNN and TPE optimized DNN models are better for positive samples, while CNN model is the worst. MCC and Kappa coefficient are all indicators of the comprehensive performance of the model. Additionally, the MCC and Kappa coefficients are closer to the judgment of Accuracy.  Figure 13 illustrates six ROC curves for these models. The AUC value of RNN is much higher than the similar values of DNN and CNN. Whereas after TPE optimization, the AUC of RNN_TPE in turn has a slight decrease, the performance of both DNN_TPE and CNN_TPE improves, especially DNN, by 4.6%. This also shows the optimization effect of TPE in the DNN and CNN models, especially the DNN model, but does not have the desired effect on the RNN model.

Model Validation and Comparison
Properties and applicability of these models need to be validated and evaluated by different perspectives and methods. Table 5 displays validation results for different models.
The Accuracy values of the models in descending order is DNN_TPE, CNN_TPE, CNN, RNN, DNN, RNN_TPE, which indicates that the TPE algorithm significantly optimizes the accuracies of DNN and CNN models, but has no effect on RNN. Differentiated into Precision and Recall analysis, TPE optimization significantly improves the performance of three typical neural network models for the correct rate of positive sample prediction according to the metric "Precision". For the ability to capture positive samples represented by Recall values, RNN has an obvious advantage over other models, but TPE optimization has a significant negative effect on RNN, while the other two models are improved. Moreover, F1 focuses on the balance of Precision and Recall, it can be obtained that RNN and TPE optimized DNN models are better for positive samples, while CNN model is the worst. MCC and Kappa coefficient are all indicators of the comprehensive performance of the model. Additionally, the MCC and Kappa coefficients are closer to the judgment of Accuracy.  Figure 13. The ROC curves of the neural network models. Table 6 presents the SCAI value of three typical neural network models before and after TPE optimization. Overall, different models show a trend of lower values for higher classes, with DNN and RNN models performing better for lower classes, while for higher classes, each model has good performance, especially the CNN model. TPE optimization's advantage is not obvious in the determination of SCAI values results, which may be caused by the difference in the ranking of different prediction results.

Discussion
Up to now, many scholars have conducted studies on LSA by using different methods and comparing the strengths and weaknesses between models [77][78][79]. This paper aims to provide an introduction for application and comparison of three typical neural network models (DNN, RNN, CNN) in LSA, and to optimize their hyperparameters using TPE algorithm in order to get better prediction accuracy and performance. Before training of these models, for the preparation of the sample set, due to the insufficient number of positive samples, this study proposed to use the hybrid ensemble oversampling and undersampling techniques, doubling the positive samples and matching an equal number of negative samples to meet the need for sample balancing. Then, the multiple collinearity  Table 6 presents the SCAI value of three typical neural network models before and after TPE optimization. Overall, different models show a trend of lower values for higher classes, with DNN and RNN models performing better for lower classes, while for higher classes, each model has good performance, especially the CNN model. TPE optimization's advantage is not obvious in the determination of SCAI values results, which may be caused by the difference in the ranking of different prediction results.

Discussion
Up to now, many scholars have conducted studies on LSA by using different methods and comparing the strengths and weaknesses between models [77][78][79]. This paper aims to provide an introduction for application and comparison of three typical neural network models (DNN, RNN, CNN) in LSA, and to optimize their hyperparameters using TPE algorithm in order to get better prediction accuracy and performance. Before training of these models, for the preparation of the sample set, due to the insufficient number of positive samples, this study proposed to use the hybrid ensemble oversampling and undersampling techniques, doubling the positive samples and matching an equal number of negative samples to meet the need for sample balancing. Then, the multiple collinearity analysis was performed on the influencing factors, which proved that the 17 factors were independent of each other and could be input into the mod-els simultaneously. In addition, the RF model was introduced to compare the factors importance, and the LSM generated by combining multiple models, it is obvious that the high-susceptibility regions mostly distributed in bands along fault zones, and the influence of elevation on landslides is much higher than other factors. As the main predisposing factor leading to landslides in the study area, the importance of AAP is second only to elevation, which means that the hydrological conditions of geotechnical bodies cannot be ignored in the occurrence of landslides. Combined with LSMs analysis, the high susceptibility area is mainly concentrated near the faults, which is the structure where the strata or rock body is significantly displaced along the rupture surface, and the slope near the fault is also high, and the results are consistent with the actual landslide patterns. More high susceptibility areas are in claystone, sandstone, and basalt, which is also consistent with our statistics on the actual lithologies that are more prone to geological hazards. The local authorities should also propose appropriate policies based on the results of LSMs to focus on the protection of high susceptibility areas and to restrict their development works and activities. The discussion on the importance of factors can also provide help and reference for the forecasting and early warning of landslides for related departments.
Due to the black-box property of neural network models, it is impossible to intervene in the model operations, tuning hyperparameters in the preparation phase of the model is an important tool to improve the model performance. In this paper, the TPE algorithm was used to optimize the hyperparameters, combining the optimization results (Table 4) with the validation results of the models (Table 5 and Figure 13). For DNN, it directly passes the input layer data to the hidden layers and finally outputs the results, while in CNN, after convolution and pooling, the data is then passed through the fully connected layer, and although the accuracy is similar to DNN, the results show polarization, i.e., the model does have higher confidence in the prediction result. As for the RNN model, its complex recurrent structure of the hidden layer can make full use of the data information, and the performance is also the best. For TPE optimization, the number of samples input to the model in each iteration is controlled by increasing the batch size, but the epochs do not need to increase with it, which may still be influenced by the small sample size. The increase of neurons improves the model accuracy, but also sets a higher dropout rate for reducing overfitting, and for the choice of optimizer, the robustness of the Adam algorithm can be proved. In fact, the TPE optimization result of RNN is very similar to DNN, but the optimization effect of RNN is not improved as we expect, and the RNN is also the earliest convergence in the optimization process, which may also be a performance that cannot be effectively optimized (Figure 10), probably because the TPE optimization is not applicable to the complex, recurrent hidden layer structure in the RNN model. For CNN, its unique hyperparameters in which the kernel size increases from 3 to 6, as the convolutional kernel size increases, the receptive field increases and better features are obtained, which does not cause a large computational effort to the extent that it takes too long due to the small amount of computation. For the fully connected layer, the increase in neurons is small, and the overall number of epochs increases while the risk of overfitting is reduced by increasing the dropout rate, and the optimizer is replaced with RMSProp, this allows for less oscillation of the return value during optimization. The assessment effect of CNN model is also greatly enhanced by tuning the parameters.
The LSM produced by each model was combined and the statistical results of each class compared (Figures 11 and 12). Overall, the spatial distribution of classes for the three typical neural network models is similar, but the proportion of each class is significantly different. The DNN model is basically all divided into quintiles, and the P A values are also decreasing from higher to lower classes; the class distribution of RNN shows a slight polarization, while the distribution of P A is better; for CNN, the very low susceptibility class accounts for nearly 50%. The accuracy of DNN and CNN is similar by multiple methods verification, while RNN is the best (AUC = 0.793), which indicates that the recurrent structure of RNN can fully utilize the sample information for operation. After TPE optimization, the class structure of DNN and CNN did not change much, but the distribution of PAs was more reasonable. In addition, The TPE optimization significantly improves the accuracy of the DNN and CNN (3.92% and 1.52%, respectively), and the AUC values of these two models improved by 4.62% and 1.99%, respectively, the performance was significantly improved for both models, especially for the DNN model, which demonstrated the optimization effect of TPE in the DNN and CNN models. For RNN, the TPE optimization makes the polarization of the susceptibility values more significant, but the distribution of P A becomes worse, so that the overall capability is not improved.
Compared with research of LSMs constructed by other methods, the accuracy of the DNN_TPE or RNN models is higher than frequency ratio (AUC = 0.75), weight of evidence (AUC = 0.76) [16], and LR (accuracy = 0.742, AUC = 0.79) [31], similar to the CNN model proposed by [41] (accuracy = 0.742, AUC = 0.80), but lower than the RNN model (accuracy = 0.762, AUC = 0.843) [40]. The reason for the difference in model accuracy with similar architecture is that the data input to each model is different, including sample size, selection of factors, etc., and cannot be directly compared. In comparison with our previous related study in Shuicheng County [36,46], the AUC value of the DNN_TPE or RNN models is higher than Bayesian network (AUC = 0.785), close to gradient boosting decision tree (AUC = 0.796), but lower than the RF model (AUC = 0.845). This result is similar to the findings of [27], although the neural network models are more advanced, the tree structured model performs better for one-dimensional data processing and classification. In addition, it may also be due to the neural network models requiring too many parameters to tune, which limits the accuracy of the model. Integrating the above discussion, we trained and validated all three models before and after the optimization in order to reflect the effect of TPE optimization and better reflect the significance of TPE optimization. The assessment framework proposed in this paper satisfies the need for accuracy, can provide guidance for disaster prevention and control, and also provides new methods and optimization strategies for LSA research, which has certain practical and theoretical significance. Figure 14 plots variation curves of the accuracy and loss function values for the training and validation sets during the epochs of the six model runs. Since the purpose of model fitting is to continuously search for the minimum value of the loss function of training set, the training set loss and accuracy of each model are monotonically decreasing and increasing respectively with the iterative process, so we need more to observe the changes in the validation set. The accuracy of validation set of DNN_TPE has a higher decreasing slope in the initial stage than DNN, indicating that TPE optimization has a very intuitive effect on the simple fully connected layer structure. For RNN, the TPE optimization has little effect on the accuracy of the validation set, but the loss value becomes more volatile and rises with the epoch increases, which has generated the risk of overfitting, and the TPE optimization does not produce the expected effect. In fact, CNN_TPE also has the risk of overfitting, which may be caused by the small number of samples, but it can be clearly seen that the accuracy of the validation set significantly improved, and the LSA generated by CNN_TPE meets the requirement of accuracy from the perspective of the actual results and the distribution of historical landslide points. Reducing overfitting is still a problem that needs to be solved in future research and means such as reducing the epoch or increasing the dropout rate can be considered.

Conclusions
Landslides pose a constant threat to the lives and property of mountain people and may also cause geomorphological destruction such as soil and water loss, vegetation destruction, and land cover change. The work on the assessment of landslide susceptibility is particularly important. The main purpose of this paper was to introduce TPE algorithm for hyperparameter optimization of three typical neural network models for landslide susceptibility assessment in Shuicheng County, China, as an example, and to compare the differences of predictive ability among the models in order to achieve higher application performance, and the susceptibility assessment was carried out by extracting LSM. First, 17 influencing factors of landslide multiple data sources were selected for spatial prediction. For the problem of imbalanced sample and small sample size, hybrid ensemble over-

Conclusions
Landslides pose a constant threat to the lives and property of mountain people and may also cause geomorphological destruction such as soil and water loss, vegetation destruction, and land cover change. The work on the assessment of landslide susceptibility is particularly important. The main purpose of this paper was to introduce TPE algorithm for hyperparameter optimization of three typical neural network models for landslide susceptibility assessment in Shuicheng County, China, as an example, and to compare the differences of predictive ability among the models in order to achieve higher applica-tion performance, and the susceptibility assessment was carried out by extracting LSM. First, 17 influencing factors of landslide multiple data sources were selected for spatial prediction. For the problem of imbalanced sample and small sample size, hybrid ensemble oversampling and undersampling approaches were used to double the sample size and randomly split into training and validation sets. Multi-collinearity analysis was carried out for influencing factors, and RF model was used to perform factor importance ranking. Second, DNN, RNN, and CNN models were adopted to predict the regional landslides susceptibility, and the TPE algorithm was used to optimize the hyperparameters, respectively, to improve the assessment capacity. Finally, to compare and validate the predictive performance of the models, several objective measures of the Accuracy, Precision, Recall, F-value, MCC, Kappa value, ROC curve, and SCAI were used for evaluation. The results show that the high-susceptibility regions mostly distributed in bands along fault zones, where the lithology is mostly claystone, sandstone, and basalt. The selected 17 factors have no co-linearity problems, and elevation has the strongest influence on landslides, followed by precipitation. The DNN, RNN and CNN models all perform well in LSM, especially the RNN model, which has an AUC value of 0.793. The TPE optimization significantly improves the accuracy of the CNN and DNN but does not improve the performance of the RNN. In summary, our proposed RNN model and TPE-optimized DNN and CNN model have robust predictive capability for landslide susceptibility in the study area and can also be applied to other areas containing similar geological conditions. In future research, the application of TPE optimization to different neural network models and their related variants can be further improved, and the evaluation performance among different machine learning models can be compared and analyzed to a greater extent.
Author Contributions: Conceptualization, G.R.; data curation, G.R. and Y.Z.; formal analysis, G.R. and K.L.; funding acquisition, J.Z. and T.L.; methodology, G.R. and Y.S.; writing-original draft, G.R.; writing-review and editing, Z.T. and X.L. All authors have read and agreed to the published version of the manuscript.