Landslide Susceptibility Mapping at Two Adjacent Catchments Using Advanced Machine Learning Algorithms

Landslides impact on human activities and socio-economic development especially in mountainous areas. This study focuses on the comparison of the prediction capability of advanced machine learning techniques for rainfall-induced shallow landslide susceptibility of Deokjeokri catchment and Karisanri catchment in South Korea. The influencing factors for landslides i.e. topographic, hydrologic, soil, forest, and geologic factors are prepared from various sources based on availability and a multicollinearity test is also performed to select relevant causative factors. The landslide inventory maps of both catchments are obtained from historical information, aerial photographs and performing field survey. In this study, Deokjeokri catchment is considered as a training area and Karisanri catchment as a testing area. The landslide inventories content 748 landslide points in training and 219 points in testing areas. Three landslide susceptibility maps using machine learning models i.e. Random Forest (RF), Extreme Gradient Boosting (XGBoost) and Deep Neural Network (DNN) are prepared and compared. The outcomes of the analyses are validated using the landslide inventory data. A receiver operating characteristic curve (ROC) method is used to verify the results of the models. The results of this study show that the training accuracy of RF is 0.757 and the testing accuracy is 0.74. Similarly, training accuracy of XGBoost is 0.756 and testing accuracy is 0.703. The prediction of DNN revealed acceptable agreement between susceptibility map and the existing landslides with training and testing accuracy of 0.855 and 0.802, respectively. The results showed that, the DNN model achieved lower prediction error and higher accuracy results than other models for shallow landslide modeling in the study area


Introduction
Landslide and debris flow are one of a major natural hazard worldwide due to unfavorable geological causes, weathering patterns, shallow soil deposit and heavy rainfall. Landslides created dangerous menaces to the environment. The Korean peninsula is currently experiencing climate change effects, including annual temperature, precipitation and rate of typhoon occurrence [1]. Shallow seated landslides have often resulted from heavy downpour accompanying typhoons and torrential rains during the summer season (Kim and Lee, 2013).
Landslide prediction is a crucial task because the inherent causes of landslide are complex due to mechanical and chemical weathering, nonhomogeneous soil deposit, orientation of geological lineament, and erosion of a slope. The external causes including rainfall, earthquakes, and the excavation of a slope, and the loading of a slope or its crest are the most powerful agents to trigger  Figure). Location of study area in inset.

Landslide inventory
A landslide inventory map records the geographic location, size, date, and type of landslide [33][34][35][36][37]. The precise detection of landslide emplacements is crucial for the prediction and assessment of the landslide susceptibility models. A new generation of satellite images like Word view, Geo-eye aided a quick reproduction of landslide inventory maps [38]. Recently, the application of interferometry techniques to radar images is extensively used in inventory mapping [39]. However, due to the limitation of resources, in this study, we: (1) examined aerial photographs (provided in the web portal site: www.map.daum.net, www.map.naver.com) to prepare a landslide inventory map. Both web portals give an excellent bird's eye view with a 25 cm resolution. And (2) performed fieldwork. The field observations showed that the predominant mass movements are shallow slides that sometimes ensued as debris flows during periods of heavy rainfalls and occurred on natural and man-made slopes with highly weathered bedrock.

Figure 2
Landslide inventory map of Deokjeok-ri (Training data) and Karisan-ri Catchment (Testing data). Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 4 August 2020 doi:10.20944/preprints202008.0089.v1 As shown in Figure 2, a total of 748 landslides were identified and afterward digitized in ArcGIS 10.3 environment for further analysis in Deokjeokri catchment (training area) and 219 landslides were identified at Karisanri catchment (testing area). Some glimpses of landslide observed during field visits are presented in Figure 3. In this study, a single pixel from the center of each landslide initiation area was extracted. A Gaussian Kernal density function of sampling strategy (an unbiased sample technique) was used [40] for non-landslide pixels. A double number of non-landslide pixels (i.e. 1496 from Deokjeokri and 438 from Karisanri) were sampled in each catchment.

Landslide influencing factor (IF)
The main process for susceptibility modeling is a gathering and designing of a spatial database in a GIS environment [41]. The landslide influencing factors (IFs) considered in this study are the intrinsic factors such as topographic, hydrologic, forest, soil and geological factors, which are collected from available resources and fieldworks. Table 1 represents the spatial database obtained from different sources. In this study, based on reviewing previous researches [3,22,23,42] and availability of data, 14 IFs were utilized to identify the hillslope features of landslide occurrence. Among 14 IFs, topographic (except aspect map) and hydrologic factors are continuous but forest, soil, and geology are categorical factors. The selection of an appropriate pixel is necessary to achieve high precision in landslide susceptibility mapping [43,44]. Tarolli and Tarboton (2006) found that 10 × 10 m resolution is adequate to get better landslide susceptibility prediction performance so that 10 × 10 m resolution digital elevation model (DEM) was created from 5 m interval contours. Similarly, all collected IFs (thematic layers) were rasterized in 10×10 m pixel size. The aspect does not show a direct relation with landslide occurrence, however, the slope aspect of the terrain is related to parameters such as sunlight exposure and precipitation [45,46]. The aspect was divided into north, northeast, east, southeast, south, southwest, west, northwest and flat, as shown in Figure 4a. The elevation is another important element in landslide studies. Study shows that higher altitude tends to cause landslides [45]. The elevation of study area ranges from 195.1 to 1513.67 m asl as presented in Figure 4b. The slope is a very powerful driver of the landslide. In fact, the slope affects the surface and subsurface flow and thus the soil moisture content, soil formation and the likelihood of soil erosion (Cooke and Doornkamp, 1974). In the study area, the slope ranges for 0˚ to 62.3˚ as shown in Figure 4c. Internal relief indicates the topographic breakage, potential energy available for mass movements [48]. Internal relief was prepared by calculating the local height difference within nine pixels (unit area) and in the study area, it ranges from 0.15 to 72.1 m as shown in Figure 4d. Terrain curvature represents the morphology of hillslope ( Figure 4e) and it controls the slope erosion process by convergence or divergence of water flow [49].

Hydrologic factors
Hydrological factors are vital factors in rainfall-induced slope instability. Two types of drainage proximity, i.e. horizontal and vertical IFs were used in this study. The drainage proximity (h) ( Figure  5a) was obtained from the Euclidean function in the GIS environment. A new topo-hydrological factor i.e. drainage proximity (v) was used. It gives the height above the nearest drainage network (Figure 5b). This index considers the height-differences along flow paths. Drainage density is the length of the drainage within a pixel (Figure 5c). The presence of higher drainage density means less percolation but faster surface flow [50].
The SPI indicates the erosive processes, as caused by runoff. As the specific catchment area and slope gradient increase, the measure of water contributed by upslope and the surface runoff also increases [51]. Eq. (1) defines the SPI where As is the specific catchment area, and β is the slope gradient. The spatial distribution of SPI is presented in Figure 5d. The STI reflects the erosive power of the overflow drain and the process of erosion and deposition. To calculate STI, two components of the slope which are responsible for soil loss i.e. length (L) and steepness (S) were considered, as suggested by Moore and Burch (1986), given in Eq. (2) where As is the specific catchment area, and β is the slope gradient. The spatial distribution of STI is shown in Figure 5e. TWI is related to soil conditions and surface runoff [53]. TWI is the tendency of water to aggregate in the catchment and the propensity of gravitational powers to move downslope [54]. TWI is defined as where α is the cumulative upslope area, and β is the slope angle. The distribution of TWI in the study area is shown in Figure 5f. Rickli et al. (2002) addressed the effect of forest in hillslope reinforcement. The crucial role of trees and forests is to forestall mass movements by fortifying and drying soils. The catchments mainly consist of oak, Japanese larch, Japanese red pine, Japanese pine, and Korean pine forest as shown in Figure 6a. The non-forest area consists of agricultural land.

Forest and soil factors
Soil properties affect degrees of drainage and erosion, which influence the landslide occurrence. The particle size and pore distribution affect water movement and the holding of infiltrated water (Kitutu et.al., 2009). Finer soils can hold higher volumes of water than coarse-textured soils [57]. In the study area, predominant soil types are sandy loam soils ( Figure 6b). However, downstream areas were covered by silty clay loam and clay loam. The northern and southern elevated areas of the catchments are rocky and covered by a very thin layer of soil. Landslides are ascertained greatly by the lithological characteristics of the hillslope because the individual lithological unit relates to different weathering degree [58,59]. As described earlier, the lithology of this area includes banded gneisses and granites ( Figure 1). Prominent lineaments and dense joint spacing with weathering and erosions were observed in both catchments, which create weak zones. Drainage system follows the lineaments. The bulky granite peaks is one of the typical characteristics of Inje area.

Modeling approaches
As illustrated in Figure 7, the fundamental working phases of adopted methodology in this study comprises four main steps: 1) preparation of landslide inventory and related database, 2) preparation of training and testing dataset, in this research Deokjeokri catchment was selected as a training area and Karisanri catchment a testing area, 3) landslide Ifs were tested and selected using multicollinearity test, and 4) preparation and comparison of performance of landslide susceptibility maps using RF, XGBoost and DNN models. The quality of the predictions was examined using receiver operating characteristic curve (ROC) technique. The machine learning methods that applied in the present study are briefly described below: Random Forest (RF) introduced by Breiman in 2001, has demonstrated to be an incredible method in classification and regression. It has the ability to deal with high-dimension, continuous and categorical data. RF does not require the assumption about the statistical distribution of the data and it can be considered robust with respect to changes in the composition of the dataset [60]. These properties are very useful in applications using variables that have mutual nonlinear interactions. In this work, we used the "ModelMap" package [61] in R-statistical environment. The "ModelMap" package allows an interface between several existing R-packages to automate and construct the maps.
The training algorithm for RF applies the general technique of bootstrap aggregating or bagging to tree learners. The training set comprises a simple and fast way of learning a function that dependent variable (target) X0 and independent Ifs data X1, X2,…..,Xn to output Y, where Xi can be a mix of categorical and numerical variables. X0 can be binary for classification, or numeric for regression. After training, a prediction for a query points x, each tree independently predicts as given in Eq. (4) and (5) And the forest averages the predictions of each tree

Random Forest (RF)
The XGBoost is based on the Gradient Boosting concept. It uses a more regularized model formalization to control over-fitting, which gives it better performance. It was developed by Chen and Guestrin (2016). In XGBoost, training is done using an additive strategy. Given samples I with independent Ifs data X1, X2,…..,Xn, a tree ensemble model uses K additive function to predict the outputs Y.
where F is the set of all possible trees. The k f is a function at each of the k steps of the descriptor values in i x to a certain output. The differentiable loss function that measures the difference between the predictions and the target is simply a mean-square error. After each step of boosting, the algorithm scales the newly added weights. In this study, an open source code "xgboost" for Renvironment was utilized [63].

Deep Neural Network (DNN)
DNN is inspired by information processing and communication pattern in the biological nervous system [64,65]. In DNN, layers representations are learned via models called a neural network. From a technical point of view, DNN algorithms are multilayered neural networks. The main body of DNN consists an input layer that receives the data, an output layer that gives the prediction and core part is the hidden layer that processed the data. With the rapid development of DNN, a great number of libraries and packages have been developed to set a DNN with minimal effort. In this work, we used the "keras" [66] and "H2O" (LeDell E et.al 2018) packages designed for the R-statistical environment and runs seamlessly on both 'CPU' and 'GPU' devices.

Selection of Ifs
The number of Ifs can be minimized via multicollinearity testing, reducing high-dimensional data. For the susceptibility analysis, the Ifs were examined through a multicollinearity test which is useful to select applicable Ifs. The variance inflation (VIF) and tolerance (TOL) are two indicators of the level of multicollinearity [67,68]. A VIF esteem more prominent than or equivalent of 10 and a TOL esteem under 0.5 demonstrates a genuine multicollinearity issue [69,70]: where the coefficient of determination (R 2 ) is the proportion of the variance in the target variable [71][72][73]. The result is presented in Table 2, which shows that IR has a multicollinearity problem. So, IR was eliminated for further analysis.

Application of RF in landslide susceptibility mapping
Three training parameters are needed to be defined in the RF model; ntree, the number of bootstrap samples for the original data (the default value is 500); mtry, the number of different predictors tested at each node i.e. as many as the IFs; node size, the minimum size of the terminal nodes of the trees [74]. The elements not included in ntree bootstrap sample are referred to as out-ofbag data (OOB) for that bootstrap sample. OOB is used to estimate an error rate called OOB error rate (Figure 8a). The result suggests that when the training of the data was applied to build the model, the error rates stay constant after 100 trees.
The error rate found to be 12% therefore 88% of the model building result was accurate, which makes this a reasonably good model. Figure 8b depicts the variable importance by means of the mean decrease accuracy in percentage. The result shows that drainage proximity (h) has the highest importance followed by slope and forest.

Application of XGBost in landslide susceptibility mapping
Basic training using XGBoost is the most critical part of the modeling. It contains several hyperparameters to be set [75]. We hold the trees constant at the default of 1000 rounds. The learning rate was selected as 0.1 to avoid overfitting the model. And the maximum depth of the tree was chosen as 3. Figure 9a shows the log loss during the modeling process. After tuning the model, it was found that the log loss of training process is 0.48 and log loss for testing data is 0.56 in 90 iterations, respectively. The model has the capability to check the important variables using bias of all IFs. Figure  9b illustrates that drainage proximity (h) is the most important IF followed by slope and curvature during the XGBoost modeling.
(a) (b) Figure 9. a) Loss vs iteration of XGBoost model and b) variable importance.

Application of DNN in landslide susceptibility mapping
There is no universal rule for the determination of an optimum neural net structure. In this study, 3 hidden layers and 13 neurons in each hidden layer were used. The input IFs were normalized and then a "relu" activation function was used to run the model. The loss function is an important output of the model, which is used to measure the inconsistency between the dependent variable and the predicted outcome. Figure 10a presents the loss function during the modelling process. It has been found that as the number of epoch increases, the loss value for the training and test data decrease. After 10 epochs, the curves were stable. The relative importance of the model is presented in Figure 10b. Drainage proximity (h) is the most important IF followed by slope and forest.

Evaluation measures
Three landslide susceptibility maps obtained from RF, XGBoost, and DNN were classified into five classes namely very low, low, moderate, high and very high using Jenks natural breaks classification method as presented in Figure 11 a, b and c, respectively. susceptibility of a landslide occurrence. The high susceptible area is more confined near to the drainage and steep areas. The flat and gentle slopes represent the low susceptible areas. One of the popular methods for model assessing accuracy is a receiver operating characteristic curve (ROC) in landslide susceptibility assessment. The ROC is a useful method to describe the quality of probabilistic prediction systems [76]. The area under the curve can be utilized as a measurement to evaluate the general execution of the model [77] with the goal that the bigger the area is, the better the presentation of the model becomes. A rough guide for grouping the exactness is the customary scholastic point framework. The following rankings were considered for accuracy test [78] The comparison of the three methods shows that the prediction capabilities of these models are different. Among the three models DNN shows the highest prediction capability, followed by XGBoost and then RF model.
In order to compare the reliability of the obtained landslide susceptibility maps, a difference in the three landslide susceptibility methods was evaluated in a 2×2 contingency table. The number of correctly predicted landslides and non-landslides are indicated by TP (True positives) and TN (True negatives), and the number of incorrectly predicted landslides and non-landslides are FN (False negatives) and FP (False positives), respectively. The aim of two-class prediction methods is to separate true cases from false ones. Based on the four descriptors, several further measures can be calculated. Sensitivity also called true positive rate (TPR), and specificity (true negative rate, TNR) shows the ratio of the landslide and non-landslide cases correctly identified by the landslide susceptibility models. Positive predictive value (PPV) (also called precision) and negative predictive value (NPV) is the conditional probability that a landslide or non-landslide variant is predicted as landslide or non-landslide, and accuracy (ACC) is the correctly classified index, respectively. The mathematical basis of these parameters [79] are given as in equations (9) The results during the training and testing stages for implemented models are presented in Table  3, which clearly shows that the execution of all models produced very good outcomes. During training and testing phases, the correctly captured presence of landslides (TP) and absence of landslides (TN) are higher in DNN model than RF and XGBoost models. The ACC of DNN shows 74.01% (training phase) and 83.71% (testing phase), on other hand, 68.52% (training) and 74.73% (testing) for XGBoost and 67.28% (training) and 68.19% (testing) for RF model respectively. Performance results show the prediction of probable occurrence of landslide by DNN model surpasses the remaining models by significant tolerance followed by XGBoost and RF. TP=True positive, FP=False positive, FN=False negative, TN=True Negative, PPV=Positive predictive value, NPV=Negative predictive value, ACC=accuracy

Discussion
The most effective mode to reduce casualties and losses ensuing from landslides are landslide risk management; therefore, high-quality landslide susceptibility maps are an important tool [80] to predict a probable occurrence of the landslide. Landslide susceptibility mapping has been studied more on qualitative and quantitative methods. Previous studies focused on improving the performance of landslide prediction using various machine learning methods. The sample size used for modelling was not sufficient for these methods. Therefore, we used the Gaussian Kernal density function to select a double number of pixels without landslides. The purpose of the study was to apply and compare different machine learning models to obtain accurate landslide susceptibility maps. Therefore, in the present study, three classification algorithms (RF, XGBoost and DNN) were utilized and compared for landslide susceptibility mapping at Inje area. Basically, the RF model is one of the popular and widely used models to make landslide susceptibility [60,81,82] and on other hand, XGBoost and DNN had been explored in limited studies for spatial prediction of occurrence of the landslide, which showed promising results in the study area. Regard to important IFs, three models revealed that horizontal drainage proximity (h) is the most important IF among 13 IFs. This finding justifies the statement given by Nobre et al. (2011) that the terrain nearer to drainage has a strong correlation with soil-water saturation regime, and it is a significant controlling factors for landslide occurrences, especially in the soil slopes. Gökceoglu and Aksoy (1996) also state that a stream adversely affects the hillslope because most of the landslides occur frequently proximity to the stream. Similarly, slope is second most important factor in these implemented models as the slope is the most powerful agent of landslide [85]. On the other hand, SPI or TWI are less significant factors compared to other IFs. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 4 August 2020 doi:10.20944/preprints202008.0089.v1 Although the applied machine learning models are "black box model", applying machine learning is a field with a growing interest in recent years. The prescient outcomes do not distinguish the reasons for the landslide, however show the relation between landslides and terrain features. Nevertheless, such results may yield understanding into landslide events and which IF affect landslide in the study area. To get more definitive results, more experiments should be done to validate the model with time series landslide inventories.
Previous studies illustrated that RF method can adequately depict probable occurrence of landslide and spatial data [81,82,86,87]. There are several pieces of research on landslide susceptibility using a gradient boosting algorithm [88][89][90]. It was found that the Boosted model has better performance than the RF model. Recently, DNN is becoming popular among researchers [91,92]. Wang et al. (2017) has stated that DNN framework achieved higher or comparable prediction accuracy and this study also supports the previous findings [94][95][96]. Thus, DNN can give better prediction and can improve the understanding of landslide susceptible areas.

Conclusions
Landslide susceptibility mapping is an important tool to predict the probability of occurrences of landslide areas in hilly terrain. Therefore, high-quality landslide prediction models are of paramount importance. Two adjacent catchments were selected for application and comparison of three machine learning models i.e. RF, XGBoost, and DNN. Deokjeokri catchment was selected as training and Karisanri catchment as testing areas. A landslide inventory consisting of 748 landslides at Deokjeokri catchment and 219 landslides at Karisanri catchment were prepared by using GIS. 14 thematic layers were collected as landslide influencing factors. Among these, IR was failed in a multicollinearity test. From the comparison in this study, three models have different accuracy levels. The result of DNN (training AUC=0.855, testing AUC=0.802) is significantly better than XGBoost (training AUC=0.756, testing AUC=0.703) and RF (training AUC=0.757, testing AUC=0.74) models. Three models revealed drainage proximity (h) is the most important IFs. The DNN model is a very promising alternative for shallow landslide modeling in the study area. It is obvious that a highquality landslide susceptibility map can decrease the costs and time. The produced maps can assist in the landuse planning for the planners and governments to provide a better way of keeping safe in the susceptible regions to mitigate further damage.
Author Contributions: Ananta Man Singh Pradhan performed the research, modified the codes, analyzed the data and wrote the manuscript. Yun-Tae Kim designed the research and extensively updated the manuscript.