Landslide Susceptibility Assessment Using an Optimized Group Method of Data Handling Model

: Landslides can cause considerable loss of life and damage to property, and are among the most frequent natural hazards worldwide. One of the most fundamental and simple approaches to reduce damage is to prepare a landslide hazard map. Accurate prediction of areas highly prone to future landslides is important for decision-making. In the present study, for the ﬁrst time, the group method of data handling (GMDH) was used to generate landslide susceptibility map for a speciﬁc region in Uzbekistan. First, 210 landslide locations were identiﬁed by ﬁeld survey and then divided randomly into model training and model validation datasets (70% and 30%, respectively). Data on nine conditioning factors, i.e., altitude, slope, aspect, topographic wetness index (TWI), length of slope (LS), valley depth, distance from roads, distance from rivers, and geology, were collected. Finally, the maps were validated using the testing dataset and receiver operating characteristic (ROC) curve analysis. The ﬁndings showed that the “optimized” GMDH model (i.e., using the gray wolf optimizer [GWO]) performed better than the standalone GMDH model, during both the training and testing phase. The accuracy of the GMDH–GWO model in the training and testing phases was 94% and 90%, compared to 85% and 82%, respectively, for the standard GMDH model. According to the GMDH–GWO model, the study area included very low, low, moderate, high, and very high landslide susceptibility areas, with proportions of 14.89%, 10.57%, 15.00%, 35.12%, and 24.43%, respectively.


Introduction
Landslides are one of the most widespread and common natural hazards in hilly and mountainous areas [1], where they can cause loss of life, property damage, and severe damage to the environment and foundations of roads and highways. Models predicting landslide occurrence have attracted considerable attention, as landslides have increased in many parts of the world due to both climate change and anthropogenic activities. Although more high-intensity and short-duration rainfall (heavy rainfall) is occurring, thus triggering more landslides, the pressure exerted by the urbanization of mountain areas, the cutting of trees to construct homes, and poor watershed management are the most important reasons for the increase in landslides [2]. The Centre for Research on the Epidemiology of Disasters (CRED) declared that 17% of the damage caused by natural disasters is due to landslides. Moreover, many researchers believe that this proportion will increase due to additional pressure on the environment from population growth, and to the effects of climate change [3].
Their RF model had high predictive power in landslide hazard-prone areas. Chen at al. (2018) applied kernel logistic regression for landslide modeling in Shangnan County, China; their model showed reasonable accuracy. Pham et al. [30] compared the predictive accuracy of RF, logistic model tree (LMT), best first decision tree (BFDT), and classification and regression tree (CART) landslide models in India. They found that the RF model outperformed the other models. Although these methods have been applied successfully for landslide modeling, they have lower performance than traditional approaches, such as the ANFIS and SVM, etc. Pham et al. [31] compared the performance of naïve Bayes tree (NBT), Bayes network (BN), naïve Bayes (NB), and decision table naïve Bayes (DTNB) data-mining algorithms with the SVM for landslide susceptibility mapping in India, and concluded that the SVM outperformed the algorithms. Honge et al. [32] used hybrid algorithm of AdaBoost, Bagging and Rotation Forest with decision tree algorithm for landslide susceptibility mapping at China. They declared that hybrid algorithm of Rotation Forest had a higher prediction power than AdaBoost, and Bagging models. Bui et al. [33] used four hybrid models of Adaboost, bagging, rotation forest, and random subspace with SVM model and finally stated that random subspace model outperform of other models. Napoli et al. [34] applied hybrid algorithm of artificial neural network, generalized boosting model and maximum entropy for landslide susceptibility mapping at Italy and finally stated that hybridization improved the prediction power of the standalone models. Table 1 summarizes the literature review. Ngo et al. [17] applied deep learning algorithm of convolutional neural networks (CNN), and recurrent neural networks (RNN), to future landslide occurrences prediction at national scale of Iran. They finally stated that both algorithms have the high accuracy, while RNN model is more accurate than CNN.  [36] kernel logistic regression Pham et al. [30] RF, LMT, BFDT, CART Pham et al. [31] NBT, BN and NB Honge et al. [32] AdaBoost, Bagging and RF Bui et al. [33] SVM Napoli et al. [34] ANN and Ngo et al. [17] Deep learning (RNN and CNN) The group method of data handling (GMDH) is a neuron model that has been applied successfully for time-series prediction, but not for natural hazard susceptibility assessment (especially landslide modeling). Ebtehaj et al. [37] applied the GMDH for discharge coefficient (Cd) prediction of side weirs; the GMDH showed very good predictive performance. Shaghaghi et al. [38] used the GMDH for stable channel geometry analysis and stated that the GMDH model outperformed other empirical equations.
The main objective of the current study was to investigate the effectiveness of the GMDH for landslide susceptibility mapping. The fact that the GMDH has not yet been applied for natural hazard susceptibility assessment was the motivation for this study. The GMDH is a powerful neuron model; similar to other neuron models, it suffers from the determination of weights of membership function and other operators. Thus, in the present study, the predictive performance of a hybrid GMDH/gray wolf optimizer (GWO) was also examined.

Study Area
The Bostanlyk District is located in the Tashkent region of Uzbekistan between long 66 45 E to 66 12 E and lat 38 45 N to 38 60 N, which is characterized by relatively uniform relief (mainly hills, mountains, and other areas of elevation). Within the Bostanlyk District, the Tekstilsik landslide is a large, recent landslide area located on the left bank of the Chirchik river basin, within the northern part of the Alizar anticline [which is characterized by Cretaceous clays and siltstones at the intersection of the "Alizar uplift" (with northwest strike) and the Karankul fault (with latitudinal strike; Figure 1 In the study area, from top to bottom, there stands out an undulating watershed of mountain rivers, which formation was completed via tectonic-erosion cycles. The slopes of the terrace surfaces are not exceeded 8-10 • , dividing their slopes have a steepness of 12-15 • . In the central part, the surface of these terraces and slopes are complicated by large cirques formed by landslide occurrences. Within the landslide area, the upper generation, the lower and the landslide toe are distinguished. In turn, the upper generation is divided into an upper stage and a lower one which a landslide depression separating them. The fault wall of the upper generation (elevation 1018-1108 m) is confined to the terrace ledge, the landslide depression and the lower step to the terrace surface (1020-900 m). The toe of the landslide falls at 790-850 m. Modern landslide displacements can be traced down the slope 1.4-1.5 km, 500 m wide in the upper and 250-300 m in the lower zones. The peculiarity of the structure is its division into two parts with the formation of a graben along the line, due to the uplift. The influence of natural factors on the dynamics of a landslide (geological, morphological, physical, artificial) is associated with the confinement to the uplift zone, the rise in the level of pressurized underground waters of the complex aquiferous of Cretaceous beddings. The reasons that play a trigger role are the simultaneous waterlogging of rocks from melted snow, precipitation and groundwater.
In fact, the Tekstilsik landslide is divided into two parts by graben formation due to the Alizar uplift. The landslide area is composed of 20-30-m-thick loess rock in the middle/upper part, but the thickness decreases to 5 m in the lower part.
Clay loams are separated by interbedded Cretaceous clays, siltstones, sandstones, and gravelites. The thickness of the layers varies from 2-4 m, which are at least 100 m deep. The water-bearing horizon is in the range of 25-45 m. Groundwater from springs at the base of scarps flows down the graben and slope, and forms small lakes and swamps. The groundwater emanates from loamy Quaternary sediments at a depth of 0.4-26.0 m, as springs with flow rates of 0.1-0.5 l/s in both the upper and lower areas. Pressurized interlayer groundwater emanates from Cretaceous sediments underlying the Quaternary sediments.
It is worth mentioning that the Tekstilsik landslide is located in the northern part of Uzbekistan on the left bank of the Chirchik river basin; the second zone is located in the southern part of the Aksu-Tankazdarya river basin, and occupies the middle and upper parts of the river valley. The area of the zone is about 160-170 km 2 . Paleozoic rocks are predominant in the upper reaches, while Meso-Cenozoic sandy-clayey rocks (mostly Cretaceous) are well-represented in the middle part of the valley. These deposits are overlaid by 5-20-m-thick loams. The study area is favorable for the formation of block landslides because of the monoclinal bedding of Meso-Cenozoic rocks, which also contain several independent aquifers. Interstratal karst provides favorable conditions for water saturation of the overlying loams and underlying clays, leading to the formation of large landslides. Landslides are most active on the right bank of the Akdarya and the left bank of the Tamshushdarya in the villages of Miraki, Nondek, and Tamshush, where Meso-Cenozoic rocks are bedded in a manner consistent with the slope. Within these villages, block landslides may occur; loams and underlying sandy-clayey deposits extending to a depth of 20-30 m are involved in the landslides, the volumes of which can reach tens of millions of cubic meters (e.g., the Nondek landslide on 04/24/92: volume = 22 million m 3 ). The landslides move relatively slowly as elastic flows.
The Tankazdarya landslide zone occupies the middle and upper parts of the river valley. Geologically, the zone is confined to the front part of the Lyangar thrust fault. The landslides are most frequent on the limbs of anticlinal structures composed of Cretaceous-Neogene rocks. As the area is characterized by cuesta relief, the landslides are usually displaced along several sliding surfaces and are large in scale (Table 2 and Figure 2). Table 2. Hazardous geological processes (HGP) documented in the Tekstilsik area.

No
Year of HGP Type of HGP Characteristics of HGP

Historical Landslide Data
The first step in natural hazard modeling, such as landslide susceptibility mapping, is collecting and compiling a dataset of historical events. The aim is to identify relationships between landslide historical data and a geo-environmental dataset, and then to use these relationships to formulate and predict future landslide occurrences. Thus, the accuracy and quality of the historical dataset used has a significant impact on the prediction results [39]. In the present study, 210 polygons were initially identified on aerial photographs as the triggering area of the landslide and subsequently confirmed through extensive field surveys ( Figure 2). It means that their location was detected much easier in the field surveys. Their locations were recorded using GPS handhold. Afterwards, the center of gravity of each polygon was considered as the landslide location. As landslide susceptibility assessment uses a binary modeling approach, except for 210-landslide location, the same number of locations from those areas in which landslides had not occurred were generated randomly in ArcGIS 10.2 (Esri, Redlands, CA, USA). Shirzadi et al. [9] stated that the dataset should be divided into training and model validation datasets; the data used in this study were divided thusly (70% and 30%, respectively). Both the landslide and non-landslide locations were divided into two parts (70% and 30% of the data). The 70% portions of the landslide and non-landslide locations were combined and used as the model training dataset; the 30% portions were integrated to form the testing dataset for model validation.
First, a digital elevation model (DEM) was devised with a pixel size of 30 × 30 m, which was downloaded from USGS Earth Explorer website, and altitude, slope degree, slope aspect, and LS data were derived directly from the DEM using ArcGIS 10.2 (Esri, Redlands, CA, USA). Altitude has an indirect effect on landslide likelihood through its major impact on rainfall, vegetation cover, soil thickness, and tectonics, all of which are important factors in landslide modeling [42][43][44]. These factors were also modeled using ArcGIS 10.2 (Figure 3a). Slope degree affects the probability of landslide occurrence via shear stress and gravitational acceleration, which tend to move mass down a hillslope if the soil layer is sufficiently thick: steeper slopes increase the probability of landslide occurrence. A slope map of the study area is shown in Figure 3b. Soil wetness varies by slope aspect, attributable to the effect of solar radiation, as do hydrological processes such as runoff, infiltration, evaporation, etc., which also affect the probability of landslide occurrence. According to the literature, the south aspect of the study area is wetter such that the landslide probability is greater. A slope aspect map is presented in Figure 3c. Valley depth (difference in elevation between the valley and upstream ridge) affects slope stability and thus the probability of landslide occurrence [45]. We calculated valley depth based on the method of Menking et al. [46], as shown in Figure 3d. The TWI is a major factor which indicates the soil moisture on slopes and, by extension, on infiltration, runoff generation and ultimately landslide probability [30]. Our DEM was used to derive the TWI via SAGA-GIS software (http://saga-gis.org) (Figure 3e).
The LS also has a major effect on landslide probability: longer LS promotes landslide occurrence, the longer LS, the more gravitational acceleration can affect detach part of the hillslope [41]. This factor was calculated from our DEM using SAGA-GIS software (Figure 3f). According to the literature, most of the landslides occurred near the road network system [40]. Road construction can cause a discontinuity in the soil on a hillslope, making it more prone to landslides. The distance between landslides and roads was calculated based on the road network layer using ArcGIS 10.2 (Figure 3g). When a landslide site is closer to a river, soil moisture is higher. Therefore, the hillslope is heavier, which makes it more prone to landslides. Pham et al. [30] reported that about 65% of the landslides in Uttarakhand, India, were close to a river. The river layer in our study was based on the river network layer in ArcGIS 10.2 and is shown in Figure 3h. Lithological formations differ in structure and strength, and greatly affect the likelihood of soil discontinuity and landslide occurrence [47]. Nguyen et al. [41] stated that the geology and slope degree have a major effect on landslide probability. The lithological units in the study are shown in Figure 3i. All conditioning factors prepared in proper classes shown in Figure 3 and Table 3.  Finally, the training data were overlaid with all of the conditioning factors and used for modeling. The data were converted into an Excel file (Microsoft Corp., Redmond, WA, USA) for use in MATLAB (MathWorks, Natick, MA, USA).

Group Method of Data Handling (GMDH)
The GMDH, which was developed by Ivakhnenko [48], can be employed to solve highly nonlinear problems, and for multiple-input to single-output systems because of its heuristic self-organizing characteristics [49]. The GMDH can select input variables, the number of neurons, and the number of layers to optimize the model. The GMDH involves different pairs of neurons, which are connected through a quadratic polynomial function to generate new neurons in the next layer. For a given input vector x = {x i1 , x i2 , x i3 , . . . , x in }, the mathematical relationship between the observation input data and output variables can be computed by the Volterra-Kolmogorov-Gabor function, as follows: whereŷ represents the forecasted value and A(a 1 , a 2 , . . . , a n ) denotes the undetermined weighting coefficients, which can be calculated via regression by minimizing the square of the difference between the forecasted values and actual output [50]. Additional explanation of the GMDH and its equations is provided elsewhere [51][52][53][54].
To determine the optimum number of hidden layers, which each contain a set of neurons, the GMDH can be combined with metaheuristic algorithms, such as the GWO.

Gray Wolf Optimizer (GWO)
The meta-heuristic GWO was proposed by [55]. The GWO is based on the leadership hierarchy and hunting strategy of gray wolves. Generally, wolves in a pack are divided into four types with specific roles: alpha (α), beta (β), delta (δ), and omega (ω). The alpha wolf (most dominant) is the leader of the pack. The omega wolves (least dominant), which play a significant role in preventing infighting, are at the bottom of the social pyramid [56,57]. The hunting behavior of gray wolves encompasses encircling, hunting, and attacking the prey [58].
When mathematically modeling the social hierarchy of wolves, the three best solutions in terms of the fitness function value are called α, β, and δ, respectively; all other solutions are considered as ω. The encircling process can be described using the following equations: where → X p and → X represent the position vectors of the prey and a gray wolf, → A and → C denote coefficient vectors, and t indicates the current iteration.
→ r 1 and → r 2 are random vectors that take values between 0 and 1, and → a changes linearly from 2 to 0 during each iteration [59]. After encircling the prey, the hunting process is conducted by wolves α, β, and δ. Given their superior knowledge of the likely position of the prey, other wolves modify their positions accordingly, as described by the following formulae [36]: Finally, the prey will be attacked when |A| ≺ 1, and algorithm exploitation will be improved. When, |A| 1 exploration is emphasized such that the wolves find better prey [60,61].

Model Validation
Model validation is one of the most important steps in the modeling process to ensure reliability [62]. The present study used receiver operating characteristic (ROC) curve analysis, which is the most popular method for model validation [4,8]. ROC curve analysis quantifies model performance according to the area under the ROC curve (AUC); this is the main advantage of this approach [63]. The AUC varies between 0.5 and 1; and ideal model has an AUC of 1 [4]. Figure 4 represents an overview of employing the proposed models to generate the landslide susceptibility maps.

Impact of Conditioning Factor Subclasses on Landslide Probability
In this section, the FR was used to determine the impact of the conditioning factor subclasses on landslide probability in the study area ( Table 3). The results showed that the probability of landslide occurrence increased with altitude until 1736 m; it decreased thereafter. The probability of landslide occurrence was highest at an altitude of 1275-1420 m. Among the slope aspects, landslide probability was highest with southwest and south aspects. Landslide probability was also greater with a longer LS (3.12-18.94 m). Distances from the landslide area to the river below 590 m also promoted landslide occurrence (FR > 1) relative to distance larger than 590 m. Greater distance of the landslide area from a road, i.e., more than 2365 m, was associated with lower landslide likelihood compared to distances below 2365 m. Regarding slope degree, the landslide occurrence would be higher for moderate slopes. Thus, only 6-13 • slopes in the study area promoted landslide occurrence. Finally, landslide occurrence probability was higher when valley depth was lower.

Landslide Susceptibility Maps
After training each model successfully, the algorithms were used to compute the landslide probability occurrence index (LPOI). First, the data for the entire study area were converted into raster format, with the same pixel size as the DEM. Next, the trained models were used to calculate the LPOI for each pixel of the study area. Then, the LPOI values were converted for compatibility with ArcGIS 10.2 to generate the landslide susceptibility maps. Finally, the LPOIs were divided into very low, low, moderate, high, and very high landslide susceptibility quantiles (Figure 5a,b). The use of quantiles is one of the most popular approaches for natural hazard probability classification. Ayalew et al. [64] and Akgun et al. [65] found that the quantile approach was superior when the LPOIs have positive or negative skewness. The landslide susceptibility maps attained in this study were validated using the testing dataset and ROC curve analysis. The findings showed that the GMDH model using the GWO (GMDH-GWO model) had better predictive power than the standard GMDH model in both the training and testing phases. The standard GMDH and GMDH-GWO models had AUC values of 0.853 and 0.944 for the training dataset and 0.827 and 0.901 for the testing dataset, respectively, equating to accuracies of 82% and 90%, respectively ( Figure 6). The GMDH-GWO model was superior to the standard GMDH model in the training and testing phases, by 9.6% and 8.5%, respectively. Moderate-susceptibility areas in the GMDH model were classified as highly landslide prone by the GMDH-GWO model. Thus, the standard GMDH model was less accurate in calculating the LPOI than the GMDH-GWO model. The standard GMDH and GMDH-GWO models considered most of the study area to be of moderate-and high susceptibility, respectively. As shown in Figure 7, the standard GMDH model predicted that approximately 19.73%, 11.74%, 26.20%, 21.90%, and 20.43% of the study area was very low, low, moderate, high, and very high landslide susceptibility area, whereas the respective values for the GMDH-GWO model were 14.89%, 10.57%, 15.00%, 35.12%, and 24.43%, respectively. The proportion of very high landslide susceptibility area was 4.00% lower according to the standard GMDH model relative to the GMDH-GWO model.

Discussion
The result indicates that probably of landslide occurrences increased up to a given altitude and then again reduced. This is due to the thinner soil at very high altitudes, and more rocks. Landslide probability was highest with southwest and south because of their higher soil moisture (in turn due to lower solar radiation) in these aspects compare with other ones. Landslide probability was also greater with longer LS. In fact, the longer LS, the more gravitational acceleration can affect detach part of the hillslope [47]. Distance from river more than 590 m has the lowest effect on the landslide occurrence in the study area and shows that river can be a potential source of the landslide occurrence. Regarding to the distance from the road, the more distance the lower landslide probability happens which shows that cutting the hillslopes make them prone to landslide occurrences. Regarding slope degree, for shallow slopes, the weak gravitational force is insufficient to cause downward movement of soil, while the soil on very steep slopes is relatively thin. Thus, only 6-13 • (i.e., moderate) slopes in the study area promoted landslide occurrences. Finally, the landslide occurrence probability was higher when valley depth was lower. These findings are in accordance with previous results [4,5].
In addition, percentage of landslide locations within each class of landslide susceptibility map is shown in Figure 8. It shows that the most numbers of landslides fall into very high classes and this shows the high performance of the developed models. As shown in Figure 5, most of the study areas are prone to landslides. Through visual comparison of the landslide susceptibility and landslide conditioning factor maps, it was obvious that high/very high susceptibility to landslides was strongly associated with altitude, LS, valley depth, and slope degree. According to the literature, the predictive power of a given model depends not only on the data quality, but also on the algorithms and assumptions used during model development [16]. More flexible models are better for modeling complex processes, such as natural hazards. Our GMDH-GWO model benefitted from both algorithms; inclusion of a second algorithm can overcome weaknesses in the base model. This is in accordance with [26], who also stated that ineffective conditioning factors reduce model performance. The high predictive power of the GMDH-GWO model in the current study (90%) indicated that all of the conditioning factors were effective in increasing model performance. The susceptibility map in this study could facilitate landslide hazard mitigation, by justifying prevention of any construction and urban development in high-susceptibility areas.

Conclusions
Although landslide risk reduction is important, landslide occurrence is a complicated process that cannot be predicted accurately using simple models; thus, in the present study, standard GMDH and GMDH-GWO models were applied to generate a landslide susceptibility map for a specific region in Uzbekistan. The main findings were as follows:

1.
Landslides were most likely in the study area at altitudes of 1420-1736 m, and with southwest slope aspects, an LS of 3.12-18.94 m, a distance from the river of 65-318 m, a distance from the road of 0-1440 m, a 6-9.7 • slope, a TWI of −24 to −145, and a valley depth of 77-144 m.

2.
The GWO enhanced the predictive power of the standard GMDH model. 3.
The performance of the GMDH-GWO model was superior to that of the standard GMDH model in the training and testing phases, by 9.6% and 8.5%, respectively. 4.
Author Contributions: Azam Kadirhodjaev; Writing-Original draft preparation, Fatemeh Rezaie; collected data, statistical analysis and writing methods, Saro Lee and Moung-Jin Lee; provided critical insights and contributed to the final version of the paper. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.