GIS-Based Machine Learning Algorithms for Gully Erosion Susceptibility Mapping in a Semi-Arid Region of Iran

In the present study, gully erosion susceptibility was evaluated for the area of the Robat Turk Watershed in Iran. The assessment of gully erosion susceptibility was performed using four state-of-the-art data mining techniques: random forest (RF), credal decision trees (CDTree), kernel logistic regression (KLR), and best-first decision tree (BFTree). To the best of our knowledge, the KLR and CDTree algorithms have been rarely applied to gully erosion modeling. In the first step, from the 242 gully erosion locations that were identified, 70% (170 gullies) were selected as the training dataset, and the other 30% (72 gullies) were considered for the result validation process. In the next step, twelve gully erosion conditioning factors, including topographic, geomorphological, environmental, and hydrologic factors, were selected to estimate gully erosion susceptibility. The area under the ROC curve (AUC) was used to estimate the performance of the models. The results revealed that the RF model had the best performance (AUC = 0.893), followed by the KLR (AUC = 0.825), the CDTree (AUC = 0.808), and the BFTree (AUC = 0.789) models. Overall, the RF model performed significantly better than the others, which may support the application of this method to a transferable susceptibility model in other areas. Therefore, we suggest using the RF, KLR, and CDT models for gully erosion susceptibility mapping in other prone areas to assess their reproducibility.


Introduction
Given the harmful effects of gully erosion, strategies for managing and reducing the damage caused by this phenomenon are essential to achieve sustainable development [1]. One of the strategies to achieve this goal is the use of gully erosion susceptibility mapping (GESM) [2]. This is well-known as an essential technique to address the mechanisms of gully erosion. To establish GESM, inventory maps of gully erosion and the methods for measuring the conditioning factors of gully erosion are needed [3]. A gully erosion susceptibility map can be constructed in the GIS environment by considering a set of variables that influence this phenomenon. A detailed analysis of the spatial correlation between these factors and the presence of gully erosion phenomena could also lead to a better estimation of gully erosion susceptibility [4]. Therefore, factors such as climate conditions, topography, geology, soil characteristics, and vegetation have a significant impact on the occurrence of gully erosion [5].
During the last few years, there has been a significant development in the application of machine learning algorithms in natural hazard studies, including floods [6][7][8], wildfire [9], sinkholes [10], drought [11,12], earthquakes [13,14], land subsidence [15,16], groundwater [17][18][19], landslides [20][21][22][23][24][25][26], and gullies [27][28][29][30][31]. Artificial intelligence is considered an advanced technique for predicting gully erosion, as well as managing and reducing the damage caused by this phenomenon. Indeed, Conoscenti et al. [32] assessed the gully erosion susceptibility in Sicily (Italy) using the logistic regression technique. An excellent discriminating ability was confirmed for gully erosion susceptibility with an area under the curve (AUC) greater than 0.8. In another study, Pourghasemi et al. [33] developed an ensemble model of an artificial neural network (ANN) and support vector machine (SVM) for gully erosion susceptibility mapping with promising results (i.e., AUCtrain = 0.897 and AUCtest = 0.879). Finally, the spatial occurrence pattern of gully erosion was properly addressed based on the developed models. Using different machine learning models, Rahmati et al. [34] also predicted and mapped the susceptibility of gully erosion with high performance. The random forest (RF) and RBF-SVM (radial basis function-SVM) models ultimately provided the highest accuracy and stability over other models. A similar study was also implemented for the same purposes based on the Naïve-Bayes tree (NBTree), Logistic Model Tree (LMT), and Alternating Decision Tree (ADTree) models. The findings indicated that the LMT model provided better performance than that of the ADTree and NBTree models. Conoscenti et al. [35] also applied the multivariate adaptive regression splines model to assess gully erosion susceptibility in Italy. Hydrological connectivity was investigated in their study to forecast gully erosion. Similar studies were likewise conducted and presented in [36][37][38][39][40][41][42].
Although gully erosion susceptibility has been studied and predicted by different machine learning algorithms and their accuracy has also been confirmed, these algorithms have not been applied in all areas. Meanwhile, the effects and levels of gully erosion in each region are different [43]. In the present study, gully erosion in the Markazi province (Iran), which has an arid and semi-arid climate, was investigated and predicted. Gullies often occur in this area and cause very extensive damage to the surrounding environment, as well as agriculture productivity [30,31]. Gullies are thus considered a major environmental problem that needs to be controlled and accurately predicted. The efforts of this study aim to establish a gully erosion map and evaluate the gully erosion susceptibility in the Robat Turk watershed in the west of Iran. The study area has an arid and semi-arid climate where various erosions have occurred in recent years, especially gully erosion; therefore, it is very important to study and evaluate this type of erosion in this area.
In the present study, we used four state-of-the-art machine learning models for predicting the gully erosion susceptibility in the Robat Turk Watershed of Iran: kernel logistic regression (KLR), best-first decision tree (BFTree), credal decision tree (CDTree), and random forest (RF). The novelty of this study is its use of the two algorithms, KLR and CDT, to evaluate gully erosion susceptibility in a semi-arid 115 land areas, and the geological materials of these areas are mainly composed of young alluvium and 116 river deposits, these areas are highly sensitive to soil erosion; despite having the least rainfall, the   In terms of land use, most of the study area is surrounded by bare land areas (76%); however, the area has a poor range (18%) and agriculture (6%) units. As most of the area is covered by bare land areas, and the geological materials of these areas are mainly composed of young alluvium and river deposits, these areas are highly sensitive to soil erosion; despite having the least rainfall, the area's water flow is concentrated and can induce rill erosion. With the development of the erodibility processes, these rill erosions become gully erosions in the study area, especially around the waterways where water infiltration is greater.

143
There are many conditioning factors related to gully erosion, such as the rainfall erosion rate 144 and the soil erosion rate. Therefore, a series of geographical, geological, and environmental 145 characteristics should be understood when studying the process of gully erosion [50]. In the first step, 146 a spatial database of gully erosion was established. The recorded gully erosion locations were 147 randomly divided into a training dataset (70%) and a validation dataset (30%) [4,51,52]. Previous 148 studies [31,53], using reliable data, identified the factors that affect the occurrence of gully erosion.

149
Twelve factors were selected, including altitude, aspect, slope, plan curvature, profile curvature, 150 normalized difference vegetation index (NDVI), distance from the river, drainage density, distance  [35] that occurred in the watershed case study.

Gully Erosion Conditioning Factors
There are many conditioning factors related to gully erosion, such as the rainfall erosion rate and the soil erosion rate. Therefore, a series of geographical, geological, and environmental characteristics should be understood when studying the process of gully erosion [50]. In the first step, a spatial database of gully erosion was established. The recorded gully erosion locations were randomly divided into a training dataset (70%) and a validation dataset (30%) [4,51,52]. Previous studies [31,53], using reliable data, identified the factors that affect the occurrence of gully erosion. Twelve factors were selected, including altitude, aspect, slope, plan curvature, profile curvature, normalized difference vegetation index (NDVI), distance from the river, drainage density, distance from the road, lithology, land use, and annual mean rainfall. We explain these conditioning factors and their effect on gully erosion as follows.
Altitude affects the types of vegetation and climate in a region. Therefore, many researchers believe that altitude plays a crucial role in the study of gully erosion [54]. The spatial resolution of the ALOS (advanced land observing satellite) PALSAR-DEM (the Phased-Array L-Band Synthetic Aperture Radar-Digital Elevation Model) is 12.5 m × 12.5 m (downloaded from https://vertex.daac. asf.alaska.edu). Previous studies have also shown that gully erosion is more likely to occur in low altitude areas [55,56] (Figure 3a).
Remote Sens. 2020, 11, x FOR PEER REVIEW 6 of 24 Rainfall acts as a triggering factor and penetrates the cracks in the soil, causing an expansion of 208 the gully in different directions [51]. In this study, the rainfall layer was obtained using three rainfall 209 stations inside and outside the study watershed (from the Markazi County Meteorological Bureau).

210
Different interpolation methods were used to detect the target layer; then, the inverse distance 211 weighting (IDW) interpolation method was used to draw the total annual rainfall map, thereby 212 improving the accuracy [80][81][82].    In the investigation of environmental hazards and the development of susceptibility maps, aspect has a great influence [57]. Aspect, viewed through weathering mechanisms and the geomorphologic process, has an impact on gully erosion [58,59]. The aspect map for this study was extracted from the ALOS PALSAR-DEM and classified into nine classes, including flat, north, northeast, east, southeast, south, southwest, west, and northwest ( Figure 3b).
The slope factor is strongly correlated with the amount of runoff and the soil erosion rate in an area [60]. Areas with low slopes are exposed to gully erosion via the accumulation of water flow [1,61]. This map was prepared in ArcGIS 10.2 using ALOS PALSAR-DEM and classified into five classes, including 0-5, 5-12, 12-20, 20-30, and >30 degrees ( Figure 3c).
Water movement on the hillslopes is affected by the shape of the slope and causes changes in the amount of erosion [62,63]. The ALOS PALSAR-DEM was also used to provide the plan and profile curvatures, which were divided into three categories in ArcGIS 10.2 (Figure 3d,e).
The NDVI is very good for displaying vegetation biomass, leaf area index, crop yield, and vegetation separation and is also used for vegetation related issues [64]. This layer can also be effective in gully erosion modeling [33,65]. The NDVI map was provided using Landsat 8 data from 23/06/2017 (Figure 3f). The value of this factor can be obtained through the following formula: where Red represents the spectral reflectance value of red, and NIR represents the spectral reflectance value of the near-infrared band. The NDVI ranges from 1 to −1 [66]. The gully erosion locations are associated with a drainage network that facilitates the discharge of erosion from the upper reaches of the area [67]. Distance from the river layer was used to understand the effect of the drainage network on gully erosion [60]. The distance from the river was obtained using the Euclidean distance tool in ArcGIS 10.2 ( Figure 3g). Drainage density has a direct relation to the amount of runoff in a catchment area [68]. This factor has a certain influence on the drainage pattern of an area, which depends on different factors, including geological formation, infiltration, soil characteristics, land use conditions, and slope [69,70]. The ArcGIS 10.2 software and line density tools were used to prepare the drainage density layer (Figure 3h).
Some phenomena, such as man-made roads and canals, have a profound effect on the occurrence of gully erosion [67,71,72]. Improper roads in bare lands could cause severe gully erosion [29]. A map of the distance from the road using a road network was constructed by the National Cartographic Center of Iran (INCC) at a scale of 1:25,000 in the ArcGIS 10.2 software (Figure 3i).
Another factor affecting the gully erosion analysis is lithology [73]. Lithological features are related to geomorphologic features and land surface characteristics [74]. Lithology unit maps were generated using a 1:100,000 scale geological map (GSI, 1997) from the Geological Survey and Mineral Exploration of Iran. The lithological map of the Robat Turk area was divided into eight classes (Figure 3j).
In recent decades, geological environments have been increasingly affected by human engineering activities, including oil and gas development [75] and transportation facility construction [76][77][78][79]. The interaction between land-use change and soil erosion has had an important impact on soil erosion and sediment production and has become a major environmental issue [56]. The landsat8 OLI image was downloaded from a web page (https://earthexplorer.usgs.gov/site) to produce the land use map. Then, pre-processing of the images, including geometric and radiometric corrections on the images, was performed in the ENVI 5.3 software to accurately observe the land use in the study area. Then, a false-color image was created for each image to better identify the land use in the area. To survey the land-use changes with a supervised method, land use maps obtained from the Department of Natural Resources of the Markazi Province along with field surveys and Google Earth were used to record numerous training samples for each land use. The maximum likelihood classifier was used to classify the training samples. The identified land uses in the region include bare land, rangeland, and agriculture areas ( Figure 3k).
Rainfall acts as a triggering factor and penetrates the cracks in the soil, causing an expansion of the gully in different directions [51]. In this study, the rainfall layer was obtained using three rainfall stations inside and outside the study watershed (from the Markazi County Meteorological Bureau). Different interpolation methods were used to detect the target layer; then, the inverse distance weighting (IDW) interpolation method was used to draw the total annual rainfall map, thereby improving the accuracy [80][81][82].

Methods
The methodological workflow to derive the final gully erosion susceptibility map is shown in Figure 4.

Multicollinearity Analysis
In natural event studies, multicollinearity points to the lack of independence of the independent variables and their strong associations, which can occur in a dataset due to their high correlation and thus confuse an analysis of their incidence [83]. Variance inflation factors (VIFs) and tolerance (TOL) [84] can determine the relationship between factors. In the present study, the TOL and VIF are used to study the multicollinearity of independent variables in gully erosion modeling [33,85]. If the VIF is higher than 10 and the TOL value is less than 0.1, then the multi-collinearity among the variables is error-prone [84].

Kernel Logistic Regression (KLR)
KLR is a powerful classification technique compared to other traditional classification methods [66]. This model has been successfully used in many of the classified problems [86]. Although KLR can transfer indivisible linear problems, it uses the core to transfer input features to the next space of higher dimensions, but this is not possible for the LR model [87]. The advantages of the KLR learning algorithm include its ability to predict an event according to its probability and its capacity to be extended to multi-class classification problems [88,89]. KLR is known to be a powerful classifier [90]. However, KLR is not sparse and requires all training instances in its model [91]. KLR can intrinsically provide probabilities and straightforwardly develop multi-class classification problems that only require solving an unconstrained quadratic program [45]. Specifically, when the optimization algorithm is suitable, as the algorithm does not need to solve the quadratic equation, it can perform analysis more quickly than other algorithms such as SVM [45,92]. To apply this model, the statistical software R (version 3.5.2) was used.

Credal Decision Trees (CDTree)
The main feature of decision trees is their internal inconsistency; even with a set of different training datasets related to a particular issue, various decision trees will be created. This feature is an essential feature that makes decision trees an appropriate classifier in ensemble models, such as bagging, boosting, and random forest. In this method, like in the classic decision tree technique, every node represents a variable property, and every branch illustrates one of the conditions of this factor [93].
A leaf node is formed when it arrives at a factor and does not provide more information about the factor class; the information is thus based on uncertainty through measurement (the measurement of maximum entropy), and more factors are not entered. The leaf node, which is the predicted value based on the information in the dataset class variable, defines the model's training. When the decision tree is created, a new instance of the X test set is used in the decision tree. For example, after the cases of the X-property variables in the decision tree, the root node is set to a leaf node. Finally, the value of the leaf node is obtained by classifying the X class through a credal decision tree [93]. The advantage of the CDT algorithm is that it offers good experimental results [94,95] and it is especially suitable when noisy data are classified [96]. The disadvantage of CDT is that it only defines discrete variables, does not work with missing values, and does not engage in a posterior pruning process [97].

Random Forest (RF)
RF is one of the most widely used algorithms due to its simplicity and capabilities [98][99][100]. This algorithm generates a forest randomly; this "forest" is a group of decision trees. The construction of a forest using trees is often done via the "bagging" method [101]. The main concept of the bagging method is that combining multiple models will produce better results than a single model. Simply put, a random forest is composed of a set of decision trees; this combination is not only used to make the prediction more accurate but also to make it more sustainable. The advantage of the random forest method is that it can reduce both oscillations and the evaluation of variance [102]. After fully building the trees, the test data are introduced into the tree, and the tree number is obtained for the input vector of an output. Using the average of these outputs, the final output of the model and the empirical distribution of the outputs are calculated by the percentage value and the range of uncertainty. The random regression tree method is a proven method, especially when the number of observations relative to the predictor is small [103].
The node size variable (which represents the number of leaves per branch) was determined in this study by trial and error. The random forest adds random branches to the model as the trees grow. Instead of looking for the most important feature when dividing a node, this model searches for the most important feature among a set of random features. This leads to too many variations and, ultimately, a better model [62,101,104].

Best-First Decision Tree (BFTree)
Ensemble-learning models, or data mining techniques, can specifically consider multiple classifications in the best-first decision tree and determine the importance of each classifier [105]. Prenatal options are available before and after finding the best number of extensions for application through cross-validation in the training data. Although these trees can be fully developed in the same manner regardless of the first algorithm or the first depth, the BFTree uses different pruning modes generated by the first depth method [106]. Another BFTree-based classification algorithm involves the creation of a functional tree with diagonal splits and linear functions in the leaf [107].

Statistical Measures
Although various studies used different methods to evaluate data mining models, we employed the most important measures that have been used by the majority of environmental researchers. In this study, the performance of these models will be evaluated through the following statistical measures: sensitivity, specificity, and accuracy. The following relationships are used to obtain statistical measures [108,109]: where TP (true positive) and TN (true negative) are correctly classified pixels, and FP (false positive) and FN (false negative) are the incorrectly classified pixels.

ROC Curve and AUC
The results were validated with the help of the receiver operating characteristic (ROC) curves [18,73,110]. The ROC curve is plotted based on a 1-specificity (x-axis) against sensitivity (y-axis) [111,112]. The area under the ROC curve (AUC) represents the ability to model whether the predetermined event will occur [113,114]. The closer the AUC value is to 1, the higher the accuracy of the results will be [115]. The AUC can be calculated by this formula [108]: where P and N represent the total number of pixels with and without gully erosion, respectively.

Assessing the Affecting Factors Using Multicollinearity Analysis
In this study, the frequency ratio method was used to study the relationship between each factor category and gully erosion point, and each factor category was given a specific weight [4,53]. The multicollinearity test was used to study the interrelationship among the impact factors, while mainly considering the VIF and TOL ( Figure 5). The experimental results of the multicollinearity test assign the highest VIF values (5.329) to altitude, followed by land use (3.525) and distance from the river (2.875). However, the smallest TOLs were assigned to altitude and land use with values of 0.188 and 0.284, respectively. Overall, since the VIF values did not exceed the critical value of 10, and the TOL was lower than 0.1, there no multicollinearity problem was found among the gully erosion conditioning factors.

308
In this study, the frequency ratio method was used to study the relationship between each factor 309 category and gully erosion point, and each factor category was given a specific weight [4,53]. The

Configuration and Training of the Data Mining Models
Since there is no multicollinearity problem among the gully erosion conditioning factors, all 12 factors were used to train the four machine learning models. Reconfiguration of the model's architecture achieved the goal of optimizing model performance. For the RF model, 100 iterations and 10-fold cross-validation offered the greatest prediction performance. For the CDTree model, the minimum total weight of the instances in a leaf was set as 2, and 10-fold cross-validation was used for pruning. For the KLR model, RBFKernel, a lambda value of 0.01, and a gamma value of 0.01 were used. For the BFTree model, the minimum number of instances at the terminal nodes was set to 2, the number of folds for internal cross-validation was set to 5, and post-pruning was adopted. After the configuration of the model was completed, training and verification were carried out, and the model was ultimately obtained.

Variable Importance
The importance of each gully erosion conditioning factor is another mandatory output used to compute gully erosion susceptibility. The importance values for the 12 conditioning factors were thus considered for gully erosion susceptibility mapping. These values were automatically produced during the model training procedure of the different algorithms ( Figure 6). For the KLR model, rainfall was the most important (0.242), followed by drainage density (0.148), distance from the river (0.112), lithology (0.095), and land use (0.095). For CDTree, the most important factor was also rainfall (0.242), followed by drainage density (0.147), distance from the river (0.127), land use (0.086), and lithology (0.086). The results provided by the BFTree algorithm showed that rainfall was again the most important factor (0.242), followed by drainage density (0.148), distance from the river (0.13), land use (0.087), and lithology (0.081). The RF model also indicated that rainfall was the most important gully erosion factor (0.234), followed by drainage density (0.14), distance from river (0.115), land use (0.084), and lithology (0.074). The results showed that the rainfall layer was the most important among the applied layers. Drainage density and distance from the river layers were also subsequently determined.

Model Performace Evaluation
The performance of the models was evaluated using the statistical measures for both the training and validating datasets. In terms of sensitivity, after using the training datasets, the best model was shown to be the RF (0.853) model, followed by the CDTree (0.847), BFTree (0.841), and KLR (0.794) models ( Table 1). The specificity result showed that the best model was again RF (0.800), but this time the BFTree (0.706) ranked second, followed by the KLR (0.694) and CDTree (0.688) models. In terms of classification accuracy, the best model highlighted by the training dataset was the RF (0.826) model, followed by the BFTree (0.774), CDTree (0.768), and KLR (0.744) models. However, the sensitivity calculated using the validation dataset indicated that RF and CDTree (0.847) were more reliable models than the KLR (0.778) and BFTree (0.667) models. However, the BFTree model had the highest specificity (0.750), followed by the RF, CDTree, and KLR models (0.722). In terms of accuracy, the RF and CDTree models had higher values (0.785) compared to the KLR (0.750) and BFTree (0.708) models. Overall, for all the statistical measures, the values were higher than 0.7. Therefore, all the models possess a strong ability to provide gully erosion susceptibility maps. However, the RF model has a more balanced performance with the training and validation datasets.
The ROC curve [116,117] was used to evaluate the accuracy of the GESM results, and the AUC was used to accurately quantify these models [66,118,119]. Figure 7 shows that the RF (AUC = 0.893) model has the best success rate curve based on the training dataset. Similarly, the RF model ranks first in its prediction rate (AUC = 0.872). According to the literature, the optimization levels of algorithms can be assessed based on the accuracy of their validation rates [120,121]. After comprehensively considering the AUC values of the four models shown in Figures 7 and 8, the RF model was determined to be the most robust and effective.
As shown in Figure 8, the criterion of the standard error (SE) was also used to determine the accuracy rates of the validation dataset; these rates were the best for all four models and were 0.035, 0.041, 0.040, and 0.030 for the KLR, BFTree, CDTree, and RF models, respectively. The results of the confidence interval (CI) also confirmed that the RF model had the narrowest 95% confidence interval (0.806 to 0.922). Further, in conjunction with the results of the AUC, the results of the two other indicators, SE and CI, confirmed that the RF model was more accurate in predicting gully erosion susceptibility than the KLR, CDTree, and BFTree models. was used to accurately quantify these models [66,118,119]. Figure 7 shows that the RF (AUC = 0.893) 368 model has the best success rate curve based on the training dataset. Similarly, the RF model ranks 369 first in its prediction rate (AUC = 0.872). According to the literature, the optimization levels of 370 algorithms can be assessed based on the accuracy of their validation rates [120,121]. After 371 comprehensively considering the AUC values of the four models shown in Figures 7 and 8, the RF 372 model was determined to be the most robust and effective.

Creating Susceptibility Maps Using the KLR, BFTree, CDTree, and RF models
To generate susceptibility maps of gully erosion, in the first stage, each raster representing the gully erosion predictors was multiplied with the related importance value achieved following the training process of the four data mining models. Then, using the map algebra function, the products resulting from the multiplication were summed to derive the final GESM across the study area.
Some classification methods are included in GIS, including natural breaks, quantile, geometric interval, equal interval, and standard deviation. We tested all these models, and the best one was found to be a natural break. The natural break method was thus used to divide the final mapping generated by these models into four categories [23,122]. This method was then used to classify the values of the gully erosion maps because, in this method, classes and classification are determined based on the inherent natural groupings in each group. A break in the class or the threshold of each class indicates that the effects of this group are most similar. On the other hand, these classes will have the greatest differences from each other. Indeed, the classes and tools that are most different from each other are separated and classified in a given situation [123,124]. The application of the KLR model revealed that high and very high susceptibility values were present throughout approximately 37.61% of the Robat Turk watershed (Figures 9a and 10). In total, 61.98% of the research territory was included by the CDTree model in areas with a low gully susceptibility value. In this case, moderate values comprised around 6.20% of the study area, while high and very high susceptibility were spread throughout approximately 31.82% of the study area (Figures 9b and 10). The BFTree model showed that areas with low gully erosion susceptibility were located in 53.31% of the Robat Turk watershed, while those with moderate values spanned over 21.9% (Figures 9c and 10). The results of the RF model revealed that more than 47% of the study area belonged to an area with low susceptibility to gully erosion, while areas with moderate values occupied 19.01% (Figures 9d and 10). Together, areas with high and very high gully erosion susceptibility covered more than 33% of the Robat Turk watershed. By analyzing the results provided by the susceptibility maps of gully erosion derived from the four algorithms, the low susceptibility class was found to occupy the largest area of Robat Turk.
Remote Sens. 2020, 11, x FOR PEER REVIEW 13 of 24 As shown in Figure 8, the criterion of the standard error (SE) was also used to determine the 380 accuracy rates of the validation dataset; these rates were the best for all four models and were 0.035,  To generate susceptibility maps of gully erosion, in the first stage, each raster representing the 392 gully erosion predictors was multiplied with the related importance value achieved following the 393 training process of the four data mining models. Then, using the map algebra function, the products 394 resulting from the multiplication were summed to derive the final GESM across the study area.

395
Some classification methods are included in GIS, including natural breaks, quantile, geometric 396 interval, equal interval, and standard deviation. We tested all these models, and the best one was 397 found to be a natural break. The natural break method was thus used to divide the final mapping 398 generated by these models into four categories [23,122]. This method was then used to classify the 399 values of the gully erosion maps because, in this method, classes and classification are determined 400 based on the inherent natural groupings in each group. A break in the class or the threshold of each

Discussion
In the present study, the important variables affecting gully erosion in the Robat Turk watershed derived using data mining models showed that rainfall, altitude, distance from the river, and land use are more important than the other variables. This fact is largely in agreement with the results achieved by Rahmati et al. [4] and Tien Bui et al. [31], according to which the distance from rivers and land use are more important than other geographical variables. These results confirm that many more locations relevant to gully erosion exist in regions with more rainfall and low altitude values. Further, in areas with bare land, the concentration of gully erosion locations is greater. In other words, while most gullies occur on bare lands, lowland areas contain more gully erosion locations. This shows that bare lands cover the majority of the study area, which is characterized by a low altitude. This also indicates a significantly positive relationship between altitude and land use. Recently, evaluating the factors controlling gully erosion has been widely considered and discussed in the literature [1,34,85,125,126]. In general, altitude and land use are the most commonly reported influential factors [127,128]. In terms of factor importance, distance from the rivers ranked third in this study. Conoscenti et al. [32] noted that most of these gullies are connected to river networks in the area, which increases erosion from the upland area. The influence of land use on gully activity was also reported in Vandekerckhove et al. [127]; however, there are many doubts about the features that induce subsurface gully development. It was reported in Hosseinalizadeh et al. [125] that changes in land use and mismanagement practices play the main roles in establishing gully head cut landforms. The impact of land use on gully development was also reported by Vandekerckhove et al. [127], although there remains uncertainty about the variables inducing gully activity and their interactions with other subsurface processes. It has also been proven that gully rates are highly dependent on the size of the runoff-contributing region above the gully erosion [129].
Additionally, gully erosion is widespread and mitigates other types of soil erosion, such as wind erosion, in different ecosystems. Thus, it is necessary to predict and map gully susceptibility. Data mining methods are reliable tools for mitigating and controlling the influence of soil erosion in different regions all over the world. In the present study, this issue was addressed by comparing and analyzing four data mining models-KLR, BFTree, CDTree, and RF-by applying the twelve affecting factors. All four models showed their most susceptible regions to be located in the northern part of the study area, whereas approximately 51% of the area was specified in the low susceptibility class. The RF model had the greatest AUC of 0.872 for the validation dataset, as well as the smallest standard error and confidence interval. This is because the RF model manages both categorical and continuous data and does not prioritize any model dependencies [130]. Additionally, the RF model handles input data without data elimination, leading to high prediction accuracy [131]. This model has been used in various studies and is mostly reported to be an accurate model [132,133].
Conversely, the BFTree model had the smallest validation rate (AUC: 73.9%) among the three models tested in this area and thus cannot be suggested for use as an advanced technique for the statistical analysis of gully erosion. Based on these results, we conclude that the RF model has the greatest accuracy, while the lowest accuracy was obtained by the BFTree. A comparison between the applied methods and other ensemble models for the spatial distribution of gully erosion could be a primary aim in future studies.
Pham and Prakash [45] compared the efficiency of the KLR and classification and regression tree (CART) algorithms. The authors concluded that the KLR model outperformed and outclassed the CART model in shallow landslide susceptibility mapping at the tri-junction of the Rudrapryag, Tehri Garhwal, and Pauri Garhwal districts (Uttarakhand, Himalayas, India). Nguyen et al. [48] used the CDTree algorithm as a base classifier to construct five hybrid models: Bagging-CDTree, Dagging-CDTree, Decorate-CDTree, MultiBoost-CDTree, and Random Subspace-CDTree. These models were developed for and applied to the groundwater potential mapping of DakLak province in Vietnam. The authors revealed that the ensemble models can significantly improve the performance of the CDTree algorithm. In another study, Chen et al. [18] compared the KLR, RF, and ADT algorithms for groundwater potential mapping of the Ningtiaota region in the northern territory of Shaanxi Province, China. They noted that the RF model provided the highest AUC value (0.811) followed by the KLR (0.797) and ADTree (0.773) models. In a comparison between BFTree, RF, and naïve Bayes tree (NBTree), Chen et al. [134] evaluated their performance and revealed that the RF algorithm outperforms the BFTree and NBTree algorithms in landslide susceptibility mapping.
Recently, Bernatek-Jakiel and Wrońska-Wałach [135] determined that gullies initiate and develop in the regions that are most susceptible to piping erosion. Other scientists have observed that gullies indicate geomorphologic changes throughout the world [136,137]. These observations indicate that gullies are deepened and developed via subsurface processes, mostly at low altitudes [135]. Also, as reported recently by Hosseinalizadeh et al. [125], the positive interactions between collapsed pipes and gully head cuts are due to soil degradation processes. Based on the archived results, gullies are recognized as the main cause of the soil erosion and degradation processes in the study area, as well as the main initiator of the sediment movement delivered to the rivers. Additionally, because of the bare land covering the majority of the study area, the land without vegetation has a negative effect on the infrastructure of the area to be damaged. Thus, soil-erodibility factors can significantly increase due to a high rate of soil loss.

Conclusions
The development of gullies leads to wasteful amounts of soil. Therefore, gully erosion is an important cause of land and environmental degradation. This paper mainly studied the impact of different data mining models on compiling a susceptibility map of a gully. To do this, 12 important and influential factors on gully erosion and 242 gully erosion locations were used. For modeling, the CDTree, KLR, RF, and BFTree machine learning algorithms were used. The results for the effectiveness of the AUC-based models in mapping gully erosion susceptibility showed that the RF model offers the highest efficiency, while the BFTree model has lower performance than the other two models. Reducing the destructive impact of this type of erosion and preventing its deterioration is a problem that needs to be resolved urgently, as it is necessary to identify this type of erosion. Since most gullies are located in the central part of the study area near the village of Robat Turk, the protective practices in these areas must be increased, and the extension of agriculture and residential areas to areas of deterioration should be prevented. Using data mining methods in other study areas by identifying the geological conditions and geo-environmental factors affecting gully erosion could be used to save time and costs when mapping the susceptibility of erosion.
Studies also showed that susceptibility maps can effectively help select various mitigation options [138]. Land use planning or urban construction could be carried out in low gully-prone areas, which could effectively reduce the losses caused by environmental hazards [139]. Susceptibility maps are of great significance; most notably, the rational use of these maps in the planning stage can yield large benefits. As a result, the proposed methods and corresponding susceptibility maps could aid local governments and related organizations in pursuing new and existing spatial planning projects, thereby taking effective measures to achieve disaster reduction and loss reduction [140]. Therefore, we suggest using the RF, KLR, and the CDT models for gully erosion susceptibility mapping in other prone areas to check their reproducibility. The results of this study provide beneficial insights for sustainable development strategies and minimizing destructive hazards to the surrounding environment via gully erosion susceptibility mapping.

Conflicts of Interest:
The authors declare no conflict of interest.