Spatial Modelling of Gully Erosion Using GIS and R Programing : A Comparison among Three Data Mining Algorithms

Gully erosion triggers land degradation and restricts the use of land. This study assesses the spatial relationship between gully erosion (GE) and geo-environmental variables (GEVs) using Weights-of-Evidence (WoE) Bayes theory, and then applies three data mining methods—Random Forest (RF), boosted regression tree (BRT), and multivariate adaptive regression spline (MARS)—for gully erosion susceptibility mapping (GESM) in the Shahroud watershed, Iran. Gully locations were identified by extensive field surveys, and a total of 172 GE locations were mapped. Twelve gully-related GEVs: Elevation, slope degree, slope aspect, plan curvature, convergence index, topographic wetness index (TWI), lithology, land use/land cover (LU/LC), distance from rivers, distance from roads, drainage density, and NDVI were selected to model GE. The results of variables importance by RF and BRT models indicated that distance from road, elevation, and lithology had the highest effect on GE occurrence. The area under the curve (AUC) and seed cell area index (SCAI) methods were used to validate the three GE maps. The results showed that AUC for the three models varies from 0.911 to 0.927, whereas the RF model had a prediction accuracy of 0.927 as per SCAI values, when compared to the other models. The findings will be of help for planning and developing the studied region.


Introduction
Today, reducing natural resources, especially soil and water, is one of the major problems and major threats to human life and is one of the most important environmental problems worldwide that has intensified in recent years, with increasing population and the alternation of human activities [1].According to the data from United Nations research, the world's population is growing at a rate of 1.8% per year and it is expected to rise from 8 billion in 2025 to 9.4 billion in 2050 [2].This increase in world population would demand the need for food, water, forage, and others, which consequently would add huge pressure on land exploitation, non-standard exploitation, and eventually lead to an increase in erosion rates [1,3].Soil erosion is one of the factors that endangers water and soil [1].Soil erosion by water, such as GE, is considered as a major cause of land degradation around the world [4,5].It leads to a range of problems, such as desertification, flooding and sediment deposition in reservoirs [6,7], the destructive effects on the ecosystem reducing soil fertility, and imposes huge economic costs [8].GE is typically defined as a deep channel that has been eroded by concentrated water flow, removing surface soils and materials [9,10].The amount of moisture and its changes as a result of the dry and wet seasons is a main parameter in creating cracks and grooves in fine-grained clay formations containing clay and silt, and ultimately developing rilled erosion and gullies [10].The alternation of warm and dry seasons makes it possible to create cracks, in the formation of fine grains, in warm seasons with the drying of the land and the wilting of the vegetation, and these cracks at the time of the first sudden rainfall concentrate the runoff and therefore cause rill and GE to emerge [11].GE occurs when the erosion of the water flow or the erodibility of the sediments or the formation of the area is higher than the geomorphological threshold of the area [11].Mapping gully erosion systems is essential for implementing soil conservation measures [6].GEVs that influence gully occurrence are rainfall, topography-derived factors such as elevation, slope degree, slope aspect, and plan curvature, lithology [12], soil properties [13], and LU/LC [14].The distribution of precipitation affects the hydraulic flow and moisture content of the soil, and the erosion strength of the flow and soil resistance to erosion is different before and after erosion [11].Generally, the amount and volume of flow are controlled by the topographic features of the area including slope, aspect, and drainage area of the area.Depth and morphology of the cross section of the gullies are controlled by soil erodibility features of the geological layers of the area.The characteristics of the region's soil affect the subsurface flow and the phenomenon of piping erosion, and the pipes cause a gully when their ceiling collapses [10].
Susceptibility maps of GE are essential for conservation of natural resources, and for evaluating the relationship among gully occurrence and relevant GEVs [12].Several models have been applied to assess soil erosion and GE rate in a quantitative and qualitative way, such as the Universal Soil Loss Equation (USLE) [1,15], Erosion Potential Method, Modified Pacific Southwest Interagency Committee Model (MPSIAC) [16], Water Erosion Prediction Project (WEEP) [17], European Soil Erosion Model (EUROSEM) [18], Ephemeral Gully Erosion Model (EGEM) [19], and Chemicals, Runoff, and Erosion from Agricultural Management Systems (CREAMS) [20].
A comprehensive literature review shows that there are still dimensions that require further research, and that a large number of potentially useful methods have not yet been fully implemented to provide GE susceptibility maps.The main objectives of this study are: (i) To determine the relationship between gully occurrence and conditioning factors using Weights-of-Evidence Bayes theory, (ii) assessing the capability of RF, MARS, and BRT data mining/machine learning models to predict GE susceptibility; and (iii) validation of models using the AUC curve and SCAI methods.Study Appl.Sci.2018, 8, 1369 3 of 21 of the research background showed that using MARS, BRT, and RF data mining models in GE zonation is very new.It will help managers in future planning to prevent human intervention in sensitive areas.

Study Area
The Shahroud watershed, with an area of about 848 km 2 and elevation range from 1084 to 2131 m a.s.l., is located in the northeastern part of Semnan Province, Iran (Figure 1).The study area receives an average rainfall of less than 250 mm has an arid and semi-arid climate [32].Various types of lithological formations cover this watershed, and the landforms are mainly low level pediment fans and valley terrace deposits.The dominant land use is rangelands, but irrigation farming and bare lands are also present.

Study Area
The Shahroud watershed, with an area of about 848 km 2 and elevation range from 1084 to 2131 m a.s.l., is located in the northeastern part of Semnan Province, Iran (Figure 1).The study area receives an average rainfall of less than 250 mm has an arid and semi-arid climate [32].Various types of lithological formations cover this watershed, and the landforms are mainly low level pediment fans and valley terrace deposits.The dominant land use is rangelands, but irrigation farming and bare lands are also present.

Data and Method
Figure 2 shows the methodological approach applied to map GE susceptibility in the Shahroud watershed using BRT, MARS, and RF models.For preparing an accurate and reliable gully inventory map, extensive field surveys with a DGPS device were performed in the study area to determine the location of the Gullies [27,28].Then, among 172 detected gully locations, randomly (70/30 ratio), 121 gully locations (70%) and 51 gully locations (30%) in the polygon format were used for training the testing models [28].The locations of training and testing gullies are shown in Figure 1.

Data and Method
Figure 2 shows the methodological approach applied to map GE susceptibility in the Shahroud watershed using BRT, MARS, and RF models.For preparing an accurate and reliable gully inventory map, extensive field surveys with a DGPS device were performed in the study area to determine the location of the Gullies [27,28].Then, among 172 detected gully locations, randomly (70/30 ratio), 121 gully locations (70%) and 51 gully locations (30%) in the polygon format were used for training the testing models [28].The locations of training and testing gullies are shown in Figure 1.Interventionary studies involving animals or humans, and other studies require ethical approval must list the authority that provided approval and the corresponding ethical approval code.
Interventionary studies involving animals or humans, and other studies require ethical approval must list the authority that provided approval and the corresponding ethical approval code.
The aspect map was classified into nine classes (Figure 3c).Positive and negative values of plan curvatures define convexity and concavity of slope curvature, whereas zero is flat surface.The plan curvature map was divided into 3 categories: Concave, Flat, and Convex.The TWI indicator is important for identifying prone areas to GE [36].TWI is calculated by Equation (1): TWI map of study area is divided into four classes [24,26,37] including <5, 5-<7.5, 7.5-11, and >11 (Figure 3e).The convergence index (CI) gives a measure of how flow in a cell diverges (convergence index in negative and positive values) [38].The CI map was prepared in SAGA-GIS 2.1.1 and divided into 3 classes: <0, 0-10, and >10 (Figure 3f).In this research, for the computation of the effect of drainage network and infrastructures on GE, the distance from rivers and roads was considered [14] and divided into four classes: <170 m, 170-<370 m, 370-650 m, and >650 m for rivers (Figure 3g) and <500 m, 500-<1500 m, 1500-3000 m, and >3000 m for roads (Figure 3h).The line density tool in ArcGIS 10.5 was used for calculating drainage density and then its map was divided into four categories: <1.4,1.4-<2.4,2.4-3.7, and >3.7 km/km 2 (Figure 3i).A geological map at a 1:100,000 scale was used to prepare the lithological unit layer.The lithological units were classified into ten categories based on their sensitivity to gully occurrence using expert knowledge method (Figure 3j and Table 1).The advantage of this method is it is easy to use, however this method has certain disadvantages, such as the possibility of a mistake by the expert.The LU/LC map was obtained using Landsat 8 images [39][40][41].The main LU/LC types identified in the study area were range, irrigation farming, and bare lands (Figure 3k).The NDVI map was also produced using Landsat 8 images and classified into 3 categories: <0.11, 0.11-0.25,and >0.25 (Figure 3l).
For multi-collinearity checking, the tolerance (TOL) and variance inflation factor (VIF) were used.If during modeling there is collinearity among the variables, the accuracy of the model's prediction decreases.Values of TOL and VIF were ≤0.1 and ≥10, respectively, indicating that multi-collinearity among parameters [28].
Appl.Sci.2018, 8, x FOR PEER REVIEW 5 of 21 (Figure 3g) and <500 m, 500-<1500 m, 1500-3000 m, and >3000 m for roads (Figure 3h).The line density tool in ArcGIS 10.5 was used for calculating drainage density and then its map was divided into four categories: <1.4,1.4-<2.4,2.4-3.7, and >3.7 km/km 2 (Figure 3i).A geological map at a 1:100,000 scale was used to prepare the lithological unit layer.The lithological units were classified into ten categories based on their sensitivity to gully occurrence using expert knowledge method (Figure 3j and Table 1).The advantage of this method is it is easy to use, however this method has certain disadvantages, such as the possibility of a mistake by the expert.The LU/LC map was obtained using Landsat 8 images [39][40][41].The main LU/LC types identified in the study area were range, irrigation farming, and bare lands (Figure 3k).The NDVI map was also produced using Landsat 8 images and classified into 3 categories: <0.11, 0.11-0.25,and >0.25 (Figure 3l).
For multi-collinearity checking, the tolerance (TOL) and variance inflation factor (VIF) were used.If during modeling there is collinearity among the variables, the accuracy of the model's prediction decreases.Values of TOL and VIF were ≤0.1 and ≥10, respectively, indicating that multicollinearity among parameters [28].

WoE Model
WoE is according to the Bayesian probability framework, to predict the significance of effective factor classes through a statistical approach [42][43][44][45][46][47][48][49][50][51].In this method, the spatial relationship between GE areas and GEVs are identified.The WoE model is based on the calculation of positive (W + ) and negative (W − ) weights.This model computes the weight of each GEVs according to the existence or absence of the gully inventory [52] as follows: where ln is the natural log function and P is the probability, B and B indicate the presence and absence of the gully geo-environmental factor, respectively, L is the presence of gully, and L is the absence of a gully.W and W are positive and negative weights, with W indicating that a geoenvironmental factor is present in the gully inventory.S (W ) is the variance of the W and S (W ) is the variance of W . C indicates the overall association between GEVs and gully occurrence.S (C) is the standard deviation of the contrast and W is final weight of each class factor.

RF Model
RF is a controlled learning method that uses multiple trees in the classification [21].The RF algorithm, by replacing and continuously changing the factors that affect the target, leads to the creation of a large number of decision trees, then all trees are combined to make decisions [21].The RF consists of 3 user-defined parameters, which include: (1) The number of variables used in the

WoE Model
WoE is according to the Bayesian probability framework, to predict the significance of effective factor classes through a statistical approach [42][43][44][45][46][47][48][49][50][51].In this method, the spatial relationship between GE areas and GEVs are identified.The WoE model is based on the calculation of positive (W + ) and negative (W − ) weights.This model computes the weight of each GEVs according to the existence or absence of the gully inventory [52] as follows: where ln is the natural log function and P is the probability, B and B indicate the presence and absence of the gully geo-environmental factor, respectively, L is the presence of gully, and L is the absence of a gully.W + and W − are positive and negative weights, with W + indicating that a geo-environmental factor is present in the gully inventory.S 2 W + is the variance of the W + and S 2 W − is the variance of W − .C indicates the overall association between GEVs and gully occurrence.S (C) is the standard deviation of the contrast and W is final weight of each class factor.

RF Model
RF is a controlled learning method that uses multiple trees in the classification [21].The RF algorithm, by replacing and continuously changing the factors that affect the target, leads to the creation of a large number of decision trees, then all trees are combined to make decisions [21].The RF consists of 3 user-defined parameters, which include: (1) The number of variables used in the construction of each tree, which expresses the power of each independent tree; (2) number of trees in RF; and (3) minimum number of nodes [43].RF prediction power increases with the increasing strength of independent trees and reducing the correlation between them [44].This algorithm uses 66% of the data to grow a tree called Bootstrap, and then a predictor variable is introduced randomly during the growing process to split a node in the tree construction.The remaining 33% of the data is also used to evaluate the fitted tree [45].This process is repeated several times and the average of all predicted values is used as the final prediction of the algorithm.In this model, two factors, including the mean decrease accuracy and mean decrease Gini, are used to prioritize of each of the effective factors.The use of the mean decrease accuracy in comparison to mean decrease Gini index is more effective in determining the priority of effective factors, especially in the context of the relationship between environmental factors [46].The RF analyses were carried out in R 3.3.1,using the "Randomforest" package [21].

BRT Model
BRT is one of several techniques that can help improve the performance of a single model by combining multiple models [47].BRT uses two algorithms for modeling: Boosting and regression [48].Boosting is a way to increase the accuracy of the model, and based on this, the construction, combination, and averaging of a large number of models are better and more accurate than an individual model on its own [49].BRT overcomes the greatest weakness of the single decision tree, which is relatively weak in data processing.In BRT, only the first tree of all the training data is constructed, the next trees are grown on the remaining data from the tree before it; trees are not built on all data and only use a number of data [50].The main idea in this method is to combine a set of weak predictor models (high predictive error) to arrive at strong prediction (low predictive error) [51].Thus, in this study, BRT was used for GE spatial modeling using GMB (Generalized Boosted Models) and dismo (Species Distribution Modeling) packages in R 3.3.1.3.

MARS Model
The MARS model is a form of regression algorithm that was introduced by Friedman in 1991 to predict continuous numerical outputs [52].This technique generates flexible regression models for predicting the target variable by means of dividing the problem space into intervals of input variables and processing a basic function in each interval.
The base function represents information in relation to one or more independent variables.A base function is defined in a given interval, in which the primary and end points are called knote.The knote is the key concept in this method and represents the point at which the behavior of the function changes at that point.The base function expresses the relationship between the input variables and the target variable and is in the form of Max (0, X − c) or Max (0, c − X), in which c is threshold value and X is the impute variable.The general form of the MARS model is as follows: where x = input, f (x) = output, P = predictor variables, and B = basis function.Max (0, x − H) and Max (0, H − x) are basis function and do not have to be present if their coefficients are 0. β 0 is constant, β jb is the coefficient of the jth base function (BF), and the H values are called knots.The MARS model includes three main steps: (1) A forward stepwise algorithm to select certain spline basis functions, (2) a backward stepwise algorithm to delete base functions (BFs) until the best set is found, and (3) a smoothing method which gives the final MARS approximation a certain degree of continuity [52].First, the MARS model estimates the value of the target function with a constant value, and then generates the best processing in the forward direction by searching among the variables.
The search process continues as long as all possible (BFs) are added to the model.At this stage, a very complex model with a large number of knotes is obtained.In the next step, through the process of pruning backward, BFs that are less important are identified and deleted by using the generalized cross-validation (GCV) criterion [27].GCV is a criterion for data fitting and eliminates a large number of BFs and reduces the probability of overfitting.This indicator is obtained by using Equation ( 10): where N is the number of data and C (B) is a complexity penalty that increases with the number of BF in the model and which is defined as: where d is a penalty for each BF included into the model.This process continues until a complete review of all the basic functions, and at the end of the optimal model is obtained by applying base functions [52].MARS model is an adaptive approach, since the selection of BFs and node locations is based on the data and type of purpose.After determining the optimal MARS model, the analysis of variance (ANOVA) method can be used to estimate the participation rate of each of the input variables and BFs.A detailed description of the MARS model can be found in [45].MARS was run with R 3.3.1 and the "Earth" package [53].

Validation of GESMs Using Three Data Mining Models
A single criterion is not enough to select the best model among a large number of models, and judging about choosing a superior model by one criteria.It is not a suitable approach and it raises the chance of mistake in choosing the suitable model [27,37,54].In this study, to compare the performance between data mining models and select the appropriate model, AUC and SCAI were used [28,36,55].For calculating AUC, different thresholds were considered from 0 to 1, and for each threshold, the number of cells detected by the model as gully erosion was compared with observed gully erosion cells and positive and negative ratio indicators was calculated.After calculating these two indicators, we arranged them in ascending order, then they were plotted to calculate AUC.The AUC values range from 0.5 to 1.If a model cannot estimate the occurrence of an event better than a probable or random viewpoint, its AUC is 0.5 and therefore it will have the least accuracy, while if the AUC is equal to one, the model will have the highest accuracy [56,57].The quantitative-qualitative relationship between AUC value and prediction accuracy can be classified as follows: 0.5-0.6,poor; 0.6-0.7,average; 0.7-0.8,good; 0.8-0.9,very good; and 0.9-1, excellent.SCAI is the ratio of the percentage area of each of the zoning classes to the percentage of gullies occurring on each class.Based on the SCAI indicator, the values of SCAI in very high sensitivity class are lower than very low sensitivity class.

Multi-Collinearity Analysis
Multicollinearity is a condition of very high inter-correlations or inter-associations among the independent variables.Therefore, it is a type of disturbance in the data, and if present in the data, the statistical conclusions of the data may not be reliable [27].A TOL value less than 0.1 or a VIF value larger than 10 indicates a high multicollinearity [56].The outcomes of the coherent analysis among the 12 GEVs are shown in Table 2.The outcomes showed that the TOL and VIF of all GEVs were ≥0.1 and ≤5, respectively.As a result, no multi-collinearity is seen among the GEVs.

Spatial Relationship Using WoE Model
The outcomes of WoE model are shown in Table 3.In elevation, the results of WoE indicate that there is a direct correlation between classes of elevation and GE, and with an increase in elevation, GE also increases.Therefore, the class of >1600 m with WoE 47.95 had the greatest impact on gully occurrence.The result of slope degree indicate that classes 5-<10 with WoE 34.96 had a strong relation with GEIM.For slope aspect, NE-facing slopes with a value of 19.46 show high probability of gully occurrence.In the case of plan curvature, among the three classes of concave, flat, and convex, the concave class had the highest value (78.04), and thus a positive correlation with GE.This result is in line with [11,50].In TWI, the class of >11 has the strongest relationship with GE with the highest value (78.04).In the case of the convergence index, the class of 0-10 with values of 13.18 has a positive relation with gully occurrence.With respect to distance from river, class of >650 with value of 25.86 and regarding distance from road the class of >3000 m with values of 16.25 had the greatest effect on gully occurrence.For the drainage density factor, the class of <1.4 km/km 2 showed the highest value (14.23) and thus high correlation with gully occurrence.According to the lithology factor, Gypsiferous marl with greatest value (51.23) is more prone to GE than other lithology units.Concerning LU/LC, most gullies are located in the range land use type and this class with the highest value (21.02) has the strongest relationship with gully occurrence.In NDVI, results indicated that all gullies are located in the class of <0.11, showing that very low vegetation density renders slopes susceptible to GE.

Applying RF Model
The outcomes of the confusion matrix for RF model are shown in Table 4.The result shows that the model predicted 2487 non-gully pixels as non-gullies and 256 non-gullies as gully.On the other hand, the RF model predicted 2677 gullies as gullies and 66 gullies as non-gullies.Moreover, the out-of-bag error (OOB) for RF was 5.82%.This means that the model has a precision of 94.18%, which expresses the excellent accuracy of the model in predicting gully erosion.Prioritization results of RF are shown in Table 5 and Figure 4.The results show that the distance from roads (381.67,22%), elevation (335.06,19%), and lithology (234.21,14%) had the highest values, followed by slope degree, drainage density, distance from river, NDVI, convergence index, slope aspect, TWI, plan curvature, and LU/LC.Finally, the GESM by the RF model was prepared in ArcGIS 10.5 and divided into five classes from very low to very high (Figure 5a), using a natural break classification [8].According to the results, of the entire study area (847.87 km 2 ), 525.97 km 2 (62.03%) are located in the very low susceptibility class, 148.28 km 2 (17.49%) in the low susceptibility, 79.42 km 2 (9.37%) in the moderate class, 56.34 km 2 (6.64%) in the high class, and 37.88 km 2 (4.47%) are located in the very high susceptibility class.Of the total area of GE (0.729 km 2 ) in the study area, 0.86% (0.01 km 2 ) are located in the very low susceptibility class, 5.67% (0.04 km 2 ) in the low susceptibility, 14.80% (0.11 km 2 ) in the moderate susceptibility, 21.95% (0.16 km 2 ) in the high susceptibility, and 56.72% (0.41 km 2 ) in the very high susceptibility classes.

Applying BRT Model
The BRT model was used to reveal the spatial correlation between the existing GE and the GEVs in the study area.The results of the model are shown in Figure 6.They indicate that the factors distance from roads (31.1%), elevation (27.2%), and lithology (11%) had the highest importance on GE, mirroring the outcomes of the RF model, followed by slope degree (7%), drainage density (6.7%), distance from river (5.1%), slope aspect (3.8%), convergence index (2.4%),NDVI (2.2%), plan curvature (1.6%), TWI (1.6%), and LU/LC (0.3%).The gully susceptibility map by the BRT model was also prepared in ArcGIS 10.

Applying MARS Model
The optimal MARS model included 28 terms, and the GCV was 0.157.MARS model provides the optimal model only by selecting the necessary parameters.In this research, nine GEVs including lithology, distance from road, distance from river, drainage density, elevation, aspect, convergence index, slope, and NDVI were used to construct the optimal model from the 12 GEVs.The GESM by the MARS model was implemented in ArcGIS 10.5 using Equation (12).According to Equation (12), distance from roads, elevation, and lithology were the most important variables.Values of GESM by MARS model varies from −9.8 to 7.3.At first, GESM classified using quantile, equal interval, natural break, and geometrical interval classification techniques, then, by comparatively analyses of the distribution of training and validation gullies in high and very high classes, the natural break classification technique was most accurate.As a result, GESM by MARS were classified into very low (−9.86-−6.24),low (−6.24-−2.31),moderate (−2.3-0.04),high (0.04-0.38), and very high (0.38-7.32) gully erosion susceptibility zones by natural break classification technique (Figure 5c).The results indicate that 0.02 km 2 (2.10%) of GE in the study area are located in the very low susceptibility class, with 339.01 km 2 (39.98% of total study area) and 0.58 km 2 (79.16%) located in the very high susceptibility class with 105.50 km 2 (0.58%) (Table 6).In general, the results indicate that for all three models with increasing susceptibility (from very low to very high), the area of the respective classes decreased, while in contrast the areas of GE increased.These results is in line with Youssef et al. (2015).

Validation of Models
The results of the validation of the models using the AUC curve and SCAI indicator are shown in Figure 7, and in Tables 6 and 7.The results show that the values of the AUC for the three models vary from 0.911 to 0.927, indicating very good prediction accuracy for all models, with RF resulting in the highest value.In addition, the SCAI values for the three models, RF (61.08-0.00),MARS (10.45-0.03),BRT (12.59-0.01),show that the RF model has higher SCAI values compared to the other models in the very low, low, and very high susceptibility classes (Figure 7).In spite of the high efficiency and accuracy of the RF model for GE sensitivity mapping, so far this model has not been used by the research community.
are not considered in site selection and construction of roads as anthropogenic structures in nature, they can act as a causative factor in environmental hazards such as landslide and gully erosion.The construction of roads in bare lands with erosion-sensitive formations has led to the expansion of gully erosion in the study area, so that the construction of a road without proper culverts causes disrupted of natural drainage and runoff concentrations, thus eroding the bare lands and resulting in the formation of a gully.The results of the validation of data mining models showed that the RF model more accurately predicted areas that are sensitive to gully erosion.These results are consistent with the results of [36,43,46,59], which introduced the RF model as a strong and high-performance model.One of the most widely used data mining methods is the RF model.The advantages of the RF method over other models is that this model can apply several input factors without eliminating any factors, and return a very small set of categories that support high prediction accuracy [6].
The classification accuracy of this model is affected by many factors such as the number, scale, type, and precision of input data.Thus, in the processing, the use of all suitable factors causes the accuracy of the model to increase.Compared with other models, RF has higher sufficiency to apply a very high number of datasets [6].The RF model has the potential as a tool of spatial model for assessing environmental issues and environmental hazards.The RF model combines several tree algorithms to generate a repeated prediction of each phenomenon.This method can learn complicated patterns and consider the nonlinear relationship between explanatory variables and dependent variables.It can also incorporate and combine different types of data in the analysis, due to the lack of distribution of assumptions about the data used.This model can use and apply thousands of input variables without deleting one of them.This method is less sensitive to artificial neural networks, in case of noise data, and can better estimate the parameters [60].The greatest advantages of RF model are high predictive accuracy, the ability to learn nonlinear relationships, the ability to determine the important variables in prediction, its nonparametric nature, and in dealing with distorted data, it works better than other algorithms for categorization.The main disadvantages of this algorithm include high memory occupation, hard and time-consuming in implementation for large datasets, high cost of pruning, high number of end nodes in case of overlap, and the accumulation of layers of errors in the case of the tree growing.[15,61] stated that the CART, BRT, and RF models showed better accuracy compared to bivariate and multivariate methods.Pourghasemi et al. concluded that the RF and maximum entropy (ME), models have high performance and precision in modeling [31].Mojaddadi et al. showed that BRT, CART, and RF methods are suitable for modelling [55].Chen et al. indicates that the MARS and RF models are good estimators for mapping [36].Lai et al. indicated that the RF model has significant potential for weight determination on landslide modelling [62].Kuhnert et al stated that RF with AUC = 97.0 is suitable for landslide susceptibility [27].Lee et al stated that the prediction accuracy of RF model is high (90.8)and that this model had a high capability for landslide prediction [43].They applied RF and boosted-tree models for spatial prediction of flood susceptibility in Seoul metropolitan city, Korea [43].They stated that the RF model has better performance compared to boosted-tree.As a scientific achievement, the methodology framework used in this research has shown that the proper selection of effective variables in gully erosion, along with the use of modern data mining models and Geography Information System (GIS) technique, are able to successfully identify areas susceptible to gully erosion.The susceptibility map prepared using this methodology is a suitable tool for sustainable planning to protect the land against gully erosion processes.Therefore, this methodology can be used to assess gully erosion in other similar areas, especially in arid and semi-arid regions.

Conclusions
GE is one of the main processes causing soil degradation and there is a need to improve methods to predict susceptible areas and responsible environmental factors, to allow early intervention to prevent, limit, or reverse gully formation.The utility of three data mining models, RF, BRT, and MARS, to predict GE in the Shahroud watershed, Iran, was assessed.For this purpose, twelve causative

Figure 1 .
Figure 1.(a) Location of the Semnan provinces in Iran, (b) location of study area, and (c) gully erosion locations with the digital elevation model map of the Shahroud watershed.

Figure 1 .
Figure 1.(a) Location of the Semnan provinces in Iran, (b) location of study area, and (c) gully erosion locations with the digital elevation model map of the Shahroud watershed.

Figure 4 .
Figure 4. Relative influence of effective conditioning factors in the random forest (RF) model.

Figure 4 .
Figure 4. Relative influence of effective conditioning factors in the random forest (RF) model.

Figure 6 .
Figure 6.Relative influence of effective conditioning factors in boosted regression tree (BRT) model.
5 and divided into five classes of very low to very high (Figure 6c).The results of the GE susceptibility class by the BRT model covered 847.87 km 2 of the study area an area distribution in the very low, low, moderate, high, and very high susceptibility classes are 605.37 km 2 , 88.38 km 2 , 52.01 km 2 , 34.13 km 2 , and 67.98 km 2 , and percentage distribution in the susceptibility classes of are 71.40,10.42, 6.13, 4.03, and 8.02, respectively.Of the actual GE area of 0.729 km 2 , 0.04 (5.55%), 0.03 (4.56%), 0.06 (8.26%), 0.08 (11.34%), and 0.51 km 2 (70.28%) are located in the very low to very high susceptibility classes, respectively.

Figure 6 .
Figure 6.Relative influence of effective conditioning factors in boosted regression tree (BRT) model.

Figure 6 .
Figure 6.Relative influence of effective conditioning factors in boosted regression tree (BRT) model.

Table 1 .
Lithology of the study area.

Table 1 .
Lithology of the study area.

Table 3 .
Relationship between conditioning factors and gully erosion using weights-of-evidence (WoE) model.

Table 5 .
Relative influence of effective conditioning factors in the RF model.Of the total area of GE (0.729 km 2 ) in the study area, 0.86% (0.01 km 2 ) are located in the very low susceptibility class, 5.67% (0.04 km 2 ) in the low susceptibility, 14.80% (0.11 km 2 ) in the moderate susceptibility, 21.95 % (0.16 km 2 ) in the high susceptibility, and 56.72% (0.41 km 2 ) in the very high susceptibility classes.

Table 5 .
Relative influence of effective conditioning factors in the RF model.

Table 6 .
Area under the curve (AUC) values of RF, MARS, and BRT data mining models.