Novel Machine Learning Approaches for Modelling the Gully Erosion Susceptibility

The extreme form of land degradation caused by the formation of gullies is a major challenge for the sustainability of land resources. This problem is more vulnerable in the arid and semi-arid environment and associated damage to agriculture and allied economic activities. Appropriate modeling of such erosion is therefore needed with optimum accuracy for estimating vulnerable regions and taking appropriate initiatives. The Golestan Dam has faced an acute problem of gully erosion over the last decade and has adversely affected society. Here, the artificial neural network (ANN), general linear model (GLM), maximum entropy (MaxEnt), and support vector machine (SVM) machine learning algorithm with 90/10, 80/20, 70/30, 60/40, and 50/50 random partitioning of training and validation samples was selected purposively for estimating the gully erosion susceptibility. The main objective of this work was to predict the susceptible zone with the maximum possible accuracy. For this purpose, random partitioning approaches were implemented. For this purpose, 20 gully erosion conditioning factors were considered for predicting the susceptible areas by considering the multi-collinearity test. The variance inflation factor (VIF) and tolerance (TOL) limit were considered for multi-collinearity assessment for reducing the error of the models and increase the efficiency of the outcome. The ANN with 50/50 random partitioning of the sample is the most optimal model in this analysis. The area under curve (AUC) values of receiver operating characteristics (ROC) in ANN (50/50) for the training and validation data are 0.918 and 0.868, respectively. The importance of the causative factors was estimated with the help of the Jackknife test, which reveals that the most important factor is the topography position index (TPI). Apart from this, the prioritization of all predicted models was estimated taking into account the training and Remote Sens. 2020, 12, 2833; doi:10.3390/rs12172833 www.mdpi.com/journal/remotesensing Remote Sens. 2020, 12, 2833 2 of 32 validation data set, which should help future researchers to select models from this perspective. This type of outcome should help planners and local stakeholders to implement appropriate land and water conservation measures.


Introduction
In the last few decades, modern societies have witnessed various types of degradation of natural resources; above all, soil and water have become more prominent [1]. Land degradation through water-induced soil erosion is the most critical threat to human life and a number of environmental problems are occurring, particularly in the arid and semi-arid region of Iran [2]. Soil erosion is not only an extreme form of land degradation; it is also responsible for a gradual decline in agricultural productivity [3][4][5]. The formation of soil is a natural process, although the net loss of soil is much higher than the formation of regolith due to the gradual degradation of this resource and the influence of anthropogenic activities [6,7]. It has been estimated that almost 6 million hectares of fertile land are lost annually on a global scale due to soil erosion [8]. In the case of Iran, it is approximately 2 to 2.5 billion tons per year and ranks second in the world in terms of soil erosion [9]. In Iran, therefore, the rate of soil erosion is occurring at an alarming rate, making it a national threat [10]. This enormous amount of soil erosion is mainly due to arid and semi-arid climatic conditions and more than 75% of the area is exposed to water-induced soil erosion, i.e., erosion in the form of gullies [10]. More specifically, there is a long dry season with a short wet season, which influences extreme rainfall and causes maximum surface runoff over infiltration [10]. It has also been reported that Iran is facing several intimidating gully incisions around the world [11,12]. Various types of environmental problems, such as desertification, sedimentation in rivers, as well as reservoirs, floods, and soil fertility losses have occurred due to the severe impact of gully erosion [13,14]. In recent times, due to its large impact on environmental degradation and national economic losses, the threat of gully erosion has been taken into account in an appropriate manner. Therefore, in order to understand the mechanism of water-induced gully erosion and to overcome this problem in an optimal way, gully erosion susceptibility mapping (GESM) is a key strategy and must be considered as an initial task. The GESM is derived from the relationship between different geo-environmental conditioning factors and occurrences of gullies [15].
Gully erosion is water-induced soil erosion and is one of the most destructive forms of soil erosion in the world [16,17]. A gully can be defined as a permanent vertical deep channel with a temporary flow of water; the depth varies from 30 cm to several meters and sometimes it is several hundred meters long [18][19][20]. Apart from this, the existence of one type of gully is limited during the wet season, which is called the ephemeral gully. Gully erosion is a very complicated process and it correlates with many factors, such as topography, soil characteristics, lithology, rainfall, land use, and the nature of vegetation [21][22][23]. Running surface water is responsible for the initiation and development of gullies by removing soil particles and ultimately transporting them in the downslope direction [24]. Primarily, two types of approaches have been recognized to evaluate the occurrences of gully erosion [25][26][27]. First, using a regression analysis, we explored the relationship between the occurrence of gully erosion and the topographical condition. Second, gully erosion response curves were prepared using machine learning techniques [28]. The predictive regression approach is considered to estimate and identify the conditioning relation among variables. Regression solutions estimated using various principles (e.g., optimization, minimal square, etc.) are not inherently similar [29]. Machine learning may analyze vast quantities of data and identify complex changes and nature that would not be obvious to individuals. This algorithm is well suited for resolving multi-dimensional and multi-variety information, and can do so in complex or unpredictable situations. As this algorithm learns its skills, it continues to develop in terms of accuracy and performance. Machine learning needs the training of large volumes of data, which should have been inclusive/true and of an excellent standard. There may also be occasions when they wait to produce additional knowledge. ML requires ample time to let algorithms learn and improve sufficiently to perform their tasks with reasonable precision and relevance. It also requires huge sources for it to operate. This might mean extra computing power needs for analysis. Another major obstacle is the ability to analyze the algorithm-generated output precisely.
Over the last few decades, with the arrival of remote sensing (RS), geographic information system (GIS), and various statistical approaches, susceptible areas have been identified in various fields, such as gully erosion [30], landslide [31], groundwater potential zone [32], etc. Basically, there are three types of susceptibility mapping based on multi-criteria decision analysis (MCDA), statistical analysis, and more recently the widely used machine learning algorithm. Extensive literature surveys show that different types of models have been used throughout the world over time to map the susceptibility of gully erosion. It is indeed necessary to remember that the MCDA models presume that the improvement of the attribute does not influence the preferences of the different criteria. This feature, recognized by the technical resource of reciprocal preferred autonomy, is related to the particular question, and the nature of the research [33]. Experts must always determine if autonomy is an expectation for fields and, if not, recommend quite complicated MCDA (multi-criteria decision analysis) systems that integrate relationships [34]. Most importantly, knowledge-based MCDA [11] and statistical analysis based on continuous, binary, and categorical data, such as the information value [23,35], conditional probability [36], certainty factor [37], frequency ratio [38], evidential belief function [2], index of entropy (IoE) [39], weights of evidence (WoE) [40], and logistic regression [2,41], has been widely used by several researchers. In the case of the machine learning algorithm, the most successful models for GESM are the multi-layer perception approach (MLPC) [42], multivariate adaptive regression spline (MARS) [39], artificial neural network (ANN) [43], classification and regression trees (CART) [23], maximum entropy (ME) [44], decision tree (DT) [45], boosted regression tree [15], stochastic gradient treeboost (SGT) [46], random forest (RF) [47], bagging best-first decision tree [48], general linear model (GLM) [49], maximum entropy [50], etc. In general, GESM with machine learning models is more proficient at predicting susceptible areas than statistical analysis. Simulation focuses, in a number of ways, on the related mechanism used, including machine learning or deep learning. Computational products are formulated to predict missing parameters, objects, or events, as their name implies. They sometimes rely on grouping, convergence, and object image processing algorithms [51].
In the current study, semi-humid, semi-arid, and Mediterranean climate types in the province of Golestan, Iran were chosen to map the susceptibility of gully erosion due to its serious damage and the major environmental problems caused by gully erosion. Therefore, the causes of extensive gully erosion, their development, and susceptibility mapping for management and planning purposes must be identified. Therefore, the objectives of our current research work were to identify the most reliable maps of gully erosion susceptibility and to recognize the main conditioning factors responsible for their development. Therefore, in order to meet our objectives, a maximum entropy (MaxEnt), artificial neural network (ANN), support vector machine (SVM), and general linear model (GLM) machine learning algorithm with 90/10, 80/20, 70/30, 60/40, and 50/50 random partitioning of training and validation samples were selected for the purpose of estimating the susceptible part of gully erosion. The selection of the machine learning models for gully erosion susceptibility mapping was based on the previous literature in this region as well as the same climatic conditions [39,43,[52][53][54]. The optimal models suggested by the various researchers were considered for the estimation of gully erosion susceptibility. In previous studies, researchers have attempted to increase the accuracy of the model by considering ensemble approaches rather than a single model. In this perspective, this kind of approach was avoided in order to avoid the problem of over-fitting. Apart from this, the study considered the modification of existing models by considering the random partition of samples. The final result of different GESM models was validated through the area under receiver operating characteristic curve (AUROC). In addition, the jackknife test was used to give different GECFs importance regarding the susceptibility to gullies. This type of machine learning algorithm is not only capable of estimating susceptibility with adequate accuracy but is also capable of handling the large amount of data of the predicted models. Supervised ML approaches usually involve partitioning information into multiple parts for clustering algorithm training, validation, and final testing. Training and validation are usually performed on a test set to discover the perfect variables for a classification model. It was accompanied by the implementation of a classification algorithm for a different test sample with optimal factors for estimating the generalization efficiency of the classification model. For minimal data, this differentiation of the test dataset results in a difficult trade-off among several predictive significances in the classifier output estimation against improved simulation analysis and significant optimal fitting. Apart from the models, the optimum separation capacity of the samples in the appropriate manure may increase the efficiency of the models without compromising the importance of the classifier. In terms of uniqueness, this study is capable of estimating the importance of sample partitioning approaches to improve model performance and to reduce predictive bias. This approach is used for the first time in gully erosion susceptibility modeling, taking into account the optimal capacity of the model. Finally, GES maps will provide appropriate strategies for restoration in the various sectors, i.e., agriculture, land use, and watershed management planners, in a sustainable manner.

Study Area
The study area occupies an area of 790 km 2 and lies between 37 • 30 00" to 37 • 50 00" N, and 55 • 31 40" to 56 • 2 10" E in the northeast part of Golestan province. Elevation ranges between 160 and 1490 mt. mean sea level (MSL) (Figure 1). More than half of the basin has mountainous morphology with gentle slopes and is a part of the Alborz Mountains. The slope angle at steep slopes reaches up to 118%. The average annual rainfall varies from 346 to 610 mm, with the maximum rainfall conducted in southern parts. The minimum and maximum temperatures are 8 and 16 • C. Three main climatic characteristics of semi-humid, semi-arid, and Mediterranean are evident in the study area. Agriculture, as the predominant land cover, is conducted in most of the study area (i.e., 45.35%), followed by rangelands (38.07%), forests (15.95%), and residential areas (0.64%) ( Table 1). Geologically, the largest portion of the region corresponds to the grey to black shale and thin layers of siltstone and sandstone (73.94%), followed by Ammonite bearing shale with the interaction of limestone (12.48%), grey thick-bedded limestone, and dolomite (6.26%), and the remaining area is dominated by other formations described in detail in Table 2.

Methodology
The present study followed several steps as follows: (i) The gully erosion inventory map and gully erosion causality factors preparations: In the current study, a total of 1115 gully head cut locations were identified using the high-resolution images, field investigation, global positioning system (GPS), and a number of gullies were received from Natural Resources and Watershed Management Organization of the Golestan Province. The 20 environmental factors were considered for the modeling purpose. (ii) Multi-collinearity analysis among the gully erosion factors using the variance inflation factor (VIF) and tolerance limit was done using SPSS software. (iii) The significance and effectiveness of factors was carried out using the MaxEnt model (Jackknife test).
(iv) GES maps were prepared using the MaxEnt, ANN, SVM, and GLM models.
(v) The GESM model's performance was validated through the area under receiver operating characteristic curve (AUROC).
The detailed methodology for estimating the susceptibility of gully erosion in the light of these novel approaches is shown in Figure 2.

Gully Inventory Map
The gully erosion inventory map (GEIM) is the basic tool for creating the GESMs. GEIM shows the spatial distribution of gullies and geometry. Through the historical and present gully distributions, one can predict the future risk of the gully erosion of the area. In the present study, the RS and GIS techniques were applied to generate the GEIM. The geographical location of gullies was partially acquired from the archived data of Natural Resources and Watershed Management Organization of the Golestan Province (NRWMOGP). Further, extensive field surveys were conducted supported by geoinformatics (Google Earth images and a handheld GPS device), through which the previous gully inventory map was amended, and an all-inclusive map was generated in ArcGIS 10.3. The produced inventory map was then randomly split into two sets of training and validation data in five replicates, each of which possessed a different training:validation balance. The balance values commenced in favor of the training dataset as it held 90% of the gullies while the remaining was cast-off for further validation, and perpetually more samples were transmitted to the validation dataset. As such, five training:validation samples of 90:10%, 80:20%, 70:30%, 60:40%, and 50:50% were considered to investigate how the dataset's transient distribution can influence models' performances. Some major erosion-prone areas are shown in Figure 3, due to the nature and vulnerability of the erosion that we can easily understand.

Data Preparation
Different geo-environmental factors, such as topographic, hydrological, geological, soil, and environmental factors, are important parameters for GESM (gully erosion susceptibility mapping) ( Table 3). It is also an important step in the selection of the various appropriate geo-environmental factors for the preparation of GESM using different machine learning models [15]. In this study, based on a previous literature review [10,23,55], the availability of data, extensive field survey, and multi-collinearity analysis, we selected 20 GECFs, namely the topography position index (TPI) [56], plan curvature [57], elevation, aspect [58], slope [58], height above nearest drainage (HAND) [59], drainage density [60], distance from stream [61], terrain ruggedness index (TRI) [42], distance from road [62], bulk density [63], mineral soil, clay content, sand content, relative slope position (RSP) [64], silt content, valley depth, land use, soil texture, and lithology (Figure 4a-t).    All these factors were derived from different sources. The Advanced Land Observing Satellite (ALOS) digital elevation model (DEM) 12.5 m resolution data were downloaded from the Alaska Satellite Facility (ASF) for the extraction of topographic and hydrological factors, such as the topography position index (TPI), plan curvature, elevation, aspect, slope, drainage density, distance from stream, terrain ruggedness index (TRI), and relative slope position (RSP). The geological map was collected from Geological Society of Iran (GSI) (http://www.gsi.ir/) at a scale of 1:100,000 to generate the lithology map. The topographic map was acquired from National Geographic Organization of Iran (www.ngo-org.ir) at a scale of 1:1:50,000 along with Google Earth images and Landsat 8 satellite images, which were also used to produce land use and roads network maps.

Topography position index (TPI):
More specifically, TPI is used to measure topographic slope positions. TPI is the measure of differences between the elevation at the central point and the average elevation around it [65].
The following equations were used to estimate the TPI. The TPI map is shown in Figure 4a and the value ranges from −38.8 to 54.8: where E Pixel is the elevation at the central point and E Surrounding is the average elevation of the neighboring areas. Plan curvature: Plan curvature signifies the overland flow of water in terms of its diverging and converging, and plays an important factor in gully erosion studies [35]. The value of plan curvature ranges from −6.1 to 9.1 (Figure 4b).
Elevation: Elevation influences the rainfall and related runoff process, which is largely employed in geo-hazard modelling like GESM [66]. The elevation of this region ranges from 160 to 1490 m ( Figure 4c).
Slope aspect: Solar radiation, vegetation covers, and evapo-transpiration largely depend on the slope aspect [67], which is considered to be one of the major parameters for geo-hazard susceptibility mapping. The aspect map of this study area is shown in Figure 4d.

Slope:
Slope angle largely affects the surface runoff, infiltration, pattern of drainage density, and soil erosion [35,68]. Therefore, slope angle has always been used as one of the major factors for mapping GESM. In this region, the angle of slope varies from 0% to 118% (Figure 4e).

Height above nearest drainage (HAND):
The HAND model emphasizes the relative heights beside the drainage network and influences the soil gravitational potential [59]. HAND is calculated by using the DEM and DEM flow field in a GIS environment [69]. The value of the HAND map ranges from 0 to 494 (Figure 4e).
Drainage density: Drainage density has a major influence on erosion in the form of the initiation and development of rills gullies, etc. a higher drainage density has a minimum infiltration rate and higher runoff capacity, and vice versa [66]. Drainage density was calculated by using the following equation [70]. The value of drainage density ranges between 0 and 3.32 km/km 2 ( Figure 4f): where n i=1 S i indicates the total length of all drainages in km and 'a' is the total area of the drainage basin in km 2 .
Distance from stream: Gullies are primarily associated with the drainage system, and there is a significant positive relationship between the distance from the stream and the occurrence of gullies [71]. The distance from the stream map is shown in Figure 4g and the range varies from 0 to 1959 m.

Terrain ruggedness index (TRI):
The concavity and convexity of an area is indicated by TRI, which also influences gully erosion occurrences [72]. Apart from this, different pedo-geomorphic processes can directly influence the amount of TRI in a specific geomorphic region. The following equation was used to calculate TRI. The value of TRI ranges between 0 and 37 ( Figure 4h): where X represents the altitude of every neighbor cell to a definite cell, and max and min are the highest and smallest altitude among different neighboring cells.

Distance from road:
The distance from the road is another important parameter for the gully erosion and the preparation of the GESM. Due to the construction of the road, the stress and strain of the slope can be increased and as a result, there were disturbances and failures of the slope [62]. The distance from the road map is shown in Figure 4i and the value ranges from 0 to 4532 m.
Bulk density: Bulk density is defined as the mass per unit volume of the loose powder bed [63]. Bulk density was estimated by using the following equation. The range of the bulk density in this region varies from 1.4 to 1.6 g/mm in the study area ( Figure 4j): where M represents the mass in grams and V o indicates the untapped apparent volume in milliliters. Relative slope position (RSP): RSP helps to understand the various topographical characteristics, such as flat surface, valley, ridge-top, foot-slope, mid-slope, and upper slope [73]. The value of the RSP map for the current study area varies from o to 1 (Figure 4n).

Silt content and valley depth:
Silt content and valley depth are also important factors for GESM. In the present study, the silt content and valley depth varies from 31 to 55 and 0 to 391, respectively (Figure 4o,p).
Land use: The formation of gullies and associated land degradation depends to a large extent on land use. The land use map of the area was prepared using the maximum likelihood algorithm of the supervised classification technique [74]. Table 2 shows different land use types and their geographical areas, i.e., forest, agricultural land, range, and residential areas (Figure 4q).
Soil texture and lithology: Soil texture in this study area has been categorized into four types namely clay loam, salty clay loam, loam, and silty loam (Figure 4r). The land surface process of the area is highly influenced by lithological characteristics and one of the most significant factors for large-scale erosion, such as the creation and development of gullies [35,75]. In this study, six types of lithological units were found ( Figure 4s) and their description is given in Table 2.

Multi-Collinearity Assessment
Multi-collinearity analysis can be defined as the relationship between two or more variables in the data set and the linear relationship among variables [2]. Generally, various geo-environmental conditioning factors have been used to prepare GESM. Thus, multi-collinearity analysis was therefore used to identify the perfect relationship between the variables. Multi-collinearity occurs when there is a very high correlation between variables and the accuracy of the result is reduced [31]. Therefore, high multi-collinearity factors need to be removed from the entire analysis in order to achieve better results [76]. Various researchers throughout the world have been used in multi-collinearity analysis to get better output by using machine learning models, i.e., in the field of GESM [28], landslide susceptibility mapping [77], etc. Generally, the variance inflation factor (VIF) and tolerance (TOL) are widely used to understand the multi-collinearity of a dataset. TOL and VIF were calculated by using the following equations: where R 2 j represent the regression value of j on other different variables in a dataset. Thus, in a general way, the multi-collinearity problem occurs when the tolerance value is <0.10 or 0.20 and VIF value is >5 or 10. ANN is a type of machine learning model in which human minds can work in a precise way and have always been the inspiration for it [78,79]. In general, it is a non-linear statistical data analysis model. ANN has various algorithms to analyze and predict the statistical dataset, including multilayer perception (MLP), which is the most up-to-date algorithm for this machine learning model [80]. The ANN model is more advanced than conventional statistical methods and involves some basic knowledge of the structure of input data and the nature of the relationship between variables, i.e., linear or non-linear [81]. In the MLP algorithm of the ANN model, there are three layers, namely the input layer, hidden layer, and output layer [81]. The information of a data structure is measured by nodes of hidden layers if the input layers are not sufficiently involved to do so [52]. In this case, the input layers, such as the various GECFs and the gully erosion training points, are connected to the output layer. After that, the input and hidden layer systematically predict the model structure of the input nodes and evaluates the result in a dynamic function [82]. In the ANN model, there is a structured code that determines the input and output nodes. In each pixel, the output nodes are equivalent to the Boolean value, i.e., 1 or 0, where 1 indicates the possibility of gully erosion, and 0 indicates no possibility of gully erosion. Hidden layers are used to determine the trial and error of the model [83]. The back propagation algorithm for ANN was discussed in the following equations [84]: The net input of the jth neuron of layer l and I iteration: The δ factor for the neuron jth in the output layer ith: The δ factor for the neuron jth in the hidden layer ith: where α represents the momentum rate and n represents the learning rate.

General Linear Model (GLM)
GLM is a statistical probability method with a logit function and it is extensively used for different natural hazards' modeling [55,85]. The GLM (logistic regression) is the modified version of the classic general linear regression model [82,86]. The GLM was first introduced by Nelder and Wedderburn in 1972 [87]. The function of GLM is much simpler; therefore, it is widely used in the broad sense of statistical analysis [88]. The link function (i.e., identity and logistic) between the dependent variable and various independent variables are assumed by this statistical-based machine learning model through a linear relationship [89]. Depending on the existence or non-existence dataset, GLM can produce a binary data model using a logistic regression model [90]. The logit link function in GLM is used for modelling a fractional response to handle the dataset of the binary value, i.e., 0 and 1 [54]. The function for GLM can be expressed as follows [91]: where Y (logit) represents the probability of an event happening and it varies from 0 to 1; X 1 . . . X n indicates the values of different controlling factors; and C 1 . . . C n indicates their coefficient.

Maximum Entropy (MaxEnt)
MaxEnt is a predictive model and is developed on the basis of the principle of entropy maximization [92]. The principle of entropy maximization is based on the statistical and information theory associated with this principle; it also provides an appropriate estimate of the uncertain probability distribution [93]. It is also said that, from all probabilistic constraints, the MaxEnt model chooses the one with the highest entropy [92]. MaxEnt is a widely used machine learning model based on the presence-only features [94]. The presence-only feature has significance for the machine learning model because it is far more trustworthy for inaccessible areas [95]. MaxEnt generally found for an unidentified target allocation and true distribution (π) over all the pixels in the area's location of X comprised by individual pixels x [96]. In this study of GESM modelling, the MaxEnt model was expected to identify the gully occurrence probability distribution at the area's location of X. A brief statistical explanation of the MaxEnt model can be found in [94,97,98], with the following equation: where P(y = 1 x) represents the probability of the gully being present at the location of x, where P(x y = 1) represents being at the site of given x, P(y = 1) is the overall prevalence, and P(x) is the probability of picking the location x. The above equation can also be rewritten as follows: The calculation of P(x) can also be done by the probability distribution of marginalizing, such as: P(x) = y P(x, y) = P(x y = 1)P(y = 1) + P(x y = 0)P(y = 0).
The generative model basically deals with P(x, y) and P(y). The simplest equation for the equal probability (P (y = 0) = P (y = 1) = 0.5) of MaxEnt is as follows:

Support Vector Machine (SVM)
SVM was introduced by Vapnik and Chervonenk is in the year of 1963. It is a supervised machine learning method based on the principle of statistical learning and structural risk minimization [99]. In both fields, i.e., classification and regression, SVM can be used to resolve statistical data [100]. Basically, it was used for a variety of classification functions along with error analysis and generalization of the overall function [101]. SVM will generally find the hyperplane to distinguish between the two classes; in this case, gully and non-gully datasets [102]. The optimal hyper plane and training dataset are closer to each other and called the support vectors [103]. Two concepts are employed in SVM modelling on statistically induced problems. The very first is to separate statistical data patterns by using a linear hyperplane separation. The second is to convert non-linear data patterns to linearly separable data patterns using kernel functions [104].
Two SVM modelling classes were described in the following section [105,106]. Regard as a set of linear separate training vectors x i (i = 1, 2, . . . n). Training vectors have two classes, i.e., y i = ±1. The primary aim of the SVM is to look for an n-dimensional hyperplane, which differentiates two classes by using the maximum gap. This can be written as: The following constraints of the subject are: where ||W|| represents the hyperplane, b represents the scalar base, and (.) represents the scalar product. The cost function of SVM can be defined by using the Lagrangian multiplier, such as: where λi represents the Lagrangian multiplier. In the case of the non-separable function, the constraints can be modified by introducing slack variables: Finally, the equation becomes as follows: where v (0, 1) is generated in order to account for misclassification [107]. In addition to this, the kernel function K x i , x j was introduced by Vapnik in the year of 1995 as an explanation for the non-linear decision boundary.

Measuring the Importance of GECFs by the Jackknife Test
In this study, the jackknife test [108] was employed to evaluate which GECFs have the strongest consequences on the GESM predictive outcome. In general, the jackknife test was used to better understand the pattern of gully erosion. In particular, the AUC-based statistical coefficient is reliable on the jackknife test, which accepts practical problems in a broader sense [109]. This test identified the most important conditioning factors in a particular model and calibrated all parameters [110]. Therefore, the jackknife test finds the major conditioning factors of gully erosion patterns by AUC. The percentage of the relative decrease (PRD) of the AUC was used for the analysis of the contributing factors. The equation of PRD is as follows [111]: where AUC all represents the AUC value calculated from the prediction by every factor, AUC i is the individual factor value, and PRD i is the relative decrease of AUC in the percentage when the ith factor has been removed from the whole prediction analysis.

Validation and Accuracy Assessment
The validation and evaluation of the accuracy assessment of GESM is very much important; otherwise, the final output result has less significance. Thus, it is necessary to validate all machine learning models; in this case, ANN, SVM, MaxEnt, and GLM have been validated to get better results and analysis. The area under the receiver operating characteristic (AUROC) curve is a standard tool that is widely used to establish the accuracy of the model [112]. The AUROC method has been widely used to evaluate the accuracy of several natural hazard susceptibility mappings [11,113]. The ROC curve is based on two terms, i.e., events and non-event phenomena; therefore, this curve is two-dimensional [114]. The ROC curve plotted on the X-axis known as the sensitivity based on the false positive rate and the Y-axis known as the 1-speficity based on the true positive rate. Generally, the sensitivity detects gullies and the specificity detects non-gullies accurately and, in both cases, the optimum value is 1 [115]. The AUC value ranges between 0.5 (represents poor performance) and 1.0 (represents good performance). The accuracy of AUC values were classified into four levels, i.e., poor, fair, good, and excellent, and their ranges are 0.6 to 0.7, 0.7 to 0.8, 0.8 to 0.9, and 0.9 to 1.0, respectively [116]. In this study, ROC curves were plotted on the basis of both datasets, i.e., training and validation points. Here, 50:50, 60:40, 70:30, 80:20, and 90:10% split were used for GESM. The following equations were used to complete the ROC curve: where TP represents the true positive, FN represents the false negative, TN represents thee true negative, FP represents the false positive, P indicates the number of total gullies, and N indicates the number of total non-gullies.

Multi-Collinearity Assessment
Here, a multi-collinearity assessment was conducted in order to select the appropriate factors for gully erosion susceptibility modelling. In order to maintain the accuracy of the predicted models and free them from bias, it was estimated that the VIF and TOL values would select the appropriate parameters without any problems with multi-collinearity. The ranges of TOL and VIF are 0.231 to 0.923 and 1.079 to 4.749, respectively. The ranges of VIF and TOL are far from the permissible threshold, so there is no problem with multi-collinearity in this analysis. The details of the multi-collinearity of all selected parameters are shown in Table 4.

Gully Erosion Susceptibility Modelling
Gully erosion susceptibility was estimated by considering the ANN, GLM, MaxEnt, and SVM machine learning algorithms for this region. The overall data were randomly divided into different ratios (90/10, 80/20, 70/30, 60/40, and 50/50) as training and validation data to estimate the outcome of all predicted models with optimum accuracy. The all output raster of the susceptibility map was reclassified into different qualitative classes (very high, high, moderate, low, and very low) considering Jenks' natural break classifier technique in the GIS environment.

Gully Erosion Susceptibility Modelling Using the General Linear Model (GLM)
The GESMs was prepared by using the GLM method in different sample sizes (random partitioning of the samples) as training and validation data. In GLM, the overall data was randomly classified as training and validation data in different ratios (90/10, 80/20, 70/30, 60/40, and 50/50). The areal percentage in the GLM (90/10 ratio) model for very low, low, moderate, high, and very high gully erosion susceptible areas are 23.93%, 23.10%, 21.19%, 17.78%, and 13.92%, respectively (Figures 6b and 7a).

Assessing the Importance of the Factors
The importance of the variables for estimating the GESMs was estimated with the help of the jackknife AUC values. The maximum importance variables for gully erosion susceptibility are the topography position index (TPI), relative slope position (RSP), valley depth, height above nearest drainage (HAND), land use, drainage density, distance from river, plan curvature, and distance from road respectively. The lowest importance variables for gully erosion susceptibility are lithology, sand content, soil texture, slope, and elevation ( Figure 10). The maximum and minimum importance variables for gully erosion susceptibility are the topography position index and lithology, with AUC values of 0.67 and 0.52, respectively. This type of assessment is helpful to estimate the importance of the variables and the influences of it in a dynamic way.

Discussion
Gully erosion is one of the common environmental issues caused by the natural environment, but the mechanism for the formation and development of gullies can accelerate by anthropogenic activities [23,52]. In the arid and semi-arid environment, the formation and development of gullies is the most problematic issue with global concerns that are related with ecological imbalances of the particular environment. The loss of fertile soil due to severe erosion not only reduces the amount of soil but also reduces soil fertility and associated agricultural productivity [42]. The climatic characteristics of this region relate to the semi-arid, semi-humid, and Mediterranean nature. The impact of extreme climatic conditions is therefore significant and has an impact on the large-scale erosion in the form of gullies. The formation and development of gullies is caused by different environmental conditions and their importance should be analyzed for appropriate modelling and management purposes [117].
The unpredictability of each outcome is mainly due to mechanisms beyond the researcher's influence. Predictive accuracy is mostly based on the instability of both the quality of the data and the selection of the model [118]. Multiple factors can also be attributed to the unpredictability of gully erosion vulnerability models: (i) Insufficient experience of the physical environment and its associated mechanism, which must be analyzed; (ii) the distance over which the analysis can indeed be performed in each of these time or space; (iii) the randomization of the estimation method for the development of a model where gullies exist or are lacking; and (iv) an approximation for some computational method for the physical phenomenon [119]. In this work, our main objective was to highlight the susceptible areas with lees or marginal uncertainty within the predicted models.
In today's research, the application of different machine learning algorithms is one of the reliable predicting tools for predicting the susceptibility of various natural hazards and disasters. For this purpose, different machine learning algorithms have been developed by different decision science researchers. In this regard, the spatial perspective of decision-making was considered to be the most reliable component in the various disciplines. Various machine learning models (e.g., ANN, GLM, MaxEnt, and SVM) are used in this study to estimate areas susceptible to gully erosion. In order to estimate results for better accuracy, the training and validation data were randomly divided into different quantities (e.g., 90/10, 80/20, 70/30, 60/40, and 50/50). ANN 50/50 is the best training and validation dataset model, although all models are associated with higher accuracy. According to the training datasets, apart from the ANN 50/50, other optimal models are ANN 60/40 (0.917), ANN 80/20 (0.910), and ANN 90/10 (0.885). According to the validation datasets, the most optimal model is ANN 50/50 (0.868) and other optimal models are ANN 90/10 (0.867), SVM 90/10 (0.864), and ANN 70/30 (0.837). The importance of all conditioning factors was estimated with the help of the jackknife test from the MaxEnt model. Jackknife checks the individual gully erosion conditional factor's significance in the creation of the predicted models relative to all conditioning factors (red bars) for each predictor variable alone (blue bars), and the decrease in the training benefit when the variable is excluded from the overall model (navy green bars). The topography position index (TPI), relative slope position (RSP), valley depth, and height above nearest drainage (HAND) were recorded consistently as the key determinants of gully erosion, as also was the case in similar research [50,[120][121][122]. The AUC values in the jackknife test of TPI, RSP, valley depth, and HAND are 0.67, 0.665, 0.65, and 0.64, respectively. Apart from this, the lower importance is associated with geological components like lithology (0.52) in gully erosion susceptibility modelling. The variable importance of other condition factors, i.e., aspect, bulk density, clay content, elevation, drainage density, distance from stream, land use, mineral soil, plan curvature, distance from road, sand, silt, slope, soil texture, and TRI, are 0.58, 0.56, 0.615, 0.62, 0.635, 0.63, 0.635, 0.585, 0.625, 0.62, 0.54, 0.575, 0.565, 0.552, and 0.575, respectively. From an evaluative point of view, the topographic indices themselves are not attributes that could be integrated with erosion at the same time. Strategic planners are therefore unable to start estimating soil erosion susceptibility on the basis of topographic indices. As a result, not only is the spatial dimension of erosion shown in the form of gullies, but it is also capable of giving us a theoretical framework and its associated cause-effect relationship between variables and associated erosion. Apart from this, the topographic variables may influence other factors of the condition that may have an influential role in the process of erosion. In this analysis, the maximum values of the TPI are most favorable for the development of large-and medium-sized gullies in the Golestan Dam Watershed as a whole. The relative slope position is one of the dominant factors in the control of pedogeomorphic processes and associated erosion. Higher valley depth is an important factor that directly accelerates the rate of large-scale erosion. In the wet season, the higher depth of the valley is recommended to severe erosion in the form of gullies, where rainfall and its associated runoff can play an essential role in this respect. Rainfall with high kinetic energy is associated with extensive erosion in most of the arid and semi-arid environments. It is also responsible for chemical weathering, which has an indirect effect on the rate of formation and development of the gully. This process confirms the transformation of the various minerals into secondary minerals. In this region, the process of water-induced erosion is accelerating to a height above the nearest drainage point. It is capable of controlling and determining river activity and associated erosion where primary direction and orientation play a vital role in the erosion process. The presence of vegetation in this region is very marginal in nature, and only a small part of the area is associated with agricultural activity, indicating the presence of bare soil and mountain topography with higher slopes. This association is most favorable to large-scale erosion in various forms of erosion, e.g., the creation and development of rills and gullies, etc. The absence of vegetation cover helps to create the maximum amount of runoff, which is directly and indirectly linked with erosion and its associated sedimentation. Higher amounts of rainfall and runoff indicated the greeter probability of erosion in any region. The maximum amount of the drainage network and the existence of gullies are positively linked with each other. The effects of rainfall and drainage during the wet season are not only favorable to the development of new gullies but are also responsible for the expansion of existing gullies. The formation of ephemeral gullies during the wet season is one of the main causes of serious erosion and loss of topsoil. The creation of ephemeral gullies and the associated loss of soil is a major problem in any arid and semi-arid environment. Furthermore, the impact of the road network is the result of the anthropogenic destruction of natural hydrological processes, establishing impervious soil surfaces accumulates runoff and results in large-scale soil erosion [41].
The erosion-prone gulling areas of this region were successfully assessed with an appropriate algorithm and random partitioning of the samples in order to maintain optimum accuracy. Prediction model accuracy was disturbed by a non-linear relationship, which is complex in nature [123,124]. Due to its role in the representation and disclosure of hidden properties and interactions, ANNs, implemented in an appropriate manner, can provide a robust replacement [112,125,126]. In contrast to any conventional simulation, ANN also has no limitations on the source and residual proportions [127,128]. The fault and lack of information can be resolved by ANN, and the presence of this type of inadequacy is capable of predicting the scenario with higher accuracy [129].

Models Prioritization
The final task was to select the optimal models according to their performance and related accuracy level. For this purpose, the prioritization of all predicted models for both training and validation databases was carried out with regard to the performance and robustness of accuracy. The prioritization method is generally used as a sub-catchment priority in morphometric studies with the consideration of different elements [130]. The same method was considered for the selection of the optimal model and the categorization of the models by performance. Based on training datasets that consider AUC values, the most optimal model is ANN 50/50 (0.918) followed by ANN Table 5).

Conclusions
This region is severely confronted with the extreme problem of land degradation in different forms of erosion like the formation of rills, gullies etc. That is why not only the economy of this region is affected but the natural environment and its associated ecosystem are also affected a number of times. Apart from large-scale erosion, construction, such as roads, rail, and bridges, is also associated with large-scale erosion. Various machine learning approaches with random sample partitioning have been made to estimate the most accurate vulnerable regions with maximum possible accuracy. The main objective of this research was to determine the optimal model of gully erosion susceptibility in this region and the development of conceptual backgrounds for the orientation and partitioning of the data for prediction with maximum accuracy. Apart from this, with the random partitioning of the training and validation datasets, we are able to know the data handling model nature and its associated optimal capacity. In this research, the ANN (50/50) is the most optimal model for both training and validation data sets. The second and third optimal model considering the validation datasets were ANN 90/10, and SVM 90/10, respectively. Though, the ANN model in different random partitioning was not capable of estimation with maximum accuracy. This approach should be applicable in any part of the world with different climatic conditions. The role of the gully erosion conditioning factors is very much optimistic for the creation and development of gullies. In this region, the importance of the topographic parameters is the maximum for susceptibility to gully erosion compared to other parameters. Apart from secondary sources, an extensive field visit was carried out to validate the entire models' outcome in a more precise way. The nature of the erosion and its impact on society was also identified at the time of the field visit. The impact of erosion in agricultural resources and the role of the stakeholders are the most conflicting issues in terms of sustainable land management practices. According to the ANN 50/50 model, 3.6.87% of the area of this watershed is associated with very high to moderate gully erosion susceptible zones. Most of the erosion-prone areas of this region are located near the drainage network. So, special watershed management strategies have to be incorporated in the vulnerable regions to escape this type of situation. It may be helpful to develop a conceptual background based on a theoretical perspective on the erosion of the gully, which may be applicable in different regions. Obviously, this type of outcome should be useful and applicable to decision-makers and local stakeholders in order to avoid this kind of serious problem by considering appropriate measures. The contribution and task of future research is to develop a model with the appropriate modification of the algorithm and partitioning of the samples and to link it to the socio-political environmental dilemma.