Novel Ensemble of Multivariate Adaptive Regression Spline with Spatial Logistic Regression and Boosted Regression Tree for Gully Erosion Susceptibility

The extreme form of land degradation through different forms of erosion is one of the major problems in sub-tropical monsoon dominated region. The formation and development of gullies is the dominant form or active process of erosion in this region. So, identification of erosion prone regions is necessary for escaping this type of situation and maintaining the correspondence between different spheres of the environment. The major goal of this study is to evaluate the gully erosion susceptibility in the rugged topography of the Hinglo River Basin of eastern India, which ultimately contributes to sustainable land management practices. Due to the nature of data instability, the weakness of the classifier andthe ability to handle data, the accuracy of a single method is not very high. Thus, in this study, a novel resampling algorithm was considered to increase the robustness of the classifier and its accuracy. Gully erosion susceptibility maps have been prepared using boosted regression trees (BRT), multivariate adaptive regression spline (MARS) and spatial logistic regression (SLR) with proposed resampling techniques. The re-sampling algorithm was able to increase the efficiency of all predicted models by improving the nature of the classifier. Each variable in the gully inventory map was randomly allocated with 5-fold cross validation, 10-fold cross validation, bootstrap and optimism bootstrap, while each consisted of 30% of the database. The ensemble model was tested using 70% and validated with the other 30% using the K-fold cross validation (CV) method to evaluate the influence of the random selection of training and validation database. Here, all resampling methods are associated with higher accuracy, but SLR bootstrap optimism is more optimal than any other methods according to its robust nature. The AUC values of BRT optimism bootstrap, MARS optimism bootstrap and SLR optimism bootstrap are 87.40%, 90.40% and 90.60%, respectively. According to the SLR optimism bootstrap, the 107,771 km2 (27.51%) area of this region is associated with a very high to high susceptible to gully erosion. This potential developmental area of the gully was found primarily in the Hinglo River Basin, where lateral exposure was mainly observed with scarce vegetation. The outcome of this work can help policy-makers to implement remedial measures to minimize the damage caused by erosion of the gully.


Introduction
Soil erosion is a threat to the environment through various forms of erosion, such as sheet, rill, gully, ravine and which finally becomes part of the badland's topography [1][2][3]. Gullies (0.5 m deep and more than 100 m wide) are formed after the complete process of rill erosion and before the starting stage of a ravine [4,5]. The world suffers from severe soil-related issues, such as loss of fertility and soil quality, and this erosion does in fact have limitations on land use. On-site and off-site environmental problems sometimes cause social or economic losses [2,3,6]. Water moves eroded soil towards rivers, where it reduces the depth of the river and reservoir, causes sediment loss and reduces agricultural productivity. In gully erosion, the top soil of the land is transported away by water and becomes a major source of sediment in rivers. Walling (1998) and May et al. (2005) argued that when soil particles retain less strength than wind, water, and glacier power, soil erosion (rill formation) occurs on the Earth's surface [7,8]. Soil erosion is considered as a disaster because of its association with several issues such as land degradation, less agricultural productivity and surface quality and quantity, all of which are inversely related to human and environmental health. This type of phenomenon causes desertification, groundwater pollution, flooding and a decline in soil fertility. In order to achieve the objective of sustainable development in modern times, land-use planners, engineers, farmers and the general public are likely to prevent soil loss.
Asia has the highest soil erosion with an annual soil erosion rate of 29.95 tons per hectare due to land degradation. Das (1985) stated that in India, 38% of the total geographical area is subjected to serious erosion. The Ministry of Agriculture, Government of India, published in 1980 that 53% of India's geographical area has been affected by land degradation in general and soil erosion in particular. According to the information in World Soil Erosion and Conservation (1994), in India, more than 70% of the population is dependent on the agricultural system and 3.7 million hectares of agricultural land suffer from fertility losses [9,10]. The loss of nutrients in the soil causes a number of losses in cereal, oilseed and pulse production, equivalent to USD 2.51 billion [11].
Soil erosion and land degradation are common in the eastern part of India because of prevalent flash floods and the high intensity of rainfall during the monsoon season.
The literature shows that over the last 40 years, one third of the world's agricultural land has been severely affected by soil erosion, and this process continues [12]. Various studies have provided several suggestions for mitigation strategies of gully erosion. Among them, many researchers argued that the presence of vegetation could reduce gully head cuts and the development of the gully [13][14][15][16]. Land use and land cover change are strongly related to gully influences and controls the topographic threshold condition [13,17]. On the other hand, the root characteristics are capable of increasing soil resistance, and the presence of vegetation in the gully area has a direct effect on soil [18]. Increased root density, size and species may increase head cuts and soil tensile strength in the river bank as indicated by different researchers whoinformed that root related parameters have also been used to determine the gully erosion model [19][20][21]. Although gully erosion is a common type of water erosion, many related factors have made it complex, such as slope, elevation, aspect, plan curvature, profile curvature, Remote Sens. 2020, 12, 3284 3 of 35 stream power index (SPI), distance to river, topographical wetness index (TWI), rainfall, drainage density, surface runoff, organic matter, soil texture, soil moisture, lithology, lineament, distance to road, land use and land cover (LULC), and normalized difference vegetation index (NDVI) [22,23].
In recent times, machine learning (ML) algorithms have been widely used for the spatial prediction of gully erosion [24][25][26]. Based on different causes, land use assessment not only helps to evaluate the resource, but also helps to categorize the risk zone from high to low [27]. Although topographic, geomorphological, geological, hydrological and soil characteristics play an important role in identifying vulnerable areas, they sometimes become insufficient. Chorley (1969) stated that the hydro-geomorphological parameters have a good predictable capacity to establish basin or catchment gully zones. A large number of geo-morphometric parameters were used with statistical techniques and resulted in a logical and systemic output. The Geographic Information System (GIS) environment has made research work more valuable by handling large databases.The GIS considers all factors to determine the occurrence of gully erosion and may be used in the data processing system. Various machine learning, artificial intelligence based, statistical and other models based on the GIS have been used to assess gully susceptibility in a number of literature reviews, such as logistic regression [28][29][30], random forest [25], multivariate adaptive regression spline (MARS) [31], frequency ratio [32], artificial neural network [33] and support vector machine [26,34]. Various physical-based methods have also been used to estimate the amount of erosion from gullies such as the condensed real-time model(CREAMS), which is conducted by runoff and erosion from agricultural management; the ephemeral gully erosion model (EGEM); and the water erosion prediction project (WEPP). In the present study, boosted regression tree (BRT), machine learning model (MARS) and statistical approach (SLR) were used to predict gully areas.
Current research focuses on assessing the impact of different geo-environmental factors on gully erosion susceptibility and identifying the vulnerable regions. All the data were processed and validated by using machine learning and statistical model based approaches. There are two types of approaches to the occurrence of gully erosion which have been identified considering the existing literature. The relationship with gully erosion and physical conditions is based on a training dataset that will finally be validated through different model and field based information. Various factors, such as topographical, hydrological, climatic, anthropogenic and morphometric factors, have influenced soil erosion. The overall objectives are: 1. Identify the major geo-environmental causes of gully erosion in the study region. 2. Evaluate changes in the land area and their environmental impact 3. Establish and locate gully prone areas in a variety of models (BRT, MARS, and SLR) 4. Finally, produce a map of the susceptibility of gully erosion with management of erosion and control planning. The main contributing aspect of this paper is the development of the most optimal model considering different appropriate resampling algorithms.

Study Area
We have focused on the Hinglo river basin, a tributary and fifth order stream of Ajoy River in eastern India. Gully erosion is getting acute due to soil erosion by water through poor drainage systems and over unprotected land surfaces in this river basin. The basin area covers 444.308 km 2 and the length of this river is 65 km. The Hinglo River basin is located between 23 • 42 7" N to 24 • 0 56" N and 86 • 59'32" E to 87 • 23 31" E ( Figure 1). Granitic rock of the Pleistocene age was found in the upper catchment area, while Barakar formations and ironstone shale formed in lower catchment area [35]. The groundwater potentiality increased towards the east, where older alluvium deposition also increased over time. Mukherjee et al. (2007) stated groundwater table variation depends on seasonal fluctuations [36]. In this region, ferruginous processes made the soil characteristics different and produced a reddish type soil which is called laterite. The north-western part is characterized by physical and chemical weathering, mass wasting, laterization, the red weathered zone and formation of duricrust (a hard layer of minerals developed at or close the soil surface) (Figure 2). The south-eastern part of this region has physiographic features with an alluvial plain. The direction of the slope of this river basin is north to Remote Sens. 2020, 12, 3284 4 of 35 east. The Hinglo River is divided between two states of India (Jharkhand and West Bengal). Originating from the spring of the Jamtara district of Jharkhand, it enters into West Bengal and covers the Birbhum district and finally merges with the Ajoy River near Palashdanga. The climatic characteristics of the area are dominated by the north-eastern monsoon and the south-western monsoon. It is under subtropical climatic categories where the summer is hot and dry, and the winter is dry. The rainfall (80%) is due to the south-western monsoon and the forest land is quite low. The annual rainfall is 1316 mm and the elevation ranges from 0 to 289 mt. The National Atlas and Thematic Mapping Organization (NATMO) (2001) recognized the soil textural class of the area as sand, clay, loam, sandy loam, loam and fine loam. By completing a geological survey, we identified different geological segments such as quartzite, alluvium, shale and granite gneiss. During the field survey, the erosion of the gully was mainly found in the upper and middle regions where the slope is steeper to gentle. The Hinglo River covered the Chhotanagpur Plateau in the West Bengal Plain and, therefore, the erosion distribution of the gully is heterogeneous. In the upper and middle parts, people are converting their own land into agricultural land, building construction profiles and clearing the forest, which in turn affects the soil and forms gullies. Preparing the gully profile, we found the 'V' shaped gully in the upper segment and the U shaped gully in the middle section due to the deposition process.
Remote Sens. 2020, 12, 3284 5 of 35 regions where the slope is steeper to gentle. The Hinglo River covered the Chhotanagpur Plateau in the West Bengal Plain and, therefore, the erosion distribution of the gully is heterogeneous. In the upper and middle parts, people are converting their own land into agricultural land, building construction profiles and clearing the forest, which in turn affects the soil and forms gullies. Preparing the gully profile, we found the 'V' shaped gully in the upper segment and the U shaped gully in the middle section due to the deposition process.

Database
Various databases were considered in this study. Detailed information about the databases is shown in Table 1.

Database
Various databases were considered in this study. Detailed information about the databases is shown in Table 1.

Data Source and Framework of Methodology
In this research, different geo-environmental data were used, such as topographical parameters, hydrological, soils, geological and environmental factors. The whole procedures of current research work have been presented infour steps. These are: 1. Create an inventory map of the gully using satellite data and information from field observation. 2. Based on geo-environmental causative factors, prepare the gully erosion susceptibility map of the Hinglo River basin by applying different approaches. 3. Use ArcGIS 10.3 to prepare the thematic layers and create the gully erosion susceptibility map for the study area.

Gully Inventory Map (GIM)
The locations of the gullies were identified during the field survey using a hand-held GPS, Google Earth images, aerial photos (1:10,000) and a topographical map (1:50,000). We considered the training to be an input process and this map validates the models as a result. Among the 100 gully inventory points, 70% were used as inputs and 30% were used as validations, which were randomly divided. Not only the gully area, but also the non-gully area is shown in this inventory map.

Conditioning Factors
A literature review shows us different geo-environmental factors that cause gully erosion [38,39]. Therefore, each and every factor must be very prominent. These geo-environmental factors were classified into such categories:

Topographical
Topography plays a major role in developing gullies. Gómez-Gutiérrez et al. (2015) stated topographic features control the drainage network, erosive power and water velocity [40]. Six topographic parameters were counted to develop the gully erosion susceptibility map, such as slope, elevation, aspect, plan curvature, profile curvature and slope length and steepness (LS) factor ( Figure 3). These were extracted from ALOS (Advanced land observing satellite) PALSAR (Phased  (2) where Pow is the Power on the ArcGIS platform, Flow Accumulation is the grid layer of the accumulation of the flow as expressed the number of grid cells, and the cell size is the length of the grid cells. The L and S factors have been estimated from the ALOSPALSAR DEM.

Hydrological Factors
Drainage density, drainage proximity, rainfall, rainfall and runoff erosivity (R) factor, topographical wetness index (TWI) and stream power index (SPI) were considered as hydrological parameters (Figure 4). The drainage density of this region has been estimated from ALOSPALSAR DEM, taking into account 1000 threshold values in the GIS environment. The rationale for selecting this threshold value was based on the previous literature in this region [41]. Apart from this, the estimated drainage network is similar to Google Earth, satellite images and topographic maps (Survey of India). Annual rainfall data were collected from the India Meteorological Department(IMD). The drainage density and stream power index were extracted from the DEM. By

Hydrological Factors
Drainage density, drainage proximity, rainfall, rainfall and runoff erosivity (R) factor, topographical wetness index (TWI) and stream power index (SPI) were considered as hydrological parameters ( Figure 4). The drainage density of this region has been estimated from ALOSPALSAR DEM, taking into account 1000 threshold values in the GIS environment. The rationale for selecting this threshold value was based on the previous literature in this region [41]. Apart from this, the estimated drainage network is similar to Google Earth, satellite images and topographic maps (Survey of India). Annual rainfall data were collected from the India Meteorological Department(IMD). The drainage density and stream power index were extracted from the DEM. By forming and developing gullies, the river can extend its own length. The SPI is calculated as follows [42]: where As represent total catchment area, σ indicates the slope gradient [43].
forming and developing gullies, the river can extend its own length. The SPI is calculated as follows [42]: where As represent total catchment area, σ indicates the slope gradient [43]. TWI has been computed by using the following equation: where ∝ is the Flow Accumulation β is the Slope and C is the Constant (0.01). The R factor of this region has been calculated with the help of the following equation [44,45] : where R is the rainfall and runoff erosivity factor and it express in MJ/ha/year. TWI has been computed by using the following equation: where ∝ is the Flow Accumulation β is the Slope and C is the Constant (0.01).
The R factor of this region has been calculated with the help of the following equation [44,45]: where R is the rainfall and runoff erosivity factor and it express in MJ/ha/year.

Soil Characteristics
Gully erosion is a type of soil erosion by water, where soil texture, soil depth and soil properties control the erosive capacity of the water ( Figure 5), where organic matter, soil moisture and soil texture are considered asthe primary role for creation and development of gullies. TheNational Bureau of Soil Survey and Land Use Planning (NBSS&LUP), State Agriculture Management and Extension Training Institute(SAMETI) and collected soil samples helped to provide the soil erodibility of this region: where K is the soil erodibility (ton ha-1 unit of R); M is the (% silt +% very fine sand) (100-% clay); OM is the percentage of organic matter and P is the permeability class; and S = structure class [46].

Soil Characteristics
Gully erosion is a type of soil erosion by water, where soil texture, soil depth and soil properties control the erosive capacity of the water ( Figure 5), where organic matter, soil moisture and soil texture are considered asthe primary role for creation and development of gullies. TheNational Bureau of Soil Survey and Land Use Planning (NBSS&LUP), State Agriculture Management and Extension Training Institute(SAMETI) and collected soil samples helped to provide the soil erodibility of this region: where K is the soil erodibility (ton ha-1 unit of R); M is the (% silt +% very fine sand) (100 − % clay); OM is the percentage of organic matter and P is the permeability class; and S = structure class [46]. .

Geological Factors
The thickness of the soil ( Figure 6) is dependent on the soil parent material and indirectly controls the formation of the gully. Further information on the geology of the study area was provided by the Geological Survey of India.

Geological Factors
The thickness of the soil ( Figure 6) is dependent on the soil parent material and indirectly controls the formation of the gully. Further information on the geology of the study area was provided by the Geological Survey of India.

Environmental Factors
Distance from the river, LULC, NDVI and distance from the road affects the surface runoff and is considered as environmental factors ( Figure 7). These features were estimated from the topographical map at 1:50,000 and Sentinel 2a satellite image at 10 mt spatial resolution (March 28, 2020). The NDVI was estimated using the following equation [47]:

Environmental Factors
Distance from the river, LULC, NDVI and distance from the road affects the surface runoff and is considered as environmental factors ( Figure 7). These features were estimated from the topographical map at 1:50,000 and Sentinel 2a satellite image at 10 mt spatial resolution (28 March 2020). The NDVI was estimated using the following equation [47]: Remote Sens. 2020, 12, x FOR PEER REVIEW 12 of 35

Methodology Flow Chart for Gully Erosion Susceptibility
The flowchart shows the methodology used in this work. Four parts of this methodology have been included: 1) preparation of data; 2) multi-collinearity assessment using Variance inflation factor (VIF) and tolerance; 3) gully erosion susceptibility modeling by BRT (machine learning), MARS (machine learning ensembles) and SLR (statistical approach) with different resamples (5-fold CV, 10-fold CV, bootstrap and optimism bootstrap) techniques; and 4) process of validation by means of receiver operating characteristics (ROC) curve and different statistical indices (Figure 8).

Methodology Flow Chart for Gully Erosion Susceptibility
The flowchart shows the methodology used in this work. Four parts of this methodology have been included: (1) preparation of data; (2) multi-collinearity assessment using Variance inflation factor (VIF) and tolerance; (3)

Multi-collinearity Test
Multi-collinearity shows the linear relationship between different variables. This assessment provides the conformity of the application and clarifies the result when statistical approaches are used [22]. It shows the linear dependence of the different independent variables. In each method used, no covariance is passed on to the least absolute shrinkage and selection operator [48]. Cama et al.(2017) stated that the use of VIF with TOL tends to predict co-relationships [49]. VIF connected with TOL, based on the calculation of 1-R 2. , where R 2 is the regressing variable in multivariate regression. The result of VIF associated with TOL varies from less than 0.1 to more than 10 and is strongly multi-collinear.

Boosted Regression Tree
Boosted regression tree (BRT) is a combination of regression tree and boosting [50]. Many decision trees have repeatedly been fitted to the BRT, such as the random forest model, in order to improve the accuracy of the model. There was a difference between the two methods used to construct selected trees in the data. In both techniques, all data were taken randomly for the construction of each new tree. However, the random forest model used the baggage method, which indicated the same probability of subsequent samples being selected for each occurrence. BRT was used as a boosting method, where input data were weighted in the trees. Applying this model, the weights were poorly modeled so that the previous tree was chosen as the new tree. This indicates that the first tree to be fitted to the model will take account of the error and that the tree will become a new tree. In this way, taking the previous tree against a new tree, the model improved its accuracy and became a powerful model. The BRT model considered two parameters to be discussed:

Multi-Collinearity Test
Multi-collinearity shows the linear relationship between different variables. This assessment provides the conformity of the application and clarifies the result when statistical approaches are used [22]. It shows the linear dependence of the different independent variables. In each method used, no covariance is passed on to the least absolute shrinkage and selection operator [48]. Cama et al. (2017) stated that the use of VIF with TOL tends to predict co-relationships [49]. VIF connected with TOL, based on the calculation of 1-R 2. , where R 2 is the regressing variable in multivariate regression. The result of VIF associated with TOL varies from less than 0.1 to more than 10 and is strongly multi-collinear.

Boosted Regression Tree
Boosted regression tree (BRT) is a combination of regression tree and boosting [50]. Many decision trees have repeatedly been fitted to the BRT, such as the random forest model, in order to improve the accuracy of the model. There was a difference between the two methods used to construct selected trees in the data. In both techniques, all data were taken randomly for the construction of each new tree. However, the random forest model used the baggage method, which indicated the same probability of subsequent samples being selected for each occurrence. BRT was used as a boosting method, where input data were weighted in the trees. Applying this model, the weights were poorly modeled so that the previous tree was chosen as the new tree. This indicates that the first tree to be fitted to the model will take account of the error and that the tree will become a new tree. In this way, taking the previous tree against a new tree, the model improved its accuracy and became a powerful model. The BRT model considered two parameters to be discussed:

Tree complexity (TC):
The number of splits in each tree was controlled by this procedure. The value of 1 showed 1 tree split, which means that the model did not consider interactions between environmental variables. Values greater than 1 or 2 showed two divisions of very high interactions.
Learning rates (LR): This parameter was used to determine the contribution of each tree used to make the model. It was necessary to construct small values of LR in the trees.
The large number of environmental variables compared to observations makes it very useful to strengthen the missing values and the outliers. This model is correlated with sets of independent variables and can accommodate any type of variables such as continuous, categorical, missing and non-independent variables. Important predictor variables enable functions, and their interactions have been identified by BRT and the model has been formed without the assumptions of these interactions or functions [51]. It has been used to solve the problem of regression and classification. It determines the input variables. This is a boosting and optimization technique that boosts the tree's decision weekly, then prepares the residual, determines the loss of function to calculate it, and shows the difference between the tree's output and the target values. To minimize the loss of function, boosting the algorithm has added a new tree to each step [51].
where MCV is the majority vote classification, a m : log which r m : is associated to compute the (weighted) misclassification rate, fit classifier c m to the weighted data. In this equation, recalculate weights w i = w exp(mI(y i c m ) that initialize weights equal to w i = 1 n for m = 1 to the next category c m [52].

Multivariate Adaptive Regression Spline
Friedman stated MARS is an extension of the generalized linear model and is one type of non-parametric regression technique [53]. MARS uses a set of independent variables to predict the value of the output variable and to solve the problem of the regression type. In order to establish a linear or non-linear relationship between dependent and independent variables, this model is partitioned using binary recursive splines. By assessing cut points similar to step functions, the non-linearity aspect of the polynomial regression was captured by the convenient approach provided by multivariate adaptive regression spline (MARS). Predictors have been developed by the procedure, and its individual data points have been developedas a knot and a linear regression model with candidate characteristics. The whole process continues until several knots have been found and a high non-linear pattern has finally been created. These many knots will help tie a good relationship to training data without generalizing new unseen data. In addition, creating a full set of knots makes it easy to remove those knots which do not measure the accuracy significantly. This new removing process, named "Pruning", uses cross validation with the previous model to find the optimal number of knots. To predict the probabilities in MARS incorporated with logistic regression, we can look at the example of when generalized linear models are incorporated with MARS through the link function. On the other hand, when the underlying form of function is known in the MARS model, non-linear regression and regression have been used to estimate function parameters. Not only that, MARS also estimates functions that have serious limitations on the nature of the functions. As a generalization of recursive partitioning, MARS can better handle non-categorical data. The equation of MARS may be as follows by Quirós et al. (2009) [54]: where k is the connections and findings one of the dependent components and x is an independent component. Thus, the MARS model is explained as follows: where y is the function-predicted dependent variable, β is a constant, M is the number of terms and x is the categorical variables. H m is the baseline constant and α m , coefficients calculated by reducing the total squared errors. Craven and Wahba (1978) has stated that to determine the optimal model for MARS, the generalized cross validation method is used [55]. (11) where N is the size of samples and C(H) is the dependent variable which rises with both the amount of base functions (BF) in the model and is determined on the basis of following equation: where d is the justice for each origin utility is considered in the model and H is the number of essential purposes. The MARS theory does not require a generalization study of the effect between the response variable and the causative parameters. The MARS model is applicable as a machine learning technique by applying specific adaptive developments.

Spatial Logistic Regression
Logistic regression or binary logistic regression is a multivariate regression model where a categorical dependent variable related to many independent variables [56,57]. The predictor variables for regression values predicted the presence and absence of dependent variables [58,59]. The algorithm of logistic regression converted the dependent variables to logit variables by applying maximum likelihood estimation [60]. The merit of LR is that any type of independent variable can be used for the logistic regression algorithm, which means that variables may be normal or categorical [61]. In order to predict the ratio of each independent variable to the dependent variables, the LR coefficient was used for the multivariate model [62]. The factors and the dependent variables are numerical, and the dependent variables must have the nominal data for the LR algorithm. The occurrence of gully erosion depends on a number of pedo-gromorphic and hydrological factors. The forward, backward stepwise and enter logistic regression method have been used in many studies, but the entire method gets the all coefficients of regression. Now, all coefficients of regression were needed for further raster calculation and developed a gully erosion susceptibility map by spatial logistic regression. Statistically, the LR algorithm with the relationship of independent (gully erosion causative factors) and dependent (occurrences of gullies) variables is expressed below as Equations (13) and (14) [61,63].
In Equation (13), P is the predicted probability of gully erosion, where the probability value varies from 0 to 1 on anS-shaped curve, and z is the linear combination expressed in the following equation. In Equation (14), b 0 is the constant of the model, b i (i = 1, 2, . . . , n) is the slope or coefficients of the LR model, and x i (i = 1, 2, . . . , n) is the independent variable. Binary LR is a non-spatial model and there is a spatial auto-correlation between data. As a result, the spatial structure is included in the logistic regression model by changing some expressions. Now, the following modified equation may be used for spatial logistic regression (SLR). In this spatial autocorrelation, three things are important.These are spatial weight matrix (W), spatial autocorrelation parameter (p) and error term obeying a Gaussian distribution (ε). The spatial structure of gully erosion is calculated through Equations (15) to (18) below [64]: . . .
where ρWY is the spatial structure effect caused by spatial autocorrelation, W is the weight matrix, f d ij is the inverse weighting function expressed as Equation (15),ε is the error term obeying a Gaussian distribution, and L is the integrated nested Laplacian approximation to reduced time during estimation [65]. The statistical software was used to calculate the correlation between the location and erosion of the gully, which affects the coefficient of the factors. The rest of the work on the GIS platform was carried out by the raster calculator, following Equations (15) and (18) to produce an SLR map of gully erosion susceptibility.

Resampling Methods
Resampling methods are used to draw samples from a dataset repeatedly and to re-assemble a model for each sample in order to learn more about the fitted model. Resampling methods can be costly, as they require repeated output from different N subsets of data using the same statistical method. In order to obtain further information on the model fitted, this method adjusts the model of interest to the samples generated by the training dataset. They provide the estimates of test-set forecast error, the confidence interval, and the bias of our estimates for the parameters. The error of the experiment is the average error resulting from the use of a statistical learning system to evaluate the effects on a sample not used to train the process. On the other hand, the error in training can be conveniently assessed by finding a statistical learning method for the measurements used in training. Various resampling algorithms are available, including: cross validation (CV), leave-one-out cross validation (LOOCV), K-fold cross validation, and bootstrap.

K-Fold Cross Validation
K-fold cross-validation is performed by arbitrarily dividing the result collection into K classes or folds of approximately the same scale. Related to a single-out cross-validation, one of the K-folds is used as a collection of validation, while the other K-folds are used as a collection of checks to calculate the check error for K. The forecast check error for K-fold cross-validation is the sum of all results.
Typical values for K are 5 or 10, since they require fewer calculations than when K is equal to n. Cross-validation can be used both to determine how well a specific statistical learning process performs on recent data and to determine the lowest point in the calculated square mean error curve, which may be helpful in contrasting statistical learning methods or in contrasting different degrees of versatility for a single statistical learning model.

Bootstrap
The bootstrap is a universally applied method used to quantify the uncertainty of a given estimation method or statistical learning process, along with those for which it is difficult to measure the variability. By repeatedly collecting data from the source data set, bootstrap generates a separate dataset. These datasets can be used instead of sampling independent datasets from the target population as a whole. Bootstrap sampling includes anarbitrary selection of substitution inferences, which means that certain findings can be identified numerous times while other observations have not been included [66].
This method is reproduced B times to produce B data sets, Z, Z 1 , Z 2 , ..., Z B , which can be used to measure only certain quantities like standard error.
Sensitivity of the predicted models has been estimated with considering following equation: where, TP is the true positive and FN is the false negative. Specificity of this estimation has been worked out with considering the following equation: where, TN is the true negative, FP is the false positive and TN is the true negative. PPV is the pixel probability value is classified as a gullying area. The calculation is given bellow: where, PPV is the positive predictive value, TP is the true positive, FP is the false positive of this estimation. Negative predictive value has been measured with the help of the following equation: False positive rate has been estimated with the help of the following equation: False discovery rate of this estimation has been worked out with the help of the following equation: where, FDR is the false discovery rate, FP is the false positive and TP is the true positive. False negative rate of this analysis has been estimated with considering the following equation: where, FNR is the false negative rate, FN is the false negative and TP is the true positive. Accuracy of the all predicted models was estimated with the help of the following equation: where, ACC is the accuracy, TP is the true positive, TN is the true negative, P is the positive and N is the negative values of this estimation. F 1 Score has been estimated with the help of the following equation: where, F 1 is the harmonic mean of the precision, TP is the true positive, FP is the false positive and FN is the false negative. Alternatively, the Matthews correlation coefficient (MCC) is a much robust statistical metric that only generates a high scores if the forecast yielded good outcomes in all basic uncertainty matrix groups (TP, FN, TN, and FP), both in comparison to the amount of positive sides and the size of negative aspects in the database: The AUC of ROC has been estimated with the help of the following method: where, S AUC is the area under curve, X k is the 1-Specificity and S k is the Sensitivity of the ROC. A common instrument for calculating model output is the receiver operating characteristic (ROC). ROC is plotted on the x-and y-axis, depending on the sensitivity and 1-specificity. ROC's area under curve (AUC) is projected on model efficiency. The statistical principle and the equation of this method are presented in detail in previous studies [80]. Sensitivity (i.e., likelihood of detection) raises the issue of which portion of the gully erosion observed is correctly labeled, and its optimum value is 1. The precision (i.e., negative predictive rate) raises the issue of which part of the non-gully erosion is accurately defined, and its optimum value is 1 [81]. AUC values below 0.6, 0.6-0.7, 07-0.8, 0.8-0.9 and above 0.9 suggest the model's poor, fair, average, outstanding and very high consistency, respectively. The ROC, which trains data collection, generates model performance and finally tests the suitability of a model. The ROC from the data collection of the evaluation shows the predictive importance of the model and how strong or poor the predictive model is.

Multi-Collinearity Analysis
Analysis of multicollinearity is another factor selection methodology for assessing the "non-independence" of gully erosion conditioning parameters. Due to higher relationships betweenvariables, the projected outcomes can be inaccurate and inconsistent [82]. It is considered as a linear relationship between different independent variables. The TOL and VIF values are ≥0.1 and ≤10, representing optimal multi-collinearity [83]. In the present study, multi-collinearity has been considered when taking into account the VIF and tolerance values for better accuracy of the predicted models and minimizes the error in this perspective. The highest and lowest VIF and TOL values were 1.099 to 3.688 and 0.271 to 0.909, respectively, taking into account 20 gully erosion conditioning factors (Table 2). It has been shown that there is no problem of multi-collinearity, therefore the factors responsible for the erosion of the gully have been considered in this research. From Table 3, it can be seen that every selected parameter for gully erosion is independent in nature and the values of VIF and TOL are associated with the permissible limit (values of TOL and VIF are ≥0.1 and ≤10, respectively). Predictions of the coefficient will vary significantly, depending on which independent variables are associated with this model. Coefficients throughout the method differ from significantly optimistic to minor alterations. Multi-collinearity affects the accuracy of the coefficients that also undermine the outcomesof optimal capacity.

Validation of the Models
The results of the training data reinforce the gully erosion susceptibility models, while the results of the validation data reflect the optimal evaluation of the final models to fit with the training samples. In this analysis, we present estimation of the statistical index and ROC curve for calculation of the validity of various gully erosion models. Table 3 shows the performance of the various models-BRT, MARS and SLR-with multiple resampling techniques with the help of different statistical outcomes. All statistical findings show that the performance of spatial logistic regression is much better than that of other models. In addition, the findings of gully erosion susceptibility models are validated by performing AUC of ROC on the basis of a database of known gully points. In ROC, the susceptibility models were compared with the gully and non-gully erosion points in the region. The ROC curves have been graphically represented depending on the different threshold reductions using optimistic-class ratios, i.e., true positive rate and false positive rate, to evaluatethe binary classifiers' dynamic response and associated accuracy. The ROCs for BRT, BRT 5-fold CV, BRT  Of them, the SLR model is most optimal according to its predictive capacity (Figure 9).

Gully Erosion Susceptibility Modeling
For a fair evaluation of the estimated susceptibility map, it is essential to find the optimal variables for simulation, as chosen factors in the training dataset depend on each other to create uncertainty in the experiments. In addition, it is important to quantify the predictive efficiency and multi-collinearity of the twenty conditioning parameters chosen when predicting the susceptibility to gully erosion. Thus, calculating the important values of the conditioning parameters is necessary.
Machine learning algorithms such as BRT, MARS and SLR with different resampling approaches (5-fold CV, 10-fold CV, bootstrap and optimism bootstrap) have been used to assess the gully erosion susceptibility ( Figure 10). The optimism bootstrap of BRT, MARS and SLR isassociated with better prediction capability than other resampling methods. The optimal capacity of the resampling algorithm has been observed in this research, and the main feature of the resampling algorithm is that it improves the accuracy of the classifier. In the BRT optimism bootstrap model, most of the regions are associated with very low (20.03%), low (28.66%) and medium (25.60%) gully erosion susceptibility where the remaining part of the region isassociated with high (17.55%) and very high (8.16%) susceptible zones.

Gully Erosion Susceptibility Modeling
For a fair evaluation of the estimated susceptibility map, it is essential to find the optimal variables for simulation, as chosen factors in the training dataset depend on each other to create uncertainty in the experiments. In addition, it is important to quantify the predictive efficiency and multi-collinearity of the twenty conditioning parameters chosen when predicting the susceptibility to gully erosion. Thus, calculating the important values of the conditioning parameters is necessary.
Machine learning algorithms such as BRT, MARS and SLR with different resampling approaches (5-fold CV, 10-fold CV, bootstrap and optimism bootstrap) have been used to assess the gully erosion susceptibility (Figure 10). The optimism bootstrap of BRT, MARS and SLR isassociated with better prediction capability than other resampling methods. The optimal capacity of the resampling algorithm has been observed in this research, and the main feature of the resampling algorithm is that it improves the accuracy of the classifier. In the BRT optimism bootstrap model, most of the regions are associated with very low (20.03%), low (28.66%) and medium (25.60%) gully erosion susceptibility where the remaining part of the region isassociated with high (17.55%) and very high (8.16%) susceptible zones.
In the MARS optimism bootstrap model, most of the regions are associated with very low (25.28%), low (21.50%) and medium (25.81%) gully erosion susceptibility where the reaming part of the region isassociated with high (18.79%) and very high (8.62%) susceptible zones (Table 4).
In the SLR optimism bootstrap model, most of the regions are associated with very low (25.39%), low (21.38%) and medium (25.72%) gully erosion susceptibility where the reaming part of the region isassociated with high (18.85%) and very high (8.66%) susceptible zones.
Apart from this, the quality of the predicted models (BRT, MARS and SLR with different resampling techniques) is shown from direct observation with the help of the Taylor diagram ( Figure 11). It is capable of drawing graphical findings where a pattern or sets of patterns are closely linked with observations. The relation between different patterns has been identified byconsidering  In the MARS optimism bootstrap model, most of the regions are associated with very low (25.28%), low (21.50%) and medium (25.81%) gully erosion susceptibility where the reaming part of the region isassociated with high (18.79%) and very high (8.62%) susceptible zones (Table 4). In the SLR optimism bootstrap model, most of the regions are associated with very low (25.39%), low (21.38%) and medium (25.72%) gully erosion susceptibility where the reaming part of the region isassociated with high (18.85%) and very high (8.66%) susceptible zones.
Apart from this, the quality of the predicted models (BRT, MARS and SLR with different resampling techniques) is shown from direct observation with the help of the Taylor diagram ( Figure 11). It is capable of drawing graphical findings where a pattern or sets of patterns are closely linked with observations. The relation between different patterns has been identified byconsidering the nature of the correlation, root mean square error (RMSE) and degree of variations. The degree of variations has been estimated byconsidering the standard deviations [84]. The location of an individualpointappearing on the plot evaluates the estimatedpattern of gully erosion susceptibility modelwhich coincides with the observed information. The RMSE-based variation amongthe trends being estimated and observed is proportional to the differencesto the point defined as "observed" on the x-axis. Here, the gray contour indicates the RMSE in the distribution of patterns. The efficiency of optimism bootstrap (SLR optimism bootstrap, MARS optimism bootstrap and BRT optimism bootstrap) is more optimal than other models.

Importance Value
The selection of the appropriate parameters for modeling is important, as the selected factors depend on each other in the training dataset. Therefore, it is necessary to calculate the predictive capacity and multi-collinearity of the 20 selected conditioning parameters for modeling gully erosion susceptibility, and therefore to calculate the importance values for each conditioning parameter. The findings suggest that the elevation (18.48), drainage density (18.33), LS factor (16.56), R factor (16.06), soil texture (13.52) and rainfall (11.72) are of greater importance where gully erosion occurs, while the other variables, such as aspect (1.60), TWI (1.19), geology (1.46), geomorphology (2.18), distance from lineament (2.33) and others are of lesser importance (Table 5). From a partial plot of important variables, it is easy to understand the importance of significant factors and their relationship to the formation of gully erosion ( Figure 12). The use of a partial plot is very important because it is capable of evaluating the effect (in the context of the dependent's prediction accuracy index) of the independent factori th in the context of other existing factor-independent features [85]. To determine the nonlinearity, if there are any i th parameters, the appropriate transformation must be picked precisely, such that the normal residual plot indicateslinearity differences, whereas the partial residual plot indicatesboth the magnitude of the linearity variance and the linearity magnitude and position. In first plot, the importance of elevation in correspondence to the drainage density shows the maximum importance. Here, the importance of topographical and hydrological factors was relatively higher than the other conditioning factors used in this study.

Importance Value
The selection of the appropriate parameters for modeling is important, as the selected factors depend on each other in the training dataset. Therefore, it is necessary to calculate the predictive capacity and multi-collinearity of the 20 selected conditioning parameters for modeling gully erosion susceptibility, and therefore to calculate the importance values for each conditioning parameter. The findings suggest that the elevation (18.48), drainage density (18.33), LS factor (16.56), R factor (16.06), soil texture (13.52) and rainfall (11.72) are of greater importance where gully erosion occurs, while the other variables, such as aspect (1.60), TWI(1.19), geology (1.46), geomorphology (2.18), distance from lineament (2.33) and others are of lesser importance (Table 5). From a partial plot of important variables, it is easy to understand the importance of significant factors and their relationship to the formation of gully erosion ( Figure 12). The use of a partial plot is very important because it is capable of evaluating the effect (in the context of the dependent's prediction accuracy index) of the independent factori th in the context of other existing factor-independent features [85].
To determine the nonlinearity, if there are any i th parameters, the appropriate transformation must be picked precisely, such that the normal residual plot indicateslinearity differences, whereas the partial residual plot indicatesboth the magnitude of the linearity variance and the linearity magnitude and position. In first plot, the importance of elevation in correspondence to the drainage density shows the maximum importance. Here, the importance of topographical and hydrological factors was relatively higher than the other conditioning factors used in this study.

Discussion
In this research, three machine learning models (BRT, MARS and SLR) and four resampling methods (5-fold CV, 10-fold CV, bootstrap and optimism bootstrap) were used for the creation of gully erosion susceptibility. Resampling methods have been used as a reliable process that can improve the predictive efficiency of all models. While, unlike resampling, empirical sampling distributions do not solve all inferential problems, they will help to generate new statistics and bring robustness to other conventional ones [86]. Before the modeling of gully erosion susceptibility, it is important to quantify the predictive potential and multi-collinearity of the selected parameters; therefore, the values of each conditioning variable have been determined [24,33]. In contrast, few experiments have mainly focused on gully erosion in order to establish and forecast a relation between gullies and its absence, keeping in mind a number of variables in the models [87][88][89]. We

Discussion
In this research, three machine learning models (BRT, MARS and SLR) and four resampling methods (5-fold CV, 10-fold CV, bootstrap and optimism bootstrap) were used for the creation of gully erosion susceptibility. Resampling methods have been used as a reliable process that can improve the predictive efficiency of all models. While, unlike resampling, empirical sampling distributions do not solve all inferential problems, they will help to generate new statistics and bring robustness to other conventional ones [86]. Before the modeling of gully erosion susceptibility, it is important to quantify the predictive potential and multi-collinearity of the selected parameters; therefore, the values of each conditioning variable have been determined [24,33]. In contrast, few experiments have mainly focused on gully erosion in order to establish and forecast a relation between gullies and its absence, keeping in mind a number of variables in the models [87][88][89]. We therefore have measured the importance value of every causal parameter behind the presence of gully locations in the catchment area of the Hinglo river basin. Additional comprehensive work on environmental elements is needed to obtain an optimum understanding of the factors that influence the erosion of the gully itself and the impact on the expansion of this occurrence at multiple locations in different climatic conditions. The results are correlated with the assumption that the rate of gully erosion depends on the volume of the runoff area above the gully and on many other variables, such as elevation, drainage density, LS factor, R factor, soil texture and rainfall. Here, elevation is the most important determinant element of erosion susceptibility. The spatial distribution of gullies depends to a large extent on the characteristics and nature of the topographic parameters, such as elevation, amount and direction of slope, ruggedness of the topography, etc. [90]. There is a very high dependency of the occurrences of gullies and existing drainage networks of this region. In any region, the gullies can continuously form according to the drainage network and can be differentiated by the drainage network. So, a positive relationship has been observed between drainage density and the occurrence of gullies [91]. In other words, the high drainage density is very much optimistic for generatingthe maximum runoff which is responsible for large scale erosion in the form of gullies [28]. In this study, topographical factors, like LS factor, are one of the important determining elements for gully erosion susceptibility. Slope length and steepness are direct influences on other topographical parameters which are responsible for creating the complex hydro-geomorphic regime. So, these scenarios of hydro-geomorphic characteristics are favorable for large scale erosion. In the monsoon-dominated climate region, the extreme soil erosion and associated land degradation are strongly influenced by rainfall. The impact of raindrops in the form of rainfall accelerates the amount of erosion in various forms of erosion. The existence of a long dry season and a limited wet season are the key features of the monsoon climatic region. The rate of erosion and its associated vulnerability to land resources is therefore increasing at an alarming rate. Apart from rainfall, the R factor is capable of estimating the increased impact of rainfall on erosion in a storm rainfall event. So, the importance of the R factor for modeling the gully erosion susceptibility is greater than any other factors.
Numerous ensemble models are applicable, but new and updated techniques and approaches for spatial modeling of gully erosion susceptibility occurrence are necessary. Therefore, three machine learning algorithms (BRT, MARS and SLR) and four resampling approaches (5-fold CV, 10-fold CV, bootstrap and optimism bootstrap) were used to assess the efficiency of machine learning models and we selected from the most optimal one. Modelling efficiency evaluation reveals that the ideal approach is a set of models (SLR, MARS and BRT optimism bootstrap) of machine learning approaches with excellent precisions of 90.6%, 90.4% and 87.4%, respectively, in the corresponding susceptibility categories of the gully positions relative to the majority of the ensemble and single machine learning methods. The ensemble of SLR optimism bootstrap, MARS optimism bootstrap, and BRT optimism bootstrap actually improves the generalization of base predictors for gully positions to locate the combination of the SLR optimism bootstrap ensemble method with greater precision.
It is obvious from the field photos that there are more numbers of gully development sites in and around the studyregionof the Chota Nagpur Plateau. Unscientific management strategies and land use modifications have been observed to play a crucial role in regional scale gully development, as subsurface pipping and gully head-cut experiences have led to the creation and development of gullies. Our results are comparable with other researchers who identified that gully erosion is largely triggered by precipitation, land use, drainage density, LS factor, soil texture, rainfall and runoff erosivity factor and elevation [92][93][94]. In addition, the resampling-based ensemble machine learning models (SLR, MARS and BRT optimism bootstrap) established these gully erosion sites more accurately on a regional scale than other evaluated ensemble machine learning models and individual models.
The upper portion of this region is very much prone to gully-high erosion because the topography of this portion is very rugged in nature and this region is associated with a diverse geo-hydrological setup. This outcome is similar in terms of the regional occurrences of gullies which were done previously by different researchers in this region [95,96].

Conclusions
Over recent decades, this region has been facing the problem of extreme soil erosion. Comparatively, in the western and middle portion, the rate of gully formation and development is very high. This leads to extreme land degradation in this region and adversely affects agricultural production. Our main objective of this paper to propose an optimal model for gully erosion susceptibility using novel resample techniques from the existing regression models. Machine learning algorithms, including BRT, MARS, SLR, bootstrap, optimism bootstrap and 5-fold cross validation and 10-fold cross validation, rendered it possible to analyze the results of factors impacting the frequency of these gully erosion characteristics in the catchment area of the Hinglo river basin. The multi-collinearity study was used to identify 20 gully erosion causal parameters and its function in gully formation and development. In addition, the importance of these causal parameters was also evaluated where six variables, including elevation, drainage density, LS factor, R factor, soil texture and rainfall, had the strongest impact on gully erosion in the study region. The influences of topographical and hydrological parameters are more prominent than other parameters. For modeling of gully erosion, 70% of the data was used for testing and the remaining 30%was considered for model validation. Validation of the models was made using the area under curve (AUC) from receiver operating characteristic (ROC) curve and other statistical indices.
Validation of the results mainly demonstrated that the SLR, MARS and BRT optimism bootstrap models with ROC values of 90.6, 90.4 and 87.4, respectively, had excellent accuracy levels based on selected relevant parameters. Our proposed models have fulfilled our objective of assessing gully erosion susceptibility with adequate accuracy. The use of various resampling techniques has increased the performance of the models by developing the nature of the classifier. The susceptibility map of gully erosion obtained in this study region can be used to manage land and water conversations, land use planning and eventually, sustainable development throughout the region. The main task of future researchers is to propose more optimal models for this subtropical region with the development of the base-classifier.