Susceptibility Mapping of Soil Water Erosion Using Machine Learning Models

: Soil erosion is a serious threat to sustainable agriculture, food production, and environmental security. The advancement of accurate models for soil erosion susceptibility and hazard assessment is of utmost importance for enhancing mitigation policies and laws. This paper proposes novel machine learning (ML) models for the susceptibility mapping of the water erosion of soil. The weighted subspace random forest (WSRF), Gaussian process with a radial basis function kernel (Gaussprradial), and naive Bayes (NB) ML methods were used in the prediction of the soil erosion susceptibility. Data included 227 samples of erosion and non-erosion locations through ﬁeld surveys to advance models of the spatial distribution using predictive factors. In this study, 19 e ﬀ ective factors of soil erosion were considered. The critical factors were selected using simulated annealing feature selection (SAFS). The critical factors included aspect, curvature, slope length, ﬂow accumulation, rainfall erosivity factor, distance from the stream, drainage density, fault density, normalized di ﬀ erence vegetation index (NDVI), hydrologic soil group, soil texture, and lithology. The dataset cells of samples (70% for training and 30% for testing) were randomly prepared to assess the robustness of the di ﬀ erent models. The functional relevance between soil erosion and e ﬀ ective factors was computed using the ML models. The ML models were evaluated using di ﬀ erent metrics, including accuracy, the kappa coe ﬃ cient, and the probability of detection (POD). The accuracies of the WSRF, Gaussprradial, and NB methods were 0.91, 0.88, and 0.85, respectively, for the testing data; 0.82, 0.76, and 0.71, respectively, for the kappa coe ﬃ cient; and 0.94, 0.94, and 0.94, respectively, for POD. However, the ML models, especially the WSRF, had an acceptable performance regarding producing soil erosion susceptibility maps. Maps produced with the most robust models can be a useful tool for sustainable management, watershed conservation, and the reduction of soil and water loss. Conglomerate, sandstone and shale with plantremains and coal seams; K2l2: Thick-bedded to massive limestone; Eksh: Greenish-black shale, partly tu ﬀ aceous with intercalations of tu ﬀ .


Study Area
The Nur-Rood watershed is located in the southwest of the Haraz watershed, in the north of Iran. The watershed lies within 51°26′-52°19′ E and 36°01′-36°16′ N ( Figure 1). The elevation of the watershed ranges from 732 to 4333 m. There are six rain gauge stations in the region provided the long-term mean annual data from 1976 to 2016 that were were used in this study. The study area is about 1297 km 2 and is located upstream of the Haraz dam. The main application of this dam is to provide drinking and agriculture water for five cities (i.e., Amol, Babol, Babolsar, Nur, and Mahmoodabad) in the Mazandaran province. According to the literature, the watershed generates water with a high sediment load such that it causes a reduction in the dam's capacity [8,33]. Therefore, identifying the hazardous areas can help to control the upstream erosion and aid with providing sustainable watershed management in the Nur-Rood watershed.

Methodology
The methodology consisted of several fundamental building blocks to ensure the accuracy of the susceptibility prediction. Figure 2 presents the schematic of the methodology workflow from data sampling to the susceptibility prediction. The method consisted of five sections: (i) preparation and collection of the relevant factors for soil erosion modeling; (ii) extraction of the erosion and non-erosion locations by the field observations; (iii) selection of the essential factors using the simulated annealing feature selection (SAFS) algorithm; (iv) water erosion modeling using the Gaussprradial, NB, and WSRF models in the Nur-Rood watershed; and (v) evaluating the models' performance.

Methodology
The methodology consisted of several fundamental building blocks to ensure the accuracy of the susceptibility prediction. Figure 2 presents the schematic of the methodology workflow from data sampling to the susceptibility prediction. The method consisted of five sections: (i) preparation and collection of the relevant factors for soil erosion modeling; (ii) extraction of the erosion and non-erosion locations by the field observations; (iii) selection of the essential factors using the simulated annealing feature selection (SAFS) algorithm; (iv) water erosion modeling using the Gaussprradial, NB, and WSRF models in the Nur-Rood watershed; and (v) evaluating the models' performance.

Field Data
It was necessary to know the locations of eroded and non-eroded areas for susceptibility mapping of the Nur-Rood watershed. Therefore, the locations (i.e., x and y coordinates) of 227 area (116 erosion locations and 111 non-erosion locations) were sampled through field surveys to model the water erosion susceptibility based on a binary scale (occurrence/non-occurrence). According to Sajedi-Hosseini et al. [8], the recorded soil erosion areas include different kinds of water erosions (such as sheet, rill, gully, and mass movements).

Predictive Variables
In this study, according to the literature review, 19 relevant factors regarding soil erosion were collected and prepared, including the topographic, hydro-climatic, geological, and land cover factors. The attributes of the factors are presented in Table 1. A brief description of each of the predictive factors is presented afterward.

Field Data
It was necessary to know the locations of eroded and non-eroded areas for susceptibility mapping of the Nur-Rood watershed. Therefore, the locations (i.e., x and y coordinates) of 227 area (116 erosion locations and 111 non-erosion locations) were sampled through field surveys to model the water erosion susceptibility based on a binary scale (occurrence/non-occurrence). According to Sajedi-Hosseini et al. [8], the recorded soil erosion areas include different kinds of water erosions (such as sheet, rill, gully, and mass movements).

Predictive Variables
In this study, according to the literature review, 19 relevant factors regarding soil erosion were collected and prepared, including the topographic, hydro-climatic, geological, and land cover factors. The attributes of the factors are presented in Table 1. A brief description of each of the predictive factors is presented afterward.

Topographic Parameters
The topographic parameters included the elevation, slope, aspect, slope length (SL), and curvature ( Figure 3). These factors are influential regarding soil erosion velocity [34]. The different elevations (Figure 3a

Hydro-Climate Factors
The hydro-climate factors included the drainage density (DD), distance from the stream (DFS), topographic wetness index (TWI), stream power index (SPI), flow accumulation (FA), precipitation (PCP), rainfall erosivity factor (R), and hydrologic soil group (HSG) (Figure 4). The DD (Figure 4a) is calculated from the sum of the length of all streams in the watershed area. The DD values depend on the permeability and resistance of the surface and deeper soil layers that affect water erosion [8].
Regarding the DFS (Figure 4b), the regions near streams are more susceptible to soil erosion [35]. The DD and DFS layers were created using line density and Euclidian distance tools, respectively, in geographic information system (GIS). TWI (Figure 4c) shows the soil moisture and water-saturated area of the watershed. SPI (Figure 4d) indicates the potential for erosion due to the water flow, in which higher values indicate a higher potential. TWI and SPI were produced using the SAGA GIS 2.0.7 software (SAGA User Group Association, Hamburg, Germany). The flow accumulation (FA) function ( Figure 4e) computes the sum of the weight of all accumulated pixels upstream [36], which is most important for showing the water-accumulated pixels that affect the water erosion. The PCP ( Figure 4f) and R (Figure 4g) were the climate factors considered to affect soil erosion. Their effects depend on soil attributes such as the soil texture, soil organic matter, and soil structure. The PCP map is produced by the mean annual precipitation of the gauge stations in the study area. The R factor is directly related to the soil erodibility. The best method for calculating it is a direct measurement of soil erosion in plots [37]. However, in this study, according to Takal et al. [38], an empirical equation was used to calculate this factor, as follows.
where R is the precipitation erosivity index (MJ·mm·ha −1 ·hr −1 ) and P is the mean annual precipitation (mm). The HSG (Figure 4h) indicates the infiltration and runoff generation rates that affect soil erosion. This layer is extracted from the digital soil map of the world [39] and it includes three groups: B, C, and D. Group B has moderately low runoff potential when completely humid. Soils in this group have 50 to 90% sand, 10 to 20% clay, and have sandy loam or loamy sand textures. Water transition across the soil is unrestricted. Group C soils have moderately high runoff potential when completely humid. They have less than 50% sand, 20 to 40% clay, and include sandy clay loam, silty clay loam, loam, silt loam, and clay loam textures. Group D soils have high runoff potential and the infiltration across the soil is very limited [40].

Geological Factors
Geological factors include the fault density (FD), lithology, and soil texture ( Figure 5). The FD affects infiltration and runoff, which can affect soil erosion. Furthermore, the existence of a fault can accelerate the mass movements [41]. The layer of FD (Figure 5a) was produced in the ArcGIS environment by using the line density tool on the fault layer. The lithology has the greatest effect on erosion control. Erosion depends on the exposed material weathering attributes or the lithology [42,43]. The lithology map ( Figure 5b) was taken from a geological survey done by the Iranian department of environment and had a scale of 1:100,000. The other important factor is soil texture (Figure 5c). Porosity and soil texture, along with the soil profile and surface, are the dominant soil attributes that influence soil erosion. An increase in the clay value of the soil causes a decrease in soil erosion [44]. The soil textures of the study area were clay, clay loam, loam, loamy sand, sandy clay loam, and sandy loam ( Figure 5c).

Land Cover Factors
The land cover factors considered were the normalized difference vegetation index (NDVI), land-use, and distance from road (DFR) ( Figure 6). The NDVI (Figure 6a) was extracted from Landsat satellite images for June 2018. The NDVI values range from −1 to 1 [45]. A watershed with a higher NDVI provides higher resistance against soil erosion [9,46]. The land uses of the study area included rangeland, residential, forest, agriculture, and rock ( Figure 6b). The land-use map was received from the Iranian Water Resources Management Company (IWRMC). Roads are one of the man-made features that increase the availability of materials for transformation and increase the sediment yield in the watershed. Moreover, roads increase the runoff speed through collecting and concentrating the surface runoff in the given areas (such as near bridges); therefore, faster flows increase the erosion. The DFR layer (Figure 6c) was calculated using the line density tool within the ArcGIS environment. Water 2020, 12, x FOR PEER REVIEW 7 of 17

Feature Selection
To select the most important factors in the water erosion of soil based on parsimonious objectives from the large number of factors considered, the simulated annealing feature selection (SAFS) model was used. The SAFS method is based on the minimum energy configuration theory, whereby a solid is gradually cooled such that its structure is frozen [47]. Many studies have used this method for feature selection in environmental fields, such as flash-flood hazard assessment [48], dust and air quality evaluation [49], and earth fissure hazard prediction [50]; see Bertsimas and Tsitsiklis [47] for more details of the SAFS method.
In the current research, the SAFS was conducted using the k-fold (k = 10) cross-validation methodology and it was implemented in the Caret package [51] of the R software (4.0.2, R Core Team, Vienna, Austria).

Weighted Subspace Random Forest (WSRF)
Xu et al. [52] suggested a new random forest, namely, the WSRF model, which involves weighting the input variables and afterward opting for the variables that ensure each subspace always includes informative attributes. The WSRF model is implemented as multi-thread processes. This algorithm categorizes very high-dimensional data and sparse data with random forests made using small subspaces. A new variable weighting manner is applied for the variable subspace choice rather than the traditional random variable sampling in the random forest model [53]. More details of the WSRF

Feature Selection
To select the most important factors in the water erosion of soil based on parsimonious objectives from the large number of factors considered, the simulated annealing feature selection (SAFS) model was used. The SAFS method is based on the minimum energy configuration theory, whereby a solid is gradually cooled such that its structure is frozen [47]. Many studies have used this method for feature selection in environmental fields, such as flash-flood hazard assessment [48], dust and air quality evaluation [49], and earth fissure hazard prediction [50]; see Bertsimas and Tsitsiklis [47] for more details of the SAFS method.
In the current research, the SAFS was conducted using the k-fold (k = 10) cross-validation methodology and it was implemented in the Caret package [51] of the R software (4.0.2, R Core Team, Vienna, Austria).

Weighted Subspace Random Forest (WSRF)
Xu et al. [52] suggested a new random forest, namely, the WSRF model, which involves weighting the input variables and afterward opting for the variables that ensure each subspace always includes informative attributes. The WSRF model is implemented as multi-thread processes. This algorithm categorizes very high-dimensional data and sparse data with random forests made using small subspaces. A new variable weighting manner is applied for the variable subspace choice rather than the traditional random variable sampling in the random forest model [53]. More details of the WSRF

Feature Selection
To select the most important factors in the water erosion of soil based on parsimonious objectives from the large number of factors considered, the simulated annealing feature selection (SAFS) model was used. The SAFS method is based on the minimum energy configuration theory, whereby a solid is gradually cooled such that its structure is frozen [47]. Many studies have used this method for feature selection in environmental fields, such as flash-flood hazard assessment [48], dust and air quality evaluation [49], and earth fissure hazard prediction [50]; see Bertsimas and Tsitsiklis [47] for more details of the SAFS method.
In the current research, the SAFS was conducted using the k-fold (k = 10) cross-validation methodology and it was implemented in the Caret package [51] of the R software (4.0.2, R Core Team, Vienna, Austria).

Weighted Subspace Random Forest (WSRF)
Xu et al. [52] suggested a new random forest, namely, the WSRF model, which involves weighting the input variables and afterward opting for the variables that ensure each subspace always includes informative attributes. The WSRF model is implemented as multi-thread processes. This algorithm categorizes very high-dimensional data and sparse data with random forests made using small subspaces. A new variable weighting manner is applied for the variable subspace choice rather than the traditional random variable sampling in the random forest model [53]. More details of the WSRF model are presented in Xu et al. [52] and Zhao et al. [53]. The WSRF model was implemented using the "wsrf" package [53] in the R software using the k-fold (k = 10) cross-validation procedure. The NB classifiers are a set of assortment algorithms that use Bayes' Theorem. This is a family of algorithms, where every pair of features being categorized is independent of each other; on the other hand, all of them share a common principle. The dataset is categorized into two sections: • A response vector, which includes the value of the class variable.

•
The feature matrix, which includes all the rows of the dataset and each row contains all the dependent features.
According to the primary naive Bayes hypothesis, each element must be independent and equal [54,55]; see Webb et al. [56] for more details of the NB model. The NB model was done using the k-fold (k = 10) cross-validation method in the "klar" package [57,58] within the R software.

The Gaussian Process with a Radial Basis Function Kernel (Gaussprradial)
Gaussian process regression is a vigorous, non-parametric Bayesian method used for solving regression problems and modeling unknown functions [59,60]. It can capture the different relationships between inputs and output variables by applying a hypothetically infinite number of parameters and allowing the dataset to determine the level of complexity via Bayesian inference [61]. The Gaussian process is parametrized using a kernel. One of the benefits of Gaussian process regression is the flexibility in choosing the kernel; furthermore, the different kernels can be combined to perform the regression [59]. In this study, the radial basis function network (RBF) was used to perform the Gaussian process. The Gaussprradial was performed in the R software using the "kernlab" package [62] using the k-fold (k = 10) cross-validation approach.

Model Calibration and Validation
The database, including the predictand and predictors, was randomly divided into the training (70%) and testing (30%) sets. A k-fold (k = 10) cross-validation methodology was used to calibrate the models. The models were assessed using testing datasets after the calibration using the features selected by the SAFS. Here, for the assessment of the models' performances, three classification evaluation metrics were used: accuracy, kappa, and the probability of detection (POD). The models' performances were represented as accuracy percentages. Kappa indicates the probability of agreement by chance using the likelihood of the model classification [63]. The metrics are computed as follows: where H (the number of hits), FA (the number of false alarms), M (the number of misses), and CN (the number of correct negatives) were computed from a contingency table.
where Pe is the expected probability of chance agreement [64] that is computed using Equation (4): The POD is a metric used to quantify the possibility of finding a specific detect. The POD is significantly linked to the subject of risk evaluation and probabilistic analyses of the components' integrity. The POD is the ratio of the correct predicted data to the total number observed occurrences. It ranges from 0 to 1, where 1 indicates a perfect score [49,50]. The metric is calculated using Equation (5):

Feature Selection Results
A relatively large number of factors, such as elevation, slope, aspect, SL, curvature, DD, DFS, TWI, SPI, FA, PCP, R, HSG, FD, lithology, soil texture, NDVI, land use, and DFR, were used in the current study to predict water erosion. The results of the feature selection using the SAFS algorithm are shown in Table 2. As can be seen, the minimum and maximum selected features were 8 and 14 variables, respectively, in the folds number of 8 (accuracy = 0.84, Kappa = 0.67) and 3 (accuracy = 0.92, Kappa = 0.83). The fold number 6 provided the worst performance (accuracy = 0.74, Kappa = 0.48), whereas the fold number 10 provided the best performance (accuracy = 0.93, Kappa = 0.87). According to the 10-fold results, the selected factors should be between the minimum and maximum selected features and should be mostly equal to the mean selected factors across all folds. However, the percentage of selected factors in all folds can be a good criterion for selecting the final variables [48][49][50]. Figure 7 shows the percentage of selected factors in all folds. Twelve variables had a frequency of at least 50% across all folds. As can be seen, the NDVI with a 100% frequency (F) was selected in all folds. However, the important role of the vegetation and NDVI is obvious and shown in previous studies [8,9,46]. Followed the NDVI (F = 100%), the variables of lithology (F = 80%), R (F = 80%), SL (F = 80%), FD (F = 70%), FA (F = 70%), soil texture (F = 60%), DFS (F = 60%), aspect (F = 50%), curvature (F = 50%), HSG (F = 50%), and DD (F = 50%) were selected.

Results of Water Erosion Modeling
The calibration of models was conducted using the "tunelength" function in the Caret R package [51]. The performance results of the three models (WSRF, Gaussprradial, and NB) were evaluated using the three statistics of accuracy, kappa, and the probability of detection (POD), which are presented in Table 3.
As can be seen from Table 3, the evaluation of the models' performance indicated that the WSRF model had a higher accuracy (accuracy = 0.91), followed by the Gaussprradial (accuracy = 0.88) and NB (accuracy = 0.85) models. According to Monserud and Leemans [65], the kappa values indicated that all three models were in the "very good" degree of agreement (i.e., 0.70 < Kappa < 0.85) ( Table  3). However, like the accuracy, the kappa statistic for the WSRF model (Kappa = 0.82) was more than the Gaussprradial (Kappa = 0.76) and NB (Kappa = 0.71) models. Regarding the POD, the WSRF, Gaussprradial, and NB models showed an equal performance (POD = 0.94) (Table 3).
Generally, the evaluation of the applied machine learning (ML) models in this study indicated an acceptable performance for all the ML models. However, regarding the accuracy and kappa values, the models' performances were ranked as follows: WSRF > Gaussprradial > NB. A direct comparison between the results of this study and previous ones is not possible because the application of the WSRF and Gaussprradial models was undertaken for the water erosion of soil for the first time. However, two novel ML models (WSRF and Gaussprradial) applied in this study indicated a better performance than the NB model that has previously been used in this field. Previous studies have indicated the accurate performance of the NB model in the assessment of soil erosion, such as Weihua et al. [66] and Nhu et al. [67].

Results of Water Erosion Modeling
The calibration of models was conducted using the "tunelength" function in the Caret R package [51]. The performance results of the three models (WSRF, Gaussprradial, and NB) were evaluated using the three statistics of accuracy, kappa, and the probability of detection (POD), which are presented in Table 3. As can be seen from Table 3, the evaluation of the models' performance indicated that the WSRF model had a higher accuracy (accuracy = 0.91), followed by the Gaussprradial (accuracy = 0.88) and NB (accuracy = 0.85) models. According to Monserud and Leemans [65], the kappa values indicated that all three models were in the "very good" degree of agreement (i.e., 0.70 < Kappa < 0.85) ( Table 3). However, like the accuracy, the kappa statistic for the WSRF model (Kappa = 0.82) was more than the Gaussprradial (Kappa = 0.76) and NB (Kappa = 0.71) models. Regarding the POD, the WSRF, Gaussprradial, and NB models showed an equal performance (POD = 0.94) ( Table 3).
Generally, the evaluation of the applied machine learning (ML) models in this study indicated an acceptable performance for all the ML models. However, regarding the accuracy and kappa values, the models' performances were ranked as follows: WSRF > Gaussprradial > NB. A direct comparison between the results of this study and previous ones is not possible because the application of the WSRF and Gaussprradial models was undertaken for the water erosion of soil for the first time. However, two novel ML models (WSRF and Gaussprradial) applied in this study indicated a better performance than the NB model that has previously been used in this field. Previous studies have indicated the accurate performance of the NB model in the assessment of soil erosion, such as Weihua et al. [66] and Nhu et al. [67].

Spatial Prediction of Water Erosion Susceptibility
After the calibration and validation of the models, the maps of the soil erosion probability were predicted using the values of the pixels throughout the study area. Then, the probability maps were classified into five susceptibility classes of very low, low, medium, high, and very high based on the classification method of natural breaks through the ArcGIS software ( Figure 8).

Conclusions
This study focused on the probability of water erosion occurring in the Nur-Rood watershed. Using the SAFS model, the most important factors were selected among nineteen parameters, namely, NDVI, lithology, R, SL, FD, FA, soil texture, DFS, aspect, curvature, HSG, and DD. Based on the performance analysis of the machine learning (ML) models, the two novel applied ML models of WSRF (accuracy = 0.91, Kappa = 0.82) and Gaussprradial (accuracy = 0.88, Kappa = 0.76) displayed better performances than the NB (accuracy = 0.85, Kappa= 0.71) model that has previously been used in this field. The predicted maps created using the ML models indicated the different areas for each susceptibility class but it was obvious that the susceptibility maps were approximately matched with the NDVI and lithology maps (which were identified as the most important variables). One of the main limitations in this study that also occurs in other spatial modeling studies is that different scales are used for the input variables all over the world. Although all of the input variables were resampled The area of the susceptibility classes found using each model is presented in Table 4. As can be seen, the Gaussprradial model predicted most of the area in the moderate class (about 450 km 2 , 34.69% of the study area). The sum of the areas for low and very low classes was less than 7% (about 85 km 2 ) ( Figure 8A). According to the NB model, more than 65% of the study area (about 850 km 2 ) was located in very high susceptibility zones ( Figure 8B). Results of the WSRF model indicated that the classes of the very low, low, moderate, high, and very high susceptibilities covered 11.91% (154.52 km 2 ), 9.76% (126.6 km 2 ), 20.25% (262.64 km 2 ), 28.66% (371.87 km 2 ), and 29.42% (381.7 km 2 ) of the study area, respectively ( Figure 8C). Although the predicted models indicated different areas for each class, there was something in common for all predicted maps. By comparing the predicated maps (Figure 8) with the NDVI map (Figure 5a), it was clear that the susceptibility maps were approximately matched with the NDVI and lithology maps. For example, the green areas in the east of the region on the NDVI map ( Figure 6a) had higher values of NDVI, which correspond to lower values on the water erosion susceptibility maps ( Figure 7). Furthermore, the higher susceptibly values (Figure 8) corresponded to the TRJ lithology ( Figure 5b). TRJs include dark grey shale, claystone, siltstone, and sandstone of the Shemshak formation. In this formation, various kinds of water erosion, such as rill, riverbank, gully, and badland erosions can be seen. This agrees with the SAFS results, which indicated that the NDVI and lithology were the most important variables during the feature selection.

Conclusions
This study focused on the probability of water erosion occurring in the Nur-Rood watershed. Using the SAFS model, the most important factors were selected among nineteen parameters, namely, NDVI, lithology, R, SL, FD, FA, soil texture, DFS, aspect, curvature, HSG, and DD. Based on the performance analysis of the machine learning (ML) models, the two novel applied ML models of WSRF (accuracy = 0.91, Kappa = 0.82) and Gaussprradial (accuracy = 0.88, Kappa = 0.76) displayed better performances than the NB (accuracy = 0.85, Kappa= 0.71) model that has previously been used in this field. The predicted maps created using the ML models indicated the different areas for each susceptibility class but it was obvious that the susceptibility maps were approximately matched with the NDVI and lithology maps (which were identified as the most important variables). One of the main limitations in this study that also occurs in other spatial modeling studies is that different scales are used for the input variables all over the world. Although all of the input variables were resampled into the same spatial resolution, the data collection and sampling of them were not on the same scale; this is an inevitable limitation for the time being. It may be the case that the data availability of the NDVI (30 m resolution) helped this variable to be the most important variable during the soil erosion modeling compared with the other variables (such as the soil dataset and lithology with scales of 1:100,000 or more). Despite these limitations, producing the water erosion susceptibility maps in developing countries can be a useful tool for sustainable management, the conservation of watersheds, the reduction of soil degradation, and alleviating water quality decline.

Conflicts of Interest:
The authors decorate no conflict of interest.