Comparison of Ensemble Machine Learning Methods for Soil Erosion Pin Measurements

: Although machine learning has been extensively used in various ﬁelds, it has only recently been applied to soil erosion pin modeling. To improve upon previous methods of quantifying soil erosion based on erosion pin measurements, this study explored the possible application of ensemble machine learning algorithms to the Shihmen Reservoir watershed in northern Taiwan. Three categories of ensemble methods were considered in this study: (a) Bagging, (b) boosting, and (c) stacking. The bagging method in this study refers to bagged multivariate adaptive regression splines (bagged MARS) and random forest (RF), and the boosting method includes Cubist and gradient boosting machine (GBM). Finally, the stacking method is an ensemble method that uses a meta-model to combine the predictions of base models. This study used RF and GBM as the meta-models, decision tree, linear regression, artiﬁcial neural network, and support vector machine as the base models. The dataset used in this study was sampled using stratiﬁed random sampling to achieve a 70/30 split for the training and test data, and the process was repeated three times. The performance of six ensemble methods in three categories was analyzed based on the average of three attempts. It was found that GBM performed the best among the ensemble models with the lowest root-mean-square error (RMSE = 1.72 mm/year), the highest Nash-Sutcliffe efﬁciency (NSE = 0.54), and the highest index of agreement (d = 0.81). This result was conﬁrmed by the spatial comparison of the absolute differences (errors) between model predictions and observations using GBM and RF in the study area. In summary, the results show that as a group, the bagging method and the boosting method performed equally well, and the stacking method was third for the erosion pin dataset considered in this study.


Introduction
Soil erosion is a severe global issue affecting farming and the environment in tropical and subtropical areas. In particular, soil erosion leads to environmental damage such as soil nutrient loss, pollution by sedimentation, and the increased possibility of flooding. The rate of soil erosion depends on soil characteristics, climate, slope steepness [1], land use, and protective vegetation [2]. In addition, eroded soils lose 75% to 80% of carbon content [3], which results in a deficit of terrestrial carbon budget. Anthropogenic activity is a major cause of soil erosion [4]. Soil degradation rapidly has intensified with the rising population in the 20th century [5] and beyond.
Many theoretical/empirical models can be used to study soil erosion. According to Borrelli et al. [6], 435 distinct models and model variants were used to analyze soil erosion from 1994 to 2017 in 1697 scientific articles in the Scopus database. The top five the same watershed to study the effect of the digital elevation model (DEM) on soil erosion. Finally, Liu et al. [12] followed up by applying slope units to soil erosion modeling in the same watershed.
In contrast to the traditional soil erosion modeling approach, in which a physically or empirically based model is needed to make a prediction and field data are collected to verify the model's correctness, a machine learning (ML) based approach does not require an a priori model. The field measurements (such as those of erosion pins) are used directly to formulate rules and make generalizations from the data (i.e., predictions). Although ML-based approaches have been extensively used in relevant fields such as landslides susceptibility mapping [13,14], soil thickness prediction [15], digital soil mapping [16], and biomass retrieval [17], it has only recently been applied to soil erosion pin study [18,19]. To improve upon previous methods of quantifying soil erosion, the study explored the possible application of ensemble machine learning algorithms to the Shihmen Reservoir watershed in northern Taiwan. Three categories of ensemble methods were considered in this study: (a) Bagging, (b) boosting, and (c) stacking. The results were compared among the ensemble methods in terms of three statistical indices: (1) Root-mean-square error (RMSE), (2) Nash-Sutcliffe efficiency (NSE), and (3) index of agreement (d).

Methods
An erosion pin is a wooden or metal rod inserted into the ground for measuring the change of the ground surface. The pin referred to in this study is made of metal with a diameter of 15.9 mm and a length of 300 mm. About 270 mm of the pin is embedded in the ground with the exposed part painted red. The erosion pin is one of the simplest and most effective methods for monitoring ground surface variation due to soil erosion and sediment deposition [20]. It has been used to monitor sheet erosion, gully erosion, landslides, and stream bank erosion [21][22][23]. Lin et al. [24] has documented the procedures for installing erosion pins in Taiwan. Measurements of erosion pins have been collected from various watersheds. The Shihmen Reservoir watershed data show that the average soil erosion was 90.6 t per hectare per year [12].
Ensemble machine learning is a technique that combines several base ML models (either homogeneous or heterogeneous) to make better predictions. Note that the word "model" has a different meaning in machine learning than for soil erosion. As explained earlier, a machine learning (ML) based approach does not require an a priori (soil erosion) model to work. In addition, ensemble methods gain performance over individual ML models [25]. Ensemble methods have been applied to many diverse fields such as banking [26], big data security [27], and breast cancer diagnosis [28]. In the environment and related fields, Pham et al. [29] applied several ensemble methods (AdaBoost, Bagging, Dagging, MultiBoost, Rotation Forest, and Random Subspace) to evaluate the landslide susceptibility in the Himalayan India. The results showed that the area under the curve (AUC) of the receiver operating characteristic (ROC) curve were all higher than 0.876. Similarly, Tehrany et al. [30] used an ensemble support vector machine (SVM) and the weights-of-evidence method to conduct the flood susceptibility mapping in Terengganu of Malaysia. Their results improved flood modeling by 29%.
An ensemble method is viewed as a compound model. The purpose of such a model is to achieve better predictive performance by reducing the noise or error between observed and predicted data. Ensemble methods are usually grouped into bootstrap aggregating (bagging), boosting, and stacking methods. All three categories attempt to tune their predictions to the observations by decreasing model variance, bias, or both simultaneously. The main difference is that bagging and boosting usually work with homogeneous models, whereas stacking excels in combining heterogeneous models. Their respective approaches are described in the following sub-sections. In this study, we will select two methods from each category for analysis and comparison.

Bagging
Bagging is a technique that builds multiple homogeneous models from different subsamples of the same training dataset to obtain more accurate predictions than its individual models. It is an application of the bootstrap procedure to high-variance machine learning problems. For example, random forest (RF) is the bagging of decision trees (DT). Using CART (classification and regression trees) as an example, RF randomly samples the training dataset multiple times (with replacement) to obtain many subsamples. Then, a decision tree is built for each subsample using CART. Finally, RF issues its prediction by combing the results of all decision trees either by voting (classification) or averaging (regression). RF is a very effective machine learning tool, and it has been applied to various research problems, including soil erosion [18]. Therefore, we chose RF and bagged MARS (multivariate adaptive regression splines) in this study to compare with other ensemble algorithms.
MARS was first introduced by Friedman and Roosen [31]. It explores the relationship between the dependent and independent variables in a way very similar to least-squares regression [32]. The advantages of MARS include its computational efficiency, its ability to yield easy to interpret models, and its function to quantify the contribution of predictor variables. However, its lack of accurate prediction is one of the most significant drawbacks [32]. To address this issue, bagging was introduced to MARS to become bagged MARS to improve the classification accuracy. The bagged MARS model was implemented by the "earth" package in R, and the RF model was implemented by the "randomForest" package.

Boosting
Boosting refers to a group of algorithms that utilize weighted averages to make weak learning algorithms stronger learning algorithms. Unlike bagging that relies on each model running independently and then aggregated at the end, boosting runs sequentially by using later models to fix the prediction errors of the predecessor models in the sequence. For this study, we have selected Cubist and gradient boosting machine (GBM) for comparison with other ensemble methods.
Cubist is a prediction-oriented regression model proposed by Quinlan [33,34]. The general idea of the Cubist regression model is briefly described here. During the tree growth stage, many branches and leaves are grown. Linear regression models are added to the leaves of the tree. The Cubist method creates a series of "if-then" rules. Each rule has an associated multivariate linear model [35]. The corresponding model is used to calculate the predicted value if a set of variables satisfies the conditions of the rule. Rules are eliminated via pruning and/or combined for simplification. The main advantage of the Cubist method is the addition of multiple training committees to balance case weights. In this study, the Cubist model was implemented by the "caret" and "Cubist" packages in R.
GBM was proposed by Friedman [36,37] as a simple and highly adaptive method for machine learning [38]. It is an improved boosting algorithm for regression and classification problems. The basic theory of GBM is to produce a prediction model constructed by a group of weak learning algorithms, typically decision trees. Each tree is grown sequentially by using the information from previously grown trees [39]. In this study, the GBM model was implemented by the "gbm" package in R.

Stacking
Stacking, sometimes called stacked generalization, is an ensemble machine learning method that combines multiple heterogeneous base or component models via a metamodel. The base model is trained on the complete training data, and then the meta-model is trained on the predictions of the base models. The advantage of stacking is the ability to explore the solution space with different models in the same problem. In this study, two stacking ensembles were chosen. They are (1) RF + DT + LM + Artificial Neural Networks (ANN) + SVM, and (2) GBM + DT + LM + ANN + SVM. For implementation, the "caretEnsemble" package in R was used.

Model Assessment
In order to evaluate the performance of ensemble models, three statistical indices were used as the evaluation criteria: (1) Root-mean-square error (RMSE), (2) Nash-Sutcliffe efficiency (NSE), and (3) index of agreement (d). These statistical indices have been frequently used in many studies [40,41].
RMSE is a good indicator for evaluating the model performance for continuous variables. In this study, RMSE represents the differences between erosion pin measurements and ensemble model predictions. It can be written as follows: where P and O are the predicted and observed values, respectively. NSE defines the relative magnitude between the "noise" and "information" [42]. Its value ranges from −∞ to 1. The closer the value of NSE to 1, the more efficient the model is. In the case that NSE is negative, the model is considered poor because the observed mean serves as a better prediction than the model. NSE is defined as follows: where O is the mean observed value. Finally, the index of agreement (d) is often used to represent the model performance [43]. Its value ranges from 0 to 1, with higher values indicating better agreement between the predictions and observations. The d index is defined as follows: Note that we do not use R 2 (coefficient of determination) as a statistical index in the model evaluation. As pointed out by Nguyen et al. [18,19], R 2 evaluates the fit to the regression line only. High R 2 does not mean small differences between predictions and observations. This point is illustrated in Figure 1, in which a poor model has a perfect R 2 but only mediocre (even poor) values of other statistical indices. Good predictions should fall on the 45 • line instead. gression line only. High R 2 does not mean small differences between predictions and observations. This point is illustrated in Figure 1, in which a poor model has a perfect R 2 but only mediocre (even poor) values of other statistical indices. Good predictions should fall on the 45 0 line instead.

Study Area
In this study, the Shihmen Reservoir watershed in northern Taiwan was selected as the research area ( Figure 2). Shihmen Reservoir watershed covers 76,340 ha of land area with a maximum elevation of 3527 m. The rainy season of the watershed coincides with the typhoon months. Therefore, heavy rainfalls are common [44]. In northern Taiwan, Shihmen Reservoir plays an essential role in providing drinking water for domestic use, irrigation for agriculture, and flood control for typhoon-related disasters [45]. Figure 2 also shows a photo of a metal erosion pin with the exposed part painted red and a picture of measurement being taken by a micrometer.

Study Area
In this study, the Shihmen Reservoir watershed in northern Taiwan was selected as the research area ( Figure 2). Shihmen Reservoir watershed covers 76,340 ha of land area with a maximum elevation of 3527 m. The rainy season of the watershed coincides with the typhoon months. Therefore, heavy rainfalls are common [44]. In northern Taiwan, Shihmen Reservoir plays an essential role in providing drinking water for domestic use, irrigation for agriculture, and flood control for typhoon-related disasters [45]. Figure 2 also shows a photo of a metal erosion pin with the exposed part painted red and a picture of measurement being taken by a micrometer. gression line only. High R 2 does not mean small differences between predictions and observations. This point is illustrated in Figure 1, in which a poor model has a perfect R 2 but only mediocre (even poor) values of other statistical indices. Good predictions should fall on the 45 0 line instead.

Study Area
In this study, the Shihmen Reservoir watershed in northern Taiwan was selected as the research area ( Figure 2). Shihmen Reservoir watershed covers 76,340 ha of land area with a maximum elevation of 3527 m. The rainy season of the watershed coincides with the typhoon months. Therefore, heavy rainfalls are common [44]. In northern Taiwan, Shihmen Reservoir plays an essential role in providing drinking water for domestic use, irrigation for agriculture, and flood control for typhoon-related disasters [45]. Figure 2 also shows a photo of a metal erosion pin with the exposed part painted red and a picture of measurement being taken by a micrometer. The workflow of this research is shown in Figure 3, and includes three parts: (1) Data collection, (2) data preparation, and (3) soil erosion pin analysis. During the first stage, the target variable and predictive variables were compiled into a dataset. Then, the dataset was split into training and test data using a stratified random sampling method. Finally, the training data were fed to the ensemble methods to create prediction models. The models were tested using the test data, and the statistical indices were computed.

Results and Discussion
The results of three types of ensemble learning (that is, bagging, boosting, and stacking) and their comparisons are presented below.

Bagging
In the bagging category, we selected bagged MARS and random forest (also used by [18]) as the representative ensemble learning methods. Three repeated samplings (groupings) of the same dataset were made to create three different 70/30 splits using the stratified random sampling method. After fitting the model, the resulting statistical metrics, including the RMSE, NSE, and d values were calculated and these are listed in Table 2. Fourteen attributes in four categories were used as the predictive variables, as shown in Figure 3. Five of the attributes were point data that were not available watershed-wide but only available at the erosion pins' locations. The dataset was separated into two groups using a 70/30 split, which is the common ratio used by many studies [46][47][48]. The entire process was repeated three times to determine the average result.
The target variable is the erosion pin measurement. A total of 550 pins were installed on 55 slopes (10 pins per slope) in the Shihmen Reservoir watershed ( Figure 2). Lin et al. [24] documented the installation procedures. The erosion depth measurements were collected from 8 September 2008 to 10 October 2011. The measurements were averaged by slopes.
To derive topography-related attributes, such as elevation, sub-watershed, slope class, slope direction, distance to river, and distance to road, we used the Central Geological Survey (CGS) DEM created from an airborne LiDAR (light detection and ranging) survey, which has a spatial resolution of 10 m ( Table 1). The DEM data were created in 2013 [11].
The average annual rainfall of the study area was calculated from 22 rainfall stations from 2003 to 2015. The slopes were classified into seven classes: (1) <5%, (2) 5-15%, (3) 15-30%, (4) 30-40%, (5) 40-55%, (6) 55-100%, and (7) >100%, based on the classification system of the Soil and Water Conservation Bureau. Slope direction is the aspect of a slope, which can be flat or facing north, northeast, east, southeast, south, southwest, west, or northwest. The distance to river was calculated based on the river network map. Similarly, the distance to road was found by the road network map with a scale of 1:5000. Both distances were calculated as the shortest distance between the individual erosion pin and the river or road network using ArcGIS 10.2 software.
Finally, lithology and epoch were from the geological maps of the Central Geological Survey. The scales were both 1:50,000. The soil contents were the percent of sand, silt, organic, and clay. The data were provided by Lin et al. [49].

Results and Discussion
The results of three types of ensemble learning (that is, bagging, boosting, and stacking) and their comparisons are presented below.

Bagging
In the bagging category, we selected bagged MARS and random forest (also used by [18]) as the representative ensemble learning methods. Three repeated samplings (groupings) of the same dataset were made to create three different 70/30 splits using the stratified random sampling method. After fitting the model, the resulting statistical metrics, including the RMSE, NSE, and d values were calculated and these are listed in Table 2. Among the three indices of bagged MARS, RMSE ranges from 0.92 to 1.83 mm/year for the training data and from 1.70 to 2.18 mm/year for the test data. In addition, NSE (−∞ to 1) varies from 0.38 to 0.83 for the training data and from 0.19 to 0.60 for the test data. Finally, d (0 to 1) goes from 0.64 to 0.94 for the training data and from 0.60 to 0.85 for the test data. As expected, all three indicators are better for the training data than for the test data with no exception. The same observation can be made on the RF metrics as well.
To evaluate the relative appropriateness of bagged MARS and RF as tools for ensemble learning, we compared the RMSE values of the two ensemble methods. The results were mixed, as shown in Figure 4. RF out-performed bagged MARS two out of three times in both the training data and the test data, as shown in Figure 4a,c and was superior overall. The same conclusion was also confirmed by the lower average RMSE values of RF than bagged MARS.
If we plot the results on a Taylor diagram, we can observe further differences between RF and bagged MARS. As can be seen from Figure 4b, all three sampling (grouping) results of the RF cluster together with very similar RMSE, correlation, and standard deviation. This indicates consistent training results across different samples. By contrast, only two of the sampling results of bagged MARS cluster together. The third is very far away with a substantially smaller correlation and standard deviation and much larger RMSE than the other two. This shows that bagged MARS did not work equally well with different training data. In this case, it had difficulties fitting a model to grouping (sampling) #3. Hence, it lost to RF and fell short in the statistical comparison. Figure 4d shows a Taylor diagram comparison between the test datasets. A larger spread is observed in the data for both bagged MARS and RF. However, notice the closeness of data points of the same color. This shows that bagged MARS and RF generated similar predictions on the same test dataset. For instance, although the green triangle and green circle are very far apart in Figure 4b, they are relatively close to each other in Figure 4d.

Boosting
Two boosting ensemble methods were explored in this study. They were the GBM and Cubist models. Table 3 shows the performance metrics (RMSE, NSE, and d) under three different samplings (groupings) of the erosion pin data. For groupings #1 and #2, GBM won against Cubist in every category (lower RMSE, higher NSE, and higher d values). However, for grouping #3, the result is mixed. On the one hand, GBM lost to Cubist in all three indices concerning the training data. On the other hand, GBM beat Cubist in

Boosting
Two boosting ensemble methods were explored in this study. They were the GBM and Cubist models. Table 3 shows the performance metrics (RMSE, NSE, and d) under three different samplings (groupings) of the erosion pin data. For groupings #1 and #2, GBM won against Cubist in every category (lower RMSE, higher NSE, and higher d values). However, for grouping #3, the result is mixed. On the one hand, GBM lost to Cubist in all three indices concerning the training data. On the other hand, GBM beat Cubist in the test data by winning two of the three indices. The RMSE of GBM is lower than that of Cubist (1.71 vs. 1.98). Similarly, the NSE of GBM is higher than that of Cubist (0.50 vs. 0.33). However, GBM's d value is lower than Cubist's (0.75 vs. 0.78). This is one of the rare instances where NSE and d do not agree with each other. Putting everything together, we can conclude that GBM is still superior to Cubist.

Stacking
Stacking is the third and final category of ensemble methods examined in this study. The basic idea of stacking is to combine several weak models together to use their predictions as attributes in an overall meta-model. The meta-model is trained to yield better predictions than the base models (component models). Using the same six ML models studied by Nguyen et al. [18,19], we created Table 4 and classified the ML models into four categories: Tree model, neural network model, hyperplane model, and linear regression model. The average RMSE values of these models are also shown in Table 4. Based

Stacking
Stacking is the third and final category of ensemble methods examined in this study. The basic idea of stacking is to combine several weak models together to use their predictions as attributes in an overall meta-model. The meta-model is trained to yield better predictions than the base models (component models). Using the same six ML models studied by Nguyen et al. [18,19], we created Table 4 and classified the ML models into four categories: Tree model, neural network model, hyperplane model, and linear regression model. The average RMSE values of these models are also shown in Table 4. Based on Table 4, we picked the four weakest models (one from each category) and used them as the base models in the stacking ensemble learning. They are DT, ANN, SVM, and LM. Furthermore, RF and GBM, one from bagging and the other from boosting, were chosen as the meta-models to form two sets of stacking models: (1) RF + DT + LM + ANN + SVM and (2) GBM + DT + LM + ANN + SVM. All of the statistical indices were computed in R in this study. The results of RF-stacking (RF + DT + LM + ANN + SVM) and GBM-stacking (GBM + DT + LM + ANN + SVM) are shown in Table 5. For the training datasets, GBM-stacking outperformed RF-stacking in groupings #1 and #3 in all three statistical indices (RMSE, NSE, and d). In grouping #2, GBM-stacking also won against RF-stacking in terms of RMSE (1.45 vs. 1.46), but lost to RF-stacking in terms of d values (0.79 vs. 0.83). This is a rare case that an inconsistency between RMSE and d is observed. As for the test datasets, the best-performing model is reversed. In all three different groupings (samplings), RF-stacking out-performed GBM-stacking in all statistical indices with no exception. In summary, although GBM-stacking performed best with the training data, it lost the test data. Since the predictive performance of a model is based on unseen test data, we conclude that RF-stacking is the better stacking model of the two. The visual comparison between RF-stacking and GBM-stacking is shown in Figure 6, where all three indices were plotted (RMSE, NSE, and d). The numbers in the figure represent the average values of three different groupings (partitioning).
1 Figure 6. The comparison of RMSE (mm/year), NSE, and d between RF-stacking and GBM-stacking.

Comparison of Ensemble Models
So far, we have compared six ensemble methods in three categories and determined the best-performing model for each category. The next question is how they compare with one another in a six-way comparison. The results were compiled in Figure 7 using the RMSE values. Among the six ensemble models examined in this study, if we only consider the training data, the overall best-performing model is GBM (0.61 mm/year), followed by Cubist (0.84 mm/year). Both of them belong to the boosting category. Their average d values are as high as 0.96 and 0.95, respectively. However, if we only focus on the test data, although the overall winning model is still GBM (1.72 mm/year), RF (1.75 mm/year) will replace Cubist (1.95 mm/year) as the second best-performing model. If NSE and d are considered instead of RMSE, the conclusion does not change. In those cases, GBM and RF remain the two best ensemble models. Based on Figure 7, we rank the models from the best to the worst as follows:
Test: GBM > RF > bagged MARS > Cubist > Stacking (RF) > Stacking (GBM) For machine learning, it is more appropriate to judge ML models' performance by the unseen test data. Therefore, for the test data, GBM (boosting) is the overall winner, followed by RF (bagging). The next best-performing models are bagged MARS (bagging) and Cubist (boosting). Clearly, the top four places were split evenly between the bagging method and the boosting method. Hence, these two types of ensemble models, bagging and boosting, rival each other in performance and are equally good in predicting soil erosion depths measured by erosion pins. By contrast, the stacking ensemble methods (RF + DT + LM + ANN + SVM and GBM + DT + LM + ANN + SVM) seem to lag behind both the bagging and boosting methods. We found this result very intriguing. Despite GBM and RF being the top-two performing models, they do not perform as well when they are used in the stacking approach to combine other weaker models. Perhaps this is because GBM and RF worked with 14 attributes directly when used individually as single models. When GBM and RF were used as meta-models in the stacking approach, they were only trained on the predictions of the base models (DT, LM, ANN, and SVM). Not being able to work with the underlying 14 attributes directly seems to have undermined the ability of GBM and RF to make better predictions.

Model Predictions and Factor Importance
Because GBM and RF are the two best ensemble models in a class of their own (RMSE = 1.72 and 1.75 mm/year, respectively), we only display their respective ensemble learning results in Figures 8 and 9. As shown in the figures, the size and color of the circles represent the absolute differences (errors) between model predictions and observations (|Obs − Pre|). Proportional symbols were used. Therefore, the larger the circle, the bigger the difference. Similarly, the redder the dot color, the more significant the error. The contrasting results produced by GBM and RF are evident. For GBM, it is clear from Figure 8 that most points have low error except for those in the eastern part of the study area. By contrast, Figure 9 shows that RF has large errors in both the eastern and southern parts of the study area. To better visualize the error distribution, we further mapped the spatially interpolated values (absolute errors) throughout the study area. As shown in Figure 8, the resulting watershed is mostly green (associated with low error) for GBM, except in the eastern part, where large errors due to the steep topography and undesirable slope directions bring a return to red colors. However, according to Figure 9, the watershed is only in green in the northern region for RF. The rest of the watershed is colored red or yellow.
Furthermore, we created two vertical profiles on the map, one in an east-west direction and the other in a north-south direction, to compare the model predictions and observations for both GBM and RF. As shown in Figure 8, the black line is the observation, and the blue line is the GBM prediction. Both lines show a similar trend and move in the same direction: If the observation increases so does the prediction, and if the observation decreases so does the prediction. However, it seems evident that GBM tends to under-estimate the high values of observation and over-estimate the low values of observation in both the northsouth and the east-west profiles. Furthermore, if we plot the same profiles in Figure 9 and use a red line to represent the RF prediction, we can see a similar result. The RF prediction also moves in the same direction as the observation. The RF model also underestimates the high values of observation and overestimates the low values of observation in the north-south and the east-west profiles. However, the difference is that RF errors more in comparison with GBM.
In summary, both GBM and RF perform better in the watershed's northern region, where the reservoir is located. Since the watershed slopes from south to north (as shown previously in Figure 2), it implies that the models more adequately captured the erosion behavior (measured by erosion pins) at the lower elevations than at the higher elevations. The eastern part is problematic. Neither of the two models works well here. In general, GBM outperformed RF because GBM matches the available measurements better than RF in the west and south. This makes GBM the overall best model, which is consistent with the RMSE-based results.  Both GBM and RF generate a rank of importance for the 14 attributes used in this study. The six simulations (three of GBM and three of RF) are combined in a boxplot to illustrate the range of ranks (factor importance) as shown in Figure 10. Several points in this figure merit a closer look. First, the grey box in the figure shows the range of ranks between the first and third quartiles of the results. It is evident that each attribute has a variable range of ranks. Second, the black line in the box shows the median rank (1 being the most important and 14 being the least important). The median rank and the gray box can be used to compare the relative importance of attributes. Therefore, it can be seen from Figure 10 that A (slope direction) and B (type of slope) are the two overall most important attributes in the GBM and RF models. They consistently rank higher than the other 12 attributes. Attribute D (elevation) is interesting. It has the fourth-lowest median rank in the comparison. At the same time, it also has the longest grey box and the broadest range than any of the other attributes. This means that the elevation has changeable importance depending on the model and dataset. In some instances, the elevation is very important, whereas in others, it is not. Both GBM and RF generate a rank of importance for the 14 attributes used in this study. The six simulations (three of GBM and three of RF) are combined in a boxplot to illustrate the range of ranks (factor importance) as shown in Figure 10. Several points in this figure merit a closer look. First, the grey box in the figure shows the range of ranks between the first and third quartiles of the results. It is evident that each attribute has a variable range of ranks. Second, the black line in the box shows the median rank (1 being the most important and 14 being the least important). The median rank and the gray box can be used to compare the relative importance of attributes. Therefore, it can be seen from Figure 10 that A (slope direction) and B (type of slope) are the two overall most important attributes in the GBM and RF models. They consistently rank higher than the other 12 attributes. Attribute D (elevation) is interesting. It has the fourth-lowest median rank in the comparison. At the same time, it also has the longest grey box and the broadest range than any of the other attributes. This means that the elevation has changeable importance depending on the model and dataset. In some instances, the elevation is very important, whereas in others, it is not.

Conclusions
Soil erosion is a significant threat to the environment and the livelihood of the region, and should be taken seriously to mitigate the disastrous consequences. Hence, a precise spatial prediction of soil erosion is a critical need. In this study, we applied six ensemble learning methods (bagged MARS, RF, GBM, Cubist, RF-stacking, and GBM-stacking) of three categories (bagging, boosting, and stacking) to model the measurements of erosion pins in the Shihmen Reservoir watershed of Taiwan. The purpose was to improve the modeling accuracy and forecasting capability related to soil erosion. In the process of assessing the performance of the three types of ensemble approaches, we have learned how effective the ensemble methods are with unseen test data.
The study results show that the ensemble methods improve the prediction accuracy as measured by three statistical indices-RMSE, NSE, and d. Among the three categories

Conclusions
Soil erosion is a significant threat to the environment and the livelihood of the region, and should be taken seriously to mitigate the disastrous consequences. Hence, a precise spatial prediction of soil erosion is a critical need. In this study, we applied six ensemble learning methods (bagged MARS, RF, GBM, Cubist, RF-stacking, and GBM-stacking) of three categories (bagging, boosting, and stacking) to model the measurements of erosion pins in the Shihmen Reservoir watershed of Taiwan. The purpose was to improve the modeling accuracy and forecasting capability related to soil erosion. In the process of assessing the performance of the three types of ensemble approaches, we have learned how effective the ensemble methods are with unseen test data.
The study results show that the ensemble methods improve the prediction accuracy as measured by three statistical indices-RMSE, NSE, and d. Among the three categories of ensemble methods, bagging and boosting work equally well on the unseen test data. Stacking is the least favorable approach, with its RMSE trailing behind other types of ensemble algorithms. Individually speaking, we found GBM to be the best fitting model. Its RMSE, NSE, and d values are 1.72 mm/year, 0.54, and 0.81, respectively. The second-best model is RF. We used both GBM and RF to map the absolute differences (errors) between model predictions and observations (|Obs − Pre|) in the study area. The results show that both models perform well in the watershed's northern region (where the reservoir is located) and perform relatively poorly in the watershed's eastern part due to the steep topography and undesirable slope directions. What makes GBM superior to RF is that GBM also works well in the western and southern parts of the study area while RF does not. This conclusion is consistent with the RMSE-based results.
Finally, as an additional discovery in the study, we noticed two cases of inconsistent statistical indices during the model comparison. One of them happened when we compared GBM with Cubist (test data of grouping #3). The other occurred when we examined the performance of the two stacking models (training data of grouping #2). In both of these cases, RMSE and NSE favored one model, but d preferred the other. Therefore, potentially contradictory conclusions could be made if we only rely on one single index. This danger exemplifies the need to present multiple indices in such studies.
Author Contributions: Conceptualization, Walter Chen; data curation, Bor-Shiun Lin; formal analysis, Kieu Anh Nguyen and Walter Chen; funding acquisition, Walter Chen and Uma Seeboonruang; investigation, Kieu Anh Nguyen and Walter Chen; methodology, Walter Chen; project administration, Walter Chen and Uma Seeboonruang; resources, Walter Chen and Bor-Shiun Lin; software, Kieu Anh Nguyen; supervision, Walter Chen and Uma Seeboonruang; visualization, Kieu Anh Nguyen; writing-original draft, Kieu Anh Nguyen and Walter Chen; writing-review and editing, Walter Chen, Bor-Shiun Lin, and Uma Seeboonruang. All authors have read and agreed to the published version of the manuscript.