Simple Optimal Sampling Algorithm to Strengthen Digital Soil Mapping Using the Spatial Distribution of Machine Learning Predictive Uncertainty: A Case Study for Field Capacity Prediction

: Machine learning models are now capable of delivering coveted digital soil mapping (DSM) beneﬁts (e.g


Introduction
Field capacity (FC) is one of the main factors representing the hydrologic characteristics of the soil. It refers to the water content when its gravity-induced vertical movement speed through the soil is reduced [1]. Because the term is frequently used to describe the actual quantity of water in the soil that plants can use, it is actively leveraged as an input variable in hydrologic models [1,2]. Therefore, a proper understanding of the spatial FC distribution is essential for sustainable water management, especially in forested areas.
The digital soil mapping (DSM) technique is widely used to predict the soil properties of interest (e.g., FC and hydraulic conductivity) using geographically referenced information or remote sensing (big) data [3,4]. DSM is a geospatial technique that can handle numerous variables that describe soil properties as data accumulate over time. It can predict the soil properties of large areas. To map the distribution of soil properties, information on soil properties is analyzed from the soil investigation first. Then, environmental covariate information corresponding to the points where soil samples were collected is extracted through the available geographically referenced data. After this, a predictive model for estimating soil properties is developed through the relationship between soil properties and extracted environmental covariates. When the raster-based information of

Collected Soil Samples and Their Properties
This study was conducted using the data from forest topsoil samples collected over time from 953 sites by workers at the National Institute of Forest Science ( Figure 1). Samples were collected at 10 cm levels (mineral A horizon), and six soil properties, including FC, were applied in this study, based on the study of Yang et al. [11], in which the detailed analysis process was first suggested. All 953 soil samples were collected randomly since there was no previous information about the forest soils and their relationship with other environmental variables. In addition to this, soil sampling was conducted seven days after the rain to eliminate the climatological effects. Moreover, all soil properties were analyzed with the same method, which means that there was consistency in the soil property data. Descriptive statistics for the six forest soil properties are shown in Table 1.

Collected Soil Samples and Their Properties
This study was conducted using the data from forest topsoil samples collected over time from 953 sites by workers at the National Institute of Forest Science ( Figure 1). Samples were collected at 10 cm levels (mineral A horizon), and six soil properties, including FC, were applied in this study, based on the study of Yang et al. [11], in which the detailed analysis process was first suggested. All 953 soil samples were collected randomly since there was no previous information about the forest soils and their relationship with other environmental variables. In addition to this, soil sampling was conducted seven days after the rain to eliminate the climatological effects. Moreover, all soil properties were analyzed with the same method, which means that there was consistency in the soil property data. Descriptive statistics for the six forest soil properties are shown in Table 1. Soil size distribution (e.g., sand, silt, and clay fraction) was analyzed using the hydrometer method. The soil texture class (STC) was classified using the US Department of Agriculture (USDA) texture classes, whose textural classes and FC distributions are shown in Figure 2. Considering that the sample collection was randomly conducted at first, loam is currently the most distributed soil texture class in forested areas of South Korea. Soil size distribution (e.g., sand, silt, and clay fraction) was analyzed using the hydrometer method. The soil texture class (STC) was classified using the US Department of Agriculture (USDA) texture classes, whose textural classes and FC distributions are shown in Figure 2. Considering that the sample collection was randomly conducted at first, loam is currently the most distributed soil texture class in forested areas of South Korea.

Forest Environmental Covariates
South Korea has two essential big data forest type and characteristic site datasets: Forest Site and Soil Map (FSSM) and Forest Type Map (FTM). In this study, the Geologic Map (GM; 1:50,000), FSSM (1:25,000), FTM (1:25,000), and Digital Elevation Map (DEM; 10 m resolution) datasets were used to extract 14 forest environmental covariates; the detailed information about each variable is listed in Table S1. Some environmental covariates derived from DEM (e.g., topographic position index (TPI), aspect, profile curvature, plan curvature, topographic wetness index, and upper catchment area) were calculated using a geospatial information system analyst tool.

Machine Learning Algorithms
Many studies on soil hydraulic properties have used machine learning algorithms for their high predictive accuracy and dimensionality [12,13]. This study tested and compared four widely used models: gradient boosting, random forest, k-nearest neighbors, and multilayer perceptron. For the training of the four models, the optimal hyperparameter determining the structure of the model was found using the grid searching method. Grid searching was performed 20,000 times each to find the optimal hyperparameter sets in this study. We used Python v.3.7.4 and SciKit-Learn v.1.1.2. to run these machine learning models.

Gradient Boosting (GB)
GB is an ensemble method that enhances the prediction accuracy by combining the results of multiple individual base classifiers [12,14]. Although other machine learning models (e.g., decision trees and RF) focus on improving performance, the gradient of the GB retains the weak residual errors learned until the present cycle [15]. This approach

Forest Environmental Covariates
South Korea has two essential big data forest type and characteristic site datasets: Forest Site and Soil Map (FSSM) and Forest Type Map (FTM). In this study, the Geologic Map (GM; 1:50,000), FSSM (1:25,000), FTM (1:25,000), and Digital Elevation Map (DEM; 10 m resolution) datasets were used to extract 14 forest environmental covariates; the detailed information about each variable is listed in Table S1. Some environmental covariates derived from DEM (e.g., topographic position index (TPI), aspect, profile curvature, plan curvature, topographic wetness index, and upper catchment area) were calculated using a geospatial information system analyst tool.

Machine Learning Algorithms
Many studies on soil hydraulic properties have used machine learning algorithms for their high predictive accuracy and dimensionality [12,13]. This study tested and compared four widely used models: gradient boosting, random forest, k-nearest neighbors, and multilayer perceptron. For the training of the four models, the optimal hyperparameter determining the structure of the model was found using the grid searching method. Grid searching was performed 20,000 times each to find the optimal hyperparameter sets in this study. We used Python v.3.7.4 and SciKit-Learn v.1.1.2. to run these machine learning models.

Gradient Boosting (GB)
GB is an ensemble method that enhances the prediction accuracy by combining the results of multiple individual base classifiers [12,14]. Although other machine learning models (e.g., decision trees and RF) focus on improving performance, the gradient of the GB retains the weak residual errors learned until the present cycle [15]. This approach reduces noisy data using decision trees and continuous runs. Eventually, GB lowers the meansquared error (MSE) by repeatedly training to correct those caused by earlier iterations [16]. This study's three hyperparameters were tuned for GB optimization: n_estimators, learn-ing_rate, and max_depth.

Random Forest (RF)
RF was developed by Breiman [17] as a decision tree algorithm that combines a bagging method with random variable selection [18]. Multiple training data are created from a single dataset, and several more are produced by combining the decision trees, which increases the predictability via reintegration [15]. Because it is indifferent to the range of input values, RF is resistant to overfitting [19]. Three hyperparameters were tuned for RF optimization: n_estimators, max_depth, and max_features.

K-Nearest Neighbors (KNN)
The KNN is a basic nonparametric technique that classifies unknown instances using known instances as the baseline [19]. The maximum summed densities are calculated to identify and classify the closest k neighbors. Because KNN can estimate complicated and/or nonlinear problems, it has been used to analyze large datasets simultaneously [20] for soil mapping [21]. Three hyperparameters were tuned for KNN optimization: leaf_size, p, and n_neighbors.

Multilayer Perceptron (MLP)
The MLP neural network is one of the most common algorithms owing to its minimal training needs and simple utility [19,22]. MLP creates many basic units (i.e., neurons) capable of making simple decisions based on learning from training samples [12]. A neuron uses interconnected layers to make final predictions by feeding those decisions to other neurons, depending on the activation function. Generally, three layer types comprise an MLP network: input, hidden, and output. In this study, the optimal number of neurons in the hidden layer was determined by trial and error and was the only hyperparameter tuned. The variable importance measurement method produces inaccurate results when there is an interrelationship between variables [17]. VIF is the most commonly used method for quantifying variable correlation [23]. The VIF value for a single variable is derived as follows: where R 2 is the coefficient of determination of the multiple linear regression equation, in which variable X 1 is the response variable, and the others (X 2 , X 3 , . . . , X N ) are explanatory variables. Generally, when the VIF is greater than five, a correlation between variables exists, and the corresponding variable must be eliminated before the variable importance can be accurately calculated [24].

Feature Importance
Feature importance is based on the mean decrease in impurity [17]. It is related to the decision tree building process, in which a decision tree splits to obtain the most significant modes of impurity reduction. The more the impurity can be reduced, the higher the variable importance. Using this principle, feature importance is one of the most commonly used methods for estimating variable importance because it is provided by most RF tools and is easily applicable [17,25]. Feature importance can only be derived using the RF model. After an optimal hyperparameter set is selected, the RF model is developed, and the feature importance is averaged from 100 bagging cycles as slightly different results are provided in each round. We used SciKit-Learn v.1.1.2 for this purpose.

Permutation Importance
Permutation importance is another method of estimating variable importance, and it is based on the mean decrease in accuracy. The general method permutes one test set variable after model development, while the other variables are kept constant. The permutation decrease is then reflected in the model accuracy (e.g., via the coefficient of determination; [26]). With a single dependent variable, model accuracy significantly decreases when permuted [17]. In this study, 50 permutations were conducted per variable. Unlike feature importance, permutation importance applies to all model types. Additionally, the results of premutation importance estimation are directly affected by splitting the data into training and testing sets. Thus, data splitting was randomly conducted 100 times, and permutation importance was the averaged value of 100 results.

Predictor Set Hierarchy
In most cases, the proper predictor set is selected to simplify the model as much as possible while maintaining the prediction performance [27]. A stepwise statistical method is used, and several predictor sets may be constructed for performance comparisons. In this study, in addition to this, there is one additional purpose of comparing the performance of different predictor set hierarchies.
Currently, in South Korea, continuous forest surveys are being conducted, and geographically referenced information (e.g., FSSM, GM, and FTM) and in situ forest soil measurements are continuously updated. It means that data coverage varies by location. For the most accurate FC prediction, it is crucial to predict it using a hierarchy model with the maximum accuracy for each location. Considering the above criteria, six predictor set hierarchies were constructed in this study.
If the variables have correlations, a multicollinearity issue may occur, and the accuracy of model prediction may decrease. To prevent this, the VIF analysis described in Section 2.5.1 was performed, and each predictor set was configured so that no correlated variables would be included. For example, soil texture class and sand have VIF values of 5 or more, so one of the two variables should be removed from E14-S2 and E14-S4, where these two factors overlapped. In order to minimize the decrease in the predictive models' performance, we removed the variable that did not explain FC relatively well. In this study, feature importance was used to determine the more important variable for prediction. In this case, sand had higher feature importance and the soil texture class was eliminated.

Performance Evaluation
Model performance was evaluated using our developed model and a hold-out test set that was not used for training. In particular, 80% of the total data was used for training, and 20% was used for testing. Five-fold cross-validation was used for performance evaluation. To assess the model performance in each evaluation process, the Nash-Sutcliffe model efficiency (NSE) coefficient, which is very effective in assessing predictive performance, and the root mean-squared error (RMSE) were used. These evaluation methods are defined as follows: where N is the number of soil samples, FC is the modeled FC, and FC is the average value of the measured FCs.

Statistical Analysis
Several statistical analytical methods were used in this study. Performance evaluation and descriptive statistics (e.g., average value, standard deviation, skewness, and kurtosis) were performed with Numpy v.1.17.3 with Python v.3.7.4 by Python Software Foundation in Delaware, United States.

Optimal Machine Learning Algorithm for FC Prediction
The model performance of GB, RF, KNN, and MLP was assessed to find the optimal machine learning algorithm for FC prediction. The E14-S4 predictor set hierarchy was used to develop the performance comparison model (see Table 2), and NSE and RMSE with five-fold cross-validation were applied.  Figure 3 shows that GB and RF performed better than KNN and MLP. GB and RF showed a higher NSE, which reflects good accuracy (the closer to one, the higher the predictive power). Moreover, GB and RF showed a smaller RMSE, which reflects low errors between the predicted and true values. Both the GB and RF models are based on decision trees, with splitting paths but no converging ones. Decision rules are used for splitting; hence, two leaves are created each time. The tree grows more significantly as the amount of information from the input data and its complexity increase [28]. Owing to these characteristics, the GB and RF models can be trained without being affected by the scale of the input data and variables (continuous or discrete).
For example, there are three categories of bedrock used in this study. Because all input data of a machine learning model must consist of numeric variables, igneous rock was converted to 1, sedimentary rock to 2, and metamorphic rock to 3. However, this does not mean that the scale of characteristics of metamorphic rock is three times that of igneous rock. In this case, the decision tree can effectively categorize these three characteristics.
KNN, on the other hand, measures the distances between two data values to evaluate similarities [29]. The Euclidean distance is (x 1 − x 2 ) 2 + (y 1 − y 2 ) 2 , in which (x 1 , y 1 ) and (x 2 , y 2 ) are coordinates. In other words, effective analysis becomes difficult when discrete data are converted into numeric variables. As MLP is a fully connected feed-forward artificial neural network [12], the activation functions and node weights are included in each calculation process; hence, the results can be negatively affected by the scale of the categorical data when converted into continuous values [30]. Therefore, although GB and RF can effectively process these datasets, KNN and MLP cannot. Several methods have been proposed to overcome this problem, such as one-hot encoding [31]. However, these For example, there are three categories of bedrock used in this study. Because all input data of a machine learning model must consist of numeric variables, igneous rock was converted to 1, sedimentary rock to 2, and metamorphic rock to 3. However, this does not mean that the scale of characteristics of metamorphic rock is three times that of igneous rock. In this case, the decision tree can effectively categorize these three characteristics.
KNN, on the other hand, measures the distances between two data values to evaluate similarities [29]. The Euclidean distance is ( − ) ( − ) , in which ( , ) and ( , ) are coordinates. In other words, effective analysis becomes difficult when discrete data are converted into numeric variables. As MLP is a fully connected feed-forward artificial neural network [12], the activation functions and node weights are included in each calculation process; hence, the results can be negatively affected by the scale of the categorical data when converted into continuous values [30]. Therefore, although GB and RF can effectively process these datasets, KNN and MLP cannot. Several methods have been proposed to overcome this problem, such as one-hot encoding [31]. However, these methods will not be discussed further here because they are beyond the aim and scope of this paper.

Relationship between Predictors and FC
The GB and RF models were used to analyze the predictor importance for estimating FC, noting that tree-based models are the most suitable. Three methods were used to verify the importance of variables: permutation importance of GB and RF and feature importance of RF. There were slight differences among these methods, but the variables significantly influencing the prediction of FC showed similar trends (Figure 4). Among the soil properties, sand and organic matter were important in predicting FC, and elevation and TPI were important among the environmental covariates. With GB, the influence of soil properties was more important than environmental covariates. However, in the RF model, the influence of environmental covariates was quite similar to that of soil properties. Therefore, it was found that using the RF model was more advantageous for the E4-S0, E12-S0, and E14-S0 predictor set hierarchies, which mainly use environmental covariates as predictors. Alternatively, the GB model was more advantageous for E0-S4, E14-S2, and E14-S4, which mainly use soil properties as predictors.

Relationship between Predictors and FC
The GB and RF models were used to analyze the predictor importance for estimating FC, noting that tree-based models are the most suitable. Three methods were used to verify the importance of variables: permutation importance of GB and RF and feature importance of RF. There were slight differences among these methods, but the variables significantly influencing the prediction of FC showed similar trends ( Figure 4). Among the soil properties, sand and organic matter were important in predicting FC, and elevation and TPI were important among the environmental covariates. With GB, the influence of soil properties was more important than environmental covariates. However, in the RF model, the influence of environmental covariates was quite similar to that of soil properties. Therefore, it was found that using the RF model was more advantageous for the E4-S0, E12-S0, and E14-S0 predictor set hierarchies, which mainly use environmental covariates as predictors. Alternatively, the GB model was more advantageous for E0-S4, E14-S2, and E14-S4, which mainly use soil properties as predictors.  Variable importance was suggested using the permutation importance of GB and RF models, and feature importance was suggested for RF.
A 2D heatmap was drawn, as shown in Figure S1, to visualize the effect of each variable on FC and their correlations. To verify the effect of environmental covariates on FC, a 2D heatmap matrix was prepared with the top four variables selected through importance analysis ( Figure S1). The E14-S0 predictor set hierarchy-based RF model was A 2D heatmap was drawn, as shown in Figure S1, to visualize the effect of each variable on FC and their correlations. To verify the effect of environmental covariates on FC, a 2D heatmap matrix was prepared with the top four variables selected through importance analysis ( Figure S1). The E14-S0 predictor set hierarchy-based RF model was trained to verify the effect of environmental covariates on FC. To consider only the effect of the two variables of interest, one soil type used in the training set was selected. The FC was predicted by modifying only the two variables, whereas the others remained fixed. The process of predicting FC by selecting soil was performed 500 times. After calculating the average of the results, it was standardized to a value between zero and one so that the trend could be checked more directly. The equation for standardization is as follows: The effect of soil properties on FC was examined using the E0-S4 predictor set hierarchy-based GB model. It was found that the FC value decreased as the elevation decreased, the TPI approached zero, and the aspect was closer to the south. Regarding bedrock, the FC of forest soil with igneous rock as bedrock was higher than the others. The result showed that as the sand decreased and the bulk density, organic matter, and clay increased, the FC increased, and the correlation between each variable was easily confirmed ( Figure S2).
Sand, OM, elevation, and TPI were the most important variables to predict FC. The interrelationship between soil properties and topographical features was analyzed by Yang et al. [11] using the same soil samples in South Korea. In this paper, however, we briefly discuss the relationship between FC and the important variables. FC is highly correlated with capillary force, affected by the soil texture, which determines the micropores [32]. Silt and clay fractions in soils have a great influence on the capillary force, and as the sand fractions increase, the content of silt and clay decreases, meaning that FC tends to decrease [11]. Organic matter particles attract water and can cause water to adhere to the soil surface [33]. This is the main reason that the OM has higher variable importance. Hudson [34] also stated that OM has a significant impact on the available water capacity. In Figure S1, elevation and TPI show a positive relationship with FC. Elevation was highly correlated with OM and sand in South Korea. One of the reasons is that the nationally protected area that is strictly limited is mostly distributed in Baekdudaegan, which is a higher-altitude area [11]. TPI represents the slope position, and it has a negative value at the ridge and a positive value at the valley [35]. Since most of the ridges-showing higher TPI values-are distributed in Baekdudaegan, TPI is an important variable and has a positive relationship with FC.
To compare the prediction performance of each predictor set hierarchy, the measured and predicted FCs were compared ( Figure 5). A scatterplot of five-fold cross-validation was built by arbitrarily selecting one of the five validations. The E4-S0 predictor set hierarchy showed the lowest predictive performance, whereas there was little difference among E12-S0, E14-S0, and E0-S4. The predictive accuracy increased as the number of soil property types increased, and the E14-S4 model showed the highest performance.
Notably, the E14-S0 and the E0-S4-based models' performance was almost identical. The E14-S0 model can represent the DSM, and the E0-S4 model can represent the pedotransfer function (PTF). The PTF model predicts another soil property of interest based on the soil's physical and chemical properties [36]. Therefore, the predictive power of PTF, which directly predicts soil characteristics, is generally higher than that of DSM, which predicts soil characteristics based on indirect environmental variables [37]. Although PTF has the advantage of relatively high predictive power, its disadvantage is that it is difficult to predict a large area because it can predict only the soil characteristics of the site from which the soil was directly investigated [38]. On the contrary, DSM has lower predictive power than PTF, while it can predict a broader range of soil properties [11]. However, the data collected in this study indicate that the explanatory power of environmental covariates is already very similar to the explanatory power of soil properties. among E12-S0, E14-S0, and E0-S4. The predictive accuracy increased as the number of soil property types increased, and the E14-S4 model showed the highest performance.
Notably, the E14-S0 and the E0-S4-based models' performance was almost identical. The E14-S0 model can represent the DSM, and the E0-S4 model can represent the pedotransfer function (PTF). The PTF model predicts another soil property of interest based on the soil's physical and chemical properties [36]. Therefore, the predictive power of PTF, which directly predicts soil characteristics, is generally higher than that of DSM, which predicts soil characteristics based on indirect environmental variables [37]. Although PTF has the advantage of relatively high predictive power, its disadvantage is that it is difficult to predict a large area because it can predict only the soil characteristics of the site from which the soil was directly investigated [38]. On the contrary, DSM has lower predictive power than PTF, while it can predict a broader range of soil properties [11]. However, the data collected in this study indicate that the explanatory power of environmental covariates is already very similar to the explanatory power of soil properties.  In other words, the number of geographically referenced datasets collected so far in Korea is as abundant, as it can follow the explanatory power of in situ measurement. Furthermore, it can be seen that the characteristics of environmental covariates and soil properties do not overlap because the performance of E14-S2 (Figure 5e,f), which combines these two datasets, further increased.
The FC of South Korea was predicted using all six hierarchical sets (Figure 6), and the averaged FC across the nation was found to be 24.70%. Notably, higher FCs are found at higher altitudes, distributed mainly along the Baekdudaegan watershed crest line. Additionally, the FCs in the northeast and southwest regions, with relatively higher altitudes, were higher than those in the northwest and southeast regions.
The FC of South Korea was predicted using all six hierarchical sets (Figure the averaged FC across the nation was found to be 24.70%. Notably, higher FCs are at higher altitudes, distributed mainly along the Baekdudaegan watershed crest lin ditionally, the FCs in the northeast and southwest regions, with relatively higher alt were higher than those in the northwest and southeast regions.

Relationship between Predictive Uncertainty and Distribution of Training Datasets
The E14-S0 and E0-S4 predictor set hierarchy-based RF models were used to es the predictive uncertainty of the models. Here, 2D heatmaps were drawn with t most important variables among environmental covariates and soil properties. W compared the predicted FC and the distribution of training data alongside predict certainty (Figure 7). One soil sample from the training data was selected to obt predicted FC. FC was predicted by modifying only the two variables of interest, wh others remained fixed. The soil samples were selected 500 times, and the predictio repeated to understand the general trend. Then, the result values were averaged, was standardized to zero and one to check the relative differences (Figure 7a,d). Pre uncertainty was derived using the bagging algorithm of the RF model. After select optimal hyperparameter of the model via grid searching, the FC was predicted whi tinuously changing the randomness of the bootstrapping method. In the "Ra ForestRegressor" library of SciKit-Learn, the parameter "random_state" c

Relationship between Predictive Uncertainty and Distribution of Training Datasets
The E14-S0 and E0-S4 predictor set hierarchy-based RF models were used to estimate the predictive uncertainty of the models. Here, 2D heatmaps were drawn with the two most important variables among environmental covariates and soil properties. We also compared the predicted FC and the distribution of training data alongside predictive uncertainty (Figure 7). One soil sample from the training data was selected to obtain the predicted FC. FC was predicted by modifying only the two variables of interest, while the others remained fixed. The soil samples were selected 500 times, and the prediction was repeated to understand the general trend. Then, the result values were averaged, which was standardized to zero and one to check the relative differences (Figure 7a,d). Predictive uncertainty was derived using the bagging algorithm of the RF model. After selecting the optimal hyperparameter of the model via grid searching, the FC was predicted while continuously changing the randomness of the bootstrapping method. In the "RandomFore-stRegressor" library of SciKit-Learn, the parameter "random_state" controls randomness. Note that the soil samples used to obtain the predictive uncertainty were the same as those used to obtain the standardized predicted FC. The coefficient of variation (CV; standard deviation/average) values from 500 results were extracted and standardized to zero and one to check the relative difference (Figure 7b,e).
A strong correlation between predictive uncertainty and the distribution of the training dataset was observed, whereas the predicted FC and predictive uncertainty were not found to be significantly related (Figure 7). The part with relatively low predictive uncertainty (blue area in Figure 7b,e) coincides with an extensive training set distribution. Moreover, the part with relatively high predictive uncertainty (red area in Figure 7b,e) corresponds to the part where the training set was barely distributed.
randomness. Note that the soil samples used to obtain the predictive uncertainty were the same as those used to obtain the standardized predicted FC. The coefficient of variation (CV; standard deviation/average) values from 500 results were extracted and standardized to zero and one to check the relative difference (Figure 7b,e).
A strong correlation between predictive uncertainty and the distribution of the training dataset was observed, whereas the predicted FC and predictive uncertainty were not found to be significantly related (Figure 7). The part with relatively low predictive uncertainty (blue area in Figure 7b,e) coincides with an extensive training set distribution. Moreover, the part with relatively high predictive uncertainty (red area in Figure 7b,e) corresponds to the part where the training set was barely distributed. We added training data and simulated how the predictive uncertainties changed to verify the relationship between training data scarcity and predictive uncertainty ( Figure  8). We used the E14-S0 predictor set hierarchy-based RF model. After selecting a soil sample, only the elevation and TPI of the soil sample were changed to create 2D imaginary soils-a simple version of the target site. The higher predictive uncertainties were distributed in the upper-left, upper-right, and lower-right corners, where the training data were rarely undistributed. The highest uncertainty was observed in the upper left and was added to the training dataset, assuming that this soil was collected. Here, the FC value of the added data was calculated based on the predicted value of the E14-S0 predictor set We added training data and simulated how the predictive uncertainties changed to verify the relationship between training data scarcity and predictive uncertainty (Figure 8). We used the E14-S0 predictor set hierarchy-based RF model. After selecting a soil sample, only the elevation and TPI of the soil sample were changed to create 2D imaginary soils-a simple version of the target site. The higher predictive uncertainties were distributed in the upper-left, upper-right, and lower-right corners, where the training data were rarely undistributed. The highest uncertainty was observed in the upper left and was added to the training dataset, assuming that this soil was collected. Here, the FC value of the added data was calculated based on the predicted value of the E14-S0 predictor set hierarchy-based model. Note that a specific error was added to the predicted value because the FC value of the actual soil was different from the model's predicted value. This error follows the distribution of the difference between the predicted and measured values of the training data using the E14-S0 predictor set hierarchy-based model (see Appendix A). As a result of the simulation, by adding a new soil sample to the training data and observing the changes in predictive uncertainties, it was found that the uncertainty was high where the training data were rarely distributed, and the predictive uncertainty decreased around the added soil sample. The graph in Figure 8 shows that the uncertainty rapidly decreased, even when only one soil sample was added. This is because 2D imaginary soils simulating the predictive uncertainty changes have precisely the same characteristics, except for elevation and TPI. However, we confirmed that the high predictive uncertainties were distributed in the data scarcity area, which can be reduced through additional investigation.
cause the FC value of the actual soil was different from the model's predicted value. This error follows the distribution of the difference between the predicted and measured values of the training data using the E14-S0 predictor set hierarchy-based model (see Appendix A). As a result of the simulation, by adding a new soil sample to the training data and observing the changes in predictive uncertainties, it was found that the uncertainty was high where the training data were rarely distributed, and the predictive uncertainty decreased around the added soil sample. The graph in Figure 8 shows that the uncertainty rapidly decreased, even when only one soil sample was added. This is because 2D imaginary soils simulating the predictive uncertainty changes have precisely the same characteristics, except for elevation and TPI. However, we confirmed that the high predictive uncertainties were distributed in the data scarcity area, which can be reduced through additional investigation. Figure 8. Examples of adding training data and changes in predictive uncertainties. Predictive uncertainties decreased when additional training data were used. Note that the 2D imaginary soils used in this simulation had the same characteristics, apart from the two variables of interest (elevation and TPI), which makes the predictive uncertainty reduction appear radical.
The bagging algorithm (bootstrap aggregation) is a machine learning ensemble that trains models by selecting random subsets. It has many advantages, such as preventing overfitting and increasing model stability or accuracy [39]. However, when a model is necessary to predict the out-of-data distribution of the training dataset, extrapolation is inevitably performed, and it can increase the uncertainty of the prediction. Zhang et al. [27] also mentioned a correlation between data distribution and uncertainty when they developed a PTF model that predicted soil hydraulic properties on a global scale; they found that a lack of relevant calibration samples might cause high uncertainty values. In other words, when the bagging technique is used, the places where the relevant training dataset is insufficient can be effectively identified through uncertainty testing. Predictive uncertainty can be calculated simply with a few lines of code in SciKit-Learn.

Our Simple Optimal Sampling Algorithm
A simple optimal sampling algorithm was developed based on the relationship between predictive uncertainty and the training dataset. Firstly, a machine learning model was developed by searching for an optimal hyperparameter set based on the collected data. The predictive uncertainty of the training set was adopted as the threshold predictive uncertainty (TPU), which will be a criterion for determining the sample size. It is determined based on the predictive uncertainty distribution of the training data used when an RF model was developed. Subsequently, the predictive uncertainties of the target site were calculated using environmental covariate data. At this point, the coordinate corresponding to the soil with the highest predictive uncertainty becomes an additional sample Figure 8. Examples of adding training data and changes in predictive uncertainties. Predictive uncertainties decreased when additional training data were used. Note that the 2D imaginary soils used in this simulation had the same characteristics, apart from the two variables of interest (elevation and TPI), which makes the predictive uncertainty reduction appear radical.
The bagging algorithm (bootstrap aggregation) is a machine learning ensemble that trains models by selecting random subsets. It has many advantages, such as preventing overfitting and increasing model stability or accuracy [39]. However, when a model is necessary to predict the out-of-data distribution of the training dataset, extrapolation is inevitably performed, and it can increase the uncertainty of the prediction. Zhang et al. [27] also mentioned a correlation between data distribution and uncertainty when they developed a PTF model that predicted soil hydraulic properties on a global scale; they found that a lack of relevant calibration samples might cause high uncertainty values. In other words, when the bagging technique is used, the places where the relevant training dataset is insufficient can be effectively identified through uncertainty testing. Predictive uncertainty can be calculated simply with a few lines of code in SciKit-Learn.

Our Simple Optimal Sampling Algorithm
A simple optimal sampling algorithm was developed based on the relationship between predictive uncertainty and the training dataset. Firstly, a machine learning model was developed by searching for an optimal hyperparameter set based on the collected data. The predictive uncertainty of the training set was adopted as the threshold predictive uncertainty (TPU), which will be a criterion for determining the sample size. It is determined based on the predictive uncertainty distribution of the training data used when an RF model was developed. Subsequently, the predictive uncertainties of the target site were calculated using environmental covariate data. At this point, the coordinate corresponding to the soil with the highest predictive uncertainty becomes an additional sample site. Afterward, the RF model predicts the FC value of the additional sample site, and an error term is added to the predicted value (see Appendix A). An RF model was then trained based on new training data, where a new hypothetical soil sample was added. The above process was then repeated, and additional sample sites were selected one by one. The optimal sample size was then determined based on the predictable rate, a ratio of the number of cells smaller than the TPU to the total number of grid cells.
Using this new algorithm, we selected additional sample sites to enhance the existing DSM of South Korea. The E14-S0 predictor set hierarchy-based RF model was trained using 953 collected soil samples. Afterward, we calculated the predictive uncertainty distribution using the same 953 soil samples (Figure 9a). TPU was determined using the 95th-percentile value of predictive uncertainties. Note that 95% is a subjectively selected ratio and may vary depending on the research or business purpose. Then, the environmental covariate information from a 100 m × 100 m grid of forests across South Korea was used for target data. The predictive uncertainty of each grid cell was calculated based on the developed model, and the cell with the highest uncertainty was selected as an additional study site. Afterward, updating the model and selecting additional survey sites was repeated while including additional data in the training dataset.  The predictive uncertainty gradually decreased as additional survey sites were selected (Figure 9a). If the predictive uncertainty of grid cells is less than the TPU, it can be assumed that the training dataset already contains information from those grid cells. Thus, the ratio of the number of cells smaller than the TPU (TPU 95 ) to the total number of grid cells in the target site (N tot ) was calculated to determine the predictable rate (TPU 95 /N tot ). It was found that this rate increased with the number of additional sites (Figure 9b). Furthermore, it was confirmed that 43 additional samplings were needed if the research goal was to explain more than 80% of the target area. Additionally, the predictive uncertainty of forests nationwide decreased as optimal sampling sites were added through the sampling algorithm ( Figure 9c).
We only used the training dataset's predictive uncertainty to select the TPU. This is because if we develop a model to estimate FC through the training dataset and estimate FC using the same training dataset consecutively, the predictive uncertainty calculated at this time will be caused by the following reasons: potential errors introduced from soil collection, laboratory work, or incorrect geographical information. These factors pose irreducible uncertainty in the current situation. In other words, it is a natural variation that occurs equally, even if more soil samples are collected. Thus, the TPU was calculated based on the training dataset.
Predictive uncertainty can be primarily divided into aleatoric and epistemic uncertainties. Aleatoric (statistical) uncertainty refers to the variability of the results caused by naturally occurring random effects [40,41]. Epistemic (systematic) uncertainty pertains to the lack of epistemic conditions applied by analysts [42]. Thus, aleatoric uncertainty is the non-reducible type, and epistemic uncertainty is reducible [43]. Hence, the TPU set, when used as the standard, predictable rate in this study, has the characteristic of irreducible aleatoric uncertainty. In contrast, predicting the FC of target site data contains not only the epistemic uncertainty related to out-of-distribution data that were not used when the model was trained but also aleatoric uncertainty. Therefore, the model learns the corresponding soil information if new sample sites that did not exist before are added to the training dataset over time. As a result, the epistemic uncertainty steadily reduces.
This sampling algorithm is useful for collecting soils with characteristics that were not previously collected by suggesting the locations and sample size of the sample sites in a simple way using machine learning. Therefore, it can make our understanding of the explanation of FC more concrete. However, this requires future work. South Korea has started an R&D project to collect additional forest soil samples. Thus, it is necessary to verify whether the predictive uncertainty is reduced by checking the ground truth as we increase the model's prediction performance throughout this project. Notably, collecting new samples is not the only method for enhancing DSM. A more enhanced model that can predict the FC of forest soil could be developed if additional predictors that can better explain FC are found in the future (using datasets from satellites or drones). Further, new machine learning structures are being developed to describe FC characteristics effectively.

Conclusions
Nine hundred and fifty-three forest soil samples and four geographical big datasets were used to develop FC predictive models. The predictive performance of four machine learning algorithms was compared, and tree-based models (RF and GB) showed a higher NSE and smaller RMSE; that is, they performed better than the other machine learning models (KNN and MLP). Therefore, RF and GB models are suitable when treating the geographically referenced big datasets of South Korea. Variable importance analysis confirmed that sand and OM as soil properties, and elevation and TPI as environmental covariates, were important variables for predicting FC. We established a different sixpredictor set since data coverage varies by location across the nation. Further, considering the variable importance of soil properties and environmental covariates, machine learning models were paired with their optimal predictor set hierarchies (RF: E4-S0, E12-S0, and E14-S0; GB: E0-S4, E14-S2, and E14-S4). The FCs of South Korea were predicted using the developed models and were primarily distributed along the Baekdudaegan watershed crest line.
There was a strong relationship between model predictive uncertainties (the coefficient of variation of predicted FCs) and training data distribution. Further, we confirmed that higher uncertainties were distributed in the data scarcity area. Changes in predictive uncertainties were simulated to verify the relationship between training data scarcity and predictive uncertainty. Predictive uncertainties decreased when additional sample sites were added to the training dataset. To determine the optimal sample sites to strengthen DSM in South Korea, environmental covariate information of each grid cell (raster with a cell size of 100 m) was used. A grid cell with the highest predictive uncertainty was selected as a new sample site. Geographical information of the new sample site was added to the training dataset, and the site selection process was iterated. Epistemic uncertainty decreased when new sample sites were added to the training dataset. The sample size was determined using the predictable rate, calculated by the total number of grid cells in the target site and a TPU that is subjectively established. When the TPU was decided using the 95th-percentile value of predictive uncertainties, 80% of the target area could be successfully explained when 43 optimal sample sites were added to the existing datasets. This intuitive sampling design can be generalized to coordinate the sampling sites and determine the sampling size for strengthening DSM.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/land11112098/s1, Table S1: Forest environmental covariates from the geographically referenced database used in this study; Figure S1: 2D heatmap matrix of predicted field capacity (FC) for four environmental covariates. A random soil sample was selected, and two variables of interest were changed while the other variables were fixed. This process was iterated 500 times, and all predicted FCs were averaged and standardized to show the effect of only two variables of interest on FC. Note that Ign. is an igneous rock, Sed. is a sedimentary rock, Meta. is a metamorphic rock. Regarding aspect, 0 • is the northern aspect, and 180 • is the southern aspect; Figure S2: 2D heatmap matrix of predicted FC for four soil properties. A random soil sample was selected, and two variables of interest were changed while the other variables were fixed. This process was iterated 500 times and all predicted FCs were averaged and standardized to show the effect of only two variables of interest on FC.  Data Availability Statement: Soil datasets used in this study are available from the corresponding author upon reasonable request.

Appendix A
When the FC value of added data is assumed, it can be calculated based on the output value of the model. Here, the model's predicted value was not added to the training dataset as it was, but an error term was placed in the predicted FC to simulate the collected soil. The error used here has the following distribution: where the symbol "~" indicates that the error is arbitrarily selected from the distribution shown in the right side of Equation (A1). "normal" corresponds to the normal distribution having two parameters: mean (µ) and standard deviation (σ). σ was calculated using the model's predicted FC value and the actual FC value. −→ EC i is an input vector composed of the i-th 14 environmental covariates, and FC S14−M0 is a random-forest-based E14-S0 model that predicts FC. FC i is the i-th measured FC. Here, FC S14−M0 was developed with 953 soil samples, and both −→ EC i and FC i are predictors and FC values related to the same soil samples. This study used 1.46 for σ.