Comparing Machine Learning Models and Hybrid Geostatistical Methods Using Environmental and Soil Covariates for Soil pH Prediction

In the current paper we assess different machine learning (ML) models and hybrid geostatistical methods in the prediction of soil pH using digital elevation model derivates (environmental covariates) and co-located soil parameters (soil covariates). The study was located in the area of Grevena, Greece, where 266 disturbed soil samples were collected from randomly selected locations and analyzed in the laboratory of the Soil and Water Resources Institute. The different models that were assessed were random forests (RF), random forests kriging (RFK), gradient boosting (GB), gradient boosting kriging (GBK), neural networks (NN), and neural networks kriging (NNK) and finally, multiple linear regression (MLR), ordinary kriging (OK), and regression kriging (RK) that although they are not ML models, they were used for comparison reasons. Both the GB and RF models presented the best results in the study, with NN a close second. The introduction of OK to the ML models’ residuals did not have a major impact. Classical geostatistical or hybrid geostatistical methods without ML (OK, MLR, and RK) exhibited worse prediction accuracy compared to the models that included ML. Furthermore, different implementations (methods and packages) of the same ML models were also assessed. Regarding RF and GB, the different implementations that were applied (ranger-ranger, randomForest-rf, xgboost-xgbTree, xgboost-xgbDART) led to similar results, whereas in NN, the differences between the implementations used (nnet-nnet and nnet-avNNet) were more distinct. Finally, ML models tuned through a random search optimization method were compared with the same ML models with their default values. The results showed that the predictions were improved by the optimization process only where the ML algorithms demanded a large number of hyperparameters that needed tuning and there was a significant difference between the default values and the optimized ones, like in the case of GB and NN, but not in RF. In general, the current study concluded that although RF and GB presented approximately the same prediction accuracy, RF had more consistent results, regardless of different packages, different hyperparameter selection methods, or even the inclusion of OK in the ML models’ residuals.


Introduction
Environmental sciences have always been interested in accurately predicting the spatial distributions of different phenomena regarding soil, water, air, etc. [1][2][3][4]. Currently, the increased numbers of digital data (Internet of Things, high-accuracy digital elevation models (DEM), satellite images) present a great opportunity for improved prediction results.
Initially, the prediction of spatial phenomena was achieved with the use of spatial prediction methods which mainly fell into the following two categories: deterministic methods, like inverse distance weighting or nearest neighbors, and stochastic ones, like regression models and kriging variations (e.g., ordinary kriging, universal kriging, etc.). Later on, hybrid methods were introduced [5][6][7] that were partially deterministic, partially stochastic, like regression kriging (RK) or kriging with external drift (KED). These methods tried to combine the advantages of both worlds, deterministic and stochastic, achieving improved results. Currently, more innovative implementations of the abovementioned hybrid methods are increasingly used; they introduce machine learning (ML) as the deterministic part, along with kriging of the ML residuals as the stochastic part [8][9][10][11][12]. These methods are widely used in environmental sciences mainly for two reasons: their improved prediction accuracy and the elimination of most of the restrictions (e.g., statistical assumptions) that regression, kriging, and their variations (RK, KED, etc.) demand [9,13].
Usually, scientists, in their environmental studies, assess multiple ML models so as to find the one that maximizes the prediction accuracy for a specific phenomenon [14][15][16][17]. They use specific ML implementations (packages, methods) and try to estimate the best hyperparameters for their models that produce the most accurate results. However, it is not obvious as to whether the different ML implementations of the same ML model and the different hyperparameter selection methods significantly affect the results.
Soil pH is a crucial soil parameter that affects both the soil properties and the plants themselves. It greatly influences the behavior of chemical elements and, along with organic matter (OM), is the most important parameter determining trace metal partitioning and aqueous speciation in soils [18]. It is considered of special importance in connection with nutrients [19] and micronutrients [20], and it is the key variable that affects plant growth [21]. It controls soil fertility, regulates soil biogeochemical processes, and influences the structure and functioning of terrestrial ecosystems [22]. Regarding its spatial distribution, soil pH seems to be affected by terrain elevation [23] and exhibits spatial cross correlation with other elements, like soil Fe [24]. Based on the above characteristics, soil pH was considered the ideal soil parameter for the current study.
The aim of the present study is manifold. Firstly, we compare the predictive capabilities of the different ML and non-ML models, with and without kriging of their residuals, in soil pH prediction. The ML models that are assessed are random forests (RF), random forests kriging (RFK), gradient boosting (GB), gradient boosting kriging (GBK), neural networks (NN), and neural networks kriging (NNK),along with the non-ML models of multiple linear regression (MLR), ordinary kriging (OK), and regression kriging (RK) as traditional/hybrid geostatistical methods. Secondly, the effects of the different implementations (methods and packages in R) of the same ML models are also evaluated. For the RF model, the ranger and rf methods are used; for GB, xgbTree and xgbDART; and for NN, nnet and avNNet. Finally, the last goal is to investigate whether the selection methods of the hyperparameters (default vs. optimized through random search) affect the results of the different ML models.

Soil Data and Environmental Covariates Collection
The study area ( Figure 1) is located in the regional unit of Grevena in northern Greece, and it extends from latitude 40 • 01 50.76 N to 40 •   A soil survey was conducted for three years (2015, 2017, and 2018), mainly around autumn and early winter of each year. Specifically, 266 disturbed soil samples were obtained from randomly selected locations using a simple random sampling design (Soil Survey Division Staff, 2017) with a soil auger from depth 0-30 cm of the surface soil. At each location, representative samples were taken (2)(3) close to each other and pooled together to make a composite sample. Global Positioning System (GPS) receivers were utilized to identify the sampling positions. The minimum distances between two sampling points range around 50 m, and the average distance between points is 300 m.
In this study, soil covariates and environmental covariates were used (Table 1). Regarding soil covariates, the 266 soil samples that were collected from the Grevena area were analyzed in the laboratory of the Soil and Water Resources Institute for clay (C), silt (Si), sand (S), electric conductivity (EC), organic matter (OM), nitrogen (N), phosphorus (P), potassium (K), magnesium (Mg), iron (Fe), zinc (Zn), manganese (Mn), copper (Cu), and boron (B). Moreover, pH analysis for the same locations was conducted to calibrate the models and to assess the prediction results.
The environmental covariates were derived from the second version of the Aster Global Digital Elevation Model (GDEM2) using SAGA-GIS software modules. Aster consists of 1 • × 1 • tiles (30 m resolution) in the World Geodetic System 1984 (WGS84), reprojected to the Greek Geodetic Reference System 1987 (GGRS87) for this study (Figure 2).

Software
The statistical analysis was implemented using statistical software R (version 3.5.3) and the caret package [25]. In the current study, different packages and methods were utilized in R through caret: a) xgboost along with the xgbTree methods from the xgboost package for GB, b) the randomForest and ranger packages for RF, and c) the avNNet and nnet methods from the nnet package for NN. The geostatistics in the current paper were implemented using the gstat package. Finally, SAGA-GIS software (http://www.saga-gis.org/en/index.html) was used to produce the different terrain attributes.

Regression Kriging
Regression kriging is a hybrid geostatistical method that combines multiple linear regression between the target soil variable and secondary parameters with geostatistical methods (e.g., ordinary kriging or simple kriging) on the residuals of the regression. Its goal is to optimize the prediction of soil properties in unsampled locations [26] based on the assumption that the deterministic component of the target soil variable is accounted for by the regression model, while the model residuals represent the spatially varying but dependent component [7].
Specifically, in the case of multiple linear regression and ordinary kriging, the predictionẐ RK (s i ) for s i locations is the sum of the prediction of the regressionẐ R (s i ) for the same locations and the prediction of the kriged residuals of the regressionε OK , as seen in the following equation.
In this equation there are two distinct parts. The first one,Ẑ R (s i ), is the deterministic component, and the second one,ε OK , is the stochastic. These two distinct parts allow for separate interpretation of the two components and the application of different regression techniques [6].

Random Forests Kriging
Random forest [27,28] is a machine learning model based on classification and regression trees [29] that provides increased prediction accuracy. In RF, a large collection of de-correlated, noisy, approximately unbiased trees are built and averaged so as to reduce model variance and cope with instability [30]. This is achieved by growing multiple trees combining twofold randomness: random selection of examples in the training dataset and random features used from the same dataset. As in every machine learning model, there are parameters that can be optimized; in the case of the ranger/randomForest packages, there are mainly two: mtry and ntree ( Table 2). Table 2. Random forests hyperparameters.

Hyperparameter
Packages Description mtry ranger, randomForest The number of random features used in each tree. ntree ranger, randomForest The number of grown trees.
Even though there might be more hyperparameters depending on the RF packages that are utilized, these two presented here are the most commonly used. The packages utilized in the current study were ranger [31] and randomForest [32].
Finally, OK was applied to the RF residuals (ε OK ) and then added to the RF prediction resultŝ Z RF (s i ) at the s i locations so as to estimate RFK according to Equation (2): This resembles Equation (1); however, the deterministic componentẐ RF (s i ) uses random forests instead of regression.

Gradient Boosting Kriging
In gradient boosting [33], multiple decision trees are grown sequentially using information from previously grown existing trees. Even though each tree is small and with few terminal nodes, they manage to boost the overall performance by addressing problematic observations with large errors. That way, they improve the model in areas where it does not perform well, resulting in a more accurate prediction model. Specifically in the case of gradient boosting, each tree is fitted to the residuals of the previous model using a gradient descent algorithm that minimize a loss function associated with the whole ensemble.
There are multiple packages that can perform GB (CatBoost, LightGBM, XGBoost, etc.), with different implementations and different results. In the current study, "eXtreme Gradient Boosting" from the XGBoost [34] package was utilized through caret by implementing two different methods: XgbDART and xgbTree. The eXtreme Gradient Boosting is considered an efficient and scalable implementation of the gradient boosting framework in which a more regularized model formalization is used to control overfitting and achieve better performance. Some of its advantages are the following: • A regularization technique that helps reducing overfitting; • Support for user-defined objective functions and evaluation metrics; • Improved efficient tree pruning mechanism; • Multiple technical improvements like parallel processing, "built-in cross-validation", and better handling of missing values.
There are multiple parameters that need to be tuned in this model (Table 3), and they can be classified into a) the general parameters that control the booster type, b) the linear booster parameters that control the performance of the linear booster, and c) the learning task parameters that specify the learning task and the corresponding learning objective. The introduction of kriging as the stochastic part (GBK) is calculated based on the following equation:Ẑ The OK method is performed on the residuals of the GB (ε OK ), and the OK results are added to the GB predictionsẐ GB (s i ).

Neural Networks Kriging
Neural networks (or artificial neural networks) are powerful tools, modeled loosely after the human brain, that use a machine learning approach to quantify and model complex behavior and patterns. They perform tasks by considering examples, generally without being programmed with task-specific rules. A NN consists of a set of interconnected units called neurons, which estimate the nonlinear correlations between each variable. The input neurons, which represent predictor variables, are connected to a single or multiple layer(s) of hidden neurons, which are then linked to the output neurons that represent the target soil variable [35].
For the NN in the current study, the nnet package was used with two different methods: nnet and avNNet. The nnet package [36,37] is software for feed-forward neural networks with a single hidden layer, and for multinomial log-linear models. In a feed-forward NN, the information moves in only one direction (forward); from the input nodes, through the hidden nodes, and to the output nodes. The nnet method implements exactly this model to predict the results.
The avNNet method of the nnet package aggregates several feed-forward neural network models by fitting different random numbers of seeds. In the case of regression, like that used in the current study, all the resulting models were used for prediction by averaging the results coming from each network.
The exact hyperparameters of the NN that were tuned in the current study, along with their descriptions, are presented in Table 4. Most of them were used in both methods, apart from bag, which was used only in avNNet. Finally, the neural networks kriging (NNK) is based on the following equation: The residualsε OK are used for OK, and the prediction results are added to the NN prediction.

Optimization of the Hyperparameters
Every ML model needs to have its hyperparameters defined before training. This could be achieved by using the following: • Default values based on the literature, library authors' recommendations, previous experience, etc.; • Optimizations techniques, through a process called tuning.
Tuning can be implemented in multiple ways, like grid search, random search, Sobol sequence, manual, and others. Grid search and random search are the most commonly applied. In grid search, every combination from a preset list of values of the hyperparameters is estimated and used to evaluate the model for each combination. In the current study, however, random search was performed, in which random combinations of parameters were employed from a range of values. Random search was preferred here due to the improved results that it provides according to the literature [38]. The model with the set of parameters which gave the highest accuracy was considered to be the best and was used for prediction.
Random search was implemented through the caret package. It is important to state here that in general, the hyperparameters that can be optimized through caret are usually fewer than the actual parameters that the package can support by itself. However, they are usually the most important ones that significantly affect the results of each model. Along with random search, the same ML models were trained with default values, and the results were compared.

Error Assessment
The 266 samples of the current study were randomly split into two datasets: the training dataset (80% of the data) that was used for estimating the models, and the testing dataset (20% of the data) that was used for assessing the different models. The optimization of the models' parameters was implemented using 10-fold cross-validation techniques in the training dataset.
Different metrics (Table 5) were employed to estimate model performance based on the difference between the observations and the predictions on the testing data set. The root-mean-square error (RMSE) and the mean absolute error (MAE) were estimated based on the measured value Z(s i ) and its predictionẐ(s i ) for the locations s i of the samples. Lower values of RMSE and MAE are associated with better predictive results. Also, the coefficient of determination (R 2 ), which represents the amount of variation explained by the model, was estimated. The terms SSE and SSTO represent the error sum of squares and the total sum of squares, respectively. The coefficient of determination ranges from 0 to 1, where for 0 (zero), no variation is explained by the model, and for 1 (one), all variation is explained by the model. Table 5. Measurements to assess model performance.

Exploratory Data Analysis
The soil co-variables that were measured in the laboratory were clay (C), silt (Si), sand (S), electrical conductivity (EC), organic matter (OM), nitrogen (N), phosphorus (P), potassium (K), magnesium (Mg), iron (Fe), zinc (Zn), manganese (Mn), copper (Cu), boron (B), and pH from the 266 locations in the Grevena area ( Table 6). Some of them, those that significantly deviated from normality and strongly affected the residuals of MLR, were log transformed (EC, N, Fe, Mn, Cu, B, and Zn). The rest remained untransformed because they did not significantly affect the normality assumption of the MLR residuals. Based on the Pearson correlation analysis for pH (Figure 3), there was a statistically significantly strong correlation with logFe (p ≤ 0.01, r = −0.8) and logMn (p ≤ 0.01, r = −0.7) and moderate correlation with Mg and Si. In general, the soil covariates exhibited stronger correlation with the pH than did the environmental ones.

Modelling and Parameter Estimation
Regression kriging was conducted on the training data set by combining multiple linear regression between the target soil variable and the secondary parameters with ordinary kriging on the residuals of the regression (Eq. 1).
A stepwise procedure was utilized to select the best regression predictors, which in our study were C, OM, P, K, Mg, Devmean, Altitude, Aspect, logEC, logN, logFe, logMn, and logZn. The final regression equation was: The residuals of the regression presented a distribution close to normal, as seen in the residual frequency distribution diagram ( Figure 4A) and normal Q-Q plot ( Figure 4B, bottom), having approximately constant standard deviation errors (homoscedasticity) ( Figure 4B, top). Also, the Shapiro-Wilk statistic (W) for the residuals was computed (W = 0.98588, p-value = 0.05638). Based on the results, for the significance level of 0.05, we cannot reject the hypothesis that the residuals come from a population which has a normal distribution. For random forests, a 10-fold cross-validation method on the training data set was applied to pick the optimal hyperparameter values via the random search tuning process ( Table 7). The default mtry value was estimated at close to one-third of the total number of variables used in RF [30], which in this case was 8.
The number of grown trees (num.trees in ranger and ntree in randomForest) could not be tuned through caret, so the default value of the packages was used (500) for both of them. Hyphens (empty cells) in Table 7, Table 8, and Table 9 were used to signify the lack of a hyperparameter for the specific method. For the estimation of RFK for each package, the residuals of RF were used for OK interpolation, and the results were added to the RF according to Equation (2).    -TRUE  -TRUE  linout  TRUE  TRUE  TRUE  TRUE  trace  FALSE  FALSE  FALSE  FALSE  maxit  100  100  100  100 Gradient boosting has a plethora of hyperparameters that need to be defined (Table 8). For the default values, the libraries' default parameters were chosen. In the case of optimized values, the random search tuning process was used with 10-fold cross-validation of the training data. At the end, the residuals of the GB were used for OK and the results were added (Equation (3)) in order to estimate the GBK for each method.
Finally, regarding NN, the hyperparameters that were used are presented in Table 9. The default values were the ones that the library proposed. For the optimized hyperparameters, "size" and "decay" were tuned through a random search tuning process with 10-fold cross-validation.
Similar to the previous ML models, the residuals of NN were used for OK, and the results were added (Equation (4)) in order to estimate the NNK.

Performance Assessment
The models in the current study were assessed based on the difference between the pH observations and their predictions. According to the prediction results, machine learning algorithms (RF, GB, NN) exhibited better accuracy in every metric when compared to the MLR, RK, or OK (Table 10). For example, RK (the best of the non-ML models) presented higher RMSE than almost all the ML algorithms, apart from NNnnD (NN with its default values). In more detail, RK's RMSE (0.336) was 14% higher compared to the mean of all the ML RMSE values (0.289), and RK's R 2 (0.626) was 15% lower than the mean R 2 of all ML models (0.723). Comparing the best implementation of each ML model, RK exhibited a 23% decrease in RMSE and 25% increase in R 2 in the case of RFrgO (RMSE of 0.259, R 2 of 0.781), a 22% decrease in RMSE and 24% increase in R 2 in the case of GBKxgbT (RMSE of 0.262, R 2 of 0.778), and finally, a 20% decrease in RMSE and 21% increase in R 2 in the case of NNnnO (RMSE of 0.268, R 2 of 0.760). The results were even worse for the OK. This was expected though, since OK does not incorporate the information of multiple covariates that the other methods do. Among the ML models, RF exhibited high prediction accuracy, with the RFrfO model being the best one with small RMSE (0.259), high R 2 (0.784), and low MAE (0.180). GB was very close, with GBKxgbT (RMSE:0.262, R 2 : 0.778, MAE:0.177) presenting the best results among the GB models. NN performed slightly worse, with NNnnO being the best NN model, scoring 0.268 in RMSE and 0.760 in R 2 ; this was not too far from the other ML models, as can also be seen in Figure 5. Consequently, there were only minor differences in the models' results between the default and optimized hyperparameters.
Kriging interpolation of the residuals did not significantly affect the results. It slightly improved NN and RK, whereas for RF, the results were slightly worsened. In more detail, in the case of GB, kriging of the residuals, as seen in GBKxgbT, improved the RMSE by approximately 6% and R 2 by 3.8% compared to GBxgbTO, while GBKxgbD's RMSE was improved by 4% and R 2 by 2.4% compared to GBxgbDO. Regarding RF, the kriging of the residuals led to a minor increase in RFrgO's RMSE by 1.7% and decrease in R 2 by 0.2% as seen in RFKrg. Similarly there was a slight increase in RFrfO's RMSE by 1.7% and decrease in R 2 by 0.4% (RFKrf). For NN, the introduction of OK to the residuals led to a decrease in NNavO's RMSE by 8% and an increase in R 2 by 7.5% compared to NNKav, while the use of the nnet package (NNnnO) in NN led to similar results with a small increase in RMSE (1.3%) and decrease in R 2 (0.7%), as seen in NNKnn. For RK, there was also an improvement with the introduction of OK to the residuals by 2.5% in RMSE and 3% in R 2 .
The use of different implementations (packages/methods) of the same ML models did not lead to major differences in the prediction results for RF and GB. In the case of RF, the results of RFrgO and RFrfO were similar, while in the case of GB (GBxgbTO and GBxgbDO), there were minor differences, with GBxgbTO being marginally better (RMSE of 0.279, R 2 of 0.750, MAE of 0.190). In NN, the differences between the two methods (NNnnO and NNavO) were slightly larger.
It is worth mentioning that there were differences in the variability of the results between the different ML models. Specifically, RF results had the smallest variability (RMSE SD = 0.002, R 2 SD = 0.002) compared to GB (RMSE SD = 0.32, R 2 SD = 0.017) and NN (RMSE SD = 0.189, R 2 SD = 0.078), regardless of the different packages that were utilized, the selection method of the hyperparameters, or the inclusion (or not) of OK of the residuals.
Regarding R 2 in general, the values were high ( Figure 5), over 0.6, especially for the ML models, apart from NNnnD. Therefore, the models explained most of the total variation of the pH, and the covariates used seem to strongly affect the soil pH. The MAE consistently had very low values in RF models, ranging from 0.178 up to 0.184. The GB results were close, but with higher variability (0.177 up to 0.232), and NN scored worse, with higher values ranging from 0.182 up to 0.312.
The covariate importance assessments ( Figure 6) revealed that iron (Fe), manganese (Mn), and magnesium (Mg) were the most influential variables for both RF and GB. For iron (Fe) specifically, high influence on soil pH was expected according to the relevant literature [18,24]. Among the environmental variables, altitude was the one with the higher score for both ML models.

Discussion and Conclusions
From the results of the current study, it is obvious that ML models outperformed the other methods in predicting soil pH, like MLR or models that use kriging (RK or OK). Additionally, the ML models were easy to use, without the statistical assumptions and requisites that linear regression and kriging interpolation require. However, ML models demand significant computing power, along with knowledge and attention to the tuning process of the models. Overall, based on the results, the authors of the current study tend to agree with the claim that ". . . kriging as a spatial prediction technique might be redundant, but solid knowledge of geostatistics and statistics in general is important more than ever" [13].
The assessment of the prediction results of the different ML models shows that GB and RF have the best performance, with NN being slightly less accurate. The NN's lesser results could be due to the fact that NN demands a big number of data in order to produce good predictions, e.g., 10 to 100 times the number of features or 10 times the number of inputs [39,40]. The 266 different locations and the 23 variables might not be enough for NN to show its predictive capabilities. Also, different implementations of the same ML model did not lead to major differences for RF and GB, whereas in NN, the differences were slightly bigger. Different implementations of the same ML models mainly introduce minor variations of the same ML models that might produce better results but only in some cases.
Optimization of the hyperparameters improved the prediction results in the GB and NN models due to the increased number of parameters and the significant difference between the default values and the optimized ones. It is obvious that their careful and thorough fine tuning is a crucial step for achieving the best results. In the case of RF, the hyperparameters were few and the default values were close to the optimized ones, leading to a minor difference in the prediction results.
The introduction of kriging interpolation of the ML models' residuals did not really improve the results. This is in accordance with the findings by Henglet al. [41], stating that ". . . as a rule of thumb where once a machine learning model explains over 60% of variation in data, chances are that kriging is not worth the computational effort".
The differences in the variability of the ML prediction results were expected based on the above-mentioned findings. In general, RF exhibited more consistent results with smaller variability, regardless of different packages or methods, different hyperparameter selection methods, or even the inclusion of kriging to the ML model residuals. This could be considered as an advantage in some cases where consistency is a requirement. The increased variability of the prediction results of the other ML models, however, suggests that they have the potential to further improve their prediction accuracy, for example, through a different tuning process or the use of different packages and implementations.