A Comparative Assessment of Geostatistical, Machine Learning, and Hybrid Approaches for Mapping Topsoil Organic Carbon Content

Accurate digital soil mapping (DSM) of soil organic carbon (SOC) is still a challenging subject because of its spatial variability and dependency. This study is aimed at comparing six typical methods in three types of DSM techniques for SOC mapping in an area surrounding Changchun in Northeast China. The methods include ordinary kriging (OK) and geographically weighted regression (GWR) from geostatistics, support vector machines for regression (SVR) and artificial neural networks (ANN) from machine learning, and geographically weighted regression kriging (GWRK) and artificial neural networks kriging (ANNK) from hybrid approaches. The hybrid approaches, in particular, integrated the GWR from geostatistics and ANN from machine learning with the estimation of residuals by ordinary kriging, respectively. Environmental variables, including soil properties, climatic, topographic, and remote sensing data, were used for modeling. The mapping results of SOC content from different models were validated by independent testing data based on values of the mean error, root mean squared error and coefficient of determination. The prediction maps depicted spatial variation and patterns of SOC content of the study area. The results showed the accuracy ranking of the compared methods in decreasing order was ANNK, SVR, ANN, GWRK, OK, and GWR. Two-step hybrid approaches performed better than the corresponding individual models, and non-linear models performed better than the linear models. When considering the uncertainty and efficiency, ML and two-step approach are more suitable than geostatistics in regional landscapes with the high heterogeneity. The study concludes that ANNK is a promising approach for mapping SOC content at a local scale.


Introduction
Soil organic carbon (SOC) is the organic carbon component of soil, consisting of living soil biota and dead biotic material derived from biomass.SOC tends to be concentrated in the topsoil with the primary source from vegetation [1].Globally, the soil biotic C pool, second only to the oceanic, is of importance for carbon sequestration [2].Small changes in the SOC could result in significant impacts on the atmospheric carbon concentration [3].This makes SOC an important ecological indicator of the greenhouse effect, as well as a major global change driver, due to its high sensitivity to human disturbance [4,5].Soils thus should be managed to sequester more organic carbon and discharge fewer greenhouse gases by maintaining sustainable land use [6,7].As a complex mixture, SOC is also involved in soil quality and ecosystem services like food production.It plays a significant role in supplying nutrients and the formation of improving soil structure [8].Accurate SOC content mapping is critically important for monitoring the baseline of the carbon pool, as well as its role in climate change and food security [9,10].
Geographic information systems (GIS) and remote sensing (RS) have been practiced acquiring spatial distribution and pattern of SOC content at different scales.Digital soil mapping (DSM) is to develop a geographically referenced database by integration of field observations with environmental variables through quantitative relationships [11].Nonetheless, SOC content shows strong spatial heterogeneity and dependence because of influences of natural factors and human disturbances vary by locations and scales [12,13].Climate and relief affect surface runoff and transport of soil along the surface, modifying the spatial distribution of SOC [14,15].Land use, fertilizer application and agricultural production also play important roles in influencing the SOC dynamics [16,17].Conventionally, DSM is conducted using environmental variables, including soil properties, climatic, relief, and spectral indices of remote sensing data.Due to the above-mentioned complexity and uncertainty, mapping SOC is still a challenging subject [18].
Generally, the main types of DSM techniques include traditional statistical, geostatistical, machine learning (ML) and hybrid approaches.Traditional statistical methods are used to determine the correlation between environmental variables and SOC content [19].Commonly used non-spatial models include multiple linear regression [20], partial least square regression [21], generalized linear models [22] and linear mixed models [23].Although easily applicable, their prerequisites of independent and identical distribution with large sample demands for SOC content observations are among challenges.Those methods are also known as lack of spatial information, making them less stable and unsuitable for delineating local changes [24].
In geostatistical techniques, such as ordinary kriging (OK) and geographically weighted regression (GWR), spatial autocorrelation or mutual correlation is considered [25,26].OK, as a kriging-based on the assumption of unknown mean, is one of the commonly used methods for spatial interpolation [27][28][29].However, this method overlooks the influence of environmental variables [30], and is not able to achieve the desired accuracy, due to the stationarity assumption [31].GWR is based on the local smooth idea and relationships among different environmental variables within a local space.It is capable of embedding the spatial location of samples into a regression through locally weighted least square method [25,32].Scull (2010) found that GWR performed better than multiple linear regression for prediction of SOC with precipitation or temperature as an environmental variable [33].Using simulated data sets, Harris et al. (2010) compared spatial prediction performances of GWR and OK methods [34].The comparison concluded that GWR provided extra information on the spatial processes, but the spatial dependence of residuals still existed.Kriging methods are typically used to interpolate geographical characteristics with significant spatial autocorrelation, such as climatic factors, natural soil properties, and geological elements [12].Whereas GWR is proposed to predict highly random factors with a significant degree of spatial heterogeneity, such as socio-economic indices.
ML methods can accommodate non-linearity and multicollinearity, and they can overcome overfitting with limited soil observations and auxiliary environmental information [35,36].They are typified by support vector machines and artificial neural networks (ANN), which are widely applied in DSM [37][38][39].Support vector machines can construct an optimal hyperplane by projecting the data onto a new hyperspace by the means of kernel functions.The hyerplane separates classes and creates the widest margin in the classification.It also can fit data (support vector machines for regression, SVR) with a modeling function that minimizes empirical risk and complexity when representing non-linear patterns [40].ANN simulate the human learning processes.The linkages between the input and output data in ANN are established and reinforced by non-linear and interconnected nodes [41].Li et al. (2017) used SVR and ANN for the prediction of L. chinensis carbon, nitrogen and phosphorus contents [42].Both SVR and ANN demonstrated high prediction accuracy.Xu et al. (2018) reported that SVR outperformed ANN for mapping soil organic matter with a visible and near infrared spectral dataset [43].The ML methods do not need explicit assumptions about data distributions, and they allows for modelling of complex relationships [44].However, their major shortcoming is that SOC content at a particular grid is estimated only from environmental variables of that node, without considering its spatial autocorrelation at the node with surrounding data [45].
Recently, hybrid approaches have drawn attentions.Two hybrid methods are geographically weighted regression kriging (GWRK) and artificial neural networks kriging (ANNK).GWRK and ANNK can incorporate the spatial autocorrelation of measured variables to achieve better predictions and lower errors [46,47].Both GWRK and ANNK can account for the spatial parametric non-stationarity and the relationship between SOC and other environmental variables.Additionally, residuals can be interpolated through kriging and they are added to the estimated trend as a spatially stochastic variable.Guo et al. (2017) reported that GWRK outperformed partial least squares regression kriging for predicting soil organic matter with visible and near-infrared spectra [48].Mirzaee et al. (2016) evaluated the ability for geostatistical methods and ANNK to predict soil organic matter [47].Despite a variety of DSM methods have been used in mapping SOC [49], there is a lack of systematic comparison among geostatistical, machine learning and hybrid approaches for mapping SOC.
With the above, the aim of this study is to perform such a methodological comparison based on soil samples and environmental variables derived from remote sensing data.In particular, this study compared methods of two geostatistical (OK and GWR), two machine learning (SVR and ANN), and two hybrid approaches (GWRK and ANNK) for their performance in SOC mapping.The comparison of intra and extra classes of three DSM techniques and their suitable situations for the application were also discussed.

Study Area
This study was conducted in the hinterland of the Northeast Plain of northeastern China (43 1).The total area was about 3592 km 2 .The study area, including Changchun the capital city of Jilin Province, has a total urban population of approximately 5.20 million in 2016.As a main grain-production area in China, the fertile lands are prevalent and are vital for the country's food security.This area has a temperate continental monsoon climate.The mean annual temperature is 4.8 • C, with mean temperatures of −15.1 • C in January and a mean temperature of 23.1 • C in July, respectively.The annual precipitation ranges from 522 to 615 mm, and more than 60% of which occurs from June to August.The dominant soil types are black soil, dark-brown soil and chernozem.

Soil Observations
The field campaign was conducted from September to October 2016.The distribution of sampling points was generated with water bodies and impervious surfaces being masked out.The 30 × 30 m plot was applied at each sampling site.The location, and land use and land cover classification of the sampling sites were recorded.Among the 395 sampling sites, 156 were located in human-managed farmland, 232 in forest and 7 in grassland (Figure 1).The sampling sites were randomly divided into training (n = 263) for modeling and validation sets (n = 132) (Figure 1) to evaluate performances of models in comparison [50][51][52].At each sampling site, SOC content and bulk density (BD, g•cm −3 ) were collected in topsoil throughout a depth range of 0-20 cm in a radial sampling scheme using an auger [53].A core ring sampler (5 cm in diameter, 5 cm in height) was used to collect samples at the center of each plot for BD determination [54].The soil samples for measuring SOC content (g•kg −1 ) were air-dried, grounded, and passed through a 2 mm mesh with the Walkley-Black wet oxidation method [55].

Environmental Variables
In DSM, the variability of SOC content is essentially explained by its relationships with soilforming factors, including soil properties, climate, organisms, relief, parent material, time and space.Based on data accessibility, 27 environmental variables were retrieved from soil properties, climate and remote sensing datasets (Table 1).

Soil Observations
The field campaign was conducted from September to October 2016.The distribution of sampling points was generated with water bodies and impervious surfaces being masked out.The 30 × 30 m plot was applied at each sampling site.The location, and land use and land cover classification of the sampling sites were recorded.Among the 395 sampling sites, 156 were located in human-managed farmland, 232 in forest and 7 in grassland (Figure 1).The sampling sites were randomly divided into training (n = 263) for modeling and validation sets (n = 132) (Figure 1) to evaluate performances of models in comparison [50][51][52].At each sampling site, SOC content and bulk density (BD, g•cm −3 ) were collected in topsoil throughout a depth range of 0-20 cm in a radial sampling scheme using an auger [53].A core ring sampler (5 cm in diameter, 5 cm in height) was used to collect samples at the center of each plot for BD determination [54].The soil samples for measuring SOC content (g•kg −1 ) were air-dried, grounded, and passed through a 2 mm mesh with the Walkley-Black wet oxidation method [55].

Environmental Variables
In DSM, the variability of SOC content is essentially explained by its relationships with soil-forming factors, including soil properties, climate, organisms, relief, parent material, time and space.Based on data accessibility, 27 environmental variables were retrieved from soil properties, climate and remote sensing datasets (Table 1).BD represented a parameter of soil properties.Climate data of monthly average values of September and October 2016, i.e., mean relative humidity (W, %), average temperature (T, • C), and precipitation (P, mm) were recorded at 60 weather stations of Jilin Province (http://data.cma.cn/).The inverse distance weighted method, a useful and common approach to acquiring spatially continuous soil and climate factors [56][57][58], with the power value of two in ArcGIS was applied to data of BD, W, T and P.
The above environmental predictors were transformed to the Albers equal-area conic WGS84 coordinate system.The BD and climatic grids were resampled to 30 m resolution using the nearest neighbor method, to match with the raster grids of other variables.Eventually, the attribute values of all grids were extracted for all the sampling points, and they were used as inputs for spatial modeling.

Statistical Analysis
The descriptive statistical values of BD and SOC were estimated for the soil samples.The pairwise Pearson's product-moment correlation analysis was conducted to determine the collinearity between explanatory variables.Predictor variables that were highly correlated to each other (r ≥ 0.8), and had high variance inflation factors (VIFs ≥ 10) in regression analysis were excluded [63,64].The analysis steps were performed using SPSS (version 21.0, IBM, Armonk, NY, USA).All of the explanatory variables were transformed to Z-score to eliminate the effects of the index dimension and the quantity of data [65].The procedures were illustrated in Figure 2.

Statistical Analysis
The descriptive statistical values of BD and SOC were estimated for the soil samples.The pairwise Pearson's product-moment correlation analysis was conducted to determine the collinearity between explanatory variables.Predictor variables that were highly correlated to each other (r ≥ 0.8), and had high variance inflation factors (VIFs ≥ 10) in regression analysis were excluded [63,64].The analysis steps were performed using SPSS (version 21.0, IBM, Armonk, NY, USA).All of the explanatory variables were transformed to Z-score to eliminate the effects of the index dimension and the quantity of data [65].The procedures were illustrated in Figure 2.

Geostatistical Models
OK is a widely used geostatistical technique that generates an optimal unbiased estimated surface by means of a semivariogram based on regionalized variables [66].The semi-variance is defined using Equation (1): where γ(h) is the semi-variance of data values for points of h distance apart, N(h) is the number of point pairs with h distance apart, Z i is the data value at point i, (i,j)|d ij = h represent a pair of points (i.e., i,j) which are separated by h distance.
The parameters of OK were listed in Table 2.The semivariogram can be modeled in GS+ (version 9.0, Leland Stanford Junior University, Stanford, CA, USA) by three typical functions: Spherical, Exponential and Gaussian.All of these functions are defined by three parameters: Nugget, range, and sill.Nugget represents the spatial variance of measurement errors at the infinite small distance.Range delineates the effective distance of the spatial autocorrelation.Sill is the maximum value of the semivariogram when the spatial distance between two locations reaches the value of the range [17].The partial sill is defined as sill-nugget, and a stronger spatial autocorrelation is denoted by higher values of partial sill/sill.Meanwhile, the spatial variation is characterized by the basal effect defined as nugget/sill.In other words, a larger value of nugget/sill shows that the spatial variation among samples is more strongly caused by stochastic factors [62].Based on the semivariogram modeling, an error variance model is defined as the objective function for which a set of weights (w i ) are selected to minimize the error [66].After that, an interpolation via OK in ArcGIS follows.The interpolation is expressed by Equation ( 2): where Ẑ0 is the estimated SOC content at point i, z i is the measured SOC content data, w i is the weight associated with the measured SOC, which is estimated by the stationary OK system, and n is the number of measured values within a neighborhood.GWR is another commonly applied geostatistical approach for DSM, and it is based on the local smoothing idea.It takes the spatial locations of samples into consideration, and uses the locally weighted least square method to model the observations of soil parameters [32].It can be written as Equation (3): where (u i ,v i ) are the coordinates of point i, is the coefficient of different explanatory variables, x* ik is the value of explanatory variable k at point i, p is the total number of explanatory variables, ε i is the error term that is generally assumed to be explanatory and normally distributed with zero mean and constant variance, and the values of the above parameters vary with the location.The parameters can be estimated using a weighting function as in Equation ( 4): where X is the matrix formed by x* ik , Y is the vector formed by values of SOC content, W(u i ,v i ) is a weight matrix to ensure that those observations near the point i have more influence on the results than those that are farther away.The parameters of the GWR were listed in Table 2. GWR was conducted using GWR (version 4.0, Ritsumeikan University, Kyoto, Japan) by which the weight function (a geographic kernel type) and the minimum value of the corrected Akaike Information Criterion (AICc, small sample bias corrected AIC) are determined to find the optimal bandwidth [67].

Machine Learning Methods
SVR, a non-linear machine learning method, derives a model hyperplane to describe the empirical data as correctly as possible and to minimize the distances from the hyperplane to the training data [68].The parameters of SVR were listed in Table 2.In this study, SVR was conducted in MATLAB software (version 2017a, MathWorks, Natick, MA, USA).The SMO (sequential minimal optimization) algorithm was used to solve the quadratic programming optimization problem step-by-step.It updated the SVR function, as shown in Equation ( 5), to reflect the new values until the Lagrange multipliers converged [69].
where x is a vector of the input predictors (environmental variables), f (x) is an optimal function developed by SVR, b is a constant threshold, K(x k , x j ) is the Gaussian radial basis kernel function with the best bandwidth parameter σ, and α k and α * k are the weights (Lagrange multipliers) with the constraints given in Equation (6): where C is the regularization parameter for balancing between the training error and model complexity.ANN attempt to mimic how the human brain processes and stores information.Among numerous proposed ANN algorithms, the multi-layer perceptron neural network (MLPNN) with back-propagation algorithm was used in this study.The MLPNN was constructed in MATLAB.The architecture of the MLPNN consists of an input layer containing the predictor variables, one or more hidden layers, and an output layer containing the response variable, along with interconnection weights characterizing the connection strength between these layers (Figure 3).By developing the network with training samples, the interconnection weights of the network are adjusted to minimize the prediction error.Meta-parameters to be specified are the number of hidden nodes, the learning algorithm, learning rate, as well as the number of training iterations in those processes [70,71], as Table 2 list.

Two-Step Hybrid Approaches
GWRK and ANNK are the extension of GWR and ANN, respectively.They fully consider spatial parametric non-stationarity and the effects of environmental variables derived from the benefits of GWR and ANN.They also add spatial dependence of the residuals interpolated though OK to the estimated trend, as part of the spatial autocorrelation.Their implementation includes two steps (Figure 3).First, the GWR and ANN models are built based on Equation (3) and 2.2.3 to model the relationship between SOC content and environmental variables.Second, the residuals resulting from GWR and ANN are estimated using the OK approach (Equations ( 1) and ( 2

 
where i y ˆ is the estimated value of each model, i y is the measured value, and n is 132 in this study.The mean error and R 2 should be close to zero and one, respectively, while RMSE should be as small as possible.

Two-Step Hybrid Approaches
GWRK and ANNK are the extension of GWR and ANN, respectively.They fully consider spatial parametric non-stationarity and the effects of environmental variables derived from the benefits of GWR and ANN.They also add spatial dependence of the residuals interpolated though OK to the estimated trend, as part of the spatial autocorrelation.Their implementation includes two steps (Figure 3).First, the GWR and ANN models are built based on Equation (3) and 2.2.3 to model the relationship between SOC content and environmental variables.Second, the residuals resulting from GWR and ANN are estimated using the OK approach (Equations ( 1) and ( 2)).
The estimates obtained by the two-step hybrid approaches GWRK and ANNK are the sum of the estimates ∧ y GWR/ANN (u i , v i ) and OK estimates of the residuals ∧ ε OK (u i , v i ), as shown in Equation ( 7):
where ŷi is the estimated value of each model, y i is the measured value, and n is 132 in this study.
The mean error and R 2 should be close to zero and one, respectively, while RMSE should be as small as possible.

Statistics Analysis
The descriptive statistics values of SOC and BD were presented in Table 3. SOC content and BD ranged from 2.54 to 68.94 g•kg −1 and from 0.81 to 1.89 g•cm −3 , respectively.The coefficient of variation (CV) indicated a strong variability of SOC content (CV > 0.35) and a moderate variability of BD (0.15 < CV < 0.35), which can be attributed to random factors like environmental factors and measurement errors [53].

Models Training of GWR, SVR and ANN
The weight function for GWR was an adaptive spatial kernel type to avoid border effects [32,73].The optimal bandwidth was 0.54, with the minimum value of AICc as 1886.96,and the adjusted R 2 was 0.31.For a given environmental variable, its coefficient from GWR varied across the study area.The absolute mean values of coefficients for BD (3.49), b5 (1.99), α (1.50), b2 (1.35), and SOA (1.01) showed the stronger capacity of explaining the relationship with SOC than other variables, while b3 (0.16) and H (0.20) had a relative lack of explanatory ability.The CV values showed that coefficients of T and b3 had a strong variability, and coefficients of W, C, and TWI had moderate variabilities.
In the SVR model, the best parameters C and σ obtained using the training data were 1000 and 0.0001, respectively, with the minimum RMSE being 8.61.As for the MLPNN model, the accuracy for various numbers of nodes in the hidden layer is shown in Figure 4.The results revealed that the optimized MLPNN architecture with an RMSE value of 7.30 was in 21-3-1 structure, i.e., indicating 21 input nodes in the input layer, three in the hidden layer with the unipolar sigmoid as the transfer function, and one in the output layer.Using the Levenberg-Marquardt learning algorithm, the best learning rate and training time (iterations) obtained were determined to be 0.01 and 13, respectively.ISPRS Int.J. Geo-Inf.2019, 8, 174 10 of 18

Statistics Analysis
The descriptive statistics values of SOC and BD were presented in Table 3. SOC content and BD ranged from 2.54 to 68.94 g•kg −1 and from 0.81 to 1.89 g•cm −3 , respectively.The coefficient of variation (CV) indicated a strong variability of SOC content (CV > 0.35) and a moderate variability of BD (0.15 < CV < 0.35), which can be attributed to random factors like environmental factors and measurement errors [53].

Models Training of GWR, SVR and ANN
The weight function for GWR was an adaptive spatial kernel type to avoid border effects [32,73].The optimal bandwidth was 0.54, with the minimum value of AICc as 1886.96,and the adjusted R 2 was 0.31.For a given environmental variable, its coefficient from GWR varied across the study area.The absolute mean values of coefficients for BD (3.49), b5 (1.99), α (1.50), b2 (1.35), and SOA (1.01) showed the stronger capacity of explaining the relationship with SOC than other variables, while b3 (0.16) and H (0.20) had a relative lack of explanatory ability.The CV values showed that coefficients of T and b3 had a strong variability, and coefficients of W, C, and TWI had moderate variabilities.
In the SVR model, the best parameters C and σ obtained using the training data were 1000 and 0.0001, respectively, with the minimum RMSE being 8.61.As for the MLPNN model, the accuracy for various numbers of nodes in the hidden layer is shown in Figure 4.The results revealed that the optimized MLPNN architecture with an RMSE value of 7.30 was in 21-3-1 structure, i.e., indicating 21 input nodes in the input layer, three in the hidden layer with the unipolar sigmoid as the transfer function, and one in the output layer.Using the Levenberg-Marquardt learning algorithm, the best learning rate and training time (iterations) obtained were determined to be 0.01 and 13, respectively.

OK, GWRK and ANNK Training and Six Models Validation
The results from Kolmogorov-Smirnov test (K-S) (p < 0.05) showed that the probability distributions for SOC of the training samples (skewness = 1.57, kurtosis = 3.56), and the residuals from GWR (skewness = 0.84, kurtosis = 2.63) and ANN (skewness = 0.90, kurtosis = 1.28) possessed normal distribution.The training set and residuals were directly used to calculate experimental semivariograms for OK interpolation.Table 4 showed the basic information on the training samples, GWR residuals (R GWR ), and ANN residuals (R ANN ) in the OK.The equivalent strong performances of semivariogram models for training samples, R GWR and R ANN were evident, based on the low residual sum of squares (RSS) and high R 2 , and the Exponential model for R ANN performed the best among the three.A greater spatial field fluctuation of training samples than R ANN and R GWR was explained by its highest range.The values of the nugget indicated that the stationarity assumption was more valid for R GWR , than for R ANN and training samples.The highest nugget value of training samples suggested that OK was more sensitive to the sampling design for SOC content at a smaller distance.While environment variables were considered as GWR and ANN, the randomness or stochastic showed the reduction according to lower values of the nugget.The smallest nugget/sill for R ANN displayed its stronger basal effect than R GWR and training samples.It also revealed that R ANN obtained the higher spatial autocorrelation and more suitability for OK than the other two.Due to the strong spatial autocorrelation and good modeling performance of R GWR and R ANN , GWRK and ANNK, which additionally predicted R GWR and R ANN by OK, were more reasonable than GWR and ANN.Table 5 presented the accuracy of different models for estimating the SOC of the validation set.The OK model resulted in a mean error of −0.71 g•kg −1 had the highest tendency for overestimation.The ANNK model with a mean error of −0.38 g•kg −1 showed the lowest tendency for overestimation.The GWR model with a mean error of 1.63 g•kg −1 was the only one for underestimation.ANNK gave rise to the lowest RMSE and closest-to-zero mean error values, as well as the highest R 2 value.Hence, it was the best predictive method for estimating the SOC.Besides, the accuracy ranking of six methods from high to low was ANNK, SVR, ANN, GWRK, OK and GWR, which was consistent with the ranking of the consistency between the predicted and measured ranges.

Spatial Prediction and Mapping of SOC Content
The spatial distribution of SOC produced by OK, GWR, SVR, ANN, GWRK and ANNK are illustrated in Figure 5.All the six maps showed that a high SOC region was on the northwest section of the study area with content values ranging from 29.29 to 70.93 g•kg −1 , where urban parks and green space occupied.While low SOC regions were located near the outskirt of Shuangyang District with values ranging from 4.18 to 9.55 g•kg −1 .Relatively high SOC contents were present in the northern side of Xinlicheng Reservoir, and the areas between Jingyuetan and the northwestern side of Shitoukoumen Reservoir.The map resulting from GWRK was characterized by more explicit spatial variation than those from GWR and OK.Comparing the range of the estimated SOC content and that of measured values (2.54-68.94g•kg −1 ) resulted in the following performance ranking for the six algorithms from strong to weak: ANNK, SVR, ANN, GWRK, OK and GWR.

Comparison between Intra-Classes of Three DSM Techniques
The study suggested that, for geostatistical techniques, OK performed better than GWR.A weaker performance of GWR compared to OK in this research can be attributed to the spatial autocorrelation and local multicollinearity of environmental variables [74], as VIFs can only detect the global collinearity.Specifically, the study area has fairly flat terrain, so that environmental variables, such as BD and relief grids used in this study were measured at a rather coarser resolution than the requirement for explicitly reflecting the variability of SOC content.Besides, GWR is an extension of the linear regression, so that the number of variables included could have disproportionately negatively impacted the results when the number of variables is excessive.Liu et al. (2015) implemented GWR and OK for predicting SOC stocks with terrain factors, distance factors, land cover type, and spectral indices, and they obtained more accurate estimates, as RMSEGWR was 43.4% of the mean value of SOC and RMSEOK was 34.4%, than those reported in this study [50].Taking a town as the study area, they considered man-made disturbances in addition to natural factors with fairly good resolution, leading to the better performance of GWR on SOC mapping.However, Zhang et al. (2011) applied OK and GWR onto the environmental factors of rainfall, land cover and soil type to map soil organic carbon, and the accuracy was lower than that which was obtained by this study [73].Theoretically speaking, OK considers the spatial autocorrelation of the soil samplings, whereas GWR considers the spatial heterogeneity of environmental variables.In other words, it was found by this study that the validation data were closer to the measuring data in the soil sampling locations, due to a better performance of OK than GWR.Whereas, the spatial heterogeneity of environmental variables and its relationship with soil sampling were more significant and matched at the study scale of Zhang et al. (2011) based on better performance of GWR than OK.Overall, due to the different

Comparison between Intra-Classes of Three DSM Techniques
The study suggested that, for geostatistical techniques, OK performed better than GWR.A weaker performance of GWR compared to OK in this research can be attributed to the spatial autocorrelation and local multicollinearity of environmental variables [74], as VIFs can only detect the global collinearity.Specifically, the study area has fairly flat terrain, so that environmental variables, such as BD and relief grids used in this study were measured at a rather coarser resolution than the requirement for explicitly reflecting the variability of SOC content.Besides, GWR is an extension of the linear regression, so that the number of variables included could have disproportionately negatively impacted the results when the number of variables is excessive.Liu et al. (2015) implemented GWR and OK for predicting SOC stocks with terrain factors, distance factors, land cover type, and spectral indices, and they obtained more accurate estimates, as RMSE GWR was 43.4% of the mean value of SOC and RMSE OK was 34.4%, than those reported in this study [50].Taking a town as the study area, they considered man-made disturbances in addition to natural factors with fairly good resolution, leading to the better performance of GWR on SOC mapping.However, Zhang et al. (2011) applied OK and GWR onto the environmental factors of rainfall, land cover and soil type to map soil organic carbon, and the accuracy was lower than that which was obtained by this study [73].Theoretically speaking, OK considers the spatial autocorrelation of the soil samplings, whereas GWR considers the spatial heterogeneity of environmental variables.In other words, it was found by this study that the validation data were closer to the measuring data in the soil sampling locations, due to a better performance of OK than GWR.Whereas, the spatial heterogeneity of environmental variables and its relationship with soil sampling were more significant and matched at the study scale of Zhang et al. (2011) based on better performance of GWR than OK.Overall, due to the different mechanism of GWR and OK, it is suggested that both models should be calibrated, and then the best result is applied for the spatial prediction of target soil attributes.
The results for the two ML models showed that SVR performed slightly better than ANN, which was in agreement with the conclusions by Wijewardane et [43,71,75].It was reported by prior studies that SVR can eliminate the local minimum issue of the error function of ANN [76,77].Nevertheless, these evaluations were different from those that were obtained by Taghizadeh-Mehrjardi et al. (2016,2017), who showed better performances of ANN than SVR for soil attributes prediction [78,79].Were et al. (2015) input nine variables of soil properties into SVR and ANN, one climate variable, land cover data, four relief factors, as well as two spectral indices to map SOC stocks, and obtained higher values of R 2 , but lower values of RMSE than did in this study [64].This may be because of different extents of the study areas, sampling sizes, the number and quality of environmental variables used.ANN is capable of learning complex patterns in the noisy environment, but it requires representative training samples and appropriate values of network parameters [80,81].SVR is able to generalize the model well with limited training samples, but it suffers from parameter assignment issues [82].Since each model has its pros and cons, studies all strongly suggested the application of both ML models, ANN and SVR, due to their close performances.
As for the hybrid approaches, ANNK outperformed GWRK for the prediction of SOC content in this study.The better performance of ANNK compared to GWRK is related to the non-linear relationship between environmental variables and SOC content.It was overcome by the ability of ANN to solve non-linear problems [83].However, GWRK is much simpler to operate, and more mature for DSM applications than ANNK.Because these two approaches have not been compared in any previous research, this finding can provide a reference for mapping SOC in future.
Except that the two ML techniques had close performance, the two-step hybrid approaches performed differently, as well as the geostatistical techniques.It also indicated that choosing a suitable model at the first step is crucial for estimating the SOC via hybrid approaches.

Comparison among Geostatistical, ML and Hybrid Techniques
Comparison among three DSM techniques (Figure 5 and Table 5), showed that the geostatistical models had the lowest accuracy.The poor performance of OK was found by Mirzaee et al. (2016) when compared with ANNK [47], by Emamgholizadeh et al. (2017) who showed a performance order of ANN, GWR and OK [84], and by Ye et al. (2017) when compared with GWRK [85].ANNK showed the highest accuracy, but GWRK performed worse than the two ML models.It showed that the spatial cross correlation between SOC content and environmental variables in this study was depicted definitely by non-linear regression like SVR and ANN, rather than GWR as a local linear regression.ML in this study, i.e., ANN and SVR have built in methods for dimension reduction, so that the number of variables and local multicollinearity of predictors influences less on ML than GWR.However, additionally considering the spatial autocorrelation of SOC residuals, ANNK apparently outperformed SVR, in spite of a close performance of ANN and SVR.Obviously, two-step hybrid approaches performed better than the corresponding one-step models, which was also revealed by comparisons made in Kumar (2015), Zeng et al. (2016) and Guo et al. (2017) about GWRK and GWR [48,86,87], and in Dai et al. (2014) and Song et al. (2017) about ANNK and ANN [72,88].Dai et al. (2014) used ANNK and ANN for mapping soil organic matter content with auxiliary variables, including climate data, relief data and spectral indices of the remote sensing data, and obtained a higher accuracy than that of this study [88].Due to the high variability of environmental variables and the fairly simple relationship between variables and the SOM content in the Tibetan Plateau, ANNK and ANN showed stronger performances than reported by this study.The relative improvement of the prediction accuracy of ANNK compared to ANN was also more obvious.
The uncertainty is a key issue for SOC mapping, based on these three DSM techniques.The processing of soil samples collection and inputs are the same for six models.While, the distribution and number of soil samples and predictor inputs have a greater influence on geostatistics than ML methods.It also results in the higher uncertainty of GWRK than ANNK.As for different modeling methods, two-step hybrid approaches would increase uncertainty than corresponding one-step models.However, two-step hybrid approaches greatly improved accuracy than geostatistics and ML according to reported studies and our findings as abovementioned.For a regional area like this study, the six methods cost a similar amount of time when considering a trade-off between accuracy and speed.Thus, the two-step hybrid approach resulted in better predictions and lower errors than corresponding one-step models, which were recommended for a regional SOC mapping.

Conclusions
To map the spatial distribution of a vital ecological indicator of soil quality, SOC content, six DSM methods were developed with soil properties, climatic data, relief factors and spectral indices being environmental variables.These methods were geostatistical (OK and GWR), ML (SVR and ANN) and two-step hybrid approaches (GWRK and ANNK).The performance ranking of six methods from high to low was ANNK, SVR, ANN, GWRK, OK and GWR based on the prediction accuracy measured by RMSE, ME, R 2 and the consistency with measured values.Additionally, the two-step hybrid approaches, ANNK and GWRK, gave rise to low RMSE, close-to-zero mean error and high R 2 for the validation set as compared with GWR and ANN.Because of the non-linear relationship between environmental variables and SOC content, two ML methods performed better than linear models.The strong performances of the two-step hybrid approaches were attributed to their capability of explicitly addressing the spatial dependency and non-stationarity of SOC content.It led to apparent outperformances of ANNK compared to SVR, in spite of the close performance of ANN and SVR.
When considering uncertainty and efficiency, the suitable situations of these six methods on SOC mapping were also discussed.For a regional study area with high heterogeneity, such as urban and mountainous landscapes, ML and two-step approach are more suitable than geostatistics.Selection of a suitable model at the first step is crucial for two-step hybrid approaches.Limited by the number of samples, SVR performs better than ANN.Influenced by predictor inputs, GWR performs worse and has a larger uncertainty than OK.Therefore, ANNK combining ANN with OK is considered to be the most promising approach to estimating regional SOC.

Figure 1 .
Figure 1.Location of study area in China and soil sampling sites.

Figure 1 .
Figure 1.Location of study area in China and soil sampling sites.

Figure 2 .
Figure 2. The flow chart of the comparative assessment of geostatistical, machine learning, and hybrid approaches for mapping topsoil organic carbon content.

Figure 2 .
Figure 2. The flow chart of the comparative assessment of geostatistical, machine learning, and hybrid approaches for mapping topsoil organic carbon content.

Figure 3 .
Figure 3. Flow diagram of soil organic carbon (SOC) content mapping using two-step hybrid approaches of geographically weighted regression kriging (GWRK) and artificial neural networks kriging (ANNK).
)).The estimates obtained by the two-step hybrid approaches GWRK and ANNK are the sum of

Figure 3 .
Figure 3. Flow diagram of soil organic carbon (SOC) content mapping using two-step hybrid approaches of geographically weighted regression kriging (GWRK) and artificial neural networks kriging (ANNK).

Figure 4 .
Figure 4.The root mean squared error (RMSE) of a multi-layer perceptron neural network (MLPNN) under different node number in the hidden layer.

Figure 4 .
Figure 4.The root mean squared error (RMSE) of a multi-layer perceptron neural network (MLPNN) under different node number in the hidden layer.

Table 1 .
Environmental variables for soil organic carbon (SOC) mapping.
h Horizontal curvature SOS Slope of the slope, the curvature of the surface SOA Slope of aspect, the curvature of the contour line RLD Relief of land surface, Hmax − Hmin M Surface roughness, 1/cosβ TWI Topographic wetness index, Ln[Ac/tanβ], Ac is the catchment area directed to the vertical flow SPI Stream power index, Ln[Ac × tanβ × 100]

Table 2 .
Algorithms used in the study.

Table 3 .
Descriptive statistics values of SOC and bulk density (BD) at the sampling sites.
1 SD is standard deviation; 2 CV represents coefficient of variation.

Table 3 .
Descriptive statistics values of SOC and bulk density (BD) at the sampling sites.

Table 4 .
Semivariogram parameters for SOC observations and the residuals of geographically weighted regression (GWR) and artificial neural network (ANN).

Table 5 .
Performance evaluation of six models.