Predicting and Mapping of Soil Organic Carbon Using Machine Learning Algorithms in Northern Iran

Estimation of the soil organic carbon content is of utmost importance in understanding the chemical, physical, and biological functions of the soil. This study proposes machine learning algorithms of support vector machines, artificial neural networks, regression tree, random forest, extreme gradient boosting, and conventional deep neural network for advancing prediction models of SOC. Models are trained with 1879 composite surface soil samples, and 105 auxiliary data as predictors. The genetic algorithm is used as a feature selection approach to identify effective variables. The results indicate that precipitation is the most important predictor driving 15 percent of SOC spatial variability followed by the normalized difference vegetation index, day temperature index of moderate resolution imaging spectroradiometer, multiresolution valley bottom flatness and land use, respectively. Based on 10 fold cross validation, the DNN model reported as a superior algorithm with the lowest prediction error and uncertainty. In terms of accuracy, DNN yielded a mean absolute error of 59 percent, a root mean squared error of 75 percent, a coefficient of determination of 0.65, and Lins concordance correlation coefficient of 0.83. The SOC content was the highest in udic soil moisture regime class with mean values of 4 percent, followed by the aquic and xeric classes, respectively. Soils in dense forestlands had the highest SOC contents, whereas soils of younger geological age and alluvial fans had lower SOC. The proposed DNN is a promising algorithm for handling large numbers of auxiliary data at a province scale, and due to its flexible structure and the ability to extract more information from the auxiliary data surrounding the sampled observations, it had high accuracy for the prediction of the SOC baseline map and minimal uncertainty.


Introduction
Soil organic carbon (SOC) is central to soil health as it plays a significant role in soil aggregation, water holding capacity, cation/anion exchangeability, and nutrient availability, which promotes plant growth.SOC can potentially affect both soil ecosystems and crop productivity due to its several critical roles in soil functioning.Globally, the amount of carbon in the upper one meter of soil is about three and two times higher than the amount of carbon found in the biosphere and atmosphere, respectively [1][2][3].Therefore, the contribution of SOC to the global C cycle by sequestering terrestrial C is of great importance.Changes in SOC pools induced by soil management and land cover changes affect global warming and, in turn, can significantly influence soil physical, chemical, and biological properties [4][5][6].As SOC is a good indicator of environmental quality [7], high-quality maps of the spatial distribution of SOC can provide base-line data for SOC turnover and sequestration for C management strategies at the province-scale.The spatial variability of SOC at the field to the regional scale is highly related to the soil forming-factors including the climate (precipitation and temperature); organisms (vegetation and human), relief (terrain attributes), parent materials, and time [8].
Due to the global importance of SOC, digital soil mapping (DSM) approaches have become more focused on SOC mapping in the last decade [4,[9][10][11][12].DSM describes the spatial variation of SOC by taking the relations between SOC and environmental auxiliary variables into account [13][14][15].The auxiliary variables correlated with SOC are often obtained from digital elevation models [11,16], remotely sensed data [16,17] and climatic data [18,19].By using remotely sensed imagery and easy accessibility of climatic and digital elevation model (DEM) data, the application of machine learning (ML) techniques for predicting SOC is significantly increased [11,17,20].
Soils, especially SOC contents, which are the result of the actions and interactions of many different processes and factors, vary from place to place with high complexity [35].Thus, to predict the behaviors and properties of such a complex environment, classical ML may encounter problems [10,12,34,36].A new approach that has received considerable attention as a sophisticated learning algorithm with substantial learning capability and high performance is deep learning (DL) [37].Recently, deep neural networks (DNNs) based on DL approaches have been proposed for overcoming the shortages arising from the traditional ANNs [37,38], by adding more complexity (deep) into the conventional models.This provides better learning capabilities to reveal the complexity underlying the data, and thus results in a higher accuracy of the trained model [39].The hierarchical structure and high learning capacity make DNN models quite flexible and adaptable for a wide variety of highly complex problems such as SOC prediction [11,39,40].The DNNs have recently been used for the prediction of soil properties [40][41][42] and particularly for SOC prediction [43,44].Xu et al. [43], for instance, indicated that the DNN method had a high performance for the prediction of SOC with the effective abstraction of complex covariates for learning by using visible and near-infrared soil spectra.DNNs are more complex and need further parametrization but severely depend on the size of the training dataset.
To eliminate the multicollinearity of variables and exclude unimportant and redundant auxiliary variables, numerous feature selection techniques have been developed for DSM including.Particle swarm optimization [45], the genetic algorithm (GA) [25,32,46], hybrid GA-artificial neural network [47], parallel GA [48], and the artificial bee colony algorithm [36] are among the notable feature selection techniques.Such variable selection techniques can simplify modeling by lowering the number of input variables and potentially improving the accuracy of soil predictions.There is no universal feature selection method to reduce the number of covariates in the pool presented to an ML algorithm.For instance, Behrens et al. [49] compared the two most common approaches for the selection of covariates, namely supervised and unsupervised, and found that the supervised feature selection approach was superior because the soil classes were predicted more accurately.Taghizadeh-mehrjardi et al. [50] explored the effect of the reduction in dimension of feature space with ant colony optimization (ACO) and correlation-based feature selection (CFS) on the accuracy of prediction of spatial models for each particle size fraction.In this study, we decided to implement the GA, one of the most advanced algorithms for feature selection [46].GA can manage the datasets with many features and do not need specific knowledge about the problem parallelizing easily in computer clusters [46,51,52].
Although many ML algorithms have been developed for the prediction of soil properties, the development of site-specific techniques is necessary for enhancing the quality of thematic soil maps [53].However, there is no best worldwide predictive algorithm for SOC mapping given that the accuracy level of SOC predictions is highly related to the local geographic attributes of the study area [54], the sampling size [9,55] and the selected auxiliary variates [14,19,56].
Mazandaran province, northern Iran, is located on the southern coast of the Caspian Sea.There is a descending precipitation gradient from the west to east across the region, leading to a diversity of soil moisture regime (SMR) and soil temperature regime (STR) classes [57].Due to the changes in SOC contents in northern Iran caused by the human activities and natural attributes (landslide, flooding, depression) [58,59], the existence of a high-quality SOC prediction map with known uncertainty in the Mazandaran province is crucial.This provides a base-line map for further temporal monitoring of SOC at the province-scale.Despite the known advantages of feature selection, there have been no insights into the important variables for SOC prediction in northern Iran given the different predominant climatic and soil-forming conditions.Therefore, due to the lack of an SOC base-line distribution map in Mazandaran province, the objectives of this research were (1) to determine the important auxiliary variables driving the SOC contents in the province using GA as a popular automatic method for feature selection, (2) to test the performance of six ML algorithms fed with GA-selected auxiliary variables and (3) to predict the spatial distribution of SOC for mapping with associated uncertainty and (4) to compare SOC contents in different geological units, soil classes and land uses in Mazandaran province.

Study Area
This research was conducted in Mazandaran province, northern Iran.The region is located at a longitude of 50°31′21′′ E to 53°56′52′′ E and latitude of 36°38′06′′ N to 36°54′59′′ N and covers an area of 2,388,179 ha.It borders the Caspian Sea in the north and the Alborz Mountain range in the south (Figure 1).Most of the province is covered by dense, moderate, and low-density forest with each forest type covering 39%, 4%, and 2% of the total area, respectively.There are several kinds of cultivated lands in the study area.Paddy fields are the most common agricultural land use with about 210,000 ha, and orchards cover about 90,000 ha.Based on the De Martonne climate classification, the western, central, eastern, and mountainous parts of the province have very humid, humid, Mediterranean, and semihumid climates, respectively.The mean annual temperature ranges from 18 °C on the coastal plain to below 8 °C in the highlands.There is a gradient of the decreasing precipitation from the west (around 1400 mm) to the east direction (around 450 mm) leading to the diversity of soil moisture regime (SMR) and soil temperature regime (STR) classes across the province.The xeric SMR class covers the largest area in the province, followed by the udic and aquic classes, while thermic (66%) is the most abundant STR followed by mesic (33%) and cryic (1%) [57].The variation of elevation ranges from the Caspian Sea coastal areas with elevations <−5 m to more than 3000 m above sea level in the highlands of the Alborz Mountain range.Five USDS soil taxonomy orders including Mollisols, Entisols Inceptisols, Alfisols, and Ultisols were in Mazandaran with a total of 12 suborders.Mollisols are the most dominant soils forming on different landforms with mollic epipedons.Mollisols have four main suborders including the aquolls, rendolls, udolls, and xerolls mostly distinguishing based on soil moisture regime classes except for rendolls that have an epipedon with less than 50 cm thickness overlies on a highly calcareous horizon.Alfisols are characterized by a clay enriched endopedon with two main suborders, i.e., aqualfs and udalfs.Ultisols with a single suborder, i.e., Humults occurring in a very limited area in Mazandaran province with leached soils under native forest vegetation.Entisols with three main suborders, i.e., Orthents, Fluvents, and Aquents are mostly used for paddy cultivation in the study area.Inceptisols in Mazandaran province have a weakly developed B horizon with two main suborders, i.e., Aquept and Xerepts.

Soil Data
The total dataset for SOC mapping is 1879 composite surface soil samples from two main sources (Figure 1).More than half of the data (1055 samples) were derived from five Master of Science (M.Sc.) research projects in the soil science department at Sari Agricultural Sciences and Natural Resources University (SANRU) [60][61][62][63][64].These samples were collected using a simple random sampling scheme mostly in uncultivated areas.The rest of the dataset comes from soil surveys performed by the Agricultural Research Education and Extension Organization (AREEO) and the Ministry of Jahad-e-Keshvarzi in Sari, Northern Iran.These samples were mostly collected in cultivated areas spread across the province using a grid sampling scheme with a 2000 m grid interval.Each composite soil sample is collected within a 20 m radius surrounding each of the sampling points with at least 10 subsamples (cores).Samples were collected from a depth of 0-20 cm, and their geographical coordinates were recorded with a global positioning system (GPS) device.After air-drying and passing through a 2 mm sieve, the content of SOC was measured by the wet oxidation procedure outlined by the Walkley and Black [65]. Figure 1 shows the spatial distribution of the sampling sites across Mazandaran province.

Auxiliary Variables
The full set of 105 predictor variables initially considered is given in Table A1 in the Appendix A. The auxiliary data included variables derived from remotely sensed imagery (60 variables from Landsat 8 and MODerate-resolution Imaging Spectroradiometer, MODIS), terrain attributes (30 variables), climatic data (10 variables), and five categorical data (e.g., soil map and land use map).
The high contribution of SOC to soil color can be detected by the spectral signature of remotely sensed data.The 60 environmental auxiliary variables derived from satellite imagery were developed based on the median values of 8 satellite images of the Landsat 8 Operational Land Imager taken from 2012 to 2016 to coincide with the dates of soil sampling.Following radiometric, geometric, and atmospheric corrections digital numbers for the blue (B1), green (B2), red (B3), near-infrared (B4), and shortwave IR-2 bands (B6) were extracted.Several indices were then calculated: the normalized difference vegetation index (NDVI), enhanced vegetation index (EVI), combined spectral response index (COSRI), land surface water index (LSWI), brightness index (BI), and other indices with a spatial resolution of 30 m.The variables derived from MODIS imagery had a spatial resolution of 250 m.These included median values of two spectral reflectance bands: red (645 nm) and near-infrared (858 nm) and the EVI, NDVI, and other indices.The daytime and nighttime land surface temperature with a 1 km resolution were also derived from MODIS data.Overall, 60 auxiliary variables were derived from Landsat 8 and MODIS data.
Thirty terrain attributes were derived from a DEM [66].The DEM was obtained from shuttle radar topography mission terrain (SRTM) data with 30 m grid cells.Terrain attributes, namely slope, aspect, elevation, length and steepness (LS) factor, valley depth, openness, catchments area, catchment slope, plane curvature, topographic wetness index (TWI), channel networks base level (CNBL), distance to channel networks, the multiresolution valley bottom flatness index (MrVBF) and other indices [67], were calculated using SAGA GIS.
Climatic factors have high potential to explain large parts of the variation of soil properties in the northern part of Iran, due to the high degree of spatial variability in Mazandaran province.In this study, 10 climatic variables were obtained using WorldClim.WorldClim version 2 [68] contains reliable temperature and precipitation data at a spatial resolution of 1000 m.Categorical predictor variables were derived from five choropleth maps, which were compiled at different cartographic scales, e.g., soil map and land use map [69].
Figure 2 shows the spatial distribution of some auxiliary variables related to SOC including precipitation, NDVI, MrVBF, and land use.The precipitation decreases from the north-west to the north-east in the province, especially in shoreline areas (Figure 2).The southern parts of the province have lower precipitation compared with the northern region where the Caspian Sea shoreline is located.The NDVI values range from −0.5 to more than 0.8, indicating a high diversity of vegetation cover spread across the province that has a potentially significant effect on SOC content due to large differences in the number of falling leaves and plant residues.The MrVBF shows that flat valley bottoms where sediments and outflows accumulate leading to higher clay and SOC contents [34,67].All environmental variables which did not conform with SOC grid resolution of 30 × 30 m were resampled to a 30 m spatial resolution using either the nearest neighbor or bilinear resampling methods.

Selection of Auxiliary Variables Using Genetic Algorithms (GA)
Instead of taking all 105 environmental auxiliary variables into consideration for a predictive ML algorithm, the feature selection method reduces the number and collinearity of the auxiliary variables.The most informative auxiliary variables should be inserted into the algorithms with the aim of high accuracy of the ML algorithms for SOC prediction [9,16,25].The selection of the significant environment auxiliary variables is a preprocessing step for ML algorithms to remove redundant and irrelevant variables.For this study, one of the most advanced algorithms for feature selection, namely the genetic algorithm (GA), was used to select the most appropriate auxiliary data to be fed as inputs to the ML algorithms [16].GA is able to select those auxiliary data that are not only essential but improve performance as well.Moreover, GA can manage the nonlinear relationships between SOC and auxiliary data [70].
By mimicking natural biological evolution, the GA which is a heuristic search algorithm provides the best value for a function [51].A GA feature selection process starts with an initial random population consisting of individuals.The individuals, representing subsets of auxiliary data, are encoded as binary in which 1 represents if the feature is selected and 0 otherwise [71].Then three primary operations including selection, crossover, mutation repeat until a stopping criterion is reached.The selection operations were for selecting the two fittest individuals for reproduction (i.e., the solutions providing the lowest root mean squared error, RMSE).The crossover recombines two individuals to create new ones which may be better.The mutation operator introduces alteration in a small number of individuals.The process of selection, crossover, and mutation continues until a termination condition is satisfied [48,52].Importantly, for each generation, it is necessary to assign a fitness value to each individual in the population so that the RMSE values are calculated by fitting the random forest model [46,48,52].
In this study, the GA procedure was performed with 10-fold cross-validation and 100 iterations to select the smallest number of auxiliary variables important for SOC modeling using the caret package in R [72].The population size, crossover, and mutation rates used were 50, 0.6, and 0.001, respectively, as outlined by Welikala et al. [52].

Machine Learning Techniques
In this study, six ML algorithms including support vector machines (SVM), artificial neural networks (ANNs), regression tree (Cubist), random forest (RF), extreme gradient boosting (XGBoost), and deep neural networks (DNN) were chosen.Each algorithm can discover complex relationships between SOC content and auxiliary covariables.Table 1 summarizes the hyperparameters of the six ML algorithms used in this study.A brief description of the ML techniques used in this study follows.Initially, SVMs were developed as a methodology for resolving problems of classification into two attributes using a threshold value.Connected with the earlier development of SVMs as a classification method, the regressive type of support vector machines was proposed.This caused the spread of the philosophy of support vectors machines being used to solve regression problems.The algorithm was formulated as a linear method and then it was generalized to (1) the presence of noise in the data using slack variables following the soft-margin philosophy, and (2) a nonlinear model through the conversion of the input space into a larger dimension, as done for classification [73].Hence, SVM is used for classification and regression processes with a set of connected supervised learning algorithms and they have an excellent ability to be universal predictors of any multivariate function to any specified degree of accuracy [20].In this study, the SVM algorithm was employed by improving the range of its components (C: 0.01-100; σ: 0.01-100) based on the input data (Table 1).

Artificial Neural Networks
Artificial neural networks (ANNs) stand out among the different types of models because they are calculative techniques with mathematical models simulated from the human's brain neural function [74].ANNs as vigorous data-modeling tools attain knowledge by way of experience.They are able to detect patterns and draw results, therefore they can be used for data prediction with correlation, such as soil properties.The ability for handling and modeling multiple outputs simultaneously is a primary benefit of ANN techniques [75].The development of an ANN model mainly consists of three main stages: the generation of data for the training/testing of the model, the selection of the optimal configuration, and the validation of the model on an independent data set.Additionally, ANNs are interconnected by structures called perceptrons and consist of input, output, and hidden layers that transform the input into something that the output layer can utilize [76].ANN models allow one to attribute lower weight to samples that deviate from a standard, since ANNs can identify patterns in data distribution, which is not observed in linear and nonlinear regression [77].As a result, ANN models can lead to a higher accuracy than linear and nonlinear regression [78].The present neural networks of this study were made based on a learning rate of 0.001-0.05and the number of hidden layer neurons was 2-10 (Table 1).The sigmoid function is the activation function in the nnet package [79] for the MLP with one hidden layer which we used in this study.

Regression Tree (Cubist)
Cubist is an ML estimating tool that is similar in approach to regression trees [80].However, unlike classification and regression tree models, linear models are often well structured rather than having end values [81].Additionally, the Cubist and other regression tree algorithms have a clear difference, that is the linear regression model is fitted to the leaf nodes of the trees in the Cubist [82].Overall, the Cubist approach makes multivariate models that are made of sets of rules, and the prediction model will be chosen based on the rules [83].In this research, the ptoposed regression tree models are based on Cubist regression models [84].Cubist models were improved by defining the number of model trees, and nearest neighbors using the data set shown in Table 1.

Random Forest (RF)
The random forest (RF) consists of a series of binary rule-based decisions that define relationships between input and its dependent variables.It comprises a large number of individual tree algorithms trained from bootstrap samples of the data [85].The single prediction will be made by accumulating the results of all trees.One of the main benefits of random forests is that they can precisely explain the compound connections between the independent variables and the dependent variables.So when composite environmental systems and ecological supplementary variables are introduced, RF can be helpful [86].Two important parameters in RF algorithms are the number of trees (Ntree) and the number of variables (Mtry) which are available for selection in each split [87].These two main parameters (Mtry and Ntree) were adjusted for the best result.The range of values used is shown in Table 1.

Extreme Gradient Boosting (XGBoost)
The algorithm for extreme gradient boosting (XGBoost) was proposed by Chen and Guestrin [88].It is an algorithm for improving the performance for gradient boosting machines and especially for regression trees and K classification methods [89].By the supplemental training strategies, the "boosting" as a basic idea of this method extends a "strong" learner from a set of "weak" learners.The XGBoost technique is supposed to improve calculation but also reduce over-estimation events.The XGBoost simplifies the objective functions and improves the calculation speed to an optimum by allowing the combination of estimative and adjustment terms.In addition, in the XGBoost approach, during the training step, simultaneous computations will be done automatically for the functions [89].More information can be obtained about the XGBoost algorithm from the work of [88].Table 1 shows the XGBoost algorithm parameters used to do this research including the type of algorithm, the depth of trees, the minimum sum of weights of all observations, the number of variables provided to a tree, the number of samples provided to a tree, and the learning rate.

Deep Neural Networks (DNN)
The performance of conventional DNN as an estimation algorithm for remote sensing applications has been extensively explored during the past few years [90].DNN has been reported as a reliable and efficient approximation function for delivering insight into the relationship (whether a linear or a nonlinear relationship) between input and output variables [91].DNN has shown promising results in a wide and diverse range of applications from digital signal processing and control systems to hazard susceptibility mapping [92,93].Figure 3 illustrates the architecture of a conventional DNN.The networks are configured by passing several layers for learning the probability of the outputs.In this study, conventional DNN is a feedforward learning network where there is no looping back from the output layer to input.In this case, the DNN produces a map of virtual neurons and random weights.The inputs and weights will be multiplied and would deliver outputs within the range of 0 and 1.The algorithm would adjust the weights to accurately identify a particular learning pattern to fully process the data.The DNN includes L hidden layers, the input layer (vector X), and the output layer (vector Y).As recently formulated by Wang et al., [94], the estimation of Y can be presented as follows.
where b  and   are the bias and the activation function of the ith layer.Here,   represents the weights.In Equation ( 2), the  + 1 represents the output layer.Therefore, Y can be presented as follows.
= (; ) Eventually, through calculating the mean square error (MSE) of output and input values, loss function L can be estimated as follows.
where N represents the sequence of data.Here, the GA is used to minimize the () function for the training.A trained DNN is further used for the estimation of the new variables.The predictive ability of neural networks is possible by learning large amounts of data.Generally, input data create the training datasets, and similar output data will be entered into a neural network algorithm.This algorithm can detect the basic rules in the data entered and compose an interior model that is suitable to estimate the new input data using several training repetitions during the process.The model can be computed by the interactions and connections between neurons, whereas any physical or clear mathematical relationships cannot be supplied [90].The neural network structure can affect the precision of the predictive models.Each latent layer of the DNN algorithm consists of some calculative neurons that are interconnected to the next calculative neurons in the adjoining latent layers.To finalize the DNN model, the neurons of each latent layer measure the calculative neuron outputs of the prior layer, and after the computation procedure of the activation function, the outputs are generated for the subsequent layer [42].
Table 1 shows the specifications used for DNN, which are hidden layers, size, network weight initialization, learning rate, dropout regularization.In this study, for the DNN method, the H2O package [92] with the rectifier function as a nonlinear transformation was used for DNNs in this study [95].It is worth mentioning that adhering to a balanced ratio of training and testing is of utmost importance in modeling with machine learning [96].Several methods in a wide range of applications are introduced to identify the correct balance for testing [97].Nevertheless, the evaluation metrics have been shown to be reliable measures to maintain a sufficient number of elements for a training dataset in soil research [37][38][39][40].It is often observed that by decreasing the amount of training data, the error increases, which accurately indicates the worth of data for models.The amount of training data, in this study, is optimally tuned to ensure the lowest errors.The total dataset is divided into 10 datasets that are sequentially used for training and testing.The DNN is calibrated 10 times to assure each data point was used as validation at least once.

Evaluation of Algorithm Performance
Ten-fold cross-validation was implemented for testing the performances of six ML prediction algorithms for estimating the SOC contents in Mazandaran province.In this regard, the total dataset was split into 10 datasets that were sequentially used as training and testing datasets for a given prediction algorithm.Each prediction model is calibrated 10 times, guaranteeing each data point was used as validation at least once.Then, the 10 prediction errors can be obtained for each prediction algorithm.The four evaluation criteria used in this study are the coefficient of determination (R 2 ) [98], Lin's concordance correlation coefficient (CCC) [99], mean absolute error (MAE) [100] and root mean squared error (RMSE) [100] with following formulas: 2 ) (5) where, n is the number of samples, Oi and Pi are observed and predicted SOC contents, respectively.O' and P' are the means for the observed and predicted SOC contents, respectively.Furthermore, σo and σp are the variances of observed and predicted values.Four criteria of the validation datasets in 10-fold validation in each prediction algorithm were averaged and used for selecting the best performing prediction algorithms.The prediction algorithm with the lowest MAE and RMSE, and highest R 2 and CCC values are determined as the best for SOC prediction.

Uncertainty Assessment
The spatially explicit quantification of the uncertainty of SOC prediction is analyzed in this study.The SOC maps generated by each model were used to calculate the mean and standard deviation (SD) of the SOC for each pixel in 10-fold realization [101].It was assumed that six ML models follow the normal distribution for each raster cell.The confidence interval (CI) was calculated with the mean as ±1.64 SD for a given 90% CI.The upper and lower limit of the 90% CI were mapped.The mean of the SOC contents in each pixel and the 90% CI was calculated by retrieving the 5th and 95th percentiles of prediction.Finally, three maps of the SOC were produced for the best performing model: the mean prediction, lower CI (5%), and higher CI (95%).

Summary Statistics
Table 2 shows the summary statistics for topsoil SOC for the 1879 sampling sites.SOC contents ranged from 0.02% to 11.48% with a mean and standard deviation of 2.19% and 1.27%, respectively.The lower SOC contents correspond to highly degraded lands where the surface soil is eroded and the maximum SOC contents were observed in dense forestlands.The coefficient of variation of 58.23% demonstrates the high variability of SOC contents within the study area.The values of skewness (2.33) and kurtosis (8.2) indicate that the SOC data is highly skewed and in turn, violates assumptions of normality.SOC data were anchored at 1.00 and then, transformed by the natural logarithm to make the distribution less skewed.The skewness and kurtosis values of the log-transformed SOC values were 0.59 and 1.50, respectively, and the Kolmogorov-Smirnov test showed that the distribution of these log-transformed values was not significantly different from normal.Further analysis was performed on the log-transformed data; and the predicted SOC values were back-transformed to the original scale.

Selected Auxiliary Data
In this study, the GA procedure with 10-fold cross-validation and 100 iterations (i.e., GA was executed 1000 times) was used to select the minimum number of important auxiliary variables for SOC modeling.The results of GA for 1000 generations are presented in Figure 4.The average of the internal out-of-bag RMSE estimates as well as the average of the external performance estimates calculated.Based on these results, the generation associated with the best external RMSE estimate was 0.86.The GA selected 35 predictors out of 105 environmental variables, as the most relevant driving factors for SOC mapping in Mazandaran province (Table 3).These variables comprised 13 terrain attributes, 18 remotely sensed variables, two climatic variables, and two categorical data layers.The resolution and origins of the 35 predictors are given in Table 3.Getting more information about the contribution of each variable to SOC variability is of great importance.Therefore, the significance of each environmental auxiliary variable was analyzed using a sensitivity analysis and was represented as an attribute percentage.Figure 5 indicates the order of the relative importance of the selected predictors on SOC spatial variability using the procedure outlined by [45].As can be seen in Figure 5, precipitation is the most crucial feature (14.9%) driving the spatial variability of SOC contents in Mazandaran province followed by the NDVI (12.5%),MODIS day temperature (10.6%),MrVBF (8.7%), land use (8.2%), valley depth (7.2%), and MODIS night temperature, respectively.In Mazandaran province, as previously thought, the precipitation significantly affects SOC contents by enhancing vegetation coverage and the rate of organic matter inputs.Together precipitation and temperature (MAT) explain 18.9% of the variation in SOC contents demonstrating the high dependencies of SOC contents in the province-scale to the climatic variables.Lamichhane et al. [14] reviewed several studies and pointed out that climate is the most influential factor for the variation of SOC at large extents.The high precipitation is mostly combined with lower temperatures and slower SOC decomposition rates at higher altitudes [102,103].Falahatkar et al. [104] reported that the most important auxiliary predictors for SOC stocks for surface soil in Guilan province, northern Iran are land use, NDWI, silt, clay, and elevation.Along with our findings, the surface temperature data derived from remotely sensed data (Landsat) was found to be influential for improving SOC prediction [66].
The NDVI is the second most important feature explaining SOC variability, indicating that the SOC contents were highly influenced by vegetation variation.SOC content was highly dependent on the natural vegetation cover intensity, and the plant residue left after plant harvesting [4,36,105].Due to the dependency of SOC on vegetation cover, NDVI has frequently been used as a predictor for mapping SOC in several studies [16,25,33,106].Additionally, NDVI has more importance to SOC contents compared to other remotely sensed vegetation indices like EVI.Although the EVI performs better than the NDVI in many applications, our results indicated that NDVI is more important for explaining SOC contents compared to EVI.It might be related to the topographic conditions as Matsushita et al. [107] reported that EVI is more sensitive to topographic conditions than is the NDVI.Meanwhile, the green band of Landsat-8 (B2) showed a higher contribution to SOC than B3 and B4.
The high contribution of the MrVBF (8.7%) to SOC variability in this study could be attributed to the deposition of fine organic-enriched particles and sediment [58] from the highlands in the lower valleys with flat and low-lying areas.Land use data is considered as the fifth important variable for SOC variation.Land use effects on SOC variation is related to the land-use conversion in the last two decades in Mazandaran province that have led to the exposure of soils and the rapid decomposition of SOC [58].
The TWI only contributed 3% SOC contents showing that SOC tends to accumulate in wetter, low-lying areas in Mazandaran province.Taghizadeh-Mehrjardi et al. [25] demonstrated that the wetness index is the most important terrain variable for the prediction of SOC in subsoils (>30 cm depth).Plan curvature (Plan.Curv), soil map, slope, aspect, channel networks base level (CHNL.BASE), LS factor and other variables ranked in Figure 4 made only a small contribution to SOC spatial variability at the province-scale used in this study but still need to be considered as input variables for SOC modeling by ML techniques.3 for a definition of auxiliary data).

Machine Learning Performances
The average MAE, RMSE, R 2 , and CCC for SOC prediction by 10-fold cross-validation are shown in Table 4.The proposed ML models showed different abilities to predict SOC contents at unsampled locations at the province-scale.This could be related to the various mathematical functions of each algorithm [2].The mean R 2 values indicate that the SVM, ANN, Cubist, RF, and XGB models deliver 53%, 55%, 57%, 58%, and 57% of SOC variability, respectively.However, the DNN model outperforms other models by delivering 65% of the SOC variability.In all ML models, the RMSE values are reported more significant than the MAE, indicating that there is a contribution of the errors in SOC predictions [108].The DNN algorithm showed the lowest mean MAE value (0.59%) of the six studied ML algorithms.The SVM algorithm had the highest error with mean RMSE values of 0.87% compared with other ML models, meanwhile, the DNN outperformed with the lowest mean RMSE value (0.75%).ANN, Cubist, RF, and XGB showed a similar ability to predict SOC in Mazandaran province.Based on the performance criteria used, SVM was always a weaker ML algorithm than the other algorithms, while DNN was the most consistently robust algorithm.
One of the main advantages of DNN is that the step of feature extraction was performed by the DNN model itself [10].Our results confirm previous research on the performance of DNN in soil modeling.For instance, a recent research study introduces the DNN as an effective and robust modeling method to capture the complex nonlinearity between auxiliary variables and soil moisture [40] and SOC prediction [11,42].DNN models by using the multiple hidden layers of the neural network improve the SOC prediction.Padarian et al. [11] reported that using deep learning models for digital soil mapping offers a simple and effective framework for future soil mapping.The DNN algorithm needs a large number of parameters to be fitted so that it performs well with a large dataset like the one used in this study.Importantly, the sample size is a critical issue for training in the DNN [41].underestimated.The CCC statistic quantified the level of agreement between predicted and measured SOC values according to the 1:1 line.It is based on CCC values (Table 4).The The lower performance of other ML algorithms except for the DNN could be related to taking a large number of auxiliary variables into account and the original data having multiple-scales of variation, as well as different sources and sampling times, all of which increase the uncertainty.
The 1:1 scatterplots of actual vs. predicted SOC using the six ML algorithms are shown in Figure 6.It is now much easier to understand the prediction efficiency of the DNN algorithm as most predictions follow the 1:1 line with the exception of large observed SOC contents, which were slightly DNN algorithms with a 0.83 value were superior and the SVM (CCC = 0.76) was inferior.As the deep learning method is sensitive to the size of the training dataset, DNN apparently yielded the best result in this study due to the large data for training.Using a deep convolutional neural network trained with a smaller dataset was not effective for the prediction of soil properties by spectral data [41].
The DNN algorithm has a more flexible structure and is explicitly able to extract more information from the environmental auxiliary variables and the SOC content and this is consistent with the results of [11,42].Therefore, we could recommend the DNN algorithm as the best ML algorithm for the prediction of SOC content with reliable uncertainty in Mazandaran province.The DNN model in this study was successfully trained with a large number of covariates due to its more flexible neural structure and in turn was able to effectively combine with multiscale properties.This algorithm uses the data imputation for taking the missing values into account thereby the better model performance could be achievable especially for subsurface DSM [11].
The observed mean value of R 2 and RMSE in the DNN algorithm can be compared to the other studies at the regional scale [109][110][111].Wang et al. [16] only achieved an R 2 mean value of 48% of the total spatial SOC variability using the RF algorithm in semiarid pastures of eastern Australia.

Spatial Prediction of SOC with Uncertainty Estimates
The numbers and percentages of SOC contents that fall within the 90% CI are shown in Table 5.The uncertainty analysis also showed to some extent the same trend to the ability of the ML algorithms to predict SOC.The DNN had the maximum percentage of observations (~88%) that fell within the defined CI.The spatial prediction of mean SOC content with a 5% lower confidence limit and 95% upper confidence limit values in Mazandaran province produced by the DNN algorithm are shown in Figure 7.It is clear that the combined influence of selected auxiliary variables controls the SOC contents.It is evident that the SOC contents tend to be higher in a strip from the west (more than 3%) to the east (lower than 1%) in the middle of the Mazandaran province.The precipitation gradient and NDVI index, which were the most highly correlated variables, were greatly responsible for SOC variations.The SOC content coincided in a systematic way with increasing the precipitation gradient, NDVI, and MrVBF indices (Figure 2).It was clear from the predicted SOC map that the amounts were higher in the area with high NDVI values ranging from 0.71 to 0.81 (the central part of the study area).The higher rainfall favors higher net primary production of plant residues and explains the higher SOC contents in the middle portion of the province.There is some uncertainty in the predicted map that may be related to the high variability in SOC data, low precision of predictors, inherently poor relationships between SOC and auxiliary variables, and errors in modeling [112].Considering the multiple scales of auxiliary variables and the good resolution of soil erosion/deposition data, such data can potentially [32] reduce the spatial prediction uncertainty in future studies.
The SOC contents change over time, thus the predicted map can be used as a base-line to indicate temporal changes.Together with the estimation of uncertainty, the prepared maps are more reliable and could be useful for future SOC inventories and province-scale accounting and carbon balance studies.
Table 5.The numbers and percentages (%) of SOC that fall within the 90% prediction intervals predicted by machine learning models for 10-fold cross-validation.

SOC Contents in Soil Classes and Geological Eras
The mean comparison of the SOC contents within different soil orders and suborders is shown in Table 6.The Ultisols and Mollisols with mean SOC contents of 4.04% and 3.20%, respectively, had higher surface SOC compared with other soils.Most Ultisols and Mollisols were found in the dense forest in Mazandaran province showing the higher C inputs into the soils.The high precipitation with a relatively low MAT in the center of the Mazandaran province leads to higher SOC accumulation at the soil surface and in turn higher clay content in a Bt horizon [6,113] and the deeper Bk horizon [6,113,114].Entisols had the highest SOC variability (CV = 40.66%)followed by the Inceptisols (33.94%),Alfisols (CV = 33.64%),and Mollisols (30.55%).The SOC under Mumults had the highest SOC with the lowest SOC variability (CV = 15.63)whereas Fluvents had significantly the lowest SOC due to the higher SOC decomposition caused by the exposure (tillage) and loss of SOC by erosion.The mean SOC contents spread across the Mazandaran province differed in soils under different soil SMR and STR classes as shown in Figure 8.The SOC mean value was the highest in the udic SMR class with mean values of 3.85% followed by the aquic (2.45%) and xeric (2.10%), respectively.The high precipitation for soils with the udic SMR class [57] led to the high aboveground biomass production inputs.The greater SOC contents in soils having the aquic SMR class compared to the xeric SMR could be related to anaerobic (reducing) conditions decreasing the rates of organic matter decomposition [115].
The soils having mesic STR classes have high SOC content with a mean value of 2.75%, which was significantly higher than the thermic (2.20%) and cryic (1.25%) STR classes, respectively (Figure 8).Soils in the thermic STR class with higher MAT in low-lying areas had lower SOC contents compared with mesic STR classes reflecting a negative effect of MAT on SOC contents in Mazandaran province due to the high SOC decomposition rate.The small area of the province with a cryic STR class has low vegetation cover in high-altitude lands accompanied by unsuitable temperature conditions for plant growth leading to low inputs of plant residues and biomass.Overall, SOC has been increased with precipitation and decreased with temperature associated with a given altitude in the study area.Moreover, soils formed on younger geological formation have lower SOC contents.The mean SOC contents in soils under the Cenozoic geological era (2.35%) had significantly lower SOC contents compared with soils under Mesozoic (3.12%), Paleozoic (3.35%) and Proterozoic (3.29%) eras, respectively.The higher observed SOC developed on the older geological formations could be attributed to the increased time for SOC to develop and aboveground carbon inputs by the dense vegetation cover inducing the SOC accumulation.[57] according to the Newhall model.

SOC Contents in Landform Units and Land Uses
The soils formed on mountainous landforms had the highest SOC (3.11%) values with forest land use, while there was little difference in SOC contents for the other landforms except for the soils developed on alluvial fans that had significantly the lowest SOC contents (1.57%) with high coarse fractions (soil particles greater than 2 mm) (Figure 9).The alluvial fans with unstable landforms, have a high susceptibility to erosion and have little water holding capacity providing the lowest aboveground biomass production in the study area.The high degrees of stability in mountain landforms especially in the summit areas [6] showed more developed soils including the Ultisols, Mollisols, and Alfisols.These are closer to the steady-state conditions relative to the younger landforms (Fluvents) leading to greater humification and SOC accumulation on mountain landforms that are currently covered by dense forest.On more geomorphically dynamic/unstable landforms, organic layers can be removed from the developing surface inducing the SOC losses through erosion [115].The disturbed soils in croplands and orchards had significantly lower SOC compared with forests and rangelands except for poor rangelands.Soils in residential, dry farming, poor rangelands, and seashore areas had SOC mean values of 2.08%, 1.78%, 1.71%, and 0.75%, respectively.Unsurprisingly, the dense forestlands have significantly the highest SOC content with mean values of 3.77% (Figure 9) followed by the semidense forestlands (2.90%), low dense forestlands (2.50%), good rangelands (2.57%), and moderate rangelands (2.03%), respectively.Emadi et al. [58] reported that the cultivation of virgin forest and pasturelands in Mazandaran province led to about 35 and 30% reduction of SOC content, respectively.The conversion of forest and rangelands into the croplands induces SOC oxidation whereby the topsoil SOC decreases.

Conclusions
The objective of this study was to determine a reliable algorithm for predicting the SOC contents in Mazandaran province through consideration of six different ML algorithms and using 105 environmental auxiliary variables derived from terrain attributes, remote sensing, and climatic data.Thirty-five auxiliary predictors were selected by the GA method.Precipitation, NDVI, MODIS day temperature, MrVBF, and land use were the most important predictors.The results show that the DNN algorithm outperformed other ML algorithms in terms of the power of the prediction uncertainty at the province scale demonstrating that DNN is suitable for use as a robust estimator for SOC mapping in Mazandaran province.The SOC was lower in soils under late geological age (Cenozoic era), while it is accumulated in more developed Ultisols and Mollisols with virgin forest and rangelands in udic SMR classes spread across the middle strip of Mazandaran province.The mesic STR class has significantly higher SOC with high vegetation cover and biomass and probably with a lower C decomposition rate.The predicted SOC map could be used as a base-line for further studies and projects related to the C sequestration development both locally in soils of the Mazandaran province and globally at the worldwide scale.Although the DNN algorithm was found to be the best algorithm to map SOC contents more accurately than other studied ML algorithms, the search for optimized spatial interpolation algorithms is still in its early stages in this province.Moreover, further investigation should be conducted to test the potential of other combination algorithms in this province and test the reliability of DNN reliability for other regions in Iran with different climate and agro-ecological structures.

Figure 1 .
Figure 1.(A) Geographical position of Iran on the world map, (B) geographical position of the study area in Iran, and (C) spatial distribution of soil samples.The background is a false-color composite image derived from Landsat.

Figure 2 .
Figure 2. Four examples of auxiliary variates used for modeling soil organic carbon: (A) multiresolution valley bottom flatness index (MrVBF), (B) normalized difference vegetation index (NDVI), (C) precipitation, and (D) land use maps.The Crop.Or, Dense.Forest, Dry.Farm, Good.Range, Low.Forest, Mod.Range, Poor.Range, Resident.L, Seashore.L, and Semi.Forest symbols in land use map correspond to the land use type of croplands and orchards, dense forestlands, dry farming lands, good rangelands, low dense forestlands, moderate rangelands, poor rangelands, residential lands, seashore lands, and semidense forestlands, respectively.

Figure 3 .
Figure 3. Illustration of the architecture of conventional deep neural network (DNN).

Figure 4 .
Figure 4. Internal and external validations of GA.

Figure 5 .
Figure 5. Relative importance of auxiliary data using genetic algorithms.(Refer to Table3for a definition of auxiliary data).

Figure 7 .
Figure 7. Spatial prediction of (A) upper, (B) mean, and (C) lower confidence limits of soil organic carbon (SOC) using a deep neural network model.

Figure 8 .
Figure 8. Mean comparison of soil organic carbon values within (A) soil moisture regime, soil moisture regime (SMR), classes, (B) the soil temperature regime, soil temperature regime (STR), classes, and geological eras in the Mazandaran province.Values with different letters in each column indicate significant differences (p < 0.05).The maps of SMR and STR classes were produced by Emadi et al. [57] according to the Newhall model.

Figure 9 .
Figure 9. Mean comparison of soil organic carbon values within (A) physiographic units and (B) land uses in the Mazandaran province.Values with different letters in each column indicate significant differences (p < 0.05).I indicates the error bar (SD) in all columns.

Table 1 .
Hyperparameters of machine learning algorithms used in this study.
SOC: soil organic carbon; Min: minimum; Max: maximum; SD: standard deviation; CV: coefficient of variation; K-S p-value: significance level of Kolmogorov-Smirnov test.

Table 3 .
The selected auxiliary data using genetic algorithms for predicting SOC.

Table 4 .
Comparisons of the accuracy of six machine learning models for validation dataset by 10fold cross-validation (means ± standard deviation).

Table 6 .
The SOC changes in different soil orders and suborders in Mazandaran province.

Table A1 .
Environmental auxiliary data initially considered for predicting the distribution of SOC.