Next Article in Journal
A Methodological Framework for Assessing the Potential Performance of Maize/Soybean Intercropping Under 2050 Climate Scenarios
Previous Article in Journal
Development, Evaluation, and Application of a Molecular Marker System for Wheat Quality Breeding in China
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Randomness in Data Partitioning and Its Impact on Digital Soil Mapping Accuracy: A Comparison of Cross-Validation and Split-Sample Approaches

Faculty of Agrobiotechnical Sciences Osijek, Josip Juraj Strossmayer University of Osijek, Vladimira Preloga 1, 31000 Osijek, Croatia
*
Author to whom correspondence should be addressed.
Agronomy 2025, 15(11), 2495; https://doi.org/10.3390/agronomy15112495
Submission received: 10 October 2025 / Revised: 23 October 2025 / Accepted: 27 October 2025 / Published: 28 October 2025
(This article belongs to the Special Issue Soil Health and Properties in a Changing Environment—2nd Edition)

Abstract

Digital soil mapping has become increasingly important for large-scale soil organic carbon (SOC) assessments, yet the choice of accuracy assessment method significantly influences model performance interpretation. This study investigates the impact of cross-validation fold numbers on accuracy metrics and compares cross-validation with split-sample validation approaches in national-scale SOC mapping. Five machine learning algorithms (Random Forest, Cubist, Support Vector Regression, Bayesian Regularized Neural Networks, and ensemble modeling) were evaluated to predict SOC content across France (539,661 km2) and Czechia (78,873 km2) using 2731 and 445 soil samples, respectively. Environmental covariates included satellite imagery (Sentinel-1, Sentinel-2, and MODIS), climate data (CHELSA), and topographic variables. Four cross-validation approaches (k = 2, 4, 5, 10) were utilized with 100 repetitions each and the results were compared with the existing literature using both cross-validation and split-sample methods. Ensemble models consistently produced the highest prediction accuracy and lowest variance per fold across all validation approaches. Higher fold numbers (k = 10) also produced higher accuracy estimates compared to lower folds (k = 2) and had the greatest value ranges of accuracy assessment metrics. This confirmed the observations from previous studies, in which split-sample validation reported higher R2 values (0.10–0.90) compared to cross-validation studies (0.03–0.68), suggesting a strong effect of randomness in training and test data split in the split-sample approach. This suggests that k-fold cross-validation should preferably be used in reporting prediction accuracy in similar studies, with the split-sample approach being strongly affected by value properties from training and test data from particular splits used for validation.

1. Introduction

Digital soil mapping revolutionized the ability to quantify and map soil properties at various spatial scales, from local to global extents, fundamentally transforming traditional soil survey approaches that relied on limited point observations and expert knowledge [1]. Soil organic carbon (SOC), as one of the most critical soil properties influencing fertility, water retention, and nutrient cycling has received particular attention in digital soil mapping applications due to its fundamental role in ecosystem functioning and global carbon cycling [2]. The modern digital soil mapping framework utilizes extensive environmental covariates derived from remote sensing imagery, climate datasets, topographic analyses, and geological information to establish quantitative relationships between soil properties and environmental factors [3,4]. Satellite-based remote sensing data, including multispectral imagery from platforms such as Landsat, Sentinel-2, and MODIS, provide crucial information about vegetation patterns, land use dynamics, and surface reflectance characteristics that correlate with soil properties [5,6,7]. Climate variables, including temperature, precipitation, and derived indices such as aridity and seasonality metrics, capture the environmental drivers of pedogenesis and organic matter accumulation [8,9]. Topographic derivatives from digital elevation models, such as slope, aspect, curvature, and compound topographic indices, quantify the influence of terrain on soil formation processes, water flow patterns, and erosion–deposition cycles [10]. The integration of machine learning algorithms with these extensive environmental datasets has enabled moderately high prediction accuracy in SOC mapping across large spatial domains, especially for national- and regional-scale applications [11]. Machine learning algorithms, and more recently, deep learning, have demonstrated superior performance compared to traditional geostatistical methods by capturing complex non-linear relationships between environmental covariates and soil properties [12,13]. Among these methods, Random Forest (RF), Cubist (CUB), Support Vector Regression (SVR), and Bayesian Regularized Neural Networks (BRNNs) were frequently used for digital SOC mapping, achieving high prediction accuracy due to their ability to model non-linear relationships between SOC and environmental covariates [14,15]. Ensemble methods that combine multiple algorithms have shown particular promise in improving prediction robustness and reducing uncertainty in digital soil mapping applications [16], particularly when these models are constructed from methods with complementary properties.
Despite these advances of technology, the reliability and comparability of digital soil mapping studies is fundamentally subject to the assessment of accuracy used, which varies considerably in the literature (Table 1). The selection of different validation strategies, specifically, cross-validation (CV) or split-sample (SS) methods, and the choices of cross-validation parameters like number of folds can have strong effects on the interpretation of model performance. CV has become the most dominant method of assessing accuracy in digital soil mapping due to its efficiency in using all available data to train and validate an accuracy assessment [17]. The k-fold cross-validation approach, where data are partitioned into k subsets with each subset serving as validation data while the remaining k − 1 subsets are used for training, was frequently used in previous studies [18,19,20,21]. However, the selection of k values varies considerably across studies, ranging from 2-fold to 10-fold or even leave-one-out cross-validation, with limited systematic investigation of how this choice affects accuracy metrics. SS validation, which involves randomly dividing the dataset into independent training and testing sets (typically 70:30 or 80:20 ratios), represents a simpler approach that may provide more conservative accuracy estimates. The fundamental difference between CV and SS lies in data utilization, as CV uses all data points for both training and validation during iterative split in k folds, while SS reserves a portion of data exclusively for validation [22,23]. If the split is biased, it may result in misleading results of the accuracy if any types of soil are overrepresented in the training dataset and undersrepresented in the testing set. Such an approach can produce very high variation in accuracy assessment metrics per split depending on the properties of input data [24,25]. Therefore, it is possible that the model works very well on the training data, but not very well on unseen data, producing an exaggerated estimate of accuracy. For this reason, a single split might not necessarily provide the overall effectiveness of the model across various scenarios, thus the need for numerous iterations resulting in a higher-reliability estimate.
Furthermore, there is much variation in the choice of accuracy metrics across the studies reviewed, as different emphasis has been placed upon the coefficient of determination (R2), root mean square error (RMSE), mean absolute error (MAE), and correlation coefficients. Some of the studies have multiple measures and some have single indicators, making direct comparisons difficult, for which reason a normalized RMSE (NRMSE) was calculated for each of the papers listed in Table 1, enabling a cross-study analysis of achieved absolute accuracy levels. The spatial scale of analysis also affects the assessment of accuracy, poined for local-scale studies to be more accurate due to a lower environmental variability, and for regional-scale or global application, with increasing heterogeneity affecting the model’s capability to perform correctly. The implications of choosing validation methods go beyond the individual studies to the knowledge of the capabilities of digital soil mapping. Systematic biases triggered by particular validation approaches can result in overly optimistic or pessimistic judgments on model performance and have an impact on decision making in agricultural management and environmental policy.
This variation in methodology has produced considerable cause for difficulty in comparing the results produced by previous studies, and a general lack of suitable benchmarks for establishing realistic levels of digital soil mapping accuracy. Therefore, the research hypothesis is that randomness in data partitioning strongly influences national-scale SOC estimates, limiting the reproducibility of digital soil mapping studies. To evaluate this hypothesis, the main aims of this research were to (1) systematically analyze the influence of the k (2, 4, 5, 10 folds) of cross-validation on digital soil mapping accuracy metrics using these identical datasets and modeling approaches relative to split-sample approach; and (2) assess the robustness and validation effect of different machine learning algorithms under varying validation scenarios to evaluate the consistency of regional validation effects of different algorithms.

2. Materials and Methods

The workflow of the study consisted of four main steps: (1) preprocessing of SOC samples used for digital soil mapping; (2) modeling and preprocessing of environmental covariates relevant for SOC prediction; (3) machine learning prediction of SOC levels; and (4) an accuracy assessment based on the k-fold cross-validation approach in multiple iterations to determine the influence of the k (2, 4, 5, 10 folds) of cross-validation on digital soil mapping accuracy metrics and randomness effect of individual split-sample components.

2.1. Study Area and SOC Sampling Data

This study area included two European countries representing diverse environmental conditions, including mainland France and Czechia, covering 539,661 km2 and 78,873 km2, respectively (Figure 1). Mainland France has a diverse topography that comprises wide lowland plains in the north and the west, the central Massif Central highland regions, and the high mountain ranges in the south and east. Conversely, the Czech Republic is a landlocked country located in Central Europe, whose topography is made up of upland plateaus and basins with mountain ranges. According to the Köppen–Geiger climate classification [47], France’s mainland possesses a wide range of climates, with temperate oceanic (Cfb) as a major climate class, Mediterranean (Csa), and humid subtropical climate (Cfa) being prevalent in its southern region, and several minor climate classes present in its highlands, including humid continental (Dfb) and alpine (Dfc, ET) climates. By contrast, the Czech Republic is more climatically homogeneous, being dominated by the humid continental (Dfb) class. SOC is directly linked to soil genesis, and this relationship can also be expressed through the formula “S = f(c, o, p)t,”, where “S” represents soil as a function of the factors “c” (climate), “o” (organisms), and “p” (parent material) that interact over time (t) to determine its formation [48].
Soil organic carbon data for France consisted of 2731 samples, resulting in a sampling density of 197.6 km2 per sample, while the Czechia dataset included 445 samples with a density of 177.2 km2 per sample. The 2018 LUCAS (Land Use/Cover Area Frame Statistical survey) SOC dataset [49], created by the Joint Research Centre of the European Commission, was used as a data source for soil samples for the study. Soil samples taken at a depth of 0–20 cm and expressed in g kg–1 were selected to match the geographic extent of the study area. Preprocessing of input soil samples included the removal of outliers using the interquartile range (IQR) approach. All soil samples with SOC values below 1.5 IQR of the first and above 1.5 IQR of the third quartile were identified as outliers using the boxplot approach, following the conventional rule.

2.2. Environmental Covariates Used for Digital SOC Mapping

Environmental covariates were modeled from three main data sources representing abiotic conditions which correlated with SOC levels (Table 2), including climate, topographic, and remote sensing covariates [50]. All environmental covariates were processed to a 1000 m spatial resolution using bilinear resampling and projected to the European Terrestrial Reference System (EPSG:3035) for spatial consistency. Covariate values were extracted at soil sampling locations using point extraction methods to create the modeling dataset. Multicollinearity analysis was performed using variance inflation factor (VIF) assessment, with variables exhibiting VIF > 10 removed to prevent numerical instability in machine learning algorithms. Recursive feature elimination (RFE) was conducted using Random Forest importance measures to identify optimal variable subsets for each study area.
Climate covariates were obtained from the climatologies at high resolutions for the Earth’s land surface areas (CHELSA) dataset, providing 1981–2010 climatological data at approximately 1 km spatial resolution [51]. These included 19 bioclimatic variables which quantified annual trends of air temperature and precipitation, their seasonality and extreme values relative to monthly and quarterly averages. Topographic covariates were derived from the Shuttle Radar Topography Mission (SRTM) 90 m digital elevation model (DEM) [52], which consisted of elevation data and derived properties, including slope, aspect, flow direction, terrain ruggedness index, topographic position index, and surface roughness calculated using neighborhood analysis with 8-connected pixels [53]. From remote sensing covariates, Sentinel-1 Synthetic Aperture Radar (SAR) data included VV and VH polarization backscatter coefficients, which represented surface roughness, moisture content, and vegetation structure independent of cloud cover. Twelve bands from multispectral Sentinel-2 satellite missions included reflectance from visible, near-infrared, and short-wave bands, based on scenes with below 20% cloud coverage. Nine spectral indices were additionally calculated from derived annual band medians, including vegetation, soil, and moisture indices.
Table 2. Spectral indices used as environmental covariates for SOC prediction in the study.
Table 2. Spectral indices used as environmental covariates for SOC prediction in the study.
Spectral IndexEquationReference
Normalized Difference Vegetation Index (NDVI) NDVI =   B 8     B 4 B 8 + B 4 [54]
Green Normalized Difference Vegetation Index (GNDVI) GNDVI =   B 8     B 3 B 8 + B 3 [55]
Enhanced Vegetation Index (EVI) EVI = 2.5 · B 8     B 4 B 8 + 6 · B 4 7.5 · B 2 + 1 [56]
Soil Adjusted Vegetation Index (SAVI) SAVI =   B 8 B 4 ( B 8 + B 4 + 0.5 ) · 1.5 [57]
Normalized Difference Moisture Index (NDMI) NDMI =   B 8     B 11 B 8 + B 11 [58]
Moisture Stress Index (MSI) MSI =   B 11 B 8 [59]
Green Chlorophyll Index (GCI) GCI =   B 9 B 3 1 [60]
Bare Soil Index (BSI) BSI =   ( B 11 + B 4 ) ( B 8 + B 2 ) ( B 11 + B 4 )   +   ( B 8 + B 2 ) [61]
Normalized Difference Water Index (NDWI) NDWI =   B 3     B 8 B 3 + B 8 [62]
B3: Reflectance in green band of Sentinel-2 (560 nm), B4: reflectance in red band of Sentinel-2 (665 nm), B8: reflectance in near-infrared band of Sentinel-2 (842 nm), B9: reflectance in near-infrared band of Sentinel-2 (945 nm), B11: reflectance in shortwave infrared band of Sentinel-2 (1610 nm).
Additionally, MODIS satellite products provided derived spectral information (Justice et al., 2002) [63], including surface reflectance data (MOD09A1, MODOCGA) covering 16 spectral bands, land surface temperature data (MOD11A1) for day and night observations, land cover classification (MCD12Q1), leaf area index and fraction of photosynthetically active radiation (MOD15A2H), evapotranspiration components (MOD16A2/105), gross primary productivity (MOD17A2H), and phenological metrics (MCD12Q2).

2.3. Ensemble Machine Learning Prediction of SOC Levels

Ensemble machine learning methods involved the combination of regression algorithms with different prediction principles, including RF, CUB, SVR, and BRNN, whose predictions were aggregated using a generalized linear model. The resulting ensemble provided a theoretical foundation for a better generalization ability, less prediction variance, and increased robustness to variances in data distribution than any single constituting algorithm [64]. Moreover, previous studies suggested that this approach is especially useful in complex regression tasks in which there is no single model that provides the highest prediction accuracy in all parts of the feature space [65,66]. The ensemble models were constructed using the stacking approach, with four base learners being trained independently. These predictions were used as input features of a generalized linear model meta-learner that used the base models for prediction, minimizing the out-of-fold prediction error by assigning higher weight to the better performing models, causing less bias from each individual model. Hyperparameter optimization was performed using a random grid search within each cross-validation fold in 10 instances per model to prevent data leakage and ensure a more robust comparison across validation approaches. A variable importance assessment was conducted for each algorithm to identify key predictors based on the model-agnostic permutation feature importance.
RF regression was based on a series of decision trees to make strong and effective predictions based on the concepts of bootstrap aggregating and random feature selection [67]. The algorithm was based on building a fixed size of regression trees, each of which is trained on a bootstrap sample of the original dataset, randomizing both samples and features in the process [68]. During tree construction, at each internal node, randomly chosen “mtry” hyperparameter features out of the available total number of predictors are used and the best split is selected only between these candidate variables [69]. The overall prediction was achieved by averaging the results of all constituent trees, effectively decreasing the variance of predictions. The random selection of features controlled by the “mtry” hyperparameter allowed the forest to retain both predictive power and generalization ability even in situations where highly correlated or irrelevant features exist, and thus it is especially useful when using high-dimensional datasets where traditional regression models can encounter challenges of multicollinearity or overfitting.
CUB regression is a rule-based algorithm that uses model trees with instance-based corrections to provide predictions [70]. The algorithm built a sequence of model trees, with the leaf node of each tree holding a linear regression model instead of a predictor constant, to allow for the identification of local linear relationships in different parts of the feature space. The “committees” hyperparameter defined how many model trees were produced by boosting, with each successive committee using the residual errors of the preceding ensemble, which becomes more accurate and prevents overfitting by adaptive reweighting of training an instance [71]. The resulting prediction tree was then narrowed down by the “neighbors” hyperparameter, which defined how many closest training cases were used for rule-based prediction to make local modifications [71]. This instance-based correction system then found the k-most similar cases in the training set and averaged their weighted residuals, presenting a combination of the global patterns of the rule-based models with the local neighborhood information.
SVR was based on the principles of support vector machines applied to regression tasks by projecting input features in a high-dimensional feature space in which the relations could be represented in linear form [72]. The algorithm was built by determining a small set of training examples, known as support vectors, which determined the optimal hyperplane that best approximated the target function. The radial basis function (RBF) kernel, optimized by hyperparameter “σ”, regulating the radius of influence of individual training samples, where larger values generated smooth, more generalized predictions [73]. Regularization hyperparameter “C” controlled the trade-off between the model’s complexity and tolerance to prediction errors, with larger C values resulting in models with a higher model complexity, but with poorer generalization properties [73]. Prediction was based on the evaluation of the kernel function between the test instance and each of the support vectors were weighted by their addressed Lagrange multipliers to generate a continuous output [74].
The BRNN combined the flexibility of multilayer perceptron with the Bayesian principle of inference to overcome overfitting and provide a measure of uncertainty in predictions [75]. This algorithm used a three-layer architecture, with an input layer, a hidden layer, where the number of neurons is specified, and a final output layer, with the network weights being random variables who have prior probability distributions instead of fixed parameters. The number of hidden units was defined by a “neurons” hyperparameter to learn complex non-linear relationships in data, and there is a tradeoff between the complexity of the model and the ability to generalize [76]. The Bayesian framework also regulated the network automatically during the training process by integrating prior calculations concerning weight distributions, and updating them using the Bayes theorem, which can avoid overfitting without explicitly specifying validation sets to be used to early terminate the training process. Prediction was performed by marginalizing the posterior distribution of network weights, which does not only generate point prediction but also generates uncertainty estimates.

2.4. Accuracy Assessment of Predicted SOC Levels per Data Fold

To evaluate the impact of fold numbers on the accuracy assessment of predicted SOC levels, four k-fold cross-validation approaches were implemented with varying levels of data partitioning. Based on evaluated k-fold numbers (k = 2, 4, 5, 10), the input dataset was split into k equal parts, with each part serving as validation data in a single fold and as a part of training data in the remaining folds. To ensure robust statistical estimates and account for variability inherent in random data splitting procedures, each cross-validation approach was repeated 100 times using different random partitions which produced a significant randomness effect on the creation of training and validation datasets [77]. Accuracy assessment metrics from individual folds were saved for each repetition, simulating training and validation data obtained by a single split-sample procedure. The selected k-fold number thus covered frequently used ratios for training and validation dataset creation using the split-sample approach, with k = 2, 4, 5, 10 representing 50:50, 75:25, 80:20, and 90:10 ratios, respectively [78].
Model performance evaluation was conducted using four complementary accuracy metrics, including the coefficient of determination (R2), root mean square error (RMSE), normalized RMSE (NRMSE), and mean absolute error (MAE). R2 quantified the proportion of variance explained by the model, serving as a relative indicator of overall predictive capability and model fit quality. RMSE measured the average magnitude of prediction errors expressed in the original units of measurement, with NRMSE enabling scale-independent comparison across different datasets or variables. Additionally, MAE was computed as the average absolute prediction error, offering a metric less sensitive to outliers compared to RMSE, thereby providing a more robust assessment of typical prediction accuracy [79]. These metrics were calculated according to Equations (1)–(4):
R 2   =   1 i   =   1 n ( y i y ^ i ) 2 i   =   1 n ( y i y ¯ ) 2 ,
  R M S E = 1 n i = 1 n ( y i y ^ i ) 2 ,
N R M S E = R M S E y ¯ ,
M A E = 1 n i = 1 n y i y ^ i   ,
where y i —measured SOC levels from the LUCAS 2018 dataset; y i —predicted SOC levels; y ¯ —mean of measured SOC levels from the LUCAS 2018 dataset; and n—sample count. The higher R2 and lower RMSE, NRMSE, and MAE indicated higher prediction accuracy.

3. Results

3.1. Descriptive Statistics of Used SOC Samples

The median SOC levels in the French and Czech soils were similar (21.5 g·kg–1 and 19.0 g·kg–1, respectively), with both study areas producing high coefficients of variation of 1.695 and 2.266, respectively (Figure 2). This observation is contrary to variability in soil types in France and Czechia, since French soils were much more heterogeneous regarding soil types. According to the Soil Geographical Database of Europe [80] and World Reference Base Soil Classification of the Food and Agriculture Organization of the United Nations [81], Dystric Cambisol (17.7%), Orthic Rendzina (13.7%), and Orthic Luvisol (11.1%) were the most representative soil classes in terms of LUCAS 2018 soil samples count. Notably higher mean SOC values than the overall average were observed for samples collected in Dystric Cambisol (36.5 g·kg–1) and Orthic Rendzina (31.9 g·kg–1), while Orthic Luvisol soils produced below average SOC levels (18.2 g·kg–1). Unlike France, which had 55 soil types with at least one soil sample, Czech soils were more homogeneous, Dystric Cambisol soils being dominant, on which 51.2% of all used soil samples were collected. While the same soil class had the highest representation in collected soil samples in both France and Czechia, its mean SOC value was lower in Czechia (29.7 g·kg–1), with the same applying for the second largest class of Orthic Luvisol soils, with a mean SOC of 17.6 g·kg–1.

3.2. Accuracy Assessment of Predicted SOC Levels Based on Aggregated Metrics from k-Fold Cross-Validation

The accuracy assessment of evaluated machine learning models produced a notable difference between the prediction accuracy results in France and Czechia (Table 3). The optimal hyperparameters of all evaluated models are presented in Table 4. The ensemble approach proved to be the most accurate for SOC prediction in France with k = 10 cross-validation (R2 = 0.412, RMSE = 11.56 g·kg–1, NRMSE = 0.452). All evaluated k-fold numbers agreed on the optimal model, with a minor difference in obtained accuracy assessment metrics, as 10-fold cross-validation produced the most accurate results and 2-fold cross-validation suggested lower accuracy across all four metrics and both study areas. Among the evaluated machine learning methods, ENS produced superior accuracy to individual models according to RMSE and NRMSE, while RF produced the highest R2 for k = 4, 5, and 10. For individual machine learning models, RF produced the highest overall prediction accuracy, followed by SVR in terms of relative prediction accuracy quantified by R2 and BRNNs in terms of absolute prediction accuracy according to RMSE and NRMSE.

3.3. Stability of Accuracy Assessment Metrics According to Randomness in Training and Validation Dataset Creation per Fold

The cross-validation results per fold indicate a strong relationship between k-fold numbers and the variability of all evaluated statistical metrics (Figure 3). In both study areas, 10-fold cross-validation produced the largest variability in all the accuracy measures, indicated by the largest interquartile ranges and the broadest outlier distributions, due to the largest discrepancy between quantity of training and validation data. This variability tended to diminish as the k-fold number reduced down to two, with median accuracy levels also being very slightly reduced [19,21,43,82,83]. The same effect was observed in final model fit accuracy assessment metrics, which were all very stable regardless of variations in k values (Figure 4).

3.4. Stability of Variable Importances According to Number of Folds in Cross-Validation [84,85]

The selection of k-fold values had a negligible effect on the variable importances of the most accurate evaluated machine learning model for SOC prediction (Figure 5). The DEM is consistently the most important variable in France, with changes in relief, slope, and topographic position being key drivers of SOC distribution. Following the DEM, other significant variables include land cover data, followed by MODIS satellite data (B10, B11) and climatic data (CHELSA). In the case of the Czech Republic, the most important predictors are vegetation indices, respectively, the GCI and BSI, followed by the Leaf Area Index (LAI) and MODIS satellite data (B4). Interestingly, the climatic variables from the CHELSA dataset, which were more important in France, have a lower relative importance here.

4. Discussion

4.1. Evaluation of Machine Learning Prediction Accuracy According to Input Data Properties

The variability in input dataset tends to produce less robust prediction accuracy depending on value properties of training data, as well as to hinder generalization in the initial stages of machine learning [86,87]. Digital soil mapping on a national scale is often subjected to highly heterogeneous soil sampling values [88], which suggests that the randomness in training and validation data split during the accuracy assessment tends to have a larger impact on comparison, less variable landscape, such as in-field predictions.
The lower prediction accuracy in Czechia compared to France according to R2 is concurrent with the increased coefficient of variation in Czech SOC data, indicating that the strong spatial heterogeneity of Czech soils prevented evaluated machine learning models from sufficiently explaining the proportion on variance in SOC values based on the used environmental covariates. However, the absolute accuracy assessment metrics were in disagreement with R2, producing NRMSE down to 0.387, since Czechia had a smaller range of extreme SOC values in the input LUCAS 2018 soil samples, leading to smaller absolute errors but not explaining much variance in SOC values. This disagreement suggests that multiple statistical metrics are needed to evaluate the prediction accuracy of machine learning models across regions with different data characteristics, thus confirming the observations of previous studies [89,90]. The likely reason for the reduced accuracy of 2-fold cross-validation is the lower quantity of training data relative to other k-fold numbers and exaggerated sensitivity to randomness in data splitting [91]. Among the evaluated methods, SVR consistently produced the lowest MAE for all evaluated models and k-fold variations, suggesting that it produced the lowest number of smaller residuals during prediction [79] and thus provides complementary prediction properties to other methods in the ensemble.
In most cases from previous studies, the accuracy of prediction tended to decrease with spatial extent of the study because of higher landscape heterogeneity and limitations related to sampling density [92], having the R2 values generally lower than 0.5, while national- and regional-scale studies frequently produced values lower than 0.4 (Table 1). Therefore, SOC prediction accuracy obtained in France (R2 = 0.412) is similar or slightly higher than previous large-scale studies, with similar large-scale mapping studies in mainland France producing an R2 in the range between 0.10 and 0.47 [93,94]. On the other hand, the reduced accuracy achieved in Czechia is indicative of the challenges for digital SOC mapping in complex topography, which correlates with problematic prediction accuracy in mountainous areas, such as Switzerland, where R2 values ranged from 0.19 to 0.32 across different depth intervals [43]. In terms of the effectiveness of machine learning algorithms used in previous studies with cross-validation, tree-based models, especially RF, frequently outperformed other evaluated methods, including CUB and SVR [95]. Ensemble learning models were less represented in previous studies and are more difficult to compare across studies due to various constructions, but a hybrid model based on RF and kriging produced higher prediction accuracy in comparison to individual methods [96]. Moreover, other complex methods also outperformed frequently used individual machine learning methods, among which deep neural networks were superior on a local scale [82], as well as on regional- and continental-scales [13,34].

4.2. The Impact of Selected Accuracy Assessment Approaches and Statistical Metrics on Prediction Accuracy

The predictive performance through accuracy measures based on SS overestimated predictive performance in both study areas relative to CV, and that SS therefore might be misleading in its ability to give an inflated feeling of model reliability depending on the randomness in data partition. This implies that reported accuracies of SOC prediction in previous studies can vary, not only because of model or data quality, but also because of the selected accuracy assessment approach. Although 10-fold cross-validation is widely used in digital SOC mapping [19,21,43,82], moderate fold sizes (k = 4, 5) can be slightly more robust approaches when using heterogeneous national or regional data. The negative correlation between k-fold number and metric variability in both study areas occurred as larger k values and produced more disproportional training-validation ratios, thereby increasing the randomness effect of sample selection, which aligns with theoretical assumptions of k-fold cross-validation [83].
A greater sensitivity of selected k values was observed on R2 in comparison to RMSE or MAE, suggesting that the absolute accuracy assessment metrics were a more reliable measure of robustness of the evaluated machine learning models. Despite this observation, a previous study noted the comparative advantages of R2 in comparison to RMSE and MAE in terms of comparing prediction accuracy in regression tasks across studies and sensitivity to value distribution of ground truth data [97]. Therefore, the use of multiple complementary statistical metrics for the accuracy assessment, as was performed in this study, is justified and should be replicated in future studies based on regression tasks. RF was the most sensitive to data partitioning, resulting in high levels of outliers and broad distributions ranges in all k-fold settings. This might be explained by the bootstrap resampling inherent in the bagging algorithm within RF and the extrinsic randomness due to data partitioning, causing RF to be highly sensitive to noise instead of real signals when training sets are small, as in high k-fold situations [98]. CUB and BRNNs were more consistent in prediction accuracy but produced lower median accuracy than RF. Compared to individual machine learning methods, ENS was by far the most resistant to randomness in training and validation data split while producing the highest prediction accuracy, indicating that model averaging is an effective way of eliminating the effect of randomness caused by particular cross-validation splits. This implies that model-specific biases can be canceled by using ensemble averaging and it minimizes variance by using the capacities of constituent algorithms as RFs ability to model complex non-linear structures is offset by the smoother and more conservative forecasts of neural network-based models [64].
While cross-validation is superior to the split-sample approach in terms of resistance to randomness in training and validation data split, it might be susceptible to data leakage as there is no hold-out validation dataset used [84]. Therefore, the results from this study addressed a component of robustness in accuracy assessment of digital SOC mapping and future studies should explore additional approaches, such as nested cross-validation, despite it being computationally expensive [99]. In contrast to standard cross-validation, which shares the same partitions of the data with the two tasks of hyperparameter tuning and performance estimation, nested cross-validation has a two-loop design, which strictly separates the model selection and performance estimation. This hierarchical structure is very useful in avoiding data leakage, which is a widespread problem in which information in validation sets accidentally permeates model-training or hyperparameter optimization, resulting in potentially over-optimistic model predictions [85].

4.3. The Impact of Selected Accuracy Assessment Approaches on Relative Importance of Environmental Covariates

In contrast to France, SOC changes in the Czech Republic are more strongly associated with surface soil characteristics and vegetation cover, including patterns of agricultural production, vegetation cycles, types of tillage and other soil surface dynamics [100]. This is consistent with the fact that the Czech Republic is more climatically homogeneous, making the variability driven by agronomic practices more pronounced [51]. Carbon accumulation is also significantly affected by the type of economic land use [101]. Agriculture dominates approximately 54% of the nation’s land area and Mišurec et al. detected that SOC levels declined over five years due to variations in agricultural practices [102]. Agricultural approaches depending only on mineral fertilizers (without the use of organic manures) often lead to degradation of the soil environment, such as soil pH decrease, reduction in the SOC and soil organic matter content or reduced availability of nutrients [103]. The mineral, N, fertilizers significantly promote decomposition processes and may cause the depletion of the SOC over time [103]. There is a fundamental difference in the factors controlling SOC at the national level. The higher average SOC in France (36.5 g·kg−1) is likely a result of French soils often encompassing areas with favorable topography for organic matter accumulation, with forested, mountainous areas with low erosion, and sufficient amounts of exchangeable Ca2+, as well as the parent material from which the soil is formed [104]. It consequently resulted in an underestimation of the effect of some controlling factors driving high SOC values, such as forest or grassland land uses and high elevations [105]. Soils located in plots with a mountainous climate had higher C content [106]. Erosion influences the higher SOC content in France. Organic matter transported by erosion accumulates at the mountain foothills, where soils are formed in valleys, leading to an increase in SOC levels [107]. It is important to note that when estimating SOC content, the effectiveness of predicting soil surface texture varies depending on the acquisition date and the presence of crop residues on the surface [108].

4.4. Study Limitations and Future Considerations

Although the research has shown that cross-validation is effective for reducing the element of randomness in data partitioning, the present method has few limitations that should be addressed in future research. Annual median values of the covariates calculated from satellite image bands are used to simplify the aspect of temporal variability and may not capture dynamics of vegetation and the state of the soil surface intra-annually, especially in agricultural landscapes with numerous cropping cycles or rapid phenological changes [88]. This temporal aggregation may conceal the short-term variation that is important to the soil organic carbon variation. Future research efforts should thus consider temporally explicit modeling schemes on the basis of seasonal or phenological composites, or time-series characteristics on the basis of remote sensing information in order to better represent variability within-year. Moreover, data-related uncertainty can also have an effect on prediction reliability as the environmental covariates were modeled using multiple datasets with varying native spatial resolutions. Although point-based extraction of these values for vector soil samples reduced technical resampling errors, subpixel variations in input datasets can still cause scale-related uncertainty, especially in discontinuous or heterogeneous landscapes [109]. While outlier removal was implemented to avoid numerical instability and impact of extreme values, this also created a spatial bias. LUCAS data are also not spatially homogeneous and the process of the removal of outliers without accounting for their geographic distribution may lead to the loss of the capacity of the model to capture the actual natural heterogeneity. Future research should examine the spatial distribution of unobservable samples and determine whether they may cause model bias or regional bias such as through declustering or geographically weighted resampling methods.

5. Conclusions

This study aimed to systematically analyze the influence of the k folds of cross-validation on digital SOC mapping accuracy relative to the split-sample approach, and to assess the robustness of different machine learning algorithms under varying validation scenarios. The main conclusions based on the results of the comprehensive evaluation of k-fold cross-validation and split-sample approaches based on the randomness in data partition for the creation of training and validation datasets in two study areas on a national scale are as follows:
  • The ensemble machine learning approach proved to be the most accurate for SOC prediction in France with 10-fold cross-validation, producing R2 of 0.412, which is on-par with prediction accuracy achieved in similar previous studies.
  • All evaluated k-fold numbers agreed on the optimal model, with 10-fold cross-validation producing the most accurate results and 2-fold cross-validation suggesting lower accuracy across all four metrics and both study areas, likely due to the lower quantity of training data relative to other k-fold number and exaggerated sensitivity to randomness in data splitting.
  • In both study areas, 10-fold cross-validation produced the largest variability in all the accuracy measures due to the largest discrepancy between quantity of training and validation data. These results suggest that moderate fold sizes (k = 4, 5) can be slightly more robust approaches when using heterogeneous national or regional data.
  • The selection of the k-fold number did not have a notable impact on relative variable importance values of the most accurate evaluated machine learning model. DEM had dominant importance for SOC prediction in France, while two spectral indices were the most important for the prediction in Czechia.
  • While cross-validation is superior to the split-sample approach in terms of resistance to randomness in training and validation data split, it might be susceptible to data leakage as there is no hold-out validation dataset used. Therefore, the results from this study addressed a component of robustness in the accuracy assessment of digital SOC mapping and future studies should explore additional approaches, such as nested cross-validation.

Author Contributions

Conceptualization, D.R.; methodology, D.R.; software, D.R.; validation, D.R.; formal analysis, D.R. and L.G.; investigation, D.R. and L.G.; resources, D.R.; data curation, D.R.; writing—original draft preparation, D.R. and L.G.; writing—review and editing, D.R., M.J., I.P. and L.G.; visualization, D.R.; supervision, M.J.; project administration, I.P.; funding acquisition, D.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

This research was supported by the scientific project “Prediction of maize yield potential using machine learning models based on vegetation indices and phenological metrics from Sentinel-2 multispectral satellite images (AgroVeFe)”.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Mitran, T.; Suresh, J.; Sujatha, G.; Sreenivas, K.; Karak, S.; Kumar, R.; Chauhan, P.; Meena, R.S. Digital Soil Mapping: A Tool for Sustainable Soil Management. In Climate Change and Soil-Water-Plant Nexus: Agriculture and Environment; Springer: Singapore, 2024; pp. 51–95. [Google Scholar] [CrossRef]
  2. Nair, P.K.R.; Kumar, B.M.; Nair, V.D. Soil Organic Matter (SOM) and Nutrient Cycling. In An Introduction to Agroforestry: Four Decades of Scientific Developments; Springer: Cham, Switzerland, 2021; pp. 383–411. [Google Scholar] [CrossRef]
  3. Anthony, T.; Nkwunonwo, U.; Emmanuel, A.; Ganiyu, O. Environmental and Geostatistical Modelling of Soil Properties toward Precision Agriculture. Discov. Soil 2025, 2, 59. [Google Scholar] [CrossRef]
  4. Radočaj, D.; Jurišić, M. A Phenology-Based Evaluation of the Optimal Proxy for Cropland Suitability Based on Crop Yield Correlations from Sentinel-2 Image Time-Series. Agriculture 2025, 15, 859. [Google Scholar] [CrossRef]
  5. Mzid, N.; Castaldi, F.; Tolomio, M.; Pascucci, S.; Casa, R.; Pignatti, S. Evaluation of Agricultural Bare Soil Properties Retrieval from Landsat 8, Sentinel-2 and PRISMA Satellite Data. Remote Sens. 2022, 14, 714. [Google Scholar] [CrossRef]
  6. Song, X.P.; Huang, W.; Hansen, M.C.; Potapov, P. An Evaluation of Landsat, Sentinel-2, Sentinel-1 and MODIS Data for Crop Type Mapping. Sci. Remote Sens. 2021, 3, 100018. [Google Scholar] [CrossRef]
  7. Rapčan, I.; Radočaj, D.; Jurišić, M. A Length-of-Season Analysis for Maize Cultivation from the Land- Surface Phenology Metrics Using the Sentinel-2 Images. Poljoprivreda 2025, 31, 92–98. [Google Scholar] [CrossRef]
  8. Wen, H.; Sun, Z.; Yang, F.; Zhang, G. Aridity Regulates the Vital Drivers of Soil Organic Carbon Content in the Northeast China. Catena 2025, 257, 109192. [Google Scholar] [CrossRef]
  9. Subašić, D.G.; Rapčan, I.; Jurišić, M.; Petrović, D.; Radočaj, D. The Effect of Irrigation on the Yield and Soybean (Glycine Max L. Merr.) Seed Germination in the Three Climatically Varying Years. Poljoprivreda 2024, 30, 17–24. [Google Scholar] [CrossRef]
  10. Kumar, S.; David Raj, A.; Justin George, K.; Chatterjee, U. Digital Terrain Analysis for Characterization of Terrain Variables Governing Soil Erosion and Watershed Hydrology. In Surface, Sub-Surface Hydrology and Management; Springer Geography; Springer: Cham, Switzerland, 2025; Part F207; pp. 469–490. [Google Scholar] [CrossRef]
  11. Li, T.; Cui, L.; Kuhnert, M.; McLaren, T.I.; Pandey, R.; Liu, H.; Wang, W.; Xu, Z.; Xia, A.; Dalal, R.C.; et al. A Comprehensive Review of Soil Organic Carbon Estimates: Integrating Remote Sensing and Machine Learning Technologies. J. Soils Sediments 2024, 24, 3556–3571. [Google Scholar] [CrossRef]
  12. De Caires, S.A.; Martin, C.S.; Atwell, M.A.; Kaya, F.; Wuddivira, G.A.; Wuddivira, M.N. Advancing Soil Mapping and Management Using Geostatistics and Integrated Machine Learning and Remote Sensing Techniques: A Synoptic Review. Discov. Soil 2025, 2, 53. [Google Scholar] [CrossRef]
  13. Radočaj, D.; Gašparović, M.; Radočaj, P.; Jurišić, M. Geospatial Prediction of Total Soil Carbon in European Agricultural Land Based on Deep Learning. Sci. Total Environ. 2024, 912, 169647. [Google Scholar] [CrossRef]
  14. Zhu, C.; Wei, Y.; Zhu, F.; Lu, W.; Fang, Z.; Li, Z.; Pan, J. Digital Mapping of Soil Organic Carbon Based on Machine Learning and Regression Kriging. Sensors 2022, 22, 8997. [Google Scholar] [CrossRef] [PubMed]
  15. Emadi, M.; Taghizadeh-Mehrjardi, R.; Cherati, A.; Danesh, M.; Mosavi, A.; Scholten, T. Predicting and Mapping of Soil Organic Carbon Using Machine Learning Algorithms in Northern Iran. Remote Sens. 2020, 12, 2234. [Google Scholar] [CrossRef]
  16. Brungard, C.; Nauman, T.; Duniway, M.; Veblen, K.; Nehring, K.; White, D.; Salley, S.; Anchang, J. Regional Ensemble Modeling Reduces Uncertainty for Digital Soil Mapping. Geoderma 2021, 397, 114998. [Google Scholar] [CrossRef]
  17. Piikki, K.; Wetterlind, J.; Söderström, M.; Stenberg, B. Perspectives on Validation in Digital Soil Mapping of Continuous Attributes—A Review. Soil Use Manag. 2021, 37, 7–21. [Google Scholar] [CrossRef]
  18. Radočaj, D.; Jug, D.; Jug, I.; Jurišić, M. A Comprehensive Evaluation of Machine Learning Algorithms for Digital Soil Organic Carbon Mapping on a National Scale. Appl. Sci. 2024, 14, 9990. [Google Scholar] [CrossRef]
  19. Broeg, T.; Blaschek, M.; Seitz, S.; Taghizadeh-Mehrjardi, R.; Zepp, S.; Scholten, T. Transferability of Covariates to Predict Soil Organic Carbon in Cropland Soils. Remote Sens. 2023, 15, 876. [Google Scholar] [CrossRef]
  20. Sakhaee, A.; Gebauer, A.; Ließ, M.; Don, A. Spatial Prediction of Organic Carbon in German Agricultural Topsoil Using Machine Learning Algorithms. Soil 2022, 8, 587–604. [Google Scholar] [CrossRef]
  21. Guo, Z.; Li, Y.; Wang, X.; Gong, X.; Chen, Y.; Cao, W. Remote Sensing of Soil Organic Carbon at Regional Scale Based on Deep Learning: A Case Study of Agro-Pastoral Ecotone in Northern China. Remote Sens. 2023, 15, 3846. [Google Scholar] [CrossRef]
  22. Allgaier, J.; Pryss, R. Cross-Validation Visualized: A Narrative Guide to Advanced Methods. Mach. Learn. Knowl. Extr. 2024, 6, 1378–1388. [Google Scholar] [CrossRef]
  23. Seraj, A.; Mohammadi-Khanaposhtani, M.; Daneshfar, R.; Naseri, M.; Esmaeili, M.; Baghban, A.; Habibzadeh, S.; Eslamian, S. Cross-Validation. In Handbook of HydroInformatics: Volume I: Classic Soft-Computing Techniques; Elsevier: Amsterdam, The Netherlands, 2023; pp. 89–105. [Google Scholar] [CrossRef]
  24. Lyons, M.B.; Keith, D.A.; Phinn, S.R.; Mason, T.J.; Elith, J. A Comparison of Resampling Methods for Remote Sensing Classification and Accuracy Assessment. Remote Sens. Environ. 2018, 208, 145–153. [Google Scholar] [CrossRef]
  25. Radočaj, D.; Jurišić, M. Comparative Evaluation of Ensemble Machine Learning Models for Methane Production from Anaerobic Digestion. Fermentation 2025, 11, 130. [Google Scholar] [CrossRef]
  26. Peng, Y.; Zhou, W.; Xiao, J.; Liu, H.; Wang, T.; Wang, K. Comparison of Soil Organic Carbon Prediction Accuracy Under Different Habitat Patches Division Methods on the Tibetan Plateau. Land Degrad. Dev. 2025, 1–14. [Google Scholar] [CrossRef]
  27. Adhikari, K.; Mishra, U.; Owens, P.R.; Libohova, Z.; Wills, S.A.; Riley, W.J.; Hoffman, F.M.; Smith, D.R. Importance and Strength of Environmental Controllers of Soil Organic Carbon Changes with Scale. Geoderma 2020, 375, 114472. [Google Scholar] [CrossRef]
  28. Song, X.D.; Wu, H.Y.; Ju, B.; Liu, F.; Yang, F.; Li, D.C.; Zhao, Y.G.; Yang, J.L.; Zhang, G.L. Pedoclimatic Zone-Based Three-Dimensional Soil Organic Carbon Mapping in China. Geoderma 2020, 363, 114145. [Google Scholar] [CrossRef]
  29. Nauman, T.W.; Duniway, M.C. Relative Prediction Intervals Reveal Larger Uncertainty in 3D Approaches to Predictive Digital Soil Mapping of Soil Properties with Legacy Data. Geoderma 2019, 347, 170–184. [Google Scholar] [CrossRef]
  30. Li, X.; Ding, J.; Liu, J.; Ge, X.; Zhang, J. Digital Mapping of Soil Organic Carbon Using Sentinel Series Data: A Case Study of the Ebinur Lake Watershed in Xinjiang. Remote Sens. 2021, 13, 769. [Google Scholar] [CrossRef]
  31. Chen, Z.; Chen, L.; Lu, R.; Lou, Z.; Zhou, F.; Jin, Y.; Xue, J.; Guo, H.; Wang, Z.; Wang, Y.; et al. A National Soil Organic Carbon Density Dataset (2010–2024) in China. Sci. Data 2025, 12, 1480. [Google Scholar] [CrossRef] [PubMed]
  32. Rainford, S.K.; Martín-López, J.M.; Da Silva, M. Approximating Soil Organic Carbon Stock in the Eastern Plains of Colombia. Front. Environ. Sci. 2021, 9, 685819. [Google Scholar] [CrossRef]
  33. Zhou, T.; Geng, Y.; Ji, C.; Xu, X.; Wang, H.; Pan, J.; Bumberger, J.; Haase, D.; Lausch, A. Prediction of Soil Organic Carbon and the C:N Ratio on a National Scale Using Machine Learning and Satellite Data: A Comparison between Sentinel-2, Sentinel-3 and Landsat-8 Images. Sci. Total Environ. 2021, 755, 142661. [Google Scholar] [CrossRef]
  34. Yang, L.; Cai, Y.; Zhang, L.; Guo, M.; Li, A.; Zhou, C. A Deep Learning Method to Predict Soil Organic Carbon Content at a Regional Scale Using Satellite-Based Phenology Variables. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102428. [Google Scholar] [CrossRef]
  35. Duarte, E.; Zagal, E.; Barrera, J.A.; Dube, F.; Casco, F.; Hernández, A.J. Digital Mapping of Soil Organic Carbon Stocks in the Forest Lands of Dominican Republic. Eur. J. Remote Sens. 2022, 55, 213–231. [Google Scholar] [CrossRef]
  36. de Arruda, D.L.; Ker, J.C.; Veloso, G.V.; Henriques, R.J.; Fernandes-Filho, E.I.; Camêlo, D.d.L.; Gomes, L.d.C.; Schaefer, C.E.G.R. Soil Carbon Prediction in Marajó Island Wetlands. Rev. Bras. Cienc. Solo 2024, 48, e0230162. [Google Scholar] [CrossRef]
  37. Veronesi, F.; Schillaci, C. Comparison between Geostatistical and Machine Learning Models as Predictors of Topsoil Organic Carbon with a Focus on Local Uncertainty Estimation. Ecol. Indic. 2019, 101, 1032–1044. [Google Scholar] [CrossRef]
  38. Fathizad, H.; Taghizadeh-Mehrjardi, R.; Hakimzadeh Ardakani, M.A.; Zeraatpisheh, M.; Heung, B.; Scholten, T. Spatiotemporal Assessment of Soil Organic Carbon Change Using Machine-Learning in Arid Regions. Agronomy 2022, 12, 628. [Google Scholar] [CrossRef]
  39. Zhang, L.; Cai, Y.; Huang, H.; Li, A.; Yang, L.; Zhou, C.A.; Arrouays, D.; Vaudour, E.; Zhang, L.; Cai, Y.; et al. A CNN-LSTM Model for Soil Organic Carbon Content Prediction with Long Time Series of MODIS-Based Phenological Variables. Remote Sens. 2022, 14, 4441. [Google Scholar] [CrossRef]
  40. Tan, Q.; Geng, J.; Fang, H.; Li, Y.; Guo, Y. Exploring the Impacts of Data Source, Model Types and Spatial Scales on the Soil Organic Carbon Prediction: A Case Study in the Red Soil Hilly Region of Southern China. Remote Sens. 2022, 14, 5151. [Google Scholar] [CrossRef]
  41. Mousavi, A.; Karimi, A.; Maleki, S.; Safari, T.; Taghizadeh-Mehrjardi, R. Digital Mapping of Selected Soil Properties Using Machine Learning and Geostatistical Techniques in Mashhad Plain, Northeastern Iran. Environ. Earth Sci. 2023, 82, 234. [Google Scholar] [CrossRef]
  42. Wang, L.J.; Cheng, H.; Yang, L.C.; Zhao, Y.G. Soil Organic Carbon Mapping in Cultivated Land Using Model Ensemble Methods. Arch. Agron. Soil Sci. 2022, 68, 1711–1725. [Google Scholar] [CrossRef]
  43. Baltensweiler, A.; Walthert, L.; Hanewinkel, M.; Zimmermann, S.; Nussbaum, M. Machine Learning Based Soil Maps for a Wide Range of Soil Properties for the Forested Area of Switzerland. Geoderma Reg. 2021, 27, e00437. [Google Scholar] [CrossRef]
  44. Guo, L.; Fu, P.; Shi, T.; Chen, Y.; Zeng, C.; Zhang, H.; Wang, S. Exploring Influence Factors in Mapping Soil Organic Carbon on Low-Relief Agricultural Lands Using Time Series of Remote Sensing Data. Soil Tillage Res. 2021, 210, 104982. [Google Scholar] [CrossRef]
  45. Farooq, I.; Bangroo, S.A.; Bashir, O.; Shah, T.I.; Malik, A.A.; Iqbal, A.M.; Mahdi, S.S.; Wani, O.A.; Nazir, N.; Biswas, A. Comparison of Random Forest and Kriging Models for Soil Organic Carbon Mapping in the Himalayan Region of Kashmir. Land 2022, 11, 2180. [Google Scholar] [CrossRef]
  46. Oukhattar, M.; Gadal, S.; Robert, Y.; Saby, N.; Houmma, I.H.; Keller, C. Variability Analysis of Soil Organic Carbon Content across Land Use Types and Its Digital Mapping Using Machine Learning and Deep Learning Algorithms. Envron. Monit. Assess. 2025, 197, 535. [Google Scholar] [CrossRef]
  47. Beck, H.E.; McVicar, T.R.; Vergopolan, N.; Berg, A.; Lutsko, N.J.; Dufour, A.; Zeng, Z.; Jiang, X.; van Dijk, A.I.J.M.; Miralles, D.G. High-Resolution (1 km) Köppen-Geiger Maps for 1901–2099 Based on Constrained CMIP6 Projections. Sci. Data 2023, 10, 724. [Google Scholar] [CrossRef]
  48. Rodríguez-Rastrero, M.; Ortega-Martos, A.; Cicuéndez, V. Soil and Land Cover Interrelationships: An Analysis Based on the Jenny’s Equation. Soil Syst. 2023, 7, 31. [Google Scholar] [CrossRef]
  49. Orgiazzi, A.; Ballabio, C.; Panagos, P.; Jones, A.; Fernández-Ugalde, O. LUCAS Soil, the Largest Expandable Soil Dataset for Europe: A Review. Eur. J. Soil. Sci. 2018, 69, 140–153. [Google Scholar] [CrossRef]
  50. Garosi, Y.; Ayoubi, S.; Nussbaum, M.; Sheklabadi, M. Effects of Different Sources and Spatial Resolutions of Environmental Covariates on Predicting Soil Organic Carbon Using Machine Learning in a Semi-Arid Region of Iran. Geoderma Reg. 2022, 29, e00513. [Google Scholar] [CrossRef]
  51. Karger, D.N.; Conrad, O.; Böhner, J.; Kawohl, T.; Kreft, H.; Soria-Auza, R.W.; Zimmermann, N.E.; Linder, H.P.; Kessler, M. Climatologies at High Resolution for the Earth’s Land Surface Areas. Sci. Data 2017, 4, 170122. [Google Scholar] [CrossRef]
  52. SRTM CGIAR-CSI SRTM—SRTM 90 m DEM Digital Elevation Database. Available online: https://srtm.csi.cgiar.org/ (accessed on 19 September 2025).
  53. Hijmans, R.J. Spatial Data Analysis [R Package Terra Version 1.8-60]. CRAN: Contributed Packages 2025. Available online: https://CRAN.R-project.org/package=terra (accessed on 2 September 2025).
  54. Rouse, J.W., Jr.; Haas, R.H.; Schell, J.A.; Deering, D.W. Monitoring Vegetation Systems in the Great Plains with ERTS. In Third Earth Resources Technology Satellite-1 Symposium. Volume 1: Technical Presentations, Section A; Goddard Space Flight Center, NASA: Greenbelt, MD, USA, 1974. [Google Scholar]
  55. Gitelson, A.A.; Kaufman, Y.J.; Merzlyak, M.N. Use of a Green Channel in Remote Sensing of Global Vegetation from EOS-MODIS. Remote Sens. Environ. 1996, 58, 289–298. [Google Scholar] [CrossRef]
  56. Huete, A.R.; Liu, H.Q.; Batchily, K.; Van Leeuwen, W. A Comparison of Vegetation Indices over a Global Set of TM Images for EOS-MODIS. Remote Sens. Environ. 1997, 59, 440–451. [Google Scholar] [CrossRef]
  57. Huete, A.R. A Soil-Adjusted Vegetation Index (SAVI). Remote Sens. Environ. 1988, 25, 295–309. [Google Scholar] [CrossRef]
  58. Wilson, E.H.; Sader, S.A. Detection of Forest Harvest Type Using Multiple Dates of Landsat TM Imagery. Remote Sens. Environ. 2002, 80, 385–396. [Google Scholar] [CrossRef]
  59. Hunt, E.R.; Rock, B.N. Detection of Changes in Leaf Water Content Using Near- and Middle-Infrared Reflectances. Remote Sens. Environ. 1989, 30, 43–54. [Google Scholar] [CrossRef]
  60. Gitelson, A.A.; Gritz, Y.; Merzlyak, M.N. Relationships between Leaf Chlorophyll Content and Spectral Reflectance and Algorithms for Non-Destructive Chlorophyll Assessment in Higher Plant Leaves. J. Plant Physiol. 2003, 160, 271–282. [Google Scholar] [CrossRef]
  61. Nguyen, C.T.; Chidthaisong, A.; Diem, P.K.; Huo, L.Z. A Modified Bare Soil Index to Identify Bare Land Features during Agricultural Fallow-Period in Southeast Asia Using Landsat 8. Land 2021, 10, 231. [Google Scholar] [CrossRef]
  62. McFeeters, S.K. The Use of the Normalized Difference Water Index (NDWI) in the Delineation of Open Water Features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
  63. Justice, C.O.; Townshend, J.R.G.; Vermote, E.F.; Masuoka, E.; Wolfe, R.E.; Saleous, N.; Roy, D.P.; Morisette, J.T. An Overview of MODIS Land Data Processing and Product Status. Remote Sens. Environ. 2002, 83, 3–15. [Google Scholar] [CrossRef]
  64. Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms. In Ensemble Methods; Chapman and Hall/CRC: Boca Raton, FL, USA, 2025. [Google Scholar] [CrossRef]
  65. Shahhosseini, M.; Hu, G.; Pham, H. Optimizing Ensemble Weights and Hyperparameters of Machine Learning Models for Regression Problems. Mach. Learn. Appl. 2022, 7, 100251. [Google Scholar] [CrossRef]
  66. Wu, H.; Levinson, D. The Ensemble Approach to Forecasting: A Review and Synthesis. Transp. Res. Part C Emerg. Technol. 2021, 132, 103357. [Google Scholar] [CrossRef]
  67. Genuer, R.; Poggi, J.-M. Random Forests. In Random Forests with R; Springer: Cham, Switzerland, 2020; pp. 33–55. [Google Scholar] [CrossRef]
  68. Syam, N.; Kaul, R. Random Forest, Bagging, and Boosting of Decision Trees. In Machine Learning and Artificial Intelligence in Marketing and Sales; Emerald Publishing Limited: Leeds, UK, 2021; pp. 139–182. [Google Scholar] [CrossRef]
  69. Breiman, L.; Cutler, A.; Liaw, A.; Wiener, M. RandomForest: Breiman and Cutlers Random Forests for Classification and Regression. CRAN: Contributed Packages 2002. Available online: https://CRAN.R-project.org/package=randomForest (accessed on 19 September 2025).
  70. John, K.; Kebonye, N.M.; Agyeman, P.C.; Ahado, S.K. Comparison of Cubist Models for Soil Organic Carbon Prediction via Portable XRF Measured Data. Environ. Monit. Assess. 2021, 193, 197. [Google Scholar] [CrossRef]
  71. Kuhn, M.; Quinlan, R. Rule- and Instance-Based Regression Modeling [R Package Cubist Version 0.5.0]. CRAN: Contributed Packages 2025. Available online: https://CRAN.R-project.org/package=Cubist (accessed on 19 September 2025).
  72. Montesinos López, O.A.; Montesinos López, A.; Crossa, J. Support Vector Machines and Support Vector Regression. In Multivariate Statistical Machine Learning Methods for Genomic Prediction; Springer: Cham, Switzerland, 2022; pp. 337–378. [Google Scholar] [CrossRef]
  73. Karatzoglou, A.; Smola, A.; Hornik, K. Kernel-Based Machine Learning Lab [R Package Kernlab Version 0.9-33]. CRAN: Contributed Packages 2024. Available online: https://CRAN.R-project.org/package=kernlab (accessed on 19 September 2025).
  74. Zhang, F.; O’Donnell, L.J. Support Vector Regression. In Machine Learning: Methods and Applications to Brain Disorders; Academic Press: Cambridge, MA, USA, 2020; pp. 123–140. [Google Scholar] [CrossRef]
  75. Mullachery, V.; Khera, A.; Husain, A. Bayesian Neural Networks. arXiv 2018, arXiv:1801.07710. [Google Scholar]
  76. Perez Rodriguez, P.; Gianola, D. Bayesian Regularization for Feed-Forward Neural Networks [R Package Brnn Version 0.9.4]. CRAN: Contributed Packages 2025. Available online: https://CRAN.R-project.org/package=brnn (accessed on 19 September 2025).
  77. Xu, Y.; Goodacre, R. On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. J. Anal. Test. 2018, 2, 249–262. [Google Scholar] [CrossRef]
  78. Nguyen, Q.H.; Ly, H.B.; Ho, L.S.; Al-Ansari, N.; Van Le, H.; Tran, V.Q.; Prakash, I.; Pham, B.T. Influence of Data Splitting on Performance of Machine Learning Models in Prediction of Shear Strength of Soil. Math. Probl. Eng. 2021, 2021, 4832864. [Google Scholar] [CrossRef]
  79. Hodson, T.O. Root-Mean-Square Error (RMSE) or Mean Absolute Error (MAE): When to Use Them or Not. Geosci. Model. Dev. 2022, 15, 5481–5487. [Google Scholar] [CrossRef]
  80. Daroussin, J.; King, D.; Bas, C.L.; Vrščaj, B.; Dobos, E.; Montanarella, L. Chapter 4 The Soil Geographical Database of Eurasia at Scale 1:1,000,000: History and Perspective in Digital Soil Mapping. Dev. Soil Sci. 2006, 31, 55–602. [Google Scholar] [CrossRef]
  81. Schad, P. World Reference Base for Soil Resources—Its Fourth Edition and Its History. J. Plant Nutr. Soil Sci. 2023, 186, 151–163. [Google Scholar] [CrossRef]
  82. Taghizadeh-Mehrjardi, R.; Schmidt, K.; Amirian-Chakan, A.; Rentschler, T.; Zeraatpisheh, M.; Sarmadian, F.; Valavi, R.; Davatgar, N.; Behrens, T.; Scholten, T. Improving the Spatial Prediction of Soil Organic Carbon Content in Two Contrasting Climatic Regions by Stacking Machine Learning Models and Rescanning Covariate Space. Remote Sens. 2020, 12, 1095. [Google Scholar] [CrossRef]
  83. Bhagat, M.; Bakariya, B. Implementation of Logistic Regression on Diabetic Dataset Using Train-Test-Split, K-Fold and Stratified K-Fold Approach. Natl. Acad. Sci. Lett. 2022, 45, 401–404. [Google Scholar] [CrossRef]
  84. Lewis, M.J.; Spiliopoulou, A.; Goldmann, K.; Pitzalis, C.; McKeigue, P.; Barnes, M.R. Nestedcv: An R Package for Fast Implementation of Nested Cross-Validation with Embedded Feature Selection Designed for Transcriptomics and High-Dimensional Data. Bioinform. Adv. 2023, 3, vbad048. [Google Scholar] [CrossRef]
  85. Zhong, Y.; Chalise, P.; He, J. Nested Cross-Validation with Ensemble Feature Selection and Classification Model for High-Dimensional Biological Data. Commun. Stat. Simul. Comput. 2023, 52, 110–125. [Google Scholar] [CrossRef]
  86. Nduati, E.; Sofue, Y.; Matniyaz, A.; Park, J.G.; Yang, W.; Kondoh, A. Cropland Mapping Using Fusion of Multi-Sensor Data in a Complex Urban/Peri-Urban Area. Remote Sens. 2019, 11, 207. [Google Scholar] [CrossRef]
  87. Raviv, L.; Lupyan, G.; Green, S.C. How Variability Shapes Learning and Generalization. Trends Cogn. Sci. 2022, 26, 462–483. [Google Scholar] [CrossRef]
  88. Reddy, N.N.; Chakraborty, P.; Roy, S.; Singh, K.; Minasny, B.; McBratney, A.B.; Biswas, A.; Das, B.S. Legacy Data-Based National-Scale Digital Mapping of Key Soil Properties in India. Geoderma 2021, 381, 114684. [Google Scholar] [CrossRef]
  89. Zhou, J.; Gandomi, A.H.; Chen, F.; Holzinger, A. Evaluating the Quality of Machine Learning Explanations: A Survey on Methods and Metrics. Electronics 2021, 10, 593. [Google Scholar] [CrossRef]
  90. Rainio, O.; Teuho, J.; Klén, R. Evaluation Metrics and Statistical Tests for Machine Learning. Sci. Rep. 2024, 14, 6086. [Google Scholar] [CrossRef]
  91. Iyengar, G.; Lam, H.; Wang, T. Is Cross-Validation the Gold Standard to Evaluate Model Performance? arXiv 2024, arXiv:2407.02754. [Google Scholar] [CrossRef]
  92. Wiesmeier, M.; Urbanski, L.; Hobley, E.; Lang, B.; von Lützow, M.; Marin-Spiotta, E.; van Wesemael, B.; Rabot, E.; Ließ, M.; Garcia-Franco, N.; et al. Soil Organic Carbon Storage as a Key Function of Soils—A Review of Drivers and Indicators at Various Scales. Geoderma 2019, 333, 149–162. [Google Scholar] [CrossRef]
  93. Chen, S.; Martin, M.P.; Saby, N.P.A.; Walter, C.; Angers, D.A.; Arrouays, D. Fine Resolution Map of Top- and Subsoil Carbon Sequestration Potential in France. Sci. Total Environ. 2018, 630, 389–400. [Google Scholar] [CrossRef] [PubMed]
  94. Mulder, V.L.; Lacoste, M.; Richer-de-Forges, A.C.; Arrouays, D. GlobalSoilMap France: High-Resolution Spatial Modelling the Soils of France up to Two Meter Depth. Sci. Total Environ. 2016, 573, 1352–1369. [Google Scholar] [CrossRef]
  95. Zhang, X.; Xue, J.; Chen, S.; Wang, N.; Shi, Z.; Huang, Y.; Zhuo, Z. Digital Mapping of Soil Organic Carbon with Machine Learning in Dryland of Northeast and North Plain China. Remote Sens. 2022, 14, 2504. [Google Scholar] [CrossRef]
  96. Zhang, W.; Wan, H.; Zhou, M.; Wu, W.; Liu, H. Soil Total and Organic Carbon Mapping and Uncertainty Analysis Using Machine Learning Techniques. Ecol. Indic. 2022, 143, 109420. [Google Scholar] [CrossRef]
  97. Chicco, D.; Warrens, M.J.; Jurman, G. The Coefficient of Determination R-Squared Is More Informative than SMAPE, MAE, MAPE, MSE and RMSE in Regression Analysis Evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
  98. Arlot, S.; Genuer, R. Analysis of Purely Random Forests Bias. arXiv 2014, arXiv:1407.3939. [Google Scholar] [CrossRef]
  99. Wainer, J.; Cawley, G. Nested Cross-Validation When Selecting Classifiers Is Overzealous for Most Practical Applications. Expert Syst. Appl. 2021, 182, 115222. [Google Scholar] [CrossRef]
  100. Zádorová, T.; Penížek, V.; Žížala, D.; Matějovský, J.; Vaněk, A. Influence of Former Lynchets on Soil Cover Structure and Soil Organic Carbon Storage in Agricultural Land, Central Czechia. Soil Use Manag. 2018, 34, 60–71. [Google Scholar] [CrossRef]
  101. Cukor, J.; Vacek, Z.; Linda, R.; Bílek, L. Carbon Sequestration in Soil Following Afforestation of Former Agricultural Land in the Czech Republic. Cent. Eur. For. J. 2017, 63, 97–104. [Google Scholar] [CrossRef]
  102. Mišurec, J.; Lukeš, P.; Tomíček, J.; Koňata, P.; Klem, K. Multi-Decadal Satellite Monitoring of Soil Carbon and Its Role in Farm Carbon Footprint: A Case Study for the Czech Republic. Eur. J. Remote Sens. 2025, 58, 2562069. [Google Scholar] [CrossRef]
  103. Voltr, V.; Menšík, L.; Hlisnikovský, L.; Hruška, M.; Pokorný, E.; Pospíšilová, L. The Soil Organic Matter in Connection with Soil Properties and Soil Inputs. Agronomy 2021, 11, 779. [Google Scholar] [CrossRef]
  104. Goidts, E.; van Wesemael, B. Regional Assessment of Soil Organic Carbon Changes under Agriculture in Southern Belgium (1955–2005). Geoderma 2007, 141, 341–354. [Google Scholar] [CrossRef]
  105. Chen, S.; Mulder, V.L.; Heuvelink, G.B.M.; Poggio, L.; Caubet, M.; Román Dobarco, M.; Walter, C.; Arrouays, D. Model Averaging for Mapping Topsoil Organic Carbon in France. Geoderma 2020, 366, 114237. [Google Scholar] [CrossRef]
  106. Soucémarianadin, L.N.; Cécillon, L.; Guenet, B.; Chenu, C.; Baudin, F.; Nicolas, M.; Girardin, C.; Barré, P. Environmental Factors Controlling Soil Organic Carbon Stability in French Forest Soils. Plant Soil 2018, 426, 267–286. [Google Scholar] [CrossRef]
  107. Le Bissonnais, Y.; Montier, C.; Jamagne, M.; Daroussin, J.; King, D. Mapping Erosion Risk for Cultivated Soil in France. Catena 2002, 46, 207–220. [Google Scholar] [CrossRef]
  108. Richer-de-Forges, A.C.; Chen, Q.; Baghdadi, N.; Chen, S.; Gomez, C.; Jacquemoud, S.; Martelet, G.; Mulder, V.L.; Urbina-Salazar, D.; Vaudour, E.; et al. Remote Sensing Data for Digital Soil Mapping in French Research—A Review. Remote Sens. 2023, 15, 3070. [Google Scholar] [CrossRef]
  109. Markham, K.; Frazier, A.E.; Singh, K.K.; Madden, M. A Review of Methods for Scaling Remotely Sensed Data for Spatial Pattern Analysis. Landsc. Ecol. 2023, 38, 619–635. [Google Scholar] [CrossRef]
Figure 1. Study area including mainland France and Czechia with soil samples from the LUCAS 2018 dataset used in the study.
Figure 1. Study area including mainland France and Czechia with soil samples from the LUCAS 2018 dataset used in the study.
Agronomy 15 02495 g001
Figure 2. Violin plots of SOC values from all input soil samples from the LUCAS 2018 database in the study area.
Figure 2. Violin plots of SOC values from all input soil samples from the LUCAS 2018 database in the study area.
Agronomy 15 02495 g002
Figure 3. Boxplots of statistical metrics representing prediction accuracy per fold in k-fold cross-validation.
Figure 3. Boxplots of statistical metrics representing prediction accuracy per fold in k-fold cross-validation.
Agronomy 15 02495 g003
Figure 4. Scatterplots of observed and predicted SOC based on the most accurate evaluated machine learning approach according to k-fold variations.
Figure 4. Scatterplots of observed and predicted SOC based on the most accurate evaluated machine learning approach according to k-fold variations.
Agronomy 15 02495 g004
Figure 5. Relative variable importances based on the most accurate evaluated machine learning approach according to k-fold variations.
Figure 5. Relative variable importances based on the most accurate evaluated machine learning approach according to k-fold variations.
Agronomy 15 02495 g005
Table 1. Accuracy assessment approaches and statistical metrics from recent digital SOC mapping studies.
Table 1. Accuracy assessment approaches and statistical metrics from recent digital SOC mapping studies.
CountrySoil
Samples
Sampling Depth (cm)Sampling Density (km2 per Sample)Mean SOCAccuracy Assessment
MethodR2RMSELowest NRMSEReference
China3130–207987.224.85 kg C m−2CV0.33–0.422.64–2.840.54[26]
USA62130–301582.8194.90 mg·ha−1SS0.38–0.530.51–0.540.01[27]
China80210–51196.4819.09 g·kg−1CV0.06–0.420.47–1.110.02
5–1517.22 g·kg−10.03–0.460.45–0.990.03[28]
15–3012.84 g·kg−10.03–0.400.51–1.160.04
China6440–301009.323.86 kg·m−2CV0.28–0.405.76–11.761.49[21]
USA6730–15641.901.16%CV0.511.691.46[29]
63015–30685.710.85%0.391.481.74
China1050–10480.0013.42 g·kg−1CV0.30–0.435.28–8.510.39[30]
China23,1030–800415.393.85 kg·m–2CV0.831.930.50[31]
Colombia6530–30398.1615.00 g·kg−1CV0.500.460.03[32]
Switzerland1500–20273.3343.93 g·kg−1CV0.12–0.470.44–0.560.01[33]
China7330–20191.0013.11 g·kg−1CV0.10–0.604.50–6.000.34[34]
Dominican Republic2680–15179.84110.35 mg· ha−1SS0.77–0.8335.00–38.600.32[35]
Brazil810–20178.5617.6 g·kg–1CV0.04–0.511.75–3.020.10[36]
Germany4750–30149.472.63%CV0.42–0.681.42–1.600.54[19]
75.791.74%0.30–0.481.37–1.440.79
Germany31040–30115.2128.00 g·kg−1CV/21.00–34.000.75[20]
Italy4140–3060.391.51%SS/0.700.46[37]
Iran2010–2024.020.32%CV0.41–0.540.08–0.180.25[38]
China3080–2018.0812.62 g·kg−1CV0.23–0.355.00–5.500.40[39]
China1860–2014.3423.78 g·kg−1SS0.11–0.493.90–5.280.16[40]
China3960–109.9212.56 g·kg−1SS0.46–0.583.49–3.830.28
10–2010.11 g·kg−10.63–0.713.49–3.600.35[39]
20–307.58 g·kg−10.67–0.732.95–3.030.39
Iran1800–108.330.86%SS0.860.240.28[41]
China3950–206.6411.60 mg·kg−1SS0.32–0.421.67–1.900.14[42]
Switzerland20710–56.286.05%SS0.10–0.224.85–5.160.80
5–153.66%0.21–0.293.78–4.001.03[43]
15–302.20%0.23–0.322.83–3.031.29
China1810–154.081.70%SS0.20–0.560.20–0.260.12[44]
1.03%0.19–0.530.25–0.330.24
India830–303.7326.48 mg·ha−1SS0.908.210.31[45]
France1620–302.068.9 g·kg–1CV0.36–0.7324.82.7[46]
CV: cross-validation, SS: split-sample, R2: coefficient of determination, RMSE: root mean square error, NRMSE: normalized RMSE.
Table 3. Cross-validation accuracy assessment results for SOC prediction in France and Czechia based on k-fold variation.
Table 3. Cross-validation accuracy assessment results for SOC prediction in France and Czechia based on k-fold variation.
Cross-Validation ApproachMachine Learning MethodFranceCzechia
R2RMSENRMSEMAER2RMSENRMSEMAE
k = 10RF0.40911.600.4548.640.2438.210.3876.25
CUB0.37312.010.4708.760.1978.560.4036.39
SVR0.39211.940.4678.420.2288.440.3986.09
BRNN0.38211.870.4648.830.2368.270.3906.27
ENS0.41211.560.4528.490.2298.210.3876.15
k = 5RF0.40611.630.4558.670.2278.270.3906.27
CUB0.37112.040.4718.790.1808.650.4076.43
SVR0.38711.980.4698.450.2188.470.3996.07
BRNN0.37811.900.4668.840.2208.340.3936.28
ENS0.40911.580.4538.520.2258.240.3886.16
k = 4RF0.40411.650.4568.690.2248.280.3906.26
CUB0.36912.050.4728.810.1868.610.4066.40
SVR0.38512.010.4708.470.2138.490.4006.07
BRNN0.37811.900.4668.850.2118.390.3956.31
ENS0.40811.600.4548.530.2238.250.3896.16
k = 2RF0.39211.770.4618.800.2068.380.3956.32
CUB0.37212.150.4768.670.1748.720.4116.30
SVR0.37412.110.4748.560.2068.530.4026.08
BRNN0.36911.990.4698.910.1928.520.4026.41
ENS0.39811.700.4588.640.2118.310.3926.21
Statistical metrics indicating the highest prediction accuracy per k-fold number and study area are bolded.
Table 4. Optimal hyperparameters for all evaluated machine learning models for SOC prediction.
Table 4. Optimal hyperparameters for all evaluated machine learning models for SOC prediction.
Cross-Validation ApproachMachine Learning MethodOptimal Hyperparameters
FranceCzechia
k = 10RFmtry = 10mtry = 5
CUBcommittees = 20, neighbors = 9committees = 20, neighbors = 9
SVRσ = 0.020, C = 1σ = 0.019, C = 0.5
BRNNneurons = 2neurons = 1
k = 5RFmtry = 10mtry = 5
CUBcommittees = 20, neighbors = 9committees = 20, neighbors = 9
SVRσ = 0.022, C = 1σ = 0.019, C = 0.5
BRNNneurons = 2neurons = 1
k = 4RFmtry = 6mtry = 5
CUBcommittees = 20, neighbors = 9committees = 20, neighbors = 9
SVRσ = 0.021, C = 1σ = 0.019, C = 0.5
BRNNneurons = 2neurons = 1
k = 2RFmtry = 6mtry = 2
CUBcommittees = 20, neighbors = 0committees = 20, neighbors = 0
SVRσ = 0.022, C = 1σ = 0.017, C = 0.5
BRNNneurons = 2neurons = 1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Radočaj, D.; Jurišić, M.; Plaščak, I.; Galić, L. Randomness in Data Partitioning and Its Impact on Digital Soil Mapping Accuracy: A Comparison of Cross-Validation and Split-Sample Approaches. Agronomy 2025, 15, 2495. https://doi.org/10.3390/agronomy15112495

AMA Style

Radočaj D, Jurišić M, Plaščak I, Galić L. Randomness in Data Partitioning and Its Impact on Digital Soil Mapping Accuracy: A Comparison of Cross-Validation and Split-Sample Approaches. Agronomy. 2025; 15(11):2495. https://doi.org/10.3390/agronomy15112495

Chicago/Turabian Style

Radočaj, Dorijan, Mladen Jurišić, Ivan Plaščak, and Lucija Galić. 2025. "Randomness in Data Partitioning and Its Impact on Digital Soil Mapping Accuracy: A Comparison of Cross-Validation and Split-Sample Approaches" Agronomy 15, no. 11: 2495. https://doi.org/10.3390/agronomy15112495

APA Style

Radočaj, D., Jurišić, M., Plaščak, I., & Galić, L. (2025). Randomness in Data Partitioning and Its Impact on Digital Soil Mapping Accuracy: A Comparison of Cross-Validation and Split-Sample Approaches. Agronomy, 15(11), 2495. https://doi.org/10.3390/agronomy15112495

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop