Randomness in Data Partitioning and Its Impact on Digital Soil Mapping Accuracy: A Comparison of Cross-Validation and Split-Sample Approaches
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Area and SOC Sampling Data
2.2. Environmental Covariates Used for Digital SOC Mapping
| Spectral Index | Equation | Reference |
|---|---|---|
| Normalized Difference Vegetation Index (NDVI) | [54] | |
| Green Normalized Difference Vegetation Index (GNDVI) | [55] | |
| Enhanced Vegetation Index (EVI) | [56] | |
| Soil Adjusted Vegetation Index (SAVI) | [57] | |
| Normalized Difference Moisture Index (NDMI) | [58] | |
| Moisture Stress Index (MSI) | [59] | |
| Green Chlorophyll Index (GCI) | [60] | |
| Bare Soil Index (BSI) | [61] | |
| Normalized Difference Water Index (NDWI) | [62] |
2.3. Ensemble Machine Learning Prediction of SOC Levels
2.4. Accuracy Assessment of Predicted SOC Levels per Data Fold
3. Results
3.1. Descriptive Statistics of Used SOC Samples
3.2. Accuracy Assessment of Predicted SOC Levels Based on Aggregated Metrics from k-Fold Cross-Validation
3.3. Stability of Accuracy Assessment Metrics According to Randomness in Training and Validation Dataset Creation per Fold
3.4. Stability of Variable Importances According to Number of Folds in Cross-Validation [84,85]
4. Discussion
4.1. Evaluation of Machine Learning Prediction Accuracy According to Input Data Properties
4.2. The Impact of Selected Accuracy Assessment Approaches and Statistical Metrics on Prediction Accuracy
4.3. The Impact of Selected Accuracy Assessment Approaches on Relative Importance of Environmental Covariates
4.4. Study Limitations and Future Considerations
5. Conclusions
- The ensemble machine learning approach proved to be the most accurate for SOC prediction in France with 10-fold cross-validation, producing R2 of 0.412, which is on-par with prediction accuracy achieved in similar previous studies.
- All evaluated k-fold numbers agreed on the optimal model, with 10-fold cross-validation producing the most accurate results and 2-fold cross-validation suggesting lower accuracy across all four metrics and both study areas, likely due to the lower quantity of training data relative to other k-fold number and exaggerated sensitivity to randomness in data splitting.
- In both study areas, 10-fold cross-validation produced the largest variability in all the accuracy measures due to the largest discrepancy between quantity of training and validation data. These results suggest that moderate fold sizes (k = 4, 5) can be slightly more robust approaches when using heterogeneous national or regional data.
- The selection of the k-fold number did not have a notable impact on relative variable importance values of the most accurate evaluated machine learning model. DEM had dominant importance for SOC prediction in France, while two spectral indices were the most important for the prediction in Czechia.
- While cross-validation is superior to the split-sample approach in terms of resistance to randomness in training and validation data split, it might be susceptible to data leakage as there is no hold-out validation dataset used. Therefore, the results from this study addressed a component of robustness in the accuracy assessment of digital SOC mapping and future studies should explore additional approaches, such as nested cross-validation.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Mitran, T.; Suresh, J.; Sujatha, G.; Sreenivas, K.; Karak, S.; Kumar, R.; Chauhan, P.; Meena, R.S. Digital Soil Mapping: A Tool for Sustainable Soil Management. In Climate Change and Soil-Water-Plant Nexus: Agriculture and Environment; Springer: Singapore, 2024; pp. 51–95. [Google Scholar] [CrossRef]
- Nair, P.K.R.; Kumar, B.M.; Nair, V.D. Soil Organic Matter (SOM) and Nutrient Cycling. In An Introduction to Agroforestry: Four Decades of Scientific Developments; Springer: Cham, Switzerland, 2021; pp. 383–411. [Google Scholar] [CrossRef]
- Anthony, T.; Nkwunonwo, U.; Emmanuel, A.; Ganiyu, O. Environmental and Geostatistical Modelling of Soil Properties toward Precision Agriculture. Discov. Soil 2025, 2, 59. [Google Scholar] [CrossRef]
- Radočaj, D.; Jurišić, M. A Phenology-Based Evaluation of the Optimal Proxy for Cropland Suitability Based on Crop Yield Correlations from Sentinel-2 Image Time-Series. Agriculture 2025, 15, 859. [Google Scholar] [CrossRef]
- Mzid, N.; Castaldi, F.; Tolomio, M.; Pascucci, S.; Casa, R.; Pignatti, S. Evaluation of Agricultural Bare Soil Properties Retrieval from Landsat 8, Sentinel-2 and PRISMA Satellite Data. Remote Sens. 2022, 14, 714. [Google Scholar] [CrossRef]
- Song, X.P.; Huang, W.; Hansen, M.C.; Potapov, P. An Evaluation of Landsat, Sentinel-2, Sentinel-1 and MODIS Data for Crop Type Mapping. Sci. Remote Sens. 2021, 3, 100018. [Google Scholar] [CrossRef]
- Rapčan, I.; Radočaj, D.; Jurišić, M. A Length-of-Season Analysis for Maize Cultivation from the Land- Surface Phenology Metrics Using the Sentinel-2 Images. Poljoprivreda 2025, 31, 92–98. [Google Scholar] [CrossRef]
- Wen, H.; Sun, Z.; Yang, F.; Zhang, G. Aridity Regulates the Vital Drivers of Soil Organic Carbon Content in the Northeast China. Catena 2025, 257, 109192. [Google Scholar] [CrossRef]
- Subašić, D.G.; Rapčan, I.; Jurišić, M.; Petrović, D.; Radočaj, D. The Effect of Irrigation on the Yield and Soybean (Glycine Max L. Merr.) Seed Germination in the Three Climatically Varying Years. Poljoprivreda 2024, 30, 17–24. [Google Scholar] [CrossRef]
- Kumar, S.; David Raj, A.; Justin George, K.; Chatterjee, U. Digital Terrain Analysis for Characterization of Terrain Variables Governing Soil Erosion and Watershed Hydrology. In Surface, Sub-Surface Hydrology and Management; Springer Geography; Springer: Cham, Switzerland, 2025; Part F207; pp. 469–490. [Google Scholar] [CrossRef]
- Li, T.; Cui, L.; Kuhnert, M.; McLaren, T.I.; Pandey, R.; Liu, H.; Wang, W.; Xu, Z.; Xia, A.; Dalal, R.C.; et al. A Comprehensive Review of Soil Organic Carbon Estimates: Integrating Remote Sensing and Machine Learning Technologies. J. Soils Sediments 2024, 24, 3556–3571. [Google Scholar] [CrossRef]
- De Caires, S.A.; Martin, C.S.; Atwell, M.A.; Kaya, F.; Wuddivira, G.A.; Wuddivira, M.N. Advancing Soil Mapping and Management Using Geostatistics and Integrated Machine Learning and Remote Sensing Techniques: A Synoptic Review. Discov. Soil 2025, 2, 53. [Google Scholar] [CrossRef]
- Radočaj, D.; Gašparović, M.; Radočaj, P.; Jurišić, M. Geospatial Prediction of Total Soil Carbon in European Agricultural Land Based on Deep Learning. Sci. Total Environ. 2024, 912, 169647. [Google Scholar] [CrossRef]
- Zhu, C.; Wei, Y.; Zhu, F.; Lu, W.; Fang, Z.; Li, Z.; Pan, J. Digital Mapping of Soil Organic Carbon Based on Machine Learning and Regression Kriging. Sensors 2022, 22, 8997. [Google Scholar] [CrossRef] [PubMed]
- Emadi, M.; Taghizadeh-Mehrjardi, R.; Cherati, A.; Danesh, M.; Mosavi, A.; Scholten, T. Predicting and Mapping of Soil Organic Carbon Using Machine Learning Algorithms in Northern Iran. Remote Sens. 2020, 12, 2234. [Google Scholar] [CrossRef]
- Brungard, C.; Nauman, T.; Duniway, M.; Veblen, K.; Nehring, K.; White, D.; Salley, S.; Anchang, J. Regional Ensemble Modeling Reduces Uncertainty for Digital Soil Mapping. Geoderma 2021, 397, 114998. [Google Scholar] [CrossRef]
- Piikki, K.; Wetterlind, J.; Söderström, M.; Stenberg, B. Perspectives on Validation in Digital Soil Mapping of Continuous Attributes—A Review. Soil Use Manag. 2021, 37, 7–21. [Google Scholar] [CrossRef]
- Radočaj, D.; Jug, D.; Jug, I.; Jurišić, M. A Comprehensive Evaluation of Machine Learning Algorithms for Digital Soil Organic Carbon Mapping on a National Scale. Appl. Sci. 2024, 14, 9990. [Google Scholar] [CrossRef]
- Broeg, T.; Blaschek, M.; Seitz, S.; Taghizadeh-Mehrjardi, R.; Zepp, S.; Scholten, T. Transferability of Covariates to Predict Soil Organic Carbon in Cropland Soils. Remote Sens. 2023, 15, 876. [Google Scholar] [CrossRef]
- Sakhaee, A.; Gebauer, A.; Ließ, M.; Don, A. Spatial Prediction of Organic Carbon in German Agricultural Topsoil Using Machine Learning Algorithms. Soil 2022, 8, 587–604. [Google Scholar] [CrossRef]
- Guo, Z.; Li, Y.; Wang, X.; Gong, X.; Chen, Y.; Cao, W. Remote Sensing of Soil Organic Carbon at Regional Scale Based on Deep Learning: A Case Study of Agro-Pastoral Ecotone in Northern China. Remote Sens. 2023, 15, 3846. [Google Scholar] [CrossRef]
- Allgaier, J.; Pryss, R. Cross-Validation Visualized: A Narrative Guide to Advanced Methods. Mach. Learn. Knowl. Extr. 2024, 6, 1378–1388. [Google Scholar] [CrossRef]
- Seraj, A.; Mohammadi-Khanaposhtani, M.; Daneshfar, R.; Naseri, M.; Esmaeili, M.; Baghban, A.; Habibzadeh, S.; Eslamian, S. Cross-Validation. In Handbook of HydroInformatics: Volume I: Classic Soft-Computing Techniques; Elsevier: Amsterdam, The Netherlands, 2023; pp. 89–105. [Google Scholar] [CrossRef]
- Lyons, M.B.; Keith, D.A.; Phinn, S.R.; Mason, T.J.; Elith, J. A Comparison of Resampling Methods for Remote Sensing Classification and Accuracy Assessment. Remote Sens. Environ. 2018, 208, 145–153. [Google Scholar] [CrossRef]
- Radočaj, D.; Jurišić, M. Comparative Evaluation of Ensemble Machine Learning Models for Methane Production from Anaerobic Digestion. Fermentation 2025, 11, 130. [Google Scholar] [CrossRef]
- Peng, Y.; Zhou, W.; Xiao, J.; Liu, H.; Wang, T.; Wang, K. Comparison of Soil Organic Carbon Prediction Accuracy Under Different Habitat Patches Division Methods on the Tibetan Plateau. Land Degrad. Dev. 2025, 1–14. [Google Scholar] [CrossRef]
- Adhikari, K.; Mishra, U.; Owens, P.R.; Libohova, Z.; Wills, S.A.; Riley, W.J.; Hoffman, F.M.; Smith, D.R. Importance and Strength of Environmental Controllers of Soil Organic Carbon Changes with Scale. Geoderma 2020, 375, 114472. [Google Scholar] [CrossRef]
- Song, X.D.; Wu, H.Y.; Ju, B.; Liu, F.; Yang, F.; Li, D.C.; Zhao, Y.G.; Yang, J.L.; Zhang, G.L. Pedoclimatic Zone-Based Three-Dimensional Soil Organic Carbon Mapping in China. Geoderma 2020, 363, 114145. [Google Scholar] [CrossRef]
- Nauman, T.W.; Duniway, M.C. Relative Prediction Intervals Reveal Larger Uncertainty in 3D Approaches to Predictive Digital Soil Mapping of Soil Properties with Legacy Data. Geoderma 2019, 347, 170–184. [Google Scholar] [CrossRef]
- Li, X.; Ding, J.; Liu, J.; Ge, X.; Zhang, J. Digital Mapping of Soil Organic Carbon Using Sentinel Series Data: A Case Study of the Ebinur Lake Watershed in Xinjiang. Remote Sens. 2021, 13, 769. [Google Scholar] [CrossRef]
- Chen, Z.; Chen, L.; Lu, R.; Lou, Z.; Zhou, F.; Jin, Y.; Xue, J.; Guo, H.; Wang, Z.; Wang, Y.; et al. A National Soil Organic Carbon Density Dataset (2010–2024) in China. Sci. Data 2025, 12, 1480. [Google Scholar] [CrossRef] [PubMed]
- Rainford, S.K.; Martín-López, J.M.; Da Silva, M. Approximating Soil Organic Carbon Stock in the Eastern Plains of Colombia. Front. Environ. Sci. 2021, 9, 685819. [Google Scholar] [CrossRef]
- Zhou, T.; Geng, Y.; Ji, C.; Xu, X.; Wang, H.; Pan, J.; Bumberger, J.; Haase, D.; Lausch, A. Prediction of Soil Organic Carbon and the C:N Ratio on a National Scale Using Machine Learning and Satellite Data: A Comparison between Sentinel-2, Sentinel-3 and Landsat-8 Images. Sci. Total Environ. 2021, 755, 142661. [Google Scholar] [CrossRef]
- Yang, L.; Cai, Y.; Zhang, L.; Guo, M.; Li, A.; Zhou, C. A Deep Learning Method to Predict Soil Organic Carbon Content at a Regional Scale Using Satellite-Based Phenology Variables. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102428. [Google Scholar] [CrossRef]
- Duarte, E.; Zagal, E.; Barrera, J.A.; Dube, F.; Casco, F.; Hernández, A.J. Digital Mapping of Soil Organic Carbon Stocks in the Forest Lands of Dominican Republic. Eur. J. Remote Sens. 2022, 55, 213–231. [Google Scholar] [CrossRef]
- de Arruda, D.L.; Ker, J.C.; Veloso, G.V.; Henriques, R.J.; Fernandes-Filho, E.I.; Camêlo, D.d.L.; Gomes, L.d.C.; Schaefer, C.E.G.R. Soil Carbon Prediction in Marajó Island Wetlands. Rev. Bras. Cienc. Solo 2024, 48, e0230162. [Google Scholar] [CrossRef]
- Veronesi, F.; Schillaci, C. Comparison between Geostatistical and Machine Learning Models as Predictors of Topsoil Organic Carbon with a Focus on Local Uncertainty Estimation. Ecol. Indic. 2019, 101, 1032–1044. [Google Scholar] [CrossRef]
- Fathizad, H.; Taghizadeh-Mehrjardi, R.; Hakimzadeh Ardakani, M.A.; Zeraatpisheh, M.; Heung, B.; Scholten, T. Spatiotemporal Assessment of Soil Organic Carbon Change Using Machine-Learning in Arid Regions. Agronomy 2022, 12, 628. [Google Scholar] [CrossRef]
- Zhang, L.; Cai, Y.; Huang, H.; Li, A.; Yang, L.; Zhou, C.A.; Arrouays, D.; Vaudour, E.; Zhang, L.; Cai, Y.; et al. A CNN-LSTM Model for Soil Organic Carbon Content Prediction with Long Time Series of MODIS-Based Phenological Variables. Remote Sens. 2022, 14, 4441. [Google Scholar] [CrossRef]
- Tan, Q.; Geng, J.; Fang, H.; Li, Y.; Guo, Y. Exploring the Impacts of Data Source, Model Types and Spatial Scales on the Soil Organic Carbon Prediction: A Case Study in the Red Soil Hilly Region of Southern China. Remote Sens. 2022, 14, 5151. [Google Scholar] [CrossRef]
- Mousavi, A.; Karimi, A.; Maleki, S.; Safari, T.; Taghizadeh-Mehrjardi, R. Digital Mapping of Selected Soil Properties Using Machine Learning and Geostatistical Techniques in Mashhad Plain, Northeastern Iran. Environ. Earth Sci. 2023, 82, 234. [Google Scholar] [CrossRef]
- Wang, L.J.; Cheng, H.; Yang, L.C.; Zhao, Y.G. Soil Organic Carbon Mapping in Cultivated Land Using Model Ensemble Methods. Arch. Agron. Soil Sci. 2022, 68, 1711–1725. [Google Scholar] [CrossRef]
- Baltensweiler, A.; Walthert, L.; Hanewinkel, M.; Zimmermann, S.; Nussbaum, M. Machine Learning Based Soil Maps for a Wide Range of Soil Properties for the Forested Area of Switzerland. Geoderma Reg. 2021, 27, e00437. [Google Scholar] [CrossRef]
- Guo, L.; Fu, P.; Shi, T.; Chen, Y.; Zeng, C.; Zhang, H.; Wang, S. Exploring Influence Factors in Mapping Soil Organic Carbon on Low-Relief Agricultural Lands Using Time Series of Remote Sensing Data. Soil Tillage Res. 2021, 210, 104982. [Google Scholar] [CrossRef]
- Farooq, I.; Bangroo, S.A.; Bashir, O.; Shah, T.I.; Malik, A.A.; Iqbal, A.M.; Mahdi, S.S.; Wani, O.A.; Nazir, N.; Biswas, A. Comparison of Random Forest and Kriging Models for Soil Organic Carbon Mapping in the Himalayan Region of Kashmir. Land 2022, 11, 2180. [Google Scholar] [CrossRef]
- Oukhattar, M.; Gadal, S.; Robert, Y.; Saby, N.; Houmma, I.H.; Keller, C. Variability Analysis of Soil Organic Carbon Content across Land Use Types and Its Digital Mapping Using Machine Learning and Deep Learning Algorithms. Envron. Monit. Assess. 2025, 197, 535. [Google Scholar] [CrossRef]
- Beck, H.E.; McVicar, T.R.; Vergopolan, N.; Berg, A.; Lutsko, N.J.; Dufour, A.; Zeng, Z.; Jiang, X.; van Dijk, A.I.J.M.; Miralles, D.G. High-Resolution (1 km) Köppen-Geiger Maps for 1901–2099 Based on Constrained CMIP6 Projections. Sci. Data 2023, 10, 724. [Google Scholar] [CrossRef]
- Rodríguez-Rastrero, M.; Ortega-Martos, A.; Cicuéndez, V. Soil and Land Cover Interrelationships: An Analysis Based on the Jenny’s Equation. Soil Syst. 2023, 7, 31. [Google Scholar] [CrossRef]
- Orgiazzi, A.; Ballabio, C.; Panagos, P.; Jones, A.; Fernández-Ugalde, O. LUCAS Soil, the Largest Expandable Soil Dataset for Europe: A Review. Eur. J. Soil. Sci. 2018, 69, 140–153. [Google Scholar] [CrossRef]
- Garosi, Y.; Ayoubi, S.; Nussbaum, M.; Sheklabadi, M. Effects of Different Sources and Spatial Resolutions of Environmental Covariates on Predicting Soil Organic Carbon Using Machine Learning in a Semi-Arid Region of Iran. Geoderma Reg. 2022, 29, e00513. [Google Scholar] [CrossRef]
- Karger, D.N.; Conrad, O.; Böhner, J.; Kawohl, T.; Kreft, H.; Soria-Auza, R.W.; Zimmermann, N.E.; Linder, H.P.; Kessler, M. Climatologies at High Resolution for the Earth’s Land Surface Areas. Sci. Data 2017, 4, 170122. [Google Scholar] [CrossRef]
- SRTM CGIAR-CSI SRTM—SRTM 90 m DEM Digital Elevation Database. Available online: https://srtm.csi.cgiar.org/ (accessed on 19 September 2025).
- Hijmans, R.J. Spatial Data Analysis [R Package Terra Version 1.8-60]. CRAN: Contributed Packages 2025. Available online: https://CRAN.R-project.org/package=terra (accessed on 2 September 2025).
- Rouse, J.W., Jr.; Haas, R.H.; Schell, J.A.; Deering, D.W. Monitoring Vegetation Systems in the Great Plains with ERTS. In Third Earth Resources Technology Satellite-1 Symposium. Volume 1: Technical Presentations, Section A; Goddard Space Flight Center, NASA: Greenbelt, MD, USA, 1974. [Google Scholar]
- Gitelson, A.A.; Kaufman, Y.J.; Merzlyak, M.N. Use of a Green Channel in Remote Sensing of Global Vegetation from EOS-MODIS. Remote Sens. Environ. 1996, 58, 289–298. [Google Scholar] [CrossRef]
- Huete, A.R.; Liu, H.Q.; Batchily, K.; Van Leeuwen, W. A Comparison of Vegetation Indices over a Global Set of TM Images for EOS-MODIS. Remote Sens. Environ. 1997, 59, 440–451. [Google Scholar] [CrossRef]
- Huete, A.R. A Soil-Adjusted Vegetation Index (SAVI). Remote Sens. Environ. 1988, 25, 295–309. [Google Scholar] [CrossRef]
- Wilson, E.H.; Sader, S.A. Detection of Forest Harvest Type Using Multiple Dates of Landsat TM Imagery. Remote Sens. Environ. 2002, 80, 385–396. [Google Scholar] [CrossRef]
- Hunt, E.R.; Rock, B.N. Detection of Changes in Leaf Water Content Using Near- and Middle-Infrared Reflectances. Remote Sens. Environ. 1989, 30, 43–54. [Google Scholar] [CrossRef]
- Gitelson, A.A.; Gritz, Y.; Merzlyak, M.N. Relationships between Leaf Chlorophyll Content and Spectral Reflectance and Algorithms for Non-Destructive Chlorophyll Assessment in Higher Plant Leaves. J. Plant Physiol. 2003, 160, 271–282. [Google Scholar] [CrossRef]
- Nguyen, C.T.; Chidthaisong, A.; Diem, P.K.; Huo, L.Z. A Modified Bare Soil Index to Identify Bare Land Features during Agricultural Fallow-Period in Southeast Asia Using Landsat 8. Land 2021, 10, 231. [Google Scholar] [CrossRef]
- McFeeters, S.K. The Use of the Normalized Difference Water Index (NDWI) in the Delineation of Open Water Features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
- Justice, C.O.; Townshend, J.R.G.; Vermote, E.F.; Masuoka, E.; Wolfe, R.E.; Saleous, N.; Roy, D.P.; Morisette, J.T. An Overview of MODIS Land Data Processing and Product Status. Remote Sens. Environ. 2002, 83, 3–15. [Google Scholar] [CrossRef]
- Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms. In Ensemble Methods; Chapman and Hall/CRC: Boca Raton, FL, USA, 2025. [Google Scholar] [CrossRef]
- Shahhosseini, M.; Hu, G.; Pham, H. Optimizing Ensemble Weights and Hyperparameters of Machine Learning Models for Regression Problems. Mach. Learn. Appl. 2022, 7, 100251. [Google Scholar] [CrossRef]
- Wu, H.; Levinson, D. The Ensemble Approach to Forecasting: A Review and Synthesis. Transp. Res. Part C Emerg. Technol. 2021, 132, 103357. [Google Scholar] [CrossRef]
- Genuer, R.; Poggi, J.-M. Random Forests. In Random Forests with R; Springer: Cham, Switzerland, 2020; pp. 33–55. [Google Scholar] [CrossRef]
- Syam, N.; Kaul, R. Random Forest, Bagging, and Boosting of Decision Trees. In Machine Learning and Artificial Intelligence in Marketing and Sales; Emerald Publishing Limited: Leeds, UK, 2021; pp. 139–182. [Google Scholar] [CrossRef]
- Breiman, L.; Cutler, A.; Liaw, A.; Wiener, M. RandomForest: Breiman and Cutlers Random Forests for Classification and Regression. CRAN: Contributed Packages 2002. Available online: https://CRAN.R-project.org/package=randomForest (accessed on 19 September 2025).
- John, K.; Kebonye, N.M.; Agyeman, P.C.; Ahado, S.K. Comparison of Cubist Models for Soil Organic Carbon Prediction via Portable XRF Measured Data. Environ. Monit. Assess. 2021, 193, 197. [Google Scholar] [CrossRef]
- Kuhn, M.; Quinlan, R. Rule- and Instance-Based Regression Modeling [R Package Cubist Version 0.5.0]. CRAN: Contributed Packages 2025. Available online: https://CRAN.R-project.org/package=Cubist (accessed on 19 September 2025).
- Montesinos López, O.A.; Montesinos López, A.; Crossa, J. Support Vector Machines and Support Vector Regression. In Multivariate Statistical Machine Learning Methods for Genomic Prediction; Springer: Cham, Switzerland, 2022; pp. 337–378. [Google Scholar] [CrossRef]
- Karatzoglou, A.; Smola, A.; Hornik, K. Kernel-Based Machine Learning Lab [R Package Kernlab Version 0.9-33]. CRAN: Contributed Packages 2024. Available online: https://CRAN.R-project.org/package=kernlab (accessed on 19 September 2025).
- Zhang, F.; O’Donnell, L.J. Support Vector Regression. In Machine Learning: Methods and Applications to Brain Disorders; Academic Press: Cambridge, MA, USA, 2020; pp. 123–140. [Google Scholar] [CrossRef]
- Mullachery, V.; Khera, A.; Husain, A. Bayesian Neural Networks. arXiv 2018, arXiv:1801.07710. [Google Scholar]
- Perez Rodriguez, P.; Gianola, D. Bayesian Regularization for Feed-Forward Neural Networks [R Package Brnn Version 0.9.4]. CRAN: Contributed Packages 2025. Available online: https://CRAN.R-project.org/package=brnn (accessed on 19 September 2025).
- Xu, Y.; Goodacre, R. On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. J. Anal. Test. 2018, 2, 249–262. [Google Scholar] [CrossRef]
- Nguyen, Q.H.; Ly, H.B.; Ho, L.S.; Al-Ansari, N.; Van Le, H.; Tran, V.Q.; Prakash, I.; Pham, B.T. Influence of Data Splitting on Performance of Machine Learning Models in Prediction of Shear Strength of Soil. Math. Probl. Eng. 2021, 2021, 4832864. [Google Scholar] [CrossRef]
- Hodson, T.O. Root-Mean-Square Error (RMSE) or Mean Absolute Error (MAE): When to Use Them or Not. Geosci. Model. Dev. 2022, 15, 5481–5487. [Google Scholar] [CrossRef]
- Daroussin, J.; King, D.; Bas, C.L.; Vrščaj, B.; Dobos, E.; Montanarella, L. Chapter 4 The Soil Geographical Database of Eurasia at Scale 1:1,000,000: History and Perspective in Digital Soil Mapping. Dev. Soil Sci. 2006, 31, 55–602. [Google Scholar] [CrossRef]
- Schad, P. World Reference Base for Soil Resources—Its Fourth Edition and Its History. J. Plant Nutr. Soil Sci. 2023, 186, 151–163. [Google Scholar] [CrossRef]
- Taghizadeh-Mehrjardi, R.; Schmidt, K.; Amirian-Chakan, A.; Rentschler, T.; Zeraatpisheh, M.; Sarmadian, F.; Valavi, R.; Davatgar, N.; Behrens, T.; Scholten, T. Improving the Spatial Prediction of Soil Organic Carbon Content in Two Contrasting Climatic Regions by Stacking Machine Learning Models and Rescanning Covariate Space. Remote Sens. 2020, 12, 1095. [Google Scholar] [CrossRef]
- Bhagat, M.; Bakariya, B. Implementation of Logistic Regression on Diabetic Dataset Using Train-Test-Split, K-Fold and Stratified K-Fold Approach. Natl. Acad. Sci. Lett. 2022, 45, 401–404. [Google Scholar] [CrossRef]
- Lewis, M.J.; Spiliopoulou, A.; Goldmann, K.; Pitzalis, C.; McKeigue, P.; Barnes, M.R. Nestedcv: An R Package for Fast Implementation of Nested Cross-Validation with Embedded Feature Selection Designed for Transcriptomics and High-Dimensional Data. Bioinform. Adv. 2023, 3, vbad048. [Google Scholar] [CrossRef]
- Zhong, Y.; Chalise, P.; He, J. Nested Cross-Validation with Ensemble Feature Selection and Classification Model for High-Dimensional Biological Data. Commun. Stat. Simul. Comput. 2023, 52, 110–125. [Google Scholar] [CrossRef]
- Nduati, E.; Sofue, Y.; Matniyaz, A.; Park, J.G.; Yang, W.; Kondoh, A. Cropland Mapping Using Fusion of Multi-Sensor Data in a Complex Urban/Peri-Urban Area. Remote Sens. 2019, 11, 207. [Google Scholar] [CrossRef]
- Raviv, L.; Lupyan, G.; Green, S.C. How Variability Shapes Learning and Generalization. Trends Cogn. Sci. 2022, 26, 462–483. [Google Scholar] [CrossRef]
- Reddy, N.N.; Chakraborty, P.; Roy, S.; Singh, K.; Minasny, B.; McBratney, A.B.; Biswas, A.; Das, B.S. Legacy Data-Based National-Scale Digital Mapping of Key Soil Properties in India. Geoderma 2021, 381, 114684. [Google Scholar] [CrossRef]
- Zhou, J.; Gandomi, A.H.; Chen, F.; Holzinger, A. Evaluating the Quality of Machine Learning Explanations: A Survey on Methods and Metrics. Electronics 2021, 10, 593. [Google Scholar] [CrossRef]
- Rainio, O.; Teuho, J.; Klén, R. Evaluation Metrics and Statistical Tests for Machine Learning. Sci. Rep. 2024, 14, 6086. [Google Scholar] [CrossRef]
- Iyengar, G.; Lam, H.; Wang, T. Is Cross-Validation the Gold Standard to Evaluate Model Performance? arXiv 2024, arXiv:2407.02754. [Google Scholar] [CrossRef]
- Wiesmeier, M.; Urbanski, L.; Hobley, E.; Lang, B.; von Lützow, M.; Marin-Spiotta, E.; van Wesemael, B.; Rabot, E.; Ließ, M.; Garcia-Franco, N.; et al. Soil Organic Carbon Storage as a Key Function of Soils—A Review of Drivers and Indicators at Various Scales. Geoderma 2019, 333, 149–162. [Google Scholar] [CrossRef]
- Chen, S.; Martin, M.P.; Saby, N.P.A.; Walter, C.; Angers, D.A.; Arrouays, D. Fine Resolution Map of Top- and Subsoil Carbon Sequestration Potential in France. Sci. Total Environ. 2018, 630, 389–400. [Google Scholar] [CrossRef] [PubMed]
- Mulder, V.L.; Lacoste, M.; Richer-de-Forges, A.C.; Arrouays, D. GlobalSoilMap France: High-Resolution Spatial Modelling the Soils of France up to Two Meter Depth. Sci. Total Environ. 2016, 573, 1352–1369. [Google Scholar] [CrossRef]
- Zhang, X.; Xue, J.; Chen, S.; Wang, N.; Shi, Z.; Huang, Y.; Zhuo, Z. Digital Mapping of Soil Organic Carbon with Machine Learning in Dryland of Northeast and North Plain China. Remote Sens. 2022, 14, 2504. [Google Scholar] [CrossRef]
- Zhang, W.; Wan, H.; Zhou, M.; Wu, W.; Liu, H. Soil Total and Organic Carbon Mapping and Uncertainty Analysis Using Machine Learning Techniques. Ecol. Indic. 2022, 143, 109420. [Google Scholar] [CrossRef]
- Chicco, D.; Warrens, M.J.; Jurman, G. The Coefficient of Determination R-Squared Is More Informative than SMAPE, MAE, MAPE, MSE and RMSE in Regression Analysis Evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
- Arlot, S.; Genuer, R. Analysis of Purely Random Forests Bias. arXiv 2014, arXiv:1407.3939. [Google Scholar] [CrossRef]
- Wainer, J.; Cawley, G. Nested Cross-Validation When Selecting Classifiers Is Overzealous for Most Practical Applications. Expert Syst. Appl. 2021, 182, 115222. [Google Scholar] [CrossRef]
- Zádorová, T.; Penížek, V.; Žížala, D.; Matějovský, J.; Vaněk, A. Influence of Former Lynchets on Soil Cover Structure and Soil Organic Carbon Storage in Agricultural Land, Central Czechia. Soil Use Manag. 2018, 34, 60–71. [Google Scholar] [CrossRef]
- Cukor, J.; Vacek, Z.; Linda, R.; Bílek, L. Carbon Sequestration in Soil Following Afforestation of Former Agricultural Land in the Czech Republic. Cent. Eur. For. J. 2017, 63, 97–104. [Google Scholar] [CrossRef]
- Mišurec, J.; Lukeš, P.; Tomíček, J.; Koňata, P.; Klem, K. Multi-Decadal Satellite Monitoring of Soil Carbon and Its Role in Farm Carbon Footprint: A Case Study for the Czech Republic. Eur. J. Remote Sens. 2025, 58, 2562069. [Google Scholar] [CrossRef]
- Voltr, V.; Menšík, L.; Hlisnikovský, L.; Hruška, M.; Pokorný, E.; Pospíšilová, L. The Soil Organic Matter in Connection with Soil Properties and Soil Inputs. Agronomy 2021, 11, 779. [Google Scholar] [CrossRef]
- Goidts, E.; van Wesemael, B. Regional Assessment of Soil Organic Carbon Changes under Agriculture in Southern Belgium (1955–2005). Geoderma 2007, 141, 341–354. [Google Scholar] [CrossRef]
- Chen, S.; Mulder, V.L.; Heuvelink, G.B.M.; Poggio, L.; Caubet, M.; Román Dobarco, M.; Walter, C.; Arrouays, D. Model Averaging for Mapping Topsoil Organic Carbon in France. Geoderma 2020, 366, 114237. [Google Scholar] [CrossRef]
- Soucémarianadin, L.N.; Cécillon, L.; Guenet, B.; Chenu, C.; Baudin, F.; Nicolas, M.; Girardin, C.; Barré, P. Environmental Factors Controlling Soil Organic Carbon Stability in French Forest Soils. Plant Soil 2018, 426, 267–286. [Google Scholar] [CrossRef]
- Le Bissonnais, Y.; Montier, C.; Jamagne, M.; Daroussin, J.; King, D. Mapping Erosion Risk for Cultivated Soil in France. Catena 2002, 46, 207–220. [Google Scholar] [CrossRef]
- Richer-de-Forges, A.C.; Chen, Q.; Baghdadi, N.; Chen, S.; Gomez, C.; Jacquemoud, S.; Martelet, G.; Mulder, V.L.; Urbina-Salazar, D.; Vaudour, E.; et al. Remote Sensing Data for Digital Soil Mapping in French Research—A Review. Remote Sens. 2023, 15, 3070. [Google Scholar] [CrossRef]
- Markham, K.; Frazier, A.E.; Singh, K.K.; Madden, M. A Review of Methods for Scaling Remotely Sensed Data for Spatial Pattern Analysis. Landsc. Ecol. 2023, 38, 619–635. [Google Scholar] [CrossRef]





| Country | Soil Samples | Sampling Depth (cm) | Sampling Density (km2 per Sample) | Mean SOC | Accuracy Assessment | ||||
|---|---|---|---|---|---|---|---|---|---|
| Method | R2 | RMSE | Lowest NRMSE | Reference | |||||
| China | 313 | 0–20 | 7987.22 | 4.85 kg C m−2 | CV | 0.33–0.42 | 2.64–2.84 | 0.54 | [26] |
| USA | 6213 | 0–30 | 1582.81 | 94.90 mg·ha−1 | SS | 0.38–0.53 | 0.51–0.54 | 0.01 | [27] |
| China | 8021 | 0–5 | 1196.48 | 19.09 g·kg−1 | CV | 0.06–0.42 | 0.47–1.11 | 0.02 | |
| 5–15 | 17.22 g·kg−1 | 0.03–0.46 | 0.45–0.99 | 0.03 | [28] | ||||
| 15–30 | 12.84 g·kg−1 | 0.03–0.40 | 0.51–1.16 | 0.04 | |||||
| China | 644 | 0–30 | 1009.32 | 3.86 kg·m−2 | CV | 0.28–0.40 | 5.76–11.76 | 1.49 | [21] |
| USA | 673 | 0–15 | 641.90 | 1.16% | CV | 0.51 | 1.69 | 1.46 | [29] |
| 630 | 15–30 | 685.71 | 0.85% | 0.39 | 1.48 | 1.74 | |||
| China | 105 | 0–10 | 480.00 | 13.42 g·kg−1 | CV | 0.30–0.43 | 5.28–8.51 | 0.39 | [30] |
| China | 23,103 | 0–800 | 415.39 | 3.85 kg·m–2 | CV | 0.83 | 1.93 | 0.50 | [31] |
| Colombia | 653 | 0–30 | 398.16 | 15.00 g·kg−1 | CV | 0.50 | 0.46 | 0.03 | [32] |
| Switzerland | 150 | 0–20 | 273.33 | 43.93 g·kg−1 | CV | 0.12–0.47 | 0.44–0.56 | 0.01 | [33] |
| China | 733 | 0–20 | 191.00 | 13.11 g·kg−1 | CV | 0.10–0.60 | 4.50–6.00 | 0.34 | [34] |
| Dominican Republic | 268 | 0–15 | 179.84 | 110.35 mg· ha−1 | SS | 0.77–0.83 | 35.00–38.60 | 0.32 | [35] |
| Brazil | 81 | 0–20 | 178.56 | 17.6 g·kg–1 | CV | 0.04–0.51 | 1.75–3.02 | 0.10 | [36] |
| Germany | 475 | 0–30 | 149.47 | 2.63% | CV | 0.42–0.68 | 1.42–1.60 | 0.54 | [19] |
| 75.79 | 1.74% | 0.30–0.48 | 1.37–1.44 | 0.79 | |||||
| Germany | 3104 | 0–30 | 115.21 | 28.00 g·kg−1 | CV | / | 21.00–34.00 | 0.75 | [20] |
| Italy | 414 | 0–30 | 60.39 | 1.51% | SS | / | 0.70 | 0.46 | [37] |
| Iran | 201 | 0–20 | 24.02 | 0.32% | CV | 0.41–0.54 | 0.08–0.18 | 0.25 | [38] |
| China | 308 | 0–20 | 18.08 | 12.62 g·kg−1 | CV | 0.23–0.35 | 5.00–5.50 | 0.40 | [39] |
| China | 186 | 0–20 | 14.34 | 23.78 g·kg−1 | SS | 0.11–0.49 | 3.90–5.28 | 0.16 | [40] |
| China | 396 | 0–10 | 9.92 | 12.56 g·kg−1 | SS | 0.46–0.58 | 3.49–3.83 | 0.28 | |
| 10–20 | 10.11 g·kg−1 | 0.63–0.71 | 3.49–3.60 | 0.35 | [39] | ||||
| 20–30 | 7.58 g·kg−1 | 0.67–0.73 | 2.95–3.03 | 0.39 | |||||
| Iran | 180 | 0–10 | 8.33 | 0.86% | SS | 0.86 | 0.24 | 0.28 | [41] |
| China | 395 | 0–20 | 6.64 | 11.60 mg·kg−1 | SS | 0.32–0.42 | 1.67–1.90 | 0.14 | [42] |
| Switzerland | 2071 | 0–5 | 6.28 | 6.05% | SS | 0.10–0.22 | 4.85–5.16 | 0.80 | |
| 5–15 | 3.66% | 0.21–0.29 | 3.78–4.00 | 1.03 | [43] | ||||
| 15–30 | 2.20% | 0.23–0.32 | 2.83–3.03 | 1.29 | |||||
| China | 181 | 0–15 | 4.08 | 1.70% | SS | 0.20–0.56 | 0.20–0.26 | 0.12 | [44] |
| 1.03% | 0.19–0.53 | 0.25–0.33 | 0.24 | ||||||
| India | 83 | 0–30 | 3.73 | 26.48 mg·ha−1 | SS | 0.90 | 8.21 | 0.31 | [45] |
| France | 162 | 0–30 | 2.06 | 8.9 g·kg–1 | CV | 0.36–0.73 | 24.8 | 2.7 | [46] |
| Cross-Validation Approach | Machine Learning Method | France | Czechia | ||||||
|---|---|---|---|---|---|---|---|---|---|
| R2 | RMSE | NRMSE | MAE | R2 | RMSE | NRMSE | MAE | ||
| k = 10 | RF | 0.409 | 11.60 | 0.454 | 8.64 | 0.243 | 8.21 | 0.387 | 6.25 |
| CUB | 0.373 | 12.01 | 0.470 | 8.76 | 0.197 | 8.56 | 0.403 | 6.39 | |
| SVR | 0.392 | 11.94 | 0.467 | 8.42 | 0.228 | 8.44 | 0.398 | 6.09 | |
| BRNN | 0.382 | 11.87 | 0.464 | 8.83 | 0.236 | 8.27 | 0.390 | 6.27 | |
| ENS | 0.412 | 11.56 | 0.452 | 8.49 | 0.229 | 8.21 | 0.387 | 6.15 | |
| k = 5 | RF | 0.406 | 11.63 | 0.455 | 8.67 | 0.227 | 8.27 | 0.390 | 6.27 |
| CUB | 0.371 | 12.04 | 0.471 | 8.79 | 0.180 | 8.65 | 0.407 | 6.43 | |
| SVR | 0.387 | 11.98 | 0.469 | 8.45 | 0.218 | 8.47 | 0.399 | 6.07 | |
| BRNN | 0.378 | 11.90 | 0.466 | 8.84 | 0.220 | 8.34 | 0.393 | 6.28 | |
| ENS | 0.409 | 11.58 | 0.453 | 8.52 | 0.225 | 8.24 | 0.388 | 6.16 | |
| k = 4 | RF | 0.404 | 11.65 | 0.456 | 8.69 | 0.224 | 8.28 | 0.390 | 6.26 |
| CUB | 0.369 | 12.05 | 0.472 | 8.81 | 0.186 | 8.61 | 0.406 | 6.40 | |
| SVR | 0.385 | 12.01 | 0.470 | 8.47 | 0.213 | 8.49 | 0.400 | 6.07 | |
| BRNN | 0.378 | 11.90 | 0.466 | 8.85 | 0.211 | 8.39 | 0.395 | 6.31 | |
| ENS | 0.408 | 11.60 | 0.454 | 8.53 | 0.223 | 8.25 | 0.389 | 6.16 | |
| k = 2 | RF | 0.392 | 11.77 | 0.461 | 8.80 | 0.206 | 8.38 | 0.395 | 6.32 |
| CUB | 0.372 | 12.15 | 0.476 | 8.67 | 0.174 | 8.72 | 0.411 | 6.30 | |
| SVR | 0.374 | 12.11 | 0.474 | 8.56 | 0.206 | 8.53 | 0.402 | 6.08 | |
| BRNN | 0.369 | 11.99 | 0.469 | 8.91 | 0.192 | 8.52 | 0.402 | 6.41 | |
| ENS | 0.398 | 11.70 | 0.458 | 8.64 | 0.211 | 8.31 | 0.392 | 6.21 | |
| Cross-Validation Approach | Machine Learning Method | Optimal Hyperparameters | |
|---|---|---|---|
| France | Czechia | ||
| k = 10 | RF | mtry = 10 | mtry = 5 |
| CUB | committees = 20, neighbors = 9 | committees = 20, neighbors = 9 | |
| SVR | σ = 0.020, C = 1 | σ = 0.019, C = 0.5 | |
| BRNN | neurons = 2 | neurons = 1 | |
| k = 5 | RF | mtry = 10 | mtry = 5 |
| CUB | committees = 20, neighbors = 9 | committees = 20, neighbors = 9 | |
| SVR | σ = 0.022, C = 1 | σ = 0.019, C = 0.5 | |
| BRNN | neurons = 2 | neurons = 1 | |
| k = 4 | RF | mtry = 6 | mtry = 5 |
| CUB | committees = 20, neighbors = 9 | committees = 20, neighbors = 9 | |
| SVR | σ = 0.021, C = 1 | σ = 0.019, C = 0.5 | |
| BRNN | neurons = 2 | neurons = 1 | |
| k = 2 | RF | mtry = 6 | mtry = 2 |
| CUB | committees = 20, neighbors = 0 | committees = 20, neighbors = 0 | |
| SVR | σ = 0.022, C = 1 | σ = 0.017, C = 0.5 | |
| BRNN | neurons = 2 | neurons = 1 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Radočaj, D.; Jurišić, M.; Plaščak, I.; Galić, L. Randomness in Data Partitioning and Its Impact on Digital Soil Mapping Accuracy: A Comparison of Cross-Validation and Split-Sample Approaches. Agronomy 2025, 15, 2495. https://doi.org/10.3390/agronomy15112495
Radočaj D, Jurišić M, Plaščak I, Galić L. Randomness in Data Partitioning and Its Impact on Digital Soil Mapping Accuracy: A Comparison of Cross-Validation and Split-Sample Approaches. Agronomy. 2025; 15(11):2495. https://doi.org/10.3390/agronomy15112495
Chicago/Turabian StyleRadočaj, Dorijan, Mladen Jurišić, Ivan Plaščak, and Lucija Galić. 2025. "Randomness in Data Partitioning and Its Impact on Digital Soil Mapping Accuracy: A Comparison of Cross-Validation and Split-Sample Approaches" Agronomy 15, no. 11: 2495. https://doi.org/10.3390/agronomy15112495
APA StyleRadočaj, D., Jurišić, M., Plaščak, I., & Galić, L. (2025). Randomness in Data Partitioning and Its Impact on Digital Soil Mapping Accuracy: A Comparison of Cross-Validation and Split-Sample Approaches. Agronomy, 15(11), 2495. https://doi.org/10.3390/agronomy15112495
