Bridging Pedology and Data Science: Machine Learning Applications for Soil Organic Matter and Carbon Analysis

Dolatabadian, Aria; Kariman, Khalil

doi:10.3390/app16115412

Open AccessReview

Bridging Pedology and Data Science: Machine Learning Applications for Soil Organic Matter and Carbon Analysis

by

Aria Dolatabadian

^1,*

and

Khalil Kariman

²

¹

UWA School of Biological Sciences, The University of Western Australia, Perth, WA 6009, Australia

²

UWA School of Agriculture and Environment, The University of Western Australia, Perth, WA 6009, Australia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5412; https://doi.org/10.3390/app16115412

Submission received: 6 March 2026 / Revised: 21 May 2026 / Accepted: 27 May 2026 / Published: 29 May 2026

(This article belongs to the Special Issue Soil Organic Matter and Carbon Content Analysis Using Machine Learning and Classical Approaches)

Download

Browse Figures

Versions Notes

Abstract

Accurate quantification of soil organic matter (SOM) and carbon content is critical for understanding climate change, evaluating soil health, supporting agricultural sustainability, and implementing carbon sequestration policies. For decades, classical analytical and statistical approaches have underpinned soil carbon assessment, but the emergence of machine learning (ML) techniques offers new opportunities to improve prediction accuracy, scalability, and efficiency. This review summarises the current knowledge on classical and ML-based approaches for analysing SOM and carbon content. We examine the strengths, limitations, and practical applications of conventional methods, including wet chemistry, dry combustion analysis, and geostatistical techniques, alongside modern ML approaches such as random forests (RFs), gradient boosting machines, neural networks, deep learning, and hybrid ML-geostatistical frameworks. Special emphasis is placed on comparative analysis across dimensions, including prediction accuracy, computational requirements, data availability needs, interpretability, uncertainty quantification, and scalability. Soil carbon stocks and dynamics are tightly regulated by indigenous soil microbial communities and their management-driven alterations, creating substantial biologically driven variation that remains difficult to capture with current modelling approaches. We therefore explore hybrid approaches that integrate classical pedological knowledge with ML capabilities. Finally, we discuss emerging challenges, future research directions, and the complementary role these approaches play in advancing soil carbon science. This review concludes that neither classical nor ML approaches alone are sufficient for accurate carbon assessment across diverse scales and environments. Instead, their strategic integration, combining classical mechanistic grounding alongside machine learning’s scalability, represents the most promising path toward realistic soil carbon evaluation for climate change mitigation and agricultural sustainability.

Keywords:

soil organic matter; soil organic carbon; machine learning; classical methods; pedology; remote sensing

1. Introduction

1.1. Background and Significance

Soil organic matter (SOM) and its associated organic carbon content represent one of the most critical components of terrestrial ecosystems [1]. Soils contain approximately 1500 gigatons of carbon in the top meter alone, more than twice the amount in the atmosphere and vegetation combined [2]. The dynamics of soil carbon directly influence climate regulation, nutrient cycling, water retention, soil structure development, and overall ecosystem productivity [3]. Understanding and accurately quantifying SOM is therefore fundamental to addressing some of humanity’s most pressing environmental challenges.

The relationship between soil carbon and climate change is particularly significant. Soils serve as both a carbon sink and a source; their management can either sequester atmospheric CO₂ or release it, thereby substantially affecting global carbon budgets [4,5]. Agricultural practices, including the use of synthetic fertilisers, fungicides and soil tillage, land-use change, and climate variability, all influence soil carbon stocks and rates of change [6]. Consequently, robust methodologies for monitoring SOM across spatial and temporal scales are essential for climate policy implementation, carbon accounting frameworks, and verification of carbon sequestration efforts [7].

Beyond climate considerations, SOM affects soil physical, chemical, and biological properties that are critical to agricultural productivity and environmental quality [8]. Adequate SOM improves soil structure, enhances water-holding capacity, increases cation exchange capacity, promotes beneficial microbial communities, and reduces vulnerability to erosion and compaction [1,9]. In agricultural contexts, maintaining or increasing SOM is recognised as central to sustainable intensification, reduced-input farming systems, and climate-smart agriculture [10].

1.2. Traditional Approaches and Their Evolution

For more than a century, soil scientists have employed classical methods to assess organic matter (OM) and carbon content. These include laboratory-based combustion techniques, wet-chemical oxidation, loss-on-ignition protocols [11], and various spectroscopic methods [12]. Complementing laboratory analysis, field sampling strategies based on statistical design and geostatistical interpolation techniques (particularly kriging) have enabled spatial characterisation of SOM variation across landscapes [13]. While these approaches have proven reliable and generated valuable datasets, they are constrained by labour intensity, cost, time requirements, and limited spatial resolution [14].

Soil organic carbon (SOC) presents unique modelling challenges compared to other soil properties, such as pH, which typically exhibit normal distributions and lower spatial variability. SOC is characterised by marked right skewness, high heterogeneity, and extreme values that lead to asymmetric prediction errors and substantially lower accuracy.

1.3. The Machine Learning Revolution

Beginning in the 2010s, machine learning and data-driven modelling began supplementing classical approaches in soil science [15]. Advances in remote sensing technology, the increasing availability of open-access satellite imagery, the development of free software platforms, and the growth of public soil databases have created unprecedented opportunities for ML applications [16,17]. Techniques including random forests (RFs) [18], support vector machines (SVMs) [19], artificial neural networks (ANNs) [20], and, more recently, deep learning architectures [21] have demonstrated the ability to predict SOM from multispectral and hyperspectral remote-sensing data, terrain attributes, and ancillary environmental variables. Some studies report that ML models achieve spatial prediction accuracies that rival or exceed classical geostatistical interpolation approaches, with substantially lower per-unit costs at landscape-to-regional scales [22,23]. However, these reported accuracies frequently rely on standard cross-validation procedures that are vulnerable to spatial data leakage, and direct comparisons are often confounded by differences in training data sizes, availability of environmental covariates, and validation protocols. When proper spatial cross-validation is employed, performance differences between ML and classical approaches typically narrow considerably. Additionally, model reproducibility remains problematic, with many published studies lacking sufficient methodological transparency or public code/data for independent verification.

1.4. Objectives

This review examines the complementary and competing roles of classical and machine-learning approaches in analysing SOM and carbon. We review recent literature to explore the foundational principles, practical procedures, advantages, and limitations of classical methods, and to evaluate the performance of various ML techniques for predicting soil carbon. The review further considers comparisons between classical and machine-learning approaches across key dimensions such as accuracy, cost, scalability, and interpretability, and highlights how classical knowledge can inform the development of hybrid models. Finally, we identify remaining knowledge gaps and outline priorities for future research. Emphasis is placed on understanding when, where, and how each approach is most appropriate, and on how their integration can advance soil carbon science.

1.5. Novel Contributions and Review Scope

This review is distinguished by four key innovations. First, it integrates classical pedological knowledge with machine learning, arguing that neither approach alone is sufficient; classical methods provide a mechanistic understanding but lack scalability, while ML offers scalability but overlooks soil microbial legacies and management-driven biological variation. Second, it uniquely emphasises how soil microbial communities (fundamentally altered by management practices such as fertilisation and tillage) create variation that standard ML models trained solely on environmental covariates poorly capture, making model generalisation across regions with different management histories inherently problematic. Third, it provides the first comprehensive multi-dimensional comparison of classical and ML approaches across accuracy, cost, interpretability, uncertainty, transferability, and temporal dynamics, revealing that method selection should be context-dependent. Finally, it articulates a coherent hybrid framework combining strategic field sampling with ML-based interpolation, positioning classical and ML methods as complementary rather than competing tools.

1.6. Review Type and Methodology

This is a conceptual synthesis and narrative review that integrates classical pedological theory, machine-learning applications, and hybrid frameworks from the previously disconnected literature. A comprehensive search across Web of Science, Scopus, and Google Scholar yielded over 3000 results using combined keywords including SOC,’ ‘machine learning,’ ‘random forest,’ ‘soil prediction,’ and ‘digital soil mapping.’ After applying inclusion criteria (peer-reviewed articles with quantitative performance metrics, published from 2010 onwards, addressing classical or ML soil carbon methods), we retained 160 studies for detailed analysis, supplemented by citation tracking through 2025. We included peer-reviewed articles addressing classical or ML soil carbon methods, comparative analyses, and hybrid frameworks, and excluded non-soil contexts and studies without quantitative performance metrics; most of the reviewed literature is from 2010 onwards. Using expert judgment rather than rigid systematic protocols, we selected representative studies that illuminate key mechanistic connections and comparative strengths. While acknowledging the limitations inherent in narrative approaches (author bias, incomplete coverage), this methodology is well-suited to bridging disciplinary divides and articulating novel integrative frameworks.

2. Soil Organic Matter

2.1. Definition and Composition

Soil organic matter encompasses the totality of organic compounds in soil, including living biomass (plant roots, microorganisms and fauna) and non-living components (decomposing plant and animal residues, stabilised humus, and highly resistant recalcitrant organic substances). In surface horizons, SOM typically comprises 2–10% of soil mass, though this varies substantially across climates, vegetation, land use, and management histories [24]. Soil organic carbon (SOC) is the carbon fraction within SOM, typically representing up to 58% of total SOM mass (accounting for the remainder as oxygen, hydrogen, nitrogen, and other elements) [25].

2.2. Forms and Stability of Soil Organic Matter: Implications for Predictive Modelling

Understanding different OM forms is not merely descriptive but directly relevant to machine learning applications because: (1) different environmental covariates preferentially predict different OM fractions, affecting model interpretation and transferability; (2) regions with contrasting management histories develop distinct microbial legacies that alter OM composition, limiting cross-regional model generalisability; and (3) contemporary soil carbon accounting increasingly targets specific OM fractions (e.g., labile vs. stabilised carbon). Therefore, a mechanistic understanding of OM fractionation is essential for bridging classical pedology with data-driven ML approaches. A central concept in SOM science is the distinction among OM forms based on chemical recalcitrance and turnover time [26]. Labile OM includes sugars, amino acids, and readily decomposable plant residues, with turnover times of days to months [27]. Intermediate fractions include partially decomposed organic compounds and microbial metabolites, with turnover times of months to decades [28]. The most resistant fraction, often termed humus or stabilised OM, may persist for centuries to millennia, protected through physical occlusion in soil aggregates, chemical bonding to minerals, and inherent chemical resistance to biological degradation [28].

Understanding these fractions is crucial because they respond differently to management, climate change, and environmental disturbance. Prediction and monitoring of soil carbon must account for this heterogeneity, as different fractions have distinctly different ecological significance.

2.3. Factors Affecting Soil Organic Matter Content

Soil organic matter distribution and dynamics are controlled by complex interactions among climate [13], parent material [29], topography [30], biological factors [31], and management [32]. Climate, particularly temperature and precipitation, strongly influences the balance between carbon input (productivity) and carbon loss (decomposition) [33]. Colder, wetter climates generally accumulate higher OM stocks. Quantitatively, a 1 °C increase in mean annual temperature is associated with SOC losses of approximately 0.2–0.5 Mg C ha⁻¹ across temperate ecosystems, with considerably greater losses in permafrost-affected regions. Conversely, global patterns show precipitation strongly controls SOC stocks in semi-arid systems, with values increasing from less than 10 Mg C ha⁻¹ at mean annual precipitation < 250 mm to greater than 30 Mg C ha⁻¹ at >800 mm [3], establishing a quantifiable climate–SOC relationship across biogeographic gradients. However, the relationship is non-linear and mediated by vegetation type and soil properties [34].

Vegetation type determines the quantity and quality of OM input. Forests typically accumulate more carbon than grasslands under comparable climatic conditions, particularly in mineral soils, whereas peatlands represent extreme carbon accumulation systems [35]. Quantitatively, forest ecosystems accumulate 1.5–3 times more carbon per unit area than grasslands under comparable climatic conditions, with typical carbon stocks ranging from 50–200 Mg C ha⁻¹ in temperate forests versus 20–80 Mg C ha⁻¹ in grasslands [2]. Peatland ecosystems represent extreme carbon accumulation systems, frequently exceeding 500 Mg C ha⁻¹ in boreal regions, highlighting the profound influence of vegetation type on long-term carbon storage.

Soil mineralogy influences stabilisation; clay-rich soils typically contain more carbon per unit volume than sandy soils, owing to interactions between OM and mineral surfaces [36]. Quantitatively, clay-rich soils contain more organic carbon per unit volume than sandy soils [29]. This mineralogical control on carbon stabilisation operates through clay–OM interactions that physically occlude carbon within mineral aggregates and chemically bond organic compounds to mineral surfaces, slowing decomposition rates substantially.

Topography affects water availability, erosion rates, and patterns of carbon accumulation. Land management, including tillage, fertilisation, drainage, cropping systems, and land-use history, has profound effects on soil carbon, often reducing stocks in converted or intensively managed soils compared to natural ecosystems [37]. Long-term global meta-analyses demonstrate that conversion from native vegetation to agricultural land reduces SOC stocks over time [6]. Intensive tillage further accelerates carbon losses by an additional 15–25% relative to no-till systems, as mechanical disturbance stimulates microbial decomposition and disrupts physical protection mechanisms within soil aggregates. These quantified effects underscore management’s dominant control over soil carbon dynamics at decadal timescales.

2.4. Challenges in Organic Matter Assessment of Soil

Several factors complicate accurate SOM assessment. First, spatial variability is extreme at fine scales; short-range variation due to microtopography, vegetation patchiness, and variation in soil parent material can exceed variation across broader landscapes [38]. Second, temporal dynamics require understanding the period represented by a measurement; surface soil samples integrate processes from recent years, whereas deeper carbon may reflect centennial- to millennial-scale dynamics [39]. Third, standardisation of sampling protocols, analytical procedures, and reporting metrics remains inconsistent across studies and regions, complicating global synthesis and comparison [40]. Fourth, accessibility to remote areas, particularly in developing regions and under forest cover, constrains spatial coverage [41]. Fifth, the heterogeneity of SOM composition means that different analytical methods may yield different estimates for the same sample, depending on each method’s efficiency in quantifying different SOM fractions [42].

3. Classical Approaches to Soil Organic Matter and Carbon Analysis

3.1. Laboratory Analytical Methods

3.1.1. Wet Chemical Oxidation

The Walkley-Black method, developed in 1934, remains one of the most widely used classical approaches globally. This method involves treating soil samples with potassium dichromate in sulfuric acid, which oxidises organic carbon to CO₂. The amount of dichromate consumed (measured by back-titration with ferrous sulphate or colourimetric determination) is proportional to soil carbon content [43]. The method is relatively rapid (requiring hours per batch), requires minimal equipment, and is inexpensive. However, it typically oxidises only 70–80% of total organic carbon (not oxidising the most resistant fractions), and recoveries vary with soil mineralogy and the chemical recalcitrance of specific SOM fractions, necessitating site-specific correction factors [44].

3.1.2. Dry Combustion (Elemental Analysis)

Dry combustion involves heating soil samples to very high temperatures (900–1000 °C) in an oxygen atmosphere, completely oxidising OM and other combustible elements. The CO₂ and H₂O produced are quantified using thermal conductivity detectors, infrared detectors, or mass spectrometry [45]. Elemental analysers provide simultaneous determination of carbon, hydrogen, nitrogen, and sulphur content with high precision. This method is considered the gold standard for total organic carbon determination because it achieves complete oxidation regardless of carbon recalcitrance. Its main limitations are higher capital costs, the requirement for skilled operation and maintenance, the need for electricity and gas, and the cost of analytical-grade reagents [46]. Nevertheless, elemental analysis is preferred for research-grade analysis, international comparisons, and situations requiring maximum accuracy.

3.1.3. Loss-on-Ignition

This simple method involves heating air-dried soil to 360–550 °C (or sometimes higher) and measuring mass loss. Mass loss results from the combined effects of water released from hydrous minerals and OM, and from the volatilisation of labile mineral components [47]. The loss on ignition (LOI) method does not specifically measure carbon content but instead provides an estimate of total volatile matter. While rapid and inexpensive, LOI yields only a crude estimate of SOM content and cannot distinguish organic from inorganic sources of mass loss [48]. Its applicability varies greatly by soil type; LOI is reliable for mineral soils but problematic for soils with substantial amounts of hydrous minerals or carbonates [49].

3.1.4. Spectroscopic Methods

Spectroscopic approaches, including visible and near-infrared (VNIR), mid-infrared (MIR), and Fourier transform infrared (FTIR) spectroscopy, analyse light absorption or reflectance in specific wavelength regions [50,51,52]. These methods do not directly measure carbon content; rather, they detect the chemical bonds and functional groups characteristic of OM [53]. Empirical relationships between spectral properties and chemically determined carbon content are established using calibration datasets, enabling prediction of carbon in new samples from their spectra alone [54].

Spectroscopic approaches offer the advantages of speed, minimal sample preparation, and potential for high-throughput analysis. The MIR spectroscopy has shown strong correlations with measured OM and has been adopted by soil survey programs and carbon accounting initiatives [55]. However, spectroscopic methods require expensive instrumentation, need careful standardisation of sample preparation, may not capture all OM components equally, and always require calibration against direct analytical measurements.

3.2. Field Sampling and Geostatistical Analysis

3.2.1. Sampling Strategies

Classical soil surveys and carbon assessment rely on carefully designed field sampling to capture spatial variability while minimising cost and effort. Multiple sampling schemes exist, each suited to different landscape characteristics and research objectives.

Grid sampling collects samples at fixed, regular intervals across the study area, providing systematic coverage but may miss important variation if grid spacing does not align with landscape features or environmental gradients [56,57]. Stratified random sampling divides the study area into relatively homogeneous strata based on environmental covariates (soil type, geology, vegetation, elevation), then randomly collects samples within each stratum. This improves sampling efficiency and statistical reliability, particularly across environmental gradients, ensuring representation of major landscape units [58]. Targeted or judgmental sampling uses expert knowledge to select specific locations that represent particular soil types or landscape positions (e.g., hilltops, toeslopes, riparian zones). This approach captures pedologically meaningful variation and is cost-effective but introduces subjectivity and limits statistical generalisation [59].

Systematic random sampling combines regularity with randomness by selecting a random starting point and then collecting samples at regular intervals, balancing the efficiency of grid sampling with statistical robustness [60]. Clustered or hierarchical sampling collects multiple samples within spatially grouped clusters distributed across the study area. This scheme is efficient when variation occurs at multiple scales and is particularly valuable for assessing soil carbon at nested spatial extents (plot, field, and landscape scales), which are commonly used in national monitoring programs and carbon audits [61].

The optimal sample density is determined through power analysis, cost–benefit considerations, and prior knowledge of expected spatial variability. In practice, many surveys employ hybrid sampling designs that combine multiple schemes—for example, stratification by major landscape units with grid and targeted sampling within strata—to balance systematic coverage with cost-efficiency.

3.2.2. Geostatistical Methods

Geostatistics analyses spatial data and predicts values at unsampled locations using the semivariogram, which quantifies the increase in differences between points with distance [62,63]. Key parameters include range (distance beyond which points are uncorrelated), sill (variance at large distances), and nugget (variance at zero distance, reflecting measurement error). Kriging uses semivariograms to weigh nearby observations for prediction, with ordinary kriging accounting for non-stationary mean effects and co-kriging incorporating secondary variables to improve accuracy [63]. Geostatistics has been widely applied to mapping soil properties such as carbon, yielding both predictions and uncertainty estimates [63].

3.3. Remote Sensing in Classical Frameworks

Classical soil science increasingly incorporates remote sensing data, particularly multispectral satellite imagery, which provides reflectance measurements in visible, near-infrared, and shortwave-infrared bands [64]. Spectral indices such as the normalised difference vegetation index (NDVI) and the normalised difference soil index (NDSI) are calculated from these bands and used as covariates in statistical models to predict SOM [65]. Multiple regression [66], polynomial regressions [67], and other classical statistical techniques, including partial least squares (PLS) regression [68], relate these indices to field-measured carbon values.

This hybrid classical-remote-sensing approach provides improved spatial coverage relative to field sampling alone and has enabled regional- to continental-scale carbon assessments [64,69]. However, the statistical relationships between spectral indices and soil carbon are often moderate (R² of 0.3–0.6), suggesting that spectral data alone contain limited information about subsurface soil properties. Vegetation cover further complicates the signal in forested regions [70].

3.4. Strengths and Limitations of Classical Approaches

Classical methods offer several important advantages. They are grounded in well-established chemistry and physics and are understood at the mechanistic level [71]. A long history of applications has produced extensive reference datasets and comparative baselines. Laboratory-based methods also provide direct measurement of carbon rather than estimation [72]. In addition, the intellectual transparency of statistical methods enables a clear understanding of assumptions and sources of uncertainty.

However, several limitations constrain their effectiveness. High per-sample cost and labour requirements, limiting spatial density and extent of surveys [73]; time delays between sampling and results; difficulty capturing fine-scale spatial heterogeneity due to typically sparse sampling [40]; inability to characterise historical trends without time-series samples or carbon dating; and inconsistency in protocols and standards across regions and time periods, complicating synthesis and comparison.

4. Machine Learning Approaches to Soil Organic Matter Analysis

4.1. Overview of Machine Learning Techniques

Machine learning has emerged as a transformative computational discipline with applications across numerous scientific fields, enabling sophisticated pattern recognition and predictive modelling from complex, high-dimensional datasets. These algorithms encompass diverse methodologies, from traditional linear regression techniques to advanced ensemble and deep learning approaches, capable of uncovering non-linear relationships that are often missed by conventional statistical methods. As computational capabilities and data availability continue to expand, ML applications in environmental science represent a critical frontier for resource management, sustainable agriculture, and climate mitigation [74].

In soil science, accurate characterisation of SOM and SOC remains a fundamental challenge with profound implications. Soil organic matter content influences soil fertility, crop productivity, carbon cycling, and ecosystem resilience [75]. Traditional laboratory methods for measuring these properties are time-consuming, destructive, and costly, making large-scale soil assessment impractical for precision agriculture and environmental monitoring. Consequently, researchers have increasingly turned to proximal and remote sensing technologies integrated with ML algorithms to achieve rapid, non-destructive soil characterisation.

Spectroscopy-based approaches have proven particularly promising for SOM estimation. Visible-near-infrared (vis-NIR) spectroscopy, when combined with multivariate calibration and machine learning, enables effective predictions of soil properties at a fraction of the cost of conventional methods. A comparative analysis of ML algorithms for paddy soils in China (N = 523 soil samples), including partial least squares regression (PLSR), least squares support vector machines (LS-SVM), extreme learning machines (ELM), and Cubist models, demonstrated that algorithm selection significantly affects accuracy [76]. The study found that ELM, coupled with a genetic algorithm and a reduced spectral band set, achieved superior performance for SOM prediction (R² = 0.81, RMSE = 5.17 g kg⁻¹), while substantially reducing computational requirements by reducing wavelengths from 201 to 44, illustrating the value of feature selection alongside algorithmic optimisation.

Beyond spectroscopy, ML can integrate diverse environmental covariates, topographic variables, remote-sensing indices, and climate data to predict landscape-scale SOM and SOC. Recent work incorporating satellite-derived features and soil health data has shown that ensemble methods, particularly stacking approaches, effectively address the overfitting challenges commonly encountered in soil property modelling [77]. Notably, gradient-boosting algorithms achieved superior training performance (R² = 0.95), whereas stacking methods demonstrated greater generalisation to test datasets, underscoring the tension between training accuracy and model robustness.

Model interpretability has emerged as a critical consideration alongside predictive accuracy. Across Germany, ML models trained on soil carbon measurements from the LUCAS dataset (N = 1686 samples) and environmental features derived from Google Earth Engine revealed that different algorithm classes rely on distinct environmental drivers [78]. Decision tree-based models emphasised topographic features, whereas neural networks and linear models prioritised soil chemical properties such as pH [78]. This divergence demonstrates that understanding the processes underlying ML predictions requires careful model interpretation guided by domain expertise.

Advances in calibration transfer techniques have expanded the applicability of spectroscopic models from laboratory to field conditions. Coupled vis-NIR and mid-infrared (VNIR-MIR) spectroscopy (N = 175 soil samples, covering a wide range of soil physical, chemical, and microbial properties) with ML methods, including Cubist, SVMs, and ensemble approaches, significantly outperformed standard PLSR calibration [79]. Critically, the external parameter orthogonalisation (EPO) algorithm reduced prediction error by approximately 40% when transferring from laboratory to field-conditioned spectra, although field-collected spectra remained less accurate than laboratory-optimised models, revealing persistent challenges in translating controlled calibrations to heterogeneous field environments.

Recent advances in optimisation techniques promise further improvements in prediction accuracy and computational efficiency. Integration of the Ninja Optimisation Algorithm (NiOA) for simultaneous feature selection and hyperparameter tuning achieved a 99.98% reduction in mean squared error for SOC prediction compared to an untuned baseline support vector regression, though such dramatic improvements likely reflect the weakness of the baseline rather than the absolute superiority of the optimisation approach [80], demonstrating the potential of metaheuristic optimisation in soil data-driven modelling. Similarly, Bayesian-enhanced RF approaches for SOM spatial mapping substantially improved accuracy compared to traditional methods, with key environmental covariates, rainfall, distance to coast, distance to water bodies, and altitude, explaining over 74% of SOM content variability [81].

These developments collectively underscore the substantial potential of ML for assessing soil properties at scales ranging from field to continental. When coupled with spectroscopy, remote sensing, and optimised environmental covariates, these approaches offer scalable, interpretable solutions that can support precision agriculture, climate adaptation strategies, and sustainable soil management.

To meet these objectives, several ML approaches, including deep neural network models, have been applied. These techniques, reviewed below, support more efficient resource allocation, improve climate resilience, and enhance early disease detection in agricultural systems.

4.2. Regression Methods

4.2.1. Linear and Polynomial Regression

Linear regression assumes a linear relationship between predictors and the target variable. Although a classical statistical method, linear regression can be regularised using techniques such as LASSO (L1 penalty), ridge regression (L2 penalty), or the elastic net (a combination of L1 and L2 penalties), which penalise model complexity to prevent overfitting. LASSO is particularly useful in high-dimensional settings because it performs automatic feature selection by shrinking some coefficients to exactly zero. These regularised approaches are often classified within ML frameworks, as they regulate model complexity through data-driven mechanisms rather than prespecified structural assumptions.

Polynomial regression extends linear regression by including polynomial terms and interaction effects. Polynomial models can capture non-linearity but are prone to overfitting in high-dimensional spaces. Modern implementations frequently incorporate cross-validation and regularisation to mitigate overfitting. For example, linear and polynomial regression models were used to analyse relationships between PC1 from principal component analysis and SOM functional pools [67], and to model the dynamics of soil solution nutrients and ions over time, with model selection based on cross-validation. The results showed that polynomial regression provided a better fit than linear models for most variables, capturing complex non-linear changes in ion bioavailability, whereas linear relationships were less effective at describing temporal nutrient dynamics [82]. In another study, a SOM mapping method was developed using ML and regression techniques applied to long-term remote-sensing data. Neural networks were used to detect bare soil surfaces, while a second-degree polynomial regression model was applied to estimate the SOM distribution. The regression achieved strong predictive performance (R² = 0.8), demonstrating that combining ML-based data filtering with polynomial regression can effectively model and map SOM spatial variability [83].

4.2.2. Support Vector Regression

Support vector regression (SVR) applies the support vector machine framework to regression problems [84]. SVR uses an epsilon-insensitive loss function that ignores errors below a specified tolerance to fit a function as flat as possible. The kernel trick is central to SVR’s effectiveness: by implicitly mapping input data into a high-dimensional feature space via a kernel function (e.g., a radial basis function or a polynomial), SVR can model complex non-linear relationships without explicitly computing the transformation [85]. SVR has shown promise for predicting soil carbon, particularly when feature selection is carefully performed. SVR was evaluated alongside ANNs and RFs for predicting and mapping SOC stocks using soil, climatic, topographic, and remote-sensing data. SVR achieved the best performance, with the lowest prediction error (RMSE = 14.9 Mg C ha⁻¹), lowest bias, and highest explanatory power (R² = 0.6), making it the most accurate model for SOC estimation [86]. SVR can be effective in moderately high-dimensional feature spaces owing to its regularisation properties, but its computational cost scales poorly with the number of training samples (approximately O(n²) to O(n³)), which can be prohibitive for large soil datasets. SVR is generally less prone to overfitting than unregularised polynomial regression, though this depends on appropriate hyperparameter tuning. However, SVR performance depends substantially on hyperparameter tuning, and optimal hyperparameters are data-dependent, requiring cross-validation for selection [87].

4.2.3. Random Forests

Random forests have substantially improved the use of ML in soil science. The RF algorithm is widely adopted for soil mapping because of its robustness to noise and overfitting, reliable performance with limited soil data, and ability to generate accurate and consistent spatial predictions [88,89]. An RF is an ensemble of individual decision trees; each trained on a random subset of observations (bootstrap samples) and a random subset of features [90]. Random forests provide several advantages for predicting soil carbon. They effectively capture non-linear relationships and interactions among environmental variables [91] and provide measures of feature importance that help identify the key drivers of soil carbon variation. They are robust for detecting outliers [92], accommodate missing data in some implementations through surrogate splits or proximity-based imputation [93], though many popular software libraries (e.g., scikit-learn) require complete data or explicit preprocessing, and are generally less sensitive to hyperparameter settings than methods such as SVR or gradient boosting, though tuning key parameters (e.g., tree depth, number of features per split, and minimum leaf size) can still yield meaningful improvements, particularly on smaller datasets [94]. These strengths have enabled consistent and reliable performance across diverse soil environments. Random forest consistently demonstrated high accuracy in predicting SOC and SOM across diverse landscapes and soil types. In agricultural lowland soils, RF predictions were among the best-performing models, and accuracy improved further when combined with residual kriging, revealing generally low SOC and highlighting the strong influence of hydrological factors on SOC distribution [95]. Across Australia, RF achieved the highest accuracy among all models for predicting SOC stocks (R² > 0.8), with forest soils being more predictable than non-forest soils. Spatial maps consistently indicated higher SOC in the southeastern and southwestern regions, although estimates varied across modelling approaches, underscoring the uncertainty inherent in continental-scale SOC prediction [96]. In forest ecosystems, RF demonstrated strong predictive capability for soil organic carbon density (SOCD). Its accuracy was further enhanced through regression kriging, which accounted for the spatial autocorrelation of residuals. The hybrid approach demonstrated that combining environmental covariates with spatial interpolation substantially improves SOCD mapping [91]. Similarly, in two experimental fields in East Hungary, RF provided the most accurate spatial predictions for SOC content, SOC stock, and bulk density (R² ≈ 0.80), with terrain attributes and satellite-derived vegetation indices identified as key drivers of SOC variability [97]. While RF was slightly outperformed by XGBoost following the dual feature selection and hyperparameter optimisation, it remained a reliable method for SOC estimation, particularly when integrating satellite and environmental data to improve spatial mapping across agricultural systems [23]. In other ML frameworks, such as those employing the Ninja Optimisation Algorithm, RF-based SOC predictions benefited from feature selection and tuning, resulting in dramatically reduced prediction errors and demonstrating the potential of carefully optimised ML models for high-precision SOC estimation in sustainable land management [80]. In Tieling County, China, RF achieved the highest accuracy in predicting SOM (R² = 0.77, RMSE = 2.85 g kg⁻¹), with NDVI and elevation identified as key predictors. Spatial mapping showed higher SOM in forested areas and in the eastern and western regions, whereas SOM was lower in cultivated land and central areas [98].

4.2.4. Gradient Boosting Machines

Gradient boosting performs functional gradient descent by sequentially training weak learners, each of which is fit to the negative gradient (i.e., the residuals) of the loss function. Key hyperparameters include the learning rate (shrinkage), number of boosting rounds, and tree depth, which together control the bias-variance tradeoff [99]. Implementations of gradient-boosted regression trees (GBRT), XGBoost [100] (which uses a regularised objective and second-order gradient information), LightGBM (which employs leaf-wise tree growth and histogram-based splitting for faster training on large datasets), and CatBoost (which provides native handling of categorical features and ordered boosting to reduce overfitting) have become popular in applied machine learning. Gradient boosting is generally more sensitive to hyperparameter tuning than random forests. Gradient boosting models have proven highly effective for predicting SOC across diverse landscapes. In the western Urmia Lake region in Iran, GBM explained 43.5% of SOC variation (R²) with an RMSE of 0.23%, with EVI and sand content as the most influential factors and interactions among EVI, wetness, and sand shaping SOC patterns [101]. In the Indian Himalayas, XGB, alongside RF and SVR, enabled high-resolution (30 m) SOC mapping. Using 421 sampled sites and multiple environmental covariates, XGB captured fine-scale spatial variability and supported uncertainty assessment for sustainable soil management [102]. In Peixian County, North China Plain, GBDT explained 68% of the variation in SOC, with precipitation, minimum temperature, distance to settlements, and distance to the lake as key drivers. Non-linear and interaction effects, particularly involving lake proximity, revealed SOC responses that plateau beyond certain thresholds [103].

4.3. Neural Networks and Deep Learning

4.3.1. Artificial Neural Networks

Artificial neural networks (ANNs) are information-processing systems inspired by the biological brain, composed of interconnected units called neurons arranged in hierarchical layers. These networks map complex data relationships by multiplying inputs by learnable weights, adding a bias term, and applying a non-linear activation function (e.g., ReLU, sigmoid) to produce an output at each neuron. Through backpropagation, ANNs iteratively adjust weights to minimise the discrepancy between predicted and observed values, as quantified by a loss function [104].

Since the 1990s, ANNs have been applied to predict soil properties, including SOC, with varying success [105,106]. While ANNs can capture non-linear patterns in SOC-environment relationships, their performance is generally comparable to ensemble tree methods rather than consistently superior. For example, RFs outperformed ANNs in visible–near-infrared spectroscopy studies of soil nitrogen and carbon [107] and in SOC prediction in the Peruvian highlands, where ANNs were optimal only for specific SOC fractions [108]. Similarly, although soil science-informed neural networks offered improved interpretability with comparable accuracy [109], and deep neural networks showed advantages for SOC estimation in northern Iran [110], comprehensive reviews indicate that no single algorithm consistently dominates, with performance strongly dependent on dataset characteristics [111].

The limited superiority of ANNs likely reflects the moderate complexity of SOC–environment relationships, which may not justify the architectural overhead of deep networks. Ensemble methods, by efficiently partitioning feature space, often perform equally well on modestly sized soil datasets. Integrating diverse data sources, such as remote-sensing spectral indices, topographic, and climatic variables, enhances predictive performance across algorithms [112,113]. Thus, optimal SOC prediction typically relies on strategic algorithm integration, expanded datasets, and hybrid frameworks that incorporate biogeochemical understanding rather than sole dependence on ANNs.

Empirical studies illustrate these points. Using a large dataset of 8556 soil samples, multilayer perceptron ANNs successfully predicted SOM from routine chemical attributes, achieving high calibration accuracy (R² = 0.92) but notably lower validation performance (R² = 0.76), suggesting some degree of overfitting. This calibration-validation gap underscores the importance of reporting both metrics and using regularisation or early stopping. Interestingly, increasing network complexity did not yield substantial improvements, as simpler architectures performed comparably [114]. In hilly terrain, ANNs developed to predict SOM from topographic variables, including topographic wetness index, relative position index, and slope length, outperformed multiple linear regression, achieving higher accuracy (R² = 0.87 vs. 0.82) and lower prediction error. Here, the ANN effectively captured non-linear relationships, with topographic wetness index identified as the most influential predictor [115].

Recent work using 50 soil samples demonstrated that ANNs could predict SOM from RGB colour values, achieving R² values up to 0.91 during training, compared with Gaussian Process Regression, SVMs, and ensemble tree models. However, drawing conclusions about generalisation from only 50 samples is inherently limited, as the training R² measures fit rather than generalisation, and results from such small datasets are highly sensitive to the specific train/test split [116]. These findings highlight ANNs’ ability to capture complex non-linear relationships across diverse soil datasets, offering rapid, cost-efficient alternatives to conventional laboratory analyses.

4.3.2. Convolutional Neural Networks

Convolutional neural networks (CNNs) are architectures that apply learnable local filters (convolutions) to extract hierarchical feature representations from structured data. Although originally developed for 2D image data, 1D CNNs are widely used for spectral data in soil science, where convolutions operate along the wavelength dimension to capture local spectral patterns [117]. Image analysis and remote sensing have been revolutionised by CNNs [118]. Deep learning approaches, particularly CNNs, have emerged as powerful tools for predicting SOC content from spectral and Earth observation data. Recent studies demonstrate that CNNs can improve upon traditional ML methods such as local weighted regression and RFs [74,119]. A multi-channel CNN framework utilising complementary information from different spectral pre-treatment techniques achieved superior results, with validation performance of R² = 0.64 and RMSE = 12.03 g kg⁻¹ [119]. Hybrid architectures that integrate multiple deep learning techniques further enhance predictive accuracy. LSTM-CNN models that combine long short-term memory networks with convolutional layers achieved remarkable performance (R² = 0.96, RMSE = 1.66 g kg⁻¹) in predicting SOM from spectral libraries [120]. Similarly, an efficient channel attention-enhanced CNN-LSTM model (CNN-LSTM-ECA) achieved an R² of 0.92 and demonstrated generalisation across diverse datasets [120]. Multi-source data fusion substantially improves SOC prediction by integrating spectral, texture, and colour information. A three-branch CNN model combining spectral bands (selected via correlation-based methods) with soil image texture features achieved R² = 0.87, representing a 23% improvement over single-input models [121]. These advances underscore the potential of CNNs for rapid, non-destructive assessment of soil carbon at field and landscape scales, with immediate applications in precision agriculture and carbon monitoring [121,122].

4.3.3. Recurrent Neural Networks

Recurrent neural networks (RNNs), including long short-term memory (LSTM) and gated recurrent units (GRUs), are designed for sequential data and can, in principle, model the temporal dynamics of soil carbon. Few studies have applied these approaches to soil carbon, possibly because its temporal dynamics are relatively slow (annual to decadal timescales) and long-term datasets are limited. The added complexity of RNNs does not appear justified relative to simpler time-series approaches for most current applications. Recent advances in deep learning have significantly improved the prediction of soil properties, particularly SOM, by leveraging temporal and spatial patterns in environmental data. Wang et al. (2023) [123] developed a fusion attention mechanism-based model combining bidirectional gated recurrent units (BiGRU) enhanced with an attention mechanism to estimate soil nutrient contents, including OM, nitrogen, phosphorus, and potassium. The attention mechanism enabled the model to focus on key features in the input data, while the BiGRU-RNN architecture captured both forward and backward dependencies, allowing effective modelling of long-term interactions in soil properties. The proposed Att-BiGRU-RNN model achieved strong predictive performance for soil OM, with an R² of 0.959, outperforming conventional RNNs and related architectures, highlighting its robustness and accuracy for soil nutrient estimation [123]. Similarly, Zhang et al. (2022) [124] demonstrated that integrating long-term vegetation phenology from MODIS time-series data into a hybrid CNN-LSTM framework improved SOC prediction at the regional scale. Here, convolutional neural networks extracted spatial features from static environmental variables, while LSTM captured temporal dynamics in vegetation phenology over a decade. Incorporating these dynamic phenological variables improved model accuracy relative to traditional approaches, such as RFs, demonstrating the importance of temporal environmental covariates for predicting SOC [124]. These studies indicate that RNN-based architectures, particularly when combined with attention mechanisms or hybrid CNN-LSTM models, are potentially effective for modelling SOM and related nutrients when temporal data are available, though the evidence base remains limited and whether the added architectural complexity is justified over simpler approaches for most soil carbon applications remains an open question. By capturing both temporal dependencies and spatial features, these approaches offer robust, data-driven tools for precision agriculture, soil fertility management, and environmental monitoring.

The machine learning methods reviewed above represent the most widely adopted approaches in soil science literature and form the foundation of contemporary soil carbon prediction models. However, this review does not claim comprehensive coverage of all machine learning algorithms applicable to soil science. Numerous other ML methods exist and are employed in soil science contexts. The selection of methods reviewed in this section reflects their prominence in soil carbon assessment literature and their proven effectiveness for this specific application, rather than an exhaustive catalogue of all ML techniques with potential utility in soil science.

5. Challenges and Limitations of ML Approaches

Machine learning offers promising tools for soil carbon assessment but faces substantial challenges that limit practical application and reliability. These challenges include data scarcity, transferability issues, lack of interpretability, limitations in distinguishing correlation from causation, spatial autocorrelation, and non-stationarity in relationships. Understanding these constraints is critical for realistic deployment.

5.1. Data Distribution Challenges Specific to SOC

Unlike pH or other uniformly distributed soil properties, SOC is highly right-skewed globally, with a preponderance of low values and occasional extremes in peatlands and organic-rich ecosystems. This skewness causes ML models to overpredict low-carbon areas while underpredicting carbon-rich areas.

5.2. Data Scarcity

ML models require large, labelled datasets to perform reliably, yet soil carbon datasets are often limited, particularly in developing regions or remote areas. For instance, a deep learning study on SOC in Tuscany, Italy, relied on only a few hundred samples, necessitating simplified network architectures and regularisation to avoid overfitting, yielding modest prediction accuracy (R² = 0.26) [125,126,127]. Similarly, in Greece, a dataset of just 36 samples required shallow neural networks and intensive cross-validation [112]. Small datasets constrain deep learning performance, creating a cycle in which costly sampling is required to generate sufficient training data, yet resources are often insufficient.

5.3. Interpretability and the Black Box Problem

Complex models such as deep neural networks and ensemble methods offer high accuracy but limited interpretability, thereby hindering scientific understanding and policy applications. Soil respiration studies emphasise that black-box models restrict usability [128]. Explainable artificial intelligence (AI) methods (e.g., SHAP) reveal that different models rely on distinct features, trees favour topography, and neural networks favour soil chemistry [78] but are increasingly mature and widely adopted, though challenges remain in faithfully representing complex model reasoning and in computational cost for some methods applied to large datasets. Approximation errors and non-linearities further complicate full interpretability [78].

5.4. Causality Versus Correlation

Standard supervised ML optimises predictive accuracy by exploiting correlations in training data, without inherently distinguishing causal from spurious associations. However, the growing field of causal machine learning (including causal forests, invariant causal prediction, and directed acyclic graphs for variable selection) offers tools for moving beyond pure correlation. Spectral indices may correlate with SOC in one region for reasons not present elsewhere, leading to spurious associations. Semi-arid soil studies report weak correlations between spectral indices and soil properties, due to non-stationary relationships, limiting the reliability of ML [129]. Hybrid approaches that integrate mechanistic biogeochemical models can impose causal constraints, though they increase computational and expertise demands [125].

5.5. Spatial Autocorrelation and Data Leakage

Soil properties commonly exhibit strong spatial autocorrelation, which violates the independence assumptions underlying many standard machine learning (ML) approaches. This spatial structure can lead to biased model evaluation, particularly when conventional random cross-validation (CV) is applied. Random splitting of data may place spatially proximate observations in both training and test sets, resulting in data leakage and artificially inflated performance metrics.

Empirical evidence highlights the magnitude of this issue. For instance, in Estonia, incorporating spatial covariates into Random Forest (RF) models led to only modest improvements in SOC prediction accuracy (R² increase of +0.02) [130]. However, when spatially explicit cross-validation strategies were used, model performance dropped substantially, yielding more realistic estimates (R² ≈ 0.45 compared to 0.66 obtained with random CV) [131]. Similarly, models reporting high predictive accuracy (e.g., R² = 0.80) under random CV may exhibit significantly lower performance (R² ≈ 0.45) when evaluated using spatially stratified approaches [130].

Despite increasing awareness of this issue, spatial cross-validation remains underutilised in soil ML applications. This raises concerns regarding the true generalisability of many reported high-performance models. To address this limitation, future studies must prioritise validation frameworks that explicitly account for spatial dependence, ensuring that model evaluation more accurately reflects real-world predictive performance [130,131].

5.6. Non-Stationarity

Predictor–SOC relationships vary across space and time. Standard stationary models perform poorly in heterogeneous or mountainous terrain, as observed in SOC mapping studies [132]. Two-point ML (TPML) approaches combine global and local models to capture location-specific heterogeneity [132]. Temporal non-stationarity, such as SOC dynamics across Argentina (1982–2017), further increases prediction uncertainty due to limited and unevenly distributed data [133].

5.7. Reproducibility and Model Transparency

Reproducibility represents a critical challenge inadequately addressed in ML soil science literature. Many published models lack publicly available code, data, or sufficient methodological detail for independent reproduction, omitting hyperparameters, random seeds, preprocessing steps, and validation procedures. Additionally, many studies report training performance rather than true validation performance, and standard protocols for reporting methodology remain underdeveloped. Machine learning models, unless carefully designed, often learn region-specific correlations rather than generalisable relationships, and transfer learning approaches requiring retraining with local data [127] highlight the reproducibility burden. Until standards improve through mandatory code sharing, transparent validation reporting, and rigorous baseline comparisons, claims of ML superiority over classical methods remain partially unsubstantiated. Furthermore, temporal dynamics and practical implementation challenges limit model deployment, with limited success in detecting actual soil carbon change from remote sensing data [133], suggesting fundamental limitations in current ML approaches that go beyond reproducibility issues alone.

6. Comparative Analysis: Classical Versus Machine Learning Approaches

6.1. Prediction Accuracy

Tree-based machine learning methods (RFs, gradient-boosting) frequently outperform classical geostatistical approaches (ordinary kriging) and multiple regression for spatial prediction of soil carbon across heterogeneous environmental contexts [134,135]. Machine learning models capture non-linear relationships between soil carbon and environmental covariates more effectively than parametric methods [136]. However, when additional spatial autocorrelation in prediction residuals is accounted for through hybrid approaches that combine machine learning with residual kriging, prediction accuracy improves only marginally, with the hybrid methods yielding more realistic spatial patterns than either approach alone [95].

Deep and CNN models rarely yield substantial accuracy gains over RFs or gradient boosting for soil carbon prediction and often underperform these simpler approaches when training data are limited [78,134]. This suggests that while soil carbon responds to complex environmental interactions, the problem may not require the modelling capacity of deep architectures, particularly in data-limited contexts.

Laboratory methods (dry combustion, elemental analysis) remain the gold standard for direct carbon quantification at measured points but lack inherent spatial predictive capability when applied independently.

6.2. Cost and Efficiency

Machine learning and classical methods differ fundamentally in their cost structure across scales. Classical field sampling and laboratory analysis require recurrent expenditure per location but minimal setup investment. Machine learning models, by contrast, require substantial initial investment in data assembly, preprocessing, and model development, but prediction at additional gridded locations incurs negligible marginal cost once trained [122].

For site-specific assessments, classical approaches are typically more cost-effective. For regional-to-national scales, machine learning demonstrates clear cost advantages due to its sublinear scaling with spatial extent: expanding predictions to larger areas requires modest increases in computational resources rather than proportional increases in field effort. Hybrid approaches that combine targeted field sampling with machine learning interpolation often optimise the cost–benefit trade-off [136]. Qu et al. (2024) [137] showed that geostatistical models achieve their best performance only under high-density field sampling, which is costly, whereas machine learning approaches become more cost-effective and comparatively advantageous at lower sampling densities, linking model choice closely to budget and sampling intensity.

6.3. Data Requirements and Availability

Classical approaches require accessible soil databases that vary widely in quality, spatial density, and metadata consistency across regions. Machine learning approaches require both measured soil carbon observations (typically 100–500 samples for regional models) and environmental covariate data (satellite imagery, climate data, terrain attributes). Global soil databases, including the Harmonised World Soil Database, USDA NRCS SSURGO, and ISRIC SoilGrids, now provide hundreds to thousands of measurements for model training. Satellite imagery is increasingly available at low or no cost (Landsat, Sentinel-2, MODIS). However, the geographic distribution of available training data is heavily skewed toward developed regions, with significant gaps in parts of Africa, South America, and South Asia, potentially limiting model transferability [134,135].

6.4. Interpretability and Scientific Understanding

Classical methods, particularly multiple regression and geostatistics, are fully transparent. Regression coefficients directly quantify the direction and magnitude of relationships between predictors and soil carbon. Statistical significance testing aligns with conventional scientific inference. Machine learning interpretability varies substantially. Linear models and generalised additive models are fully transparent. Tree-based methods provide feature-importance rankings but limited mechanistic insight into how relationships function. Neural networks typically produce opaque predictions that reflect learned patterns difficult to articulate [78]. Recent advances in explainable artificial intelligence, particularly SHAP (SHapley Additive exPlanations) values, decompose complex model predictions into per-feature contributions, substantially improving interpretability [138,139]. However, post hoc explanations do not constitute genuine causal understanding and may be misleading if the underlying features correlate with confounders [78]. Different model classes rely on distinct environmental features: decision tree-based models emphasise topographic variables, whereas neural networks often depend more heavily on soil chemical properties such as pH [78].

6.5. Uncertainty Quantification

Classical statistical methods naturally yield uncertainty estimates [140]. Kriging provides both point predictions and variance estimates, enabling the construction of confidence intervals [141]. Regression models yield prediction intervals accounting for model error and residual variance. While standard ML implementations often produce only point predictions, many methods have natural extensions for uncertainty estimation. Random forests can leverage variance across individual tree predictions, or quantile regression forests, to provide prediction intervals. Standard neural networks typically require additional techniques such as ensembling, Monte Carlo dropout, or Bayesian formulations for uncertainty estimation. Methods for quantifying uncertainty in machine learning include ensemble approaches and deep ensembles (where variation across models reflects epistemic uncertainty), quantile regression (predicting quantiles rather than means), conformal prediction (which provides distribution-free coverage guarantees without retraining), Bayesian neural networks, and Monte Carlo dropout [142]. However, these remain less standard than classical confidence intervals, and uncertainty quantification is frequently omitted from machine-learning-based soil carbon maps, thereby limiting their utility for policy decisions that require explicit uncertainty characterisation [134].

6.6. Transferability, Domain Shift, and Generalisation

A fundamental challenge in soil science is the limited transferability of models across regions. Models trained in one geographic context often perform poorly when applied elsewhere due to differences in soil types, climate, vegetation, land use, and underlying biological processes. These factors alter the relationships between environmental covariates and soil properties, violating the assumption of universal, stable relationships. Classical approaches typically require regional parameterisation, while machine learning (ML) models often capture region-specific correlations rather than generalisable patterns [143].

Empirical studies highlight the extent of this issue. For example, SOC models transferred between Bavaria and Baden-Württemberg (Germany) exhibited reduced accuracy and systematically overpredicted high SOC values in areas with differing soil conditions [126]. Similarly, mixed-region models improved transferability only for certain covariates, indicating that not all predictors generalise equally well across spatial domains. In practice, most models require retraining or recalibration with local data to achieve acceptable predictive performance in new regions.

Domain adaptation and transfer learning approaches offer promising strategies to mitigate these limitations. For instance, fine-tuning a global model trained on 106,167 samples improved regional prediction accuracy, reducing mean absolute error (MAE) by approximately 11% [127]. However, such approaches depend on the availability of large, high-quality datasets and substantial computational resources. Emerging methods, including multi-task learning and foundation models pretrained on satellite imagery, may further enhance generalisation by learning feature representations that are transferable across regions, although these approaches remain under active development.

Ensemble methods and hybrid geostatistical–ML frameworks may provide partial robustness to domain shift, but their effectiveness is still context-dependent [134]. Importantly, soil properties such as carbon content are strongly influenced by biological factors, including indigenous microbial communities and vegetation dynamics. For example, soils supporting abundant biotrophic organisms such as arbuscular mycorrhizal fungi can exhibit markedly different plant-derived carbon inputs compared to otherwise similar soils where such symbioses are suppressed by management practices (e.g., high phosphorus fertilisation or fungicide application). These biologically mediated differences further complicate model transferability, even between environmentally similar regions with contrasting management histories.

6.7. Temporal Dynamics and Change Detection

Classical point-in-time sampling reveals current conditions but requires expensive repeated campaigns to document temporal change from years to decades. Machine learning applied to time-series satellite imagery can, in principle, characterise temporal trends in vegetation and reflectance properties [144]. However, directly detecting changes in subsurface soil carbon from satellite data remains challenging because multiple surface factors, such as vegetation dynamics, moisture, and roughness, shape reflectance. Attributing spectral change to underlying soil carbon change rather than other surface factors is difficult. Some studies have applied time-series machine learning to vegetation productivity as a proxy for soil carbon trends [65,133], but direct detection of soil carbon change remains limited. The comparative patterns outlined above are exemplified in Table 1, which summarises representative studies from the reviewed literature across different methods, sample sizes, geographical contexts, and performance metrics, providing empirical grounding for the assertions regarding method performance, data requirements, and transferability. A multi-dimensional comparison of classical, ML, and hybrid approaches for soil carbon prediction is shown in Figure 1.

6.8. Hybrid Frameworks: Architecture and Implementation

Three distinct architectures effectively couple classical and machine learning approaches. Sequential residual correction combines random forest predictions with kriging applied to residuals: RF captures non-linear environment–carbon relationships while kriging corrects spatial autocorrelation that empirical models miss, achieving R² = 0.80–0.87 in forest ecosystems [91]. Parallel multi-temporal deep learning uses CNN-LSTM frameworks, in which convolutional networks extract spatial features from environmental covariates and LSTM networks capture temporal dynamics of vegetation phenology, with the constraint that vegetation phenology reflects soil carbon embedded in the model architecture [124]. Multi-source data fusion processes separate convolutional branches for spectral bands, soil image texture, and colour information, merging feature maps into a unified representation that leverages multiple observable soil dimensions, achieving R² = 0.87—a 23% improvement over single-input models [121].

6.8.1. Fusion Architecture Descriptions

Three fusion architectures are prominent in the hybrid ML-classical literature for soil carbon prediction, each with distinct data flow and integration mechanics.

Sequential residual correction (RF-Kriging): In this architecture, an RF model first processes a feature matrix of environmental covariates (remote-sensing indices, topographic variables, climate layers) and generates spatial predictions of SOC. The prediction residuals, the spatially structured errors that the RF cannot explain, are then passed to a kriging model. The kriging layer fits a semivariogram to the residual field and interpolates corrections at unsampled locations. The final output is the sum of the RF spatial prediction layer and the kriged residual layer. This architecture is two-stage and sequential: the first stage is a non-parametric ML layer; the second is a geostatistical correction layer. The critical architectural feature is that the RF handles global, non-linear feature–SOC relationships, while kriging handles local, spatially autocorrelated residual structure that ML cannot capture.

Parallel multi-branch CNN (multi-source data fusion): This architecture processes multiple data modalities in parallel through separate convolutional branches before merging. Branch 1 receives soil spectral data (vis-NIR bands) and applies 1D convolutions along the wavelength dimension to extract spectral features. Branch 2 receives soil image texture features extracted via Grey Level Co-occurrence Matrix (GLCM) analysis. Branch 3 processes soil colour information in RGB or Lab colour space. Each branch outputs a feature vector; these are concatenated into a unified representation and passed through fully connected layers to produce SOC predictions. Parallel architecture is essential: it allows each data modality to be processed by convolutions tuned to its specific structure before feature-level fusion.

CNN-LSTM spatio-temporal hybrid: This architecture couples spatial and temporal processing in sequence. A CNN module (typically 2D or 1D convolutional layers) extracts spatial feature representations from static environmental covariates at each time step. These spatial feature vectors are passed sequentially to an LSTM module that models temporal dependencies across multiple years of vegetation phenology data (e.g., 10-year MODIS time series). The LSTM output is passed through fully connected layers to SOC predictions. The critical design principle is the separation of spatial feature extraction (CNN) from temporal dependency modelling (LSTM), allowing the architecture to exploit both the spatial covariate structure and the year-to-year dynamics of vegetation-soil coupling. The fusion architectures are illustrated in Figure 2.

6.8.2. Structured Implementation Workflow

Implementing a hybrid ML-classical framework for soil carbon prediction follows a structured workflow that integrates domain knowledge at each stage:

Step 1: Feature selection. Candidate covariates are assembled from multiple sources: satellite-derived spectral indices (NDVI, EVI, NDSI), terrain attributes (slope, aspect, topographic wetness index, curvature), climate variables (mean annual temperature, precipitation), and soil survey ancillary data. Feature importance algorithms (RF variable importance, SHAP values) are applied alongside expert pedological knowledge to identify covariates that most strongly drive SOC variation in the target region. Covariates with high collinearity are removed using variance inflation factor screening. The output is a curated feature matrix for model training.

Step 2: Training data design. Sampling locations are designed to cover the covariate space identified in Step 1. Stratified random sampling ensures representation across major environmental gradients (e.g., elevation bands, land-use classes, soil types). ML-identified areas of high covariate gradient or prediction uncertainty may receive targeted additional sampling to maximise information gain. Collected samples undergo laboratory analysis (dry combustion or Walkley-Black) to generate ground-truth SOC values.

Step 3: Model training. The curated feature matrix and ground-truth SOC labels are used to train the primary ML model (typically RF or gradient boosting). Hyperparameter optimisation is performed via k-fold cross-validation, tuning parameters such as the number of trees, maximum tree depth, minimum leaf size, and (for gradient boosting) the learning rate. Training is repeated across multiple random seeds to assess model stability.

Step 4: Residual analysis. Prediction residuals (observed minus predicted) are extracted from the training predictions. The spatial distribution of residuals is examined using semivariogram analysis to determine whether residuals exhibit significant spatial autocorrelation. If the semivariogram shows a clear spatial structure (range and sill above the nugget), kriging of residuals is warranted.

Step 5: Residual kriging. Ordinary or universal kriging is fitted to the spatial residual field using the semivariogram parameters estimated in Step 4. Kriged residual surfaces are generated across the spatial domain and added to the ML spatial prediction layer to produce the final corrected SOC map.

Step 6: Uncertainty quantification. Prediction uncertainty is estimated by combining RF prediction variance (quantile regression forests or variance across individual trees) with kriging variance from the residual layer. Uncertainty maps are generated alongside the SOC prediction maps. These bounds are critical for policy applications, where the credibility of carbon accounting depends on explicit uncertainty characterisation.

Step 7: Spatial validation. Model performance is evaluated using spatial cross-validation (blocked or buffered k-fold CV) rather than standard random splits, to prevent data leakage from spatial autocorrelation. Performance metrics (R², RMSE, MAE) from spatial CV provide realistic estimates of predictive accuracy in unsampled areas.

Step 8: Map production and interpretation. Final SOC maps are produced with associated uncertainty layers. Pedological interpretation of spatial patterns is performed by overlaying predictions with soil type maps, land-use data, and terrain layers to verify that predicted distributions align with known soil-forming processes. Anomalies are flagged for targeted field verification. The eight-step structured implementation workflow for hybrid ML-classical soil carbon prediction is depicted in Figure 3.

Implementation requires integrating domain knowledge with data-driven approaches at each stage. Feature selection combines expert identification of candidate covariates with feature-importance algorithms that identify predictors that most strongly control soil carbon [97]. Training data design couples ML guidance (identifying covariate gradients and areas of prediction uncertainty) with classical field sampling to maximise information gain while maintaining rigour [91,97]. Model coupling can be sequential (ML first, then classical correction of residuals) or parallel (embedding pedological constraints in neural network loss functions or architecture). Validation must employ spatial cross-validation to prevent data leakage [130], while classical statistical methods quantify uncertainty. Transfer learning approaches fine-tune global models on local data, reducing error by ~11% through domain adaptation [127].

Case analysis demonstrates coupling mechanics. Regression kriging in forests follows a workflow: assemble environmental covariates, train an RF to identify feature importance and generate SOCD predictions, calculate residuals, apply kriging to the residuals, and add the kriged residuals to the RF predictions [91]. Feature-guided sampling in East Hungary shows how ML optimises classical fieldwork: RF identifies terrain attributes and satellite indices as key predictors; these guide the design of sampling campaigns targeting high-gradient areas; retraining with expanded data, plus residual kriging, yields R² = 0.80 with improved realism while reducing sampling effort [97]. CNN-LSTM in regional SOC prediction separates spatial and temporal processes: CNNs extract environmental features, while LSTMs capture 10-year MODIS phenology trends, subject to the constraint that vegetation dynamics reflect soil productivity [124]. Multi-source fusion recognises soil properties manifest across spectral, textural, and chromatic dimensions, using three CNN branches to extract modality-specific features before merging [121].

6.8.3. Synthesised Case Studies

Analysis across the hybrid studies reviewed reveals three cross-cutting lessons about when and why hybrid architectures succeed.

Lesson 1: Spatial autocorrelation correction delivers consistent gains when ML is used alone. Across three geographically distinct studies—forest soils in central Vietnam [91], agricultural lowlands in Lombardy, Italy [95], and cropland fields in East Hungary [97]—the addition of residual kriging to RF predictions improved R² by 0.05–0.10 relative to RF alone and produced spatially smoother, pedologically more realistic maps. The consistent improvement across very different soil and climate contexts suggests that spatial autocorrelation correction via kriging is a robust architectural feature worth including in any hybrid deployment. The mechanism is consistent: RF captures the non-linear feature–SOC relationship globally but leaves spatially structured residuals that kriging efficiently corrects. Notably, the gain from adding kriging was more pronounced in landscapes with moderate spatial autocorrelation in residuals (range 1–10 km); in highly heterogeneous landscapes, RF alone performed comparably [132].

Lesson 2: Multi-source data fusion outperforms single-input models, but only when modalities are complementary. The three-branch CNN architecture fusing spectral, texture, and colour information achieved a 23% improvement over single-input models [121], and the coupled VNIR-MIR approach reduced prediction error by approximately 40% compared to VNIR alone [79]. However, fusion of redundant data streams (e.g., two spectral indices measuring similar phenomena) does not improve accuracy and may introduce noise. The practical implication is that fusion architectures should be designed based on the complementarity of information sources: spectral data captures chemical composition, texture captures physical structure, and colour captures surface organic matter expression. Fusion of genuinely complementary streams provides non-redundant information and justifies the architectural complexity.

Lesson 3: Transfer learning mitigates domain shift but requires sufficient regional anchor data. Studies examining model transferability consistently report degraded performance when ML models trained in one region are applied to another without adaptation [126,127]. Fine-tuning a globally pre-trained model on 106,167 samples reduced mean absolute error by approximately 11% in target regions [127], but the benefit was contingent on having at least approximately 100 quality-controlled regional samples for fine-tuning. In data-poor contexts (fewer than 50 samples), fine-tuned global models offered minimal advantage over locally trained shallow models. This finding argues for a tiered hybrid strategy: use global pre-trained models as starting points, but invest in targeted regional field sampling (Step 2 of the workflow above) to enable effective domain adaptation. The biological drivers of regional SOC variation—particularly management-driven differences in microbial communities—are among the hardest to capture through covariate-only transfer, reinforcing the need for regional ground-truth data that reflects local management history.

Critical to success is embedding pedological knowledge as explicit constraints. Spatial autocorrelation is corrected via kriging on residuals and spatial cross-validation. Temporal coherence is achieved using an LSTM architecture and smoothness penalties. Vegetation-soil coupling constrains predictions to align with observed phenology. Management legacies (fertilisation, tillage, chemical effects on microbial communities) are encoded as model features. Soil microbial processes are represented by loss-function constraints or biology-relevant spectral features. These mechanistic embeddings prevent spurious correlations and improve transferability across regions with different environmental conditions or management histories.

7. Challenges and Opportunities

Assessing SOM and SOC faces several interconnected challenges, beginning with the uneven global distribution of data [145]. Extensive soil carbon measurements exist in North America, Europe, and parts of Asia, supporting detailed ML models, whereas sub-Saharan Africa, Central South America, and some Southeast Asian regions remain largely unmeasured [146,147]. This spatial bias limits model performance in underrepresented areas and discourages investment in new data collection. Initiatives such as GlobalSoilMap [148] and the Africa Soil Information Service (AfSIS) [149] are addressing these gaps, but funding and local capacity remain insufficient. Conceptual and methodological issues further complicate assessments. Definitions of soil carbon vary between total organic carbon, specific fractions, or SOM, making cross-study comparisons difficult. Soil samples reflect current conditions integrated over an indeterminate temporal scale, potentially creating mismatches between predictions of labile and stable carbon. Most studies focus on surface soils (0–30 cm), while deeper layers contain significant carbon, requiring either destructive sampling or extrapolation [150]. Additionally, many ML models prioritise predictive accuracy over mechanistic understanding, thereby limiting their usefulness in elucidating how factors influence soil carbon. Computational and infrastructure constraints also pose challenges, as ML development can require GPUs and cloud computing for deep learning approaches, though the tree-based methods (RFs, gradient boosting) identified in this review as most effective for soil carbon prediction have modest computational requirements and can be trained on standard hardware. The more pressing barriers in developing regions may be access to curated training data, reliable satellite imagery, and local ML expertise. Validation and standardisation remain limited; many regional soil carbon maps lack independent datasets, and standard protocols for reporting uncertainty and methodology are underdeveloped [151]. Operational deployment of research models is hindered by the need for accessible software, integration with existing systems, practitioner training, legal frameworks, and long-term maintenance, with few models progressing to policy or management applications [152]. Despite these challenges, soil carbon assessment has practical applications. Agricultural carbon accounting relies on baseline field sampling combined with ML predictions to estimate carbon sequestration under various management practices, as demonstrated in Argentina, Australia, and the United States [153]. National mapping initiatives in countries like Australia and China integrate classical soil surveys with ML or geostatistical methods to extrapolate carbon estimates across unmapped regions [154,155]. Forest carbon stocks combine above-ground biomass predicted from satellite data with soil carbon, which requires ground-based sampling and ML interpolation [156], while wetland and peatland systems store large carbon stocks relative to their area [157]; ML can detect wet areas, but accurate carbon estimation still depends on field sampling and proxy variables. Emerging technologies offer potential to improve soil carbon assessment. Hyperspectral, radar, and LiDAR remote sensing, combined with drone and proximal-sensor data, enable high-resolution, multi-scale measurements [158,159]. Explainable AI techniques can enhance the interpretability of complex ML models, while federated learning allows global model training without centralising sensitive data. Integration of ML with process-based carbon models combines data-driven accuracy with mechanistic understanding, and continuous monitoring systems with networked soil sensors could transform the temporal resolution of carbon knowledge. Together, these advances promise to address current data gaps, improve prediction accuracy, and support operational and policy applications of soil carbon science.

8. Conclusions

Our review shows that classical and ML-based methods for analysing SOM and SOC are complementary rather than competing, each thriving in different operational contexts and scales. Classical laboratory techniques provide the ground truth for carbon quantification by direct measurement and underpin all regional and global modelling efforts. These approaches provide mechanistic insights, clear methodologies, explicit uncertainty measures and rigorous validation datasets required for the development of soil carbon science. Nevertheless, their high per-sample costs, labour-intensive requirements, and limited distribution coverage inevitably hinder their scalability at regional and global scales. In contrast, ML models utilise a wide range of environmental and remote-sensing covariates to generate large-scale global predictions at minimal marginal cost. These approaches successfully capture the complexity of the non-linear relationships between soils and the environment, can leverage freely available environmental covariates to extend the spatial reach of limited field observations, and are scalable for carbon monitoring and policy.

However, the reliance of ML models on extensive training datasets, susceptibility to poor generalisation across regions, lack of interpretability in more complex models, and propensity to model spurious correlations diminish the utility of these methods under data-sparse conditions and erode confidence in mechanistic understanding. Soil carbon stocks and dynamics are tightly regulated by soil microbial communities, which can be fundamentally altered by anthropogenic and management activities (e.g., the application of synthetic fertilisers, herbicides, or fungicides in farming systems). Consequently, even under otherwise similar soil and climatic conditions, biologically driven variation in soil carbon can be substantial yet difficult to detect by modelling approaches. This represents a critical bottleneck for ML-based soil carbon estimation models, which typically lack explicit representation of belowground biotic processes and may therefore struggle to generalise or transfer reliably across regions with contrasting management histories and microbial legacies.

The ideal way forward integrates classical pedological methods with machine learning through hybrid approaches that deploy each method where it excels. Concrete examples illustrate this effectiveness. In forest ecosystems, random forests enhanced through regression kriging improved SOC density mapping by accounting for spatial autocorrelation that ML alone cannot capture [91]. In East Hungary, the same hybrid approach achieved R² ≈ 0.80 with improved spatial realism [97]. More sophisticated hybrids, such as CNN-LSTM frameworks that combine spatial feature extraction with temporal dynamics from satellite vegetation data, improved regional SOC prediction [124], while multi-source data fusion combining spectral, texture, and colour information achieved an R² of 0.87—a 23% improvement over single inputs [121].

These examples reveal key principles: deploy classical methods where spatial structure and mechanistic understanding matter, while using ML for pattern recognition and scalability. Target field sampling at ML-identified important locations to ground predictions in pedological reality. Apply kriging to ML residuals to correct spatial autocorrelation. Use classical statistical methods to quantify prediction uncertainty. Embed pedological knowledge as constraints rather than purely data-driven approaches. Recent advances in explainable AI, transfer learning, and process-model integration further expand hybrid possibilities.

Neither classical nor ML techniques alone suffice for accurate soil carbon assessment across diverse scales. The strategic integration pathway illustrated in Figure 4 demonstrates how field sampling, feature selection, ML training, residual kriging, and uncertainty quantification combine to achieve superior outcomes. Their strategic integration—combining mechanistic grounding with ML’s scalability—represents the most realistic path forward, enabling quantification, monitoring, and management of soil carbon as a critical component of global climate change mitigation and land-use policy.

To address the urgent twin challenges of climate change mitigation and agricultural sustainability, there is a need for accurate yet low-cost soil carbon assessments across diverse environmental contexts and scales. Neither traditional nor ML techniques by themselves are the answers. Their integration, deploying each method where it excels, offers the most realistic and promising path forward, enabling unprecedented capacity to quantify, monitor, and manage soil carbon as a critical component of global climate and land-use policy.

To advance soil carbon assessment and better integrate classical and machine learning approaches, five key research priorities emerge: improving and standardising spatial cross-validation protocols to ensure robust and comparable SOC model evaluation across studies; evaluating the integration of soil microbiome-derived variables (e.g., sequencing-based metrics and functional indices) as covariates to enhance model generalisability across differing management and environmental contexts; developing hybrid process-based and machine learning frameworks that embed mechanistic soil carbon constraints to improve interpretability and reduce spurious correlations; advancing temporal transfer learning approaches using multi-year remote sensing data to enable reliable detection of soil carbon dynamics over time; and establishing decision-theoretic frameworks that optimise the trade-off between sampling intensity and model complexity to minimise cost while maximising predictive accuracy across spatial scales.

Author Contributions

Conceptualisation, A.D.; writing—original draft preparation, A.D. and K.K.; writing—review and editing, A.D. and K.K.; visualisation, A.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No datasets were created for this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SOM	Soil Organic Matter
ML	Machine Learning
RF	Random Forest
OM	Organic Matter
SOC	Soil Organic Carbon
LOI	Loss On Ignition
VNIR	Visible and Near-Infrared
FTIR	Fourier Transform Infrared
MIR	Mid-Infrared
NDVI	Normalised Difference Vegetation Index
vis-NIR	Visible-Near-Infrared
PLSR	Partial Least Squares Regression
LS-SVM	Least Squares Support Vector Machines
ELM	Extreme Learning Machines
EPO	External Parameter Orthogonalisation
VNIR-MIR	Vis-NIR and Mid-Infrared
NiOA	Ninja Optimisation Algorithm
SVR	Support Vector Regression
SOCD	Soil Organic Carbon Density
GBRT	Gradient-Boosted Regression Trees
ANNs	Artificial Neural Networks
CNNs	Convolutional Neural Networks
RNNs	Recurrent Neural Networks
LSTM	Long Short-Term Memory
GRUs	Gated Recurrent Units
BiGRU	Bidirectional Gated Recurrent Units
TPML	Two-Point ML
AfSIS	Africa Soil Information Service

References

Georgiou, K.; Jackson, R.B.; Vindušková, O.; Abramoff, R.Z.; Ahlström, A.; Feng, W.; Harden, J.W.; Pellegrini, A.F.A.; Polley, H.W.; Soong, J.L.; et al. Global stocks and capacity of mineral-associated soil organic carbon. Nat. Commun. 2022, 13, 3797. [Google Scholar] [CrossRef] [PubMed]
Lal, R. Soil Carbon Sequestration Impacts on Global Climate Change and Food Security. Science 2004, 304, 1623–1627. [Google Scholar] [CrossRef]
Smith, P.; House, J.I.; Bustamante, M.; Sobocká, J.; Harper, R.; Pan, G.; West, P.C.; Clark, J.M.; Adhya, T.; Rumpel, C.; et al. Global change pressures on soils from land use and management. Glob. Change Biol. 2015, 22, 1008–1028. [Google Scholar] [CrossRef]
Batjes, N.H. Total carbon and nitrogen in the soils of the world. Eur. J. Soil Sci. 1996, 47, 151–163. [Google Scholar] [CrossRef]
Post, W.M.; Kwon, K.C. Soil carbon sequestration and land-use change: Processes and potential. Glob. Change Biol. 2000, 6, 317–327. [Google Scholar] [CrossRef]
Beillouin, D.; Corbeels, M.; Demenois, J.; Berre, D.; Boyer, A.; Fallot, A.; Feder, F.; Cardinael, R. A global meta-analysis of soil organic carbon in the Anthropocene. Nat. Commun. 2023, 14, 3700. [Google Scholar] [CrossRef]
Bokati, L.; Somenahally, A.; Kumar, S.; Robatjazi, J.; Talchabhadel, R.; Sarkar, R.; Perepi, R. Temporal adjustment approach for high-resolution continental scale modeling of soil organic carbon. Sci. Rep. 2025, 15, 6483. [Google Scholar] [CrossRef]
Celestina, C.; Hunt, J.R.; Sale, P.W.G.; Franks, A.E. Attribution of crop yield responses to application of organic amendments: A critical review. Soil Tillage Res. 2019, 186, 135–145. [Google Scholar] [CrossRef]
Wilpiszeski, R.L.; Aufrecht, J.A.; Retterer, S.T.; Sullivan, M.B.; Graham, D.E.; Pierce, E.M.; Zablocki, O.D.; Palumbo, A.V.; Elias, D.A. Soil Aggregate Microbial Communities: Towards Understanding Microbiome Interactions at Biologically Relevant Scales. Appl. Environ. Microbiol. 2019, 85, e00324-19. [Google Scholar] [CrossRef]
Țopa, D.C.; Căpșună, S.; Calistru, A.E.; Ailincăi, C. Sustainable Practices for Enhancing Soil Health and Crop Quality in Modern Agriculture: A Review. Agriculture 2025, 15, 998. [Google Scholar] [CrossRef]
Apesteguia, M.; Plante, A.F.; Virto, I. Methods assessment for organic and inorganic carbon quantification in calcareous soils of the Mediterranean region. Geoderma Reg. 2018, 12, 39–48. [Google Scholar] [CrossRef]
Gomez, C.; Chevallier, T.; Moulin, P.; Bouferra, I.; Hmaidi, K.; Arrouays, D.; Jolivet, C.; Barthès, B.G. Prediction of soil organic and inorganic carbon concentrations in Tunisian samples by mid-infrared reflectance spectroscopy using a French national library. Geoderma 2020, 375, 114469. [Google Scholar] [CrossRef]
Huang, B.; Yang, G.; Lei, J.; Wang, X. A partitioned conditioned Latin hypercube sampling method considering spatial heterogeneity in digital soil mapping. Sci. Rep. 2025, 15, 12851. [Google Scholar] [CrossRef]
Paul, S.S.; Coops, N.C.; Johnson, M.S.; Krzic, M.; Smukler, S.M. Evaluating sampling efforts of standard laboratory analysis and mid-infrared spectroscopy for cost effective digital soil mapping at field scale. Geoderma 2019, 356, 113925. [Google Scholar] [CrossRef]
Wadoux, A.M.J.C. Artificial intelligence in soil science. Eur. J. Soil Sci. 2025, 76, e70080. [Google Scholar] [CrossRef]
Bouslihim, Y.; Rochdi, A.; Aboutayeb, R.; El Amrani-Paaza, N.; Miftah, A.; Hssaini, L. Soil Aggregate Stability Mapping Using Remote Sensing and GIS-Based Machine Learning Technique. Front. Earth Sci. 2021, 9, 748859. [Google Scholar] [CrossRef]
Wu, B.; Zhang, M.; Zeng, H.; Tian, F.; Potgieter, A.B.; Qin, X.; Yan, N.; Chang, S.; Zhao, Z.; Dong, Q. Challenges and opportunities in remote sensing-based crop monitoring: A review. Natl. Sci. Rev. 2022, 10, nwac290. [Google Scholar] [CrossRef]
Xiao, X.; He, Q.; Ma, S.; Liu, J.; Sun, W.; Lin, Y.; Yi, R. Environmental variables improve the accuracy of remote sensing estimation of soil organic carbon content. Sci. Rep. 2024, 14, 18964. [Google Scholar] [CrossRef]
Chen, Q.; Wang, Y.; Zhu, X. Soil organic carbon estimation using remote sensing data-driven machine learning. PeerJ 2024, 12, e17836. [Google Scholar] [CrossRef]
John, K.; Abraham Isong, I.; Michael Kebonye, N.; Okon Ayito, E.; Chapman Agyeman, P.; Marcus Afu, S. Using Machine Learning Algorithms to Estimate Soil Organic Carbon Variability with Environmental Variables and Soil Nutrient Indicators in an Alluvial Soil. Land 2020, 9, 487. [Google Scholar] [CrossRef]
Miao, T.; Ji, W.; Li, B.; Zhu, X.; Yin, J.; Yang, J.; Huang, Y.; Cao, Y.; Yao, D.; Kong, X. Advanced Soil Organic Matter Prediction with a Regional Soil NIR Spectral Library Using Long ShortTerm Memory–Convolutional Neural Networks: A Case Study. Remote Sens. 2024, 16, 1256. [Google Scholar] [CrossRef]
Liu, L.; Zhou, W.; Guan, K.; Peng, B.; Xu, S.; Tang, J.; Zhu, Q.; Till, J.; Jia, X.; Jiang, C.; et al. Knowledge-guided machine learning can improve carbon cycle quantification in agroecosystems. Nat. Commun. 2024, 15, 357. [Google Scholar] [CrossRef]
Sunantha, O.; Shao, Z.; Pattama, P.; Potchara, A.; Huang, X.; Zeeshan, A. Machine learning-based estimation of soil organic carbon in Thailand’s cash crops using multispectral and SAR data fusion combined with environmental variables. Geo-Spat. Inf. Sci. 2025, 28, 2721–2743. [Google Scholar] [CrossRef]
Abrar, M.M.; Waqas, M.A.; Mehmood, K.; Fan, R.; Memon, M.S.; Khan, M.A.; Siddique, N.; Xu, M.; Du, J. Organic carbon sequestration in global croplands: Evidenced through a bibliometric approach. Front. Environ. Sci. 2025, 13, 1495991. [Google Scholar] [CrossRef]
Fortuna, A.M.; Starks, P.J.; Moriasi, D.N. Estimation of soil organic carbon as a function of soil pretreatment and spectral features of radiometers within the visible and near-infrared spectra. J. Soil Water Conserv. 2025, 80, 476–490. [Google Scholar] [CrossRef]
Conant, R.T.; Ryan, M.G.; Ågren, G.I.; Birge, H.E.; Davidson, E.A.; Eliasson, P.E.; Evans, S.E.; Frey, S.D.; Giardina, C.P.; Hopkins, F.M.; et al. Temperature and soil organic matter decomposition rates—Synthesis of current knowledge and a way forward. Glob. Change Biol. 2011, 17, 3392–3404. [Google Scholar] [CrossRef]
König, A.; Wiesenbauer, J.; Gorka, S.; Marchand, L.; Kitzler, B.; Inselsbacher, E.; Kaiser, C. Reverse microdialysis: A window into root exudation hotspots. Soil Biol. Biochem. 2022, 174, 108829. [Google Scholar] [CrossRef]
Zhou, Z.; Ren, C.; Wang, C.; Delgado-Baquerizo, M.; Luo, Y.; Luo, Z.; Du, Z.; Zhu, B.; Yang, Y.; Jiao, S.; et al. Global turnover of soil mineral-associated and particulate organic carbon. Nat. Commun. 2024, 15, 5329. [Google Scholar] [CrossRef]
Wagai, R.; Mayer, L.M.; Kitayama, K.; Knicker, H. Climate and parent material controls on organic matter storage in surface soils: A three-pool, density-separation approach. Geoderma 2008, 147, 23–33. [Google Scholar] [CrossRef]
Patton, N.R.; Lohse, K.A.; Seyfried, M.S.; Godsey, S.; Parsons, S.B. Topographic controls of soil organic carbon on soil-mantled landscapes. Sci. Rep. 2019, 9, 6390. [Google Scholar] [CrossRef]
Wang, X.; Chi, Y.; Song, S. Important soil microbiota’s effects on plants and soils: A comprehensive 30-year systematic literature review. Front. Microbiol. 2024, 15, 1347745. [Google Scholar] [CrossRef]
Engell, I.; Gerigk, J.; Linsler, D.; Joergensen, R.G.; Potthoff, M. Tillage and land use management effects on soil organic matter and soil microbial biomass in a field network of practical farms in France, Romania, and Sweden. Appl. Soil Ecol. 2024, 202, 105584. [Google Scholar] [CrossRef]
Wan, Q.; Zhu, G.; Guo, H.; Zhang, Y.; Pan, H.; Yong, L.; Ma, H. Influence of Vegetation Coverage and Climate Environment on Soil Organic Carbon in the Qilian Mountains. Sci. Rep. 2019, 9, 17623. [Google Scholar] [CrossRef]
Hartley, I.P.; Hill, T.C.; Chadburn, S.E.; Hugelius, G. Temperature effects on carbon storage are controlled by soil stabilisation capacities. Nat. Commun. 2021, 12, 6713. [Google Scholar] [CrossRef]
Huang, Y.; Wei, F. Climate controls the global distribution of soil organic and inorganic carbon. Ecol. Indic. 2025, 175, 113514. [Google Scholar] [CrossRef]
Schapel, A.; Marschner, P.; Churchman, J. Clay amount and distribution influence organic carbon content in sand with subsoil clay addition. Soil Tillage Res. 2018, 184, 253–260. [Google Scholar] [CrossRef]
Dalzell, B.J.; Fissore, C.; Nater, E.A. Topography and land use impact erosion and soil organic carbon burial over decadal timescales. CATENA 2022, 218, 106578. [Google Scholar] [CrossRef]
Zhang, P.; Shao, M. Spatial Variability and Stocks of Soil Organic Carbon in the Gobi Desert of Northwestern China. PLoS ONE 2014, 9, e93584. [Google Scholar] [CrossRef] [PubMed]
Piñeiro-Juncal, N.; Mateo, M.Á.; Leiva-Dueñas, C.; Serrano, E.; Inostroza, K.; Soler, M.; Apostolaki, E.T.; Lavery, P.; Duarte, C.M.; Lafratta, A.; et al. Soil organic carbon depth profiles and centennial and millennial decay rates in tidal marsh, mangrove and seagrass blue carbon ecosystems. Commun. Earth Environ. 2025, 6, 504. [Google Scholar] [CrossRef]
Davis, M.; Alves, B.; Karlen, D.; Kline, K.; Galdos, M.; Abulebdeh, D. Review of Soil Organic Carbon Measurement Protocols: A US and Brazil Comparison and Recommendation. Sustainability 2018, 10, 53. [Google Scholar] [CrossRef]
Bravo-García, J.; Camarillo-Naranjo, J.M.; Blanco-Velázquez, F.J.; Anaya-Romero, M. Soil Organic Carbon Mapping Through Remote Sensing and In Situ Data with Random Forest by Using Google Earth Engine: A Case Study in Southern Africa. Land 2025, 14, 1436. [Google Scholar] [CrossRef]
Lv, J.; Huang, Z.; Luo, L.; Zhang, S.; Wang, Y. Advances in Molecular and Microscale Characterization of Soil Organic Matter: Current Limitations and Future Prospects. Environ. Sci. Technol. 2022, 56, 12793–12810. [Google Scholar] [CrossRef]
Walkley, A.; Black, I.A. An Examination of The Degtjareff Method for Determining Soil Organic Matter, and A Proposed Modification of The Chromic Acid Titration Method. Soil Sci. 1934, 37, 29–38. [Google Scholar] [CrossRef]
Matus, F.J.; Escudey, M.; Förster, J.E.; Gutiérrez, M.; Chang, A.C. Is the Walkley–Black Method Suitable for Organic Carbon Determination in Chilean Volcanic Soils? Commun. Soil Sci. Plant Anal. 2009, 40, 1862–1872. [Google Scholar] [CrossRef]
Burgos Hernández, T.D.; Slater, B.K.; Shaffer, J.M.; Basta, N. Comparison of methods for determining organic carbon content of urban soils in Central Ohio. Geoderma Reg. 2023, 34, e00680. [Google Scholar] [CrossRef]
Pallasser, R.; Minasny, B.; McBratney, A.B. Soil carbon determination by thermogravimetrics. PeerJ 2013, 1, e6. [Google Scholar] [CrossRef] [PubMed]
Hoogsteen, M.J.J.; Lantinga, E.A.; Bakker, E.J.; Groot, J.C.J.; Tittonell, P.A. Estimating soil organic carbon through loss on ignition: Effects of ignition conditions and structural water loss. Eur. J. Soil Sci. 2015, 66, 320–328. [Google Scholar] [CrossRef]
Salehi, M.H.; Beni, O.H.; Harchegani, H.B.; Borujeni, I.E.; Motaghian, H.R. Refining Soil Organic Matter Determination by Loss-on-Ignition. Pedosphere 2011, 21, 473–482. [Google Scholar] [CrossRef]
Schulte, E.E.; Hopkins, B.G. Estimation of Soil Organic Matter by Weight Loss-On-Ignition. In Soil Organic Matter: Analysis and Interpretation; Soil Science Society of America, Inc.: Madison, WI, USA, 2015; pp. 21–31. [Google Scholar]
Viscarra Rossel, R.A.; Behrens, T.; Ben-Dor, E.; Brown, D.J.; Demattê, J.A.M.; Shepherd, K.D.; Shi, Z.; Stenberg, B.; Stevens, A.; Adamchuk, V.; et al. A global spectral library to characterize the world’s soil. Earth-Sci. Rev. 2016, 155, 198–230. [Google Scholar] [CrossRef]
Ng, W.; Minasny, B.; Jeon, S.H.; McBratney, A. Mid-infrared spectroscopy for accurate measurement of an extensive set of soil properties for assessing soil functions. Soil Secur. 2022, 6, 100043. [Google Scholar] [CrossRef]
Margenot, A.J.; Parikh, S.J.; Calderón, F.J. Fourier-transform infrared spectroscopy for soil organic matter analysis. Soil Sci. Soc. Am. J. 2023, 87, 1503–1528. [Google Scholar] [CrossRef]
Walden, L.; Sepanta, F.; Viscarra Rossel, R.A. FT-MIR Spectroscopic Analysis of the Organic Carbon Fractions in Australian Mineral Soils. Eur. J. Soil Sci. 2025, 76, e70084. [Google Scholar] [CrossRef]
Shepherd, K.D.; Walsh, M.G. Development of Reflectance Spectral Libraries for Characterization of Soil Properties. Soil Sci. Soc. Am. J. 2002, 66, 988–998. [Google Scholar] [CrossRef]
Carter, T.L.; Schaecher, C.; Monteith, S.; Ferguson, R. Using combustion analysis to simultaneously measure soil organic and inorganic carbon. Geoderma 2024, 451, 117066. [Google Scholar] [CrossRef]
Virk, S.; Tucker, M.; Harris, G.; Smith, A.; Levi, M.; Lessl, J. Efficacy and Economics of Different Soil Sampling Grid Sizes for Site-Specific Nutrient Management in Southeastern USA. Agronomy 2025, 15, 903. [Google Scholar] [CrossRef]
Marcaida, M.; Workman, K.; Czymmek, K.J.; Ketterings, Q.M. Grid-based soil sampling for Northeast Region phosphorus index assessment. Soil Sci. Soc. Am. J. 2025, 89, e70156. [Google Scholar] [CrossRef]
Brus, D.J.; Kempen, B.; Heuvelink, G.B.M. Sampling for validation of digital soil maps. Eur. J. Soil Sci. 2011, 62, 394–407. [Google Scholar] [CrossRef]
Adamchuk, V.I.; Viscarra Rossel, R.A.; Marx, D.B.; Samal, A.K. Using targeted sampling to process multivariate soil sensing data. Geoderma 2011, 163, 63–73. [Google Scholar] [CrossRef]
Long, J.; Liu, Y.; Xing, S.; Qiu, L.; Huang, Q.; Zhou, B.; Shen, J.; Zhang, L. Effects of sampling density on interpolation accuracy for farmland soil organic matter concentration in a large region of complex topography. Ecol. Indic. 2018, 93, 562–571. [Google Scholar] [CrossRef]
Radočaj, D.; Jug, I.; Vukadinović, V.; Jurišić, M.; Gašparović, M. The Effect of Soil Sampling Density and Spatial Autocorrelation on Interpolation Accuracy of Chemical Soil Properties in Arable Cropland. Agronomy 2021, 11, 2430. [Google Scholar] [CrossRef]
Brenning, A. Spatial prediction models for landslide hazards: Review, comparison and evaluation. Nat. Hazards Earth Syst. Sci. 2005, 5, 853–862. [Google Scholar] [CrossRef]
Goovaerts, P. Geostatistics in soil science: State-of-the-art and perspectives. Geoderma 1999, 89, 1–45. [Google Scholar] [CrossRef]
Fongaro, C.T.; Demattê, J.A.M.; Rizzo, R.; Lucas Safanelli, J.; Mendes, W.D.S.; Dotto, A.C.; Vicente, L.E.; Franceschini, M.H.D.; Ustin, S.L. Improvement of Clay and Sand Quantification Based on a Novel Approach with a Focus on Multispectral Satellite Images. Remote Sens. 2018, 10, 1555. [Google Scholar] [CrossRef]
Yan, K.; Wang, D.; Feng, Y.; Hou, S.; Zhang, Y.; Yang, H. Digital mapping of soil organic carbon in a plain area based on time-series features. Ecol. Indic. 2025, 171, 113215. [Google Scholar] [CrossRef]
Dhawale, N.M.; Adamchuk, V.I.; Prasher, S.O.; Viscarra Rossel, R.A. Evaluating the Precision and Accuracy of Proximal Soil vis–NIR Sensors for Estimating Soil Organic Matter and Texture. Soil Syst. 2021, 5, 48. [Google Scholar] [CrossRef]
Whalen, E.D.; Grandy, A.S.; Geyer, K.M.; Morrison, E.W.; Frey, S.D. Microbial trait multifunctionality drives soil organic matter formation potential. Nat. Commun. 2024, 15, 10209. [Google Scholar] [CrossRef] [PubMed]
Wold, S.; Sjöström, M.; Eriksson, L. PLS-regression: A basic tool of chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109–130. [Google Scholar] [CrossRef]
Sothe, C.; Gonsamo, A.; Arabian, J.; Snider, J. Large scale mapping of soil organic carbon concentration with 3D machine learning and satellite observations. Geoderma 2022, 405, 115402. [Google Scholar] [CrossRef]
van Wesemael, B.; Chabrillat, S.; Dias, A.S.; Berger, M.; Szantoi, Z. Remote Sensing for Soil Organic Carbon Mapping and Monitoring. Remote Sens. 2023, 15, 3464. [Google Scholar] [CrossRef]
Stockmann, U.; Adams, M.A.; Crawford, J.W.; Field, D.J.; Henakaarchchi, N.; Jenkins, M.; Minasny, B.; McBratney, A.B.; Courcelles, V.R.; Singh, K.; et al. The knowns, known unknowns and unknowns of sequestration of soil organic carbon. Agric. Ecosyst. Environ. 2013, 164, 80–99. [Google Scholar] [CrossRef]
Roper, W.R.; Robarge, W.P.; Osmond, D.L.; Heitman, J.L. Comparing Four Methods of Measuring Soil Organic Matter in North Carolina Soils. Soil Sci. Soc. Am. J. 2019, 83, 466. [Google Scholar] [CrossRef]
Burgess, T.M.; Webster, R. Optimal interpolation and isarithmic mapping of soil properties: I The semi-variogram and punctual kriging. Eur. J. Soil Sci. 2019, 70, 11–19. [Google Scholar] [CrossRef]
Wang, W.; Li, Q. Smart farming revolution: Leveraging machine learning for sustainable agriculture. J. Clean. Prod. 2025, 527, 146434. [Google Scholar] [CrossRef]
Xing, Y.; Xie, Y.; Wang, X. Enhancing soil health through balanced fertilization: A pathway to sustainable agriculture and food security. Front. Microbiol. 2025, 16, 1536524. [Google Scholar]
Yang, M.; Xu, D.; Chen, S.; Li, H.; Shi, Z. Evaluation of Machine Learning Approaches to Predict Soil Organic Matter and pH Using vis-NIR Spectra. Sensors 2019, 19, 263. [Google Scholar] [CrossRef]
Mundada, S.; Jain, P. Predicting soil organic carbon with ensemble learning techniques by using satellite images for precision farming. Sci. Rep. 2025, 15, 28760. [Google Scholar] [CrossRef]
Kakhani, N.; Taghizadeh-Mehrjardi, R.; Omarzadeh, D.; Ryo, M.; Heiden, U.; Scholten, T. Towards Explainable AI: Interpreting Soil Organic Carbon Prediction Models Using a Learning-Based Explanation Method. Eur. J. Soil Sci. 2025, 76, e70071. [Google Scholar] [CrossRef]
Hutengs, C.; Eisenhauer, N.; Schädler, M.; Cesarz, S.; Lochner, A.; Seidel, M.; Vohland, M. Enhanced VNIR and MIR proximal sensing of soil organic matter and PLFA-derived soil microbial properties through machine learning ensembles and external parameter orthogonalization. Geoderma 2024, 450, 117037. [Google Scholar] [CrossRef]
Ben Ghorbal, A.; Grine, A.; Eid, M.M.; El-kenawy, E.S.M. Sustainable soil organic carbon prediction using machine learning and the ninja optimization algorithm. Front. Environ. Sci. 2025, 13, 1630762. [Google Scholar] [CrossRef]
Ngu, N.H.; Trung, N.H.; Shinjo, H.; Chotpantarat, S.; Thanh, N.N. Improving spatial prediction of soil organic matter in central Vietnam using Bayesian-enhanced machine learning and environmental covariates. Arch. Agron. Soil Sci. 2025, 71, 1–17. [Google Scholar] [CrossRef]
Narváez-Ortiz, W.A.; Reyes-Valdés, M.H.; Cabrera-De la Fuente, M.; Benavides-Mendoza, A. Multiple Linear and Polynomial Models for Studying the Dynamics of the Soil Solution. Soil Syst. 2022, 6, 42. [Google Scholar] [CrossRef]
Rukhovich, D.; Koroleva, P.; Rukhovich, A.; Komissarov, M. A detailed mapping of soil organic matter content in arable land based on the multitemporal soil line coefficients and neural network filtering of big remote sensing data. Geoderma 2024, 447, 116941. [Google Scholar] [CrossRef]
Law, T.; Shawe-Taylor, J. Practical Bayesian support vector regression for financial time series prediction and market condition change detection. Quant. Financ. 2017, 17, 1403–1416. [Google Scholar] [CrossRef]
Elhallaoui Oueldkaddour, F.Z.; Wariaghli, F.; Brirhet, H.; Yahyaoui, A.; Jaziri, H. Comparison of Machine Learning Models for Real-Time Flow Forecasting in the Semi-Arid Bouregreg Basin. Limnol. Rev. 2025, 25, 6. [Google Scholar] [CrossRef]
Were, K.; Bui, D.T.; Dick, Ø.B.; Singh, B.R. A comparative assessment of support vector regression, artificial neural networks, and random forests for predicting and mapping soil organic carbon stocks across an Afromontane landscape. Ecol. Indic. 2015, 52, 394–403. [Google Scholar] [CrossRef]
Laref, R.; Losson, E.; Sava, A.; Siadat, M. On the optimization of the support vector machine regression hyperparameters setting for gas sensors array applications. Chemom. Intell. Lab. Syst. 2019, 184, 22–27. [Google Scholar] [CrossRef]
Greve, M.H.; Kheir, R.B.; Greve, M.B.; Bøcher, P.K. Quantifying the ability of environmental parameters to predict soil texture fractions using regression-tree model with GIS and LIDAR data: The case study of Denmark. Ecol. Indic. 2012, 18, 1–10. [Google Scholar] [CrossRef]
Siqueira, R.G.; Moquedace, C.M.; Fernandes-Filho, E.I.; Schaefer, C.E.G.R.; Francelino, M.R.; Sacramento, I.F.; Michel, R.F.M. Modelling and prediction of major soil chemical properties with Random Forest: Machine learning as tool to understand soil-environment relationships in Antarctica. CATENA 2024, 235, 107677. [Google Scholar] [CrossRef]
Valavi, R.; Elith, J.; Lahoz-Monfort, J.J.; Guillera-Arroita, G. Modelling species presence-only data with random forests. Ecography 2021, 44, 1731–1742. [Google Scholar] [CrossRef]
Ho, V.H.; Morita, H.; Bachofer, F.; Ho, T.H. Random forest regression kriging modeling for soil organic carbon density estimation using multi-source environmental data in central Vietnamese forests. Model. Earth Syst. Environ. 2024, 10, 7137–7158. [Google Scholar] [CrossRef]
Gao, D.; Zhang, Y.X.; Zhao, Y.H. Random forest algorithm for classification of multiwavelength data. Res. Astron. Astrophys. 2009, 9, 220–226. [Google Scholar] [CrossRef]
Tang, F.; Ishwaran, H. Random forest missing data algorithms. Stat. Anal. Data Min. ASA Data Sci. J. 2017, 10, 363–377. [Google Scholar] [CrossRef] [PubMed]
Probst, P.; Wright, M.N.; Boulesteix, A. Hyperparameters and tuning strategies for random forest. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1301. [Google Scholar] [CrossRef]
Adeniyi, O.D.; Brenning, A.; Maerker, M. Spatial prediction of soil organic carbon: Combining machine learning with residual kriging in an agricultural lowland area (Lombardy region, Italy). Geoderma 2024, 448, 116953. [Google Scholar] [CrossRef]
Wang, L.; Abramowitz, G.; Wang, Y.P.; Pitman, A.; Viscarra Rossel, R.A. An ensemble estimate of Australian soil organic carbon using machine learning and process-based modelling. SOIL 2024, 10, 619–636. [Google Scholar] [CrossRef]
Hateffard, F.; Szatmári, G.; Novák, T.J. Applicability of machine learning models for predicting soil organic carbon content and bulk density under different soil conditions. Soil Sci. Annu. 2023, 74, 165879. [Google Scholar] [CrossRef]
Li, Y.; Yao, G.; Li, S.; Dong, X. Predicting and Mapping of Soil Organic Matter with Machine Learning in the Black Soil Region of the Southern Northeast Plain of China. Agronomy 2025, 15, 533. [Google Scholar] [CrossRef]
Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 2013, 7, 21. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’16; Association for Computing Machinery: New York, NY, USA, 2016; Volume 1, pp. 785–794. [Google Scholar]
Hamzehpour, N.; Shafizadeh-Moghadam, H.; Valavi, R. Exploring the driving forces and digital mapping of soil organic carbon using remote sensing and soil texture. CATENA 2019, 182, 104141. [Google Scholar] [CrossRef]
Kalambukattu, J.G.; Kumar, S.; Das, B.; Roy, T. Digital mapping of soil organic carbon in the hilly and mountainous landscape of Indian Himalayan region employing machine-learning techniques. Discov. Soil 2025, 2, 35. [Google Scholar] [CrossRef]
Zou, X.; Wu, Z.; Fan, D.; Wu, Z.; Zhu, Y.; Ou, J. Exploring the Main Control Factors of Soil Organic Carbon in Riparian Farmland by Using Gradient Boosting Decision Tree. Eurasian Soil Sci. 2025, 58, 93. [Google Scholar]
López, O.A.M.; López, A.M.; Crossa, J. Fundamentals of Artificial Neural Networks and Deep Learning. In Multivariate Statistical Machine Learning Methods for Genomic Prediction; Springer: Cham, Switzerland, 2022; pp. 379–425. [Google Scholar]
Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 1991, 4, 251–257. [Google Scholar] [CrossRef]
Pacci, S.; Dengiz, O.; Alaboz, P.; Saygın, F. Artificial neural networks in soil quality prediction: Significance for sustainable tea cultivation. Sci. Total Environ. 2024, 947, 174447. [Google Scholar] [CrossRef] [PubMed]
Nawar, S.; Mouazen, A. Comparison between Random Forests, Artificial Neural Networks and Gradient Boosted Machines Methods of On-Line Vis-NIR Spectroscopy Measurements of Soil Total Nitrogen and Total Carbon. Sensors 2017, 17, 2428. [Google Scholar] [CrossRef] [PubMed]
Carbajal, M.; Ramírez, D.A.; Turin, C.; Schaeffer, S.M.; Konkel, J.; Ninanya, J.; Rinza, J.; Mendiburu, F.D.; Zorogastua, P.; Villaorduña, L.; et al. From Rangelands to Cropland, Land-Use Change and Its Impact on Soil Organic Carbon Variables in a Peruvian Andean Highlands: A Machine Learning Modeling Approach. Ecosystems 2024, 27, 899–917. [Google Scholar] [CrossRef]
Tian, X.; Ahrens, B.; Rossdeutscher, L.; Alonso, L.; Parente, L. Soil science-informed neural networks for soil organic carbon density modelling under scarce bulk density data. EGUsphere 2026. [Google Scholar] [CrossRef]
Emadi, M.; Taghizadeh-Mehrjardi, R.; Cherati, A.; Danesh, M.; Mosavi, A.; Scholten, T. Predicting and Mapping of Soil Organic Carbon Using Machine Learning Algorithms in Northern Iran. Remote Sens. 2020, 12, 2234. [Google Scholar] [CrossRef]
Ding, Z.; Liu, K.; Grunwald, S.; Smith, P.; Ciais, P.; Wang, B.; Wadoux, A.M.J.C.; Ferreira, C.; Karunaratne, S.; Shurpali, N.; et al. Advancing Soil Organic Carbon Prediction: A Comprehensive Review of Technologies, AI, Process-Based and Hybrid Modelling Approaches. Adv. Sci. 2025, 12, e04152. [Google Scholar] [CrossRef]
Triantakonstantis, D.; Karakostas, A. Soil Organic Carbon Monitoring and Modelling via Machine Learning Methods Using Soil and Remote Sensing Data. Agriculture 2025, 15, 910. [Google Scholar] [CrossRef]
Sarkar, R.; Ray, R.L. Application of artificial neural network algorithms to estimate spatial soil organic carbon stock in Prairie lands from remote sensing and soil data pool. Geocarto Int. 2025, 40, 2597423. [Google Scholar] [CrossRef]
Honorato, M.; Coelho, A.P.; Fernandes, C.; Matheus, S.; Claudia, M. Estimation of soil organic matter content by modeling with artificial neural networks. Geoderma 2019, 350, 46–51. [Google Scholar] [CrossRef]
Guo, P.T.; Wu, W.; Sheng, Q.; Li, M.F.; Liu, H.; Wang, Z.Y. Prediction of soil organic matter using artificial neural network and topographic indicators in hilly areas. Nutr. Cycl. Agroecosystems 2013, 95, 333–344. [Google Scholar] [CrossRef]
Mansur, N.; Abbod, M. Machine learning-based estimation of soil organic matter using RGB values. DYSONA-Appl. Sci. 2026, 7, 73–81. [Google Scholar] [CrossRef]
Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef]
Alhatami, E.; Huang, M.; Bhatti, U.A.; Bhatti, U.A. Chapter 14—Remote sensing image fusion based on deep learning and convolutional neural network technique. In Deep Learning for Earth Observation and Climate Monitoring; Elsevier: Amsterdam, The Netherlands, 2025; pp. 265–277. Available online: https://www.sciencedirect.com/science/article/pii/B9780443247125000166 (accessed on 21 May 2026).
Tziolas, N.; Tsakiridis, N.; Heiden, U.; van Wesemael, B. Soil organic carbon mapping utilizing convolutional neural networks and Earth observation data, a case study in Bavaria state Germany. Geoderma 2024, 444, 116867. [Google Scholar] [CrossRef]
Wang, H.; Sun, Q.; Niu, X.; Liu, K.; Zhang, J.; Hao, Z.; Xu, D. Soil Organic Carbon Prediction Using an Efficient Channel Attention-Enhanced CNN-LSTM Model with LUCAS Spectral Library. Eur. J. Soil Sci. 2025, 76, e70202. [Google Scholar] [CrossRef]
Guo, L.; Gao, Q.; Zhang, M.; Cheng, P.; He, P.; Li, L.; Ding, D.; Liu, C.; Muga, F.C.; Kamal, M.; et al. Soil Organic Matter Content Prediction Using Multi-Input Convolutional Neural Network Based on Multi-Source Information Fusion. Agriculture 2025, 15, 1313. [Google Scholar] [CrossRef]
Li, T.; Xia, A.; McLaren, T.I.; Pandey, R.; Xu, Z.; Liu, H.; Manning, S.; Madgett, O.; Duncan, S.; Rasmussen, P.; et al. Preliminary Results in Innovative Solutions for Soil Carbon Estimation: Integrating Remote Sensing, Machine Learning, and Proximal Sensing Spectroscopy. Remote Sens. 2023, 15, 5571. [Google Scholar]
Wang, H.; Zhang, L.; Zhao, J. Application of a Fusion Attention Mechanism-Based Model Combining Bidirectional Gated Recurrent Units and Recurrent Neural Networks in Soil Nutrient Content Estimation. Agronomy 2023, 13, 2724. [Google Scholar] [CrossRef]
Zhang, L.; Cai, Y.; Huang, H.; Li, A.; Yang, L.; Zhou, C. A CNN-LSTM Model for Soil Organic Carbon Content Prediction with Long Time Series of MODIS-Based Phenological Variables. Remote Sens. 2022, 14, 4441. [Google Scholar]
Pavlovic, M.; Ilic, S.; Ralevic, N.; Antonic, N.; Raffa, D.W.; Bandecchi, M.; Culibrk, D. A Deep Learning Approach to Estimate Soil Organic Carbon from Remote Sensing. Remote Sens. 2024, 16, 655. [Google Scholar] [CrossRef]
Broeg, T.; Blaschek, M.; Seitz, S.; Taghizadeh-Mehrjardi, R.; Zepp, S.; Scholten, T. Transferability of Covariates to Predict Soil Organic Carbon in Cropland Soils. Remote Sens. 2023, 15, 876. [Google Scholar] [CrossRef]
Zhang, L.; Yang, L.; Ma, Y.; Zhu, A.-X.; Wei, R.; Liu, J.; Greve, M.H.; Zhou, C. Regional-scale soil carbon predictions can be enhanced by transferring global-scale soil–environment relationships. Geoderma 2025, 461, 117466. [Google Scholar] [CrossRef]
Novielli, P.; Magarelli, M.; Romano, D.; Di Bitonto, P.; Stellacci, A.M.; Monaco, A.; Amoroso, N.; Bellotti, R.; Tangaro, S. Leveraging explainable AI to predict soil respiration sensitivity and its drivers for climate change mitigation. Sci. Rep. 2025, 15, 12527. [Google Scholar] [CrossRef] [PubMed]
Suleymanov, A.; Komissarov, M.; Suleymanov, R.; Gabbasova, I. The Basic Soil Structure Parameters and Their Spatial Prediction Using Machine Learning and Remote Sensing Data in Semi-Arid Trans-Ural Steppe Zone, Russia. Soil Syst. 2026, 10, 11. [Google Scholar] [CrossRef]
Kmoch, A.; Harrison, C.T.; Choi, J.; Uuemaa, E. Spatial autocorrelation in machine learning for modelling soil organic carbon. Ecol. Inform. 2025, 86, 103057. [Google Scholar] [CrossRef]
Chinilin, A.; Savin, I.Y. Combining machine learning and environmental covariates for mapping of organic carbon in soils of Russia. Egypt. J. Remote Sens. Space Sci. 2023, 26, 666–675. [Google Scholar] [CrossRef]
Yin, Y.; Gao, B.; Xu, H.; Wang, Y.; Xie, D.; Liu, Y.; Wang, C. Soil organic matter mapping in complex terrains considering spatial heterogeneity. Environ. Model. Softw. 2025, 192, 106569. [Google Scholar] [CrossRef]
Heuvelink, G.B.M.; Angelini, M.E.; Poggio, L.; Bai, Z.; Batjes, N.H.; van den Bosch, R.; Bossio, D.; Estella, S.; Lehmann, J.; Olmedo, G.F.; et al. Machine learning in space and time for modelling soil organic carbon change. Eur. J. Soil Sci. 2020, 72, 1607–1623. [Google Scholar] [CrossRef]
Hengl, T.; Mendes de Jesus, J.; Heuvelink, G.B.M.; Ruiperez Gonzalez, M.; Kilibarda, M.; Blagotić, A.; Shangguan, W.; Wright, M.N.; Geng, X.; Bauer-Marschallinger, B.; et al. SoilGrids250m: Global gridded soil information based on machine learning. PLoS ONE 2017, 12, e0169748. [Google Scholar] [CrossRef]
Poggio, L.; de Sousa, L.M.; Batjes, N.H.; Heuvelink, G.B.M.; Kempen, B.; Ribeiro, E.; Rossiter, D. SoilGrids 2.0: Producing soil information for the globe with quantified spatial uncertainty. SOIL 2021, 7, 217–240. [Google Scholar] [CrossRef]
McBratney, A.B.; Mendonça Santos, M.L.; Minasny, B. On digital soil mapping. Geoderma 2003, 117, 3–52. [Google Scholar] [CrossRef]
Qu, L.; Lu, H.; Tian, Z.; Schoorl, J.M.; Huang, B.; Liang, Y.; Qiu, D.; Liang, Y. Spatial prediction of soil sand content at various sampling density based on geostatistical and machine learning algorithms in plain areas. CATENA 2024, 234, 107572. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. arXiv 2017. [Google Scholar] [CrossRef]
Padarian, J.; Minasny, B.; McBratney, A.B. Machine learning and soil sciences: A review aided by machine learning tools. SOIL 2020, 6, 35–52. [Google Scholar] [CrossRef]
Paciorek, C.J.; Stone, D.A.; Wehner, M.F. Quantifying statistical uncertainty in the attribution of human influence on severe weather. Weather Clim. Extrem. 2018, 20, 69–80. [Google Scholar] [CrossRef]
Lark, R.M.; Lapworth, D.J. Quality measures for soil surveys by lognormal kriging. Geoderma 2012, 173–174, 231–240. [Google Scholar] [CrossRef]
Shi, Y.; Wei, P.; Feng, K.; Feng, D.C.; Beer, M. A survey on machine learning approaches for uncertainty quantification of engineering systems. Mach. Learn. Comput. Sci. Eng. 2025, 1, 11. [Google Scholar] [CrossRef]
Luo, L.; Chen, B.; Zeng, S.; Li, Y.; Chen, X.; Zhang, J.; Guo, X.; Li, S.; Ruan, L.; Zhu, S.; et al. Machine learning integrates region-specific microbial signatures to distinguish geographically adjacent populations within a province. Front. Microbiol. 2025, 16, 1586195. [Google Scholar] [CrossRef] [PubMed]
Amoli, M.G.; Hasanlou, M.; Samadzadegan, F.; Taghizadeh-Mehrjardi, R.; Dadrass Javan, F. Estimating soil organic carbon using time series Band 11 (SWIR) of multispectral Sentinel-2 satellite images and machine learning algorithms. Remote Sens. Appl. Soc. Environ. 2025, 40, 101736. [Google Scholar] [CrossRef]
Luo, D.; Xie, Y.; Tang, J.; Xu, J.; Zhang, M.; Cheng, H.; Luo, H.; Ouyang, W. Improving the prediction accuracy of soil organic matter: Addressing the challenge of soil moisture variability. Ecol. Indic. 2025, 179, 114249. [Google Scholar] [CrossRef]
Scharlemann, J.P.; Tanner, E.V.; Hiederer, R.; Kapos, V. Global soil carbon: Understanding and managing the largest terrestrial carbon pool. Carbon Manag. 2014, 5, 81–91. [Google Scholar] [CrossRef]
Lin, Z.; Dai, Y.; Mishra, U.; Wang, G.; Shangguan, W.; Zhang, W.; Qin, Z. Global and regional soil organic carbon estimates: Magnitude and uncertainties. Pedosphere 2023, 34, 685–698. [Google Scholar] [CrossRef]
Chen, S.; Arrouays, D.; Leatitia Mulder, V.; Poggio, L.; Minasny, B.; Roudier, P.; Libohova, Z.; Lagacherie, P.; Shi, Z.; Hannam, J.; et al. Digital mapping of GlobalSoilMap soil properties at a broad scale: A review. Geoderma 2022, 409, 115567. [Google Scholar] [CrossRef]
Kebonye, N.M.; John, K.; Delgado-Baquerizo, M.; Zhou, Y.; Agyeman, P.C.; Seletlo, Z.; Heung, B.; Scholten, T. Major overlap in plant and soil organic carbon hotspots across Africa. Sci. Total Environ. 2024, 951, 175476. [Google Scholar] [CrossRef]
Garrett, L.G.; Byers, A.K.; Chen, C.; Lan, Z.; Bahadori, M.; Wakelin, S.A. The hidden depths of forest soil organic carbon chemistry in a pumice soil. Geoderma Reg. 2024, 36, e00760. [Google Scholar] [CrossRef]
Feeney, C.; Cosby, B.J.; Robinson, D.A.; Thomas, A.; Emmett, B.; Henrys, P. Multiple soil map comparison highlights challenges for predicting topsoil organic carbon concentration at national scale. Sci. Rep. 2022, 12, 1379. [Google Scholar] [CrossRef]
Nair, M.; Svedberg, P.; Larsson, I.; Nygren, J.M. A comprehensive overview of barriers and strategies for AI implementation in healthcare: Mixed-method design. PLoS ONE 2024, 19, e0305949. [Google Scholar] [CrossRef]
Habib, S.; Tahir, F.; Hussain, F.; Macauley, N.; Al-Ghamdi, S.G. Current and emerging technologies for carbon accounting in urban landscapes: Advantages and limitations. Ecol. Indic. 2023, 154, 110603. [Google Scholar] [CrossRef]
Bui, E.N.; Searle, R.D.; Wilson, P.R.; Philip, S.R.; Thomas, M.; Brough, D.; Harms, B.; Hill, J.V.; Holmes, K.; Henry, J.; et al. Soil surveyor knowledge in digital soil mapping and assessment in Australia. Geoderma Reg. 2020, 22, e00299. [Google Scholar] [CrossRef]
Sun, Z.; Liu, F.; Wu, H.; Zhang, G.L. Developing a national black soil map of China through machine learning classification. CATENA 2024, 240, 107993. [Google Scholar] [CrossRef]
Fu, H.; Zhao, H.; Liu, G.; Zhang, Y.; Huangfu, X.; Jiang, J. Forest aboveground carbon storage estimation and uncertainty analysis by coupled multi-source remote sensing data in Liaoning Province. Ecol. Indic. 2025, 176, 113729. [Google Scholar] [CrossRef]
Goyette, J.O.; Loiselle, A.; Mendes, P.; Cimon-Morin, J.; Pellerin, S.; Poulin, M.; Dupras, J. Above and belowground carbon stocks among organic soil wetland types, accounting for peat bathymetry. Sci. Total Environ. 2024, 946, 174177. [Google Scholar] [CrossRef]
Jiang, R.; Sui, Y.; Zhang, X.; Lin, N.; Zheng, X.; Li, B.; Zhang, L.; Li, X.; Yu, H. Estimation of soil organic carbon by combining hyperspectral and radar remote sensing to reduce coupling effects of soil surface moisture and roughness. Geoderma 2024, 444, 116874. [Google Scholar] [CrossRef]
Chen, Z.; Chen, Y.; Shi, T.; Chen, X.; Pan, X.; Lei, J.; Wu, T.; Li, Y.; Liu, Q.; Liu, X.; et al. Estimation of soil organic carbon in tropical rainforest regions by combining UAV hyperspectral and LiDAR data. CATENA 2025, 258, 109195. [Google Scholar] [CrossRef]

Figure 1. Multi-dimensional comparison of classical, ML, and hybrid approaches for soil carbon prediction. Scores reflect representative performance across reviewed studies; see Table 1 for detailed study-level metrics.

Figure 2. Data-flow architectures for three hybrid ML-classical fusion frameworks. (A) Sequential RF-kriging: environmental covariates feed an RF prediction layer; residuals are spatially corrected by kriging. (B) Parallel multi-branch CNN: three independent convolutional branches process spectral, texture, and colour inputs before feature-level concatenation. (C) CNN-LSTM spatio-temporal hybrid: a CNN encodes static spatial features per time step, which are passed to an LSTM capturing 10-year vegetation phenology dynamics.

Figure 3. Eight-step structured implementation workflow for hybrid ML-classical soil carbon prediction. Steps 2, 4, 5, 6, and 8 rely on classical pedological methods; Steps 1, 3, and 7 are ML-driven. Classical and ML responsibilities are shown in the responsibility bands below.

Figure 4. Hybrid integration framework for SOC assessment: comparative analysis of classical, machine learning, and integrated approaches with an eight-step implementation workflow. Classical methods (green) provide ground-truth measurements through laboratory analysis and geostatistical kriging but face limitations in coverage and cost. Machine learning approaches (purple) enable scalable predictions across large areas but suffer from poor transferability and spatial autocorrelation bias. Hybrid approaches (red) address the limitations of both by sequentially coupling random forest predictions with kriging on residuals (R² = 0.80–0.87), integrating a CNN-LSTM for temporal dynamics, or fusing multiple data sources through parallel CNN architectures (R² = 0.87, a +23% improvement). The eight-step implementation workflow integrates field sampling, feature selection, ML training, residual analysis, spatial kriging, uncertainty quantification, map production, and validation. Key outcomes demonstrate improved accuracy, balanced cost-efficiency, optimised spatial coverage, and rigorous uncertainty bounds. Detailed sub-panels illustrate field sampling design, soil processing, ML model training, residual kriging mechanics, and final uncertainty-quantified map production suitable for climate and land-use policy.

Table 1. Representative studies comparing classical, machine learning, and hybrid approaches for soil organic carbon and soil organic matter prediction across diverse geographic regions, methods, and validation protocols.

Geographic Focus	Soil Type/Region	Method(s) Tested	Key Covariates	Outputs/Target	Performance (R²/RMSE/MAE)	Reference
Paddy soils, China	Agricultural paddy	ELM, PLSR, LS-SVM, Cubist	Vis-NIR spectra, feature selection	SOM prediction	R² = 0.81, RMSE = 5.17 g kg⁻¹ (ELM)	[76]
Germany	Croplands (LUCAS dataset)	RF, Neural Networks, Linear regression	Soil properties, environmental data	SOC	Different algorithms prioritised different features (trees: topography; NN: pH)	[78]
Germany	Diverse soils	Stacking ensemble vs. gradient boosting	Satellite imagery, soil chemical properties	SOC	Training R² = 0.95 (GBM); Test R² higher for stacking	[77]
Field conditions, Europe	Various	VNIR-MIR ML (Cubist, SVMs) vs. PLSR	Vis-NIR and mid-IR spectra; EPO calibration transfer	SOC	R² improved 40% with the EPO algorithm for field-collected spectra	[79]
Global optimization	Various	SVR with Ninja Optimisation Algorithm	Spectral bands, soil properties	SOC prediction	RMSE reduced 99.98% vs. untuned SVR	[80]
Tropical forests, Australia	Forest soils, diverse	Bayesian-enhanced RF	Rainfall, distance to coast/water, altitude	SOM spatial mapping	R² > 0.74 for key environmental covariates	[81]
Agricultural lowland, Italy (Lombardy)	Agricultural soils	RF vs. kriging vs. regression kriging	Soil, climatic, topographic, and RS data	SOC stocks	R² = 0.6 (SVR best); RMSE = 14.9 Mg C ha⁻¹	[86]
Across Australia	Continental scale	RF, multiple algorithms	Environmental & remote sensing variables	SOC stocks	R² > 0.8 (RF); southeastern/southwestern regions highest	[96]
Central Vietnamese forests	Mixed forest ecosystems	RF + regression kriging	Topographic wetness index, relative position, slope	SOCD	High accuracy with spatial autocorrelation accounted for	[91]
East Hungary (two fields)	Agricultural soils	RF, XGBoost, other ML	Terrain attributes, satellite vegetation indices	SOC content/stock	R² ≈ 0.80 (RF); R² slightly higher for XGBoost + optimization	[97]
Tieling County, China	Agricultural/forest mixed	RF with spatial mapping	NDVI, elevation, land use	SOM	R² = 0.77, RMSE = 2.85 g kg⁻¹	[98]
Urmia Lake region, Iran	Semi-arid mixed	Gradient boosting machine	EVI, sand content, wetness indices	SOC	R² = 0.435, RMSE = 0.23%	[101]
Indian Himalayas	Mountainous terrain	XGBoost, RF, SVR	Climatic, topographic, soil, and satellite covariates	SOC at 30m resolution	R² ~0.60+ (context-dependent)	[102]
Peixian County, North China Plain	Riparian agricultural	Gradient boosting decision trees	Precipitation, temperature, distance to settlement/lake	SOC	R² = 0.68	[103]
Hilly terrain	Diverse upland soils	ANN vs. multiple linear regression	Topographic wetness, relative position, slope length	SOM	ANN R² = 0.87 vs. MLR R² = 0.82	[115]
Diverse soils (large dataset)	Various mineral soils	Multilayer perceptron ANN	Routine chemical soil attributes	SOM	Calibration R² = 0.92; Validation R² = 0.76	[114]
Bavarian soils, Germany	Diverse cropland	CNN framework	Spectral pre-treatment variants	SOC	R² = 0.64, RMSE = 12.03 g kg⁻¹	[119]
Soil spectral libraries	Laboratory & field	LSTM-CNN hybrid	NIR spectral data	SOM	R² = 0.96, RMSE = 1.66 g kg⁻¹	[120]
Soil spectral libraries (global)	Multi-source fusion	CNN-LSTM-ECA (channel attention)	Spectra, texture, colour information	SOC	R² = 0.92	[120]
Soil images	Diverse soils	Three-branch CNN (spectra + texture + colour)	Spectral bands, image texture, colour features	SOM	R² = 0.87 (23% improvement over single-input)	[121]
Multiple regions	Various	Att-BiGRU-RNN (attention mechanism)	Vegetation phenology, environmental data	Soil nutrients (OM, N, P, K)	R² = 0.959 (OM)	[123]
MODIS time-series	Regional scale	CNN-LSTM (vegetation phenology)	10-year MODIS phenology, environmental variables	SOC regional mapping	R² improved vs. traditional RF	[124]
Tuscany, Italy	Mixed soils	Deep neural networks	Spectral and environmental data	SOC	R² = 0.26	[125]
Greece	Diverse	Shallow neural networks	Environmental covariates	SOC	Modest; intensive CV required	[112]
Transfer learning (Bavaria ↔ Baden-Württemberg, Germany)	Cropland soils	RF transfer models	Environmental covariates	SOC	Reduced accuracy in transferred model; overprediction at high values	[126]
Global scale transfer	Multiple regions	Domain adaptation pre-training + fine-tuning	Diverse environmental data	SOC regional	MAE improved ~11% in target region	[127]
Estonia	Diverse soils	RF with spatial covariates	Environmental variables + spatial covariates	SOC	R² improved +0.02 with spatial variables; Spatial CV R² ~0.45 vs. random CV R² ~0.66	[130]
Complex terrain, Argentina	Mountainous	Two-point ML (global + local models)	Environmental covariates	SOC	Performance varies with local heterogeneity	[132]
Argentina (1982–2017)	Agricultural	Temporal ML (time-series)	NDVI, climate data, temporal records	SOC change detection	Uncertainty high due to uneven temporal distribution	[133]
Spectral data analysis	Paddy soils China (extended)	PLSR, SVM ensemble, Cubist optimization	Visible-near-infrared spectroscopy	SOM from spectra	Ensemble methods outperformed single algorithms	[76]
Spectral indices integration	Various European regions	Multiple regression, PLS regression	Satellite NDVI, NDSI indices + field data	SOM/SOC	R² of 0.3–0.6 range typically	[65,66,67,68]
Classical kriging comparison	Multiple regions	Ordinary kriging, co-kriging	Spatial semivariograms, secondary variables	SOC mapping	Baseline for ML comparison; provides uncertainty estimates	[62,63]
Wet chemistry method (Walkley-Black)	Diverse global soils	Laboratory oxidation method	Chemical soil oxidation	SOC quantification	Recovers 70–80% of total organic carbon; site-specific correction needed	[43,44]
Dry combustion (Elemental analysis)	Research-grade soils	Combustion analysis	Complete oxidation at 900–1000 °C	Total SOC	Gold standard; R² = 1.0 (by definition) for measured samples	[45,46]
Loss-on-Ignition (LOI)	Various mineral soils	Heating to 360–550 °C, mass loss	Volatile matter measurement	Crude SOM estimate	Unreliable for soils with hydrous minerals or carbonates	[47,48,49]
Spectroscopic methods (VNIR, MIR, FTIR)	Global soil libraries	Spectroscopic calibration vs. wet chemistry	Infrared absorption/reflectance bonds	SOM from spectra	Requires calibration; MIR shows strong correlation with measured OM	[50,51,52,53,54,55]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dolatabadian, A.; Kariman, K. Bridging Pedology and Data Science: Machine Learning Applications for Soil Organic Matter and Carbon Analysis. Appl. Sci. 2026, 16, 5412. https://doi.org/10.3390/app16115412

AMA Style

Dolatabadian A, Kariman K. Bridging Pedology and Data Science: Machine Learning Applications for Soil Organic Matter and Carbon Analysis. Applied Sciences. 2026; 16(11):5412. https://doi.org/10.3390/app16115412

Chicago/Turabian Style

Dolatabadian, Aria, and Khalil Kariman. 2026. "Bridging Pedology and Data Science: Machine Learning Applications for Soil Organic Matter and Carbon Analysis" Applied Sciences 16, no. 11: 5412. https://doi.org/10.3390/app16115412

APA Style

Dolatabadian, A., & Kariman, K. (2026). Bridging Pedology and Data Science: Machine Learning Applications for Soil Organic Matter and Carbon Analysis. Applied Sciences, 16(11), 5412. https://doi.org/10.3390/app16115412

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bridging Pedology and Data Science: Machine Learning Applications for Soil Organic Matter and Carbon Analysis

Abstract

1. Introduction

1.1. Background and Significance

1.2. Traditional Approaches and Their Evolution

1.3. The Machine Learning Revolution

1.4. Objectives

1.5. Novel Contributions and Review Scope

1.6. Review Type and Methodology

2. Soil Organic Matter

2.1. Definition and Composition

2.2. Forms and Stability of Soil Organic Matter: Implications for Predictive Modelling

2.3. Factors Affecting Soil Organic Matter Content

2.4. Challenges in Organic Matter Assessment of Soil

3. Classical Approaches to Soil Organic Matter and Carbon Analysis

3.1. Laboratory Analytical Methods

3.1.1. Wet Chemical Oxidation

3.1.2. Dry Combustion (Elemental Analysis)

3.1.3. Loss-on-Ignition

3.1.4. Spectroscopic Methods

3.2. Field Sampling and Geostatistical Analysis

3.2.1. Sampling Strategies

3.2.2. Geostatistical Methods

3.3. Remote Sensing in Classical Frameworks

3.4. Strengths and Limitations of Classical Approaches

4. Machine Learning Approaches to Soil Organic Matter Analysis

4.1. Overview of Machine Learning Techniques

4.2. Regression Methods

4.2.1. Linear and Polynomial Regression

4.2.2. Support Vector Regression

4.2.3. Random Forests

4.2.4. Gradient Boosting Machines

4.3. Neural Networks and Deep Learning

4.3.1. Artificial Neural Networks

4.3.2. Convolutional Neural Networks

4.3.3. Recurrent Neural Networks

5. Challenges and Limitations of ML Approaches

5.1. Data Distribution Challenges Specific to SOC

5.2. Data Scarcity

5.3. Interpretability and the Black Box Problem

5.4. Causality Versus Correlation

5.5. Spatial Autocorrelation and Data Leakage

5.6. Non-Stationarity

5.7. Reproducibility and Model Transparency

6. Comparative Analysis: Classical Versus Machine Learning Approaches

6.1. Prediction Accuracy

6.2. Cost and Efficiency

6.3. Data Requirements and Availability

6.4. Interpretability and Scientific Understanding

6.5. Uncertainty Quantification

6.6. Transferability, Domain Shift, and Generalisation

6.7. Temporal Dynamics and Change Detection

6.8. Hybrid Frameworks: Architecture and Implementation

6.8.1. Fusion Architecture Descriptions

6.8.2. Structured Implementation Workflow

6.8.3. Synthesised Case Studies

7. Challenges and Opportunities

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI