1. Introduction
Soil organic carbon is fundamental component of soils, playing a pivotal role in nutrient cycling, soil structure, water retention, and as a significant reservoir in the global carbon cycle. The accurate monitoring and modelling of SOC are essential for sustainable land management, climate change mitigation, and enhancing agricultural productivity. Traditional methods of SOC assessment, primarily based on direct soil sampling and chemical analysis, are time and labor consuming and cannot cover large areas. The advent of ML techniques, combined with soil data and remotely sensed information, offers a promising avenue for efficient and comprehensive SOC estimation.
SOC constitutes a major portion of the Earth’s terrestrial carbon pool, with estimates suggesting that soils store approximately 1550 petagrams (Pg) of carbon globally, surpassing the carbon content of the atmosphere and vegetation combined [
1]. This substantial carbon reservoir underscores the critical role of SOC in regulating atmospheric CO
2 levels and, consequently, global climate patterns. Moreover, SOC enhances soil health by increasing soil–water–microbial dynamics, all of which are vital for plant growth and agricultural productivity [
2].
Conventional SOC assessment methods involve extensive field sampling and laboratory analyses to determine carbon content. As already mentioned, these methods provide accurate point-based measurements, but they are resource-intensive and often impractical for large-scale monitoring due to logistical constraints and high costs. Additionally, the spatial heterogeneity of soils necessitates a dense sampling network to capture variability accurately, further escalating the effort and expense involved [
3].
Remote sensing technologies have revolutionized soil monitoring by providing extensive spatial and temporal data coverage. In the context of SOC estimation, remote sensing offers indirect measurements through spectral data that can be correlated with soil properties. Multispectral and hyperspectral sensors capture reflectance data across various wavelengths, which, when analyzed, can indicate organic matter content and other soil characteristics [
4,
5]. For instance, the Normalized Difference Vegetation Index (NDVI) has been utilized to infer SOC levels based on vegetation cover and health, which is often correlated with soil fertility [
6].
Recent studies have demonstrated the efficacy of integrating remote sensing data with machine learning models to predict SOC. For example, the authors of [
7] utilized hyperspectral remote sensing images with machine learning algorithms to develop models predicting surface soil organic matter content. Similarly, the authors of [
8] used remote sensing data including radar with ML models to calculate SOC, highlighting the potential of multi-sensor data fusion in enhancing prediction accuracy.
Machine learning is important in SOC modelling because it can deal with complex, non-linear relationships between predictors and soil properties [
9,
10]. Methods such as RF, SVM, and NN have been employed to predict SOC by leveraging large datasets comprising soil characteristics, topographic variables, climate data, and remotely sensed imagery [
11,
12,
13]. For instance, the authors of [
14] applied various ML models, including RF and Extreme Gradient Boosting, to predict SOC in Northern Iran, demonstrating the superior performance of these models over traditional statistical methods.
The integration of ML with remote sensing data has further enhanced SOC prediction capabilities. The authors of [
15] explored the combination of Sentinel-2 time-series data with laboratory spectral measurements using machine learning algorithms, finding that models incorporating both data sources achieved higher prediction accuracy compared to those using single data sources. This approach underscores the value of multi-source data integration in capturing the spatial and temporal dynamics of SOC.
Despite the advancements, several challenges persist in SOC monitoring and modelling using ML and remote sensing data. One significant challenge is the variability in soil properties across different regions, affecting the generalizability of ML models. Developing models that can adapt to diverse soil types and environmental conditions remains an area of active research.
Another challenge is the impact of factors, such as soil moisture, surface roughness, and vegetation cover, on remote sensing signals, which can introduce noise and affect the accuracy of SOC predictions. Advanced preprocessing techniques and the development of robust models that can account for these factors are essential to improve prediction reliability [
8].
Moreover, the integration of emerging technologies, such as unmanned aerial vehicles (UAVs), for SOC estimation is challenging. Additionally, the application of deep learning techniques, which can model complex patterns in large datasets, holds promise for further enhancing SOC prediction accuracy. Collaborative efforts that combine expertise in soil science, remote sensing, and machine learning are crucial to develop comprehensive frameworks for SOC monitoring and modelling.
The integration of machine learning methods with soil and remotely sensed data represents a promising approach for the efficient and accurate monitoring and modelling of soil organic carbon. This synergy enables the handling of complex datasets and captures the spatial and temporal variability of SOC across agricultural landscapes. Continued advancements in this field are essential for informing sustainable land management practices, enhancing agricultural productivity, and contributing to global climate change mitigation efforts.
Given the pivotal role of SOC in sustaining agricultural productivity, regulating the global carbon cycle, and enhancing ecosystem resilience, its accurate monitoring and prediction are not only of local or regional concern but of global importance. SOC dynamics are directly linked to pressing challenges such as land degradation, food security, and climate change mitigation. In this context, improved SOC assessment using advanced machine learning techniques supports international efforts like the United Nations Sustainable Development Goals (SDGs)—particularly SDG 13 (Climate Action) and SDG 15 (Life on Land)—as well as global initiatives such as the “4 per 1000” movement, which advocates for increasing SOC stocks to offset greenhouse gas emissions. This study contributes to these global objectives by developing and evaluating robust predictive models that integrate field data with remote sensing, providing scalable tools for sustainable soil management.
The main objective of this research is to apply ML models for the accurate monitoring and prediction of SOC by integrating soil data with remotely sensed information. Specifically, this study aims to (i) identify key soil properties, spectral indices, and environmental variables that influence SOC dynamics; (ii) assess the ability of various ML algorithms in calculating SOC across different landscapes; (iii) enhance model interpretability through feature selection and explainability techniques; and (iv) establish a scalable and cost-effective framework for SOC assessment to assist farmers in applying sustainable soil management practices.
  2. Materials and Methods
  2.1. Study Area
The study area comprises the largest part of the lowland sections of the farmlands of the communities of Mavrothalassa and Tragilos. It is located at the southeastern edge of the Serres plain and borders the Strymon River to the northeast (
Figure 1). The elevation ranges from 5 to 16 m, with slopes around 0.5%, making the area generally flat.
The soils of the study area comprised 85.4% from river deposits (alluvium), primarily due to the floodplain activity of the Ezovitis River and, to a lesser extent, the Strymon River. The remaining 14.6% of the soils resulted from the transport of material towards the slopes of the hills following the weathering of rocks (granites, gneisses, and marbles) from the hills surrounding the settlements of Tragilos and Mavrothalassa. The soils are predominantly deep, with a light to medium texture (moderately coarse-grained to moderately fine-grained). They are well-drained and do not present irrigation issues [
16]. The land use of the study area is agriculture.
  2.2. Data Acquisition and Preprocessing
The foundation of our research lies in the acquisition and meticulous preprocessing of diverse datasets, including soil, terrain, climate and remote sensing data. Soil data were obtained from a combination of existing soil surveys and targeted field sampling campaigns. These campaigns involved collecting soil samples, which were then analyzed in the chemical laboratory.
Terrain data, including elevation and slope, were produced using Digital Elevation Models (DEMs), derived from SRTM data, ensuring accurate representation of the topographic features of the study area. Remote sensing data, specifically Sentinel-2, were downloaded from the Copernicus Open Access Hub. Sentinel-2 provides multispectral imagery at a high spatial resolution, making it ideal for capturing vegetation indices and other land surface characteristics relevant to SOC estimation. Climate data, including precipitation and temperature, were obtained from local meteorological stations.
Preprocessing steps were applied by removing outliers and erroneous values, dealing with missing data through imputation techniques and integrating the various datasets into a unified database. Prior to model training, the dataset underwent a thorough cleaning process. Outliers were identified and removed based on a z-score threshold of ±3, which corresponds to values falling more than three standard deviations from the mean. This threshold was chosen to ensure that extreme values, which could disproportionately influence model performance, were excluded while retaining the majority of the data.
In terms of missing data, we observed incomplete values in a small subset of predictor variables, particularly among remotely sensed indices (e.g., NDVI) and topographic features (e.g., slope and aspect) due to occasional cloud cover or DEM anomalies. The amount of missing data was minimal, accounting for approximately 4.8% of the dataset. Missing values were handled using mean imputation, which replaces each missing entry with the mean value of the corresponding feature across the training data. This method was selected for its simplicity and to preserve data size, which was important given the limited number of soil samples (n = 36).
Soil data were spatially referenced using GPS coordinates, allowing for seamless integration with terrain and remote sensing data. All datasets were projected to a common coordinate system (Greek Grid) to ensure spatial consistency. This rigorous data acquisition and preprocessing workflow ensures the quality and reliability of the input data used for subsequent machine learning modelling.
  2.3. Field Sampling and Laboratory Analysis
A total of 36 soil samples were acquired using a random sampling technique, meaning that the samples were taken at irregular intervals across the study area during last week of November 2024. Soil sampling points were selected using a targeted purposive sampling strategy rather than a systematic grid or purely random design. The aim was to capture the full range of spatial heterogeneity in the study area, including variations in topography, land use, vegetation cover, and soil type. High-resolution satellite imagery and GIS layers (e.g., land cover maps, NDVI, slope, and elevation) were used to stratify the landscape into zones of differing environmental characteristics. Sampling points were then distributed across these strata to ensure that each distinct land unit was represented. Although this resulted in sampling at irregular intervals, it was intentionally designed to maximize representativeness and variability of soil organic carbon (SOC) conditions within the limited number of available samples (n = 36). This method also considered logistical accessibility and landowner permissions, which influenced the final positioning of some samples. While a regular grid-based sampling design could improve spatial interpolation, the chosen approach prioritized environmental diversity and efficient coverage of relevant site conditions for machine learning model development. Moreover, this method is particularly useful in cases where visible soil variations, such as differences in texture or color, are not immediately apparent. To accurately record the sampling locations, a portable Global Positioning System (GPS) device was employed.
All 36 soil samples were collected from agricultural fields practicing conventional arable farming, predominantly involving similar cropping systems (e.g., cereals, maize, and legumes), fertilization practices, and tillage operations. The selection of sampling locations was conducted to ensure relative uniformity in agricultural management, minimizing the impact of differing practices on SOC variability. This consistency supports the reliability of comparisons made across the sampled sites in the context of machine learning modeling.
Topsoil samples (0–30 cm) were received by a hand auger, ensuring consistency in sample collection. After retrieval, the soil samples underwent preliminary processing, which included air-drying and sieving using a 2 mm mesh to remove larger particles and ensure uniformity for laboratory analysis. The determination of SOC was conducted using the Walkley–Black wet oxidation method, a widely accepted procedure for assessing SOC content [
17]. Moreover, other soil properties, such as texture, pH, EC, CaCO
3, macro-nutrients and micro-nutrients, were also estimated under the ISO 17025/2017 standards.
  2.4. Feature Extraction
Feature extraction is important in extracting relevant predictors for using them in ML models. From Sentinel-2 imagery, we derived a suite of spectral indices, i.e., NDVI, SAVI (Soil-Adjusted Vegetation Index), NDWI (Normalized Difference Water Index), BI (Bare Soil Index), and CI (Chlorophyll Index). These indices describe vegetation cover, biomass, soil color and soil moisture, all of which are closely related to SOC content. We calculated these indices for Sentinel-2 scene acquired the same time period with sampling campaign to capture the real variations in vegetation dynamics.
While vegetation indices, such as NDVI, SAVI, NDWI, BI, and CI, are indeed influenced by seasonal and phenological changes, their inclusion in SOC prediction models can still offer valuable information. The specific timing of data acquisition—late November 2024—was selected to minimize interference from peak vegetation growth stages and to represent post-harvest conditions in the study area. This period typically reflects residual vegetation cover and soil surface conditions that can be correlated with long-term soil properties such as SOC. For instance, low NDVI and SAVI values at this stage may indicate reduced biomass input and potentially lower SOC accumulation over time. Conversely, sites with higher values could suggest areas with better soil fertility or ground cover history that influence SOC stocks. To mitigate the stochastic nature of these indices, we carefully selected indices that are sensitive not only to greenness (e.g., NDVI, SAVI) but also to soil brightness and moisture (e.g., BI, NDWI), which are more directly related to surface soil conditions and, thus, indirectly to SOC content.
Climate data were used to calculate various climatic variables, such as mean annual temperature and total annual precipitation. These variables reflect the influence of climate on SOC accumulation and decomposition processes. Topographic features, including elevation, slope, aspect, and topographic wetness index (TWI), were derived from the DEMs. These features capture the influence of topography on soil drainage, erosion, and redistribution of organic matter.
Moreover, interaction was created between different variables to capture non-linear relationships. For example, we created interaction terms between NDVI and precipitation to reflect the combined influence of vegetation and moisture on SOC. The feature extraction process resulted in a comprehensive set of predictors that capture the complex interplay of factors influencing SOC content. These parameters were considered as inputs to evaluate the ML models.
  2.5. Machine Learning Models
ML is a subfield of artificial intelligence making predictions through experience. This research study will delve into four key supervised learning algorithms: NN, RF, SVM, and DT.
  2.5.1. Neural Networks (NNs)
NNs are powerful non-linear models able to deal with large datasets and the complexity in relationships between predictors and SOC [
18,
19,
20]. They are computationally expensive to train and prone to overfitting. NNs are machine learning models that simulate the way the human brain processes information. They consist of multiple layers of neurons, where each neuron applies an activation function to transform input data.
A feedforward neural network follows this mathematical relationship:
          where:
- x = input vector (e.g., soil properties, spectral indices, environmental variables), 
- W = weight matrix connecting the neurons, 
- b = bias term, 
- f = activation function (e.g., ReLU, sigmoid, tanh), 
- y = predicted SOC value. 
The network is trained using backpropagation, which minimizes the Mean Squared Error (MSE). Moreover, the hyperparameters used are as follows:
- Number of hidden layers = 2: Controls model complexity by allowing multiple levels of feature transformation. 
- Neurons per layer = [32, 16]: The first hidden layer has 32 neurons, while the second has 16. More neurons improve learning capacity but increase computational cost. 
- Activation function = ReLU (Rectified Linear Unit): Ensures non-linearity in learning. 
- Optimizer = Adam (Adaptive Moment Estimation): A gradient-based optimizer that adjusts learning rates dynamically for faster convergence. 
Learning rate = 0.001: Controls how much weights update during training. A small value ensures stable learning, preventing large fluctuations.
Although Neural Networks are typically better suited to large datasets due to their high capacity and data-hungry nature, their inclusion in this study was justified as a comparative benchmark against other machine learning algorithms under small-sample conditions. The model architecture was deliberately kept shallow (i.e., few hidden layers and neurons), and regularization techniques were used to reduce the risk of overfitting. Given that the acquisition of georeferenced soil data is often constrained by field logistics and analytical costs, especially in the context of SOC modeling, this approach reflects real-world limitations faced in many environmental and agricultural studies. Additionally, cross-validation techniques were applied to ensure model stability and to obtain reliable performance metrics, even with a limited number of samples.
  2.5.2. Random Forests (RFs)
RFs, an ensemble learning method, can robustly model outliers and can address high-dimensional datasets [
18]. They are easy to use and provide a measure of feature importance. They are less prone to and deal with large datasets. RFs construct multiple decision trees and combine their outputs to improve accuracy and reduce overfitting. RF predicts SOC using the average output of multiple decision trees:
          where:
 is the prediction from the i-th tree, and N is the total number of trees.
The split at each node is determined by minimizing Mean Squared Error (MSE). The hyperparameters used are:
- Number of trees = 100: More trees improve stability but increase computation. 
- Maximum tree depth = 10: Controls overfitting by limiting tree complexity. 
- Minimum samples per leaf = 2: Ensures each leaf node has at least 2 samples, reducing model variance. 
  2.5.3. Support Vector Machines (SVMs)
SVMs can effectively model datasets with both linear and non-linear relationships aiming to adjust the optimal hyperplane that separates data points into different classes. More specifically, SVMs use kernel functions to map the input features into a higher-dimensional space, where optimal separation between classes or regression functions can be achieved [
21]. SVMs are powerful supervised learning models that map input data to a higher-dimensional space to find an optimal separation boundary. For regression (Support Vector Regression), the goal is to find a function f(x) that minimizes prediction errors within a margin ϵ:
          where:
The objective is to minimize: , subject to: .
For non-linear relationships, an RBF (Radial Basis Function) kernel is applied to transform data into a higher-dimensional space:
K() = exp (−). The hyperparameters used are:
- Kernel = RBF: Maps non-linear data into a higher-dimensional space to improve separation. 
- Regularization parameter (C) = 1.0: Controls trade-off between model complexity and error tolerance. Higher values reduce bias but may lead to overfitting. 
- Gamma (γ) = 0.1: Determines the influence of each training sample. Smaller values generalize better, while higher values make the model more sensitive to individual points. 
  2.5.4. Decision Trees (DTs)
DTs recursively partition the data based on feature values to create a classification tree. More specifically, DTs are simple, interpretable, and easy to visualize. They can handle both classification and regression tasks. However, they are prone to overfitting and may not perform as well as more complex algorithms on complex datasets [
22,
23]. At each node, the algorithm selects the feature that minimizes impurity, using Mean Squared Error (MSE) for regression.
A feature  is chosen if it results in the lowest MSE. The hyperparameters are as follows:
- Maximum tree depth = 5: Limits the depth to prevent overfitting. 
- Minimum samples per split = 4: Ensures that splits occur only if at least 4 samples are available, reducing unnecessary complexity. 
Each of the above models has its pros and cons, making them suitable for different types of datasets. The rationale for selecting these four models was based on their proven ability to address the complexity and non-linearity within datasets and provide robust estimations. Each model offers unique strengths, allowing us to explore different aspects of the data and obtain a robust estimation of SOC content. In general, NNs are suitable for complex tasks with large datasets, RFs are a good starting point for many problems, SVMs are optimal when the separation is evident, and DTs are useful for simple, interpretable models.
  2.6. Model Association with Data and Performance Metrics
Here, 36 soil samples were collected from various locations, covering diverse soil properties. Remote sensing data (e.g., NDVI, SAVI, NDWI, BI and CI) were extracted for each sample location. The final dataset contained SOC values as the dependent variable and spectral/terrain/soil attributes as independent variables.
To rigorously assess the accuracy and reliability of the SOC estimation models, we employed a comprehensive set of performance metrics. The primary metrics used were R-squared (coefficient of determination) and Root Mean Squared Error (RMSE). In addition to R-squared and RMSE, we also calculated other metrics, such as Mean Absolute Error (MAE), Mean Absolute Percent Error (MAPE), and Mean Squared Error (MSE).
The performance metrics were calculated for both the training (70% of the data) and validation (30% of the data) sets to assess model overfitting and generalization ability. A model with high accuracy on the training set but poor on the validation set is likely overfitting the data, while a model with consistent performance on both sets is considered more reliable.
The dataset, consisting of 36 soil samples, was randomly split into a training set (70%, n = 25) and a validation set (30%, n = 11). This 70/30 split ratio is a commonly adopted approach in machine learning, especially for small datasets, as it provides a balanced trade-off between model training and generalization testing. The split was performed using a stratified sampling strategy, ensuring that the distribution of SOC values remained representative in both subsets to avoid bias during model evaluation.
  2.7. Visual and Spectral Evaluation of Prediction Quality
To visually interpret model performance, Sentinel-2 RGB image clips (10 m resolution) were extracted for four representative sample locations: two with minimal prediction error and two with higher residuals. RGB composites were generated using bands 4 (red), 3 (green), and 2 (blue). Additionally, spectral reflectance values for Sentinel-2 bands 1–12 were plotted to create multispectral profiles of each location, allowing for the assessment of differences in surface properties related to prediction accuracy.
  3. Results
The remote sensing-derived indices used is this research study as well as their corresponding histograms are as follows (
Figure 2 and 
Figure 3):
- SAVI (Soil-Adjusted Vegetation Index), which is a modified version of NDVI that minimizes soil brightness effects, especially in areas with sparse vegetation. The formula is SAVI = (NIR-RED)/(NIR + RED + L) × (1 + L), where L is a soil brightness correction factor, typically set to 0.5 for moderate vegetation cover [ 24- ]; 
- NDWI (Normalized Difference Water Index), which is used to detect water bodies and assess vegetation water content. Its formula is: NDWI = (NIR-SWIR)/(NIR + SWRI). Higher values indicate more water presence, which can influence SOC through vegetation productivity and microbial activity [ 25- ]; 
- NDVI (Normalized Difference Vegetation Index) which measures vegetation health and biomass. Its formula is: NDVI = (NIR-RED)/(NIR + RED). The NDVI values range from −1 to +1, where higher values indicate healthy vegetation. NDVI is among the most frequently used indices in SOC-related remote sensing studies [ 26- ]; 
- BI (Bare Soil Index), which identifies bare soil areas by highlighting differences in visible and infrared bands. Its formula is: BI =  - . Higher values indicate barer soil. This index has been used in SOC modeling to infer surface exposure and erosion risk [ 27- ]; 
- CI (Chlorophyll Index), which estimates chlorophyll content in vegetation, useful for monitoring plant health. The formula is: CI = NIR/(Green) − 1). Higher values indicate greater chlorophyll concentration, which is indirectly linked to biomass productivity and organic carbon input to the soil [ 28- ]. 
  
    
  
  
    Figure 2.
      Remote sensing-derived indices.
  
 
   Figure 2.
      Remote sensing-derived indices.
  
 
  
    
  
  
    Figure 3.
      Histograms of remote sensing-derived indices.
  
 
   Figure 3.
      Histograms of remote sensing-derived indices.
  
 
The field sampling was conducted during [late November 2024], a period corresponding to a transitional phenological phase for most agricultural areas in the region. Specifically, some fields were in the late vegetative to early reproductive stage, while others had undergone recent harvesting or were under bare soil or early growth due to different planting schedules or crop types. This phenological variability is reflected in the distribution of vegetation indices (NDVI and SAVI), where two distinct peaks were observed in the histograms. The first peak corresponds to low vegetation cover (e.g., harvested fields or bare soil), while the second peak reflects areas with active, dense vegetation. This heterogeneity is typical of Mediterranean agricultural systems with mixed cropping practices and non-synchronous crop calendars.
The descriptive statistics of the previous remote sensing indices (BI, CI, NDVI, NDWI, SAVI) with no missing values in 36 observations are given in 
Table 1. Each metric provides insights into vegetation, soil, and moisture conditions. NDVI has a mean of 0.205, suggesting moderate vegetation cover. The CI mean (0.179) also shows relatively high values, indicating healthy chlorophyll levels. The SAVI mean (0.117) is lower than NDVI, as expected, since it adjusts for soil brightness. The BI mean (0.105) is low, confirming that the area is not dominated by bare soil. Moreover, the BI standard deviation is 0.016, which is low, indicating consistent base soil exposure. The NDWI mean (−0.100) is negative, indicating overall dry conditions or low water content. Moreover, NDWI (Std. Dev.: 0.076) has high variability, meaning moisture conditions differ significantly across the study area and the negative values indicate predominantly dry conditions.
Figure 4 shows the spectral signatures of soil samples categorized by SOC levels (low, medium, high) using Sentinel-2 data. Each line represents the average reflectance across Sentinel-2 bands for samples within an SOC range.
 Each line represents the mean reflectance across Sentinel-2 bands for each SOC category. The following observations can be made:
- Reflectance is generally lower in high-SOC soils, especially in the visible (B2–B4) and near-infrared (B8) bands. This aligns with known spectral behavior, as organic-rich soils tend to be darker and absorb more incoming radiation. 
- Medium-SOC soils show intermediate reflectance, while low-SOC soils exhibit the highest reflectance values, particularly in the red and NIR regions. 
- The Red Edge bands (B5–B7) and SWIR bands (B11, B12) show variation across SOC categories, indicating potential for SOC-sensitive spectral indicators. 
These patterns support the use of Sentinel-2 multispectral features in machine learning models for SOC prediction and help justify the selection of specific bands and vegetation indices (e.g., NDVI, BI, etc.) as input variables.
The dataset suggests a landscape with moderate vegetation cover, relatively dry conditions, and minimal bare soil exposure. Furthermore, the negative NDWI values might be worth further investigation, especially to determine if dry areas align with expected soil moisture conditions.
The results of the evaluation metrics on the ML models are presented in 
Table 2 and are described below:
  3.1. Decision Trees (Best Overall Performance)
This model gave the lowest MSE (0.428) and RMSE (0.654), indicating that Decision Trees have the smallest average squared and absolute errors among the models. Moreover, the lowest MAE (0.514) shows that its absolute prediction errors are the smallest. The R2 = 0.344, which is the highest among all models, indicates that Decision Trees explain about 34.4% of the variance in the target variable. MAPE = 139.46%, which is still relatively high but better than Random Forests.
  3.2. Random Forests (Moderate Performance)
The MSE (0.581) and RMSE (0.762) show higher errors than Decision Trees but still better than Neural Networks and SVM. The MAE (0.681) is slightly worse than Decision Trees and R2 = 0.063, which is very low, meaning the model explains only 6.3% of the variance in the data. MAPE = 339.92%, which is extremely high, suggesting poor generalization. Therefore, Random Forests do not generalize well in this case, possibly due to overfitting or data distribution issues.
  3.3. Neural Networks (Underperforming)
The values of MSE = 1.133 and RMSE = 1.064 show high error values, worse than Decision Trees and Random Forests. MAE = 0.713 is higher than Decision Trees and Random Forests, while R2 = 0.013, which is very low, meaning the model explains only 1.3% of the variance in the data. MAPE = 104.21% is the lowest among all models but still high. Thus, Neural Networks do not perform well on this dataset, likely due to insufficient training data, improper hyperparameter tuning, or unsuitable architecture.
  3.4. Support Vector Machines (SVMs) (Worst Performance)
The MSE = 1.403 and RMSE = 1.184 are the highest errors among all models. The MAE is 0.785, which is the highest absolute error. R2 = 0.034, meaning only 3.4% of the variance is explained and showing very poor performance. MAPE = 107.87%, which indicates poor predictive accuracy. SVMs perform the worst in this case, possibly due to poor kernel selection, lack of feature scaling, or sensitivity to data distribution.
The best model is Decision Trees, as they have the lowest error rates and the highest R2. Random Forests surprisingly perform worse than Decision Trees, likely due to overfitting or hyperparameter issues. Neural Networks and SVM perform poorly, suggesting they are not well suited for this dataset.
  3.5. Visual and Spectral Analysis of Model Residuals
The spectral signature patterns observed in remote sensing data are closely linked to land cover or soil surface conditions, which, in turn, have a significant impact on the accuracy of predictive models. For instance, Pin 1 and Pin 2 exhibit strong near-infrared (NIR) peaks, which suggest the presence of vegetation residues or organic-rich soil (
Figure 5). These conditions could introduce higher prediction errors in models that were not trained using data from vegetated pixels, as the model may struggle to account for the complexity of these surfaces. On the other hand, Pin 3 and Pin 4 display flatter spectral curves, characteristic of bare soil surfaces (
Figure 5). These conditions are generally easier to model, especially if the training set focused on clean soil signatures, as the model can more easily differentiate and predict based on the simpler reflectance characteristics of bare soil. Thus, the quality of model predictions can be significantly influenced by the composition of the training dataset and how well it represents the range of land cover types and soil conditions in the study area.
  3.6. Limitations and Future Research
While this research study provides a comprehensive methodology for enhancing SOC estimation, it is important to mention some limitations. The accuracy of the SOC predictions is affected by the spatial coverage of the data. In areas with with poor soil data availability, the prediction accuracy may be lower. The ML models used in this study are based on statistical relationships between predictors and SOC and may not fully capture the complex biogeochemical processes governing SOC dynamics [
29]. The models assume that the relationships learned from the training data hold true across the entire study area, which may not be the case in areas with significant spatial heterogeneity.
While the machine learning models applied in this study provided insights into the prediction of soil organic carbon (SOC), the reported errors—particularly in terms of RMSE and MAPE—highlight limitations in prediction reliability. These limitations may be attributed to the relatively small sample size (n = 36), the complexity of SOC spatial variability, and the non-linear interactions among soil, topographic, and remotely sensed variables.
To address these challenges, future research should consider the use of deep learning approaches, such as Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), which have shown great promise in capturing high-level feature representations and complex spatial–temporal relationships [
30]. Deep learning models, particularly when integrated with high-resolution remote sensing data and larger datasets, may significantly enhance the accuracy and robustness of SOC mapping.
Integrating land cover classification into the modelling framework could improve predictive performance by accounting for vegetation type, land use patterns, and surface characteristics that influence organic carbon dynamics [
31]. Land cover can act as an important covariate or stratification layer, helping to tailor models to distinct ecological and management contexts. These enhancements could pave the way for more reliable, scalable, and interpretable SOC estimation workflows in the future.
Moreover, further research is needed to validate the SOC predictions in different environmental settings and assess the transferability of the models to other regions. The development of user-friendly tools and interfaces would facilitate the application of this methodology for SOC monitoring and management. Addressing uncertainties and suggesting avenues for further improvement are crucial for advancing SOC estimation techniques and informing sustainable soil management practices.
  4. Discussion
The results of this study demonstrate notable differences in the predictive performance of the applied machine learning models for estimating SOC. Among the four models evaluated, the Decision Tree outperformed the others, with the lowest error metrics (MSE = 0.428, RMSE = 0.654, MAE = 0.514) and the highest coefficient of determination (R2 = 0.344). In contrast, Neural Networks and Support Vector Machines (SVMs) yielded significantly lower R2 values (0.013 and 0.034, respectively), indicating poor generalization under the current data constraints. The Random Forest model performed moderately well (R2 = 0.063), showing lower error metrics than Neural Networks and SVMs but not surpassing the Decision Tree.
These discrepancies can largely be attributed to the small dataset used in this study, which consisted of only 36 soil samples. Neural Networks and SVMs typically require larger datasets to effectively learn complex non-linear relationships. Their performance may have been compromised by overfitting, as they are more sensitive to noise and less robust to sparse data. Conversely, Decision Trees and Random Forests, particularly the former, are well suited to small datasets and can efficiently partition the feature space based on available information, even with a limited number of samples.
A detailed sample-level analysis based on residual plots (observed vs. predicted SOC values) revealed that larger prediction errors occurred in locations with high topographic variation and heterogeneous vegetation cover, as indicated by fluctuations in NDVI and slope. These patterns suggest that terrain-induced microclimatic variability and spectral mixing in satellite data may have contributed to SOC estimation errors. Additionally, inconsistencies in surface reflectance due to soil moisture or crop residue could have affected the quality of remotely sensed predictors.
An analysis of feature importance, derived from the Random Forest and Decision Tree models, revealed that NDVI, slope, and clay content were the most influential variables in SOC prediction. This highlights the importance of integrating remote sensing indicators and soil texture information to capture both vegetative productivity and edaphic control over carbon accumulation. Despite the relatively simple structure of Decision Trees, their performance illustrates that under constrained data conditions, model interpretability and adaptability may outweigh complexity.
This study is not without limitations. The relatively low sample size introduces a degree of uncertainty to model validation, especially regarding generalization to other spatial contexts. The lack of temporal resolution in satellite data may also have hindered the model’s ability to capture seasonal dynamics in vegetation that influence SOC formation. Furthermore, although the spatial co-registration between soil samples and satellite-derived indices was carefully implemented, residual spatial mismatches may have introduced errors.
For future research, it is recommended to increase the number of soil sampling points, particularly in areas with high landscape variability. The incorporation of time-series satellite data could also enhance model sensitivity to seasonal variations in SOC-related processes. Lastly, the exploration of hybrid or ensemble modeling frameworks, which combine physical process-based understanding with data-driven methods, may yield further improvements in SOC prediction accuracy.
  5. Conclusions
This study assessed the performance of various ML models—NN, RF, SVM and DT—in predicting SOC. The varying performance of the models highlights the importance of model selection and indicates that more complex algorithms like NN and SVM may not always provide a significant advantage over simpler models like DT and RF in certain contexts. More specifically, the findings highlight that simpler, interpretable models such as Decision Trees can outperform more complex approaches in SOC prediction, particularly when working with heterogeneous environmental datasets. DT emerged as the most reliable ML model, demonstrating the best performance overall. This suggests that, within the context of this dataset, the structure of DT, which allows for clear decision-making paths, is particularly suited for capturing the underlying patterns. While Random Forests also provided strong results, their performance, though good, was slightly less optimal compared to DT. In contrast, NN and SVM struggled to deliver the same level of accuracy, indicating that their complexity might not be well suited to the characteristics of this dataset.
Finally, there is increasing interest in the ethical and societal issues raised by applying machine learning. ML models are fair, transparent, and do not perpetuate existing biases. Research could focus on the development of techniques for detecting and mitigating bias in machine learning models, as well as on the development of ethical protocols for ML applications in various domains.