Next Article in Journal
A Systematic Review of the Practical Applications of Synthetic Aperture Radar (SAR) for Bridge Structural Monitoring
Previous Article in Journal
Educational Technology and E-Learning as Pillars for Sustainable Education
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Application of Vis–NIR Spectroscopy and Machine Learning for Assessing Soil Organic Carbon in the Sierra Nevada de Santa Marta, Colombia

by
Marlon Jose Yacomelo Hernández
1,*,
William Ipanaqué Alama
2,
Andrea C. Montenegro
1,*,
Oscar de Jesús Córdoba
3,
Darío Castañeda Sanchez
3,
Cesar Vargas García
1,
Elias Flórez Cordero
1,
Jim Castillo Quezada
2,
Carlos Pacherres Herrera
2,
Luis Fernando Prado-Castillo
4 and
Oscar Casas Leuro
4
1
AGROSAVIA (Corporación Colombiana de Investigación Agropecuaria), Vía Mosquera Km 14, Mosquera 250047, Colombia
2
Faculty of Engineering, University of Piura, Av. Ramón Mugica 131, Urb. San Eduardo, Piura 20009, Peru
3
Department of Agronomic Sciences, Faculty of Agricultural Sciences, Universidad Nacional de Colombia, Sede Medellín, Carrera 65 No. 59A-110, Medellín 050034, Colombia
4
ICPET (Instituto Colombiano del Petróleo y Energías de la Transición), Vía Piedecuesta Km 7, Piedecuesta 681018, Colombia
*
Authors to whom correspondence should be addressed.
Sustainability 2026, 18(1), 513; https://doi.org/10.3390/su18010513
Submission received: 16 October 2025 / Revised: 23 December 2025 / Accepted: 24 December 2025 / Published: 4 January 2026

Abstract

Soil organic carbon (SOC) is an essential indicator of soil fertility, health, and carbon sequestration capacity. Its proper management improves soil structure, productivity, and resilience to climate change, making rapid and reliable SOC assessment essential for sustainable agriculture. Visible and near-infrared (Vis–NIR) spectroscopy offers a non-destructive and cost-effective alternative to conventional laboratory analyses, allowing for the simultaneous estimation of multiple soil properties from a single spectrum. This study aimed to predict SOC content using machine learning techniques applied to Vis–NIR spectra of 860 soil samples collected in the Sierra Nevada de Santa Marta, Colombia. The spectra (400–2500 nm) were acquired using a NIR spectrophotometer, and the soil organic carbon (SOC) content was quantified using a wet oxidation method that employs dichromate in an acidic medium. A hybrid modeling framework combining Random Forest (RF) with support vector regression (SVR) and XGBoost was implemented. Spectral pretreatments (Savitzky–Golay first derivative, MSC, and SNV) were compared, and spectral bands were selected every 10 nm. The 30 most relevant wavelengths were identified using RF importance analysis. Data were divided into training (80%) and test (20%) subsets using stratified random sampling, and five-fold cross-validation was applied for parameter optimization and overfitting control. The RF–XGBoost (R2 = 0.86) and RF–SVR (R2 = 0.85) models outperformed the individual RF and SVR models (R2 < 0.7). The proposed hybrid approach, optimized through features, and advanced spectral preprocessing demonstrate a robust and scalable framework for rapid prediction of SOC and sustainable soil monitoring.

1. Introduction

Soil organic carbon (SOC) is crucial to soil fertility and the success of crop growth. It is the main component of organic matter (OM) and serves as a key indicator of soil health and productivity [1]. SOC plays a vital role in improving soil structure, enhancing its ability to retain water and nutrients, and promoting better aeration and drainage, which are essential for healthy root growth. It also supports microbial activity, enhancing nutrient cycling and making critical nutrients, such as nitrogen and phosphorus, available to plants [1]. Additionally, SOC contributes to soil aggregation, which reduces erosion and improves soil tilth, making it easier to work with during planting and cultivation [2]. Moreover, SOC contributes to soil pH stabilization and acts as a buffer against environmental stresses [3]. SOC and other fertilization parameters are analyzed in laboratories to provide accurate and reliable data on soil health and fertility [4].
Traditional soil property research methods involve collecting soil samples and performing laboratory analysis, which is expensive and time-consuming [5,6,7,8]. For this reason, different alternative techniques that have proven efficient in predicting different soil indicators are currently being explored. These alternatives include spectroscopic methods such as visible and near-infrared (Vis–NIR) spectroscopy, laser-induced breakdown spectroscopy (LIBS), near-infrared (NIR), and mid-infrared (MIR) spectroscopy; the latter encompasses Fourier transform infrared spectroscopy (FTIR) and Raman spectroscopy [9].
The Vis–NIR technique has great potential for analytical applications due to its rapid, non-destructive, and cost-effective nature [9]. This technique is based on interactions between the studied object and optical waves of different wavelengths in the visible–near-infrared (Vis–NIR) spectrum (350–2500 nm), which can occur through vibrational reflectance, transmittance, or absorbance [10]. Different soils, minerals, and plant species have been shown to reflect, absorb, and emit electromagnetic waves from various spectral regions [11]. Consequently, each soil has its unique “fingerprint” depending on its composition and state [11]. Compared to conventional chemical analysis, Vis–NIR spectroscopy enables rapid, nondestructive, and reproducible prediction of soil properties [12]. It can infer multiple components from a single spectrum [13].
The application of machine learning (ML) in soil science has increased over the past decade [14], which has also influenced the utilization of infrared spectral data to infer soil properties [15]. One of the most widely used multivariate models for estimating soil chemical properties from Vis–NIR is partial least squares regression (PLSR) [16,17]. However, it hides the nonlinear relationships between the spectrum and soil chemical properties [18]. Other alternatives include stepwise multiple regression (SMLR) [19,20], principal component regression (PCR) [21,22], backpropagation neural network (BPNN) [23,24], support vector machine regression (SVMR) [18,25,26], and random forest regression (RFR) [27].
Recent studies utilized 140 [28], 261 [29], and 523 [30] soil samples, adopting ML approaches, including support vector machines, neural networks, random forest, and Cubist, to estimate, among other indicators, SOC and OM.
Previous studies have focused on using regression models to estimate soil properties from spectral data [30]. However, the potential of classification approaches in this context is underexplored. In Colombia, Delgadillo–Durán et al. [13] categorized soil properties into three classes (low, medium, or high) based on agronomic criteria and standards. However, natural soil variability leads to unbalanced datasets that complicate modeling. To address this, merging similar classes (e.g., low and medium OM or Ca) improves prediction accuracy and supports more practical soil and crop management decisions.
In this work, we investigate the use of machine learning techniques to predict SOC using Vis–NIR regions. We evaluated the performance of various regression models, including SVR, RF, and RF–XGBoost. Additionally, we incorporated the first derivative of the Savitsky–Golay as a preprocessing technique to improve the quality and representativeness of the absorbance data. Based on this approach, we hypothesize that machine learning models calibrated with Vis–NIR spectral information can accurately predict soil organic carbon despite the high spatial and edaphic variability of the Sierra Nevada de Santa Marta.

2. Materials and Methods

2.1. Data

2.1.1. Study Area and Sample Collection Procedure

A selective sampling was carried out on farms with agroforestry–cacao systems (SAFC) in the Sierra Nevada de Santa Marta, Colombia. A total of 860 soil samples were collected across four municipalities in the region: Santa Marta, Ciénaga, Fundación, and Aracataca. Samples were taken from elevations ranging between 10 m above sea level (at coordinates N 11.249237, W 73.778825) and 2000 m above sea level (at coordinates N 10.975503, W 74.059458), across diverse SAFC systems and soils derived from various parent materials, including igneous rocks (granodiorites, quartz diorites, quartz monzonites, and granite), metamorphic rocks (ignimbrites, gneiss, schists), and colluvial–alluvial deposits. The area features a variety of landforms (ridges and beams, morainic fields, hills, knolls, and small valleys). The terrain ranges from gently sloping to steeply rugged, with slopes varying from 3% to over 75%. The soils are well-drained, shallow, and limited by bedrock, which often outcrops at the surface. Soils exhibit pH values ranging from slightly acidic to strongly acidic, with low SOC content and low fertility [31].
The sampling stage was carried out between 2023 and 2024, using a 10 cm deep ring. This depth was selected because it corresponds to the topsoil horizon (A), where the highest proportion of OM and biological activity is concentrated. Several studies have reported that between 70% and 90% of the total SOC and most of the soil macrofauna biomass are found within the first 10 cm of the soil profile, due to the accumulation of litter, fine roots, and easily decomposable organic residues [32,33,34].
However, the number of samples per municipality was not uniform, as it depended on the number and spatial extent of farms with active SAFC and on accessibility and environmental heterogeneity. The sampled farms corresponded to properties affiliated with four producer organizations in the region, and the sampling included 100% of the available farms with active agroforestry systems, meaning that the total sample represented the entire target population.
Although this resulted in unequal sampling among municipalities, the approach was methodologically justified to capture the true spatial and ecological variability of the agroforestry landscape across the Sierra Nevada. Furthermore, potential spatial bias was minimized by applying a stratified block sampling design, grouping sampling units according to slope and vegetation cover type, thus ensuring comparability among sites and maintaining the robustness of ecological and statistical analyses.

2.1.2. Chemical Measurements

The chemical method used to quantify SOC in the samples was that of Walkey and Black [35] with modifications. Between 20 and 100 mg of the sample is weighed, 2 mL of 1 N K2Cr2O7 is added, and the mixture is stirred for 5 s. Then, 1 mL of concentrated sulfuric acid (H2SO4) is added. 7 mL of MilliQ Water is added, the reaction is left overnight, and the following day, it is centrifuged and quantified by UV–Vis at 585 nm. The method involves the wet oxidation of the soil sample using potassium dichromate in an acidic medium. The heat released during the incorporation of sulfuric acid facilitates the partial oxidation of C. In this process, a reduction of the dichromate occurs, equivalent to the oxidized C content. This method only estimates Readily Oxidizable Carbon (ROC), so a correction factor ranging from 60% to 86%, depending on soil type and horizon, is applied to estimate Total Organic Carbon (TOC) [36,37]. Numerous studies have indicated that the method typically achieves an average recovery rate of between 76% and 77%, corresponding to the application of a correction factor of 1.3 [38,39,40]. Consequently, this study adopted this correction factor to estimate total organic carbon (TOC), thus ensuring the uniformity, comparability, and standardization of the results. It is important to note that this factor represents a standardized approximation used in soil organic carbon analysis; however, the actual recovery efficiency may vary depending on soil characteristics, environmental conditions, and the nature of the OM.

2.1.3. Spectroscopy Vis–NIR

The samples passed through a humidity homogenization process; samples were dried at 40 °C (for 48–96 h, depending on the soil type). Soil samples were dried according to protocol GA-R-104 [41] and ISO 11464:2006. Drying was carried out in a forced-air oven at 40 °C until a constant weight was reached (variation < 0.1%), resulting in a soil of uniform color and free of wet aggregates. Drying time ranged from 24 to 96 h, depending on the texture and OM content, while a temperature of 40 °C preserved the organic carbon and prevented thermal alteration. Samples were placed in a 50 mm diameter annular cup and scanned from 400 to 2500 nm using a NIR spectrophotometer (FOOS-DS2500) according to the methodology proposed by Delgadillo–Durán et al. [12].

2.2. Methods

In the context of predicting SOC, several advanced machine learning techniques were explored and evaluated. The study considered the following models: RF, SVR, RF–SVR and RF–XGBoost. These methods are essential to address regression and classification tasks where data complexity and collinearity between variables are critical. The data processed for this analysis were obtained using Vis–NIR spectra and compared with results from the traditional method, which was obtained in the laboratory using the Walkey and Black method [35] with modifications.
RF uses decision trees and is trained on a random subset of features (wavelengths) and a random set of observations (sample carbon level and its absorbance levels); the decision trees grow until they reach a predetermined number of nodes [42]. It is an ensemble machine learning approach that merges thousands of decision trees, each built by bootstrapping on calibration data [43]. Moreover, the random subspace method—where the number of features considered at each node split is determined by the parameter mtry—is applied at every split in the tree [43]. The final prediction is the average of the predicted values from all the trees. RF generally has a better generalization ability, which is used for regression and classification [43].
SVR is a supervised learning algorithm derived from Support Vector Machines (SVM), explicitly designed for regression tasks. SVR seeks to find a function that approximates the relationship between input variables and a continuous output, within a specified margin of tolerance. It constructs this function by identifying a subset of training data points, support vectors, that are most relevant to defining the position of the hyperplane in a high-dimensional feature space [44].
Performs comparably well when there is a coherent margin of dissociation between classes [28]. Additionally, SVR can model nonlinear relationships. Its robustness to overfitting and ability to handle high-dimensional data make it a valuable tool in various regression applications, including time series forecasting, financial modeling, and environmental data analysis.
Extreme Gradient Boosting (XGBoost) is a very popular ensemble learning technique for structured data, as it allows solving infinite prediction problems [45]. It iteratively builds an ensemble of trees, minimizing at each step the residuals of previous models, learning from the errors of previous trees. Among its variants, XGBoost stands out as one of the most powerful and popular algorithms in machine learning, offering superior performance, scalability, and regularization techniques to avoid overfitting [46]. To achieve this, XGBoost introduces an even more efficient implementation of the gradient boosting framework, which prioritizes scalability and parallelization. It also includes regularization terms that reduce the size of the trees, allowing for successful prevention of overfitting compared to a simple decision tree or even a non-optimized boosting model.

2.2.1. Preprocessing

The first step in spectral data processing involved data cleaning to reduce the noise.
Then, the Savitsky–Golay first derivative (SG1.der), a spectral preprocessing technique that removes additive baseline effects present in the raw spectral data and highlights the most informative features of the spectrum, was applied. This transformation, applied after smoothing the data using a polynomial filter (in this study, a second-order filter with a seven-point window for the RF–SVR models and an 11-point window for the RF, SVR and RF–XGBoost models), aims to improve data quality by reducing noise and amplifying changes in spectral curvature, facilitating the identification of absorption bands relevant to soil property prediction. The number of points in each window were determined experimentally after testing multiple configurations so that no more than one inflection is included in the absorbance data at any interval of the convolution, smoothing reduces noise, and the approximation polynomial (2nd order) captures the dynamics of the data [47].

2.2.2. Training Models

The Python (version 3.10) program was used for data analysis. To generate the marching learning models, the following libraries were used: numpy (mathematical calculations), tensorflow (for model generation), matplotlib pyplot (graphics), pandas (for reading Excel), sklearn model selection (for data division), Os (for reading folders), Seaborn (statistical graphics), Sklearn ensamble (for importing random forest regressor data), sklearn.svm (for importing Support Vector Regression model), sklearn.metrics (for importing metrics), scipy.signal (for importing Savitzky–Golay filter).
For RF, the analysis path followed was first to select the number of trees using a loop, which in this case tested a total of twenty models. The next step was to choose the optimal number of features (i.e., the number of wavelengths) and a suitable stopping criterion to determine when it is prudent to continue generating trees within the RF model. Next, the stopping criterion is defined: the maximum number of leaf nodes, or max leaf nodes. Here, the maximum number of leaf nodes (nodes where a prediction will be obtained) per decision tree is defined. Once the optimal parameters of the models were obtained, they were trained with the selected characteristics from which the critical and sensitive absorbance bands were determined to predict SOC.
SVR consists of three hyperparameters: Regularization (c) that controls the balance between low training error and good generalization ability, epsilon (ε) defines a tolerance range around the prediction function within which errors are not penalized, and kernel is the function that transforms the input data into a feature space where it is easier to perform regression. To find the best values for each hyperparameter, cross-validation was performed using the GridSearchCV function from the Scikit Learn library.
For the RF–XGBoost model, spectral bands were selected every 10 nm to reduce dimensionality and noise. A RF Regressor model was then run to calculate the importance of each spectral band, selecting the 30 most relevant bands from the analysis as input for the final model. Finally, the modeling and evaluation stages were executed. The data were divided into training (80%) and testing (20%).
For the RF + XGBoost model, spectral bands were selected every 10 nm to reduce dimensionality and noise. A Random Forest Regressor model was then run to calculate the importance of each spectral band, selecting the 30 most relevant bands from the analysis as input for the final model. Finally, the modeling and evaluation stages were executed. The data were divided into training (80%) and testing (20%). This 80/20 partition was chosen to ensure a sufficient amount of data for model training while maintaining an independent portion for unbiased evaluation of predictive performance.
This ratio is widely adopted in machine learning applications involving spectral data, as it provides an appropriate balance between model learning capacity and validation reliability [48].
Additionally, a five-fold cross-validation procedure was applied within the training set to optimize hyperparameters and minimize overfitting, ensuring the robustness and generalization of the final RF + XGBoost model.

3. Results and Discussion

3.1. Descriptive Statistics of Selected Soil Properties

Table 1 summarizes the descriptive statistics (mean, error, standard deviation, coefficient of variation, minimum, and maximum) of the 860 soil samples analyzed in the Sierra Nevada de Santa Marta. SOC exhibited high variability in the observed values, with a minimum mean of 0.11 and a maximum of 5.35 g 100 g−1. Notably, 57% of the samples fell within the range of 1 to 3 g 100 g−1, distributed across the four sampled municipalities (Figure 1). This indicates the high variability (CV: 41.61%) of the SOC associated with agroforestry systems featuring cacao in the Sierra Nevada de Santa Marta. This is probably due, firstly, to the edaphoclimatic conditions present in the different sampled areas, which are located from sea level to 2000 masl, to the differences between forest and fruit species that make up the agroforestry arrangement on the farms, and finally to the density or number of plants that make up the forest arrangement. High variability in soil organic carbon was also present within the farm, with the most significant variability occurring on the No Hay Como Tu farms (47.23%) and Los Acacios farms (46.43%). In comparison, the least variability was found on the Billa Brisa farms (3.62%) and Los Naranjos farms (5.35%).
The variability in SOC at the farm level is attributed to farm characteristics and agronomic management practices. In the former case, a higher soil organic carbon content is found towards the lower part of the mountain, resulting from particle carryover due to rainfall and management practices associated with uneven agroforestry arrangements. In most cases, the dispersion of forest trees on the farm causes more significant litter biomass accumulation in some areas within the farm and more significant soil macrofauna activity (Figure 1).

3.2. Data Processing

Figure 2 and Figure 3 show the absorbance data after implementing the first derivative pretreatment. This pretreatment technique eliminated additive or multiplicative effects between spectra [48], which generated a positive effect on the spectral characteristics compared to the spectra obtained in the absorbance mode [log (1/R)] and on the performance of the RF–STV and RF–XGBoost models for predicting SOC. This technique also offered the advantage of showing the variation in sample reflectance in relation to the variation in wavelength and exposing the noise that could be interpreted as a signal to show where there were more marked changes between close wavelengths [49]. This principle is established by the change in reflectance (ρλ) as a function of wavelength λ at a given pointi. The derivative is numerically approximated using symmetric or central differences, as expressed in the equation presented by Rudorff et al. [50].
Figure 4 illustrates that spectral absorbance varied significantly with SOC content. The difference in absorption becomes more pronounced at higher SOC contents in the Vis region compared to the NIR region [51]. Stenberg et al. [52] also observed a decrease in the absorption rates of organic soils in the NIR region. The first derivative of the absorption spectrum further revealed a strong correlation between SOC content and absorption in the Vis region, highlighting subtle spectral variations. The absorption peaks around the wavelengths of 600, 900, and 1900 nm (Window = 7, Order = 2) in the RF–SVR model and the wavelengths of 720, 736, and 1920 nm (Window = 11, Order = 2) in the RF–XGBoost appear to be associated with SOC content (Figure 2 and Figure 3). The 7-point window used in the RF–SVR model is more sensitive to rapid variations but can introduce more variability into the data, affecting the model’s accuracy, while the 11-point window is more suitable for the RF–XGBoost model because it smooths noise better without losing relevant signals.
In addition to the Savitsky–Golay first-derivative smoothing, other commonly used scatter-correction techniques, such as Multiplicative Scatter Correction (MSC) and Standard Normal Variate (SNV), were also tested for comparison.
Although MSC and SNV effectively reduce multiplicative and additive scattering effects among spectra, the Savitsky–Golay first derivative showed a greater ability to enhance subtle spectral variations and reveal absorption features associated with SOC.
Therefore, the first-derivative Savitsky–Golay preprocessing was selected as the main spectral pretreatment method, as it improved the resolution of narrow absorption peaks around 600, 900, and 1900 nm, which are strongly correlated with SOC content, resulting in superior predictive performance of the RF–SVR and RF–XGBoost models.
Moura-Bueno et al. [53] compared the predictive performance of six preprocessing techniques and four calibration models on the SOC content of 841 samples from subtropical soil. The best preprocessing technique was the first derivative SG smoothing, and the best calibration model was PLSR.
Other preprocessing techniques have also been efficient for predicting soil indicators, including moving averages, clustering, and smoothing. Such as Savitsky–Golay filtering, normalization, continuum removal, gap derivatives, [52,54,55], for example, Savitzky–Golay is a smoothing function that reduces noise by using a weighted sum of neighboring values [52,56] clustered spectral data using unsupervised learning (UL) techniques: principal component analysis (PCA) and Fuzzy C-means (FCM) method; however, FCM clustering improved the accuracy of the PLSR model, as measured by the prediction interval (PI). However, this had little impact on the accuracy of SOC predictions, as assessed using the residual prediction deviation (RPD).
The transformed Vis–NIR spectra of the soil samples, recorded in absorbance mode as [log (1/R)], have a typical shape similar to that presented in other soils. SOC and moisture in the sample are the primary constituents that determine the shape of the spectra. In the visible region of the spectrum (400 to 780 nm), absorption is attributed to soil chromophores such as iron oxides and the dark color caused by OM. In the near-infrared region (780 to 2500 nm), absorption is attributed to water molecules, hydroxyl groups, and clay minerals [57].

3.3. Selecting the Number of Trees and Characteristics per Node in a Random Forest Model

As a result of the loop application, 20 models were tested, analyzing 20 at a time, up to a total of 400 trees. This allowed the performance of the models to be evaluated on a variety of trees, selecting the model based on the lowest “out of bag” error, which in this case was 220 trees. The “out of bag” error is a training error, which specifies that the lower the “out of bag” error value, the more stable the corresponding random forest model will be. It also shows that RF is robust to overfitting [58]. Different combinations of the number of regression trees and the number of split nodes were chosen. In this step, the number of features per node was selected based on the “out of bag” error, which for this test was 12 features and 25, the maximum number of leaves per node (Figure 5).
The RF algorithm enables predictions from multiple decision trees to be compiled into a final forecast, demonstrating an improved ability to navigate nonlinear interactions between variables and mitigate overfitting. Consequently, it has been widely used in hyperspectral soil research [59].
The above has been widely documented; for example, Meng et al. [60] employed a stratification strategy and the RF algorithm to enhance the predictive accuracy of SOC content, reducing RMSE by 4.88 g kg−1 and increasing R2 by 0.32. However, the RF algorithm exhibits a “step-like” behavior in predicting continuous variables, which can reduce its effectiveness in analyzing hyperspectral data. For the presented case, the RF implementation improved the predictive accuracy compared to other evaluated alternatives, such as the LSTM-trained model, which achieved an R2 of 0.52, and the CNN-trained model, with an R2 of 0.72. This resulted in an increased R2 of 0.33 and 0.13, respectively.

3.4. Model Selection

Once the optimal parameters of the model were obtained, the model was trained with the aforementioned characteristics. The wavelengths are related to the x-axis, and the importance is on the y-axis. For the Random State 42 distribution, it can be seen that there is a more significant influence of the absorbance on the SOC around 600 nm, 900 nm and 1900 nm in the RF–SVR model and the wavelengths of 720, 736, and 1920 nm in the RF–XGBoost (Figure 4 and Figure 6). Bands with greater significance are given greater weight in the decisions of the model trees. This enables the identification of key spectral regions associated with carbon content, which can be utilized to design more efficient sensors, conduct more focused spectral analyses, or interpret chemical or biological processes related to those wavelengths.
It is worth noting that the most important bands in the RF algorithm are identified based on the Mean Decrease in Impurity (MDI), also known as Feature Importance. This criterion makes it possible to determine which spectral bands contribute most significantly to improving prediction accuracy by being used as splitting points in the decision trees. Specifically, the cumulative reduction in variance was calculated each time a band was used to split a node (in a regression context). Bands with the highest cumulative reduction were considered the most relevant.
Once these key bands were identified, the SVR and XGBoost models were trained using only the selected bands. This approach enhanced computational efficiency and helped to reduce overfitting, while maintaining high predictive performance.
This can be observed by extracting a portion of the spectrum that covers useful information from the complete spectrum data, eliminating noise, enhancing computational efficiency, and developing a concise and stable prediction model. This has been documented during the last decade in other research; for example, Stenberg [61]. published that SOC can affect spectral data ranging from 700 nm to 2450 nm. Tahmasbian et al. [62] also reported that the wavelengths important for CT prediction were primarily observed in the 740–800 and 900–1000 nm spectral regions, respectively. Also, Delgadillo–Durán et al. [13] found that the key components in the organic matter were around 400 nm and 1930 nm. In addition, there are several specific absorption features around the 950 nm wavelength and between the 2300–2400 nm wavelengths. According to Milos and Bensa [63], the most important wavelengths for clustering were in the following ranges: 1980–2025 nm, 1900, 735–775, 490–530, 1200, 2100, 2200, 2130, 2400, 485, and 440 nm.
A data cleaning process was conducted to eliminate outliers, aiming to enhance the correlation between the characteristics and the predicted variable. To achieve this, the box-and-whisker (boxplot) technique was employed, which enabled the dataset to be reduced from 860 to 842 samples, retaining only those observations that represented the overall behavior. The proposed RF model consisted of 220 trees, 12 characteristics, 25 leaves per node, and a Random state of 42, which was assigned to ensure the reproducibility of the results and to determine the data distribution during model training. Subsequently, through cross-validation using GridSearchCV, the optimal combination of hyperparameters for the SVR model was identified. The best performance was obtained using a linear kernel, with a C value of 1000 and an epsilon value of 0.151, configurations that allowed for improved model generalization capabilities. The best performance of the XGBoost model for predicting SOC was achieved using a specific combination of hyperparameters that effectively captured the nonlinear relationships between spectral bands and the target variable. A total of 200 decision trees (n_estimators = 200) were used, providing a balance between model complexity and stability, while a maximum tree depth of 5 (max_depth = 5) helped prevent overfitting and ensured good generalization. The learning rate (learning_rate = 0.1) allowed the model to make moderate optimization steps, avoiding oscillations or premature convergence. Additionally, a random seed (random_state = 42) was set to ensure result reproducibility. As a result, an RF–SVR model was generated, which, based on the sample absorbance information, presents a test prediction of R2 = 0.85, while the RF–XGBoost model achieved an R2 = 0.866 and an RMSE = 0.090 (Figure 7 and Figure 8). The figures illustrate the relationship between the predicted and actual values of SOC content obtained using the hybrid model (RF–SVR) or (RF–XGBoost). Each point represents a sample, and the proximity of the points to the 1:1 reference line (dashed red) indicates the accuracy of the model’s predictions. The model demonstrates a strong predictive ability, suggesting a high level of agreement between observed and estimated values. This result confirms the model’s reliability for estimating SOC based on selected spectral features, preprocessed using first derivative transformations and optimized through feature selection techniques. In contrast, the standalone models, RF and SVR, individually showed lower predictive performance, with R2 values of 0.69 and 0.66, respectively (Table 2). This highlights the importance of the combined modeling approach and the selection of relevant spectral bands to improve prediction accuracy for estimating SOC.
The RF–SVR and RF–XGBoost model offers an alternative to SOC estimation, with greater uncertainty than traditional laboratory analyses; however, it has certain advantages (cost, time, waste) [24]. The results confirm the predictive potential of Vis–NIR spectra for SOC [64], as it has a direct spectral response related to the good indices obtained for Vis–NIR estimation [65].
This study compared and analyzed the performance of the RF–SVR model in predicting SOC. The results indicate that the RF–SVR and RF–XGBoost models outperform other models with extensive soil spectral databases in estimating SOC. The RF–SVR and RF–XGBoost models achieved the best accuracy in SOC estimation, with an R2 of 0.85, an RMSE of 0.30 g/kg, and an R2 of 0.86 and an RMSE of 0.09. Compared with other regression models, RF–SVR and RF–XGBoost offer wider applicability and superior performance, enabling SOC estimation across important soil spectral libraries [66]. This improvement suggests that gradient-boosting techniques benefit substantially from the reduction in input dimensionality and the elimination of noisy or redundant variables. Recent studies have shown that applying feature-selection algorithms prior to model training enhances predictive performance by reducing multicollinearity and retaining only the most informative spectral features [67,68]. Similarly, the superior performance of gradient-boosting models such as XGBoost in SOC estimation has been consistently reported, demonstrating higher accuracy and robustness compared with traditional machine learning approaches [69]. The increase in RDP values for the hybrid models also indicates greater robustness and practical applicability for soil SOC assessment. Overall, these findings demonstrate that model stacking and feature optimization not only outperform traditional single-model approaches but also provide a more reliable framework for handling complex spectral datasets in SOC estimation.
The application of the first derivative Savitsky–Golay preprocessing significantly boosted the performance metrics in all the evaluated models. Thus, the RF–SVR and RF–XGBoost models are better at predicting carbon levels from absorbance data compared to the model proposed in “Estimation of SOC in arid agricultural fields based on hyperspectral satellite images” [70] whose Ridge Regression model based on hyperspectral satellite images for carbon level estimation obtained an R2 of 0.6658 and an RMSE of 0.1206, as well as the RF model of “Enhancing SOC estimation accuracy: Integrating spatial vegetation dynamics and temporal analysis with Sentinel 2 imagery” where an R2 of 0.8 and an RMSE of 0.33 were obtained.

4. Conclusions

The integration of spectral preprocessing, variable selection, and hybrid machine learning models (RF–SVR and RF–XGBoost) enabled highly accurate predictions of soil organic carbon (SOC). In particular, the application of the Savitsky–Golay first derivative proved critical for enhancing spectral features and improving model performance, highlighting the importance of appropriate preprocessing in Vis–NIR-based SOC estimation.
The sequential RF–SVR and RF–XGBoost models outperformed individual models, indicating that carbon prediction based on absorbance benefits from tools capable of handling nonlinear relationships and reducing noise. Spectral band selection with RF increased efficiency by reducing dimensionality and retaining only the most relevant variables.
The results confirm that Vis–NIR spectral information is a reliable predictor of SOC and provides a fast, economical, and non-destructive alternative to traditional laboratory analyses, although it still does not match their precision. Therefore, spectroscopy should be considered a complementary technique to support soil assessment.
Field limitations include extreme climatic variability and fertilizer use, which may alter absorbance. We recommend exploring additional deep learning models, new variable combinations, and more complex architectures. The selected spectral bands may be used to design more efficient sensors and to enhance soil monitoring programs and carbon sequestration studies.

Author Contributions

Conceptualization, methodology, formal analysis, writing/original draft preparation, writing/reviewing and final editing, visualization, M.J.Y.H. and A.C.M.; investigation, review and editing E.F.C.; resources and supervision L.F.P.-C. and O.C.L.; methodology, writing—review and editing, W.I.A., O.d.J.C., D.C.S., C.V.G., J.C.Q. and C.P.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Colombian Agricultural Research Corporation (AGROSAVIA), the Ministry of Sciences, Ecopetrol, and the Colombian Ministry of Mines and Environment, under Agreement No. 2147 (Project: Sustainable Landscape Approach to the Production of Premium “Sierra Nevada” Cocoa in PDET Municipalities in the Departments of Magdalena and La Guajira, ID 1002115).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is available by contacting the first and the corresponding authors.

Acknowledgments

We would like to thank the producers of the Sierra Nevada de Santa Marta, the Colombian Agricultural Research Corporation (AGROSAVIA), the Ministry of Sciences, Ecopetrol, and the Colombian Ministry of Mines and Environment for their support in developing this research.

Conflicts of Interest

Authors Marlon Jose Yacomelo Hernández, Andrea Constanza Montenegro, Cesar Vargas García and Elias David Flórez Cordero were employed by the company Colombian Corporation for Agricultural Research—AGROSAVIA. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
FCMFuzzy C-means
FTIRFourier Transform Infrared Spectroscopy
LIBSLaser-Induced Breakdown Spectroscopy
MIRMid-Infrared
MLMachine Learning
MSCMultiplicative Scatter Correction
NIRNear-Infrared
OMOrganic Matter
PCAPrincipal Component Analysis
PCRPrincipal Component Regression
PIPrediction Interval
PLSRPartial Least Squares Regression
RFRandom Forest
RF–SVRRandom Forest–Support Vector Regression
RF–XGBoostRandom Forest–Extreme Gradient Boosting
RMSERoot Mean Square Error
ROCReadily Oxidizable Carbon
RPDResidual Prediction Deviation
R2Coefficient of Determination
SAFCAgroforestry–Cacao Systems
SGSavitzky–Golay
SOCSoil Organic Carbon
SVMSupport Vector Machine
SVRSupport Vector Regression
TOCTotal Organic Carbon
ULUnsupervised Learning
Vis–NIRVisible–Near-Infrared Spectroscopy
XGBoostExtreme Gradient Boosting

References

  1. Gerke, J. The central role of soil organic matter in soil fertility and carbon storage. Soil Syst. 2022, 6, 33. [Google Scholar] [CrossRef]
  2. Zhao, J.; He, Y.; Cheng, H.; Liang, Z. Organic carbon accumulation and aggregate formation in intensively managed soils. Sci. Rep. 2023, 13, 4175. [Google Scholar] [CrossRef]
  3. Xiao, H.; Li, Z.; Chang, X.; Huang, B.; Nie, X.; Liu, C.; Liu, L.; Wang, D.; Jiang, J. The mineralization and sequestration of organic carbon in relation to agricultural soil erosion. Geoderma 2018, 329, 73–81. [Google Scholar] [CrossRef]
  4. Houba, V.J.G.; Novozamsky, I.; van der Lee, J.J. Quality aspects in laboratories for soil and plant analysis. Commun. Soil Sci. Plant Anal. 1996, 27, 327–348. [Google Scholar] [CrossRef]
  5. Godwin, R.J.; Miller, P.C.H. A review of the technologies for mapping within-field variability. Biosyst. Eng. 2003, 84, 393–407. [Google Scholar] [CrossRef]
  6. Doetterl, S.; Six, J.; Van Wesemael, B.; Van Oost, K. Carbon cycling in eroding landscapes: Geomorphic controls on soil organic C pool composition and stability. Glob. Change Biol. 2013, 18, 2218–2232. [Google Scholar] [CrossRef]
  7. Eslamifar, M.; Tavakoli, H.; Thiessen, E.; Kock, R.; Correa, J.; Hartung, E. Effective spectral pre-processing methods enhance accuracy of soil property prediction by NIR spectroscopy. Discov. Appl. Sci. 2025, 7, 896. [Google Scholar] [CrossRef]
  8. Ziyi, K.E.; Shilin, R.E.; Liang, Y.I. Advancing soil property prediction with encoder-decoder structures integrating traditional deep learning methods in Vis-NIR spectroscopy. Geoderma 2024, 449, 117006. [Google Scholar] [CrossRef]
  9. Wangeci, A.; Adén, D.; Nikolajsen, T.; Greve, M.H.; Knadel, M. Combining laser-induced breakdown spectroscopy and visible near-infrared spectroscopy for predicting soil organic carbon and texture: A Danish national-scale study. Sensors 2024, 24, 4464. [Google Scholar] [CrossRef]
  10. Oliveira, R.; Pereira, S.; Araújo de França, C.; dos Santos, D.; Feliciano, R.; Menezes, R.; Simas, E.; Pereira, R. On the feasibility of Vis–NIR spectroscopy and machine learning for real-time SARS-CoV-2 detection. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2024, 308, 123735. [Google Scholar] [CrossRef]
  11. Chinilin, A.V.; Vindeker, G.V.; Savin, I.Y. Vis-NIR spectroscopy for soil organic carbon assessment: A meta-analysis. Eurasian Soil Sci. 2023, 56, 1605–1617. [Google Scholar] [CrossRef]
  12. Silvero, N.E.; Demattê, J.A.; Minasny, B.; Rosin, N.A.; Nascimento, J.G.; Albarracín, H.S.R.; Gómez, A.M. Sensing technologies for characterizing and monitoring soil functions: A review. Adv. Agron. 2023, 177, 125–168. [Google Scholar] [CrossRef]
  13. Delgadillo-Duran, D.; Vargas-García, C.; Varón-Ramírez, V.; Calderón, F.; Montenegro, A.; Reyes-Herrera, P. Vis-NIR spectroscopy and machine learning methods to diagnose chemical properties in Colombian sugarcane soils. Geoderma Reg. 2022, 31, e00588. [Google Scholar] [CrossRef]
  14. Padarian, J.; Minasny, B.; McBratney, A.B. Using deep learning to predict soil properties from regional spectral data. Geoderma Reg. 2019, 16, e00198. [Google Scholar] [CrossRef]
  15. Adab, H.; Morbidelli, R.; Saltalippi, C.; Moradian, M.; Ghalhari, G.A.F. Machine Learning to Estimate Surface Soil Moisture from Remote Sensing Data. Water 2020, 12, 3223. [Google Scholar] [CrossRef]
  16. Shin, S.K.; Lee, S.J.; Park, J.H. Prediction of soil properties using vis-nir spectroscopy combined with machine learning: A review. Sensors 2025, 25, 5045. [Google Scholar] [CrossRef]
  17. Vohland, M.; Ludwig, B. Using variable selection and wavelets to exploit the full potential of visible–near infrared spectra for predicting soil properties. J. Near Infrared Spectrosc. 2016, 24, 255–269. [Google Scholar] [CrossRef]
  18. Yang, M.; Chen, S.; Li, H.; Zhao, X.; Shi, Z. Effectiveness of different approaches for in situ measurements of organic carbon using visible and near-infrared spectrometry in the Poyang Lake Basin area. Land Degrad. Dev. 2021, 32, 1301–1311. [Google Scholar] [CrossRef]
  19. Dhawale, A.K.; Wolff, S.B.E.; Ko, R.; Ölveczky, B.P. The basal ganglia control the detailed kinematics of learned motor skills. Nat. Neurosci. 2021, 24, 1256–1269. [Google Scholar] [CrossRef]
  20. Li, S.; Shen, X.; Shen, X.; Cheng, J.; Xu, D.; Makar, R.S.; Guo, Y.; Hu, B.; Chen, S.; Hong, Y.; et al. Improving the Accuracy of Soil Classification by Using Vis–NIR, MIR, and Their Spectra Fusion. Remote Sens. 2025, 17, 1524. [Google Scholar] [CrossRef]
  21. Mahajan, G.R.; Das, B.; Gaikwad, B.; Murgaonkar, D.; Desai, A.; Morajkar, S.; Patel, K.; Kulkarni, R. Monitoring properties of the salt-affected soils by multivariate analysis of the visible and near-infrared hyperspectral data. Catena 2021, 198, 105041. [Google Scholar] [CrossRef]
  22. Ribeiro, S.G.; Teixeira, A.D.S.; de Oliveira, M.R.R.; Costa, M.C.G.; Araújo, I.C.D.S.; Moreira, L.C.J.; Lopes, F.B. Soil organic carbon content prediction using soil-reflected spectra: A comparison of two regression methods. Remote Sens. 2021, 13, 4752. [Google Scholar] [CrossRef]
  23. Reda, R.; Saffaj, T.; Ilham, B.; Saidi, O.; Issam, K.; Brahim, L.; El Hadrami, E.M. A comparative study between a new method and other machine learning algorithms for soil organic carbon and total nitrogen prediction using near-infrared spectroscopy. Chemom. Intell. Lab. Syst. 2019, 195, 103873. [Google Scholar] [CrossRef]
  24. Yu, B.; Yan, C.; Yuan, J.; Ding, N.; Chen, Z. Prediction of soil properties based on characteristic wavelengths with optimal spectral resolution by using Vis-NIR spectroscopy. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2023, 293, 122452. [Google Scholar] [CrossRef] [PubMed]
  25. De Santana, F.; de Souza, A.; Poppi, R. Green methodology for soil organic matter analysis using a national near infrared spectral library in tandem with learning machine. Sci. Total Environ. 2019, 658, 895–900. [Google Scholar] [CrossRef]
  26. Gholizadeh, A.; Saberioon, M.; Borůvka, L.; Vašát, R. Soil organic carbon and texture prediction using visible and near-infrared spectroscopy and support vector machine regression: A case study from Czech agricultural soils. Remote Sens. 2021, 13, 424. [Google Scholar] [CrossRef]
  27. Nawar, S.; Mouazen, A.M. On-line vis-NIR spectroscopy prediction of soil organic carbon using machine learning. Soil. Tillage Res. 2019, 190, 120–127. [Google Scholar] [CrossRef]
  28. Morellos, A.; Pantazi, X.-E.; Moshou, D.; Alexandridis, T.; Whetton, R.; Tziotzios, G.; Wiebensohn, J.; Bill, R.; Mouazen, A.M. Machine learning-based prediction of soil total nitrogen, organic carbon and moisture content using VIS-NIR spectroscopy. Biosyst. Eng. 2016, 152, 104–116. [Google Scholar] [CrossRef]
  29. Nawar, S.; Mouazen, A.M. Comparison between Random Forests, Artificial Neural Networks and Gradient Boosted Machines Methods of On-Line Vis-NIR Spectroscopy Measurements of Soil Total Nitrogen and Total Carbon. Sensors 2017, 17, 2428. [Google Scholar] [CrossRef]
  30. Yang, M.; Xu, D.; Chen, S.; Li, H.; Shi, Z. Evaluation of machine learning approaches to predict soil organic matter and pH using vis-NIR spectra. Sensors 2019, 19, 263. [Google Scholar] [CrossRef]
  31. Instituto Geográfico Agustín Codazzi (IGAC). Estudio General de Suelos y Zonificación de Tierras: Departamento de Magdalena, Escala 1:100000; Imprenta Nacional de Colombia: Bogotá, Colombia, 2009. Available online: https://catalogo.fedepalma.org/cgi-bin/koha/opac-detail.pl?biblionumber=27933 (accessed on 5 October 2025).
  32. Six, J.; Conant, R.T.; Paul, E.A.; Paustian, K. Stabilization mechanisms of soil organic matter: Implications for soil C-saturation. Plant Soil. 2002, 241, 155–176. [Google Scholar] [CrossRef]
  33. Don, A.; Schumacher, J.; Freibauer, A. Impact of tropical land use change on soil organic carbon stocks—A meta-analysis. Glob. Change Biol. 2011, 17, 1658–1670. [Google Scholar] [CrossRef]
  34. Lal, R.; Roose, E.; Feller, C. Soil Erosion and Carbon Dynamics; Roose, E.E.J., Meybeck, M., Lal, R., Feller, C., Eds.; CRC Press: Boca Raton, FL, USA, 2002. [Google Scholar]
  35. Walkley, A.; Black, I.A. An examination of the Degtjareff method for determining soil organic matter, and a proposed modification of the chromic acid titration method. Soil Sci. 1934, 37, 29–38. [Google Scholar] [CrossRef]
  36. Walkley, A. A critical examination of a rapid method for determining organic carbon in soils: Effect of variations in digestion conditions and of inorganic soil constituents. Soil Sci. 1947, 63, 251–263. [Google Scholar] [CrossRef]
  37. Vlček, V.; Juřička, D.; Valtera, M.; Dvořáčková, H.; Štulc, V.; Bednaříková, M.; Šimečková, J.; Váczi, P.; Pohanka, M.; Kapler, P.; et al. Soil organic matter interactions along the elevation gradient of the Monte Baldo (Eastern Italian Alps). Soil 2024, 10, 813. [Google Scholar] [CrossRef]
  38. Nelson, D.W.; Sommers, L.E. Total carbon, organic carbon, and organic matter. In Methods of Soil Analysis. Part 1996, 3—Chemical Methods; Sparks, D.L., Ed.; Soil Science Society of America: Madison, WI, USA, 1996; pp. 961–1010. [Google Scholar]
  39. Schumacher, B.A. Methods for the Determination of Total Organic Carbon (TOC) in Soils and Sediments; EPA/600/R-02/069; U.S. Environmental Protection Agency: Washington, DC, USA, 2002.
  40. FAO. Walkley–Black Method for Soil Organic Carbon Determination: Training Material; Global Soil Laboratory Network (GLOSOLAN): Rome, Italy, 2020. [Google Scholar]
  41. AGROSAVIA. Corporate Agenda Management—Sample Pretreatment for Soil Analysis (Code: GA-R-104); AGROSAVIA: Tolima, Colombia, 2024. [Google Scholar]
  42. Canero, F.M.; Rodríguez Galiano, V.; Aragón, D. Machine learning and feature selection for soil spectroscopy: An evaluation of Random Forest wrappers to predict soil organic matter, clay, and carbonates. Heliyon 2024, 10, e30228. [Google Scholar] [CrossRef]
  43. Ho, V.H.; Morita, H.; Bachofer, F.; Ho, T.H. Random forest regression kriging modeling for soil organic carbon density estimation using multi-source environmental data in central Vietnamese forests. Model. Earth Syst. Environ. 2024, 10, 7137–7158. [Google Scholar] [CrossRef]
  44. Zhao, M.; Arshad, M.; Wang, J.; Triantafilis, J. Soil exchangeable cations estimation using Vis-NIR spectroscopy in different depths: Effects of multiple calibration models and spiking. Comput. Electron. Agric. 2021, 182, 105990. [Google Scholar] [CrossRef]
  45. Vilash, P.; Hameed, M.A. Improvement of Extreme Gradient Boosting (XGBoost) algorithm using neural decision trees on miscarriage dataset. In Proceedings of the International Conference on Innovative Computing & Communication (ICICC 2024), New Delhi, India, 16–17 February 2025. [Google Scholar] [CrossRef]
  46. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  47. Savitzky, A.; Golay, M.J.E. Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 1964, 36, 1627–1639. [Google Scholar] [CrossRef]
  48. Stevens, A.; Ramirez-Lopez, L. An Introduction to the Prospectr Package. 2015. Available online: https://mran.microsoft.com/snapshot/2017-08-06/web/packages/prospectr/vignettes/prospectr-intro.pdf (accessed on 10 September 2025).
  49. O’Haver, T.C. An introduction to signal processing in chemical measurement. J. Chem. Educ. 1991, 68, A147. [Google Scholar] [CrossRef]
  50. Rudorff, C.M.; Novo, E.M.L.M.; Galvão, L.S.; Pereira Filho, W. Análise derivativa de dados hiperespectrais medidos em nível de campo e orbital para caracterizar a composição de águas opticamente complexas na Amazônia. Acta Amaz. 2007, 37, 269–280. [Google Scholar] [CrossRef]
  51. Nocita, M.; Stevens, A.; Toth, G. Prediction of soil organic carbon content by diffuse reflectance spectroscopy using a local partial least square regression approach. Soil Biol. Biochem. 2014, 68, 337–347. [Google Scholar] [CrossRef]
  52. Stenberg, B. Effects of soil sample pretreatments and standardised rewetting as interacted with sand classes on Vis-NIR predictions of clay and soil organic carbon. Geoderma 2010, 158, 15–22. [Google Scholar] [CrossRef]
  53. Moura-Bueno, J.M.; Dalmolin, R.S.D.; ten Caten, A.; Oliveira, J.P.; Barros, F.M. Stratification of a local VIS-NIR-SWIR spectral library by homogeneity criteria yields more accurate soil organic carbon predictions. Geoderma 2019, 337, 565–581. [Google Scholar] [CrossRef]
  54. Barra, I.; Haefele, S.M.; Sakrabani, R.; Kebede, F. Soil spectroscopy with the use of chemometrics, machine learning and pre-processing techniques in soil diagnosis: Recent advances—A review. TrAC Trends Anal. Chem. 2020, 135, 116166. [Google Scholar] [CrossRef]
  55. Dotto, A.C.; Dalmolin, R.S.D.; ten Caten, A.; Grunwald, S. A systematic study on the application of scatter-corrective and spectral-derivative preprocessing for multivariate prediction of soil organic carbon by Vis-NIR spectra. Geoderma 2018, 314, 262–274. [Google Scholar] [CrossRef]
  56. Milos, M.; Bensa, A. Organic carbon estimation in a regional soil Vis-NIR database supported by unsupervised learning and chemometrics techniquesr. Soil Adv. 2024, 2, 100013. [Google Scholar] [CrossRef]
  57. FAO (Food and Agriculture Organization of the United Nations). Soil Spectroscopy: Apractical Guide for Measuring Soil Properties Using Reflectance Spectroscopy; FAO: Rome, Italy, 2022. [Google Scholar]
  58. Abbasizadeh, H.; Maca, P.; Hanel, M.; Troldborg, M.; AghaKouchak, A. Can causal discovery lead to a more robust prediction model for runoff signatures? Hydrol. Earth Syst. Sci. 2025, 29, 4761–4790. [Google Scholar] [CrossRef]
  59. Hong, Y.; Chen, S.; Chen, Y.; Linderman, M.; Mouazen, A.M.; Liu, Y.; Guo, L.; Yu, L.; Liu, Y.; Cheng, H.; et al. Comparing laboratory and airborne hyperspectral data for the estimation and mapping of topsoil organic carbon: Feature selection coupled with random forest. Soil Tillage Res. 2020, 199, 104589. [Google Scholar] [CrossRef]
  60. Meng, X.; Bao, Y.; Zhang, X.; Zhao, Z. Prediction of soil organic matter using different soil classification hierarchical level stratification strategies and spectral characteristic parameters. Geoderma 2022, 411, 115696. [Google Scholar] [CrossRef]
  61. Stenberg, B.; Viscarra Rossel, R.A. Visible and near-infrared spectroscopy in soil science. In Advances in Agronomy; Donald, L., Ed.; Elsevier: Amsterdam, The Netherlands, 2010. [Google Scholar]
  62. Tahmasbian, I.; Xu, Z.; Boyd, S.; Zhou, J.; Esmaeilani, R.; Che, R.; Hosseini, B.S. Laboratory-based hyperspectral image analysis for predicting soil carbon, nitrogen and their isotope compositions. Geoderma 2018, 330, 254–263. [Google Scholar] [CrossRef]
  63. Miloš, B.; Bensa, A. Prediction of soil organic carbon using VIS-NIR spectroscopy: Application to Red Mediterranean soils from Croatia. Eurasian J. Soil Sci. 2017, 6, 365–373. [Google Scholar] [CrossRef]
  64. Fernandes, M.M.H.; Coelho, A.P.; Fernandes, C.; da Silva, M.F.; Dela Marta, C.C. Estimation of soil organic matter content by modeling with artificial neural networks. Geoderma 2019, 350, 46–51. [Google Scholar] [CrossRef]
  65. Ortiz, D.; de Dios Herrero, J.M.; Kloster, N. Uso de la espectroscopia visible e infrarrojo cercano para estimar propiedades de suelo en argentina. Cienc. Del. Suelo 2024, 42, 1–13. [Google Scholar]
  66. Liu, B.; Guo, B.; Zhuo, R.; Dai, F. Estimation of soil organic carbon in LUCAS soil database using Vis-NIR spectroscopy based on hybrid kernel Gaussian process regression. Spectrochim. Acta Part. A Mol. Biomol. Spectrosc. 2024, 310, 124687. [Google Scholar] [CrossRef] [PubMed]
  67. Pudjihartono, N.; Fadason, T.; Kempa-Liehr, A.W.; O’Sullivan, J.M. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front. Bioinform. 2022, 2, 927312. [Google Scholar] [CrossRef]
  68. Yang, J.; Li, X.; Ma, X. Improving the Accuracy of Soil Organic Carbon Estimation: CWT-Random Frog-XGBoost as a prerequisite technique for in situ hyperspectral analysis. Remote Sens. 2023, 15, 5294. [Google Scholar] [CrossRef]
  69. Qu, K.; Nie, L.; Cui, L.; Li, H.; Xiong, M.; Zhai, X.; Zhao, X.; Wang, J.; Lei, Y.; Li, W. Vis–NIR Spectroscopy Characteristics of Wetland Soils with Different Water Contents and Machine Learning Models for Carbon and Nitrogen Content. Ecologies 2025, 6, 75. [Google Scholar] [CrossRef]
  70. Alsaleh, A.R.S.; Alcibahy, M.; Gafoor, F.A.; Al Hashemi, H.; Athamneh, B.; Al Hammadi, A.A.; Seneviratne, L.; Al Shehhi, M.R. Estimation of soil organic carbon in arid agricultural fields based on hyperspectral satellite images. Geoderma 2024, 453, 117151. [Google Scholar] [CrossRef]
Figure 1. Characterization of soil organic carbon in the municipalities of the Sierra Nevada de Santa Marta.
Figure 1. Characterization of soil organic carbon in the municipalities of the Sierra Nevada de Santa Marta.
Sustainability 18 00513 g001
Figure 2. Absorbance data for 860 soil samples after implementing first-derivative pretreatment. Savitzky–Golay (Window = 7, Order = 2), model RF-SVR.
Figure 2. Absorbance data for 860 soil samples after implementing first-derivative pretreatment. Savitzky–Golay (Window = 7, Order = 2), model RF-SVR.
Sustainability 18 00513 g002
Figure 3. Absorbance data for 860 soil samples after implementing first-derivative pretreatment. Savitzky–Golay (Window = 11, Order = 2), model RF–XGBoost.
Figure 3. Absorbance data for 860 soil samples after implementing first-derivative pretreatment. Savitzky–Golay (Window = 11, Order = 2), model RF–XGBoost.
Sustainability 18 00513 g003
Figure 4. Identification of sensitive wavelengths for predicting soil organic carbon using the RF–SVR model.
Figure 4. Identification of sensitive wavelengths for predicting soil organic carbon using the RF–SVR model.
Sustainability 18 00513 g004
Figure 5. Random forest model selection: (a) selecting the number of trees based on the lowest out-of-bag error, (b) selecting the number of features per node based on the out-of-bag error, and (c) restricting the number of leaf nodes.
Figure 5. Random forest model selection: (a) selecting the number of trees based on the lowest out-of-bag error, (b) selecting the number of features per node based on the out-of-bag error, and (c) restricting the number of leaf nodes.
Sustainability 18 00513 g005
Figure 6. Identification of sensitive wavelengths for predicting soil organic carbon using the RF–XGBoost model.
Figure 6. Identification of sensitive wavelengths for predicting soil organic carbon using the RF–XGBoost model.
Sustainability 18 00513 g006
Figure 7. Predictive model (RF–SVR) for soil organic carbon, based on sample absorbance.
Figure 7. Predictive model (RF–SVR) for soil organic carbon, based on sample absorbance.
Sustainability 18 00513 g007
Figure 8. Predictive model (RF–XGBoost) for organic carbon in soil, based on sample absorbance.
Figure 8. Predictive model (RF–XGBoost) for organic carbon in soil, based on sample absorbance.
Sustainability 18 00513 g008
Table 1. Descriptive statistics for 860 soil samples taken in the Sierra Nevada de Santa Marta, Colombia.
Table 1. Descriptive statistics for 860 soil samples taken in the Sierra Nevada de Santa Marta, Colombia.
FarmsAverage (g 100 g−1)ErrorSD (g 100 g−1)CV (%)Min (g 100 g−1)Max (g 100 g−1)
Altamira1.570.090.138.111.481.66
Brisas de Córdoba1.510.190.3221.321.251.87
Chukwin Chukwa2.330.200.2811.862.132.52
El Amparo1.860.090.5328.660.853.75
El Congo1.550.140.2415.761.381.83
El Descanso1.670.140.2414.121.411.87
El Diamante2.070.180.3114.991.862.43
El Edén1.340.140.2418.161.061.50
El Esfuerzo1.220.150.2620.910.971.48
El fraile1.320.110.1813.941.121.48
El Jardín1.610.120.2112.891.461.85
El Limón1.690.330.5834.071.032.09
El Manantial3.090.311.2139.330.394.64
El Paraíso2.440.130.9539.151.035.28
El Progreso1.620.120.2113.131.471.86
El Recuerdo2.650.140.5520.871.763.67
El Triunfo1.660.110.1811.031.521.87
Emanuel1.240.070.1310.181.101.34
La Arcadia2.250.080.3214.151.732.66
La Aurora1.640.200.3521.101.312.00
La Cabaña1.900.150.6031.500.963.24
La Carmelita1.680.210.3621.151.271.90
La Cascada1.530.060.106.441.451.64
La Chavela1.840.100.6133.000.692.97
La Conquista1.360.130.5036.520.782.26
La Esperancita3.670.440.7520.572.874.37
La Esperanza2.780.120.6523.541.954.61
La Fortuna1.220.100.2520.920.861.52
La Granja1.500.100.1711.371.331.67
La Perla1.400.120.2014.491.271.63
La Vega0.890.200.3525.001.081.76
La Victoria2.230.110.198.662.062.44
Las Tres Palmas1.060.130.2321.450.861.31
La Esperanza1.390.210.3726.331.051.78
Las Gaviotas1.360.110.1914.031.161.54
Las Murallas1.890.090.5529.140.842.92
Las Palmas 31.760.070.4223.880.742.80
Las Piedritas1.710.080.4928.440.833.04
Los Acacios0.910.070.4246.430.111.77
Los Angeles1.960.090.2311.551.572.21
Los Cacaos 2.030.110.7034.310.293.70
Los Jazmines1.930.070.4121.481.152.65
Los Mandarinos2.240.100.6026.930.953.53
Los Mangos2.360.180.7029.841.243.37
Los Naranjos1.740.050.095.351.631.80
Los Potreritos2.110.080.4923.171.023.13
Los Recuerdos1.710.080.1416.830.690.97
María Bonita1.480.130.2214.781.241.67
Monte Carmelo1.530.180.3220.601.221.85
Niguakoa2.460.170.7229.171.664.40
No hay como tu1.230.110.5847.230.573.13
Nuevo Oriente0.870.100.3742.750.371.71
Santa Bárbara1.860.220.8646.230.904.50
Sinaí2.700.150.9535.060.865.35
Tagbi1.240.140.2419.331.011.49
Villa Brisa1.050.020.043.621.021.09
Villa Sofia2.050.100.4019.661.292.70
Villa Vista1.780.160.2715.111.612.09
Total 1.900.030.7941.610.115.35
Table 2. Predictive models for SOC.
Table 2. Predictive models for SOC.
ModelR2RMSERDP
Random Forest0.6920.4411.8
SVR0.6630.4611.56
RF–SVR0.8530.3802.6
RF-Optimized XGBoost0.8660.0901.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yacomelo Hernández, M.J.; Alama, W.I.; Montenegro, A.C.; Córdoba, O.d.J.; Castañeda Sanchez, D.; Vargas García, C.; Flórez Cordero, E.; Castillo Quezada, J.; Pacherres Herrera, C.; Prado-Castillo, L.F.; et al. Application of Vis–NIR Spectroscopy and Machine Learning for Assessing Soil Organic Carbon in the Sierra Nevada de Santa Marta, Colombia. Sustainability 2026, 18, 513. https://doi.org/10.3390/su18010513

AMA Style

Yacomelo Hernández MJ, Alama WI, Montenegro AC, Córdoba OdJ, Castañeda Sanchez D, Vargas García C, Flórez Cordero E, Castillo Quezada J, Pacherres Herrera C, Prado-Castillo LF, et al. Application of Vis–NIR Spectroscopy and Machine Learning for Assessing Soil Organic Carbon in the Sierra Nevada de Santa Marta, Colombia. Sustainability. 2026; 18(1):513. https://doi.org/10.3390/su18010513

Chicago/Turabian Style

Yacomelo Hernández, Marlon Jose, William Ipanaqué Alama, Andrea C. Montenegro, Oscar de Jesús Córdoba, Darío Castañeda Sanchez, Cesar Vargas García, Elias Flórez Cordero, Jim Castillo Quezada, Carlos Pacherres Herrera, Luis Fernando Prado-Castillo, and et al. 2026. "Application of Vis–NIR Spectroscopy and Machine Learning for Assessing Soil Organic Carbon in the Sierra Nevada de Santa Marta, Colombia" Sustainability 18, no. 1: 513. https://doi.org/10.3390/su18010513

APA Style

Yacomelo Hernández, M. J., Alama, W. I., Montenegro, A. C., Córdoba, O. d. J., Castañeda Sanchez, D., Vargas García, C., Flórez Cordero, E., Castillo Quezada, J., Pacherres Herrera, C., Prado-Castillo, L. F., & Casas Leuro, O. (2026). Application of Vis–NIR Spectroscopy and Machine Learning for Assessing Soil Organic Carbon in the Sierra Nevada de Santa Marta, Colombia. Sustainability, 18(1), 513. https://doi.org/10.3390/su18010513

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop