Next Article in Journal
Characterizing the Temporally Dynamic Nature of Relative Growth Rates: A Kinetic Analysis on Nitrogen-, Phosphorus-, and Potassium-Limited Growth
Previous Article in Journal
MF-FusionNet: A Lightweight Multimodal Network for Monitoring Drought Stress in Winter Wheat Based on Remote Sensing Imagery
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Estimation of Soil Organic Carbon Content of Grassland in West Songnen Plain Using Machine Learning Algorithms and Sentinel-1/2 Data

College of Earth Sciences, Jilin University, Changchun 130061, China
*
Author to whom correspondence should be addressed.
Agriculture 2025, 15(15), 1640; https://doi.org/10.3390/agriculture15151640
Submission received: 13 May 2025 / Revised: 10 July 2025 / Accepted: 27 July 2025 / Published: 29 July 2025
(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Abstract

Based on multi-source data, including synthetic aperture radar (Sentinel-1, S1) and optical satellite images (Sentinel-2, S2), topographic data, and climate data, this study explored the performance and feasibility of different variable combinations in predicting SOC using three machine learning models. We designed the three models based on 244 samples from the study area, using 70% of the samples for the training set and 30% for the testing set. Nine experiments were conducted under three variable scenarios to select the optimal model. We used this optimal model to achieve high-precision predictions of SOC content. Our results indicated that both S1 and S2 data are significant for SOC prediction, and the use of multi-sensor data yielded more accurate results than single-sensor data. The RF model based on the integration of S1, S2, topography, and climate data achieved the highest prediction accuracy. In terms of variable importance, the S2 data exhibited the highest contribution to SOC prediction (31.03%). The SOC contents within the study region varied between 4.16 g/kg and 29.19 g/kg, showing a clear spatial trend of higher concentrations in the east than in the west. Overall, the proposed model showed strong performance in estimating grassland SOC and offered valuable scientific guidance for grassland conservation in the western Songnen Plain.

1. Introduction

Globe climate change is a major challenge facing the world and human society today. In recent years, countries around the globe have taken measures to address the greenhouse effect and reduce the rate of greenhouse gas emissions. Soil organic carbon (SOC) represents the predominant carbon reservoir within terrestrial ecosystems, comprising approximately 60% of the total carbon stocks in terrestrial ecosystems [1]. Grasslands, the largest and most widely distributed terrestrial ecosystems globally, cover approximately 30% of the Earth’s land area [2], accounting for around 20% of global carbon reserves [3]. Grassland degradation not only affects climate regulation and ecological functions but also impacts the livelihoods and well-being of the people living there [4]. Grassland organic carbon comprises vegetation organic carbon and SOC, accounting for 5–20% and 80–95% of the total grassland organic carbon, respectively [5]. Due to its strong associations with soil fertility, vegetation growth, aggregate stability, and nutrient cycling, SOC is regarded as a critical indicator for evaluating grassland degradation [6]. Thus, understanding the spatial patterns of SOC and its influencing variables offers valuable insights to inform sustainable land management and grassland utilization. It contributes to the sustainable development of grassland resources and helps accurately assess the regional and global carbon reserves. However, due to the diversity in grassland growth environments and management methods, obtaining high-precision SOC distribution in grassland at a large scale is challenging. In this context, the existing research lacks satisfactory accuracy, hindering its relevance to urgent practical needs.
Conventional methods for SOC quantification include field surveys, soil sampling, and laboratory analyses [7]. While conventional methods offer high accuracy, they are costly, time-consuming, and inefficient, severely limiting their practical applicability in large-scale soil property monitoring. Therefore, the development of economical and reliable SOC prediction techniques is essential. To overcome these challenges, researchers have focused on exploring robust and cost-effective methods for SOC prediction. Remote sensing (RS) technology, offering advantages of data accessibility and large-scale monitoring capabilities, has seen extensive use in estimating SOC quantitatively [8]. Characterized by vast spatial extents and diverse ecosystem heterogeneity, grassland ecosystems pose significant challenges for conventional methods to achieve regional-scale predictions of SOC. Therefore, the integration of conventional soil survey approaches with state-of-the-art RS techniques is imperative to accurately characterize the spatial heterogeneity of SOC across grassland landscapes. RS has shown great potential in describing soil properties. Previously, soil properties were detected using optical and multispectral imagery, spanning wavelengths from the visible-near infrared (VIS-NIR) to the shortwave infrared (SWIR) spectra [9,10]. These sensors are mounted on satellites, aircraft, and unmanned aerial vehicles [11,12,13]. Among various optical RS datasets, Sentinel-2 (S2) imagery has garnered widespread recognition due to its 13 spectral bands encompassing the full range from VIS to SWIR wavelengths, offering a comprehensive data resource. Its spectral indices have been successfully applied to predict various soil properties, including soil moisture [14], aboveground biomass [15], soil texture [16], effective cation exchange capacity [17], and total nitrogen [18]. In particular, the visible bands centered at 490, 560, and 665 nm have been demonstrated to serve as key indicators in the RS-based estimation of SOC [19]. Unlike VIS and NIR bands, SWIR bands can partially penetrate clouds and thin fog. This characteristic enables SWIR to provide effective surface observation data under complex atmospheric conditions, making it an essential factor in the quantitative prediction of SOC. Previous research has demonstrated that the use of S2 data leads to enhanced SOC content prediction accuracy, making it an ideal data source for SOC quantification. Despite their advantages, optical sensors are still challenged by cloud cover and precipitation events, which hinder the accuracy of SOC quantification efforts when relying solely on optical data.
In comparison, synthetic aperture radar (SAR) sensors are capable of continuously monitoring the Earth’s surface around the clock, unaffected by rainy or cloudy weather. SAR sensors are capable of depicting the interplay between soil properties and vegetation dynamics, thereby supporting the accurate prediction and monitoring of soil properties. The existing research has consistently validated the utility of SAR data in modeling and predicting SOC dynamics [20,21]. Sentinel-1 (S1) has shown good performance in predicting grassland soil properties. The texture features derived from S1 have been applied to estimate soil moisture in both mineral soils [22] and organic soils [23]. Furthermore, previous research has shown that integrating data from multiple sensors enhances the prediction accuracy of SOC contents compared to using single-sensor data [24,25].
SOC prediction typically involves constructing models that describe the interactions between environmental variables and SOC. RS technologies, combined with machine learning (ML) models, provide robust tools for SOC prediction. The combination of optical and SAR sensors increases the availability of environmental variables for SOC modeling. The development of ML models has also expanded the range of approaches available for SOC prediction. Research has revealed that geographically weighted regression [26], support vector regression, gradient boosting regression trees [27], and random forest (RF) algorithms are commonly applied in SOC prediction. In southern Hangzhou, Zhejiang Province, the support vector regression (SVR) model has been used to improve the prediction accuracy of soil organic carbon content [28]. Both support vector machine (SVM) and RF models have demonstrated excellent performance in predicting SOC in the Ourika watershed in Morocco [29], and the RF model has also shown outstanding prediction accuracy for soil carbon stocks in the Srou watershed in Morocco [30]. Numerous studies have highlighted that RF [31,32] and extreme gradient boosting (XGBoost) [33] models exhibit outstanding prediction capabilities for SOC, substantially enhancing the precision and stability of SOC estimation models.
Since climate and terrain factors are important environmental factors affecting soil organic matter accumulation, they are often used as key variables in SOC prediction models [34,35]. Generally, temperature regulates the decomposition rate of microorganisms, while precipitation supports the productivity of vegetation and organic matter inputs. High-altitude regions generally experience lower temperatures and slower microbial activity, which reduces the decomposition rate of organic matter, resulting in higher SOC reserves [36]. Climate factors such as mean annual precipitation (MAP) and mean annual temperature (MAT) play a dominant role in the temporal SOC changes. Climate changes, such as alterations in precipitation patterns or rising temperatures, might significantly affect the balance between SOC accumulation and decomposition [37]. Topographic features are also key determinants of the spatial heterogeneity of SOC [38,39]. Slopes, for instance, regulate the spatial distribution of SOC by influencing soil erosion and water retention capacity. Steeper slopes tend to exacerbate soil and nutrient loss, thereby reducing SOC accumulation. In contrast, gentler slopes help retain soil moisture and promote vegetation growth, leading to increased SOC storage [40]. Elevation affects SOC storage by altering temperature and precipitation gradients [41]. These results highlight the critical influence of climatic and topographic factors in shaping the spatiotemporal dynamics of SOC in grasslands.
The grasslands of western Songnen Plain are located in Northeast China at the easternmost edge of the Eurasian steppe. This region serves as a crucial green ecological barrier in the Black Soil Zone of Northeast China and is one of the key forage production areas in northern China. Although numerous studies have applied RS and ML methods to predict SOC in cropland and forest ecosystems, relevant research remains limited in grasslands, especially in natural or semi-natural grassland ecosystems. In particular, in the western Songnen Plain, in-depth studies focusing on SOC in grasslands are still lacking. This region represents a typical meadow steppe ecosystem, characterized by a temperate semi-arid climate, unique vegetation structure, and significant human intervention, exhibiting distinct regional environmental features. In addition, there is a lack of systematic comparisons among the SOC prediction capabilities of multi-source RS data (such as S1, S2, and environmental variables) across different ML models, especially in the context of typical grassland ecosystems in Northeast China. Combined with S1 (radar imagery), S2 (optical RS data), and multi-source environmental data, the current study performed a comparative analysis of different ML methods to achieve high-precision SOC predictions in the grassland of western Songnen Plain. The primary goals of this work included (1) investigating the potential of integrating S1/S2 imagery to enhance the SOC prediction accuracy and advance inversion mapping of grasslands, (2) comparing the accuracy of SOC prediction using different ML models under different variable combination scenarios, and (3) evaluating the contribution levels of various predictive variables in SOC estimation. Our results provide methodological insights for improving soil property prediction through the integration of multi-source imagery. Our findings contribute to the protection and sustainable utilization of the grasslands in western Songnen Plain.

2. Materials and Methods

2.1. Study Area

The grasslands of western Songnen Plain are located in central Northeast China (N 44°00′~48°35′, E 121°36′~126°36′), encompassing 23 counties and districts in the western regions of Heilongjiang and Jilin Provinces, with a total land area of 10,151.2 km2 (Figure 1). The plain is primarily formed by perennial alluvial deposits from the Songhua River, the Nenjiang River, and their tributaries, with surrounding rolling hills and depressions contributing to its plain topography. The Songnen Plain lies in the transitional zone between China’s humid monsoon region and the inland arid region, characterized by a semi-arid to semi-humid climate and recognized as a climate-sensitive area. The MAT in the region ranges from 4 °C to 6 °C, and the MAP ranges from 300 mm to 600 mm, with 80% of its rainfall concentrated between May and September. However, evaporation exceeds precipitation in this area.
Grasslands in the study region occupy around 32.5% of the overall land surface, making it one of China’s eight main pastoral areas. However, with the socioeconomic development of Northeast China, grassland resources have suffered severe degradation. The grassland area in western Songnen Plain decreased from 9017.03 km2 in 2000 to 8182.05 km2 in 2021. Large areas of the grassland region were reclaimed, and some were abandoned after cultivation due to unsuitability, resulting in severe damage to the grassland vegetation. Overgrazing and excessive use of grasslands have resulted in serious “three problems” (desertification, salinization, and degradation). In recent years, the government and related agencies have implemented various measures, including the Grain for Green program, seasonal grazing bans, prohibition and suspension of grazing, and ecological compensation, to address the ecological challenges of the western Songnen Plain grasslands and support the restoration and long-term sustainability of the grassland ecosystem. The western grasslands of the Songnen Plain hold significant ecological, economic, and social value, and their conservation and restoration are crucial for maintaining regional ecological balance and sustainable development.

2.2. Data Acquisition and Preprocessing

2.2.1. Soil Sampling and Laboratory Analysis

In August 2022, a field survey and sampling were conducted in the study area. A total of 244 surface soil samples (0–20 cm) were collected from the western grasslands of the Songnen Plain (Figure 1), including 160 samples from the western part of Jilin Province and 84 samples from Heilongjiang Province. The sampling points were designed to cover different types of grasslands, soil textures, and management practices as comprehensively as possible within the constraints of practical conditions to ensure representativeness. SOC in the prepared soil samples was quantified using the potassium dichromate oxidation method under external heating conditions.

2.2.2. Geospatial Data Sources and Preprocessing

This study employed feature selection algorithms in RStudio-4.3.3 to identify the predictive variables necessary for estimating SOC contents in the grasslands of the western Songnen Plain using the S1/S2 imagery and environmental data. These predictive variables from various sources were normalized into WGS_1984_Albers and resampled to a 10 m resolution grid data. Additionally, due to the varying dimensions and magnitudes of the data, all predictors were normalized prior to being incorporated into the ML models. For improved model prediction accuracy and to reduce the interference of redundant variables, the Boruta algorithm was applied for feature selection prior to model construction. The Boruta algorithm identifies variables that are significantly important to the response variable by generating shadow features and comparing their importance scores with those of the original variables [30]. In this study, SOC was set as the response variable, while RS bands and environmental factors were used as input variables. Feature selection was performed utilizing the “Boruta” package within the R environment. Ultimately, only those features classified as “confirmed” by the Boruta algorithm were retained for subsequent machine learning modeling. This method helps enhance a model’s generalizability, reduce the risk of overfitting, and improve the identification of key driving factors influencing SOC.
1.
S2 image: S2 images were downloaded from the European Space Agency’s data sharing platform (https://dataspace.copernicus.eu/) (accessed on 22 October 2023). Level-2A data, comprising atmospherically corrected bottom-of-atmosphere reflectance data, were used in this study. The selected images were acquired in August 2022, closely matching the soil sampling time. Four MSI bands commonly used in soil property assessments were selected from S2 bands: blue band (B2), green band (B3), red band (B4), and near-infrared band (B8). Additionally, eight vegetation indices (Table 1) were selected to predict SOC contents in the western grasslands of the Songnen Plain.
2.
S1 image: The S1 RS imagery used in this study was acquired in interferometric wide (IW) mode and included ground range detected (GRD) data with a resolution of 10 m, which were downloaded from the ASF Data Search platform (https://search.asf.alaska.edu/) (accessed on 25 October 2023). Dual-polarization data, including VV and VH, were selected for the analysis. Based on the S1 images, five predictive variables were derived, including two dual-polarization bands (VH and VV) and three transformed bands (VH/VV, VH-VV, and (VH + VV)/2).
3.
Topography and position data: The digital elevation model (DEM) data were obtained from the GEE database. The DEM was reprojected into WGS_1984_Albers and resampled to a spatial resolution of 10 m. Two topography variables, slope and elevation, were calculated using ArcGIS 10.6. Additionally, longitude (X) and latitude (Y) geographic coordinate data were input as auxiliary geographic information.
4.
Climate data: Temperature and precipitation data were obtained from the National Earth System Science Data Center (http://www.geodata.cn/) (accessed on 27 October 2023) with a resolution of 1 km. These data were stored as INT16 type in NC (NETCDF) files. MAT and MAP values at the sampling points were calculated and extracted through batch processing in Python 3.10.

2.3. Research Methods

2.3.1. Soil Organic Carbon Prediction

The study employed three ML models—RF, SVM, and XGBoost—to predict SOC. ArcGIS 10.6 was used to extract attribute values of predictive variables at sampling locations. Then, 70% of the data were used for model training and 30% of the data were used for model testing. Models were constructed using R 4.3.3 and RStudio software, with parameter optimization performed for all three models. The research methodology adopted in this study is illustrated in Figure 2.
The RF model is an improvement of the bagging algorithm, which constructs multiple decision trees by randomly selecting input variables [42]. RF begins by extracting multiple training datasets from the original data to construct a large number of trees similar to classification and regression trees for training. During the construction of regression trees, input variables are randomly selected, and the optimal split for each node is determined based on the minimized Gini coefficient. Finally, during prediction, the results from all regression trees are aggregated [43].
RF has a low computational cost and has been successfully applied to high-dimensional data analyses [44]. Compared to traditional regression methods, RF demonstrates greater robustness and accuracy and provides feature importance scores for multiple variables. RF includes two distinct measures of feature importance—one inspired by statistical permutation tests and the other derived from RF training. Studies have found a reasonable correlation between these two measures [45].
SVM is a supervised learning method extensively utilized for both classification and regression tasks [46]. It functions by identifying an optimal hyperplane in a high-dimensional feature space to maximize the margin between different classes, thereby achieving effective separation [47]. Due to its strong generalization capability, robustness with small sample sizes, and resistance to overfitting, SVM has been widely adopted in RS image analysis, soil property estimation, and environmental monitoring applications [48,49,50].
XGBoost is an efficient and scalable implementation of the gradient boosting decision tree algorithm [51], specifically designed to enhance the performance and computational efficiency of a model. The objective function of XGBoost incorporates both a loss function and a regularization term, enabling the model to maintain strong fitting ability while effectively mitigating overfitting, thereby improving the generalization capacity [52]. Moreover, XGBoost supports parallel processing, features built-in mechanisms for handling missing values, and introduces specific optimizations for sparse data, making it particularly advantageous for modeling high-dimensional datasets with complex features [53].

2.3.2. Model Accuracy Validation

We introduced three commonly used model evaluation metrics—the coefficient of determination (R2), mean absolute error (MAE), and root mean squared error (RMSE)—to assess the prediction accuracy of SOC contents across different models. R2, ranging from 0 to 1, indicates how well the regression model accounts for the variability in the observed data, reflecting the strength of the model’s fit [54]. A value of R2 close to 1 indicates a better model fit, while a value close to 0 suggests poor model performance. MAE represents the arithmetic mean of the absolute differences between the measured and predicted SOC values. RMSE quantifies the average magnitude of prediction errors as the square root of the mean squared differences between the predicted and measured SOC values. Both MAE and RMSE are indicators of model prediction accuracy and robustness. Reduced MAE and RMSE values signify higher prediction accuracy and minimized errors in the model outputs [55]. The calculation formulas are as follows:
R 2 = 1 i = 1 n O i P i 2 i = 1 n O i O ¯ 2
M A E = 1 n i = 1 n P i O i
M S E = 1 n i = 1 n O i P i 2
R M S E = 1 n i = 1 n O i P i 2
where n denotes the number of samples, Oi represents the observed value, Pi indicates predicted value, and o ¯ is the mean of Pi.
To further evaluate the robustness of the model predictions, we introduced the ratio of performance to interquartile range (RPIQ) and the corrected Akaike information criterion (AICc) for small sample sizes. The calculation formulas are as follows:
  RPIQ   =   IQ MSE
AICc = m × ln S S m + 2 K + 1 + 2 K + 1 K + 2 m K 1
Here, m represents the number of samples. IQ is the interquartile range of the true values (IQ = Q3 − Q1); that is, the third quartile minus the first quartile. Furthermore, SS denotes the sum of squared residuals. K represents the number of input variables.
The RPIQ, through comparison with the interquartile range of the true values, effectively evaluates the model’s performance under different data distributions. It is particularly robust in the scenarios when data contains outliers or is unevenly distributed, making it more reliable than solely using MSE. Generally, a higher RPIQ value reflects stronger model performance. The AICc aims to balance the goodness of fit and complexity of the model, preventing overfitting. A smaller AICc value implies that the model achieves a better balance when explaining the data.

3. Results

3.1. Correlation Between Predictive Variables and SOC

Figure 3 presents the results of the Spearman correlation analysis between the measured SOC content and the 23 selected predictive variables. Our results indicated that MAT exhibited the highest correlation with SOC contents among all variables, showing a negative correlation coefficient of −0.61. B2, B3, B4, and B8, among the twelve S2 variables, exhibited outstanding and positive associations with SOC. In addition, the difference vegetation index (DVI), soil-adjusted vegetation index (SAVI), green normalized difference vegetation index (GNDVI), and normalized difference water index (NDWI) variables demonstrated stronger correlations with SOC contents compared to alternative vegetation indices. Five feature variables derived from S1 positively correlated with SOC content, with stronger correlations observed for VH and VH-VV and weaker correlations observed for VH, VH/VV, and (VH + VV)/2. Regarding position factors, the correlation coefficients for latitude and longitude variables were 0.55 and 0.25, respectively, both showing a significant positive correlation with SOC. Among the topography variables, elevation exhibited a higher positive correlation coefficient of 0.36.

3.2. Evaluation and Comparison of the Accuracy Rates of Different Models

To assess how various variable combinations influence the prediction accuracy of SOC, three scenarios (Table 2) were constructed based on the existing research. Scenario 1 was constructed using S1-drived predictors and topography, position, and climate factors. Scenario 2 was constructed using S2-derived predictors and position, topography, and climate factors. Scenario 3 was built using all the variables and predictors.
The prediction accuracy rates of SOC contents by RF, SVM, and XGBoost models under the three different scenarios are shown in Table 3. Metrics such as the R2, RMSE, MAE, Akaike information criterion corrected (AICc), and relative prediction interval quotient (RPIQ) were calculated using the test dataset. The evaluation results revealed that model selection, variable types, and variable quantity impacted the performance of a model in terms of SOC prediction.
Figure 4 illustrates the relationship between the observed and predicted SOC values across the testing datasets for the RF, SVM, and XGBoost models. For the prediction results of the RF, R2 values ranged from 0.52 to 0.58 across the three scenarios, with scenario 3 achieving the highest prediction accuracy. Additionally, the RF regression model in scenario 3 demonstrated the most stable performance. These findings suggested that incorporating multi-source data markedly enhance the predictive performance of the RF model for SOC estimation.
In comparison, the performance of the SVM model was slightly weaker but still demonstrated reasonable predictive capability. Its R2, RMSE, MAE, and RPIQ values ranged within 0.54–0.56, 4.45–4.89 g/kg, 3.36–3.64 g/kg, and 5.23–5.81, respectively. For R2 = 0.55, the SVM model exhibited lower RMSE, MAE, and AICc values and a higher RPIQ compared to the RF, further indicating that the former was superior to the latter. These results indicated that scenario 3 of the SVM achieved the best prediction accuracy (R2 = 0.56, RMSE = 4.89, and MAE = 3.64).
Among the evaluated models, XGBoost exhibited markedly lower predictive performance compared to the RF and SVM, indicating its limited effectiveness in the given context, particularly in the first scenario, where the R2 value was only 0.25, indicating that the performance of XGBoost was highly unstable in this scenario. Although its prediction accuracy improved in scenarios 2 and 3, it remained significantly lower than that of the RF and SVM in all three scenarios, demonstrating its limited predictive capability.
Among the three scenarios, all models demonstrated that integrating S1 and S2 imagery significantly enhanced the SOC prediction accuracy compared to relying on individual image sources. For instance, in the RF model, combining both imagery types increased the R2 values by 16% and 5.45% compared to using S1 and S2 alone, respectively. In addition, the RF model using combined multi-source data exhibited reduced RMSE, MAE, and AICc and higher RPIQ values, indicating that multi-source data can significantly enhance the SOC prediction accuracy. This improvement in prediction accuracy might be attributed to the complementary characteristics of radar and optical data. Radar data (S1) provide valuable information on surface roughness and moisture even under cloudy conditions, while optical data (S2) capture spectral variations related to vegetation cover and soil reflectance. By combining both, the RF model can exploit a wider range of biophysical indicators, thereby capturing SOC dynamics more comprehensively. For the SVM model, with increasing prediction accuracy from scenario 1 to scenario 3, the RMSE, MAE, and AICc increased and RPIQ values decreased, suggesting that the SVM did not perform as well as the RF under multi-source data fusion and might be prone to overfitting. This finding indicated that the SVM might not fully leverage the complementary information from multi-source data without careful tuning. Moreover, the prediction accuracy of XGBoost in scenario 1 was the lowest and most unstable among all models. However, its accuracy improved in scenarios 2 and 3, with an increase in R2. This result indicated that XGBoost was not sufficiently robust for SOC prediction, and even with the addition of more variables, its performance improvements remained limited.
The comprehensive analysis showed that the RF model exhibited the best SOC prediction performance, followed by the SVM and XGBoost. The RF model, constructed with a combination of S1 imagery, S2 imagery, and environmental factors, achieved the highest prediction accuracy and demonstrated the best stability. Therefore, this study selected the RF model to predict SOC contents, assess the relative contributions of predictive factors, and visualize the spatial distribution of SOC across the study region.

3.3. Evaluation of the Relative Importance of Predictive Variables

Figure 5 depicts the relative importance of variables used as inputs in the RF model. Our results indicated that the variables derived from the S2 optical data contributed the most to SOC prediction (31.03%), with B2 (5.75%), B3 (5.35%), and B4 (5%) showing fairly high importance. In addition, B8 (3.42%), DVI (2.25%), GNDVI (1.92%), and IPVI (1.75%) exhibited good contributions, with relative importance values exceeding 1.5%. In contrast, NDVI (0.79%) and SAVI (0.69%) showed lower importance. The position factors exhibited a relatively high contribution among all environmental variables (28.97%), with latitude (24.48%) being more important than longitude (4.48%). Climate factors exhibited a 22.88% contribution, with MAT (19.79%) depicting higher importance than MAP (3.08%).
The relative importance of variables derived from the S1 radar data was 8.44%, with VH (3.38%) exhibiting the highest importance, followed by VV (1.44%), VH/VV (1.44%), VH + VV/2 (1.1%), and VH-VV (1.08%).
Topographic factors showed an 8.69% contribution, with elevation (7.38%) showing significantly higher importance than slope (1.31%).

3.4. Spatial Distribution of SOC in the Grassland of Western Songnen Plain

Among all the models, the RF model in scenario 3 outperformed the other models, achieving the highest accuracy in SOC prediction. Therefore, this model was used to predict and map the SOC contents in the grasslands of the western Songnen Plain (Figure 6). Based on the prediction outcomes, the SOC contents in the grassland varied between 4.16 g/kg and 29.19 g/kg, with an average value of 11.33 g/kg and a standard deviation of 3.21 g/kg.
Notably, the SOC contents in the grassland of western Songne Plain exhibited a decreasing trend from north to south, with higher SOC levels observed in the eastern regions compared to the western regions. Furthermore, the mean SOC content observed in the grasslands of western Heilongjiang Province (13.59 g/kg) was notably greater than that in the grasslands of western Jilin Province (8.32 g/kg). This finding might be attributed to specific climatic conditions and differentiated grassland utilization and management practices in these regions. In addition, the cold and humid climate in northern high-latitude areas is more favorable for SOC storage.

4. Discussion

4.1. Performance of the Prediction Models

Previous studies have shown that combining radar with optical data substantially improves the predictive performance for SOC estimation models; however, the accuracy of SOC prediction is closely related to the choice of models and the combination of predictive variables [56]. In the current study, three models—RF, SVM, and XGBoost—and three scenarios were proposed to achieve high-precision SOC prediction in the grasslands of Northeast China. Due to the ability of radar data to compensate for the limitations of optical data in Northeast China, such as providing ground information under cloudy conditions, an earlier study proved that the accuracy of predictions can be substantially enhanced when the radar data are incorporated into the model [57]. The results of the present study indicated that integrating optical and radar multi-source data yielded better SOC prediction results compared to using either S2 or S1 data alone. Overall, the RF model outperformed the SVM and XGBoost in terms of its prediction accuracy and stability. In addition to effectively improving the SOC prediction accuracy, it exhibited higher robustness and simplicity. Furthermore, the RF model demonstrated stable performance across all three scenarios and effectively mitigated overfitting issues, consistent with the findings of previous studies [58,59,60]. In contrast, the XGBoost regression model showed unstable performance when using S1 data, and its prediction accuracy was lower in all three scenarios compared to the other two models. This finding was inconsistent with the results of previous studies, which reported the superior performance of XGBoost, which might be attributed to several factors, such as the sample size, properties, and selection of predictive variables [61]. Compared to the RF model, the prediction accuracy of the SVM was slightly higher in scenario 1, comparable in scenario 2, and lower in scenario 3. This result suggested that the RF model performs better in modeling complex soil combinations. In addition, the SVM is capable of handling nonlinear relationships; however, the RF model generally demonstrates superior interpretability and generalization performance [62].
Studies have demonstrated that SAR can retrieve information related to soil and vegetation, thereby enhancing the ability to predict soil characteristics, with its effectiveness depending on the ability to extract vegetation characteristics indicative of soil property variations from the RS images [63]. In the current study, all three models demonstrated that combined with environmental variables, the predictive performance of the S2 data was superior to that of the S1 imagery. This finding might be attributed to the environmental conditions of the study area. During sampling, heavy rainfall prior to fieldwork caused severe waterlogging in some parts of the study area. The presence of surface water likely reduced the scattering and penetration capabilities of radar signals, potentially introducing unnecessary variability in SAR data [64,65]. As a result, SAR’s performance was inferior to that of the S2 data in predicting SOC contents. However, due to the inclusion of the S1 data, the prediction accuracy of scenario 3 in all three models was notably better than that of scenario 2, reflecting that the use of S1 imagery effectively improved the SOC prediction.
Our results indicated that the RF model exhibited strong predictive capability, particularly achieving favorable application outcomes in high-latitude regions, offering novel insights into its application in similar areas. This finding indicated that the RF model can more efficiently capture nonlinear relationships among the variables, making it well-suited for predicting soil properties in similar scenarios. Furthermore, the comprehensive evaluation of the models’ accuracy and reliability using multiple metrics to select the optimal model and data integration scheme provided valuable references for assessing SOC prediction models.

4.2. Variable Importance

As shown in Figure 5, substantial disparities were observed among different variables with respect to their contributions to SOC prediction.
Among the position factors, latitude was more important than longitude, which might be attributed to its strong correlation with environmental variables, such as temperature, precipitation, and vegetation types, influencing the soil carbon cycle process [66].
Among the climate factors, MAT contributed more than MAP. Thus, MAT was the most critical climate factor affecting SOC formation and accumulation, indicating that temperature plays a dominant role in SOC dynamics. Generally, lower temperatures reduce organic matter decomposition and promote SOC accumulation, while higher temperatures accelerate SOC mineralization and losses [67]. Furthermore, latitude and temperature are highly correlated. In high-latitude regions of Northeast China, temperature is one of the most important variables influencing SOC, further explaining why latitude more prominently affected SOC.
Among the S2 imagery, B2, B3, and B4 were the primary contributors involved in SOC prediction, highlighting the significant explanatory power of visible bands in SOC modeling [68]. The visible bands (B2–B4) are sensitive to chlorophyll content and leaf area, indirectly indicating the impact of biomass and organic matter on the soil. In grasslands, these bands help capture information on photosynthetic activity, which is closely linked to SOC dynamics through litter input and root biomass. Numerous studies have shown that NDVI has a high explanatory capacity in predicting SOC, bulk density, and soil texture, particularly in regions with high vegetation coverage, where it significantly enhances the prediction accuracy [69,70]. In the present study, all vegetation indices except SAVI showed higher importance than NDVI. Among them, DVI was more significant in SOC prediction. This finding might be attributed to the variations in vegetation cover across different management practices in the study area, with certain grazing areas exhibiting sparse vegetation, rendering NDVI less effective due to its suitability for medium to high vegetation coverage. In contrast, DVI and EVI, which are relatively more sensitive to soil, performed better. Our study also demonstrated that DVI can serve as an effective auxiliary indicator in a multi-source data fusion prediction model.
The S1 radar data demonstrated 8% importance in SOC prediction, with VH polarization (3.38%) being the most significant. This result indicated that the VH polarization mode was more critical for SOC prediction. In addition, VH is more sensitive to vegetation, while VV is more sensitive to soil moisture. VH polarization can capture variations in grassland height, density, and biomass. The high vegetation variability in the study area enhanced the contribution of VH polarization. Furthermore, as previously mentioned, some areas in the study region were in a water storage period, which weakened VV-polarized ground and canopy scattering, causing a dominant water surface scattering effect and making it less effective in reflecting SOC contents [71]. However, as observed in Figure 3, the S1 data performed poorly in the Spearman correlation analysis. This finding indicated that determining predictive variables through a correlation analysis is a relatively basic method and might not be suitable in multivariate regression or ML models. In such models, interaction effects might exist among variables. With sufficient data, these low-correlation variables might provide additional information to the model, thereby enhancing its predictive capacity [72]. Notably, while the S1-derived features generally showed moderate correlation in the present study, their inclusion significantly improved the SOC prediction accuracy.
Among the topographic factors, elevation (7.38%) was found to be the most significant in SOC prediction. Its high importance suggested that topography influences SOC accumulation, with high-altitude regions with lower temperatures and reduced microbial activity tending to retain more SOC [73,74].
Overall, position factors and optical RS variables dominated SOC predictions, while climate factors exhibited a significant contribution and topographic and SAR variables showed a relatively lower contribution. However, the inclusion of SAR variables enhanced the model prediction accuracy to some extent, especially for the VH polarization mode.

4.3. Research Limitations and Future Research

In this study, we demonstrated the predictive capability of the RF model in the study area and confirmed the potential of integrating multi-source RS data for SOC prediction. However, the study still had certain limitations and uncertainties. First, the SOC sampling depth range in this study was 0–20 cm, which is a commonly adopted range in similar studies; however, SOC typically exhibits significant vertical variation within the soil profile. In deeper soil layers (20–30 cm), SOC is more dependent on soil microorganisms and mineral levels [75]. Therefore, whether the selected features in this study can accurately predict SOC in deeper layers requires further investigation. Moreover, studies have shown that the prediction accuracy of the same model can vary for SOC at different depths; hence, it must be acknowledged that the RF model cannot be universally considered as the best-performing model under all environmental conditions [30]. Second, in order to minimize the impact of vegetation cover on spectral and radar signals, we selected sampling sites and imagery acquired in August. During this period, due to grazing and grass harvesting in the western grasslands of the Songnen Plain, the vegetation cover in some areas was relatively sparse, which helped improve the prediction accuracy. However, soil moisture and vegetation conditions are highly seasonal factors, which might affect the accuracy of SOC predictions at different time periods [76]. In future research, multi-temporal RS imagery could be used to further improve the prediction accuracy of SOC in the study area.

5. Conclusions

In this study, we developed and evaluated an approach for predicting SOC by integrating SAR data (S1), optical imagery (S2), other environmental factors, and three ML models. This approach was employed to predict SOC contents in the grasslands of western Songnen Plain. Overall, the RF model demonstrated superior prediction accuracy and robustness compared to the SVM and XGBoost models. Multi-sensor data provided better predictive performance than the data from a single sensor. The RF model achieved the highest prediction accuracy with the combination of S1 and S2 data and other environmental factors. In addition, the combination of S2 data and environmental factors demonstrated superior predictive performance compared to the S1 data when used as standalone predictors. In terms of variable importance, environmental factors and S2 data were identified as the principal factors influencing SOC predictions, followed by S1-derived predictors. This study provides a methodological reference for accurately monitoring grassland SOC using multi-source RS and ML techniques. Our findings provide scientific evidence that could be used to inform policy guidance for the protection and sustainable use of grasslands in western Songnen Plain. Future research should further refine this framework across different seasons, management practices, and broader grassland regions. Overall, the newly developed framework in this study can be used for high-precision quantitative predictions of grassland SOC; however, its stability needs to be validated across broader geographic regions.

Author Contributions

Conceptualization, X.L.; methodology, H.L.; software, Y.B. and H.L.; resources, J.X. and Y.Y.; writing—original draft, H.L.; writing—review and editing, X.L.; visualization, J.X. and Y.Y.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 42171328.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

  1. Raich, J.; Potter, C.; Bhagawati, D. Interannual variability in global soil respiration, 1980–94. Glob. Change Biol. 2002, 8, 800–812. [Google Scholar] [CrossRef]
  2. Bengtsson, J.; Bullock, J.M.; Egoh, B.; Everson, C.; Everson, T.; O’Connor, T.; Lindborg, R. Grasslands—More Important for Ecosystem Services than You Might Think. Ecosphere 2019, 10, e02582. [Google Scholar] [CrossRef]
  3. Lal, R. Digging Deeper: A Holistic Perspective of Factors Affecting Soil Organic Carbon Sequestration in Agroecosystems. Glob. Change Biol. 2018, 24, 3285–3301. [Google Scholar] [CrossRef]
  4. Wang, G.; Li, Y.; Fan, L.; Ma, X.; Mao, J. The Response of Soil Organic Carbon Content of Grasslands in Northern Xinjiang to Future Climate Change. Phys. Chem. Earth 2024, 134, 103576. [Google Scholar] [CrossRef]
  5. Schuman, G.E.; Janzen, H.H.; Herrick, J.E. Soil Carbon Dynamics and Potential Carbon Sequestration by Rangelands. Environ. Pollut. 2002, 116, 391–396. [Google Scholar] [CrossRef]
  6. Ma, L.; Wang, Q.; Shen, S.T. Response of Soil Aggregate Stability and Distribution of Organic Carbon to Alpine Grassland Degradation in Northwest Sichuan. Geoderma Reg. 2020, 22, e00309. [Google Scholar] [CrossRef]
  7. Zhang, H.; Wan, L.; Li, Y. Prediction of Soil Organic Carbon Content Using Sentinel-1/2 and Machine Learning Algorithms in Swamp Wetlands in Northeast China. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5219–5230. [Google Scholar] [CrossRef]
  8. Castaldi, F.; Chabrillat, S.; Chartin, C.; Genot, V.; Jones, A.; van Wesemael, B. Estimation of Soil Organic Carbon in Arable Soil in Belgium and Luxembourg with the LUCAS Topsoil Database. Eur. J. Soil Sci. 2018, 69, 592–603. [Google Scholar] [CrossRef]
  9. Soriano-Disla, J.M.; Janik, L.J.; Viscarra Rossel, R.A.; Macdonald, L.M.; McLaughlin, M.J. The Performance of Visible, Near-, and Mid-Infrared Reflectance Spectroscopy for Prediction of Soil Physical, Chemical, and Biological Properties. Appl. Spectrosc. Rev. 2014, 49, 139–186. [Google Scholar] [CrossRef]
  10. Tziolas, N.; Tsakiridis, N.; Ben-Dor, E.; Theocharis, J.; Zalidis, G. A Memory-Based Learning Approach Utilizing Combined Spectral Sources and Geographical Proximity for Improved VIS-NIR-SWIR Soil Properties Estimation. Geoderma 2019, 340, 11–24. [Google Scholar] [CrossRef]
  11. Liu, L.; Ji, M.; Buchroithner, M. Combining Partial Least Squares and the Gradient-Boosting Method for Soil Property Retrieval Using Visible Near-Infrared Shortwave Infrared Spectra. Remote Sens. 2017, 9, 1299. [Google Scholar] [CrossRef]
  12. Angelopoulou, T.; Tziolas, N.; Balafoutis, A.; Zalidis, G.; Bochtis, D. Remote Sensing Techniques for Soil Organic Carbon Estimation: A Review. Remote Sens. 2019, 11, 676. [Google Scholar] [CrossRef]
  13. Tsakiridis, N.; Tziolas, N.; Theocharis, J.; Zalidis, G. A Genetic Algorithm-Based Stacking Algorithm for Predicting Soil Organic Matter from VIS-NIR Spectral Data. Eur. J. Soil Sci. 2019, 70, 578–590. [Google Scholar] [CrossRef]
  14. Sadeghi, M.; Babaeian, E.; Tuller, M.; Jones, S.B. The Optical Trapezoid Model: A Novel Approach to Remote Sensing of Soil Moisture Applied to Sentinel-2 and Landsat-8 Observations. Remote Sens. Environ. 2017, 198, 52–68. [Google Scholar] [CrossRef]
  15. Muro, J.; Linstädter, A.; Magdon, P.; Wöllauer, S.; Männer, F.A.; Schwarz, L.-M.; Ghazaryan, G.; Schultz, J.; Malenovský, Z.; Dubovyk, O. Predicting plant biomass and species richness in temperate grasslands across regions, time, and land management with remote sensing and deep learning. Remote Sens. Environ. 2022, 282, 113262. [Google Scholar] [CrossRef]
  16. Bousbih, S.; Zribi, M.; Pelletier, C.; Gorrab, A.; Lili-Chabaane, Z.; Baghdadi, N.; Ben Aissa, N.; Mougenot, B. Soil Texture Estimation Using Radar and Optical Data from Sentinel-1 and Sentinel-2. Remote Sens. 2019, 11, 1520. [Google Scholar] [CrossRef]
  17. Solly, E.F.; Weber, V.; Zimmermann, S.; Walthert, L.; Hagedorn, F.; Schmidt, M.W.I. A Critical Evaluation of the Relationship Between the Effective Cation Exchange Capacity and Soil Organic Carbon Content in Swiss Forest Soils. Front. For. Glob. Change 2020, 3, 98. [Google Scholar] [CrossRef]
  18. Zhang, Y.; Sui, B.; Shen, H.; Ouyang, L. Mapping Stocks of Soil Total Nitrogen Using Remote Sensing Data: A Comparison of Random Forest Models with Different Predictors. Comput. Electron. Agric. 2019, 160, 23–30. [Google Scholar] [CrossRef]
  19. Nocita, M.; Stevens, A.; van Wesemael, B.; Aitkenhead, M.; Bachmann, M.; Barthès, B.G.; Ben-Dor, E.; Brown, D.J.; Clairotte, M.; Demattê, J.A.M.; et al. Soil Spectroscopy: An Alternative to Wet Chemistry for Soil Monitoring. Adv. Agron. 2015, 132, 139–159. [Google Scholar] [CrossRef]
  20. Yang, R.; Guo, W. Using Time-Series Sentinel-1 Data for Soil Prediction on Invaded Coastal Wetlands. Environ. Monit. Assess. 2019, 191, 446. [Google Scholar] [CrossRef]
  21. Wang, J.; Zhang, Y.; Guo, W.; Liu, L.; Wang, Y.; Zhang, X.; Yang, R.; Li, Y. Estimating Leaf Area Index and Aboveground Biomass of Grazing Pastures Using Sentinel-1, Sentinel-2 and Landsat Images. ISPRS J. Photogramm. Remote Sens. 2019, 154, 189–201. [Google Scholar] [CrossRef]
  22. Gao, Q.; Zribi, M.; Escorihuela, M.J.; Baghdadi, N. Synergetic Use of Sentinel-1 and Sentinel-2 Data for Soil Moisture Mapping at 100 m Resolution. Sensors 2017, 17, 1966. [Google Scholar] [CrossRef]
  23. Dabrowska-Zielinska, K.; Musial, J.; Malinska, A.; Budzynska, M.; Gurdak, R.; Kiryla, W.; Bartold, M.; Grzybowski, P. Soil Moisture in the Biebrza Wetlands Retrieved from Sentinel-1 Imagery. Remote Sens. 2018, 10, 1979. [Google Scholar] [CrossRef]
  24. Zhou, T.; Geng, Y.; Chen, J.; Pan, J.; Haase, D.; Lausch, A. High-Resolution Digital Mapping of Soil Organic Carbon and Soil Total Nitrogen Using DEM Derivatives, Sentinel-1 and Sentinel-2 Data Based on Machine Learning Algorithms. Sci. Total Environ. 2020, 729, 138244. [Google Scholar] [CrossRef] [PubMed]
  25. Zhou, T.; Geng, Y.; Chen, J.; Pan, J.; Haase, D.; Lausch, A. Prediction of Soil Organic Carbon and the C:N Ratio on a National Scale Using Machine Learning and Satellite Data: A Comparison between Sentinel-2, Sentinel-3 and Landsat-8 Images. Sci. Total Environ. 2021, 755, 142661. [Google Scholar] [CrossRef] [PubMed]
  26. Costa, E.; Tassinari, W.; Pinheiro, H.; Beutler, S.; dos Anjos, L. Mapping Soil Organic Carbon and Organic Matter Fractions by Geographically Weighted Regression. J. Environ. Qual. 2018, 47, 718–725. [Google Scholar] [CrossRef] [PubMed]
  27. Zhou, T.; Geng, Y.; Chen, J.; Liu, M.; Haase, D.; Lausch, A. Mapping Soil Organic Carbon Content Using Multi-Source Remote Sensing Variables in the Heihe River Basin in China. Ecol. Indic. 2020, 114, 106288. [Google Scholar] [CrossRef]
  28. Xu, M.; Chu, X.; Fu, Y.; Wang, C.; Wu, S. Improving the Accuracy of Soil Organic Carbon Content Prediction Based on Visible and Near-Infrared Spectroscopy and Machine Learning. Environ. Earth Sci. 2021, 80, 326. [Google Scholar] [CrossRef]
  29. Meliho, M.; Boulmane, M.; Khattabi, A.; Dansou, C.E.; Orlando, C.A.; Mhammdi, N.; Noumonvi, K.D. Spatial Prediction of Soil Organic Carbon Stock in the Moroccan High Atlas Using Machine Learning. Remote Sens. 2023, 15, 2494. [Google Scholar] [CrossRef]
  30. Mosaid, H.; Barakat, A.; John, K.; Faouzi, E.; Bustillo, V.; El Garnaoui, M.; Heung, B. Improved Soil Carbon Stock Spatial Prediction in a Mediterranean Soil Erosion Site Through Robust Machine Learning Techniques. Environ. Monit. Assess. 2024, 196, 130. [Google Scholar] [CrossRef]
  31. Keskin, H.; Grunwald, S.; Harris, W.G. Digital Mapping of Soil Carbon Fractions with Machine Learning. Geoderma 2019, 339, 40–58. [Google Scholar] [CrossRef]
  32. Wang, B.; Waters, C.; Orgill, S.; Gray, J.; Cowie, A.; Clark, A.; Liu, D.L. High Resolution Mapping of Soil Organic Carbon Stocks Using Remote Sensing Variables in the Semi-Arid Rangelands of Eastern Australia. Sci. Total Environ. 2018, 630, 367–378. [Google Scholar] [CrossRef] [PubMed]
  33. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: San Francisco, CA, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
  34. Chen, H.; Ju, P.; Zhu, Q.; Xu, X.; Wu, N.; Gao, Y.; Feng, X.; Tian, J.; Niu, S.; Zhang, Y.; et al. Carbon and Nitrogen Cycling on the Qinghai-Tibetan Plateau. Nat. Rev. Earth Environ. 2022, 3, 701–716. [Google Scholar] [CrossRef]
  35. Wang, Y.; Lv, W.; Xue, K.; Wang, S.; Zhang, L.; Hu, R.; Zeng, H.; Xu, X.; Li, Y.; Jiang, L.; et al. Grassland Changes and Adaptive Management on the Qinghai-Tibetan Plateau. Nat. Rev. Earth Environ. 2022, 3, 668–683. [Google Scholar] [CrossRef]
  36. Lal, R. Soil Carbon Sequestration Impacts on Global Climate Change and Food Security. Science 2004, 304, 1623–1627. [Google Scholar] [CrossRef]
  37. Jobbágy, E.; Jackson, R. The Vertical Distribution of Soil Organic Carbon and Its Relation to Climate and Vegetation. Ecol. Appl. 2000, 10, 423–436. [Google Scholar] [CrossRef]
  38. Obu, J.; Lantuit, H.; Myers-Smith, I.; Heim, B.; Wolter, J.; Fritz, M. Effect of Terrain Characteristics on Soil Organic Carbon and Total Nitrogen Stocks in Soils of Herschel Island, Western Canadian Arctic. Permafr. Periglac. Process. 2017, 28, 92–107. [Google Scholar] [CrossRef]
  39. Wang, S.; Zhuang, Q.; Wang, Q.; Jin, X.; Han, C. Mapping Stocks of Soil Organic Carbon and Soil Total Nitrogen in Liaoning Province of China. Geoderma 2017, 305, 250–263. [Google Scholar] [CrossRef]
  40. Tsui, C.-C.; Chen, Z.-S.; Hsieh, C.-F. Relationships Between Soil Properties and Slope Position in a Lowland Rain Forest of Southern Taiwan. Geoderma 2004, 123, 131–142. [Google Scholar] [CrossRef]
  41. Griffiths, R.P.; Madritch, M.D.; Swanson, A.K. The Effects of Topography on Forest Soil Characteristics in the Oregon Cascade Mountains (USA): Implications for the Effects of Climate Change on Soil Properties. For. Ecol. Manag. 2009, 257, 1–7. [Google Scholar] [CrossRef]
  42. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  43. Wei, Y.; Zhang, X.; Hou, N.; Zhang, W.; Jia, K.; Yao, Y. Estimation of Surface Downward Shortwave Radiation over China from AVHRR Data Based on Four Machine Learning Methods. Sol. Energy 2019, 177, 32–46. [Google Scholar] [CrossRef]
  44. Menze, B.; Kelm, B.M.; Masuch, R.; Himmelreich, U.; Bachert, P.; Petrich, W.; Hamprecht, F.A. A Comparison of Random Forest and Its Gini Importance with Standard Chemometric Methods for the Feature Selection and Classification of Spectral Data. BMC Bioinform. 2009, 10, 213. [Google Scholar] [CrossRef] [PubMed]
  45. Strobl, C.; Boulesteix, A.L.; Kneib, T.; Augustin, T.; Zeileis, A. Conditional variable importance for random forests. BMC Bioinform. 2008, 9, 307. [Google Scholar] [CrossRef] [PubMed]
  46. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  47. Were, K.; Bui, D.; Dick, O.; Singh, B. A Comparative Assessment of Support Vector Regression, Artificial Neural Networks, and Random Forests for Predicting and Mapping Soil Organic Carbon Stocks across an Afromontane Landscape. Ecol. Indic. 2015, 52, 394–403. [Google Scholar] [CrossRef]
  48. Taghizadeh-Mehrjardi, R.; Neupane, R.; Sood, K.; Kumar, S. Artificial Bee Colony Feature Selection Algorithm Combined with Machine Learning Algorithms to Predict Vertical and Lateral Distribution of Soil Organic Matter in South Dakota, USA. Carbon Manag. 2017, 8, 277–291. [Google Scholar] [CrossRef]
  49. Ahmad, S.; Kalra, A.; Stephen, H. Estimating Soil Moisture Using Remote Sensing Data: A Machine Learning Approach. Adv. Water Resour. 2010, 33, 69–80. [Google Scholar] [CrossRef]
  50. Jebur, M.; Pradhan, B.; Tehrany, M. Manifestation of LiDAR-Derived Parameters in the Spatial Prediction of Landslides Using Novel Ensemble Evidential Belief Functions and Support Vector Machine Models in GIS. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 674–690. [Google Scholar] [CrossRef]
  51. Pham, T.; Yokoya, N.; Xia, J.; Ha, N.T.; Le, N.N.; Nguyen, T.T.T.; Dao, T.H.; Vu, T.T.P.; Pham, T.D.; Takeuchi, W. Comparison of Machine Learning Methods for Estimating Mangrove Above-Ground Biomass Using Multiple Source Remote Sensing Data in the Red River Delta Biosphere Reserve, Vietnam. Remote Sens. 2020, 12, 1334. [Google Scholar] [CrossRef]
  52. Osman, A.I.A.; Ahmed, A.N.; Chow, M.F.; Huang, Y.F.; El-Shafie, A. Extreme gradient boosting (XGBoost) model to predict the groundwater levels in Selangor Malaysia. Ain Shams Eng. J. 2021, 12, 1545–1556. [Google Scholar] [CrossRef]
  53. Pham, T.; Le, N.N.; Ha, N.T.; Nguyen, L.V.; Xia, J.; Yokoya, N.; To, T.T.; Trinh, H.X.; Kieu, L.Q.; Takeuchi, W. Estimating mangrove above-ground biomass using extreme gradient boosting decision trees algorithm with fused Sentinel-2 and ALOS-2 PALSAR-2 data in Can Gio Biosphere Reserve, Vietnam. Remote Sens. 2020, 12, 777. [Google Scholar] [CrossRef]
  54. Taghizadeh-Mehrjardi, R.; Schmidt, K.; Amirian-Chakan, A.; Rentschler, T.; Zeraatpisheh, M.; Sarmadian, F.; Valavi, R.; Davatgar, N.; Behrens, T.; Scholten, T. Improving the spatial prediction of soil organic carbon content in two contrasting climatic regions by stacking machine learning models and rescanning covariate space. Remote Sens. 2020, 12, 1095. [Google Scholar] [CrossRef]
  55. Wei, L.; Yuan, Z.; Wang, Z.; Zhao, L.; Zhang, Y.; Lu, X.; Cao, L. Hyperspectral inversion of soil organic matter content based on a combined spectral index model. Sensors 2020, 20, 2777. [Google Scholar] [CrossRef]
  56. Le, N.; Yokoya, N.; Ha, N.T.; Nguyen, T.T.T.; Tran, T.D.T.; Pham, T.D. Learning from multimodal and multisensor earth observation dataset for improving estimates of mangrove soil organic carbon in Vietnam. Int. J. Remote Sens. 2021, 42, 6866–6890. [Google Scholar] [CrossRef]
  57. Yang, R.-M.; Guo, W.-W. Modelling of soil organic carbon and bulk density in invaded coastal wetlands using Sentinel-1 imagery. Int. J. Appl. Earth Obs. Geoinf. 2019, 82, 101906. [Google Scholar] [CrossRef]
  58. Siewert, M. High-resolution digital mapping of soil organic carbon in permafrost terrain using machine learning: A case study in a sub-Arctic peatland environment. Biogeosciences 2018, 15, 1663–1682. [Google Scholar] [CrossRef]
  59. Khaledian, Y.; Miller, B. Selecting appropriate machine learning methods for digital soil mapping. Appl. Math. Model. 2020, 81, 401–418. [Google Scholar] [CrossRef]
  60. Padarian, J.; Minasny, B.; McBratney, A. Machine learning and soil sciences: A review aided by machine learning tools. Soil 2020, 6, 35–52. [Google Scholar] [CrossRef]
  61. Nguyen, T.; Pham, T.D.; Nguyen, C.T.; Delfos, J.; Archibald, R.; Dang, K.B.; Hoang, N.B.; Guo, W.; Ngo, H.H. A novel intelligence approach based active and ensemble learning for agricultural soil organic carbon prediction using multispectral and SAR data fusion. Sci. Total Environ. 2022, 804, 150187. [Google Scholar] [CrossRef]
  62. Zayani, H.; Fouad, Y.; Michot, D.; Kassouk, Z.; Baghdadi, N.; Vaudour, E.; Lili-Chabaane, Z.; Walter, C. Using Machine-Learning Algorithms to Predict Soil Organic Carbon Content from Combined Remote Sensing Imagery and Laboratory Vis-NIR Spectral Datasets. Remote Sens. 2023, 15, 4264. [Google Scholar] [CrossRef]
  63. Stevens, A.; Nocita, M.; Tóth, G.; Montanarella, L.; van Wesemael, B. Prediction of soil organic carbon at the European scale by visible and near infrared reflectance spectroscopy. PLoS ONE 2013, 8, e66409. [Google Scholar] [CrossRef]
  64. Wang, H.; Zhang, X.; Wu, W.; Liu, H. Prediction of soil organic carbon under different land use types using Sentinel-1/-2 data in a small watershed. Remote Sens. 2021, 13, 1229. [Google Scholar] [CrossRef]
  65. Asmuss, T.; Bechtold, M.; Tiemeyer, B. On the potential of Sentinel-1 for high resolution monitoring of water table dynamics in grasslands on organic soils. Remote Sens. 2019, 11, 1659. [Google Scholar] [CrossRef]
  66. Xu, L.; Wang, C.; Zhu, J.; Gao, Y.; Li, M.; Lv, Y.; Yu, G.; He, N. Latitudinal patterns and influencing factors of soil humic carbon fractions from tropical to temperate forests. J. Geogr. Sci. 2018, 28, 15–30. [Google Scholar] [CrossRef]
  67. Zhou, Z.; Wang, C.; Luo, Y. Meta-analysis of the impacts of global change factors on soil microbial diversity and functionality. Nat. Commun. 2020, 11, 3072. [Google Scholar] [CrossRef] [PubMed]
  68. Gholizadeh, A.; Žižala, D.; Saberioon, M.; Borůvka, L. Soil organic carbon and texture retrieving and mapping using proximal, airborne and Sentinel-2 spectral imaging. Remote Sens. Environ. 2018, 218, 89–103. [Google Scholar] [CrossRef]
  69. Huete, A.; Didan, K.; Miura, T.; Rodriguez, E.; Gao, X.; Ferreira, L. Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sens. Environ. 2002, 83, 195–213. [Google Scholar] [CrossRef]
  70. Meng, B.; Ge, J.; Liang, T.; Yang, S.; Gao, J.; Feng, Q.; Cui, X.; Huang, X.; Xie, H. Evaluation of remote sensing inversion error for the above-ground biomass of alpine meadow grassland based on multi-source satellite data. Remote Sens. 2017, 9, 372. [Google Scholar] [CrossRef]
  71. Gao, Q.; Zribi, M.; Escorihuela, M.; Baghdadi, N.; Segui, P. Irrigation mapping using Sentinel-1 time series at field scale. Remote Sens. 2018, 10, 1495. [Google Scholar] [CrossRef]
  72. Gregorutti, B.; Michel, B.; Saint-Pierre, P. Correlation and variable importance in random forests. Stat. Comput. 2017, 27, 659–678. [Google Scholar] [CrossRef]
  73. Li, Q.; Yue, T.-X.; Wang, C.-Q.; Zhang, W.-J.; Yu, Y.; Li, B.; Yang, J.; Bai, G.-C. Spatially distributed modeling of soil organic matter across China: An application of artificial neural network approach. Catena 2013, 104, 210–218. [Google Scholar] [CrossRef]
  74. Kokulan, V.; Akinremi, O.; Moulin, A.; Kumaragamage, D. Importance of terrain attributes in relation to the spatial distribution of soil properties at the micro scale: A case study. Can. J. Soil Sci. 2018, 98, 292–305. [Google Scholar] [CrossRef]
  75. Wei, B.; Wei, Y.; Zhang, H.; Guo, T.; Zhang, R.; Zhang, Y.; Liu, N. Mowing in Place of Conventional Grazing Increased Soil Organic Carbon Stability and Altered Depth-Dependent Protection Mechanisms. Catena 2025, 248, 108629. [Google Scholar] [CrossRef]
  76. Geng, J.; Tan, Q.; Lv, J.; Fang, H. Assessing Spatial Variations in Soil Organic Carbon and C:N Ratio in Northeast China’s Black Soil Region: Insights from Landsat-9 Satellite and Crop Growth Information. Soil Tillage Res. 2024, 235, 105897. [Google Scholar] [CrossRef]
Figure 1. Distribution of the study area and sampling points.
Figure 1. Distribution of the study area and sampling points.
Agriculture 15 01640 g001
Figure 2. The flowchart for SOC prediction.
Figure 2. The flowchart for SOC prediction.
Agriculture 15 01640 g002
Figure 3. Correlation between SOC and predictive variables.
Figure 3. Correlation between SOC and predictive variables.
Agriculture 15 01640 g003
Figure 4. Scatter plots of observed versus predicted SOC.
Figure 4. Scatter plots of observed versus predicted SOC.
Agriculture 15 01640 g004
Figure 5. Relative importance of predictive variables in SOC prediction.
Figure 5. Relative importance of predictive variables in SOC prediction.
Agriculture 15 01640 g005
Figure 6. SOC contents in the grassland based on the RF model.
Figure 6. SOC contents in the grassland based on the RF model.
Agriculture 15 01640 g006
Table 1. Vegetation and Soil Indices.
Table 1. Vegetation and Soil Indices.
IndexesFormulas
Difference Vegetation Index (DVI)B8 − B4
Infrared Percentage Vegetation Index (IPVI)B8/(B8 + B4)
Enhanced Vegetation Index (EVI)2.5 × (B8 − B4)/(2.5 × (B8 − B4))
Green Normalized Difference Vegetation Index (GNDVI)(B8 − B3)/(B8 + B3)
Normalized Difference Vegetation Index(NDVI)(B8 − B4)/(B8 + B4)
Normalized Difference Water Index (NDWI)(B3 − B8)/(B3 + B8)
Ratio Vegetation Index (RVI)B8/B4
Soil-Adjusted Vegetation Index (SAVI)(1 + L) × (B8 − B4)/(B8 + B4 + L)
Table 2. Scenarios with different variable combinations.
Table 2. Scenarios with different variable combinations.
ScenarioData Sources
1S1, topography, position, and climate
2S2, topography, position, and climate
3all variables
Table 3. SOC prediction accuracy rates of three models under different scenarios.
Table 3. SOC prediction accuracy rates of three models under different scenarios.
Machine Learning AlgorithmsScenarioR2RMSE
(g/kg)
MAE
(g/kg)
AICcRPIQ
10.524.133.27−475.995.78
RF20.554.093.21−461.775.85
30.584.093.23−464.895.88
10.544.453.36−382.625.78
SVM20.554.553.47−386.975.81
30.564.893.64−370.285.23
10.255.033.86−40.774.67
XGBoost20.464.253.23−392.255.75
30.494.263.35−389.915.72
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, H.; Xia, J.; Yang, Y.; Bo, Y.; Li, X. Estimation of Soil Organic Carbon Content of Grassland in West Songnen Plain Using Machine Learning Algorithms and Sentinel-1/2 Data. Agriculture 2025, 15, 1640. https://doi.org/10.3390/agriculture15151640

AMA Style

Li H, Xia J, Yang Y, Bo Y, Li X. Estimation of Soil Organic Carbon Content of Grassland in West Songnen Plain Using Machine Learning Algorithms and Sentinel-1/2 Data. Agriculture. 2025; 15(15):1640. https://doi.org/10.3390/agriculture15151640

Chicago/Turabian Style

Li, Haoming, Jingyao Xia, Yadi Yang, Yansu Bo, and Xiaoyan Li. 2025. "Estimation of Soil Organic Carbon Content of Grassland in West Songnen Plain Using Machine Learning Algorithms and Sentinel-1/2 Data" Agriculture 15, no. 15: 1640. https://doi.org/10.3390/agriculture15151640

APA Style

Li, H., Xia, J., Yang, Y., Bo, Y., & Li, X. (2025). Estimation of Soil Organic Carbon Content of Grassland in West Songnen Plain Using Machine Learning Algorithms and Sentinel-1/2 Data. Agriculture, 15(15), 1640. https://doi.org/10.3390/agriculture15151640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop