Detecting Drivers and Predicting Spatial Distribution of Soil Organic Carbon in an Arid Region Using Machine Learning

Chen, Guiren; Ge, Xianghe; Zhang, Zipeng; Han, Lijing

doi:10.3390/rs18040535

Open AccessArticle

Detecting Drivers and Predicting Spatial Distribution of Soil Organic Carbon in an Arid Region Using Machine Learning

¹

Kunming General Survey of Natural Resources Center China Geological Survey, Kunming 650100, China

²

Technology Innovation Center for Natural Ecosystem Carbon Sink, Ministry of Natural Resources, Kunming 650100, China

³

Yunnan Technology Innovation Center of Natural Resources Carbon Sink Investigation and Carbon Asset Assessment, Kunming 650100, China

⁴

College of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao 266590, China

⁵

College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi 800017, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(4), 535; https://doi.org/10.3390/rs18040535

Submission received: 23 October 2025 / Revised: 19 December 2025 / Accepted: 29 December 2025 / Published: 7 February 2026

(This article belongs to the Topic Advances in Multi-Scale Geographic Environmental Monitoring: Ecosystem Differences and Multi-Scale Comparisons)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

In the arid Akesai region, the Gradient Boosting model (R² = 0.675, RMSE = 1.304 g kg⁻¹) outperformed other machine learning algorithms in predicting SOC, with vegetation-type factors (NDVI_PC1) and clay content identified as the most influential positive drivers of SOC accumulation.
The spatial distribution of SOC exhibits clear heterogeneity, with higher concentrations in mountainous and valley areas, which is primarily governed by the synergistic interplay of vegetation dynamics, soil texture, and topographic features.

What are the implications of the main findings?

Methodologically, this study demonstrates that combining ensemble machine learning with interpretable SHAP analysis provides a robust and transparent framework for quantifying the contribution of multiple environmental factors to SOC variability in data-scarce arid regions.
Practically speaking, the identified key drivers and high-resolution spatial patterns provide a scientific basis for targeted land management, ecological restoration, and more accurate carbon stock assessment in arid ecosystems.

Abstract

Soil organic carbon (SOC) plays a critical role in the terrestrial carbon cycle, yet its spatial patterns and drivers in arid regions remain poorly understood. This study aims to clarify SOC distribution mechanisms in the Akesai region, where limited water–heat conditions and land use create high environmental heterogeneity. Four machine learning models were applied to predict SOC content and produce high-resolution spatial maps, and SHAP analysis was used to quantify the contributions of key environmental variables. The Gradient Boosting model had the best performance (R² = 0.675; RMSE = 1.304 g kg⁻¹), followed by XGBoost, LightGBM, and Random Forest. The results indicated that the main factors controlling SOC variation were NDVI, DEM, sand, clay, mean temperature, and ERVI. Furthermore, NDVI and clay parameters were positively associated with promoted SOC accumulation, while sand showed a negative effect. Spatially, higher SOC values were found in mountainous zones and vegetated valleys, while low SOC values were observed in flat, arid plains. These findings demonstrate that incorporating vegetation-type indicators substantially improves large-scale SOC estimation and enhances our understanding of SOC spatial dynamics and the driving mechanisms in arid environments. This provides a scientific basis for carbon-stock assessment and sustainable land management.

Keywords:

soil organic carbon; remote sensing; machine learning; arid region; Akesai

1. Introduction

Soil organic carbon (SOC) is a critical determinant of soil fertility; it improves water retention and nutrient availability, thereby promoting crop growth. As a major carbon reservoir, SOC also plays a key role in mitigating climate change [1]. Its importance extends to ensuring global food security and addressing climate warming [2,3]. The United Nations Convention to Combat Desertification (UNCCD), which serves as the custodian agency for the Sustainable Development Goal (SDG) indicator 15.3.1, has identified SOC stock mapping as an essential metric for assessing progress toward achieving land degradation neutrality by 2030 [4,5]. This indicator is particularly significant in arid and semi-arid regions, where water scarcity is a limiting factor and insufficient SOC increases soil susceptibility to wind erosion [6,7]. Therefore, continuous monitoring of SOC content is vital for sustainable land management and climate resilience.

For SOC monitoring, traditional soil sampling and laboratory analysis are both costly and time-consuming, challenges that become particularly evident in large-scale studies [8,9]. Consequently, the use of remote sensing imagery and environmental variables for rapid SOC monitoring has emerged as an effective alternative. Geostatistical methods have long served as primary tools for spatial SOC estimation, with kriging—recognized as the best linear unbiased estimator—being widely applied to SOC prediction worldwide following various methodological advancements [10,11,12,13,14]. When sufficient SOC observations are available, these methods can yield highly accurate predictions [15,16]. Nevertheless, the high financial and temporal demands of field sampling and laboratory analysis continue to constrain their practical application [17,18,19]. Moreover, SOC exhibits strong spatial heterogeneity associated with landscapes, and neglecting these relationships can markedly reduce predictive accuracy [20,21]. In recent years, numerous environmental auxiliary variables have been incorporated into SOC mapping, including topography, land use, soil structure and texture, available water capacity, parent material, cation exchange capacity, and soil classification [22,23,24,25]. Most of these variables can be efficiently derived from remote sensing data, making remote sensing an economical and effective means for obtaining SOC-related information [26].

In recent years, machine-learning approaches for predicting the spatial variability of SOC have advanced rapidly, driving a shift in digital soil mapping from traditional, experience-based statistical models to data-driven frameworks capable of capturing complex nonlinear relationships. Previous studies have demonstrated that algorithms such as Random Forest (RF), Gradient Boosting Regression Trees (GBRT), eXtreme Gradient Boosting (XGBoost), and Support Vector Regression (SVR) can effectively characterize the pronounced spatial heterogeneity of SOC under the influence of multi-scale environmental drivers [27,28,29,30,31,32]. For instance, Wang et al. demonstrated in Northeast China that integrating soil environmental factors with spectral data in a random forest model improved soil type classification accuracy to 81.5%, significantly outperforming models based solely on spectral data [33]. Similarly, a study employing a boosted regression tree (BRT) model incorporating 300 years of cultivation history identified cultivation duration as a key predictor of SOC distribution in Northeast agricultural fields. Long-term cultivation was shown to reduce SOC by approximately 45%, underscoring its importance in future carbon stock assessments [34]. Furthermore, Li et al. integrated multi-source remote sensing data from Sentinel-1, Sentinel-2, and Sentinel-3 with topographic and climatic variables to generate digital SOC maps of the Aibi Lake Basin in Xinjiang. Their results further validated the effectiveness of combining remote sensing and machine learning for SOC estimation [30].

RF models have demonstrated strong performance in constructing spatiotemporal models linking SOC with environmental factors, providing a robust tool for assessing historical and future changes in SOC stocks across China [35]. Subsequent studies combining random forest with process-based models such as RothC have further enhanced the explanatory and predictive power of SOC dynamics under warming scenarios [36]. Additionally, algorithms such as LightGBM have shown promising performance in nationwide analyses of extreme climate impacts on SOC [37]. On the other hand, partial least squares–support vector machines, which integrate the strengths of both linear and nonlinear modelling, have shown strong performance in estimating soil organic matter from visible–near-infrared spectroscopy, particularly in complex environments such as saline or sodic soils [38,39]. These machine learning-based digital soil mapping approaches offer crucial technical support for monitoring soil carbon dynamics and understanding climate change responses from regional to national scales.

Despite these advances, the mechanisms by which different covariates shape the spatial variability of SOC remain insufficiently quantified, particularly across regions with heterogeneous vegetation cover. The integration of multi-source remote sensing data and multidimensional environmental factors further highlights this gap, as the nonlinear interactions among covariates are still poorly characterized. Addressing these limitations requires a more systematic covariate framework and comparative model evaluation to identify key drivers and improve SOC prediction accuracy. To address this gap, the present study introduces an integrated approach that combines multi-source remote sensing data, including Sentinel satellite imagery, to systematically derive environmental covariates such as soil properties, vegetation indices, topographic features, and climatic variables. A key innovation of this research is the incorporation of the vegetation-type factors derived from the time-series characteristics of vegetation indices, together with the development of four machine learning models for comparative SOC prediction analysis. This study aims to develop an integrated multivariate framework incorporating soil properties, topography, climate, and time-series vegetation characteristics to systematically examine the influence of these factors on the spatial variability of SOC. By comparing multiple machine learning models, it focuses on evaluating the role of temporal vegetation features in improving the accuracy of SOC predictions. The work ultimately seeks to identify the optimal approach for high-resolution SOC mapping in the Akesai region.

2. Materials and Methods

2.1. Study Area

Aksai Kazakh Autonomous County is located in the northwestern part of Gansu Province (Figure 1), south of Jiuquan City (38°10′–39°42′N, 92°14′–96°68′E). It borders the Su Bei Mongolian Autonomous County to the east, Dunhuang City to the north, the Xinjiang Uygur Autonomous Region across the Gobi Desert to the west, and Qinghai Province across the Seshengteng Mountains to the south. The county covers an area of approximately 31,400 km², extending about 425 km from east to west and 125 km from north to south. It is situated on the western edge of the Hexi Corridor and the northern margin of the Tibetan Plateau. The region features a complex, elongated terrain, with elevations ranging from 1289 m to 5805 m and an average altitude of around 3200 m. The landscape is dominated by the Altyn Tagh, Danghe Nan, and Seshengteng mountain ranges, interspersed with extensive areas of the Gobi Desert. In contrast, the plateau basins and the Sugan Lake area exhibit relatively flat topography at elevations between 2800 m and 3000 m.

Aksai lies within the high-cold zone of the Tibetan Plateau, deep in the interior of the Eurasian Continent. Its climate is primarily influenced by the Mongolian high-pressure continental air masses, which exhibit typical high-cold, semi-arid continental characteristics. The study region exhibits distinctive soil and ecosystem characteristics closely linked to the spatial variability of SOC. Soils are mainly divided into three groups, aeolian sandy soils, desert steppe soils, and alpine meadow soils, reflecting pronounced climatic and topographic gradients. Sandy and loamy sandy soils dominate the Gobi and alluvial plains. In contrast, finer-textured loam and silty loam are more prevalent in mountainous and high-altitude meadow areas, where organic matter inputs are comparatively higher. Land use is primarily limited to natural grasslands, sparse desert vegetation, and alpine meadows, with oasis agriculture occurring only in river valleys and lake basins. Vegetation cover varies markedly across the landscape: desert and semi-desert shrubs prevail in low-elevation Gobi zones; temperate and alpine grasslands dominate the montane foothills; and short-growing-season, high-cold alpine meadows characterize the southern high-elevation mountains. The variations in soil type, texture, land use, and vegetation significantly impact the distribution of SOC across the Akesai region.

2.2. Soil Sampling and RS Data Acquisition and Preprocessing

On 17 June 2024, a total of 207 surface soil samples (0–20 cm) were collected across Aksai Kazakh Autonomous County. Sampling sites were selected based on accessibility as well as their representativeness of local land-use types and soil characteristics. The study area includes mountainous and difficult-to-access terrain, which restricted sampling in certain locations. All samples were air-dried, gently ground, and sieved to ≤2 mm to remove coarse fragments prior to laboratory analysis. SOC content was quantified using the Walkley–Black dichromate oxidation method, ensuring consistency with widely adopted soil carbon assessment protocols [31].

Environmental covariates were derived using the Google Earth Engine (GEE) platform, which provides access to extensive Earth observation datasets and cloud-based computational resources. The study incorporated 10 m Sentinel-2 multispectral imagery, the Shuttle Radar Topography Mission (SRTM) digital elevation model (DEM), ISRIC SoilGrids soil property data, and CHIPR climate datasets. Soil texture attributes were extracted from SoilGrids, topographic variables were derived from SRTM DEM, and climate indicators from CHIPR covered the 2004–2024 period.

Sentinel-2 scenes with <20% cloud cover during the study period were first selected. Cloud-contaminated pixels were removed using GEE’s built-in cloud-masking algorithms, after which a minimum-cloud composite image was produced using a mean-compositing procedure [17]. From this composite, multiple vegetation and spectral indices were calculated. Following the approach of Guo and Liu et al. [40,41], a vegetation-type factor was further derived from the Sentinel-2 time-series products. To ensure spatial consistency across all environmental covariates, all rasters were standardized and resampled to 10 m spatial resolution using GEE’s nearest-neighbor algorithm [9]. The complete list of environmental covariates used in this study is provided in Table 1.

2.3. Variable Selection

To obtain a parsimonious and efficient model, we applied recursive feature elimination with cross-validation (RFECV) for variable selection [42]. RFECV iteratively evaluates the contribution of each predictor based on model-derived importance scores and removes the least informative features [43]. This pruning continues until cross-validation performance reaches its optimum, thereby identifying the optimal number and combination of features [44].

We applied RFECV to all datasets for feature selection and subsequently used the top-ranked features for model training and prediction. The results of feature selection using RFECV are shown in Figure 2. The features Sand, Clay, DEM, Temp_2004_2024, NDVI_PC1, and ERVI ranked highest; therefore, these six features were used for predictive mapping in this study.

2.4. Modeling Framework

2.4.1. Gradient Boosting Trees (GBT)

GBT is an ensemble learning method based on gradient boosting that incrementally fits weak learners to the residuals of preceding models [45]. In this study, GBT was used to model the nonlinear relationships between SOC and environmental covariates, including vegetation indices, topography, soil texture, and climate variables. We used GBT to assess its ability to capture multi-scale environmental gradients and its comparative advantage in SOC prediction accuracy over other models [46].

2.4.2. Random Forest (RF)

RF constructs multiple decision trees using bootstrapped samples and random feature selection during node splitting [47]. In this study, RF was applied to evaluate the capacity of traditional ensemble tree methods to predict SOC using multi-source environmental covariates. Its inherent resistance to overfitting and ability to handle heterogeneous inputs enabled us to test how well a classical bagging-based approach performs relative to boosting-based models when mapping SOC across the environmentally diverse landscapes of Aksai. RF also provided an alternative framework for assessing feature contributions, allowing comparison of covariate importance patterns across different model families [48].

2.4.3. eXtreme Gradient Boosting (XGBoost)

XGBoost is an optimized gradient boosting algorithm incorporating regularization and parallel computation to improve predictive efficiency [49]. In this study, XGBoost was applied to explore how advanced boosting techniques handle the heterogeneous environmental conditions of the Aksai region [50]. Its ability to manage sparse, irregular, and multi-source remote sensing predictors allowed us to evaluate whether additional algorithmic optimizations resulted in better SOC estimation than conventional GBT.

2.4.4. Light Gradient Boosting Machine (LGBM)

LGBM implements a histogram-based tree-learning approach and a leaf-wise growth strategy to accelerate training while preserving accuracy [51]. We used LGBM to assess model scalability and computational efficiency when integrating high-resolution covariates. Its fast training speed allowed for extensive hyperparameter tuning and cross-validation, enabling us to test whether lightweight boosting architectures could achieve comparable SOC prediction performance with reduced computational cost [52].

2.5. Model Interpretability Using SHAP Analysis

To assess the contribution of each feature to model predictions, this study employs SHapley Additive exPlanations (SHAP) analysis alongside feature importance evaluation. SHAP interprets machine learning model outputs based on Shapley values from game theory, assigning a contribution value to each feature and thereby clarifying the model’s predictive mechanisms [53,54]. Positive SHAP values indicate that a feature positively influences the model’s predictions, whereas negative values suggest a detrimental effect on the predicted outcomes [55].

2.6. Model Evaluation

In this study, the root mean square error (RMSE), mean absolute error (MAE), coefficient of determination (R²), and the ratio of performance to interquartile distance (RPIQ) were used to assess the accuracy of model prediction and validation. The corresponding equations are presented in Equations (1)–(4).

\begin{matrix} R M S E = \sqrt{\frac{\sum_{i = 1}^{n} ({\hat{y}}_{i} - y_{i})^{2}}{n}} \end{matrix}

(1)

\begin{matrix} M S E = \frac{\sum_{i = 1}^{n} | {\hat{y}}_{i} - y_{i} |}{n} \end{matrix}

(2)

\begin{matrix} R^{2} = 1 - \frac{\sum_{i = 1}^{n} ({\hat{y}}_{i} - y_{i})^{2}}{\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}} \end{matrix}

(3)

\begin{matrix} R P I Q = \frac{I Q R}{R M S E} \end{matrix}

(4)

where

{\hat{y}}_{i}

and

y_{i}

represent the model estimates and the actual values, respectively;

\bar{y}

denotes the mean of the true values; and n is the number of data points. The interquartile range (IQR), defined as the difference between the third and first quartiles (Q3–Q1), represents the spread of the central 50% of the data.

The overall workflow of data preprocessing, feature extraction, model construction, and SOC prediction is summarized in Figure 3.

3. Results

3.1. Descriptive Statistics of Soil Organic Carbon

Descriptive statistics for SOC across 207 samples are summarized in Table 2. SOC exhibited a mean of 4.728 g kg⁻¹, a standard deviation of 2.208, and a coefficient of variation of 0.467, indicating moderate variability [56,57]. The distribution was mildly right-skewed (skewness = 0.784) with elevated kurtosis (kurtosis = 1.021), suggesting that most observations clustered at lower values, while a few samples attained relatively high values. Overall, SOC showed moderate variability and a right-skewed distribution across the region, reflecting notable spatial heterogeneity.

Predictor selection was conducted using recursive feature elimination with cross-validation (Figure 2). Pearson correlation analysis was then performed among Sand, Clay, DEM, Temp_2004_2024, NDVI_PC1, ERVI, and SOC (Figure 4). Histograms along the diagonal illustrate the marginal distributions of each variable. Scatterplots with fitted regression lines in the lower triangle provide a visual assessment of pairwise relationships. The upper triangle reports Pearson’s correlation coefficients (r) and significance levels, with color indicating the sign and strength of the correlations (purple for negative, green for positive, and deeper hues for stronger associations). Sand content was strongly and significantly negatively correlated with terrain elevation (r = −0.83, p < 0.001); temperature and DEM height were very strongly negatively correlated (r = −0.97, p < 0.001); and clay content displayed a weak but significant positive correlation with SOC (r = 0.27, p < 0.001).

3.2. Comparison of Model Accuracy

We used four machine learning algorithms to predict SOC and evaluated their performance using R², RMSE, MAE, and RPIQ (Table 3). On the training set, all models showed strong goodness-of-fit. GBT performed best, reaching an R² of 0.790 and an RPIQ of 3.558, accompanied by low RMSE (1.040) and MAE (0.809). XGBoost followed closely, with an R² of 0.761 and an RMSE of 1.111. RF, by contrast, showed a markedly lower accuracy, achieving an R² of only 0.614.

On the test set, GBT again yielded the most accurate predictions, with R² of 0.675 and RPIQ of 2.837, together with reduced error levels (RMSE = 1.304; MAE = 0.975). XGBoost exhibited similar generalization performance, reaching R² of 0.658 and RPIQ of 2.765. LGBM achieved intermediate accuracy, with an R² of 0.588 and RPIQ of 2.519, whereas RF remained the least accurate model, showing the lowest R² (0.516) and the highest errors (RMSE = 1.591; MAE = 1.233). Figure 5 displays test-set scatterplots, where point color denotes absolute prediction error. Errors were generally small across low-to-moderate SOC values but increased at higher SOC levels, suggesting greater model uncertainty in regions with elevated SOC.

Overall, GBT and XGBoost demonstrated strong robustness and generalization capacity in SOC modeling, with GBT achieving the best performance, whereas RF showed the weakest. These results highlight the superior ability of gradient boosting frameworks to capture SOC spatial variability and nonlinear relationships in SOC.

3.3. Feature Importance

SHAP analyses for all models are presented in Figure 6 and Figure A1. Overall, the relative contributions of predictors to SOC varied, with the time-series-derived vegetation-type factors NDVI_PC1 consistently emerging as the dominant driver. NDVI_PC1 exhibited the highest mean absolute SHAP values across all four models, indicating that vegetation-type factors serve as the primary control on SOC spatial distribution. DEM and sand content ranked next, emphasizing the importance of topography and soil texture in SOC accumulation and distribution. TempC_2004_2024 and ERVI contributed more consistently, reflecting the stable influences of climatic background and vegetation condition. Clay showed generally lower SHAP values but retained explanatory power for a subset of samples.

The SHAP value distributions revealed that NDVI_PC1 exerted strong positive effects on SOC, with greater vegetation coverage and higher spectral indices associated with increased SOC. Sand content exhibited a predominantly negative effect, as higher sand fractions corresponded to lower SOC levels. DEM and temperature demonstrated nonlinear relationships, with intermediate elevations and moderate thermal conditions favoring SOC accumulation. Although clay contributed less overall, it displayed positive effects at certain sites.

Collectively, these SHAP results elucidate the relative importance and directionality of environmental controls on SOC, identifying NDVI_PC1 as the most influential variable across all four models.

3.4. Soil Organic Carbon Mapping

Figure 7 illustrates the spatial predictions of SOC in the Akesai region derived from four machine learning models. Overall, all models effectively captured the spatial distribution of SOC, with predicted values ranging from 2.292 to 13.847 g kg⁻¹, revealing pronounced spatial heterogeneity.

High SOC contents were primarily concentrated in mountainous and river-valley areas with higher elevations and dense vegetation cover, whereas low SOC values were mainly distributed across arid plains with gentle terrain and coarser soil textures. Across the predictive models, GBT produced spatial patterns that aligned most closely with known environmental gradients in the region. This assessment is based on the clearer correspondence between its predicted SOC transitions and observed variations in elevation, vegetation cover, and moisture conditions, as well as the reduced patchiness and smoother spatial continuity visible in its prediction maps. XGBoost generated a broadly similar spatial structure, though the extent of high-SOC zones appeared slightly narrower. LGBM yielded comparable results but showed localized overestimation—particularly in the low-lying western part of the study area and in small depressions or flat intermountain basins—where sparse vegetation and coarse-textured soils increase model sensitivity to input variability.

In contrast, the RF model generated generally lower SOC predictions, with noticeably reduced high-SOC zones. Its overly smoothed spatial pattern suggests limited capacity to represent complex nonlinear relationships. A distinct high-SOC zone appeared in the central-western part of the study area, where SOC values exceeded those of surrounding regions. Although this area lies at relatively low elevation, it may experience higher soil moisture due to topographic convergence or riverine influence, which can enhance vegetation growth and organic matter input, thereby increasing SOC content.

Overall, high SOC values were mainly observed in alpine grasslands and forested areas, as well as in low-lying wetlands and riverine zones, where this study may underestimate SOC levels. In summary, models based on gradient boosting frameworks (GBT and XGBoost) demonstrated superior robustness and spatial resolution in SOC mapping, providing a more accurate representation of SOC spatial variability in the Akesai region. These findings further corroborate the superior predictive performance of the GBT model indicated by the quantitative analyses.

4. Discussion

This study compares four machine learning approaches for predicting SOC in the Aksai region through a systematic evaluation along two dimensions—predictive accuracy and feature contributions—to produce SOC predictions and spatial distribution maps. We first conducted correlation analyses between SOC measurements and selected covariates to elucidate potential relationships. Subsequently, four predictive models were trained to estimate SOC, and SHAP values were applied to quantify the importance of individual features, thereby interpreting their contributions and effects on SOC variation. A detailed comparison of feature importances was then performed to identify the most informative predictors, followed by a discussion of the sources of uncertainty inherent in the analysis.

4.1. Comparing the Effects of Different Characteristics on Soil Organic Carbon Mapping

This study employed four machine learning models to map SOC spatial patterns in the Aksai region, achieving consistently high predictive performance. Among the models, GBT performed best, followed closely by XGBoost. Although RF and LGBM showed slightly lower accuracy, the overall testing R² remained within 0.65–0.68, outperforming several SOC remote sensing studies conducted in arid regions [30]. These results indicate that integrating vegetation, topography, climate, and soil physicochemical attributes can substantially enhance model interpretability and generalization in dryland environments. The performance differences among models largely stem from their ability to capture nonlinear structures and feature interactions, with GBT and XGBoost being particularly effective, consistent with previous findings on the advantages of boosting approaches in digital soil mapping [58].

A SHAP-based interpretation of the models shows that vegetation-related information, particularly the dominant gradient captured by NDVI_PC1, plays a central role in shaping SOC predictions. This underscores that variations in vegetation productivity and cover exert a stronger influence on SOC dynamics than topographic, textural, or climatic factors within the study area. This finding highlights the dominant role of vegetation cover in driving SOC spatial variability. SHAP dependence plots further revealed a strong positive relationship between NDVI_PC1 and SOC, with areas characterized by denser and greener vegetation exhibiting higher SOC levels. These results are consistent with previous studies showing that vegetation-type factors are among the most informative predictors of SOC [59,60,61]. For example, He et al. demonstrated that incorporating phenological parameters from Sentinel-2 time-series data significantly improved the prediction of SOC in croplands. Their model identified key metrics, such as seasonal maximum EVI and the rate of green-up, which reflect crop growth dynamics linked to SOC variation [61]. This supports the utility of vegetation-related proxies, like the NDVI_PC1 used in our study, for capturing agricultural influences on SOC.

SHAP-based interpretability further revealed that NDVI_PC1 is the dominant predictor across all models, with the highest mean absolute SHAP value, substantially exceeding ERVI, DEM, soil texture, and long-term temperature metrics. This finding aligns with existing evidence that vegetation indices and their principal components are often the most influential variables in SOC prediction for arid and semi-arid ecosystems [62,63]. In addition, vegetation cover reduces wind erosion and surface exposure—effects that are particularly critical in regions like Aksai—thereby preventing carbon loss from topsoil [64,65]. Thus, the high importance of NDVI_PC1 reflects not only statistical associations but also the central role of vegetation–soil carbon feedbacks in dryland carbon cycling.

Beyond vegetation effects, DEM and long-term temperature also exhibited significant but nonlinear influences on SOC, aligning with regional studies showing that mid-elevation and moderate-temperature conditions tend to favor SOC accumulation [66,67]. These patterns are likely mediated by variations in decomposition rates, microclimatic stability, and erosion sensitivity. Regarding soil texture, sand content showed the strongest negative effect, whereas clay, though globally less influential, exerted positive local effects. These results agree with established evidence that sandy soils reduce water and nutrient retention, while clay particles enhance SOC stabilization through adsorption and physical protection [68], reflecting the scale-dependent nature of texture effects.

Overall, this study underscores the central role of NDVI_PC1 in shaping SOC distribution in arid ecosystems and highlights the dominance of vegetation-related variables over other environmental factors. Together with previous findings, our results provide strong evidence that vegetation-focused management strategies—such as grazing control, vegetation restoration, and ecological engineering—are crucial for enhancing SOC storage, mitigating wind erosion, and improving ecosystem resilience in the Aksai region. These insights offer valuable guidance for SOC conservation and sustainable land management in arid and semi-arid environments.

4.2. Uncertainty Analysis of the Current Study

Although this study achieved satisfactory results in predicting and mapping SOC in the Aksai region, several sources of uncertainty remain. First, model performance differs across algorithms: GBT and XGBoost capture spatial heterogeneity more effectively, whereas RF tends to oversmooth predictions, indicating that SOC spatial patterns are sensitive to model choice. Second, SOC sampling points are mainly distributed in accessible areas, resulting in limited representation of remote mountainous zones and certain land-use types. This uneven sampling may introduce higher uncertainty in poorly sampled regions. Third, although GBT performs well overall, the R² values (0.516–0.675) indicate considerable unexplained variance, suggesting that the models do not fully capture the complex environmental controls on SOC. Finally, predictions near the boundaries or in data-sparse areas show more fluctuation, suggesting potential edge effects.

In principle, uncertainties could be quantified through approaches such as bootstrapped prediction intervals, spatial cross-validation, Monte Carlo simulations, or model-ensemble variance [69,70]. However, these techniques were not implemented in the present study for two main reasons. First, the integration of multi-source, multi-temporal variables combined with computationally intensive boosting-based models substantially increased algorithmic complexity, making full probabilistic uncertainty quantification computationally prohibitive at 10 m resolution across the entire study area [71]. Second, while this study employs a comprehensive multivariate framework that integrates diverse environmental covariates, its primary analytical objective is to assess the specific contribution of vegetation phenological features through a comparative multi-model analysis, not to generate probabilistic SOC maps. As such, we focused on model comparison and variable interpretability rather than implementing full quantification of predictive uncertainty.

Future work should include a formal uncertainty analysis by increasing the density and representativeness of SOC samples. Additionally, enhancing the resolution of environmental covariates and applying ensemble or Bayesian methods can help generate spatially explicit uncertainty estimates. These improvements will enhance the robustness of SOC mapping and support more reliable assessments of carbon stocks.

5. Conclusions

This study demonstrates that integrating multi-source environmental variables with ensemble machine learning models can effectively capture the spatial dynamics of SOC in the arid landscapes of the Akesai region. Among the evaluated algorithms, GBT provided the most reliable predictions, indicating that boosting-based approaches are particularly suited to modeling SOC under conditions of strong environmental heterogeneity. The dominant influence of vegetation-related factors—especially phenological and vegetation-type information—highlights the central role of vegetation productivity in governing carbon accumulation in arid ecosystems, where biological inputs are limited and spatially constrained. The observed SOC gradients across mountains, piedmonts, and plains further emphasize the tight coupling between topography, hydrothermal conditions, and vegetation patterns.

Collectively, these findings underscore a broader mechanistic insight: in arid regions, SOC formation is controlled primarily by vegetation–soil feedbacks rather than by edaphic or climatic factors alone. This study utilizes SHAP analysis to develop a clear and data-driven framework that enhances our understanding of how multi-source environmental information influences SOC variability. It enables a transparent evaluation of machine-learning predictions. These results offer actionable scientific evidence to support ecological restoration, carbon-sink enhancement, and regional carbon-stock assessments in arid zones.

Despite these advances, uncertainties remain due to sampling limitations, remote sensing noise, and model-structure variability. Future research should incorporate spatially explicit uncertainty quantification, expand field observations into remote mountain areas, integrate dynamic vegetation indices that capture interannual variability, and couple machine learning approaches with process-based models. Such developments will help establish more accurate, interpretable, and scalable SOC assessments and contribute to the long-term management of carbon resources under accelerating climate change.

Author Contributions

Conceptualization, G.C., X.G., Z.Z. and L.H.; data curation, G.C., X.G. and L.H.; funding acquisition, G.C. and L.H.; investigation, G.C.; methodology, G.C., X.G. and L.H.; Visualization, G.C. and X.G.; writing—original draft preparation, G.C. and X.G.; writing—review and editing, X.G., Z.Z. and L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Comprehensive Research on the Investigation of Natural Resources Carbon Sinks in Typical Areas across China (NO. DD20230111).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. SHAP summary plots for (a) GBT, (b) RF, (c) XGBoost, and (d) LightGBM.

References

Wang, X.; Li, L.; Liu, H.; Song, K.; Wang, L.; Meng, X. Prediction of soil organic matter using VNIR spectral parameters extracted from shape characteristics. Soil Tillage Res. 2022, 216, 105241. [Google Scholar] [CrossRef]
Bradford, M.A.; Wieder, W.R.; Bonan, G.B.; Fierer, N.; Raymond, P.A.; Crowther, T.W. Managing uncertainty in soil carbon feedbacks to climate change. Nat. Clim. Change 2016, 6, 751–758. [Google Scholar] [CrossRef]
Wang, X.; Wang, L.; Li, S.; Wang, Z.; Zheng, M.; Song, K. Remote estimates of soil organic carbon using multi-temporal synthetic images and the probability hybrid model. Geoderma 2022, 425, 116066. [Google Scholar] [CrossRef]
Iucn, D.J.; Gudka, M.; Laban, P.; Metternicht, G.; Alexander, S.; Hannam, I. Land Degradation Neutrality: Implications and Opportunities for Conservation; Technical Brief 2nd Edition; IUCN: Nairobi, Kenya, 2015; 19p. [Google Scholar]
Lamichhane, S.; Kumar, L.; Wilson, B. Digital soil mapping algorithms and covariates for soil organic carbon mapping and their implications: A review. Geoderma 2019, 352, 395–413. [Google Scholar] [CrossRef]
Bruun, T.B.; Elberling, B.; de Neergaard, A.; Magid, J. Organic Carbon Dynamics in Different Soil Types After Conversion of Forest to Agriculture. Land Degrad. Dev. 2013, 26, 272–283. [Google Scholar] [CrossRef]
Saia, S.; Benítez, E.; GarcíA-Garrido, J.M.; Settanni, L.; Amato, G.; Giambalvo, D. The effect of arbuscular mycorrhizal fungi on total plant nitrogen uptake and nitrogen recovery from soil organic material. J. Agric. Sci. 2013, 152, 370–378. [Google Scholar] [CrossRef]
Cui, Z.A.; Zhang, R.; Wang, W.; Peng, Z.; Wu, Y.; Zhao, Z.; Li, M.; Cong, Y.; Zhang, S.; Li, Z.; et al. Unveiling critical drivers of soil salinity prediction accuracy in remote sensing: A global meta-analysis. Plant Soil 2025, 516, 33–65. [Google Scholar] [CrossRef]
Zhang, J.; Ge, X.; Hou, X.; Han, L.; Zhang, Z.; Feng, W.; Zhou, Z.; Luo, X. Strategies for Soil Salinity Mapping Using Remote Sensing and Machine Learning in the Yellow River Delta. Remote Sens. 2025, 17, 2619. [Google Scholar] [CrossRef]
Dai, F.; Zhou, Q.; Lv, Z.; Wang, X.; Liu, G. Spatial prediction of soil organic matter content integrating artificial neural network and ordinary kriging in Tibetan Plateau. Ecol. Indic. 2014, 45, 184–194. [Google Scholar] [CrossRef]
Liu, S.; An, N.; Yang, J.; Dong, S.; Wang, C.; Yin, Y. Prediction of soil organic matter variability associated with different land use types in mountainous landscape in southwestern Yunnan province, China. Catena 2015, 133, 137–144. [Google Scholar] [CrossRef]
Mishra, U.; Drewniak, B.; Jastrow, J.D.; Matamala, R.M.; Vitharana, U.W.A. Spatial representation of organic carbon and active-layer thickness of high latitude soils in CMIP5 earth system models. Geoderma 2017, 300, 55–63. [Google Scholar] [CrossRef]
Zeng, C.; Yang, L.; Zhu, A.X.; Rossiter, D.G.; Liu, J.; Liu, J.; Qin, C.; Wang, D. Mapping soil organic matter concentration at different scales using a mixed geographically weighted regression method. Geoderma 2016, 281, 69–82. [Google Scholar] [CrossRef]
Mirzaee, S.; Ghorbani-Dashtaki, S.; Mohammadi, J.; Asadi, H.; Asadzadeh, F. Spatial variability of soil organic matter using remote sensing data. Catena 2016, 145, 118–127. [Google Scholar] [CrossRef]
Hoffmann, U.; Hoffmann, T.; Jurasinski, G.; Glatzel, S.; Kuhn, N.J. Assessing the spatial variability of soil organic carbon stocks in an alpine setting (Grindelwald, Swiss Alps). Geoderma 2014, 232–234, 270–283. [Google Scholar] [CrossRef]
Piccini, C.; Marchetti, A.; Francaviglia, R. Estimation of soil organic matter by geostatistical methods: Use of auxiliary information in agricultural and environmental assessment. Ecol. Indic. 2014, 36, 301–314. [Google Scholar] [CrossRef]
Ge, X.; Ding, J.; Teng, D.; Wang, J.; Huo, T.; Jin, X.; Wang, J.; He, B.; Han, L. Updated soil salinity with fine spatial resolution and high accuracy: The synergy of Sentinel-2 MSI, environmental covariates and hybrid machine learning approaches. Catena 2022, 212, 106054. [Google Scholar] [CrossRef]
Han, L.; Liu, D.; Cheng, G.; Zhang, G.; Wang, L. Spatial distribution and genesis of salt on the saline playa at Qehan Lake, Inner Mongolia, China. Catena 2019, 177, 22–30. [Google Scholar] [CrossRef]
Han, L.; Ding, J.; Zhang, J.; Chen, P.; Wang, J.; Wang, Y.; Wang, J.; Ge, X.; Zhang, Z. Precipitation events determine the spatiotemporal distribution of playa surface salinity in arid regions: Evidence from satellite data fused via the enhanced spatial and temporal adaptive reflectance fusion model. Catena 2021, 206, 105546. [Google Scholar] [CrossRef]
Jandl, R.; Rodeghiero, M.; Martinez, C.; Cotrufo, M.F.; Bampa, F.; van Wesemael, B.; Harrison, R.B.; Guerrini, I.A.; Richter, D.D., Jr.; Rustad, L.; et al. Current status, uncertainty and future needs in soil organic carbon monitoring. Sci. Total Environ. 2014, 468-469, 376–383. [Google Scholar] [CrossRef]
Viaud, V.; Angers, D.A.; Walter, C. Toward Landscape-Scale Modeling of Soil Organic Matter Dynamics in Agroecosystems. Soil Sci. Soc. Am. J. 2010, 74, 1847–1860. [Google Scholar] [CrossRef]
Aksoy, E.; Yigini, Y.; Montanarella, L. Combining Soil Databases for Topsoil Organic Carbon Mapping in Europe. PLoS ONE 2016, 11, e0152098. [Google Scholar] [CrossRef] [PubMed]
Rial, M.; Martinez Cortizas, A.; Rodriguez-Lado, L. Understanding the spatial distribution of factors controlling topsoil organic carbon content in European soils. Sci. Total Environ. 2017, 609, 1411–1422. [Google Scholar] [CrossRef]
Schillaci, C.; Acutis, M.; Lombardo, L.; Lipani, A.; Fantappie, M.; Marker, M.; Saia, S. Spatio-temporal topsoil organic carbon mapping of a semi-arid Mediterranean region: The role of land use, soil texture, topographic indices and the influence of remote sensing data to modelling. Sci. Total Environ. 2017, 601–602, 821–832. [Google Scholar] [CrossRef] [PubMed]
Yigini, Y.; Panagos, P. Assessment of soil organic carbon stocks under future climate and land cover changes in Europe. Sci. Total Environ. 2016, 557-558, 838–850. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Zhao, Y.; Taylor, J.; Gaulton, R.; Jin, X.; Song, X.; Li, Z.; Meng, Y.; Chen, P.; Feng, H.; et al. Comparison and transferability of thermal, temporal and phenological-based in-season predictions of above-ground biomass in wheat crops from proximal crop reflectance data. Remote Sens. Environ. 2022, 273, 112967. [Google Scholar] [CrossRef]
Li, T.; Cui, L.; Kuhnert, M.; McLaren, T.I.; Pandey, R.; Liu, H.; Wang, W.; Xu, Z.; Xia, A.; Dalal, R.C.; et al. A comprehensive review of soil organic carbon estimates: Integrating remote sensing and machine learning technologies. J. Soils Sediments 2024, 24, 3556–3571. [Google Scholar] [CrossRef]
Rukhovich, D.; Koroleva, P.; Rukhovich, A.; Komissarov, M. A detailed mapping of soil organic matter content in arable land based on the multitemporal soil line coefficients and neural network filtering of big remote sensing data. Geoderma 2024, 447, 116941. [Google Scholar] [CrossRef]
Dong, C.; Meng, X.; Ruan, W.; Cui, J.; Zhang, X.; Liu, H. An innoval hyperspectral prediction model for soil organic matter in croplands of the Northeast China Mollisols Region. Soil Tillage Res. 2025, 253, 106666. [Google Scholar] [CrossRef]
Li, X.; Ding, J.; Liu, J.; Ge, X.; Zhang, J. Digital Mapping of Soil Organic Carbon Using Sentinel Series Data: A Case Study of the Ebinur Lake Watershed in Xinjiang. Remote Sens. 2021, 13, 769. [Google Scholar] [CrossRef]
Wang, X.; Li, S.; Wang, L.; Zheng, M.; Wang, Z.; Song, K. Effects of cropland reclamation on soil organic carbon in China’s black soil region over the past 35 years. Glob. Chang. Biol. 2023, 29, 5460–5477. [Google Scholar] [CrossRef]
Wang, X.; Song, K.; Wang, Z.; Li, S.; Shang, Y.; Liu, G. Effects of land conversion to cropland on soil organic carbon in montane soils of Northeast China from 1985 to 2020. Catena 2024, 235, 107691. [Google Scholar] [CrossRef]
Wang, X.; Song, K.; Wang, Z.; Li, S.; Zheng, M.; Wen, Z.; Liu, G. Are topsoil spectra or soil-environmental factors better indicators for discrimination of soil classes? Catena 2022, 218, 106580. [Google Scholar] [CrossRef]
Wang, Y.; Wang, S.; Adhikari, K.; Wang, Q.; Sui, Y.; Xin, G. Effect of cultivation history on soil organic carbon status of arable land in northeastern China. Geoderma 2019, 342, 55–64. [Google Scholar] [CrossRef]
Zhang, Z.; Ding, J.; Zhu, C.; Wang, J.; Ge, X.; Li, X.; Han, L.; Chen, X.; Wang, J. Historical and future variation of soil organic carbon in China. Geoderma 2023, 436, 116557. [Google Scholar] [CrossRef]
Zhou, Z.; Zhang, Z.; Wang, M.; Wang, K.; Ai, J.; Gill, A.; Temirbayeva, K.; Zhu, C. Impact of future climate warming on soil organic carbon in China based on process-based models. Clim. Smart Agric. 2025, 2, 100086. [Google Scholar] [CrossRef]
Zhang, Z.; Ding, J.; Li, L.; Cao, J.; Wang, K.; Zhu, C.; Ge, X.; Wang, J.; Yang, C.; Li, F.; et al. The impact of extreme climate on soil organic carbon in China. Geogr. Sustain. 2025, 6, 100356. [Google Scholar] [CrossRef]
Zhang, Z.; Ding, J.; Wang, J.; Ge, X. Prediction of soil organic matter in northwestern China using fractional-order derivative spectroscopy and modified normalized difference indices. Catena 2020, 185, 104257. [Google Scholar] [CrossRef]
Zhang, Z.; Ding, J.; Zhu, C.; Wang, J.; Ma, G.; Ge, X.; Li, Z.; Han, L. Strategies for the efficient estimation of soil organic matter in salt-affected soils through Vis-NIR spectroscopy: Optimal band combination algorithm and spectral degradation. Geoderma 2021, 382, 114729. [Google Scholar] [CrossRef]
Guo, B.; Yang, X.; Yang, M.; Sun, D.; Zhu, W.; Zhu, D.; Wang, J. Mapping soil salinity using a combination of vegetation index time series and single-temporal remote sensing images in the Yellow River Delta, China. Catena 2023, 231, 107313. [Google Scholar] [CrossRef]
Liu, H.; Guo, B.; Yang, X.; Zhao, J.; Li, M.; Huo, Y.; Wang, J. High spatiotemporal resolution vegetation index time series can facilitate enhanced remote sensing monitoring of soil salinization. Plant Soil 2024, 510, 305–327. [Google Scholar] [CrossRef]
Xing, H.; Niu, J.; Feng, Y.; Hou, D.; Wang, Y.; Wang, Z. A coastal wetlands mapping approach of Yellow River Delta with a hierarchical classification and optimal feature selection framework. Catena 2023, 223, 106897. [Google Scholar] [CrossRef]
Zhao, J.; Wang, Z.; Zhang, Q.; Niu, Y.; Lu, Z.; Zhao, Z. A novel feature selection criterion for wetland mapping using GF-3 and Sentinel-2 Data. Ecol. Indic. 2025, 171, 113146. [Google Scholar] [CrossRef]
Miao, J.; Niu, L. A Survey on Feature Selection. Procedia Comput. Sci. 2016, 91, 919–926. [Google Scholar] [CrossRef]
Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2020, 54, 1937–1967. [Google Scholar] [CrossRef]
Li, X.; Jia, H.; Wang, L. Remote Sensing Monitoring of Drought in Southwest China Using Random Forest and eXtreme Gradient Boosting Methods. Remote Sens. 2023, 15, 4840. [Google Scholar] [CrossRef]
Cutler, A.; Cutler, D.R.; Stevens, J.R. Random Forests. In Ensemble Machine Learning; Springer: Berlin/Heidelberg, Germany, 2012; pp. 157–175. [Google Scholar]
Belgiu, M.; Drăguţ, L. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Liu, Z.; Jiang, P.; De Bock, K.W.; Wang, J.; Zhang, L.; Niu, X. Extreme gradient boosting trees with efficient Bayesian optimization for profit-driven customer churn prediction. Technol. Forecast. Soc. Change 2024, 198, 122945. [Google Scholar] [CrossRef]
Yan, J.; Xu, Y.; Cheng, Q.; Jiang, S.; Wang, Q.; Xiao, Y.; Ma, C.; Yan, J.; Wang, X. LightGBM: Accelerated genomically designed crop breeding through ensemble learning. Genome Biol. 2021, 22, 271. [Google Scholar] [CrossRef]
Fan, J.; Ma, X.; Wu, L.; Zhang, F.; Yu, X.; Zeng, W. Light Gradient Boosting Machine: An efficient soft computing model for estimating daily reference evapotranspiration with local and external meteorological data. Agric. Water Manag. 2019, 225, 105758. [Google Scholar] [CrossRef]
Li, X.; Wang, H.; Qin, S.; Lin, L.; Wang, X.; Cornelis, W. Evaluating ensemble learning in developing pedotransfer functions to predict soil hydraulic properties. J. Hydrol. 2024, 640, 131658. [Google Scholar] [CrossRef]
Antwarg, L.; Miller, R.M.; Shapira, B.; Rokach, L. Explaining anomalies detected by autoencoders using Shapley Additive Explanations. Expert Syst. Appl. 2021, 186, 115736. [Google Scholar] [CrossRef]
Al-Najjar, H.A.H.; Pradhan, B.; Beydoun, G.; Sarkar, R.; Park, H.-J.; Alamri, A. A novel method using explainable artificial intelligence (XAI)-based Shapley Additive Explanations for spatial landslide prediction using Time-Series SAR dataset. Gondwana Res. 2023, 123, 107–124. [Google Scholar] [CrossRef]
Wang, T.; Kang, F.; Cheng, X.; Han, H.; Bai, Y.; Ma, J. Spatial variability of organic carbon and total nitrogen in the soils of a subalpine forested catchment at Mt. Taiyue, China. Catena 2017, 155, 41–52. [Google Scholar] [CrossRef]
Swetha, R.K.; Dasgupta, S.; Chakraborty, S.; Li, B.; Weindorf, D.C.; Mancini, M.; Silva, S.H.G.; Ribeiro, B.T.; Curi, N.; Ray, D.P. Using Nix color sensor and Munsell soil color variables to classify contrasting soil types and predict soil organic carbon in Eastern India. Comput. Electron. Agric. 2022, 199, 107192. [Google Scholar] [CrossRef]
Yang, R.-M.; Zhang, G.-L.; Liu, F.; Lu, Y.-Y.; Yang, F.; Yang, F.; Yang, M.; Zhao, Y.-G.; Li, D.-C. Comparison of boosted regression tree and random forest models for mapping topsoil organic carbon concentration in an alpine ecosystem. Ecol. Indic. 2016, 60, 870–878. [Google Scholar] [CrossRef]
Wan, Q.; Zhu, G.; Guo, H.; Zhang, Y.; Pan, H.; Yong, L.; Ma, H. Influence of Vegetation Coverage and Climate Environment on Soil Organic Carbon in the Qilian Mountains. Sci. Rep. 2019, 9, 17623. [Google Scholar] [CrossRef]
Hounkpatin, K.O.L.; Stendahl, J.; Lundblad, M.; Karltun, E. Predicting the spatial distribution of soil organic carbon stock in Swedish forests using a group of covariates and site-specific data. Soil 2021, 7, 377–398. [Google Scholar] [CrossRef]
He, X.; Yang, L.; Li, A.; Zhang, L.; Shen, F.; Cai, Y.; Zhou, C. Soil organic carbon prediction using phenological parameters and remote sensing variables generated from Sentinel-2 images. Catena 2021, 205, 105442. [Google Scholar] [CrossRef]
Basile-Doelsch, I.; Balesdent, J.; Pellerin, S. Reviews and syntheses: The mechanisms underlying carbon storage in soil. Biogeosciences 2020, 17, 5223–5242. [Google Scholar] [CrossRef]
Sun, D.; Qiu, X.; Feng, J.; Ru, J.; Song, J.; Wan, S. Forest types control the contribution of litter and roots to labile and persistent soil organic carbon. Biogeochemistry 2024, 167, 1609–1617. [Google Scholar] [CrossRef]
Sastre, B.; Antón-Iruela, O.; Moreno-Delafuente, A.; Navas, M.J.; Marques, M.J.; González-Canales, J.; Martín-Sanz, J.P.; Ramos, R.; García-Díaz, A.; Bienes, R. Groundcovers Improve Soil Properties in Woody Crops Under Semiarid Climate. Agriculture 2024, 14, 2288. [Google Scholar] [CrossRef]
Plaza-Bonilla, D.; Arrúe, J.L.; Cantero-Martínez, C.; Fanlo, R.; Iglesias, A.; Álvaro-Fuentes, J. Carbon management in dryland agricultural systems. A review. Agron. Sustain. Dev. 2015, 35, 1319–1334. [Google Scholar] [CrossRef]
Li, Y.; Zheng, S.; Meng, X.; Wang, L.; Yu, Y.; Zhang, Y.; Zhang, G.; Zhang, S.; Dai, X.; Ruan, W.; et al. Climatic and topographic controls on soil organic carbon distribution across continents. Catena 2025, 260, 109435. [Google Scholar] [CrossRef]
Zhu, G.; Zhou, L.; He, X.; Wei, P.; Lin, D.; Qian, S.; Zhao, L.; Luo, M.; Yin, X.; Zeng, L.; et al. Effects of Elevation Gradient on Soil Carbon and Nitrogen in a Typical Karst Region of Chongqing, Southwest China. J. Geophys. Res. Biogeosci. 2022, 127, e2021JG006742. [Google Scholar] [CrossRef]
Schapel, A.; Marschner, P.; Churchman, J. Clay amount and distribution influence organic carbon content in sand with subsoil clay addition. Soil Tillage Res. 2018, 184, 253–260. [Google Scholar] [CrossRef]
Nguyen-Sy, T. Optimized hybrid XGBoost-CatBoost model for enhanced prediction of concrete strength and reliability analysis using Monte Carlo simulations. Appl. Soft Comput. 2024, 167, 112490. [Google Scholar] [CrossRef]
Roberts, D.R.; Bahn, V.; Ciuti, S.; Boyce, M.S.; Elith, J.; Guillera-Arroita, G.; Hauenstein, S.; Lahoz-Monfort, J.J.; Schröder, B.; Thuiller, W.; et al. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 2017, 40, 913–929. [Google Scholar] [CrossRef]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]

Figure 1. Soil sampling sites and location of the Aksai region: (a) location of Gansu Province in China, (b) location of Aksai in Gansu Province, and (c) soil sampling sites and elevation map of the Aksai region.

Figure 2. The most important features following variable selection technique.

Figure 3. Schematic of the research workflow.

Figure 4. Heatmap and statistical analysis of variable correlations. Significance levels are indicated by asterisks: **, and ***; marked significant correlations at p ≤ 0.01, and p ≤ 0.001 levels, respectively.

Figure 5. Predictive performance of different models on the test set. (a) Gradient Boosting; (b) Random Forest; (c) XGBoost; and (d) LightGBM. Scatter plots show measured versus predicted soil organic carbon (SOC, g kg⁻¹). The dashed black line represents the 1:1 line, and the solid red line indicates the linear fit. Colours indicate the magnitude of predicted SOC values, ranging from low (purple) to high (yellow).

Figure 6. SHAP contribution plot for the GBT model. Colors represent the magnitude of the feature value (yellow, high; purple, low).

Figure 7. Spatially resolved soil organic carbon content retrieved from different models. (a) GBT; (b) LGBM; (c) RF; and (d) XGBoost. All panels show the spatial distribution of topsoil soil organic carbon (SOC, g kg⁻¹) across the study area. Colours indicate SOC concentration from low (light green) to high (dark blue). Model-specific minimum and maximum SOC values are provided in the legends. Scale bars denote distance (km), and north arrows indicate orientation.

Table 1. Datasets used to select the best ones for mapping soil organic carbon.

Category	Predictors	Abbreviation
Vegetation spectral indices	Difference Vegetation Index	SR
	Enhanced Normalized Difference Vegetation Index	ENDVI
	Enhanced Ratio Vegetation Index	ERVI
	Enhanced Vegetation Index	EVI
	Green Difference Vegetation Index	GDVI
	Green Ratio Vegetation Index,	GRVI
	Modified Soil Adjusted Vegetation Index	MSAVI
	Normalized Difference Vegetation Index	NDVI
	Soil Adjusted Vegetation Index	SAVI
	Normalized Difference Water Index	NDWI
	Simple Ratio Index	SR
Salinity indices	Salinity index I–V	SI1, SI2, SI3, SI4, SI5
Soil properties	–	Clay, Sand
Spectral band data	Sentinel-2 bands	B1, B11, B12, B2, B3, B4, B6, B7, B8, B8A, B9
Climate data	Mean precipitation and temperature from 2004 to 2024	Precip_2004_2024, TempC_2004_2024
Topographic data		DEM
Vegetation-type factors	Principal components of NDVI time-series Maximum, minimum, and mean of NDVI time-series	NDVI_PC1, NDVI_PC2, NDVI_max, NDVI_mean, NDVI_min

Table 2. Descriptive statistics for soil organic carbon and covariate datasets (n = 207).

Dataset	Mean	SD	Skewness	Kurtosis	CV	Min	Median	Max
SOC (g/kg)	4.728	2.208	0.784	1.021	0.467	1.000	4.500	14.100
Sand (%)	69.816	7.984	−0.676	0.093	0.114	49.000	71.000	85.000
Clay (%)	7.609	3.068	0.923	0.705	0.403	1.000	7.000	18.00
NDVI_PC1	−0.028	1.325	6.009	46.115	−47.741	−1.977	−0.290	11.518
DEM (m)	3114.908	708.531	−0.207	−1.221	0.227	1647.000	2930.000	4509.000
ERVI	2.553	0.294	−0.879	0.591	0.115	1.594	2.614	3.152
TempC_2004_2024 (°C)	0.035	6.243	−0.038	−1.634	180.175	−8.501	2.782	9.739

Table 3. Performance of the four models on the training and test sets.

Model	Train R²	Train RMSE	Train MAE	Train RPIQ	Test R²	Test RMSE	Test MAE	Test RPIQ
Gradient Boosting	0.790	1.040	0.809	3.558	0.675	1.304	0.975	2.837
Random Forest	0.614	1.410	1.120	2.624	0.516	1.591	1.233	2.326
XGBoost	0.761	1.111	0.847	3.331	0.658	1.338	1.005	2.765
LightGBM	0.684	1.276	0.972	2.900	0.588	1.469	1.081	2.519

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, G.; Ge, X.; Zhang, Z.; Han, L. Detecting Drivers and Predicting Spatial Distribution of Soil Organic Carbon in an Arid Region Using Machine Learning. Remote Sens. 2026, 18, 535. https://doi.org/10.3390/rs18040535

AMA Style

Chen G, Ge X, Zhang Z, Han L. Detecting Drivers and Predicting Spatial Distribution of Soil Organic Carbon in an Arid Region Using Machine Learning. Remote Sensing. 2026; 18(4):535. https://doi.org/10.3390/rs18040535

Chicago/Turabian Style

Chen, Guiren, Xianghe Ge, Zipeng Zhang, and Lijing Han. 2026. "Detecting Drivers and Predicting Spatial Distribution of Soil Organic Carbon in an Arid Region Using Machine Learning" Remote Sensing 18, no. 4: 535. https://doi.org/10.3390/rs18040535

APA Style

Chen, G., Ge, X., Zhang, Z., & Han, L. (2026). Detecting Drivers and Predicting Spatial Distribution of Soil Organic Carbon in an Arid Region Using Machine Learning. Remote Sensing, 18(4), 535. https://doi.org/10.3390/rs18040535

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detecting Drivers and Predicting Spatial Distribution of Soil Organic Carbon in an Arid Region Using Machine Learning

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Soil Sampling and RS Data Acquisition and Preprocessing

2.3. Variable Selection

2.4. Modeling Framework

2.4.1. Gradient Boosting Trees (GBT)

2.4.2. Random Forest (RF)

2.4.3. eXtreme Gradient Boosting (XGBoost)

2.4.4. Light Gradient Boosting Machine (LGBM)

2.5. Model Interpretability Using SHAP Analysis

2.6. Model Evaluation

3. Results

3.1. Descriptive Statistics of Soil Organic Carbon

3.2. Comparison of Model Accuracy

3.3. Feature Importance

3.4. Soil Organic Carbon Mapping

4. Discussion

4.1. Comparing the Effects of Different Characteristics on Soil Organic Carbon Mapping

4.2. Uncertainty Analysis of the Current Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI