Unveiling the “Sparse Carbon Pool”: High-Resolution Mapping and Storage Estimation of Topsoil Organic Carbon in Arid Xinjiang, China

Li, Yunhao; Shi, Mingjie; Wang, Shanshan; Liu, Wenhui; Wang, Pengfei; Wang, Xiangge; Guo, Jia; Wu, Hongqi

doi:10.3390/rs18050728

Open AccessArticle

Unveiling the “Sparse Carbon Pool”: High-Resolution Mapping and Storage Estimation of Topsoil Organic Carbon in Arid Xinjiang, China

by

Yunhao Li

¹

,

Mingjie Shi

¹

,

Shanshan Wang

¹,

Wenhui Liu

²,

Pengfei Wang

¹,

Xiangge Wang

¹,

Jia Guo

³ and

Hongqi Wu

^1,*

¹

Xinjiang Key Laboratory of Soil and Plant Ecological Processes, Xinjiang Agricultural University, Urumqi 830052, China

²

Rural Energy Station of Urumqi County, Urumqi 830000, China

³

College of Ecology and Environment, Xinjiang University, Urumqi 830046, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(5), 728; https://doi.org/10.3390/rs18050728

Submission received: 30 December 2025 / Revised: 19 February 2026 / Accepted: 27 February 2026 / Published: 28 February 2026

(This article belongs to the Special Issue Ecosystem Protection in Arid and Semi-Arid Regions Supported by Multi-Source Remote Sensing Data)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Factors characterizing hydrothermal exchange processes exhibited significant superiority in explaining the spatial variability of SOC, outperforming traditional vegetation indices.
The desert ecosystem was confirmed to be a massive “Sparse Carbon Pool” contributing 44.33% of the total regional carbon storage despite its low carbon density.

What are the implications of the main findings?

Physical parameters reflecting hydrothermal exchange (e.g., ET and VPD) are more effective than traditional vegetation indices in elucidating the mechanisms driving SOC variation in arid regions.
Carbon sink management strategies in arid zones should not solely focus on high-density grasslands but must also account for the cumulative carbon sequestration effects of vast desert ecosystems.

Abstract

High-resolution mapping of soil organic carbon (SOC) in arid regions remains challenging. Using Xinjiang as a case study, this research constructed a prediction framework integrating Boruta feature selection with the Random Forest (RF) algorithm to achieve refined mapping of topsoil SOC. Results indicated that: (1) Among the tested machine learning models, the Boruta–RF framework achieved the highest predictive performance (R² = 0.48, with the lowest RMSE); (2) Evapotranspiration (ET) and Vapor Pressure Deficit (VPD) were dominant drivers, with the stepwise increase in ET and negative inhibition of VPD confirming the decisive role of hydrothermal fluxes in regulating carbon input; (3) The total SOC storage was estimated at approximately 3.20 Pg C. Despite low carbon density, the desert ecosystem contributed 44.33% of the total storage, constituting a massive Sparse Carbon Pool. This study confirms the necessity of incorporating hydrothermal parameters and highlights that neglecting desert ecosystems leads to a significant underestimation of regional carbon storage.

Keywords:

soil organic carbon; machine learning; Boruta; hydrothermal driving mechanism; Sparse Carbon Pool; arid regions

1. Introduction

As key ecosystems covering over 40% of the global land surface, arid and semi-arid regions have recently been confirmed as the dominant force driving the interannual variability of the global land carbon sink [1,2]. Although the soil organic carbon (SOC) content per unit area in these regions is lower than that of tropical rainforests or peatlands, due to their vast spatial extent, even minor perturbations in the arid soil carbon pool are sufficient to exert a significant impact on the global carbon budget [3,4]. As a typical temperate arid and semi-arid region in the hinterland of the Eurasian continent, Xinjiang is characterized by its vast territory and complex geomorphology. Accurately quantifying the spatial distribution patterns and storage of SOC in this region is of great scientific significance for assessing the carbon budget balance of China and even Central Asia [5]. However, due to high surface heterogeneity and extreme climate constraints, high-precision spatial mapping of SOC in arid regions still faces multiple challenges, limiting our understanding of the feedback mechanisms of the terrestrial ecosystem carbon cycle [6].

With the rapid advancement of Earth observation technology and computational science, digital soil mapping (DSM), integrating environmental covariates with machine learning algorithms, has become a core approach for large-scale soil property prediction [7,8]. Although DSM techniques have made significant progress at regional scales, limitations remain regarding the selection of environmental covariates and their mechanistic interpretations within the unique geographical environment of arid regions [9]. Existing studies predominantly rely on vegetation indices derived from optical remote sensing, such as NDVI and EVI, as the primary proxy variables for biological factors [10]. However, in arid desert regions with sparse or even absent vegetation cover, optical signals are susceptible to strong interference from soil background spectra, resulting in severe spectral confusion that hinders the accurate reflection of actual surface biogeochemical processes [9]. In fact, water availability serves as the primary threshold factor limiting ecosystem productivity and soil carbon turnover in arid regions. Physical parameters reflecting surface hydrothermal exchange processes can directly characterize the material cycling and energy balance status within the continuum of soil, vegetation, and atmosphere; however, their potential in SOC spatial mapping has not yet been fully explored [11,12].

Meanwhile, in terms of algorithmic evolution, DSM prediction models have advanced from traditional Multiple Linear Regression (MLR) and Ordinary Kriging (OK) to a new stage dominated by Ensemble Learning and Deep Neural Networks [7]. Compared with parameterized traditional statistical methods, machine learning algorithms represented by Random Forest (RF), Gradient Boosting Decision Tree (GBDT), and Support Vector Machine (SVM), owing to their powerful non-linear mapping capabilities and fault tolerance mechanisms, can more effectively interpret the highly heterogeneous relationships between soil properties and complex environmental covariates [13,14]. Particularly for the non-stationary characteristics of surface processes in arid regions, these data-driven models demonstrate predictive potential and robustness significantly superior to traditional methods. However, the increasing complexity of model architectures is accompanied by high sensitivity to the quality of input variables; simply applying advanced algorithms without considering their compatibility with data characteristics often makes it difficult to obtain the optimal solution [15].

Secondly, the contradiction between the high-dimensional redundancy of multi-source data and model generalization ability urgently needs to be resolved [16]. With the explosion of remote sensing big data, researchers often tend to input massive topographic, climatic, and spectral variables into models; however, this is prone to triggering the “Hughes Phenomenon”, resulting in model overfitting and low computational efficiency [17]. Therefore, eliminating noise from the complex variable pool, selecting a subset of core features with clear physical significance, and identifying the model best suited for the non-linear characteristics of arid regions from numerous machine learning algorithms are key to improving mapping accuracy [18]. However, the synergy between specific feature selection strategies and different machine learning architectures remains poorly understood, especially regarding which combination yields the highest stability and accuracy for arid soil mapping.

However, solely pursuing high predictive accuracy of models often comes at the expense of physical interpretability [19]. Although ensemble learning and deep learning models excel in capturing complex non-linear relationships, their “black-box nature” severely hinders researchers’ understanding of the internal decision-making processes of the models, making it difficult to quantify the specific contribution pathways of environmental factors to SOC variation [7]. To overcome this limitation, Interpretable Machine Learning (IML) techniques have emerged. Among them, the Partial Dependence Plot (PDP) can visualize the marginal effects between specific environmental covariates and the target variable at a large sample scale, intuitively revealing the positive and negative feedback mechanisms and non-linear response patterns of environmental factors to SOC across different threshold intervals [20]. Combining feature selection with PDP analysis can not only verify whether the model results conform to geoscientific knowledge but also uncover unrecognized environmental thresholds for carbon sequestration in arid regions from a data-driven perspective. Nevertheless, identifying the dominant environmental drivers among complex variables and their specific non-linear threshold effects remains a critical gap in understanding the carbon cycle mechanisms in Xinjiang.

Desert ecosystems, functioning as potential Sparse Carbon Pools, are frequently underestimated [21]. Previous studies have predominantly focused on high-productivity regions such as forests and grasslands, whereas deserts and bare lands, which occupy the vast majority of arid regions, are often regarded as “carbon-barren zones” and consequently overlooked or roughly estimated. Although the carbon density per unit area is low, given their immense spatial extent, the cumulative carbon storage of desert ecosystems may play a non-negligible role in the global carbon cycle [6]. Neglecting this Sparse Carbon Pool may lead to a significant underestimation of regional carbon storage. Therefore, a comprehensive quantification of SOC storage that explicitly accounts for the contributions of diverse land use types, particularly these ‘sparse’ pools, is indispensable for a precise regional carbon budget.

Based on the aforementioned background, this study takes Xinjiang, China, as the study area, relies on 2372 field-measured sampling points, and integrates multi-source remote sensing data to construct a high-dimensional system of environmental covariates, aiming to address the following three scientific questions: (1) Which machine learning algorithm, when combined with Boruta feature selection, provides the best predictive performance for SOC mapping among the tested models (RF, GBDT, SVR, and MLP)? (2) How can SOC storage and the relative contributions of different land use types be quantified under a cross-validated modeling framework using established accuracy metrics (R², RMSE, MAE, and RPD)? (3) What are the dominant environmental factors driving SOC spatial differentiation in arid regions, and what are their non-linear response patterns? This study expects to provide a scientific basis for the elucidation of soil carbon cycle mechanisms and refined decision-making for regional carbon sink management in arid regions.

2. Materials and Methods

2.1. Overview of the Study Area

Xinjiang is located in the hinterland of the Eurasian continent (73°40′–96°18′E, 34°25′–49°10′N), with a total area of approximately 1.66 million km², and is a typical temperate continental arid and semi-arid climate zone (Figure 1a) [22]. The geomorphological features within the region are distinct; the Altai Mountains, Tianshan Mountains, and Kunlun Mountains are distributed alternately with the Junggar Basin and Tarim Basin, forming a unique pattern of alternating mountains and basins. Affected by the obstruction of high mountain ranges and the inland location, regional precipitation is scarce and its spatiotemporal distribution is extremely uneven. The mean annual precipitation is generally less than 150 mm, but can reach over 800 mm in high-altitude mountain areas [23]. The land cover of the study area is dominated by vast deserts, accounting for about 60% of the total area, followed by grasslands mainly distributed in the mountains and on the edges of basins (accounting for about 30%), while cropland, forestland, and water bodies account for only a small proportion (about 10%) (Figure 1c) [24]. As shown in Figure 1d, the sampling points of this survey widely covered the latitude range of 36–48°N and the elevation gradient of 0–4000 m, intuitively displaying the complex spatial pattern of SOC content varying with geographic location and topographic uplift. This drastic topographic relief and the spatial redistribution of hydrothermal gradients endow the SOC in the study area with strong spatial heterogeneity characteristics.

2.2. SOC Sampling and Laboratory Analysis

Field soil sampling for this study was conducted from July to October 2023. Following the principle of combining spatial representativeness with randomness, a total of 2372 topsoil (0–30 cm) samples were collected within the study area using both typical and random sampling methods (Figure 1b). The 0–30 cm soil layer was selected in accordance with IPCC guidelines for greenhouse gas inventory reporting and because it generally corresponds to the cultivated (plow) layer in agricultural areas of Xinjiang, where anthropogenic disturbance and carbon inputs are most active. Moreover, in arid ecosystems, SOC is predominantly concentrated in the surface soil layer, which is more responsive to vegetation and climatic drivers. The deployment of sampling points comprehensively considered the integrity of environmental gradients, spatially covering the major soil types, geomorphic units, and land use types within the study area, thereby ensuring sufficient representativeness of the samples under complex geographical environmental backgrounds [25]. During the sampling process, the longitude, latitude, and elevation of each sampling point were recorded. The collected soil samples were transported back to the laboratory and air-dried naturally after the removal of visible plant roots and debris. The air-dried samples were first ground and passed through a 2 mm standard sieve to remove gravel; subsequently, a portion of the sieved soil was obtained using the quartering method. Finally, the SOC content was determined using the potassium dichromate oxidation method with external heating [26].

2.3. Acquisition and Preprocessing of Environmental Variables

To comprehensively capture the complex biogeochemical processes driving the spatial variation in SOC in arid regions, this study constructed a comprehensive indicator system comprising 37 variables, covering six major categories including remote sensing spectral characteristics, climatic aridity status, topography and geomorphology, and soil physical properties (Table 1).

2.3.1. Remote Sensing Spectral Indices

Based on Landsat–8 OLI images (30 m) from the 2023 growing season (May–October), after radiometric calibration and atmospheric correction, two categories of key indices were extracted to characterize surface biological and physical states. This multi-temporal composite (May–October) was employed to ensure alignment with the field sampling period while effectively addressing potential data gaps or missing pixels in individual satellite scenes, thereby providing a robust environmental signal that captures the integrative biological influence on SOC accumulation rather than transient daily fluctuations (calculation formulas are detailed in Table 2).

Vegetation Indices: Six of the most representative and commonly used vegetation indices were selected, including the Normalized Difference Vegetation Index (NDVI), Kernel Normalized Difference Vegetation Index (kNDVI), Enhanced Vegetation Index (EVI), Generalized Difference Vegetation Index (GDVI), Optimized Soil Adjusted Vegetation Index (OSAVI), and Soil Adjusted Vegetation Index (SAVI). Although these indices are standard benchmarks in digital soil mapping, their effectiveness in the arid Xinjiang hinterland is often compromised by sparse vegetation cover and strong interference from soil background spectra. By including these diverse indices, we provided a comprehensive candidate pool for the Boruta algorithm to objectively evaluate their suitability for this specific environment.

Environmental Indices: Considering the strong environmental heterogeneity in arid regions, this study introduced six indices to quantify salinization, moisture, and bare soil characteristics, including: Salinity Index (SI), Kernel Normalized Difference Moisture Index (kNDMI), Bare Soil Index (BSI), Canopy Response Salinity Index (CRSI), Normalized Difference Water Index (NDWI), and Normalized Difference Salinity Index (NDSI). These variables are used to characterize the limiting effects of salinization stress, wind erosion risk, and surface moisture conditions on SOC sequestration and mineralization.

2.3.2. Climate and Aridity Characteristics

Given that water is the primary limiting factor for ecosystems in arid regions, this study integrated multi-source high-resolution meteorological data, focusing on constructing a set of climate driving factors from two dimensions: basic hydrothermal conditions and long-term drought stress.

Meteorological Factors: Three basic meteorological variables were integrated, namely Temperature (TEM) from ECMWF ERA5–Land, Precipitation (PRE) from UCSB–CHIRPS, and Evapotranspiration (ET) from MODIS products. These variables constitute the foundational framework of the regional climatic background, among which ET directly characterizes the intensity of hydrothermal exchange between the land surface and the atmosphere.

Drought Indices: To capture the impacts of extreme climate on soil processes, six drought indices from the CHM_Drought dataset were specifically introduced, including: Evaporative Demand Drought Index (EDDI), Self-calibrating Palmer Drought Severity Index (SC_PDSI), Palmer Drought Severity Index (PDSI), Standardized Precipitation Evapotranspiration Index (SPEI), Standardized Precipitation Index (SPI), and Vapor Pressure Deficit (VPD). These indices can reflect, at multiple scales, the fine control mechanisms of atmospheric moisture deficit on vegetation stomatal conductance and soil respiration.

2.3.3. Topography and Soil Properties

Topography determines the local redistribution of hydrothermal resources, while soil physicochemical properties directly affect the physical protection mechanisms of organic carbon. This study extracted key variables from two dimensions: topography/geomorphology and soil matrix.

Topographic Factors: Based on the 30 m resolution Shuttle Radar Topography Mission Digital Elevation Model (SRTM DEM), four variables were extracted, including: Elevation (DEM), Slope, Aspect, and Topographic Relief (TR). These factors characterize geomorphic features at the micro-topographic scale, directly affecting precipitation runoff pathways and differences in solar radiation reception.

Soil Properties: To provide key parent material and hydrological background information, this study integrated 12 soil-related variables. Among them, 10 basic physicochemical properties were obtained from the National Tibetan Plateau Data Center (TPDC), including: Total Nitrogen (TN), Total Phosphorus (TP), Total Potassium (TK), Bulk Density (BD), pH, Cation Exchange Capacity (CEC), Clay content (Clay), Silt content (Silt), Sand content (Sand), and Porosity (Por). Furthermore, two dynamic variables from NASA GLDAS were introduced: Soil Temperature (ST) and Soil Moisture (SM), to explain the potential control mechanisms of soil texture and physical structure on the stability of the SOC pool.

To ensure spatial consistency for pixel-by-pixel modeling, all 37 environmental covariates were first reprojected into a custom Lambert Conformal Conic (LCC) projection system based on the WGS 1984 geographic coordinate system. This step resolved the unit mismatch between angular (degrees) and linear (meters) scales, providing a consistent metric-based grid for the study area. Subsequently, the datasets were geometrically harmonized to a 30 m spatial resolution. Bilinear Interpolation was employed as a disaggregation technique for continuous variables to match the high-resolution Landsat and DEM grids, while Nearest Neighbor resampling was used for categorical data (e.g., LULC types) to maintain class integrity. While this process aligns grid geometry, we recognize that the inherent information content remains constrained by the native resolution of each dataset (Table 1), a limitation further addressed in Section 4.4 [27].

Table 2. Sources and formulas of vegetation indices and environmental indices.

Factor Types	Variable Name	Formulas	References
Vegetation index	NDVI	$NDVI = \frac{NIR - Red}{NIR + Red}$	[28]
	kNDVI	$kNDVI = \tanh ({(\frac{NIR - Red}{NIR + Red})}^{2})$	[29]
	EVI	$EVI = 2.5 \times \frac{NIR - Red}{NIR + 6 \times Red - 7.5 \times Blue + 1}$	[30]
	GDVI	$GDVI = \frac{{NIR}^{2} - {Red}^{2}}{{NIR}^{2} + {Red}^{2}}$	[31]
	OSAVI	$OSAVI = \frac{NIR - Red}{NIR + Red + 0.16}$	[32]
	SAVI	$SAVI = \frac{NIR - Red}{NIR + Red + 0.5} \times 1.5$	[33]
Environmental Indices	SI	$SI = \frac{Blue}{Red}$	[34]
	kNDMI	$kNDMI = \exp (\frac{NIR - SWIR 1}{NIR + SWIR 1}) - 1$	[35]
	BSI	$BSI = \frac{Red + SWIR 1 - (NIR + Blue)}{Red + SWIR 1 + NIR + Blue}$	[36]
	CRSI	$CRSI = \sqrt{\frac{NIR}{SWIR 1 + Red}}$	[37]
	NDWI	$NDWI = \frac{Green - NIR}{Green + NIR}$	[38]
	NDSI	$NDSI = \frac{Green - SWIR 1}{Green + SWIR 1}$	[39]

2.4. Feature Selection Strategy

Given the potential multicollinearity and data redundancy among multi-source environmental covariates, direct modeling may lead to model overfitting and low computational efficiency. This study adopts the Boruta algorithm for feature selection [40]. Boruta is a Random-Forest-based wrapper all-relevant feature selection algorithm, whose core mechanism lies in introducing randomly shuffled copies of original features to construct a randomness reference benchmark. By calculating feature importance scores and iteratively testing whether the scores of original features are significantly better than the maximum Z-score of the shadow copies, the algorithm screens out confirmed variables with statistical significance. Unlike traditional methods aimed at finding the minimal optimal feature subset, Boruta focuses on identifying all features that have a significant dependency relationship with the target variable, thereby maximally preserving environmental explanatory information [41]. This study implemented this process based on the BorutaPy library in the Python (3.1.2) environment, selecting the Random Forest Regressor from scikit–learn as the base estimator, with the maximum number of iterations set to 500 and the significance level set to 0.01.

2.5. Machine Learning Modeling

To systematically evaluate the applicability and stability of algorithms with different mechanisms in the prediction of soil organic carbon (SOC) in arid regions, based on the selected feature subset, this study constructed four representative machine learning models using the scikit–learn library in the Python environment: Random Forest (RF), Gradient Boosting Decision Tree (GBDT), Support Vector Regression (SVR), and Multi-layer Perceptron (MLP). This multi-model comparison strategy has been proven effective in identifying the optimal predictor under different data characteristics [13]. As an ensemble algorithm based on the Bagging strategy, RF reduces model variance by aggregating prediction results and possesses excellent robustness against high-dimensional noise [7]; in contrast, GBDT adopts a Boosting strategy to iteratively minimize the loss function and can usually achieve superior bias correction capability [42]. SVR utilizes the Kernel Trick to map non-linear relationships into a high-dimensional space [43], while MLP, as a typical feedforward neural network, possesses adaptive learning capabilities through the error backpropagation algorithm [44]. To ensure optimal model performance, Grid Search (GS) combined with a 10-fold cross-validation (CV) strategy was adopted to optimize hyperparameters [44]. Furthermore, by calculating Learning Curves (LC) and Feature Inclusion Curves (FIC) to analyze the trend of Mean Squared Error (MSE) varying with training sample size and input feature dimension, the model convergence state was identified, and the most cost-effective feature subset size was determined.

2.6. Model Evaluation and Trend Analysis

This study adopted the 10-fold cross-validation strategy to test the basic predictive accuracy of the models [7]. To quantify the consistency between observed and predicted values, the Coefficient of Determination (R²), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Ratio of Performance to Deviation (RPD) were selected as core evaluation metrics. The calculation formulas for each metric are as follows:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(1)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}

(2)

MAE = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - \hat{y_{i}}|

(3)

R P D = \frac{S D}{R M S E}

(4)

The RPD is used to evaluate prediction accuracy, interpreted as: excellent (RPD > 2.5), very good (2.0 < RPD < 2.5), good (1.8 < RPD < 2.0), fair (1.4 < RPD < 1.8), and poor (RPD < 1.4).

To overcome the “black-box” limitations of machine learning algorithms and elucidate the environmental driving mechanisms of SOC spatial heterogeneity, this study introduced Partial Dependence Plots (PDP) [7]. PDPs were generated based on the Python environment; by calculating the marginal effects of specific variables in model predictions, they intuitively displayed the non-linear response trends of SOC to key environmental factors, thereby verifying whether the patterns captured by the model conform to regional geographic differentiation characteristics.

Furthermore, considering the spatial instability of prediction results, this study utilized the Google Earth Engine (GEE) platform for spatial mapping and uncertainty analysis. A Bootstrap resampling procedure was constructed within GEE to conduct 50 independent simulations; in each simulation, 80% of samples were randomly drawn from the training set to retrain the Random Forest classifier (ee.Classifier.smileRandomForest) and generate a prediction layer. By calculating the pixel-by-pixel statistical characteristics of the multiple simulation results, the standard deviation (SD) and coefficient of variation (CV) of the SOC predictions were obtained respectively. Among them, SD characterizes the absolute uncertainty of the prediction results, while CV eliminates the influence of dimension, providing a basis for identifying potential risk areas during the spatial extrapolation process [45].

2.7. Carbon Storage Estimation

To evaluate the ecosystem service function of the soil carbon pool in the study area, based on the SOC content prediction layer generated by the optimal model and combined with soil bulk density and gravel content data, this study estimated the organic carbon density (SOCD, kg m⁻²) and total stock (SOC Stock, Tg) of the 0–30 cm soil layer [46]. The calculation formulas are as follows:

\begin{matrix} S O C D = S O C \times B D \times D \times (1 - S C) \times 0.01 \end{matrix}

(5)

S O C S = \frac{S O C D \times A}{1000}

(6)

where SOCD is the soil organic carbon density (kg m⁻²), SOCS is the total organic carbon stock of the study area (Tg), SOC is the soil organic carbon content (g kg⁻¹), BD is the soil bulk density (g cm⁻³), D is the soil layer thickness (cm), SC is the gravel content (%), and A is the raster area (km²). The constants 0.01 and 1000 are conversion factors.

3. Results

3.1. Descriptive Statistics of SOC

The SOC content of 2372 surface (0–30 cm) soil sampling points across Xinjiang exhibits significant spatial heterogeneity. Overall, the mean SOC in the study area is 13.10 g kg⁻¹, with a wide range (0.65–94.50 g kg⁻¹), and the overall coefficient of variation is 62.01%, indicating that the soil carbon pool in the study area is strongly influenced by environmental heterogeneity. The general descriptive statistical results and the distribution differences among different land use types are shown in Figure 2.

From the perspective of different land use types (Table 3), Cropland has the highest SOC content, with a mean of 14.96 g kg⁻¹ and a range of 0.65 to 94.50 g kg⁻¹; followed by Forestland, with a mean of 12.20 g kg⁻¹; Grassland has a mean of 8.91 g kg⁻¹; while the widely distributed Bare Land has the lowest SOC content, at only 5.07 g kg⁻¹. It is worth noting that cropland samples show the highest kurtosis (Kurtosis = 12.80) and skewness (Skewness = 2.23), indicating that although the mean value of this soil type is high, the data distribution has a significant long-tail effect, and some high-value points (maximum reaching 94.50 g kg⁻¹) may be related to long-term fertilization and organic matter input in oasis agricultural areas.

In terms of the degree of variation, the coefficients of variation for SOC of the four land use types are all between 40% and 80%, belonging to moderate variation. Among them, grassland has the highest coefficient of variation (CV = 75.18%), reflecting the huge spatial differences in carbon sequestration capacity of grassland ecosystems under the dual influence of natural precipitation and topographic gradients. Although bare land has the lowest average content, its CV value (60.86%) remains high, indicating that even in desert areas, local micro-landforms or the “fertile island effect” under shrubs can lead to significant fluctuations in SOC. In contrast, forestland has the lowest coefficient of variation (CV = 44.22%), indicating that its internal habitat is relatively homogeneous.

In addition, SOC data for all LULC types show a positive skew distribution (Skewness > 1), which is consistent with the probability density curves shown in the raincloud plot in Figure 2b, that is, the SOC content of most sampling points is concentrated in the lower value range, presenting typical soil carbon distribution characteristics in arid regions.

3.2. SOC Spatial Pattern Simulation and Storage Assessment

3.2.1. Variable Importance Selection

To identify the key drivers governing the spatial variability of SOC from high-dimensional multi-source data, this study employed the Boruta algorithm to conduct an all-relevant feature selection on 37 environmental covariates. The results indicated (Figure 3) that a total of six variables were identified as “Confirmed” features with significant importance, while the remaining variables were eliminated (“Rejected”) due to their importance scores being lower than that of random noise.

Among the selected variables, ET exhibited the highest explanatory power, with an importance score significantly surpassing that of other factors. This was followed in descending order by VPD, SPI, DEM, PRE, and Clay. In terms of variable categories, climatic and drought characteristics occupied a dominant position. These results indicate that in arid regions, the spatial distribution of SOC is strongly correlated with hydrothermal conditions, elevation gradients, and soil physical texture, whereas the independent explanatory capacity of singular vegetation greenness or micro-topographic features for SOC variation remains relatively weak.

3.2.2. Model Parameter Optimization and Feature Dimensionality Determination

In preliminary model screening experiments, the RF model exhibited superior potential and robustness compared to other algorithms. Consequently, to further exploit the predictive performance of the RF model and identify the optimal configuration for spatial mapping, this study conducted an in-depth optimization analysis on its key parameters and feature dimensionality.

To balance prediction accuracy with computational efficiency, this study quantitatively evaluated and optimized the key hyperparameter (number of decision trees) and input feature dimensionality of the RF model (Figure 4). Figure 4a illustrates the Learning Curve of the RF model, reflecting the variation trend of MSE with an increasing number of decision trees (ntree). Results indicated that in the initial stage (ntree < 50), the MSE of both the training and testing sets exhibited a sharp decline as the number of trees increased, suggesting that the model was rapidly capturing underlying patterns in the data. When the number of decision trees exceeded 200, the test set error curve gradually stabilized and reached a state of convergence. Furthermore, the gap between training and testing errors remained stable without a distinct divergence trend, indicating that the model possessed good generalization ability at ntree = 200. At this point, further increasing the number of trees made a marginal contribution to accuracy improvement while increasing the computational burden; therefore, this study set the ntree of the RF model to 200.

The Feature Inclusion Curve further revealed the response patterns of model accuracy to the increase in input feature quantity (Figure 4b). This process incorporated features sequentially into the model for training based on the variable importance ranking identified by the Boruta algorithm. It was observed that with the introduction of the top-ranking high-importance features, the cross-validation MSE decreased significantly, indicating that these core factors contained the primary explanatory information. When the number of input features reached six, the MSE reached a distinct inflection point (local minimum); thereafter, continuing to add features did not result in a significant decrease in MSE and even caused oscillations. This suggests that the subsequently introduced variables contained information redundancy or noise, which not only failed to improve accuracy but potentially increased model uncertainty. In summary, this study ultimately established a streamlined and efficient RF model comprising 200 decision trees and six core features, serving as the benchmark for subsequent model comparison and spatial mapping.

3.2.3. Model Simulation and Accuracy Validation Based on Selected Features

Based on the top six core variables identified by the Boruta algorithm, this study constructed four parameter-optimized machine learning models: RF, GBDT, SVR, and MLP. A 10-fold cross-validation was employed to systematically evaluate their predictive performance. This optimal feature subset encompassed key hydrothermal, topographic, and soil properties, effectively reducing data redundancy while preserving environmental explanatory power.

The validation results (Figure 5) indicated that models with different mechanisms exhibited significant disparities in accuracy under identical feature inputs. Overall, the predictive capabilities of ensemble learning models (RF and GBDT) outperformed those of single-structure models (SVR and MLP). Specifically, the RF model achieved the best performance, with a coefficient of determination (R²) of 0.48, a Root Mean Square Error (RMSE) of 2.63 g kg⁻¹, a Mean Absolute Error (MAE) of 1.96 g kg⁻¹, and an RPD value of 1.39. This suggests that the RF model, through its internal random feature subspace and ensemble voting mechanisms, could more effectively mine the non-linear mapping relationships between these six features and SOC. In contrast, the GBDT model followed closely (R² = 0.45, RMSE = 2.69 g kg⁻¹), whereas SVR and MLP exhibited relatively weaker fitting effects (R² of only 0.42 and 0.34, respectively). This indicates that the ability of these two models to capture complex spatial variations was limited under restricted feature dimensionality. Considering the tested algorithms and the defined feature subset, the RF model demonstrated the highest predictive performance and stability. Therefore, within the scope of this comparative framework, RF was selected as the most suitable model for subsequent spatial mapping.

3.2.4. SOC Spatial Distribution Patterns and Uncertainty

Based on the optimized RF model, the spatial distribution map of topsoil (0–30 cm) SOC in Xinjiang for 2023 was constructed on the GEE platform (Figure 6). Overall, SOC in the study area exhibited strong spatial heterogeneity, characterized by a distinct geographical differentiation pattern of high in the north and low in the south, as well as high in mountains and low in basins, which corresponds highly with the geomorphological structure of Xinjiang. The vast majority of pixels were concentrated in the low-value range (<10 g kg⁻¹), indicating that soils with low carbon content constitute the extensive spatial matrix of the region, whereas high-value pixels were distributed only in specific areas in a sporadic or strip-like manner.

High-value SOC zones (>20 g kg⁻¹) were primarily aggregated on the southern slopes of the Altai Mountains, the Tianshan Mountains (particularly the Ili River Valley), and the high-altitude zones of the northern slopes of the Kunlun Mountains. Geographically, these regions mainly correspond to high-altitude humid and semi-humid zones with relatively superior hydrothermal conditions. In contrast, low-value zones (<5 g kg⁻¹) were widely distributed across the Tarim Basin (especially the Taklamakan Desert hinterland) and the center of the Junggar Basin, corresponding primarily to the hyper-arid desert hinterlands. Notably, a significant carbon concentration gradient exists between the mountains and the basins. As elevation increases, SOC content exhibits a gradual increasing trend. Furthermore, on the northern slopes of the Tianshan Mountains and along the Tarim River, influenced by oasis agriculture and artificial irrigation, SOC presented distinct discontinuous medium to high value patches, breaking the homogeneous background of low desert values.

The uncertainty analysis based on Bootstrap revealed that the reliability of SOC spatial mapping exhibits a significant non-uniform distribution. Absolute uncertainty (SD) was primarily controlled by SOC background values, characterized by heteroscedasticity that highly overlapped with high-value zones such as the Tianshan and Altai Mountains (Figure 7a). Conversely, the high-value centers of relative uncertainty (CV) did not merely cover the desert hinterlands but were concentrated in the oasis–desert ecotones and fragmented mountainous terrains where environmental gradients change drastically, which is mainly attributed to the mixed-pixel effect and the complexity of local microclimates (Figure 7b). In comparison, both sampling-dense regions like the Ili River Valley and landscape-homogeneous desert hinterlands exhibited lower coefficients of variation, confirming the robustness and applicability boundaries of the model across different habitats.

3.2.5. Estimation of SOC Storage

Following the validation of the RF model under a 10-fold cross-validation framework (R² = 0.48, RMSE = 2.63 g kg⁻¹), the optimized model was applied to generate the spatial distribution map of SOC across Xinjiang. This prediction layer was subsequently combined with soil bulk density and gravel content data to estimate the topsoil (0–30 cm) SOC storage (Table 4). Results indicated that the total topsoil SOC storage in the study area was approximately 3.20 Pg. In terms of spatial distribution, the pattern of carbon storage was highly consistent with that of SOC content; high-storage zones were primarily located in vegetation-dense ecological areas such as the Altai Mountains, Tianshan Mountains, and Ili River Valley, whereas the hinterlands of the Tarim and Junggar Basins constituted regional low-storage centers.

Regarding the contributions of different land use types, Bare land and Grassland constituted the two core carbon pools of the region. Of particular note is Bare land; despite its low carbon content per unit of soil, its cumulative carbon storage ranked first, accounting for approximately 44.33% (about 1.42 Pg) of the regional total, owing to its vast spatial extent (extensive deserts and Gobi). This finding suggests that in arid and semi-arid regions, although desert ecosystems possess low biological productivity, their cumulative effect as massive carbon pools in the global carbon cycle should not be overlooked. Grassland served as the second largest carbon pool, contributing 35.22% (about 1.13 Pg) of the total storage by virtue of its relatively extensive distribution area and higher organic matter accumulation levels. In contrast, although Cropland and Forestland were mainly distributed in high-productivity regions, their contributions to the total regional carbon storage were relatively limited, at 14.19% and 6.26%, respectively, constrained by their smaller total areas.

3.3. Non-Linear Response of SOC to Environmental Covariates

To elucidate the intrinsic driving mechanisms underlying SOC spatial variability, this study employed Partial Dependence Plots (PDP) to analyze the marginal effects of the top six core variables within the RF model on SOC (Figure 8). The results indicated that the environmental factors exhibited strong non-linear relationships with SOC, characterized by distinct response patterns.

First, hydrothermal conditions exhibited the strongest controlling influence, with the response curves of ET and VPD displaying the most dramatic variations. ET exerted a typical and steep increasing trend on SOC (Figure 8a): within the low ET range, SOC remained at a relatively low level; however, once a specific environmental threshold was crossed, SOC content exhibited a rapid increase, eventually approaching saturation in the high-value range. Conversely, VPD displayed a strong negative inhibitory effect (Figure 8b): SOC peaked in relatively humid environments (low VPD), but as atmospheric aridity increased, SOC content experienced a precipitous decline, subsequently stabilizing at a lower level. Complementarily, PRE exhibited a significant “S-shaped” positive promotional effect (Figure 8e): before precipitation reached intermediate levels, its enhancement of SOC was limited; however, with the continuous increase in water input, SOC exhibited significant accumulation, gradually leveling off in the humid zone.

Secondly, the impacts of topography and soil properties on SOC were relatively moderate and complex. Unlike the sharp variations observed in hydrothermal factors, DEM and SPI did not exhibit a single linear trend but instead presented characteristics of non-linear fluctuation (Figure 8c,d). The PDP curve for DEM was relatively flat, indicating that after accounting for the influence of other variables, the direct marginal effect of the elevation gradient alone was weak. Clay, however, displayed a gradual linear increasing trend (Figure 8f), implying that as soil texture transitioned from coarse to fine (with a relative increase in Clay content), SOC content demonstrated a steady upward trajectory.

4. Discussion

4.1. Factors Influencing Model Prediction Accuracy

This study demonstrated that, within the tested modeling framework and selected environmental covariates, the integration of Boruta feature selection with the Random Forest (RF) algorithm provided improved predictive performance for SOC mapping in arid regions. Under identical training and validation conditions, the RF model showed higher R² and lower RMSE values compared with Gradient Boosting Decision Tree (GBDT), Support Vector Regression (SVR), and Multi-layer Perceptron (MLP), indicating stronger predictive stability. The predictive accuracy achieved by our RF model (R² = 0.48) is consistent with findings from other regional DSM studies in arid environments, where R² values typically fluctuate between 0.30 and 0.55 due to extreme spatial heterogeneity and sparse vegetation cover [47,48]. This performance underscores the relative superiority of ensemble learning algorithms (RF and GBDT) over single-structure models like SVR and MLP in capturing complex, non-linear relationships in noisy remote sensing datasets. This is primarily attributed to its Bagging-based ensemble learning mechanism. Given the complexity of the surface environment in arid regions, highly non-linear and non-stationary mapping relationships often exist between soil properties and environmental covariates. Single-structure weak classifiers or models that are highly sensitive to parameters are easily susceptible to local noise interference or prone to overfitting [13]. In contrast, RF effectively reduced prediction variance by constructing a large number of decorrelated decision trees and introducing random feature subspaces to utilize the model averaging effect. This mechanism endowed the model with excellent anti-noise capability and robustness, demonstrating superior applicability when processing high-dimensional, high-noise, and uncertain multi-source remote sensing data [49]. The finding that the RF algorithm often exhibits superior robustness compared to single models when dealing with high-dimensional environmental noise has also been validated in other studies targeting complex environments [7,13].

Secondly, the all-relevant variable selection strategy based on Boruta played a key role in improving model generalization performance. Multi-source remote sensing datasets usually contain a large number of collinear or irrelevant variables. Directly inputting the full variable set not only increases computational load but is also highly susceptible to triggering the “Hughes phenomenon,” where model performance decreases as feature dimensionality increases [50]. This study utilized the Boruta algorithm to successfully reduce the 37 initial variables to 6 core features, effectively removing redundant noise while maximally retaining key environmental explanatory information. This process greatly simplified the model structure and ensured that the model’s decision rules were built upon dominant factors with clear physical meaning, thereby enhancing the interpretability and ecological significance of the mapping results. Removing redundant features is a necessary step to improve the generalization ability of machine learning algorithms in digital soil mapping, a conclusion that has also been verified in studies targeting different regions [41].

The fusion of multi-source environmental data, particularly the introduction of hydrothermal physical parameters, was another core factor in breaking through mapping bottlenecks in arid regions and achieving high-resolution prediction. Traditional digital soil mapping mostly relies on spectral reflectance or vegetation indices; however, in arid desert areas with sparse vegetation cover, optical remote sensing signals are prone to strong interference from high-brightness soil backgrounds, leading to severe spectral confusion [51]. This study innovatively introduced physical parameters reflecting surface hydrothermal exchange processes (such as ET and VPD) as well as precipitation characteristics. These variables characterized soil moisture dynamics and surface energy balance states from a physical mechanism perspective, effectively supplementing the information deficit of single vegetation indices in bare soil areas. In particular, the leading position of ET and VPD in feature importance ranking confirmed that in water-limited arid ecosystems, quantifying the moisture flux at the atmosphere–soil interface is crucial for the accurate prediction of SOC. Water availability is the primary environmental driving factor limiting biogeochemical processes and carbon sequestration potential in arid regions, a conclusion consistent with research results targeting arid ecosystems [52].

It should be noted that although other advanced algorithms such as XGBoost or deep ensemble frameworks may potentially improve predictive performance, this study focuses on evaluating representative algorithms with distinct learning mechanisms to ensure interpretability and methodological comparability within the scope of arid-region DSM.

4.2. Spatial Differentiation of SOC Storage and Carbon Pool Composition

The estimation results indicate that topsoil SOC storage in Xinjiang exhibits a characteristic coexistence of localized high-value enrichment and widespread low-value accumulation. First, the grassland ecosystem was confirmed as the most important Active Carbon Pool in the region, accounting for 35.22% (approximately 1.13 Pg C) of the total storage. This estimate is consistent in magnitude with related studies on grassland carbon pools in the arid regions of Northwest China [53], primarily attributed to the combination of high carbon density per unit area and extensive spatial distribution. On one hand, mountain and alpine grasslands in Xinjiang are limited by low temperatures and precipitation, resulting in lower soil respiration rates which favor the long-term accumulation of root biomass; this aligns with the mechanism of low-temperature limited decomposition [54]. On the other hand, the vast distribution area of grasslands means their cumulative effect far exceeds that of the spatially limited forestlands. This suggests that protecting grassland ecosystems on the northern slopes of the Tianshan, Altai, and Kunlun Mountains is decisive for maintaining carbon sequestration services. Any grassland degradation caused by overgrazing or climate warming could trigger significant risks of soil carbon loss [55].

Secondly, an often overlooked but crucial finding is the absolute dominant contribution of desert ecosystems to the regional total carbon storage. Despite the extremely low carbon density per unit area, the desert actually constitutes a massive Sparse Carbon Pool by virtue of occupying the vast majority of the study area, with storage reaching 1.42 Pg C, or approximately 44.33% of the total regional storage. This finding corrects the underestimation of desert carbon storage in previous low-resolution mapping studies [56], which often ignored this contribution due to an inability to capture micro-geomorphology. The high-resolution results of this study suggest that in arid region carbon budget accounting, the role of deserts must not be ignored due to their “barrenness”. However, this massive “Sparse Carbon Pool” may exhibit significant vulnerability under future climate change; for instance, the projected “warming and wetting” trends in arid regions could potentially accelerate soil organic matter mineralization or trigger shrub encroachment, leading to substantial shifts in the regional carbon balance [52,57]. Although their carbon sequestration rate is slow, given their massive baseline, minor perturbations in desert ecosystems (such as changes in precipitation patterns or shrub encroachment) could generate observable carbon flux changes on a global scale.

Finally, compared to previous studies based on traditional interpolation methods such as Kriging, the high-resolution mapping based on machine learning in this study demonstrated significant advantages in capturing spatial details of carbon storage. Traditional geostatistical methods often suffer from the Smoothing Effect, leading to underestimation of high values and overestimation of low values, making it difficult to characterize spatial variability under complex terrain [58]. In contrast, the SOC storage map generated in this study clearly delineated the carbon gradient changes in the oasis–desert ecotones and the distribution of carbon patches in fragmented terrains. This refined spatial information provides a more precise scientific basis for formulating region-specific soil carbon management policies.

4.3. Response Patterns and Driving Mechanisms of SOC to Environmental Factors in Arid Regions

The non-linear response analysis based on PDP in this study revealed the complex water, heat, and carbon interaction mechanisms in arid ecosystems. Results indicate that water availability and its forms are the primary limiting factors driving SOC spatial differentiation. The stepwise increasing trend presented by ET (Figure 8a) profoundly reflects the synergistic symbiotic relationship between vegetation and water in arid regions. In the low ET range, ineffective soil evaporation dominates, and carbon input is extremely low; only when moisture conditions cross a critical threshold sufficient to support effective transpiration by vegetation can photosynthetic products be converted into SOC in large quantities [59].

More critically, the strong negative effect of VPD (Figure 8b) revealed the restrictive regulatory role of atmospheric aridity on the carbon cycle. VPD represents the intensity of atmospheric demand for moisture. When VPD exceeds a specific threshold, plants induce Stomatal Closure to prevent excessive water loss, leading to a precipitous decline in photosynthetic carbon assimilation rates [60]. This physiological stress mechanism driven by atmospheric dryness explains why SOC content remains difficult to accumulate in certain regions where soil moisture is acceptable but the atmosphere is extremely dry.

Secondly, topographic factors indirectly shaped the vertical zonal pattern of SOC through the redistribution of hydrothermal resources. Although the spatial distribution of SOC (Figure 6) showed that high-altitude mountain areas were significantly higher than basins, PDP analysis (Figure 8d) showed that the direct marginal effect of DEM was weak after accounting for climate variables. This indicates that the driving role of elevation on SOC is mainly realized by altering local microclimates, specifically the unique “low temperature-high humidity” environmental advantage of high-altitude mountain areas [61]. On one hand, as elevation increases, the decrease in temperature significantly inhibits soil microbial respiration rates and extracellular enzyme activities, slowing down the mineralization and decomposition of organic matter. On the other hand, the precipitation effect formed by mountainous terrain uplift promotes the increase in vegetation Net Primary Productivity (NPP), increasing exogenous carbon input [62]. This dual positive effect of “increased input–decreased output” makes the Tianshan and Altai Mountains key carbon sink centers in the arid region.

Finally, soil physical properties serve as micro-scale environmental filters, playing a key regulatory role in SOC stability. The linear positive correlation between Clay content and SOC (Figure 8f) confirmed the formation mechanism of Mineral-Associated Organic Carbon (MAOC). Fine-grained clay minerals possess immense specific surface area and charge density, capable of forming stable organo-mineral complexes with organic molecules through adsorption, thereby physically isolating the contact between microorganisms and substrates [63]. In arid desert zones with sparse vegetation and scarce carbon input, this physical protection mechanism appears particularly important; it is the key to maintaining the long-term stability of the Recalcitrant Carbon Pool in the soil.

4.4. Limitations and Future Perspectives

Although this study achieved high-resolution estimation of SOC storage in arid regions based on multi-source remote sensing data and machine learning approaches, several limitations warrant further consideration. First, optical remote sensing primarily captures surface spectral characteristics, and the spectral response of SOC is often influenced by soil moisture, salinity, and surface roughness. Such mixed effects may constrain the model’s ability to detect subtle SOC variations in complex surface environments, such as highly salinized or humid areas. Second, a scale mismatch remains between coarse–resolution meteorological covariates and the 30 m target resolution. Although spatial resampling was applied to harmonize grid geometry, this process does not enhance the intrinsic spatial information content of the original datasets. The resulting spatial smoothing effect may obscure micro-scale hydrothermal gradients and introduce uncertainty in SOC predictions, particularly in topographically fragmented landscapes. In addition, the comparative analysis was conducted within a limited set of machine learning algorithms (RF, GBDT, SVR, and MLP). While the 37 environmental covariates selected cover major physical and biological drivers, the potential role of additional factors, such as specific parent material types or high-resolution soil moisture dynamics, remains to be further explored. More advanced approaches, such as gradient boosting variants (e.g., XGBoost) or deep learning frameworks, were not evaluated in this study and may further improve predictive performance. Furthermore, due to the high sampling costs associated with the vast territory of arid regions, the representativeness of samples in certain marginal habitats (e.g., high-altitude mountains and desert hinterlands) could be improved. The relatively sparse sampling coverage in desert interiors may introduce uncertainty in regional SOC storage estimates, particularly for the “Sparse Carbon Pool” assessment. Crucially, the spatial identification of these high-uncertainty zones (Figure 7b) provides a strategic framework for optimizing future sampling designs. The elevated uncertainty observed in oasis-desert ecotones and topographically fragmented terrains—where environmental gradients shift drastically—suggests that these regions should be prioritized for targeted sampling in subsequent surveys. Implementing an adaptive sampling strategy in these priority areas would allow for a more precise capture of local hydrothermal variations, effectively reducing model residuals and enhancing the overall reliability of the ‘Sparse Carbon Pool’ assessment across arid regions. Finally, this study focused exclusively on the 0–30 cm soil layer as per IPCC guidelines; however, the lack of deeper soil carbon data limits our understanding of full-profile carbon sequestration capacity. Future research could incorporate semi-supervised or transfer learning strategies to reduce dependence on extensive ground sampling. Additionally, given the sensitivity of arid ecosystems to climate variability, long-term dynamic monitoring is needed to better understand the evolution of regional carbon source–sink patterns under global change.

5. Conclusions

By comparing multiple machine learning algorithms under a unified modeling framework, this study found that the RF model exhibited relatively improved predictive performance for SOC mapping in arid Xinjiang. Compared with the other evaluated models, RF achieved higher R² values and lower RMSE, indicating greater predictive stability within the scope of the tested algorithms. The results further suggest that incorporating ET and VPD, which characterize surface hydrothermal exchange processes, together with traditional spectral indices, enhances the explanatory capacity of SOC spatial variability. These variables appear to influence SOC distribution through non-linear threshold responses and were identified as important environmental drivers regulating biological carbon inputs in arid ecosystems. Based on the cross-validated RF model, the estimated total topsoil (0–30 cm) SOC storage in Xinjiang in 2023 was approximately 3.20 Pg C. Spatially, carbon storage exhibited clear altitudinal zonation patterns, with higher values concentrated in the Tianshan and Altai Mountains and lower values distributed across the basin hinterlands. In terms of ecosystem contributions, although desert areas possess relatively low SOC density, their extensive spatial coverage (~60% of the region) resulted in a substantial cumulative storage (44.33%, approximately 1.42 Pg C), highlighting the importance of the so-called “Sparse Carbon Pool.” Grasslands contributed 35.22% (approximately 1.13 Pg C) due to comparatively higher carbon density and broad distribution, representing a major active carbon reservoir in the region. Overall, this study provides a high-resolution SOC mapping framework applicable to arid environments and emphasizes the cumulative contribution of low-density but large-area ecosystems to the regional carbon budget. These findings offer a quantitative basis for improving carbon management strategies in arid regions under climate change.

Author Contributions

Conceptualization, Y.L. and H.W.; methodology, Y.L.; software, Y.L.; validation, M.S., S.W., W.L., P.W., X.W. and J.G.; formal analysis, Y.L. and M.S.; investigation, Y.L., M.S., S.W., W.L., P.W. and X.W.; resources, H.W.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, H.W.; visualization, Y.L.; supervision, H.W.; project administration, H.W.; funding acquisition, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Foundation of Xinjiang Uygur Autonomous Region. It was also financially supported by the National Key Research and Development Program of China (No. 2023YFD1901503-2), the Major Science and Technology Special Projects in the Xianjiang Uygur Autonomous Region, China (No. 2023A02002), and the Xinjiang Talent System Construction—Cotton Industry System.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

The authors would like to thank the associate editor and the anonymous reviewers for their constructive comments that helped to improve the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Poulter, B.; Frank, D.; Ciais, P.; Myneni, R.B.; Andela, N.; Bi, J.; Broquet, G.; Canadell, J.G.; Chevallier, F.; Liu, Y.Y. Contribution of semi-arid ecosystems to interannual variability of the global carbon cycle. Nature 2014, 509, 600–603. [Google Scholar] [CrossRef] [PubMed]
Ahlström, A.; Raupach, M.R.; Schurgers, G.; Smith, B.; Arneth, A.; Jung, M.; Reichstein, M.; Canadell, J.G.; Friedlingstein, P.; Jain, A.K. The dominant role of semi-arid ecosystems in the trend and variability of the land CO₂ sink. Science 2015, 348, 895–899. [Google Scholar] [CrossRef] [PubMed]
Friedlingstein, P.; O’sullivan, M.; Jones, M.W.; Andrew, R.M.; Hauck, J.; Landschützer, P.; Le Quéré, C.; Li, H.; Luijkx, I.T.; Olsen, A. Global carbon budget 2024. Earth Syst. Sci. Data Discuss. 2024, 2024, 1–133. [Google Scholar] [CrossRef]
Piao, S.; Wang, X.; Park, T.; Chen, C.; Lian, X.; He, Y.; Bjerke, J.W.; Chen, A.; Ciais, P.; Tømmervik, H. Characteristics, drivers and feedbacks of global greening. Nat. Rev. Earth Environ. 2020, 1, 14–27. [Google Scholar] [CrossRef]
Li, X.; Ding, J.; Liu, J.; Ge, X.; Zhang, J. Digital mapping of soil organic carbon using sentinel series data: A case study of the Ebinur lake watershed in Xinjiang. Remote Sens. 2021, 13, 769. [Google Scholar] [CrossRef]
Plaza, C.; Zaccone, C.; Sawicka, K.; Méndez, A.M.; Tarquis, A.; Gascó, G.; Heuvelink, G.B.; Schuur, E.A.; Maestre, F.T. Soil resources and element stocks in drylands to face global issues. Sci. Rep. 2018, 8, 13788. [Google Scholar] [CrossRef]
Wadoux, A.M.-C.; Minasny, B.; McBratney, A.B. Machine learning for digital soil mapping: Applications, challenges and suggested solutions. Earth-Sci. Rev. 2020, 210, 103359. [Google Scholar] [CrossRef]
Arrouays, D.; Poggio, L.; Guerrero, O.A.S.; Mulder, V.L. Digital soil mapping and GlobalSoilMap. Main advances and ways forward. Geoderma Reg. 2020, 21, e00265. [Google Scholar] [CrossRef]
Fathololoumi, S.; Vaezi, A.R.; Alavipanah, S.K.; Ghorbani, A.; Saurette, D.; Biswas, A. Improved digital soil mapping with multitemporal remotely sensed satellite data fusion: A case study in Iran. Sci. Total Environ. 2020, 721, 137703. [Google Scholar] [CrossRef]
Lamichhane, S.; Kumar, L.; Wilson, B. Digital soil mapping algorithms and covariates for soil organic carbon mapping and their implications: A review. Geoderma 2019, 352, 395–413. [Google Scholar] [CrossRef]
Green, J.K.; Seneviratne, S.I.; Berg, A.M.; Findell, K.L.; Hagemann, S.; Lawrence, D.M.; Gentine, P. Large influence of soil moisture on long-term terrestrial carbon uptake. Nature 2019, 565, 476–479. [Google Scholar] [CrossRef]
Yuan, W.; Zheng, Y.; Piao, S.; Ciais, P.; Lombardozzi, D.; Wang, Y.; Ryu, Y.; Chen, G.; Dong, W.; Hu, Z. Increased atmospheric vapor pressure deficit reduces global vegetation growth. Sci. Adv. 2019, 5, eaax1396. [Google Scholar] [CrossRef]
Khaledian, Y.; Miller, B.A. Selecting appropriate machine learning methods for digital soil mapping. Appl. Math. Model. 2020, 81, 401–418. [Google Scholar] [CrossRef]
Zeng, P.; Song, X.; Yang, H.; Wei, N.; Du, L. Digital soil mapping of soil organic matter with deep learning algorithms. ISPRS Int. J. Geo-Inf. 2022, 11, 299. [Google Scholar] [CrossRef]
Padarian, J.; Minasny, B.; McBratney, A.B. Using deep learning for digital soil mapping. Soil 2019, 5, 79–89. [Google Scholar] [CrossRef]
Hong, D.; Gao, L.; Yokoya, N.; Yao, J.; Chanussot, J.; Du, Q.; Zhang, B. More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4340–4354. [Google Scholar]
Canero, F.M.; Rodriguez-Galiano, V.; Aragones, D. Machine Learning and Feature Selection for soil spectroscopy. An evaluation of Random Forest wrappers to predict soil organic matter, clay, and carbonates. Heliyon 2024, 10, e30228. [Google Scholar] [CrossRef]
Keskin, H.; Grunwald, S.; Harris, W.G. Digital mapping of soil carbon fractions with machine learning. Geoderma 2019, 339, 40–58. [Google Scholar] [CrossRef]
van der Westhuizen, S.; Heuvelink, G.B.; Gardner-Lubbe, S.; Clarke, C.E. Biplots for understanding machine learning predictions in digital soil mapping. Ecol. Inform. 2024, 84, 102892. [Google Scholar]
Padarian, J.; McBratney, A.B.; Minasny, B. Game theory interpretation of digital soil mapping convolutional neural networks. Soil Discuss. 2020, 2020, 389–397. [Google Scholar]
Gouda, M.; Abu-hashim, M.; Nassrallah, A.; Khalil, M.N.; Hendawy, E.; Benhasher, F.F.; Shokr, M.S.; Elshewy, M.A.; Mohamed, E.s. Integration of remote sensing and artificial neural networks for prediction of soil organic carbon in arid zones. Front. Environ. Sci. 2024, 12, 1448601. [Google Scholar] [CrossRef]
Wang, S.; Shi, M.; Fan, Y.; Jiang, P.; Chen, S.; Li, Y.; Huang, L.; Zhao, J. Assessing the impacts of climate changes and human activities on cotton distribution in Xinjiang. Front. Sustain. Food Syst. 2025, 9, 1534544. [Google Scholar] [CrossRef]
Zhang, C.; Chen, X.; Shao, H.; Chen, S.; Liu, T.; Chen, C.; Ding, Q.; Du, H. Evaluation and intercomparison of high-resolution satellite precipitation estimates—GPM, TRMM, and CMORPH in the Tianshan Mountain Area. Remote Sens. 2018, 10, 1543. [Google Scholar]
Du, H.; Li, M.; Xu, Y.; Zhou, C. An ensemble learning approach for land use/land cover classification of arid regions for climate simulation: A case study of Xinjiang, northwest China. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2413–2426. [Google Scholar] [CrossRef]
Guo, J.; Fan, Y.; Li, Y.; Bi, Y.; Wang, S.; Hu, Y.; Zhang, L.; Song, W. Topography dominates the spatial and temporal variability of soil bulk density in typical arid zones. Sustainability 2024, 16, 9670. [Google Scholar] [CrossRef]
Wang, X.; Wang, J.; Zhang, J. Comparisons of three methods for organic and inorganic carbon in calcareous soils of northwestern China. PLoS ONE 2012, 7, e44334. [Google Scholar] [CrossRef]
Hengl, T. Finding the right pixel size. Comput. Geosci. 2006, 32, 1283–1298. [Google Scholar] [CrossRef]
Tucker, C.J. Red and photographic infrared linear combinations for monitoring vegetation. Remote Sens. Environ. 1979, 8, 127–150. [Google Scholar]
Camps-Valls, G.; Campos-Taberner, M.; Moreno-Martínez, Á.; Walther, S.; Duveiller, G.; Cescatti, A.; Mahecha, M.D.; Muñoz-Marí, J.; García-Haro, F.J.; Guanter, L. A unified vegetation index for quantifying the terrestrial biosphere. Sci. Adv. 2021, 7, eabc7447. [Google Scholar] [CrossRef]
Huete, A.; Didan, K.; Miura, T.; Rodriguez, E.P.; Gao, X.; Ferreira, L.G. Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sens. Environ. 2002, 83, 195–213. [Google Scholar]
Wu, W.; Al-Shafie, W.M.; Mhaimeed, A.S.; Ziadat, F.; Nangia, V.; Payne, W.B. Soil salinity mapping by multiscale remote sensing in Mesopotamia, Iraq. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 4442–4452. [Google Scholar] [CrossRef]
Fern, R.R.; Foxley, E.A.; Bruno, A.; Morrison, M.L. Suitability of NDVI and OSAVI as estimators of green biomass and coverage in a semi-arid rangeland. Ecol. Indic. 2018, 94, 16–21. [Google Scholar] [CrossRef]
Li, S.; Li, X.; Ge, X. Prediction and mapping of soil organic carbon in the Bosten Lake oasis based on Sentinel-2 data and environmental variables. Int. Soil Water Conserv. Res. 2025, 13, 436–446. [Google Scholar] [CrossRef]
Nicolas, H.; Walter, C. Detecting salinity hazards within a semiarid context by means of combining soil and remote-sensing data. Geoderma 2006, 134, 217–230. [Google Scholar] [CrossRef]
Xu, H.; Sun, H.; Xu, Z.; Wang, Y.; Zhang, T.; Wu, D.; Gao, J. kNDMI: A kernel normalized difference moisture index for remote sensing of soil and vegetation moisture. Remote Sens. Environ. 2025, 319, 114621. [Google Scholar] [CrossRef]
Sajjad, H.; Kumar, P.; Srivastava, P.K.; Pathak, S.O.; Ahmed, M.; Kumar, V.; Dobriyal, M.; Kumari, P.; Pandey, P.C. Assessing soil organic carbon and its relation with biophysical and ecological parameters in tropical forest ecosystem India. Geocarto Int. 2025, 40, 2441388. [Google Scholar] [CrossRef]
Li, J.; Zhang, T.; Shao, Y.; Ju, Z. Comparing machine learning algorithms for soil salinity mapping using topographic factors and sentinel-1/2 data: A case study in the yellow river delta of China. Remote Sens. 2023, 15, 2332. [Google Scholar] [CrossRef]
Rakhymberdina, M.; Daumova, G.; Apshikur, B.; Shults, R.; Toguzova, M.; Assylkhanova, Z.; Kolpakova, V.; Kapasov, A. Integrated Chemical-Geoecological Monitoring and Engineering Approaches for Pollution Reduction in the Yertis River. Eng. Sci. 2024, 32, 1328. [Google Scholar] [CrossRef]
Belenok, V.; Hebryn-Baidy, L.; Bielousova, N.; Zavarika, H.; Kryachok, S.; Liashenko, D.; Malik, T. Application of remote sensing methods for statistical estimation of organic matter in soils. Earth Sci. Res. J. 2023, 27, 299–313. [Google Scholar] [CrossRef]
Kursa, M.B.; Rudnicki, W.R. Feature selection with the Boruta package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
Gomes, L.C.; Faria, R.M.; de Souza, E.; Veloso, G.V.; Schaefer, C.E.G.; Fernandes Filho, E.I. Modelling and mapping soil organic carbon stocks in Brazil. Geoderma 2019, 340, 337–350. [Google Scholar] [CrossRef]
Wang, Q.; Bian, J.; Ma, E.; Zhang, J. Predicting sorption of organic pollutants on soils with interpretable machine learning. Environ. Pollut. 2025, 382, 126665. [Google Scholar] [CrossRef] [PubMed]
Song, J.; Gao, J.; Zhang, Y.; Li, F.; Man, W.; Liu, M.; Wang, J.; Li, M.; Zheng, H.; Yang, X. Estimation of soil organic carbon content in coastal wetlands with measured VIS-NIR spectroscopy using optimized support vector machines and random forests. Remote Sens. 2022, 14, 4372. [Google Scholar]
Emadi, M.; Taghizadeh-Mehrjardi, R.; Cherati, A.; Danesh, M.; Mosavi, A.; Scholten, T. Predicting and mapping of soil organic carbon using machine learning algorithms in Northern Iran. Remote Sens. 2020, 12, 2234. [Google Scholar]
Viscarra Rossel, R.A.; Webster, R.; Bui, E.N.; Baldock, J.A. Baseline map of organic carbon in Australian soil to support national carbon accounting and monitoring under climate change. Glob. Change Biol. 2014, 20, 2953–2970. [Google Scholar] [CrossRef]
Li, Y.; Wang, X.; Niu, Y.; Lian, J.; Luo, Y.; Chen, Y.; Gong, X.; Yang, H.; Yu, P. Spatial distribution of soil organic carbon in the ecologically fragile Horqin Grassland of northeastern China. Geoderma 2018, 325, 102–109. [Google Scholar] [CrossRef]
Kumar, A.; Moharana, P.C.; Jena, R.K.; Malyan, S.K.; Sharma, G.K.; Fagodiya, R.K.; Shabnam, A.A.; Jigyasu, D.K.; Kumari, K.M.V.; Doss, S.G. Digital mapping of soil organic carbon using machine learning algorithms in the upper Brahmaputra valley of northeastern India. Land 2023, 12, 1841. [Google Scholar] [CrossRef]
Odebiri, O.; Mutanga, O.; Odindi, J.; Slotow, R.; Mafongoya, P.; Lottering, R.; Naicker, R.; Matongera, T.N.; Mngadi, M. Remote sensing of depth-induced variations in soil organic carbon stocks distribution within different vegetated landscapes. Catena 2024, 243, 108216. [Google Scholar] [CrossRef]
Shafizadeh-Moghadam, H.; Minaei, F.; Talebi-khiavi, H.; Xu, T.; Homaee, M. Synergetic use of multi-temporal Sentinel-1, Sentinel-2, NDVI, and topographic factors for estimating soil organic carbon. Catena 2022, 212, 106077. [Google Scholar] [CrossRef]
Chen, C.; Yuan, X.; Gan, S.; Kang, X.; Luo, W.; Li, R.; Bi, R.; Gao, S. A new strategy based on multi-source remote sensing data for improving the accuracy of land use/cover change classification. Sci. Rep. 2024, 14, 26855. [Google Scholar]
Angelopoulou, T.; Tziolas, N.; Balafoutis, A.; Zalidis, G.; Bochtis, D. Remote sensing techniques for soil organic carbon estimation: A review. Remote Sens. 2019, 11, 676. [Google Scholar] [CrossRef]
He, M.; Tang, L.; Li, C.; Ren, J.; Zhang, L.; Li, X. Dynamics of soil organic carbon and nitrogen and their relations to hydrothermal variability in dryland. J. Environ. Manag. 2022, 319, 115751. [Google Scholar] [CrossRef] [PubMed]
Tang, X.; Zhao, X.; Bai, Y.; Tang, Z.; Wang, W.; Zhao, Y.; Wan, H.; Xie, Z.; Shi, X.; Wu, B. Carbon pools in China’s terrestrial ecosystems: New estimates based on an intensive field survey. Proc. Natl. Acad. Sci. USA 2018, 115, 4021–4026. [Google Scholar] [CrossRef] [PubMed]
Wang, G.; Mao, J.; Fan, L.; Ma, X.; Li, Y. Effects of climate and grazing on the soil organic carbon dynamics of the grasslands in Northern Xinjiang during the past twenty years. Glob. Ecol. Conserv. 2022, 34, e02039. [Google Scholar] [CrossRef]
Zhou, G.; Zhou, X.; He, Y.; Shao, J.; Hu, Z.; Liu, R.; Zhou, H.; Hosseinibai, S. Grazing intensity significantly affects belowground carbon and nitrogen cycling in grassland ecosystems: A meta-analysis. Glob. Change Biol. 2017, 23, 1167–1179. [Google Scholar] [CrossRef]
Xie, B.; Ding, J.; Ge, X.; Li, X.; Han, L.; Wang, Z. Estimation of soil organic carbon content in the Ebinur Lake wetland, Xinjiang, China, based on multisource remote sensing data and ensemble learning algorithms. Sensors 2022, 22, 2685. [Google Scholar] [CrossRef]
Yan, X.; Zhang, Q.; Ren, X.; Wang, X.; Yan, X.; Li, X.; Wang, L.; Bao, L. Climatic change characteristics towards the “Warming–Wetting” trend in the Pan-Central-Asia arid region. Atmosphere 2022, 13, 467. [Google Scholar] [CrossRef]
Hengl, T.; Mendes de Jesus, J.; Heuvelink, G.B.; Ruiperez Gonzalez, M.; Kilibarda, M.; Blagotić, A.; Shangguan, W.; Wright, M.N.; Geng, X.; Bauer-Marschallinger, B. SoilGrids250m: Global gridded soil information based on machine learning. PLoS ONE 2017, 12, e0169748. [Google Scholar] [CrossRef]
Yi, S.; Wang, H.; Wang, C.; Huang, X. Threshold effects and synergistic trade-offs in ecosystem services: A spatio-temporal study of Kashgar’s arid region. Agriculture 2025, 15, 1742. [Google Scholar] [CrossRef]
Grossiord, C.; Buckley, T.N.; Cernusak, L.A.; Novick, K.A.; Poulter, B.; Siegwolf, R.T.; Sperry, J.S.; McDowell, N.G. Plant responses to rising vapor pressure deficit. New Phytol. 2020, 226, 1550–1566. [Google Scholar] [CrossRef]
Sun, W.; Zhu, H.; Guo, S. Soil organic carbon as a function of land use and topography on the Loess Plateau of China. Ecol. Eng. 2015, 83, 249–257. [Google Scholar] [CrossRef]
De la Cruz-Amo, L.; Bañares-de-Dios, G.; Cala, V.; Granzow-de la Cerda, Í.; Espinosa, C.I.; Ledo, A.; Salinas, N.; Macía, M.J.; Cayuela, L. Trade-offs among aboveground, belowground, and soil organic carbon stocks along altitudinal gradients in Andean tropical montane forests. Front. Plant Sci. 2020, 11, 106. [Google Scholar] [CrossRef]
Georgiou, K.; Jackson, R.B.; Vindušková, O.; Abramoff, R.Z.; Ahlström, A.; Feng, W.; Harden, J.W.; Pellegrini, A.F.; Polley, H.W.; Soong, J.L. Global stocks and capacity of mineral-associated soil organic carbon. Nat. Commun. 2022, 13, 3797. [Google Scholar] [CrossRef]

Figure 1. Overview of the study area. (a) Arid climate zoning map of China; (b) Spatial distribution of field-measured sampling points; (c) Area proportion of land use types in the study area; (d) Scatter bubble plot of soil organic carbon (SOC) content varying with elevation (X-axis) and latitude (Y-axis).

Figure 2. Statistical distribution characteristics of soil organic carbon (SOC) content under different land use types. (a) Frequency histograms and kernel density curves of four land use types; (b) Statistical distribution of SOC content under different land use types.

Figure 3. Importance ranking of environmental covariates and feature selection results based on the Boruta algorithm.

Figure 4. Parameter optimization and feature dimensionality determination of the RF model. (a) Learning Curve; (b) Feature Inclusion Curve.

Figure 5. Comparison of prediction accuracy among four machine learning models. (a) Random Forest (RF); (b) Gradient Boosting Decision Tree (GBDT); (c) Support Vector Regression (SVR); (d) Multi-layer Perceptron (MLP).

Figure 6. Spatial distribution pattern of topsoil (0–30 cm) SOC in Xinjiang in 2023 based on the optimized RF model.

Figure 7. Spatial distribution of SOC prediction uncertainty. (a) Standard deviation of prediction results; (b) Coefficient of variation in prediction results.

Figure 8. Partial Dependence Plots (PDP) of the top six core environmental covariates for SOC based on the RF model. (a) Evapotranspiration (ET); (b) Vapor Pressure Deficit (VPD); (c) Standardized Precipitation Index (SPI); (d) Elevation (DEM); (e) Precipitation (PRE); (f) Clay content (Clay). The X-axis represents the standardized variable values, and the Y-axis represents the marginal effect of the variable on SOC prediction results.

Table 1. Sources and resolutions of environmental covariates.

Factor Types	Variable Name	Data Sources	Spatial Resolution
Vegetation Indices	NDVI, kNDVI, EVI, GDVI, OSAVI, SAVI	Landsat8	30 m
Topographic Factors	DEM, Slope, Aspect, TR	SRTM	30 m
Soil Properties	TN, TP, TK, BD, pH, CEC, Clay, Silt, Sand, Por	TPDC	1000 m
Soil Properties	ST, SM	NASA GLDAS	0.25°
Meteorological Factors	TEM	ECMWF ERA5–Land	0.1°
	PRE	UCSB CHIRPS	0.05°
	ET	MOD16A2GF	500 m
Drought Indices	EDDI, SC_PDSI, PDSI, SPEI, SPI, VPD	CHM_Drought	0.1°
Environmental Indices	SI, kNDMI, BSI, CRSI, NDWI, NDSI	Landsat8	30 m

Table 3. Descriptive statistical characteristics of soil organic carbon content in the surface layer (0–30 cm) under different land uses.

Land Use Type	N	Mean ± SD (g kg⁻¹)	An Typege (Min–Max)	CV (%)	Skewness	Kurtosis
Cropland	1764	14.96 ± 7.94	0.65–94.50	53.11	2.23	12.80
Forestland	39	12.20 ± 5.40	4.21–30.90	44.23	1.22	2.52
Grassland	347	8.91 ± 6.70	1.19–42.30	75.18	1.86	5.10
Bare land	222	5.07 ± 3.09	1.34–16.50	60.86	1.14	1.18
Overall	2372	13.10 ± 8.13	0.65–94.50	62.01	1.90	9.93

Table 4. Topsoil (0–30 cm) SOC storage and its proportion under different land use types in the study area.

LULC Type	SOC Storage (Pg)	SOC Storage (Tg)	Percentage (%)
Cropland	0.45	454.33	14.19%
Forestland	0.20	200.57	6.26%
Grassland	1.13	1127.64	35.22%
Bare land	1.42	1419.36	44.33%
Overall	3.20	3201.90	100.00%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Shi, M.; Wang, S.; Liu, W.; Wang, P.; Wang, X.; Guo, J.; Wu, H. Unveiling the “Sparse Carbon Pool”: High-Resolution Mapping and Storage Estimation of Topsoil Organic Carbon in Arid Xinjiang, China. Remote Sens. 2026, 18, 728. https://doi.org/10.3390/rs18050728

AMA Style

Li Y, Shi M, Wang S, Liu W, Wang P, Wang X, Guo J, Wu H. Unveiling the “Sparse Carbon Pool”: High-Resolution Mapping and Storage Estimation of Topsoil Organic Carbon in Arid Xinjiang, China. Remote Sensing. 2026; 18(5):728. https://doi.org/10.3390/rs18050728

Chicago/Turabian Style

Li, Yunhao, Mingjie Shi, Shanshan Wang, Wenhui Liu, Pengfei Wang, Xiangge Wang, Jia Guo, and Hongqi Wu. 2026. "Unveiling the “Sparse Carbon Pool”: High-Resolution Mapping and Storage Estimation of Topsoil Organic Carbon in Arid Xinjiang, China" Remote Sensing 18, no. 5: 728. https://doi.org/10.3390/rs18050728

APA Style

Li, Y., Shi, M., Wang, S., Liu, W., Wang, P., Wang, X., Guo, J., & Wu, H. (2026). Unveiling the “Sparse Carbon Pool”: High-Resolution Mapping and Storage Estimation of Topsoil Organic Carbon in Arid Xinjiang, China. Remote Sensing, 18(5), 728. https://doi.org/10.3390/rs18050728

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unveiling the “Sparse Carbon Pool”: High-Resolution Mapping and Storage Estimation of Topsoil Organic Carbon in Arid Xinjiang, China

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview of the Study Area

2.2. SOC Sampling and Laboratory Analysis

2.3. Acquisition and Preprocessing of Environmental Variables

2.3.1. Remote Sensing Spectral Indices

2.3.2. Climate and Aridity Characteristics

2.3.3. Topography and Soil Properties

2.4. Feature Selection Strategy

2.5. Machine Learning Modeling

2.6. Model Evaluation and Trend Analysis

2.7. Carbon Storage Estimation

3. Results

3.1. Descriptive Statistics of SOC

3.2. SOC Spatial Pattern Simulation and Storage Assessment

3.2.1. Variable Importance Selection

3.2.2. Model Parameter Optimization and Feature Dimensionality Determination

3.2.3. Model Simulation and Accuracy Validation Based on Selected Features

3.2.4. SOC Spatial Distribution Patterns and Uncertainty

3.2.5. Estimation of SOC Storage

3.3. Non-Linear Response of SOC to Environmental Covariates

4. Discussion

4.1. Factors Influencing Model Prediction Accuracy

4.2. Spatial Differentiation of SOC Storage and Carbon Pool Composition

4.3. Response Patterns and Driving Mechanisms of SOC to Environmental Factors in Arid Regions

4.4. Limitations and Future Perspectives

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI