1. Introduction
Global warming has become a major challenge threatening human survival and sustainable development. Increasing carbon sinks in terrestrial ecosystems are crucial measures to mitigate climate warming. In response to the challenges of global climate change, the Chinese government announced at the 75th United Nations General Assembly in September 2020 its goal to peak carbon emissions by 2030 and achieve carbon neutrality by 2060 [
1]. This means that over the next few decades, China will significantly strengthen its efforts in monitoring and managing carbon emissions [
2]. Soil, as the main component of terrestrial carbon reservoirs, surpasses the combined amounts found in vegetation and the atmosphere [
3]. Variations in soil carbon can lead to significant shifts in CO
2 levels, thereby affecting the global climate [
4]. Forests, as a major part of the terrestrial biosphere, store approximately 73% of the world’s soil carbon, playing a critical role in maintaining the global terrestrial ecosystem carbon balance [
5,
6]. Thus, precisely quantifying the carbon storage of forest soils and providing high-quality spatial distribution data for the SOC content is crucial for improving forest management and promoting ecological restoration.
The traditional SOC mapping process, which includes data collection, field surveys, indoor analysis, field verification, and boundary delineation, is relatively inefficient [
7] Digital soil mapping (DSM), on the other hand, primarily relies on soil–landscape models as its theoretical foundation, employing Geographic Information System (GIS) and mathematical models to infer the spatial distribution of soil properties. In contrast to traditional soil mapping techniques, DSM offers significant advantages [
8,
9,
10], including (1) modeling continuous soil variability through fuzzy classification and geostatistical methods; (2) integrating multi-source data (e.g., remote sensing and topographic variables); (3) hybrid spatial prediction techniques (e.g., regression kriging) to address deterministic and stochastic components of soil variation; (4) adaptability to various spatial resolutions; (5) handling nonlinear relationships via machine learning algorithms; (6) reducing data acquisition costs through ancillary data utilization; (7) quantifying prediction uncertainty via geostatistical models; and (8) efficient GIS-based data management. The commonly employed methods in DSM include linear regression [
11], geostatistics [
12] and Geographically Weighted Regression (GWR) [
13]. However, in natural settings, the relationship between soil properties and environmental variables is nearly always nonlinear [
9]. Traditional statistical approaches for DSM struggle to adequately address these nonlinear relationships. Recently, with the rise of artificial intelligence, the application of machine learning algorithms in DSM has emerged as an effective solution to this challenge. Machine learning techniques do not rely on specific assumptions about the data distribution, allowing them to effectively capture the correlations between soil properties and large-scale environmental covariates, and predict each pixel at a broad scale [
14,
15]. Several popular algorithms used for this purpose include RF, ERT and Category Boosting (CatBoost) [
16,
17]. For instance, Dharumarajan et al. utilized an RF algorithm to predict the spatial distribution of SOC by integrating environmental variables and soil type data, elucidating the primary environmental elements influencing soil property variations in the semi-arid tropical regions of South India [
18]. Mahmoudzadeh et al. employed 865 soil samples, 101 auxiliary variables, and five machine learning algorithms to generate the spatial distribution of surface SOC in Kurdistan Province at a 90 m resolution [
19]. Zhou and Li systematically evaluated the potential of the RF and CatBoost models for soil salinity mapping in the Yellow River Delta [
10]. However, despite significant advancements in DSM through machine learning, critical research gaps remain. First, the interpretability of machine learning models in SOC mapping remains underexplored. While algorithms like CatBoost demonstrate high predictive performance, quantifying the contributions of individual environmental variables and their interactions through explainable AI methods (e.g., SHAP) is rarely integrated into DSM workflows [
20]. Second, the application of high-resolution SOC mapping in ecologically fragile mining areas—where soil recovery dynamics and anthropogenic disturbances create unique spatial heterogeneity—has not been sufficiently addressed. These gaps hinder the development of tailored strategies for carbon sink management in such regions.
Despite the numerous environmental factors influencing soil properties, which bring considerable convenience to DSM, the more feature variables available for modeling, the greater the computational load on predictive models becomes. Moreover, redundant variables can interfere with the model’s proper learning process, thereby reducing the prediction performance, a phenomenon commonly referred to as the “dimensional disaster” [
21]. Therefore, conducting feature selection to optimize the feature variables is crucial for shortening model runtime and enhancing model accuracy. Depending on the timing of feature selection, it can be categorized into two types. One is the traditional pre-model feature selection method, which screens feature variables before the model completes its actual task, evaluating the effectiveness of the selected features through the final outcomes of the model. Currently, the Boruta algorithm is widely favored for this approach [
20]. The other type is post-model feature selection based on SHAP, an interpretability method that operates by back-calculating the contribution of each distinct feature variable to the model output from the results generated by the model [
22,
23]. At present, there is limited research comparing the differences between post-model feature selection based on SHAP and traditional pre-model feature selection methods, leaving the comparative performance of these two feature selection methods in DSM unclear. Therefore, it is essential to explore the relative strengths and applicable scenarios of both types of feature selection methods to improve the performance of SOC mapping in the aspect of feature selection.
Mining areas, as regions where resource and environmental conflicts are concentrated, have been a focal point for scholars concerning their sustainable development. According to the National Mineral Resources Plan (2016–2020) [
24], Shanxi Province hosts 18 state-planned mining areas, among which the Huodong mining district has the most extensive forest land distribution and is home to the largest state-owned forest farm in Shanxi Province (Taiyue Mountain National Forest Farm). This endows the Huodong mining district with significant ecological protection responsibilities alongside resource exploitation. In recent years, coal mining activities in the region have led to severe vegetation degradation, a reduced soil nutrient supply, and delayed natural soil recovery processes, posing long-term threats to the regional ecosystem and forest soil carbon reservoirs. Consequently, there is an immediate need for high-precision data on the SOC distribution in the area to improve forest management and inform soil carbon sequestration policies. However, despite advancements in SOC mapping, critical gaps remain in high-resolution assessments of forested mining areas. Existing studies predominantly focus on agricultural, grassland, and wetland ecosystems, often relying on outdated sampling data from the 1980s to 2000s. Furthermore, the comparative efficacy of feature selection methods such as Boruta (pre-model) and SHAP (post-model) in optimizing the SOC prediction accuracy remains underexplored. This study addresses these gaps by (1) developing a fine-resolution SOC mapping framework tailored to the Huodong mining district, a region with significant ecological and resource conflicts; (2) evaluating the performance of six machine learning algorithms combined with Boruta and SHAP for feature selection; and (3) determining the spatial distribution of SOC and assessing the contributions and impacts of its primary controlling factors.
2. Materials and Methods
To visually illustrate the processes of data preparation, model execution, and spatial prediction, we have constructed a workflow (
Figure 1).
2.1. Study Area
The Huodong mining district (35°58′~36°54′N, 111°47′~112°24′E) is located in the eastern part of the Jinzhong Coal Base in Shanxi Province, China, and is one of the 18 state-planned mining areas in Shanxi Province. It is situated in the southwestern part of the Qishui Coalfield, with the widest east–west span of approximately 51 km and the longest north–south extent of about 102 km, covering a total area of approximately 2802.81 km
2 (
Figure 2). The climate of the Huodong mining district is characterized by a temperate continental monsoon climate, with distinct seasons and significant diurnal temperature differences. Located on the eastern slope of Mount Huo, the predominant terrain of the study area consists of mountains and hills, with a general trend of higher elevations in the north and lower in the south, and a typical relative elevation difference ranging from 200 to 500 m. The Huodong mining district is rich in coal resources, with reserves amounting to 30 billion tons, characterized by high-quality coal and a long history of coal mining. Additionally, it boasts the most extensive forest coverage among the 18 state-planned mining areas in Shanxi Province and supports rich biodiversity. Based on relevant data and field survey results, Chinese pine (Pinus tabulaeformis) is the dominant tree species within the mining area, followed by other prominent species such as Mongolian oak, locust, mountain poplar, and northeast China larch.
The soil types in the Huodong mining area are primarily divided into two categories: native soils and mining-disturbed soils. The native soils are predominantly cinnamon soils, featuring a moderate surface organic matter content (1–2%), neutral to slightly alkaline pH (7.0–8.5), and textures mainly composed of silt loam or clay loam, with the parent material derived from loess deposits. Surface soils in the mining area have been severely degraded due to open-pit mining and coal gangue accumulation, where mining-disturbed soils dominated by anthropogenically disturbed soils and coal gangue mixtures prevail. These disturbed soils exhibit a loose structure, organic matter deficiency (<1%), poor water retention capacity, and potential contamination from sulfides and heavy metals released by coal gangue.
2.2. Soil Sampling and Laboratory Analysis
Sampling points were selected using a stratified random sampling approach to ensure coverage across key environmental gradients. The study area was stratified based on elevation segments. Within each stratum, sampling locations were randomly allocated to capture spatial variability. A total of 142 samples were collected, with the sample size determined by balancing spatial representativeness and logistical constraints. The formula for estimating minimum sample size in spatial studies was referenced as follows:
where
= 1.96 (95% confidence level),
= 19.96 g/kg (standard deviation from preliminary data), and
E = 3.3 g/kg (acceptable margin of error). This yielded n ≈ 136, aligning with the 142 samples collected. Field accessibility and budget limitations were also considered.
Each sampling point was centered on a Global Positioning System (GPS) location, and a sample plot with a radius of 5 m was selected for sampling. Five to six surface soil samples were randomly collected at a depth of 0–20 cm and mixed, resulting in a total of 142 soil samples. The soil samples were then placed in a well-ventilated area of the laboratory to air-dry naturally, and impurities such as stones and plant roots were cleared. The dried soil samples were sieved through a 2 mm mesh, and the SOC content was determined using the potassium dichromate heating method [
12]. The coordinate projection of the sampling points was converted using ArcGIS 10.7 to unify the coordinate systems of the study area and environmental variables. The distribution of the sampling points is shown in
Figure 2.
2.3. Environmental Covariates
The SCORPAN model integrates seven key factors influencing soil formation: Soil, Climate, Organisms, Relief (topography), Parent material, Age, and space (N). This framework systematically identifies environmental covariates that drive soil property variations, including SOC. In this study, we focused on factors most relevant to SOC dynamics in the Huodong mining district. For instance, soil properties such as pH, the nitrogen content, and texture (sand, silt, and clay) directly affect SOC stabilization. Topographic variables like DEM, slope, and aspect influence microclimate and erosion processes, which in turn impact the SOC distribution. Vegetation indices such as NDVI and EVI reflect biomass and litter inputs, both of which are critical for SOC accumulation. Although climate was not explicitly included due to limited local resolution, proxies like net primary productivity (NPP) and vegetation indices indirectly captured climatic effects. Due to limitations in data availability, parent material, age, and spatial variables were not directly included, but soil texture (clay/sand particles) partially reflected parent material characteristics. Our final selection of 32 covariates was also guided by data availability and regional characteristics, ensuring alignment with the SCORPAN framework while addressing practical constraints.
2.3.1. Multispectral Remote Sensing Variables
The multispectral remote sensing variables consisted of net primary production (NPP) products, original spectral information, and vegetation indices. NPP is a significant factor for assessing plant productivity in ecosystems. The NPP data used in this study were obtained from the MOD17A3HGF dataset provided by NASA (
https://lpdaac.usgs.gov). Original spectral information refers to the variables directly obtained through a remote sensing image analysis, with these bands designed for different functions during the satellite design phase. For this study, we comprehensively considered the number of bands and the sensitivity of each band to monitoring vegetation growth conditions from various satellite sensors, selecting six bands from Landsat 8 imagery, as shown in
Table 1. Vegetation indices are derived from combinations of image bands and are primarily used to describe the characteristics of surface vegetation distribution, vegetation change processes, and quantitative estimation of vegetation parameters. Through the calculation of these indices, a preliminary characterization of the vegetation growth status in the study area can be achieved. Referring to the vegetation indices used in similar studies and considering the specifics of our study area, ten vegetation indices were selected for initial model inputs (
Table 2).
2.3.2. Topographic Variables
In this study, a total of eight topographic variables were adopted. DEM was downloaded from the Computer Geographic Spatial Data Cloud site of the Chinese Academy of Sciences (
https://www.gscloud.cn/). Other topographic attributes in
Table 3 were calculated using SAGAGIS software (version 8.4.0) [
25].
2.3.3. Soil Variables
The soil-related data were downloaded from Soilgrids250m2.0 (
https://soilgrids.org/), including sand, clay, silt, coarse fraction volume occupancy (CFVO), cation exchange capacity (CEC), pH, and nitrogen content (
Table 4).
2.3.4. Covariate Harmonization
The environmental covariates in this study were obtained from various data sources. To ensure consistency in spatial resolution, all environmental covariates were processed to a uniform resolution of 30 m using the bilinear interpolation method. Furthermore, all environmental covariates were standardized to the GCS_WGS_1984 geographic coordinate system and projected into the WGS_1984_UTM_Zone_49N projection coordinate system.
2.4. Machine Learning Algorithms
The selection of machine learning algorithms in this study was based on their proven effectiveness in handling nonlinear relationships, robustness to high-dimensional data, and successful applications in digital soil mapping (DSM) and environmental variable prediction. Six algorithms—Random Forest (RF), Extremely Randomized Trees (ERT), Gradient Boosting Decision Tree (GBDT), XGBoost, LightGBM, and CatBoost—were chosen for their complementary strengths. RF and ERT, as ensemble methods, reduce overfitting by aggregating multiple decision trees, making them suitable for capturing complex interactions between environmental covariates and SOC [
26,
27]. GBDT and its derivatives (XGBoost, LightGBM, and CatBoost) employ gradient boosting to sequentially optimize residuals, excelling in scenarios with noisy data and heterogeneous feature spaces [
28,
29,
30]. Specifically, LightGBM and CatBoost enhance computational efficiency through gradient-based sampling and automated handling of categorical features, respectively, which are critical for large-scale spatial datasets [
31]. These algorithms have been widely validated in SOC mapping studies, demonstrating superior performance in accuracy and interpretability compared to traditional statistical models [
10,
18,
32].
2.4.1. RF
RF is a classification and regression model that comprises multiple decision trees, where the output class is determined by the mode of the individual tree outputs [
26]. It includes a large number of decision trees with high predictive accuracy that are weakly correlated or even uncorrelated, forming an ensemble prediction model [
33]. The combined ensemble of prediction models collaboratively participates in predicting the values of output variables for new observations, thereby achieving higher accuracy [
34].
2.4.2. ERT
ERT was proposed in 2006 by Pierre Geurts and colleagues after extensive experimental research [
27]. Its principle is largely similar to that of RF, involving the aggregation of multiple decision trees to make predictions based on the average of the individual tree predictions. For ERT, every decision tree is instructed using the complete dataset, guaranteeing the exploitation of training examples and partially aiding in diminishing the ultimate prediction bias [
35].
2.4.3. GBDT
GBDT, proposed by Friedman in 2001, is a Boosting algorithm in the realm of ensemble learning [
28]. Its training process is sequential, meaning that weak learners are trained in a specific order, with each subsequent weak learner building upon the previous one. GBDT commonly employs decision trees as its fundamental weak learners, where the key concept is that each tree is built to minimize the residuals in the direction of the gradient of the loss function.
2.4.4. XGBoost
XGBoost, based on gradient-boosted decision trees, is capable of handling classification and regression tasks in machine learning with weak supervision through its training process. It was introduced by Chen and Guestrin in 2016 [
36]. The XGBoost model is crafted to avoid overfitting and maintain optimal computational efficiency by employing simplification and regularization techniques, thereby also lowering computational costs. By doing so, the model can predict sample scores based on the features of the samples [
37].
2.4.5. LightGBM
To address the issue of time-consuming training with large datasets, LightGBM employs two key methods: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) [
29]. GOSS retains samples with larger gradients while discarding those with smaller gradients based on the information gain of the samples, yielding more accurate results compared to random sampling. The EFB method achieves dimensionality reduction by bundling features. Additionally, LightGBM optimizes the histogram difference for the histogram method, which reduces memory consumption and enhances computational efficiency [
31].
2.4.6. CatBoost
CatBoost is a novel GBDT algorithm that utilizes symmetric decision trees as base learners. It improves upon the gradient estimation in traditional GBDT algorithms by adopting a permutation-based boosting approach, which reduces the impact of gradient estimation bias and enhances the model’s generalization capability. Additionally, CatBoost can automatically handle discrete feature data. Compared to other Boosting algorithms, it exhibits superior robustness and scalability when addressing regression problems that involve multiple input features and noisy sample data [
30].
2.5. Parameter Setting
The implementation of six nonlinear algorithms, including RF, ERT, and GBDT, was completed on the Python 3.9.12 platform. To test the performance of different machine learning models, the train_test_split function from the Sklearn module in the Python 3.9.12 programming language was used to randomly split the data into 80% for the training set and 20% for the validation set, with the selected datasets fixed using the random_state function. A grid search with 5-fold cross-validation was employed to tune the parameters of the machine learning methods, yielding the optimal parameter values for each model, as shown in
Table 5.
2.6. Feature Selection
2.6.1. Boruta
In SOC mapping, not all environmental covariates serve as critical variables in modeling. An excess of variables leads to information redundancy and model complexity [
38]. Boruta, a feature selection technique grounded in the RF algorithm, efficiently assesses the significance of each feature relative to the target variable within the dataset. Its basic principle involves replicating the original feature set and constructing a set of randomized shadow features by randomly shuffling the values of each feature [
20]. In the Boruta algorithm, the importance score is determined using the out-of-bag (OOB) error from the RF model, and it is calculated according to the following formula:
where
represents the OOB error of the Random Forest,
represents the sample value, and
represents the predicted value of the OOB sample for sample
.
where
is the Z-score,
is the mean of the OOB errors, and
is the standard deviation of the OOB errors.
The final selection criterion relies on the maximum Z-score of the shadow feature’s significance (ShadowMax). If a feature’s Z-score surpasses ShadowMax, it is deemed crucial; conversely, it is considered insignificant and excluded from the modeling process.
2.6.2. SHAP
SHAP (SHapley Additive exPlanations) is a method for explaining the output of machine learning models and is categorized as an additive feature attribution method that ensures consistency with the expected representation and local accuracy [
39]. SHAP assigns a specific prediction importance value, known as the SHAP value, to each feature variable, serving as a unified measure of feature importance. By connecting optimal resource allocation with local explanations, SHAP enables the analysis of each feature’s contribution to the prediction outcomes in environments where multiple feature variables interact. The calculation of SHAP values can be expressed by the following formula [
23]:
where
represents the explanation model,
indicates the presence or absence of the corresponding feature variable,
is the total number of input feature variables,
represents the marginal contribution of the
i-th variable, and
is the mean prediction over all training instances, serving as the constant of the explanation model. The specific formula for calculating SHAP values is as follows:
where
represents the SHAP value to be calculated;
is the mean value of the feature variable across all samples; and
is the contribution value of the first feature variable in the
n-th sample. This pattern continues accordingly;
is the number of feature variables.
In this study, the SHAP mechanism was introduced to interpret the model, ranking the feature variables input into the model by their importance to identify the primary feature variables influencing SOC. The mean SHAP values were used to highlight important features, and the SHAP summary plot was also utilized to evaluate the interactions between SOC and environmental variables.
2.7. Evaluation Criteria
The entire dataset (142 samples) was randomly split into a training set (114 samples) and a validation set (28 samples) at a ratio of 80% and 20%, respectively, with this process repeated 50 times. To validate the model’s performance,
R2,
RMSE, and
RPD were used to assess the model accuracy on the validation set. A higher
R2 value, closer to 1, indicates greater model accuracy; a lower
RMSE and a higher
RPD suggest superior model performance [
40].
where
n is the number of sampling points;
is the actual value of soil sample
i;
is the estimated value of soil sample
i;
is the mean of the actual values of the soil samples; and K is the number of feature variables.
5. Conclusions
This study achieved fine resolution predictions of the regional SOC content by constructing a streamlined and information-rich optimal database of environmental covariates and combining it with advanced machine learning models. This contributes a significant reference case for soil mapping in other regions.
Among the six machine learning models compared, CatBoost achieved the highest prediction accuracy for SOC (R2 = 0.70, RMSE = 10.86 g/kg, RPD = 1.84). Both feature selection methods, Boruta and SHAP, increased the accuracy of SOC predictions, but the model obtained after Boruta feature selection showed the highest predictive accuracy. Compared to the model without feature selection, the introduction of Boruta increased the R2 by 8.57% (0.76 vs. 0.70) and reduced the RMSE by 9.02% (9.88 vs. 10.86 g/kg). The SOC prediction map generated using the optimal machine learning model (CatBoost) and feature selection method (Boruta) revealed that SOC levels gradually decrease from the northern and southern parts of the study area toward the central region, indicating strong spatial heterogeneity. In exploring the factors explaining the SOC content, we found that pH, nitrogen content, sand, DEM, and the B3 band play crucial roles in predicting the forest SOC content.