Fine Resolution Mapping of Forest Soil Organic Carbon Based on Feature Selection and Machine Learning Algorithm

Li, Yanan; Li, Jing; Tan, Jun; Ma, Tianyue; Yan, Xingguang; Chen, Zongyang; Li, Kunheng

doi:10.3390/rs17122000

Open AccessArticle

Fine Resolution Mapping of Forest Soil Organic Carbon Based on Feature Selection and Machine Learning Algorithm

by

Yanan Li

¹,

Jing Li

^1,*

,

Jun Tan

¹,

Tianyue Ma

^1,2,

Xingguang Yan

¹

,

Zongyang Chen

¹ and

Kunheng Li

¹

College of Geoscience and Surveying Engineering, China University of Mining and Technology (Beijing), Beijing 100083, China

²

School of Natural Sciences, Bangor University, Bangor LL57 2DG, UK

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(12), 2000; https://doi.org/10.3390/rs17122000

Submission received: 7 April 2025 / Revised: 4 June 2025 / Accepted: 4 June 2025 / Published: 10 June 2025

Download

Browse Figures

Versions Notes

Abstract

An accurate forest soil organic carbon (SOC) assessment aids in the ecological restoration of forest mining areas, enabling dynamic monitoring of carbon sink accounting and informed land reclamation decisions. Digital soil mapping (DSM) has enhanced soil monitoring, with machine learning and environmental covariates becoming the keys to improving accuracy. This study utilized 32 environmental variables from multispectral, topographic, and soil data, along with 142 soil samples and six machine learning methods to construct a forest SOC model for the Huodong mining district. The performance of Boruta and SHAP (SHapley Additive exPlanations) in optimizing feature selection was evaluated. Ultimately, the optimal machine learning model and feature selection method were applied to map the SOC distribution, with variable contributions quantified using SHAP. The results showed that CatBoost performed best among the six algorithms in predicting the SOC content (R² = 0.70). Both Boruta and SHAP improved the prediction accuracy, with Boruta achieving the highest precision. Introducing the Boruta model increased R² by 8.57% (from 0.70 to 0.76) compared to models without feature selection. The spatial distribution mapping revealed higher SOC concentrations in the southern and northern regions and lower levels in the central area, indicating strong spatial heterogeneity. Key factors influencing the SOC distribution included pH, the nitrogen content, sand content, DEM, and B3 band.

Keywords:

digital soil mapping; soil organic carbon; machine learning algorithms; feature selection; model comparison

1. Introduction

Global warming has become a major challenge threatening human survival and sustainable development. Increasing carbon sinks in terrestrial ecosystems are crucial measures to mitigate climate warming. In response to the challenges of global climate change, the Chinese government announced at the 75th United Nations General Assembly in September 2020 its goal to peak carbon emissions by 2030 and achieve carbon neutrality by 2060 [1]. This means that over the next few decades, China will significantly strengthen its efforts in monitoring and managing carbon emissions [2]. Soil, as the main component of terrestrial carbon reservoirs, surpasses the combined amounts found in vegetation and the atmosphere [3]. Variations in soil carbon can lead to significant shifts in CO₂ levels, thereby affecting the global climate [4]. Forests, as a major part of the terrestrial biosphere, store approximately 73% of the world’s soil carbon, playing a critical role in maintaining the global terrestrial ecosystem carbon balance [5,6]. Thus, precisely quantifying the carbon storage of forest soils and providing high-quality spatial distribution data for the SOC content is crucial for improving forest management and promoting ecological restoration.

The traditional SOC mapping process, which includes data collection, field surveys, indoor analysis, field verification, and boundary delineation, is relatively inefficient [7] Digital soil mapping (DSM), on the other hand, primarily relies on soil–landscape models as its theoretical foundation, employing Geographic Information System (GIS) and mathematical models to infer the spatial distribution of soil properties. In contrast to traditional soil mapping techniques, DSM offers significant advantages [8,9,10], including (1) modeling continuous soil variability through fuzzy classification and geostatistical methods; (2) integrating multi-source data (e.g., remote sensing and topographic variables); (3) hybrid spatial prediction techniques (e.g., regression kriging) to address deterministic and stochastic components of soil variation; (4) adaptability to various spatial resolutions; (5) handling nonlinear relationships via machine learning algorithms; (6) reducing data acquisition costs through ancillary data utilization; (7) quantifying prediction uncertainty via geostatistical models; and (8) efficient GIS-based data management. The commonly employed methods in DSM include linear regression [11], geostatistics [12] and Geographically Weighted Regression (GWR) [13]. However, in natural settings, the relationship between soil properties and environmental variables is nearly always nonlinear [9]. Traditional statistical approaches for DSM struggle to adequately address these nonlinear relationships. Recently, with the rise of artificial intelligence, the application of machine learning algorithms in DSM has emerged as an effective solution to this challenge. Machine learning techniques do not rely on specific assumptions about the data distribution, allowing them to effectively capture the correlations between soil properties and large-scale environmental covariates, and predict each pixel at a broad scale [14,15]. Several popular algorithms used for this purpose include RF, ERT and Category Boosting (CatBoost) [16,17]. For instance, Dharumarajan et al. utilized an RF algorithm to predict the spatial distribution of SOC by integrating environmental variables and soil type data, elucidating the primary environmental elements influencing soil property variations in the semi-arid tropical regions of South India [18]. Mahmoudzadeh et al. employed 865 soil samples, 101 auxiliary variables, and five machine learning algorithms to generate the spatial distribution of surface SOC in Kurdistan Province at a 90 m resolution [19]. Zhou and Li systematically evaluated the potential of the RF and CatBoost models for soil salinity mapping in the Yellow River Delta [10]. However, despite significant advancements in DSM through machine learning, critical research gaps remain. First, the interpretability of machine learning models in SOC mapping remains underexplored. While algorithms like CatBoost demonstrate high predictive performance, quantifying the contributions of individual environmental variables and their interactions through explainable AI methods (e.g., SHAP) is rarely integrated into DSM workflows [20]. Second, the application of high-resolution SOC mapping in ecologically fragile mining areas—where soil recovery dynamics and anthropogenic disturbances create unique spatial heterogeneity—has not been sufficiently addressed. These gaps hinder the development of tailored strategies for carbon sink management in such regions.

Despite the numerous environmental factors influencing soil properties, which bring considerable convenience to DSM, the more feature variables available for modeling, the greater the computational load on predictive models becomes. Moreover, redundant variables can interfere with the model’s proper learning process, thereby reducing the prediction performance, a phenomenon commonly referred to as the “dimensional disaster” [21]. Therefore, conducting feature selection to optimize the feature variables is crucial for shortening model runtime and enhancing model accuracy. Depending on the timing of feature selection, it can be categorized into two types. One is the traditional pre-model feature selection method, which screens feature variables before the model completes its actual task, evaluating the effectiveness of the selected features through the final outcomes of the model. Currently, the Boruta algorithm is widely favored for this approach [20]. The other type is post-model feature selection based on SHAP, an interpretability method that operates by back-calculating the contribution of each distinct feature variable to the model output from the results generated by the model [22,23]. At present, there is limited research comparing the differences between post-model feature selection based on SHAP and traditional pre-model feature selection methods, leaving the comparative performance of these two feature selection methods in DSM unclear. Therefore, it is essential to explore the relative strengths and applicable scenarios of both types of feature selection methods to improve the performance of SOC mapping in the aspect of feature selection.

Mining areas, as regions where resource and environmental conflicts are concentrated, have been a focal point for scholars concerning their sustainable development. According to the National Mineral Resources Plan (2016–2020) [24], Shanxi Province hosts 18 state-planned mining areas, among which the Huodong mining district has the most extensive forest land distribution and is home to the largest state-owned forest farm in Shanxi Province (Taiyue Mountain National Forest Farm). This endows the Huodong mining district with significant ecological protection responsibilities alongside resource exploitation. In recent years, coal mining activities in the region have led to severe vegetation degradation, a reduced soil nutrient supply, and delayed natural soil recovery processes, posing long-term threats to the regional ecosystem and forest soil carbon reservoirs. Consequently, there is an immediate need for high-precision data on the SOC distribution in the area to improve forest management and inform soil carbon sequestration policies. However, despite advancements in SOC mapping, critical gaps remain in high-resolution assessments of forested mining areas. Existing studies predominantly focus on agricultural, grassland, and wetland ecosystems, often relying on outdated sampling data from the 1980s to 2000s. Furthermore, the comparative efficacy of feature selection methods such as Boruta (pre-model) and SHAP (post-model) in optimizing the SOC prediction accuracy remains underexplored. This study addresses these gaps by (1) developing a fine-resolution SOC mapping framework tailored to the Huodong mining district, a region with significant ecological and resource conflicts; (2) evaluating the performance of six machine learning algorithms combined with Boruta and SHAP for feature selection; and (3) determining the spatial distribution of SOC and assessing the contributions and impacts of its primary controlling factors.

2. Materials and Methods

To visually illustrate the processes of data preparation, model execution, and spatial prediction, we have constructed a workflow (Figure 1).

2.1. Study Area

The Huodong mining district (35°58′~36°54′N, 111°47′~112°24′E) is located in the eastern part of the Jinzhong Coal Base in Shanxi Province, China, and is one of the 18 state-planned mining areas in Shanxi Province. It is situated in the southwestern part of the Qishui Coalfield, with the widest east–west span of approximately 51 km and the longest north–south extent of about 102 km, covering a total area of approximately 2802.81 km² (Figure 2). The climate of the Huodong mining district is characterized by a temperate continental monsoon climate, with distinct seasons and significant diurnal temperature differences. Located on the eastern slope of Mount Huo, the predominant terrain of the study area consists of mountains and hills, with a general trend of higher elevations in the north and lower in the south, and a typical relative elevation difference ranging from 200 to 500 m. The Huodong mining district is rich in coal resources, with reserves amounting to 30 billion tons, characterized by high-quality coal and a long history of coal mining. Additionally, it boasts the most extensive forest coverage among the 18 state-planned mining areas in Shanxi Province and supports rich biodiversity. Based on relevant data and field survey results, Chinese pine (Pinus tabulaeformis) is the dominant tree species within the mining area, followed by other prominent species such as Mongolian oak, locust, mountain poplar, and northeast China larch.

The soil types in the Huodong mining area are primarily divided into two categories: native soils and mining-disturbed soils. The native soils are predominantly cinnamon soils, featuring a moderate surface organic matter content (1–2%), neutral to slightly alkaline pH (7.0–8.5), and textures mainly composed of silt loam or clay loam, with the parent material derived from loess deposits. Surface soils in the mining area have been severely degraded due to open-pit mining and coal gangue accumulation, where mining-disturbed soils dominated by anthropogenically disturbed soils and coal gangue mixtures prevail. These disturbed soils exhibit a loose structure, organic matter deficiency (<1%), poor water retention capacity, and potential contamination from sulfides and heavy metals released by coal gangue.

2.2. Soil Sampling and Laboratory Analysis

Sampling points were selected using a stratified random sampling approach to ensure coverage across key environmental gradients. The study area was stratified based on elevation segments. Within each stratum, sampling locations were randomly allocated to capture spatial variability. A total of 142 samples were collected, with the sample size determined by balancing spatial representativeness and logistical constraints. The formula for estimating minimum sample size in spatial studies was referenced as follows:

n = \frac{{(Z_{α / 2} \cdot σ)}^{2}}{E^{2}}

(1)

where

Z_{α / 2}

= 1.96 (95% confidence level),

σ

= 19.96 g/kg (standard deviation from preliminary data), and E = 3.3 g/kg (acceptable margin of error). This yielded n ≈ 136, aligning with the 142 samples collected. Field accessibility and budget limitations were also considered.

Each sampling point was centered on a Global Positioning System (GPS) location, and a sample plot with a radius of 5 m was selected for sampling. Five to six surface soil samples were randomly collected at a depth of 0–20 cm and mixed, resulting in a total of 142 soil samples. The soil samples were then placed in a well-ventilated area of the laboratory to air-dry naturally, and impurities such as stones and plant roots were cleared. The dried soil samples were sieved through a 2 mm mesh, and the SOC content was determined using the potassium dichromate heating method [12]. The coordinate projection of the sampling points was converted using ArcGIS 10.7 to unify the coordinate systems of the study area and environmental variables. The distribution of the sampling points is shown in Figure 2.

2.3. Environmental Covariates

The SCORPAN model integrates seven key factors influencing soil formation: Soil, Climate, Organisms, Relief (topography), Parent material, Age, and space (N). This framework systematically identifies environmental covariates that drive soil property variations, including SOC. In this study, we focused on factors most relevant to SOC dynamics in the Huodong mining district. For instance, soil properties such as pH, the nitrogen content, and texture (sand, silt, and clay) directly affect SOC stabilization. Topographic variables like DEM, slope, and aspect influence microclimate and erosion processes, which in turn impact the SOC distribution. Vegetation indices such as NDVI and EVI reflect biomass and litter inputs, both of which are critical for SOC accumulation. Although climate was not explicitly included due to limited local resolution, proxies like net primary productivity (NPP) and vegetation indices indirectly captured climatic effects. Due to limitations in data availability, parent material, age, and spatial variables were not directly included, but soil texture (clay/sand particles) partially reflected parent material characteristics. Our final selection of 32 covariates was also guided by data availability and regional characteristics, ensuring alignment with the SCORPAN framework while addressing practical constraints.

2.3.1. Multispectral Remote Sensing Variables

The multispectral remote sensing variables consisted of net primary production (NPP) products, original spectral information, and vegetation indices. NPP is a significant factor for assessing plant productivity in ecosystems. The NPP data used in this study were obtained from the MOD17A3HGF dataset provided by NASA (https://lpdaac.usgs.gov). Original spectral information refers to the variables directly obtained through a remote sensing image analysis, with these bands designed for different functions during the satellite design phase. For this study, we comprehensively considered the number of bands and the sensitivity of each band to monitoring vegetation growth conditions from various satellite sensors, selecting six bands from Landsat 8 imagery, as shown in Table 1. Vegetation indices are derived from combinations of image bands and are primarily used to describe the characteristics of surface vegetation distribution, vegetation change processes, and quantitative estimation of vegetation parameters. Through the calculation of these indices, a preliminary characterization of the vegetation growth status in the study area can be achieved. Referring to the vegetation indices used in similar studies and considering the specifics of our study area, ten vegetation indices were selected for initial model inputs (Table 2).

2.3.2. Topographic Variables

In this study, a total of eight topographic variables were adopted. DEM was downloaded from the Computer Geographic Spatial Data Cloud site of the Chinese Academy of Sciences (https://www.gscloud.cn/). Other topographic attributes in Table 3 were calculated using SAGAGIS software (version 8.4.0) [25].

2.3.3. Soil Variables

The soil-related data were downloaded from Soilgrids250m2.0 (https://soilgrids.org/), including sand, clay, silt, coarse fraction volume occupancy (CFVO), cation exchange capacity (CEC), pH, and nitrogen content (Table 4).

2.3.4. Covariate Harmonization

The environmental covariates in this study were obtained from various data sources. To ensure consistency in spatial resolution, all environmental covariates were processed to a uniform resolution of 30 m using the bilinear interpolation method. Furthermore, all environmental covariates were standardized to the GCS_WGS_1984 geographic coordinate system and projected into the WGS_1984_UTM_Zone_49N projection coordinate system.

2.4. Machine Learning Algorithms

The selection of machine learning algorithms in this study was based on their proven effectiveness in handling nonlinear relationships, robustness to high-dimensional data, and successful applications in digital soil mapping (DSM) and environmental variable prediction. Six algorithms—Random Forest (RF), Extremely Randomized Trees (ERT), Gradient Boosting Decision Tree (GBDT), XGBoost, LightGBM, and CatBoost—were chosen for their complementary strengths. RF and ERT, as ensemble methods, reduce overfitting by aggregating multiple decision trees, making them suitable for capturing complex interactions between environmental covariates and SOC [26,27]. GBDT and its derivatives (XGBoost, LightGBM, and CatBoost) employ gradient boosting to sequentially optimize residuals, excelling in scenarios with noisy data and heterogeneous feature spaces [28,29,30]. Specifically, LightGBM and CatBoost enhance computational efficiency through gradient-based sampling and automated handling of categorical features, respectively, which are critical for large-scale spatial datasets [31]. These algorithms have been widely validated in SOC mapping studies, demonstrating superior performance in accuracy and interpretability compared to traditional statistical models [10,18,32].

2.4.1. RF

RF is a classification and regression model that comprises multiple decision trees, where the output class is determined by the mode of the individual tree outputs [26]. It includes a large number of decision trees with high predictive accuracy that are weakly correlated or even uncorrelated, forming an ensemble prediction model [33]. The combined ensemble of prediction models collaboratively participates in predicting the values of output variables for new observations, thereby achieving higher accuracy [34].

2.4.2. ERT

ERT was proposed in 2006 by Pierre Geurts and colleagues after extensive experimental research [27]. Its principle is largely similar to that of RF, involving the aggregation of multiple decision trees to make predictions based on the average of the individual tree predictions. For ERT, every decision tree is instructed using the complete dataset, guaranteeing the exploitation of training examples and partially aiding in diminishing the ultimate prediction bias [35].

2.4.3. GBDT

GBDT, proposed by Friedman in 2001, is a Boosting algorithm in the realm of ensemble learning [28]. Its training process is sequential, meaning that weak learners are trained in a specific order, with each subsequent weak learner building upon the previous one. GBDT commonly employs decision trees as its fundamental weak learners, where the key concept is that each tree is built to minimize the residuals in the direction of the gradient of the loss function.

2.4.4. XGBoost

XGBoost, based on gradient-boosted decision trees, is capable of handling classification and regression tasks in machine learning with weak supervision through its training process. It was introduced by Chen and Guestrin in 2016 [36]. The XGBoost model is crafted to avoid overfitting and maintain optimal computational efficiency by employing simplification and regularization techniques, thereby also lowering computational costs. By doing so, the model can predict sample scores based on the features of the samples [37].

2.4.5. LightGBM

To address the issue of time-consuming training with large datasets, LightGBM employs two key methods: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) [29]. GOSS retains samples with larger gradients while discarding those with smaller gradients based on the information gain of the samples, yielding more accurate results compared to random sampling. The EFB method achieves dimensionality reduction by bundling features. Additionally, LightGBM optimizes the histogram difference for the histogram method, which reduces memory consumption and enhances computational efficiency [31].

2.4.6. CatBoost

CatBoost is a novel GBDT algorithm that utilizes symmetric decision trees as base learners. It improves upon the gradient estimation in traditional GBDT algorithms by adopting a permutation-based boosting approach, which reduces the impact of gradient estimation bias and enhances the model’s generalization capability. Additionally, CatBoost can automatically handle discrete feature data. Compared to other Boosting algorithms, it exhibits superior robustness and scalability when addressing regression problems that involve multiple input features and noisy sample data [30].

2.5. Parameter Setting

The implementation of six nonlinear algorithms, including RF, ERT, and GBDT, was completed on the Python 3.9.12 platform. To test the performance of different machine learning models, the train_test_split function from the Sklearn module in the Python 3.9.12 programming language was used to randomly split the data into 80% for the training set and 20% for the validation set, with the selected datasets fixed using the random_state function. A grid search with 5-fold cross-validation was employed to tune the parameters of the machine learning methods, yielding the optimal parameter values for each model, as shown in Table 5.

2.6. Feature Selection

2.6.1. Boruta

In SOC mapping, not all environmental covariates serve as critical variables in modeling. An excess of variables leads to information redundancy and model complexity [38]. Boruta, a feature selection technique grounded in the RF algorithm, efficiently assesses the significance of each feature relative to the target variable within the dataset. Its basic principle involves replicating the original feature set and constructing a set of randomized shadow features by randomly shuffling the values of each feature [20]. In the Boruta algorithm, the importance score is determined using the out-of-bag (OOB) error from the RF model, and it is calculated according to the following formula:

M S E_{O O B} = \frac{1}{N} {(y_{i} - {\hat{y}}_{i O O B})}^{2} / N

(2)

where

M S E_{O O B}

represents the OOB error of the Random Forest,

y_{i}

represents the sample value, and

{\hat{y}}_{i O O B}

represents the predicted value of the OOB sample for sample

y_{i}

.

Z_{s c o r e} = \bar{M S E_{O O B}} / S D M S E_{O O B}

(3)

where

Z_{s c o r e}

is the Z-score,

\bar{M S E_{O O B}}

is the mean of the OOB errors, and

S D M S E_{O O B}

is the standard deviation of the OOB errors.

The final selection criterion relies on the maximum Z-score of the shadow feature’s significance (ShadowMax). If a feature’s Z-score surpasses ShadowMax, it is deemed crucial; conversely, it is considered insignificant and excluded from the modeling process.

2.6.2. SHAP

SHAP (SHapley Additive exPlanations) is a method for explaining the output of machine learning models and is categorized as an additive feature attribution method that ensures consistency with the expected representation and local accuracy [39]. SHAP assigns a specific prediction importance value, known as the SHAP value, to each feature variable, serving as a unified measure of feature importance. By connecting optimal resource allocation with local explanations, SHAP enables the analysis of each feature’s contribution to the prediction outcomes in environments where multiple feature variables interact. The calculation of SHAP values can be expressed by the following formula [23]:

g (z^{'}) = ϕ_{0} + \sum_{i = 1}^{M} ϕ_{i} {z^{'}}_{i}

(4)

where

g

represents the explanation model,

z^{'} \in {0, 1}^{M}

indicates the presence or absence of the corresponding feature variable,

M

is the total number of input feature variables,

ϕ_{i} \in R

represents the marginal contribution of the i-th variable, and

ϕ_{0}

is the mean prediction over all training instances, serving as the constant of the explanation model. The specific formula for calculating SHAP values is as follows:

Y_{n} = y_{b} + f (x_{n}, 1) + f (x_{n}, 2) + \dots + f (x_{n}, P)

(5)

where

Y_{n}

represents the SHAP value to be calculated;

y_{b}

is the mean value of the feature variable across all samples; and

f (x_{n}, 1)

is the contribution value of the first feature variable in the n-th sample. This pattern continues accordingly;

P

is the number of feature variables.

In this study, the SHAP mechanism was introduced to interpret the model, ranking the feature variables input into the model by their importance to identify the primary feature variables influencing SOC. The mean SHAP values were used to highlight important features, and the SHAP summary plot was also utilized to evaluate the interactions between SOC and environmental variables.

2.7. Evaluation Criteria

The entire dataset (142 samples) was randomly split into a training set (114 samples) and a validation set (28 samples) at a ratio of 80% and 20%, respectively, with this process repeated 50 times. To validate the model’s performance, R², RMSE, and RPD were used to assess the model accuracy on the validation set. A higher R² value, closer to 1, indicates greater model accuracy; a lower RMSE and a higher RPD suggest superior model performance [40].

R^{2} = 1 - \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2} / \sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}

(6)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}

(7)

R P D = \frac{\sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}}}{R M S E}

(8)

\begin{matrix} A d j u s t e d & R^{2} \end{matrix} = 1 - (1 - R^{2}) \frac{n - 1}{n - k - 1}

(9)

where n is the number of sampling points;

y_{i}

is the actual value of soil sample i;

{\hat{y}}_{i}

is the estimated value of soil sample i;

\bar{y}

is the mean of the actual values of the soil samples; and K is the number of feature variables.

3. Results

3.1. Descriptive Statistics for SOC

Table 6 shows the statistical summary of the SOC content in the total sample, training set and test set. For the total sample, the SOC content varied from 2.09 to 114.27 g kg⁻¹, with a wide range. The median and mean of the SOC content were 21.15 and 26.52 g kg⁻¹. The coefficient of variation was 75.28 g kg⁻¹, indicating that the SOC content exhibited high heterogeneity. The skewness of 1.83 g kg⁻¹ and the kurtosis of 4.35 g kg⁻¹ indicated that the data distribution had an obvious right-skewed and leptokurtic distribution. The training set and test set had similar SOC distributions to total sample, with the means and coefficients of variation of the total samples between the training set and test set, showing that the sample partitioning was reasonable.

3.2. Evaluation of Modeling

Using only non-remote sensing variables, only remote sensing variables, remote sensing + topography variables, remote sensing + soil variables, and remote sensing + topography + soil variables as input variables, with SOC as the target variable, six models—RF, ERT, GBDT, XGBoost, LightGBM, and CatBoost—were established for a comparative analysis to verify the predictive capabilities of the models.

As shown in Figure 3, when using only non-remote sensing data as the input, the six machine learning models demonstrated moderate performance (R² ranges from 0.48 to 0.62, adjusted R² from 0.46 to 0.61, RMSE from 12.36 to 14.45 g/kg, and RPD from 1.38 to 1.62). Among these, the CatBoost model (R² = 0.62, adjusted R² = 0.61, RMSE = 12.36 g/kg, rpd = 1.62) achieved the best evaluation results. When using only remote sensing data as the input, the overall performance was comparable but poorer (R² ranges from 0.28 to 0.45, adjusted R² ranges from 0.26 to 0.43, RMSE ranges from 14.86 to 16.90 g/kg, and RPD ranges from 1.18 to 1.34). The random forest model (R² = 0.45, adjusted R² = 0.43, RMSE = 14.86 g/kg, RPD =1.34) yielded slightly better evaluation metrics than other models in this scenario. As the number of features in the data combinations increased, the prediction accuracy of different models improved to varying degrees, indicating that increasing the number of features can effectively enhance the fusion of multi-data input features for various regression models. When the data combinations included remote sensing + topography and remote sensing + soil, the evaluation results of several regression methods improved, but the extent of improvement in evaluation metrics varied. Notably, the precision of all regression methods was higher when remote sensing + soil was used as the data source compared to remote sensing + topography. When the data combination included remote sensing + topography + soil, most models exhibited their best performance (R² ranges from 0.63 to 0.70, adjusted R² from 1.64 to 1.84, RMSE from 10.86 to 12.18 g/kg, RPD from 1.64 to 1.84), indicating that models built using the full feature combination had superior predictive effects compared to other single or dual-feature combinations. Due to its notably superior performance across all evaluation criteria (R² = 0.70, adjusted R² = 0.69, RMSE = 10.86 g/kg, RPD = 1.84), CatBoost was selected for subsequent research and analysis.

3.3. Analysis of Feature Selection Results

3.3.1. Feature Selection Based on Boruta

In the Boruta algorithm, during the feature selection process, variables retained were those with a Z-score greater than ShadowMax (the maximum Z-score). Figure 4 illustrates the feature selection results using boxplots to represent the minimum Z-score (ShadowMin), average Z-score (ShadowMean), and maximum Z-score (ShadowMax) of the feature variables. Among these, the Z-scores of sand, DEM, clay, CFVO, nitrogen content, pH, silt, B3, B4, and B2 were significantly higher than ShadowMax; thus, they were identified as important feature variables. Features such as LAI and NDVI had Z-scores notably lower than ShadowMax and were therefore considered unimportant feature variables. The Z-scores of B5, aspect, RVI, EVI, and GVI were close to ShadowMax, classifying them as candidate features in the Boruta algorithm, especially B5 and aspect, which were particularly close to ShadowMax.

To verify whether the modeling accuracy improved after feature selection using the Boruta algorithm and to further screen the candidate features B5 and aspect, this study used five-fold cross-validation to input the important feature variables, important feature variables plus B5, important feature variables plus aspect, and important feature variables plus both B5 and aspect into six machine learning models for four separate experiments. The feature variables that achieved the highest accuracy when input into the CatBoost model were selected as the final feature variables for modeling. As shown in Table 7, the R², adjusted R², and RPD values of the model using important feature variables were 0.64, 0.62, and 1.66, respectively. These metrics were all lower than those of the other three models, and its RMSE was also higher than the results of the other feature-based models. The modeling accuracy was highest when the candidate features B5 and aspect were added (R² = 0.76, adjusted R² = 0.75, RMSE = 9.88 g/kg, RPD = 2.02). The results indicate that using the Boruta algorithm for feature selection and applying it to the construction of the CatBoost model is effective.

3.3.2. Feature Selection Based on SHAP

SHAP was employed to conduct an interpretability analysis of the feature variables input into the CatBoost model. Bar charts ranked the importance of environmental covariates based on their average SHAP values, which quantify their contributions to the SOC content prediction. As shown in Figure 5, the SOC content was primarily influenced by pH and the nitrogen content, while features such as B3, DEM, and sand had weaker effects. The SHAP summary plot in Figure 6 displays the SHAP values of environmental variables ranked by their importance. The positive or negative signs of SHAP values indicate whether the covariates have a positive or negative influence on the predicted target. Each point represents a sample, with the point’s color reflecting the value of a covariate: red signifies higher values, while blue indicates lower values. Consequently, pH, B3, and sand exerted a certain degree of negative effect on the SOC content; the greater the value of these features, the smaller the SHAP value, leading to a stronger inhibition of SOC accumulation in the corresponding areas and resulting in lower SOC content. In contrast, the nitrogen content and DEM positively affected the SOC content; the higher the values of these features, the larger the SHAP values, leading to higher SOC contents in the corresponding areas, as abundant total nitrogen and higher elevations promote SOC accumulation.

Based on the interpretation results from SHAP, the importance of feature variables was ranked (Table 8). Among them, the cumulative contributions of the top four feature variables (X₁~X₄) exceeded 50%, the top six (X₁~X₆) exceeded 60%, the top eight (X₁~X₈) exceeded 70%, the top twelve (X₁~X₁₂) exceeded 80%, and the top seventeen (X₁~X₁₇) exceeded 90%. The cumulative contributions of the top 14 feature variables was 86.25%, approaching 90%, indicating that X₁~X₁₄ play predominant roles in the model.

To further refine the feature selection process, the top 14 feature variables, based on their importance rankings, were used as inputs for the model, and variables with lower importance were sequentially eliminated from the combination, starting with X14. The prediction performance of the CatBoost model under different variable combinations (Table 9) shows that as the feature variables were progressively removed, R², adjusted R² and RPD generally exhibited a fluctuating downward trend, while RMSE showed a fluctuating upward trend. Additionally, when the feature combination was reduced to “X₁ + X₂” and “X₁”, the model lacked sufficient explanatory power, making the predictions meaningless. Therefore, when the variable combination consisted of X₁ to X₁₄, R², adjusted R² and RPD reached their peak values (0.73, 0.72 and 1.93, respectively), and RMSE achieved its lowest value (10.35 g/kg), indicating that the model prediction performance was optimal.

In summary, whether using the Boruta or SHAP algorithm, the predictive performance of the models after feature selection was superior to that without feature selection. A further comparison of the feature selection effectiveness between Boruta and SHAP algorithms revealed that the Boruta–CatBoost model slightly outperformed the SHAP–CatBoost model (0.76 vs. 0.73). Considering the overall model performance, the Boruta–CatBoost model was determined to be the optimal choice.

3.4. Sensitivity Analysis

Due to the differences in resolution among covariates (e.g., 250 m for SoilGrids vs. 30 m for Landsat), we harmonized all data to 30 m using bilinear interpolation. To assess the impact of resolution on model performance, a sensitivity analysis was conducted: SoilGrids data were resampled to 30 m, 60 m, 90 m, and 250 m (original), while other covariates remained unchanged. The CatBoost model was retrained and evaluated under each scenario.

As shown in Table 10, the sensitivity analysis revealed that reduced spatial resolution leads to progressively declining model performance (R² decreasing from 0.76 to 0.68), validating the importance of high-resolution data for improving the prediction accuracy. The low-resolution SoilGrids data (250 m) increased prediction errors. Although the bilinear interpolation method partially mitigates this issue, future research should explore higher-resolution soil property data sources (such as localized field measurements) or advanced spatial downscaling techniques to minimize uncertainties from resolution discrepancies.

3.5. Spatial Distribution Pattern of SOC

Using the optimal variable combination (sand, DEM, clay, CFVO, nitrogen content, pH, silt, B3, B4, B2, B5, and aspect) obtained through feature selection based on the Boruta algorithm and the best-performing machine learning model (CatBoost), a spatial distribution map of the forest SOC content in the Huodong mining area was generated (Figure 7). The SOC content within the study area varied from 1.03 to 94.53 g/kg. The map reveals that the SOC distribution is higher in the southern and northern parts of the study area, with relatively lower values in the central region, indicating strong spatial heterogeneity. The areas with a high organic carbon content are predominantly located in the northwestern and southwestern parts of the study area, which are rich in forest resources. The northwestern part of the study area is part of the Taiyue Mountains and is also a local Chinese pine (Pinus tabulaeformis) cultivation base. The southwestern part belongs to the Fushan Mountains. These regions have higher elevations and more rugged terrain, with less human activity, allowing for better preservation and restoration of natural vegetation. The high vegetation cover, including forests, forms a relatively closed and stable ecosystem, contributing to the significantly higher SOC content in these areas. In contrast, regions with a lower SOC content are predominantly found in the central part of the study area, where the terrain is flatter compared to the northern and southern regions. With towns and mining enterprises concentrated here, the soil in this region is subject to human activity disturbances, resulting in reduced surface SOC storage.

4. Discussion

4.1. Effects of Different Covariates on the SOC Content

Both the feature sets selected by the Boruta and SHAP algorithms included environmental covariates from remote sensing, topography, and soil, indicating that the integration of multi-source data has a positive impact on the SOC prediction, consistent with the findings of Zhou et al. [41]. The results from both feature selection methods revealed that pH, the nitrogen content, sand, DEM and B3 band are common key factors influencing the distribution of SOC.

After feature selection using two different algorithms, soil property-related variables were identified as the most significant factors (Figure 4 and Figure 5). This result aligns with earlier studies. Schillaci argued that soil texture has a more substantial impact on SOC variations than other factors, including land use, precipitation, and topography [42]. Stevens also noted that soil texture can explain more than half of the total variation in the SOC content [43]. The SHAP summary plot in Figure 6 indicates that the sand content inhibits SOC accumulation, which aligns with the conclusions of Adhikari [44]. Fine-textured soils, with their dense mineral particles, provide a natural barrier that effectively protects SOC from microbial degradation. Sandy soils exhibit a lower SOC content not only due to sparse vegetation but also because their coarse particles fail to stabilize organic matter. The lack of protective mineral surfaces and reduced water retention accelerate SOC mineralization, as evidenced by the negative SHAP values for the sand content. Additionally, sandy soils are more susceptible to wind and water erosion, further contributing to the loss of organic carbon. pH is another crucial factor influencing SOC [45]. Under high pH conditions, dissolved organic carbon in the soil is more readily released, and the turnover rate of SOC is faster compared to acidic conditions. This aligns with the negative correlation between pH and the SOC content observed in this study [46]. The total nitrogen content is closely linked to the SOC content; an increase in total nitrogen content enhances soil fertility, thereby promoting plant growth and development and enhancing the accumulation of SOC. Our study also confirmed a positive relationship between the total nitrogen and SOC contents (Figure 6).

As one of the five soil-forming factors [47], topographic variables are effective predictors in regions with a complex and variable terrain, and their applicability in DSM practices has been widely recognized [48]. Elevation was identified as the most significant topographic variable in this study, consistent with previous research [49]. As shown in Figure 6, elevation has a positive impact on the SOC content; higher elevations correspond to higher SOC levels. This is because in the Huodong mining district, high-elevation areas (e.g., Taiyue Mountain National Forest Farm) are strictly protected as nature reserves and Chinese pine (Pinus tabulaeformis) cultivation bases, with minimal human disturbance. The vegetation coverage is high, resulting in a significantly higher soil organic carbon content in the area compared to other regions. Wang also noted that elevation influences vegetation, local climate, and microbial activity, which have long-term effects on SOC dynamics [50]. Furthermore, due to the topographic covariates being derived from the DEM, high multicollinearity among these variables may lead to the reductions in other topographic variables during feature selection.

In forest ecosystems, vegetation-related variables are closely associated with the spatial distribution of SOC populations [51]. Spectral reflectance and vegetation indices derived from remote sensing data effectively represent vegetation coverage, providing better predictions of the surface SOC content compared to direct field measurements when combined with machine learning models. Among all remote sensing variables, the B3 band, which represents vegetation productivity, is the most critical environmental variable for predicting the SOC content [52]. Our results show that after feature selection using either the Boruta or SHAP algorithms, the B3 band consistently ranks as the most important remote sensing variable. This is attributed to its excellent ability to distinguish vegetation types and assess vegetation cover, which aligns with the findings of Yang [53]. The B3 band exhibits a negative correlation with the SOC content (Figure 6). This relationship is because pigments and other compounds in organic matter absorb specific wavelengths of light, particularly in the green portion of the visible spectrum. Therefore, soils with a higher organic carbon content have lower reflectance in the B3 band [6].

Although non-remote sensing variables (such as pH, the nitrogen content, and DEM) dominate the SOC prediction (Table 8), remote sensing variables (e.g., B3) compensate for the spatial limitations of traditional soil attribute data by capturing vegetation coverage and surface spectral characteristics. Particularly in forested mining areas, the strong correlation between B3 and vegetation productivity provides critical ecological process information for the model, thereby optimizing the expression of SOC spatial heterogeneity in complex terrains and enhancing the model’s spatial interpretability. As shown in Figure 3, taking the CatBoost model as an example, the model using only non-remote sensing variables achieved R² and adjusted R² values of 0.62 and 0.61, respectively. In contrast, the combined input of remote sensing + topographic + soil variables elevated these metrics to 0.70 and 0.69, demonstrating the significant auxiliary role of remote sensing variables in improving model accuracy. The integration of both datasets enables more precise SOC mapping.

4.2. The Application of Different Feature Selection Methods in the Prediction of the SOC Content

In practical applications, removing irrelevant or redundant feature variables can reduce noise that does not contribute to predictions, accelerate training and prediction speeds, and thereby enhance model performance. Boruta and SHAP represent pre-model feature selection and post-model feature selection methods, respectively, each with unique strengths. Boruta improves the robustness and reliability of feature selection by introducing shadow features [32]. Meanwhile, SHAP provides detailed model explanations, aiding in understanding the model’s working mechanism [23]. Although models such as RF and XGBoost can output feature importance, their evaluation criteria (e.g., Gini impurity) may exhibit biases due to differences in the model structure. For instance, continuous variables are prone to overestimation in tree-based models. Boruta and SHAP model performance from pre-modeling and post-interpretation perspectives, respectively, addressing potential biases in traditional feature importance metrics (e.g., those from RF and XGBoost). Additionally, SHAP’s local interpretability reveals nonlinear relationships between features and the SOC content, a capability absent in conventional importance metrics.

The model accuracy obtained after Boruta selection is higher than that after SHAP selection (0.76 vs. 0.73). However, SHAP offers rich visualization tools, such as the SHAP summary plot (Figure 6), which enable users to intuitively grasp the importance of features and their interactions. Various machine learning models developed in recent years exhibit strong nonlinear mining capabilities but are often deemed “black-box models” due to the difficulty in representing these nonlinear relationships with specific mathematical formulas [54]. With the emergence and development of interpretability methods such as SHAP and LIME (Local Interpretable Model-agnostic Explanation), the results of machine learning models have become clearer and more explicit.

4.3. Comparison to Other SOC Maps

The global soil database SoilGrids2.0 (250 m resolution) released by Poggio et al. holds irreplaceable value due to its unparalleled ability to fill information gaps in remote and inaccessible regions such as the Sahara Desert, Tibetan Plateau, and Siberia [55]. We conducted a comparative analysis between our SOC product and SoilGrids 2.0, visually illustrating the spatial discrepancies in the SOC content between the two datasets through discrepancy maps (Figure 8). As shown in Figure 8c, SoilGrids 2.0 predicted higher SOC values in the northern and central parts of the study area compared to our SOC product, while it predicted lower values in the southern part. In terms of the range of predicted values, SoilGrids 2.0 provided a broader range (0–108.40 g kg⁻¹) for the SOC content in the Huodong mining area’s forestland (Figure 8a), whereas our SOC product had a range of 1.03–94.53 g kg⁻¹ (Figure 8b). After assessing the prediction accuracy using the same validation dataset and performance metrics, we found that SoilGrids 2.0 had an R² of 0.289, RMSE of 16.846, and RPD of 1.158 in our study area. This indicates that our SOC predictions (R² = 0.76, RMSE = 9.88, RPD = 2.02) outperformed those of SoilGrids 2.0. This finding is consistent with previous local validation studies of SoilGrids [56,57,58], which have shown that the performance of SoilGrids at the local level is significantly lower than its global performance. Therefore, caution should be exercised when using global-scale SoilGrids 2.0 data for smaller, more localized areas.

Additionally, we have provided a comparison of the performance of our developed model with that of other similar studies in Table 11. The results indicate that our model generally achieves higher R² values and exhibits superior modeling performance compared to most existing studies. The higher accuracy of our SOC product can be attributed to several factors, including more intensive field soil sampling, the use of environmental covariates with finer spatial resolution, and effective feature selection methods that remove redundant information. Moreover, the spatial resolution of SOC maps in the majority of earlier research generally ranged from 90 m to 1000 m (Table 11), whereas the SOC map in our study possesses a higher spatial resolution of 30 m. This advantage enables a more detailed capture and representation of subtle changes and regional characteristics of the SOC content, providing stronger support for guiding forestry management and promoting ecological restoration.

4.4. Limitations and Perspectives

Although we have made some progress, this study still has some limitations. First, increasing the number of sampling points helps to fully capture and describe the spatial variability of the SOC content in the study area [68]. In the future, more detailed soil data can be obtained by increasing the sampling density, which is particularly beneficial for reducing sampling bias, especially in areas with significant environmental and soil property variations [61]. Second, to build more accurate models, there is a need to incorporate and apply environmental covariates with higher spatial resolution. Although climate is one of the factors influencing the SOC content at the regional scale, the current distribution density of meteorological stations is insufficient. Mainstream climate data products, such as WorldClim2 and CMIP6, have relatively low spatial resolutions. Given the relatively small size of our study area (2802.81 km²), a coarse spatial resolution can lead to the loss of detailed information, thereby hindering the model’s ability to accurately capture the relationships between the target variable and environmental covariates and potentially resulting in systematic prediction biases. Therefore, this study does not consider climate variables. Future research should employ spatial downscaling methods to develop new climate products, aiming to minimize prediction errors and uncertainties caused by covariates. Additionally, the formation of SOC is a highly complex and long-term process involving interactions between the parent material and environmental factors [69]. Future studies should further quantify the impacts of soil-forming factors and human activities to enhance the accuracy and stability of the predictions.

5. Conclusions

This study achieved fine resolution predictions of the regional SOC content by constructing a streamlined and information-rich optimal database of environmental covariates and combining it with advanced machine learning models. This contributes a significant reference case for soil mapping in other regions.

Among the six machine learning models compared, CatBoost achieved the highest prediction accuracy for SOC (R² = 0.70, RMSE = 10.86 g/kg, RPD = 1.84). Both feature selection methods, Boruta and SHAP, increased the accuracy of SOC predictions, but the model obtained after Boruta feature selection showed the highest predictive accuracy. Compared to the model without feature selection, the introduction of Boruta increased the R² by 8.57% (0.76 vs. 0.70) and reduced the RMSE by 9.02% (9.88 vs. 10.86 g/kg). The SOC prediction map generated using the optimal machine learning model (CatBoost) and feature selection method (Boruta) revealed that SOC levels gradually decrease from the northern and southern parts of the study area toward the central region, indicating strong spatial heterogeneity. In exploring the factors explaining the SOC content, we found that pH, nitrogen content, sand, DEM, and the B3 band play crucial roles in predicting the forest SOC content.

Author Contributions

Conceptualization, Y.L. and J.L.; methodology, Y.L.; software, Y.L.; validation, Y.L. and J.T.; data curation, Y.L. and T.M.; writing—original draft preparation, Y.L.; writing—review and editing, Z.C. and X.Y.; visualization, Y.L. and K.L.; supervision, J.L.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (Intergovernmental and international cooperation in science, technology and innovation), grant number 2022YFE0127700.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

The authors are grateful to the editor and reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SOC	Soil Organic Carbon
DSM	Digital Soil Mapping
SHAP	SHapley Additive exPlanations
GIS	Geographic Information System
GWR	Geographically Weighted Regression
CatBoost	Category Boosting
GPS	Global Positioning System
NPP	Net Primary Productivity
B2	Band 2
B3	Band 3
B4	Band 4
B5	Band 5
B6	Band 6
B7	Band 7
CFVO	Coarse Fraction Volume Occupancy
CEC	Cation Exchange Capacity
RF	Random Forest
ERT	Extremely Randomized Trees
GBDT	Gradient Boosting Decision Tree
GOSS	Gradient-Based One-Side Sampling
EFB	Exclusive Feature Bundling
OOB	Out-Of-Bag
DEM	Digital Elevation Model
NDVI	Normalized Difference Vegetation Index
DVI	Difference Vegetation Index
EVI	Enhanced Vegetation Index
ARVI	Atmospherically Resistant Vegetation Index
RVI	Relative Vegetation Index
LAI	Leaf Area Index
GVI	Greenness Vegetation Index
GRVI	Greenness Ratio Vegetation Index
GNDVI	Green Normalized Difference Vegetation Index
SAVI	Soil Adjusted Vegetation Index
LS	Slope Length and Steepness Factor
PC	Profile Curvature
TWI	Topographic Wetness Index
VD	Valley Depth
TCA	Total Catch Area
LIME	Local Interpretable Model-agnostic Explanation
QRF	Quantile Regression Forest
BRT	Boosted Regression Trees
RK	Regression-Kriging
HASM	High Accuracy Surface Modeling
MLR	Multiple Linear Regression

References

He, Z.; Qian, J.; Li, J.; Hong, M.; Man, Y. Data-driven soft sensors of papermaking process and its application to cleaner production with multi-objective optimization. J. Clean. Prod. 2022, 372, 133803. [Google Scholar] [CrossRef]
He, T.; Xiao, W.; Zhao, Y.; Deng, X.; Hu, Z. Identification of waterlogging in Eastern China induced by mining subsidence: A case study of Google Earth Engine time-series analysis applied to the Huainan coal field. Remote Sens. Environ. 2020, 242, 111742. [Google Scholar] [CrossRef]
Yousaf, B.; Liu, G.; Wang, R.; Abbas, Q.; Imtiaz, M.; Liu, R. Investigating the biochar effects on C-mineralization and sequestration of carbon in soil compared with conventional amendments using the stable isotope (δ¹³C) approach. GCB Bioenergy 2017, 9, 1085–1099. [Google Scholar] [CrossRef]
Grinand, C.; Le Maire, G.; Vieilledent, G.; Razakarnanarivo, H.; Razafimbelo, T.; Bernoux, M. Estimating temporal changes in soil carbon stocks at ecoregional scale in Madagascar using remote-sensing. Int. J. Appl. Earth Obs. Geoinf. 2017, 54, 1–14. [Google Scholar] [CrossRef]
Fujisaki, K.; Perrin, A.-S.; Desjardins, T.; Bernoux, M.; Balbino, L.C.; Brossard, M. From forest to cropland and pasture systems: A critical review of soil organic carbon stocks changes in Amazonia. Glob. Change Biol. 2015, 21, 2773–2786. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Zhuang, Q.; Jin, X.; Yang, Z.; Liu, H. Predicting Soil Organic Carbon and Soil Nitrogen Stocks in Topsoil of Forest Ecosystems in Northeastern China Using Remote Sensing Data. Remote Sens. 2020, 12, 1115. [Google Scholar] [CrossRef]
Sun, F.; Lei, Q.; Liu, Y.; Li, H. The Progress and Prospect of Digital Soil Mapping Research. Chin. J. Soil Sci. 2011, 42, 1502–1507. [Google Scholar] [CrossRef]
McBratney, A.B.; Odeh, I.O.A.; Bishop, T.F.A.; Dunbar, M.S.; Shatar, T.M. An overview of pedometric techniques for use in soil survey. Geoderma 2000, 97, 293–327. [Google Scholar] [CrossRef]
McBratney, A.B.; Santos, M.L.M.; Minasny, B. On digital soil mapping. Geoderma 2003, 117, 3–52. [Google Scholar] [CrossRef]
Zhou, M.; Li, Y. Digital Mapping and Scenario Prediction of Soil Salinity in Coastal Lands Based on Multi-Source Data Combined with Machine Learning Algorithms. Remote Sens. 2024, 16, 2681. [Google Scholar] [CrossRef]
Nowkandeh, S.M.; Noroozi, A.A.; Homaee, M. Estimating soil organic matter content from Hyperion reflectance images using PLSR, PCR, MinR and SWR models in semi-arid regions of Iran. Environ. Dev. 2018, 25, 23–32. [Google Scholar] [CrossRef]
Hu, K.; Li, H.; Li, B.; Huang, Y. Spatial and temporal patterns of soil organic matter in the urban-rural transition zone of Beijing. Geoderma 2007, 141, 302–310. [Google Scholar] [CrossRef]
Kumar, S.; Lal, R.; Liu, D. A geographically weighted regression kriging approach for mapping soil organic carbon stock. Geoderma 2012, 189, 627–634. [Google Scholar] [CrossRef]
Omuto, C.T.; Vargas, R.R.; Elmobarak, A.A.; Mapeshoane, B.E.; Koetlisi, K.A.; Ahmadzai, H.; Mohamed, N.A. Digital soil assessment in support of a soil information system for monitoring salinization and sodification in agricultural areas. Land Degrad. Dev. 2022, 33, 1204–1218. [Google Scholar] [CrossRef]
Siqueira, R.G.; Moquedace, C.M.; Fernandes-Filho, E.I.; Schaefer, C.E.G.R.; Francelino, M.R.; Sacramento, I.F.; Michel, R.F.M. Modelling and prediction of major soil chemical properties with Random Forest: Machine learning as tool to understand soil-environment relationships in Antarctica. Catena 2024, 235, 107677. [Google Scholar] [CrossRef]
Cheng, F.; Ou, G.; Wang, M.; Liu, C. Remote Sensing Estimation of Forest Carbon Stock Based on Machine Learning Algorithms. Forests 2024, 15, 681. [Google Scholar] [CrossRef]
Lu, Q.; Tian, S.; Wei, L. Digital mapping of soil pH and carbonates at the European scale using environmental variables and machine learning. Sci. Total Environ. 2023, 856, 159171. [Google Scholar] [CrossRef]
Dharumarajan, S.; Hegde, R.; Singh, S.K. Spatial prediction of major soil properties using Random Forest techniques A case study in semi-arid tropics of South India. Geoderma Reg. 2017, 10, 154–162. [Google Scholar] [CrossRef]
Mahmoudzadeh, H.; Matinfar, H.R.; Taghizadeh-Mehrjardi, R.; Kerry, R. Spatial prediction of soil organic carbon using machine learning techniques in western Iran. Geoderma Reg. 2020, 21, e00260. [Google Scholar] [CrossRef]
Chen, Y.; Ma, L.; Yu, D.; Zhang, H.; Feng, K.; Wang, X.; Song, J. Comparison of feature selection methods for mapping soil organic matter in subtropical restored forests. Ecol. Indic. 2022, 135, 108545. [Google Scholar] [CrossRef]
Xiong, X.; Grunwald, S.; Myers, D.B.; Kim, J.; Harris, W.G.; Comerford, N.B. Holistic environmental soil-landscape modeling of soil organic carbon. Environ. Model. Softw. 2014, 57, 202–215. [Google Scholar] [CrossRef]
Zhang, J.; Ma, X.; Zhang, J.; Sun, D.; Zhou, X.; Mi, C.; Wen, H. Insights into geospatial heterogeneity of landslide susceptibility based on the SHAP-XGBoost model. J. Environ. Manag. 2023, 332, 117357. [Google Scholar] [CrossRef]
Zhu, Z.; Ge, Y.; Zang, J.; Li, Q. Research on Wheat Yield Prediction Based on UAV Imagery and SHAP Feature Selection. J. Triticeae Crops 2024, 45, 264–274. [Google Scholar]
Yan, X.; Li, J.; Smith, A.R.; Yang, D.; Ma, T.; Su, Y.; Shao, J. Evaluation of machine learning methods and multi-source remote sensing data combinations to construct forest above-ground biomass models. Int. J. Digit. Earth 2023, 16, 4471–4491. [Google Scholar] [CrossRef]
Conrad, O.; Bechtel, B.; Bock, M.; Dietrich, H.; Fischer, E.; Gerlitz, L.; Wehberg, J.; Wichmann, V.; Boehner, J. System for Automated Geoscientific Analyses (SAGA) v. 2.1.4. Geosci. Model Dev. 2015, 8, 1991–2007. [Google Scholar] [CrossRef]
Li, X.Y.; Wang, Z.X.; Chen, Y.; Wang, Z.W.; Kuang, D. Exploring the impact of land use on bird diversity in high-density urban areas using explainable machine learning models. J. Environ. Manag. 2025, 374, 124080. [Google Scholar] [CrossRef]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Song, J.; Liu, G.; Jiang, J.; Zhang, P.; Liang, Y. Prediction of Protein-ATP Binding Residues Based on Ensemble of Deep Convolutional Neural Networks and LightGBM Algorithm. Int. J. Mol. Sci. 2021, 22, 939. [Google Scholar] [CrossRef]
Lee, S.; Vo, T.P.; Thai, H.T.; Lee, J.; Patel, V. Strength prediction of concrete-filled steel tubular columns using Categorical Gradient Boosting algorithm. Eng. Struct. 2021, 238, 112109. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Luo, C.; Zhang, X.; Wang, Y.; Men, Z.; Liu, H. Regional soil organic matter mapping models based on the optimal time window, feature selection algorithm and Google Earth Engine. Soil Tillage Res. 2022, 219, 105325. [Google Scholar] [CrossRef]
Ge, X.; Wang, J.; Ding, J.; Cao, X.; Zhang, Z.; Liu, J.; Li, X. Combining UAV-based hyperspectral imagery and machine learning algorithms for soil moisture content monitoring. Peerj 2019, 7, e6926. [Google Scholar] [CrossRef]
Pan, Z.H.; Xiang, H.X.; Shi, X.Y.; Wang, M.; Song, K.S.; Mao, D.H.; Huang, C.L. Detecting and Mapping Peatlands in the Tibetan Plateau Region Using the Random Forest Algorithm and Sentinel Imagery. Remote Sens. 2025, 17, 292. [Google Scholar] [CrossRef]
Zhao, Y.X.; Dong, H.; Huang, W.B.; He, S.C.; Zhang, C.F. Seamless terrestrial evapotranspiration estimation by machine learning models across the Contiguous United States. Ecol. Indic. 2024, 165, 112203. [Google Scholar] [CrossRef]
Ye, M.; Zhu, L.; Li, X.; Ke, Y.; Huang, Y.; Chen, B.; Yu, H.; Li, H.; Feng, H. Estimation of the soil arsenic concentration using a geographically weighted XGBoost model based on hyperspectral data. Sci. Total Environ. 2023, 858, 159798. [Google Scholar] [CrossRef]
Chen, R.; Huang, W.; Sun, L.; Yang, J.; Ma, T.; Shi, R. Distribution, transport and ecological risk prediction of organophosphate esters in China seas based on machine learning. Sci. Total Environ. 2024, 957, 177559. [Google Scholar] [CrossRef]
Zhang, X.; Chen, S.; Xue, J.; Wang, N.; Xiao, Y.; Chen, Q.; Hong, Y.; Zhou, Y.; Teng, H.; Hu, B.; et al. Improving model parsimony and accuracy by modified greedy feature selection in digital soil mapping. Geoderma 2023, 432, 116383. [Google Scholar] [CrossRef]
Yuan, Y.; Guo, W.; Tang, S.; Zhang, J. Effects of patterns of urban green-blue landscape on carbon sequestration using XGBoost-SHAP model. J. Clean. Prod. 2024, 476, 143640. [Google Scholar] [CrossRef]
Rossel, R.A.V.; McGlynn, R.N.; McBratney, A.B. Determing the composition of mineral-organic mixes using UV-vis-NIR diffuse reflectance spectroscopy. Geoderma 2006, 137, 70–82. [Google Scholar] [CrossRef]
Zhou, T.; Geng, Y.; Chen, J.; Liu, M.; Haase, D.; Lausch, A. Mapping soil organic carbon content using multi-source remote sensing variables in the Heihe River Basin in China. Ecol. Indic. 2020, 114, 106288. [Google Scholar] [CrossRef]
Schillaci, C.; Acutis, M.; Lombardo, L.; Lipani, A.; Fantappie, M.; Maerker, M.; Saia, S. Spatio-temporal topsoil organic carbon mapping of a semi-arid Mediterranean region: The role of land use, soil texture, topographic indices and the influence of remote sensing data to modelling. Sci. Total Environ. 2017, 601, 821–832. [Google Scholar] [CrossRef]
Stevens, F.; Bogaert, P.; van Wesemael, B. Detecting and quantifying field-related spatial variation of soil organic carbon using mixed-effect models and airborne imagery. Geoderma 2015, 259, 93–103. [Google Scholar] [CrossRef]
Adhikari, K.; Hartemink, A.E. Soil organic carbon increases under intensive agriculture in the Central Sands, Wisconsin, USA. Geoderma Reg. 2017, 10, 115–125. [Google Scholar] [CrossRef]
Zhang, H.; Wu, P.; Yin, A.; Yang, X.; Zhang, M.; Gao, C. Prediction of soil organic carbon in an intensively managed reclamation zone of eastern China: A comparison of multiple linear regressions and the random forest model. Sci. Total Environ. 2017, 592, 704–713. [Google Scholar] [CrossRef]
Malik, A.A.; Puissant, J.; Buckeridge, K.M.; Goodall, T.; Jehmlich, N.; Chowdhury, S.; Gweon, H.S.; Peyton, J.M.; Mason, K.E.; van Agtmaal, M.; et al. Land use driven change in soil pH affects microbial carbon cycling processes. Nat. Commun. 2018, 9, 3591. [Google Scholar] [CrossRef]
Pendleton, R.L. FACTORS OF SOIL FORMATION: A System of Quantitative Pedology. Geogr. Rev. 1945, 35, 336–337. [Google Scholar] [CrossRef]
Yang, R.-M.; Zhang, G.-L.; Liu, F.; Lu, Y.-Y.; Yang, F.; Yang, F.; Yang, M.; Zhao, Y.-G.; Li, D.-C. Comparison of boosted regression tree and random forest models for mapping topsoil organic carbon concentration in an alpine ecosystem. Ecol. Indic. 2016, 60, 870–878. [Google Scholar] [CrossRef]
Romano, N.; Palladino, M. Prediction of soil water retention using soil physical data and terrain attributes. J. Hydrol. 2002, 265, 56–75. [Google Scholar] [CrossRef]
Wang, S.; Wang, Q.; Adhikari, K.; Jia, S.; Jin, X.; Liu, H. Spatial-Temporal Changes of Soil Organic Carbon Content in Wafangdian, China. Sustainability 2016, 8, 1154. [Google Scholar] [CrossRef]
Yimer, F.; Ledin, S.; Abdelkadir, A. Soil property variations in relation to topographic aspect and vegetation. community in the south-eastern highlands of Ethiopia. For. Ecol. Manag. 2006, 232, 90–99. [Google Scholar] [CrossRef]
Gomes, L.C.; Faria, R.M.; de Souza, E.; Veloso, G.V.; Schaefer, C.E.G.R.; Fernandes Filho, E.I. Modelling and mapping soil organic carbon stocks in Brazil. Geoderma 2019, 340, 337–350. [Google Scholar] [CrossRef]
Yang, J.-L.; Zhang, G.-L.; Yang, F.; Yang, R.-M.; Yi, C.; Li, D.-C.; Zhao, Y.-G.; Liu, F. Controlling effects of surface crusts on water infiltration in an arid desert area of Northwest China. J. Soils Sediments 2016, 16, 2408–2418. [Google Scholar] [CrossRef]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
Poggio, L.; de Sousa, L.M.; Batjes, N.H.; Heuvelink, G.B.M.; Kempen, B.; Ribeiro, E.; Rossiter, D. SoilGrids 2.0: Producing soil information for the globe with quantified spatial uncertainty. Soil 2021, 7, 217–240. [Google Scholar] [CrossRef]
Dharumarajan, S.; Kalaiselvi, B.; Suputhra, A.; Lalitha, M.; Vasundhara, R.; Kumar, K.S.A.; Nair, K.M.; Hegde, R.; Singh, S.K.; Lagacherie, P. Digital soil mapping of soil organic carbon stocks in Western Ghats, South India. Geoderma Reg. 2021, 25, e00387. [Google Scholar] [CrossRef]
Song, X.-D.; Wu, H.-Y.; Ju, B.; Liu, F.; Yang, F.; Li, D.-C.; Zhao, Y.-G.; Yang, J.-L.; Zhang, G.-L. Pedoclimatic zone-based three-dimensional soil organic carbon mapping in China. Geoderma 2020, 363, 114145. [Google Scholar] [CrossRef]
Tifafi, M.; Guenet, B.; Hatte, C. Large Differences in Global and Regional Total Soil Carbon Stock Estimates Based on SoilGrids, HWSD, and NCSCD: Intercomparison and Evaluation Based on Field Data From USA, England, Wales, and France. Glob. Biogeochem. Cycles 2018, 32, 42–56. [Google Scholar] [CrossRef]
Hengl, T.; de Jesus, J.M.; Heuvelink, G.B.M.; Gonzalez, M.R.; Kilibarda, M.; Blagotic, A.; Shangguan, W.; Wright, M.N.; Geng, X.; Bauer-Marschallinger, B.; et al. SoilGrids250m: Global gridded soil information based on machine learning. PLoS ONE 2017, 12, e0169748. [Google Scholar] [CrossRef]
Bahri, H.; Raclot, D.; Barbouchi, M.; Lagacherie, P.; Annabi, M. Mapping soil organic carbon stocks in Tunisian topsoils. Geoderma Reg. 2022, 30, e00561. [Google Scholar] [CrossRef]
Liang, Z.; Chen, S.; Yang, Y.; Zhao, R.; Shi, Z.; Rossel, R.A.V. National digital soil map of organic matter in topsoil and its associated uncertainty in 1980’s China. Geoderma 2019, 335, 47–56. [Google Scholar] [CrossRef]
Encina-Rojas, A.; Rios-Velazquez, D.; Sevilla-Linares, V.; Villarreal, S.; Moriya, M.A.K.; Olivera, C.; Vargas, R.; Olmedo, F.; Barreras, A.; Guevara, M. First soil organic carbon report of Paraguay. Geoderma Reg. 2023, 32, e00611. [Google Scholar] [CrossRef]
Owusu, S.; Yigini, Y.; Olmedo, G.F.; Omuto, C.T. Spatial prediction of soil organic carbon stocks in Ghana using legacy data. Geoderma 2020, 360, 114008. [Google Scholar] [CrossRef]
Wang, S.; Roland, B.; Adhikari, K.; Zhuang, Q.; Jin, X.; Qian, F. Spatial-temporal variations and driving factors of soil organic carbon in forest ecosystems of Northeast China. For. Ecosyst. 2023, 10, 100101. [Google Scholar] [CrossRef]
Hu, B.; Xie, M.; Zhou, Y.; Chen, S.; Zhou, Y.; Ni, H.; Peng, J.; Ji, W.; Hong, Y.; Li, H.; et al. A high-resolution map of soil organic carbon in cropland of Southern China. Catena 2024, 237, 107813. [Google Scholar] [CrossRef]
Wang, Z.; Du, Z.; Li, X.; Bao, Z.; Zhao, N.; Yue, T. Incorporation of high accuracy surface modeling into machine learning to improve soil organic matter mapping. Ecol. Indic. 2021, 129, 107975. [Google Scholar] [CrossRef]
Keskin, H.; Grunwald, S.; Harris, W.G. Digital mapping of soil carbon fractions with machine learning. Geoderma 2019, 339, 40–58. [Google Scholar] [CrossRef]
Chen, S.; Mulder, V.L.; Heuvelink, G.B.M.; Poggio, L.; Caubet, M.; Dobarco, M.R.; Walter, C.; Arrouays, D. Model averaging for mapping topsoil organic carbon in France. Geoderma 2020, 366, 114237. [Google Scholar] [CrossRef]
Ma, Y.; Minasny, B.; Welivitiya, W.D.D.P.; Malone, B.P.; Willgoose, G.R.; McBratney, A.B. The feasibility of predicting the spatial pattern of soil particle-size distribution using a pedogenesis model. Geoderma 2019, 341, 195–205. [Google Scholar] [CrossRef]

Figure 1. Framework of forest soil organic carbon mapping in the Huodong mining area based on feature selection and machine learning algorithms.

Figure 2. Overview of the study area.

Figure 3. Performance comparison of variable combinations for estimating the forest SOC content using different ML algorithms.

Figure 4. Boruta feature selection results.

Figure 5. Ranking of the importance of environmental covariates in predicting the content of SOC.

Figure 6. SHAP summary plot.

Figure 7. Map predicting the spatial distribution of SOC in the Huodong mining area.

Figure 8. Comparison of SoilGrids 2.0 and our map of the SOC content.

Table 1. Selected band parameters of Landsat imagery.

Band	Abbreviation	Wavelength (μm)	Resolution (m)
Blue (Band 2)	B2	0.450–0.515	30
Green (Band 3)	B3	0.525–0.600	30
Red (Band 4)	B4	0.630–0.680	30
NIR (Band 5)	B5	0.845–0.885	30
SWIR 1 (Band 6)	B6	1.550–1.750	30
SWIR 2 (Band 7)	B7	2.090–2.350	30

Table 2. Calculation of vegetation indices by type.

Vegetation Index	Formula
NDVI (Normalized Difference Vegetation Index)	$NDVI = \frac{NIR - Red}{NIR + Red}$
DVI (Difference Vegetation Index)	$DVI = NIR - Red$
EVI (Enhanced Vegetation Index)	$EVI = 2.5 \times \frac{NIR - Red}{NIR + 6 Red - 7.5 Blue + 1}$
ARVI (Atmospherically Resistant Vegetation Index)	$ARVI = \frac{NIR - (2 Red - Blue)}{NIR + (2 Red - Blue)}$
RVI (Relative Vegetation Index)	$RVI = \frac{NIR}{Red}$
LAI (Leaf Area Index)	$LAI = 3.618 \times EVI - 0.118$
GVI (Greenness Vegetation Index)	$GVI = - 0.2848 \times Blue - 0.2435 \times Green - 0.5436 \times Red$ $+ 0.7243 \times NIR + 0.084 \times {SWIR}_{1} - 0.18 \times {SWIR}_{2}$
GRVI (Greenness Ratio Vegetation Index)	$GRVI = \frac{NIR}{Green}$
GNDVI (Green Normalized Difference Vegetation Index)	$GNDVI = \frac{NIR - Green}{NIR + Green}$
SAVI (Soil Adjusted Vegetation Index)	$SAVI = \frac{1.5 \times (NIR - Red)}{NIR + Red + 0.5}$

Table 3. Terrain variables used for predicting SOC.

Variable Category	Variable	Abbreviation	Resolution (m)	Source
Terrain	Elevation	DEM	30	https://www.gscloud.cn/
	Slope	-	30	Calculated from the elevation
	Aspect	-	30	Calculated from the elevation
	Slope length and steepness factor	LS	30	Calculated from the elevation
	Profile curvature	PC	30	Calculated from the elevation
	Topographic wetness index	TWI	30	Calculated from the elevation
	Valley depth	VD	30	Calculated from the elevation
	Total catch area	TCA	30	Calculated from the elevation

Table 4. Soil variables used for predicting SOC.

Variable Category	Variable	Abbreviation	Resolution(m)	Source
Soil properties	Sand	-	250	SoilGrids250m 2.0
	Clay	-	250	SoilGrids250m 2.0
	Silt	-	250	SoilGrids250m 2.0
	Coarse fraction volume occupancy	CFVO	250	SoilGrids250m 2.0
	Cation exchange capacity	CEC	250	SoilGrids250m 2.0
	pH	-	250	SoilGrids250m 2.0
	Nitrogen content	-	250	SoilGrids250m 2.0

Table 5. Parameter optimization values for machine learning models requiring parameterization based on a grid search with 5-fold cross-validation.

Model	Parameters for Predicting the SOC
RF	n_estimators = 60, min_samples_split = 7, min_samples_leaf = 1, max_depth = 14, random_state = 18
ERT	n_estimators = 135, min_samples_split = 5, min_samples_leaf = 8, max_depth = 16, random_state = 18
GBDT	learning_rate = 0.1, n_estimators = 39, min_samples_split = 8, min_samples_leaf = 3, max_depth = 3, random_state = 18
XGBoost	learning_rate = 0.09, n_estimators = 48, max_depth = 3, gamma = 0.3, reg_lambda = 4, reg_alpha = 1, colsample_bytree = 0.8, subsample = 0.7, random_state = 18
LightGBM	learning_rate = 0.12, n_estimators = 41, max_depth = 5, num_leaves = 6, min_child_samples = 8, colsample_bytree = 0.9, subsample = 0.7, reg_lambda = 6, reg_alpha = 4, random_state = 18
CatBoost	learning_rate = 0.12, iterations = 556, depth = 4, l2_leaf_reg = 2, random_state = 18

Table 6. Descriptive statistics of the SOC content (g kg⁻¹).

Sample Type	n	Min	Max	Median	Mean	SD	CV (%)	Skewness	Kurtosis
Total sample	142	2.09	114.27	21.15	26.52	19.96	75.28	1.83	4.35
Training set	114	2.09	114.27	21.45	26.77	19.95	74.50	2.00	5.37
Testing set	28	5.77	75.70	19.61	25.46	20.35	79.91	1.25	0.65

Table 7. The final results of feature selection.

Different Combinations of Feature Models	R²	RMSE (g/kg)	RPD	Adjusted R²
Important feature variables	0.64	12.06	1.66	0.62
Important feature variables plus B5	0.70	10.99	1.82	0.69
Important feature variables plus aspect	0.67	11.44	1.75	0.66
Important feature variables plus both B5 and aspect	0.76	9.88	2.02	0.75

Table 8. The contributions and cumulative contributions of each feature variable.

Symbol	Feature Variable	SHAP Absolute Value	Contribution/%	Cumulative Contribution/%
X₁	pH	0.16	20.00	20.00
X₂	Nitrogen content	0.14	17.50	37.50
X₃	B3	0.06	7.50	45.00
X₄	DEM	0.05	6.25	51.25
X₅	Sand	0.05	6.25	57.50
X₆	B6	0.04	5.00	62.50
X₇	Aspect	0.03	3.75	66.25
X₈	Silt	0.03	3.75	70.00
X₉	NDVI	0.03	3.75	73.75
X₁₀	VD	0.02	2.50	76.25
X₁₁	B4	0.02	2.50	78.75
X₁₂	B7	0.02	2.50	81.25
X₁₃	CEC	0.02	2.50	83.75
X₁₄	RVI	0.02	2.50	86.25
X₁₅	Slope	0.01	1.25	87.50
X₁₆	ARVI	0.01	1.25	88.75
X₁₇	B5	0.01	1.25	90.00
X₁₈	CFVO	0.01	1.25	91.25
X₁₉	B2	0.01	1.25	92.50
X₂₀	Clay	0.01	1.25	93.75

Note: Contribution refers to the proportion of the SHAP absolute value of a variable in the sum of the SHAP absolute values of all variables.

Table 9. Analysis of the predictive effects of different SHAP feature combinations.

Different Feature Combinations	R²	RMSE (g/kg)	RPD	Adjusted R²
X₁ + X₂ + X₃ + X₄ + X₅ + X₆ + X₇ + X₈ + X₉ + X₁₀ + X₁₁ + X₁₂ + X₁₃ + X₁₄	0.73	10.35	1.93	0.72
X₁ + X₂ + X₃ + X₄ + X₅ + X₆ + X₇ + X₈ + X₉ + X₁₀ + X₁₁ + X₁₂ + X₁₃	0.68	11.35	1.76	0.66
X₁ + X₂ + X₃ + X₄ + X₅ + X₆ + X₇ + X₈ + X₉ + X₁₀ + X₁₁ + X₁₂	0.71	10.85	1.84	0.69
X₁ + X₂ + X₃ + X₄ + X₅ + X₆ + X₇ + X₈ + X₉ + X₁₀ + X₁₁	0.65	11.82	1.69	0.64
X₁ + X₂ + X₃ + X₄ + X₅ + X₆ + X₇ + X₈ + X₉ + X₁₀	0.68	11.24	1.78	0.67
X₁ + X₂ + X₃ + X₄ + X₅ + X₆ + X₇ + X₈ + X₉	0.51	14.05	1.42	0.49
X₁ + X₂ + X₃ + X₄ + X₅ + X₆ + X₇ + X₈	0.51	13.97	1.43	0.49
X₁ + X₂ + X₃ + X₄ + X₅ + X₆ + X₇	0.42	15.26	1.31	0.39
X₁ + X₂ + X₃ + X₄ + X₅ + X₆	0.56	13.23	1.51	0.54
X₁ + X₂ + X₃ + X₄ + X₅	0.42	15.18	1.32	0.40
X₁ + X₂ + X₃ + X₄	0.47	14.51	1.38	0.45
X₁ + X₂ + X₃	0.17	18.16	1.10	0.14
X₁ + X₂	/	/	/	/
X₁	/	/	/	/

Table 10. Model performance at different SoilGrids resolutions.

Resolution (m)	R²	RMSE (g/kg)	RPD	Adjusted R²
30	0.76	9.88	2.02	0.74
60	0.74	10.12	1.98	0.73
90	0.71	10.89	1.84	0.69
250	0.68	11.35	1.76	0.67

Table 11. Comparison with previous similar studies.

Study Case	Study Area	Study Scale	Spatial Resolution	Method	Soil Sampling Sites	R²
This study	Huodong Mining District, China	Regional	30 m	CatBoost	142	0.75
Case 1	Global	Global	250 m	RF	150,000	0.69
Case 2	Tunisian	National	100 m	QRF	1540	0.44
Case 3	China	National	90 m	Cubist	5982	0.42
Case 4	Paraguay	National	1000 m	RK	954	0.54
Case 5	Ghana	National	1000 m	RK	744	0.34
Case 6	Northeast China	Regional	90 m	BRT	157	0.67
Case 7	Jiangxi Province, China	Regional	30 m	RF	13,424	0.73
Case 8	Dongzhi Loess Tableland, China	Regional	90 m	RF_HASM	145	0.57
Case 9	North of Songnen Plain, China	Regional	30 m	RF	182	0.70
Case 10	NSW State, Australia	Regional	100 m	MLR/RF	2153	0.53(MLR)/0.69(RF)

Note: Case 1 [59]; Case 2 [60]; Case 3 [61]; Case 4 [62]; Case 5 [63]; Case 6 [64]; Case 7 [65]; Case 8 [66]; Case 9 [32]; Case 10 [67]; QRF means quantile regression forest; BRT means boosted regression trees; RK means Regression-Kriging; HASM means high accuracy surface modeling; MLR means multiple linear regression.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Li, J.; Tan, J.; Ma, T.; Yan, X.; Chen, Z.; Li, K. Fine Resolution Mapping of Forest Soil Organic Carbon Based on Feature Selection and Machine Learning Algorithm. Remote Sens. 2025, 17, 2000. https://doi.org/10.3390/rs17122000

AMA Style

Li Y, Li J, Tan J, Ma T, Yan X, Chen Z, Li K. Fine Resolution Mapping of Forest Soil Organic Carbon Based on Feature Selection and Machine Learning Algorithm. Remote Sensing. 2025; 17(12):2000. https://doi.org/10.3390/rs17122000

Chicago/Turabian Style

Li, Yanan, Jing Li, Jun Tan, Tianyue Ma, Xingguang Yan, Zongyang Chen, and Kunheng Li. 2025. "Fine Resolution Mapping of Forest Soil Organic Carbon Based on Feature Selection and Machine Learning Algorithm" Remote Sensing 17, no. 12: 2000. https://doi.org/10.3390/rs17122000

APA Style

Li, Y., Li, J., Tan, J., Ma, T., Yan, X., Chen, Z., & Li, K. (2025). Fine Resolution Mapping of Forest Soil Organic Carbon Based on Feature Selection and Machine Learning Algorithm. Remote Sensing, 17(12), 2000. https://doi.org/10.3390/rs17122000

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fine Resolution Mapping of Forest Soil Organic Carbon Based on Feature Selection and Machine Learning Algorithm

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Soil Sampling and Laboratory Analysis

2.3. Environmental Covariates

2.3.1. Multispectral Remote Sensing Variables

2.3.2. Topographic Variables

2.3.3. Soil Variables

2.3.4. Covariate Harmonization

2.4. Machine Learning Algorithms

2.4.1. RF

2.4.2. ERT

2.4.3. GBDT

2.4.4. XGBoost

2.4.5. LightGBM

2.4.6. CatBoost

2.5. Parameter Setting

2.6. Feature Selection

2.6.1. Boruta

2.6.2. SHAP

2.7. Evaluation Criteria

3. Results

3.1. Descriptive Statistics for SOC

3.2. Evaluation of Modeling

3.3. Analysis of Feature Selection Results

3.3.1. Feature Selection Based on Boruta

3.3.2. Feature Selection Based on SHAP

3.4. Sensitivity Analysis

3.5. Spatial Distribution Pattern of SOC

4. Discussion

4.1. Effects of Different Covariates on the SOC Content

4.2. The Application of Different Feature Selection Methods in the Prediction of the SOC Content

4.3. Comparison to Other SOC Maps

4.4. Limitations and Perspectives

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI