The Spatial Distribution and Driving Mechanism of Soil Organic Matter in Hilly Basin Areas Based on Genetic Algorithm Variable Combination Optimization and Shapley Additive Explanations Interpretation

Huang, He; Liu, Yaolin; Liu, Yanfang; Tong, Zhaomin; Ren, Zhouqiao; Xie, Yifan

doi:10.3390/rs17071186

Open AccessArticle

The Spatial Distribution and Driving Mechanism of Soil Organic Matter in Hilly Basin Areas Based on Genetic Algorithm Variable Combination Optimization and Shapley Additive Explanations Interpretation

by

He Huang

¹,

Yaolin Liu

^1,*,

Yanfang Liu

¹,

Zhaomin Tong

¹,

Zhouqiao Ren

² and

Yifan Xie

¹

School of Resource and Environmental Science, Wuhan University, Wuhan 430079, China

²

Institute of Digital Agriculture, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(7), 1186; https://doi.org/10.3390/rs17071186

Submission received: 10 February 2025 / Revised: 9 March 2025 / Accepted: 26 March 2025 / Published: 27 March 2025

Download

Browse Figures

Versions Notes

Abstract

Studying the spatial variation patterns and influencing factors of soil organic matter (SOM) in hilly and basin areas is of great significance for guiding agricultural production practices. This study takes Lanxi City as an example and comprehensively considers soil formation factors such as climate, vegetation, and terrain. Based on the genetic algorithm, 47 environmental variables are combined and optimized to construct a random forest (RF) model and an improved version—a random forest model based on genetic algorithm variable combination optimization (RF-GA). At the same time, the SHAP interpretation method is used to quantitatively analyze the spatial distribution characteristics of the SOM content and further identify the main driving factors. Compared with the ordinary Kriging (OK) and random forest (RF) methods, the random forest model based on genetic algorithm variable combination optimization (RF-GA) demonstrates a significantly improved prediction accuracy (R² = 0.49; RMSE = 3.49 g·kg⁻¹), with an MAE = 3.019 and LCCC = 0.67. Among the three models, the R² of the RF-GA model increases by 87.84% and 56.29%. The model prediction results indicate that the SOM content in the study area ranges from 12.11 to 31.38 g·kg⁻¹, showing spatial distribution characteristics of a higher content in mountainous areas and a lower content in plains. A further SHAP analysis shows that terrain, climate, and biological factors are key environmental factors affecting the spatial differentiation of the SOM, with the channel network base level (CNBL), which contributes 20.68% to the model, and DEM, which has a contribution rate of 5.57%, playing particularly significant roles. By regulating moisture, erosion deposition, vegetation distribution, and microclimate conditions, they significantly affect the spatial distribution of the SOM. In summary, the RF-GA and its interpretable prediction model constructed in this study not only effectively reveal the spatial and driving mechanisms of SOM in hilly and basin areas but also provide a solid theoretical basis and practical guidance for accurate mapping, the formulation of sustainable utilization strategies for soil resources, and ensuring national food security.

Keywords:

soil organic matter; genetic algorithm; random forest; SHAP

1. Introduction

Soil organic matter (SOM) is an active and critical component of the soil carbon pool, and its spatial distribution characteristics are of great significance for revealing regional soil quality and global carbon cycling processes [1]. However, due to the combined effects of structural and stochastic factors, the spatial distribution of SOM exhibits significant variability and non-stationarity, causing significant uncertainty in modeling and quantitatively describing its spatial variation process [2]. Therefore, although it is necessary to accurately obtain spatial distribution information on regional SOM, many challenges remain in practical operation.

In hilly basin areas, the topographic conditions are highly variable. There are distinct differences in lighting, heat, and moisture conditions among different topographic positions, endowing these regions with a high degree of spatial heterogeneity. The influencing factors are intricate and complex, rendering it particularly challenging to quantitatively describe the soil morphology, properties, their process variations, and spatial correlations [3]. Therefore, digital soil mapping (DSM) has been widely used in recent years as an important technology for quickly and accurately determining the spatial distribution of regional soil attributes [3]. However, due to the combined influence of natural soil-forming factors and human activities, the SOM in farmland often exhibits significant spatial non-stationarity, which further increases the difficulty of SOM spatial prediction [4]. Identifying the key influencing factors of the SOM spatial distribution and introducing them into prediction models can greatly improve prediction accuracy.

Traditional soil attribute mapping methods include Kriging interpolation, inverse distance weight interpolation, spline function interpolation, and other geostatistical methods [5]. Ordinary Kriging (OK) is the most widely adopted Kriging method, celebrated for its model simplicity and capacity to generate unbiased optimal predictions alongside uncertainty quantification [6].

As well as the commonly used Kriging and regression analysis methods, researchers often use linear estimation methods, which present difficulty in capturing the complex nonlinear relationship between SOM and environmental variables [6]. Therefore, in recent years, an increasing number of scholars have begun to introduce machine learning algorithms, such as support vector machines (SVMs), random forests (RFs), artificial neural networks (ANNs), and regression trees, aiming to more accurately establish the nonlinear relationship between SOM and environmental variables [7,8]. These methods typically rely on sample data and environmental covariates for fitting, with the commonly used environmental variables including soil type, climate factors, land use type, vegetation index, terrain factors, and soil parent material [9,10]. Terrain factors in particular have a significant impact on SOM content by regulating surface runoff, solar radiation, soil erosion, moisture content, and temperature, making them particularly important in hilly and mountainous areas [11].

The random forest (RF) model, emerging in recent years, is a tree-structured model that adopts an ensemble learning strategy. It can be used for both classification and the prediction of continuous variables. It constructs a series of tree models and allows them to be trained and make predictions independently. The final prediction result is determined by voting among the prediction results of all trees (for categorical prediction) or by taking the average (for numerical prediction). Therefore, it exhibits excellent robustness [12]. In addition, RF can display the relative importance of each environmental variable in the modeling process (using the % IncMSE index). Owing to its outstanding prediction accuracy and good interpretability, RF has become one of the most commonly used machine learning algorithms in recent years.

In digital soil mapping, the selection of environmental variables is one of the crucial steps which directly influences the spatial distribution of soil properties and the accuracy of prediction models. Therefore, numerous scholars have dedicated themselves to researching how to screen out the factors that have the greatest impact on soil properties from a multitude of environmental variables. To endow the model with high prediction accuracy, it is necessary to conduct an efficient screening process from a large number of feature variables. This involves selecting the optimal subset of features and avoiding multicollinearity to enhance the model’s prediction accuracy and simplify its complexity. The genetic algorithm (GA) is a global optimization algorithm that simulates the natural evolution process, continuously optimizing variable combinations through operations such as selection, crossover, and mutation in order to select feature sets that can maximize model performance [13]. In complex terrain and multivariate environments, the GA can effectively avoid becoming stuck in local optima, thereby improving the robustness and accuracy of model predictions [14]. However, the random forest model based on GA filtering features (GA-RF) has not been fully applied in SOM estimation in complex areas, and its advantages in SOM prediction over the RF model using full-variable prediction still need to be verified. Therefore, this study proposes a random forest model based on the genetic algorithm for variable combination optimization, aiming to improve the prediction accuracy of the SOM spatial distribution in complex regions and provide new perspectives and methods for DSM research.

Although machine learning methods typically outperform traditional statistical methods in terms of prediction accuracy, their “black box” nature—i.e., their lack of sufficient interpretability—has always limited their practical applications. To address this issue, the SHAP (Shapley Additive Explanations) method based on game theory and local interpretation theory was introduced to quantitatively estimate the contribution of each feature variable to the model’s prediction results [15]. In the field of soil property simulation, SHAP has not only successfully identified key driving factors but has also effectively analyzed the interactions between different climate and terrain variables, making it widely used to interpret the prediction results of complex models [9,16].

Lanxi City is the largest Yangmei-producing area in the region, with a typical hilly and basin landform. Identifying the main controlling factors of the SOM in farmland in Lanxi City and obtaining a high-precision SOM spatial distribution map will help not only formulate scientific and reasonable farmland planting and management strategies, optimize land use layouts, increase soil carbon sequestration capacities, and alleviate the greenhouse effect but also enhance soil fertility and achieve increased grain production.

The main objectives of this study are to (1) explore the potential application of GA-RF models based on variable combination optimization in DSM in complex regions; (2) evaluate the performance differences between this model and the ordinary Kriging method (OK) and the RF model based on full-variable prediction in terms of predicting the SOM spatial distribution; and (3) use the SHAP method to analyze the spatial correlation between SOM formation environmental variables and SOM content.

2. Materials and Methods

2.1. Overview of the Study Area

Lanxi City is located in the central and western part of Zhejiang Province, with the geographical coordinates of 29°1′20″–29°27′30″ north latitude and 119°13′30″–119°53′50″ east longitude; it has a total area of 1313 square kilometers. The climate belongs to the subtropical monsoon region of East Asia, with abundant annual precipitation. The landform is a hilly basin in central Zhejiang, surrounded by mountains in the northeast, winding low hills in the southwest, and a flat plain in the central part. The main soil types in the research area are red soil, yellow soil, lithological soil, tidal soil, and paddy soil, with agriculture being the main land use. Lanxi has diverse vegetation. In the northern mountainous area, evergreen broad-leaved forests contribute significantly to soil organic matter. Their annual litter production per hectare is considerable, with a large amount of organic carbon being input into the soil, resulting in high SOM content. In contrast, in the southern hilly area, due to over-grazing and poor management, the vegetation is sparse, limiting the input of SOM. Regarding land use, paddy fields along the Lanjiang River slow down the decomposition of SOM due to waterlogging, maintaining a stable SOM content. However, in drylands, frequent intensive tillage and excessive use of fertilizers accelerate the decomposition of SOM. The expansion of urban and industrial areas disrupts the natural soil-forming process and reduces SOM content.

2.2. Data Sources and Processing

2.2.1. Soil Sample Data

During fieldwork from 8 to 15 June 2022, a total of 1566 surface soil samples (0–30 cm) were collected from farmlands across the study area. Prior to field investigations, sampling points were systematically planned based on field surveys to ensure an even distribution, effectively capturing the spatial characteristics of soil properties in agricultural land. The selection of sampling locations was guided by key environmental factors, including topography, land use type, and soil type. A random or stratified random sampling strategy was employed to ensure that the sampling points represented diverse environmental gradients. To establish sampling points systematically, a grid sampling method was used. First, a 2 × 2 km regular grid was generated across the study area, with sampling points placed at the center of each grid to achieve uniform distribution. Next, to refine the sampling design, grid points falling outside agricultural land were removed using actual land use data. Given the complexity of the agricultural landscape, the remaining points were overlaid with high-resolution Google Earth imagery for visual inspection, ensuring accurate classification and selection of sampling locations.

For soil sampling, the upward drilling method was adopted to minimize the impact of small-scale soil heterogeneity. A composite sampling approach was applied, where ten subsamples were randomly or evenly collected within each 10 m × 10 m plot and thoroughly mixed to form a single composite sample. A variance analysis was conducted on a subset of samples to assess soil organic matter (SOM) variability within these plots. The results indicated relatively low variability, confirming that the composite sampling method effectively represented soil properties within small areas. To maintain consistency, all samples were collected from a depth of 0–20 cm. At each sampling point, ten soil cores were randomly taken using a 5 cm diameter spiral soil drill and then combined into one composite sample. This approach helps to reduce local variations and provides a more representative measurement of SOM content. Throughout the process, standardized sampling protocols were strictly followed.

After field collection, soil samples were air-dried, crushed, and sieved through a 1.0 mm sieve before being stored in sealed glass jars for further analysis. SOM content was determined using the potassium dichromate volumetric method. To minimize the influence of outliers in the dataset, the SOM values of the 1566 samples from Lanxi City were carefully examined, and extreme outliers were removed using Excel software, resulting in a final dataset of 1560 valid samples. Their spatial distribution is shown in Figure 1. In ArcGIS 10.2, 80% of the samples (1249) were randomly and uniformly selected as the training set, while the remaining 20% (311) were assigned as the test set for model validation.

2.2.2. Obtaining Environmental Covariates

Based on the soil landscape SCORPAN function model [17], following the principles of correlation and availability, soil texture, terrain factors, remote sensing biological indices, climate factors, soil types, and land use were selected as environmental variables to predict the soil properties in the study area, as shown in Table 1. As the study area pertains to a hilly basin area, the intricate topography and diverse vegetation render topographic and biological factors highly significant for elucidating the spatial heterogeneity of soil organic matter. Consequently, a relatively large number of indicators have been selected within the domains of topographic and biological factors. According to McBratney et al. [18], of the digital mapping studies, 80% have used terrain elements, 25% have used biological elements, another 25% have used parent rock elements, 5% have used climate elements.

(1): Topographical factors

The terrain series of soil is mainly controlled by surface morphology characteristics and parent rocks, which are relatively uniform in a small area. Therefore, terrain is the most important influencing factor in the formation of local soil. Terrain factors directly affect the energy cycle of surface materials and the occurrence and evolution of soil, and they are commonly used environmental variables in soil mapping. This study used 12.5 m digital elevation model (DEM) data for terrain data; these data were sourced from the NASA Earth Science Data website (https://www.earthdata.nasa.gov/, accessed on 15 June 2022). The data time of the DEM for the study area is May 2022. Based on these DEM data, the analytical hillshading (AH), aspect (ASP), closed depressions (CDs), convergence index (CI), channel network base level (CNBL), channel network distance (CND), coefficient of variation of elevation (ECV), LS factor (LS), mass balance index (MBI), multiscale ridge top flatness (MRRTF), etc., were extracted using SAGA-GIS 7.6.2 software. Among them, MRRTF and MRVBF are humidity indices that identify flat and low terrain or high flat areas at multiple resolutions by progressively smoothing and coarsening the DEM while reducing slope thresholds to identify valleys or ridges. These terrain factors affect the movement of surface materials and energy from different aspects, thereby influencing the soil formation process. All topographic factors were calculated by GAGA-GIS 7.6.2 software, with the default general parameter settings applicable to such terrain data processing.

(2): Climate factors

The annual average temperature, the annual average precipitation, and other climate factors were sourced from the National Qinghai Tibet Plateau Data Center in China http://data.tpdc.ac.cn (accessed on 14 May 2022). The dataset was generated by downscaling in China based on the gridded time series climate dataset released by the Climate Research Unit (CRU) at the University of East Anglia in the UK, as well as the World Clim global high-resolution climate dataset.

(3): Biological factors

Biological factors indirectly reflect the surface conditions and vegetation landscape characteristics formed by soil properties through the characteristic bands and different combinations of remote sensing images. Biological factors mainly include plants, animals, and microorganisms. Vegetation growing on different soils may vary in type or growth status. Therefore, the soil type or properties can be inferred from the vegetation type or its condition. Information about soil animals and microorganisms is difficult to obtain, but they often have a correlation with the status of surface vegetation. Thus, in practical mapping, the vegetation status is used as a substitute. Vegetation information can be mainly divided into two categories. One is the qualitative spatial distribution information of types, such as the vegetation type. The other is the quantitative spatial distribution information of attributes, which is mainly obtained by calculating remote sensing image data to acquire vegetation indices and vegetation biophysical parameters [19,20]. Remote sensing image data were obtained from Sentinel-2, which involves high-resolution multispectral imaging satellites carrying a multispectral imager (MSI) for land monitoring. Sentinel-2 can provide images of vegetation, soil and water cover, inland waterways, and coastal areas and involves two satellites: 2A and 2B. This study used Sentinel-2A satellite data, with a spatial resolution of 10 m, downloaded from the GEE (Google Earth Engine) public data platform. The image time was consistent with the sampling time, and the cloud cover was 0. Subsequently, the obtained image data underwent preprocessing such as format conversion, projection transformation, and resampling.

(4): Soil texture

Soil texture is one of the physical properties of soil, referring to the combination of mineral particles of different sizes and diameters in the soil. Soil texture is closely related to soil aeration, fertilizer retention, the water retention status, and the difficulty of cultivation, and its condition is an important basis for formulating soil utilization, management, and improvement measures. Fertile soil requires not only a good texture of the plow layer but also a good texture profile. Although soil texture is mainly determined by the type of parent material and is relatively stable, the texture of the cultivated layer can still be adjusted through activities such as tillage and fertilization. The spatial distribution data of soil texture were compiled based on soil type maps and soil profile data obtained from soil surveys, and they were divided into three categories, namely, sand, silt, and clay, each of which reflects the content of particles with different textures through percentages. The dataset was provided by the Geographic Remote Sensing Ecological Network Platform (www.gisrs.cn, accessed on 11 June 2022), and it has a spatial resolution of 900 m.

(5): Soil type and land use data

The soil type and land use data were sourced from the measured data collected in this experiment. This study used arithmetic mean transformation for categorical variables, such as land use and soil type, which allowed for the quantitative relationship between the levels of the independent variables and the quantitative outcome variables to be established using the relationship between the categorical independent variables and quantitative dependent variables. The arithmetic mean (area percentage) of the quantitative dependent variable under different land use and soil types was used to replace the land use and soil types.

To facilitate subsequent modeling, the spatial resolution of the aforementioned selected environmental variables was resampled to 10 m using the ArcGIS 10.2 software. Concurrently, their spatial extents and coordinate systems were standardized to ensure consistency and compatibility within the analytical framework.

2.3. Research Method

2.3.1. Ordinary Kriging

Ordinary Kriging (OK) is an accurate spatial local interpolation method based on the theory of variation functions [21]. In OK, a theoretical semi-variogram model of the regionalized variable is first fitted with the observed values. The value

z_{O K}^{*} (x_{0})

at the predicted point

x_{0}

can be obtained by linearly weighting the observed values within a certain range around it, while the weight value

λ_{i}

is determined under the guidance of unbiased and optimal thinking. The calculation formula for OK is as follows:

z_{O K}^{*} (x_{0}) = \sum_{i = 1}^{n} λ_{i} z (x_{i})

(1)

Here,

z_{O K}^{*} (x_{0})

is the OK estimate at

x_{0}

,

z (x_{i})

is the observation at

x_{i}

, and

λ_{i}

is the weight value. The OK method determines the optimal weight value on the premise of unbiasedness (the estimated value equal to the true value) and optimality (minimum variance), thus satisfying the following conditions:

Unbiased condition:

E [z_{O K}^{*} (x_{0}) - z_{O K} (x_{0})] = 0

(2)

Optimal condition:

Var [z_{O K}^{*} (x_{0}) - z_{O K} (x_{0})] = \min

(3)

2.3.2. Random Forest

Random forest (RF) is a tree structure model that adopts an ensemble learning strategy, which can be used for both the classification and prediction of continuous variables [12]. In recent years, the random forest (RF) algorithm, as an excellent machine learning algorithm, has been widely used in digital soil mapping research based on multi-source environmental variables. RF-based models are non-parametric models and can handle the complex nonlinear relationship between soil properties and environmental covariates [22]. Moreover, RF has low sensitivity to the noise present in training samples; thus, it can better handle the problem of reduced accuracy caused by data loss and identify the importance of predictive variables [23]. Numerous studies have shown that RF has a higher prediction accuracy than other machine learning algorithms and traditional statistical regression methods [24].

Its advantages are that it does not require the assumption that the dependent variable is normally distributed, and it does not require testing for multicollinearity between independent variables. More importantly, it can explore the nonlinear relationship between independent and dependent variables. The RF model uses the bootstrap method to perform random sampling with replacement from the original training set, forming m new training sets and independently constructing CART decision tree models using each new training set. The samples remaining each time are called out-of-bag data. n independent variables are randomly selected from each tree to determine the classification of tree nodes. The final prediction result is determined by voting on the prediction results of all trees (when the dependent variable is a categorical variable) or by taking the average (when the dependent variable is a continuous variable). RF calculates the increase in the mean square error (MSE) of the regression equation to predict the out-of-bag data when removing each variable, % IncMSE, and it determines the relative importance of each variable based on this: the higher the % IncMSE, the more important the variable [23].

Evaluating the feature importance using the random forest algorithm involves quantifying the contribution of each feature to the classification performance of the

k

decision trees constructed. The contribution is commonly assessed by employing the out-of-bag (OOB) error rate as the evaluation index. Here, the contribution is denoted by the feature importance measures (FIM).

We define the indicator function as follows:

I (x, y) = {\begin{matrix} 1, x = y \\ 0, x \neq y \end{matrix}

(4)

The value of

F I M_{k m}^{(O O B)}

for the

m

feature

F_{m}

in the

k

decision tree is as follows:

F I M_{k m}^{(O O B)} = \frac{\sum_{p = 1}^{n_{o}^{k}} I (Y_{p}, Y_{p}^{k})}{n_{o}^{k}} - \frac{\sum_{p = 1}^{n_{o}^{k}} I (Y_{p}, Y_{p, π_{m}}^{k})}{n_{o}^{k}}

(5)

In the formula,

n_{o}^{k}

represents the number of observation samples in the

k

decision tree;

Y_{p}

is the true classification label corresponding to the

p

sample;

Y_{p}^{k}

is the predicted classification result of the

k

decision tree for the

p

observation of the out-of-bag (OOB) data before the random permutation of feature

F_{m}

;

Y_{p}^{k}

is the classification result of the

k

decision tree for the

p

sample after the random permutation of feature

F_{m}

, where the

k

decision tree needs to be retrained after the random permutation of

F_{m}

. When the feature

F_{m}

does not appear in the

k

decision tree,

F I M_{k m}^{(O O B)} = 0

.

The importance measure score of feature

F_{m}

in the entire random forest is defined as:

F I M_{k m}^{(O O B)} = \frac{\sum_{k = 1}^{K} F I M_{k m}^{(O O B)}}{K σ}

(6)

In the formula,

K

represents the number of decision trees in the random forest;

σ

represents the standard deviation of

F I M_{k m}^{(O O B)} = 0

. The importance measure score

F I M_{m}^{(O O B)}

of feature

F_{m}

characterizes the contribution of

F_{m}

to the classification accuracy rate. The feature importance measure score is jointly determined by the mean value of the out-of-bag error rate and the standard deviation.

The RF model has two key parameters: the number of trees (ntree) and the number of nodes (mtry). When the computational load allows, a larger ntree is better; changes in mtry will affect the goodness of fit of the model, and multiple attempts will be required (ranging from 1 to the number of independent variables). The random forest (RF) model was implemented on the R studio 4.2.0 platform using the “randomForest” package. Specifically, the number of trees in the forest, ntree, was set to the maximum value of 1000. For each tree, the number of features used, mtry, ranged from 1 to 20. Different values of mtry were input separately for model construction. The input variables for the random forest model encompassed environmental variables such as topographic factors, biological factors, and climatic factors. By means of the random forest model, we aimed to accurately predict the content of soil organic matter using the input environmental variables, thereby elucidating the influencing mechanisms of environmental factors on the distribution and variation of soil organic matter.

2.3.3. Genetic Algorithm

The genetic algorithm (GA) is a random search optimization algorithm based on natural selection and genetic mechanisms, inspired by the theory of biological evolution. It simulates genetic operations (selection, crossover, mutation, etc.) to achieve the iterative process from the initial population to the optimal solution [25]. In variable combination optimization problems, the GA encodes variable combinations into chromosomes (such as binary encoding, where each gene corresponds to a variable) to achieve feature selection or optimization [26]. The algorithm starts from a randomly generated initial population; evaluates the quality of each chromosome through fitness functions, such as prediction accuracy and AIC/BIC indicators; and then uses selection, crossover, and mutation operations to generate new populations during the iteration process, continuously optimizing the quality of the solution. The optimization objectives of the GA typically include maximizing model performance (such as accuracy or minimum error), minimizing the number of variables to simplify the model, and ensuring the robustness of the results. This process outputs the optimal variable combination after meeting the predetermined termination conditions, such as the number of iterations or the convergence of fitness [27,28]. The GA has a wide range of applications in feature selection and variable combination optimization due to its powerful global search capability and adaptability to complex high-dimensional nonlinear problems. The principle of GA is shown in Figure 2; the figure was meticulously crafted by the author based on the principles of the genetic algorithm to facilitate the readers’ comprehension.

2.3.4. SHAP Driving Force Analysis

SHAP is a game theory-based method proposed by Lundberg and Lee to describe the performance of machine learning models, it uses Shapley values to estimate the contribution value of each feature [29]. According to game theory, each feature variable in a dataset can be seen as the result of a member training a model using that dataset to obtain predictions, and it can be seen as the benefit of all members working together to complete a project. The Shapley value provides a fair distribution of the benefits of cooperation by considering the contributions of each member. Due to the use of Shapley values from game theory as explanatory measures, an SHAP attribution analysis has the advantages of strong global and local interpretability of variables, a fair distribution of variable contributions, and excellent visualization effects, which compensate for the poor interpretability of black box models. Therefore, SHAP is introduced to explain and analyze the nonlinear relationship between a single variable and the dependent variable through the Shapley value and to evaluate the contributions of various environmental variables.

Let us assume the use of

F

groups (with

n

features) to predict the output of the RF model. In SHAP, the contribution of each feature to the model output

f (f)

is allocated based on its marginal contribution. The Shapley value is determined by using the following formula:

\emptyset_{i} = \sum_{S \subseteq F {i}} \frac{| S |! (| F | - | S | - 1)!}{| F |!} [f_{S \cup {i}} (x_{S \cup {i}}) - f_{S} (x_{S})]

(7)

In the formula,

\emptyset_{i}

is the Shapley value of feature

i

;

F

is the set of all features;

S

is the set of all feature subsets produced from

F

after removing feature

i

;

\frac{| S |! (| F | - | S | - 1)!}{| F |!}

refers to the probability weight of

S

derived after feature permutation and combination; and

f_{S \cup {i}}

and

f_{S}

represent sets of the

S

feature subsets. The features and predicted values of model

i

are input, and its prediction is compared with that of the current input

f_{S \cup {i}} (x_{S \cup {i}}) - f_{S} (x_{S})

, where represents the values of the input features in set

S

.

2.3.5. Model Evaluation Indicators

Four indicators were selected to evaluate the predictive performance of the model: the mean absolute error (MAE), the root mean square error (RMSE), the coefficient of determination (R²) of the linear regression equation between the predicted and observed values, and Lin’s consistency correlation coefficient (LCCC). Their calculation formulas are as follows:

M A E = \frac{1}{n} \sum_{1}^{n} | O_{i} - P_{i} |

(8)

R M S E = \sqrt[2]{{\frac{1}{n} \sum_{1}^{n} (O_{i} - P_{i})}^{2}}

(9)

R^{2} = 1 - \frac{\sum_{1}^{n} {(O_{i} - P_{i})}^{2}}{\sum_{1}^{n} {(O_{i} - \bar{O})}^{2}}

(10)

L C C C = \frac{2 r S_{O} S_{P}}{S_{O}^{2} + S_{P}^{2} + {(\bar{O} - \bar{P})}^{2}}

(11)

Among them,

n

is the number of sample points in the test set,

O_{i}

is the observed value at sample point i,

P_{i}

is the predicted value at sample point i,

\bar{O}

is the average of the observed values,

\bar{P}

is the average of the predicted values, r is the Pearson correlation coefficient between the observed and predicted values,

S_{O}

is the standard deviation of the observed values, and

S_{P}

is the standard deviation of the predicted values. Among them, the

M A E

and

R M S E

measure the numerical error of the prediction set, with smaller values indicating a higher model prediction accuracy. Moreover,

R^{2}

mainly reflects whether the predicted trend is correct; the larger the value, the more accurate the model’s predicted trend. On the basis of measuring correlations (Pearson correlation coefficient),

L C C C

also considers prediction bias; that is, it comprehensively considers the prediction accuracy and trend of the model [30].

Therefore, its results are more reliable. The range of LCCC values is between 0 and ±1. The larger the value, the closer the predicted and observed point pairs are to the perfect consistency line (45° diagonal) in the scatter plot. When the absolute value of LCCC is equal to 1, it indicates perfect consistency (or perfect inconsistency); when LCCC is equal to 0, it indicates no correlation. Overall, a good predictive model has lower MAE and RMSE values and higher R² and LCCC values.

3. Experimental Results and Analysis

3.1. Basic Statistics of Soil Organic Matter Content

The distribution characteristics and variability of data have an impact on the reliability of spatial interpolation results. In Kriging interpolation, if the data follow a normal distribution, the optimal prediction results can be obtained [31]. Therefore, normality testing and transformation of the data were performed to obtain more reliable prediction results.

This study first conducted descriptive statistics on the soil organic matter content of the training and test sets, and it performed K-S tests on the experimental data in SPSS 26. The results (Table 2) show that the maximum value (Max), minimum value (Min), average value (AVE), and standard deviation (SD) of the training and test sets were relatively consistent. The magnitude of the coefficient of variation (CV) indicates the spatial variability of soil properties. When the coefficient of variation is less than 10%, it suggests weak variability; when the coefficient of variation is greater than 100%, it suggests strong variability. A value between the two suggests moderate variability. According to Table 2, the results indicate that the soil organic matter content at the sampling points ranges from 3.91 to 66.20 g/kg, with a relatively large standard deviation. This suggests that there are fluctuations of varying degrees in the soil organic matter content among local areas within the study region. The coefficient of variation for the training set and the test set is 37.77% and 38.15% respectively, indicating that the soil organic matter in the study area belongs to a moderately variable type.

Based on the skewness and kurtosis values, as well as the K-S value (K-S) test results, it could be concluded that both the training and test sets are non-normally distributed. Although Kriging interpolation does not strictly require data to be normally distributed, when the data deviate too far from the normal distribution, the interpolation effect may not be ideal. After performing Box–Cox transformation (Box–Cox) on the training and test sets, the skewness and kurtosis values were close to 0, and the K-S test results were greater than 0.05, thus conforming to the normal distribution.

3.2. Assessment of the Importance of Environmental Variables in RF Models

The optimal parameters of the random forest model were determined by applying the grid-search method [32]. Grid-search functions by exhaustively exploring all possible combinations within a pre-defined hyperparameter range to identify the optimal hyperparameter configuration. Although this method can efficiently identify a relatively favorable hyperparameter setting, it usually requires substantial computational resources and time. In the preliminary experimental stage, the hyperparameter ranges for the grid-search were carefully determined based on insights from previous academic literature and empirical tests. For both the RF and RFGA models, the range of mtry (the number of variables randomly sampled as candidates at each split) was set from 1 to 20. This decision was made because different values of mtry can significantly impact the model’s generalization ability. The range of ntree (the number of trees in the forest) was set from 100 to 1000. This is because increasing the number of trees can improve the model’s performance up to a certain point, after which the marginal improvement becomes negligible while the computational cost increases. To select the optimal hyperparameters, we used the root-mean-square error (RMSE) calculated on an independent validation set as the evaluation metric. After conducting the grid-search, the optimal parameters for the RF model in this study were determined to be mtry = 19 and ntree = 500. For the RFGA model, the optimal parameters were found to be mtry = 4 and ntree = 500.

Based on the RF model, the importance ranking of all environmental variables involved in modeling was conducted, and it was found that there were differences in the importance of the effects of the different environmental variables on the prediction results of different attribute spaces. In the RF model importance evaluation results (% IncMSE) of the soil SOM content, the order of influence on the SOM from high to low was as follows (Figure 3): the CNBL, DEM, T_m, LSTM_m, H_m, MSAVI, WEI, E_m, SCD, BSI, etc.

Therefore, the two topographic factors that had the greatest impact on the SOM in the RF results were the CNBL and DEM. The core distinction between CNBL and DEM lies in the fact that the former serves as a dynamic geomorphic evolution reference, whereas the latter represents static topographic data. By controlling erosion, sedimentation, and hydrological processes, CNBL indirectly yet profoundly influences the spatial distribution and stability of soil organic matter, making it a critical parameter for understanding watershed-scale carbon cycling.

The three biological factors that had the greatest impact on the SOM in the RF results were the T_m, LSTM_m and H_m. T_m determines decomposition rates across climatic zones (e.g., rapid turnover in tropics vs. slow accumulation in cold regions), drives freeze–thaw cycles releasing stored organic carbon, and shapes microbial adaptation. LST_m directly drives near-surface SOM mineralization (Q10 effect), with elevated temperatures increasing decomposition rates but drought potentially limiting microbial activity; it also influences vegetation distribution. H_m regulates SOM through microbial activity, plant productivity, and leaching: high humidity accelerates aerobic decomposition but slows anaerobic decay, while optimal moisture enhances plant carbon inputs.

3.3. Analysis of the Accuracy of OK, RF, and RF-GA Models

In this study, soil organic matter sample point data were utilized. Kriging, random forest, and random forest with variable screening were employed to predict the spatial distribution of soil organic matter and evaluate the accuracy. First, the dataset was partitioned into a training set and a validation set at an 8:2 ratio. The former was used for model construction, while the latter was used to assess the generalization ability of the model. Kriging constructed the model by calculating the variogram and evaluated the accuracy with the validation set. The random forest method sampled subsets from the training set to construct multiple decision trees, and after training, the validation set was used for evaluation. For the random forest with variable screening, variables were screened first, and then the model was constructed, predicted, and evaluated in the same way. This was carried out to compare the advantages and disadvantages of each method.

After obtaining the SOM (Box–Cox transformation) spatial prediction results of each prediction model, inverse transformation can be used to obtain the SOM spatial distribution results based on Kriging interpolation. RF uses 47 full variables to predict soil organic matter across the entire domain.

The selection of environmental variables for GA-RF relies on the genetic algorithm. Inspired by evolution, it encodes variables as chromosomes. Starting with a random population, it assesses chromosomes via fitness functions. Through selection, crossover and mutation, it pursues optimal model performance. After meeting termination criteria, it outputs the best variable combo. The optimal variable combination selected by GA-RF is P_m, E_m, VARI, NDWI, NPP, MNDWI, GNDVI, BSI, AH, ASP, CI, CNBL, CND, DEM, LS, MRRTF, RSP, TCA, LU, ST, and SCD, predicting soil SOM across the entire region based on 21 environmental variables.

The prediction results of each model are externally validated using the MAE, RMSE, R², and LCCC, as shown in Table 3. It can be observed that, among the three types of prediction models, the OK model has higher MAE (6.31) and RMSE (8.33) values, while R² (0.06) and LCCC (0.16) are very low, indicating that using only the Kriging method results in poor prediction accuracy and trends. The RF + GA model exhibits a relatively high R² (0.49) and LCCC (0.67), along with low MAE (3.02) and RMSE (3.49). In the regression model, according to the LCCC results, the order from best to worst for each model is RF-GA (0.67) > RF (0.38) > OK (0.16). Compared with the OK model and the RF model, the R² of the RF + GA model has increased by 0.43 and 0.28, respectively. These results indicate that the RF-GA model considering nonlinear relationships has the smallest spatial interpolation error, OK has the largest spatial interpolation error, and RF-GA and RF have improved interpolation accuracy compared to OK due to the use of auxiliary variables. The RF-GA model is the optimal SOM prediction model in this study.

3.4. SHAP Overlay Explanation

Figure 4 shows the distribution of the SHAP values for each environmental variable, with positive values indicating a positive impact on the SOM content and negative values indicating a negative impact on the SOM content. In Figure 4, the overall importance of each variable is shown, with the x-axis representing the ranking of environmental variable importance and the y-axis representing the average SHAP value of each influencing factor.

As shown in Figure 5, the importance of SR, VARI, GNDVI, and NDVI is relatively low, and the SHAP values are concentrated around 0. However, CNBL, which contributes 20.68% to the model, and DEM, with a contribution rate of 5.57%, are of relatively high importance. The bee colony plot in Figure 5 shows that the CNBL and DEM have a significant impact on the SOM content. The overall importance and direction of influence of variables are shown. In Figure 5, feature ranking (x-axis) represents the importance of the environmental variables, the SHAP value (y-axis) represents the unified index of the influence of a certain factor in the model, and red (blue) dots represent the value of environmental variables. SHAP > 0 represents a positive contribution. As the SHAP value increases, the positive effect of the factor on the SOM content is higher. SHAP < 0 represents a negative contribution, and, as the SHAP value decreases, the negative effect of the factor on SOM content is higher.

According to the results of the environmental variable driving force analysis of the soil organic matter content in the study area, terrain factors, climate factors, and biological factors are important environmental variables that affect the spatial distribution of the SOM in the study area, which is consistent with the conclusion of RF. Among them, terrain factors reflect not only the regional environment but also the influence of hydrogeological features on the distribution of soil properties. Climate factors not only directly affect the decomposition rate of soil organic matter but also indirectly affect soil organic matter content by influencing soil moisture content and vegetation type. Biological factors affect the distribution of organic matter through vegetation cover and growth conditions.

3.5. Spatial Distribution of Soil Organic Matter

The prediction accuracy of the optimization model based on the combination of RF and GA variables is relatively high, achieving an R² of 0.49, an MAE of 3.01 g·kg⁻¹, an RMSE of 3.49 g·kg⁻¹, and an LUCC of 0.67. The fitting with actual values indicates that the model can effectively predict the SOM content. To allow for a visual comparison of the SOM prediction results of different models, we display the prediction results of all models within the same range (Figure 6, Figure 7 and Figure 8). The SOM content in the predicted graph exhibits a significant spatial variability in distribution. The prediction results indicate that, in the study area, the SOM content is higher in the northern and eastern mountainous areas, while it is lower in the central area with a flat terrain, and a few high values are also distributed in southern cities and mixed forest areas. SOM content is generally higher in mountainous regions and lower in plains. However, this spatial pattern does not necessarily indicate that terrain undulation is the primary driver of soil organic matter heterogeneity. This is because while terrain undulation can influence soil erosion and sedimentation processes, these effects are often indirect and localized. Although steeper topographies may exacerbate erosion, SOM distribution depends not only on erosion intensity but also on the stability of depositional environments. In the northern and eastern mountainous areas, the main land cover type is forest, with dense vegetation and a complex terrain; less human intervention allows vegetation to continuously input organic matter into the soil, and the mountainous terrain may slow down soil erosion, resulting in a higher accumulation rate of organic matter in the soil. The flat areas in the central region are mainly farmland, and more agricultural activities such as long-term cultivation and fertilization may accelerate the decomposition of organic matter. In addition, areas with a flat terrain are more susceptible to rainfall and wind erosion, further reducing the SOM content.

The prediction results of the three models are shown in Figure 4. It can be seen that the SOM spatial distribution prediction results of RF and RF-GA are very similar. The difference is that the OK prediction results are very smooth, while the RF and RF-GA prediction models can highlight the spatial details and changes in the SOM, demonstrating richer SOM spatial variation information. The OK valuation significantly differs from the original data. The areas exhibiting high SOM are mainly distributed in areas with significant terrain fluctuations, which is conducive to the accumulation of SOM.

4. Discussion

4.1. Advantages of RF-GA Model

Compared to traditional Ordinary Kriging (OK) and random forest (RF) methods that rely on full-variable predictions, the RF-GA model utilized in this study offers several distinct advantages. Notably, it effectively addresses spatial heterogeneity in complex regions. Unlike traditional methods that often require the classification of land use types, the RF-GA model enhances the model’s ability to discern data features, leading to improved fitting accuracy. Specifically, in the hilly basin of Lanxi City, where the study is situated, the complex topography induces significant spatial variability. The RF-GA model excels in capturing the nonlinear relationships between soil organic matter (SOM) and environmental variables across different topographic units. For example, in hilly areas, slopes and valleys exhibit distinct hydrological and micro-climatic conditions, which the RF-GA model can accommodate, whereas traditional methods may overlook these variations.

When comparing the RF-GA model with OK and RF methods in predicting SOM in such complex regions, the RF-GA model demonstrates clear superiority. In particular, the RF-GA model is capable of effectively selecting the key environmental covariates that substantially contribute to the model’s performance. In regions with complex topography, such as our study area, variables like slope gradient and aspect play a crucial role in SOM distribution. The genetic algorithm (GA) embedded within the RF-GA model can identify these important topographic factors, thereby eliminating low-contribution variables that could otherwise interfere with model accuracy. This significantly enhances the accuracy of SOM predictions. The genetic algorithm optimizes the RF model by mimicking natural selection processes. By encoding model parameters as genes and applying operations such as selection, crossover, and mutation, the GA helps the model search for the optimal combination of parameters within a large solution space, thereby improving its predictive performance.

4.2. Explanation of Environmental Variables

The influencing factors of SOM exhibit considerable spatial variation due to both natural and anthropogenic disturbances [33]. In the hilly basin area of Lanxi City, topographic, climatic, and biological factors are identified as key determinants of SOM, and they exhibit certain threshold or peak effects.

Topographic factors have a profound impact on SOM distribution. In this hilly region, elevation, slope gradient, and aspect are significant variables. Elevation influences temperature and precipitation patterns, which, in turn, affect the decomposition and accumulation of SOM. Higher elevations typically experience lower temperatures, which slow the decomposition rate, resulting in higher SOM content. The slope gradient influences soil erosion and deposition processes. Steeper slopes are more prone to erosion, leading to the loss of SOM-rich topsoil, while gentle slopes are more favorable for SOM accumulation. Aspect affects the amount of solar radiation received, which influences micro-climates and vegetation growth, both of which are closely related to SOM distribution. Among topographic variables, the CNBL and DEM exhibit the most substantial impact on SOM distribution, consistent with findings from related studies [34,35,36]. Terrain factors regulate SOM distribution by affecting soil water content, erosion–deposition processes, vegetation distribution, and micro-climates [8].

Climate factors also influence SOM dynamics, affecting accumulation and decomposition through temperature, precipitation, vegetation, and microbial activity. In warm and humid climates, high temperatures and abundant precipitation accelerate SOM decomposition by promoting microbial activity. Conversely, in cold and arid regions, low temperatures and limited precipitation slow down decomposition, favoring SOM accumulation.

Biological factors impact SOM distribution through vegetation growth, litter input, and biological activity [37]. Dense vegetation cover increases litter deposition, providing a major source of SOM. Moreover, different vegetation types exhibit varying root exudation patterns, which influence soil microbial communities, thus affecting the decomposition and transformation of SOM.

4.3. Limitations and Potential Improvements

4.3.1. Insufficient Data Scale and Representativeness

This study is based on data collected from 1560 sampling points across Lanxi City. While this sample size provides valuable insights, its limited spatial distribution may affect the generalizability and robustness of the model’s predictions, especially when applied to other regions with differing environmental conditions. It is important to note that the study area is situated in a hilly basin region, where the spatial variability of soil organic matter (SOM) is inherently high due to complex topography. As a result, the prediction accuracy of SOM models in such areas is generally lower compared to flatter regions. This phenomenon is well-documented in previous studies and reflects the challenges of modeling SOM in regions with such high spatial heterogeneity. Furthermore, environmental variables, particularly climate-related factors, were considered without accounting for temporal dynamics such as seasonal or long-term fluctuations, which could influence SOM distribution. In future research, incorporating time-series data and increasing the sampling points could help mitigate these limitations, improving both the spatial and temporal representation of the data.

4.3.2. Directions for Model Optimization

Although genetic algorithms (GAs) have proven effective in enhancing the performance of the model by optimizing variable selection, it does come with high computational complexity and longer optimization times. Additionally, both the random forest (RF) and RF-GA models tend to underperform in areas with sparse extreme values, which are often associated with special geographical conditions or environmental factors. This challenge is inherent in SOM predictions for complex areas, where extreme values, typically linked to specific local factors, are underrepresented in the sample. As a result, the model may overestimate values near the global mean, reducing prediction variability. To address this issue, future research should focus on optimizing both the model and the data, incorporating techniques such as oversampling or enhanced sampling strategies for extreme values. Further, model calibration techniques, such as post-prediction adjustments or refinement of algorithmic parameters, can help correct for these biases, improving prediction accuracy in regions with sparse extreme values.

4.3.3. Applicability and Interaction Analysis of Explanatory Methods

The SHAP (Shapley Additive Explanations) interpretation method provides valuable insights into the contributions of environmental variables to SOM, enhancing the model’s explainability. However, it comes with significant computational complexity and resource demands, particularly as the number of environmental variables increases. Moreover, this study did not explore the interactions between different environmental factors, which may be crucial for a more comprehensive understanding of their combined effects on SOM distribution. In the future, we suggest incorporating a more efficient explanatory model that combines SHAP interaction values to investigate how multiple environmental factors interact and influence SOM distribution. This would allow for a more detailed, nuanced understanding of the environmental processes at play.

5. Conclusions and Prospects

This study, based on soil surveys and measured data, utilized Kriging interpolation (OK), the random forest (RF) model, the random forest model optimized with genetic algorithm variable combination (RF-GA), and the SHAP interpretation method to analyze the spatial differentiation characteristics and key influencing factors of soil organic matter (SOM) in Lanxi City, as well as their impacts. The following key conclusions were drawn:

The spatial distribution of SOM in the study area is influenced by factors such as terrain, climate, and biological factors, exhibiting clear spatial differentiation patterns. Specifically, SOM content is higher in the northern and eastern mountainous regions, while lower in the central, flat areas. Additionally, some high SOM values are observed in the southern cities and mixed forest areas. This distribution pattern indicates that SOM spatial variability is not only influenced by topographic changes but also closely related to local climate and vegetation factors.

The RF-GA model, optimized by the genetic algorithm-based variable combination, demonstrates excellent performance in extracting environmental variables. Compared to the traditional RF model, which uses full-variable prediction, the RF-GA model significantly improves the accuracy of SOM predictions. Particularly in complex regions, this model can better identify and optimize key variables, excluding the interference of low-contribution variables, thereby enhancing prediction accuracy. As such, the RF-GA model presents a reliable tool for SOM prediction in complex areas.

Further analysis using the RF-GA-SHAP model reveals that the primary factors influencing the spatial distribution of surface SOM in the hilly basin area of Lanxi City include CNBL, DEM, P_m, NDWI, CI, T_m, SCD, and BSI. These factors not only reveal the patterns of SOM spatial variation but also provide scientific evidence for soil management practices and sustainable agricultural development. Notably, the use of the SHAP method helps to quantitatively explain the specific effects of these environmental variables on SOM distribution, offering intuitive and actionable support for land use decision making.

The innovative aspect of this study lies in the combination of the RF-GA model and the SHAP method, proposing a novel SOM prediction model. In complex topographic regions, the optimization process of the genetic algorithm effectively selects important environmental variables, while the SHAP method provides quantitative explanations for their impacts on SOM distribution. This integrated approach not only improves prediction accuracy but also enhances model interpretability, offering new insights for future SOM research and environmental management.

Author Contributions

Conceptualization, H.H. and Z.T.; methodology, H.H. and Y.X.; software, Z.T.; validation, Y.L. (Yaolin Liu) and Z.R.; formal analysis, Z.T.; investigation, H.H.; resources, Z.R. and Y.L. (Yanfang Liu). data curation, Z.R. and Y.L. (Yanfang Liu); writing—original draft, H.H.; writing—review and editing, Y.X. and Z.T.; visualization, Y.X. and Y.L. (Yaolin Liu); supervision, Y.L. (Yaolin Liu); funding acquisition, Y.L. (Yaolin Liu). All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the support of the project “the Key Program of the National Natural Science Foundation of China” (42230107); “the National Natural Science Foundation of China” (42471454); and the ”Strategic Science and Technology Talent Cultivation Special Project of Hubei Province” (2024DJA012).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wiesmeier, M.; Barthold, F.; Blank, B.; Kögel-Knabner, I. Digital mapping of soil organic matter stocks using Random Forest modeling in a semi-arid steppe ecosystem. Plant Soil 2011, 340, 7–24. [Google Scholar] [CrossRef]
Kempen, B.; Brus, D.J.; Stoorvogel, J.J.; Heuvelink, G.B.; de Vries, F. Efficiency Comparison of Conventional and Digital Soil Mapping for Updating Soil Maps. Soil Sci. Soc. Am. J. 2012, 76, 2097–2115. [Google Scholar] [CrossRef]
Zhao, M.-S.; Rossiter, D.G.; Li, D.-C.; Zhao, Y.-G.; Liu, F.; Zhang, G.-L. Mapping soil organic matter in low-relief areas based on land surface diurnal temperature difference and a vegetation index. Ecol. Indic. 2014, 39, 120–133. [Google Scholar] [CrossRef]
Xie, H.; Li, W.; Duan, L.; Yuan, H.; Zhou, Q.; Luo, Z.; Du, H. Digital mapping of cultivated land soil organic matter in hill-mountain and plain regions. J. Soil Sediments 2024, 24, 349–360. [Google Scholar] [CrossRef]
Zhang, W.-C.; Wan, H.-S.; Zhou, M.-H.; Wu, W.; Liu, H.-B. Soil total and organic carbon mapping and uncertainty analysis using machine learning techniques. Ecol. Indic. 2022, 143, 109420. [Google Scholar] [CrossRef]
Sun, Y.; Ma, J.; Zhao, W.; Qu, Y.; Gou, Z.; Chen, H.; Tian, Y.; Wu, F. Digital mapping of soil organic carbon density in China using an ensemble model. Environ. Res. 2023, 231, 116131. [Google Scholar] [CrossRef]
Mousavi, S.R.; Sarmadian, F.; Omid, M.; Bogaert, P. Three-dimensional mapping of soil organic carbon using soil and environmental covariates in an arid and semi-arid region of Iran. Measurement 2022, 201, 111706. [Google Scholar] [CrossRef]
Zeraatpisheh, M.; Ayoubi, S.; Jafari, A.; Tajik, S.; Finke, P. Digital mapping of soil properties using multiple machine learning in a semi-arid region, central Iran. Geoderma 2019, 338, 445–452. [Google Scholar] [CrossRef]
Agyeman, P.C.; Ahado, S.K.; Borůvka, L.; Biney, J.K.M.; Sarkodie, V.Y.O.; Kebonye, N.M.; Kingsley, J. Trend analysis of global usage of digital soil mapping models in the prediction of potentially toxic elements in soil/ sediments: A bibliometric review. Environ. Geochem. Health 2021, 42, 1715–1739. [Google Scholar] [CrossRef]
Hendriks, C.M.J.; Stoorvogel, J.J.; Álvarez-Martínez, J.M.; Claessens, L.; Pérez-Silos, I.; Barquín, J. Introducing a mechanistic model in digital soil mapping to predict soil organic matter stocks in the Cantabrian region (Spain). Eur. J. Soil Sci. 2021, 72, 704–719. [Google Scholar] [CrossRef]
Sun, X.-L.; Wang, H.-L.; Zhao, Y.-G.; Zhang, C.; Zhang, G.-L. Digital soil mapping based on wavelet decomposed components of environmental covariates. Geoderma 2017, 303, 118–132. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar]
Min, H.; Ko, H.J.; Ko, C.S. A genetic algorithm approach to developing the multi-echelon reverse logistics network for product returns. Omega 2006, 34, 56–69. [Google Scholar] [CrossRef]
Pasdarpour, M.; Ghazavi, M.; Teshnehlab, M.; Sadrnejad, S.A. Optimal design of soil dynamic compaction using genetic algorithm and fuzzy system. Soil Dyn. Earthq. Eng. 2009, 29, 1103–1112. [Google Scholar] [CrossRef]
Shapchenkova, O.A.; Krasnoshchekov, Y.N.; Loskutov, S.R. Application of the methods of thermal analysis for the assessment of organic matter in postpyrogenic soils. Eurasian Soil Sci. 2011, 44, 677–685. [Google Scholar] [CrossRef]
Minasny, B.; McBratney, A.B.; Malone, B.P.; Wheeler, I. Digital mapping of soil carbon. Adv. Agron. 2013, 118, 1–47. [Google Scholar]
McBratney, A.B.; Santos, M.L.M.; Minasny, B. On digital soil mapping. Geoderma 2003, 117, 3–52. [Google Scholar]
Peng, S.; Ding, Y.; Liu, W.; Li, Z. 1 km monthly temperature and precipitation dataset for China from 1901 to 2017. Earth Syst. Sci. Data 2019, 11, 1931–1946. [Google Scholar] [CrossRef]
Boettinger, J.L. Environmental Covariates for Digital Soil Mapping in the Western USA. In Digital Soil Mapping; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar] [CrossRef]
Song, X.; Liu, F.; Ju, B.; Zhi, J.; Li, D.; Zhao, Y.; Zhang, G. Mapping soil organic carbon stocks of northeastern China using expert knowledge and GIS-based methods. Chin. Geogr. Sci. 2017, 27, 516–528. [Google Scholar] [CrossRef]
Webster, R.; Oliver, M.A. Geostatistics for Environmental Scientists; John Wiley & Sons: Hoboken, NJ, USA, 2001. [Google Scholar]
Forkuor, G.; Hounkpatin, O.K.L.; Welp, G.; Thiel, M. High Resolution Mapping of Soil Properties Using Remote Sensing Variables in South-Western Burkina Faso: A Comparison of Machine Learning and Multiple Linear Regression Models. PLoS ONE 2017, 12, e0170478. [Google Scholar]
Were, K.; Bui, D.T.; Dick, Ø.B.; Singh, B.R. A comparative assessment of support vector regression, artificial neural networks, and random forests for predicting and mapping soil organic carbon stocks across an Afromontane landscape. Ecol. Indic. 2015, 52, 394–403. [Google Scholar]
Pittman, R.; Hu, B.; Webster, K. Improvement of soil property mapping in the Great Clay Belt of northern Ontario using multi-source remotely sensed data. Geoderma 2021, 381, 114761. [Google Scholar]
Bouktif, S.; Fiaz, A.; Ouni, A.; Serhani, M.A. Optimal Deep Learning LSTM Model for Electric Load Forecasting using Feature Selection and Genetic Algorithm: Comparison with Machine Learning Approaches. Energies 2018, 11, 1636. [Google Scholar] [CrossRef]
Xue, B.; Zhang, M.; Browne, W.N. A Comprehensive Comparison on Evolutionary Feature Selection Approaches to Classification. Int. J. Comput. Intell. Appl. 2015, 14. [Google Scholar] [CrossRef]
Huang, J.; Cai, Y.; Xu, X. A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recognit. Lett. 2007, 28, 1825–1844. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S. Unified Approach to Interpreting Model Predictions. arXiv 2017. [Google Scholar] [CrossRef]
Lin, L.I. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989, 45, 255–268. [Google Scholar]
Mcbride, G.B. A Proposal for Strength-of-Agreement Criteria for Lin’s Concordance Correlation Coefficient; National Institute of Water & Atmospheric Rescarch Ltd.: Hamilton, New Zealand, 2005. [Google Scholar]
Song, Y.-Q.; Yang, L.-A.; Li, B.; Hu, Y.-M.; Wang, A.-L.; Zhou, W.; Cui, X.-S.; Liu, Y.-L. Spatial Prediction of Soil Organic Matter Using a Hybrid Geostatistical Model of an Extreme Learning Machine and Ordinary Kriging. Sustainability 2017, 9, 754. [Google Scholar] [CrossRef]
Guo, L.; Fan, G. Support Vector Machines for Surface Soil Density Prediction based on Grid Search and Cross Validation. Chin. J. Soil Sci. 2018, 49, 512–518. [Google Scholar]
Chen, S.; Arrouays, D.; Mulder, V.L.; Poggio, L.; Minasny, B.; Roudier, P.; Libohova, Z.; Lagacherie, P.; Shi, Z.; Hannam, J.; et al. Digital mapping of GlobalSoilMap soil properties at a broad scale: A review. Geoderma 2022, 409, 115567. [Google Scholar]
Hamzehpour, N.; Shafizadeh-Moghadam, H.; Valavi, R. Exploring the driving forces and digital mapping of soil organic carbon using remote sensing and soil texture. CATENA 2019, 182, 104141. [Google Scholar] [CrossRef]
Zhao, C.; Li, P.; Yan, Z.; Zhang, C.; Meng, Y.; Zhang, G. Effects of landscape pattern on water quality at multi-spatial scales in Wuding River Basin, China. Environ. Sci. Pollut. Res. 2024, 31, 19699–19714. [Google Scholar] [CrossRef]
Zhou, Y.; Zhao, X.; Guo, X.; Li, Y. Mapping of soil organic carbon using machine learning models: Combination of optical and radar remote sensing data. Soil Sci. Soc. Am. J. 2022, 86, 293–310. [Google Scholar] [CrossRef]
Guo, P.T.; Li, M.F.; Luo, W.; Tang, Q.F.; Liu, Z.W.; Lin, Z.M. Digital mapping of soil organic matter for rubber plantation at regional scale: An application of random forest plus residuals kriging approach. Geoderma 2015, 237, 49–59. [Google Scholar] [CrossRef]

Figure 1. Location of the research area and distribution of sampling points.

Figure 2. Schematic diagram of the principle of the genetic algorithm.

Figure 3. Ranking of % IncMSE values of various influencing factors in the random forest model.

Figure 4. Shapley values between soil organic matter content and environmental variables.

Figure 5. Colony plot of Shapley values between soil organic matter content and environmental variables.

Figure 6. Spatial distribution map of prediction accuracy of the OK model for SOM.

Figure 7. Spatial distribution map of prediction accuracy of the RF model for SOM.

Figure 8. Spatial distribution map of prediction accuracy of the RF-GA model for SOM.

Table 1. Input variables used in this study.

Soil-Forming Factors	Input Variables	Spatial Resolution
Topographic factors	Analytical hillshading (AH), aspect (ASP), closed depressions (CDs), convergence index (CI), channel network base level (CNBL), channel network distance (CND), elevation (DEM), coefficient of variation of elevation (ECV), LS factor (LS), mass balance index (MBI), multiscale ridge top flatness (MRRTF), multi-resolution valley bottom flatness (MRVBF), plan curvature (PLC), profile curvature (PRC), relative slope position (RSP), surface cutting depth (SCD), slope (SLP), total catchment area (TCA), topographic position index (TPI), terrain ruggedness index (TRI), topographic wetness index (TWI), terrain undulation (TU), valley depth (VD), wind exposition index (WEI)	12.5 m
Biological factors	Bare soil index (BSI), enhanced vegetation index (EVI), global environment monitoring index (GEMI), green normalized difference vegetation index (GNDVI), modified normalized difference water index (MNDWI), modified soil-adjusted vegetation index (MSAVI), normalized difference moisture index (NDMI), normalized difference vegetation index (NDVI), normalized difference water index (NDWI), net primary production (NPP), soil-adjusted vegetation index (SAVI), simple ratio (SR), visible-light atmospheric impedance index (VARI)	10 m
Soil texture	Sand content (sand), silt content (silt), clay content (clay)	900 m
Climate factors	Evaporation (E_m), humidity mean (H_m), land surface temperature mean (LST_m), precipitation mean (P_m), temperature mean (T_m)	1000 m
Land use (LU)		Vector data
Soil type (ST)		Vector data

Table 2. Descriptive statistics of soil organic matter content at sampling points in the study area.

Type		Samples	Max (g·kg⁻¹)	Min (g·kg⁻¹)	AVE (g·kg⁻¹)	SD (g·kg⁻¹)
Training set	Raw data	1249	66.20	3.91	22.25	8.40
Training set	Box–Cox	1249	10.87	1.81	6.01	1.31
Test set	Raw data	311	58.60	5.21	22.50	8.58
Test set	Box–Cox	311	10.24	2.34	6.05	1.30
Type		CV (%)	Skewness	Kurtosis	K-S
Training set	Raw data	37.77	0.85	1.89	0.000
Training set	Box–Cox	21.84	−0.01	0.44	0.081
Test set	Raw data	38.15	0.86	1.40	0.006
Test set	Box–Cox	21.54	0.12	0.29	0.200

Table 3. Cross-validation results of different interpolation methods.

Method	MAE	RMSE	R²	LCCC
OK	6.31	8.33	0.06	0.16
RF	4.60	5.86	0.21	0.38
RF-GA	3.02	3.49	0.49	0.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, H.; Liu, Y.; Liu, Y.; Tong, Z.; Ren, Z.; Xie, Y. The Spatial Distribution and Driving Mechanism of Soil Organic Matter in Hilly Basin Areas Based on Genetic Algorithm Variable Combination Optimization and Shapley Additive Explanations Interpretation. Remote Sens. 2025, 17, 1186. https://doi.org/10.3390/rs17071186

AMA Style

Huang H, Liu Y, Liu Y, Tong Z, Ren Z, Xie Y. The Spatial Distribution and Driving Mechanism of Soil Organic Matter in Hilly Basin Areas Based on Genetic Algorithm Variable Combination Optimization and Shapley Additive Explanations Interpretation. Remote Sensing. 2025; 17(7):1186. https://doi.org/10.3390/rs17071186

Chicago/Turabian Style

Huang, He, Yaolin Liu, Yanfang Liu, Zhaomin Tong, Zhouqiao Ren, and Yifan Xie. 2025. "The Spatial Distribution and Driving Mechanism of Soil Organic Matter in Hilly Basin Areas Based on Genetic Algorithm Variable Combination Optimization and Shapley Additive Explanations Interpretation" Remote Sensing 17, no. 7: 1186. https://doi.org/10.3390/rs17071186

APA Style

Huang, H., Liu, Y., Liu, Y., Tong, Z., Ren, Z., & Xie, Y. (2025). The Spatial Distribution and Driving Mechanism of Soil Organic Matter in Hilly Basin Areas Based on Genetic Algorithm Variable Combination Optimization and Shapley Additive Explanations Interpretation. Remote Sensing, 17(7), 1186. https://doi.org/10.3390/rs17071186

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Spatial Distribution and Driving Mechanism of Soil Organic Matter in Hilly Basin Areas Based on Genetic Algorithm Variable Combination Optimization and Shapley Additive Explanations Interpretation

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview of the Study Area

2.2. Data Sources and Processing

2.2.1. Soil Sample Data

2.2.2. Obtaining Environmental Covariates

2.3. Research Method

2.3.1. Ordinary Kriging

2.3.2. Random Forest

2.3.3. Genetic Algorithm

2.3.4. SHAP Driving Force Analysis

2.3.5. Model Evaluation Indicators

3. Experimental Results and Analysis

3.1. Basic Statistics of Soil Organic Matter Content

3.2. Assessment of the Importance of Environmental Variables in RF Models

3.3. Analysis of the Accuracy of OK, RF, and RF-GA Models

3.4. SHAP Overlay Explanation

3.5. Spatial Distribution of Soil Organic Matter

4. Discussion

4.1. Advantages of RF-GA Model

4.2. Explanation of Environmental Variables

4.3. Limitations and Potential Improvements

4.3.1. Insufficient Data Scale and Representativeness

4.3.2. Directions for Model Optimization

4.3.3. Applicability and Interaction Analysis of Explanatory Methods

5. Conclusions and Prospects

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI