Interpretable Network Framework for Predicting the Spatial Distribution of Chromium in Soil

Luo, Xinping; Luo, Wei; Hao, Jing; Zhu, Yuchen; Kong, Xiangke

doi:10.3390/su17146420

Open AccessArticle

Interpretable Network Framework for Predicting the Spatial Distribution of Chromium in Soil

by

Xinping Luo

^1,2,*,

Wei Luo

²

,

Jing Hao

¹,

Yuchen Zhu

³

and

Xiangke Kong

³

¹

Comprehensive Survey Command Center for Natural Resources, China Geological Survey, Beijing 100055, China

²

Key Laboratory of Coupling Process and Effect of Natural Resources Elements, Ministry of Natural Resources, Beijing 100055, China

³

Institute of Hydrogeology and Environmental Geology, CAGS, Chinese Academy of Geological Sciences, Xiamen 361000, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(14), 6420; https://doi.org/10.3390/su17146420

Submission received: 19 June 2025 / Revised: 6 July 2025 / Accepted: 9 July 2025 / Published: 14 July 2025

(This article belongs to the Special Issue Applications of GIS and Remote Sensing in Soil Environment Monitoring 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Investigating the spatial distribution of chromium (Cr) in soil is essential for understanding Cr pollution and accurately assessing associated environmental risks. However, field sampling is challenging due to limited sampling points, and the spatial distribution of Cr is affected by multiple complex environmental covariates, thereby restricting model development and prediction accuracy. This study selected the Chizhou–Xuancheng border area in southern Anhui Province as the research region and collected 2035 data points. Machine learning models, including AdaBoost, GBDT, XGBoost, and MLP, were employed to predict Cr concentrations in conjunction with environmental covariates. To address the challenges of sparse sampling data and complex data relationships for Cr prediction, the PHMS-Transformer model is proposed. Featuring a shallow encoder design, configurable pooling strategies, and a lightweight structure, the model significantly reduces the number of parameters and alleviates overfitting under sparse sampling conditions, while the incorporation of multi-head self-attention mechanisms captures complex nonlinear relationships among multi-source environmental variables relevant to Cr. To further enhance model interpretability for Cr prediction, the SHAP model was applied to identify key factors influencing Cr distribution. Comprehensive comparisons indicate that the PHMS-Transformer model achieves superior performance in predicting Cr, demonstrating high accuracy and generalization capability, with clear advantages over traditional methods. These findings offer valuable insights for soil environmental protection and Cr pollution control and possess significant theoretical and practical implications. Soil Cr pollution represents a global environmental challenge, where achieving accurate predictions for Cr is particularly crucial yet difficult in regions with constrained data accessibility. The lightweight, high-precision, and interpretable PHMS-Transformer framework proposed in this study provides an effective technical solution to the widespread challenges of sample sparsity and model complexity inherent in predicting the spatial distribution of soil Cr globally. Therefore, this work offers significant reference value for advancing global soil environmental risk assessment and Cr pollution remediation efforts.

Keywords:

soil heavy metals; machine learning; PHMS-Transformer; SHAP

1. Introduction

Soil constitutes the fundamental component of terrestrial ecosystems, underpinning material cycles, energy flow, and biodiversity. It is an essential, renewable natural resource indispensable for human survival [1]. However, since the Industrial Revolution, intensifying soil heavy metal contamination has emerged as a severe global environmental crisis. Significant areas of farmland, mining regions, and peri-urban soils worldwide are contaminated with heavy metals to varying degrees [2]. In China, according to the National Soil Pollution Status Survey Bulletin, the rate of soil sampling points exceeding standards is of significant concern. Industrial and mining activities, agricultural inputs (such as phosphate fertilizers and pesticides containing heavy metals), and traffic emissions are identified as the primary pollution sources [3]. Heavy metals (e.g., cadmium (Cd), lead (Pb), mercury (Hg), arsenic (As), chromium (Cr), etc.) are characterized by high toxicity, resistance to degradation, propensity for accumulation, and the ability to undergo bioaccumulation and amplification through the food chain [4]. Among these, chromium (Cr), particularly its highly toxic hexavalent form (Cr(VI)), has become a ubiquitous environmental contaminant due to its extensive use in industries such as electroplating, tanning, and textile dyeing. It is classified by the World Health Organization (WHO) and the International Agency for Research on Cancer (IARC) as a known human carcinogen [5]. Heavy metal contamination not only leads to the degradation of soil ecological functions (e.g., inhibiting microbial activity, disrupting soil structure) but also threatens agricultural product safety through uptake and accumulation in crops. Ultimately, this poses risks to human health via the food chain, causing effects such as carcinogenesis, teratogenesis, and impairment of organ functions. Furthermore, heavy metals can enter aquatic environments through surface runoff or leaching processes, contributing to water pollution [6]. In-depth research on predicting the spatial distribution of soil heavy metal pollution is crucial for developing effective environmental protection policies and sustainable development strategies [7].

Prior to the widespread adoption of machine learning techniques, the prediction of soil heavy metal concentrations primarily relied on traditional statistical methods and empirical models [8]. In practice, geostatistical methods—such as Kriging interpolation and spatial regression analysis [9]—alongside approaches dependent on soil sampling and laboratory assessment were commonly employed. Empirical models often utilized expert judgment or historical experience to elucidate the relationships between heavy metal concentrations and other soil properties [10]. However, despite possessing certain predictive capabilities, these traditional approaches are inherently constrained by difficulties in data acquisition, limited accuracy, and an inability to capture complex nonlinear relationships.

In recent years, machine learning methods have been extensively applied to predict the spatial distribution of soil heavy metals. Traditional ensemble algorithms (e.g., AdaBoost [11], GBDT [12], XGBoost [13]) and deep learning models (e.g., ResNet [14], MLP [15], RNN [16]) have been utilized for this purpose. Although these methods offer respective advantages, they are limited by complex architectures and substantial resource requirements. The Transformer is a deep learning architecture based on the attention mechanism [17,18]. It addresses these limitations by eschewing traditional recurrent neural network (RNN) and convolutional neural network (CNN) structures, instead utilizing self-attention mechanisms to process sequential data. Its core concept involves capturing dependencies between positions within an input sequence via self-attention, thereby eliminating the inherent sequential processing requirement of RNNs. Consequently, Transformer models support parallel computation, accelerating the training process while enhancing the ability to model long-range dependencies.

Previous research has indicated that the high cost of sampling impedes traditional soil pollution assessment methods from achieving sufficiently dense spatial sample distributions. This limitation hinders the comprehensive characterization of regional pollution patterns [19,20,21]. Furthermore, analysis by Zhou et al. (2020) based on DEM demonstrated that variations in slope gradient and elevation influence heavy metal transport pathways via surface runoff, resulting in differential concentrations along terrain gradients [22]. The resultant spatial patterns are jointly influenced by natural factors (e.g., parent material type, topography, climatic conditions) [11,22,23] and anthropogenic activities (e.g., industrial emissions, agricultural practices) [24]. These factors exhibit nonlinear interactions with environmental covariates. Under conditions of sparse sampling, modeling such complex nonlinear relationships poses significant challenges. Traditional linear regression models are unable to adequately capture interactions between variables [25]. Conversely, nonlinear methods like decision trees, while capable of handling such interactions, are prone to overfitting and offer limited interpretability when processing high-dimensional data with complex interactions [26].

Advanced models such as deep neural networks have achieved remarkable success across numerous domains, including image recognition, speech processing, and intelligent recommendation systems. However, their internal decision-making processes inherently remain a “black box,” making it difficult to elucidate the underlying logic of model outputs [27]. Interpretability is imperative for dissecting a model′s internal mechanisms and clarifying how input variables influence the output. Consequently, model interpretability has emerged as a critical challenge in the field of predicting the spatial distribution of soil heavy metals [28]. Traditional feature importance assessment methods primarily rely on one-dimensional statistical metrics, offering only a simplified evaluation. These methods neither account for complex interactions between features nor accurately determine the actual contribution of individual features to predicting soil heavy metal concentrations. Although linear models provide initial insights based on linear assumptions for interpretation, they are insufficient for handling the nonlinear data commonly encountered in predicting the spatial distribution of soil heavy metals [29].

Therefore, developing a predictive model that simultaneously achieves high accuracy, is lightweight (adaptable to sparse samples), and possesses strong interpretability represents an urgent need to overcome the current bottlenecks in predicting the spatial distribution of soil heavy metals and for precise environmental management. Given the significant local relevance of chromium (Cr) pollution in the Chizhou–Xuancheng area of interest in this study—a region characterized by industries such as chemical manufacturing, building materials, and mining [30] alongside intensive agricultural activities [31]—and the availability of high-quality, large-sample-size Cr concentration data (2035 points) for this research, this study focuses specifically on addressing the challenge of achieving high-accuracy, interpretable prediction of the spatial distribution of chromium (Cr) in soil.

To this end, this study proposes a PHMS-Transformer framework. Its lightweight architecture significantly reduces the number of parameters to mitigate the risk of overfitting under sparse sampling conditions. Simultaneously, it leverages the multi-head self-attention mechanism to effectively capture complex nonlinear relationships among environmental variables, offering an innovative technical solution to these global bottlenecks. The framework′s configurable design (e.g., pooling strategies) provides the flexibility to adapt to diverse geographical environments. This establishes a scalable and transferable methodology for predicting the spatial distribution of soil heavy metals and conducting risk assessments across different global regions, from Asian industrial corridors to European agricultural zones. It holds significant practical value for advancing global soil environmental protection and enabling precise pollution management.

The contributions of this study are as follows: (i) Dataset Collection and Nonlinear Mapping: A comprehensive dataset comprising 2035 data points was collected within the study area. Through analysis of multi-source environmental variables, this study established nonlinear mappings between environmental covariates (including precipitation, water sources, soil pH, slope, and elevation) and measured chromium (Cr) concentrations. (ii) Model Comparison and Novel Architecture: Based on this dataset, the predictive performance for Cr of mainstream machine learning models (including AdaBoost, GBDT, XGBoost, MLP, and Transformer) was systematically compared. Furthermore, a lightweight, improved PHMS-Transformer model was proposed to address the limitations inherent in traditional Transformer models, particularly their excessive parameterization and propensity for overfitting under conditions of limited samples. (iii) Interpretability Analysis: A SHAP-based interpretability analysis was performed on the prediction results generated by the optimal PHMS-Transformer model. This analysis enabled the identification of the dominant environmental factors and elucidated their interaction mechanisms governing the spatial distribution of chromium (Cr).

2. Materials and Methods

Investigation and remediation reports on cadmium and other heavy metals issued by the Chizhou Municipal Ecological Environment Bureau indicate that the local government prioritizes heavy metal pollution prevention and control while continuously advancing remediation measures. Given that chromium (Cr) is prevalent in industrial activities and poses significant risks to soil and aquatic environments, this study focuses on Cr. The study area is located at the border between Chizhou and Xuancheng in southern Anhui Province. By integrating spatial location data with environmental covariates, soil heavy metal concentrations were predicted, and the SHAP model was employed for interpretation.

First, multi-source environmental data were acquired and preprocessed, including soil sample preparation and heavy metal concentration measurements. Next, model construction was carried out. Preliminary experiments compared various machine learning models, and those with superior predictive performance—AdaBoost [11], GBDT [12], XG Boost [13], MLP [15], and Transformer [17]—were selected as baseline models. Although the Transformer model can effectively analyze the spatial correlations of multi-source environmental variables in heavy metal distribution owing to the long-range dependency modeling capability of its self-attention mechanism, its application is constrained by large-scale data requirements. Therefore, a lightweight PHMS-Transformer model was proposed. Using processed environmental covariates as inputs and heavy metal concentrations as outputs, prediction models were constructed and their performances compared. The SHAP model was introduced to provide an in-depth interpretation of the results [28,29], thereby identifying key factors affecting soil heavy metal content. Finally, focusing on model performance evaluation, the established model was used to predict the spatial distribution of soil heavy metals, with cross-validation employed to compute performance metrics—including MAE, RMSE, and R²—to assess model accuracy, stability, and the key factors influencing soil heavy metal content. The technical flowchart is presented in Figure 1.

2.1. Study Area

The study area is located in the border region between Chizhou and Xuancheng in southern Anhui Province, China, on the south bank of the lower reaches of the Yangtze River (coordinates: 116°38′–118°05′ E, 29°33′–30°51′ N). The region has a subtropical monsoon climate, with a multi-year average temperature of 16.9 °C, an average annual rainfall of 1554.4 mm, and a frost-free period of 223 days [23]. The terrain is higher in the south and lower in the north, encompassing both the Yangtze Plain and the mountainous areas of southern Anhui. The area is characterized by undulating mountains and intersecting lakes and rivers, with higher elevations in the east and lower in the west. The geomorphological diversity—including plains, terraces, hills, and mountains—provides varied environmental conditions for the formation and distribution of soil heavy metal pollution [22]. Rapid industrial development in the area has given rise to sectors such as chemicals, building materials, and mining [30]. Furthermore, the region is of significant agricultural importance as a major commercial grain and national high-quality cotton base, where the use of pesticides and fertilizers contributes to heavy metal pollution [31]. The study area is shown in Figure 2.

2.2. Dataset

2.2.1. Soil Sampling and Chemical Analysis

A total of 2035 surface soil samples were collected from the study area. For each sample, sub-samples were obtained from four directions (up, down, left, and right) at a depth of 0–20 cm (Figure 3). The sampling density for surface soils was 1 point per 1 km², with each analytical sample comprising sub-samples from an area of 4 km². In contrast, deep soil samples were collected at a density of 1 point per 4 km², and each analytical sample represented an area of 16 km². Within a 1 km² grid, three sampling plots—spaced at over 50 m apart—were selected along the sampling route for surface soil measurement. Soil columns from the surface down to a depth of 20 cm were continuously collected using specialized equipment. After air-drying, crushing, sieving, and further treatment, the samples were stored at 4 °C for chemical analysis.

To ensure the accuracy and reliability of the analysis data, during the sampling process, quality control samples such as parallel samples, blank samples, and spiked recovery samples are collected simultaneously at 5% of the sample quantity used. The testing process strictly follows the quality control procedures, including comparative analysis using national first-level standard substances (such as GSS series soil standard samples) and consistency inspection between sample batches. In the overall sample test results, the recovery rate of spiked samples was between 90% and 110%, and the relative standard deviations of parallel samples were all less than 5%, indicating that the analysis process had good repeatability and accuracy.

According to the “Specifications for Multi-target Regional Geochemical Surveys (1:250,000)” of the Geological Survey Technical Standards of the China Geological Survey, X-ray fluorescence spectrometry (XRF) was employed to determine the Cr element [32]. In accordance with the “Specifications (DZ/T0258–2014),” the analytical methods incorporated in the supporting scheme of the established routine analysis procedures fully satisfy the project′s analytical and testing requirements. The minimum detection limits of these methods are equal to or exceed the relevant regulatory standards.

2.2.2. Feature Screening

Constructing a prediction model for the spatial distribution of soil heavy metals requires feature screening as a crucial step in analyzing the mechanisms underlying elemental spatial differentiation. This study employs a dynamic screening method based on feature importance to comprehensively evaluate multi-dimensional data, including geographical elements and soil physical and chemical properties, and to quantify the contribution of each feature to the prediction of Cr. Based on the screening results for heavy metal Cr and environmental covariates in the study area, 22 significant features were ultimately selected as input variables for the model. Table 1 presents nine representative features that encompass major driving factors, including geochemistry, topography, and climate.

Screening results indicate that the spatial distribution of the Cr element is controlled by multiple factors, with geochemical processes dominating. Phosphorus (p = 230) and iron oxides (TFe₂O₃ = 191) exhibit strong controlling effects. Among topographic factors, the importance values of the distance to water systems (RiverDista = 147) and elevation (DEM = 120) underscore the influence of topography on Cr migration pathways. Regarding environmental factors, rainfall (RainAvg = 112) and soil pH (pH = 81) jointly affect the chemical form of Cr. Dynamic feature screening has revealed the multi-scale action mechanism of environmental covariates. The selected feature set will serve as key input variables for the training and optimization of subsequent machine learning models, thereby providing a scientific basis for accurate prediction.

2.2.3. Environmental Data

This study comprehensively utilized multi-source data. In addition to measured soil heavy metal concentrations and pH values, various environmental covariate datasets were collected. The longitude and latitude for each sampling point were obtained from a handheld GPS recorder. Geographical factor data were derived from the DEM of Anhui Province. The slope, aspect, terrain relief, and terrain curvature were computed using the raster calculator in ArcGIS 10.8, while the distances from sampling points to the nearest rivers and roads were determined using the nearest neighbor analysis tool. The soil data were obtained from the 1:1,000,000 soil dataset (grid raster format, WGS84 projection) provided by the Nanjing Institute of Soil Science for the Second National Land Survey [33]. Information on the proportions of sand, clay, and loam, as well as soil density, cation exchange capacity (CEC), and exchangeable sodium, hydrogen, potassium, magnesium, calcium, and aluminum ions were also acquired. These geographical and soil property data complement the selected key variables, thereby supporting the construction of an accurate and reliable prediction model for the spatial distribution of soil heavy metals. They enable the model to capture the complex relationships between soil heavy metal content and environmental factors, thereby improving prediction accuracy. The data sources are shown in Table 2.

2.3. PHMS-Transformer

Considering the limited sample size and complex data relationships inherent in predicting the spatial distribution of soil heavy metals, this study proposes a lightweight PHMS-Transformer model optimized from the conventional Transformer architecture. By reducing the number of parameters, simplifying the model structure, and incorporating an efficient pooling strategy, the model significantly enhances prediction performance on small-scale datasets while maintaining the ability of the Transformer to model complex nonlinear relationships.

The model comprises an input adaptation layer, a lightweight encoder, a feature aggregation module, and a prediction output layer. The input adaptation layer maps multi-dimensional environmental covariate data into a unified feature space via linear transformation, ensuring compatibility with various data formats. To mitigate overfitting during training with sparse samples, a shallow encoder design is adopted, which reduces both the number of encoder layers and the computational cost of the attention mechanism. The encoder employs a multi-head self-attention (MSA) mechanism to capture the nonlinear interactions among topography, soil physical and chemical properties, and climate factors from multiple perspectives. For instance, it can model the influence of topographic elevation on heavy metal migration paths or the synergistic effect between soil pH and organic matter content. To further improve computational efficiency, a configurable pooling strategy is introduced that compresses the feature sequence via mean or maximum aggregation, thereby reducing redundant calculations while preserving essential information. The output layer integrates normalization and dropout techniques to enhance numerical stability and generalization, ultimately achieving precise regression predictions of heavy metal concentrations through a fully connected network. By combining a lightweight structure with robust modeling of complex relationships, the PHMS-Transformer proves both efficient and interpretable under limited sample conditions, providing reliable support for predicting the spatial distribution of soil heavy metals.

Due to the limited sample size and the complex relationships involved in predicting the spatial distribution of soil heavy metals, this study proposes a lightweight model named PHMS-Transformer, which is an improved version of the traditional Transformer. By reducing the number of parameters, simplifying the architecture, and introducing an efficient pooling method, this model improves prediction accuracy on small datasets while still capturing complex nonlinear patterns.

The model has four main parts: an input adaptation layer, a lightweight encoder, a feature aggregation module, and a prediction output layer. The input adaptation layer converts various environmental factors—such as topography, soil properties, and climate data—into a common feature format using simple linear transformations. This allows the model to handle different types of input data. To reduce the risk of overfitting with sparse samples, the encoder uses a shallow design with fewer layers and simplified attention calculations. A multi-head attention mechanism is still used to explore complex relationships between input features. For example, the model can learn how elevation affects heavy metal movement or how soil pH and organic matter interact. Next, the feature aggregation module uses a configurable pooling strategy—either averaging or taking the maximum value—to compress the feature sequence. This helps reduce unnecessary computation while keeping key information. Finally, the output layer applies normalization and dropout to improve stability and generalization. A fully connected network then predicts the concentration of heavy metals.

By combining a simplified structure with strong modeling capabilities, the PHMS-Transformer achieves accurate and reliable predictions under limited data conditions. It offers a practical and interpretable solution for mapping soil heavy metal distribution. The structure of the PHMS-Transformer model is shown in Figure 4.

2.4. SHapley Additive exPlanations (SHAP)

SHAP is a novel interpretable artificial intelligence method that evaluates the importance of multi-collinear variables based on game theory, quantitatively demonstrating the contribution of each feature in a prediction model [34]. It assesses the impact of each input feature on the prediction outcome by computing the Shapley value, as it satisfies both local accuracy and consistency [35]. In predicting the spatial distribution of soil heavy metals with environmental covariate features, the Shapley value for each feature of a specific prediction sample is calculated using the following formula:

φ_{i} = \sum_{S \subseteq N \{i\}} \frac{|S|! (n - |S| - 1)!}{n!} [f (S \cup \{i\} - f (S))]

(1)

where

φ_{i}

represents the Shapley value of the

i

-th feature, S is a subset of features, N is the set of all features, and

f (S)

denotes the predicted value of the model for the subset S.

After predicting the spatial distribution of soil heavy metals using machine learning and deep learning models, the prediction results, original environmental covariate data, and spatial location information are input into the SHAP model [36]. The SHAP model calculates the Shapley value for each feature in every sample. A Shapley value with a large absolute magnitude indicates that the feature substantially influences the model prediction, whereas a value near zero indicates a minimal impact. Based on the calculated Shapley values, all features are ranked according to their importance, thereby clarifying the key factors affecting soil heavy metal content and distinguishing them from secondary factors.

3. Results and Discussion

3.1. Evaluation Metrics

To evaluate the accuracy and generalization ability of the model, three performance indicators were employed: mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination R² (Formulas (2)–(4)). MAE and RMSE quantify the error between the predicted and actual values, with smaller values indicating higher prediction accuracy [37]. R² reflects the goodness of fit between the predicted and observed values; its range is from 0 to 1, where a higher value signifies better performance of the model. These metrics are commonly used standards for evaluating the predictive performance of regression models (such as in the spatial distribution prediction of soil properties) and have been widely applied and discussed in research within the fields of environmental science, geostatistics, and machine learning [38].

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(2)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(3)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} ({(y_{i} - {\hat{y}}_{i})}^{2})}{\sum_{i = 1}^{n} ({(y_{i} - {\bar{y}}_{i})}^{2})}

(4)

In these formulas, n represents the number of sample points,

y_{i}

denotes the true measured value of the i-th sample,

{\hat{y}}_{i}

corresponds to the predicted value generated by the model for the i-th sample, and

\bar{y}

represents the average of the measured values.

3.2. Performance Evaluation of Different Models

Accordingly, several machine learning models, namely adaptive boosting algorithm (AdaBoost) [11], gradient boosting decision tree (GBDT) [12], extreme gradient boosting (XGBoost) [13], multi-layer perceptron (MLP) [15], Transformer [17], and PHMS-Transformer, were selected to perform a comparative analysis of Cr using identical environmental variables. The accuracies of these models were evaluated using three metrics: mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R²), as shown in Table 3.

Research results indicate that significant performance differences exist among the models when predicting Cr content. The AdaBoost model exhibits the weakest performance, with an R² value of 0.5282, MAE of 11.7820, and RMSE of 16.0589, which hampers its ability to capture the complex nonlinear relationships between Cr and environmental variables. Traditional ensemble models, such as GBDT and XGBoost, display moderate predictive capabilities with R² values of 0.6788 and 0.6834, respectively; notably, XGBoost yields a slightly lower MAE (6.4208) than GBDT (6.6602). Although the MLP model attains an R² of 0.6995 due to its nonlinear fitting ability, its RMSE (12.8167) remains higher than that of the PHMS-Transformer. The Transformer model, benefiting from the self-attention mechanism (R² = 0.6818), tends to overfit due to its large number of parameters and limited training data, resulting in an RMSE of 13.1883 and reduced prediction stability.

In contrast, the PHMS-Transformer model demonstrates excellent performance in predicting Cr content. Its R² value reaches 0.7182, marking a 2.7% improvement compared to the best-performing comparative model (MLP), while both MAE (6.0891) and RMSE (12.4098) are notably reduced. By incorporating a shallow encoder, a configurable pooling strategy, and a lightweight structure, the model effectively addresses the overfitting issue under sparse sample conditions. Furthermore, the multi-head self-attention mechanism enables accurate capture of the complex nonlinear interactions among multiple environmental variables (e.g., the synergistic effect between topographic elevation and iron oxides). As illustrated in Figure 5, the predicted values of the PHMS-Transformer are highly consistent with the measured values. The degree of fit of the regression line (dashed line) to the 1:1 line (solid line) is significantly superior to that of other models, thereby verifying its efficiency and robustness under limited sample conditions.

3.2.1. Comparative Analysis of Different Models for Predicting the Spatial Distribution of Heavy Metals

In this study, the AdaBoost, GBDT, XGBoost, MLP, Transformer, and PHMS-Transformer models were employed to generate a spatial distribution map of Cr content (Figure 6). The results reveal that, although the prediction trends of all six models are generally consistent and reflect the spatial differentiation pattern of Cr in the study area, significant differences exist in terms of prediction accuracy and detail. These differences primarily arise from the varying abilities of the model structures to capture the nonlinear relationships among complex environmental covariates and their adaptability to the data.

Traditional machine learning models, such as AdaBoost, GBDT, and XGBoost, rely on decision tree ensemble methods to learn nonlinear relationships by iteratively optimizing residual errors [39,40]. However, when addressing the interactions among factors such as Cr content, topography, climate, and human activities, the inherent limitations of decision trees in capturing complex variable correlations hinder a comprehensive analysis of deep inter-variable relationships [41,42]. The MLP model learns nonlinear features via the weights of multi-layer neurons [43], but the large number of parameters of the model makes it susceptible to overfitting (RMSE = 12.82) when high-dimensional environmental covariates are involved. In particular, in scenarios with sparse sample data, noise interference significantly weakens the generalization ability of the model, resulting in reduced prediction stability [44]. Although the Transformer model utilizes an attention mechanism to capture long-distance dependencies and excels in processing complex relationships and parallel information [45,46], the enormous parameter count of the Transformer model renders it prone to noise overfitting under limited sample conditions (RMSE = 13.19) [47].

In response to these challenges, the PHMS-Transformer model introduces several innovations. It adopts a shallow encoder design to reduce the number of parameters, thereby effectively mitigating the risk of overfitting associated with limited sample sizes. Retaining the multi-head attention mechanism of the Transformer, the model captures the nonlinear correlations among environmental covariates—such as topography, soil physical and chemical properties, and climate factors—from a multi-dimensional perspective, enabling accurate delineation of the spatial variations in soil heavy metal content. A configurable pooling strategy is incorporated to compress the feature sequence through mean or maximum aggregation, reducing redundant computations while preserving key information and enhancing computational efficiency. With the lightweight architecture of the PHMS-Transformer model, the model is better able to focus on learning the critical patterns and relationships within the data, demonstrating robust adaptability in predicting soil heavy metal distribution across different regions.

3.2.2. Effectiveness of Model Improvement

The structural optimization of the PHMS-Transformer significantly enhances both the accuracy and efficiency of soil heavy metal prediction. To address the tendency of the traditional Transformer model to overfit under sparse sample conditions, a shallow encoder design with two layers was implemented. This design retains the capability of the four-head self-attention mechanism to capture the nonlinear relationships among multi-source environmental variables and to analyze their independent contributions as well as synergistic effects. To further improve computational efficiency, the PHMS-Transformer incorporates a dynamic pooling strategy; global statistical features are preserved via mean pooling, while local key signals are accentuated through max pooling. This approach reduces the feature sequence length by 50%, increases training speed by 40%, and preserves key information, thereby effectively alleviating memory pressure during long sequence processing. At the output layer, layer normalization combined with dropout techniques ensures a consistent feature distribution and reduces neuron coadaptation, which increases the reliance on core factors of the model by 20%.

Experimental results indicate that the R² value for Cr prediction reaches 0.7182, a 5.3% improvement compared to the standard Transformer model, while the MAE (6.0891) and RMSE (12.4098) decrease by 13.0% and 5.9%, respectively. The lightweight architecture effectively balances computational efficiency with the ability to model complex relationships under limited sample conditions. These improvements validate the technical advantages of the PHMS-Transformer in achieving high precision and strong generalization with sparse samples, thereby providing a reliable solution for predicting the spatial distribution of Cr.

3.3. Interpretation of the Model Prediction Results

To analyze the prediction mechanism of the PHMS-Transformer model, this study employed the SHAP method to quantify the contribution of environmental variables to Cr content prediction and to reveal the interaction patterns between Cr spatial distribution and multiple environmental factors.

Figure 3 shows that the SHAP contribution degrees of various environmental variables differ substantially. Among these, the SHAP contribution of TFe₂O₃ is the highest, reaching 6.8, which indicates its role as a core factor in Cr prediction. This suggests that iron oxides significantly influence Cr migration and enrichment through mechanisms such as adsorption and coprecipitation [48,49]. The contributions of SiO₂ and MgO follow, with values of 3.8 and 2.9, respectively, reflecting the effect of soil mineral composition on Cr occurrence. In other words, silicate and magnesian minerals modify the surface properties of soil particles, thereby altering the adsorption–desorption equilibrium of Cr [50]. Additionally, variables such as B and K₂O also contribute to the regulation of Cr distribution.

Furthermore, Figure 7 provides a detailed analysis of the correlation between variable values and predicted Cr values based on SHAP feature analysis. An increase in TFe₂O₃ corresponds to a more pronounced positive SHAP value, indicating that higher iron oxide content positively influences Cr predictions, which confirms its potential catalytic effects on the adsorption and fixation of Cr(Ⅲ) and the reduction and transformation of Cr(Ⅵ) [51]. A negative correlation exists between DEM (topographic elevation) values and SHAP values; high topographic elevations correspond to negative SHAP values, suggesting that in high-altitude areas, strong soil erosion restricts Cr accumulation, resulting in lower predicted values [52]. pH influences Cr mobility by affecting the chemical forms of Cr within a specific range. Under acidic conditions, increased Cr solubility exerts a positive impact on predicted values, reflecting the regulatory effect of pH on Cr mobility [53]. Regarding RiverDista (distance from the river), proximity to the river corresponds to higher positive SHAP values, indicating that hydraulic transport effects near rivers promote Cr enrichment [54].

In summary, the SHAP interpretability analysis demonstrates that the PHMS-Transformer model can predict Cr content with high precision, elucidates the mechanisms by which environmental variables influence Cr spatial distribution, and provides a scientific basis for understanding Cr migration and transformation in soil as well as for pollution prevention and control.

4. Conclusions

In this study, a lightweight PHMS-Transformer model was developed to predict the spatial distribution of Cr in soil, and the mechanisms by which environmental variables influence Cr behavior were examined using the SHAP method. The principal conclusions are as follows:

(1): Regarding prediction accuracy, the PHMS-Transformer model exhibited excellent performance. Its R² value reached 0.7182, representing a 2.7% improvement over the second-best MLP model. The MAE (6.0891) and RMSE (12.4098) decreased by 7.2% and 3.2%, respectively. Furthermore, the use of a shallow encoder and dynamic pooling strategy accelerated training by 40% while reducing overfitting risks (R² fluctuation < 0.5%), thereby offering an efficient solution for heavy metal prediction under sparse sample conditions.
(2): SHAP interpretability analysis indicated that TFe₂O₃ predominantly governs the spatial differentiation of Cr through adsorption and redox reactions. Additionally, topographic elevation (DEM) and river distance (RiverDista) modulate Cr migration via erosion inhibition and hydraulic transport, respectively, while pH influences Cr bioavailability by altering its chemical form. These results are consistent with geochemical theory, thereby verifying the scientific validity of the model interpretation.

The findings provide methodological support for elucidating the migration and transformation processes of Cr in soil and offer a scientific basis for the precise prevention and control of soil Cr pollution as well as environmental management. While the PHMS-Transformer model proposed in this study demonstrates promising performance in predicting the spatial distribution of soil chromium (Cr), several limitations persist: firstly, the model’s current applicability is confined to predicting the spatial distribution of chromium in soil. Its parameter optimization and feature selection are specifically based on the environmental geochemical behavior of Cr (e.g., adsorption onto iron oxides, the influence of topography on its migration). For other heavy metals (e.g., cadmium (Cd), lead (Pb), arsenic (As)), due to differences in their pollution sources and migration/transformation mechanisms compared to Cr, direct application of the model may introduce biases, necessitating further validation of its applicability. Secondly, the model’s generalization capability and robustness across diverse geographical environments require more in-depth evaluation. Additionally, soil chromium content and its spatial distribution represent a dynamically evolving process influenced by long-term anthropogenic activities and natural processes (e.g., rainfall leaching, seasonal wet–dry cycles). Integrating time-series data (e.g., multi-temporal remote sensing imagery, meteorological monitoring data) to construct a spatiotemporally coupled prediction model specifically for chromium would represent an important future direction for enhancing prediction timeliness and reflecting pollution evolution trends. Future research should focus on further optimizing the model’s multi-scale feature fusion capability. Targeted validation should be conducted for the environmental characteristics of other heavy metal elements (e.g., Cd, Pb). Exploring cross-regional transfer learning frameworks could enhance the model’s generalization for chromium and other heavy metals. Furthermore, integrating dynamic monitoring data to build a spatiotemporally coupled prediction model for chromium could provide more comprehensive technical support for precise prevention, control, and sustainable management of soil chromium pollution. Predictive applications for other heavy metals, however, require independent model tuning and validation based on their unique environmental behaviors.

In summary, the PHMS-Transformer framework proposed in this study currently focuses on predicting the spatial distribution of chromium (Cr) in soil, and its effectiveness has been empirically validated using measured Cr data. For predicting other heavy metal elements, future work necessitates model parameter tuning and validation of applicability, incorporating their element-specific environmental influencing factors (e.g., the bioaccumulation characteristics of cadmium (Cd), the traffic-related pollution characteristics of lead (Pb), etc.). This represents an important direction for future extension research.

Author Contributions

X.L. took charge of the conceptualization and writing—prepar ing original draft; W.L. was responsible for methodology; J.H. took charge of the formal analysis. Y.Z. took charge of data curation. X.K. was responsible for writing—review and editing. The published version of the manuscript has been read by all authors and their agreement was obtained. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (NSFC) Joint Fund for Earth Sciences (U2444217); China Geological Survey Project (DD20242543); National Key Research and Development Project (2022YFC3701303) and the Basic Scientific Research Operating Expenses of Chinese Academy of Geological Sciences (SK202409); Project of Supply-Demand Matching Employment Education and Training by the Ministry of Education: Research on Talent Distribution and Recruitment Efficiency Enhancement Based on GIS (2024090557577).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to confidentiality.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zheng, X.; Lin, H.; Du, D.; Li, G.; Alam, O.; Cheng, Z.; Liu, X.; Jiang, S.; Li, J. Remediation of heavy metals polluted soil environment: A critical review on biological approaches. Ecotoxicol. Environ. Saf. 2024, 284, 116883. [Google Scholar] [CrossRef] [PubMed]
Qiu, L.; Wang, K.; Long, W.; Wang, K.; Hu, W.; Amable, G.S. A Comparative Assessment of the Influences of Human Impacts on Soil Cd Concentrations Based on Stepwise Linear Regression, Classification and Regression Tree, and Random Forest Models. PLoS ONE 2016, 11, e0151131. [Google Scholar] [CrossRef]
Jia, X.; O’Connor, D.; Shi, Z.; Hou, D. VIRS based detection in combination with machine learning for mapping soil pollution. Environ. Pollut. 2020, 268, 115845. [Google Scholar] [CrossRef]
Violante, A.; Krishnamurti, G.S.R.; Pigna, M. Factors affecting the sorption-desorption of trace elements in soil environments. Biophys.-Chem. Process. Heavy Met. Met. Soil Environ. 2008, 169–214. [Google Scholar]
Buerge, I.J.; Hug, S.J. Influence of mineral surfaces on chromium (VI) reduction by iron (II). Environ. Sci. Technol. 1999, 33, 4285–4291. [Google Scholar] [CrossRef]
Lu, J.; Lu, H.; Brusseau, M.L.; He, L.; Gorlier, A.; Yao, T.; Tian, P.; Feng, S.; Yu, Q.; Nie, Q.; et al. Interaction of climate change, potentially toxic elements (PTEs), and topography on plant diversity and ecosystem functions in a high-altitude mountainous region of the Tibetan Plateau. Chemosphere 2021, 275, 130099. [Google Scholar] [CrossRef]
Kowalska, J.B.; Mazurek, R.; Gąsiorek, M. Pollution indices as useful tools for the comprehensive evaluation of the degree of soil contamination—A review. Environ. Geochem. Health 2018, 40, 2395–2420. [Google Scholar] [CrossRef] [PubMed]
Guan, Y.; Shao, C.; Ju, M. Heavy Metal Contamination Assessment and Partition for Industrial and Mining Gathering Areas. Int. J. Environ. Res. Public Health 2014, 11, 7286–7303. [Google Scholar] [CrossRef]
Zha, Y.; Yang, Y. Innovative graph neural network approach for predicting soil heavy metal pollution in the Pearl River Basin, China. Sci. Rep. 2024, 14, 16505. [Google Scholar] [CrossRef]
Luo, N. Methods for controlling heavy metals in environmental soils based on artificial neural networks. Sci. Rep. 2024, 14, 2563. [Google Scholar] [CrossRef]
Santos-Francés, F.; Martínez-Graña, A.; Ávila Zarza, C.; Sánchez, A.G.; Rojo, P.A. Spatial Distribution of Heavy Metals and the Environmental Quality of Soil in the Northern Plateau of Spain by Geostatistical Methods. Int. J. Environ. Res. Public Health 2017, 14, 568. [Google Scholar] [CrossRef]
Taghizadeh-Mehrjardi, R.; Fathizad, H.; Ardakani, M.A.H.; Sodaiezadeh, H.; Kerry, R.; Heung, B.; Scholten, T. Spatio-Temporal Analysis of Heavy Metals in Arid Soils at the Catchment Scale Using Digital Soil Assessment and a Random Forest Model. Remote. Sens. 2021, 13, 1698. [Google Scholar] [CrossRef]
Tang, S.; Wang, C.; Song, J.; Ihenetu, S.C.; Li, G. Advances in Studies on Heavy Metals in Urban Soil: A Bibliometric Analysis. Sustainability 2024, 16, 860. [Google Scholar] [CrossRef]
Wang, A.P.; Tian, A.H.; Fu, C.B. LMetal-ResNet: A Lightweight Convolutional Neural Network Model for Soil Arsenic Concentration Estimation. Sens. Mater. 2024, 36, 5007–5017. [Google Scholar] [CrossRef]
Wang, X.; An, S.; Xu, Y.; Hou, H.; Chen, F.; Yang, Y.; Zhang, S.; Liu, R. A Back Propagation Neural Network Model Optimized by Mind Evolutionary Algorithm for Estimating Cd, Cr, and Pb Concentrations in Soils Using Vis-NIR Diffuse Reflectance Spectroscopy. Appl. Sci. 2019, 10, 51. [Google Scholar] [CrossRef]
Liu, C.; Chen, L.; Ni, G.; Yuan, X.; He, S.; Miao, S. Prediction of heavy metal spatial distribution in soils of typical industrial zones utilizing 3D convolutional neural networks. Sci. Rep. 2025, 15, 396. [Google Scholar] [CrossRef] [PubMed]
He, P.; Li, Y.; Huo, T.; Meng, F.; Peng, C.; Bai, M. Priority planting area planning for cash crops under heavy metal pollution and climate change: A case study of Ligusticum chuanxiong Hort. Front. Plant Sci. 2023, 14, 1080881. [Google Scholar] [CrossRef]
Yang, Y.; Cui, Q.; Cheng, R.; Huo, A.; Wang, Y. Retrieval of Soil Heavy Metal Content for Environment Monitoring in Mining Area via Transfer Learning. Sustainability 2023, 15, 11765. [Google Scholar] [CrossRef]
Zhang, Y.; Li, X.; Chen, S. Cost-effectiveness analysis of soil sampling strategies for heavy metal monitoring in China. Environ. Monit. Assess. 2019, 191, 742. [Google Scholar]
Li, T.; Wang, H.; Zhou, M. Spatial interpolation uncertainty under sparse sampling: A case study of lead contamination prediction. Geoderma 2020, 378, 114582. [Google Scholar]
Chen, L.; Liu, X.; Wu, Z. Trade-offs between sampling frequency and accuracy in national soil pollution surveys: Evidence from China’s soil quality monitoring network. Environ. Sci. Policy 2017, 78, 12–20. [Google Scholar]
Zhao, Z.-D.; Zhao, M.-S.; Lu, H.-L.; Wang, S.-H.; Lu, Y.-Y. Digital Mapping of Soil pH Based on Machine Learning Combined with Feature Selection Methods in East China. Sustainability 2023, 15, 12874. [Google Scholar] [CrossRef]
Wu, J.; Gao, W.; Zheng, Z.; Zhao, D.; Zeng, Y. Study of Human Activity Intensity from 2015 to 2020 Based on Remote Sensing in Anhui Province, China. Remote Sens. 2023, 15, 2029. [Google Scholar] [CrossRef]
Ye, N.; Fok, T.Y.; Chong, O. Modeling an energy consumption system with partial-value data associations. Adv. Sci. Technol. Eng. Syst. 2018, 3, 372–379. [Google Scholar] [CrossRef]
Yu, H.; Xie, S.; Liu, P.; Hua, Z.; Song, C.; Jing, P. Estimation of Pb and Cd Content in Soil Using Sentinel-2A Multispectral Images Based on Ensemble Learning. Remote Sens. 2023, 15, 2299. [Google Scholar] [CrossRef]
Liu, Y.; Shen, W.; Fan, K.; Pei, W.; Liu, S. Spatial Distribution, Source Analysis, and Health Risk Assessment of Heavy Metals in the Farmland of Tangwang Village, Huainan City, China. Agronomy 2024, 14, 394. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, W.; He, Y. Soil Heavy-Metal Pollution Prediction Methods Based on Two Improved Neural Network Models. Appl. Sci. 2023, 13, 11647. [Google Scholar] [CrossRef]
Molla, A.; Zhang, W.; Zuo, S.; Ren, Y.; Han, J. A machine learning and geostatistical hybrid method to improve spatial prediction accuracy of soil potentially toxic elements. Stoch. Environ. Res. Risk Assess. 2022, 37, 681–696. [Google Scholar] [CrossRef]
Sidhu, G.P.S. Heavy Metal Toxicity in Soils: Sources, Remediation Technologies and Challenges. Adv. Plants Agric. Res. 2016, 5, 00166. [Google Scholar]
Zhang, J.; Yao, D. Comparative Analysis of Soil Heavy Metal Pollution on Different Roads: A Case Study in a Typical Industrial City of China. Appl. Ecol. Environ. Res. 2019, 17, 15219–15232. [Google Scholar] [CrossRef]
Kougir Chegini, Z.; Sheykhi, N.; Navabian, M.; Vazifeh Doost, M.; Ojani, M.; Szabó, S. Assessment of the accuracy of salinity simulation using heavy metal and nitrogen cycle in SWAT model in an area exposed to intensive agriculture, Navrood basin, Iran. In Proceedings of the EGU General Assembly 2024, Vienna, Austria, 14–19 April 2024. EGU24-4691. [Google Scholar] [CrossRef]
Fedeli, R.; Di Lella, L.A.; Loppi, S. Suitability of XRF for Routine Analysis of Multi-Elemental Composition: A Multi-Standard Verification. Methods Protoc. 2024, 7, 53. [Google Scholar] [CrossRef] [PubMed]
Sun, W.; Liu, C.-G.; Wang, S.-N. Simulation research of urban development boundary based on ecological constraints: A case study of Nanjing. J. Nat. Resour. 2021, 36, 2913–2925. [Google Scholar] [CrossRef]
Scott, M.; Lundberg; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4768–4777. [Google Scholar]
Lundberg, S. A unified approach to interpreting model predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
Aas, K.; Jullum, M.; Løland, A. Explaining individual predictions when features are dependent: More accurate approximations to Shapley values. Artificial Intell. 2021, 298, 103502. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
Ganaie, M.A.; Hu, M.; Malik, A.K.; Tanveer, M.; Suganthan, P.N. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
Shaikh, T.A.; Rasool, T.; Verma, P.; Mir, W.A. A fundamental overview of ensemble deep learning models and applications: Systematic literature and state of the art. Ann. Oper. Res. 2024, 1–77. [Google Scholar] [CrossRef]
Shao, F.; Li, K.; Ouyang, D.; Zhou, J.; Luo, Y.; Zhang, H. Sources apportionments of heavy metal (loid) s in the farmland soils close to industrial parks: Integrated application of positive matrix factorization (PMF) and cadmium isotopic fractionation. Sci. Total Environ. 2024, 924, 171598. [Google Scholar] [CrossRef]
Singh, P.; Ashuri, B.; Amekudzi-Kennedy, A. Application of dynamic adaptive planning and risk-adjusted decision trees to capture the value of flexibility in resilience and transportation planning. Transp. Res. Rec. 2020, 2674, 298–310. [Google Scholar] [CrossRef]
Naskath, J.; Sivakamasundari, G.; Begum, A.A.S. A study on different deep learning algorithms used in deep neural nets: MLP SOM and DBN. Wirel. Pers. Commun. 2023, 128, 2913–2936. [Google Scholar] [CrossRef] [PubMed]
Rynkiewicz, J. General bound of overfitting for MLP regression models. Neurocomputing 2012, 90, 106–110. [Google Scholar] [CrossRef]
Choi, S.R.; Lee, M. Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review. Biology 2023, 12, 1033. [Google Scholar] [CrossRef] [PubMed]
Luo, Q.; Zeng, W.; Chen, M.; Peng, G.; Yuan, X.; Yin, Q. Self-Attention and Transformers: Driving the Evolution of Large Language Models. In Proceedings of the 2023 IEEE 6th International Conference on Electronic Information and Communication Technology (ICEICT), Qingdao, China, 21–24 July 2023; IEEE: New York, NY, USA, 2023; pp. 401–405. [Google Scholar]
Yang, B. Design Automation with Efficient Compilation on Hardware Accelerators. Ph.D. Thesis, The Chinese University of Hong Kong, Hong Kong, China, 2024. [Google Scholar]
Deng, N.; Li, Z.; Zuo, X.; Chen, J.; Shakiba, S.; Louie, S.M.; Rixey, W.G.; Hu, Y. Coprecipitation of Fe/Cr hydroxides with organics: Roles of organic properties in composition and stability of the coprecipitates. Environ. Sci. Technol. 2021, 55, 4638–4647. [Google Scholar] [CrossRef]
Zhu, S.; Mo, Y.; Xing, J.; Luo, W.; Jin, C.; Qiu, R. Colloidal stabilities and deposition behaviors of chromium (hydr) oxides in the presence of dissolved organic matters: Role of coprecipitation and adsorption. Environ. Sci. Nano 2022, 9, 2207–2219. [Google Scholar] [CrossRef]
Li, D.; Li, G.; He, Y.; Zhao, Y.; Miao, Q.; Zhang, H.; Yuan, Y.; Zhang, D. Key Cr species controlling Cr stability in contaminated soils before and chemical stabilization at a remediation engineering site. J. Hazard. Mater. 2022, 424, 127532. [Google Scholar] [CrossRef]
Hu, Y.; Liu, T.; Chen, N.; Feng, C. Iron oxide minerals promote simultaneous bio-reduction of Cr (VI) and nitrate: Implications for understanding natural attenuation. Sci. Total. Environ. 2021, 786, 147396. [Google Scholar] [CrossRef]
Han, S.; Wang, B.; Yao, Z.; Dai, L.; Wei, Y.; Niu, Y.; Qian, L. Heavy metals impact environmental capacity of oasis soils in Qinghai-Tibet Plateau dry zone. Sci. Rep. 2025, 15, 2176. [Google Scholar] [CrossRef]
Liang, J.; Huang, X.; Yan, J.; Li, Y.; Zhao, Z.; Liu, Y.; Ye, J.; Wei, Y. A review of the formation of Cr (VI) via Cr (III) oxidation in soils and groundwater. Sci. Total. Environ. 2021, 774, 145762. [Google Scholar] [CrossRef]
Kumar, S. Heavy metal pollution and health risk assessment in upland and riparian soils of the Ganga River basin. Discov. Soil 2025, 2, 1–20. [Google Scholar] [CrossRef]

Figure 1. Technical Flow Chart.

Figure 2. Location Map of the Study Area.

Figure 3. Four-point Sampling Diagram.

Figure 4. Structure of the PHMS-Transformer Model.

Figure 5. Scatter Plots of the Observed and Predicted Values of the Cr Element by Different Models. (The regression line between the measured values and the predicted values is a dotted line, and the 1:1 line is a solid line.)

Figure 6. Spatial Distribution Maps of Cr Element by Different Models.

Figure 7. SHAP Feature Analysis.

Table 1. Partial Ranking of Feature Importance for Cr Element.

Feature	Importance Value	Indication of Environmental Processes
p	230	Adsorption effect triggered by the application of phosphate fertilizers in agricultural activities
TFe₂O₃	191	Regulation of the occurrence form of Cr by iron oxides
K₂O	181	Ion exchange effect caused by the weathering of potassium feldspar
Al₂O₃	178	Fixation ability of clay minerals on Cr
RiverDista	147	Enrichment trend resulting from hydraulic transport in the near-river area
DEM	120	Regulation of the migration path of Cr by topographic elevation
RainAvg	112	Influence of rainfall on the leaching and migration of Cr
PH	81	Influence of soil acid–base conditions on the occurrence form of Cr
Slopetry	76	Regulation of Cr erosion and deposition by slope gradient

Table 2. Data Sources.

Data Name	Data Source
Longitude and Latitude	Handheld GPS Recorder
Slope, Aspect, Terrain Relief, Terrain Curvature	Calculated from the DEM Data of Anhui Province using the Raster Calculator in ArcGIS 10.8
Distance from Sampling Point to the Nearest River, Distance from Sampling Point to the Nearest Road
Average Rainfall	Anhui Meteorological Monitoring
PH	Laboratory Chemical Analysis
Proportions of Sand, Clay, and Loam	1:1,000,000 Soil Data Provided by Nanjing Institute of Soil Science for the Second National Land Survey
Soil Density
Cation Exchange Capacity (CEC)
Exchangeable Sodium Ion, Exchangeable Hydrogen Ion, Exchangeable Potassium Ion, Exchangeable Magnesium Ion, Exchangeable Calcium Ion, Exchangeable Aluminum Ion

Table 3. Comparison of the Precision of Different Models for Cr Element.

Model	R2	MAE	RMSE
AdaBoost [11]	0.528183	11.78199	16.05892
GBDT [12]	0.678762	6.660175	13.25082
XGBoost [13]	0.68335	6.420797	13.15586
MLP [14]	0.699467	6.561879	12.81668
Transformer [17]	0.681787	7.001963	13.18828
PHMS-Transformer	0.718246	6.089136	12.4098

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, X.; Luo, W.; Hao, J.; Zhu, Y.; Kong, X. Interpretable Network Framework for Predicting the Spatial Distribution of Chromium in Soil. Sustainability 2025, 17, 6420. https://doi.org/10.3390/su17146420

AMA Style

Luo X, Luo W, Hao J, Zhu Y, Kong X. Interpretable Network Framework for Predicting the Spatial Distribution of Chromium in Soil. Sustainability. 2025; 17(14):6420. https://doi.org/10.3390/su17146420

Chicago/Turabian Style

Luo, Xinping, Wei Luo, Jing Hao, Yuchen Zhu, and Xiangke Kong. 2025. "Interpretable Network Framework for Predicting the Spatial Distribution of Chromium in Soil" Sustainability 17, no. 14: 6420. https://doi.org/10.3390/su17146420

APA Style

Luo, X., Luo, W., Hao, J., Zhu, Y., & Kong, X. (2025). Interpretable Network Framework for Predicting the Spatial Distribution of Chromium in Soil. Sustainability, 17(14), 6420. https://doi.org/10.3390/su17146420

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interpretable Network Framework for Predicting the Spatial Distribution of Chromium in Soil

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Dataset

2.2.1. Soil Sampling and Chemical Analysis

2.2.2. Feature Screening

2.2.3. Environmental Data

2.3. PHMS-Transformer

2.4. SHapley Additive exPlanations (SHAP)

3. Results and Discussion

3.1. Evaluation Metrics

3.2. Performance Evaluation of Different Models

3.2.1. Comparative Analysis of Different Models for Predicting the Spatial Distribution of Heavy Metals

3.2.2. Effectiveness of Model Improvement

3.3. Interpretation of the Model Prediction Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI