An Adaptive Graph Convolutional Network with Spatial Autocorrelation for Enhancing 3D Soil Pollutant Mapping Precision from Sparse Borehole Data

Tao, Huan; Li, Ziyang; Nie, Shengdong; Li, Hengkai; Zhao, Dan

doi:10.3390/land14071348

Open AccessArticle

An Adaptive Graph Convolutional Network with Spatial Autocorrelation for Enhancing 3D Soil Pollutant Mapping Precision from Sparse Borehole Data

by

Huan Tao

¹

,

Ziyang Li

^2,*,

Shengdong Nie

³,

Hengkai Li

³ and

Dan Zhao

^4,5,*

¹

Key Laboratory of Land Surface Pattern and Simulation, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences (CAS), Beijing 100101, China

²

School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China

³

Civil and Surveying & Mapping Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China

⁴

Key Laboratory of Environmental Damage Identification and Restoration, Ministry of Ecology and Environment, Beijing 100041, China

⁵

Center for Environmental Risk and Damage Assessment, Chinese Academy of Environmental Planning, Beijing 100012, China

^*

Authors to whom correspondence should be addressed.

Land 2025, 14(7), 1348; https://doi.org/10.3390/land14071348

Submission received: 26 May 2025 / Revised: 23 June 2025 / Accepted: 23 June 2025 / Published: 25 June 2025

Download

Browse Figures

Versions Notes

Abstract

Sparse borehole sampling at contaminated sites results in sparse and unevenly distributed data on soil pollutants. Traditional interpolation methods may obscure local variations in soil contamination when applied to such sparse data, thus reducing the interpolation accuracy. We propose an adaptive graph convolutional network with spatial autocorrelation (ASI-GCN) model to overcome this challenge. The ASI-GCN model effectively constrains pollutant concentration transfer while capturing subtle spatial variations, improving soil pollution characterization accuracy. We tested our model at a coking plant using 215 soil samples from 15 boreholes, evaluating its robustness with three pollutants of varying volatility: arsenic (As, non-volatile), benzo(a)pyrene (BaP, semi-volatile), and benzene (Ben, volatile). Leave-one-out cross-validation demonstrates that the ASI-GCN_RC_G model (ASI-GCN with residual connections) achieves the highest prediction accuracy. Specifically, the R for As, BaP, and Ben are 0.728, 0.825, and 0.781, respectively, outperforming traditional models by 58.8% (vs. IDW), 45.82% (vs. OK), and 53.78% (vs. IDW). Meanwhile, their RMSE drop by 36.56% (vs. Bayesian_K), 38.02% (vs. Bayesian_K), and 35.96% (vs. IDW), further confirming the model’s superior precision. Beyond accuracy, Monte Carlo uncertainty analysis reveals that most predicted areas exhibit low uncertainty, with only a few high-pollution hotspots exhibiting relatively high uncertainty. Further analysis revealed the significant influence of pollutant volatility on vertical migration patterns. Non-volatile As was primarily distributed in the fill and silty sand layers, and semi-volatile BaP concentrated in the silty sand layer. At the same time, volatile Ben was predominantly found in the clay and fine sand layers. By integrating spatial autocorrelation with deep graph representation, ASI-GCN redefines sparse data 3D mapping, offering a transformative tool for precise environmental governance and human health assessment.

Keywords:

soil pollution; graph neural network; sparse samples; 3D spatial interpolation; contaminated site

1. Introduction

Rapid industrialization and urbanization have driven extensive land conversion to industrial purposes, where improper disposal of toxic byproducts from energy-intensive industries induces persistent soil contamination. The escalating global demand for land remediation is evidenced by annual expenditures surpassing USD 30 billion, according to the 2021 Global Assessment of Soil Pollution [1]. Consequently, accurate 3D mapping of subsurface pollutants has become pivotal for reducing the time and cost of risk assessment and selecting optimal remediation strategies. These advancements directly support data-driven environmental governance, an essential goal for sustainable development in the 21st century.

While machine learning (ML) has revolutionized spatial prediction in geosciences, its application to 3D soil pollution mapping underground faces two fundamental constraints: firstly, sparsity of high-quality training data due to the prohibitive costs of systematic soil sampling, particularly in deep subsurface investigations [2,3]; secondly, the inherent limitations of conventional ML architectures in capturing complex 3D spatial dependencies from sparse borehole data with heterogeneous quality [4,5]. These constraints collectively result in systematic underestimation of contaminant transport dynamics and unreliable 3D distribution models that inadequately represent subsurface heterogeneity.

To address these challenges, recent advances in graph convolutional networks (GCNs) show promise for modeling non-Euclidean spatial relationships within sparse soil borehole data at contaminated sites [6,7,8]. Unlike conventional geostatistical methods, GCNs can theoretically learn latent spatial correlations through adaptive graph structures. However, existing GCN implementations for soil science remain constrained by two critical shortcomings [9,10]: (1) Dependence on predefined adjacency matrices that cannot adapt to pollutant migration patterns; (2) Requirement for exhaustive environmental covariates rarely available in practical remediation projects. Recent methodological innovations propose dynamic correlation learning [11,12] and residual correction mechanisms [13,14]. However, their efficacy in 3D soil systems remains unproven.

Herein, we present an adaptive graph convolutional network with spatial autocorrelation (ASI-GCN) integrating three methodological innovations: (1) Autocorrelation-informed graph construction to encode 3D spatial dependencies without prior adjacency assumptions; (2) Volatility-adaptive feature embedding accommodating diverse pollutant transport behaviors; (3) Residual-enhanced concentration prediction mitigating interpolation errors in sparse data regimes. We validate the framework using soil borehole data from a decommissioned Beijing coking plant, employing stratified soil samples to characterize the 3D spatial distributions of arsenic (As, non-volatile), benzopyrene (BaP, semi-volatile), and benzene (Ben, volatile). The research objectives of the study are to: (i) capture 3D local variability, (ii) resolve vertical stratification patterns, and (iii) quantify volatility-mediated transport effects. The developed framework advances precision in contaminated site characterization while providing computational tools for adaptive remediation planning.

2. Materials and Methods

2.1. Study Area and Soil Sampling

The study area comprises a decommissioned coking plant in Southeastern Beijing (Figure 1a), operational from 1958 to 2008 as one of China’s largest coal chemical complexes. Historical production of coke, coal gas, and 40+ chemicals (including benzene, benzopyrene, and arsenic compounds) generated three-dimensional contamination across 50 m × 40 m × 14 m (28,000 m³), with 2008 Olympic Games monitoring revealing heavy metals (As), PAHs (BaP), and VOCs (Ben) at concentrations exceeding carcinogenic risk thresholds. These pollutants originate from coal tar distillation residues and unlined wastewater ponds, creating persistent contamination plumes in shallow aquifers. Based on the soil screening values of the Beijing Municipal Environmental Protection Bureau [15], Ben (Figure 1d) had the highest exceedance ratio at 96.55%, followed by BaP (Figure 1c) at 63.28% and As (Figure 1b) at 43.25%. As, BaP, and Ben concentrations ranged from 0.01 to 38.84 mg/kg, 0.001 to 16.71 mg/kg, and 0.01 to 1000 mg/kg, respectively. For more information on drilling and pollutants, please refer to Table S1.

Based on historical contamination maps, a stratified sampling strategy was implemented at 15 drilling locations. Soil samples were collected according to the “Technical Specifications for Soil Environmental Monitoring” and preserved and transported following the “Soil Sampling Quality Assurance User’s Guide” [16,17]. The soil concentrations of As, BaP, and Ben were measured using three analytical techniques: inductively coupled plasma mass spectrometry (ICP-MS, Elan DRC-e, PerkinElmer, Waltham, MA, USA), gas chromatography–mass spectrometry (GC–MS, ASI Scientific, Durham, NC, USA), and purge-and-trap gas chromatography/mass spectrometry (GC–MS, ASI Scientific, USA), respectively. For detailed information, please refer to the attachment.

2.2. The Construction of the ASI-GCN Model

2.2.1. The Principle of the GCN Model for 3D Spatial Interpolation

A GCN is a scalable semi-supervised machine learning algorithm based on graph-structured data in a non-Euclidean space [18]. Unlike traditional Euclidean-based models, GCNs capture spatial correlation and pollutant concentration information by extracting local neighborhood characteristics of sparse drilling samples through a space-based approach. The layer-wise propagation rule transfers graph structure information between layers, where node features (pollutant concentration and 3D coordinates) were weighted to predict labels at unsampled sites. The baseline GCN equations are defined as follows:

\hat{A} = A + I

(1)

L = {\hat{D}}^{- \frac{1}{2}} \times \hat{A} \times {\hat{D}}^{- \frac{1}{2}}

(2)

H^{(l + 1)} = σ (L \times H^{(l)} \times W^{(l)})

(3)

G C N (X, A) = s o f t m a x (\hat{A} \times σ (\hat{A} \times X \times W^{l}) \times W^{l + 1})

(4)

where

\hat{A}

is the adjacency matrix with a self-connection of the borehole location,

A

is the adjacency matrix without self-connections, and

I

is the unit matrix. Equation (2) represents the concentration features aggregated and normalized from adjacent sites. L is the normalized aggregation matrix, and

\hat{D}

is the degree matrix, which is used to normalize the adjacency matrix

\hat{A}

. In Equation (3),

H^{(l)}

is the output in

l

-th layer,

H^{(0)} = X

,

σ (\cdot)

denotes the output, and

W^{(l)}

is a trainable weight matrix in

l

-th layer. Equation (4) represents the prediction results.

However, two critical limitations hinder the direct application of GCNs to 3D soil pollutant mapping: (1) Dependency on labeled training nodes, and (2) Inability to adaptively learn dynamic spatial relationships from sparse data.

2.2.2. ASI-GCN: Adaptive Enhancements for 3D Soil Mapping

To address the above limitations, we propose an adaptive graph convolutional network with spatial autocorrelation (ASI-GCN) model with three key innovations: constrained message passing, dynamic graph structure learning, and residual correction. Figure 2 illustrates the framework. Please refer to the attachment for specific details.

(1): Constrained message passing mechanism

Traditional GCNs suffer from over-smoothing when handling sparse samples [12]. We introduce a validation-aware adjacency matrix to optimize information flow. Two matrices are constructed:

W_adj: Initial adjacency matrix without self-loops.
W_adj_1: Validation-masked matrix that blocks message passing from validation nodes to ensure the transmission of true information.
W_adj_2: Self-loop-enhanced matrix to retain the aggregated message of the nodes.

This dual-matrix design ensures robust feature aggregation while mitigating overfitting to sparse labels.

(2): Dynamic graph structure learning

Spatial heterogeneity in soil pollutants necessitates adaptive graph construction. We integrate a multi-head dynamic mask [19] and full-parameter learning [20] to refine edge weights. In the first layer, K parallel masks generate diverse spatial relationships:

{\tilde{M}}^{(i)} = \frac{σ_{p} (M^{(i)}) + σ_{p} {(M^{(i)})}^{⊤}}{2}

(5)

{\hat{W}}_{a d j}_i = {\tilde{D}}_{i}^{- 1} ⊙ W_{a d j_i} ⊙ {\tilde{M}}^{(i)}

(6)

H^{(1)} = ∥_{k = 1}^{K} σ ({\tilde{D}}_{1, k}^{- 1} W_{a d j_1, k} ⊙ {\tilde{M}}^{(1, k)} H^{(0)} W^{(1, k)})

(7)

where

{\tilde{M}}^{(i)}

indicates that

M^{(i)}

is symmetrized and activated.

{\tilde{D}}_{i}

is the degree matrix of

{\tilde{W}}_{a d j_i}

.

{\tilde{M}}^{(1, k)}

represents k-th learnable mask in the first layer.

W^{(1, k)}

is the trainable matrix of the k-th in the first layer.

∥

denotes the concatenation operation.

σ

is ELU.

{(\cdot)}^{T}

denotes the transpose of a matrix.

⊙

denotes the element-wise product (Hadamard product) between two matrices. The output of the first layer,

H^{(1)}

, will consist of K aggregation branches for each node.

The model captures multi-scale spatial dependencies by concatenating K masked outputs, adapting to uneven sample distributions.

(3): Residual correction mechanism

Outliers in pollutant concentrations degrade interpolation accuracy. We propose a two-stage residual correction to further reduce the smoothing effect, as demonstrated in Algorithm 1.

Algorithm 1 Residual correction mechanism

Primary prediction: Generate initial estimates

{\hat{Y}}_{p} = H^{(2)}

using ASI-GCN.

Residual learning: Compute residuals

∆ Y = Y_{o b s} - {\hat{Y}}_{p}

at sampled nodes, then predict

∆ Y_{u n s a m p l e d}

via a secondary ASI-GCN or Kriging model.

Final output: Corrected predictions are obtained as

Y_{f i n a l} = {\hat{Y}}_{p} + ∆ Y_{u n s a m p l e d}

2.3. Performance and Uncertainty Evaluation of the ASI-GCN Model

Correlation coefficient (R), mean absolute error (MAE), and root mean square error (RMSE) are selected to evaluate the performance of the model in the study. R represents the similarity between the predicted value and the fitted regression line. When R is close to 1, the model has a high fit, and vice versa. The smaller the MAE and RMSE values are, the smaller the model’s prediction error is.

R = \frac{\sum_{i = 1}^{n} (y_{i} - \bar{y}) ({\hat{y}}_{i} - \hat{\bar{y}})}{\sqrt{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} \sum_{i = 1}^{n} {({\hat{y}}_{i} - \hat{\bar{y}})}^{2}}}

(8)

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - \bar{y}|

(9)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(10)

where

n

is the number of sampling points in the testing set;

y_{i}

is the actual value of pollutant concentration of the unobserved point

i

in the testing set;

{\hat{y}}_{i}

is the predicted value of pollutant concentration of the unobserved point

i

in the testing set;

\bar{y}

and

\hat{\bar{y}}

are the average values of

y_{i}

and

{\hat{y}}_{i}

, respectively.

Monte Carlo Dropout leverages the stochasticity of Dropout to perform multiple forward passes during the inference phase (each time deactivating a different random subset of neurons), thereby generating an output distribution for each input to achieve uncertainty quantification [21]. Specifically, different Dropout masks produce varying predictions when maintaining Dropout activation for T stochastic forward passes. The final predictions are aggregated by computing the mean (

\hat{y}

) as the model output, with the standard deviation (

σ

) serving as the uncertainty estimate.

\hat{y} = \frac{1}{T} \sum_{t = 1}^{T} y^{(t)}

(11)

σ = \sqrt{\frac{\sum_{t = 1}^{T} {(y^{(t)} - \hat{y})}^{2}}{T}}

(12)

3. Results

3.1. Three-Dimensional Spatial Distribution of Pollutants

Figure 3 compares interpolation results across models. ASI-GCN_RC_G captures fine-scale spatial heterogeneity (e.g., As in surface/middle layers, BaP in southeast, Ben in northwest). The RMSE (panels d, h, l) further validates ASI-GCN_RC_G’s robustness. For As, the ASI-GCN_RC_G model demonstrated an RMSE ranging from 2.88 (11th layer) to 7.43 (1st layer) mg/kg. Across the 14 depth layers, this model achieved over 79% probability of yielding lower RMSE values compared to other models. In the case of BaP, the RMSE of ASI-GCN_RC_G varied between 0.36 (10th layer) and 2.56 (1st layer) mg/kg. Among the 14 depth layers, the model exhibited a greater than 57% probability of outperforming competing models in terms of RMSE accuracy. For Ben, the ASI-GCN_RC_G model recorded RMSE values spanning from 64.70 (1st layer) to 290.70 (14th layer) mg/kg. Notably, it showed a probability exceeding 71% of achieving superior RMSE performance relative to other models across all 14 depth layers. Traditional methods (OK, IDW, Bayesian_K) produce oversmoothed outputs (Figure S2), failing to resolve local hotspots (e.g., Ben’s high-concentration pockets). ASI-GCN_RC_G uniquely preserves vertical stratification (e.g., Ben’s bottom-layer dominance), whereas OK and Bayesian_K misallocate pollution to upper layers. For improved visual representation of the contamination depth and stratigraphy, the vertical dimension in Figure 3 has been exaggerated threefold.

3.2. Performance Assessment of the ASI-GCN Model

Table 1 and Figure 4 demonstrate ASI-GCN_RC_G’s superiority across metrics (R, RMSE, MAE). We find that dynamic graph learning enhanced BaP predictions, likely due to its high spatial variability. The R for As, BaP, and Ben are 0.728, 0.825, and 0.781, respectively, representing improvements of 58.8% (compared to IDW), 45.82% (compared to OK), and 53.78% (compared to IDW) over traditional models. The RMSEs for As, BaP, and Ben are 4.914 mg/kg, 1.656 mg/kg, and 159.040 mg/kg, respectively, indicating reductions of 36.56% (compared to Bayesian_K), 38.02% (compared to Bayesian_K), and 35.96% (compared to IDW) over conventional models. On the other hand, the R² and MAE metrics indicate that ASI-GCN_RC_G possesses strong predictive capability with minor prediction errors. The regression line of the ASI-GCN_RC_G model aligns more closely with the diagonal, indicating that its predictions are more accurate. Regarding confidence intervals, although the ASI-GCN_RC_G model (Figure 4d,g) exhibits wider confidence intervals in high-pollution regions compared to the OK model, this superficial observation might suggest lower reliability in estimating high-pollution values. However, when combined with the prediction results in Figure 3, a more plausible explanation is that the OK model tends to underestimate high values and overestimate low values.

3.3. Uncertainty Assessment of the ASI-GCN Model

The spatial distribution of uncertainties for different pollutants is shown in Figure 5. High uncertainties are typically observed around highly polluted sites, indicating that the ASI-GCN model may have an incomplete understanding of high-pollution features, resulting in lower prediction reliability. Low uncertainties are generally distributed in areas with low-pollution characteristics. This pattern suggests the ASI-GCN model has strong predictive capabilities in these regions. Additionally, in the concentration prediction results and uncertainty analysis for BaP (Figure 3 and Figure S2), although highly polluted sites exhibit high uncertainties, their observed values remain consistent, further demonstrating the robust predictive performance of the ASI-GCN model.

3.4. Pollution Volume Estimation Accuracy

Table 2 reveals ASI-GCN_RC_G’s alignment with ground-truth contamination rates: for soil As, predicted pollution volume (43.7%) matched the sampled exceedance ratio (43.25%), outperforming ASI-GCN_RC_K by 1.2%. BaP has achieved 86.7% polluted volume vs. Bayesian_K’s 75.0%, minimizing false negatives. However, Ben obtained all models converged at 95% due to extreme contamination, but ASI-GCN_RC_G reduced MAE by 30.9% vs. OK.

3.5. Sensitivity to Sparse Sampling

We removed points from the sampling dataset, eliminating 15 points each time, but ensuring that each depth layer had at least three points to guarantee the training and testing of the model (Figure 6 and Figure S4). The ASI-GCN_RC_G model demonstrates remarkable stability, consistently achieving low RMSE values of 32.92% (As), 33.72% (BaP), and 23.79% (Ben), significantly outperforming traditional approaches like OK, IDW, and Bayesian_K. Notably, it delivers substantial improvements over Bayesian_K, reducing RMSE by up to 40.56% for As (185 sample points) and 46.92% for BaP (102 sample points). Meanwhile, IDW struggles in comparison: with 102 and 99 sample points, its RMSE is 55.75% higher for BaP and 40.01% higher for Ben than ASI-GCN_RC_G (Figure S4). Furthermore, ASI-GCN_RC_K suffered in data-scarce regions, highlighting the necessity of graph-based residual correction.

4. Discussion

4.1. Pollutant Characteristics and Soil Texture Drive Vertical Heterogeneity

As shown in Figure 3 and Table S1, the vertical distribution of pollutants is influenced by the interplay between their volatility and soil characteristics such as texture and adsorption capacity. Three key patterns emerged: soil As concentrated in fill/silty layers (mean = 19.66 mg/kg) due to strong particle adsorption [22,23], validated by ASI-GCN_RC_G’s accurate surface/middle-layer predictions (r = 0.728). BaP peaked in silty sand (CV = 1.65%), aligning with its hydrophobicity and clay affinity [24,25]. The prediction results are similar to those of Hou [26] and Meng [27]. With 45.82% (OK) and 38.02% (Bayesian_K) improvements in R and RMSE, respectively (Table 1), ASI-GCN_RC_G proves that dynamic graph learning excels at capturing the complex behavior of spatially variable pollutants. Ben dominated deep layers (95% exceedance) due to volatility and fine sand adsorption [28,29,30]. Even under heavily polluted conditions, ASI-GCN_RC_G demonstrates superior accuracy, achieving a 35.96% reduction in RMSE, a 36.1% reduction in MAE (compared to IDW), and a 53.78% increase in R (Table 1), highlighting the effectiveness of its residual correction framework. This synergy between pollutant behavior and soil lithology underscores the necessity of adaptive 3D mapping tools for remediation planning.

4.2. ASI-GCN’s Success: Dual-Module Learning and Residual Correction

While all six prediction models exhibit similar spatial distribution patterns, their performance and predictive accuracy vary significantly. Compared with OK (BaP R = 0.447) and IDW (As R = 0.3, Ben R = 0.361), ASI-GCN, ASI-GCN_RC_G, and ASI-GCN_RC_K demonstrate robust and stable predictive capabilities. The superior performance of ASI-GCN_RC_G (R = 0.825 for BaP, Table 1) stems from two key innovations: First, its dual-module adaptive learning system dynamically constructs graph structures to capture complex spatial variability of pollutants accurately (Equations (5)–(7)), particularly crucial for highly variable BaP (CV = 1.65%). The constrained message passing mechanism effectively prevents over-smoothing [31,32], successfully preserving local hotspots like those observed in southeastern regions (Figure 3e). This results in ASI-GCN predictions with maximum uncertainty of just 0.3 mg/kg, with most areas approaching zero (Figure 5). Second, the residual correction mechanism not only mitigates the outlier effect [14], reducing the RMSE and MAE of As by 5.3% and 6.04%, respectively, and improving the r by 6.8% compared to ASI-GCN_RC_K (Table 1), but also achieves an impressive pollution volume accuracy of 86.7% for BaP (Table 2). Furthermore, predictive performance improves with increasing sample size (Figure S4), as accurate predictions require samples that adequately represent soil contaminants across all depth layers [33,34,35,36]. These findings robustly validate our initial hypothesis that graph-based neighborhood aggregation substantially enhances model adaptability to sparse data conditions.

5. Conclusions

This study demonstrates the efficacy of the adaptive graph convolutional network with spatial autocorrelation (ASI-GCN) model in accurately delineating the three-dimensional distribution of multi-volatility soil contaminants in a coking-contaminated site with sparse borehole data. By integrating spatial autocorrelation with dynamic graph representation learning, the ASI-GCN model presents a novel framework that significantly outperforms traditional interpolation methods (OK, IDW, and Bayesian_K) in both predictive accuracy and structural adaptability. The residual correction variant (ASI-GCN_RC_G) achieves lower prediction errors and higher R values, while maintaining low uncertainty across most areas, with only localized increases near high-pollution zones. In addition, the model revealed distinct vertical migration patterns for contaminants with different volatilities, providing crucial insights for targeted remediation strategies. This superior performance can be attributed to the model’s dual-module structure, combining adaptive learning of sample feature mechanisms with dynamic learning of sample structures. The ASI-GCN model represents a significant advancement in coking-contaminated site investigations. By integrating spatial autocorrelation with deep graph representation, ASI-GCN redefines sparse data 3D mapping, offering a transformative tool for sustainable land remediation and precision environmental governance.

Despite the promising results, this study also has certain limitations and areas for future improvement. First, the relatively small spatial extent of the study area may constrain the generalizability and robustness of the model. Future research could benefit from expanding the geographical scope, particularly by incorporating samples from a wider range of regions with diverse geological and environmental characteristics. Second, comparative evaluations between ASI-GCN and other state-of-the-art deep learning models remain to be explored, which could further validate the model’s performance advantages. Additionally, while this study focuses on maximizing the use of existing borehole samples for accurate contamination prediction, it does not address the issue of sampling design optimization. Future work could investigate how optimized sampling strategies may complement and enhance the model’s applicability under conditions of extreme data sparsity.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/land14071348/s1, Figure S1: Sample information transfer flow; Table S1: Summary of statistical descriptions of As, BaP, and Ben content; Figure S2: Three-dimensional distribution of the pollutant As, BaP, and Ben with the concentration of pollutants increases progressively from blue to red; Figure S3: The predictive performance of different models; Figure S4: The RMSE values for various pollutants calculated by the ASI-GCN, ASI-GCN_RC_K, and IDW models under different sparse sample sizes.

Author Contributions

H.T.: Conceptualization, Methodology, Writing—Original Draft. Z.L.: Conceptualization, Methodology, Writing. S.N.: Writing—Review and Editing. H.L.: Writing—Review and Editing. D.Z.: Writing—Review and Editing, Funding Acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number [42301060, 42307583, 42130713].

Data Availability Statement

Data sharing is not applicable (only appropriate if no new data are generated or the article describes entirely theoretical research). No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

FAO; UNEP. Global Assessment of Soil Pollution: Summary for Policymakers; FAO: Rome, Italy, 2021. [Google Scholar]
Qiao, P.; Dong, N.; Lei, M.; Yang, S.; Gou, Y. An effective method for determining the optimal sampling scale based on the purposes of soil pollution investigations and the factors influencing the pollutants. J. Hazard. Mater. 2021, 418, 126296. [Google Scholar] [CrossRef]
Zheng, S.; Wang, J.; Zhuo, Y.; Yang, D.; Liu, R. Spatial distribution model of DEHP contamination categories in soil based on Bi-LSTM and sparse sampling. Ecotoxicol. Environ. Saf. 2022, 229, 113092. [Google Scholar] [CrossRef]
Yin, G.; Chen, X.; Zhu, H.; Chen, Z.; Su, C.; He, Z.; Qiu, J.; Wang, T. A novel interpolation method to predict soil heavy metals based on a genetic algorithm and neural network model. Sci. Total Environ. 2022, 825, 153948. [Google Scholar] [CrossRef]
Zhu, D.; Huang, Z.; Shi, L.; Wu, L.; Liu, Y. Inferring spatial interaction patterns from sequential snapshots of spatial distributions. Int. J. Geogr. Inf. Sci. 2018, 32, 783–805. [Google Scholar] [CrossRef]
Qiu, Z.; Yue, L.; Liu, X. Void Filling of Digital Elevation Models with a Terrain Texture Learning Model Based on Generative Adversarial Networks. Remote Sens. 2019, 11, 2829. [Google Scholar] [CrossRef]
Yan, L.; Tang, X.; Zhang, Y. High accuracy interpolation of DEM using generative adversarial network. Remote Sens. 2021, 13, 676. [Google Scholar] [CrossRef]
Zhu, D.; Zhang, F.; Wang, S.; Wang, Y.; Cheng, X.; Huang, Z.; Liu, Y. Understanding Place Characteristics in Geographic Contexts through Graph Convolutional Neural Networks. Ann. Am. Assoc. Geogr. 2020, 110, 408–420. [Google Scholar] [CrossRef]
Appleby, G.; Liu, L.; Liu, L. Kriging convolutional networks. Proc. AAAI Conf. Artif. Intell. 2020, 34, 3187–3194. [Google Scholar] [CrossRef]
Wu, Y.; Zhuang, D.; Labbe, A.; Sun, L. Inductive Graph neural networks for spatiotemporal kriging. Proc. AAAI Conf. Artif. Intell. 2021, 35, 4478–4485. [Google Scholar] [CrossRef]
Sun, Y.; Lei, S.; Zhao, Y.; Wei, C.; Yang, X.; Han, X.; Li, Y.; Xia, J.; Cai, Z. Spatial distribution prediction of soil heavy metals based on sparse sampling and multi-source environmental data. J. Hazard. Mater. 2024, 465, 133114. [Google Scholar] [CrossRef]
Zhang, R.J.; Ji, X.H.; Xie, Y.H.; Xue, T.; Liu, S.H.; Tian, F.X.; Pan, S.F. A novel graph convolutional neural network model for predicting soil Cd and As pollution: Identification of influencing factors and interpretability. Ecotoxicol. Environ. Saf. 2025, 292, 117926. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Shen, Y.; Chen, L.; Ng, C.W.W. Rainfall Spatial Interpolation with Graph Neural Networks. Lect. Notes Comput. Sci. 2023, 13946, 175–191. [Google Scholar]
Li, Z.; Tao, H.; Zhao, D.; Li, H. Three-dimensional empirical Bayesian kriging for soil PAHs interpolation considering the vertical soil lithology. Catena 2022, 212, 106098. [Google Scholar] [CrossRef]
Beijing Municipal Environmental Protection Bureau. Screening Levels for Soil Environmental Risk Assessment of Sites; Beijing Municipal Administration of Quality and Technical Supervision: Beijing, China, 2011. [Google Scholar]
EPA-600/4-84-049; Soil Sampling Quality Assurance User’s Guide. U.S. Environmental Protection Agency: Las Vegas, NV, USA, 1989.
State Environmental Protection Administration of China. Technical Specification for Soil Environmental Monitoring; China Environment Press: Beijing, China, 2004. Available online: https://english.mee.gov.cn/Resources/standards/Soil/Method_Standard4/200710/t20071024_111895.shtml (accessed on 7 August 2024).
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Ying, R.; Bourgeois, D.; You, J.; Zitnik, M.; Leskovec, J. GNNExplainer: Generating Explanations for Graph Neural Networks. arXiv 2019, arXiv:1903.03894. [Google Scholar]
Jin, W.; Ma, Y.; Liu, X.; Tang, X.; Wang, S.; Tang, J. Graph Structure Learning for Robust Graph Neural Networks. arXiv 2020, arXiv:2005.10203. [Google Scholar]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
Nguyen, K.T.; Ahmed, M.B.; Mojiri, A.; Huang, Y.; Zhou, J.L.; Li, D. Advances in As contamination and adsorption in soil for effective management. J. Environ. Manag. 2021, 296, 113274. [Google Scholar] [CrossRef]
Xie, Z.; Wang, J.; Wei, X.; Li, F.; Chen, M.; Wang, J.; Gao, B. Interactions between arsenic adsorption/desorption and indigenous bacterial activity in shallow high arsenic aquifer sediments from the Jianghan Plain, Central China. Sci. Total Environ. 2018, 644, 382–388. [Google Scholar] [CrossRef]
Li, J.; Wan, H.; Shang, S. Comparison of interpolation methods for mapping layered soil particle-size fractions and texture in an arid oasis. Catena 2020, 190, 10451. [Google Scholar] [CrossRef]
Yang, Y.; Zhang, N.; Xue, M.; Tao, S. Impact of soil organic matter on the distribution of polycyclic aromatic hydrocarbons (PAHs) in soils. Environ. Pollut. 2010, 158, 2170–2174. [Google Scholar] [CrossRef]
Hou, Y.; Li, Y.; Tao, H.; Cao, H.; Liao, X.; Liu, X. Three-dimensional distribution characteristics of multiple pollutants in the soil at a steelworks mega-site based on multi-source information. J. Hazard. Mater. 2023, 448, 130934. [Google Scholar] [CrossRef]
Meng, X.; Chen, H.; Wu, M. Pollution characteristics of polycyclic aromatic hydrocarbons in unsaturated zone of the different workshops at a large iron and steel industrial site of Beijing, China. Pol. J. Environ. Stud. 2020, 30, 781–792. [Google Scholar] [CrossRef]
Chang, W.; Um, Y.; Hoffman, B.; Holoman, T.R.P. Molecular characterization of polycyclic aromatic hydrocarbon (PAH)-degrading methanogenic communities. Biotechnol. Progr. 2005, 21, 682–688. [Google Scholar] [CrossRef] [PubMed]
Dou, J.; Liu, X.; Hu, Z.; Deng, D. Anaerobic BTEX biodegradation linked to nitrate and sulfate reduction. J. Hazard. Mater. 2008, 151, 720–729. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Bian, J.; Ruan, D.; Zhang, C. Adsorption of benzene on soils under different influential factors: An experimental investigation, importance order and prediction using artificial neural network. J. Environ. Manag. 2022, 306, 114467. [Google Scholar] [CrossRef] [PubMed]
Goovaerts, P. Geostatistics for Natural Resources Evaluation; Oxford University Press: New York, NY, USA, 1997; p. 496. [Google Scholar]
Yang, Y.; Jia, M. 3D spatial interpolation of soil heavy metals by combining kriging with depth function trend model. J. Hazard. Mater. 2024, 461, 132571. [Google Scholar] [CrossRef] [PubMed]
Bradley, V.C.; Kuriwaki, S.; Isakov, M.; Sejdinovic, D.; Meng, X.L.; Flaxman, S. Unrepresentative big surveys significantly overestimated US vaccine uptake. Nature 2021, 600, 695–700. [Google Scholar] [CrossRef]
Wang, J.; Gao, B.; Stein, A. The spatial statistic trinity: A generic framework for spatial sampling and inference. Environ. Modell. Softw. 2020, 134, 104835. [Google Scholar] [CrossRef]
Shen, F.; Xu, C.; Wang, J.; Hu, M.; Guo, G.; Fang, T.; Zhu, X.; Cao, H.; Tao, H.; Hou, Y. A new method for spatial three-dimensional prediction of soil heavy metals contamination. Catena 2024, 235, 107658. [Google Scholar] [CrossRef]
Meng, X.L. Statistical paradises and paradoxes in big data (i) law of large populations, big data paradox, and the 2016 us presidential election. Ann. Appl. Stat. 2018, 12, 685–726. [Google Scholar] [CrossRef]

Figure 1. The spatial layout of the coking-contaminated site and the soil boreholes design. (a) The distribution of soil boreholes superimposed on the spatial layout of the coking-contaminated site; (b–d) represent the pollution status classification of soil drilling samples for As, BaP, and Ben, respectively. For better visual effects, the vertical direction is expanded by three times.

Figure 2. Framework of sparse drilling spatial interpolation and evaluation based on spatial autocorrelation and the GCN model.

Figure 3. Three-dimensional distribution of the pollutants As, BaP, and Ben, with the concentration of contaminants increasing progressively from blue to red. In the spatial distribution of the concentrations of the three soil pollutants, panels (a,e,i) present the interpolation results for Arsenic (As), benzopyrene (BaP), and benzene (Ben) using the ASI-GCN_RC_G model; panels (b,f,j) show the corresponding results from the OK model; and panels (c,g,k) illustrate the results obtained through the Bayesian_K model. Additionally, the panels labeled (d,h,l) display the variations of RMSE across different depths for As, BaP, and Ben, respectively, under different models. Purple represents the ASI-GCN_RC_G model, green represents the OK model, and orange represents the Bayesian_K model.

Figure 4. The predictive performance of different models. The black diagonal line in each subgraph represents the ideal situation where the predicted value is the same as the observed value. The solid blue line also represents the regression fitting line, reflecting the fitting relationship between the predicted and observed values. Furthermore, the shaded areas in blue represent 95% confidence intervals that fit the regression line. It’s important to note that the narrower the region, the more reliable the model’s predictions. Lastly, R² represents the coefficient of determination. Scatter points represent observed values and predicted values. Subfigures (a–c) respectively show the fitting results of the ASI-GCN_RC_G, OK, and Bayesian_K models for As. Subfigures (d–f) respectively show the fitting results of the ASI-GCN_RC_G, OK, and Bayesian_K models for BaP. Subfigures (g–i) respectively show the fitting results of the ASI-GCN_RC_G, OK, and Bayesian_K models for Ben.

Figure 5. Spatial uncertainty maps of ASI-GCN model predictions for various pollutants.

Figure 6. The RMSE values for various pollutants under different sparse sample sizes. Subfigures (a–c) display the RMSE values for As, BaP, and Ben, respectively.

Table 1. Summary of performance metrics between ASI-GCN models and classical interpolation methods.

Pollutants	Count	Model	R	RMSE (mg/kg)	MAE (mg/kg)
As	215	ASI-GCN_RC_G	0.728	4.914	3.916
		ASI-GCN_RC_K	0.678	5.190	4.168
		ASI-GCN	0.721	5.041	4.066
		OK	0.374	6.534	5.125
		IDW	0.300	7.320	5.773
		Bayesian_K	0.316	7.746	5.698
BaP	207	ASI-GCN_RC_G	0.825	1.656	1.096
		ASI-GCN_RC_K	0.800	1.755	1.187
		ASI-GCN	0.819	1.699	1.214
		OK	0.447	2.592	1.585
		IDW	0.529	2.491	1.454
		Bayesian_K	0.458	2.672	1.342
Ben	174	ASI-GCN_RC_G	0.781	159.040	112.315
		ASI-GCN_RC_K	0.700	178.316	128.787
		ASI-GCN	0.755	167.710	124.389
		OK	0.520	214.492	162.454
		IDW	0.361	248.342	175.867
		Bayesian_K	0.557	214.732	145.066

Table 2. The pollution volume of different soil pollutants using the ASI-GCN models and conventional models.

Soil Pollutants	Spatial Interpolation Models	Pollution Volumes (m³)	Percentage of Pollution Volumes (%)
As	ASI-GCN_RC_G	12,244	43.7
	ASI-GCN_RC_K	12,084	43.2
	ASI-GCN	12,988	46.4
	OK	13,712	49.0
	IDW	12,220	43.6
	Bayesian_K	9900	35.4
BaP	ASI-GCN_RC_G	24,284	86.7
	ASI-GCN_RC_K	26,464	94.5
	ASI-GCN	26,600	95.0
	OK	25,972	92.8
	IDW	26,476	94.6
	Bayesian_K	21,004	75.0
Ben	ASI-GCN_RC_G	26,600	95.0
	ASI-GCN_RC_K	26,600	95.0
	ASI-GCN	26,600	95.0
	OK	26,600	95.0
	IDW	26,600	95.0
	Bayesian_K	26,504	94.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tao, H.; Li, Z.; Nie, S.; Li, H.; Zhao, D. An Adaptive Graph Convolutional Network with Spatial Autocorrelation for Enhancing 3D Soil Pollutant Mapping Precision from Sparse Borehole Data. Land 2025, 14, 1348. https://doi.org/10.3390/land14071348

AMA Style

Tao H, Li Z, Nie S, Li H, Zhao D. An Adaptive Graph Convolutional Network with Spatial Autocorrelation for Enhancing 3D Soil Pollutant Mapping Precision from Sparse Borehole Data. Land. 2025; 14(7):1348. https://doi.org/10.3390/land14071348

Chicago/Turabian Style

Tao, Huan, Ziyang Li, Shengdong Nie, Hengkai Li, and Dan Zhao. 2025. "An Adaptive Graph Convolutional Network with Spatial Autocorrelation for Enhancing 3D Soil Pollutant Mapping Precision from Sparse Borehole Data" Land 14, no. 7: 1348. https://doi.org/10.3390/land14071348

APA Style

Tao, H., Li, Z., Nie, S., Li, H., & Zhao, D. (2025). An Adaptive Graph Convolutional Network with Spatial Autocorrelation for Enhancing 3D Soil Pollutant Mapping Precision from Sparse Borehole Data. Land, 14(7), 1348. https://doi.org/10.3390/land14071348

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Adaptive Graph Convolutional Network with Spatial Autocorrelation for Enhancing 3D Soil Pollutant Mapping Precision from Sparse Borehole Data

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Soil Sampling

2.2. The Construction of the ASI-GCN Model

2.2.1. The Principle of the GCN Model for 3D Spatial Interpolation

2.2.2. ASI-GCN: Adaptive Enhancements for 3D Soil Mapping

2.3. Performance and Uncertainty Evaluation of the ASI-GCN Model

3. Results

3.1. Three-Dimensional Spatial Distribution of Pollutants

3.2. Performance Assessment of the ASI-GCN Model

3.3. Uncertainty Assessment of the ASI-GCN Model

3.4. Pollution Volume Estimation Accuracy

3.5. Sensitivity to Sparse Sampling

4. Discussion

4.1. Pollutant Characteristics and Soil Texture Drive Vertical Heterogeneity

4.2. ASI-GCN’s Success: Dual-Module Learning and Residual Correction

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI