Next Article in Journal
DFN-YOLO: Detecting Narrowband Signals in Broadband Spectrum
Previous Article in Journal
Optimized Two-Stage Anomaly Detection and Recovery in Smart Grid Data Using Enhanced DeBERTa-v3 Verification System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Transfer Learning-Based Interpretable Soil Lead Prediction in the Gejiu Mining Area, Yunnan

1
Faculty of Land Resources Engineering, Kunming University of Science and Technology, Kunming 650093, China
2
School of Fine Art and Design, Kunming University, Kunming 650214, China
3
International Research Center of Big Data for Sustainable Development Goals, Beijing 100094, China
4
School of Earth and Environmental Sciences, Yunnan Land and Resources Vocational College, Kunming 652501, China
5
Engineering Center of Yunnan Education Department for Health Geological Survey & Evaluation, Kunming 650218, China
6
Key Laboratory of Digital Earth Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Sensors 2025, 25(13), 4209; https://doi.org/10.3390/s25134209
Submission received: 23 May 2025 / Revised: 2 July 2025 / Accepted: 3 July 2025 / Published: 5 July 2025
(This article belongs to the Section Smart Agriculture)

Abstract

Accurate prediction of soil lead (Pb) content in small sample scenarios is often limited by data scarcity and variability in soil properties, with traditional spectral modeling methods yielding suboptimal precision. To address this, we propose a transfer learning-based framework integrated with SHAP analysis for predicting soil Pb content in the Gejiu mining area, Yunnan. Using pH data from the European LUCAS soil database as the source domain, spectral features were extracted via a 1D-ResNet model and transferred to the target domain (130 soil samples from Gejiu) for Pb prediction. SHAP analysis was applied to clarify the role of spectral characteristics in cross-component transfer learning, uncovering shared and adaptive features between pH and Pb predictions. The transfer learning model (ResNet-pH-Pb) significantly outperformed direct modeling methods (PLS-Pb, SVM-Pb, and ResNet-Pb), with an R2 of 0.77, demonstrating superior accuracy. SHAP analysis showed that the model retained key pH-related wavelengths (550–750 nm and 1600–1700 nm) while optimizing Pb-related wavelengths (e.g., 919 nm and 959 nm). This study offers a novel approach for soil heavy metal prediction under small sample constraints and provides a theoretical basis for understanding spectral prediction mechanisms through interpretability analysis.

1. Introduction

Lead (Pb), a profoundly toxic, bioaccumulative, and environmentally persistent heavy metal, presents substantial risks to ecosystems and human well-being [1,2]. Due to prolonged mining and smelting activities, Pb content in mining area soils is typically high, and the pollution exhibits significant spatial heterogeneity [3,4]. The Gejiu mining area in Yunnan, one of China’s major non-ferrous metal mining regions, suffers from particularly severe Pb pollution due to historical mining activities [5]. Therefore, accurately and rapidly assessing Pb content in the soils of mining areas and identifying the key factors influencing it is of great importance for pollution assessment and environmental management.
Visible–near infrared (Vis-NIR) spectroscopy provides non-destructive, swift, and economically efficient benefits for estimating soil Pb levels [6]. Researchers typically enhance Pb spectral signal by selecting specific bands and combine this with machine learning methods such as partial least squares (PLS) and support vector machines (SVM) to build predictive models [7,8]. However, Pb lacks direct spectral absorption features, making its signal susceptible to soil matrix interference, which limits model accuracy [9]. Deep learning methods, such as 1D residual neural networks (1D-ResNet), show great potential by extracting complex spectral features through cross-layer connections [10,11]. Yet, their reliance on large sample sizes conflicts with the limited data available in mining areas [12].
Transfer learning presents a novel approach by harnessing knowledge from a related task to boost prediction accuracy despite limited target domain data [13]. Existing studies have shown that models trained on the LUCAS dataset for predicting organic carbon and pH have achieved high accuracy in cross-region transfers [14,15,16]. Notably, current research has mainly focused on transfer within the same attribute (e.g., pH → pH), while how to use soil physicochemical properties closely related to Pb to assist Pb prediction remains an unexplored issue. Soil pH is a key factor influencing the distribution and mobility of Pb [17,18]. Variations in pH substantially influence the solubility and adsorption capacity of Pb within soil [19]. Compared to Pb, pH is more easily predicted accurately from spectral data, as it strongly correlates with spectrally active soil components such as clay minerals, organic matter, and iron oxides, which show Vis-NIR absorption features [20,21]. These components also influence Pb behavior, allowing pH to indirectly inform Pb modeling [19]. This makes pH an ideal intermediary variable, as large-scale pH data from the LUCAS dataset can be used to train a source domain model and transfer this knowledge to the small sample Pb prediction task, improving the accuracy and stability of Pb predictions.
Although transfer learning can enhance prediction performance, the lack of model interpretability limits its practical application. SHAP value helps address the “black box” problem by quantifying feature contributions [22]. While SHAP has shown interpretive potential in fields such as spectral analysis and environmental monitoring [23,24,25], its application in transfer learning models, particularly in analyzing the reuse of features across components, has not been fully explored.
Therefore, this study innovatively proposes a soil Pb prediction framework that integrates transfer learning with SHAP analysis. Its core contributions include the following: (1) Building a model with the extensive LUCAS dataset for pH and using transfer learning to improve Pb predictions in the small-sample Gejiu mining area. (2) Using SHAP values for the first time to explain how spectral features contribute to Pb prediction in this transfer learning framework, offering a theoretical foundation for heavy metal spectral prediction.

2. Data and Methods

2.1. Data Sources

This study utilizes two datasets—source domain and target domain data—to predict soil lead (Pb) content via cross-component transfer learning.
The source domain data is derived from the European LUCAS Soil Database, collected by the European Commission, comprising 19,036 surface soil samples (0–20 cm) from multiple European countries [26]. Physicochemical and spectral properties were measured using standardized protocols to ensure data consistency [27]. Spectral measurements were conducted using a FOSS XDS Rapid Content Analyzer (FOSS NIRSystems Inc., Hilleroed, Denmark; 400–2500 nm, 0.5 nm resolution) [28], downsampled to 400–2499 nm (1 nm interval, 2100 wavelength points) to align with the target domain. Given the potential correlation between pH and soil Pb chemical behaviors (such as adsorption and desorption), pH was selected as the source domain task to provide a knowledge foundation for cross-component transfer learning.
The target domain data was collected by the research team through field sampling in the Gejiu mining area, Yunnan, China, a region characterized by complex terrain and rich mineral resources, particularly tin mines [29]. Prolonged mining operations have resulted in substantial Pb pollution [30]. From March to April 2024, in the dry season when exposed soil was common, 130 surface soil samples (0–20 cm) were gathered. Sampling was carried out using a grid method: in the northern smelting area, a 1000 m × 1000 m grid was used, with denser sampling (500 m × 500 m) around the smelting plants. In the southern mining region, samples were gathered at 1000 m intervals along the road. The overall distribution of sampling points across the study area is illustrated in Figure 1. Soil samples were dried in air, had their impurities removed, ground, and sieved through a 100-mesh screen, then split into two portions: one for spectral measurement and one for Pb content analysis. Spectral measurements were obtained with an ASD FieldSpec 3 spectrometer (Analytical Spectral Devices Inc., Boulder, CO, USA; 350–2500 nm, 1 nm resolution), with five measurements averaged per sample. To reduce noise and align with the source domain, the 350–399 nm range was excluded, retaining 400–2499 nm. Pb concentrations were determined using inductively coupled plasma optical emission spectrometry (ICP-OES) after mixed-acid digestion with HNO3, HCl, HF, and HClO4 in a PTFE crucible. The residues were dissolved in 1:1 HCl and diluted to 10 mL, following established soil analysis protocols The Pb concentration data used in this study are publicly available at Zenodo: https://zenodo.org/records/15742450, accessed on 25 June 2025.

2.2. Methods

2.2.1. ResNet Model Architecture

ResNet optimizes deep network performance through a unique residual learning mechanism, effectively addressing the degradation problem that arises with increasing network depth in traditional architectures [10]. The core design relies on residual blocks, where skip connections merge input and convolutional features, enhancing both training efficiency and model expressiveness.
This research developed a 1D-ResNet model utilizing spectral data characteristics. The model’s input consists of log-transformed (logR) spectral reflectance data, with a dimensionality of 2100. To improve computational efficiency and extract key features, the first layer applies average pooling with a window size of 10, reducing the dimension to 210. The data then passes through a 1D convolutional layer (48 filters, kernel size 3, stride 1, Leaky ReLU activation function with α = 0.01) for feature extraction, followed by batch normalization to optimize data distribution. The data subsequently undergoes max pooling with a window size of 2, followed by two residual blocks (with ReLU activation and skip connections). Afterward, it advances through a 1D convolutional layer (32 filters), a flattening layer, and two dense layers (16 and 10 nodes, using Leaky ReLU activation). The output layer consists of a single-node dense layer with ReLU activation. All convolutional and fully connected layers use L2 regularization (λ = 0.0004). The specific network architecture is shown in Figure 2 and Table 1.
The model was constructed using Python 3.9 and TensorFlow 2.0, employing the Adam optimizer (learning rate = 0.001) for training and utilizing mean squared error (MSE) as the loss metric. Training runs for 2000 iterations, with early stopping enabled (patience = 120), meaning training halts if no improvement is observed after 120 epochs.

2.2.2. Transfer Learning Process

Transfer learning enhances target domain prediction by using rich source domain data, making it ideal for cross-component predictions such as from pH to Pb. This study employs a 1D-ResNet model with a fine-tuning strategy to transfer spectral features from the LUCAS dataset (source domain, 19,036 samples) to the Gejiu mining area (target domain, 130 samples) for improved soil Pb content prediction.
First, the source domain model, ResNet-pH, is trained on the LUCAS dataset to predict pH values, using 14,277 samples (75%) for training and 4759 samples (25%) for testing. Next, a baseline model, ResNet-Pb, is independently trained on the Gejiu dataset to directly predict Pb content, using 98 samples (75%) for training and 32 samples (25%) for testing, without relying on source domain information. For transfer learning, pre-trained weights from ResNet-pH are loaded into the target domain model. Convolutional layers and residual blocks are frozen to retain general spectral features learned from the source task (pH prediction), which often reflect fundamental soil properties such as absorption patterns of iron oxides near 550 nm. In contrast, the max-pooling layer and fully connected layers are fine-tuned using the target domain (Pb) data to capture component-specific variations. This selective updating strategy allows the model to preserve transferable low-level spectral representations while adapting to the distinct characteristics of Pb, thereby improving prediction accuracy in data-scarce conditions. The resulting transfer learning model is denoted as ResNet-pH-Pb. The model configurations are detailed in Table 2.
Both datasets are randomly split into 75% training and 25% testing sets, and this process is repeated over 10 rounds to ensure robust evaluation. The final evaluation of the model is based on the average of the evaluation metrics from the 10 test rounds. Model performance is assessed using metrics detailed in Section 2.2.4.

2.2.3. Interpretability Analysis

SHAP values, based on game theory, quantify feature impacts on predictions, enhancing model interpretability [31]. This research utilizes the GradientExplainer module from the SHAP library to calculate SHAP values and assess feature importance for ResNet-based models.
For the source domain model, ResNet-pH, the LUCAS dataset’s large size (19,036 samples) leads to significant computational demands for SHAP analysis due to its need to evaluate feature interactions. We selected 1000 samples from the training set as background data to provide a baseline for estimating the model’s expected output, enabling SHAP to quantify each feature’s contribution relative to this baseline. This subset size ensures computational efficiency while capturing sufficient spectral information. From the test set, 1000 samples are selected to compute SHAP values, with their mean absolute values used to assess each wavelength’s contribution to pH prediction.
For the transfer learning model, ResNet-pH-Pb, all 130 samples from the Gejiu mining area dataset are used as background data to compute SHAP values for the test set, revealing the impact of transfer learning on Pb content prediction.
By comparing the SHAP values of ResNet-pH and ResNet-pH-Pb, we analyze differences in wavelength contributions between pH and Pb prediction tasks, elucidating how transfer learning adjusts shared and task-specific wavelengths to enhance model predictions.

2.2.4. Comparison Experiments and Evaluation Metrics

To validate the effectiveness of cross-component transfer learning for soil Pb prediction in the target domain, this study trains PLS and SVM models, implemented in Python 3.9 using scikit learn, on the target domain dataset, labeled as PLS-Pb and SVM-Pb, and compares their performance with the transfer learning model ResNet-pH-Pb.
PLS extracts latent variables to model relationships between predictor and response variables [32]. It excels in handling linear relationships with low computational cost, suitable for small datasets, but may fail to capture nonlinear patterns in complex data [7,33,34]. Studies like Chen et al. (2022) reported PLS based soil Pb prediction with R2 = 0.59, and Arif et al. (2022) achieved R2 = 0.66 in urban greenbelt zones [7,34]. In our study, the PLS-Pb model was optimized using 5-fold cross validation with GridSearchCV, searching over 1 to 50 components to select 11 components for optimal performance.
SVM maps data into a higher-dimensional space using a kernel function to model complex relationships [35]. It effectively captures nonlinear relationships with flexible kernel choices, but requires careful parameter tuning and can be computationally demanding for larger datasets [1]. Chen et al. (2022) reported SVM based Pb prediction with R2 = 0.55 [7]. In our study, the SVM-Pb model employed a radial basis function kernel (kernel = ‘rbf’), with Bayesian optimization used to determine optimal hyperparameters, yielding C = 0.26 (regularization) and gamma = ‘scale’ (scikit-learn default setting).
Model performance is assessed using the coefficient of determination (R2), root mean square error (RMSE), and residual prediction deviation (RPD). R2 indicates the extent of variance in the data accounted for by the model, with values nearer to 1 signifying improved model fit. RMSE quantifies the average discrepancy between predicted and actual values, with lower values reflecting greater prediction precision. RPD evaluates the ratio of standard deviation to residuals to gauge prediction reliability, with RPD < 1.4 indicating inadequate prediction, 1.4 ≤ RPD < 2.0 suggesting acceptable performance, and RPD ≥ 2.0 representing superior prediction [36].

3. Results

3.1. Statistical Characterization of Soil Pb Concentrations

This study analyzed soil Pb concentrations from the Gejiu mining region and pH levels from the LUCAS dataset, with summary statistics presented in Table 3.
For LUCAS, pH values ranged from 3.21 to 10.08, with a mean of 6.02 and a median of 6.21. The skewness of −0.07 and kurtosis of −1.24 suggest a near-normal distribution with minimal asymmetry. The SD of 1.35 and CV of 0.22 indicate low variability, reflecting relatively uniform soil pH.
For Gejiu, Pb levels varied from 34.6 mg/kg to 9720 mg/kg, highlighting considerable differences in Pb concentrations across sampling locations. The average Pb level was 974.06 mg/kg, with a median of 232 mg/kg. The skewness of 3.12 and kurtosis of 9.1 suggest a markedly positively skewed distribution, indicating that most locations exhibited relatively low Pb levels, while a few showed exceptionally high concentrations. Furthermore, the standard deviation (SD) was 1969.17 mg/kg, and the coefficient of variation (CV) of 2.02 indicates substantial variability, pointing to significant spatial heterogeneity. This may be attributed to localized contamination hotspots resulting from mining activities. To mitigate the effects of skewness and variability, a logarithmic transformation was applied to the Pb levels.

3.2. Source Domain Modeling

The scatter plot depicting the results of 10 rounds of random sampling tests for the source domain model (ResNet-pH) is presented in Figure 3. The model’s average R2 is 0.91, RMSE is 0.42, and RPD is 3.24, indicating strong predictive capability. This high-precision source domain model effectively captured the spectral features of pH values, providing a reliable foundation for subsequent target domain Pb prediction and wavelength contribution analysis.

3.3. Direct Modeling in the Target Domain

For the target domain Gejiu dataset, ResNet, PLS, and SVM models were directly trained to predict Pb levels, with 10 rounds of random sampling evaluations performed. The findings are displayed in Figure 4. From the model performance, SVM-Pb achieved the best prediction results (R2 = 0.47, RMSE = 0.89, RPD = 1.40), followed by PLS-Pb (R2 = 0.43, RMSE = 0.90, RPD = 1.37), and ResNet-Pb had the worst performance (R2 = 0.30, RMSE = 1.05, RPD = 1.25). Compared to SVM-Pb and PLS-Pb, the R2 of ResNet-Pb decreased by 0.17 and 0.13, respectively. This indicates that for small sample target domains, deep learning models do not perform as well as traditional methods. Moreover, the R2 of all models is below 0.5, suggesting that direct modeling has limited applicability in small sample target domains.

3.4. Performance of the Transfer Learning Model

The results of 10 rounds of random sampling tests for the transfer learning model ResNet-pH-Pb are shown in Figure 5, yielding R2 = 0.77, RMSE = 0.59, and RPD = 2.12, significantly outperforming the direct modeling methods. The R2 box plot in Figure 6 shows that the average R2 of ResNet-pH-Pb is 0.47, 0.34, and 0.30 higher than those of ResNet-Pb, PLS-Pb, and SVM-Pb, respectively. This indicates that transfer learning not only improves Pb content prediction accuracy but also significantly enhances model stability and generalization.

3.5. Wavelength Contribution in ResNet Modeling

To explore the contribution of different wavelengths in ResNet modeling, SHAP values were used to interpret the source domain model ResNet-pH and the transfer learning model ResNet-pH-Pb, as shown in Figure 7.
In the ResNet-pH model (Figure 7a), the wavelengths with significant contributions to pH prediction are mainly concentrated at 629 nm, 689 nm, 729 nm, 1560 nm, 1650 nm, 1680 nm, 1750 nm, 1789 nm, and other positions, suggesting that these wavelengths may be closely related to the spectral features of soil pH.
In contrast, in the ResNet-pH-Pb model (Figure 7b), the wavelength contributions have shifted. New key wavelengths have emerged, including 610 nm, 659 nm, 717 nm, and 759 nm in the visible range, and 919 nm, 959 nm, 1639 nm, and 1670 nm in the near-infrared range. Meanwhile, some key wavelengths from the source domain model, like 689 nm and 1610 nm, are still retained. This indicates that in the Pb prediction task, the model not only inherited part of the spectral information from the pH modeling but also focused on new wavelength regions related to Pb, especially in the near-infrared range (e.g., 919 nm and 959 nm).
Overall, the transfer learning model ResNet-pH-Pb has undergone significant adjustments in its wavelength contributions. This indicates that while the spectral behavior of soil Pb is partially related to pH, they are not identical. With the introduction of transfer learning, the model is able to more effectively focus on the important wavelengths needed for Pb prediction, enhancing the reliability of Pb content prediction.

4. Discussion

4.1. Improvement in Prediction Performance for Small Sample Target Domains Using Transfer Learning

In soil spectroscopy, the accuracy of Pb prediction is often limited by small sample sizes and high heterogeneity. In this study, direct modeling was conducted on a target domain with 130 samples from the Gejiu region. The findings indicate that the R2 values for SVM-Pb, PLS-Pb, and ResNet-Pb are 0.47, 0.43, and 0.30, respectively. This indicates that both traditional modeling methods and deep learning approaches struggle to achieve satisfactory Pb prediction results under small sample conditions. These findings are consistent with existing research. Specifically, Tan et al. (2021) applied CARS in feature selection and PLS in modeling to predict soil Pb content, achieving a maximum R2 of 0.60 in the validation set [37]. Arif et al. (2022) selected feature wavelengths and applied PLS for modeling, achieving an optimal R2 of 0.66 [34]. Chen et al. (2022) used fractional-order derivatives for feature selection and combined PLS and SVM regression to predict Pb content, with R2 ranging between 0.54 and 0.59, which is insufficient for high-accuracy predictions [7].
In contrast, the transfer learning model ResNet-pH-Pb in this study significantly improved Pb prediction performance in the target domain (R2 = 0.77). The R2 box plot (Figure 6) shows that the R2 distribution of ResNet-pH-Pb is more stable, with improvements of 0.30, 0.34, and 0.47 over SVM-Pb, PLS-Pb, and ResNet-Pb, respectively. Although there remains room for improvement, an R2 of 0.77 is notable given the small sample size and high heterogeneity of the Gejiu dataset, surpassing typical accuracies of 0.54–0.66 reported in previous studies [7,34,37]. Transfer learning has been explored in soil spectroscopy to some extent. For instance, Kok et al. (2024) applied transfer learning to improve pH prediction, achieving an R2 of 0.66 [15]. However, most of these studies focus on predicting single soil properties, with fewer studies addressing cross-property transfer learning applications. In this study, we innovatively implemented cross-component transfer from pH to Pb, leveraging the high accuracy of the source domain model ResNet-pH (R2 = 0.91). By sharing spectral features (e.g., both pH and Pb are governed by soil organic matter and iron oxides), the model reduces overfitting in the small sample target domain and improves its generalization ability.

4.2. Feature Analysis of Wavelength Contribution

Wavelength contribution analysis further reveals the predictive mechanism of transfer learning. Here, SHAP values were employed to evaluate wavelength contributions in the source domain model ResNet-pH (Figure 7a) and the transfer learning model ResNet-pH-Pb (Figure 7b). It was found that both models showed high contributions in the 550–750 nm and 1600–1700 nm bands. These bands are associated with the absorption characteristics of soil organic matter, iron oxides (550–750 nm), and water and organic matter (1600–1700 nm) [38], while organic matter and iron oxides are the main adsorbents of Pb [33]. This finding suggests that, although pH and Pb are different components, the spectral features learned by the source domain model can still be effectively transferred to the target domain. This feature sharing originates from the potential correlation between pH and Pb in soil. In the tin mining area, the soil pH is typically acidic, primarily driven by sulfide mineral oxidation generating sulfuric acid [3,39,40]. Low pH values lead to the dissolution of iron oxides, releasing adsorbed Pb, which increases its mobility [41,42]. This acidic environment provides the chemical basis for cross-component transfer, and the spectral features learned by the source domain model ResNet-pH can effectively transfer to the target domain, alleviating overfitting in small sample scenarios and significantly improving Pb prediction accuracy.
Although pH and Pb predictions share some key wavelengths, transfer learning also optimizes the model’s adaptability to the target domain. The contribution of ResNet-pH-Pb significantly increased in the 919 nm, 959 nm, 1639 nm, and 1670 nm bands. The 919 nm and 959 nm bands align with absorption features of clay minerals (O-H stretching) and carbonates (C-O stretching), as documented in soil spectroscopy studies [43,44,45]. These components are critical for Pb prediction, as Pb often forms lead carbonates or is adsorbed onto clay surfaces in contaminated soils [46]. The 1639 nm and 1670 nm bands correspond to water (H-O-H bending) and organic matter (C-H stretching) absorptions, consistent with Pb hydrolysis and organic matter complexation in soils [43,47]. Furthermore, the SHAP contribution to Pb prediction slightly increased in the visible light region (600–700 nm), potentially linked to enhanced spectral interactions of Pb with soil iron oxides and organic matter [33]. Existing studies have also shown that key wavelengths for Pb prediction typically include 600–800 nm, 1390–1460 nm and 1870–1960 nm [48,49]. The 550–750 nm and 1600–1700 nm bands identified here align closely with prior research, confirming their association with Pb adsorption mechanisms. Compared with the SHAP values analyzed for single-component predictions by Zhong et al. (2024), this study innovatively reveals the feature sharing and adaptation mechanisms of cross-component transfer through SHAP value analysis [22].
To explore alternative preprocessing methods for making the Gejiu Pb dataset (skewness = 3.12, kurtosis = 9.1) more normally distributed, we applied the Box–Cox transformation and compared its performance with the logarithmic transformation used in Section 3.4. The Box–Cox-transformed model (ResNet-pH-Pb) yielded an average R2 of 0.65, RMSE of 0.11, and RPD of 2.10, as shown in the scatter plot of predicted versus observed Pb values in Figure 8. Compared to the logarithmic transformation (R2 = 0.77, RMSE = 0.59, RPD = 2.12), the Box–Cox transformation yielded lower R2 and RPD values, indicating reduced explained variance and prediction reliability. Although the RMSE was significantly lower, it is not directly comparable due to differences in the value ranges after transformation. These findings indicate that the logarithmic transformation is more suitable for this dataset.

4.3. Limitations and Future Directions

This study demonstrated the effectiveness of the transfer learning approach in achieving relatively high accuracy in Pb prediction. However, the dataset is limited to 130 samples from a single mining area, which may restrict the model’s ability to generalize to other regions with different soil types and contamination patterns. Future work should incorporate multi-site datasets to improve robustness across diverse soil types. Integrating additional environmental variables, such as soil moisture and proximity to pollution sources, could further enhance prediction accuracy. Additionally, extending the model to predict other heavy metals (e.g., Cd, Zn) and incorporating soil physicochemical properties (e.g., organic matter content) could broaden its utility for environmental monitoring.

5. Conclusions

This study successfully achieved cross-component transfer prediction from pH to Pb and, for the first time, utilized SHAP values to analyze wavelength contributions in transfer learning models, innovatively broadening the use of soil spectroscopy for heavy metal assessment. The source domain model ResNet-pH demonstrated high accuracy on the LUCAS dataset (R2 = 0.91). In the target domain of the Gejiu mining area, traditional direct modeling methods (SVM-Pb, PLS-Pb, ResNet-Pb) showed low prediction performance (R2 < 0.5 for all). The transfer learning model ResNet-pH-Pb significantly improved prediction accuracy (R2 = 0.77), with R2 values 0.30, 0.34, and 0.47 higher than SVM-Pb, PLS-Pb, and ResNet-Pb, respectively, confirming the advantages of transfer learning in small sample target domains. Wavelength contribution analysis revealed high contributions in the 550–750 nm and 1600–1700 nm bands, further elucidating the feature-sharing mechanism between pH and Pb in highly heterogeneous soils (CV = 2.02, skewness = 3.12). This study demonstrates that transfer learning methods not only markedly enhance the prediction accuracy of small sample target domains but also provide a robust and efficient approach for evaluating soil heavy metal contamination, offering a rapid, non-destructive tool with significant potential for contamination monitoring and environmental management in mining regions.

Author Contributions

Conceptualization, P.H., X.C., X.W. and Y.C.; Methodology, P.H., X.W. and Y.C.; Software, P.H., X.C., Y.Y. and Z.C.; Validation, P.H., X.C., Y.Y. and Z.C.; Formal analysis, P.H., X.C., X.W., Y.Y. and Y.C.; Investigation, P.H., X.C., Y.Y. and Z.C.; Resources, P.H. and X.C.; Data curation, P.H., X.C. and Z.C.; Writing—original draft, P.H., X.C., X.W. and Y.C.; Writing—review & editing, P.H., X.C., X.W. and Y.C.; Visualization, P.H. and Z.C.; Supervision, X.W. and Y.C.; Project administration, X.C.; Funding acquisition, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported by the Key Project Of Sustainable Development International Cooperation Program by NSFC (Grant No. 42361144883), the National Natural Science Foundation of China (42271422), the Engineering Center of Yunnan Education Department for Health Geological Survey & Evaluation (9135009009), and the Science and Technology Innovation Team for Highland Ecological Agriculture Geological Survey and Evaluation of Yunnan Education Department.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT to assist with language polishing in order to improve clarity and readability. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gholizadeh, A.; Borůvka, L.; Saberioon, M.M.; Kozák, J.; Vašát, R.; Němeček, K. Comparing different data preprocessing methods for monitoring soil heavy metals based on soil spectral features. Soil Water Res. 2015, 10, 218–227. [Google Scholar] [CrossRef]
  2. Bradl, H.B. Adsorption of heavy metal ions on soils and soils constituents. J. Colloid Interface Sci. 2004, 277, 1–18. [Google Scholar] [CrossRef] [PubMed]
  3. Li, Z.; Ma, Z.; van der Kuijp, T.J.; Yuan, Z.; Huang, L. A review of soil heavy metal pollution from mines in China: Pollution and health risk assessment. Sci. Total Environ. 2014, 468–469, 843–853. [Google Scholar] [CrossRef]
  4. Luo, X.; Wu, C.; Lin, Y.; Li, W.; Deng, M.; Tan, J.; Xue, S. Soil heavy metal pollution from Pb/Zn smelting regions in China and the remediation potential of biomineralization. J. Environ. Sci. 2023, 125, 662–677. [Google Scholar] [CrossRef]
  5. Guo, G.; Zhang, D.; Wang, Y. Probabilistic Human Health Risk Assessment of Heavy Metal Intake via Vegetable Consumption around Pb/Zn Smelters in Southwest China. Int. J. Environ. Res. Public Health 2019, 16, 3267. [Google Scholar] [CrossRef]
  6. Shi, T.; Chen, Y.; Liu, Y.; Wu, G. Visible and near-infrared reflectance spectroscopy—An alternative for monitoring soil contamination by heavy metals. J. Hazard. Mater. 2014, 265, 166–176. [Google Scholar] [CrossRef]
  7. Chen, L.; Lai, J.; Tan, K.; Wang, X.; Chen, Y.; Ding, J. Development of a soil heavy metal estimation method based on a spectral index: Combining fractional-order derivative pretreatment and the absorption mechanism. Sci. Total Environ. 2022, 813, 151882. [Google Scholar] [CrossRef] [PubMed]
  8. He, P.; Cheng, X.; Wen, X.; Cao, Y.; Chen, Y. Improving Soil Heavy Metal Lead Inversion Through Combined Band Selection Methods: A Case Study in Gejiu City, China. Sensors 2025, 25, 684. [Google Scholar] [CrossRef]
  9. Wang, J.; Cui, L.; Gao, W.; Shi, T.; Chen, Y.; Gao, Y. Prediction of low heavy metal concentrations in agricultural soils using visible and near-infrared reflectance spectroscopy. Geoderma 2014, 216, 1–9. [Google Scholar] [CrossRef]
  10. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  11. Zeng, P.; Song, X.; Yang, H.; Wei, N.; Du, L. Digital Soil Mapping of Soil Organic Matter with Deep Learning Algorithms. ISPRS Int. J. Geo-Inf. 2022, 11, 299. [Google Scholar] [CrossRef]
  12. Ng, W.; Minasny, B.; Mendes, W.d.S.; Demattê, J.A.M. The influence of training sample size on the accuracy of deep learning models for the prediction of soil properties with near-infrared spectroscopy data. Soil 2020, 6, 565–578. [Google Scholar] [CrossRef]
  13. Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
  14. Yang, J.; Wang, X.; Wang, R.; Wang, H. Combination of Convolutional Neural Networks and Recurrent Neural Networks for predicting soil properties using Vis–NIR spectroscopy. Geoderma 2020, 380, 114616. [Google Scholar] [CrossRef]
  15. Kok, M.; Sarjant, S.; Verweij, S.; Vaessen, S.F.C.; Ros, G.H. On-site soil analysis: A novel approach combining NIR spectroscopy, remote sensing and deep learning. Geoderma 2024, 446, 116903. [Google Scholar] [CrossRef]
  16. Padarian, J.; Minasny, B.; Mcbratney, A.B. Transfer learning to localise a continental soil vis-NIR calibration model. Geoderma 2019, 340, 279–288. [Google Scholar] [CrossRef]
  17. Zhong, X.; Chen, Z.; Li, Y.; Ding, K.; Liu, W.; Liu, Y.; Yuan, Y.; Zhang, M.; Baker, A.J.M.; Yang, W.; et al. Factors influencing heavy metal availability and risk assessment of soils at typical metal mines in Eastern China. J. Hazard. Mater. 2020, 400, 123289. [Google Scholar] [CrossRef]
  18. Luo, X.-s.; Yu, S.; Li, X.-d. The mobility, bioavailability, and human bioaccessibility of trace metals in urban soils of Hong Kong. Appl. Geochem. 2012, 27, 995–1004. [Google Scholar] [CrossRef]
  19. Caporale, A.G.; Violante, A. Chemical Processes Affecting the Mobility of Heavy Metals and Metalloids in Soil Environments. Curr. Pollut. Rep. 2016, 2, 15–27. [Google Scholar] [CrossRef]
  20. Ke, Z.; Ren, S.; Yin, L. Advancing soil property prediction with encoder-decoder structures integrating traditional deep learning methods in Vis-NIR spectroscopy. Geoderma 2024, 449, 117006. [Google Scholar] [CrossRef]
  21. Haghi, R.K.; Pérez-Fernández, E.; Robertson, A.H.J. Prediction of various soil properties for a national spatial dataset of Scottish soils based on four different chemometric approaches: A comparison of near infrared and mid-infrared spectroscopy. Geoderma 2021, 396, 115071. [Google Scholar] [CrossRef]
  22. Zhong, L.; Guo, X.; Ding, M.; Ye, Y.; Jiang, Y.; Zhu, Q.; Li, J. SHAP values accurately explain the difference in modeling accuracy of convolution neural network between soil full-spectrum and feature-spectrum. Comput. Electron. Agric. 2024, 217, 108627. [Google Scholar] [CrossRef]
  23. Li, C.; Song, L.; Zheng, L.; Ji, R. DSCformer: Lightweight model for predicting soil nitrogen content using VNIR-SWIR spectroscopy. Comput. Electron. Agric. 2025, 230, 109761. [Google Scholar] [CrossRef]
  24. Mkhatshwa, J.; Kavu, T.; Daramola, O. Analysing the Performance and Interpretability of CNN-Based Architectures for Plant Nutrient Deficiency Identification. Computation 2024, 12, 113. [Google Scholar] [CrossRef]
  25. Albinet, F.; Peng, Y.; Eguchi, T.; Smolders, E.; Dercon, G. Prediction of exchangeable potassium in soil through mid-infrared spectroscopy and deep learning: From prediction to explainability. Artif. Intell. Agric. 2022, 6, 230–241. [Google Scholar] [CrossRef]
  26. Orgiazzi, A.; Ballabio, C.; Panagos, P.; Jones, A.; Fernández-Ugalde, O. LUCAS Soil, the largest expandable soil dataset for Europe: A review. Eur. J. SoilSci. 2017, 69, 140–153. [Google Scholar] [CrossRef]
  27. Panagos, P.; Meusburger, K.; Ballabio, C.; Borrelli, P.; Alewell, C. Soil erodibility in Europe: A high-resolution dataset based on LUCAS. Sci. Total Environ. 2014, 479–480, 189–200. [Google Scholar] [CrossRef]
  28. Liu, B.; Guo, B.; Zhuo, R.; Dai, F. Estimation of soil organic carbon in LUCAS soil database using Vis-NIR spectroscopy based on hybrid kernel Gaussian process regression. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2024, 321, 124687. [Google Scholar] [CrossRef]
  29. Li, H.; Yao, J.; Min, N.; Duran, R. Comprehensive assessment of environmental and health risks of metal(loid)s pollution from non-ferrous metal mining and smelting activities. J. Clean. Prod. 2022, 375, 134049. [Google Scholar] [CrossRef]
  30. Cheng, X.; Chen, Z.; Zhou, X.; Huang, Q.; Shen, J.; Chen, Y.; Hou, M.; Xiong, J. Evaluation of Contamination and Ecological and Health Risk in Surface Soil and Crops Contaminated with Metalloids and Heavy Metals in Datun, China, “The World Tin Capital”. Soil. Sediment. Contam. Int. J. 2024. [Google Scholar] [CrossRef]
  31. Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 4765–4774. [Google Scholar]
  32. Wu, Y.; Chen, J.; Wu, X.; Tian, Q.; Ji, J.; Qin, Z. Possibilities of reflectance spectroscopy for the assessment of contaminant elements in suburban soils. Appl. Geochem. 2005, 20, 1051–1059. [Google Scholar] [CrossRef]
  33. Zhou, M.; Zou, B.; Tu, Y.; Feng, H.; He, C.; Ma, X.; Ning, J. Spectral response feature bands extracted from near standard soil samples for estimating soil Pb in a mining area. Geocarto Int. 2022, 37, 13248–13267. [Google Scholar] [CrossRef]
  34. Arif, M.; Qi, Y.; Dong, Z.; Wei, H. Rapid retrieval of cadmium and lead content from urban greenbelt zones using hyperspectral characteristic bands. J. Clean. Prod. 2022, 374, 133922. [Google Scholar] [CrossRef]
  35. Khosravi, V.; Doulati Ardejani, F.; Yousefi, S.; Aryafar, A. Monitoring soil lead and zinc contents via combination of spectroscopy with extreme learning machine and other data mining methods. Geoderma 2018, 318, 29–41. [Google Scholar] [CrossRef]
  36. Viscarra Rossel, R.A.; Walvoort, D.J.J.; McBratney, A.B.; Janik, L.J.; Skjemstad, J.O. Visible, near infrared, mid infrared or combined diffuse reflectance spectroscopy for simultaneous assessment of various soil properties. Geoderma 2006, 131, 59–75. [Google Scholar] [CrossRef]
  37. Tan, K.; Ma, W.; Chen, L.; Wang, H.; Du, Q.; Du, P.; Yan, B.; Liu, R.; Li, H. Estimating the distribution trend of soil heavy metals in mining area from HyMap airborne hyperspectral imagery based on ensemble learning. J. Hazard. Mater. 2021, 401, 123288. [Google Scholar] [CrossRef]
  38. Rathod, P.H.; Rossiter, D.G.; Noomen, M.F.; van der Meer, F.D. Proximal Spectral Sensing to Monitor Phytoremediation of Metal-Contaminated Soils. Int. J. Phytoremediation 2013, 15, 405–426. [Google Scholar] [CrossRef]
  39. Liu, J.; Li, X.; Zhang, P.; Zhu, Q.; Lu, W.; Yang, Y.; Li, Y.; Zhou, J.; Wu, L.; Zhang, N.; et al. Contamination levels of and potential risks from metal(loid)s in soil-crop systems in high geological background areas. Sci. Total Environ. 2023, 881, 163405. [Google Scholar] [CrossRef]
  40. Ashraf, M.A.; Maah, M.J.; Yusoff, I. Heavy metals accumulation in plants growing in ex tin mining catchment. Int. J. Environ. Sci. Technol. 2011, 8, 401–416. [Google Scholar] [CrossRef]
  41. Vega, F.A.; Covelo, E.F.; Andrade, M.L. Competitive sorption and desorption of heavy metals in mine soils: Influence of mine soil characteristics. J. Colloid Interface Sci. 2006, 298, 582–592. [Google Scholar] [CrossRef]
  42. Jiang, H.; Li, T.; Han, X.; Yang, X.; He, Z. Effects of pH and low molecular weight organic acids on competitive adsorption and desorption of cadmium and lead in paddy soils. Environ. Monit. Assess. 2012, 184, 6325–6335. [Google Scholar] [CrossRef]
  43. Lu, Q.; Wang, S.; Bai, X.; Liu, F.; Wang, M.; Wang, J.; Tian, S. Rapid inversion of heavy metal concentration in karst grain producing areas based on hyperspectral bands associated with soil components. Microchem. J. 2019, 148, 404–411. [Google Scholar] [CrossRef]
  44. Rossel, R.A.V.; Behrens, T. Using data mining to model and interpret soil diffuse reflectance spectra. Geoderma 2010, 158, 46–54. [Google Scholar] [CrossRef]
  45. Viscarra Rossel, R.A.; Behrens, T.; Ben-Dor, E.; Brown, D.J.; Demattê, J.A.M.; Shepherd, K.D.; Shi, Z.; Stenberg, B.; Stevens, A.; Adamchuk, V.; et al. A global spectral library to characterize the world’s soil. Earth-Sci. Rev. 2016, 155, 198–230. [Google Scholar] [CrossRef]
  46. Wang, Y.; Zou, B.; Chai, L.; Lin, Z.; Feng, H.; Tang, Y.; Tian, R.; Tu, Y.; Zhang, B.; Zou, H. Monitoring of soil heavy metals based on hyperspectral remote sensing: A review. Earth-Sci. Rev. 2024, 254, 104814. [Google Scholar] [CrossRef]
  47. Stenberg, B.; Viscarra Rossel, R.A.; Mouazen, A.M.; Wetterlind, J. Chapter Five—Visible and Near Infrared Spectroscopy in Soil Science. In Advances in Agronomy; Sparks, D.L., Ed.; Academic Press: Cambridge, MA, USA, 2010; Volume 107, pp. 163–215. [Google Scholar]
  48. Wang, Y.; Zhang, X.; Sun, W.; Wang, J.; Ding, S.; Liu, S. Effects of hyperspectral data with different spectral resolutions on the estimation of soil heavy metal content: From ground-based and airborne data to satellite-simulated data. Sci. Total Environ. 2022, 838, 156129. [Google Scholar] [CrossRef]
  49. Hong, Y.; Shen, R.; Cheng, H.; Chen, Y.; Zhang, Y.; Liu, Y.; Zhou, M.; Yu, L.; Liu, Y.; Liu, Y. Estimating lead and zinc concentrations in peri-urban agricultural soils through reflectance spectroscopy: Effects of fractional-order derivative and random forest. Sci. Total Environ. 2019, 651, 1969–1982. [Google Scholar] [CrossRef]
Figure 1. Distribution of 130 soil samples in the target domain (Gejiu mining area, Yunnan, China).
Figure 1. Distribution of 130 soil samples in the target domain (Gejiu mining area, Yunnan, China).
Sensors 25 04209 g001
Figure 2. Architecture of the 1D-ResNet model used for predicting soil pH on the LUCAS dataset and applying transfer learning for Pb estimation in the Gejiu dataset.
Figure 2. Architecture of the 1D-ResNet model used for predicting soil pH on the LUCAS dataset and applying transfer learning for Pb estimation in the Gejiu dataset.
Sensors 25 04209 g002
Figure 3. Predicted vs. observed pH for ResNet-pH model based on 10 random sampling tests. The average evaluation results of 10 test rounds are displayed in the upper-left section of the figure.
Figure 3. Predicted vs. observed pH for ResNet-pH model based on 10 random sampling tests. The average evaluation results of 10 test rounds are displayed in the upper-left section of the figure.
Sensors 25 04209 g003
Figure 4. Predicted vs. observed Pb values for ResNet, PLS, and SVM models based on 10 random sampling tests in the target domain. The average evaluation metrics are shown in the upper-left corner.
Figure 4. Predicted vs. observed Pb values for ResNet, PLS, and SVM models based on 10 random sampling tests in the target domain. The average evaluation metrics are shown in the upper-left corner.
Sensors 25 04209 g004
Figure 5. Predicted versus observed Pb values from the transfer learning model (ResNet-pH-Pb) based on 10 random sampling tests. The average evaluation metrics across tests are shown in the upper-left corner.
Figure 5. Predicted versus observed Pb values from the transfer learning model (ResNet-pH-Pb) based on 10 random sampling tests. The average evaluation metrics across tests are shown in the upper-left corner.
Sensors 25 04209 g005
Figure 6. Boxplot comparison of R2 values for Pb prediction among the ResNet-pH-Pb, PLS-Pb, SVM-Pb, and ResNet-Pb models.
Figure 6. Boxplot comparison of R2 values for Pb prediction among the ResNet-pH-Pb, PLS-Pb, SVM-Pb, and ResNet-Pb models.
Sensors 25 04209 g006
Figure 7. Contribution of different wavelengths to model predictions, highlighting key spectral features extracted by ResNet-pH and ResNet-pH-Pb for pH and Pb estimation.
Figure 7. Contribution of different wavelengths to model predictions, highlighting key spectral features extracted by ResNet-pH and ResNet-pH-Pb for pH and Pb estimation.
Sensors 25 04209 g007
Figure 8. Predicted versus observed Pb values from the transfer learning model (ResNet-pH-Pb) with Box–Cox transformation based on 10 random sampling tests. The average evaluation metrics across tests are shown in the upper-left corner.
Figure 8. Predicted versus observed Pb values from the transfer learning model (ResNet-pH-Pb) with Box–Cox transformation based on 10 random sampling tests. The average evaluation metrics across tests are shown in the upper-left corner.
Sensors 25 04209 g008
Table 1. ResNet model architecture parameters.
Table 1. ResNet model architecture parameters.
LayerTypeFiltersKernel SizeStrideWidthNumber of ParametersActication
1Input---21000-
2AvgPool-10×12100-
3Convolutional483 × 11210192Leaky Relu (alpha = 0.01)
4Maxpooling-2 × 1-1050-
5Residual Block483 × 111056960Relu
483 × 111056960-
6Residual Block483 × 111056960Relu
483 × 111056960
7Convolutional323 × 111054640Leaky Relu (alpha = 0.01)
8Maxpooling-2 × 1-520-
9Flatten---16640-
10Dense (Fully connected)---1616,640Leaky Relu (alpha = 0.01)
11Dense (Fully connected)---10170Leaky Relu (alpha = 0.01)
12Output---111Relu
Table 2. Training and layer freezing details of ResNet models.
Table 2. Training and layer freezing details of ResNet models.
ModelDatasetTraining SamplesTesting SamplesTrained LayersFrozen Layers
ResNet-pHLUCAS14,2774759AllNone
ResNet-PbGejiu9832AllNone
ResNet-pH-PbGejiu9832Max-pooling, DenseConvolutional, Residual
Table 3. Summary Statistics of Soil Pb Concentrations (Gejiu) and Soil pH Values (LUCAS).
Table 3. Summary Statistics of Soil Pb Concentrations (Gejiu) and Soil pH Values (LUCAS).
DatasetSample SizePropertyMinMaxMeanSDCVSkewKurtMedian
LUCAS19,036pH3.2110.086.021.350.22−0.07−1.246.21
Gejiu130Pb34.69720974.061969.172.023.129.1232
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

He, P.; Cheng, X.; Wen, X.; Yi, Y.; Chen, Z.; Chen, Y. Transfer Learning-Based Interpretable Soil Lead Prediction in the Gejiu Mining Area, Yunnan. Sensors 2025, 25, 4209. https://doi.org/10.3390/s25134209

AMA Style

He P, Cheng X, Wen X, Yi Y, Chen Z, Chen Y. Transfer Learning-Based Interpretable Soil Lead Prediction in the Gejiu Mining Area, Yunnan. Sensors. 2025; 25(13):4209. https://doi.org/10.3390/s25134209

Chicago/Turabian Style

He, Ping, Xianfeng Cheng, Xingping Wen, Yan Yi, Zailin Chen, and Yu Chen. 2025. "Transfer Learning-Based Interpretable Soil Lead Prediction in the Gejiu Mining Area, Yunnan" Sensors 25, no. 13: 4209. https://doi.org/10.3390/s25134209

APA Style

He, P., Cheng, X., Wen, X., Yi, Y., Chen, Z., & Chen, Y. (2025). Transfer Learning-Based Interpretable Soil Lead Prediction in the Gejiu Mining Area, Yunnan. Sensors, 25(13), 4209. https://doi.org/10.3390/s25134209

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop