Next Article in Journal
Robotic-Assisted Single-Position Lateral Mini-Open Upper Lumbar Corpectomy with Posterior Percutaneous Pedicle Screw Fixation: A Technical Note with Illustrative Case Series
Previous Article in Journal
Autonomous Radiation Mapping Using a Manipulator-Equipped Quadruped with Flexible Behavior Design
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Using Machine Learning Algorithms to Clarify Relationships Between Soil Properties and Lead Stomach Bioaccessibility

1
Department of Biological Sciences, Michigan Technological University, 1400 Townsend Drive, Houghton, MI 49931, USA
2
Department of Civil, Environmental and Ocean Engineering, Stevens Institute of Technology, 1 Castle Point on Hudson, Hoboken, NJ 07030, USA
3
Department of Biomedical Engineering, Michigan Technological University, 1400 Townsend Drive, Houghton, MI 49931, USA
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(7), 3504; https://doi.org/10.3390/app16073504
Submission received: 18 February 2026 / Revised: 18 March 2026 / Accepted: 28 March 2026 / Published: 3 April 2026

Featured Application

This paper has important applications in environmental health risk assessment and urban soil remediation planning. By demonstrating how machine learning can predict lead bioaccessibility in lead paint-contaminated soil, the study provides a scalable, cost-effective alternative to laboratory-based extraction methods. Such models could support rapid screening of contaminated sites, helping prioritize high-risk areas for intervention and allocate remediation resources more efficiently. This model is expected to advance data-driven decision-making for managing lead-contaminated soils and protect vulnerable urban populations from exposure.

Abstract

Lead contamination in urban soils, primarily from deteriorating lead-based paint, poses a significant health risk in the United States. These soils often serve as major sources of exposure, making them critical targets for remediation efforts. To guide such strategies, preliminary risk assessments are necessary to evaluate lead bioaccessibility in the soil and identify key soil properties influencing lead speciation. In this study, a novel machine learning approach was co-developed with an artificial intelligence assistant, Claude Sonnet, developed by Anthropic, to design a predictive model that overcomes the difficulties of conducting experimental bioaccessibility models. Data was compiled from published sources (n = 640), as well as an internal analysis of soils sampled across three large cities in the United States (n = 30), to use as a validation model. While our final model’s prediction accuracy was good (R2 = 0.95), it initially did not perform as expected on our internal dataset, indicating a fundamental domain shift. Further analysis revealed complications with outliers, data availability, and data consistency that resulted in poor performance. When optimization was applied to the validation model, our final prediction accuracy improved (R2 = 0.84). Here, we conclude the importance of data availability and consistency in heavy-metal soil bioaccessibility studies to build a generalizable predictive model.

Graphical Abstract

1. Introduction

Lead is a pervasive environmental toxin affecting thousands of children across the United States) [1]. It is generally a relatively stable element under most environmental conditions, meaning that lead in soils can remain a persistent source of exposure long after its initial release. This contributes to ongoing exposure in many areas today, even after its elimination from products like paint, gasoline, and food cans [2]. Although the use of lead-based paint was banned in the U.S. in 1971, many older homes still pose a significant risk of lead exposure, even after abatement efforts [3]. The maximum allowable soil lead concentration is 400 mg/kg [4]. Lead-contaminated soil, especially bare soil, is a major source of exposure, as studies have shown that soil lead levels are strongly correlated with elevated blood lead levels in children [5]. Child lead poisoning remains a widespread issue, with the Centers for Disease Control and Prevention (CDC) lowering the blood lead reference value from 5.0 µg/dL to 3.5 µg/dL in 2021 [6]. The primary route of environmental lead exposure is through ingestion of chipped lead-based paint and dust in soil or on surfaces in older homes [7]). The severity of health outcomes is linked to elevated blood lead levels (EBLLs), which in turn correlate strongly with the amount of bioaccessible lead in the soil. Several soil properties, such as pH, electrical conductivity, organic-matter content, and texture, influence the bioaccessibility of lead. Identifying soil characteristics that influence bioaccessibility is a critical first step in assessing risk and requires further investigation [8]. While these properties have been studied in relation to bioavailability [9], their effect on in vitro stomach bioaccessibility has not been fully explored. Understanding how these factors impact bioaccessibility in the stomach is important because it correlates well with in vivo studies using juvenile swine models. Therefore, it is crucial to explore strategies that reduce lead bioaccessibility in soil, which would help mitigate human exposure risk [10].
Soil pH plays a crucial role in lead bioavailability, much like gastric pH does in the body. Under acidic conditions, lead species are more stabilized and soluble, increasing their bioaccessibility, whereas in alkaline conditions, they become less bioavailable due to precipitation or complexation [11]. This relationship between pH and bioaccessibility is well-documented, with soil pH showing a negative correlation with lead bioaccessibility [12].
Soil texture, defined by the proportions of clay, sand, and silt, which impacts the soil’s cation-exchange capacity (CEC), is another critical factor for lead retention. Clay minerals, particularly phyllosilicates, facilitate the adsorption and structural incorporation of heavy metals through oxyhydroxide reactions [13]. Sand and silt, primarily composed of quartz and other minerals, also contribute to cation exchange, although less effectively than clay. Lead is known to create surface complexes with iron (oxy)oxides, such as hematite and goethite, and is retained over time in crystalline iron minerals [14]. Studies consistently show a negative correlation between clay content and lead bioaccessibility, attributed to the high absorption capacity of clay minerals [9,15,16]. A linear model moderately predicted bioaccessible lead (R2 = 0.35), but the correlation improved significantly when the source of lead was factored in (R2 = 0.86), emphasizing the importance of considering soil particle size and lead source in predictive models.
Soil organic matter (OM), composed of decayed biological material and microbial byproducts, interacts with lead through complexation with organic acids. The stability of these complexes depends on factors like organic acid concentration and soil [17]. Lead has a high affinity for negatively charged functional groups, especially those containing oxygen [18]. Lead sorption by organic ligands is most effective when organic acid concentrations are relatively low, and pH is favorable for complexation [19]. A higher organic-matter content is typically associated with reduced lead bioaccessibility due to the strong binding of lead to organic compounds [20].
Electrical conductivity (EC) measures the concentration of dissolved salts, including non-hydrolyzing cations like Ca2+, Mg2+, Na+, and K+, along with anions such as Cl, SO42−, HCO3, and NO3. EC is directly related to soil salt content and has been found to negatively correlate with lead bioaccessibility [21]. Higher salt concentrations in soils promote the formation of insoluble lead compounds, such as lead carbonate and Fe-Mn oxides, thus reducing bioavailability [20].
Soil’s heterogeneous nature complicates the analysis of lead bioaccessibility. Machine learning (ML) regression techniques offer powerful tools for uncovering correlations between soil properties and bioaccessibility. Machine learning in soil science is a fast-growing and popular application of AI in soil science [22]. Random forest and neural networks have been utilized for soil moisture and soil–water retention, respectively [22]. While multiple linear regression (MLR) can only identify linear relationships between variables, non-linear relationships between two variables can only be demonstrated after manipulation of the data to linearize it. ML techniques like tree-based or gradient-based models, when applied here, will offer greater predictive accuracy from a non-linear and non-normal dataset, which is often observed in nature. Model inspection techniques, like feature importance, offer elucidation to key variables that contribute significantly to the model’s performance, thereby identifying key factors vital to estimating the target value. When such techniques are applied to the analysis of lead bioaccessibility from a variety of different soil properties, ML methods are far more robust predictive tools.
These techniques have been applied to previous publications to further examine the role of environmental spatial [23] or experiential factors [24] in addition to soil chemistry that affect heavy-metal bioaccessibility. A recent publication found that soil properties contributed to 88.1% of their bioaccessible lead predictions, while aging factors and experimental factors contributed the remaining [24]. Their model was built from n = 157 datapoints compared to the only other similar model published—which consisted of over 300 unique datapoints for lead [23]—and achieved a high performance (R2 = 0.95) using a robust meta-learner on bootstrapped cross-validated datasets. While these produced models are highly accurate, they have yet to be validated through actual experimental data on soil samples using almost exclusively soil properties. In order to have any utility as a predictive model for real-world applications, a model must be tested on unique datapoints that it has not seen beforehand. The objective of this study was to co-develop an ML regression model to create a generalizable predictive tool for lead stomach bioaccessibility. This study focused on identifying key soil characteristics that influence bioaccessible lead in urban soils and developing a predictive model to assess these factors. Large language models (LLMs) were employed to help explicate the writing of the codebase, while also offering value to the model and statistical test through suggestions. Claude Sonnet was selected for this aim due to its high performance in solving scientific machine learning [25], as well as its strong domain-specific performance on soil science, compared to other known LLMs [26]. Once model training was optimized with the aid of Claude Sonnet, the best-performing model was then assessed for its predictive accuracy using an unknown isolated dataset. Our report highlights the exploration of machine learning techniques and our attempt to build a comprehensive model for its practical application as a predictive tool.

2. Materials and Methods

Soil samples were collected from 30 residential sites across three cities: Detroit, MI; San Antonio, TX; and Baltimore, MD. Sampling was conducted near exterior walls where deteriorated paint was present, at a depth of 0–15 cm. The surface soil from these locations was dried and sieved in preparation for detailed soil characterization. Sampling sites were selected based on preliminary data from a portable X-ray Fluorescence (XRF) device (pXRF, Niton XL3t, Thermo Fisher Scientific, Bergenfield, NJ, USA), which was used to identify areas with potentially high lead concentrations near the houses. Once collected, soil samples from each site were composited for analysis. The soil characterization process involved measuring soil pH, electrical conductivity (EC), and texture properties following the standardized protocols described in the Soil Science Society of America Handbook for Chemical and Mineralogical Analysis [27]. The percentage of organic matter in the soils was determined using the weight-loss-on-ignition method [28]. Total lead was determined after acid digestion with HNO3 (trace-metal grade, Fisher Scientific, Fair Lawn, NJ, USA) and H2O2 (ACS reagent grade, Sigma-Aldrich, St. Louis, MO, USA) following the USEPA Method 3050B [29]. Lead concentrations were measured using inductively coupled plasma optical emission spectroscopy (ICP-OES, Agilent Technologies 5100, Santa Clara, CA, USA). All extractions and analyses were done in triplicate.
Lead bioaccessibility was assessed using a modified version of the Unified BARGE Method (UBM). This method simulates the digestive process using synthetic fluids representative of saliva, gastric acid, duodenal fluid, and bile, all formulated from salts and organic compounds to mimic their natural composition. The pH levels of these fluids were adjusted accordingly to simulate the digestive environment. A key modification to the protocol involved the incorporation of ferric-oxide strips, which were used as an ion sink for soluble lead, based on a technique previously applied to bioavailable arsenic assays [30]. After the gastric phase of digestion, these ferric-oxide strips were added to the intestinal-phase solution to absorb bioavailable lead. Once the intestinal-phase incubation was complete, the strips were removed and placed in a beaker containing 50 mL of 1.6 M nitric acid to desorb the lead. Samples were collected at three stages: from the stomach phase, from the absorbed intestinal phase, and from the desorbed ferric-oxide strip solution. All samples were analyzed for soluble lead content using ICP-OES.
A machine learning model was designed as a predictive model to examine the relationship between lead bioaccessibility and soil properties (n = 18). The source of lead pollution and the in vitro assay used to estimate lead bioaccessibility were also considered, resulting in a wide range of explanatory variables to build out a comprehensive model. This initial dataset was expanded to 670 unique datapoints by including relevant sources. The extractable literature was found through an advanced search for the following words on Google Scholar: “soil properties”, “Pb bioaccessibility”, and “in vitro digestion”. Some specified parameters were also included in the search, such as pH, OM, EC, and CEC. A total of 12 sources with extractable data were found, including a different machine learning dataset used in a publication for predicting heavy-metal bioaccessibility based on soil properties and additional environmental factors [23]. The dataset was compiled and loaded into Python version 3.12 for model evaluation.
The main pipeline for hyperparameter tuning was constructed initially without the aid of AI, but later using Claude Sonnet 3.7–4.6. The initial model consisted of three regression models: random forest regressor (RFR), histogram gradient-boosting model (HistGBM), and eXtreme gradient-boosting model (XGBoost). Tree-based models were chosen to handle complex non-normal data, and gradient-boosting models were added to reduce the risk of overfitting. However, it failed to achieve desirable prediction levels, since the RFR model would overfit, and the gradient-boosting models failed to capture feature patterns. The main pipeline was then reconstructed, and files were uploaded into Claude to address the initial model’s shortcomings. A major challenge with our dataset was adapting a relatively small dataset with a large number of features. In collaboration with Claude and in testing, selected models were tuned to combat overfitting and better handle the complexities of the dataset. In addition, Claude allowed the development of our pipeline to be streamlined, allowing time spent on writing the source code to be utilized to optimize specific parameters. The pipeline consisted of four main steps: preprocessing, feature selection, model training, and final model selection.
After extraction, the data was compiled into a single Excel file. The dataset was standardized by converting units into the most commonly reported unit for that feature. Interchangeable features such as organic carbon (%) and organic matter (%) were converted into one feature, following organic matter (%) = total organic carbon (%) × 2 [31]. The following techniques were implemented with the aid of Claude to boost the quality of the dataset used in the model. Features were then separated based on numerical data (n = 17) or categorical data (n = 3), then transformed. Numerical features were scaled using the “Yeo–Johnson” method to better normalize the resulting data. Categorical data was transformed into binary using one-hot encoding. Outliers within each feature were removed if found to be 1.5× more or less than the interquartile bounds. Numerical features with 65% or fewer missing datapoints (Nan) were then imputed via simple or iterative methods. Mostly empty features (>65% Nan) were identified and removed. Simple imputation filled missing values with the median value of that feature. A robust iterative imputation method estimated missing datapoints using a random forest regressor and was optimized to conserve variance. To optimize estimations, converging iterations with the lowest resulting variance, stability across multiple tested iterations, and low risk of overfitting were selected. Finally, after imputation was completed, the numerical features were recombined with the categorical features and enhanced with synthetically produced datapoints. Synthetic datapoints made up one-third of the total dataset and were expanded conservatively using K-nearest neighbors. This would assist with model learning without losing relatability to real-world data.
Recursive feature selection with cross-validation (RFECV) and consensus techniques selected the top 10 features based on their importance to the estimator model. While a random forest estimator was solely used during RFECV in our first model, it often led to the selection of underperforming features. Through Claude, optimization of this process was further enhanced by introducing consensus techniques to choose common high-performing features. Variance threshold and mutual info selection techniques were included, and final features were selected based on a voting consensus. Claude also helped design a method to select an optimal number of impactful features instead of manually tuning this number.
The number of models also increased under the recommendation of Claude to include model diversity. With this “shotgun” approach, multiple models, including ensemble models, could be selected for validation. While performance was our top priority to be selected, we also examined and hand-picked the best model based on model balance and computational resource demand. Seven base regression models were trained: random forest regressor (RFR), eXtreme gradient-boosting (XGBoost) model, light gradient-boosting model (LightGBM), histogram gradient-boosting model (HistGBM), multi-layer perceptron regressor (MLP), support vector regression (SVR), and K-nearest neighbors (KNN). Multiple ensemble methods, a voting regressor, were also created, which were built from the 7 other base models. The chosen base models were selected to diversify the techniques used to estimate each individual point, with the intention that it would help improve our ensemble methods. The ensemble and diversity-based method utilized a meta-estimator model, a voting regressor, which fits multiple base models to then estimate target values from the average. Adaptive weighted and Bayesian averaging ensemble models use different statistical approaches to apply dynamic weights to the models’ truthfulness. The culmination of all these optimizations designed into the model’s architecture built the best possible predictor for lead bioaccessibility with a multitude of features to examine. All models were trained on a multitude of hyperparameters optimized to decrease the risk of overfitting. This was particularly impactful for the random forest regressor. Constructed “forests” were designed to be simple and generalizable, meaning that correlating features had to be statistically significant to generate a split “new tree”. The final selected model was retrained with a subset of the data (n = 620) and then used to predict target values using a holdout dataset. The validation set was drawn from an internal dataset (n = 30) to simulate a real-life application of the predictive model. Randomly distributed datapoints from the external dataset (n = 20) were also included to help generalize conclusions during post-validation analysis and balance the range of predicted points. Prediction inspection techniques were employed at the end to explain poor predictions, domain shift, and feature reliance. Once diagnosed, more advanced techniques were applied to address domain shift, explore the implications of the model’s “blind” performance, and investigate and remove outliers in the model. The first domain shift was addressed by introducing a resampling technique that moves the majority of the validation dataset into training. Then the model was retrained and predicted the remaining dataset. The indices moved were randomly determined in each run, and this retraining occurred over many instances to stabilize projected values. All randomness introduced during random splits, imputation, and synthetic expansion was fixed to a seed for reproducibility. Afterwards, outlier analysis inspected the resulting predictions and located values that were consistently outside the model’s uncertainty. These points were removed and isolated for further inspection to understand the model’s current inadequacies.

3. Results

3.1. Final Model Selection

Model selection on our boosted dataset found that our dynamic ensemble models, “adaptive weighted” (AW) and “Bayesian averaging” (BA), performed the best. Both models performed within a small margin of each other, with the BA with the simple imputation model slightly outperforming the iterative imputation top model (Figure 1). Three metrics were used to select the most competitive model: prediction accuracy (50% weight), model performance (35% weight), and root mean square error (RMSE) (15% weight). Details about the top competing model metrics with differing imputations can be found in Supplementary Materials, Figures S4 and S5. Out of the two, the AW model was selected as the final model since its weighted ensemble was far more balanced than the BA model. Upon closer inspection, the BA had weighed one base model, RFR, at 99%, and effectively was another RFR model. The RFR model with simple imputation was later tested during validation over BA, since it would significantly cut down computation time.
In Figure 2B,C, both the SHAP analysis and feature importance indicated strong reliance on soil Pb or “TotalPb” for model predictions, which was found to be highly correlated to bioaccessible lead. The model’s potential overreliance on TotalPb was later explored and can be found in the Supplementary Materials, Figures S1–S3. Feature correlation (Figure 2D) found a strong negative correlation between “sand” and “silt” (−0.85) and between “sand” and “clay” (0.48). Interestingly, features expected to correlate together did not, such as CEC, ferric oxides (FeOs), and clay. CEC, instead, weakly correlated with pH (0.37) and residential sources, possibly indicating high CEC in soils found at residential sites (0.40). Organic matter and “TotalPb” did not correlate well with any features, but were the two topmost important features to the model. According to the SHAP plot, most selected features had no impact outside of [−0.5, 0.5], while TotalPb had a low-to-high impact across the entire range. High feature value (Figure 2B) was primarily observed close to the target value mean (0) for almost every feature. The prediction accuracy chart shows well the predicted values between −1 and 1 in the normalized chart (Figure 2A). Model error begins to compound at the extremes past −1 and 1.5. This strongly correlates with the SHAP plots, where feature values have almost no impact the farther they are away from zero, apart from “TotalPb”.

3.2. Validation Model

The dataset was split differently in a 0.8/0.15/~0.075 configuration for validation. After the model was trained and tested again following the same preprocessing pipeline, it made predictions on our designated validation dataset, consisting of 30 samples obtained from in-house data not yet published, and 20 samples from our compiled dataset. The first iteration of the validation model underperformed our expected targets (R2 = 0.11). A two-sample Kolmogorov–Smirnov test was performed and revealed that the underperformance was due to a domain shift, particularly features: TotalPb, silt, FeOs, and EC (Figure 3). TotalPb, a high-impact feature, displayed a high KS (0.398) but did not shift as significantly as FeOs and EC, 0.557 and 0.415, respectively.
Other impactful features, such as OM, clay, and CEC, were not found to exhibit a significant shift (KS = 0.182, 0.138, and 0.140, p-value > 0.05). A principal component analysis (PCA) projection was generated after validation to compare shifts between validation and training feature spaces (Figure 4). While the projection only accounts for 44.4% of the variance, most of the validation datapoints fall within the 95% confidence interval. A new iteration of the validation model to address domain shift was conceptualized and then implemented with Claude.
Since the training dataset needed to reflect the domain exhibited in the validation model, we implemented a synthetic resampling method, which moved 90% of the validation data to training and synthetically expanded the training set using a hybrid approach. Both K-nearest neighbors and principal component analysis were used to greatly increase the training dataset by 225 datapoints or 5 times the moved validation samples (n = 45). After expansion was complete and the new training set was tested, predictions were made on the remaining 10% percent. This process was repeated for 200 iterations, randomly assigning samples to be moved to training until every datapoint was used in validation a minimum of 20+ times. These point-by-point predictions were averaged and then plotted. The resampling method vastly improved prediction accuracy, increasing prediction accuracy to 0.67 R2. Most values fall within 95% confidence interval (Figure 5). Out of our internal dataset, 16/30 predicted well within 10% of the target range, and eight were predicted within 5% of the target range. A total of 10 predicted values that were part of the internal dataset did not predict well, but still fell within the model’s uncertainty range. However, all values outside the uncertainty range belonged to the internal dataset. These samples were isolated further and examined further to determine their effect on model performance.
A three-criterion outlier analysis method identified statistical outliers and thereby justified their removal from the model. The validation set was evaluated using KNN distances to neighbors, Mahalanobis, and Bonferroni tests. If a point was found to be significant by two of the three tests, it was deemed to be a genuine outlier. Only indices 7 and 19 were identified with these methods; however, according to the scatterplot and iterative synthetic imputation data, more outliers remained. Using the “leave-one-out” strategy, additional outliers were identified based on the significant impact on prediction accuracy (ΔR ≥ 0.025) when removed (Figure 6). The remaining outliers, 4, 6, 26, and 28, were confirmed to be poorly predicted, yet their impact on model performance was not yet clear.
One commonality between all these points was their low stomach lead bioaccessibility. We were able to attribute low bioaccessibility due to the alkaline conditions in the soil itself in indices 4, 7, 3, and 6. This information was not reflected well within our feature space, apart from high pH values. For indices 19, 26, and 28, they start to diverge apart from their more acidic pH, with high TotalPb. These points were input into Claude to ask for both its soil-science knowledge base and to be able to quickly identify patterns within the feature space between these three points. While point 19 was identified as an outlier purely because its features were found to be outside the rest of the validation group, concerning indices 26 and 28, Claude identified a discrepancy between acidity, FeOs, and CEC absorption and bioaccessibility. To quote the AI directly, “Indices 26 and 28 have all three pathways compromised [32,33,34]—yet diverge in BAF outcome because FeO absolute concentration still partially moderates Pb surface precipitation at circumneutral pH [33,35,36] (index 26), while extreme TotalPb at lower pH overwhelms even that residual capacity [37,38,39] (index 28). This multi-pathway saturation interaction cannot be captured by any single feature ratio [40,41], which explains their persistent influence in the greedy LOO sequence.” While this interpretation warrants further investigation, Claude provided citations for its conclusions (Table S1). Despite these shortcomings with model performance, overall, after outlier removal, prediction accuracy improved to R2 = 0.84 (Figure 7).

4. Discussion

This study demonstrates the utility of machine learning for estimating bioaccessibility from soil physicochemical properties, while also revealing key challenges to model generalizability. Although the final model, surprisingly, was not the model that marginally outperformed other competing models, rather it was the most balanced of the ensemble models. While the Bayesian averaging model and adaptive weighted model were within 0.002 apart from their weighted score, the adaptive weighted model (iterative synthetic resample R2 = 0.67) performed significantly better than the Bayesian averaging model (iterative synthetic resample R2 = 0.62). We conclude that the best generalizable ensemble model must be well-balanced with its base models. Ironically, our initial validation performed poorly. This reduction reflects a domain shift between heterogeneous literature-derived datasets and internally consistent in-house measurements, underscoring a common limitation in environmental machine learning applications.
Domain-shift issues remained prevalent throughout experimentation, which was especially exaggerated by our small validation set size. Domain shift was adjusted properly with our advanced synthetic iterative resampling approach. Deep analysis of plotted points identified well-predicted samples (n = 16) that made up the majority of our internal experimental datapoints (53%). Half of those points (n = 8) were predicted within their target value. Model performance, however, was still impacted heavily by outliers. There were datapoints deemed unpredictable, and after careful examination using our three-criterion outlier examination and greedy “leave-out-one” simulations, seven points that fell outside of our model’s predictive range were found. These points exhibited alkaline conditions, high cation-exchange capacity, and low ferric-oxide absorption, but diverging bioaccessibility fractions that the model could not pick up. For instance, in datapoint 26’s case, a low ratio between Pb and FeOs (0.13) was expected to test for low bioaccessibility, due to high surface complexion. This was well reflected in the low bioaccessibility of sample 26’s soil (5.9%). Again, in point 28, the ratio between Pb and FeOs (0.35) was high, and lead cations possibly overwhelmed surface complexes and became more bioaccessible (87%). However, the model failed to establish this relationship, since it was not specified (Pb/FeOs) and did not have enough real-world data to interpret emerging patterns. Alkaline conditions were not interpreted by the model but led to low gastric lead concentrations. High pH promotes complexation of Pb, which is not bioaccessible. The model failed to interpret this as well, which caused poor predictions with points 3, 4, 6, and 7. Conversely, upon similar examinations with near-perfect predictors (within 5% of the target value), Claude was able to determine a common pattern between all indices, high pH, and high organic matter. The model was able to notice this consistent pattern and accurately predict it.
Another issue to consider was the loss in generalizability due to overfitting. According to both the prediction accuracy plot and SHAP analysis during final model selection, datapoints with target values above 1.5 or below −1 have the least amount of information to generate predictions. However, with our internal dataset, the observed outliers were found close to the middle, between −0.5 and 0.5. One possibility for this could be overfitting, since most features’ impact value is dense around that range. These features, instead of adding value, further confused the model, and distinct connections were not created between target values and predictors. Feature engineering of vital geochemical relationships not yet exploited by the model could help adjust for overfitting. Rather than adding an anomalous soil property, a relationship between two features could provide context and value, thus improving predictive accuracy. This has yet to be conducted and should be explored in future work.

5. Conclusions

The bioaccessibility of heavy-metal contamination remains a variable and difficult measurement to obtain without long and costly digestion models. Soil bioaccessibility is tied to the properties of the soil in a significant way. By implementing a machine learning model, we elucidated that various soil properties, like TotalPb, organic matter, electrical conductivity, cation-exchange capacity, and pH, are significant in influencing lead bioaccessibility. Compiling data from previously published work and our own experimental data yielded a complex predictive tool, which has shown promising results. Our goal was to intentionally analyze a small subset of data to explore a practical application of this predictive tool. Initial model tests indicated that the domain shift resulted in poor performance. Iterative resampling significantly improved model recovery, which highlights its ability to operate in the event of domain shift. Ideally, sufficient data would be fed to the model in the future, such that domain shifts are less frequent and predictions can be made on a point-by-point basis. Geochemical relationships between features can be better assessed with real-life data, and allow feature engineering to accurately encompass factors that are not clear from the raw dataset alone. While our validation set contained the least amount of missing feature data, the features the model trained on were not enough to estimate our outlier group. Rather, the inclusion of features that account for alkaline conditions would improve predictions. Deep learning can be applied to re-evaluate the model based on isolated poor predictors.
One major constraint limiting this model is the paucity of publicly available data. Future studies should seek to improve data quality and standardization of soil properties across studies. Supplementing missing data with synthetic estimations proved to be adequate and can be made more meaningful with complete feature data. This would help expand the dataset with real-world samples and permit continued validation testing with other unknown samples. The aid of large-language-model artificial-intelligence tools, like Claude, hastened the development of the algorithm and permitted the integration of optimization techniques, as demonstrated in our results. Some limitations in Claude’s ability to debug scripts and clean scripts come from its inability to see the user’s virtual environment. We have also noticed a few times when Claude fails to understand user prompts based on the context established earlier in the conversation. Conversation size is limited without signing up for one of Anthropic’s monetized plans. Context lost between conversations can lead to bloated scripts, which makes it almost a requirement to purchase the “Pro” plan and organize within a project folder. Individuals or research groups considering utilizing Claude should consider the value of a subscription to access Claude’s full features. Overall, Claude Sonnet with extended thinking proved to be an invaluable tool, especially to those interested in applying machine learning to their research. Whilst there is still room for exploration into improving our model to be even more generalizable across diverse datasets, the model developed is an excellent platform to create a predictive tool.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16073504/s1, Figure S1: Results of the exploratory study into the models over reliance on TotalPb. Figure S2: LightGBM trained on iterative imputed datasets with artificially stripped datapoints. Figure S3: Adaptive weighted ensemble model trained on simple imputed (median) datasets with artificially stripped datapoints. Figure S4: Performance of adaptive weighted ensemble model trained on iterative imputed datasets. Figure S5: Performance of Bayesian averaging ensemble model trained on simple imputed datasets. Table S1: Table of citations provided in Claude’s explanation of indices 26 and 28 geochemical properties.

Author Contributions

Conceptualization, S.W., R.D. and D.S.; methodology, S.W., H.S. and K.M.; validation, S.W.; formal analysis, S.W. and K.M.; investigation, S.W., K.M. and H.S.; resources, R.D., S.R. and D.S.; data curation, S.W.; writing—original draft preparation, S.W.; writing—review and editing, R.D., S.R. and D.S.; visualization, S.W.; supervision, R.D., S.R. and D.S.; project administration, R.D., S.R. and D.S.; funding acquisition, R.D., S.R. and D.S. All authors have read and agreed to the published version of the manuscript.

Funding

The work that provided the basis for this publication was supported by funding under an award from the United States Department of Housing and Urban Development (grant# MILTS0023-21). The substance and findings of the work are dedicated to the public. The authors are solely responsible for the accuracy of the statements and interpretations contained in the publication. Such interpretations do not necessarily reflect the views of the Government.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Full dataset and modeling scripts are available through GitHub, https://github.com/swijesin/Prediction-Model-for-Lead-Bioaccessibility, accessible on 18 October 2025.

Acknowledgments

Anthropic’s GenAI, Claude Sonnet 3.7-4.6, was used to assist in building out the machine learning model and several key figures. We thank Zeeshan Jaffari for his valuable comments and suggestions on data analysis.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sowers, T.D.; Blackmon, M.D.; Wilkin, R.T.; Rovero, M.; Bone, S.E.; Jerden, M.L.; Nelson, C.M.; Bradham, K.D. Lead Speciation, Bioaccessibility, and Sources for a Contaminated Subset of House Dust and Soils Collected from Similar United States Residences. Environ. Sci. Technol. 2024, 58, 9339–9349. [Google Scholar] [CrossRef]
  2. Dietrich, M.; Filippelli, G.M. Positive outcomes from U.S. lead regulations, continued challenges, and lessons learned for regulating emerging contaminants. Environ. Sci. Pollut. Res. 2023, 30, 57178–57187. [Google Scholar] [CrossRef] [PubMed]
  3. Jacobs, D.E.; Clickner, R.P.; Zhou, J.Y.; Viet, S.M.; Marker, D.A.; Rogers, J.W.; Zeldin, D.C.; Broene, P.; Friedman, W. The Prevalence of Lead-Based Paint Hazards in U.S. Housing. Environ. Health Perspect. 2002, 110, A599. [Google Scholar] [CrossRef] [PubMed]
  4. Laidlaw, M.A.S.; Filippelli, G.M.; Brown, S.; Paz-Ferreiro, J.; Reichman, S.M.; Netherway, P.; Truskewycz, A.; Ball, A.S.; Mielke, H.W. Case studies and evidence-based approaches to addressing urban soil lead contamination. Appl. Geochem. 2017, 83, 14–30. [Google Scholar] [CrossRef]
  5. Wilson, J.; Dixon, S.L.; Wisinski, C.; Speidel, C.; Breysse, J.; Jacobson, M.; Crisci, S.; Jacobs, D.E. Pathways and sources of lead exposure: Michigan Children’s Lead Determination (the MI CHILD study). Environ. Res. 2022, 215, 114204. [Google Scholar] [CrossRef]
  6. Pereira, E.C.; Piai, K.d.A.; Salles, F.J.; da Silva, A.S.; Olympio, K.P.K. A comprehensive analysis of children’s blood lead levels in Latin America and the Caribbean over the last eight years: Progress and recommendations. Sci. Total Environ. 2024, 928, 172372. [Google Scholar] [CrossRef] [PubMed]
  7. Levin, R.; Vieira, C.L.Z.; Rosenbaum, M.H.; Bischoff, K.; Mordarski, D.C.; Brown, M.J. The urban lead (Pb) burden in humans, animals and the natural environment. Environ. Res. 2020, 193, 110377. [Google Scholar] [CrossRef]
  8. Cao, X.; Ma, L.Q.; Singh, S.P.; Zhou, Q. Phosphate-induced lead immobilization from different lead minerals in soils under varying pH conditions. Environ. Pollut. 2008, 152, 184–192. [Google Scholar] [CrossRef]
  9. Wijayawardena, M.A.; Naidu, R.; Megharaj, M.; Lamb, D.; Thavamani, P.; Kuchel, T. Using soil properties to predict in vivo bioavailability of lead in soils. Chemosphere 2015, 138, 422–428. [Google Scholar] [CrossRef]
  10. Kastury, F.; Smith, E.; Doelsch, E.; Lombi, E.; Donnelley, M.; Cmielewski, P.L.; Parsons, D.W.; Scheckel, K.G.; Paterson, D.; De Jonge, M.D.; et al. In Vitro, in Vivo, and Spectroscopic Assessment of Lead Exposure Reduction via Ingestion and Inhalation Pathways Using Phosphate and Iron Amendments. Environ. Sci. Technol. 2019, 53, 10329–10341. [Google Scholar] [CrossRef]
  11. Cui, J.; Li, H.; Shi, Y.; Zhang, F.; Hong, Z.; Fang, D.; Jiang, J.; Wang, Y.; Xu, R. Influence of soil pH and organic carbon content on the bioaccessibility of lead and copper in four spiked soils. Environ. Pollut. 2024, 360, 124686. [Google Scholar] [CrossRef]
  12. Billmann, M.; Hulot, C.; Pauget, B.; Badreddine, R.; Papin, A.; Pelfrêne, A. Oral bioaccessibility of PTEs in soils: A review of data, influencing factors and application in human health risk assessment. Sci. Total Environ. 2023, 896, 165263. [Google Scholar] [CrossRef] [PubMed]
  13. Violante, A.U.D.N.; Cozzolino, V.U.D.N.; Perelomov, L.P.S.U.; Caporale, A.G.; Pigna, M.U.D.N. Mobility and bioavailability of heavy metals and metalloids in soil environments. J. Soil Sci. Plant Nutr. 2010, 10, 268–292. [Google Scholar] [CrossRef]
  14. Shi, M.; Min, X.; Ke, Y.; Lin, Z.; Yang, Z.; Wang, S.; Peng, N.; Yan, X.; Luo, S.; Wu, J.; et al. Recent progress in understanding the mechanism of heavy metals retention by iron (oxyhydr) oxides. Sci. Total Environ. 2021, 752, 141930. [Google Scholar] [CrossRef]
  15. Bradham, K.D.; Dayton, E.A.; Basta, N.T.; Schroder, J.; Payton, M.; Lanno, R.P. Effect of soil properties on lead bioavailability and toxicity to earthworms. Environ. Toxicol. Chem. Int. J. 2006, 25, 769–775. [Google Scholar] [CrossRef] [PubMed]
  16. Sungur, A.; Soylak, M.; Ozcan, H. Investigation of heavy metal mobility and availability by the BCR sequential extraction procedure: Relationship between soil properties and heavy metals availability. Chem. Speciat. Bioavailab. 2014, 26, 219–230. [Google Scholar] [CrossRef]
  17. Paul, E.A. The nature and dynamics of soil organic matter: Plant inputs, microbial transformations, and organic matter stabilization. Soil Biol. Biochem. 2016, 98, 109–126. [Google Scholar] [CrossRef]
  18. Encinas-Vázquez, A.; Quezada-Renteria, J.A.; Cervantes, F.J.; Pérez-Rábago, C.A.; Molina-Freaner, F.E.; Pat-Espadas, A.M.; Estrada, C.A. Unraveling the mechanisms of lead adsorption and ageing process on high-temperature biochar. J. Chem. Technol. Biotechnol. 2021, 96, 775–784. [Google Scholar] [CrossRef]
  19. Yang, J.Y.; Yang, X.E.; He, Z.L.; Li, T.Q.; Shentu, J.L.; Stoffella, P.J. Effects of pH, organic acids, and inorganic ions on lead desorption from soils. Environ. Pollut. 2006, 143, 9–15. [Google Scholar] [CrossRef]
  20. Yan, K.; Dong, Z.; Wijayawardena, M.A.; Liu, Y.; Li, Y.; Naidu, R. The source of lead determines the relationship between soil properties and lead bioaccessibility. Environ. Pollut. 2019, 246, 53–59. [Google Scholar] [CrossRef]
  21. Saminathan, S.K.M.; Sarkar, D.; Andra, S.S.; Datta, R. Lead fractionation and bioaccessibility in contaminated soils with variable chemical properties. Chem. Speciat. Bioavailab. 2010, 22, 215–225. [Google Scholar] [CrossRef]
  22. Wadoux, A.M.C. Artificial intelligence in soil science. Eur. J. Soil Sci. 2025, 76, e70080. [Google Scholar] [CrossRef]
  23. Xie, K.; Ou, J.; He, M.; Peng, W.; Yuan, Y. Predicting the bioaccessibility of soil Cd, Pb, and As with advanced machine learning for continental-scale soil environmental criteria determination in China. Environ. Health 2024, 2, 631–641. [Google Scholar] [CrossRef] [PubMed]
  24. Mao, L.; Kang, K.; Kong, H.; Zhu, E.; Zhang, Z.; Li, Y.; Tao, H. Quantifying the contributions of factors to bioaccessible Cd and Pb in soil using machine learning. J. Hazard. Mater. 2025, 487, 137102. [Google Scholar] [CrossRef]
  25. Jiang, Q.; Gao, Z.; Karniadakis, G.E. DeepSeek vs. ChatGPT vs. Claude: A comparative study for scientific computing and scientific machine learning tasks. Theor. Appl. Mech. Lett. 2025, 15, 100583. [Google Scholar] [CrossRef]
  26. Khanifar, J. Evaluating AI-generated responses from different chatbots to soil science-related questions. Soil Adv. 2025, 3, 100034. [Google Scholar] [CrossRef]
  27. Sparks, D.L.; Page, A.L.; Helmke, P.A.; Loeppert, R.H. (Eds.) Methods of Soil Analysis, Part 3: Chemical Methods; John Wiley & Sons: Hoboken, NJ, USA, 2020. [Google Scholar]
  28. Schulte, E.E.; Hopkins, B.G. Estimation of soil organic matter by weight loss-on-ignition. Soil Org. Matter Anal. Interpret. 1996, 46, 21–31. [Google Scholar]
  29. U.S. EPA. Method 3050B: Acid Digestion of Sediments, Sludges, and Soils; Revision 2; U.S. EPA: Washington, DC, USA, 1996.
  30. Sarkar, D.; Datta, R. A modified in-vitro method to assess bioavailable arsenic in pesticide-applied soils. Environ. Pollut. 2003, 126, 363–366. [Google Scholar] [CrossRef]
  31. Pribyl, D.W. A critical review of the conventional SOC to SOM conversion factor. Geoderma 2010, 156, 75–83. [Google Scholar] [CrossRef]
  32. Stevenson, F.J. Humus Chemistry: Genesis, Composition, Reactions; John Wiley & Sons: Hoboken, NJ, USA, 1994. [Google Scholar]
  33. McBride, M.B. Environmental Chemistry of Soils; Oxford Press: New York, NY, USA, 1994. [Google Scholar]
  34. Basta, N.; Gradwohl, R. Estimation of Cd, Pb, and Zn bioavailability in smelter-contaminated soils by a sequential extraction procedure. J. Soil Contam. 2000, 9, 149–164. [Google Scholar] [CrossRef]
  35. Schwertmann, U.T.R.M.; Taylor, R.M. Iron oxides. Miner. Soil Environ. 1989, 1, 379–438. [Google Scholar]
  36. Trivedi, P.; Axe, L. Modeling Cd and Zn sorption to hydrous metal oxides. Environ. Sci. Technol. 2000, 34, 2215–2223. [Google Scholar] [CrossRef]
  37. Ruby, M.V.; Davis, A.; Schoof, R.; Eberle, S.; Sellstone, C.M. Estimation of lead and arsenic bioavailability using a physiologically based extraction test. Environ. Sci. Technol. 1996, 30, 422–430. [Google Scholar] [CrossRef]
  38. Drexler, J.W.; Brattin, W.J. An in vitro procedure for estimation of lead relative bioavailability: With validation. Hum. Ecol. Risk Assess. 2007, 13, 383–401. [Google Scholar] [CrossRef]
  39. Basta, N.T.; Ryan, J.A.; Chaney, R.L. Trace element chemistry in residual-treated soil: Key concepts and metal bioavailability. J. Environ. Qual. 2005, 34, 49–63. [Google Scholar] [CrossRef]
  40. Juhasz, A.L.; Scheckel, K.G.; Betts, A.R.; Smith, E. Predictive capabilities of in vitro assays for estimating Pb relative bioavailability in phosphate amended soils. Environ. Sci. Technol. 2016, 50, 13086–13094. [Google Scholar] [CrossRef]
  41. Poggio, L.; Vrščaj, B. A GIS-based human health risk assessment for urban green space planning—An example from Grugliasco (Italy). Sci. Total Environ. 2009, 407, 5961–5970. [Google Scholar] [CrossRef]
Figure 1. Weighted metrics of the two top-performing models with different imputation methods. (Top left): bar graph of individual scores (from left to right)—prediction accuracy, R2, and RMSE. (Top right): weighted criteria. (Bottom left): bar graph of final scores. (Bottom right): Table summarizing scores of the two top-performing models. Metrics assessed—prediction accuracy, R2, RMSE, weighted score. Best models for simple imputation and iterative imputation, respectively, are the adaptive weighted ensemble model and the Bayesian averaging ensemble model.
Figure 1. Weighted metrics of the two top-performing models with different imputation methods. (Top left): bar graph of individual scores (from left to right)—prediction accuracy, R2, and RMSE. (Top right): weighted criteria. (Bottom left): bar graph of final scores. (Bottom right): Table summarizing scores of the two top-performing models. Metrics assessed—prediction accuracy, R2, RMSE, weighted score. Best models for simple imputation and iterative imputation, respectively, are the adaptive weighted ensemble model and the Bayesian averaging ensemble model.
Applsci 16 03504 g001
Figure 2. Model inspection of light gradient-boosting model prediction performance and feature impacts: (A) prediction accuracy; (B) SHAP analysis; (C) feature importance; (D) feature correlation heatmap.
Figure 2. Model inspection of light gradient-boosting model prediction performance and feature impacts: (A) prediction accuracy; (B) SHAP analysis; (C) feature importance; (D) feature correlation heatmap.
Applsci 16 03504 g002
Figure 3. Two-sample Kolmogorov–Smirnov test of four significantly shifted features (p < 0.05): TotalPb mg/kg (top left), sand % (top right), pH (bottom left), silt % (bottom right). Red histogram represents the distribution of feature values within the validation dataset vs. the blue histogram of the training feature distribution. KS = Kolmogorov–Smirnov statistic. p-values listed with an adjacent star indicates statistically significant results.
Figure 3. Two-sample Kolmogorov–Smirnov test of four significantly shifted features (p < 0.05): TotalPb mg/kg (top left), sand % (top right), pH (bottom left), silt % (bottom right). Red histogram represents the distribution of feature values within the validation dataset vs. the blue histogram of the training feature distribution. KS = Kolmogorov–Smirnov statistic. p-values listed with an adjacent star indicates statistically significant results.
Applsci 16 03504 g003aApplsci 16 03504 g003b
Figure 4. Principal component analysis projection plots of the entire dataset’s feature space (training = blue circle, validation = red triangle). The blue circle represents the area occupied by the training dataset. Points that fall outside of the circles are composed of one or more features with extreme values.
Figure 4. Principal component analysis projection plots of the entire dataset’s feature space (training = blue circle, validation = red triangle). The blue circle represents the area occupied by the training dataset. Points that fall outside of the circles are composed of one or more features with extreme values.
Applsci 16 03504 g004
Figure 5. Scatterplot of predicted points after resampling occurred. Scatterplot designated predicted points into three groups based on the model’s certainty within a specified confidence interval (95%). The confidence interval was calculated based on the absolute residual values or how far the predicted value was from the true value. Red triangle = values outside the confidence interval. Blue square = values that were within the 95% uncertainty interval but were poor predictions. Green dots = values that are 10% within the target value.
Figure 5. Scatterplot of predicted points after resampling occurred. Scatterplot designated predicted points into three groups based on the model’s certainty within a specified confidence interval (95%). The confidence interval was calculated based on the absolute residual values or how far the predicted value was from the true value. Red triangle = values outside the confidence interval. Blue square = values that were within the 95% uncertainty interval but were poor predictions. Green dots = values that are 10% within the target value.
Applsci 16 03504 g005
Figure 6. Greedy leave-out-one plots. The graphs demonstrate the R2 value gained when a certain point is removed from the model. The (left graph) illustrates cumulative gain from point removal, starting with high-impact point removal. The (right graph) shows point-by-point gain in R2.
Figure 6. Greedy leave-out-one plots. The graphs demonstrate the R2 value gained when a certain point is removed from the model. The (left graph) illustrates cumulative gain from point removal, starting with high-impact point removal. The (right graph) shows point-by-point gain in R2.
Applsci 16 03504 g006
Figure 7. Validation model performance across all three subsets of data (training, testing, validation), including improvements to predictions after advanced model retraining. Iterative synthetic resampling (iteration = 200), then removed 7 identified outliers before replotting.
Figure 7. Validation model performance across all three subsets of data (training, testing, validation), including improvements to predictions after advanced model retraining. Iterative synthetic resampling (iteration = 200), then removed 7 identified outliers before replotting.
Applsci 16 03504 g007
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wijesinghe, S.; Sarkar, D.; Saleh, H.; Mustafa, K.; Rao, S.; Datta, R. Using Machine Learning Algorithms to Clarify Relationships Between Soil Properties and Lead Stomach Bioaccessibility. Appl. Sci. 2026, 16, 3504. https://doi.org/10.3390/app16073504

AMA Style

Wijesinghe S, Sarkar D, Saleh H, Mustafa K, Rao S, Datta R. Using Machine Learning Algorithms to Clarify Relationships Between Soil Properties and Lead Stomach Bioaccessibility. Applied Sciences. 2026; 16(7):3504. https://doi.org/10.3390/app16073504

Chicago/Turabian Style

Wijesinghe, Shehan, Dibyendu Sarkar, Hadeer Saleh, Khalid Mustafa, Smitha Rao, and Rupali Datta. 2026. "Using Machine Learning Algorithms to Clarify Relationships Between Soil Properties and Lead Stomach Bioaccessibility" Applied Sciences 16, no. 7: 3504. https://doi.org/10.3390/app16073504

APA Style

Wijesinghe, S., Sarkar, D., Saleh, H., Mustafa, K., Rao, S., & Datta, R. (2026). Using Machine Learning Algorithms to Clarify Relationships Between Soil Properties and Lead Stomach Bioaccessibility. Applied Sciences, 16(7), 3504. https://doi.org/10.3390/app16073504

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop