1. Introduction
Lead is a pervasive environmental toxin affecting thousands of children across the United States) [
1]. It is generally a relatively stable element under most environmental conditions, meaning that lead in soils can remain a persistent source of exposure long after its initial release. This contributes to ongoing exposure in many areas today, even after its elimination from products like paint, gasoline, and food cans [
2]. Although the use of lead-based paint was banned in the U.S. in 1971, many older homes still pose a significant risk of lead exposure, even after abatement efforts [
3]. The maximum allowable soil lead concentration is 400 mg/kg [
4]. Lead-contaminated soil, especially bare soil, is a major source of exposure, as studies have shown that soil lead levels are strongly correlated with elevated blood lead levels in children [
5]. Child lead poisoning remains a widespread issue, with the Centers for Disease Control and Prevention (CDC) lowering the blood lead reference value from 5.0 µg/dL to 3.5 µg/dL in 2021 [
6]. The primary route of environmental lead exposure is through ingestion of chipped lead-based paint and dust in soil or on surfaces in older homes [
7]). The severity of health outcomes is linked to elevated blood lead levels (EBLLs), which in turn correlate strongly with the amount of bioaccessible lead in the soil. Several soil properties, such as pH, electrical conductivity, organic-matter content, and texture, influence the bioaccessibility of lead. Identifying soil characteristics that influence bioaccessibility is a critical first step in assessing risk and requires further investigation [
8]. While these properties have been studied in relation to bioavailability [
9], their effect on in vitro stomach bioaccessibility has not been fully explored. Understanding how these factors impact bioaccessibility in the stomach is important because it correlates well with in vivo studies using juvenile swine models. Therefore, it is crucial to explore strategies that reduce lead bioaccessibility in soil, which would help mitigate human exposure risk [
10].
Soil pH plays a crucial role in lead bioavailability, much like gastric pH does in the body. Under acidic conditions, lead species are more stabilized and soluble, increasing their bioaccessibility, whereas in alkaline conditions, they become less bioavailable due to precipitation or complexation [
11]. This relationship between pH and bioaccessibility is well-documented, with soil pH showing a negative correlation with lead bioaccessibility [
12].
Soil texture, defined by the proportions of clay, sand, and silt, which impacts the soil’s cation-exchange capacity (CEC), is another critical factor for lead retention. Clay minerals, particularly phyllosilicates, facilitate the adsorption and structural incorporation of heavy metals through oxyhydroxide reactions [
13]. Sand and silt, primarily composed of quartz and other minerals, also contribute to cation exchange, although less effectively than clay. Lead is known to create surface complexes with iron (oxy)oxides, such as hematite and goethite, and is retained over time in crystalline iron minerals [
14]. Studies consistently show a negative correlation between clay content and lead bioaccessibility, attributed to the high absorption capacity of clay minerals [
9,
15,
16]. A linear model moderately predicted bioaccessible lead (R
2 = 0.35), but the correlation improved significantly when the source of lead was factored in (R
2 = 0.86), emphasizing the importance of considering soil particle size and lead source in predictive models.
Soil organic matter (OM), composed of decayed biological material and microbial byproducts, interacts with lead through complexation with organic acids. The stability of these complexes depends on factors like organic acid concentration and soil [
17]. Lead has a high affinity for negatively charged functional groups, especially those containing oxygen [
18]. Lead sorption by organic ligands is most effective when organic acid concentrations are relatively low, and pH is favorable for complexation [
19]. A higher organic-matter content is typically associated with reduced lead bioaccessibility due to the strong binding of lead to organic compounds [
20].
Electrical conductivity (EC) measures the concentration of dissolved salts, including non-hydrolyzing cations like Ca
2+, Mg
2+, Na
+, and K
+, along with anions such as Cl
−, SO
42−, HCO
3−, and NO
3−. EC is directly related to soil salt content and has been found to negatively correlate with lead bioaccessibility [
21]. Higher salt concentrations in soils promote the formation of insoluble lead compounds, such as lead carbonate and Fe-Mn oxides, thus reducing bioavailability [
20].
Soil’s heterogeneous nature complicates the analysis of lead bioaccessibility. Machine learning (ML) regression techniques offer powerful tools for uncovering correlations between soil properties and bioaccessibility. Machine learning in soil science is a fast-growing and popular application of AI in soil science [
22]. Random forest and neural networks have been utilized for soil moisture and soil–water retention, respectively [
22]. While multiple linear regression (MLR) can only identify linear relationships between variables, non-linear relationships between two variables can only be demonstrated after manipulation of the data to linearize it. ML techniques like tree-based or gradient-based models, when applied here, will offer greater predictive accuracy from a non-linear and non-normal dataset, which is often observed in nature. Model inspection techniques, like feature importance, offer elucidation to key variables that contribute significantly to the model’s performance, thereby identifying key factors vital to estimating the target value. When such techniques are applied to the analysis of lead bioaccessibility from a variety of different soil properties, ML methods are far more robust predictive tools.
These techniques have been applied to previous publications to further examine the role of environmental spatial [
23] or experiential factors [
24] in addition to soil chemistry that affect heavy-metal bioaccessibility. A recent publication found that soil properties contributed to 88.1% of their bioaccessible lead predictions, while aging factors and experimental factors contributed the remaining [
24]. Their model was built from n = 157 datapoints compared to the only other similar model published—which consisted of over 300 unique datapoints for lead [
23]—and achieved a high performance (R
2 = 0.95) using a robust meta-learner on bootstrapped cross-validated datasets. While these produced models are highly accurate, they have yet to be validated through actual experimental data on soil samples using almost exclusively soil properties. In order to have any utility as a predictive model for real-world applications, a model must be tested on unique datapoints that it has not seen beforehand. The objective of this study was to co-develop an ML regression model to create a generalizable predictive tool for lead stomach bioaccessibility. This study focused on identifying key soil characteristics that influence bioaccessible lead in urban soils and developing a predictive model to assess these factors. Large language models (LLMs) were employed to help explicate the writing of the codebase, while also offering value to the model and statistical test through suggestions. Claude Sonnet was selected for this aim due to its high performance in solving scientific machine learning [
25], as well as its strong domain-specific performance on soil science, compared to other known LLMs [
26]. Once model training was optimized with the aid of Claude Sonnet, the best-performing model was then assessed for its predictive accuracy using an unknown isolated dataset. Our report highlights the exploration of machine learning techniques and our attempt to build a comprehensive model for its practical application as a predictive tool.
2. Materials and Methods
Soil samples were collected from 30 residential sites across three cities: Detroit, MI; San Antonio, TX; and Baltimore, MD. Sampling was conducted near exterior walls where deteriorated paint was present, at a depth of 0–15 cm. The surface soil from these locations was dried and sieved in preparation for detailed soil characterization. Sampling sites were selected based on preliminary data from a portable X-ray Fluorescence (XRF) device (pXRF, Niton XL3t, Thermo Fisher Scientific, Bergenfield, NJ, USA), which was used to identify areas with potentially high lead concentrations near the houses. Once collected, soil samples from each site were composited for analysis. The soil characterization process involved measuring soil pH, electrical conductivity (EC), and texture properties following the standardized protocols described in the
Soil Science Society of America Handbook for Chemical and Mineralogical Analysis [
27]. The percentage of organic matter in the soils was determined using the weight-loss-on-ignition method [
28]. Total lead was determined after acid digestion with HNO
3 (trace-metal grade, Fisher Scientific, Fair Lawn, NJ, USA) and H
2O
2 (ACS reagent grade, Sigma-Aldrich, St. Louis, MO, USA) following the USEPA Method 3050B [
29]. Lead concentrations were measured using inductively coupled plasma optical emission spectroscopy (ICP-OES, Agilent Technologies 5100, Santa Clara, CA, USA). All extractions and analyses were done in triplicate.
Lead bioaccessibility was assessed using a modified version of the Unified BARGE Method (UBM). This method simulates the digestive process using synthetic fluids representative of saliva, gastric acid, duodenal fluid, and bile, all formulated from salts and organic compounds to mimic their natural composition. The pH levels of these fluids were adjusted accordingly to simulate the digestive environment. A key modification to the protocol involved the incorporation of ferric-oxide strips, which were used as an ion sink for soluble lead, based on a technique previously applied to bioavailable arsenic assays [
30]. After the gastric phase of digestion, these ferric-oxide strips were added to the intestinal-phase solution to absorb bioavailable lead. Once the intestinal-phase incubation was complete, the strips were removed and placed in a beaker containing 50 mL of 1.6 M nitric acid to desorb the lead. Samples were collected at three stages: from the stomach phase, from the absorbed intestinal phase, and from the desorbed ferric-oxide strip solution. All samples were analyzed for soluble lead content using ICP-OES.
A machine learning model was designed as a predictive model to examine the relationship between lead bioaccessibility and soil properties (n = 18). The source of lead pollution and the in vitro assay used to estimate lead bioaccessibility were also considered, resulting in a wide range of explanatory variables to build out a comprehensive model. This initial dataset was expanded to 670 unique datapoints by including relevant sources. The extractable literature was found through an advanced search for the following words on Google Scholar: “soil properties”, “Pb bioaccessibility”, and “in vitro digestion”. Some specified parameters were also included in the search, such as pH, OM, EC, and CEC. A total of 12 sources with extractable data were found, including a different machine learning dataset used in a publication for predicting heavy-metal bioaccessibility based on soil properties and additional environmental factors [
23]. The dataset was compiled and loaded into Python version 3.12 for model evaluation.
The main pipeline for hyperparameter tuning was constructed initially without the aid of AI, but later using Claude Sonnet 3.7–4.6. The initial model consisted of three regression models: random forest regressor (RFR), histogram gradient-boosting model (HistGBM), and eXtreme gradient-boosting model (XGBoost). Tree-based models were chosen to handle complex non-normal data, and gradient-boosting models were added to reduce the risk of overfitting. However, it failed to achieve desirable prediction levels, since the RFR model would overfit, and the gradient-boosting models failed to capture feature patterns. The main pipeline was then reconstructed, and files were uploaded into Claude to address the initial model’s shortcomings. A major challenge with our dataset was adapting a relatively small dataset with a large number of features. In collaboration with Claude and in testing, selected models were tuned to combat overfitting and better handle the complexities of the dataset. In addition, Claude allowed the development of our pipeline to be streamlined, allowing time spent on writing the source code to be utilized to optimize specific parameters. The pipeline consisted of four main steps: preprocessing, feature selection, model training, and final model selection.
After extraction, the data was compiled into a single Excel file. The dataset was standardized by converting units into the most commonly reported unit for that feature. Interchangeable features such as organic carbon (%) and organic matter (%) were converted into one feature, following organic matter (%) = total organic carbon (%) × 2 [
31]. The following techniques were implemented with the aid of Claude to boost the quality of the dataset used in the model. Features were then separated based on numerical data (n = 17) or categorical data (n = 3), then transformed. Numerical features were scaled using the “Yeo–Johnson” method to better normalize the resulting data. Categorical data was transformed into binary using one-hot encoding. Outliers within each feature were removed if found to be 1.5× more or less than the interquartile bounds. Numerical features with 65% or fewer missing datapoints (Nan) were then imputed via simple or iterative methods. Mostly empty features (>65% Nan) were identified and removed. Simple imputation filled missing values with the median value of that feature. A robust iterative imputation method estimated missing datapoints using a random forest regressor and was optimized to conserve variance. To optimize estimations, converging iterations with the lowest resulting variance, stability across multiple tested iterations, and low risk of overfitting were selected. Finally, after imputation was completed, the numerical features were recombined with the categorical features and enhanced with synthetically produced datapoints. Synthetic datapoints made up one-third of the total dataset and were expanded conservatively using K-nearest neighbors. This would assist with model learning without losing relatability to real-world data.
Recursive feature selection with cross-validation (RFECV) and consensus techniques selected the top 10 features based on their importance to the estimator model. While a random forest estimator was solely used during RFECV in our first model, it often led to the selection of underperforming features. Through Claude, optimization of this process was further enhanced by introducing consensus techniques to choose common high-performing features. Variance threshold and mutual info selection techniques were included, and final features were selected based on a voting consensus. Claude also helped design a method to select an optimal number of impactful features instead of manually tuning this number.
The number of models also increased under the recommendation of Claude to include model diversity. With this “shotgun” approach, multiple models, including ensemble models, could be selected for validation. While performance was our top priority to be selected, we also examined and hand-picked the best model based on model balance and computational resource demand. Seven base regression models were trained: random forest regressor (RFR), eXtreme gradient-boosting (XGBoost) model, light gradient-boosting model (LightGBM), histogram gradient-boosting model (HistGBM), multi-layer perceptron regressor (MLP), support vector regression (SVR), and K-nearest neighbors (KNN). Multiple ensemble methods, a voting regressor, were also created, which were built from the 7 other base models. The chosen base models were selected to diversify the techniques used to estimate each individual point, with the intention that it would help improve our ensemble methods. The ensemble and diversity-based method utilized a meta-estimator model, a voting regressor, which fits multiple base models to then estimate target values from the average. Adaptive weighted and Bayesian averaging ensemble models use different statistical approaches to apply dynamic weights to the models’ truthfulness. The culmination of all these optimizations designed into the model’s architecture built the best possible predictor for lead bioaccessibility with a multitude of features to examine. All models were trained on a multitude of hyperparameters optimized to decrease the risk of overfitting. This was particularly impactful for the random forest regressor. Constructed “forests” were designed to be simple and generalizable, meaning that correlating features had to be statistically significant to generate a split “new tree”. The final selected model was retrained with a subset of the data (n = 620) and then used to predict target values using a holdout dataset. The validation set was drawn from an internal dataset (n = 30) to simulate a real-life application of the predictive model. Randomly distributed datapoints from the external dataset (n = 20) were also included to help generalize conclusions during post-validation analysis and balance the range of predicted points. Prediction inspection techniques were employed at the end to explain poor predictions, domain shift, and feature reliance. Once diagnosed, more advanced techniques were applied to address domain shift, explore the implications of the model’s “blind” performance, and investigate and remove outliers in the model. The first domain shift was addressed by introducing a resampling technique that moves the majority of the validation dataset into training. Then the model was retrained and predicted the remaining dataset. The indices moved were randomly determined in each run, and this retraining occurred over many instances to stabilize projected values. All randomness introduced during random splits, imputation, and synthetic expansion was fixed to a seed for reproducibility. Afterwards, outlier analysis inspected the resulting predictions and located values that were consistently outside the model’s uncertainty. These points were removed and isolated for further inspection to understand the model’s current inadequacies.
4. Discussion
This study demonstrates the utility of machine learning for estimating bioaccessibility from soil physicochemical properties, while also revealing key challenges to model generalizability. Although the final model, surprisingly, was not the model that marginally outperformed other competing models, rather it was the most balanced of the ensemble models. While the Bayesian averaging model and adaptive weighted model were within 0.002 apart from their weighted score, the adaptive weighted model (iterative synthetic resample R2 = 0.67) performed significantly better than the Bayesian averaging model (iterative synthetic resample R2 = 0.62). We conclude that the best generalizable ensemble model must be well-balanced with its base models. Ironically, our initial validation performed poorly. This reduction reflects a domain shift between heterogeneous literature-derived datasets and internally consistent in-house measurements, underscoring a common limitation in environmental machine learning applications.
Domain-shift issues remained prevalent throughout experimentation, which was especially exaggerated by our small validation set size. Domain shift was adjusted properly with our advanced synthetic iterative resampling approach. Deep analysis of plotted points identified well-predicted samples (n = 16) that made up the majority of our internal experimental datapoints (53%). Half of those points (n = 8) were predicted within their target value. Model performance, however, was still impacted heavily by outliers. There were datapoints deemed unpredictable, and after careful examination using our three-criterion outlier examination and greedy “leave-out-one” simulations, seven points that fell outside of our model’s predictive range were found. These points exhibited alkaline conditions, high cation-exchange capacity, and low ferric-oxide absorption, but diverging bioaccessibility fractions that the model could not pick up. For instance, in datapoint 26’s case, a low ratio between Pb and FeOs (0.13) was expected to test for low bioaccessibility, due to high surface complexion. This was well reflected in the low bioaccessibility of sample 26’s soil (5.9%). Again, in point 28, the ratio between Pb and FeOs (0.35) was high, and lead cations possibly overwhelmed surface complexes and became more bioaccessible (87%). However, the model failed to establish this relationship, since it was not specified (Pb/FeOs) and did not have enough real-world data to interpret emerging patterns. Alkaline conditions were not interpreted by the model but led to low gastric lead concentrations. High pH promotes complexation of Pb, which is not bioaccessible. The model failed to interpret this as well, which caused poor predictions with points 3, 4, 6, and 7. Conversely, upon similar examinations with near-perfect predictors (within 5% of the target value), Claude was able to determine a common pattern between all indices, high pH, and high organic matter. The model was able to notice this consistent pattern and accurately predict it.
Another issue to consider was the loss in generalizability due to overfitting. According to both the prediction accuracy plot and SHAP analysis during final model selection, datapoints with target values above 1.5 or below −1 have the least amount of information to generate predictions. However, with our internal dataset, the observed outliers were found close to the middle, between −0.5 and 0.5. One possibility for this could be overfitting, since most features’ impact value is dense around that range. These features, instead of adding value, further confused the model, and distinct connections were not created between target values and predictors. Feature engineering of vital geochemical relationships not yet exploited by the model could help adjust for overfitting. Rather than adding an anomalous soil property, a relationship between two features could provide context and value, thus improving predictive accuracy. This has yet to be conducted and should be explored in future work.
5. Conclusions
The bioaccessibility of heavy-metal contamination remains a variable and difficult measurement to obtain without long and costly digestion models. Soil bioaccessibility is tied to the properties of the soil in a significant way. By implementing a machine learning model, we elucidated that various soil properties, like TotalPb, organic matter, electrical conductivity, cation-exchange capacity, and pH, are significant in influencing lead bioaccessibility. Compiling data from previously published work and our own experimental data yielded a complex predictive tool, which has shown promising results. Our goal was to intentionally analyze a small subset of data to explore a practical application of this predictive tool. Initial model tests indicated that the domain shift resulted in poor performance. Iterative resampling significantly improved model recovery, which highlights its ability to operate in the event of domain shift. Ideally, sufficient data would be fed to the model in the future, such that domain shifts are less frequent and predictions can be made on a point-by-point basis. Geochemical relationships between features can be better assessed with real-life data, and allow feature engineering to accurately encompass factors that are not clear from the raw dataset alone. While our validation set contained the least amount of missing feature data, the features the model trained on were not enough to estimate our outlier group. Rather, the inclusion of features that account for alkaline conditions would improve predictions. Deep learning can be applied to re-evaluate the model based on isolated poor predictors.
One major constraint limiting this model is the paucity of publicly available data. Future studies should seek to improve data quality and standardization of soil properties across studies. Supplementing missing data with synthetic estimations proved to be adequate and can be made more meaningful with complete feature data. This would help expand the dataset with real-world samples and permit continued validation testing with other unknown samples. The aid of large-language-model artificial-intelligence tools, like Claude, hastened the development of the algorithm and permitted the integration of optimization techniques, as demonstrated in our results. Some limitations in Claude’s ability to debug scripts and clean scripts come from its inability to see the user’s virtual environment. We have also noticed a few times when Claude fails to understand user prompts based on the context established earlier in the conversation. Conversation size is limited without signing up for one of Anthropic’s monetized plans. Context lost between conversations can lead to bloated scripts, which makes it almost a requirement to purchase the “Pro” plan and organize within a project folder. Individuals or research groups considering utilizing Claude should consider the value of a subscription to access Claude’s full features. Overall, Claude Sonnet with extended thinking proved to be an invaluable tool, especially to those interested in applying machine learning to their research. Whilst there is still room for exploration into improving our model to be even more generalizable across diverse datasets, the model developed is an excellent platform to create a predictive tool.