Machine Learning-Driven Optimization for Predicting Biochar Adsorption Performance Toward Pb(II) and Cd(II)

Yu, Pengcheng; Huang, Zixi; Xie, Wuming

doi:10.3390/w18121416

Open AccessArticle

Machine Learning-Driven Optimization for Predicting Biochar Adsorption Performance Toward Pb(II) and Cd(II)

by

Pengcheng Yu

,

Zixi Huang

and

Wuming Xie

^*

Key Laboratory of Resources Comprehensive Utilization and Cleaner Production, Guangdong Education Department, School of Environmental Science and Engineering, Guangdong University of Technology, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Water 2026, 18(12), 1416; https://doi.org/10.3390/w18121416 (registering DOI)

Submission received: 15 April 2026 / Revised: 17 May 2026 / Accepted: 28 May 2026 / Published: 10 June 2026

(This article belongs to the Section Wastewater Treatment and Reuse)

Download

Browse Figures

Versions Notes

Abstract

With the increasing levels of toxic heavy metals such as Pb(II) and Cd(II), their discharge poses serious threats to environmental safety and human health, necessitating efficient remediation technologies. Biochar has emerged as a promising eco-friendly adsorbent; however, its adsorption performance is constrained by interactions among material properties, environmental conditions, and ion specificity. Conventional machine learning (ML) models are typically built on single-metal-ion datasets, limiting their ability to leverage shared information across related adsorption scenarios. To address this limitation, this study proposes a descriptor-based ML framework for Pb(II)–Cd(II) adsorption prediction, in which ion-related physicochemical descriptors, such as electronegativity and hydrated ionic radius, are incorporated in place of discrete ion labels to enable ion-specific modeling. An Optuna-optimized CatBoost model achieved high predictive accuracy (R² = 0.952, RMSE = 9.80) and demonstrated improved performance on both Pb and Cd subsets compared with single-ion models. SHAP analysis reveals the model is consistent with known adsorption-related factors. Uncertainty quantification was incorporated to constrain predictions and enhance robustness. Ultimately, this study provides a robust data-driven baseline for heavy metal adsorption modeling, offering mechanistic insights into biochar–metal interactions and demonstrating a physicochemical descriptor approach that supports future extensions to broader multi-ion systems.

Keywords:

biochar adsorption; heavy metal remediation; ion-specific descriptor mapping; conformal prediction; mechine learning

1. Introduction

The escalating discharge of heavy metals, driven by rapid industrial and urban expansion, has emerged as a critical global environmental concern [1]. The prevalence of these contaminants in wastewater is primarily attributed to their indispensable roles in the mining, battery production, and electroplating industries [2]. Due to their inherent resistance to biological degradation, these metallic ions exhibit prolonged environmental persistence, posing severe bioaccumulation risks across aquatic and terrestrial ecosystems [3]. Specifically, exposure to lead (Pb(II)) and cadmium (Cd(II)) is strongly associated with profound human health deterioration, encompassing neurological deficits, endocrine malfunction, and chronic conditions such as Itai-Itai disease [4]. Consequently, engineering highly efficient and universally deployable removal techniques has become an absolute environmental imperative [5].

Compared with conventional remediation technologies such as chemical precipitation and membrane filtration, adsorption is often preferred for heavy metal removal because of its economic feasibility and operational simplicity [6]. In this context, biochar (BC) has emerged as a highly effective and eco-compatible material for aquatic remediation, owing to its extensive pore network and dense surface functionalization [7]. The specific adsorption capacity is governed by a complex interplay of physical entrapment, electrostatic forces, and surface complexation, all of which are highly sensitive to feedstock origins, pyrolysis settings, and the prevailing solution chemistry [7]. Ultimately, the substantial fluctuations observed in BC performance can be attributed to the intricate dynamic between these experimental variables, the physicochemical traits of the biochar, and the intrinsic ionic profiles of the target metallic species [8].

Optimizing biochar application for specific heavy metals requires understanding the complex interplay among material properties and environmental conditions. Traditional experimental approaches, such as “One-Factor-At-a-Time” methods and orthogonal designs, are widely used to investigate individual parameters [5]. While these approaches provide essential empirical data, they remain limited in representing nonlinear interactions among multiple variables in complex adsorption systems [9]. Similarly, conventional empirical modeling approaches (e.g., classic isotherm and kinetic equations) are typically fitted under specific equilibrium or rate conditions and may be sensitive to simplifying assumptions and model-selection choices, whereas ML offers a way to learn nonlinear relationships across heterogeneous datasets [10]. Results derived from such isolated experimental designs and condition-specific models are often difficult to compare or integrate across different studies [11], which further hinders the systematic evaluation of how material structures and environmental conditions jointly influence adsorption performance [12]. As a result, the available datasets are frequently fragmented across different experimental conditions and target metal ions. Machine learning (ML) has been increasingly adopted to model high-dimensional and nonlinear relationships within these datasets, enabling the prediction of adsorption capacities, optimization of process parameters, and screening of adsorbent materials [12,13]. In many cases, ML models are still constructed for single-metal modeling, inheriting the underlying data fragmentation [14].

A common limitation in current machine learning applications for heavy metal adsorption is that models are usually constructed from single-metal-ion datasets, with separate models trained for different metal ions [15]. Such a modeling strategy confines each model to a specific dataset, limiting its ability to leverage complementary information across related metal systems [9]. Previous studies have shown that adsorption processes of Pb and Cd often involve similar fundamental mechanisms, such as electrostatic interaction, ion exchange, and surface complexation, suggesting the possibility of shared governing factors to some extent [16]. However, most existing studies remain focused on individual metal systems and rarely provide quantitative exploration of cross-ion relationships or a integrated modeling approach. In this context, integrating Pb and Cd within a descriptor-based parameterization provides a practical pathway to reduce dataset fragmentation and enable information sharing across related metal systems. Feature engineering can therefore adopt continuous physicochemical descriptors instead of discrete ion identifiers to achieve a descriptor-based representation of different ions [17]. Introducing such physical features as inputs removes the need to build separate models for each metal, thereby enabling more efficient use of existing data within a single predictive model. Although recent studies have attempted to encode metal ions using physicochemical descriptors and develop shared ion-aware representation, their ability to fully exploit cross-ion information—particularly for improving predictions within known metal systems—remains underexplored [18]. As a result, empirical evidence demonstrating whether such integration can effectively enhance predictive performance in practice is still limited, especially in environmental application scenarios. In addition, uncertainty quantification is often neglected, which further reduces the reliability of model predictions in practical applications [19].

To address these specific gaps, this study develops a descriptor-based machine-learning framework for predicting Pb(II) and Cd(II) adsorption behavior. Our proposed framework is distinguished from existing ML-based adsorption studies in three key aspects: (i) it integrates Pb and Cd datasets by replacing discrete categorical variables with physically meaningful descriptors (e.g., electronegativity), serving as an extensible parameterization for descriptor-based modeling rather than treating different metal systems in isolation; (ii) it systematically evaluates the performance gain of this descriptor-based strategy against conventional single-metal models to provide empirical evidence of its practical benefit; and (iii) it moves beyond deterministic point predictions by integrating local adaptive inductive conformal prediction (ICP) to generate statistically rigorous uncertainty intervals. Random forest was adopted to handle missing values in the dataset, and CatBoost was identified as the optimal model through model comparison. By comparing the proposed descriptor-based modeling framework with conventional single-metal-ion models, we evaluated whether incorporating ion-specific features within a single modeling framework could improve predictive performance compared with conventional single-ion models. Local adaptive inductive conformal prediction (ICP) was further integrated to quantify prediction uncertainty and robustness, providing support for the reliable application of biochar adsorption performance under real environmental conditions. Meanwhile, SHAP analysis was used to interpret the key variables governing adsorption behavior. Ultimately, this study provides a data-driven framework for descriptor-based modeling of adsorption systems, improving predictive accuracy through shared feature representations and supporting performance optimization and practical biochar application.

2. Materials and Methods

The comprehensive machine learning framework developed in this study consists of five main phases, namely data collection, feature engineering, model development, synergy verification, and mechanistic interpretation, as systematically depicted in Figure 1.

2.1. Data Acquisition and Pre-Processing

2.1.1. Data Collection and Feature Engineering

The development workflow of the machine learning models is illustrated in Figure 1. A comprehensive literature search was conducted across three major databases—Web of Science, ScienceDirect, and CNKI—focusing on the last decade, using keywords such as “biochar”, “lead”, “cadmium”, and “adsorption”. Studies were included if they (i) reported batch adsorption experiments of Pb(II) or Cd(II) using biochar-based materials, (ii) provided explicit experimental conditions and adsorption capacity data, and (iii) employed biomass-derived feedstocks under clearly defined pyrolysis conditions. Studies focusing on multi-metal competitive adsorption, non-biochar adsorbents, or lacking sufficient experimental detail were excluded. From the selected studies, data were extracted directly from text and tables; when unavailable, numerical values were digitized from graphical representations using GetData Graph Digitizer (v2.24) software. To construct a shared feature set for the Pb/Cd system, several engineering steps were performed. The unstructured adsorbent descriptions were systematically parsed into structured categorical features: Feedstock and Modification type. Notably, to characterize the intrinsic attributes of the adsorbates beyond simple discrete labels, ion-specific descriptors such as electronegativity, ionic/hydrated radius, and atomic weight were introduced to replace the nominal “pollutant type”. After rigorous curation, the final dataset comprised 781 data points (304 for Pb(II) and 477 for Cd(II)).

The extracted literature data were examined to identify key factors governing adsorption performance. Based on their mechanistic relevance, the screened input features were grouped into six categories: (1) biochar pore structure (BET surface area, pore volume, average particle size), representing the accessibility and availability of active adsorption sites; (2) solution physicochemical environment (pH, initial concentration, solid–liquid ratio), which controls metal speciation and surface charge interactions; (3) operational kinetics (adsorption temperature, contact time), reflecting adsorption rate and thermodynamic feasibility; (4) preparation parameters (pyrolysis temperature and time), which govern pore development and surface functional group formation; (5) chemical composition (C/H/O/N contents, atomic ratios, ash content, and 11 inorganic elements), providing information on surface functionality and mineral phases relevant to complexation and ion exchange; and (6) heavy-metal physicochemical properties (electronegativity, hydrated ionic radius, covalent radius, and hydrolysis constant), characterizing intrinsic differences in metal–surface interactions. To mitigate the extreme sparsity and heterogeneity of quantitative reports, the 11 inorganic elements within the chemical composition category were encoded as binary indicators (1 = presence, 0 = absence). Ultimately, the finalized dataset incorporates 11 binary elemental indicators and 23 physicochemical and environmental variables, which are comprehensively summarized in Table S1.

2.1.2. Missing Value Imputation and Leakage Prevention

Upon completing data collection, a review of the input features revealed missingness in core structural parameters, such as pyrolysis temperature, BET surface area, and total pore volume. To address this and ensure a rigorous statistical reconstruction of the structural parameters, a refined imputation strategy tailored to the current data distribution was developed. The dataset encompasses a robust subset of 406 completely observed instances, while the remaining 375 instances predominantly exhibit minor missingness (fewer than five variables). To prevent data leakage, all preprocessing steps, including feature encoding and imputation model training, were strictly performed using only the training portion of the complete-observation subset. During imputation model construction, the final prediction target (i.e., heavy metal adsorption capacity) was explicitly excluded from the feature matrix. The incomplete samples were never used during parameter estimation. Instead, the imputation models were trained exclusively on fully observed instances to predict the missing values, effectively minimizing potential variance.

Accordingly, a modified Multivariate Imputation by Chained Equations (MICE) framework utilizing Random Forest (RF) as the base learner was adopted to handle missing feature values. During this MICE-RF implementation, the Optuna algorithm was incorporated to optimize hyperparameters. Crucially, cross-validation was conducted within the training set using K-fold and Stratified K-fold strategies, ensuring no information from test folds was used during model fitting or hyperparameter optimization.To preserve the discrete gradient design of typical experimental studies, specific processing steps were applied based on feature types. Sparse variables, such as pyrolysis time, were transformed into categorical labels and imputed using an Optuna-optimized RF classifier within the Stratified K-Fold framework. For discrete numeric features (e.g., adsorption time and solid–liquid ratio), a “Snap-to-Observed” post-processing step was applied. This step calculated the Euclidean distance between RF-predicted values and the existing observation set, mapping predictions to the nearest experimental levels. While this heuristic approach may slightly reduce local variance, it was employed to avoid non-physical intermediate values and ensure that the imputed data strictly aligned with the discrete gradient nature of the experimental designs.

2.2. Data Visualization and Pre-Processing

Focusing on the requirements of machine learning model development, the distributions of input and output variables were visualized to clarify data dispersion. Box plots were selected as the primary tool to concisely present distribution characteristics. By integrating these plots with descriptive statistics, an exploratory analysis was conducted to assess data quality (e.g., completeness, rationality) and diversity, ensuring the dataset was sufficiently robust for modeling.

To reveal associations between variables related to Cd(II) and Pb(II) adsorption, heat maps were employed, utilizing the Pearson Correlation Coefficient (PCC) to measure linear correlation strength. Leveraging the core advantages of PCC—specifically its utilization of variable covariance and scale invariance—the calculation covered all variables in the input–process–output system:

R_{xy} = \frac{\sum_{i}^{N} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i}^{N} (x_{i} - \bar{x})^{2}} \sqrt{\sum_{i}^{N} (y_{i} - \bar{y})^{2}}}

(1)

The parameter R_xy represents the correlation coefficient between input variable x and target variable y. Through these calculations, linear interactions between variables were identified, providing data-driven insights into the system’s operational characteristics [20,21].

2.3. Selection and Building of ML Models

Based on biochar pore structure, solution physicochemical environment, and operational kinetic parameters, a cross-ion synergistic prediction framework for Pb(II)/Cd(II) was constructed. The study addressed specific data characteristics, including the high heterogeneity of biochar physicochemical properties, a limited sample size, and the non-linear coupling of variable responses. Consequently, the Gradient Boosting Decision Tree (GBDT) family was identified as the optimal algorithmic approach, specifically its three core variants: CatBoost, LightGBM, and XGBoost. To ensure a consistent feature representation, discrete categorical variables were subjected to a standardized encoding process prior to training. By utilizing ensemble strategies, robust prediction of adsorption capacity was achieved.

All selected models are based on the Boosting ensemble framework, aiming to approximate the true function space by iteratively fitting residuals. (1) CatBoost utilizes an ordered boosting strategy to effectively resolve prediction shift problems, demonstrating exceptional robustness in mitigating overfitting on small-sample datasets [22]. (2) LightGBM is positioned as a highly efficient boosting framework; it employs Gradient-based One-Side Sampling and Exclusive Feature Bundling to reduce memory consumption and accelerate splitting [23]. (3) XGBoost is defined as a “scalable regularized tree boosting system,” which optimizes the objective function based on second-order Taylor expansion and introduces L1/L2 regularization terms to control model complexity [24].

Since tree model performance is highly dependent on hyperparameter combinations, inefficient methods such as grid search and random search were abandoned. Instead, an advanced strategy employing the Optuna Bayesian optimization framework, based on the Tree-structured Parzen Estimator (TPE) algorithm, was implemented [25]. This logic involves constructing a surrogate model to efficiently explore key parameter spaces with the objective of minimizing RMSE, thereby enhancing model convergence speed. Simultaneously, to mitigate overfitting risks from a single fixed training set, K-Fold Cross-Validation (K-Fold CV) was embedded within the optimization process to ensure reliable generalization.

2.4. Model Training and Evaluation

To facilitate machine learning modeling, the dataset was randomized and partitioned into a training set (70%) and a testing set (30%). Model fitting and hyperparameter optimization utilized the training set, reserving the test set exclusively for final performance evaluation. This partitioning provided a foundational data basis for subsequent assessments of fitting and prediction efficacy.

A comprehensive set of metrics was selected to evaluate model performance [26]. The Coefficient of Determination (R²), defined within the 0–1 interval, was used to assess prediction accuracy and regression effect, with values closer to 1 indicating superior fitting.

R^{2} = 1 - \frac{\sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i = 1}^{N} (y_{i} - {\bar{y}}_{i})^{2}}

(2)

Root Mean Square Error (RMSE) was selected to measure the degree of deviation between predicted and true values, where smaller values indicate higher accuracy.

RMSE = \sqrt{\frac{1}{N} {\sum_{i = 1}^{N} ({\hat{y}}_{i} - y_{i})}^{2}}

(3)

Additionally, Mean Absolute Error (MAE), defined as the average absolute difference between predicted and observed values in the test sample, was used to quantify predictive capability, with smaller values representing stronger capability.

MAE = \frac{1}{N} \sum_{i = 1}^{N} | {\hat{y}}_{i} - y_{i} |

(4)

Relying solely on a single metric poses a risk of masking potential overfitting; thus, a combined judgment integrating multiple metrics and practical application contexts was employed to effectively avoid overfitting and ensure rigorous model validation.

2.5. Ablation Study for Data Synergy Verification

To investigate the data fragmentation issue arising from independent modeling of single metal ions, this study constructed a hybrid CatBoost model based on descriptor-based feature representation. This process reconstructs the feature space by replacing conventional pollutant category labels with continuous physicochemical descriptors (e.g., electronegativity and ionic radius), thereby establishing a shared cross-ion feature representation. On this basis, an ablation study comprising three comparative models was designed: Model A (trained using only Pb data), Model B (trained using only Cd data), and Model C (trained using mixed Pb and Cd data). To ensure fairness and interpretability of the comparison, Model C was evaluated using Pb and Cd subsets separately during the testing phase, i.e., different metal subsets were independently assessed under the same data source conditions, thereby enabling cross-system validation under homologous data conditions. By comparing the R² and RMSE of each model on independent test sets, the improvement in predictive performance resulting from the incorporation of cross-metal data was quantitatively evaluated, thereby analyzing the potential for information synergy across different metal systems. This validation strategy aims to assess whether the consolidated feature representation can facilitate effective utilization of cross-system information, rather than treating each metal system as an isolated problem. Ultimately, Model C is intended to provide a integrated reference framework for evaluating single-metal adsorption behavior, enabling assessment of whether cross-system information can be effectively utilized under a shared feature representation and whether predictive performance can be improved.

2.6. Uncertainty Quantification via Locally Adaptive Conformal Prediction

To rigorously quantify the epistemic uncertainty arising from the cross-ion mixed modeling, a Locally Adaptive Inductive Conformal Prediction (ICP) framework integrated with 5-fold cross-validation was employed. Unlike standard ICP which produces fixed-width intervals, this locally adaptive approach introduces a variance estimator to generate heteroscedastic intervals that dynamically expand or contract based on feature space difficulty [27,28].

The 5-fold cross-validation ensures that every data point serves as a calibration instance. For a given calibration set, the normalized nonconformity score (si) is computed by scaling the absolute residual with the predicted local uncertainty estimate (σi), defined as

s_{i} = \frac{| y_{i} - {\hat{y}}_{i} |}{\hat{σ} (x_{i}) + β}

(5)

where

{\hat{y}}_{i}

is the point prediction,

\hat{σ} (x_{i})

is the estimated local variability (residual magnitude), and β is a small stability constant. The conformalization factor, denoted as

Q_{1 - α}

, is determined as the (1 − α)-th percentile of the nonconformity scores {

s_{i}

} from the calibration folds. Consequently, the prediction interval C(

x_{test}

) for a new test sample is constructed as

Q_{1 - α} = P e r c e n t i l e ((1 - α), {s_{i}})

(6)

C (x_{test}) = [{\hat{y}}_{test} - Q_{1 - α} \cdot \hat{σ} (x_{test}), {\hat{y}}_{test} + Q_{1 - α} \cdot \hat{σ} (x_{test})]

(7)

This formulation ensures that the interval width

2 \cdot Q_{1 - α} \cdot \hat{σ} (x_{test})

varies adaptively: regions with high aleatory uncertainty or sparse data yield wider intervals, thereby providing a statistically valid coverage guarantee (e.g., 95%) while accurately reflecting local model confidence.

2.7. Feature Importance Analysis of ML Model

To address the need for understanding the ML model’s prediction mechanism and understanding the underlying feature contributions, an analysis of the impact of input variables on the target variable was conducted. While various techniques exist—such as Monte Carlo methods, One-Factor-at-a-Time analysis, and partial derivatives [29]—SHAP (SHapley Additive exPlanations) was selected as the optimal technique [30]. SHAP calculates Shapley values based on game theory, satisfying the core axioms of efficiency, symmetry, dummy, and additivity. Furthermore, its simultaneous possession of three key attributes, namely local accuracy, consistency, and the ability to handle missing data, renders it the superior choice for interpreting the model. In addition, partial dependence plots (PDPs) were employed to evaluate how changes in individual feature values affect model outputs while holding other variables constant, thereby providing complementary insights into feature effects and model decision behavior [31]. Together, SHAP and PDP analyses serve as effective methods to evaluate feature importance, improving model transparency and supporting rational material design.

3. Results

3.1. Data-Descriptive Analysis and Physicochemical Isomorphism

As shown in Figure 2a,b, the key physical parameters in the Pb(II) and Cd(II) datasets were visualized using box plots. Descriptive analysis was conducted to preliminarily explore the distribution of input variables. Box plots intuitively illustrate data dispersion and central tendency by displaying the median and interquartile range (IQR, between the 25% and 75% percentiles), facilitating a better understanding of the datasets. For biochar properties, variables such as specific surface area, pyrolysis temperature, and carbon content exhibited statistical overlap between the two datasets, suggesting a shared operational domain for diverse porous carbon materials. The Pb(II) dataset exhibited higher coverage in the high adsorption capacity region, with several observations extending beyond the upper IQR limit, reflecting high-response behavior under extreme process conditions. The Cd(II) dataset provided broader variations in nitrogen content and ash. These complementary distribution patterns extend the operational range of input variables, which is beneficial for developing machine learning models capable of generalizing complex adsorption mechanisms across varied scenarios.

Pearson correlation coefficients (PCC) were calculated to examine inter-feature relationships and their associations with adsorption capacity (Figure 3). Several physically meaningful correlations were observed, including a positive relationship between specific surface area and pore volume (r = 0.64), and a negative correlation between pyrolysis temperature and H content (r = −0.47), reflecting typical structural evolution during biochar formation. In addition, moderate associations between operational variables, such as m/V and adsorption time (r = 0.54), indicate consistent experimental design patterns. Despite these localized relationships, the majority of feature pairs exhibited weak to moderate linear correlations (|r| < 0.6). This overall weak linear structure suggests that adsorption performance cannot be adequately explained by simple pairwise associations. Instead, it likely arises from complex nonlinear interactions among physicochemical properties, operational conditions, and intrinsic ion characteristics. Such characteristics highlight the necessity of adopting machine learning approaches capable of capturing high-dimensional nonlinear dependencies.

3.2. Predictive Performance and Optimal Model Selection

This study utilized three tree-based gradient boosting algorithms (CatBoost, XGBoost, and LightGBM) to predict the adsorption capacities of Pb(II) and Cd(II) using a single consolidated model [32]. Tree-based ensemble algorithms are currently recognized as the state-of-the-art framework for tabular data modeling, inherently outperforming deep neural networks in processing heterogeneous tabular features while preserving explicit mechanistic interpretability [33]. All models were evaluated using 5-fold cross-validation to rigorously assess their predictive capabilities. In addition to cross-validation, an independent hold-out test set (30% of the full dataset), which was not involved in model training, hyperparameter optimization, or uncertainty calibration, was used for external validation to assess generalization on unseen data. To ensure optimal model performance, hyperparameters were systematically fine-tuned using the Tree-structured Parzen Estimator (TPE) algorithm within the Optuna framework. The optimization objectives were set to maximize the coefficient of determination (R²) and minimize the root mean square error (RMSE). Specifically, for CatBoost, the optimization focused on the learning rate, tree depth, L2 regularization coefficient, and number of iterations. The hyperparameter space for XGBoost encompassed the maximum depth, learning rate, subsample ratio, and regularization terms, whereas LightGBM tuning centered on the fine adjustment of the number of leaves, learning rate, and feature fraction.

Cross-validation critically assesses model generalization capability and ensures that high predictive performance is not merely an artifact of a favorable random data split. As summarized in Table 1, The CatBoost model demonstrated outstanding stability during the 5-fold cross-validation, achieving the highest mean R² (0.9209 ± 0.0359) and the lowest RMSE (10.9171 ± 1.7662). In contrast, models such as XGBoost (R² = 0.9158 ± 0.0438, RMSE = 11.1662 ± 2.0617) and LightGBM (R² = 0.8869 ± 0.0297, RMSE = 13.2840 ± 1.3782) exhibited slightly lower accuracy and more noticeable performance fluctuations across folds. Although the minor variance (standard deviation) observed across the validation folds inherently reflects the intrinsic heterogeneity of biochar physicochemical properties, the consistently high average metrics confidently confirm the robust generalizability of the CatBoost architecture.

Figure 4 presents a unified scatter plot illustrating the correspondence between the observed and predicted values. It should be noted that the visualization displays a optimal predictive model to explicitly map the ideal predictive boundaries and fitting residuals. The majority of data points cluster tightly around the 1:1 diagonal, indicating robust fitting capabilities across all three algorithms on the training dataset. Specifically, Figure 4a, Figure 4b, and Figure 4c correspond to XGBoost, LightGBM, and CatBoost, respectively. CatBoost achieved the highest training precision (R² = 0.9850,RMSE = 4.98), outperforming XGBoost (R² = 0.9770, RMSE = 6.09) and LightGBM (R² = 0.9647, RMSE = 7.61). Such high-precision fitting validates the capacity of the CatBoost algorithm to resolve the high-dimensional, non-linear mapping relationships introduced by the physicochemical descriptors, accurately reconstructing the underlying rules of the adsorption process [22].

The independent testing phase, evaluating generalization capabilities on unseen data, revealed more pronounced performance disparities among the algorithms. CatBoost maintained a distinct advantage with a test R² of 0.9551, surpassing both LightGBM (0.9356) and XGBoost (0.9506). The prediction deviation for CatBoost was also the lowest, recording an RMSE of 9.43 and an MAE of 6.00, compared to XGBoost (RMSE = 9.89) and LightGBM (RMSE = 11.29). LightGBM exhibited a substantial generalization gap between its training and testing performance, evidenced by the surged test RMSE, implying a relatively high risk of overfitting within this specific feature space [34]. Demonstrating an optimal balance of robustness and precision in predicting the adsorption behaviors of Pb(II) and Cd(II), the CatBoost model was selected for downstream analysis and the construction of prediction intervals to rigorously quantify the uncertainty range of the forecasting results.

3.3. Quantifying Data Synergy and Reliability via Ablation Study and Conformal Prediction

3.3.1. Ablation Study on Cross-Ion Predictive Performance

To evaluate the impact of heterogeneous metal data integration on model generalizability, an ablation study employing independent subset validation was conducted by independently retraining models on specific ion subsets and evaluating them on their corresponding 30% hold-out test data [25]. Figure 5a,b delineates the predictive baselines of the solute-specific models: on the unseen test set, Model A (Pb) yielded an R² of 0.9252 (RMSE = 13.56), and Model B (Cd) yielded an R² of 0.9414 (RMSE = 9.51). Upon implementing the integrated CatBoost architecture (Model C), the model was evaluated separately on identical Pb and Cd test subsets Figure 5c,d, ensuring fair and interpretable comparison under the same data source conditions. Both systems exhibited systematic predictive improvements: Model C elevated the Pb subset R² to 0.9535 (RMSE = 10.68) and the Cd subset R² to 0.9571 (RMSE = 8.13). Manifesting across distinct ion systems, these consistent performance gains indicate that the model enhancement stems not from mere data volume expansion or random fluctuations, but from the effective utilization of cross-system information within a consolidated feature representation.

3.3.2. Uncertainty Quantification via Conformal Prediction

Standard metrics like R² capture overall correlation but often mask underlying epistemic uncertainty. Addressing this, a localized adaptive ICP strategy utilizing 5-fold cross-validation dynamically estimates the conditional variance σ(x) for each prediction. Displaying marked interval shrinkage and reduced volatility (Figure 6a,b), the integrated Model C compresses the average prediction interval width for the Cd test set from 628.43 (single Model B) to 124.36; concurrently, the interval standard deviation drops from 1590.33 to 196.18. Exhibiting a similar compression effect (90.13%), the Pb dataset (Figure 6a) mirrors the profound narrowing of the maximum Cd interval width (Figure 6b) from 16,660.22 to 1940.58. This substantial contraction highlights the high epistemic uncertainty inherent in single-metal-ion models constrained by limited sample diversity.

Mechanistically, this interval shrinkage arises from the complementary distribution of distinct ions across shared descriptor space. The expanded coverage of Pb(II) data in elevated response ranges and the broader variation in Cd(II) data regarding surface chemistry provide supplementary constraints within the consolidated feature representation; superimposing these datasets extends locally observable responses into a mapping structure. Expanding this constraint boundary enables stable prediction for unseen samples, superseding reliance on localized fitting within a single system.

Integrating diverse data sources through physicochemical descriptors, diverse data sources form synergistic constraints that enhance the overall characterization of complex adsorption behaviors. Without introducing explicit multi-task structures, this integrated modeling strategy facilitates cross-system information transfer, successfully capturing shared data-driven patterns [35]. Beyond empirical mathematical fitting, Model C establishes a robust statistical baseline for single-metal-ion systems. Representing an idealized interference-free state, it provides a rigorously constrained reference framework for quantifying competitive or synergistic deviations in future multi-solute adsorption studies.

3.4. Mechanistic Interpretation and Feature Attribution

3.4.1. Feature Importance and Directional Impacts

Feature importance was evaluated using the mean absolute SHAP values (Figure 7). Overall, macroscopic experimental conditions exerted the strongest influence on adsorption predictions for both Pb(II) and Cd(II), followed by biochar physicochemical properties and ion-related descriptors. Among the experimental variables, the initial metal concentration (Ci) was identified as the most influential factor. This observation is consistent with adsorption thermodynamics, as higher concentration gradients increase the driving force for mass transfer and promote ion diffusion toward available adsorption sites within the biochar matrix [31,36]. Adsorption time ranked as the second-most important variable, reflecting its critical role in adsorption kinetics and the progression toward equilibrium [37]. The solid-to-liquid ratio (m/V) also exhibited a substantial contribution, likely because it directly modulates the effective availability of adsorption sites relative to the dissolved metal ion load.

Among biochar properties, hydrogen content and average pore size exhibit notable contributions to model predictions. While average pore size is conventionally regarded as a geometric indicator of steric constraints, its joint importance with hydrogen content more likely reflects statistical associations arising from the thermochemical evolution of the material. Specifically, pore structure and elemental composition tend to co-vary with activation intensity; for instance, lower hydrogen content is generally associated with a higher degree of carbonization and structural reorganization [38]. In this context, the model may utilize the combination of pore size and hydrogen content as a proxy descriptor for activation severity, rather than attributing predictive influence to a single geometric or chemical factor. Such coupled features are often associated with variations in surface functional group abundance and structural defects, which may influence metal ion adsorption through mechanisms such as surface complexation and electrostatic interactions [39].

With respect to ion-related descriptors, electronegativity and hydrated ionic radius emerged as influential variables in the SHAP analysis. From a mechanistic perspective, these descriptors are physically meaningful for differentiating Pb(II) and Cd(II) adsorption behavior. According to hard–soft acid–base (HSAB) theory, electronegativity differences can reflect variations in metal–ligand coordination tendencies and binding affinities [40], while ionic radius constrains ion–pore compatibility and steric effects [41]. However, it should be noted that, in the present dataset, these ion descriptors assume only two discrete values corresponding to Pb(II) and Cd(II). Consequently, their SHAP contributions primarily quantify the model’s differentiation between these two specific ion types, rather than evidencing a learned continuous physicochemical relationship across a broader ion spectrum. In this sense, the ion-related SHAP values should be interpreted as physically informed indicators of ion-type differentiation within a constrained two-ion system, rather than proof of a fully unified or continuously generalizable feature space.

Beyond ranking feature importance, the SHAP summary plot (Figure 7) provides insights into the directional effects of key variables on predicted adsorption capacity. The observed trends are broadly consistent with established adsorption mechanisms. Higher initial concentrations and longer contact times generally contribute positively to predicted uptake, while higher pH and lower hydrogen content are associated with enhanced adsorption, reflecting reduced proton competition and increased availability of active functional groups. These directional relationships are learned from the combined Pb–Cd dataset and remain physically interpretable at the global level. Nevertheless, given the limited ion diversity, SHAP analysis in the current framework primarily supports mechanistic consistency and ion-specific differentiation, rather than demonstrating continuous cross-ion generalizability. Extending this interpretative framework to additional metal ions or hypothetical ion descriptors represents a necessary direction for future work to rigorously validate continuous generalization across metal species.

3.4.2. Influence of the Important Features on the Target

Based on Partial Dependence Plots (PDPs), this study further analyzed the marginal effects of key features on the adsorption capacity, revealing the dependencies and trend patterns between individual variables and the target outcome. As identified by the SHAP analysis, macroscopic experimental variables (Ci and contact time) and intrinsic affinity descriptors (H content and ionic radius) dictate the fundamental pathways of the adsorption process.

As shown in Figure 8a,b, the macroscopic variables exhibit trends highly consistent with classical adsorption theories. The PDP for initial concentration (Ci) demonstrates a pronounced continuous increase from 0 to approximately 250 mg/L before leveling off. This reflects the typical isotherm behavior [42]: an elevated Ci provides a stronger chemical potential gradient to overcome mass transfer resistance, eventually reaching a saturation plateau as available active sites are progressively occupied [43]. Similarly, the contact time PDP (evaluated within the kinetically dominant 0–3000 min range) indicates a “rapid-slow-plateau” evolution [44]. An initial fast capacity increase corresponds to external film mass transfer and rapid surface site occupation, followed by a slower phase governed by intraparticle diffusion and eventual equilibrium. While PDPs represent conditional statistical expectations rather than explicit kinetic parameters, these robust macroscopic mappings confirm that the algorithm successfully internalized established physicochemical principles.

Displaying a clear non-monotonic dependence on the solid–liquid ratio (m/V) (Figure 8c), the adsorption capacity increases initially and then declines, revealing an optimal intermediate dosage window. Within the effective range of 0.40–5.00 g/L, this inverted-U pattern suggests a trade-off between the availability of adsorption sites and their utilization efficiency. At low m/V, increasing the adsorbent dosage improves qe by enhancing solid–liquid contact and exposing more active sites; at higher m/V, the marginal benefit gradually diminishes as site utilization becomes saturated. This physically interpretable response highlights the dominant role of dosage optimization in governing adsorption performance.

Elemental H content exhibits a distinct negative dependency (Figure 8d). Maximum PDP values (optimal adsorption capacity) cluster consistently at lower H contents (<1.5%), declining sharply towards 3.5% before plateauing at minimal capacity levels. Corroborating the role of H content as a proxy for carbonization degree in the SHAP analysis, this low H content classically indicates extensive biochar aromatization and deep dehydrogenation driven by severe pyrolysis or chemical activation [45]. This profound carbonization systematically promotes highly conjugated π-electron systems—essential for capturing heavy metal cations via cation-π interactions [46]—while concurrently generating abundant structural defects and oxygen-containing functional groups. Inherently lacking these sufficient active sites, under-carbonized biomass with higher H content strictly limits the effective complexation of Pb(II) and Cd(II) ions.

3.4.3. Interactions Between the Important Features

To further investigate the interactions among the key features influencing Pb(II) and Cd(II) adsorption capacity, this study employed two-way PDP analysis to generate interaction maps (Figure 9). Based on the importance identified in Section 3.4.1, the top macroscopic and intrinsic variables were selected to compute pairwise 2D-PDPs. In these plots, the color gradient from light to dark purple represents increasing predicted values, indicating significant interactions between key features learned within the combined dataset. Notably, when the initial concentration exceeds approximately 400 mg/L alongside a relatively low solid-to-liquid ratio (around 0.40 g/L), the adsorption capacity of biochar increases markedly, reaching its global maximum (94.31 mg/g). In contrast, excessive biochar dosage under the same concentration substantially weakens this high-capacity effect (Figure 9a), confirming the dominant role of pollutant-to-site stoichiometry. A high initial concentration provides a strong chemical potential driving force, while a lower dosage concentrates the adsorbate onto limited active sites, maximizing unit adsorption capacity. Additionally, an elemental H content lower than 1.5% and a slightly alkaline pH (around 8.0) consistently correspond to higher adsorption capacity, regardless of other macroscopic variations (Figure 9b). This highlights the coupled roles of material activation and solution chemistry: low H content reflects deeper carbonization and increased structural defects, while appropriate pH promotes surface deprotonation and electrostatic attraction [47]. Taken together, these interaction patterns, learned from mixed Pb/Cd data, reveal stable interdependencies among key features and confirm the dominant influence of initial concentration, solid-to-liquid ratio, H content, and pH. The consistency of these macroscopic interactions across the dataset suggests that the model captures fundamental operational dependencies (e.g., concentration and dosage effects), independent of the mathematical differentiation between the two specific ion types.

4. Discussion

Compared to conventional optimization approaches (e.g., response surface methodology) that are typically constrained to low-dimensional, predefined experimental domains, the proposed ML framework excels at capturing nonlinear interactions across highly heterogeneous data. It complements traditional mechanistic models by providing scalable, data-driven predictive guidance for material screening. However, critical limitations regarding its broader applicability must be acknowledged. First, because the current dataset is restricted to Pb(II) and Cd(II), the continuous physicochemical descriptors (e.g., electronegativity and hydrated radius) assume only two distinct values. Consequently, these descriptors are mathematically collinear with binary ion-type indicators. The SHAP attributions therefore primarily reflect the model’s differentiation between the two specific ions rather than revealing a truly continuous physical relationship. The current “physical embedding” should be interpreted as an extensible reparameterisation within a constrained two-ion system, rather than proof of a fully unified feature space. Second, inherent uncertainties derived from heterogeneous literature sources limit the model’s immediate transferability to entirely distinct experimental regimes. To validate the “continuous generalisability” of this framework, future research will require evaluation of its predictive performance on entirely unseen metal ions (e.g., Cu²⁺, Ni²⁺) and extend it to competitive multi-metal systems coupled with independent experimental validation.

5. Conclusions

In this study, a integrated machine learning framework was developed to predict Pb(II) and Cd(II) adsorption by biochar, effectively overcoming the data isolation inherent in conventional single-metal-ion models. The Optuna-optimized CatBoost model achieved the best performance (R² = 0.9551, RMSE = 9.43, MAE = 6.00), outperforming XGBoost and LightGBM. While the integration of continuous physical descriptors (e.g., electronegativity and hydrated radius) primarily serves to differentiate the two metal types in the current binary context, this feature embedding strategy successfully facilitated cross-ion knowledge transfer. Consequently, the integrated model significantly improved prediction performance on both the Pb(R²: 0.9252 → 0.9535) and Cd (R²: 0.9414 → 0.9571) subsets compared with baseline single-metal models. SHAP analysis confirmed that adsorption capacity is dominantly governed by macroscopic operating conditions (e.g., initial concentration and solid–liquid ratio). Concurrently, the SHAP attributions for the ion descriptors effectively captured the intrinsic affinity differences between Pb and Cd, acting as mathematical differentiators for ion types within the consolidated feature representation. Furthermore, the integration of uncertainty quantification reduced prediction interval widths, enhancing the reliability of model outputs. Overall, this study provides a robust data-driven baseline for single-metal-ion adsorption systems. Building upon the physical feature embedding strategy established in this study, a critical next step is to evaluate whether such descriptors can support meaningful generalization across a broader range of metal ions. This would require incorporating adsorption data for additional metal species (e.g., Cu²⁺, Ni²⁺) and systematically examining whether the model responses vary smoothly and consistently with respect to their physicochemical descriptors, rather than relying on ion-specific differentiation. Only after such continuous descriptor-based behavior is empirically validated for single-solute systems can the framework be robustly extended to more complex multi-metal competitive adsorption scenarios.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/w18121416/s1, Table S1 Original_data.

Author Contributions

Conceptualization, P.Y. and W.X.; methodology, P.Y.; software, P.Y.; validation, P.Y. and Z.H.; formal analysis, P.Y.; investigation, P.Y. and Z.H.; resources, W.X.; data curation, P.Y. and Z.H.; writing—original draft preparation, P.Y.; writing—review and editing, P.Y., Z.H. and W.X.; visualization, P.Y.; supervision, W.X.; project administration, W.X.; funding acquisition, W.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [No. 42277365].

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Saravanan, P.; Saravanan, V.; Rajeshkannan, R.; Arnica, G.; Rajasimman, M.; Baskar, G.; Pugazhendhi, A. Comprehensive review on toxic heavy metals in the aquatic system: Sources, identification, treatment strategies, and health risk assessment. Environ. Res. 2024, 258, 119440. [Google Scholar] [CrossRef]
Monroy-Licht, A.; Martinez-Burgos, W.J.; de Carvalho, J.C.; Cavali, M.; Woiciechowski, A.L.; Karp, S.G.; Soccol, C.R.; De la Parra-Guerra, A.C.; Pozzan, R.; Acevedo-Barrios, R. Biological approaches to mitigate heavy metal pollution from battery production effluents: Advances, challenges, and perspectives. Environ. Sci. Pollut. Res. 2025, 32, 20844–20878. [Google Scholar] [CrossRef]
Ali, H.; Khan, E.; Ilahi, I. Environmental chemistry and ecotoxicology of hazardous heavy metals: Environmental persistence, toxicity, and bioaccumulation. J. Chem. 2019, 2019, 6730305. [Google Scholar] [CrossRef]
Balali-Mood, M.; Naseri, K.; Tahergorabi, Z.; Khazdair, M.R.; Sadeghi, M. Toxic Mechanisms of Five Heavy Metals: Mercury, Lead, Chromium, Cadmium, and Arsenic. Front. Pharmacol. 2021, 12, 643972. [Google Scholar] [CrossRef]
Ghangale, S.S.; Saler, R.S.; Khandelwal, S.R.; Handore, D.V.; Handore, A.V. Harnessing heavy metal-tolerant bacteria and phytotoxicity assessment for ecofriendly treatment of industrial effluents. Eqa-Int. J. Environ. Qual. 2024, 61, 16–23. [Google Scholar] [CrossRef]
Ajiboye, T.O.; Oyewo, O.A.; Onwudiwe, D.C. Simultaneous removal of organics and heavy metals from industrial wastewater: A review. Chemosphere 2021, 262, 128379. [Google Scholar] [CrossRef] [PubMed]
Mei, Y.; Zhuang, S.; Wang, J. Adsorption of heavy metals by biochar in aqueous solution: A review. Sci. Total Environ. 2025, 968, 178898. [Google Scholar] [CrossRef]
Wang, W.; Chang, J.S.; Lee, D.J. Machine learning applications for biochar studies: A mini-review. Bioresour. Technol. 2024, 394, 130291. [Google Scholar] [CrossRef]
Duan, Q.N.; Yan, P.W.; Feng, Y.C.; Wan, Q.R.; Zhu, X.L. Machine learning assisted adsorption performance evaluation of biochar on heavy metal. Front. Environ. Sci. Eng. 2024, 18, 55. [Google Scholar] [CrossRef]
Liu, B.Y.; Xi, F.Y.; Zhang, H.J.; Peng, J.T.; Sun, L.P.; Zhu, X.Z. Coupling machine learning and theoretical models to compare key properties of biochar in adsorption kinetics rate and maximum adsorption capacity for emerging contaminants. Bioresour. Technol. 2024, 402, 130776. [Google Scholar] [CrossRef]
Yang, H.R.; Huang, K.; Zhang, K.; Weng, Q.; Zhang, H.C.; Wang, F.E. Predicting Heavy Metal Adsorption on Soil with Machine Learning and Mapping Global Distribution of Soil Adsorption Capacities. Environ. Sci. Technol. 2021, 55, 14316–14328. [Google Scholar] [CrossRef] [PubMed]
Wei, X.; Liu, Y.; Shen, L.; Lu, Z.; Ai, Y.; Wang, X. Machine learning insights in predicting heavy metals interaction with biochar. Biochar 2024, 6, 10. [Google Scholar] [CrossRef]
Yuan, X.Z.; Li, J.; Lim, J.Y.; Zolfaghari, A.; Alessi, D.S.; Wang, Y.; Wang, X.N.; Ok, Y.S. Machine Learning for Heavy Metal Removal from Water: Recent Advances and Challenges. ACS ES T Water 2023, 4, 820–836. [Google Scholar] [CrossRef]
Zhao, S.Y.; Guo, J.Y.; Tang, Y.; Zhou, Y.B. Applications of machine learning in heavy metal adsorption modeling: A review. Sep. Purif. Technol. 2025, 377, 134168. [Google Scholar] [CrossRef]
Zhao, C.X.; Yue, W.J.; Jiang, Z.H.; Lu, X.Y.; Xia, Q.; Shen, Z.P.; Chen, A.H. Predictive modeling of heavy metal lead and cadmium adsorption on biochar based on machine learning. Int. J. Phytoremediat. 2025, 28, 748–755. [Google Scholar] [CrossRef]
Chen, B.T.; Guan, H.B.; Zhang, Y.; Liu, S.X.; Zhao, B.F.; Zhong, C.Q.; Zhang, H.M.; Ding, W.R.; Song, A.A.; Zhu, D.; et al. Performance and mechanism of Pb2+and Cd2+ions’ adsorption via modified antibiotic residue-based hydrochar. Heliyon 2023, 9, e14930. [Google Scholar] [CrossRef]
Shen, T.; Peng, H.Y.; Yuan, X.Z.; Liang, Y.S.; Liu, S.Q.; Wu, Z.B.; Leng, L.J.; Qin, P.F. Feature engineering for improved machine-learning-aided studying heavy metal adsorption on biochar. J. Hazard. Mater. 2024, 466, 133442. [Google Scholar] [CrossRef] [PubMed]
Hernández-Guerrero, B.A.; Martínez, L.; Peña-Rodríguez, G.; Trejo, F. Machine Learning-Enhanced Modeling of Heavy Metal Adsorption onto Coal Fly Ash-Derived Zeolite P. Water 2026, 18, 857. [Google Scholar] [CrossRef]
Varivoda, D.; Dong, R.Z.; Omee, S.S.; Hu, J.J. Materials property prediction with uncertainty quantification: A benchmark study. Appl. Phys. Rev. 2023, 10, 021409. [Google Scholar] [CrossRef]
Roh, J.; Park, H.; Kwon, H.; Joo, C.; Moon, I.; Cho, H.; Ro, I.; Kim, J. Interpretable machine learning framework for catalyst performance prediction and validation with dry reforming of methane. Appl. Catal. B-Environ. Energy 2024, 343, 123454. [Google Scholar] [CrossRef]
Sakhiya, A.K.; Kaushal, P.; Vijay, V.K. Potential of rice straw derived activated biochar to remove arsenic and manganese from groundwater: A cleaner approach in the Indo-Gangetic Plain. Appl. Surf. Sci. Adv. 2023, 17, 100443. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neur. In. 2018, 31, 6639–6649. [Google Scholar] [CrossRef]
Qiu, S.; Zhao, H.K.; Jiang, N.; Wang, Z.L.; Liu, L.; An, Y.; Zhao, H.Y.; Miao, X.; Liu, R.C.; Fortino, G. Multi-sensor information fusion based on machine learning for real applications in human activity recognition: State-of-the-art and research challenges. Inf. Fusion 2022, 80, 241–265. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Khajavian, M.; Ismail, S.; Esmaeili, J. Structural design of cellulose derivative-modified chitosan adsorbents for arsenic removal: Machine learning modeling, Box-Behnken design, Optuna hyperparameter tuning, and molecular dynamics. Biochem. Eng. J. 2025, 221, 109800. [Google Scholar] [CrossRef]
Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. Peerj Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
Angelopoulos, A.N.; Bates, S. Conformal Prediction: A Gentle Introduction. Found. Trends Mach. Learn. 2023, 16, 494–591. [Google Scholar] [CrossRef]
Barber, R.F.; Candès, E.J.; Ramdas, A.; Tibshirani, R.J. Predictive Inference with the Jackknife. Ann. Stat. 2021, 49, 486–507. [Google Scholar] [CrossRef]
Razavi, S.; Jakeman, A.; Saltelli, A.; Prieur, C.; Iooss, B.; Borgonovo, E.; Plischke, E.; Lo Piano, S.; Iwanaga, T.; Becker, W.; et al. The Future of Sensitivity Analysis: An essential discipline for systems modeling and policy support. Environ. Model. Softw. 2021, 137, 104954. [Google Scholar] [CrossRef]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Chen, L.; Hu, J.; Wang, H.; He, Y.Y.; Deng, Q.Y.; Wu, F.F. Predicting Cd(II) adsorption capacity of biochar materials using typical machine learning models for effective remediation of aquatic environments. Sci. Total Environ. 2024, 944, 173955. [Google Scholar] [CrossRef]
Fayaz, S.A.; Zaman, M.; Kaul, S.; Butt, M.A. Is Deep Learning on Tabular Data Enough? An Assessment. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 466–473. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
Ke, G.L.; Meng, Q.; Finley, T.; Wang, T.F.; Chen, W.; Ma, W.D.; Ye, Q.W.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
Wang, L.; He, T.J.; Ouyang, B. Impact of Domain Knowledge on the Property Prediction of Specialized Machine Learning Models. Acs Mater. Lett. 2025, 7, 2708–2715. [Google Scholar] [CrossRef]
Xiao, J.; Mo, G.H.; Zhou, S.K. Mitigating Cd(II) contamination form aqueous solution by phosphate-activated sludge biochar: Role of effect of activation sequence. J. Environ. Chem. Eng. 2024, 12, 114960. [Google Scholar] [CrossRef]
Thi, H.V. Mechanistically validated selective adsorption of pb(II), cd(II), and hg(II) on thiol-functionalized hierarchical magnetic biochar under competitive conditions. Sep. Purif. Technol. 2026, 388, 136860. [Google Scholar] [CrossRef]
Wang, Z.W.; Nie, Q.; Lei, Z.F.; Zhang, Z.Y.; Shimizu, K.; Yuan, T. Enhanced Pb(II) removal from wastewater by co-pyrolysis biochar derived from sewage sludge and calcium sulfate: Performance evaluation and quantitative mechanism analysis. Sep. Purif. Technol. 2024, 329, 125124. [Google Scholar] [CrossRef]
Ahmed, W.; Xu, T.W.; Mahmood, M.; Nunez-Delgado, A.; Ali, S.; Shakoor, A.; Qaswar, M.; Zhao, H.W.; Liu, W.J.; Li, W.D.; et al. Nano-hydroxyapatite modified biochar: Insights into the dynamic adsorption and performance of lead (II) removal from aqueous solution. Environ. Res. 2022, 214, 113827. [Google Scholar] [CrossRef]
Cai, T.; Du, H.H.; Liu, X.L.; Tie, B.Q.; Zeng, Z.X. Insights into the removal of Cd and Pb from aqueous solutions by NaOH-EtOH-modified biochar. Environ. Technol. Innov. 2021, 24, 102031. [Google Scholar] [CrossRef]
Chen, Q.Y.; Zhang, T.C.; Ouyang, L.K.; Yuan, S.J. Single-Step Hydrothermal Synthesis of Biochar from HPO-Activated Lettuce Waste for Efficient Adsorption of Cd(II) in Aqueous Solution. Molecules 2022, 27, 269. [Google Scholar] [CrossRef]
Ge, S.Q.; Zhao, S.; Wang, L.; Zhao, Z.Y.; Wang, S.L.; Tian, C.Y. Exploring adsorption capacity and mechanisms involved in cadmium removal from aqueous solutions by biochar derived from euhalophyte. Sci. Rep. 2024, 14, 450. [Google Scholar] [CrossRef]
Hou, Y.W.; Lin, S.N.; Fan, J.J.; Zhang, Y.C.; Jing, G.H.; Cai, C. Enhanced Adsorption of Cadmium by a Covalent Organic Framework-Modified Biochar in Aqueous Solution. Toxics 2024, 12, 717. [Google Scholar] [CrossRef]
Islam, M.S.; Kwak, J.H.; Nzediegwu, C.; Wang, S.Y.; Palansuriya, K.; Kwon, E.E.; Naeth, M.A.; El-Din, M.G.; Ok, Y.S.; Chang, S.X. Biochar heavy metal removal in aqueous solution depends on feedstock type and pyrolysis purging gas. Environ. Pollut. 2021, 281, 117094. [Google Scholar] [CrossRef]
Wijitkosum, S.; Sriburi, T. Aromaticity, polarity, and longevity of biochar derived from disposable bamboo chopsticks waste for environmental application. Heliyon 2023, 9, e19831. [Google Scholar] [CrossRef]
Deng, H.; Zhang, J.Y.; Huang, R.; Wang, W.; Meng, M.W.; Hu, L.N.; Gan, W.X. Adsorption of Malachite Green and Pb2+ by KMnO4-Modified Biochar: Insights and Mechanisms. Sustainability 2022, 14, 2040. [Google Scholar] [CrossRef]
Qiu, M.Q.; Liu, L.J.; Ling, Q.; Cai, Y.W.; Yu, S.J.; Wang, S.Q.; Fu, D.; Hu, B.W.; Wang, X.K. Biochar for the removal of contaminants from soil and water: A review. Biochar 2022, 4, 19. [Google Scholar] [CrossRef]

Figure 1. Workflow for data-driven adsorption prediction.

Figure 2. Feature distributions of the input variables for the (a) Cd(II) and (b) Pb(II) datasets using Min–Max normalized box plots.

Figure 3. Pearson correlation coefficient (PCC) heatmaps of variables for the Cd(II) and Pb(II) datasets.

Figure 4. Scatter plots of actual vs. predicted adsorption capacities for the mixed dataset: (a) XGBoost; (b) LightGBM; (c) CatBoost.

Figure 5. Predictive performance comparison under independent subset validation: (a) Pb(II)-only model; (b) Cd(II)-only model; (c) Integrated model on Pb(II) test subset; (d) Integrated model on Cd(II) test subset.

Figure 6. Heteroscedastic 95% prediction intervals generated via Locally Adaptive ICP for the (a) Pb(II) and (b) Cd(II) datasets.

Figure 7. Global feature importance and local impact analysis based on SHAP for the integrated Pb(II) and Cd(II) adsorption model.

Figure 8. One-dimensional partial dependence plots for (a) initial concentration, (b) contact time, (c) Solid–liquid ratio, and (d) H content.

Figure 9. Two-dimensional feature interaction partial dependence plots of the selected features: (a) interaction between initial concentration and solid-to-liquid ratio; (b) interaction between elemental H content and solution pH.

Table 1. Predictive performance of the three gradient boosting models during 5-fold cross-validation.

Model	Test R²	Test RMSE
XGBoost	0.9158 ± 0.0438	11.1662 ± 2.0617
LightGBM	0.8869 ± 0.0297	13.2840 ± 1.3782
CatBoost	0.9209 ± 0.0359	10.9171 ± 1.7662

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, P.; Huang, Z.; Xie, W. Machine Learning-Driven Optimization for Predicting Biochar Adsorption Performance Toward Pb(II) and Cd(II). Water 2026, 18, 1416. https://doi.org/10.3390/w18121416

AMA Style

Yu P, Huang Z, Xie W. Machine Learning-Driven Optimization for Predicting Biochar Adsorption Performance Toward Pb(II) and Cd(II). Water. 2026; 18(12):1416. https://doi.org/10.3390/w18121416

Chicago/Turabian Style

Yu, Pengcheng, Zixi Huang, and Wuming Xie. 2026. "Machine Learning-Driven Optimization for Predicting Biochar Adsorption Performance Toward Pb(II) and Cd(II)" Water 18, no. 12: 1416. https://doi.org/10.3390/w18121416

APA Style

Yu, P., Huang, Z., & Xie, W. (2026). Machine Learning-Driven Optimization for Predicting Biochar Adsorption Performance Toward Pb(II) and Cd(II). Water, 18(12), 1416. https://doi.org/10.3390/w18121416

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Machine Learning-Driven Optimization for Predicting Biochar Adsorption Performance Toward Pb(II) and Cd(II)

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition and Pre-Processing

2.1.1. Data Collection and Feature Engineering

2.1.2. Missing Value Imputation and Leakage Prevention

2.2. Data Visualization and Pre-Processing

2.3. Selection and Building of ML Models

2.4. Model Training and Evaluation

2.5. Ablation Study for Data Synergy Verification

2.6. Uncertainty Quantification via Locally Adaptive Conformal Prediction

2.7. Feature Importance Analysis of ML Model

3. Results

3.1. Data-Descriptive Analysis and Physicochemical Isomorphism

3.2. Predictive Performance and Optimal Model Selection

3.3. Quantifying Data Synergy and Reliability via Ablation Study and Conformal Prediction

3.3.1. Ablation Study on Cross-Ion Predictive Performance

3.3.2. Uncertainty Quantification via Conformal Prediction

3.4. Mechanistic Interpretation and Feature Attribution

3.4.1. Feature Importance and Directional Impacts

3.4.2. Influence of the Important Features on the Target

3.4.3. Interactions Between the Important Features

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI