Machine Learning-Based Prediction and Optimization of Heavy Metal Adsorption Performance of Biochar

Huang, Xin; Bai, Xiaopeng; Yang, Yifei; Li, Wenbin; Xu, Daochun

doi:10.3390/f17030326

Open AccessArticle

Machine Learning-Based Prediction and Optimization of Heavy Metal Adsorption Performance of Biochar

by

Xin Huang

^1,2,3,

Xiaopeng Bai

^1,2,3,*

,

Yifei Yang

^1,2,3,

Wenbin Li

^1,2,3

and

Daochun Xu

^1,2,3

¹

School of Technology, Beijing Forestry University, Beijing 100083, China

²

State Key Laboratory of Efficient Production of Forest Resources, Beijing 100083, China

³

Key Laboratory of National Forestry and Grassland Administration on Forestry Equipment and Automation, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Forests 2026, 17(3), 326; https://doi.org/10.3390/f17030326

Submission received: 3 February 2026 / Revised: 28 February 2026 / Accepted: 4 March 2026 / Published: 5 March 2026

(This article belongs to the Special Issue Energy Conversion and High-Value Utilization of Agroforestry Biomass)

Download

Browse Figures

Versions Notes

Abstract

Biochar has been extensively employed in wastewater treatment owing to its effectiveness in removing heavy metal ions. However, the relationships among biomass feedstock composition, pyrolysis conditions, and adsorption performance are highly nonlinear and difficult to quantify systematically across heterogeneous experimental studies. Random Forest (RF), Gradient Boosting Regression (GBR), and XGBoost (XGB) algorithms were employed to predict and optimize biochar yield, adsorption-related physicochemical properties, and adsorption capacity (q_e). The models were developed using two literature-derived datasets comprising 431 samples for physicochemical property and yield prediction (Data 1) and 452 samples for adsorption capacity modeling (Data 2). Results indicated that XGB exhibits superior overall predictive performance across most tasks. Specifically, the single-target XGB models achieved coefficients of determination (R²) of 0.89–0.92 and root mean square errors (RMSE) of 0.02–0.07 on the test set, while the multi-target model attained an R² of 0.83 with an RMSE of 0.05. Further analysis reveals that pyrolysis conditions exert a dominant influence on biochar yield, with pyrolysis temperature identified as the most critical factor. In contrast, the physicochemical properties of biochar and its adsorption performance are primarily governed by feedstock composition, particularly carbon and ash contents. Higher ash content impairs surface functionality and reduces adsorption capacity (q_e), whereas increased carbon content and appropriately optimized pyrolysis conditions contribute to enhanced adsorption capacity. In addition to intrinsic biochar properties, the initial concentration of heavy metal ions in solution constitutes an important external factor influencing adsorption behavior. Feature importance analysis, SHAP analysis, and correlation analysis collectively elucidate the key factors affecting biochar characteristics and adsorption performance, as well as the interactions among these factors. These findings provide a data-driven basis for optimizing biochar production parameters and guiding experimental design for efficient Cu²⁺ and Pb²⁺ removal from wastewater.

Keywords:

biochar; biomass pyrolysis; machine learning; heavy-metal adsorption; process optimization

1. Introduction

With the accelerating pace of industrialization, heavy metal contamination in industrial wastewater has become increasingly severe, posing significant threats to ecological systems and human health [1]. Consequently, the development of efficient, low-cost, and environmentally friendly technologies for heavy metal removal has emerged as a critical research focus. Among various adsorbent materials, biochar has attracted considerable attention due to its wide availability of feedstock, low production cost, environmental compatibility, and favorable adsorption performance, demonstrating strong potential for wastewater treatment, particularly in the removal of heavy metal ions [2,3]. Biochar is primarily produced from agricultural and forestry residues, which are abundant and readily accessible resources [4].

Biochar is a solid-phase product derived from the pyrolysis of biomass under oxygen-limited or low-oxygen conditions. Owing to its well-developed pore structure, large specific surface area (SSA), and abundant surface functional groups, biochar can effectively remove heavy metal ions from aqueous solutions through multiple mechanisms, including physical adsorption, ion exchange, and surface complexation [5,6]. Previous studies have demonstrated that rational regulation of pyrolysis parameters and feedstock composition can significantly alter the physicochemical properties of biochar, thereby enhancing its adsorption performance [7]. However, variations in feedstock proximate and elemental compositions can lead to substantial differences in the resulting biochar properties, such as ash content, SSA, and cation exchange capacity (CEC). In addition, pyrolysis parameters, including temperature and residence time, exert complex and nonlinear influences on surface pore structure and yield of biochar [8,9].

Traditional studies typically rely on systematic batch adsorption experiments combined with controlled pyrolysis trials to investigate the effects of feedstock characteristics and pyrolysis conditions on the adsorption performance of biochar. In these approaches, researchers usually vary one or a limited number of factors at a time—such as feedstock composition, pyrolysis temperature, or residence time—and subsequently evaluate adsorption behavior through isotherm and kinetic experiments, often supported by physicochemical characterization techniques (e.g., BET surface area analysis, elemental analysis, and functional group identification). Such trial-and-error experimental strategies often require multiple controlled pyrolysis experiments at different temperatures and residence times, followed by adsorption isotherm and kinetic tests under varying pH and concentration conditions. Each experimental cycle may take several days to weeks, and the combined cost of material preparation, characterization, and adsorption testing can be substantial. Moreover, these approaches typically vary only one factor at a time, limiting the ability to capture nonlinear interactions among multiple variables [10]. In this context, there is a growing need for data-driven approaches capable of simultaneously analyzing multiple interacting variables and capturing nonlinear relationships without relying on one-factor-at-a-time experimental designs. Such approaches can leverage existing experimental data to extract underlying patterns and reduce the dependence on repetitive and resource-intensive laboratory trials. Therefore, machine learning techniques, with their strong capability to handle high-dimensional, nonlinear, and multivariate problems, have increasingly been adopted as powerful tools in biochar research [11]. Leng et al. [12] developed machine learning models, including Random Forest (RF) and Extreme Gradient Boosting (XGBoost), to predict the yield, nitrogen content, and specific surface area of biochar produced from biomass pyrolysis. Habib et al. [13] employed RF and Support Vector Regression models to predict the adsorption capacity of iron-modified biochar for selenium removal. Su and Wang [14] established relationships between biomass feedstock properties, pyrolysis conditions, and the composition of oxygenated compounds in bio-oil. Kanthasamy et al. [15] combined experimental data with ensemble learning models to predict biochar yield and surface physicochemical properties. While these studies demonstrate the growing applicability of machine learning in biochar-related prediction tasks, recent research has also highlighted important methodological considerations when dealing with heterogeneous environmental datasets. For instance, Palansooriya et al. [16] emphasized that biochar “performance” in heavy-metal systems should be interpreted as an outcome jointly shaped by intrinsic material properties and experimental or geochemical context. Similarly, Abouzari et al. [17] cautioned that exceptionally high predictive performance may be misleading if robustness checks are not conducted under mixed experimental conditions.

Despite the increasing application of machine learning in biochar research, several limitations remain. Most existing studies focus either on predicting isolated properties or on modeling adsorption performance under specific experimental settings. They rarely establish an integrated framework that simultaneously links biomass composition, pyrolysis conditions, biochar physicochemical properties, and adsorption capacity within a unified predictive and optimization scheme. Moreover, systematic harmonization of heterogeneous literature-derived datasets and explicit evaluation of model robustness under such conditions remain insufficiently addressed. These gaps limit the generalizability and practical applicability of current modeling efforts. In contrast to previous studies that focus on single-property prediction or case-specific adsorption modeling, the present study establishes an integrated data-driven framework that simultaneously models biochar yield, physicochemical properties, and equilibrium adsorption capacity under heterogeneous experimental conditions. Furthermore, this work systematically links feedstock composition and pyrolysis parameters to adsorption performance through intermediate physicochemical descriptors, providing a more structured interpretation pathway compared to purely empirical prediction models.

2. Materials and Methods

2.1. Dataset and Preprocessing

Experimental data on biochar production from biomass pyrolysis and its application in heavy metal adsorption over the past decade were collected from the Web of Science database. The data were divided into two subsets: Data 1 was used to predict properties and yield of biochar, with inputs including the proximate analysis of feedstocks (Volatile matter, Ash, and Fixed carbon), elemental composition (C, H, N, O), and process parameters (temperature T, residence time Rt, and heating rate HR). The outputs consisted of biochar yield and physicochemical properties (e.g., SSA, biochar pH, and CEC). Data 1 contains 429 samples and Data 2 contains 452 adsorption records compiled from independent literature sources. Data 2 used the outputs of Data 1 as inputs, with the adsorption capacity (qe) as the target variable. In this study, the adsorption capacity represents the equilibrium adsorption capacity, defined as the amount of metal ions adsorbed per unit mass of biochar at adsorption equilibrium. The Data 2 primarily includes copper (Cu2+) and lead (Pb2+) ions under single-solute conditions, which are commonly investigated contaminants in biochar adsorption studies. Only experimentally measured equilibrium adsorption capacity (qe) were included in the dataset, while fitted parameters such as Langmuir maximum adsorption capacity were excluded. Experimental conditions such as initial concentration (C0), solution pH, and temperature were available and incorporated into the model. However, other experimental parameters, such as contact time and sorbent dosage, were not consistently reported across all literature sources and therefore could not be standardized or included. Duplicate records from the same experimental conditions were carefully identified and removed during dataset construction.

All proximate and elemental analyses were standardized to mass fraction (wt.%), and yield was also expressed in wt.%. SSA was expressed in m²·g⁻¹, CEC in cmol·kg⁻¹, temperature in °C, time in minutes, heating rate in °C·min⁻¹, and adsorption capacity in mg·g⁻¹. Inconsistent units and differences in data scales may lead to model bias; hence, strict standardization was conducted prior to data integration [18].

In machine learning, standardization is performed using the standard normal variable Z:

Z = (X - μ) / σ

(1)

where X denotes the original data of the variable, μ represents the mean of the variable, and σ represents the standard deviation of the variable.

Feature standardization was applied to ensure consistency across different model classes, particularly when comparing tree-based ensemble models with other algorithms that are sensitive to feature scale. While tree-based models are generally insensitive to scaling, standardization was performed to maintain uniform preprocessing procedures across models.

The quality of the dataset strongly affects the accuracy of predictive models; thus, rigorous preprocessing was performed before model construction and training. Due to the heterogeneous sources of data, some variables contained missing values. Missing values were not strictly randomly distributed. In literature-mined datasets, incomplete reporting may reflect study-specific measurement focus rather than random omission. The missing values were imputed using the K-nearest neighbors (KNN) interpolation method with k = 5. The choice of k was determined based on cross-validated reconstruction stability within the training set [19]. Outliers were identified and removed by combining the interquartile range (IQR) method with domain-specific constraints (e.g., the sum of elemental and proximate analysis components ≤ 100%), effectively eliminating extreme samples [20].

To prevent data leakage, dataset splitting was conducted at the literature-source level. All samples originating from the same publication were assigned exclusively to either the training set or the testing set, ensuring that study-specific experimental patterns were not shared across sets. Furthermore, all preprocessing procedures, including missing value imputation using the K-nearest neighbors (KNN) method, outlier removal, and feature standardization, were performed using parameters derived exclusively from the training set and then applied to the testing set. This ensured that no information from the testing set was introduced during model training or preprocessing. In addition, Data 1 and Data 2 were used to develop independent predictive models. Data 1 was used to predict biochar physicochemical properties and yield based on biomass composition and pyrolysis conditions, while Data 2 was used to predict qe based on biochar properties and experimental conditions. The target variable in Data 2 was not used during the training of models developed using Data 1. Highly correlated features were identified using the Pearson correlation matrix, and feature selection was conducted to mitigate multicollinearity [21].

The correlation coefficient

ρ

is calculated as follows:

ρ = Σ_{i = 1}^{n} ((x_{i} - \bar{x}) Σ_{i = 1}^{n} (y_{i} - \bar{y})) / \sqrt{(Σ_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}} \sqrt{Σ_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}})

(2)

where

ρ

denotes the correlation coefficient,

x_{i}

and

y_{i}

represent two independent variables, and

\bar{x}

and

\bar{y}

denote the mean values of variables

x

and

y

, respectively. The numerator of the equation corresponds to the covariance between

x

and

y

, while the denominator is the product of their standard deviations.

A threshold of |ρ| > 0.90 was considered indicative of strong linear association. However, correlated features were not automatically removed, particularly for compositional variables constrained by mass balance. Correlation analysis was used as a preliminary screening tool to inform model interpretation rather than a strict elimination criterion.

2.2. Model Development

Three representative ensemble learning algorithms RF, GBR and XGB were implemented using the Scikit-learn framework in Python 3.11 to predict biochar yield, physicochemical properties, and adsorption performance.

RF is a bagging-based ensemble learning method that constructs multiple decision trees through bootstrap sampling of the original dataset. During model training, each decision tree is built using a randomly selected subset of samples and input features, which enhances the diversity among trees. The final prediction is obtained by aggregating the outputs of all individual decision trees. In the RF model, feature importance is commonly quantified by calculating the average decrease in impurity caused by each input variable across all trees, thereby reflecting the relative contribution of different features to the model predictions.

GBR is a boosting-based ensemble learning approach that improves predictive accuracy by sequentially constructing decision trees to fit the residuals of the previous model. In each iteration, the newly generated tree is used to correct the prediction errors of the existing model, gradually approximating the true target values. By iteratively minimizing the loss function, GBR effectively captures complex nonlinear relationships between input variables and target outputs. Compared with bagging-based methods, GBR places greater emphasis on correcting errors associated with difficult-to-predict samples.

XGB is an efficient extension of the gradient boosting framework, in which regularization terms are incorporated into the objective function to constrain model complexity and mitigate overfitting. The optimization objective of XGB consists of the prediction error on the training data and a regularization term, and the objective function is approximated using a second-order Taylor expansion to accelerate the training process.

2.3. Model Training and Evaluation

The dataset was divided into training and testing sets at a ratio of 4:1. To prevent potential study-specific information leakage, the split was performed at the literature-source level. Specifically, all samples originating from the same publication were assigned exclusively to either the training set or the testing set.

For hyperparameter optimization, a combination of grid search and Bayesian optimization was employed to systematically explore key parameters, including the number of trees maximum tree depth, minimum samples required for splitting and leaf nodes, and the number of features considered at each split. The optimal parameter configuration was selected to achieve the best model performance. Hyperparameter combinations were evaluated using five-fold cross-validation within the training set, and the configuration yielding the best cross-validated performance was selected as the final model setting. The detailed search ranges and Bayesian optimization settings are provided in the Supplementary Materials. Bayesian optimization was initialized with a set of exploratory iterations followed by sequential updates guided by the acquisition function to balance exploration and exploitation of the search space. Computational cost was assessed empirically based on the total number of optimization iterations and observed training time under the specified hardware environment. No formal theoretical complexity analysis was conducted.

Performance metrics are a crucial component in assessing the effectiveness of machine learning models. In this study, two commonly used evaluation metrics were employed: the coefficient of determination (R²) and the root mean square error (RMSE). R² measures the proportion of variance in the response variable that is explained by the model, while RMSE quantifies the average deviation between the predicted and observed values. Both metrics represent the goodness of fit of a model but differ in interpretation and scale. The R² value ranges from 0 to 1, indicating how well the predicted values match the actual observations—the closer R² is to 1, the better the model fit. RMSE, on the other hand, is unbounded and depends on the scale of the variables; it reflects the magnitude of prediction errors, with smaller values indicating higher predictive accuracy.

In addition to R² and RMSE, Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) were also calculated to provide complementary absolute and relative error assessments. Compared with RMSE, MAE is less sensitive to extreme outliers and therefore offers complementary insight into overall model robustness. MAPE was introduced to evaluate relative prediction error by expressing deviations as a percentage of the observed values. This metric facilitates comparison across target variables with different magnitudes.

The formulas for the evaluation metrics are as follows:

R^{2} = 1 - \sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2} / \sum_{i = 1}^{n} {({\bar{y}}_{i} - y_{i})}^{2}

(3)

R M S E = \sqrt{(1 / n) \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(4)

M A E = (1 / n) \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(5)

M A E = (100 % / n) \sum_{i = 1}^{n} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} |

(6)

where

y_{i}

is the observed value,

{\hat{y}}_{i}

is the predicted value,

\bar{y}

is the mean of observed values, and

n

is the total number of samples.

2.4. Model Interpretation

To facilitate a deeper understanding of the relationships between input variables and model outputs, feature importance analysis, SHAP and multivariate correlation analysis were employed to interpret the machine learning models exhibiting superior predictive performance. Feature importance analysis quantitatively evaluates the relative contribution of each input variable to the model predictions, where higher importance scores indicate a stronger influence of the corresponding feature on the target variable, and lower scores suggest a weaker impact. The SHAP method not only provides a ranking of feature importance but also characterizes the positive or negative influence of individual variables on the model predictions. By assigning a signed contribution value to each input feature, SHAP quantifies whether a given feature promotes or suppresses the predicted outcome, thereby simultaneously reflecting both the magnitude and direction of its effect. It should be noted that SHAP values describe feature influence within the trained model structure and do not imply causal relationships. Multivariate correlation analysis is employed to investigate the relationships among multiple input variables as well as their associations with the target output, enabling the identification of potential coupling effects and synergistic interactions among key factors.

SHAP values were computed using the TreeExplainer method, which is specifically designed for tree-based ensemble models. SHAP analysis was performed on the testing set to interpret model generalization behavior. A subset of training samples was used as background data for SHAP value estimation. The response surface visualizations were generated using partial dependence plots (PDP). PDPs depict how the model’s predicted adsorption capacity changes as a selected input variable varies across its observed range, while averaging predictions over the remaining input variables. Multivariate correlation analysis in this study refers primarily to partial dependence plots (PDP) to explore learned associations. SHAP interaction values were not explicitly computed.

2.5. Model Application

Based on the machine learning models exhibiting superior predictive performance, an interactive graphical user interface (Graphical User Interface, GUI) platform was developed for model integration, deployment, and validation. The platform was implemented using the Python Streamlit framework. As a lightweight web application development tool, Streamlit enables rapid integration of data preprocessing workflows, model inference interfaces, and visualization components. The machine learning models were modularly encapsulated together with data standardization, parameter transmission, and model invocation procedures.

3. Results and Discussion

3.1. Dataset Description and Statistical Analysis

Pearson correlation coefficient (PCC) used to describe the relationships between input variables and target-related parameters are presented in Figure 1. The sign of the PCC indicates the direction of the linear relationship between two variables, and its absolute value reflects the degree of linear association, ranging from −1 (perfect negative) to +1 (perfect positive). According to commonly used interpretation thresholds in statistics and machine learning literature, absolute values of |r| < 0.3 are often regarded as weak correlations, 0.3 ≤ |r| < 0.5 as moderate, and |r| ≥ 0.5 as relatively stronger associations. These thresholds have been applied in studies that interpret effect sizes based on correlation magnitudes [22].

As shown in Figure 1a for Data 1, most PCC values between feedstock composition and pyrolysis process variables fall below |r| ≈ 0.5. This suggests that while some variables exhibit measurable associations, clear strong linear dependencies (e.g., |r| ≥ 0.5) are limited. Specifically, oxygen content and ash content exhibit a moderate negative correlation (r ≈ −0.60), and volatile matter (VM) and ash content also show a moderate negative correlation (r ≈ −0.63). In contrast, most other pairs, including correlations between process parameters (temperature T, residence time Rt) and feedstock properties, remain within the weak-to-moderate range (|r| < 0.50), indicating limited linear association strength.

Figure 1b depicts the correlation matrix for Data 2, where adsorption-related variables show more complex relationships. For example, CEC exhibits a moderate positive correlation with oxygen content (r ≈ 0.48), indicating a tendency for oxygen-rich feedstocks to promote exchangeable surface functionality relevant to adsorption [23]. SSA shows a moderate negative correlation with nitrogen content (r ≈ −0.59) and oxygen content (r ≈ −0.36), while it has a weak positive correlation with carbon content (r ≈ 0.51). In addition, a comprehensive analysis of adsorption-related variables indicates that CEC and SSA are closely associated with adsorption behavior, while external experimental conditions, such as the C₀, also exert a certain influence on adsorption performance.

Overall, the PCC analysis indicates that several variable pairs exhibit moderate-to-relatively strong linear associations (|r| ≈ 0.5–0.7). However, no extremely strong linear redundancy (e.g., |r| > 0.9) was observed across the dataset. Therefore the dataset does not exhibit structural collinearity that would imply direct linear redundancy among variables. This interpretation is supported by standard statistical interpretation criteria [22]. However, the absence of extremely high Pearson correlation coefficients does not fully exclude the possibility of multivariate collinearity, particularly given the compositional and physically constrained nature of certain biomass-related variables. Therefore, the correlation analysis is primarily used as a preliminary descriptive assessment rather than definitive evidence of full variable independence. It should be noted that the correlation analysis presented in Figure 1 serves as an exploratory assessment of pairwise linear associations rather than definitive evidence of independent physical relationships. Given the heterogeneous nature of the compiled dataset, some correlations may reflect study-specific conditions or shared experimental contexts rather than intrinsic mechanistic links. Therefore, the correlation results are used primarily for descriptive purposes and preliminary screening rather than as conclusive justification of variable independence.

3.2. Model Training and Performance Evaluation

Through a systematic comparison of three machine learning models in predicting key target variables, including biochar yield, SSA, CEC, and q_e, the XGB model was found to exhibit superior overall performance across all metrics. Specifically, the XGB model achieved R² of 0.92, 0.91, 0.89, and 0.99 for yield, SSA, CEC, and q_e, respectively, with corresponding RMSE of 0.07, 0.08, 0.03, and 0.02. These results indicate that the XGB model outperforms the RF and GBR models in terms of prediction accuracy and stability. Based on these findings, XGB demonstrates clear advantages in fitting capability, generalization performance, and modeling of complex nonlinear relationships, and was therefore selected as the core model for subsequent interpretability analysis and process optimization. The quantitative comparison of model performance has been systematically summarized and presented in Table 1.

The relatively large standard deviation observed in the 5-fold cross-validation results for CEC suggests either intrinsic data variability or sensitivity to measurement heterogeneity. CEC values reported across literature may vary due to differences in extraction protocols and ionic conditions, which may amplify prediction variance and contribute to model instability.

The relatively lower R² observed in the multi-target model (0.82–0.83) compared with single-target models and certain literature benchmarks may be attributed to differences in dataset heterogeneity and validation strategy. In the present study, the multi-target model simultaneously predicts multiple physicochemical properties derived from literature-mined data compiled across diverse feedstocks, experimental protocols, and reporting standards. Such cross-study heterogeneity increases variance. Using more homogeneous experimental datasets or target-specific optimization strategies may favor higher reported R² values.

The prediction results are presented in Figure 2. In terms of elemental composition prediction, the predicted values of C, H, and O show a high degree of agreement with the measured data, with data points closely distributed around the 1:1 line, indicating strong stability and predictive accuracy. In contrast, N exhibits a certain degree of dispersion in the low-value region, although an overall clear linear trend is still maintained. The prediction of Ash content demonstrates high accuracy, with both training and test data tightly clustered around the 1:1 line, suggesting that the input features adequately capture the influence of inorganic components in the feedstock on the final Ash content. The prediction of biochar yield also shows good consistency with experimental values, although some dispersion is observed in the intermediate yield range.

For physicochemical property prediction, the XGB model exhibits strong predictive capability for biochar pH, SSA, and CEC. The predicted biochar pH values are generally consistent with the measured data, though increased dispersion is observed in the high-pH region. For SSA prediction, the predicted values generally follow the 1:1 line relative to the experimental values, although a certain degree of dispersion is observed in localized intervals. For CEC prediction, the predicted values exhibit a consistent trend with the experimental measurements across the evaluated range. Regarding the adsorption q_e, the predicted values are largely aligned with the experimental values along the 1:1 line over the entire range, with data points showing a relatively concentrated distribution. This result indicates that the XGB model can effectively integrate biochar characterization parameters and experimental conditions to achieve high-precision prediction of adsorption performance. However, such a high level of agreement also highlights the need for caution regarding potential information leakage during model construction and validation. Although literature-source–level splitting was strictly implemented to prevent study-specific leakage, additional robustness validation strategies (e.g., temporal splitting or leave-one-feedstock-out validation) were not conducted in this study. Therefore, while no direct leakage was identified during verification, the exceptionally high predictive performance for q_e should be interpreted within the current validation framework. Future work incorporating temporal and feedstock-stratified validation would further strengthen generalization assessment. Physicochemical properties such as SSA, CEC, and biochar pH are often measured within the same experimental systems as the adsorption tests, which may introduce structural dependence between predictors and response variables.

3.3. Feature Importance Analysis

Figure 3 presents the feature importance rankings obtained from the XGB model for different prediction targets. The distributions of feature importance vary markedly across biochar properties and adsorption capacity, indicating that each target variable is governed by a distinct set of key input features within the modeling framework.

For elemental composition prediction, a clear one-to-one correspondence is observed between the elemental composition of the biomass feedstock and that of the resulting biochar. This indicates that the elemental composition of biochar is largely constrained by the initial feedstock composition and that the XGB model is able to robustly capture this relationship. Specifically, C_biomass and N_biomass dominate the predictions of C_char and N_char, accounting for approximately 33% and 38% of the total importance, respectively. In contrast, the predictions of H_char and O_char are more strongly controlled by pyrolysis conditions, with T contributing approximately 48% and 47% of the importance, respectively, while H_biomass and O_biomass contribute only about 15% and 19%. This indicates that H and O retention in biochar is more sensitive to thermal conditions than to their initial contents in the feedstock [24].

In the prediction of ash content, Ash_biomass plays a dominant role, accounting for approximately 64% of the total importance, while the individual contributions of all other variables are below 10%. This result confirms that biochar Ash content is primarily inherited from the inorganic components of the biomass feedstock [25]. For biochar pH, T is the most influential factor, contributing about 58%, followed by Rt at approximately 18%. In comparison, the contributions of feedstock composition variables are generally low, indicating that biochar surface alkalinity is mainly regulated by pyrolysis process conditions rather than by initial biomass composition [26].

In the prediction of SSA, Ash_biomass is the most important feature, accounting for approximately 49% of the total importance. This result indicates a strong association between variations in specific surface area and ash content. In combination with existing studies suggesting that higher ash contents may occupy pore space or hinder pore structure development [27]. For CEC, C_biomass exhibits the highest importance (approximately 34%), followed by Ash_biomass (approximately 21%) and Rt (approximately 15%). These results indicate that CEC is jointly influenced by feedstock composition and pyrolysis conditions [28].

In the prediction of biochar yield, T clearly dominates the feature importance ranking, accounting for approximately 63%, far exceeding all other variables. Rt provides a secondary contribution (approximately 14%), indicating that pyrolysis conditions are the key parameters controlling biochar yield. Under high T and prolonged Rt, enhanced volatile release and secondary cracking reactions lead to a reduction in solid-phase products [29].

For the prediction of q_e, Ash_biomass and the initial solution C₀ are identified as the most influential variables, together accounting for approximately 68% of the total importance. Among them, Ash_biomass contributes about 42%, while C₀ contributes approximately 26%. The individual contributions of other variables (such as CEC, biochar pH, and SSA) are relatively low, indicating that adsorption behavior is jointly governed by biochar properties and adsorption driving forces, with possible contributions from mineral-mediated pathways depending on experimental conditions [30].

3.4. Interpretability Analysis

To further interpret the decision-making behavior of the XGB model, SHAP was applied to quantify the contribution of each input variable to the model outputs. The SHAP summary plots for different target variables are shown in Figure 4.

It is important to distinguish between design variables and assay variables. Design variables, including feedstock composition, T, and Rt, represent controllable production parameters that determine the intrinsic physicochemical properties of biochar. In contrast, assay variables, including C₀, solution pH, and ambient temperature, represent experimental conditions under which adsorption performance is measured. While these assay variables strongly influence the q_e, they reflect the testing environment rather than intrinsic material properties.

In the prediction of biochar yield, T exhibits the highest SHAP values among all input variables. The SHAP distribution shows a clear overall negative trend with increasing T, indicating that a higher T is associated with lower predicted yields. This pattern is consistently observed across the majority of samples, further confirming T as the dominant factor governing yield variation. In comparison, the SHAP magnitude of VM is relatively smaller, but it shows predominantly positive contributions within lower T ranges, suggesting that feedstocks rich in VM are more favorable for solid residue retention under milder pyrolysis conditions [29]. The predicted yield reduction of more than 40% across the 300–800 °C temperature range is consistent with recent pyrolysis literature. Multiple studies have demonstrated that increasing temperature intensifies devolatilization and secondary cracking reactions, leading to substantial reduction of solid-phase residues, with yield losses reported within the 30%–60% range depending on feedstock type and heating severity [31]. Therefore, the magnitude of yield decline observed in this study falls within expected experimental variability rather than representing an anomalous prediction.

For elemental composition prediction, distinct controlling patterns are observed for different elements. In the cases of C_char and H_char, the SHAP results indicate that their predictions are primarily influenced by the corresponding elemental contents in the feedstock (C_biomass and H_biomass), together with T. Higher feedstock C_biomass and H_biomass contents generally contribute positively to the C_char and H_char contents in biochar, whereas increasing T exerts a negative effect, which is particularly pronounced for H_char, indicating reduced H retention at elevated temperatures. For O_char, T and feedstock O_biomass dominate the SHAP rankings, with T consistently showing a negative contribution, reflecting the strong sensitivity of O_biomass structures to high T conditions [29].

In the prediction of Ash_char, feedstock Ash_biomass exhibits the highest SHAP contribution among all input variables. Its contribution is consistently positive and substantially larger than those of other variables, indicating that biochar Ash content is primarily inherited from the inorganic fraction of the feedstock, while the influence of pyrolysis conditions is relatively limited [32]. For biochar pH, T and Rt emerge as the most influential variables, both exhibiting clear positive SHAP contributions. This distribution pattern suggests that higher T and longer Rt generally lead to increased surface alkalinity of biochar, whereas feedstock elemental composition plays a comparatively minor role in biochar pH regulation [33].

In the prediction of SSA, Ash_biomass shows the strongest negative SHAP contribution, with higher Ash_biomass contents corresponding to lower SSA values. In contrast, the SHAP distribution of temperature displays both positive and negative contributions, indicating that the effect of temperature on SSA is not strictly linear and may depend on interactions with other variables [34]. The bidirectional SHAP contributions of temperature indicate a non-monotonic relationship. Moderate pyrolysis temperatures (typically 400–600 °C) are reported to enhance pore development and promote aromatic condensation, thereby increasing specific surface area. However, at excessively high temperatures (>700 °C), partial pore collapse, sintering of mineral phases, and depletion of oxygen-containing functional groups may reduce effective surface reactivity despite increased carbonization. Such temperature-dependent structural evolution has been documented in recent biochar studies [35].

For CEC, feedstock N_biomass and C_biomass exhibit the highest SHAP importance. Their contributions alternate between positive and negative depending on their values, suggesting that CEC formation is jointly controlled by feedstock composition and pyrolysis conditions, rather than being dominated by a single factor. Ash content and Rt provide secondary contributions, further highlighting the role of multivariate interactions [36].

In the prediction of q_e, the SHAP results clearly identify feedstock Ash_biomass and initial C₀ as the two most influential variables. Higher C₀ values consistently show positive contributions, reflecting the enhancement of adsorption driving force by larger concentration gradients. Ash_biomass also exhibits strong positive SHAP contributions, whereas SSA and biochar pH display comparatively lower SHAP magnitudes, indicating that their effects on q_e are not dominant when multiple variables are considered simultaneously [37].

It should be noted that SHAP analysis reflects the contribution of input features to model predictions based on the learned relationships within the machine learning algorithm. These results provide insights into feature importance and model behavior; however, they do not establish causal relationships between variables and predicted outcomes. When predictors are correlated, SHAP-based attribution may be distributed among related variables, which can influence the interpretation of individual feature effects. Therefore, these interpretability tools should be understood as model-driven explanatory analyses rather than definitive evidence of causality.

3.5. Correlation Analysis

To further examine interactions among influential variables, multivariate response surface analyses were conducted based on the trained XGB model. For each target variable, the two most influential input features identified from the feature importance analysis were selected to construct three-dimensional response surfaces, as shown in Figure 5.

For biochar yield, the response surface exhibits a pronounced decreasing trend with increasing T. When the T increases from approximately 300 °C to 800 °C, the yield decreases by more than 40%. Higher yields are observed in the low-temperature range (<450 °C), whereas this effect rapidly weakens under high-temperature conditions, indicating that T plays a dominant role in controlling biochar yield [38].

The response surface of Ash shows that biochar ash is primarily governed by the Ash_biomass of the feedstock, with FC acting as a secondary modulating factor. As Ash_biomass increases from low to high levels, Ash_char increases by more than threefold. Under high-ash conditions, variations in FC exert only a minor influence on Ash_char (<10%), whereas under low-ash conditions, the Ash_char remains low even when FC is relatively high. These results further confirm that inorganic mineral inheritance is the dominant mechanism governing Ash formation [39]. For CEC, the response surface reveals a clear synergistic effect between feedstock C and N contents. When both C_biomass and N_biomass are at high levels, the predicted CEC increases by more than twofold compared to low C/N conditions, indicating that the C/N composition jointly determines the formation of surface functional groups. The response surface of biochar pH highlights the interaction between T and Rt. When T exceeds approximately 600 °C and the Rt is longer than 40 min, biochar pH rapidly increases to values above 10. In contrast, under low-temperature and short-residence conditions, biochar pH generally remains below 8. This trend reflects that biochar surfaces are more likely to exhibit alkalinity under high pyrolysis severity [27].

For C retention, the response surface reveals a coupled effect between feedstock C_biomass and Ash_biomass. Under low-ash conditions, increasing C_biomass leads to a substantial increase in C_char (up to approximately 25 wt.%), whereas under high-ash conditions, the regulatory effect of C_biomass on C_char is significantly weakened, indicating that excessive inorganic minerals limit organic carbon retention. H content exhibits strong temperature sensitivity: at T below approximately 450 °C, higher H_biomass increases H_char by about 0.8 wt.%, whereas when T exceeds 600 °C, H_char rapidly declines to below 0.5 wt.% [40]. The N response surface indicates that N retention is jointly controlled by feedstock N content and FC. Under low FC conditions, even high N_biomass results in only limited increases in N_char (<0.3 wt.%). In contrast, under high FC conditions, N_char increases markedly with N_biomass and can exceed 1.0 wt.%, suggesting that part of the N becomes stably embedded within aromatic carbon structures [41]. O content is mainly determined by T and O_biomass. At low temperature (<400 °C), O_char typically exceeds 10 wt.%, whereas when T exceeds 600 °C, O_char rapidly decreases to below 5wt.%, further confirming the dominant role of T in O retention.

For q_e, the response surface results demonstrate that feedstock Ash_biomass and C₀ jointly control q_e. As C₀ increases from low to high values, q_e increases by approximately 3–4 times across all ash levels. Meanwhile, biochar derived from high-ash feedstocks consistently exhibit higher q_e values under the same C₀, exceeding 120 mg g⁻¹ at high concentrations. These results indicate that q_e is governed by the synergistic effects of mineral composition and external concentration driving force rather than by a single parameter, and that high-ash biochar exhibit relatively better adsorption performance [42].

Ash-related variables should be interpreted as proxies for the mineral fraction and alkalinity of biochar rather than as a single-direction adsorption modifier. Recent studies have demonstrated that high-ash biochars, especially those derived from manure or mineral-rich feedstocks, can enhance apparent Pb(II) removal through mineral-induced precipitation pathways, including the formation of lead carbonate and lead phosphate phases [43,44]. These processes are often accompanied by an increase in equilibrium pH, which shifts metal speciation and promotes hydroxide precipitation under batch conditions [45]. Conversely, elevated ash content may also suppress carbon-surface–mediated adsorption by partially occluding pore structures or diluting oxygen-containing functional groups within the carbon matrix [46]. Therefore, the ash effect observed in the present feature importance, SHAP, and response surface analyses likely reflects the combined contributions of mineral-driven precipitation and carbon-surface complexation mechanisms rather than a unidirectional adsorption trend.

3.6. Interface Design

In this study, a prediction platform based on a graphical user interface (GUI) was developed to visually present the outputs of the established XGB models (detailed interface information is provided in the Supplementary Materials). The platform integrates the trained machine learning models into an interactive visualization environment, enabling users to intuitively obtain predictions of the physicochemical properties and adsorption performance of biochar.

To further evaluate the applicability and reliability of the platform, multiple sets of experimental data reported in the literature were selected as external validation samples (see Table 2). By inputting the feedstock compositions, pyrolysis conditions, and adsorption experimental parameters reported in the literature into the platform, corresponding prediction results were automatically generated and compared with the experimental measurements. The absolute deviations in the prediction of biochar elemental composition (C, H, N, and O) are generally within ±0.3 wt.%, while the absolute deviation for ash content prediction is less than ±0.2 wt.%. The predicted biochar yield shows a consistent trend with the experimental results, with absolute deviations of approximately 2wt.%–9wt.%, and relatively larger errors observed in the high-yield range. The results demonstrate good agreement between the predicted values and the experimental data, confirming the accuracy of the developed prediction platform and its feasibility for practical applications.

Overall, this machine learning–based visual and interactive platform provides an efficient and convenient tool for evaluating biochar production and adsorption performance. It helps reduce experimental workload and offers data-driven decision support for the design and optimization of pyrolysis processes. It should be noted that the GUI reflects model predictions within the domain supported by the training dataset. Predictions made under sparsely represented or extrapolated conditions may exhibit higher uncertainty, and therefore the platform should be used as a decision-support tool rather than a definitive experimental substitute.

4. Limitations and Future Perspectives

This study is subject to several limitations that should be considered when interpreting the results and applying the proposed framework.

First, the dataset was compiled from heterogeneous literature sources. Although duplicate records were removed, multiple samples originating from the same study may share similar feedstock characteristics, pyrolysis conditions, and experimental protocols. Such study-specific patterns may influence model learning and limit generalizability. Therefore, predictive performance should be interpreted within the experimental context represented in the compiled dataset.

Second, not all adsorption-related experimental parameters were consistently reported across literature sources. While initial solution concentration, pH, and temperature were included as input variables, parameters such as sorbent dosage and contact time were not uniformly available and therefore could not be incorporated into the model. Because adsorption capacity is sensitive to equilibrium conditions and protocol design, the absence of these variables may introduce hidden variability and affect mechanistic interpretation.

In addition, although initial solution pH was included, final equilibrium pH was not consistently reported. Since biochar can alter solution alkalinity during batch adsorption experiments, pH drift may influence metal speciation and precipitation behavior. Consequently, the model captures condition-dependent removal performance but cannot disentangle surface adsorption from precipitation-driven contributions. Ash-related variables may partially encode such mineral-induced pH effects, and therefore should not be interpreted as purely structural adsorption modifiers.

Third, the dataset primarily consisted of Cu²⁺ and Pb²⁺ adsorption data under single-solute conditions, and metal identity was not included as an explicit model input variable. While this improves internal consistency, it may limit extrapolation to other metal species with distinct adsorption mechanisms.

From a methodological perspective, Pearson correlation analysis was employed as an exploratory screening tool to assess pairwise linear associations. However, Pearson correlation does not capture nonlinear dependencies, and more advanced multicollinearity diagnostics (e.g., variance inflation factor analysis) were not systematically conducted. Moreover, proximate and ultimate analysis variables are compositional in nature and constrained by summation relationships, which may introduce intrinsic collinearity beyond simple pairwise correlations. Therefore, feature importance rankings and SHAP-based interpretations should be understood within the structural constraints of the dataset.

Furthermore, response surface analyses based on partial dependence plots reflect model-estimated trends under the joint distribution of the training data. When input features covary due to experimental or material constraints, the apparent optimal regions represent feasible parameter ranges inferred from statistical associations rather than independently verified optimal conditions. These surfaces do not account for thermodynamic constraints, kinetic limitations, or process-scale engineering considerations.

Overall, the analyses presented in this study reveal statistical associations and model-driven explanatory patterns rather than definitive causal mechanisms. Distinguishing between correlation, model attribution, and physical causality remains an important direction for future research. Future work incorporating more comprehensive experimental descriptors, expanded metal species coverage, and more rigorous validation strategies would further enhance model robustness and mechanistic interpretability.

In practical applications, the proposed framework should therefore be regarded as a data-driven decision-support tool operating within the domain represented in the training dataset. Predictions made under sparsely represented or extrapolated conditions may involve increased uncertainty and should be validated experimentally.

5. Conclusions

This study systematically compared three ensemble learning algorithms RF, GBR, and XGB in predicting biochar yield, key physicochemical properties, and adsorption capacity. The results demonstrate that the XGB model outperforms the other models in terms of both prediction accuracy and generalization capability, achieving excellent performance in predicting yield (R² ≈ 0.92), CEC (R² ≈ 0.89), and q_e (R² ≈ 0.98). Correlation analysis, feature importance ranking, SHAP interpretation, and three-dimensional response surface analysis collectively revealed that the dominant controlling factors vary significantly across different target variables. T was identified as the most critical variable governing biochar yield, whereas Ash played a dominant role in determining SSA and q_e. In contrast, the formation of CEC was mainly controlled by the synergistic effects of feedstock C and N contents, while biochar pH was primarily determined by the interaction between pyrolysis T and Rt. These results indicate that the adsorption performance of biochar is not dictated by a single parameter, but rather arises from the synergistic interplay among feedstock composition, pyrolysis conditions, and the resulting physicochemical properties. Based on the XGB model, an interpretable “process–performance” mapping was established, and inverse optimization was further implemented to enable targeted regulation of biochar adsorption performance. Meanwhile, a Streamlit-based interactive prediction platform was developed, integrating model prediction, result visualization, and process optimization functionalities, thereby providing an effective guidance strategy for the optimization of pyrolysis process parameters.

From a practical perspective, the proposed data-driven framework provides a quantitative decision-support tool for researchers and process engineers to rapidly evaluate feedstock–process combinations prior to experimental implementation. By reducing trial-and-error experimentation and enabling pre-screening of parameter spaces, the model can improve experimental efficiency and resource utilization. Nevertheless, the platform is intended to complement rather than replace experimental validation, particularly when extrapolating beyond the domain represented in the training dataset.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/f17030326/s1, Table S1: Bayesian optimizer parameters; Figure S1: Schematic illustration of the developed biochar adsorption performance prediction platform. The platform is developed based on a graphical user interface (GUI) and mainly consists of two functional modules: Prediction and Optimization. In the Prediction module, users can input biomass feedstock composition (C, H, N, O, VM, Ash, and FC) and pyrolysis conditions (T and Rt) either manually or by uploading a CSV file. The platform then invokes the trained XGB model to generate real-time predictions of biochar adsorption performance. In the Optimization module, given the biomass feedstock composition, the platform automatically recommends optimal T and Rt by integrating model inversion and optimization algorithms, thereby providing intuitive decision support for process parameter design and optimization.

Author Contributions

Conceptualization, X.B. and X.H.; methodology, X.H.; Software, X.H.; validation, X.B., Y.Y. and X.H.; formal analysis, X.H.; investigation, W.L. and D.X.; resources, X.B.; data curation, X.H.; writing—original draft preparation, X.H.; writing—review and editing, X.B. and Y.Y.; supervision, X.B.; project administration, X.B.; funding acquisition, X.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (52206229).

Data Availability Statement

The data presented in this study are available upon request from the author.

Acknowledgments

The authors gratefully acknowledge the support from the Beijing Forestry University; the State Key Laboratory of Efficient Production of Forest Resources; and the Key Laboratory of National Forestry and Grassland Administration on Forestry Equipment and Automation.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

FC	Fixed carbon content
VM	Volatile matter content
Ash	Ash content
C	Carbon content
H	Hydrogen content
O	Oxygen content
N	Nitrogen content
Rt	Residence time
T	Pyrolysis temperature
Yield	Biochar yield
SSA	Specific Surface Area
CEC	Cation exchange capacity

References

Wan, J.; Liu, L.; Ayub, K.S.; Zhang, W.; Shen, G.; Hu, S.; Qian, X. Characterization and adsorption performance of biochars derived from three key biomass constituents. Fuel 2020, 269, 117142. [Google Scholar] [CrossRef]
Zhang, P.; Zhang, X.; Yuan, X.; Xie, R.; Han, L. Characteristics, adsorption behaviors, Cu(II) adsorption mechanisms by cow manure biochar derived at various pyrolysis temperatures. Bioresour. Technol. 2021, 331, 125013. [Google Scholar] [CrossRef] [PubMed]
Cho, S.-K.; Igliński, B.; Kumar, G. Biomass based biochar production approaches and its applications in wastewater treatment, machine learning and microbial sensors. Bioresour. Technol. 2024, 391, 129904. [Google Scholar] [CrossRef] [PubMed]
Yang, H.; Liu, X.; Liu, Y.; Cui, J.; Xiao, Y. Revolutionizing biochar synthesis for enhanced heavy metal adsorption: Harnessing machine learning and Bayesian optimization. J. Environ. Chem. Eng. 2023, 11, 110593. [Google Scholar] [CrossRef]
Wang, W.; Chang, J.-S.; Lee, D.-J. Machine learning applications for biochar studies: A mini-review. Bioresour. Technol. 2024, 394, 130291. [Google Scholar] [CrossRef] [PubMed]
Jaffari, Z.H.; Abbas, A.; Kim, C.-M.; Shin, J.; Kwak, J.; Son, C.; Lee, Y.-G.; Kim, S.; Chon, K.; Cho, K.H. Transformer-based deep learning models for adsorption capacity prediction of heavy metal ions toward biochar-based adsorbents. J. Hazard. Mater. 2024, 462, 132773. [Google Scholar] [CrossRef]
Su, G.; Jiang, P. Machine learning models for predicting biochar properties from lignocellulosic biomass torrefaction. Bioresour. Technol. 2024, 399, 130519. [Google Scholar] [CrossRef]
Liu, B.; Xi, F.; Zhang, H.; Peng, J.; Sun, L.; Zhu, X. Coupling machine learning and theoretical models to compare key properties of biochar in adsorption kinetics rate and maximum adsorption capacity for emerging contaminants. Bioresour. Technol. 2024, 402, 130776. [Google Scholar] [CrossRef]
Shen, T.; Peng, H.; Yuan, X.; Liang, Y.; Liu, S.; Wu, Z.; Leng, L.; Qin, P. Feature engineering for improved machine-learning-aided studying heavy metal adsorption on biochar. J. Hazard. Mater. 2024, 466, 133442. [Google Scholar] [CrossRef]
Divine, D.C.; Hubert, S.; Epelle, E.I.; Ojo, A.U.; Adeleke, A.A.; Ogbaga, C.C.; Akande, O.; Okoye, P.U.; Giwa, A.; Okolie, J.A. Enhancing biomass Pyrolysis: Predictive insights from process simulation integrated with interpretable Machine learning models. Fuel 2024, 366, 131346. [Google Scholar] [CrossRef]
Leng, L.; Zheng, H.; Shen, T.; Wu, Z.; Xiong, T.; Liu, S.; Cao, J.; Peng, H.; Zhan, H.; Li, H. Engineering biochar from biomass pyrolysis for effective adsorption of heavy metal: An innovative machine learning approach. Sep. Purif. Technol. 2024, 361, 131592. [Google Scholar] [CrossRef]
Sudhakar, U.; Prabhu, P.; Naveen, K.; Singh, C.J.; Kumar, K.P.; Harinadh, V.; Hobicho, D.L.; Rupesh, G. Precision biochar yield forecasting employing random forest and XGBoost with Taylor diagram visualization. Sci. Rep. 2025, 15, 7105. [Google Scholar] [CrossRef] [PubMed]
Habib, U.; Sangar, K.; Chen, B.; Asfandyar, S.; Luqman, R.; Wu, N. Machine learning approach to predict adsorption capacity of Fe-modified biochar for selenium. Carbon Res. 2023, 2, 29. [Google Scholar] [CrossRef]
Su, S.; Wang, J. Machine learning prediction of contents of oxygenated components in bio-oil using extreme gradient boosting method under different pyrolysis conditions. Bioresour. Technol. 2023, 379, 129040. [Google Scholar] [CrossRef]
Kanthasamy, R.; Almatrafi, E.; Ali, I.; Sait, H.H.; Zwawi, M.; Abnisa, F.; Peng, L.C.; Ayodele, B.V. Biochar production from valorization of agricultural Wastes: Data-Driven modelling using Machine learning algorithms. Fuel 2023, 351, 128948. [Google Scholar] [CrossRef]
Palansooriya, K.N.; Li, J.; Dissanayake, P.D.; Suvarna, M.; Li, L.; Yuan, X.; Sarkar, B.; Tsang, D.C.W.; Rinklebe, J.; Wang, X.; et al. Prediction of Soil Heavy Metal Immobilization by Biochar Using Machine Learning. Environ. Sci. Technol. 2022, 56, 4187–4198. [Google Scholar] [CrossRef]
Abouzari, M.; Pahlavani, P.; Izaditame, F.; Bigdeli, B. Estimating the chemical oxygen demand of petrochemical wastewater treatment plants using linear and nonlinear statistical models—A case study. Chemosphere 2021, 270, 129465. [Google Scholar] [CrossRef]
Li, H.; Ai, Z.; Yang, L.; Zhang, W.; Yang, Z.; Peng, H.; Leng, L. Machine learning assisted predicting and engineering specific surface area and total pore volume of biochar. Bioresour. Technol. 2023, 369, 128417. [Google Scholar] [CrossRef]
Li, X.; Zhang, X.; Zhang, J.; Gu, J.; Zhang, S.; Li, G.; Shao, J.; He, Y.; Yang, H.; Zhang, S.; et al. Applied machine learning to analyze and predict CO₂ adsorption behavior of metal-organic frameworks. Carbon Capture Sci. Technol. 2023, 9, 100146. [Google Scholar] [CrossRef]
Călin, A.D.; Coroiu, A.M.; Mureşan, H.B. Analysis of Preprocessing Techniques for Missing Data in the Prediction of Sunflower Yield in Response to the Effects of Climate Change. Appl. Sci. 2023, 13, 7415. [Google Scholar] [CrossRef]
Shubham, Y.; Priyanshu, R.; Paramasivan, B.; Liu, C.; Li, F.; Zhang, P. Machine learning-driven prediction of biochar adsorption capacity for effective removal of Congo red dye. Carbon Res. 2025, 4, 11. [Google Scholar] [CrossRef]
Li, F.; Lu, Q.; He, Y.; Zheng, C.; Wang, X. Data-Driven Analysis of the Effectiveness of Water Control Measures in Offshore Horizontal Wells. Processes 2026, 14, 88. [Google Scholar] [CrossRef]
Huff, M.D.; Marshall, S.; Saeed, H.A.; Lee, J.W. Surface oxygenation of biochar through ozonization for dramatically enhancing cation exchange capacity. Bioresour. Bioprocess. 2018, 5, 18. [Google Scholar] [CrossRef]
Hassan, R.; Behtouei, Z.; Baghban, A. Advanced machine learning for precise prediction of biochar’s heavy metal sorption efficiency. J. Hazard. Mater. Adv. 2025, 18, 100739. [Google Scholar] [CrossRef]
Patel, M.R.; Panwar, N.L. Biochar from agricultural crop residues: Environmental, production, and life cycle assessment overview. Resour. Conserv. Recycl. Adv. 2023, 19, 200173. [Google Scholar] [CrossRef]
Yue, L.; Zhang, X.; Wang, Y.; Li, W.; Tang, Y.; Bai, Y. Cellulose nanocomposite modified conductive self-healing hydrogel with enhanced mechanical property. Eur. Polym. J. 2021, 146, 110258. [Google Scholar] [CrossRef]
Song, Y.; Huang, Z.; Jin, M.; Liu, Z.; Wang, X.; Hou, C.; Zhang, X.; Shen, Z.; Zhang, Y. Machine learning prediction of biochar physicochemical properties based on biomass characteristics and pyrolysis conditions. J. Anal. Appl. Pyrolysis 2024, 181, 106596. [Google Scholar] [CrossRef]
Guo, H.-n.; Wu, S.-b.; Tian, Y.-j.; Zhang, J.; Liu, H.-t. Application of machine learning methods for the prediction of organic solid waste treatment and recycling processes: A review. Bioresour. Technol. 2021, 319, 124114. [Google Scholar] [CrossRef]
Leng, L.; Yang, L.; Lei, X.; Zhang, W.; Ai, Z.; Yang, Z.; Zhan, H.; Yang, J.; Yuan, X.; Peng, H.; et al. Machine learning predicting and engineering the yield, N content, and specific surface area of biochar derived from pyrolysis of biomass. Biochar 2022, 4, 63. [Google Scholar] [CrossRef]
Meng, K.; Dong, Y.; Liu, J.; Xie, J.; Jin, Q.; Lu, Y.; Lin, H. Advances in selective heavy metal removal from water using biochar: A comprehensive review of mechanisms and modifications. J. Environ. Chem. Eng. 2025, 13, 116099. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, P.; Yuan, X.; Li, Y.; Han, L. Effect of pyrolysis temperature and correlation analysis on the yield and physicochemical properties of crop residue biochar. Bioresour. Technol. 2020, 296, 122318. [Google Scholar] [CrossRef]
Ma, J.; Zhang, S.; Liu, X.; Wang, J. Machine learning prediction of biochar yield based on biomass characteristics. Bioresour. Technol. 2023, 389, 129820. [Google Scholar] [CrossRef] [PubMed]
Najar, I.; Rasool, T. Optimization of multi stage Co-pyrolysis process using municipal solid waste and sawdust blends: A hybrid approach using iso-conversional modeling and machine learning. J. Indian Chem. Soc. 2025, 102, 101605. [Google Scholar] [CrossRef]
Zhang, H.; Liu, M.; Yang, Y.; Chen, W.; Zhu, J.; Zhang, S.; Yang, H.; Chen, H.; Chen, Y. Mechanism study on the interaction between holocellulose and lignin during secondary pyrolysis of biomass: In terms of molecular model compounds. Fuel Process. Technol. 2023, 244, 107701. [Google Scholar] [CrossRef]
Wang, J.; Wang, S. Preparation, modification and environmental application of biochar: A review. J. Clean. Prod. 2019, 227, 1002–1022. [Google Scholar] [CrossRef]
Ippolito, J.A.; Cui, L.; Kammann, C.; Wrage-Mönnig, N.; Estavillo, J.M.; Fuertes-Mendizabal, T.; Cayuela, M.L.; Sigua, G.; Novak, J.; Spokas, K.; et al. Feedstock choice, pyrolysis temperature and type influence biochar characteristics: A comprehensive meta-data analysis review. Biochar 2020, 2, 421–438. [Google Scholar] [CrossRef]
Zhu, X.; Wang, X.; Ok, Y.S. The application of machine learning methods for prediction of metal sorption onto biochars. J. Hazard. Mater. 2019, 378, 120727. [Google Scholar] [CrossRef]
Subramanian, P.; Pandian, K.; Pakkiyam, S.; veni Dhanuskodi, K.; Annamalai, S.; Chidambaram, P.P.; Mustaffa, M.R.A.F. Biochar for heavy metal cleanup in soil and water: A review. Biomass Convers. Biorefin. 2025, 15, 11421–11441. [Google Scholar] [CrossRef]
He, D.; Luo, Y.; Zhu, B. Feedstock and pyrolysis temperature influence biochar properties and its interactions with soil substances: Insights from a DFT calculation. Sci. Total Environ. 2024, 922, 171259. [Google Scholar] [CrossRef]
Zhang, X.; Zhao, B.; Liu, H.; Zhao, Y.; Li, L. Effects of pyrolysis temperature on biochar’s characteristics and speciation and environmental risks of heavy metals in sewage sludge biochars. Environ. Technol. Innov. 2022, 26, 102288. [Google Scholar] [CrossRef]
Fang, Y.; Yang, L.; Rao, F.; Zheng, Y.; Song, Z. Adsorption behavior and mechanism of MB, Pb(II) and Cu(II) on porous geopolymers. Ceram. Int. 2025, 51, 11455–11466. [Google Scholar] [CrossRef]
Yuan, J.-H.; Xu, R.-K.; Zhang, H. The forms of alkalis in the biochar produced from crop residues at different temperatures. Bioresour. Technol. 2011, 102, 3488–3497. [Google Scholar] [CrossRef] [PubMed]
Sun, B.; Pang, J.; Shi, X.; Zhao, Y.; Sun, A.; Guo, Y.; Cao, M.; Zheng, Y.; Gu, X. High-performance magnetic biochar prepared via acid and mg/Fe Co-modification for ultraefficient Pb (II) adsorption. iScience 2025, 28, 113266. [Google Scholar] [CrossRef] [PubMed]
Wu, J.; Wang, T.; Wang, J.; Zhang, Y.; Pan, W.-P. A novel modified method for the efficient removal of Pb and Cd from wastewater by biochar: Enhanced the ion exchange and precipitation capacity. Sci. Total Environ. 2021, 754, 142150. [Google Scholar] [CrossRef]
Choi, M.-Y.; Lee, C.-G.; Yoon, Y.-M.; Park, S.-J. Kenaf Biochar as an Eco-Friendly Adsorbent for Removal of Cu(II) and Pb(II): Optimal Temperature, Adsorption Models, and Efficiency Evaluation. Appl. Organomet. Chem. 2024, 38, 7672. [Google Scholar] [CrossRef]
Jiao, Z.; Zhang, S.; Wang, Y.; Dong, Z.; Lu, Z. Influence of fly ash content on pore structure regulation in alkali-activated slag under alkaline conditions. Constr. Build. Mater. 2025, 485, 141863. [Google Scholar] [CrossRef]

Figure 1. (a) Data 1 PCCs analysis of inputs and outputs, (b) Data 2 PCCs analysis of inputs and outputs.

Figure 2. Single-target and multi-target prediction results based on the XGB model: (a) C content in biochar; (b) H content in biochar; (c) N content in biochar; (d) O content in biochar; (e) Ash content in biochar; (f) Biochar pH; (g) Specific surface area (SSA) of biochar; (h) Cation exchange capacity (CEC) of biochar; (i) Biochar yield; (j) Biochar Adsorption performance.

Figure 3. Feature importance analysis for different target variables: (a) C content in biochar; (b) H content in biochar; (c) N content in biochar; (d) O content in biochar; (e) Ash content in biochar; (f) Biochar pH; (g) Specific surface area (SSA) of biochar; (h) Cation exchange capacity (CEC) of biochar; (i) Biochar yield; and (j) Biochar Adsorption performance.

Figure 4. SHAP-based interpretability analysis for different target variables: (a) C content in biochar; (b) H content in biochar; (c) N content in biochar; (d) O content in biochar; (e) Ash content in biochar; (f) Biochar pH; (g) Specific surface area (SSA) of biochar; (h) Cation exchange capacity (CEC) of biochar; (i) Biochar yield; (j) Biochar Adsorption performance.

Figure 5. Partial dependence plots (PDP) and multi-correlation analysis for different target variables: (a) C content in biochar; (b) H content in biochar; (c) N content in biochar; (d) O content in biochar; (e) Ash content in biochar; (f) Biochar pH; (g) Cation exchange capacity (CEC) of biochar; (h) Biochar yield; (i) Biochar Adsorption performance.

Table 1. Comparison of prediction performance among three models for different biochar characteristics.

Models	Target Parameters	Train R²	Test R²	Test RMSE	Test MAE	Test MAPE (%)	5-Fold
RF	Yield	0.98	0.89	0.08	0.06	18.55	0.81 ± 0.04
	SSA	0.96	0.84	0.11	0.07	79.40	0.69 ± 0.04
	CEC	0.93	0.83	0.04	0.02	59.45	0.50 ± 0.18
	q_e	0.98	0.96	0.04	0.02	47.48	0.88 ± 0.04
	multi-target	0.97	0.82	0.06	0.04	42.50
GBR	Yield	0.95	0.92	0.07	0.05	17.06	0.86 ± 0.04
	SSA	0.98	0.83	0.11	0.07	50.71	0.67 ± 0.06
	CEC	0.99	0.84	0.04	0.03	43.40	0.57 ± 0.17
	q_e	0.99	0.98	0.03	0.02	41.77	0.92 ± 0.05
	multi-target	0.94	0.82	0.06	0.04	50.07
XGB	Yield	0.95	0.92	0.07	0.05	13.63	0.86 ± 0.04
	SSA	0.99	0.91	0.08	0.05	33.03	0.61 ± 0.11
	CEC	0.99	0.89	0.03	0.02	36.60	0.61 ± 0.22
	q_e	0.99	0.99	0.02	0.01	44.27	0.92 ± 0.04
	multi-target	0.99	0.83	0.06	0.39	35.51

Note: RMSE values have the same units as their corresponding predicted variables.

Table 2. Platform prediction results: Biochar actual value and predicted value.

	C (wt.%)	H (wt.%)	N (wt.%)	O (wt.%)	Ash (wt.%)	pH	Yield (wt.%)
Actual	52.57	5.42	0.95	31.55	3.13	6.9	69.87
	68.24	2.99	0.85	17.13	5.06	9.1	31.42
	72.72	2.39	0.53	12.7	6.35	11.1	22.6
	74.2	1.30	0.41	12.32	6.63	11.4	20.78
Predicted	52.66	5.42	0.92	31.50	3.13	6.9	78.58
	68.08	2.98	1.17	17.26	5.22	9.1	34.16
	73.01	2.38	0.83	12.66	6.18	11.1	24.84
	74.10	1.30	0.74	12.29	6.69	11.4	24.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, X.; Bai, X.; Yang, Y.; Li, W.; Xu, D. Machine Learning-Based Prediction and Optimization of Heavy Metal Adsorption Performance of Biochar. Forests 2026, 17, 326. https://doi.org/10.3390/f17030326

AMA Style

Huang X, Bai X, Yang Y, Li W, Xu D. Machine Learning-Based Prediction and Optimization of Heavy Metal Adsorption Performance of Biochar. Forests. 2026; 17(3):326. https://doi.org/10.3390/f17030326

Chicago/Turabian Style

Huang, Xin, Xiaopeng Bai, Yifei Yang, Wenbin Li, and Daochun Xu. 2026. "Machine Learning-Based Prediction and Optimization of Heavy Metal Adsorption Performance of Biochar" Forests 17, no. 3: 326. https://doi.org/10.3390/f17030326

APA Style

Huang, X., Bai, X., Yang, Y., Li, W., & Xu, D. (2026). Machine Learning-Based Prediction and Optimization of Heavy Metal Adsorption Performance of Biochar. Forests, 17(3), 326. https://doi.org/10.3390/f17030326

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning-Based Prediction and Optimization of Heavy Metal Adsorption Performance of Biochar

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset and Preprocessing

2.2. Model Development

2.3. Model Training and Evaluation

2.4. Model Interpretation

2.5. Model Application

3. Results and Discussion

3.1. Dataset Description and Statistical Analysis

3.2. Model Training and Performance Evaluation

3.3. Feature Importance Analysis

3.4. Interpretability Analysis

3.5. Correlation Analysis

3.6. Interface Design

4. Limitations and Future Perspectives

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI