A Generalized Method for Modeling the Adsorption of Heavy Metals with Machine Learning Algorithms

: Applications of machine learning algorithms (MLAs) to modeling the adsorption e ﬃ ciencies of di ﬀ erent heavy metals have been limited by the adsorbate–adsorbent pair and the selection of speciﬁc MLAs. In the current study, adsorption e ﬃ ciencies of fourteen heavy metal–adsorbent (HM-AD) pairs were modeled with a variety of ML models such as support vector regression with polynomial and radial basis function kernels, random forest (RF), stochastic gradient boosting, and bayesian additive regression tree (BART). The wet experiment-based actual measurements were supplemented with synthetic data samples. The ﬁrst batch of dry experiments was performed to model the removal e ﬃ ciency of an HM with a speciﬁc AD. The ML modeling was then implemented on the whole dataset to develop a generalized model. A ten-fold cross-validation method was used for the model selection, while the comparative performance of the MLAs was evaluated with statistical metrics comprising Spearman’s rank correlation coe ﬃ cient, coe ﬃ cient of determination (R 2 ), mean absolute error, and root-mean-squared-error. The regression tree methods, BART, and RF demonstrated the most robust and optimum performance with 0.96 (cid:53) R 2 (cid:53) 0.99. The current study provides a generalized methodology to implement ML in modeling the e ﬃ ciency of not only a speciﬁc adsorption process but also a group of comparable processes involving multiple HM-AD pairs.


Introduction
Heavy metals (HMs) are stable inorganic pollutants with a low level of biodegradability [1][2][3][4][5][6][7] and thus tend to accumulate in living organisms [8][9][10][11]. Unlike some other pollutants, HMs can cause severe complications even at low concentrations. The US Environmental Protection Agency (EPA) listed lead (Pb), arsenic (As), nickel (Ni), chromium (Cr), copper (Cu), zinc (Zn), cadmium (Cd), and mercury (Hg) among the most serious water pollutants [12]. The permissible limits of these HMs in the industrial wastewater suggested by US EPA were 0.1, 0.01, 0.2, 0.1, 0.25, 1, 0.01, and 0.05 mg/L, respectively. The existence of such toxic metals in wastewater produced from industrial and agricultural activities can result in severe health and environmental issues due to their toxicity and environmental persistence [13]. Researchers around the globe are working on developing a feasible solution to maintain the HM concentration in natural water bodies and wastewater below the standard limits.
Various chemical and physical treatments have been evaluated to remove HMs from water. These methods include, but are not limited to, membrane separation, filtration, ion exchange, precipitation, coagulation, reverse osmosis, and adsorption [13]. The cost and efficiency of a technique should be evaluated and judged from an engineering perspective before selecting it. The adsorption method is sometimes more preferable compared to other methods due to its many beneficial advantages, including low cost, reusability of adsorbents (ADs), environmental friendliness, and ease of operation [14]. Various ADs, including clays [15,16], zeolites [17], activated carbons [18,19], carbon nanotubes (CNTs) [20], nano-composites [21][22][23], graphene [24], chemical composites [25], and bio-sorbents [26][27][28][29], have been used to remove HMs from contaminated aqueous solutions. Usually, the success of any AD is mainly attributed to its morphology (porous structure), functional groups or inorganic minerals contained [30].
Extensive experimental works on removing HMs using different ADs have been reported in the literature. In general, the research scope of the previous studies was to find the maximum adsorption capacity for a single or multiple HM(s). Experimental conditions including pH, time, initial concentration, adsorbent dosage, and temperature were optimized initially. Then the adsorption process was modeled to describe its nature quantitatively. The measured values of the independent parameters were considered as the inputs (IPs) for the model, while the output (OP) was calculated based on the measurements of the initial and final concentrations of the respective HM. In most cases, the OP was the removal efficiency (%): Removal The traditional way of correlating the OP to the IPs is by identifying the most suitable adsorption isotherm, which demonstrates the adsorption capacity (q e , mg/g) as a function of the adsorbate concentration (C e ) in equilibrium condition.
In Equation (2), C o (mg/L) is the initial concentration of adsorbate, V (L) is the total volume of the fluids, and m (g) is the mass of AD. A few examples of the isotherms used in the previous studies are as follows: Langmuir Isotherm (LI) : 1 Freundlich Isotherm (FI) : log(q e ) = log k f + 1 n log(C e ) (4) Temkin Isotherm (TI) : q e = β T ln(K T ) + β T ln(C e ) Dubinin-Radushkevich Isotherm (DRI) : ln(q e ) = ln(q max ) − B D ·ε 2 D In Equations (3)-(6), q max (mg/g) is the maximum adsorption capacity; k L (L/mg) is the Langmuir constant; k f ((mg/g)/(mg/L) n ) is the Freundlich constant; n (-) represents the non-linearity of the correlation; K T (L/mg) and β T (mg/g) are the TI specific constants; B D (mol 2 /kJ 2 ) is the activity coefficient; and ε D (kJ 2 /mol 2 ) is the Polnyi potential. The standard practice of identifying the best isotherm for an adsorption process is to estimate the appropriate values of the isotherm-specific constants with a trial and error procedure. As analyzing the complex relative impacts of the IPs on the OP was found to be difficult with a traditional isotherm model, different statistical methodologies were also employed to model the adsorption processes. The most commonly used statistical tool was the response surface method (RSM). The data required to apply RSM were generated by conducting wet experiments. This kind of experiment can be considered a simple batch process of adsorption. An AD was added to the sample containing HM by adjusting all IPs. The concentration of HM in the sample was measured before and after the experiment to appraise the OP. The values of the IPs considered to significantly affect the OP for a specific HM-AD pair were varied, while the other IPs were maintained at fixed values for the experiments. Usually, a quadratic correlation (Equation (7)) of the OP to the variable IPs was developed by minimizing the difference between the predicted OP and its actual values.
In Equaiton (7), β and ε are the constants. An automated trial and error procedure was followed to determine the optimum values of these constants. Even though the RSM yielded acceptable predictions in most cases, it could not address the non-linearity of the correlation appropriately.
At present, artificial intelligence (AI) has been identified as a promising technique for modeling an adsorption process [16,19,22,23,25,27,29,[31][32][33][34]. Compared to the traditional isotherms and statistical models, it has the advantage of directly predicting the impact of the IPs and AD-HM interaction on the adsorption process. Many AI-based machine learning algorithms (MLAs) have been employed to date [35]. The majority of these applications involved a specific algorithm, the artificial neural network (ANN). This correlates the IPs to the OP(s) using "neurons" or nodes arranged in hidden layers. As an example, a fully connected ANN architecture (6-4-1) with one input layer with six inputs, one hidden layer with four neurons, and one output layer with a single output is shown in Figure 1. Every node of each layer is connected with a weight to the nodes in the following layer. The arrangement is similar to the neurons in the animal brain. A non-linear activation function is activated for every neuron in the hidden layer to map the weighted inputs to the outputs of the neurons. The function used to predict the actual OP with an ANN can be expressed as follows: In Equaiton (7), and are the constants. An automated trial and error procedure was followed to determine the optimum values of these constants. Even though the RSM yielded acceptable predictions in most cases, it could not address the non-linearity of the correlation appropriately.
At present, artificial intelligence (AI) has been identified as a promising technique for modeling an adsorption process [16,19,22,23,25,27,29,[31][32][33][34]. Compared to the traditional isotherms and statistical models, it has the advantage of directly predicting the impact of the IPs and AD-HM interaction on the adsorption process. Many AI-based machine learning algorithms (MLAs) have been employed to date [35]. The majority of these applications involved a specific algorithm, the artificial neural network (ANN). This correlates the IPs to the OP(s) using "neurons" or nodes arranged in hidden layers. As an example, a fully connected ANN architecture (6-4-1) with one input layer with six inputs, one hidden layer with four neurons, and one output layer with a single output is shown in Figure 1. Every node of each layer is connected with a weight to the nodes in the following layer. The arrangement is similar to the neurons in the animal brain. A non-linear activation function is activated for every neuron in the hidden layer to map the weighted inputs to the outputs of the neurons. The function used to predict the actual OP with an ANN can be expressed as follows: In Equation (8), N is the number of neurons in the hidden layer, φi (x) is the non-linear activation function, wi is the weighting coefficient, and bi is the bias. Even though the non-linearity of a correlation can be addressed better by an ANN than an RSM or isotherm, its application usually suffers from several drawbacks [36]. It may experience the complication of over-fitting from a learning perspective if sufficient data are not used to train the model. Most of the previous studies on modeling the HM adsorption with ANN involved comparatively smaller datasets. It should be noted that this particular algorithm is usually applied using expensive commercial software, namely MATLAB. Apart from ANN, other advanced MLAs, such as support vector regression (SVR), genetic algorithm (GA), genetic programming (GP), multiple linear regression (MLR), adaptive neural fuzzy interface (ANFIS), random forest (RF), stochastic gradient boosting (SGB), and Bayesian additive regression tree (BART), were also used to model various adsorption processes [35]. Instead of depending on specific commercial software, most of these algorithms can be applied using opensource statistical and data mining software, such as R. Earlier, Hafsa et al. [37] investigated the predicting performance of the non-ANN algorithms on modeling the adsorption efficiency of As in the oxidation state of As 3+ . In the current study, the scope of the application is expanded further by investigating the regression performance of a set of similar models (SVR with polynomial and RBF In Equation (8), N is the number of neurons in the hidden layer, ϕ i (x) is the non-linear activation function, w i is the weighting coefficient, and b i is the bias. Even though the non-linearity of a correlation can be addressed better by an ANN than an RSM or isotherm, its application usually suffers from several drawbacks [36]. It may experience the complication of over-fitting from a learning perspective if sufficient data are not used to train the model. Most of the previous studies on modeling the HM adsorption with ANN involved comparatively smaller datasets. It should be noted that this particular algorithm is usually applied using expensive commercial software, namely MATLAB.
Apart from ANN, other advanced MLAs, such as support vector regression (SVR), genetic algorithm (GA), genetic programming (GP), multiple linear regression (MLR), adaptive neural fuzzy interface (ANFIS), random forest (RF), stochastic gradient boosting (SGB), and Bayesian additive regression tree (BART), were also used to model various adsorption processes [35]. Instead of depending on specific commercial software, most of these algorithms can be applied using open-source statistical and data mining software, such as R. Earlier, Hafsa et al. [37] investigated the predicting performance of the non-ANN algorithms on modeling the adsorption efficiency of As in the oxidation state of As 3+ .
In the current study, the scope of the application is expanded further by investigating the regression performance of a set of similar models (SVR with polynomial and RBF kernels, RF, BART, and SGB) in predicting the adsorption efficiencies of five toxic metals (Pd, Hg, Cd, Cr, and As) in different oxidation states (Pb 2+ , Hg 2+ , Cd 2+ , Cr 6+ , and As 3+ ). The data required for the investigation were extracted from the literature. In addition to developing HM-AD-specific individual models, attempts were made to advance a generalized model that can predict the adsorption efficiency of multiple HM-AD combinations based on a single learning framework.

Regression Algorithms
A diverse variety of regression algorithms, including parametric, non-parametric, and Bayesian models were selected for the current study to model the removal efficiency of the toxic heavy metals. The list of the algorithms is as follows: (i). support vector regression with radial basis function (SVR-RBF) kernel (ii). support vector regression with polynomial (SVR-poly) kernel (iii). random forest (RF) regression (iv). stochastic gradient boosting (SGB) regression (v). Bayesian additive regression tree (BART) All of these ML models are briefly discussed below, underscoring the respective statistical formulations that correlate the response with the inputs [38][39][40][41][42][43][44][45].

SVR-RBF
The objective of SVR is to devise an as flat as possible hypothesis function, f(x), that is insensitive to most of the є-deviations calculated from the measured responses in the training data. For a training dataset, (X, Y) = (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N ), the predicted response (ŷ) or the output of f(x) can be expressed as follows:ŷ where w, ϕ(x) is the dot product of the weight vector (w) and mapped feature vector with a non-linear transformation function ϕ(x), α i coefficients are the support vectors, k(x i , x) is a suitable kernel function for non-linear feature mapping, and b is a constant term. The RBF kernel, K(x, x ) for the feature vectors, (x, x ) can be presented with the following equation: where σ is the radius or width of the RBF kernel. It is a tuning parameter that indicates the influence of (x, x ) on the model. Equation (10) shows the similitude between x and x as a decaying function in squared form. A smaller value of the kernel indicates a higher similarity between the features. In addition to σ, there is another important tuning parameter, C. It is the regularization parameter of the objective loss function defined as the difference between the measured and predicted values. The ultimate goal of this ML model is to minimize the objective loss function for the training data.

SVR-Poly
A polynomial kernel function is used in the SVR-polynomial algorithms to learn non-linear feature interactions. It compares two column vectors under the objective function framework using a degree-d polynomial equation: (11) where x and x are two column vectors representing the feature vectors, γ is a scalar parameter, c is a constant, and d is the kernel degree. Combining Equation (11) with Equation (9), the following expression can be obtained for the response (ŷ): The tuning parameters for SVR-Polynomial are d, γ, and C.

RF Regression
RF is an ensemble learning method. It uses a multitude of regression trees during the training period. The variable splits at each node are selected based on the randomization. Using bootstrap sampling, the algorithm initiates the learning by dividing the training data into M subsets. Next, an individual regression tree (T i ) is set for every subset by utilizing a randomly selected subset of features. This process of splitting the nodes results in a forest of M regression trees. After fitting the model to the entire training set, the response (ŷ) is usually predicted for a test dataset (x ) by averaging the individual predictions as follows:ŷ where M is the total number of regression trees and T i (x ) is a regression tree output. The hyper-parameter used for RF optimization is mtry. It designates the sum total of predictor variables that are randomly specified as nominees in the course of splitting the tree.

SGB Regression
An additive regression model is developed with an SGB algorithm by successively fitting a simple parameterized function as a weak base learner to the declivity of the difference between the measured value and model response. A random sample of the complete dataset is used to reduce the disagreement stochastically at each iteration.
In the gradient boosting algorithm, a simple regression tree can be used as a weak prediction model, and the final prediction model can then be produced in the form of an ensemble of such weak learners. The functional form of the gradient boosting-based approximation of the predicted response, y, for each data point, x can be described as follows: where the functions f (x; a m ) represent the weak learners that are simple functions of x involving weighting parameters a = {a 1 , a 2 , . . . , a m }, and expansion coefficients as β = {β 1 , β 2 , . . . ,β m }. Both a and β are jointly fit to the training data. These parameters also define the split points for the base regression tree [20]. In SGB, the tuning parameters are the number of regression trees (m) and the number of splits to be performed at each node, i.e., the maximum nodes for each tree.

BART
The BART is a Bayesian approach. It uses a sum-of-trees model to approximate the hypothesis function [30]. The BART algorithmic concept builds on enhancing the additive trees model by introducing a prior regularization term that attempts to fit the model by moderating the effect of the individual regression tree. Consequently, each regression tree in the BART becomes a weak prediction model, explaining only a smaller portion of the training data. BART conveniently uses the additive representation of multiple regression trees to produce the final prediction model instead of constructing a single dominant large tree. The predicted response for a set of feature variables (x 1 ,x 2 , . . . x n ) associated with a single data point, x, could be formulated according to the sum-of-trees model, which is shown as follows:ŷ = m j=1 g x, T j (15) where T j is a single binary regression tree and m is the total number of regression trees. Each tree T j consists of a set of interior node decision rules and a set of prior regularized terminal nodes. For BART modeling, the tuning parameter is m or the number of trees used in the sum-of-trees model.

Evaluation Metrics
The evaluation metrics used to appraise the performance of the regression models consisted of Spearman's rank correlation coefficient (SPCC), the coefficient of determination (R 2 ), root mean square error (RMSE), and mean absolute error (MAE). The statistical parameters are briefly described as follows.

Spearman's Rank Correlation Coefficient (SPcorr)
The SRCC is a technique that summarizes the strength and the direction (positive or negative) of a relationship between actual (y) and predicted (ŷ) response variables. The formula to calculate SRCC adopts the following mathematical notation: where d = difference in the ranks between two variable sets and n= number of samples. The SRCC values range between −1 and +1. The closer SRCC is to +1 or −1, the stronger the probable correlation.
In the case of SRCC, a perfect positive correlation is +1 and a perfect negative correlation is −1.

Coefficient of Determination (R 2 )
The R 2 is a statistical measurement of how well the results of a regression model fit the actual measurements. It quantifies the fraction of the variation in outputs explained by the model. The equation for R 2 can be described as follows: The maximum and minimum values of R 2 are 1 and 0, respectively. Generally, a higher value of R 2 indicates a better model.

Mean Absolute Error (MAE)
The MAE is defined as an average of absolute differences between the model outputs and the actual measurements. The MAE can be calculated using the following formula: whereŷ and y are the predicted and actual responses respectively. The RMSE is a statistical measure of the dispersion of the prediction errors. It is a popular parameter to present the overall error of a model, as it can provide a relative perception of how concentrated the model outputs are around the best fit curve. The formula for RMSE can be written as follows: where y andŷ are actual and predicted responses, respectively, and N is the total number of samples.

Dataset
The experimental data used for the current study were collected from eight independent sources as presented in Tables 1 and 2. All of the experiments were conducted to test the efficiencies of different adsorbents in removing toxic heavy metals such as Cr, Pd, Hg, Cd, and As from polluted water. The ADs used for the experimental studies were as follows:    The IPs were measured, while the OP was calculated based on the measured values of initial and final concentrations of the respective HM.

MLA Modeling
As mentioned earlier, the predictive power of MLA was assessed in the current study by training and testing five regression models to estimate the removal efficiency of five heavy metals using six experimental parameters. The MLA modeling involved interpolating the experimental parameters to produce synthetic data and developing models for the datasets individually as well as comprehensively. For this purpose, the following steps were executed in successive order: (i) interpolation of the experimental data; (ii) parameter optimization and model selection for individual heavy metals; (iii) parameter optimization and model selection for the comprehensive dataset.

Data Interpolation
As can be observed from Table 1, the ten datasets used in the present study consist of a relatively lower number of actual measurements. Therefore, the training set composed of actual measurements is not necessarily large enough, as the MLAs are data-driven and demand a reasonable quantity of data for optimizing the parameters and training the models reliably. To resolve this issue, a data augmentation technique is necessary for increasing the number of data points in the training set. Earlier, Podder et al. [47] used a cubic spline function for generating interpolated data points for their ANN-based modeling of As adsorption efficiency. Cubic spline or piecewise cubic interpolation can be categorized as an exact point interpolation method in the family of spatial interpolation techniques. This piecewise polynomial interpolation method, unlike its polynomial analogs, is capable of finding a continuous second derivative at all data points by minimizing the interpolation errors and produces a smoother distribution of interpolated data points within a certain range [48,49]. The cubic spline interpolation also mitigates the distortion issues in boundary regions observed in least-squares interpolation [47]. Considering the advantages, a piecewise cubic interpolation method was adopted to interpolate the original data points in the current study. The process of interpolation was initiated by interpolating the output column based on the first predictor. It was then extended to the predictor columns using the previously interpolated output values. On average, 250 data points were interpolated for each data set. All interpolations were performed using the 'spline' function of the 'stats' package in R [50].

Individual Metal
The ML modeling to predict the removal efficiency involved associated parameter optimization on a training set and the final model selection based on the validation using the independent test data points. As a first step, the original experimental datasets extracted from the literature were interpolated using a natural cubic spline technique and merged. Next, the merged dataset was split into a training (80%) and a test (20%) subset using random sampling. The training of five ML models was carried out using the training data points (80%). The test dataset (20%), which can be considered as an independent test set, was withheld for model verification. The optimized parameters for the ML models and, also, the best performing model were selected based on a repeated k-fold cross-validation (CV) technique integrated with a grid search [51]. The hyper-parameters of the ML models, as presented in Table 3, were optimized by minimizing the average prediction error in training data. At every fold of the CV, (k-1) subsets of the data were used to train a model. The hold-out dataset was then used to validate the trained model by calculating the RMSE. This process was repeated k-times with the desired number of iterations to identify the best performing model with the optimal values of the associated parameters that minimized the average RMSE across the repeated CV. From a mathematical perspective, the following optimization problem was attempted to be minimized: 20) where n indicates the number of iterations, k is the number of folds in CV, and RMSE i,j is the root mean square deviation of the predicted response from the actual response for the j-th fold in the i-th iteration of the repeated CV. The algorithm used for model selection and parameter optimization is presented in Figure 2. Both the number of folds (k) used in the CV and the repetition times (N) were considered as 10 in the present study.
Water 2020, 12, x FOR PEER REVIEW 13 of 23 presented in Figure 2. Both the number of folds (k) used in the CV and the repetition times (N) were considered as 10 in the present study.

Comprehensive Dataset
We performed MLA modeling for a comprehensive dataset that includes the data related to all metals. The objective was to produce a generalized model for the five toxic metals (As, Cd, Cr, Hg, and Pb) and 10 adsorbents (see Tables 1 and 2) used in this study. The dataset contained all the merged (original and interpolated) data points associated with five toxic metals. The statistics for this comprehensive dataset are presented in Table 4. There were two modifications in the feature description in this combined dataset. As shown in Table 1, six experimental parameters were available for any dataset including the variable and fixed inputs such as operating temperature, pH, initial concentration, contact time, sorbent dosage, and agitator speed. In the combined dataset, all six of these experimental parameters were included in the feature description. In addition to that, the types of metal and the specific adsorbent used in the experiment were also incorporated as features. Therefore, a total of eight (8) feature variables, as opposed to six (6) parameters in individual modeling, were used for the MLA modeling of removal efficiency on a total of 2476 (80%) training data points in the combined dataset. All five MLAs previously used for individual metal removal efficiency modeling were utilized to develop these generalized models. A similar strategy of repeated 10-fold cross-validation was performed to optimize the different parameters of the MLA models on this combined dataset. Finally, the parameter optimization and model selection steps gave us five optimal models (from using five different ML algorithms) chosen based on the lowest average RMSE observed on the validation data points during the cross-validated training. These models were then used to predict the removal efficiency on 619 (20%) independent test data points from the combined dataset.

Comprehensive Dataset
We performed MLA modeling for a comprehensive dataset that includes the data related to all metals. The objective was to produce a generalized model for the five toxic metals (As, Cd, Cr, Hg, and Pb) and 10 adsorbents (see Tables 1 and 2) used in this study. The dataset contained all the merged (original and interpolated) data points associated with five toxic metals. The statistics for this comprehensive dataset are presented in Table 4. There were two modifications in the feature description in this combined dataset. As shown in Table 1, six experimental parameters were available for any dataset including the variable and fixed inputs such as operating temperature, pH, initial concentration, contact time, sorbent dosage, and agitator speed. In the combined dataset, all six of these experimental parameters were included in the feature description. In addition to that, the types of metal and the specific adsorbent used in the experiment were also incorporated as features. Therefore, a total of eight (8) feature variables, as opposed to six (6) parameters in individual modeling, were used for the MLA modeling of removal efficiency on a total of 2476 (80%) training data points in the combined dataset. All five MLAs previously used for individual metal removal efficiency modeling were utilized to develop these generalized models. A similar strategy of repeated 10-fold cross-validation was performed to optimize the different parameters of the MLA models on this combined dataset. Finally, the parameter optimization and model selection steps gave us five optimal models (from using five different ML algorithms) chosen based on the lowest average RMSE observed on the validation data points during the cross-validated training. These models were then used to predict the removal efficiency on 619 (20%) independent test data points from the combined dataset.

Computing Framework
A computer with the configuration of Intel CORE i7 8th Gen, 2GHz processor with 8 GB of RAM was used to perform the dry experiments and statistical analysis. The hardware was operated with a 64-bit Windows 10 operating system. The open-source statistical computing framework, R (4.0.2) was used for conducting all ML experiments and performing the statistical analysis. We chose R because it can be generally used in any platform. It can also provide all necessary packages and library functions for ML model training, hyper-parameters tuning, visualizing and data preprocessing, performing post-prediction related statistical analyses, and plotting graphs and trends of the results. R is considered a standard industrial choice for exploratory data analysis. The time requirement of prediction for a typical data matrix of 200 × 6 dimensions was approximately 10 s.

ML Model Evaluation for Individual Dataset
The performances of five MLAs in predicting the efficiencies of absorbing five toxic metals (As, Cd, Cr, Hg, and Pb) by different adsorbents are shown in Tables 5-9. For each metal, the prediction results are reported for two separate adsorbents. The outcomes are evaluated with a statistical metric comprising MAE, RMSE, SPcorr, and R 2 .

ML Model Evaluation for Combined Dataset
The results of the performance evaluations of the five generalized ML models on the independent test dataset are described in Table 10. The performance of the best model (RF) is presented in Figure 3, with a graph depicting the predicted values of the removal efficiencies as the function of the measured values for different HMs. In addition, the residual percentile error plot for the RF model is depicted in Figure 4 for independent test data.

Discussion
The current study presents a comprehensive approach to modeling adsorption efficiency. A wide range of ML models was applied to model the experimental adsorption of five toxic heavy metals with ten different adsorbents. As the modeling of an adsorption process involves non-linear feature interactions, the utility of the non-linear parametric regression models, such as SVR with polynomial and RBF kernels, RF, and SGB, including a Bayesian regression approach called BART, were examined in the current study. The RF and SGB were selected as the bagging and boosting algorithms, respectively. Both RBF and polynomial kernels in the SVR algorithm perform mapping of the input space to higher dimensional feature space, and, subsequently, the data points become linearly separable into that higher feature dimension. Similarly, three different variations of regression trees used in the current study are suitable for non-linear regression tasks. For each toxic metal, two datasets using two different adsorbents were considered, resulting in a total of 10 datasets

Discussion
The current study presents a comprehensive approach to modeling adsorption efficiency. A wide range of ML models was applied to model the experimental adsorption of five toxic heavy metals with ten different adsorbents. As the modeling of an adsorption process involves non-linear feature interactions, the utility of the non-linear parametric regression models, such as SVR with polynomial and RBF kernels, RF, and SGB, including a Bayesian regression approach called BART, were examined in the current study. The RF and SGB were selected as the bagging and boosting algorithms, respectively. Both RBF and polynomial kernels in the SVR algorithm perform mapping of the input space to higher dimensional feature space, and, subsequently, the data points become linearly separable into that higher feature dimension. Similarly, three different variations of regression trees used in the current study are suitable for non-linear regression tasks. For each toxic metal, two datasets using two different adsorbents were considered, resulting in a total of 10 datasets for the ML experiments. Note that each of these datasets consists of both original and interpolated data points, which were split into an 20 to 80% ratio of training and test data, respectively. Tables 5-9 report the results of ML modeling of the selected regression models on 20% independent test data points for each of these 10 datasets. Interestingly, a single learning algorithm did not stand alone for all ten datasets when evaluated with the independent test points (see Tables 5-9). However, the BART algorithm showed the optimum performance compared to other models for all data. The average R 2 value was 96%. The other two regression tree approaches, SGB and RF, demonstrated the next best performances with average R 2 values of 94% and 93%, respectively. In the case of SVR, the models with the RBF kernel demonstrated slightly better performance (R 2 = 93%) than its polynomial counterpart (R 2 = 91%). However, an extensive comparative analysis (e.g., finding min, max, and standard deviation) of the performance of these 10 individual models may not be appropriate here, as the 10 datasets used were collected under different experimental setups using 12 different adsorbents and five different metals.
Since a generalized ML model applicable to different adsorption processes does not exist in the literature, we performed the modeling based on the strategy that combines diverse datasets in a single learning framework to which different ML algorithms can be applied. This effort provided insights about the generalized predictive power of the ML algorithms for estimating adsorption efficiency irrespective of the HM-AD combinations and the reliability of the prediction made by the generalized models in the case of different toxic metals. It also made the comparative analysis of the performances of ML algorithms more meaningful as all variations in the experimental setup, metal, and adsorbent types were brought under a single learning framework of model development using a specific algorithm and all five algorithms underwent the training on the same set of data points.
The evaluation of the generalized models, as presented in Table 10, shows that all of those demonstrate consistent and comparable performances for training and test datasets. The SVR-polynomial kernel performs almost identically to its RBF kernel counterpart. Among these methods, the RF model yielded the best scores in terms of all evaluation metrics (SPCC = 0.989, R 2 = 0.988, MAE = 0.007, and RMSE = 0.033). It is important to observe that both bagging-(RF) and boosting (SGB)-based regression tree algorithms with stochastic components were found to perform better by choosing the best possible random set of predictors (RF) or observations (SGB) for splitting at each node of the regression tree and several iterations for parameter optimization. Both regression tree models were able to capture the non-linearity of the data accurately in estimating the response variable. The BART was able to achieve one of the best correlations (SPCC = 0.983 and R 2 = 0.969) by imposing regularization on each tree while fitting to a small portion of the training data, leading to a bias-free prediction when several trees were fitted to the complete set of training samples. The measured removal efficiencies for the test dataset are shown against the predicted values by the best performing RF model in Figure 3. Compared to the metal-specific predictions shown in Figure 4, the RF model is evidently accurate in predicting the removal efficiencies for all different types of metals, irrespective of the adsorbent type used for the adsorption experiments. The residual error analysis of the RF model is presented in Figure 4 with the range of errors in the percentile level. More than 98% of test data lie within a ±10% error limit.
A methodology to implement the best performing RF model is outlined in Figure 5, with a block diagram. The model in its current form is directly applicable to predict the adsorption efficiency for a given set of process conditions. It requires only the design or operating parameters (IPs) as inputs from the user. These input parameters are to be treated as the predictor variables to provide the output of adsorption efficiency. In the case of using the current database, the predictions would be limited to the twelve HM-AD pairs used for this study. However, the database can be enriched further by adding new experimental measurements for different HM-AD pairs. That will help to extend the predicting scope of the current model. The AI-based automated methodology is expected to replace the traditional modeling approach that requires indefinite iterations to figure out the appropriate model with the optimized values of the coefficients. It will be significantly beneficial for the general users, including design and operating engineers, as well as management and research personnel.

Conclusions
State-of-the-art ML algorithms were applied to model the sorption efficiencies of different adsorbents in removing toxic heavy metals (As, Cd, Hg, Cr, and Pb) for the current study. Specifically,

Conclusions
State-of-the-art ML algorithms were applied to model the sorption efficiencies of different adsorbents in removing toxic heavy metals (As, Cd, Hg, Cr, and Pb) for the current study. Specifically, the predictive power of the non-ANN approach was analyzed in an open-source statistical computing framework, R. Probably the most significant contribution of the current study is a generalized ML model that can be used to predict the removal efficiency of five toxic metals using twelve different adsorbents. All ML models were developed using original and synthetic data produced by interpolating the original data using a cubic spline interpolation algorithm. Model assessments using standard evaluation metrics show an excellent agreement between the actual and predicted removal efficiency for both individual (R 2 = 96%.) and generalized (R 2 = 98.8%) predictive models. The present work provides important insights about the predictive power of non-ANN ML approaches for both metal-and adsorbent-specific individual learning models, and the models in which all data are combined into a single learning framework. With the superior performances and beneficial attributes of the generalized models, the proposed system has a high potential to be employed and used for the industrial production system. Although the current approach was successfully tested for a set of adsorption systems comprising five toxic heavy metals and twelve varieties of adsorbents, it should be implemented further to a larger dataset to develop a universal model.