## 1. Introduction

Heavy metals (HMs) are stable inorganic pollutants with a low level of biodegradability [

1,

2,

3,

4,

5,

6,

7] and thus tend to accumulate in living organisms [

8,

9,

10,

11]. Unlike some other pollutants, HMs can cause severe complications even at low concentrations. The US Environmental Protection Agency (EPA) listed lead (Pb), arsenic (As), nickel (Ni), chromium (Cr), copper (Cu), zinc (Zn), cadmium (Cd), and mercury (Hg) among the most serious water pollutants [

12]. The permissible limits of these HMs in the industrial wastewater suggested by US EPA were 0.1, 0.01, 0.2, 0.1, 0.25, 1, 0.01, and 0.05 mg/L, respectively. The existence of such toxic metals in wastewater produced from industrial and agricultural activities can result in severe health and environmental issues due to their toxicity and environmental persistence [

13]. Researchers around the globe are working on developing a feasible solution to maintain the HM concentration in natural water bodies and wastewater below the standard limits.

Various chemical and physical treatments have been evaluated to remove HMs from water. These methods include, but are not limited to, membrane separation, filtration, ion exchange, precipitation, coagulation, reverse osmosis, and adsorption [

13]. The cost and efficiency of a technique should be evaluated and judged from an engineering perspective before selecting it. The adsorption method is sometimes more preferable compared to other methods due to its many beneficial advantages, including low cost, reusability of adsorbents (ADs), environmental friendliness, and ease of operation [

14]. Various ADs, including clays [

15,

16], zeolites [

17], activated carbons [

18,

19], carbon nanotubes (CNTs) [

20], nano-composites [

21,

22,

23], graphene [

24], chemical composites [

25], and bio-sorbents [

26,

27,

28,

29], have been used to remove HMs from contaminated aqueous solutions. Usually, the success of any AD is mainly attributed to its morphology (porous structure), functional groups or inorganic minerals contained [

30].

Extensive experimental works on removing HMs using different ADs have been reported in the literature. In general, the research scope of the previous studies was to find the maximum adsorption capacity for a single or multiple HM(s). Experimental conditions including pH, time, initial concentration, adsorbent dosage, and temperature were optimized initially. Then the adsorption process was modeled to describe its nature quantitatively. The measured values of the independent parameters were considered as the inputs (IPs) for the model, while the output (OP) was calculated based on the measurements of the initial and final concentrations of the respective HM. In most cases, the OP was the removal efficiency (%):

The traditional way of correlating the OP to the IPs is by identifying the most suitable adsorption isotherm, which demonstrates the adsorption capacity (

q_{e}, mg/g) as a function of the adsorbate concentration (

C_{e}) in equilibrium condition.

In Equation (2),

C_{o} (mg/L) is the initial concentration of adsorbate,

V (L) is the total volume of the fluids, and

m (g) is the mass of AD. A few examples of the isotherms used in the previous studies are as follows:

In Equations (3)–(6),

q_{max} (mg/g) is the maximum adsorption capacity;

k_{L} (L/mg) is the Langmuir constant;

k_{f} ((mg/g)/(mg/L)

^{n}) is the Freundlich constant; n (-) represents the non-linearity of the correlation;

K_{T} (L/mg) and

β_{T} (mg/g) are the TI specific constants;

B_{D} (mol

^{2}/kJ

^{2}) is the activity coefficient; and ε

_{D} (kJ

^{2}/mol

^{2}) is the Polnyi potential. The standard practice of identifying the best isotherm for an adsorption process is to estimate the appropriate values of the isotherm-specific constants with a trial and error procedure. As analyzing the complex relative impacts of the IPs on the OP was found to be difficult with a traditional isotherm model, different statistical methodologies were also employed to model the adsorption processes. The most commonly used statistical tool was the response surface method (RSM). The data required to apply RSM were generated by conducting wet experiments. This kind of experiment can be considered a simple batch process of adsorption. An AD was added to the sample containing HM by adjusting all IPs. The concentration of HM in the sample was measured before and after the experiment to appraise the OP. The values of the IPs considered to significantly affect the OP for a specific HM-AD pair were varied, while the other IPs were maintained at fixed values for the experiments. Usually, a quadratic correlation (Equation (7)) of the OP to the variable IPs was developed by minimizing the difference between the predicted OP and its actual values.

In Equaiton (7), β and ε are the constants. An automated trial and error procedure was followed to determine the optimum values of these constants. Even though the RSM yielded acceptable predictions in most cases, it could not address the non-linearity of the correlation appropriately.

At present, artificial intelligence (AI) has been identified as a promising technique for modeling an adsorption process [

16,

19,

22,

23,

25,

27,

29,

31,

32,

33,

34]. Compared to the traditional isotherms and statistical models, it has the advantage of directly predicting the impact of the IPs and AD-HM interaction on the adsorption process. Many AI-based machine learning algorithms (MLAs) have been employed to date [

35]. The majority of these applications involved a specific algorithm, the artificial neural network (ANN). This correlates the IPs to the OP(s) using “neurons” or nodes arranged in hidden layers. As an example, a fully connected ANN architecture (6-4-1) with one input layer with six inputs, one hidden layer with four neurons, and one output layer with a single output is shown in

Figure 1. Every node of each layer is connected with a weight to the nodes in the following layer. The arrangement is similar to the neurons in the animal brain. A non-linear activation function is activated for every neuron in the hidden layer to map the weighted inputs to the outputs of the neurons. The function used to predict the actual OP with an ANN can be expressed as follows:

In Equation (8),

N is the number of neurons in the hidden layer,

φ_{i}(

x) is the non-linear activation function,

w_{i} is the weighting coefficient, and

b_{i} is the bias. Even though the non-linearity of a correlation can be addressed better by an ANN than an RSM or isotherm, its application usually suffers from several drawbacks [

36]. It may experience the complication of over-fitting from a learning perspective if sufficient data are not used to train the model. Most of the previous studies on modeling the HM adsorption with ANN involved comparatively smaller datasets. It should be noted that this particular algorithm is usually applied using expensive commercial software, namely MATLAB.

Apart from ANN, other advanced MLAs, such as support vector regression (SVR), genetic algorithm (GA), genetic programming (GP), multiple linear regression (MLR), adaptive neural fuzzy interface (ANFIS), random forest (RF), stochastic gradient boosting (SGB), and Bayesian additive regression tree (BART), were also used to model various adsorption processes [

35]. Instead of depending on specific commercial software, most of these algorithms can be applied using open-source statistical and data mining software, such as R. Earlier, Hafsa et al. [

37] investigated the predicting performance of the non-ANN algorithms on modeling the adsorption efficiency of As in the oxidation state of As

^{3+}. In the current study, the scope of the application is expanded further by investigating the regression performance of a set of similar models (SVR with polynomial and RBF kernels, RF, BART, and SGB) in predicting the adsorption efficiencies of five toxic metals (Pd, Hg, Cd, Cr, and As) in different oxidation states (Pb

^{2+}, Hg

^{2+}, Cd

^{2+}, Cr

^{6+}, and As

^{3+}). The data required for the investigation were extracted from the literature. In addition to developing HM-AD-specific individual models, attempts were made to advance a generalized model that can predict the adsorption efficiency of multiple HM-AD combinations based on a single learning framework.

## 4. Discussion

The current study presents a comprehensive approach to modeling adsorption efficiency. A wide range of ML models was applied to model the experimental adsorption of five toxic heavy metals with ten different adsorbents. As the modeling of an adsorption process involves non-linear feature interactions, the utility of the non-linear parametric regression models, such as SVR with polynomial and RBF kernels, RF, and SGB, including a Bayesian regression approach called BART, were examined in the current study. The RF and SGB were selected as the bagging and boosting algorithms, respectively. Both RBF and polynomial kernels in the SVR algorithm perform mapping of the input space to higher dimensional feature space, and, subsequently, the data points become linearly separable into that higher feature dimension. Similarly, three different variations of regression trees used in the current study are suitable for non-linear regression tasks. For each toxic metal, two datasets using two different adsorbents were considered, resulting in a total of 10 datasets for the ML experiments. Note that each of these datasets consists of both original and interpolated data points, which were split into an 20 to 80% ratio of training and test data, respectively.

Table 5,

Table 6,

Table 7,

Table 8 and

Table 9 report the results of ML modeling of the selected regression models on 20% independent test data points for each of these 10 datasets. Interestingly, a single learning algorithm did not stand alone for all ten datasets when evaluated with the independent test points (see

Table 5,

Table 6,

Table 7,

Table 8 and

Table 9). However, the BART algorithm showed the optimum performance compared to other models for all data. The average

R^{2} value was 96%. The other two regression tree approaches, SGB and RF, demonstrated the next best performances with average

R^{2} values of 94% and 93%, respectively. In the case of SVR, the models with the RBF kernel demonstrated slightly better performance (

R^{2} = 93%) than its polynomial counterpart (

R^{2} = 91%). However, an extensive comparative analysis (e.g., finding min, max, and standard deviation) of the performance of these 10 individual models may not be appropriate here, as the 10 datasets used were collected under different experimental setups using 12 different adsorbents and five different metals.

Since a generalized ML model applicable to different adsorption processes does not exist in the literature, we performed the modeling based on the strategy that combines diverse datasets in a single learning framework to which different ML algorithms can be applied. This effort provided insights about the generalized predictive power of the ML algorithms for estimating adsorption efficiency irrespective of the HM-AD combinations and the reliability of the prediction made by the generalized models in the case of different toxic metals. It also made the comparative analysis of the performances of ML algorithms more meaningful as all variations in the experimental setup, metal, and adsorbent types were brought under a single learning framework of model development using a specific algorithm and all five algorithms underwent the training on the same set of data points.

The evaluation of the generalized models, as presented in

Table 10, shows that all of those demonstrate consistent and comparable performances for training and test datasets. The SVR-polynomial kernel performs almost identically to its RBF kernel counterpart. Among these methods, the RF model yielded the best scores in terms of all evaluation metrics (SPCC = 0.989,

R^{2} = 0.988, MAE = 0.007, and RMSE = 0.033). It is important to observe that both bagging- (RF) and boosting (SGB)-based regression tree algorithms with stochastic components were found to perform better by choosing the best possible random set of predictors (RF) or observations (SGB) for splitting at each node of the regression tree and several iterations for parameter optimization. Both regression tree models were able to capture the non-linearity of the data accurately in estimating the response variable. The BART was able to achieve one of the best correlations (SPCC = 0.983 and

R^{2} = 0.969) by imposing regularization on each tree while fitting to a small portion of the training data, leading to a bias-free prediction when several trees were fitted to the complete set of training samples. The measured removal efficiencies for the test dataset are shown against the predicted values by the best performing RF model in

Figure 3. Compared to the metal-specific predictions shown in

Figure 4, the RF model is evidently accurate in predicting the removal efficiencies for all different types of metals, irrespective of the adsorbent type used for the adsorption experiments. The residual error analysis of the RF model is presented in

Figure 4 with the range of errors in the percentile level. More than 98% of test data lie within a ±10% error limit.

A methodology to implement the best performing RF model is outlined in

Figure 5, with a block diagram. The model in its current form is directly applicable to predict the adsorption efficiency for a given set of process conditions. It requires only the design or operating parameters (IPs) as inputs from the user. These input parameters are to be treated as the predictor variables to provide the output of adsorption efficiency. In the case of using the current database, the predictions would be limited to the twelve HM-AD pairs used for this study. However, the database can be enriched further by adding new experimental measurements for different HM-AD pairs. That will help to extend the predicting scope of the current model. The AI-based automated methodology is expected to replace the traditional modeling approach that requires indefinite iterations to figure out the appropriate model with the optimized values of the coefficients. It will be significantly beneficial for the general users, including design and operating engineers, as well as management and research personnel.