Predicting the Toxicity of Ionic Liquids toward Acetylcholinesterase Enzymes Using Novel QSAR Models

Limited information on the potential toxicity of ionic liquids (ILs) becomes the bottleneck that creates a barrier in their large-scale application. In this work, two quantitative structure-activity relationships (QSAR) models were used to evaluate the toxicity of ILs toward the acetylcholinesterase enzyme using multiple linear regression (MLR) and extreme learning machine (ELM) algorithms. The structures of 57 cations and 21 anions were optimized using quantum chemistry calculations. The electrostatic potential surface area (SEP) and the screening charge density distribution area (Sσ) descriptors were calculated and used for prediction of IL toxicity. Performance and predictive aptitude between MLR and ELM models were analyzed. Highest squared correlation coefficient (R2), and also lowest average absolute relative deviation (AARD%) and root-mean-square error (RMSE) were observed for training set, test set, and total set for the ELM model. These findings validated the superior performance of ELM over the MLR toxicity prediction model.

State-of-the-art research shows that the previously accepted advantage of low toxicity for ILs has proved to be overestimated, meaning that to some extent ILs in reality pose hazard potentials to humans and environment [23]. Rogers [22] et al. challenged the notion of green ILs because of their hazardous properties, e.g., unknown toxicity and stability. Meanwhile, Jastorff [24]  toxicity and ecotoxicity of ILs using quantitative structure-activity relationships (QSAR), which could be used to aid the rational design of optimal solvents from a technical and environmental perspective. Docherty [25] et al. found that ILs with hexyl-and octyl-imidazolium and pyridinium bromides had significant antimicrobial activity to pure cultures of Saccharomyces cerevisiae, Escherichia coli, Bacillus subtilis, and others. Later, Pang [26] et al. reviewed the environmentally relevant issues of ILs, such as environmental application and toxicity. Karunanithi [27] et al. studied the life cycle of aquatic ecotoxicity impacts for five common ILs by integrating physical properties, toxicity data, and transport parameters in the model. For the first time, they reported the freshwater ecotoxicity characterization factors for ILs. They found that an average of 83% of ecotoxicity impact was due to chemicals released during the upstream synthesis steps, while the remaining 17% ecotoxicity was related to the life-cycle energy consumption. All these findings underscore the need to develop sustainable and nontoxic ILs in future research. Unlike other basic physicochemical properties, such as density, viscosity, thermal stability, etc., to the best of our knowledge, research studies on the toxicity of ILs are still insufficient, and therefore need to be addressed further. In this regard, key developments in toxicity of ILs will highlight the designable feasibility of ILs to be environmentally benign with huge potential benefits for sustainable and green chemistry [28]. On the other hand, it is worthy to emphasize that around 10 8 ILs are accessible due to the combination of enormously different cations and anions. Consequently, challenges have been encountered when investigating the toxicity of ILs [28,29]. To deal with these issues, it is highly necessary to develop efficient models to predict the toxicity of ILs with high accuracy.
Recently, QSAR studies have been reported in correlation with, and prediction of, properties of ILs [30][31][32][33][34][35][36]. QSAR is a powerful tool for accurately predicting the physical and chemical properties when applied to a set of molecular descriptors. In this regard, QSAR is supposed to be useful for estimating and screening chemical compounds for specific applications, having the desired characteristics for elucidating the underlying relation between micro-structures and macro-properties. Since the pioneering studies by Jastorff and co-authors [37], research attempts have been devoted to improving understanding and estimating the toxicity of ILs [30,[38][39][40][41][42]. In subsequent years, Pereira et al. presented their toxicological assessment of a group of environmentally-friendly ILs with benign cholinium cations and linear alkanoate anions, using filamentous fungi as model eukaryotic organisms [39]. They found that the toxicity of ILs was increased with elongation in the linear chains of the anion, while branching resulted in reduced toxicity due to depressed lipophilicity [39]. Similar conclusions were drawn by Lima and Coutinho [43]. In order to further these studies, Ranke and co-workers studied up to 74 ionic liquids with different cations and anions to show the influence of cation lipophilicity on the cytotoxicity in IPC-81 leukemia cells of rats. They found that substituents in the cation increased the toxicity of ILs [41]. Torrecilla et al. developed QSAR models based on MLR and neural network for prediction of toxicological effects of 96 ILs on IPC-81 [42]. Romero and co-workers studied EC 50 values of each compound in an aqueous solution using the microtox standard procedure. They found that the short length of the R 2 side chain on imidazolium cation was favorable to diminish the toxic effect, while anions had a minor effect on the toxicity of ILs [44]. Yan et al. investigated the toxicity of ILs in acetyl cholinesterase enzyme (AChE) by the QSAR method using topological indices based on the atom characters. Their results demonstrated that the MLR model was capable of predicting the logEC 50 (AChE) of ILs by using a 177 unit training set and a 44 unit testing set of topological indices generated from cations and anions [45]. Recently, Singh et al. investigated the chemical attributes of a wide variety of ILs towards their inhibitory potential of AChE using a supported vector machine (SVM) and a cascade correlation network (CCN) [46]. Their results showed that the proposed QSAR models had more statistical confidence, especially with respect to external validation, which has not been focused upon in other studies. This work addresses the success in predicting different toxicity classes and precise toxicity end-points of ILs using proposed QSAR. Like other researchers, we have previously focused on establishing the toxicity database of ILs, including over 4000 pieces of data and using QSAR models based on quantum chemistry descriptors and an artificial intelligence algorithm to study the toxicity toward the IPC-81 [40]. The results revealed that the nonlinear model developed by the SVM algorithm was more reliable in the prediction of toxicity of ILs. Until now, very limited information has been available on the toxicity of ILs towards AChE. Our main objective in this work was to enhance the understanding of IL's toxicity towards AChE by using novel extreme learning machine (ELM) algorithms with better accuracy in a bid to shed light on designing novel environmentally benign ILs for future applications.

Quantitative Prediction of Multiple Linear Regression (MLR) Model
To obtain the predictive model with an optimal number of descriptors, the stepwise regression approach was utilized to choose the effective input descriptors. As shown in Figure 1, the number of descriptors gradually grows as the squared correlation coefficient (R 2 ) and adjusted squared correlation coefficient (R 2 adjusted ) of the model grow, with simultaneous decrease in standard error (Std. Error). When the number of input parameters overtakes 11, the R 2 , R 2 adjusted , and standard error do not change evidently. Thus, the MLR model based on the optimal set of descriptors was determined. The final model is shown as Equation (1), where P 0 stands for the intercept of the prediction model, while C i and C j mean the ith and jth coefficients respectively; S EP-h is the electrostatic potential surface area for cations or anions at the electrostatic potential h while S σ-k means the screening charge density distribution area for cations or anions at the screening charge density k.
Details of the coefficients and the parameters, including the t values of the descriptors, can be found in Table 1. In total, eight S EP parameters and three S σ parameters were chosen for the model, where C is the cation and the numbers are the corresponding electrostatic potential and surface screening charge density, respectively. Since t values indicate the significance of the parameters, it can be seen that S σ-C0.013 is the most vital parameter for the prediction of the toxicity of ILs. The σ-profile range can be qualitatively divided into three main parts: the hydrogen-bond (HB) donor region (σ < −0.0082 e/Å 2 ) and the HB acceptor region (σ > 0.0082 e/ Å 2 ), as well as the non-polar region (−0.0082 e/Å 2 ≤ σ ≤ 0.0082 e/Å 2 ). Therefore, it can be concluded that the larger the S σ-C0.013 value of the ILs, the lower the toxicity of the ILs. The S EP-C36. 25 and S EP-C82.75 are the second and third crucial parameters, while their coefficients are 0.361 and −0.213, respectively. It should be mentioned that all the selected parameters are from cations, which illustrates that the cations of ILs have a vital influence on the toxicity of ILs, while the anions have little effect. This finding is consistent with the result that the cation makes the dominant contribution to toxicity for most commercial ILs [45].   Table S1) via the MLR model for both the training set as well as the test set are plotted in Figure 2, where it can be seen that the predicted values are close to the experimental results. Figure 3 compares the absolute relative deviation (ARD%) within the range of logEC50 (AChE) values from the MLR model. It is evident that the largest proportion of the predicted values falls in the range of 1-5%, which accounts for 48.13%. Similarly, the percentages of predicted data fall in the range of 5-10% and over 10%, accounting for 28.75% and 14.38%, respectively. Only 8.75% of the predicted ARD% values are in the range of 0-1%. These conclusions showed the good performance of the MLR model to predict the toxicity of ILs.   Table 1. Coefficients (C i ), parameters (S EP-h and S σ-k ) and the t values for Equation (1).
i  1 The electrostatic potential surface area (S EP ) at the electrostatic potential h; 2 The screening charge density distribution area (S σ ) at the screening charge density k; 3 t value stands for the importance of the parameter.
Experimental and predicted the logEC 50 values of ILs towards acetyl cholinesterase enzyme (AChE) (logEC 50 (AChE)) values of ILs (as shown in Supporting Information Table S1) via the MLR model for both the training set as well as the test set are plotted in Figure 2, where it can be seen that the predicted values are close to the experimental results. Figure 3 compares the absolute relative deviation (ARD%) within the range of logEC 50 (AChE) values from the MLR model. It is evident that the largest proportion of the predicted values falls in the range of 1-5%, which accounts for 48.13%. Similarly, the percentages of predicted data fall in the range of 5-10% and over 10%, accounting for 28.75% and 14.38%, respectively. Only 8.

Quantitative Prediction of Extreme Learning Machine (ELM) Model
Based upon the foundation of 11 descriptors selected via stepwise regression method and the data sets from the former model, a more effective nonlinear ELM model was established. In the ELM model, there are three layers, as shown in

Quantitative Prediction of Extreme Learning Machine (ELM) Model
Based upon the foundation of 11 descriptors selected via stepwise regression method and the data sets from the former model, a more effective nonlinear ELM model was established. In the ELM model, there are three layers, as shown in

Quantitative Prediction of Extreme Learning Machine (ELM) Model
Based upon the foundation of 11 descriptors selected via stepwise regression method and the data sets from the former model, a more effective nonlinear ELM model was established. In the ELM model, there are three layers, as shown in  Figure 5, in terms of the training set, the R 2 value slightly increases with the growth in the number of neurons, whereas the average absolute relative deviation (AARD%) considerably drops when the number of neurons is less than 45, beyond which it dramatically rises. The variation tendency of R 2 and AARD% for the test set are almost opposite to the growing number of neurons. Therefore, the best number of neurons (45) is acquired for the prediction the toxicity of ILs by the ELM model. to acquire the ideal number of neurons between the input variables and intermediate neurons. As shown in Figure 5, in terms of the training set, the R 2 value slightly increases with the growth in the number of neurons, whereas the average absolute relative deviation (AARD%) considerably drops when the number of neurons is less than 45, beyond which it dramatically rises. The variation tendency of R 2 and AARD% for the test set are almost opposite to the growing number of neurons. Therefore, the best number of neurons (45) is acquired for the prediction the toxicity of ILs by the ELM model.   to acquire the ideal number of neurons between the input variables and intermediate neurons. As shown in Figure 5, in terms of the training set, the R 2 value slightly increases with the growth in the number of neurons, whereas the average absolute relative deviation (AARD%) considerably drops when the number of neurons is less than 45, beyond which it dramatically rises. The variation tendency of R 2 and AARD% for the test set are almost opposite to the growing number of neurons. Therefore, the best number of neurons (45) is acquired for the prediction the toxicity of ILs by the ELM model. The extreme learning machine (ELM) comprising three parts-the input layer, hidden layer, and output layer-was established based on the same input descriptors. The parameters wij and bias from the input layer to hidden layer were generated randomly and the parameters wjk were coefficients from the hidden layer to output layer.   Table S1) in both the training and test sets with the experimental logEC 50 (AChE) values are shown in the Figure 6, which reveals that the predicted values match well with the experimental data. Figure 7 illustrates the proportion of logEC 50 (AChE) values in the different ARD% ranges of the ELM model. It is clear that the largest percentage of ARD% from the logEC 50 (AChE) values is 54.38% within the range of 1-5%, followed by the 22.5% in the range of 0-1%. The percentage of ARD% of the logEC 50 (AChE) values in the range of 5-10% and over 10% account for 18.75% and 4.38%, respectively. These results showcase the validity and reliability of the ELM model to predict the toxicity of ILs. The comparisons of the predicted logEC50 (AChE) values by ELM (as shown in Supporting Information Table S1) in both the training and test sets with the experimental logEC50 (AChE) values are shown in the Figure 6, which reveals that the predicted values match well with the experimental data. Figure 7 illustrates the proportion of logEC50 (AChE) values in the different ARD% ranges of the ELM model. It is clear that the largest percentage of ARD% from the logEC50 (AChE) values is 54.38% within the range of 1-5%, followed by the 22.5% in the range of 0-1%. The percentage of ARD% of the logEC50 (AChE) values in the range of 5-10% and over 10% account for 18.75% and 4.38%, respectively. These results showcase the validity and reliability of the ELM model to predict the toxicity of ILs.     The comparisons of the predicted logEC50 (AChE) values by ELM (as shown in Supporting Information Table S1) in both the training and test sets with the experimental logEC50 (AChE) values are shown in the Figure 6, which reveals that the predicted values match well with the experimental data. Figure 7 illustrates the proportion of logEC50 (AChE) values in the different ARD% ranges of the ELM model. It is clear that the largest percentage of ARD% from the logEC50 (AChE) values is 54.38% within the range of 1-5%, followed by the 22.5% in the range of 0-1%. The percentage of ARD% of the logEC50 (AChE) values in the range of 5-10% and over 10% account for 18.75% and 4.38%, respectively. These results showcase the validity and reliability of the ELM model to predict the toxicity of ILs.

Comparison between Two Quantitative Structure-activity Relationships (QSAR) Models
As discussed above, satisfactory results are obtained for the two models having high R 2 values. Detailed comparison of the statistical parameters of the training set, test set, and total data of different QSAR models are given in Table 2. It is evident from Table 2 that the R 2 values of the training set, test set, and total data using the ELM model are 0.969, 0.950, and 0.964, respectively, which are comparatively higher than of the R 2 values of the MLR model, i.e., 0.920 for the training set, 0.914 for the test set, and 0.917 for the total set. This trend is valid for both AARD% and RMSE, meaning that the non-linear ELM shows better predictive performance than MLR. For example, AARD% values for the training set by the MLR and ELM models are 5.18 and 2.86, while the RMSE values are 1.534 and 0.950, respectively. A similar situation can easily be observed for the test set and all data sets using two models. Undoubtedly, the ELM algorithm is a powerful prediction tool with better accuracy.
In addition, according to the effective criteria used in the literature [47][48][49][50], a QSAR model with acceptable predictive ability should satisfy the conditions in Equations (2)- (5). As can be seen from Table 3, all the coefficients for both the MLR and ELM models satisfy the above-mentioned criteria, illustrating the good predictive power of the models in this study. To further verify the reliability of the models established in this study, we also predicted the toxicity values of five new ILs. The specific prediction results can be found in the Supporting Information (Table S2). It can be seen that the predicted values agree well with the experimental values, indicating that the models we built are reliable, and the ELM model has better prediction performance than the MLR model. We compared our estimated results with the reported values, as summarized in Table 4. Our two QSAR models, MLR (R 2 = 0.917) and ELM (R 2 = 0.964), have higher R 2 values compared with those of other models, except the Multiple linear regression (MLP) model (R 2 = 0.973). Although MLP exhibits the highest value of R 2 , our models used less input parameters. Our models show AARD% values of 6.35 for MLR and 3.29 for ELM, which fall in the range of the reported values, i.e., 2.8 to 9.35 (Table 4). In addition, as can be seen from Table 4, the RMSE values of our models are also relatively low compared to the literature [45]. Taking these factors into consideration, our models are reliable and meaningful for accurate prediction of the toxicity of ILs towards AChE.

Framework
As we know, quantitative structure-activity relationships (QSAR) can be implemented in the form of linear and non-linear algorithms. The former are relatively easy and intuitively present the impact of each parameter upon the properties, while the latter are considered to be suitable for accurate prediction in real-world scenarios. In this work, the toxicity of ionic liquids (ILs) was predicted according to the scheme shown in Figure 4.

Dataset
Herein, toxicity data in acetyl cholinesterase enzyme (AChE) of 160 ILs was collected from the widely acknowledged ionic liquid (IL) database [40,51,52]. Concentration for 50% of maximal effect (EC 50 ) values (µM) for AChE, written as log EC 50 , were converted into the form of a logarithm of half maximal effective concentration. The whole data set was divided into two parts: a training set of 80% ILs to build the model and a set of 20% ionic liquids (ILs) to evaluate the model's predictability. Detailed information on the ILs studied in this paper, including the names, SMILES strings, log 10 (EC 50 ) values, etc., can be found in Supporting Information.

Calculation of Descriptors
As established, the related compounds are represented by theoretical molecular descriptors, which are of key significance for the predictive performance of the models [40]. The screening charge density distribution area (S σ ), obtained from the histogram function of the σ-profiles given by the conductor-like screening model for real solvent (COSMO-RS) computation, is an a priori quantum chemistry descriptor that could quantitatively represent the molecule's polar surface screen charge on the polarity scale [36]. The electrostatic potential surface area (S EP ) of molecules refers to the surface areas of the molecules within the interval of the different electrostatic potential, and it can be used for the prediction of the material's properties due to its rich information at the electron level. In this work, two descriptors S EP and S σ-profiles were employed to develop models and were calculated by different programs based on the optimal structures of cations and anions. First of all, the S EP files of the corresponding cations were calculated using Multiwfn software with an electrostatic potential range of 0~150 kcal/mol, and was kept the same for the anions in the range of −150~0 kcal/mol, with a step size of 0.5 kcal/mol [53]. For example, the S EP-C88.75 means the molecular surface areas in the electrostatic potential scale of 88.5-89 kcal/mol. Then, the conductor-like screening model (COSMO) files of cations and anions were obtained using Gaussian 03 software [54]. Finally, S σ-profiles files of corresponding cations and anions were calculated with a σ-profile ranging from −0.03 to 0.03 e/Å 2 and with a step size of 0.001 e/Å 2 , respectively. We chose cation (1-(cyanomethyl)-3-methylimidazolium) and anion (1-octylsulfate) to present the representative S EP and S σ-profiles , and the results were depicted in Figure 8; Figure 9, respectively. As shown in Figure 8, the darker shade of red for 1-(cyanomethyl)-3-methylimidazolium or blue for 1-octylsulfate indicates stronger polarity. Figure 9 presents a similar situation of electrostatic potential surfaces. used for the prediction of the material's properties due to its rich information at the electron level. In this work, two descriptors SEP and Sσ-profiles were employed to develop models and were calculated by different programs based on the optimal structures of cations and anions. First of all, the SEP files of the corresponding cations were calculated using Multiwfn software with an electrostatic potential range of 0~150 kcal/mol, and was kept the same for the anions in the range of −150~0 kcal/mol, with a step size of 0.5 kcal/mol [53]. For example, the SEP-C88.75 means the molecular surface areas in the electrostatic potential scale of 88.5-89 kcal/mol. Then, the conductor-like screening model (COSMO) files of cations and anions were obtained using Gaussian 03 software [54]. Finally, Sσ-profiles files of corresponding cations and anions were calculated with a σ-profile ranging from −0.03 to 0.03 e/Å 2 and with a step size of 0.001 e/Å 2 , respectively. We chose cation (1-(cyanomethyl)-3-methylimidazolium) and anion (1-octylsulfate) to present the representative SEP and Sσ-profiles, and the results were depicted in Figure 8; Figure 9, respectively. As shown in Figure  8, the darker shade of red for 1-(cyanomethyl)-3-methylimidazolium or blue for 1-octylsulfate indicates stronger polarity. Figure 9 presents a similar situation of electrostatic potential surfaces.

Multiple Linear Regression (MLR)
Multiple Linear Regression (MLR) is widely harnessed in quantitative structure-activity relationships (QSAR) to determinate the relationship between a dependent variable and various independent variables. The linear combination relationship of the independent variables is described by a set of coefficients in this model. The structural characteristics to the toxicity can be linked as follows: where b0 is the intercept and b1, b2, and bn are regression coefficients of the corresponding descriptors; n is the number of descriptors used in the equation in order to figure out the optimal regression model; x denotes the descriptor used to describe the chemical structure of the compound. MLR consists of stepwise selection of descriptors, which combines the forward and backward procedures.

Extreme Learning Machine (ELM)
Extreme learning machine (ELM) architecture was proposed for the first time by Huang et al., as a single hidden layer feed-forward neural network that has a very strong learning ability for application in basic classification and regression [55]. The most key characteristic of ELM is that it has faster learning speed compared with the traditional learning algorithms. In addition, ELM can be easily used and is a fast-learning and effective algorithm for the feed-forward neural network with single hidden layer interpolation capability and universal approximation capability, which are the two important characteristics of the feed-forward neural network. To some extent, ELM is similar to an artificial neural network (ANN), and its weights and biases in the first layer are randomly initialized and kept constant, while the weights of the second layer are selected by diminishing the least squared error. In practical applications, the training data set is mainly concerned with the specific issues [56]. The data sets include actual results and the related factors. Through iterations to finish the learning process, the impact factors and corresponding results were

Multiple Linear Regression (MLR)
Multiple Linear Regression (MLR) is widely harnessed in quantitative structure-activity relationships (QSAR) to determinate the relationship between a dependent variable and various independent variables. The linear combination relationship of the independent variables is described by a set of coefficients in this model. The structural characteristics to the toxicity can be linked as follows: where b 0 is the intercept and b 1 , b 2 , and b n are regression coefficients of the corresponding descriptors; n is the number of descriptors used in the equation in order to figure out the optimal regression model; x denotes the descriptor used to describe the chemical structure of the compound. MLR consists of stepwise selection of descriptors, which combines the forward and backward procedures.

Extreme Learning Machine (ELM)
Extreme learning machine (ELM) architecture was proposed for the first time by Huang et al., as a single hidden layer feed-forward neural network that has a very strong learning ability for application in basic classification and regression [55]. The most key characteristic of ELM is that it has faster learning speed compared with the traditional learning algorithms. In addition, ELM can be easily used and is a fast-learning and effective algorithm for the feed-forward neural network with single hidden layer interpolation capability and universal approximation capability, which are the two important characteristics of the feed-forward neural network. To some extent, ELM is similar to an artificial neural network (ANN), and its weights and biases in the first layer are randomly initialized and kept constant, while the weights of the second layer are selected by diminishing the least squared error. In practical applications, the training data set is mainly concerned with the specific issues [56]. The data sets include actual results and the related factors. Through iterations to finish the learning process, the impact factors and corresponding results were put into the ELM for training during the training process. Afterward, using the trained ELM, we only needed to input the training data set, similar to the influencing factors. It has been reported that ELM is feasible in many fields, such as protein sequence classification, regression problems, etc., to easily achieve good performance with fast speed [57]. Evidently, increasingly more research in this regard shows that excellent predictive performance can be achieved using the non-linear model developed by the ELM algorithm. More information on the theory and application of ELM can be found elsewhere [55][56][57].

Evaluation of Quantitative Structure-activity Relationships (QSAR)
In this work, several parameters were defined to evaluate the model performance, as measured by different metrics: adjusted squared correlation coefficient (R 2 adjusted ), squared correlation coefficient (R 2 ), average absolute relative deviation (AARD%), root-mean-square error (RMSE), absolute relative deviation (ARD%), the coefficients of calculated values versus experimental values (R 2 0 ), experimental values versus calculated values (R 2 0 ), slope k of the calculated versus experimental values, and slope k' of the experimental versus calculated values. The corresponding definition could be expressed in Equations (7)- (15).
where y means logEC 50 (AChE) value of ILs while the " cal " and " exp " superscripts stand for calculated and experimental values, respectively. N p denotes the number of data points for the corresponding dataset;ȳ m represents the average log EC 50 (AChE) value of ILs for all of the selected data.

Conclusions
Safe operation and benign utilization are crucial for ILs, therefore the scope of accurately predicting the toxicity of ILs is of vital interest for scientific and technological advancement. In this work, detailed data analyses of the toxicity of ILs were made based on the established database and novel QSAR models were developed for toxicity prediction of 160 ILs in AChE. Good correlation between the estimated and original data was observed for the training set, test set, and the total set. It was found that the newly used ELM model gave the highest R 2 values and the lowest values for AARD% and RMSE. The results validated the superior performance of ELM for estimation of toxicity of ILs. This study will be helpful in design and screening of environmentally friendly commercial ILs.