Machine Learning for Ionic Liquid Toxicity Prediction

: In addition to proper physicochemical properties, low toxicity is also desirable when seeking suitable ionic liquids (ILs) for speciﬁc applications. In this context, machine learning (ML) models were developed to predict the IL toxicity in leukemia rat cell line (IPC-81) based on an extended experimental dataset. Following a systematic procedure including framework construction, hyper-parameter optimization, model training, and evaluation, the feedforward neural network (FNN) and support vector machine (SVM) algorithms were adopted to predict the toxicity of ILs directly from their molecular structures. Based on the ML structures optimized by the ﬁve-fold cross validation, two ML models were established and evaluated using IL structural descriptors as inputs. It was observed that both models exhibited high predictive accuracy, with the SVM model observed to be slightly better than the FNN model. For the SVM model, the determination coefﬁcients were 0.9289 and 0.9202 for the training and test sets, respectively. The satisfactory predictive performance and generalization ability make our models useful for the computer-aided molecular design (CAMD) of environmentally friendly ILs.


Introduction
Ionic liquids (ILs) are known as molten salts, consisting solely of organic cation and inorganic or organic anion. ILs present several unique and benign characteristics, such as low melting point, low volatility, and high thermal stability. For these reasons, ILs have attracted extensive attention in both academia and industry as alternatives to traditional organic solvents. In fact, they have been used in many chemical processes [1][2][3][4] to meet specific requirements while improving the process efficiency. In spite of their superior properties and large application prospects, many ILs present potential negative effects on the ecosystem, which could also impact human health. Therefore, toxicity is a significant factor that should be considered in IL selection, and therefore the toxicological assessment of ILs is essential for the development of a sustainable IL-based chemical process.
In order to find ILs showing desired low toxicity, experimental measurement is the most direct and effective way. However, the number of new IL structures is increasing rapidly due to numerous feasible cation-anion combinations. This makes the experimental measurement a time-consuming, resource-intensive, and even impractical method. In contrast, mathematical modeling methods based on existing IL structures and toxicity data become efficient substitutes. Researchers have built mathematical models for predicting various properties of ILs, including electrical conductivity [5], thermal decomposition temperature [6], critical properties [7], and water solubility [8]. These models are regressed from experimental property data and can provide satisfying predictions on the properties of new ILs that are not used in model training. Such well-established property models have been widely used in the computer-aided molecular design (CAMD) of IL solvents [9][10][11][12][13][14], where optimal ILs possessing desirable properties are identified based on the property models. Unfortunately, up to now, mathematical models for the toxicity properties of ILs have been explored more scarcely in comparison to other IL properties.
In literature, the mortality of leukemia rat cell line (IPC-81) is frequently used to quantitatively indicate the toxicity of ILs [15,16]. A few researchers have developed models to predict the toxicity of ILs against IPC-81 from IL structures. Based on a dataset of 173 experimental samples, Yan et al. [17] established a structure-activity relationship model for an IL to predict its toxicity against IPC-81 using multiple linear regression (MLR) and topological descriptors. The mean absolute error (MAE) was 0.226. Later, Sosnowska et al. [18] constructed a larger IPC-81 dataset containing 304 experimental samples, and on this basis, they used structure-related descriptors to develop another MLR model where the MAE was 0.3779. Due to the utilization of a larger database, this model, in principle, shows improved applicability. Based on this updated dataset, Wu et al. [19] recently reported another MLR model using IL fingerprint descriptors, which resulted in an improved predictive accuracy (MAE = 0.34). Despite the effectiveness of MLR, there is still a large potential for the further improvement of model accuracy.
As proven by many studies [20][21][22][23][24][25], machine learning (ML) is an efficient and promising approach for building quantitative structure-property relationship (QSPR) models to predict various properties for chemical compounds. Until now, ML techniques have obtained wide applications and great successes in predicting IL properties, including melting point [26], CO 2 solubility [27], viscosity [28], etc. Despite its popularity, there is still a lack of accurate ML models for predicting IL toxicity.
Considering the above aspects, we herein explore ML models to predict the toxicity of ILs against IPC-81 by applying a systematic procedure covering model framework construction, hyper-parameter optimization, training, and evaluation. Since ML is a datadriven method, an extended experimental dataset is always preferred for building a reliable ML model. In this work, we first collect experimental toxicity data for 355 ILs. Two different ML models are then developed, one using the feedforward neural network (FNN) algorithm and the other using the support vector machine (SVM) algorithm. The structures of ML models, characterized by their hyper-parameters, are optimized with the five-fold cross validation to improve the model robustness. The corresponding models are then established with the training dataset, followed by an evaluation with an external test set to measure their generalization abilities.

Experimental Data
A dataset of the toxicity of 355 ILs against the IPC-81 was collected from the literature [18,29]. The toxicity was evaluated in terms of the logarithm of the half maximal effective concentration (logEC 50 ), and the IL structures were depicted by SMILES (simplified molecular-input line-entry system) strings obtained from PubChem [30]. Therefore, the logEC 50 dataset for the 355 ILs was used for predictive modeling. The detailed data are tabulated in Table S1 (Supporting Information) accompanied with the corresponding IL full name and SMILES strings.
To translate the text-form IL structure (i.e., SMILES string) into a ML-acceptable type (i.e., numerical values), the IL descriptors were created by a feature extraction algorithm [31] to characterize the IL structure. Specifically, an IL was treated as a molecule in which the cation and anion are connected with each other, and its SMILES string was parsed by the RDKit cheminformatics tool [32] to obtain detailed structural and chemical information. A set of substructures (subgroups) were defined as descriptors, and their appearance frequencies in the IL molecule were used as model inputs to characterize the structure of ILs. After analyzing the structures of the 355 ILs, a total of 42 structural descriptors were resulted, including 9 cation descriptors, 9 anion descriptors, and 24 general descriptors. These structural descriptors and their occurrence frequencies in each IL are provided in Tables S2 and S3 (Supporting Information), respectively.
To develop the ML models for IL toxicity prediction, a systematic procedure covering framework construction (step A), hyper-parameter optimization (step B), model establish-ment (step C), and model evaluation (step D) was deployed (see Figure 1). The full dataset was randomly divided into two parts, a training set (80% of the dataset, 284 samples) to develop the ML model and a test set (20% of the dataset, 71 samples) to evaluate the predictive accuracy of the developed model on unseen data. To develop the ML models for IL toxicity prediction, a systematic procedure covering framework construction (step A), hyper-parameter optimization (step B), model establishment (step C), and model evaluation (step D) was deployed (see Figure 1). The full dataset was randomly divided into two parts, a training set (80% of the dataset, 284 samples) to develop the ML model and a test set (20% of the dataset, 71 samples) to evaluate the predictive accuracy of the developed model on unseen data. To obtain a ML model with high robustness, cross validation should be employed to optimize the structural parameters (i.e., hyper-parameters). Herein, the five-fold cross validation strategy was adopted to determine the optimal hyper-parameters (Step B in Figure  1). First, the training set was equally divided into five parts. Then, for each combination of different hyper-parameters, the modeling process was performed five times based on the selected four parts (marked green in Figure 1), and the validation process was also carried out five times on the remaining part (marked yellow). Optimal hyper-parameters could be determined from the variation of the averaged error of the validation results. Finally, for the fixed hyper-parameters, the ML models were developed using the entire training set (Step C) and evaluated with the test set (Step D in Figure 1). The ILs in the randomly divided training and test sets are provided in Table S1 (Supporting Information).

FNN Modeling
Artificial neural networks are probably the most widespread ML technique in mathematical modeling, image recognition, process control, etc. due to their flexible and configurable structures and connections. Therein, a feedforward neural network (FNN) is a common type, which presents the regular layer structure and one-directional transmission of information. Due to its simpler structure and less adjustable parameters, a FNN is computationally efficient in learning and making predictions, compared to other neural To obtain a ML model with high robustness, cross validation should be employed to optimize the structural parameters (i.e., hyper-parameters). Herein, the five-fold cross validation strategy was adopted to determine the optimal hyper-parameters (Step B in Figure 1). First, the training set was equally divided into five parts. Then, for each combination of different hyper-parameters, the modeling process was performed five times based on the selected four parts (marked green in Figure 1), and the validation process was also carried out five times on the remaining part (marked yellow). Optimal hyper-parameters could be determined from the variation of the averaged error of the validation results. Finally, for the fixed hyper-parameters, the ML models were developed using the entire training set (Step C) and evaluated with the test set (Step D in Figure 1). The ILs in the randomly divided training and test sets are provided in Table S1 (Supporting Information).

FNN Modeling
Artificial neural networks are probably the most widespread ML technique in mathematical modeling, image recognition, process control, etc. due to their flexible and configurable structures and connections. Therein, a feedforward neural network (FNN) is a common type, which presents the regular layer structure and one-directional transmission of information. Due to its simpler structure and less adjustable parameters, a FNN is computationally efficient in learning and making predictions, compared to other neural networks, such as the convolutional neural network and recurrent neural network. Moreover, it is more suitable for handling one-dimensional time-independent input data, as is the case in this work.
In this section, a two-hidden-layer FNN (see Figure 2) is constructed to correlate the logEC 50 values with the IL structures. The appearance frequencies of the 42 subgroups are loaded into the neurons in the input layer, and after being processed by two adjacent hidden layers, the predicted logEC 50 is given by the output layer. The activation functions, namely "sigmoid" and "softplus", are assigned for the two hidden layers to achieve the non-linear data transformations, enabling the FNN to model complex mathematical relationships. the case in this work.
In this section, a two-hidden-layer FNN (see Figure 2) is constructed to correlate the logEC50 values with the IL structures. The appearance frequencies of the 42 subgroups are loaded into the neurons in the input layer, and after being processed by two adjacent hidden layers, the predicted logEC50 is given by the output layer. The activation functions, namely "sigmoid" and "softplus", are assigned for the two hidden layers to achieve the non-linear data transformations, enabling the FNN to model complex mathematical relationships. The FNN framework was implemented with the PyTorch [33] toolkit in Python. The root mean square error (RMSE) was employed as the loss function to monitor the model performance. Moreover, Adam optimizer [34] was used to adjust the learning rate and model parameters, coupling with the back-propagation algorithm [35] that computes the gradient of the loss function and guides the optimization process.
Since the number of hidden layers was pre-defined, the number of neurons in each layer is an important hyper-parameter in the FNN structure that determines the model performance. The number of neurons in the input layer depends on the dimension of the used descriptors (42 in this work), and the neuron number of the output layer equals one, corresponding to the scalar logEC50. In this case, the number of neurons in the two hidden layers needed to be optimized by the five-fold cross validation. As can be seen in Figure  3, the lowest average RMSE (0.5696) of the five-round independent validations was obtained when 10 and 6 neurons were used in hidden layer 1 and 2, respectively. This means The FNN framework was implemented with the PyTorch [33] toolkit in Python. The root mean square error (RMSE) was employed as the loss function to monitor the model performance. Moreover, Adam optimizer [34] was used to adjust the learning rate and model parameters, coupling with the back-propagation algorithm [35] that computes the gradient of the loss function and guides the optimization process.
Since the number of hidden layers was pre-defined, the number of neurons in each layer is an important hyper-parameter in the FNN structure that determines the model performance. The number of neurons in the input layer depends on the dimension of the used descriptors (42 in this work), and the neuron number of the output layer equals one, corresponding to the scalar logEC 50 . In this case, the number of neurons in the two hidden layers needed to be optimized by the five-fold cross validation. As can be seen in Figure 3, the lowest average RMSE (0.5696) of the five-round independent validations was obtained when 10 and 6 neurons were used in hidden layer 1 and 2, respectively. This means that this FNN structure was the most robust and promising one in predicting the toxicity of ILs against IPC-81.
After the numbers of neurons in the two hidden layers were determined, the FNN model was established with all of the training samples and then evaluated with the test samples. Figure 4 plots the predicted logEC 50 of samples in the training and test sets against the corresponding experimental logEC 50 . In addition to the RMSE, mean absolute error (MAE) and coefficient of determination (R 2 ) were also used to quantify the performance of the FNN model. The RMSE, MAE, and R 2 were 0.2906, 0.2111, and 0.9227 for the training set, and 0.3732, 0.3028, and 0.8917 for the test set, respectively. The satisfying and close metrics obtained for the training and test sets indicate a generally good predictive performance of the obtained model for the prediction of the toxicity of ILs against IPC-81. For the established FNN model, prediction results are provided in Table S1, and weight and bias values in each layer are summarized in Table S4 (Supporting Information).  After the numbers of neurons in the two hidden layers were determined, the FNN model was established with all of the training samples and then evaluated with the test samples. Figure 4 plots the predicted logEC50 of samples in the training and test sets against the corresponding experimental logEC50. In addition to the RMSE, mean absolute error (MAE) and coefficient of determination (R 2 ) were also used to quantify the performance of the FNN model. The RMSE, MAE, and R 2 were 0.2906, 0.2111, and 0.9227 for the training set, and 0.3732, 0.3028, and 0.8917 for the test set, respectively. The satisfying and close metrics obtained for the training and test sets indicate a generally good predictive performance of the obtained model for the prediction of the toxicity of ILs against IPC-81. For the established FNN model, prediction results are provided in Table S1, and weight and bias values in each layer are summarized in Table S4 (Supporting Information).

SVM Modeling
A support vector machine (SVM) is another popular ML algorithm in data analysis and predictive modeling. In terms of the regression task, the SVM algorithm is able to fit the training data by creating hyperplanes in the high-dimensional descriptor space. An accurate SVM model makes samples stay as close as possible to these hyperplanes.
The SVM framework was implemented with the scikit-learn [36] toolkit in Python. Gaussian radial basis function was adopted as the kernel function to map input descriptors into a high-dimensional space for complex nonlinear modeling. The RMSE was used again as the loss function to measure the model's predictive capability. Two important hyper-parameters in the SVM algorithm, regularization parameter C and tolerance ε, were optimized by the five-fold cross validation. As shown in Figure 5, the lowest average RMSE value (0.4591) of the five independent validations was achieved by the

SVM Modeling
A support vector machine (SVM) is another popular ML algorithm in data analysis and predictive modeling. In terms of the regression task, the SVM algorithm is able to fit the training data by creating hyperplanes in the high-dimensional descriptor space. An accurate SVM model makes samples stay as close as possible to these hyperplanes.
The SVM framework was implemented with the scikit-learn [36] toolkit in Python. Gaussian radial basis function was adopted as the kernel function to map input descriptors into a high-dimensional space for complex nonlinear modeling. The RMSE was used again as the loss function to measure the model's predictive capability. Two important hyper-parameters in the SVM algorithm, regularization parameter C and tolerance ε, were optimized by the five-fold cross validation. As shown in Figure 5, the lowest average RMSE value (0.4591) of the five independent validations was achieved by the SVM algorithm when the regularization parameter C and tolerance ε were 30 and 0.11, respectively.  Table S1, and model parameters are summarized in Tables S5 and S6 (Supporting Information). Based on the optimized hyper-parameters highlighted in Figure 5, the final SVM model was developed with all the training samples and then evaluated with the test samples. The comparison between experimental and SVM-predicted logEC 50 is shown in Figure 6. Except for a few outliers, most of the samples exhibited low deviations. The determined RMSE, MAE, and R 2 were 0.2787, 0.1762, and 0.9289 for the training set, and 0.3204, 0.2628, and 0.9202 for the test set, respectively. These statistical indicators demonstrate a good predictive capability of the SVM model. For the established SVM model, prediction results are also provided in Table S1, and model parameters are summarized in Tables S5 and S6  (Supporting Information).

Model Comparison
Both the FNN and SVM models showed good predictive performances, as proven by the statistical results. Sorted in ascending order, the absolute errors between ML-predicted and experimental logEC50 for the 71 ILs in the test dataset are shown in Figure 7. As indicated, the SVM model exhibited smaller absolute errors, revealing its higher predictive accuracy for IL toxicity.

Model Comparison
Both the FNN and SVM models showed good predictive performances, as proven by the statistical results. Sorted in ascending order, the absolute errors between MLpredicted and experimental logEC 50 for the 71 ILs in the test dataset are shown in Figure 7. As indicated, the SVM model exhibited smaller absolute errors, revealing its higher predictive accuracy for IL toxicity. The two ML models established in this work were compared with the reported models developed by Sosnowska et al. [18] and Wu et al. [19]. As shown in Table 1, the FNN and SVM models presented much lower RMSE and MAE as well as a much higher R 2 value than the two previous models, indicating their higher predictive accuracy. Moreover, they have improved applicability due to the extended dataset used in the model development. Comparing the two models developed in this work, it was found that the SVM model presented relatively lower RMSE and MAE as well as a higher R 2 than the FNN model. Therefore, the SVM model can generally provide more accurate predictions on the toxicity of ILs against IPC-81, which confirms the conclusion obtained from Figure 7.  The two ML models established in this work were compared with the reported models developed by Sosnowska et al. [18] and Wu et al. [19]. As shown in Table 1, the FNN and SVM models presented much lower RMSE and MAE as well as a much higher R 2 value than the two previous models, indicating their higher predictive accuracy. Moreover, they have improved applicability due to the extended dataset used in the model development. Comparing the two models developed in this work, it was found that the SVM model presented relatively lower RMSE and MAE as well as a higher R 2 than the FNN model. Therefore, the SVM model can generally provide more accurate predictions on the toxicity of ILs against IPC-81, which confirms the conclusion obtained from Figure 7.

Conclusions
In this work, a dataset of 355 IL experimental logEC 50 values was constructed to quantify the toxicity of ILs against IPC-81. Structural descriptors were generated with a feature extraction algorithm based on the SMILES strings of the ILs. Two ML frameworks (FNN and SVM) were built, and their structural parameters were optimized with five-fold cross validation. After determining the best set of structural parameters, the ML models were regressed using the training dataset, and the established models were evaluated with the test data. It was observed that the SVM model performed slightly better than the FNN model. However, compared with previously reported models, both models presented better predictive performance and improved applicability. Considering the satisfactory predictions on IL toxicity, the established ML models can be incorporated into computeraided molecular design (CAMD) frameworks in order to identify suitable ILs that show low toxicity while meeting other requirements defined by the specific applications.
Supplementary Materials: The following are available online at https://www.mdpi.com/2227-9 717/9/1/65/s1: Table S1: Details of the employed dataset and ML prediction results. Table S2: Structural descriptors of ILs for ML model development. Table S3: Occurrence frequencies of the descriptors in IL molecules. Table S4: Weight and bias values of the FNN model. Table S5: Support vectors of the SVM model. Table S6: Fitting parameters of the SVM model.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.