Estimation of the Toxicity of Different Substituted Aromatic Compounds to the Aquatic Ciliate Tetrahymena pyriformis by QSAR Approach

Nowadays, quantitative structure–activity relationship (QSAR) methods have been widely performed to predict the toxicity of compounds to organisms due to their simplicity, ease of implementation, and low hazards. In this study, to estimate the toxicities of substituted aromatic compounds to Tetrahymena pyriformis, the QSAR models were established by the multiple linear regression (MLR) and radial basis function neural network (RBFNN). Unlike other QSAR studies, according to the difference of functional groups (−NO2, −X), the whole dataset was divided into three groups and further modeled separately. The statistical characteristics for the models are obtained as the following: MLR: n = 36, R2 = 0.829, RMS (root mean square) = 0.192, RBFNN: n = 36, R2 = 0.843, RMS = 0.167 for Group 1; MLR: n = 60, R2 = 0.803, RMS = 0.222, RBFNN: n = 60, R2 = 0.821, RMS = 0.193 for Group 2; MLR: n = 31 R2 = 0.852, RMS = 0.192; RBFNN: n = 31, R2 = 0.885, RMS = 0.163 for Group 3, respectively. The results were within the acceptable range, and the models were found to be statistically robust with high external predictivity. Moreover, the models also gave some insight on those characteristics of the structures that most affect the toxicity.


Introduction
With the rapid development of science and technology, tens of thousands of new chemicals are synthesized and widely used in all walks of life every day. However, as we all know, if chemicals are used or handled incorrectly, they may enter the aquatic environment or bio-accumulate in the food chain, where they may adversely impact the people, ultimately. One of the current interests in medicinal chemistry, environmental sciences, and especially for toxicology, is to rank and establish the chemical substances with respect to their potential hazardous effects on humans, wildlife, and aquatic flora and fauna [1]. Among the vast organic matter, it is noteworthy that the substituted aromatic compounds [2][3][4][5][6][7][8] occupy important positions, since they are produced in large quantities and released into the environment as a result of their wide use in agriculture and industry, and are widely distributed in air, natural water, waste water, soil, sediment, and living organics [9,10]. In addition, recent studies have proved that the substituted aromatic compounds are also a kind of biotoxic environmental pollutant, and even have the effects of carcinogenesis and gene mutation on organisms [10,11]. Therefore, studies on the properties of substituted aromatics have important significance.
Up to now, both experimental [12][13][14][15] and theoretical methods [16,17] have been used to evaluate kinds of substituted aromatic compounds for their different toxicities. Also, it is well known that the theoretical predictions of properties or activities by quantitative structure-activity relationship (QSAR) studies have been widely adopted and applied since the 90s, because of their advantages, such as rapidness, easiness, sensitiveness, and cheapness [11]. The QSAR method has been widely applied in different fields, including physical chemistry, pharmaceutical chemistry, environmental chemistry, toxicology, and other research fields [18]. It has been proven that the use of QSAR modeling for toxicological predictions would help to determine the potential adverse effects of chemical entities in risk assessment.
For a long time, a lot of meaningful research focusing on the toxicity of substituted aromatic compounds by QSAR approach have been carried out. In 1982, Schultz et al. tried to perform the QSAR study between the cellular response to Tetrahymena pyriformis and molecular connectivity indexes for a series of 24 mono-and dinitrogen heterocyclic compounds. In this study, the authors established a better model than before, and pointed out that toxicity increases with an increase in the number of atoms and degree of methylation per compound, and that toxicity decreases with an increase in nitrogen substitution [1]. In 1998, Cronin et al. established several QSAR models focusing on a dataset of 42 alkyl-and halogen-substituted nitro-and dinitrobenzenes to Tetrahymena pyriformis [19]. They found that the nitrobenzenes were thought to elicit their toxic response through multiple (and mixed) mechanisms by one or two molecule descriptor models. In 2001, in order to compare the differences among kinds of QSAR model-building methods, Cronin and Schultz developed QSAR studies for the toxicity of 268 aromatic compounds in the Tetrahymena pyriformis growth inhibition assay [16]. In their study, they not only compared the influence of different descriptors on the models, but also the Bayesian regularized neural network (BRANN) and partial least-squares (PLS) analysis to build the models. In the following year, the same authors performed also the same study on a dataset of phenolic toxicity data to Tetrahymena pyriformis [17]. The above works gave us some guidelines or directions on how to build better models on the toxicities to this group of compounds. Netzeva et al. developed relative simple QSARs models (one or two descriptors) for the acute toxicity of a dataset of 77 aromatic aldehydes to the ciliate Tetrahymena pyriformis using mechanistically interpretable descriptors [20]. They revealed that the octanol/water partition coefficient (log K OW ) is the most important descriptor, and the models would be improved by using another electronic descriptor. Roy et al. performed a QSAR studies on the toxic potency to Tetrahymena pyriformis of a dataset of 174 aromatic compounds (phenols, nitrobenzenes, and benzonitriles) using electrophilicity index [21]. In this study, the compounds in the dataset were divided into the electron donor and acceptor group, and they stated that electrophilicity indices, along with the total Hartree-Fock energy, can be used to build the model perfectively. Later, the performances of the linear and nonlinear models were estimated by Devillers et al. using a structurally heterogeneous set of 200 phenol derivatives on Tetrahymena pyriformis. In this study, the authors pointed out the superiority of the nonlinear methods over the linear ones to find complex structure-toxicity relationships among large sets of structurally diverse chemicals [22]. Tetko et al. gave studies on the applicability domain and the influence of the overfitting in the QSAR model building process by the toxicity dataset against Tetrahymena pyriformis [23]. The hierarchical technology for QSAR was performed using 95 diverse nitroaromatic compounds against the ciliate Tetrahymena pyriformis [24]. Zarei et al. developed a model for the prediction of the toxicity of 268 substituted benzene compounds including phenols, monosubstituted nitrobenzenes, multiply substituted nitrobenzenes and benzonitriles to T. pyriformis using bee algorithm (BA) for selecting descriptors and adaptive neuro-fuzzy inference system (ANFIS) for building model [25]. A molecular structural characterization (MSC) method named molecular vertexes correlative index (MVCI) was successfully used to describe the structures of 30 substituted aromatic compounds, and the results suggested good stability and predictability of the QSAR models [26]. Comparative molecular field (CoMFA), molecular similarity index analysis (CoMSIA), and density functional theory (DFT) methods were used to establish QSAR models for analyzing and predicting the toxicities of 31 substituted thiophenols [27]. And later, Salahinejad et al. also used the CoMFA, CoMSIA, and VolSurf techniques to develop valid and predictive models able to estimate the toxicity of substituted benzenes toward T. pyriformis. In the paper, they confirmed that in addition to hydrophobic effects, electrostatic and H-bonding interactions also play important roles in the toxicity of substituted benzenes, as well as that the information obtained from CoMFA and CoMSIA 3-D contour maps could be useful to explain the toxicity mechanism of substituted benzenes [28]. The linear (MLR) and nonlinear statistical (RBFNN) methods were used by us to build a reliable, credible, and fast QSAR model for the prediction of mixture toxicity of non-polar narcotic chemicals, including 9 PFCAs, 12 alcohols, and 8 chlorobenzenes and bromobenzenes. The predictive values are in good agreement with the experimental ones [18]. In the same way, recursive neural networks (RNN) and multiple linear regression (MLR) methods were also employed to build models for prediction of the toxicity values of 69 benzene derivatives, both methods provided good results as compared to other studies available in the literature [29]. To build a reliable and predictive QSAR model, a genetic algorithm along with partial least square (GA-PLS) was employed to select the optimal subset of descriptors that significantly contribute to the toxicity of 45 nitrobenzene derivatives to Tetrahymena pyriformis [30].
The goal of present study was to develop reliable and predictive QSAR models using both MLR and RBFNN methods to identify and predict the acute toxicity (the 50% growth inhibitory concentration IGC 50 ) of substituted aromatic compounds to the aquatic ciliate Tetrahymena pyriformis. For this purpose, the whole dataset was divided into three groups with respect to the important function group of the substituted aromatic compounds such as −NO 2 , −X etc. They were Group 1: Compounds with NO 2 group, etc. (46 compounds); Group 2: Compounds with -X, etc. (75 compounds); Group 3: Compounds with both −NO 2 and −X, etc. (39 compounds). In so doing, different accurate models were built to evaluate the toxicities of these aromatic compounds.

Datasets
For the aromatic compounds, Wei et al. have mentioned that the order of the contribution of the special substituents to the toxicity of the aromatic compound is: −NO 2 > −Cl > −CH 3 > −NH 2 > −OH [31]. Based on the dataset given by Schultz et al. [32], we selected the typical compounds containing the most influential functional groups (−NO 2 and -X), and divided them into three subgroups. Group 1 includes 46 compounds whose chemical structures have the functional group −NO 2 without −X. Among them, 36 compounds were substituted by a −NO 2 and 10 compounds were substituted by two −NO 2 . Group 2 contains 75 compounds which have functional groups -X without −NO 2 . Among them, the 54, 16, and 5 compounds were replaced by one, two, or three functional groups −X, respectively. Group 3 contains 39 compounds, in which both the −NO 2 and −X functional groups are included, and the total number of substituents for −NO 2 and −X is not more than 3.
In this study, compounds in each group were randomly divided into two subsets. One called training set was used to build a model, and there were 36, 60, 31 compounds in the training set for Group 1, 2, 3, respectively. The remaining compounds were used to verify the robustness and feasibility of the model as a test set which includes 10, 15, and 8 for the corresponding groups, respectively. The CAS number, name, and toxicity (−log IGC 50 ) of the above compounds are all listed in Table 1.

Molecular Descriptors' Generation and Selection
To calculate the molecular descriptors of each compound, their structures were drawn using ISIS Draw 2.3 (MDL Information Systems, Inc., San Ramon, CA, USA) [33]. The MM + molecular mechanics forcefield in the HyperChem 6.0 program (Hypercube, Inc.: Waterloo, ON, Canada) was then used to carry out the preliminary molecular geometry optimization [34]. The further optimization of the compound structure was done by semi-empirical PM 3 method utilizing the Polak-Ribiere algorithm until the root mean square gradient was 0.01 kcal/mol [35]. Finally, a more precise optimization was achieved by MOPAC 6.0 software package (Indiana University: Bloomington, IN, USA) [36]. Afterwards, the final optimized structures were converted to the CODESSA 2.63 program (University of Florida, Gainesville, FL, USA) for calculating the five classes of descriptors, namely constitutional, topological, geometrical, electrostatic, and quantum-chemical descriptors [37]. It was necessary to explain that the logP descriptor, which cannot be calculated by the CODESSA 2.63, but can be obtained by Hyperchem, was then added to the descriptors pool [34]. Through doing these, 494, 597, and 611 descriptors were gained for each of the studied compounds in Group 1, 2, and 3, respectively.
Before establishing the QSAR models, it is necessary to remove the insignificant descriptors, and the constant and highly intercorrelated descriptors (the intercorrelation of the descriptors should be lower than 0.8). In this paper, the heuristic method (HM) was used to achieve a thorough search for the best multilinear correlations with the computed descriptors in the framework of the program CODESSA 2.63 [37].

Multiple Linear Regressions (MLR)
Multiple linear regressions (MLR) are often accepted as a classical method for solving linear problems when there are two or more than two independent variables in QSAR modeling. The purpose of MLR is to find a mathematical function which best depicts the desired activity Y (here, −log IGC 50 values) as a linear combination of the X-variables (the molecular descriptors), with the regression coefficients b n . The equation is as follows: Usually, the good fit alone does not guarantee that the model is useful for prediction purposes by the R 2 (coefficient of determination), LOOq 2 (leave-one-out correlation coefficient), RMS (root mean square error), F (Fisher's statistics), etc. [38]. Some statistical characteristics of the test set are also needed to be considered: R 2 (coefficient of determination), R 2 0 (the coefficients of determination, predicted vs observed activities, when the Y-intercept b 0 is set to zero), as well as by their corresponding slopes k and k . The following conditions need to be fulfilled to adequately estimate the predictive ability of a model [39]:

Radial Basis Function Neural Networks (RBFNN)
In general, RBFNN may have a better result than MLR, because it can take into account some nonlinear behavior between the molecular descriptors and the desired activities values (−log IGC 50 ). The detailed introduction of RBFNN has been stated in previous studies [40,41], so we only make a simple statement of the key parts here.
The RBFNN is a typical feed forward neural network which is composed of three layers, which are the input layer, the hidden layer, and the output layer. The first layer is linear, and distributes the input values, while the next layer is nonlinear, and uses radial basis function. The third layer linearly combines the outputs. Each neuron in each layer is adequately linked to the next layer. However, there is no connection between neurons in a given layer. Each hidden layer unit stands for a single radial basis function, which is characterized by a center and a width. In this layer, each neuron uses a radial basis function as nonlinear transfer function to handle the input information from the previous layer. The most common use of RBF is the Gauss function, characterized by the center (c j ) and width (r j ) [42]. It is used to measure the Euclidean distance between the input vector (x) and the radial basis function center (c j ), and gain the nonlinear transformation within the hidden layer, defined as where h j is the output of the jth RBF unit, while c j and r j are the center and width of such a unit, respectively. And the operation of the output layer is linear and is given by where y k is the kth output unit for the input vector X, w kj is the weight connection between the kth output unit and the jth hidden layer unit, and b k is the respective bias.
In the present study, we used the MATLAB package (MathWorks, Natick, MA, USA) (www.mathworks.com/products/matlab/) to accomplish all the RBFNN calculations. The total functions of the RBFNN model can be evaluated by the same statistical parameters as the MLR method together with its reliability and robustness.

Applicability Domain (AD) of the Model
It is necessary to give the application domain (AD) of the model. The applicability domain (AD) of a QSAR model refers to a theoretical region in the space defined by the compounds in the training set. It demonstrates the nature of the compound molecules that can be utilized in the built model. That is to say, AD restricts a theoretical region, also for unknown chemicals without experimental data, with the lowest number of bad predictions (Y-outliers) and chemicals far from the training structural domain [43]. In this study, a William's plot, i.e., a plot of standardized residuals (R) vs leverages was used [44]. Here, a simple measure of a chemical being too far from the applicability domain of the model is its leverage, h i [43], as follows: In the above equation, x i represents the descriptor row vector of the studied compound, while x represents the n × k − 1 matrix of k model descriptor values for the n training set compounds. The superscript "T" refers to the transpose of the matrix/vector. h i characterizes the leverage of a compound, and is one of the coordinates of the William's plot (standardized residuals versus leverage).

MLR Results
As mentioned above, based on the structural differences among the molecules which are caused by the influential functional groups (−NO 2 and −X), Group 1, 2, and 3 have 46, 75, 39 compounds, respectively. The models of each group were established by the training sets. Before doing this, the heuristic method (HM) was used to conduct the descriptor selection. After the preselection of the descriptors, 178, 203, and 160 descriptors were left for each group by removing of the descriptors not obeyed the thumb rules [45].
Multilinear regression models were then developed in a stepwise procedure, that is, the descriptors and correlations were sorted by the values of the F-test and the correlation coefficients. Beginning with the top descriptor from the list, two-parameter correlations were calculated. Later, the descriptors were added one by one, until the preselected number of descriptors in the model is fulfilled. Finally, three descriptors were used to describe the relationship between molecule structure and toxicity for each group of compounds. The selected descriptors and their chemical meaning, along with the statistical parameters, are listed in Tables 2-4.    The external test set was also used to further evaluate the three models. The statistical parameters obtained are as follows: N ext = 10, R 2 = 0.917, q 2 ext = 0.851, F = 13.820, RMS = 0.222 for group 1; N ext = 15, R 2 = 0.789, q 2 ext = 0.732, F = 13.720, RMS = 0.266 for group 2; N ext = 8, R 2 = 0.733, q 2 ext = 0.730, F = 260.404, RMS = 0.380 for Group 3. Figures 1, 2 and 3a show the predicted vs observed −log IGC 50 values for all the training and test set compounds. Thus, it can be seen that the model is reasonable in both statistical significance and predictive ability.

Model Applicability Domain Analysis and Improved MLR Model
It is also an important step to consider the possible outliers of the models. In order to visualize the AD, the plot of standardized cross-validated residuals versus leverage (the William's plot), which can provide an immediate and simple graphical detection, was used to find out the outliers from the models. In this plot, the horizontal and vertical straight lines represent the normal control values of Y-outliers and X-outliers, respectively. The limit of X-coordinate is 3m/n, where m is the number of model parameters, and n is the number of samples belonging to the training set. In the present study, the normal control value for Y-outliers (RES) was set as ±3σ. Figures 4-6 show the William's plot based on the MLR models for the whole dataset compounds of group 1, 2, 3, respectively.
As can be judged from Figure 4, in the model for Group 1, there is one X-outlier (for Group 1: compound 2), which is 2-nitroanisole. In its structure, there are two functional groups, −NO 2 and methoxy. The former is in all of the compounds belonging to this group as a strong electron-withdrawing group. However, the methoxy group has oxygen lone pair electrons which are a strong electron donor moiety, compared to other ones in the group. Therefore, care should be taken when using the compounds with methoxy, since they can activate the benzene ring and exert an unusual influence on the toxicity. And from Figure 6, it can also be seen that there is a X-outlier (for group 3: compound 15), that is, 2-chloromethyl-4-nitrophenol. This compound has three electron-withdrawing moieties, including-Cl, −OH, and −NO 2 , which has almost the strongest induction effect of the compounds in this group. Also, there seem to be another outlier (for Group 3: compound 36), which belongs to the test set. This may be due to variability in the measurement, or it may indicate experimental error.   If the handling of the outliers is unreasonable, the accuracy of the model will be affected. Thus, the quality and ability of the model prediction will be affected. Therefore, we removed the outliers from Group 1 and Group 3, set up the models anew, and the results were as follows: for the training To further assess the predictive powers of the model established by the MLR method, parameters such as , k, k , etc., were also calculated, and the results were shown in Table 5. From the table, we can see the statistical results were all within the acceptable ranges for the methods of MLR.

Validation Results of the Models
Further, a fivefold cross-validation algorithm was applied for validation of the stability of the three models. The members selected for each group (i.e., groups A, B, C, D, and T) were shown in Table 1. The R 2 , F, and RMS values for each validation along with their average values were shown in Table 6 for the MLR models. As can be seen, both models are stable, judging from the obtained values for the average training quality and for the average predicting quality.

RBFNN Results
In the field of QSAR research, RBFNN often shows better results than MLR because of its ability to consider some nonlinear relationships between the molecular structure and its activity. In order to confirm this view, RBFNN was utilized to build nonlinear predictive models using the same descriptors selected by the MLR models. The RBFNN can be traced as i-n k -1 net to indicate the number of units in the three layers, respectively. Meanwhile, the width (r) of RBF was computed by systemically changing its value in the training step from 0.1 to 4.0 with increments of 0.1. For the three groups of compounds belonging to training sets in this study, the RBFNN models were 3-10-1, 3-9-1, and 3-9-1, along with widths of 0.8, 2.0, and 1.7, respectively.
Their statistical results of the training and the test set are as follows. Group1: for training set,  Table 1, and the plot of the predicted and experimental values of both training and test set were displayed in Figures 1b, 2b and 3b. Different from the original literature, [35], we selected and classified the original compounds according to the structural characteristics and further modeled, analyzed, and predicted the corresponding toxicity values. The models, thus established, are also more targeted for the particular compounds, and the statistical results of (R 2 −R 2 0 ) R 2 , k, k , etc., as shown in Table 5 by RBFNN, also indicated the models to be statistically robust with high external predictivity.

Interpretation of Model Descriptors
In order to deepen the understanding of this study, more detailed explanations of the descriptors selected in each group were performed. For group 1, three descriptors were selected in the QSAR model, namely: G 2 , P AB , and Enn (C-H) . The positive sign of them indicated that the −log IGC 50 values increased with its increase, and vice versa. G 2 refers to gravitation indexes for all bonded pairs of atoms, and it is defined as [46], where m i and m j are the atomic weights of atoms i and j, r ij is the interatomic distance, N b is the number of bonds in the molecule. P o belongs to the valency-related descriptors, which relate to the strength of intermolecular bonding interactions and characterize the stability of the molecules, their conformational flexibility and other valency-related properties [47]. E nn(C-H) is Max n-n repulsion for a C-H bond, calculated as follows: E nn (CH) = Z C Z H

R CH
, where Z C and Z H are the nuclear (core) charges of atoms C and H, respectively, and R CH is the distance between them. This energy describes the nuclear repulsion driven processes in the molecule, and may be related to the conformational (rotational, inversional) changes or atomic reactivity in the molecule [46].
For Group 2, focusing on the compounds without the functional group −NO 2 , but with −X, three descriptors were chosen. That is, Log P, PNSA-2/TMSA, and P SIGMA. PNSA-2/TMSA is FNSA-2 fractional PNSA (PNSA-2/TMSA) [Zefirov's PC], which contributes to the calculation of atomic partial charges to the total molecular solvent-accessible surface area [46]. P SIGMA represents the maximum bond order for a given pair of atomic species in the molecule, its values for a given pair of atomic species in the molecule with the lower limit P SIGMA (min) > 0.1. LogP stands for the solvational characteristic (hydrophobicity of chemicals) because it is closely related to the change in the Gibbs energy of solvation of a solute between two solvents.
For Group 3, three descriptors were selected to build the model, that is Ic, Enn (C-C) , and RPCG. The chemical meaning of them can be seen in Table 4. I c is a geometrical descriptor which relates to the atomic masses, the distance of the atomic nucleus from the main rotational axes, which characterizes the mass distribution in the molecule. Enn(C − C) = Z C Z C /R C-C , where Z C and Z C are the nuclear (core) charges of atoms C, and R C-C is the distance between them. This energy describes the nuclear expulsion driven processes in the molecule, and may be related to the conformational (rotational, inversional) changes or atomic reactivity in the molecule [48]. RPCG, relative positive charge, belongs to electrostatic descriptors. From its coefficient, we can find that the relative positive charge of the molecule is negatively related to the endpoint values (−log IGC 50 ).
In summary, we found that the repulsion between the two bonds and the local charge on the surface of the molecule appeared in different models, indicating that these two factors have a greater influence on the structure of the compound and should be relatively valued.

Conclusions
In the present study, the QSAR models were performed on the study of the acute toxicity of substituted aromatic compounds to the aquatic ciliate Tetrahymena pyriformis using the MLR and RBFNN methods, and by dividing the whole dataset into three groups based on the most influential functional group (−NO 2 and −X). Acceptable statistical results for each model indicated their good stability and good predictability. We can also see from the results of the MLR and RBFNN models that the MLR method can establish reasonable models for evaluating the activity of compounds, and the RBFNN method can provide better statistical parameters. Also, the selected descriptors are effective and feasible for evaluating the toxicity of this group of compounds. Lastly, the results of this study provided useful insights on the characteristics of the structures that most affect the toxicity.