Prediction of the Toxicity of Binary Mixtures by QSAR Approach Using the Hypothetical Descriptors

Organic compounds are often exposed to the environment, and have an adverse effect on the environment and human health in the form of mixtures, rather than as single chemicals. In this paper, we try to establish reliable and developed classical quantitative structure–activity relationship (QSAR) models to evaluate the toxicity of 99 binary mixtures. The derived QSAR models were built by forward stepwise multiple linear regression (MLR) and nonlinear radial basis function neural networks (RBFNNs) using the hypothetical descriptors, respectively. The statistical parameters of the MLR model provided were N (number of compounds in training set) = 79, R2 (the correlation coefficient between the predicted and observed activities)= 0.869, LOOq2 (leave-one-out correlation coefficient) = 0.864, F (Fisher’s test) = 165.494, and RMS (root mean square) = 0.599 for the training set, and Next (number of compounds in external test set) = 20, R2 = 0.853, qext2 (leave-one-out correlation coefficient for test set)= 0.825, F = 30.861, and RMS = 0.691 for the external test set. The RBFNN model gave the statistical results, namely N = 79, R2 = 0.925, LOOq2 = 0.924, F = 950.686, RMS = 0.447 for the training set, and Next = 20, R2 = 0.896, qext2 = 0.890, F = 155.424, RMS = 0.547 for the external test set. Both of the MLR and RBFNN models were evaluated by some statistical parameters and methods. The results confirm that the built models are acceptable, and can be used to predict the toxicity of the binary mixtures.


Introduction
It has been widely accepted that the environmental pollutants usually exist and play a role in the form of mixtures, rather than as single chemicals. Thus, the mixture toxicity has attracted more attention of scientists in the past several decades [1][2][3]. Up to now, researchers have proposed two different approaches to evaluate the toxicity of mixtures, that is, the experimental toxicity tests to experimentally measure the toxicity of a whole mixture [4][5][6], and computational toxicology methods, such as quantitative structure-activity relationship (QSAR) studies, to predict the toxicity of the mixture [7][8][9][10]. For the experimental method, in the process of the determination of the toxicity of mixtures, individual compounds and combinations of arbitrary mixtures need to be tested one by one, and the combination of various compounds is infinite in the process of production. Furthermore, the last step of the experiment is usually to use animals as targets, which is difficult to achieve and ethically complicated. However, QSAR, as a computational method, has been used for almost six decades, and was widely applied in physical chemistry, pharmaceutical chemistry, environmental chemistry, toxicology, and other research fields [10]. It has been proven that it can be used to evaluate the properties, activities, and toxicities as effectively as the alternative methods to the experimental methods.
As for the recent QSAR methodology for the assessment of mixture toxicity, a QSAR model was formulated by using the similarity parameter (λ) to predict of the toxicity of equitoxic and nonequitoxic mixtures. From the results of this QSAR model, it can be seen that the joint effects of a mixture were determined by the concentration ratios of the individual components [11]. The protein receptor interaction energy (E binding ) was used to establish a model for the evaluation of the differences between chronic and acute mixture toxicity [12]. The CORAL software was used as a tool to model the toxicity of 50 binary mixtures of halogenated benzene for Photobacterium phosphoreum. The authors came to the conclusion that the half maximal effective concentration (pEC 50 ) increases when there are chlorine, bromine, and oxygen atoms in the molecule structure, however, the pEC 50 decreases if there is nitrogen atom [13]. In the study of mixture toxicity, the molecular docking study between nonpolar narcotic chemicals (halogenobenzene, alcohols, and perfluorinated carboxylic acids (PFCAs)) and lipid membranes was performed, to obtain the binding energy (E binding ). E binding was then used to build a QSAR model. The authors declared that the new calculated method of the octanol/water partition coefficient of chemical mixtures (K owmix ) by E binding was proposed [1]. Later, the liposome-water partition coefficient (K lipw ) was also used to build a QSAR model for 64 polar and nonpolar baseline toxicants measured by the bioluminescence inhibition in Vibrio fischeri. Then, the predicted values were used to evaluate the effect-based water quality trigger value (EBT-EC 50 ) of mixture chemicals [14]. Yao et al. developed a QSAR model for the prediction of mixture toxicities by two parameters, E binding and logK ow (mix) . In this study, the different mixtures (aldehydes and cyanogenic toxicants; triazines and urea herbicide; sulfonamides and trimethoprim toxicants) binding to different sites on different (or same) target proteins were considered [15]. We also performed the QSAR studies on a group of nonpolar narcotic compounds, including 9 PFCAs, 12 alcohols, and 8 chlorobenzenes and bromobenzenes by linear MLR and nonlinear RBFNN method. The predictive values are in good agreement with the experimental ones [10]. Based on the mixture toxicity mechanism, the formula between mixture toxicity and hormetic effect was built, and it can be used to predict the mixture toxicity from effect level at the low concentration. The authors also declared that the method can be further used for the prediction of mixture toxicity at any effect level from individual toxicity [16]. A general baseline toxicity QSAR model was performed also by using liposome-water distribution ratios as descriptors to assess the cytotoxicity of nonpolar, polar, and ionizable chemicals, and their mixtures, in the bioluminescence inhibition assay with Aliivibrio fischeri [3].
From the above research, the computational methods of mixture toxicity assessment have lasted and been developed over decades. The toxicity of mixtures has usually been evaluated by two methods, CA (concentration addition) and IA (independent action). It should be noted that, in these studies, the descriptors used to predict the toxicity are mostly the more effective two, E binding and the octanol/water partition coefficient (logk ow ). In the present work, we try to find other classical or defined hypothetical descriptors to relate the structure characteristic and the mixture toxicity, instead of the normally used E binding and logk ow . In addition, the possible relationships between the single chemicals or their mixtures, and the molecular descriptor, are considered.

MLR Results
According to the method of calculation of the molecular descriptors, a total of 614 descriptors which encode kind of significant features of molecules were obtained. Following the heuristic method (HM) of the descriptor selection, the 132 descriptors were left after removing the descriptors that did not obey the four rules. Finally, the descriptor dataset was refined to 106 for the MLR analysis by removing the highly relevant descriptors.
Through the forward regression method, a model of three descriptors (max partial charge for a C atom (Zefirov's PC (partial charge)) (Q C Max ), average complementary information content (order 2) (ACIC2), number of triple bonds (NTB) was finally obtained for the pure chemicals in this study. The equation is as follows: (1) N = 55, R 2 = 0.736, LOOq 2 = 0.720, F = 47.334, RMS = 1.042 (2) The above equation is for single pure compounds only. The predicted −log(EC 50 ) values are shown in Table 1 as well as the name, Chemical Abstracts Service (CAS) number, and the experimental −log(EC 50 ) values of the single chemicals. The statistical results showed an acceptable correlation between the three descriptors and the toxicity of these single compounds. However, in terms of the toxicity of the mixture, whether the above descriptors or their combinations have any correlation with the binary mixture toxicity needs to be studied. It has been shown and confirmed that for the docking effect of any type of theses mixtures, their descriptors are shown to be only simple addition [15]. Therefore, the total effect of the mixture is quantified with the descriptors previously selected, expressed as the formula below: where D i represents the selected descriptor of the single pure compounds, and x i represents the ratio of toxic unit. D, the hypothetical descriptor, is only a numerical operator that sums each value of the selected descriptor in the mixture, considering its contributions.
In this study, for each binary mixture, three hypothetical descriptors were formed by Q C Max , ACIC2, NTB and the fractional concentrations of the mixture. That is, the hypothetical descriptor is the sum of each descriptor of the pure compound, multiplied by the fraction of it in the mixture. Then, using the hypothetical descriptors, the 79 combinations belonging to the training set were used to build the model. The selected descriptors and their chemical meaning, along with the statistical parameters, were listed in Table 2. At the same time, the intercorrelation of the descriptors was evaluated when the model was built, as shown in Table 3. Pair correlations among the variables are far below 0.80 [17], indicating that each descriptor can be considered independent, which prevent grave over-fit of chance correlation effects happening. The corresponding variation inflation factors (VIF) and mean effect (MF) values of the three descriptors used in the model were also given in Table 2. From Table 2, we can see that the VIF value of each descriptor is less than 2, which is below the critical point 10 as an indicator of severe or serious multicollinearity [18]. The MF value of each descriptor is listed in this table. The value and the sign of them indicates the relative importance of a descriptor [19]. The MF values of descriptors Q C Max , ACIC2, and NTB decrease in turn, indicating that their contribution to the model is also reduced.  Lastly, the model was evaluated by the external test set, whereby the statistic values are N ext = 20, R 2 = 0.853, q 2 ext = 0.825, F = 30.861, and RMS = 0.691. Our model also meets more stringent requirements, as shown in Table 4. That is, giving very good results. R 2 0 is the coefficient of determination (predicted vs observed activities), R '2 0 is the coefficients of determination (observed vs. predicted activities through the origin), and k and k', are the corresponding slopes of the regression lines through the origin. The predicted versus experimental −log(EC 50 ) values of MLR model for the training and test sets are shown in Table 5, and plotted in Figure 1a. Moreover, Figure 2a shows the scatter plot of residuals of the whole dataset.   *Test set. Set* "T" means the corresponding compound belongs to the external test set. Set*s "A", "B", "C", and "D" mean the compound belongs to the subset of the training set.

Model Applicability Domain Analysis and Improved MLR Model
The Williams plot (Figure 3) was used to visualize the application domains of the built model. In this plot, the horizontal (standardized cross-validated residuals) and vertical (leverage) straight lines represent the normal control values of Y-outliers and X-outliers, respectively. The limit of X-coordinate is set as 3m/n, where m is the number of descriptors in the model, and n is the number of compounds belongs to training set. The normal control values for Y-outliers (standardized cross-validated residuals) was set as ±3σ. From Figure 3, there is a X-outlier (combination 62) and a Y-outlier (combination 65) in this model, and all of them belong to the training set.

RBFNN Results
In general, nonlinear studies considering the nonlinear relationship between descriptors and activity usually show better results than linear ones in QSAR studies. In this paper, the nonlinear RBFNN method was employed to further verify this. The inputs of RBFNN were the same hypothetical descriptor (D) used in the MLR model, and the outputs are the same target −log(EC 50 ) values. The same as with the MLR studies, the training set was used to establish the model, and the test set was used to evaluate the model. In the model building process, a three-layer net with 2-n k -1 was designed, where 2, n k , and 1 represent the number of the units in the input, the hidden, and the output layer, respectively. The width (r) of the RBF function was optimized by calculating its range, from 0.1 to 4, with the increment of 0.  Table 4.
In all, it can be seen from the results that both the linear MLR and nonlinear RBFNN model can predict the toxicity of the mixture satisfactorily. The linear model is simple, intuitive, and easy to be used by the researcher, and the nonlinear model can give more accurate prediction results. The important thing is that we must know the three descriptors of each pure compound. In doing so, we can give the prediction of the toxicity of this group of compounds within the application domains.

Validation Results of the Models
In the present study, a Y-randomization test was developed 10 times for both of the linear and nonlinear models, and the R 2 and LOOq 2 values of the two models are shown in Table 6. For the MLR and RBFNN models, the average values of R 2 were 0.0266, 0.0251, and the average values of LOOq 2 were 0.0141, 0.0124, respectively. The above low average values for R 2 and LOOq 2 values of the models, which imply that both models are robust, and have no chance correlation or structural redundancy. Further, a fivefold cross-validation algorithm (leave many out cross-validation (LMO)) was applied for validation of the stability of the two models. The members selected for each group (i.e., groups A, B, C, D and T) were shown in Table 5. In this table, for example, A+B+C+D means that the subsets A, B, C, D were selected as the training set, while the sunset T as test set. The R 2 , F, and RMS values for each validation, along with their average values, are shown in Table 7 (for the MLR model) and Table 8 (for the RBFNN

Interpretation of Model Descriptors
In the present study, three descriptors were selected for the model, namely Q C Max , ACIC2, and NTB According to the computed MF value (see Table 2), Q C Max was the biggest one, so it plays the most important role in the process of model buildings. The positive sign indicated that the −logEC 50 values increased with its increase, and vice versa. Q C Max is an electrostatic descriptor, which reflects how charge is distributed in the molecules partial surface area. Q C Max also affects the ability of the compound to be a H-bond acceptor. The stronger the electronegativity is, the easier it is to form hydrogen bonds, which leads to the enhancement of toxicity. Regarding the constitutional descriptor NTB (the number of triple bonds), which mainly includes -CC (compound 31 # ) and -CN (compound 14 # -24 # ) in this paper, it has a negative effect on the toxicity of the compounds. Average complementary information content (order 2) (ACIC2) belongs to topological descriptors. It describes the size, shape, and branching information of the molecules, and reflects the diversity of atomic and structural constitution of organic molecules. From the coefficient of the model, one can see that it has a positive effect on the toxicity of the compounds.

Datasets
In the present study, a group of 55 compounds, including 13 aldehydes (AHs) and 11 cyanogenic (CGs) as group 1, 10 triazines (TAs) and 10 urea (UE) herbicides as group 2, and 10 sulfonamides (SAs) and 1 trimethoprim (TMP) as group 3, were obtained from the literature by Yao et al. [15]. These compounds are widely present in the environment, and have an adverse effect on the environment and human health in the form of monomers or mixtures. The name, CAS number, and the experimental −log(EC 50 ) values of the single chemicals were listed in Table 1, and the unit of EC 50 is mol/L. The binary mixture toxicity of 99 combinations, along with the single chemicals in the mixtures, the ratio of toxic unit, as well as the experimental values of mixture (−log(EC 50mix )), were listed in Table 5. The single compounds in Table 1 were used to select the proper descriptors, which were highly related to the toxicity of pure compounds. In doing so, the selected ones were then used to generate new descriptors for the building of the QSAR model of the mixture in Table 5. The dataset of the combinations in Table 5 were also randomly divided into a training set (including 79 combinations) to build the models, and a test set (including 20 combinations, marked with *) to evaluate the predictive power of models.

Molecular Descriptors Generation and Selection
To obtain the descriptors of the compounds, each structure of the single compounds was drawn in ISIS Draw 2.3 (MDL Information Systems, Inc., San Ramon, CA, USA) [20]. The preliminary molecular geometry optimization was done in HyperChem software (Hypercube, Inc., Waterloo, ON, Canada) [21] by the molecular mechanics MM + force fields. Then, the further optimization of the molecular structures was carried out by semi-empirical PM3 method using the Polak-Ribière algorithm, until the root mean square gradient was 0.01 kcal/mol [22]. Finally, an optimization was also done in MOPAC software package (Indiana University, Bloomington, IN, USA) at the same root mean square gradient [23]. After that, the structure files exported from HyperChem and MOPAC were transferred into CODESSA software (University of Florida, Gainesville, FL, USA) [24] for the purpose of calculation of descriptors, which mainly include constitutional, topological, geometrical, electrostatic, and quantum-chemical descriptors. In addition, the logP descriptor, which cannot be calculated by CODESSA, was obtained by HyperChem and, then, was added to the descriptors pool [18].
After calculating the descriptors, the heuristic method (HM) was used to find the proper descriptors associated with the toxicity of single compounds; this was done by using the experimental −log(EC50) values of single chemicals as the end-point activity values. This method has been approved to be usefully used in the field of quantitative structure-activity relationship (QSAR) [10], quantitative structure-property relationship (QSPR) [25], and quantitative structure-toxicity relationship (QSTR) [26].
However, we all know that a considerable portion of these descriptors are constant and highly intercorrelated, requiring further screening. According to the CODESSA program, the choice of best descriptors by the heuristic method (HM) follows the following principles: descriptors must be obtained for each compounds; the magnitude between the descriptors should be large; the F-test's value of the descriptor must be larger than 1.0 in the one-parameter correlation; and descriptors whose t-values are larger than the user-specified value (0.1) are more preferred. By doing so, the descriptors can be arranged in order of the influence of the one-parameter correlations.
Following the data reduction, forward stepwise multiple linear regression (MLR) method was used to select the suitable descriptors. In this process, the descriptor was added, one by one, until there was no further significant improvement in the statistics of the model. It should be noted that an increase of the "R 2 " value less than 0.02 was chosen as the breakpoint criterion.

Multiple Linear Regressions (MLR)
In the QSAR studies, the multiple linear regressions (MLR) method is often considered as a simple but powerful approach to regression problems when there are two or more than two independent variables. The goal is to find a mathematical function using the selected training set compounds which best describe the desired activity Y (here −log(EC 50 ) values), as a linear combination of the X-variables (the molecular descriptors) with the regression coefficients b n . The equation is as follows: Some statistical parameters, including R 2 (the correlation coefficient between the predicted and observed activities), LOOq 2 (leave-one-out correlation coefficient), RMS (root mean square error), F (Fisher's statistics) [27], etc., are normally used to evaluate the stability of the model. For the purpose of estimation of the predictive ability of the model, some other statistical characteristics should also be considered: R 2 0 (the coefficient of determination, predicted vs observed activities), R 2 0 (the coefficients of determination, observed vs predicted activities through the origin), and k and k', are the corresponding slopes of the regression lines through the origin [28].
0.85 ≤ k ≤ 1.15 or 0.85 ≤ k ≤ 1.15 (8) Finally, Y-scrambling techniques should be used to rule out the possibility of chance correlation, and to inspect for reliability and robustness by permutation testing. Also, multicollinearity between the selected descriptors can be detected by calculating their variation inflation factors (VIF), as follows [29]: In this equation, r represents the correlation coefficient between each pair of the selected descriptors. In order to evaluate the relative importance and the contribution of each descriptor in the model, the value of the mean effect (MF) is calculated by the following equation [30]: where MF j is the mean effect for the considered descriptor j, β j is the coefficient of the descriptor j in the model, d ij is the value of the target descriptors for each molecule, and m and n is the number of descriptors and compounds being used to build the model. The MF value indicates the relative importance of a descriptor compared with the other descriptors in the model, and its sign exhibits the variation direction in the values of the activities resulting from an increase (or a reduction) of this descriptor value.

Radial Basis Function Neural Networks (RBFNNs)
For MLR method, all works were achieved by means of building the best multivariate linear model between molecular descriptors and the −log(EC 50 ) values. However, there are also many other approaches that were used in the analysis of nonlinear QSAR data, such as Radial basis function neural network (RBFNN), for considering the nonlinear behavior of the desired activity. The principles of it have been mentioned elsewhere [31,32]. The RBFNN is a standard feed forward neural network characterized by a set of inputs and a set of outputs. Between the inputs and outputs, there is a layer of processing n units called hidden units. There is no connection between the neurons in the given layer, but each neuron in each layer is well connected to the next layer. The input layer does not handle any input information, and its only role is in the distribution of input to the hidden layer. The output layer gives the results of the hidden layer with a linear transformation. It is important that each hidden layer unit plays the role by a nonlinear function, which can be used to deal with the input information from the previous layer.
The most common use of RBF is the Gaussian function characterized by the center (c j ) and width (r j ). The RBF function realizes the nonlinear transformation by measuring the Euclidean distance between the input vector (x) and the radial basis function center (c j ): where h j represents the output of the jth RBF unit, c j represents the radial basis function center, and r j is the width of the unit. The linear transformation of the output layer is defined as where y k represents the kth output unit for the input vector X, w kj represents the weight connection between the kth output unit and the jth hidden layer unit, and b k is the respective bias.
In the performance of the development of the RBFNN model, the center (c j ) and width (r j ) should be determined. In our study, a forward subset selection routine was used with a constant of Gaussian functions for all the units. The width (r j ) was adjusted by changing its range from 0.1 to 4, with the increment of 0.1. After doing that, the connection weight between the hidden and output layer was selected by a least-squares solution. In all the optimal process, root mean square (RMS) was used as the error function. This process is completed by MATLAB (Available online: https://www.mathworks.com/products/matlab/). Like the MLR method, RBFNN is also evaluated with parameters such as RMS, R 2 , and others.

Applicability Domain (AD) of the Model
For a useful model, the applicability domain (AD) of the model should be given. The applicability domain (AD) of a QSAR model is a theoretical region in the space defined by the compounds in the training set. It characterizes the nature of the chemicals that can be used in the built model. That is, AD defines a theoretical region which can be used to predict the new compounds, even if they have not been tested by the experimental method [33]. There are some methods to achieve this purpose. In this study, a Williams plot, i.e., a plot of standardized residuals (R) vs leverages was used [34]. The leverage, h i , can be calculated by the formula [33], as follows: where h i characterizes the leverage of a compound, x i represents the descriptor row vector of the studied compound, and x represents the whole matrix of the descriptor values of compounds in the training set. T is matrix or vector-transposed symbol.

Y-Randomization Test
Y-randomization (also called Y-scrambling or response randomization) is considered to be the most powerful validation procedure to protect against the risk of chance correlation. In this test, the dependent-variable vector, Y-vector, is randomly shuffled, and a new QSAR model is established using the initial independent-variable matrix [28]. The process is repeated many times, if in each case, the scrambled data gives much lower R 2 and LOOq 2 values than the original data that can be used to build the model, then, a robust model was obtained [35].

Leave Many Out Cross-Validation
In addition, we also use leave many out cross-validation (LMO) to further evaluate the robustness of the model. Unlike LOO, the internal test set for LMO is for a group of compounds instead of single one. It is generally believed that the obtained model is robust, if a QSAR model has a high average R 2 in the LMO cross-validation.

Conclusions
Prediction of the toxicity of a sequence of mixture was accomplished by the MLR (multiple linear regression) and RBFNN (radial basis function neural network) methods. The hypothetical descriptors (D), derived from the normal Q C Max , NTB and ACIC2 descriptors were successfully used to measure the contributions of the components of a mixture to the overall activity. The MLR is an acceptable and easy method and, with regard to the RBFNN model, it can predict the mixture toxicity more accurately. In addition, statistical results show that the developed descriptors are effective and feasible for evaluating the toxicity of mixture compounds. Thus, the present study shows the viability of applying QSAR tools to predict the binary toxicity. Furthermore, the results of this study provided useful insights on the characteristics of the structures that most affect the toxicity.