Support Vector Machine-Based Global Classification Model of the Toxicity of Organic Compounds to Vibrio fischeri

Vibrio fischeri is widely used as the model species in toxicity and risk assessment. For the first time, a global classification model was proposed in this paper for a two-class problem (Class − 1 with log1/IBC50 ≤ 4.2 and Class + 1 with log1/IBC50 > 4.2, the unit of IBC50: mol/L) by utilizing a large data set of 601 toxicity log1/IBC50 of organic compounds to Vibrio fischeri. Dragon software was used to calculate 4885 molecular descriptors for each compound. Stepwise multiple linear regression (MLR) analysis was used to select the descriptor subset for the models. The ten molecular descriptors used in the classification model reflect the structural information on the Michael-type addition of nucleophiles, molecular branching, molecular size, polarizability, hydrophobic, and so on. Furthermore, these descriptors were interpreted from the point of view of toxicity mechanisms. The optimal support vector machine (SVM) model (C = 253.8 and γ = 0.009) was obtained with the genetic algorithm. The SVM classification model produced a prediction accuracy of 89.1% for the training set (451 log1/IBC50), of 80.0% for the test set (150 log1/IBC50), and of 86.9% for the total data set (601 log1/IBC50), which are higher than that (80.5%, 76%, and 79.4%, respectively) from the binary logistic regression (BLR) model. The global SVM classification model is successful, although it deals with a large data set in relation to the toxicity of organics to Vibrio fischeri.


Introduction
With the extensive application of chemicals such as pesticides, insecticides, chemical fertilizers, medicines, detergents, anti-foaming agents, and flocculants, their potent toxicity and organic natures have attracted more and more attention [1,2]. Among the 84,000 chemicals on the market, only 1% of them have been measured for safety. Furthermore, 1000-2000 new chemicals enter the market every year [3,4]. In addition to the criticism over animal testing, to test the toxicity and risk of chemicals is non-trivial and is often very time consuming [5].
Quantitative structure-activity/toxicity relationship (QSAR/QSTR) models have been proposed to assess the risk of chemicals [5]. QSPR approaches have the advantages that they need only the information of the molecular structure and do not rely on any empirical data. The QSPR principle is based on the assumption that it is the molecular structure that is determining the physical and chemical properties of molecules, and that compounds possessing similar molecular structures have similar properties. The molecular structure (i.e., the three-dimensional structure or arrangement of atoms in a molecule) is governed by several factors, including covalent bonds and polarity, non-covalent interactions, the hydrophobic effect, shape and size, and so on. It can be quantitatively described by molecular descriptors related to the physical and chemical properties of molecules.
Vibrio fischeri is a bioluminescent and widely used as the model species since it is sensitive to a wide range of toxicants [5]. There are several QSTR models reported for the assessment of the aquatic toxicity of chemicals against Vibrio fischeri. For example, Cronin and Schultz [6] developed three single-descriptor (logKow) models of toxicity to Vibrio fischeri for six alkanones, six aldehydes, and six alkenals, and a two-descriptor model for 19 compounds via multiple linear regression (MLR) analysis. After that, they carried out similar work for 15 haloalcohols, 11 halonitriles, 11 haloesters, 10 diones, and 63 compounds in the combined data set. These models yielded coefficients of determination R 2 ≥ 0.846 [7].
The group of Sun and Wang utilized two molecular descriptors to correlate with the toxicity of 24 bromide-based ionic liquids (Br-ILs) against Vibrio fischeri. The model yielded a R 2 of 0.954 [11]. Wang et al. proposed QSTR models of 17 alkylated aromatic hydrocarbons towards Vibrio fischer [12]. The model has 10 molecular descriptors and produced a R 2 of 0.956. de Melo et al. predicted the toxicity of aromatic sulfones to Vibrio fischeri [13]. The QSTR models were based on two latent variables (or descriptors) and on the training sets of 35-41 aromatic sulfone chemicals, which yielded a R 2 of 0.704-0.934 [13].
The afore-mentioned QSPR models belong to local models, focus on a particular chemical species, and have some limitations in predicting aquatic toxicity to Vibrio fischeri. Especially, some models have lower ratios between the samples and descriptors used in the QSTR models, reducing the statistical quality and leading to overfitting, that is, the QSTR models may have superior performance on the training data but perform poorly on the test set.
Recently, the group of Su established mechanistically based QSTR models for a larger data set of chemicals against Vibrio fischeri [5]. The number of chemicals and the R 2 from the MLR models are, respectively, 172 and 0.787 for baseline chemicals, 133 and 0.737 for less inert chemicals, 57 and 0.801 for reactive chemicals, and 25 and 0.766 for acting chemicals. Although they are limited in application (i.e., the modes of action should be first determined for a chemical in relation to Vibrio fischeri before selecting these QSTR models), these QSTR models based on different modes of action have high prediction accuracy [5]. Support vector machine (SVM) has become an effective tool in nonlinear problems. The aim of this paper is to develop a global SVM classification model for the toxicity of a large data set including 601 chemicals towards Vibrio fischeri. Furthermore, it can be used for the prediction of the toxicity category of a chemical in relation to Vibrio fischeri without knowing its modes of action.

Results and Discussion
The binary logistic regression (BLR) analysis (Forward: Wald) in SPSS 19.0 was performed between 946 molecular descriptors and 601 acute toxicity data, log1/IBC 50 . Three BLR models were obtained, which, respectively, have the number of independent variables as one (MLOGP2), two (SM03_AEA (dm and MLOGP2), and three (SM03_AEA(dm), RDF145u and MLOGP2), and possess lower total prediction accuracies of 71.2%, 71.7%, and 71.5%. Stepwise MLR analysis in SPSS 19.0 was further carried out for these molecular descriptors and acute toxicity data in the total data set, resulting in a descriptor subset including ten molecular descriptors when the increasing coefficient of determination ∆R 2 > 0.01 was set as the criterion for adding new variables. Tables 1 and 2, respectively, show the characteristics and physical meaning of ten molecular descriptors selected in the MLR model.  Table 2. Physical meaning and classes of ten molecular descriptors selected.

Physical Meaning Class
SpMax4_Bh(m) means the largest eigenvalue no. 4 of Burden matrix weighted by mass and reflects molecular similarity/diversity on large databases.
Burden eigenvalues AVS_B(p) is the average vertex sum from Burden matrix weighted by polarizability and associated with nucleophilic aromatic substitution reaction in benzene rings.
2D matrix-based descriptors MLOGP2 denotes the squared Moriguchi octanol-water partition coefficient and describes molecular hydrophobic property.

Molecular properties
N-074 is the number of R#N/R=N-groups. Atom-centered fragments B01[C-C] reflects the presence/absence of C-C at topological distance 1.

2D Atom Pairs
QXXm means the quadrupole X-component value/weighted by mass and is related to molecular polar and volume.
Geometrical descriptors CATS2D_04_NL denotes the CATS 2D negative-lipophilic at lag 04 and is associated with the type and the number of pharmacophore points.

CATS 2D
C-016 is the number of =CHR groups. Atom-centered fragments DBI is the Dragon branching index and correlated with molecular size and branching.
Topological indices SM03_AEA(dm) is the spectral moment of order 3 from the augmented edge adjacency matrix weighted by dipole moment and reflects molecular fragment counts.

Edge adjacency indices
As can be seen from Table 1, the sig. value of each descriptor is less than 0.001 (the default threshold being 0.05). Therefore, they are significant descriptors. Moreover, the variance inflation factor (VIF) of each descriptor is less than 5 (the default threshold being 10), indicating these descriptors independently reflect molecular structure information related to the toxicity, log1/IBC 50 .
BLR is essentially the same as MLR analysis in modeling relationships between dependent and independent variables [3]. The difference between them is that BLR deals with dichotomous dependent variables and MLR deals with continuous variables. Thus, the ten molecular descriptors selected in the MLR model were used as the descriptor subset to develop QSTR classification models for the toxicity (log1/IBC 50 ) of chemicals to Vibrio fischeri. The results obtained from the BLR analysis are shown in Table 3. For the training set of the BLR classification model, the true positives (TP), false positives (FP), true negatives (TN), and false negative (FN) are, respectively, 176, 54, 187, and 34. The statistical parameters of specificity (=84.6%), sensitivity (=76.5%), and accuracy (=77.6% for Class − 1 and 83.8% for Class + 1) are above 76%. For the test set, the values of TP, FP, TN, and FN are 56, 22, 58, and 14, respectively. Its specificity (=80.6%), sensitivity (=71.8%), and accuracy (=72.5% for Class − 1 and 80.0% for Class + 1) are above 71%. The harmonic means of precision F 1score for the training set, test set, and total set are 80.0%, 75.7%, and 78.9%, respectively. Their Matthew's correlation coefficients (MCC) between the experimental and calculated log1/IBC 50 are 61.3%, 52.4%, and 59.1%, respectively. In addition, the total prediction accuracy is 80.5% for the training set (451 log1/IBC 50 ), 76.0% for the test set (150 log1/IBC 50 ), and 79.4% for the total data set (601 log1/IBC 50 ). The ten molecular descriptors selected in the MLR model were further used as input vectors to develop the SVM QSTR classification model from the 451 toxicity (log1/IBC 50 ) in the training set by applying the LibSVM package on the MATLAB R2014a platform. By using a genetic algorithm, the SVM parameters C and γ were optimized under the conditions, with C being [100,500], γ being [0,1], m-fold cross-validation (m = 451), the maximum generation (maxgen = 200), the maximum population size (sizepop = 20), and the ε-insensitive loss function (ε = 0.01). In the end, the optimal SVM with the parameters, C = 253.8 and γ = 0.009, was obtained. For the training set in Table S1 in Supplemental Materials, the values of the parameters TP, FP, TN, and FN are 191, 30, 211, and 19, respectively. The optimal SVM model has an accuracy of 87.6% for Class − 1 and 91.0% for Class + 1, specificity of 91.7%, sensitivity of 86.4%, and the harmonic mean of precision F 1score of 88.6%. The leave-one-out (LOO) cross-validation accuracy is 72.9%.
Subsequently, 150 toxicities (log1/IBC 50 ) in the test set were used to validate the optimal SVM model and the prediction results were shown in Table 3. The test set possesses the parameters TP = 57, FP = 17, TN = 63, and FN = 13, a prediction accuracy of 78.8% for Class − 1 and 81.4% for Class + 1, specificity of 82.9%, sensitivity of 77.0%, and a harmonic mean of precision F 1score of 79.2%. The total data set has TP = 248, FP = 47, TN = 274, and FN = 32, specificity = 89.5%, sensitivity = 84.1%, and F 1score = 86.3%. The MCC values of the training set, the test set, and the total data set are 78.3%, 60.1% and 73.8%, respectively, which are satisfactory, compared with the MCC results (61.3%, 52.4%, and 59.1%, respectively) from the BLR classification model. In addition, the prediction accuracy is 89.1% for the training set (451 log1/IBC 50 ), 80.0% for the test set (150 log1/IBC 50 ), and 86.9% for the total data set (601 log1/IBC 50 ), which are higher than the results (80.5%, 76.0%, and 79.4%, respectively) from the BLR model. In addition to the accuracy, the parameters (specificity, sensitivity and F 1score ) of the SVM model are higher than that from the BLR model (see Table 3). Therefore, there are nonlinear relationships between the toxicities (log1/IBC 50 ) and the ten molecular descriptors used, suggesting that it is reasonable to develop a SVM QSTR classification model for the toxicity (log1/IBC 50 ) of chemicals to Vibrio fischeri.
Recently, the group of Lučić and Batista proposed a significant parameter (∆Q 2 ) for validating the classification model quality [14,15]. The parameter ∆Q 2 denotes the difference between the two-state accuracy (Q 2 ) of the original model and the two-state random accuracy (Q 2,rnd  from the SVM model are higher than that from the BLR classification model, meaning that the SVM model contributes a larger amount of useful information over the maximal level of random accuracy [16]. Many factors including the absorption, distribution, bioaccumulation, and metabolism of organic compounds in Vibrio fischeri influence the test results of toxicity, leading to the difficulty in investigating the mechanism of the toxicity of organic compounds towards Vibrio fischeri. The descriptor SpMax4_Bh(m) means the largest eigenvalue no. 4 of the Burden matrix was weighted by mass. It is calculated from a hydrogen-filled molecular graph with the Burden matrix Bh(w) and its largest positive eigenvalue. SpMax4_Bh(m) can reflect the molecular similarity/diversity in large databases. Table S1 shows that the compounds, cypermethrin (No. 117) and PR Toxin (No. 434), have high SpMax4_Bh(m) (3.914 and 3.346, respectively) and high log1/IBC 50 (4.8 and 5.2, respectively) values. The reason is that they are typically reactive compounds: cypermethrin possesses benzylic activation with a good leaving group cyanide at an α-position, and PR Toxin has an epoxide. AVS_B(p) belongs to 2D matrix-based descriptors and denotes the average vertex sum from the Burden matrix weighted by polarizability. Similar to SpMax4_Bh(m), the descriptor AVS_B(p) is derived from the Burden matrix Bh(w) of a hydrogen-filled molecular graph. As can be seen from Table S1,  ≥4.45 greater than the average value (log1/IBC 50 = 4.02) of the total set. Due to the fact that the hydrogen atoms in benzene rings can be replaced by many different substituents, these compounds readily undergo nucleophilic aromatic substitution reaction, resulting in higher toxicity.
The descriptor MLOGP2 is derived from the squared Moriguchi octanol-water partition coefficient and widely used for QSTR models of compounds, especially for inert chemicals and less inert chemicals. A larger MLOGP2 expresses the compound with highly hydrophobic property, resulting in the accumulation of toxicant in Vibrio fischeri and causing toxicity [17,18]. Thus, the descriptor MLOGP2 is positively correlated with log1/IBC 50 of chemicals to Vibrio fischeri.
The descriptor N-074 means the number of specific atom type, R#N or R=N-, with # here representing a triple bond and R being any group linked through carbon. In Table S1, there are 42 chemicals with N-074 > 0, their average log1/IBC 50 value is 4.97, greater than the average log1/IBC 50 value of 4.02 of the total set. These chemicals have reactive structural entities, for example, the chemicals of No. 600 (thiocyanates) and No. 601 (isothiocyanates), which can cause high log1/IBC 50 values.
The descriptor B01[C-C] denotes the presence/absence of C-C at topological distance 1. In Table S1, there are only 12 chemicals without C-C at topological distance 1, that contain C-N, C-O, C-S, or only one C atom in the main chain, e.g., methanol (No. 1), formic acid The descriptor QXXm is the quadrupole X-component value/weighted by mass. It is associated with molecular polar and volume. For example, novobiocin (No. 393) has the highest value (QXXm = 336.668) and the highest molecular weight (=612.69). On the one hand, a larger QXXm indicates the corresponding chemical (or polar narcosis chemical) having strong polarity and leading to high toxicity compared with non-polar narcosis. On the other hand, a larger QXXm means the molecule possessing a larger volume, which hinders molecules penetrating the lipid bilayer of biological cell membrane and reduces the toxicity.
The descriptor CATS2D_04_NL denotes the number of pharmacophore point types of negative-lipophilic at a topological distance of four bonds. For CATS 2D descriptors, the atoms with a hydrogen bond donor (or acceptor), positively (or negatively) charged, or lipophilic pharmacophore points at topological distances not greater than nine bonds are taken into account. Table S1 shows that there are 41 chemicals with CATS2D_04_NL >0. Furthermore, all the chemicals contain acid groups. This descriptor is opposite to MLOGP2 mentioned above; a large CATS2D_04_NL indicates the chemical exhibiting strong hydrophilicity and resulting in lower log1/IBC 50 value.
The descriptor C-016 is the number of =CHR groups in a molecule (R represents any groups linked through carbon). As can be seen from Table S1, 1,4-benzoquinone (No. 150) has the highest C-016 (=4). Its toxicity is very high (log1/IBC 50 = 6.73), because this chemical enables a Michael-type addition of nucleophiles across the double bonds. In addition, some chemicals such as aldehydes and ketones (C-016 > 0) can have high toxicity when their α-position of the double bonds possesses a leaving group.
The descriptor DBI is the molecular branching index and reflects information on pairs of connected atoms in an H-depleted molecular graph. It is mainly related to molecular size and also sensitive to molecular branching. In Table S1, the chemical tetra-n-butylthiuramdisulfide (No. 510) has the highest DBI (=5.099) and lower toxicity log1/IBC 50 (=3.83). The chemical has two terminal N atoms, each linking to two butyl groups. As stated above, it has a larger volume, hindering molecules across the lipid bilayer of the biological cell membrane, and reduces the toxicity. Other amines (e.g., tetrapropylthioperoxydicarbonicdiamide, No. 43) and amides (e.g., metolachlor, No. 486) have similar phenomenon.
The descriptor SM03_AEA(dm) is a spectral moment of order 3 from the augmented edge adjacency matrix weighted by the dipole moment. It encodes information about connectivity between graph edges and is related to spectral moments and fragment counts. Although a larger molecule in molecular weight or size usually has higher toxicity, too large a size will hinder molecular penetration and transport and decrease the toxicity. It is easy to understand that the chemical novobiocin (No. 393), with the highest value (SM03_AEA(dm) = 2.167) and the highest molecular weight (=612.69), only has a medium toxicity (log1/IBC 50 = 4.38).

Experimental Data
There are 606 acute toxicity data, log1/IBC 50 (the logarithmic form of 50% inhibition concentration of bioluminescence, in the unit of mol/L), of organic chemicals to Vibrio fischeri at the 15 or 30 min endpoint in [5]. After deleting the mixture (4-pyridylacetonitrile hydrochloride and four sodium compounds), 601 acute toxicity data were remaining and listed in Table S1 in Supplementary Materials. The toxicity data, log1/IBC 50 , are in the range of −0.25 to 7.12. The higher the log1/IBC 50 value, the more toxic is the chemical to Vibrio fischeri. After sorting the log1/IBC 50 values, it can be found that the samples with a log1/IBC 50 value around 4.2 are sparse. Therefore, the threshold for Class + 1/− 1 was selected as 4.2, i.e., an organic compound is categorized as Class + 1 if its log1/IBC 50 value is greater than 4.20, otherwise as Class − 1. Here, Class − 1 denotes the compound being nontoxic to Vibrio fischeri and Class + 1 means the compound being toxic towards Vibrio fischeri. The 601 organic compounds in the total data set were divided into a training set including 451 samples (Nos. 1-451 in Table S1 in Supplementary Materials) and a test set including 150 samples (Nos. 452-601 in Table S1) at the approximate ratio of 3:1. The training set includes 241 Class − 1 and 210 Class + 1, and the test set includes 80 Class − 1 and 70 Class + 1. From the training set of 451 log1/IBC 50 data, a classification model was built, which was validated with the test set (150 log1/IBC 50 ) that is different from the training set.

Molecular Descriptor Calculation
According to the chemical names in Table S1 in Supplementary Materials, each molecular structure was constructed with ChemDraw 19.0, and optimized with the molecular mechanics MM2 in Chem3D 19.0 at the default convergence level. After that, 4885 molecule descriptors were calculated for each molecule with the Dragon 6.0 [19], followed by removing those variables that equal a constant or possess partial correlation coefficients above 0.9. In the end, 946 descriptors were obtained for each molecule.

Support Vector Machine Algorithm
Rooted in the principle of structural risk minimization, SVM algorithms are widely used for both classification and regression [20,21]. They have excellent generalization capability on yet-to-be-seen data, even if a training sample size is small. In dealing with machine learning problems, its kernel functions are used to map the data points into a high dimensional feature space, then linear tasks can be implemented. In the SVM two-class problem, Equation (1) represents the separating hyperplane.
The slack variables ξ i and ξ i * are introduced to guard against outliers. Then, the weight vector w and bias term b can be obtained by minimizing the objective function: subject to: The regularization constant (C > 0) is a tunable parameter. A larger C leads to more weight, decreasing the error. The ε-insensitive function is adopted in penalizing incorrect predictions: Then, Equation (1) becomes: The kernel function K(x i , x) is introduced to deal with the nonlinear problem. The minimizing function can be written as: The coefficients α i , α * i can be obtained by maximizing the quadratic programming problem: In this work, the nonlinear problems above were solved with the Gaussian radial basis function: where γ is the kernel parameter. Similar to the regularization constant C, SVM models can be over-fitting or under-fitting on the training data when γ values are too large or too small [22,23]. The genetic algorithm (GA) and particle swarm optimization (PSO) algorithms are usually used for C and γ optimization. Due to the former being superior to the latter in the optimization speed, the genetic algorithm was used to find the optimal SVM parameters C and γ in this paper. The parameters used to evaluate the performance of the classification models in this paper were defined as: Here, TP being the true positives, FP being false positives, TN being true negatives, and FN being false negatives. The parameter ∆Q 2 takes into account the level of two-state random accuracy and estimates the real model contribution to prediction accuracy above that level. A large ∆Q 2 suggests that the classification model provides a significant level of useful information over the maximal level of two-state random accuracy [15,16].

Conclusions
Although many factors influence the toxicity log1/IBC 50 of organic compounds to Vibrio fischeri, the ten molecular descriptors selected with MLR analysis were successfully used for the classification models by correlating with the structural information on the Michael-type addition of nucleophiles, molecular branching, molecular size, polarizability, hydrophobic, and so on. By the prediction of the optimal SVM model combined with the genetic algorithm, the training set (451 organics) has a prediction accuracy of 87.6% for Class − 1 (log1/IBC 50 ≤ 4.2) and 91.0% for Class + 1 (log1/IBC 50 > 4.2); the test set (150 organics) has the accuracy being 78.8% for Class − 1 and 81.4% for Class + 1, which are more accurate than those from the binary logistic regression model. The global SVM classification model is satisfactory, although it dealt with a larger data set (601) including the toxicity of organics to Vibrio fischeri.
Author Contributions: Conceptualization, F.W.; data curation, F.W. and X.Z.; methodology, Z.F. and X.Y.; software, F.W., Z.F. and X.Y.; writing-original draft preparation, Z.F. and X.Y.; write-review and editing, F.W., Z.F. and X.Y. All authors have read and agreed to the published version of the manuscript.