Predictive QSAR Models for the Toxicity of Disinfection Byproducts

Several hundred disinfection byproducts (DBPs) in drinking water have been identified, and are known to have potentially adverse health effects. There are toxicological data gaps for most DBPs, and the predictive method may provide an effective way to address this. The development of an in-silico model of toxicology endpoints of DBPs is rarely studied. The main aim of the present study is to develop predictive quantitative structure–activity relationship (QSAR) models for the reactive toxicities of 50 DBPs in the five bioassays of X-Microtox, GSH+, GSH−, DNA+ and DNA−. All-subset regression was used to select the optimal descriptors, and multiple linear-regression models were built. The developed QSAR models for five endpoints satisfied the internal and external validation criteria: coefficient of determination (R2) > 0.7, explained variance in leave-one-out prediction (Q2LOO) and in leave-many-out prediction (Q2LMO) > 0.6, variance explained in external prediction (Q2F1, Q2F2, and Q2F3) > 0.7, and concordance correlation coefficient (CCC) > 0.85. The application domains and the meaning of the selective descriptors for the QSAR models were discussed. The obtained QSAR models can be used in predicting the toxicities of the 50 DBPs.


Introduction
Disinfection byproducts (DBPs) have raised concerns since the first DBPs of trihalomethane (THM) compounds were identified in the 1970s [1]. The DBPs may result from reactions between disinfectants and dissolved organic matter present in source waters [2,3]. The THMs are the most common DBPs present in the typical chlorinated drinking water. Approximately 700 DBPs have been identified, as the increasingly employed disinfectants such as ozone, chlorine dioxide and chloramines in drinking water result in reactivity with organic compounds [2]. Research has demonstrated that several known DBPs are considered to be potent cytotoxins, genotoxins and carcinogens [4], which indicates that many DBPs may exert more toxicity to humans than THMs [3]. However, there are toxicological data gaps for most of the DBPs, and more in vitro bioassays or chronic in vivo bioassays need to be carried out. Compared with the experimental test for the toxicological data, the in-silico approach provides an effective way for predicting the toxicities of chemicals. Quantitative structure-activity relationship (QSAR) seems to be a useful method to study the toxicities of DBPs.
The QSAR method provides a promising, faster way of predicting the activity of chemicals using the structural information of the compound. A limited number of QSARs have been proposed for DBP studies. A QSAR for benchmark concentrations of cranial neural tube dysmorphogenesis was established for 10 halogenated derivatives (HAAs) [5]. The mutagenicity in the Salmonlla typhimurium strain TA100 of 42 HAAs was predicted by a QSAR model based on geometrical-, radial distribution function (RDF)-, weighted holistic invariant molecular (WHIM)-, eigenvalueand 2D-autocorrelation-based methods, as well as information descriptors [6]. The eight-variable model was internally validated by leave-one-out (LOO) cross-validation, the bootstrapping test and Y-scrambling. Tang and Wang [7] developed a QSAR model for finding the energy of the highest occupied molecular orbital (E HOMO ) of 36 DBPs, and the model was validated by LOO cross-validation and K-fold cross-validation. An empirical QSAR model based on liposome-water partition coefficients (logK lipw ) was proposed for the 50% effective concentration (EC 50 ) of five DBPs (1,1-dichloroethene, bromoethane, chloroform, dichloromethane and bromoform) on Vibrio fischeri [8]. Another empirical model was built for the first-order rate constants of photodegradation of six iodinated trihalomethanes (ITHMs) and three brominated THMs (BTHMs) [9]. Yang and Zhang [10] predicted the developmental toxicity of 19 DBPs to Platynereis dumerilii embryos using an oil-water partition coefficient (logP), protein kinase A (pK(a)), E HOMO and lowest unoccupied molecular orbital energy (E LUMO ). All the aforementioned QSAR models for DBPs lack both strictly internal and external validation, which may not guarantee the real predictive ability of the models. Only two studies [8,10] were related to the toxicities of DBPs with a limited number of compounds (no more than 19 samples). The main reasons for the lack of QSAR study on DBPs is that only a small fraction of the DBPs identified (out of a total of 700) have been tested for toxicity so far. It is implied that QSAR techniques remain underutilized by DBP researchers [2].
In the present study, we aim to develop QSAR models using multiple linear regression (MLR) to predict five toxicity endpoints of DBPs. The developed QSAR models were strictly internally validated by LOO and leave-many-out (LMO) cross-validation and Y-scrambling, and externally validated by several metrics, including variance explained in external prediction Q 2 F1 [11], Q 2 F2 [12], and Q 2 F3 [13], concordance correlation coefficient (CCC) [14,15], and r 2 m metrics based on the correlation of the observed and predicted response data with and without the intercept [16,17], and the criteria recommended by Golbraikh and Tropsha [18].

Molecular Structure Descriptors
The molecular structure descriptors of the chemicals were calculated by the Dragon software (version 6.0, Talete SRL, Milano, Italy). The original descriptors generated from the Dragon software were refined by the following principles [19,20]: (1) the descriptors with standard deviation less than 0.0001 were excluded; (2) the descriptors with at least one missing value were deleted; (3) the descriptors with (abs)pair correlation larger than or equal to 0.8 were excluded; and (4) the descriptors that Pearson correlation coefficients (|r|) between descriptors and toxicities of DBPs lower than 0.3 were deleted. The remaining descriptors were used for the further analysis.

Data Splits and Model Development
The whole dataset was randomly split into several training and test sets. It was recommended that analysis of the models should be obtained from various splits into the training set and test set [21]. For each toxicity endpoint of DBPs, we randomly split the whole dataset into five training sets and five test sets.
All-subset regression for the whole dataset was performed with the QSARINS software [22,23]. The four-variable multiple linear regression (MLR) models with the highest coefficients of determination (R 2 ) and explained variance in leave-one-out (Q 2 LOO ) prediction were selected for the whole dataset. The MLR models for the training sets based on the same descriptors derived from the whole dataset were developed, and the test sets were used to validate the external predictive abilities of the models.

Model Validation
The statistical parameters for modeling, internal and external validation metrics were adopted to evaluate the fit, stability and predicative power of the QSAR model. The quality parameters include R 2 , adjusted coefficient of determination (R 2 adj ), root mean square error in fitting (RMSE tr ) and F-value (F). The internal validations were performed by the LOO and LMO cross-validation (Q 2 LOO and Q 2 LMO ) and the Y-scrambling test (R 2 Yscr and Q 2 Yscr ). The external validation was evaluated by a test set. The parameters Q 2 F1 [11], Q 2 F2 [12], Q 2 F3 [13], CCC [14,15], average of r 2 m (r 2 m ) and the difference between r 2 m (∆r 2 m ) [16,17] were used as the measures of the predictive power of a QSAR model. The proposed parameters by Golbraikh and Tropsha were also applied for the external validation criteria [18]: slope of the regression line over external data (k and k'), coefficients of determination between predicted and observed activities (R 2 0 ) and coefficients of determination between observed and predicted activities (R 2 0 ). The validation criteria thresholds for the parameters mentioned above are: (1) R 2 > 0.7, Q 2 LOO and Q 2 LMO > 0.6, Q 2 F1 , Q 2 F2 and Q 2 F3 > 0.7, difference between R 2 and Q 2 LOO smaller than 0.1 [15]; (2) r 2 m > 0.65; (3) CCC > 0.85 [15]; and (4) criteria recommended by Golbraikh and Tropsha [18]: 15. The descriptors included in the whole dataset, training set and the test set should satisfy the following conditions [24]: (1) the Pearson correlation coefficients for the complete (r c ), training (r t ) and test (r e ) sets are equal to or greater than 0.3: |r c | and |r t | ≥ 0.3; (2) the normalized regression coefficient of the descriptor for the complete (β c ) and the training (β t ) sets are equal to or greater than 0.001: |β c | and |β t | ≥ 0.001; and (3) absence of the sign-change problem: sign(r c ) = sign(r t ) = sign(r e ); sign(r c ) = sign(β c ) = sign(β t ) Models that have acceptable validation criteria thresholds for all conditions were considered as the final models. These models are robust and able to make good internal and external predictions.

Applicability Domain
The application domain of the QSAR model was defined by the leverage approach from the hat matrix (h i in Equation (1)), which is calculated from the descriptors of chemicals [19,25], and by identification of chemicals with LOO cross-validated standardized residuals greater than 2.0 standard deviation units. An outlier in the QSAR model is defined as h i value larger than the warning leverage h* and LOO standardized residuals greater than 2.0, which is graphically depicted in the Williams plot. The warning leverage h * is fixed at (3k)/n, where k is the number of model parameters and n is the number of the objects used to calculate the model.
where x i is the descriptor row vector of the query compound; X is the n × k matrix of k model descriptor values for n training set compounds and the superscript T refers to the transpose of the matrix/vector.

Selected Descriptors
For each five endpoints of X-Microtox, GSH+, GSH−, DNA+ and DNA− of the selected drinking-water DBPs, four descriptors were selected by all-subset regression for the whole dataset, which was performed by the QSARINS software [22,23]. The selected descriptors for X-Microtox were the spectral diameter from Burden matrix weighted by mass (SpDiam_B(m)), average vertex sum from Burden matrix weighted by van der Waals volume (AVS_B(v)), eigenvalue no. 5 from augmented edge adjacency mat weighted by dipole moment (Eig05_AEA(dm)) and sum of ddsN E-states (SddsN). For the endpoints of GSH+ and GSH−, the four selected descriptors were percentage of C atoms (C%), SpDiam_B(m), P_VSA-like on LogP bin 8 (P_VSA_LogP_8) and sum of topological distances between N..Br (T(N..Br)). The selected descriptors for DNA+ were P_VSA-like on LogP, bin 7 (P_VSA_LogP_7), signal 04/weighted by I-state (Mor04s), T(N..Br) and sum of topological distances between N..I (T(N..I)). For the endpoints of DNA−, the four selected descriptors were sum of atomic van der Waals volumes (Sv), P_VSA_LogP_7, signal 03/weighted by I-state (Mor03s) and T(N..I). There was no high correlation between the selected descriptors, and these descriptors were used as inputs for the training set.

Models Development and Validation
The whole dataset for each endpoint was randomly split into training and test sets by five iterations (splits 1-5) for the same size of training and test sets. Of the chemicals in the dataset, 80% were selected for the training set and the remaining 20% were considered as the test set. Five QSAR models based on the same size of training sets were built for five endpoints of X-Microtox, GSH+, GSH−, DNA+ and DNA−. The statistical parameters of modeling, internal and external validations were calculated for each model. We have examined five splits into the training and test sets. The realistic reliability of the QSAR model was estimated by the result of the analysis of five splits into the training and test sets. The statistical characteristics of QSAR models of five splits for five endpoints are given in Supplementary data. It can be found that all QSAR models presented high predictive power, as those models satisfy the internal and external validation criteria: R 2 > 0.7, Q 2 LOO and Q 2 LMO > 0.6, Q 2 F1 , Q 2 F2 and Q 2 F3 > 0.7, CCC > 0.85, and r 2 m > 0.65. In order to validate whether the descriptors presented in the QSAR models were real or not before model validation and interpretation, we checked the sign-change-problem correlation coefficients and regression coefficients of a descriptor in the MLR model regressions, before and after the data split [24]. It was found that all descriptors in the five QSAR models satisfy the conditions [24]: |r c | and |r t | ≥ 0.3, |β c | and |β t | ≥ 0.001, sign(r c ) = sign(r t ) = sign(r e ), and sign(r c ) = sign(β c ) = sign(β t ). Thus, the selected descriptors are considered to be real variables.

Endpoint
Equation

Domain of Applicability
The criteria for an outlier are expressed as hi > h * and LOO standardized residuals greater than 2.0. For the first split (split 1), the Williams plot of the five QSAR models for toxicity bioassays of X-Microtox, GSH+, GSH−, DNA+ and DNA− are shown in Figures 1-3. All 50 DBPs in the training and test sets satisfy the outlier criteria, and the QSAR models lead to reliably predicted data. The outlier was also examined for the splits 2-5 (the statistical parameters are listed in Supplementary data). There were no outliers in the training and test sets.

Explanation of Descriptors
The five developed QSAR models allow for mechanical interpretation of the toxicities of DBPs to X-Microtox, GSH+, GSH−, DNA+ and DNA−. Four descriptors for five models selected by stepwise MLR helped to improve the understanding of DBPs. A total of 12 descriptors were included in five four-variable models. The For the QSAR model based on the toxicity of the X-Microtox bioassay, the standard regression coefficients of AVS_B(v) and SddsN were higher than the other two descriptors. AVS_B(v) and

Domain of Applicability
The criteria for an outlier are expressed as h i > h * and LOO standardized residuals greater than 2.0. For the first split (split 1), the Williams plot of the five QSAR models for toxicity bioassays of X-Microtox, GSH+, GSH−, DNA+ and DNA− are shown in Figures 1-3. All 50 DBPs in the training and test sets satisfy the outlier criteria, and the QSAR models lead to reliably predicted data. The outlier was also examined for the splits 2-5 (the statistical parameters are listed in Supplementary data). There were no outliers in the training and test sets.

Explanation of Descriptors
The five developed QSAR models allow for mechanical interpretation of the toxicities of DBPs to X-Microtox, GSH+, GSH−, DNA+ and DNA−. Four descriptors for five models selected by stepwise MLR helped to improve the understanding of DBPs. A total of 12 descriptors were included in five four-variable models. The For the QSAR model based on the toxicity of the X-Microtox bioassay, the standard regression coefficients of AVS_B(v) and SddsN were higher than the other two descriptors. AVS_B(v) and SddsN were the main factors affecting the toxicity of DBPs to X-Microtox. The descriptor SddsN indicated that −N(=)= (nitro) (where "=" represents a double bond and "−" represents a single bond) is one of the main factors that affected the toxicity of DBPs to X-Microtox. For the toxicities of DBPs toward GSH+ and GSH−, the same descriptors were selected in the QSAR models, which indicates the similar toxicity mechanism of the two endpoints. SpDiam_B(m) and T(N..Br) were the main positive contributors to the toxicity, as their standard regression coefficients were higher than C% and P_VSA_LogP_8. T(N..Br), the heteroatom between N and Br, was one of the main factors affecting the toxicity of DBPs toward GSH+ and GSH−. There were two descriptors (P_VSA_LogP_7 and T(N..I)) in the QSAR model for DNA+ and DNA−, where P_VSA_LogP_7 made a positive contribution to toxicity while T(N..I) made a negative contribution to toxicity. T(N..I), the heteroatom between N and I, was one of the main factors affecting the toxicity of DBPs toward DNA+ and DNA−.

Conclusions
All five considered QSAR models, resulting from the random split of the whole dataset intro training and test sets, satisfied the validation criteria. The application domain was clearly defined and the mechanism was interpreted. The reliability of the five QSAR models met the Organization for Economic Co-operation and Development (OECD) principles [26]: (1) a defined endpoint; (2) an unambiguous algorithm; (3) a defined domain of applicability; (4) appropriate measures of goodness-of-fit, robustness and predictivity; and (5) a mechanistic interpretation, if possible.
The QSAR method was used to develop several predictive models for the toxicities of DBPs toward X-Microtox, GSH+, GSH−, DNA+, and DNA−. Using the selected descriptors, which can be easily generated from the Dragon software, all the developed QSAR models with a good predictive performance were used for estimating toxicities of DBPs. It is expected that the proposed QSAR models could be used to predict the toxicities of DBPs.