Next Article in Journal
Exploring the Degradation of Ibuprofen by Bacillus thuringiensis B1(2015b): The New Pathway and Factors Affecting Degradation
Previous Article in Journal
Design, Synthesis and Antifungal Activity of Psoralen Derivatives
Article Menu
Issue 10 (October) cover image

Export Article

Molecules 2017, 22(10), 1671; doi:10.3390/molecules22101671

Article
Predictive QSAR Models for the Toxicity of Disinfection Byproducts
Litang Qin 1,2,3, Xin Zhang 1, Yuhan Chen 1, Lingyun Mo 2,3,*Orcid, Honghu Zeng 1,2,3 and Yanpeng Liang 1,2,3
1
College of Environmental Science and Engineering, Guilin University of Technology, Guilin 541004, China
2
Guangxi Key Laboratory of Environmental Pollution Control Theory and Technology, Guilin University of Technology, Guilin 541004, China
3
Collaborative Innovation Center for Water Pollution Control and Water Safety in Karst Area, Guilin University of Technology, Guilin 541004, China
*
Correspondence: Tel.: +86-773-589-5238
Received: 26 September 2017 / Accepted: 1 October 2017 / Published: 9 October 2017

Abstract

:
Several hundred disinfection byproducts (DBPs) in drinking water have been identified, and are known to have potentially adverse health effects. There are toxicological data gaps for most DBPs, and the predictive method may provide an effective way to address this. The development of an in-silico model of toxicology endpoints of DBPs is rarely studied. The main aim of the present study is to develop predictive quantitative structure–activity relationship (QSAR) models for the reactive toxicities of 50 DBPs in the five bioassays of X-Microtox, GSH+, GSH−, DNA+ and DNA−. All-subset regression was used to select the optimal descriptors, and multiple linear-regression models were built. The developed QSAR models for five endpoints satisfied the internal and external validation criteria: coefficient of determination (R2) > 0.7, explained variance in leave-one-out prediction (Q2LOO) and in leave-many-out prediction (Q2LMO) > 0.6, variance explained in external prediction (Q2F1, Q2F2, and Q2F3) > 0.7, and concordance correlation coefficient (CCC) > 0.85. The application domains and the meaning of the selective descriptors for the QSAR models were discussed. The obtained QSAR models can be used in predicting the toxicities of the 50 DBPs.
Keywords:
disinfection byproduct; QSAR; validation; toxicity; drinking water

1. Introduction

Disinfection byproducts (DBPs) have raised concerns since the first DBPs of trihalomethane (THM) compounds were identified in the 1970s [1]. The DBPs may result from reactions between disinfectants and dissolved organic matter present in source waters [2,3]. The THMs are the most common DBPs present in the typical chlorinated drinking water. Approximately 700 DBPs have been identified, as the increasingly employed disinfectants such as ozone, chlorine dioxide and chloramines in drinking water result in reactivity with organic compounds [2]. Research has demonstrated that several known DBPs are considered to be potent cytotoxins, genotoxins and carcinogens [4], which indicates that many DBPs may exert more toxicity to humans than THMs [3]. However, there are toxicological data gaps for most of the DBPs, and more in vitro bioassays or chronic in vivo bioassays need to be carried out. Compared with the experimental test for the toxicological data, the in-silico approach provides an effective way for predicting the toxicities of chemicals. Quantitative structure–activity relationship (QSAR) seems to be a useful method to study the toxicities of DBPs.
The QSAR method provides a promising, faster way of predicting the activity of chemicals using the structural information of the compound. A limited number of QSARs have been proposed for DBP studies. A QSAR for benchmark concentrations of cranial neural tube dysmorphogenesis was established for 10 halogenated derivatives (HAAs) [5]. The mutagenicity in the Salmonlla typhimurium strain TA100 of 42 HAAs was predicted by a QSAR model based on geometrical-, radial distribution function (RDF)-, weighted holistic invariant molecular (WHIM)-, eigenvalue- and 2D-autocorrelation-based methods, as well as information descriptors [6]. The eight-variable model was internally validated by leave-one-out (LOO) cross-validation, the bootstrapping test and Y-scrambling. Tang and Wang [7] developed a QSAR model for finding the energy of the highest occupied molecular orbital (EHOMO) of 36 DBPs, and the model was validated by LOO cross-validation and K-fold cross-validation. An empirical QSAR model based on liposome–water partition coefficients (logKlipw) was proposed for the 50% effective concentration (EC50) of five DBPs (1,1-dichloroethene, bromoethane, chloroform, dichloromethane and bromoform) on Vibrio fischeri [8]. Another empirical model was built for the first-order rate constants of photodegradation of six iodinated trihalomethanes (ITHMs) and three brominated THMs (BTHMs) [9]. Yang and Zhang [10] predicted the developmental toxicity of 19 DBPs to Platynereis dumerilii embryos using an oil–water partition coefficient (logP), protein kinase A (pK(a)), EHOMO and lowest unoccupied molecular orbital energy (ELUMO). All the aforementioned QSAR models for DBPs lack both strictly internal and external validation, which may not guarantee the real predictive ability of the models. Only two studies [8,10] were related to the toxicities of DBPs with a limited number of compounds (no more than 19 samples). The main reasons for the lack of QSAR study on DBPs is that only a small fraction of the DBPs identified (out of a total of ~700) have been tested for toxicity so far. It is implied that QSAR techniques remain underutilized by DBP researchers [2].
In the present study, we aim to develop QSAR models using multiple linear regression (MLR) to predict five toxicity endpoints of DBPs. The developed QSAR models were strictly internally validated by LOO and leave-many-out (LMO) cross-validation and Y-scrambling, and externally validated by several metrics, including variance explained in external prediction Q2F1 [11], Q2F2 [12], and Q2F3 [13], concordance correlation coefficient (CCC) [14,15], and r m 2 metrics based on the correlation of the observed and predicted response data with and without the intercept [16,17], and the criteria recommended by Golbraikh and Tropsha [18].

2. Materials and Methods

2.1. Experimental Toxicity Data

The five endpoints of 50 drinking water DBPs are precisely explained (Table 1). Experimental reactive toxicity data for five endpoints (X-Microtox, GSH+, GSH−, DNA+ and DNA−) of 50 drinking water DBPs were obtained from the literature (Table 2) [4]. The 50 DBPs comprise a wide range of different chemical groups, 47 commonly detected drinking water DBPs together with 1,1-dichloroethene (1,1-DCE), dichloromethane (DCM) and bromochloromethane. The negative logarithm observed effect concentrations (pEC50 for Microtox and pECIR1.5 for the other assays, M (mol/L)) are listed in Table 2.

2.2. Molecular Structure Descriptors

The molecular structure descriptors of the chemicals were calculated by the Dragon software (version 6.0, Talete SRL, Milano, Italy). The original descriptors generated from the Dragon software were refined by the following principles [19,20]: (1) the descriptors with standard deviation less than 0.0001 were excluded; (2) the descriptors with at least one missing value were deleted; (3) the descriptors with (abs)pair correlation larger than or equal to 0.8 were excluded; and (4) the descriptors that Pearson correlation coefficients (|r|) between descriptors and toxicities of DBPs lower than 0.3 were deleted. The remaining descriptors were used for the further analysis.

2.3. Data Splits and Model Development

The whole dataset was randomly split into several training and test sets. It was recommended that analysis of the models should be obtained from various splits into the training set and test set [21]. For each toxicity endpoint of DBPs, we randomly split the whole dataset into five training sets and five test sets.
All-subset regression for the whole dataset was performed with the QSARINS software [22,23]. The four-variable multiple linear regression (MLR) models with the highest coefficients of determination (R2) and explained variance in leave-one-out (Q2LOO) prediction were selected for the whole dataset. The MLR models for the training sets based on the same descriptors derived from the whole dataset were developed, and the test sets were used to validate the external predictive abilities of the models.

2.4. Model Validation

The statistical parameters for modeling, internal and external validation metrics were adopted to evaluate the fit, stability and predicative power of the QSAR model. The quality parameters include R2, adjusted coefficient of determination ( R adj 2 ), root mean square error in fitting (RMSEtr) and F-value (F). The internal validations were performed by the LOO and LMO cross-validation (Q2LOO and Q2LMO) and the Y-scrambling test (R2Yscr and Q2Yscr). The external validation was evaluated by a test set. The parameters Q2F1 [11], Q2F2 [12], Q2F3 [13], CCC [14,15], average of r m 2 ( r ¯ m 2 ) and the difference between r m 2 ( Δ r m 2 ) [16,17] were used as the measures of the predictive power of a QSAR model. The proposed parameters by Golbraikh and Tropsha were also applied for the external validation criteria [18]: slope of the regression line over external data (k and k’), coefficients of determination between predicted and observed activities ( R 0 2 ) and coefficients of determination between observed and predicted activities ( R 0 2 ).
The validation criteria thresholds for the parameters mentioned above are: (1) R2 > 0.7, Q2LOO and Q2LMO > 0.6, Q2F1, Q2F2 and Q2F3 > 0.7, difference between R2 and Q2LOO smaller than 0.1 [15]; (2) r ¯ m 2 > 0.65; (3) CCC > 0.85 [15]; and (4) criteria recommended by Golbraikh and Tropsha [18]: ( R 2 R 0 2 ) / R 2 < 0.1 or ( R 2 R 0 2 ) / R 2 < 0.1 , 0.85 ≤ k ≤ 1.15 or 0.85 ≤ k’ ≤ 1.15.
The descriptors included in the whole dataset, training set and the test set should satisfy the following conditions [24]: (1) the Pearson correlation coefficients for the complete (rc), training (rt) and test (re) sets are equal to or greater than 0.3: |rc| and |rt| ≥ 0.3; (2) the normalized regression coefficient of the descriptor for the complete (βc) and the training (βt) sets are equal to or greater than 0.001: |βc| and |βt| ≥ 0.001; and (3) absence of the sign-change problem: sign(rc) = sign(rt) = sign(re); sign(rc) = sign(βc) = sign(βt)
Models that have acceptable validation criteria thresholds for all conditions were considered as the final models. These models are robust and able to make good internal and external predictions.

2.5. Applicability Domain

The application domain of the QSAR model was defined by the leverage approach from the hat matrix (hi in Equation (1)), which is calculated from the descriptors of chemicals [19,25], and by identification of chemicals with LOO cross-validated standardized residuals greater than 2.0 standard deviation units. An outlier in the QSAR model is defined as hi value larger than the warning leverage h* and LOO standardized residuals greater than 2.0, which is graphically depicted in the Williams plot. The warning leverage h* is fixed at (3k)/n, where k is the number of model parameters and n is the number of the objects used to calculate the model.
h i = x i T ( X T X ) 1 x i ( i = 1 , , n )
where xi is the descriptor row vector of the query compound; X is the n × k matrix of k model descriptor values for n training set compounds and the superscript T refers to the transpose of the matrix/vector.

3. Results and Discussion

3.1. Selected Descriptors

For each five endpoints of X-Microtox, GSH+, GSH−, DNA+ and DNA− of the selected drinking-water DBPs, four descriptors were selected by all-subset regression for the whole dataset, which was performed by the QSARINS software [22,23]. The selected descriptors for X-Microtox were the spectral diameter from Burden matrix weighted by mass (SpDiam_B(m)), average vertex sum from Burden matrix weighted by van der Waals volume (AVS_B(v)), eigenvalue no. 5 from augmented edge adjacency mat weighted by dipole moment (Eig05_AEA(dm)) and sum of ddsN E-states (SddsN). For the endpoints of GSH+ and GSH−, the four selected descriptors were percentage of C atoms (C%), SpDiam_B(m), P_VSA-like on LogP bin 8 (P_VSA_LogP_8) and sum of topological distances between N..Br (T(N..Br)). The selected descriptors for DNA+ were P_VSA-like on LogP, bin 7 (P_VSA_LogP_7), signal 04/weighted by I-state (Mor04s), T(N..Br) and sum of topological distances between N..I (T(N..I)). For the endpoints of DNA−, the four selected descriptors were sum of atomic van der Waals volumes (Sv), P_VSA_LogP_7, signal 03/weighted by I-state (Mor03s) and T(N..I). There was no high correlation between the selected descriptors, and these descriptors were used as inputs for the training set.

3.2. Models Development and Validation

The whole dataset for each endpoint was randomly split into training and test sets by five iterations (splits 1–5) for the same size of training and test sets. Of the chemicals in the dataset, 80% were selected for the training set and the remaining 20% were considered as the test set. Five QSAR models based on the same size of training sets were built for five endpoints of X-Microtox, GSH+, GSH−, DNA+ and DNA−. The statistical parameters of modeling, internal and external validations were calculated for each model. We have examined five splits into the training and test sets. The realistic reliability of the QSAR model was estimated by the result of the analysis of five splits into the training and test sets. The statistical characteristics of QSAR models of five splits for five endpoints are given in Supplementary data. It can be found that all QSAR models presented high predictive power, as those models satisfy the internal and external validation criteria: R2 > 0.7, Q2LOO and Q2LMO > 0.6, Q2F1, Q2F2 and Q2F3 > 0.7, CCC > 0.85, and r ¯ m 2 > 0.65.
In order to validate whether the descriptors presented in the QSAR models were real or not before model validation and interpretation, we checked the sign-change-problem correlation coefficients and regression coefficients of a descriptor in the MLR model regressions, before and after the data split [24]. It was found that all descriptors in the five QSAR models satisfy the conditions [24]: |rc| and |rt| ≥ 0.3, |βc| and |βt| ≥ 0.001, sign(rc) = sign(rt) = sign(re), and sign(rc) = sign(βc) = sign(βt). Thus, the selected descriptors are considered to be real variables.
The four-variable QSAR models for the first split and its statistical parameters for five toxicity endpoints are listed in Table 3. All five QSAR models for toxicity bioassays of X-Microtox, GSH+, GSH−, DNA+ and DNA− are satisfactory, according to all conditions of R2 > 0.7, Q2LOO and Q2LMO > 0.6, Q2F1, Q2F2 and Q2F3 > 0.7, CCC > 0.85 [15]; r ¯ m 2 > 0.65; ( R 2 R 0 2 ) / R 2 < 0.1 or ( R 2 R 0 2 ) / R 2 < 0.1 , and 0.85 ≤ k ≤ 1.15 or 0.85 ≤ k’ ≤ 1.15 [18]. Figure 1, Figure 2 and Figure 3 present the correlations between experimental and calculated –logEC50 or –logECIR1.5 (pEC50 or pECIR1.5) values for the five models. The pEC50 or pECIR1.5 values calculated from the QSAR models are listed in Table 2.

3.3. Domain of Applicability

The criteria for an outlier are expressed as hi > h* and LOO standardized residuals greater than 2.0. For the first split (split 1), the Williams plot of the five QSAR models for toxicity bioassays of X-Microtox, GSH+, GSH−, DNA+ and DNA− are shown in Figure 1, Figure 2 and Figure 3. All 50 DBPs in the training and test sets satisfy the outlier criteria, and the QSAR models lead to reliably predicted data. The outlier was also examined for the splits 2–5 (the statistical parameters are listed in Supplementary data). There were no outliers in the training and test sets.

3.4. Explanation of Descriptors

The five developed QSAR models allow for mechanical interpretation of the toxicities of DBPs to X-Microtox, GSH+, GSH−, DNA+ and DNA−. Four descriptors for five models selected by stepwise MLR helped to improve the understanding of DBPs. A total of 12 descriptors were included in five four-variable models. The selected descriptors are SpDiam_B(m), AVS_B(v), Eig05_AEA(dm), SddsN, Sv, C%, P_VSA_LogP_7, P_VSA_LogP_8, Mor03s, Mor04s, T(N..Br) and T(N..I). These structural features are related to DBP toxicity. The positive values of the regression coefficients indicate increasing toxicity with increasing descriptor values, while the negative values indicate decreasing toxicity with increasing descriptor values. The 12 descriptors belong to different groups of descriptors: 2D matrix-based descriptors (SpDiam_B(m) and AVS_B(v)), edge adjacency indices (Eig05_AEA(dm)), atom-type E-state indices (SddsN), constitutional indices (Sv and C%), P_VSA-like descriptors (P_VSA_LogP_7 and P_VSA_LogP_8), 3D-MoRSE descriptors (Mor03s and Mor04s) and 2D atom pairs (T(N..Br) and T(N..I)).
For the QSAR model based on the toxicity of the X-Microtox bioassay, the standard regression coefficients of AVS_B(v) and SddsN were higher than the other two descriptors. AVS_B(v) and SddsN were the main factors affecting the toxicity of DBPs to X-Microtox. The descriptor SddsN indicated that −N(=)= (nitro) (where “=” represents a double bond and “−” represents a single bond) is one of the main factors that affected the toxicity of DBPs to X-Microtox. For the toxicities of DBPs toward GSH+ and GSH−, the same descriptors were selected in the QSAR models, which indicates the similar toxicity mechanism of the two endpoints. SpDiam_B(m) and T(N..Br) were the main positive contributors to the toxicity, as their standard regression coefficients were higher than C% and P_VSA_LogP_8. T(N..Br), the heteroatom between N and Br, was one of the main factors affecting the toxicity of DBPs toward GSH+ and GSH−. There were two descriptors (P_VSA_LogP_7 and T(N..I)) in the QSAR model for DNA+ and DNA−, where P_VSA_LogP_7 made a positive contribution to toxicity while T(N..I) made a negative contribution to toxicity. T(N..I), the heteroatom between N and I, was one of the main factors affecting the toxicity of DBPs toward DNA+ and DNA−.

4. Conclusions

All five considered QSAR models, resulting from the random split of the whole dataset intro training and test sets, satisfied the validation criteria. The application domain was clearly defined and the mechanism was interpreted. The reliability of the five QSAR models met the Organization for Economic Co-operation and Development (OECD) principles [26]: (1) a defined endpoint; (2) an unambiguous algorithm; (3) a defined domain of applicability; (4) appropriate measures of goodness-of-fit, robustness and predictivity; and (5) a mechanistic interpretation, if possible.
The QSAR method was used to develop several predictive models for the toxicities of DBPs toward X-Microtox, GSH+, GSH−, DNA+, and DNA−. Using the selected descriptors, which can be easily generated from the Dragon software, all the developed QSAR models with a good predictive performance were used for estimating toxicities of DBPs. It is expected that the proposed QSAR models could be used to predict the toxicities of DBPs.

Supplementary Materials

Supplementary Materials are available online.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (grant numbers 21407032, 51578171, and 21667013), Science Research and Technology Development Project of Guilin (2016012505), Science Research and Technology Development Project of Guangxi (Guikehe1599005-2-2), and the Project of High-level Innovation Team and Outstanding Scholar in Guangxi Colleges and Universities (002401013001).

Author Contributions

Litang Qin: The writer of the manuscript, provide the main research ideas. Xin Zhang: Data collecion; data analysis; draw pictures and tables. Yuhan Chen: Data collection. Lingyun Mo: Build QSAR model. Honghu Zeng : Data analysis.Yanpeng Liang: Calculation of molecular descriptors.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rook, J.J. Formation of haloforms during chlorination of natural waters. Water Treat. Exam. 1974, 23, 234–243. [Google Scholar]
  2. Chen, B.; Zhang, T.; Bond, T.; Gan, Y. Development of quantitative structure activity relationship (QSAR) model for disinfection byproduct (DBP) research: A review of methods and resources. J. Hazard. Mater. 2015, 299, 260–279. [Google Scholar] [PubMed]
  3. Grellier, J.; Rushton, L.; Briggs, D.J.; Nieuwenhuijsen, M.J. Assessing the human health impacts of exposure to disinfection by-products—A critical review of concepts and methods. Environ. Int. 2015, 78, 61–81. [Google Scholar] [PubMed]
  4. Stalter, D.; O’Malley, E.; von Gunten, U.; Escher, B.I. Fingerprinting the reactive toxicity pathways of 50 drinking water disinfection by-products. Water Res. 2016, 91, 19–30. [Google Scholar] [CrossRef] [PubMed]
  5. Hunter, E.S.; Rogers, E.; Blanton, M.; Richard, A.; Chernoff, N. Bromochloro-haloacetic acids: Effects on mouse embryos in vitro and QSAR considerations. Reprod. Toxicol. 2006, 21, 260–266. [Google Scholar] [PubMed]
  6. Perez-Garrido, A.; Gonzalez, M.P.; Escudero, A.G. Halogenated derivatives QSAR model using spectral moments to predict haloacetic acids (HAA) mutagenicity. Bioorg. Med. Chem. 2008, 16, 5720–5732. [Google Scholar] [CrossRef] [PubMed]
  7. Tang, W.Z.; Wang, F. Quantitative structure activity relationship (QSAR) of chlorine effects on E-LUMO of disinfection by-product: Chlorinated alkanes. Chemosphere 2010, 78, 914–921. [Google Scholar] [CrossRef] [PubMed]
  8. Stalter, D.; Dutt, M.; Escher, B.I. Headspace-Free setup of in vitro bioassays for the evaluation of volatile disinfection by-products. Chem. Res. Toxicol. 2013, 26, 1605–1614. [Google Scholar] [CrossRef] [PubMed]
  9. Xiao, Y.J.; Fan, R.L.; Zhang, L.F.; Yue, J.Q.; Webster, R.D.; Lim, T.T. Photodegradation of iodinated trihalomethanes in aqueous solution by UV 254 irradiation. Water Res. 2014, 49, 275–285. [Google Scholar] [PubMed]
  10. Yang, M.T.; Zhang, X.R. Comparative developmental toxicity of new aromatic halogenated DBPs in a chlorinated saline sewage effluent to the marine polychaete Platynereis dumerilii. Environ. Sci. Technol. 2013, 47, 10868–10876. [Google Scholar] [CrossRef] [PubMed]
  11. Shi, L.M.; Fang, H.; Tong, W.; Wu, J.; Perkins, R.; Blair, R.M.; Branham, W.S.; Dial, S.L.; Moland, C.L.; Sheehan, D.M. QSAR models using a large diverse set of estrogens. J. Chem. Inf. Comput. Sci. 2001, 41, 186–195. [Google Scholar] [CrossRef] [PubMed]
  12. Schüürmann, G.; Ebert, R.U.; Chen, J.W.; Wang, B.; Kuhne, R. External validation and prediction employing the predictive squared correlation coefficient - test set activity mean vs training set activity mean. J. Chem. Inf. Model. 2008, 48, 2140–2145. [Google Scholar] [CrossRef] [PubMed]
  13. Consonni, V.; Ballabio, D.; Todeschini, R. Evaluation of model predictive ability by external validation techniques. J. Chemom. 2010, 24, 194–201. [Google Scholar] [CrossRef]
  14. Chirico, N.; Gramatica, P. Real external predictivity of QSAR models. Part 2. New intercomparable thresholds for different validation criteria and the need for scatter plot inspection. J. Chem. Inf. Model. 2012, 52, 2044–2058. [Google Scholar] [CrossRef] [PubMed]
  15. Chirico, N.; Gramatica, P. Real external predictivity of QSAR models: How to evaluate it? comparison of different validation criteria and proposal of using the concordance correlation coefficient. J. Chem. Inf. Model. 2011, 51, 2320–2335. [Google Scholar] [CrossRef] [PubMed]
  16. Roy, K.; Mitra, I.; Kar, S.; Ojha, P.K.; Das, R.N.; Kabir, H. Comparative studies on some metrics for external validation of QSPR models. J. Chem. Inf. Model. 2012, 52, 396–408. [Google Scholar] [CrossRef] [PubMed]
  17. Ojha, P.K.; Mitra, I.; Das, R.N.; Roy, K. Further exploring R2M metrics for validation of QSPR models. Chemom. Intell. Lab. Syst. 2011, 107, 194–205. [Google Scholar] [CrossRef]
  18. Golbraikh, A.; Tropsha, A. Beware of q2! J. Mol. Graph. Mode.l 2002, 20, 269–276. [Google Scholar] [CrossRef]
  19. Qin, L.T.; Liu, S.S.; Chen, F.; Wu, Q.S. Development of validated quantitative structure–retention relationship models for retention indices of plant essential oils. J. Sep. Sci. 2013, 36, 1553–1560. [Google Scholar] [CrossRef] [PubMed]
  20. Qin, L.T.; Liu, S.S.; Chen, F.; Xiao, Q.F.; Wu, Q.S. Chemometric model for predicting retention indices of constituents of essential oils. Chemosphere 2013, 90, 300–305. [Google Scholar] [CrossRef] [PubMed]
  21. Toropova, A.P.; Toropov, A.A.; Benfenati, E.; Leszczynska, D.; Leszczynski, J. QSAR model as a random event: A case of rat toxicity. Bioorg. Med. Chem. 2015, 23, 1223–1230. [Google Scholar] [CrossRef] [PubMed]
  22. Gramatica, P.; Chirico, N.; Papa, E.; Cassani, S.; Kovarich, S. QSARINS: A new software for the development, analysis, and validation of QSAR MLR models. J. Comput. Chem. 2013, 34, 2121–2132. [Google Scholar] [CrossRef]
  23. Gramatica, P.; Cassani, S.; Chirico, N. QSARINS-chem: Insubria datasets and new QSAR/QSPR models for environmental pollutants in QSARINS. J. Comput. Chem. 2014, 35, 1036–1044. [Google Scholar] [CrossRef] [PubMed]
  24. Kiralj, R.; Ferreira, M.M.C. Is your QSAR/QSPR descriptor real or trash? J. Chemom. 2010, 24, 681–693. [Google Scholar] [CrossRef]
  25. Tropsha, A.; Gramatica, P.; Gombar, V.K. The importance of being earnest: Validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb. Sci. 2003, 22, 69–77. [Google Scholar] [CrossRef]
  26. Organisation for Economic Co-operation and Development. Guidance Document on the Validation of (quanTitative) Structure-Activity Relationship [(Q)SAR] Models; ENV/JM/MONO: Paris, France, 2007; pp. 1–154. [Google Scholar]
  • Sample Availability: Not available.
Figure 1. Scatter plot of observed versus calculated pEC50 (A), and the Williams plot of the final model (B) for 50 disinfection byproducts to X-Microtox. “●”: training set, “○”: test set. h*( warning leverage).
Figure 1. Scatter plot of observed versus calculated pEC50 (A), and the Williams plot of the final model (B) for 50 disinfection byproducts to X-Microtox. “●”: training set, “○”: test set. h*( warning leverage).
Molecules 22 01671 g001
Figure 2. Scatter plot of observed versus calculated pECIR1.5 ((A) for GSH+ and (C) for GSH− ) and the Williams plot ((B) for GSH+ and (D) for GSH− ) of the final model for 45 disinfection byproducts. “●”: training set, “○”: test set.
Figure 2. Scatter plot of observed versus calculated pECIR1.5 ((A) for GSH+ and (C) for GSH− ) and the Williams plot ((B) for GSH+ and (D) for GSH− ) of the final model for 45 disinfection byproducts. “●”: training set, “○”: test set.
Molecules 22 01671 g002
Figure 3. Scatter plot of observed versus calculated pECIR1.5 ((A) for DNA+ and (C) for DNA−) and the Williams plot ((B) for DNA+ and (D) for DNA−) of the final model for 45 disinfection byproducts. “●”: training set, “○”: test set.
Figure 3. Scatter plot of observed versus calculated pECIR1.5 ((A) for DNA+ and (C) for DNA−) and the Williams plot ((B) for DNA+ and (D) for DNA−) of the final model for 45 disinfection byproducts. “●”: training set, “○”: test set.
Molecules 22 01671 g003
Table 1. The five endpoints of the 50 drinking water disinfection byproducts (DBPs).
Table 1. The five endpoints of the 50 drinking water disinfection byproducts (DBPs).
BioassayTest Species (Strain/Cell Line) aEndpointDetected Signal
MicrotoxAliivibrio fischeriCytotoxicityBioluminescence as indicator for cell viability
E. coli ± GSHEscherichia coli MJF335 (GSH−) and MJF276 (GSH+)Interaction with proteins/peptidesOD at 600 nm as indicator for cell density and descriptor of cell growth
E. coli ± DNAEscherichia coli MV4108 (DNA−) and MV1161 (DNA+)Interaction with DNAOD at 600 nm as indicator for cell density and descriptor of cell growth
a: GSH+: EC50 of E. coli strain; MJF276: capable to produce glutathione (GSH); GSH−: EC50 of E. coli strain; MJF335: not capable to produce GSH and hence more susceptible to compounds which react with proteins (i.e., soft electrophiles); DNA+: EC50 of E. coli strain; MV1161: capable of repairing DNA damage; DNA−: EC50 of E. coli strain; MV4108: not capable of repairing DNA damage and hence more susceptible to compounds which react with DNA (i.e., hard electrophiles).
Table 2. Observed and calculated effect concentrations (pEC50 (negative logarithm of 50% effective concentration) for X-Microtox and pECIR1.5 (negative logarithm of 1.5 induction ratio effective concentration) for the other assays, mol/L) of disinfection byproducts.
Table 2. Observed and calculated effect concentrations (pEC50 (negative logarithm of 50% effective concentration) for X-Microtox and pECIR1.5 (negative logarithm of 1.5 induction ratio effective concentration) for the other assays, mol/L) of disinfection byproducts.
No.NameX-MicrotoxGSH+GSH−DNA+DNA−
Observed pEC50Calculated pEC50Observed pECIR1.5Calculated pECIR1.5Observed pECIR1.5Calculated pECIR1.5Observed pECIR1.5Calculated pECIR1.5Observed pECIR1.5Calculated pECIR1.5
Halomethanes
11,1-dichloroethene3.15493.32141.35162.48471.41452.43571.0706 *1.7985 *0.43181.6512
2dichloromethane2.28403.19631.0888 *0.7391 *0.88610.78180.85391.68280.72120.9997
3bromochloromethane5.07064.48131.21821.64740.73281.87410.60211.7808--
4chloroform2.43182.46751.23660.93301.4089 *1.0149 *1.1675 *1.2621 *1.08090.6385
5bromodichloromethane5.02694.77231.2111 *1.7614 *1.40341.90321.31881.61961.5686 *1.2324 *
6bromoform3.6383 *3.2549 *1.00661.74071.58501.98631.08621.91661.16751.9629
7dibromochloromethane3.0000 *3.0193 *2.08091.76171.92082.01161.6990 *1.8321 *1.5528 *1.6647 *
8dichloroiodomethane3.4949 *3.5960 *2.79592.95853.20763.45082.19381.83532.32791.6163
9bromochloroiodomethane1.60212.27172.79592.89673.3279 *3.3766 *2.22912.97232.31881.9777
10dibromoiodomethane4.0506 *4.0353 *2.8697 *2.8209 *3.45593.28542.0000 *1.9755 *1.95862.2126
11chlorodiiodomethane4.65764.40463.08092.91613.69903.39992.4437 *2.0457 *2.30982.2495
12bromodiiodomethane5.60213.85123.0969 *2.8369 *3.58503.30463.05061.98362.9586 *2.4210 *
13triiodomethane2.42022.61493.36152.84863.93373.31882.88611.93982.88612.5894
Halonitromethanes
14trichloronitromethane4.30983.71324.63834.21525.31434.98124.2007 *4.2819 *4.08093.3428
15tribromonitromethane2.74472.77055.38205.12836.49496.07934.88615.06404.74474.7139
Haloacetonitriles
16dichloroacetonitrile4.50863.71843.27572.54813.7632 *2.5119 *3.03622.51232.82392.9808
17trichloroacetonitrile4.88614.26723.89792.66173.74472.64863.74473.85603.48153.0753
18bromochloroacetonitrile4.01323.81594.31884.10674.27574.29823.85393.37363.72123.3159
19dibromoacetonitrile4.76963.96554.71004.79814.7825 *5.0416 *4.22914.12824.19383.5113
Haloketones
201,1-dichloropropanone2.72124.05553.0506 *2.2187 *3.31882.22632.4318 *2.2303 *2.35652.7129
211,1,1-trichloropropanone3.6576 *3.4857 *2.28032.33112.73642.36152.38721.02253.00002.2423
Haloacetic acids
22chloroacetic acid6.00885.65922.13671.57371.98511.61791.69901.06521.65762.3448
23bromoacetic acid1.88613.27823.86972.59214.21112.84284.06553.31274.00003.1346
24iodoacetic acid2.67783.80684.3768 *3.8079 *4.7212 *4.305 *3.74472.47303.65763.8483
25dichloroacetic acid3.21474.06221.29671.69751.52291.76690.92081.27450.61981.1912
26bromochloroacetic acid5.16124.45082.07832.61222.35652.86691.1938 *1.9983 *1.6990 *1.8777 *
27dibromoacetic acid4.14874.18652.24032.66372.4318 *2.9289 *1.60212.17761.8861 *2.2108 *
28chloroiodoacetic acid1.82392.92674.40343.82424.53024.32464.03624.72394.1024 *4.8634 *
29bromoiodoacetic acid3.7959 *3.8762 *4.2403 *3.8168 *4.02004.31573.7212 *4.8968 *3.88615.1786
30trichloroacetic acid3.29243.64891.40341.81081.40342.01121.05551.59051.05061.6881
31bromodichloroacetic acid1.43182.40292.71002.63672.9031 *2.8964 *1.79592.06581.65760.6340
32dibromochloroacetic acid3.52292.67182.79592.68822.85392.95841.48152.14171.45592.7074
33tribromoacetic acid4.42023.20733.33722.73223.68823.01132.18052.34902.6021 *2.9859 *
Haloacetaldehyde
34chloral hydrate2.16752.20672.26362.10462.17072.13591.30981.71851.67781.9750
Haloacetamides
35dichloracetamide6.52296.97691.11351.41611.27981.52220.58501.50561.05061.6881
36bromochloroacetamide2.5686 *2.9516 *1.8539 *2.9712 *2.35653.30431.45592.82151.8239 *2.4295 *
37dibromoacetamide3.07063.20434.2218 *3.6626 *4.25964.04773.95863.64673.61982.7807
38chloroiodoacetamide2.65763.31423.72123.54374.11924.08102.72122.03332.53762.1670
39bromoiodoacetamide3.37684.47293.11634.17623.7959 *4.7536 *2.22911.96512.07062.4718
40diiodoacetamide1.43181.17682.78253.57243.04824.11552.22182.19412.19382.1785
41trichloroacetamide2.00001.85590.35651.52880.7825 *1.6577 *1.07061.46511.58502.1329
42bromodichloroacetamide4.3098 *4.099 *3.61982.99513.82393.33314.03152.66933.79592.7569
43dibromochloroacetamide4.3768 *4.2953 *3.95663.68654.31884.07643.95863.80003.63833.1107
44tribromoacetamide2.13081.51484.32334.37034.66764.81064.44374.73594.2147 *3.4421 *
Nitrosamines
45n-nitrosodimethylamine2.92083.0246--------
46n-nitrosodiethylamine7.42027.1686--------
47n-nitrosopiperidine3.88614.6292--------
48n-nitrosomorpholine3.85393.3096--------
49nitrosodi-n-butylamine3.5850 *3.4285 *--------
Furanone
503-chloro-4-(dichloromethyl)-5-4.74473.13235.25966.01395.61086.44544.88614.39494.95864.7578
* The chemical included in the test set.
Table 3. QSAR(quantitative structure–activity relationship) model and statistical parameters for five endpoints of disinfection byproduct (training set = 80% of whole dataset, test set = 20% of whole dataset).
Table 3. QSAR(quantitative structure–activity relationship) model and statistical parameters for five endpoints of disinfection byproduct (training set = 80% of whole dataset, test set = 20% of whole dataset).
EndpointEquation aModeling bInternal Validation cExternal Validation dGolbraikh & Tropsha e
X-MicrotoxpEC50 = −11.8502 + 0.1230 SpDiam_B(m) + 4.9744 AVS_B(v) + 0.8805 Eig05_AEA(dm) − 3.3986 SddsNntr = 40,
R2 = 0.7152,
R adj 2 = 0.6826,
RMSEtr = 0.7682,
F = 21.9717
Q L O O 2 = 0.6374,
RMSEcv = 0.8668,
Q L M O 2 = 0.6216,
R Y scr 2 = 0.1034,
Q Y scr 2 = −0.2452
ntest = 10, RMSEext = 0.2040,
R ext 2 = 0.8660, Q F 1 2 = 0.8508,
Q F 2 2 = 0.8496, Q F 3 2 = 0.9799,
CCC = 0.9115
r ¯ m 2 = 0.7185, Δ r m 2 = 0.1439
k = 1.0136,
k’ = 0.9837,
R 0 2 = 0.8018,
R 0 2 = 0.8584
GSH+pECIR1.5 = −2.4744 + 0.1022C% + 0.3184SpDiam_B(m) + 0.0725 P_VSA_LogP_8+ 0.2132 T(N..Br)ntr = 36,
R2 = 0.7837,
R adj 2 = 0.7558,
RMSEtr = 0.5927,
F = 28.0843
Q L O O 2 = 0.6956,
RMSEcv = 0.7032,
Q L M O 2 = 0.6644,
R Y scr 2 = 0.1121,
Q Y scr 2 = −0.2323
ntest = 9, RMSEext = 0.6010,
R ext 2 = 0.7715, Q F 1 2 = 0.7502,
Q F 2 2 = 0.7502, Q F 3 2 = 0.7776,
CCC = 0.8500,
r ¯ m 2 = 0.6558, Δ r m 2 = 0.1915
k = 1.0596,
k’ = 0.9119,
R 0 2 = 0.6964,
R 0 2 = 0.7709
GSH-pECIR1.5 = −2.4133 + 0.0894 C% + 0.3829SpDiam_B(m) + 0.0835 P_VSA_LogP_8 + 0.2270 T(N..Br)ntr = 36,
R2 = 0.8166,
R adj 2 = 0.7929,
RMSEtr = 0.5936,
F = 34.5096
Q L O O 2 = 0.7332,
RMSEcv = 0.7160,
Q L M O 2 = 0.6634,
R Y scr 2 = 0.1140,
Q Y scr 2 = −0.2349
ntest = 9, RMSEext = 0.6578,
R ext 2 = 0.7593, Q F 1 2 = 0.7436,
Q F 2 2 = 0.7430, Q F 3 2 = 0.7748,
CCC = 0.8703,
r ¯ m 2 = 0.6688, Δ r m 2 = 0.0426
k = 0.9659,
k’ = 0.9969,
R 0 2 = 0.7376,
R 0 2 = 0.7510
DNA+pECIR1.5 = 1.8732 + 0.0493 P_VSA_LogP_7 − 0.2258 Mor04s + 0.2798 T(N..Br) − 0.8971 T(N..I)ntr = 36,
R2 = 0.7019,
R adj 2 = 0.6635,
RMSEtr = 0.7113,
F = 18.2520
Q L O O 2 = 0.6287,
RMSEcv = 0.7940,
Q L M O 2 = 0.6338,
R Y scr 2 = 0.1139,
Q Y scr 2 = −0.2471
ntest = 9, RMSEext = 0.5570,
R ext 2 = 0.8232, Q F 1 2 = 0.7482,
Q F 2 2 = 0.7228, Q F 3 2 = 0.8173,
CCC = 0.8781,
r ¯ m 2 = 0.7541, Δ r m 2 = 0.0264
k = 0.8805,
k ’= 1.0974,
R 0 2 = 0.8132,
R 0 2 = 0.8186
DNA-pECIR1.5 = 0.9105 + 0.3091Sv + 0.0493 P_VSA_LogP_7 + 0.2008 Mor03s − 1.0911 T(N..I)ntr = 36,
R2 = 0.7164,
R adj 2 = 0.6786,
RMSEtr = 0.6540,
F = 18.9496
Q L O O 2 = 0.6221,
RMSEcv = 0.7550,
Q L M O 2 = 0.5291,
R Y scr 2 = 0.1200,
Q Y scr 2 = −0.2504
ntest = 9, RMSEext = 0.4991,
R ext 2 = 0.7774, Q F 1 2 = 0.7505,
Q F 2 2 = 0.7500, Q F 3 2 = 0.8348,
CCC = 0.8787,
r ¯ m 2 = 0.6920, Δ r m 2 = 0.0076
k = 0.9538,
k’ = 1.0145,
R 0 2 = 0.7643,
R 0 2 = 0.7664
a SpDiam_B(m): spectral diameter from Burden matrix weighted by mass; AVS_B(v): average vertex sum from Burden matrix weighted by van der Waals volume; Eig05_AEA(dm): eigenvalue no. 5 from augmented edge adjacency mat weighted by dipole moment; SddsN: sum of ddsN E-states; Sv: sum of atomic van der Waals volumes (scaled on carbon atom); C%: percentage of C atoms; P_VSA_LogP_7: P_VSA-like on LogP, bin 7; P_VSA_LogP_8: P_VSA-like on LogP, bin 8; T(N..Br): sum of topological distances between N..Br; T(N..I): sum of topological distances between N..I; Mor04s: signal 04/weighted by I-state; Mor03s: signal 03/weighted by I-state; b ntr: the number of samples in training set; R2: coefficient of determination; R adj 2 : adjusted R2; RMSEtr: root mean square error in fitting; F: F-value; c Q L O O 2 : explained variance in leave-one-out prediction; RMSEcv: root mean square error in cross-validation prediction; Q L M O 2 : explained variance in leave-many-out prediction; R Y scr 2 and Q Y scr 2 : R2 and Q2 in Y-scrambling, respectively; d ntest: the number of samples in test set; RMSEext: root mean square error in test set; R ext 2 : external determination coefficient; Q F 1 2 , Q F 2 2 and Q F 3 2 : variance explained in test set; CCC: concordance correlation coefficient; r ¯ m 2 and Δ r m 2 : average and delta r m 2 values of Roy criteria, respectively; e k and k’: slopes of the regression line over external data; R 0 2 and R 0 2 : R2 values in Golbraikh & Tropsha criteria.
Molecules EISSN 1420-3049 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top