Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure
Abstract
:1. Introduction
2. Methods
2.1. Coordinate Decent Algorithm for Different Thresholding Operators
2.2. Dataset
2.2.1. Simulated Data
2.2.2. Real Data
Algorithm: A coordinate descent algorithm for log-sum penalized multiple linear regression. |
Step 1: Initialize all ,set ; |
Step 2: Calculate the function (16) based on |
Step 3: Update each and cycle |
Step 3.1: |
and |
Step 3.2: Update |
Step 4: Let , |
Step 5: Repeat Steps 2 and 3 until converges |
3. Results
3.1. Analyses of Simulated Data
3.2. Analyses of Real Data
4. Conclusions
Acknowledgments
Author Contributions
Conflicts of Interest
Abbreviations
QSAR | Quantitative structure-activity relationship |
QSRR | Quantitative structure-(chromatographic) retention relationships |
QSPR | Quantitative structure-property relationship |
QSTR | Quantitative structure-toxicity relationship |
MLR | Multiple linear regression |
MCP | Maximum concave penalty |
SCAD | Smoothly clipped absolute deviation |
LASSO | |
BTAZD | (Benzo-)Triazoles toxicity in Daphnia magna |
EDCER | EDC estrogen receptor binding |
GHLI | Global half-life index |
BCL2 | Apoptosis regulator Bcl-2 |
Appendix A. Proof
References
- Katritzky, A.R.; Kuanar, M.; Slavov, S.; Hall, C.D.; Karelson, M.; Kahn, I.; Dobchev, D.A. Quantitative correlation of physical and chemical properties with chemical structure: Utility for prediction. Chem. Rev. 2010, 110, 5714–5789. [Google Scholar] [CrossRef] [PubMed]
- Shahlaei, M. Descriptor selection methods in quantitative structure-activity relation-ship studies: A review study. Chem. Rev. 2013, 113, 8093–8103. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.-S.; Liu, H.-L.; Yin, C.-S.; Wang, L.-S. Vsmp: A novel variable selection and modeling method based on the prediction. J. Chem. Inf. Comput. Sci. 2003, 43, 964–969. [Google Scholar] [CrossRef] [PubMed]
- Xu, L.; Zhang, W.-J. Comparison of different methods for variable selection. Anal. Chim. Acta 2001, 446, 475–481. [Google Scholar] [CrossRef]
- Wegner, J.K.; Zell, A. Prediction of aqueous solubility and partition coefficient optimized by a genetic algorithm based descriptor selection method. J. Chem. Inf. Comput. Sci. 2003, 43, 1077–1084. [Google Scholar] [CrossRef] [PubMed]
- Khajeh, A.; Modarress, H.; Zeinoddini-Meymand, H. Modified particle swarm optimization method for variable selection in qsar/qspr studies. Struct. Chem. 2013, 24, 1401–1409. [Google Scholar] [CrossRef]
- Meissner, M.; Schmuker, M.; Schneider, G. Optimized particle swarm optimization (OPSO) and its application to artificial neural network training. BMC Bioinform. 2006, 7, 125. [Google Scholar] [CrossRef] [PubMed]
- Ghosh, P.; Bagchi, M. QSAR modeling for quinoxaline derivatives using genetic algorithm and simulated annealing based feature selection. Curr. Med. Chem. 2009, 16, 4032–4048. [Google Scholar] [CrossRef] [PubMed]
- Burden, F.; Winkler, D. Bayesian regularization of neural networks. Artif. Neural Netw. Methods Appl. 2009, 458, 23–42. [Google Scholar]
- Dorigo, M.; Birattari, M.; Stutzle, T. Ant colony optimization. IEEE Comput. Intell. Mag. 2006, 1, 28–39. [Google Scholar] [CrossRef]
- Zheng, W.; Tropsha, A. Novel variable selection quantitative structure- property relationship approach based on the k-nearest-neighbor principle. J. Chem. Inf. Comput. Sci. 2000, 40, 185–194. [Google Scholar] [CrossRef] [PubMed]
- Mercader, A.G.; Duchowicz, P.R.; Fern’andez, F.M.; Castro, E.A. Modified and enhanced replacement method for the selection of molecular descriptors in qsar and qspr theories. Chemom. Intell. Lab. Syst. 2008, 92, 138–144. [Google Scholar] [CrossRef]
- Ara’ujo, M.C.U.; Saldanha, T.C.B.; Galvao, R.K.H.; Yoneyama, T.; Chame, H.C.; Visani, V. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemom. Intell. Lab. Syst. 2001, 57, 65–73. [Google Scholar] [CrossRef]
- Put, R.; Daszykowski, M.; Baczek, T.; Heyden, Y.V. Retention prediction of peptides based on uninformative variable elimination by partial least squares. J. Proteome Res. 2006, 5, 1618–1625. [Google Scholar] [CrossRef] [PubMed]
- Daghir-Wojtkowiak, E.; Wiczling, P.; Bocian, S.; Kubik, L.; Koslinski, P.; Buszewski, B.; Kaliszan, R.; Markuszewski, M.J. Least absolute shrinkage and selection operator and dimensionality reduction techniques in quantitative structure retention relationship modeling of retention in hydrophilic interaction liquid chromatography. J. Chromatogr. A 2015, 1403, 54–62. [Google Scholar] [CrossRef] [PubMed]
- Goodarzi, M.; Chen, T.; Freitas, M.P. QSPR predictions of heat of fusion of organic compounds using Bayesian regularized artificial neural networks. Chemom. Intell. Lab. Syst. 2010, 104, 260–264. [Google Scholar] [CrossRef]
- Aalizadeh, R.; Peter, C.; Thomaidis, N.S. Prediction of acute toxicity of emerging contaminants on the water flea Daphnia magna by Ant Colony Optimization-Support Vector Machine QSTR models. Environ. Sci. Process. Impacts 2017, 19, 438–448. [Google Scholar] [CrossRef] [PubMed]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 73, 267–288. [Google Scholar]
- Algamal, Z.; Lee, M. A new adaptive l1-norm for optimal descriptor selection of high-dimensional qsar classification model for anti-hepatitis c virus activity of thiourea derivatives. SAR QSAR Environ. Res. 2017, 28, 75–90. [Google Scholar] [CrossRef] [PubMed]
- Xu, Z.; Chang, X.; Xu, F.; Zhang, H. l1/2 regularization: A thresholding repre-sentation theory and a fast solver. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1013–1027. [Google Scholar] [PubMed]
- Algamal, Z.; Lee, M.; Al-Fakih, A.; Aziz, M. High-dimensional qsar modeling using penalized linear regression model with l1/2-norm. SAR QSAR Environ. Res. 2016, 27, 703–719. [Google Scholar] [CrossRef] [PubMed]
- Liang, Y.; Liu, C.; Luan, X.-Z.; Leung, K.-S.; Chan, T.-M.; Xu, Z.B.; Zhang, H. Sparse logistic regression with a l1/2 penalty for gene selection in cancer classification. BMC Bioinform. 2013, 14, 198. [Google Scholar] [CrossRef] [PubMed]
- Candes, E.J.; Wakin, M.B.; Boyd, S.P. Enhancing sparsity by reweighted l1 minimization. J. Fourier Anal. Appl. 2008, 14, 877–905. [Google Scholar] [CrossRef]
- Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef]
- Donoho, D.L.; Johnstone, I.M. Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994, 81, 425–455. [Google Scholar] [CrossRef]
- Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
- Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
- Gramatica, P.; Papa, E. Screening and ranking of pops for global half-life: Qsar approaches for prioritization based on molecular structure. Environ. Sci. Technol. 2007, 41, 2833–2839. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Gramatica, P. The importance of molecular structures, endpoints values, and predictivity parameters in qsar research: Qsar analysis of a series of estrogen receptor binders. Mol. Divers. 2010, 14, 687–696. [Google Scholar] [CrossRef] [PubMed]
- Cassani, S.; Kovarich, S.; Papa, E.; Roy, P.P.; van der Wal, L.; Gramatica, P. Daphnia and fish toxicity of (benzo) triazoles: Validated qsar models, and interspecies quantitative activity-activity modeling. J. Hazard. Mater. 2013, 258, 50–60. [Google Scholar] [CrossRef] [PubMed]
- Zakharov, A.V.; Peach, M.L.; Sitzmann, M.; Nicklaus, M.C. Qsar modeling of imbalanced high-throughput screening data in pubchem. J. Chem. Inf. Model. 2014, 54, 705–712. [Google Scholar] [CrossRef] [PubMed]
- Gramatica, P.; Cassani, S.; Chirico, N. QSARINS-Chem: Insubria Datasets and New QSAR/QSPR Models for Environmental Pollutants in QSARINS. J. Comput. Chem. Softw. News Updates 2014, 35, 1036–1044. [Google Scholar] [CrossRef] [PubMed]
- Golbraikh, A.; Tropsha, A. Beware of q2. J. Mol. Graph. Model. 2002, 20, 269–276. [Google Scholar] [CrossRef]
Dataset Name | No. of Samples | No. of Descriptors | No. of Samples (Training) | No. of Samples (Test) |
---|---|---|---|---|
BTAZD | 97 | 1083 | 78 | 19 |
EDCER | 129 | 1089 | 104 | 25 |
GHLI | 250 | 1120 | 200 | 50 |
BCL2 | 508 | 1562 | 407 | 101 |
Sample Size | SCAD | MCP | Log-Sum | ||||
---|---|---|---|---|---|---|---|
, | 381.60 | 92.92 | 19.09 | 23.36 | 19.13 | 19.00 | |
498.81 | 34.18 | 19.03 | 19.00 | 19.09 | 19.00 | ||
, | 382.24 | 93.26 | 27.74 | 25.79 | 21.77 | 21.54 | |
499.49 | 95.83 | 36.48 | 23.65 | 23.83 | 23.15 | ||
, | 378.96 | 93.98 | 19.26 | 24.67 | 19.98 | 19.11 | |
495.66 | 97.51 | 40.87 | 24.04 | 24.42 | 23.79 | ||
, | 379.35 | 93.46 | 29.22 | 26.08 | 22.48 | 22.04 | |
495.64 | 98.97 | 40.61 | 23.95 | 24.43 | 23.73 |
Sample Size | SCAD | MCP | Log-Sum | ||||
---|---|---|---|---|---|---|---|
, | 12.23 | 14.45 | 19.09 | 18.81 | 19.13 | 19.00 | |
16.22 | 20.00 | 19.03 | 19.00 | 19.09 | 19.00 | ||
, | 12.24 | 14.30 | 19.93 | 19.42 | 19.74 | 19.81 | |
16.26 | 20.00 | 20.00 | 20.00 | 20.00 | 20.00 | ||
, | 11.84 | 13.57 | 18.88 | 18.40 | 18.65 | 18.88 | |
15.79 | 19.99 | 19.97 | 19.93 | 19.96 | 19.93 | ||
, | 11.88 | 13.55 | 19.48 | 18.81 | 19.14 | 19.00 | |
15.80 | 19.99 | 19.98 | 19.93 | 19.97 | 19.95 |
Sample Size | SCAD | MCP | Log-Sum | ||||
---|---|---|---|---|---|---|---|
, | 3.20% | 15.55% | 100.00% | 80.52% | 100.00% | 100.00% | |
3.25% | 58.51% | 100.00% | 100.00% | 100.00% | 100.00% | ||
, | 3.12% | 14.44% | 98.03% | 74.58% | 93.34% | 98.80% | |
3.19% | 20.50% | 48.86% | 82.90% | 81.74% | 83.77% | ||
, | 3.20% | 15.33% | 71.85% | 75.30% | 90.68% | 91.97% | |
3.26% | 20.87% | 54.87% | 84.57% | 83.93% | 86.39% | ||
, | 3.19% | 20.50% | 48.86% | 82.90% | 81.74% | 83.77% | |
3.19% | 20.20% | 49.20% | 83.22% | 81.74% | 84.07% |
Datasets | Methods | ||||||
---|---|---|---|---|---|---|---|
GHLI | 0.87 | 0.65 | 0.74 | 0.68 | 0.74 | 0.90 | |
0.87 | 0.64 | 0.75 | 0.67 | 0.74 | 0.90 | ||
SCAD | 0.84 | 0.71 | 0.82 | 0.62 | 0.72 | 0.93 | |
MCP | 0.85 | 0.68 | 0.80 | 0.65 | 0.73 | 0.91 | |
0.82 | 0.75 | 0.81 | 0.62 | 0.72 | 0.92 | ||
log-sum | 0.85 | 0.69 | 0.84 | 0.57 | 0.75 | 0.88 | |
EDCER | 0.81 | 0.74 | 0.70 | 0.70 | 0.64 | 1.23 | |
0.82 | 0.73 | 0.73 | 0.68 | 0.63 | 1.25 | ||
SCAD | 0.86 | 0.63 | 0.74 | 0.69 | 0.70 | 1.12 | |
MCP | 0.83 | 0.70 | 0.74 | 0.69 | 0.65 | 1.21 | |
0.87 | 0.62 | 0.75 | 0.65 | 0.64 | 1.24 | ||
log-sum | 0.86 | 0.63 | 0.79 | 0.62 | 0.70 | 1.12 | |
BATZD | 0.87 | 0.28 | 0.73 | 0.30 | 0.60 | 0.52 | |
0.88 | 0.28 | 0.74 | 0.30 | 0.60 | 0.52 | ||
SCAD | 0.86 | 0.30 | 0.77 | 0.30 | 0.62 | 0.51 | |
MCP | 0.88 | 0.27 | 0.83 | 0.29 | 0.64 | 0.50 | |
0.86 | 0.29 | 0.84 | 0.26 | 0.64 | 0.50 | ||
log-sum | 0.88 | 0.28 | 0.88 | 0.23 | 0.68 | 0.47 | |
BCL2 | 0.75 | 0.57 | 0.51 | 0.53 | 0.61 | 0.67 | |
0.74 | 0.58 | 0.58 | 0.51 | 0.61 | 0.67 | ||
SCAD | 0.72 | 0.59 | 0.73 | 0.45 | 0.59 | 0.69 | |
MCP | 0.74 | 0.57 | 0.73 | 0.46 | 0.58 | 0.70 | |
0.73 | 0.60 | 0.68 | 0.48 | 0.57 | 0.70 | ||
log-sum | 0.68 | 0.64 | 0.75 | 0.43 | 0.65 | 0.63 |
Rank | GHLI | |||||
---|---|---|---|---|---|---|
SCAD | MCP | Log-Sum | ||||
1 | JGI7 | JGI7 | Mp | JGI7 | minsCl | ATSC4c |
2 | ETA_Eta_B_RC | ETA_Eta_B_RC | MDEC-44 | ATSC4c | ATSC1e | GATS1e |
3 | BCUTc-1l | BCUTc-1l | GATS1e | GATS1e | minaaN | ATSC1p |
4 | Mv | Mv | ATSC1p | AATS0e | WPOL | MATS8m |
5 | ATSC4c | MDEN-23 | GGI9 | meanI | nHdsCH | maxwHBa |
6 | MDEN-23 | ATSC4c | maxHBa | nHdsCH | ALogP | maxHBa |
7 | GATS1e | GATS1e | maxwHBa | maxHBa | nFG12Ring | ATSC7s |
8 | ETA_Epsilon_3 | ETA_Epsilon_4 | MATS8m | ATSC7s | AATS6i | AATS0v |
9 | ETA_Epsilon_4 | minHCsatu | SIC1 | ATS4v | AATSC8m | ATS4p |
Rank | EDCER | |||||
---|---|---|---|---|---|---|
SCAD | MCP | Log-Sum | ||||
1 | JGI10 | JGI10 | JGI10 | JGI10 | JGI10 | JGI10 |
2 | VE2_Dt | VE2_Dt | MATS1i | JGI6 | GATS1c | MATS1c |
3 | JGI7 | JGI6 | AATSC2s | AATSC2s | GATS2s | hmax |
4 | AATSC8p | AATSC8p | hmax | AATSC8p | hmax | nssO |
5 | JGI6 | JGI7 | JGI6 | hmax | GATS5v | piPC6 |
6 | hmax | hmax | nBase | nHBint2 | nTG12Ring | nFG12HeteroRing |
7 | SpMin4_Bhm | SpMin4_Bhm | GATS8p | nHBd | nssO | maxaaCH |
8 | GATS5v | GATS5v | nFG12HeteroRing | maxaaCH | maxaaCH | SHBint2 |
9 | GATS2s | GATS2s | MATS5v | C3SP2 | ETA_Beta_ns_d | TIC1 |
10 | SpMin5_Bhs | nAcid | maxaaCH | SHBint8 | MDEC-24 | AATSC8m |
Rank | BATZD | |||||
---|---|---|---|---|---|---|
SCAD | MCP | Log-Sum | ||||
1 | JGI4 | JGI4 | VE2_Dze | SpMax1_Bhi | SpMax1_Bhi | SpMax1_Bhi |
2 | VE2_Dze | VE2_Dze | JGI3 | MATS5m | GATS1p | GATS1v |
3 | MATS5v | ndS | ndS | GATS3s | ndS | GATS3s |
4 | SdS | MATS5v | CrippenLogP | C4SP3 | GATS3m | GATS8c |
5 | CrippenLogP | CrippenLogP | nHother | CrippenLogP | GATS3s | naaS |
6 | mindS | MDEO-22 | minddssS | ALogP | LipoaffinityIndex | AATSC4i |
7 | MDEO-22 | nF9Ring | GATS4m | nHother | nHsOH | LipoaffinityIndex |
8 | maxdS | ETA_Epsilon_4 | nF9Ring | ATSC8i | ATSC8i | SpDiam_Dzp |
Rank | BCL2 | |||||
---|---|---|---|---|---|---|
SCAD | MCP | Log-Sum | ||||
1 | JGI7 | AATSC8p | AATSC4s | JGI7 | MATS4s | AATSC8p |
2 | VE2_D | MATS4s | IC2 | MATS4s | IC2 | IC2 |
3 | AATSC8p | MATS5m | MDEN-13 | IC2 | E3m | GATS4s |
4 | MATS5m | IC2 | minHsNH2 | E3m | MDEN-13 | maxHBint2 |
5 | MATS4s | MDEN-13 | maxHBint2 | GATS8p | maxHBint2 | minsOH |
6 | IC2 | SpMax1_Bhi | nT8Ring | MDEN-13 | minsOH | SwHBa |
Descriptor Type | Class | Descriptor |
---|---|---|
Autocorrelation | 2D | AATS0v; AATSC4i; AATSC8m; ATS4p; ATSC1p; |
ATSC4c; ATSC7s; GATS1e; GATS1v; GATS3s; | ||
GATS8c; MATS1c; MATS8m; AATSC8p; GATS4s | ||
Atom-type electrotopological state | 2D | Hmax; LipoaffinityIndex; maxaaCH; maxHBa; maxwHBa; |
naaS; nssO; SHBint2; maxHBint2; minsOH; SwHBa | ||
Barysz matrix | 2D | SpDiam_Dzp |
Burden modified eigenvalues | 2D | SpMax1_Bhi |
Information content | 2D | TIC1 |
Path counts | 2D | piPC6 |
Ring count | 2D | nFG12HeteroRing |
Topological charge | 2D | JGI10 |
Information content | 2D | IC2 |
Descriptor | Name |
---|---|
AATS0v | Average Broto–Moreau autocorrelation-lag 0/weighted by van der Waals volumes |
AATSC4i | Average centered Broto–Moreau autocorrelation-lag 4/weighted by first ionization potential |
AATSC8m | Average centered Broto–Moreau autocorrelation-lag 8/weighted by mass |
ATS4p | Average centered Broto–Moreau autocorrelation-lag 1/weighted by polarizabilities |
ATSC1p | Centered Broto–Moreau autocorrelation-lag 1/weighted by polarizabilities |
ATSC4c | Average centered Broto–Moreau autocorrelation-lag 4/weighted by charges |
ATSC7s | Average centered Broto–Moreau autocorrelation-lag 7/weighted by I-state |
GATS1e | Geary autocorrelation-lag 1/weighted by Sanderson electronegativities |
GATS1v | Geary autocorrelation-lag 1/weighted by van der Waals volumes |
GATS3s | Geary autocorrelation-lag 3/weighted by I-state |
GATS8c | Geary autocorrelation-lag 8/weighted by charges |
hmax | Maximum H E-state |
JGI10 | Mean topological charge index of order 10 |
LipoaffinityIndex | Lipoaffinity index |
MATS1c | Moran autocorrelation-lag 1/weighted by charges |
MATS8m | Moran autocorrelation-lag 8/weighted by mass |
maxaaCH | Maximum atom-type E-state: :CH: |
maxHBa | Maximum E-states for (strong) hydrogen bond acceptors |
maxwHBa | Maximum E-states for weak hydrogen bond acceptors |
naaS | Count of atom-type E-state::C:- |
nFG12HeteroRing | Number of >12-membered fused rings containing heteroatoms (N, O, P, S or halogens) |
nssO | Count of atom-type E-state: -O- |
piPC6 | Conventional bond order ID number of order 6 (ln(1 + x) |
SHBint2 | Sum of E-state descriptors of strength for potential hydrogen bonds of path length 2 |
SpDiam_Dzp | Spectral diameter from Barysz matrix/weighted by polarizabilities |
SpMax1_Bhi | Largest absolute eigenvalue of Burden-modified matrix - n 1/weighted by the relative first ionization potential |
TIC1 | Total information content index (neighborhood symmetry of 1-order) |
SwHBa | Sum of E-states for weak hydrogen bond acceptors |
AATSC8p | Average centered Broto–Moreau autocorrelation-lag 8/weighted by polarizabilities |
IC2 | Information content index (neighborhood symmetry of 2-order) |
GATS4s | Geary autocorrelation-lag 4/weighted by I-state |
maxHBint2 | Maximum E-State descriptors of strength for potential Hydrogen Bonds of path length 2 |
minsOH | Minimum atom-type E-state: -OH |
© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xia, L.-Y.; Wang, Y.-W.; Meng, D.-Y.; Yao, X.-J.; Chai, H.; Liang, Y. Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure. Int. J. Mol. Sci. 2018, 19, 30. https://doi.org/10.3390/ijms19010030
Xia L-Y, Wang Y-W, Meng D-Y, Yao X-J, Chai H, Liang Y. Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure. International Journal of Molecular Sciences. 2018; 19(1):30. https://doi.org/10.3390/ijms19010030
Chicago/Turabian StyleXia, Liang-Yong, Yu-Wei Wang, De-Yu Meng, Xiao-Jun Yao, Hua Chai, and Yong Liang. 2018. "Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure" International Journal of Molecular Sciences 19, no. 1: 30. https://doi.org/10.3390/ijms19010030