Next Article in Journal
Autologous Adipose-Derived Stem Cells Reduce Burn-Induced Neuropathic Pain in a Rat Model
Previous Article in Journal
Bio-Functional Design, Application and Trends in Metallic Biomaterials
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure

1
State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Macau 999078, China
2
Ministry of Education Key Lab of Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an 710049, China
*
Author to whom correspondence should be addressed.
Int. J. Mol. Sci. 2018, 19(1), 30; https://doi.org/10.3390/ijms19010030
Submission received: 17 November 2017 / Revised: 10 December 2017 / Accepted: 21 December 2017 / Published: 22 December 2017
(This article belongs to the Section Biochemistry)

Abstract

:
The quantitative structure-activity relationship (QSAR) model searches for a reliable relationship between the chemical structure and biological activities in the field of drug design and discovery. (1) Background: In the study of QSAR, the chemical structures of compounds are encoded by a substantial number of descriptors. Some redundant, noisy and irrelevant descriptors result in a side-effect for the QSAR model. Meanwhile, too many descriptors can result in overfitting or low correlation between chemical structure and biological bioactivity. (2) Methods: We use novel log-sum regularization to select quite a few descriptors that are relevant to biological activities. In addition, a coordinate descent algorithm, which uses novel univariate log-sum thresholding for updating the estimated coefficients, has been developed for the QSAR model. (3) Results: Experimental results on artificial and four QSAR datasets demonstrate that our proposed log-sum method has good performance among state-of-the-art methods. (4) Conclusions: Our proposed multiple linear regression with log-sum penalty is an effective technique for both descriptor selection and prediction of biological activity.

Graphical Abstract

1. Introduction

The quantitative structure-activity relationship (QSAR) model searches for a reliable relationship between chemical the structure and biological activities in the field of drug design and discovery [1]. In the study of QSAR, the chemical structure is encoded by a substantial number of descriptors, such as thermodynamic, shape descriptors, etc. Generally, only a few descriptors that are relevant to biological activities are of interest to the QSAR model. Descriptor selection aims to eliminate redundant, noisy and irrelevant descriptors [2]. The flow diagram shows the process of QSAR modeling in Figure 1.
Generally, descriptor selection techniques can be categorized into four groups in the study of QSAR: classical methods, artificial intelligence-based methods, miscellaneous methods and regularization methods.
The classical methods have been proposed in the study of QSAR; as an example, forward selection adds the most significant descriptors until none improves the model to a statistically-significant extent. Backward elimination starts with all candidate descriptors, subsequently deleting descriptors without any statistical significance. Generally, stepwise regression builds a model by adding or removing predictor variables based on a series of F-tests or t-tests. The variable selection and modeling method based on the prediction [3] uses leave-one-out cross-validation ( Q 2 ), predicted to select meaningful and important descriptors. Leaps-and-bounds regression [4] selects a subset of descriptors based on the residual sum of squares (RSS).
Recently, artificial intelligence-based methods have been designed for descriptor selection, such as the genetic algorithm [5], which uses the code, selection, exchange and mutation operations to select the important descriptors. Particle swarm optimization [6] has a series of initial random particles and then selects the descriptors by updating the velocity and positions. Artificial neural networks [7] are composed of many artificial neurons that are linked together according to a specific network architecture and select input nodes (descriptors) to predict the output node (biological activity). Simulated annealing [8] can be performed with the Metropolis algorithm based on Monte Carlo techniques, which performs descriptor selection. Frank et al. [9] used Bayesian regularized artificial neural networks with automatic relevance determination (ARD) in the study of QSAR. ARD has the capacity to allow the network to estimate the importance of each input, neglects irrelevant or highly correlated indices in the modeling and uses the most important variables for modeling the activity data. The ant colony system [10], inspired by real ants, searches a path, which is connected to a number of selected descriptors, between the colony and a source of food.
The miscellaneous methods used for descriptor selection in the development of QSAR include K nearest neighbor (KNN) [11], the replacement method (RM) [12], the successive projections algorithm (SPA) [13] and uninformative variable elimination-partial least squares (UVE-PLS) [14], just to name a few. KNN uses a similarity measure (Euler distance) to select the descriptor and predict the biological activity. RM has the capacity to find an optimal subset of the descriptors via the standard deviation. SPA is a simple operation to eliminate collinearity to reduce the descriptors. UVE-PLS has been proposed to increase the predictive ability of the standard PLS method via eliminating the variables that cannot contribute to the model and to make a comparison between experimental variables and added noise variables with respect to the degree of contribution to the model.
The regularization is an effective technique in descriptor selection and has been used in QSRR [15], QSPR [16] and QSTR [17] in the field of chemometrics. However, some individuals have poured their interest and attention into the study of QSAR. For example, LASSO ( L 1 ) (least absolute shrinkage and selection operator) [18] has the capacity to perform descriptor selection. Algamal et al. proposed the L 1 -norm to select the significant and meaningful descriptors for anti-hepatitis C virus activity of thiourea derivatives in the QSAR classification model [19]. Xu et al. proposed L 1 / 2 [20] regularization, which has more sparsity. Algamal et al. proposed a penalized linear regression model with the L 1 / 2 -norm to select the significant and meaningful descriptors [21]. Theoretically, the L 0 regularization produces better solutions with more sparsity [22], but it is an NP problem. Therefore, Candes et al. proposed the log-sum penalty [23], which approximates the L 0 regularization much better.
In this paper, we utilized the log-sum penalty, which is non-convex in Figure 2. A coordinate descent algorithm, which uses novel univariate log-sum thresholding for updating the estimated coefficients, has been developed for the QSAR model. Experimental results on artificial and four QSAR datasets demonstrate that our proposed log-sum method has good performance among state-of-the-art methods. The structure of this paper is organized as follows: Section 2 introduces a coordinate descent algorithm, which uses novel univariate log-sum thresholding for updating the estimated coefficients and gives a detailed description of the datasets. In Section 3, we discuss the experimental results on simulated data and four QSRA datasets. Finally, we give some conclusions in Section 4.

2. Methods

In this paper, there exists a predictor X and a response y, which represent the chemical structure and corresponding biological activities, respectively. Suppose we have n samples, D = ( X 1 , y 1 ) , ( X 2 , y 2 ) , , ( X n , y n ) , where X i = ( x i 1 , x i 2 ,..., x i p ) is the i-th input pattern with dimensionality p, which means X i has p descriptors, and x i j denotes the value of descriptor j for the i-th sample. The multiple linear regression is expressed as:
y i = x i 1 β 1 + + x i p β p + β 0
where β = ( β 0 , β 1 , , β p ) are the coefficients.
Given X and y, β 0 , β 1 , , β p are estimated based on an objective function. The linear regression of the objective function can be formulated:
m i n { 1 2 n y X β 2 }
where y = ( y 1 , . . . . . . , y n ) T is the vector of n response variables, X = { X 1 , X 2 ,......, X n } is n × p matrix with X i = ( x i 1 , , x i p ) and | | . | | denotes the L 2 -norm. When the number of variables is larger than the number of samples ( p n ), this can result in over-fitting. Here, we introduced a penalty function in the objective function to estimate the coefficient. We have rewritten Equation (2):
m i n { 1 2 n y X β 2 + P λ ( β ) }
where P λ ( ) is a penalty function indexed by the regularized parameter λ > 0 .

2.1. Coordinate Decent Algorithm for Different Thresholding Operators

In this paper, we used the coordinate descent algorithm to implement different penalized multiple linear regression. The algorithm is a “one-at-a-time” algorithm and solves β j , and other β k j (representing the parameters remaining after the j-th element is removed) are fixed [22]. Equation (3) can be rewritten as:
R ( β ) = a r g m i n { 1 2 n ( y i ( k j x i k β k + x i j β j ) ) 2 + λ k j P ( β k ) + P ( β j ) }
where k represents other variables except the j-th variable.
Take the derivative with respect to β j :
R β j = i = 1 n ( x i j ( y j k j x i k β k x i j β j ) ) + λ P ( β j ) = 0
Denote y ˜ i ( j ) = k j x i k β k , r ˜ i ( j ) = y i y ˜ i ( j ) , w j = i = 1 n x i j r ˜ i ( j ) , where r ˜ i ( j ) represents the partial residuals with respect to the j-th covariate. To take into account the correlation of descriptors, Zhou et al. have proposed elastic net ( L E N ) [24], which emphasizes a grouping effect. The L E N penalty function is given as follows:
P ( β ) = ( 1 a ) 1 2 β L 2 2 + a β L 1
The penalty function of L E N is a combination of the L 1 penalty ( a = 1 ) and the ridge penalty ( a = 0 ). Therefore, Equation (5) is rewritten as follows:
R β j = i = 1 n ( x i j ( y j k j x i k β k x i j β j ) ) + λ ( 1 a ) β j + λ a = 0
Donoho et al. proposed the univariate solution [25] for a L E N -penalized regression coefficient as follows:
β j = f L E N ( w j , λ , a ) = S ( w j , λ a ) 1 + λ ( 1 a )
where S ( w j , λ a ) is the soft thresholding operator for the L 1 if a is equal to one; Formula (8) can be rewritten as follows:
β j = S o f t ( w j , λ ) = w j + λ if   w j < λ w j λ if   w j > λ 0 if   λ w j λ
Fan et al. have proposed the smoothly clipped absolute deviation (SCAD) [26], which can produce a sparse set of solutions and approximately unbiased coefficients for large coefficients. The penalty function is shown as follows:
p λ , a ( β ) = λ β if   β λ a λ β 1 2 ( β 2 + λ 2 ) a 1 if   λ < β < a λ λ ( a 2 1 ) 2 ( a 1 ) if   β > a λ
Additionally, the SCAD thresholding operator is given as follows:
β j = f SCAD ( w j , λ , a ) = S ( w j , λ ) if   | w j | < 2 λ S ( w j , a λ / ( a 1 ) ) 1 1 / ( a 1 ) if   2 λ < | w j | a λ w j if   | w j | > a λ
Similar to the SCAD penalty, Zhang et al. have proposed the maximum concave penalty (MCP) [27]. The formula of the penalty function is shown as:
p λ , a ( β ) = λ β if   β γ λ 1 2 γ λ 2 if   β > γ λ
Additionally, the MCP thresholding operator is given as follows:
β j = f MCP ( w j , λ , γ ) = S ( w j , λ ) 1 1 / γ if   | w j | γ λ w j if   | w j | > γ λ
where γ is the experience parameter.
Xu et al. proposed L 1 / 2 regularization [20]. Formula (3) can be rewritten:
m i n { 1 2 n y X β 2 + λ j p | β j | 1 2 }
and the univariate half thresholding operator for a L 1 / 2 -penalized linear regression coefficient is as follows:
β j = H a l f ( w j , λ ) = 2 3 w j ( 1 + cos 2 ( π ϕ λ ( w j ) ) 3 ) if   | w j | > 3 4 ( λ ) 2 3 0 o t h e r w i s e
where ϕ λ ( w ) = λ 8 ( | w | 3 ) 3 2 .
In this paper, we applied the log-sum penalty to the linear regression model. We could rewrite Formula (3) as follows:
m i n { 1 2 n y X β 2 + λ j p l o g ( | β j | + ε ) }
where ε > 0 should be set arbitrarily small, to make the log-sum penalty closely resemble the L 0 -norm. Equation (16) has a local minimal. The proof is given in the Appendix A:
β j = f l o g s u m ( w j , λ , ε ) = D ( w j , λ , ε ) = s i g n ( w j ) c 1 + c 2 2 if   c 2 > 0 0 if   c 2 0
where λ > 0 , 0 < ε < λ , c 1 = ω j ε and c 2 = c 1 2 4 ( λ w j ε ) .
According to different thresholding operators, we can define three properties for to satisfy the coefficient estimator, unbiasedness, sparsity and continuity, in Figure 3.

2.2. Dataset

2.2.1. Simulated Data

In this work, we constructed the simulation. The process of the construction was given as follows:
Step I: The simulated dataset was generated from multiple linear regression using the normal distribution to produce X. Here, the number of row is sample n and the number of column is variable p.
y = X β + σ ϵ i n t e r c e p t , ϵ N ( 0 , 1 )
where y = ( y 1 , , y n ) T is the vector of n response variables, X = { X 1 , X 2 , ..., X n } is the generated matrix with X i = ( x i 1 , , x i p ) , ϵ = ( ϵ 1 , , ϵ n ) T is the random error and σ controls the signal to noise.
Step II: Add a different correlation parameter ρ to the simulation data.
x i j = ρ × x 11 + ( 1 ρ ) x i j , i ( 1 , , n ) , j ( 2 , 3 , 4 , 5 , 6 )
Step III: In order to get a high quality model and variable selection, the coefficients (20) are set in advance from 1–20.
β = 2 , 2 , 1 , 1.5 , 3 , 2.5 , 3 , 2 , , 2 , 20 0 , 0 , 0 , , 0 1980 2000
where β is the coefficient.
Step IV: We can get y from Equations (18)–(20).
In the simulation study, we firstly generated 100 groups of data with different sample sizes n = 100 and n = 200 . Secondly, the correlation coefficient ρ = 0.2 , 0.4 and the noise control parameter σ = 0.3 , 0.9 , were considered in the model. Thirdly, the coefficients (20) are set in advance. Fourthly, the multiple linear regression with different penalties to select variables and build the model, including our proposed method, was used. Finally, due to the generation of 100 groups of data, the results obtained by different methods need to be averaged.

2.2.2. Real Data

We could obtain four public QSAR datasets, including the global half-life index [28], endocrine disruptor chemical (EDC) estrogen receptor (ER)-binding [29], (Benzo-)Triazoles toxicity in Daphnia magna [30] and apoptosis regulator Bcl-2 [31]. A brief description of these datasets is shown in Table 1. We utilized random sampling to divide datasets into training datasets and test datasets (80% for the training set and 20% for the test set [32]). Six commonly-used parameters in regression problems are employed to evaluate the model performance, including the square correlation coefficients of the leave-one-out cross-validation ( Q L O O 2 ), the root mean squared error of cross-validation ( R M S E C V ), the square correlation coefficients of fitting for the training set ( R t r a i n 2 ), the root mean squared error for the training set ( R M S E t r a i n ), the square correlation coefficients of fitting for the test set ( R t e s t 2 ) and the root mean squared error for the test set ( R M S E t e s t ). According to existing literature [33], we have learned that the value of Q L O O 2 is not the best measure for QSAR model evaluation. Therefore, we poured more interest and attention into ( R t e s t 2 ) and ( R M S E t e s t ).
Algorithm: A coordinate descent algorithm for log-sum penalized multiple linear regression.
Step 1: Initialize all β j ( m ) = 0 ( j = 1 , 2 , 3 , , p ) , λ , ε ,set m = 0 ;
Step 2: Calculate the function (16) based on β ( m )
Step 3: Update each β j ( m ) and cycle j = 1 , 2 , 3 , , p
              Step 3.1: r ˜ i ( j ) ( m ) = y i ( m ) y ˜ i ( j ) ( m ) = y i ( m ) k j x i k β k ( m )
                            and w j ( m ) = x i j ( r i ( m ) r ˜ i ( j ) ( m ) )
              Step 3.2: Update β j ( m ) = D ( w j , λ , ε )
Step 4: Let m ( m + 1 ) , β ( m + 1 ) β ( m )
Step 5: Repeat Steps 2 and 3 until β ( m ) converges

3. Results

In this work, five methods are compared to our proposed method, including multiple linear regression with L E N , L 1 , SCAD, MCP and L 1 / 2 penalties, respectively.

3.1. Analyses of Simulated Data

Table 2 and Table 3 describe the number of variables that are selected (non-zero coefficient) by different methods within 2000 variables and within pre-set variables (20), respectively. For example, when n = 200 , ρ = 0.4 and σ = 0.9 , the average number of variables selected is 23.73 within 2000 variables by the log-sum in Table 2. In pre-set variables (20), we got 19.95 variables by the log-sum in Table 3. Therefore, we could calculate the average accuracy ( 19.95 ÷ 23.73 × 100 % = 84.07 % ) for the simulation datasets obtained by log-sum in Table 4. From Table 2, Table 3 and Table 4, for example, when the correlation parameter ρ and the noise control parameter σ decrease, the average accuracy of log-sum improves. When n = 100 and σ = 0.9 , the average accuracy of log-sum is from 83.77–98.7%, where the correlation parameter ρ is from 0.4–0.2. When n = 200 and ρ = 0.4 , the results obtained by log-sum are 84.07% and 86.39% with the noise control parameter σ = 0.9, 0.3. In addition, compared to other methods, the average accuracy obtained by our proposed log-sum method is better, for example when n = 200 , ρ = 0.4 and σ = 0.9 , the result of the log-sum is 84.07% higher than 3.19%, 20.20%, 49.20%, 83.22% and 81.74% of the L E N , L 1 , SCAD, MCP and L 1 / 2 . In other words, our proposed log-sum method has the capacity to obtain good performance in the simulation dataset.

3.2. Analyses of Real Data

As shown in Table 5 and Figure 4 and Figure 5, the R t r a i n 2 and R M S E t r a i n of the L 1 , L 1 / 2 and MCP are 0.87, 0.87, 0.88 and 0.64, 0.62, 0.27, better than the values of 0.85, 0.86, 0.88 and 0.69, 0.63, 0.28 of the log-sum for the GHLI, EDCER and BATZD datasets, respectively. However, our proposed log-sum method is the best in terms of Q 2 and R M S E C V . In the BATZD dataset, the R M S E C V obtained by log-sum is 0.23, lower than the values of 0.30, 0.30, 0.30, 0.28 and 0.26 of other methods. In the BCL2 dataset, the Q 2 obtained by log-sum is 0.75, higher than the 0.51, 0.57, 0.73, 0.73 and 0.67 of other methods. Moreover, a small subset of descriptors was selected by our proposed method; for example, for the EDCER dataset, the result of log-sum is 10, lower than the 47, 36, 17, 11 and 12 of L E N , L 1 , SCAD, MCP and L 1 / 2 . Furthermore, for R T e s t 2 and R M S E t e s t , for the GHLI dataset, the best method is log-sum (0.75 and 0.88); L E N and L 1 are second (0.74 and 0.90); MCP is third (0.73 and 0.91); L 1 / 2 is fourth (0.72 and 0.92); and the last is SCAD (0.72 and 0.93). Therefore, our proposed method is better than the other methods. In addition, we gave the experimental and predicted values for the four datasets.
First of all, in Table 6, Table 7, Table 8 and Table 9, the number of top-ranked informative descriptors identified by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum is 9, 10, 8 and 6 based on the value of the coefficients. Secondly, the common descriptors are emphasized in bold. Thirdly, as shown in Table 10, the number of descriptors is from the class of 2D. Then, the majority of descriptors are belong to the atom-type electrotopological state and autocorrelation of descriptors types. Finally, the name of the descriptors obtained by the log-sum method is exhibited in Table 11.

4. Conclusions

In the field of drug design and discovery, only a few descriptors are of interest to the QSAR model. Therefore, descriptor selection plays an important role in the study of QSAR. In this paper, we proposed univariate log-sum thresholding for updating the estimated coefficients and developed a coordinate descent algorithm for log-sum penalized multiple linear regression.
Both experimental results on artificial and four QSAR datasets demonstrate that our proposed multiple linear regression with log-sum penalty is still better than L 1 , L E N , SCAD, MCP and L 1 / 2 . Therefore, our proposed log-sum method is the effective technique in both descriptor selection and prediction of biological activity.
In this paper, we introduced random sampling, which is easy to use, for QSAR data preprocessing. However, this method does not take into account additional knowledge. Therefore, we plan to integrate a self-paced learning mechanism, which learns easy samples first and then gradually takes into consideration complex samples, making the model more and more mature, with our proposed method in future work.

Acknowledgments

This work was supported by the Macau Science and Technology Development Funds G r a n t N o . 003 / 2016 / A F J from the Macau Special Administrative Region of the People’s Republic of China, the National Grand Fundamental Research 973 Program of China under Grant No. 2013CB329404 and the China NSFC projects under Contracts 61373114, 61661166011, 11690011, 61721002.

Author Contributions

Liang-Yong Xia, Hua Chai and Yong Liang designed the simulations. Liang-Yong Xia and De-Yu Meng provided the mathematical proof. Liang-Yong Xia, Xiao-Jun Yao and Yu-Wei Wang contributed to collecting the datasets and analyze the data. Liang-Yong Xia and Yong Liang designed and implemented the algorithm. Liang-Yong Xia, Yu-Wei Wang, De-Yu Meng, Xiao-Jun Yao and Yong Liang contributed to the interpretation of the results. Liang-Yong Xia took the lead in writing the manuscript. Yu-Wei Wang, De-Yu Meng, Xiao-Jun Yao, Hua Chai and Yong Liang revised the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

QSARQuantitative structure-activity relationship
QSRRQuantitative structure-(chromatographic) retention relationships
QSPRQuantitative structure-property relationship
QSTRQuantitative structure-toxicity relationship
MLRMultiple linear regression
MCPMaximum concave penalty
SCADSmoothly clipped absolute deviation
L 1 LASSO
BTAZD(Benzo-)Triazoles toxicity in Daphnia magna
EDCEREDC estrogen receptor binding
GHLIGlobal half-life index
BCL2Apoptosis regulator Bcl-2

Appendix A. Proof

We first consider the situation β j > 0 :
R β j = i = 1 n ( x i j ( y i k j x i j β k x i j β j ) ) + λ 1 β j + ε = 0
Based on Equation (A1), the gradient of the log-sum regularization at β j can be expressed as:
R β j = β j ω j + λ 1 β j + ε = 0
Denote y ˜ i ( j ) = k j x i k β k , r ˜ i ( j ) = y i y ˜ i ( j ) , w j = i = 1 n x i j r ˜ i ( j ) , which is equivalent to:
β j 2 ( ω j ε ) β j + ( λ ω j ε ) = 0
β j = ω j ε ± ( ω j ε ) 2 4 ( λ ω j ε ) 2
let: c 1 = ω j ε , c 2 = c 1 2 4 ( λ ω j ε ) Thus, we have:
(1)
if c 2 < 0 , Equation (A3) has no real solution.
(2)
if c 2 = 0 , Equation (A3) has the solution β j = c 1 2 .
(3)
if c 2 > 0 , Equation (A3) has the two solutions β j 1 = c 1 c 2 2 and β j 2 = c 1 + c 2 2 :
c 2 = ( ω j ε ) 2 4 ( λ ω j ε ) = ω j 2 2 ω j ε + ε 2 4 λ + 4 ω j ε = ( ω j + ε ) 2 4 λ > 0 ω j + ε > 2 λ ω j ε > 2 λ 2 ε c 1 > 0
Thus, β j 2 > β j 1 > 0 , and it is then easy to obtain that f ( β j ) > 0 when 0 < β j < β j 1 or β j 2 > β j and f ( β j ) < 0 when β j 1 < β j < β j 2 . Therefore, Equation (16) has a local minimum. For β j < 0 , we can prove it in a similar way.

References

  1. Katritzky, A.R.; Kuanar, M.; Slavov, S.; Hall, C.D.; Karelson, M.; Kahn, I.; Dobchev, D.A. Quantitative correlation of physical and chemical properties with chemical structure: Utility for prediction. Chem. Rev. 2010, 110, 5714–5789. [Google Scholar] [CrossRef] [PubMed]
  2. Shahlaei, M. Descriptor selection methods in quantitative structure-activity relation-ship studies: A review study. Chem. Rev. 2013, 113, 8093–8103. [Google Scholar] [CrossRef] [PubMed]
  3. Liu, S.-S.; Liu, H.-L.; Yin, C.-S.; Wang, L.-S. Vsmp: A novel variable selection and modeling method based on the prediction. J. Chem. Inf. Comput. Sci. 2003, 43, 964–969. [Google Scholar] [CrossRef] [PubMed]
  4. Xu, L.; Zhang, W.-J. Comparison of different methods for variable selection. Anal. Chim. Acta 2001, 446, 475–481. [Google Scholar] [CrossRef]
  5. Wegner, J.K.; Zell, A. Prediction of aqueous solubility and partition coefficient optimized by a genetic algorithm based descriptor selection method. J. Chem. Inf. Comput. Sci. 2003, 43, 1077–1084. [Google Scholar] [CrossRef] [PubMed]
  6. Khajeh, A.; Modarress, H.; Zeinoddini-Meymand, H. Modified particle swarm optimization method for variable selection in qsar/qspr studies. Struct. Chem. 2013, 24, 1401–1409. [Google Scholar] [CrossRef]
  7. Meissner, M.; Schmuker, M.; Schneider, G. Optimized particle swarm optimization (OPSO) and its application to artificial neural network training. BMC Bioinform. 2006, 7, 125. [Google Scholar] [CrossRef] [PubMed]
  8. Ghosh, P.; Bagchi, M. QSAR modeling for quinoxaline derivatives using genetic algorithm and simulated annealing based feature selection. Curr. Med. Chem. 2009, 16, 4032–4048. [Google Scholar] [CrossRef] [PubMed]
  9. Burden, F.; Winkler, D. Bayesian regularization of neural networks. Artif. Neural Netw. Methods Appl. 2009, 458, 23–42. [Google Scholar]
  10. Dorigo, M.; Birattari, M.; Stutzle, T. Ant colony optimization. IEEE Comput. Intell. Mag. 2006, 1, 28–39. [Google Scholar] [CrossRef]
  11. Zheng, W.; Tropsha, A. Novel variable selection quantitative structure- property relationship approach based on the k-nearest-neighbor principle. J. Chem. Inf. Comput. Sci. 2000, 40, 185–194. [Google Scholar] [CrossRef] [PubMed]
  12. Mercader, A.G.; Duchowicz, P.R.; Fern’andez, F.M.; Castro, E.A. Modified and enhanced replacement method for the selection of molecular descriptors in qsar and qspr theories. Chemom. Intell. Lab. Syst. 2008, 92, 138–144. [Google Scholar] [CrossRef]
  13. Ara’ujo, M.C.U.; Saldanha, T.C.B.; Galvao, R.K.H.; Yoneyama, T.; Chame, H.C.; Visani, V. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemom. Intell. Lab. Syst. 2001, 57, 65–73. [Google Scholar] [CrossRef]
  14. Put, R.; Daszykowski, M.; Baczek, T.; Heyden, Y.V. Retention prediction of peptides based on uninformative variable elimination by partial least squares. J. Proteome Res. 2006, 5, 1618–1625. [Google Scholar] [CrossRef] [PubMed]
  15. Daghir-Wojtkowiak, E.; Wiczling, P.; Bocian, S.; Kubik, L.; Koslinski, P.; Buszewski, B.; Kaliszan, R.; Markuszewski, M.J. Least absolute shrinkage and selection operator and dimensionality reduction techniques in quantitative structure retention relationship modeling of retention in hydrophilic interaction liquid chromatography. J. Chromatogr. A 2015, 1403, 54–62. [Google Scholar] [CrossRef] [PubMed]
  16. Goodarzi, M.; Chen, T.; Freitas, M.P. QSPR predictions of heat of fusion of organic compounds using Bayesian regularized artificial neural networks. Chemom. Intell. Lab. Syst. 2010, 104, 260–264. [Google Scholar] [CrossRef]
  17. Aalizadeh, R.; Peter, C.; Thomaidis, N.S. Prediction of acute toxicity of emerging contaminants on the water flea Daphnia magna by Ant Colony Optimization-Support Vector Machine QSTR models. Environ. Sci. Process. Impacts 2017, 19, 438–448. [Google Scholar] [CrossRef] [PubMed]
  18. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 73, 267–288. [Google Scholar]
  19. Algamal, Z.; Lee, M. A new adaptive l1-norm for optimal descriptor selection of high-dimensional qsar classification model for anti-hepatitis c virus activity of thiourea derivatives. SAR QSAR Environ. Res. 2017, 28, 75–90. [Google Scholar] [CrossRef] [PubMed]
  20. Xu, Z.; Chang, X.; Xu, F.; Zhang, H. l1/2 regularization: A thresholding repre-sentation theory and a fast solver. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1013–1027. [Google Scholar] [PubMed]
  21. Algamal, Z.; Lee, M.; Al-Fakih, A.; Aziz, M. High-dimensional qsar modeling using penalized linear regression model with l1/2-norm. SAR QSAR Environ. Res. 2016, 27, 703–719. [Google Scholar] [CrossRef] [PubMed]
  22. Liang, Y.; Liu, C.; Luan, X.-Z.; Leung, K.-S.; Chan, T.-M.; Xu, Z.B.; Zhang, H. Sparse logistic regression with a l1/2 penalty for gene selection in cancer classification. BMC Bioinform. 2013, 14, 198. [Google Scholar] [CrossRef] [PubMed]
  23. Candes, E.J.; Wakin, M.B.; Boyd, S.P. Enhancing sparsity by reweighted l1 minimization. J. Fourier Anal. Appl. 2008, 14, 877–905. [Google Scholar] [CrossRef]
  24. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef]
  25. Donoho, D.L.; Johnstone, I.M. Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994, 81, 425–455. [Google Scholar] [CrossRef]
  26. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  27. Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
  28. Gramatica, P.; Papa, E. Screening and ranking of pops for global half-life: Qsar approaches for prioritization based on molecular structure. Environ. Sci. Technol. 2007, 41, 2833–2839. [Google Scholar] [CrossRef] [PubMed]
  29. Li, J.; Gramatica, P. The importance of molecular structures, endpoints values, and predictivity parameters in qsar research: Qsar analysis of a series of estrogen receptor binders. Mol. Divers. 2010, 14, 687–696. [Google Scholar] [CrossRef] [PubMed]
  30. Cassani, S.; Kovarich, S.; Papa, E.; Roy, P.P.; van der Wal, L.; Gramatica, P. Daphnia and fish toxicity of (benzo) triazoles: Validated qsar models, and interspecies quantitative activity-activity modeling. J. Hazard. Mater. 2013, 258, 50–60. [Google Scholar] [CrossRef] [PubMed]
  31. Zakharov, A.V.; Peach, M.L.; Sitzmann, M.; Nicklaus, M.C. Qsar modeling of imbalanced high-throughput screening data in pubchem. J. Chem. Inf. Model. 2014, 54, 705–712. [Google Scholar] [CrossRef] [PubMed]
  32. Gramatica, P.; Cassani, S.; Chirico, N. QSARINS-Chem: Insubria Datasets and New QSAR/QSPR Models for Environmental Pollutants in QSARINS. J. Comput. Chem. Softw. News Updates 2014, 35, 1036–1044. [Google Scholar] [CrossRef] [PubMed]
  33. Golbraikh, A.; Tropsha, A. Beware of q2. J. Mol. Graph. Model. 2002, 20, 269–276. [Google Scholar] [CrossRef]
Figure 1. The flow diagram shows the process of QSAR modeling. (1) Collecting molecular structures and their activities; (2) calculating molecular descriptors, which can produce thousands of parameters for each molecular structure; (3) removing redundant or irrelevant descriptors via descriptor selection; (4) building the model with the optimum descriptor subset; (5) predicting the biological activity of a new molecular structure using the established model. Different color blocks represent different values.
Figure 1. The flow diagram shows the process of QSAR modeling. (1) Collecting molecular structures and their activities; (2) calculating molecular descriptors, which can produce thousands of parameters for each molecular structure; (3) removing redundant or irrelevant descriptors via descriptor selection; (4) building the model with the optimum descriptor subset; (5) predicting the biological activity of a new molecular structure using the established model. Different color blocks represent different values.
Ijms 19 00030 g001
Figure 2. L 1 and L E N are convex, and SCAD, MCP, L 1 / 2 and log-sum are non-convex. The log-sum approximates to L 0 .
Figure 2. L 1 and L E N are convex, and SCAD, MCP, L 1 / 2 and log-sum are non-convex. The log-sum approximates to L 0 .
Ijms 19 00030 g002
Figure 3. Plot of thresholding functions for: (a) L 1 ; (b) L E N ; (c) SCAD; (d) MCP; (e) L 1 / 2 ; and (f) log-sum.
Figure 3. Plot of thresholding functions for: (a) L 1 ; (b) L E N ; (c) SCAD; (d) MCP; (e) L 1 / 2 ; and (f) log-sum.
Ijms 19 00030 g003
Figure 4. The value of residual ( | y y p r e d | ) on different datasets.
Figure 4. The value of residual ( | y y p r e d | ) on different datasets.
Ijms 19 00030 g004
Figure 5. The number of descriptors obtained by the multiple linear regression with the different penalties on different datasets(different colors represent different datasets).
Figure 5. The number of descriptors obtained by the multiple linear regression with the different penalties on different datasets(different colors represent different datasets).
Ijms 19 00030 g005
Table 1. A brief description of four public datasets used in the experiments.
Table 1. A brief description of four public datasets used in the experiments.
Dataset NameNo. of SamplesNo. of DescriptorsNo. of Samples (Training)No. of Samples (Test)
BTAZD9710837819
EDCER129108910425
GHLI250112020050
BCL25081562407101
Table 2. The average number of variables selected in total by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum. In bold, the best performance is shown.
Table 2. The average number of variables selected in total by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum. In bold, the best performance is shown.
Sample Size L EN L 1 SCADMCP L 1 / 2 Log-Sum
ρ = 0.2 , σ = 0.3 n = 100 381.6092.9219.0923.3619.1319.00
n = 200 498.8134.1819.0319.0019.0919.00
ρ = 0.2 , σ = 0.9 n = 100 382.2493.2627.7425.7921.7721.54
n = 200 499.4995.8336.4823.6523.8323.15
ρ = 0.4 , σ = 0.3 n = 100 378.9693.9819.2624.6719.9819.11
n = 200 495.6697.5140.8724.0424.4223.79
ρ = 0.4 , σ = 0.9 n = 100 379.3593.4629.2226.0822.4822.04
n = 200 495.6498.9740.6123.9524.4323.73
Table 3. The average number of variables selected with a pre-set value (20) obtained by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum.
Table 3. The average number of variables selected with a pre-set value (20) obtained by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum.
Sample Size L EN L 1 SCADMCP L 1 / 2 Log-Sum
ρ = 0.2 , σ = 0.3 n = 100 12.2314.4519.0918.8119.1319.00
n = 200 16.2220.0019.0319.0019.0919.00
ρ = 0.2 , σ = 0.9 n = 100 12.2414.3019.9319.4219.7419.81
n = 200 16.2620.0020.0020.0020.0020.00
ρ = 0.4 , σ = 0.3 n = 100 11.8413.5718.8818.4018.6518.88
n = 200 15.7919.9919.9719.9319.9619.93
ρ = 0.4 , σ = 0.9 n = 100 11.8813.5519.4818.8119.1419.00
n = 200 15.8019.9919.9819.9319.9719.95
Table 4. The average accuracy (%) for the simulation data sets obtained by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum. In bold, the best performance is shown.
Table 4. The average accuracy (%) for the simulation data sets obtained by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum. In bold, the best performance is shown.
Sample Size L EN L 1 SCADMCP L 1 / 2 Log-Sum
ρ = 0.2 , σ = 0.3 n = 100 3.20%15.55%100.00%80.52%100.00%100.00%
n = 200 3.25%58.51%100.00%100.00%100.00%100.00%
ρ = 0.2 , σ = 0.9 n = 100 3.12%14.44%98.03%74.58%93.34%98.80%
n = 200 3.19%20.50%48.86%82.90%81.74%83.77%
ρ = 0.4 , σ = 0.3 n = 100 3.20%15.33%71.85%75.30%90.68%91.97%
n = 200 3.26%20.87%54.87%84.57%83.93%86.39%
ρ = 0.4 , σ = 0.9 n = 100 3.19%20.50%48.86%82.90%81.74%83.77%
n = 200 3.19%20.20%49.20%83.22%81.74%84.07%
Table 5. Experimental results on the four datasets (the results are emphasized by our proposed method in bold and italic).
Table 5. Experimental results on the four datasets (the results are emphasized by our proposed method in bold and italic).
DatasetsMethods R train 2 RMSE train Q LOO 2 RMSE cv R test 2 RMSE test
GHLI L E N 0.870.650.740.680.740.90
L 1 0.870.640.750.670.740.90
SCAD0.840.710.820.620.720.93
MCP0.850.680.800.650.730.91
L 1 / 2 0.820.750.810.620.720.92
log-sum0.850.690.840.570.750.88
EDCER L E N 0.810.740.700.700.641.23
L 1 0.820.730.730.680.631.25
SCAD0.860.630.740.690.701.12
MCP0.830.700.740.690.651.21
L 1 / 2 0.870.620.750.650.641.24
log-sum0.860.630.790.620.701.12
BATZD L E N 0.870.280.730.300.600.52
L 1 0.880.280.740.300.600.52
SCAD0.860.300.770.300.620.51
MCP0.880.270.830.290.640.50
L 1 / 2 0.860.290.840.260.640.50
log-sum0.880.280.880.230.680.47
BCL2 L E N 0.750.570.510.530.610.67
L 1 0.740.580.580.510.610.67
SCAD0.720.590.730.450.590.69
MCP0.740.570.730.460.580.70
L 1 / 2 0.730.600.680.480.570.70
log-sum0.680.640.750.430.650.63
Table 6. The 9 top-ranked descriptors identified by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum from the GHLI dataset (the common descriptors are emphasized in bold).
Table 6. The 9 top-ranked descriptors identified by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum from the GHLI dataset (the common descriptors are emphasized in bold).
RankGHLI
L EN L 1 SCADMCP L 1 / 2 Log-Sum
1JGI7JGI7MpJGI7minsClATSC4c
2ETA_Eta_B_RCETA_Eta_B_RCMDEC-44ATSC4cATSC1eGATS1e
3BCUTc-1lBCUTc-1lGATS1eGATS1eminaaNATSC1p
4MvMvATSC1pAATS0eWPOLMATS8m
5ATSC4cMDEN-23GGI9meanInHdsCHmaxwHBa
6MDEN-23ATSC4cmaxHBanHdsCHALogPmaxHBa
7GATS1eGATS1emaxwHBamaxHBanFG12RingATSC7s
8ETA_Epsilon_3ETA_Epsilon_4MATS8mATSC7sAATS6iAATS0v
9ETA_Epsilon_4minHCsatuSIC1ATS4vAATSC8mATS4p
Table 7. The 10 top-ranked descriptors identified by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum from the EDCER dataset (the common descriptors are emphasized in bold).
Table 7. The 10 top-ranked descriptors identified by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum from the EDCER dataset (the common descriptors are emphasized in bold).
RankEDCER
L EN L 1 SCADMCP L 1 / 2 Log-Sum
1JGI10JGI10JGI10JGI10JGI10JGI10
2VE2_DtVE2_DtMATS1iJGI6GATS1cMATS1c
3JGI7JGI6AATSC2sAATSC2sGATS2shmax
4AATSC8pAATSC8phmaxAATSC8phmaxnssO
5JGI6JGI7JGI6hmaxGATS5vpiPC6
6hmaxhmaxnBasenHBint2nTG12RingnFG12HeteroRing
7SpMin4_BhmSpMin4_BhmGATS8pnHBdnssOmaxaaCH
8GATS5vGATS5vnFG12HeteroRingmaxaaCHmaxaaCHSHBint2
9GATS2sGATS2sMATS5vC3SP2ETA_Beta_ns_dTIC1
10SpMin5_BhsnAcidmaxaaCHSHBint8MDEC-24AATSC8m
Table 8. The 8 top-ranked descriptors identified by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum from the BATZD dataset (the common descriptors are emphasized in bold).
Table 8. The 8 top-ranked descriptors identified by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum from the BATZD dataset (the common descriptors are emphasized in bold).
RankBATZD
L EN L 1 SCADMCP L 1 / 2 Log-Sum
1JGI4JGI4VE2_DzeSpMax1_BhiSpMax1_BhiSpMax1_Bhi
2VE2_DzeVE2_DzeJGI3MATS5mGATS1pGATS1v
3MATS5vndSndSGATS3sndSGATS3s
4SdSMATS5vCrippenLogPC4SP3GATS3mGATS8c
5CrippenLogPCrippenLogPnHotherCrippenLogPGATS3snaaS
6mindSMDEO-22minddssSALogPLipoaffinityIndexAATSC4i
7MDEO-22nF9RingGATS4mnHothernHsOHLipoaffinityIndex
8maxdSETA_Epsilon_4nF9RingATSC8iATSC8iSpDiam_Dzp
Table 9. The 6 top-ranked descriptors identified by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum from the BCL2 dataset (the common descriptors are emphasized in bold).
Table 9. The 6 top-ranked descriptors identified by L E N , L 1 , SCAD, MCP, L 1 / 2 and log-sum from the BCL2 dataset (the common descriptors are emphasized in bold).
RankBCL2
L EN L 1 SCADMCP L 1 / 2 Log-Sum
1JGI7AATSC8pAATSC4sJGI7MATS4sAATSC8p
2VE2_DMATS4sIC2MATS4sIC2IC2
3AATSC8pMATS5mMDEN-13IC2E3mGATS4s
4MATS5mIC2minHsNH2E3mMDEN-13maxHBint2
5MATS4sMDEN-13maxHBint2GATS8pmaxHBint2minsOH
6IC2SpMax1_BhinT8RingMDEN-13minsOHSwHBa
Table 10. The detailed information of the descriptors obtained by the log-sum method.
Table 10. The detailed information of the descriptors obtained by the log-sum method.
Descriptor TypeClassDescriptor
Autocorrelation2DAATS0v; AATSC4i; AATSC8m; ATS4p; ATSC1p;
ATSC4c; ATSC7s; GATS1e; GATS1v; GATS3s;
GATS8c; MATS1c; MATS8m; AATSC8p; GATS4s
Atom-type electrotopological state2DHmax; LipoaffinityIndex; maxaaCH; maxHBa; maxwHBa;
naaS; nssO; SHBint2; maxHBint2; minsOH; SwHBa
Barysz matrix2DSpDiam_Dzp
Burden modified eigenvalues2DSpMax1_Bhi
Information content2DTIC1
Path counts2DpiPC6
Ring count2DnFG12HeteroRing
Topological charge2DJGI10
Information content2DIC2
Table 11. The name of the descriptors obtained by the log-sum method.
Table 11. The name of the descriptors obtained by the log-sum method.
DescriptorName
AATS0vAverage Broto–Moreau autocorrelation-lag 0/weighted by van der Waals volumes
AATSC4iAverage centered Broto–Moreau autocorrelation-lag 4/weighted by first ionization potential
AATSC8mAverage centered Broto–Moreau autocorrelation-lag 8/weighted by mass
ATS4pAverage centered Broto–Moreau autocorrelation-lag 1/weighted by polarizabilities
ATSC1pCentered Broto–Moreau autocorrelation-lag 1/weighted by polarizabilities
ATSC4cAverage centered Broto–Moreau autocorrelation-lag 4/weighted by charges
ATSC7sAverage centered Broto–Moreau autocorrelation-lag 7/weighted by I-state
GATS1eGeary autocorrelation-lag 1/weighted by Sanderson electronegativities
GATS1vGeary autocorrelation-lag 1/weighted by van der Waals volumes
GATS3sGeary autocorrelation-lag 3/weighted by I-state
GATS8cGeary autocorrelation-lag 8/weighted by charges
hmaxMaximum H E-state
JGI10Mean topological charge index of order 10
LipoaffinityIndexLipoaffinity index
MATS1cMoran autocorrelation-lag 1/weighted by charges
MATS8mMoran autocorrelation-lag 8/weighted by mass
maxaaCHMaximum atom-type E-state: :CH:
maxHBaMaximum E-states for (strong) hydrogen bond acceptors
maxwHBaMaximum E-states for weak hydrogen bond acceptors
naaSCount of atom-type E-state::C:-
nFG12HeteroRingNumber of >12-membered fused rings containing heteroatoms (N, O, P, S or halogens)
nssOCount of atom-type E-state: -O-
piPC6Conventional bond order ID number of order 6 (ln(1 + x)
SHBint2Sum of E-state descriptors of strength for potential hydrogen bonds of path length 2
SpDiam_DzpSpectral diameter from Barysz matrix/weighted by polarizabilities
SpMax1_BhiLargest absolute eigenvalue of Burden-modified matrix - n 1/weighted by the relative first ionization potential
TIC1Total information content index (neighborhood symmetry of 1-order)
SwHBaSum of E-states for weak hydrogen bond acceptors
AATSC8pAverage centered Broto–Moreau autocorrelation-lag 8/weighted by polarizabilities
IC2Information content index (neighborhood symmetry of 2-order)
GATS4sGeary autocorrelation-lag 4/weighted by I-state
maxHBint2Maximum E-State descriptors of strength for potential Hydrogen Bonds of path length 2
minsOHMinimum atom-type E-state: -OH

Share and Cite

MDPI and ACS Style

Xia, L.-Y.; Wang, Y.-W.; Meng, D.-Y.; Yao, X.-J.; Chai, H.; Liang, Y. Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure. Int. J. Mol. Sci. 2018, 19, 30. https://doi.org/10.3390/ijms19010030

AMA Style

Xia L-Y, Wang Y-W, Meng D-Y, Yao X-J, Chai H, Liang Y. Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure. International Journal of Molecular Sciences. 2018; 19(1):30. https://doi.org/10.3390/ijms19010030

Chicago/Turabian Style

Xia, Liang-Yong, Yu-Wei Wang, De-Yu Meng, Xiao-Jun Yao, Hua Chai, and Yong Liang. 2018. "Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure" International Journal of Molecular Sciences 19, no. 1: 30. https://doi.org/10.3390/ijms19010030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop