Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure

The quantitative structure-activity relationship (QSAR) model searches for a reliable relationship between the chemical structure and biological activities in the field of drug design and discovery. (1) Background: In the study of QSAR, the chemical structures of compounds are encoded by a substantial number of descriptors. Some redundant, noisy and irrelevant descriptors result in a side-effect for the QSAR model. Meanwhile, too many descriptors can result in overfitting or low correlation between chemical structure and biological bioactivity. (2) Methods: We use novel log-sum regularization to select quite a few descriptors that are relevant to biological activities. In addition, a coordinate descent algorithm, which uses novel univariate log-sum thresholding for updating the estimated coefficients, has been developed for the QSAR model. (3) Results: Experimental results on artificial and four QSAR datasets demonstrate that our proposed log-sum method has good performance among state-of-the-art methods. (4) Conclusions: Our proposed multiple linear regression with log-sum penalty is an effective technique for both descriptor selection and prediction of biological activity.


Introduction
The quantitative structure-activity relationship (QSAR) model searches for a reliable relationship between chemical the structure and biological activities in the field of drug design and discovery [1]. In the study of QSAR, the chemical structure is encoded by a substantial number of descriptors, such as thermodynamic, shape descriptors, etc. Generally, only a few descriptors that are relevant to biological activities are of interest to the QSAR model. Descriptor selection aims to eliminate redundant, noisy and irrelevant descriptors [2]. The flow diagram shows the process of QSAR modeling in Figure 1.
Generally, descriptor selection techniques can be categorized into four groups in the study of QSAR: classical methods, artificial intelligence-based methods, miscellaneous methods and regularization methods.
The classical methods have been proposed in the study of QSAR; as an example, forward selection adds the most significant descriptors until none improves the model to a statistically-significant extent. Backward elimination starts with all candidate descriptors, subsequently deleting descriptors without any statistical significance. Generally, stepwise regression builds a model by adding or removing predictor variables based on a series of F-tests or t-tests. The variable selection and modeling method based on the prediction [3] uses leave-one-out cross-validation (Q 2 ), predicted to select meaningful and important descriptors. Leaps-and-bounds regression [4] selects a subset of descriptors based on the residual sum of squares (RSS). the L 1 -norm to select the significant and meaningful descriptors for anti-hepatitis C virus activity of thiourea derivatives in the QSAR classification model [19]. Xu et al. proposed L 1/2 [20] regularization, which has more sparsity. Algamal et al. proposed a penalized linear regression model with the L 1/2 -norm to select the significant and meaningful descriptors [21]. Theoretically, the L 0 regularization produces better solutions with more sparsity [22], but it is an NP problem. Therefore, Candes et al. proposed the log-sum penalty [23], which approximates the L 0 regularization much better.
In this paper, we utilized the log-sum penalty, which is non-convex in Figure 2. A coordinate descent algorithm, which uses novel univariate log-sum thresholding for updating the estimated coefficients, has been developed for the QSAR model. Experimental results on artificial and four QSAR datasets demonstrate that our proposed log-sum method has good performance among state-of-the-art methods. The structure of this paper is organized as follows: Section 2 introduces a coordinate descent algorithm, which uses novel univariate log-sum thresholding for updating the estimated coefficients and gives a detailed description of the datasets. In Section 3, we discuss the experimental results on simulated data and four QSRA datasets. Finally, we give some conclusions in Section 4.

Methods
In this paper, there exists a predictor X and a response y, which represent the chemical structure and corresponding biological activities, respectively. Suppose we have n samples, D = (X 1 , y 1 ), (X 2 , y 2 ), ..., (X n , y n ), where X i = (x i1 , x i2 ,..., x ip ) is the i-th input pattern with dimensionality p, which means X i has p descriptors, and x ij denotes the value of descriptor j for the i-th sample. The multiple linear regression is expressed as: where β = (β 0 , β 1 , ..., β p ) are the coefficients. Given X and y, β 0 , β 1 , ..., β p are estimated based on an objective function. The linear regression of the objective function can be formulated: where y = (y 1 , ......, y n ) T is the vector of n response variables, X = {X 1 ,X 2 ,......,X n } is n × p matrix with X i = (x i1 , ......, x ip ) and ||.|| denotes the L 2 -norm. When the number of variables is larger than the number of samples (p n), this can result in over-fitting. Here, we introduced a penalty function in the objective function to estimate the coefficient. We have rewritten Equation (2): where P λ () is a penalty function indexed by the regularized parameter λ > 0.

Coordinate Decent Algorithm for Different Thresholding Operators
In this paper, we used the coordinate descent algorithm to implement different penalized multiple linear regression. The algorithm is a "one-at-a-time" algorithm and solves β j , and other β k =j (representing the parameters remaining after the j-th element is removed) are fixed [22]. Equation (3) can be rewritten as: where k represents other variables except the j-th variable. Take the derivative with respect to β j : i represents the partial residuals with respect to the j-th covariate. To take into account the correlation of descriptors, Zhou et al. have proposed elastic net (L EN ) [24], which emphasizes a grouping effect. The L EN penalty function is given as follows: The penalty function of L EN is a combination of the L 1 penalty (a = 1) and the ridge penalty (a = 0). Therefore, Equation (5) is rewritten as follows: Donoho et al. proposed the univariate solution [25] for a L EN -penalized regression coefficient as follows: where S(w j , λa) is the soft thresholding operator for the L 1 if a is equal to one; Formula (8) can be rewritten as follows: Fan et al. have proposed the smoothly clipped absolute deviation (SCAD) [26], which can produce a sparse set of solutions and approximately unbiased coefficients for large coefficients. The penalty function is shown as follows: Additionally, the SCAD thresholding operator is given as follows: Similar to the SCAD penalty, Zhang et al. have proposed the maximum concave penalty (MCP) [27]. The formula of the penalty function is shown as: Additionally, the MCP thresholding operator is given as follows: where γ is the experience parameter. Xu et al. proposed L 1/2 regularization [20]. Formula (3) can be rewritten: and the univariate half thresholding operator for a L 1/2 -penalized linear regression coefficient is as follows: In this paper, we applied the log-sum penalty to the linear regression model. We could rewrite Formula (3) as follows: where ε > 0 should be set arbitrarily small, to make the log-sum penalty closely resemble the L 0 -norm. Equation (16) has a local minimal. The proof is given in the Appendix A: where According to different thresholding operators, we can define three properties for to satisfy the coefficient estimator, unbiasedness, sparsity and continuity, in Figure 3.

Simulated Data
In this work, we constructed the simulation. The process of the construction was given as follows: Step I: The simulated dataset was generated from multiple linear regression using the normal distribution to produce X. Here, the number of row is sample n and the number of column is variable p.
Step II: Add a different correlation parameter ρ to the simulation data.
Step III: In order to get a high quality model and variable selection, the coefficients (20) are set in advance from 1-20.  (20) where β is the coefficient.
In the simulation study, we firstly generated 100 groups of data with different sample sizes n = 100 and n = 200. Secondly, the correlation coefficient ρ = 0.2, 0.4 and the noise control parameter σ = 0.3, 0.9, were considered in the model. Thirdly, the coefficients (20) are set in advance.
Fourthly, the multiple linear regression with different penalties to select variables and build the model, including our proposed method, was used. Finally, due to the generation of 100 groups of data, the results obtained by different methods need to be averaged.

Real Data
We could obtain four public QSAR datasets, including the global half-life index [28], endocrine disruptor chemical (EDC) estrogen receptor (ER)-binding [29], (Benzo-)Triazoles toxicity in Daphnia magna [30] and apoptosis regulator Bcl-2 [31]. A brief description of these datasets is shown in Table 1. We utilized random sampling to divide datasets into training datasets and test datasets (80% for the training set and 20% for the test set [32]). Six commonly-used parameters in regression problems are employed to evaluate the model performance, including the square correlation coefficients of the leave-one-out cross-validation (Q 2 LOO ), the root mean squared error of cross-validation (RMSE CV ), the square correlation coefficients of fitting for the training set (R 2 train ), the root mean squared error for the training set (RMSE train ), the square correlation coefficients of fitting for the test set (R 2 test ) and the root mean squared error for the test set (RMSE test ). According to existing literature [33], we have learned that the value of Q 2 LOO is not the best measure for QSAR model evaluation. Therefore, we poured more interest and attention into (R 2 test ) and (RMSE test ). Algorithm: A coordinate descent algorithm for log-sum penalized multiple linear regression.

Results
In this work, five methods are compared to our proposed method, including multiple linear regression with L EN , L 1 , SCAD, MCP and L 1/2 penalties, respectively.  Table 2. In pre-set variables (20), we got 19.95 variables by the log-sum in Table 3. Therefore, we could calculate the average accuracy (19.95 ÷ 23.73 × 100% = 84.07%) for the simulation datasets obtained by log-sum in Table 4. From Tables 2-4, for example, when the correlation parameter ρ and the noise control parameter σ decrease, the average accuracy of log-sum improves. When n = 100 and σ = 0.9, the average accuracy of log-sum is from 83.77-98.7%, where the correlation parameter ρ is from 0.4-0.2. When n = 200 and ρ = 0.4, the results obtained by log-sum are 84.07% and 86.39% with the noise control parameter σ =0.9, 0.3. In addition, compared to other methods, the average accuracy obtained by our proposed log-sum method is better, for example when n = 200, ρ = 0.4 and σ = 0.9, the result of the log-sum is 84.07% higher than 3.19%, 20.20%, 49.20%, 83.22% and 81.74% of the L EN , L 1 , SCAD, MCP and L 1/2 . In other words, our proposed log-sum method has the capacity to obtain good performance in the simulation dataset. Table 2. The average number of variables selected in total by L EN , L 1 , SCAD, MCP, L 1/2 and log-sum. In bold, the best performance is shown.

Analyses of Real Data
As shown in Table 5 and Figures 4 and 5, the R 2 train and RMSE train of the L 1 , L 1/2 and MCP are 0.87, 0.87, 0.88 and 0.64, 0.62, 0.27, better than the values of 0.85, 0.86, 0.88 and 0.69, 0.63, 0.28 of the log-sum for the GHLI, EDCER and BATZD datasets, respectively. However, our proposed log-sum method is the best in terms of Q 2 and RMSE CV . In the BATZD dataset, the RMSE CV obtained by log-sum is 0.23, lower than the values of 0.30, 0.30, 0.30, 0.28 and 0.26 of other methods. In the BCL2 dataset, the Q 2 obtained by log-sum is 0.75, higher than the 0.51, 0.57, 0.73, 0.73 and 0.67 of other methods. Moreover, a small subset of descriptors was selected by our proposed method; for example, for the EDCER dataset, the result of log-sum is 10, lower than the 47, 36, 17, 11 and 12 of L EN , L 1 , SCAD, MCP and L 1/2 . Furthermore, for R 2 Test and RMSE test , for the GHLI dataset, the best method is log-sum (0.75 and 0.88); L EN and L 1 are second (0.74 and 0.90); MCP is third (0.73 and 0.91); L 1/2 is fourth (0.72 and 0.92); and the last is SCAD (0.72 and 0.93). Therefore, our proposed method is better than the other methods. In addition, we gave the experimental and predicted values for the four datasets.  First of all, in Tables 6-9, the number of top-ranked informative descriptors identified by L EN , L 1 , SCAD, MCP, L 1/2 and log-sum is 9, 10, 8 and 6 based on the value of the coefficients. Secondly, the common descriptors are emphasized in bold. Thirdly, as shown in Table 10, the number of descriptors is from the class of 2D. Then, the majority of descriptors are belong to the atom-type electrotopological state and autocorrelation of descriptors types. Finally, the name of the descriptors obtained by the log-sum method is exhibited in Table 11. Table 6. The 9 top-ranked descriptors identified by L EN , L 1 , SCAD, MCP, L 1/2 and log-sum from the GHLI dataset (the common descriptors are emphasized in bold).   Information content 2D TIC1

Rank GHLI
Path counts 2D piPC6 Ring count 2D nFG12HeteroRing Topological charge 2D JGI10 Information content 2D IC2 Table 11. The name of the descriptors obtained by the log-sum method.

AATS0v
Average Broto-Moreau autocorrelation-lag 0/weighted by van der Waals volumes AATSC4i Average centered Broto-Moreau autocorrelation-lag 4/weighted by first ionization potential AATSC8m Average centered Broto-Moreau autocorrelation-lag 8/weighted by mass ATS4p Average centered Broto-Moreau autocorrelation-lag 1/weighted by polarizabilities ATSC1p Centered Broto-Moreau autocorrelation-lag 1/weighted by polarizabilities ATSC4c Average centered Broto-Moreau autocorrelation-lag 4/weighted by charges ATSC7s Average centered Broto-Moreau autocorrelation-lag 7/weighted by I-state GATS1e Geary autocorrelation-lag 1/weighted by Sanderson electronegativities GATS1v Geary autocorrelation-lag 1/weighted by van der Waals volumes GATS3s Geary autocorrelation-lag 3/weighted by I-state GATS8c Geary autocorrelation-lag 8/weighted by charges hmax Maximum H E-state JGI10 Mean topological charge index of order 10 LipoaffinityIndex Lipoaffinity index MATS1c Moran autocorrelation-lag 1/weighted by charges MATS8m Moran Conventional bond order ID number of order 6 (ln(1 + x) SHBint2 Sum of E-state descriptors of strength for potential hydrogen bonds of path length 2 SpDiam_Dzp Spectral diameter from Barysz matrix/weighted by polarizabilities SpMax1_Bhi Largest absolute eigenvalue of Burden-modified matrix -n 1/weighted by the relative first ionization potential TIC1 Total information content index (neighborhood symmetry of 1-order) SwHBa Sum of E-states for weak hydrogen bond acceptors AATSC8p Average centered Broto-Moreau autocorrelation-lag 8/weighted by polarizabilities IC2 Information content index (neighborhood symmetry of 2-order) GATS4s Geary autocorrelation-lag 4/weighted by I-state maxHBint2 Maximum E-State descriptors of strength for potential Hydrogen Bonds of path length 2 minsOH Minimum atom-type E-state: -OH

Conclusions
In the field of drug design and discovery, only a few descriptors are of interest to the QSAR model. Therefore, descriptor selection plays an important role in the study of QSAR. In this paper, we proposed univariate log-sum thresholding for updating the estimated coefficients and developed a coordinate descent algorithm for log-sum penalized multiple linear regression.
Both experimental results on artificial and four QSAR datasets demonstrate that our proposed multiple linear regression with log-sum penalty is still better than L 1 , L EN , SCAD, MCP and L 1/2 . Therefore, our proposed log-sum method is the effective technique in both descriptor selection and prediction of biological activity.
In this paper, we introduced random sampling, which is easy to use, for QSAR data preprocessing. However, this method does not take into account additional knowledge. Therefore, we plan to integrate a self-paced learning mechanism, which learns easy samples first and then gradually takes into consideration complex samples, making the model more and more mature, with our proposed method in future work.