Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure

Xia, Liang-Yong; Wang, Yu-Wei; Meng, De-Yu; Yao, Xiao-Jun; Chai, Hua; Liang, Yong

doi:10.3390/ijms19010030

Open AccessArticle

Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure

by

Liang-Yong Xia

¹,

Yu-Wei Wang

¹

,

De-Yu Meng

²,

Xiao-Jun Yao

¹,

Hua Chai

¹ and

Yong Liang

^1,*

¹

State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Macau 999078, China

²

Ministry of Education Key Lab of Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Int. J. Mol. Sci. 2018, 19(1), 30; https://doi.org/10.3390/ijms19010030

Submission received: 17 November 2017 / Revised: 10 December 2017 / Accepted: 21 December 2017 / Published: 22 December 2017

(This article belongs to the Section Biochemistry)

Download

Browse Figures

Versions Notes

Abstract

:

The quantitative structure-activity relationship (QSAR) model searches for a reliable relationship between the chemical structure and biological activities in the field of drug design and discovery. (1) Background: In the study of QSAR, the chemical structures of compounds are encoded by a substantial number of descriptors. Some redundant, noisy and irrelevant descriptors result in a side-effect for the QSAR model. Meanwhile, too many descriptors can result in overfitting or low correlation between chemical structure and biological bioactivity. (2) Methods: We use novel log-sum regularization to select quite a few descriptors that are relevant to biological activities. In addition, a coordinate descent algorithm, which uses novel univariate log-sum thresholding for updating the estimated coefficients, has been developed for the QSAR model. (3) Results: Experimental results on artificial and four QSAR datasets demonstrate that our proposed log-sum method has good performance among state-of-the-art methods. (4) Conclusions: Our proposed multiple linear regression with log-sum penalty is an effective technique for both descriptor selection and prediction of biological activity.

Keywords:

QSAR; biological activity; descriptor selection; regularization; log-sum

Graphical Abstract

1. Introduction

The quantitative structure-activity relationship (QSAR) model searches for a reliable relationship between chemical the structure and biological activities in the field of drug design and discovery [1]. In the study of QSAR, the chemical structure is encoded by a substantial number of descriptors, such as thermodynamic, shape descriptors, etc. Generally, only a few descriptors that are relevant to biological activities are of interest to the QSAR model. Descriptor selection aims to eliminate redundant, noisy and irrelevant descriptors [2]. The flow diagram shows the process of QSAR modeling in Figure 1.

Generally, descriptor selection techniques can be categorized into four groups in the study of QSAR: classical methods, artificial intelligence-based methods, miscellaneous methods and regularization methods.

The classical methods have been proposed in the study of QSAR; as an example, forward selection adds the most significant descriptors until none improves the model to a statistically-significant extent. Backward elimination starts with all candidate descriptors, subsequently deleting descriptors without any statistical significance. Generally, stepwise regression builds a model by adding or removing predictor variables based on a series of F-tests or t-tests. The variable selection and modeling method based on the prediction [3] uses leave-one-out cross-validation (

Q^{2}

), predicted to select meaningful and important descriptors. Leaps-and-bounds regression [4] selects a subset of descriptors based on the residual sum of squares (RSS).

Recently, artificial intelligence-based methods have been designed for descriptor selection, such as the genetic algorithm [5], which uses the code, selection, exchange and mutation operations to select the important descriptors. Particle swarm optimization [6] has a series of initial random particles and then selects the descriptors by updating the velocity and positions. Artificial neural networks [7] are composed of many artificial neurons that are linked together according to a specific network architecture and select input nodes (descriptors) to predict the output node (biological activity). Simulated annealing [8] can be performed with the Metropolis algorithm based on Monte Carlo techniques, which performs descriptor selection. Frank et al. [9] used Bayesian regularized artificial neural networks with automatic relevance determination (ARD) in the study of QSAR. ARD has the capacity to allow the network to estimate the importance of each input, neglects irrelevant or highly correlated indices in the modeling and uses the most important variables for modeling the activity data. The ant colony system [10], inspired by real ants, searches a path, which is connected to a number of selected descriptors, between the colony and a source of food.

The miscellaneous methods used for descriptor selection in the development of QSAR include K nearest neighbor (KNN) [11], the replacement method (RM) [12], the successive projections algorithm (SPA) [13] and uninformative variable elimination-partial least squares (UVE-PLS) [14], just to name a few. KNN uses a similarity measure (Euler distance) to select the descriptor and predict the biological activity. RM has the capacity to find an optimal subset of the descriptors via the standard deviation. SPA is a simple operation to eliminate collinearity to reduce the descriptors. UVE-PLS has been proposed to increase the predictive ability of the standard PLS method via eliminating the variables that cannot contribute to the model and to make a comparison between experimental variables and added noise variables with respect to the degree of contribution to the model.

The regularization is an effective technique in descriptor selection and has been used in QSRR [15], QSPR [16] and QSTR [17] in the field of chemometrics. However, some individuals have poured their interest and attention into the study of QSAR. For example, LASSO (

L_{1}

) (least absolute shrinkage and selection operator) [18] has the capacity to perform descriptor selection. Algamal et al. proposed the

L_{1}

-norm to select the significant and meaningful descriptors for anti-hepatitis C virus activity of thiourea derivatives in the QSAR classification model [19]. Xu et al. proposed

L_{1 / 2}

[20] regularization, which has more sparsity. Algamal et al. proposed a penalized linear regression model with the

L_{1 / 2}

-norm to select the significant and meaningful descriptors [21]. Theoretically, the

L_{0}

regularization produces better solutions with more sparsity [22], but it is an NP problem. Therefore, Candes et al. proposed the log-sum penalty [23], which approximates the

L_{0}

regularization much better.

In this paper, we utilized the log-sum penalty, which is non-convex in Figure 2. A coordinate descent algorithm, which uses novel univariate log-sum thresholding for updating the estimated coefficients, has been developed for the QSAR model. Experimental results on artificial and four QSAR datasets demonstrate that our proposed log-sum method has good performance among state-of-the-art methods. The structure of this paper is organized as follows: Section 2 introduces a coordinate descent algorithm, which uses novel univariate log-sum thresholding for updating the estimated coefficients and gives a detailed description of the datasets. In Section 3, we discuss the experimental results on simulated data and four QSRA datasets. Finally, we give some conclusions in Section 4.

2. Methods

In this paper, there exists a predictor X and a response y, which represent the chemical structure and corresponding biological activities, respectively. Suppose we have n samples,

D = (X_{1}, y_{1}), (X_{2}, y_{2}), \dots, (X_{n}, y_{n})

, where

X_{i}

= (

x_{i 1}

,

x_{i 2}

,...,

x_{i p}

) is the i-th input pattern with dimensionality p, which means

X_{i}

has p descriptors, and

x_{i j}

denotes the value of descriptor j for the i-th sample. The multiple linear regression is expressed as:

y_{i} = x_{i 1} β_{1} + \dots + x_{i p} β_{p} + β_{0}

(1)

where

β = (β_{0}, β_{1}, \dots, β_{p})

are the coefficients.

Given X and y,

β_{0}, β_{1}, \dots, β_{p}

are estimated based on an objective function. The linear regression of the objective function can be formulated:

m i n {\frac{1}{2 n} ∥ y - X β ∥^{2}}

(2)

where

y = {(y_{1}, . . . . . ., y_{n})}^{T}

is the vector of n response variables, X = {

X_{1}

,

X_{2}

,......,

X_{n}

} is

n \times p

matrix with

X_{i} = (x_{i 1}, \dots \dots, x_{i p})

and

| | . | |

denotes the

L_{2}

-norm. When the number of variables is larger than the number of samples (

p ≫ n

), this can result in over-fitting. Here, we introduced a penalty function in the objective function to estimate the coefficient. We have rewritten Equation (2):

m i n {\frac{1}{2 n} ∥ y - X β ∥^{2} + P_{λ} (β)}

(3)

where

P_{λ} ()

is a penalty function indexed by the regularized parameter

λ > 0

.

2.1. Coordinate Decent Algorithm for Different Thresholding Operators

In this paper, we used the coordinate descent algorithm to implement different penalized multiple linear regression. The algorithm is a “one-at-a-time” algorithm and solves

β_{j}

, and other

β_{k \neq j}

(representing the parameters remaining after the j-th element is removed) are fixed [22]. Equation (3) can be rewritten as:

R (β) = a r g m i n {\frac{1}{2 n} {(y_{i} - (\sum_{k \neq j} x_{i k} β_{k} + x_{i j} β_{j}))}^{2} + λ \sum_{k \neq j} P (β_{k}) + P (β_{j})}

(4)

where k represents other variables except the j-th variable.

Take the derivative with respect to

β_{j}

:

\frac{\partial R}{\partial β_{j}} = \sum_{i = 1}^{n} (- x_{i j} (y_{j} - \sum_{k \neq j} x_{i k} β_{k} - x_{i j} β_{j})) + λ P (β_{j}) = 0

(5)

Denote

{\tilde{y}}_{i}^{(j)} = \sum_{k \neq j} x_{i k} β_{k}

,

{\tilde{r}}_{i}^{(j)} = y_{i} - {\tilde{y}}_{i}^{(j)}

,

w_{j} = \sum_{i = 1}^{n} x_{i j} {\tilde{r}}_{i}^{(j)}

, where

{\tilde{r}}_{i}^{(j)}

represents the partial residuals with respect to the j-th covariate. To take into account the correlation of descriptors, Zhou et al. have proposed elastic net (

L_{E N}

) [24], which emphasizes a grouping effect. The

L_{E N}

penalty function is given as follows:

P (β) = (1 - a) \frac{1}{2} {∥ β ∥}_{L_{2}}^{2} + a {∥ β ∥}_{L_{1}}

(6)

The penalty function of

L_{E N}

is a combination of the

L_{1}

penalty (

a = 1

) and the ridge penalty (

a = 0

). Therefore, Equation (5) is rewritten as follows:

\frac{\partial R}{\partial β_{j}} = \sum_{i = 1}^{n} (- x_{i j} (y_{j} - \sum_{k \neq j} x_{i k} β_{k} - x_{i j} β_{j})) + λ (1 - a) β_{j} + λ a = 0

(7)

Donoho et al. proposed the univariate solution [25] for a

L_{E N}

-penalized regression coefficient as follows:

β_{j} = f_{L_{E N}} (w_{j}, λ, a) = \frac{S (w_{j}, λ a)}{1 + λ (1 - a)}

(8)

where

S (w_{j}, λ a)

is the soft thresholding operator for the

L_{1}

if a is equal to one; Formula (8) can be rewritten as follows:

β_{j} = S o f t (w_{j}, λ) = \{\begin{matrix} w_{j} + λ & if w_{j} < - λ \\ w_{j} - λ & if w_{j} > λ \\ 0 & if - λ \leq w_{j} \leq λ \end{matrix}

(9)

Fan et al. have proposed the smoothly clipped absolute deviation (SCAD) [26], which can produce a sparse set of solutions and approximately unbiased coefficients for large coefficients. The penalty function is shown as follows:

p_{λ, a} (β) = \{\begin{matrix} λ β & if β \neq λ \\ \frac{a λ β - \frac{1}{2} (β^{2} + λ^{2})}{a - 1} & if λ < β < a λ \\ \frac{λ (a^{2} - 1)}{2 (a - 1)} & if β > a λ \end{matrix}

(10)

Additionally, the SCAD thresholding operator is given as follows:

β_{j} = f_{SCAD} (w_{j}, λ, a) = \{\begin{matrix} S (w_{j}, λ) & if | w_{j} | < 2 λ \\ \frac{S (w_{j}, a λ / (a - 1))}{1 - 1 / (a - 1)} & if 2 λ < | w_{j} | \leq a λ \\ w_{j} & if | w_{j} | > a λ \end{matrix}

(11)

Similar to the SCAD penalty, Zhang et al. have proposed the maximum concave penalty (MCP) [27]. The formula of the penalty function is shown as:

p_{λ, a} (β) = \{\begin{matrix} λ β & if β \leq γ λ \\ \frac{1}{2} γ λ^{2} & if β > γ λ \end{matrix}

(12)

Additionally, the MCP thresholding operator is given as follows:

β_{j} = f_{MCP} (w_{j}, λ, γ) = \{\begin{matrix} \frac{S (w_{j}, λ)}{1 - 1 / γ} & if | w_{j} | \leq γ λ \\ w_{j} & if | w_{j} | > γ λ \end{matrix}

(13)

where

γ

is the experience parameter.

Xu et al. proposed

L_{1 / 2}

regularization [20]. Formula (3) can be rewritten:

m i n {\frac{1}{2 n} ∥ y - X β ∥^{2} + λ \sum_{j}^{p} | β_{j} |^{\frac{1}{2}}}

(14)

and the univariate half thresholding operator for a

L_{1 / 2}

-penalized linear regression coefficient is as follows:

β_{j} = H a l f (w_{j}, λ) = \{\begin{matrix} \frac{2}{3} w_{j} (1 + cos \frac{2 (π - ϕ_{λ} (w_{j}))}{3}) & if | w_{j} | > \frac{3}{4} {(λ)}^{\frac{2}{3}} \\ 0 & o t h e r w i s e \end{matrix}

(15)

where

ϕ_{λ} (w) = \frac{λ}{8} {(\frac{| w |}{3})}^{- \frac{3}{2}}

.

In this paper, we applied the log-sum penalty to the linear regression model. We could rewrite Formula (3) as follows:

m i n {\frac{1}{2 n} {∥ y - X β ∥}^{2} + λ \sum_{j}^{p} l o g (| β_{j} | + ε)}

(16)

where

ε > 0

should be set arbitrarily small, to make the log-sum penalty closely resemble the

L_{0}

-norm. Equation (16) has a local minimal. The proof is given in the Appendix A:

β_{j} = f_{l o g - s u m} (w_{j}, λ, ε) = D (w_{j}, λ, ε) = \{\begin{matrix} s i g n (w_{j}) \frac{c_{1} + {\sqrt{c}}_{2}}{2} & if c_{2} > 0 \\ 0 & if c_{2} \leq 0 \end{matrix}

(17)

where

λ > 0, 0 < ε < \sqrt{λ}, c_{1} = ω_{j} - ε

and

c_{2} = c_{1}^{2} - 4 (λ - w_{j} ε)

.

According to different thresholding operators, we can define three properties for to satisfy the coefficient estimator, unbiasedness, sparsity and continuity, in Figure 3.

2.2. Dataset

2.2.1. Simulated Data

In this work, we constructed the simulation. The process of the construction was given as follows:

Step I: The simulated dataset was generated from multiple linear regression using the normal distribution to produce X. Here, the number of row is sample n and the number of column is variable p.

y = X β + \overset{i n t e r c e p t}{\overset{︷}{σ ϵ}}, ϵ \sim N (0, 1)

(18)

where

y = {(y_{1}, \dots, y_{n})}^{T}

is the vector of n response variables, X = {

X_{1}

,

X_{2}

, ...,

X_{n}

} is the generated matrix with

X_{i} = (x_{i 1}, \dots, x_{i p})

,

ϵ = {(ϵ_{1}, \dots, ϵ_{n})}^{T}

is the random error and

σ

controls the signal to noise.

Step II: Add a different correlation parameter

ρ

to the simulation data.

x_{i j} = ρ \times x_{11} + (1 - ρ) x_{i j}, i \sim (1, \dots, n), j \sim (2, 3, 4, 5, 6)

(19)

Step III: In order to get a high quality model and variable selection, the coefficients (20) are set in advance from 1–20.

β = \overset{2000}{\overset{︷}{\underset{20}{\underset{︸}{2, - 2, - 1, 1.5, 3, 2.5, 3, 2, \dots, 2,}} \underset{1980}{\underset{︸}{0, 0, 0, \dots, 0}}}}

(20)

where

β

is the coefficient.

Step IV: We can get y from Equations (18)–(20).

In the simulation study, we firstly generated 100 groups of data with different sample sizes

n = 100

and

n = 200

. Secondly, the correlation coefficient

ρ = 0.2, 0.4

and the noise control parameter

σ = 0.3, 0.9

, were considered in the model. Thirdly, the coefficients (20) are set in advance. Fourthly, the multiple linear regression with different penalties to select variables and build the model, including our proposed method, was used. Finally, due to the generation of 100 groups of data, the results obtained by different methods need to be averaged.

2.2.2. Real Data

We could obtain four public QSAR datasets, including the global half-life index [28], endocrine disruptor chemical (EDC) estrogen receptor (ER)-binding [29], (Benzo-)Triazoles toxicity in Daphnia magna [30] and apoptosis regulator Bcl-2 [31]. A brief description of these datasets is shown in Table 1. We utilized random sampling to divide datasets into training datasets and test datasets (80% for the training set and 20% for the test set [32]). Six commonly-used parameters in regression problems are employed to evaluate the model performance, including the square correlation coefficients of the leave-one-out cross-validation (

Q_{L O O}^{2}

), the root mean squared error of cross-validation (

R M S E_{C V}

), the square correlation coefficients of fitting for the training set (

R_{t r a i n}^{2}

), the root mean squared error for the training set (

R M S E_{t r a i n}

), the square correlation coefficients of fitting for the test set (

R_{t e s t}^{2}

) and the root mean squared error for the test set (

R M S E_{t e s t}

). According to existing literature [33], we have learned that the value of

Q_{L O O}^{2}

is not the best measure for QSAR model evaluation. Therefore, we poured more interest and attention into (

R_{t e s t}^{2}

) and (

R M S E_{t e s t}

).

Algorithm: A coordinate descent algorithm for log-sum penalized multiple linear regression.

Step 1: Initialize all

β_{j} (m) = 0 (j = 1, 2, 3, \dots, p), λ, ε

,set

m = 0

;

Step 2: Calculate the function (16) based on

β (m)

Step 3: Update each

β_{j} (m)

and cycle

j = 1, 2, 3, \dots, p

Step 3.1:

{\tilde{r}}_{i}^{(j)} (m) = y_{i} (m) - {\tilde{y}}_{i}^{(j)} (m) = y_{i} (m) - \sum_{k \neq j} x_{i k} β_{k} (m)

and

w_{j} (m) = x_{i j} (r_{i} (m) - {\tilde{r}}_{i}^{(j)} (m))

Step 3.2: Update

β_{j} (m) = D (w_{j}, λ, ε)

Step 4: Let

m \leftarrow (m + 1)

,

β (m + 1) \leftarrow β (m)

Step 5: Repeat Steps 2 and 3 until

β (m)

converges

3. Results

In this work, five methods are compared to our proposed method, including multiple linear regression with

L_{E N}

,

L_{1}

, SCAD, MCP and

L_{1 / 2}

penalties, respectively.

3.1. Analyses of Simulated Data

Table 2 and Table 3 describe the number of variables that are selected (non-zero coefficient) by different methods within 2000 variables and within pre-set variables (20), respectively. For example, when

n = 200, ρ = 0.4

and

σ = 0.9

, the average number of variables selected is 23.73 within 2000 variables by the log-sum in Table 2. In pre-set variables (20), we got 19.95 variables by the log-sum in Table 3. Therefore, we could calculate the average accuracy (

19.95 \div 23.73 \times 100 % = 84.07 %

) for the simulation datasets obtained by log-sum in Table 4. From Table 2, Table 3 and Table 4, for example, when the correlation parameter

ρ

and the noise control parameter

σ

decrease, the average accuracy of log-sum improves. When

n = 100

and

σ = 0.9

, the average accuracy of log-sum is from 83.77–98.7%, where the correlation parameter

ρ

is from 0.4–0.2. When

n = 200

and

ρ = 0.4

, the results obtained by log-sum are 84.07% and 86.39% with the noise control parameter

σ =

0.9, 0.3. In addition, compared to other methods, the average accuracy obtained by our proposed log-sum method is better, for example when

n = 200

,

ρ = 0.4

and

σ = 0.9

, the result of the log-sum is 84.07% higher than 3.19%, 20.20%, 49.20%, 83.22% and 81.74% of the

L_{E N}

,

L_{1}

, SCAD, MCP and

L_{1 / 2}

. In other words, our proposed log-sum method has the capacity to obtain good performance in the simulation dataset.

3.2. Analyses of Real Data

As shown in Table 5 and Figure 4 and Figure 5, the

R_{t r a i n}^{2}

and

R M S E_{t r a i n}

of the

L_{1}

,

L_{1 / 2}

and MCP are 0.87, 0.87, 0.88 and 0.64, 0.62, 0.27, better than the values of 0.85, 0.86, 0.88 and 0.69, 0.63, 0.28 of the log-sum for the GHLI, EDCER and BATZD datasets, respectively. However, our proposed log-sum method is the best in terms of

Q^{2}

and

R M S E_{C V}

. In the BATZD dataset, the

R M S E_{C V}

obtained by log-sum is 0.23, lower than the values of 0.30, 0.30, 0.30, 0.28 and 0.26 of other methods. In the BCL2 dataset, the

Q^{2}

obtained by log-sum is 0.75, higher than the 0.51, 0.57, 0.73, 0.73 and 0.67 of other methods. Moreover, a small subset of descriptors was selected by our proposed method; for example, for the EDCER dataset, the result of log-sum is 10, lower than the 47, 36, 17, 11 and 12 of

L_{E N}

,

L_{1}

, SCAD, MCP and

L_{1 / 2}

. Furthermore, for

R_{T e s t}^{2}

and

R M S E_{t e s t}

, for the GHLI dataset, the best method is log-sum (0.75 and 0.88);

L_{E N}

and

L_{1}

are second (0.74 and 0.90); MCP is third (0.73 and 0.91);

L_{1 / 2}

is fourth (0.72 and 0.92); and the last is SCAD (0.72 and 0.93). Therefore, our proposed method is better than the other methods. In addition, we gave the experimental and predicted values for the four datasets.

First of all, in Table 6, Table 7, Table 8 and Table 9, the number of top-ranked informative descriptors identified by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum is 9, 10, 8 and 6 based on the value of the coefficients. Secondly, the common descriptors are emphasized in bold. Thirdly, as shown in Table 10, the number of descriptors is from the class of 2D. Then, the majority of descriptors are belong to the atom-type electrotopological state and autocorrelation of descriptors types. Finally, the name of the descriptors obtained by the log-sum method is exhibited in Table 11.

4. Conclusions

In the field of drug design and discovery, only a few descriptors are of interest to the QSAR model. Therefore, descriptor selection plays an important role in the study of QSAR. In this paper, we proposed univariate log-sum thresholding for updating the estimated coefficients and developed a coordinate descent algorithm for log-sum penalized multiple linear regression.

Both experimental results on artificial and four QSAR datasets demonstrate that our proposed multiple linear regression with log-sum penalty is still better than

L_{1}

,

L_{E N}

, SCAD, MCP and

L_{1 / 2}

. Therefore, our proposed log-sum method is the effective technique in both descriptor selection and prediction of biological activity.

In this paper, we introduced random sampling, which is easy to use, for QSAR data preprocessing. However, this method does not take into account additional knowledge. Therefore, we plan to integrate a self-paced learning mechanism, which learns easy samples first and then gradually takes into consideration complex samples, making the model more and more mature, with our proposed method in future work.

Acknowledgments

This work was supported by the Macau Science and Technology Development Funds

G r a n t N o . 003 / 2016 / A F J

from the Macau Special Administrative Region of the People’s Republic of China, the National Grand Fundamental Research 973 Program of China under Grant No. 2013CB329404 and the China NSFC projects under Contracts 61373114, 61661166011, 11690011, 61721002.

Author Contributions

Liang-Yong Xia, Hua Chai and Yong Liang designed the simulations. Liang-Yong Xia and De-Yu Meng provided the mathematical proof. Liang-Yong Xia, Xiao-Jun Yao and Yu-Wei Wang contributed to collecting the datasets and analyze the data. Liang-Yong Xia and Yong Liang designed and implemented the algorithm. Liang-Yong Xia, Yu-Wei Wang, De-Yu Meng, Xiao-Jun Yao and Yong Liang contributed to the interpretation of the results. Liang-Yong Xia took the lead in writing the manuscript. Yu-Wei Wang, De-Yu Meng, Xiao-Jun Yao, Hua Chai and Yong Liang revised the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

QSAR	Quantitative structure-activity relationship
QSRR	Quantitative structure-(chromatographic) retention relationships
QSPR	Quantitative structure-property relationship
QSTR	Quantitative structure-toxicity relationship
MLR	Multiple linear regression
MCP	Maximum concave penalty
SCAD	Smoothly clipped absolute deviation
$L_{1}$	LASSO
BTAZD	(Benzo-)Triazoles toxicity in Daphnia magna
EDCER	EDC estrogen receptor binding
GHLI	Global half-life index
BCL2	Apoptosis regulator Bcl-2

Appendix A. Proof

We first consider the situation

β_{j} > 0

:

\frac{\partial R}{\partial β_{j}} = \sum_{i = 1}^{n} (- x_{i j} (y_{i} - \sum_{k \neq j} x_{i j} β_{k} - x_{i j} β_{j})) + λ \frac{1}{β_{j} + ε} = 0

(A1)

Based on Equation (A1), the gradient of the log-sum regularization at

β_{j}

can be expressed as:

\frac{\partial R}{\partial β_{j}} = β_{j} - ω_{j} + λ \frac{1}{β_{j} + ε} = 0

(A2)

Denote

{\tilde{y}}_{i}^{(j)} = \sum_{k \neq j} x_{i k} β_{k}

,

{\tilde{r}}_{i}^{(j)} = y_{i} - {\tilde{y}}_{i}^{(j)}

,

w_{j} = \sum_{i = 1}^{n} x_{i j} {\tilde{r}}_{i}^{(j)}

, which is equivalent to:

β_{j}^{2} - (ω_{j} - ε) β_{j} + (λ - ω_{j} ε) = 0

(A3)

β_{j} = \frac{ω_{j} - ε \pm \sqrt{{(ω_{j} - ε)}^{2} - 4 (λ - ω_{j} ε)}}{2}

(A4)

let:

c_{1} = ω_{j} - ε

,

c_{2} = c_{1}^{2} - 4 (λ - ω_{j} ε)

Thus, we have:

(1): if $c_{2} < 0$ , Equation (A3) has no real solution.
(2): if $c_{2} = 0$ , Equation (A3) has the solution $β_{j} = \frac{c_{1}}{2}$ .
(3): if $c_{2} > 0$ , Equation (A3) has the two solutions $β_{j 1} = \frac{c_{1} - {\sqrt{c}}_{2}}{2}$ and $β_{j 2} = \frac{c_{1} + {\sqrt{c}}_{2}}{2}$ :

$\begin{matrix} c_{2} & = {(ω_{j} - ε)}^{2} - 4 (λ - ω_{j} ε) \\ = ω_{j}^{2} - 2 ω_{j} ε + ε^{2} - 4 λ + 4 ω_{j} ε \\ = {(ω_{j} + ε)}^{2} - 4 λ > 0 \\ ω_{j} + ε > 2 \sqrt{λ} \\ ω_{j} - ε > 2 \sqrt{λ} - 2 ε \\ c_{1} > 0 \end{matrix}$

Thus,

β_{j 2} > β_{j 1} > 0

, and it is then easy to obtain that

f^{'} (β_{j}) > 0

when

0 < β_{j} < β_{j 1}

or

β_{j 2} > β_{j}

and

f^{'} (β_{j}) < 0

when

β_{j 1} < β_{j} < β_{j 2}

. Therefore, Equation (16) has a local minimum. For

β_{j} < 0

, we can prove it in a similar way.

References

Katritzky, A.R.; Kuanar, M.; Slavov, S.; Hall, C.D.; Karelson, M.; Kahn, I.; Dobchev, D.A. Quantitative correlation of physical and chemical properties with chemical structure: Utility for prediction. Chem. Rev. 2010, 110, 5714–5789. [Google Scholar] [CrossRef] [PubMed]
Shahlaei, M. Descriptor selection methods in quantitative structure-activity relation-ship studies: A review study. Chem. Rev. 2013, 113, 8093–8103. [Google Scholar] [CrossRef] [PubMed]
Liu, S.-S.; Liu, H.-L.; Yin, C.-S.; Wang, L.-S. Vsmp: A novel variable selection and modeling method based on the prediction. J. Chem. Inf. Comput. Sci. 2003, 43, 964–969. [Google Scholar] [CrossRef] [PubMed]
Xu, L.; Zhang, W.-J. Comparison of different methods for variable selection. Anal. Chim. Acta 2001, 446, 475–481. [Google Scholar] [CrossRef]
Wegner, J.K.; Zell, A. Prediction of aqueous solubility and partition coefficient optimized by a genetic algorithm based descriptor selection method. J. Chem. Inf. Comput. Sci. 2003, 43, 1077–1084. [Google Scholar] [CrossRef] [PubMed]
Khajeh, A.; Modarress, H.; Zeinoddini-Meymand, H. Modified particle swarm optimization method for variable selection in qsar/qspr studies. Struct. Chem. 2013, 24, 1401–1409. [Google Scholar] [CrossRef]
Meissner, M.; Schmuker, M.; Schneider, G. Optimized particle swarm optimization (OPSO) and its application to artificial neural network training. BMC Bioinform. 2006, 7, 125. [Google Scholar] [CrossRef] [PubMed]
Ghosh, P.; Bagchi, M. QSAR modeling for quinoxaline derivatives using genetic algorithm and simulated annealing based feature selection. Curr. Med. Chem. 2009, 16, 4032–4048. [Google Scholar] [CrossRef] [PubMed]
Burden, F.; Winkler, D. Bayesian regularization of neural networks. Artif. Neural Netw. Methods Appl. 2009, 458, 23–42. [Google Scholar]
Dorigo, M.; Birattari, M.; Stutzle, T. Ant colony optimization. IEEE Comput. Intell. Mag. 2006, 1, 28–39. [Google Scholar] [CrossRef]
Zheng, W.; Tropsha, A. Novel variable selection quantitative structure- property relationship approach based on the k-nearest-neighbor principle. J. Chem. Inf. Comput. Sci. 2000, 40, 185–194. [Google Scholar] [CrossRef] [PubMed]
Mercader, A.G.; Duchowicz, P.R.; Fern’andez, F.M.; Castro, E.A. Modified and enhanced replacement method for the selection of molecular descriptors in qsar and qspr theories. Chemom. Intell. Lab. Syst. 2008, 92, 138–144. [Google Scholar] [CrossRef]
Ara’ujo, M.C.U.; Saldanha, T.C.B.; Galvao, R.K.H.; Yoneyama, T.; Chame, H.C.; Visani, V. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemom. Intell. Lab. Syst. 2001, 57, 65–73. [Google Scholar] [CrossRef]
Put, R.; Daszykowski, M.; Baczek, T.; Heyden, Y.V. Retention prediction of peptides based on uninformative variable elimination by partial least squares. J. Proteome Res. 2006, 5, 1618–1625. [Google Scholar] [CrossRef] [PubMed]
Daghir-Wojtkowiak, E.; Wiczling, P.; Bocian, S.; Kubik, L.; Koslinski, P.; Buszewski, B.; Kaliszan, R.; Markuszewski, M.J. Least absolute shrinkage and selection operator and dimensionality reduction techniques in quantitative structure retention relationship modeling of retention in hydrophilic interaction liquid chromatography. J. Chromatogr. A 2015, 1403, 54–62. [Google Scholar] [CrossRef] [PubMed]
Goodarzi, M.; Chen, T.; Freitas, M.P. QSPR predictions of heat of fusion of organic compounds using Bayesian regularized artificial neural networks. Chemom. Intell. Lab. Syst. 2010, 104, 260–264. [Google Scholar] [CrossRef]
Aalizadeh, R.; Peter, C.; Thomaidis, N.S. Prediction of acute toxicity of emerging contaminants on the water flea Daphnia magna by Ant Colony Optimization-Support Vector Machine QSTR models. Environ. Sci. Process. Impacts 2017, 19, 438–448. [Google Scholar] [CrossRef] [PubMed]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 73, 267–288. [Google Scholar]
Algamal, Z.; Lee, M. A new adaptive l1-norm for optimal descriptor selection of high-dimensional qsar classification model for anti-hepatitis c virus activity of thiourea derivatives. SAR QSAR Environ. Res. 2017, 28, 75–90. [Google Scholar] [CrossRef] [PubMed]
Xu, Z.; Chang, X.; Xu, F.; Zhang, H. l1/2 regularization: A thresholding repre-sentation theory and a fast solver. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1013–1027. [Google Scholar] [PubMed]
Algamal, Z.; Lee, M.; Al-Fakih, A.; Aziz, M. High-dimensional qsar modeling using penalized linear regression model with l1/2-norm. SAR QSAR Environ. Res. 2016, 27, 703–719. [Google Scholar] [CrossRef] [PubMed]
Liang, Y.; Liu, C.; Luan, X.-Z.; Leung, K.-S.; Chan, T.-M.; Xu, Z.B.; Zhang, H. Sparse logistic regression with a l1/2 penalty for gene selection in cancer classification. BMC Bioinform. 2013, 14, 198. [Google Scholar] [CrossRef] [PubMed]
Candes, E.J.; Wakin, M.B.; Boyd, S.P. Enhancing sparsity by reweighted l1 minimization. J. Fourier Anal. Appl. 2008, 14, 877–905. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef]
Donoho, D.L.; Johnstone, I.M. Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994, 81, 425–455. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
Gramatica, P.; Papa, E. Screening and ranking of pops for global half-life: Qsar approaches for prioritization based on molecular structure. Environ. Sci. Technol. 2007, 41, 2833–2839. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Gramatica, P. The importance of molecular structures, endpoints values, and predictivity parameters in qsar research: Qsar analysis of a series of estrogen receptor binders. Mol. Divers. 2010, 14, 687–696. [Google Scholar] [CrossRef] [PubMed]
Cassani, S.; Kovarich, S.; Papa, E.; Roy, P.P.; van der Wal, L.; Gramatica, P. Daphnia and fish toxicity of (benzo) triazoles: Validated qsar models, and interspecies quantitative activity-activity modeling. J. Hazard. Mater. 2013, 258, 50–60. [Google Scholar] [CrossRef] [PubMed]
Zakharov, A.V.; Peach, M.L.; Sitzmann, M.; Nicklaus, M.C. Qsar modeling of imbalanced high-throughput screening data in pubchem. J. Chem. Inf. Model. 2014, 54, 705–712. [Google Scholar] [CrossRef] [PubMed]
Gramatica, P.; Cassani, S.; Chirico, N. QSARINS-Chem: Insubria Datasets and New QSAR/QSPR Models for Environmental Pollutants in QSARINS. J. Comput. Chem. Softw. News Updates 2014, 35, 1036–1044. [Google Scholar] [CrossRef] [PubMed]
Golbraikh, A.; Tropsha, A. Beware of q2. J. Mol. Graph. Model. 2002, 20, 269–276. [Google Scholar] [CrossRef]

Figure 1. The flow diagram shows the process of QSAR modeling. (1) Collecting molecular structures and their activities; (2) calculating molecular descriptors, which can produce thousands of parameters for each molecular structure; (3) removing redundant or irrelevant descriptors via descriptor selection; (4) building the model with the optimum descriptor subset; (5) predicting the biological activity of a new molecular structure using the established model. Different color blocks represent different values.

Figure 2.

L_{1}

and

L_{E N}

are convex, and SCAD, MCP,

L_{1 / 2}

and log-sum are non-convex. The log-sum approximates to

L_{0}

.

Figure 2.

L_{1}

and

L_{E N}

are convex, and SCAD, MCP,

L_{1 / 2}

and log-sum are non-convex. The log-sum approximates to

L_{0}

.

Figure 3. Plot of thresholding functions for: (a)

L_{1}

; (b)

L_{E N}

; (c) SCAD; (d) MCP; (e)

L_{1 / 2}

; and (f) log-sum.

Figure 3. Plot of thresholding functions for: (a)

L_{1}

; (b)

L_{E N}

; (c) SCAD; (d) MCP; (e)

L_{1 / 2}

; and (f) log-sum.

Figure 4. The value of residual (

| y - y^{p r e d} |

) on different datasets.

Figure 4. The value of residual (

| y - y^{p r e d} |

) on different datasets.

Figure 5. The number of descriptors obtained by the multiple linear regression with the different penalties on different datasets(different colors represent different datasets).

Table 1. A brief description of four public datasets used in the experiments.

Dataset Name	No. of Samples	No. of Descriptors	No. of Samples (Training)	No. of Samples (Test)
BTAZD	97	1083	78	19
EDCER	129	1089	104	25
GHLI	250	1120	200	50
BCL2	508	1562	407	101

Table 2. The average number of variables selected in total by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum. In bold, the best performance is shown.

Table 2. The average number of variables selected in total by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum. In bold, the best performance is shown.

	Sample Size	$L_{EN}$	$L_{1}$	SCAD	MCP	$L_{1 / 2}$	Log-Sum
$ρ = 0.2$ , $σ = 0.3$	$n = 100$	381.60	92.92	19.09	23.36	19.13	19.00
$ρ = 0.2$ , $σ = 0.3$	$n = 200$	498.81	34.18	19.03	19.00	19.09	19.00
$ρ = 0.2$ , $σ = 0.9$	$n = 100$	382.24	93.26	27.74	25.79	21.77	21.54
$ρ = 0.2$ , $σ = 0.9$	$n = 200$	499.49	95.83	36.48	23.65	23.83	23.15
$ρ = 0.4$ , $σ = 0.3$	$n = 100$	378.96	93.98	19.26	24.67	19.98	19.11
$ρ = 0.4$ , $σ = 0.3$	$n = 200$	495.66	97.51	40.87	24.04	24.42	23.79
$ρ = 0.4$ , $σ = 0.9$	$n = 100$	379.35	93.46	29.22	26.08	22.48	22.04
$ρ = 0.4$ , $σ = 0.9$	$n = 200$	495.64	98.97	40.61	23.95	24.43	23.73

Table 3. The average number of variables selected with a pre-set value (20) obtained by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum.

Table 3. The average number of variables selected with a pre-set value (20) obtained by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum.

	Sample Size	$L_{EN}$	$L_{1}$	SCAD	MCP	$L_{1 / 2}$	Log-Sum
$ρ = 0.2$ , $σ = 0.3$	$n = 100$	12.23	14.45	19.09	18.81	19.13	19.00
$ρ = 0.2$ , $σ = 0.3$	$n = 200$	16.22	20.00	19.03	19.00	19.09	19.00
$ρ = 0.2$ , $σ = 0.9$	$n = 100$	12.24	14.30	19.93	19.42	19.74	19.81
$ρ = 0.2$ , $σ = 0.9$	$n = 200$	16.26	20.00	20.00	20.00	20.00	20.00
$ρ = 0.4$ , $σ = 0.3$	$n = 100$	11.84	13.57	18.88	18.40	18.65	18.88
$ρ = 0.4$ , $σ = 0.3$	$n = 200$	15.79	19.99	19.97	19.93	19.96	19.93
$ρ = 0.4$ , $σ = 0.9$	$n = 100$	11.88	13.55	19.48	18.81	19.14	19.00
$ρ = 0.4$ , $σ = 0.9$	$n = 200$	15.80	19.99	19.98	19.93	19.97	19.95

Table 4. The average accuracy (%) for the simulation data sets obtained by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum. In bold, the best performance is shown.

Table 4. The average accuracy (%) for the simulation data sets obtained by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum. In bold, the best performance is shown.

	Sample Size	$L_{EN}$	$L_{1}$	SCAD	MCP	$L_{1 / 2}$	Log-Sum
$ρ = 0.2$ , $σ = 0.3$	$n = 100$	3.20%	15.55%	100.00%	80.52%	100.00%	100.00%
$ρ = 0.2$ , $σ = 0.3$	$n = 200$	3.25%	58.51%	100.00%	100.00%	100.00%	100.00%
$ρ = 0.2$ , $σ = 0.9$	$n = 100$	3.12%	14.44%	98.03%	74.58%	93.34%	98.80%
$ρ = 0.2$ , $σ = 0.9$	$n = 200$	3.19%	20.50%	48.86%	82.90%	81.74%	83.77%
$ρ = 0.4$ , $σ = 0.3$	$n = 100$	3.20%	15.33%	71.85%	75.30%	90.68%	91.97%
$ρ = 0.4$ , $σ = 0.3$	$n = 200$	3.26%	20.87%	54.87%	84.57%	83.93%	86.39%
$ρ = 0.4$ , $σ = 0.9$	$n = 100$	3.19%	20.50%	48.86%	82.90%	81.74%	83.77%
$ρ = 0.4$ , $σ = 0.9$	$n = 200$	3.19%	20.20%	49.20%	83.22%	81.74%	84.07%

Table 5. Experimental results on the four datasets (the results are emphasized by our proposed method in bold and italic).

Datasets	Methods	$R_{train}^{2}$	${RMSE}_{train}$	$Q_{LOO}^{2}$	${RMSE}_{cv}$	$R_{test}^{2}$	${RMSE}_{test}$
GHLI	$L_{E N}$	0.87	0.65	0.74	0.68	0.74	0.90
	$L_{1}$	0.87	0.64	0.75	0.67	0.74	0.90
	SCAD	0.84	0.71	0.82	0.62	0.72	0.93
	MCP	0.85	0.68	0.80	0.65	0.73	0.91
	$L_{1 / 2}$	0.82	0.75	0.81	0.62	0.72	0.92
	log-sum	0.85	0.69	0.84	0.57	0.75	0.88
EDCER	$L_{E N}$	0.81	0.74	0.70	0.70	0.64	1.23
	$L_{1}$	0.82	0.73	0.73	0.68	0.63	1.25
	SCAD	0.86	0.63	0.74	0.69	0.70	1.12
	MCP	0.83	0.70	0.74	0.69	0.65	1.21
	$L_{1 / 2}$	0.87	0.62	0.75	0.65	0.64	1.24
	log-sum	0.86	0.63	0.79	0.62	0.70	1.12
BATZD	$L_{E N}$	0.87	0.28	0.73	0.30	0.60	0.52
	$L_{1}$	0.88	0.28	0.74	0.30	0.60	0.52
	SCAD	0.86	0.30	0.77	0.30	0.62	0.51
	MCP	0.88	0.27	0.83	0.29	0.64	0.50
	$L_{1 / 2}$	0.86	0.29	0.84	0.26	0.64	0.50
	log-sum	0.88	0.28	0.88	0.23	0.68	0.47
BCL2	$L_{E N}$	0.75	0.57	0.51	0.53	0.61	0.67
	$L_{1}$	0.74	0.58	0.58	0.51	0.61	0.67
	SCAD	0.72	0.59	0.73	0.45	0.59	0.69
	MCP	0.74	0.57	0.73	0.46	0.58	0.70
	$L_{1 / 2}$	0.73	0.60	0.68	0.48	0.57	0.70
	log-sum	0.68	0.64	0.75	0.43	0.65	0.63

Table 6. The 9 top-ranked descriptors identified by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum from the GHLI dataset (the common descriptors are emphasized in bold).

Table 6. The 9 top-ranked descriptors identified by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum from the GHLI dataset (the common descriptors are emphasized in bold).

Rank	GHLI
Rank	$L_{EN}$	$L_{1}$	SCAD	MCP	$L_{1 / 2}$	Log-Sum
1	JGI7	JGI7	Mp	JGI7	minsCl	ATSC4c
2	ETA_Eta_B_RC	ETA_Eta_B_RC	MDEC-44	ATSC4c	ATSC1e	GATS1e
3	BCUTc-1l	BCUTc-1l	GATS1e	GATS1e	minaaN	ATSC1p
4	Mv	Mv	ATSC1p	AATS0e	WPOL	MATS8m
5	ATSC4c	MDEN-23	GGI9	meanI	nHdsCH	maxwHBa
6	MDEN-23	ATSC4c	maxHBa	nHdsCH	ALogP	maxHBa
7	GATS1e	GATS1e	maxwHBa	maxHBa	nFG12Ring	ATSC7s
8	ETA_Epsilon_3	ETA_Epsilon_4	MATS8m	ATSC7s	AATS6i	AATS0v
9	ETA_Epsilon_4	minHCsatu	SIC1	ATS4v	AATSC8m	ATS4p

Table 7. The 10 top-ranked descriptors identified by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum from the EDCER dataset (the common descriptors are emphasized in bold).

Table 7. The 10 top-ranked descriptors identified by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum from the EDCER dataset (the common descriptors are emphasized in bold).

Rank	EDCER
Rank	$L_{EN}$	$L_{1}$	SCAD	MCP	$L_{1 / 2}$	Log-Sum
1	JGI10	JGI10	JGI10	JGI10	JGI10	JGI10
2	VE2_Dt	VE2_Dt	MATS1i	JGI6	GATS1c	MATS1c
3	JGI7	JGI6	AATSC2s	AATSC2s	GATS2s	hmax
4	AATSC8p	AATSC8p	hmax	AATSC8p	hmax	nssO
5	JGI6	JGI7	JGI6	hmax	GATS5v	piPC6
6	hmax	hmax	nBase	nHBint2	nTG12Ring	nFG12HeteroRing
7	SpMin4_Bhm	SpMin4_Bhm	GATS8p	nHBd	nssO	maxaaCH
8	GATS5v	GATS5v	nFG12HeteroRing	maxaaCH	maxaaCH	SHBint2
9	GATS2s	GATS2s	MATS5v	C3SP2	ETA_Beta_ns_d	TIC1
10	SpMin5_Bhs	nAcid	maxaaCH	SHBint8	MDEC-24	AATSC8m

Table 8. The 8 top-ranked descriptors identified by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum from the BATZD dataset (the common descriptors are emphasized in bold).

Table 8. The 8 top-ranked descriptors identified by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum from the BATZD dataset (the common descriptors are emphasized in bold).

Rank	BATZD
Rank	$L_{EN}$	$L_{1}$	SCAD	MCP	$L_{1 / 2}$	Log-Sum
1	JGI4	JGI4	VE2_Dze	SpMax1_Bhi	SpMax1_Bhi	SpMax1_Bhi
2	VE2_Dze	VE2_Dze	JGI3	MATS5m	GATS1p	GATS1v
3	MATS5v	ndS	ndS	GATS3s	ndS	GATS3s
4	SdS	MATS5v	CrippenLogP	C4SP3	GATS3m	GATS8c
5	CrippenLogP	CrippenLogP	nHother	CrippenLogP	GATS3s	naaS
6	mindS	MDEO-22	minddssS	ALogP	LipoaffinityIndex	AATSC4i
7	MDEO-22	nF9Ring	GATS4m	nHother	nHsOH	LipoaffinityIndex
8	maxdS	ETA_Epsilon_4	nF9Ring	ATSC8i	ATSC8i	SpDiam_Dzp

Table 9. The 6 top-ranked descriptors identified by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum from the BCL2 dataset (the common descriptors are emphasized in bold).

Table 9. The 6 top-ranked descriptors identified by

L_{E N}

,

L_{1}

, SCAD, MCP,

L_{1 / 2}

and log-sum from the BCL2 dataset (the common descriptors are emphasized in bold).

Rank	BCL2
Rank	$L_{EN}$	$L_{1}$	SCAD	MCP	$L_{1 / 2}$	Log-Sum
1	JGI7	AATSC8p	AATSC4s	JGI7	MATS4s	AATSC8p
2	VE2_D	MATS4s	IC2	MATS4s	IC2	IC2
3	AATSC8p	MATS5m	MDEN-13	IC2	E3m	GATS4s
4	MATS5m	IC2	minHsNH2	E3m	MDEN-13	maxHBint2
5	MATS4s	MDEN-13	maxHBint2	GATS8p	maxHBint2	minsOH
6	IC2	SpMax1_Bhi	nT8Ring	MDEN-13	minsOH	SwHBa

Table 10. The detailed information of the descriptors obtained by the log-sum method.

Descriptor Type	Class	Descriptor
Autocorrelation	2D	AATS0v; AATSC4i; AATSC8m; ATS4p; ATSC1p;
		ATSC4c; ATSC7s; GATS1e; GATS1v; GATS3s;
		GATS8c; MATS1c; MATS8m; AATSC8p; GATS4s
Atom-type electrotopological state	2D	Hmax; LipoaffinityIndex; maxaaCH; maxHBa; maxwHBa;
Atom-type electrotopological state	2D	naaS; nssO; SHBint2; maxHBint2; minsOH; SwHBa
Barysz matrix	2D	SpDiam_Dzp
Burden modified eigenvalues	2D	SpMax1_Bhi
Information content	2D	TIC1
Path counts	2D	piPC6
Ring count	2D	nFG12HeteroRing
Topological charge	2D	JGI10
Information content	2D	IC2

Table 11. The name of the descriptors obtained by the log-sum method.

Descriptor	Name
AATS0v	Average Broto–Moreau autocorrelation-lag 0/weighted by van der Waals volumes
AATSC4i	Average centered Broto–Moreau autocorrelation-lag 4/weighted by first ionization potential
AATSC8m	Average centered Broto–Moreau autocorrelation-lag 8/weighted by mass
ATS4p	Average centered Broto–Moreau autocorrelation-lag 1/weighted by polarizabilities
ATSC1p	Centered Broto–Moreau autocorrelation-lag 1/weighted by polarizabilities
ATSC4c	Average centered Broto–Moreau autocorrelation-lag 4/weighted by charges
ATSC7s	Average centered Broto–Moreau autocorrelation-lag 7/weighted by I-state
GATS1e	Geary autocorrelation-lag 1/weighted by Sanderson electronegativities
GATS1v	Geary autocorrelation-lag 1/weighted by van der Waals volumes
GATS3s	Geary autocorrelation-lag 3/weighted by I-state
GATS8c	Geary autocorrelation-lag 8/weighted by charges
hmax	Maximum H E-state
JGI10	Mean topological charge index of order 10
LipoaffinityIndex	Lipoaffinity index
MATS1c	Moran autocorrelation-lag 1/weighted by charges
MATS8m	Moran autocorrelation-lag 8/weighted by mass
maxaaCH	Maximum atom-type E-state: :CH:
maxHBa	Maximum E-states for (strong) hydrogen bond acceptors
maxwHBa	Maximum E-states for weak hydrogen bond acceptors
naaS	Count of atom-type E-state::C:-
nFG12HeteroRing	Number of >12-membered fused rings containing heteroatoms (N, O, P, S or halogens)
nssO	Count of atom-type E-state: -O-
piPC6	Conventional bond order ID number of order 6 (ln(1 + x)
SHBint2	Sum of E-state descriptors of strength for potential hydrogen bonds of path length 2
SpDiam_Dzp	Spectral diameter from Barysz matrix/weighted by polarizabilities
SpMax1_Bhi	Largest absolute eigenvalue of Burden-modified matrix - n 1/weighted by the relative first ionization potential
TIC1	Total information content index (neighborhood symmetry of 1-order)
SwHBa	Sum of E-states for weak hydrogen bond acceptors
AATSC8p	Average centered Broto–Moreau autocorrelation-lag 8/weighted by polarizabilities
IC2	Information content index (neighborhood symmetry of 2-order)
GATS4s	Geary autocorrelation-lag 4/weighted by I-state
maxHBint2	Maximum E-State descriptors of strength for potential Hydrogen Bonds of path length 2
minsOH	Minimum atom-type E-state: -OH

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xia, L.-Y.; Wang, Y.-W.; Meng, D.-Y.; Yao, X.-J.; Chai, H.; Liang, Y. Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure. Int. J. Mol. Sci. 2018, 19, 30. https://doi.org/10.3390/ijms19010030

AMA Style

Xia L-Y, Wang Y-W, Meng D-Y, Yao X-J, Chai H, Liang Y. Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure. International Journal of Molecular Sciences. 2018; 19(1):30. https://doi.org/10.3390/ijms19010030

Chicago/Turabian Style

Xia, Liang-Yong, Yu-Wei Wang, De-Yu Meng, Xiao-Jun Yao, Hua Chai, and Yong Liang. 2018. "Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure" International Journal of Molecular Sciences 19, no. 1: 30. https://doi.org/10.3390/ijms19010030

APA Style

Xia, L.-Y., Wang, Y.-W., Meng, D.-Y., Yao, X.-J., Chai, H., & Liang, Y. (2018). Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure. International Journal of Molecular Sciences, 19(1), 30. https://doi.org/10.3390/ijms19010030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Descriptor Selection via Log-Sum Regularization for the Biological Activities of Chemical Structure

Abstract

1. Introduction

2. Methods

2.1. Coordinate Decent Algorithm for Different Thresholding Operators

2.2. Dataset

2.2.1. Simulated Data

2.2.2. Real Data

3. Results

3.1. Analyses of Simulated Data

3.2. Analyses of Real Data

4. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

Abbreviations

Appendix A. Proof

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI