Gene Selection in Cancer Classification Using Sparse Logistic Regression with L1/2 Regularization

In recent years, gene selection for cancer classification based on the expression of a small number of gene biomarkers has been the subject of much research in genetics and molecular biology. The successful identification of gene biomarkers will help in the classification of different types of cancer and improve the prediction accuracy. Recently, regularized logistic regression using the L 1 regularization has been successfully applied in high-dimensional cancer classification to tackle both the estimation of gene coefficients and the simultaneous performance of gene selection. However, the L 1 has a biased gene selection and dose not have the oracle property. To address these problems, we investigate L 1 / 2 regularized logistic regression for gene selection in cancer classification. Experimental results on three DNA microarray datasets demonstrate that our proposed method outperforms other commonly used sparse methods ( L 1 and L E N ) in terms of classification performance.


Introduction
With the development of DNA microarray technology, biological researchers can pay more attention to simultaneously studying the expression levels of thousands of genes [1,2]. Cancer classification based on gene expression levels is one of the most active topics in genome research, which is appropriate for gene expression levels in different situations (e.g., normal and abnormal) [3,4]. However, cancer classification using DNA microarray data is a challenge because of the data's high dimension and small sample size [5]. Generally, the number of genes ranges in the thousands from a hundred or fewer tissue samples, and so gene selection has recently emerged as important technology for cancer classification [6]. Gene selection is applied because only a small subset of genes is strongly indicative of a targeted disease. From the biological perspective, effective gene selection methods can be desirable to help to classify different types of cancer and improve the accuracy of prediction [7][8][9].
Many gene selection methods have been proposed for selection of the subset of meaningful and important genes that can achieve high cancer classification performance. Recently, there has been growing interest in applying regularization techniques in gene selection. Regularization methods are an important embedded technique [10][11][12][13]. From the statistical perspective, regularization methods can prevent over-fitting. Many statistical methods have been successfully applied to cancer classification. Among them, logistic regression [14][15][16][17] is a powerful discriminative method, and has a direct probabilistic interpretation that can obtain classification probabilities apart from the class label information. However, logistic regression is not suitable for solving the high-dimensional and small sample size problem because the design matrix is singular. Thus, Newton-Raphson's method cannot work. Regularized logistic regression has been successfully applied in cancer classification in order to be suitable for high dimension and small sample size [7,8]. The advantages of regularized logistic regression can improve the classification accuracy by shrinking the regression coefficients and selecting a small subset of genes. Different regularization terms are applied to regularized logistic regression. The widely popular regularization term is L 1 penalty, which is the least absolute shrinkage and selection operator (lasso) [18]. Meanwhile, there are various of versions of L 1 , such as smoothly clipped absolute deviation (SCAD) [19], maximum concave penalty (MCP) [20], group lasso [21], and so on. The L 1 regularization can assign some genes' coefficients to zero for variable selection. Thus, the L 1 regularization has been widely applied to data with high dimension and small sample size.
Although a well-known regularization method is the L 1 penalty, it has some limitations [22]. The L 1 regularization does not have oracle property [19], which means the aim-listed probability of selecting the right set of genes (with nonzero coefficients) converges to one, and the estimators of the nonzero coefficients have asymptotically normal distribution with the same means and covariances as if the zero coefficients were known in the prior. Besides, there is grouping among genes in DNA microarray data. Related to this limitation, concerning the grouping property, Zhou and Hastie proposed the elastic net penalty (L EN ) [23], which is a linear combination of L 1 and L 2 penalties. In addition, L 1 regularization is not sparser. To overcome this limitation, Xu et al. proposed the L 1/2 penalty-a method that can be taken as a representative of L q (0 < q < 1) penalty in both sparsity and computational efficiency, and has demonstrated many attractive properties, such as unbiasedness and oracle properties [24][25][26]. Therefore, we investigated L 1/2 regularized logistic regression for gene selection in cancer classification. The approach is suitable for DNA data with high dimension and small sample size. To evaluate the effectiveness of the approach, three public datasets were applied to cancer classification. Additionally, we compared other commonly used sparse methods (L 1 and L EN ) to our methods.
Our research can be summarized as follows are given as follows: • identification of gene biomarkers will help to classify different types of cancer and improve the prediction accuracy.

•
The L 1/2 penalized logistic regression is used as a gene selection method for cancer classification to overcome the over-fitting problem with high-dimensional data and small sample size.

•
Experimental results on three GEO lung cancer datasets corroborate our ideas and demonstrate the correctness and effectiveness of L 1/2 penalized logistic regression.

Regularized Logistic Regression
In this paper, we only consider a general binary classification problem and get a predictor vector X and a response variable y, which consists of genes and corresponding tissue samples, respectively. Suppose we have n samples, D = (X 1 , y 1 ), (X 2 , y 2 ), ..., (X n , y n ), where X i = (x i1 , x i2 , ..., x ip ) is ith input pattern with dimensionality p, which means the X i has p descriptors and x ij denotes the value of gene j for the ith sample. y i is a corresponding variable that takes a value of 0 or 1. Define a classifier f (x) = e x /(1 + e x ), and the logistic regression is shown as follows: Additionally, the log-likelihood can be expressed as follows: We can get the value of vector β from Equation (2). However, solving Equation (2) can result in over-fitting with data of high dimension and small sample size. Therefore, in order to address the problem, we add the regularization terms to Equation (2): where l(β) and p(β) are loss function and penalty function, respectively, and λ > 0 is a tuning parameter. Note that p(β) = ∑ |β| q . When q is equal to 1, the L 1 has been proposed. Moreover, there are various of versions of L 1 , such as SCAD, MCP, group lasso, and so on. We add the L 1 regularization to Equation (2). The formula is expressed as follows: From a biologist's point of view, there is a grouping property among genes, which is a limitation of L 1 regularization. To overcome this limitation, Zou et al. proposed the elastic net (L EN ) regularization method for gene selection. The L EN regularization tries to combine L 1 with L 2 in order to search for highly correlated genes and perform gene selection simultaneously. The regularized logistic regression using L EN is exhibited as follows: As we observe from Equation (5), λ 1 and λ 2 control the sparsity and group effect, respectively. The coefficient β depends on two non-negative tuning parameters λ 1 and λ 2 . In order to simplify Equation (5), let λ 1 plus λ 2 equal to 1. Thus, we can rewrite Equation (5) as:

L 1/2 Regularized Logistic Regression
Despite the advantages of L 1 and L EN , there are some limitations. L 1 and L EN have a biased gene selection, and they do not have an oracle property. Besides, theoretically, the L q -type regularization p(β) = ∑ |β| q with the lower value of q would lead to better solutions with more sparsity. However, difficulties with convergence arise when q is very close to zero. Therefore, Xu et al. proposed L 1/2 regularization. When 1 2 < q < 1, comparing with L 1 , the convergence of L 1/2 regularization is not high, while when 0 < q < 1 2 , comparing with L 0 , solving the L 1/2 regularization is much simpler. Thus, the L 1/2 regularization can be taken as a representative of L q (0 < q < 1) regularization. The L 1/2 regularized logistic regression is as follows: where the value of β can be obtained by calculating Equation (7). In this paper, we apply the coordinate descent algorithm to solve Equation (7). The algorithm is a "one-at-a-time" algorithm and solves β j , and other β j =k (representing the parameters remaining after the jth element is removed) are fixed [7,8]. Suppose that we have n samples, D = (X 1 , y 1 ), (X 2 , y 2 ), ..., (X n , y n ), where X i = (x i1 , x i2 , ..., x ip ) is the ith input pattern with dimensionality p, which means the X i has p genes and x ij denotes the value of genes j for the ith sample. y i is a corresponding variable that takes a value of 0 or 1. y i = 0 indicates that the ith sample is in Class 1 and y i = 1 indicates that the ith sample is in Class 2. Inspired by Friedman et al. [27], Xu et al. [26], and Xia et al. [28], the univariate half thresholding operator for a L 1/2 -penalized logistic regression coefficient is as follows: Besides, the univariate thresholding operator of the coordinate descent algorithm for the L EN regularization can be defined as: where S(w j , λa) is a soft thresholding operator for the L 1 if a is equal to 1, as follows: Inspired by Reference [7], Equation (7) is linearized by one-term Taylor series expansion: where . Redefine the partial residual for fitting β j as Z i ). A pseudocode of coordinate descent algorithm for L 1/2 penalized logistic regression is described in Algorithm 1 [7].

Classification Evaluation Criteria
In order to evaluate the cancer classification performance of the proposed method, accuracy, sensitivity, and specificity were applied to three public DNA microarray data. The formulas of accuracy, sensitivity, and specificity are shown as follows [29]: where TP refers to true positives, TN refers to true negatives, FP refers to false positives, and FN refers to false negatives.

Datasets
In this section, three public QSAR datasets were obtained online, including GSE10072 [30], GSE19804 [31], and GSE4115 [32]. A brief description of these datasets is given in Table 1.

GSE10072
The dataset is provided by the National Cancer Institute (NIH). There are 107 samples, of which 58 are lung tumor, and the other 49 are normal lung. Each sample contained 22,283 genes.

GSE19804
We obtained this dataset online. For data preprocessing, we utilized 120 samples, which consisted of 60 lung cancer and 60 lung normal samples, with 54,675 genes for the model as input.

GSE4115
This cancer dataset is from the Boston University Medical Center. After preprocessing, the number of lung cancer and normal lung samples was 97 and 90, respectively. Each sample contained 22,215 descriptors.

Results
In this section, two methods are compared to our proposed method, including L EN and L 1 . To evaluate the prediction accuracy of the three logistic regression models, we first used random partition to divide the samples. That is to say, the samples were divided into training samples (70%) and testing samples (30%). The detailed information of the three publicly available datasets used in the experiments are shown in Table 2. Secondly, in order to obtain the tuning parameter λ, we applied 5-fold cross validation to the training set. Thirdly, the classification evaluation criteria were the corresponding average number at 50 runs.  Table 3 shows that the results of the training set and testing set were obtained by L 1 , L EN , and L 1/2 . The results obtained by L 1/2 were better those of L 1 and L EN . For example, for the training set in the dataset GSE10072, the values of sensitivity, specificity, and accuracy of L 1/2 were the same as for L 1 . Besides, the values of sensitivity and accuracy of L EN were 0.98, and 0.99 lower than those of L 1/2 . For the testing set in dataset GSE4115, L 1/2 and L EN ranked first and second, respectively. L 1 was the last. For instance, the value of accuracy of L 1/2 was 0.80, higher than the 0.77 and 0.78 of L 1 and L EN , respectively. Moreover, L 1/2 was more sparse than L 1 and L EN . As shown in Figure 1, In dataset GSE17084, the number of selected genes of L 1/2 was 8, lower than the respective 33 and 82 of L 1 and L EN . In a word, L 1/2 was superior to L 1 and L EN . Table 3. Mean results of empirical datasets. The results of our proposed method are given in bold.

Datasets Training Set (5-CV) Testing Set
Sensitivity Specificity Accuracy Sensitivity Specificity Accuracy GSE10072  In order to search the common gene signatures selected by the different methods, we used VENNY software (2.1.0 Centro Nacional de Biotecnología, Madrid, Spain, 2015) [33] to generate Venn diagrams. As shown in Figure 2, we considered the common gene signatures selected by the logistic regression model with L 1 , L EN , and L 1/2 regularization methods, which are the most relevant signatures of lung cancer. Hence, 2, 3, and 2 common genes were found in these methods for different datasets.  Table 4 shows that the genes were selected by L 1/2 . At the beginning of the experiments, the attribute of genes was prob set ID. Thus, we could transform prob set ID to gene symbol by using the software DAVID 6.8 [34]. The data distribution for the selected genes is displayed in Figures 3-5. From inspecting the figures, we can find that some genes facilitated the classification of lung tumor and normal lung, such as FAM107A, KDELR2, AASS, and SFRP1 for dataset GSE10072; and SOCS2 and EHD2 for dataset GSE19804. In addition, we found that a common gene in the three different datasets using L 1/2 was EGFR [35,36]. However, due to the distribution of the data of different datasets, we cannot use gene EFGR to classify different types of cancer and improve the prediction accuracy. Furthermore, the literature indicates that never-smokers with adenocarcinoma have the highest incidence of EGFR, HER2, ALK, RET, and ROS1 mutations [37]. Therefore, our proposed L 1/2 is an effective technique in gene selection and classification.

Conclusions
In cancer classification with data of high dimension and small sample size, only a small number of genes strongly suggest specific diseases. Therefore, gene selection is widely popular in cancer classification. Especially, regularization methods have the capacity to select a small subset of meaningful and important genes. In this study, we applied L 1/2 to a logistic regression model to perform gene selection. Additionally, during the updating of the estimated coefficients, the proposed method utilizes a novel univariate half thresholding.
Experimental results on three cancer datasets demonstrated that our proposed method outperformed the other commonly used sparse methods (L 1 and L EN ) in terms of classification performance, while fewer but informative genes were selected-especially the gene EFGR. Therefore, L 1/2 regularization is a promising tool for feature selection in classification problems.