LJELSR: A Strengthened Version of JELSR for Feature Selection and Clustering

Feature selection and sample clustering play an important role in bioinformatics. Traditional feature selection methods separate sparse regression and embedding learning. Later, to effectively identify the significant features of the genomic data, Joint Embedding Learning and Sparse Regression (JELSR) is proposed. However, since there are many redundancy and noise values in genomic data, the sparseness of this method is far from enough. In this paper, we propose a strengthened version of JELSR by adding the L1-norm constraint on the regularization term based on a previous model, and call it LJELSR, to further improve the sparseness of the method. Then, we provide a new iterative algorithm to obtain the convergence solution. The experimental results show that our method achieves a state-of-the-art level both in identifying differentially expressed genes and sample clustering on different genomic data compared to previous methods. Additionally, the selected differentially expressed genes may be of great value in medical research.


Introduction
With the emergence of deep sequencing technologies, considerable genomic data have become available. Since genomic data are usually high-dimension small-sample data, that is, the dimension of the gene is large, the dimension of the sample is small, and it is easy to cause interference when performing feature selection and difficult to understand the sample directly [1]. Additionally, a large number of superfluous and extraneous genes are possessed in these genomic data, which severely interfere with the biological processes. As the case stands, only a small minority of genes with biological sense contribute to disease research [2]. Accordingly, how to identify these key genes from the massive high-dimensional genomic data is a hotspot and nodus in research. Furthermore, studies have testified that these key genes are efficaciously extracted by embedding learning [3]. Furthermore, cluster analysis is based on the similarity of each data point to classify the samples or genes, which is helpful for accurate determination of the cancer subtype. Some studies have also demonstrated that embedding learning and sparse regression is good for cluster analysis and feature selection [4,5].
Feature selection is used to pick out k features from m dimensional data (m > k) to optimize a specific index of the system [6]. Initially, some information on characteristic genes is extracted to constitute the data after the reduction to achieve dimensionality reduction. Following this, some genes associated with disease are dredged up from the low-dimensional data on medical research. At present, feature selection is extensively and continually studied owing to the usefulness and practicality of this method. However, traditional feature selection approaches have some issues: (1) The manifold structure is not fully considered, which can reflect the internal geometric structure in data [7]; and (2) they only use the statistical strategy, which can affect the accuracy and reliability of the results. Moreover, procedures of traditional feature selection method are performed independently, such as embedding learning and sparse regression [8,9]. However, a better performance can be achieved by combining two of the above independent procedures. Hou et al. came up with a new method of feature selection via Joint Embedding Learning and Sparse Regression (JELSR), according to the above ideas [5]. This method is a good solution to the above-mentioned issues. JELSR has a good effect on feature selection outstrip of these traditional methods. Nevertheless, there is still a problem thereinafter; since the L 2,1 -norm penalty only sparsely constrains the rows of data, the sparsity of the method is far from satisfactory. However, if the sparseness is not enough, taking too many unrelated genes into account can cause serious errors. Hence, an efficient sparse method should be explored to strengthen the previous method.
LASSO (Least Absolute Shrinkage and Selection Operator) was first proposed by Robert Tibshirani [10]. It constructs an L 1 -norm penalty function to obtain a more refined model and compresses some coefficients to get zero coefficients. Both the L 1 -norm constraint and the L 2,1 -norm constraint can produce sparse effects, but the sparsity of the L 1 -norm constraint is decentralized and the L 2,1 -norm constraint can only produce row sparseness, as shown in Figure 1. However, the combination of the L 1 -norm and the L 2,1 -norm constraints can generate internal row sparsity and enhance the correlation between rows and columns of the matrix [11], as shown in Figure 1, so the occurrence of the redundant and irrelevant genes can be reduced. In this paper, to obtain more sparse effects, we propose a new method by adding an L 1 -norm constraint on the sparse regression matrix based on the JELSR (LJELSR). First, the corresponding graph is constructed to depict the inherent structure of the data. Then, the graph matrix is embedded into the following steps: feature ranking and data regression. Owing to the similarity between data points described by the above constructed graph, the clustering effect is improved to some degree. Finally, to get more sparse effects, we combine embedding learning and sparse regression with L 1 -norm through linear weighting to complete the corresponding feature selection and cluster analysis.
The manifold structure is not fully considered, which can reflect the internal geometric structure in data [7]; and (2) they only use the statistical strategy, which can affect the accuracy and reliability of the results. Moreover, procedures of traditional feature selection method are performed independently, such as embedding learning and sparse regression [8,9]. However, a better performance can be achieved by combining two of the above independent procedures. Hou et al. came up with a new method of feature selection via Joint Embedding Learning and Sparse Regression (JELSR), according to the above ideas [5]. This method is a good solution to the above-mentioned issues. JELSR has a good effect on feature selection outstrip of these traditional methods. Nevertheless, there is still a problem thereinafter; since the L 2,1 -norm penalty only sparsely constrains the rows of data, the sparsity of the method is far from satisfactory. However, if the sparseness is not enough, taking too many unrelated genes into account can cause serious errors. Hence, an efficient sparse method should be explored to strengthen the previous method.
LASSO (Least Absolute Shrinkage and Selection Operator) was first proposed by Robert Tibshirani [10]. It constructs an L 1 -norm penalty function to obtain a more refined model and compresses some coefficients to get zero coefficients. Both the L 1 -norm constraint and the L 2,1 -norm constraint can produce sparse effects, but the sparsity of the L 1 -norm constraint is decentralized and the L 2,1 -norm constraint can only produce row sparseness, as shown in Figure 1. However, the combination of the L 1 -norm and the L 2,1 -norm constraints can generate internal row sparsity and enhance the correlation between rows and columns of the matrix [11], as shown in Figure 1, so the occurrence of the redundant and irrelevant genes can be reduced. In this paper, to obtain more sparse effects, we propose a new method by adding an L 1 -norm constraint on the sparse regression matrix based on the JELSR (LJELSR). First, the corresponding graph is constructed to depict the inherent structure of the data. Then, the graph matrix is embedded into the following steps: feature ranking and data regression. Owing to the similarity between data points described by the above constructed graph, the clustering effect is improved to some degree. Finally, to get more sparse effects, we combine embedding learning and sparse regression with L 1 -norm through linear weighting to complete the corresponding feature selection and cluster analysis. The major merits of our work are shown below: 1． More zero values are produced by adding an L 1 -norm constraint on the sparse regression matrix, such that we get more sparse results. 2． The internal geometric structure of data is preserved along with dimensionality reduction by embedding learning to reduce the occurrence of inaccurate results. 3． Although an exact solution cannot be obtained by our method, we provide a convergent iterative algorithm to get the optimal results.
The other parts of this paper are arranged as listed below. Section 2 details some comparative experiments and the analysis of experimental results from different datasets. Section 3 describes some related materials and presents the methodology of LJELSR. A conclusion of this paper is given in Section 4. The major merits of our work are shown below:

1.
More zero values are produced by adding an L 1 -norm constraint on the sparse regression matrix, such that we get more sparse results.

2.
The internal geometric structure of data is preserved along with dimensionality reduction by embedding learning to reduce the occurrence of inaccurate results.

3.
Although an exact solution cannot be obtained by our method, we provide a convergent iterative algorithm to get the optimal results.
The other parts of this paper are arranged as listed below. Section 2 details some comparative experiments and the analysis of experimental results from different datasets. Section 3 describes some related materials and presents the methodology of LJELSR. A conclusion of this paper is given in Section 4.

Results
The LJELSR, JELSR [5], ReDac [12], and SMART [11] are used to select differentially expressed genes and test the performance of the proposed method for different genomic data. Among them, the JELSR, ReDac, and SMART are used as comparison methods.

Datasets
To validate the effectiveness of our method, the LJELSR, JELSR, ReDac, and SMART methods are run on three datasets, including the ALL_AML, the colon cancer, and the ESCA datasets. The ALL_AML dataset includes acute lymphoblastic leukemia (ALL) and acute myelogenous leukemia (AML) [13], and ALL has also been divided into T cell subtypes and B cell subtypes. The colon cancer dataset is obtained by [14], to facilitate clustering, and the samples are organized into two categories, namely, diseased and normal samples. Additionally, the esophageal carcinoma dataset (ESCA) is downloaded from the TCGA (The Cancer Genome Atlas, TCGA). The TCGA is a publicly available dataset, where it is acquired from https://tcgadata.nci.nih.gov/tcga/. Some details of the three datasets are listed in Table 1.

Evaluation Metrics
In this study, there are two metrics employed to assess all algorithms: p-value and clustering accuracy (ACC) [17]. Firstly, the size of the p-value is closely related to the relationship between the selected genes and the disease, and there is a negative correlation between them. Additionally, the p-value is obtained by the ToppFun tool, which is a public gene list functional enrichment analysis tool. The p-value cutoff is set as 0.01 in the whole experiment. Secondly, the level of ACC indicates the degree of excellence of the algorithm, and there is a positive correlation between them. The value of the ACC is obtained by the following formula: where c i means the label of the clustering and s i is the label of the original data x i . In addition, the value of the δ(x, Y) is 1 if x = y and is 0 otherwise, and map (·) is a mapping function.

Experimental Results and Analysis on ALL_AML Dataset
For the sake of fairness, the LJELSR, JELSR, ReDac, and SMART are respectively used to extract 100 differentially expressed genes from the ALL_AML dataset to analyze their performance. To verify the effectiveness of the algorithm, the selected genes are put into ToppFun to get the p-values and the resulting p-values are arrayed in an ascending sort order. Then, we pick out the first ten terms listed in Table 2. The ten portions in bold font represent the best p-values. It can be seen from Table 2 that the p-values obtained by the LJELSR method are lower than the p-values obtained by the other three methods. Hence, the performance of the LJELSR surpasses the other three methods. To further illustrate the relationship between the selected genes and ALL_AML, the selected differentially expressed genes are put in GeneCards for testing. GeneCards is a synthesis database of human genes, providing relationships between disease, gene expression, gene function, and so on. The top five differentially expressed genes associated with ALL_AML obtained by the LJELSR method are listed in Table 3 and the official names of these genes and related diseases are also listed. As can be seen from Table 3, the official name of the "CD7" is the "CD7 Molecule". CD7 is originally found in T cells of acute lymphoblastic leukemia [18]. It is a membrane glycoprotein of human T lymphocytes and thymocytes. Additionally, it plays an essential role in detecting the interactions of T-cells or B-cells during early lymphoid development. Thence, the loss of CD7 can affect the expression of T-cells, which has a great impact on T-cell leukemia [19]. The "MYB" homologous official name is "MYB Proto-Oncogene, Transcription Factor". From the official name of this gene, MYB is not only a pro-oncogene, but also a factor that affects the transcription of genes. Furthermore, its duplication can cause leukemia [20]. Other genetic analyses are similar to the above analysis. In all, these genes are directly or indirectly related to leukemia. This table only simply displays partial functions of some of these genes. In this section, the analytical approach and procedure of this dataset are the same as the previous dataset. The 100 genes extracted by each method are tested for the Gene Ontology (GO) detection tool-ToppFun. The p-values gained by four methods are arranged in ascending order. We single out the p-values of the first ten items and list them in Table 4. The best p-values are indicated in bold typeface. From Table 4, it is obvious that the p-values obtained by the LJELSR method are smaller than the p-values obtained by the other three methods. Therefore, the performance of our method is a big plus over the other three methods. In addition, the selected differentially expressed genes by the LJELSR method are put in GeneCards for testing. We choose the top five genes associated with colon cancer obtained by the LJELSR method to list in Table 5. Additionally, Table 5 also shows how these genes correspond to the official names and related diseases. From Table 5, the official name of the "ACTB" is "Actin Beta". This gene can encode one of six different actin proteins, and change in the protein brought about by changes in the gene. It can affect certain biological processes. Furthermore, ACTB goes hand in hand with many cancers and plays a major role in lung and colorectal cancer, and so on [21]. Andersen et al. found that colon cancer is affected by ACTB [22]. The "WW Domain Containing Oxidoreductase" is abbreviated as the "WWOX". Different from the traditional tumor suppressor genes, its effect is more complicated and extensive on cellular function. Besides, the expression level of WWOX is different in two different colon cancer cell lines, which are the HT29 and SW480 cell lines [23]. Moreover, the expression of WWOX can lead to apoptosis, while defects in this gene are associated with multiple types of cancer. Analysis of the remaining genes is similar to the above-mentioned genes. Table 5 only shows some descriptions of partial genes associated with colon cancer.

Experimental Results and Analysis on ESCA Dataset
In this subsection, the dataset we use is ESCA, which is different from the above two datasets. To further confirm the effectiveness of the algorithm, the experiment is run with this data and the selected differentially expressed genes are placed in ToppFun for GO analysis. We rank all the p-values in ascending order and choose first ten p-values to list in the Table 6, where the best results are highlighted. From Table 6, we can conclude that the p-values obtained by the LJELSR method are mostly smaller than the p-values obtained by the other three methods. Therefore, on the whole, the performance of the LJELSR outperforms the other three methods. Additionally, the selected differentially expressed genes are put in GeneCards for testing. The top five genes associated with ESCA obtained by the LJELSR method, their name, and related diseases are all displayed in Table 7. From Table 7, the "ERBB2" corresponds to the official name "Erb-B2 Receptor Tyrosine Kinase 2". ERBB2 is the membrane receptor of 185kDa encoded by proto-oncogene ERBB-2, and is one of the members of the epidermal growth factor receptor family. Additionally, it has been verified that its amplification is closely related to the occurrence of esophageal cancer [24]. "KRT5" is the abbreviation of "Keratin 5", and it is a member of the keratin family of momentous gene families that encodes the corresponding protein. Additionally, its changes affect the expression of this gene's families, causing some complex diseases. For example, mutations in KRT5 and KRT14 can cause epidermolysis bullosa to a large extent [25]. The analysis of other genes is similar to the above analysis. Table 7 only shows some functions of partial genes, and detailed information on the remaining genes can be obtained from GeneCards.

Differentially Expressed Genes Comparing by Methods
In this subsection, the selected differentially expressed genes are further analyzed. For the three datasets mentioned above, we explore the common differentially expressed genes and the unique differentially expressed genes obtained by different methods for the same dataset. We select 100 genes for each algorithm and pair them with the officially published disease-causing gene pool to obtain the verified genes. This disease-causing gene pool can be downloaded directly from GeneCards. Table 8 shows the common and unique differentially expressed genes obtained by the four methods for the ESCA dataset. The bold italic indicates the common differentially expressed genes excavated by the four methods. The underlined italic indicates the unique differentially expressed genes excavated by LJELSR, which are not found by other methods. The "Number" means the total number of verified genes. As can be seen from the table, LJELSR has selected more proven genes than other methods; ANXA1, MUC6, FN1, PKM, CD24, GLUL, PLEC, PIGR, ACTB, PABPC1, LYZ, and SPRR1B are the common differentially expressed genes; the unique differentially expressed genes of the LJELSR method are FSCN1, ITGB4, LAMC2, HLA-B, LAMB3, HLA-C, SLC7A5, ENO1, and so on. These genes play an important role in studying the relationship between genes and diseases, the details of these genes can be obtained from GeneCards, and some of the genes have been explained above. In summary, LJELSR has an advantage over other methods in that it can mine more and more valuable differentially expressed genes.

Clustering Analysis
Kmeans is one of simple algorithms that solve the clustering problem, and its procedure is a relatively easy and simple way to classify the samples by setting a certain number of clusters in advance [26]. For all methods, different datasets have different numbers of sample clustering. The Kmeans method is used to cluster samples on the ALL_AML, colon cancer, and ESCA datasets, where their number of sample clustering is three, two, and two, respectively. Different methods are run for different data to get the values of ACC, and the details are shown in Table 8, where the largest values are marked in bold typeface. From Table 9, we can conclude that the values of ACC obtained by the LJELSR method for different datasets are almost all higher than the other methods. In this section, based on the above analysis, we can summarize that the efficiency of our method is higher than other methods.

Related Notations and Definitions
In this paper, m i and m j are defined as the i-th row and j-th column of matrix M, respectively. Denote X ∈ R m×n as the original data matrix, where its row represents a gene (feature) and its column represents a sample. For an arbitrary matrix M, L r,s norm is represented as below: where r and s represent a positive number. When r = s = 1, the above formula becomes the expression of the L 1 norm or LASSO, i.e., Similarly, when r = 2, s = 1, and (2) becomes the expression of the L 2,1 norm, i.e., Additionally, the inherent geometric structure of data is fully taken into account to deal with data for lessening the appearance of the inexact results in actual applications. Considering its benefits, it is also added to our work. Therefore, we firstly construct a q -nearest-neighbor graph G with n vertices in the data space, where a vertex corresponds to a data point [15], and q is the nearest neighbor number. w ij represents the correlation between two data point x j and x j . All w ij make up the weight matrix W. There are many strategies to compute W. However, three strategies are frequently employed [15].
(1) 0-1 weighting: where N q (x i ) indicates the set gained by the way of q -nearest neighbors of the data point x i .
(2) Heat kernel weighting: where σ is a proper constant, and its value is obtained by the previous experience.
(3) Dot-product weighting: These three strategies are applied to different occasions. Since the operation of the 0-1 weighting is relatively uncomplicated, it is commonly used for computing the weight matrix. The second is widely applied to image data, and the third is frequently used in the IR community for processing documents [15].
Next, we define a diagonal matrix D, where its diagonal values are given as d ii = ∑ j w ij . The graph Laplacian matrix L is defined as L = D − W [27].

Joint Embedding Learning and Sparse Regression (JELSR)
Traditional feature selection methods are performed independently. To further heighten the performance of the previous algorithm, the JELSR method was proposed by Hou et al. [5]. Firstly, it uses linear approximation weights and the L 2,1 -norm regularization to combine embedding learning and sparse regression to establish a new objective function. Then, the sparse regression matrix is used to finish the corresponding feature selection [5]. The objective function of the JELSR algorithm is written as follows: min where Y ∈ R k×n is a low dimensional embedding matrix; P ∈ R m×k is the sparse regression matrix; α and β are two balance parameters; and k is the dimension of low-dimensional space.

The Proposed Method
Assuming that the expression level of the gene is within the normal range, the higher the sparseness of the sparse regression matrix, the easier it is to find the differentially expressed genes. What is more, the results of identifying the differentially expressed genes are much more accurate.
Compared with the previous methods, the JELSR method proposed by Hou et al. [5] has a good effect on the feature selection. However, it is inevitable that some redundancy values and artificial noise values in the sparse regression matrix have to be taken into account, such that the sparseness of this method is far from satisfactory. Consequently, it is necessary to discover an efficacious sparse method to improve the performance. Therefore, we propose the LJELSR method, which may improve the sparseness of the algorithm and the accuracy of the results.
The objective function of LJELSR is: where Y ∈ R k×n is a low-dimensional embedding matrix; P ∈ R m×k is the sparse regression matrix; and α 1 , α 2 , and β are three balance parameters.

Optimization
Since there are the L 1 -norm and the L 2,1 -norm constraints on the objective formula, it is difficult to optimize and solve the optimal solution directly. With the elicitation of Wang et al. [11] and Nie et al. [28], the iterative strategy is introduced to solve the above problem. Now, we will explain in detail the specific optimization process of our method.
Before solving the optimization problem, we introduce two diagonal matrices, U ∈ R m×m and U ∈ R m×m , whose i-th diagonal values are defined as follows: U jj = 1 2 p j 2 (11) where U i represents i-th diagonal matrix U; (U i ) jj represents the j-th diagonal value of the i-th matrix U; and U jj indicates the j-th diagonal value of matrix U.
To prevent the emergence of spillover, a small constant ε is added to the diagonal matrices U and U, respectively, that is: Owing to the fact that the partial derivatives of P 1 and tr P T UP on P are identical, P 1 can be replaced by tr P T UP . Analogously, we carry out an exchange for P 2,1 and tr(P T UP). Therefore, (9) can be rewritten as: We firstly optimize the matrix P. We denote L(P, U, U) = P T X − Y 2 2 + α 1 tr P T UP + α 2 tr(P T UP) When two diagonal matrices U and U are fixed, we compute the partial derivative of L(P, U, U) on P and make it equal to zero. Therefore, we can get the following equation: namely, To facilitate optimization, we introduce an auxiliary variable A into the objective formula, and denote A = XX T +α 1 U + α 2 U. According to the above analysis, (17) is brought into (14), such that we will get Since Y is subject to the orthogonal constraint YY T = I k×k , the optimization problem of Y becomes arg min When A and L are fixed, we use the strategy of the eigen-decomposition of the matrix G = L + βI n×n −βX T A −1 X Y T to update Y in (19). For Y i (i = 1, 2, · · · , k), we firstly choose the k smallest eigenvalues of the matrix G, and then seek out the corresponding eigenvectors to constitute a new matrix Y =(Y 1 , Y 2 , · · · , Y k ) [29]. In addition, the diagonal matrices U and U are updated by (12) and (13) when P is fixed, respectively. Additionally, we initialize U and U as an identity matrix, respectively.

Feature Selection
According to the above update rules, the sparse regression matrix P is acquired after repeated iterations of LJELSR. Then, to acquire some differentially expressed genes, we conduct a detailed analysis of the matrix P. In the first place, each element of the matrix P is subjected to absolute processing. Secondly, we sum the absolute values by each row of the sparse regression matrix P and get a new vector, as follows. P = P 1 , P 2 , · · · , P m T (20) Then, we rank the vector P i in descending order to get a new vector, as follows: P = ( P 1 , P 2 , · · · , P m ) T Finally, the genes corresponding to the first l values are selected as differentially expressed genes to analyze their property (l < m). By and large, the value of the element is directly proportional to the importance of the corresponding gene. Next, we put the selected genes into ToppFun and GeneCards to analyze them, where they are publicly accessible from https://toppgene.cchmc.org/enrichment.jsp and http://www.genecards.org/, respectively.
In summary, the procedure of LJELSR is shown in Algorithm 1.

Algorithm 1. Procedure of LJELSR.
Input: Data matrix X; Neighborhood size q; Balance parameters α 1 , α 2 , β ; Dimensionality of embedding k; Feature selection number l. Output: Selected feature index set {P 1 , P 2 , · · · , P m } Stage one: Graph construction Construct the weight matrix W; Compute the diagonal matrix D, graph Laplacian matrix L; Stage two: Alternative optimization, Initialize U = U = I m×m ; Loop Update Y and fix A, L by (19), Update P and fix U, U by (17), Update U, U and fix P by (12) and (13). until convergence Stage three: Feature selection

Convergence Analysis
In this study, an alternative algorithm is utilized to finish iteratively updating work of the proposed method. Now, let us analyze its convergence behavior of LJELSR. The lemma given below was proposed by Nie et al. [28].

Lemma 1.
For any non-zero vectors P, P t ∈ R k , the following inequality holds: The convergence result of the proposed method is explained by the following theorem: Theorem 1. The value of the target function for each iteration is monotonically decreasing in the Algorithm 1. Detailed proof of Theorem 1 is given in the Appendix A.

Conclusions
In this paper, we discuss a new feature selection method by adding an L 1 -norm constraint on the sparse regression matrix based on the JELSR method. Firstly, the four methods are executed to select differentially expressed genes and cluster the samples from ALL_AML, colon cancer, and ESCA datasets, respectively. Secondly, some of materials related to this paper are presented, and our methods and the corresponding optimization strategy are given. Finally, the conclusion is drawn that the performance of the proposed method is better than other methods through the experimental results.