Co-differential Gene Selection and Clustering Based on Graph Regularized Multi-View NMF in Cancer Genomic Data

Cancer genomic data contain views from different sources that provide complementary information about genetic activity. This provides a new way for cancer research. Feature selection and multi-view clustering are hot topics in bioinformatics, and they can make full use of complementary information to improve the effect. In this paper, a novel integrated model called Multi-view Non-negative Matrix Factorization (MvNMF) is proposed for the selection of common differential genes (co-differential genes) and multi-view clustering. In order to encode the geometric information in the multi-view genomic data, graph regularized MvNMF (GMvNMF) is further proposed by applying the graph regularization constraint in the objective function. GMvNMF can not only obtain the potential shared feature structure and shared cluster group structure, but also capture the manifold structure of multi-view data. The validity of the proposed GMvNMF method was tested in four multi-view genomic data. Experimental results showed that the GMvNMF method has better performance than other representative methods.


Introduction
With the rapid development of gene sequencing technology, a large number of multi-view data have been generated. In essence, multi-view data are insightful and have multiple levels of genetic activity information. Exploring this information will provide us with an unprecedented opportunity to discover the molecular mechanisms of cancer [1]. The Cancer Genome Atlas (TCGA) is the largest genome-based platform. And it provides a large number of different types of omics data. In this paper, we use gene expression (GE), copy number variation (CNV), and methylation (ME) data of four cancers in the TCGA database. They are mutually dependent on each other [2].
In the field of bioinformatics, feature selection and clustering are two important ways to explore genomic data [3,4]. To some extent, feature selection can reduce computational complexity and also find differentially expressed genes associated with cancer. It promotes cancer research at the molecular level of genes. Multi-view clustering is the division of samples or genes in a multi-view dataset into several subsets based on their potential group structure; the samples or genes in the same subset have similarities.
With the advent of the big data era, data are no longer single-view but multi-view data composed of different sources. The information of multiple views in multi-view data is complementary, and it 1.
In order to effectively cluster and select features for multi-view data at the same time, a novel integrated model called MvNMF is proposed. In the MvNMF framework, the shared basis matrix can reconstruct the potential cluster group structure, which contributed to the improvement of clustering performance. The selection of the co-differential genes can be performed because the shared coefficient matrix can recover the common feature pattern from different views. 2.
The graph regularization was applied to the objective function to form the GMvNMF method, which ensured that GMvNMF can capture the manifold structure of the multi-view data. This makes sense for the performance improvement of the integrated model.

3.
Scientific and rational experiments were designed on the cancer genomic data to illustrate the validity of the GMvNMF method and achieve satisfactory results.
In what follows, jNMF and its representative variants, graph regularization are reviewed in Section 2; a detailed description of the proposed GMvNMF method is also included. The results of multi-view clustering and co-differential gene selection are presented in Section 3. Finally, the paper is concluded in Section 4.

Joint Non-Negative Matrix Factorization and Representative Variants
Joint NMF (jNMF) [2] is a popular matrix decomposition algorithm. In the field of bioinformatics, each type of genomic data can be represented as an original matrix. The row of the matrix represents the sample, and the column represents the expression level of genomic data.
Given d different types of non-negative matrices, the goal of jNMF is to find that the product of a shared basis matrix W n×k and the corresponding coefficient matrix (H I ) k×m I is similar to the input data matrix (X I ) n×m I , i.e., X I ≈ WH I (I = 1, 2, . . . , d). n represents the number of rows of the input data matrix. The value of k means the degree of dimensionality reduction of data. m I represents the number of columns of the i-th input data matrix. Further, the objective function of jNMF can be expressed as: where · F denotes the Frobenius norm of the matrix. The shared basis matrix can reflect the sharing pattern of multi-view data matrices from different sources [11]. It is obvious that jNMF is equivalent to NMF when the value of d is 1. In other words, jNMF is a flexible and clever NMF extension model for the integration of multi-view data. Then, the updated rules are as follows: where W il and H lj refer to specific elements in the matrices W and H I . jNMF is effective in finding homogenous effects of data from different sources, however, it does not consider the effects of heterogeneous noise between multi-view data. Therefore, Yang et al. [8] proposed an integrative NMF (iNMF) model. Specifically, the objective function of iNMF was defined as follows: where λ represents a balance parameter. WH I means taking into account the homogenous effect, and V I H I means considering the heterogeneous effect. In other words, V I H I can be used as an approximation of the heterogeneous effect.
In order to obtain non-overlapping and sparse solutions, Stražar et al. [9] proposed integrative orthogonality-regularized non-negative matrix factorization (iONMF) by applying orthogonal regularization in the jNMF framework. Thus, its cost function can be written as: where I represents a unit matrix, and α is a trade-off parameter. iONMF exhibits better performance in predicting protein-RNA interactions.

Graph Regularization
The maturity of manifold learning theory made people pay more attention to the internal geometric structure in the original data. The basic idea of graph regularization is to reconstruct the low-dimensional manifold structure embedded in high dimensional sample space. That is to say, adjacent sample points in the high dimensional sample space should be as close as possible in the low dimensional space [12]. If each sample point is used as the vertices of the graph to construct a K-nearest neighbor graph, then a symmetric matrix E is obtained [13]. E ij represents the weight of the edge connecting vertex i and vertex j. Thus, the degree of proximity between the vertices can be measured using E ij . The definition of E ij can be shown as follows: where Mathematically, the graph regularization can be formulated as follows: where D is a diagonal matrix and the elements on the diagonal are composed of the sum of the rows or columns of E. s i and s j are the low dimensional representation of x i and x j , respectively. tr(·) is the trace of the matrix. The matrix V represents the coefficient matrix produced by the decomposition of NMF. Finally, L = D − E is a graph Laplacian matrix [14].

Graph Regularized Multi-View Non-Negative Matrix Factorization
It is well known that cancer genomic datasets contain many types of data. In order to effectively utilize the information of multiple views, we proposed the MvNMF model and further improved it to get GMvNMF. The GMvNMF algorithm is introduced in detail; the proposed algorithm is given in Algorithm 1.

Objective Function
jNMF is a good integration model that can fully explore potential shared structures in multiple views [2]. However, its flexibility is not sufficient to explore multi-view clustering and selecting co-differentially expressed genes at the same time. For multi-view clustering, the same sample points in different views are likely to be grouped together. Therefore, we required the basis matrix to exhibit the potential cluster structure that is shared by different views. For the selection of co-differential genes, the expression of the same gene in different views should be considered comprehensively. Therefore, we required that the coefficient matrix reflected the shared feature structure from different views.
In view of the above requirements, a model called MvNMF was designed. It can simultaneously perform multi-view clustering and selection of co-differentially expressed genes. (X I ) n×m I can be approximated by WU I V (I = 1, 2, . . . , d). Specifically, the objective function of MvNMF can be formulated as optimization problem: where W n×k is the shared basis matrix, (U I ) k×r is the subspace transformation matrix, and V r×m is the shared coefficient matrix. n is the number of samples in the dataset. m is the number of features in the dataset. k denotes dimensionality reduction and r denotes the rank of the matrix. From the objective function we can see that the H I of jNMF can be approximated by U I V.
We further considered the low-dimensional manifold structure embedded in the high-dimensional multi-view data space. Thus, the GMvNMF model was obtained by combining graph regularization and MvNMF.
Its objective function can be written as the following minimization problem: where λ I ≥ 0 is the balance parameter that controls the Laplacian regularization. It is worth mentioning that the different values of λ I represent the heterogeneity of multi-view data. If λ I = 0, GMvNMF will be simplified to MvNMF. In other words, MvNMF is a special case of GMvNMF. Therefore, the following section only shows the optimization algorithm of GMvNMF.

Optimization of GMvNMF
The Equation (9) can be rewritten as: The multiplicative iterative method was used to solve the optimization problem in Equation (10). Then the Lagrangian function f was constructed as follows: where ψ = [ψ il ], ϕ I = [ϕ la ] I and µ = µ aj are Lagrange multipliers that constrain W ≥ 0, U I ≥ 0 and V ≥ 0, respectively. i, l, a and j represent the subscripts of the elements in the matrix. We separately derived the partial derivatives of W, U I and V of the Lagrangian function as follows: It is well known that Karush-Kuhn-Tucher (KKT) conditions [15] can be applied to solve an optimization problem with inequality constraints. By using the KKT conditions ψW = 0, ϕ I U I = 0 and µV = 0, we can get the following update rules: Finally, we summarize the iterative process of the proposed GMvNMF model in Algorithm 1.  (17); Until convergence

Results
We performed experiments with multi-view clustering and selection of co-differentially expressed genes to verify the effectiveness of the proposed method. In addition, we used jNMF [2], iNMF [8] and iONMF [9] as comparison methods. Detailed information on the experimental settings and results are shown in the following section.

Datasets
The Cancer Genome Atlas (TCGA) program intends to analyze the genomic variation map of cancer by using high-throughput sequencing technology [16]. As the largest cancer genome database, TCGA contains a lot of valuable and incredible information. An in-depth study of this information can help us understand, prevent, and treat cancer. In this paper, we used four multi-view datasets to analyze the performance of the proposed method. These datasets included pancreatic adenocarcinoma (PAAD), esophageal carcinoma (ESCA), colon adenocarcinoma (COAD), and head and neck squamous cell carcinoma (HNSC). Each cancer dataset contained three different types of data, such as GE, CNV, and ME. All of the above data can be downloaded from the TCGA (https://tcgadata.nci.nih.gov/tcga/). In the experiment, we performed preprocessing on the data. First, principal component analysis (PCA) was used to reduce dimensionality and remove redundant information and noise on the data. Then, the data matrix was normalized such that each row of the matrix was distributed between 0 and 1. More descriptions of multi-view datasets are summarized in Table 1.

Parameter Setting
In the MvNMF and GMvNMF methods, we needed to choose parameters such as k, r and λ I . The values of k and r determined the size of the shared basis matrix, the subspace transformation matrix and the shared coefficient matrix. Thus, choosing a reasonable parameter value will promote the experimental results. Since the value of k means the degree of dimensionality reduction of data, it had a significant impact on the experiment. r is the rank of the matrix. If the value of r is more appropriate, then a better genetic selection result will be obtained. In other words, MvNMF and GMvNMF were sensitive to the choice of k and r. The graph regularization parameters λ I controlled the extent to which the internal geometric structure of the original data was preserved. In addition, λ I reflected the heterogeneity of data from different sources in multi-view data.
In the experiment, we empirically set λ I corresponding to different views in a multi-view data to the same value [17]. For convenience, we used the grid search algorithm to select the optimal value of the parameter. When k, r, and λ I were selected in the interval [1,50], [1, k − 1], and [1, 100,000], respectively, MvNMF and GMvNMF achieved the best performance. The specific conditions of the selected parameters can be seen from the following figures. It should be noted that, as we can see from Figure 1a, when k = 2, MvNMF had a higher accuracy in the HNSC dataset. That is to say, the value of r in MvNMF can only be 1 on the HNSC dataset. Therefore, it is not shown in Figure 1c. In summary, we can select the optimal parameters through Figures 1 and 2. transformation matrix and the shared coefficient matrix. Thus, choosing a reasonable parameter value will promote the experimental results. Since the value of k means the degree of dimensionality reduction of data, it had a significant impact on the experiment. r is the rank of the matrix. If the value of r is more appropriate, then a better genetic selection result will be obtained. In other words, MvNMF and GMvNMF were sensitive to the choice of k and r . The graph regularization parameters I  controlled the extent to which the internal geometric structure of the original data was preserved. In addition, I  reflected the heterogeneity of data from different sources in multi-view data.
In the experiment, we empirically set I  corresponding to different views in a multi-view data to the same value [17]. For convenience, we used the grid search algorithm to select the optimal value of the parameter. When k , r , and I  were selected in the interval [1,50], [1, k − 1], and [1, 100,000], respectively, MvNMF and GMvNMF achieved the best performance. The specific conditions of the selected parameters can be seen from the following figures. It should be noted that, as we can see from Figure 1a, when 2 k  , MvNMF had a higher accuracy in the HNSC dataset. That is to say, the value of r in MvNMF can only be 1 on the HNSC dataset. Therefore, it is not shown in Figure 1c. In summary, we can select the optimal parameters through Figures 1 and 2. is the clustering performance of GMvNMF on PAAD, HNSC, ESCA and COAD about  .

Convergence and Computational Time Analysis
We iterated the updated rules of MvNMF and GMvNMF to approximate the local optimal solution of the objective function. In Figure 3, the convergence curves for the five methods are given

Convergence and Computational Time Analysis
We iterated the updated rules of MvNMF and GMvNMF to approximate the local optimal solution of the objective function. In Figure 3, the convergence curves for the five methods are given (to save space, only the convergence curves on the ESCA dataset are shown). These five methods consisted of jNMF, iNMF, iONMF, MvNMF, and GMvNMF. It can be observed from Figure 3 that these five methods converged in 100 iterations. The error value is the loss function value. Additionally, the convergence criterion was that the error value tended to zero. Since MvNMF and GMvNMF had smaller error values than other methods, the convergence of MvNMF and GMvNMF is better. is the clustering performance of GMvNMF on PAAD, HNSC, ESCA and COAD about  .

Convergence and Computational Time Analysis
We iterated the updated rules of MvNMF and GMvNMF to approximate the local optimal solution of the objective function. In Figure 3, the convergence curves for the five methods are given (to save space, only the convergence curves on the ESCA dataset are shown). These five methods consisted of jNMF, iNMF, iONMF, MvNMF, and GMvNMF. It can be observed from Figure 3 that these five methods converged in 100 iterations. The error value is the loss function value. Additionally, the convergence criterion was that the error value tended to zero. Since MvNMF and GMvNMF had smaller error values than other methods, the convergence of MvNMF and GMvNMF is better. In addition, we compared the execution time of these five algorithms. Experiments were executed on a PC with 3.50 GHz Intel(R) (Santa Clara, CA, USA) Xeon(R) CPU and 16G RAM. In the experiment, each method was repeated 10 times. The mean and variance were calculated. The statistics of the computational time are listed in Table 2. As can be seen from Table 2, all five methods had satisfactory running times. Because iNMF takes into account the heterogeneous effect, its computational time was affected. IONMF imposed orthogonal constraints, thus, it had the longest log(Error Value) In addition, we compared the execution time of these five algorithms. Experiments were executed on a PC with 3.50 GHz Intel(R) (Santa Clara, CA, USA) Xeon(R) CPU and 16G RAM. In the experiment, each method was repeated 10 times. The mean and variance were calculated. The statistics of the computational time are listed in Table 2. As can be seen from Table 2, all five methods had satisfactory running times. Because iNMF takes into account the heterogeneous effect, its computational time was affected. IONMF imposed orthogonal constraints, thus, it had the longest running time. MvNMF and GMvNMF had lower running times. This is because the experimental results showed that the decomposed matrix of our proposed method had better sparsity and lower rank than the matrix after jNMF decomposition.

Clustering Results
We performed clustering experiments on four multi-view datasets to verify the effectiveness of the proposed method. Multi-view clustering was performed on the shared basis matrix using the K-means algorithm.

Evaluation Metrics
In order to strictly analyze the performance of multi-view clustering, we adopted multiple measures, including accuracy (AC), recall, precision, and F-measure [18,19]. The AC is defined as follows: where n is the number of samples contained in the dataset, r i is the clustering label obtained using the clustering algorithm, and s i is the real data label. map(r i ) is a permutation mapping function that maps clustering labels to real data labels. The real clusters refer to the known sample labels. In addition, δ(x, y) is a delta function. Recall, precision, and F-measure are a set of metrics that are widely used in clustering applications. Recall can also be called sensitivity. True-positive (TP) indicates that two data points of the same cluster are divided into the same cluster. True-negative (TN) means that two data points of the same cluster are divided into different clusters. False-positive (FP) indicates that data points of two different clusters are divided into the same cluster. False-negative (FN) refers to two data points in different clusters divided into different clusters. Recall, precision and F-measure are defined as follows: where F-measure is a comprehensive evaluation indicator that takes into account recall and precision.

Multi-View Clustering Results
In the experiment, each algorithm is executed 50 times to reduce the impact of random initialization on multi-view clustering results. The mean and variance of performance on each multi-view data are recorded in Table 3. According to Table 3, we can draw the following conclusions: 1.
The clustering performance of jNMF on PAAD and COAD datasets was better than iNMF, iONMF, and MvNMF. This demonstrates that improvements to the traditional NMF integration model may result in the loss of useful information, which in turn affected the clustering results. However, in the ESCA and HNSC datasets, MvNMF outperformed jNMF, iNMF, and iONMF from the overall perspective of the evaluation metrics. This shows the validity of our proposed MvNMF model, which better preserved the complementary information between multiple views. 2.
From Table 3, we can see that the precision of the GMvNMF method and the precision of the MvNMF method were similar in the four multi-view datasets. However, the GMvNMF method was at least about 18, 32 and 20% higher than the MvNMF method in terms of AC, recall, and F-measure. Therefore, the GMvNMF method had better clustering performance. This shows that it is necessary to consider the manifold structure that exists in multi-view data.

3.
Taking the four multi-view datasets in Table 3 as a whole, the proposed GMvNMF method had the best clustering performance. GMvNMF outperformed other methods by about 23, 39, 0.67, and 25%, with respect to the average values of the metrics AC, recall, precision, and F-measure. Therefore, GMvNMF is an effective integration model that takes into account the latent group structure and intrinsic geometric information between multi-view data.

Co-Differentially Expressed Gene Selection Results
It is well known that genomic alterations and genetic mutations can cause cancer [20,21]. Therefore, research on cancer genomic data is an urgent task. In the feature selection experiment, we used genomic data including GE, CNV, and ME to verify the effectiveness of the proposed method. The co-differential genes were selected on a shared coefficient matrix. Since the differential genes we selected are genes expressed in GE, CNV, and ME, the selected co-differential genes have more important biological significance.
In the experiment, we scored all the genes. These genes were then ranked in descending order of score. The higher the score of a gene, the greater its significance. We chose such a gene as a differentially expressed gene. In practice, we selected the top 500 genes of each method as co-differentially expressed genes for comparison. Then, the selected genes were placed in the GeneGards (http://www.genecards. org/) for analysis. GeneCards is a comprehensive database of human genes that provides a variety of valuable information for studying genes [22]. Table 4 lists the results of five methods for selecting co-differential genes.  23 17.60 Note: N is obtained by matching the co-differential genes selected by each method to the virulence gene pool of PAAD, ESCA, COAD, and HNSC. HRS represents the highest relevant score, and ARS represents the average relevant score. The best experimental results are highlighted in bold.
In Table 4, N is obtained by matching the co-differential genes selected by each method to the virulence gene pool of PAAD, ESCA, COAD, and HNSC. A larger N indicates a higher accuracy in identifying co-differentially expressed genes. HRS represents the highest relevant score, and ARS represents the average relevant score. Relevant scores represent the degree to which a gene is associated with a disease. A higher relevant score for a gene means that the gene is likely to be a pathogenic gene. Although the ARS of MvNMF was slightly higher than GMvNMF in COAD and HNSC, the performance of co-differentially expressed genes selected by GMvNMF was better on the whole. This indicates that our method was reasonable. In order to retain the geometric structure in the data, it was necessary to add graph regularization to the method.
3.5.2. Discussion of Co-Differential Genes Table 5 lists the relevant information of the top 10 co-differential genes selected by the GMvNMF method (to save space, we only listed the top 10 genes selected in the COAD dataset.). From Table 5, we can see that BRCA1 has the highest relevance score. BRCA1 is a protein-coding gene involved in DNA repair. When it is mutated, the tumor suppressor protein does not form normally. This leads to the emergence of cancer. BRCA1 has been confirmed to be related to COAD [23]. BRCA2 is a tumor suppressor gene, which is mainly involved in the repair of DNA damage and regulation of transcription. There is literature that BRCA2 is related to COAD [24]. As we all know, mutations in BRCA1 and BRCA2 increase the risk of breast or ovarian cancer [25]. Therefore, mutations in one gene may be related to the production of multiple cancers. This suggests that biologists can further study the link between COAD and breast or ovarian cancer. The protein encoded by the epidermal growth factor receptor (EGFR) is a transmembrane glycoprotein that is a member of the protein kinase superfamily. In addition, mutation or overexpression of EGFR generally triggers COAD [26]. Note: Gene ID represents the number of the gene. Gene ED represents the gene name. Table 6 lists the co-differentially expressed genes with the highest relevance score selected by GMvNMF on the multi-view dataset of PAAD, ESCA and HNSC. These co-differential genes were highly likely to cause cancer. The relevance score of EGFR is 168.23 in the HNSC. EGFR is a protein-coding gene. Among its related pathways are extracellular regulated protein kinases (ERK) signaling and GE. Moreover, the importance of EGFR in HNSC has been widely recognized [27,28]. EGFR also appears in Table 5, which indicates that EGFR has to do with the occurrence of COAD and HNSC. This provides a new way for biologists to study COAD and HNSC. Table 6. Summary of the co-differential genes selected on PAAD, ESCA and HNSC.

Conclusions
In this paper, we proposed a new integrated NMF model called MvNMF for multi-view clustering and selection of co-differentially expressed genes. Considering the low-dimensional manifold structure existing in the high-dimensional multi-view sample space, the graph regularization constraint was added to the objective function of MvNMF. This new model is called GMvNMF. It effectively encodes the geometric information inherent in the data. Numerous experiments on cancer genomic data showed that our proposed GMvNMF method is more effective.
For future work, we continue to improve the model to enhance its robustness and sparsity.