1. Introduction
The advances in biological technologies, such as the RNAseq, make it possible to generate genomewide highthroughput data with various platforms. The world consortia, such as The Cancer Genome Atlas (TCGA)
https://cancergenome.nih.gov/ and the Encyclopedia of DNA Elements (ENCODE)
https://www.encodeproject.org/, have generated largescale heterogeneous data on, for example, gene expression, DNA methylation, and mutation for various cancers or tissues (cells). The accumulated biological data provides a great opportunity to investigate the mechanisms of cancers.
Among these genomic data, great efforts have been devoted to the analysis of gene expression because regulation of gene expression refers to the control of the amount and timing of appearance of the functional product of a gene. Control of expression is vital to allow a cell to produce the gene products it needs when it needs them; in turn, this gives cells the flexibility to adapt to a variable environment, external signals, damage to the cell, and other stimuli [
1,
2,
3]. The differentially expressed genes between two cohorts shed light on revealing the regulation mechanisms of cells. For example, Li et al. [
4] demonstrated that PE1 inhibits stem cell selfrenewal in human chronic myelocytic leukemia. To investigate the highorder relation among genes, networkbased analysis has been devoted to gene expression, which extracts many interesting patterns that are different from differentially expressed genes. For instance, Langfelder et al. [
5] proposed the weighted gene coexpression network analysis tool (WGCNA) to mine the coexpression modules.
Furthermore, biological networks have been proven to be powerful for describing and analyzing profile data, where each vertex represents a gene and each edge corresponds to an interaction between a pair of genes. There are many biological networks, such as gene regulation networks [
6], signal transduction networks [
7], protein–protein interaction (PPI) networks [
8], disease networks [
9], and gene regulation networks [
10,
11,
12,
13,
14,
15]. The accumulated biological networks provide an opportunity to explore the mechanisms of cells via mining the graph patterns. Great efforts have been devoted to network analysis, where the graph patterns shed light on the structure–function relations in biology. For example, Taylor et al. [
16] analyzed the PPI network and demonstrated that the genes with large degrees (hub genes) play a critical role in the prognosis of breast cancer. Furthermore, Chuang et al. [
17] showed that the pathways where genes are differentially expressed between two cohorts of cancer patients serve as biomarkers for predicting cancer metastasis.
However, a vast majority of analysis ignores the dynamics of data. Complex diseases, such as cancers, are dynamic and involve a continuum of molecular events associated with disease progression, from early warning events to catastrophic endstage events [
18]. How to extract modules associated with cancer progression is critical for discovering the mechanisms of cancers because these patterns provide clues for biologists for further research [
19,
20]. However, it is nontrivial to detect dynamic modules associated with cancer progression because it is difficult to characterize and extract dynamics of modules. Thus, the available algorithms for the dynamic modules differ greatly in terms of how to define dynamic modules and the strategies to discover the predefined patterns. Ma et al. [
21] designed the
MModule algorithm to the common modules across various stages of breast cancer, and demonstrated that the dynamics of interaction strength is critical for the acceleration of heart failure [
22]. Similar efforts have also been devoted to common and specific modules for breast cancer [
23,
24]. However, these algorithms only focus on extracting the common and specific modules associated with cancer progression. In [
25], the authors developed the
NMFDM algorithm to investigate how the pathway dynamically recruits genes, for example, in cancer progression.
However, these algorithms are only based on gene expression or DNA methylation data and do not integrate any other data. In fact, integrative analysis of omic data has been extensively studied since it identifies interesting patterns that cannot be obtained by analysis of a single type of data [
26]. Compared to the gene coexpression network, the protein interaction network is more reliable since the large coexpression value between a pair of genes does not imply physical interaction. Thus, the protein interaction network should be integrated with gene expression data to extract dynamic modules. Even though many algorithms have been developed to integrate protein interaction and gene expression data, no attempt has been made to identify modules associated with cancer progression. The reason is that the integrative analysis of these data is difficult because it involves both the breast progression and heterogeneity of data.
In this study, we address the integration of gene expression data and a protein interaction network to mine the dynamic modules associated with cancer progression. As done in [
21,
22], the dynamic modules are defined as common modules that are coexpressed across various stages. To analyze cancer gene expression data, we adopt the multiview subspace clustering algorithm with sparsity constraints to obtain a representation matrix for each view and a consensus matrix, as shown in
Figure 1 (
Supplementary Materials). By effectively integrating the protein interaction networks, we expected that the joint representation matrix
C would not only balance the agreement across various stages but also preserve the topological structure of the protein interaction network. Therefore, the protein interaction network was incorporated into multiview subspace clustering via regularization. In this way, the common module detection problem is transformed into a convex optimization. The interior point algorithm was used for convex optimization. The experimental results demonstrate that the proposed algorithm is more accurate than the state of the art. The modules obtained by our algorithm are more enriched by the known pathways and serve as biomarkers to predict cancer stages.
The rest of the paper is organized as follows:
Section 2 proposes the mathematical model and algorithm. The related materials are presented in
Section 3. The experimental results are provided in
Section 4. The conclusion is discussed in
Section 5.
2. Methods
The objective function and optimization procedure of the proposed algorithm, and the algorithm analysis, are presented in this section. The rMVspc algorithm comprises two major components as shown in
Figure 1.
2.1. Preliminaries
Prior to giving the detailed description of the procedure of rMVspc, let us introduce some terminologies that are widely used in the forthcoming sections.
The protein interaction network can be modeled by an unweighted and undirected graph $G=(V,E)$, where the vertex set $V=\{{v}_{1},{v}_{2},\dots ,{v}_{n}\}$ contains all the genes (proteins) and the edge set $E=\left\{({v}_{i},{v}_{j})\right\}$ denotes the interaction between a pair of genes. The protein interaction network G can be represented by an $n\times n$ adjacency matrix A, where ${a}_{ij}$ =1 if vertex ${v}_{i}$ and ${v}_{j}$ are connected, 0 otherwise. The degree of vertex ${v}_{i}$ is the number of edges connected to it, i.e., ${d}_{i}={\sum}_{j}{a}_{ij}$. The degree matrix D is the diagonal matrix with a degree sequence of G, i.e., $D=diag({d}_{1},\dots ,{d}_{n})$. The trace of a matrix W is the sum of diagonal elements of W, i.e., $trace\left(W\right)={\sum}_{i}{w}_{ij}$.
Let $\{1,2,\dots ,m\}$ be a finite set of cancer clinical stages and the attached subscript s be the value of the variable at the sth stage. The gene expression for cancer with various clinical stages $\mathcal{X}=\{{X}_{1},{X}_{2},\dots ,{X}_{m}\}$, where each ${X}_{i}$ is the gene expression for the stage S. The gene expression data ${X}_{s}$ is an ${n}_{s}\times n$ matrix, where each row corresponds a gene, each column represents a sample (patient), and element ${x}_{ijs}$ denotes the expression level of the jth patients in the ith gene at stage s.
2.2. Procedure of Algorithm
In the singleview clustering, the sparse subspace clustering (SSC) [
27,
28] represents each data point using a small number of data points from its own subspace. Given the data
X, it amounts to the minimization problem as
where
${\parallel C\parallel}_{1}$ is the
${l}_{1}$ norm, and constraint
$diag\left(C\right)=0$ is used to avoid trivial solutions where a data point is represented as a linear combination of itself. In the case of the corrupted data, the above equation can be rewritten as
where the
${l}_{1}$ norm promotes sparsity of the columns of
C, while the Frobenius norm favors small entries in the columns of
Z.
Given gene expression associated with cancer progression
$\mathcal{X}=\{{X}_{1},{X}_{2},\dots ,{X}_{m}\}$, the multiview clustering finds representation matrices
${C}_{1},\dots ,{C}_{m}$ across different stages and a joint representation matrix
C that balance the agreement across various stages [
29]. According to [
30], we use the centroid based strategy to obtain the consensus matrix
C for the subspace clustering. Therefore, Equation (
2) becomes
We present the regularized multiview sparse subspace clustering (rMVspc) algorithm to discover the common modules in multiple views of gene expression for cancers. However, the common modules solely based on gene expression data assume that the genes within a module are coexpressed. In fact, protein interactions between genes are more reliable than the coexpression relation. Thus, it is promising to integrate the gene expression and protein interaction network to discover the common modules across cancer stages. However, the protein interaction network is sparse. Therefore, we also expect that the joint representation matrix
C not only balances the agreement across various stages but also preserves the topological structure of protein interaction network
G. According to [
31], the localstructurepreserved embedding can be formulated as the trace form, which is defined as
where
${L}_{G}$ is the Laplacian matrix of graph
G, i.e.,
${L}^{G}=DA$. By imposing the topology preserving constraint, the model in Equation (
3) is formulated as
To solve the model in Equation (
5), we adopt an alternative twostep procedure. Specifically, we update
${C}_{i}(1\le i\le m)$ by fixing
C, while we update
C by fixing
${C}_{i}(1\le i\le m)$. In each procedure, the problem in Equation (
5) is a convex optimization, which can be solved using the convex programming algorithms [
32,
33], and the sparsity of solutions is also preferred [
34,
35]. In this study, we adopt the interiorpoint algorithm [
32] to obtain matrix
C.
After obtaining the consensus matrix C, we construct the affinity matrix W as
The spectral clustering algorithm is used to obtain the final modules. The procedure is depicted in Algorithm 1.
Algorithm 1 The rMVspc algorithm 
Input: $\mathcal{X}$: Gene expression data $G=(V,E)$: Protein interaction network Output: ${\left\{{V}_{i}\right\}}_{i=1}^{k}$: Common modules
 1:
Update ${C}_{s}$ by fixing C and ${C}_{i}(i\ne s)$ based on the interior point algorithm [ 32]  2:
Update C by fixing ${C}_{s}(1\le m)$ based on the interior point algorithm [ 32];  3:
Normalize the columns of consensus matrix C;  4:
Construct the affinity matrix $W=C+{C}^{\prime}$;  5:
Apply spectral clustering to obtain modules based on matrix W;  6:
return common modules.

4. Results
To validate the performance of the proposed algorithm, three stateoftheart algorithms are selected to make a comparison of both artificial data and breast cancer data. The compared algorithms are the MModule algorithm [
21], multiview clustering (MVNMF) [
39], and spectral clustering [
40]. Notice that the spectral clustering cannot be applied to the multiple networks directly. Thus, we apply the spectral clustering to each network and then combine the results on each network based on consensus clustering (CSC).
Two types of datasets, including both the artificial and real breast cancer data, are employed for a comparison between various algorithms. The artificial networks are adopted to test the accuracy of the rMVspc algorithm, and the breast cancer data are used to determine the applicability of the proposed algorithm in discovering common modules in real networks with strong backgrounds.
4.1. Benchmarking Performance on the Artificial Networks
In the artificial networks, we combine three GN networks, where the first two networks are used for multiple views and the remaining one is used for regularization (Materials). To increase the difficulty in discovering the common modules, we increase the parameter ${Z}_{out}$ from 1 to 8 while we fix ${Z}_{out}$ as 6. To quantify the performance of algorithms, the normalized mutual information (NMI) is adopted since the community structure is known in the artificial networks (Materials).
Prior to giving the performance of algorithms, we first investigate how the parameter affects the performance of the proposed algorithm. Notice that there are three involved parameters: parameter
${\lambda}_{Z}$ controls the importance of the regularizer of factorization, parameter
${\lambda}_{C}$ determines the tradeoff between the consensus matrix among multiple views, and parameter
${\lambda}_{G}$ denotes the importance of the network for regularization. Similar to [
41], we assume that these parameters are equal since we hypothesize that all items for regularization are equally important. By setting parameter
$\lambda \in \{{10}^{2},{10}^{1},{10}^{0},{10}^{1},{10}^{2}\}$, we check how the accuracy of the proposed algorithm changes as parameter
${Z}_{out}$ increases from 1 to 8 in terms of NMI, which is shown in
Figure 2A. As
$\lambda $ increases from
${10}^{2}$ to
${10}^{0}$, the accuracy of the rMVspc algorithm increases and achieves the best performance at
$\lambda $ = 1. The reason is that, when
$\lambda $ is small, the objective function is denominated by subspace clustering, and the contribution of items of regularization is subtle. As
$\lambda $ increases, the contribution of regularized items becomes increasingly important, which improves the accuracy of rMVspc. As
$\lambda $ increases from
${10}^{0}$ to
${10}^{2}$, the accuracy of the proposed algorithm decreases dramatically. The reason is that, as
$lambda$ continues to increasing, the objective function of rMVspc is dominated by the regularization, resulting in the decrease in the performance of the algorithm. Furthermore, the proposed algorithm is robust since its accuracy is stable for a wide range of
$\lambda $ values. In all experiments, we set
$\lambda $ = 1.
We compare the MVNMF, CSC, MModule, and rMVspc algorithms on the artificial networks in terms of accuracy, which is shown in
Figure 2B. From the panel, we assert that the proposed algorithm achieves the best performance, followed by MModule, MVNMF, and CSC. While the MModule is inferior to the rMVspc algorithm, it is much better than the others. There are two possible reasons why the proposed algorithm outperforms the other methods. First, the subspaces are more precise in characterizing the module structure in multiple view data compared with the data in the original space. Second, the proposed algorithm incorporates both the subspace and topological information, which provides a better way to characterize the structure of common modules. Moreover, it is easy to conclude that the performance of algorithms decreases dramatically as
${Z}_{out}$ increases from 1 to 8 because the module structure becomes fuzzy as
${Z}_{out}$ increases. For example, the NMI is about 1 when
${Z}_{out}\le 4$. As
${Z}_{out}>4$, the NMI value decreases dramatically.
4.2. Benchmarking Performance on the Breast Cancer Networks
The artificial data is used to test the performance of the proposed algorithm in detecting the common modules in terms of accuracy. To check whether the proposed algorithm can identify common modules across various clinical stages in the data with biological background.
Because the true modules are unknown, multiple reference pathway annotations, including Gene Ontology [
42], KEGG [
43], and Biocart [
44], are used to determine the effectiveness of the algorithms by using the enrichment analysis (Materials). To evaluate the performance, we use specificity and sensitivity to quantify the accuracy, where specificity is defined as the fraction of the predicted modules that significantly overlaps with at least one reference pathway, while sensitivity is defined as the fraction of the reference pathways that significantly overlaps with at least one predicted module.
Figure 3A,B shows that the rMVspc algorithm achieves higher specificity while maintaining comparable sensitivity than the other methods. Specifically, the specificity values of rMVspc are 76.9%, 80.3%, and 81.7% for the GO, KEGG, and BioCart pathways, respectively, while those of the MModule algorithm are 72.4%, 74.4% and 76.5%. The results demonstrate that the common modules obtained bythe proposed method are more enriched by the known pathways than those obtained by others. Notice that the rMVspc algorithm is inferior to MModule in terms of sensitivity. We check the significance of the difference between rMVspc and MModule on sensitivity using the Fisher exact test with a cutoff of 0.05. The results demonstrate that the difference in specificity is significant, while it is not significant in terms of sensitivity.
The proposed algorithm integrate both the gene expression and protein interaction networks. Then, we ask what is the different if the protein interaction network is not integrated. The specificity and sensitivity of modules are shown in
Figure 3C,D. From the panel, we assert that the integration of the protein interaction network increases the percentage of modules that are enriched by known pathways. The results demonstrate that the integration is promising in identifying the common modules associated with cancer progression.
4.3. Common Modules Serve as Biomarkers to Predict Breast Cancer Stages
It has been shown that the hub genes [
16] and modules [
17,
21] are predictive for the breast cancer diagnosis. Thus, we hypothesize that the common modules can also be used to predict the stages of breast cancer. Following [
17], we construct modulebased features to predict the stages of breast cancer (Materials). For each module, we construct a feature vector that is the average of the gene expression of the genes within the modules. Based on the feature vectors, we use the SVM to predict the stage of cancers.
For a baseline comparison, we compare the classification accuracy by using the following feature sets: modules generated by other algorithms, sizematched differentially expressed genes, and randomly selected genes. We trained the support vector machine (SVM) classifier to perform multiclass classification. This SVM employed accuracy (the percentage of patients that are corrected classified) to measure performance. The results on the TCGA breast cancer data using fivefold cross validation are presented in
Figure 4A. The modules obtained by our algorithms are more discriminative than the others. Specifically, the rMVspc algorithm has significantly higher accuracy than the MModule (74.5% vs. 71.3%). These results demonstrate that the common modules obtained by rMVspc capture the specificity of pathways as breast cancer progression.
To further validate the performance of various algorithms, we evaluated the performance of the SVM classifiers by using external data (GSE5874). We trained the SVM classifier on the TCGA data and tested it on an external microarray dataset. Consistent results indicate that the performance is not due to hidden confounding factors in the TCGA dataset (
Figure 4B). The accuracy of rMVspc is 51.4%, while the accuracies of the MModule, MVNMF, CSC, and DGis are 49.8%, 44.9%, 41.3%, and 38.7%, respectively. The results show that the proposed algorithm is better than the available approaches in discovering common modules in data integration.