Detecting Interactive Gene Groups for Single-Cell RNA-Seq Data Based on Co-Expression Network Analysis and Subgraph Learning

High-throughput sequencing technologies have enabled the generation of single-cell RNA-seq (scRNA-seq) data, which explore both genetic heterogeneity and phenotypic variation between cells. Some methods have been proposed to detect the related genes causing cell-to-cell variability for understanding tumor heterogeneity. However, most existing methods detect the related genes separately, without considering gene interactions. In this paper, we proposed a novel learning framework to detect the interactive gene groups for scRNA-seq data based on co-expression network analysis and subgraph learning. We first utilized spectral clustering to identify the subpopulations of cells. For each cell subpopulation, the differentially expressed genes were then selected to construct a gene co-expression network. Finally, the interactive gene groups were detected by learning the dense subgraphs embedded in the gene co-expression networks. We applied the proposed learning framework on a real cancer scRNA-seq dataset to detect interactive gene groups of different cancer subtypes. Systematic gene ontology enrichment analysis was performed to examine the detected genes groups by summarizing the key biological processes and pathways. Our analysis shows that different subtypes exhibit distinct gene co-expression networks and interactive gene groups with different functional enrichment. The interactive genes are expected to yield important references for understanding tumor heterogeneity.


Introduction
Recent advances in Next-generation sequencing (NGS) technologies have enabled the generation of high-throughput single-cell gene expression data exploring both genetic heterogeneity and phenotypic variation between cells [1,2]. Single-cell RNA-seq (scRNA-seq) acquires transcriptomic information from individual cells, providing a higher resolution of cellular differences and a better understanding of cell functions at genetic and cellular levels [3]. In contrast with traditional bulk RNA-seq that reveals the average gene expression of a collection of cells, scRNA-seq will allow researchers to uncover new and potentially unexpected biological discoveries [4]. scRNA-seq has been utilized to study cancer, where tumor heterogeneity poses significant challenges in the clinical diagnosis, cancer treatment, and patient survival [5]. The unprecedented ability of measuring gene expression from individual cells holds enormous potential for detecting the clinically important tumor subpopulations and understanding tumor heterogeneity [6].
Many machine learning methods have been applied to analyze scRNA-seq data for determining cell types and predicting diagnoses [7]. scRNA-seq data often comes with high dimensionality,  In the gene filtering step, we filtered out the rare, ubiquitous, and invariable genes to focus on the intrinsic transcriptomic signatures of cells in the scRNA-seq data. Since the rare and ubiquitous genes are usually not useful for identifying different cell subpopulations, the genes that are expressed in less than r% of cells (i.e., rare genes) or expressed in at least (100 − r)% of cells (i.e., ubiquitous genes) were firstly filtered out. We set r% as 6, as that considered in the previous study [35]. Then, the most c% variable gene set across the single-cells was identified by controlling the relationship between mean expression and variability.

Spectral Clustering to Identify Cell Subpopulations
Given a set of n data samples X = {x 1 , x 2 , ..., x n } in R d , the objective of spectral clustering is to divide the data samples into K clusters. Spectral clustering consists of two main steps: (1) Dimensionality reduction based on the eigenvectors of the Laplacian matrix; (2) Finding clusters in the low-dimensional space.
In the first step, a similarity matrix was constructed to calculate the Laplacian matrix. The similarity matrix S has pairwise similarities s ij (i, j = 1, ..., n) as its entries, i.e., S = (s ij ). By using the Gaussian kernel function, the pairwise similarity is calculated as where x i − x j is the Euclidean distance between data samples x i and x j , σ is the kernel parameter. Furthermore, the undirected kNN graph was applied to sparse the similarity matrix, by which s ij is calculated as Equation (1) when x i is one of the k nearest neighbors of x j or x j is one of the k nearest neighbors of x i , otherwise s ij = 0. The normalized Laplacian matrix L was then calculated as where D is a n × n diagonal matrix with d i = ∑ n j=1 s ij on the diagonal. The K smallest eigenvectors corresponding to the K smallest eigenvalues of L were computed to form a low K-dimensional space.
In the second step, the K-means clustering method was performed to divide the data into K clusters in the low-dimensional space. By using spectral clustering, the cells in a cluster were identified as the same subpopulation.

Differentially Expressed Gene Selection
Genes that are more strongly differentially expressed (DE) are more likely to cause separated clusters of cells [36]. We selected the DE genes for the samples in each cluster to construct the gene co-expression networks.
To select the DE genes, we used the Welch t-test [37] to test differences in expression between clusters. Pairwise comparisons between clusters were performed for each gene. The genes differentially expressed in any pairwise comparison between clusters will be given a low p-value. For a cluster, we combined the p-value for each gene by combining the p-values across the pairwise comparisons involving this cluster [38]. For example, considering 3 clusters and combining the p-value of each gene for cluster 1. Pairwise comparisons between clusters 1 and 2, and between clusters 1 and 3 were performed for each gene, respectively. Then the p-value of each gene in the two comparisons were combined. The combined p-value for each gene was calculated as the middle-most value by applying the Holm-Bonferroni correction [39] across its p-values. Thus, in each cluster, a gene will achieve a low combined p-value if it is strongly differentially expressed in all pairwise comparisons to other clusters.
Then, we calculated the false discovery rate (FDR) by the Benjamini-Hochberg method [40] based on the combined p-value. For each cell subpopulation, the genes with 5% FDR were selected as the DE genes.

Gene Co-Expression Network Construction
For each cell subpopulation, we constructed a gene co-expression network based on the selected DE genes. A gene co-expression network is a transcript-transcript association network, generally reported as an undirected graph, in which genes are connected when there is a significant co-expression relationship between them [41].
Firstly, we calculated the adjacency matrix A = (a ij ), in which a ij is the adjacency value between gene i and gene j. a ij is calculated as where w ij is the gene-wise similarity which is calculated as the absolute value of the pairwise Pearson correlation between gene i and gene j, β is the single soft threshold which is chosen by scare free topology criterion. Then, we applied the topological overlap matrix (TOM) to calculate the gene connectivity in the co-expression network. TOM provides the implication of the connected genes and their useful biological function or pathway [42]. The entries of TOM, i.e., t ij , is calculated based on a ij as follows.
We further constructed a binary matrix B by using 1 to represent the strong similarity and using 0 to represent the weak similarity. The top g% values in TOM were set as 1 and the rest were set as 0. The binary matrix B = (b ij ) directly presents the connectivity between genes in the co-expression network. That is, b ij = 1 if there is an edge between gene i and gene j and b ij = 0 otherwise. In this paper, the g% was set by experience.

Subgraph Detection
We detected the dense subgraphs embedded in the gene co-expression network based on eigenvector L 1 norms of a modularity matrix [43,44]. Newman's notion of the modularity matrix [45] associated with an unweighted, undirected graph is given by Here B is the binary matrix used to construct the gene co-expression network. H is the degree vector of the co-expression network, where the ith component of H is the number of edges adjacent to gene i. |E| is the total number of edges in the co-expression network. Since M is real and symmetric, it admits the eigendecomposition M = UΛU T , where U is a matrix with each column being an eigenvector of M, and Λ is a diagonal matrix of eigenvalues.
We detected the dense subgraphs based on L 1 properties of the largest eigenvectors corresponding to the largest eigenvalues of the modularity matrix. The L 1 norms of an eigenvector Here Z is the number of genes in the co-expression network. The L 1 properties of the largest eigenvectors have been exploited in a graph-theoretic setting for finding maximal cliques. If a small set of genes are interactive, i.e., forms a community group in the co-expression network, there will be an eigenvector well aligned with this set, which implies that the L 1 norm of this eigenvector would be smaller than that of an eigenvector with a similar eigenvalue when there is no dense subgraph. The genes involved in dense subgraph are probably interactive since they are tightly connected.

Materials
In this study, we used a single-cell expression dataset from a recent scRNA-seq study, i.e., GSE72056 [46], which was selected from the data repository NCBI Gene Expression Omnibus. In this dataset, 4645 single cells with 23684 genes were isolated from 19 patients with melanoma tumor. There are 1257 malignant melanoma tumor cells and 3388 benign tumor cells. The detailed number of cells in each sample/patient is listed in Table 1.
The proposed framework was applied to the malignant melanoma tumor cells to identify the cancer subtypes and detect the interactive gene group in each cancer subtype. As shown in Table 1, the 1257 malignant cells were derived from 15 patients. The dataset was transformed by logTPM before being processed by the proposed framework.

Identification of Cancer Subtypes
We applied spectral clustering to identify the cancer subtypes due to its superior performance compared to other classic clustering methods. To evaluate the performance of spectral clustering, we first compared spectral clustering with other classic clustering methods, i.e., K-means [18], hierarchical clustering [19], EM [20], on the clustering task: clustering the 4645 cells in GSE72056 into two clusters (malignant and benign tumor cells). We used the adjusted rand index (ARI) [47] to measure the accuracy of clustering results. A larger value of ARI indicates a better clustering result. The comparison result is shown in Figure 2. We can see that spectral clustering outperforms other classic clustering methods. Then, spectral clustering was applied to identify the cancer subtypes of malignant melanoma tumor in GSE72056. Since there is no ground truth of the clusters for these malignant cells, we applied Calinski-Harabaz Index [35] to decide the number of clusters. Spectral clustering identified six clusters, i.e., six cancer subtypes, in the dataset. In spectral clustering, the six smallest eigenvectors corresponding to the six smallest eigenvalues of the Laplacian matrix were computed to form a low six-dimensional space. Since six-dimensional space cannot be visualized directly, we show the three-dimensional spaces constructed by the first three eigenvectors and the last three eigenvectors in Figure 3a,b, respectively. We can see that the three clusters denoted by red, pink, and green colors can be separated by the first three eigenvectors, and the other three clusters denoted by brown, purple, and black colors can be separated by the last three eigenvectors. Thus, by using the six eigenvectors, the six clusters can be separated and identified.  We also visualized the clustering result of spectral clustering by t-SNE and UMAP, the corresponding results are shown in Figure 4a,b, respectively. Spectral clustering displays six clearly recognizable clusters in the two-dimensional space constructed by both t-SNE and UMAP. The two-dimensional space constructed by UMAP shows more clearly recognizable clusters than that by t-SNE.  We listed the 6 cell subpopulations presented in each sample/patient in Table 2. For each sample/patient, the majority of cells belonging to a subtype was highlighted in bold-face type. We can see from Table 2, in most cases, the subpopulations are indeed present in the same patients (100%) or the majority of cells in the same patients (>92%). Some patients may refer to more than one melanoma subtype, such as Melanoma_60 and Melanoma_94.

Detecting Interactive Gene Groups
We detected the interactive gene groups from each cancer subtype identified by spectral clustering. In each cancer subtype, the DE genes were selected to construct a gene co-expression network. The numbers of selected DE genes for subtypes 1 to 6 are 3092, 4679, 5644, 4364, 4538, and 2533, respectively. We used the WGCNA [48] to calculate the TOM matrix. The top g% values to be 1 for subtypes 1 to 6 were set as 3.5, 1.5, 0.4, 0.9, 0.9, and 0.9, respectively. A binary matrix was formed based on the TOM matrix to decide the edges in the gene co-expression network. Then, the interactive gene groups were detected based on the eigenvector L 1 norms of the modularity matrix which was calculated based on the binary matrix.
For each cancer subtype, we computed the largest 100 eigenvectors of the modularity matrix and the L 1 norm of each eigenvector. Comparing each L 1 sequence to a "smoothed" version, we selected the two eigenvectors that deviate the most from this trend [43]. For the six cancer subtypes, the two eigenvectors that deviate most are those with the smallest L 1 norm. Figures 5 and 6 show the plots of the L 1 norms of the largest 100 eigenvectors and the scatterplots in the space of the corresponding two eigenvectors with the smallest L 1 norm. The eigenvectors declared are highlighted by circles.
The dense subgraphs detected by L 1 analysis are presented in Table 3. Two subgraphs are first chosen from each cancer subtype, corresponding to the points highlighted by circles in the scatterplots in Figures 5 and 6. The two subgraphs are denoted as Subg 1 and Subg 2 in Table 3. For each subgraph, we listed the size (number of genes), density (internal edges divided by the maximum number of edges), and the eigenvector that separates it from the co-expression network. e j denotes the jth largest eigenvector. We can see from Table 3, the detected subgraphs are quite dense, all with 100% density. That is, the genes are connected to each other in each detected subgraph.
We then examined the genes in the detected dense subgraph. We found that the genes in a detected subgraph are highly connected, however, they may isolate from other genes outside the subgraph. For example, subgraph 1 detected in cancer subtype 2, as shown in Figure 7. Figure 7 shows the gene co-expression network of cancer subtype 2, constructed by the Cytoscape software [49]. Two detected subgraphs are highlighted by red circles in the gene co-expression network. To see the genes in the subgraphs, we further enlarge the two detected subgraphs and show them in two green squares, respectively. The above green square shows Subg 1, in which the 10 genes, i.e., SNAR-A9, SNAR-A6, SNAR-A14, SNAR-A11, SNAR-A4, SNAR-A8, SNAR-A7, SNAR-A3, SNAR-A5, SNAR-A10, are highly connected. These genes are not connected to other genes outside Subg 1. The genes in Subg 1 have a very close relationship since they have the same prefix name. Note that the scatterplots of these genes in the space of two eigenvectors in Figure 5d, i.e., the points in the left circle, also isolate from other points/genes. Similar subgraphs are detected in subtypes 3 (Subg 1) and 6 (Subg 1), which are corresponding to the points in the left circle in Figure 5f and the points in the right circle in Figure 6f, respectively. The genes in Subg 1 of subtypes 3 are CT47A4, CT47A10, CT47A6, CT47A2, CT47A3, CT47A5, CT47A11, CT47A12, CT47A8, CT47A9, CT47A1, CT47A7, while the genes in Subg 1 of subtypes 6 are SNAR-A6, SNAR-A5, SNAR-A4, SNAR-A7, SNAR-A9, SNAR-A10, SNAR-A3, SNAR-A8. These genes in the detected subgraph all have the same prefix name.   We further performed systematic gene ontology enrichment analysis on the genes in a subgraph by using DAVID tools and summarize the key biological processes and pathways [50]. We have detected two subgraphs for each cancer subtype. Table 4 lists the enrichment analysis of the genes in one subgraph for each cancer subtype, in which the genes in the listed subgraph that are more enriched than those in another. For example, the genes in Subg 1 for cancer subtype 1 are listed since the analysis of genes in Subg 1 are more enriched than those in Subg 2. These modules are enriched for biologically important processes that are relevant to melanoma, such as cell cycle. The abnormal proliferation resulting from alterations in cell cycle regulatory mechanisms will lead to the transformation of melanocytes to melanoma cells [51]. In Table 4, we also summarize the number of genes that are involved in the same term type and write it in brackets, e.g., BP: cell cycle (18)  In subtype 2, the largest number of genes belonging to the same type is 10, which are BUB1, ANLN, BIRC5, CENPF, KIFC1, NCAPH, NUSAP1, TYMS, TOP2A, UBE2C, and they are involved in BP: mitotic cell cycle process. In subtype 3, the most connected genes are also involved in BP: mitotic cell cycle process., which are NDC80, RACGAP1, TTK, CCNB1, CDKN3, CKAP2, FAM64A, FOXM1, KIF14, KIF20A, KIF20B, KIF4A, SKA3. We can see that these genes are different from those involved in BP: mitotic cell cycle process in subtype 2. A similar result can be found in the most highly connected genes in subtype 4 and subtype 6, which are involved in the same term type but the related genes are different. That may be because for different cancer subtypes the informative genes are different. We also can see from the enrichment analysis results in Table 4, the genes in the detected subgraph are closely related. For example, there are 20 genes in Sunb1 of subtype 1 and 18 of them are involved in the same term type. For Subg 2 in subtype 5, all the genes are involved in the same term type, i.e., BP: cellular macromolecule. The interactive genes detected in different cancer types are expected to yield important references for finding new markers and understanding tumor heterogeneity.

Conclusions and Discussion
scRNA-seq brings unprecedented insights into cellular heterogeneity, in which detecting the related genes causing cell-to-cell variability is critical. The related genes are usually detected separately without considering gene interactions. However, considering gene interaction is important since many human diseases are multigenic. In this paper, we proposed a novel learning framework to detect the interactive genes for scRNA-seq data based on co-expression network analysis and subgraph learning. We identified the cell subpopulations using spectral clustering and selected the differentially expressed genes to construct a gene co-expression network for each cell subpopulation.
The interactive gene groups were detected by learning the dense subgraphs embedded in the gene co-expression networks. We applied the proposed learning framework on the real melanoma tumor scRNA-seq dataset. Six cancer subtypes were identified, and we detected the interactive gene groups from each cancer subtype. The genes were highly connected, i.e., connected to each other, in each detected gene group. Systematic gene ontology enrichment analysis was performed to examine the potential functions of the detected interactive genes by summarizing the key biological processes and pathways. Our analysis shows that different subtypes exhibit distinct gene co-expression networks and interactive gene groups with different functional enrichment. The interactive genes are expected to yield important references for understanding tumor heterogeneity.
Although our framework is proposed for scRNA-seq data and the experimental results are from the application on melanoma tumor dataset, the proposed framework is generally applicable to other types of biological data and other types of tumors. For example, the proposed framework can be apply to protein datasets to analyze the signal transduction pathways in protein interaction networks. Other types of biological data with the need for detecting interactive groups can apply the subgraph detection methods in the proposed framework.
Nevertheless, the current learning models in the proposed framework still have some limitations. Firstly, we used a published dataset with a low number of cells. In future work, we will analyze scRNA-seq datasets with a larger number of cells to better demonstrate the power of the proposed framework. Secondly, some known human gene-disease interactions can be integrated to improve the learning models. For example, using the known information to improve the model parameter setting. Thirdly, some ensemble learning and feature selection procedures can be properly integrated into the clustering process to enhance performance. Fourthly, significance of the findings can be much more articulate and interpreted in the light of the up-to-date knowledge of melanoma biology. We will leave these issues for future work.