A Hybrid Clustering Algorithm for Identifying Cell Types from Single-Cell RNA-Seq Data

Single-cell RNA sequencing (scRNA-seq) has recently brought new insight into cell differentiation processes and functional variation in cell subtypes from homogeneous cell populations. A lack of prior knowledge makes unsupervised machine learning methods, such as clustering, suitable for analyzing scRNA-seq. However, there are several limitations to overcome, including high dimensionality, clustering result instability, and parameter adjustment complexity. In this study, we propose a method by combining structure entropy and k nearest neighbor to identify cell subpopulations in scRNA-seq data. In contrast to existing clustering methods for identifying cell subtypes, minimized structure entropy results in natural communities without specifying the number of clusters. To investigate the performance of our model, we applied it to eight scRNA-seq datasets and compared our method with three existing methods (nonnegative matrix factorization, single-cell interpretation via multikernel learning, and structural entropy minimization principle). The experimental results showed that our approach achieves, on average, better performance in these datasets compared to the benchmark methods.


Introduction
Gene expression profiles can represent the development stage of cells and the differentiation state of cells. For example, based on gene expression profiles, the classification of colorectal cancer can find subtypes to display resistance to therapy [1][2][3]. Gene expression across tissues has been described, which can be used to build complex networks and understand the heterogeneity of human tissues [4][5][6]. Traditional gene expression of bulk cells is obtained by sequencing a large number of cells that are commonly a mixture of different cell types or tissues [7,8]. Single-cell RNA sequencing (scRNA-seq) [9][10][11] is able to address the limitation of conventional bulk sequencing approaches. For example, bulk sequencing technology measures the mean gene expression of multiple cells and discards the difference of cells [12,13]. Single-cell RNA sequencing has attracted a great amount of attention for the following characteristics: (1) It can sequence more samples than traditional bulk methods and obtain more raw material for downstream analysis [14]; (2) it can be clearly observed that scRNA-seq data is sparse. The average sparsity may reach 50% [15]. The number of samples is efforts have been focused on obtaining robust and significant clustering results, and complex similarity measurement methods or clustering algorithms were designed. Specially, some methods represented instability in different datasets and obviously depended on adjusting parameters.
To address the aforementioned issue in unsupervised learning methods based on scRNA-seq datasets, we explored an effective and robust clustering method in this study using graph theory and structure entropy theory. Our proposed method included three steps: Firstly, the similarity matrix of cell samples was constructed by learning different weights for multiple kernels to measure cell-to-cell distances. Secondly, the weighted cell network was constructed with the k nearest neighbor algorithm; the weight of edges was determined by the similarity matrix. Thirdly, clustering was performed using the two-dimensional structure entropy minimum principle. On eight public scRNA-seq datasets, the performance of the presented method was investigated in terms of two evaluation metrics: Normalized mutual information and adjusted rand index. From the experiment results, we found that our approach achieved the best average performance in these datasets compared to other methods.

Materials and Methods
A framework of our proposed method (single-cell structure entropy minimization principle, SSE) is presented in Figure 1. This is a hybrid clustering algorithm based on multikernel learning, k nearest neighbor (KNN), and structure entropy. It is well known that there are various methods to cluster high dimensional data into interpretable subparts, among which we applied and combined two novel methods, multikernel learning and structure entropy, and KNN. Firstly, single-cell interpretation via multikernel learning (SIMLR) is a novel similarity measurement method, which is insensitive to the parameter pairs (k,σ) and the number of kernels. Moreover, we tested our method with different values of parameter k (k = 5, 10, 15, 20, 25, the default value is 10) based on two datasets with a typically accurate label and found that our algorithm was also insensitive to the value of parameter k. Multikernel learning can best fit the data structure and enforce block structures in similarity calculation by integrating multiple kernels [39,40]. Secondly, KNN is a classical and very popular method in clustering for its easy-to-understand implementation and significant classification performance [41,42], and it has been voted as one of the top ten data mining algorithms. KNN can represent the sample network by constructing a KNN graph and detect the community quickly [43][44][45][46]. The KNN method has only one parameter k to adjust. Thirdly, entropy can be used to measure the complexity of networks and represent the stability of a system in which the lower the entropy, the more stable the system is. Thus, the principle of structure entropy minimization can detect the natural communities in networks [47,48]. In this study, we tried to use their advantages to do the research on identification of cell types and SSE inherits three main advantages over these compared methods. First, it does not need to decide the parameter k in the KNN algorithm by combining multikernel similarity learning. Second, SSE can apply to cluster scRNA-seq data without prior knowledge about the true number of clusters. Third, SSE does not need to adjust model parameters using the default values of parameters from SIMLR. Figure 1. The mechanism of the SSE (single-cell structure entropy minimization principle) algorithm. The input is a gene expression matrix. The SSE algorithm includes three steps: (1) The similarity is calculated by multikernel learning; (2) the cell network is constructed by KNN (k nearest neighbor); (3) clustering is implemented using the structure entropy minimized principle. Lastly, gene priority ranking results as an output.

Cell-to-Cell Similarity Measurement
Cell-to-cell similarity measurement plays an important role in cell sample clustering. The common similarity calculation methods are as follows: Euclidean distance, Spearman correlation, Pearson correlation coefficient, Jaccard similarity, Minkowski distance, and so on. Beyond that, some researchers proposed novel methods for distance or similarity calculation, such as Kiselev et al. [38] and Wang et al. [39]. We calculated the cell-to-cell similarity by kernel-based learning method, proposed by Wang et al. [39], which would overcome the problem that some distance calculation methods were affected by data distribution, such as Minkowski distance. We chose this similarity measurement method mainly for the following reasons: First, SIMLR was recently referenced and considered as an efficient similarity measurement method [49][50][51]. Second, SIMLR had the following main advantages for similarity measurement: (1) It provided a distance metric by combining multiple kernels; (2) it employed a rank constraint to address the dropout events, in which it enforced a block structure and obtained a more accurate similarity matrix for downstream steps; (3) the parameters of SIMLR were (k,σ) and the number of kernels, and the empirically results showed that it was insensitive to the parameters.
Here, given a gene expression matrix as an input, rows correspond to cells, while columns correspond to genes. Multikernel learning was used to calculate the distance between the cells and construct a similarity matrix in the following two steps [39]: (1) To compute the distance between a pair of cells, the distance formula was detailed in the literature, in which each weight value described the importance of each kernel. Gaussian kernels were used here, and each kernel was decided by a parameter pair (k,σ). The experiments showed that the method was insensitive to the parameter pair. The parameter pair was set to default values. (3) clustering is implemented using the structure entropy minimized principle. Lastly, gene priority ranking results as an output.

Cell-to-Cell Similarity Measurement
Cell-to-cell similarity measurement plays an important role in cell sample clustering. The common similarity calculation methods are as follows: Euclidean distance, Spearman correlation, Pearson correlation coefficient, Jaccard similarity, Minkowski distance, and so on. Beyond that, some researchers proposed novel methods for distance or similarity calculation, such as Kiselev et al. [38] and Wang et al. [39]. We calculated the cell-to-cell similarity by kernel-based learning method, proposed by Wang et al. [39], which would overcome the problem that some distance calculation methods were affected by data distribution, such as Minkowski distance. We chose this similarity measurement method mainly for the following reasons: First, SIMLR was recently referenced and considered as an efficient similarity measurement method [49][50][51]. Second, SIMLR had the following main advantages for similarity measurement: (1) It provided a distance metric by combining multiple kernels; (2) it employed a rank constraint to address the dropout events, in which it enforced a block structure and obtained a more accurate similarity matrix for downstream steps; (3) the parameters of SIMLR were (k,σ) and the number of kernels, and the empirically results showed that it was insensitive to the parameters.
Here, given a gene expression matrix as an input, rows correspond to cells, while columns correspond to genes. Multikernel learning was used to calculate the distance between the cells and construct a similarity matrix in the following two steps [39]: (1) To compute the distance between a pair of cells, the distance formula was detailed in the literature, in which each weight value described the importance of each kernel. Gaussian kernels were used here, and each kernel was decided by a parameter pair (k,σ). The experiments showed that the method was insensitive to the parameter pair. The parameter pair was set to default values.
(2) To construct a similarity matrix based on an optimization framework over S, L, and w, where S is a similarity matrix, L is an N×C rank-enforcing matrix, and w is the weight of kernels, the optimization algorithm was detailed in the literature.

Cell Network Construction
KNN is a popular method for its significant ability to present network structure and simple implementation. Here, we used the popular KNN algorithm [52]. Because the result matrix of multikernel similarity learning was a sparse matrix, which had reserved the nodes with larger similarity, we did not need to test a special value of k, and kept all the edges to construct a graph. We constructed a weighted undirected cell network G = (V, E). Suppose that c 1 , c 2 , . . . , c n were n cells, and g 1 , g 2 , . . . , g m were m genes. We denoted the input gene expression matrix X = [x ij ], with rows representing cells and columns representing genes. Thus, its ith row and jth column were denoted as c i and g j , respectively.
The algorithm for constructing cell network is as follows: (1) For each i from 1 to n, a vector (x(i, 1), x(i, 2), . . . , x(i, m)) represented the genes expression of cell c i, and the gene number j is from 1 to m. The sample x(i, :) was one node of network G.
(3) For each i from 1 to n, all edges adjacent to the x(i, :) were reserved.
In the traditional KNN method, the choice of the value of k is a challenge. Wang et al. chose k = 3 based on experimental experience. Li et al. [53] used the one-dimension structure entropy minimization principle to determine the value of k, but this method would not sometimes find k in a few scRNA-seq data. In our method, the value of k would not be specified through testing an empirical value from the above analysis. The details were described later in the article.
The pseudocode for the used Algorithm 1 is as follows: calculate distance between x(i,:) and x(i',:) using SIMLR algorithm; 5: here, we get similar matrix for each cells, denoted w(i,i'); 6: for s = 1, 2, . . . n 7: reserve all values in w(s,:), and set 0 for other values; 8: here, we get sparse matrix, denoted S(i,i'); 9: Output: S(i,i') will be used to construct graph in SSE algorithm

Cell Types Identification
Entropy can be used as a metric for representing object uncertainty, as well as the information needed to determine the event. The smaller the entropy is, the more orderly the system is. According to Shannon's entropy function, entropy is defined as follows: where p i is a probability that event i occurs with ∑ p i = 1. − log 2 p i bits needed to represent a variable that can take one of 1/p i values if 1/p i is a power of 2.
In a cell network, communities can be detected when entropy is minimized. However, entropy does not have enough information to measure the complexity of a network, so additional information needs to be added. In order to address this issue, we employed the structure entropy minimization principle proposed by Li et al. [53]. The principle of graph structure entropy and the criteria used for partitioning the overall network into cell subpopulations are described as following. The detail of structure entropy definition and minimization can be found in [53].
The graph structure entropy can provide a matrix of the dynamical complexity of the network. For a graph G, the k-dimensional structural entropy is defined as the fewest bits needed to describe the k-dimensional space information of the node, which is obtained from random walk in G. To detect the natural communities, two-dimensional graph structural entropy is defined as the average number of bits required to determine the code (i,j) of the node.
Suppose that Z = {X 1 , X 2 , · · · , X L } was a sub region of node set V, and each of X 1 , X 2 , · · · , X n was defined as a community in graph G. Then, X (i, j) encoded node v, in which i was the code of v in local community X n , and j was the code of community X n in global V. From the abovementioned, the structure entropy was defined as Equation (2): where L was the number of community X l in Z, n l was the number of node in community X l , d i l was the degree of the i-th node of X l , Vol l was the sum of the degrees of the nodes in community X l , and e j was the number of edges with just one endpoint in community X l . The structure entropy of graph G was defined as Equation (3), and minimizing the structure entropy of the graph would achieve the natural community structure of the network: where Z run over the subregion of G.

Feature Gene Selection
In the gene expression matrix, each gene is an attribute of a cell. The gene expression value contributes to cluster cells and affects the result significantly due to its high dimensionality. Some methods implemented dimension reduction, which is gene feature extraction, to get better clustering results. Nevertheless, bias would be introduced and relevant genes may be dropped. The technique and biological noise would lead to a poor result, such as only the first few components of principal component analysis (PCA) not being able to distinguish the subpopulation unambiguously [54,55]. Our approach differed from those methods, whereas the feature genes were selected to get the marker genes after clustering. We computed the average of certain gene expression values in every community to determine which community a gene belongs to. Then, genes in a community were sorted in descending order by the gene expression value. The top k genes were selected to be the marker genes relevant to subpopulation.

Time Complexity Analysis
The most time-consuming step of SSE is to cluster using two-dimension structure entropy minimization, which requires O (n 2 ) time. Here, n is the number of cells. Since the number of cells is usually far less than the number of genes, this step is still fast. In addition, the time complexity to construct a cell network is O (n) using a KNN graph. For optimization framework solutions for S, L and w iteratively in the similarity measurement step, the time complexity is O (Tkn), where T is the number of iterations and k is the number of neighbors.

Datasets Description
Single-cell RNA-seq data based on cell type differentiation are crucial for understanding cell linage relationships and predicting the relationship between diseases and treatments. Thus, we executed SSE on eight test datasets, which are summarized in Table 1. These datasets were downloaded from EMBL-EBI (https://www.ebi.ac.uk/) or the NCBI Gene Expression Omnibus (GEO) repository (https://www.ncbi.nlm.nih.gov/geo/), among others.

Experiments and Results
To demonstrate the performance of the proposed method SSE, we carefully compared it with three unsupervised learning methods for scRNA-seq data analysis: Nonnegative matrix factorization (NMF), SIMLR, and structural entropy (SE) minimization principle. All these algorithms were run on Windows 7. To perform SSE, we used the R code to implement a similarity matrix by multikernel learning algorithms, which are given in detail in [39]. We also used a JAVA code to implement structural entropy minimization principle algorithms, which are given in detail in [53]. The heat maps were drawn by a matplotlib package in Python, version 2.7.12 [64].

Performance Evaluation
To make the comparison fairly, we ran all methods with the commonly used eight datasets which were analyzed in other methods. In the same way, we compared these methods based on two evaluation metrics: Normalized mutual information (NMI) and adjusted Rand index (ARI). The true number of populations, abbreviated as 'gold standard' cluster numbers, was applied on computing the NMI value and ARI value. The number of categories of datasets was selected on the basis that one could be highly confident in the cell-labels, as they represent cells from different conditions or lines, and thus we considered them 'gold standard'. The 'gold standard' cluster number of each testing dataset is shown in Table 1.
NMI [65] is commonly used to evaluate the consistency between the obtained cluster results and the true labels of the cells. NMI is defined as follows: where I (X; Y) is the mutual information between clustering X and Y, and H(X) is the entropy of the clustering X. p(x, y) is the joint probability distribution function of x and y. p(x i ) is the probability distribution function of x i . ARI [37] is commonly used to evaluate the agreement between the predicted clusters and the true categories. ARI is defined as follows: where a, b, c, and d are calculated as follows, respectively. a: A number of pairs of objects are placed in the same group in X and in the same group in Y; b: A number of pairs of objects are placed in the same group in X and in a different group in Y; c: A number of pairs of objects are placed in the same group in Y and in a different group in X; d: A number of pairs of objects are placed in a different group in X and in a different group in Y; n: The number of the elements (cells). The overlap between X and Y can be formed in a contingency table, and n ij are the values from abovementioned contingency table; a i is the i-th row of the contingency table and b j is the j-th column of the contingency table.
We compared the performance of our method SSE to NMF, SIMLR, and the structural entropy minimization principle (SE) in terms of NMI and ARI. The results of NMI are listed in Table 2, while the results of ARI are listed in Table 3. It is worth mentioning that all of these methods were performed with default parameters, without any parameter optimization. The parameter pair (k, σ) of SIMLR was set to default values. SE also had parameter σ' (different from that of SIMLR), with σ' defaulting to 1/2n; the number of clusters calculated by SE, denoted as k', depends on σ' by one dimensional structure entropy minimization. For SE, when k' could not be easily determined at the default value of σ', different σ' values in {1/n, 2/3n, 1/2n, 2/5n, 1/3n} were tested to determine k'. Taken together, the above results indicated that SSE was a robust method with the best average performance, which would be applied for clustering analysis to identify cell types. Especially, these results provided evidence that SSE was a simple and promising tool for clustering analysis, which did not need to adjust complex parameters, including the value of k. Meanwhile, we used the Mann-Whitney U test, which is a commonly used nonparametric test method, to test whether our method significantly outperformed others. The results showed that the improvement is insignificant. However, it should be noted the improvement varies a lot for different datasets. For example, our method achieves much better results on the Biase data, but the improvement is less significant on the Chung data; on the Treutlein data, our method performed worse than others.
To describe the overlap and relationship of the four methods, the cluster results comparison between SSE and NMF, SIMLR, and SE in terms of NMI and ARI were calculated, and the results are shown in Supplementary Tables S1 and S2.

Cluster Result Analysis
To represent and analyze the cluster results, the true types and cluster heat maps of the eight datasets were provided, giving the visualization of how these cell samples are clustered, as shown in Figures 2-9, in which (a) is the heat map of true types with labels and (b) is the heat map of cluster result using SSE method. The x-coordinate represents the cell samples, the y-coordinate represents the gene expression values, and the top horizontal line marks the number of categories.
According to the heat maps, we found that our method could cluster the samples unambiguously. The cluster numbers were marked above the top horizontal line. Clear blocks appear in the diagrams. Each of the blocks was the high expression gene set in one cluster, that is, a feature gene set. Moreover, we observed that SSE achieved different cluster numbers than the other competing methods. The detail of cluster number results is shown in Table 4. Especially in the Patel dataset and Chung dataset, this phenomenon was more obvious. For the Patel dataset, the gold standard number was 5, while it was 15 in the SSE result. Meanwhile, compared to other methods, SSE achieved the best NMI value of 0.599. For the Chung dataset, the gold standard number was 4, while it was 21 in the SSE result. Meanwhile, compared to other methods, SSE achieved the best NMI value of 0.334.             PCA is a popular tool to identify the subgroups from scRNA-seq data, of which the first two components are commonly performed for visualization [58]. The first two components capture the highest percentage of variance, which means greater information, so we used them to visualize the eight datasets after binary log-transformation and centering of the scRNA-seq data. The scatter diagram of eight datasets by PCA is shown in Figure 10. In the experiments, each sample point in the same category was assigned the same color according to its true label. From Figure 10, some remarkable phenomena can be observed: (1) Limited to the difference of inherent attributes in each dataset, the performance of PCA method varied greatly over different datasets. Note that the Biase dataset was clustered clearly into three groups, which was in accordance with the true clusters. However, it was unfortunate that the PCA method did not work well in other datasets with higher heterogeneity; (2) SSE had an excellent clustering performance both in the Biase and Pollen dataset, i.e., several block structures were revealed in the gene map, which indicated that SSE better discovered the true clusters. We can observe that there were more blocks in the other five datasets from the gene maps; this phenomenon can particularly be observed in the Patel and Chung datasets. Because there was no cluster number as input as in NMF and SIMLR, SSE and SE found more or less clusters based on scRNA-seq data; this aspect deserves further investigation; (3) the marker genes in each cluster could be specified explicitly via the SSE method, but the PCA method could not get it. Finally, we observed that some datasets were clearly separated, such as the Biase dataset, and most datasets were indistinguishable.
Moreover, to describe the results of dimensionality reduction more fully, we applied another nonlinear dimensionality reduction method, t-SNE (t-distributed stochastic neighbor embedding). The scatter diagram of eight datasets by t-SNE can be found in Supplementary Figure S1. To better spot out possible clustering, we also presented the visualization of single cells in 3D space using the first three principal components (Supplementary Figure S2).  PCA is a popular tool to identify the subgroups from scRNA-seq data, of which the first two components are commonly performed for visualization [58]. The first two components capture the highest percentage of variance, which means greater information, so we used them to visualize the eight datasets after binary log-transformation and centering of the scRNA-seq data. The scatter diagram of eight datasets by PCA is shown in Figure 10. In the experiments, each sample point in the same category was assigned the same color according to its true label. From Figure 10, some remarkable phenomena can be observed: (1) Limited to the difference of inherent attributes in each dataset, the performance of PCA method varied greatly over different datasets. Note that the Biase dataset was clustered clearly into three groups, which was in accordance with the true clusters. However, it was unfortunate that the PCA method did not work well in other datasets with higher heterogeneity; (2) SSE had an excellent clustering performance both in the Biase and Pollen dataset, i.e., several block structures were revealed in the gene map, which indicated that SSE better discovered the true clusters. We can observe that there were more blocks in the other five datasets from the gene maps; this phenomenon can particularly be observed in the Patel and Chung datasets. Because there was no cluster number as input as in NMF and SIMLR, SSE and SE found more or less clusters based on scRNA-seq data; this aspect deserves further investigation; (3) the marker genes in each cluster could be specified explicitly via the SSE method, but the PCA method could not get it. Finally, we observed that some datasets were clearly separated, such as the Biase dataset, and most datasets were indistinguishable.  Moreover, to describe the results of dimensionality reduction more fully, we applied another nonlinear dimensionality reduction method, t-SNE (t-distributed stochastic neighbor embedding). The scatter diagram of eight datasets by t-SNE can be found in Supplementary Figure S1. To better spot out possible clustering, we also presented the visualization of single cells in 3D space using the first three principal components (Supplementary Figure S2).

Discussion
Single cell RNA-seq data posed a challenge to cluster approaches for exploring new cell subtypes and rare cell populations without prior knowledge. Scialdone et al. clustered mouse embryonic stem cells, suffering from the limitation of the dependence on known data as training dataset. As a matter of fact, most datasets were lacking prior knowledge. In addition, as similarity calculation plays an important role in clustering results, complex similarity measurement algorithms were designed to get high accurate clusters. Here, we explored graph theory and the structure entropy minimization principle for the purpose of subgroup identification in scRNA-seq data. Instead of using conventional hierarchical clustering, here we focused on minimizing the structure entropy to find the natural communities in cell networks. We found that SSE correctly clustered cells to biologically meaningful subgroups. Compared to NMF, SIMLR, and SE, SSE could produce the cluster results as stable communities that were straightforward to interpret. Remarkably, SSE performed well even without prior dimension reduction, such as extraction feature genes using PCA.
As can be seen from our analysis, in the SSE method, we constructed cell networks using KNN, as Xu et al. did. However, Xu et al. had to adjust a set of parameters k, r, and m to improve cluster performance. Nevertheless, SSE only had the parameter pair (k,σ) of SIMLR with default values. Beyond that, there were no other parameters to be adjusted in the steps of network construction and clustering.
In addition, SSE proved very robust when it was applied to scRNA-seq datasets. By analyzing eight datasets, we found that SSE showed the best average performance in terms of NMI and ARI compared to the three competing approaches. In conclusion, our study showed that SSE was an effective and robust clustering method for scRAN-seq dataset.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/10/2/98/s1, Figure S1: The scatter diagram of eight datasets by t-SNE, Figure S2: The three-dimensional scatter diagram of eight datasets by PCA, Table S1: Cluster results comparison between SSE and NMF, SIMLR, SE in terms of NMI, Table S2: Cluster results comparison between SSE and NMF, SIMLR, SE in terms of ARI.