Single Cell Self-Paced Clustering with Transcriptome Sequencing Data

Single cell RNA sequencing (scRNA-seq) allows researchers to explore tissue heterogeneity, distinguish unusual cell identities, and find novel cellular subtypes by providing transcriptome profiling for individual cells. Clustering analysis is usually used to predict cell class assignments and infer cell identities. However, the performance of existing single-cell clustering methods is extremely sensitive to the presence of noise data and outliers. Existing clustering algorithms can easily fall into local optimal solutions. There is still no consensus on the best performing method. To address this issue, we introduce a single cell self-paced clustering (scSPaC) method with F-norm based nonnegative matrix factorization (NMF) for scRNA-seq data and a sparse single cell self-paced clustering (sscSPaC) method with l21-norm based nonnegative matrix factorization for scRNA-seq data. We gradually add single cells from simple to complex to our model until all cells are selected. In this way, the influences of noisy data and outliers can be significantly reduced. The proposed method achieved the best performance on both simulation data and real scRNA-seq data. A case study about human clara cells and ependymal cells scRNA-seq data clustering shows that scSPaC is more advantageous near the clustering dividing line.


Introduction
Single cell RNA sequencing (scRNA-seq) is a powerful new approach for studying the transcriptomes of cell lines, tissues, tumors and disease states. The use of scRNA-seq has already yielded key biological insights and discoveries, such as a better knowledge of cancer tumor heterogeneity [1]. In recent years, advances in scRNA-seq have promoted the study of computational methods for analyzing transcriptome data from single cells. Since the information about sequential cells is only partial, cluster analysis is usually used to discover cell subtypes or to distinguish and better characterize known cell subtypes [2]. However, the analysis methods are typically complex, and the user is often simply given a visual representation of the data with no assessment of the robustness of the groupings.
Unlike bulk RNA-seq data, single cell RNA-seq data are more sparse and have a high dropout rate, which makes clustering very challenging. Recently, several methods and tools have been developed for single cell RNA-seq clustering. K-means is used in several approaches for evaluating scRNA-seq data. In rounds of grouping single cells, single cell analysis via iterative clustering (SAIC) [3] combines K-means and analysis of variance, followed by signature gene identification. Single cell clustering using bifurcation analysis (SCUBA) [4] divides cells into two groups at each time point using K-means, and then utilizes gap statistics to locate bifurcation occurrences. The method in [5] uses non-negative matrix factorization to incorporate information from a larger annotated dataset and then applies transfer learning to perform the clustering. Clustering through imputation and dimensionality reduction (CIDR) uses hierarchical clustering to do data imputation before clustering a principal component analysis (PCA)-reduced representation [6]. Semisoft clustering with pure cells (SOUP) can handle both pure and transitional cells and computes soft cluster memberships using the expression similarity matrix [7]. Maaten et al. [8] introduced a novel embedding algorithm named the t-distributed stochastic neighbor embedding (t-SNE) algorithm. The t-SNE is a dimensionality reduction method that may also be used to classify single cells. The spectral clustering (SC) algorithm finds a low-dimensional embedding of data by calculating the eigenvectors of the constructed Laplacian matrix [9] and is one of the most widely used algorithms for data clustering. Hu et al. [10] proposed a new low-rank matrix factorization model for scRNA-seq data clustering based on sparse optimization. Wang et al. [11] developed a novel single cell interpretation via multi-kernel learning (SIMLR) method to construct the similarity matrix by fusing multiple Gaussian kernel functions, and it clusters the single cells by applying the spectral clustering algorithm to the similarity matrix. To characterize the sparsity of scRNA-seq data, Part et al. [12] improved the SIMLR method by integrating doubly stochastic affinity matrices and sparse structure constraints to cluster single cells.
Self-paced learning (SPL) [13] is a novel machine learning framework that has recently gained a lot of interest. The concept is based on the principle that individuals learn better when they begin with simple knowledge and work their way up to more complicated knowledge. Bengio et al. presented curriculum learning to define this method in machine learning (CL) [14]. After that, Kumar et al. [13] suggested using SPL for curriculum design purposes by including an SPL regularization term in the objective function. The learning difficulty of the instances (either simple or complex) depends on the loss of the current parameter values. The capacity of SPL to avoid undesirable local minima and so have superior generalization ability has been empirically shown [13,[15][16][17][18]. The authors of [19] used SPL to solve non-convex problems caused by feature destruction techniques. Traditional clustering algorithms are either easily caught in local optima or susceptible to outliers and noisy data [20][21][22][23]. Ren et al. [22] proposed a unique self-paced multi-task clustering (SPMTC) method to address these issues in multi-task clustering. Yu et al. [23] offered a self-paced, learning-based K-means clustering method. To deal with the nonconvex problem in multi-view clustering, DSMVC [24] uses self-paced learning. Therefore, SPL is often used to find better solutions for non-convex problems.
Due to the non-convexity of nonnegative matrix factorization (NMF) models for scRNA-seq clustering, these models easily obtain a bad local solution. In this study, we introduce a single cell self-paced clustering (scSPaC) model and a sparse (l 21 -norm based) single cell self-paced clustering (sscSPaC) model. Specifically, single cells are gradually incorporated into the NMF process from simple to complex, which draws on the advantages of SPL and has been shown to help models avoid falling into local minima. In our other model, i.e., sscSPaC, l 2,1 -norm is used, which reduces the effects of noise and outliers. In order to verify the effectiveness of the introduced methods, we conducted comparative experiments on simulation data and real scRNA-seq data. The workflow of this study is shown in Figure 1, including data preprocessing, clustering and visualization. Workflow for single cell self-paced clustering (scSPaC) and sparse single cell self-paced clustering (sscSPaC), which included data preprocessing, clustering and visualization. The pentagram in the figure represents the cluster center. The number of clusters is searched within a reasonable range (determined by an existing tool, SCANPY), and we discuss the impact of the cluster number on model performance in Section 3.3.

Datasets
To illustrate the efficacy of the two novel scRNA-seq clustering algorithms in further detail, on simulated and real single cell data, we compared the performances of these two clustering algorithms and baselines. We generated simulated data to evaluate the clustering performance of scSPaC. Splatter [25], a tool commonly used to generate scRNA-seq data, was utilized to generate the experimental data. Simulation data were obtained from two classes with 100 single cells per class. Each cell contains 22,002 genes. The real datasets are described in the following: baron [26], kolodziejczyk [27], pollen [28], rca [29], goolam [30], zeisel [31], and cell lines [32], which includes a mixture of 1047 cultured human BJ, H1, K562 and GM12878 cells. The statistical information of all datasets used in this study is shown in Table 1. The datasets contain 2-14 cell types, and the number of cells in each dataset ranges from 124 to 3500. The number of genes in each of these datasets exceeds 10,000. The maximum is 32,316 genes.

Data Preprocessing
Raw scRAN-seq read count data are sparse and high-dimensional, which makes further subsequent statistical analysis challenging [33]. Therefore, we needed to preprocess the raw matrix data. The raw data were pre-processed by the Python package Scanpy [34] as follows: Step 1: Genes with no count in any cell were filtered out.
Step 2: We filtered genes that were not expressed in almost all cells.
Step 3: The top N high variable genes (HVGs) were selected. One thousand highly variable genes were selected by default. In Section 3.2. We discuss the influence of different N values for the experimental accuracy.
Step 4: The last step was to take the log transform and scale of the read counts, so that count values follow unit variance and zero mean.
The pre-processed read count matrix was treated as the input for our scSPaC model and the other algorithms.

scSPaC Model
Consider a log-transformed count matrix X = [x 1 , x 2 , · · · , x n ] ∈ R m×n , where n is the number of cells and m is the number of genes. Nonnegative matrix factorization (NMF) [35] aims to find two nonnegative matrices U ∈ R m×r and V ∈ R r×n , which minimizes the following objective function: where · p is p-norm. X ij in X denotes the gene expression of the i-th gene in the j-th cell. V can be regarded as the new representation of the original data with respect to the new basis U. r represents the components of U and V. Lee et al. [35] proposed an algorithm for iteratively updating U and V to optimize the objective (Equation (1)). It adopts the Frobenius norm (F-norm) NMF model, which is sensitive to noisy data [36,37]. Recently, the authors of [36] proposed robust NMF methods with l 2,1 -norm. Compared with the F-norm NMF, the l 2,1 -norm NMF is robust to noisy data, since the non-squared residuals reduce the effects of outliers [36]. To mitigate the tendency of NMF model to fall into a local optimum solution, we introduce a SPL regularization term to NMF model for scRNA-seq clustering.
where diag(w) denotes a diagonal matrix with the i-th diagonal element being w i . One of the simple regular functions f (λ, w) is shown in Equation (3). Kumar et al. [13] proposed to let w ∈ {0, 1} and define f (λ, w) as Then, the optimal w * can be calculated by Since w i (i = 1, . . . , n) is either 1 or 0, the strategy mentioned above can be treated as hard weighting. λ > 0 is initially tuned to a small value such that the single cells with small loss values can be selected to clustering model. With the increasing of λ, more and more cells will be selected until all cells are chosen.
In Equation (2), if the p-norm is specific to the F-norm, we name the single cell clustering model scSPaC. This strategy has been successfully applied in the field of face recognition [38]. If the p-norm is specific to the sparse l 2,1 -norm, the model is named sscSPaC.
The core idea of scSPaC and sscSPaC introduced in this work is to gradually select cells for decomposition from simple to complex.
Reference [39] proposed that Equation (2) with l 2,1 -norm can be written as follows in simple algebra.
where W is a diagonal matrix and W ii = w i .

Optimization
We utilize an iterative updating algorithm to solve the optimization problem of scSPaC and sscSPaC. Specifically, we iteratively optimize each variable in the objective function while fixing the other variables.
Step 1: Fix w, update U and V.
When we fix w, f (λ, w) in Equation (2) is a constant. Solving Equation (2) is equivalent to solving the original NMF model Equation (1). Thus, we can update the model parameters U and V iteratively.
(a) Update U and V for the scSPaC model. For Equation (1), Lee et al. [35] proposes an algorithm for iteratively updating U and V to optimize the objective.
(b) Update U and V for the sscSPaC model. For Equation (5), we propose update rules for U and V as follows [39]: Step 2: Fix U and V, update w.
With the fixed parameters U and V, the weight matrix diag(w) is updated by where the loss function l i = x i − Uv i 2 in Equation (2) is a constant. We can observe from Equation (3) that SPL chooses single cells based on their loss values and a parameter λ. We consider assigning weights and gradually choosing single cells from simple to complex. For the single cell clustering problem, we define a new method for computing the hard and easy samples in self-paced learning. We define this single cell close to its own clustering center (i.e., far from other clustering centers) as a single cell that is easy to cluster and will be preferentially selected for the clustering model. We chose to utilize a new SPL regularization term.
The regularization term is defined as and the optimal w * is computed by Equation (12) is a soft weighting strategy. According to [40], Equation (12) is also called mixture weighting. We set ζ = 0.5 × λ for simplicity in our experiments. Now, we have all the update rules done. We optimize the model in an iterative way; i.e., steps 1 and 2 are iteratively repeated until the model convergence. We increase λ to select more single cells to the factorization process. Specifically, we initialize λ such that more than half (the default value is sixty percent) the cells are picked in the first iteration. In the following iteration, λ is increased such that 10% more cells can be added. As a consequence, λ is automatically determined. The model repeats until all the single cells are chosen. Finally, K-means clustering is applied to the matrix V after iteration, and the clustering results of scRNA-seq data are obtained. The clustering results will be evaluated and analyzed in the experimental section.

Evaluation Metrics
All clustering results are measured by adjusted rand index (ARI), purity and normalized mutual information (NMI). These cluster evaluation indicators will be briefly introduced here.

ARI
Rand index (RI) [41] is a measure of similarity between two clusters. We can use it to compare actual class labels C and predicted cluster labels Y to evaluate the performance of a clustering algorithm. The adjusted rand index (ARI), described in formula (13), is the corrected-for-chance version of the rand index [42].
Here, N represents the number of all cells. n ij represents the number of cells that are in class i after clustering and should actually be in class j. a i denotes the logarithm of elements of the same cluster in both clusters C and true classes Y. b j denotes the logarithm of elements of different clusters in both clusters C and true classes Y. ( m k ) is standard m-choose-k notation. ARI ranges from −1 to 1. Perfect labeling is scored 1; bad clustering has negative or close to 0 scores. A larger value means that the clustering results match the real cell types better.

Purity
Purity [43] is quite simple to calculate. It is applied to measure the extent to which each cluster contains data instances from primarily one class. The purity of a clustering result is computed by the weighted sum of each cluster purity values and can be defined as follows: where C = {c 1 , c 2 , ...c K } represent K different clusters, and Y = y 1 , y 2 , ..., y J represent J different true classes. For Purity ∈ [0, 1], the higher the value, the better the clustering result.

NMI
Normalized mutual information (NMI) [44] measures the amount of information obtained about one partition through observing the other partition, ignoring the permutations: where H(.) is the entropy, and I(Y, C) measures the mutual information between Y and C.

Experimental Performance on All Datasets
The recently published benchmark article, Qi et al. [45], tested five representative clustering methods (SC3, SNN-Cliq, SINCERA, SEURAT, and pcaReduce) of the most advanced scRNA-seq tools and showed that SC3 had the highest clustering accuracy under default parameters. Seurat performed well in the mixture control experiment reported by the recently published benchmark article [46]. Scanpy is a widely used python package for single cell analysis [47]. Therefore, we only compared our scSPaC and sscSPaC with SC3, Scanpy and Seurat, three basic NMF models; and the K-means method. To ensure that comparisons between algorithms were based on the same criteria, we used the same gene-filtering and normalization steps for all these algorithms. The main steps of data preprocessing are shown in Section 2.2.
To evaluate the performances of the proposed scSPaC and sscSPaC, we compared them with several closely related nonnegative matrix factorization (NMF) methods and scRNA-seq clustering tools: 1.

5.
Scanpy [34] is a Python-based toolkit for analyzing single cell gene expression data. Scanpy was downloaded from https://github.com/theislab/scanpy (accessed on 3 March 2022). It includes clustering and is used as the comparison algorithm in the experiment. We ran Scanpy with default parameters, for example, n_neighbors = 20 and resolution = 1.0. 6.
Seurat3 [50] is a graph-based clustering tool. For all datasets, Seurat was performed with default parameters and downloaded from https://github.com/satijalab/seurat (accessed on 3 March 2022). We set the number of neighbors to 20 and the cluster resolution to 0.8, and used the ScoreJackStraw() function and 0.05 (the bound of P-value) to determine the number of principal components. 7.
In scSPaC and sscSPaC, there are several parameters to be set, such as the top N HVGs, the number of reduced dimensions r (the components in NMF), the number of clusters K and the SPL parameters ζ and λ. In our experiment, we selected the top 1000 highly variable genes by default to conduct clustering analysis. In Section 3.2, we discuss the impact of high variable gene numbers on clustering performance in detail. Considering that HVGs are chosen to reduce the dimensionality of the genes in this study, the effects of the components in NMF on the results are not discussed in this study. The number of real cell classes in the dataset was used uniformly as the component dimension r of NMF. In Section 3.3, we discuss the impact of number of clusters on the results of the proposed scSPaC in this work. We use adjusted rand index (ARI), purity and normalized mutual information (NMI) in Section 2.5 to evaluate the clustering results. The results of all experiments are the means and standard deviations calculated from 20 repetitions. Table 2 shows the clustering results on simulated datasets. For the simulation data, our method achieved the highest purity, indicating that the cells can be well clustered into some higher purity classes. For ARI and NMI, we also achieved the highest performance. SC3 is a very competitive approach, having the best clustering performance among the baseline algorithms. We tested the results of our two methods, scSPaC and sscSPaC against the seven benchmark methods on seven real scRNA-seq datasets. Clustering results for ARI on real scRNA-seq data are shown in Table 3 and Figure 2. On most of the test datasets, we had a 3-4% improvement in ARI. In the zeisel dataset, we had close to 15 point improvements in our evaluation metrics, which shows that our proposed algorithm works well on large-scale datasets. Although SC3 was a very competitive method on both pollen and rca datasets. Our sscSPaC model achieved the second best clustering performance. The results of the other two evaluation indicators purity and NMI are shown in Tables 4 and 5. It can also be confirmed from the tables that our method achieved the best or second best results in most cases compared with the comparison methods.   Table 4. Clustering results for purity on real scRNA-seq data. The highest score for each dataset is shown in bold and the second best score is underlined.  Table 5. Clustering results for NMI on real scRNA-seq data. The highest score for each dataset is shown in bold and the second best score is underlined.

Different Numbers of Variable Genes Were Selected for Comparison
To do clustering analysis, we chose the top 1000 highly variable genes by default in our methods. In fact, highly variable genes can collect more biological information than lowly variable genes with little effect on cell type determination [52]. Furthermore, we could lower the model and temporal complexity of our clustering methods by picking highly variable genes. We varied the number of highly variable genes from 200 to 2500 and used scSPaC and sscSPaC on seven real datasets to see how they affected the outcomes.
We use the broken line graph in Figure 3 to show the ARI values of seven real datasets by selecting 200, 500, 1000, 1500, 2000 or 2500 highly variable genes. Overall, the performance of 200 high variable genes was somewhat poorer than the other five cases, and the mean values of the other five sets of results did not appear to differ much. In most of the datasets, the results of our scSPaC decreased when more than two thousand HVGs were selected, so only up to a maximum of 2500 HVGs were tested in this study. However, in most datasets, the average ARI computed for 1000 HVGs was still the greatest, so we proposed to use the first 1000 high variable genes for clustering in preference.

Accuracy in Estimating the Number of Clusters
As the number of cell types in a real scRNA-seq dataset is usually unknown, most similarity-based clustering methods require the number of clusters to be specified, and an accurate estimate of the optimal number of cell types is critical to identifying cell types on a real dataset. In this section, we used Scanpy [34], a community detection-based tool that includes an efficient method for partitioning the network into discrete clusters that has been shown to be reliable for forecasting the number of cell types.
In order to evaluate the accuracy of our method in estimating the correct number of populations, the proposed scSPaC in this study searched for the optimal number of clusters around K (from K − 3 to K + 3). K is the number of clusters estimated by Scanpy. As K increases, our model was robust. We recommend that users initialize a slightly larger number of clusters. Table 6 shows the details of how we determined the number of clusters in our model scSPaC. Perhaps it may be more reasonable to add some biological information when analyzing the number of clusters in scRNA-seq data, and combine it with other downstream analysis, such as marker gene identification. To fully examine the validity of scSPaC on different single cell data, we tested the algorithm on human scRNA-seq data. In this section, we focus on the enhancement of the original algorithm in the single cell domain by the addition of self-paced learning. For the sake of simplicity and visualization of the results, we selected human data containing only two cell types. The dataset contains two types of cell lines (113 clara cells and 58 ependymal cells in the human scRNA-seq data) [53]. We use the provided cell type labels as a benchmark for evaluating the performances of the clustering methods. Figure 4 shows the cluster results for t-SNE targeting pulmonary alveolar type II, clara and ependymal cells of human scRNA-seq data. Clara cells are shown in red and ependymal cells in blue. As can be seen in the figure, our scSPaC is more advantageous near boundary lines between clusters. SARS-CoV-2 infection of alveolar epithelial type 2 cells (AT2s) is a defining feature of severe COVID-19 pneumonia [54]. For human lung alveolar type II, our model performs a decent job of discriminating between these clara and ependymal cells, which could help with drug development.

Conclusions
The advent of single cell sequencing technology provides an opportunity to reveal cellular heterogeneity. In this study, a new sample selection strategy, self-paced learning, is introduced for scRNA-seq data clustering, which solves the clustering problem: that these comparison algorithms are easy to fall into local optimum. Cells are grouped into clustered samples from easy to hard based on the loss of initialization. In order to reduce the impacts of noise and outliers on clustering results, two non-negative matrix factorization algorithms based on self-paced learning were introduced in this work. We test scSPaC and sscSPaC on both simulated and real scRNA-seq data. The state-of-the-art performance was achieved compared to baseline clustering algorithms. In a case study, our scSPaC was more advantageous near the clustering dividing line. Deep learning is computationally expensive compared to traditional machine learning, needing a huge amount of memory and processing resources, and it is difficult to adapt to new situations. It is difficult to put into words and is not totally understood [55,56]. As a result, we only talked about the applicability of the self paced learning technique to scRNA-seq data in the traditional machine learning model in this study.
Although the newly proposed methods scSPaC and sscSPaC performed well in identifying new cell types, it still has some shortcomings. For example, the computational complexity is relatively high, and it requires a relatively long time and large memory size, especially for large-scale datasets. Based on the proposed computational framework, some future improvements will be considered, for example, designing a more elegant regularization term or a deep learning framework to characterize the non-linear relationship among single cells and improve similarity learning by integrating additional multi-omics data.