Visualizing Single-Cell RNA-seq Data with Semisupervised Principal Component Analysis

Single-cell RNA-seq (scRNA-seq) is a powerful tool for analyzing heterogeneous and functionally diverse cell population. Visualizing scRNA-seq data can help us effectively extract meaningful biological information and identify novel cell subtypes. Currently, the most popular methods for scRNA-seq visualization are principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). While PCA is an unsupervised dimension reduction technique, t-SNE incorporates cluster information into pairwise probability, and then maximizes the Kullback–Leibler divergence. Uniform Manifold Approximation and Projection (UMAP) is another recently developed visualization method similar to t-SNE. However, one limitation with UMAP and t-SNE is that they can only capture the local structure of the data, the global structure of the data is not faithfully preserved. In this manuscript, we propose a semisupervised principal component analysis (ssPCA) approach for scRNA-seq visualization. The proposed approach incorporates cluster-labels into dimension reduction and discovers principal components that maximize both data variance and cluster dependence. ssPCA must have cluster-labels as its input. Therefore, it is most useful for visualizing clusters from a scRNA-seq clustering software. Our experiments with simulation and real scRNA-seq data demonstrate that ssPCA is able to preserve both local and global structures of the data, and uncover the transition and progressions in the data, if they exist. In addition, ssPCA is convex and has a global optimal solution. It is also robust and computationally efficient, making it viable for scRNA-seq cluster visualization.


Introduction
Single-cell RNA sequencing (scRNA-seq) technology enables the measurement of cell-to-cell expression variability of thousands to hundreds of thousands of genes simultaneously, and provides a powerful approach for the quantitative characterization of cell types based on high-throughput transcriptome profiles. A full characterization of the transcriptional landscape of individual cells has enormous potential for both biological and clinical applications. However, characterization and identification of cell types require robust and efficient computational methods. Particularly, visualization is crucial for humans interactively processing and interpreting the heterogeneous and high-dimensional scRNA-seq data, because humans rely on their astonishing cognitive abilities to detect visual structures, such as clusters and outliers. Hence, high-dimensional data must be projected (embedded) into a 2D or 3D space with dimension reduction (DR) techniques for visualization.
Among different methods proposed for scRNA-seq visualization, the two popular ones are principal component analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE). PCA projects the high-dimensional scRNA-seq data into the linearly orthogonal low-dimensional vector space thorough variance maximization. Due to its efficiency and conceptual simplicity, PCA has

The Methods
We propose a semisupervised principal component analysis (ssPCA) approach for the visualization of scRNA-seq data. The proposed approach is based on nonlinear kernel (or similarity) matrices and incorporates cluster labels into the visualization. This optimization problem is involved in two components, including the unsupervised maximization of total variance and the supervised maximization of cluster dependence with the Hilbert-Schmidt Independence Criterion (HSIC). Overall, the proposed approach is, therefore, a semisupervised learning problem. In addition, because cluster labels are required for ssPCA visualization, ssPCA is most useful for the visualization of clusters detected from other scRNA-seq clustering tools. Note that we can easily construct a kernel matrix for ssPCA if there is no similarity matrix available from other clustering software (which is rare).
Given a n × p scRNA-seq data matrix X with n cells and p genes, and a n × 1 cell cluster vector y, we aim to project the scRNA-seq data onto low-dimensional orthogonal space for cluster visualization. The idea to incorporate the clustering information into principal component analysis based on HSIC. HSIC is a standard method for measuring the dependence between two sets of random variables [16].
With a scRNA-seq data X and cell subtype (cluster) vector y, two kernels K X ∈ R n×n and K Y ∈ R n×n are built from X and y, respectively. Then, the empirical estimate of HSIC is where H is the centering matrix and 1 n is a n × 1 vector of all 1s. Maximizing HSIC will maximize the dependence between cell expression and cluster labels. Supervised PCA with HSIC and related optimization has been explored in different studies [17,18]. Our proposed ssPCA method for cluster visualization aims not only to maximize HSIC, but also to preserve the data variance. Kernel measures the inner-product in a new feature space. In general, a linear or nonlinear function Φ(X) is used to map the data X onto a new space, then a kernel is defined as K X = Φ(X)Φ(X) T . With the kernel trick, we can define the kernel directly without knowing the exact form of Φ. Given the expressions of two cells x i and x j , common kernels include linear (K( ). Kernel method has been used for scRNA-seq clustering [19]. The similarity matrix generated from any scRNA-seq clustering software can be used as a kernel (K X ) for cluster visualization.
The kernel for cluster (subtype) labels y is defined as follows. Suppose there are c cell clusters in y, we first recode y into a binary matrix Y ∈ R n×c with the one-hot coding scheme-i.e., Y ∈ R n×c , where y ij = 1 if the cell i belongs to cluster j, and is 0 otherwise. Then, we define the kernel K Y as With the kernel matrices K X and K Y available, we project the kernel K X onto a low-dimensional Z ∈ R n×k (where k = 2 or 3) for cluster visualization.
The linear kernel in the low-dimensional space is K Z = ZZ T = K X WW T K X . Therefore, after dropping the scaling factor, we have the supervised HSIC maximization in the projected low-dimensional space as tr(K Z HK Y H) = tr(K X WW T K X HK Y H) = tr(W T K X HK Y HK X W).
The second term is the unsupervised total variance maximization in the projected space. The variance covariance matrix S Z S Z = cov(Z) = Z T HZ = W T K X HK X W, where H = I − 1 n 1 n 1 T n , which is the same as we defined previously. Put the two terms for HSIC and total variance (tr(S Z )) together and add an orthogonal constraint, W T W = I; we optimize the following semisupervised PCA problem: where 0 ≤ λ ≤ 1 is a trade-off hyperparameter. When λ = 0, the problem becomes the traditional unsupervised kernel PCA. When λ = 1, the problem is the supervised PCA. With the Lagrangian multiplier method, the solution for W is the eigenvectors of (1 − λ)K X HK X + λK X HK Y HK X corresponding to the k largest eigenvalues. The projection Z = K X W will be used for visualization.
ssPCA is most suitable for visualizing cell clusters detected from other scRNA-seq software, since cluster labels are required for ssPCA visualization. Both the cluster labels and the similarity (distance) matrix used for cell clustering are usually available. In such a case, the similarity matrix is treated as the kernel (K X ) for X, K Y = YY T is constructed from cluster labels Y, and the low-dimensional projection will be discovered and visualized with ssPCA. If only cluster labels are available (which is rare), we can construct our own kernel K X for visualization. The ssPCA algorithm (Algorithm 1) for cluster visualization of scRNA-seq data is as follows: Algorithm 1: The ssPCA algorithm Given the cluster labels y, scRNA-seq data X, similarity (kernel) matrix K X and hyperparameter λ: 1. Recode y into a binary matrix Y, calculate K Y = YY T , and compute K X from X if K X is not available. 2. Find W, the eigenvectors of (1 − λ)K X HK X + λK X HK Y HK X corresponding to the k largest eigenvalues. 3. Project to low-dimension with Z = K X W for cluster visualization.
The hyperparameter λ: We demonstrate that the visualization graphs are very similar across different λs with the simulation data. Thus, the proposed method is quite robust with λ. We set λ = 0.75 in all computational experiments with real data for comparison.

The scRNA-seq Datasets
Four scRNA-seq datasets are used for evaluation. They span a wide range of cell types with known numbers of subpopulations, representing a broad spectrum of single-cell data. The first dataset consists of embryonic stem cells under different cell cycle stages [2], which includes 8989 genes, 182 cells, and 3 known cell subtypes. The second dataset contains pluripotent cells under different environment conditions [20], which has 10,685 genes, 704 cells, and 3 cell subtypes. The third one is composed of eleven cell populations including neural cells and blood cells [21], which contains 14,805 genes, 249 cells, and 11 cell subtypes. The fourth dataset consists of neuronal cells with sensory subtypes [5], which includes 17,772 genes, 622 cells, and 4 cell subtypes. Cell subtypes in each data are known in advance, providing nice data sources for performance evaluation.

Simulation Data
It is challenging to compare the performance of different visualization software. One reliable approach is to perform simulations with ground-truth available. Our simulation is based on the artificial tree data from PHATE (Potential of Heat-diffusion for Affinity-based Transition Embedding) [22]. PHATE was recently developed with a novel informational distance developed from diffusion processes, and was an efficient tool for visualizing continuous progression and trajectories. The artificial tree included 10 branches, and data for each branch was uniformly sampled in 60 dimensions. The sample size of the tree data was 1440 with branch information available, providing a nice source for ssPCA assessment and software comparisons. The goal for this visualization is not only to detect the clusters (local data structure), but also to recover the branching trajectories of the simulated tree correctly. Note that PHATE is also a visualization software. To evaluate the true power of ssPCA, compared to PCA, t-SNE, and UMAP, we did not use the final similarity matrix learned from PHATE, but constructed a simple Gaussian kernel matrix from the data, and the 10 branches were regarded as the cluster labels. The same kernel was used to evaluate the performance of UMAP, t-SNE, and ssPCA, respectively. With the data dimensions of 60, no PCA is carried out for t-SNE and UMAP with original tree data. However, PCA is used for dimension reduction before we perform UMAP and t-SNE with kernel. The number of PCs used for UMAP and t-SNE is set to 30 in this simulation. The results with different methods are reported in Figure 1.  Figure 1 demonstrates that neither PCA nor t-SNE recover the tree structure of the data correctly. PCA leads to artificially overlapping, while t-SNE with data shatters the tree structure into discrete clusters, and we cannot recognize the tree structure from t-SNE with kernel, indicating that t-SNE does not preserve the global data structure. UMAP with both data and kernel seems to be better at preserving the global tree structure, a few tree branches are connected together. However, tree branches in UMAP are also shattered into discrete pieces. The global tree structure is not completely recovered and the intercluster distances are meaningless with UMAP. PHATE performs the best and represents the ground-truth with this artificial tree data. It correctly visualizes the global and local tree structures. Although ssPCA is mainly designed for visualizing cluster structures, it performs second-best and correctly visualizes the global tree structures. The intercluster distances in the artificial tree data are correctly preserved with ssPCA. ssPCA also discovers most of the clusters (branches) correctly. It only fails to distinguish a couple of tree branches that are close to each other because of the noises. PHATE performs better than ssPCA with this specific dataset, but not without the costs. There are more than 5 hyperparameters in PHATE, and different parameter settings may lead to quite different visualization results. Particularly, PHATE is sensitive to the choice of the noise parameter σ. The visualizations of PHATE are quite different with different noises σs, although the true structure of the artificial tree is the same. Additional simulations are carried out with the parameters set to (i) the number of dimensions of 5000; (ii) number of samples of 1200; (iii) the number of branches of 8; and (iv) the different noise levels of σ = 3, 6, and 12, respectively. PCA is performed for dimension reduction before we carry out UMAP and t-SNE. The number of PCs is also set to 30 with more than 70% of explained variance. The visualization results are reported in Appendix Figures A1-A3. PHATE leads to overfitting the tree-structure when the noise in the data is relatively low (σ = 3 and σ = 6), but performs better with higher noises (σ = 12). ssPCA on the other hand, is robust with different choices of the hyperparameter (λ = 0.25, 0.5, 0.75, and 1) and different noises (σ = 3, 6, and 12), as presented in the top-right and bottom panels of Figures A1-A3. ssPCA also performs better than UMAP. In addition, ssPCA is computationally more efficient than PHATE, t-SNE, and OMAP. These three software utilize a gradient-decent approach to find the optimal solution, which is relatively time-consuming. The computational time for PCA, ssPCA, t-SNE, UMAP, and PHATE are 0.018, 1.96, 41.88, 8.05, and 13.24 s, respectively, with the artificial tree data running on an Intel core i7 laptop with 12-GB memory. In conclusion, ssPCA preserves both global and local structures of the data. When transition and progressions exist in the branches of artificial tree, ssPCA captures the branching trajectories well.

Real Data
Our computational evaluations with four real datasets are based on the cell similarity matrix and clustering labels from three popular scRNA-seq software including SIMLR [19], SoptSC [23], and sinNLRR [24]. t-SNE was implemented in their original packages for visualization. As PHATE is not solely designed for cluster visualization, we only compare the performances of ssPCA with t-SNE and UMAP using the real datasets. Note that PCA is carried out for dimension reduction before we perform UMAP and t-SNE. The number of PCs is set to 20, with more than 70% of explained variance for all 4 datasets.
SIMLR [19] is denoted as single-cell interpretation via multikernel enhanced similarity learning. It learns the cell-to-cell similarities through efficient multikernel optimization. The similarity matrices from SIMLR for the 4 real single-cell datasets are visualized with ssPCA, t-SNE, and UMAP, respectively. The perplexity value for t-SNE and the number of nearest neighbors for UMAP are set to 30. We actually try several perplexity values (10,20,30,40) and choose the one with a better visualization. The clusters with different scRNA-seq data are visualized in Figure 2.  Visualization results with similarity matrices from SIMLR with 4 real single-cell RNA sequencing (scRNA-seq) datasets, where plots in the top row are generated by semisupervised principal component analysis (ssPCA) with λ = 0.75, and plots in the middle row and bottom row are produced with t-distributed stochastic neighbor embedding (t-SNE) and UMAP, respectively. Datasets to draw the the subplots from left to right: Buettner [2]; Kolodziejczyk [20]; Pollen [21]; Usoskin [5].
While ssPCA, t-SNE, and UMAP can separate the clusters in different datasets with the similarity matrices from SIMLR, ssPCA performs better in recovering the global cluster structures relative to one another. For instance, with the Buettner scRNA-seq data, ssPCA demonstrates that cluster 2 (green) is close to cluster 1 (red), and far away from cluster 3 (blue), while the relative locations (distances) of the three clusters with t-SNE and UMAP might not mean anything. In addition, with the Pollen data, t-SNE and UMAP display all clusters uniformly on the plane, while ssPCA shows that cluster 7 (light-blue) is far away from the rest clusters. Finally, with the Usoskin data, cluster 3 (light blue) is displayed as several pieces with t-SNE and UMAP, but it is visualized as one entity and adjacent to cluster 1 (red) and cluster 2 (green) with ssPCA. This is reasonable. As demonstrated with the simulation data, ssPCA preserves both global and local data structures with principal component projection, while t-SNE and UMAP only capture the local data structure, and the global structure is not fully preserved. The main reason is that both t-SNE and UMAP are based on the nearest neighbor graph technique which only optimizes the data points close to each other.
SoptSC [23] is a recently developed software that learns cell-cell similarities through locality-preserving low-rank representation. The similarity matrices from the same 4 scRNA-seq datasets are visualized with ssPCA, t-SNE, and UMAP, respectively, as shown in Figure 3. Figure 3 demonstrates that ssPCA, t-SNE, and UMAP can separate the clusters in 3 out of 4 datasets including Kolodziejczyk, Pollen, and Usoskin with the similarity matrices from SoptSC. However, for the Buettner scRNA-seq data, t-SNE and UMAP fail to distinguish cluster 1 (red) from cluster 2 (green) and cluster 3 (blue), while ssPCA is able to recover these clusters as separate entities, although there are some overlaps among them. With the other 3 datasets, ssPCA provides additional information about the clusters relative to one another. For instance, with the Usoskin dataset, cluster 3 (light blue) is located at the center, and links clusters 1, 2, and 4 together with ssPCA, while it is hard to identify such related information of the clusters with t-SNE and UMAP, because the intercluster distances with t-SNE and UMAP are meaningless. Similar visualization results with sinNLRR [24] are reported in Figure A4 of the Appendix. SoptSC and sinNLRR are based on the same idea of locality-preserving low-rank representation. The only difference is that F-norm is used in SoptSC, while L 21 norm is used in sinNLRR. Thus, the similarity matrices generated by SoptSC and sinNLRR with the 4 scRNA-seq data are comparable, leading to similar visualization results.   Figure 3. Visualization results with similarity matrices from SoptSC with 4 real scRNA-seq datasets, where plots in the top row are generated by ssPCA with λ = 0.75, and plots in the middle and bottom rows are produced with t-SNE and UMAP, respectively. Datasets to draw the the subplots from left to right: Buettner [2]; Kolodziejczyk [20]; Pollen [21]; Usoskin [5].

Discussion
Visualization of the high-dimensional RNA-seq data is critical for detecting cell subpopulations and revealing biological insights. To date, there are only a few tools including PCA, t-SNE, UMAP, and PHATE available for dimension-reduction and scRNA-seq data visualization. ssPCA provides another viable tool for scRNA-seq visualization and has its advantages. More specifically, PCA projects high-dimensional data into a low-dimensional space through eigenvalue decomposition. PCA is a linear projection method, and it mainly captures the global structure of the data, as demonstrated in Figure 1 of the artificial tree visualization. However, scRNA-seq data are usually not linear, and the nonlinear structure in scRNA-seq data cannot be detected by PCA. ssPCA, on the other hand, is a nonlinear kernel extension of PCA. It reduces the nonlinear noises and projects the scRNA-seq data onto a low-dimensional manifold.
t-SNE and UMAP are the two popular local graph algorithms for scRNA-seq data visualization. While t-SNE is the most popular tool currently used in the literature, UMAP produces somewhat similar output with increased speed. However, one key disadvantage with t-SNE and UMAP is that they only preserve the local neighborhood structure in the data. The global structures are not correctly visualized. As demonstrated in Figure 1 and Appendix Figures A1-A3, both t-SNE and UMAP tends to shatter the continuous structures into discrete clusters, and the relative location of clusters in t-SNE and UMAP generally has no meaning. ssPCA, on the other hand, integrates local cluster dependence with global principal components. It is able to maintain both global and local structures, as demonstrated in Figure 1 and Figures A1-A3. Moreover, in practice, both t-SNE and UMAP utilize PCA as a dimension reduction prior, because of the large number of genes in scRNA-sq data. Although UMAP can handle the high-dimensional data efficiently, PCA for dimension reduction is still necessary, due to the curse of dimensionality. The distance between cells in high-dimension tends to be very similar, leading to deteriorating performance in cluster visualization. On the contrary, ssPCA finds the principal components in one-step, which is computationally more efficient.
PHATE [22] is a recently developed software for dimension reduction and visualization. It visualizes the simulated tree structure better when the noise in the artificial tree data is high, but tends to overfit the data and leads to a too-complex visualization when the noise is relatively low, as demonstrated in Figure 1 and Appendix Figures A1-A3. Furthermore, there are too many parameters in PHATE, and different parametric settings may lead to different visualizations. On the contrary, ssPCA only has one hyperparameter λ, and it is robust with different values of λ over different noise levels, as demonstrated in Figures A1-A3.
The only hyperparameter λ (0 ≤ λ ≤ 1) measures the trade-off between the total variance of the data and cluster dependence. When λ = 0, ssPCA becomes a standard kernel PCA that maximizes the total variance. On the other hand, when λ = 1, the projection is solely based on maximizing cluster dependence of the data. As ssPCA is robust with different values of λ in the simulation, we recommend that you pick a λ value between 0.25-0.75 in practice. Since ssPCA is computational efficient, you may run ssPCA multiple times with different values of hyperparameter λ, and get a better sense of how the projection is affected by λ.
One interesting finding with the Pollen data is that ssPCA identifies 3 distinct clusters with different kernel (similarity) matrices from different software packages, while 11 clusters were discovered in their original study. Whether this discrepancy is from the novel finding with ssPCA or from the limitation of ssPCA is not known. More investigations are required in the near future. Finally, although the same 4 scRNA-seq datasets are used for Figures 2 and 3, the visualizations are not exactly the same with different similarity matrices from different software. Thus, the choice of similarity (kernel) matrices is also crucial for scRNA-seq visualization.

Conclusions
We propose a semisupervised principal component method (ssPCA) with HSIC maximization for cell subtype visualization. ssPCA optimizes both local cluster dependence and global principal component projection. It has an analytical solution, and is robust with respect to different values of hyperparameter λ and different noise levels in the data. Thus, ssPCA has its advantages over PCA, t-SNE, UMAP, and PHATE. The key advantages with ssPCA are that it preserves both local and global structures in the data faithfully, and the principal component projection with ssPCA is more interpretable than that from t-SNE and UMAP. However, it is important to remember that no visualization technique is perfect, and ssPCA is no exception. Unlike t-SNE and UMAP, in which cluster labels are only optional, ssPCA must have cluster labels as its input, indicating that ssPCA can only be used when cluster information is available. However, through integration with a clustering software, ssPCA is still a powerful tool for scRNA-seq data visualization. It also provides an alternative to standard visualization methods such as t-SNE and UMAP.
Funding: This research received no external funding.
Acknowledgments: This research is partially supported by Four Diamonds Fund, Pennsylvania State University.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Supplementary Figures A1-A4
We did more simulations with artificial tree structure visualization where the parameters were set to the following: (i) the number of dimensions of 5000; (ii) number of samples of 1200; (iii) the number of branches of 8; and (iv) the different noises of σ = 3, 6, and 12, respectively.     Figure A4. Visualization results from sinNLRR with 4 real scRNA-seq datasets, where plots in the top row are generated by ssPCA with λ = 0.75, and plots in the middle and bottom rows are produced with t-SNE and UMAP, respectively. Datasets to draw the the subplots from left to right: Buettner [2]; Kolodziejczyk [20]; Pollen [21]; Usoskin [5].