Tensor-Decomposition-Based Unsupervised Feature Extraction in Single-Cell Multiomics Data Analysis

Analysis of single-cell multiomics datasets is a novel topic and is considerably challenging because such datasets contain a large number of features with numerous missing values. In this study, we implemented a recently proposed tensor-decomposition (TD)-based unsupervised feature extraction (FE) technique to address this difficult problem. The technique can successfully integrate single-cell multiomics data composed of gene expression, DNA methylation, and accessibility. Although the last two have large dimensions, as many as ten million, containing only a few percentage of nonzero values, TD-based unsupervised FE can integrate three omics datasets without filling in missing values. Together with UMAP, which is used frequently when embedding single-cell measurements into two-dimensional space, TD-based unsupervised FE can produce two-dimensional embedding coincident with classification when integrating single-cell omics datasets. Genes selected based on TD-based unsupervised FE are also significantly related to reasonable biological roles.


Introduction
Single-cell multiomics data analysis is challenging [1]. There are multiple reasons for this issue. First, it inevitably includes too many missing values. In the usual highthroughput sequencing (HTS), the so-called depth can compensate for this problem. Nevertheless, because of the very limited amount of RNA retrieved from individual cells available, "depth" cannot resolve this missing value problem. Second, too many missing values result in apparent diversity. The primary purpose of single-cell analysis is to identify the diversity of individual cells that cannot be recognized by the tissue-level HTS. Although missing values are random, apparently very variant profiles appear from a single profile, which can be recognized if there is a large enough number of reads available. This compels researchers to distinguish between true biological diversity and apparent diversity caused by missing values [2].
Finally, single-cell analysis is computationally challenging. Because there are not many samples in the standard HTS, even if the number of features is large, the overall required computational resources decided by the product between the number of features and the number of samples are very limited. Nonetheless, since the number of samples that is the same as that of cells can be huge in single-cell analysis, single-cell analysis can be computationally very challenging.
To resolve these difficulties, we employed tensor-decomposition (TD)-based unsupervised feature extraction (FE) [3]. Prior to applying TD to multiomics datasets, singular-value decomposition (SVD) was applied to individual omics profiles such that individual omics profiles have common L singular-value vectors. Then, K omics profiles are formatted as an L × M × K-dimensional tensor, where M is the number of single cells. Then, higher-order singular-value decomposition (HOSVD), which is a type of TD, is applied to the tensor. UMAP applied to singular-value vectors attributed to single cells by HOSVD success-fully generated two-dimensional embedding, coincident with the known classification of single cells.

Gene Expression Profiles
Two single-cell multiomics datasets were downloaded from GEO using the following two GEO IDs.
The multiomics dataset [4] retrieved from GEO ID GSE154762, which is denoted as Dataset 1 in this study, is composed of 899 single cells for which gene expression, DNA methylation, and DNA accessibility were measured. These single cells represent human oocyte maturation (Table 1). For gene expression, the file "GSE154762_hO_scChaRM_count _matix.txt.gz" was downloaded from the supplementary file of GEO and was loaded into R [5] using the read.table function in R. For DNA methylation and DNA accessibility, 899 files with the extensions "WCG.bw" and "GCH.bw" were downloaded from the supplementary files of GEO and were loaded into R using the import function in the rtracklayer [6] package in R. The multiomics dataset [7] retrieved from GEO ID GSE154762, which is denoted as Dataset 2 in this study, is composed of 852 single cells for which DNA methylation and DNA accessibility were measured, as well as 758 single cells for which gene expression was measured. These single cells represent the four time points of the mouse embryo (Table 2). For gene expression, the file "GSE121650_rna_counts.tsv.gz" was downloaded from the supplementary file of GEO and was loaded into R using the read.table function in R. For DNA methylation and DNA accessibility, 852 files with the extensions "met.tsv.gz" and "acc.tsv.gz" were downloaded from the supplementary file of GEO and were loaded into R using the read.table function in R.

Preprocessing of DNA Methylation Profiles
First, we collected genomic positions for which at least one measurement was performed for at least one single cell (i.e., union). Then, for each genomic position, three integers, −1, 0, and 1, were assigned. When the genomic position was measured in a single cell and its state was methylated (nonmethylated), we attributed 1 (−1) to the genomic position of the single cell. Otherwise (i.e., missing observation), we attributed 0 to the genomic transition in a single cell. x ij2 ∈ R N 2 ×M was stored as a sparse matrix object using the Matrix [8] package in R because of the large N 2 .

Preprocessing of DNA Accessibility
First, we divided the whole genome into 200 nucleotide regions, and DNA accessibility was summed up within individual regions. These values, which show the summation of DNA accessibility within individual regions, are regarded as DNA accessibility at the individual 200 nucleotide regions, each of which is supposed to approximately correspond to a single nucleosome that is composed of 140-length DNA that wraps around histones and 80-length linker DNA. In this study, these 200 nucleotide regions are called "nucleosome regions". x ij3 ∈ R N 3 ×M was stored as a sparse matrix object using the Matrix package in R because of the large N 3 . Here, feature denotes gene expression, DNA methylation, or DNA accessibility. Because the features of these three datasets differ from one another, we first applied SVD to these features. Suppose x ijk ∈ R N k ×M×K represents the value of the ith feature (expression of the ith gene, methylation of the ith genomic location, or DNA accessibility of the ith nucleosome region) at the kth single cell of the kth omics data (1 ≤ k ≤ K = 3, k = 1: gene expression, k = 2: DNA methylation, and k = 3: DNA accessibility). Applying SVD to x ijk , we obtain: where λ is the th singular value and u i k and v jk are the ith and jth components of the th left and right singular-value vectors, respectively. Then, x ijk is transformed to x jk ∈ R L×M×K to have the same (common) feature dimension, L, independent of k, as:

Data Normalization
Prior to applying SVD to the individual omics profiles in these two datasets, x ijk , (k = 2, 3), that is DNA methylation and accessibility, of Dataset 1 was normalized such that: whereas x ij1 , i.e., gene expression, was normalized such that: for Datasets 1 and 2. The reason why DNA methylation and the accessibility of Dataset 2 were not normalized is because ∑ i |x ijk |, (k = 2, 3) is very small in some single cells in Dataset 2. Thus, applying normalization adds significant weight to these single cells with fewer observations and drastically skewed outcomes. To avoid this problem, x ijk , (k = 2, 3) of Dataset 2 was not normalized.

TD Applied to Dimension-Reduced Multiomics Datasets
HOSVD [3] was applied to the tensor, x jk , and we obtained: where G ∈ R L×M×K is the core tensor that represents the contribution of u 1 u 2 j u 3 k to x jk . u 1 ∈ R L×L , u 2 j ∈ R M×M , u 3 k ∈ R K×K are singular-value matrices and are orthogonal matrices.

Categorical Regression
For categorical regression to test the coincidence between classification shown in Table 1 or Table 2 and singular-value vectors attributed to the jth single cells, we performed categorical regression: v jk = a ks δ js + b k (7) where s denotes one of the classifications shown in Table 1 or Table 2, a ks , b k , a 2 s , b 2 are regression coefficients, and δ js takes the value of 1 when the jth single cell belongs to the sth classification and 0 otherwise. Categorical regression was performed using the ls function in R. The obtained p-values were corrected using the Benjamini-Hochberg (BH) criterion [3]. s or 2 s associated with adjusted p-values less than 0.01 were regarded to be coincident with classification.

UMAP
Two-dimensional embedding was performed by UMAP [9]. The umap function implemented in R was used.

Gene Selection
After identifying which u 2 j coincided with the classification, we needed to identify which u 1 was associated with the selected u 2 j by investigating |G( 1 2 3 )|; 1 s with a larger |G| with the selected 2 were regarded to be coincident with the classification. Then, the selected u 1 was converted back to u 1 i 1 attributed to genes as: p-values can be attributed to genes, i, assuming u 1 i obeys a multiple Gaussian distribution (null hypothesis) as: where the summation is taken over only the selected 1 s, P χ 2 [> x] is the cumulative χ 2 distribution, where the argument is larger than x, and σ 1 is the standard deviation. P i s were corrected by the BH criterion [3], and is associated with adjusted P i less than 0.01 were selected.

Dataset 1
We obtained x ij1 ∈ R 26500×899 , x ij2 ∈ R 26438807×899 , and x ij3 ∈ R 15478375×899 . SVD was applied to x ijk with L = 10, as in Equation (1). For x ijk , k = 2, 3, SVD was performed using the irlba function in the irlba package [10] in R because of the large N k , k = 2, 3, as many as ten million. Then, HOSVD was applied to x jk , as in Equation (2).
One possible validation to check whether the above procedure works properly is to check whether v jk and u 2 j are coincident with the classification shown in Table 1. Because the above procedure is fully unsupervised, it is unlikely that v jk and u 2 j are accidentally coincident with the classification. To quantitatively validate the coincidence between the classification and v jk or u 2 j , we applied categorical regression (see Section 2.5). Table 3 shows the number of singular-value vectors coincident with the classification shown in Table 1. When SVD was applied to individual omics data, because L = 10, the number of singular-value vectors was 10 as well. When K omics datasets were integrated and HOSVD was applied, the number of singular-value vectors was KL = 10K. Thus, when DNA methylation and accessibility were integrated, 20 singular-value vectors were available. When all three omics data were integrated, 30 singular-value vectors were available. It is obvious that for all five cases, at least one singular-value vector was coincident with the classification. Thus, our strategy was essentially successful. To further validate the successful integration of singular-value vectors, we applied UMAP to 20 or 30 singular-value vectors obtained by HOSVD (Figure 1).
It is obvious that the integration of all three omics datasets (lower) was more coincident with classification than that of the integration of the two omics datasets, DNA methylation, and accessibility (upper). This suggests the usefulness of integrating the three omics datasets. In fact, single omics data cannot provide two-dimensional embedding coinciding with classification ( Figure S1).
We also attempted to validate biological outcomes when all three omics datasets were integrated. We selected 47 genes associated with adjusted P i less than 0.01, as described in Section 2.7 using u 1i because u 1i is associated with the largest: where the summation of 2 is taken over only 18 2 s coincident with the classification ( Table 3). The selected 47 genes (Data S1) were uploaded to Enrichr [11]. Forty-seven genes were enriched by H3K36me3 based on "ENCODE Histone Modifications 2015"; H3K36m3 is known to play critical roles during oocyte maturation [12]. Forty-seven genes were also targeted by MYC based on "ENCODE and ChEA Consensus TFs from ChIP-X"; Myc is known to play critical roles in oogenesis [13]. Forty-seven genes were also targeted by TAF7 based on "ENCODE and ChEA Consensus TFs from ChIP-X" and "ENCODE TF ChIP-seq 2015"; TAF7 is known to play critical roles during oocyte growth [14]. Forty-seven genes were also targeted by ATF2 based on "ENCODE and ChEA Consensus TFs from ChIP-X"; the expression of ATF2 is known to be altered during oocyte development [15]. This suggests that our strategy correctly captures regulation-related parts (full data of the enrichment analysis are available as Data S1).  (Table 3). Upper: u 2 j , 1 ≤ 2 ≤ 20 when only DNA methylation and accessibility (k = 2, 3) are integrated. Lower: u 2 j , 1 ≤ 2 ≤ 30 when all three omics data points (1 ≤ k ≤ 3) are integrated. Default settings other than custom.config$n_neighbors = 100 were used.

Dataset 2
To confirm that the success in the previous section was not accidental, we applied the same procedure to Dataset 2 as well. We obtained x ij1 ∈ R 22084×758 , x ij2 ∈ R 20106507×852 , and x ij3 ∈ R 13627678×852 . SVD was applied to x ijk with L = 10, as in Equation (1). For x ijk , k = 2, 3, SVD was performed using the irlba function in the irlba package [10] in R because of the large N k , k = 2, 3 of as many as ten million. Then, HOSVD was applied to x jk , as in Equation (2). Because N 1 = 758 < N 2 = N 2 = 852, when HOSVD was applied to x jk composed of all three omics datasets, only 758 single cells shared with all x ijk were considered. As described in the previous section, we first validated the coincidence between singular-value vectors attributed to single cells (Table 4), that is v j and u 2 k , and the classification in Table 2. The coincidence between the singular-value vectors and the classification in Table 4 was even better than that in Table 3. Thus, it is unlikely that the superior outcome in Table 3 was purely accidental. To further validate the successful integration of singularvalue vectors, we applied UMAP to 20 or 30 singular-value vectors obtained by HOSVD ( Figure 2).
It is obvious that the integration of all three omics datasets (lower) was more coincident with classification than that of the integration of the two omics datasets, DNA methylation, and DNA accessibility (upper), as can be seen in Figure 2. This again confirms the usefulness of integrating the three omics datasets. In fact, single omics data cannot provide two-dimensional embedding coinciding with classification ( Figure S2).
We also attempted to validate biological outcomes when all three omics datasets were integrated. We selected 175 genes associated with adjusted P i less than 0.01, as described in Section 2.7 using u 1i because u 1i is associated with the largest: where the summation of 2 is taken over only 18 2 s coincident with the classification ( Table 4). The selected 175 genes (Data S2) were converted to gene symbols by the DAVID [16] gene ID converter and were uploaded to Enrichr. One-hundred and seventy-five genes were enriched by H3K36me3 based on "EN-CODE Histone Modifications 2015"; H3K36m3 is known to play critical roles during gastrulation [17]. One-hundred and seventy-five genes were also targeted by MYC based on "ENCODE and ChEA Consensus TFs from ChIP-X"; Myc is also known to play critical roles in gastrulation [18]. One hundred and seventy-five genes were also targeted by TAF7 based on "ENCODE and ChEA Consensus TFs from ChIP-X" and "ENCODE TF ChIP-seq 2015"; TAF7 is known to play critical roles during gastrulation [19]. One-hundred and seventy-five genes were also targeted by ATF2 based on "ENCODE and ChEA Consensus TFs from ChIP-X"; the expression of ATF2 is known to be maintained during gastrulation [20]. This suggests that our strategy correctly captured regulation-related parts (full data of the enrichment analysis are available as Data S2).
These two examples, the application to Datasets 1 and 2, demonstrate the usefulness of the present strategy to integrate single-cell multiomics datasets composed of gene expression, DNA methylation, and accessibility.   (Table 4). Upper: u 2 j , 1 ≤ 2 ≤ 20 when only DNA methylation and accessibility (k = 2, 3) are integrated. Lower: u 2 j , 1 ≤ 2 ≤ 30 when all three omics data points (1 ≤ k ≤ 3) are integrated. Default settings other than custom.config$n_neighbors = 100 were used.

Discussion
In this study, we demonstrated the usefulness of our strategy when it was applied to the integrated analysis of single-cell multiomics datasets composed of gene expression, DNA methylation, and DNA accessibility. One might wonder if other more popular methods can achieve similar performance because our strategy is useless if others can perform comparably. There are several advantages of our method, which other methods do not have.
First, we do not have to fill in the missing values with nonzero values. Single-cell measurements are usually associated with a large number of missing values (Table 5). Although gene expression profiles were associated with a relatively small number of missing components, more than 70 % were missing. For DNA methylation and accessibility, the situation was very difficult to treat. Only a few percentages of components had values, while the rest were missing values. To address this problem, especially for DNA methylation and accessibility, heavy preprocessing is usually required. For example, for Dataset 1, statistical tests were applied and regions associated with significant p-values were selected [4], which reduced the number of features attributed to DNA methylation and accessibility. Because such a statistical test automatically filters out regions filled in with missing values, the ratio of nonzero components was also reduced as a result. For Dataset 2, the authors restricted the features to only the most variable ones (typically ∼10 3 ) and occasionally filled in missing components with Bayesian models [7]. These procedures inevitably introduce arbitrariness to the outcomes, as preprocessing the data might affect the outcome. In contrast to these arbitrary procedures, our method is almost unsupervised. We did not select any features or fill in the missing values. Despite these fully unsupervised strategies, our results were highly coincident with the classification (Tables 3 and 4 and Figures 1 and 2). From this perspective, our strategy is superior to the other methods.
Second, our method can deal with massive datasets. For example, although integrated analysis of multiomics data was performed using multiomics factor analysis (MOFA) [21] in the original studies [4,7] of Datasets 1 and 2, MOFA cannot accept x ijk in this study as inputs because MOFA does not implement sparse matrix architecture. During the computation of MOFA, zero values must be filled in with nonzero values to evaluate the convergence; this results in a dense matrix that cannot be stored in the computer memory because the number of components of DNA methylation and accessibility is too large to store them as they are (Table 5). In our computation, we can apply SVD to these large datasets while keeping them in a sparse matrix format using the irlba package implemented in R. SVD not only reduces the number of features to L, but also fills in missing values. Thus, we can manage a large matrix as in our implementation.
Third, our method is free from the dividing weight between multiomics datasets; how to weigh individual omics data must be decided based on some criteria outside the datasets available. Nevertheless, in our implementation, the weight of individual omics data is represented by u 3 k , which is automatically decided by simply applying HOSVD to a multiomics dataset. Thus, from this perspective, our strategy is outstanding.
Although we showed that the integration of all three omics data was superior to that of the integration of DNA methylation and accessibility (Figures 1 and 2), one might wonder if the integration of gene expression and DNA methylation or DNA accessibility might be comparable to that of all thee omics datasets. In order to deny this possibility, we also considered these combinations of two of the three omics datasets (Figures S3  and S4). Although the integration of gene expression and DNA accessibility in Dataset 1 ( Figure S3B) is comparable to that of all three omics data, neither integration of gene expression and DNA methylation ( Figure S4A) nor that of gene expression and DNA accessibility ( Figure S4B) is comparable to that of all three omics data in Dataset 2. Thus, it is obvious that only the integration of the three omics datasets can give us UMAP embedding coincident with the classification regardless of the dataset considered.
As for the comparisons with other methods, as mentioned above, no methods implemented with a sparse matrix architecture and applicable to multiomics datasets exist to our knowledge. Thus, we could not compare our performance to other methods.
Prospective uses of our methods are as follows. First of all, it can integrate gene expression profiles, DNA methylation, and accessibility in single-cell measurements without applying preprocessing; this enables researchers to obtain reasonable results without struggling to convert raw data into treatable formats. In addition to this, since it can save the memories required for analyzing single-cell multiomics datasets, more researchers who do not have massive computational facilities can analyze massive single-cell measurements.

Conclusions
In this study, we proposed a method for applying TD-based unsupervised FE to single-cell multiomics datasets composed of gene expression, DNA methylation, and DNA accessibility. Together with UMAP, the proposed method successfully integrated a multiomics dataset and generated a two-dimensional embedding of single cells coincident with the classification. The implementation requires neither filling missing values nor massive CPU memory to store multiomics datasets of single cells and can deal with DNA methylation and accessibility with ten million features. The present implementation is very promising and can be a de facto standard method to integrate single-cell multiomics datasets composed of gene expression, DNA methylation, and DNA accessibility.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/genes12091442/s1, Figure S1: UMAP embedding of single omics data for data set 1, Figure S2: UMAP embedding of single omics data for data set 2, Figure S3: UMAP embedding of integration of two omics data for data set 1, Figure S4: UMAP embedding of integration of two omics data for data set 2.

Data Availability Statement:
The data used in this study are available in GEO ID GSE154762 and GSE121708. A sample of the R source can be found at https://github.com/tagtag/scMultiR (accessed on 14 September 2021).

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.