scIRT: Imputation and Dimensionality Reduction for Single-Cell RNA-Seq Data by Combining NMF with SMOTE
Abstract
1. Introduction
2. Results
2.1. Overview of scIRT
2.2. Use scIRT for Dropout Imputation
2.3. ScIRT Optimizes Cell Clustering in High-Dimensional Space
2.4. Analysis of Co-Expression Structures of Genes
2.5. ScIRT Improves Cell Clustering Results in Low-Dimensional Space
2.5.1. Impact of Sample Size on Clustering Results
2.5.2. Comparison of Cell Clustering Before and After Dimensionality Reduction
2.6. Performance Evaluation of Runtime and Memory Consumption
3. Discussion
4. Materials and Methods
4.1. Public scRNA-Seq Datasets
4.2. Aggregate Distance Metrics to Calculate Intercellular Similarity
4.3. Random Cell Imputation Based on SMOTE
4.4. Multi-Layer Random Imputation Approach
4.5. Data Model and Iterative Update Rules of NMF
4.6. Combined Application of NMF and SMOTE
- The similarity between cells is computed using a voting-based set distance metric, and several commonly utilized distances are integrated into a collective solution.
- Random imputation employs the SMOTE oversampling strategy, assigning random weights to cells based on the integrated comprehensive distance.
- The hierarchical framework is implemented in a multi-layer approach, progressively incorporating additional cells for random imputation. NMF matrix factorization is employed to analyze the results of random imputation at each layer. Through multiple iterations, the root mean square residuals derived from the actual and predicted values of random imputation results at each layer are minimized. The imputation outcomes are computed sequentially for each layer.
- The outcomes are refined by applying NMF matrix factorization once more using the imputation results from the final layer. Through this process, both the high-dimensional expression matrix and its corresponding low-dimensional representation matrix are derived through calculation.
| Algorithm 1. Application framework of the scIRT algorithm |
| Input: gene dataset (data), weight (wt), number of categories (nc). Step 1. [total_rank] = distance_consensus (data,wt); Step 2. [m,n] = size (data); Step 3. mean_number = floor (log2(m/nc)); Step 4. times = 2(1: mean_number); Step 5. expectation of each cluster (eec) = floor (m/nc × 0.9); Step 6. for i ← 1 to length (times). |
| do data_new = SMOTE (data,total_rank,m,eec) |
| [matrix1, matrix2] = NMF (data_new) data_new_NMF = matrix1 × matrix2 |
| end Step 7. [u,v] = NMF (data_new_NMF); Step 8. high-dimensional matrix = u × v, low-dimensional matrix = u. Output: high-dimensional expression matrix, low-dimensional representation matrix. Note: The first step is the distance_consensus function, which implements step a. In the second step, the size function mainly calculates the size of the matrix. Here, m represents the number of rows and n represents the number of columns. The floor function and log2 function in the third step are used to find the integer in the negative infinity direction and calculate the logarithm with base 2, respectively. The floor function in step five has the same effect as step four. The length function in step six calculates the length of the vector. Step six and implements steps b and c. Step seven implements step d. |
4.7. Comparative Methodology and Metrics
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ziegenhain, C.; Vieth, B.; Parekh, S.; Reinius, B.; Guillaumet-Adkins, A.; Smets, M.; Leonhardt, H.; Heyn, H.; Hellmann, I.; Enard, W. Comparative analysis of single-cell RNA sequencing methods. Mol. Cell 2017, 65, 631–643. [Google Scholar] [CrossRef]
- Zhang, H.; Lee, C.A.A.; Li, Z.; Garbe, J.R.; Eide, C.R.; Petegrosso, R.; Kuang, R.; Tolar, J. A multitask clustering approach for single-cell RNA-seq analysis in recessive dystrophic epidermolysis bullosa. PLoS Comput. Biol. 2018, 14, e1006053. [Google Scholar] [CrossRef]
- Villani, A.C.; Satija, R.; Reynolds, G.; Sarkizova, S.; Shekhar, K.; Fletcher, J.; Griesbeck, M.; Butler, A.; Zheng, S.; Lazo, S.; et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 2017, 356, eaah4573. [Google Scholar] [CrossRef] [PubMed]
- Welch, J.D.; Hartemink, A.J.; Prins, J.F. SLICER: Inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol. 2016, 17, 106. [Google Scholar] [CrossRef] [PubMed]
- Trapnell, C.; Cacchiarelli, D.; Grimsby, J.; Pokharel, P.; Li, S.; Morse, M.; Lennon, N.J.; Livak, K.J.; Mikkelsen, T.S.; Rinn, J.L. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 2014, 32, 381–386. [Google Scholar] [CrossRef] [PubMed]
- Macosko, E.Z.; Basu, A.; Satija, R.; Nemesh, J.; Shekhar, K.; Goldman, M.; Tirosh, I.; Bialas, A.R.; Kamitaki, N.; Martersteck, E.M.; et al. Highly parallel genome-wide expression profiling of individual cells using Nanoliter droplets. Cell 2015, 161, 1202–1214. [Google Scholar] [CrossRef]
- Kharchenko, P.V.; Silberstein, L.; Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 2014, 11, 740–742. [Google Scholar] [CrossRef]
- Van Dijk, D.; Sharma, R.; Nainys, J.; Yim, K.; Kathail, P.; Carr, A.J.; Burdziak, C.; Moon, K.R.; Chaffer, C.L.; Pattabiraman, D.; et al. Recovering gene interactions from single-cell data using data diffusion. Cell 2018, 174, 716–729. [Google Scholar] [CrossRef]
- Li, W.V.; Li, J.J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 2018, 9, 997. [Google Scholar] [CrossRef]
- Huang, M.; Wang, J.; Torre, E.; Dueck, H.; Shaffer, S.; Bonasio, R.; Murray, J.I.; Raj, A.; Li, M.; Zhang, N.R. SAVER: Gene expression recovery for single-cell RNA sequencing. Nat. Methods 2018, 15, 539–542. [Google Scholar] [CrossRef]
- Gong, W.; Kwak, I.-Y.; Pota, P.; Koyano-Nakagawa, N.; Garry, D.J. DrImpute: Imputing dropout events in single cell RNA sequencing data. BMC Bioinform. 2018, 19, 220. [Google Scholar] [CrossRef]
- Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
- Halko, N.; Martinsson, P.G.; Tropp, J.A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. Soc. Ind. Appl. Math. 2011, 53, 217–288. [Google Scholar]
- Liu, C.; Wechsler, H. Comparative Assessment of Independent Component Analysis (ICA) for Face Recognition. In Proceedings of the Second International Conference on Audio- and Video-based Biometric Person Authentication, AVBPA’99, Washington, DC, USA, 22–24 March 1999. [Google Scholar]
- Wall, M.E.; Dyck, P.A.; Brettin, T.S. SVDMAN—Singular value decomposition analysis of microarray data. Bioinformatics 2001, 17, 566–568. [Google Scholar] [CrossRef] [PubMed]
- Lee, K.R.; Lin, X.; Park, D.C.; Eslava, S. Megavariate data analysis of mass spectrometric proteomics data using latent variable projection method. Proteomics 2003, 3, 1680–1686. [Google Scholar] [CrossRef]
- Jansen, J.J.; Hoefsloot, H.C.; Boelens, H.F.; van der Greef, J.; Smilde, A.K. Analysis of longitudinal metabolomics data. Bioinformatics 2004, 20, 2438–2446. [Google Scholar] [CrossRef]
- Scholz, M.; Gatzek, S.; Sterling, A.; Fiehn, O.; Selbig, J. Metabolite fingerprinting: Detecting biological features by independent component analysis. Bioinformatics 2004, 20, 2447–2454. [Google Scholar] [CrossRef]
- Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef]
- Elyanow, R.; Dumitrascu, B.; Engelhardt, B.E.; Raphael, B.J. netNMF-sc: Leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome Res. 2020, 30, 195–204. [Google Scholar]
- Xu, J.; Cai, L.; Liao, B.; Zhu, W.; Yang, J. CMF-Impute: An accurate imputation tool for single-cell RNA-seq data. Bioinformatics 2020, 36, 3139–3147, Erratum in Bioinformatics 2020, 36, 5563–5564. [Google Scholar]
- Ye, P.; Ye, W.; Ye, C.; Li, S.; Ye, L.; Ji, G.; Wu, X. scHinter: Imputing dropout events for single-cell RNA-seq data with limited sample size. Bioinformatics 2020, 36, 789–797. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Patel, A.P.; Tirosh, I.; Trombetta, J.J.; Shalek, A.K.; Gillespie, S.M.; Wakimoto, H.; Cahill, D.P.; Nahed, B.V.; Curry, W.T.; Martuza, R.L.; et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 2014, 344, 1396–1401. [Google Scholar] [CrossRef]
- Laurens, V.D.M.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Xue, Z.; Huang, K.; Cai, C.; Cai, L.; Jiang, C.-Y.; Feng, Y.; Liu, Z.; Zeng, Q.; Cheng, L.; Sun, Y.E.; et al. Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing. Nature 2013, 500, 593–597. [Google Scholar] [CrossRef]
- Joydeep, G.; Acharya, A. Cluster ensembles. Wiley Interdiscip. Rev.-Data Min. Knowl. Discov. 2011, 1, 305–315. [Google Scholar] [CrossRef]
- Leng, N.; Chu, L.-F.; Barry, C.; Li, Y.; Choi, J.; Li, X.; Jiang, P.; Stewart, R.M.; Thomson, J.A.; Kendziorski, C. Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments. Nat. Methods 2015, 12, 947–950. [Google Scholar] [CrossRef]
- Peter, R.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
- Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
- Li, H.; Courtois, E.T.; Sengupta, D.; Tan, Y.; Chen, K.H.; Goh, J.J.L.; Kong, S.L.; Chua, C.; Hon, L.K.; Tan, W.S.; et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 2017, 49, 708–718, Erratum in Nat. Genet. 2023, 55, 166. [Google Scholar] [CrossRef]
- Darmanis, S.; Sloan, S.A.; Zhang, Y.; Enge, M.; Caneda, C.; Shuer, L.M.; Gephart, M.G.H.; Barres, B.A.; Quake, S.R. A survey of human brain transcriptome diversity at the single cell level. Proc. Natl. Acad. Sci. USA 2015, 112, 7285–7290. [Google Scholar] [CrossRef]
- Hubert, L.; Arabie, P. Comparing Partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
- Li, B.; Cai, L.; Liao, B.; Fu, X.; Bing, P.; Yang, J. Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features. Molecules 2019, 24, 919. [Google Scholar] [CrossRef]
- Li, L.; Sun, Y.; Luo, J.; Liu, M. Circulating immune cells and risk of osteosarcoma: A Mendelian randomization analysis. Front. Immunol. 2024, 15, 1381212. [Google Scholar] [CrossRef]
- Wang, J.; Luo, J.; Rotili, D.; Mai, A.; Steegborn, C.; Xu, S.; Jin, Z.G. SIRT6 Protects Against Lipopolysaccharide-Induced Inflammation in Human Pulmonary Lung Microvascular Endothelial Cells. Inflammation 2024, 47, 323–332. [Google Scholar] [CrossRef]
- He, Y.; Guo, Y.; Liang, X.; Hu, H.; Xiong, X.; Zhou, X. Single-Cell Transcriptome and Microbiome Profiling Uncover Ileal Immune Impairment in Intrauterine Growth-Retarded Piglets. Curr. Pharm. Des. 2025, 32, 20. [Google Scholar] [CrossRef]
- Rehman, M.U.; Abbas, Z.; Ullah, F.; Hussain, I. AttnW2V-Enhancer: Leveraging attention and Word2Vec for enhanced enhancer prediction. Comput. Struct. Biotechnol. J. 2025, 27, 3275–3284. [Google Scholar] [CrossRef]
- Abbas, Z.; Rehman, M.U.; Tayara, H.; Lee, S.W.; Chong, K.T. m5C-Seq: Machine learning-enhanced profiling of RNA 5-methylcytosine modifications. Comput. Biol. Med. 2024, 182, 109087. [Google Scholar] [CrossRef]
- Ebenezer, O.; Oyebamiji, A.K.; Olanlokun, J.O.; Tuszynski, J.A.; Wong, G.K.-S. Recent Update on siRNA Therapeutics. Int. J. Mol. Sci. 2025, 26, 3456. [Google Scholar] [CrossRef]
- Jiang, D.; Ao, C.; Li, Y.; Yu, L. Feadm5C: Enhancing prediction of RNA 5-Methylcytosine modification sites with physicochemical molecular graph features. Genomics 2025, 117, 111037. [Google Scholar] [CrossRef]
- Shirkhorshidi, A.S.; Aghabozorgi, S.; Wah, T.Y. A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data. PLoS ONE 2015, 10, e0144059. [Google Scholar] [CrossRef] [PubMed]
- Wang, L.; Zhang, Y.; Feng, J. On the Euclidean distance of images. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1334–1339. [Google Scholar] [CrossRef] [PubMed]
- Oh, J.H.; Gao, J.; Rosenblatt, K. Biological Data Outlier Detection Based on Kullback-Leibler Divergence. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, Philadephia, PA, USA, 13–15 November 2008. [Google Scholar]
- Seshadri, V. The Inverse Gaussian Distribution: Statistical Theory and Applications; Springer: New York, NY, USA, 1999. [Google Scholar]
- Lizama, C. The Poisson distribution, abstract fractional difference equations, and stability. Proc. Am. Math. Soc. 2017, 145, 3809–3827. [Google Scholar] [CrossRef]








| Data | Method | Xue-em | Patel-gl | Leng-h1h | |||
|---|---|---|---|---|---|---|---|
| ARI | NMI | ARI | NMI | ARI | NMI | ||
| High- dimensional matrix | NMF | 0.3881 | 0.7259 | 0.0190 | 0.1087 | 0.2648 | 0.2792 |
| CMF | 0.6364 | 0.8259 | 0.0105 | 0.0696 | 0.3275 | 0.3634 | |
| netNMF-sc | 0.1920 | 0.5455 | 0.1555 | 0.1941 | 0.0235 | 0.0288 | |
| scIRT | 0.6229 | 0.8248 | 0.4820 | 0.6725 | 0.4178 | 0.4488 | |
| Low- dimensional matrix | NMF | 0.8309 | 0.8338 | 0.0080 | 0.0662 | 0.3650 | 0.3708 |
| CMF | 0.2109 | 0.6259 | 0.5516 | 0.6937 | 0.3824 | 0.3995 | |
| netNMF-sc | 0.2630 | 0.5622 | 0.1149 | 0.1882 | 0.0201 | 0.0306 | |
| scIRT | 0.8942 | 0.8881 | 0.7636 | 0.7527 | 0.5791 | 0.5415 | |
| Data | Method | Xue-em | Patel-gl | Leng-h1h | |||
|---|---|---|---|---|---|---|---|
| DBI | SC | DBI | SC | DBI | SC | ||
| High- dimensional matrix | NMF | 0.5010 | 0.7850 | 0.5736 | 0.9601 | 1.3470 | 0.2935 |
| CMF | 1.1125 | 0.4250 | 1.1067 | 0.7266 | 1.7318 | 0.2097 | |
| netNMF-sc | 0.9551 | 0.4543 | 1.4327 | 0.3359 | 0.5768 | 0.6799 | |
| scIRT | 0.4902 | 0.7885 | 0.4919 | 0.7690 | 0.5300 | 0.6344 | |
| Low- dimensional matrix | NMF | 0.5079 | 0.7834 | 0.3264 | 0.9582 | 0.8560 | 0.5743 |
| CMF | 0.4047 | 0.7816 | 3.6242 | 0.0517 | 3.5051 | 0.0283 | |
| netNMF-sc | 1.0646 | 0.3910 | 1.4147 | 0.3054 | 1.6794 | 0.3093 | |
| scIRT | 0.4492 | 0.8228 | 0.7340 | 0.6983 | 0.5044 | 0.7716 | |
| Tool | Genes = 100 | Genes = 200 | Genes = 500 | Genes = 1000 | Genes = 2000 |
|---|---|---|---|---|---|
| NMF | 0.37 | 0.39 | 0.40 | 0.53 | 0.82 |
| CMF | 6.74 | 8.05 | 12.61 | 24.49 | 65.67 |
| netNMF-sc | 14.66 | 20.03 | 46.41 | 165.13 | 464.49 |
| scHinter | 22.10 | 23.42 | 22.41 | 22.48 | 23.02 |
| MAGIC | 3.93 | 4.15 | 3.80 | 3.83 | 4.56 |
| scIRT | 23.78 | 22.42 | 24.47 | 25.86 | 26.17 |
| Tool | Genes = 100 | Genes = 200 | Genes = 500 | Genes = 1000 | Genes = 2000 |
|---|---|---|---|---|---|
| NMF | 4785 | 4800 | 4807 | 4830 | 4828 |
| CMF | 4647 | 4669 | 4685 | 4667 | 4886 |
| netNMF-sc | 4793 | 4794 | 4806 | 4826 | 4783 |
| scHinter | 4544 | 4710 | 4742 | 4778 | 4706 |
| MAGIC | 4777 | 4836 | 4874 | 4769 | 4887 |
| scIRT | 4579 | 4587 | 4582 | 4591 | 4593 |
| Tool | Genes = 100 | Genes = 200 | Genes = 500 | Genes = 1000 | Genes = 2000 |
|---|---|---|---|---|---|
| NMF | 59.33 | 59.52 | 59.60 | 59.89 | 59.76 |
| CMF | 57.61 | 57.89 | 58.09 | 57.87 | 60.58 |
| netNMF-sc | 59.43 | 59.44 | 59.59 | 59.84 | 59.31 |
| scHinter | 56.34 | 58.40 | 58.80 | 59.24 | 58.35 |
| MAGIC | 59.23 | 59.96 | 60.43 | 59.13 | 60.59 |
| scIRT | 56.78 | 56.88 | 56.81 | 56.92 | 56.95 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Mou, Y.; Li, S.; Ji, G. scIRT: Imputation and Dimensionality Reduction for Single-Cell RNA-Seq Data by Combining NMF with SMOTE. Int. J. Mol. Sci. 2026, 27, 1173. https://doi.org/10.3390/ijms27031173
Mou Y, Li S, Ji G. scIRT: Imputation and Dimensionality Reduction for Single-Cell RNA-Seq Data by Combining NMF with SMOTE. International Journal of Molecular Sciences. 2026; 27(3):1173. https://doi.org/10.3390/ijms27031173
Chicago/Turabian StyleMou, Yunwen, Shuchao Li, and Guoli Ji. 2026. "scIRT: Imputation and Dimensionality Reduction for Single-Cell RNA-Seq Data by Combining NMF with SMOTE" International Journal of Molecular Sciences 27, no. 3: 1173. https://doi.org/10.3390/ijms27031173
APA StyleMou, Y., Li, S., & Ji, G. (2026). scIRT: Imputation and Dimensionality Reduction for Single-Cell RNA-Seq Data by Combining NMF with SMOTE. International Journal of Molecular Sciences, 27(3), 1173. https://doi.org/10.3390/ijms27031173

