Revealing the Best Strategies for Rare Cell Type Detection in Multi-Sample Single-Cell Datasets
Abstract
1. Introduction
2. Materials and Methods
2.1. Methods Under Comparison
2.2. Dataset and Gold Standard
2.3. Data Preprocessing
2.4. Benchmark Workflow
2.4.1. Run Rare Cell Detection Methods and Collect Results
2.4.2. Evaluation
- Precision quantifies the proportion of correctly predicted rare cells among all cells predicted as rare. It is particularly informative when controlling for false positives and is defined as:
- Recall (Sensitivity) measures the proportion of true rare cells correctly identified by the method, reflecting its ability to minimize false negatives. It is computed as:
- F1-score represents the harmonic mean between Precision and Sensitivity, indicating the balance between detection accuracy and completeness. It is calculated as:
- Matthews correlation coefficient (MCC) provides a balanced measure that considers both true and false predictions across positive and negative classes. It is particularly robust for imbalanced datasets and is defined as:
- Kappa (Cohen’s κ) quantifies the level of agreement between the predicted and true labels, correcting for random chance. It is calculated as:
- G-mean evaluates the geometric mean between Sensitivity and Specificity, reflecting the balance between detecting rare cells and avoiding false positives. It is defined as:
3. Results
3.1. Overall Performance Comparison
3.2. Stability Assessment Across Different Rare Cell Proportions
3.3. Stability with Different Numbers of Rare Cell Types
3.4. Computational Efficiency
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| Bfoc | Follicular B cell |
| cDC | Conventional dendritic cell |
| FN | False negatives |
| FP | False positives |
| IF | Isolation Forest |
| PCA | Principal component analysis |
| pDC | Plasmacytoid dendritic cell |
| scRNA-seq | Single-cell RNA sequencing |
| TN | True negatives |
| TP | True positives |
| UMAP | Uniform Manifold Approximation and Projection |
References
- Tang, F.; Barbacioru, C.; Wang, Y.; Nordman, E.; Lee, C.; Xu, N.; Wang, X.; Bodeau, J.; Tuch, B.B.; Siddiqui, A.; et al. mRNA-Seq Whole-Transcriptome Analysis of a Single Cell. Nat. Methods 2009, 6, 377–382. [Google Scholar] [CrossRef] [PubMed]
- Kolodziejczyk, A.A.; Kim, J.K.; Svensson, V.; Marioni, J.C.; Teichmann, S.A. The Technology and Biology of Single-Cell RNA Sequencing. Molecular Cell 2015, 58, 610–620. [Google Scholar] [CrossRef]
- Trapnell, C.; Cacchiarelli, D.; Grimsby, J.; Pokharel, P.; Li, S.; Morse, M.; Lennon, N.J.; Livak, K.J.; Mikkelsen, T.S.; Rinn, J.L. The Dynamics and Regulators of Cell Fate Decisions Are Revealed by Pseudotemporal Ordering of Single Cells. Nat. Biotechnol. 2014, 32, 381–386. [Google Scholar] [CrossRef] [PubMed]
- Stuart, T.; Satija, R. Integrative Single-Cell Analysis. Nat. Rev. Genet. 2019, 20, 257–272. [Google Scholar] [CrossRef] [PubMed]
- Grün, D.; Lyubimova, A.; Kester, L.; Wiebrands, K.; Basak, O.; Sasaki, N.; Clevers, H.; van Oudenaarden, A. Single-Cell Messenger RNA Sequencing Reveals Rare Intestinal Cell Types. Nature 2015, 525, 251–255. [Google Scholar] [CrossRef]
- Villani, A.-C.; Satija, R.; Reynolds, G.; Sarkizova, S.; Shekhar, K.; Fletcher, J.; Griesbeck, M.; Butler, A.; Zheng, S.; Lazo, S.; et al. Single-Cell RNA-Seq Reveals New Types of Human Blood Dendritic Cells, Monocytes, and Progenitors. Science 2017, 356, eaah4573. [Google Scholar] [CrossRef]
- Zhang, X.; Li, T.; Liu, F.; Chen, Y.; Yao, J.; Li, Z.; Huang, Y.; Wang, J. Comparative Analysis of Droplet-Based Ultra-High-Throughput Single-Cell RNA-Seq Systems. Molecular Cell 2019, 73, 130–142.e5. [Google Scholar] [CrossRef]
- Tirosh, I.; Izar, B.; Prakadan, S.M.; Wadsworth, M.H.; Treacy, D.; Trombetta, J.J.; Rotem, A.; Rodman, C.; Lian, C.; Murphy, G.; et al. Dissecting the Multicellular Ecosystem of Metastatic Melanoma by Single-Cell RNA-Seq. Science 2016, 352, 189–196. [Google Scholar] [CrossRef]
- Patel, A.P.; Tirosh, I.; Trombetta, J.J.; Shalek, A.K.; Gillespie, S.M.; Wakimoto, H.; Cahill, D.P.; Nahed, B.V.; Curry, W.T.; Martuza, R.L.; et al. Single-Cell RNA-Seq Highlights Intratumoral Heterogeneity in Primary Glioblastoma. Science 2014, 344, 1396–1401. [Google Scholar] [CrossRef]
- Llorens-Bobadilla, E.; Zhao, S.; Baser, A.; Saiz-Castro, G.; Zwadlo, K.; Martin-Villalba, A. Single-Cell Transcriptomics Reveals a Population of Dormant Neural Stem Cells That Become Activated upon Brain Injury. Cell Stem Cell 2015, 17, 329–340. [Google Scholar] [CrossRef]
- Zheng, L.; Qin, S.; Si, W.; Wang, A.; Xing, B.; Gao, R.; Ren, X.; Wang, L.; Wu, X.; Zhang, J.; et al. Pan-Cancer Single-Cell Landscape of Tumor-Infiltrating T Cells. Science 2021, 374, abe6474. [Google Scholar] [CrossRef] [PubMed]
- Chen, G.; Ning, B.; Shi, T. Single-Cell RNA-Seq Technologies and Related Computational Data Analysis. Front. Genet. 2019, 10, 317. [Google Scholar] [CrossRef] [PubMed]
- Svensson, V.; Vento-Tormo, R.; Teichmann, S.A. Exponential Scaling of Single-Cell RNA-Seq in the Past Decade. Nat. Protoc. 2018, 13, 599–604. [Google Scholar] [CrossRef]
- Andrews, T.S.; Hemberg, M. False Signals Induced by Single-Cell Imputation. F1000Research 2019, 7, 1740. [Google Scholar] [CrossRef] [PubMed]
- Zappia, L.; Oshlack, A. Clustering Trees: A Visualization for Evaluating Clusterings at Multiple Resolutions. GigaScience 2018, 7, giy083. [Google Scholar] [CrossRef]
- Fa, B.; Wei, T.; Zhou, Y.; Johnston, L.; Yuan, X.; Ma, Y.; Zhang, Y.; Yu, Z. GapClust Is a Light-Weight Approach Distinguishing Rare Cells from Voluminous Single Cell Expression Profiles. Nat. Commun. 2021, 12, 4197. [Google Scholar] [CrossRef]
- Xu, Y.; Wang, S.; Feng, Q.; Xia, J.; Li, Y.; Li, H.-D.; Wang, J. scCAD: Cluster Decomposition-Based Anomaly Detection for Rare Cell Identification in Single-Cell Expression Data. Nat. Commun. 2024, 15, 7561. [Google Scholar] [CrossRef]
- Wegmann, R.; Neri, M.; Schuierer, S.; Bilican, B.; Hartkopf, H.; Nigsch, F.; Mapa, F.; Waldt, A.; Cuttat, R.; Salick, M.R.; et al. CellSIUS Provides Sensitive and Specific Detection of Rare Cell Populations from Complex Single-Cell RNA-Seq Data. Genome Biol. 2019, 20, 142. [Google Scholar] [CrossRef]
- Leary, J.R.; Xu, Y.; Morrison, A.B.; Jin, C.; Shen, E.C.; Kuhlers, P.C.; Su, Y.; Rashid, N.U.; Yeh, J.J.; Peng, X.L. Sub-Cluster Identification through Semi-Supervised Optimization of Rare-Cell Silhouettes (SCISSORS) in Single-Cell RNA-Sequencing. Bioinformatics 2023, 39, btad449. [Google Scholar] [CrossRef]
- Jiang, L.; Chen, H.; Pinello, L.; Yuan, G.-C. GiniClust: Detecting Rare Cell Types from Single-Cell Gene Expression Data with Gini Index. Genome Biol. 2016, 17, 144. [Google Scholar] [CrossRef]
- Dong, R.; Yuan, G.-C. GiniClust3: A Fast and Memory-Efficient Tool for Rare Cell Type Identification. BMC Bioinform. 2020, 21, 158. [Google Scholar] [CrossRef]
- Jerby-Arnon, L.; Shah, P.; Cuoco, M.S.; Rodman, C.; Su, M.-J.; Melms, J.C.; Leeson, R.; Kanodia, A.; Mei, S.; Lin, J.-R.; et al. A Cancer Cell Program Promotes T Cell Exclusion and Resistance to Checkpoint Blockade. Cell 2018, 175, 984–997.e24. [Google Scholar] [CrossRef] [PubMed]
- Kim, C.; Gao, R.; Sei, E.; Brandt, R.; Hartman, J.; Hatschek, T.; Crosetto, N.; Foukakis, T.; Navin, N.E. Chemoresistance Evolution in Triple-Negative Breast Cancer Delineated by Single-Cell Sequencing. Cell 2018, 173, 879–893.e13. [Google Scholar] [CrossRef] [PubMed]
- Gaublomme, J.T.; Yosef, N.; Lee, Y.; Gertner, R.S.; Yang, L.V.; Wu, C.; Pandolfi, P.P.; Mak, T.; Satija, R.; Shalek, A.K.; et al. Single-Cell Genomics Unveils Critical Regulators of Th17 Cell Pathogenicity. Cell 2015, 163, 1400–1412. [Google Scholar] [CrossRef] [PubMed]
- Crowell, H.L.; Soneson, C.; Germain, P.-L.; Calini, D.; Collin, L.; Raposo, C.; Malhotra, D.; Robinson, M.D. Muscat Detects Subpopulation-Specific State Transitions from Multi-Sample Multi-Condition Single-Cell Transcriptomics Data. Nat. Commun. 2020, 11, 6077. [Google Scholar] [CrossRef]
- Luecken, M.D.; Büttner, M.; Chaichoompu, K.; Danese, A.; Interlandi, M.; Mueller, M.F.; Strobl, D.C.; Zappia, L.; Dugas, M.; Colomé-Tatché, M.; et al. Benchmarking Atlas-Level Data Integration in Single-Cell Genomics. Nat. Methods 2022, 19, 41–50. [Google Scholar] [CrossRef]
- Kang, H.M.; Subramaniam, M.; Targ, S.; Nguyen, M.; Maliskova, L.; McCarthy, E.; Wan, E.; Wong, S.; Byrnes, L.; Lanata, C.M.; et al. Author Correction: Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation. Nat. Biotechnol. 2020, 38, 1356. [Google Scholar] [CrossRef]
- Hao, Y.; Hao, S.; Andersen-Nissen, E.; Mauck, W.M.; Zheng, S.; Butler, A.; Lee, M.J.; Wilk, A.J.; Darby, C.; Zager, M.; et al. Integrated Analysis of Multimodal Single-Cell Data. Cell 2021, 184, 3573–3587.e29. [Google Scholar] [CrossRef]
- Barkas, N.; Petukhov, V.; Nikolaeva, D.; Lozinsky, Y.; Demharter, S.; Khodosevich, K.; Kharchenko, P.V. Joint Analysis of Heterogeneous Single-Cell RNA-Seq Dataset Collections. Nat. Methods 2019, 16, 695–698. [Google Scholar] [CrossRef]
- Grün, D.; van Oudenaarden, A. Design and Analysis of Single-Cell Sequencing Experiments. Cell 2015, 163, 799–810. [Google Scholar] [CrossRef]
- Andrews, T.S.; Kiselev, V.Y.; McCarthy, D.; Hemberg, M. Tutorial: Guidelines for the Computational Analysis of Single-Cell RNA Sequencing Data. Nat. Protoc. 2021, 16, 1–9. [Google Scholar] [CrossRef] [PubMed]
- Butler, A.; Hoffman, P.; Smibert, P.; Papalexi, E.; Satija, R. Integrating Single-Cell Transcriptomic Data across Different Conditions, Technologies, and Species. Nat. Biotechnol. 2018, 36, 411–420. [Google Scholar] [CrossRef] [PubMed]
- Stuart, T.; Butler, A.; Hoffman, P.; Hafemeister, C.; Papalexi, E.; Mauck, W.M.; Hao, Y.; Stoeckius, M.; Smibert, P.; Satija, R. Comprehensive Integration of Single-Cell Data. Cell 2019, 177, 1888–1902.e21. [Google Scholar] [CrossRef] [PubMed]
- Korsunsky, I.; Millard, N.; Fan, J.; Slowikowski, K.; Zhang, F.; Wei, K.; Baglaenko, Y.; Brenner, M.; Loh, P.; Raychaudhuri, S. Fast, Sensitive and Accurate Integration of Single-Cell Data with Harmony. Nat. Methods 2019, 16, 1289–1296. [Google Scholar] [CrossRef]
- Regev, A.; Teichmann, S.A.; Lander, E.S.; Amit, I.; Benoist, C.; Birney, E.; Bodenmiller, B.; Campbell, P.; Carninci, P.; Clatworthy, M.; et al. The Human Cell Atlas. eLife 2017, 6, e27041. [Google Scholar] [CrossRef]
- Haghverdi, L.; Lun, A.T.L.; Morgan, M.D.; Marioni, J.C. Batch Effects in Single-Cell RNA-Sequencing Data Are Corrected by Matching Mutual Nearest Neighbors. Nat. Biotechnol. 2018, 36, 421–427. [Google Scholar] [CrossRef]
- Tran, H.T.N.; Ang, K.S.; Chevrier, M.; Zhang, X.; Lee, N.Y.S.; Goh, M.; Chen, J. A Benchmark of Batch-Effect Correction Methods for Single-Cell RNA Sequencing Data. Genome Biol. 2020, 21, 12. [Google Scholar] [CrossRef]
- Cui, H.; Wang, C.; Maan, H.; Pang, K.; Luo, F.; Duan, N.; Wang, B. scGPT: Toward Building a Foundation Model for Single-Cell Multi-Omics Using Generative AI. Nat. Methods 2024, 21, 1470–1480. [Google Scholar] [CrossRef]
- Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
- Zhang, Y.; Parmigiani, G.; Johnson, W.E. ComBat-Seq: Batch Effect Adjustment for RNA-Seq Count Data. NAR Genom. Bioinform. 2020, 2, lqaa078. [Google Scholar] [CrossRef]
- Sandve, G.K.; Nekrutenko, A.; Taylor, J.; Hovig, E. Ten Simple Rules for Reproducible Computational Research. PLoS Comput. Biol. 2013, 9, e1003285. [Google Scholar] [CrossRef]
- Spielmann, M.; Lupiáñez, D.G.; Mundlos, S. Structural Variation in the 3D Genome. Nat. Rev. Genet. 2018, 19, 453–467. [Google Scholar] [CrossRef]
- Zeisel, A.; Hochgerner, H.; Lönnerberg, P.; Johnsson, A.; Memic, F.; Van Der Zwan, J.; Häring, M.; Braun, E.; Borm, L.E.; La Manno, G.; et al. Molecular Architecture of the Mouse Nervous System. Cell 2018, 174, 999–1014.e22. [Google Scholar] [CrossRef]
- Lake, B.B.; Menon, R.; Winfree, S.; Hu, Q.; Melo Ferreira, R.; Kalhor, K.; Barwinska, D.; Otto, E.A.; Ferkowicz, M.; Diep, D.; et al. An Atlas of Healthy and Injured Cell States and Niches in the Human Kidney. Nature 2023, 619, 585–594. [Google Scholar] [CrossRef]
- Zhang, Y.; Chen, H.; Mo, H.; Zhao, N.; Sun, X.; Liu, B.; Gao, R.; Xu, B.; Zhang, Z.; Liu, Z.; et al. Distinct Cellular Mechanisms Underlie Chemotherapies and PD-L1 Blockade Combinations in Triple-Negative Breast Cancer. Cancer Cell 2025, 43, 446–463.e7. [Google Scholar] [CrossRef]
- Integrated Analysis of Multimodal Single-Cell Data: Cell. Available online: https://www.cell.com/cell/fulltext/S0092-8674(21)00583-3?s=09 (accessed on 28 October 2025).
- Johnson, W.E.; Li, C.; Rabinovic, A. Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods. Biostatistics 2007, 8, 118–127. [Google Scholar] [CrossRef]




| Method | Key Principle | Requires Clustering | Outlier Score | Platform |
|---|---|---|---|---|
| CellSIUS | Bimodal gene expression in clusters | yes | no | R |
| GapClust | Local density and expression gap | no | no | R |
| GiniClust | High-Gini gene feature + clustering | no | no | Python |
| scCAD | Clusters decomposition + anomaly | yes | yes | Python |
| SCISSORS | Silhouette-based subclustering | yes | no | R |
| scGPT + IF | Embedding and isolation forest | no | yes | Python |
| Datasets | Cells | Genes | Types | Rare Cells | Proportion | Samples | Data Source |
|---|---|---|---|---|---|---|---|
| Breast cancer | 57,411 | 22,815 | 12 | 522 | 0.9% | 10 | GEO GSE266919 |
| Human midbrain | 41,435 | 26,737 | 12 | 1265 | 3% | 11 | GEO GSE157783 |
| Mouse hippocampus | 29,519 | 27,998 | 8 | 114 | 0.4% | 13 | www.mousebrain.org |
| Human Kidney atlas | 87,647 | 29,394 | 26 | 1689 | 1.9% | 11 | GEO GSE183279 |
| Method | Wall Time (min) | CPU Time (s) | Max RSS (GB) | Notes |
|---|---|---|---|---|
| CellSIUS | 90.8 | 130,099 | 407 | Extremely high memory usage; slowest runtime |
| GapClust | 11.9 | 972 | 31.1 | fast; lowest memory consumption |
| GiniClust | 11.5 | 2491 | 46.5 | Fastest runtime; moderately high memory usage |
| scCAD | 57.9 | 15,681 | 97.8 | High computational cost and high memory usage |
| SCISSORS | 29.5 | 1768 | 85.7 | Moderate runtime but still requires substantial memory |
| scGPT + IF | 12.1 | 929 | 31.1 | fast; lowest memory consumption |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ye, Z.; Yan, Y.; Yu, Y.; Wu, H. Revealing the Best Strategies for Rare Cell Type Detection in Multi-Sample Single-Cell Datasets. Genes 2026, 17, 31. https://doi.org/10.3390/genes17010031
Ye Z, Yan Y, Yu Y, Wu H. Revealing the Best Strategies for Rare Cell Type Detection in Multi-Sample Single-Cell Datasets. Genes. 2026; 17(1):31. https://doi.org/10.3390/genes17010031
Chicago/Turabian StyleYe, Zhiwei, Yinqiao Yan, Yuanyuan Yu, and Hao Wu. 2026. "Revealing the Best Strategies for Rare Cell Type Detection in Multi-Sample Single-Cell Datasets" Genes 17, no. 1: 31. https://doi.org/10.3390/genes17010031
APA StyleYe, Z., Yan, Y., Yu, Y., & Wu, H. (2026). Revealing the Best Strategies for Rare Cell Type Detection in Multi-Sample Single-Cell Datasets. Genes, 17(1), 31. https://doi.org/10.3390/genes17010031

