# Differential Expression Analysis of Single-Cell RNA-Seq Data: Current Statistical Approaches and Outstanding Challenges

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{7}

^{8}

^{9}

^{*}

## Abstract

**:**

## 1. Background

## 2. Current Statistical Approaches

## 3. Classes of Statistical Approaches for DEA

**Notation:**${Y}_{ij}$: random variable (rv) represents the observed expression (i.e., read, UMI) counts of ith (i = 1, 2, …, N) gene in jth (j = 1, 2, …, M) cell; N: total number of genes; M: total number of cells; ${\mu}_{i}$: mean of ith gene for NB distribution (count part of the model); ${\phi}_{i}$ and ${\theta}_{i}$ $(={\phi}_{i}^{-1})$: dispersion and size parameters, respectively, for ith gene; ${\pi}_{i}\left(\in \left[0,1\right]\right)$: mixture probability (zero inflation probability) of ith gene; ${s}_{j}$: library size of jth cell; ${Z}_{ij}$: rv represents the true (unknown) expression counts for ith gene in jth cell; $\mathit{X}$: design matrix for cell group information, whose jth row: ${X}_{j}=\left[{X}_{j1},{X}_{j2},\dots ,{X}_{jN}\right]$; ${W}_{ij}$: indicator rv representing the rate of expression for ith gene in jth cell, i.e., ${W}_{ij}=0:{Y}_{ij}=0;{W}_{ij}=1:{Y}_{ij}0$. ${\mathsf{\Omega}}_{i}=\left\{{\mu}_{i},{\theta}_{i},{\pi}_{i}\right\}$: parametric space for ith gene.

#### 3.1. Generalized Linear Model-Based Approaches

**:**M × 1 vector of parameters for ith gene;

**X:**M × G design matrix providing group information (first column consists of 1 s to include intercept term); G: number of cellular groups (cell clusters are divided into G groups, if group is unknown);

**R**: M × N design matrix providing cell cluster information;

**C**: M × C design matrix providing cell level auxiliary information; ${\mathit{\gamma}}_{i}$ and ${\mathit{\beta}}_{\mathit{i}}$

**:**G × 1 vectors of cellular groups effects for ith gene; ${\mathit{w}}_{i}$ and ${\mathit{u}}_{i}$: N × 1 vectors of cell cluster effects for ith gene; ${\mathit{s}}_{i}$ and ${\mathit{v}}_{i}$: C × 1 vectors of effects for cell level co-variates, such as cell cycle, cell phase, etc., for the ith gene; C: levels of cell level auxiliaries; and, ${\mathit{O}}_{\mathit{\mu}},{\mathit{O}}_{\mathit{\pi}}$: offsets for ${\mathit{\mu}}_{i}$ and ${\mathit{\pi}}_{k}$, respectively.

**:**offset term. The model in Equation (9) can also be expanded to accommodate other cell-level co-variates including cell type, cell cycle, cell growth phase, etc., [13]. To test whether the ith gene is differentially expressed or not across the cell groups, GLM-based approaches test the following null hypothesis:

**Limitations**: There are three major limitations of this class of approaches. Firstly, strict model assumptions: the GLM class of approaches requires several distributional assumptions about the expression counts, which may not be satisfied by the real single-cell data. For instance, GLM requires the counts to be generated by exponential family distributions; the link function must be invertible, continuous, and differentiable; and it linearly depends on cell co-variates. These strict assumptions restrict the utility of GLM-based DEA approaches for real data analysis. In most cases, the users simply apply these techniques without testing or violating these assumptions, which causes the results to be misleading.

#### 3.2. Generalized Additive Model-Based Approaches

**Limitations**: (i) Pseudo-time dependent: Approaches including Monocle heavily depend on the accuracy of the pseudo-time-ordering of cells. In other words, in single-cell studies, expression of genes in each cell is a function of time, therefore, cells can be ordered by the time. Single-cell analytical tools use existing algorithms including Wanderlust [68] to order the single-cells along discrete paths. These paths do not represent real time but rather a pseudo-time variable (due to short life cycles of cells), which usually represents the intrinsic cellular process. Further, computational experiments indicate that differences in the temporal ordering of the single-cells from different approaches affect the results, and thus interpretations [69]. The use of pseudo-temporal ordering along with expression data has been useful in some studies, but it has also faced criticism. For instance, Moris et al. 2016 [70] questioned the underlying assumptions of smooth and continuous cell state transitions, which are required by pseudo-time-ordering algorithms. Moreover, such data may not be readily available for the users, thus making it difficult to apply in general cases; (ii) Similar to the GLM classes of approaches, this class is also unable to consider the multi-modal nature of single-cell data; and, (iii) The GAM class of approaches is computationally intensive, due to implementation of complex statistical models fitted individually for each gene.

#### 3.3. Mixture Model-Based Approaches

#### 3.4. Hurdle Model-Based Approaches

#### 3.5. Two-Class Comparison (Parametric) Approaches

**Limitations**: (i) Only two groups: This class of approaches cannot be generalized to accommodate multiple cellular groups, though it is clear that scRNA-seq data are characterized by the presence of multiple cell types/groups, which these methods are unable to consider. This is due to the fact that other classes of methods including GLM, GAM, Hurdle, and MM, consider the GLM to model the mean parameter of genes, which can accommodate the multi-group comparison; (ii) Cell-level auxiliary data: The incorporation of cell-level confounding covariates including cell type, cell cycle, cell growth phase, etc., in the DEA improves the statistical power to detect true differentially expressed genes in single-cell studies. Therefore, this class of approaches cannot accommodate such auxiliary data in the analysis, and thus has poor performance compared with other class of approaches [10,16]; and, (iii) Many aforementioned approaches consider the inflated zero counts through parametric models (e.g., ZINB) which might not be sufficient to capture the heterogeneity in the scRNA-seq data. Further limitations and unique features of this class of approaches are listed in Table 2.

#### 3.6. Non-Parametric Approaches

**Limitations**: (i) Lesser statistical power: If all of the assumptions of the parametric approaches are apparently met by the single-cell data, and the DEA hypothesis can be tested with a parametric approach, then NP approaches may not be suitable. The degree of unsuitableness can be expressed in terms of lesser statistical power. Previous studies indicate that ZIM (Supplementary Documents S3–S5) fits well to the single-cell data [13,34]. Subsequently, DEA approaches based on ZIM usually have better performance over NP approaches [10,16]; (ii) NP approaches are not systematic, whereas parametric approaches have been systematized, and different tests are simply variations on a central theme; (iii) Another objection to NP approaches is related with convenience. Tables necessary to implement NP tests are scattered widely and appear in different formats; and, (iv) The results may or may not provide an accurate answer because they are distribution free. Further limitations and special features of this class of approaches are listed in Table 2.

## 4. Outstanding Challenges

#### 4.1. Biological Challenges

#### 4.1.1. Proper Biological Benchmarking

#### 4.1.2. Annotation

#### 4.2. Methodological Challenges

#### 4.2.1. Gold Standard scRNA-seq Data

#### 4.2.2. Excess Heterogeneity

#### 4.2.3. Dropouts or Excess Zeros of Single-Cell Data

#### 4.2.4. Pre-Processing of scRNA-seq Data

#### 4.2.5. Lack of Biological Relevant Criteria

#### 4.2.6. Statistical Methods for DEA across Individuals

#### 4.2.7. False Discoveries in DEA

#### 4.2.8. Improved Methods for Dispersion Estimation

#### 4.2.9. Random/Mixed Effect Models

#### 4.2.10. Optimal Combination of Algorithms

#### 4.2.11. Integration of Multi-Omics Data

#### 4.2.12. Slow Computational Processing

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Liu, S.; Trapnell, C. Single-cell transcriptome sequencing: Recent advances and remaining challenges. F1000Research
**2016**, 5, 182. [Google Scholar] [CrossRef] [PubMed] - Kiselev, V.Y.; Andrews, T.S.; Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet.
**2019**, 20, 273–282. [Google Scholar] [CrossRef] [PubMed] - Saliba, A.-E.; Westermann, A.J.; Gorski, S.A.; Vogel, J. Single-cell RNA-seq: Advances and future challenges. Nucleic Acids Res.
**2014**, 42, 8845–8860. [Google Scholar] [CrossRef] [PubMed] - Macosko, E.Z.; Basu, A.; Satija, R.; Nemesh, J.; Shekhar, K.; Goldman, M.; Tirosh, I.; Bialas, A.R.; Kamitaki, N.; Martersteck, E.M.; et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell
**2015**, 161, 1202–1214. [Google Scholar] [CrossRef] [PubMed][Green Version] - Zheng, G.X.Y.; Terry, J.M.; Belgrader, P.; Ryvkin, P.; Bent, Z.W.; Wilson, R.; Ziraldo, S.B.; Wheeler, T.D.; McDermott, G.P.; Zhu, J.; et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun.
**2017**, 8, 14049. [Google Scholar] [CrossRef][Green Version] - Picelli, S.; Faridani, O.R.; Björklund, Å.K.; Winberg, G.; Sagasser, S.; Sandberg, R. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc.
**2014**, 9, 171–181. [Google Scholar] [CrossRef] - Pollen, A.A.; Nowakowski, T.J.; Shuga, J.; Wang, X.; Leyrat, A.A.; Lui, J.H.; Li, N.; Szpankowski, L.; Fowler, B.; Chen, P.; et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol.
**2014**, 32, 1053–1058. [Google Scholar] [CrossRef][Green Version] - Jiang, R.; Sun, T.; Song, D.; Li, J.J. Statistics or biology: The zero-inflation controversy about scRNA-seq data. Genome Biol.
**2022**, 23, 31. [Google Scholar] [CrossRef] - Svensson, V. Reply to: UMI or not UMI, that is the question for scRNA-seq zero-inflation. Nat. Biotechnol.
**2021**, 39, 160. [Google Scholar] [CrossRef] - Das, S.; Rai, A.; Merchant, M.L.; Cave, M.C.; Rai, S.N. A Comprehensive Survey of Statistical Approaches for Differential Expression Analysis in Single-Cell RNA Sequencing Studies. Genes
**2021**, 12, 1947. [Google Scholar] [CrossRef] - Mou, T.; Deng, W.; Gu, F.; Pawitan, Y.; Vu, T.N. Reproducibility of Methods to Detect Differentially Expressed Genes from Single-Cell RNA Sequencing. Front. Genet.
**2020**, 10, 1331. [Google Scholar] [CrossRef] [PubMed] - Vu, T.N.; Wills, Q.F.; Kalari, K.R.; Niu, N.; Wang, L.; Rantalainen, M.; Pawitan, Y. Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics
**2016**, 32, 2128–2135. [Google Scholar] [CrossRef] [PubMed] - Das, S.; Rai, S.N. SwarnSeq: An improved statistical approach for differential expression analysis of single-cell RNA-seq data. Genomics
**2021**, 113, 1308–1324. [Google Scholar] [CrossRef] [PubMed] - Dal Molin, A.; Baruzzo, G.; Di Camillo, B. Single-cell RNA-sequencing: Assessment of differential expression analysis methods. Front. Genet.
**2017**, 8, 62. [Google Scholar] [CrossRef] - Wang, T.; Li, B.; Nelson, C.E.; Nabavi, S. Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinform.
**2019**, 20, 40. [Google Scholar] [CrossRef][Green Version] - Soneson, C.; Robinson, M.D. Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods
**2018**, 15, 255–261. [Google Scholar] [CrossRef] - Jaakkola, M.K.; Seyednasrollah, F.; Mehmood, A.; Elo, L.L. Comparison of methods to detect differentially expressed genes between single-cell populations. Brief. Bioinform.
**2016**, 18, 735–743. [Google Scholar] [CrossRef] - Miao, Z.; Zhang, X. Differential expression analyses for single-cell RNA-Seq: Old questions on new data. Quant. Biol.
**2016**, 4, 243–260. [Google Scholar] [CrossRef][Green Version] - Cui, X.; Churchill, G.A. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol.
**2003**, 4, 210. [Google Scholar] [CrossRef][Green Version] - Costa-Silva, J.; Domingues, D.; Lopes, F.M. RNA-Seq differential expression analysis: An extended review and a software tool. PLoS ONE
**2017**, 12, e0190152. [Google Scholar] [CrossRef][Green Version] - Das, S.; Rai, A.; Mishra, D.C.; Rai, S.N. Statistical approach for selection of biologically informative genes. Gene
**2018**, 655, 71–83. [Google Scholar] [CrossRef] - Das, S.; Rai, S.N. Statistical approach for biologically relevant gene selection from high-throughput gene expression data. Entropy
**2020**, 22, 1205. [Google Scholar] [CrossRef] [PubMed] - Pratapa, A.; Jalihal, A.P.; Law, J.N.; Bharadwaj, A.; Murali, T.M. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nat. Methods
**2020**, 17, 147–154. [Google Scholar] [CrossRef] [PubMed] - Ye, C.; Speed, T.P.; Salim, A. DECENT: Differential expression with capture efficiency adjustmeNT for single-cell RNA-seq data. Bioinformatics
**2019**, 35, 5155–5162. [Google Scholar] [CrossRef] [PubMed][Green Version] - Vallejos, C.A.; Marioni, J.C.; Richardson, S. BASiCS: Bayesian Analysis of Single-Cell Sequencing Data. PLoS Comput. Biol.
**2015**, 11, e1004333. [Google Scholar] [CrossRef] [PubMed] - Jia, C.; Hu, Y.; Kelly, D.; Kim, J.; Li, M.; Zhang, N.R. Accounting for technical noise in differential expression analysis of single-cell RNA sequencing data. Nucleic Acids Res.
**2017**, 45, 10978–10988. [Google Scholar] [CrossRef][Green Version] - Das, S.; Rai, S.N. Statistical methods for analysis of single-cell RNA-sequencing data. MethodsX
**2021**, 8, 101580. [Google Scholar] [CrossRef] - Wang, J.; Huang, M.; Torre, E.; Dueck, H.; Shaffer, S.; Murray, J.; Raj, A.; Li, M.; Zhang, N.R. Gene expression distribution deconvolution in single-cell RNA sequencing. Proc. Natl. Acad. Sci. USA
**2018**, 115, E6437–E6446. [Google Scholar] [CrossRef][Green Version] - The External RNA Controls Consortium: A progress report. Nat. Methods
**2005**, 2, 731–734. [CrossRef] - Chen, W.; Li, Y.; Easton, J.; Finkelstein, D.; Wu, G.; Chen, X. UMI-count modeling and differential expression analysis for single-cell RNA sequencing. Genome Biol.
**2018**, 19, 70. [Google Scholar] [CrossRef][Green Version] - Risso, D.; Perraudeau, F.; Gribkova, S.; Dudoit, S.; Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun.
**2018**, 9, 284. [Google Scholar] [CrossRef] [PubMed][Green Version] - Van den Berge, K.; Soneson, C.; Love, M.I.; Robinson, M.D.; Clement, L. zingeR: Unlocking RNA-seq tools for zero-inflation and single cell applications. bioRxiv
**2017**. [Google Scholar] [CrossRef][Green Version] - Van den Berge, K.; Perraudeau, F.; Soneson, C.; Love, M.I.; Risso, D.; Vert, J.-P.; Robinson, M.D.; Dudoit, S.; Clement, L. Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol.
**2018**, 19, 24. [Google Scholar] [CrossRef][Green Version] - Mallick, H.; Chatterjee, S.; Chowdhury, S.; Chatterjee, S.; Rahnavard, A.; Hicks, S.C. Differential expression of single-cell RNA-seq data using Tweedie models. Stat. Med.
**2022**, 41, 3492–3510. [Google Scholar] [CrossRef] - He, Z.; Pan, Y.; Shao, F.; Wang, H. Identifying Differentially Expressed Genes of Zero Inflated Single Cell RNA Sequencing Data Using Mixed Model Score Tests. Front. Genet.
**2021**, 12, 616686. [Google Scholar] [CrossRef] - Shi, Y.; Lee, J.-H.; Kang, H.; Jiang, H. A Two-Part Mixed Model for Differential Expression Analysis in Single-Cell High-Throughput Gene Expression Data. Genes
**2022**, 13, 377. [Google Scholar] [CrossRef] [PubMed] - Trapnell, C.; Cacchiarelli, D.; Grimsby, J.; Pokharel, P.; Li, S.; Morse, M.; Lennon, N.J.; Livak, K.J.; Mikkelsen, T.S.; Rinn, J.L. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol.
**2014**, 32, 381–386. [Google Scholar] [CrossRef] [PubMed][Green Version] - Qiu, X.; Mao, Q.; Tang, Y.; Wang, L.; Chawla, R.; Pliner, H.A.; Trapnell, C. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods
**2017**, 14, 979–982. [Google Scholar] [CrossRef][Green Version] - Van den Berge, K.; Roux de Bézieux, H.; Street, K.; Saelens, W.; Cannoodt, R.; Saeys, Y.; Dudoit, S.; Clement, L. Trajectory-based differential expression analysis for single-cell sequencing data. Nat. Commun.
**2020**, 11, 1201. [Google Scholar] [CrossRef][Green Version] - Finak, G.; McDavid, A.; Yajima, M.; Deng, J.; Gersuk, V.; Shalek, A.K.; Slichter, C.K.; Miller, H.W.; McElrath, M.J.; Prlic, M.; et al. MAST: A flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol.
**2015**, 16, 278. [Google Scholar] [CrossRef][Green Version] - Sekula, M.; Gaskins, J.; Datta, S. Detection of differentially expressed genes in discrete single-cell RNA sequencing data using a hurdle model with correlated random effects. Biometrics
**2019**, 75, 1051–1062. [Google Scholar] [CrossRef] [PubMed] - Kharchenko, P.V.; Silberstein, L.; Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nat. Methods
**2014**, 11, 740–742. [Google Scholar] [CrossRef] [PubMed] - Delmans, M.; Hemberg, M. Discrete distributional differential expression (D3E)-A tool for gene expression analysis of single-cell RNA-seq data. BMC Bioinform.
**2016**, 17, 110. [Google Scholar] [CrossRef] [PubMed][Green Version] - Wu, Z.; Zhang, Y.; Stitzel, M.L.; Wu, H. Two-phase differential expression analysis for single cell RNA-seq. Bioinformatics
**2018**, 34, 3340–3348. [Google Scholar] [CrossRef] [PubMed] - Zhang, W.; Wei, Y.; Zhang, D.; Xu, E.Y. ZIAQ: A quantile regression method for differential expression analysis of single-cell RNA-seq data. Bioinformatics
**2020**, 36, 3124–3130. [Google Scholar] [CrossRef] - Niyakan, S.; Hajiramezanali, E.; Boluki, S.; Zamani Dadaneh, S. SimCD: Simultaneous Clustering and Differential expression analysis for single-cell transcriptomic data. arXiv
**2021**, arXiv:2104.01512. [Google Scholar] - Ling, W.; Zhang, W.; Cheng, B.; Wei, Y. Zero-inflated quantile rank-score based test (ZIQRank) with application to scRNA-seq differential gene expression analysis. Ann. Appl. Stat.
**2021**, 15, 1673–1696. [Google Scholar] [CrossRef] - Satija, R.; Farrell, J.A.; Gennert, D.; Schier, A.F.; Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol.
**2015**, 33, 495–502. [Google Scholar] [CrossRef][Green Version] - Hao, Y.; Hao, S.; Andersen-Nissen, E.; Mauck, W.M.; Zheng, S.; Butler, A.; Lee, M.J.; Wilk, A.J.; Darby, C.; Zager, M.; et al. Integrated analysis of multimodal single-cell data. Cell
**2021**, 184, 3573–3587.e29. [Google Scholar] [CrossRef] - Korthauer, K.D.; Chu, L.F.; Newton, M.A.; Li, Y.; Thomson, J.; Stewart, R.; Kendziorski, C. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol.
**2016**, 17, 222. [Google Scholar] [CrossRef][Green Version] - Miao, Z.; Deng, K.; Wang, X.; Zhang, X. DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics
**2018**, 34, 3223–3224. [Google Scholar] [CrossRef][Green Version] - Ntranos, V.; Yi, L.; Melsted, P.; Pachter, L. A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat. Methods
**2019**, 16, 163–166. [Google Scholar] [CrossRef] [PubMed] - Zhang, M.; Liu, S.; Miao, Z.; Han, F.; Gottardo, R.; Sun, W. IDEAS: Individual level differential expression analysis for single-cell RNA-seq data. Genome Biol.
**2022**, 23, 33. [Google Scholar] [CrossRef] [PubMed] - Katayama, S.; Töhönen, V.; Linnarsson, S.; Kere, J. SAMstrt: Statistical test for differential expression in single-cell transcriptome with spike-in normalization. Bioinformatics
**2013**, 29, 2943–2945. [Google Scholar] [CrossRef] [PubMed][Green Version] - Guo, M.; Wang, H.; Potter, S.S.; Whitsett, J.A.; Xu, Y. SINCERA: A Pipeline for Single-Cell RNA-Seq Profiling Analysis. PLoS Comput. Biol.
**2015**, 11, e1004575. [Google Scholar] [CrossRef] - Sengupta, D.; Rayan, N.A.; Lim, M.; Lim, B.; Prabhakar, S. Fast, scalable and accurate differential expression analysis for single cells. bioRxiv
**2016**, 049734. [Google Scholar] [CrossRef] - Nabavi, S.; Schmolze, D.; Maitituoheti, M.; Malladi, S.; Beck, A.H. EMDomics: A robust and powerful method for the identification of genes differentially expressed between heterogeneous classes. Bioinformatics
**2016**, 32, 533–541. [Google Scholar] [CrossRef] - Wang, T.; Nabavi, S. SigEMD: A powerful method for differential gene expression analysis in single-cell RNA sequencing data. Methods
**2018**, 145, 25–32. [Google Scholar] [CrossRef] - Wang, Z.; Jin, S.; Liu, G.; Zhang, X.; Wang, N.; Wu, D.; Hu, Y.; Zhang, C.; Jiang, Q.; Xu, L.; et al. DTWscore: Differential expression and cell clustering analysis for time-series single-cell RNA-seq data. BMC Bioinform.
**2017**, 18, 270. [Google Scholar] [CrossRef][Green Version] - Gupta, K.; Lalit, M.; Biswas, A.; Sanada, C.; Greene, C.; Hukari, K.; Maulik, U.; Bandyopadhyay, S.; Ramalingam, N.; Ahuja, G.; et al. Modeling expression ranks for noise-tolerant differential expression analysis of scRNA-seq data. Genome Res.
**2021**, 31, 689–697. [Google Scholar] [CrossRef] - Li, H.-S.; Ou-Yang, L.; Zhu, Y.; Yan, H.; Zhang, X.-F. scDEA: Differential expression analysis in single-cell RNA-sequencing data via ensemble learning. Brief. Bioinform.
**2022**, 23, bbab402. [Google Scholar] [CrossRef] [PubMed] - Müller, M. Generalized Linear Models. In XploRe—Learning Guide; Springer: Berlin/Heidelberg, Germany, 2000; pp. 205–228. [Google Scholar]
- McCullagh, P.; Nelder, J.A. Generalized Linear Models; Springer: Boston, MA, USA, 1989; ISBN 978-0-412-31760-6. [Google Scholar]
- Kærn, M.; Elston, T.C.; Blake, W.J.; Collins, J.J. Stochasticity in gene expression: From theories to phenotypes. Nat. Rev. Genet.
**2005**, 6, 451–464. [Google Scholar] [CrossRef] [PubMed] - Birtwistle, M.R.; Rauch, J.; Kiyatkin, A.; Aksamitiene, E.; Dobrzyński, M.; Hoek, J.B.; Kolch, W.; Ogunnaike, B.A.; Kholodenko, B.N. Emergence of bimodal cell population responses from the interplay between analog single-cell signaling and protein expression noise. BMC Syst. Biol.
**2012**, 6, 109. [Google Scholar] [CrossRef] [PubMed][Green Version] - Singer, Z.S.; Yong, J.; Tischler, J.; Hackett, J.A.; Altinok, A.; Surani, M.A.; Cai, L.; Elowitz, M.B. Dynamic Heterogeneity and DNA Methylation in Embryonic Stem Cells. Mol. Cell
**2014**, 55, 319–331. [Google Scholar] [CrossRef] [PubMed][Green Version] - Dobrzyński, M.; Nguyen, L.K.; Birtwistle, M.R.; von Kriegsheim, A.; Blanco Fernández, A.; Cheong, A.; Kolch, W.; Kholodenko, B.N. Nonlinear signalling networks and cell-to-cell variability transform external signals into broadly distributed or bimodal responses. J. R. Soc. Interface
**2014**, 11, 20140383. [Google Scholar] [CrossRef] - Bendall, S.C.; Davis, K.L.; Amir, E.D.; Tadmor, M.D.; Simonds, E.F.; Chen, T.J.; Shenfeld, D.K.; Nolan, G.P.; Pe’er, D. Single-Cell Trajectory Detection Uncovers Progression and Regulatory Coordination in Human B Cell Development. Cell
**2014**, 157, 714–725. [Google Scholar] [CrossRef][Green Version] - Bacher, R.; Kendziorski, C. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol.
**2016**, 17, 63. [Google Scholar] [CrossRef][Green Version] - Moris, N.; Pina, C.; Arias, A.M. Transition states and cell fate decisions in epigenetic landscapes. Nat. Rev. Genet.
**2016**, 17, 693–703. [Google Scholar] [CrossRef][Green Version] - Hafemeister, C.; Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol.
**2019**, 20, 296. [Google Scholar] [CrossRef][Green Version] - Townes, F.W.; Hicks, S.C.; Aryee, M.J.; Irizarry, R.A. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol.
**2019**, 20, 295. [Google Scholar] [CrossRef][Green Version] - Anders, S.; Huber, W. Differential expression analysis for sequence count data. Genome Biol.
**2010**, 11, R106. [Google Scholar] [CrossRef] [PubMed][Green Version] - Klein, A.M.; Mazutis, L.; Akartuna, I.; Tallapragada, N.; Veres, A.; Li, V.; Peshkin, L.; Weitz, D.A.; Kirschner, M.W. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell
**2015**, 161, 1187–1201. [Google Scholar] [CrossRef] [PubMed][Green Version] - Seyednasrollah, F.; Rantanen, K.; Jaakkola, P.; Elo, L.L. ROTS: Reproducible RNA-seq biomarker detector-Prognostic markers for clear cell renal cell cancer. Nucleic Acids Res.
**2016**, 44, e1. [Google Scholar] [CrossRef][Green Version] - Glazko, G.V.; Emmert-Streib, F. Unite and conquer: Univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics
**2009**, 25, 2348–2354. [Google Scholar] [CrossRef] [PubMed][Green Version] - Das, S.; McClain, C.J.; Rai, S.N. Fifteen Years of Gene Set Analysis for High-Throughput Genomic Data: A Review of Statistical Approaches and Future Challenges. Entropy
**2020**, 22, 427. [Google Scholar] [CrossRef][Green Version] - Das, S.; Rai, A.; Mishra, D.C.; Rai, S.N. Statistical Approach for Gene Set Analysis with Trait Specific Quantitative Trait Loci. Sci. Rep.
**2018**, 8, 2391. [Google Scholar] [CrossRef][Green Version] - Squair, J.W.; Gautier, M.; Kathe, C.; Anderson, M.A.; James, N.D.; Hutson, T.H.; Hudelle, R.; Qaiser, T.; Matson, K.J.E.; Barraud, Q.; et al. Confronting false discoveries in single-cell differential expression. Nat. Commun.
**2021**, 12, 5692. [Google Scholar] [CrossRef] [PubMed] - Mehta, T.; Tanik, M.; Allison, D.B. Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nat. Genet.
**2004**, 36, 943–947. [Google Scholar] [CrossRef] - Chen, S.; Mar, J.C. Evaluating methods of inferring gene regulatory networks highlights their lack of performance for single cell gene expression data. BMC Bioinform.
**2018**, 19, 232. [Google Scholar] [CrossRef][Green Version] - Hou, W.; Ji, Z.; Ji, H.; Hicks, S.C. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol.
**2020**, 21, 218. [Google Scholar] [CrossRef] - Ziegenhain, C.; Vieth, B.; Parekh, S.; Reinius, B.; Guillaumet-Adkins, A.; Smets, M.; Leonhardt, H.; Heyn, H.; Hellmann, I.; Enard, W. Comparative Analysis of Single-Cell RNA Sequencing Methods. Mol. Cell
**2017**, 65, 631–643.e4. [Google Scholar] [CrossRef] [PubMed][Green Version] - Robinson, M.D.; Smyth, G.K. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics
**2007**, 23, 2881–2887. [Google Scholar] [CrossRef] [PubMed][Green Version] - Sandberg, R. Entering the era of single-cell transcriptomics in biology and medicine. Nat. Methods
**2014**, 11, 22–24. [Google Scholar] [CrossRef] [PubMed][Green Version] - Trapnell, C. Defining cell types and states with single-cell genomics. Genome Res.
**2015**, 25, 1491–1498. [Google Scholar] [CrossRef][Green Version] - Islam, S.; Kjällquist, U.; Moliner, A.; Zajac, P.; Fan, J.B.; Lönnerberg, P.; Linnarsson, S. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res.
**2011**, 21, 1160–1167. [Google Scholar] [CrossRef][Green Version] - Luecken, M.D.; Theis, F.J. Current best practices in single-cell RNA-seq analysis: A tutorial. Mol. Syst. Biol.
**2019**, 15, e8746. [Google Scholar] [CrossRef] - Tung, P.-Y.; Blischak, J.D.; Hsiao, C.J.; Knowles, D.A.; Burnett, J.E.; Pritchard, J.K.; Gilad, Y. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep.
**2017**, 7, 39921. [Google Scholar] [CrossRef][Green Version] - Kolodziejczyk, A.A.; Kim, J.K.; Svensson, V.; Marioni, J.C.; Teichmann, S.A. The Technology and Biology of Single-Cell RNA Sequencing. Mol. Cell
**2015**, 58, 610–620. [Google Scholar] [CrossRef][Green Version] - Stegle, O.; Teichmann, S.A.; Marioni, J.C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet.
**2015**, 16, 133–145. [Google Scholar] [CrossRef] - Ma, Y.; Qiu, F.; Deng, C.; Li, J.; Huang, Y.; Wu, Z.; Zhou, Y.; Zhang, Y.; Xiong, Y.; Yao, Y.; et al. Integrating single-cell sequencing data with GWAS summary statistics reveals CD16+monocytes and memory CD8+T cells involved in severe COVID-19. Genome Med.
**2022**, 14, 16. [Google Scholar] [CrossRef] - Cui, C.; Shu, W.; Li, P. Fluorescence In situ Hybridization: Cell-Based Genetic Diagnostic and Research Applications. Front. Cell Dev. Biol.
**2016**, 4, 89. [Google Scholar] [CrossRef] [PubMed][Green Version] - Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol.
**2014**, 15, 550. [Google Scholar] [CrossRef] [PubMed][Green Version] - Malhotra, A.; Das, S.; Rai, S.N. Analysis of Single-Cell RNA-Sequencing Data: A Step-by-Step Guide. BioMedInformatics
**2022**, 2, 43–61. [Google Scholar] [CrossRef] - Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. EdgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics
**2010**, 26, 139–140. [Google Scholar] [CrossRef] [PubMed][Green Version] - Zeileis, A.; Kleiber, C.; Jackman, S. Regression Models for Count Data in R. J. Stat. Softw.
**2008**, 27, 1–25. [Google Scholar] [CrossRef][Green Version] - Kempc, D.; Kempa, W. Some properties of the “Hermite” distribution. Biometrika
**1965**, 52, 381–394. [Google Scholar] - Boon, W.C.; Petkovic-Duran, K.; Zhu, Y.; Manasseh, R.; Horne, M.K.; Aumann, T.D. Increasing cDNA Yields from Single-cell Quantities of mRNA in Standard Laboratory Reverse Transcriptase Reactions using Acoustic Microstreaming. J. Vis. Exp.
**2011**, 53, e3144. [Google Scholar] [CrossRef][Green Version] - Macaulay, I.C.; Voet, T. Single Cell Genomics: Advances and Future Perspectives. PLoS Genet.
**2014**, 10, e1004126. [Google Scholar] [CrossRef][Green Version] - Marinov, G.K.; Williams, B.A.; McCue, K.; Schroth, G.P.; Gertz, J.; Myers, R.M.; Wold, B.J. From single-cell to cell-pool transcriptomes: Stochasticity in gene expression and RNA splicing. Genome Res.
**2013**, 24, 496–510. [Google Scholar] [CrossRef][Green Version] - Pierson, E.; Yau, C. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol.
**2015**, 16, 241. [Google Scholar] [CrossRef][Green Version] - Wang, Y.; Navin, N.E. Advances and Applications of Single-Cell Sequencing Technologies. Mol. Cell
**2015**, 58, 598–609. [Google Scholar] [CrossRef] [PubMed][Green Version] - McElduff, F.; Cortina-Borja, M.; Chan, S.-K.; Wade, A. When t-tests or Wilcoxon-Mann-Whitney tests won’t do. Adv. Physiol. Educ.
**2010**, 34, 128–133. [Google Scholar] [CrossRef] [PubMed][Green Version] - Qiu, X.; Hill, A.; Packer, J.; Lin, D.; Ma, Y.-A.; Trapnell, C. Single-cell mRNA quantification and differential analysis with Census. Nat. Methods
**2017**, 14, 309–315. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**

**Operational framework of differential expression analysis of scRNA-seq data.**Various steps in single-cell studies are shown. Pre-processing and various steps of DE analysis are also shown. Potential use and interpretation of obtained results are presented.

**Figure 2.**

**Classification of available statistical approaches and tools used for DEA in single-cell studies.**Classification of the approaches is conducted based on the requirement of input data, data distribution, and statistical models, etc. DE analytic tools belonging to each category are presented in pink colored boxes.

**Figure 3.**

**Operational outlines of DE analytic GLM and two-class comparison approaches in scRNA-seq studies.**(

**A**) Workflow of steps for GLM-based DE approaches. (

**B**) Workflow of steps for two-class comparison approaches. In both classes, the framework can be divided into four major parts, namely: (i) input (data provided as input to tools); (ii) pre-processing of data, this step involves data cleaning, outlier removal, normalization, etc.; (iii) model fitting and computation of DE test statistic, various distributional/model (e.g., GLM, simple statistical distribution or distribution-free) assumptions are made about the expression data, parameters of the models are estimated, and DE test statistic(s) for genes and their corresponding p-values are computed; and, (iv) assessment and interpretation of DE results.

**Figure 4.**Operational outlines of DE analytic GAM, Hurdle and mixed model class of approaches in scRNA-seq studies. (

**A**) Workflow of steps for GAM-based DEA approaches. (

**B**) Workflow of steps for Hurdle and mixed-model-based approaches. In both classes, the framework can be divided into four major parts, namely: (i) input (data provided as input to tools); (ii) pre-processing of data, this step involves data cleaning, outlier removal, normalization, etc.; (iii) model fitting and computation of DEA test statistic, various distributional/model (e.g., GAM, Hurdle or mixture model) assumptions are made about the expression data, parameters of the models are estimated, DEA test statistic(s) for genes and their corresponding p-values are computed; and (iv) assessment and interpretation of DEA results.

SN. | Methods | Year | Model | Input | DE Test Stat. | Runtime | Platform | Ref. |
---|---|---|---|---|---|---|---|---|

1 | NBID | 2018 | NB (GLM) | Counts | LRT | Medium | R code | [30] |

2 | ZINB–WaVE | 2018 | ZINB (GLM) | Counts | LRT | High | Bioconductor, GitHub | [31] |

3 | zingeR | 2018 | ZINB (GLM) | Counts | LRT | High | GitHub | [32,33] |

4 | DECENT | 2019 | ZINB (GLM) | Counts | LRT | High | GitHub | [24] |

5 | SwarnSeq | 2021 | ZINB (GLM) | Counts | LRT | High | GitHub | [13] |

6 | Tweedieverse | 2021 | ZITweedie (GLM) | Counts | Wald | High | GitHub | [34] |

7 | scMMST | 2021 | GLMM | Counts | Norm. score | High | NA | [35] |

8 | TPMM | 2022 | GLMM | Norm. | Wald/LRT | High | GitHub | [36] |

9 | Monocle2 | 2017 | GAM | Norm. | LRT | Medium | Bioconductor | [37,38] |

10 | tradeSeq | 2020 | GAM | Counts | Wald | Medium | GitHub | [39] |

11 | MAST | 2015 | Hurdle | Norm. | LRT/Wald | Medium | Bioconductor | [40] |

12 | Random-Hurdle | 2019 | Hurdle | Counts | Chi-square test statistic | High | NA | [41] |

13 | SCDE | 2014 | Poisson-NB (MM) | Counts | Bayesian stat. | High | Bioconductor | [42] |

14 | BASiCS | 2015 | Poisson-Gamma (MM) | Norm. | Posterior prob. | High | Bioconductor | [25] |

15 | D3E | 2016 | Poisson-Beta (MM) | Counts | CM/KS test | High | GitHub | [43] |

16 | BPSC | 2016 | Beta-Poisson (MM) | Counts | LRT | Medium | GitHub | [12] |

17 | TASC | 2017 | Logistic, Poisson Models (MM) | UMI | LRT | High | GitHub | [26] |

18 | DESCEND | 2018 | Poisson-Alpha (MM) | Counts | Normalized Gini Score | High | GitHub | [28] |

19 | SC2P | 2018 | ZIP, Poisson-Lognormal (MM) | Counts | Posterior prob. | High | GitHub | [44] |

20 | ZIAQ | 2020 | Logistic and quantile Regression (MM) | Norm. | Fisher’s test | Medium | GitHub | [45] |

21 | SimCD | 2021 | Gamma-NB (MM) | Counts | Bayesian | High | GitHub | [46] |

22 | ZIQRank | 2022 | Zero-inflated model, quantile regression (MM) | Cont. | Rank-score test | High | NA | [47] |

23 | Seurat | 2015 | NB (TCP) | Counts | LRT | Low | CRAN | [48,49] |

24 | scDD | 2016 | Multi-modal Bayesian (TCP) | Norm. | Bayesian stat. | High | Bioconductor | [50] |

25 | DEsingle | 2018 | ZINB (TCP) | Counts | LRT | High | Bioconductor, GitHub | [51] |

26 | NYMP | 2019 | Logistic regression (TCP) | Cont. | Medium | GitHub | [52] | |

27 | t-test | logCPM (TCP) | Norm. | T stat | Low | CRAN | [10] | |

28 | IDEAS | 2022 | NB/ZINB/Kernel Density estimation/ Cumulative distribution function (TCP) | Counts/Cont. | Jensen–Shannon Divergence/ Wasserstein distance | High | GitHub | [53] |

29 | SAMstrt | 2013 | NP | Counts | Medium | GitHub | [54] | |

30 | Wilcox | NP | Counts/Norm. | Sum ranks | Low | CRAN | [10] | |

31 | SINCERA | 2015 | NP | Norm. | Welch (LS)/ Wilcox (SS) | High | GitHub | [55] |

32 | NODES | 2016 | NP | Norm. | Wilcox | Medium | Dropbox | [56] |

33 | EMDomics | 2016 | NP | Norm. | Euclidean distance | High | Bioconductor | [57] |

34 | sigEMD | 2018 | NP | Norm. | Distance measure | High | GitHub | [58] |

35 | DTWscore | 2017 | NP | FPKM | Distance | Medium | GitHub | [59] |

36 | ROSeq | 2021 | NP | Counts/Norm. | Wald | High | Bioconductor, GitHub | [60] |

37 | scDEA ^{1} | 2021 | 12 Models (Hybrid) | Counts | Lancaster’s test (Chi) | High | GitHub | [61] |

^{1}: Integrated approach.

SN. | Class | Features | Limitations | Tools |
---|---|---|---|---|

1 | GLM | - Gene expression can have any form of exponential distribution type.
- Suitable for bi-modality of data.
- Able to deal with categorical predictors, e.g., cell type, cell cycle, etc.
- Easy to interpret and allows a clear understanding of how each of the predictors are influencing the gene parameters.
- Can be generalized to multi-cell group comparisons.
- Less susceptible to model over-fitting.
| - Strict exponential family distributional assumptions about the data.
- Needs relatively large datasets (with more predictor and large number of cells).
- Sensitive to outliers.
- Sensitive to dropout events.
- Not suitable for low expressed genes.
- Cannot handle multi-modality of the data.
- ZIM–GLM approaches are not able to handle zero-deflation at any level of a factor and will result in parameter estimates of infinity for the logistic component.
- Higher computational cost especially for large datasets.
| NBID, ZingeR ZINB–WaVE, DECENT, SwarnSeq, scMMST, TPMM, Tweedieverse |

2 | GAM | - Predictor functions are automatically derived during model estimation.
- Marginal impact of a single variable does not depend on the values of the other variables in the model.
- Flexibility in choosing the type of functions, which will help in finding patterns missed in a parametric model.
- Allows controlling smoothness of the predictor functions to prevent model over-fitting.
- By controlling the wiggliness of the predictor functions, we can directly tackle the bias/variance tradeoff.
- Highly effective in many settings, particularly when one wishes to model the response variable as a function of both categorical (e.g., cell groups) and continuous predictors (e.g., cell-level auxiliary variables).
- Considers both linear and non-linear functions of cell-level predictors to model gene parameters.
- Each lineage is represented by a separate cubic smoothing spline, and its flexibility allows adjustment for other covariates or confounders as fixed effects in the model.
| - Approaches such as Monocle can only handle a single lineage of cells.
- Lack of interpretability, to infer differences in expression between lineages of cells.
- Assumes the dropout events to be linear; however, the effect of dropout events is likely to be non-linear, especially for genes with low to moderate expression.
- Computationally complex.
| Monocle, Monocle2, Monocle3, tradeSeq |

3 | Hurdle Model | - Considers the excess zeros while model building.
- Can handle zero-inflation as well as zero-deflation present in data.
- Models the bimodality of gene expression distribution.
| - Does not differentiate the generating process for excessive zeros versus sampling zeros.
- Fails to consider the multi-modality of gene expression distribution.
- Requires higher runtime.
| MAST, Random Hurdle |

4 | Mixture-Model | - Considers bi-modal or multi-modal nature of single-cell data.
- Can differentiate between major sources of variation in single-cell data.
| - Certain approaches including BPSC, SC2P cannot consider the zero-inflation in single-cell data.
- Mostly uses linear models for DEA, which is cumbersome.
- Higher runtime and computationally intensive.
| SCDE, D3E, BPSC, BASiCS, DESCEND, SC2P, ZIAQ, ZIQRank, SimCD |

5 | Non-parametric (two-class) | - Distribution-free approaches.
- Considers the multi-modality of the data.
- Computationally not cumbersome (less runtime).
- Estimates the parameters without fitting any distribution for genes.
- Performs DEA with distance-like metrics across two cell types.
- Performs well when there are lesser proportions of zeros in the data.
| - Mostly focuses on two cellular groups’ comparison.
- Computationally complex for multi-groups.
- Performance severely affected due to high dropouts (some methods exclude dropouts).
- Cannot separate between true/biological and false/dropout zeros.
- Sensitive to sparsity.
- Methods such as D3E, scDD fail to consider UMI count nature of the data.
- Cannot separate confounding factors from each other.
| Wilcox, NODES, ROTS, EMDomics, ROSeq, SINCERA, sigEMD, DTWscore, SAMstrt |

6 | Parametric (two-class) | - Easy to understand and execute.
- Lesser runtime.
- Particularly suitable for larger datasets.
| - Makes strict distributional assumption about the data.
- Cannot generalize to multi-group comparisons.
- Ignores the multi-modal distributions of the scRNA-seq data.
- Sensitive to sparsity or dropout events.
- Cannot differentiate between the major sources of variability in the data.
| scDD, DEsingle, t-test, NYMP, IDEAS |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Das, S.; Rai, A.; Rai, S.N.
Differential Expression Analysis of Single-Cell RNA-Seq Data: Current Statistical Approaches and Outstanding Challenges. *Entropy* **2022**, *24*, 995.
https://doi.org/10.3390/e24070995

**AMA Style**

Das S, Rai A, Rai SN.
Differential Expression Analysis of Single-Cell RNA-Seq Data: Current Statistical Approaches and Outstanding Challenges. *Entropy*. 2022; 24(7):995.
https://doi.org/10.3390/e24070995

**Chicago/Turabian Style**

Das, Samarendra, Anil Rai, and Shesh N. Rai.
2022. "Differential Expression Analysis of Single-Cell RNA-Seq Data: Current Statistical Approaches and Outstanding Challenges" *Entropy* 24, no. 7: 995.
https://doi.org/10.3390/e24070995