# Recommendations of scRNA-seq Differential Gene Expression Analysis Based on Comprehensive Benchmarking

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Analysis Datasets

#### 2.2. Simulation Methods

#### 2.3. Diagnostic Plot Methods

#### 2.4. DE Benchmarking Methods

#### 2.5. Simulation Performance Methods

#### 2.6. Real Data DE Application Methods

## 3. Results

#### 3.1. Diagnostic Plots

#### 3.2. Type-I Error Rate Control

#### 3.3. FC Bias

#### 3.4. FC Correlations

#### 3.5. FDR Control

#### 3.6. Power

#### 3.7. AUROC and PRAUC

#### 3.8. Computation Time

^{2.5}) seconds for each simulation, implying that NEBULA methods were three times faster than glmmTMB or MAST.cdr overall. Computation by NEBULA-HL was consistently longer than NEBULA-LN. Pseudo-bulk DE methods except for ANCOVA, are relatively time-efficient compared to cell-level methods in Figure 8. The rankings are identical across all cell types and are not affected by a larger number of lowly expressed genes in the DEG testing set (Figure 8b).

#### 3.9. Heatmap

#### 3.10. Real Data Application Results

## 4. Discussion

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Code Availability Statement

## References

- Svensson, V.; da Veiga Beltrame, E.; Pachter, L. A curated database reveals trends in single-cell transcriptomics. Database
**2020**, 2020, 1–7. [Google Scholar] [CrossRef] - Cao, J.; O’Day, D.R.; Pliner, H.A.; Kingsley, P.D.; Deng, M.; Daza, R.M.; Zager, M.A.; Aldinger, K.A.; Blecher-Gonen, R.; Zhang, F.; et al. A human cell atlas of fetal gene expression. Science
**2020**, 370, 7721. [Google Scholar] [CrossRef] - Jindal, A.; Gupta, P.; Jayadeva; Sengupta, D. Discovery of rare cells from voluminous single cell expression data. Nat. Commun.
**2018**, 9, 4719. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Nguyen, A.; Khoo, W.H.; Moran, I.; Croucher, P.I.; Phan, T.G. Single Cell RNA Sequencing of Rare Immune Cell Populations. Front. Immunol.
**2018**, 9, 1553. [Google Scholar] [CrossRef] - Schirmer, L.; Velmeshev, D.; Holmqvist, S.; Kaufmann, M.; Werneburg, S.; Jung, D.; Vistnes, S.; Stockley, J.H.; Young, A.; Steindel, M.; et al. Neuronal vulnerability and multilineage diversity in multiple sclerosis. Nature
**2019**, 573, 75–82. [Google Scholar] [CrossRef] - Reyfman, P.A.; Walter, J.M.; Joshi, N.; Anekalla, K.R.; McQuattie-Pimentel, A.C.; Chiu, S.; Fernandez, R.; Akbarpour, M.; Chen, C.I.; Ren, Z.; et al. Single-Cell Transcriptomic Analysis of Human Lung Provides Insights into the Pathobiology of Pulmonary Fibrosis. Am. J. Respir. Crit. Care Med.
**2019**, 199, 1517–1536. [Google Scholar] [CrossRef] [PubMed] - Soneson, C.; Robinson, M.D. Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods
**2018**, 15, 255–261. [Google Scholar] [CrossRef] - Benidt, S.; Nettleton, D. SimSeq: A nonparametric approach to simulation of RNA-sequence datasets. Bioinformatics
**2015**, 31, 2131–2140. [Google Scholar] [CrossRef] - Assefa, A.T.; Vandesompele, J.; Thas, O. SPsimSeq: Semi-parametric simulation of bulk and single-cell RNA-sequencing data. Bioinformatics
**2020**, 36, 3276–3278. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Crowell, H.L.; Soneson, C.; Germain, P.L.; Calini, D.; Collin, L.; Raposo, C.; Malhotra, D.; Robinson, M.D. muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data. Nat. Commun.
**2020**, 11, 6077. [Google Scholar] [CrossRef] - Zappia, L.; Phipson, B.; Oshlack, A. Splatter: Simulation of single-cell RNA sequencing data. Genome Biol.
**2017**, 18, 174. [Google Scholar] [CrossRef] - Li, W.V.; Li, J.J. A statistical simulator scDesign for rational scRNA-seq experimental design. Bioinformatics
**2019**, 35, i41–i50. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Zhang, M.; Liu, S.; Miao, Z.; Han, F.; Gottardo, R.; Sun, W. IDEAS: Individual level differential expression analysis for single-cell RNA-seq data. Genome Biol.
**2022**, 23, 33. [Google Scholar] [CrossRef] - Squair, J.W.; Gautier, M.; Kathe, C.; Anderson, M.A.; James, N.D.; Hutson, T.H.; Hudelle, R.; Qaiser, T.; Matson, K.J.E.; Barraud, Q.; et al. Confronting false discoveries in single-cell differential expression. Nat. Commun.
**2021**, 12, 5692. [Google Scholar] [CrossRef] [PubMed] - Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics
**2010**, 26, 139–140. [Google Scholar] [CrossRef] [Green Version] - Ritchie, M.E.; Phipson, B.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res.
**2015**, 43, e47. [Google Scholar] [CrossRef] [PubMed] - Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol.
**2014**, 15, 550. [Google Scholar] [CrossRef] [Green Version] - Brooks, M.E.; Kristensen, K.; Van Benthem, K.J.; Magnusson, A.; Berg, C.W.; Nielsen, A.; Skaug, H.J.; Machler, M.; Bolker, B.M. glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling. R J.
**2017**, 9, 378–400. [Google Scholar] [CrossRef] [Green Version] - He, L.; Davila-Velderrain, J.; Sumida, T.S.; Hafler, D.A.; Kellis, M.; Kulminski, A.M. NEBULA is a fast negative binomial mixed model for differential or co-expression analysis of large-scale multi-subject single-cell data. Commun. Biol.
**2021**, 4, 629. [Google Scholar] [CrossRef] - Finak, G.; McDavid, A.; Yajima, M.; Deng, J.; Gersuk, V.; Shalek, A.K.; Slichter, C.K.; Miller, H.W.; McElrath, M.J.; Prlic, M.; et al. MAST: A flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol.
**2015**, 16, 278. [Google Scholar] [CrossRef] [Green Version] - Miao, Z.; Zhang, X. Differential expression analyses for single-cell RNA-Seq: Old questions on new data. Quant. Biol.
**2016**, 4, 243–260. [Google Scholar] [CrossRef] [Green Version] - Jaakkola, M.K.; Seyednasrollah, F.; Mehmood, A.; Elo, L.L. Comparison of methods to detect differentially expressed genes between single-cell populations. Brief. Bioinform.
**2017**, 18, 735–743. [Google Scholar] [CrossRef] - Dal Molin, A.; Baruzzo, G.; Di Camillo, B. Single-Cell RNA-Sequencing: Assessment of Differential Expression Analysis Methods. Front Genet.
**2017**, 8, 62. [Google Scholar] [CrossRef] - Reich, D.S.; Lucchinetti, C.F.; Calabresi, P.A. Multiple Sclerosis. N. Engl. J. Med.
**2018**, 378, 169–180. [Google Scholar] [CrossRef] [PubMed] - Lassmann, H. Multiple Sclerosis Pathology. Cold Spring Harb. Perspect. Med.
**2018**, 8, a028936. [Google Scholar] [CrossRef] [Green Version] - Trapp, B.D.; Peterson, J.; Ransohoff, R.M.; Rudick, R.; Mork, S.; Bo, L. Axonal transection in the lesions of multiple sclerosis. N. Engl. J. Med.
**1998**, 338, 278–285. [Google Scholar] [CrossRef] [PubMed] - Schirmer, L.; Antel, J.P.; Bruck, W.; Stadelmann, C. Axonal loss and neurofilament phosphorylation changes accompany lesion development and clinical progression in multiple sclerosis. Brain Pathol.
**2011**, 21, 428–440. [Google Scholar] [CrossRef] [PubMed] - Lederer, D.J.; Martinez, F.J. Idiopathic Pulmonary Fibrosis. N. Engl. J. Med.
**2018**, 379, 797–798. [Google Scholar] [CrossRef] - Wynn, T.A. Fibrotic disease and the T(H)1/T(H)2 paradigm. Nat. Rev. Immunol.
**2004**, 4, 583–594. [Google Scholar] [CrossRef] [Green Version] - Korsunsky, I.; Millard, N.; Fan, J.; Slowikowski, K.; Zhang, F.; Wei, K.; Baglaenko, Y.; Brenner, M.; Loh, P.R.; Raychaudhuri, S. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods
**2019**, 16, 1289–1296. [Google Scholar] [CrossRef] - Sing, T.; Sander, O.; Beerenwinkel, N.; Lengauer, T. ROCR: Visualizing classifier performance in R. Bioinformatics
**2005**, 21, 3940–3941. [Google Scholar] [CrossRef] [PubMed] - Grau, J.; Grosse, I.; Keilwagen, J. PRROC: Computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics
**2015**, 31, 2595–2597. [Google Scholar] [CrossRef] [PubMed] - Subramanian, A.; Tamayo, P.; Mootha, V.K.; Mukherjee, S.; Ebert, B.L.; Gillette, M.A.; Paulovich, A.; Pomeroy, S.L.; Golub, T.R.; Lander, E.S.; et al. Gene set enrichment analysis: A knowledge-based approach for interpreting Genom.e-wide expression profiles. Proc. Natl. Acad. Sci USA
**2005**, 102, 15545–15550. [Google Scholar] [CrossRef] [Green Version] - Liberzon, A.; Subramanian, A.; Pinchback, R.; Thorvaldsdottir, H.; Tamayo, P.; Mesirov, J.P. Molecular signatures database (MSigDB) 3.0. Bioinformatics
**2011**, 27, 1739–1740. [Google Scholar] [CrossRef] [PubMed] - Korotkevich, G.; Sukhov, V.; Budin, N.; Shpak, B.; Artyomov, M.N.; Sergushichev, A. Fast gene set enrichment analysis. bioRxiv
**2021**. [Google Scholar] [CrossRef] [Green Version] - Beutel, T.; Dzimiera, J.; Kapell, H.; Engelhardt, M.; Gass, A.; Schirmer, L. Cortical projection neurons as a therapeutic target in multiple sclerosis. Expert Opin. Ther. Targets
**2020**, 24, 1211–1224. [Google Scholar] [CrossRef] - Lauranzano, E.; Pozzi, S.; Pasetto, L.; Stucchi, R.; Massignan, T.; Paolella, K.; Mombrini, M.; Nardo, G.; Lunetta, C.; Corbo, M.; et al. Peptidylprolyl isomerase A governs TARDBP function and assembly in heterogeneous nuclear ribonucleoprotein complexes. Brain
**2015**, 138, 974–991. [Google Scholar] [CrossRef] [Green Version] - Gilgun-Sherki, Y.; Melamed, E.; Offen, D. The role of oxidative stress in the pathogenesis of multiple sclerosis: The need for effective antioxidant therapy. J. Neurol.
**2004**, 251, 261–268. [Google Scholar] [CrossRef] - Gonsette, R.E. Neurodegeneration in multiple sclerosis: The role of oxidative stress and excitotoxicity. J. Neurol. Sci.
**2008**, 274, 48–53. [Google Scholar] [CrossRef] - Ascherio, A.; Munger, K.L. Environmental risk factors for multiple sclerosis. Part I: The role of infection. Ann. Neurol.
**2007**, 61, 288–299. [Google Scholar] [CrossRef] - Homer, R.J.; Elias, J.A.; Lee, C.G.; Herzog, E. Modern concepts on the role of inflammation in pulmonary fibrosis. Arch. Pathol. Lab. Med.
**2011**, 135, 780–788. [Google Scholar] [CrossRef] - Kuwano, K. Involvement of epithelial cell apoptosis in interstitial lung diseases. Intern. Med.
**2008**, 47, 345–353. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Noble, P.W.; Homer, R.J. Idiopathic pulmonary fibrosis: New insights into pathogenesis. Clin. Chest Med.
**2004**, 25, 749–758. [Google Scholar] [CrossRef] [PubMed] - Bouland, G.A.; Mahfouz, A.; Reinders, M.J.T. Differential analysis of binarized single-cell RNA sequencing data captures biological variation. NAR Genom. Bioinform.
**2021**, 3, lqab118. [Google Scholar] [CrossRef] - Alan, E.; Murphy, N.G.S. A balanced measure shows superior performance of pseudobulk methods over mixed models and pseudoreplication approaches in single-cell RNA-sequencing analysis. bioRxiv
**2022**. [Google Scholar] [CrossRef] - Zimmerman, K.D.; Espeland, M.A.; Langefeld, C.D. A practical solution to pseudoreplication bias in single-cell studies. Nat. Commun.
**2021**, 12, 738. [Google Scholar] [CrossRef]

**Figure 1.**Diagnostic plots to compare the simulation with real data. (

**a**) One control sample from EN-MIX cell type in Schirmer et al. [5]; (

**b**) one control sample from AT1 cell type in Reyfman et al. [6]. P1: scatterplot of all gene means from real vs. simulated cell-level normalized counts, P2: scatterplot of filtered dispersions from real vs. simulated cell-level normalized counts, P3: boxplot of all library sizes from real vs. simulated cell-level normalized counts, P4: scatterplot of the proportion of zero counts from real vs. simulated cell-level normalized counts, and P5: Loess smoother with 95% confidence intervals of the relationship between the filtered means and dispersions from real vs. simulated cell-level normalized counts.

**Figure 2.**Distribution of empirical false positive rate (FPR) given type-I error rate is 0.05 (red dotted line). (

**a**) Data were simulated based on Schirmer et al. [5] and lowly-expressed genes were excluded by ‘and’ filtering scheme; (

**b**) data were simulated based on Reyfman et al. [6] and lowly-expressed genes were excluded by ‘and’ filtering scheme.

**Figure 6.**Boxplots of power given “and” filtering scheme based on Schirmer et al. [5]. (

**a**) when FC = 1.5; (

**b**) when FC = 1.2.

**Figure 7.**Boxplots of AUROC and PRAUC given FC = 1.2 and ‘or’ filtering scheme from Schirmer et al. [5]. (

**a**) AUROC; (

**b**) PRAUC.

**Figure 8.**Boxplot of elapsed time in logarithmic scale at base 10 based on Schirmer et al. [5] when FC = 1.5. (

**a**) Using ‘and’ filtering scheme; (

**b**) Using “or” filtering scheme.

**Figure 10.**Upset plots showing the overlap of real-data DEGs identified by three DE methods (DESeq2, Nebula HL, and glmmTMB) at FC 1.5 and FDR 0.05 cutoffs for Schirmer data. Upset plot overlap is shown per cell type. (

**a**) Overlap of DEGs for EN-L2-3 cells, (

**b**) Overlap of DEGs for EN-L4 cells, (

**c**) Overlap of DEGs for OL cells, and (

**d**) Overlap of DEGs for OPC cells.

**Figure 11.**Volcano plots of EN-L2-3, EN-L4, OL, and OPC cell types for Schirmer data [5].

**Figure 12.**Heatmap plots of EN-L2-3, EN-L4, OL, and OPC cell types for Schirmer data [5].

**Figure 13.**Upset plots showing the overlap of real-data DEGs identified by three DE methods (DESeq2, Nebula HL, and glmmTMB) at FC 1.5 and FDR 0.05 cutoffs for Reyfman data [6]. Upset plot overlap is shown per cell type. (

**a**) Overlap of DEGs for Alveolar macrophage cells; (

**b**) Overlap of DEGs for AT2 cells; (

**c**) Overlap of DEGs for SMC+ Fibroblast cells.

**Figure 14.**Volcano plots of AT2, Alveolar macrophages, SMC + Fibroblasts cell types for Reyfman data [6].

**Figure 15.**Heatmap of the enriched GO terms (Reyfman Fibrosis data [6]). Color bar indicates normalized enrichment score (NES). Higher NES absolute value indicates more significant; NES = 0 indicates that pathway cannot be enriched. The more rows with NES = 0, the worse performance of the DEG method.

Good | Intermediate | Poor | |
---|---|---|---|

Power.median | Kmean class including max. median power | Otherwise | Kmean class including min. median power |

FDP.median | no more than 75% of FDPs (False Discovery Proportion) on one side (above or below) of 0.05 and 0.0167 < median FDP < 0.15 | Otherwise | median FDP ≥ 0.25 or median FDP ≤ 0.01 or at least one FDP is missing |

missFDP | 0 | <0.5 | ≥0.5 |

AUROC.median | ≥0.9 | 0.7≤ and <0.9 | <0.7 |

PRAUC.median | ≥0.8 | 0.4≤ and <0.8 | <0.4 |

FPR.median | $\left|lo{g}_{2}\left(\frac{medianFPR}{0.05}\right)\right|lo{g}_{2}\left(1.5\right)$ | $lo{g}_{2}\left(1.5\right)\le \left|lo{g}_{2}\left(\frac{medianFPR}{0.05}\right)\right|2$ | $2\le \left|lo{g}_{2}\left(\frac{medianFPR}{0.05}\right)\right|$ |

Time.median | ≤10 | 10< and ≤500 | >500 |

Abs(FC bias.median) ($F{C}^{w}$ is 1.2 for Schirmer et al. [5] or 1.4 for Reyfman et al. [6]) | $\le 0.05\times \frac{FC}{F{C}^{w}}$ | $0.05\times \frac{FC}{F{C}^{w}}<and\le 0.10\times \frac{FC}{F{C}^{w}}$ | $>0.10\times \frac{FC}{F{C}^{w}}$ |

**Table 2.**Comparison of DEG approaches for covariate handling, documentation, and complex design support.

Method | Covariates? | Documentation? | Fixed Effect Matrix? | Random Design Matrix? | Download Link |
---|---|---|---|---|---|

t-test | No | Textbook | No | no | N/A |

u-test | No | Textbook | No | no | N/A |

ancova | Yes | Textbook | No ^{1} | no | N/A |

edgeR | Yes | vignette, users guide, reference | yes | no | https://bioconductor.org/packages/release/bioc/html/edgeR.html (last accessed 22 April 2022) |

limma | Yes | quickstart, users guide, reference | yes | no | https://bioconductor.org/packages/release/bioc/html/limma.html (last accessed 8 February 2022) |

DESeq2 | Yes | quick start, users guide, reference | yes | no | https://bioconductor.org/packages/release/bioc/html/DESeq2.html (last accessed 11 February 2022) |

MAST | Yes | intro, MAST examples, reference | yes | yes | https://www.bioconductor.org/packages/release/bioc/html/MAST.html (last accessed 10 February 2022) |

glmmTMB | Yes | multiple vignettes and reference | yes | yes | https://cran.r-project.org/web/packages/glmmTMB/index.html (last accessed 1 April 2022) |

NEBULA | Yes | vignette and reference | yes | no | https://cran.r-project.org/web/packages/nebula/index.html (last accessed 2 June 2022) |

^{1}Traditional ancova is a single group effect with one or more covariates. However, more complex designs are possible with ancova.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Gagnon, J.; Pi, L.; Ryals, M.; Wan, Q.; Hu, W.; Ouyang, Z.; Zhang, B.; Li, K.
Recommendations of scRNA-seq Differential Gene Expression Analysis Based on Comprehensive Benchmarking. *Life* **2022**, *12*, 850.
https://doi.org/10.3390/life12060850

**AMA Style**

Gagnon J, Pi L, Ryals M, Wan Q, Hu W, Ouyang Z, Zhang B, Li K.
Recommendations of scRNA-seq Differential Gene Expression Analysis Based on Comprehensive Benchmarking. *Life*. 2022; 12(6):850.
https://doi.org/10.3390/life12060850

**Chicago/Turabian Style**

Gagnon, Jake, Lira Pi, Matthew Ryals, Qingwen Wan, Wenxing Hu, Zhengyu Ouyang, Baohong Zhang, and Kejie Li.
2022. "Recommendations of scRNA-seq Differential Gene Expression Analysis Based on Comprehensive Benchmarking" *Life* 12, no. 6: 850.
https://doi.org/10.3390/life12060850