A Two-Part Mixed Model for Differential Expression Analysis in Single-Cell High-Throughput Gene Expression Data

Yang Shi; Ji-Hyun Lee; Huining Kang; Hui Jiang

doi:10.3390/genes13020377

,

and

¹

Division of Biostatistics and Data Science, Department of Population Health Sciences and Department of Neuroscience and Regenerative Medicine, Medical College of Georgia, Augusta University, Augusta, GA 30912, USA

²

Department of Internal Medicine, University of New Mexico Comprehensive Cancer Center, Albuquerque, NM 87102, USA

³

Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA

⁴

Division of Quantitative Sciences, University of Florida Health Cancer Center and Department of Biostatistics, University of Florida, Gainesville, FL 32610, USA

Genes2022, 13(2), 377;https://doi.org/10.3390/genes13020377

This article belongs to the Section Technologies and Resources for Genetics

Version Notes

Order Reprints

Abstract

The high-throughput gene expression data generated from recent single-cell RNA sequencing (scRNA-seq) and parallel single-cell reverse transcription quantitative real-time PCR (scRT-qPCR) technologies enable biologists to study the function of transcriptome at the level of individual cells. Compared with bulk RNA-seq and RT-qPCR gene expression data, single-cell data show notable distinct features, including excessive zero expression values, high variability, and clustered design. We propose to model single-cell high-throughput gene expression data using a two-part mixed model, which not only adequately accounts for the aforementioned features of single-cell expression data but also provides the flexibility of adjusting for covariates. An efficient computational algorithm, automatic differentiation, is used for estimating the model parameters. Compared with existing methods, our approach shows improved power for detecting differential expressed genes in single-cell high-throughput gene expression data.

Keywords:

two-part mixed-model; single-cell RNA-seq; single-cell RT-qPCR; differential expression; automatic differentiation

1. Introduction

Recently, single-cell high-throughput gene expression profiling technologies, including single-cell RNA sequencing (scRNA-seq) and parallel single-cell single-cell reverse transcription quantitative real-time PCR (scRT-qPCR), have enabled researchers to examine mRNA expression at the resolution of individual cell level, which provide further biological insights of the transcriptomes and functional genomics [1,2,3,4]. Compared to bulk RNA-seq and RT-qPCR experiments that are usually performed on animal tissues (i.e., cell populations) and homogenous cell lines, single-cell high-throughput gene expression data generated by scRNA-seq and scRT-qPCR have the following distinct features as seen in recent literature [4,5,6]:

Excessive zero expression values. The proportions of genes with observed zero expression values in single-cell gene expression data are much larger than bulk RNA-seq or RT-qPCR data [4,5,6]. The reasons for this phenomenon can be either biological, such that the abundance of mRNA levels of certain transcripts are essentially low in individual cells, or can be technical, such that the extracted total amount of mRNA is low in a single cell sample [4,6].

High variability of expression levels across samples. It has been observed that scRNA-seq or scRT-qPCR data tend to show higher variability than bulk RNA-seq or RT-qPCR data [4,6]. This can be explained by the differences in the designs between the two: the regular bulk RNA-seq or RT-qPCR experiments are performed on the cell populations, and the gene expression levels from those experiments are averaged across all individual cells in the population, which dilutes the variability of gene expression levels among individual cells [6].

Clustering of single-cell samples within subjects. Another notable feature of single-cell high-throughput gene expression data is that each individual single-cell sample is randomly sampled from a higher-level cluster unit (e.g., patients, animals) [1,2,7]. Therefore, the single-cell samples from the same subject are expected to be more homogeneous than those from different subjects, which has been shown in several single-cell RNA-seq data published recently [1,2,7]. From a statistical perspective, this feature is called the clustering effect, which should be adequately adjusted for in the analysis.

To account for the abovementioned issues, we propose to model single-cell high-throughput gene expression data using a two-part mixed model. This model not only adequately accounts for the above features of single-cell gene expression data but also provides flexibility for adjusting for covariates in the study design. The details of this model and how it can be applied to differential expression analysis of single-cell data are discussed in the rest of this paper, which is organized as follows. First, we describe the formulation of the two-part mixed model with a brief literature review. Then we use an efficient method, named automatic differentiation, to fit the model. We also discuss how to test for differential expression under this model and describe several methods for approximating the null distribution of the test statistics for small sample sizes, followed by simulations for studying the type I error rate and statistical power. Finally, we demonstrate our approach by applying it to two real-world single-cell high-throughput gene expression datasets: one from scRT-qPCR and the other from scRNA-seq.

2. Materials and Methods

2.1. The Two-Part Mixed Model for Single-Cell Gene Expression Data

We first introduce the notations for our approach. Assume there are

m

subjects and

N

genes in a scRNA-seq experiment, and

n_{i}

single-cell samples extracted and sequenced for subject (

i = 1, \dots, m

). Let

y_{i j k}

be the normalized expression value (in the unit of RPKM/FPKM, TPM, or CPM) for gene

k

(

k = 1, \dots, N

) in single-cell sample

j

(

j = 1, \dots, n_{i}

) in subject

i

, then we model the gene expression value

y_{i j k}

using the following two-part mixed model:

\begin{array}{l} logit [\Pr (y_{i j k} = 0)] = \log (\frac{π_{i j k}}{1 - π_{i j k}}) = w_{k}^{T} α_{k} + u_{i k}, \\ \log (y_{i j k} + c | y_{i j k} > 0) = x_{k}^{T} β_{k} + v_{i k} + e_{i j k}, \end{array}

(1)

where

π_{i j k}

is the proportion of single-cell samples with zero expression values for gene

k

(named “zero-proportion” hereafter). In this two-part model, the zero-proportions are modeled by a logistic regression model (logistic or binomial part), and the log-transformed non-zero expression values are modeled by a linear regression model (Gaussian part), where

w_{k}^{T}

and

x_{k}^{T}

are the vectors of covariates for the binomial and Gaussian parts, respectively (e.g., if there are only two biological conditions and no other covariates to be adjusted,

w_{k}^{T}

and

x_{k}^{T}

are simply the vectors of 1/0 indicators for the biological conditions),

α_{k}

and

β_{k}

are the corresponding vectors of regression coefficients associated with the covariates

w_{k}^{T}

and

x_{k}^{T}

,

e_{i j k}

is the random error that is assumed to be distributed as

N (0, σ_{e}^{2})

,

u_{i k}

and

v_{i k}

are the random effects for subject

i

that account for the clustering effects, which are assumed to follow the bivariate normal distribution.

(\begin{matrix} u_{i k} \\ v_{i k} \end{matrix}) ~ N (0, (\begin{matrix} σ_{u}^{2} & ρ σ_{u} σ_{v} \\ ρ σ_{u} σ_{v} & σ_{v}^{2} \end{matrix}))

(2)

with

σ_{u}^{2}

and

σ_{v}^{2}

as the variances for the marginal univariate normal distributions of

u_{i k}

and

v_{i k}

, and

ρ

as the correlation between them. We note that most scRNA-seq experiments contain only one level of clusters (i.e., single cells are sampled from subjects). If the study design is more complicated, such that it may contain multi-level cluster effects, then more variance components for the random effects can be added into the model. Finally, a small constant c is added to the non-zero expression levels before taking logarithms to avoid the left skewness caused by taking logarithms on small-expression values between 0 and 1, which is often seen in RNA-seq data. In the following analysis of scRNA-seq data,

c

is set as 1.

In an scRT-qPCR experiment, the gene expression levels are usually measured by the expression threshold (

e t

) values, which is defined as

e t = c_{\max} - c t

, where

c_{m a x}

is the maximum number of amplification cycles used in the scRT-qPCR experiment and

c t

is the threshold cycle that the gene is detected by the PCR instrument [5]. The gene expression level

y_{i j k}

is assumed to have an exponential relationship with

e t

, such that

y_{i j k} = 2^{e t}

(for undetected genes,

e t

is shown as missing values from the PCR machine and can be treated as

- \infty

, which gives zero expression values) [5]. Therefore Model (1) can also be used to model gene expression values in scRT-qPCR data, and the definitions of the parameters are exactly the same as those aforementioned for scRNA-seq data. The only difference is that adding the small constant c is not necessary for scRT-qPCR data, as the non-zero gene expression levels in scRT-qPCR experiments do not have many small values between 0 and 1, such as those in scRNA-seq data.

Remark on related literature: The two-part model including the binomial part and Gaussian part without random effects is first proposed for modeling the medical care data [8,9], where the dependent variable (medical care expenses) takes the range of any non-negative value but has a positive probability at zero (these type of data are also called semicontinuous data) [8,9,10]. This type of model is later extended for longitudinal or clustered semicontinuous data by incorporating random effects for both the binomial part and the Gaussian part [11]. A comprehensive survey for a variety of models with applications for data taking non-negative values with a substantial proportion of zero values is given in [10]. Our two-part mixed model essentially follows the model formulation in [10,11], except for the addition of a small constant c to the non-zero expression values in RNA-seq data [Equation (1)]. A similar yet different two-part model without random effects is proposed to model the scRNA-seq data in a recent paper, which is named MAST [12]. Instead of incorporating clustered random effects from subjects, MAST uses an empirical Bayes method to shrink the gene-specific variance to the global variance of all genes [12].

2.2. Model Fitting

The proposed two-part mixed model (1) will be referred to as TMM hereafter. Since the TMM is fitted for each gene independently, we will drop the subscript

k

for simplicity if there is no ambiguity within the context. Following [11], the fixed-effect parameters of the TMM model,

α_{k}

and

β_{k}

, are estimated by maximizing the following marginal likelihood function of the model:

L \propto \prod_{i = 1}^{m} \int L_{B_{i}} L_{G_{i}} p (u_{i}, v_{i}) d u_{i} d v_{i},

(3)

where

L_{B_{i}}

is the conditional distribution (likelihood) of

y_{i j k}

given the random effect

u_{i}

from the binomial (logistic) part that can be written as

L_{B_{i}} = [\prod_{j = 1, y_{i j} = 0}^{n_{i}} \exp (w_{j}^{T} α_{j} + u_{i})] [\prod_{j = 1}^{n_{i}} \frac{1}{1 + \exp (w_{j}^{T} α_{j} + u_{i})}],

(4)

and

L_{G_{i}}

is the conditional distribution (likelihood) of

y_{i j k}

given the random effect

v_{i}

from the Gaussian part that can be written as

L_{G_{i}} = \prod_{j = 1, y_{i j} > 0}^{n_{i}} σ_{e}^{- 1} ϕ [\frac{\log (y_{i j} + 1) - x_{j}^{T} β_{j} - v_{i}}{σ_{e}}]

(5)

with

ϕ (\cdot)

as the standard normal PDF [for scRT-qPCR data,

\log (y_{i j} + 1)

becomes

\log (y_{i j})

], and

p (u_{i}, v_{i})

is the joint distribution of the random effects

u_{i}

and

v_{i}

, which is the bivariate normal given in Equation (2).

As discussed in [10,11], maximizing the marginal likelihood function (3) involves numerical or stochastic approximation of the integrals, followed by maximization of the approximated likelihood. Several computational methods, including the Markov chain Monte Carlo, the expectation-maximization (EM) algorithm, the penalized quasi-likelihood (PQL) method, Gauss-Hermite quadrature, and Laplace approximations are reviewed and discussed in detail in [11]. Here, we use an efficient computational method, automatic differentiation, to maximize the likelihood function (3). The automatic differentiation technique is implemented in the software package automatic differentiation model builder (ADMB, version 11.4) [13,14]. Given the likelihood function written in the form of (4.2), ADMB calculates the Hessian matrix of the marginal likelihood function using the automatic differentiation technique, and the maximization of the marginal likelihood function is performed by first approximating the integrals using Laplace approximations and then maximizing the approximated likelihood using the quasi-Newton algorithm. Descriptions of the automatic differentiation technique can be found in [13,14], and the details for implementation of the algorithm can be found in https://www.admb-project.org/ (accessed on 21 December 2021).

2.3. Testing for Differential Expression

Testing for differential expression of genes across biological conditions under model (1) is done by testing for the fixed effects. More explicitly, (1) can be written as

\begin{array}{l} l o g i t [\Pr (y_{i j} = 0)] = \log (\frac{π_{i j}}{1 - π_{i j}}) = w_{1}^{T} α_{1} + w_{2}^{T} α_{2} + u_{i}, \\ \log (y_{i j} + 1 | y_{i j} > 0) = x_{1}^{T} β_{1} + x_{2}^{T} β_{2} + v_{i} + e_{i j}, \end{array}

(6)

where

w_{1}^{T}

and

x_{1}^{T}

are the covariates of interest that we want to test for, and

w_{2}^{T}

and

x_{2}^{T}

are the covariates to be adjusted for in the model. Specifically, we are interested in testing for the following two effects across biological conditions: (1) whether the zero-proportions are significantly different across conditions and (2) for genes with non-zero expression levels, whether the mean expression levels are significantly different across conditions. The two problems can be formulated as the following two corresponding hypothesis testing problems:

(1): Testing of the binomial part

$H_{B 0} : α_{1} = 0 versus H_{B 1} : α_{1} \neq 0;$

(7)
(2): Testing of the Gaussian part

$H_{G 0} : β_{1} = 0 versus H_{G 1} : β_{1} \neq 0;$

(8)

and the two parts can also be tested jointly, which can improve the statistical power:
(3): Joint testing of the binomial and Gaussian parts

$H_{0} : α_{1} = 0 and β_{1} = 0 versus H_{1} : α_{1} \neq 0 or β_{1} \neq 0 .$

(9)

The individual test for the binomial part or the Gaussian part can be performed using the Wald test or the likelihood ratio test, and the joint test for the two parts can be performed using the likelihood ratio test. Under

H_{0}

, the asymptotic distributions of the Wald statistic (

W_{0}

) and the likelihood ratio statistic (

L_{0}

) can be approximated by the

χ^{2}

distribution with the degrees of freedom equal to the differences in the numbers of parameters between

H_{0}

and

H_{1}

, which is a widely used approach in practice [15,16]. However, for small sample sizes, the

χ^{2}

distributions are not good approximations to the null distributions of the two test statistics, which, as noted in the literature [15,17] and as shown in simulations in the Results part, often show inflated type I error rate. Therefore, we use the following two methods for reliable estimation of p-values when the sample size is small:

The parametric bootstrap method: this approach estimates the null distribution of the test statistic by simulating data from the fitted model under

H_{0}

, which is performed in the following way [17,18,19]:

(1): Fit model (4) under $H_{0}$ and generate N random samples $y_{1}, \dots, y_{N}$ from this model.
(2): Calculate the corresponding test statistics (i.e., Wald or likelihood ratio statistics) $T (y_{1}), \dots, T (y_{N})$ using the above-simulated samples $y_{1}, \dots, y_{N}$ .
(3): Estimate the p-value as $\hat{p} = \frac{1}{N} \sum_{l = 1}^{N} I {T (y_{l}) \geq γ}$ , where $γ$ is the test statistic (Wald or likelihood ratio) calculated from the observed data (an alternative formula is $\hat{p} = \frac{\sum_{l = 1}^{N} I {T (y_{l}) \geq γ} + 1}{N + 1}$ . The two formulas give almost the same results providing $N$ is large, so we use the former throughout this chapter).

The empirical Satterthwaitemethod: this method is proposed in [20], and it is a general approach for approximating the null distribution of the test statistics [17,20,21,22]. Following [20,21], this method is performed in the following two steps:

(1): Approximate the null distribution of test statistics ( $W_{0}$ or $L_{0}$ ) by a scaled $χ^{2}$ distribution $k χ_{v}^{2}$ with $k$ as the scale parameter and $v$ as the degrees of freedom. The parameters $k$ and $v$ can be estimated by matching the first two moments (sample mean and variance) of test statistics under $H_{0}$ with those of $k χ_{v}^{2}$ [20,21]. The sample mean and variance of test statistics under $H_{0}$ can be obtained by using the above parametric bootstrap method with a smaller number of random samples.
(2): Fit a two-component normal mixture distribution $π_{1} N (μ_{1}, σ_{1}^{2}) + π_{2} N (μ_{2}, σ_{2}^{2})$ on $Φ^{- 1} (p_{k χ_{v}^{2}}^{(b)})$ , where $p_{k χ_{v}^{2}}^{(b)}$ is the p-value obtained from the above-scaled $χ^{2}$ distribution $k χ_{v}^{2}$ for the $b th$ random sample and $Φ (\cdot)$ is the standard normal CDF. The final p-values are calculated as

$p = \Pr [Ψ > Φ^{- 1} (p_{k χ_{v}^{2}})],$

where $p_{k χ_{v}^{2}}$ is the p-value obtained from Step (1) and $Ψ$ is the fitted normal mixture distribution ${\hat{π}}_{1} N ({\hat{μ}}_{1}, {\hat{σ}}_{1}^{2}) + {\hat{π}}_{2} N ({\hat{μ}}_{2}, {\hat{σ}}_{2}^{2})$ . The Satterthwaite method can estimate p-values using a smaller number of random samples than the parametric bootstrap method [20,21]. However, in our simulations, it also shows an inflated type I error rate when the sample size is small (see simulations in the next section).

3. Results

3.1. Simulation Studies

3.1.1. Evaluation of Type I Error Rates

In this section, we evaluate type I error rates of the three methods for approximating the null distribution of the test statistics under H₀: the

χ^{2}

distribution, the Satterthwaite method, and the parametric bootstrap method. The simulations are performed based on the following settings: assuming two biological conditions, each has

m / 2

subjects, and for each subject

i

there are

n_{i}

single-cell samples. To evaluate type I error rates, we simulate gene expression levels

y_{i j k}

from the following model under

H_{0}

(i.e., there is no difference between the two conditions):

\begin{array}{l} l g i t [\Pr (y_{i j k} = 0)] = \log (\frac{π_{i j k}}{1 - π_{i j k}}) = α_{1} + u_{i}, \\ \log (y_{i j k} + 1 | y_{i j k} > 0) = β_{1} + v_{i} + e_{i j}, \end{array}

(10)

with

u_{i} ~ N (0, σ_{u}^{2})

,

v_{i} ~ N (0, σ_{v}^{2})

and

e_{i j k} ~ N (0, σ_{e}^{2})

.

In this model, there is only one intercept for the fixed effect in both the binomial and Gaussian parts, therefore no differences in terms of zero-proportions and mean expression levels are expected between the two conditions. The values of the parameters are set as follows:

σ_{u} = 0.5

,

σ_{v} = 1

,

σ_{e} = 0.5

,

α_{1} ~ N (0.5, {0.25}^{2})

,

β_{1} ~ N (3, {0.5}^{2})

,

n_{i} = 20

for all

i

’s (

i = 1, \dots, m

). We tune the sample sizes by varying

m

for 3 different values, 4, 10, and 20, respectively, which correspond to a range of increased sample sizes. The simulations are repeated 1000 times for different

m

’s. For each run, we calculate the following five test statistics: Wald statistic for the Gaussian part, Wald statistic for the binomial part, likelihood ratio statistic for the Gaussian part, likelihood ratio statistic for the binomial part, likelihood ratio statistic for jointly testing the Gaussian and binomial parts. Then, we calculate the p-values from each test using the 3 methods as described in Section 2.3.

If the type I error rate is correctly controlled, the p-values from the 1000 repetitions for each

m

should be uniformly distributed within 0 to 1, so we examine each method using the quantile-quantile plots of the above-calculated p-values from the simulated datasets (observed p-values) and the quantiles of uniform [0, 1] distribution (expected p-values), which are shown in Appendix A Figure A1, Figure A2, Figure A3, Figure A4 and Figure A5. As shown in these results, all 3 methods give well-controlled type I error rates for

m = 20

. However, for small sample sizes (

m = 10

or

m = 4

) the performance of controlling type I error rate of the 3 methods are ranked as (from the best to the worst): parametric bootstrap, Satterthwaite, the

χ^{2}

distribution. The inflation of the type I error rate is more severe for the

χ^{2}

distribution with the test for the binomial part (Figure A2 and Figure A4) or the joint test for the two parts (Figure A5). On the other hand, the parametric bootstrap takes the longest computational time, which can be overwhelming if we want to accurately estimate small p-values. As a general rule, if the sample size is large, then the

χ^{2}

distribution can be used. If the sample size is small, then the parametric bootstrap method should be preferred, even with the cost of longer computational time. The Satterthwaite method can be considered as an alternative method for a moderate sample size. Another strategy is to first use the p-values from the

χ^{2}

distribution or the rankings of the test statistics to identify those top differentially expressed genes and then use parametric bootstrap to further accurately estimate the p-values for those top genes.

3.1.2. Evaluation of Statistical Power

In this section, we evaluate the statistical power of the TMM model and compare it with an existing method, MAST [12], and the two-part model with binomial and Gaussian parts but without random effects (named TM hereafter). The simulations are performed based on the following settings: suppose there are two biological conditions, and each condition has

m / 2

subjects, and for each subject

i

there are

n_{i}

single-cell samples sequenced. To evaluate the power, we simulate the gene expression levels

y_{i j k}

from the following model under

H_{1}

:

\begin{array}{l} l o g i t [\Pr (y_{i j k} = 0)] = \log (\frac{π_{i j k}}{1 - π_{i j k}}) = α_{1} + α_{2} w + u_{i}, \\ \log (y_{i j k} + 1 | y_{i j k} > 0) = β_{1} + β_{2} x + v_{i} + e_{i j}, \end{array}

(11)

with

u_{i} ~ N (0, σ_{u}^{2})

,

v_{i} ~ N (0, σ_{v}^{2})

and

e_{i j k} ~ N (0, σ_{e}^{2})

. In this model,

w

and

x

are 0/1 indicators of the conditions, and the effect sizes are represented by the parameters

α_{2}

and

β_{2}

, which correspond to the log odds of zero proportions and log fold change of the mean expression values for non-zero genes between the two conditions. The values of the parameters are set as follows:

m = 10

,

n_{i} = 20

for all

i ’ s

(

i = 1, \dots, m

),

σ_{u} = 0.5

,

σ_{v} = 1

,

σ_{e} = 0.5

,

α_{1} ~ N (0.25, {0.25}^{2})

,

β_{1} ~ N (3, {0.5}^{2})

. We then tune the effect sizes by varying

(α_{2}, β_{2})

for the following values: (0, 0), (0.25, 0.25), (0.5, 0.5), …, (1.5, 1.5). The simulations are repeated 1000 times for each different pairs of

(α_{2}, β_{2})

’s. In each run, we apply our model TMM with the three methods for calculating p-values (the

χ^{2}

distribution, the Satterthwaite, and parametric bootstrap), MAST, and TM, respectively. The estimated power for each method is calculated as the proportion of p-values less than 0.05 among the 1000 repetitions.

Figure 1 shows the plots of power curves for each model with different effect sizes. As expected, the power of each method increases with effect size. The power of TMM is consistently higher than the other two models, which is also expected since we include random effects in this simulation setting.

Figure 1. Comparisons of statistical powers of different methods. (A) Tests for the Gaussian part. (B) Tests for the binomial part. (C) Joint tests for the Gaussian and binomial parts. TMM: two-part mixed model. “Chi-square”, “Satterthwaite”, and “bootstrap”: the χ² distribution, the Satterthwaite method, and parametric bootstrap method as described in Section 2.3. TM: the two-part model without random effects. The horizontal red dashed line represents the level of the test, which is α = 0.05.

4. Application to Real-World Single-Cell Gene Expression Data

4.1. Application to an scRT-qPCR Dataset

First, we apply the TMM model to an scRT-qPCR dataset and compare the results with MAST. This dataset is described in [23] and is incorporated with the MAST package [12], where 456 single-cell samples of T cells from 2 patients with human immunodeficiency virus (HIV) are isolated, and the expression levels of 75 genes related to the immune system function are measured by scRT-qPCR. The activation of two immune-response proteins, T cell receptor Vβ (TCR-Vβ) and CD154, are used to categorize those T cells, and the 456 single cells are divided into the following 4 different groups: TCR-Vβ+/CD154+, TCR-Vβ+/CD154−, TCR-Vβ−/CD154+, and TCR-Vβ−/CD154−, where the TCR-Vβ+ CD154+ group is the activated T cells with normal immune functions [23]. The goal of the analysis is to identify differentially expressed genes across the above four groups.

We fit MAST and our TMM model to this dataset. Specifically, the following two covariates are included in MAST:

X_{1}

: a categorical variable indicating which of the above four groups the sample belongs to, where the TCR-Vβ+/CD154+ is coded as the reference group. This variable is the one of interest.

X_{2}

: a categorical variable indicating which of the two subjects the sample is from.

For our TMM model,

X_{1}

is included as a fixed effect in both the binomial part and the Gaussian part. The two subjects are treated as two clusters, which are included as random effects in TMM. The likelihood ratio test is used to test the individual Gaussian part and binomial part and also to jointly test the two parts, and the

χ^{2}

distribution approximation is used to calculate p-values for saving the computational time.

The results from MAST and TMM for the 75 genes are shown in Table A1, and Figure 2 is a graphical comparison of the p-values from the two methods. We can see that the results from the two methods agree with each other in general, though some genes show different p-values from the tests for the zero-proportions (binomial part) (Figure 2). This is expected as there are only two clusters in this dataset, and the clustering effects do not play a significant role in this example. In fact, there should be a reasonable number of clusters included in a mixed effect model to make it useful in practice [15]. Therefore, MAST should be preferred for this dataset rather than TMM, and the application of TMM here is for the purpose of demonstration. On the other hand, these results show that TMM is not essentially worse than MAST, even if the clustering effects are not significant.

Figure 2. Comparisons of the p-values from TMM and MAST for the scRT-qPCR dataset. The -log 10 of the p-values from both methods are plotted. (B,D,F) are, respectively, the zoom-in parts of (A,C,E) on the range of 0 to 10.

4.2. Application to scRNA-seq Datasets

A recent study compared various methods for differential expression analysis in scRNA-seq using a number of scRNA-seq datasets with matched bulk RNA-seq in the same purified cell types as reference [24]. This study showed that pseudobulk methods, which first aggregates reads across samples (i.e., biological replicates), transform a genes-by-cells matrix to a genes-by-samples matrix, and then uses methods for bulk RNA-seq such as DESeq [25], edgeR [26], and limma [27] for the following differential expression analysis, achieved the highest concordance with matched bulk RNA-seq results when the number of cells obtained from each sample is large (>500), while a negative binomial mixed model (NBMM) won when the number of cells per sample is not large (<200). Here we used one of those datasets containing both scRNA-seq and matched RNA-seq datasets made publicly available in [24], which was originally published in [28] to study the gene expression profile changes between five different types of CD4+ T cells stimulated by cytokines and unstimulated CD4+ T cells (control), to compare the performance of TMM with p-value evaluated by the empirical Satterthwaite method, an NBMM with the library size as an offset term implemented in [24] and a pseudobulk method using the likelihood ratio test in edgeR (referred as edgeR below).

Following [24], we first obtain the lists of differentially expressed genes in the matched bulk data, and next apply the 3 aforementioned approaches for a series of downsampled scRNA-seq datasets, containing between 25 and 500 cells per sample from the original scRNA-seq datasets [28], and then calculate the area under the concordance curve (AUCC, ranges from 0 to 1 with 1 as perfect concordance and 0 as complete dissonance). The reason that we have to use the downsampled datasets is that the running time of NBMM is very long (see [24] and below), which prevents us from comparing these approaches to the full datasets. The results are shown in Figure 3, where we can see NBMM and TMM show higher concordance with matched bulk RNA-seq than edgeR when the number of cells per sample is not large (number of cells ≤ 200, Figure 3), while edgeR gives the highest concordance when the number of cells per sample is large (number of cells = 500, Figure 3). Regarding the running time: edgeR is the fastest with an average time of 1.7 min (including the time of the aggregating reads across samples); TMM has an average time of 53.2 min; NBMM is the slowest with an average time of 1174.3 min. These comparisons imply that TMM is more suitable for situations where the number of cells per sample is not large. We elaborate on these comparisons and the strengths of different approaches in the Section 5.

Figure 3. AUCC for TMM, NBMM, and edgeR in samples of between 25 and 500 cells from the CD4+ T cell data. The dots represent the AUCC values, and the boxplots represent their 75%, 50%, and 25 quantiles.

Next, we apply the TMM model to another scRNA-seq dataset and compare it with MAST and TM. This dataset is published in [7], which contains 466 single-cell samples from the human brain tissues of 8 adults (aged from 21 to 63 years) and 4 fetuses (all aged 16 to 18 weeks), and the expression levels of 22,088 genes in these samples are measured by scRNA-seq [7]. The dataset is available in NCBI Gene Expression Omnibus under accession number GSE67835.

The goal of our analysis is to identify differentially expressed genes between the adult and fetal brains. We fit TMM with the following two covariates as fixed effects:

X_{1}

: a 0/1 indicator of biological conditions (adult versus fetus), which is the variable of interest;

X_{2}

: the gender of the subjects: male and female for adults. The gender of the fetuses is coded as a third category, “undeveloped”.

The 12 subjects are treated as clusters, which are included as random effects in the model. The likelihood ratio test is used to test the individual Gaussian part and binomial part and also to jointly test the two parts, and the

χ^{2}

distribution approximation is used to calculate p-values for saving the computational time. We also fit the MAST and TM models, where X₁ and X₂ are included as covariates in these two models. Multiple comparison adjustment is performed using the Benjamini–Hochberg FDR procedure [29].

Figure 4 shows the number of differentially expressed genes identified by each method with FDR < 0.01, and Table A2 shows the p-values and FDR for the top 20 differentially expressed genes (ranked by the p-values from the joint test for both the Gaussian and binomial parts under the TMM model). We can see the results from the three models show considerable overlaps (Figure 4), and the top differentially expressed genes all show very significant p-values and FDR from all methods. Notably, the total number of differentially expressed genes detected by TMM with FDR < 0.01 is much larger than the other two methods.

Figure 4. Number of differentially expressed genes identified by each method with FDR < 0.01. (A) Gaussian part. (B) Binomial part. (C) Joint test for the Gaussian and binomial parts.

5. Discussion

In summary, we present a two-part mixed model (TMM) for differential expression analysis with single-cell gene expression data. This model not only adequately accounts for the distinct features of single-cell expression data, including extra zero expression values, high variability, and clustered design, but also provides the flexibility of adjusting for covariates. Since scRNA-seq is still a developing and growing technology, it brings more challenges in data analysis than bulk RNA-seq. These challenges can be technical (e.g., the number of samples in scRNA-seq is large, and the sequencing experiments are performed in different batches [30]), and also can be biological (e.g., the distinct features of the single-cell gene expression data, as discussed in the Introduction). Several more recent studies show that several confounding factors often present in scRNA-seq experiments, which can lead to biased results. These factors can also be categorized as technical factors that are related to the design of experiments, such as batch effects [30], or biological factors such as the detection rate of genes [12,30], gene lengths, and GC percent[30]. These confounding factors can be adjusted in TMM; however, planning a good study design for scRNA-seq experiments to reduce the confounding factors is a more fundamental task [30].

More recently, several new models and approaches have been proposed for the DE analysis on scRNA-seq data. As studied in [24] and Section 4.2, the pseudobulk method, which mimics the data format in bulk RNA-seq by aggregating reads across samples and generating a genes-by-samples matrix, enables the usage of well-maintained tools developed for bulk RNA-seq such as DESeq [25], edgeR [26], and limma [27] for the analysis in scRNAseq, and those methods are faster and show higher concordance with the DE results from bulk RNA-seq when the number of cells per sample is large, which can be achieved with current sequencing platforms. Alternatively, our approach, TMM, has the strength of being reliable when the number of cells per sample is not large (e.g., scRT-qPCR data and scRNA-seq data with smaller sample sizes and less cost) and providing a test for checking if the proportions of zero or lowly expressed genes are different between biological conditions. As future work, the computational speed and p-value estimation of TMM should be further optimized, which is also common for many mixed-effect models [24]. On a separate note, in [24] and Section 4.2, the DE genes in the matched bulk RNA-seq datasets that were used to check the consistency of those methods for scRNA-seq were also identified using DESeq, edgeR, and limma, which may lead to the bias towards the higher concordance given by those pseudobulk methods using the same three packages.

Author Contributions

Conceptualization, Y.S., H.K. and H.J.; methodology, Y.S., H.K. and H.J.; formal analysis, Y.S.; writing—original draft preparation, Y.S., J.-H.L., H.K. and H.J.; writing—review and editing, Y.S., J.-H.L., H.K. and H.J.; supervision, J.-H.L., H.K. and H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by startup funds from Medical College of Georgia, Augusta University.

Data Availability Statement

The computer codes for reproducing the results in this paper is available online at: https://github.com/shilab2017/two_part_mixed_model (accessed on 21 December 2021).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Figure A1. Results from 3.1.1 Evaluation of type I error rates. Plots of the observed versus the expected p-values for the Wald test for the Gaussian part under H₀: no significant difference between the two conditions. The p-values are plotted on -log10 scale. The gray areas represent the 95% confidence interval bands of the expected p-values under H₀.

Figure A2. Results from 3.1.1 Evaluation of type I error rates. Plots of the observed versus the expected p-values for the Wald test for the binomial part under H₀: no significant difference between the two conditions. The p-values are plotted on -log10 scale. The gray areas represent the 95% confidence interval bands of the expected p-values under H₀.

Figure A3. Results from 3.1.1 Evaluation of type I error rates. Plots of the observed versus the expected p-values for the likelihood ratio test for the Gaussian part under H₀: no significant difference between the two conditions. The p-values are plotted on -log10 scale. The gray areas represent the 95% confidence interval bands of the expected p-values under H₀.

Figure A4. Results from 3.1.1 Evaluation of type I error rates. Plots of the observed versus the expected p-values for the likelihood ratio test for the binomial part under H₀: no significant difference between the two conditions. The p-values are plotted on -log10 scale. The gray areas represent the 95% confidence interval bands of the expected p-values under H₀.

Figure A5. Results from 3.1.1 Evaluation of type I error rates. Plots of the observed versus the expected p-values for jointly testing the Gaussian and binomial parts under H₀: no significant difference between the two conditions. The p-values are plotted on -log10 scale. The gray areas represent the 95% confidence interval bands of the expected p-values under H₀.

Table A1. Results of the gene differential expression analysis for the HIV scRT-qPCR dataset. The top gene CD40LG that codes CD154 protein is highlighted.

Gene Name	MAST			TMM
Gene Name	Gaussian	Binomial	Combine	Gaussian	Binomial	Combine
CD40LG	2.33 × 10⁻⁴⁶	9.87 × 10⁻¹⁸	2.82 × 10⁻⁶¹	3.73 × 10⁻⁴⁸	3.53 × 10⁻¹⁷	1.72 × 10⁻⁶²
GAPDH	6.60 × 10⁻²⁷	8.44 × 10⁻¹⁰	4.04 × 10⁻³⁴	1.78 × 10⁻²⁷	2.30 × 10⁻¹⁰	2.82 × 10⁻³⁵
TNF	1.60 × 10⁻⁰³	7.70 × 10⁻²²	1.13 × 10⁻²²	3.48 × 10⁻⁰³	1.89 × 10⁻²²	7.94 × 10⁻²³
TGFB1	6.08 × 10⁻¹⁸	2.73 × 10⁻⁰⁴	9.61 × 10⁻²⁰	1.75 × 10⁻¹⁶	4.31 × 10⁻⁰⁴	5.01 × 10⁻¹⁸
IL2	1.46 × 10⁻⁰³	4.53 × 10⁻¹⁸	3.06 × 10⁻¹⁹	2.39 × 10⁻⁰³	3.90 × 10⁻¹⁸	2.05 × 10⁻¹⁹
IL16	2.21 × 10⁻⁰¹	9.21×10⁻¹⁸	2.93 × 10⁻¹⁶	4.90 × 10⁻⁰²	4.74 × 10⁻¹⁸	2.42 × 10⁻¹⁷
IL2Rg	6.80×10⁻⁰⁸	1.97 × 10⁻¹⁰	3.72 × 10⁻¹⁶	3.11 × 10⁻⁰⁹	2.08 × 10⁻¹¹	2.06 × 10⁻¹⁸
CXCR4	7.89 × 10⁻⁰⁴	3.94 × 10⁻¹⁴	1.33 × 10⁻¹⁵	5.20 × 10⁻⁰⁴	1.71 × 10⁻¹⁴	3.51 × 10⁻¹⁶
CCR7	3.60 × 10⁻⁰¹	8.61 × 10⁻¹⁷	4.88 × 10⁻¹⁵	4.20 × 10⁻⁰¹	8.38 × 10⁻¹⁷	3.54 × 10⁻¹⁵
CD3d	3.69 × 10⁻⁰⁶	1.67 × 10⁻¹⁰	1.65 × 10⁻¹⁴	7.22 × 10⁻⁰⁷	2.33 × 10⁻¹⁰	3.61 × 10⁻¹⁵
IL2Ra	9.09 × 10⁻⁰³	5.14 × 10⁻¹³	2.08 × 10⁻¹³	1.18 × 10⁻⁰³	1.09 × 10⁻¹²	6.11 × 10⁻¹⁴
CD69	7.99 × 10⁻⁰⁶	2.06 × 10⁻⁰⁹	4.00 × 10⁻¹³	3.34 × 10⁻⁰⁶	1.89 × 10⁻⁰⁹	1.80 × 10⁻¹³
IL10	1.75 × 10⁻⁰¹	3.88 × 10⁻¹⁴	5.74 × 10⁻¹³	3.25 × 10⁻⁰²	5.85 × 10⁻¹⁴	1.44 × 10⁻¹³
FASLG	6.47 × 10⁻⁰²	1.60 × 10⁻¹²	3.73 × 10⁻¹²	3.06 × 10⁻⁰²	1.33 × 10⁻¹²	3.32 × 10⁻¹²
IL7R	1.23 × 10⁻⁰⁶	2.18 × 10⁻⁰⁶	5.23 × 10⁻¹¹	9.05 × 10⁻⁰⁸	1.68 × 10⁻⁰⁶	4.57 × 10⁻¹²
IL6ST	1.13 × 10⁻⁰³	4.49 × 10⁻⁰⁹	1.21 × 10⁻¹⁰	1.14 × 10⁻⁰⁴	4.45 × 10⁻⁰⁸	1.12 × 10⁻¹⁰
SLAMF1	3.45 × 10⁻⁰²	4.78 × 10⁻¹⁰	5.56 × 10⁻¹⁰	1.05 × 10⁻⁰²	2.95 × 10⁻¹⁰	9.47 × 10⁻¹¹
IFNg	2.66 × 10⁻⁰⁵	1.47 × 10⁻⁰⁶	7.06 × 10⁻¹⁰	2.39 × 10⁻⁰⁵	1.85 × 10⁻⁰⁶	5.83 × 10⁻¹⁰
CD109	8.97 × 10⁻⁰¹	8.36 × 10⁻¹⁰	8.06 × 10⁻⁰⁹	7.76 × 10⁻⁰¹	7.10 × 10⁻¹⁰	7.65 × 10⁻⁰⁹
TNFRSF9	3.20 × 10⁻⁰¹	2.38 × 10⁻⁰⁹	1.44 × 10⁻⁰⁸	2.13 × 10⁻⁰¹	1.45 × 10⁻⁰⁹	8.35 × 10⁻⁰⁹
DPP4	2.96 × 10⁻⁰¹	1.35 × 10⁻⁰⁹	1.91 × 10⁻⁰⁸	4.66 × 10⁻⁰²	2.03 × 10⁻⁰⁸	2.38 × 10⁻⁰⁸
ICOS	2.41 × 10⁻⁰⁵	6.80 × 10⁻⁰⁵	2.53 × 10⁻⁰⁸	8.93 × 10⁻⁰⁵	5.14 × 10⁻⁰⁴	4.77 × 10⁻⁰⁷
CD28	2.14 × 10⁻⁰¹	3.34 × 10⁻⁰⁷	1.88 × 10⁻⁰⁶	3.67 × 10⁻⁰¹	1.17 × 10⁻⁰⁶	9.26 × 10⁻⁰⁶
CD4	1.06 × 10⁻⁰¹	1.47 × 10⁻⁰⁵	2.46 × 10⁻⁰⁵	1.04 × 10⁻⁰¹	3.88 × 10⁻⁰⁵	6.66 × 10⁻⁰⁵
CD27	1.94 × 10⁻⁰¹	2.86 × 10⁻⁰⁵	8.73 × 10⁻⁰⁵	1.01 × 10⁻⁰¹	5.68 × 10⁻⁰⁶	1.20 × 10⁻⁰⁵
CD48	4.55 × 10⁻⁰¹	1.69 × 10⁻⁰⁵	1.59 × 10⁻⁰⁴	3.52 × 10⁻⁰¹	1.46 × 10⁻⁰⁶	2.14 × 10⁻⁰⁵
SLAMF5	5.39 × 10⁻⁰²	3.28 × 10⁻⁰⁴	1.89 × 10⁻⁰⁴	3.34 × 10⁻⁰²	4.43 × 10⁻⁰⁴	1.14 × 10⁻⁰⁴
CTSD	3.11 × 10⁻⁰³	7.58 × 10⁻⁰³	2.14 × 10⁻⁰⁴	3.59 × 10⁻⁰⁴	8.01 × 10⁻⁰³	2.68 × 10⁻⁰⁵
CD5	5.47 × 10⁻⁰¹	3.30 × 10⁻⁰⁵	3.65 × 10⁻⁰⁴	3.80 × 10⁻⁰¹	7.13 × 10⁻⁰⁶	4.89 × 10⁻⁰⁵
TBX21	1.71 × 10⁻⁰¹	1.97 × 10⁻⁰⁴	4.06 × 10⁻⁰⁴	1.17 × 10⁻⁰¹	8.15 × 10⁻⁰⁵	1.10 × 10⁻⁰⁴
CSF2	6.55 × 10⁻⁰¹	2.46 × 10⁻⁰⁴	5.40 × 10⁻⁰⁴	1.00	2.09 × 10⁻⁰⁴	5.13 × 10⁻⁰⁴
CD3g	9.95 × 10⁻⁰¹	1.17 × 10⁻⁰⁵	6.03 × 10⁻⁰⁴	8.13 × 10⁻⁰¹	7.87 × 10⁻⁰⁷	3.11 × 10⁻⁰⁵
TIA1	1.70 × 10⁻⁰²	1.29 × 10⁻⁰²	1.64 × 10⁻⁰³	9.89 × 10⁻⁰³	4.89 × 10⁻⁰³	3.84 × 10⁻⁰⁴
CD45	2.94 × 10⁻⁰¹	8.41 × 10⁻⁰⁴	2.56 × 10⁻⁰³	1.72 × 10⁻⁰¹	2.48 × 10⁻⁰³	3.63 × 10⁻⁰³
PECAM1	3.68 × 10⁻⁰²	1.21 × 10⁻⁰²	3.14 × 10⁻⁰³	7.98 × 10⁻⁰³	9.12 × 10⁻⁰³	5.55 × 10⁻⁰⁴
NT5E	5.96 × 10⁻⁰¹	2.21 × 10⁻⁰³	6.24 × 10⁻⁰³	3.77 × 10⁻⁰¹	1.28 × 10⁻⁰³	3.82 × 10⁻⁰³
LIF	7.70 × 10⁻⁰¹	3.60 × 10⁻⁰³	1.17 × 10⁻⁰²	6.83 × 10⁻⁰¹	1.00	7.22 × 10⁻⁰¹
FOXP3	8.77 × 10⁻⁰³	2.03 × 10⁻⁰¹	1.21 × 10⁻⁰²	3.65 × 10⁻⁰³	2.02 × 10⁻⁰¹	6.33 × 10⁻⁰³
TIMP1	2.18 × 10⁻⁰²	1.26 × 10⁻⁰¹	1.62 × 10⁻⁰²	4.23 × 10⁻⁰²	7.55 × 10⁻⁰²	2.13 × 10⁻⁰²
CTLA4	3.26 × 10⁻⁰¹	8.50 × 10⁻⁰³	1.92 × 10⁻⁰²	4.15 × 10⁻⁰²	1.70 × 10⁻⁰²	5.31 × 10⁻⁰³
FAS	7.88 × 10⁻⁰¹	4.45 × 10⁻⁰³	3.49 × 10⁻⁰²	4.21 × 10⁻⁰¹	4.76 × 10⁻⁰³	1.21 × 10⁻⁰²
RORC	8.99 × 10⁻⁰¹	1.13 × 10⁻⁰²	3.61 × 10⁻⁰²	9.52 × 10⁻⁰¹	7.71 × 10⁻⁰³	2.06 × 10⁻⁰²
CCR2	3.11 × 10⁻⁰²	2.07 × 10⁻⁰¹	3.73 × 10⁻⁰²	1.42 × 10⁻⁰⁴	5.63 × 10⁻⁰¹	9.61 × 10⁻⁰⁴
BCL2	2.34 × 10⁻⁰¹	3.77 × 10⁻⁰²	4.15 × 10⁻⁰²	1.56 × 10⁻⁰¹	1.00	1.51 × 10⁻⁰¹
PRDM1	1.36 × 10⁻⁰²	5.71 × 10⁻⁰¹	5.18 × 10⁻⁰²	2.11 × 10⁻⁰³	7.40 × 10⁻⁰¹	1.85 × 10⁻⁰²
CCL3	1.48 × 10⁻⁰¹	1.01 × 10⁻⁰¹	6.68 × 10⁻⁰²	4.16 × 10⁻⁰¹	1.00	3.36 × 10⁻⁰¹
CCL2	8.07 × 10⁻⁰²	1.00	8.07 × 10⁻⁰²	2.14 × 10⁻⁰¹	1.00	3.05 × 10⁻⁰¹
IL8	1.03 × 10⁻⁰¹	2.06 × 10⁻⁰¹	9.36 × 10⁻⁰²	4.31 × 10⁻⁰¹	1.07 × 10⁻⁰¹	1.67 × 10⁻⁰¹
CCL5	8.38 × 10⁻⁰²	2.80 × 10⁻⁰¹	9.98 × 10⁻⁰²	3.18 × 10⁻⁰¹	3.17 × 10⁻⁰¹	2.20 × 10⁻⁰¹
TNFSF10	3.27 × 10⁻⁰¹	1.49 × 10⁻⁰¹	1.76 × 10⁻⁰¹	3.88 × 10⁻⁰¹	2.06 × 10⁻⁰²	5.12 × 10⁻⁰²
CSF1	1.79 × 10⁻⁰¹	4.38 × 10⁻⁰¹	2.57 × 10⁻⁰¹	1.23 × 10⁻⁰¹	2.20 × 10⁻⁰¹	9.77 × 10⁻⁰²
CCR4	4.57 × 10⁻⁰¹	1.79 × 10⁻⁰¹	2.66 × 10⁻⁰¹	4.10 × 10⁻⁰¹	1.59 × 10⁻⁰¹	1.48 × 10⁻⁰¹
HLADRA	1.61 × 10⁻⁰¹	5.11 × 10⁻⁰¹	2.73 × 10⁻⁰¹	4.78 × 10⁻⁰¹	1.21 × 10⁻⁰¹	2.16 × 10⁻⁰¹
BAX	9.26 × 10⁻⁰¹	5.98 × 10⁻⁰²	2.86 × 10⁻⁰¹	9.97 × 10⁻⁰¹	2.79 × 10⁻⁰²	1.94 × 10⁻⁰¹
CD38	4.60 × 10⁻⁰¹	3.35 × 10⁻⁰¹	4.09 × 10⁻⁰¹	7.37 × 10⁻⁰¹	2.11 × 10⁻⁰¹	4.97 × 10⁻⁰¹
SLAMF7	5.01 × 10⁻⁰¹	1.00	5.01 × 10⁻⁰¹	1.00	1.00	1.00
GATA3	1.36 × 10⁻⁰¹	9.75 × 10⁻⁰¹	5.11 × 10⁻⁰¹	1.29 × 10⁻⁰¹	8.29 × 10⁻⁰¹	4.44 × 10⁻⁰¹
PCNA	8.92 × 10⁻⁰¹	3.40 × 10⁻⁰¹	5.52 × 10⁻⁰¹	8.19 × 10⁻⁰¹	1.00	1.00
MMP9	6.35 × 10⁻⁰¹	1.00	6.35 × 10⁻⁰¹	1.00	1.00	1.00
ENTPD1	6.50 × 10⁻⁰¹	1.00	6.50 × 10⁻⁰¹	2.97 × 10⁻⁰²	1.00	3.10 × 10⁻⁰²
CCL4	6.79 × 10⁻⁰¹	6.22 × 10⁻⁰¹	7.55 × 10⁻⁰¹	1.00	1.00	1.00
PRF1	9.51 × 10⁻⁰¹	4.12 × 10⁻⁰¹	7.67 × 10⁻⁰¹	7.96 × 10⁻⁰¹	1.00	1.00
EOMES	7.68 × 10⁻⁰¹	5.98 × 10⁻⁰¹	7.89 × 10⁻⁰¹	5.52 × 10⁻⁰¹	1.00	7.63 × 10⁻⁰¹
IL6R	6.31 × 10⁻⁰¹	7.39 × 10⁻⁰¹	7.98 × 10⁻⁰¹	2.22 × 10⁻⁰¹	5.60 × 10⁻⁰¹	5.96 × 10⁻⁰¹
CCR5	4.52 × 10⁻⁰¹	9.28 × 10⁻⁰¹	8.09 × 10⁻⁰¹	1.33 × 10⁻⁰¹	5.54 × 10⁻⁰¹	3.07 × 10⁻⁰¹
GZMA	4.02 × 10⁻⁰¹	9.95 × 10⁻⁰¹	8.52 × 10⁻⁰¹	2.78 × 10⁻⁰¹	8.78 × 10⁻⁰¹	7.79 × 10⁻⁰¹
CD8a	9.58 × 10⁻⁰¹	1.00	9.58 × 10⁻⁰¹	6.68 × 10⁻⁰¹	1.00	9.00 × 10⁻⁰¹
B3GAT1	1.00	1.00	1.00	1.00	1.00	1.00
CXCL13	1.00	1.00	1.00	1.00	1.00	1.00
IL12RbII	1.00	1.00	1.00	1.00	1.00	1.00
IL13	1.00	1.00	1.00	1.00	1.00	1.00
IL22	1.00	1.00	1.00	1.00	1.00	1.00
IL3	1.00	1.00	1.00	1.00	1.00	1.00
IL4	1.00	1.00	1.00	1.00	1.00	1.00
MKI67	1.00	1.00	1.00	1.00	1.00	1.00

Table A2. p-values and FDR for the top 20 differentially expressed genes. The list of genes is ranked by the p-values from the combined test for both the Gaussian and binomial parts under the TMM model.

Gene Name	TMM						MAST
	Gaussian		Binomial		Combine		Gaussian		Binomial		Combine
	p-Value	FDR	p-Value	FDR	p-Value	FDR	p-Value	FDR	p-Value	FDR	p-Value	FDR
TMSB15A	2.14 × 10⁻¹²	5.31 × 10⁻¹⁰	1.60 × 10⁻⁵⁵	3.33 × 10⁻⁵¹	1.57 × 10⁻⁵⁵	3.27 × 10⁻⁵¹	4.50 × 10⁻⁰⁹	4.66 × 10⁻⁰⁷	1.42 × 10⁻⁴⁴	3.01 × 10⁻⁴⁰	1.35 × 10⁻⁴⁴	2.86 × 10⁻⁴⁰
MEX3A	2.01 × 10⁻¹⁰	2.66 × 10⁻⁰⁸	1.55 × 10⁻⁵⁰	1.62 × 10⁻⁴⁶	1.34 × 10⁻⁵⁰	1.39 × 10⁻⁴⁶	3.73 × 10⁻⁰⁸	2.79 × 10⁻⁰⁶	1.57 × 10⁻⁴²	1.67 × 10⁻³⁸	1.57 × 10⁻⁴²	1.67 × 10⁻³⁸
SPARCL1	2.31 × 10⁻¹⁵	1.42 × 10⁻¹²	1.53 × 10⁻⁴⁹	1.07 × 10⁻⁴⁵	1.29 × 10⁻⁴⁹	8.94 × 10⁻⁴⁶	1.22 × 10⁻¹³	6.46 × 10⁻¹¹	7.91 × 10⁻³⁸	5.60 × 10⁻³⁴	7.32 × 10⁻³⁸	5.18 × 10⁻³⁴
CLU	3.86 × 10⁻¹⁴	1.75 × 10⁻¹¹	7.64 × 10⁻⁴³	3.98 × 10⁻³⁹	7.54 × 10⁻⁴³	3.93 × 10⁻³⁹	3.24 × 10⁻¹²	1.11 × 10⁻⁰⁹	1.36 × 10⁻³³	5.78 × 10⁻³⁰	1.26 × 10⁻³³	5.23 × 10⁻³⁰
IL6ST	2.78 × 10⁻⁰⁶	8.37 × 10⁻⁰⁵	5.91 × 10⁻⁴²	2.46 × 10⁻³⁸	5.24 × 10⁻⁴²	2.19 × 10⁻³⁸	5.82 × 10⁻⁰⁴	7.40 × 10⁻⁰³	3.49 × 10⁻³³	1.06 × 10⁻²⁹	3.17 × 10⁻³³	9.61 × 10⁻³⁰
CRYAB	1.90 × 10⁻¹³	6.51 × 10⁻¹¹	4.47 × 10⁻³⁹	1.55 × 10⁻³⁵	4.06 × 10⁻³⁹	1.41 × 10⁻³⁵	2.62 × 10⁻¹¹	6.78 × 10⁻⁰⁹	1.66 × 10⁻³⁴	8.82 × 10⁻³¹	1.43 × 10⁻³⁴	7.60 × 10⁻³¹
ALDOC	1.30 × 10⁻¹⁶	9.72 × 10⁻¹⁴	2.84 × 10⁻³⁶	8.46 × 10⁻³³	2.28 × 10⁻³⁶	6.79 × 10⁻³³	2.50 × 10⁻¹⁴	1.87 × 10⁻¹¹	2.93 × 10⁻²⁹	5.66 × 10⁻²⁶	2.65 × 10⁻²⁹	5.11 × 10⁻²⁶
OSBPL1A	3.47 × 10⁻²⁰	6.03 × 10⁻¹⁷	1.09 × 10⁻³⁵	2.76 × 10⁻³²	9.29 × 10⁻³⁶	2.35 × 10⁻³²	4.44 × 10⁻²⁰	1.35 × 10⁻¹⁶	1.71 × 10⁻³³	6.07 × 10⁻³⁰	1.48 × 10⁻³³	5.23 × 10⁻³⁰
HTRA1	1.77 × 10⁻¹³	6.16 × 10⁻¹¹	1.19 × 10⁻³⁵	2.76 × 10⁻³²	1.01 × 10⁻³⁵	2.35 × 10⁻³²	3.43 × 10⁻¹¹	8.68 × 10⁻⁰⁹	1.73 × 10⁻²⁷	2.45 × 10⁻²⁴	1.46 × 10⁻²⁷	2.07 × 10⁻²⁴
PRNP	6.70 × 10⁻²⁴	3.50 × 10⁻²⁰	2.33 × 10⁻³⁵	4.87 × 10⁻³²	2.24 × 10⁻³⁵	4.66 × 10⁻³²	5.61 × 10⁻¹⁹	1.32 × 10⁻¹⁵	4.92 × 10⁻³⁰	1.16 × 10⁻²⁶	4.04 × 10⁻³⁰	9.53 × 10⁻²⁷
TSPYL2	1.46 × 10⁻¹⁹	2.03 × 10⁻¹⁶	8.68 × 10⁻³⁵	1.65 × 10⁻³¹	8.53 × 10⁻³⁵	1.62 × 10⁻³¹	8.81 × 10⁻¹⁹	1.87 × 10⁻¹⁵	6.39 × 10⁻²⁹	1.13 × 10⁻²⁵	6.17 × 10⁻²⁹	1.09 × 10⁻²⁵
BHLHE41	3.97 × 10⁻¹²	9.01 × 10⁻¹⁰	1.08 × 10⁻³⁴	1.88 × 10⁻³¹	9.52 × 10⁻³⁵	1.66 × 10⁻³¹	3.60 × 10⁻⁰⁸	2.70 × 10⁻⁰⁶	6.76 × 10⁻²⁸	1.03 × 10⁻²⁴	6.45 × 10⁻²⁸	9.79 × 10⁻²⁵
CD24	4.01 × 10⁻¹⁵	2.39 × 10⁻¹²	1.51 × 10⁻³⁴	2.43 × 10⁻³¹	1.37 × 10⁻³⁴	2.19 × 10⁻³¹	9.82 × 10⁻¹⁴	5.35 × 10⁻¹¹	1.09 × 10⁻²⁹	2.32 × 10⁻²⁶	9.27 × 10⁻³⁰	1.97 × 10⁻²⁶
NEUROD6	1.46 × 10⁻¹²	3.80 × 10⁻¹⁰	1.62 × 10⁻³²	2.42 × 10⁻²⁹	1.53 × 10⁻³²	2.28 × 10⁻²⁹	1.41 × 10⁻¹⁰	3.03 × 10⁻⁰⁸	2.14 × 10⁻³⁰	5.69 × 10⁻²⁷	1.73 × 10⁻³⁰	4.59 × 10⁻²⁷
ADD3	7.02 × 10⁻¹⁴	2.87 × 10⁻¹¹	1.18 × 10⁻³¹	1.64 × 10⁻²⁸	9.83 × 10⁻³²	1.37 × 10⁻²⁸	5.81 × 10⁻¹⁰	9.23 × 10⁻⁰⁸	2.65 × 10⁻²²	1.94 × 10⁻¹⁹	2.37 × 10⁻²²	1.68 × 10⁻¹⁹
BCL11A	5.72 × 10⁻¹⁴	2.44 × 10⁻¹¹	2.92 × 10⁻³¹	3.81 × 10⁻²⁸	2.42 × 10⁻³¹	3.16 × 10⁻²⁸	1.32 × 10⁻¹⁰	2.90 × 10⁻⁰⁸	1.44 × 10⁻²⁶	1.80 × 10⁻²³	1.16 × 10⁻²⁶	1.45 × 10⁻²³
SLC6A1	2.85 × 10⁻¹⁷	2.58 × 10⁻¹⁴	1.04 × 10⁻³⁰	1.27 × 10⁻²⁷	8.63 × 10⁻³¹	1.06 × 10⁻²⁷	5.76 × 10⁻¹⁴	3.42 × 10⁻¹¹	1.35 × 10⁻²¹	8.43 × 10⁻¹⁹	1.34 × 10⁻²¹	7.50 × 10⁻¹⁹
NR3C1	5.03 × 10⁻⁰⁷	1.93 × 10⁻⁰⁵	5.15 × 10⁻³⁰	5.66 × 10⁻²⁷	4.30 × 10⁻³⁰	4.89 × 10⁻²⁷	1.20 × 10⁻⁰⁹	1.68 × 10⁻⁰⁷	3.01 × 10⁻²⁷	4.00 × 10⁻²⁴	3.06 × 10⁻²⁷	4.06 × 10⁻²⁴
NEUROD2	2.34 × 10⁻⁰⁶	7.27 × 10⁻⁰⁵	4.56 × 10⁻³⁰	5.29 × 10⁻²⁷	4.45 × 10⁻³⁰	4.89 × 10⁻²⁷	7.12 × 10⁻⁰⁶	2.22 × 10⁻⁰⁴	3.74 × 10⁻²⁸	6.11 × 10⁻²⁵	3.68 × 10⁻²⁸	6.01 × 10⁻²⁵
ALCAM	5.90 × 10⁻¹⁵	3.33 × 10⁻¹²	6.98 × 10⁻³⁰	7.28 × 10⁻²⁷	5.98 × 10⁻³⁰	6.24 × 10⁻²⁷	1.80 × 10⁻¹⁶	2.25 × 10⁻¹³	2.25 × 10⁻²³	1.99 × 10⁻²⁰	2.13 × 10⁻²³	1.81 × 10⁻²⁰

References

Ting, D.T.; Wittner, B.S.; Ligorio, M.; Vincent Jordan, N.; Shah, A.M.; Miyamoto, D.T.; Aceto, N.; Bersani, F.; Brannigan, B.W.; Xega, K.; et al. Single-cell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells. Cell Rep. 2014, 8, 1905–1918. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lawson, D.A.; Bhakta, N.R.; Kessenbrock, K.; Prummel, K.D.; Yu, Y.; Takai, K.; Zhou, A.; Eyob, H.; Balakrishnan, S.; Wang, C.Y.; et al. Single-cell analysis reveals a stem-cell program in human metastatic breast cancer cells. Nature 2015, 526, 131–135. [Google Scholar] [CrossRef]
Guo, G.; Huss, M.; Tong, G.Q.; Wang, C.; Li Sun, L.; Clarke, N.D.; Robson, P. Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst. Dev. Cell 2010, 18, 675–685. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bacher, R.; Kendziorski, C. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol. 2016, 17, 63. [Google Scholar] [CrossRef] [Green Version]
McDavid, A.; Finak, G.; Chattopadyay, P.K.; Dominguez, M.; Lamoreaux, L.; Ma, S.S.; Roederer, M.; Gottardo, R. Data exploration, quality control and testing in single-cell qPCR-based gene expression experiments. Bioinformatics 2013, 29, 461–467. [Google Scholar] [CrossRef]
Kharchenko, P.V.; Silberstein, L.; Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 2014, 11, 740–742. [Google Scholar] [CrossRef] [PubMed]
Darmanis, S.; Sloan, S.A.; Zhang, Y.; Enge, M.; Caneda, C.; Shuer, L.M.; Hayden Gephart, M.G.; Barres, B.A.; Quake, S.R. A survey of human brain transcriptome diversity at the single cell level. Proc. Natl. Acad. Sci. USA 2015, 112, 7285–7290. [Google Scholar] [CrossRef] [Green Version]
Duan, N.; Manning, W.G.; Morris, C.N.; Newhouse, J.P. A comparison of alternative models for the demand for medical care. J. Bus. Econ. Stat. 1983, 1, 115–126. [Google Scholar]
Duan, N.; Manning, W.G.; Morris, C.N.; Newhouse, J.P. Choosing between the sample-selection model and the multi-part model. J. Bus. Econ. Stat. 1984, 2, 283–289. [Google Scholar]
Min, Y.; Agresti, A. Modeling nonnegative data with clumping at zero: A survey. J. Iran. Stat. Soc. 2002, 1, 7–33. [Google Scholar]
Olsen, M.K.; Schafer, J.L. A two-part random-effects model for semicontinuous longitudinal data. J. Am. Stat. Assoc. 2001, 96, 730–745. [Google Scholar] [CrossRef]
Finak, G.; McDavid, A.; Yajima, M.; Deng, J.; Gersuk, V.; Shalek, A.K.; Slichter, C.K.; Miller, H.W.; McElrath, M.J.; Prlic, M.; et al. MAST: A flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 2015, 16, 278. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fournier, D.A.; Skaug, H.J.; Ancheta, J.; Ianelli, J.; Magnusson, A.; Maunder, M.N.; Nielsen, A.; Sibert, J. AD Model Builder: Using automatic differentiation for statistical inference of highly parameterized complex nonlinear models. Optim. Methods Softw. 2012, 27, 233–249. [Google Scholar] [CrossRef] [Green Version]
Skaug, H.J.; Fournier, D.A. Automatic approximation of the marginal likelihood in non-gaussian hierarchical models. Comput. Stat. Data Anal. 2006, 51, 699–709. [Google Scholar] [CrossRef]
Pinheiro, J.; Bates, D. Mixed-Effects Models in S and S-PLUS; Springer Science & Business Media: New York, NY, USA, 2006. [Google Scholar]
Liu, L.; Strawderman, R.L.; Cowen, M.E.; Shih, Y.C. A flexible two-part random effects model for correlated medical costs. J. Health Econ. 2010, 29, 110–123. [Google Scholar] [CrossRef] [Green Version]
Halekoh, U.; Højsgaard, S. A kenward-roger approximation and parametric bootstrap methods for tests in linear mixed models–the R package pbkrtest. J. Stat. Softw. 2014, 59, 1–30. [Google Scholar] [CrossRef] [Green Version]
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar]
Davison, A.C.; Hinkley, D.V. Bootstrap Methods and Their Application; Cambridge University Press: Cambridge, UK, 1997; Volume 1. [Google Scholar]
Cai, T.; Lin, X.; Carroll, R.J. Identifying genetic marker sets associated with phenotypes via an efficient adaptive score test. Biostatistics 2012, 13, 776–790. [Google Scholar] [CrossRef] [Green Version]
Huang, Y.T.; Lin, X. Gene set analysis using variance component tests. BMC Bioinform. 2013, 14, 210. [Google Scholar] [CrossRef] [Green Version]
Liu, D.; Lin, X.; Ghosh, D. Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics 2007, 63, 1079–1088. [Google Scholar] [CrossRef] [Green Version]
Dominguez, M.H.; Chattopadhyay, P.K.; Ma, S.; Lamoreaux, L.; McDavid, A.; Finak, G.; Gottardo, R.; Koup, R.A.; Roederer, M. Highly multiplexed quantitation of gene expression on single cells. J. Immunol. Methods 2013, 391, 133–145. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Squair, J.W.; Gautier, M.; Kathe, C.; Anderson, M.A.; James, N.D.; Hutson, T.H.; Hudelle, R.; Qaiser, T.; Matson, K.J.E.; Barraud, Q.; et al. Confronting false discoveries in single-cell differential expression. Nat. Commun. 2021, 12, 5692. [Google Scholar] [CrossRef] [PubMed]
Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26, 139–140. [Google Scholar] [CrossRef] [PubMed] [Green Version]
McCarthy, D.J.; Chen, Y.; Smyth, G.K. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012, 40, 4288–4297. [Google Scholar] [CrossRef] [Green Version]
Cano-Gamez, E.; Soskic, B.; Roumeliotis, T.I.; So, E.; Smyth, D.J.; Baldrighi, M.; Wille, D.; Nakic, N.; Esparza-Gordillo, J.; Larminie, C.G.C.; et al. Single-cell transcriptomics identifies an effectorness gradient shaping the response of CD4(+) T cells to cytokines. Nat. Commun. 2020, 11, 1801. [Google Scholar] [CrossRef] [Green Version]
Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Society Ser. B Methodol. 1995, 57, 289–300. [Google Scholar] [CrossRef]
Hicks, S.C.; Teng, M.; Irizarry, R.A. On the widespread and critical impact of systematic bias and batch effects in single-cell RNA-Seq data. bioRxiv 2015, 10, 025528. [Google Scholar]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A Two-Part Mixed Model for Differential Expression Analysis in Single-Cell High-Throughput Gene Expression Data

Abstract

1. Introduction

2. Materials and Methods

2.1. The Two-Part Mixed Model for Single-Cell Gene Expression Data

2.2. Model Fitting

2.3. Testing for Differential Expression

3. Results

3.1. Simulation Studies

3.1.1. Evaluation of Type I Error Rates

3.1.2. Evaluation of Statistical Power

4. Application to Real-World Single-Cell Gene Expression Data

4.1. Application to an scRT-qPCR Dataset

4.2. Application to scRNA-seq Datasets

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Article Metrics

Citations

Article Access Statistics