A Two-Part Mixed Model for Differential Expression Analysis in Single-Cell High-Throughput Gene Expression Data

The high-throughput gene expression data generated from recent single-cell RNA sequencing (scRNA-seq) and parallel single-cell reverse transcription quantitative real-time PCR (scRT-qPCR) technologies enable biologists to study the function of transcriptome at the level of individual cells. Compared with bulk RNA-seq and RT-qPCR gene expression data, single-cell data show notable distinct features, including excessive zero expression values, high variability, and clustered design. We propose to model single-cell high-throughput gene expression data using a two-part mixed model, which not only adequately accounts for the aforementioned features of single-cell expression data but also provides the flexibility of adjusting for covariates. An efficient computational algorithm, automatic differentiation, is used for estimating the model parameters. Compared with existing methods, our approach shows improved power for detecting differential expressed genes in single-cell high-throughput gene expression data.


Introduction
Recently, single-cell high-throughput gene expression profiling technologies, including single-cell RNA sequencing (scRNA-seq) and parallel single-cell single-cell reverse transcription quantitative real-time PCR (scRT-qPCR), have enabled researchers to examine mRNA expression at the resolution of individual cell level, which provide further biological insights of the transcriptomes and functional genomics [1][2][3][4]. Compared to bulk RNA-seq and RT-qPCR experiments that are usually performed on animal tissues (i.e., cell populations) and homogenous cell lines, single-cell high-throughput gene expression data generated by scRNA-seq and scRT-qPCR have the following distinct features as seen in recent literature [4][5][6]: Excessive zero expression values. The proportions of genes with observed zero expression values in single-cell gene expression data are much larger than bulk RNA-seq or RT-qPCR data [4][5][6]. The reasons for this phenomenon can be either biological, such that the abundance of mRNA levels of certain transcripts are essentially low in individual cells, or can be technical, such that the extracted total amount of mRNA is low in a single cell sample [4,6].

The Two-Part Mixed Model for Single-Cell Gene Expression Data
We first introduce the notations for our approach. Assume there are m subjects and N genes in a scRNA-seq experiment, and n i single-cell samples extracted and sequenced for subject (i = 1, . . . , m). Let y ijk be the normalized expression value (in the unit of RPKM/FPKM, TPM, or CPM) for gene k (k = 1, . . . , N) in single-cell sample j (j = 1, . . . , n i ) in subject i, then we model the gene expression value y ijk using the following two-part mixed model: logit[Pr(y ijk = 0)] = log( where π ijk is the proportion of single-cell samples with zero expression values for gene k (named "zero-proportion" hereafter). In this two-part model, the zero-proportions are modeled by a logistic regression model (logistic or binomial part), and the log-transformed nonzero expression values are modeled by a linear regression model (Gaussian part), where w T k and x T k are the vectors of covariates for the binomial and Gaussian parts, respectively (e.g., if there are only two biological conditions and no other covariates to be adjusted, w T k and x T k are simply the vectors of 1/0 indicators for the biological conditions), α k and β k are the corresponding vectors of regression coefficients associated with the covariates w T k and x T k , e ijk is the random error that is assumed to be distributed as N(0, σ 2 e ), u ik and v ik are the random effects for subject i that account for the clustering effects, which are assumed to follow the bivariate normal distribution.
Genes 2022, 13, 377 3 of 18 with σ 2 u and σ 2 v as the variances for the marginal univariate normal distributions of u ik and v ik , and ρ as the correlation between them. We note that most scRNA-seq experiments contain only one level of clusters (i.e., single cells are sampled from subjects). If the study design is more complicated, such that it may contain multi-level cluster effects, then more variance components for the random effects can be added into the model. Finally, a small constant c is added to the non-zero expression levels before taking logarithms to avoid the left skewness caused by taking logarithms on small-expression values between 0 and 1, which is often seen in RNA-seq data. In the following analysis of scRNA-seq data, c is set as 1.
In an scRT-qPCR experiment, the gene expression levels are usually measured by the expression threshold (et) values, which is defined as et = c max − ct, where c max is the maximum number of amplification cycles used in the scRT-qPCR experiment and ct is the threshold cycle that the gene is detected by the PCR instrument [5]. The gene expression level y ijk is assumed to have an exponential relationship with et, such that y ijk = 2 et (for undetected genes, et is shown as missing values from the PCR machine and can be treated as −∞, which gives zero expression values) [5]. Therefore Model (1) can also be used to model gene expression values in scRT-qPCR data, and the definitions of the parameters are exactly the same as those aforementioned for scRNA-seq data. The only difference is that adding the small constant c is not necessary for scRT-qPCR data, as the non-zero gene expression levels in scRT-qPCR experiments do not have many small values between 0 and 1, such as those in scRNA-seq data.
Remark on related literature: The two-part model including the binomial part and Gaussian part without random effects is first proposed for modeling the medical care data [8,9], where the dependent variable (medical care expenses) takes the range of any non-negative value but has a positive probability at zero (these type of data are also called semicontinuous data) [8][9][10]. This type of model is later extended for longitudinal or clustered semicontinuous data by incorporating random effects for both the binomial part and the Gaussian part [11]. A comprehensive survey for a variety of models with applications for data taking non-negative values with a substantial proportion of zero values is given in [10]. Our two-part mixed model essentially follows the model formulation in [10,11], except for the addition of a small constant c to the non-zero expression values in RNA-seq data [Equation (1)]. A similar yet different two-part model without random effects is proposed to model the scRNA-seq data in a recent paper, which is named MAST [12]. Instead of incorporating clustered random effects from subjects, MAST uses an empirical Bayes method to shrink the gene-specific variance to the global variance of all genes [12].

Model Fitting
The proposed two-part mixed model (1) will be referred to as TMM hereafter. Since the TMM is fitted for each gene independently, we will drop the subscript k for simplicity if there is no ambiguity within the context. Following [11], the fixed-effect parameters of the TMM model, α k and β k , are estimated by maximizing the following marginal likelihood function of the model: where L B i is the conditional distribution (likelihood) of y ijk given the random effect u i from the binomial (logistic) part that can be written as and L G i is the conditional distribution (likelihood) of y ijk given the random effect v i from the Gaussian part that can be written as with φ(·) as the standard normal PDF [for scRT-qPCR data, log(y ij + 1) becomes log(y ij )], and p(u i , v i ) is the joint distribution of the random effects u i and v i , which is the bivariate normal given in Equation (2). As discussed in [10,11], maximizing the marginal likelihood function (3) involves numerical or stochastic approximation of the integrals, followed by maximization of the approximated likelihood. Several computational methods, including the Markov chain Monte Carlo, the expectation-maximization (EM) algorithm, the penalized quasi-likelihood (PQL) method, Gauss-Hermite quadrature, and Laplace approximations are reviewed and discussed in detail in [11]. Here, we use an efficient computational method, automatic differentiation, to maximize the likelihood function (3). The automatic differentiation technique is implemented in the software package automatic differentiation model builder (ADMB, version 11.4) [13,14]. Given the likelihood function written in the form of (4.2), ADMB calculates the Hessian matrix of the marginal likelihood function using the automatic differentiation technique, and the maximization of the marginal likelihood function is performed by first approximating the integrals using Laplace approximations and then maximizing the approximated likelihood using the quasi-Newton algorithm. Descriptions of the automatic differentiation technique can be found in [13,14], and the details for implementation of the algorithm can be found in https://www.admb-project.org/ (accessed on 21 December 2021).

Testing for Differential Expression
Testing for differential expression of genes across biological conditions under model (1) is done by testing for the fixed effects. More explicitly, (1) can be written as where w T 1 and x T 1 are the covariates of interest that we want to test for, and w T 2 and x T 2 are the covariates to be adjusted for in the model. Specifically, we are interested in testing for the following two effects across biological conditions: (1) whether the zero-proportions are significantly different across conditions and (2) for genes with non-zero expression levels, whether the mean expression levels are significantly different across conditions. The two problems can be formulated as the following two corresponding hypothesis testing problems: (1) Testing of the binomial part (2) Testing of the Gaussian part H G0 : β 1 = 0 versus H G1 : β 1 = 0; (8) and the two parts can also be tested jointly, which can improve the statistical power: (3) Joint testing of the binomial and Gaussian parts H 0 : α 1 = 0 and β 1 = 0 versus H 1 : α 1 = 0 or β 1 = 0.
The individual test for the binomial part or the Gaussian part can be performed using the Wald test or the likelihood ratio test, and the joint test for the two parts can be performed using the likelihood ratio test. Under H 0 , the asymptotic distributions of the Wald statistic (W 0 ) and the likelihood ratio statistic (L 0 ) can be approximated by the χ 2 distribution with the degrees of freedom equal to the differences in the numbers of parameters between H 0 and H 1 , which is a widely used approach in practice [15,16]. However, for small sample sizes, the χ 2 distributions are not good approximations to the null distributions of the two test statistics, which, as noted in the literature [15,17] and as shown in simulations in the Results part, often show inflated type I error rate. Therefore, we use the following two methods for reliable estimation of p-values when the sample size is small: The parametric bootstrap method: this approach estimates the null distribution of the test statistic by simulating data from the fitted model under H 0 , which is performed in the following way [17][18][19]: (1) Fit model (4) under H 0 and generate N random samples y 1 , . . . , y N from this model. (2) Calculate the corresponding test statistics (i.e., Wald or likelihood ratio statistics) T(y 1 ), . . . , T(y N ) using the above-simulated samples y 1 , . . . , y N .
where γ is the test statistic (Wald or likelihood ratio) calculated from the observed data (an alternative formula iŝ

N+1
. The two formulas give almost the same results providing N is large, so we use the former throughout this chapter).
The empirical Satterthwaite method: this method is proposed in [20], and it is a general approach for approximating the null distribution of the test statistics [17,[20][21][22]. Following [20,21], this method is performed in the following two steps: (1) Approximate the null distribution of test statistics (W 0 or L 0 ) by a scaled χ 2 distribution kχ 2 v with k as the scale parameter and v as the degrees of freedom. The parameters k and v can be estimated by matching the first two moments (sample mean and variance) of test statistics under H 0 with those of kχ 2 v [20,21]. The sample mean and variance of test statistics under H 0 can be obtained by using the above parametric bootstrap method with a smaller number of random samples.
(2) Fit a two-component normal mixture distribution π 1 N(µ 1 , is the p-value obtained from the above-scaled χ 2 distribution kχ 2 v for the bth random sample and Φ(·) is the standard normal CDF. The final p-values are calculated as where p kχ 2 v is the p-value obtained from Step (1) and Ψ is the fitted normal mixture distributionπ 1 N(μ 1 ,σ 2 1 ) +π 2 N(μ 2 ,σ 2 2 ). The Satterthwaite method can estimate p-values using a smaller number of random samples than the parametric bootstrap method [20,21]. However, in our simulations, it also shows an inflated type I error rate when the sample size is small (see simulations in the next section).

Evaluation of Type I Error Rates
In this section, we evaluate type I error rates of the three methods for approximating the null distribution of the test statistics under H 0 : the χ 2 distribution, the Satterthwaite method, and the parametric bootstrap method. The simulations are performed based on the following settings: assuming two biological conditions, each has m/2 subjects, and for each subject i there are n i single-cell samples. To evaluate type I error rates, we simulate gene expression levels y ijk from the following model under H 0 (i.e., there is no difference between the two conditions): ). In this model, there is only one intercept for the fixed effect in both the binomial and Gaussian parts, therefore no differences in terms of zero-proportions and mean expression levels are expected between the two conditions. The values of the parameters are set as follows: . We tune the sample sizes by varying m for 3 different values, 4, 10, and 20, respectively, which correspond to a range of increased sample sizes. The simulations are repeated 1000 times for different m's. For each run, we calculate the following five test statistics: Wald statistic for the Gaussian part, Wald statistic for the binomial part, likelihood ratio statistic for the Gaussian part, likelihood ratio statistic for the binomial part, likelihood ratio statistic for jointly testing the Gaussian and binomial parts. Then, we calculate the p-values from each test using the 3 methods as described in Section 2.3.
If the type I error rate is correctly controlled, the p-values from the 1000 repetitions for each m should be uniformly distributed within 0 to 1, so we examine each method using the quantile-quantile plots of the above-calculated p-values from the simulated datasets (observed p-values) and the quantiles of uniform [0, 1] distribution (expected p-values), which are shown in Appendix A Figures A1-A5. As shown in these results, all 3 methods give well-controlled type I error rates for m = 20. However, for small sample sizes (m = 10 or m = 4) the performance of controlling type I error rate of the 3 methods are ranked as (from the best to the worst): parametric bootstrap, Satterthwaite, the χ 2 distribution. The inflation of the type I error rate is more severe for the χ 2 distribution with the test for the binomial part ( Figures A2 and A4) or the joint test for the two parts ( Figure A5). On the other hand, the parametric bootstrap takes the longest computational time, which can be overwhelming if we want to accurately estimate small p-values. As a general rule, if the sample size is large, then the χ 2 distribution can be used. If the sample size is small, then the parametric bootstrap method should be preferred, even with the cost of longer computational time. The Satterthwaite method can be considered as an alternative method for a moderate sample size. Another strategy is to first use the p-values from the χ 2 distribution or the rankings of the test statistics to identify those top differentially expressed genes and then use parametric bootstrap to further accurately estimate the p-values for those top genes.

Evaluation of Statistical Power
In this section, we evaluate the statistical power of the TMM model and compare it with an existing method, MAST [12], and the two-part model with binomial and Gaussian parts but without random effects (named TM hereafter). The simulations are performed based on the following settings: suppose there are two biological conditions, and each condition has m/2 subjects, and for each subject i there are n i single-cell samples sequenced. To evaluate the power, we simulate the gene expression levels y ijk from the following model under In this model, w and x are 0/1 indicators of the conditions, and the effect sizes are represented by the parameters α 2 and β 2 , which correspond to the log odds of zero proportions and log fold change of the mean expression values for non-zero genes between the two conditions. The values of the parameters are set as follows: m = 10, n i = 20 for all i s (i = 1, . . . , m), σ u = 0.5, . We then tune the effect sizes by varying (α 2 , β 2 ) for the following values: (0, 0), (0.25, 0.25), (0.5, 0.5), . . . , (1.5, 1.5). The simulations are repeated 1000 times for each different pairs of (α 2 , β 2 )'s. In each run, we apply our model TMM with the three methods for calculating p-values (the χ 2 distribution, the Satterthwaite, and parametric bootstrap), MAST, and TM, respectively. The estimated power for each method is calculated as the proportion of p-values less than 0.05 among the 1000 repetitions. Figure 1 shows the plots of power curves for each model with different effect sizes. As expected, the power of each method increases with effect size. The power of TMM is consistently higher than the other two models, which is also expected since we include random effects in this simulation setting.  2 , which correspond to the log odds of zero proportions and log fold change of the m expression values for non-zero genes between the two conditions. The values of the rameters are set as follows: 10 m = , 20 We then tune the effect sizes by vary   's. In each run, we apply model TMM with the three methods for calculating p-values (the 2  distribution, Satterthwaite, and parametric bootstrap), MAST, and TM, respectively. The estima power for each method is calculated as the proportion of p-values less than 0.05 am the 1000 repetitions. Figure 1 shows the plots of power curves for each model with different effect si As expected, the power of each method increases with effect size. The power of TMM consistently higher than the other two models, which is also expected since we incl random effects in this simulation setting.

Application to an scRT-qPCR Dataset
First, we apply the TMM model to an scRT-qPCR dataset and compare the results with MAST. This dataset is described in [23] and is incorporated with the MAST package [12], where 456 single-cell samples of T cells from 2 patients with human immunodeficiency virus (HIV) are isolated, and the expression levels of 75 genes related to the immune system function are measured by scRT-qPCR. The activation of two immune-response proteins, T cell receptor Vβ (TCR-Vβ) and CD154, are used to categorize those T cells, and the 456 single cells are divided into the following 4 different groups: TCR-Vβ+/CD154+, TCR-Vβ+/CD154−, TCR-Vβ−/CD154+, and TCR-Vβ−/CD154−, where the TCR-Vβ+ CD154+ group is the activated T cells with normal immune functions [23]. The goal of the analysis is to identify differentially expressed genes across the above four groups.
We fit MAST and our TMM model to this dataset. Specifically, the following two covariates are included in MAST: X 1 : a categorical variable indicating which of the above four groups the sample belongs to, where the TCR-Vβ+/CD154+ is coded as the reference group. This variable is the one of interest.
X 2 : a categorical variable indicating which of the two subjects the sample is from. For our TMM model, X 1 is included as a fixed effect in both the binomial part and the Gaussian part. The two subjects are treated as two clusters, which are included as random effects in TMM. The likelihood ratio test is used to test the individual Gaussian part and binomial part and also to jointly test the two parts, and the χ 2 distribution approximation is used to calculate p-values for saving the computational time.
The results from MAST and TMM for the 75 genes are shown in Table A1, and Figure 2 is a graphical comparison of the p-values from the two methods. We can see that the results from the two methods agree with each other in general, though some genes show different p-values from the tests for the zero-proportions (binomial part) (Figure 2). This is expected as there are only two clusters in this dataset, and the clustering effects do not play a significant role in this example. In fact, there should be a reasonable number of clusters included in a mixed effect model to make it useful in practice [15]. Therefore, MAST should be preferred for this dataset rather than TMM, and the application of TMM here is for the purpose of demonstration. On the other hand, these results show that TMM is not essentially worse than MAST, even if the clustering effects are not significant.

Application to scRNA-seq Datasets
A recent study compared various methods for differential expression analysis in scRNA-seq using a number of scRNA-seq datasets with matched bulk RNA-seq in the same purified cell types as reference [24]. This study showed that pseudobulk methods, which first aggregates reads across samples (i.e., biological replicates), transform a genes-by-cells matrix to a genes-by-samples matrix, and then uses methods for bulk RNA-seq such as DESeq [25], edgeR [26], and limma [27] for the following differential expression analysis, achieved the highest concordance with matched bulk RNA-seq results when the number of cells obtained from each sample is large (>500), while a negative binomial mixed model (NBMM) won when the number of cells per sample is not large (<200). Here we used one of those datasets containing both scRNA-seq and matched RNA-seq datasets made publicly available in [24], which was originally published in [28] to study the gene expression profile changes between five different types of CD4+ T cells stimulated by cytokines and unstimulated CD4+ T cells (control), to compare the performance of TMM with p-value evaluated by the empirical Satterthwaite method, an NBMM with the library size as an offset term implemented in [24] and a pseudobulk method using the likelihood ratio test in edgeR (referred as edgeR below).
Following [24], we first obtain the lists of differentially expressed genes in the matched bulk data, and next apply the 3 aforementioned approaches for a series of downsampled scRNA-seq datasets, containing between 25 and 500 cells per sample from the original scRNA-seq datasets [28], and then calculate the area under the concordance curve (AUCC, ranges from 0 to 1 with 1 as perfect concordance and 0 as complete dissonance). The reason that we have to use the downsampled datasets is that the running time of NBMM is very long (see [24] and below), which prevents us from comparing these approaches to the full datasets. The results are shown in Figure 3, where we can see NBMM and TMM show higher concordance with matched bulk RNA-seq than edgeR when the number of cells per sample is not large (number of cells ≤ 200, Figure 3), while edgeR gives the highest concordance when the number of cells per sample is large (number of cells = 500, Figure 3). Regarding the running time: edgeR is the fastest with an average time of 1.7 min (including the time of the aggregating reads across samples); TMM has an average time of 53.2 min; NBMM is the slowest with an average time of 1174.3 min. These comparisons imply that TMM is more suitable for situations where the number of cells per sample is not large. We elaborate on these comparisons and the strengths of different approaches in the Section 5.
Next, we apply the TMM model to another scRNA-seq dataset and compare it with MAST and TM. This dataset is published in [7], which contains 466 single-cell samples from the human brain tissues of 8 adults (aged from 21 to 63 years) and 4 fetuses (all aged 16 to 18 weeks), and the expression levels of 22,088 genes in these samples are measured by scRNA-seq [7]. The dataset is available in NCBI Gene Expression Omnibus under accession number GSE67835.

Application to scRNA-seq Datasets
A recent study compared various methods for differential expression analysis in scRNA-seq using a number of scRNA-seq datasets with matched bulk RNA-seq in the same purified cell types as reference [24]. This study showed that pseudobulk methods, which first aggregates reads across samples (i.e., biological replicates), transform a genesby-cells matrix to a genes-by-samples matrix, and then uses methods for bulk RNA-seq such as DESeq [25], edgeR [26], and limma [27] for the following differential expression analysis, achieved the highest concordance with matched bulk RNA-seq results when the number of cells obtained from each sample is large (>500), while a negative binomial mixed model (NBMM) won when the number of cells per sample is not large (<200). Here we used one of those datasets containing both scRNA-seq and matched RNA-seq datasets made publicly available in [24], which was originally published in [28] to study the gene expression profile changes between five different types of CD4+ T cells stimulated by cytokines and unstimulated CD4+ T cells (control), to compare the performance of TMM with p-value evaluated by the empirical Satterthwaite method, an NBMM with the library size as an offset term implemented in [24] and a pseudobulk method using the likelihood ratio test in edgeR (referred as edgeR below). Following [24], we first obtain the lists of differentially expressed genes in the matched bulk data, and next apply the 3 aforementioned approaches for a series of downsampled scRNA-seq datasets, containing between 25 and 500 cells per sample from the original scRNA-seq datasets [28], and then calculate the area under the concordance curve (AUCC, ranges from 0 to 1 with 1 as perfect concordance and 0 as complete dissonance). The reason that we have to use the downsampled datasets is that the running time of NBMM is very long (see [24] and below), which prevents us from comparing these approaches to the full datasets. The results are shown in Figure 3, where we can see NBMM and TMM show higher concordance with matched bulk RNA-seq than edgeR when the number of cells per sample is not large (number of cells ≤ 200, Figure 3), while edgeR gives the highest concordance when the number of cells per sample is large (number of cells = 500, Figure 3). Regarding the running time: edgeR is the fastest with an average time of 1.7 min (including the time of the aggregating reads across samples); TMM has an average time of 53.2 min; NBMM is the slowest with an average time of 1174.3 min. These comparisons imply that TMM is more suitable for situations where the number of cells per sample is not large. We elaborate on these comparisons and the strengths of different approaches in the Section 5. Next, we apply the TMM model to another scRNA-seq dataset and compare it with MAST and TM. This dataset is published in [7], which contains 466 single-cell samples from the human brain tissues of 8 adults (aged from 21 to 63 years) and 4 fetuses (all aged 16 to 18 weeks), and the expression levels of 22,088 genes in these samples are measured The goal of our analysis is to identify differentially expressed genes between the adult and fetal brains. We fit TMM with the following two covariates as fixed effects: X 1 : a 0/1 indicator of biological conditions (adult versus fetus), which is the variable of interest; X 2 : the gender of the subjects: male and female for adults. The gender of the fetuses is coded as a third category, "undeveloped".
The 12 subjects are treated as clusters, which are included as random effects in the model. The likelihood ratio test is used to test the individual Gaussian part and binomial part and also to jointly test the two parts, and the χ 2 distribution approximation is used to calculate p-values for saving the computational time. We also fit the MAST and TM models, where X 1 and X 2 are included as covariates in these two models. Multiple comparison adjustment is performed using the Benjamini-Hochberg FDR procedure [29]. Figure 4 shows the number of differentially expressed genes identified by each method with FDR < 0.01, and Table A2 shows the p-values and FDR for the top 20 differentially expressed genes (ranked by the p-values from the joint test for both the Gaussian and binomial parts under the TMM model). We can see the results from the three models show considerable overlaps (Figure 4), and the top differentially expressed genes all show very significant p-values and FDR from all methods. Notably, the total number of differentially expressed genes detected by TMM with FDR < 0.01 is much larger than the other two methods.
Genes 2022, 13, x FOR PEER REVIEW 11 of 20 by scRNA-seq [7]. The dataset is available in NCBI Gene Expression Omnibus under accession number GSE67835. The goal of our analysis is to identify differentially expressed genes between the adult and fetal brains. We fit TMM with the following two covariates as fixed effects: 1 X : a 0/1 indicator of biological conditions (adult versus fetus), which is the variable of interest; 2 X : the gender of the subjects: male and female for adults. The gender of the fetuses is coded as a third category, "undeveloped".
The 12 subjects are treated as clusters, which are included as random effects in the model. The likelihood ratio test is used to test the individual Gaussian part and binomial part and also to jointly test the two parts, and the 2  distribution approximation is used to calculate p-values for saving the computational time. We also fit the MAST and TM models, where X1 and X2 are included as covariates in these two models. Multiple comparison adjustment is performed using the Benjamini-Hochberg FDR procedure [29]. Figure 4 shows the number of differentially expressed genes identified by each method with FDR < 0.01, and Table A2 shows the p-values and FDR for the top 20 differentially expressed genes (ranked by the p-values from the joint test for both the Gaussian and binomial parts under the TMM model). We can see the results from the three models show considerable overlaps (Figure 4), and the top differentially expressed genes all show very significant p-values and FDR from all methods. Notably, the total number of differentially expressed genes detected by TMM with FDR < 0.01 is much larger than the other two methods.

Discussion
In summary, we present a two-part mixed model (TMM) for differential expression analysis with single-cell gene expression data. This model not only adequately accounts for the distinct features of single-cell expression data, including extra zero expression values, high variability, and clustered design, but also provides the flexibility of adjusting for covariates. Since scRNA-seq is still a developing and growing technology, it brings more challenges in data analysis than bulk RNA-seq. These challenges can be technical (e.g., the number of samples in scRNA-seq is large, and the sequencing experiments are performed in different batches [30]), and also can be biological (e.g., the distinct features of the singlecell gene expression data, as discussed in the Introduction). Several more recent studies show that several confounding factors often present in scRNA-seq experiments, which can lead to biased results. These factors can also be categorized as technical factors that are related to the design of experiments, such as batch effects [30], or biological factors such as the detection rate of genes [12,30], gene lengths, and GC percent [30]. These confounding factors can be adjusted in TMM; however, planning a good study design for scRNA-seq experiments to reduce the confounding factors is a more fundamental task [30].

Discussion
In summary, we present a two-part mixed model (TMM) for differential expression analysis with single-cell gene expression data. This model not only adequately accounts for the distinct features of single-cell expression data, including extra zero expression values, high variability, and clustered design, but also provides the flexibility of adjusting for covariates. Since scRNA-seq is still a developing and growing technology, it brings more challenges in data analysis than bulk RNA-seq. These challenges can be technical (e.g., the number of samples in scRNA-seq is large, and the sequencing experiments are performed in different batches [30]), and also can be biological (e.g., the distinct features of the singlecell gene expression data, as discussed in the Introduction). Several more recent studies show that several confounding factors often present in scRNA-seq experiments, which can lead to biased results. These factors can also be categorized as technical factors that are related to the design of experiments, such as batch effects [30], or biological factors such as the detection rate of genes [12,30], gene lengths, and GC percent [30]. These confounding factors can be adjusted in TMM; however, planning a good study design for scRNA-seq experiments to reduce the confounding factors is a more fundamental task [30].
More recently, several new models and approaches have been proposed for the DE analysis on scRNA-seq data. As studied in [24] and Section 4.2, the pseudobulk method, which mimics the data format in bulk RNA-seq by aggregating reads across samples and generating a genes-by-samples matrix, enables the usage of well-maintained tools developed for bulk RNA-seq such as DESeq [25], edgeR [26], and limma [27] for the analysis in scRNAseq, and those methods are faster and show higher concordance with the DE results from bulk RNA-seq when the number of cells per sample is large, which can be achieved with current sequencing platforms. Alternatively, our approach, TMM, has the strength of being reliable when the number of cells per sample is not large (e.g., scRT-qPCR data and scRNA-seq data with smaller sample sizes and less cost) and providing a test for checking if the proportions of zero or lowly expressed genes are different between biological conditions. As future work, the computational speed and p-value estimation of TMM should be further optimized, which is also common for many mixed-effect models [24]. On a separate note, in [24] and Section 4.2, the DE genes in the matched bulk RNA-seq datasets that were used to check the consistency of those methods for scRNA-seq were also identified using DESeq, edgeR, and limma, which may lead to the bias towards the higher concordance given by those pseudobulk methods using the same three packages.