A Comparison of Methods for RNA-Seq Differential Expression Analysis and a New Empirical Bayes Approach

Transcriptome-based biosensors are expected to have a large impact on the future of biotechnology. However, a central aspect of transcriptomics is differential expression analysis, where, currently, deep RNA sequencing (RNA-seq) has the potential to replace the microarray as the standard assay for RNA quantification. Our contributions here to RNA-seq differential expression analysis are two-fold. First, given the high cost of an RNA-seq run, biological replicates are rare, and therefore, information sharing across genes to obtain variance estimates is crucial. To handle such information sharing in a rigorous manner, we propose an hierarchical, empirical Bayes approach (R-EBSeq) that combines the Cufflinks model for generating relative transcript abundance measurements, known as FPKM (fragments per kilobase of transcript length per million mapped reads) with the EBArrays framework, which was previously developed for empirical Bayes analysis of microarray data. A desirable feature of R-EBSeq is easy-to-implement analysis of more than pairwise comparisons, as we illustrate with experimental data. Secondly, we develop the standard RNA-seq test data set, on the level of reads, where 79 transcripts are artificially differentially expressed and, therefore, explicitly known. This test data set allows us to compare the performance, in terms of the true discovery rate, of R-EBSeq to three other widely used RNAseq data analysis packages: Cuffdiff, DEseq and BaySeq. Our analysis indicates that DESeq identifies the first half of the differentially expressed transcripts well, but then is outperformed by Cuffdiff and R-EBSeq. Cuffdiff and R-EBSeq are the two top performers. Thus, R-EBSeq offers good performance, while allowing flexible and rigorous comparison of multiple biological conditions.

. The RNA-seq pipeline. This schematic illustrates the process of going from cells to RNA-seq data, with potential sources of bias and variability noted along the way. See the Introduction for details.
considerations; and (2) the ability to do simultaneous differential expression comparisons with arbitrary many expression patterns. Unlike another EBArrays-based method for the RNA-seq analysis [12], ours works with FPKM measurements, which seem more appropriate for the across-gene information-sharing framework of EBArrays. To illustrate the advantages of our approach, we compare R-EBSeq method with three other established differential expression methods: FPKM-based Cuffdiff and count-based DESeq and BaySeq [7,9,13,14].

EBarrays
The method developed in [10,11] characterizes the distribution of expression measurements for a single gene (or transcript) in a certain condition. Here, we present a brief overlook of the idea behind the framework and how we adapt it to our problem. Similar empirical-based approach, but for count-based RNA-seq analysis, has been recently developed in [12]. We refer to their software documentation for additional EBArrays model details.
Let j indicate j-th transcript and x j = x j1 , . . . , x jI be the expressions in the particular condition denoted: x, where I is the number of expression measurements we can take in this condition. The method treats x j as a random sample with a transcript-and condition-specific mean. Thus, we have that X ij |µ j are iid with a pdf given as f X ij |µ j (x ij ). To allow information sharing among all transcripts and conditions, a prior distribution, π, is assumed on µ j .
In a simple setting, if we investigate differences in expression between only two conditions, the index set {1, . . . , I} is partitioned into two subsets, s 1 , s 2 , each containing indices for a corresponding condition (x j = (x js 1 , x js 2 )). For each transcript, we aim to investigate a null hypothesis that assumes equivalent expression (EE j ) among conditions, against the alternative that a transcript is differentially expressed (DE j ) in one of conditions. The pdf in each expression pattern ('EE' or 'DE') can be written as follows: EE j : DE j : Let p and 1 − p denote the fraction of all genes that are differentially expressed and equivalently expressed among conditions. Then, the pdf for x j takes the following form: Using Bayes' formula, we can write the pdfs for each hypothesis: The form of both the conditional distribution of x j and its conjugate prior to the mean are usually known. EBarrays uses the EM algorithm to compute the estimates of parameters for both distributions, as well as mixing proportions of genes supporting the investigated hypothesis.
This method is easily generalized for more than two expression patterns. A pattern corresponding to the null hypothesis, where each sample has the same underlying mean, is always distinguished. Consider m DE patterns, which yield a total of m+1 patterns (including the null pattern-Equivalent Expression).
The formula for the marginal distribution of x j , corresponding to Equation (3), is: where f k is density corresponding to the k-th pattern, as in [10], and p k is the fraction of genes supporting the corresponding pattern (hypothesis). Consequently, the posterior probability of expression pattern, k, for gene, j, is:

EBarrays Model for Cufflinks-Processed Data
Usually, Cufflinks-processed data does not have replicates. To use the EBarray framework, replicates need to be generated from the Cufflinks-specified normal distribution [7] for gene-or transcript-specific FPKM values (see below for discussion of normal approximation). They are then fed into EBarrays. The number of replicates to use is a free parameter of R-EBSeq, and its effect of differential expression analysis is explored in the main text. Since expression measurements approximately follow a normal distribution, the conjugate prior distribution for means is also normal. Thus, it is natural to assume a normal-normal model for the EBarrays estimation procedure: In this setting, π ∼ N (µ 0 , τ 0 ) and an n-dimensional input for gene j, after evaluating the integral as detailed in [11], follows a marginal pdf, f , which is the pdf of a normal distribution with mean, µ = (µ 0 , . . . , µ 0 ), and covariance matrix: where I n×n is an identity matrix and M n×n is a matrix of ones. We adapted the EBarrays R code to account for the normal-normal model with modified variances by modifying the previously built-in LNNMV model (available as Supplemental File). This normal-normal model with modified variances code was created using the tools provided in EBarrays R package [10,11].

Justification for a Normal Distribution Model for Transcript Abundances A t
According to Equation (15), the transcript abundance,Â t , is a product of two maximal likelihood estimates and is approximated in the model by a Gaussian variate. Here, we derive a brief justification, which works for any products of asymptotically Gaussian estimators. Consider two parameters, θ and η, with their respective estimates,θ n andη n , which satisfy jointly the multivariate CLT , that is, as n → ∞ in distribution, where BV N (0, σ 1 , σ 2 , ρ) is a bivariate normal random variate with zero marginal means, marginal standard deviations, σ 1 and σ 2 , and the correlation coefficient, ρ. In the case of the abundance model, A t considered in the paper, the vector, (θ n ,η n ) = (X g ,γ t ), is a joined MLE and the convergence above follows from the general theory of the likelihood-based estimation (see, e.g., [15] Chapter 4), where, now, n stands for the sampling depth (or the overall number of transcripts). Sincê andη n → η in probability, therefore, due to Slutsky's theorem (see, e.g., [15] chapter 1), we see that in distribution, where N (0, γ) is a zero mean normal variate with the standard deviation, γ, such that γ 2 = η 2 σ 2 1 + θ 2 σ 2 2 + 2ηθσ 1 σ 2 ρ

Simulating Data Sets for Use with R-EBSeq
In order to evaluate the performance of our method, we investigated its ROC plots for several different parameter settings under the normal-normal model assumption. The control dataset was created by generating a 1, 000 × n measurement matrix, according to the Normal-Normal model-reflecting 1,000 transcripts under one condition with n replicates. The number of replicates is also referred to as the sample size.
The treatment dataset was created in the same way, but expression values for 10% of the transcripts (randomly chosen) were generated with altered parameters in order to simulate the differential expression. Only over-expression was considered.
To make the simulations as close to real data as possible, the parameters in the control setting (the mean and the variance) for the generation from the normal-normal model were taken from real data (we have chosen a fixed sample of 1, 000 transcripts from Cufflinks-processed yeast data [16]).
The following three settings were investigated:

Mean difference in expression:
Fixed: variance-taken from Cufflinks, sample size: 10 Investigated: influence of difference in expression in 10% of transcripts ("treatment" group). Different levels of differential expression were generated according to the following: 5% of transcripts had x mean difference, and the remaining 5% had 2x mean difference in expression from the "control" group. The value of x was varied from 0, 5, 10, . . . , 50, and these different values yield ten ROCs visible in Figure 2(A).
Sample size:

Generating an RNA-Seq Benchmark Dataset for Comparing Various Differential Expression Testing Methods
In order to make comparisons between methods, we needed a benchmark dataset that would provide a common base for all methods. To ensure similarity with experimental conditions, we started with real data published in [13,16]. The control (sam file format) dataset is Tophat processed SRA output [7]. For annotation, we used a GTF file provided by a UCSC database, and to facilitate a direct comparison of various software packages operating on transcript or gene level, we only included in the modified dataset transcripts matching uniquely to particular genes.
The treatment (sam file format) dataset was created by first randomly choosing 100 reads from the control (sam) file. Each of those reads was replicated a number of times corresponding to its order: the first read was copied once, the second twice, and so on, until the 100th was copied 100 times.
To retrieve count datasets for DESeq and BaySeq methods, the created control and treatment files (sam file format) have to be further processed. We used the HTSeq software [17] recommended for DESeq, with default settings. To ensure that all methods will be capable of detecting replicated reads, we have been choosing only reads that had NH:i:1 flag in the sam file, which indicates unique mapping to a particular transcript [18].
The final step of creating the datasets was adding different level of noise to the reads, as described in the following section.

Adding Noise to the RNA-Seq Benchmark Dataset
For the benchmark datasets, to reflect real data, we are adding noise to the RNA-seq data by considering separately reads mapping uniquely to gene, g. For each read in this subset, we replicate this read y g times, where y g is taken from a negative binomial distribution, that is, y g ∼ N B(r g , p g ).
The new noisy number of reads for a particular gene, X g , given that this gene had N original reads, is therefore given as: It is well-known that a sum of independent negative binomial random variables is itself negative binomially distributed, with a new r parameter that is equal to the sum of the individual r parameters. Therefore: Given this, along with the properties of expected values and variances of random variables, we have the following: It can be shown that when V ar(X g ) = E(X g ): Thus, given this negative binomial noise generation model, we can determine the mean and then choose r g < r c for underdispersion or r g > r c for overdispersion. Because RNA-Seq data are typically overdispersed (V ar(X g ) > E(X g )) [13,14,19], to ensure it, for each gene, g, we used the following scheme for obtaining proper p g , r g parameters: (1) Generate p g uniformly from distributions U [(0.9, 1)].

Gene Count as a Negative Binomial Random Variable
Most of the existing models for RNA-seq analysis employ some form of a parametric distribution of gene counts, typically a gamma-Poisson or a negative binomial, like, e.g., Bioconductor software [20] suites: edgeR, DSS, DESeq or BaySeq). Below, we briefly explain the main reasons for this particular model.
We begin by exploring what an appropriate description of the gene count data may be, and we focus first on the capture step, where members of the library are chosen for sequencing. Let the total number of molecules in a library be N, the total number of mapped reads, n, and the number of molecules in the library corresponding to the gene, i, be m i . If the capture process is unbiased (we consider capture bias and its variability later), then the probability of capturing a molecule that corresponds to gene i, which we denote as the probability of success for gene i, is p i = m i /N . Because by experimental construction the library size is very large compared to the total number of mapped fragments (N n) and only in very exceptional circumstances is an expressed biological entity represented very few times in the library (m i 1), then for all practical purposes, the probability of success is constant for all capture events. Thus, from the viewpoint of gene i, one can cast this RNA-seq experiment as a series of n trials, where in each trial, there is some probability of success, p i . Now, if we denote by x i the number of reads that map to gene i, given the properties of the RNA-seq experiment as outlined above, the probability of gene i having x i reads follows a binomial distribution [21].
Given the large number of genes in any organism, in general, p i 1. Since the mean and variance of this aforementioned binomial distribution would be, respectively, µ = N p i and σ 2 = N p i (1 − p i ) [21], the Fano factor (σ 2 /µ = (1 − p i )) would be approximately one for most genes, similar to that predicted by a Poisson distribution (mean equals variance). Therefore, in a situation where the same library is sequenced multiple times with the same total number of mapped reads n, a Poisson distribution should be an adequate representation for most gene counts. This was indeed found to be the case for 99.5% of genes from human tissue samples [19].
This binomial model also suggests that as the mean number of counts, or, analogously, p i , increases, then the Fano factor should decrease, leading to so-called "underdispersion" relative to a Poisson model. However, technical replicate data from sequencing runs using the same library show that the opposite is true; as the mean number of counts increases, so too does the Fano factor ( Figure 3) [19]. This is not evident for the majority of genes with relatively low means (Figure 3(A)), but becomes apparent when higher mean genes are considered (Figure 3(B,C)). Thus, even when one considers only capture variability, a so-called "overdispersion" relative to a Poisson model is evident. Such overdispersion is likely due to bias in the capture process that varies from run to run. A negative binomial model can adequately capture such overdispersion [13,14], and it describes the probability of needing a particular, but random total number of trials to get a fixed number of failures, given a constant probability of success in each trial. Thus, for a negative binomial model, the experimentally-fixed total number of mapped reads, N , has no clear physical interpretation. Therefore, the negative binomial distribution may be thought of as giving an empirical description of RNA-seq count data. Figure 3. Noise properties of RNA-seq technical replicate data. In every panel, the mean vs. the Fano factor for individual genes is plotted. Data were taken from [19]. Black circles correspond to individual genes, and the red line corresponds to a linear regression of the mean vs. the Fano factor for the indicated number of genes. Behavior of the (A) first 15,000 genes; (B) first 19,000 genes; or (C) first 19,600 genes as ranked by increasing mean.

An Empirical Bayes Framework to Detect Differential Expression from RNA-Seq Data
One goal of an RNA-seq experiment is to determine what transcripts (or genes) have significantly different expression levels between various biologically-meaningful conditions. In developing a new method for differential expression analysis, we kept in mind the following three key features of current RNA-seq experiments. First, because count data are not always linearly proportional to the original number of transcripts in the biological sample, converting counts to FPKM measurements for the purposes of differential expression testing is warranted when information sharing across genes is desired, and we wish to remove the size effect. Second, the monetary cost of a single RNA-seq run is high, so having a large number of biological replicates is unlikely. However, differential expression analysis relies on a precise estimate of a transcript's variance, which is problematic with few replicates and impossible with no replicates. Therefore, information sharing across transcripts to improve variance estimates is necessary. Third, often, one would like to compare more than two biological conditions, so it is preferable to have the ability to do more than pairwise comparisons. Here, we integrate two previously developed software packages, Tophat-Cufflinks [7,9] and EBArrays [10,11], to create an empirical Bayes framework for differential expression analysis of RNA-seq data. We call this hybrid software R-EBSeq, as it incorporates these three key features mentioned above. Since the EBarrays software employs a Bayesian hierarchical model, it may be used with limited (possibly a single) biological replicates. We note that most other methods of differential expression analysis for microarray data, like limma or SAM [20], rely on in-sample variance estimation (limma's moderated t-test) or permutation testing (SAM) and are, therefore, more vulnerable to the small sample size effects.
Before formulating an empirical Bayes approach to differential expression analysis, we must first consider how to convert the raw RNA-seq data into FPKM. The Tophat-Cufflinks pipeline is well-suited for this task, taking into account all the forms of bias and variability shown in Figure 1 with a rigorous, model-based formulation [7,9]. Cufflinks returns an estimated expression value,Â t , in FPKM for each transcript, t, which is given as:Â t = C tXgγt (15) where C t is a normalizing constant for transcript, t (which includes the total, genome-wide number of mapped fragments and adjusted transcript length),X g is the number of fragments mapping to gene, g (to which transcript t belongs), andγ t is the fraction ofX g attributable to transcript, t (0 ≤γ t ≤ 1).
Cufflinks also returns an estimated confidence interval forÂ t , from which, based on the approximately normal behavior of A t (see [7] and Methods), the variance of A t , which we denote as V At , can be estimated. Thus, by using the Tophat-Cufflinks pipeline, we obtain estimates of the mean and variance of each transcript's expression level (in FPKM). A concern of some researchers is that transforming discrete count data into continuous FPKM data results in a loss of information related to the magnitude of the number of counts. To illustrate this concern, consider the case when transcript a with length 1 kb has two counts in Condition 1 and four counts in Condition 2, whereas transcript b with length 5 kb has 10 counts in Condition 1 and 20 counts in Condition 2. If Condition 1 and Condition 2 have the same number of total, genome-wide mapped reads, then in terms of FPKM, the expression patterns of transcript a and b are identical. However, on the level of counts, the differential expression of transcript b is clearly more significant than that of a. This is consistent with the well-known fact that there is a length bias for detecting differential expression in RNA-seq experiments [22,23], which could possibly interfere with the information sharing across transcripts in the EBArrays model. Unfortunately, when converting counts to FPKM, the only way to retain the read length information is to propagate the variance of X g into the variance of A t , and Cufflink's estimate of this variance,V At , indeed does this [7].
A previously developed software package called EBArrays uses a hierarchical, empirical Bayes model to analyze differential expression analysis in microarray studies [10,11]. This model is described in detail in [10,11] and also in the Methods. Conveniently, this software and model address the last two key features mentioned above: information sharing to improve variance estimates and the ability to do more than pairwise comparisons between arbitrary many expression patterns. Thus, by combining the Tophat-Cufflinks pipeline for generating FPKM along with the EBArrays software, we can obtain a hybrid package that is suitable for analysis of RNA-Seq data given the three key features listed above.
The output of the Tophat-Cufflinks pipeline for every condition are estimates of the mean and variance of A t for each transcript t (Â t andV At ), which specify an associated normal distribution. However, these parameters cannot be used as a direct input to EBArrays, as it expects expression values for various conditions. Therefore, to couple these two pieces of software, for each transcript, we generate M replicates from a normal distribution with mean,Â t , and variance,V At , and, then, use these replicates as the input to EBArrays (see Methods). As indicated already above, we call this combined pipeline R-EBSeq.
To evaluate how R-EBSeq performs, we generated a test data set according to the underlying empirical Bayes model, imposed differential expression on a subset of transcripts from this data set and, then, calculated the performance of R-EBSeq in terms of true positive and false positive identifications (see Methods). Such plots of the false positive rate vs. the true positive rate are called receiver operator characteristic (ROC) curves. An ROC curve along the x = y line implies a very poor algorithm that performs no better than random choice, whereas an ROC curve that peaks high above the x = y line, at low x values, implies a very good algorithm. We investigated how three characteristics of a transcript affect the ROC curves: the difference of means between two conditions (Figure 4(A)), the variance of the transcript expression level (Figure 4(B)),and the number of replicates, M , used as input to the software (Figure 4(C)). In general, we see that R-EBSeq is capable of very good behavior in terms of the ROC curves. As expected, as we increase the difference of means between two conditions and/or decrease the variance, the ability of R-EBSeq to identify truly differentially expressed genes improves. Increasing the number of replicates, M , also improves the performance of R-EBSeq, likely because R-EBSeq is able to get a better estimate of a transcript's variance. Figure 4. Comparison of true discovery rates for various RNA-Seq differential expression testing methods. The test data sets were generated and various software suites implemented, as described in Methods. Each panel contains two plots; the plot on the right is a zoomed-in version of the plot on the left. On the y-axis is the number of correctly-identified transcripts, and on the x-axis is the number of transcripts selected (in order of increasing p-value). DE stands for differentially expressed. In every plot, the thick black line corresponds to Cufflinks, the thick red line to R-EBSeq, the large-dashed black line to DESeq and the small-dashed black line to BaySeq. These lines are also labeled as indicated. The shaded region surrounding the R-EBSeq curve depicts the range of 20 independent runs. (A) Performance of the various methods with noise-free data; (B) Performance of the various methods with noisy data. Noise was added as described in Methods, and the data are overdispersed, as typical for RNA-seq data.

Comparison of the R-EBSeq Approach to Published Differential Expression Software
There are several packages that currently exist for performing differential expression analysis on RNA-seq data. One method that is also based on FPKM is Cuffdiff, which is part of the Cufflinks pipeline [7]. We also want to consider count-based methods; two highly-used ones are DESeq [13] and BaySeq [14]. To determine how R-EBSeq compares to these three established methods, we need a common benchmark data set where the differentially expressed transcripts are known. However, generation of such a data set is not a trivial task, because using Cuffdiff requires altering a data set on the level of sequencer reads, whereas the input to count-based methods simply require a table of counts. To our knowledge, there is no publicly available benchmark dataset for such purposes. We have therefore modified a previously published RNA-seq dataset on S. cerevisiae [16] at the level of reads so as to create a new dataset in which the differentially expressed transcripts are precisely known (Table 1).
Before comparing the four software packages, we had to select operating parameters for R-EBSeq. Although above idealized data (i.e., those not coming from sequencing reads), we found that increasing the number of "replicate" (model parameter reflecting our confidence in Cufflinks variance, not to be confused with technical or biological replicates of RNASeq data) used for R-EBSeq always resulted in better performance; with this real data, there is an optimum number of "replicates" (Figure S1). Decreasing the number of replicates allows for better performance in detecting strongly differentially expressed transcripts (at the beginning/left part of the curves) at the expense of detecting weakly differentially expressed transcripts (at the end/right part of the curves). This optimum number depends on the level of noise (compare Figure S1(A) to S1(B)); and in our specific example, it is found to be ≈1,000 replicates for noise-free data and ≈100 replicates for noisy data. We reasoned that this behavior is likely due to mismatch between the assumed normal model for FPKM values and the actual distribution, which can be skewed particularly when transcript fractions, γ t 's, are close to 1 (truncation effects). Most of the investigated differentially expressed transcripts have γ t = 1 (data not shown). Thus, a potential direction for future work with R-EBSeq is to improve the normal distribution assumption for FPKM data. However, changing the normal assumption may slow down the software significantly, since using the normal distribution allows for a clean analytic solution of the posterior distribution in the EBArray model and, therefore, fast computation. It is not clear whether appropriate changes to the FPKM distribution assumption would also be solvable analytically. In a practical setting, we recommend that one selects the number of replicates based on some self-consistency test, like e.g., the cross-validation Shao [15].
Another item to consider before comparison is filtering; that is, throwing away lowly and highly expressed transcripts that are likely to "confuse" EBArrays software, as the differential expression model could become unstable for very low or very high values of p and limited biological replicates [11]. Although we could not apply such filtering for Cuffdiff (it is done internally), we could apply filters to R-EBSeq, DESeq and BaySeq. We set these filters to remove 1% of the most lowly expressed transcripts and 0.1% of the most highly expressed transcripts and verified that changing these filter levels only served to degrade performance (data not shown). While these levels are arbitrary and may be dataset-dependent, such a thresholding is a necessary part of differential expression analysis, and future research in this direction is warranted. Table 1. Benchmark dataset for software comparison. The 79 artificially modified transcripts in RNA-seq dataset on S. cerevisiae [16] used to generate two artificial datasets (denoted here as Ctrl and Treat) for our software performance comparison study. For each transcript listed in the first column, the specific number of reads listed in the second column is manually added to force differential expression in the Treat group. All the remaining transcripts are left unchanged. Last two columns provide the observed counts as reported by the software. After determining the number of replicates to use for R-EBSeq and the filtering parameters, we proceeded to compare the performance of these four software packages (Figure 4). We first considered the case of noise-free data (Figure 4(A)). Cuffdiff performs the best and is followed closely by R-EBSeq. For the top 100 selected transcripts (right plot), DESeq performs very well for slightly over half of the differentially expressed transcripts, but struggles finding the others. Additional simulations showed that this inability of DESeq is actually due to a problem with the conversion of the read level data to count level data with the software HTSeq, as opposed to that of DESeq itself, in that HTSeq simply failed to convert a subset of transcripts (data not shown). When we eliminated transcripts based on the ability of HTSeq to process them, DESeq performs markedly better (data not shown); however, eliminating transcripts in such a biased manner is undesirable, and we therefore proceeded without this elimination. Importantly, we also tested these software packages with noisy data, ensuring that such noise is overdispersed, as is typical for RNA-seq data (see Methods). Although the noise clearly had an effect on the total number of transcripts that could be identified within the top 1,000, this noise did not change the relative performance of the four methods. Thus, we conclude that Cuffdiff offers the best performance, and R-EBSeq is a very close second. The count-based methods, DESeq and BaySeq, did not perform as well as the FPKM-based methods.

Application of R-EBSeq to Experimental Data-Illustration of Multiple Conditions Comparison
A convenient feature of R-EBSeq that was inherited from EBArrays is the ability to easily specify patterns for multiple condition comparison. We illustrate this feature here using RNA-seq datasets for a DNA damage response with or without tp53, a gene that is lost in over 50% of human cancers and is a central player in the DNA damage response (reviewed in [24]). RKO colon carcinoma cells, which are wild-type for p53 protein (WT), or RKO cells where both copies of tp53 have been knocked-out (KO) were treated with etoposide, a DNA-damaging agent, or DMSO (control) and then assayed for gene expression via RNA-seq [25]. One biological question of interest is to identify which genes change as a function of etoposide without regard to p53-status and which genes are p53-dependent without regard to etoposide-status. It is difficult to identify such genes with traditional pairwise differential expression analysis, particularly because etoposide is used to upregulate p53 expression [26][27][28][29].
To determine potential p53-and etoposide-specific genes, we first specified "patterns". Pattern specification is the aspect of R-EBSeq (and formerly EBArrays) where one defines how many unique population means we expect and which sample we would expect which population mean to correspond to. In our case, we have four samples and three different conditions we are interested in: p53-specific, etoposide-specific and non-specific (null). To specify these conditions, we consider two different population means, 1 and 2, which are distributed as "patterns" among the conditions according to Table 2. For the pattern etoposide, we require that a gene have a different mean in all etoposide-treated samples. For the pattern, p53, we require that a gene only have a different mean in the WT cells (upon etoposide treatment, which increases p53 levels that are otherwise negligible). For the pattern null, we require that genes have the same mean in all populations. Running R-EBSeq with this setup results in each gene having a posterior probability for each pattern. Thus, the higher the posterior probability, the more likely a gene belongs to a particular pattern.
To analyze the results, we first identified genes having greater than 0.99 posterior probability of belonging to either the etoposide or p53 pattern (Supplemental Table S1). We then compared the high posterior probability gene lists to those in the MSigDB to find those which exhibited significant overlap [30]. The top two hits for the etoposide pattern are PEREZ TP53 TARGETS and PUJANA ATM PCC NETWORK. Although it may seem illogical that a list of p53 targets have significant overlap with a list of etoposide-specific genes, this is actually not that surprising and, in fact, expected. There are many proteins that have functional redundancy with p53, so if p53 is not there, it is likely other proteins serve to provide a compensatory effect in the presence of etoposide. Because many p53 targets have been defined as those which respond to DNA damage and many of these targets are likely regulated by compensatory actions, one would expect a large overlap. For the second hit, ATM is a well known DNA-damage related protein, and therefore, its presence is understandably etoposide-related. Some top hits for the p53 pattern are UV response sets (DACOSTA UV RESPONSE VIA ERCC3 UP and ENK UV RESPONSE KERATINOCYTE UP) and stem cell related sets (NUYTTEN EZH2 TARGETS DN and BENPORATH NANOG TARGETS). The presence of the UV sets imply that there may be some distinct pathways that are utilized in response to etoposide-mediated DNA damage vs. UV-mediated DNA damage and that p53 allows response to both types. The presence of the stem cell sets are in line with current thoughts that p53 plays a large role in stem cell function (reviewed in [31]). Thus, by performing this multiple condition comparison with an appropriately chosen pattern, we were able to find a p53 function related to stem-cells that did not depend on DNA damage. Table 2. Patterns for multiple conditions comparison. Entries correspond to the population mean which sample is assumed to come from, given the pattern indicated. Pattern names are indicated in the row, whereas samples are indicated in the columns. WT: wild-type; KO: knock-out; D: DMSO (control); E: etoposide.

Conclusions
Here we have investigated several aspects of RNA-seq differential expression analysis. In the first part of the paper, we have suggested the empirical Bayes method for RNA-seq data analysis based on FPKM measurements (called R-EBSeq), which combines two previously developed and powerful software suits: Cufflinks and EBarrays. R-EBSeq has two important features. First, it shares information across genes (or transcripts) to get better variances estimates when there are few or no replicates. The use of FPKM makes the sharing straightforward and, unlike in some other empirical Bayes RNA-seq software, does not require additional pre-processing. Second, it is capable of doing multiple conditions comparisons in an easy-to-implement manner, as we demonstrate with experimental data. In the second part of the paper, we created the benchmark dataset for RNA-seq differential expression analysis and used it to compare R-EBSeq to Cuffdiff, DESeq and BaySeq software suits developed for RNA-seq analysis. We find that FPKM-based Cuffdiff gives the best performance, followed closely by R-EBSeq. DESeq and BaySeq, which are count-based methods and do not perform as well. Overall, R-EBSeq seems to offer reasonable performance with the flexibility of multiple comparisons and rigorous treatment of information sharing when there are few replicates.