Next Article in Journal
Traffic Prediction with Data Fusion and Machine Learning
Previous Article in Journal
Unveiling the Impact of Socioeconomic and Demographic Factors on Graduate Salaries: A Machine Learning Explanatory Analytical Approach Using Higher Education Statistical Agency Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Copula-Based Bayesian Model for Detecting Differential Gene Expression

by
Prasansha Liyanaarachchi
1 and
N. Rao Chaganty
2,*
1
Department of Statistics, University of Sri Jayewardenepura, Gangodawila, Nugegoda 10250, Sri Lanka
2
Department of Mathematics and Statistics, Old Dominion University, Norfolk, VA 23529-0077, USA
*
Author to whom correspondence should be addressed.
Analytics 2025, 4(2), 11; https://doi.org/10.3390/analytics4020011
Submission received: 22 January 2025 / Revised: 9 March 2025 / Accepted: 26 March 2025 / Published: 3 April 2025

Abstract

:
Deoxyribonucleic acid, more commonly known as DNA, is a fundamental genetic material in all living organisms, containing thousands of genes, but only a subset exhibit differential expression and play a crucial role in diseases. Microarray technology has revolutionized the study of gene expression, with two primary types available for expression analysis: spotted cDNA arrays and oligonucleotide arrays. This research focuses on the statistical analysis of data from spotted cDNA microarrays. Numerous models have been developed to identify differentially expressed genes based on the red and green fluorescence intensities measured using these arrays. We propose a novel approach using a Gaussian copula model to characterize the joint distribution of red and green intensities, effectively capturing their dependence structure. Given the right-skewed nature of the intensity distributions, we model the marginal distributions using gamma distributions. Differentially expressed genes are identified using the Bayes estimate under our proposed copula framework. To evaluate the performance of our model, we conduct simulation studies to assess parameter estimation accuracy. Our results demonstrate that the proposed approach outperforms existing methods reported in the literature. Finally, we apply our model to Escherichia coli microarray data, illustrating its practical utility in gene expression analysis.

1. Introduction

Microarray technology is a cutting-edge approach to gene expression analysis, widely used in fields such as basic biology and medical science. Over the past two decades, the number of gene expression studies employing microarrays has grown rapidly. Two primary microarray technologies are available for expression analysis: spotted cDNA arrays and oligonucleotide arrays. While our research primarily focuses on the statistical analysis of data from spotted cDNA microarrays, the methods discussed here can be adapted for data generated via Affymetrix chips. In microarray experiments, RNA is first extracted from the subject cells. Some of its molecules are then replaced with counterparts containing a fluorescent dye, producing labeled transcripts known as targets. In cDNA microarrays, both targets and probes are cDNA molecules, whereas, in oligonucleotide arrays, the targets remain cDNA molecules, but the probes consist of carefully selected small cDNA segments known as oligos.
A cDNA microarray is commonly referred to as a two-channel array. In this technology, both an experimental sample and a reference sample are labeled with fluorescent dyes—Cyanine 3 (Cy3, green) and Cyanine 5 (Cy5, red)—on a single chip, with the experimental sample typically labeled with Cy5 [1]. Each chip contains thousands of spots, each corresponding to a specific gene. A laser microscope scanner measures the fluorescence intensity at each spot, providing a quantitative assessment of gene expression. Colored spots indicate genes expressed in one or both samples, while gray areas represent genes not expressed in either sample. Although microarray data capture thousands of genes, only a subset of differentially expressed genes (DEGs) may be biologically significant, particularly in diseases such as cancer. Therefore, identifying these DEGs is a crucial step in microarray analysis. Numerous methods have been proposed in the literature to detect differentially expressed genes.
The fold-change method was one of the earliest techniques used in microarray data analysis [2,3,4]. This simple approach identifies differentially expressed genes (DEGs) by applying a predefined threshold on fold change. However, it is prone to bias, particularly when the data are not properly normalized [5]. Moreover, since it does not account for statistical variation, the method is considered unreliable. To address these limitations, later approaches incorporated the probabilistic modeling of red (R) and green (G) intensities for DEG detection. These methods established classification rules based on distributional assumptions of ( R , G ) . For example, ref. [6] proposed a data-driven threshold selection for the ratio R / G , assuming normality and a constant coefficient of variation. However, a key drawback of this approach is that it overlooks the valuable information contained in the product R G .
To avoid this problem, ref. [7] suggested a hierarchical model (Gamma–Gamma–Bernoulli) to capture DEGs based on the posterior odds of change (the odds are functions of R + G and R G ). This method assumes that R and G are independent. Mav and Chaganty [8] have shown that the R and G are positively correlated. They have built a bivariate distribution with gamma marginals and a positive correlation between R and G to incorporate this dependence. They also incorporated a latent Bernoulli variable. They used the EM algorithm to calculate the posterior probabilities of differential expression. The DEGs are the ones with higher posterior probabilities.
A Bayesian approach to testing multiple hypotheses in microarray experiments was proposed by Maria et al. (2020) [9], pointing to the use of copula functions to model dependence. Their method demonstrated that incorporating dependency structures significantly improves classification accuracy when identifying differentially expressed genes. The term “copula” was first used by Sklar in 1959 [10]; its meaning is that it “ties” the marginal uniform distributions to create a joint distribution function. An excellent introduction to copulas is the book by Nelsen [11]. Copula functions are useful for constructing bivariate or, in general, multivariate distributions with given marginal distributions. Since their introduction, the literature and applications of copulas have grown rapidly. Some classic books on the topic include [11,12]. More comprehensive coverage of copula models and their applications is given in [13].
Over the past decade, extensive research has explored the application of copulas in genomics, particularly for analyzing gene expression data. For instance, Chaba (2017) [14] proposed a semi-parametric copula-based approach to differential gene expression analysis. Likewise, Ray et al. (2020) [15] introduced a copula-based model that focuses on detecting differential co-expression, rather than differential expression, emphasizing the significance of dependency structures in gene networks.
Although copulas are increasingly used in gene expression studies, existing approaches have not been specifically developed to detect differentially expressed genes in microarray experiments. To address this gap, we introduce a copula-based framework particularly for DEG detection, utilizing copulas to model dependencies between gene intensities while offering a probabilistic basis for differential expression analysis. In this article, we extend the work done by Mav and Chaganty [8], replacing the joint probability distribution of red and green intensities with a Gaussian copula-based joint distribution. The DEGs can be identified by calculating the Bayes estimate of the differential expression under this model.
The outline of the article is as follows. Section 2 describes the challenges in analyzing microarray data and the need for sophisticated statistical models to accurately capture the dependence structure between gene expressions. The theoretical foundation of the copula model, including the formulation of the model and the specification of prior distributions for the parameters, is explained in Section 3.
In Section 4, we present the methodology for estimating the parameters of the copula model using maximum likelihood estimation and Bayesian inferential procedures. The section also details the numerical optimization routines employed. Section 4 provides an in-depth discussion of the estimation process, including the steps involved in maximizing the log-likelihood function and implementing the quasi-Newton algorithm. Section 5 focuses on identifying differentially expressed genes using the estimated copula model parameters and explaining the statistical tests and criteria used. In Section 6, we explore the relationship between the copula parameter and the correlation coefficient, providing insights into how the copula model captures dependencies between gene expression levels.
Section 7 presents the results of simulation studies conducted to evaluate the performance of the copula-based Bayesian model in identifying differentially expressed genes and comparing it with other existing methods. Section 8 contains the application of the copula-based Bayesian model to real datasets of gene expression levels in Escherichia coli, demonstrating the practical utility of the model. Finally, by comparing the log-likelihood values, we show that this Gaussian copula model improves over the models given in [7,8].

2. Motivation

Escherichia coli (E. coli) is a bacteria that generally lives in the intestines of people and animals. The motivating data for this article is the experiment designed to study gene expression levels in E. coli, described in [16]. The E. coli genome consists of approximately 4.6 million base pairs (Mbp), but it is suspected of encoding only about forty-two hundred genes. To study differential gene expressions in E. coli, ref. [16] used two traditional treatments that affect gene expression levels. The first treatment was induction with isopropyl- β -D-thiogalactopyranoside (IPTG), which tests the methods since only a few gene transcripts are expected to change. Secondly, heat shock treatment allows global regulatory effects to be observed. A single colony of E. coli K-12 was divided into five samples for the experiments.
IPTG treatment was performed independently on two samples (IPTG-A and IPTG-B), while one sample (control) was untreated. Heat shock induction was carried out by treating the culture to a 50 °C shaking water bath for seven minutes on the remaining two samples (Heat Shock A and Heat Shock B). Following the hybridization of the samples on E. coli microarrays, the signal intensities for each spot were determined using the ScanAlyze software 1.0.3. The average fluorescence intensity for each spot was measured, and the background was chosen as the median pixel intensity in a square surrounding each spot. The red and green signal intensities were recalculated and normalized after background subtraction.
Newton et al. [7] proposed a Bayesian hierarchical model with a latent variable to identify differentially expressed genes. In their approach, the marginal distributions of red and green intensities were modeled as gamma distributions with common shape parameters but different scale parameters. However, they assumed that the red and green intensities for the same gene were independent. To address this limitation, Mav and Chaganty [8] introduced a bivariate distribution with gamma marginals that incorporated a positive correlation between the red and green intensities. In this article, we extend their approach by replacing the bivariate distribution with a bivariate Gaussian copula, constructing a joint distribution with gamma marginals. The performance of our proposed model is evaluated through log-likelihood analysis using E. coli data.

3. Copula-Based Bayesian Model for Expression Level

The typical objective when analyzing data from microarray experiments is to identify differentially expressed genes. This section will propose a copula-based Bayesian model that can filter these genes. Consider a microarray consisting of n genes. Let R 1 j and R 2 j denote the red and green intensities of gene j, respectively. The concepts based on the red and green intensity ratio have been widely used in the literature to identify differentially expressed genes. Some of those were discussed briefly in Section 1. To filter the differentially expressed genes, we will use the ratio of expected expression levels, which are given by η j = E ( R 1 j ) / E ( R 2 j ) , where E stands for the expected value.
As explained in Section 2, this study was motivated by E. coli data. Gamma distribution has been widely used in statistical analysis in microarray experiments [17,18,19], not only for its analytical convenience but also for its ability to be interpreted in a deeper way. The family of gamma distributions is supported on the positive line and provides a rich class of distributions as special cases [17]. In addition, Chen et al. in 1997 [6] presented an intriguing argument that, while expected expression levels vary from gene to gene in the microarray, the measurements are connected via a constant coefficient of variation ( C V ) and C V = 1 / α . Having a common shape parameter, α , ensures that the variability structure remains consistent across genes and makes it convenient to model the variability in gene expression. Taking into account all of the above facts, we assume R i j and R 2 j are gamma distributions with a common shape parameter, α , but different scale parameters, 1 / θ 1 j and 1 / θ 2 j for j = 1 , 2 , , n . The probability density function of R i j is given by
f i ( r i j ; θ i j , α ) = 1 Γ ( α ) θ i j α r i j α 1 exp ( θ i j r i j ) , i = 1 , 2 ; j = 1 , , n .
Figure 1 and Figure 2 show the histograms of red and green intensities along with nonparametric kernel density plots for the five microarray experiments. The positively skewed shape of the density curves suggests that the assumption of gamma marginals is reasonable. Note that E ( R i j ) = α / θ i j for i = 1 , 2 . Therefore, the ratio of expected expression levels η j = E ( R 1 j ) / E ( R 2 j ) = θ 2 j / θ 1 j for j = 1 , , n . To model the dependence between the two intensities, we assume the joint distribution of ( R 1 j , R 2 j ) is given by the bivariate Gaussian copula and can be written as
f ( r 1 j , r 2 j ; θ 1 j , θ 2 j , α , γ ) = c ( u 1 j , u 2 j ; γ ) f 1 ( r 1 j ) f 2 ( r 2 j ) ,
where u i j = F i ( r i j ) and F i ( . ) is the cumulative distribution function of a gamma distribution with parameters ( α , 1 / θ i j ) . Note that the Gaussian copula density is
c ( u 1 j , u 2 j ; γ ) = 1 1 γ 2 exp 1 2 γ 2 ( z 1 j 2 + z 2 j 2 ) 2 γ z 1 j z 2 j 1 γ 2 ,
where z i j = Φ 1 ( u i j ) , and γ is the parameter of the copula density. To simplify the notation moving forward, we write c ( u 1 j , u 2 j ) in place of c ( u 1 j , u 2 j ; γ ) .
The joint distribution in (2) consists of 2 n + 2 unknown parameters. Since there are too many unknown parameters, we adopt the empirical Bayes approach to make the model parsimonious. This requires specification of prior distributions for the gene-specific parameters θ 1 j and θ 2 j . We assume independent gamma distributions with parameters α 0 and 1 / ν as the prior distributions for θ i j .
The prior density π ( θ i j ) is
π ( θ i j ; ν , α 0 ) = 1 Γ ( α 0 ) ν α 0 θ i j α 0 1 exp ( ν θ i j ) , for i = 1 , 2 ; j = 1 , , n .
Multiplying (2) and (4), we get the joint density of ( R 1 j , R 2 j ) and ( θ 1 j , θ 2 j ) as
f ( r 1 j , r 2 j , θ 1 j , θ 2 j ; Υ ) = ν α 0 Γ ( α ) Γ ( α 0 ) 2 c ( u 1 j , u 2 j ) i = 1 2 r i j α 1 θ i j α + α 0 1 exp ( θ i j ( r i j + ν ) ) ,
where Υ = ( α , α 0 , ν , γ ) is the vector of model parameters. Recall that this model has gene-specific parameters, ( θ 1 j , θ 2 j ) , for j = 1 , , n . The marginal density of R j = ( R 1 j , R 2 j ) is
f m ( r 1 j , r 2 j ; Υ ) = 0 0 f ( r 1 j , r 2 j ; θ 1 j , θ 2 j ; Υ ) d θ 1 j d θ 2 j = ν α 0 Γ ( α ) Γ ( α 0 ) 2 0 0 c ( F 1 ( r 1 j ) , F 2 ( r 2 j ) ) × i = 1 2 r i j α 1 θ i j α + α 0 1 exp ( θ i j ( r i j + ν ) ) d θ 1 j d θ 2 j .
Here, F i ( r i j ) is the cumulative distribution function of gamma with parameters α and 1 / θ i j for i = 1 , 2 and j = 1 , 2 , , n .
The double integral in Equation (6) does not simplify because of the presence of the Gaussian copula function c ( u 1 j , u 2 j ) in the integrand. A numerical computation of the double integral (6) is also challenging. To compute the double integral, we could use the R libraries, such as the cubature by [20] or pracma by [21]. We were unsuccessful with these packages and encountered numerous errors with the functions embedded in these packages when evaluating the double integral iteratively. To overcome the computational problems, we have developed our own R code to evaluate the double integral and obtain the marginal density of R j = ( R 1 j , R 2 j ) .

4. Parameter Estimation Procedure

The marginal bivariate density of red and green intensities given in (6) has four unknown parameters given by the vector Υ = ( α , α 0 , ν , γ ) . Maximum likelihood is the most efficient method for estimating these parameters. This method entails maximizing the likelihood or the log-likelihood, which is the logarithm of the likelihood function. For n genes, the log-likelihood is given by
l ( Υ ) = j = 1 n log f m ( r 1 j , r 2 j ; Υ ) = j = 1 n log [ ν α 0 Γ ( α ) Γ ( α 0 ) 2 0 0 c ( F 1 ( r 1 j ) , F 2 ( r 2 j ) ) × i = 1 2 r i j α 1 θ i j α + α 0 1 exp ( θ i j ( r i j + ν ) ) d θ 1 j d θ 2 j ] = 2 n α 0 log ( ν ) log ( Γ ( α ) Γ ( α 0 ) ) + j = 1 n log [ 0 0 c ( F 1 ( r 1 j ) , F 2 ( r 2 j ) ) × i = 1 2 r i j α 1 θ i j α + α 0 1 exp ( θ i j ( r i j + ν ) ) d θ 1 j d θ 2 j ] .
Maximizing (7) will yield the maximum likelihood estimate of the unknown parameter vector Υ .

Estimation

A numerical optimization routine is required to obtain the maximum likelihood estimator of Υ = ( α , α 0 , ν , γ ) since the log-likelihood (7) is highly nonlinear. The quasi-Newton (or variable metric) algorithm given in [22] is an ideal choice for this situation. The algorithm can be described as follows:
  • Start with an initial estimate Υ ^ i n t of Υ .
  • At the ith step, compute Υ ^ i + 1 = Υ ^ i c B ( Υ ^ i ) g ( Υ ^ i ) , where g ( Υ ) = l ( Υ ) / Υ and B ( Υ ) is an approximation to the inverse of Hessian matrix, [ 2 l ( Υ ) / Υ j Υ k ] 1 , and c is a constant.
  • Repeat Step 2 until Υ ^ i + 1 Υ ^ i , and take Υ ^ = Υ ^ i + 1 as the MLE of Υ .
The function optim in the R package stats provides algorithms for general purpose optimization. We used the quasi-Newton method “BFGS”, which was published simultaneously by [23,24,25,26]. The estimation of the gradient function is carried out using finite-difference approximation. The Hessian matrix is the square matrix of second-order partial derivatives given by
2 l ( Υ ) Υ Υ = 2 l ( Υ ) α 2 2 l ( Υ ) α α 0 2 l ( Υ ) α ν 2 l ( Υ ) α γ 2 l ( Υ ) α 0 α 2 l ( Υ ) α 0 2 2 l ( Υ ) α 0 ν 2 l ( Υ ) α 0 γ 2 l ( Υ ) ν α 2 l ( Υ ) ν α 0 2 l ( Υ ) ν 2 2 l ( Υ ) ν γ 2 l ( Υ ) γ α 2 l ( Υ ) γ α 0 2 l ( Υ ) γ ν 2 l ( Υ ) γ 2 .
This matrix can be calculated numerically at the point of maximum of the log-likelihood function using the method “Richardson” of function Hessian in the R package numDeriv by Gilbert and Varadhan (2019) [27]. The square root of the diagonal elements of inverse Hessian gives us the standard errors of the maximum likelihood estimates.

5. Differentially Expressed Genes

Our ultimate goal of modeling is to identify the differentially expressed genes in the cDNA microarray. Recall that we are interested in estimating η j = E ( R 1 j ) / E ( R 2 j ) = θ 2 j / θ 1 j for j = 1 , , n . Consider the transformation μ j = θ 1 j and η j = θ 2 j / θ 1 j . The inverse transformation is θ 1 j = μ j and θ 2 j = η j μ j , and the Jacobian is given by
J = μ j η j 0 1 = μ j .
Using the transformation theorem, the joint density of R j = ( R 1 j , R 2 j ) , η j and μ j can be expressed as
g ( r 1 j , r 2 j , η j , μ j ; Υ ) = f ( r 1 j , r 2 j , μ j , η j μ j ; Υ ) μ j , η j , μ j > 0 .
The conditional posterior distribution of η j and μ j , given R j , is
g ( η j , μ j | r 1 j , r 2 j ; Υ ) = f ( r 1 j , r 2 j , μ j , η j μ j ; Υ ) μ j f m ( r 1 j , r 2 j ; Υ ) , η j , μ j > 0 .
The Bayes estimate of the differential expression of the jth gene is
E ( η j | r 1 j , r 2 j ; Υ ) = 0 0 η j f ( r 1 j , r 2 j , μ j , η j μ j ; Υ ) μ j f m ( r 1 j , r 2 j ; Υ ) d η j d μ j ,
which we have calculated numerically. Let η j ^ = E ( η j | r 1 j , r 2 j ; Υ ^ ) , where Υ ^ is the maximum likelihood estimate of Υ . We say the jth gene is up-regulated if η j ^ is greater than some specified value and down-regulated otherwise.

6. Relation Between Copula Parameter and the Correlation Coefficient

It is a well-known fact that the correlation coefficient of two random variables is the magnitude and the direction of the linear relationship between those two random variables. However, it fails to capture nonlinear dependence, which the copula function captures, specifically the dependence in the tail region for non-normal variables. In this section, we derive the relationship between the linear correlation coefficient ρ between R 1 and R 2 and the Gaussian copula parameter γ .
Case 1. Suppose that R i is distributed as gamma ( α i , 1 / θ i ) for i = 1 , 2 , and the joint distribution is given by the bivariate Gaussian copula with parameter γ . Note that the marginal mean and variance of R i are α i / θ i and α i / θ i 2 , respectively. The joint probability density function of ( R 1 , R 2 ) is given by
f ( r 1 , r 2 ; θ 1 , θ 2 , α 1 , α 2 , γ ) = c ( u 1 , u 2 ) f 1 ( r 1 ) f 2 ( r 2 ) = 1 1 γ 2 exp 1 2 γ 2 ( z 1 2 + z 2 2 ) 2 γ z 1 z 2 1 γ 2 × i = 1 2 1 Γ ( α 1 ) r i α i 1 θ i α i exp ( θ i r i ) ,
where z i = Φ 1 ( u i ) , for i = 1 , 2 and u i = F i ( r i ) , and F i is the cumulative distribution function. Therefore, the expected value of R 1 R 2 is
E [ R 1 R 2 ] = 0 0 r 1 r 2 f ( r 1 , r 2 ; θ 1 , θ 2 , α 1 , α 2 , γ ) d r 1 d r 2 .
If ρ is the correlation coefficient between R 1 and R 2 , then we have
ρ = θ 1 θ 2 α 1 α 2 0 0 r 1 r 2 f ( r 1 , r 2 ; θ 1 , θ 2 , α 1 , α 2 , γ ) d r 1 d r 2 α 1 α 2 .
This can be numerically computed for different values of ( α i , θ i ) , i = 1 , 2 , and γ .
Case 2. Suppose that R i is distributed as gamma ( α , 1 / θ i ) , and θ i is also distributed as gamma ( α 0 , 1 / ν ) for i = 1 , 2 . Then, the joint probability distribution of ( R i , θ i ) is
f i ( r i , θ i ; α , α 0 , ν ) = ν α 0 Γ ( α ) Γ ( α 0 ) r i α 1 θ i α + α 0 1 exp [ θ i ( r i ν ) ] .
We can show that the marginal probability density function of r i is the beta distribution of the second type ( B e t a 2 ) with parameters ( ν , α , α 0 ) .
f i ( r i ; α , α 0 , ν ) = 0 f i ( r i , θ i ; α , α 0 , ν ) d θ i = Γ ( α + α 0 ) Γ ( α ) Γ ( α 0 ) ν α 0 r i α 1 ( r i + ν ) α + α 0 B e t a 2 ( ν , α , α 0 ) .
The marginal mean and the variance of R i are
E ( R i ) = 0 0 r i f i ( r i , θ i ; α , α 0 , ν ) d r i d θ i = α ν α 0 Γ ( α 0 ) 0 θ i α 0 2 exp ( θ i ν ) d θ i = α ν α 0 1 , V a r ( R i ) = 0 0 r i 2 f i ( r i , θ i ; α , α 0 , ν ) d r i d θ i α ν α 0 1 2 = α ( α + 1 ) ν α 0 Γ ( α 0 ) 0 θ i α 0 3 exp ( θ i ν ) d θ i α ν α 0 1 2 = α ( α + 1 ) ( α 0 1 ) ( α 0 2 ) ν 2 α ν α 0 1 2 = α ( α + α 0 2 ) ( α 0 1 ) 2 ( α 0 2 ) ν 2 .
Note that E ( R 1 ) = E ( R 2 ) and V a r ( R 1 ) = V a r ( R 2 ) are functions of ( α , α 0 , ν ) . Assuming the joint distribution of ( R 1 , R 2 ) is determined by the Gaussian copula with parameter γ , Equation (6) gives the marginal density of ( R 1 , R 2 ) . The expected value of the product of R 1 R 2 is given by
E [ R 1 R 2 ] = 0 0 r 1 r 2 f m ( r 1 , r 2 ; α , α 0 , ν , γ ) d r 1 d r 2 = ν α 0 Γ ( α ) Γ ( α 0 ) 2 0 0 0 0 r 1 r 2 c ( F 1 ( r 1 ) , F 2 ( r 2 ) ) × i = 1 2 r i j α 1 θ i j α + α 0 1 exp ( θ i j ( r i j + ν ) ) d θ 1 d θ 2 d r 1 d r 2 .
Equation (12) has a double integral with respect to θ 1 and θ 2 , and another additional double integral with respect to r 1 and r 2 . The function adaptIntegrate in the R package cubature is useful to evaluate this multidimensional integral numerically. We have developed an R function that uses adaptIntegrate to calculate (12). The relationship between the copula parameter γ and ρ , in this case, is given by
ρ = α 0 2 α + α 0 2 ( α 0 1 ) 2 α ν 2 E [ R 1 R 2 ] α .
The above-derived relationship between ρ and γ is applied in the simulation study in Section 7, demonstrating its practical relevance. Furthermore, in Table 1, the correlation coefficient ρ and its estimate ρ ^ were reasonably close for all sample sizes, validating the robustness of the proposed approach.

7. Simulation Studies

In this section, we perform simulations to gauge the performance of our Gaussian copula model. The data are simulated for two sets of values of Υ = ( α , α 0 , ν , γ ) with three sample sizes, n = 100 , 500 , 3000 . The data simulation steps are as follows.
Fix a value for Υ = ( α , α 0 , ν , γ ) .
  • Generate n pairs of bivariate normal random variables ( x 1 j , x 2 j ) from a standard bivariate normal distribution (BVN) with the correlation parameter γ .
  • Calculate ( u 1 j , u 2 j ) = Φ ( x 1 j ) , Φ ( x 2 j ) for j = 1 , , n where Φ is the cumulative distribution function of the normal standard.
  • Generate θ i j from a gamma distribution with parameters ( α 0 , 1 / ν ) for i = 1 , 2 and j = 1 , , n .
  • Calculate ( r 1 j , r 2 j ) = F 1 j 1 ( u 1 j ) , F 2 j 1 ( u 2 j ) , where F i j is the cumulative distribution function of a gamma distribution with parameters ( α , 1 / θ i j ) .
For our first simulation, we have fixed the parameter values as α = 0.5 , α 0 = 10 , ν = 25 , and γ = 0.9 . We simulated samples of sizes n = 100, 500, and 3000 with these parameter values. The parameter estimation results are given in Table 1. In Table 1, ρ is the correlation coefficient calculated from simulated data, and ρ ^ is the correlation coefficient calculated after substituting the estimated values of ( α , α 0 , ν , γ ) in Equation (13). The parameter estimates are closer to the actual parameter values for large sample sizes, and the standard errors get smaller as the sample size increases. The correlation coefficient, ρ , and its estimate, ρ ^ , were reasonably close for all sample sizes.
For our second simulation, we fixed the parameter values as α = 2 , α 0 = 27 , ν = 900 , and γ = 0.8 , and as before, we took three sample sizes, 100, 500, and 3000. Table 2 consists of the parameter estimation results for this second simulation.
For n = 100 , 300 , the estimate α ^ 0 is an overestimate of α 0 , and ρ ^ is a terrible underestimate of ρ . Otherwise, the results are consistent with the first simulation. All the parameter estimates are closer to their true values for a larger sample size, n = 3000 . This is good news because, in practice, n, which represents the number of genes, is in thousands.

8. Analysis of E. coli Microarray Data

In this section, we apply the Gaussian copula-based Bayesian model developed in Section 3 to some actual data obtained from microarray experiments on E. coli. These data consist of observations from five microarrays. There are two IPTG-treated samples labeled IPTG-A and IPTG-B, as well as two heat shock samples labeled Heat Shock A and Heat Shock B, and the fifth is a control (untreated). We described these data earlier in Section 2. There are 4253, 4083, 4141, 4208, and 4071 genes in the control, IPTG-A, IPTG-B, Heat Shock A, and Heat Shock-B samples, respectively.
The scatter plots for the red and green intensities for the five samples are shown in Figure 3, along with the sample correlation coefficients. There is a high positive correlation between red and green intensities in all five samples. The positively skewed shape of the density curves suggests the assumption of gamma marginals is reasonable. Thus, following [7], as a parsimonious model, we assume the marginal distributions of the red and green intensities as gamma with standard shape parameters but different scale parameters.
Table 3 contains the parameter estimates and their standard errors for the Gaussian copula-based Bayesian models for the five microarray samples. The standard errors are small because the sample sizes are large—more than 4000 in all cases. This suggests that the parameter estimates are pretty accurate.
The empirical density plots, along with the fitted density plots, are shown in Figure 4 and Figure 5 for the red and green intensities, respectively. The solid curves in Figure 4 and Figure 5 are the fitted gamma marginals, and the shaded curves are the empirical plots. Note that the fitted marginals are gamma densities with the estimated parameter values in Table 3. These figures show that the fitted marginals are very good for the IPTG and control samples, but there is some improvement for the heat shock samples, especially the red intensities.
Figure 6 shows the fitted bivariate density plots obtained using the parameter estimates in Table 3. In these plots, the 45° line indicates equal red and green intensities, and the points that fall on this line correspond to genes that are not differentially expressed. We can see that most of the control group’s genes are not differentially expressed using this criteria. For the IPTG samples, a few points lie away from the 45° line, indicating the presence of differentially expressed genes in these samples. Finally, for the two heat shock samples, a large number of points are away from the 45° line, indicating that there is a large number of differentially expressed genes in these samples.
Table 4 displays the observed correlation ( ρ ) and correlation coefficient ( ρ ^ ) calculated from the estimated copula parameter as in Table 3 using Equation (13). Except for Heat Shock B, for all the other four samples, the values of ρ and ρ ^ are very close, indicating that our copula model was fairly successful in quantifying the dependence between the two intensities.
We calculated η j ^ using Equation (8), the Bayes estimate η j ^ of η j which is a measure of differential expression of the jth gene. Table 5 displays the top twenty down-regulated ( η j ^ is small) genes, and Table 6 lists the top twenty up-regulated ( η j ^ is large) genes for all five samples.
Plots of ordered η j ^ values for the five samples are displayed in Figure 7. These plots are S-shaped; the left tails contain the down-regulated genes, whereas the right tails contain the up-regulated genes. In their paper, ref. [16] have listed the genes significantly affected by heat shock and IPTG treatments. According to their findings, the control sample has none of the differentially expressed genes, the IPTG samples have few, and the heat-shock samples have many differentially expressed genes. Therefore, by considering the number of differentially expressed genes and the plots of η j ^ of five microarrays, η j ^ = 2 is a good candidate cut-off value to filter up-regulated genes, while η j ^ = 0.5 is for down-regulated genes. The horizontal lines in Figure 7 indicate the possible cut-off values to separate the normal genes from the two extremes.
The total number of differentially expressed genes for each microarray is listed in Table 7, along with the total number of differentially expressed genes filtered with the bivariate gamma model proposed by [8].
As expected, none of the genes are identified as differentially expressed in the control sample, and very few in IPTG-A and IPTG-B. Many genes were filtered as up- or down-regulated from both the Heat Shock A and Heat Shock B models. The number of genes filtered from the Gaussian copula model is somewhat smaller than that from the bivariate gamma model.
The best model cannot be determined by looking at this total number of genes. Therefore, the log-likelihoods for the two models under each microarray are compared and shown in Table 8. The log-likelihood values under our model are higher than those of the bivariate gamma model, which was proposed by [8] for each microarray. Hence, we conclude that the Gaussian copula-based Bayesian model achieves an improvement over the model given in [8]. Further, our method’s filtered differentially expressed genes of heat-shock samples are well matched with the genes listed in [16]. Recall that, in [16], the control sample had no differentially expressed genes, and the IPTG samples had few, consistent with our findings.

9. Discussion

Various methods have been proposed in the literature to identify differentially expressed genes. In this article, we have developed a Gaussian copula-based Bayesian model for detecting differentially expressed genes in cDNA microarray data. The accuracy of the model’s parameter estimates was demonstrated through two simulation studies with three different sample sizes. We further applied our model to five microarray samples from E. coli. Experimentally identified differentially expressed genes in E. coli are documented in [16]. Using Bayes estimates, our model effectively distinguishes between up-regulated and down-regulated genes. Notably, many of the down-regulated genes identified using our model align with those reported in [8], which introduced a bivariate gamma model for the same E. coli data. However, the higher log-likelihood values obtained under our model compared to [8] suggest an improvement over the bivariate Gamma approach. A key advantage of our Guassian-copula model is its flexibility—it can accommodate any marginal distribution for intensities, whereas the bivariate gamma model is restricted to gamma-distributed marginals. Future research will explore the use of other copulas.

Author Contributions

Conceptualization, N.R.C.; computations P.L.; writing— draft, P.L. and N.R.C. This article was based on the first author’s doctoral dissertation ref. [28]. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data are available from ref. [16].

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liu, H.; Bebu, I.; Li, X. Microarray probes and probe sets. Front. Biosci. 2010, E2, 325–338. [Google Scholar] [CrossRef] [PubMed]
  2. Schena, M.; Shalon, D.; Davis, R.W.; Brown, P.O. Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray. Science 1995, 270, 467–470. [Google Scholar] [CrossRef]
  3. Schena, M.; Shalon, D.; Heller, R.; Chai, A.; Brown, P.O.; Davis, R.W. Parallel human genome analysis: Microarray-based expression monitoring of 1000 genes. Proc. Natl. Acad. Sci. USA 1996, 93, 10614–10619. [Google Scholar] [PubMed]
  4. DeRisi, J.; Penland, L.; Brown, P.O.; Bittner, M.L.; Meltzer, P.S.; Ray, M.; Chen, Y.; Su, Y.A.; Trent, J.M. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat. Genet. 1996, 14, 457–460. [Google Scholar] [PubMed]
  5. Sreekumar, J.; Jose, K.K. Statistical tests for identification of differentially expressed genes in cDNA microarray experiments. Nat. Genet. 2008, 7, 423–436. [Google Scholar]
  6. Chen, Y.; Dougherty, E.R.; Bittner, M.L. Ratio-based decision and the quantitative analysis of cDNA microarray images. Biomed. Opt. 1997, 2, 364–374. [Google Scholar] [CrossRef] [PubMed]
  7. Newton, M.A.; Kendziorski, C.M.; Richmond, C.S.; Blattner, F.R.; Tsui, K.W. On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data. J. Comput. Biol. 2001, 8, 37–52. [Google Scholar] [CrossRef] [PubMed]
  8. Mav, D.; Chaganty, N.R. Bivariate Models for Identifying Differentially Expressed Genes in Microarray Experiments. J. Stat. Theory Appl. 2004, 3, 111–124. [Google Scholar]
  9. Maria, E.C.J.; Salazar, I.; Sanz, L.; Gómez-Villegas, M.A. Using Copula to Model Dependence When Testing Multiple Hypotheses in DNA Microarray Experiments: A Bayesian Approximation. Mathematics 2020, 8, 1514. [Google Scholar] [CrossRef]
  10. Sklar, A. Fonctions de répartition à n dimensions et leurs marges; Université de Paris: Paris, France, 1959; Volume 8, pp. 229–231. [Google Scholar]
  11. Nelsen, R.B. An Introduction to Copulas, 2nd ed.; Springer: New York, NY, USA, 2006. [Google Scholar]
  12. Joe, H. Multivariate Models and Multivariate Dependence Concepts; Chapman and Hall/CRC: Boca Raton, FL, USA, 1997. [Google Scholar]
  13. Joe, H. Dependence Modeling with Copulas; Chapman and Hall/CRC: Boca Raton, FL, USA, 2015. [Google Scholar]
  14. Chaba, L.A. A Copula-based Approach to Differential Gene Expression Analysis. Ph.D. Thesis, Strathmore University, Nairobi, Kenya, 2017. [Google Scholar]
  15. Ray, S.; Lall, S.; Sanz, L.; Bandyopadhyay, S. CODC: A copula based model to identify differential coexpression. npj Syst. Biol. Appl. 2020, 6, 20. [Google Scholar] [CrossRef] [PubMed]
  16. Richmond, C.S.; Glasner, J.D.; Mau, R.; Jin, H.; Blattner, F. Genome-wide expression profiling in Escherichia coli K-12. Nucleic Acids Res. 1999, 27, 3821–3835. [Google Scholar] [PubMed]
  17. Baek, J.; Son, Y.S.; McLachlan, G.J. Segmentation and intensity estimation of microarray images using a gamma-t mixture model. Bioinformatics 2007, 23, 458–465. [Google Scholar] [CrossRef]
  18. Plancade, S.; Rozenholc, Y.; Lund, E. Generalization of the normal-exponential model: Exploration of a more accurate parametrisation for the signal distribution on Illumina BeadArrays. arXiv 2012, arXiv:1112.4180. [Google Scholar]
  19. Fajriyah, R. A Study of Convolution Models for Background Correction of BeadArrays. Austrian J. Stat. 2016, 45, 15–33. [Google Scholar]
  20. Narasimhan, B.; Koller, M.; Johnson, S.G.; Hahn, T.; Bouvier, A.; Kiêu, K.; Gaure, S. Cubature: Adaptive Multivariate Integration over Hypercubes, R Package version 2.1.1; 2024; Available online: https://cran.r-project.org/package=cubature (accessed on 25 March 2025).
  21. Borchers, H.W. Pracma: Practical Numerical Math Functions, R Package version 2.4.4; 2023; Available online: https://cran.r-project.org/package=pracma (accessed on 25 March 2025).
  22. Nash, J.C. Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation, 2nd ed.; Adam Hilger: Bristol, UK; Halsted Press: New York, NY, USA, 1979. [Google Scholar]
  23. Broyden, C.G. The convergence of a class of double-rank minimization algorithms. J. Inst. Math. Appl. 1970, 6, 76–90. [Google Scholar] [CrossRef]
  24. Fletcher, R. A new approach to variable metric algorithms. Comput. J. 1970, 13, 317–322. [Google Scholar] [CrossRef]
  25. Goldfarb, D. A family of variable metric updates derived by variational Means. Math. Comput. 1970, 24, 23–26. [Google Scholar] [CrossRef]
  26. Shanno, D.F. Conditioning of quasi-Newton methods for function minimization. Math. Comput. 1970, 24, 647–656. [Google Scholar]
  27. Gilbert, P.; Varadhan, R. numDeriv: Accurate Numerical Derivatives, R Package version 2016.8-1.1; 2022; Available online: https://cran.r-project.org/package=numDeriv (accessed on 25 March 2025).
  28. Liyanaarachchi, P. A Copula Model Approach to Identify the Differential Gene Expression. Ph.D. Dissertation, Old Dominion University, Norfolk, VA, USA, 2021. [Google Scholar]
Figure 1. Histogram of red intensities with kernel density plots.
Figure 1. Histogram of red intensities with kernel density plots.
Analytics 04 00011 g001
Figure 2. Histogram of green intensities with kernel density plots.
Figure 2. Histogram of green intensities with kernel density plots.
Analytics 04 00011 g002
Figure 3. Scatter plots of red and green intensities.
Figure 3. Scatter plots of red and green intensities.
Analytics 04 00011 g003
Figure 4. Empirical and fitted density (solid line) plots of red intensities.
Figure 4. Empirical and fitted density (solid line) plots of red intensities.
Analytics 04 00011 g004
Figure 5. Empirical and fitted density (solid line) plots of green intensities.
Figure 5. Empirical and fitted density (solid line) plots of green intensities.
Analytics 04 00011 g005
Figure 6. Estimated bivariate density plots of red and green intensities.
Figure 6. Estimated bivariate density plots of red and green intensities.
Analytics 04 00011 g006
Figure 7. Plots of η j ^ , the estimates of expected ratio of the intensities.
Figure 7. Plots of η j ^ , the estimates of expected ratio of the intensities.
Analytics 04 00011 g007
Table 1. Parameter estimates (standard errors) for the first simulated data †.
Table 1. Parameter estimates (standard errors) for the first simulated data †.
n α ^ α ^ 0 ν ^ γ ^ ρ ρ ^
1000.5418.90122.0430.8350.8060.713
   (0.054)(0.280)(3.829)(0.005)
5000.5069.85025.9220.9100.8320.827
   (0.024)(0.102)(1.678)(0.010)
30000.54010.11925.9990.8900.7710.805
(0.011)(0.016)(0.421)(<0.001)
† True parameter values are α = 0.5 , α 0 = 10 , ν = 25 , and γ = 0.9 .
Table 2. Parameter estimates (standard errors) for the second simulated data †.
Table 2. Parameter estimates (standard errors) for the second simulated data †.
n α ^ α ^ 0 ν ^ γ ^ ρ ρ ^
1002.62931.999899.2160.7370.7180.485
   (0.005)(0.019)(2.772)(0.003)
5002.01830.424898.9900.7060.7170.407
   (0.003)(0.026)(1.982)(0.005)
30002.18726.256898.9910.8630.7100.775
(0.003)(0.002)(0.753)(0.001)
† True parameter values are α = 2 , α 0 = 27 , ν = 900 , and γ = 0.8 .
Table 3. Parameter estimates (standard errors) for the E. coli data.
Table 3. Parameter estimates (standard errors) for the E. coli data.
Microarray α α 0 ν γ
Control0.79655.2451529.8760.9896
   (0.001)(0.343)(1.056)(0.003)
IPTG-A0.74340.0481149.6040.9838
   (0.001)(0.127)(1.181)(0.021)
IPTG-B0.64327.095899.9970.9839
   (0.003)(0.059)(0.678)(0.019)
Heat Shock A1.7774.64424.9990.8116
   (0.039)(0.094)(0.056)(0.013)
Heat Shock B1.4494.61329.9990.6507
   (0.024)(0.083)(0.102)(0.015)
Table 4. True and estimated correlation coefficients.
Table 4. True and estimated correlation coefficients.
Microarray ρ ρ ^
Control0.97120.9748
IPTG-A0.95150.9163
IPTG-B0.94710.9799
Heat Shock A0.41370.4723
Heat Shock B0.51470.3971
Table 5. Top 20 down-regulated genes.
Table 5. Top 20 down-regulated genes.
#ControlIPTG-AIPTG-BHeat Shock AHeat Shock B
Gene η j ^ Gene η j ^ Gene η j ^ Gene η j ^ Gene η j ^
id id id id id
1b02330.52b40980.29b41190.30b36860.00b36860.04
2b13250.53b41190.29b41200.30b36870.00b36870.05
3b05580.56b41200.45b41490.35b00140.02b41420.05
4b28430.58b02960.53b03410.43b13060.03b00150.05
5b21290.60b42910.54b42910.43b19670.03b00140.05
6b13190.60b15710.54b17850.46b13040.03b34000.06
7b38180.61b07200.56b10200.47b13800.03b13800.06
8b10750.64b10200.58b06480.47b26140.04b25920.07
9b19240.65b39080.59b05580.50b03990.04b13060.08
10b27420.66b15000.60b22600.52b13050.04b09660.08
11b14470.67b03260.60b07050.55b13070.04b41430.08
12b33410.68b39620.60b07200.55b41430.04b13040.08
13b37510.68b41490.61b34890.56b41400.05b13070.08
14b27560.68b07020.62b03020.56b34000.05b13050.08
15b11660.68b10180.63b33420.57b41420.05b26140.09
16b20510.68b42470.63b11660.58b34010.05b04730.09
17b25410.68b16850.64b08050.60b13210.05b00160.09
18b33400.69b22600.65b16810.60b04730.05b39320.09
19b39660.70b07050.65b28430.61b18290.06b10600.10
20b26280.70b07260.66b35080.61b41710.07b18290.10
Table 6. Top 20 up-regulated genes.
Table 6. Top 20 up-regulated genes.
#ControlIPTG-AIPTG-BHeat Shock AHeat Shock B
Gene η j ^ Gene η j ^ Gene η j ^ Gene η j ^ Gene η j ^
id id id id id
1b43251.89b22062.32b12562.56b355620.65b355615.68
2b06571.79b12562.30b22062.47b209419.03b107814.71
3b27401.70b00432.22b07592.32b107615.56b090712.44
4b05421.69b16731.97b43072.11b107714.40b185710.89
5b06791.50b22051.95b16742.05b107514.24b107410.28
6b43141.50b22041.95b29972.03b185714.12b10769.91
7b24181.49b29971.94b22032.03b075414.11b02969.87
8b36161.48b01151.90b21512.02b107413.85b20949.51
9b42431.48b07591.89b07332.01b292613.49b15888.81
10b01851.47b27271.89b08572.00b107312.11b12458.22
11b10841.45b22411.83b22041.99b293511.27b40258.18
12b38341.44b02831.81b22421.97b354410.66b43287.97
13b28601.44b22021.80b22051.95b107810.32b18857.89
14b07291.44b29961.80b29961.95b09079.60b22417.17
15b10831.43b03471.78b29571.92b24169.05b12447.12
16b10641.42b07331.78b27271.91b20928.93b14177.03
17b22831.42b29571.78b21491.90b22868.84b01316.70
18b31471.42b22031.76b22411.90b33578.73b19386.63
19b16741.41b21511.75b05981.88b08938.10b16766.57
20b06981.41b22421.74b08941.87b12448.06b10726.53
Table 7. Total number of differentially expressed genes.
Table 7. Total number of differentially expressed genes.
Microarray# of Genes for Which
η j ^ > 2 η j ^ < 0 . 5
BivariateGaussianBivariateGaussian
GammaCopulaGammaCopula
Control0000
IPTG-A10333
IPTG-B21978
Heat Shock A5534511007439
Heat Shock B856600590169
Table 8. Log-likelihoods for the competitive models.
Table 8. Log-likelihoods for the competitive models.
MicroarrayBivariate GammaGaussian Copula
Control−28,824−28,350
IPTG-A−28,320−27,853
IPTG-B−28,257−27,885
Heat Shock A−31,936−30,419
Heat Shock B−31,658−30,282
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liyanaarachchi, P.; Chaganty, N.R. Copula-Based Bayesian Model for Detecting Differential Gene Expression. Analytics 2025, 4, 11. https://doi.org/10.3390/analytics4020011

AMA Style

Liyanaarachchi P, Chaganty NR. Copula-Based Bayesian Model for Detecting Differential Gene Expression. Analytics. 2025; 4(2):11. https://doi.org/10.3390/analytics4020011

Chicago/Turabian Style

Liyanaarachchi, Prasansha, and N. Rao Chaganty. 2025. "Copula-Based Bayesian Model for Detecting Differential Gene Expression" Analytics 4, no. 2: 11. https://doi.org/10.3390/analytics4020011

APA Style

Liyanaarachchi, P., & Chaganty, N. R. (2025). Copula-Based Bayesian Model for Detecting Differential Gene Expression. Analytics, 4(2), 11. https://doi.org/10.3390/analytics4020011

Article Metrics

Back to TopTop