Using Copula to Model Dependence When Testing Multiple Hypotheses in DNA Microarray Experiments: A Bayesian Approximation

Maria, Elisa C. J.; Salazar, Isabel; Sanz, Luis; Gómez-Villegas, Miguel A.

doi:10.3390/math8091514

Open AccessArticle

Using Copula to Model Dependence When Testing Multiple Hypotheses in DNA Microarray Experiments: A Bayesian Approximation

¹

Departamento de Matemática e Estatística, Faculdade de Ciências Naturais, Matemática e Estatística, Universidade Rovuma, 3100 Nampula, Mozambique

²

Departamento de Producción Animal, Facultad de Veterinaria, Universidad Complutense de Madrid, 28040 Madrid, Spain

³

Departamento de Estadística e IO, Facultad de Ciencias Matemáticas, Universidad Complutense de Madrid, 28040 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Mathematics 2020, 8(9), 1514; https://doi.org/10.3390/math8091514

Submission received: 9 August 2020 / Revised: 26 August 2020 / Accepted: 28 August 2020 / Published: 4 September 2020

(This article belongs to the Special Issue Mathematical Biology: Modeling, Analysis, and Simulations)

Download Versions Notes

Abstract

Many experiments require simultaneously testing many hypotheses. This is particularly relevant in the context of DNA microarray experiments, where it is common to analyze many genes to determine which of them are differentially expressed under two conditions. Another important problem in this context is how to model the dependence at the level of gene expression. In this paper, we propose a Bayesian procedure for simultaneously testing multiple hypotheses, modeling the dependence through copula functions, where all available information, both objective and subjective, can be used. The approach has the advantage that it can be used with different dependency structures. Simulated data analysis was performed to examine the performance of the proposed approach. The results show that our procedure captures the dependence appropriately classifying adequately a high percentage of true and false null hypotheses when choosing a prior distribution beta skewed to the right for the initial probability of each null hypothesis, resulting in a very powerful procedure. The procedure is also illustrated with real data.

Keywords:

bayesian inference; copula function; dependence; DNA microarrays; testing multiple hypotheses

1. Introduction

There are many experiments that require simultaneously testing many hypotheses. In that context, if each hypothesis is tested individually at a given significance level,

α

, the probability of erroneously rejecting at least one hypothesis increases rapidly depending on the number of hypotheses, i.e., a problem arises if the multiplicity of the problem is not taken into account when simultaneously evaluating all of the hypotheses. DNA microarray experiments exhibit this problem, as data analysis often requires simultaneously testing many hypotheses, one for each gene. The first to warn of this problem was [1]. The literature regarding this subject is extensive, especially under assumption of independence.

From a frequentist point of view, procedures for testing multiple hypotheses are based on controlling a measure related to Type I errors, such as the family wise error rate (FWER). However, this usually leads to especially conservative procedures, in the sense that few false null hypotheses are rejected, thus reducing the power of the test.

When testing multiple hypotheses, the false discovery rate (FDR) was proposed by [2] as a measure of error that results in less conservative procedures than those controlling the FWER. The authors argue that in some situations, it may be acceptable to tolerate some false positives, provided that there are few in relation to the number of rejected null hypotheses. A review of multiple hypothesis tests is presented by [3]. Additionally, these authors proposed methods based on ordered p-values for multiple comparisons between means of normal populations.

Following with frequentist approaches, the different error rates were analyzed by [4] who also compared the different procedures in the context of DNA microarrays, and a general statistical framework is proposed by [5] for multiple tests of association between known fixed features of a genome and unknown parameters of the distribution of variable features of this genome in a population of interest.

In the context of DNA microarray experiments, the FDR is the most commonly used error type in the frequentist approach. This is because multiple hypotheses are used, in many situations, as a first exploratory step to identify groups of genes that are expressed differentially to perform further research with them subsequently. Thus, it may be acceptable to support a higher number of false positives. Furthermore, the authors derived a procedure for controlling the FDR at a certain level,

α

, for independent test statistics. There are many published studies in this field. For a detailed review of the multiple testing problem, see [4,5].

In some cases, however, test statistics are dependent. For instance, in the context of DNA microarray experiments, where genes usually present high correlation and the number of hypotheses is very high, this presents a challenge and an important problem in statistics. The following research considers multiple hypothesis testing through a frequentist approach under dependence. The case in which test statistics have positive regression dependency on each of the test statistics corresponding to the true null hypotheses is analyzed in [6]. A step-down procedure to control the FDR under the independence of test statistics while also controlling the FDR under positive dependence can be seen in [7]. An application of the Archimedean copulas to resampled p-values generated by permutations in the context of multiple testing is shown in [8]. Detailed review of multiple testing problem addressed to control FDR under Archimedean copulas can be found in [9] and their references.

From a Bayesian perspective, the posterior probability of each null hypothesis must be obtained to make decisions. Several publications offer essential insights on this approach under the assumption of independence. For example, as distribution for the observations, a mixture of a discrete and a continuous component in a Bayesian approach is proposed by [10]. Hierarchical Bayesian models can be seen in [11,12], the former, robust with respect to extreme values and powerful even with a small number of observations, the latter based on a mixture of two distributions along with an empirical Bayes approach.

Moreover, the sensitivity regarding the choice of the prior distribution on the probability of each null hypothesis was analyzed in [13]. On the other hand, the multiple hypothesis testing problem from the perspective of Bayesian decision theory was dealt with in [14], where a decision criterion based on an estimate of the number of false null hypotheses is proposed. Finally, procedures that control the Bayes FDR and the Bayes FNR, which are applicable in any situation are suggested by [15].

Under the assumption of dependence in the field of genomics, an empirical Bayesian method is proposed in [16], which essentially combines multiple testing with a clustering technique. Similarly, a procedure that combines a clustering technique with multiple testing from a Bayesian perspective to deal with the correlation effect in data analysis is presented by [17].

Other recent approaches have been developed to address dependence in testing multiple hypotheses, such as graphical models. The hidden Markov model (HMM), in the context of graphical models, has emerged as a tool to support the structure of data dependence when testing multiple hypotheses and it has had considerable impact on the field of genomics. The potential offered by the HMM to support the dependency structure has been explored by [18,19,20,21], among others. The dependency structure through Markov chain models was explored by [18], demonstrating the optimality of the FDR under certain conditions at an appropriate alpha level along with the empirical realization of these models. This procedure was extended in [21,22,23], by developing a graphical model based on the multiple test procedure and a Markov-random-field-coupled mixture model. The extended procedure allows for an arbitrary dependency structure (

N \geq 2

) and heterogeneously dependent parameters. The effect of the dependency structure of the finite states of the HMM and on the likelihood ratio for optimal multiple tests in hidden states is analyzed in [19].

The main aim of this paper is to provide a Bayesian procedure for testing multiple hypotheses in cases with many hypotheses, under the assumption of dependency, and modeling this dependence through copula functions. Copulas are attractive because they can be used to model a wide range of dependency structures. The remainder of this paper is organized as follows. In Section 2, we propose a full Bayesian approach for the problem of testing multiple hypotheses, and we describe the theory regarding copula functions. In Section 3, we present the full Bayesian approach for modeling dependence through an N-dimensional Gaussian copula with normal marginal densities, together with the prior and conditional posterior distributions necessary to apply a Markov chain Monte Carlo (MCMC) algorithm (the Metropolis-Hastings-within-Gibbs algorithm), along with a summary of this algorithm (the details for which are given in the Appendix A) and a simulation study that evaluates the performance of our approach. Section 4 shows the Bayesian approach for modeling the dependence through an N-dimensional Clayton copula with normal marginal densities, together with a simulation study and a comparison for model selection based on the Deviance Information Criterion (DIC). Section 5 applies the proposed methodology to a real data set from experiments with DNA microarrays. Finally, Section 6 presents the main conclusions.

2. A Bayesian Approach: Model Specification

Suppose N dependent random variables are measured under two different independent treatment conditions. In particular, suppose

X = (X_{1}, \dots, X_{N})

is an N-dimensional random vector of dependent variables measured from one condition, and that

Y = (Y_{1}, Y_{2}, \dots, Y_{N})

is an N-dimensional random vector of dependent variables measured from another condition, where X and Y arise independently from distributions

F_{X} (X | Θ_{X}, λ_{X})

and

F_{Y} (Y | Θ_{Y}, λ_{Y})

, respectively, and where

Θ_{X} = (θ_{X_{1}}, \dots, θ_{X_{N}})

and

Θ_{Y} = (θ_{Y_{1}}, \dots, θ_{Y_{N}})

are the parameter vectors of interest and

Λ = (λ_{X}, λ_{Y})

is the other group of parameter vectors for the model. We consider the problem of simultaneous testing as follows:

\begin{matrix} H_{0 i} : θ_{X_{i}} = θ_{Y_{i}} versus H_{1 i} : θ_{X_{i}} \neq θ_{Y_{i}}, i = 1, 2, \dots, N \end{matrix}

(1)

We decide which null hypothesis to accept through the posterior probability of each null hypothesis. Thus, we build a distribution probability model for X and Y, preceding the inference of the parameters when the variables

X_{i}

and

Y_{i}

have been observed.

The joint probability density of X and Y is defined as product of the joint probability density’s

f (X)

and

f (Y)

, because we previously considered the independence between these two conditions. Thus,

f (X, Y | Θ) = f (X | Θ_{X}, λ_{X}) f (Y | Θ_{Y}, λ_{Y})

(2)

where

Θ = (Θ_{X}, λ_{X}, Θ_{Y}, λ_{Y})

denotes the model parameters.

2.1. Copula Function

We built a multivariate distribution for each treatment condition according to copula function. According to [24,25], a copula is a joint distribution function defined in the unit cube

{[0, 1]}^{n}

with standard uniform univariate margins. This concept was introduced by [26], and other recent works have been proposed, such as [27,28,29], among others. Copulas are especially useful because they can be used to model the dependence in data completely.

According to Sklar’s Theorem(1959), given a joint cumulative distribution function

F (x_{1}, \dots, x_{N})

for random variables

X_{1}, X_{2}, \dots, X_{N}

with marginal cumulative distribution functions (CDFs)

F_{1} (x_{1})

,

F_{2} (x_{2})

, …,

F_{N} (x_{N})

, F can be written as a function of its marginals:

\begin{matrix} F (x_{1}, x_{2}, \dots x_{N}) & = C (F_{1} (x_{1}), F_{2} (x_{2}), \dots, F_{N} (x_{N})) \\ = C (u_{1}, u_{2}, \dots, u_{N}) \end{matrix}

(3)

where

C (u_{1}, u_{2}, \dots, u_{N})

is a joint distribution function with uniform marginals,

u_{i} = F_{i} (x_{i})

, for

i = 1, \dots, N

and C is called a copula. If

F_{1}, \dots, F_{N}

are continuous, then the copula C is unique, and if each

F_{i}

is discrete, then C is unique on

R a n (F_{1}) \times \dots \times R a n (F_{N})

, where

R a n (F_{i})

is the range of

F_{i}

. For a strict analysis of copulas, see [29].

As a consequence of Sklar’s Theorem (Without loss of generality, we will treat the absolutely continuous case), the joint probability density can be written as a product of the marginal density and the copula density. Thus, an N-dimensional joint density function is defined as follows:

\begin{matrix} f (x_{1}, \dots, x_{N}) & = \frac{\partial^{N}}{{\partial F}_{1} (x_{1}) \dots {\partial F}_{N} (x_{N})} C (F_{1} (x_{1}), F_{2} (x_{2}), \dots, F_{N} (x_{N})) \prod_{i = 1}^{N} \frac{\partial}{{\partial x}_{i}} F_{i} (x_{i}) \\ = c (u_{1}, u_{2}, \dots, u_{N}) \prod_{i = 1}^{N} f_{i} (x_{i}) \end{matrix}

(4)

where

\begin{matrix} u_{i} = F_{i} (x_{i}), f_{i} (x_{i}) = \frac{\partial}{{\partial x}_{i}} F_{i} (x_{i}), c (u_{1}, u_{2} \dots, u_{N}) = \frac{\partial^{N}}{{\partial u}_{1} \dots {\partial u}_{N}} C (u_{1}, u_{2}, \dots, u_{N}) \end{matrix}

The dependence function

c (u_{1}, u_{2} \dots, u_{N})

is called the copula density and it encodes the dependence among the variables

(x_{1}, x_{2}, \dots, x_{N})

. For instance, if the random variables

x_{1}, x_{2}, \dots, x_{N}

are independent,

c (u_{1}, u_{2} \dots, u_{N}) = 1

. Thus,

f (x_{1}, \dots, x_{N}) = \prod_{i = 1}^{N} f_{i} (x_{i})

.

2.2. Modeling Dependence with N-Dimensional Copulas

From (3) and (4), the N-dimensional joint density function (2) for X and Y can be expressed as follows:

\begin{matrix} f (x_{1}, \dots, x_{N}; y_{1}, \dots, y_{N} | Θ) & = c_{X} (u_{X_{1}}, u_{X_{2}}, \dots, u_{X_{N}}; ω_{X}) \prod_{i = 1}^{N} f_{i} (x_{i} | θ_{X_{i}}, λ_{X_{i}}) \\ \times c_{Y} (u_{Y_{1}}, u_{Y_{2}}, \dots, u_{Y_{N}}; ω_{Y}) \prod_{i = 1}^{N} f_{i} (y_{i} | θ_{Y_{i}}, λ_{Y_{i}}) \end{matrix}

(5)

where

u_{X_{i}} = F_{i} (x_{i})

,

u_{Y_{i}} = F_{i} (y_{i})

,

i = 1, \dots, N

, and

ω_{X}

and

ω_{Y}

denote the vector parameters for the copula density through conditions X and Y, respectively. Next, we update the parameter vectors for the model:

Θ = (Θ_{X}, λ_{X}, Θ_{Y}, λ_{Y}, ω_{X}, ω_{Y})

. To simplify the notation, throughout the following, we use the notation

c_{X} (u_{X}; ω_{X})

and

c_{Y} (u_{Y}; ω_{Y})

, rather than

c_{X} (u_{X_{1}}, u_{X_{2}}, \dots, u_{X_{N}}; ω_{X})

and

c_{Y} (u_{Y_{1}}, u_{Y_{2}}, \dots, u_{Y_{N}}; ω_{Y})

, respectively.

In the Bayesian framework, to proceed with the inference, all unknown quantities

Θ

must be estimated from the posterior distribution:

π (Θ | x ._{1}, \dots, x ._{n_{x}}; y ._{1} \dots, y ._{n_{y}}) \propto π (Θ) L (Θ | x ._{1}, \dots, x ._{n_{x}}; y ._{1} \dots, y ._{n_{y}})

Therefore, a prior distribution

π (T h e t a)

is needed, and the likelihood

L (Θ | x ._{1}, \dots, x ._{n_{x}}; y ._{1} \dots, y ._{n_{y}})

of being the observations

x ._{j} = (x_{1 j}, x_{2 j}, \dots, x_{N j}), j = 1, 2, \dots, n_{x}

, and

y ._{k} = (y_{1 k}, y_{2 k}, \dots y_{N k})

,

k = 1, 2, \dots, n_{y}

samples from

X = (X_{1}, \dots, X_{N})

and

Y = (Y_{1}, . ., Y_{N})

, where

n_{x}

and

n_{y}

represent the number of samples of X and Y, respectively. Thus, the likelihood is derived as follows:

\begin{matrix} L (Θ | x ._{1}, \dots, x ._{n_{x}}; y ._{1} \dots, y ._{n_{y}}) & = \prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ω_{X}) \prod_{i = 1}^{N} f_{i} (x_{i j} | θ_{X_{i}}, λ_{X_{i}}) \\ \times \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ω_{Y}) \prod_{i = 1}^{N} f_{i} (y_{i k} | θ_{Y_{i}}, λ_{Y_{i}}) \end{matrix}

(6)

As we can see, this likelihood is complex because it depends on

H_{0_{i}}

and

H_{1 i}

, defined in (1). To make it tractable, we introduce N independent latent variables

τ_{i}

[30] following a

B e r n o u l l i (1 - p_{i})

distribution for all

i = 1, 2, \dots, N

, where

p_{i}

is the initial probability of each null hypothesis:

τ_{i} = \{\begin{matrix} 0 & i f & θ_{X_{i}} = θ_{Y_{i}} \\ 1 & i f & θ_{X_{i}} \neq θ_{Y_{i}} \end{matrix}

(7)

Then,

P r (τ_{i} = 0 | p_{i}) = p_{i}

and

P r (τ_{i} = 1 | p_{i}) = 1 - p_{i}

. Thus, each vector of observations

(x_{i} ., y_{i} .)

proceeds from a distribution under

H_{0 i}

when

τ_{i} = 0

, and under

H_{1 i}

when

τ_{i} = 1

, for

i = 1, 2, \dots, N

, i.e.,

\begin{matrix} X_{i j} |τ_{i} = 0, Θ_{X}, λ_{X} = X_{i j}| τ_{i} = 1, Θ_{Y}, λ_{Y} \sim f_{i} (x_{i j} | θ_{X_{i}}, λ_{i}), i = 1, \dots, N, j = 1, \dots, n_{x} \\ Y_{i k} | τ_{i} = 0, Θ_{Y}, λ_{Y} \sim f_{i} (y_{i k} | θ_{X_{i}}, λ_{Y_{i}}), k = 1, \dots, n_{y} \\ Y_{i k} | τ_{i} = 1, Θ_{Y}, λ_{Y} \sim f_{i} (y_{i k} | θ_{Y_{i}}, λ_{Y_{i}}) \end{matrix}

In a Bayesian framework, we can consider the latent variables

τ = (τ_{1}, \dots, τ_{N})

as an additional group of parameters. As a result, the likelihood (6) is written as follows:

\begin{matrix} L (Θ, τ | x ._{1}, \dots, x ._{n_{x}}; y ._{1} \dots, y ._{n_{y}}) & = \prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ω_{X}) \prod_{i = 1}^{N} f_{i} (x_{i j} | θ_{X_{i}}, λ_{X_{i}}) \\ \times \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ω_{Y}) \prod_{i : τ_{i} = 0} f_{i} (y_{i k} | θ_{X_{i}}, λ_{Y_{i}}) \prod_{i : τ_{i} = 1} f_{i} (y_{i k} | θ_{Y_{i}}, λ_{Y_{i}}) \end{matrix}

(8)

Then, the posterior distribution is

π (Θ, τ | x ._{1}, \dots, x ._{n_{x}}; y ._{1} \dots, y ._{n_{y}}) \propto π (Θ) π (τ | Θ) L (Θ, τ | x ._{1}, \dots, x ._{n_{x}}; y ._{1} \dots, y ._{n_{y}})

where

Θ = (Θ_{X}, λ_{X}, Θ_{Y}, λ_{Y}, ω_{X}, ω_{Y}, p)

, being

p = (p_{1}, \dots, p_{N})

.

Given

π (Θ)

, we seek to obtain the posterior probability of each null hypothesis through the correspondent marginal distributions of

τ_{i}

.

In the following sections, we consider normal marginal densities for the model, as it is usually done when modeling gene expression data, and the means are the parameters of interest. In this context, to assume that the joint distribution is normal may seem reasonable. Then, in Section 3 the dependency between each treatment condition variables is modeled using an N-dimensional Gaussian copula. However, the Gaussian copula is not always the most appropriate to model dependency even if the marginals are normal, because the assumption that the marginal distributions are normal does not imply that the joint distribution is normal as can be seen, for example, in [31,32,33]. For this reason, in Section 4, the dependency is modeled using an N-dimensional Clayton copula.

3. Modeling Dependence Through N-Dimensional Gaussian Copulas with Normal Marginal Densities

The typical objective when analyzing data arising from microarray experiments is to identify genes that are differentially expressed. Normal marginal distributions have been widely used in the field of genomics to model gene expression data [13,14,34,35], among others.

Thus, we may assume a normal distribution for the variables

X_{i}

and

Y_{i}

,

i = 1, 2, \dots, N

. We consider that the vector of observations

{(x}_{i} ., y_{i} .)

proceeds from a distribution under

H_{0 i}

when

τ_{i} = 0

, and the marginal density of each observation of this vector is defined by the same law,

N (μ_{X_{i}}, σ_{i}^{2})

, for both treatment conditions. Likewise, we consider that the vector of observations

{(x}_{i} ., y_{i} .)

proceeds from a distribution under

H_{1 i}

, when the latent variable

τ_{i} = 1

, for each

i = 1, 2, \dots, N

. However,

X_{i}

and

Y_{i}

random variable marginal densities are defined by

N (μ_{X_{i}}, σ_{i}^{2})

and

N (μ_{Y_{i}}, σ_{i}^{2})

, respectively. Please note that we consider the variance of the population from the two treatment conditions to be

σ_{i}^{2} = σ_{X_{i}}^{2} = σ_{Y_{i}}^{2}, i = 1, 2, \dots, N

; however, this could differ for all hypotheses (For simplicity’s sake, we have considered

σ_{X_{i}}^{2} = σ_{Y_{i}}^{2}

; however, the procedure is also applicable when the variances

σ_{X_{i}}^{2}

and

σ_{Y_{i}}^{2}

are different).

The main objective of this paper is to identify differentiated expressions under two experimental conditions. Therefore, we developed multiple hypothesis tests to decide between treatments, which is equivalent to testing the following hypotheses:

\begin{matrix} H_{0 i} : μ_{X_{i}} = μ_{Y_{i}} v e r s u s H_{1 i} : μ_{X_{i}} \neq μ_{Y_{i}}, i = 1, 2, \dots, N \end{matrix}

(9)

where

μ_{X_{i}}

and

μ_{Y_{i}}

are the means of

X_{i}

and

Y_{i}

, respectively.

To model the dependence between the variables, we assume the density distribution for each condition defined by the N-dimensional Gaussian copula, since it uses only the pairwise correlation among variables. This is done in precisely the same way that a multivariate normal distribution encodes the dependence between variables, and it allows for any marginal distribution and any positive-definite correlation matrix [36], defined as follows:

\begin{matrix} c_{X} (u_{X}; Σ_{X}) = \frac{1}{\sqrt{|Σ_{X}|}} e x p \{- \frac{1}{2} ξ_{X}^{'} (Σ_{X}^{- 1} - I_{N}) ξ_{X})\} \\ c_{Y} (u_{Y}; Σ_{Y}) = \frac{1}{\sqrt{|Σ_{Y}|}} e x p \{- \frac{1}{2} ψ_{Y}^{'} (Σ_{Y}^{- 1} - I_{N}) ψ_{Y}\} \end{matrix}

where

ξ_{X} = ({{ξ_{X}}_{1} = Φ}^{- 1} (u_{X_{1}}), \dots, {ξ_{X}}_{N} = Φ^{- 1} (u_{X_{N}})) and ψ_{Y} = ({ψ_{Y}}_{1} = Φ^{- 1} (u_{Y_{1}}), \dots, {ψ_{Y}}_{1} = Φ^{- 1} (u_{Y_{N}}))

,

Σ_{X}, Σ_{Y}

are correlation matrices for the copula through conditions X and Y, respectively, and

Φ

is the normal CDF.

For the sake of simplicity, to build the model we considered the same dependency structure for the two treatment conditions. Consequently, the correlation matrix is denoted by

Σ_{X} = Σ_{Y} = Σ

. The normal scores

ξ_{X}

and

ψ_{Y}

are quantiles of order

{u_{X}}_{i}

and

{u_{Y}}_{i}

, respectively, from the standard normal distribution

N (0, 1), i = 1, 2, \dots, N

.

Then, the joint density (5) defined through the Gaussian copula with normal marginal densities is

\begin{matrix} f (x_{1}, \dots, x_{N}; y_{1}, \dots, y_{N} | Θ) & = c_{X} (u_{X}; Σ) \prod_{i = 1}^{N} f_{i} (x_{i} | μ_{X_{i},} σ_{i}^{2}) \\ \times c_{Y} (u_{Y}; Σ) \prod_{i = 1}^{N} f_{i} (y_{i} | μ_{Y_{i}}, σ_{i}^{2}) \end{matrix}

where

Θ = (μ_{X}, μ_{Y}, σ^{2}, Σ), μ_{X} = (μ_{X_{1}} . ., μ_{X_{N}}), μ_{Y} = (μ_{Y_{1}} . ., μ_{Y_{N}}), σ^{2} = (σ_{1}^{2}, \dots, σ_{N}^{2})

.

Suppose we observe

x ._{j} = (x_{1 j}, x_{2 j}, \dots, x_{N j}), j = 1, 2, \dots ., n_{x}

and

y ._{k} = (y_{1 k}, y_{2 k}, \dots y_{N k}), k = 1, 2, \dots ., n_{y}

to be two independent random samples from

X = (X_{1}, \dots, X_{N})

and

Y = (Y_{1}, . ., Y_{N})

, where

n_{x}

and

n_{y}

represent the number of samples of X and Y, respectively.

To obtain the likelihood, we consider the latent variables defined in (7) with

μ_{X_{i}}

and

μ_{Y_{i}}

,

i = 1, \dots, N

, the interest parameters.

In accordance with the parametric model defined above (in Section 2), the likelihood (8) for parameters

Θ = (μ_{X}, μ_{Y}, σ^{2}, Σ)

is defined as follows:

\begin{matrix} L (Θ, τ | x ._{1}, \dots, x ._{n_{x}}; y ._{1} \dots, y ._{n_{y}}) & = \prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; Σ) \prod_{i = 1}^{N} f_{i} (x_{i j} | μ_{X_{i}}, σ_{i}^{2}) \\ \times \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; Σ) \prod_{i : τ_{i} = 0} f_{i} (y_{i k} | μ_{X_{i}}, σ_{i}^{2}) \prod_{i : τ_{i} = 1} f_{i} (y_{i k} | μ_{Y_{i}}, σ_{i}^{2}) \end{matrix}

(10)

Due to the multiplicity of the problem and the need to estimate too many parameters, we used the uniform correlation structure matrix in accordance with [36] to build a multivariate Gaussian copula that is dependent on the correlation parameter

ρ

, defined as follows:

\begin{matrix} c_{X} (u_{X ._{j}}; ρ) = \frac{1}{{\{[1 + (N - 1) ρ] {(1 - ρ)}^{N - 1}\}}^{\frac{1}{2}}} \\ \times e x p \{- \frac{ρ}{2 (1 - ρ)} \frac{1}{[1 + (N - 1) ρ]} ((N - 1) ρ \sum_{i = 1}^{N} ξ_{i j}^{2} - 2 \sum_{i = 1}^{N} \sum_{m > i} ξ_{i j} ξ_{m j})\} \\ c_{Y} (u_{Y ._{k}}; ρ) = \frac{1}{{\{[1 + (N - 1) ρ] {(1 - ρ)}^{N - 1}\}}^{\frac{1}{2}}} \\ \times e x p \{- \frac{ρ}{2 (1 - ρ)} \frac{1}{[1 + (N - 1) ρ]} ((N - 1) ρ \sum_{i = 1}^{N} ψ_{i k}^{2} - 2 \sum_{i = 1}^{N} \sum_{m > i} ψ_{i k} ψ_{m k})\} \end{matrix}

where

\begin{matrix} ρ \in (\frac{- 1}{N - 1}, 1) \end{matrix}

(11)

We use the copula function as a tool to describe the dependency relation between variables. There can be many ways of quantifying this relation. The most common is the Pearson correlation, although this measure is the most limited, as it reflects only linear dependence. As alternatives, the Kendall rank correlation or Spearman correlation can be used, as they are invariant under monotonic transformations.

Thus, the posterior distribution of the parameters is as follows:

π (Θ, τ | x ._{1}, \dots, x ._{n_{x}}; y ._{1} \dots, y ._{n_{y}}) \propto π (Θ) π (τ | p) L (Θ, τ | x ._{1}, \dots, x ._{n_{x}}; y ._{1} \dots, y ._{n_{y}})

To develop the Bayesian approach, we need to specify the prior distribution for the model parameters

(Θ, τ) = (μ_{X}, μ_{Y}, σ^{2}, p, ρ, τ)

.

3.1. Prior and Posterior Distributions

To present a model as simple as possible, complex as it is though, in this paper we assume independence for all prior distributions of

Θ

parameters. However, it is possible to consider prior distributions that reflect some kind of dependency between parameters, as for example in [12,13]. Then, the joint prior distribution for the model parameters

(Θ, τ)

is

π (Θ, τ) = π (ρ) \prod_{i = 1}^{N} π (μ_{X_{i}}) π (μ_{Y_{i}}) π ({σ_{i}}^{2}) π (p_{i}) π (τ_{i} |p_{i})

(12)

where the prior distribution for the means for both conditions of treatment

μ_{X_{i}}

and

μ_{Y_{i}}

follow a uniform distribution with range

[a_{X_{i}}, b_{X_{i}}]

and

[a_{Y_{i}}, b_{Y_{i}}]

, respectively, as indicated by [37].

The prior distribution for the variance

σ_{i}^{2} (i = 1, 2, \dots, N)

is defined through Jeffrey’s prior density function

π ({σ_{i}}^{2}) \sim \frac{1}{σ_{i}^{2}}, i = 1, 2, \dots, N

.

The parameter

τ_{i}

is then introduced to the model, assuming that each latent variable follows a Bernoulli distribution,

τ_{i} | p_{i} \sim Bernoulli (1 - p_{i})

.

We assume the prior distribution for the parameter

p_{i} \sim B e t a (α_{i}, β_{i})

, following [12,35], among others.

As explained in the previous section, we use an N-dimensional Gaussian copula function as the tool for modeling the dependence between variables through the Pearson correlation. The same dependency structure for the two treatment conditions

Σ_{X} = Σ_{Y} = Σ

, and, likewise, for the correlation coefficient, is used for both treatment conditions

ρ_{X} = ρ_{Y} = ρ

and follows a uniform distribution with range

[a, b]

.

Thus, the joint prior distribution for the model parameters (12) is defined as follows:

π (Θ, τ) \propto \frac{1}{b - a} \prod_{i = 1}^{N} \frac{p_{i}^{1 - τ_{i}} {(1 - p_{i})}^{τ_{i}} p_{i}^{α_{i} - 1} {(1 - p_{i})}^{β_{i} - 1}}{(b_{X_{i}} - a_{X_{i}}) (b_{Y_{i}} - a_{Y_{i}}) σ_{i}^{2}}

(13)

Then, given the data, the joint prior densities (13), and the likelihood (10), the posterior distribution of the parameters

(Θ, τ) = (μ_{X}, μ_{Y}, σ, p, ρ, τ)

is defined as follows:

\begin{matrix} π (Θ, τ | x ._{1}, \dots, x ._{n_{x}}; y ._{1} \dots, y ._{n_{y}}) & \propto \frac{1}{b - a} \prod_{i = 1}^{N} \frac{p_{i}^{1 - τ_{i}} {(1 - p_{i})}^{τ_{i}} p_{i}^{α_{i} - 1} {(1 - p_{i})}^{β_{i} - 1}}{(b_{X_{i}} - a_{X_{i}}) (b_{Y_{i}} - a_{Y_{i}}) σ_{i}^{2}} \\ \times \prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ) \prod_{i = 1}^{N} f_{i} (x_{i j} | μ_{X_{i}}, {σ_{i}}^{2}) \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ) \\ \times \prod_{i : τ_{i} = 0} f_{i} (y_{i k} | μ_{X_{i}}, {σ_{i}}^{2}) \prod_{i : τ_{i} = 1} f_{i} (y_{i k} | μ_{Y_{i}}, {σ_{i}}^{2}) \end{matrix}

(14)

As we can see, this joint posterior distribution is complex and has no known analytical expression. However, Bayesian inference may be performed using the MCMC approach. The MCMC approach can produce a Markov chain

\{(Θ^{(l)}, τ^{(l)}) : l = 1, \dots, M\}

that convergences to the joint posterior distribution. Consequently, we can estimate the parameters from the generated sample, for example, from the marginal means of these samples. In particular, we used the Metropolis-Hastings-within-Gibbs algorithm. For details regarding this, see [38,39], and their references.

3.2. Conditional Posterior Distributions

In this subsection, we derive the conditional posterior distributions of the model parameters to construct the MCMC chain.

The conditional posterior distributions are derived as follows for each

τ_{i} = 0

,

i = 1, 2, \dots, N

, given the data and the remaining parameters:

\begin{matrix} P r (τ_{i} = 0 | Θ, τ_{- i}, X, Y) = \\ \frac{p_{i} \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ) e x p \{- \frac{1}{2 {σ_{i}}^{2}} [n_{y} {({\bar{y}}_{i} . - μ_{X_{i}})}^{2}]\}}{p_{i} \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ) e x p \{- \frac{1}{2 {σ_{i}}^{2}} [n_{y} {({\bar{y}}_{i} . - μ_{X_{i}})}^{2}]\} + (1 - p_{i}) \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ) e x p \{- \frac{1}{2 {σ_{i}}^{2}} [n_{y} {({\bar{y}}_{i} . - μ_{Y_{i}})}^{2}]\}} . \end{matrix}

(15)

Therefore, the conditional posterior distributions of each

τ_{i} = 1

, for

i = 1, 2, \dots, N

, are

P r (τ_{i} = 1 | Θ, τ_{- i}, X, Y) = 1 - P r (τ_{i} = 0 | Θ, τ_{- i}, X, Y)

The conditional posterior distributions of each

p_{i}

, for

i = 1, 2, \dots, N

, given the data and the remaining parameters are

π (p_{i} | X, Y, Θ_{- p i}) \sim B e t a (α_{i} + 1 - τ_{i}, β_{i} + τ_{i})

(16)

The conditional posterior distributions of each

μ_{X_{i}}

for

i = 1, 2, \dots, N

, given the data and the remaining parameters, when

τ_{i} = 0

and

τ_{i} = 1

, are respectively defined by

\begin{matrix} π (μ_{X_{i}} | X, Y, τ_{i} = 0, τ_{- i}, Θ_{- μ_{X_{i}}}) = \frac{\prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ) \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ) f_{i} (μ_{X_{i}})}{E_{f_{i} (μ_{X_{i}})} (\prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ) \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ))} \end{matrix}

(17)

where

f_{i} (μ_{X_{l}}) \sim N (\frac{n_{x} {\bar{x}}_{i} . + {\bar{y}}_{i} .}{n_{x} + n_{y}}, \frac{σ_{i}}{\sqrt{n_{x} + n_{y}}})

(18)

\begin{matrix} π (μ_{X_{i}} | X, Y, τ_{i} = 1, τ_{- i}, Θ_{- μ_{X_{i}}}) = \frac{\prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ) f_{i} (μ_{X_{i}})}{E_{f_{i} (μ_{X_{i}})} (\prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ))} \end{matrix}

(19)

and where

f_{i} (μ_{X_{i}}) \sim N ({\bar{x}}_{i} ., \frac{σ_{i}}{\sqrt{n_{x}}})

(20)

The conditional posterior distributions for each

μ_{Y_{i}}

of the model, when

τ_{i} = 1

, for

i = 1, 2, \dots, N

, given the data and the remaining model parameters, are defined by

\begin{matrix} π (μ_{Y_{i}} | X, Y, τ_{i} = 1, τ_{- i}, Θ_{- μ_{Y_{i}}}) = \frac{\prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ) f_{i} (μ_{Y_{i}})}{E_{f_{i} (μ_{Y_{i}})} (\prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ))} \end{matrix}

(21)

where

f_{i} (μ_{Y_{i}}) \sim N ({\bar{y}}_{i} ., \frac{σ_{i}}{\sqrt{n_{y}}})

(22)

The conditional posterior distributions for each

{σ_{i}}^{2}

of the model, for

i = 1, 2, \dots, N

, given the data and the remaining model parameters, when

τ_{i} = 0

and

τ_{i} = 1

, are respectively defined by

\begin{matrix} π ({σ_{i}}^{2} | X, Y, τ_{i} = v, τ_{- i}, Θ_{- {σ_{i}}^{2}}) = \frac{\prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ) \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ) f_{i} (σ_{i}^{2})}{E_{f_{i} (σ_{i}^{2})} (\prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ) \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ))}, v = {0, 1} \end{matrix}

(23)

where

\begin{matrix} if v = 0, f_{i} (σ_{i}^{2}) \sim InverseGamma (\frac{n_{x} + n_{y}}{2}, \frac{A}{2}) with A = \sum_{j} {(x_{i j} - μ_{X_{i}})}^{2} + \sum_{k} {(Y_{i k} - μ_{X_{i}})}^{2} \end{matrix}

(24)

\begin{matrix} if v = 1, f_{i} (σ_{i}^{2}) \sim InverseGamma (\frac{n_{x} + n_{y}}{2}, \frac{B}{2}) with B = \sum_{k} {(x_{i j} - μ_{X i})}^{2} + \sum_{k} {(Y_{i k} - μ_{Y_{i}})}^{2} \end{matrix}

(25)

Finally, the conditional posterior distribution for

ρ

given the data and the remaining parameters is

\begin{matrix} π (ρ | X, Y, Θ_{- ρ}, τ_{i} = v, τ_{- i}) = \frac{\prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ) \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ)}{\int \prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ) \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ) d_{ρ}}, v = {0, 1} \end{matrix}

(26)

Please note that the posterior conditional distribution in Equations (17), (19), (21), (23) and (26) for the parameters has no analytic form.

For the computational implementation of the algorithm, we used the R program, as it is free statistical software and it provides an easy structure for manipulating complex models.

3.3. MCMC Algorithm (Metropolis-Hastings-within-Gibbs Algorithm)

We make use the Metropolis-Hastings-within-Gibbs algorithm based on the MCMC sampling strategies to obtain a sample from the joint posterior distribution (14). The structure of the proposed MCMC method is implemented as follows. The detailed algorithm is described in the Appendix A.

Algorithm 1 MCMC Algorithm

Require: initial values

(Θ^{(0)}, τ^{(0)}) = (μ_{X}^{(0)}, μ_{Y}^{(0)}, σ^{2 (0)}, p^{(0)}, ρ^{(0)}, τ^{(0)})

. Where

τ^{(0)} = (τ_{1}^{(0)}, \dots, τ_{N}^{(0)})

,

p^{(0)} = (p_{1}^{(0)}, . ., p_{N}^{(0)})

,

μ_{X}^{(0)} = (μ_{X_{1}}^{(0)}, \dots, μ_{X_{N}}^{(0)})

,

μ_{Y}^{(0)} = (μ_{Y_{1}}^{(0)}, \dots, μ_{Y_{N}}^{(0)})

,

σ^{2 (0)} = (σ_{1}^{2 (0)}, . ., σ_{N}^{2 (0)})

Procedure

1:: Let the current state of the Markov chain be $(Θ^{(l)}, τ^{(l)}) = (μ_{X}^{(l)}, μ_{Y}^{(l)}, σ^{(l)}, p^{(l)}, ρ^{(l)}, τ^{(l)})$
2:: for $l \in 1 : M$ do
3:: Update $τ_{i}^{(l)}$ , for $i = 1, \dots, N$ ▹ by sampling from (15)
4:: Update $p_{i}^{(l)}$ , for $i = 1, \dots, N$ ▹ by sampling from (16)
5:: Update $μ_{X_{i}}^{(l)}$ , for $i = 1, \dots, N$ ▹ by sampling from (17) and (19) when $τ_{i}^{(l + 1)} = 0$ and $τ_{i}^{(l + 1)} = 1$ , respectively
6:: Update $μ_{Y_{i}}^{(l)}$ , for $i = 1, \dots, N$ ▹ by sampling from (21)
7:: Update $σ_{i}^{2 (l)}$ , for $i = 1, \dots, N$ ▹ by sampling from (23) with (24) when $τ_{i}^{(l + 1)} = 0$ and with (25) when $τ_{i}^{(l + 1)} = 1$
8:: Update $ρ^{(l)}$ ▹ by sampling from (26).
9:: end for

End Procedure: Return

\{(Θ^{(l)}, τ^{(l)}) : l = 1, . ., M\}

Given an MCMC sample,

{\{μ_{X}^{(l)}, μ_{Y}^{(l)}, σ^{2 (l)}, p^{(l)}, τ^{(l)}, ρ^{(l)}\}}_{l = 1}^{M}

with

μ_{X} = (μ_{X_{1}}, \dots, μ_{X_{N}})

,

μ_{Y} = (μ_{Y_{1}}

,

\dots, μ_{Y_{N}})

,

σ^{2} = (σ_{1}^{2}, \dots, σ_{N}^{2})

,

p = (p_{1}, \dots, p_{N})

, and

τ = (τ_{1}, \dots, τ_{N})

, obtained from Metropolis-Hasting-within-Gibbs sampling algorithm. We can obtain estimates of the posterior marginal means as follows:

\begin{matrix} {\hat{μ}}_{X_{i}} = E [μ_{X_{i}}| X, Y] \approx \frac{1}{M} \sum_{l = 1}^{M} μ_{X_{i}}^{(l)} \end{matrix}

(27)

\begin{matrix} {\hat{μ}}_{Y_{i}} = E [μ_{Y_{i}}| X, Y] \approx \frac{1}{M} \sum_{j = 1}^{M} μ_{Y_{i}}^{(l)} \end{matrix}

(28)

\begin{matrix} {\hat{σ}}_{i}^{2} = E [σ_{i}^{2}| X, Y] \approx \frac{1}{M} \sum_{j = 1}^{M} {σ_{i}^{2}}^{(l)} \end{matrix}

(29)

\begin{matrix} {\hat{p}}_{i} = E [p_{i}| X, Y] \approx \frac{1}{M} \sum_{l = 1}^{M} p_{i}^{(l)} \\ \hat{ρ} = E [ρ| X, Y] \approx \frac{1}{M} \sum_{l = 1}^{M} ρ^{(l)} \end{matrix}

(30)

for each

i = 1, 2, \dots, N

.

We can also approximate the posterior probability of the alternative hypothesis by

\begin{matrix} P r (τ_{i} = 1 | X, Y) = P r (H_{1 i} = 1 | X, Y) = \frac{1}{M} \sum_{j = 1}^{M} I (τ_{i}^{(j)} = 1) \end{matrix}

(31)

for

i = 1, 2, \dots, N

.

Please note that we can use these posterior probabilities to solve the multiple hypothesis testing problem.

3.4. Simulation Study

We performed a simulation with

N = 50

simultaneous hypotheses, with n = 20 observations per hypothesis/gene, where

n_{x} = 7

and

n_{y} = 13

observations express Treatment Conditions 1 and 2, respectively, since the microarray dates typically have a smaller number of samples than the number of genes/hypothesis.

In this context, the datasets for both experimental conditions were generated following multivariate Gaussian distributions

N_{50} (μ_{X}, Σ)

and

N_{50} (μ_{Y}, Σ)

, with the same covariance-variance matrix

Σ

and correlation coefficient

ρ = 0.8

. The vector of means

μ_{X}

for Condition 1 was defined in the range (1150, 1160), and the vector of means

μ_{Y}

for Condition 2 was defined in the range (1165, 1180). The vector of standard deviations was defined in the range (8, 16).

We considered simultaneous updates of the vector

(Θ, τ) = (μ_{X}, μ_{Y}, σ^{2}, p, τ, ρ)

of the model parameters. The proposed algorithm was run for 15,000 draws (i.e., iterations) of the Metropolis-Hastings-within-Gibbs sampler sequence, discarding the first 7500 burn-in iterations in one-step MCMC outputs.

To implement the proposed algorithm (i.e., the Metropolis-Hastings-within-Gibbs algorithm), we opted to draw the candidate using an independence chain, to explore the form of the conditional posterior distributions in Equations (17), (19), (21), (23), and (26), and to set the candidate-generating densities in Equations (18), (20), (22), (24), and (25) for the parameters

μ_{X}, μ_{Y}, σ^{2}

. In addition, we selected the uniform prior distribution for the dependence parameter

ρ

in the range

(0.6, 0.9)

as the candidate-generating density.

The simulation compared the performance of our approach when applied to three simulated datasets. Thus, the simulated dataset assumed 80%, 50%, and 20% of the true null hypotheses, and all sets were generated at

N = 50

. For the prior distribution

p_{i} \sim B e t a (α_{i}, β_{i})

, we assumed the same

α

and

β

parameters for

i = 1, \dots, N

for the sake of simplicity, i.e.,

p_{i} \sim B e t a (α, β)

for the three simulated datasets. Furthermore, to analyze the sensitivity with respect to the election of beta distribution parameters, we selected the following values for

(α, β)

:

(0.5, 1), (1, 1)

,

(1, 0.5)

, and

(2, 0.5)

. Indeed, with these parameters, very different distributions are obtained. As in [14,40,41], we obtain the FDR from the expected false discovery rate introduced by [42,43]. The FDR has been estimated using an MCMC sample obtained from Metropolis-Hasting-within-Gibbs algorithm. Table 1, Table 2 and Table 3 present the simulation results.

As shown in Table 1 and Table 2, for both data structures there was too much sensitivity in the choice of parameters

α

and

β

for the prior distribution on

p_{i}

. It seems that these parameters have a considerable influence on the results, since we can observe significant differences in the estimations. Therefore, it appears that the parameters

α

and

β

of the prior distribution for

p_{i}

are better when the prior distribution is skewed to the right, for instance, when we assume the prior distributions

B e t a (1, 0.5)

and

B e t a (2, 0.5)

.

To emphasize that this is in fact better for our approach, we simulated a dataset by assuming 20% of the true null hypotheses. The simulation results are presented in Table 3. We can observe that these results are similar to those obtained in Table 1 and Table 2, i.e., the estimations are closer to reality when the prior distribution is skewed to the right, even with a low percentage of true null hypotheses. Therefore, we conclude that with our model, good results will be obtained whenever a prior distribution for

p_{i}

is skewed to the right, regardless of the number of true null hypotheses.

4. Modeling Dependence Through N-Dimensional Clayton Copulas with Normal Marginal Densities

The Archimedean family includes a large number of copulas with different peculiarities, characterized by allowing the modeling of multivariate distributions using a single univariate function, thus simplifying the calculations.

As in the previous section, we consider normal marginal densities, but the dependency is modeled using an N-dimensional Clayton copula of the Archimedean family. This copula has already been used to model dependency in gene expression analysis, as can be seen in [8].

In this case, the likelihood is expressed as in (10) but now

c_{X} (u_{X ._{j}}; θ_{c})

and

c_{Y} (u_{Y ._{k}}; θ_{c})

are the Clayton copulas given by:

\begin{matrix} c_{X} (u_{X ._{j}}; θ_{c}) = {(1 - N + \sum_{i = 1}^{N} u_{X_{i j}}^{- θ_{c}})}^{- N - (\frac{1}{θ_{c}})} \prod_{l = 1}^{N} [u_{X_{l j}}^{- θ_{c} - 1} (θ_{c} (l - 1) + 1)] \\ c_{Y} (u_{Y ._{k}}; θ_{c}) = {(1 - N + \sum_{i = 1}^{N} u_{Y_{i k}}^{- θ_{c}})}^{- N - (\frac{1}{θ_{c}})} \prod_{l = 1}^{N} [u_{Y_{l k}}^{- θ_{c} - 1} (θ_{c} (l - 1) + 1)] \end{matrix}

(32)

where

θ_{c}

is the dependency parameter for the Clayton copula,

u_{X ._{j}} = (u_{X_{1 j}} \dots, u_{X_{N j}})

,

u_{Y ._{k}} = (u_{Y_{1 k}} \dots, u_{Y_{N k}})

, being

u_{X_{i j}} = F (x_{i j})

and

u_{Y_{i k}} = F (y_{i j})

,

i = 1, \dots, N

,

j = 1, 2, \dots, n_{x}

and

k = 1, 2, \dots, n_{y}

.

Then, the parameter vector for the model is

Θ = (μ_{X}, μ_{Y}, σ^{2}, p, θ_{c})

, where

μ_{X} = (μ_{X_{1}} . ., μ_{X_{N}})

,

μ_{Y} = (μ_{Y_{1}} . ., μ_{Y_{N}}), σ^{2} = (σ_{1}^{2}, \dots, σ_{N}^{2})

and

p = (p_{1}, \dots, p_{N})

, being

p_{i}

the initial probability of each null hypothesis.

Thus, given the data, the joint prior densities as in (13), with the uniform distribution for

θ_{c}

over

[a, b]

and the likelihood, the posterior distribution takes the same form as in (14) with the

c_{X} (u_{X ._{j}}; θ_{c})

and

c_{Y} (u_{Y ._{k}}; θ_{c})

the Clayton copulas.

As in Section 3.1, this posterior distribution has no analytic form, but Bayesian inference may be performed using MCMC methods. The parameters conditional posterior distributions, which are necessary to build the algorithm, are the same as those described in Section 3.2, replacing Gaussian copulas by Clayton copulas. Therefore, the same Metropolis-Hastings-within-Gibbs algorithm is used substituting Gaussian copulas by Clayton copulas, defined in (32).

Likewise, given an MCMC sample,

{\{μ_{X}^{(l)}, μ_{Y}^{(l)}, σ^{2 (l)}, p^{(l)}, τ^{(l)}, θ_{c}^{(l)}\}}_{l = 1}^{M}

with

μ_{X} = (μ_{X_{1}}, \dots, μ_{X_{N}})

,

μ_{Y} = (μ_{Y_{1}}

,

\dots, μ_{Y_{N}})

,

σ^{2} = (σ_{1}^{2}, \dots, σ_{N}^{2})

,

p = (p_{1}, \dots, p_{N})

, and

τ = (τ_{1}, \dots, τ_{N})

, obtained from Metropolis-Hasting-within-Gibbs sampling algorithm. We can obtain estimates of the posterior marginal means as in (27), (28), (29) and (30). Analogously, we can estimate the parameter of the Clayton copula as follows:

\hat{θ_{c}} = E [θ_{c}| X, Y] \approx \frac{1}{M} \sum_{l = 1}^{M} θ_{c}^{(l)}

Finally, we can also approximate the posterior probability of the alternative hypothesis through (31).

4.1. Simulation Study

In this section, the same data sets from Section 3.4 was used. When working with non-elliptical distributions, Kendall

τ

coefficient should be used instead of Pearson correlation coefficient [36]. Under normality, both coefficients are related as follows:

τ = (\frac{2}{π}) a r c s i n ρ \Leftrightarrow ρ = s i n (\frac{π}{2} τ)

(33)

while Kendall coefficient and Clayton copula parameter,

θ_{c}

, are also related:

τ (θ_{c}) = \frac{θ_{c}}{θ_{c} + 2}

(34)

Therefore, we select the uniform prior distribution over the interval

(1.3, 4)

for

θ_{c}

as the candidate-generating density for the MCMC algorithm, since, from (33) and (34), this interval corresponds to the interval

(0.58, 0.87)

for Pearson coefficient.

To carry out the sensitivity analysis, the same values as in Section 3.4 were considered for parameters

(α, β)

of the beta prior distribution for

p_{i}

,

(0.5, 1)

,

(1, 1)

,

(1, 0.5)

and

(2, 0.5)

.

The Metropolis-Hastings-within-Gibbs algorithm, described in Section 3.3, was used replacing Gaussian copulas by Clayton copulas. The algorithm was run for 15,000 iterations, discarding the first 7500 burn-in iterations in MCMC outputs. The

F D R

was also obtained as in Section 3.4.

As shown in Table 4, Table 5 and Table 6, there was a high sensitivity regarding the choice of parameters

α

and

β

for the prior distribution on

p_{i}

, as it happened for the model with Gaussian copulas. Likewise, the procedure is more accurate when the prior distribution of

p_{i}

is skewed to the right, for instance, when the prior distributions

B e t a (1, 0.5)

and

B e t a (2, 0.5)

are assumed. However, the FDR obtained is higher than in the Gaussian copula model.

4.2. Model Selection

To compare normal marginal distribution models using Gaussian and Clayton copulas, we use the Deviance Information Criterion (DIC). A model with smaller DIC should be preferred to models with larger DIC [44]. The DIC value is given by:

D I C = - 4 E_{Θ, τ} [l o g L (Θ, τ | X, Y) | X, Y] + 2 l o g L (E_{Θ, τ} [Θ, τ | X, Y] | X, Y)

(35)

Given an MCMC sample of size M from the posterior distribution, the DIC value (35), can be approximated by,

D I C = - \frac{4}{M} \sum_{l = 1}^{M} l o g L (Θ^{(l)}, τ^{(l)} | X, Y) + 2 l o g L (\frac{1}{M} \sum_{l = 1}^{M} Θ^{(l)}, \frac{1}{M} \sum_{l = 1}^{M} τ^{(l)} | X, Y)

The DIC value was obtained for the model with Gaussian Copulas and the model with Clayton copulas and, in both cases, using prior distributions for

p_{i}

skewed to the right,

B e t a (1, 0.5)

and

B e t a (2, 0.5)

. The results are shown in Table 7, for data sets with

80 %

,

50 %

and

20 %

true null hypotheses, respectively.

As it can be seen in Table 7, the smallest DIC values obtained correspond to the model with Gaussian copulas, as expected based on the data that had been generated. On the other hand, it can be seen that there were no differences between the DIC values there were no important differences between the DIC values corresponding to the parameters

(1, 0.5)

and

(2, 0.5)

of the beta prior distribution. Furthermore, the FDR values are lower using Gaussian copulas than Clayton copulas. Thus, the model with Gaussian copulas is more suitable for our simulated data.

5. Application to a Real Data Set: DNA Microarrays

The procedure described in the previous section is applied to a DNA microarrays data set. This data set consists of 38 genes obtained from duodenal biopsies tissues, performed in 13 children with celiac disease of mean age

5.6

(\pm 0.6)

years old and 7 children controls of mean age

8.1

(\pm 2.2)

years old, belonging to part of the study carried out in [45]. The data is available in NCBI-GEO datasets (The National Center for Biotechnology Information-Gene Expression Omnibud) via the GSE76168 access number.

The aim is to identify deferentially expressed genes. Thus, we tested simultaneously if there are significant differences between celiac patients and controls in the expression mean level of 38 genes, i.e., we consider the multiple hypothesis tests given in (9) and the main objective is to obtain the posterior probability of each null hypothesis.

The Bayesian procedure described by [12,14] was applied by [45] considering independence between the levels of genes expression. However, there is correlation between the levels of expression of these genes. To model the data, we consider normal marginal densities, as [45], and the dependency is modeled first through Gaussian copulas and secondly through Clayton copulas.

5.1. Modeling Dependency through N-Dimensional Gaussian Copulas

To apply Metropolis-Hastings-within-Gibbs algorithm, described in Section 3.3, we need to consider a candidate-generating density for the dependency parameter

ρ

. Since there are positive and negative correlations, the most suitable procedure would be to select the uniform prior distribution in the range (–1,1). However, we consider as a candidate-generating distribution for

ρ

the uniform prior distribution in the range

(- 0.027, 1)

due to the constraint (11).

For the prior distribution on

p_{i}

, we consider the

B e t a (2, 0.5)

distribution, because, according to the results obtained in Section 3.4, this is the prior distribution that produces the most accurate results when using Gaussian copulas.

The algorithm was run for 40,000 iterations, discarding the first 20,000 burn-in iterations. If we compare our results, using a data dependency structure with those of [45] that suppose independence, different results are obtained: we identified 16 genes that are expressed differentially while they find 15, additionally out of the 16 identified, 4 do not coincide with those identified by their procedure.

5.2. Modeling Dependence Through N-Dimensional Clayton Copulas

In this subsection, the same Metropolis-Hastings-within-Gibbs algorithm, replacing the Gaussian copulas by the Clayton copulas, is used to the data of [45].

For

θ_{c}

, the uniform distribution over

(0.01, 5)

was considered to be candidate-generating density. For the parameters

p_{i}

, we chose a

B e t a (1, 0.5)

prior distribution because, according to the results obtained in Section 4.1, this is the prior distribution that produces the most accurate results when using Clayton copulas. Likewise, the algorithm was run for 40,000 iterations, discarding the first 20,000 burn-in iterations.

In this case, we found 22 genes that are differentially expressed, 6 more than when using Gaussian copulas and if we compare them with the results of [45], we identified the 15 genes that they had already identified and 7 additional ones.

Finally, DIC = 1487.213 was obtained for the model with Gaussian copulas and DIC = 1504.445 for the model with Clayton copulas. Therefore, the model with Gaussian copulas and normal marginal densities turns out to be the most suitable for the data associated with celiac disease by [45], because with this model the smallest value DIC is obtained.

6. Conclusions

The proposed approach is very useful when many hypotheses are tested simultaneously under the assumption of dependence. For the proposed procedure for testing multiple hypotheses, the full data are used directly, rather than using test statistics, especially for modeling the dependency structure. Therefore, the modeling process is more complex than when using test statistics, and this presents computational problems when thousands of hypotheses are tested simultaneously with a large sample size. In any case all available information, both objective and subjective, can be used.

In the field of genomics, the normal distribution is widely used to model gene expression data. Thus, we adopted normal marginal distributions and modeled the dependency structure through the Gaussian copula, which shares the properties of a multivariate normal distribution. We opted to use a uniform correlation matrix to reduce the dimensionality of the parameters. To model the dependency structure, we also consider the Clayton copula of the Archimedean family, which enables modeling of multivariate distributions using a single univariate function, thus simplifying calculations. The proposed approach is flexible as far as it can be used with other correlation matrices more realistically, or with other copula functions to model the dependence, as well as other marginal distributions.

For the model with Gaussian copulas, with a lower DIC value and therefore the most appropriate for our simulated data, the results obtained demonstrated that the procedure fits the dependence well. The estimated correlation coefficient was close to the true value with which the data were generated. However, the procedure is not robust with respect to the choice of prior distribution for the initial probability of each null hypothesis. Nevertheless, in all simulated examples, the procedure rejected almost all false hypotheses when we used a prior distribution beta skewed to the right. Therefore, our proposal turned out to be a very powerful procedure for testing multiple hypotheses.

In the cases analyzed using the Gaussian copula model, the highest FDR value obtained was

0.079

when using a right-skewed beta prior distribution. However, this need not be inconvenient in the context of experiments with DNA microarrays, because, as explained above, the main objective in many of these studies is to obtain the greatest possible number of potentially expressed genes, with which more detailed studies can be carried out subsequently. As a result, in this phase of analysis, we can tolerate more false positives to obtain the greatest possible number of interesting genes.

Author Contributions

Conceptualization, E.C.J.M., I.S., L.S. and M.A.G.-V.; methodology, E.C.J.M., I.S., L.S. and M.A.G.-V.; software, E.C.J.M. and I.S.; validation, E.C.J.M.; formal analysis, E.C.J.M., I.S., L.S. and M.A.G.-V.; writing—original draft preparation, E.C.J.M. and I.S.; writing—review and editing, I.S., L.S. and M.A.G.-V.; visualization, E.C.J.M., I.S., L.S. and M.A.G.-V.; supervision, I.S. and L.S.; project administration, I.S., L.S. and M.A.G.-V.; funding acquisition, I.S., L.S. and M.A.G.-V. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Universidad Complutense de Madrid, Spain, grant GR29/20 and research group HUMLOG UCM-GR970643.

Conflicts of Interest

The authors declare no conflict of interest, financial or ethical of any kind.

Appendix A. MCMC Algorithm

In this appendix, we explain the proposed MCMC algorithm in detail. We constructed a Metropolis-Hastings-within-Gibbs sampling scheme that combines the Gibbs and Metropolis-Hastings sample strategies. The scheme involves interactively sampling from the Gibbs algorithm and from a single iteration of the Metropolis-Hastings algorithm, for standard conditional posterior distributions and for no standard conditional posterior distributions, respectively.

Algorithm 2 MCMC

Require: initial values

(Θ^{(0)}, τ^{(0)}) = (μ_{X}^{(0)}, μ_{Y}^{(0)}, σ^{2 (0)}, p^{(0)}, ρ^{(0)}, τ^{(0)})

. Where

τ^{(0)} = (τ_{1}^{(0)}, \dots, τ_{N}^{(0)})

,

p^{(0)} = (p_{1}^{(0)}, . ., p_{N}^{(0)})

,

μ_{X}^{(0)} = (μ_{X_{1}}^{(0)}, \dots, μ_{X_{N}}^{(0)})

,

μ_{Y}^{(0)} = (μ_{Y_{1}}^{(0)}, \dots, μ_{Y_{N}}^{(0)})

,

σ^{2 (0)} = (σ_{1}^{2 (0)}, . ., σ_{N}^{2 (0)})

Procedure

1:: Let the current state of the Markov chain be $(Θ^{(l)}, τ^{(l)}) = (μ_{X}^{(l)}, μ_{Y}^{(l)}, σ^{(l)}, p^{(l)}, ρ^{(l)}, τ^{(l)})$
2:: for $l \in 1 : M$ do
3:: Update $τ_{i}$ by sampling from $τ_{i}^{(l + 1)}$ , for $i = 1, \dots, N$ ▹ Algorithm 3
4:: Update $p_{i}$ by sampling from $p_{i}^{(l + 1)}$ , for $i = 1, \dots, N$ ▹ Algorithm 4
5:: Update $μ_{X_{i}}$ by sampling from $μ_{X_{i}}^{(l + 1)}$ , for $i = 1, \dots, N$ ▹ Algorithm 5
6:: Update $μ_{Y_{i}}$ by sampling from $μ_{Y_{i}}^{(l + 1)}$ , for $i = 1, \dots, N$ ▹ Algorithm 6
7:: Update $σ_{i}^{2}$ by sampling from $σ_{i}^{2 (l + 1)}$ , for $i = 1, \dots, N$ ▹ Algorithm 7
8:: Update $ρ$ by sampling from $ρ^{(l + 1)}$ ▹ Algorithm 8
9:: end for

End Procedure: Return

\{(Θ^{(l)}, τ^{(l)}) : l = 1, . ., M\}

Algorithm 3 MCMC for

τ_{i}

,

i = 1, \dots, N

Require: initial values

(Θ^{(0)}, τ^{(0)}) = (μ_{X}^{(0)}, μ_{Y}^{(0)}, σ^{2 (0)}, p^{(0)}, ρ^{(0)}, τ^{(0)})

. Where

τ^{(0)} = (τ_{1}^{(0)}, \dots, τ_{N}^{(0)})

,

p^{(0)} = (p_{1}^{(0)}, . ., p_{N}^{(0)})

,

μ_{X}^{(0)} = (μ_{X_{1}}^{(0)}, \dots, μ_{X_{N}}^{(0)})

,

μ_{Y}^{(0)} = (μ_{Y_{1}}^{(0)}, \dots, μ_{Y_{N}}^{(0)})

,

σ^{2 (0)} = (σ_{1}^{2 (0)}, . ., σ_{N}^{2 (0)})

Procedure

1:: Update $μ_{Y}^{(0)} = (μ_{Y_{1}}^{(0)}, \dots, μ_{Y_{N}}^{(0)})$ ▹Ifelse $τ_{i}^{(0)} = 0$ , $μ_{Y_{i}}^{(0)} = μ_{X_{i}}^{(0)}$ , $μ_{Y_{i}}^{(0)}$
2:: Calculate copula $c_{Y} (u_{Y}; ρ^{(0)})$
3:: Let the current state of the Markov chain be $(Θ^{(l)}, τ^{(l)}) = (μ_{X}^{(l)}, μ_{Y}^{(l)}, σ^{(l)}, p^{(l)}, ρ^{(l)}, τ^{(l)})$
4:: for $i \in 1 : N$ do
5:: Calculate $K_{i} = P r (τ_{i}^{l + 1} = 0 | μ_{X}^{(l)}, μ_{Y}^{(l)}, σ^{2 (l)}, p_{i}^{(l)}, ρ^{(l)}, τ_{j < i}^{(l + 1)}, τ_{j > i}^{(l)})$ ▹ Equation (15)
6:: Generate a random uniform number $U_{i} \in (0, 1)$
7:: if $U_{i} \leq K_{i}$ then
8:: $τ_{i}^{l + 1} = 0$
9:: Update $μ_{Y_{i}}^{(l)} = μ_{X_{i}}^{(l)}$
10:: else
11:: $τ_{i}^{l + 1} = 1$
12:: end if
13:: Update $μ_{Y}^{(l)} = (μ_{Y_{1}}^{(l)}, \dots, μ_{Y_{N}}^{(l)})$ ▹Ifelse $τ_{i}^{(l + 1)} = 0$ , $μ_{Y_{i}}^{(l)} = μ_{X_{i}}^{(l)}$ , $μ_{Y_{i}}^{(l)}$
14:: Calculate copula $c_{Y} (u_{Y}; ρ^{(l)})$
15:: end for

End procedure Return

(τ_{i}^{l + 1})

,

i = 1, \dots, N

Algorithm 4 MCMC: GIBBS for

p_{i}

,

i = 1, \dots, N

Require: current values

(Θ^{(0)}, τ^{(l + 1)}) = (μ_{X}^{(0)}, μ_{Y}^{(0)}, σ^{2 (0)}, p^{(0)}, ρ^{(0)}, τ^{(0)})

. Where

τ^{(l + 1)} = (τ_{1}^{(1 + 1)}, \dots, τ_{N}^{(l + 1)})

,

p^{(0)} = (p_{1}^{(0)}, . ., p_{N}^{(0)})

,

μ_{X}^{(0)} = (μ_{X_{1}}^{(0)}, \dots, μ_{X_{N}}^{(0)})

,

μ_{Y}^{(0)} = (μ_{Y_{1}}^{(0)}, \dots, μ_{Y_{N}}^{(0)})

,

σ^{2 (0)} = (σ_{1}^{2 (0)}, . ., σ_{N}^{2 (0)})

procedure

1:: Let the current state of the Markov chain be $(Θ^{(l)}, τ^{(l + 1)}) = (μ_{X}^{(l)}, μ_{Y}^{(l)}, σ^{(l)}, p^{(l)}, ρ^{(l)}, τ^{(l + 1)})$
2:: for $i \in 1 : N$ do
3:: Update $p_{i}$ by sampling from $p_{i}^{l + 1} \sim p (p_{i} ∣ τ_{i}^{(l + 1)})$ ▹ Equation (16)
4:: end for

End procedure Return

(p_{i}^{l + 1})

,

i = 1, \dots, N

Algorithm 5 Single iteration of the Metropolis-Hasting for

μ_{X_{i}}

,

i = 1, \dots, N

Require: current values

τ^{(l + 1)} = (τ_{1}^{(1 + 1)}, \dots, τ_{N}^{(l + 1)})

,

μ_{X}^{(l)} = (μ_{X_{1}}^{(l)}, \dots, μ_{X_{N}}^{(l)})

,

μ_{Y}^{(l)} = (μ_{Y_{1}}^{(l)}, \dots, μ_{Y_{N}}^{(l)})

,

σ^{2 (l)} = (σ_{1}^{2 (l)}, . ., σ_{N}^{2 (l)})

,

ρ^{(l)}

procedure

1:: for $i \in 1 : N$ do
2:: Sample candidate $μ_{X_{i}}^{(c)}$ ▹Ifelse $τ_{i}^{(l + 1)} = 0$ , (18), from (20)
3:: Calculate copula $c_{X} (u_{X}; ρ^{(l)})$ ▹ With $μ_{X_{i}}^{(c)}$
4:: Calculate copula $c_{Y} (u_{Y}; ρ^{(l)})$ ▹Ifelse $τ_{i}^{(l + 1)} = 0$ , $μ_{Y_{i}}^{(l)} = μ_{X_{i}}^{(c)}$ , $μ_{Y_{i}}^{(l)}$
5:: Sample random uniform number $U_{i} \in (0, 1)$
6:: if $U_{i} \leq α (μ_{X_{i}}^{(l)}, μ_{X_{i}}^{(c)})$ then ▹ $α (μ_{X_{i}}^{(l)}, μ_{X_{i}}^{(c)}) = m i n {1, A_{1}}$
7:: $μ_{X_{i}}^{(l + 1)} = μ_{X_{i}}^{(c)}$
8:: else
9:: $μ_{X_{i}}^{(l + 1)} = μ_{X_{i}}^{(l)}$
10:: end if
11:: Update $μ_{Y}^{(l)} = (μ_{Y_{1}}^{(l)}, \dots, μ_{Y_{N}}^{(l)})$ ▹Ifelse $τ_{i}^{(l + 1)} = 0$ , $μ_{Y_{i}}^{(l)} = μ_{X_{i}}^{(l + 1)}$ , $μ_{Y_{i}}^{(l)}$
12:: Calculate copula $c_{X} (u_{X}; ρ^{(l)})$
13:: Calculate copula $c_{Y} (u_{Y}; ρ^{(l)})$
14:: end for

End procedure Return

μ_{X_{i}}^{(l + 1)}

,

i = (1, \dots, N)

A_{1} = \frac{\prod_{j = 1}^{n_{x}} c_{X}^{c} (u_{X ._{j}}; ρ) \prod_{k = 1}^{n_{y}} c_{Y}^{c} (u_{Y ._{k}}; ρ) f_{i} (μ_{X_{i}}^{(c)})}{\prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ) \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ) f_{i} (μ_{X_{i}}^{l})}

, where

c_{X}^{c} (u_{X ._{j}}; ρ)

and

c_{Y}^{c} (u_{Y ._{k}}; ρ)

computed when

μ_{X_{i}}^{l} = μ_{X_{i}}^{c}

.

Algorithm 6 Single iteration of the Metropolis-Hasting for

μ_{Y_{i}}

,

i = 1, \dots, N

Require: current values

τ^{(l + 1)} = (τ_{1}^{(1 + 1)}, \dots, τ_{N}^{(l + 1)})

,

μ_{X}^{(l + 1)} = (μ_{X_{1}}^{(l + 1)}, \dots, μ_{X_{N}}^{(l + 1)})

,

μ_{Y}^{(l)} = (μ_{Y_{1}}^{(l)}, \dots, μ_{Y_{N}}^{(l)})

,

σ^{2 (l)} = (σ_{1}^{2 (l)}, . ., σ_{N}^{2 (l)})

,

ρ^{(l)}

procedure

1:: for $i \in 1 : N$ do
2:: if $τ_{i}^{(l + 1)} = 0$ then
3:: $μ_{Y_{i}}^{(l + 1)} = μ_{X_{i}}^{(l + 1)}$
4:: else
5:: Sample candidate $μ_{Y_{i}}^{(c)}$ ▹ candidate-generating density (22)
6:: Calculate copula $c_{Y} (u_{Y}; ρ^{(l)})$ ▹ With $μ_{Y}^{(c)}$
7:: Sample random uniform number $U_{i} \in (0, 1)$
8:: if $U_{i} \leq α (μ_{Y_{i}}^{(l)}, μ_{Y_{i}}^{(c)})$ then ▹ $α (μ_{Y_{i}}^{(l)}, μ_{Y_{i}}^{(c)}) = m i n {1, A_{2}}$
9:: $μ_{Y_{i}}^{(l + 1)} = μ_{Y_{i}}^{(c)}$
10:: else
11:: $μ_{Y_{i}}^{(l + 1)} = μ_{Y_{i}}^{(l)}$
12:: end if
13:: Calculate copula $c_{Y} (u_{Y}; ρ^{(l)})$
14:: end if
15:: end for

End procedure Return

μ_{Y_{i}}^{(l + 1)}

,

i = (1, \dots, N)

A_{2} = \frac{\prod_{k = 1}^{n_{y}} c_{Y}^{c} (u_{Y ._{k}}; ρ) f_{i} (μ_{X_{i}}^{(c)})}{\prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ) f_{i} (μ_{X_{i}}^{l})}

, where

c_{Y}^{c} (u_{Y ._{k}}; ρ)

computed when

μ_{Y_{i}}^{l} = μ_{Y_{i}}^{c}

.

Algorithm 7 Single iteration of the Metropolis-Hasting for

σ_{i}^{2}

,

i = 1, \dots, N

Require: current values

τ^{(l + 1)} = (τ_{1}^{(1 + 1)}, \dots, τ_{N}^{(l + 1)})

,

μ_{X}^{(l + 1)} = (μ_{X_{1}}^{(l + 1)}, \dots, μ_{X_{N}}^{(l + 1)})

,

μ_{Y}^{(l + 1)} = (μ_{Y_{1}}^{(l + 1)}, \dots, μ_{Y_{N}}^{(l + 1)})

,

σ^{2 (l)} = (σ_{1}^{2 (l)}, . ., σ_{N}^{2 (l)})

,

ρ^{(l)}

procedure

1:: for $i \in 1 : N$ do
2:: Sample candidate $σ_{i}^{2 (c)}$ ▹Ifelse $τ_{i}^{(l + 1)} = 0$ ,from (24), from ()
3:: Calculate copula $c_{X} (u_{X}; ρ^{(l)})$ , $c_{Y} (u_{Y}; ρ^{(l)})$ ▹ With $σ_{i}^{2 (c)}$
4:: Sample random uniform number $U_{i} \in (0, 1)$
5:: if $U_{i} \leq α (σ_{i}^{2 (l)}, σ_{i}^{2 (c)})$ then ▹ $α (σ_{i}^{2 (l)}, σ_{i}^{2 (c)}) = m i n {1, A_{3}}$
6:: $σ_{i}^{2 (l + 1)} = σ_{i}^{2 (c)}$
7:: else
8:: $σ_{i}^{2 (l + 1)} = σ_{i}^{2 (l)}$
9:: end if
10:: Calculate copula $c_{X} (u_{X}; ρ^{(l)})$
11:: Calculate copula $c_{Y} (u_{Y}; ρ^{(l)})$
12:: end for

End procedure Return

σ_{i}^{2 (l + 1)}

,

i = (1, \dots, N)

A_{3} = \frac{\prod_{j = 1}^{n_{x}} c_{X}^{c} (u_{X ._{j}}; ρ) \prod_{k = 1}^{n_{y}} c_{Y}^{c} (u_{Y ._{k}}; ρ) f_{i} (σ_{i}^{2 (l)})}{\prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ) \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ) f_{i} (σ_{i}^{2 (l)})}

, where

c_{X}^{c} (u_{X ._{j}}; ρ)

and

c_{Y}^{c} (u_{Y ._{k}}; ρ)

computed when

σ_{i}^{2 (l)} = σ_{i}^{2 (c)}

.

Algorithm 8 Single iteration of the Metropolis-Hasting for

ρ

Require: current values

μ_{X}^{(l + 1)} = (μ_{X_{1}}^{(l + 1)}, \dots, μ_{X_{N}}^{(l + 1)})

,

μ_{Y}^{(l + 1)} = (μ_{Y_{1}}^{(l + 1)}, \dots, μ_{Y_{N}}^{(l + 1)})

,

σ^{2 (l + 1)} = (σ_{1}^{2 (l + 1)}, . ., σ_{N}^{2 (l + 1)})

procedure

1:: Sample candidate $ρ^{(c)}$ ▹ from $U \in (a, b)$ , $a, b \in (0, 1)$
2:: Sample random uniform number $U \in (0, 1)$
3:: if $U \leq α (ρ^{(l)}, ρ^{(c)})$ then ▹ $α = m i n {1, A_{4}}$
4:: $ρ^{(l + 1)} = ρ^{(c)}$
5:: else
6:: $ρ^{(l + 1)} = ρ^{(l)}$
7:: end if

End procedure Return

ρ^{(l + 1)}

A_{4} = \frac{\prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ^{c}) \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ^{c})}{\prod_{j = 1}^{n_{x}} c_{X} (u_{X ._{j}}; ρ^{l}) \prod_{k = 1}^{n_{y}} c_{Y} (u_{Y ._{k}}; ρ^{l})}

.

References

Fisher, R.A. The Design of Experiments, 9th ed.; Macmillan: New York, NY, USA, 1971; [1935]; ISBN 0-02-844690-9. [Google Scholar]
Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B-Stat. Methodol. 1995, 57, 289–300. [Google Scholar] [CrossRef]
Shaffer, J.P. Multiple hypothesis testing. Annu. Rev. Psychol. 1995, 46, 561–584. [Google Scholar] [CrossRef]
Dudoit, S.; Shaffer, J.P.; Boldrick, J.C. Multiple hypothesis testing in microarray experiments. Stat. Sci. 2003, 1, 71–103. [Google Scholar] [CrossRef]
Dudoit, S.; Keleş, S.; van der Laan, M.J. Multiple tests of association with biological annotation metadata. In Probability and Statistics: Essays in Honor of David A. Freedman; Collections; Institute of Mathematical Statistics: Beachwood, OH, USA, 2008; Volume 2, pp. 153–218. [Google Scholar] [CrossRef]
Benjamini, Y.; Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 2001, 29, 1165–1188. [Google Scholar] [CrossRef]
Gavrilov, Y.; Benjamini, Y.; Sarkar, S.K. An adaptive step-down procedure with proven FDR control under independence. Ann. Stat. 2009, 37, 619–629. [Google Scholar] [CrossRef]
Dickhaus, T.; Gierl, J. Simultaneous test procedures in terms of p-value copulae. In Proceedings of the 2nd Annual International Conference on Computacional Mathematics, Computational Geometry & Statictics (CMCGS), Paris, France, 4–5 February 2013. [Google Scholar]
Bodnar, T.; Dickhaus, T. False discovery rate control under Archimedean copula. Electron. J. Statist. 2014, 8, 2207–2241. [Google Scholar] [CrossRef]
Ibrahim, J.G.; Chen, M.H.; Gray, R.J. Bayesian models for gene expression with DNA microarray data. J. Am. Stat. Assoc. 2002, 97, 88–99. [Google Scholar] [CrossRef]
Gottardo, R.; Raftery, A.E.; Yee Yeung, K.; Bumgarner, R.E. Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics 2006, 62, 10–18. [Google Scholar] [CrossRef]
Ausín, M.C.; Gómez-Villegas, M.A.; González-Pérez, B.; Rodríguez-Bernal, M.T.; Salazar, I.; Sanz, L. Bayesian analysis of multiple hypothesis testing with applications to microarray experiments. Commun. Stat. Theory Methods 2011, 40, 2276–2291. [Google Scholar] [CrossRef]
Scott, J.G.; Berger, J.O. An exploration of aspects of Bayesian multiple testing. J. Stat. Plan. Infer. 2006, 136, 2144–2162. [Google Scholar] [CrossRef]
Gómez-Villegas, M.A.; Salazar, I.; Sanz, L. A Bayesian decision procedure for testing multiple hypotheses in DNA microarray experiments. Stat. Appl. Genet. Mol. Biol. 2014, 13, 49–65. [Google Scholar] [CrossRef] [PubMed]
Sarkar, S.K.; Zhou, T.; Ghosh, D. A general decision theoretic formulation of procedures controlling FDR and FNR from a Bayesian perspective. Stat. Sin. 2008, 18, 925–945. [Google Scholar]
Yuan, M.; Kendziorski, C. A unified approach for simultaneous gene clustering and differential expression identification. Biometrics 2006, 62, 1089–1098. [Google Scholar] [CrossRef] [PubMed]
Marín, J.M.; Rodríguez-Bernal, M.T. Multiple hypothesis testing and clustering with mixtures of non-central t-distributions applied in microarray data analysis. Comput. Stat. Data Anal. 2012, 56, 1898–1907. [Google Scholar] [CrossRef]
Sun, W.; Cai, T.T. Large-scale multiple testing under dependence. J. R. Stat. Soc. Ser. B-Stat. Methodol 2009, 71, 393–424. [Google Scholar] [CrossRef]
Chi, Z. Effects of statistical dependence on multiple testing under a hidden Markov model. Ann. Statist. 2011, 39, 439–473. [Google Scholar] [CrossRef]
Rayaprolu, S.; Chi, Z. Multiple Testing under Dependence with Approximate Conditional Likelihood. arXiv 2014, arXiv:1412.7778. [Google Scholar]
Liu, J.; Zhang, C.; Burnside, E.S.; Page, D. Learning Heterogeneous Hidden Markov Random Fields. In Proceedings of the JMLR Workshop Conference Proceedings, Nha Trang City, Vietnam, 26–28 November 2014; Volume 33, pp. 576–584. [Google Scholar]
Liu, J.; Peissig, P.; Zhang, C.; Burnside, E.; McCarty, C.; Page, D. Graphical-model based multiple testing under dependence, with applications to genome-wide association studies. In Proceedings of the Uncertainty in Artificial Intelligence: Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, 14–18 August 2012; pp. 511–522. [Google Scholar]
Liu, J.; Zhang, C.; Page, D. Multiple testing under dependence via graphical models. Ann. Appl. Stat. 2016, 10, 1699–1724. [Google Scholar] [CrossRef]
Genest, C.; MacKay, J. The joy of copulas: Bivariate distributions with uniform marginals. Am. Stat. 1986, 40, 280–283. [Google Scholar] [CrossRef]
Genest, C.; Ghoudi, K.; Rivest, L.P. A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika 1995, 82, 543–552. [Google Scholar] [CrossRef]
Sklar, M. Fonctions de repartition an dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris 1959, 8, 229–231. [Google Scholar]
Joe, H. Multivariate Models and Dependence Concepts; Chapman & Hall/CRC: New York, NY, USA; London, UK; Washington, DC, USA, 1997; ISBN 10: 0412073315. [Google Scholar]
Cherubini, U.; Luciano, E.; Vecchiato, W. Copula Methods in Finance; John Wiley & Sons: New York, NY, USA, 2004; ISBN 978-0-470-86345-9. [Google Scholar]
Nelsen, R.B. An Introduction to Copulas; Springer Science & Business Media: New York, NY, USA, 2007; ISBN 0-387-28678-5. [Google Scholar]
Diebolt, J.; Robert, C.P. Estimation of finite mixture distributions through Bayesian sampling. J. R. Stat. Soc. Ser. B-Stat. Methodol. 1994, 56, 363–375. [Google Scholar] [CrossRef]
Feller, W. An Introduction to Probability Theory and Its Applications; John Wiley & Sons: New York, NY, USA; London, UK; Sidney, BC, Canada, 1966; Volume 2, ISBN 10: 0471257095. [Google Scholar]
Kowalski, C.J. Non-normal bivariate distributions with normal marginals. Am. Stat. 1973, 27, 103–106. [Google Scholar] [CrossRef]
Gelman, A.; Meng, X.L. A note on bivariate distributions that are conditionally normal. Am. Stat. 1991, 45, 125–126. [Google Scholar] [CrossRef]
Zhao, H.; Chan, K.L.; Cheng, L.M.; Yan, H. Multivariate hierarchical Bayesian model for differential gene expression analysis in microarray experiments. BMC Bioinform. 2008, 9, S9. [Google Scholar] [CrossRef]
Salazar, I. Aproximación bayesiana a los Contrastes de Hipótesis Múltiples Con Aplicaciones a los Microarrays; E-Prints Complutense: Madrid, Spain, 2011; ISBN 978-84-694-6254-6. [Google Scholar]
Žežula, I. On multivariate Gaussian copulas. J. Stat. Plan. Infer. 2009, 139, 3942–3946. [Google Scholar] [CrossRef]
Broët, P.; Richardson, S.; Radvanyi, F. Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J. Comput. Biol. 2002, 9, 671–683. [Google Scholar] [CrossRef]
Patz, R.J.; Junker, B.W. A straightforward approach to Markov chain Monte Carlo methods for item response models. J. Educ. Behav. Stat. 1999, 24, 146–178. [Google Scholar] [CrossRef]
Robert, C.; Casella, G. Monte Carlo Statistical Methods; Springer Science & Business Media: New York, NY, USA, 2013; ISBN 978-1-4757-3073-9. [Google Scholar]
Müller, P.; Parmigiani, G.; Robert, C.; Rousseau, J. Optimal Sample Size for Multiple Testing: The Case of Gene Expression Microarrays. J. Am. Stat. Assoc. 2004, 99, 990–1001. [Google Scholar] [CrossRef]
Do, K.A.; Müller, P.; Tang, F. A Bayesian mixture model for differential gene expression. J. R. Stat. Soc. Ser. C-Appl. Stat. 2005, 54, 627–644. [Google Scholar] [CrossRef]
Genovese, C.; Wasserman, L. Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. B-Stat. Methodol. 2002, 64, 499–517. [Google Scholar] [CrossRef]
Genovese, C.; Wasserman, L. Bayesian and Frequentist Multiple Testing. In Proceedings of the Seventh Valencia International Meeting, 2–6 June 2002, Bayesian Statistics 7; Bernardo, J.M., Bayarri, M.J., Berger, J.O., Dawid, A.P., Heckerman, D., Smith, A.F.M., West, M., Eds.; Oxford University Press: Oxford, UK, 2003; ISBN 0-19-852615-6. [Google Scholar]
Spiegelhalter, D.J.; Best, N.G.; Carlin, B.P.; Van Der Linde, A. Bayesian measures of model complexity and fit. J. R. Statist. Soc. B-Stat. Methodol. 2002, 64, 583–639. [Google Scholar] [CrossRef]
Pascual, V.; Medrano, L.; López-Palacios, N.; Bodas, A.; Dema, B.; Fernández-Arquero, M.; González-Pérez, B.; Salazar, I.; Núñez, C. Different gene expression signatures in children and adults with celiac disease. PLoS ONE 2016, 11, e0146276. [Google Scholar] [CrossRef] [PubMed]

Table 1. Results for the model with Gaussian copulas and for data corresponding to 80% of true null hypotheses to distinct values of the prior distributions of

p_{i}

.

Table 1. Results for the model with Gaussian copulas and for data corresponding to 80% of true null hypotheses to distinct values of the prior distributions of

p_{i}

.

$(α, β)$	$(0.5, 1)$		$(1, 1)$		$(1, 0.5)$		$(2, 0.5)$
$\hat{ρ}$	0.73		0.77		0.752		0.76
$\hat{F D R}$	0.219		0.298		0.045		0.023
	Accepted	Rejected	Accepted	Rejected	Accepted	Rejected	Accepted	Rejected	Total
True	0	39	14	25	37	2	38	1	39
False	0	11	0	11	0	11	0	11	11
Total	0	50	14	36	37	13	38	12	50

Table 2. Results for the model with Gaussian copulas and for data corresponding to 50% of true null hypotheses to distinct values of the prior distributions of

p_{i}

.

Table 2. Results for the model with Gaussian copulas and for data corresponding to 50% of true null hypotheses to distinct values of the prior distributions of

p_{i}

.

$(α, β)$	$(0.5, 1)$		$(1, 1)$		$(1, 0.5)$		$(2, 0.5)$
$\hat{ρ}$	0.803		0.813		0.804		0.773
$\hat{F D R}$	0.089		0.17		0.079		0.065
	Accepted	Rejected	Accepted	Rejected	Accepted	Rejected	Accepted	Rejected	Total
True	0	25	1	24	19	6	22	3	25
False	0	25	0	25	0	25	0	25	25
Total	0	50	1	49	19	31	22	28	50

Table 3. Results for the model with Gaussian copulas and for data corresponding to 20% of true null hypotheses to distinct values of the prior distributions of

p_{i}

.

Table 3. Results for the model with Gaussian copulas and for data corresponding to 20% of true null hypotheses to distinct values of the prior distributions of

p_{i}

.

$(α, β)$	$(0.5, 1)$		$(1, 1)$		$(1, 0.5)$		$(2, 0.5)$
$\hat{ρ}$	0.773		0.78		0.775		0.795
$\hat{F D R}$	0.04		0.085		0.054		0.07
	Accepted	Rejected	Accepted	Rejected	Accepted	Rejected	Accepted	Rejected	Total
True	0	12	0	12	8	4	11	1	12
False	0	38	0	38	0	38	1	37	38
Total	0	50	0	50	8	42	12	38	50

Table 4. Results for the model with Clayton copulas and for data corresponding to 80% of true null hypotheses to distinct values of the prior distributions of

p_{i}

.

Table 4. Results for the model with Clayton copulas and for data corresponding to 80% of true null hypotheses to distinct values of the prior distributions of

p_{i}

.

$(α, β)$	$(0.5, 1)$		$(1, 1)$		$(1, 0.5)$		$(2, 0.5)$
$\hat{θ_{c}}$	$1.94$		$1.99$		$1.94$		$1.99$
$\hat{F D R}$	0.28		0.385		0.258		0.27
	Accepted	Rejected	Accepted	Rejected	Accepted	Rejected	Accepted	Rejected	Total
True	0	39	10	29	39	0	39	0	39
False	0	11	0	11	1	10	6	5	11
Total	0	50	10	40	40	10	45	5	50

Table 5. Results for the model with Clayton copulas and for data corresponding to 50% of true null hypotheses to distinct values of the prior distributions of

p_{i}

.

Table 5. Results for the model with Clayton copulas and for data corresponding to 50% of true null hypotheses to distinct values of the prior distributions of

p_{i}

.

$(α, β)$	$(0.5, 1)$		$(1, 1)$		$(1, 0.5)$		$(2, 0.5)$
$\hat{θ_{c}}$	$1.89$		$1.93$		$2.01$		$1.97$
$\hat{F D R}$	0.226		0.29		0.166		0.177
	Accepted	Rejected	Accepted	Rejected	Accepted	Rejected	Accepted	Rejected	Total
True	0	25	6	19	25	0	25	0	25
False	0	25	1	24	8	17	11	14	25
Total	0	50	7	43	33	17	36	14	50

Table 6. Results for the model with Clayton copulas and for data corresponding to 20% of true null hypotheses to distinct values of the prior distributions of

p_{i}

.

Table 6. Results for the model with Clayton copulas and for data corresponding to 20% of true null hypotheses to distinct values of the prior distributions of

p_{i}

.

$(α, β)$	$(0.5, 1)$		$(1, 1)$		$(1, 0.5)$		$(2, 0.5)$
$\hat{θ_{c}}$	$2.59$		$2.19$		$2.29$		$2.27$
$\hat{F D R}$	0.203		0.26		0.22		0.194
	Accepted	Rejected	Accepted	Rejected	Accepted	Rejected	Accepted	Rejected	Total
True	0	12	4	8	12	0	12	0	12
False	0	38	1	37	15	23	24	14	38
Total	0	50	5	45	27	23	36	14	50

Table 7. DIC values for the different percentages of true null hypotheses and for the prior distributions of

p_{i}

skewed to the right.

Table 7. DIC values for the different percentages of true null hypotheses and for the prior distributions of

p_{i}

skewed to the right.

	Model	Gaussian	Copulas	Clayton	Copulas
	$(α, β)$	$(1, 0.5)$	$(2, 0.5)$	$(1, 0.5)$	$(2, 0.5)$
	80%	8548.326	8540.206	8961.517	8973.411
% of true null hypotheses	50%	8682.981	8696.98	9033.87	9034.77
	20%	8565.80	8543.045	9114.768	9130.447

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Maria, E.C.J.; Salazar, I.; Sanz, L.; Gómez-Villegas, M.A. Using Copula to Model Dependence When Testing Multiple Hypotheses in DNA Microarray Experiments: A Bayesian Approximation. Mathematics 2020, 8, 1514. https://doi.org/10.3390/math8091514

AMA Style

Maria ECJ, Salazar I, Sanz L, Gómez-Villegas MA. Using Copula to Model Dependence When Testing Multiple Hypotheses in DNA Microarray Experiments: A Bayesian Approximation. Mathematics. 2020; 8(9):1514. https://doi.org/10.3390/math8091514

Chicago/Turabian Style

Maria, Elisa C. J., Isabel Salazar, Luis Sanz, and Miguel A. Gómez-Villegas. 2020. "Using Copula to Model Dependence When Testing Multiple Hypotheses in DNA Microarray Experiments: A Bayesian Approximation" Mathematics 8, no. 9: 1514. https://doi.org/10.3390/math8091514

APA Style

Maria, E. C. J., Salazar, I., Sanz, L., & Gómez-Villegas, M. A. (2020). Using Copula to Model Dependence When Testing Multiple Hypotheses in DNA Microarray Experiments: A Bayesian Approximation. Mathematics, 8(9), 1514. https://doi.org/10.3390/math8091514

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using Copula to Model Dependence When Testing Multiple Hypotheses in DNA Microarray Experiments: A Bayesian Approximation

Abstract

1. Introduction

2. A Bayesian Approach: Model Specification

2.1. Copula Function

2.2. Modeling Dependence with N-Dimensional Copulas

3. Modeling Dependence Through N-Dimensional Gaussian Copulas with Normal Marginal Densities

3.1. Prior and Posterior Distributions

3.2. Conditional Posterior Distributions

3.3. MCMC Algorithm (Metropolis-Hastings-within-Gibbs Algorithm)

3.4. Simulation Study

4. Modeling Dependence Through N-Dimensional Clayton Copulas with Normal Marginal Densities

4.1. Simulation Study

4.2. Model Selection

5. Application to a Real Data Set: DNA Microarrays

5.1. Modeling Dependency through N-Dimensional Gaussian Copulas

5.2. Modeling Dependence Through N-Dimensional Clayton Copulas

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. MCMC Algorithm

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI