 Next Article in Journal
Market Sentiments Distribution Law
Next Article in Special Issue
Paraconsistent Probabilities: Consistency, Contradictions and Bayes’ Theorem
Previous Article in Journal
Using Graph and Vertex Entropy to Compare Empirical Graphs with Theoretical Graph Models
Previous Article in Special Issue
Optimal Noise Benefit in Composite Hypothesis Testing under Different Criteria Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

# Bayesian Dependence Tests for Continuous, Binary and Mixed Continuous-Binary Variables

1
Dalle Molle Institute for Artificial Intelligence, Manno 6928, Switzerland
2
School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT7 1NN, UK
*
Author to whom correspondence should be addressed.
Entropy 2016, 18(9), 326; https://doi.org/10.3390/e18090326
Received: 20 June 2016 / Revised: 25 August 2016 / Accepted: 26 August 2016 / Published: 6 September 2016
(This article belongs to the Special Issue Statistical Significance and the Logic of Hypothesis Testing)

## Abstract

:
Tests for dependence of continuous, discrete and mixed continuous-discrete variables are ubiquitous in science. The goal of this paper is to derive Bayesian alternatives to frequentist null hypothesis significance tests for dependence. In particular, we will present three Bayesian tests for dependence of binary, continuous and mixed variables. These tests are nonparametric and based on the Dirichlet Process, which allows us to use the same prior model for all of them. Therefore, the tests are “consistent” among each other, in the sense that the probabilities that variables are dependent computed with these tests are commensurable across the different types of variables being tested. By means of simulations with artificial data, we show the effectiveness of the new tests.

## 1. Introduction

Tests for dependence of continuous, discrete and mixed continuous-discrete variables are fundamental in science. The standard way to statistically assess if two (or more) variables are dependent is by using null-hypothesis significance tests (NHST), such as $χ 2$-test, Kendall’s τ, etc. However, these tests are affected by the drawbacks which characterize NHST [1,2,3]. An NHST computes the probability of getting the observed (or a larger) value of the statistics under the assumption that the null hypothesis of independence is true, which is obviously not the same as the probability of variables being dependent on each other, given the observed data. Another common problem is that the claimed statistical significance might have no practical impact. Indeed, the usage of NHST often relies on the wrong assumptions that p-values are a reasonable proxy to the probability of the null hypothesis and that statistical significance implies practical significance.
In this paper, we propose a collection of Bayesian dependence tests. The questions we are actually interested in—for example, Is variable Y dependent on Z? or Based on the experiments, how probable is Y dependent on Z?—are actually questions about posterior probabilities. Answers to these questions are naturally provided by Bayesian methods. The core of this paper is thus to derive Bayesian alternatives to frequentist NHST and to discuss their inference and results. In particular, we present three Bayesian tests for dependence of binary, continuous and mixed variables. All of these tests are nonparametric and based on the Dirichlet Process. This allows us to use the same prior model for all the tests we develop. Therefore, they are “consistent” in the sense that the probabilities of dependence we compute are commensurable across the tests. This is another main difference about such an approach and the use of p-values, since the latter usually cannot be compared across different types of tests.
To address the issue of how to choose the prior parameters in case of lack of information, we propose the use of the Imprecise Dirichlet Process (IDP) . It consists of a family of Dirichlet processes with fixed prior strength and and prior probability measure free to span the set of all distributions. In this way, we obtain as a byproduct a measure of sensitivity of inferences to the choice of the prior parameters.
Nonparametric tests based on the Dirichlet Process and on similar ideas to those presented in this paper have also been proposed in  to develop a Bayesian rank test, in  for a Bayesian signed-rank test, in  for a Bayesian Friedman test and in  for a Bayesian test that accounts for censored data.
Several alternative Bayesian methods are available for testing of independence. The test of linear dependence between two continuous univariate random variables can be achieved by fitting a linear model and inspecting the posterior distribution of the correlation coefficient. A more sophisticated test based on a Dirichlet Process Mixture prior is instead presented in  to deal with linear and nonlinear dependences. Other methods were proposed for testing of independence based on a contingency table [9,10,11]. The main difference between these works and the work presented in this paper is that we provide tests for continuous, categorical (binary) and mixed variables using the same approach. This allows us to derive a very general framework to test independence/dependence (these tests could be used for instance for feature selection in machine learning [12,13,14,15]).
By means of simulations on artificial data, we use our test to decide if two variables are dependent. We show that our Bayesian test achieves equal or better results than the frequentist tests. We moreover show that the IDP test is more robust, in the sense that it acknowledges when the decision is prior-dependent. In other words, the IDP test suspends the judgment and becomes indeterminate when the decision becomes prior dependent. Since IDP has all the positive features of a Bayesian test and it is more reliable than the frequentist tests, we propose IDP as a new test for testing dependence.

## 2. Dirichlet Process

The Dirichlet Process was developed by Ferguson  as a probability distribution on the space of probability distributions. Let $X$ be a standard Borel space with Borel σ-field $B X$ and $P$ be the space of probability measures on $( X , B X )$ equipped with the weak topology and the corresponding Borel σ-field $B P$. Let $M$ be the class of all probability measures on $( P , B P )$. We call the elements $μ ∈ M$ nonparametric priors.
An element of $M$ is called a Dirichlet Process distribution $D ( α )$ with base measure α if for every finite measurable partition $B 1 , ⋯ , B m$ of $X$, the vector $( P ( B 1 ) , ⋯ , P ( B m ) )$ has a Dirichlet distribution with parameters $( α ( B 1 ) , ⋯ , α ( B m ) )$, where $α ( · )$ is a finite positive Borel measure on $X$. Consider the partition $B 1 = A$ and $B 2 = A c = X ∖ A$ for some measurable set $A ∈ X$, then if $P ∼ D ( α )$ from the definition of the DP we have that $( P ( A ) , P ( A c ) ) ∼ D i r ( α ( A ) , α ( X ) − α ( A ) )$, which is a β distribution. From the moments of the β distribution, we can thus derive that:
$E [ P ( A ) ] = α ( A ) α ( X ) , E [ ( P ( A ) − E [ P ( A ) ] ) 2 ] = α ( A ) ( α ( X ) − α ( A ) ) ( α ( X ) 2 ( α ( X ) + 1 ) ) ,$
where we have used the calligraphic letter $E$ to denote expectation with respect to the Dirichlet process. This shows that the normalized measure $α ( · ) / α ( X )$ of the DP reflects the prior expectation of P, while the scaling parameter $α ( X )$ controls how much P is allowed to deviate from its mean $α ( · ) / α ( X )$. Let $s = α ( X )$ stand for the total mass of $α ( · )$ and $α * ( · ) = α ( · ) / s$ stand for the probability measure obtained by normalizing $α ( · )$. If $P ∼ D ( α )$, we shall also describe this by saying $P ∼ D p ( s , α * )$ or, if $X = R$, $P ∼ D p ( s , G 0 )$, where $G 0$ stands for the cumulative distribution function of $α *$.
Let $P ∼ D p ( s , α * )$ and f be a real-valued bounded function defined on $( X , B )$. Then the expectation with respect to the Dirichlet Process of $E [ f ]$ is
$E E ( f ) = E ∫ f d P = ∫ f d E [ P ] = ∫ f d α * .$
One of the most remarkable properties of the DP priors is that the posterior distribution of P is again a DP. Let $X 1 , ⋯ , X n$ be an independent and identically distributed sample from P and $P ∼ D p ( s , α * )$, then the posterior distribution of P given the observations, denoted as $P X | X n$, is
$P X | X n ∼ D p s + n , s s + n α * + 1 s + n ∑ i = 1 n δ X i ,$
where $δ X i$ is an atomic probability measure centered at $X i$ and $X n = { X 1 , ⋯ , X n }$. This means that the Dirichlet Process satisfies a property of conjugacy, in the sense that the posterior for P is again a Dirichlet Process with updated unnormalized base measure $α + ∑ i = 1 n δ X i$. From Equations (1)–(3), we can easily derive the posterior mean and variance of $P ( A )$ and, respectively, posterior expectation of f. Hereafter we list some useful properties of the DP that will be used in the sequel (see Chapter 3 in ).
(a)
In case $X = R$, since P is completely defined by its cumulative distribution function F, a-priori we say $F ∼ D p ( s , G 0 )$ and a posteriori we can rewrite (3) as follows:
$F X | X n ∼ D p s + n , s s + n G 0 + n s + n 1 n ∑ i = 1 n I [ X i , ∞ ) ,$
where I is the indicator function and $1 n ∑ i = 1 n I [ X i , ∞ )$ is the empirical cumulative distribution.
(b)
Consider an element $μ ∈ M$ which puts all its mass at the probability measure $P = δ x$ for some $x ∈ X$. This can also be modeled as $D p ( s , δ x )$ for each $s > 0$.
(c)
Assume that $P 1 ∼ D p ( s 1 , α 1 * )$, $P 2 ∼ D p ( s 2 , α 2 * )$, $( ω 1 , ω 2 ) ∼ D i r ( s 1 , s 2 )$ and $P 1$, $P 2$, $( ω 1 , ω 2 )$ are independent, then Section 3.1.1. in :
$ω 1 P 1 + ω 2 P 2 ∼ D p s 1 + s 2 , s 1 s 1 + s 2 α 1 * + s 2 s 1 + s 2 α 2 * .$
(d)
Let $P X | X n$ have distribution $D p ( s + n , s s + n α * + 1 s + n ∑ i = 1 n δ X i )$. We can write
$P X | X n = ω 0 P + ∑ i = 1 n ω i δ X i ,$
where $∑ i = 0 n ω i = 1$, $( ω 0 , ω 1 , ⋯ , ω n ) ∼ D i r ( s , 1 , ⋯ , 1 )$ and $P ∼ D p ( s , α * )$. This follows from (b)–(c).
An issue in the use of the DP as prior measure on P is how to choose the infinite dimensional parameter $G 0$ in case of lack of prior information. There are two avenues that we can follow. The first assumes that prior ignorance can be modelled satisfactorily by a so-called noninformative prior. In the DP setting, the only noninformative prior that has been proposed so far is the limiting DP obtained for $s → 0$, which has been introduced by  and discussed by . The second approach suggests that lack of prior information should be expressed in terms of a set of probability distributions. This approach known as Imprecise Probability [19,20,21] is connected to Bayesian robustness [22,23,24] and it has been extensively applied to model prior (near-)ignorance in parametric models. In this paper, we implement a prior (near-)ignorance model by considering a set of DPs obtained by fixing s to a strictly positive value and letting $G 0$ span the set of all distributions. This model has been introduced in  with the name of Imprecise Dirichlet Process (IDP).

## 3. Bayesian Independence Tests

Let us denote by X the vector of variables $[ Y , Z ] T$ so that the n observations of X can be rewritten as
$X n = { X 1 , ⋯ , X n } ,$
that is, a set of n vector-valued i.i.d. observations of X. We also consider an auxiliary variable $X ′$ together with X. We assume that $X , X ′$ are independent variables from the same unknown distribution and that $X ′ n = X n$, that is, we have the same observations of X and $X ′$.
Let P be the unknown distribution of $X , X ′$ and assume that the prior distribution of P is $D p ( s , α * )$. Our goal is to compute the posterior of P. The posterior of P is given in (3) and, by exploiting (6), we know that
$P X | X n ∼ ω 0 P + ∑ i = 1 n ω i δ X i ,$
with $( ω 0 , ω 1 , ⋯ , ω n ) ∼ D i r ( s , 1 , ⋯ , 1 )$ and $P ∼ D p ( s , α * )$. The distribution $P X ′ | X ′ n$ of $X ′$ is similarly defined.
The questions we pose in a statistical analysis can all be answered by querying this posterior distribution in different ways. We adopt this posterior distribution to devise Bayesian counterparts of the independence hypothesis tests.

#### 3.1. Bayesian Bivariate Independence Test for Binary Variables

Let us assume that the variables $Y , Z ∈ { 0 , 1 }$ (that is, they are binary). Our aim is to devise a Bayesian independence test for binary variables based on the DP. We will also show that our test is a Bayesian generalisation of the frequentist $χ 2$-test for independence applied to binary variables. We start by defining the following quantities:
$E I { 0 , 0 } ( X ) I { 1 , 1 } ( X ′ ) | X n , X ′ n = ∫ ∫ I { 0 , 0 } ( X ) I { 1 , 1 } ( X ′ ) d F ( X | X n ) d F ( X ′ | X ′ n ) ,$
where we have exploited the independence of $X , X ′$ and here $F ( X | X n )$ denotes the posterior cumulative distribution of $P X | X n$ defined in (8). From (8), it can easily be verified that
$E I { 0 , 0 } ( X ) I { 1 , 1 } ( X ′ ) | X n , X ′ n = ω 00 ω 11 ,$
where
$ω 00 = ω 0 ∫ I { 0 , 0 } ( X ) d F ( X ) + ∑ i = 1 n ω i I { 0 , 0 } ( Y i , Z i ) ,$
and
$ω 11 = ω 0 ∫ I { 1 , 1 } ( X ) d F ( X ) + ∑ i = 1 n ω i I { 1 , 1 } ( Y i , Z i ) ,$
where in the last equality we have exploited the fact that $X ′$ has the same distribution as X and also the same observations. The two quantities $ω 00 , ω 11$ include two terms. The first is the term due to the prior $d F ∼ D p ( s , α * )$ and the second term is due to the observations.
Similarly, we compute
$E I { 0 , 1 } ( X ) I { 1 , 0 } ( X ′ ) | X n , X ′ n = ω 01 ω 10 ,$
where
$ω 01 = ω 0 ∫ I { 0 , 1 } ( X ) d F ( X ) + ∑ i = 1 n ω i I { 0 , 1 } ( Y i , Z i ) ,$
and
$ω 11 = ω 0 ∫ I { 1 , 0 } ( X ) d F ( X ) + ∑ i = 1 n ω i I { 1 , 0 } ( Y i , Z i ) .$
Summing up, $ω 00 , ω 1 , 0 , ω 0 , 1 , ω 11$ represent the posterior probabilities of the events $( 0 , 0 )$ (that is, $Y = 0$ and $Z = 0$), $( 1 , 0 )$, $( 0 , 1 )$ and $( 1 , 1 )$, respectively, according to the posterior joint distribution $F ( X | X n )$.
Theorem 1.
The variables Y and Z are said to be concordant (dependent) with posterior probability $( 1 − γ )$ provided that
$P ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0 | X n ) > ( 1 − γ ) ,$
and they are said to be discordant provided that
$P ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) < 0 | X n ) > ( 1 − γ ) ,$
where $P$ is the probability computed with respect to $( ω 0 , ω 1 , ⋯ , ω n ) ∼ D i r ( s , 1 , ⋯ , 1 )$ and $d F ∼ D p ( s , α * )$. Finally, they are said to be simply dependent with posterior probability $( 1 − γ )$ provided that
$0 ∉ ( 1 − γ ) H D I ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) | X n ) ,$
where $H D I$ denotes the posterior Highest Density Interval of $2 ( ω 00 ω 11 − ω 01 ω 10 )$.
Proof.
We just derive the third statement. The other two statements are analogue. We first consider the indicator functions
$I { 0 } ( Y ) , I { 1 } ( Y ) , I { 0 } ( Z ) , I { 1 } ( Z ) ,$
and same for the auxiliary variables $Y ′ , Z ′$. By computing the expectation of these functions, we can obtain the marginals of the variables $Y , Z$ with respect to the joint $P X$:
$ω 0 • : = E ( I { 0 } ( Y ) | X n ) = ω 0 ∫ I { 0 } ( Y ) d F ( X ) + ∑ i = 1 n ω i I { 0 } ( Y i ) ,$
$ω 1 • : = E ( I { 1 } ( Y ) | X n ) = ω 0 ∫ I { 1 } ( Y ) d F ( X ) + ∑ i = 1 n ω i I { 1 } ( Y i ) ,$
$ω • 0 : = E ( I { 0 } ( Z ) | X n ) = ω 0 ∫ I { 0 } ( Z ) d F ( X ) + ∑ i = 1 n ω i I { 0 } ( Z i ) ,$
$ω • 1 : = E ( I { 1 } ( Z ) | X n ) = ω 0 ∫ I { 1 } ( Z ) d F ( X ) + ∑ i = 1 n ω i I { 1 } ( Z i ) ,$
where $ω 0 •$ (resp. $ω 1 •$) denotes the marginal with respect to Z when $Y = 0$ (resp. $Y = 1$), while $ω • 0$ (resp. $ω • 1$) denotes the marginal with respect to Y when $Y = 0$ (resp. $Y = 1$).
Then, by exploiting independence between X and $X ′$, we derive
$E ( I { 0 } ( Y ) I { 0 } ( Z ′ ) | X n , X ′ n ) = ω 0 • ω • 0 , E ( I { 1 } ( Y ) I { 0 } ( Z ′ ) | X n , X ′ n ) = ω 1 • ω • 0 ,$
$E ( I { 0 } ( Y ) I { 1 } ( Z ′ ) | X n , X ′ n ) = ω 0 • ω • 1 , E ( I { 1 } ( Y ) I { 1 } ( Z ′ ) | X n , X ′ n ) = ω 1 • ω • 1 .$
We are now ready to define the independence test. If the two variables $Y , Z$ are independent, then the vector
$v = ( ω 00 , ω 10 , ω 01 , ω 11 ) − ( ω 0 • ω • 0 , ω 1 • ω • 0 , ω 0 • ω • 1 , ω 1 • ω • 1 ) ,$
has zero mean. Note that the first component of the vector v is $E ( I { 0 , 0 } ( X ) − I { 0 } ( Y ) I { 0 } ( Z ′ ) | X n , X ′ n )$ and thus is a well-defined quantity with respect to our probabilistic model (similarly for the other terms). Therefore, the independence test reduces to checking whether the $( 1 − γ ) %$ highest density credible region (HCR) of v includes the zero vector. It can be easily verified that $| v l | = | v m |$ for each $l , m$ component of v. In fact, we have
$ω i • = ω i j + ω i j ¯ , ω • j = ω i j + ω i ¯ j ,$
for $i , j ∈ { 0 , 1 }$ and $i ¯ = 1 − i$, $j ¯ = 1 − j$, and so
$ω i j − ( ω i j + ω i j ¯ ) ( ω i j + ω i ¯ j ) = ω i j ( ω i j + ω i ¯ j + ω i j ¯ + ω i ¯ j ¯ ) − ( ω i j + ω i j ¯ ) ( ω i j + ω i ¯ j ) = ω i j ω i ¯ j ¯ − ω i j ¯ ω i ¯ j .$
Therefore, it is enough to check whether
$0 ∉ ( 1 − γ ) % H D I ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) ) .$
If this is the case, then we can declare that the two variables are dependent with probability $( 1 − γ )$. Here, the multiplier 2 in $2 ( ω 00 ω 11 − ω 01 ω 10 )$ is only a scaling factor so that $2 ( ω 00 ω 11 − ω 01 ω 10 )$ varies in $[ − 0 . 5 , 0 . 5 ]$. ☐
From the proof of Theorem 1 it is evident the similarity of the test with the frequentist $χ 2$-test for independence. Both tests use the difference between the joint and the product of the marginals as a measure of dependence. The advantage of the Bayesian approach is that we compute posterior probabilities for the hypothesis in which we are interested and not the probability of getting the observed (or a larger) difference under the assumption that the null hypothesis of independence is true.
The probabilities computed in Theorem 1 depend on the prior information $F ∼ D p ( s , α * )$. In this paper we adopt IDP as prior model. We can then perform a Bayesian nonparametric test that is based on extremely weak prior assumptions, and easy to elicit, since it requires only the choice of the strength s of the DP instead of its infinite-dimensional parameter $α *$. The infinite-dimensional parameter $α *$ is free to vary in the set of all distributions.
Let us consider for instance (13). Each one of these priors gives a posterior probability $P ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0 | X n )$. We can characterize this set of posteriors by computing the lower and upper bounds $P ̲ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0 | X n )$ and $P ¯ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0 | X n )$. Inferences with IDP can be computed by verifying if
$P ̲ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0 | X n ) > ( 1 − γ ) , P ¯ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0 | X n ) > ( 1 − γ ) ,$
and then by taking the following decisions:
• if both the inequalities are satisfied, then we declare that the two variables are dependent with probability larger than $1 − γ$;
• if only one of the inequalities is satisfied (which has necessarily to be the one for the upper), we are in an indeterminate situation, that is, we cannot decide;
• if both are not satisfied, then we declare that the probability that the two variables are dependent is lower than the desired probability of $1 − γ$.
When IDP returns an indeterminate decision, it means that the evidence from the observations is not enough to declare that the probability of the hypothesis being true is either larger or smaller than the desired value $1 − γ$; more observations are necessary to reach a reliable decision.
Theorem 2.
The upper probability $P ¯ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0 | X n )$ is obtained by a prior measure $α * = m δ ( 0 , 0 ) + ( 1 − m ) δ ( 1 , 1 )$ with
$m = 0 i f e 1 + ω 0 < e 0 , 1 i f e 0 < e 1 − ω 0 , ω 0 + e 1 − e 0 2 ω 0 o t h e r ,$
where $e 0 = ∑ i = 1 n ω i I { 0 , 0 } ( Y i , Z i )$ and $e 1 = ∑ i = 1 n ω i I { 1 , 1 } ( Y i , Z i )$. The lower probability $P ̲ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0 | X n )$ is obtained by a prior measure $α * = m δ ( 1 , 0 ) + ( 1 − m ) δ ( 0 , 1 )$ with the same m as before but $e 0 = ∑ i = 1 n ω i I { 1 , 0 } ( Y i , Z i )$ and $e 1 = ∑ i = 1 n ω i I { 0 , 1 } ( Y i , Z i )$.
Proof.
We are interested in the quantity $2 ( ω 00 ω 11 − ω 01 ω 10 )$. It is clear that in order to maximize the probability that $ω 00 ω 11 − ω 01 ω 10 > 0$ we must put all the prior mass on $ω 00 ω 11$. Let us call $m = I { 0 , 0 } ( X ) d F ( X )$. Then $∫ I { 1 , 1 } ( X ) d F ( X ) = 1 − I { 0 , 0 } ( X ) d F ( X ) = 1 − m$. From (9)–(12), we have that $ω 00 = ω 0 m + e 0$ and $ω 11 = ω 0 ( 1 − m ) + e 1$ with $m ∈ [ 0 , 1 ]$. By computing the derivative with respect to m we have
$d d m ω 00 ω 11 = ω 0 ω 0 ( 1 − m ) + e 1 + ω 0 m + e 0 − ω 0 ,$
whose zero is $m = ω 0 + e 1 − e 0 2 ω 0$, which is also a maximum. Hence, the maximum can be either on $m = ω 0 + e 1 − e 0 2 ω 0$ or on the extremes $m = 0$ or $m = 1$. This can be easily verified by checking when $ω 0 + e 1 − e 0 < 0$ (so $m = 0$) or $ω 0 + e 1 − e 0 > 2 ω 0$ (so $m = 1$). The lower probability $P ̲ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0 | X n )$ can be determined using a similar reasoning. ☐
Since $( ω 0 , ω 1 , ⋯ , ω n ) ∼ D i r ( s , 1 , ⋯ , 1 )$, the computation of $P ̲ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0 | X n )$, $P ¯ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0 | X n )$ can be obtained by Monte Carlo sampling. The following pseudo-code describes how to compute the upper (the lower can be computed in a similar way).
• Initialize the counter $P c$ to 0 and the array V to empty;
• For $i = 1 , ⋯ , N m c$
(a)
sample $( ω 0 , ω 1 , ⋯ , ω n ) ∼ D i r ( s , 1 , ⋯ , 1 )$;
(b)
compute $ω 00 , ω 01 , ω 10 , ω 11$ as in (9)–(12) by choosing $d F ( X ) = m δ ( 0 , 0 ) ( X ) + ( 1 − m ) δ ( 1 , 1 ) ( X )$ with m defined in Theorem 2;
(c)
compute $2 ( ω 00 ω 11 − ω 01 ω 10 )$ and store the result in V;
(d)
if $2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0$ then $P c = P c + 1$ else $P c = P c + 0$.
• compute the histogram of the elements in V (this gives us the plot of the posterior of $2 ( ω 00 ω 11 − ω 01 ω 10 )$)
• compute the posterior upper probability that $2 ( ω 00 ω 11 − ω 01 ω 10 )$ is greater than zero as $P ¯ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) > 0 | X n ) ≈ P c / N m c$.
The number of Monte Carlo samples $N m c$ is equal to 100 thousand in the next examples and figures.
The lower and upper $H D I$ intervals in Theorem 1 can also be obtained as in Theorem 2 and computed via Monte Carlo sampling ($H D I$ can be computed using the values stored in V, see pseudo-code). Hereafter we will denote the two intervals corresponding to the lower and upper distributions as $H D I ̲ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) )$ and $H D I ¯ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) )$, respectively.
The only prior parameter that must be selected with IDP is the prior strength s. The value of s determines how quickly the posteriors corresponding to the lower and upper probabilities converge as the number of observations increases. We select $s = 0 . 5$—this means that we need at least 4 concordant binary observations to take a decision with $1 − γ = 0 . 95$. In other words, for $s = 0 . 5$ we need two observations of type $Y = 0 , Z = 0$ and two of type $Y = 1 , Z = 1$ to guarantee that both $1 − γ = 0 . 95 %$ $H D I$ intervals, i.e., $H D I ̲ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) )$ and $H D I ¯ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) )$, do not include the zero. For any number of (and configuration of) observations less than four, the test is always indeterminate (i.e., no decision can be taken). Thus, four is the minimum number of observations that is required to take a decision. This choice is arbitrary and subjective, but is our measure of cautiousness. We make clearer the meaning of determinate and indeterminate in the following example.
Example 1.
Let us consider the following three matrices of 10 paired binary i.i.d. observations
$X a 10 = 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 T , X b 10 = 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 0 0 0 T , X c 10 = 0 1 1 0 0 1 1 1 0 1 0 0 0 1 0 1 1 1 1 1 T .$
They correspond to different degrees of dependence. Figure 1 shows the lower and upper distributions of $2 ( ω 00 ω 11 − ω 01 ω 10 )$ and the relative $95 %$ HDI, i.e., $H D I ̲ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) )$ and $H D I ¯ ( 2 ( ω 00 ω 11 − ω 01 ω 10 ) )$, for the three cases $a , b , c$ (the filled in areas). In case (a), the two variables are dependent (concordant) with probability greater than $0 . 95$, since all the mass of the lower and upper distributions are in the interval $[ 0 , 0 . 5 ]$. In the second case, we are in an indeterminate situation, that is, the lower and upper are in disagreement, which means that the inference is prior dependent. In the third case, we can only say that they are not dependent at $95 %$ since both the $H D I$ intervals include the zero.

#### 3.2. Bayesian Bivariate Independence Test for Continuous Variables

Let us assume that variables $Y , Z ∈ R$, that is, they are real continuous variables. Our aim is to devise a Bayesian independence test for continuous variables based on the DP. We will also show that our test is a Bayesian generalisation of Kendall-τ test for independence. This test uses results from  that derived a Bayesian Kendall’s τ statistic using DP. As before, we introduce auxiliary variables $Y ′ , Z ′$. We start by defining the following quantities:
$T 1 = { ( Y , Z , Y ′ , Z ′ ) : ( Y − Y ′ ) ( Z − Z ′ ) > 0 } , T 2 = { ( Y , Z , Y ′ , Z ′ ) : ( Y − Y ′ ) ( Z − Z ′ ) < 0 } .$
$T 1$ and $T 2$ are concordance measures. We can then compute
$E [ I T 1 − I T 2 ] = ∫ ∫ ( I T 1 ( X , X ′ ) − I T 2 ( X , X ′ ) ) d F ( X | X n ) d F ( X ′ | X ′ n ) ,$
where we have exploited the independence of $X , X ′$ and here $F ( X | X n )$ denotes the posterior cumulative distribution of $P X | X n$. This quantity is equal to
$E [ I T 1 − I T 2 ] = ω 0 2 ∫ ∫ ( I T 1 ( X , X ′ ) − I T 2 ( X , X ′ ) ) d F ( X ) d F ( X ′ ) + 2 ∑ i = 1 n ω 0 ω i ∫ ( I T 1 ( X i , X ′ ) − I T 2 ( X i , X ′ ) ) d F ( X ′ ) + ∑ i = 1 n ∑ j = 1 n ω i ω j ( I T 1 ( X i , X j ) − I T 2 ( X i , X j ) ) ,$
where we have exploited the fact that $X ′$ has the same distribution as X and the same observations. Given $( ω 0 , ⋯ , ω n )$, it can be seen that the first two terms depend on the prior distribution $F ∼ D p ( s , α * )$ and the last term is only due to the observations.
Theorem 3.
The variables Y and Z are said to be concordant (dependent) with posterior probability $( 1 − γ )$ provided that
$P ( E [ I T 1 − I T 2 ] / 2 > 0 | X n ) > ( 1 − γ ) ,$
and they are said to be discordant provided that
$P ( E [ I T 1 − I T 2 ] / 2 < 0 | X n ) > ( 1 − γ ) ,$
where $P$ is the probability computed with respect to $( ω 0 , ω 1 , ⋯ , ω n ) ∼ D i r ( s , 1 , ⋯ , 1 )$ and $d F ∼ D p ( s , α * )$. Finally, they are said to be simply dependent with posterior probability $( 1 − γ )$ provided that
$0 ∉ ( 1 − γ ) H D I ( E [ I T 1 − I T 2 ] / 2 | X n ) ,$
where $H D I$ denotes the posterior Highest Density Interval of $E [ I T 1 − I T 2 ] / 2$.
The divisor 2 in $E [ I T 1 − I T 2 ] / 2$ is only a scaling factor so that the expectation lies in $[ − 0 . 5 , 0 . 5 ]$. The theorem simply follows from the fact that $E [ I T 1 − I T 2 ]$ is the same measure of dependence used in Kendall’s τ test. In this respect, it is worth to highlight the connection with Kendall’s τ. By exploiting the properties of DP, we have that the posterior mean of $E [ I T 1 − I T 2 ]$ for large n is approximately equal to.
$E ( E [ I T 1 − I T 2 ] | X n ) ≈ 1 ( n + 1 ) n ∑ i = 1 n ∑ j = 1 n ( I T 1 ( X i , X j ) − I T 2 ( X i , X j ) )$
and this is exactly Kendall’s sample τ coefficient. In fact, Kendall’s sample τ coefficient is defined as:
$T = 2 ∑ 1 ≤ i < j ≤ n A i j n ( n − 1 ) = 2 ∑ i = 1 n − 1 ∑ j = i + 1 n A i j n ( n − 1 ) ,$
with
$A i j = 1 , if ( Y i − Y j ) ( Z i − Z j ) > 0 , − 1 , if ( Y i − Y j ) ( Z i − Z j ) < 0 .$
Observe that T can also be rewritten as:
$T = ∑ i = 1 n ∑ j = 1 n A i j n ( n − 1 ) ,$
in terms of all the $A i j$ pairs, which is proportional to (30) for large n. This clarifies the connection between our Bayesian test of dependence for continuous variables based on $E [ I T 1 − I T 2 ] / 2$ and Kendall’s τ test.
As for the dependence test for binary variables, we will make inferences using IDP. Inferences with IDP can computed by verifying if
$P ̲ ( E [ I T 1 − I T 2 ] / 2 > 0 | X n ) > ( 1 − γ ) , P ¯ ( E [ I T 1 − I T 2 ] / 2 > 0 | X n ) > ( 1 − γ ) .$
Theorem 4.
The upper probability $P ¯ ( E [ I T 1 − I T 2 ] / 2 > 0 | X n )$ is obtained by a prior measure $α * = 0 . 5 δ X 0 a + 0 . 5 δ X 0 b$ with $X 0 a > X 0 b > X i$ for $i = 1 , ⋯ , n$. The lower probability $P ̲ ( E [ I T 1 − I T 2 ] / 2 > 0 | X n )$ is obtained by a prior measure $α * = 0 . 5 δ X 0 a + 0 . 5 δ X 0 b$ with $X 0 = ( Y 0 , Z 0 )$ and $Y 0 a > Y 0 b > Y i$ and $Z 0 a < Z 0 b < Z i$ for $i = 1 , ⋯ , n$.
Proof.
We have that
$E [ I T 1 − I T 2 ] = ω 0 2 ∫ ∫ ( I T 1 ( X , X ′ ) − I T 2 ( X , X ′ ) ) d F ( X ) d F ( X ′ ) + 2 ∑ i = 1 n ω 0 ω i ∫ ( I T 1 ( X i , X ′ ) − I T 2 ( X i , X ′ ) ) d F ( X ′ ) + ∑ i = 1 n ∑ j = 1 n ω i ω j ( I T 1 ( X i , X j ) − I T 2 ( X i , X j ) ) .$
We want to maximize $I T 1 ( X , X ′ )$. Since $∫ ∫ I T 1 ( X , X ′ ) δ X 0 a ( X ) δ X 0 a ( X ′ ) d X d X ′ = 0$, we need at least two Dirac’s deltas. Hence, we consider the mixture $d F = m δ X 0 a + ( 1 − m ) δ X 0 b$ with $X 0 a > X 0 b > X i$ for $i = 1 , ⋯ , n$. Then we have that
$E [ I T 1 − I T 2 ] = m ( 1 − m ) ω 0 2 + 2 ∑ i = 1 n ω 0 ω i + ∑ i = 1 n ∑ j = 1 n ω i ω j ( I T 1 ( X i , X j ) − I T 2 ( X i , X j ) ) ,$
and so we have maximized the second term. For the first term depending on $m ( 1 − m )$, the maximum is obtained at $m = 1 / 2$. For the lower probability, the proof is similar. ☐
The lower and upper $H D I$ intervals can also be obtained as in Theorem 4. Again in this case, the value of s determines how quickly lower and upper posteriors converge as the number of observations increases. We choose $s = 0 . 5$ as for the binary test.
Example 2.
Also in this case we consider three matrices of 10 paired continuous i.i.d. observations
$X a 10 = − 0 . 1 − 0 . 2 − 0 . 3 − 0 . 4 − 0 . 5 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 − 0 . 1 − 0 . 2 − 0 . 3 − 0 . 4 − 0 . 5 0 . 5 0 . 4 − 0 . 3 − 0 . 2 − 0 . 1 T , X b 10 = − 0 . 1 − 0 . 2 − 0 . 3 − 0 . 4 − 0 . 5 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 − 0 . 1 − 0 . 2 − 0 . 3 − 0 . 4 − 0 . 5 0 . 5 − 0 . 4 − 0 . 3 − 0 . 2 − 0 . 1 T , X c 10 = − 0 . 1 0 . 2 0 . 3 − 0 . 4 − 0 . 5 − 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 . 1 − 0 . 2 − 0 . 3 − 0 . 4 0 . 5 0 . 5 0 . 4 − 0 . 3 − 0 . 2 − 0 . 1 T .$
They correspond to different degrees of dependence. Figure 2 shows the lower and upper posteriors for the three cases $a , b , c$ and the relative $H D I$ intervals at $95 %$ probability (the filled in areas). In case (a), the two variables are dependent (concordant) with probability greater than $0 . 95$, since all the mass of the lower and upper distributions are in the interval $[ 0 , 0 . 5 ]$. In the second case, we are in an indeterminate situation, that is, the lower and upper are in disagreement, which means that the inference is prior dependent. In the third case, we can only say that they are not dependent at $95 %$ since both the $H D I$ intervals include the zero.

#### 3.3. Bayesian Bivariate Independence Test for Mixed Continuous-Binary Variables

Let us assume that the variables $Y ∈ R$ and $Z ∈ { 0 , 1 }$. Our aim is to devise a Bayesian independence test based on the DP. We introduce the auxiliary variable $X ′$ as done before. To derive our test, we start by defining the following indicator:
$I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) .$
This indicator is one if $X = ( Y , 0 )$ and $X ′ = ( Y ′ , 1 )$, with $Y > Y ′$ and zero otherwise. We can compute
$E [ I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ] = ∫ ∫ ( I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ) d F ( X | X n ) d F ( X ′ | X ′ n ) ,$
where we have exploited the independence of $X , X ′$. $F ( X | X n )$ denotes the posterior cumulative distribution of $P X | X n$. This quantity is equal to
$E [ I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ] = ω 0 2 ∫ ∫ ( I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ) d F ( X ) d F ( X ′ ) + ∑ i = 1 n ω 0 ω i ∫ ( I ( Y ′ , ∞ ) ( Y i ) I { 0 } ( Z i ) I { 1 } ( Z ′ ) ) d F ( X ′ ) + ∑ i = 1 n ω 0 ω i ∫ ( I ( Y i , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z i ) ) d F ( X ) + ∑ i = 1 n ∑ j = 1 n ω i ω j ( I ( Y j , ∞ ) ( Y i ) I { 0 } ( Z i ) I { 1 } ( Z j ) ) .$
For large n, we have that
$E ( E [ I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ] | X n ) ≈ 1 ( n + 1 ) n ∑ i = 1 n ∑ j = 1 n I ( Y j , ∞ ) ( Y i ) I { 0 } ( Z i ) I { 1 } ( Z j ) ,$
which is equal to the rank of Y in the observations $( Y , 0 )$ with respect to the observations $( Y , 1 )$. Therefore, our dependence test is rank-based. It is clear that, in the case of independence of the variables Y and Z, the mean rank is equal to $0 . 125$. Hence, we can formulate an independence test for mixed variables.
Theorem 5.
The variables Y and Z are dependent with posterior probability $( 1 − γ )$ provided that
$0 ∉ ( 1 − γ ) H D I ( 4 E [ I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ] − 0 . 5 | X n ) ,$
where $H D I$ denotes the posterior Highest Density Interval of $4 E [ I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ] − 0 . 5$.
The theorem follows from the fact that in case of independence between variables Y and Z the mean rank (36) scaled by 4 and shifted of $− 0 . 5$ is equal to 0. Also in this case, we make inferences using IDP.
Theorem 6.
The upper probability $P ¯ ( 4 E [ I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ] − 0 . 5 > 0 | X n )$ is obtained by a prior measure $α * = m δ X 0 a + ( 1 − m ) δ X 0 b$ with $X 0 a$ equal to $( − ∞ , 1 )$ and $X 0 b$ equal to $( ∞ , 0 )$ and
$m = 0 i f ω 0 + e 0 < e 1 , 1 i f e 1 < e 0 − ω 0 , ω 0 + e 0 − e 1 2 ω 0 o t h e r ,$
with $e 0 = ∑ i = 1 n ω i I { 0 } ( Z i )$ and $e 1 = ∑ i = 1 n ω i I { 1 } ( Z i )$. The lower probability $P ̲ ( 4 E [ I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ] − 0 . 5 > 0 | X n )$ is obtained by a prior measure $α * = δ X 0$ with $X 0$ equal to $( − ∞ , 0 )$.
Proof.
Consider the quantity
$E [ I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ] = ω 0 2 ∫ ∫ ( I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ) d F ( X ) d F ( X ′ ) + ∑ i = 1 n ω 0 ω i ∫ ( I ( Y ′ , ∞ ) ( Y i ) I { 0 } ( Z i ) I { 1 } ( Z ′ ) ) d F ( X ′ ) + ∑ i = 1 n ω 0 ω i ∫ ( I ( Y i , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z i ) ) d F ( X ) + ∑ i = 1 n ∑ j = 1 n ω i ω j ( I ( Y j , ∞ ) ( Y i ) I { 0 } ( Z i ) I { 1 } ( Z j ) ) ,$
and $α * = m δ X 0 a + ( 1 − m ) δ X 0 b$ with $X 0 a$ equal to $( − ∞ , 1 )$ and $X 0 b$ equal to $( ∞ , 0 )$. Thus
$E [ I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ] = m ( 1 − m ) ω 0 2 + m ∑ i = 1 n ω 0 ω i I { 0 } ( Z i ) + ( 1 − m ) ∑ i = 1 n ω 0 ω i I { 1 } ( Z i ) ) + ∑ i = 1 n ∑ j = 1 n ω i ω j ( I ( Y j , ∞ ) ( Y i ) I { 0 } ( Z i ) I { 1 } ( Z j ) .$
By computing the derivative
$d d m E [ I ( Y ′ , ∞ ) ( Y ) I { 0 } ( Z ) I { 1 } ( Z ′ ) ] = ω 0 − 2 ω 0 m + e 0 − e 1 = 0 ,$
we have that $m = ω 0 + e 0 − e 1 2 ω 0$. The result is obtained by exploiting the fact that $m ∈ [ 0 , 1 ]$. For the lower probability, the computation is straightforward. ☐
The lower and upper $H D I$ intervals can also be obtained as in Theorem 4. We choose $s = 0 . 5$ as for the previous tests.
Example 3.
We consider three matrices of 10 paired binary-continuous i.i.d. observations
$X a 10 = − 0 . 1 − 0 . 2 − 0 . 3 − 0 . 4 − 0 . 5 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 1 1 1 1 1 0 0 0 0 0 T , X b 10 = − 0 . 1 − 0 . 2 − 0 . 3 − 0 . 4 − 0 . 5 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 1 1 1 1 1 0 0 0 0 1 T , X c 10 = − 0 . 1 − 0 . 2 − 0 . 3 − 0 . 4 − 0 . 5 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 1 0 0 1 1 0 0 1 0 1 T .$
Again, they correspond to different degrees of dependence. Figure 3 shows the lower and upper posteriors for the three cases $a , b , c$ and the relative $H D I$ intervals at $95 %$ probability (the filled in areas). In case (a), the two variables are dependent (concordant) with probability greater than $0 . 95$, since all the mass of the lower and upper distributions are in the interval $[ 0 , 0 . 5 ]$. In the second case, we are in an indeterminate situation, that is, the lower and upper are in disagreement. In the third case, we can only say that they are not dependent at $95 %$ since both the $H D I$ intervals include the zero.

## 4. Experiments

We compare our Bayesian testing approach in the three discussed main scenarios where both variables are binary, both are continuous and one is binary and the other is continuous. The goal is to decide whether the two variables are dependent or independent. We generate n samples ($n = 20$ and 50) using the distributions defined in Table 1. Ten thousand repetitions are used by forcing the variables to be independent (so $β = 0$) and thousand repetitions where the variables are dependent, for each value of $β > 0$. The value of β is varied as explained in the table. For each n, β and each of these twenty thousand samples (for which we know the correct result of the test), we run the new approach versus $χ 2$ test, Kendall τ test and Kolmogorov–Smirnov test, respectively for the binary-binary, continuous-continuous and binary-continuous cases. For each run of each method, we record their p-values, while for the new approach we compute γ corresponding to the limiting credible region $1 − γ$ wide where the decision changes between dependent and independent. Such value is related to the p-values of the other tests and can be used for decision making by comparing it against a threshold (just as it is done with the p-values). However, it should be observed that thresholds different from $0 . 05$ or $0 . 01$ are hardly used in practice in null hypothesis significance tests. Conversely, for a Bayesian tests $1 − γ$ is a probability and, therefore, we can take decisions with probability $0 . 99$, $0 . 95$ but also $0 . 7$ or even $0 . 51$ depending on the application (and the loss function). However, instead of fixing a threshold (which is a subjective choice) to decide between the options dependent and non-dependent with probability $1 − γ$, we use Receiver Operating Characteristic (ROC) curves. ROC curves give the quality of the approaches for all possible thresholds. The curves are calculated as usual by varying the threshold from 0 to 1 and computing the sensitivity (or true positive rate) and specificity (or one minus false positive rate) (this is slightly different from the common approach of drawing ROC curves as a function of the true positive rate and false positive rate [26,27,28]). ROC curves are always computed considering different degrees of dependence (different values for $β ≠ 0$) against independence ($β = 0$). We apply the same criterion to p-values for comparing the methods across a wider range of decision criteria. We have used the R package “pROC” to compute the ROC curves .
Figure 4, Figure 5 and Figure 6 present the comparison of the new approach (which we name as IBinary, ICont or IMixed to explicitly account for the types of variables been analyzed) using $s ≈ 0$ against the appropriate competitor. With such choice of s, the new approach runs without indeterminacy and can be directly compared against usual methods. As we see in the figures, the new method performs very similar to each competitor, with the advantage of being compatible among different types of data (the p-values of the other methods, among different data types, cannot be compared to each other). This is useful when one works with multivariate models involving multiple data types. As expected, the quality of the methods increases with the increase of β and of the sample size.
Figure 7, Figure 8 and Figure 9 present the ROC curves for the methods $χ 2$, Kendall τ and Kolmogorov–Smirnov, respectively. These curves are separated according to whether the instance is considered determinate or indeterminate by the new approach. In other words, for each one of the twenty thousand repetitions, we run the corresponding usual test and then we check whether the output of the new approach is determinate or indeterminate (applying $s = 0 . 5$), and we split the instances accordingly (blue curves show the accuracy over instances that are considered easy (determinate cases) while green curves over instances that are hard (indeterminate cases)—we also present the overall accuracy of the method using red curves). As we see, such division is able to identify easy-to-classify and hard-to-classify cases, since the ROC curves for the cases deemed as indeterminate by the new approach suggest a performance not better than a random guess (green curves). using the new approach, This means that if we would devise another test (called “50/50 when indeterminate”) which returns the same response as IBinary, ICont or IMixed when they are determinate, and issues a random answer (with 50/50 chance) otherwise, then this “50/50 when indeterminate” test would have the same ROC curve as $χ 2$, Kendall τ and Kolmogorov–Smirnov, respectively.
This suggests that the indeterminacy of IDP based tests is an additional useful information that our approach gives to the analyst. In these cases she/he knows that (i) her/his posterior decisions would depend on the choice of the prior DP measure; (ii) deciding between the two hypotheses under test is a difficult problem as shown by the comparison with the DP with $s = 0$, $χ 2$, Kendall τ and Kolmogorov–Smirnov. Based on this additional information, the analyst can for example decide to collect additional measurements to eliminate the indeterminacy (in fact when the number of observations goes to infinity the indeterminacy goes to zero).
This represents a second advantage of our IDP approach, once we have fixed the value of s (e.g., $s = 0 . 5$) it can automatically identify the risky cases where a decision must be taken with additional caution. For this reason, we suggest to use the IDP based test for dependence and not $s = 0$.
Finally, Table 2, Table 3 and Table 4 present the values for the Area under the curve (AUC) in Chaper 5 in  of the ROC curves discussed previously, as well as similar experimental setup but with different values of s: 0.25, 0.5 and 1. Table 2 has results for binary variable versus binary variable, Table 3 for continuous variable versus continuous variable, and Table 4 for continuous variable versus binary variable. Overall, results show that IBinary has similar performance as $χ 2$ test, ICont has similar performance as Kendall’s τ test and IMixed is similar to Kolmogorov–Smirnov (KS) test. The most interesting outcome is the comparison, in each scenario, of the frequentist test over whole data, over only data samples that were considered determinate by the new test, and over only data samples that were considered indeterminate. We clearly see that the AUC values over the cases considered indeterminate are much inferior to the values over cases considered determinate, which indicates that the new test has a good ability to discriminate easy and hard cases. ROC curves for values of s other than 0.5 were omitted for clarity of exposition, but they are very similar to those obtained for $s = 0 . 5$.

## 5. Conclusions

We have proposed three novel Bayesian methods for performing independence tests for binary, continuous and mixed binary-continuous variables. All of these tests are nonparametric and based on the Dirichlet Process. This has allowed us to use the same prior model for all the tests we have developed. Therefore, all the tests are “consistent”, in the sense that the probabilities of dependence we compute with these tests are commensurable across the tests.
We have presented two versions of these tests: one based on a noninformative prior and one based on a conservative model of prior ignorance (IDP). Experimental results show that the prior ignorance method is more reliable than both the frequentist test and the noninformative Bayesian one, being able to isolate instances in which these tests are almost guessing at random. For future work, we plan to extend this approach in two directions: (1) feature selection in classification; (2) learning the structure (graph) of Bayesian networks and Markov Random Fields. The idea is to use our dependence tests to replace the frequentist tests that are commonly used for that purpose and evaluate the gain in terms of performance. For instance in case (1), we then could compare the accuracy of a classifier whose features are selected using our tests with that of a classifier whose features are selected by using frequentist tests. Our new approach is suitable since it addresses two limitations of currently used tests: they are based on null-hypothesis significance tests, and they cannot be applied to categorical and continuous variables at the same time in a commensurable way.

## Author Contributions

All authors made substantial contributions to conception and design, data analysis and interpretation of data; all authors participate in drafting the article or revising it critically for important intellectual content; all authors gave final approval of the version to be submitted.

## Conflicts of Interest

The authors declare no conflict of interest.

## Abbreviations

The following abbreviations are used in this manuscript:
 DP Dirichlet Process IDP Imprecise Dirichlet Process

## References

1. Raftery, A.E. Bayesian model selection in social research. Sociol. Methodol. 1995, 25, 111–164. [Google Scholar] [CrossRef]
2. Goodman, S.N. Toward evidence-based medical statistics. 1: The P–value fallacy. Ann. Intern. Med. 1999, 130, 995–1004. [Google Scholar] [CrossRef] [PubMed]
3. Kruschke, J.K. Bayesian data analysis. Wiley Interdiscip. Rev. Cognit. Sci. 2010, 1, 658–676. [Google Scholar] [CrossRef] [PubMed]
4. Benavoli, A.; Mangili, F.; Ruggeri, F.; Zaffalon, M. Imprecise Dirichlet Process With Application to the Hypothesis Test on the Probability that X ≤ Y. J. Stat. Theory Pract. 2015, 9, 658–684. [Google Scholar] [CrossRef]
5. Benavoli, A.; Mangili, F.; Corani, G.; Zaffalon, M.; Ruggeri, F. A Bayesian Wilcoxon Signed-Rank Test Based on the Dirichlet Process. In Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China, 21–26 July 2014; pp. 1026–1034.
6. Benavoli, A.; Corani, G.; Mangili, F.; Zaffalon, M. A Bayesian Nonparametric Procedure for Comparing Algorithms. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 1–9.
7. Mangili, F.; Benavoli, A.; de Campos, C.P.; Zaffalon, M. Reliable survival analysis based on the Dirichlet Process. Biom. J. 2015, 57, 1002–1019. [Google Scholar] [CrossRef] [PubMed]
8. Kao, Y.; Reich, B.J.; Bondell, H.D. A nonparametric Bayesian test of dependence. 2015; arXiv:1501.07198. [Google Scholar]
9. Nandram, B.; Choi, J.W. Bayesian analysis of a two-way categorical table incorporating intraclass correlation. J. Stat. Comput. Simul. 2006, 76, 233–249. [Google Scholar] [CrossRef]
10. Nandram, B.; Choi, J.W. Alternative tests of independence in two-way categorical tables. J. Data Sci. 2007, 5, 217–237. [Google Scholar]
11. Nandram, B.; Bhatta, D.; Sedransk, J.; Bhadra, D. A Bayesian test of independence in a two-way contingency table using surrogate sampling. J. Stat. Plan. Inference 2013, 143, 1392–1408. [Google Scholar] [CrossRef]
12. Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef]
13. Blum, A.L.; Langley, P. Selection of relevant features and examples in machine learning. Artif. Intell. 1997, 97, 245–271. [Google Scholar] [CrossRef]
14. Keogh, E.J.; Pazzani, M.J. Learning Augmented Bayesian Classifiers: A Comparison of Distribution-Based and Classification-Based Approaches. Available online: http://www.cs.rutgers.edu/∼pazzani/Publications/EamonnAIStats.pdf (accessed on 31 August 2016).
15. Jiang, L.; Cai, Z.; Wang, D.; Zhang, H. Improving Tree augmented Naive Bayes for class probability estimation. Knowl. Based Syst. 2012, 26, 239–245. [Google Scholar] [CrossRef]
16. Ferguson, T.S. A Bayesian Analysis of Some Nonparametric Problems. Ann. Stat. 1973, 1, 209–230. [Google Scholar] [CrossRef]
17. Ghosh, J.K.; Ramamoorthi, R. Bayesian Nonparametrics; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
18. Rubin, D.B. Bayesian Bootstrap. Ann. Stat. 1981, 9, 130–134. [Google Scholar] [CrossRef]
19. Walley, P. Statistical Reasoning with Imprecise Probabilities; Chapman & Hall: New York, NY, USA, 1991. [Google Scholar]
20. Coolen-Schrijner, P.; Coolen, F.P.; Troffaes, M.C.; Augustin, T. Imprecision in Statistical Theory and Practice. J. Stat. Theory Pract. 2009, 3. [Google Scholar] [CrossRef]
21. Augustin, T.; Coolen, F.P.; de Cooman, G.; Troffaes, M.C. Introduction to Imprecise Probabilities; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
22. Berger, J.O.; Rios Insua, D.; Ruggeri, F. Bayesian Robustness. In Robust Bayesian Analysis; Insua, D.R., Ruggeri, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2000; Volume 152, pp. 1–32. [Google Scholar]
23. Berger, J.O.; Moreno, E.; Pericchi, L.R.; Bayarri, M.J.; Bernardo, J.M.; Cano, J.A.; De la Horra, J.; Martín, J.; Ríos-Insúa, D.; Betrò, B.; et al. An overview of robust Bayesian analysis. Test 1994, 3, 5–124. [Google Scholar] [CrossRef]
24. Pericchi, L.R.; Walley, P. Robust Bayesian credible intervals and prior ignorance. Int. Stat. Rev. 1991, 59. [Google Scholar] [CrossRef]
25. Dalal, S.; Phadia, E. Nonparametric Bayes inference for concordance in bivariate distributions. Commun. Stat. Theory Methods 1983, 12, 947–963. [Google Scholar] [CrossRef]
26. Tan, P.-N.; Steinbach, M.; Kumar, V. Introduction to Data Mining; Pearson Education: New York, NY, USA, 2006. [Google Scholar]
27. Jiang, L.; Li, C.; Cai, Z. Learning decision tree for ranking. Knowl. Inf. Syst. 2009, 20, 123–135. [Google Scholar] [CrossRef]
28. Jiang, L.; Wang, D.; Zhang, H.; Cai, Z.; Huang, B. Using instance cloning to improve naive Bayes for ranking. Int. J. Pattern Recognit. Artif. Intell. 2008, 22, 1121–1140. [Google Scholar] [CrossRef]
29. Robin, X.; Turck, N.; Hainard, A.; Tiberti, N.; Lisacek, F.; Sanchez, J.C.; Müller, M. pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 2011, 12. [Google Scholar] [CrossRef] [PubMed]
30. Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Figure 1. Three possible results of the independence hypothesis testing with two binary variables. The red and blue filled areas correspond respectively to the lower and upper $H D I$. (a) Dependent at $95 %$; (b) Indeterminate at $95 %$; (c) They are not dependent at $95 %$.
Figure 1. Three possible results of the independence hypothesis testing with two binary variables. The red and blue filled areas correspond respectively to the lower and upper $H D I$. (a) Dependent at $95 %$; (b) Indeterminate at $95 %$; (c) They are not dependent at $95 %$.
Figure 2. Three possible results of the independence hypothesis testing for continuous variables. The red and blue filled areas correspond respectively to the lower and upper $H D I$. (a) Dependent at $95 %$; (b) Indeterminate at $95 %$; (c) They are not dependent at $95 %$.
Figure 2. Three possible results of the independence hypothesis testing for continuous variables. The red and blue filled areas correspond respectively to the lower and upper $H D I$. (a) Dependent at $95 %$; (b) Indeterminate at $95 %$; (c) They are not dependent at $95 %$.
Figure 3. Three possible results of the independence hypothesis testing for pairs binary-continuous. The red and blue filled areas correspond respectively to the lower and upper $H D I$. (a) Dependent at $95 %$; (b) Indeterminate at $95 %$; (c) They are not dependent at $95 %$.
Figure 3. Three possible results of the independence hypothesis testing for pairs binary-continuous. The red and blue filled areas correspond respectively to the lower and upper $H D I$. (a) Dependent at $95 %$; (b) Indeterminate at $95 %$; (c) They are not dependent at $95 %$.
Figure 4. Comparison of approaches with binary data. New approach with $s ≈ 0$ (so always determinate) is compared against $χ 2$ test using ROC curves. Curves are built using two thousand repetitions (one thousand where variables are independent ($β = 0$) and one thousand where they are dependent with β as shown in the figures). Data are generated as explained in Table 1. (a) ROC ($n = 20$, $β = 1$); (b) ROC ($n = 20$, $β = 3$); (c) ROC ($n = 50$, $β = 1$); (d) ROC ($n = 50$, $β = 3$).
Figure 4. Comparison of approaches with binary data. New approach with $s ≈ 0$ (so always determinate) is compared against $χ 2$ test using ROC curves. Curves are built using two thousand repetitions (one thousand where variables are independent ($β = 0$) and one thousand where they are dependent with β as shown in the figures). Data are generated as explained in Table 1. (a) ROC ($n = 20$, $β = 1$); (b) ROC ($n = 20$, $β = 3$); (c) ROC ($n = 50$, $β = 1$); (d) ROC ($n = 50$, $β = 3$).
Figure 5. Comparison of approaches with continuous data. New approach with $s ≈ 0$ (so always determinate) is compared against Kendall $t a u$ test using ROC curves. Curves are built using two thousand repetitions (one thousand where variables are independent ($β = 0$) and one thousand where they are dependent with β as shown in the figures). Data are generated as explained in Table 1. (a) ROC ($n = 20$, $β = 1$); (b) ROC ($n = 20$, $β = 2$); (c) ROC ($n = 50$, $β = 1$); (d) ROC ($n = 50$, $β = 2$).
Figure 5. Comparison of approaches with continuous data. New approach with $s ≈ 0$ (so always determinate) is compared against Kendall $t a u$ test using ROC curves. Curves are built using two thousand repetitions (one thousand where variables are independent ($β = 0$) and one thousand where they are dependent with β as shown in the figures). Data are generated as explained in Table 1. (a) ROC ($n = 20$, $β = 1$); (b) ROC ($n = 20$, $β = 2$); (c) ROC ($n = 50$, $β = 1$); (d) ROC ($n = 50$, $β = 2$).
Figure 6. Comparison of approaches with mixed data. New method with $s ≈ 0$ (so always determinate) is compared against Kolmogorov–Smirnov (KS) test using ROC curves. Curves are built using two thousand repetitions (one thousand where variables are independent ($β = 0$) and one thousand where they are dependent with β as shown in the figures). Data are generated as explained in Table 1. (a) ROC ($n = 20$, $β = 1$); (b) ROC ($n = 20$, $β = 2$); (c) ROC ($n = 50$, $β = 1$); (d) ROC ($n = 50$, $β = 2$).
Figure 6. Comparison of approaches with mixed data. New method with $s ≈ 0$ (so always determinate) is compared against Kolmogorov–Smirnov (KS) test using ROC curves. Curves are built using two thousand repetitions (one thousand where variables are independent ($β = 0$) and one thousand where they are dependent with β as shown in the figures). Data are generated as explained in Table 1. (a) ROC ($n = 20$, $β = 1$); (b) ROC ($n = 20$, $β = 2$); (c) ROC ($n = 50$, $β = 1$); (d) ROC ($n = 50$, $β = 2$).
Figure 7. Comparison of approaches with binary data. New approach is used to differentiate instance by instance into hard-to-classify and easy-to-classify, and curves represent the outcome of $χ 2$ test under each such different scenarios. Data are generated as explained in Table 1. (a) ROC ($n = 20$, $β = 1$); (b) ROC ($n = 20$, $β = 3$); (c) ROC ($n = 50$, $β = 1$); (d) ROC ($n = 50$, $β = 3$).
Figure 7. Comparison of approaches with binary data. New approach is used to differentiate instance by instance into hard-to-classify and easy-to-classify, and curves represent the outcome of $χ 2$ test under each such different scenarios. Data are generated as explained in Table 1. (a) ROC ($n = 20$, $β = 1$); (b) ROC ($n = 20$, $β = 3$); (c) ROC ($n = 50$, $β = 1$); (d) ROC ($n = 50$, $β = 3$).
Figure 8. Comparison of approaches with continuous data. New approach is used to differentiate instance by instance into hard-to-classify and easy-to-classify, and curves represent the outcome of Kendall τ test under each such different scenarios. Data are generated as explained in Table 1. (a) ROC ($n = 20$, $β = 1$); (b) ROC ($n = 20$, $β = 2$); (c) ROC ($n = 50$, $β = 1$); (d) ROC ($n = 50$, $β = 2$).
Figure 8. Comparison of approaches with continuous data. New approach is used to differentiate instance by instance into hard-to-classify and easy-to-classify, and curves represent the outcome of Kendall τ test under each such different scenarios. Data are generated as explained in Table 1. (a) ROC ($n = 20$, $β = 1$); (b) ROC ($n = 20$, $β = 2$); (c) ROC ($n = 50$, $β = 1$); (d) ROC ($n = 50$, $β = 2$).
Figure 9. Comparison of approaches with mixed data. New approach is used to differentiate instance by instance into hard-to-classify and easy-to-classify, and curves represent the outcome of Kolmogorov–Smirnov (KS) test under each such different scenarios. Data are generated as explained in Table 1. (a) ROC ($n = 20$, $β = 1$); (b) ROC ($n = 20$, $β = 2$); (c) ROC ($n = 50$, $β = 1$); (d) ROC ($n = 50$, $β = 2$).
Figure 9. Comparison of approaches with mixed data. New approach is used to differentiate instance by instance into hard-to-classify and easy-to-classify, and curves represent the outcome of Kolmogorov–Smirnov (KS) test under each such different scenarios. Data are generated as explained in Table 1. (a) ROC ($n = 20$, $β = 1$); (b) ROC ($n = 20$, $β = 2$); (c) ROC ($n = 50$, $β = 1$); (d) ROC ($n = 50$, $β = 2$).
Table 1. Data generation setup. In order to generate independent data, β is set to zero. Larger values of β increase their dependency.
Table 1. Data generation setup. In order to generate independent data, β is set to zero. Larger values of β increase their dependency.
Variable 1Variable 2Distribution
BinaryBinaryMultinomial distr. with $[ P ( 00 ) , P ( 01 ) , P ( 10 ) , P ( 11 ) ] ∝ [ 3 , 3 + β , 3 + β , 3 ]$.
ContinuousContinuousBivariate Gaussian with means 0 and covariance matrix $10 β β 3$.
BinaryContinuousHalf of the samples have the binary variable set to zero and half to one. When that variable is zero, then for the continuous use $Γ ( 10 , 2 )$, otherwise $Γ ( 10 + β , 2 + β )$.
Table 2. Area under the ROC curve (AUC) values for all the performed experiments using different values of s, β and n. IBinary shows the AUC for the new test applied to two binary variables and $s ≈ 0$. The columns $χ 2$ test, Det.cases, and Indet.cases show the AUC obtained by the $χ 2$ test over all samples, only over samples considered determinate by IBinary (with the corresponding s) and finally only over samples considered indeterminate by IBinary.
Table 2. Area under the ROC curve (AUC) values for all the performed experiments using different values of s, β and n. IBinary shows the AUC for the new test applied to two binary variables and $s ≈ 0$. The columns $χ 2$ test, Det.cases, and Indet.cases show the AUC obtained by the $χ 2$ test over all samples, only over samples considered determinate by IBinary (with the corresponding s) and finally only over samples considered indeterminate by IBinary.
snβ$IBinary$ChisqDet.casesIndet.cases
0.252010.55620.56290.56530.4890
0.52010.55440.55960.56450.5233
12010.54910.55510.56420.5153
0.252030.73410.75020.75670.4266
0.52030.73880.75510.76860.4526
12030.73300.75020.77170.4888
0.255010.63720.64250.64490.5125
0.55010.63190.63530.63930.4747
15010.63660.64070.64920.4954
0.255030.91450.91100.91270.5205
0.55030.91300.90900.91150.4473
15030.91340.90810.91230.5642
Table 3. Area under the ROC curve (AUC) values for all the performed experiments using different values of s, β and n. ICont shows the AUC for the new test applied to two continuous variables and $s ≈ 0$. Kendall, Det.cases, and Indet.cases show the AUC obtained by Kendall’s test over all samples, only over samples considered determinate by ICont (with the corresponding s) and finally only over samples considered indeterminate by ICont.
Table 3. Area under the ROC curve (AUC) values for all the performed experiments using different values of s, β and n. ICont shows the AUC for the new test applied to two continuous variables and $s ≈ 0$. Kendall, Det.cases, and Indet.cases show the AUC obtained by Kendall’s test over all samples, only over samples considered determinate by ICont (with the corresponding s) and finally only over samples considered indeterminate by ICont.
snβ$ICont$KendallDet.casesIndet.cases
0.252010.58260.58580.58980.5101
0.52010.57080.57290.58040.4987
12010.57440.57420.59140.5004
0.252020.75240.75060.75580.5037
0.52020.75350.75020.75740.5203
12020.74880.74070.75960.5447
0.255010.68250.68880.69170.5051
0.55010.67820.68690.69350.5633
15010.68710.69600.70870.5204
0.255020.93430.91910.91970.4933
0.55020.93390.92080.92070.5487
15020.93610.92050.91920.5499
Table 4. Area under the ROC curve (AUC) values for all the performed experiments using different values of s, β and n. IMixed shows the AUC for the new test applied to one binary and one continuous variables and $s ≈ 0$. Kolmogorov–Smirnov (KS), Det.cases, and Indet.cases show the AUC obtained by KS test over all samples, only over samples considered determinate by IMixed (with the corresponding s) and finally only over samples considered indeterminate by IMixed.
Table 4. Area under the ROC curve (AUC) values for all the performed experiments using different values of s, β and n. IMixed shows the AUC for the new test applied to one binary and one continuous variables and $s ≈ 0$. Kolmogorov–Smirnov (KS), Det.cases, and Indet.cases show the AUC obtained by KS test over all samples, only over samples considered determinate by IMixed (with the corresponding s) and finally only over samples considered indeterminate by IMixed.
snβ$IMixed$KSDet.casesIndet.cases
0.252010.61590.61180.61390.5386
0.52010.61500.59430.59890.5594
12010.61320.60040.61040.5532
0.252020.71760.73580.73920.5254
0.52020.72020.70910.71590.4937
12020.71630.70910.72330.4928
0.255010.69970.70910.71090.4447
0.55010.69660.71060.71490.4213
15010.70760.71350.72240.4455
0.255020.85260.88160.88320.3278
0.55020.84970.87900.88180.3044
15020.85620.89230.89860.2934

## Share and Cite

MDPI and ACS Style

Benavoli, A.; De Campos, C.P. Bayesian Dependence Tests for Continuous, Binary and Mixed Continuous-Binary Variables. Entropy 2016, 18, 326. https://doi.org/10.3390/e18090326

AMA Style

Benavoli A, De Campos CP. Bayesian Dependence Tests for Continuous, Binary and Mixed Continuous-Binary Variables. Entropy. 2016; 18(9):326. https://doi.org/10.3390/e18090326

Chicago/Turabian Style

Benavoli, Alessio, and Cassio P. De Campos. 2016. "Bayesian Dependence Tests for Continuous, Binary and Mixed Continuous-Binary Variables" Entropy 18, no. 9: 326. https://doi.org/10.3390/e18090326

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.