Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach

Suzuki, Joe

doi:10.3390/e17085752

Open AccessArticle

Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach

by

Joe Suzuki

Department of Mathematics, Graduate School of Science, Osaka University, Toyonaka-shi 560-0043,Japan

Entropy 2015, 17(8), 5752-5770; https://doi.org/10.3390/e17085752

Submission received: 30 April 2015 / Revised: 30 April 2015 / Accepted: 5 August 2015 / Published: 10 August 2015

(This article belongs to the Special Issue Dynamical Equations and Causal Structures from Observations)

Download

Browse Figures

Versions Notes

Abstract

:

We consider the problem of learning a Bayesian network structure given n examples and the prior probability based on maximizing the posterior probability. We propose an algorithm that runs in

O (n log n)

time and that addresses continuous variables and discrete variables without assuming any class of distribution. We prove that the decision is strongly consistent, i.e., correct with probability one as

n \to \infty

. To date, consistency has only been obtained for discrete variables for this class of problem, and many authors have attempted to prove consistency when continuous variables are present. Furthermore, we prove that the “

log n

” term that appears in the penalty term of the description length can be replaced by

2 (1 + ϵ) log log n

to obtain strong consistency, where

ϵ > 0

is arbitrary, which implies that the Hannan–Quinn proposition holds.

Keywords:

posterior probability; consistency; minimum description length; universality; discrete and continuous variables; Bayesian network

Graphical Abstract

1. Introduction

In this paper, we address the problem of learning a Bayesian network structure from examples.

For sets

A, B, C

of random variables, we say that A and B are conditionally independent given C if the conditional probability of A and B given C is the product of the conditional probabilities of A given C and B given C. A Bayesian network (BN) is a graphical model that expresses conditional independence (CI) relations among the prepared variables using a directed acyclic graph (DAG). We define a BN by the DAG with vertexes

V = {1, \dots, N}

and directed edges

E = {(j, i) | i \in V, j \in π (i)}

, where edge

(j, k) \in V^{2}

directs from j to k, via minimal parent sets

π (i) \subseteq V

,

i \in V

, such that the distribution is factorized by:

P (X^{(1)}, \dots, X^{(N)}) = \prod_{i = 1}^{N} P (X^{(i)} | {X^{(j)}}_{j \in π (i)}) .

First, suppose that we wish to know whether two random binary variables X and Y are independent (hereafter, we write

X ⊥ ⊥ Y

). If we have n pairs of actually emitted examples

(X = x_{1}, Y = y_{1}), \dots, (X = x_{n}, Y = y_{n})

and know the prior probability p of

X ⊥ ⊥ Y

, then it would be reasonable to maximize the posterior probability of

X ⊥ ⊥ Y

given

x^{n} = (x_{1}, \dots, x_{n})

and

y^{n} = (y_{1}, \dots, y_{n})

. If we assume that the probabilities

P (X = x), P (Y = y)

and

P (X = x, Y = y)

are parameterized by

p (x | θ_{X}), p (y | θ_{Y})

, and

p (x, y | θ_{X Y})

and that the prior probabilities

W_{X}, W_{Y},

and

W_{X Y}

over the probabilities

θ_{X}, θ_{Y},

and

θ_{X Y}

of

X \in {0, 1}

,

Y \in {0, 1}

and

(X, Y) \in {0, 1}^{2}

are available, respectively, then we can construct the quantities:

\begin{matrix} Q_{X}^{n} (x^{n}) & : = & \int \prod_{i = 1}^{n} p (x_{i} | θ_{X}) W_{X} (d θ_{X}), \\ Q_{Y}^{n} (y^{n}) & : = & \int \prod_{i = 1}^{n} p (y_{i} | θ_{Y}) W_{Y} (d θ_{Y}), \\ Q_{X Y}^{n} (x^{n}, y^{n}) & : = & \int \prod_{i = 1}^{n} p (x_{i}, y_{i} | θ_{X Y}) W_{X Y} (d θ_{X Y}) . \end{matrix}

In this setting, maximizing the posterior probability of

X ⊥ ⊥ Y

given examples

x^{n}, y^{n}

w.r.t. the prior probability p is equivalent to deciding

X ⊥ ⊥ Y

if and only if:

p Q_{X}^{n} (x^{n}) Q_{Y}^{n} (y^{n}) \geq (1 - p) Q_{X Y}^{n} (x^{n}, y^{n}) .

(1)

The decision based on (1) is strongly consistent, i.e., it is correct with probability one as

n \to \infty

[1] (see Section 3.1 for the proof). We say that a model selection procedure satisfies weak consistency if the probability of choosing the correct model goes to unity as n grows (probability convergence) and that it satisfies strong consistency if the probability one is assigned to the set of infinite example sequences that choose the correct model, except for at most finite times (almost sure convergence). In general, strong consistency implies weak consistency, but the converse is not true [2]. In any model selection, in particular for large n, the correct answer is required. If continuous variables are present, the BN structure learning is not easy, and strong consistency is hard to obtain.

The same scenario is applied to the case in which X and Y take values from finite sets A and B rather than

{0, 1}

.

Next, suppose that we wish to know the factorization of three random binary variables

X, Y, Z

:

P (X) P (Y) P (Z)

,

P (X) P (Y, Z)

,

P (Y) P (Z, X)

,

P (Z) P (X, Y), \frac{P (X, Y) P (X, Z)}{P (X)}

,

\frac{P (X, Y) P (Y, Z)}{P (Y)}

,

\frac{P (X, Z) P (Y, Z)}{P (Z)}

,

\frac{P (Y) P (Z) P (X, Y, Z)}{P (Y, Z)}

,

\frac{P (Z) P (X) P (X, Y, Z)}{P (Z, X)}

,

\frac{P (X) P (Y) P (X, Y, Z)}{P (X, Y)}

and

P (X, Y, Z)

. If we have n triples of actually emitted examples

(X = x_{1}, Y = y_{1}, Z = z_{1}), \dots, (X = x_{n}, Y = y_{n}, Z = z_{n})

and know the prior probabilities

p_{1}, \dots, p_{11}

over the eleven factorizations, then it would be reasonable to choose the one that maximizes:

\begin{matrix} p_{1} Q_{X}^{n} (x^{n}) Q_{Y}^{n} (y^{n}) Q_{Z} (z^{n}), & p_{2} Q_{X}^{n} (x^{n}) Q_{Y Z}^{n} (y^{n}, z^{n}), & p_{3} Q_{Y}^{n} (y^{n}) Q_{X Z}^{n} (x^{n}, z^{n}), \\ p_{4} Q_{Z}^{n} (z^{n}) Q_{X Y}^{n} (x^{n}, y^{n}), & p_{5} \frac{Q_{X Y}^{n} (x^{n}, y^{n}) Q_{X Z}^{n} (x^{n}, z^{n})}{Q_{X}^{n} (x^{n})}, & p_{6} \frac{Q_{X Y}^{n} (x^{n}, y^{n}) Q_{Y Z}^{n} (y^{n}, z^{n})}{Q_{Y}^{n} (y^{n})}, \\ p_{7} \frac{Q_{X Z}^{n} (x^{n}, z^{n}) Q_{Y Z}^{n} (y^{n}, z^{n})}{Q_{Z}^{n} (z^{n})}, & p_{8} \frac{Q_{Y}^{n} (y^{n}) Q_{Z}^{n} (z^{n}) Q_{X Y Z}^{n} (x^{n}, y^{n}, z^{n})}{Q_{Y Z}^{n} (y^{n}, z^{n})}, & p_{9} \frac{Q_{Z}^{n} (z^{n}) Q_{X}^{n} (x^{n}) Q_{X Y Z}^{n} (x^{n}, y^{n}, z^{n})}{Q_{X Z}^{n} (x^{n}, z^{n})}, \\ p_{10} \frac{Q_{X}^{n} (x^{n}) Q_{Y}^{n} (y^{n}) Q_{X Y Z}^{n} (x^{n}, y^{n}, z^{n})}{Q_{X Y}^{n} (x^{n}, y^{n})}, & p_{11} Q_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}), \end{matrix}

to maximize the posterior probability of the factorization given

x^{n} = (x_{1}, \dots, x_{n})

,

y^{n} = (y_{1}, \dots, y_{n})

and

z^{n} = (z_{1}, \dots, z_{n})

. For example, between the last two distributions, we choose the last if and only if:

p_{10} Q_{X}^{n} (x^{n}) Q_{Y}^{n} (y^{n}) \leq p_{11} Q_{X Y}^{n} (x^{n}, y^{n}) .

In fact, for example, we can check that the factorizations:

P (Y) P (X | Y) P (Z | X), P (X) P (Y | X) P (Z | X), P (Z) P (X | Z) P (Y | Z)

in Figure 1a–c share the same form

\frac{P (X Y) P (X Z)}{P (X)}

, and we say that they share the same Markov-equivalent class. On the other hand, the factorization

P (Y) P (Z) P (X | Y Z) = \frac{P (X Y Z) P (Y) P (Z)}{P (Y Z)}

in Figure 1d has nothing to share with the same Markov equivalent class, except itself. In the case of three variables, there are 25 DAGs, but they reduce to the eleven Markov equivalent classes.

Figure 1. Markov-equivalent classes (a–d).

The method that maximizes the posterior probability is strongly consistent [1] (see Section 3.1 for the proof), and a scenario with two and three variables as above can be extended to cases with N variables in a straightforward manner, if the variables are discrete.

In this paper, we consider the case when continuous variables are present. The idea is to construct measures

g_{X}^{n} (x^{n})

,

g_{Y}^{n} (y^{n})

and

g_{X Y}^{n} (x^{n}, y^{n})

over

X^{n}

,

Y^{n}

and

X^{n} \times Y^{n}

for continuous ranges X and Y to make the decision whether

X ⊥ ⊥ Y

based on:

p g_{X}^{n} (x^{n}) g_{Y}^{n} (y^{n}) \geq (1 - p) g_{X Y}^{n} (x^{n}, y^{n}) .

(2)

The main problem is whether the decision is strongly consistent. Many authors have attempted to address continuous variables. For example, Nir Friedman [3] experimentally demonstrated the construction of a genetic network based on expression data using the E-Malgorithm. However, the variables were assumed to be linearly related and included Gaussian noise, and the dataset was not sufficiently fit to the model. Imoto et al. [4] improved the model such that the relation is expressed by B-spline curves rather than lines. However, all of the authors, including Friedman and Imoto, failed to maximize the posterior probability, and thus, the decision is not consistent. This paper proves that the decision based on (2) and its extension for general

N \geq 2

is strongly consistent.

In any Bayesian approach of BN structure learning, whether continuous variables are present or not, the procedure consists of two stages:

(1): Compute the local scores for the nonempty subsets of ${X^{(1)}, \dots, X^{(N)}}$ ; for example, if $N = 3$ , the seven quantities $Q_{X}^{n} (x^{n}), \dots, Q_{X Y Z}^{n} (x^{n}, y^{n}, z^{n})$ are obtained; and
(2): Find a BN structure that maximizes the global scores among the $M (N) (\leq 3^{N})$ candidate BN structures; there are at most $3^{N}$ DAGs in the case of N variables; for example, if $N = 3$ , the eleven quantities are computed and a structure with the largest is chosen.

Note that the second stage does not care about whether each variable is continuous or not. In this paper, we mainly discuss about the performance of the first stage. The number of local scores to be computed can be saved, although it is generally exponential with N. We consider the problem inSection 3.3.

On the other hand, Zhang, Peters, Janzing and Scholkopf [5] proposed a BN structure learning method using conditional independence (CI) tests based on kernel statistics. However, for the CI test that is close to the Hilbert–Schmidt information criterion (HSIC), it is very hard to simulate the null distribution. They only proposed to approximate it by a Gamma distribution, but no consistency, is obtained because the threshold of the statistical test is not correct in practice. Furthermore, for the independence test approach, it often results in conflicting assertions of independence for finite samples. In particular, for small samples, the obtained DAG sometimes contain a directed loop. The Bayesian approach we consider in this paper does not suffer from the inconvenience, because we seek a structure that maximizes the global score [6].

Another contribution of this paper is identifying the border between consistency and non-consistency in learning Bayesian networks. For discrete X, maximizing

Q_{X}^{n} (x^{n})

is equivalent to minimizing the description length [1]:

- log Q_{X}^{n} (x^{n}) \approx H^{n} (x^{n}) + \frac{α - 1}{2} log n,

(3)

where

H^{n} (x^{n})

is the empirical entropy of

x^{n} \in X^{n}

(we write

A \approx B

when

| A - B |

is bounded by a constant) and α is the cardinality of set X. The problem at hand is whether the

log n

term is the minimum function of n for ensuring strong consistency. If

log n

is replaced by two (AIC), we cannot obtain consistency. We prove that

2 (1 + ϵ) log log n

with

ϵ > 0

is the minimum for strong consistency based on the law of iterated logarithms. The same property is known as the Hannan–Quinn principle [7], and similar results have been obtained for autoregression, linear regression [8] and classification [9], among others. The derivation in this paper does not depend on these previous results. The Hannan–Quinn principle will also be applied to continuous variables.

This paper is organized as follows. Section 2.1 introduces the general concept of learning Bayesian network structures based on maximizing the posterior probability, and Section 2.2 discusses the concept of density functions developed by Boris Ryabko [10] and extended by Suzuki [11]. Section 3 presents our contributions: Section 3.1 proves the Hannan–Quinn property in the current problem, andSection 3.2 proves consistency when continuous variables are present. Section 4 concludes the paper by summarizing the results and states the paper's significance in the field of model selection.

2. Preliminaries

2.1. Learning the Bayesian Structure for Discrete Variables and Its Consistency

We choose

w_{X}

, such that

\int w_{X} (θ) d θ = 1

and

0 \leq θ (x) \leq 1

by

w_{X} (θ) \propto \prod_{x \in X} θ {(x)}^{- 1 / 2}

, where X is the set from which X takes its values. Let

α = | X |

, and let

c_{i} (x)

be the frequency of

x \in X

in

x^{i} = (x_{1}, \dots, x_{i}) \in X^{i}

,

i = 1, \dots, n

. It is known that the following quantities satisfies (3) [12]:

Q_{X}^{n} (x^{n}) : = \prod_{i = 1}^{n} \frac{c_{i - 1} (x_{i}) + 1 / 2}{i - 1 + | X | / 2} = \frac{Γ (α / 2) \prod_{x \in X} Γ (c_{n} (x) + 1 / 2)}{Γ {(1 / 2)}^{α} Γ (n + α / 2)},

where Γ is the Gamma function, and Stirling's formula

Γ (z) = \sqrt{2 π z} {(\frac{z}{e})}^{z} {1 + O (z^{- 1 / 3})}

has been applied. Thus, for

x \in X

, from the law of large numbers,

c_{n} (x) / n

converges to

P (X = x)

with probability one as

n \to \infty

, such that:

- \frac{1}{n} log Q^{n} (x^{n}) \to H (X) : = \sum_{x \in X} - P (X = x) log P (X = x)

with probability one as

n \to \infty

.

Moreover, from the law of large numbers, with probability one as

n \to \infty

,

- \frac{1}{n} log P (X^{n} = x^{n}) = \frac{1}{n} \sum_{i = 1}^{n} {- log P (X = x_{i})} \to E [- log P (X)] = H (X)

(Shannon–McMillan–Breiman [13]). This proves that there exists a

Q_{X}^{n}

(universal measure), such that for any probability P over the finite set X,

\frac{1}{n} log \frac{P^{n} (x^{n})}{Q^{n} (x^{n})} \to 0

(4)

with probability one as

n \to \infty

, where we write

P^{n} (x^{n}) : = P (X^{n} = x^{n})

. The same property holds for:

- log Q_{Y}^{n} (y^{n}) \approx H^{n} (y^{n}) + \frac{β - 1}{2} log n,

(5)

and:

- log Q_{X Y}^{n} (x^{n}, y^{n}) \approx H^{n} (x^{n}, y^{n}) + \frac{α β - 1}{2} log n,

(6)

where

β = | Y |

,

H^{n} (y^{n}) = \sum_{y \in Y} - c_{n} (y) log \frac{c_{n} (y)}{n}

and

H^{n} (x^{n}, y^{n}) = \sum_{x \in X} \sum_{y \in Y} - c_{n} (x, y) log \frac{c_{n} (x, y)}{n}

are the empirical entropies of

y^{n} \in Y^{n}

and

(x^{n}, y^{n}) \in X^{n} \times Y^{n}

, and

c_{n} (y)

and

c_{n} (x, y)

are the numbers of occurrences of

y \in Y

and

(x, y) \in X \times Y

in

y^{n} = (y_{1}, \dots, y_{n}) \in Y^{n}

and

(x^{n}, y^{n}) \in X^{n} \times Y^{n}

, respectively.

Thus, we have:

J^{n} (x^{n}, y^{n}) : = \frac{1}{n} log \frac{Q_{X Y} (x^{n}, y^{x})}{Q_{X} (x^{n}) Q_{Y} (y^{n})} \to I (X, Y) : = E {\frac{P (X, Y)}{P (X) P (Y)}} .

with probability one as

n \to \infty

. However,

X ⊥ ⊥ Y

if and only if

I (X, Y) = 0

. Hence, if

X \neg ⊥ ⊥ Y

, the value of

J^{n} (x^{n}, y^{n})

is positive with probability one as

n \to \infty

. However, how can we detect

X ⊥ ⊥ Y

when

X ⊥ ⊥ Y

?

J^{n} (x^{n}, y^{n})

cannot be exactly zero with probability one as

n \to \infty

.

However, when X and Y are discrete, the estimation based on

J^{n} (x^{n}, y^{n})

is consistent: if

X ⊥ ⊥ Y

, the value of

J^{n} (x^{n}, y^{n})

is not greater than zero with probability one as

n \to \infty

. For example, the decision based on (1) is strongly consistent because the values of

\frac{1}{n} log p

and

\frac{1}{n} log (1 - p)

are negligible for large n, and asymptotically, (1) is equivalent to

J^{n} (x^{n}, y^{n}, z^{n}) \leq 0

.

In Section 3.1, we provide a stronger result of consistency and a more intuitive and elegant proof.

In general, if N variables exist (

N \geq 2

), we must consider two cases:

D (P^{*} | | P) > 0

and

D (P^{*} | | P) = 0

, where

P^{*}

and P are the probabilities based on the correct and estimated factorizations and

D (P^{*} | | P)

denotes the Kullback–Leibler divergence between

P^{*}

and P. If

N = 2

, then:

D (P^{*} | | P) : = \sum_{x} \sum_{y} P^{*} (x, y) log \frac{P^{*} (x, y)}{P (x, y)} > 0

if and only if

X \neg ⊥ ⊥ Y

in

P^{*}

and

X ⊥ ⊥ Y

in P.

The same property holds for three variables

X, Y, Z

(

N = 3

):

J^{n} (x^{n}, y^{n}, z^{n}) : = \frac{1}{n} log \frac{Q_{X Y Z} (x^{n}, y^{n}, z^{n}) Q_{Z}^{n} (z^{n})}{Q_{X Z}^{n} (x^{n}, y^{n}) Q_{Y Z}^{n} (y^{n}, z^{n})} \to I (X, Y, Z) : = E {\frac{P (X Y Z) P (Z)}{P (X Z) P (Y Z)}}

with probability one as

n \to \infty

, and

X ⊥ ⊥ Y | Z

if and only if

I (X, Y, Z) = 0

. Then, we can show

J^{n} (x^{n}, y^{n}, z^{n}) \leq 0

if and only if

I (X, Y, Z) = 0

, with probability one as

n \to \infty

(see Section 3.1). For example, between the seventh and eleventh factorizations, if

J^{n} (x^{n}, y^{n}, z^{n}) \leq 0

and

J^{n} (x^{n}, y^{n}, z^{n}) > 0

, then we choose the seventh and eleventh, respectively. In fact,

p_{7} \frac{Q_{X Z}^{n} (x^{n}, z^{n}) Q_{Y Z}^{n} (y^{n}, z^{n})}{Q_{Z}^{n} (z^{n})} \geq p_{11} Q_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) ⟺ J^{n} (x^{n}, y^{n}, z^{n}) \leq 0

for large n, because

\frac{1}{n} log \frac{p_{7}}{p_{11}}

diminishes.

Then, the decision is correct with probability one as

n \to \infty

. Similarly, we calculate:

- log Q_{Z} (z^{n}) \approx H^{n} (z^{n}) + \frac{γ - 1}{2} log n,

- log Q_{Y Z} (y^{n}, z^{n}) \approx H^{n} (y^{n}, z^{n}) + \frac{β γ - 1}{2} log n,

- log Q_{Z X} (z^{n}, x^{n}) \approx H^{n} (z^{n}, x^{n}) + \frac{γ α - 1}{2} log n,

and:

- log Q_{X Y Z} (x^{n}, y^{n}, z^{n}) \approx H^{n} (x^{n}, y^{n}, z^{n}) + \frac{α β γ - 1}{2} log n,

where

γ = | Z |

. In general, for N variables, given P and

P^{*}

, we have all of the CI statements for each of them, and

D (P^{*} | | P) = 0

if and only if the CI statements in P imply those in

P^{*}

; in other words, P induces an I-map, which is not necessarily minimal.

Note that for any subsets

a, b, c

of

{1, \dots, N}

, we can construct the estimation

J^{n} (x^{n}, y^{n}, z^{n})

, with

X = {X^{(i)}}_{i \in a}, Y = {Y^{(j)}}_{j \in b}, Z = {X^{(k)}}_{k \in c}

, and obtain consistency, i.e., we will have the correct CI statements, where c may be empty.

Table 1 depicts whether

D (P^{*} | | P) > 0

or

D (P^{*} | | P) = 0

for each

P^{*}

and P. For example, if the factorizations of

P^{*}

and P are the fourth and sixth, then

D (P^{*} | | P) = 0

from the table. In general,

D (P^{*} | | P) = 0

if and only if

P^{*}

is realized using the factorization and an appropriate parameterset for P.

Table 1. Three-variable case:

D (P^{*} | | P) > 0

or

D (P^{*} | | P) = 0

: “+” and “0” denote

D (P^{*} | | P) > 0

and

D (P^{*} | | P) = 0

, respectively.

**Table 1.** Three-variable case: $D (P^{*} | | P) > 0$ or $D (P^{*} | | P) = 0$ : “+” and “0” denote $D (P^{*} | | P) > 0$ and $D (P^{*} | | P) = 0$ , respectively.
	Estimated P
		1	2	3	4	5	6	7	8	9	10	11
True $P^{*}$	1	*	0	0	0	0	0	0	0	0	0	0
	2	+	*	+	+	+	0	0	+	+	+	0
	3	+	+	*	+	0	+	0	+	+	+	0
	4	+	+	+	*	0	0	+	+	+	+	0
	5	+	+	+	+	*	+	+	+	+	+	0
	6	+	+	+	+	+	*	+	+	+	+	0
	7	+	+	+	+	+	+	*	+	+	+	0
	8	+	+	+	+	+	+	+	*	+	+	0
	9	+	+	+	+	+	+	+	+	*	+	0
	10	+	+	+	+	+	+	+	+	+	*	0
	11	+	+	+	+	+	+	+	+	+	+	*

2.2. Universal Measures for Continuous Variables

In this section, we primarily address continuous variables.

Let

{A_{j}}

be such that

A_{0} = {X}

, and let

A_{j + 1}

be a refinement of

A_{j}

. For example, suppose that the random variable X takes values in

X = [0, 1]

, and we generate a sequence as follows:

\begin{matrix} A_{1} & = {[0, \frac{1}{2}), [\frac{1}{2}, 1)} \\ A_{2} & = {[0, \frac{1}{4}), [\frac{1}{4}, \frac{1}{2}), [\frac{1}{2}, \frac{3}{4}, [\frac{3}{4}, 1)} \\ ⋮ \\ A_{j} & = {[0, 2^{- (j - 1)}), [2^{- (j - 1)}, 2 \cdot 2^{- (j - 1)}), \dots, 2^{j - 1} - 1) 2^{- (j - 1)}, 1)} \\ ⋮ \end{matrix}

For each j, we quantize each

x \in [0, 1]

into the

a \in A_{j}

, such that

x \in a

. For example, for

j = 2

,

x = 0.4

is quantized into

a = [\frac{1}{4}, \frac{1}{2}) \in A_{2}

. Let λ be the Lebesgue measure (width of the interval). For example,

λ ([\frac{1}{4}, \frac{1}{2})) = \frac{1}{4}

and

λ ({\frac{1}{2}}) = 0

.

Note that each

A_{j}

is a finite set. Therefore, we can construct a universal measure

Q_{j}^{n}

w.r.t. a finite set

A_{j}

for each j. Given

x^{n} = (x_{1}, \dots, x_{n}) \in {[0, 1]}^{n}

, we obtain a quantized sequence

(a_{1}^{(j)}, \dots, a_{n}^{(j)}) \in A_{j}^{n}

for each j and use it to compute the quantity:

g_{j}^{n} (x^{n}) : = \frac{Q_{j}^{n} (a_{1}^{(j)}, \dots, a_{n}^{(j)})}{λ (a_{1}^{(j)}) \dots λ (a_{n}^{(j)})}

for each j. If we prepare a sequence of positive reals

w_{1}, w_{2}, \dots

, such that

\sum_{j} w_{j} = 1

and

w_{j} > 0

, we can compute the quantity:

g_{X}^{n} (x^{n}) : = \sum_{j = 1}^{\infty} w_{j} g_{j}^{n} (x^{n}) .

Moreover, let

f_{X}

be the true density function and

f_{j} (x) : = P (X \in a) / λ (a)

for

a \in A_{j}

and

j = 1, 2, \dots

if

x \in a

. We may consider

f_{j}

to be an approximated density function assuming the quantization sequence

{A_{j}}

(Figure 2). For the given

x^{n}

, we define

f_{X}^{n} (x^{n}) = f_{X} (x_{1}) \dots f_{X} (x_{n})

and

f_{j}^{n} (x^{n}) : = f_{j} (x_{1}) \dots f_{j} (x_{n})

.

Figure 2. Quantization at level k:

x^{n} = (x_{1}, \dots, x_{n}) \mapsto (a_{1}^{(j)}, \dots, a_{n}^{(j)})

Figure 2. Quantization at level k:

x^{n} = (x_{1}, \dots, x_{n}) \mapsto (a_{1}^{(j)}, \dots, a_{n}^{(j)})

Thus, we have the following proposition, which is a continuous version of the universality (4) that was proven in Section 2.1.

Proposition 1 ([10]). For any density function f, such that

D (f_{X} | | f_{j}) \to 0

as

j \to \infty

,

\frac{1}{n} log \frac{f_{X}^{n} (x^{n})}{g_{X}^{n} (x^{n})} \to 0

as

n \to \infty

with probability one, where

D (f_{X} | | f_{j})

is the Kullback–Leibler divergence between

f_{X}

and

f_{j}

.

The same concept is applied to the case where no density function exists [11] in the usual sense (w.r.t. the Lebesgue measure λ). For example, suppose that we wish to estimate a distribution over the positive integers N. Apparently, N is not a finite set and has no density function. We consider the quantization sequence

{B_{k}}

:

B_{0} = {N}

,

B_{1} : = {{1}, {2, 3, \dots}}

,

B_{2} : = {{1}, {2}, {3, 4, \dots}}

, ...,

B_{k} : = {{1}, {2}, \dots, {k}, {k + 1, k + 2, \dots}}

, ....

For each k, we quantize each

y \in N

into a

b \in B_{k}

, such that

y \in b

. For example, for

k = 2

,

y = 4

is quantized into

b = {3, 4, \dots} \in B_{2}

. Let η be a measure, such that:

η ({k}) = \frac{1}{k} - \frac{1}{k + 1}, k \in N .

The measure

η (a)

for closed interval a gives:

η (a) = \sum_{k \in a} η ({k}) = \sum_{k \in a} (\frac{1}{k} - \frac{1}{k + 1}) = \frac{1}{k_{m i n}} - \frac{1}{k_{m a x}}

if

k_{m i n}

and

k_{m a x}

are the minimum and maximum integers in a, and evaluates each bin width in a nonstandard way. For example,

η ({2}) = \frac{1}{6}

and

η ({3, 4}) = \frac{2}{15}

. For multiple variables, we compute the measure by:

η ({j}, {k}) = (\frac{1}{j} - \frac{1}{j + 1}) (\frac{1}{k} - \frac{1}{k + 1}) .

Note that each

B_{k}

is a finite set, and we construct a universal measure

Q_{k}^{n}

w.r.t. a finite set

B_{k}

for each k. Given

y^{n} = (y_{1}, \dots, y_{n}) \in N^{n}

, we obtain a quantized sequence

(b_{1}^{(k)}, \dots, b_{n}^{(k)}) \in B_{k}^{n}

for each k, such that we can compute the quantity:

g_{k}^{n} (y^{n}) : = \frac{Q_{k}^{n} (b_{1}^{(k)}, \dots, b_{n}^{(k)})}{η (b_{1}^{(k)}) \dots η (b_{n}^{(k)})}

for each k. If we prepare a sequence of positive reals

w_{1}, w_{2}, \dots

, such that

\sum_{k} w_{k} = 1

and

w_{k} > 0

, we can compute the quantity

g_{Y}^{n} (y^{n}) : = \sum_{k = 1}^{\infty} w_{k} g_{k}^{n} (y^{n})

. In this case,

f_{Y} (y) = \frac{P (Y = y)}{η ({y})}

for

y \in N

(

f (y)

with

y \notin N

may take any arbitrary value) is considered to be a generalized density function (w.r.t. the measure η).

In general, if

η (D) = 0

implies

P (Y \in D) = 0

for the Borel sets (the Borel sets w.r.t. R being the set consisting of the sets generated via a countable number of union, intersection and set difference from the closed intervals of R [2]), we state that P is absolutely continuous w.r.t. η and that there exists a density function w.r.t. η (Radon–Nikodym [2]).

The following proposition addresses generalized densities and eliminates the condition

D (f_{Y} | | f_{j}) \to 0

as

j \to \infty

in Proposition 1.

Proposition 2 ([11]). For any generalized density function

f_{Y}

,

\frac{1}{n} log \frac{f_{Y}^{n} (y^{n})}{g_{Y}^{n} (y^{n})} \to 0

as

n \to \infty

with probability one.

Proposition 1 assumes a specific quantization sequence, such as

{A_{n}}

. The universality holds for the densities that satisfy

D (f_{X} | | f_{k}) \to \infty

as

k \to \infty

[10]. However, in the proof of Proposition 2, a universal quantization, such that

D (f_{X} | | f_{k}) \to 0

as

k \to \infty

for any density

f_{X}

, was constructed [11].

3. Contributions

3.1. The Hannan and Quinn Principle

We know that

H^{n} (x^{n}) + H^{n} (y^{n}) - H^{n} (x^{n}, y^{n})

is at most

\frac{(α - 1) (β - 1)}{2} log n

with probability one as

n \to \infty

when

X ⊥ ⊥ Y

because the decision based on (1) is strongly consistent.

In this section, we prove a stronger result: let:

I^{n} (x^{n}, y^{n}, z^{n}) : = H^{n} (x^{n}, z^{n}) + H^{n} (y^{n}, z^{n}) - H^{n} (x^{n}, y^{n}, z^{n}) - H^{n} (z^{n}) .

We show that the quantity

I^{n} (x^{n}, y^{n}, z^{n})

is at most

(α - 1) (β - 1) γ log log n

rather than

\frac{1}{2} (α - 1) (β - 1) γ log n

, when

X ⊥ ⊥ Y | Z

:

Theorem 1. If

X ⊥ ⊥ Y | Z

:

I^{n} (x^{n}, y^{n}, z^{n}) \leq (1 + ϵ) (α - 1) (β - 1) γ log log n

(7)

with probability one as

n \to \infty

for any

ϵ > 0

.

In order to show the claim, we approximate

I^{n} (x^{n}, y^{n}, z^{n})

by

\sum_{z \in Z} I (z)

with

I (z) = \frac{1}{2} \sum_{i = 1}^{α - 1} \sum_{j = 1}^{β - 1} r_{i, j}^{2}

, where

r_{i, j}

,

i = 1, \dots, α - 1, j = 1, \dots, β - 1

, are mutually independent random variables with mean zero and variance

σ_{i, j}^{2}

, such that:

\sum_{i = 1}^{α - 1} \sum_{j = 1}^{β - 1} σ_{i, j}^{2} = (α - 1) (β - 1) .

Then, from the law of iterated logarithms below (Lemma 1) [2], it will be proven that

r_{i, j}^{2}

is almost surely upper-bounded by

2 (1 + ϵ) σ_{i, j}^{2} log log n

for any

ϵ > 0

and each

z \in Z

, which implies Theorem 1 because:

\begin{matrix} I^{n} (x^{n}, y^{n}, z^{n}) & \approx & \sum_{z} I (z) = γ \cdot \frac{1}{2} \sum_{i} \sum_{j} r_{i, j}^{2} \\ \leq & γ \cdot \frac{1}{2} \sum_{i} \sum_{j} 2 (1 + ϵ) σ_{i, j}^{2} log log n \\ = & (1 + ϵ) (α - 1) (β - 1) γ log log n \end{matrix}

(see the Appendix for the details of the derivation).

Lemma 1 ([2]). Let

{U_{k}}_{k = 1}^{n}

be random variables that obey an identical distribution with zero mean and unit variance, and

S_{n} : = \sum_{k = 1}^{n} U_{k}

. Then, with probability one,

\underset{n \to \infty}{lim sup} \frac{S_{n}}{\sqrt{2 n log n log n}} = 1 .

Theorem 1 implies the strong consistency of the decision based on (1). However, a stronger statement can be obtained:

Theorem 2. We define

R_{Z}^{n} (z^{n})

,

R_{X Z}^{n} (x^{n}, z^{n})

,

R_{Y Z}^{n} (y^{n}, z^{n})

and

R_{X Y Z}^{n} (x^{n}, y^{n}, z^{n})

by:

- log R_{Z}^{n} (z^{n}) = H^{n} (z^{n}) + (1 + ϵ) (γ - 1) log log n,

- log R_{X Z}^{n} (x^{n}, z^{n}) = H^{n} (x^{n}, z^{n}) + (1 + ϵ) (β γ - 1) log log n,

- log R_{Y Z}^{n} (y^{n}, z^{n}) = H^{n} (y^{n}, z^{n}) + (1 + ϵ) (β γ - 1) log log n,

and:

- log R_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) = H^{n} (x^{n}, y^{n}, z^{n}) + (1 + ϵ) (α β γ - 1) log log n .

Then, the decision based on:

R_{X Z}^{n} (x^{n}, z^{n}) R_{Y Z}^{n} (y^{n}, z^{n}) \geq R_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) R_{Z}^{n} (z^{n}) ⟺ X ⊥ ⊥ Y | Z

is strongly consistent.

Proof. We note two properties:

$R_{X Z}^{n} (x^{n}, z^{n}) R_{Y Z}^{n} (y^{n}, z^{n}) \geq R_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) R_{Z}^{n} (z^{n})$ is equivalent to (7); and
$lim_{n \to \infty} \frac{1}{n} log \frac{R_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) R_{Z}^{n} (z^{n})}{R_{X Z}^{n} (x^{n}, z^{n}) R_{Y Z}^{n} (x^{n}, z^{n})} = lim_{n \to \infty} \frac{1}{n} log \frac{Q_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) Q_{Z}^{n} (z^{n})}{Q_{X Z}^{n} (x^{n}, z^{n}) Q_{Y Z}^{n} (x^{n}, z^{n})} \to I (X, Y, Z)$

If

X ⊥ ⊥ Y | Z

, then from Theorem 1 and the first property, we have

R_{X Z}^{n} (x^{n}, z^{n}) R_{Y Z}^{n} (y^{n}, z^{n}) \geq R_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) R_{Z}^{n} (z^{n})

almost surely. If

R_{X Z}^{n} (x^{n}, z^{n}) R_{Y Z}^{n} (y^{n}, z^{n}) \geq R_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) R_{Z}^{n} (z^{n})

almost surely holds, then the value in the second property should be no greater than zero, which means that

X ⊥ ⊥ Y | Z

. This completes the proof. ☐

Theorem 2 is related to the Hannan and Quinn theorem [7] for model selection. To obtain strong consistency, they proved that

log log n

rather than

\frac{1}{2} log n

is sufficient for the penalty terms of autoregressive model selection. Recently, several authors have proven this in other settings, such as classification [9] and linear regression [8].

3.2. Consistency for Continuous Variables

Suppose that we wish to estimate the distribution over

[0, 1] \times N

in Section 2.2. The set

[0, 1] \times N

is not a finite set and has no density function.

Because

A_{j} \times B_{k}

is a finite set, we can construct a universal measure

Q_{j, k}^{n}

for

A_{j} \times B_{k}

:

g_{j k}^{n} (x^{n}, y^{n}) : = \frac{Q_{j, k}^{n} (a_{1}^{(j)}, \dots, a_{n}^{(j)}, b_{1}^{(k)}, \dots, b_{n}^{(k)})}{λ (a_{1}^{(j)}) \dots λ (a_{n}^{(j)}) η (b_{1}^{(k)}) \dots η (b_{n}^{(k)})} .

If we prepare the sequence such that

\sum_{j k} ω_{j k} = 1

,

ω_{j k} > 0

, we obtain the quantity:

g_{X Y}^{n} (x^{n}, y^{n}) : = \sum_{j = 1}^{\infty} \sum_{k = 1}^{\infty} w_{j, k} g_{j k}^{n} (x^{n}, y^{n}) .

In this case, the (generalized) density function is obtained via:

f_{X Y} (x, y) = \frac{F_{X} (x | y)}{d x} \cdot \frac{P (Y = y)}{η ({y})}

where

y \in N

(

f_{X Y}

takes arbitrary values for

x \notin [0, 1]

and

y \notin N

), where

F_{X} (\cdot | y)

is the conditional distribution function of X given

Y = y

.

In general, we have the following result:

Proposition 3 For any generalized density function f:

\frac{1}{n} log \frac{f_{X Y}^{n} (x^{n}, y^{n})}{g_{X Y}^{n} (x^{n}, y^{n})} \to 0

as

n \to \infty

with probability one.

The measures

g_{X}^{n} (x^{n})

and

g_{X Y}^{n} (x^{n}, y^{n})

are computed using (A) and (B) of Algorithm 1, where the value of K is the number of quantizations, and

{\hat{g}}_{X}^{n} (x^{n})

and

{\hat{g}}_{X Y}^{n} (x^{n}, y^{n})

denote the approximated scores using finite quantization of level K.

Algorithm 1 Calculating gⁿ.
(A) Input xⁿ ∈ Aⁿ, Output ${\hat{g}}_{X}^{n} (x^{n})$
1.	For each $k = 1, \dots, K$ , $g_{k}^{n} (x^{n}) : = 0$
2.	For each $k = 1, \dots, K$ and each $a \in A_{k}$ , $c_{k} (a) : = 0$
3.	For each $i = 1, \dots, n$ ,
	(a) $A_{0} = X$ , $a_{i}^{(0)} = x_{i}$
	(b) for each $k = 1, \dots, K$
	i. Find $a_{i}^{(k)} \in A_{k}$ from $a_{i}^{(k - 1)} \in A_{k - 1}$
	ii. $log g_{k}^{n} (x^{n}) : = log g_{k}^{n} (x^{n}) + log \frac{c_{i, k} (a_{i}^{(k)}) + 1 / 2}{i - 1 + \| A_{k} \| / 2} - log (η_{X} (a_{i}^{(k)}))$
	iii. $c_{i, k} (a_{i}^{(k)}) : = c_{i, k} (a_{i}^{(k)}) + 1$
4.	${\hat{g}}_{X}^{n} (x^{n}) : = \sum_{k = 1}^{K} \frac{1}{K} g_{k}^{n} (x^{n})$
(B) Input $x^{n} \in A^{n}$ and $y^{n} \in B^{n}$ , Output ${\hat{g}}_{X Y}^{n} (x^{n}, y^{n})$
1.	For each $j, k = 1, \dots, K$ , $g_{j, k}^{n} (x^{n}, y^{n}) : = 0$
2.	For each $j, k = 1, \dots, K$ and each $a \in A_{j}$ and $b \in B_{k}$ , $c_{j, k} (a, b) : = 0$
3.	For each $i = 1, \dots, n$
	(a) $A_{0} = X$ , $B_{0} = Y$ , $a_{i}^{(0)} = x_{i}$ , $b_{i}^{(0)} = y_{i}$
	(b) for each $j, k = 1, \dots, K$
	i. Find $a_{i}^{(j)} \in A_{j}$ and $b_{i}^{(k)} \in B_{k}$ from $a_{i}^{(j - 1)} \in A_{j - 1}$ and $b_{i}^{(k - 1)} \in B_{k - 1}$
	ii. $log g_{j, k}^{n} (x^{n}, y^{n}) : = log g_{j, k}^{n} (x^{n}, y^{n}) + log \frac{c_{i, j, k} (a_{i}^{(j)}, b_{i}^{(k)}) + 1 / 2}{i - 1 + \| A_{j} \| \| B_{k} \| / 2} - log (η_{X} (a_{i}^{(j)}) η_{Y} (b_{i}^{(k)}))$
	iii. $c_{i, j, k} (a_{i}^{(j)}, b_{i}^{(k)}) : = c_{i, j, k} (a_{i}^{(j)}, b_{i}^{(k)}) + 1$
4.	${\hat{g}}_{X Y}^{n} (x^{n}, y^{n}) : = \sum_{j = 1}^{K} \sum_{k = 1}^{K} \frac{1}{K^{2}} g_{j, k}^{n} (x^{n}, y^{n})$

Propositions 1–3 are obtained for large K. However, we can prepare only a finite number of quantizations. Furthermore, if n is small, then the number of examples that each bin contains is small, and we cannot estimate the histogram well. Therefore, given n, K must be moderately sized, and we recommend to set

K = \frac{1}{m} log n

because the number of examples contained in a bin decreases exponentially with increasing depth, where m is the number of variables in the local score. For example,

m = 1

and

m = 2

for (A) and (B), respectively. Algorithm 1 (A)(B) of do not guarantee anything for the theoretical property assured in Proposition 3 and Theorems 3–5 for finite K, however, as K grows, consistency holds.

In Step 3(a) of Algorithm 1(A)(B), we calculate

a_{i}^{(k)}

from

a_{i}^{(k - 1)}

and not from

x_{i}

, which means that the computational time required to obtain

(a_{i}^{(1)}, \dots, a_{i}^{(K)})

from

x_{i}

is

O (K)

. Thus, the total computation times of Algorithm 1 (A)(B) are at most

O (n K)

.

In Step 3(b) of Algorithm 1(A), we compute for

i = 1, \dots, n

and

k = 1, \dots, K

:

log \frac{g_{k}^{i} (x^{i})}{g_{k}^{i - 1} (x^{i - 1})} = log \frac{Q_{k}^{i} (a_{1}^{(k)}, \dots, a_{i}^{(k)})}{Q_{k}^{i - 1} (a_{1}^{(k)}, \dots, a_{i - 1}^{(k)})} - log η_{X} (a_{i}^{(k)})

if

x_{i}

is quantized into

a_{i}^{(k)} \in A_{k}

,

i = 1, \dots, n

.

For the memory requirements, we require exponential orders of K. However, because we set

K = \frac{1}{m} log n

, the computational time and memory requirements are at most

O (n log n)

and

O (n)

for Algorithm 1(A)(B).

Based on the same notion, we can construct

g_{Z}^{n} (z^{n})

,

g_{X Z}^{n} (x^{n}, z^{n})

,

g_{Y Z} (y^{n} . z^{n})

,

g_{X Y Z}^{n} (x^{n}, y^{n}, z^{n})

from examples

x^{n} \in X^{n}

,

y^{n} \in Y^{n}

and

z^{n} \in Z^{n}

, and Propositions 2 and 3 hold for three variables.

Theorem 3. With probability one as

n \to \infty

:

\frac{1}{n} log \frac{g_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) g_{Z} (z^{n})}{g_{X Y}^{n} (x^{n}, z^{n}) g_{Y Z}^{n} (y^{n}, z^{n})} \to I (X, Y, Z)

(8)

Proof. From Propositions 2 and 3 for two and three variables and the law of large numbers, we have:

\begin{matrix} lim_{n \to \infty} \frac{1}{n} log \frac{g_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) g_{Z}^{n} (z^{n})}{g_{X Z}^{n} (x^{n}, z^{n}) g_{Y Z}^{n} (y^{n}, z^{n})} = lim_{n \to \infty} \frac{1}{n} log \frac{f_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) f_{Z}^{n} (z^{n})}{f_{X Z}^{n} (x^{n}, z^{n}) f_{Y Z}^{n} (y^{n}, z^{n})} \\ = & lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} {log \frac{f_{X Y Z} (x_{i}, y_{i}, z_{i}) f_{Z} (z_{i})}{f_{X Z} (x_{i}, z_{i}) f_{Y Z} (y_{i}, z_{i})}} = E log \frac{f_{X Y Z} (X, Y, Z)}{f_{X Z} (X, Z) f_{Y Z} (Y, Z)} = I (X, Y, Z) \end{matrix}

with probability one, which completes the proof. ☐

From the discussion in Section 2.1, even when more than two variables are present, if

D (P^{*} | | P) > 0

, we can choose

P^{*}

rather than P with probability one as

n \to \infty

.

Now, we prove that the continuous counterpart of the decision based on (1) is strongly consistent:

Theorem 4. With probability one as

n \to \infty

:

X ⊥ ⊥ Y | Z ⟺ p g_{X Z}^{n} (x^{n}, z^{n}) g_{Y Z}^{n} (y^{n}, z^{n}) \geq (1 - p) g_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) g_{Z}^{n} (z^{n}),

(9)

where p is the prior probability of

X ⊥ ⊥ Y | Z

.

Proof: Suppose that

X \neg ⊥ ⊥ Y | Z

. Then, the conditional mutual information between X and Y given Z is positive, and from Theorem 3, the estimator converges to a positive value with probability one as

n \to \infty

; thus,

p g_{X Z}^{n} (x^{n}, z^{n}) g_{Y Z}^{n} (y^{n}, z^{n}) \geq (1 - p) g_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) g_{Z}^{n} (z^{n})

holds almost surely. Suppose that

X ⊥ ⊥ Y | Z

. The discrete variables X and Y are conditionally independent given Z if and only if:

c Q_{X Z}^{n} (x^{n}, z^{n}) Q_{Y Z}^{n} (y^{n}, z^{n}) \geq (1 - c) Q_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) Q_{Z}^{n} (z^{n})

with probability one as

n \to \infty

for any constant

0 < c < 1

, even if c does not coincide with the prior probability p. If

X, Y

and Z are continuous, we quantize

x^{n}

,

y^{n}

and

z^{n}

into

(a_{1}^{(j)}, \dots, a_{n}^{(j)})

,

(b_{1}^{(k)}, \dots, b_{n}^{(k)})

and

(c_{1}^{(l)}, \dots, c_{n}^{(l)})

. Thus, for each

j, k

and l, we have:

\begin{matrix} p w_{j l} w_{k l} Q_{j l}^{n} (a_{1}^{(j)}, \dots, a_{n}^{(j)}, c_{1}^{(l)}, \dots, c_{n}^{(l)}) Q_{k l}^{n} (b_{1}^{(k)}, \dots, b_{n}^{(k)}, c_{1}^{(l)}, \dots, c_{n}^{(l)}) \\ \geq & (1 - p) w_{j k l} w_{l} Q_{j k l}^{n} (a_{1}^{(j)}, \dots, a_{n}^{(j)}, b_{1}^{(k)}, \dots, b_{n}^{(k)}, c_{1}^{(l)}, \dots, c_{n}^{(l)}) Q_{n}^{(l)} (c_{1}^{(l)}, \dots, c_{n}^{(l)}) \end{matrix}

with probability one as

n \to \infty

. Thus, if we divide both sides by:

η_{X} (a_{1}^{(j)}) \dots η_{X} (a_{n}^{(j)}) η_{Y} (b_{1}^{(k)}) \dots η_{Y} (b_{n}^{(k)}) η_{Z} (c_{1}^{(l)}) \dots η_{Z} (c_{n}^{(l)})

and take summations of both sides over

j, k . l = 1, 2, \dots

, we have:

p g_{X Z}^{n} (x^{n}, z^{n}) g_{Y Z}^{n} (y^{n}, z^{n}) \geq (1 - p) g_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) g_{Z}^{n} (z^{n})

with probability one, where we have assumed

w_{j, k, l} > 0 ⟹ w_{j l}, w_{k l} > 0

because of

K = \frac{1}{m} log n

, which completes the proof.

Note that even if either X or Y is discrete, the same conclusion will be obtained. The generalized density functions cover the discrete distributions as a special case.

From the discussion in Section 2.1, even when more than two variables are present, if

D (P^{*} | | P) = 0

, we can choose

P^{*}

rather than P with probability one as

n \to \infty

.

Let

h_{Z}^{n} (z^{n})

,

h_{X Z}^{n} (x^{n}, z^{n})

,

h_{Y Z}^{n} (y^{n}, z^{n})

and

h_{X Y Z}^{n} (x^{n}, y^{n}, z^{n})

take the same values of

g_{Z}^{n} (z^{n})

,

g_{X Z}^{n} (x^{n}, z^{n})

,

g_{Y Z}^{n} (y^{n}, z^{n})

and

g_{X Y Z}^{n} (x^{n}, y^{n}, z^{n})

, except that the

log n

terms in

- log Q_{Z}^{n} (z^{n})

,

- log Q_{X Z}^{n} (x^{n}, z^{n})

,

- log Q_{Y Z}^{n} (y^{n}, z^{n})

and

- log Q_{X Y Z}^{n} (x^{n}, y^{n}, z^{n})

are replaced by

2 (1 + ϵ) log log n

, respectively, where

ϵ > 0

is arbitrary. Then, we obtain the final result:

Theorem 5. With probability one as

n \to \infty

:

p h_{X Z}^{n} (x^{n}, z^{n}) h_{Y Z}^{n} (y^{n}, z^{n}) \geq (1 - p) h_{X Y Z}^{n} (x^{n}, y^{n}, z^{n}) h_{Z}^{n} (z^{n}) ⟺ X ⊥ ⊥ Y | Z .

(10)

This paper focuses on the theoretical aspects of the BN structure learning, in particular for consistency when continuous variables are present. For the details of the practical matters we deal with in this section, see the conference paper [14].

3.3. The Number of Local Scores to be Computed

We refer the conditional independence (CF) score w.r.t. X and Y given Z to the left of (8). Suppose we follow the fastest Bayesian network structure learning due to [6]: let

P a (X, V)

be the optimal parent set of

X \in V

contained in

V - {X}

for

V \subseteq U : = {1, \dots, N}

and

S (X, V)

its local score. Then, we can obtain:

T (V) : = max_{x \in V} {S (X, V) + T (V - {X})}

For each

V \subseteq U

, the sinks:

X_{N} = {argmax}_{X \in U} T (U), X_{N - 1} = {argmax}_{X \in U - {X_{N}}} T (U - {X_{N}}), \dots,

and the parent sets:

P a (X_{N}, U), P (X_{N - 1}, U - {X_{N}}), \dots, {} .

For each fixed pair (

X, V)

, maximizing the local score

\frac{1}{n} log \frac{g_{W + {X}}}{g_{W}}

and maximizing the CF score

\frac{1}{n} log \frac{g_{V - {X}} g_{W + {X}}}{g_{V} g_{W}}

w.r.t.

V - {X}

and

W + {X}

, given W are equivalent. In other words,

\frac{1}{n} log \frac{g_{W + {X}}}{g_{W}} \leq \frac{1}{n} log \frac{g_{W^{'} + {X}}}{g_{W^{'}}} ⟺ \frac{1}{n} log \frac{g_{V - {X}} g_{W^{'} + {X}}}{g_{V} g_{W}} \leq \frac{1}{n} log \frac{g_{V - {X}} g_{W^{'} + {X}}}{g_{V} g_{W^{'}}}

for

W, W^{'} \subseteq V - {X}

.

On the other hand, from [15,16], we know that the relationship between the complexity term and the likelihood term gives tight bounds on the maximum number of parents in the optimal BN for any given dataset. In particular, the number of elements in each parent set

P a (X, V)

is at most

O (log n)

for

X \in V

and

V \subseteq U

. Hence, the number for computing the CF scores is much less than exponentialwith N.

4. Concluding Remarks

In this paper, we considered the problem of learning a Bayesian network structure from examples and provided two contributions.

First, we found that the

log n

terms in the penalty terms of the description length can be replaced by

2 (1 + ϵ) log log n

to obtain strong consistency, where the derivation is based on the law of iterated logarithms. We claim that the Hannan and Quinn principle [7] is applicable to this problem.

Second, we constructed an extended version of the score function for finding a Bayesian network structure with the maximum posterior probability and proved that the decision is strongly consistent even when continuous variables are present. Thus far, consistency has been obtained only for discrete variables, and many authors have been seeking consistency when continuous variables are present.

Consistency has been proven in many model selection methods that maximize the posterior probability or, equivalently, minimize the description length [1]. However, almost all such methods assume that the variables are either discrete or that the variables obey Gaussian distributions. This paper proposed an extended version of the MDL/Bayesian principle without assuming such constraints and proved its strong consistency in a precise manner, which we believe provides a substantial contribution to the statistics and machine learning communities.

Appendix: Proof of Theorem 1

Hereafter, we write P(X = x|Z = z) and P(Y = y|Z = z) simply as P(x|z) and P(y|z) respectively, for x ∈ X, y ∈ Y and z ∈ Z. We find that the empirical mutual information:

(11)

(12)

(13)

is approximated by Entropy 17 05752 i004

with:

where the difference between them is zero with probability one as n → ∞, and (1 + t)log(1 + t) = t + t²/2 − t³/{6[1 + δ(t)t]²} with 0 < δ(t) < 1 and:

has been applied for (11), (12) and (13), respectively. Furthermore, we derive:

(14)

where V = (V_xy)_{x ∈ X, and y ∈ Y} with Entropy 17 05752 i022

and u and v are the column vectors Entropy 17 05752 i008

and

, respectively. Hereafter, we arbitrarily fix z ∈ Z. Let U = (u[0], u[1], …, u[α − 1]), with u[0] = u and W = (w[0], w[1], …, w[β − 1], with w[0] = w being eigenvectors of Entropy 17 05752 i010

and

, where E_m is the identity matrix of dimension m.

Then, ^tuVw = 0, and for

\tilde{U}

= (u[1], …, u[α − 1] and

\tilde{W}

= (w[1], …, w[β − 1], we have:

and:

If we note that U^tU = ^tUU = E_α and W^tW = ^tWW = E_β, we obtain:

and find that (14) becomes:

with r_ij := ^tu[i]Vw[j]. Then, we can see:

(15)

and that the (α − 1) × (β − 1) matrix

^{t} \tilde{U} V \tilde{W}

consists of mutually independent elements r_ij with i = 1, …, α − 1 and j = 1, …, β − 1: E[r_ij] = 0, and:

where

σ_{i j}^{2}

is the variance of r_ij and the expectation of

σ_{i j}^{2}

, so that (15) implies:

(16)

If we define for each x ∈ X and y ∈ Y and for i = 1, …, n:

where u[i] = (u[i,x])_{x ∈ X} and w[j] = (w[y,j])_{y ∈ Y}, then we can check E[Z_i,j,k] = 0 and V[Z_i,j,k] = 1, where expectation E and variance V are with respect to the examples Xⁿ = xⁿ and Yⁿ = yⁿ, and I(A) takes one if the event A is true and zero otherwise. We can easily check:

(17)

We consider applying the obtained derivation to Lemma 1. From (17), we obtain:

which means that (14) is upper bounded by(1 + ϵ)(α − 1)(β − 1)log log n with probability one as n → ∞ for any ϵ > 0, from (16). This completes the proof of Theorem 1.

References

Rissanen, J. Modeling by shortest data description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
Billingsley, P. Probability & Measure, 3rd ed.; Wiley: New York, NY, USA, 1995. [Google Scholar]
Friedman, N.; Linial, M.; Nachman, I.; Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol. 2000, 7, 601–620. [Google Scholar] [CrossRef] [PubMed]
Imoto, S.; Kim, S.; Goto, T.; Aburatani, S.; Tashiro, K.; Kuhara, S.; Miyano, S. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. J. Bioinform. Comput. Biol. 2003, 1, 231–252. [Google Scholar] [CrossRef] [PubMed]
Zhang, K.; Peters, J.; Janzing, D.; Scholkopf, B. Kernel-based Conditional Independence Test and Application in Causal Discovery. In Proceedings of the 2011 Uncertainty in Artificial Intelligence Conference, Barcelona, Spain, 14–17 July 2011; pp. 804–813.
Silander, T.; Myllymaki, P. A simple approach for finding the globally optimal Bayesian network structure. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, Arlington, Virginia, 13–16 July 2006; pp. 445–452.
Hannan, E.J.; Quinn, B.G. The Determination of the Order of an Autoregression. J. R. Stat. Soc. B 1979, 41, 190–195. [Google Scholar]
Suzuki, J. The Hannan–Quinn Proposition for Linear Regression, Int. J. Stat. Probab. 2012, 1, 2. [Google Scholar]
Suzuki, J. On Strong Consistency of Model Selection in Classification. IEEE Trans. Inf. Theory 2006, 52, 4767–4774. [Google Scholar] [CrossRef]
Ryabko, B. Compression-based Methods for Nonparametric Prediction and Estimation of Some Characteristics of Time Series, IEEE Trans. Inform. Theory 2009, 55, 4309–4315. [Google Scholar] [CrossRef]
Suzuki, J. Universal Bayesian Measures. In Proceedings of the 2013 IEEE International Symposium on Information Theory, Istanbul, Turkey, 7–12 July 2013; pp. 644–648.
Krichevsky, R.E.; Trofimov, V.K. The Performance of Universal Encoding. IEEE Trans. Inf. Theory 1981, 27, 199–207. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: New York, NY, USA, 1995. [Google Scholar]
Suzuki, J. Learning Bayesian Network Structures When Discrete and Continuous Variables Are Present. In Proceedings of the 2014 Workshop on Probabilistic Graphical Models, 17–19 September 2014; Springer Lecture Notes on Artificial Intelligence. Volume 8754, pp. 471–486.
Suzuki, J. Learning Bayesian belief networks based on the minimum description length principle: An efficient algorithm using the B&B technique. In Proceedings of the 13th International Conference on Machine Learning (ICML'96), Bari, Italy, 3–6 July 1996; pp. 462–470.
De Campos, C.P.; Ji, Q. Efficient Structure Learning of Bayesian Networks using Constraints. JMLR 2011, 12, 663–689. [Google Scholar]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
Judea, P. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan-Kaufmann: San Mateo, CA, USA, 1988. [Google Scholar]

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Suzuki, J. Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach. Entropy 2015, 17, 5752-5770. https://doi.org/10.3390/e17085752

AMA Style

Suzuki J. Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach. Entropy. 2015; 17(8):5752-5770. https://doi.org/10.3390/e17085752

Chicago/Turabian Style

Suzuki, Joe. 2015. "Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach" Entropy 17, no. 8: 5752-5770. https://doi.org/10.3390/e17085752

APA Style

Suzuki, J. (2015). Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach. Entropy, 17(8), 5752-5770. https://doi.org/10.3390/e17085752

Article Menu

Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach

Abstract

1. Introduction

2. Preliminaries

2.1. Learning the Bayesian Structure for Discrete Variables and Its Consistency

2.2. Universal Measures for Continuous Variables

3. Contributions

3.1. The Hannan and Quinn Principle

3.2. Consistency for Continuous Variables

3.3. The Number of Local Scores to be Computed

4. Concluding Remarks

Appendix: Proof of Theorem 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI