Consistency of Learning Bayesian Network Structures with Continuous Variables : An Information Theoretic Approach

We consider the problem of learning a Bayesian network structure given n examples and the prior probability based on maximizing the posterior probability. We propose an algorithm that runs in O(n log n) time and that addresses continuous variables and discrete variables without assuming any class of distribution. We prove that the decision is strongly consistent, i.e., correct with probability one as n → ∞. To date, consistency has only been obtained for discrete variables for this class of problem, and many authors have attempted to prove consistency when continuous variables are present. Furthermore, we prove that the “log n” term that appears in the penalty term of the description length can be replaced by 2(1+ ) log log n to obtain strong consistency, where > 0 is arbitrary, which implies that the Hannan–Quinn proposition holds.


Introduction
In this paper, we address the problem of learning a Bayesian network structure from examples.For sets A, B, C of random variables, we say that A and B are conditionally independent given C if the conditional probability of A and B given C is the product of the conditional probabilities of A given C and B given C. A Bayesian network (BN) is a graphical model that expresses conditional independence (CI) relations among the prepared variables using a directed acyclic graph (DAG).We define a BN by the DAG with vertexes V = {1, • • • , N } and directed edges E = {(j, i)|i ∈ V, j ∈ π(i)}, where edge (j, k) ∈ V 2 directs from j to k, via minimal parent sets π(i) ⊆ V , i ∈ V , such that the distribution is factorized by: P (X (1) , • • • , X (N ) ) = N i=1 P (X (i) |{X (j) } j∈π(i) ) .
First, suppose that we wish to know whether two random binary variables X and Y are independent (hereafter, we write X ⊥ ⊥ Y ).If we have n pairs of actually emitted examples (X = x 1 , Y = y 1 ), • • • , (X = x n , Y = y n ) and know the prior probability p of X ⊥ ⊥ Y , then it would be reasonable to maximize the posterior probability of X ⊥ ⊥ Y given x n = (x 1 , • • • , x n ) and y n = (y 1 , • • • , y n ).If we assume that the probabilities P (X = x), P (Y = y) and P (X = x, Y = y) are parameterized by p(x|θ X ), p(y|θ Y ), and p(x, y|θ XY ) and that the prior probabilities W X , W Y , and W XY over the probabilities θ X , θ Y , and θ XY of X ∈ {0, 1}, Y ∈ {0, 1} and (X, Y ) ∈ {0, 1} 2 are available, respectively, then we can construct the quantities: In this setting, maximizing the posterior probability of X ⊥ ⊥ Y given examples x n , y n w.r.t. the prior probability p is equivalent to deciding X ⊥ ⊥ Y if and only if: The decision based on (1) is strongly consistent, i.e., it is correct with probability one as n → ∞ [1] (see Section 3.1 for the proof).We say that a model selection procedure satisfies weak consistency if the probability of choosing the correct model goes to unity as n grows (probability convergence) and that it satisfies strong consistency if the probability one is assigned to the set of infinite example sequences that choose the correct model, except for at most finite times (almost sure convergence).In general, strong consistency implies weak consistency, but the converse is not true [2].In any model selection, in particular for large n, the correct answer is required.If continuous variables are present, the BN structure learning is not easy, and strong consistency is hard to obtain.
The same scenario is applied to the case in which X and Y take values from finite sets A and B rather than {0, 1}.
Next, suppose that we wish to know the factorization of three random binary variables X, Y, Z: P (X)P (Y )P (Z), P (X)P (Y, Z), P (Y )P (Z, X), P (Z)P (X, Y ), P (X, Y )P (X, Z) P (X) and know the prior probabilities p 1 , • • • , p 11 over the eleven factorizations, then it would be reasonable to choose the one that maximizes: to maximize the posterior probability of the factorization given For example, between the last two distributions, we choose the last if and only if: In fact, for example, we can check that the factorizations: The method that maximizes the posterior probability is strongly consistent [1] (see Section 3.1 for the proof), and a scenario with two and three variables as above can be extended to cases with N variables in a straightforward manner, if the variables are discrete.
In this paper, we consider the case when continuous variables are present.The idea is to construct measures g n X (x n ), g n Y (y n ) and g n XY (x n , y n ) over X n , Y n and X n × Y n for continuous ranges X and Y to make the decision whether X ⊥ ⊥ Y based on: The main problem is whether the decision is strongly consistent.Many authors have attempted to address continuous variables.For example, Nir Friedman [3] experimentally demonstrated the construction of a genetic network based on expression data using the E-Malgorithm.However, the variables were assumed to be linearly related and included Gaussian noise, and the dataset was not sufficiently fit to the model.
Imoto et al. [4] improved the model such that the relation is expressed by B-spline curves rather than lines.However, all of the authors, including Friedman and Imoto, failed to maximize the posterior probability, and thus, the decision is not consistent.This paper proves that the decision based on (2) and its extension for general N ≥ 2 is strongly consistent.
In any Bayesian approach of BN structure learning, whether continuous variables are present or not, the procedure consists of two stages: (1) Compute the local scores for the nonempty subsets of {X (1) , • • • , X (N ) }; for example, if N = 3, the seven quantities (2) Find a BN structure that maximizes the global scores among the M (N )(≤ 3 N ) candidate BN structures; there are at most 3 N DAGs in the case of N variables; for example, if N = 3, the eleven quantities are computed and a structure with the largest is chosen.
Note that the second stage does not care about whether each variable is continuous or not.In this paper, we mainly discuss about the performance of the first stage.The number of local scores to be computed can be saved, although it is generally exponential with N .We consider the problem in Section 3.3.
On the other hand, Zhang, Peters, Janzing and Scholkopf [5] proposed a BN structure learning method using conditional independence (CI) tests based on kernel statistics.However, for the CI test that is close to the Hilbert-Schmidt information criterion (HSIC), it is very hard to simulate the null distribution.They only proposed to approximate it by a Gamma distribution, but no consistency, is obtained because the threshold of the statistical test is not correct in practice.Furthermore, for the independence test approach, it often results in conflicting assertions of independence for finite samples.In particular, for small samples, the obtained DAG sometimes contain a directed loop.The Bayesian approach we consider in this paper does not suffer from the inconvenience, because we seek a structure that maximizes the global score [6].
Another contribution of this paper is identifying the border between consistency and non-consistency in learning Bayesian networks.For discrete X , maximizing Q n X (x n ) is equivalent to minimizing the description length [1]: where ) and α is the cardinality of set X .The problem at hand is whether the log n term is the minimum function of n for ensuring strong consistency.If log n is replaced by two (AIC), we cannot obtain consistency.We prove that 2(1 + ) log log n with > 0 is the minimum for strong consistency based on the law of iterated logarithms.The same property is known as the Hannan-Quinn principle [7], and similar results have been obtained for autoregression, linear regression [8] and classification [9], among others.The derivation in this paper does not depend on these previous results.The Hannan-Quinn principle will also be applied to continuous variables.This paper is organized as follows.Section 2.1 introduces the general concept of learning Bayesian network structures based on maximizing the posterior probability, and Section 2.2 discusses the concept of density functions developed by Boris Ryabko [10] and extended by Suzuki [11].Section 3 presents our contributions: Section 3.1 proves the Hannan-Quinn property in the current problem, and Section 3.2 proves consistency when continuous variables are present.Section 4 concludes the paper by summarizing the results and states the paper's significance in the field of model selection.

Learning the Bayesian Structure for Discrete Variables and Its Consistency
We choose w X , such that w X (θ)dθ = 1 and 0 ≤ θ(x) ≤ 1 by w X (θ) ∝ x∈X θ(x) −1/2 , where X is the set from which X takes its values.Let α = |X |, and let c i (x) be the frequency of x ∈ X in It is known that the following quantities satisfies (3) [12]: where Γ is the Gamma function, and Stirling's formula )} has been applied.Thus, for x ∈ X , from the law of large numbers, c n (x)/n converges to P (X = x) with probability one as n → ∞, such that: with probability one as n → ∞.Moreover, from the law of large numbers, with probability one as n → ∞, (Shannon-McMillan-Breiman [13]).This proves that there exists a Q n X (universal measure), such that for any probability P over the finite set X , with probability one as n → ∞, where we write P n (x n ) := P (X n = x n ).The same property holds for: and: where Thus, we have: with probability one as n → ∞.However, X ⊥ ⊥ Y if and only if I(X, Y ) = 0. Hence, if X ⊥ ⊥ Y , the value of J n (x n , y n ) is positive with probability one as n → ∞.However, how can we detect X ⊥ ⊥ Y when X ⊥ ⊥ Y ?J n (x n , y n ) cannot be exactly zero with probability one as n → ∞.However, when X and Y are discrete, the estimation based on J n (x n , y n ) is consistent: if X ⊥ ⊥ Y , the value of J n (x n , y n ) is not greater than zero with probability one as n → ∞.For example, the decision based on (1) is strongly consistent because the values of 1 n log p and 1 n log(1 − p) are negligible for large n, and asymptotically, (1) is equivalent to J n (x n , y n , z n ) ≤ 0.
In Section 3.1, we provide a stronger result of consistency and a more intuitive and elegant proof.
In general, if N variables exist (N ≥ 2), we must consider two cases: D(P * ||P ) > 0 and D(P * ||P ) = 0, where P * and P are the probabilities based on the correct and estimated factorizations and D(P * ||P ) denotes the Kullback-Leibler divergence between P * and P .If N = 2, then: The same property holds for three variables X, Y, Z (N = 3): with probability one as n → ∞, and X ⊥ ⊥ Y |Z if and only if I(X, Y, Z) = 0.Then, we can show J n (x n , y n , z n ) ≤ 0 if and only if I(X, Y, Z) = 0, with probability one as n → ∞ (see Section 3.1).For example, between the seventh and eleventh factorizations, if J n (x n , y n , z n ) ≤ 0 and J n (x n , y n , z n ) > 0, then we choose the seventh and eleventh, respectively.In fact, for large n, because 1 n log p 7 p 11 diminishes.Then, the decision is correct with probability one as n → ∞.Similarly, we calculate: and: where γ = |Z|.In general, for N variables, given P and P * , we have all of the CI statements for each of them, and D(P * ||P ) = 0 if and only if the CI statements in P imply those in P * ; in other words, P induces an I-map, which is not necessarily minimal.Note that for any subsets a, b, c of {1, • • • , N }, we can construct the estimation J n (x n , y n , z n ), with X = {X (i) } i∈a , Y = {Y (j) } j∈b , Z = {X (k) } k∈c , and obtain consistency, i.e., we will have the correct CI statements, where c may be empty.
Table 1 depicts whether D(P * ||P ) > 0 or D(P * ||P ) = 0 for each P * and P .For example, if the factorizations of P * and P are the fourth and sixth, then D(P * ||P ) = 0 from the table.In general, D(P * ||P ) = 0 if and only if P * is realized using the factorization and an appropriate parameter set for P .

Universal Measures for Continuous Variables
In this section, we primarily address continuous variables.
Let {A j } be such that A 0 = {X }, and let A j+1 be a refinement of A j .For example, suppose that the random variable X takes values in X = [0, 1], and we generate a sequence as follows: For each j, we quantize each x ∈ [0, 1] into the a ∈ A j , such that x ∈ a.For example, for j = 2, Let λ be the Lebesgue measure (width of the interval).For example, λ([ 1 4 , 1 2 )) = 1 4 and λ({ 1 2 }) = 0. Note that each A j is a finite set.Therefore, we can construct a universal measure Q n j w.r.t. a finite set A j for each j.Given for each j and use it to compute the quantity: for each j.If we prepare a sequence of positive reals w 1 , w 2 , • • • , such that j w j = 1 and w j > 0, we can compute the quantity: Moreover, let f X be the true density function and f j (x) := P (X ∈ a)/λ(a) for a ∈ A j and j = 1, 2, • • • if x ∈ a.We may consider f j to be an approximated density function assuming the quantization sequence {A j } (Figure 2).For the given x n , we define n ) Thus, we have the following proposition, which is a continuous version of the universality (4) that was proven in Section 2.1.

Proposition 1 ([10]). For any density function
as n → ∞ with probability one, where D(f X ||f j ) is the Kullback-Leibler divergence between f X and f j .
The same concept is applied to the case where no density function exists [11] in the usual sense (w.r.t. the Lebesgue measure λ).For example, suppose that we wish to estimate a distribution over the positive integers N. Apparently, N is not a finite set and has no density function.We consider the quantization sequence {B k }: Let η be a measure, such that: The measure η(a) for closed interval a gives: if k min and k max are the minimum and maximum integers in a, and evaluates each bin width in a nonstandard way.For example, η({2}) = 1  6 and η({3, 4}) = 2 15 .For multiple variables, we compute the measure by: ) .
Note that each B k is a finite set, and we construct a universal measure k for each k, such that we can compute the quantity: with y ∈ N may take any arbitrary value) is considered to be a generalized density function (w.r.t. the measure η).
In general, if η(D) = 0 implies P (Y ∈ D) = 0 for the Borel sets (the Borel sets w.r.t.R being the set consisting of the sets generated via a countable number of union, intersection and set difference from the closed intervals of R [2]), we state that P is absolutely continuous w.r.t.η and that there exists a density function w.r.t.η (Radon-Nikodym [2]).
The following proposition addresses generalized densities and eliminates the condition
Proposition 1 assumes a specific quantization sequence, such as {A n }.The universality holds for the densities that satisfy D(f X ||f k ) → ∞ as k → ∞ [10].However, in the proof of Proposition 2, a universal quantization, such that D(f X ||f k ) → 0 as k → ∞ for any density f X , was constructed [11].

The Hannan and Quinn Principle
We know that

2
log n with probability one as n → ∞ when X ⊥ ⊥ Y because the decision based on ( 1) is strongly consistent.
In this section, we prove a stronger result: let: We show that the quantity with probability one as n → ∞ for any > 0.
In order to show the claim, we approximate are mutually independent random variables with mean zero and variance σ 2 i,j , such that: Then, from the law of iterated logarithms below (Lemma 1) [2], it will be proven that r 2 i,j is almost surely upper-bounded by 2(1 + )σ 2 i,j log log n for any > 0 and each z ∈ Z, which implies Theorem 1 because: (see the Appendix for the details of the derivation).

Lemma 1 ([2]
).Let {U k } n k=1 be random variables that obey an identical distribution with zero mean and unit variance, and S n := n k=1 U k .Then, with probability one, Theorem 1 implies the strong consistency of the decision based on (1).However, a stronger statement can be obtained: where y ∈ N (f XY takes arbitrary values for x ∈ [0, 1] and y ∈ N), where F X (•|y) is the conditional distribution function of X given Y = y.
In general, we have the following result: Proposition 3.For any generalized density function f : as n → ∞ with probability one.
The measures g n X (x n ) and g n XY (x n , y n ) are computed using (A) and (B) of Algorithm 1, where the value of K is the number of quantizations, and ĝn X (x n ) and ĝn XY (x n , y n ) denote the approximated scores using finite quantization of level K.
Propositions 1-3 are obtained for large K.However, we can prepare only a finite number of quantizations.Furthermore, if n is small, then the number of examples that each bin contains is small, and we cannot estimate the histogram well.Therefore, given n, K must be moderately sized, and we recommend to set K = 1 m log n because the number of examples contained in a bin decreases exponentially with increasing depth, where m is the number of variables in the local score.For example, m = 1 and m = 2 for (A) and (B), respectively.Algorithm 1 (A)(B) of do not guarantee anything for the theoretical property assured in Proposition 3 and Theorems 3-5 for finite K, however, as K grows, consistency holds.In Step 3(a) of Algorithm 1(A)(B), we calculate a and not from x i , which means that the computational time required to obtain (a For the memory requirements, we require exponential orders of K.However, because we set K = 1 m log n, the computational time and memory requirements are at most O(n log n) and O(n) for Algorithm 1(A)(B).
Based on the same notion, we can construct and Propositions 2 and 3 hold for three variables.Theorem 3.With probability one as n → ∞: Proof.From Propositions 2 and 3 for two and three variables and the law of large numbers, we have: with probability one, which completes the proof.
From the discussion in Section 2.1, even when more than two variables are present, if D(P * ||P ) > 0, we can choose P * rather than P with probability one as n → ∞.Now, we prove that the continuous counterpart of the decision based on (1) is strongly consistent: Theorem 4. With probability one as n → ∞: where p is the prior probability of X ⊥ ⊥ Y |Z.
Proof: Suppose that X ⊥ ⊥ Y |Z.Then, the conditional mutual information between X and Y given Z is positive, and from Theorem 3, the estimator converges to a positive value with probability one as n → ∞; thus, The discrete variables X and Y are conditionally independent given Z if and only if: and the parent sets: For each fixed pair (X, V ), maximizing the local score 1 n log g W +{X} g W and maximizing the CF score V − {X} and W + {X}, given W are equivalent.In other words, On the other hand, from [15,16], we know that the relationship between the complexity term and the likelihood term gives tight bounds on the maximum number of parents in the optimal BN for any given dataset.In particular, the number of elements in each parent set P a(X, V ) is at most O(log n) for X ∈ V and V ⊆ U .Hence, the number for computing the CF scores is much less than exponential with N .

Concluding Remarks
In this paper, we considered the problem of learning a Bayesian network structure from examples and provided two contributions.
First, we found that the log n terms in the penalty terms of the description length can be replaced by 2(1 + ) log log n to obtain strong consistency, where the derivation is based on the law of iterated logarithms.We claim that the Hannan and Quinn principle [7] is applicable to this problem.
Second, we constructed an extended version of the score function for finding a Bayesian network structure with the maximum posterior probability and proved that the decision is strongly consistent even when continuous variables are present.Thus far, consistency has been obtained only for discrete variables, and many authors have been seeking consistency when continuous variables are present.
Consistency has been proven in many model selection methods that maximize the posterior probability or, equivalently, minimize the description length [1].However, almost all such methods assume that the variables are either discrete or that the variables obey Gaussian distributions.This paper proposed an extended version of the MDL/Bayesian principle without assuming such constraints and proved its strong consistency in a precise manner, which we believe provides a substantial contribution to the statistics and machine learning communities.
, and c n (y) and c n (x, y) are the numbers of occurrences of y ∈ Y and (x, y) ∈ X × Y in y

I
n (x n , y n , z n ) = x∈X y∈Y z∈Z c n (x, y, z) log c n (x, y, z)c n (z) c n (x, z)c n (y, z) = x∈X y∈Y z∈Z c n (z)P (x|z)P (y|z) • c n (x, y, z) c n (z)P (x|z)P (y|z) log c n (x, y, z) c n (z)P (x|z)P (y|z)− x∈X y∈Y z∈Z c n (z)P (x|z) • c n (x, y, z) c n (z)P (x|z) log c n (x, z) c n (z)P (x|z) − x∈X y∈Y z∈Z c n (z)P (y|z) • c n (x, y, z) c n (z)P (y|z) log c n (y, z) c n (z)P (y|z) = z∈Z { x y c n (z)P (x|z)P (y|z) • c n (x, y, z) c n (z)P (x|z)P (y|z) log c n (x, y, z) c n (z)P (x|z)P (y|z)(11)− x c n (z)P (x|z) • c n (x, z) c n (z)P (x|z) log c n (x, z) c n (z)P (x|z)(12)− y c n (z)P (y|z) • c n (y, z) c n (z)P (y|z) log c n (y, z) c n (z)P (y|z) (13) } is approximated by z∈Z I(z) with: I(z) := x y {c n (x, y, z) − c n (z)P (x|z)P (y|z)} 2 2c n (z)P (x|z)P (y|z) − x {c n (x, z) − c n (z)P (x|z)} 2 2c n (z)P (x|z) − y {c n (y, z) − c n (z)P (y|z)} 2 2c n (z)P (y|z) t = c n (x, y, z) c n (z)P (x|z)P (y|z) − 1, c n (x, z) c n (z)P (x|z) − 1, c n (y, z) c n (z)P (y|z) − 1 where V = (V xy ) x∈X ,y∈Y with V xy = c n (x, y, z) − c n (z)P (x|z)P (y|z) 2c n (z)P (x|z)P (y|z) , and u and v are the column vectors [ P (x|z)] x∈X and [ P (y|z)] y∈Y , respectively.Hereafter, we arbitrarily fix z ∈ Z.Let U =