Entropy and the Kullback–Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation

Scutari, Marco

doi:10.3390/a17010024

Open AccessArticle

Entropy and the Kullback–Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation

by

Marco Scutari

Istituto Dalle Molle di Studi Sull’Intelligenza Artificiale (IDSIA), 6900 Lugano, Switzerland

Algorithms 2024, 17(1), 24; https://doi.org/10.3390/a17010024

Submission received: 29 November 2023 / Revised: 31 December 2023 / Accepted: 4 January 2024 / Published: 6 January 2024

(This article belongs to the Special Issue Bayesian Networks and Causal Reasoning)

Download

Browse Figures

Versions Notes

Abstract

Bayesian networks (BNs) are a foundational model in machine learning and causal inference. Their graphical structure can handle high-dimensional problems, divide them into a sparse collection of smaller ones, underlies Judea Pearl’s causality, and determines their explainability and interpretability. Despite their popularity, there are almost no resources in the literature on how to compute Shannon’s entropy and the Kullback–Leibler (KL) divergence for BNs under their most common distributional assumptions. In this paper, we provide computationally efficient algorithms for both by leveraging BNs’ graphical structure, and we illustrate them with a complete set of numerical examples. In the process, we show it is possible to reduce the computational complexity of KL from cubic to quadratic for Gaussian BNs.

Keywords:

Bayesian networks; Shannon entropy; Kullback–Leibler divergence

1. Introduction

Bayesian networks [1] (BNs) have played a central role in machine learning research since the early days of the field as expert systems [2,3], graphical models [4,5], dynamic and latent variables models [6], and as the foundation of causal discovery [7] and causal inference [8]. They have also found applications as diverse as comorbidities in clinical psychology [9], the genetics of COVID-19 [10], the Sustainable Development Goals of the United Nations [11], railway disruptions [12] and industry 4.0 [13].

Machine learning, however, has evolved to include a variety of other models and reformulated them into a very general information-theoretic framework. The central quantities of this framework are Shannon’s entropy and the Kullback–Leibler divergence. Learning models from data relies crucially on the former to measure the amount of information captured by the model (or its complement, the amount of information lost in the residuals) and on the latter as the loss function we want to minimise. For instance, we can construct variational inference [14], the Expectation-Maximisation algorithm [15], Expectation Propagation [16] and various dimensionality reduction approaches such as t-SNE [17] and UMAP [18] using only these two quantities. We can also reformulate classical maximum-likelihood and Bayesian approaches to the same effect, from logistic regression to kernel methods to boosting [19,20].

Therefore, the lack of literature on how to compute the entropy of a BN and the Kullback–Leibler divergence between two BNs is surprising. While both are mentioned in Koller and Friedman [5] and discussed at a theoretical level in Moral et al. [21] for discrete BNs, no resources are available on any other type of BN. Furthermore, no numerical examples of how to compute them are available even for discrete BNs. We fill this gap in the literature by:

Deriving efficient formulations of Shannon’s entropy and the Kullback–Leibler divergence for Gaussian BNs and conditional linear Gaussian BNs.
Exploring the computational complexity of both for all common types of BNs.
Providing step-by-step numeric examples for all computations and all common types of BNs.

Our aim is to make apparent how both quantities are computed in their closed-form exact expressions and what is the associated computational cost.

The common alternative is to estimate both Shannon’s entropy and the Kullback–Leibler divergence empirically using Monte Carlo sampling. Admittedly, this approach is simple to implement for all types of BNs. However, it has two crucial drawbacks:

Using asymptotic estimates voids the theoretical properties of many machine learning algorithms: Expectation-Maximisation is not guaranteed to converge [5], for instance.
The number of samples required to estimate the Kullback–Leibler divergence accurately on the tails of the global distribution of both BNs is also an issue [22], especially when we need to evaluate it repeatedly as part of some machine learning algorithm. The same is true, although to a lesser extent, for Shannon’s entropy as well. In general, the rate of convergence to the true posterior in Monte Carlo particle filters is proportional to the number of variables squared [23].

Therefore, efficiently computing the exact value of Shannon’s entropy and the Kullback–Leibler divergence is a valuable research endeavour with a practical impact on BN use in machine learning. To help its development, we implemented the methods proposed in the paper in our bnlearn R package [24].

The remainder of the paper is structured as follows. In Section 2, we provide the basic definitions, properties and notation of BNs. In Section 3, we revisit the most common distributional assumptions in the BN literature: discrete BNs (Section 3.1), Gaussian BNs (Section 3.2) and conditional linear Gaussian BNs (Section 3.3). We also briefly discuss exact and approximate inferences for these types of BNs in Section 3.4 to introduce some key concepts for later use. In Section 4, we discuss how we can compute Shannon’s entropy and the Kullback–Leibler divergence for each type of BN. We conclude the paper by summarising and discussing the relevance of these foundational results in Section 5. Appendix A summarises all the computational complexity results from earlier sections, and Appendix B contains additional examples we omitted from the main text for brevity.

2. Bayesian Networks

Bayesian networks (BNs) are a class of probabilistic graphical models defined over a set of random variables

X

= {X_{1}, \dots, X_{N}}

, each describing some quantity of interest, that are associated with the nodes of a directed acyclic graph (DAG)

G

. Arcs in

G

express direct dependence relationships between the variables in

X

, with graphical separation in

G

implying conditional independence in probability. As a result,

G

induces the factorisation

P (X ∣ G, Θ) = \prod_{i = 1}^{N} P (X_{i} ∣ Π_{X_{i}}, Θ_{X_{i}}),

(1)

in which the global distribution (of

X

, with parameters

Θ

) decomposes into one local distribution for each

X_{i}

(with parameters

Θ_{X_{i}}

,

⋃_{X} Θ_{X_{i}} = Θ

) conditional on its parents

Π_{X_{i}}

.

This factorisation is as effective at reducing the computational burden of working with BNs as the DAG underlying the BN is sparse, meaning that each node

X_{i}

has a small number of parents (

| Π_{X_{i}} | < c

, usually with

c \in [2, 5]

). For instance, learning BNs from data is only feasible in practice if this holds. The task of learning a BN

B = (G, Θ)

from a data set

D

containing n observations comprises two steps:

\underset{learning}{\underset{︸}{P (G, Θ ∣ D)}} = \underset{structure learning}{\underset{︸}{P (G ∣ D)}} \cdot \underset{parameter learning}{\underset{︸}{P (Θ ∣ G, D)}} .

If we assume that parameters in different local distributions are independent [25], we can perform parameter learning independently for each node. Each

X_{i} ∣ Π_{X_{i}}

will have a low-dimensional parameter space

Θ_{X_{i}}

, making parameter learning computationally efficient. On the other hand, structure learning is well known to be both NP-hard [26] and NP-complete [27], even under unrealistically favourable conditions such as the availability of an independence and inference oracle [28]. However, if

G

is sparse, heuristic learning algorithms have been shown to run in quadratic time [29]. Exact learning algorithms, which have optimality guarantees that heuristic algorithms lack, retain their exponential complexity but become feasible for small problems because sparsity allows for tight bounds on goodness-of-fit scores and the efficient pruning of the space of the DAGs [30,31,32].

3. Common Distributional Assumptions for Bayesian Networks

While there are many possible choices for the distribution of

X

in principle, the literature has focused on three cases.

3.1. Discrete BNs

Discrete BNs [25] assume that both

X

and the

X_{i}

are multinomial random variables (The literature sometimes denotes discrete BNs as “dBNs” or “DBNs”; we do not do that in this paper to avoid confusion with dynamic BNs, which are also commonly denoted as “dBNs”). Local distributions take the form

\begin{matrix} X_{i} ∣ Π_{X_{i}} \sim Mul (π_{i k ∣ j}), & π_{i k ∣ j} = P (X_{i} = k ∣ Π_{X_{i}} = j); \end{matrix}

their parameters are the conditional probabilities of

X_{i}

given each configuration of the values of its parents, usually represented as a conditional probability table (CPT) for each

X_{i}

. The

π_{i k ∣ j}

can be estimated from data via the sufficient statistic

{n_{i j k},

i = 1, \dots N;

j = 1, \dots, q_{i};

k = 1, \dots, r_{i}}

, the corresponding counts tallied from

{X_{i}, Π_{X_{i}}}

using maximum likelihood, Bayesian or shrinkage estimators as described in Koller and Friedman [5] and Hausser and Strimmer [33].

The global distribution takes the form of an N-dimensional probability table with one dimension for each variable. Assuming that each

X_{i}

takes at most l values, the table will contain

| Val (X) | = O (l^{N})

cells, where

Val (\cdot)

denotes the possible (configurations of the) values of its argument. As a result, it is impractical to use for medium and large BNs. Following standard practices from categorical data analysis [34], we can produce the CPT for each

X_{i}

from the global distribution by marginalising (that is, summing over) all the variables other than

{X_{i}, Π_{X_{i}}}

and then normalising over each configuration of

Π_{X_{i}}

. Conversely, we can compose the global distribution from the local distributions of the

X_{i}

by multiplying the appropriate set of conditional probabilities. The computational complexity of the composition is

O (N l^{N})

because applying (1) for each of the

l^{N}

cells yields

P (X = x) = \prod_{i = 1}^{N} P (X_{i} = x_{i} ∣ Π_{X_{i}} = x_{Π_{X_{i}}}),

which involves N multiplications. As for the decomposition, for each node, we:

Sum over $N - | Π_{X_{i}} | - 1$ variables to produce the joint probability table for ${X_{i}, Π_{X_{i}}}$ , which contains $O (l^{| Π_{X_{i}} | + 1})$ cells. The value of each cell is the sum of $O (l^{N - | Π_{X_{i}} | - 1})$ probabilities.
Normalise the columns of the joint probability table for ${X_{i}, Π_{X_{i}}}$ over each of the $O (l^{| Π_{X_{i}} |})$ configurations of values of $Π_{X_{i}}$ , which involves summing O(l) probabilities and dividing them by their total.

The resulting computational complexity is

\underset{marginalisation}{\underset{︸}{O (l^{| Π_{X_{i}} | + 1} \cdot l^{N - | Π_{X_{i}} | - 1})}} + \underset{normalisation}{\underset{︸}{O (l \cdot l^{| Π_{X_{i}} |})}} = O (l^{N} + l^{| Π_{X_{i}} | + 1})

(2)

for each node and

O (N l^{N} + l \sum_{i = 1}^{N} l^{| Π_{X_{i}} |})

for the whole BN.

Example 1

(Composing and decomposing a discrete BN). For reasons of space, this example is presented as Example A1 in Appendix B.

3.2. Gaussian BNs

Gaussian BNs [35] (GBNs) model

X

with a multivariate normal random variable N(

μ_{B}

,

Σ_{B})

and assume that the

X_{i}

are univariate normals linked by linear dependencies,

X_{i} ∣ Π_{X_{i}} \sim N (μ_{X_{i}} + Π_{X_{i}} β_{X_{i}}, σ_{X_{i}}^{2}),

(3)

which can be equivalently written as linear regression models of the form

\begin{matrix} X_{i} = μ_{X_{i}} + Π_{X_{i}} β_{X_{i}} + ε_{X_{i}}, & ε_{X_{i}} \sim N (0, σ_{X_{i}}^{2}) . \end{matrix}

(4)

The parameters in (3) and (4) are the regression coefficients

β_{X_{i}}

associated with the parents

Π_{X_{i}}

, an intercept term

μ_{X_{i}}

and the variance

σ_{X_{i}}^{2}

. They are usually estimated by maximum likelihood, but Bayesian and regularised estimators are available as well [1].

The link between the parameterisation of the global distribution of a GBN and that of its local distributions is detailed in Pourahmadi [36]. We summarise it here for later use.

Composing the global distribution. We can create an $N \times N$ lower triangular matrix $C_{B}$ from the regression coefficients in the local distributions such that $C_{B} C_{B}^{T}$ gives $Σ_{B}$ after rearranging rows and columns. In particular, we:
- Arrange the nodes of $B$ in the (partial) topological ordering induced by $G$ , denoted $X_{(i)}, i = 1, \dots, N$ .
- The ith row of $C_{B}$ (denoted $C_{B} [i; •$ ], i = 1, …, N) is associated with X_(i). We compute its elements from the parameters of X_(i) | $Π_{X_{i}}$ as
  $\begin{matrix} C_{B} [i; i] = \sqrt{σ_{X_{(i)}}^{2}} & and & C_{B} [i; • \end{matrix}] = β_{X_{i}} C_{B} [Π_{X_{(i)}}; •],$ where $C_{B} [Π_{X_{(i)}}; •$ ] are the rows of $C_{B}$ that correspond to the parents of X_(i). The rows of $C_{B}$ are filled following the topological ordering of the BN.
- Compute ${\sum^{\sim}}_{B} = C_{B} C_{B}^{T}$ .
- Rearrange the rows and columns of ${\sum^{\sim}}_{B}$ to obtain $\sum_{B}$ .
Intuitively, we construct $C_{B}$ by propagating the node variances along the paths in $G$ while combining them with the regression coefficients, which are functions of the correlations between adjacent nodes. As a result, $C_{B} C_{B}^{T}$ gives $Σ_{B}$ after rearranging the rows and columns to follow the original ordering of the nodes.
The elements of the mean vector $μ_{B}$ are similarly computed as $E (X_{(i)}) = Π_{X_{(i)}} β_{X_{i}}$ iterating over the variables in topological order.
Decomposing the global distribution. Conversely, we can derive the matrix $C_{B}$ from $Σ_{B}$ by reordering its rows and columns to follow the topological ordering of the variables in $G$ and computing its Cholesky decomposition. Then

$R = I_{N} - diag (C_{B}) C_{B}^{- 1},$

contains the regression coefficients $β_{X_{i}}$ in the elements corresponding to $X_{(i)}, Π_{X_{(i)}}$ (Here $diag (C_{B})$ is a diagonal matrix with the same diagonal elements as $C_{B}$ and $I_{N}$ is the identity matrix.) Finally, we compute the intercepts $μ_{X_{i}}$ as $μ_{B} - R μ_{B}$ by reversing the equations we used to construct $μ_{B}$ above.

The computational complexity of composing the global distribution is bound by the matrix multiplication

C_{B} C_{B}^{T}

, which is

O (N^{3})

; if we assume that

G

is sparse as in Scutari et al. [29], the number of arcs is bound by some

c N

, computing the

μ_{B}

takes O(N) operations. The complexity of decomposing the global distribution is also

O (N^{3})

because both inverting

C_{B}

and multiplying the result by

diag (C_{B})

are

O (N^{3})

.

Example 2

(Composing and decomposing a GBN). Consider the GBN

B

from Figure 1 top. The topological ordering of the variables defined by

B

is

{{X_{1}, X_{2}}, X_{4}, X_{3}}

, so

C_{B} = \begin{matrix} X_{1} \\ X_{2} \\ X_{4} \\ X_{3} \end{matrix} \begin{matrix} X_{1} & X_{2} & X_{4} & X_{3} \\ ( & \begin{matrix} 0.894 \\ 0 \\ 1.341 \\ 1.610 \end{matrix} & \begin{matrix} 0 \\ 0.774 \\ 2.014 \\ 2.416 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 1.049 \\ 1.258 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 0 \\ 0.948 \end{matrix} & ) \end{matrix}

where the diagonal elements are

\begin{matrix} C_{B} [X_{1}; X_{1}] = \sqrt{0.8}, & C_{B} [X_{2}; X_{2}] = \sqrt{0.6}, & C_{B} [X_{4}; X_{4}] = \sqrt{1.1}, & C_{B} [X_{3}; X_{3}] = \sqrt{0.9}; \end{matrix}

and the elements below the diagonal are taken from the corresponding cells of

C_{B} [X_{4}; •] = (\begin{matrix} 1.5 & 2.6 \end{matrix}) (\begin{matrix} 0.894 & 0 & 0 & 0 \\ 0 & 0.774 & 0 & 0 \end{matrix}),

C_{B} [X_{3}; •] = (1.2) (1.341 2.014 1.049 0) .

Computing

C_{B} C_{B}^{T}

gives

{\tilde{Σ}}_{B} = \begin{matrix} X_{1} \\ X_{2} \\ X_{4} \\ X_{3} \end{matrix} \begin{matrix} X_{1} & X_{2} & X_{4} & X_{3} \\ ( & \begin{matrix} 0.800 \\ 0 \\ 1.200 \\ 1.440 \end{matrix} & \begin{matrix} 0 \\ 0.600 \\ 1.560 \\ 1.872 \end{matrix} & \begin{matrix} 1.200 \\ 1.560 \\ 6.956 \\ 8.347 \end{matrix} & \begin{matrix} 1.440 \\ 1.872 \\ 8.347 \\ 10.916 \end{matrix} & ) \end{matrix}

and reordering the rows and columns of

{\tilde{Σ}}_{B}

gives

Σ_{B} = \begin{matrix} X_{1} \\ X_{2} \\ X_{3} \\ X_{4} \end{matrix} \begin{matrix} X_{1} & X_{2} & X_{3} & X_{4} \\ ( & \begin{matrix} 0.800 \\ 0 \\ 1.440 \\ 1.200 \end{matrix} & \begin{matrix} 0 \\ 0.600 \\ 1.872 \\ 1.560 \end{matrix} & \begin{matrix} 1.440 \\ 1.872 \\ 10.916 \\ 8.347 \end{matrix} & \begin{matrix} 1.200 \\ 1.560 \\ 8.347 \\ 6.956 \end{matrix} & ) \end{matrix}

The elements of the corresponding expectation vector

μ_{B}

are then

\begin{matrix} E (X_{1}) & = 2.400, \\ E (X_{2}) & = 1.800, \\ E (X_{4}) & = 0.2 + 1.5 E (X_{1}) + 2.6 E (X_{2}) = 8.480, \\ E (X_{3}) & = 2.1 + 1.2 E (X_{4}) = 12.276 . \end{matrix}

Starting from

Σ_{B}

, we can reorder its rows and columns to obtain

{\tilde{Σ}}_{B}

. The Cholesky decomposition of

{\tilde{Σ}}_{B}

is

C_{B}

. Then

\begin{matrix} σ_{X_{1}}^{2} = C_{B} {[X_{1}; X_{1}]}^{2} = 0.8, & σ_{X_{2}}^{2} = C_{B} {[X_{2}; X_{2}]}^{2} = 0.6, \\ σ_{X_{3}}^{2} = C_{B} {[X_{3}; X_{3}]}^{2} = 0.9, & σ_{X_{4}}^{2} = C_{B} {[X_{4}; X_{4}]}^{2} = 0.11 . \end{matrix}

The coefficients

β_{X_{i}}

of the local distributions are available from

\begin{matrix} R & = I_{N} - \underset{diag (C_{B})}{\underset{︸}{[\begin{matrix} 0.894 & 0 & 0 & 0 \\ 0 & 0.774 & 0 & 0 \\ 0 & 0 & 1.049 & 0 \\ 0 & 0 & 0 & 0.948 \end{matrix}]}} \underset{C_{B}^{- 1}}{\underset{︸}{[\begin{matrix} 1.118 & 0 & 0 & 0 \\ 0 & 1.291 & 0 & 0 \\ - 1.430 & - 2.479 & 0.953 & 0 \\ 0 & 0 & - 1.265 & 1.054 \end{matrix}]}} \\ = \begin{matrix} X_{1} \\ X_{2} \\ X_{4} \\ X_{3} \end{matrix} \begin{matrix} X_{1} & X_{2} & X_{4} & X_{3} \\ ( & \begin{matrix} 0 \\ 0 \\ 1.500 \\ 0 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 2.600 \\ 0 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 0 \\ 1.200 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 0 \\ 0 \end{matrix} & ) \end{matrix} \end{matrix}

where we can read

R_{X_{4}, X_{1}} = 1.5 = β_{X_{4}, X_{1}}

,

R_{X_{4}, X_{2}} = 2.6 = β_{X_{4}, X_{2}}

,

R_{X_{3}, X_{4}} = 1.2 = β_{X_{3}, X_{4}}

.

We can read the standard errors of

X_{1}

,

X_{2}

,

X_{3}

and

X_{4}

directly from the diagonal elements of

C_{B}

, and we can compute the intercepts from

μ_{B} - R μ_{B}

which amounts to

\begin{matrix} μ_{X_{1}} & = E (X_{1}) = 2.400, \\ μ_{X_{2}} & = E (X_{2}) = 1.800, \\ μ_{X_{4}} & = E (X_{4}) - E (X_{1}) β_{X_{4}, X_{1}} - E (X_{2}) β_{X_{4}, X_{2}} = 0.200, \\ μ_{X_{3}} & = E (X_{3}) - E (X_{4}) β_{X_{3}, X_{4}} = 2.100 . \end{matrix}

3.3. Conditional Linear Gaussian BNs

Finally, conditional linear Gaussian BNs [37] (CLGBNs) subsume discrete BNs and GBNs as particular cases by combining discrete and continuous random variables in a mixture model. If we denote the former with

X_{D}

and the latter with

X_{G}

, so that

X = X_{D} \cup X_{G}

, then:

Discrete $X_{i} \in X_{D}$ are only allowed to have discrete parents (denoted $Δ_{X_{i}}$ ), and are assumed to follow a multinomial distribution parameterised with CPTs. We can estimate their parameters in the same way as those in a discrete BN.
Continuous $X_{i} \in X_{G}$ are allowed to have both discrete and continuous parents (denoted $Γ_{X_{i}}$ , $Δ_{X_{i}} \cup Γ_{X_{i}} = Π_{X_{i}}$ ). Their local distributions are

$X_{i} ∣ Π_{X_{i}} \sim N (μ_{X_{i}, δ_{X_{i}}} + Γ_{X_{i}} β_{X_{i}, δ_{X_{i}}}, σ_{X_{i}, δ_{X_{i}}}^{2}),$

which is equivalent to a mixture of linear regressions against the continuous parents with one component for each configuration $δ_{X_{i}} \in Val (Δ_{X_{i}})$ of the discrete parents:

$\begin{matrix} X_{i} = μ_{X_{i}, δ_{X_{i}}} + Γ_{X_{i}} β_{X_{i}, δ_{X_{i}}} + ε_{X_{i}, δ_{X_{i}}}, & ε_{X_{i}, δ_{X_{i}}} \sim N (0, σ_{X_{i}, δ_{X_{i}}}^{2}) . \end{matrix}$

If $X_{i}$ has no discrete parents, the mixture reverts to a single linear regression like that in (4). The parameters of these local distributions are usually estimated by maximum likelihood like those in a GBN; we have used hierarchical regressions with random effects in our recent work [38] for this purpose as well. Bayesian and regularised estimators are also an option [5].

If the CLGBN comprises

| X_{D} | = M

discrete nodes and

| X_{G} | = N - M

continuous nodes, these distributional assumptions imply the partial topological ordering

\underset{discrete nodes}{\underset{︸}{\{X_{(1)}, \dots, X_{(M)}\}}}, \underset{continuous nodes}{\underset{︸}{\{X_{(M + 1)}, \dots, X_{(N)}\}}} .

(5)

The discrete nodes jointly follow a multinomial distribution, effectively forming a discrete BN. The continuous nodes jointly follow a multivariate normal distribution, parameterised as a GBN, for each configuration of the discrete nodes. Therefore, the global distribution is a Gaussian mixture in which the discrete nodes identify the components, and the continuous nodes determine their distribution. The practical link between the global and local distributions follows directly from Section 3.1 and Section 3.2.

Example 3

(Composing and decomposing a CLGBN). For reasons of space, this example is presented as Example A2 in Appendix B.

The complexity of composing and decomposing the global distribution is then

\underset{convert between CPTs and component probabilities}{\underset{︸}{O (M l^{M}) .}} + \underset{(de) compose the distinct component distributions}{\underset{︸}{O ({(N - M)}^{3} l^{Val (Δ^{})})}}

where

Δ^{} = ⋃_{X_{i} \in X_{G}} Δ_{X_{i}}

are the discrete parents of the continuous nodes.

3.4. Inference

For BNs, inference broadly denotes obtaining the conditional distribution of a subset of variables conditional on a second subset of variables. Following older terminology from expert systems [2], this is called formulating a query in which we ask the BN about the probability of an event of interest after observing some evidence. In conditional probability queries, the event of interest is the probability of one or more events in (or the whole distribution of) some variables of interest conditional on the values assumed by the evidence variables. In maximum a posteriori (“most probable explanation”) queries, we condition the values of the evidence variables to predict those of the event variables.

All inference computations on BNs are completely automated by exact and approximate algorithms, which we will briefly describe here. We refer the interested reader to the more detailed treatment in Castillo et al. [2] and Koller and Friedman [5].

Exact inference algorithms use local computations to compute the value of the query. The seminal works of Lauritzen and Spiegelhalter [39], Lauritzen and Wermuth [37] and Lauritzen and Jensen [40] describe how to transform a discrete BN or a (CL)GBN into a junction tree as a preliminary step before using belief propagation. Cowell [41] uses elimination trees for the same purpose in CLGBNs. (A junction tree is an undirected tree whose nodes are the cliques in the moral graph constructed from the BN and their intersections. A clique is the maximal subset of nodes such that every two nodes in the subset are adjacent).

Namasivayam et al. [42] give the computational complexity of constructing the junction tree from a discrete BN as

O (N w + w l^{w} N)

where w is the maximum number of nodes in a clique and, as before, l is the maximum number of values that a variable can take. We take the complexity of belief propagation to be

O (N w l^{w} + | Θ |)

, as stated in Lauritzen and Spiegelhalter [39] (“The global propagation is no worse than the initialisation [of the junction tree]”). This is confirmed by Pennock [43] and Namasivayam and Prasanna [44].

As for GBNs, we can also perform exact inference through their global distribution because the latter has only

O (N^{2} + N)

parameters. The computational complexity of this approach is

O (N^{3})

because of the cost of composing the global distribution, which we derived in Section 3.2. However, all the operations involved are linear, making it possible to leverage specialised hardware such as GPUs and TPUs to the best effect. Koller and Friedman [5] (Section 14.2.1) note that “inference in linear Gaussian networks is linear in the number of cliques, and at most cubic in the size of the largest clique” when using junction trees and belief propagation. Therefore, junction trees may be significantly faster for GBNs when

w ≪ N

. However, the correctness and convergence of belief propagation in GBNs require a set of sufficient conditions that have been studied comprehensively by Malioutov et al. [45]. Using the global distribution directly always produces correct results.

Approximate inference algorithms use Monte Carlo simulations to sample from the global distribution of

X

through the local distributions and estimate the answer queries by computing the appropriate summary statistics on the particles they generate. Therefore, they mirror the Monte Carlo and Markov chain Monte Carlo approaches in the literature: rejection sampling, importance sampling, and sequential Monte Carlo among others. Two state-of-the-art examples are the adaptive importance sampling (AIS-BN) scheme [46] and the evidence pre-propagation importance sampling (EPIS-BN) [47].

4. Shannon Entropy and Kullback–Leibler Divergence

The general definition of Shannon entropy for the probability distribution P of

X

is

H (P) = E_{P} (- log P (X)) = - \int_{Val (X)} P (x) log P (x) d x .

(6)

The Kullback–Leibler divergence between two distributions P and Q for the same random variables

X

is defined as

KL (P ∥ Q) = E_{P (X)} (- log \frac{P (X)}{Q (X)}) = - \int_{Val (X)} P (x) log \frac{P (x)}{Q (x)} d x .

(7)

They are linked as follows:

\underset{KL (P (X) ∥ Q (X))}{\underset{︸}{E_{P (X)} (- log \frac{P (X)}{Q (X)})}} = \underset{H (P (X))}{\underset{︸}{E_{P (X)} (- log P (X))}} + \underset{H (P (X), Q (X))}{\underset{︸}{E_{P (X)} (log Q (X))}}

(8)

where

H (P (X), Q (X))

is the cross-entropy between

P (X)

and

Q (X)

. For the many properties of these quantities, we refer the reader to Cover and Thomas [48] and Csiszár and Shields [49]. Their use and interpretation are covered in depth (and breadth!) in Murphy [19,20] for general machine learning and in Koller and Friedman [5] for BNs.

For a BN

B

encoding the probability distribution of

X

, (6) decomposes into

H (B) = \sum_{i = 1}^{N} H (X_{i} ∣ Π_{X_{i}}^{B})

where

Π_{X_{i}}^{B}

are the parents of

X_{i}

in

B

. While this decomposition looks similar to (1), we see that its terms are not necessarily orthogonal, unlike the local distributions.

As for (7), we cannot simply write

KL (B ∥ B^{'}) = \sum_{i = 1}^{N} KL (X_{i} ∣ Π_{X_{i}}^{B} ∥ X_{i} ∣ Π_{X_{i}}^{B^{'}})

because, in the general case, the nodes

X_{i}

have different parents in

B

and

B^{'}

. This issue impacts the complexity of computing Kullback–Leibler divergences in different ways depending on the type of BN.

4.1. Discrete BNs

For discrete BNs,

H (B)

does not decompose into orthogonal components. As pointed out in Koller and Friedman [5] (Section 8.4.12),

\begin{matrix} H (X_{i} ∣ Π_{X_{i}}^{B}) = \sum_{j = 1}^{q_{i}} P (Π_{X_{i}}^{B} = j) H (X_{i} ∣ Π_{X_{i}}^{B} = j) where \\ H (X_{i} ∣ Π_{X_{i}}^{B} = j) = - \sum_{k = 1}^{r_{i}} π_{i k ∣ j} (B) log π_{i k ∣ j} (B) . \end{matrix}

(9)

If we estimated the conditional probabilities

π_{i k ∣ j} (B)

from data, the

P (Π_{X_{i}}^{B} = j)

are already available as the normalising constants of the individual conditional distributions

{π_{i k ∣ j} (B), j = 1, \dots, q_{i}}

in the local distribution of

X_{i}

. In this case, the complexity of computing

H (X_{i} ∣ Π_{X_{i}}^{B})

is linear in the number of parameters:

O (| Θ |) = \sum_{i = 1}^{N} O (| Θ_{X_{i}} |)

.

In the general case, we need exact inference to compute the probabilities

P (Π_{X_{i}}^{B} = j)

. Fortunately, they can be readily extracted from the junction tree derived from

B

as follows:

Identify a clique containing both $X_{i}$ and $Π_{X_{i}}^{B}$ . Such a clique is guaranteed to exist by the family preservation property [5] (Definition 10.1).
Compute the marginal distribution of $Π_{X_{i}}^{B}$ by summing over the remaining variables in the clique.

Combining the computational complexity of constructing the junction tree from Section 3.4 and that of marginalisation, which is at most

O (l^{w - 1})

for each node as in (2), we have

\begin{matrix} \underset{create the junction tree}{\underset{︸}{O (N w + w l^{w} N)}} + \underset{compute the P (Π_{X_{i}}^{B} = j)}{\underset{︸}{O (N l^{w - 1})}} + \underset{compute H (B)}{\underset{︸}{O (| Θ |)}} = \\ O (N (w (1 + l^{w}) + l^{w - 1}) + | Θ |), \end{matrix}

which is exponential in the maximum clique size w. (The maximum clique size in a junction tree is proportional to the treewidth of the BN the junction tree is created from, which is also used in the literature to characterise computational complexity in BNs.) Interestingly, we do not need to perform belief propagation, so computing

H (B)

is more efficient than other inference tasks.

Example 4

(Entropy of a discrete BN). For reasons of space, this example is presented as Example A3 in Appendix B.

The Kullback–Leibler divergence has a similar issue, as noted in Koller and Friedman [5] (Section 8.4.2). The best and most complete explanation of how to compute it for discrete BNs is in Moral et al. [21]. After decomposing

KL (B ∥ B^{'})

following (8) to separate

H (B)

and

H (B, B^{'})

, Moral et al. [21] show that the latter takes the form

H (B, B^{'}) = \sum_{i = 1}^{N} \sum_{j \in Val (Π_{X_{i}}^{B^{'}})} [\sum_{k = 1}^{r_{i}} π_{i k j} (B) log π_{i k ∣ j} (B^{'})]

(10)

where:

$π_{i k j} (B) = P (X_{i} = k, Π_{X_{i}} (B^{'}) = j)$ is the probability assigned by $B$ to $X_{i} = k$ given that the variables that are parents of $X_{i}$ in $B^{'}$ take value j;
$π_{i k ∣ j} (B^{'}) = P (X_{i} = k ∣ Π_{X_{i}} (B^{'}) = j)$ is the $(k, j)$ element of the CPT of $X_{i}$ in $B^{'}$ .

In order to compute the

π_{i k j} (B)

, we need to transform

B

into its junction tree and use belief propagation to compute the joint distribution of

X_{i} \cup Π_{X_{i}}^{B^{'}}

. As a result,

H (B, B^{'})

does not decompose at all: each

π_{i k j} (B)

can potentially depend on the whole BN

B

.

Algorithmically, to compute

KL (B ∥ B^{'})

we:

Transform $B$ into its junction tree.
Compute the entropy $H (B)$ .
For each node $X_{i}$ :
(a)
Identify $Π_{X_{i}}^{B^{'}}$ , the parents of $X_{i}$ in $B^{'}$ .
(b)
Obtain the distribution of the variables ${X_{i}, Π_{X_{i}}^{B^{'}}}$ from the junction tree of $B$ , consisting of the probabilities $π_{i k j} (B)$ .
(c)
Read the $π_{i k ∣ j} (B^{'})$ from the local distribution of $X_{i}$ in $B^{'}$ .
Use the $π_{i k j} (B)$ and the $π_{i k ∣ j} (B^{'})$ to compute (10).

The computational complexity of this procedure is as follows:

\begin{matrix} \underset{create the junction tree of B and computing H (B)}{\underset{︸}{O (N (w (1 + l^{w}) + l^{w - 1}) + | Θ |)}} + \underset{produce the π_{i k j} (B)}{\underset{︸}{O (N l^{c} (N w l^{w} + | Θ |))}} + \underset{compute H (B, B^{'})}{\underset{︸}{O (| Θ |)}} = \\ O (N^{2} w l^{w + c} + N (w + w l^{w} + l^{w - 1}) + (N l^{c} + 2) | Θ |) . \end{matrix}

(11)

As noted in Moral et al. [21], computing the

π_{i k j} (B)

requires a separate run of belief propagation for each configuration of the

Π_{X_{i}}^{B^{'}}

, for a total of

\sum_{i = 1}^{N} l^{| Π_{X_{i}}^{B^{'}} |}

times. If we assume that the DAG underlying

B^{'}

is sparse, we have that

| Π_{X_{i}}^{B^{'}} | ⩽ c

and the overall complexity of this step becomes

O (N l^{c} \cdot (N w l^{w} + | Θ |))

, N times that listed in Section 3.4. The caching scheme devised by Moral et al. [21] is very effective in limiting the use of belief propagation, but it does not alter its exponential complexity.

Example 5

(KL between two discrete BNs). Consider the discrete BN

B

from Figure 2 top. Furthermore, consider the BN

B^{'}

from Figure 2 bottom. We constructed the global distribution of

B

in Example A1; we can similarly compose the global distribution of

B^{'}

, shown below.

X_{1} = a

X_{1} = b

X_{2} = c

X_{2} = d

X_{2} = c

X_{2} = d

X_{3}

X_{3}

X_{3}

X_{3}

X_{4}

e

f

X_{4}

e

f

X_{4}

e

f

X_{4}

e

f

g

0.013

0.033

g

0.022

0.054

g

0.016

0.144

g

0.016

0.139

h

0.072

0.062

h

0.029

0.025

h

0.013

0.040

h

0.079

0.243

Since both global distributions are limited in size, we can then compute the Kullback–Leibler divergence between

B

and

B^{'}

using (7).

\begin{matrix} KL (B ∥ B^{'}) = - 0.013 log 0.013 - 0.016 log 0.016 - 0.022 log 0.022 - 0.016 log 0.016 - \\ 0.072 log 0.072 - 0.013 log 0.013 - 0.029 log 0.029 - 0.079 log 0.079 - \\ 0.033 log 0.033 - 0.144 log 0.144 - 0.054 log 0.054 - 0.139 log 0.139 - \\ - 0.062 log 0.062 - 0.04 log 0.04 - 0.025 log 0.025 - 0.243 log 0.243 = 0.687 \end{matrix}

In the general case, when we cannot use the global distributions, we follow the approach described in Section 4.1. Firstly, we apply (8) to write

KL (B ∥ B^{'}) = H (B) - H (B, B^{'});

we have from Example A3 that

H (B) = 2.440

. As for the cross-entropy

H (B, B^{'})

, we apply (10):

1.

We identify the parents of each node in

B^{'}

:

\begin{matrix} Π_{X_{1}}^{B^{'}} = {⌀}, & Π_{X_{2}}^{B^{'}} = {X_{1}, X_{4}}, & Π_{X_{3}}^{B^{'}} = {X_{1}}, & Π_{X_{4}}^{B^{'}} = {X_{3}} . \end{matrix}

2.

We construct a junction tree from

B

and we use it to compute the distributions

P (X_{1})

,

P (X_{2}, X_{1}, X_{4})

,

P (X_{3}, X_{1})

and

P (X_{4}, X_{3})

.

$X_{1}$
a	b
$0.53$	$0.47$

		${X_{1}, X_{4}}$
		${a, g}$	${a, h}$	${b, g}$	${b, h}$
$X_{2}$	c	$0.070$	$0.110$	$0.053$	$0.107$
	d	$0.089$	$0.261$	$0.076$	$0.235$

		$X_{1}$
		a	b
$X_{3}$	e	$0.289$	$0.312$
	f	$0.241$	$0.158$

		$X_{3}$
		e	f
$X_{4}$	g	$0.120$	$0.167$
	h	$0.481$	$0.231$

3.

We compute the cross-entropy terms for the individual variables in

B

and

B^{'}

:

\begin{matrix} H (X_{1}^{B}, X_{1}^{B^{'}}) & = 0.53 log 0.31 + 0.47 log 0.69 = - 0.795; \\ H (X_{2}^{B}, X_{2}^{B^{'}}) & = 0.070 log 0.38 + 0.089 log 0.62 + 0.110 log 0.71 + 0.261 log 0.29 + \\ 0.053 log 0.51 + 0.076 log 0.49 + 0.107 log 0.14 + 0.235 log 0.86 \\ = - 0.807; \\ H (X_{3}^{B}, X_{3}^{B^{'}}) & = 0.289 log 0.44 + 0.241 log 0.56 + 0.312 log 0.18 + 0.158 log 0.82 \\ = - 0.943; \\ H (X_{4}^{B}, X_{4}^{B^{'}}) & = 0.120 log 0.26 + 0.481 log 0.74 + 0.167 log 0.50 + 0.231 log 0.50 \\ = - 0.582; \end{matrix}

which sum up to

H (B, B^{'}) = \sum_{i = 1}^{N} H (X_{i}^{B}, X_{i}^{B^{'}}) = - 3.127

.

4.

We compute

KL (B ∥ B^{'}) = 2.440 - 3.127 = 0.687

, which matches the value we previously computed from the global distributions.

4.2. Gaussian BNs

H (B)

decomposes along with the local distributions

X_{i} ∣ Π_{X_{i}}

in the case of GBNs: from (3), each

X_{i} ∣ Π_{X_{i}}

is a univariate normal with variance

σ_{X_{i}}^{2} (B)

and therefore

H (X_{i} ∣ Π_{X_{i}}^{B}) = \frac{1}{2} log (2 π σ_{X_{i}}^{2} (B)) + \frac{1}{2}

(12)

which has a computational complexity of O(1) for each node, O(N) overall. Equivalently, we can start from the global distribution of

B

from Section 3.2 and consider that

det (Σ) = det (C_{B^{'}} C_{B^{'}}^{T}) = det {(C_{B})}^{2} = {(\prod_{i = 1}^{N} C_{B} [i; i])}^{2} = \prod_{i = 1}^{N} σ_{X_{i}}^{2} (B)

(13)

because

C_{B}

is lower triangular. The (multivariate normal) entropy of

X

then becomes

\begin{matrix} H (B) = \frac{N}{2} + \frac{N}{2} log 2 π + \frac{1}{2} log det (Σ) = \frac{N}{2} + \frac{N}{2} log 2 π + \frac{1}{2} \sum_{i = 1}^{N} log σ_{X_{i}}^{2} (B) \\ = \sum_{i = 1}^{N} \frac{1}{2} + \frac{1}{2} log (2 π σ_{X_{i}}^{2} (B)) = \sum_{i = 1}^{N} H (X_{i} ∣ Π_{X_{i}}^{B}) \end{matrix}

in agreement with (12).

Example 6

(Entropy of a GBN). For reasons of space, this example is presented as Example A4 in Appendix B.

In the literature, the Kullback–Leibler divergence between two GBNs

B

and

B^{'}

is usually computed using the respective global distributions

N (μ_{B}, Σ_{B})

and

N (μ_{B^{'}}, Σ_{B^{'}})

[50,51,52]. The general expression is

KL (B ∥ B^{'}) = \frac{1}{2} [tr (Σ_{B^{'}}^{- 1} Σ_{B}) + {(μ_{B^{'}} - μ_{B})}^{T} Σ_{B^{'}}^{- 1} (μ_{B^{'}} - μ_{B}) - N + log \frac{det (Σ_{B^{'}})}{det (Σ_{B})}],

(14)

which has computational complexity

\begin{matrix} \underset{compute μ_{B}, μ_{B^{'}} Σ_{B}, Σ_{B^{'}}}{\underset{︸}{O (2 N^{3} + 2 N)}} + \underset{invert Σ_{B^{'}}}{\underset{︸}{O (N^{3})}} + \underset{multiply Σ_{B}^{- 1} and Σ_{B^{'}}}{\underset{︸}{O (N^{3})}} + \underset{trace of Σ_{B}^{- 1} Σ_{B^{'}}}{\underset{︸}{O (N)}} + \\ \underset{compute {(μ_{B^{'}} - μ_{B})}^{T} Σ_{B^{'}}^{- 1} (μ_{B^{'}} - μ_{B})}{\underset{︸}{O (N^{2} + 2 N)}} + \underset{determinant of Σ_{B^{'}}}{\underset{︸}{O (N^{3})}} + \underset{determinant of Σ_{B}}{\underset{︸}{O (N^{3})}} = \\ O (6 N^{3} + N^{2} + 5 N) . \end{matrix}

(15)

The spectral decomposition

Σ_{B^{'}} = U Λ_{B^{'}} U^{T}

gives the eigenvalues

diag (Λ_{B^{'}}) = {λ_{1} (B^{'}), \dots,

λ_{N} (B^{'})}

to compute

Σ_{B^{'}}^{- 1}

and

det (Σ_{B^{'}})

efficiently as illustrated in the example below. (Further computing the spectral decomposition of

Σ_{B}

to compute

det (Σ_{B})

from the eigenvalues

{λ_{1} (B), \dots, λ_{N} (B)}

does not improve complexity because it just replaces a single

O (N^{3})

operation with another one.) We thus somewhat improve the overall complexity of

KL (B ∥ B^{'})

to

O (5 N^{3} + N^{2} + 6 N)

.

Example 7

(General-case KL between two GBNs). Consider the GBN

B

Figure 1 top, which we know has global distribution

\begin{matrix} [\begin{matrix} X_{1} \\ X_{2} \\ X_{3} \\ X_{4} \end{matrix}] \sim N ([\begin{matrix} 2.400 \\ 1.800 \\ 12.276 \\ 8.848 \end{matrix}], [\begin{matrix} 0.800 & 0 & 1.440 & 1.200 \\ 0 & 0.600 & 1.872 & 1.560 \\ 1.440 & 1.872 & 10.916 & 8.347 \\ 1.200 & 1.560 & 8.347 & 6.956 \end{matrix}]) \end{matrix}

from Example 2. Furthermore, consider the GBN

B^{'}

from Figure 1 bottom, which has global distribution

\begin{matrix} [\begin{matrix} X_{1} \\ X_{2} \\ X_{3} \\ X_{4} \end{matrix}] \sim N ([\begin{matrix} 2.400 \\ 11.324 \\ 6.220 \\ 4.620 \end{matrix}], [\begin{matrix} 0.800 & 2.368 & 1.040 & 0.640 \\ 2.368 & 8.541 & 3.438 & 1.894 \\ 1.040 & 3.438 & 1.652 & 0.832 \\ 0.640 & 1.894 & 0.832 & 1.012 \end{matrix}]) . \end{matrix}

In order to compute

KL (B ∥ B^{'})

, we first invert

Σ_{B^{'}}

to obtain

Σ_{B^{'}}^{- 1} = [\begin{matrix} 9.945 & - 1.272 & - 2.806 & - 1.600 \\ - 1.272 & 0.909 & - 1.091 & 0 \\ - 2.806 & - 1.091 & 4.642 & 0 \\ - 1.600 & 0 & 0 & 2.000 \end{matrix}],

which we then multiply by

Σ_{B}

to compute the trace

tr (Σ_{B^{'}}^{- 1} Σ_{B}) = 57.087

. We also use

Σ_{B^{'}}^{- 1}

to compute

{(μ_{B^{'}} - μ_{B})}^{T} Σ_{B^{'}}^{- 1} (μ_{B^{'}} - μ_{B}) = 408.362

. Finally,

det (Σ_{B^{'}}) = 0.475

,

det (Σ_{B}) = 0.132

and therefore

KL (B ∥ B^{'}) = \frac{1}{2} [57.087 + 408.362 - 4 + log (\frac{0.475}{0.132})] = 230.0846 .

(16)

As an alternative, we can compute the spectral decompositions

Σ_{B} = U_{B} Λ_{B} U_{B}^{T}

and

Σ_{B^{'}} = U_{B^{'}} Λ_{B^{'}} U_{B^{'}}^{T}

as an intermediate step. Multiplying the sets of eigenvalues

\begin{matrix} Λ_{B} = diag ({18.058, 0.741, 0.379, 0.093}) & and & Λ_{B^{'}} = diag ({11.106, 0.574, 0.236, 0.087}) \end{matrix}

gives the corresponding determinants; and it allows us to easily compute

\begin{matrix} Σ_{B^{'}}^{- 1} = U_{B^{'}} Λ_{B^{'}}^{- 1} U_{B^{'}}^{T}, & where & Λ_{B^{'}}^{- 1} = diag (\{\frac{1}{11.106}, \frac{1}{0.574}, \frac{1}{0.236}, \frac{1}{0.087}\}) \end{matrix}

for use in both the quadratic form and in the trace.

However, computing

KL (B ∥ B^{'})

from the global distributions

N (μ_{B}, Σ_{B})

and

N (μ_{B^{'}}, Σ_{B^{'}})

disregards the fact that BNs are sparse models that can be characterised more compactly by

(μ_{B}, C_{B})

and

(μ_{B^{'}}, C_{B^{'}})

as shown in Section 3.2. In particular, we can revisit several operations that are in the high-order terms of (15):

Composing the global distribution from the local ones. We avoid computing $Σ_{B}$ and $Σ_{B^{'}}$ , thus reducing this step to $O (2 N)$ complexity.
Computing the trace $tr (Σ_{B^{'}}^{- 1} Σ_{B})$ . We can reduce the computation of the trace as follows.
- We can replace $Σ_{B}$ and $Σ_{B^{'}}$ in the trace with any reordered matrix [53] (Result 8.17): we choose to use ${\tilde{Σ}}_{B^{'}}$ and ${\tilde{Σ}}_{B}^{*}$ where ${\tilde{Σ}}_{B^{'}}$ is defined as before and ${\tilde{Σ}}_{B}^{*}$ is $Σ_{B}$ with the rows and columns reordered to match ${\tilde{Σ}}_{B^{'}}$ . Formally, this is equivalent to ${\tilde{Σ}}_{B}^{*} = P {\tilde{Σ}}_{B} P^{T}$ where P is a permutation matrix that imposes the desired node ordering: since both the rows and the columns are permuted in the same way, the diagonal elements of ${\tilde{Σ}}_{B}$ are the same as those of ${\tilde{Σ}}_{B}^{*}$ and the trace is unaffected.
- We have ${\tilde{Σ}}_{B^{'}} = C_{B^{'}} C_{B^{'}}^{T}$ .
- As for ${\tilde{Σ}}_{B}^{*}$ , we can write ${\tilde{Σ}}_{B}^{*} = P {\tilde{Σ}}_{B} P = (P C_{B}) {(P C_{B})}^{T} = C_{B}^{*} {(C_{B}^{*})}^{T}$ where $C_{B}^{*} = P C_{B}$ is the lower triangular matrix $C_{B}$ with the rows re-ordered to match ${\tilde{Σ}}_{B^{'}}$ . Note that $C_{B}^{*}$ is not lower triangular unless $G$ and $G^{'}$ have the same partial node ordering, which implies $P = I_{N}$ .
Therefore

$\begin{matrix} tr (Σ_{B^{'}}^{- 1} Σ_{B}) = tr ({(C_{B^{'}}^{- 1} C_{B}^{*})}^{T} (C_{B^{'}}^{- 1} C_{B}^{*})) = {∥ C_{B^{'}}^{- 1} C_{B}^{*} ∥}_{F}^{2} \end{matrix}$

(17)

where the last step rests on Seber [53] (Result 4.15). We can invert $C_{B^{'}}$ in $O (N^{2})$ time following Stewart [54] (Algorithm 2.3). Multiplying $C_{B^{'}}^{- 1}$ and $C_{B}^{*}$ is still $O (N^{3})$ . The Frobenius norm ${∥ \cdot ∥}_{F}$ is $O (N^{2})$ since it is the sum of the squared elements of $C_{B^{'}}^{- 1} C_{B}^{*}$ .
Computing the determinants $det (Σ_{B^{'}})$ and $det (Σ_{B})$ . From (13), each determinant can be computed in $O (N)$ .
Computing the quadratic term ${(μ_{B^{'}} - μ_{B})}^{T} Σ_{B^{'}}^{- 1} (μ_{B^{'}} - μ_{B})$ . Decomposing $Σ_{B^{'}}^{- 1}$ leads to

${(μ_{B^{'}} - μ_{B})}^{T} Σ_{B^{'}}^{- 1} (μ_{B^{'}} - μ_{B}) = {(C_{B^{'}}^{- 1} (μ_{B^{'}}^{*} - μ_{B}^{*}))}^{T} C_{B^{'}}^{- 1} (μ_{B^{'}}^{*} - μ_{B}^{*}),$

(18)

where $μ_{B^{'}}^{*}$ and $μ_{B}^{*}$ are the mean vectors re-ordered to match $C_{B^{'}}^{- 1}$ . The computational complexity is still $O (N^{2} + 2 N)$ because $C_{B^{'}}^{- 1}$ is available from previous computations.

Combining (17), (13) and (18), the expression in (14) becomes

\begin{matrix} KL (B ∥ B^{'}) = \\ \frac{1}{2} [∥ C_{B^{'}}^{- 1} C_{B}^{*} ∥_{F}^{2} + {(C_{B^{'}}^{- 1} (μ_{B^{'}}^{*} - μ_{B}^{*}))}^{T} C_{B^{'}}^{- 1} (μ_{B^{'}}^{*} - μ_{B}^{*}) - N + 2 log \frac{\prod_{i = 1}^{N} C_{B^{'}} [i; i]}{\prod_{i = 1}^{N} C_{B} [i; i]}] . \end{matrix}

(19)

The overall complexity of (19) KL is

\begin{matrix} \underset{compute μ_{B}, μ_{B^{'}} C_{B}, C_{B^{'}}}{\underset{︸}{O (2 N^{2} + 2 N)}} + \underset{compute ∥ C_{B^{'}}^{- 1} C_{B} ∥_{F}^{2}}{\underset{︸}{O (2 N^{2} + N^{3})}} + \underset{compute the quadratic form}{\underset{︸}{O (N^{2} + 2 N)}} + \\ \underset{compute det (Σ_{B}), det (Σ_{B^{'}})}{\underset{︸}{O (2 N)}} = O (N^{3} + 5 N^{2} + 6 N); \end{matrix}

(20)

while still cubic, the leading coefficient suggests that it should be about 5 times faster than the variant of (15) using the spectral decomposition.

Example 8

(Sparse KL between two GBNs). Consider again the two GBNs from Example 7. The corresponding matrices

\begin{matrix} C_{B} = \begin{matrix} X_{1} \\ X_{2} \\ X_{4} \\ X_{3} \end{matrix} \begin{matrix} X_{1} & X_{2} & X_{4} & X_{3} \\ ( & \begin{matrix} 0.894 \\ 0 \\ 1.341 \\ 1.610 \end{matrix} & \begin{matrix} 0 \\ 0.774 \\ 2.014 \\ 2.416 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 1.049 \\ 1.258 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 0 \\ 0.948 \end{matrix} & ) \end{matrix} & , C_{B'} = \begin{matrix} X_{1} \\ X_{3} \\ X_{4} \\ X_{2} \end{matrix} \begin{matrix} X_{1} & X_{3} & X_{4} & X_{2} \\ ( & \begin{matrix} 0.894 \\ 1.163 \\ 0.715 \\ 2.647 \end{matrix} & \begin{matrix} 0 \\ 0.548 \\ 0 \\ 0.657 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 0.707 \\ 0 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 0 \\ 1.049 \end{matrix} & ) \end{matrix} \end{matrix}

readily give the determinants of

Σ_{B}

and

Σ_{B^{'}}

following (13):

\begin{matrix} det (C_{B}) & = {(0.894 \cdot 0.774 \cdot 1.049 \cdot 0.948)}^{2} = 0.475, \\ det (C_{B^{'}}) & = {(0.894 \cdot 0.548 \cdot 0.707 \cdot 1.049)}^{2} = 0.132 . \end{matrix}

As for the Frobenius norm in (17), we first invert

C_{B^{'}}

to obtain

C_{B^{'}}^{- 1} = \begin{matrix} X_{1} \\ X_{3} \\ X_{4} \\ X_{2} \end{matrix} \begin{matrix} X_{1} & X_{3} & X_{4} & X_{2} \\ ( & \begin{matrix} 1.118 \\ −2.373 \\ −1.131 \\ −1.334 \end{matrix} & \begin{matrix} 0 \\ 1.825 \\ 0 \\ −1.144 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 1.414 \\ 0 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 0 \\ 0.953 \end{matrix} & ) \end{matrix};

then we reorder the rows and columns of

C_{B}

to follow the same node ordering as

C_{B^{'}}

and compute

{∥(\begin{matrix} 1.118 & 0 & 0 & 0 \\ - 2.373 & 1.825 & 0 & 0 \\ - 1.131 & 0 & 1.414 & 0 \\ - 1.334 & - 1.144 & 0 & 0.953 \end{matrix}) (\begin{matrix} 0.894 & 0 & 0 & 0 \\ 1.610 & 0.948 & 1.258 & 2.416 \\ 1.341 & 0 & 1.049 & 2.014 \\ 0 & 0 & 0 & 0.774 \end{matrix})∥}_{F}^{2} = 57.087

which, as expected, matches the value of

tr (Σ_{B^{'}}^{- 1} Σ_{B})

we computed in Example 7. Finally,

C_{B^{'}}^{- 1} (μ_{B^{'}}^{*} - μ_{B}^{*})

in (18) is

(\begin{matrix} 1.118 & 0 & 0 & 0 \\ - 2.373 & 1.825 & 0 & 0 \\ - 1.131 & 0 & 1.414 & 0 \\ - 1.334 & - 1.144 & 0 & 0.953 \end{matrix}) [(\begin{matrix} 2.400 \\ 6.220 \\ 4.620 \\ 11.324 \end{matrix}) - (\begin{matrix} 2.400 \\ 12.1276 \\ 8.848 \\ 1.800 \end{matrix})] = (\begin{matrix} 0 \\ - 11.056 \\ - 5.459 \\ 16.010 \end{matrix}) .

The quadratic form is then equal to

408.362

, which matches the value of

{(μ_{B^{'}} - μ_{B})}^{T} Σ_{B^{'}}^{- 1} (μ_{B^{'}} - μ_{B})

in Example 7. As a result, the expression for

KL (B ∥ B^{'})

is the same as in (16).

We can further reduce the complexity (20) of (19) when an approximate value of KL is suitable for our purposes. The only term with cubic complexity is

tr (Σ_{B^{'}}^{- 1} Σ_{B}) = {∥ C_{B^{'}}^{- 1} C_{B}^{*} ∥}_{F}^{2}

: reducing it to quadratic complexity or lower will eliminate the leading term of (20), making it quadratic in complexity. One way to do this is to compute a lower and an upper bound for

tr (Σ_{B^{'}}^{- 1} Σ_{B})

, which can serve as an interval estimate, and take their geometric mean as an approximate point estimate.

A lower bound is given by Seber [53] (Result 10.39):

tr (Σ_{B^{'}}^{- 1} Σ_{B}) ⩾ log det (Σ_{B^{'}}^{- 1} Σ_{B}) + N = - log det (Σ_{B^{'}}) + log det (Σ_{B}) + N,

(21)

which conveniently reuses the values of

det (Σ_{B})

and

det (Σ_{B^{'}})

we have from (13). For an upper bound, Seber [53] (Result 10.59) combined with Seber [53] (Result 4.15) gives

tr (Σ_{B^{'}}^{- 1} Σ_{B}) ⩽ tr (Σ_{B^{'}}^{- 1}) tr (Σ_{B}) = tr ({(C_{B^{'}} C_{B^{'}}^{T})}^{- 1}) tr (C_{B} C_{B}^{T}) = ∥ C_{B^{'}}^{- 1} ∥_{F}^{2} {∥ C_{B} ∥}_{F}^{2},

(22)

a function of

C_{B}

and

C_{B^{'}}

that can be computed in

O (2 N^{2})

time. Note that, as far as the point estimate is concerned, we do not care about how wide the interval is: we only need its geometric mean to be an acceptable approximation of

tr (Σ_{B^{'}}^{- 1} Σ_{B})

.

Example 9

(Approximate KL). From Example 7, we have that

tr (Σ_{B^{'}}^{- 1} Σ_{B}) = 57.087

,

det (Σ_{B^{'}}) = 0.475

and

det (Σ_{B}) = 0.132

. The lower bound in (21) is then

- log det (Σ_{B^{'}}) + log det (Σ_{B}) + 4 = 5.281

and the upper bound in (22) is

∥ C_{B^{'}}^{- 1} ∥_{F}^{2} {∥ C_{B} ∥}_{F}^{2} = 17.496 \cdot 19.272 = 337.207 .

Their geometric mean is

42.199

, which can serve as an approximate value for

KL (B ∥ B^{'})

.

If we are comparing two GBNs whose parameters (but not necessarily network structures) have been learned from the same data, we can sometimes approximate

KL (B ∥ B^{'})

using the local distributions

X_{i} ∣ Π_{X_{i}}^{B}

and

X_{i} ∣ Π_{X_{i}}^{B^{'}}

directly. If

B

and

B^{'}

have compatible partial orderings, we can define a common total node ordering for both such that

\begin{matrix} KL (B ∥ B^{'}) & = KL (X_{(1)} ∣ {X_{(2)}, \dots, X_{(N)}} \dots X_{N} ∥ X_{(1)} ∣ {X_{(2)}, \dots, X_{(N)}} \dots X_{N}) \\ = KL (X_{(1)} ∣ Π_{X_{(1)}}^{B} \cdot \dots \cdot X_{(N)} ∣ Π_{X_{(N)}}^{B} ∥ X_{(1)} ∣ Π_{X_{(1)}}^{B^{'}} \cdot \dots \cdot X_{(N)} ∣ Π_{X_{(N)}}^{B^{'}}) . \end{matrix}

By “compatible partial orderings”, we mean two partial orderings that can be sorted into at least one shared total node ordering that is compatible with both. The product of the local distributions in the second step is obtained from the chain decomposition in the first step by considering the nodes in the conditioning other than the parents to have associated regression coefficients equal to zero. Then, following the derivations in Cavanaugh [55] for a general linear regression model, we can write the empirical approximation

KL (X_{i} ∣ Π_{X_{i}}^{B} ∥ X_{i} ∣ Π_{X_{i}}^{B^{'}}) \approx \frac{1}{2} (log \frac{{\hat{σ}}_{X_{i}}^{2} (B^{'})}{{\hat{σ}}_{X_{i}}^{2} (B)} + \frac{{\hat{σ}}_{X_{i}}^{2} (B)}{{\hat{σ}}_{X_{i}}^{2} (B^{'})} - 1) + \frac{1}{2 n} (\frac{∥ {\hat{x}}_{i} (B) - {\hat{x}}_{i} (B^{'}) ∥_{2}^{2}}{{\hat{σ}}_{X_{i}}^{2} (B^{'})})

(23)

where, following a similar notation to (4):

${\hat{μ}}_{X_{i}} (B)$ , ${\hat{β}}_{X_{i}} (B)$ , ${\hat{μ}}_{X_{i}} (B^{'})$ , ${\hat{β}}_{X_{i}} (B^{'})$ are the estimated intercepts and regression coefficients;
${\hat{x}}_{i} (B)$ and ${\hat{x}}_{i} (B^{'})$ are the $n \times 1$ vectors

${\hat{x}}_{i} (B) = {\hat{μ}}_{X_{i}} (B) + x [•; \prod_{X_{i}} (B)] {\hat{β}}_{X_{i}} (B), {\hat{x}}_{i} (B') = {\hat{μ}}_{X_{i}} (B^{'}) + x [•; \prod_{X_{i}} (B')] {\hat{β}}_{X_{i}} (B'),$

the fitted values computed from the data observed for $X_{i}$ , $Π_{X_{i}} (B)$ , $Π_{X_{i}} (B^{'})$ ;
$σ_{X_{i}}^{2} (B)$ and $σ_{X_{i}}^{2} (B^{'})$ are the residual variances in $B$ and $B^{'}$ .

We can compute the expression in (23) for each node in

\begin{matrix} \underset{compute {\hat{x}}_{i} (B) and {\hat{x}}_{i} (B^{'})}{\underset{︸}{O (n (| Π_{X_{i}} (B) | + | Π_{X_{i}} (B^{'}) | + 2))}} + \underset{compute the norm ∥ {\hat{x}}_{i} (B) - {\hat{x}}_{i} (B^{'}) ∥_{2}^{2}}{\underset{︸}{O (n)}} = \\ O (n (| Π_{X_{i}} (B) | + | Π_{X_{i}} (B^{'}) | + \frac{5}{2})), \end{matrix}

which is linear in the sample size if both

G

and

G^{'}

are sparse because

| Π_{X_{i}} (B) | ⩽ c

,

| Π_{X_{i}} (B^{'}) | ⩽ c

. In this case, the overall computational complexity simplifies to

O (n N (2 c + \frac{5}{2}))

. Furthermore, as we pointed out in Scutari et al. [29], the fitted values

{\hat{x}}_{i} (B)

,

{\hat{x}}_{i} (B^{'})

are computed as a by-product of parameter learning: if we consider them to be already available, the above computational complexity is reduced to just O(n) for a single node and

O (n N)

overall. We can also replace the fitted values

{\hat{x}}_{i} (B)

,

{\hat{x}}_{i} (B^{'})

in (23) with the corresponding residuals

{\hat{ε}}_{i} (B)

,

{\hat{ε}}_{i} (B^{'})

because

∥ {\hat{x}}_{i} (B) - {\hat{x}}_{i} (B^{'}) ∥_{2}^{2} = ∥ (x [•; X_{i}] - {\hat{x}}_{i} (B)) - (x [•; X_{i}] - {\hat{x}}_{i} (B^{'})) ∥_{2}^{2} = ∥ {\hat{ε}}_{i} (B) - {\hat{ε}}_{i} (B^{'}) ∥_{2}^{2}

if the latter are available but the former are not.

Example 10

(KL between GBNs with parameters estimated from data). For reasons of space, this example is presented as Example A5 in Appendix B.

4.3. Conditional Gaussian BNs

The entropy

H (B)

decomposes into a separate

H (X_{i} ∣ Π_{X_{i}}^{B})

for each node, of the form (9) for discrete nodes and (12) for continuous nodes with no discrete parents. For continuous nodes with both discrete and continuous parents,

H (X_{i} ∣ Π_{X_{i}}^{B}) = \frac{1}{2} \sum_{δ_{X_{i}} \in Val (Δ_{X_{i}})} π_{δ_{X_{i}}} log (2 π σ_{X_{i}, δ_{X_{i}}}^{2} (B)) + \frac{1}{2},

(24)

where

π_{δ_{X_{i}}}

represents the probability associated with the configuration

δ_{X_{i}}

of the discrete parents

Δ_{X_{i}}

. This last expression can be computed in

| Val (Δ_{X_{i}}) |

time for each node. Overall, the complexity of computing

H (B)

is

O (\sum_{X_{i} \in X_{D}} | Θ_{X_{i}} | + \sum_{X_{i} \in X_{G}} max \{1, | Val (Δ_{X_{i}}) |\}) .

where the max accounts for the fact that

| Val (Δ_{X_{i}}) | = 0

when

Δ_{X_{i}} = ⌀

but the computational complexity is O(1) for such nodes.

Example 11

(Entropy of a CLGBN). For reasons of space, this example is presented as Example A6 in Appendix B.

As for

KL (B ∥ B^{'})

, we could not find any literature illustrating how to compute it. The partition of the nodes in (5) implies that

KL (B ∥ B^{'}) = \underset{discrete nodes}{\underset{︸}{KL (X_{D}^{B} ∥ X_{D}^{B^{'}})}} + \underset{continuous nodes}{\underset{︸}{KL (X_{G}^{B} ∣ X_{D}^{B} ∥ X_{G}^{B^{'}} ∣ X_{D}^{B^{'}})}} .

(25)

We can compute the first term following Section 4.1:

X_{D}^{B}

and

X_{D}^{B^{'}}

form two discrete BNs whose DAGs are the spanning subgraphs of

B

and

B^{'}

and whose local distributions are the corresponding ones in

B

and

B^{'}

, respectively. The second term decomposes into

KL (X_{G}^{B} ∣ X_{D}^{B} ∥ X_{G}^{B^{'}} ∣ X_{D}^{B^{'}}) = \sum_{x_{D} \in Val (X_{D})} P (X_{D}^{B} = x_{D}) KL (X_{G}^{B} ∣ X_{D}^{B} = x_{D} ∥ X_{G}^{B^{'}} ∣ X_{D}^{B^{'}} = x_{D})

(26)

similarly to (10) and (24). We can compute it using the multivariate normal distributions associated with the

X_{D}^{B} = x_{D}

and the

X_{D}^{B^{'}} = x_{D}

in the global distributions of

B

and

B^{'}

.

Example 12

(General-case KL between two CLGBNs). Consider the CLGBNs

B

from Figure 3 top, which we already used in Examples 3 and 11, and

B^{'}

from Figure 3 bottom. The variables

X_{D}^{B^{'}}

identify the following mixture components in the global distribution of

B^{'}

:

\begin{matrix} {a, c, e}, {b, c, e}, {a, d, e}, {b, d, e} & \mapsto {e}, \\ {a, c, f}, {b, c, f}, {a, d, f}, {b, d, f} & \mapsto {f} . \end{matrix}

Therefore,

B^{'}

only encodes two different multivariate normal distributions.

Firstly, we construct two discrete BNs using the subgraphs spanning

X_{D}^{B} = X_{D}^{B^{'}} = {X_{1}, X_{2}, X_{3}}

in

B

and

B^{'}

, which have arcs

{X_{1} \to X_{2}}

and

{X_{1} \to X_{2}, X_{2} \to X_{3}}

, respectively. The CPTs for

X_{1}

,

X_{2}

and

X_{3}

are the same as in

B

and in

B^{'}

. We then compute

KL (X_{D}^{B} ∥ X_{D}^{B^{'}}) = 0.577

following Example 5.

Secondly, we construct the multivariate normal distributions associated with the components of

B^{'}

following Example 3 (in which we computed those of

B

). For

{e}

, we have

\begin{matrix} [\begin{matrix} X_{4} \\ X_{5} \\ X_{6} \end{matrix}] \sim N ([\begin{matrix} 0.300 \\ 1.400 \\ 1.140 \end{matrix}], [\begin{matrix} 0.160 & 0.000 & 0.032 \\ 0.000 & 1.690 & 1.183 \\ 0.032 & 1.183 & 2.274 \end{matrix}]); \end{matrix}

for

{f}

, we have

\begin{matrix} [\begin{matrix} X_{4} \\ X_{5} \\ X_{6} \end{matrix}] \sim N ([\begin{matrix} 1.000 \\ 0.500 \\ 0.650 \end{matrix}], [\begin{matrix} 0.090 & 0.000 & 0.018 \\ 0.000 & 2.250 & 1.575 \\ 0.018 & 1.575 & 2.546 \end{matrix}]) . \end{matrix}

Then,

\begin{matrix} KL (X_{G}^{B} ∣ X_{D}^{B} ∥ X_{G}^{B^{'}} ∣ X_{D}^{B^{'}}) \\ = \sum_{x_{1} \in {a, b}} \sum_{x_{2} \in {c, d}} \sum_{x_{3} \in {e, f}} P (X_{D}^{B} = {x_{1}, x_{2}, x_{3}}) \cdot \\ KL (X_{G}^{B} ∣ X_{D}^{B} = {x_{1}, x_{2}, x_{3}} ∥ X_{G}^{B^{'}} ∣ X_{D}^{B^{'}} = {x_{1}, x_{2}, x_{3}}) \\ = \underset{{a, c, e}}{\underset{︸}{0.040 \times 1.721}} + \underset{{b, c, e}}{\underset{︸}{0.036 \times 1.721}} + \underset{{a, d, e}}{\underset{︸}{0.040 \times 2.504}} + \underset{{b, d, e}}{\underset{︸}{0.084 \times 2.504}} + \\ \underset{{a, c, f}}{\underset{︸}{0.16 \times 4.303}} + \underset{{b, c, f}}{\underset{︸}{0.144 \times 4.303}} + \underset{{a, d, f}}{\underset{︸}{0.16 \times 6.31}} + \underset{{b, d, f}}{\underset{︸}{0.336 \times 6.31}} \\ = 4.879 \end{matrix}

and

KL (B ∥ B^{'}) = KL (X_{D}^{B} ∥ X_{D}^{B^{'}}) + KL (X_{G}^{B} ∣ X_{D}^{B} ∥ X_{G}^{B^{'}} ∣ X_{D}^{B^{'}}) = 0.577 + 4.879 = 5.456

.

The computational complexity of this basic approach to computing

KL (B ∥ B^{'})

is

\begin{matrix} \underset{compute KL (X_{D}^{B} ∥ X_{D}^{B^{'}})}{\underset{︸}{O (M w l^{w + c} + M (w + w l^{w} + l^{w - 1}) + (M l^{c} + 2) | Θ_{X_{D}} |)}} + \\ \underset{compute all the KL (X_{G}^{B} ∣ X_{D}^{B} = x_{D} ∥ X_{G}^{B^{'}} ∣ X_{D}^{B^{'}} = x_{D})}{\underset{︸}{O (l^{M} \cdot (6 {(N - M)}^{3} + {(N - M)}^{2} + 5 (N - M)))}}, \end{matrix}

(27)

which we obtain by adapting (11) and (15) to follow the notation

| X_{D} | = M

and

| X_{G} | = N - M

we established in Section 3.3. The first term implicitly covers the cost of computing the

P (X_{D}^{B} = x_{D})

, which relies on exact inference like the computation of

KL (X_{D}^{B} ∥ X_{D}^{B^{'}})

. The second term is exponential in M, which would lead us to conclude that it is computationally unfeasible to compute

KL (B ∥ B^{'})

whenever we have more than a few discrete variables in

B

and

B^{'}

. Certainly, this would agree with Hershey and Olsen [22], who reviewed various scalable approximations of the KL divergence between two Gaussian mixtures.

However, we would again disregard the fact that BNs are sparse models. Two properties of CLGBNs that are apparent from Examples 3 and 12 allow us to compute (26) efficiently:

We can reduce $X_{G}^{B} ∣ X_{D}^{B}$ to $X_{G}^{B} ∣ Δ^{B}$ where $Δ^{B} = ⋃_{X_{i} \in X_{G}} Δ_{X_{i}}^{B} \subseteq X_{D}^{B}$ . In other words, the continuous nodes are conditionally independent on the discrete nodes that are not their parents ( $X_{D}^{B} ∖ Δ^{B}$ ) given their parents ( $Δ^{B}$ ). The same is true for $X_{G}^{B^{'}} ∣ X_{D}^{B^{'}}$ . The number of distinct terms in the summation in (26) is then given by $| Val (Δ^{B} \cup Δ^{B^{'}}) |$ which will be smaller than $| Val (X_{D}^{B}) |$ in sparse networks.
The conditional distributions $X_{G}^{B} ∣ X_{D}^{B} = δ$ and $X_{G}^{B^{'}} ∣ X_{D}^{B^{'}} = δ$ are multivariate normals (not mixtures). They are also faithful to the subgraphs spanning the continuous nodes $X_{G}$ , and we can represent them as GBNs whose parameters can be extracted directly from $B$ and $B^{'}$ . Therefore, we can use the results from Section 4.2 to compute their Kullback–Leibler divergences efficiently.

As a result, (26) simplifies to

\begin{matrix} KL (X_{G}^{B} ∣ X_{D}^{B} ∥ X_{G}^{B^{'}} ∣ X_{D}^{B^{'}}) = \\ \sum_{δ \in Val (Δ^{B} \cup Δ^{B^{'}})} P ({Δ^{B} \cup Δ^{B^{'}}} = δ) KL (X_{G}^{B} ∣ {Δ^{B} \cup Δ^{B^{'}}} = δ ∥ X_{G}^{B^{'}} ∣ {Δ^{B} \cup Δ^{B^{'}}} = δ) . \end{matrix}

where

P ({Δ^{B} \cup Δ^{B^{'}}} = δ)

is the probability that the nodes

Δ^{B} \cup Δ^{B^{'}}

take value

δ

as computed in

B

. In turn, (27) reduces to

\begin{matrix} \underset{compute KL (X_{D}^{B} ∥ X_{D}^{B^{'}})}{\underset{︸}{O (M w l^{w + c} + M (w + w l^{w} + l^{w - 1}) + (M l^{c} + 2) | Θ_{X_{D}} |)}} + \\ \underset{compute all the KL (X_{G}^{B} ∣ {Δ^{B} \cup Δ^{B^{'}}} = δ ∥ X_{G}^{B^{'}} ∣ {Δ^{B} \cup Δ^{B^{'}}} = δ)}{\underset{︸}{O (l^{| Val (Δ^{B} \cup Δ^{B^{'}}) |} \cdot ({(N - M)}^{3} + 5 {(N - M)}^{2} + 6 (N - M)))}} . \end{matrix}

because we can replace

l^{M}

with

l^{| Val (Δ^{B} \cup Δ^{B^{'}}) |}

, which is an upper bound to the unique components in the mixture, and because we replace the complexity in (15) with that (20). We can also further reduce the second term to quadratic complexity as we discussed in Section 4.2. The remaining drivers of the computational complexity are:

the maximum clique size w in the subgraph spanning $X_{D}^{B}$ ;
the number of arcs from discrete nodes to continuous nodes in both $B$ and $B^{'}$ and the overlap between $Δ^{B}$ and $Δ^{B^{'}}$ .

Example 13

(Sparse KL between two CLGBNs). Consider again the CLGBNs

B

and

B^{'}

from Example 12. The node sets

Δ^{B} = {X_{2}, X_{3}}

and

Δ^{B^{'}} = {X_{3}}

identify four KL divergences to compute:

Val (Δ^{B} \cup Δ^{B^{'}}) = \{{c, e}, {c, f}, {d, e}, {d, f}\}

.

\begin{matrix} KL (X_{G}^{B} ∣ X_{D}^{B} ∥ X_{G}^{B^{'}} ∣ X_{D}^{B^{'}}) = \\ P ({Δ^{B} \cup Δ^{B^{'}}} = {c, e}) KL (X_{G}^{B} ∣ {Δ^{B} \cup Δ^{B^{'}}} = {c, e} ∥ X_{G}^{B^{'}} ∣ {Δ^{B} \cup Δ^{B^{'}}} = {c, e}) + \\ P ({Δ^{B} \cup Δ^{B^{'}}} = {c, f}) KL (X_{G}^{B} ∣ {Δ^{B} \cup Δ^{B^{'}}} = {c, f} ∥ X_{G}^{B^{'}} ∣ {Δ^{B} \cup Δ^{B^{'}}} = {c, f}) + \\ P ({Δ^{B} \cup Δ^{B^{'}}} = {d, e}) KL (X_{G}^{B} ∣ {Δ^{B} \cup Δ^{B^{'}}} = {d, e} ∥ X_{G}^{B^{'}} ∣ {Δ^{B} \cup Δ^{B^{'}}} = {d, e}) + \\ P ({Δ^{B} \cup Δ^{B^{'}}} = {d, f}) KL (X_{G}^{B} ∣ {Δ^{B} \cup Δ^{B^{'}}} = {d, f} ∥ X_{G}^{B^{'}} ∣ {Δ^{B} \cup Δ^{B^{'}}} = {d, f}) \end{matrix}

All the BNs in the Kullback–Leibler divergences are GBNs whose structure and local distributions can be read from

B

and

B^{'}

. The four GBNs associated with

X_{G}^{B} ∣ {Δ^{B} \cup Δ^{B^{'}}}

have nodes

X_{G}^{B} = {X_{4}, X_{5}, X_{6}}

, arcs

{X_{5} \to X_{4}, X_{4} \to X_{6}}

and the local distributions listed in Figure 3. The corresponding GBNs associated with

X_{G}^{B^{'}} ∣ {Δ^{B} \cup Δ^{B^{'}}}

are, in fact, only two distinct GBNs associated with

{e}

and

{f}

. They have arcs

{X_{4} \to X_{6}, X_{5} \to X_{6}}

and local distributions: for

{e}

,

\begin{matrix} X_{4} = 0.3 + ε_{X_{4}}, & ε_{X_{4}} \sim N (0, 0.16), \\ X_{5} = 1.4 + ε_{X_{5}}, & ε_{X_{5}} \sim N (0, 1.69), \\ X_{6} = 0.1 + 0.2 X_{4} + 0.7 X_{5} + ε_{X_{6}}, & ε_{X_{6}} \sim N (0, 1.44); \end{matrix}

for

{f}

,

\begin{matrix} X_{4} = 1.0 + ε_{X_{4}}, & ε_{X_{4}} \sim N (0, 0.09), \\ X_{5} = 0.5 + ε_{X_{5}}, & ε_{X_{5}} \sim N (0, 2.25), \\ X_{6} = 0.1 + 0.2 X_{4} + 0.7 X_{5} + ε_{X_{6}}, & ε_{X_{6}} \sim N (0, 1.44) . \end{matrix}

Plugging in the numbers,

\begin{matrix} KL (X_{G}^{B} ∣ X_{D}^{B} ∥ X_{G}^{B^{'}} ∣ X_{D}^{B^{'}}) = \underset{{c, e}}{\underset{︸}{0.076 \times 1.721}} + \underset{{c, f}}{\underset{︸}{0.304 \times 4.303}} + \\ \underset{{d, e}}{\underset{︸}{0.124 \times 2.504}} + \underset{{d, f}}{\underset{︸}{0.496 \times 6.310}} = 4.879 \end{matrix}

which matches the value we computed in Example 12.

5. Conclusions

We started this paper by reviewing the three most common distributional assumptions for BNs: discrete BNs, Gaussian BNs (GBNs) and conditional linear Gaussian BNs (CLGBNs). Firstly, we reviewed the link between the respective global and local distributions, and we formalised the computational complexity of decomposing the former into the latter (and vice versa).

We then leveraged these results to study the complexity of computing Shannon’s entropy. We can, of course, compute the entropy of a BN from its global distribution using standard results from the literature. (In the case of discrete BNs and CLGBNS, only for small networks because

| Θ |

grows combinatorially.) However, this is not computationally efficient because we incur the cost of composing the global distribution. While the entropy does not decompose along with the local distributions for either discrete BNs or CLGBNS, we show that it is nevertheless efficient to compute it from them.

Computing the Kullback–Leibler divergence between two BNs following the little material found in the literature is more demanding. The discrete case has been thoroughly investigated by Moral et al. [21]. However, the literature typically relies on composing the global distributions for GBNs and CGBNs. Using the local distributions, thus leveraging the intrinsic sparsity of BNs, we showed how to compute the Kullback–Leibler divergence exactly with greater efficiency. For GBNs, we showed how to compute the Kullback–Leibler divergence approximately with quadratic complexity (instead of cubic). If the two GBNs have compatible node orderings and their parameters are estimated from the same data, we can also approximate their Kullback–Leibler divergence with complexity that scales with the number of parents of each node. All these results are summarised in Table A1 in Appendix A.

Finally, we provided step-by-step numeric examples of how to compute Shannon’s entropy and the Kullback–Leibler divergence for discrete BNs, GBNs and CLGBNs. (See also Appendix B). Considering this is a highly technical topic, and no such examples are available anywhere in the literature, we feel that they are helpful in demystifying this topic and in integrating BNs into many general machine learning approaches.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Computational Complexity Results

For ease of reference, we summarise here all the computational complexity results in this paper, including the type of BN and the page where they have been derived.

Table A1. Summary of all the computational complexity results in this paper, including the type of BN and the page where they have been derived.

Composing and decomposing the global distributions
$O (N l^{N} + l \sum_{i = 1}^{N} l^{\| Π_{X_{i}} \|})$	discrete BNs	Section 3.1
$O (N^{3})$	GBNs	Section 3.2
$O (M l^{M} + {(N - M)}^{3} l^{Δ^{}})$	CLGBNs	Section 3.3
Computing Shannon’s entropy
$O (N (w (1 + l^{w}) + l^{w - 1}) + \| Θ \|)$	discrete BNs	Section 4.1
O(N)	GBNs	Section 4.2
$O (\sum_{X_{i} \in X_{D}} \| Θ_{X_{i}} \| + \sum_{X_{i} \in X_{G}} max \{1, \| Val (Δ_{X_{i}}) \|\})$	CLGBNs	Section 4.3
Computing the Kullback–Leibler divergence
$O (N^{2} w l^{w + c} + N (w + w l^{w} + l^{w - 1}) + (N l^{c} + 2) \| Θ \|)$	discrete BNs	Section 4.1
$O (6 N^{3} + N^{2} + 5 N)$	GBNs	Section 4.2
$O (M w l^{w + c} + M (w + w l^{w} + l^{w - 1}) + (M l^{c} + 2) \| Θ_{X_{D}} \|) +$
$O (l^{M} \cdot (6 {(N - M)}^{3} + {(N - M)}^{2} + 5 (N - M)))$	CLGBNs	Section 4.3
Sparse Kullback–Leibler divergence
$O (N^{3} + 5 N^{2} + 6 N)$	GBNs	Section 4.2
$O (M w l^{w + c} + M (w + w l^{w} + l^{w - 1}) + (M l^{c} + 2) \| Θ_{X_{D}} \|) +$
$O (l^{\| Val (Δ^{B} \cup Δ^{B^{'}}) \|} \cdot ({(N - M)}^{3} + 5 {(N - M)}^{2} + 6 (N - M)))$	CLGBNs	Section 4.3
Approximate Kullback–Leibler divergence
$O (7 N^{2} + 6 N)$	GBNs	Section 4.2
Efficient empirical Kullback–Leibler divergence
$O (n N (2 c + \frac{5}{2}))$	GBNs	Section 4.2

Appendix B. Additional Examples

Example A1

(Composing and decomposing a discrete BN). Consider the discrete BN

B

shown in Figure 2 (top). Composing its global distribution entails computing the joint probabilities of all possible states of all variables,

{a, b} \times {c, d} \times {e, f} \times {g, h},

and arranging them in the following four-dimensional probability table in which each dimension is associated with one of the variables.

X_{1} = a

X_{1} = b

X_{2} = c

X_{2} = d

X_{2} = c

X_{2} = d

X_{3}

X_{3}

X_{3}

X_{3}

X_{4}

e

f

X_{4}

e

f

X_{4}

e

f

X_{4}

e

f

g

0.005

0.064

g

0.052

0.037

g

0.013

0.040

g

0.050

0.026

h

0.022

0.089

h

0.210

0.051

h

0.051

0.056

h

0.199

0.036

The joint probabilities are computed by multiplying the appropriate cells of the CPTs, for instance

\begin{matrix} P (X = {a, d, f, h}) = \\ P (X_{1} = a) P (X_{2} = d) P (X_{3} = f ∣ X_{1} = a, X_{2} = d) P (X_{4} = h ∣ X_{3} = f) = \\ 0.53 \cdot 0.66 \cdot 0.25 \cdot 0.58 = 0.051 . \end{matrix}

Conversely, we can decompose the global distribution into the local distributions by summing over all variables other than the nodes and their parents. For

X_{1}

, this means

\begin{matrix} P (X_{1} = a) & = \sum_{x_{2} \in {c, d}} \sum_{x_{3} \in {e, f}} \sum_{x_{4} \in {g, h}} P (X_{1} = a, X_{2} = x_{2}, X_{3} = x_{3}, X_{4} = x_{4}) \\ = 0.005 + 0.064 + 0.022 + 0.089 + 0.052 + 0.037 + 0.210 + 0.051 = 0.53, \\ P (X_{1} = b) & = \sum_{x_{2} \in {c, d}} \sum_{x_{3} \in {e, f}} \sum_{x_{4} \in {g, h}} P (X_{1} = b, X_{2} = x_{2}, X_{3} = x_{3}, X_{4} = x_{4}) \\ = 0.013 + 0.040 + 0.051 + 0.056 + 0.050 + 0.026 + 0.199 + 0.036 = 0.47 . \end{matrix}

Similarly, for

X_{2}

we obtain

\begin{matrix} P (X_{2} = c) & = \sum_{x_{1} \in {a, b}} \sum_{x_{3} \in {e, f}} \sum_{x_{4} \in {g, h}} P (X_{1} = x_{1}, X_{2} = c, X_{3} = x_{3}, X_{4} = x_{4}) \\ = 0.005 + 0.064 + 0.022 + 0.089 + 0.013 + 0.040 + 0.051 + 0.056 = 0.34, \\ P (X_{2} = d) & = \sum_{x_{1} \in {a, b}} \sum_{x_{3} \in {e, f}} \sum_{x_{4} \in {g, h}} P (X_{1} = x_{1}, X_{2} = d, X_{3} = x_{3}, X_{4} = x_{4}) \\ = 0.052 + 0.037 + 0.210 + 0.051 + 0.050 + 0.026 + 0.199 + 0.036 = 0.66 . \end{matrix}

For

X_{4}

, we first compute the joint distribution of

X_{4}

and

X_{3}

by marginalising over

X_{1}

and

X_{2}

,

\begin{matrix} \underset{{a, c}}{\underset{︸}{\begin{matrix} g \\ h \end{matrix} \begin{matrix} e & f \\ ( & \begin{matrix} 0.005 \\ 0.022 \end{matrix} & \begin{matrix} 0.064 \\ 0.089 \end{matrix} & ) \end{matrix}}} + \underset{{a, d}}{\underset{︸}{\begin{matrix} g \\ h \end{matrix} \begin{matrix} e & f \\ ( & \begin{matrix} 0.052 \\ 0.210 \end{matrix} & \begin{matrix} 0.037 \\ 0.051 \end{matrix} & ) \end{matrix}}} + \underset{{b, c}}{\underset{︸}{\begin{matrix} g \\ h \end{matrix} \begin{matrix} e & f \\ ( & \begin{matrix} 0.013 \\ 0.051 \end{matrix} & \begin{matrix} 0.040 \\ 0.056 \end{matrix} & ) \end{matrix}}} + \\ \underset{{b, d}}{\underset{︸}{\begin{matrix} g \\ h \end{matrix} \begin{matrix} e & f \\ ( & \begin{matrix} 0.050 \\ 0.199 \end{matrix} & \begin{matrix} 0.026 \\ 0.036 \end{matrix} & ) \end{matrix}}} = \begin{matrix} g \\ h \end{matrix} \begin{matrix} e & f \\ ( & \begin{matrix} 0.120 \\ 0.481 \end{matrix} & \begin{matrix} 0.167 \\ 0.232 \end{matrix} & ) \end{matrix}; \end{matrix}

from which we obtain the CPT for

X_{4} ∣ X_{3}

by normalising its columns.

As for

X_{3}

, we marginalise over

X_{2}

to obtain the joint distribution of

X_{3}

,

X_{1}

and

X_{2}

\begin{matrix} e \\ f \end{matrix} \begin{matrix} {a, c} & {a, d} & {b, c} & {b, d} \\ ( & \begin{matrix} 0.005 + 0.022 = 0.027 \\ 0.064 + 0.089 = 0.153 \end{matrix} & \begin{matrix} 0.052 + 0.210 = 0.262 \\ 0.037 + 0.051 = 0.087 \end{matrix} & \begin{matrix} 0.013 + 0.051 = 0.051 \\ 0.040 + 0.056 = 0.096 \end{matrix} & \begin{matrix} 0.050 + 0.199 = 0.248 \\ 0.026 + 0.036 = 0.062 \end{matrix} & ) \end{matrix}

and we obtain the CPT for

X_{3} ∣ X_{1}, X_{2}

by normalising its columns as we did earlier with

X_{4}

.

Example A2

(Composing and decomposing a CLGBN). Consider the CLGBN

B

from Figure 3 top. The

M = 3

discrete variables at the top of the network have the joint distribution below:

${X_{1}, X_{2}, X_{3}}$
${a, c, e}$	${b, c, e}$	${a, d, e}$	${b, d, e}$	${a, c, f}$	${b, c, f}$	${a, d, f}$	${b, d, f}$
$0.040$	$0.036$	$0.040$	$0.084$	$0.160$	$0.144$	$0.160$	$0.336$

Its elements identify the components of the mixture that make up the global distribution of

B

, and the associated probabilities are the probabilities of those components.

We can then identify which parts of the local distributions of the

N - M = 3

continuous variables (

X_{4}

,

X_{5}

and

X_{6}

) we need to compute

P (X_{4}, X_{5}, X_{6} ∣ X_{1}, X_{2}, X_{3})

for each element of the mixture. The graphical structure of

B

implies that

P (X_{4}, X_{5}, X_{6} ∣ X_{1}, X_{2}, X_{3}) =

P (X_{4}, X_{5}, X_{6} ∣ X_{2}, X_{3})

because the continuous nodes are d-separated from

X_{1}

by their parents. As a result, the following mixture components will share identical distributions which only depend on the configurations of

X_{2}

and

X_{3}

:

\begin{matrix} {a, c, e}, {b, c, e} \mapsto {c, e}, & {a, d, e}, {b, d, e} \mapsto {d, e}, \\ {a, c, f}, {b, c, f} \mapsto {c, f}, & {a, d, f}, {b, d, f} \mapsto {d, f} . \end{matrix}

For the mixture components with a distribution identified by

{c, e}

, the relevant parts of the distributions of

X_{4}

,

X_{5}

and

X_{6}

are:

\begin{matrix} X_{4} = 0.1 + 0.2 X_{5} + ε_{X_{4}}, & ε_{X_{4}} \sim N (0, 0.09); \\ X_{5} = 0.1 + ε_{X_{5}}, & ε_{X_{5}} \sim N (0, 0.09); \\ X_{6} = 0.1 + 0.2 X_{4} + ε_{X_{6}}, & ε_{X_{6}} \sim N (0, 1) . \end{matrix}

We can treat them as the local distributions in a GBN over

{X_{4}, X_{5}, X_{6}}

with a DAG equal to the subgraph of

B

spanning only these nodes. If we follow the steps outlined in Section 3.2 and illustrated in Example 2, we obtain

\begin{matrix} [\begin{matrix} X_{4} \\ X_{5} \\ X_{6} \end{matrix}] \sim N ([\begin{matrix} 0.120 \\ 0.100 \\ 0.124 \end{matrix}], Σ_{{c, e}} (B) = [\begin{matrix} 0.094 & 0.018 & 0.019 \\ 0.018 & 0.090 & 0.004 \\ 0.019 & 0.004 & 1.004 \end{matrix}]) \end{matrix}

which is the multivariate normal distribution associated with the components

{a, c, e}

and

{b, c, e}

in the mixture. Similarly, the relevant parts of the distributions of

X_{4}

,

X_{5}

and

X_{6}

for

{d, e}

are

\begin{matrix} X_{4} = 0.6 + 0.8 X_{5} + ε_{X_{4}}, & ε_{X_{4}} \sim N (0, 0.36); \\ X_{5} = 0.2 + ε_{X_{5}}, & ε_{X_{5}} \sim N (0, 0.36); \\ X_{6} = 0.1 + 0.2 X_{4} + ε_{X_{6}}, & ε_{X_{6}} \sim N (0, 1); \end{matrix}

and jointly

\begin{matrix} [\begin{matrix} X_{4} \\ X_{5} \\ X_{6} \end{matrix}] \sim N ([\begin{matrix} 0.760 \\ 0.200 \\ 0.252 \end{matrix}], Σ_{{d, e}} (B) = [\begin{matrix} 0.590 & 0.288 & 0.118 \\ 0.288 & 0.360 & 0.058 \\ 0.118 & 0.058 & 1.024 \end{matrix}]) \end{matrix}

for the components

{a, d, e}

and

{b, d, e}

. For the components

{a, c, f}

and

{b, c, f}

, the local distributions identified by

{c, f}

are

\begin{matrix} X_{4} = 0.1 + 0.2 X_{5} + ε_{X_{4}}, & ε_{X_{4}} \sim N (0, 0.09); \\ X_{5} = 0.4 + ε_{X_{5}}, & ε_{X_{5}} \sim N (0, 0.81); \\ X_{6} = 0.1 + 0.2 X_{4} + ε_{X_{6}}, & ε_{X_{6}} \sim N (0, 1); \end{matrix}

and the joint distribution of

X_{4}

,

X_{5}

and

X_{6}

is

\begin{matrix} [\begin{matrix} X_{4} \\ X_{5} \\ X_{6} \end{matrix}] \sim N ([\begin{matrix} 0.180 \\ 0.400 \\ 0.136 \end{matrix}], Σ_{{c, f}} (B) = [\begin{matrix} 0.122 & 0.162 & 0.024 \\ 0.162 & 0.810 & 0.032 \\ 0.024 & 0.032 & 1.005 \end{matrix}]) . \end{matrix}

Finally, the local distributions identified by

{d, f}

are

\begin{matrix} X_{4} = 0.6 + 0.8 X_{5} + ε_{X_{4}}, & ε_{X_{4}} \sim N (0, 0.36); \\ X_{5} = 0.4 + ε_{X_{5}}, & ε_{X_{5}} \sim N (0, 1.44); \\ X_{6} = 0.1 + 0.2 X_{4} + ε_{X_{6}}, & ε_{X_{6}} \sim N (0, 1); \end{matrix}

and the joint distribution of

X_{4}

,

X_{5}

and

X_{6}

for the components

{a, d, f}

,

{b, d, f}

is

\begin{matrix} [\begin{matrix} X_{4} \\ X_{5} \\ X_{6} \end{matrix}] \sim N ([\begin{matrix} 0.920 \\ 0.400 \\ 0.284 \end{matrix}], Σ_{{d, f}} (B) = [\begin{matrix} 1.282 & 1.152 & 0.256 \\ 1.152 & 1.440 & 0.230 \\ 0.256 & 0.230 & 1.051 \end{matrix}]) . \end{matrix}

We follow the same steps in reverse to decompose the global distribution into the local distributions. The joint distribution of

X

is a mixture with multivariate normal components and the associated probabilities. The latter are a function of the discrete variables

X_{1}

,

X_{2}

,

X_{3}

: rearranging them as the three-dimensional table

$X_{1} = a$
		$X_{2}$
		c	d
$X_{3}$	e	$0.040$	$0.040$
	f	$0.160$	$0.160$

$X_{1} = b$
		$X_{2}$
		c	d
$X_{3}$	e	$0.036$	$0.084$
	f	$0.144$	$0.336$

gives us the typical representation of

P (X_{1}, X_{2}, X_{3})

, which we can work with by operating over the different dimensions. We can then compute the conditional probability tables in the local distributions of

X_{1}

and

X_{3}

by marginalising over the remaining variables:

\begin{matrix} P (X_{1}) & = \sum_{X_{2} \in {c, d}} \sum_{X_{3} \in {e, f}} P (X_{1}, X_{2}, X_{3}) \\ = \begin{matrix} a & b \\ ( & 0.040 + 0.160 + 0.040 + 0.160 & 0.036 + 0.144 + 0.084 + 0.336 & ) \end{matrix} \\ = \begin{matrix} a & b \\ ( & 0.4 & 0.6 & ) \end{matrix}, \\ P (X_{3}) & = \sum_{X_{1} \in {a, b}} \sum_{X_{2} \in {c, d}} P (X_{1}, X_{2}, X_{3}) \\ = \begin{matrix} e & f \\ ( & 0.040 + 0.040 + 0.036 + 0.084 & 0.160 + 0.160 + 0.144 + 0.336 & ) \end{matrix} \\ = \begin{matrix} e & f \\ ( & 0.2 & 0.8 & ) \end{matrix} . \end{matrix}

As for

X_{2}

, we marginalise over

X_{3}

and normalise over

X_{1}

to obtain

P (X_{2} ∣ X_{1}) = \sum_{X_{3} \in {e, f}} \frac{P (X_{1}, X_{2}, X_{3})}{P (X_{1})} = \begin{matrix} c \\ d \end{matrix} \begin{matrix} a & b \\ ( & \begin{matrix} \frac{0.040 + 0.160}{0.4} \\ \frac{0.040 + 0.160}{0.4} \end{matrix} & \begin{matrix} \frac{0.036 + 0.144}{0.6} \\ \frac{0.084 + 0.336}{0.6} \end{matrix} & ) \end{matrix} = \begin{matrix} c \\ d \end{matrix} \begin{matrix} a & b \\ ( & \begin{matrix} 0.5 \\ 0.5 \end{matrix} & \begin{matrix} 0.3 \\ 0.7 \end{matrix} & ) \end{matrix} .

The multivariate normal distributions associated with the mixture components are a function of the continuous variables

X_{4}

,

X_{5}

,

X_{6}

.

X_{4}

has only one discrete parent (

X_{2}

),

X_{5}

has two (

X_{2}

and

X_{3}

) and

X_{6}

has none. Therefore, we only need to examine four mixture components to obtain the parameters of the local distributions of all three variables: one for which

{X_{2} = c, X_{3} = e}

, one for which

{X_{2} = d, X_{3} = e}

, one for which

{X_{2} = c, X_{3} = f}

and one for which

{X_{2} = d, X_{3} = f}

.

If we consider the first mixture component

{a, c, e}

, we can apply the steps described Section 3.2 to decompose it into the local distributions of

X_{4}

,

X_{5}

,

X_{6}

and obtain

\begin{matrix} X_{4} = 0.1 + 0.2 X_{5} + ε_{X_{4}}, & ε_{X_{4}} \sim N (0, 0.09); \\ X_{5} = 0.1 + ε_{X_{5}}, & ε_{X_{5}} \sim N (0, 0.09); \\ X_{6} = 0.1 + 0.2 X_{4} + ε_{X_{6}}, & ε_{X_{6}} \sim N (0, 1) . \end{matrix}

Similarly, the third mixture component

{a, d, e}

yields

\begin{matrix} X_{4} = 0.6 + 0.8 X_{5} + ε_{X_{4}}, & ε_{X_{4}} \sim N (0, 0.36); \\ X_{5} = 0.2 + ε_{X_{5}}, & ε_{X_{5}} \sim N (0, 0.36); \\ X_{6} = 0.1 + 0.2 X_{4} + ε_{X_{6}}, & ε_{X_{6}} \sim N (0, 1) . \end{matrix}

The fifth mixture component

{a, c, f}

yields

\begin{matrix} X_{4} = 0.1 + 0.2 X_{5} + ε_{X_{4}}, & ε_{X_{4}} \sim N (0, 0.09); \\ X_{5} = 0.4 + ε_{X_{5}}, & ε_{X_{5}} \sim N (0, 0.81); \\ X_{6} = 0.1 + 0.2 X_{4} + ε_{X_{6}}, & ε_{X_{6}} \sim N (0, 1) . \end{matrix}

The seventh mixture component

{a, d, f}

yields

\begin{matrix} X_{4} = 0.6 + 0.8 X_{5} + ε_{X_{4}}, & ε_{X_{4}} \sim N (0, 0.36); \\ X_{5} = 0.4 + ε_{X_{5}}, & ε_{X_{5}} \sim N (0, 1.44); \\ X_{6} = 0.1 + 0.2 X_{4} + ε_{X_{6}}, & ε_{X_{6}} \sim N (0, 1) . \end{matrix}

Reorganising these distributions by variables we obtain the local distributions of

B

shown in Figure 3 top.

Example A3

(Entropy of a discrete BN). Consider again the discrete BN from Example A1. In this simple example, we can use its global distribution and (6) to compute

\begin{matrix} H (B) = - 0.005 log 0.005 - 0.013 log 0.013 - 0.052 log 0.052 - 0.050 log 0.050 - \\ 0.064 log 0.064 - 0.040 log 0.040 - 0.037 log 0.037 - 0.026 log 0.026 - \\ 0.022 log 0.022 - 0.051 log 0.051 - 0.210 log 0.210 - 0.199 log 0.199 - \\ 0.089 log 0.089 - 0.056 log 0.056 - 0.051 log 0.051 - 0.036 log 0.036 = 2.440 . \end{matrix}

In the general case, we compute

H (B)

from the local distributions using (9). Since

X_{1}

and

X_{2}

have no parents, their entropy components simply sum over their marginal distributions:

\begin{matrix} H (X_{1}) & = - 0.53 log 0.53 - 0.47 log 0.47 = 0.691, \\ H (X_{2}) & = - 0.34 log 0.34 - 0.66 log 0.66 = 0.641 . \end{matrix}

For

X_{3}

,

H (X_{3} ∣ X_{1}, X_{2}) = \sum_{x_{1} \in {a, b}} \sum_{x_{2} \in {c, d}} P (X_{1} = x_{1}, X_{2} = x_{2}) H (X_{3} ∣ X_{1} = x_{1}, X_{2} = x_{2})

where

\begin{matrix} H (X_{3} ∣ X_{1} = a, X_{2} = c) & = - 0.15 log 0.15 - 0.85 log 0.85 = 0.423, \\ H (X_{3} ∣ X_{1} = a, X_{2} = d) & = - 0.75 log 0.75 - 0.25 log 0.25 = 0.562, \\ H (X_{3} ∣ X_{1} = b, X_{2} = c) & = - 0.40 log 0.40 - 0.60 log 0.60 = 0.673, \\ H (X_{3} ∣ X_{1} = b, X_{2} = d) & = - 0.80 log 0.80 - 0.20 log 0.20 = 0.500; \end{matrix}

and where (multiplying the marginal probabilities for

X_{1}

and

X_{2}

, which are marginally independent)

\begin{matrix} P (X_{1} = a, X_{2} = c) = 0.180, & P (X_{1} = a, X_{2} = d) = 0.350, \\ P (X_{1} = b, X_{2} = c) = 0.160, & P (X_{1} = b, X_{2} = d) = 0.310; \end{matrix}

giving

H (X_{3} ∣ X_{1}, X_{2}) = (0.180 \cdot 0.423 + 0.350 \cdot 0.562 + 0.160 \cdot 0.673 + 0.310 \cdot 0.500) = 0.536 .

Finally, for

X_{4}

H (X_{4} ∣ X_{3}) = \sum_{x_{3} \in {e, f}} P (X_{3} = x_{3}) H (X_{4} ∣ X_{3} = x_{3})

where

\begin{matrix} H (X_{4} ∣ X_{3} = e) & = - 0.20 log 0.20 - 0.80 log 0.80 = 0.500, \\ H (X_{4} ∣ X_{3} = f) & = - 0.42 log 0.42 - 0.58 log 0.58 = 0.680; \end{matrix}

and

P (X_{3} = e) = 0.601

,

P (X_{3} = f) = 0.399

, giving

H (X_{4} ∣ X_{3}) = 0.601 \cdot 0.500 + 0.399 \cdot 0.680 = 0.572 .

Combining all these figures, we obtain

H (B)

as

H (X_{1}) + H (X_{2}) + H (X_{3} ∣ X_{1}, X_{2}) + H (X_{4} ∣ X_{3}) = 0.691 + 0.641 + 0.536 + 0.572 = 2.440

as before.

In general, we would have to compute the probabilities of the parent configurations of each node using a junction tree as follows:

1.: We construct the moral graph of $B$ , which contains the same arcs (but undirected) as its DAG plus $X_{1} — X_{2}$ .
2.: We identify two cliques $C_{1} = {X_{1}, X_{2}, X_{3}}$ and $C_{2} = {X_{3}, X_{4}}$ and a separator $S_{12} = {X_{3}}$ .
3.: We connect them to create the junction tree $C_{1} — S_{12} — C_{2}$ .
4.: We initialise the cliques with the respective distributions $P (C_{1}) = P (X_{1}, X_{2}, X_{3})$ , $P (C_{2}) = P (X_{3}, X_{4})$ and $P (S_{12}) = P (X_{3})$ .
5.: We compute $P (X_{1}, X_{2}) = \sum_{x_{3} \in {e, f}} P (C_{1})$ and $P (X_{3}) = P (S_{12})$ .

Example A4

(Entropy of a GBN). Consider the GBN

B

from Figure 1 top, whose global distribution we derived in Example 2. If we plug its covariance matrix

Σ_{B}

into the entropy formula for the multivariate normal distribution we obtain

H (B) = \frac{4}{2} + \frac{4}{2} log 2 π + \frac{1}{2} log det (Σ_{B}) = 2 + 3.676 + 0.5 log 0.475 = 5.304 .

Equivalently, plugging the

σ_{X_{i}}^{2} (B)

into (12) we have

\begin{matrix} H (B) = \sum_{i = 1}^{N} H (X_{i} ∣ Π_{X_{i}}^{B}) = \\ \frac{1}{2} [log (2 π \cdot 0.8) + log (2 π \cdot 0.6) + log (2 π \cdot 0.9) + log (2 π \cdot 1.1)] + \frac{4}{2} = 5.304 . \end{matrix}

Example A5

(KL between GBNs with parameters estimated from data). Consider the DAGs for the BNs

B

and

B^{'}

and the 10 observations shown in Figure A1. The partial topological ordering of the nodes in

B

is

{{X_{1}, X_{2}}, X_{4}, X_{3}}

and that in

B^{'}

is

{X_{1}, X_{2}, {X_{3}, X_{4}}}

: the total ordering that is compatible with both is

{X_{1}, X_{2}, X_{4}, X_{3}}

.

If we estimate the parameters of the local distributions of

B

by maximum likelihood we obtain

\begin{matrix} X_{1} = 2.889 + ε_{X_{1}}, & ε_{X_{1}} \sim N (0, 0.558), \\ X_{2} = 1.673 + ε_{X_{2}}, & ε_{X_{2}} \sim N (0, 1.595), \\ X_{3} = 0.896 + 1.299 X_{4} + ε_{X_{3}}, & ε_{X_{3}} \sim N (0, 1.142), \\ X_{4} = - 2.095 + 2.222 X_{1} + 2.613 X_{2} + ε_{X_{4}}, & ε_{X_{4}} \sim N (0, 1.523), \end{matrix}

and the associated fitted values are

\begin{matrix} {\hat{x}}_{1} (B) & = (2.889, 2.889, 2.889, 2.889, 2.889, 2.889, 2.889, 2.889, 2.889, 2.889), \\ {\hat{x}}_{2} (B) & = (1.673, 1.673, 1.673, 1.673, 1.673, 1.673, 1.673, 1.673, 1.673, 1.673), \\ {\hat{x}}_{3} (B) & = (17.293, 14.480, 8.675, 13.937, 14.846, 12.801, 13.449, 2.394, 9.670, 14.381), \\ {\hat{x}}_{4} (B) & = (13.307, 11.447, 5.852, 8.635, 8.475, 9.018, 10.370, 2.376, 7.014, 10.489) . \end{matrix}

Similarly, for

B^{'}

we obtain

\begin{matrix} X_{1} = 2.889 + ε_{X_{1}}, & ε_{X_{1}} \sim N (0, 0.558), \\ X_{2} = 3.505 - 0.634 X_{1} + ε_{X_{2}}, & ε_{X_{2}} \sim N (0, 1.542), \\ X_{3} = 7.284 + 2.933 X_{2} + ε_{X_{3}}, & ε_{X_{3}} \sim N (0, 6.051), \\ X_{4} = 5.151 + 2.120 X_{2} + ε_{X_{4}}, & ε_{X_{4}} \sim N (0, 3.999), \end{matrix}

and the associated fitted values are

\begin{matrix} {\hat{x}}_{1} (B) & = (2.889, 2.889, 2.889, 2.889, 2.889, 2.889, 2.889, 2.889, 2.889, 2.889), \\ {\hat{x}}_{2} (B) & = (1.207, 2.304, 1.778, 1.625, 1.754, 2.044, 1.127, 2.037, 0.840, 2.019), \\ {\hat{x}}_{3} (B) & = (15.529, 17.760, 9.408, 11.931, 12.261, 14.009, 11.918, 6.528, 7.019, 15.564), \\ {\hat{x}}_{4} (B) & = (11.110, 12.722, 6.686, 8.509, 8.748, 10.011, 8.500, 4.604, 4.959, 11.135) . \end{matrix}

Therefore,

\begin{matrix} ∥ {\hat{x}}_{1} (B) - {\hat{x}}_{1} (B^{'}) ∥_{2}^{2} = 0, & ∥ {\hat{x}}_{2} (B) - {\hat{x}}_{2} (B^{'}) ∥_{2}^{2} = 2.018, \\ ∥ {\hat{x}}_{3} (B) - {\hat{x}}_{3} (B^{'}) ∥_{2}^{2} = 54.434, & ∥ {\hat{x}}_{4} (B) - {\hat{x}}_{4} (B^{'}) ∥_{2}^{2} = 21.329; \end{matrix}

and the values of the Kullback–Leibler divergence for the individual nodes are

\begin{matrix} KL (X_{1} ∣ Π_{X_{1}}^{B} ∥ X_{1} ∣ Π_{X_{1}}^{B^{'}}) & \approx \frac{1}{2} (log \frac{0.558}{0.558} + \frac{0.558}{0.558} - 1) + \frac{1}{20} (\frac{0}{0.558}) = 0, \\ KL (X_{2} ∣ Π_{X_{2}}^{B} ∥ X_{2} ∣ Π_{X_{2}}^{B^{'}}) & \approx \frac{1}{2} (log \frac{1.542}{1.595} + \frac{1.595}{1.542} - 1) + \frac{1}{20} (\frac{2.018}{1.542}) = 0.066, \\ KL (X_{3} ∣ Π_{X_{3}}^{B} ∥ X_{3} ∣ Π_{X_{3}}^{B^{'}}) & \approx \frac{1}{2} (log \frac{6.051}{1.142} + \frac{1.142}{6.051} - 1) + \frac{1}{20} (\frac{54.434}{6.051}) = 0.878, \\ KL (X_{4} ∣ Π_{X_{4}}^{B} ∥ X_{4} ∣ Π_{X_{4}}^{B^{'}}) & \approx \frac{1}{2} (log \frac{3.999}{1.523} + \frac{1.523}{3.999} - 1) + \frac{1}{20} (\frac{21.329}{3.999}) = 0.440, \end{matrix}

which sum up to

KL (B ∥ B^{'}) \approx 1.383

. The exact value, which we can compute as shown in Section 4.2, is

1.692

.

The quality of the empirical approximation improves with the number of observations. For reference, we generated the data in Figure A1 from the GBN in Example 2. With a sample of size

n = 100

from the same network,

KL (B ∥ B^{'}) \approx 1.362

with

KL (B ∥ B^{'}) = 1.373

; with

n = 1000

,

KL (B ∥ B^{'}) \approx 1.343

with

KL (B ∥ B^{'}) = 1.345

.

Figure A1. The DAGs for the GBNs

B

(top left) and

B^{'}

(bottom left) and the data (right) used in Example A5.

Figure A1. The DAGs for the GBNs

B

(top left) and

B^{'}

(bottom left) and the data (right) used in Example A5.

Example A6

(Entropy of a CLGBN). Consider again the CLGBN

B

from from Figure 3 (top). For such a simple BN, we can use its global distribution (which we derived in Example A2) directly to compute the entropies of the multivariate normal distributions associated with the mixture components

\begin{matrix} H (X_{4}, X_{5}, X_{6} ∣ {c, e}) & = \frac{3}{2} + \frac{3}{2} log (2 π) + \frac{1}{2} log det (Σ_{{c, e}} (B)) = 1.849, \\ H (X_{4}, X_{5}, X_{6} ∣ {d, e}) & = \frac{3}{2} + \frac{3}{2} log (2 π) + \frac{1}{2} log det (Σ_{{d, e}} (B)) = 3.235, \\ H (X_{4}, X_{5}, X_{6} ∣ {c, f}) & = \frac{3}{2} + \frac{3}{2} log (2 π) + \frac{1}{2} log det (Σ_{{c, f}} (B)) = 2.947, \\ H (X_{4}, X_{5}, X_{6} ∣ {d, f}) & = \frac{3}{2} + \frac{3}{2} log (2 π) + \frac{1}{2} log det (Σ_{{d, f}} (B)) = 3.928; \end{matrix}

and to combine them by weighting with the component probabilities

\begin{matrix} H (X_{4}, X_{5}, X_{6} ∣ X_{1}, X_{2}, X_{3}) = \underset{{a, c, e}}{\underset{︸}{0.040 \cdot 1.849}} + \underset{{b, c, e}}{\underset{︸}{0.036 \cdot 1.849}} + \underset{{a, d, e}}{\underset{︸}{0.040 \cdot 3.235}} + \underset{{b, d, e}}{\underset{︸}{0.084 \cdot 3.235}} + \\ \underset{{a, c, f}}{\underset{︸}{0.160 \cdot 2.947}} + \underset{{b, c, f}}{\underset{︸}{0.144 \cdot 2.947}} + \underset{{a, d, f}}{\underset{︸}{0.160 \cdot 3.928}} + \underset{{b, d, f}}{\underset{︸}{0.336 \cdot 3.928}} = 3.386 . \end{matrix}

The entropy of the discrete variables is

\begin{matrix} H (X_{1}, X_{2}, X_{3}) = - 0.040 log 0.040 - 0.036 log 0.036 - 0.040 log 0.040 - 0.084 log 0.084 - \\ 0.160 log 0.160 - 0.144 log 0.144 - 0.160 log 0.160 - 0.336 log 0.336 = 1.817 \end{matrix}

and then

H (B) = H (X_{1}, X_{2}, X_{3}) + H (X_{4}, X_{5}, X_{6} ∣ X_{1}, X_{2}, X_{3}) = 5.203

.

If we use the local distributions instead, we can compute the entropy of the discrete variables using (9) from Section 4.1:

\begin{matrix} H (X_{1}) & = - 0.4 log 0.4 - 0.6 log 0.6 = 0.673, \\ H (X_{2} ∣ X_{1}) & = 0.4 (- 0.5 log 0.5 - 0.5 log 0.5) + 0.6 (- 0.3 log 0.3 - 0.7 log 0.7) = 0.644, \\ H (X_{3}) & = - 0.2 log 0.2 - 0.8 log 0.8 = 0.500 . \end{matrix}

We can compute the entropy of the continuous variables with no discrete parents using (12) from Section 4.2:

H (X_{6} ∣ X_{4}) = \frac{1}{2} log (2 π \cdot 1) + \frac{1}{2} = 1.419 .

Finally, we can compute the entropy of the continuous variables with discrete parents using (24) from Section 4.3:

\begin{matrix} H (X_{4} ∣ X_{2}, X_{5}) & = 0.38 (\frac{1}{2} log (2 π \cdot 0.09) + \frac{1}{2}) + 0.62 (\frac{1}{2} log (2 π \cdot 0.36) + \frac{1}{2}) \\ = 0.645, \\ H (X_{5} ∣ X_{2}, X_{3}) & = 0.076 (\frac{1}{2} log (2 π \cdot 0.09) + \frac{1}{2}) + 0.124 (\frac{1}{2} log (2 π \cdot 0.36) + \frac{1}{2}) + \\ 0.304 (\frac{1}{2} log (2 π \cdot 0.81) + \frac{1}{2}) + 0.496 (\frac{1}{2} log (2 π \cdot 1.44) + \frac{1}{2}) \\ = 1.322 . \end{matrix}

As before, we confirm that overall

\begin{matrix} H (B) = H (X_{1}) + H (X_{2} ∣ X_{1}) + H (X_{3}) + H (X_{4} ∣ X_{2}, X_{5}) + H (X_{5} ∣ X_{2}, X_{3}) + \\ H (X_{6} ∣ X_{4}) = 0.673 + 0.644 + 0.500 + 0.645 + 1.322 + 1.419 = 5.203 . \end{matrix}

References

Scutari, M.; Denis, J.B. Bayesian Networks with Examples in R, 2nd ed.; Chapman & Hall: Boca Raton, FL, USA, 2021. [Google Scholar]
Castillo, E.; Gutiérrez, J.M.; Hadi, A.S. Expert Systems and Probabilistic Network Models; Springer: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
Cowell, R.G.; Dawid, A.P.; Lauritzen, S.L.; Spiegelhalter, D.J. Probabilistic Networks and Expert Systems; Springer: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann: Burlington, MA, USA, 1988. [Google Scholar]
Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
Murphy, K.P. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Thesis, Computer Science Division, UC Berkeley, Berkeley, CA, USA, 2002. [Google Scholar]
Spirtes, P.; Glymour, C.; Scheines, R. Causation, Prediction, and Search; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
Pearl, J. Causality: Models, Reasoning and Inference, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Borsboom, D.; Deserno, M.K.; Rhemtulla, M.; Epskamp, S.; Fried, E.I.; McNally, R.J.; Robinaugh, D.J.; Perugini, M.; Dalege, J.; Costantini, G.; et al. Network Analysis of Multivariate Data in Psychological Science. Nat. Rev. Methods Prim. 2021, 1, 58. [Google Scholar]
Carapito, R.; Li, R.; Helms, J.; Carapito, C.; Gujja, S.; Rolli, V.; Guimaraes, R.; Malagon-Lopez, J.; Spinnhirny, P.; Lederle, A.; et al. Identification of Driver Genes for Critical Forms of COVID-19 in a Deeply Phenotyped Young Patient Cohort. Sci. Transl. Med. 2021, 14, 1–20. [Google Scholar]
Requejo-Castro, D.; Giné-Garriga, R.; Pérez-Foguet, A. Data-driven Bayesian Network Modelling to Explore the Relationships Between SDG 6 and the 2030 Agenda. Sci. Total Environ. 2020, 710, 136014. [Google Scholar] [PubMed]
Zilko, A.A.; Kurowicka, D.; Goverde, R.M.P. Modeling Railway Disruption Lengths with Copula Bayesian Networks. Transp. Res. Part C Emerg. Technol. 2016, 68, 350–368. [Google Scholar]
Gao, R.X.; Wang, L.; Helu, M.; Teti, R. Big Data Analytics for Smart Factories of the Future. CIRP Ann. 2020, 69, 668–692. [Google Scholar]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational Inference: A Review for Statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood From Incomplete Data via the EM Algorithm. J. R. Stat. Soc. (Ser. B) 1977, 39, 1–22. [Google Scholar] [CrossRef]
Minka, T.P. Expectation Propagation for Approximate Bayesian Inference. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI), Seattle, WA, USA, 2–5 August 2001; pp. 362–369. [Google Scholar]
van der Maaten, L.; Hinton, G. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–3605. [Google Scholar]
Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.A.; Kwok, I.W.H.; Ng, L.G.; Ginhoux, F.; Newell, E.W. Dimensionality Reduction for Visualizing Single-Cell Data Using UMAP. Nat. Biotechnol. 2019, 37, 38–44. [Google Scholar] [CrossRef]
Murphy, K.P. Probabilistic Machine Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar]
Murphy, K.P. Probabilistic Machine Learning: Advanced Topics; MIT Press: Cambridge, MA, USA, 2023. [Google Scholar]
Moral, S.; Cano, A.; Gómez-Olmedo, M. Computation of Kullback–Leibler Divergence in Bayesian Networks. Entropy 2021, 23, 1122. [Google Scholar] [CrossRef]
Hershey, J.R.; Olsen, P.A. Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models. In Proceedings of the 32nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, HI, USA, 15–20 April 2007; Volume IV, pp. 317–320. [Google Scholar]
Beskos, A.; Crisan, D.; Jasra, A. On the Stability of Sequential Monte Carlo Methods in High Dimensions. Ann. Appl. Probab. 2014, 24, 1396–1445. [Google Scholar] [CrossRef]
Scutari, M. Learning Bayesian Networks with the bnlearn R Package. J. Stat. Softw. 2010, 35, 1–22. [Google Scholar] [CrossRef]
Heckerman, D.; Geiger, D.; Chickering, D.M. Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Mach. Learn. 1995, 20, 197–243. [Google Scholar] [CrossRef]
Chickering, D.M.; Heckerman, D. Learning Bayesian Networks is NP-Hard; Technical Report MSR-TR-94-17; Microsoft Corporation: Redmond, WA, USA, 1994. [Google Scholar]
Chickering, D.M. Learning Bayesian Networks is NP-Complete. In Learning from Data: Artificial Intelligence and Statistics V; Fisher, D., Lenz, H., Eds.; Springer: Berlin/Heidelberg, Germany, 1996; pp. 121–130. [Google Scholar]
Chickering, D.M.; Heckerman, D.; Meek, C. Large-sample Learning of Bayesian Networks is NP-hard. J. Mach. Learn. Res. 2004, 5, 1287–1330. [Google Scholar]
Scutari, M.; Vitolo, C.; Tucker, A. Learning Bayesian Networks from Big Data with Greedy Search: Computational Complexity and Efficient Implementation. Stat. Comput. 2019, 25, 1095–1108. [Google Scholar] [CrossRef]
Cussens, J. Bayesian Network Learning with Cutting Planes. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), Barcelona, Spain, 14–17 July 2011; pp. 153–160. [Google Scholar]
Suzuki, J. An Efficient Bayesian Network Structure Learning Strategy. New Gener. Comput. 2017, 35, 105–124. [Google Scholar] [CrossRef]
Scanagatta, M.; de Campos, C.P.; Corani, G.; Zaffalon, M. Learning Bayesian Networks with Thousands of Variables. Adv. Neural Inf. Process. Syst. (Nips) 2015, 28, 1864–1872. [Google Scholar]
Hausser, J.; Strimmer, K. Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks. J. Mach. Learn. Res. 2009, 10, 1469–1484. [Google Scholar]
Agresti, A. Categorical Data Analysis, 3rd ed.; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
Geiger, D.; Heckerman, D. Learning Gaussian Networks. In Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence (UAI), Seattle, WA, USA, 29–31 July 1994; pp. 235–243. [Google Scholar]
Pourahmadi, M. Covariance Estimation: The GLM and Regularization Perspectives. Stat. Sci. 2011, 26, 369–387. [Google Scholar]
Lauritzen, S.L.; Wermuth, N. Graphical Models for Associations between Variables, Some of which are Qualitative and Some Quantitative. Ann. Stat. 1989, 17, 31–57. [Google Scholar] [CrossRef]
Scutari, M.; Marquis, C.; Azzimonti, L. Using Mixed-Effect Models to Learn Bayesian Networks from Related Data Sets. In Proceedings of the International Conference on Probabilistic Graphical Models, Almería, Spain, 5–7 October 2022; Volume 186, pp. 73–84. [Google Scholar]
Lauritzen, S.L.; Spiegelhalter, D.J. Local Computation with Probabilities on Graphical Structures and their Application to Expert Systems (with discussion). J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1988, 50, 157–224. [Google Scholar]
Lauritzen, S.L.; Jensen, F. Stable Local Computation with Conditional Gaussian Distributions. Stat. Comput. 2001, 11, 191–203. [Google Scholar] [CrossRef]
Cowell, R.G. Local Propagation in Conditional Gaussian Bayesian Networks. J. Mach. Learn. Res. 2005, 6, 1517–1550. [Google Scholar]
Namasivayam, V.K.; Pathak, A.; Prasanna, V.K. Scalable Parallel Implementation of Bayesian Network to Junction Tree Conversion for Exact Inference. In Proceedings of the 18th International Symposium on Computer Architecture and High Performance Computing, Ouro Preto, Brazil, 17–20 October 2006; pp. 167–176. [Google Scholar]
Pennock, D.M. Logarithmic Time Parallel Bayesian Inference. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI), Pittsburgh, PA, USA, 31 July–4 August 2023; pp. 431–438. [Google Scholar]
Namasivayam, V.K.; Prasanna, V.K. Scalable Parallel Implementation of Exact Inference in Bayesian Networks. In Proceedings of the 12th International Conference on Parallel and Distributed Systems (ICPADS), Minneapolis, MN, USA, 12–15 July 2006; pp. 1–8. [Google Scholar]
Malioutov, D.M.; Johnson, J.K.; Willsky, A.S. Walk-Sums and Belief Propagation in Gaussian Graphical Models. J. Mach. Learn. Res. 2006, 7, 2031–2064. [Google Scholar]
Cheng, J.; Druzdzel, M.J. AIS-BN: An Adaptive Importance Sampling Algorithm for Evidential Reasoning in Large Bayesian Networks. J. Artif. Intell. Res. 2000, 13, 155–188. [Google Scholar] [CrossRef]
Yuan, C.; Druzdzel, M.J. An Importance Sampling Algorithm Based on Evidence Pre-Propagation. In Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence (UAI), Acapulco, Mexico, 7–10 August 2003; pp. 624–631. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
Csiszár, I.; Shields, P. Information Theory and Statistics: A Tutorial; Now Publishers Inc.: Delft, The Netherlands, 2004. [Google Scholar]
Gómez-Villegas, M.A.; Main, P.; Susi, R. Sensitivity of Gaussian Bayesian Networks to Inaccuracies in Their Parameters. In Proceedings of the 4th European Workshop on Probabilistic Graphical Models (PGM), Cuenca, Spain, 17–19 September 2008; pp. 265–272. [Google Scholar]
Gómez-Villegas, M.A.; Main, P.; Susi, R. The Effect of Block Parameter Perturbations in Gaussian Bayesian Networks: Sensitivity and Robustness. Inf. Sci. 2013, 222, 439–458. [Google Scholar] [CrossRef]
Görgen, C.; Leonelli, M. Model-Preserving Sensitivity Analysis for Families of Gaussian Distributions. J. Mach. Learn. Res. 2020, 21, 1–32. [Google Scholar]
Seber, G.A.F. A Matrix Handbook for Stasticians; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar]
Stewart, G.W. Matrix Algorithms, Volume I: Basic Decompositions; SIAM: Philadelphia, PA, USA, 1998. [Google Scholar]
Cavanaugh, J.E. Criteria for Linear Model Selection Based on Kullback’s Symmetric Divergence. Aust. N. Z. J. Stat. 2004, 46, 197–323. [Google Scholar] [CrossRef]

Figure 1. DAGs and local distributions for the GBNs

B

(top) and

B^{'}

(bottom) used in Examples 2 and 6–9.

Figure 1. DAGs and local distributions for the GBNs

B

(top) and

B^{'}

(bottom) used in Examples 2 and 6–9.

Figure 2. DAGs and local distributions for the discrete BNs

B

(top) and

B^{'}

(bottom) used in Examples 1, 4 and 5.

Figure 2. DAGs and local distributions for the discrete BNs

B

(top) and

B^{'}

(bottom) used in Examples 1, 4 and 5.

Figure 3. DAGs and local distributions for the CLGBNs

B

(top) and

B^{'}

(bottom) used in Examples 3 and 11–13.

Figure 3. DAGs and local distributions for the CLGBNs

B

(top) and

B^{'}

(bottom) used in Examples 3 and 11–13.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Scutari, M. Entropy and the Kullback–Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation. Algorithms 2024, 17, 24. https://doi.org/10.3390/a17010024

AMA Style

Scutari M. Entropy and the Kullback–Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation. Algorithms. 2024; 17(1):24. https://doi.org/10.3390/a17010024

Chicago/Turabian Style

Scutari, Marco. 2024. "Entropy and the Kullback–Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation" Algorithms 17, no. 1: 24. https://doi.org/10.3390/a17010024

APA Style

Scutari, M. (2024). Entropy and the Kullback–Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation. Algorithms, 17(1), 24. https://doi.org/10.3390/a17010024

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Entropy and the Kullback–Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation

Abstract

1. Introduction

2. Bayesian Networks

3. Common Distributional Assumptions for Bayesian Networks

3.1. Discrete BNs

3.2. Gaussian BNs

3.3. Conditional Linear Gaussian BNs

3.4. Inference

4. Shannon Entropy and Kullback–Leibler Divergence

4.1. Discrete BNs

4.2. Gaussian BNs

4.3. Conditional Gaussian BNs

5. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Computational Complexity Results

Appendix B. Additional Examples

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI