Entropy and the Kullback-Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation

Bayesian networks (BNs) are a foundational model in machine learning and causal inference. Their graphical structure can handle high-dimensional problems, divide them into a sparse collection of smaller ones, underlies Judea Pearl's causality, and determines their explainability and interpretability. Despite their popularity, there are almost no resources in the literature on how to compute Shannon's entropy and the Kullback-Leibler (KL) divergence for BNs under their most common distributional assumptions. In this paper, we provide computationally efficient algorithms for both by leveraging BNs' graphical structure, and we illustrate them with a complete set of numerical examples. In the process, we show it is possible to reduce the computational complexity of KL from cubic to quadratic for Gaussian BNs.


Introduction
Bayesian networks [BNs; 1] have played a central role in machine learning research since the early days of the field as expert systems [2,3], graphical models [4,5], dynamic and latent variables models [6], and as the foundation of causal discovery [7] and causal inference [8].They have also found applications as diverse as comorbidities in clinical psychology [9], the genetics of COVID-19 [10], the Sustainable Development Goals of the United Nations [11], railway disruptions [12] and industry 4.0 [13].
Machine learning, however, has evolved to include a variety of other models and reformulated them into a very general information-theoretic framework.The central quantities of this framework are Shannon's entropy and the Kullback-Leibler divergence.Learning models from data relies crucially on the former to measure the amount of information captured by the model (or its complement, the amount of information lost in the residuals) and on the latter as the loss function we want to minimise.For instance, we can construct variational inference [14], the Expectation-Maximisation algorithm [15], Expectation Propagation [16] and various dimensionality reduction approaches such as t-SNE [17] and UMAP [18] using only these two quantities.We can also reformulate classical maximum-likelihood and Bayesian approaches to the same effect, from logistic regression to kernel methods to boosting [19,20].
Therefore, the lack of literature on how to compute the entropy of a BN and the Kullback-Leibler divergence between two BNs is surprising.While both are mentioned in Koller and Friedman [5] and discussed at a theoretical level in Moral et al. [21] for discrete BNs, no resources are available on any other type of BN.Furthermore, no numerical examples of how to compute them are available even for discrete BNs.We fill this gap in the literature by:

•
Deriving efficient formulations of Shannon's entropy and the Kullback-Leibler divergence for Gaussian BNs and conditional linear Gaussian BNs.

•
Exploring the computational complexity of both for all common types of BNs.
• Providing step-by-step numeric examples for all computations and all common types of BNs.
Our aim is to make apparent how both quantities are computed in their closed-form exact expressions and what is the associated computational cost.The common alternative is to estimate both Shannon's entropy and the Kullback-Leibler divergence empirically using Monte Carlo sampling.Admittedly, this approach is simple to implement for all types of BNs.However, it has two crucial drawbacks: 1.
Using asymptotic estimates voids the theoretical properties of many machine learning algorithms: Expectation-Maximisation is not guaranteed to converge [5], for instance.

2.
The number of samples required to estimate the Kullback-Leibler divergence accurately on the tails of the global distribution of both BNs is also an issue [22], especially when we need to evaluate it repeatedly as part of some machine learning algorithm.The same is true, although to a lesser extent, for Shannon's entropy as well.In general, the rate of convergence to the true posterior in Monte Carlo particle filters is proportional to the number of variables squared [23].
Therefore, efficiently computing the exact value of Shannon's entropy and the Kullback-Leibler divergence is a valuable research endeavour with a practical impact on BN use in machine learning.To help its development, we implemented the methods proposed in the paper in our bnlearn R package [24].The remainder of the paper is structured as follows.In Section 2, we provide the basic definitions, properties and notation of BNs.In Section 3, we revisit the most common distributional assumptions in the BN literature: discrete BNs (Section 3.1), Gaussian BNs (Section 3.2) and conditional linear Gaussian BNs (Section 3.3).We also briefly discuss exact and approximate inference for these types of BNs in Section 3.4 to introduce some key concepts for later use.In Section 4, we discuss how we can compute Shannon's entropy and the Kullback-Leibler divergence for each type of BN.We conclude the paper by summarising and discussing the relevance of these foundational results in Section 5. Appendix A summarises all the computational complexity results from earlier sections, and Appendix B contains additional examples we omitted from the main text for brevity.

Bayesian Networks
Bayesian networks (BNs) are a class of probabilistic graphical models defined over a set of random variables X = {X 1 , . . ., X N }, each describing some quantity of interest, that are associated with the nodes of a directed acyclic graph (DAG) G. Arcs in G express direct dependence relationships between the variables in X, with graphical separation in G implying conditional independence in probability.As a result, G induces the factorisation in which the global distribution (of X, with parameters Θ) decomposes into one local distribution for each X i (with parameters Θ X i , X Θ X i = Θ) conditional on its parents Π X i .This factorisation is as effective at reducing the computational burden of working with BNs as the DAG underlying the BN is sparse, meaning that each node X i has a small number of parents (|Π X i | < c, usually with c ∈ [2,5]).For instance, learning BNs from data is only feasible in practice if this holds.The task of learning a BN B = (G, Θ) from a data set D containing n observations comprises two steps: If we assume that parameters in different local distributions are independent [25], we can perform parameter learning independently for each node.Each X i | Π X i will have a low-dimensional parameter space Θ X i , making parameter learning computationally efficient.On the other hand, structure learning is well known to be both NP-hard [26] and NP-complete [27], even under unrealistically favourable conditions such as the availability of an independence and inference oracle [28].However, if G is sparse, heuristic learning algorithms have been shown to run in quadratic time [29].Exact learning algorithms, which have optimality guarantees that heuristic algorithms lack, retain their exponential complexity but become feasible for small problems because sparsity allows for tight bounds on goodness-of-fit scores and the efficient pruning of the space of the DAGs [30][31][32].

Common Distributional Assumptions for Bayesian Networks
While there are many possible choices for the distribution of X in principle, the literature has focused on three cases.

Discrete BNs
Discrete BNs [25] assume that both X and the X i are multinomial random variables. 1ocal distributions take the form their parameters are the conditional probabilities of X i given each configuration of the values of its parents, usually represented as a conditional probability table (CPT) for each X i .
The global distribution takes the form of an N-dimensional probability table with one dimension for each variable.Assuming that each X i takes at most l values, the table will contain |Val(X)| = O l N cells, where Val(•) denotes the possible (configurations of the) values of its argument.As a result, it is impractical to use for medium and large BNs.Following standard practices from categorical data analysis [34], we can produce the CPT for each X i from the global distribution by marginalising (that is, summing over) all the variables other than {X i , Π X i } and then normalising over each configuration of Π X i .Conversely, we can compose the global distribution from the local distributions of the X i by multiplying the appropriate set of conditional probabilities.The computational complexity of the composition is O Nl N because applying (1) for each of the l N cells yields which involves N multiplications.As for the decomposition, for each node, we: Normalise the columns of the joint probability table for {X i , Π X i } over each of the O l |Π X i | configurations of values of Π X i , which involves summing O(l) probabilities and dividing them by their total.
The resulting computational complexity is for each node and O Nl N + l ∑ N i=1 l |Π X i | for the whole BN.
Example 1 (Composing and decomposing a discrete BN).For reasons of space, this example is presented as Example B.1 in Appendix B.

Gaussian BNs
Gaussian BNs [GBNs; 35] model X with a multivariate normal random variable N(µ B , Σ B ) and assume that the X i are univariate normals linked by linear dependencies, which can be equivalently written as linear regression models of the form The parameters in ( 3) and ( 4) are the regression coefficients β X i associated with the parents Π X i , an intercept term µ X i and the variance σ2 X i .They are usually estimated by maximum likelihood, but Bayesian and regularised estimators are available as well [1].
The link between the parameterisation of the global distribution of a GBN and that of its local distributions is detailed in Pourahmadi [36].We summarise it here for later use.

•
Composing the global distribution.We can create an N × N lower triangular matrix C B from the regression coefficients in the local distributions such that C B C T B gives Σ B after rearranging rows and columns.In particular, we: 1.
Arrange the nodes of B in the (partial) topological ordering induced by G, denoted We compute its elements from the parameters of where C B [Π X (i) ; • ] are the rows of C B that correspond to the parents of X (i) .The rows of C B are filled following the topological ordering of the BN.

3.
Compute Rearrange the rows and columns of Σ B to obtain Σ B .
Intuitively, we are constructing C B by propagating the node variances along the paths in G while combining them with the regression coefficients, which are functions of the correlations between adjacent nodes.As a result, C B C T B gives Σ B after rearranging the rows and columns to follow the original ordering of the nodes.The elements of the mean vector µ B are similarly computed as E(X (i) ) = Π X (i) β X (i) iterating over the variables in topological order.

•
Decomposing the global distribution.Conversely, we can derive the matrix C B from Σ B by reordering its rows and columns to follow the topological ordering of the variables in G and computing its Cholesky decomposition.Then contains the regression coefficients β X (i) in the elements corresponding to X (i) , Π X (i) . 2 Finally, we compute the intercepts µ X i as µ B − Rµ B by reversing the equations we used to construct µ B above.
The computational complexity of composing the global distribution is bound by the matrix multiplication C B C T B , which is O N 3 ; if we assume that G is sparse as in Scutari et al. [29], The topological ordering of the variables defined by B is {{X 1 , X 2 }, X 4 , X 3 }, so where the diagonal elements are and the elements below the diagonal are taken from the corresponding cells of  The elements of the corresponding expectation vector µ B are then Starting from Σ B , we can reorder its rows and columns to obtain Σ B .The Cholesky decomposition of Σ B is C B .Then The coefficients β X i of the local distributions are available from We can read the standard errors of X 1 , X 2 , X 3 and X 4 directly from the diagonal elements of C B , and we can compute the intercepts from µ B − Rµ B which amounts to

Conditional Linear Gaussian BNs
Finally, conditional linear Gaussian BNs [CLGBNs; 37] subsume discrete BNs and GBNs as particular cases by combining discrete and continuous random variables in a mixture model.If we denote the former with X D and the latter with X G , so that X = X D ∪ X G , then: • Discrete X i ∈ X D are only allowed to have discrete parents (denoted ∆ X i ), and are assumed to follow a multinomial distribution parameterised with CPTs.We can estimate their parameters in the same way as those in a discrete BN.
• Continuous X i ∈ X G are allowed to have both discrete and continuous parents (denoted Γ , which is equivalent to a mixture of linear regressions against the continuous parents with one component for each configuration δ X i ∈ Val(∆ X i ) of the discrete parents: If X i has no discrete parents, the mixture reverts to a single linear regression like that in (4).The parameters of these local distributions are usually estimated by maximum likelihood like those in a GBN; we have used hierarchical regressions with random effects in our recent work [38] for this purpose as well.Bayesian and regularised estimators are also an option [5].
If the CLGBN comprises |X D | = M discrete nodes and |X G | = N − M continuous nodes, these distributional assumptions imply the partial topological ordering The discrete nodes jointly follow a multinomial distribution, effectively forming a discrete BN.The continuous nodes jointly follow a multivariate normal distribution, parameterised as a GBN, for each configuration of the discrete nodes.Therefore, the global distribution is a Gaussian mixture in which the discrete nodes identify the components, and the continuous nodes determine their distribution.The practical link between the global and the local distributions follows directly from Sections 3.1 and 3.2.
Example 3 (Composing and decomposing a CLGBN).For reasons of space, this example is presented as Example B.2 in Appendix B.
The complexity of composing and decomposing the global distribution is then convert between CPTs and component probabilities where ∆ = X i ∈X G ∆ X i are the discrete parents of the continuous nodes.

Inference
For BNs, inference broadly denotes obtaining the conditional distribution of a subset of variables conditional on a second subset of variables.Following older terminology from expert systems [2], this is called formulating a query in which we ask the BN about the probability of an event of interest after observing some evidence.In conditional probability queries, the event of interest is the probability of one or more events in (or the whole distribution of) some variables of interest conditional on the values assumed by the evidence variables.In maximum a posteriori ("most probable explanation") queries, we condition on the values of the evidence variables to predict those of the event variables.
All inference computations on BNs are completely automated by exact and approximate algorithms, which we will briefly describe here.We refer the interested reader to the more detailed treatment in Castillo et al. [2] and Koller and Friedman [5].
Exact inference algorithms use local computations to compute the value of the query.The seminal works of Lauritzen and Spiegelhalter [39], Lauritzen and Wermuth [37] and Lauritzen and Jensen [40] describe how to transform a discrete BN or a (CL)GBN into a junction tree3 as a preliminary step before using belief propagation.Cowell [41] uses elimination trees for the same purpose in CLGBNs.
Namasivayam et al. [42] give the computational complexity of constructing the junction tree from a discrete BN as O(Nw + wl w N) where w is the maximum number of nodes in a clique and, as before, l is the maximum number of values that a variable can take.We take the complexity of belief propagation to be O(Nwl w + |Θ|), as stated in Lauritzen and Spiegelhalter [39] ("The global propagation is no worse than the initialisation [of the junction tree]").This is confirmed by Pennock [43] and Namasivayam and Prasanna [44].
As for GBNs, we can also perform exact inference through their global distribution because the latter has only O N 2 + N parameters.The computational complexity of this approach is O N 3 because of the cost of composing the global distribution, which we derived in Section 3.2.However, all the operations involved are linear, making it possible to leverage specialised hardware such as GPUs and TPUs to the best effect.Koller and Friedman [5, Section 14.2.1] note that "inference in linear Gaussian networks is linear in the number of cliques, and at most cubic in the size of the largest clique" when using junction trees and belief propagation.Therefore, junction trees may be significantly faster for GBNs when w ≪ N.However, the correctness and convergence of belief propagation in GBNs require a set of sufficient conditions that have been studied comprehensively by Malioutov et al. [45].Using the global distribution directly always produces correct results.
Approximate inference algorithms use Monte Carlo simulations to sample from the global distribution of X through the local distributions and estimate the answer queries by computing the appropriate summary statistics on the particles they generate.Therefore, they mirror the Monte Carlo and Markov chain Monte Carlo approaches in the literature: rejection sampling, importance sampling, and sequential Monte Carlo among others.Two state-of-the-art examples are the adaptive importance sampling (AIS-BN) scheme [46] and the evidence pre-propagation importance sampling (EPIS-BN) [47].

Shannon Entropy and Kullback-Leibler Divergence
The general definition of Shannon entropy for the probability distribution P of X is The Kullback-Leibler divergence between two distributions P and Q for the same random variables X is defined as They are linked as follows: H(P(X)) where H(P(X), Q(X)) is the cross-entropy between P(X) and Q(X).For the many properties of these quantities, we refer the reader to Cover and Thomas [48] and Csiszár and Shields [49].Their use and interpretation are covered in depth (and breadth!) in Murphy [19,20] for general machine learning and in Koller and Friedman [5] for BNs.
For a BN B encoding the probability distribution of X, (6) decomposes into where Π B X i are the parents of X i in B. While this decomposition looks similar to (1), we will see that its terms are not necessarily orthogonal unlike the local distributions.
As for (7), we cannot simply write because, in the general case, the nodes X i will have different parents in B and B ′ .This issue impacts the complexity of computing Kullback-Leibler divergences in different ways depending on the type of BN.

Discrete BNs
For discrete BNs, H(B) does not decompose into orthogonal components.As pointed out in Koller and Friedman [5, Section 8.4.12], If we estimated the conditional probabilities π ik|j (B) from data, the P Π B X i = j are already available as the normalising constants of the individual conditional distributions {π ik|j (B), j = 1, . . ., q i } in the local distribution of X i .In this case, the complexity of computing H X i | Π B X i is linear in the number of parameters: In the general case, we need exact inference to compute the probabilities P Π B X i = j .Fortunately, they can be readily extracted from the junction tree derived from B as follows: 1.
Identify a clique containing both X i and Π B X i .Such a clique is guaranteed to exist by the family preservation property [5, Definition 10.1].

2.
Compute the marginal distribution of Π B X i by summing over the remaining variables in the clique.
Combining the computational complexity of constructing the junction tree from Section 3.4 and that of marginalisation, which is at most O l w−1 for each node as in (2), we have which is exponential in the maximum clique size w. 4 Interestingly, we do not need to perform belief propagation, so computing H(B) is more efficient than other inference tasks.Example 4 (Entropy of a discrete BN).For reasons of space, this example is presented as The Kullback-Leibler divergence has a similar issue, as noted in Koller and Friedman [5, Section 8.4.2].The best and most complete explanation of how to compute it for discrete BNs is in Moral et al. [21].After decomposing KL(B ∥ B ′ ) following (8) to separate H(B) and H(B, B ′ ), Moral et al. [21] show that the latter takes the form where: is the probability assigned by B to X i = k given that the variables that are parents of X i in B ′ take value j; In order to compute the π ikj (B), we need to transform B into its junction tree and use belief propagation to compute the joint distribution of X i ∪ Π B ′ X i .As a result, H(B, B ′ ) does not decompose at all: each π ikj (B) can potentially depend on the whole BN B.
Transform B into its junction tree.

3.
For each node X i : Obtain the distribution of the variables {X i , Π B ′ X i } from the junction tree of B, consisting of the probabilities π ikj (B).

(c)
Read the π ik|j (B ′ ) from the local distribution of X i in B ′ .
The computational complexity of this procedure is as follows: O N(w(1 + l w ) + l w−1 ) + |Θ| create the junction tree of B and computing H(B) As noted in Moral et al. [21], computing the π ikj (B) requires a separate run of belief propagation for each configuration of the Π B ′ X i , for a total of ∑ N i=1 l | times.If we assume that the DAG underlying B ′ is sparse, we have that |Π B ′ X i | ⩽ c and the overall complexity of this step becomes O(Nl c • (Nwl w + |Θ|)), N times that listed in Section 3.4.The caching scheme devised by Moral et al. [21] is very effective in limiting the use of belief propagation, but it does not alter its exponential complexity.
Example 5 (KL between two discrete BNs).Consider the discrete BN B from Figure 2 (top).Furthermore, consider the BN B ′ from Figure 2 (bottom).We constructed the global distribution of B in Example B.1; we can similarly compose the global distribution of B ′ , shown below.We identify the parents of each node in B ′ : We construct a junction tree from B and we use it to compute the distributions P(X 1 ), P(X 2 , X 1 , X 4 ), P(X 3 , X 1 ) and P(X 4 , X 3 ).

H(B) decomposes along with the local distributions
which has a computational complexity of O(1) for each node, O(N) overall.Equivalently, we can start from the global distribution of B from Section 3.2 and consider that because C B is lower triangular.The (multivariate normal) entropy of X then becomes in agreement with (12).
Example 6 (Entropy of a GBN).For reasons of space, this example is presented as Example B.4 in Appendix B.
In the literature, the Kullback-Leibler divergence between two GBNs B and B ′ is usually computed using the respective global distributions N(µ B , Σ B ) and N(µ B ′ , Σ B ′ ) [50][51][52].The general expression is which has computational complexity The spectral decomposition In order to compute KL(B ∥ B ′ ), we first invert Σ B ′ to obtain As an alternative, we can compute the spectral decompositions gives the corresponding determinants; and it allow us to easily compute for use in both the quadratic form and in the trace.
However, computing KL(B ∥ B ′ ) from the global distributions N(µ B , Σ B ) and N(µ B ′ , Σ B ′ ) disregards the fact that BNs are sparse models that can be characterised more compactly by (µ B , C B ) and (µ B ′ , C B ′ ) as shown in Section 3.2.In particular, we can revisit several operations that are in the high-order terms of (15):

•
Composing the global distribution from the local ones.We avoid computing Σ B and Σ B ′ thus reducing this step to O(2N) complexity.

•
Computing the trace tr(Σ −1 B ′ Σ B ).We can reduce the computation of the trace as follows.1.
We can replace Σ B and Σ B ′ in the trace with any reordered matrix [53,Result 8.17]: we choose to use Σ B ′ and Σ * B where Σ B ′ is defined as before and Σ * B is Σ B with the rows and columns reordered to match Σ B ′ .Formally, this is equivalent to Σ * B = P Σ B P T where P is a permutation matrix that imposes the desired node ordering: since both the rows and the columns are permuted in the same way, the diagonal elements of Σ B are the same as those of Σ * B and the trace is unaffected.

2.
We have As for Σ * B , we can write T where C * B = PC B is the lower triangular matrix C B with the rows re-ordered to match Σ B ′ .Note that C * B is not lower triangular unless G and G ′ have the same partial node ordering, which implies P = I N .Therefore where the last step rests on Seber [ where µ * B ′ and µ * B are the mean vectors re-ordered to match B ′ is available from previous computations.Combining (17), ( 13) and ( 18), the expression in (14) becomes The overall complexity of ( 19) KL is while still cubic, the leading coefficient suggests that it should be about 5 times faster than the variant of (15) using the spectral decomposition.
Example 8 (Sparse KL between two GBNs).Consider again the two GBNs from Example 7.
The corresponding matrices readily give the determinants of Σ B and Σ B ′ following (13): As for the Frobenius norm in (17), we first invert C B ′ to obtain The quadratic form is then equal to 408.362, which matches the value of (µ As a result, the expression for KL(B ∥ B ′ ) is the same as in (16).
We can further reduce the complexity (20) of ( 19) when an approximate value of KL is suitable for our purposes.The only term with cubic complexity is tr(Σ reducing it to quadratic complexity or lower will eliminate the leading term of (20), making it quadratic in complexity.One way to do this is to compute a lower and an upper bound for tr(Σ −1 B ′ Σ B ), which can serve as an interval estimate, and take their geometric mean as an approximate point estimate.
A lower bound is given by Seber [53,Result 10.39]: which conveniently reuses the values of det(Σ B ) and det(Σ B ′ ) we have from (13).For an upper bound, Seber [53,Result 10.59] combined with Seber [53,Result 4.15] gives Their geometric mean is 42.199, which can serve as an approximate value for KL(B ∥ B ′ ).
If we are comparing two GBNs whose parameters (but not necessarily network structures) have been learned from the same data, we can sometimes approximate KL(B ∥ B ′ ) using the local distributions X i | Π B X i and X i | Π B ′ X i directly.If B and B ′ have compatible partial orderings, 5 we can define a common total node ordering for both such that The product of the local distributions in the second step is obtained from the chain decomposition in the first step by considering the nodes in the conditioning other than the parents to have associated regression coefficients equal to zero.Then, following the derivations in Cavanaugh [55] for a general linear regression model, we can write the empirical approximation where, following a similar notation to (4): ) are the estimated intercepts and regression coeffi- cients; • x i (B) and x i (B ′ ) are the n × 1 vectors the fitted values computed from the data observed for X i , Π X i (B), ) are the residual variances in B and B ′ .We can compute the expression in (23) for each node in which is linear in the sample size if both G and G ′ are sparse because In this case, the overall computational complexity simplifies to O(nN(2c + 5 /2)).Furthermore, as we pointed out in Scutari et al. [29], the fitted values x i (B), x i (B ′ ) are computed as a by-product of parameter learning: if we consider them to be already available, the above computational complexity reduced to just O(n) for a single node and O(nN) overall.We can also replace the fitted values x i (B), x i (B ′ ) in ( 23) with the corresponding residuals ε i (B), if the latter are available but the former are not.
Example 10 (KL between GBNs with parameters estimated from data).For reasons of space, this example is presented as Example B.5 in Appendix B.

Conditional Gaussian BNs
The entropy H(B) decomposes into a separate H X i | Π B X i for each node, of the form (9) for discrete nodes and (12) for continuous nodes with no discrete parents.For continuous nodes with both discrete and continuous parents, where π δ X i represents the probability associated with the configuration δ X i of the discrete parents ∆ X i .This last expression can be computed in O |Val(∆ X i )| time for each node.
Overall, the complexity of computing H(B) is where the max accounts for the fact that |Val(∆ X i )| = 0 when ∆ X i = ∅ but the computational complexity is O(1) for such nodes.
Example 11 (Entropy of a CLGBN).For reasons of space, this example is presented as Example B.6 in Appendix B.
As for KL(B ∥ B ′ ), we could not find any literature illustrating how to compute it.The partition of the nodes in (5) implies that We can compute the first term following Section 4.1: X B D and X B ′ D form two discrete BNs whose DAGs are the spanning subgraphs of B and B ′ and whose local distributions are the corresponding ones in B and B ′ , respectively.The second term decomposes into ) similarly to (10) and (24).We can compute it using the multivariate normal distributions associated with the Therefore, B ′ only encodes two different multivariate normal distributions.
Firstly, we construct two discrete BNs using the subgraphs spanning The CPTs for X 1 , X 2 and X 3 are the same as in B and in B ′ .We then compute KL X B D X B ′ D = 0.577 following Example 5.
Secondly, we construct the multivariate normal distributions associated with the components of B ′ following Example 3 (in which we computed those of B).For {e}, we have 0.160 0.000 0.032 0.000 1.690 1.183 0.032 1.183 2.274 The computational complexity of this basic approach to computing KL(B ∥ B ′ ) is which we obtain by adapting ( 11) and ( 15 The second term is exponential in M, which would lead us to conclude that it is computationally unfeasible to compute KL(B ∥ B ′ ) whenever we have more than a few discrete variables in B and B ′ .Certainly, this would agree with Hershey and Olsen [22], who reviewed various scalable approximations of the KL divergence between two Gaussian mixtures.
However, we would again disregard the fact that BNs are sparse models.Two properties of CLGBNs that are apparent from Examples 3 and 12 allow us to compute (26) efficiently: In other words, the continuous nodes are conditionally independent on the discrete nodes that are not their parents (X B D \ ∆ B ) given their parents (∆ B ).The same is true for X The number of distinct terms in the summation in ( 26) is then given by |Val(∆ B ∪ ∆ B ′ )| which will be smaller than |Val(X are multivariate normals (not mixtures).They are also faithful to the subgraphs spanning the continuous nodes X G , and we can represent them as GBNs whose parameters can be extracted directly from B and B ′ .Therefore, we can use the results from Section 4.2 to compute their Kullback-Leibler divergences efficiently.
As a result, (26) where P {∆ B ∪ ∆ B ′ } = δ is the probability that the nodes ∆ B ∪ ∆ B ′ take value δ as computed in B. In turn, ( 27) reduces to because we can replace l M with l |Val(∆ B ∪∆ B ′ )| , which is an upper bound to the unique components in the mixture, and because we replace the complexity in (15) with that (20).We can also further reduce the second term to quadratic complexity as we discussed in Section 4.2.The remaining drivers of the computational complexity are: • the maximum clique size w in the subgraph spanning X B D ; • the number of arcs from discrete nodes to continuous nodes in both B and B ′ and the overlap between ∆ B and ∆ B ′ .
Example 13 (Sparse KL between two CLGBNs).Consider again the CLGBNs B and B ′ from Example 12.The node sets All the BNs in the Kullback-Leibler divergences are GBNs whose structure and local distributions can be read from B and B ′ .The four GBNs associated with } and the local distributions listed in Figure 3.The corresponding GBNs associated with X B ′ G | {∆ B ∪ ∆ B ′ } are, in fact, only two distinct GBNs associated with {e} and { f }.They have arcs {X 4 → X 6 , X 5 → X 6 } and local distributions: for {e}, Plugging in the numbers,

Conclusions
We started this paper by reviewing the three most common distributional assumptions for BNs: discrete BNs, Gaussian BNs (GBNs) and conditional linear Gaussian BNs (CLGBNs).Firstly, we reviewed the link between the respective global and local distributions, and we formalised the computational complexity of decomposing the former into the latter (and vice versa).
We then leveraged these results to study the complexity of computing Shannon's entropy.We can, of course, compute the entropy of a BN from its global distribution using standard results from the literature.(In the case of discrete BNs and CLGBNS, only for small networks because |Θ| grows combinatorially.)However, this is not computationally efficient because we incur the cost of composing the global distribution.While the entropy does not decompose along with the local distributions for either discrete BNs or CLGBNS, we show that it is nevertheless efficient to compute it from them.
Computing the Kullback-Leibler divergence between two BNs following the little material found in the literature is more demanding.The discrete case has been thoroughly investigated by Moral et al. [21].However, the literature typically relies on composing the global distributions for GBNs and CGBNs.Using the local distributions, thus leveraging the intrinsic sparsity of BNs, we showed how to compute the Kullback-Leibler divergence exactly with greater efficiency.For GBNs, we showed how to compute the Kullback-Leibler divergence approximately with quadratic complexity (instead of cubic).If the two GBNs have compatible node orderings and their parameters are estimated from the same data, we can also approximate their Kullback-Leibler divergence with complexity that scales with the number of parents of each node.All these results are summarised in Appendix A.
Finally, we provided step-by-step numeric examples of how to compute Shannon's entropy and the Kullback-Leibler divergence for discrete BNs, GBNs and CLGBNs.(See also Appendix B.) Considering this is a highly technical topic, and no such examples are available anywhere in the literature, we feel that they are helpful in demystifying this topic and in integrating BNs into many general machine learning approaches.

Appendix A Computational Complexity Results
For ease of reference, we summarise here all the computational complexity results in this paper, including the type of BN and the page where they have been derived.The joint probabilities are computed by multiplying the appropriate cells of the CPTs, for instance
For X 4 , we first compute the joint distribution of X 4 and X 3 by marginalising over X 1 and X Its elements identify the components of the mixture that make up the global distribution of B, and the associated probabilities are the probabilities of those components.We can then identify which parts of the local distributions of the N − M = 3 continuous variables (X 4 , X 5 and X 6 ) we need to compute P(X 4 , X 5 , X 6 | X 1 , X 2 , X 3 ) for each ele- ment of the mixture.The graphical structure of B implies that P(X 4 , X 5 , X 6 | X 1 , X 2 , X 3 ) = P(X 4 , X 5 , X 6 | X 2 , X 3 ) because the continuous nodes are d-separated from X 1 by their parents.As a result, the following mixture components will share identical distributions which only depend on the configurations of X 2 and X 3 : For the mixture components with a distribution identified by {c, e}, the relevant parts of the distributions of X 4 , X 5 and X 6 are: We can treat them as the local distributions in a GBN over {X 4 , X 5 , X 6 } with a DAG equal to the subgraph of B spanning only these nodes.If we follow the steps outlined in Section 3. which is the multivariate normal distribution associated with the components {a, c, e} and {b, c, e} in the mixture.Similarly, the relevant parts of the distributions of X 4 , X 5 and X 6 for {d, e} are  Finally, the local distributions identified by {d, f } are and the joint distribution of X 4 , X 5 and X 6 for the components {a, We follow the same steps in reverse to decompose the global distribution into the local distributions.The joint distribution of X is a mixture with multivariate normal components and the associated probabilities.The latter are a function of the discrete variables X 1 , X 2 , X 3 : rearranging them as the three-dimensional table X 3 e 0.040 0.040 f 0.160 0.160 X 3 e 0.036 0.084 f 0.144 0.336 gives us the typical representation of P(X 1 , X 2 , X 3 ), which we can work with by operating over the different dimensions.We can then compute the conditional probability tables in the local distributions of X 1 and X 3 by marginalising over the remaining variables:  As for X 2 , we marginalise over X 3 and normalise over X 1 to obtain and where (multiplying the marginal probabilities for X 1 and X 2 , which are marginally independent) P(X 1 = a, X 2 = c) = 0.180, P(X 1 = a, X 2 = d) = 0.350, P(X 1 = b, X 2 = c) = 0.160, P(X 1 = b, X 2 = d) = 0.310; giving  Combining all these figures, we obtain H(B) as H(X 1 ) + H(X 2 ) + H(X 3 | X 1 , X 2 ) + H(X 4 | X 3 ) = 0.691 + 0.641 + 0.536 + 0.572 = 2.440 as before.
In general, we would have to compute the probabilities of the parent configurations of each node using a junction tree as follows: 1.
We construct the moral graph of B, which contains the same arcs (but undirected) as its DAG plus X 1 − − X 2 .
We compute P(X 1 , X 2 ) = ∑ x 3 ∈{e, f } P(C 1 ) and P(X 3 ) = P(S 12 ).Equivalently, plugging the σ 2 X i (B) into (12) we have The quality of the empirical approximation improves with the number of observations.For reference, we generated the data in Figure A1

Figure 2 .
Figure 2. DAGs and local distributions for the discrete BNs B (top) and B ′ (bottom) used in Examples 1, 4 and 5.
and det(Σ B ′ ) efficiently as illustrated in the example below.(Further computing the spectral decomposition of Σ B to compute det(Σ B ) from the eigenvalues {λ 1 (B), . . ., λ N (B)} does not improve complexity because it just replaces a single O N 3 operation with another one.)We thus somewhat improve the overall complexity of KL(B ∥ B ′ ) to O 5N 3 + N 2 + 6N .Example 7 (General-case KL between two GBNs).Consider the GBN B Figure 1 (top), which we know has global distribution
and the X B ′ D = x D in the global distributions of B and B ′ .Example 12 (General-case KL between two CLGBNs).Consider the CLGBNs B from Figure 3 (top), which we already used in Examples 3 and 11, and B ′ from Figure 3 (bottom).The variables X B ′ D identify the following mixture components in the global distribution of B ′ : ) to follow the notation |X D | = M and |X G | = N − M we established in Section 3.3.The first term implicitly covers the cost of computing the P X B D = x D , which relies on exact inference like the computation of KL X B D X B ′ D .

Example B. 4 (
Entropy of a GBN).Consider the GBN B from Figure 1 (top), whose global distribution we derived in Example 2. If we plug its covariance matrix Σ B into the entropy formula for the multivariate normal distribution we obtain Σ B ) = 2 + 3.676 + 0.5 log 0.475 = 5.304.

Figure A1 .
Figure A1.The DAGs for the GBNs B (top left) and B ′ (bottom left) and the data (right) used in Example B.5.
from the GBN in Example 2. With a sample of size n = 100 from the same network, KL(B ∥ B ′ ) ≈ 1.362 with KL(B ∥ B ′ ) = 1.373; with n = 1000, KL(B ∥ B ′ ) ≈ 1.343 with KL(B ∥ B ′ ) = 1.345.Example B.6 (Entropy of a CLGBN).Consider again the CLGBN B from from Figure 3 (top).For such a simple BN, we can use its global distribution (which we derived in Example B.2) directly to compute the entropies of the multivariate normal distributions associated with the mixture components H(X 4 , X We compute the cross-entropy terms for the individual variables in B and B ′ : = 0.53 log 0.31 + 0.47 log 0.69 = −0.795;= 0.070 log 0.38 + 0.089 log 0.62 + 0.110 log 0.71 + 0.261 log 0.29+ 0.053 log 0.51 + 0.076 log 0.49 + 0.107 log 0.14 + 0.235 log 0.86 = −0.807; as an intermediate step.Multiplying the sets of eigenvalues Λ B = diag({18.058,0.741, 0.379, 0.093}) and Λ B ′ = diag({11.106,0.574, 0.236, 0.087}) B is still O N 3 .The Frobenius norm ∥ • ∥ F is O N 2 since it's the sum of the squared elements of C −1 B 53, Result 4.15].We can invert C B ′ in O N 2 time following Stewart [54, Algorithm 2.3].Multiplying C −1 B ′ and C * ′ C * B .• Computing the determinants det(Σ B ′ ) and det(Σ B ). From (13), each determinant can be computed in O(N).• Computing the quadratic term a function of C B and C B ′ that can be computed in O 2N 2 time.Note that, as far as the point estimate is concerned, we do not care about how wide the interval is: we only need its geometric mean to be an acceptable approximation of tr(Σ −1 B ′ Σ B ). From Example 7, we have that tr(Σ −1 B ′ Σ B ) = 57.087,det(ΣB′ ) = 0.475 and det(Σ B ) = 0.132.The lower bound in (21) is then− log det(Σ B ′ ) + log det(Σ B ) + 4 = 5.281and the upper bound in (22) is Example 9 (Approximate KL).