Next Article in Journal
Particle Swarm Optimization-Based Unconstrained Polygonal Fitting of 2D Shapes
Next Article in Special Issue
What Is a Causal Graph?
Previous Article in Journal
A New Approach to Identifying Sorghum Hybrids Using UAV Imagery Using Multispectral Signature and Machine Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Entropy and the Kullback–Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation

Istituto Dalle Molle di Studi Sull’Intelligenza Artificiale (IDSIA), 6900 Lugano, Switzerland
Algorithms 2024, 17(1), 24; https://doi.org/10.3390/a17010024
Submission received: 29 November 2023 / Revised: 31 December 2023 / Accepted: 4 January 2024 / Published: 6 January 2024
(This article belongs to the Special Issue Bayesian Networks and Causal Reasoning)

Abstract

:
Bayesian networks (BNs) are a foundational model in machine learning and causal inference. Their graphical structure can handle high-dimensional problems, divide them into a sparse collection of smaller ones, underlies Judea Pearl’s causality, and determines their explainability and interpretability. Despite their popularity, there are almost no resources in the literature on how to compute Shannon’s entropy and the Kullback–Leibler (KL) divergence for BNs under their most common distributional assumptions. In this paper, we provide computationally efficient algorithms for both by leveraging BNs’ graphical structure, and we illustrate them with a complete set of numerical examples. In the process, we show it is possible to reduce the computational complexity of KL from cubic to quadratic for Gaussian BNs.

1. Introduction

Bayesian networks [1] (BNs) have played a central role in machine learning research since the early days of the field as expert systems [2,3], graphical models [4,5], dynamic and latent variables models [6], and as the foundation of causal discovery [7] and causal inference [8]. They have also found applications as diverse as comorbidities in clinical psychology [9], the genetics of COVID-19 [10], the Sustainable Development Goals of the United Nations [11], railway disruptions [12] and industry 4.0 [13].
Machine learning, however, has evolved to include a variety of other models and reformulated them into a very general information-theoretic framework. The central quantities of this framework are Shannon’s entropy and the Kullback–Leibler divergence. Learning models from data relies crucially on the former to measure the amount of information captured by the model (or its complement, the amount of information lost in the residuals) and on the latter as the loss function we want to minimise. For instance, we can construct variational inference [14], the Expectation-Maximisation algorithm [15], Expectation Propagation [16] and various dimensionality reduction approaches such as t-SNE [17] and UMAP [18] using only these two quantities. We can also reformulate classical maximum-likelihood and Bayesian approaches to the same effect, from logistic regression to kernel methods to boosting [19,20].
Therefore, the lack of literature on how to compute the entropy of a BN and the Kullback–Leibler divergence between two BNs is surprising. While both are mentioned in Koller and Friedman [5] and discussed at a theoretical level in Moral et al. [21] for discrete BNs, no resources are available on any other type of BN. Furthermore, no numerical examples of how to compute them are available even for discrete BNs. We fill this gap in the literature by:
  • Deriving efficient formulations of Shannon’s entropy and the Kullback–Leibler divergence for Gaussian BNs and conditional linear Gaussian BNs.
  • Exploring the computational complexity of both for all common types of BNs.
  • Providing step-by-step numeric examples for all computations and all common types of BNs.
Our aim is to make apparent how both quantities are computed in their closed-form exact expressions and what is the associated computational cost.
The common alternative is to estimate both Shannon’s entropy and the Kullback–Leibler divergence empirically using Monte Carlo sampling. Admittedly, this approach is simple to implement for all types of BNs. However, it has two crucial drawbacks:
  • Using asymptotic estimates voids the theoretical properties of many machine learning algorithms: Expectation-Maximisation is not guaranteed to converge [5], for instance.
  • The number of samples required to estimate the Kullback–Leibler divergence accurately on the tails of the global distribution of both BNs is also an issue [22], especially when we need to evaluate it repeatedly as part of some machine learning algorithm. The same is true, although to a lesser extent, for Shannon’s entropy as well. In general, the rate of convergence to the true posterior in Monte Carlo particle filters is proportional to the number of variables squared [23].
Therefore, efficiently computing the exact value of Shannon’s entropy and the Kullback–Leibler divergence is a valuable research endeavour with a practical impact on BN use in machine learning. To help its development, we implemented the methods proposed in the paper in our bnlearn R package [24].
The remainder of the paper is structured as follows. In Section 2, we provide the basic definitions, properties and notation of BNs. In Section 3, we revisit the most common distributional assumptions in the BN literature: discrete BNs (Section 3.1), Gaussian BNs (Section 3.2) and conditional linear Gaussian BNs (Section 3.3). We also briefly discuss exact and approximate inferences for these types of BNs in Section 3.4 to introduce some key concepts for later use. In Section 4, we discuss how we can compute Shannon’s entropy and the Kullback–Leibler divergence for each type of BN. We conclude the paper by summarising and discussing the relevance of these foundational results in Section 5. Appendix A summarises all the computational complexity results from earlier sections, and Appendix B contains additional examples we omitted from the main text for brevity.

2. Bayesian Networks

Bayesian networks (BNs) are a class of probabilistic graphical models defined over a set of random variables X = { X 1 , , X N } , each describing some quantity of interest, that are associated with the nodes of a directed acyclic graph (DAG) G . Arcs in G express direct dependence relationships between the variables in X , with graphical separation in G implying conditional independence in probability. As a result, G induces the factorisation
P X G , Θ = i = 1 N P X i Π X i , Θ X i ,
in which the global distribution (of X , with parameters Θ ) decomposes into one local distribution for each X i (with parameters Θ X i , X Θ X i = Θ ) conditional on its parents Π X i .
This factorisation is as effective at reducing the computational burden of working with BNs as the DAG underlying the BN is sparse, meaning that each node X i has a small number of parents ( | Π X i | < c , usually with c [ 2 , 5 ] ). For instance, learning BNs from data is only feasible in practice if this holds. The task of learning a BN B = ( G , Θ ) from a data set D containing n observations comprises two steps:
P G , Θ D learning = P G D structure learning · P Θ G , D parameter learning .
If we assume that parameters in different local distributions are independent [25], we can perform parameter learning independently for each node. Each X i Π X i will have a low-dimensional parameter space Θ X i , making parameter learning computationally efficient. On the other hand, structure learning is well known to be both NP-hard [26] and NP-complete [27], even under unrealistically favourable conditions such as the availability of an independence and inference oracle [28]. However, if G is sparse, heuristic learning algorithms have been shown to run in quadratic time [29]. Exact learning algorithms, which have optimality guarantees that heuristic algorithms lack, retain their exponential complexity but become feasible for small problems because sparsity allows for tight bounds on goodness-of-fit scores and the efficient pruning of the space of the DAGs [30,31,32].

3. Common Distributional Assumptions for Bayesian Networks

While there are many possible choices for the distribution of X in principle, the literature has focused on three cases.

3.1. Discrete BNs

Discrete BNs [25] assume that both X and the X i are multinomial random variables (The literature sometimes denotes discrete BNs as “dBNs” or “DBNs”; we do not do that in this paper to avoid confusion with dynamic BNs, which are also commonly denoted as “dBNs”). Local distributions take the form
X i Π X i Mul ( π i k j ) , π i k j = P X i = k Π X i = j ;
their parameters are the conditional probabilities of X i given each configuration of the values of its parents, usually represented as a conditional probability table (CPT) for each X i . The π i k j can be estimated from data via the sufficient statistic { n i j k , i = 1 , N ; j = 1 , , q i ; k = 1 , , r i } , the corresponding counts tallied from { X i , Π X i } using maximum likelihood, Bayesian or shrinkage estimators as described in Koller and Friedman [5] and Hausser and Strimmer [33].
The global distribution takes the form of an N-dimensional probability table with one dimension for each variable. Assuming that each X i takes at most l values, the table will contain | Val ( X ) | = O ( l N ) cells, where Val ( · ) denotes the possible (configurations of the) values of its argument. As a result, it is impractical to use for medium and large BNs. Following standard practices from categorical data analysis [34], we can produce the CPT for each X i from the global distribution by marginalising (that is, summing over) all the variables other than { X i , Π X i } and then normalising over each configuration of Π X i . Conversely, we can compose the global distribution from the local distributions of the X i by multiplying the appropriate set of conditional probabilities. The computational complexity of the composition is O ( N l N ) because applying (1) for each of the l N cells yields
P X = x = i = 1 N P X i = x i Π X i = x Π X i ,
which involves N multiplications. As for the decomposition, for each node, we:
  • Sum over N | Π X i | 1 variables to produce the joint probability table for { X i , Π X i } , which contains O ( l | Π X i | + 1 ) cells. The value of each cell is the sum of O ( l N | Π X i | 1 ) probabilities.
  • Normalise the columns of the joint probability table for { X i , Π X i } over each of the O ( l | Π X i | ) configurations of values of Π X i , which involves summing O(l) probabilities and dividing them by their total.
The resulting computational complexity is
O ( l | Π X i | + 1 · l N | Π X i | 1 ) marginalisation + O ( l · l | Π X i | ) normalisation = O ( l N + l | Π X i | + 1 )
for each node and O ( N l N + l i = 1 N l | Π X i | ) for the whole BN.
Example 1 
(Composing and decomposing a discrete BN). For reasons of space, this example is presented as Example A1 in Appendix B.

3.2. Gaussian BNs

Gaussian BNs [35] (GBNs) model X with a multivariate normal random variable N( μ B , Σ B ) and assume that the X i are univariate normals linked by linear dependencies,
X i Π X i N ( μ X i + Π X i β X i , σ X i 2 ) ,
which can be equivalently written as linear regression models of the form
X i = μ X i + Π X i β X i + ε X i , ε X i N ( 0 , σ X i 2 ) .
The parameters in (3) and (4) are the regression coefficients β X i associated with the parents Π X i , an intercept term μ X i and the variance σ X i 2 . They are usually estimated by maximum likelihood, but Bayesian and regularised estimators are available as well [1].
The link between the parameterisation of the global distribution of a GBN and that of its local distributions is detailed in Pourahmadi [36]. We summarise it here for later use.
  • Composing the global distribution. We can create an N × N lower triangular matrix C B from the regression coefficients in the local distributions such that C B C B T gives Σ B after rearranging rows and columns. In particular, we:
    • Arrange the nodes of B in the (partial) topological ordering induced by G , denoted X ( i ) , i = 1 , , N .
    • The ith row of C B (denoted C B [ i ; ], i = 1, …, N) is associated with X(i). We compute its elements from the parameters of X(i) | Π X i as
      C B [ i ; i ] = σ X ( i ) 2 and C B [ i ; ] = β X i C B [ Π X ( i ) ; ] , where C B [ Π X ( i ) ; ] are the rows of C B that correspond to the parents of X(i). The rows of C B are filled following the topological ordering of the BN.
    • Compute B = C B C B T .
    • Rearrange the rows and columns of B to obtain B .
    Intuitively, we construct C B by propagating the node variances along the paths in G while combining them with the regression coefficients, which are functions of the correlations between adjacent nodes. As a result, C B C B T gives Σ B after rearranging the rows and columns to follow the original ordering of the nodes.
    The elements of the mean vector μ B are similarly computed as E ( X ( i ) ) = Π X ( i ) β X i iterating over the variables in topological order.
  • Decomposing the global distribution. Conversely, we can derive the matrix C B from Σ B by reordering its rows and columns to follow the topological ordering of the variables in G and computing its Cholesky decomposition. Then
    R = I N diag ( C B ) C B 1 ,
    contains the regression coefficients β X i in the elements corresponding to X ( i ) , Π X ( i ) (Here diag ( C B ) is a diagonal matrix with the same diagonal elements as C B and I N is the identity matrix.) Finally, we compute the intercepts μ X i as μ B R μ B by reversing the equations we used to construct μ B above.
The computational complexity of composing the global distribution is bound by the matrix multiplication C B C B T , which is O ( N 3 ) ; if we assume that G is sparse as in Scutari et al. [29], the number of arcs is bound by some c N , computing the μ B takes O(N) operations. The complexity of decomposing the global distribution is also O ( N 3 ) because both inverting C B and multiplying the result by diag ( C B ) are O ( N 3 ) .
Example 2 
(Composing and decomposing a GBN). Consider the GBN B from Figure 1 top. The topological ordering of the variables defined by B is { { X 1 , X 2 } , X 4 , X 3 } , so
C B = X 1 X 2 X 4 X 3 X 1 X 2 X 4 X 3 ( 0.894 0 1.341 1.610 0 0.774 2.014 2.416 0 0 1.049 1.258 0 0 0 0.948 )
where the diagonal elements are
C B [ X 1 ; X 1 ] = 0.8 , C B [ X 2 ; X 2 ] = 0.6 , C B [ X 4 ; X 4 ] = 1.1 , C B [ X 3 ; X 3 ] = 0.9 ;
and the elements below the diagonal are taken from the corresponding cells of
C B [ X 4 ; ] = 1.5 2.6 0.894 0 0 0 0 0.774 0 0 ,
C B [ X 3 ; ] = ( 1.2 ) ( 1.341   2.014   1.049   0 ) .
Computing C B C B T gives
Σ ˜ B = X 1 X 2 X 4 X 3 X 1 X 2 X 4 X 3 ( 0.800 0 1.200 1.440 0 0.600 1.560 1.872 1.200 1.560 6.956 8.347 1.440 1.872 8.347 10.916 )
and reordering the rows and columns of Σ ˜ B gives
Σ B = X 1 X 2 X 3 X 4 X 1 X 2 X 3 X 4 ( 0.800 0 1.440 1.200 0 0.600 1.872 1.560 1.440 1.872 10.916 8.347 1.200 1.560 8.347 6.956 )
The elements of the corresponding expectation vector μ B are then
E ( X 1 ) = 2.400 , E ( X 2 ) = 1.800 , E ( X 4 ) = 0.2 + 1.5 E ( X 1 ) + 2.6 E ( X 2 ) = 8.480 , E ( X 3 ) = 2.1 + 1.2 E ( X 4 ) = 12.276 .
Starting from Σ B , we can reorder its rows and columns to obtain Σ ˜ B . The Cholesky decomposition of Σ ˜ B is C B . Then
σ X 1 2 = C B [ X 1 ; X 1 ] 2 = 0.8 , σ X 2 2 = C B [ X 2 ; X 2 ] 2 = 0.6 , σ X 3 2 = C B [ X 3 ; X 3 ] 2 = 0.9 , σ X 4 2 = C B [ X 4 ; X 4 ] 2 = 0.11 .
The coefficients β X i of the local distributions are available from
R = I N 0.894 0 0 0 0 0.774 0 0 0 0 1.049 0 0 0 0 0.948 diag ( C B ) 1.118 0 0 0 0 1.291 0 0 1.430 2.479 0.953 0 0 0 1.265 1.054 C B 1 = X 1 X 2 X 4 X 3 X 1 X 2 X 4 X 3 ( 0 0 1.500 0 0 0 2.600 0 0 0 0 1.200 0 0 0 0 )
where we can read R X 4 , X 1 = 1.5 = β X 4 , X 1 , R X 4 , X 2 = 2.6 = β X 4 , X 2 , R X 3 , X 4 = 1.2 = β X 3 , X 4 .
We can read the standard errors of X 1 , X 2 , X 3 and X 4 directly from the diagonal elements of C B , and we can compute the intercepts from μ B R μ B which amounts to
μ X 1 = E ( X 1 ) = 2.400 , μ X 2 = E ( X 2 ) = 1.800 , μ X 4 = E ( X 4 ) E ( X 1 ) β X 4 , X 1 E ( X 2 ) β X 4 , X 2 = 0.200 , μ X 3 = E ( X 3 ) E ( X 4 ) β X 3 , X 4 = 2.100 .

3.3. Conditional Linear Gaussian BNs

Finally, conditional linear Gaussian BNs [37] (CLGBNs) subsume discrete BNs and GBNs as particular cases by combining discrete and continuous random variables in a mixture model. If we denote the former with X D and the latter with X G , so that X = X D X G , then:
  • Discrete X i X D are only allowed to have discrete parents (denoted Δ X i ), and are assumed to follow a multinomial distribution parameterised with CPTs. We can estimate their parameters in the same way as those in a discrete BN.
  • Continuous X i X G are allowed to have both discrete and continuous parents (denoted Γ X i , Δ X i Γ X i = Π X i ). Their local distributions are
    X i Π X i N μ X i , δ X i + Γ X i β X i , δ X i , σ X i , δ X i 2 ,
    which is equivalent to a mixture of linear regressions against the continuous parents with one component for each configuration δ X i Val ( Δ X i ) of the discrete parents:
    X i = μ X i , δ X i + Γ X i β X i , δ X i + ε X i , δ X i , ε X i , δ X i N 0 , σ X i , δ X i 2 .
    If X i has no discrete parents, the mixture reverts to a single linear regression like that in (4). The parameters of these local distributions are usually estimated by maximum likelihood like those in a GBN; we have used hierarchical regressions with random effects in our recent work [38] for this purpose as well. Bayesian and regularised estimators are also an option [5].
If the CLGBN comprises | X D | = M discrete nodes and | X G | = N M continuous nodes, these distributional assumptions imply the partial topological ordering
X ( 1 ) , , X ( M ) discrete nodes , X ( M + 1 ) , , X ( N ) continuous nodes .
The discrete nodes jointly follow a multinomial distribution, effectively forming a discrete BN. The continuous nodes jointly follow a multivariate normal distribution, parameterised as a GBN, for each configuration of the discrete nodes. Therefore, the global distribution is a Gaussian mixture in which the discrete nodes identify the components, and the continuous nodes determine their distribution. The practical link between the global and local distributions follows directly from Section 3.1 and Section 3.2.
Example 3 
(Composing and decomposing a CLGBN). For reasons of space, this example is presented as Example A2 in Appendix B.
The complexity of composing and decomposing the global distribution is then
O ( M l M ) . convert between CPTs and component probabilities + O ( ( N M ) 3 l Val ( Δ ) ) ( de ) compose the distinct component distributions
where Δ = X i X G Δ X i are the discrete parents of the continuous nodes.

3.4. Inference

For BNs, inference broadly denotes obtaining the conditional distribution of a subset of variables conditional on a second subset of variables. Following older terminology from expert systems [2], this is called formulating a query in which we ask the BN about the probability of an event of interest after observing some evidence. In conditional probability queries, the event of interest is the probability of one or more events in (or the whole distribution of) some variables of interest conditional on the values assumed by the evidence variables. In maximum a posteriori (“most probable explanation”) queries, we condition the values of the evidence variables to predict those of the event variables.
All inference computations on BNs are completely automated by exact and approximate algorithms, which we will briefly describe here. We refer the interested reader to the more detailed treatment in Castillo et al. [2] and Koller and Friedman [5].
Exact inference algorithms use local computations to compute the value of the query. The seminal works of Lauritzen and Spiegelhalter [39], Lauritzen and Wermuth [37] and Lauritzen and Jensen [40] describe how to transform a discrete BN or a (CL)GBN into a junction tree as a preliminary step before using belief propagation. Cowell [41] uses elimination trees for the same purpose in CLGBNs. (A junction tree is an undirected tree whose nodes are the cliques in the moral graph constructed from the BN and their intersections. A clique is the maximal subset of nodes such that every two nodes in the subset are adjacent).
Namasivayam et al. [42] give the computational complexity of constructing the junction tree from a discrete BN as O ( N w + w l w N ) where w is the maximum number of nodes in a clique and, as before, l is the maximum number of values that a variable can take. We take the complexity of belief propagation to be O ( N w l w + | Θ | ) , as stated in Lauritzen and Spiegelhalter [39] (“The global propagation is no worse than the initialisation [of the junction tree]”). This is confirmed by Pennock [43] and Namasivayam and Prasanna [44].
As for GBNs, we can also perform exact inference through their global distribution because the latter has only O ( N 2 + N ) parameters. The computational complexity of this approach is O ( N 3 ) because of the cost of composing the global distribution, which we derived in Section 3.2. However, all the operations involved are linear, making it possible to leverage specialised hardware such as GPUs and TPUs to the best effect. Koller and Friedman [5] (Section 14.2.1) note that “inference in linear Gaussian networks is linear in the number of cliques, and at most cubic in the size of the largest clique” when using junction trees and belief propagation. Therefore, junction trees may be significantly faster for GBNs when w N . However, the correctness and convergence of belief propagation in GBNs require a set of sufficient conditions that have been studied comprehensively by Malioutov et al. [45]. Using the global distribution directly always produces correct results.
Approximate inference algorithms use Monte Carlo simulations to sample from the global distribution of X through the local distributions and estimate the answer queries by computing the appropriate summary statistics on the particles they generate. Therefore, they mirror the Monte Carlo and Markov chain Monte Carlo approaches in the literature: rejection sampling, importance sampling, and sequential Monte Carlo among others. Two state-of-the-art examples are the adaptive importance sampling (AIS-BN) scheme [46] and the evidence pre-propagation importance sampling (EPIS-BN) [47].

4. Shannon Entropy and Kullback–Leibler Divergence

The general definition of Shannon entropy for the probability distribution P of X is
H P = E P ( log P ( X ) ) = Val ( X ) P ( x ) log P ( x ) d x .
The Kullback–Leibler divergence between two distributions P and Q for the same random variables X is defined as
KL P Q = E P ( X ) log P ( X ) Q ( X ) = Val ( X ) P x log P ( x ) Q ( x ) d x .
They are linked as follows:
E P ( X ) log P ( X ) Q ( X ) KL P ( X ) Q ( X ) = E P ( X ) ( log P ( X ) ) H P ( X ) + E P ( X ) log Q ( X ) H P ( X ) , Q ( X )
where H P ( X ) , Q ( X ) is the cross-entropy between P ( X ) and Q ( X ) . For the many properties of these quantities, we refer the reader to Cover and Thomas [48] and Csiszár and Shields [49]. Their use and interpretation are covered in depth (and breadth!) in Murphy [19,20] for general machine learning and in Koller and Friedman [5] for BNs.
For a BN B encoding the probability distribution of X , (6) decomposes into
H B = i = 1 N H X i Π X i B
where Π X i B are the parents of X i in B . While this decomposition looks similar to (1), we see that its terms are not necessarily orthogonal, unlike the local distributions.
As for (7), we cannot simply write
KL B B = i = 1 N KL X i Π X i B X i Π X i B
because, in the general case, the nodes X i have different parents in B and B . This issue impacts the complexity of computing Kullback–Leibler divergences in different ways depending on the type of BN.

4.1. Discrete BNs

For discrete BNs, H B does not decompose into orthogonal components. As pointed out in Koller and Friedman [5] (Section 8.4.12),
H X i Π X i B = j = 1 q i P Π X i B = j H X i Π X i B = j where H X i Π X i B = j = k = 1 r i π i k j ( B ) log π i k j ( B ) .
If we estimated the conditional probabilities π i k j ( B ) from data, the P Π X i B = j are already available as the normalising constants of the individual conditional distributions { π i k j ( B ) , j = 1 , , q i } in the local distribution of X i . In this case, the complexity of computing H X i Π X i B is linear in the number of parameters: O ( | Θ | ) = i = 1 N O ( | Θ X i | ) .
In the general case, we need exact inference to compute the probabilities P Π X i B = j . Fortunately, they can be readily extracted from the junction tree derived from B as follows:
  • Identify a clique containing both X i and Π X i B . Such a clique is guaranteed to exist by the family preservation property [5] (Definition 10.1).
  • Compute the marginal distribution of Π X i B by summing over the remaining variables in the clique.
Combining the computational complexity of constructing the junction tree from Section 3.4 and that of marginalisation, which is at most O ( l w 1 ) for each node as in (2), we have
O ( N w + w l w N ) create the junction tree + O ( N l w 1 ) compute the P Π X i B = j + O ( | Θ | ) compute H B = O ( N ( w ( 1 + l w ) + l w 1 ) + | Θ | ) ,
which is exponential in the maximum clique size w. (The maximum clique size in a junction tree is proportional to the treewidth of the BN the junction tree is created from, which is also used in the literature to characterise computational complexity in BNs.) Interestingly, we do not need to perform belief propagation, so computing H B is more efficient than other inference tasks.
Example 4 
(Entropy of a discrete BN). For reasons of space, this example is presented as Example A3 in Appendix B.
The Kullback–Leibler divergence has a similar issue, as noted in Koller and Friedman [5] (Section 8.4.2). The best and most complete explanation of how to compute it for discrete BNs is in Moral et al. [21]. After decomposing KL B B following (8) to separate H B and H B , B , Moral et al. [21] show that the latter takes the form
H B , B = i = 1 N j Val Π X i B k = 1 r i π i k j ( B ) log π i k j ( B )
where:
  • π i k j ( B ) = P X i = k , Π X i ( B ) = j is the probability assigned by B to X i = k given that the variables that are parents of X i in B take value j;
  • π i k j ( B ) = P X i = k Π X i ( B ) = j is the ( k , j ) element of the CPT of X i in B .
In order to compute the π i k j ( B ) , we need to transform B into its junction tree and use belief propagation to compute the joint distribution of X i Π X i B . As a result, H B , B does not decompose at all: each π i k j ( B ) can potentially depend on the whole BN B .
Algorithmically, to compute KL B B we:
  • Transform B into its junction tree.
  • Compute the entropy H B .
  • For each node X i :
    (a)
    Identify Π X i B , the parents of X i in B .
    (b)
    Obtain the distribution of the variables { X i , Π X i B } from the junction tree of B , consisting of the probabilities π i k j ( B ) .
    (c)
    Read the π i k j ( B ) from the local distribution of X i in B .
  • Use the π i k j ( B ) and the π i k j ( B ) to compute (10).
The computational complexity of this procedure is as follows:
O ( N ( w ( 1 + l w ) + l w 1 ) + | Θ | ) create the junction tree of B and computing H B + O ( N l c ( N w l w + | Θ | ) ) produce the π i k j ( B ) + O ( | Θ | ) compute H B , B = O ( N 2 w l w + c + N ( w + w l w + l w 1 ) + ( N l c + 2 ) | Θ | ) .
As noted in Moral et al. [21], computing the π i k j ( B ) requires a separate run of belief propagation for each configuration of the Π X i B , for a total of i = 1 N l | Π X i B | times. If we assume that the DAG underlying B is sparse, we have that | Π X i B | c and the overall complexity of this step becomes O ( N l c · ( N w l w + | Θ | ) ) , N times that listed in Section 3.4. The caching scheme devised by Moral et al. [21] is very effective in limiting the use of belief propagation, but it does not alter its exponential complexity.
Example 5 
(KL between two discrete BNs). Consider the discrete BN B from Figure 2 top. Furthermore, consider the BN B from Figure 2 bottom. We constructed the global distribution of B in Example A1; we can similarly compose the global distribution of B , shown below.
X 1 = a X 1 = b
X 2 = c X 2 = d X 2 = c X 2 = d
X 3 X 3 X 3 X 3
X 4 ef X 4 ef X 4 ef X 4 ef
g 0.013 0.033 g 0.022 0.054 g 0.016 0.144 g 0.016 0.139
h 0.072 0.062 h 0.029 0.025 h 0.013 0.040 h 0.079 0.243
Since both global distributions are limited in size, we can then compute the Kullback–Leibler divergence between B and B using (7).
KL B B = 0.013 log 0.013 0.016 log 0.016 0.022 log 0.022 0.016 log 0.016 0.072 log 0.072 0.013 log 0.013 0.029 log 0.029 0.079 log 0.079 0.033 log 0.033 0.144 log 0.144 0.054 log 0.054 0.139 log 0.139 0.062 log 0.062 0.04 log 0.04 0.025 log 0.025 0.243 log 0.243 = 0.687
In the general case, when we cannot use the global distributions, we follow the approach described in Section 4.1. Firstly, we apply (8) to write
KL B B = H B H B , B ;
we have from Example A3 that H B = 2.440 . As for the cross-entropy H B , B , we apply (10):
1. 
We identify the parents of each node in B :
Π X 1 B = { } , Π X 2 B = { X 1 , X 4 } , Π X 3 B = { X 1 } , Π X 4 B = { X 3 } .
2. 
We construct a junction tree from B and we use it to compute the distributions P X 1 , P X 2 , X 1 , X 4 , P X 3 , X 1 and P X 4 , X 3 .
X 1
ab
0.53 0.47
{ X 1 , X 4 }
{ a , g } { a , h } { b , g } { b , h }
X 2 c 0.070 0.110 0.053 0.107
d 0.089 0.261 0.076 0.235
X 1
ab
X 3 e 0.289 0.312
f 0.241 0.158
X 3
ef
X 4 g 0.120 0.167
h 0.481 0.231
3. 
We compute the cross-entropy terms for the individual variables in B and B :
H X 1 B , X 1 B = 0.53 log 0.31 + 0.47 log 0.69 = 0.795 ; H X 2 B , X 2 B = 0.070 log 0.38 + 0.089 log 0.62 + 0.110 log 0.71 + 0.261 log 0.29 + 0.053 log 0.51 + 0.076 log 0.49 + 0.107 log 0.14 + 0.235 log 0.86 = 0.807 ; H X 3 B , X 3 B = 0.289 log 0.44 + 0.241 log 0.56 + 0.312 log 0.18 + 0.158 log 0.82 = 0.943 ; H X 4 B , X 4 B = 0.120 log 0.26 + 0.481 log 0.74 + 0.167 log 0.50 + 0.231 log 0.50 = 0.582 ;
which sum up to H B , B = i = 1 N H X i B , X i B = 3.127 .
4. 
We compute KL B B = 2.440 3.127 = 0.687 , which matches the value we previously computed from the global distributions.

4.2. Gaussian BNs

H B decomposes along with the local distributions X i Π X i in the case of GBNs: from (3), each X i Π X i is a univariate normal with variance σ X i 2 ( B ) and therefore
H X i Π X i B = 1 2 log 2 π σ X i 2 ( B ) + 1 2
which has a computational complexity of O(1) for each node, O(N) overall. Equivalently, we can start from the global distribution of B from Section 3.2 and consider that
det ( Σ ) = det ( C B C B T ) = det ( C B ) 2 = i = 1 N C B [ i ; i ] 2 = i = 1 N σ X i 2 ( B )
because C B is lower triangular. The (multivariate normal) entropy of X then becomes
H B = N 2 + N 2 log 2 π + 1 2 log det ( Σ ) = N 2 + N 2 log 2 π + 1 2 i = 1 N log σ X i 2 ( B ) = i = 1 N 1 2 + 1 2 log 2 π σ X i 2 ( B ) = i = 1 N H X i Π X i B
in agreement with (12).
Example 6 
(Entropy of a GBN). For reasons of space, this example is presented as Example A4 in Appendix B.
In the literature, the Kullback–Leibler divergence between two GBNs B and B is usually computed using the respective global distributions N ( μ B , Σ B ) and N ( μ B , Σ B ) [50,51,52]. The general expression is
KL B B = 1 2 tr ( Σ B 1 Σ B ) + ( μ B μ B ) T Σ B 1 ( μ B μ B ) N + log det ( Σ B ) det ( Σ B ) ,
which has computational complexity
O ( 2 N 3 + 2 N ) compute μ B , μ B Σ B , Σ B + O ( N 3 ) invert Σ B + O ( N 3 ) multiply Σ B 1 and Σ B + O ( N ) trace of Σ B 1 Σ B + O ( N 2 + 2 N ) compute ( μ B μ B ) T Σ B 1 ( μ B μ B ) + O ( N 3 ) determinant of Σ B + O ( N 3 ) determinant of Σ B = O ( 6 N 3 + N 2 + 5 N ) .
The spectral decomposition Σ B = U Λ B U T gives the eigenvalues diag ( Λ B ) = { λ 1 ( B ) , , λ N ( B ) } to compute Σ B 1 and det ( Σ B ) efficiently as illustrated in the example below. (Further computing the spectral decomposition of Σ B to compute det ( Σ B ) from the eigenvalues { λ 1 ( B ) , , λ N ( B ) } does not improve complexity because it just replaces a single O ( N 3 ) operation with another one.) We thus somewhat improve the overall complexity of KL B B to O ( 5 N 3 + N 2 + 6 N ) .
Example 7 
(General-case KL between two GBNs). Consider the GBN B Figure 1 top, which we know has global distribution
X 1 X 2 X 3 X 4 N 2.400 1.800 12.276 8.848 , 0.800 0 1.440 1.200 0 0.600 1.872 1.560 1.440 1.872 10.916 8.347 1.200 1.560 8.347 6.956
from Example 2. Furthermore, consider the GBN B from Figure 1 bottom, which has global distribution
X 1 X 2 X 3 X 4 N 2.400 11.324 6.220 4.620 , 0.800 2.368 1.040 0.640 2.368 8.541 3.438 1.894 1.040 3.438 1.652 0.832 0.640 1.894 0.832 1.012 .
In order to compute KL B B , we first invert Σ B to obtain
Σ B 1 = 9.945 1.272 2.806 1.600 1.272 0.909 1.091 0 2.806 1.091 4.642 0 1.600 0 0 2.000 ,
which we then multiply by Σ B to compute the trace tr ( Σ B 1 Σ B ) = 57.087 . We also use Σ B 1 to compute ( μ B μ B ) T Σ B 1 ( μ B μ B ) = 408.362 . Finally, det ( Σ B ) = 0.475 , det ( Σ B ) = 0.132 and therefore
KL B B = 1 2 57.087 + 408.362 4 + log 0.475 0.132 = 230.0846 .
As an alternative, we can compute the spectral decompositions Σ B = U B Λ B U B T and Σ B = U B Λ B U B T as an intermediate step. Multiplying the sets of eigenvalues
Λ B = diag ( { 18.058 , 0.741 , 0.379 , 0.093 } ) and Λ B = diag ( { 11.106 , 0.574 , 0.236 , 0.087 } )
gives the corresponding determinants; and it allows us to easily compute
Σ B 1 = U B Λ B 1 U B T , where Λ B 1 = diag 1 11.106 , 1 0.574 , 1 0.236 , 1 0.087
for use in both the quadratic form and in the trace.
However, computing KL B B from the global distributions N ( μ B , Σ B ) and N ( μ B , Σ B ) disregards the fact that BNs are sparse models that can be characterised more compactly by ( μ B , C B ) and ( μ B , C B ) as shown in Section 3.2. In particular, we can revisit several operations that are in the high-order terms of (15):
  • Composing the global distribution from the local ones. We avoid computing Σ B and Σ B , thus reducing this step to O ( 2 N ) complexity.
  • Computing the trace tr ( Σ B 1 Σ B ) . We can reduce the computation of the trace as follows.
    • We can replace Σ B and Σ B in the trace with any reordered matrix [53] (Result 8.17): we choose to use Σ ˜ B and Σ ˜ B where Σ ˜ B is defined as before and Σ ˜ B is Σ B with the rows and columns reordered to match Σ ˜ B . Formally, this is equivalent to Σ ˜ B = P Σ ˜ B P T where P is a permutation matrix that imposes the desired node ordering: since both the rows and the columns are permuted in the same way, the diagonal elements of Σ ˜ B are the same as those of Σ ˜ B and the trace is unaffected.
    • We have Σ ˜ B = C B C B T .
    • As for Σ ˜ B , we can write Σ ˜ B = P Σ ˜ B P = ( P C B ) ( P C B ) T = C B ( C B ) T where C B = P C B is the lower triangular matrix C B with the rows re-ordered to match Σ ˜ B . Note that C B is not lower triangular unless G and G have the same partial node ordering, which implies P = I N .
    Therefore
    tr ( Σ B 1 Σ B ) = tr ( C B 1 C B ) T ( C B 1 C B ) = C B 1 C B F 2
    where the last step rests on Seber [53] (Result 4.15). We can invert C B in O ( N 2 ) time following Stewart [54] (Algorithm 2.3). Multiplying C B 1 and C B is still O ( N 3 ) . The Frobenius norm · F is O ( N 2 ) since it is the sum of the squared elements of C B 1 C B .
  • Computing the determinants det ( Σ B ) and det ( Σ B ) . From (13), each determinant can be computed in O ( N ) .
  • Computing the quadratic term ( μ B μ B ) T Σ B 1 ( μ B μ B ) . Decomposing Σ B 1 leads to
    ( μ B μ B ) T Σ B 1 ( μ B μ B ) = ( C B 1 ( μ B μ B ) ) T C B 1 ( μ B μ B ) ,
    where μ B and μ B are the mean vectors re-ordered to match C B 1 . The computational complexity is still O ( N 2 + 2 N ) because C B 1 is available from previous computations.
Combining (17), (13) and (18), the expression in (14) becomes
KL B B = 1 2 C B 1 C B F 2 + ( C B 1 ( μ B μ B ) ) T C B 1 ( μ B μ B ) N + 2 log i = 1 N C B [ i ; i ] i = 1 N C B [ i ; i ] .
The overall complexity of (19) KL is
O ( 2 N 2 + 2 N ) compute μ B , μ B C B , C B + O ( 2 N 2 + N 3 ) compute C B 1 C B F 2 + O ( N 2 + 2 N ) compute the quadratic form + O ( 2 N ) compute det ( Σ B ) , det ( Σ B ) = O ( N 3 + 5 N 2 + 6 N ) ;
while still cubic, the leading coefficient suggests that it should be about 5 times faster than the variant of (15) using the spectral decomposition.
Example 8 
(Sparse KL between two GBNs). Consider again the two GBNs from Example 7. The corresponding matrices
C B = X 1 X 2 X 4 X 3 X 1 X 2 X 4 X 3 ( 0.894 0 1.341 1.610 0 0.774 2.014 2.416 0 0 1.049 1.258 0 0 0 0.948 ) , C B = X 1 X 3 X 4 X 2 X 1 X 3 X 4 X 2 ( 0.894 1.163 0.715 2.647 0 0.548 0 0.657 0 0 0.707 0 0 0 0 1.049 )
readily give the determinants of Σ B and Σ B following (13):
det ( C B ) = ( 0.894 · 0.774 · 1.049 · 0.948 ) 2 = 0.475 , det ( C B ) = ( 0.894 · 0.548 · 0.707 · 1.049 ) 2 = 0.132 .
As for the Frobenius norm in (17), we first invert C B to obtain
C B 1 = X 1 X 3 X 4 X 2 X 1 X 3 X 4 X 2 ( 1.118 −2.373 −1.131 −1.334 0 1.825 0 −1.144 0 0 1.414 0 0 0 0 0.953 ) ;
then we reorder the rows and columns of C B to follow the same node ordering as C B and compute
1.118 0 0 0 2.373 1.825 0 0 1.131 0 1.414 0 1.334 1.144 0 0.953 0.894 0 0 0 1.610 0.948 1.258 2.416 1.341 0 1.049 2.014 0 0 0 0.774 F 2 = 57.087
which, as expected, matches the value of tr ( Σ B 1 Σ B ) we computed in Example 7. Finally, C B 1 ( μ B μ B ) in (18) is
1.118 0 0 0 2.373 1.825 0 0 1.131 0 1.414 0 1.334 1.144 0 0.953 2.400 6.220 4.620 11.324 2.400 12.1276 8.848 1.800 = 0 11.056 5.459 16.010 .
The quadratic form is then equal to 408.362 , which matches the value of ( μ B μ B ) T Σ B 1 ( μ B μ B ) in Example 7. As a result, the expression for KL B B is the same as in (16).
We can further reduce the complexity (20) of (19) when an approximate value of KL is suitable for our purposes. The only term with cubic complexity is tr ( Σ B 1 Σ B ) = C B 1 C B F 2 : reducing it to quadratic complexity or lower will eliminate the leading term of (20), making it quadratic in complexity. One way to do this is to compute a lower and an upper bound for tr ( Σ B 1 Σ B ) , which can serve as an interval estimate, and take their geometric mean as an approximate point estimate.
A lower bound is given by Seber [53] (Result 10.39):
tr ( Σ B 1 Σ B ) log det ( Σ B 1 Σ B ) + N = log det ( Σ B ) + log det ( Σ B ) + N ,
which conveniently reuses the values of det ( Σ B ) and det ( Σ B ) we have from (13). For an upper bound, Seber [53] (Result 10.59) combined with Seber [53] (Result 4.15) gives
tr ( Σ B 1 Σ B ) tr ( Σ B 1 ) tr ( Σ B ) = tr ( C B C B T ) 1 tr ( C B C B T ) = C B 1 F 2 C B F 2 ,
a function of C B and C B that can be computed in O ( 2 N 2 ) time. Note that, as far as the point estimate is concerned, we do not care about how wide the interval is: we only need its geometric mean to be an acceptable approximation of tr ( Σ B 1 Σ B ) .
Example 9 
(Approximate KL). From Example 7, we have that tr ( Σ B 1 Σ B ) = 57.087 , det ( Σ B ) = 0.475 and det ( Σ B ) = 0.132 . The lower bound in (21) is then
log det ( Σ B ) + log det ( Σ B ) + 4 = 5.281
and the upper bound in (22) is
C B 1 F 2 C B F 2 = 17.496 · 19.272 = 337.207 .
Their geometric mean is 42.199 , which can serve as an approximate value for KL B B .
If we are comparing two GBNs whose parameters (but not necessarily network structures) have been learned from the same data, we can sometimes approximate KL B B using the local distributions X i Π X i B and X i Π X i B directly. If B and B have compatible partial orderings, we can define a common total node ordering for both such that
KL B B = KL X ( 1 ) { X ( 2 ) , , X ( N ) } X N X ( 1 ) { X ( 2 ) , , X ( N ) } X N = KL X ( 1 ) Π X ( 1 ) B · · X ( N ) Π X ( N ) B X ( 1 ) Π X ( 1 ) B · · X ( N ) Π X ( N ) B .
By “compatible partial orderings”, we mean two partial orderings that can be sorted into at least one shared total node ordering that is compatible with both. The product of the local distributions in the second step is obtained from the chain decomposition in the first step by considering the nodes in the conditioning other than the parents to have associated regression coefficients equal to zero. Then, following the derivations in Cavanaugh [55] for a general linear regression model, we can write the empirical approximation
KL X i Π X i B X i Π X i B 1 2 log σ ^ X i 2 ( B ) σ ^ X i 2 ( B ) + σ ^ X i 2 ( B ) σ ^ X i 2 ( B ) 1 + 1 2 n x ^ i ( B ) x ^ i ( B ) 2 2 σ ^ X i 2 ( B )
where, following a similar notation to (4):
  • μ ^ X i ( B ) , β ^ X i ( B ) , μ ^ X i ( B ) , β ^ X i ( B ) are the estimated intercepts and regression coefficients;
  • x ^ i ( B ) and x ^ i ( B ) are the n × 1 vectors
    x ^ i ( B ) = μ ^ X i ( B ) + x [ ; X i ( B ) ] β ^ X i ( B ) , x ^ i ( B ) = μ ^ X i ( B ) + x [ ; X i ( B ) ] β ^ X i ( B ) ,
    the fitted values computed from the data observed for X i , Π X i ( B ) , Π X i ( B ) ;
  • σ X i 2 ( B ) and σ X i 2 ( B ) are the residual variances in B and B .
We can compute the expression in (23) for each node in
O ( n ( | Π X i ( B ) | + | Π X i ( B ) | + 2 ) ) compute x ^ i ( B ) and x ^ i ( B ) + O ( n ) compute the norm x ^ i ( B ) x ^ i ( B ) 2 2 = O ( n ( | Π X i ( B ) | + | Π X i ( B ) | + 5 2 ) ) ,
which is linear in the sample size if both G and G are sparse because | Π X i ( B ) | c , | Π X i ( B ) | c . In this case, the overall computational complexity simplifies to O ( n N ( 2 c + 5 2 ) ) . Furthermore, as we pointed out in Scutari et al. [29], the fitted values x ^ i ( B ) , x ^ i ( B ) are computed as a by-product of parameter learning: if we consider them to be already available, the above computational complexity is reduced to just O(n) for a single node and O ( n N ) overall. We can also replace the fitted values x ^ i ( B ) , x ^ i ( B ) in (23) with the corresponding residuals ε ^ i ( B ) , ε ^ i ( B ) because
x ^ i ( B ) x ^ i ( B ) 2 2 = ( x [ ; X i ] x ^ i ( B ) ) ( x [ ; X i ] x ^ i ( B ) ) 2 2 = ε ^ i ( B ) ε ^ i ( B ) 2 2
if the latter are available but the former are not.
Example 10 
(KL between GBNs with parameters estimated from data). For reasons of space, this example is presented as Example A5 in Appendix B.

4.3. Conditional Gaussian BNs

The entropy H B decomposes into a separate H X i Π X i B for each node, of the form (9) for discrete nodes and (12) for continuous nodes with no discrete parents. For continuous nodes with both discrete and continuous parents,
H X i Π X i B = 1 2 δ X i Val ( Δ X i ) π δ X i log 2 π σ X i , δ X i 2 ( B ) + 1 2 ,
where π δ X i represents the probability associated with the configuration δ X i of the discrete parents Δ X i . This last expression can be computed in | Val ( Δ X i ) | time for each node. Overall, the complexity of computing H B is
O ( X i X D | Θ X i | + X i X G max 1 , | Val ( Δ X i ) | ) .
where the max accounts for the fact that | Val ( Δ X i ) | = 0 when Δ X i = but the computational complexity is O(1) for such nodes.
Example 11 
(Entropy of a CLGBN). For reasons of space, this example is presented as Example A6 in Appendix B.
As for KL B B , we could not find any literature illustrating how to compute it. The partition of the nodes in (5) implies that
KL B B = KL X D B X D B discrete nodes + KL X G B X D B X G B X D B continuous nodes .
We can compute the first term following Section 4.1: X D B and X D B form two discrete BNs whose DAGs are the spanning subgraphs of B and B and whose local distributions are the corresponding ones in B and B , respectively. The second term decomposes into
KL X G B X D B X G B X D B = x D Val ( X D ) P X D B = x D KL X G B X D B = x D X G B X D B = x D
similarly to (10) and (24). We can compute it using the multivariate normal distributions associated with the X D B = x D and the X D B = x D in the global distributions of B and B .
Example 12 
(General-case KL between two CLGBNs). Consider the CLGBNs B from Figure 3 top, which we already used in Examples 3 and 11, and B from Figure 3 bottom. The variables X D B identify the following mixture components in the global distribution of B :
{ a , c , e } , { b , c , e } , { a , d , e } , { b , d , e } { e } , { a , c , f } , { b , c , f } , { a , d , f } , { b , d , f } { f } .
Therefore, B only encodes two different multivariate normal distributions.
Firstly, we construct two discrete BNs using the subgraphs spanning X D B = X D B = { X 1 , X 2 , X 3 } in B and B , which have arcs { X 1 X 2 } and { X 1 X 2 , X 2 X 3 } , respectively. The CPTs for X 1 , X 2 and X 3 are the same as in B and in B . We then compute KL X D B X D B = 0.577 following Example 5.
Secondly, we construct the multivariate normal distributions associated with the components of B following Example 3 (in which we computed those of B ). For { e } , we have
X 4 X 5 X 6 N 0.300 1.400 1.140 , 0.160 0.000 0.032 0.000 1.690 1.183 0.032 1.183 2.274 ;
for { f } , we have
X 4 X 5 X 6 N 1.000 0.500 0.650 , 0.090 0.000 0.018 0.000 2.250 1.575 0.018 1.575 2.546 .
Then,
KL X G B X D B X G B X D B = x 1 { a , b } x 2 { c , d } x 3 { e , f } P X D B = { x 1 , x 2 , x 3 } · KL X G B X D B = { x 1 , x 2 , x 3 } X G B X D B = { x 1 , x 2 , x 3 } = 0.040 × 1.721 { a , c , e } + 0.036 × 1.721 { b , c , e } + 0.040 × 2.504 { a , d , e } + 0.084 × 2.504 { b , d , e } + 0.16 × 4.303 { a , c , f } + 0.144 × 4.303 { b , c , f } + 0.16 × 6.31 { a , d , f } + 0.336 × 6.31 { b , d , f } = 4.879
and KL B B = KL X D B X D B + KL X G B X D B X G B X D B = 0.577 + 4.879 = 5.456 .
The computational complexity of this basic approach to computing KL B B is
O ( M w l w + c + M ( w + w l w + l w 1 ) + ( M l c + 2 ) | Θ X D | ) compute KL X D B X D B + O ( l M · 6 ( N M ) 3 + ( N M ) 2 + 5 ( N M ) ) compute all the KL X G B X D B = x D X G B X D B = x D ,
which we obtain by adapting (11) and (15) to follow the notation | X D | = M and | X G | = N M we established in Section 3.3. The first term implicitly covers the cost of computing the P X D B = x D , which relies on exact inference like the computation of KL X D B X D B . The second term is exponential in M, which would lead us to conclude that it is computationally unfeasible to compute KL B B whenever we have more than a few discrete variables in B and B . Certainly, this would agree with Hershey and Olsen [22], who reviewed various scalable approximations of the KL divergence between two Gaussian mixtures.
However, we would again disregard the fact that BNs are sparse models. Two properties of CLGBNs that are apparent from Examples 3 and 12 allow us to compute (26) efficiently:
  • We can reduce X G B X D B to X G B Δ B where Δ B = X i X G Δ X i B X D B . In other words, the continuous nodes are conditionally independent on the discrete nodes that are not their parents ( X D B Δ B ) given their parents ( Δ B ). The same is true for X G B X D B . The number of distinct terms in the summation in (26) is then given by | Val ( Δ B Δ B ) | which will be smaller than | Val ( X D B ) | in sparse networks.
  • The conditional distributions X G B X D B = δ and X G B X D B = δ are multivariate normals (not mixtures). They are also faithful to the subgraphs spanning the continuous nodes X G , and we can represent them as GBNs whose parameters can be extracted directly from B and B . Therefore, we can use the results from Section 4.2 to compute their Kullback–Leibler divergences efficiently.
As a result, (26) simplifies to
KL X G B X D B X G B X D B = δ Val ( Δ B Δ B ) P { Δ B Δ B } = δ KL X G B { Δ B Δ B } = δ X G B { Δ B Δ B } = δ .
where P { Δ B Δ B } = δ is the probability that the nodes Δ B Δ B take value δ as computed in B . In turn, (27) reduces to
O ( M w l w + c + M ( w + w l w + l w 1 ) + ( M l c + 2 ) | Θ X D | ) compute KL X D B X D B + O ( l | Val ( Δ B Δ B ) | · ( N M ) 3 + 5 ( N M ) 2 + 6 ( N M ) ) compute all the KL X G B { Δ B Δ B } = δ X G B { Δ B Δ B } = δ .
because we can replace l M with l | Val ( Δ B Δ B ) | , which is an upper bound to the unique components in the mixture, and because we replace the complexity in (15) with that (20). We can also further reduce the second term to quadratic complexity as we discussed in Section 4.2. The remaining drivers of the computational complexity are:
  • the maximum clique size w in the subgraph spanning X D B ;
  • the number of arcs from discrete nodes to continuous nodes in both B and B and the overlap between Δ B and Δ B .
Example 13 
(Sparse KL between two CLGBNs). Consider again the CLGBNs B and B from Example 12. The node sets Δ B = { X 2 , X 3 } and Δ B = { X 3 } identify four KL divergences to compute: Val ( Δ B Δ B ) = { c , e } , { c , f } , { d , e } , { d , f } .
KL X G B X D B X G B X D B = P { Δ B Δ B } = { c , e } KL X G B { Δ B Δ B } = { c , e } X G B { Δ B Δ B } = { c , e } + P { Δ B Δ B } = { c , f } KL X G B { Δ B Δ B } = { c , f } X G B { Δ B Δ B } = { c , f } + P { Δ B Δ B } = { d , e } KL X G B { Δ B Δ B } = { d , e } X G B { Δ B Δ B } = { d , e } + P { Δ B Δ B } = { d , f } KL X G B { Δ B Δ B } = { d , f } X G B { Δ B Δ B } = { d , f }
All the BNs in the Kullback–Leibler divergences are GBNs whose structure and local distributions can be read from B and B . The four GBNs associated with X G B { Δ B Δ B } have nodes X G B = { X 4 , X 5 , X 6 } , arcs { X 5 X 4 , X 4 X 6 } and the local distributions listed in Figure 3. The corresponding GBNs associated with X G B { Δ B Δ B } are, in fact, only two distinct GBNs associated with { e } and { f } . They have arcs { X 4 X 6 , X 5 X 6 } and local distributions: for { e } ,
X 4 = 0.3 + ε X 4 , ε X 4 N ( 0 , 0.16 ) , X 5 = 1.4 + ε X 5 , ε X 5 N ( 0 , 1.69 ) , X 6 = 0.1 + 0.2 X 4 + 0.7 X 5 + ε X 6 , ε X 6 N ( 0 , 1.44 ) ;
for { f } ,
X 4 = 1.0 + ε X 4 , ε X 4 N ( 0 , 0.09 ) , X 5 = 0.5 + ε X 5 , ε X 5 N ( 0 , 2.25 ) , X 6 = 0.1 + 0.2 X 4 + 0.7 X 5 + ε X 6 , ε X 6 N ( 0 , 1.44 ) .
Plugging in the numbers,
KL X G B X D B X G B X D B = 0.076 × 1.721 { c , e } + 0.304 × 4.303 { c , f } + 0.124 × 2.504 { d , e } + 0.496 × 6.310 { d , f } = 4.879
which matches the value we computed in Example 12.

5. Conclusions

We started this paper by reviewing the three most common distributional assumptions for BNs: discrete BNs, Gaussian BNs (GBNs) and conditional linear Gaussian BNs (CLGBNs). Firstly, we reviewed the link between the respective global and local distributions, and we formalised the computational complexity of decomposing the former into the latter (and vice versa).
We then leveraged these results to study the complexity of computing Shannon’s entropy. We can, of course, compute the entropy of a BN from its global distribution using standard results from the literature. (In the case of discrete BNs and CLGBNS, only for small networks because | Θ | grows combinatorially.) However, this is not computationally efficient because we incur the cost of composing the global distribution. While the entropy does not decompose along with the local distributions for either discrete BNs or CLGBNS, we show that it is nevertheless efficient to compute it from them.
Computing the Kullback–Leibler divergence between two BNs following the little material found in the literature is more demanding. The discrete case has been thoroughly investigated by Moral et al. [21]. However, the literature typically relies on composing the global distributions for GBNs and CGBNs. Using the local distributions, thus leveraging the intrinsic sparsity of BNs, we showed how to compute the Kullback–Leibler divergence exactly with greater efficiency. For GBNs, we showed how to compute the Kullback–Leibler divergence approximately with quadratic complexity (instead of cubic). If the two GBNs have compatible node orderings and their parameters are estimated from the same data, we can also approximate their Kullback–Leibler divergence with complexity that scales with the number of parents of each node. All these results are summarised in Table A1 in Appendix A.
Finally, we provided step-by-step numeric examples of how to compute Shannon’s entropy and the Kullback–Leibler divergence for discrete BNs, GBNs and CLGBNs. (See also Appendix B). Considering this is a highly technical topic, and no such examples are available anywhere in the literature, we feel that they are helpful in demystifying this topic and in integrating BNs into many general machine learning approaches.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Computational Complexity Results

For ease of reference, we summarise here all the computational complexity results in this paper, including the type of BN and the page where they have been derived.
Table A1. Summary of all the computational complexity results in this paper, including the type of BN and the page where they have been derived.
Table A1. Summary of all the computational complexity results in this paper, including the type of BN and the page where they have been derived.
Composing and decomposing the global distributions
O ( N l N + l i = 1 N l | Π X i | ) discrete BNsSection 3.1
O ( N 3 ) GBNsSection 3.2
O ( M l M + ( N M ) 3 l Δ ) CLGBNsSection 3.3
Computing Shannon’s entropy
O ( N ( w ( 1 + l w ) + l w 1 ) + | Θ | ) discrete BNsSection 4.1
O(N) GBNsSection 4.2
O ( X i X D | Θ X i | + X i X G max 1 , | Val ( Δ X i ) | ) CLGBNsSection 4.3
Computing the Kullback–Leibler divergence
O ( N 2 w l w + c + N ( w + w l w + l w 1 ) + ( N l c + 2 ) | Θ | ) discrete BNsSection 4.1
O ( 6 N 3 + N 2 + 5 N ) GBNsSection 4.2
O ( M w l w + c + M ( w + w l w + l w 1 ) + ( M l c + 2 ) | Θ X D | ) +
O ( l M · 6 ( N M ) 3 + ( N M ) 2 + 5 ( N M ) ) CLGBNsSection 4.3
Sparse Kullback–Leibler divergence
O ( N 3 + 5 N 2 + 6 N ) GBNsSection 4.2
O ( M w l w + c + M ( w + w l w + l w 1 ) + ( M l c + 2 ) | Θ X D | ) +
O ( l | Val ( Δ B Δ B ) | · ( N M ) 3 + 5 ( N M ) 2 + 6 ( N M ) ) CLGBNsSection 4.3
Approximate Kullback–Leibler divergence
O ( 7 N 2 + 6 N ) GBNsSection 4.2
Efficient empirical Kullback–Leibler divergence
O ( n N ( 2 c + 5 2 ) ) GBNsSection 4.2

Appendix B. Additional Examples

Example A1 
(Composing and decomposing a discrete BN). Consider the discrete BN B shown in Figure 2 (top). Composing its global distribution entails computing the joint probabilities of all possible states of all variables,
{ a , b } × { c , d } × { e , f } × { g , h } ,
and arranging them in the following four-dimensional probability table in which each dimension is associated with one of the variables.
X 1 = a X 1 = b
X 2 = c X 2 = d X 2 = c X 2 = d
X 3 X 3 X 3 X 3
X 4 ef X 4 ef X 4 ef X 4 ef
g 0.005 0.064 g 0.052 0.037 g 0.013 0.040 g 0.050 0.026
h 0.022 0.089 h 0.210 0.051 h 0.051 0.056 h 0.199 0.036
The joint probabilities are computed by multiplying the appropriate cells of the CPTs, for instance
P X = { a , d , f , h } = P X 1 = a P X 2 = d P X 3 = f X 1 = a , X 2 = d P X 4 = h X 3 = f = 0.53 · 0.66 · 0.25 · 0.58 = 0.051 .
Conversely, we can decompose the global distribution into the local distributions by summing over all variables other than the nodes and their parents. For X 1 , this means
P X 1 = a = x 2 { c , d } x 3 { e , f } x 4 { g , h } P X 1 = a , X 2 = x 2 , X 3 = x 3 , X 4 = x 4 = 0.005 + 0.064 + 0.022 + 0.089 + 0.052 + 0.037 + 0.210 + 0.051 = 0.53 , P X 1 = b = x 2 { c , d } x 3 { e , f } x 4 { g , h } P X 1 = b , X 2 = x 2 , X 3 = x 3 , X 4 = x 4 = 0.013 + 0.040 + 0.051 + 0.056 + 0.050 + 0.026 + 0.199 + 0.036 = 0.47 .
Similarly, for X 2 we obtain
P X 2 = c = x 1 { a , b } x 3 { e , f } x 4 { g , h } P X 1 = x 1 , X 2 = c , X 3 = x 3 , X 4 = x 4 = 0.005 + 0.064 + 0.022 + 0.089 + 0.013 + 0.040 + 0.051 + 0.056 = 0.34 , P X 2 = d = x 1 { a , b } x 3 { e , f } x 4 { g , h } P X 1 = x 1 , X 2 = d , X 3 = x 3 , X 4 = x 4 = 0.052 + 0.037 + 0.210 + 0.051 + 0.050 + 0.026 + 0.199 + 0.036 = 0.66 .
For X 4 , we first compute the joint distribution of X 4 and X 3 by marginalising over X 1 and X 2 ,
g h e f ( 0.005 0.022 0.064 0.089 ) { a , c } + g h e f ( 0.052 0.210 0.037 0.051 ) { a , d } + g h e f ( 0.013 0.051 0.040 0.056 ) { b , c } + g h e f ( 0.050 0.199 0.026 0.036 ) { b , d } = g h e f ( 0.120 0.481 0.167 0.232 ) ;
from which we obtain the CPT for X 4 X 3 by normalising its columns.
As for X 3 , we marginalise over X 2 to obtain the joint distribution of X 3 , X 1 and X 2
e f { a , c } { a , d } { b , c } { b , d } ( 0.005 + 0.022 = 0.027 0.064 + 0.089 = 0.153 0.052 + 0.210 = 0.262 0.037 + 0.051 = 0.087 0.013 + 0.051 = 0.051 0.040 + 0.056 = 0.096 0.050 + 0.199 = 0.248 0.026 + 0.036 = 0.062 )
and we obtain the CPT for X 3 X 1 , X 2 by normalising its columns as we did earlier with X 4 .
Example A2 
(Composing and decomposing a CLGBN). Consider the CLGBN B from Figure 3 top. The M = 3 discrete variables at the top of the network have the joint distribution below:
{ X 1 , X 2 , X 3 }
{ a , c , e } { b , c , e } { a , d , e } { b , d , e } { a , c , f } { b , c , f } { a , d , f } { b , d , f }
0.040 0.036 0.040 0.084 0.160 0.144 0.160 0.336
Its elements identify the components of the mixture that make up the global distribution of B , and the associated probabilities are the probabilities of those components.
We can then identify which parts of the local distributions of the N M = 3 continuous variables ( X 4 , X 5 and X 6 ) we need to compute P X 4 , X 5 , X 6 X 1 , X 2 , X 3 for each element of the mixture. The graphical structure of B implies that P X 4 , X 5 , X 6 X 1 , X 2 , X 3 =   P X 4 , X 5 , X 6 X 2 , X 3 because the continuous nodes are d-separated from X 1 by their parents. As a result, the following mixture components will share identical distributions which only depend on the configurations of X 2 and X 3 :
{ a , c , e } , { b , c , e } { c , e } , { a , d , e } , { b , d , e } { d , e } , { a , c , f } , { b , c , f } { c , f } , { a , d , f } , { b , d , f } { d , f } .
For the mixture components with a distribution identified by { c , e } , the relevant parts of the distributions of X 4 , X 5 and X 6 are:
X 4 = 0.1 + 0.2 X 5 + ε X 4 , ε X 4 N ( 0 , 0.09 ) ; X 5 = 0.1 + ε X 5 , ε X 5 N ( 0 , 0.09 ) ; X 6 = 0.1 + 0.2 X 4 + ε X 6 , ε X 6 N ( 0 , 1 ) .
We can treat them as the local distributions in a GBN over { X 4 , X 5 , X 6 } with a DAG equal to the subgraph of B spanning only these nodes. If we follow the steps outlined in Section 3.2 and illustrated in Example 2, we obtain
X 4 X 5 X 6 N 0.120 0.100 0.124 , Σ { c , e } ( B ) = 0.094 0.018 0.019 0.018 0.090 0.004 0.019 0.004 1.004
which is the multivariate normal distribution associated with the components { a , c , e } and { b , c , e } in the mixture. Similarly, the relevant parts of the distributions of X 4 , X 5 and X 6 for { d , e } are
X 4 = 0.6 + 0.8 X 5 + ε X 4 , ε X 4 N ( 0 , 0.36 ) ; X 5 = 0.2 + ε X 5 , ε X 5 N ( 0 , 0.36 ) ; X 6 = 0.1 + 0.2 X 4 + ε X 6 , ε X 6 N ( 0 , 1 ) ;
and jointly
X 4 X 5 X 6 N 0.760 0.200 0.252 , Σ { d , e } ( B ) = 0.590 0.288 0.118 0.288 0.360 0.058 0.118 0.058 1.024
for the components { a , d , e } and { b , d , e } . For the components { a , c , f } and { b , c , f } , the local distributions identified by { c , f } are
X 4 = 0.1 + 0.2 X 5 + ε X 4 , ε X 4 N ( 0 , 0.09 ) ; X 5 = 0.4 + ε X 5 , ε X 5 N ( 0 , 0.81 ) ; X 6 = 0.1 + 0.2 X 4 + ε X 6 , ε X 6 N ( 0 , 1 ) ;
and the joint distribution of X 4 , X 5 and X 6 is
X 4 X 5 X 6 N 0.180 0.400 0.136 , Σ { c , f } ( B ) = 0.122 0.162 0.024 0.162 0.810 0.032 0.024 0.032 1.005 .
Finally, the local distributions identified by { d , f } are
X 4 = 0.6 + 0.8 X 5 + ε X 4 , ε X 4 N ( 0 , 0.36 ) ; X 5 = 0.4 + ε X 5 , ε X 5 N ( 0 , 1.44 ) ; X 6 = 0.1 + 0.2 X 4 + ε X 6 , ε X 6 N ( 0 , 1 ) ;
and the joint distribution of X 4 , X 5 and X 6 for the components { a , d , f } , { b , d , f } is
X 4 X 5 X 6 N 0.920 0.400 0.284 , Σ { d , f } ( B ) = 1.282 1.152 0.256 1.152 1.440 0.230 0.256 0.230 1.051 .
We follow the same steps in reverse to decompose the global distribution into the local distributions. The joint distribution of X is a mixture with multivariate normal components and the associated probabilities. The latter are a function of the discrete variables X 1 , X 2 , X 3 : rearranging them as the three-dimensional table
X 1 = a
X 2
cd
X 3 e 0.040 0.040
f 0.160 0.160
X 1 = b
X 2
cd
X 3 e 0.036 0.084
f 0.144 0.336
gives us the typical representation of P X 1 , X 2 , X 3 , which we can work with by operating over the different dimensions. We can then compute the conditional probability tables in the local distributions of X 1 and X 3 by marginalising over the remaining variables:
P X 1 = X 2 { c , d } X 3 { e , f } P X 1 , X 2 , X 3 = a b ( 0.040 + 0.160 + 0.040 + 0.160 0.036 + 0.144 + 0.084 + 0.336 ) = a b ( 0.4 0.6 ) , P X 3 = X 1 { a , b } X 2 { c , d } P X 1 , X 2 , X 3 = e f ( 0.040 + 0.040 + 0.036 + 0.084 0.160 + 0.160 + 0.144 + 0.336 ) = e f ( 0.2 0.8 ) .
As for X 2 , we marginalise over X 3 and normalise over X 1 to obtain
P X 2 X 1 = X 3 { e , f } P X 1 , X 2 , X 3 P X 1 = c d a b ( 0.040 + 0.160 0.4 0.040 + 0.160 0.4 0.036 + 0.144 0.6 0.084 + 0.336 0.6 ) = c d a b ( 0.5 0.5 0.3 0.7 ) .
The multivariate normal distributions associated with the mixture components are a function of the continuous variables X 4 , X 5 , X 6 . X 4 has only one discrete parent ( X 2 ), X 5 has two ( X 2 and X 3 ) and X 6 has none. Therefore, we only need to examine four mixture components to obtain the parameters of the local distributions of all three variables: one for which { X 2 = c , X 3 = e } , one for which { X 2 = d , X 3 = e } , one for which { X 2 = c , X 3 = f } and one for which { X 2 = d , X 3 = f } .
If we consider the first mixture component { a , c , e } , we can apply the steps described Section 3.2 to decompose it into the local distributions of X 4 , X 5 , X 6 and obtain
X 4 = 0.1 + 0.2 X 5 + ε X 4 , ε X 4 N ( 0 , 0.09 ) ; X 5 = 0.1 + ε X 5 , ε X 5 N ( 0 , 0.09 ) ; X 6 = 0.1 + 0.2 X 4 + ε X 6 , ε X 6 N ( 0 , 1 ) .
Similarly, the third mixture component { a , d , e } yields
X 4 = 0.6 + 0.8 X 5 + ε X 4 , ε X 4 N ( 0 , 0.36 ) ; X 5 = 0.2 + ε X 5 , ε X 5 N ( 0 , 0.36 ) ; X 6 = 0.1 + 0.2 X 4 + ε X 6 , ε X 6 N ( 0 , 1 ) .
The fifth mixture component { a , c , f } yields
X 4 = 0.1 + 0.2 X 5 + ε X 4 , ε X 4 N ( 0 , 0.09 ) ; X 5 = 0.4 + ε X 5 , ε X 5 N ( 0 , 0.81 ) ; X 6 = 0.1 + 0.2 X 4 + ε X 6 , ε X 6 N ( 0 , 1 ) .
The seventh mixture component { a , d , f } yields
X 4 = 0.6 + 0.8 X 5 + ε X 4 , ε X 4 N ( 0 , 0.36 ) ; X 5 = 0.4 + ε X 5 , ε X 5 N ( 0 , 1.44 ) ; X 6 = 0.1 + 0.2 X 4 + ε X 6 , ε X 6 N ( 0 , 1 ) .
Reorganising these distributions by variables we obtain the local distributions of B shown in Figure 3 top.
Example A3 
(Entropy of a discrete BN). Consider again the discrete BN from Example A1. In this simple example, we can use its global distribution and (6) to compute
H B = 0.005 log 0.005 0.013 log 0.013 0.052 log 0.052 0.050 log 0.050 0.064 log 0.064 0.040 log 0.040 0.037 log 0.037 0.026 log 0.026 0.022 log 0.022 0.051 log 0.051 0.210 log 0.210 0.199 log 0.199 0.089 log 0.089 0.056 log 0.056 0.051 log 0.051 0.036 log 0.036 = 2.440 .
In the general case, we compute H B from the local distributions using (9). Since X 1 and X 2 have no parents, their entropy components simply sum over their marginal distributions:
H X 1 = 0.53 log 0.53 0.47 log 0.47 = 0.691 , H X 2 = 0.34 log 0.34 0.66 log 0.66 = 0.641 .
For X 3 ,
H X 3 X 1 , X 2 = x 1 { a , b } x 2 { c , d } P X 1 = x 1 , X 2 = x 2 H X 3 X 1 = x 1 , X 2 = x 2
where
H X 3 X 1 = a , X 2 = c = 0.15 log 0.15 0.85 log 0.85 = 0.423 , H X 3 X 1 = a , X 2 = d = 0.75 log 0.75 0.25 log 0.25 = 0.562 , H X 3 X 1 = b , X 2 = c = 0.40 log 0.40 0.60 log 0.60 = 0.673 , H X 3 X 1 = b , X 2 = d = 0.80 log 0.80 0.20 log 0.20 = 0.500 ;
and where (multiplying the marginal probabilities for X 1 and X 2 , which are marginally independent)
P X 1 = a , X 2 = c = 0.180 , P X 1 = a , X 2 = d = 0.350 , P X 1 = b , X 2 = c = 0.160 , P X 1 = b , X 2 = d = 0.310 ;
giving
H X 3 X 1 , X 2 = ( 0.180 · 0.423 + 0.350 · 0.562 + 0.160 · 0.673 + 0.310 · 0.500 ) = 0.536 .
Finally, for X 4
H X 4 X 3 = x 3 { e , f } P X 3 = x 3 H X 4 X 3 = x 3
where
H X 4 X 3 = e = 0.20 log 0.20 0.80 log 0.80 = 0.500 , H X 4 X 3 = f = 0.42 log 0.42 0.58 log 0.58 = 0.680 ;
and P X 3 = e = 0.601 , P X 3 = f = 0.399 , giving
H X 4 X 3 = 0.601 · 0.500 + 0.399 · 0.680 = 0.572 .
Combining all these figures, we obtain H B as
H X 1 + H X 2 + H X 3 X 1 , X 2 + H X 4 X 3 = 0.691 + 0.641 + 0.536 + 0.572 = 2.440
as before.
In general, we would have to compute the probabilities of the parent configurations of each node using a junction tree as follows:
1. 
We construct the moral graph of B , which contains the same arcs (but undirected) as its DAG plus X 1 X 2 .
2. 
We identify two cliques C 1 = { X 1 , X 2 , X 3 } and C 2 = { X 3 , X 4 } and a separator S 12 = { X 3 } .
3. 
We connect them to create the junction tree C 1 S 12 C 2 .
4. 
We initialise the cliques with the respective distributions P C 1 = P X 1 , X 2 , X 3 , P C 2 = P X 3 , X 4 and P S 12 = P X 3 .
5. 
We compute P X 1 , X 2 = x 3 { e , f } P C 1 and P X 3 = P S 12 .
Example A4 
(Entropy of a GBN). Consider the GBN B from Figure 1 top, whose global distribution we derived in Example 2. If we plug its covariance matrix Σ B into the entropy formula for the multivariate normal distribution we obtain
H B = 4 2 + 4 2 log 2 π + 1 2 log det ( Σ B ) = 2 + 3.676 + 0.5 log 0.475 = 5.304 .
Equivalently, plugging the σ X i 2 ( B ) into (12) we have
H B = i = 1 N H X i Π X i B = 1 2 log ( 2 π · 0.8 ) + log ( 2 π · 0.6 ) + log ( 2 π · 0.9 ) + log ( 2 π · 1.1 ) + 4 2 = 5.304 .
Example A5 
(KL between GBNs with parameters estimated from data). Consider the DAGs for the BNs B and B and the 10 observations shown in Figure A1. The partial topological ordering of the nodes in B is { { X 1 , X 2 } , X 4 , X 3 } and that in B is { X 1 , X 2 , { X 3 , X 4 } } : the total ordering that is compatible with both is { X 1 , X 2 , X 4 , X 3 } .
If we estimate the parameters of the local distributions of B by maximum likelihood we obtain
X 1 = 2.889 + ε X 1 , ε X 1 N ( 0 , 0.558 ) , X 2 = 1.673 + ε X 2 , ε X 2 N ( 0 , 1.595 ) , X 3 = 0.896 + 1.299 X 4 + ε X 3 , ε X 3 N ( 0 , 1.142 ) , X 4 = 2.095 + 2.222 X 1 + 2.613 X 2 + ε X 4 , ε X 4 N ( 0 , 1.523 ) ,
and the associated fitted values are
x ^ 1 ( B ) = ( 2.889 , 2.889 , 2.889 , 2.889 , 2.889 , 2.889 , 2.889 , 2.889 , 2.889 , 2.889 ) , x ^ 2 ( B ) = ( 1.673 , 1.673 , 1.673 , 1.673 , 1.673 , 1.673 , 1.673 , 1.673 , 1.673 , 1.673 ) , x ^ 3 ( B ) = ( 17.293 , 14.480 , 8.675 , 13.937 , 14.846 , 12.801 , 13.449 , 2.394 , 9.670 , 14.381 ) , x ^ 4 ( B ) = ( 13.307 , 11.447 , 5.852 , 8.635 , 8.475 , 9.018 , 10.370 , 2.376 , 7.014 , 10.489 ) .
Similarly, for B we obtain
X 1 = 2.889 + ε X 1 , ε X 1 N ( 0 , 0.558 ) , X 2 = 3.505 0.634 X 1 + ε X 2 , ε X 2 N ( 0 , 1.542 ) , X 3 = 7.284 + 2.933 X 2 + ε X 3 , ε X 3 N ( 0 , 6.051 ) , X 4 = 5.151 + 2.120 X 2 + ε X 4 , ε X 4 N ( 0 , 3.999 ) ,
and the associated fitted values are
x ^ 1 ( B ) = ( 2.889 , 2.889 , 2.889 , 2.889 , 2.889 , 2.889 , 2.889 , 2.889 , 2.889 , 2.889 ) , x ^ 2 ( B ) = ( 1.207 , 2.304 , 1.778 , 1.625 , 1.754 , 2.044 , 1.127 , 2.037 , 0.840 , 2.019 ) , x ^ 3 ( B ) = ( 15.529 , 17.760 , 9.408 , 11.931 , 12.261 , 14.009 , 11.918 , 6.528 , 7.019 , 15.564 ) , x ^ 4 ( B ) = ( 11.110 , 12.722 , 6.686 , 8.509 , 8.748 , 10.011 , 8.500 , 4.604 , 4.959 , 11.135 ) .
Therefore,
x ^ 1 ( B ) x ^ 1 ( B ) 2 2 = 0 , x ^ 2 ( B ) x ^ 2 ( B ) 2 2 = 2.018 , x ^ 3 ( B ) x ^ 3 ( B ) 2 2 = 54.434 , x ^ 4 ( B ) x ^ 4 ( B ) 2 2 = 21.329 ;
and the values of the Kullback–Leibler divergence for the individual nodes are
KL X 1 Π X 1 B X 1 Π X 1 B 1 2 log 0.558 0.558 + 0.558 0.558 1 + 1 20 0 0.558 = 0 , KL X 2 Π X 2 B X 2 Π X 2 B 1 2 log 1.542 1.595 + 1.595 1.542 1 + 1 20 2.018 1.542 = 0.066 , KL X 3 Π X 3 B X 3 Π X 3 B 1 2 log 6.051 1.142 + 1.142 6.051 1 + 1 20 54.434 6.051 = 0.878 , KL X 4 Π X 4 B X 4 Π X 4 B 1 2 log 3.999 1.523 + 1.523 3.999 1 + 1 20 21.329 3.999 = 0.440 ,
which sum up to KL B B 1.383 . The exact value, which we can compute as shown in Section 4.2, is 1.692 .
The quality of the empirical approximation improves with the number of observations. For reference, we generated the data in Figure A1 from the GBN in Example 2. With a sample of size n = 100 from the same network, KL B B 1.362 with KL B B = 1.373 ; with n = 1000 , KL B B 1.343 with KL B B = 1.345 .
Figure A1. The DAGs for the GBNs B (top left) and B (bottom left) and the data (right) used in Example A5.
Figure A1. The DAGs for the GBNs B (top left) and B (bottom left) and the data (right) used in Example A5.
Algorithms 17 00024 g0a1
Example A6 
(Entropy of a CLGBN). Consider again the CLGBN B from from Figure 3 (top). For such a simple BN, we can use its global distribution (which we derived in Example A2) directly to compute the entropies of the multivariate normal distributions associated with the mixture components
H X 4 , X 5 , X 6 { c , e } = 3 2 + 3 2 log ( 2 π ) + 1 2 log det Σ { c , e } ( B ) = 1.849 , H X 4 , X 5 , X 6 { d , e } = 3 2 + 3 2 log ( 2 π ) + 1 2 log det Σ { d , e } ( B ) = 3.235 , H X 4 , X 5 , X 6 { c , f } = 3 2 + 3 2 log ( 2 π ) + 1 2 log det Σ { c , f } ( B ) = 2.947 , H X 4 , X 5 , X 6 { d , f } = 3 2 + 3 2 log ( 2 π ) + 1 2 log det Σ { d , f } ( B ) = 3.928 ;
and to combine them by weighting with the component probabilities
H X 4 , X 5 , X 6 X 1 , X 2 , X 3 = 0.040 · 1.849 { a , c , e } + 0.036 · 1.849 { b , c , e } + 0.040 · 3.235 { a , d , e } + 0.084 · 3.235 { b , d , e } + 0.160 · 2.947 { a , c , f } + 0.144 · 2.947 { b , c , f } + 0.160 · 3.928 { a , d , f } + 0.336 · 3.928 { b , d , f } = 3.386 .
The entropy of the discrete variables is
H X 1 , X 2 , X 3 = 0.040 log 0.040 0.036 log 0.036 0.040 log 0.040 0.084 log 0.084 0.160 log 0.160 0.144 log 0.144 0.160 log 0.160 0.336 log 0.336 = 1.817
and then H B = H X 1 , X 2 , X 3 + H X 4 , X 5 , X 6 X 1 , X 2 , X 3 = 5.203 .
If we use the local distributions instead, we can compute the entropy of the discrete variables using (9) from Section 4.1:
H X 1 = 0.4 log 0.4 0.6 log 0.6 = 0.673 , H X 2 X 1 = 0.4 ( 0.5 log 0.5 0.5 log 0.5 ) + 0.6 ( 0.3 log 0.3 0.7 log 0.7 ) = 0.644 , H X 3 = 0.2 log 0.2 0.8 log 0.8 = 0.500 .
We can compute the entropy of the continuous variables with no discrete parents using (12) from Section 4.2:
H X 6 X 4 = 1 2 log ( 2 π · 1 ) + 1 2 = 1.419 .
Finally, we can compute the entropy of the continuous variables with discrete parents using (24) from Section 4.3:
H X 4 X 2 , X 5 = 0.38 1 2 log ( 2 π · 0.09 ) + 1 2 + 0.62 1 2 log ( 2 π · 0.36 ) + 1 2 = 0.645 , H X 5 X 2 , X 3 = 0.076 1 2 log ( 2 π · 0.09 ) + 1 2 + 0.124 1 2 log ( 2 π · 0.36 ) + 1 2 + 0.304 1 2 log ( 2 π · 0.81 ) + 1 2 + 0.496 1 2 log ( 2 π · 1.44 ) + 1 2 = 1.322 .
As before, we confirm that overall
H B = H X 1 + H X 2 X 1 + H X 3 + H X 4 X 2 , X 5 + H X 5 X 2 , X 3 + H X 6 X 4 = 0.673 + 0.644 + 0.500 + 0.645 + 1.322 + 1.419 = 5.203 .

References

  1. Scutari, M.; Denis, J.B. Bayesian Networks with Examples in R, 2nd ed.; Chapman & Hall: Boca Raton, FL, USA, 2021. [Google Scholar]
  2. Castillo, E.; Gutiérrez, J.M.; Hadi, A.S. Expert Systems and Probabilistic Network Models; Springer: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
  3. Cowell, R.G.; Dawid, A.P.; Lauritzen, S.L.; Spiegelhalter, D.J. Probabilistic Networks and Expert Systems; Springer: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
  4. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann: Burlington, MA, USA, 1988. [Google Scholar]
  5. Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  6. Murphy, K.P. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Thesis, Computer Science Division, UC Berkeley, Berkeley, CA, USA, 2002. [Google Scholar]
  7. Spirtes, P.; Glymour, C.; Scheines, R. Causation, Prediction, and Search; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
  8. Pearl, J. Causality: Models, Reasoning and Inference, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
  9. Borsboom, D.; Deserno, M.K.; Rhemtulla, M.; Epskamp, S.; Fried, E.I.; McNally, R.J.; Robinaugh, D.J.; Perugini, M.; Dalege, J.; Costantini, G.; et al. Network Analysis of Multivariate Data in Psychological Science. Nat. Rev. Methods Prim. 2021, 1, 58. [Google Scholar]
  10. Carapito, R.; Li, R.; Helms, J.; Carapito, C.; Gujja, S.; Rolli, V.; Guimaraes, R.; Malagon-Lopez, J.; Spinnhirny, P.; Lederle, A.; et al. Identification of Driver Genes for Critical Forms of COVID-19 in a Deeply Phenotyped Young Patient Cohort. Sci. Transl. Med. 2021, 14, 1–20. [Google Scholar]
  11. Requejo-Castro, D.; Giné-Garriga, R.; Pérez-Foguet, A. Data-driven Bayesian Network Modelling to Explore the Relationships Between SDG 6 and the 2030 Agenda. Sci. Total Environ. 2020, 710, 136014. [Google Scholar] [PubMed]
  12. Zilko, A.A.; Kurowicka, D.; Goverde, R.M.P. Modeling Railway Disruption Lengths with Copula Bayesian Networks. Transp. Res. Part C Emerg. Technol. 2016, 68, 350–368. [Google Scholar]
  13. Gao, R.X.; Wang, L.; Helu, M.; Teti, R. Big Data Analytics for Smart Factories of the Future. CIRP Ann. 2020, 69, 668–692. [Google Scholar]
  14. Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational Inference: A Review for Statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar]
  15. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood From Incomplete Data via the EM Algorithm. J. R. Stat. Soc. (Ser. B) 1977, 39, 1–22. [Google Scholar] [CrossRef]
  16. Minka, T.P. Expectation Propagation for Approximate Bayesian Inference. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI), Seattle, WA, USA, 2–5 August 2001; pp. 362–369. [Google Scholar]
  17. van der Maaten, L.; Hinton, G. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–3605. [Google Scholar]
  18. Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.A.; Kwok, I.W.H.; Ng, L.G.; Ginhoux, F.; Newell, E.W. Dimensionality Reduction for Visualizing Single-Cell Data Using UMAP. Nat. Biotechnol. 2019, 37, 38–44. [Google Scholar] [CrossRef]
  19. Murphy, K.P. Probabilistic Machine Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar]
  20. Murphy, K.P. Probabilistic Machine Learning: Advanced Topics; MIT Press: Cambridge, MA, USA, 2023. [Google Scholar]
  21. Moral, S.; Cano, A.; Gómez-Olmedo, M. Computation of Kullback–Leibler Divergence in Bayesian Networks. Entropy 2021, 23, 1122. [Google Scholar] [CrossRef]
  22. Hershey, J.R.; Olsen, P.A. Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models. In Proceedings of the 32nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, HI, USA, 15–20 April 2007; Volume IV, pp. 317–320. [Google Scholar]
  23. Beskos, A.; Crisan, D.; Jasra, A. On the Stability of Sequential Monte Carlo Methods in High Dimensions. Ann. Appl. Probab. 2014, 24, 1396–1445. [Google Scholar] [CrossRef]
  24. Scutari, M. Learning Bayesian Networks with the bnlearn R Package. J. Stat. Softw. 2010, 35, 1–22. [Google Scholar] [CrossRef]
  25. Heckerman, D.; Geiger, D.; Chickering, D.M. Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Mach. Learn. 1995, 20, 197–243. [Google Scholar] [CrossRef]
  26. Chickering, D.M.; Heckerman, D. Learning Bayesian Networks is NP-Hard; Technical Report MSR-TR-94-17; Microsoft Corporation: Redmond, WA, USA, 1994. [Google Scholar]
  27. Chickering, D.M. Learning Bayesian Networks is NP-Complete. In Learning from Data: Artificial Intelligence and Statistics V; Fisher, D., Lenz, H., Eds.; Springer: Berlin/Heidelberg, Germany, 1996; pp. 121–130. [Google Scholar]
  28. Chickering, D.M.; Heckerman, D.; Meek, C. Large-sample Learning of Bayesian Networks is NP-hard. J. Mach. Learn. Res. 2004, 5, 1287–1330. [Google Scholar]
  29. Scutari, M.; Vitolo, C.; Tucker, A. Learning Bayesian Networks from Big Data with Greedy Search: Computational Complexity and Efficient Implementation. Stat. Comput. 2019, 25, 1095–1108. [Google Scholar] [CrossRef]
  30. Cussens, J. Bayesian Network Learning with Cutting Planes. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), Barcelona, Spain, 14–17 July 2011; pp. 153–160. [Google Scholar]
  31. Suzuki, J. An Efficient Bayesian Network Structure Learning Strategy. New Gener. Comput. 2017, 35, 105–124. [Google Scholar] [CrossRef]
  32. Scanagatta, M.; de Campos, C.P.; Corani, G.; Zaffalon, M. Learning Bayesian Networks with Thousands of Variables. Adv. Neural Inf. Process. Syst. (Nips) 2015, 28, 1864–1872. [Google Scholar]
  33. Hausser, J.; Strimmer, K. Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks. J. Mach. Learn. Res. 2009, 10, 1469–1484. [Google Scholar]
  34. Agresti, A. Categorical Data Analysis, 3rd ed.; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
  35. Geiger, D.; Heckerman, D. Learning Gaussian Networks. In Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence (UAI), Seattle, WA, USA, 29–31 July 1994; pp. 235–243. [Google Scholar]
  36. Pourahmadi, M. Covariance Estimation: The GLM and Regularization Perspectives. Stat. Sci. 2011, 26, 369–387. [Google Scholar]
  37. Lauritzen, S.L.; Wermuth, N. Graphical Models for Associations between Variables, Some of which are Qualitative and Some Quantitative. Ann. Stat. 1989, 17, 31–57. [Google Scholar] [CrossRef]
  38. Scutari, M.; Marquis, C.; Azzimonti, L. Using Mixed-Effect Models to Learn Bayesian Networks from Related Data Sets. In Proceedings of the International Conference on Probabilistic Graphical Models, Almería, Spain, 5–7 October 2022; Volume 186, pp. 73–84. [Google Scholar]
  39. Lauritzen, S.L.; Spiegelhalter, D.J. Local Computation with Probabilities on Graphical Structures and their Application to Expert Systems (with discussion). J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1988, 50, 157–224. [Google Scholar]
  40. Lauritzen, S.L.; Jensen, F. Stable Local Computation with Conditional Gaussian Distributions. Stat. Comput. 2001, 11, 191–203. [Google Scholar] [CrossRef]
  41. Cowell, R.G. Local Propagation in Conditional Gaussian Bayesian Networks. J. Mach. Learn. Res. 2005, 6, 1517–1550. [Google Scholar]
  42. Namasivayam, V.K.; Pathak, A.; Prasanna, V.K. Scalable Parallel Implementation of Bayesian Network to Junction Tree Conversion for Exact Inference. In Proceedings of the 18th International Symposium on Computer Architecture and High Performance Computing, Ouro Preto, Brazil, 17–20 October 2006; pp. 167–176. [Google Scholar]
  43. Pennock, D.M. Logarithmic Time Parallel Bayesian Inference. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI), Pittsburgh, PA, USA, 31 July–4 August 2023; pp. 431–438. [Google Scholar]
  44. Namasivayam, V.K.; Prasanna, V.K. Scalable Parallel Implementation of Exact Inference in Bayesian Networks. In Proceedings of the 12th International Conference on Parallel and Distributed Systems (ICPADS), Minneapolis, MN, USA, 12–15 July 2006; pp. 1–8. [Google Scholar]
  45. Malioutov, D.M.; Johnson, J.K.; Willsky, A.S. Walk-Sums and Belief Propagation in Gaussian Graphical Models. J. Mach. Learn. Res. 2006, 7, 2031–2064. [Google Scholar]
  46. Cheng, J.; Druzdzel, M.J. AIS-BN: An Adaptive Importance Sampling Algorithm for Evidential Reasoning in Large Bayesian Networks. J. Artif. Intell. Res. 2000, 13, 155–188. [Google Scholar] [CrossRef]
  47. Yuan, C.; Druzdzel, M.J. An Importance Sampling Algorithm Based on Evidence Pre-Propagation. In Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence (UAI), Acapulco, Mexico, 7–10 August 2003; pp. 624–631. [Google Scholar]
  48. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
  49. Csiszár, I.; Shields, P. Information Theory and Statistics: A Tutorial; Now Publishers Inc.: Delft, The Netherlands, 2004. [Google Scholar]
  50. Gómez-Villegas, M.A.; Main, P.; Susi, R. Sensitivity of Gaussian Bayesian Networks to Inaccuracies in Their Parameters. In Proceedings of the 4th European Workshop on Probabilistic Graphical Models (PGM), Cuenca, Spain, 17–19 September 2008; pp. 265–272. [Google Scholar]
  51. Gómez-Villegas, M.A.; Main, P.; Susi, R. The Effect of Block Parameter Perturbations in Gaussian Bayesian Networks: Sensitivity and Robustness. Inf. Sci. 2013, 222, 439–458. [Google Scholar] [CrossRef]
  52. Görgen, C.; Leonelli, M. Model-Preserving Sensitivity Analysis for Families of Gaussian Distributions. J. Mach. Learn. Res. 2020, 21, 1–32. [Google Scholar]
  53. Seber, G.A.F. A Matrix Handbook for Stasticians; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar]
  54. Stewart, G.W. Matrix Algorithms, Volume I: Basic Decompositions; SIAM: Philadelphia, PA, USA, 1998. [Google Scholar]
  55. Cavanaugh, J.E. Criteria for Linear Model Selection Based on Kullback’s Symmetric Divergence. Aust. N. Z. J. Stat. 2004, 46, 197–323. [Google Scholar] [CrossRef]
Figure 1. DAGs and local distributions for the GBNs B (top) and B (bottom) used in Examples 2 and 6–9.
Figure 1. DAGs and local distributions for the GBNs B (top) and B (bottom) used in Examples 2 and 6–9.
Algorithms 17 00024 g001
Figure 2. DAGs and local distributions for the discrete BNs B (top) and B (bottom) used in Examples 1, 4 and 5.
Figure 2. DAGs and local distributions for the discrete BNs B (top) and B (bottom) used in Examples 1, 4 and 5.
Algorithms 17 00024 g002
Figure 3. DAGs and local distributions for the CLGBNs B (top) and B (bottom) used in Examples 3 and 11–13.
Figure 3. DAGs and local distributions for the CLGBNs B (top) and B (bottom) used in Examples 3 and 11–13.
Algorithms 17 00024 g003
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Scutari, M. Entropy and the Kullback–Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation. Algorithms 2024, 17, 24. https://doi.org/10.3390/a17010024

AMA Style

Scutari M. Entropy and the Kullback–Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation. Algorithms. 2024; 17(1):24. https://doi.org/10.3390/a17010024

Chicago/Turabian Style

Scutari, Marco. 2024. "Entropy and the Kullback–Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation" Algorithms 17, no. 1: 24. https://doi.org/10.3390/a17010024

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop