Gaussian Mixture Reduction for Time-Constrained Approximate Inference in Hybrid Bayesian Networks

Hybrid Bayesian Networks (HBNs), which contain both discrete and continuous variables, arise naturally in many application areas (e.g., image understanding, data fusion, medical diagnosis, fraud detection). This paper concerns inference in an important subclass of HBNs, the conditional Gaussian (CG) networks, in which all continuous random variables have Gaussian distributions and all children of continuous random variables must be continuous. Inference in CG networks can be NP-hard even for special-case structures, such as poly-trees, where inference in discrete Bayesian networks can be performed in polynomial time. Therefore, approximate inference is required. In approximate inference, it is often necessary to trade off accuracy against solution time. This paper presents an extension to the Hybrid Message Passing inference algorithm for general CG networks and an algorithm for optimizing its accuracy given a bound on computation time. The extended algorithm uses Gaussian mixture reduction to prevent an exponential increase in the number of Gaussian mixture components. The trade-off algorithm performs pre-processing to find optimal run-time settings for the extended algorithm. Experimental results for four CG networks compare performance of the extended algorithm with existing algorithms and show the optimal settings for these CG networks.


INTRODUCTION
A Bayesian Network (BN) [Pearl, 1988] is a probabilistic graphical model that represents a joint distribution on a set of random variables in a compact form that exploits conditional independence relationships among the random variables. The random variables (RVs) are represented as nodes in a directed acyclic graph (DAG) in which a directed edge represents a direct dependency between two nodes and no directed cycles are allowed in the graph. Bayesian Networks have become a powerful tool for representing uncertain knowledge and performing inference under uncertainty. They have been applied in many domains, such as Image Understanding, Data fusion, Medical diagnosis, and Fraud detection, and have become a powerful tool in inference for the real world.
Hybrid Bayesian Network (HBNs) can contain both discrete and continuous RVs. An important subclass, the conditional Gaussian (CLG) networks, consists of networks in which all discrete random variables have only discrete parents, all continuous random variables have Gaussian distributions, and the conditional distribution of any Gaussian RV is linear in its Gaussian parents. Exact inference methods exist for CLG networks [Lauritzen, 1992] [Lauritzen & Jensen, 2001]. However, even in special cases for which exact inference in discrete Bayesian Networks (BNs) is tractable, exact inference in CLG networks can be NP-hard [Lerner & Parr, 2001]. In particular, the posterior marginal distribution for each individual Gaussian random variable is a mixture of Gaussian distributions, and the number of components needed to compute the exact distribution for a given random variable may be exponential in the number of discrete variables in the network. Furthermore, no exact algorithms exist for general CG networks. Therefore, approximate inference for CG networks is an important area of research.
SP algorithms draw random samples to use for inference and can handle BNs of arbitrary structure. Henrion [1988] presented a basic sampling approach, called logic sampling, for approximate inference in discrete Bayesian networks. Logic sampling generates samples beginning at root nodes and following links to descendant nodes, terminating at leaf nodes of the graph. If a sampled realization contains an evidence node whose value does not match the observed value, it is rejected. The result is a sample from the conditional distribution of the sampled nodes given the observed values of the evidence nodes. This rejection strategy may require a very large number of samples to converge to an acceptable inference result. Further, this strategy cannot be applied when there are continuous evidence nodes, because in this case all samples would be rejected. Fung & Chang [1989] suggested a method that sets all evidence variables to their observed values, samples only the non-evidence variables, and weights each sample by the likelihood of the evidence variables given the sampled non-evidence variables. This likelihood weighting algorithm, which can be applied when evidence nodes are continuous, has become very popular. However, when the evidence configuration is highly unlikely, this method can result in very poor accuracy. Pearl [1988] proposed a Gibbs sampling approach for Bayesian networks. His algorithm is a specific case of the more general class of Markov Chain Monte Carlo algorithms [Gilks et al, 1996]. Efficiency of Gibbs sampling can be dramatically improved by sampling only a subset (called a cutset) of random variables that breaks all loops in the graph, and performing exact inference on the remaining singly connected network [Bidyuk & Dechter, 2007]. Nevertheless, for any SP algorithm very large numbers of samples may be required for challenging BNs, such as those with complex topologies, very unlikely evidence configurations, and/or deterministic or near-deterministic relationships.
DS algorithms change a hybrid BN to a discrete BN by discretizing all continuous RVs in the hybrid BN. This approach changes a continuous variable to a set of intervals, called a bin. After the change, the discretized BN is handled by a discrete inference algorithm (e.g., [Pearl, 1988]). Kozlov & Koller [1997] provided an improved discretization by efficiently adjusting the shape of a continuous RV. However, DS algorithms start with approximation for discretization and this approximation can cause inaccurate posterior distributions. Accuracy can be improved with finer discretization, but at the cost of possibly major additional cost in time and space. Furthermore, there is a time cost for discretization, and a need for methods to choose the granularity of the distribution to balance accuracy against computation cost.
SA algorithms change an intractable hybrid BN (e.g., conditional nonlinear Gaussian network) to a tractable hybrid BN (e.g., conditional linear Gaussian network). After changing to a tractable hybrid BN, a hybrid inference algorithm which can handle the tractable hybrid BN is used for inference. Shenoy [2006] proposed a SA algorithm in which any type of a continuous RV can be approximated by a mixture of Gaussian distributions, thus converting an arbitrary hybrid BN to a CG BN. He showed how various hybrid BNs (e.g., non-Gaussian HBN, nonlinear HBN, a HBN with a continuous parent and a discrete child, and a HBN with non-constant variance) can be converted to a CG BN. Although SA algorithms can treat various types of HBNs, they require an appropriate CG inference algorithm for the converted HBN. CL algorithms handle loops by converting the original BN to a graph of clusters in which each node corresponds to a cluster of nodes from the original BN, such that the graph of clusters contains no loops. A conversion step is required to form clusters from the original BN. Among CL approaches, the popular Junction Tree (JT) algorithm has been adapted for inference in CG networks [Lauritzen, 1992] [Lauritzen & Jensen, 2001]. However, constraints required by the Lauritzen algorithm [Lauritzen, 1992] [Lauritzen & Jensen, 2001] on the form of the junction tree tend to result in cliques containing very many discrete nodes. Because inference is exponential in the number of discrete nodes in a cluster, the algorithm is often intractable even when a tractable clustering approach exists for a discrete network of the same structure [Lerner & Parr, 2001]. For this reason, it is typically necessary to resort to approximate inference. Gaussian mixture reduction (GMR) has been suggested as an approximation approach [Lerner, 2002]. GMR approximates a Gaussian mixture model (GMM) with a GMM having fewer components.
In MP algorithms, each node in the BN sends messages to relevant nodes along paths between the relevant nodes. The messages contain information to update the distributions of the relevant nodes. After updating, each of the nodes computes its marginal distribution. If the BN has loops, message passing may not converge. MP algorithms are also subject to the problem of uncontrolled growth in the number of mixture components. A GMR approach has been proposed to address this issue  . However, they provided no general algorithm for applying GMR within the MP algorithm. Park et al. [2015] introduced a general algorithm for MP using GMR 1 , but included no guidance on how to trade off between accuracy and computational resources in hybrid MP using GMR. This paper presents a complete solution to the hybrid inference problem by providing two algorithms: Hybrid Message Passing (HMP) with Gaussian Mixture Reduction (GMR) and Optimal Gaussian Mixture Reduction (Optimal GMR).
The HMP-GMR algorithm prevents exponential growth of Gaussian mixture components in MP algorithms for inference in CG Bayesian networks. We present an extension of the algorithm of  ] that incorporates GMR to control complexity, and examine its performance relative to competing algorithms.
Each inference algorithm has its own characteristics. For example, some algorithms are faster and some are more accurate. Further, accuracy and speed can depend on the Bayesian network and the specific pattern of evidence. These characteristics can be used as guidance for choosing an inference method for a given problem. Metrics for evaluating an inference algorithm include speed, accuracy, and resource usage (e.g., memory or CPU usage). In some situations, algorithm speed is the most important factor. In other cases, accuracy may be more important. For example, early stage missile tracking may require a high speed algorithm for estimating the missile trajectory, while matching faces in a security video against a no-fly database may prioritize accuracy over speed. The HMP-GMR algorithm requires a maximum number of Gaussian components as an input parameter. This maximum number of components influences both accuracy and execution time of the HMP-GMR algorithm. We introduce a preprocessing algorithm called HMP-GMR with Optimal Settings (HMP-GMR-OS), which optimizes the initial settings for HMP-GMR to provide the best accuracy on a given HBN under a bound on computation time. The HMP-GMR-OS algorithm is intended for cases in which a given HBN will be used repeatedly in a time-limited situation, and a pre-processing step is desired to balance accuracy against speed of inference. Sampling approaches have been used for such situations, because of their anytime property. That is, sampling always provides an answer even if it runs out of time. In some cases, our algorithm can result in better accuracy than a sampling approach for the same execution time.
The layout of this paper is as follows. Section 2 introduces Hybrid Message Passing Inference and Gaussian Mixture Reduction. Section 3 presents the HMP-GMR algorithm, which combines the two methods introduced in Section 2. Section 4 proposes the OCB algorithm to find the optimal number of allowable components in any given mixture distribution. Section 5 presents experimental results on the advantages and disadvantages of the new algorithm. Section 6 draws conclusions.

II. PRELIMINARIES
In this section, we introduce message passing inference for CG BNs and component reduction for Gaussian mixture models.

Structure of Hybrid Bayesian Network
A general hybrid BN can contain both discrete and continuous nodes. A node in a hybrid BN can be categorized according to its type (i.e., discrete or continuous), its parent node type(s) (i.e., discrete, continuous, or hybrid with at least one discrete and one continuous node), and its child node type(s) (i.e., discrete, continuous, or hybrid  [Sun, 2007]. That is, a conditional hybrid BN may contain Types 1, 2, 3, 11, 14, and 17 from Table 1. These six cases are shown in Fig. 1 below. In the figure, a rectangle indicates a discrete node and a circle indicates a continuous node. For example, Type 1 has a discrete node B with its discrete parent node A and discrete child node C, while Type 3 has a discrete node B with its discrete parent node A and hybrid child nodes C and Y. A general hybrid BN places no restriction on the type of probability distribution for a continuous node. If all continuous nodes in a hybrid BN have Gaussian probability distributions, the BN is called Gaussian hybrid BN. A BN that is both a conditional hybrid BN and a Gaussian hybrid BN is called a conditional Gaussian (CG) BN. CG BNs can be further classified into two sub-categories: conditional linear Gaussian (CLG) BNs and conditional nonlinear Gaussian (CNG) BNs. For the CLG BNs, the Gaussian conditional distributions are always linear functions of the Gaussian parents. That is, if X is a continuous node X with n continuous parents U 1 , …, U n and m discrete parents A 1 , …, A m , then the conditional distribution p( | , ) given parent states U=u and A=a has the following form: p( | , ) = �L ( ) ( ), ( ) �, (1) where L (a) (u) = ( ) + 1 ( ) is a linear function of the continuous parents, with intercept ( ) , coefficients ( ) , and standard deviation ( ) that depend on the state a of the discrete parents.
A Gaussian conditional distribution for a continuous node in a CNG BN can be any function of the Gaussian and discrete parents. The form is similar to Equation 1 except that L (a) (u) can be a nonlinear function. Note that the types of Fig. 1 cover any number of parent and child nodes. Thus, the discrete node B can have a set of discrete parent nodes (i.e., A = {A 1 , A 2 , ..., A l }), a set of discrete child nodes (i.e., C = {C 1 , C 2 , ..., C m }), and/or a set of continuous child nodes (i.e., Y = {Y 1 , Y 2 , ..., Y n }). The continuous node X can have a set of discrete parent nodes (i.e., A = {A 1 , A 2 , ..., A l }), a set of continuous parent nodes (i.e., U = {U 1 , U 2 , ..., U m }), and a set of continuous child nodes (i.e., Y = {Y 1 , Y 2 , ..., Y n }). We use this notation to introduce message passing inference for a discrete BN, a continuous BN, and a hybrid BN.

Message Passing Inference for Discrete BN
Message passing inference for a discrete BN was introduced in [Pearl, 1988]. A discrete BN contains only discrete nodes (i.e., Type 1 in Fig. 1). The objective of inference is to compute the function BEL( ) = P( | ) (2) for each node in the Bayesian network. Here, B denotes a node with its associated RV, is a set of evidence events consisting of state assignments for nodes in the network, and BEL( ) is the conditional distribution of B given the values of the evidence RVs. If the BN is a polytree (has no undirected cycles), the evidence e can be split into two components, + and − , where + relates only to non-descendants of B and − relates only to descendants of B. Equation 2 can be decomposed as follows: where α denotes a normalizing constant. The second line is valid because in a polytree, + and − are independent given B. The factors relating to non-descendants and descendants are denoted π( ) and λ( ) , as shown in the third line of the equation. These factors are called the Pi and Lambda functions, respectively.
The Pi function π( ) and Lambda function λ( ) can be written as follows: and λ( ) = � λ ( ) (4) where π B ( ) and λ ( ) denote a Pi message from the parent to B and Lambda message from the child C j to B, respectively. These messages can be written as follows: Note that the Lambda function λ(B) is similar to the Pi message π ( ) except that the Pi message includes a factor of π( ), includes Lambda messages all children of the parent except the target child node j, and includes a normalizing constant α . A similar relationship exists between the Pi function π( ) and the Lambda message λ ( ). The Lambda message multiplies by λ( ) and includes Pi messages only from parent nodes other than the target parent node i.
The MP algorithm is given as follows.
• Initialization: For any evidence node B=b, set π( ) = λ( ) to 1 for B=b and 0 for B≠b. For any non-evidence node with no parents, set π( ) to the prior distribution for B. For any non-evidence node B with no children, set λ( ) uniformly equal to 1.
This algorithm finds exact values of all π( ) if the network is a polytree. The algorithm can also be applied to BNs containing undirected cycles. It is not guaranteed to converge, but when it converges, it often results in a good approximation to the correct posterior probabilities [Murphy et al, 1999].

Message Passing Inference for Continuous BNs
Type 14 in Fig. 1 is a BN in which all nodes are continuous. The MP algorithm can be extended to this case by defining Pi/Lambda messages as follows [Sun, 2007]. These messages can be computed exactly for linear Gaussian networks. For general Gaussian networks, the messages can be approximated using the unscented transformation [Uhlmann, 1995] to project mean and covariance estimates through nonlinear transformations.
The Pi and Lambda functions for a continuous node X in a Gaussian network can be written as follows: π( ) = � P( | ) � π ( ) d (7) and λ( ) = � λ ( ) (8) where d is the m-dimensional differential with = {U 1 , U 2 , ..., U m }, π X ( ) denotes the Pi message from the continuous parent U i to X, and λ j ( ) denotes the Lambda message from the continuous child Y j to X. These Pi and Lambda messages can be written as follows: and (10) where d � is the (m -1)-dimensional differential with � = {s | s ∈ U and s ≠ U j }.
The Pi function for a discrete node B (i.e., Types 1, 2, and 3) is given by Equation 3. The Pi function for the continuous node X of Type 17 is given as follows: (11) where d is the m-dimensional differential with = {U 1 , U 2 , ..., U m }, π ( ) denotes the Pi message from the discrete parent to X and π ( ) denotes the Pi message from the continuous parent U j to X. Derivation of Pi functions for Types 11 and 14 is straightforward by use of Equation 11.
The Lambda function for Types 11, 14, and 17 is given by Equation 8. The Lambda function for the discrete node B of Type 3 is given as follows: (12) where λ ( )denotes the Lambda message from the discrete child node C i to B and λ ( ) denotes the Lambda message from the continuous child node Y j to B. Derivation of Lambda functions for Types 1 and 2 is straightforward by use of Equation 12.
The Pi message for Types 11, 14, and 17 is given by Equation 9. The Pi message for Type 3 is given as follows: The Lambda message for Types 1, 2, and 3 is given by Equation 6. The Lambda message for the continuous node X of Type 17 can be written as follows: (14) and and denotes a state of . The first equation is for the message to the continuous parent U j and the second equation is for the message to the discrete parent A i . These apply respectively to Types 11 and 14.

B. Gaussian Mixture Reduction
Gaussian mixture reduction (GMR) approximates an Mcomponent GMM with a reduced number N<M of components. Several methods for GMR have been proposed, e.g., [Salmond, 1990] [West, 1993] A straightforward method for performing GMR is the following: (1) Find the two closest components in a GMM according to a distance criterion.
(2) Merge the two selected components into one component.
(3) Update to a GMM with one fewer component.
where i and j denote the i-th and j-th component of a GMM, respectively, and , , are the weight, mean, and covariance of the k-th component, respectively.
Recently, a more efficient algorithm using constraint optimization was proposed  .

III. EXTENDED HYBRID MESSAGE PASSING ALGORITHM
The previous sections introduced message passing inference and Gaussian mixture reduction. This section combines these methods into an extended hybrid message passing algorithm for CG BNs.
The GMR operation is denoted as a function, τ (gmm, max_nc) that applies Equation 16, where gmm is a Gaussian mixture model and max_nc is the maximum number of allowable mixture components.
To specify the algorithm, we need to define where in the inference process GMR will be applied. For this, the Pi/Lambda functions for X and Pi/Lambda messages U→X in Type 14 and {A, U}→X in Type 17 are chosen. Hence, the function τ(gmm, max_nc) is applied to Equations 7,8,9,10,11,14,and 15. For the extended algorithm, these equations become: where M = max_nc denotes the maximum allowable number of components.
The above equations are implemented in the following algorithm, called Hybrid Message Passing Algorithm with Gaussian Mixture Reduction (HMP-GMR). The following algorithm is an extension of a Hybrid Message Passing algorithm (HMP) from [Sun, 2007] to which we apply GMR. of their anytime property. An initial version of this algorithm was introduced in [Park et al., 2015]. In contrast with the initial version, this is an anytime algorithm that can provide a solution even if it is interrupted before completion. SendLambdaMsg (j, max_nc, max_time) 7 for j = 1, … until the number of nodes in net 8 belij ← compute belief function using λj and πj (2)  9 diffij ← compare distribution difference between belij and bel(i-1)j 10 max_diff ← get maximum difference between diffij and max_diff 11 if max_diff < max_prcs then break 12 return a set of belij if parent of j is continuous then do (17)  if max_time < exe_time do not update the Pi message from one of (5) and (20) Procedure SendLambdaMsg ( j, // a current node max_time, // a maximum execution time max_nc // a maximum allowable number of components ) 1 for k = 1, … until number of parents of j 2 if j is discrete then do (6) for k 3 else if j is continuous then 4 if parent of j is discrete then do (15) 5 if parent of j is continuous then do (21) with M = max_nc  6 if parent of j is hybrid then do (22), (23) with M = max_nc 7 exe_time ← get a current execution time 8 if max_time < exe_time do not update the Lambda message from one of (6), (15), (21), (22), and (23) HMP-GMR has five inputs. The first input, net, is the Hybrid BN with specified evidence nodes and their values. The second input, max_time, is the maximum execution time used to control how long this algorithm runs by comparing to a current execution time, exe_time, representing a period from the algorithm starting time to the current time. The third input, max_iteration, is the maximum number of iterations allowed, where an iteration is one round in which all nodes in the BN perform their operations. The fourth input, max_nc, is the maximum number of Gaussian components that may be output by the GMR function τ. The fifth input, max_prcs, is a threshold on the distance between posterior distributions of nodes in the current and previous iterations. The algorithm terminates when the distance is lower than the threshold. HMP-GMR outputs approximate posterior distributions of all nodes.
Given these inputs, the algorithm proceeds as follows: (1) The algorithm iterates message passing from 1 to the maximum number of iterations or until it is interrupted due to exceeding the time limit. There are three exit points from the iteration: (1) when the iteration reaches the maximum number of allowable iterations, (2) when the maximum difference is less than the maximum precision, and (3) when the current execution time for the algorithm exceeds the maximum execution time.

IV. OPTIMIZING THE SETTINGS OF HMP-GMR
In some situations, the Hybrid Message Passing with Gaussian Mixture Reduction (HMP-GMR) algorithm performs better than other algorithms. For example, although in theory a sampling algorithm can be made as accurate as desired, for a given problem, HMP-GMR may have higher accuracy for a given limit on computation time. However, HMP-GMR requires initial settings before it executes. The performance of HMP-GMR depends on these initial settings. More specifically, HMP-GMR requires that the maximum allowable number of components max_nc and the maximum number of allowable iterations max_iteration are specified as inputs. If the maximum allowable number of components is too small, accuracy may be too low; but if it is too large, execution time may be unacceptably long. Also, the maximum number of allowable iterations can influence accuracy and execution time. The number of components required to achieve a given accuracy depends on the network topology, the placement of continuous and discrete nodes, and the conditional distributions. As noted above, when the BN contains loops, the HMP-GMR algorithm may not converge. Thus, in some problems, HMP-GMR may spend many iterations without a significant improvement in accuracy.
Therefore, there is a need to trade off accuracy against execution time depending on the maximum allowable number of components and the maximum number of iterations. Different applications pose different requirements on execution time. It is assumed that the maximum allowable execution time for inference is an input parameter that is specified before the inference algorithm runs. Therefore, the optimization problem is defined as attaining the best achievable accuracy for a given constraint on execution time, by varying the maximum allowable number of components and the maximum allowable number of iterations.
Finding an exact optimum would be infeasible in the general case. Therefore, this section presents a Monte Carlo method to find approximately optimal values for a specific conditional Gaussian Bayesian network. The algorithm is appropriate for problems in which a given HBN is specified a priori, and inference on the HBN will be performed repeatedly in a time-restricted setting with limits on execution time for inference. The optimization can be performed offline as a preprocessing step to find good initial settings for performing HMP-GMR inference at run time. For example, a real-time threat detection system might use a CG HBN to process sensor inputs automatically and issue alarms when suspicious combinations of sensor readings occur. Because the system runs in real time, fast execution is essential. At design time, an offline optimization can be run using the algorithm presented here to determine the best settings for the maximum number of components and the maximum number of iterations.
An optimization problem for this situation can be formulated as shown Equation 24. In this setting, we assume that a specific HBN is given.
subject to: t < max_time (24) where means a set of the candidate maximum allowable numbers of components { 1 , 2 , ..., n }, means a set of the candidate maximum iteration numbers { 1 , 2 , ..., m }, f(.) is an objective function measuring error of HMP-GMR, t means the current execution time for inference of HMP-GMR, and max_time means the maximum execution time. We call this as HMP-GMR with Optimal Settings (HMP-GMR-OS), which finds the values (nc and it) that achieve the best accuracy under a given time restriction. Equation 25 shows the objective function f(.) of HMP-GMR-OS.
where means a set of the candidate evidences { 1 , 2 , ..., l }, which are randomly selected, for a Bayesian network net, err(.) means a function resulting in an error between a near-correct inference result from sampling and an inference result from HMP-GMR, s(.) means a sampling inference algorithm used for exact inference, h(.) means the HMP-GMR algorithm with a maximum execution time max_time, a candidate maximum allowable number of components nc, and a candidate maximum iteration number it.
The Monte Carlo method for HMP-GMR-OS is called an HMP-GMR-OS algorithm (Algorithm 2), which finds the best values for the two decision variables given a Hybrid BN, a maximum execution time, a number of samples, an upper limit on the maximum number of iterations, and an upper limit on the maximum allowable number of components. for i = 1, … until num_samples 2 ei ← generate randomly a set of evidence values from net 3 si ← inference using a sampling algorithm with net and ei 4 for j = 1, … until ul_max_nc 5 for k = 1, … until ul_max_it 6 hjk ← inference using HMP-GMR with net, ei, max_time, j, and k 7 rijk ← calculate an inference error value between si and hjk 8 rjk ← add rijk to a set of inference error values rjk for j and k 9 for j = 1, … until ul_max_nc 10 for k = 1, … until ul_max_it 11 avg_rjk ← calculate an average inference error for rjk 12 [nc, it] ← select best values for nc and it in (25) using avg_rjk 13 return [nc, it] The HMP-GMR-OS algorithm has five inputs. The first input net is a Hybrid BN. The second input max_time is the maximum execution time for inference of HMP-GMR. The third input num_samples is the number indicating how many times the simulation should be repeated. The fourth input ul_max_it is the number of maximum iterations used for inference of HMP-GMR. The fifth input ul_max_nc is the number indicating how many the number of maximum allowable number of components will be investigated.
Given these inputs, the algorithm proceeds as follows: (1) The algorithm simulates the given number of samples. (2) The algorithm randomly selects some evidence nodes from the Hybrid BN net. Also, it randomly selects a reasonable evidence value for each evidence node (i.e., a highly unlikely value is not used for the evidence value) and provides a i-th set of evidence values e i . (3) The set of the evidence values are used for inference of a sampling algorithm by which nearly correct results of inference s i (i.e., posterior distributions) are found. (4) The maximum allowable number of components, denoted by j, is varied from 1 to the upper limit on the maximum allowable number of components ul_max_nc. (5) The maximum number of iterations, denoted by k, is varied from 1 to the upper limit on the maximum number of iterations ul_max_it. (6) This algorithm uses the HMP-GMR algorithm with the Hybrid BN net, the set of the evidence vlaues e i , the maximum execution time max_time, the maximum allowable number of components j, and the maximum number of iterations k. Then, the HMP-GMR algorithm provides the results h jk (i.e., posterior distributions). (7) An inference error value r ijk between the nearly correct results s i and the HMP-GMR's result h jk is computed by using a distance function (e.g., KLdivergence [Kullback & Leibler, 1951]). (8) The inference error value r ijk at i-th sample for j and k is stored at a set of inference error values r jk for j and k. After simulating all samples, for all j (9) and k (10), (11) an average inference error avg_r jk is calculated using the set of the inference error values r jk . (12) A best maximum allowable number of components nc and a best maximum number of iterations it are selected by finding a minimum average inference error from avg_r jk . (13) The algorithm outputs the best values nc and it.
In summary, the HMP-GMR-OS algorithm is a preprocessing algorithm finding the optimal settings for HMP-GMR given a HBN to improve accuracy before HMP-GMR for the HBN executes for a practical situation.

V. EXPERIMENT
This section presents experiments to evaluate the performance of the HMP-GMR algorithm and the Optimal GMR algorithm. For evaluation of the HMP-GMR algorithm, Park et al. [2015] presented simple experiments to demonstrate scalability and efficiency using two hybrid BNs. Here, more extensive experiments are performed on four BNs. These four BNs would be representative BNs in terms of various numbers of discrete parent nodes and various numbers of loopy structures in a given network. Fig. 2 shows two illustrative Conditional Gaussian (CG) BNs (i.e., 1 and 2) containing a discrete node A with 4 states, a continuous node X j , and another continuous node Y j . These two BNs differ in the links between Y j and Y j+1 . The first, shown on the left in Fig. 2, has no undirected cycles involving only continuous nodes, while the second, shown on the right in Fig.  2, has undirected cycles among. For example, between X 1 and Y 2 , there are two paths: X 1 → X 2 →Y 2 and X 1 →Y 1 →Y 2 .  Fig. 3 shows two additional cases in which the BNs contain a large number of discrete nodes. Each discrete node A i has four states and the continuous nodes X i and Y j are conditional Gaussians. Again, the difference between the two BNs is that the third has no undirected cycles involving only continuous nodes, while the fourth has loopy structure for the continuous nodes. The experiments examined both CLG and CNG BNs with these four CG BNs. The size of the four CG BNs was varied by adjusting n. Some of the leaf continuous nodes {Y 1 , ... , Y n } were randomly selected as evidence nodes. All other nodes were unobserved. Table 2 shows characteristics of these four CG BNs. In these BNs, the discrete nodes have four states. Therefore, in CG BNs 1 and 2, there are four discrete states for discrete node A, while in CG BNs 3 and 4, there are 4 n total configurations of the discrete states. This is 4 7 = 16384 configurations when n = 7. When n = 7, CG BNs 1 and 4 contain 21 cycles, while CG BN 2 contains 501 cycles 2 . CG BN 3 contains no cycles. CG   The following factors were varied in the experiment: (1) Type of hybrid BN (i.e., BN 1, 2, 3, or 4; CLG or CNG BN), (2) type of inference algorithm (i.e., Hybrid Junction Tree (Hybrid-JT) [Lauritzen, 1992] [Lauritzen & Jensen, 2001], original Hybrid MP (HMP) [Sun, 2007], Hybrid MP with Gaussian Mixture Reduction (HMP-GMR) [Park et al., 2015], or Likelihood Weighting (LW) sampling [Fung & Chang, 1989]), (3) number of repeated structures n, (4) algorithm characteristics (i.e., the number of GMM components allowed, the number of allowable message passing iterations, and the maximum precision for the message passing algorithms). The dependent variables were accuracy of result and execution time. 2 Cycles were derived by a cycle finding algorithm [Johnson, 1975].
For all experiments, the convergence criterion for HMP and HMP-GMR was max_prs = 10 -3 .
Using these settings, we conducted three experiments: (1) A comparison between HMP, HMP-GMR, and Hybrid-JT investigated scalability of HMP-GMR to complex networks (i.e., larger values of n), (2) the HMP-GMR itself was evaluated for posterior distribution accuracy and execution time, and (3) optimal settings derived from the HMP-GMR-OS algorithm were evaluated by inference accuracy. The experiments were run on a 3.40GHz Intel Core i7-3770 processor. The algorithms were implemented in the Java programming language. In a Java code for HMP-GMR, there was another exit point from the iteration of the HMP-GMR algorithm in Section 3. In some cases, a computation between Lambda values and Pi values in the HMP-GMR algorithm could not be computed due to numeric underflow. This happened when the HMP-GMR algorithm diverged. When this occurred, the HMP-GMR algorithm stopped and provided its current solution.

A. Scalability of HMP-GMR
The first experiment examined improvement in scalability of HMP-GMR over HMP and Hybrid-JT.
The initial setting of this experiment consists of (1) maximum of 4 components output by GMR and (2) 100 iteration limit for each of HMP and HMP-GMR. Eight CG BNs (i.e., conditional linear/nonlinear cases for CG BNs 1, 2, 3, and 4) were run with HMP, HMP-GMR, and Hybrid-JT using the following inputs and outputs. The input value of n for both BNs was varied from 1 to 10. The output value is the execution time.  Results for CG BNs 1 and 2 show a similar pattern. The HMP algorithm with no GMR exceeded the time limit at n = 7 and n = 4, respectively. Execution times for HMP-GMR were higher than those for Hybrid-JT in both cases. The increase in execution time for both HMP-GMR and Hybrid-JT was linear in n. In Fig. 5, results from HMP and Hybrid-JT show exponential growth in execution time, while execution time of HMP-GMR increased linearly. For HMP, the execution time limit for CG BNs 3 and 4 was exceeded at n = 7 and n = 4, respectively. For Hybrid-JT, the execution time limit for CG BNs 3 and 4 was exceeded at n = 9.
Results for the CNG networks showed similar patterns, and are not shown here for brevity. These experiments showed that HMP-GMR is scalable to large BNs for both linear and nonlinear CG networks. However, scalability alone is not sufficient. Accuracy and good operational performance are also essential.

B. Accuracy and Efficiency of HMP-GMR
In this experiment, we investigated the accuracy and convergence of HMP-GMR for the four CLG BNs. To evaluate accuracy of HMP-GMR, exact inference results using Hybrid-JT inference were used. Some of the runs using Hybrid-JT stopped because of the exponential growth of components, so Hybrid-JT produced posterior distributions only for n ≤ 7. For this reason, this experiment used n = 7 for the four CLG BNs. Accuracy was measured by KL-divergence [Kullback & Leibler, 1951] (lower values mean better accuracy). We calculated the KL-divergence between exact and approximate results for each unobserved node, and summed them (henceforth, we use KL-divergences to mean the sum of KLdivergences over unobserved nodes). The number of runs in the experiment was 100. The maximum allowable number of components was nc = 2. The maximum number of iterations was it = 10000. The maximum execution time was max_time = 200000 millisecond (ms). In the experiment, there were three exit points: (1) When the algorithm converged, (2) when the time limit was exceeded, and (3) when the algorithm diverged. When the algorithm did not converge, the algorithm halted and provided its current solution. Fig. 6 shows percentages for each CLG BN for which the algorithm converged, diverged, and ran out of time. For CLG BN 1, the algorithm converged in 97% of the runs and ran out of time in 3% of the cases (execution time > 200000 ms). For CLG BN 2, the algorithm converged in 52% of the runs, ran out of time in 3% of the runs, and diverged in 45% of the runs. For CLG BN 3, the algorithm converged in 100% of the runs. For CLG BN 4, the algorithm converged in only 31% of the runs and ran out of time in 69% of the runs. We observed that the large number of cycles (CLG BN 2) could cause many situations in which the algorithm did not converge (i.e., diverged or ran out of time). Also, the algorithm for the cases with no cycles (CLG BN 3) always converged. For the cases in which the algorithm ran out of time, if more time had been allowed it might have converged, diverged, or failed to either converge or diverge (i.e., oscillated). In some cases for CLG BNs 1, 2, and 4, the algorithm oscillated until reaching the maximum execution time and halting. When there were many cycles and many discrete states (i.e., for CLG BN 4), the algorithm often ran out of time.  Table 3. Average accuracies and average execution times for the four CLG BNs Table 3 shows averages (avg.) for KL-divergence over runs and average execution times for three cases (converged, diverged, and ran out of time) on the four CLG BNs (numbers in parentheses are standard deviations). Fig. 7 shows the accuracy results from this experiment, when the algorithm converged. In Fig. 7, the four lanes denote the four CLG BNs. The vertical axis on the upper chart denotes KL-divergence values. For case of convergence, the averages for KLdivergence over runs of the experiment were 0.0001, 1.04, 2.25, and 3.64 for CLG BNs 1, 2, 3, and 4, respectively. Again for the case of convergence, the average execution times were 1514, 16547, 2164, and 9584 for CLG BNs 1, 2, 3, and 4, respectively. For the case of convergence, because of the large number (501) of cycles in CLG BN 2, the inference algorithm required more time (avg. 16547 ms) to converge than others. Also, the algorithm for CLG BN 4 spent more time (avg. 9584 ms) in comparison with the case (avg. 2164 ms) of CLG BN 3, because of the large number (21) of cycles in CLG BN 4. In the case of divergence, the algorithm for CLG BN 2 halted with numeric underflow in the Lambda or Pi value computation. In this case, the algorithm performed with very poor accuracy (avg. 106.35) and had a long execution time (avg. 9678 ms).
Some cases for CLG BNs 1, 2, and 4 stopped because the maximum execution time was exceeded. This never happened in CLG BN 3, which contained no cycles. In addition, accuracies for CLG BNs 1 and 2 were better than those for CLG BN 4. The four sets of results depicted in Table 3 illustrated how network topology influences accuracy and execution time. In this experiment, we used arbitrary settings (i.e., nc = 2 and it = 10000) for HMP-GMR. In the next section, we investigate whether the performance of HMP-GMR can be improved by optimizing these settings.

C. Optimal Settings for HMP-GMR
HMP-GMR requires initial settings (i.e., nc and it), which influence accuracy and execution time. To find good initial settings, the HMP-GMR-OS algorithm was introduced in Section 4. With this algorithm, we use the following experiment setting: (1) the Hybrid BNs (i.e., CLG BNs 1, 2, 3, and 4 with n = 7), (2) the maximum execution time (i.e., max_time = 3000 ms), (3) the number of samples (i.e., num_samples = 50), (4) the upper limit on the maximum allowable number of components (i.e., ul_max_nc = 10), (5) the upper limit on the maximum number of iterations (i.e., ul_num_it = 10), and (6) the Hybrid-JT algorithm to obtain correct inference results.   Table 4 shows minimum averages for KL-divergences in Fig. 8   For CLG BNs 1 and 3, the optimal number of iterations was the upper limit of 10. A better value might have been found if a larger number of iterations had been investigated. For CLG BNs 2 and 4, the best value for the maximum number of iterations was smaller than 10. Note that better results might be obtained, if the number of samples was increased and/or the ranges of nc and/or it were expanded.
From this experiment, we observed that good accuracy can be achieved with a small number of components. Although the best values found by our experiment for nc were 5, 4, 6, and 1 for CLG BNs 1, 2, 3, and 4, respectively, the accuracy was not much better than using a single component. For example, the average of KL-divergence for CLG BN 1 was 0.78 at nc = 5 and it = 10, while the average of KL-divergence for CLG BN 1 was 1.09 at nc = 1 and it = 10. To check whether the difference was statistically significant, paired t-tests were performed at the 5% significance level. Table 5 shows confidence intervals from the paired t-tests. From these tests, we observed that the difference between the setting found by HMP-GMR-OS and the case with nc = 1 and it = 10 was not statistically significant for any of the CLG BNs. CLG  This result suggests choosing nc = 1 as a default setting for HMP-GMR. This default setting for HMP-GMR was evaluated for accuracy against the Likelihood Weighting (LW) algorithm. Fig. 9 shows accuracy comparison between HMP-GMR with the default setting nc = 1 and LW sampling for the four CLG BNs. We use the following Experiment setting: (1) type of the hybrid BNs (i.e., CLG BNs 1, 2, 3, and 4 with n = 7), (2) type of inference algorithm (i.e., HMP-GMR at nc = 1 and it = 10, and LW sampling), (3) 100 samples, (4) the maximum execution time (i.e., max_time = 3000 ms), and (5) the Hybrid-JT algorithm to obtain correct inference results. Fig. 9 shows results from this experiment. When HMP-GMR didn't converge, it stopped at the maximum execution time and provided its current solution. The vertical axis denotes KL-divergence values. The chart contains eight lanes for four groups (CLG BNs 1, 2, 3, and 4). In the two adjoined lanes for each group, the left lane denotes the HMP-GMR case, while the right lane denotes the LW case. For example, the first lane in Fig. 9 denotes the HMP-GMR case for CLG BN 1 (i.e., HG 1), while the second lane in Fig. 9 denotes the LW case for CLG BN 1 (i.e., LW 1). The execution times for the two cases in each group were set to similar values. That is, the number of samples for LW was controlled to achieve similar execution times as HMP-GMR.   Table 6. Comparison between three algorithms on averages of KLdivergences Fig. 10 shows accuracy comparison between LW and HMP-GMR for the four CLG BNs. The X axis denotes KLdivergence for HMP-GMR. The Y axis denotes KL-divergence for LW. For CLG BN 1, HMP-GMR provided much better accuracy than LW. For CLG BN 2, LW provided better accuracy than HMP-GMR. For CLG BN 3, HMP-GMR provided better accuracy than LW. For CLG BN 4, LW and HMP-GMR performed similarly, but as can be seen in Fig.9, the results for LW were more variable. Accuracy from HMP-GMR was lower in comparison with LW, for the CLG BN containing many loops. CLG BN 2 contained 501 cycles, while CLG BNs 1 and 4 contained 21 cycles and CLG BN 3 contained no cycles. More extensive investigations would be needed to determine whether this superiority of LW over HMP-GMR generalizes to arbitrary BNs with many loops. Accuracies from HMP-GMR did not depend much on the number of discrete states. CLG BN 1 contained four discrete states, while the CLG BN 3 contained 16384 configurations of the discrete states. For these networks, HMP-GMR provided better accuracy than LW regardless of how many discrete states a CLG BN contains.
We can consider whether to use HMP-GMR or LW according to the features of the CLG BN. The following list shows suggestions in which HMP-GMR can be chosen or not in terms of the number of configurations of the discrete states and the number of cycles.

Small number of configurations of the discrete states and small number of cycles
In this case, our experiment showed better accuracy from HMP-GMR in comparison with LW under a given time restriction. LW requires many samples to improve accuracy, while HMP-GMR uses message passing approach, which can provide exact results for a polytree network [Pearl, 1988]. In a simple-topology network (i.e., small number of configurations of the discrete states and small number of cycles), when HMP-GMR converges in time, it can provide high accuracy.

Small number of configurations of the discrete states and large number of cycles
When there are many cycles, HMP-GMR can diverge. This behavior depends on the network topology, the placement of continuous and discrete nodes, the conditional distributions, and the pattern of evidence. When divergence occurs, HMP-GMR halts during message passing because of numeric underflow. A feature to detect when the pi and lambda messages are going out of bounds and stop the algorithm may be useful, but intermediate results from before the algorithm diverges are of doubtful usefulness. In the case of a large number of cycles, LW can perform better than HMP-GMR.

Large number of configurations of the discrete states and small number of cycles
HMP-GMR reduces the maximum allowable number of components, which is influenced by the number of configurations of the discrete states. So, HMP-GMR can tolerate many numbers of configurations of the discrete states, while LW requires many samples for many numbers of configurations of the discrete states.
In this case, we observed that for the four CLG BNs, HMP-GMR could converge and provide good accuracy.

Large number of configurations of the discrete states and large number of cycles
Although LW can tolerate many cycles, it can perform poorly when the necessary number of samples to achieve good accuracy is too large for the available time limit. Also, HMP-GMR can perform poorly because of many cycles. In this case, the better choice between LW and HMP-GMR may vary depending on specific features of the problem. For example, when the number of configurations of the discrete states is relatively smaller than the number of cycles, LW can be used. When the number of configurations of the discrete states is relatively larger than the number of cycles, HMP-GMR can be used. However, in any case, accuracies for both approaches may be low.

VI. CONCLUSION
We have developed an extended message passing algorithm for CG hybrid Bayesian networks to overcome the exponential growth in components of the Gaussian mixture model. Our experiments demonstrated scalability, accuracy, and optimal settings for the complex hybrid BNs. In our experiments, both the original hybrid message passing inference and the hybrid junction tree inference showed exponential growth in execution time. The Gaussian mixture reduction method presented in this paper addressed this problem. Another issue we should address was to define ways to choose the optimal settings to achieve desired accuracy and execution time. For this, we presented a preprocessing algorithm to optimize the maximum allowable number of components and the optimal maximum iteration.
The algorithm enables HMP-GMR to provide better accuracy within a predefined time. For the four CLG BNs we investigated, we observed that the accuracy from using a single Gaussian component was nearly as good as the setting found by our optimization method, and the difference in accuracy was not statistically significant. Note that other networks with complex topologies, very unlikely evidence configurations, and/or deterministic or near-deterministic relationships might be different. We observed that the accuracy results for a loopy CG BN were less than for a poly CG BN. To address this, we can use a clustering inference (e.g., Hybrid-JT) with Gaussian mixture reduction. These issues will be addressed in future work.