Hidden Node Detection between Observable Nodes Based on Bayesian Clustering

Structure learning is one of the main concerns in studies of Bayesian networks. In the present paper, we consider networks consisting of both observable and hidden nodes, and propose a method to investigate the existence of a hidden node between observable nodes, where all nodes are discrete. This corresponds to the model selection problem between the networks with and without the middle hidden node. When the network includes a hidden node, it has been known that there are singularities in the parameter space, and the Fisher information matrix is not positive definite. Then, the many conventional criteria for structure learning based on the Laplace approximation do not work. The proposed method is based on Bayesian clustering, and its asymptotic property justifies the result; the redundant labels are eliminated and the simplest structure is detected even if there are singularities.


Introduction
In learning Bayesian networks, one of the main concerns is structure learning. Many criteria to detect the network structure have been proposed such as the minimum description length (MDL) [1], the Bayesian information criterion (BIC) [2], the Akaike information criterion (AIC) [3], and the marginal likelihood [4]. Most of these criteria assume statistical regularity, which means that the network has identifiability on the parameter and then the nodes are observable.
The nodes of the network are not always observable in practical situations; there will be some underlying factors, which are difficult to observe and do not appear in the given data. In such cases, the criteria for the structure learning must be designed by taking account of the existence of the hidden nodes. However, the statistical regularity does not hold when the network contains hidden nodes [5,6].
The probabilistic models fall into two types: Regular and singular. If the parameter and the probability function expressed by the parameter have one-to-one mapping, the model has statistical regularity and is referred to as regular. Otherwise, there are singularities in the parameter space and the model is referred to as singular. Due to the singularities, the Fisher information matrix is not positive definite, which means that the conventional analysis based on the Laplace approximation or the asymptotic normality does not work in the singular models. Many probabilistic models such as mixture models, hidden Markov models, and neural networks are singular. To cope with the problem of the singularities, an analysis method based on algebraic geometry has been proposed [7], and asymptotic properties of the generalization performance and of the marginal likelihood have been investigated in mixture models [8], hidden Markov models [9], neural networks [7,10], etc.
It is known that the Bayesian network with hidden nodes is singular since the parametrization will change compared with the network without hidden nodes. Even in the simple structure such as the naive Bayesian network, the parameter space has singularities [5,11]. A method to select the  The left and the right panels are networks without and with the hidden node, respectively, where X = (x 1 , . . . , x L ) with the domain x l ∈ {1, . . . , N (l) X } and Y ∈ {1, . . . , N Y } are observable and Z ∈ {1, . . . , N Z } is hidden. Since the evidence data on X and Y are given and there is no information on Z, we need to consider whether the hidden node exists and its range N Z . We propose a method to examine whether the middle hidden node should exist or not using Bayesian clustering. In order to obtain the simplest structure, there is a way to use the regularization technique [16], while it is not straightforward to prove the selected structure is theoretically optimal. Our method is justified based on a property of the entropy term in the asymptotic form of the marginal likelihood, which plays an essential role in the clustering. The result of clustering shows necessary labels to express the relation between the observable nodes X and Y. Counting the number of the used labels, we can determine the existence of the hidden node. Note that we do not consider whole possible structures of the network to reduce the computational complexity; in the present paper, we try to optimize the network from the limited structures, where for example there is no multiple inserted hidden nodes or connections between hidden nodes.
The remainder of this paper is organized as follows. Section 2 presents a formal definition of the network. Section 3 summarizes Bayesian clustering. Section 4 proposes the method to select the structure based on Bayesian clustering and derives its asymptotic behavior. Section 5 shows results of the numerical experiments validating the behavior. Finally, we present a discussion and our conclusions in Sections 6 and 7, respectively.

Model Settings
In this section, the network structure and its parameterization are formalized. The naive structure has been applied to classification and clustering tasks and its mathematical properties are studied [5] since it is expressed as a mixture model. As mentioned in the previous section, we consider the hidden node with both parent and child observable nodes. One of the simplest networks is shown in the right panel of Figure 2. Let the probabilities of X = (X 1 , . . . , X L ), Z, and Y be defined by for i ∈ I = {(i (1) , . . . , i (L) )}, i (l) ∈ {1, . . . , N (l) X }, j = 1, . . . , N Z , and k = 1, . . . , N Y . Since they are probabilities, we assume that a (l) It is easy to find that b ij is the element of the CPT for Z and c jk is that for Y. Let w be the parameter We also define the probabilities of the network shown in the left panel of Figure 2; p(X (l) = i (l) ) =d The parameter u consisting of d i and e ij has the dimension If the relation between X and Y can be simplified, the degree of freedom dim u is not necessary and is reduced to dim w such as the case shown in Figure 1. This is similar to the dimension reduction of data with sandglass type neural networks or the non-negative matrix factorization, which have a smaller number of nodes in the middle layers than the one in the input and output layers. The relation between the necessary dimension of the parameter and the probability of the output is not always trivial [17]. The present paper focuses on the sufficient case in terms of the dimension reduction, where dim w < dim u rewritten as Recall that X and Y are observable and Z is hidden, where N X and N Y are given and N Z is unknown. When the minimum N Z is detected from the given evidence pairs of X and Y, and is satisfied Equation (11), the network structure with the hidden node expresses the pairs with smaller dimension of the parameter. We use Bayesian clustering technique to detect the minimum N Z .

Bayesian Clustering
In this section, let us formally introduce Bayesian clustering. Let the evidence be described by (x i , y i ) and there are n pairs, which are denoted by (X n , The corresponding value of the hidden node is z i and the set of n data is denoted by Z n . We can estimate z i based on the probability p(Z n |X n , Y n ). In Bayesian clustering, it is defined by where ϕ(w|α) is a prior distribution and α is the hyperparameter. In the network with the hidden node, If the prior distribution is expressed as the Dirichlet distribution for a (l) i (l) , b ij , and c jk , the numerator p(X n , Z n , Y n ) is analytically computable. Based on the relation p(Z n |X n , Y n ) ∝ p(X n , Z n , Y n ), the Markov Chain Monte Carlo (MCMC) method provides the sampling of Z n from p(Z n |X n , Y n ). This is a common method to estimate hidden variables in machine learning; the underlying topics are estimated based on the Gibbs sampler in topic models such as the latent Dirichlet allocation [18].

Hidden Node Detection
In this section, the algorithm to detect the hidden node is introduced and its asymptotic property reducing the number of the used labels is revealed.

The Proposed Algorithm
When the size of the middle node is large such as there is no reason to have the node Z; the middle node should reduce the degree of freedom from X. If only N Z = 1 satisfies Equation (11), the middle node is not necessary. Note that N Z = 1 shows that there is no edge between X and Y, which is already excluded in structure learning. (11), which shows that there is no hidden node between X and Y.
The present paper proposes the following algorithm to determine the existence of Z; Algorithm 2. Assume that there is N Z > 1 for given N (l) X and N Y , that is Equation (11) is satisfied. Apply the Bayesian clustering method to the given evidence (X n , Y n ) and estimate Z n based on the MCMC sampling. Let the number of used labels be denoted byN Z . If the following inequality holds, the hidden node Z ∈ {1, . . . ,N Z } reduces the parameter,

Asymptotic Properties of the Algorithm
The MCMC method in Bayesian clustering is based on the probability p(X n , Z n , Y n ) as shown in Section 3. Since the proposed method depends on this clustering method, let us consider the properties of p(X n , Z n , Y n ). The negative logarithm of the probability is expressed as follows: where n i , n ij , and m jk are given as respectively, and the prior distribution ϕ(w|α) consists of the Dirichlet distributions; The function δ ij and Γ(·) are the Kronecker delta and the gamma function, respectively. The hyperparameter α consists of α a , α b , and α c . The sampling result of Z n is dominantly taken from the area, which makes p(X n , Z n , Y n ) large. Then, we investigate which Z n minimizes F α (X n , Z n , Y n ) for given (X n , Y n ). Theorem 3. When the number of the given data n is sufficiently large, F(X n , Z n , Y n ) is written as whereÑ Z is the number of m j such that m j /n = O(1).
The proof will be shown in Appendix A. The first term −nS is the dominant factor, and its coefficient S is maximized in the clustering. This coefficient determinesÑ Z , which is the number of used labels in the clustering result.
Assume that the true structure with the hidden node has the minimal expression, where the range of Z is z = 1, . . . , N * Z , and that the estimated size is larger than the true one; N * Z ≤Ñ Z . We can easily confirm that Bayesian clustering chooses the minimum structureÑ Z = N * Z as follows. The three terms in the coefficient S correspond to the negative entropy functions of the parameter a (l) i , b ij , and c jk , respectively. Then, the minimumÑ Z obviously makes the coefficient S maximized since the number of elements of parameter should be minimized for the small entropy. When the hidden node has the redundant state, which means that two values of Z have completely same output distribution of Y, the second term of S is larger than the case of non-redundant situationN Z = N * Z . Based on the assumption that the true structure is minimal, the estimation therefore gets the minimum structure, N Z = N * Z . According to this property, the number of used labelN Z asymptotically goes to N * Z . The proposed algorithm compares the essential number of the values of Z and will be a criterion to select the proper structure when n is large. This property exists only in Bayesian clustering so far; the eliminating effect of the redundant labels has not been found in other method of the clustering such as the maximum-likelihood clustering based on the expectation-maximization algorithm.

Numerical Experiments
In this section, we validate the asymptotic property in numerical experiments. We set the data-generating model shown in Figure 3 and prepared ten evidence data sets. There was a single parent node L = 1. The sizes of the nodes were N (1) X = 6, N Y = 6 and N * Z = 3. The CPTs are described on the right-side of the figure, where the true parameter consists of these probabilities. There were 2000 pairs of (x, y) in each data set. Since the following condition is satisfied, the structure of the data-generating model with the hidden node had smaller dimension of the parameter than the one without a hidden node. We applied Bayesian clustering to each data set, where the model had the size of the hidden node N Z = 6. According to the asymptotic property in Theorem 3, the MCMC method should take label assignment from the area, where the number of the used labels was reduced to three. The estimated model size was determined by the assignment, which minimized the function F α (X n , Z n , Y n ). Since the sampling of the MCMC method depended on the initial assignment, we conducted ten trials for each data set and regarded the estimated size as the minimum one. The number of iterations in the MCMC method was 1000. Table 1 shows the results of the experiments. In all data sets, the size of the hidden node Z is reduced and the correct size is estimated in more than half sets, we confirmed the effect eliminating the redundant labels. Since the result of the MCMC method depends on the given data, the minimum size is not always found; the estimated size is four in some data sets instead of three. Even in such case, however, we could estimate the correct size after setting the initial size of the model as N Z = 4. Repeating this procedure, we will be able to avoid the local optimal size and find the global one. Figure 4 shows this estimation procedure in the practical cases. The initial model size starts from six. The left panel is the case, where the proper size is directly found and the estimated size does not change at size four. The right panel is the case, where the estimated size is first four and then the next result is three, which is the fixed point. To investigate the properties of the estimated size, we tried some different numbers of pairs n = 100, 500 and a skewed distribution of the parent node ( Figure 5), and nearly uniform distribution of the child node ( Figure 6).   Table 2 shows the results of n = 100, 500. Table 2. The results of the estimated size in n = 100, 500.

Data-Set ID 1 2 3 4 5 6 7 8 9 10
Estimated size (n = 100) 3 3 3 3 4 3 4 3 3  3  Estimated size (n = 500) 3 3 3 3 3 3 3 3 3  4 Since these CPTs of X, Z, Y are a straightforward case to distinguish the role of the hidden node, the smaller number of the pairs does not adversely affect the estimation. Table 3 shows the results of the different CPTs in the parent and the child nodes. Table 3. The results of the estimated size in the different conditional probability tables (CPTs).

Data-Set ID 1 2 3 4 5 6 7 8 9 10
Estimated size (skewed parent node) 3 3 3 3 3 4 3 3 3 3 Estimated size (nearly-uniform child node) 1 1 1 1 2 1 1 2 1 1 The number of pairs was n = 100. Due to the CPT of Z, the skewed distribution of the parent node still keeps the sufficient variation of Z to estimate the size N Z , which provides the same accuracy as the uniform distribution. On the other hand, the nearly uniform distribution of the child node makes the estimation difficult because each value of Z has the similar output distribution. The Dirichlet prior of Z has a strong effect to eliminate the redundancy, which means the estimated sizes tend to be smaller than the true one.

Discussion
In this section, we discuss the difference between the proposed method and other conventional criteria for the model selection. In the proposed method, the label assignment Z n is obtained from the MCMC method, which takes the samples according to p(X n , Z n , Y n ). The probability p(X n , Z n , Y n ) is the marginal likelihood on the complete data (X n , Z n , Y n ); recall the definition, This looks similar to the criteria based on the marginal likelihood such as BDu(e) [19,20] and its asymptotic form such as BIC [2], MDL [1]. Since it is assumed that the network has the statistical regularity or the nodes are all observable, many criteria do not work on the network with hidden nodes.
WBIC is proposed for the singular models. The main difference is that it is based on the marginal likelihood of the incomplete data X n , Y n ; Due to the marginalization over Z n , it requires the calculation of values for all candidate structures. For example, assume that we have candidate structures N Z = 1, 2, 3 denoted by p 1 (X n , Y n ), p 2 (X n , Y n ), and p 3 (X n , Y n ), respectively. In WBIC, we calculate all values and select the optimal structure; On the other hand, in the proposed method, we calculate the label assignment with the structure N Z = 3 and obtainN Z , which shows the necessity of the node Z.
Another difference from the conventional criteria is the dominant order of the objective function, which determines the optimal structure. As shown in Corollary 6.1 of [6], the negative logarithm of the marginal likelihood of the incomplete data has the following asymptotic form; where the coefficient S XY is the empirical entropy of the observation (X n , Y n ) and C XY depends on the data-generating distribution, the model, and the prior distribution. This form means that the optimal model is selected by ln n order term with the coefficient C XY , while it is selected by n order term with the coefficient S of Theorem 3 in the proposed method. Since the largest terms are n order in both F α (X n , Y n ) and F α (X n , Z n , Y n ), the proposed method will have stronger effect to distinguish the difference of the structures.
The asymptotic accuracy of Bayesian clustering has been studied [21], which considers the error function between the true distribution of the label assignment and the estimated one measured by the Kullback-Leibler divergence: where E X n ,Y n [·] is the expectation over all evidence data and The true network is denoted by q(x, z, y). The proposed method minimizes this error function, which means that the label assignment Z n is optimized in the sense of the density estimation. Even though the optimized function is not directly for the model selection, due to the asymptotic property of the Bayes clustering simplifying the label use, the proposed method is computationally efficient to determine the existence of the hidden node and the result asymptotically has coincident.

Conclusions
In this paper, we have proposed a method to detect a hidden node between observable nodes based on Bayesian clustering. The asymptotic behavior of the clustering has been revealed and it shows that the redundant labels are eliminated and the essential structure will be detected. Evaluation of the proposed method with numerical experiments is one of our future studies.

Conflicts of Interest:
The authors declare no conflict of interest.
Collecting the constant terms and uniting them to O p (1), we rewrite F α (X n , Z n , Y n ) as and focusing on the terms of order n and ln n, we obtain the asymptotic form in the theorem.