Abstract
Bayesian Network is one of the famous network models, and the loop cutset is one of the crucial structures for Bayesian Inference. In the Bayesian Network and its inference, how to measure the relationship between nodes is very important, because the relationship between different nodes has significant influence on the node-probability of the loop cutset. To analyse the relationship between two nodes in a graph, we define the shared node, prove the upper and lower bounds of the shared nodes number, and affirm that the shared node influences the node-probability of the loop cutset according to the theorems and experiments. These results can explain the problems that we found in studying on the statistical node-probability belonging to the loop cutset. The shared nodes are performed not only to improve the theoretical analysis on the loop cutset, but also to the loop cutset solving algorithms, especially the heuristic algorithms, in which the heuristic strategy can be optimized by a shared node. Our results provide a new tool to gauge the relationship between different nodes, a new perspective to estimate the loop cutset, and it is helpful to the loop cutset algorithm and network analysis.
1. Introduction
The network model is a valuable tool, and network analysis has popularly dealt with a diverse range of social, biological, and information systems in the last few decades. Variables and their interactions are considered as nodes and edges in the network model, one of them is Bayesian network. A Bayesian network is an acyclic directed graph, in which the nodes represent chance variables, and the edges represent the probabilistic influences. An edge implies that it is possible to characterize the relationship between the connected nodes by a conditional probability matrix. Pearl defined the loop cutset in the conditioning method in order to solve the inference problem for multi-connected networks [1]. The complexity of the conditioning method grows exponentially with the size of the loop cutset. However, the minimum loop cutset problem has been proven to be NP-hard [2].
For solving the loop cutset problem, researchers have made many efforts on three classes of solving algorithms: heuristic algorithms, random algorithms, and precise algorithms [3,4,5,6,7]. The heuristic algorithm selects the node with the highest degree as the first member of the loop cutset by default. The random algorithm WRA picks nodes with probability , where and u are the nodes in graph , and is the degree of node .
However, the theoretical research on the characteristics of the loop cutset lacks at the same time, and the above algorithms lack theoretical support. The results presented in reference [8] proved the positive correlation between the node degree and the node-probability of the loop cutset, which explain the first selection in heuristic algorithm and the picking probability in WRA. However, we still have many questions on the loop cutset problem. Because the problem is NP-hard, is it possible to perform some pre-analysis on the loop cutset before solving? Pre-analysis on the loop cutset can reduce unnecessary calculations in some cases.
According to these theorems in [8], the probability of one node belonging to the loop cutset is related to its degree. Subsequently, if a node is known to belong to the loop cutset, given another node, will the node-probability of the loop cutset change? This question requires to gauge the relationship between different nodes.
The researchers widely use various node centrality measures to gauge the importance of individual nodes in a network model, which is a principal application of network theory. A number of measures, commonly referred to as centrality measures, are proposed to quantify the relative importance of nodes in a network [9,10,11]. This paper defines the shared node for two different nodes in a network, in order to analyse the relationship between the two nodes.
Reference [8] introduced a new parameter to gauge the relative degree of nodes in the graph. And, there is a question left in reference [8] that, the statistical probability of the next-largest degree nodes belonging to the loop cutset has dropped significantly, while the largest-degree node probability of the loop cutset maxes. In this paper, we use shared nodes to discuss the reasons for this phenomenon and introduce theorems to prove the rationality of this discussion.
The paper is organized, as follows. Some preliminary knowledge is provided in Section 2. In Section 3, we define the shared node, predict the theoretical number of the shared node in any undirected graph, regular graph, and bipartite graph, and perform some theoretical analyses of the loop cutset. The data experiments are performed on four existing algorithms, and the results verify the correctness of the relevant theories in Section 4. Section 5 gives the applications of shared nodes in the loop cutset algorithms and network analysis.
2. Preliminaries
In this section, we give the content related to graph theory, the loop cutset problem, Bayesian network, and its two characteristics—conditional independence and d-separation.
Definition 1
(Graph Concepts).A simple graph G is defined by a node-set V and a set E of two-element subsets of V, and the ends of an edge are precisely the nodes u and v. A directed graph is a pair , where is the set of nodes and is the set of edges. Given that , is called a parent of , and is called a child of . The moral graph of a directed graph D is the undirected graph obtained by connecting the parents of all the nodes in D and removing the arrows. A loop in a directed graph D is a subgraph, whose underlying graph is a cycle. A directed graph is acyclic if it has no directed loops. A directed graph is singly-connected, if its underlying undirected graph has no cycles; otherwise, it is multiply-connected [11]. A graph G is regular if for all . A graph is bipartite if its vertex set can be partitioned into two subsets X and Y, so that every edge has one end in X and one end in Y [12].
Definition 2
(Bayesian Networks).Let be a set of random variables over multivalued domains . A Bayesian Network (Pearl, 1988), also named a belief network, is a pair , where G is a directed acyclic graph whose nodes are the variables X, and is the set of conditional probability tables associated with each . The Bayesian Network represents a joint probability distribution having the product form:
Evidence E is an instantiated subset of variables [13].
Definition 3
(Loop Cutset).A vertex v is a sink with respect to a loop L if the two edges that are adjacent to v in L are directed into v. A vertex that is not a sink with respect to a loop L is called an allowed vertex with respect to L. A loop cutset of a directed graph D is a set of vertices that contains at least one allowed vertex with respect to each loop in D [13].
The loop cutset problem, for a directed acyclic graph and an integer k, is finding a set , such that and is a forest, where . The problem can be transformed into a feedback vertex set (FVS) problem, as follows: convert the directed acyclic graph to its underlying graph. Subsequently, iteratively delete the nodes whose degrees are less than 2 and delete the adjacent edges. Denote the obtained undirected graph as , and then the problem involves finding a loop cutset for .
Definition 4
(Conditional Independence).Let be a finite set of variables. Let be a joint probability function over the variables in V, and let X, Y, Z stand for any three subsets of variables in V. The set X and Y are said to be conditionally independent, given Z if
In words, learning the value of Y does not provide additional information about X, once we know Z [12]. (Metaphorically, Z “screens off” X from Y.)
Definition 5
(d-Separation).A path p is said to be d-separated (or blocked) by a set of nodes Z if and only if (1) p contains a chain or a fork , such that the middle node m is in Z, or (2) p contains an inverted fork (or collider) such that the middle node m is not in Z and, such that no descendant of m is in Z. A set Z is said to d-separate X from Y if and only if Z blocks every path from a node in X to a node in Y [12].
4. Experiments
This section implements four existing algorithms: the greedy algorithm A1 [3,4], the improved algorithm A2 [14] for A1, the MGA [5], and the WRA [6]. A large number of random Bayesian Networks are randomly generated using the algorithm in reference [4], and applied with the above algorithms. In WRA, we need to specify two constants: c and Max, and here we take the constant c as 1, and the constant Max as 300. The constant c influences the correct probability of the results, and the constant Max influences the solution speed. Four groups of Bayesian networks were randomly generated, in which the number of each group is 500, the number of nodes is 25, and the numbers of edges are 50, 100, 150, and 200, respectively. We applied the four algorithms to the networks.
We introduce the parameter in order to measure the relative degree of one node in the graph, which is defined in reference [2]. Denoting the largest node degree in Graph G as , and the degree of node v as , the parameter is the ratio . In Figure 5, is used as an independent variable. When , node v is the largest-degree node in G.
Figure 5.
Statistical frequency of the of nodes in the loop cutset (here, , are the numbers of the nodes and edges, respectively).
Figure 5 shows the relationship between the of the node and the corresponding statistical frequency of different . We obtained the trend that the statistical frequency increases with the parameter increasing, which is more obvious when the graph is more complicated. However, there is a significant decline in the statistical frequency near the value of 0.9 for in each graph, which verifies the conclusion of Theorem 5.
The results are consistent with the conclusion of Theorem 5. Because, in the algorithms A1, A2, and MGA, the first choice is the node with the largest degree in the graph, and the of the largest degree node is 1. Thus, the statistical frequency at the value of 1 maxes. As the largest-degree node has been selected into the loop cutset, and the nodes in the loop cutset will be instantiated, the messages from the largest-degree node will be spread along its neighboring nodes. The other node receives information from the largest-degree node through shared nodes between them. The more the shared nodes, the more information received; thus, the smaller node-probability of the loop cutset. According to Theorem 1, the shared nodes number between the largest-degree and next-largest degree nodes is greater than the constant , where is the largest degree, is the next-largest degree, and n is the number of the nodes in the graph. Thus, the node-probability of the next-largest degree will be reduced, which can explain the sudden drop in statistical frequency in Figure 5.
Next, we study the shared nodes number between two nodes in the loop cutset, to examine the relationship between the shared nodes number and the probability of two nodes that belong to the loop cutset. Figure 6 shows the statistical probability of two nodes belonging to the loop cutset according to different shared nodes number, where the nodes number is 25 and the edges number is 50, 100, 150, and 200, respectively. It can be seen that, the larger the shared nodes number, the smaller the frequency of two nodes belonging to the cut set, which corresponds exactly to Theorem 5.
Figure 6.
Statistical probability of the shared nodes number in the loop cutset (here , are the numbers of the nodes and edges respectively).
We introduce a parameter defined in Reference [8] in order to describe the degree of edge-saturation in the graph. Assuming that a simple graph , the nodes number is p, and the edges number is q, then p and q satisfy the relationship . The parameter is defined as . The value-range of is , and when its value is 0, G is a trivial graph; when its value is 1, G is a complete graph. It can be seen from this definition that the parameter can describe the degree of edge-saturation in the graph, and measure the complexity of the graph from the perspective of the edge-saturation.
When the nodes number is 25 and the edges number is 50, 100, 150, 200, the values of are 1/6, 1/3, 1/2, 2/3, respectively. As can be seen from Figure 6, as the value of keeps increasing, the complexity of the graph is getting bigger, and the shared nodes number of the nodes in loop cutset are getting bigger. This shows that, with the increase of edge-saturation, the node-to-node ties have become closer.
6. Conclusions
We have provided a new tool to gauge the relationship between different nodes—shared nodes, proven the upper and lower bounds of the shared nodes number, revealed the relationship between the shared nodes number and the node-probability of the loop cutset. We show that the experimental results support the correctness of the theorems in this paper. Moreover, we present an optimal to the existing algorithms for solving the loop cutset, and discuss extended applications of shared nodes in the social network.
The proposal of shared node attempts to measure the relationship between two nodes, which is necessary in many practical applications. The shared node’s measure of the relationship between two nodes is related to the measures of the node centrality—they both contain the concepts of degree and path, so they are reasonable. More generally, with the development of computers and algorithms, the algorithm is often iterative to solve a certain element. In the iteration, the existing results should have a corresponding influence on the next calculation, so that the existing resources can be fully utilized. In this way, the relationship between the two is the answer to the question that we want to know.
However, potential future research with the shared nodes exists based on our theoretical analyses: the proposition that the shared nodes are related to conditional independence and graph measurement lacks exact proof and data support; the applications of shared nodes in algorithms and network analysis can be further implemented.
Author Contributions
Conceptualization, J.W. and Y.N.; methodology, J.W. and Y.N.; software, J.W.; validation, J.W.; formal analysis, J.W. and W.X.; investigation, J.W. and Y.N.; data curation, J.W. and W.X.; writing—original draft preparation, J.W.; writing—review and editing, Y.N. and W.X.; visualization, J.W. and W.X.; supervision, Y.N. and W.X.; project administration, Y.N.; funding acquisition, Y.N. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Natural Science Foundation of China, 11971386.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Pearl, J. Fusion, propagation, and structuring in belief networks. Artif. Intell. 1986, 29, 241–288. [Google Scholar] [CrossRef]
- Cooper, G.F. The computational complexity of probabilistic inference using Bayesian belief networks. Artif. Intell. 1990, 42, 393–405. [Google Scholar] [CrossRef]
- Suermondt, J.; Gregory, F.C. Updating probabilities in multiply-connected belief networks. arXiv 2013, arXiv:1304.2377. [Google Scholar]
- Suermondt, H.J.; Gregory, F.C. Probabilistic inference in multiply connected belief networks using loop cutsets. Int. J. Approx. Reason. 1990, 4, 283–306. [Google Scholar] [CrossRef]
- Becker, A.; Dan, G. Optimization of Pearl’s method of conditioning and greedy-like approximation algorithms for the vertex feedback set problem. Artif. Intell. 1996, 83, 167–188. [Google Scholar] [CrossRef]
- Becker, A.; Reuven, B.-Y.; Dan, G. Randomized algorithms for the loop cutset problem. J. Artif. Intell. Res. 2000, 12, 219–234. [Google Scholar] [CrossRef]
- Razgon, I. Exact computation of maximum induced forest. In Scand. Workshop Algorithm Theory; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Wei, J.; Nie, Y.; Xie, W. The Study of the Theoretical Size and Node Probability of the Loop Cutset in Bayesian Networks. Mathematics 2020, 8, 1079. [Google Scholar] [CrossRef]
- Freeman, L.C. Centrality in social networks conceptual clarification. Soc. Netw. 1978, 1, 215–239. [Google Scholar] [CrossRef]
- Opsahl, T.; Agneessens, F.; Skvoretz, J. Node centrality in weighted networks: Generalizing degree and shortest paths. Soc. Netw. 2010, 32, 245–251. [Google Scholar] [CrossRef]
- Sporns, O.; Honey, C.J.; Kötter, R. Identification and classification of hubs in brain networks. PLoS ONE 2007, 2, e1049. [Google Scholar] [CrossRef] [PubMed]
- Pearl, J. Models, Reasoning and Inference; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
- Bidyuk, B.; Dechter, R. Cutset sampling for Bayesian networks. J. Artif. Intell. Res. 2007, 28, 1–48. [Google Scholar] [CrossRef][Green Version]
- Stillman, J. On heuristics for finding loop cutsets in multiply-connected belief networks. arXiv 2013, arXiv:1304.1113. [Google Scholar]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).