Abstract
Learning the conditional probability table (CPT) parameters of Bayesian networks (BNs) is a key challenge in real-world decision support applications, especially when there are limited data available. The traditional approach to this challenge is introducing domain knowledge/expert judgments that are encoded as qualitative parameter constraints. In this paper, we focus on multiplicative synergistic constraints. The negative multiplicative synergy constraint and positive multiplicative synergy constraint in this paper are symmetric. In order to integrate multiplicative synergistic constraints into the learning process of Bayesian Network parameters, we propose four methods to deal with the multiplicative synergistic constraints based on the idea of classical isotonic regression algorithm. The four methods are simulated by using the lawn moist model and Asia network, and we compared them with the maximum likelihood estimation (MLE) algorithm. Simulation results show that the proposed methods are superior to the MLE algorithm in the accuracy of parameter learning, which can improve the results of the MLE algorithm to obtain more accurate estimators of the parameters. The proposed methods can reduce the dependence of parameter learning on expert experiences. Combining these constraint methods with Bayesian estimation can improve the accuracy of parameter learning under small sample conditions.
1. Introduction
Bayesian networks (BNs) can model probabilistic dependent relationships among variables in many real world problems. Therefore, they have become very popular in the artificial intelligence (AI) field over the last two decades. BNs have become a powerful tool with many applications such as medical diagnosis, financial analysis, bioinformatics, medical diagnosis, financial analysis, bioinformatics and industrial applications [1], target tracking [2], robot control [3], gene analysis [4], ecosystem modeling [5], signal processing [6], and educational measurement [7]. A BN model consists of a network structure and a set of conditional probability tables (CPTs). This paper focuses on the parameter learning of discrete BNs when the structure is known.
In practice, when performing parameter learning, we need sufficient samples. If we have sufficient data, BNs can be easily constructed using traditional methods such as the maximum likelihood (ML) method [8]. However, the ML method is difficult to obtain accurate parameters when the data is insufficient, and thus it is difficult to give the right decision [9]. It is very difficult to collect abundant data under certain circumstances, such as, in the cases of earthquake prediction [10], parole assessment [11], and rare disease diagnosis [12]. Thus, the data are not sufficient in many cases, which may lead to inaccurate structure and parameters of a BN. Therefore, many scholars began to pay attention to the parameter learning problems of BNs under the condition of small data sets, and proposed some algorithms to solve these problems.
Altendorf et al. [13] converted the qualitative influence constraints to penalty function and gave the objective optimization function combined with the ML function, and then applied the gradient method to solve it. Feelders and Linda [14] converted qualitative influence constraints to order constraints and applied isotonic regression algorithm to adjust the size of parameters so that the parameters satisfy order constraints. Cassio and Ji [15] converted monotonic constraints to penalty function and applied a convex optimization algorithm to solve the objective optimization function. Ren et al. [16] transformed interval constraints into beta distribution to constrain priori parameters, and combined Bayesian estimation to obtain the parameters. Niculescu [17] studied some equality constraints such as normative constraint and proportional constraint by introducing Lagrange multiplier. Rui [18] used Monte Carlo sampling method to extract virtual data from non-monotonic constraints space, and then used virtual data to construct prior distribution. Finally, Bayesian estimation was used to combine prior distribution with real data to obtain the parameters of BNs. Kobra [19] proposed a multi-experts parameter learning framework to fuse multiple experts’ knowledge. In addition to the above methods based on constraints, some scholars proposed the parameter learning methods of BN based on the minimum free energy (see [20]) and Noisy- or -Gates (see [21]). Zhou et al. [22] studied a class of constraints that is naturally encoded in the edges of BNs with monotonic influences. Gao et al. [23] developed “MiniMax Fitness” algorithm to address the problem that imposing prior distributions can reduce the fitness between parameters and data. For more studies on BNs, one can refer to [24,25,26,27].
It is found that the main idea of the algorithms is to introduce expert experiences or domain knowledge into the parameter learning process of BNs with some constraints in existing literature. However, the constraints involved in the existing literature are mainly network parameters under the condition of single parent node, and the constraints under the synergistic condition of multiple parent nodes are rarely studied, and the methods involved are relatively complex. There are many constraints under the condition of multiple parents nodes such as additive synergy, multiplicative synergy (see [28]), etc. In this paper, isotonic regression is used to study the parameter learning of BNs under multiplicative synergistic constrains. The proposed methods can reduce the dependence of parameter learning on expert experiences. Combining these constraint methods with Bayesian estimation can improve the accuracy of parameter learning under small sample conditions.
The reminder of this paper is organized as follows. In Section 2, we briefly review some basic theories of BNs and parameter learning. In Section 3, the classical isotonic regression algorithm is introduced. In Section 4, we provide the parameter learning algorithms of this paper under multiplicative synergistic constrains by referring to the idea of pool adjacent violators (PAV) algorithm. In Section 5, the effectiveness and performance of four mentioned algorithms are verified by simulations. In Section 6, some conclusions are given.
2. Preliminaries
In this section, some concepts of Bayesian networks will be briefly reviewed, so that one can understand the paper well.
2.1. Bayesian Networks
Bayesian networks are represented as a directed acyclic graph that contains some nodes and edges. Nodes represent random variables, while edges represent the probabilistic relationship between random variables. For each variable node , a conditional probability table is specified as , which describes the probability over the possible values of and possible configurations of parent variables . In a BN, the joint probability can be written as follows:
To illustrate the BNs more clearly, the lawn moist BN is employed, which is depicted as Figure 1, where “1” stands for “Cloudy”, “2” stands for “Rain”, “3” stands for “Sprinkle”, and “4” stands for “Wet”. The model will be used as the experimental model later. The variables in the network are binary, the value is 0 or 1, that is, if the event occurs or is present, it has the state 1. Our purpose is to learn the estimators of parameters in the following BN.
Figure 1.
Lawn moist model.
2.2. Maximum Likelihood Estimation
Maximum likelihood estimation is an important method for parameters learning of BN. The MLE of the parameters is
where is the index of node , is the index of parent nodes’ configuration, is states, is the number of the states of , and is the number of cases that satisfy and in the data set (see [1]).
3. Isotonic Regression
We know that the variables satisfy order relation, but the order obtained by observation or counting does not satisfy the known order relation, then the data will be adjusted by the weighted average method. This is the problem to be solved by the isotonic regression. Let be a set of variables, , and are estimators. The process of using pool adjacent violators (PAV) algorithm to isotonic regression is as follows:
- Step 1: Start with , and compare in pairs, if , there is no adjusment, if , , letand so on.
- Step 2: If do not satisfy the size relationship of variables in step 1, repeat the process in step 1 from . Thus, the values of adjusted by PAV algorithm can be obtained.
The uniqueness of the solution of isotonic regression has been verified (see [29]) and the solutions obtained by the above process that satisfying the order relation of are unique.
4. Parameter Learning of BNs under Multiplicative Synergistic Constraints
4.1. The Model of Multiplicative Synergistic
Multiplicative synergy constraints describe the synergy size relationship of parameters among three node variables. Let , and be the three variables in the BNs, and all of them are discrete binary variables. Suppose that and are the parent nodes of , then the negative multiplicative synergy constraint of and to can be expressed as:
Similarly, positive multiplicative synergy constraint can be expressed as:
4.2. Description of Algorithm
Under the condition of small data sets, if the variables , and in the network conform to the above multiplicative synergy constraint, how to apply this constraint and combine the small data sets to limit the Bayesian network parameter learning and get more accurate results is very important.
The algorithm steps adopted in this paper are as follows:
- Step 1: Using the existing small data sets and by the maximum likelihood estimation algorithm to get relevant parameters.
- Step 2: For the structural part of multiple parent nodes, judge whether the parameters of this part meet the multiplicative synergy constraint. If they meet, conform to step 5, otherwise conform to step 3.
- Step 3: Take the left and right sides of multiplicative synergy constraint as a whole, and use the idea of “averaging” of PAV algorithm to modify them, respectively.
- Step 4: Adjust the whole (product of parameters) in step 3, and then modify each parameter.
- Step 5: Obtain the final parameter learning result.
Steps 3 and 4 are the focus of the algorithm, which are described in detail below. By referring to the idea of PAV algorithm, this paper proposes four different methods to complete the order preserving of parameters and the contents mentioned in steps 3 and 4. Taking the negative multiplicative synergy constraint mentioned in Section 4.1 as an example, the following specific calculation methods are given successively. Parameters order preserving under the condition of positive multiplicative synergy can be obtained in a similar way. The proposed four methods are as follows:
Method 1
Method 2
Method 3
Method 4
where stands for the sample size when the state of parent nodes is , stands for the sample size when the state of parent nodes is , stands for the sample size when the state of parents node is , and stands for the sample size when the state of parent nodes is .
The algorithms obtained from Methods 1 to 4 are convergent, and we can obtain the theorem as follows:
Theorem 1.
Let, then there existssuch that.
Proof of Theorem 1.
Without loss of generality, we only prove the case of Method 1, as the proof of the other three cases is analogous.
Proving is to show that
then
where ( is a positive constant).
Therefore, we can find that
Since , we have
Hence,
Therefore,
By , we have
When ,, then
Therefore, Theorem 1 holds in this case.
When ,
thus
Let , we can find that
Since , we have
Hence, there exists such that
From the above, we known that there exists such that
This completes the proof of Theorem 1. □
5. Experiments
In this section, we verify the effectiveness and performance of four algorithms mentioned in this paper by two simulations.
5.1. Experiment 1
5.1.1. Simulation Model
This simulation adopts the lawn moist model, as shown in Figure 1.
In the model, , and meet the multiplicative synergy constraint, which can be expressed as follows:
In the above inequality, when the value of the variable is 1, it means that the event occurs, and 0 means that it does not occur. Table 1 shows the real parameters in the network. In order to quantitatively analyze the performance of several methods in this paper, KL divergence from real parameters is introduced as an index to measure the accuracy of the algorithm. The expression of KL divergence (see [30]) is as follows:
Table 1.
Real parameters of the simulation network.
5.1.2. Simulation Analysis
Take the sample size as 20, the simulation results obtained by several algorithms are shown from Table 2, Table 3, Table 4, Table 5 and Table 6, and the KL divergences between the learning results and the real parameters are shown in Table 7.
Table 2.
Learning parameters of MLE.
Table 3.
Learning parameters of Method 1.
Table 4.
Learning parameters of Method 2.
Table 5.
Learning parameters of Method 3.
Table 6.
Learning parameters of Method 4.
Table 7.
KL divergences between the learning results and the real parameter.
Table 2, Table 3, Table 4, Table 5 and Table 6 show the network parameters learned by MLE algorithm and the four algorithms proposed in this paper when the sample size is 20. Table 7 shows the KL divergences between the learning results and the real parameters. The experimental results show that the KL divergence between the learning parameters of the four methods proposed in this paper and real parameters is smaller than that between the learning parameters of MLE and the real parameters when the sample size is small. It shows that each method proposed in this paper is superior to the MLE algorithm in the accuracy of parameter learning. In addition, it can be seen from Table 7 that among the four algorithms proposed in this paper, the learning accuracy of Method 2 is the highest, while that of Method 1 is the lowest.
5.2. Experiment 2
5.2.1. Simulation Model
This simulation adopts Asia Network, as shown in Figure 2, where ‘1’ stands for , ‘2’ stands for , ‘3’ stands for , ‘4’ stands for , ‘5’ stands for , ‘6’ stands for , ‘7’ stands for , and ‘8’ stands for .
Figure 2.
Asia network.
In the model, and, meet the multiplicative synergy constraint, which can be expressed as follows:
In the above inequality, if the value of the variable is 1, it means that the event occurs, and 0 means that it does not occur. Table 8 shows the real parameters in the network.
Table 8.
Real parameters of the simulation network.
The explanations of are as follows:
—Visit To Asia (2): Visit, No_Visit;
—Tuberculosis (2): Present, Absent;
—Smoking (2): Smoker, Nonsmoker;
—Lung Cancer (2): Present, Absent;
—Tuberculosis or Lung Cancer (2): True, False;
—Xray Result (2): Abnormal, Normal;
—Bronchitis (2): Present, Absent;
—Dyspnoea (2): Present, Absent.
5.2.2. Simulation Analysis
Take the sample size as 20, the simulation results obtained by several algorithms are shown from Table 9, Table 10, Table 11, Table 12 and Table 13, and the KL divergences between the learning results and the real parameters are shown in Table 14.
Table 9.
Learning parameters of MLE.
Table 10.
Learning parameters of Method 1.
Table 11.
Learning parameters of Method 2.
Table 12.
Learning parameters of Method 3.
Table 13.
Learning parameters of Method 4.
Table 14.
KL divergences between the learning results and the real parameter.
Table 9, Table 10, Table 11, Table 12 and Table 13 show the network parameters learned by MLE algorithm and the four algorithms proposed in this paper when the sample size is 20. Table 14 shows the KL divergences between the learning results and the real parameters. The experimental results show that the KL divergence between the learning parameters of the four methods proposed in this paper and real parameters is smaller than that between the learning parameters of MLE and the real parameters when the sample size is small. It shows that each method proposed in this paper is superior to the MLE algorithm in the accuracy of parameter learning. In addition, it can be seen from Table 14 that among the four algorithms proposed in this paper, the learning accuracy of Method 1 is the highest, while that of Method 3 is the lowest.
6. Conclusions
By referring to the idea of PAV algorithm, this paper proposes four methods to deal with multiplicative synergy constraints. We analyze and compare the algorithms from the algorithm accuracy. The simulations results show that the four algorithms mentioned in this paper are superior to the MLE algorithm in the accuracy of parameter learning, which can improve the results of the MLE algorithm to obtain more accurate estimators of the parameters.
The methods proposed in this paper can reduce the dependence of parameter learning on expert experiences. Combining these constraint methods with Bayesian estimation can improve the accuracy of parameter learning under small sample conditions. However, the algorithms in this paper also have limitations. When there are many parent nodes, it is difficult to give the parameter size relationship of the network. In the future research, the constraints presented in this paper can be combined with other existing constraints to reduce the dependence of constraints on expert experiences and improve the accuracy of parameter learning.
Author Contributions
Methodology, software, writing—original draft, writing—review and editing, Y.Z.; funding acquisition, supervision project administration, Z.H.; validation, Y.Z. and Z.H. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by Bigdata Modeling and Intelligent Computing Research Institute, Hubei University of Education, Scientific Research Project of Education Department of Zhejiang Province (Y202147034), Zhejiang College of Shanghai University of Finance and Economics for Scientific Research Projects at the Provincial and Above Levels, and the National Statistical Science Research Project of China (2021LY100).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Acknowledgments
The authors would like to thank everyone for help.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Zhang, L.; Guo, H. Introduction to Bayesian Networks; Science Press: Beijing, China, 2006. [Google Scholar]
- Mascaro, S.; Nicholso, A.E.; Korb, K.B. Anomaly detection in vessel tracks using Bayesian networks. Int. J. Approx. Reason. 2014, 55, 84–98. [Google Scholar] [CrossRef]
- Infantes, G.; Ghallab, M.; Ingrand, F. Learning the behavior model of a robot. Auton. Robot. 2010, 30, 157–177. [Google Scholar] [CrossRef][Green Version]
- Tamada, Y.; Imoto, S.; Araki, H.; Nagasaki, M.; Print, C.; Charnock-Jones, D.S.; Miyano, S. Estimating Genome-Wide Gene Networks Using Nonparametric Bayesian Network Models on Massively Parallel Computers. IEEE/ACM Trans. Comput. Biol. Bioinform. 2010, 8, 683–697. [Google Scholar] [CrossRef] [PubMed]
- Landuyt, D.; Broekx, S.; D’hondt, R.; Engelen, G.; Aertsens, J.; Goethals, P.L. A review of Bayesian belief networks in ecosystem service modelling. Environ. Model. Softw. 2013, 46, 1–11. [Google Scholar] [CrossRef]
- Wachowski, N.; Azimi-Sadjadi, M.R. Detection and Classification of Nonstationary Transient Signals Using Sparse Approximations and Bayesian Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1750–1764. [Google Scholar] [CrossRef]
- Almond, R.G.; Mislevy, R.J.; Steinberg, L.S.; Yan, D.; Williamson, D.M. Bayesian Networks in Educational Assessment; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
- Redner, R.A.; Walker, H.F. Mixture Densities, Maximum Likelihood and the EM Algorithm. SIAM Rev. 1984, 26, 195–239. [Google Scholar] [CrossRef]
- Isozaki, T.; Kato, N.; Ueno, M. “Data Temperature” in minimum free energies for parameter learning of bayesian networks. Int. J. Artif. Intell. Tools 2009, 18, 653–671. [Google Scholar] [CrossRef]
- Hu, J.; Tang, X.; Qiu, J. A Bayesian network approach for predicting seismic liquefaction based on interpretive structural mod-eling. Georisk Assess. Manag. Risk Eng. Syst. Geohazards 2015, 9, 200–217. [Google Scholar] [CrossRef]
- Constantinou, A.C.; Freestone, M.; Marsh, W.; Fenton, N.; Coid, J. Risk assessment and risk management of violent reoffending among prisoners. Expert Syst. Appl. 2015, 42, 7511–7529. [Google Scholar] [CrossRef]
- Seixas, F.L.; Zadrozny, B.; Laks, J.; Conci, A.; Saade, D.C.M. A Bayesian network decision model for supporting the diagnosis of dementia, Alzheimer’s disease and mild cognitive impairment. Comput. Biol. Med. 2014, 51, 140–158. [Google Scholar] [CrossRef]
- Altendorf, E.; Restificar, A.; Dietterich, T. Learning from sparse data by exploiting monotonicity constraints. In Proceedings of the Twenty First Conference on Uncertainty in Artificial Intelligence (UAI 2005), Edinburgh, UK, 26–29 July 2005; pp. 18–26. [Google Scholar]
- Ad, F.; Van der Gaag, L.C. Learning Bayesian tetworks parameters under order constraints. Int. J. Approx. Reason. 2006, 42, 37–53. [Google Scholar]
- de Campos, C.P.; Ji, Q. Improving Bayesian network parameter learning using constraints. In Proceedings of the Nineteenth International Conference on Pattern Recognition (ICPR 2008), Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar]
- Ren, J.; Gao, X.; Ru, W. Parameters learning of BN in small sample base on data missing. Syst. Eng. Theory Pract. 2011, 31, 172–177. [Google Scholar]
- Niculescu, R.S.; Mitchell, T.M.; Rao, R.B. Bayesian network learning with parameter constraints. J. Mach. Learn. Res. 2006, 7, 1357–1383. [Google Scholar]
- Chang, R. Advanced Algorithms of Bayesian Network Learning and Probabilistic Inference from Inconsistent Prior Knowledge and Sparse Data with Applications in Computational Biology and Computer Vision. In Bayesian Network; Intechopen: London, UK, 2010. [Google Scholar] [CrossRef]
- Etminani, K.; Naghibzadeh, M.; Peña, J.M. DemocraticOP: A Democratic way of aggregating Bayesian network parameters. Int. J. Approx. Reason. 2013, 54, 602–614. [Google Scholar] [CrossRef]
- Isozaki, T.; Kato, N.; Ueno, M. Minimum Free Energies with “Data Temperature” for Parameter Learning of Bayesian Networks. In Proceedings of the 2008 20th IEEE International Conference on Tools with Artificial Intelligence, Dayton, OH, USA, 3–5 November 2008. [Google Scholar] [CrossRef]
- Oniśko, A.; Druzdzel, M.J.; Wasyluk, H. Learning Bayesian network parameters from small data sets: Application of Noisy-OR gates. Int. J. Approx. Reason. 2001, 27, 165–182. [Google Scholar] [CrossRef]
- Zhou, Y.; Fenton, N.; Zhu, C. An empirical study of Bayesian network parameter learning with monotonic influence constraints. Decis. Support Syst. 2016, 87, 69–79. [Google Scholar] [CrossRef]
- Gao, X.; Guo, Z.; Ren, H.; Yang, Y.; Chen, D.; He, C. Learning Bayesian network parameters via minimax algorithm. Int. J. Approx. Reason. 2019, 108, 62–75. [Google Scholar] [CrossRef]
- Tang, K.; Parsons, D.J.; Jude, S. Comparison of automatic and guided learning for Bayesian networks to analyse pipe failures in the water distribution system. Reliab. Eng. Syst. Saf. 2019, 186, 24–36. [Google Scholar] [CrossRef]
- Imani, M.; Ghoreishi, S.F. Graph-Based Bayesian Optimization for Large-Scale Objective-Based Experimental Design. IEEE Trans. Neural Networks Learn. Syst. 2021, 1–13. [Google Scholar] [CrossRef]
- Scutari, M.; Vitolo, C.; Tucker, A. Learning Bayesian networks from big data with greedy search: Computational complexity and efficient implementation. Stat. Comput. 2019, 29, 1095–1108. [Google Scholar] [CrossRef]
- Imani, M.; Imani, M.; Ghoreishi, S.F. Bayesian Optimization for Expensive Smooth-Varying Functions. IEEE Intell. Syst. 2022. [Google Scholar] [CrossRef]
- Wellman, M. Fundamental concepts of qualitative probabilistic networks. Artif. Intell. 1990, 44, 257–303. [Google Scholar] [CrossRef]
- Di, R.; Gao, X.; Guo, Z. Discrete Bayesian network parameter learning based on monotonic constraint. Syst. Eng. Electron. 2014, 36, 272–277. [Google Scholar]
- Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

