Constrained Adjusted Maximum a Posteriori Estimation of Bayesian Network Parameters

Maximum a posteriori estimation (MAP) with Dirichlet prior has been shown to be effective in improving the parameter learning of Bayesian networks when the available data are insufficient. Given no extra domain knowledge, uniform prior is often considered for regularization. However, when the underlying parameter distribution is non-uniform or skewed, uniform prior does not work well, and a more informative prior is required. In reality, unless the domain experts are extremely unfamiliar with the network, they would be able to provide some reliable knowledge on the studied network. With that knowledge, we can automatically refine informative priors and select reasonable equivalent sample size (ESS). In this paper, considering the parameter constraints that are transformed from the domain knowledge, we propose a Constrained adjusted Maximum a Posteriori (CaMAP) estimation method, which is featured by two novel techniques. First, to draw an informative prior distribution (or prior shape), we present a novel sampling method that can construct the prior distribution from the constraints. Then, to find the optimal ESS (or prior strength), we derive constraints on the ESS from the parameter constraints and select the optimal ESS by cross-validation. Numerical experiments show that the proposed method is superior to other learning algorithms.


Introduction
A Bayesian network (BN) is a type of graphical model that combines probability and causality theory. A BN becomes a causal model that enables reasoning about intervention under a desired causal assumption [1][2][3]. BNs have been shown to be powerful tools for addressing statistical prediction and classification problems, and they have been widely applied in many fields, such as geological hazard prediction [4], reliability analysis [5,6], medical diagnosis [7,8], gene analysis [9], fault diagnosis [10], and language recognition [11]. A BN B = (G, Θ) includes two components: a graph structure G and a set of parameters Θ. The structure G is a Directed Acyclic Graph (DAG) that consists of nodes (also called vertices) representing random variables, (X 1 , . . . , X n ), where n is the number of variables, and directed edges (also called arcs) correspond to the conditional dependence relationships among the variables. Notice that there should be no directed cycles in the graph. When sufficient data are available, the parameters of BN can be precisely and efficiently learnt by statistical approaches such as Maximum Likelihood (ML) estimation. When the sample data set is small, ML estimation often overfits the data and fails to approximate the underlying parameter distribution. To address this problem, Maximum a Posteriori (MAP) estimation has been introduced and shown to be effective in improving parameter learning. Because of the useful properties, i.e., (I) hyper-parameters of the BN model can be taken as equivalent sample observations and (II) experts find it convenient to define the uniformity

Bayesian Network
A BN is a probabilistic graphical model representing a set of variables and their conditional dependencies via a DAG. Learning a BN includes two parts: structure learning and parameter learning. Structure learning consists of finding the optimal DAG G that identifies the dependencies between variables from the observational data. Parameter learning entails estimating the optimal parameters θ that quantitatively specify the conditional dependencies between variables. Given the structure, the parameter estimation of a network can be factorized into the independent parameter estimations of individual variables, which means: where (D|θ) is the likelihood function of parameters θ given observational data D, and the ML estimation of parameter θ ijk is (2) where N ij = ∑ r i k=1 N ijk . When the observed data are sufficient, the ML estimation often fits the underlying distributions well. However, when the data are insufficient, additional information such as domain knowledge is required to prevent over-fitting.

Parameter Constraints
Domain knowledge can be transformed into qualitative parameter constraints. In practice, there are three common parameter constraints [22,27], which are all convex (i.e., the constraints form a convex constrained parameter feasible set that is easy to compute its geometric center, see Section 3.1). The constraints are: (1) Range constraint: This constraint defines the upper and lower bounds of a parameter, and it is commonly considered in practice.
(2) Intra-distribution constraint: This constraint describes the comparative relationship between two parameters that refer to the same parent configuration state but different child node states.
(3) Cross-distribution constraint: This constraint has also been called "order constraint" [23] or "monotonic influence constraint" [24]. It defines the comparative relationship between two parameters that share the same child node state but different parent configuration node states.
The third type of constraints might be hard to understand. As an example, smoking (S = 1) and polluted air (PA = 1) are two causes of lung cancer (LC = 1) and medical experts agree that smoking is more likely to cause lung cancer. Then, the medical knowledge could be expressed as a cross-distribution constraint, P(C = 1|S = 1, PA = 0) > P(C = 1|S = 0, PA = 1).

Problem Formulation
With observational data and domain knowledge, the parameter learning problem of a discrete BN can be formally defined as: Input: n: Number of nodes in the network. G: Structure with unknown parameters. D: Set of complete observations for variables. Ω: Set of parameter constraints transformed from reliable domain knowledge, Ω = {Ω 1 , Ω 2 , . . . , Ω n }, where Ω i denotes all the constraints on node i.

Task:
Find the optimal parameters that approximate the underlying parameter distribution,θ = {θ 1 , . . . ,θ n },θ i = {θ i1 , . . . ,θ iq i },θ ij = {θ ij1 , . . . ,θ ijr i }. Here, q i is the number of configuration state values of the parents of the variable X i and r i is the number of state values of the variable X i .

Sample Complexity of BN Parameter Learning
Basically, the ML estimation method learns accurate parameters when the acquired data are sufficient. However, when the data are insufficient, ML estimation is often inaccurate. Thus, definition of sample complexity for BN parameter learning helps to Entropy 2021, 23, 1283 4 of 16 determine whether ML meets the accuracy requirement. With regard to this problem, Dasgupta [28] defined the lower bound of the sample size for BN parameters learning with known structures. Given that a network has n binary variables, and no node has more than k parents, then the sample complexity with confidence 1 − δ is lower bounded by where ε is the error rate and is often computed as ε = nσ, for a small constant σ.

The Method
Among all the parameter learning algorithms, MAP estimation is a learning algorithm that conveniently combines the prior knowledge and observed data. For node i, the posteriori estimation of parameters θ ij can be written as where P θ ij denotes the prior distribution and P(D θ ij ) equals to l(D θ ij ) . Thus, the MAP estimation ofθ ij can be further defined as: Since the parameters θ ij studied in this paper follows the multinomial distribution and the conjugate prior for the multinomial distribution is Dirichlet distribution, the prior distribution of θ ij = θ ij1 , . . . , θ ijr i is set to be the Dirichlet distribution, i.e., θ ij ∼ Dir α ij1 , . . . , α ijr i , where α ij1 , . . . , α ijr i are the priors equivalent to the observations N ij1 , . . . , N ijr i . As a result, the approximate MAP estimation (see Appendix A) for θ ijk has the formθ where α ij = ∑ r i k=1 α ijk is the equivalent (or hypothetical) sample size. Generally, domain experts would find it difficult to provide a specific prior Dirichlet distribution but feel more comfortable to make qualitative statements on unknown parameters. From such qualitative parameter statements or parameter constraints, the prior distribution Dir α ij1 , . . . , α ijr i can be further defined as where θ prior ij ) is the prior hyper-parameter vector of the prior distribution that represents the domain knowledge and can be sampled from the parameter constraints. Finally, the MAP estimation for θ ijk can be expressed aŝ As the parameter constraints are incorporated into the MAP estimation, we define the above estimation as Constrained adjusted Maximum a Posteriori (CaMAP) estimation. In the following sections, we will introduce the elicitation of the prior parameter θ prior ij and the selection of the optimal ESS α ij .

Prior Elicitation
Before defining the optimal ESS α ij , the prior parameter θ prior ij is required, which could be elicited from the parameter constraints in a sampling manner. In this paper, we design a sampling method that applies to all types of convex constraints. Specifically, in the sampling method, (1) First, we search for the optimal parameters of the following model: where C is a random constant and Ω(θ i ) represents all the parameter constraints on node i. The constrained model is simple and could be efficiently solved. Note that even though the objective function is a constant, the solutions of the constrained model could vary each time. In fact, any parameters satisfying the given parameter constraints are solutions of the constrained model. Therefore, through iteratively solving the constrained model, we collect the parameters that cover the feasible parameter region constrained by the parameter constraints.
(2) Then, the first step is repeated (In this paper, we set the repetition times at 100 and the sampling code is available at: https://uk.mathworks.com/matlabcentral/fileexchange/ 34208-uniform-distribution-over-a-convex-polytope (accessed on 26 September 2021)) to collect sufficient sampled parameters that cover the constrained parameter space. To make sure that the sampled parameters are uniformly distributed over the constrained parameter space, for each sampling step, we add an extra constraint where τ is a small value (e.g., 0.1), θ t i represents the sampled parameters at step t, and θ t+1 i represents the sampled parameters at step t + 1.

ESS Value Selection
Although the sampled prior θ prior i guarantees satisfying all the parameter constraints, the overall estimation (Equation (10)) may violate the constraints if ESS α ij is not reasonably defined. For example, for binary variables, {LC = Lung Cancer, S = Smoking, PA = Pollution Air}, smoking and pollution air are shown to cause lung cancer. Parameter θ 142 represents the probability that the value of variable LC is true given that the values of variables S and PA are both true. In this example, θ 142 is the probability of having lung cancer (LC = 1) given that the patients consistently smoke (S = 1) and work in polluted air (PA = 1). The medical experts assert that θ 142 lies in the interval, [0.6, 1.0], which is also the parameter constraint. Now, the elicited prior θ prior 142 is 0.80, which satisfies the parameter constraint, and the purely data-driven estimation (also ML estimation) is N 142 /N 14 = 1/7. Then, with a small ESS, such as 5, the estimation (Equation (11)) is computed as follows: Obviously, the above estimation does not satisfy the constraint, θ 142 ∈ [0.6, 1.0]. In fact, to make sure that the estimation does not violate the constraint, the optimal ESS should not be less than 16, which could be inferred from the parameter constraints. Therefore, given the elicited prior and observation counting, to guarantee that the overall CaMAP estimation satisfies all the parameter constraints, the optimal ESS should satisfy certain constraints.
From each type of constraint imposed on the parameters, ESS constraints could be derived as follows: (1) To satisfy the range constraint, the CaMAP estimation in Equation (11) should satisfy (2) To satisfy the intra-distribution constraint, the CaMAP estimation should satisfy which implies (3) To satisfy the cross-distribution constraint, the CaMAP estimation should satisfy where α ij 1 and α ij 2 represent the ESS values of the distributions under the cross-distribution constraint. Thus, we have In this paper, we set α ij 1 = α ij 2 and thus we have From the above inequality, constraints on the ESS values α ij 1 and α ij 2 could be derived. Furthermore, in this paper, for each node, we define two classes of ESSs: "global" and "local" ESS. "Global" ESS refers to the equivalent sample size imposed on all parameter distributions of the given node, such as node i, while "local" ESS refers to the equivalent sample size working on parameter distribution that refers to a specific parent configuration state. For example, in Figure 1, for node i, α i is the "global" ESS, while (α i1 , . . . , α iq i ) are the "local" ESSs.
In general, with the elicited prior, observational data and parameter constraints, for node i, the optimal ESSs could be determined by the following procedure: (1) First, from the elicited prior and observational data, the optimal "global" ESS α i could be determined by cross-validation [29]. In the cross-validation, each candidate ESS (In this paper, the candidate ESS varies from 1 to 50) is evaluated based on the likelihood of posteriori estimation in Equation (11).
(2) Then, based on the parameter constraints, we can derive the constraints on each "local" ESS α ij .
(3) Finally, for "local" ESS α ij , (I) If there is no constraint imposed on α ij , then we set α ij = α i . (II) If there are constraints imposed on α ij and meanwhile the "global" ESS α i satisfies the constraints, then, we set α ij = α i ; if not, α ij is determined by further crossvalidation using data, prior and ESS constraints. Note that in the process of validation, the In general, with the elicited prior, observational data and parameter const node , the optimal ESSs could be determined by the following procedure: (1) First, from the elicited prior and observational data, the optimal "globa could be determined by cross-validation [29]. In the cross-validation, each cand (In this paper, the candidate ESS varies from 1 to 50) is evaluated based on the l of posteriori estimation in Equation (11).
(2) Then, based on the parameter constraints, we can derive the constraint "local" ESS . (3) Finally, for "local" ESS , (I) If there is no constraint imposed on , th = . (II) If there are constraints imposed on and meanwhile the "globa satisfies the constraints, then, we set = ; if not, is determined b cross-validation using data, prior and ESS constraints. Note that in the process tion, the initial candidate ESS value of is set to be the lower bound value of defined by its constraints.
The pseudo-code of the proposed CaMAP algorithm could be summariz lowing Algorithm 1:   In general, with the elicited prior, observational data and parameter constraints, for node , the optimal ESSs could be determined by the following procedure: (1) First, from the elicited prior and observational data, the optimal "global" ESS αi could be determined by cross-validation [29]. In the cross-validation, each candidate ESS (In this paper, the candidate ESS varies from 1 to 50) is evaluated based on the likelihood of posteriori estimation in Equation (11).
(2) Then, based on the parameter constraints, we can derive the constraints on each "local" ESS . if not, is determined by further cross-validation using data, prior and ESS constraints. Note that in the process of validation, the initial candidate ESS value of is set to be the lower bound value of the range defined by its constraints.
The pseudo-code of the proposed CaMAP algorithm could be summarized as following Algorithm 1:

Numerical Illustration of CaMAP Method
To illustrate the principle of the proposed method, we demonstrate the parameter learning of the BN shown in Figure 2, which is extracted from the brain tumor BN [23]. Nodes in the network have meanings as below. Specifically, the network indicates that the presence of brain tumor and the increased level of serum calcium may cause coma.

Numerical Illustration of CaMAP Method
To illustrate the principle of the proposed method, we demonstrate the parameter learning of the BN shown in Figure 2, which is extracted from the brain tumor BN [23]. Nodes in the network have meanings as below. Specifically, the network indicates that the presence of brain tumor and the increased level of serum calcium may cause coma.
, = 1) ≥ 9.01 (4) Next, for node , the optimal "global" ESS is cross-validated to be 3 "global" ESS does not satisfy any of the ESS constraints, the "local" ESSs woul equal to the "global" ESS and should be further validated. Based on the prior, ESS constraints, the optimal "local" ESSs are cross-validated to be as follows:  (1) First, we assume that a small data set of 20 patients is available. From the data, the following counting are observed: Furthermore, we acquire the following medical knowledge from the medical experts: a brain tumor as well as an increased level of serum calcium are likely to cause the patient to fall into a coma in due course. From this medical knowledge, we generate the following parameter constraints: (2) Then, based on the parameter constraints, we elicit the following priors using the proposed prior elicitation algorithm (Section 3. (4) Next, for node C, the optimal "global" ESS is cross-validated to be 3. As the "global" ESS does not satisfy any of the ESS constraints, the "local" ESSs would not be equal to the "global" ESS and should be further validated. Based on the prior, data and ESS constraints, the optimal "local" ESSs are cross-validated to be as follows:

The Experiments
We conducted experiments to investigate the performance of the proposed CaMAP method in terms of learning accuracy, under different sample sizes and constraint sizes. In the experiments, we used the networks from [16,17], shown in Figures 3-7. The true parameter distributions in these networks show different uniformities, varying from strongly skewed to strongly uniform distributions. As the true parameters were set or known in advance, the learnt parameters were evaluated by the Kullback-Leibler (KL) divergence [30], which indicates the divergence between the learnt parameters or estimated distribution and the true parameters or underlying distribution. The proposed method was evaluated against the following learning algorithms: ME [31], ML [32], MAP [13], CME [26,33], and CML [24,34]

The Experiments
We conducted experiments to investigate the performance of the proposed method in terms of learning accuracy, under different sample sizes and constra In the experiments, we used the networks from [16,17], shown in Figures 3-7. parameter distributions in these networks show different uniformities, vary strongly skewed to strongly uniform distributions. As the true parameters we known in advance, the learnt parameters were evaluated by the Kullback-Lei divergence [30], which indicates the divergence between the learnt parameter mated distribution and the true parameters or underlying distribution. The method was evaluated against the following learning algorithms: ME [31], ML [ [13], CME [26,33], and CML [24,34] (The code of all the six tested algorithms can at https://github.com/ZHIGAO-GUO/CaMAP (accessed on 29 September 2021)) names of the tested algorithms are listed as follows: Notice that, (I) in the MAP method, we used a uniform (or flat) prior, whic in Equation (11) was set to be 1⁄ and ESS value is 1, and (II) in the method, we set the maximum candidate ESS to be 50, which is a sufficient numb networks.

Learning with Different Sample Sizes
First, we examined the learning performance of all algorithms under diffe ple sizes. Our experiments were carried out under the following settings: (1) Th sizes were set to be 10, 20, 30, 40, and 50, respectively. (2) The parameter constra randomly generated from the true parameters of the tested networks, with the m number of constraints for each node at 3. Specifically, the parameter constraint erated using the following rules: (1) Range constraints are generated as [ where is equal to be (0, * − 1 ) and is equal to be (1, where * represents the true parameter, and 1 and 2 are two random value 0.2. (2) Inequality constraints are generated as 1 1 ≥ 2 2 if ( 1 1 − 2 Therefore, when 1 = 2 and 1 ≠ 2 , the constraint becomes the intra-dis constraint, while the constraint becomes the cross-distribution constraint whe and 1 = 2 ,. We performed 100 repeated experiments. The average KL divergence valu ferent algorithms on different networks under different sample sizes are summ Table 1 with the best results highlighted in bold.  Notice that, (I) in the MAP method, we used a uniform (or flat) prior, which means, θ prior ijk in Equation (11) was set to be 1/r i and ESS value is 1, and (II) in the CaMAP method, we set the maximum candidate ESS to be 50, which is a sufficient number for all networks.

Learning with Different Sample Sizes
First, we examined the learning performance of all algorithms under different sample sizes. Our experiments were carried out under the following settings: (1) The sample sizes were set to be 10, 20, 30, 40, and 50, respectively. (2) The parameter constraints were randomly generated from the true parameters of the tested networks, with the maximum number of constraints for each node at 3. Specifically, the parameter constraints are generated using the following rules: (1) Range constraints are generated as [θ lower ijk , θ upper ijk ], where θ lower ijk is equal to be max(0, θ * ijk − τ 1 ) and θ upper ijk is equal to be min(1, θ * ijk + τ 2 ), where θ * ijk represents the true parameter, and τ 1 and τ 2 are two random values around 0.2. (2) Inequality constraints are generated as θ ij 1 k 1 ≥ θ ij 2 k 2 if θ ij 1 k 1 − θ ij 2 k 2 ≥ 0.2. Therefore, when j 1 = j 2 and k 1 = k 2 , the constraint becomes the intra-distribution constraint, while the constraint becomes the cross-distribution constraint when j 1 = j 2 and k 1 = k 2 ,.
We performed 100 repeated experiments. The average KL divergence values of different algorithms on different networks under different sample sizes are summarized in Table 1 with the best results highlighted in bold.
From the experimental results, we draw the following conclusions: (1) With increasing data, the performance of all algorithms improved by different levels. (2) In almost all cases, CaMAP outperformed the other learning algorithms. However, when the available data are extremely insufficient, e.g., 10, the CaMAP was inferior to the MAP method. The explanation might be that the insufficiency of data impacts the cross-validation of ESS values. Therefore, the optimal ESS turns out to be extreme, either small or large, and fails to balance data and prior (see the 2nd future study in Discussion and Conclusions section).

Learning with Different Constraint Sizes
Next, we further explored the learning performance of different learning algorithms under different constraint sizes. The experiments were conducted under the following settings: (1) The data set size for all the tested networks was set to be 20, which is a small number for all networks. (2) Parameter constraints were generated from the true parameters of the networks and the maximum number of constraints for each node was set to be 3. The parameters were learnt from a fixed data set but an increasing number of parameter constraints that were randomly chosen from all generated constraints. The constraint sparsity varied from 0% to 100%. For each setting, we performed 100 repeated experiments.
The average KL divergence values of different algorithms on different networks under different constraint sizes are summarized in Table 2. From the experimental results, we draw the following conclusions: (1) For the algorithms that did not use constraints, such as ML, ME, and MAP, changing the constraint size did not impact their performance. However, for the algorithms that have been incorporated constraints, such as CML, CME, and CaMAP, an increase in constraints affected their performances to a certain degree depending on the number of incorporated constraints.
(2) In most cases, CaMAP outperformed the other parameter learning algorithms, except for MAP, when no parameter constraints were incorporated into the learning. In fact, when no parameter constraints were available, CaMAP method was slightly inferior to the MAP estimation with uniform prior. The explanation might be as follows: when the parameter constraints are not available, constraints on ESS values could not be deduced. Therefore, ESS values in CaMAP estimation are the same at those in MAP estimation. Then, the difference between the CaMAP and MAP estimation lies in the prior, θ prior i . However, unlike uniform prior in MAP estimation, prior in the CaMAP method is elicited using a sampling method. For the sampling methods, it is hard to achieve completely uniform sampling unless the sampling size is very large (see the 1 st future study in the Discussion and Conclusions section).

Discussion and Conclusions
For MAP estimation in BN parameter learning, informative prior distribution and reasonable ESS values are two crucial factors that impact the learning performance. Empirically, a uniform prior is preferred and ESS is further cross-validated according to the uniform prior. However, when the underlying parameter distribution is non-uniform or skewed, MAP estimation with a uniform prior does not fit the underling parameter distribution well, and, in that case, an informative prior is required. In fact, reliable qualitative domain knowledge has been proved to be useful and can be used for eliciting informative priors and selecting the reasonable ESS. In this paper, we proposed a CaMAP estimation method. The proposed method automatically elicits the prior distribution from the parameter constraints that are transformed from the domain knowledge. Besides, constraints on ESS values are derived from the parameter constraints. Then, the optimal ESS, including "global" and "local" ESS, are further chosen from the ranges derived from the ESS constraints by cross-validation. Our experiments demonstrated that the proposed method outperformed most of the mainstream parameter learning algorithms. In future study: (1) A more effective prior elicitation approach is desired. Compared to the samplingbased methods, geometric constraint-solving methods would be more robust and could elicit more informative priors.
(2) A more reasonable ESS selection method is preferred. For the cross-validation method, when the available data are extremely insufficient or less informative, the optimal ESS tends to maximize the likelihood of data and makes the CaMAP estimation fail to approach the underling parameter distribution. In fact, data bootstrapping guided by the parameter constraints may extend the data and make the data more informative and thus improve the ESS selection.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The approximate MAP estimation for θ ijk has the form Proof. The posterior estimation of parameter θ ij , where θ ij = θ ij1 , . . . , θ ijr i and r i is the number of states of node i, is P(θ ij |D) = P(D θ ij )P θ ij P(D) ∝ P(D|θ ij )P θ ij (24) where P θ ij is the prior and P(D θ ij ) is the likelihood. Thus, the maximum a posteriori estimation of θ ij isθ ij = argmax θ ij P(θ ij |D) = argmax θ ij P(D|θ ij )P θ ij .
As it is more convenient to deal log, the MAP estimation of θ ij can be expressed aŝ θ ij = argmax θ ij log P θ ij D = argmax (log θ ij P D θ ij + log P θ ij .
Since the parameters θ ij studied in this paper follows the multinomial distribution and the conjugate prior for the multinomial distribution is Dirichlet distribution. The above equation could be further written aŝ where Dir α ij1 , α ij2 , . . . , α ijr i is the prior distribution. Then, the maximum a posteriori estimation of θ ijk isθ However, the above estimation only holds for α ij > 1 and it is only one choice of point estimation since the true θ ijk is unknown. Instead of exact MAP estimation, the approximate estimationθ