Teaching Probabilistic Graphical Models with OpenMarkov

: OpenMarkov is an open-source software tool for probabilistic graphical models. It has been developed especially for medicine, but has also been used to build applications in other ﬁelds and for tuition, in more than 30 countries. In this paper we explain how to use it as a pedagogical tool to teach the main concepts of Bayesian networks and inﬂuence diagrams, such as conditional dependence and independence, d-separation, Markov blankets, explaining away, optimal policies, expected utilities, etc., and some inference algorithms: logic sampling, likelihood weighting, and arc reversal. The facilities for learning Bayesian networks interactively can be used to illustrate step by step the performance of the two basic algorithms: search-and-score and PC.


Introduction
Bayesian networks (BNs) [1] and influence diagrams (IDs) [2,3] are two types of probabilistic graphical models (PGMs) [4][5][6] widely used in artificial intelligence. Unfortunately, the mathematical theory that supports them may be tough for beginners. Our computer science students, despite having a relatively strong mathematical background, find it hard to intuitively grasp some of the fundamental concepts, such as conditional independence and d-separation. Additionally, we have been teaching PGMs to health professionals, most of them physicians, for more than 25 years, and although we avoid the more complex aspects (for instance, we do not mention d-separation and only teach them the variable elimination algorithm), some of the basic notions important for them, such as conditional independence, are difficult to convey. In this paper we show how OpenMarkov, an opensource tool with an advanced graphical user interface (GUI), has allowed us to explain more intuitively some concepts that we found very difficult to explain before we had it.
This article is an extended version of the paper "Teaching Bayesian networks with OpenMarkov", presented at the 9th International Conference on Probabilistic Graphical Models, Prague, 2018 [7].
The rest of this paper is structured as follows: Section 2 introduces the background (notation, definitions, and an overview of OpenMarkov); Sections 3 and 4, the core of the paper, explain how to teach BNs and IDs respectively; Section 5 contains a brief discussion and the conclusion.

Basic Definitions about Probability and Graphs
In the literature about PGMs it is usual to represent variables with capital letters (X) and their values with lower-case letters (x). A bold upper-case letter (X) denotes a set of variables and a bold lower-case letter (x) denotes a configuration of them, i.e., the assignment of a value to each variable in X. In this paper we assume that all the variables are discrete, i.e., each variable has a finite set of values, called states. When a variable X is Boolean, we denote by +x the state "true", "present", or "positive", and by ¬x the state "false", "absent", or "negative". Definition 1 (Conditional independence). Given a probability distribution P, two variables X and Y, and a set of variables Z containing neither X nor Y, we define I P (X, Y | Z) as follows: I P (X, Y | Z) ⇐⇒ ∀x, ∀y, ∀z, P(x, y | z) = P(x | z) · P(y | z) . (1) When I P (X, Y | Z) holds, we say that X and Y are conditionally independent given Z. In this case, if P(y) = 0 for a particular value of Y, then ∀x, P(x | z, y) = P(x | z) , i.e., if we already know Z = z, knowing later that Y = y does not alter the probability of X.
In these expressions Z can be the empty set, which only has one configuration, usually denoted by . We have P(x | ) = P(x) and P(x | , y) = P(x | y). If I P (X, Y | ∅) holds, we say that X and Y are a priori independent.
The graphs considered in this paper can have at most one link between each pair of nodes. A graph is directed when all its links are directed. A directed path is a sequence of nodes {X 1 , . . . , X n } such that there is a link X i → X i+1 between each pair of consecutive nodes. A cycle is a sequence of nodes {X 1 , . . . , X n } such that there is a link X i → X i+1 between each pair of consecutive nodes and a link X n → X 1 . Directed graphs containing no cycles are said to be acyclic.
When there is a link X → Y, we say that X is a parent of Y and Y is a child of X. The set of parents of a node X is denoted by Pa(X). When there is a directed path from X to Y, we say that X is an ancestor of Y and Y is a descendant of X.
A pair of consecutive links in a path is called a trail [5]. A trail can be convergent (X → Z ← Y), divergent (Y ← X → Z), or consecutive (X → Y → Z); these three types are sometimes called head-to-head, tail-to-tail, and head-to-tail, respectively [4].
The following two definitions for acyclic directed graphs are relevant to PGMs.

Definition 2 (d-separation).
A path consisting of one link is always active. Let X, Y, and Z be three nodes and E a set of nodes in an acyclic directed graph G, such that E contains neither X nor Y. A convergent trail X → Z ← Y is active if Z or at least one of its descendants is in E.
A path consisting of more than one link is active when all its trails are active. Given two nodes, X and Y, and a set E containing neither X nor Y, if there is at least one active path connecting them, we say that they are connected given E; otherwise, we say that they are separated given E and denote it as S G (X, Y | E). Proposition 1. Let X and Y be two nodes and E a set of nodes in an acyclic directed graph G, such that E contains neither X nor Y. A path (not necessarily a directed path) between X and Y is active if and only if it consists of a single link or every node W between X and Y in the path satisfies this property: 1.
if the arrows that connect W with its two neighbors in the path converge in it (head-to-head trail), then W or at least one of its descendants is in E; 2. else (i.e., if W is the middle of a divergent or a sequential trail), then W is not in E.
This proposition is a consequence of Definition 2, but it could alternatively be taken as the definition of d-separation.

Probabilistic Graphical Models
A PGM consists of a set of variables V [4,5], a probability distribution P(v), and a graph G such that each node in the graph represents a variable in V; for this reason, it is usual to speak indifferently of nodes and variables. The relation between G and P depends on the type of PGM. In this paper we focus on two types of PGMs whose graphs are directed and acyclic, namely, BN and IDs.

Bayesian Networks
In a BN every node has a conditional probability distribution for each configuration of its parents, P(x | pa(X)). Since we assume that all the variables are discrete, the set of distributions for a node can be encoded as a conditional probability table (CPT).
The relation between G and P is given by the following properties; we can take any one of them as the definition of a BN and then prove that the other two follow from it [4,5,8]:

1.
Factorization of the probability: The joint probability is the product of the probability of each node conditioned on its parents, i.e., (In this equation, the value x and the configuration pa(X) in the right-hand side are given by the projection of v onto X and Pa(X) respectively. The same holds for the equations in the next section.) 2.
Markov property. Each node is independent of its non-descendants given its parents, i.e., if Y is a set of nodes such that none of them is a descendant of X, then 3. d-separation. If two nodes X and Y are d-separated in the graph given E (cf. Definition 2), then they are probabilistically independent given E:

Influence Diagrams
IDs [2,3] have three different types of nodes: chance (V C ), decision (V D ), and utility (V U ). In this paper we assume that utility nodes have no children; for a more general presentation, see [9]. Every chance node C has a CPT and each utility node U has a table, ψ, that represents the decision maker's values for each configuration of the parents of U, ψ U (pa(U)).
The meaning of a link (sometimes called arc) depends on the type of nodes it connects. A link from a decision D i to a decision D j means that D i is made before D j . A link from a chance node C to a decision node D j means that the value of variable C is known when making the decision D j . Links into utility nodes represent functional dependency.
IDs require that there is a directed path connecting all the decisions; it induces a total ordering {D 0 , . . . , D n−1 }, the order in which they are made. It is usual to assume the non-forgetting hypothesis, which means that a variable C known for a decision D j is also known for any subsequent decision D k . The set of chance variables, V C , can be partitioned into {C 0 , C 1 , . . . , C n }, where C i is the subset of variables unknown for D i−1 and known for D i . The set of variables known to the decision maker when deciding on D i is called the informational predecessors of D i and denoted by InfPred(D i ) [10].
A stochastic policy for a decision D is a probability distribution P D (d|infPred(D)). If P D is degenerate (i.e., consisting of ones and zeros only) then the policy is deterministic. A strategy ∆ for an ID is a set of policies, one for each decision, {P D | D ∈ V D }. A strategy ∆ induces a joint probability distribution over V C ∪ V D defined as follows: so that the expected utility under the strategy ∆ is where infPred(D) and pa(U) are the projections of the configuration (v C , v D ) onto InfPred(D) and Pa(U) respectively. The maximum expected utility (MEU) is where ∆ * is the set of all the strategies. A strategy ∆ opt is optimal if it maximizes the expected utility, i.e., if EU(∆ opt ) = MEU. Each policy in an optimal strategy is an optimal policy. The evaluation of an ID consists in finding the MEU and an optimal strategy. It can be proved [11] that When the information available for D i is infPred(D i ), the best choice for this decision is which implies that, in the optimal strategy, the policy for D 1 is

Arc Reversal Algorithm
An arc (a link) X → Y in a BN can be inverted without modifying the joint probabilities or the expected utilities of the network, as long as there is no other path from X to Y [2]. It proceeds as follows. Let A = pa(X) ∩ pa(Y), B = pa(X) \ A, and C = pa(Y) \ A, which are disjoint sets. The CPT for the two nodes in the original network are P(x | a, b) and P(y | x, a, c), respectively. In the new BN, this link is replaced with Y → X and the new CPTs are P(y | a, b, c) and P(x | y, a, b, c): In order to maintain the consistency of the BN, it is then necessary to share the parents by drawing a link from each node in C to X and from each node in B to Y. Arc reversal can be applied to compute the posterior probability P(v | e) of a variable of interest, V, in a BN. A node is said to be barren when it is not V, is not in E, and is not an ancestor of any evidence variable. Due to the Markov property, barren nodes can be removed from the network without altering the posterior probability of V. If necessary, some links can be inverted one by one to create new barren nodes until the evidence variables are the parents of X and all the other nodes have been removed; the probability of interest can be read from the CPT for X-we will see an example in Section 3.3.1.
Similarly, a link X → Y between two chance nodes in an ID can be inverted to remove all the chance and decision nodes one by one, and some utility nodes can be fused, until only one utility node remains [12,13]-there is an example in Section 4.1.2. A chance node X whose only descendants are utility nodes can be absorbed as follows. If the only child of X is U, the algorithm adds a link from each node in V = pa(X) \ pa(U) to U and removes X. The new utility potential is where V = (pa(X) ∪ pa(U)) \ {X}. If X has more than one utility node, U X = {U 1 , . . . , U n }, they must be fused into a new utility node U, with Pa(U) = ∪ U i ∈U X Pa(U i ) and ψ U = Similarly, a decision D whose only descendant is U can be absorbed, so that the new utility is where V = pa(U) \ {D}. If D has more than one utility node, they must be fused, as above.
OpenMarkov has been designed primarily for medicine and for teaching. With this tool and its predecessor, Elvira [21], our research group has built complex models for several real-world health problems. (Some of those networks and other examples are available at http://www.probmodelxml.org/networks; accessed on 20 September 2022.) Other groups have used it to build PGMs in other fields, such as planning and robotics [22]. Both Elvira and OpenMarkov have paid special attention to the explanation of reasoning [23,24], a topic whose importance has been acknowledged in the area of expert systems since the 1980s [25], and is now an issue of utmost relevance in modern AI-see [24] and references therein.
To our knowledge, OpenMarkov has been used for research and tuition in more than 30 countries, from top universities, large companies, and centers of the Government of the United States to students in low-income countries who cannot afford paying for commercial software for PGMs.

Evidence Propagation in BNs with OpenMarkov
In a diagnostic problem, the assignment of a value to a variable as a result of an observation is called a finding. The set of findings is called evidence. The propagation of evidence consists in computing the posterior probability of some variables given the evidence.
In OpenMarkov chance variables are drawn as rounded rectangles and colored in cream, as shown in Figure 1. When a finding is entered (usually by double-clicking on the value/state of the variable), OpenMarkov propagates it and shows the posterior probability by means of a horizontal bar. It is possible to have several sets of findings, each called an evidence case, and display several bars for each state. Figure 1 shows three evidence cases: in the first one, corresponding to the red bars, there is no finding (E = ∅); in the second one, shown in blue, the presence of virus A is confirmed, so E = {V A } and e = (+v A ); in the third one, shown in green, this virus is known to be absent, i.e., E = {V A } and e = (¬v A ). This allows the user to see how the probabilities of the variables change when new findings are entered. Figure 1. A Bayesian network for the differential diagnosis of two hypothetical diseases. The horizontal bars represent the probability of each state for each evidence case. We can check that V A and V B are a priori independent by introducing evidence about V A and observing that the probability of V B does not change. The same holds for the 5 variables at the right of F. In contrast, the 4 descendants of V A do depend on the evidence for this variable.

Conditional Independence
Even though the concepts of probabilistic dependence (correlation) and independence are mathematically very simple (cf. Equation (1)), many students have difficulties to understand them intuitively, especially in the case of conditional independence. In our teaching, we use the network in Figure 1, which has a clear causal interpretation: all the variables are Boolean, and for each link X → Y, the finding +x increases the probability of +y, except in the case of vaccination, +v, which decreases the probability of D 2 being present.
In order to illustrate a priori independence, we point out that in this model there is no link between the two viruses, V A and V B , and they have no common ancestors. Therefore, they are d-separated in the graph, and because of Property (5) (with E = ∅), they are a priori independent. We can check it by introducing a finding for V A and observing that the probability of V B does not change, or vice versa; for example, P(+v B |+v A ) = P(+v B |¬v A ) = P(+v B ) = 0.01, as shown in Figure 1, which confirms that Equation (2) holds. In contrast, we can see that the variables V A and D 1 are correlated by introducing evidence about the one and observing that the probability of the other changes; for example, in Figure 1 we observe that P(+d 1 |+v A ) = 0.9009 > P(+d 1 ) = 0.0268 > P(+d 1 | ¬v A ) = 0.009.
We can also see that in this graph each node at the left of F is separated from each variable at its right when E = ∅, which implies that they are pairwise a priori independentsee again Figure 1. We can verify it by introducing evidence for one variable in one side and observing that the probabilities on the other side do not change.
To illustrate conditional independence, we first show that S (a sign) and F (fever) are a priori correlated by introducing evidence on one of them and seeing that the probability of the other changes. However, if we first introduce evidence about D 1 , which plays the role of E, and introduce a finding S (by generating a new evidence case in OpenMarkov), then the probability of F does not change, as we can observe in Figure 2. This shows that F and S, despite being correlated a priori, are conditionally independent given D 1 (it is an instance of Equation (1) with E = {D 1 }). Our students easily understand that the correlation between fever and the sign is due to a common cause, and when we know with certainty whether this cause is present or absent, the correlation disappears. OpenMarkov confirms that our intuitive understanding of causation leads to the numerical results we expected.

d-Separation
Section 2.2.1 introduced the definition of d-separation. If we just left our students with it (or with its equivalent definition in Proposition 1), they would be absolutely unable to understand the rationale behind it-so would we! In particular, it is difficult to understand why some trails are active if and only if the intermediate node is in E, while for other trails the opposite is true-see Definition 2. Additionally, a convergent path X → Z ← Y, which is a priori inactive, can be activated not only by Z but also by any of its descendants, while a divergent path Y ← X → Z, which is a priori active, can be blocked by X but not by its ancestors. It sounds arbitrary, if not esoteric.
To make d-separation intuitive, we explain that this property is a consequence of the factorization of the probability (cf. Equation (3)) and that it agrees with our notions of causality, provided that E in Definition 2 is interpreted as the evidence, i.e., a set of findings for the observed variables. We first consider that a path containing just one link is always active, by definition, whatever the evidence. We can observe for the network in Figure 1 that introducing a finding for one variable affects the posterior probabilities of all its neighbors, even if there is evidence for other nodes. The correlation in this case is explained by a direct cause-effect relation.
We then consider a divergent trail, such as S ← D 1 → F. Definition 2 says that this path is active a priori, i.e., when there is no evidence. We can verify it by introducing a finding for S or for F, as explained above. This correlation is intuitive because S and F are effects of a common cause. However, when the presence or the absence of this disease is confirmed or ruled out by direct observation, then . We can check it by first entering evidence for D 1 and then, in a new evidence case, adding a finding for S, either +s or ¬s; the probability of F does not change, as shown in Figure 2. This also agrees with our notion of causal influence.
The behavior of a sequential trail is similar; for example, the path V A → D 1 → S means that the causal influence of V A on S is mediated by D 1 . This path is active a priori because detecting virus A increases the probability of the disease and, consequently, that of the symptom (deductive reasoning). Similarly, the symptom makes us suspect the presence of the virus (abductive reasoning). However, the finding +d 1 blocks this path, because once we know that the disease is present (or absent), the information about the virus does not affect the probability of the symptom, and vice versa. We can check it with OpenMarkov. Again, d-separation agrees with our intuitive notion of causality.
Let us now consider the convergent trail V A → D 1 ← V B . When E = ∅, it is inactive and V A and V B are separated. (At this point, it may be worth warning our students that, contrary to intuition, "being connected" is not a transitive property: V A is connected with D 1 and D 1 is connected with V B , but V A is not connected with V B . The analysis of whether a path is active cannot be done link by link, because an individual link is always active; it is necessary to consider every pair of consecutive links, i.e., every trail, as in Definition 2 and Proposition 1. The lesson is that, even though intuition is very useful in mathematics, it must be properly trained and supported by formal reasoning. ) We can check that V A and V B are separated-and, because of Property (5), a priori independent-by introducing evidence for V A and observing that the probability of V B does not change, as shown in Figure 1. This is intuitive, because there is no common cause for these variables. In contrast, when there is evidence about D 1 , the trail V A → D 1 ← V B becomes active and, consequently, V A and V B are no longer separated: We can verify it by first introducing evidence about D 1 -for example, +d 1 -, generating a new evidence case, introducing evidence about V A , and observing that now the probability of V B changes: P(+v B | +d 1 , +v A ) < P(+v B | +d 1 ) < P(+v B | +d 1 , ¬v A ). This is consistent with the causal interpretation of the BN, because when a patient has the first disease, we suspect that the cause is virus A or virus B; if additional evidence (for example, the result of a test) leads us to ruling out virus A, we then suspect that the cause of the disease is virus B; conversely, if the presence of A is confirmed, our suspicion of B decreases. Put another way, the finding +d 1 creates a negative correlation between V A and V B , which were are a priori independent. This phenomenon, called explaining away [4], is the most typical case of intercausal reasoning; in particular, it is a property of the noisy-OR model [4,26]. (In this network there is another noisy OR at F.) It also follows from the definition of d-separation that the convergent trail V A → D 1 ← V B is not only activated by D 1 itself, but also by any of its descendants. We can verify it with OpenMarkov by introducing the findings +s or + f . This is another instance of explaining-away because any of these findings makes us suspect the presence of at least one of the viruses, with a negative correlation between V A and V B .
We may now ask ourselves: if the middle node in a divergent trail can block it, and the descendants of the middle node can activate a convergent trail, why cannot the ancestors of the middle node in a divergent trail block it? We can check with OpenMarkov that this is the case; for example, given the trail S ← D 1 → F, if we first introduce the finding +v A and then add in a new evidence case the finding +s, we observe that the probability increases, which proves that S and F are not conditionally independent given V A . The reason for this correlation is that the presence of a virus increases the probability of the disease, butunlike the case in which the disease is confirmed by direct observation-the probability is not yet 100%, so it can be further increased by +s. We can also try other combinations of findings to check that no ancestor of D 1 blocks this trail.

Markov Property and Markov Blankets
As we saw in Section 2.2.1 (cf. Equation (4)), the Markov property means that every node is conditionally independent of its non-descendants given its parents. Again, this definition may be difficult to understand for students when stated in abstract, but it is intuitive when explained with examples. In particular, when a node has no parents, it is conditionally independent of its non-descendants. We can illustrate it with the twodiseases network (Figure 1) by showing, for example, that the probability of Disease 1 does not change when introducing evidence for the nodes at the right of Fever, which are not descendants of that disease; and vice versa. Similarly, we can introduce evidence for the parents of a node-for example, Fever-and then see that adding evidence about other nodes that are not its descendants does not alter its probability.
We can illustrate in the same manner the concept of Markov blanket, which denotes a set of nodes that surround a node, making it conditionally independent of the other variables in the network [4]. One might think that the set of parents and children of a node D 1 constitute a Markov blanket for it. However, this is not the case: if we introduce evidence for the parents and children of D 1 , i.e., for V A , V B , S, and F, we can see that D 1 is not yet separated from all the other nodes in the network; in fact, every node in {V, D 2 , A, X, E} is correlated with D 1 because F has activated the trail D 1 → F ← D 2 . Therefore, the Markov blanket of a node must include not only its parents and children, but also the parents of its children.

Inference Algorithms for Bayesian Networks
We have seen that OpenMarkov is able to propagate evidence, but so far we have not discussed inference algorithms. In this section, we explain how this tool can help illustrate some of the basic algorithms, namely, arc reversal, logic sampling, and likelihood weighting.

Arc Reversal for Bayesian Networks
Arc reversal was initially designed to transform IDs into decision trees [2], but in our opinion, students understand it better if it is first introduced for BNs.
Let us use again the two-diseases network ( Figure 1) as an example. When processing the query P(+d 1 |+ f , +s, ¬v), D 1 is the variable of interest and F, S, and V are the evidence variables. Then X and U can be removed in OpenMarkov's GUI because they are barren nodes. Then A becomes a barren node, which can also be removed. The user can invert link B → D 1 by right-clicking on it; OpenMarkov replaces it with D 1 → B, adds a new link A → B (because A was a parent of D 1 ), and computes the new probability table P(b | a, d 1 ); the probability P(a) does not change because A has received no new parent. Now B is barren and can be removed. The user can then invert A → D 1 to remove A. After inverting D 2 → S, which adds the links D 1 → D 2 and V → S, it is possible to remove D 2 . In each step the user can inspect the new CPTs and check that they have been computed correctly. Finally, after inverting D 1 → S and D 1 → F the parents of D 1 (the variable of interest) are the three evidence nodes, and the user can retrieve the probability P(+d 1 |+ f , +s, ¬v) by opening the CPT for D 1 .
When there is a link X → Y and another directed path from X to Y, the option for inverting the link is disabled in its contextual menu (it appears in gray) because it would create a cycle. In the future, we might add a dialog that would suggest to the user the node deletions and arc reversals that will lead to calculating the probability of the variable of interest.

Stochastic Algorithms
OpenMarkov currently implements two stochastic algorithms: logic sampling [27] and likelihood weighting [28]. Both start by sampling a value for each node without parents, using its prior distribution, and then proceed in topological order (i.e., downwards), sampling each other node in accordance with the probability distribution for the configuration of its parents. This way, every iteration of the algorithm obtains a sample-a configuration of all the nodes. OpenMarkov is able to store these configurations in a spreadsheet and compute some statistics, including the posterior probability of each variable, as shown in Figure 3.
The left side of this figure displays the output of the logic sampling algorithm. The 10,000 configurations obtained are stored in the "Samples" sheet (not visible in the figure), with a sample per row and a variable per column; those compatible with the evidence are colored in green and those incompatible in red. The "General stats" tab shows that only 37 samples are compatible (see cell B6), a clear indication of the inefficiency of this algorithm. For each variable, the spreadsheet displays the number of samples in which each state has appeared; for example, D 1 has taken the state "absent" in 9714 samples (cell B23) and "present" in 286 (cell C23). It also shows the posterior probability for each state, which is not proportional to the number of occurrences because the samples incompatible with the evidence do not count.
The right side of Figure 3 shows the output of likelihood weighting. One difference with the previous algorithm is that it only samples the variables that do not make part of the evidence; therefore, the evidence variables are not shown in the sheet. Another difference is that now the number of non-null samples (cell B6) equals the number of samples, because all of them are valid. However, each sample has a weight between 0 and 1 (in logic sampling it was either 0 or 1), as shown in the "Samples" sheet. As a consequence, the total weight for this simulation is 188.15 (cell B6), much higher than the value of 37 obtained for logic sampling, and this usually leads to more accurate estimates of the posterior probabilities, as we can see by comparing the approximate probabilities with their exact probabilities for both algorithms.

Learning Bayesian Networks
BNs can be built from human knowledge, data, or a combination of both. OpenMarkov implements the two basic algorithms for learning BNs from data: search-and-score [29] and PC [30]. Other tools offer many more algorithms, but the advantage of OpenMarkov is the possibility of interactive learning [31]: in every step, the GUI displays a list of the edits (operations) that it is ready to perform and a motivation for each edit, as shown in Figures 4 and 5. This way, the user can monitor how the algorithm proceeds, step by step, and either accept the next edit proposed by the algorithm, or select another one from the list, or do a different edit at the GUI.  The search-and-score algorithm, also called "hill climbing", departs from a network with a node for each variable in the data, and no link (cf. Figure 4). The possible edits are: adding a directed link (the most common edit), deleting one of the existing links, or inverting a link. This process is guided by a metric chosen by the user. Currently OpenMarkov offers six well-known metrics: BD, Bayesian, K2, entropy, AIC, and MDLM. When learning the network, it selects the edits compatible with the restrictions of the network (for example, a BN cannot have cycles) and ranks them according to their scores. This way, a student can see, for example, that when the network has no link yet, the K2 metric usually assigns different scores to the links X → Y and Y → X, although the resulting networks represent exactly the same probability distribution, which is an unsatisfactory property of this metric. It is also possible to see that every edit (for example, adding a link) usually changes the scores of future edits.
In contrast, the PC algorithm departs from a fully connected undirected graph ( Figure 5) and removes the links one by one depending on the conditional independencies found in the database. For each undirected link X-Y, OpenMarkov performs a statistical test that returns the p-value for the "null hypothesis" that X and Y are a priori independent; if p is below a certain threshold, α-called significance level, set by the user-, the null hypothesis is rejected and the link is kept; otherwise, it is removed. Links with higher p-values, which correspond to correlations that can be explained by chance, are proposed to be removed first. Then the PC algorithm tests, for each pair of variables, whether they are independent given a third variable, and then given a pair of other variables, and so on. In each step, the GUI shows the user a list of the links that might be removed, and for each link, the conditioning variables and the p-value. This way, the user can not only see the removals that the algorithm is considering, but also the certainty for each one. Finally, the algorithm assigns a direction to each link.
The tutorial of OpenMarkov, available at www.openmarkov.org/docs/tutorial; accessed on 20 September 2022, explains in detail the options it offers for learning BNs, either automatically or interactively.

Expected Utility and Optimal Policies
In Section 2.2.2 we mentioned that the evaluation of an ID consists in finding the maximum expected utility (MEU) and an optimal strategy. When OpenMarkov evaluates an ID in the GUI, it presents to the user the posterior probability of each chance and decision node and the expected utility of each utility node, as shown in Figure 6.
One way to evaluate an ID-the original method proposed by Howard and Matheson [2] when introducing this formalism-is to convert it into an equivalent decision tree (DT). For example, the ID in Figure 6 can be expanded into the DT in Figure 7, where each branch is labeled with its expected utility, obtained when evaluating the tree from the leaves to the root; for every decision node, one of its branches is marked with a small red rectangle to indicate the optimal choice in that scenario (in the case of a tie, more than one branch would have this mark). This evaluation method is very inefficient, because the size of the tree grows exponentially with the number of nodes in the ID. In fact, in our group we have built IDs for some medical problems [32,33], having fewer than 30 nodes, whose equivalent DTs contain tens of thousands of leaves. However, when teaching PGMs it is very useful to compare IDs with DTs for small problems because, in our opinion, an ID can only be understood as a compact representation of a DT, and all the algorithms for evaluating IDs take the DT as a reference. For these reasons, we implemented in OpenMarkov the automatic conversion of IDs into DTs.   Figure 6. A red rectangle denotes the optimal choice for each decision. Some branches have been collapsed to make the figure more compact.
However, OpenMarkov can also evaluate IDs with more efficient algorithms; by default, the GUI uses variable elimination [9,34]. After the evaluation, in addition to showing the posterior probability of each chance variable, with a bar for each evidence case, as in the case of BNs, it also displays a bar for every state (option) of every decision and for the expected utility of every utility node. For example, Figure 6 displays the probabilities and the expected utility for three evidence cases: when the test is not yet done (red bars), when it is positive (blue bars), and when it is negative (green bars).
The optimal strategy calculated by OpenMarkov can be examined in different ways. One of them is to open for each decision D the probability table P D (d|iPred(D)) for the optimal policy-cf. Section 2.2.2. Since the optimal policies are deterministic-except in the case of ties-these tables usually contain only 0's and 1's, as in Figure 8, where the only informational predecessor of the decision (the only variable known when making it) is the result of the test. More insight about this policy is presented in Figure 9, which shows the expected utilities obtained when calculating the optimal policy for Therapy-cf. Equation (10). . Expected utility for Therapy. When the test is positive, "therapy 2" is chosen because it has the highest expected utility. When the test is negative, the highest expected utility is obtained for "no therapy".
An alternative way to see the optimal strategy in OpenMarkov is to display the strategy tree [35], which summarizes all the policies in one figure. It is more compact than the DT-please compare Figures 7 and 10-because it prunes the suboptimal branches (which implies that only one branch goes out from each decision node, except in the case of a tie) as well as the branches with null probability (for example, when we decide not to do a test, its result is neither "positive" nor "negative"). The strategy tree is very useful for large IDs; for example, the optimal-policy table for the last decision in Mediastinet, an ID for lung cancer [32], contained more than 15,000 columns, but only 5 were relevant because the others corresponded to impossible or suboptimal scenarios. In contrast, the strategy for that ID only has 5 leaves, one for each relevant column [35].

Arc Reversal for Influence Diagrams
As mentioned above, arc reversal was introduced by Howard and Matheson [2] to transform IDs and convert them into DTs. Later, Olmsted [12] designed an algorithm that iteratively removes the nodes from the graph, one by one, until only the utility node remains; this is much more efficient than expanding a DT-see also [13].
Again, we can illustrate this algorithm with OpenMarkov. For example, when evaluating the ID in Figure 6, we should remove first the node Disease because it is not an informational predecessor of any decision; but this node has a descendant that is not a utility node. We can then invert the link Disease → Test by right-clicking on it, as in BNs. Now Effectiveness, a utility node, is the only descendant of Disease, so the user can click "Absorb node" on the contextual menu of this node, which adds a link from Test to Effectiveness and computes the new utility table with Equation (11). If Disease were the parent of more than one utility node, OpenMarkov would fuse them into a single node, as explained in Section 2.2.3. Then the only descendant of Therapy is the utility node, and this decision can be absorbed by applying Equation (12); the optimal policy is obtained from Equation (10). Now the only descendant of Test is the utility node, so this chance node can be absorbed by applying again Equation (11). At the end, only one utility node remains in the ID; its potential contains a single numerical parameter, which is the MEU.

Expected Value of Perfect Information (EVPI)
A relevant concept in decision analysis is the EVPI [36], which measures the advantage we would obtain from a certain piece of information, such as knowing the exact value of a parameter or the value taken by a variable. For example, given the ID in Figure 6, we may ask ourselves: "What would be the value of knowing for sure whether the patient has the disease (or not) before deciding about the therapy?" In OpenMarkov this question can be easily solved by drawing an information link from Disease to Therapy. Students can observe that the expected utility (effectiveness) increases from 9.3937-see Figure 6-to 9.5100, which means that the EVPI is 0.1163. This example illustrates the advantages of IDs, because if the original problem had been modeled with a decision tree, computing the EVPI would require building a new decision tree almost from scratch.

Explanation of Reasoning
OpenMarkov offers several options for explaining the conclusions achieved by an ID, most of them developed for its predecessor, Elvira [24]. These options have been useful for our research group when building BNs and IDs for medicine [23] and also for teaching PGMs to our students.

Imposing Policies for What-If Reasoning
OpenMarkov allows the user to impose policies on some decision nodes, in which case the evaluation algorithm only calculates optimal policies for the other decisions, which may differ from those obtained without imposed policies. This functionality allows the user to analyze decision scenarios that can never occur if the decision maker applies the optimal strategy, thus performing what-if reasoning [24].
For example, the optimal strategy for the ID in Figure 6 is: "if the result of test is negative, then do not apply any therapy; otherwise, apply therapy 2". However, the user might wish to investigate other policies, such as applying therapy 1 instead of therapy 2 when the test is positive, or applying therapy 1 in all cases, and calculate the expected utility for different results of the test, as shown in Figure 11. Figure 11. What-if reasoning: OpenMarkov allows the user to analyze what would happen if the decision maker applied a non-optimal policy. The node Therapy is colored in dark blue to indicate that a policy was imposed, instead of allowing the evaluation algorithm find the optimal strategy. The colors of the bars have the same meaning as in Figure 6.

Introducing Evidence
Lacave et al. [24] distinguished two types of evidence in the context of IDs: preresolution and post-resolution. Post-resolution evidence is introduced when every decision has been assigned a policy, either by the user or by the evaluation algorithm. The goal is to see how some findings affect the posterior probabilities-as in BNs-and the expected utilities. We have already seen two examples in Figures 6 and 11: OpenMarkov displays the utility expected before doing the test (which is the same as the utility for the general population, because some people test positive and others test negative), but we can also compute the expected utility and the posterior probability of the disease for those people having a positive test result and for those having a negative result.
In contrast, pre-resolution evidence corresponds to the classical definition of evidence in IDs [37]. In this case the question is: "What would the optimal strategy and the expected utilities be if we had that information when making the decisions?"

Example: Justifying a Policy
The usefulness of these two explanation facilities can be illustrated with the following example, adapted from a situation we encountered during the construction of an ID for lung cancer [32], when the pneumologist did not understand why the model built so far advised against doing a test that, in his opinion, would be useful.
The ID in Figure 12 presents a similar situation, in which it is better not doing the test, which might be counterintuitive. The reason seems to be that the result of the test does not modify the optimal policy. To confirm it, we first perform the sensitivity analysis shown in Figure 13, which shows that when the probability of disease is below 3.45% no therapy should be applied. Then we try to find out the posterior probability of disease after a positive test, but when we try to introduce the finding "Result of test = positive". OpenMarkov throws an error message saying that this finding is incompatible with the optimal policy, which precludes doing the test. A workaround consists in imposing the policy "Do test? = yes", which allows us introducing that finding and observing that the posterior probability increases of disease increases only to 2.51% (see Figure 14), still below the 3.45% threshold.   Figure 12, showing that when the probability of disease is below 0.0345, the best option is no therapy. Figure 14. OpenMarkov allows the user to impose the suboptimal policy "Do test? = yes" and observe that a positive test result is unable to raise the posterior probability of disease above the 0.0345 threshold. This explains why it is not worth doing the test. The colors of the bars have the same meaning as in Figures 6 and 11.

Discussion and Conclusions
OpenMarkov is an open-source tool for building and evaluating several types of PGMs. It has been especially designed for medical applications and for teaching. It has been used for research and tuition in more than 30 countries.
In this paper we have illustrated with several examples how to teach several aspects of PGMs with OpenMarkov. Some of them-for example, the properties of d-separation, which are far from intuitive for beginners-might be illustrated with any other tool having a graphical user interface (GUI) able to show on a screen the graph of the model and a probability bar for each node, such as those mentioned in Section 2.3, but the explanation is much clearer if there is a probability bar for each evidence case, a feature that is only available in OpenMarkov. This tool also allows building networks in which some conditional probability tables (CPTs) are encoded as canonical models based on the independence of causal interactions, such as the noisy and leaky of the OR, AND, MAX, MIN, etc. [4,26], and implements efficient algorithms for evaluating them [38]. Students can learn how to apply these models and how they behave when propagating evidence.
Additionally OpenMarkov is useful to illustrate the execution of several iterative algorithms for inference and learning. In particular, it is able to display on a spreadsheet the samples generated by stochastic algorithms, as well the number of valid samples, the accumulated weights, and the posterior probabilities. It also allows the user to apply arc reversal iteratively for both BNs and IDs, showing how the probabilities and the utility tables are updated in each step. Similarly, it can learn BNs from a database using several variations of the two basic algorithms, search-and-score (hill climbing) and PC; in this case, the GUI offers several edits (such as adding, removing, or inverting a link), along with a qualitative score for each one, so that the user can understand what the algorithm intends to do and why. Students can compare the performance of different algorithms by observing not only the differences in the networks learned from the same database, but also how the algorithms differ step by step.
Our tool offers several possibilities for evaluating IDs. One of them is the conversion into decision trees, which can only be done for very small problems, but is very useful to intuitively understand the relation between the two formalisms. OpenMarkov can also apply efficient algorithms, such as variable elimination and arc reversal, and show the optimal policy (a table) for each decision, as well as the expected utility and the posterior probability of each chance and decision node, with the possibility of entering pre-and post-resolution evidence to observe how those utilities and probabilities vary. It can also show the optimal strategy in the form of a tree (cf. Figure 10), which is much more compact than the decision tree and the policy tables. Most of these features are not available in any other tool, whether commercial, free, or open-source.
Finally, OpenMarkov has novel types of PGMs, such as Markov influence diagrams [14] and decision analysis networks [15], developed by our research group, as well as new algorithms for cost-effectiveness analysis with these models [14,39,40]. They can be very useful for teaching health technology assessment (HTA), but that topic is out of the scope of this paper.
A limitation of OpenMarkov is that, although it implements several algorithms for exact inference, it only offers the most basic algorithms for stochastic inference and for learning. We implemented them just for pedagogical purposes, because these topics fall outside the priorities of our research group.
However, being open-source is an important advantage of OpenMarkov because it allows students with some knowledge of Java to inspect the implementation of the algorithms. For example, in the abstract class StochasticPropagation.java the students can find the data structures and methods common to the two algorithms discussed in this paper, while the classes that extend it, namely, LogicSampling.java and LikelihoodWeighting.java, implement the aspects in which the algorithms differ. Furthermore, advanced students can add new features-see [41] as an example. In fact, a significant part of OpenMarkov's code has been written by undergraduate, master, and Ph.D. students. In the future, other researchers and students, not necessarily from our university, might contribute new algorithms for inference and learning. This tool can be especially useful as a workbench for new learning algorithms because it has been carefully designed to allow implementing other algorithms, integrating them in the GUI, and executing them interactively.
Given that nowadays PGMs make part of the computer science curriculum in all universities around the world, we hope that many teachers and students may consider using OpenMarkov as a pedagogical tool, and some of them will later use it for building real-world applications.