# Hard and Soft EM in Bayesian Network Learning from Incomplete Data

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Methods

#### 2.1. Bayesian Networks

#### 2.2. Missing Data

- Missing completely at random (MCAR): missingness does not depend on the values of the data, missing or observed.
- Missing at random (MAR): missingness depends on the variables in $\mathbf{X}$ only through the observed values in the data.
- Missing not at random (MNAR): the missingness depends on both the observed and the missing values in the data.

#### 2.3. Missing Data Imputation

#### 2.4. The Expectation-Maximisation (EM) Algorithm

- the Expectation step (E-step) consists in computing the expected values of the sufficient statistics $s\left(\mathcal{D}\right)$ for the parameters ${\Theta}_{j}$ using the previous parameter estimates ${\widehat{\Theta}}_{j-1}$;
- the Maximisation step (M-step) takes the sufficient statistics ${\widehat{s}}_{j}$ from the E-step and uses them to update the parameters estimates.

Algorithm 1: The (Soft) Expectation-Maximisation Algorithm (Soft EM) |

Algorithm 2: The Hard EM Algorithm. |

#### 2.5. The EM Algorithm and Bayesian Networks

- the Expectation (E) step consists of computing the expected values of the sufficient statistics (the counts $\left\{{n}_{ijk}\right\}$) using exact inference along the lines described above to make use of incomplete as well as complete samples;
- the Maximisation (M) step takes the sufficient statistics from the E-step and estimates the parameters of the BN.

- in the E-step, we complete the data by computing the expected sufficient statistics using the current network structure;
- in the M-step, we find the structure that maximises the expected score function for the completed data.

## 3. Materials

- Network size: small (from 2 to 20 nodes), medium (from 21 to 50 nodes) and large (more than 50 nodes).
- Missingness balancing: whether the distribution of the missing values over the possible values taken by a node is balanced or unbalanced (that is, some values are missing more often than others).
- Missingness severity: low (⩽1% missing values), medium (1% to 5% missing values) and high (5% to 20% missing values).
- Missingness pattern: whether missing values appear only in root nodes (labelled “root”), only in leaf nodes (“leaf”), in nodes with large number of neighbours (“high degree”) or uniformly on all node types (“fair”). We also consider specific target nodes that represent the variables of interest in the BN (“target”).
- Missing data mechanism: the ampute function of the
**mice**R package [27] has been applied to generated data sets to simulate MCAR, MAR and MNAR missing data mechanisms as described in Section 2.2.

- We generate a complete data set from the BN.
- We introduce missing values in the data from step 1 by hiding a random selection of observed values in a pattern that satisfies the relevant experimental factors (missingness balancing, missingness severity, missingness pattern and missing data mechanism). We perform this step 10 times for each complete data set.
- We check that the proportion of missing values in each incomplete data set from step 2 is within a factor of 0.01 of the missingness severity.
- We perform parameter learning with each EM algorithm and each incomplete data set to estimate the ${\widehat{\Theta}}_{i}$ for each node ${X}_{i}$, which we then use to impute the missing values in those same data sets. As for the network structure, we consider both the DAG of the reference BN and a set network structures with high $P\left(\mathcal{G}\right|\mathcal{D})$.

- The proportion of correct replacements (PCR), defined as the number of missing values that are correctly replaced. Higher values are better.
- The absolute probability difference:$$APD=\sum _{m=1}^{M}|{p}_{m}-{q}_{m}|,$$
- The Kullback–Leibler divergence:$$\begin{array}{cccccc}& KLD\left[\Theta ||\widehat{\Theta}\right]=\sum _{m=1}^{M}KLD\left[{\Theta}_{\left(m\right)}||{\widehat{\Theta}}_{\left(m\right)}\right]& & \mathrm{where}& & KLD\left[{\Theta}_{\left(m\right)}||{\widehat{\Theta}}_{\left(m\right)}\right]=\sum {\Theta}_{\left(m\right)}log\frac{{\Theta}_{\left(m\right)}}{{\widehat{\Theta}}_{\left(m\right)}},\end{array}$$

- we choose to perturb 15% of nodes in small BNs and 10% of nodes in medium and large BNs, to ensure a fair amount of perturbation across BNs of different size;
- we sample the nodes to perturb;
- and then we apply, to each node, a perturbation chosen at random among single arc removal, single arc addition and single arc reversal.

## 4. Results

**D**and

**F**. Leaf

**D**covers small and medium BNs with balanced missingness distributions and medium or high missingness severity; leaf

**F**covers only large BNs and unbalanced missingness. Finally, leaf

**C**recommends soft and soft-forced EM for small and medium BNs with balanced missingness distributions, low missingness severity and a pattern of missingness that is not fair.

- Hard EM has the lowest $\Delta KLD$ in 44/67 scenarios, compared to 16/67 (soft EM) and 7/67 (soft-forced EM). Soft EM has the highest $\Delta KLD$ in 30/67 triplets, compared to 24/67 (soft-forced) and 13/67 (hard EM). Hence, hard EM can often outperform soft EM in the quality of estimated ${\widehat{\Theta}}_{i}$, and it appears to be the worst-performing only in a minority of simulations. The opposite seems to be true for soft EM, possibly because it converges very slowly or it fails to converge completely in some simulations. The performance of soft-forced EM appears to be not as good as that of hard EM, but not as often the worst as that of soft EM.
- We observe some negative $\Delta KLD$ values for all EM algorithms: 7/67 (hard EM), 8/67 (soft EM), 5/67 (soft-forced). They highlight how all EM algorithms can sometimes fail to converge and produce good parameter estimates for the network structure of the reference BN, but not for the perturbed network structures.
- Hard EM has the lowest $\Delta KLD$ 13/30 times in small networks, 9/14 in medium networks and 21/23 in large networks in a monotonically increasing trend. At the same time, hard EM has the highest $\Delta KLD$ in 8/30 times in small networks, 4/14 in medium networks and 0/23 in large networks, in a monotonically decreasing trend. This suggests that the performance of hard EM improves as the BNs increase in size: it provides the best ${\widehat{\Theta}}_{i}$ more and more frequently, and it is never the worst performer in large networks.
- Soft EM has the lowest $\Delta KLD$ in 12/30 times in small networks, 5/14 in medium networks and 0/23 in large networks in a monotonically increasing trend. At the same time, soft EM has the highest $\Delta KLD$ in 7/30 times in small networks, 6/14 in medium networks and 17/23 in large networks, in a monotonically increasing trend. Hence, we observe that soft EM is increasingly unlikely to be the worst performer as the size of the BN increases, but it is also increasingly likely outperformed by hard EM.
- Soft-forced EM never has the lowest $\Delta KLD$ in medium and large networks. It has the highest $\Delta KLD$ 15/30 times in small networks, 4/14 in medium networks and 4/23 in large networks, in a monotonically decreasing trend (with a large step between small and medium networks, and comparable values for medium and large networks). Again, this suggests that the behaviour of soft-forced EM is an average of that of hard EM and soft EM, occupying the middle ground for medium and large networks.

## 5. Discussion and Conclusions

- Hard EM performs well across BNs of different sizes when the missing pattern is fair; that is, missing data occur independently on the structure of the BN.
- Soft EM should be preferred to hard EM, across BNs of different sizes, when the missing pattern is not fair; that is, missing data occur at nodes of the BN with specific graphical characteristics (root, leaf, high-degree nodes); and when the missingness distribution of nodes is balanced.
- Hard and soft EM perform similarly for medium-size BNs when missing data are unbalanced.

- Hard EM achieves the lowest value of $\Delta KLD$ in most simulation scenarios, reliably outperforming other EM algorithms.
- In terms of robustness, we find no marked difference between soft EM and hard EM for small to medium BNs. On the other hand, hard EM consistently outperforms soft EM for large BNs. In fact, for large BNs hard EM achieves the lowest value of $\Delta KLD$ in all simulations, and it never achieves the highest value of $\Delta KLD$.
- Sometimes all EM algorithms fail to converge and to provide good parameter estimates for the network structure of the true BN, but not for the corresponding perturbed networks.

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A. Complete List of the Simulation Scenarios

**Table A1.**Complete description of all the combinations of experimental factors covered in the simulation study.

Network | Description | Proportion of Missing Values | Replicates | Sample Size |
---|---|---|---|---|

Asia | Random patterns MNAR e MCAR | 0.05 | 10 | 100, 200, 300, 400, 500, 1000, 1500, 2000 |

0.1 | 10 | 100, 200, 300, 400, 500, 1000, 1500, 2000 | ||

0.2 | 10 | 100, 200, 300, 400, 500, 1000, 1500, 2000 | ||

Sports | Random patterns MNAR e MCAR | 0.05 | 10 | 100, 200, 400, 800, 1200, 1600, 5000 |

0.1 | 10 | 100, 200, 400, 800, 1200, 1600 | ||

Most central nodes | 0.05 | 10 | 100, 200, 400, 800, 1200, 1600, 2000 | |

0.1 | 10 | 100, 200, 400, 800, 1200, 1600 | ||

Alarm | Random patterns MNAR e MCAR | 0.01 | 8 | 200, 400, 600, 1000, 1500 |

0.05 | 8 | 200, 400, 600, 1000, 1500 | ||

Most central nodes | 0.01 | 8 | 200, 400, 600, 1000, 1500 | |

0.05 | 8 | 200, 400, 600, 1000, 1500 | ||

Property | Random patterns MNAR e MCAR | 0.01 | 8 | 200, 400, 800, 1100 |

0.05 | 8 | 400, 800, 1100 | ||

Most central nodes | 0.01 | 8 | 200, 400, 800, 1100 | |

Leaves | 0.01 | 8 | 200, 400, 800, 1100 | |

ForMed | Random patterns MNAR | 0.005 | 8 | 300, 600, 1000, 1400 |

0.01 | 8 | 300, 600, 1000, 1400 | ||

Roots | 0.003 | 8 | 300, 600, 1000, 1400 | |

With high degree | 0.003 | 8 | 300, 600, 1000, 1400 | |

Leaves | 0.006 | 8 | 300, 600, 1000, 1400 | |

Random patterns MCAR | 0.006 | 8 | 300, 600, 1000, 1400 | |

Most central nodes | 0.006 | 8 | 300, 600, 1000, 1400 | |

Pathfinder | Random patterns MNAR | 0.005 | 8 | 300, 600, 1000, 1400 |

0.01 | 8 | 1000 | ||

Most central nodes | 0.005 | 8 | 300,600,1000, 1400 | |

With high degree | 0.005 | 8 | 300,600,1000 | |

leaves | 0.005 | 8 | 300, 600, 1000 | |

Random patterns MCAR | 0.005 | 8 | 300,600,1000 | |

Hailfinder | Random patterns MNAR | 0.03 | 8 | 300, 600, 900, 1200 |

0.005 | 8 | 300, 600, 900, 1200 | ||

Random patterns MCAR | 0.005 | 8 | 300, 600, 900, 1200 | |

Most central nodes | 0.005 | 8 | 300, 600, 900, 1200 | |

Leaves | 0.005 | 8 | 300, 600, 900, 1200 |

## References

- Kalton, G.; Kasprzyk, D. The Treatment of Missing Survey Data. Surv. Methodol.
**1986**, 12, 1–16. [Google Scholar] - Raghunathan, T.E. What Do We Do with Missing Data? Some Options for Analysis of Incomplete Data. Annu. Rev. Public Health
**2004**, 25, 99–117. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data; Wiley: Hoboken, NJ, USA, 1987. [Google Scholar]
- Rubin, D.B. Inference and Missing Data. Biometrika
**1976**, 63, 581–592. [Google Scholar] [CrossRef] - Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data Via the EM Algorithm. J. R. Stat. Soc. (Ser. B)
**1977**, 39, 1–22. [Google Scholar] - Beal, M.J.; Ghahramani, Z. The Variational Bayesian EM Algorithm for Incomplete Data: With Application to Scoring Graphical Model Structures. In Proceedings of the 7th Valencia International Meeting, New York, NY, USA, 3 July 2003; pp. 453–464. [Google Scholar]
- Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
- Scutari, M. Bayesian Network Models for Incomplete and Dynamic Data. Stat. Neerl.
**2020**, 74, 397–419. [Google Scholar] [CrossRef] - Friedman, N. Learning Belief Networks in the Presence of Missing Values and Hidden Variables. In Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA, 8–12 July 1997; pp. 125–133. [Google Scholar]
- Friedman, N. The Bayesian Structural EM Algorithm. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, San Francisco, CA, USA, 24–26 July 1998; pp. 129–138. [Google Scholar]
- Franzin, A.; Sambo, F.; di Camillo, B. bnstruct: An R Package for Bayesian Network Structure Learning in the Presence of Missing Data. Bioinformatics
**2017**, 33, 1250–1252. [Google Scholar] [CrossRef] [PubMed] - Scanagatta, M.; Corani, G.; Zaffalon, M.; Yoo, J.; Kang, U. Efficient Learning of Bounded-Treewidth Bayesian Networks from Complete and Incomplete Data Sets. Int. J. Approx. Reason.
**2018**, 95, 152–166. [Google Scholar] [CrossRef] [Green Version] - Schafer, J.L. Multiple Imputation: A Primer. Stat. Methods Med Res.
**1999**, 8, 3–15. [Google Scholar] [CrossRef] [PubMed] - R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020; ISBN 3-900051-07-0. [Google Scholar]
- Heckerman, D.; Geiger, D.; Chickering, D.M. Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Mach. Learn.
**1995**, 20, 197–243. [Google Scholar] [CrossRef] [Green Version] - Geiger, D.; Heckerman, D. Learning Gaussian Networks. In Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, Seattle, WA, USA, 29–31 July 1994; pp. 235–243. [Google Scholar]
- Lauritzen, S.L.; Wermuth, N. Graphical Models for Associations Between Variables, Some of which are Qualitative and Some Quantitative. Ann. Stat.
**1989**, 17, 31–57. [Google Scholar] [CrossRef] - Schwarz, G. Estimating the Dimension of a Model. Ann. Stat.
**1978**, 6, 461–464. [Google Scholar] [CrossRef] - Scutari, M.; Denis, J.B. Bayesian Networks with Examples in R; Chapman & Hall: London, UK, 2014. [Google Scholar]
- Jadhav, A.; Pramod, D.; Ramanathan, K. Comparison of Performance of Data Imputation Methods for Numeric Dataset. Appl. Artif. Intell.
**2019**, 33, 913–933. [Google Scholar] [CrossRef] - Beretta, L.; Santaniello, A. Nearest Neighbor Imputation Algorithms: A Critical Evaluation. BMC Med Inform. Decis. Mak.
**2016**, 16, 74. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing Value Estimation Methods for DNA Microarrays. Bioinformatics
**2001**, 17, 520–525. [Google Scholar] [CrossRef] [PubMed] [Green Version] - White, I.R.; Royston, P.; Wood, A.M. Multiple Imputation Using Chained Equations: Issues and Guidance for Practice. Stat. Med.
**2011**, 30, 377–399. [Google Scholar] [CrossRef] [PubMed] - Bennett, D.A. How Can I Deal with Missing Data in My Study? Aust. N. Z. J. Public Health
**2001**, 25, 464–469. [Google Scholar] [CrossRef] [PubMed] - Watanabe, M.; Yamaguchi, K. The EM Algorithm and Related Statistical Models; Marcel Dekker: New York, NY, USA, 2004. [Google Scholar]
- McLachlan, G.J.; Krishnan, T. The EM Algorithm and Extensions; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar]
- Schouten, R.M.; Lugtig, P.; Vink, G. Generating missing values for simulation purposes: A multivariate amputation procedure. J. Stat. Comput. Simul.
**2018**, 88, 2909–2930. [Google Scholar] [CrossRef] [Green Version] - Constantinou, A.C.; Liu, Y.; Chobtham, K.; Guo, Z.; Kitson, N.K. The Bayesys Data and Bayesian Network Repository; Queen Mary University of London: London, UK, 2020. [Google Scholar]

**Figure 2.**Leaf A. No EM algorithm proves to be more effective than the others (data sets with 5% missing data generated from the Alarm BN).

**Figure 3.**Leaf B. Hard EM achieves a value of KLD which is significantly smaller than that achieved by other EM algorithms (data sets with 5% missing data generated from the Property BN).

**Figure 4.**Leaf E. Hard EM achieves a value of KLD which is significantly smaller than that achieved by other EM algorithms (data sets with 1% missing data generated from the Formed BN).

**Figure 5.**Leaf G. Hard EM achieves a value of KLD which is significantly greater than that achieved by other EM algorithms (data sets with 1% missing data generated from the Pathfinder BN).

Network’s Size | Bayesian Network | Number of Nodes |
---|---|---|

small (from 2 to 20 nodes) | Asia | 8 |

Sports | 9 | |

medium (from 21 to 50 nodes) | Alarm | 31 |

Property | 27 | |

large (more than 50 nodes) | Hailfinder | 56 |

Formed | 88 | |

Pathfinder | 109 |

Leaf | Recommended Algorithm | Bayesian Network |
---|---|---|

A | Hard, Soft, Soft-Forced | ASIA ALARM |

B | Hard | SPORTS PROPERTY |

C | Soft, Soft-Forced | SPORTS PROPERTY |

D | Hard | SPORTS PROPERTY |

E | Hard | FORMED PATHFINDER HAILFINDER |

F | Hard | FORMED PATHFINDER HAILFINDER |

G | Soft, Soft-Forced | FORMED PATHFINDER |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ruggieri, A.; Stranieri, F.; Stella, F.; Scutari, M.
Hard and Soft EM in Bayesian Network Learning from Incomplete Data. *Algorithms* **2020**, *13*, 329.
https://doi.org/10.3390/a13120329

**AMA Style**

Ruggieri A, Stranieri F, Stella F, Scutari M.
Hard and Soft EM in Bayesian Network Learning from Incomplete Data. *Algorithms*. 2020; 13(12):329.
https://doi.org/10.3390/a13120329

**Chicago/Turabian Style**

Ruggieri, Andrea, Francesco Stranieri, Fabio Stella, and Marco Scutari.
2020. "Hard and Soft EM in Bayesian Network Learning from Incomplete Data" *Algorithms* 13, no. 12: 329.
https://doi.org/10.3390/a13120329