# PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- A greedy data size calculation algorithm is proposed for adaptively partitioning a big dataset into data slices of appropriate size for distributed BN learning.
- A distributed three-layered ensemble approach called PenBayes is proposed to achieve stable and accurate Bayesian network learning from big datasets at both data and algorithm levels.

## 2. Background

#### 2.1. Distributed Data-Parallel Patterns and Supporting Systems for Scalable Big Data Application

#### 2.2. Scientific Workflow System

#### 2.3. Bayesian Network

**Definition**

**1.**

**Definition**

**2.**

#### 2.4. Bayesian Network Learning Algorithms

## 3. Related Work

## 4. Problem Formulation

**Definition**

**3.**

**Definition**

**4.**

**Theorem**

**1.**

**Proof.**

Algorithm 1 CalculateALS. | |

Input: | |

D: Dataset; | |

ϵ_{1}, ϵ_{2}: Thresholds; | |

mstep: Maximum loop steps. | |

Output: | |

AMBS: Average Markov blanket size; | |

ALS: Appropriate Learning Size. | |

1: | bestAMBS = 1; bestES = −1; step = 0; |

2: | sliceSize = InitialSize * number of attributes in D;// Initial data slice size |

3: | D_{sliced} = read sliceSize rows from D; |

4: | BN_{DS} = LearnBNStructure(D_{sliced}); |

5: | currentAMBS = average Markov Blanket size of BN_{DS}; |

6: | currentES = Edge Strength of BN_{DS}; |

7: | while (step ≤ mstep) AND ((|currentAMBS − bestAMBS| > bestAMBS * ϵ_{1}) OR (|currentES −
bestES| > bestES * ϵ_{2})) do |

8: | sliceSize = sliceSize * 2; |

9: | bestAMBS = currentAMBS |

10: | bestES = currentES; |

11: | D_{sliced} = readData(D, nrows = sliceSize); |

12: | BD_{DS} = learnBNStructure(D_{sliced}); |

13: | currentAMBS = AMBS of BD_{DS}; |

14: | currentES = Edge Strength of BD_{DS}; |

15: | step = step + 1; |

16: | end while |

17: | ALS = number of records in D_{sliced}; |

18: | returnALS. |

**Definition**

**5.**

**Definition**

**6.**

**Definition**

**7.**

## 5. The Proposed Approach

#### 5.1. Overview of PEnBayes

#### 5.1.1. Adaptive Two-Stage Data Slicing

#### 5.1.2. Local Learner

#### 5.1.3. Global Ensemble

#### 5.2. Structure Ensemble Method

Algorithm 2 StructureEnsemble. | |

Input: | |

BN: BN Structures; | |

D: Data set; | |

T: Threshold factor. | |

Output: | |

BN_{E}: Ensembled BN Structure. | |

1: | Obtain $\mathbf{AM}\left[i\right]$ from $B{N}_{i}$; |

2: | $ES\left[i\right]$ = $ES(B{N}_{i},D)$; |

3: | $W\left(B{N}_{i}\right)$ = $ES\left[i\right]/\sum ES\left[k\right]$ |

4: | ${\mathbf{WAM}}_{B{N}_{i}}$ = $\mathbf{AM}\left[i\right]*W\left(B{N}_{i}\right)$; |

5: | $\mathbf{FWAM}=\sum {\mathbf{WAM}}_{B{N}_{i}}$; |

6: | $\gamma =T\ast min\left(W\left(B{N}_{i}\right)\right)$, |

7: | if$\mathbf{FWAM}[i,j]>\gamma $ and i->j does not form a circle in $B{N}_{E}$ then |

8: | $B{N}_{E}[i,j]$ = 1; |

9: | end if |

10: | return$B{N}_{E}$. |

#### 5.3. Data Slice Learner

Algorithm 3 DataSliceLearner. | |

Input: | |

DS: Data slice. | |

Output: | |

BN_{DS}: Merged network structure in matrix. | |

1: | $B{N}_{MMHC}$ = $MMHC$($DS$); |

2: | $B{N}_{HC}$ = $HC$($DS$); |

3: | $B{N}_{Tabu}$ = $Tabu$($DS$); |

4: | $BNs=[B{N}_{MMHC},B{N}_{HC},B{N}_{Tabu}]$ |

5: | $T=2;$ |

6: | $B{N}_{DS}=StructureEnsemble(BNs,DS,T)$; |

7: | return$B{N}_{DS}$ |

#### 5.4. Local Learner

Algorithm 4 LocalLearner. | |

Input: | |

DS: Data slices; | |

N_{d}: number of data slices. | |

Output: | |

BN_{Local}: Local network structure. | |

1: | For each$D{S}_{k}$ |

2: | $B{N}_{DS}\left[k\right]=\mathrm{DataSliceLearner}\left(D{S}_{k}\right)$; |

3: | $D{S}_{B}$ = the data slice with the best Edge Strength; |

4: | End For |

5: | $B{N}_{local}=Structureensemble(B{N}_{DS},D{S}_{B},{N}_{d}/2)$ |

6: | return$B{N}_{local}$. |

#### 5.5. Global Ensemble

Algorithm 5 GlobalEnsemble. | |

Input: | |

LS: Local Structures; | |

DS_{BG}: Data slice with the best global Edge Strength; | |

K: Number of Local Learners. | |

Output: | |

BN_{final}: Local network structure. | |

1: | $B{N}_{final}=StructureEnsemble(LS,D{S}_{BG},2\ast K/3)$ |

2: | return$B{N}_{final}$. |

#### 5.6. The Time Complexity of PenBayes

## 6. PEnBayes Workflow in Kepler

#### 6.1. Overall Workflow

#### 6.2. ALS Calculation Sub-Workflow

#### 6.3. Local Learner Sub-Workflow

#### 6.4. Global Ensemble Sub-Workflow

## 7. Experimental Setup

#### 7.1. Hardware Specification

#### 7.2. Datasets

#### 7.3. PEnBayes Experimental Setup

#### 7.4. Baseline Experimental Setup

## 8. Experimental Results

#### 8.1. ALS Calculation Results

#### 8.2. PEnBayes and Baseline Experimental Result Comparison

#### 8.3. PEnBayes Scalability Experiments

## 9. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

BN | Bayesian Networks |

PEnBayes | Parallel Ensemble based Bayesian network learning |

MB | Markov Blanket |

ALS | Appropriate Learning Size |

DAG | directed acyclic graph |

DDP | Distributed Data-Parallel |

UDF | User Defined Functions |

GUI | Graphical User Interface |

BDeu | Bayesian Dirichlet equivalence with uniform prior |

UDF | User Defined Functions |

HC | Hill Climbing |

TPDA | Three Phase Dependency Analysis |

MMHC | Max-Min Hill-Climbing |

AMBS | Average Markov Blanket Size |

ES | Edge Strength |

FWAM | Final Weighted Adjacent Matrix |

AMBS | Average Markov Blanket Size |

ALARM | A Logical Alarm Reduction Mechanism |

SHD | Structural Hamming distance |

## References

- Yoo, C.; Ramirez, L.; Liuzzi, J. Big Data Analysis Using Modern Statistical and Machine Learning Methods in Medicine. Int. Neurourol J.
**2014**, 18, 50–57. [Google Scholar] [CrossRef] [PubMed] - Hasna, N.; Jamoussi, S. Weighted ensemble learning of Bayesian network for gene regulatory networks. Neurocomputing
**2015**, 150, 404–416. [Google Scholar] - Yue, K.; Wu, H.; Fu, X.; Xu, J.; Yin, Z.; Liu, W. A data-intensive approach for discovering user similarities in social behavioral interactions based on the bayesian network. Neurocomputing
**2017**, 219, 364–375. [Google Scholar] [CrossRef] - Yang, J.; Tong, Y.; Liu, X.; Tan, S. Causal inference from financial factors: Continuous variable based local structure learning algorithm. In Proceedings of the 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr), London, UK, 27–28 March 2014; pp. 278–285. [Google Scholar]
- He, L.; Liu, B.; Hu, D.; Ying, W.; Meng, W.; Long, J. Motor imagery EEG signals analysis based on Bayesian network with Gaussian distribution. Neurocomputing
**2016**, 188, 217–224. [Google Scholar] [CrossRef] - Chickering, D.; Heckerman, D.; Meek, C. Large-Sample Learning of Bayesian Networks is NP-Hard. J. Mach. Learn. Res.
**2004**, 5, 1287–1330. [Google Scholar] - Fang, Q.; Yue, K.; Fu, X.; Wu, H.; Liu, W. A MapReduce-Based Method for Learning Bayesian Network from Massive Data. In Proceedings of the 15th Asia-Pacific Web Conference (APWeb 2013), Sydney, Australia, 4–6 April 2013; pp. 697–708. [Google Scholar]
- Wang, J.; Tang, Y.; Nguyen, M.; Altintas, I. A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning. In Proceedings of the 2014 IEEE/ACM International Symposium on Big Data Computing (BDC 2014), London, UK, 8–11 December 2014; pp. 16–25. [Google Scholar]
- Apache Spark Project. Available online: http://spark.apache.org (accessed on 2 October 2019).
- The Kepler Project. Available online: https://kepler-project.org (accessed on 2 October 2019).
- Khan, Z.A.; Anjum, A.; Soomro, K.; Tahir, M.A. Towards cloud based big data analytics for smart future cities. J. Cloud Comput.
**2015**, 4, 2. [Google Scholar] [CrossRef] - Talia, D. A view of programming scalable data analysis: From clouds to exascale. J. Cloud Comput.
**2019**, 8, 4. [Google Scholar] [CrossRef] - Wang, J.; Crawl, D.; Altintas, I.; Li, W. Big data applications using workflows for data parallel computing. Comput. Sci. Eng.
**2014**, 16, 11–21. [Google Scholar] [CrossRef] - Apache Hadoop Project. Available online: http://hadoop.apache.org (accessed on 2 October 2019).
- The Stratosphere Project. Available online: http://stratosphere.eu/ (accessed on 2 October 2019).
- Apache Flink Project. Available online: http://flink.apache.org (accessed on 2 October 2019).
- Gonzalez, N.M.; de Brito Carvalho, T.C.M.; Miers, C.C. Cloud resource management: towards efficient execution of large-scale scientific applications and workflows on complex infrastructures. J. Cloud Comput.
**2017**, 6, 13. [Google Scholar] [CrossRef] - Altintas, I.; Berkley, C.; Jaeger, E.; Jones, M.; Ludascher, B.; Mock, S. Kepler: An extensible system for design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, Santorini Island, Greece, 23 June 2004; pp. 423–424. [Google Scholar]
- Ludäscher, B.; Altintas, I.; Berkley, C.; Higgins, D.; Jaeger, E.; Jones, M.; Lee, E.A.; Tao, J.; Zhao, Y. Scientific workflow management and the Kepler system. Concurr. Comput. Pract. Exp.
**2006**, 18, 1039–1065. [Google Scholar] [CrossRef] - Goderis, A.; Brooks, C.; Altintas, I.; Lee, E.A.; Goble, C. Heterogeneous composition of models of computation. Future Gener. Comput. Syst.
**2009**, 25, 552–560. [Google Scholar] [CrossRef] [Green Version] - Ben-Gal, I. Bayesian Networks. Available online: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470061572.eqr089 (accessed on 2 October 2019).
- Korb, K.B.; Nicholson, A.E. Bayesian Artificial Intelligence; CRC Press: Boca Raton, FL, USA, 2010. [Google Scholar]
- Cheng, J.; Greiner, R.; Kelly, J.; Bell, D.; Liu, W. Learning Bayesian networks from data: An information-theory based approach. Artif. Intell.
**2002**, 137, 43–90. [Google Scholar] [CrossRef] [Green Version] - Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1988. [Google Scholar]
- Spirtes, P.; Glymour, C.N.; Scheines, R.; Heckerman, D.; Meek, C.; Cooper, G.; Richardson, T. Causation, Prediction, and Search; MIT Press: Cambridge, MA, USA; London, UK, 2001. [Google Scholar]
- Meek, C. Strong completeness and faithfulness in Bayesian networks. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–20 August 1995; pp. 411–418. [Google Scholar]
- Tsamardinos, I.; Brown, L.E.; Aliferis, C.F. The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn.
**2006**, 65, 31–78. [Google Scholar] [CrossRef] [Green Version] - Tsamardinos, I.; Aliferis, C.F.; Statnikov, A.R.; Statnikov, E. Algorithms for Large Scale Markov Blanket Discovery. In Proceedings of the FLAIRS 2003, St. Augustine, FL, USA, 12–14 May 2003; Volume 2, pp. 376–380. [Google Scholar]
- Heckerman, D.; Geiger, D.; Chickering, D. Learning Bayesian networks: The combination of knowledge and statistical data. Mach. Learn.
**1995**, 20, 197–243. [Google Scholar] [CrossRef] [Green Version] - Yaramakala, S.; Margaritis, D. Speculative Markov blanket discovery for optimal feature selection. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), Houston, TX, USA, 27–30 November 2005; pp. 809–812. [Google Scholar]
- Khanteymoori, A.R.; Olyaee, M.; Abbaszadeh, O.; Valian, M. A novel method for Bayesian networks structure learning based on Breeding Swarm algorithm. Soft Comput.
**2018**, 22, 3049–3060. [Google Scholar] [CrossRef] - Zhang, Y.; Zhang, W.; Xie, Y. Improved heuristic equivalent search algorithm based on Maximal Information Coefficient for Bayesian Network Structure Learning. Neurocomputing
**2013**, 117, 186–195. [Google Scholar] [CrossRef] - Yuan, J.; Wang, Z.; Sun, Y.; Wei, Z.; Jiang, J. An effective pattern-based Bayesian classifier for evolving data stream. Neurocomputing
**2018**, 295, 17–28. [Google Scholar] [CrossRef] - Nikravesh, A.Y.; Ajila, S.A.; Lung, C. Using genetic algorithms to find optimal solution in a search space for a cloud predictive cost-driven decision maker. J. Cloud Comput.
**2018**, 7, 20. [Google Scholar] [CrossRef] - Zhu, X.; Yuan, C. An Exact Algorithm for Solving Most Relevant Explanation in Bayesian Networks. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; pp. 3649–3656. [Google Scholar]
- Ordyniak, S.; Szeider, S. Parameterized Complexity Results for Exact Bayesian Network Structure Learning. J. Artif. Intell. Res.
**2013**, 46, 263–302. [Google Scholar] [CrossRef] [Green Version] - Wilczynski, B.; Dojer, N. BNFinder: Exact and efficient method for learning Bayesian networks. Bioinformatics
**2009**, 25, 286–287. [Google Scholar] [CrossRef] [PubMed] - Yue, K.; Fang, Q.; Wang, X.; Li, J.; Weiyi, W. A Parallel and Incremental Approach for Data-Intensive Learning of Bayesian Networks. IEEE Trans. Cybern.
**2017**, 45, 2890–2904. [Google Scholar] [CrossRef] [PubMed] - Martínez, A.M.; Webb, G.I.; Chen, S.; Zaidi, N.A. Scalable learning of Bayesian network classifiers. J. Mach. Learn. Res.
**2016**, 17, 1515–1549. [Google Scholar] - Tang, Y.; Wang, Y.; Cooper, K.M.; Li, L. Towards Big Data Bayesian Network Learning - An Ensemble Learning Based Approach. In Proceedings of the 2014 IEEE International Congress on Big Data, Anchorage, AK, USA, 27 June–2 July 2014; pp. 355–357. [Google Scholar]
- Rokach, L. Ensemble-based classifiers. Artif. Intell. Rev.
**2010**, 33, 1–39. [Google Scholar] [CrossRef] - Abusitta, A.; Bellaïche, M.; Dagenais, M. An SVM-based framework for detecting DoS attacks in virtualized clouds under changing environment. J. Cloud Comput.
**2018**, 7, 9. [Google Scholar] [CrossRef] [Green Version] - Wu, G.; Li, H.; Hu, X.; Bi, Y.; Zhang, J.; Wu, X. MReC4. 5: C4. 5 ensemble classification with MapReduce. In Proceedings of the 2009 Fourth ChinaGrid Annual Conference, Yantai, China, 21–22 August 2009; pp. 249–255. [Google Scholar]
- Breiman, L. Bagging predictors. Mach. Learn.
**1996**, 24, 123–140. [Google Scholar] [CrossRef] [Green Version] - Lin, J.; Kolcz, A. Large-scale machine learning at twitter. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, AZ, USA, 20–24 May 2012; pp. 793–804. [Google Scholar]
- Madsen, A.L.; Jensen, F.; Salmerón, A.; Langseth, H.; Nielsen, T.D. A parallel algorithm for Bayesian network structure learning from large data sets. Knowl.-Based Syst.
**2017**, 117, 46–55. [Google Scholar] [CrossRef] [Green Version] - MLlib: Apache Spark’s Scalable Machine Learning Library. Available online: https://spark.apache.org/mllib/ (accessed on 2 October 2019).
- The Mahout Project. Available online: http://mahout.apache.org/ (accessed on 2 October 2019).
- H2O.ai: Brings AI to Enterprise. Available online: http://www.h2o.ai/ (accessed on 2 October 2019).
- FlinkML: Machine Learning for Flink. Available online: https://github.com/FlinkML (accessed on 2 October 2019).
- Beinlich, I.; Suermondt, H.; Chavez, R.; Cooper, G. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. In Proceedings of the 2nd European Conference on Artificial Intelligence in Medicine, London, UK, 29–31 August 1989; pp. 247–256. [Google Scholar]
- Cowell, R.G.; Dawid, P.; Lauritzen, S.L.; Spiegelhalter, D.J. Probabilistic Networks and Expert Systems: Exact Computational Methods for Bayesian Networks; Springer Science & Business Media: New York, NY, USA, 2007. [Google Scholar]
- Binder, J.; Koller, D.; Russell, S.; Kanazawa, K. Adaptive probabilistic networks with hidden variables. Mach. Learn.
**1997**, 29, 213–244. [Google Scholar] [CrossRef] - SamIam Tool for Modeling and Reasoning with Bayesian Networks. Available online: http://reasoning.cs.ucla.edu/samiam/ (accessed on 2 October 2019).
- Scutari, M. Learning Bayesian Networks with the bnlearn R Package. J. Stat. Softw.
**2010**, 35, 1–22. [Google Scholar] [CrossRef] - Zaharia, M.; Konwinski, A.; Joseph, A.D.; Katz, R.H.; Stoica, I. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation, San Diego, CA, USA, 8–10 December 2008; Volume 8, pp. 7–18. [Google Scholar]
- Saranti, A.; Taraghi, B.; Ebner, M.; Holzinger, A. Insights into Learning Competence Through Probabilistic Graphical Models. In Proceedings of the International Cross-Domain Conference, CD-MAKE 2019, Canterbury, UK, 26–29 August 2019; pp. 250–271. [Google Scholar]
- Goebel, R.; Chander, A.; Holzinger, K.; Lecue, F.; Akata, Z.; Stumpf, S.; Kieseberg, P.; Holzinger, A. Explainable AI: The New 42? In Proceedings of the International Cross-Domain Conference, CD-MAKE 2018, Hamburg, Germany, 27–30 August 2018; pp. 295–303. [Google Scholar]

**Figure 7.**Structural Hamming distance (SHD) of different data set sizes using calculated $ALS$ (Table 3) as reference value (red circle), Child Dataset.

**Figure 8.**Structural Hamming distance (SHD) of different data set sizes using calculated $ALS$ (Table 3) as reference value (red circle), Insurance Dataset.

**Figure 9.**Structural Hamming distance (SHD) of different data set sizes using calculated $ALS$ (Table 3) as reference value (red circle), Alarm Dataset.

**Figure 13.**Alarm Dataset Accuracy Results. Negative values indicate that the algorithm was unsuccessful in learning a network for the dataset.

Name | Nodes | Edges | AMBS | Edge Strength |
---|---|---|---|---|

Alarm | 37 | 46 | 3.51 | 0.23 |

Child | 20 | 25 | 3.0 | 0.49 |

Insurance | 27 | 52 | 5.19 | 0.25 |

Dataset | 10 M | 20 M | 50 M | 100 M | 150 M | 200 M |
---|---|---|---|---|---|---|

Alarm | 1.828 | 3.656 | 9.140 | 18.280 | 27.421 | 36.561 |

Child | 1.073 | 2.146 | 5.364 | 10.728 | 16.093 | 21.457 |

Insurance | 1.829 | 3.658 | 9.144 | 18.288 | 27.432 | 36.576 |

Network | Calculated ALS | Calculated AMBS | Actual AMBS |
---|---|---|---|

Alarm | 14,800 | 3.656 | 3.51 |

Child | 4000 | 3.00 | 3.00 |

Insurance | 43,200 | 4.66 | 5.19 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Tang, Y.; Wang, J.; Nguyen, M.; Altintas, I.
PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data. *Sensors* **2019**, *19*, 4400.
https://doi.org/10.3390/s19204400

**AMA Style**

Tang Y, Wang J, Nguyen M, Altintas I.
PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data. *Sensors*. 2019; 19(20):4400.
https://doi.org/10.3390/s19204400

**Chicago/Turabian Style**

Tang, Yan, Jianwu Wang, Mai Nguyen, and Ilkay Altintas.
2019. "PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data" *Sensors* 19, no. 20: 4400.
https://doi.org/10.3390/s19204400