# PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data

## Abstract

## 1. Introduction

- A greedy data size calculation algorithm is proposed for adaptively partitioning a big dataset into data slices of appropriate size for distributed BN learning.
- A distributed three-layered ensemble approach called PenBayes is proposed to achieve stable and accurate Bayesian network learning from big datasets at both data and algorithm levels.

## 2. Background

#### 2.1. Distributed Data-Parallel Patterns and Supporting Systems for Scalable Big Data Application

#### 2.2. Scientific Workflow System

#### 2.3. Bayesian Network

#### 2.4. Bayesian Network Learning Algorithms

## 3. Related Work

## 4. Problem Formulation

Algorithm 1 CalculateALS. | |

Input: | |

D: Dataset; | |

ϵ_{1}, ϵ_{2}: Thresholds; | |

mstep: Maximum loop steps. | |

Output: | |

AMBS: Average Markov blanket size; | |

ALS: Appropriate Learning Size. | |

1: | bestAMBS = 1; bestES = −1; step = 0; |

2: | sliceSize = InitialSize * number of attributes in D;// Initial data slice size |

3: | D_{sliced} = read sliceSize rows from D; |

4: | BN_{DS} = LearnBNStructure(D_{sliced}); |

5: | currentAMBS = average Markov Blanket size of BN_{DS}; |

6: | currentES = Edge Strength of BN_{DS}; |

7: | while (step ≤ mstep) AND ((|currentAMBS − bestAMBS| > bestAMBS * ϵ_{1}) OR (|currentES −
bestES| > bestES * ϵ_{2})) do |

8: | sliceSize = sliceSize * 2; |

9: | bestAMBS = currentAMBS |

10: | bestES = currentES; |

11: | D_{sliced} = readData(D, nrows = sliceSize); |

12: | BD_{DS} = learnBNStructure(D_{sliced}); |

13: | currentAMBS = AMBS of BD_{DS}; |

14: | currentES = Edge Strength of BD_{DS}; |

15: | step = step + 1; |

16: | end while |

17: | ALS = number of records in D_{sliced}; |

18: | returnALS. |

## 5. The Proposed Approach

#### 5.1. Overview of PEnBayes

#### 5.1.1. Adaptive Two-Stage Data Slicing

#### 5.1.2. Local Learner

#### 5.1.3. Global Ensemble

#### 5.2. Structure Ensemble Method

Algorithm 2 StructureEnsemble. | |

Input: | |

BN: BN Structures; | |

D: Data set; | |

T: Threshold factor. | |

Output: | |

BN_{E}: Ensembled BN Structure. | |

1: | Obtain $\mathbf{AM}\left[i\right]$ from $B{N}_{i}$; |

2: | $ES\left[i\right]$ = $ES(B{N}_{i},D)$; |

3: | $W\left(B{N}_{i}\right)$ = $ES\left[i\right]/\sum ES\left[k\right]$ |

4: | ${\mathbf{WAM}}_{B{N}_{i}}$ = $\mathbf{AM}\left[i\right]*W\left(B{N}_{i}\right)$; |

5: | $\mathbf{FWAM}=\sum {\mathbf{WAM}}_{B{N}_{i}}$; |

6: | $\gamma =T\ast min\left(W\left(B{N}_{i}\right)\right)$, |

7: | if$\mathbf{FWAM}[i,j]>\gamma $ and i->j does not form a circle in $B{N}_{E}$ then |

8: | $B{N}_{E}[i,j]$ = 1; |

9: | end if |

10: | return$B{N}_{E}$. |

#### 5.3. Data Slice Learner

Algorithm 3 DataSliceLearner. | |

Input: | |

DS: Data slice. | |

Output: | |

BN_{DS}: Merged network structure in matrix. | |

1: | $B{N}_{MMHC}$ = $MMHC$($DS$); |

2: | $B{N}_{HC}$ = $HC$($DS$); |

3: | $B{N}_{Tabu}$ = $Tabu$($DS$); |

4: | $BNs=[B{N}_{MMHC},B{N}_{HC},B{N}_{Tabu}]$ |

5: | $T=2;$ |

6: | $B{N}_{DS}=StructureEnsemble(BNs,DS,T)$; |

7: | return$B{N}_{DS}$ |

#### 5.4. Local Learner

Algorithm 4 LocalLearner. | |

Input: | |

DS: Data slices; | |

N_{d}: number of data slices. | |

Output: | |

BN_{Local}: Local network structure. | |

1: | For each$D{S}_{k}$ |

2: | $B{N}_{DS}\left[k\right]=\mathrm{DataSliceLearner}\left(D{S}_{k}\right)$; |

3: | $D{S}_{B}$ = the data slice with the best Edge Strength; |

4: | End For |

5: | $B{N}_{local}=Structureensemble(B{N}_{DS},D{S}_{B},{N}_{d}/2)$ |

6: | return$B{N}_{local}$. |

#### 5.5. Global Ensemble

Algorithm 5 GlobalEnsemble. | |

Input: | |

LS: Local Structures; | |

DS_{BG}: Data slice with the best global Edge Strength; | |

K: Number of Local Learners. | |

Output: | |

BN_{final}: Local network structure. | |

1: | $B{N}_{final}=StructureEnsemble(LS,D{S}_{BG},2\ast K/3)$ |

2: | return$B{N}_{final}$. |

#### 5.6. The Time Complexity of PenBayes

## 6. PEnBayes Workflow in Kepler

#### 6.1. Overall Workflow

#### 6.2. ALS Calculation Sub-Workflow

#### 6.3. Local Learner Sub-Workflow

#### 6.4. Global Ensemble Sub-Workflow

## 7. Experimental Setup

#### 7.1. Hardware Specification

#### 7.2. Datasets

#### 7.3. PEnBayes Experimental Setup

#### 7.4. Baseline Experimental Setup

## 8. Experimental Results

#### 8.1. ALS Calculation Results

#### 8.2. PEnBayes and Baseline Experimental Result Comparison

#### 8.3. PEnBayes Scalability Experiments

## 9. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

BN | Bayesian Networks |

PEnBayes | Parallel Ensemble based Bayesian network learning |

MB | Markov Blanket |

ALS | Appropriate Learning Size |

DAG | directed acyclic graph |

DDP | Distributed Data-Parallel |

UDF | User Defined Functions |

GUI | Graphical User Interface |

BDeu | Bayesian Dirichlet equivalence with uniform prior |

HC | Hill Climbing |

TPDA | Three Phase Dependency Analysis |

MMHC | Max-Min Hill-Climbing |

AMBS | Average Markov Blanket Size |

ES | Edge Strength |

FWAM | Final Weighted Adjacent Matrix |

ALARM | A Logical Alarm Reduction Mechanism |

SHD | Structural Hamming distance |

## References

**Figure 7.**Structural Hamming distance (SHD) of different data set sizes using calculated $ALS$ (Table 3) as reference value (red circle), Child Dataset.

**Figure 8.**Structural Hamming distance (SHD) of different data set sizes using calculated $ALS$ (Table 3) as reference value (red circle), Insurance Dataset.

**Figure 9.**Structural Hamming distance (SHD) of different data set sizes using calculated $ALS$ (Table 3) as reference value (red circle), Alarm Dataset.

**Figure 13.**Alarm Dataset Accuracy Results. Negative values indicate that the algorithm was unsuccessful in learning a network for the dataset.

Name | Nodes | Edges | AMBS | Edge Strength |
---|---|---|---|---|

Alarm | 37 | 46 | 3.51 | 0.23 |

Child | 20 | 25 | 3.0 | 0.49 |

Insurance | 27 | 52 | 5.19 | 0.25 |

Dataset | 10 M | 20 M | 50 M | 100 M | 150 M | 200 M |
---|---|---|---|---|---|---|

Alarm | 1.828 | 3.656 | 9.140 | 18.280 | 27.421 | 36.561 |

Child | 1.073 | 2.146 | 5.364 | 10.728 | 16.093 | 21.457 |

Insurance | 1.829 | 3.658 | 9.144 | 18.288 | 27.432 | 36.576 |

Network | Calculated ALS | Calculated AMBS | Actual AMBS |
---|---|---|---|

Alarm | 14,800 | 3.656 | 3.51 |

Child | 4000 | 3.00 | 3.00 |

Insurance | 43,200 | 4.66 | 5.19 |

