# Probabilistic Ensemble of Deep Information Networks

^{*}

## Abstract

**:**

## 1. Introduction

- extreme flexibility and high modularity: all the nodes are functionally equivalent and with a reduced number of inputs and outputs, which gives good opportunities for a possible hardware implementation;
- high parallelizability: each tree can be trained in parallel with the others;
- memory usage: we need to feed the network with data only at the first layer and simple incremental counters can be used to estimate the initial probability mass distribution; and
- training time and training complexity: the locality of the computed cost function allows a nodewise training that does not require any kind of information from other points of the tree apart from its feeding nodes (that are usually a very small number, e.g., 2–3).

## 2. The DIN Architecture and Its Training

#### 2.1. The Input Node

- ${\mathbf{x}}_{in}$ of size ${N}_{train}$, whose elements take values in a set of cardinality ${N}_{in}$; ${\mathbf{x}}_{in}$ corresponds to one of the D features of the dataset (typically one column)
- $\mathbf{y}$ of size ${N}_{train}$, whose elements take values in a set of cardinality ${N}_{class}$; $\mathbf{y}$ corresponds to the known classes of the ${N}_{train}$ points

#### 2.2. The Information Node

- The input probability matrices ${\mathbf{P}}_{{X}_{in}},{\mathbf{P}}_{{X}_{in}|Y},{\mathbf{P}}_{Y|{X}_{in}},{\mathbf{P}}_{Y}$ describe the input random variable ${X}_{in}$, with ${N}_{in}$ possible values, and its relationship with class Y.
- The output matrices ${\mathbf{P}}_{{X}_{out}},{\mathbf{P}}_{{X}_{out}|Y},{\mathbf{P}}_{Y|{X}_{out}},{\mathbf{P}}_{Y}$ describe the output random variable ${X}_{out}$, with ${N}_{out}$ possible values, and its relationship with Y.

- $P({X}_{out}=j)$ is the probability mass function of the output random variable ${X}_{out}$$$P({X}_{out}=j)=\sum _{i=0}^{{N}_{in}-1}P({X}_{in}=i)P({X}_{out}=j|{X}_{in}=i),\phantom{\rule{1.em}{0ex}}j=0,\dots ,{N}_{out}-1$$
- $d(i,j)$ is the Kullback–Leibler divergence$$\begin{array}{cc}\hfill d(i,j)& =\sum _{m=0}^{{N}_{class}-1}P(Y=m|{X}_{in}=i){log}_{2}\frac{P(Y=m|{X}_{in}=i)}{P(Y=m|{X}_{out}=j)}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\mathbb{KL}(P(Y|{X}_{in}=i)||P(Y|{X}_{out}=j))\hfill \end{array}$$$$\begin{array}{cc}\hfill P(Y=m|{X}_{out}=j)=\sum _{i=0}^{{N}_{in}-1}& P(Y=m|{X}_{in}=i)P({X}_{in}=i|{X}_{out}=j),\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& m=0,\dots ,{N}_{class}-1,j=0,\dots ,{N}_{out}-1\hfill \end{array}$$
- $\beta $ is a real positive parameter.
- $Z(i;\beta )$ is a normalizing coefficient to get$$\sum _{j=1}^{{N}_{out}-1}P({X}_{out}=j|{X}_{in}=i)=1.$$

#### 2.3. The Combiner

**assuming**that ${X}_{out,a}$ and ${X}_{out,b}$ are conditionally independent given Y (notice that in implementation [16] this assumption was not necessary):

#### 2.4. The Tree Architecture

#### 2.5. A Note on Computational Complexity and Memory Requirements

## 3. The Running Phase

- (a)
- input node $I(2k)$ passes value i to info node $(2k,0)$;
- (b)
- input node $I(2k+1)$ passes value j to info node $(2k+1,0)$;

- (a)
- info node $(2k,0)$ passes the probability vector ${\mathbf{p}}_{a}={\mathbf{P}}_{{X}_{out}(2k,0)|{X}_{in}(2k,0)}(i,:)$ (ith row) to the combiner; ${\mathbf{p}}_{a}$ stores the conditional probabilities $P({X}_{out}(2k,0)=g|\mathbf{X}(n,2k)=i)$ for $g=0,\dots ,{N}_{out}^{(0)}-1$;
- (b)
- info node $(2k+1,0)$ passes the probability vector ${\mathbf{p}}_{b}={\mathbf{P}}_{{X}_{out}(2k+1,0)|{X}_{in}(2k+1,0)}(j,:)$ (jth row) to the combiner; ${\mathbf{p}}_{b}$ stores the conditional probabilities $P({X}_{out}(2k+1,0)=h|\mathbf{X}(n,2k+1)=j)$ for $h=0,\dots ,{N}_{out}^{(0)}-1$;

- the combiner generates vector$${\mathbf{p}}_{c}={\mathbf{p}}_{a}\otimes {\mathbf{p}}_{b},$$
- info node $(k,1)$ generates the probability vector$${\mathbf{p}}_{c}{\mathbf{P}}_{{X}_{out}(k,1)|{X}_{in}(k,1)},$$
- in the following layer, each combiner performs the Kronecker product of its two input vectors and each info node performs the product between the input vector and its conditional probability matrix ${\mathbf{P}}_{{X}_{out}|{X}_{in}}$;
- the root information node at Layer 3, having the input vector $\mathbf{p}$, outputs$${\mathbf{p}}_{out}(n)=\mathbf{p}{\mathbf{P}}_{{X}_{out}(0,3)|{X}_{in}(0,3)}{\mathbf{P}}_{Y|{X}_{out}(0,3)},$$According to the MAP criterion, the estimated class of the input point $\mathbf{X}(n,:)$ is$$\widehat{Y}(n)=\mathrm{arg}\mathrm{max}{\mathbf{p}}_{out}(n)$$

#### 3.1. The DIN Ensemble

## 4. The Probabilistic Point of View

#### 4.1. Assumption of Conditionally Independent Features

#### 4.2. The Overall Probability Matrix

## 5. Experiments

#### 5.1. UCI Congressional Voting Records Dataset

#### 5.2. UCI Kidney Disease Dataset

#### 5.3. UCI Mushroom Dataset

#### 5.4. Misclassification Probability Analysis

^{®}Classification Learner. All datasets were randomly split 100 times into training and testing subsets, thus generating 100 different experiments. The proposed method shows competitive results in the considered cases, as can be observed in Table 1. It is interesting to compare in terms of performance the proposed algorithm with respect to the Naive Bayes classifier, i.e., Equation (34), and the Bagged Tree algorithm, which is the closest algorithm (conceptually) to the one we propose. In general, the two variants of the DINs perform similarly to the Bagged Trees, while outperforming Naive Bayes. For Bagged Trees and KNN-Ensemble, the same number of learners as DIN ensembles were used.

#### 5.5. The Impact of Number of Iterations of Blahut–Arimoto on The Performance

#### 5.6. The Role of $\beta $: Underfitting, Optimality, and Overfitting

#### 5.7. A Synthetic Multiclass Experiment

## 6. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A. Quantization

- Age (Years) $\{<10,<18,<45,<70,<120\}$
- Blood (mm/Hg) $\{\phantom{\rule{4pt}{0ex}}<80,<84,<89,<99,<109,\ge 110\}$
- Blood Glucose Random (mg/dl) $\{<79,<160,<200,\ge 200\}$
- Blood Urea (mg/dl) $\{<6,<20,\ge 20\}$
- Serum Creatinine (mg/dl) $\{<0.5,<1.2,<2,\ge 2\}$
- Sodium (mEq/l) $\{<136,<145,\ge 145\}$
- Potassium (mEq/l) $\{<3.5,<5,\ge 5\}$
- Haemoglobin (gm) $\{<12,<17,\ge 17\}$
- Packed Cell Volume $\{<27,<52,\ge 52\}$
- White Blood Cell Count (cells/mm${}^{3}$) $\{<3500,<10500,\ge 10500\}$
- Red Blood Cell (millions/mm${}^{3}$) $\{<2.5,<6,\ge 6\}$

## References

- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements Of Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2001. [Google Scholar] [CrossRef]
- Murphy, K. Machine Learning: A Probabilistic Perspective; The MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
- Bergman, M.K. A Knowledge Representation Practionary; Springer: Basel, Switzerland, 2018. [Google Scholar]
- Rokach, L.; Maimon, O.Z. Data Mining with Decision Trees: Theory and Applications; World Scientific: Singapore, 2008; Volume 69. [Google Scholar]
- Quinlan, J.R. Induction of decision trees. Mach. Learn.
**1986**, 1, 81–106. [Google Scholar] [CrossRef] [Green Version] - Quinlan, J. C4.5: Programs for Machine Learning; Morgan Kaufmann: Burlington, MA, USA, 1993. [Google Scholar]
- Quinlan, J. Improved Use of Continuous Attributes in C4.5. J. Artif. Intell. Res.
**1996**, 4, 77–90. [Google Scholar] [CrossRef] [Green Version] - Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Elsevie: Burlington, MA, USA, 2014. [Google Scholar]
- Barber, D. Bayesian Reasoning and Machine Learning; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
- Jensen, F.V. Introduction to Bayesian Networks; UCL Press: London, UK, 1996; Volume 210. [Google Scholar]
- Norouzi, M.; Collins, M.; Johnson, M.A.; Fleet, D.J.; Kohli, P. Efficient Non-greedy Optimization of Decision Trees. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 1729–1737. [Google Scholar]
- Breiman, L. Bagging Predictors. Mach. Learn.
**1996**, 24, 123–140. [Google Scholar] [CrossRef] [Green Version] - Breiman, L. Random Forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] [Green Version] - Tishby, N.; Zaslavsky, N. Deep Learning and the Information Bottleneck Principle. arXiv
**2015**, arXiv:1503.02406v1. [Google Scholar] - Tishby, N.; Pereira, F.; Bialek, W. The Information Bottleneck method. arXiv
**2000**, arXiv:physics/0004057v1. [Google Scholar] - Franzese, G.; Visintin, M. Deep Information Networks. arXiv
**2018**, arXiv:1803.02251v1. [Google Scholar] - Slonim, N.; Tishby, N. Agglomerative information bottleneck. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; pp. 617–623. [Google Scholar]
- Still, S. Information bottleneck approach to predictive inference. Entropy
**2014**, 16, 968–989. [Google Scholar] [CrossRef] [Green Version] - Still, S. Thermodynamic cost and benefit of data representations. arXiv
**2017**, arXiv:1705.00612. [Google Scholar] - Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach. Learn. Res.
**2005**, 6, 165–188. [Google Scholar] - Gedeon, T.; Parker, A.E.; Dimitrov, A.G. The mathematical structure of information bottleneck methods. Entropy
**2012**, 14, 456–479. [Google Scholar] [CrossRef] [Green Version] - Freund, Y.; Schapire, R. A short introduction to boosting. Jpn. Soc. Artif. Intell.
**1999**, 14, 1612. [Google Scholar] - Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inf. Theory
**1972**, 18, 14–20. [Google Scholar] [CrossRef] [Green Version] - Blahut, R. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory
**1972**, 18, 460–473. [Google Scholar] [CrossRef] [Green Version] - Hand, D.J.; Yu, K. Idiot’s Bayes—Not so stupid after all? Int. Stat. Rev.
**2001**, 69, 385–398. [Google Scholar] - UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences. Available online: http://archive.ics.uci.edu/ml (accessed on 30 September 2010).
- Salekin, A.; Stankovic, J. Detection of chronic kidney disease and selecting important predictive attributes. In Proceedings of the IEEE International Conference on Healthcare Informatics (ICHI), Chicago, IL, USA, 4–7 October 2016; pp. 262–270. [Google Scholar]
- Duch, W.; Adamczak, R.; Grąbczewski, K. Extraction of logical rules from neural networks. Neural Process. Lett.
**1998**, 7, 211–219. [Google Scholar] [CrossRef]

**Figure 1.**Schematic representation of an input node: the inputs are two vectors and the outputs are matrices that statistically describe the random variables ${X}_{in}$ and Y.

**Figure 3.**Sub-network: ${X}_{in,a}$, ${X}_{out,a}$, ${X}_{in,b}$, ${X}_{out,b}$, ${X}_{in,c}$, and ${X}_{out,c}$ are all random variables; ${N}_{0}$ is the number of values taken by ${X}_{in,a}$ and ${X}_{in,b}$; ${N}_{1}$ is the number of values taken by ${X}_{out,a}$ and ${X}_{out,b}$; and ${N}_{2}$ is the number of values taken by ${X}_{out,c}$.

**Figure 4.**Example of a DIN for $D=8$: the input nodes are represented as rectangles, the info nodes as circles, and the combiners as triangles. The numbers inside each circle identify the node (position inside the layer and layer number), ${N}_{in}^{(k)}$ is the number of values taken by the input of the info node at layer k, and ${N}_{out}^{(k)}$ is the number of values taken by the output of the info node at layer k. In this example, the info nodes at a given layer all have the same input and output cardinalities.

**Figure 5.**Misclassification probability versus number of iterations (average over 10 different trials) for the considered UCI datasets.

**Figure 6.**Misclassification probability versus $\beta $ (average over 20 different trials) for the considered UCI datasets.

**Table 1.**Mean misclassification probability (over 100 random experiments) for the three datasets with the considered classifiers.

Classifier | Congressional Voting Records | Kidney Disease | Mushroom |
---|---|---|---|

Naive Bayes | 0.10894 | 0.051 g | 0.20641 |

Decision Tree | 0.050691 | 0.062314 | 0.05505 |

Bagged Trees | 0.043641 | 0.0268 | 0.038305 |

DIN Prob | 0.050138 | 0.037229 | 0.020796 |

DIN Gen | 0.049447 | 0.026286 | 0.022182 |

Linear Discriminant Classifier | 0.059724 | 0.091029 | 0.069923 |

Logistic Regression | 0.075161 | 0.096429 | 0.07074 |

Linear SVM | 0.063226 | 0.049914 | 0.04513 |

KNN | 0.08682 | 0.11369 | 0.037018 |

KNN-Ensemble | 0.062811 | 0.036057 | 0.043967 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Franzese, G.; Visintin, M.
Probabilistic Ensemble of Deep Information Networks. *Entropy* **2020**, *22*, 100.
https://doi.org/10.3390/e22010100

**AMA Style**

Franzese G, Visintin M.
Probabilistic Ensemble of Deep Information Networks. *Entropy*. 2020; 22(1):100.
https://doi.org/10.3390/e22010100

**Chicago/Turabian Style**

Franzese, Giulio, and Monica Visintin.
2020. "Probabilistic Ensemble of Deep Information Networks" *Entropy* 22, no. 1: 100.
https://doi.org/10.3390/e22010100