Next Article in Journal
On Heat Transfer Performance of Cooling Systems Using Nanofluid for Electric Motor Applications
Next Article in Special Issue
Convergence Behavior of DNNs with Mutual-Information-Based Regularization
Previous Article in Journal / Special Issue
The Convex Information Bottleneck Lagrangian
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Probabilistic Ensemble of Deep Information Networks

Electronic and Telecommunications, Politecnico di Torino, 10100 Torino, Italy
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(1), 100; https://doi.org/10.3390/e22010100
Submission received: 22 November 2019 / Revised: 10 January 2020 / Accepted: 13 January 2020 / Published: 14 January 2020
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)

Abstract

:
We describe a classifier made of an ensemble of decision trees, designed using information theory concepts. In contrast to algorithms C4.5 or ID3, the tree is built from the leaves instead of the root. Each tree is made of nodes trained independently of the others, to minimize a local cost function (information bottleneck). The trained tree outputs the estimated probabilities of the classes given the input datum, and the outputs of many trees are combined to decide the class. We show that the system is able to provide results comparable to those of the tree classifier in terms of accuracy, while it shows many advantages in terms of modularity, reduced complexity, and memory requirements.

1. Introduction

Supervised classification is at the core of many modern applications of machine learning. The history of classifiers is rich and many variants have been proposed, such as decision trees, logistic regression, Bayesian networks, and neural networks (for an overview of general methods, see [1,2,3]). Despite the power of modern deep learning, for many problems involving categorical structured datasets, decision trees [4,5,6,7] or Bayesian networks [8,9,10] usually outperform neural network based approaches.
Decision trees are particularly interesting because they can be easily interpreted. Various types of tree classifiers can be discriminated according to the metric for the iterative construction and selection of features [4]: popular tree classifiers are based on information theoretic metrics, such as ID3 and C4.5 [6,7]. However, it is known that the greedy splitting procedure at each node can be sub-optimal [11], and that decision trees are prone to overfitting when dealing with small datasets. When a classifier is not strong enough, there are, roughly speaking, two possibilities: choosing a more sophisticated classifier or ensembling multiple “weak” classifiers [12,13]. This second approach is usually called the ensemble method. In the performance tradeoff by using multiple classifiers simultaneously, we improve classification performance, paying with the loss of interpretability.
The so-called “information bottleneck”, described by Tishby and Zaslavsky [14] and Tishby et al. [15], was proposed in [16] to build a classifier (Deep Information Network, DIN) with a tree topology that compresses the input data and generates the estimated class. DINs [16] are based on the so-called information node that, using the input samples of a feature X i n , generates samples of a new feature X o u t , according to the conditional probabilities P ( X o u t = j | X i n = i ) obtained by minimizing the mutual information I ( X i n ; X o u t ) , with the constraint of a given mutual information I ( X o u t ; Y ) between X o u t and the target/class Y (information bottleneck [14]). The outputs of two or more nodes are combined, without information loss, to generate samples of a new feature passed to a subsequent information node. The final node (root) directly outputs the class of each input datum. The tree structure of the network is thus built from the leaves, whereas C4.5 and ID3 build it from the root.
We here propose an improved implementation of the DIN scheme in [16] that only requires the propagation through the tree of small matrices containing conditional probabilities. Notice that the previous version of the DIN was stochastic, while the one we propose here is deterministic. Moreover, we use an ensemble (e.g., [12,13]) of trees with randomly permuted features and weigh their outputs to improve classification accuracy.
The proposed architecture has several advantages in terms of:
  • extreme flexibility and high modularity: all the nodes are functionally equivalent and with a reduced number of inputs and outputs, which gives good opportunities for a possible hardware implementation;
  • high parallelizability: each tree can be trained in parallel with the others;
  • memory usage: we need to feed the network with data only at the first layer and simple incremental counters can be used to estimate the initial probability mass distribution; and
  • training time and training complexity: the locality of the computed cost function allows a nodewise training that does not require any kind of information from other points of the tree apart from its feeding nodes (that are usually a very small number, e.g., 2–3).
With respect to the DINs in [16], the main difference is that samples of the random variables in the inner layers of the tree are never generated, which is an advantage in the case of large datasets. However, an assumption of statistical independence (see Section 2.3) is necessary to build the probability matrices and this might be seen as a limitation of the newly proposed method. However, experimental results (see Section 5) show that this approximation does not compromise the performance.
We underline similarities and differences of the proposed classifier with respect to the methods described in [6,7] since they are among the best performing ones. When using decision trees, as well as DINs, categorical and missing data are easily managed, but continuous random variables are not: quantization of these input features is necessary in a pre-processing phase, and it can be performed as in C4.5 [6], using other heuristics, or manually. Concerning differences, instead, the first one is that normally a hierarchical decision tree is built starting from the root and splitting at each node, whereas we here propose a way to build a tree starting from the leaves. The topology of our network implies that, once the initial ordering of the features has been set, there is no need, after each node is trained, to perform a search of the best possible next node. The second important difference is that we do not use directly mutual information as a metric for building the tree but we base our algorithm on the Information Bottleneck principle [14,15,17,18,19,20,21]. This allows us to extract all the relevant information (the sufficient statistic) while removing the redundant one, which is helpful in avoiding overfitting. As in [12,13], we use an ensemble method. We choose the simplest possible form of ensemble combination: we train independently many structurally equivalent networks, using the same single dataset but permuting the order of the features, and produce a weighted average of the outputs based on a simple rule described in Section 3.1. Notice that we use a one-shot procedure, i.e., we do not iterate more than once over the entire dataset and exploit techniques similarly to [22,23]. We leave the study of more sophisticated techniques to future works.
Section 2 and Section 3 more precisely describe the structure of the DIN and how it works, Section 4 gives some insight on the theoretical properties, Section 5 comments the results obtained with standard datasets. Conclusions are finally drawn in Section 6.

2. The DIN Architecture and Its Training

The information network is made of input nodes (Section 2.1), information nodes (Section 2.2), and combiners joined together through a tree network described in Section 2.3. Moreover, an ensemble of N m a c h trees is built, based on which the final estimated class is produced (Section 3.1). In [16], the input nodes are not present, the information node has a slightly different role, the combiners are much simpler than those described here, and just one tree was considered. As already stated, the new version of the DIN is more efficient when a large dataset with relatively few features is analyzed.
In the following, it is assumed that all the features take a finite number of discrete values; a case of continuous random variables is discussed in Section 5.2.
It is also assumed that N t r a i n points are used in the training phase, N t e s t points in the testing phase, and that D features are present. The nth training point corresponds to one of N c l a s s possible classes.

2.1. The Input Node

Each input node (see Figure 1) has two input vectors:
  • x i n of size N t r a i n , whose elements take values in a set of cardinality N i n ; x i n corresponds to one of the D features of the dataset (typically one column)
  • y of size N t r a i n , whose elements take values in a set of cardinality N c l a s s ; y corresponds to the known classes of the N t r a i n points
The notation we use in the equations below is the following: Y , X i n represent random variables; y ( n ) and x i n ( n ) are the nth elements of vectors y and x i n , respectively; and 1 ( c ) is equal to 1 if c is true, and is otherwise equal to 0. Using Laplace smoothing [2], the input node estimates the following probabilities (the probability mass function of Y in Equation (1) is common to all the input nodes: it can be evaluated only by the first one and passed to the others):
P ^ ( Y = m ) 1 + n = 0 N t r a i n 1 1 ( y ( n ) = m ) N t r a i n + N c l a s s m = 0 , , N c l a s s 1
P ^ ( X i n = i ) 1 + n = 0 N t r a i n 1 1 ( x i n ( n ) = i ) N t r a i n + N i n , i = 0 , , N i n 1
P ^ ( Y = m , X i n = i ) 1 + n = 0 N t r a i n 1 1 ( y ( n ) = m ) 1 ( x i n ( n ) = i ) N t r a i n + N c l a s s N i n
From basic application of probability rules, P ^ ( Y = m | X i n = i ) and P ^ ( X i n = i | Y = m ) are then computed. From now on, for simplicity, we denote all the estimated probabilities P ^ simply as P.
All the above probabilities can be organized in matrices defined as follows:
P Y R 1 × N c l a s s , P Y ( m ) = P ( Y = m )
P X i n R 1 × N i n , P X i n ( i ) = P ( X i n = i )
P X i n | Y R N c l a s s × N i n , P X i n | Y ( m , i ) = P ( X i n = i | Y = m )
P Y | X i n R N i n × N c l a s s , P Y | X i n ( i , m ) = P ( Y = m | X i n = i )
Note that vectors x i n and y are not needed by the subsequent elements in the tree; only the input nodes have access to them.
Notice also that the following equalities hold:
P X i n = P Y P X i n | Y
P Y = P X i n P Y | X i n

2.2. The Information Node

The information node is schematically shown in Figure 2: the input discrete random variable X i n is stochastically mapped into another discrete random variable X o u t (see [16] for further details) through probability matrices:
  • The input probability matrices P X i n , P X i n | Y , P Y | X i n , P Y describe the input random variable X i n , with N i n possible values, and its relationship with class Y.
  • The output matrices P X o u t , P X o u t | Y , P Y | X o u t , P Y describe the output random variable X o u t , with N o u t possible values, and its relationship with Y.
Compression (source encoding) is obtained by setting N o u t < N i n .
In the training phase, the information node generates the conditional probability mass function that satisfies the following equation (see [14]):
P ( X o u t = j | X i n = i ) = 1 Z ( i ; β ) P ( X o u t = j ) e β d ( i , j ) , i = 0 , , N i n 1 , j = 0 , , N o u t 1
where
  • P ( X o u t = j ) is the probability mass function of the output random variable X o u t
    P ( X o u t = j ) = i = 0 N i n 1 P ( X i n = i ) P ( X o u t = j | X i n = i ) , j = 0 , , N o u t 1
  • d ( i , j ) is the Kullback–Leibler divergence
    d ( i , j ) = m = 0 N c l a s s 1 P ( Y = m | X i n = i ) log 2 P ( Y = m | X i n = i ) P ( Y = m | X o u t = j ) = KL ( P ( Y | X i n = i ) | | P ( Y | X o u t = j ) )
    and
    P ( Y = m | X o u t = j ) = i = 0 N i n 1 P ( Y = m | X i n = i ) P ( X i n = i | X o u t = j ) , m = 0 , , N c l a s s 1 , j = 0 , , N o u t 1
  • β is a real positive parameter.
  • Z ( i ; β ) is a normalizing coefficient to get
    j = 1 N o u t 1 P ( X o u t = j | X i n = i ) = 1 .
The probabilities P ( X o u t = j | X i n = i ) can be iteratively found using the Blahut–Arimoto algorithm [14,24,25].
Equation (10) solves the information bottleneck: it minimizes the mutual information I ( X i n ; X o u t ) under the constraint of a given mutual information I ( Y ; X o u t ) . In particular, Equation (10) is the solution of the minimization of the Lagrangian
L = I ( X i n ; X o u t ) β I ( Y ; X o u t ) .
If the Lagrangian multiplier β is increased, then the constraint is privileged and the information node tends to maximize the mutual information between its output X o u t and the class Y; if β is reduced, then minimization of I ( X i n ; X o u t ) is obtained (compression). The information node must actually balance compression from X i n to X o u t and propagation of the information about Y. In our implementation, the compression is also imposed by the fact that the cardinality of the output alphabet N o u t is smaller than that of the input alphabet N i n .
The role of the information node is thus that of finding the conditional probability matrices
P X o u t | X i n R N i n × N o u t , P X o u t | X i n ( i , j ) = P ( X o u t = j | X i n = i )
P Y | X o u t R N o u t × N c l a s s , P Y | X o u t ( j , m ) = P ( Y = m | X o u t = j )
P X o u t R 1 × N o u t , P X o u t ( j ) = P ( X o u t = j )

2.3. The Combiner

Consider the case depicted in Figure 3, where the two information nodes a and b feed a combiner (shown as a triangle) that generates the input of the information node c. The random variables X o u t , a and X o u t , b , both having alphabet with cardinality N 1 , are combined together as
X i n , c = X o u t , a + N 1 X o u t , b
that has an alphabet with cardinality N 1 × N 1 .
The combiner actually does not generate X i n , c ; it simply evaluates the probability matrices that describe X i n , c and Y. In particular, the information node c needs P X i n , c | Y , which can be evaluated assuming that X o u t , a and X o u t , b are conditionally independent given Y (notice that in implementation [16] this assumption was not necessary):
P ( X i n , c = k | Y = m ) = P ( X o u t , a = k a , X o u t , b = k b | Y = m ) = P ( X o u t , a = k a | Y = m ) P ( X o u t , b = k b | Y = m )
where k = k a + N 1 k b . In particular, the mth row of P X i n , c | Y is the Kronecker product of the mth rows of P X o u t , a | Y and P X o u t , b | Y
P X i n , c | Y ( m , : ) = P X o u t , a | Y ( m , : ) P X o u t , b | Y ( m , : ) m = 0 , , N c l a s s 1
(here A ( m , : ) identifies the mth row of matrix A ). The probability vector P X i n , c can be evaluated considering that
P ( X i n , c = k ) = m = 0 N c l a s s 1 P ( X i n , c = k , Y = m ) = m = 0 N c l a s s 1 P ( X i n , c = k | Y = m ) P ( Y = m )
so that
P X i n , c = P Y P X i n , c | Y
At this point, matrix P Y | X i n , c can be evaluated element by element since
P ( Y = m | X i n , c = k ) = P ( X i n , c = k | Y = m ) P ( Y = m ) P ( X i n , c = k ) , m = 1 , , N c l a s s 1 , k = 0 , , N 1 × N 1 1
It is straightforward to extend the equations to the case in which X i n , a and X i n , b have different cardinalities.

2.4. The Tree Architecture

Figure 4 shows an example of a DIN, where we assume that the dataset has D = 8 features and that training is thus obtained using a matrix X t r a i n with N t r a i n rows and D = 8 columns, with a corresponding class vector y . The kth column x ( k ) of matrix X t r a i n feeds, together with vector y , the input node I ( k ) , k = 0 , , D 1 .
Information node ( k , 0 ) at layer 0 processes the probability matrices generated by the input node I ( k ) , with N i n ( 0 ) possible values of X i n ( k , 0 ) , and evaluates the conditional probability matrices with N o u t ( 0 ) possible values of X o u t ( k , 0 ) , using the algorithm described in Section 2.2. The outputs of info nodes ( 2 k , 0 ) and ( 2 k + 1 , 0 ) are given to a combiner that outputs the probability matrices for X i n ( k , 1 ) , having alphabet of cardinality N i n ( 1 ) = N o u t ( 0 ) × N o u t ( 0 ) , using the equations described in Section 2.3. The sequence of combiners and information nodes is iterated, decreasing the number of information nodes from layer to layer, until the final root node is obtained. In the previous implementation of the DINs in [16], the root information node outputs the estimated class of the input and it is therefore necessary that the output cardinality of the root info node is equal to N c l a s s . In the current implementation, this cardinality can be larger than N c l a s s , since classification is based on the output probability matrix P Y | X o u t .
For a number of features D = 2 d , the number of layers is d. If D is not a power of 2, then it is possible to use combiners with 3 or more inputs (the changes in the equations in Section 2.3 are straightforward, since a combiner with three inputs can be seen as two cascaded combiners with two inputs each).
The overall binary topology proposed in Figure 4 requires a number of information nodes equal to
N n o d e s = D + D 2 + D 4 + + 2 + 1 = 2 D 1
and a number of combiners equal to
N c o m b = D 2 + D 4 + + 2 + 1 = D 1
All the info nodes run exactly the same algorithm and all the combiners are equal, apart from the input/output alphabet cardinalities. If the cardinalities of the alphabets are all equal, i.e., N i n ( i ) and N o u t ( i ) do not depend on the layer i, then all the nodes and all the combiners are exactly equal, which might help in a possible hardware implementation; in this case, the number of parameters of the network is ( N o u t 1 ) × N i n × N n o d e s .
Actually, the network performance depends on how the features are coupled in subsequent layers and a random shuffling of the columns of matrix X t r a i n provides results that might be significantly different. This property is used in Section 3.1 for building the ensemble of networks.

2.5. A Note on Computational Complexity and Memory Requirements

The modular structure of the proposed method has several advantages in terms of both memory footprint and computational cost. The considered topology in this explanation is binary, similarly to what is depicted in Figure 4. We furthermore consider for simplicity cardinalities of the D input features all equal to N i n and input/output cardinalities of subsequent layers information node to also be fixed constants N i n and N o u t = N i n 2 , respectively. As we show in the experiment (Section 5), small values for N i n and N o u t such as 2, 3, or 4 are sufficient in the considered cases. Straightforward generalizations are possible when considering inhomogeneous cases.
At the first layer (the input node layer), each of the D input nodes stores the joint probabilities of the target variable Y and its input feature. Each node thus includes a simple counter that fills the probability matrix of dimension N i n × N c l a s s . Both the computational cost and the memory requirements for this first stage are the same as the Naive Bayes algorithm. Notice that, from the memory requirements point of view, it is not necessary to store all the training data but just counters with number of joint occurrences of features/classes. If after training, new data are observed, it is in fact sufficient to update the counters and properly renormalize the values to obtain the updated probability matrices. In this paper, we do not cover the topic of online learning as well as possible strategies to reduce the computational complexity in such a scenario.
At the second layer (the first information node layer), each node receives as input the joint probability matrix of feature and target variable and performs the Blahut–Arimoto algorithm. The internal memory requirement of this node is the space needed to store two probability matrices of dimensions N i n × N c l a s s and N i n × N o u t , respectively. The cost per iteration of Blahut–Aritmoto depends on matrix multiplication of sizes N i n × N o u t and N i n × N c l a s s , and thus obviously the complexity scales with the number of classes of the considered classification problem. To the best of our knowledge, the convergence rate for the Blahut–Arimoto algorithm applied to information bottleneck problems is unknown. In this study, however, we found empirically that, for the considered datasets, 5–6 iterations per node are sufficient, as discussed in Section 5.5.
Each combiner process the matrices generated by two information nodes: the memory requirement is zero and the computational cost is roughly N c l a s s Kronecker products between rows of probability matrices. Since for ease of explanation we chose N o u t = N i n 2 the output probability matrix have again dimensions N i n × N c l a s s .
The overall memory requirement and computational complexity (for a single DIN) is thus going to scale as D times the requirements for an input node, 2 D 1 times the requirements for an information node, and D 1 times the requirements for a combiner. To complete the discussion, we have to remember that a further multiplication factor of N m a c h is required to take into account that we are considering an ensemble of networks (actually, at the first layer, the set of input nodes can be shared by the different architectures since only the relative position of the input nodes changes, see Section 3.1).

3. The Running Phase

During the running phase, the columns of matrix X with N rows and D columns are used as inputs. Assume again that the network architecture is that depicted in Figure 4 with D = 8 , and consider the nth input row X ( n , : ) .
In particular, assume that X ( n , 2 k ) = i and X ( n , 2 k + 1 ) = j . Then,
  • (a)
    input node I ( 2 k ) passes value i to info node ( 2 k , 0 ) ;
    (b)
    input node I ( 2 k + 1 ) passes value j to info node ( 2 k + 1 , 0 ) ;
  • (a)
    info node ( 2 k , 0 ) passes the probability vector p a = P X o u t ( 2 k , 0 ) | X i n ( 2 k , 0 ) ( i , : ) (ith row) to the combiner; p a stores the conditional probabilities P ( X o u t ( 2 k , 0 ) = g | X ( n , 2 k ) = i ) for g = 0 , , N o u t ( 0 ) 1 ;
    (b)
    info node ( 2 k + 1 , 0 ) passes the probability vector p b = P X o u t ( 2 k + 1 , 0 ) | X i n ( 2 k + 1 , 0 ) ( j , : ) (jth row) to the combiner; p b stores the conditional probabilities P ( X o u t ( 2 k + 1 , 0 ) = h | X ( n , 2 k + 1 ) = j ) for h = 0 , , N o u t ( 0 ) 1 ;
  • the combiner generates vector
    p c = p a p b ,
    which stores the conditional probabilities P ( X i n ( k , 1 ) = s | X ( n , 2 k ) = i , X ( n , 2 k + 1 ) = j ) for s = 0 , , N i n ( 1 ) 1 , where N i n ( 1 ) = N o u t ( 0 ) × N o u t ( 0 ) ;
  • info node ( k , 1 ) generates the probability vector
    p c P X o u t ( k , 1 ) | X i n ( k , 1 ) ,
    which stores the conditional probabilities P ( X o u t ( k , 1 ) = r | X ( n , 2 k ) = i , X ( n , 2 k + 1 ) = j ) for r = 0 , , N o u t ( 1 )
  • in the following layer, each combiner performs the Kronecker product of its two input vectors and each info node performs the product between the input vector and its conditional probability matrix P X o u t | X i n ;
  • the root information node at Layer 3, having the input vector p , outputs
    p o u t ( n ) = p P X o u t ( 0 , 3 ) | X i n ( 0 , 3 ) P Y | X o u t ( 0 , 3 ) ,
    which stores the estimated probabilities P ( Y = m | X ( n , : ) ) for m = 0 , , N c l a s s 1 .
    According to the MAP criterion, the estimated class of the input point X ( n , : ) is
    Y ^ ( n ) = arg max p o u t ( n )
    but we propose to use an improved method, as described in Section 3.1.

3.1. The DIN Ensemble

At the end of the training phase, when all the conditional matrices have been generated in each information node and combiner, the network is run with input matrix X t r a i n ( N t r a i n rows and D columns) and the probability vector p o u t is obtained for each input point X t r a i n ( n , : ) . As anticipated at the end of Section 2.4, the DIN classification accuracy depends on how the input features are combined together. By permuting the columns of X t r a i n , a different probability vector p o u t is typically obtained. We thus propose to generate an ensemble of DINs by randomly permuting the columns of X t r a i n , and then combine their outputs.
Since in the training phase y ( n ) is known, it is possible to get for each DIN v the probability p o u t v ( n ) , and ideally p o u t v ( n , y ( n ) ) , the estimated probability corresponding to the true class y ( n ) , should be equal to one. The weights
w v = n = 0 N t r a i n 1 p o u t v ( n , y ( n ) ) n = 0 N t r a i n 1 j = 0 N m a c h 1 p o u t j ( n , y ( n ) )
thus represent the reliability of the vth DIN.
In the running phase, feeding the N m a c h machines each with the correctly permuted vector X ( n , : ) , the final estimated probability vector is determined as
p ^ e n s ( n ) = v = 0 N m a c h 1 w v p ^ o u t v ( n )
and the estimated class is
Y ^ ( n ) = arg max p ^ e n s ( n ) .

4. The Probabilistic Point of View

This section is intended to underline the difference in probability terms formulation between the Naive Bayes classifier [2,26] and the proposed scheme, since both use the assumption of conditional independence of the input features. Both classifiers build in a simplified way the probability matrix P Y | X 0 , , X D with N c l a s s rows and i = 0 D 1 N i n ( i ) , where N i n ( i ) is the cardinality for the input feature X i . In the next sections, we show the different structure of these two probability matrices.

4.1. Assumption of Conditionally Independent Features

The Naive Bayes assumption allows writing the output estimated probability of the Naive Bayes classifier as follows:
P ( Y = m | x = x 0 ) = P ( x = x 0 | Y = m ) P ( Y = m ) P ( x = x 0 ) = k = 0 D 1 P ( X k = x k 0 | Y = m ) P ( Y = m ) s = 0 N c l a s s k = 0 D 1 P ( X k = x k 0 | Y = s ) P ( Y = s )
which is very easily implemented, without the need of generating the tree network. We rewrite this output probability in a fairly complex way to show the difference between the naive Bayes probability matrix and the DIN one. Consider the nth feature x ( n ) , which can take values in the set { c n 0 , , c n D n 1 } . Define p x ( n ) | y = m = [ P ( x ( n ) = c n 0 | Y = m ) , P ( x ( n ) = c n D n 1 | Y = m ) ] ; then,
P X i n | Y ( m , : ) = k = 0 D 1 p x ( k ) | y = m
and thus obviously
P X i n | Y = k = 0 D 1 p x ( k ) | y = 0 k = 0 D 1 p x ( k ) | y = 1 k = 0 D 1 p x ( k ) | y = N c l a s s
We can write the joint probability matrix as
P X i n , Y = d i a g ( P Y ) P X | Y
and the probability matrix of target class given observation as
P Y | X i n = ( P X i n , Y d i a g ( P X i n ( 1 ) ) ) T
The hypothesis of conditional statistical independence of the features is not always correct and thus we can incur obvious performance degradation.

4.2. The Overall Probability Matrix

We now instead compute the output estimated probability for the DIN classifier. Consider again the sub-network in Figure 3 made of info nodes a, b, and c. Info node a is characterized by matrix P a , whose element P a ( i , j ) is P ( X o u t , a = j | X i n , a = i ) ; similar definitions hold for P b and P c . Note that P a and P b have N 0 rows and N 1 columns, whereas P c has N 1 × N 1 rows and N 2 columns; the overall probability matrix between the inputs X i n , a , X i n , b and the output X o u t , c is P ˜ with N 0 × N 0 rows and N 2 columns. Then,
P ( X o u t , c = i | X i n , a = j , X i n , b = k ) = r = 0 N 1 1 s = 0 N 1 1 P ( X o u t , c = i , X o u t , a = r , X o u t , b = s | X i n , a = j , X i n , b = k ) = r = 0 N 1 1 s = 0 N 1 1 P ( X o u t , c = i | X o u t , a = r , X o u t , b = s ) P ( X o u t , a = r | X i n , a = j ) P ( X o u t , b = s | X i n , b = k ) = r = 0 N 1 1 s = 0 N 1 1 P ( X o u t , c = i | X o u t , s = r , X o u t , b = s ) P a ( j , r ) P b ( k , s ) .
It can be shown that
P ˜ = ( P a P b ) P c
where ⊗ identifies the Kronecker matrix multiplication; note that P a P b has N 0 × N 0 rows and N 1 × N 1 columns. By iteratively applying the above rule, we can get the expression of the overall matrix P ˜ for the exact topology of Figure 4, with eight input nodes and four layers:
P ˜ = [ ( P 0 , 0 P 1 , 0 ) P 0 , 1 ( P 2 , 0 P 3 , 0 ) P 1 , 1 P 0 , 2 ( P 4 , 0 P 5 , 0 ) P 2 , 1 ( P 6 , 0 P 7 , 0 ) P 3 , 1 P 1 , 2 ] P 0 , 3 .
The overall output probability matrix P Y | X can finally be computed as
P Y | X i n = P ˜ P Y | X o u t ( 0 , 3 ) .
The DIN then behaves as a one-layer system that generates the output according to matrix P Y | X i n , whose size might be impractically large. It is also possible to interpret the system as a sophisticated way of factorizing and approximating the exponentially large true probability matrix. In fact, the proposed layered structure needs smaller probability matrices, which makes the system computationally efficient. The equivalent probability matrix is thus different in the DIN (Equation (42)) and Naive Bayes (Equation (38)) cases.

5. Experiments

In this section, we analyze the results obtained with benchmark datasets. In particular, we consider the DIN ensemble when: (a) each DIN is based on the probability matrices (the scheme described in this paper); and (b) each information node of the DIN randomly generates the symbols, as described in the previous work [16]. We refer to these two variants in captions and labels as DIN(Prob) and DIN(Gen), respectively. The reason for this comparison is that conditional statistical independence is not required in the case DIN(Gen), and the classification accuracy could be different in the two cases. Note that Franzese and Visintin [16] considered just one DIN, not an ensemble of DINs. In the following, we introduce three datasets on which we tested the method (Section 5.1, Section 5.2 and Section 5.3) and propose some examples of DINs architectures. Complete analysis of numerical results is described in Section 5.4. Section 5.5 and Section 5.6 analyze the impact of changing the maximum number of iterations of Blahut–Arimoto algorithm and Lagrangian coefficient β , respectively. Finally, a synthetic multiclass experiment is described in Section 5.7. In all experiments, the value of β was optimized similarly to what is described in Section 5.6 using the training set.

5.1. UCI Congressional Voting Records Dataset

The first experiment on real data was conducted on the UCI Congressional Voting Records dataset [27], which collects the votes given by each of the U.S. House of Representatives Congressmen on 16 key laws (in 1985). Each vote can take three values corresponding to (roughly, see [27] for more details) yes, no, and missing value; each datum belongs to one of two classes (Democrats or Republican). The aim of the network is, given the list of 16 votes, decide if the voter is Republican or Democratic. In this dataset, we thus have D = 16 features and 435 data split into N t r a i n data for training and N t e s t = 435 N t r a i n data for testing. The architecture of the used network is the same as the one described in Section 2.4, except for the fact that there are 16 input features instead of 8 (the network has thus one more layer). The input cardinality in the first layer is N i n ( 0 ) = 3 (yes/no/missing) and the output cardinality is set to N o u t ( 0 ) = 2 . From the second layer on, the input cardinality for each information node is equal to N i n = 4 and N o u t = 2 . In the majority of the cases, the size of the probability matrices is therefore 4 × 2 or 2 × 2 . In this example, we used N m a c h = 30 and N t r a i n = 218 (roughly 50% of the data). The value of β was set to 2.2.

5.2. UCI Kidney Disease Dataset

The second considered dataset was the UCI Kidney Disease dataset [28]. The dataset has a total of 24 medical features, consisting of mixed categorical, integer, and real values, with missing values. Quantization of non-categorical features of the dataset was performed according to the thresholds in Appendix A, agreed upon by a medical doctor.
The aim of the experiment is to correctly classify patients affected by chronic kidney disease. We performed 100 different trials training the algorithms using only N t r a i n = 50 out of 400 samples for the training. Layer zero has 24 input nodes, and then the outputs of layer zero are mixed two at a time to get 12 information nodes at Layer 1, 6 at Layer 2, and 3 at Layer 3; the last three nodes are combined into a unique final node. The output cardinality of all nodes is equal to N o u t = 2 . The value of β was set equal to 5.6. In addition, in this case, we used an ensemble of N m a c h = 30 DINs.

5.3. UCI Mushroom Dataset

The last considered dataset was the UCI Mushroom dataset [29]. This dataset is comprised of 22 categorical features with different cardinalities, which describe some properties of mushrooms, and one target variable that defines whether the considered mushroom is edible or poisonous/unsafe. There are 8124 entries in the dataset. We padded the dataset with two null features to reach the cardinality of 24 and used exactly the same architecture as the kidney disease experiment. We selected N t r a i n = 50 , β = 2.7 , and number of DINs equal to N m a c h = 15 .

5.4. Misclassification Probability Analysis

We hereafter report results in terms of misclassification probability between the proposed method and several classification methods implemented using MATLAB® Classification Learner. All datasets were randomly split 100 times into training and testing subsets, thus generating 100 different experiments. The proposed method shows competitive results in the considered cases, as can be observed in Table 1. It is interesting to compare in terms of performance the proposed algorithm with respect to the Naive Bayes classifier, i.e., Equation (34), and the Bagged Tree algorithm, which is the closest algorithm (conceptually) to the one we propose. In general, the two variants of the DINs perform similarly to the Bagged Trees, while outperforming Naive Bayes. For Bagged Trees and KNN-Ensemble, the same number of learners as DIN ensembles were used.

5.5. The Impact of Number of Iterations of Blahut–Arimoto on The Performance

As anticipated in Section 2.5, the computational complexity of a single node scales with the number of iterations of Blahut–Arimoto algorithm. To the best of our knowledge, a provable convergence rate for the Blahut–Arimoto algorithm in the information nottleneck setting does not exist. We hereafter (Figure 5) present empirical results on the impact of limiting the number of iterations of Blahut–Arimoto algorithm (for simplicity, the same bound is applied to all nodes in the networks). When the number of iterations is too small, there is a drastic decrease in performance because the probability matrices in the information nodes have not yet converged, while 5–6 iterations are sufficient and a further increase in the number of iterations is not necessary in terms of performance improvements.

5.6. The Role of β : Underfitting, Optimality, and Overfitting

As usual with almost all machine learning algorithms, the choice of hyperparameters is of fundamental importance. For simplicity, in all experiments described in the previous sections, we kept the value of β constant through the network. To gain some intuition, Figure 6 shows the misclassification probability for different β for the three considered datasets (each time keeping β constant through the network). While the three curves are quantitatively different, we can notice the same qualitative trend: when β is too small, not enough information about the target variable is propagated, and then by increasing β above a certain threshold, the misclassification probability drops. Increasing β too much however induces overfitting, as expected, and the classification error (slowly) increases again. Remember (from Equation (15)) that the Lagrangian we are minimizing is
L = I ( X i n ; X o u t ) β I ( Y ; X o u t ) .
Information theory tells us that at every information node we should propagate only the sufficient statistic about the target variable Y. In practice, this is reflected in the role of β : when it is too small, we neglect the term I ( Y ; X o u t ) and just minimize I ( X i n ; X o u t ) (that corresponds to underfitting), while increasing β allows passing more information about the target variable through the bottleneck. It is important to remember, however, that we do not have direct access to the true mutual information values but just to an empirical estimate based on a finite dataset. Especially when the cardinalities of inputs and outputs are high, this translates into an increased probability of spotting spurious correlations that, if learned by the nodes, induce overfitting. The overall message is that β has an extremely important role in the proposed method, and its value should be chosen to modulate between underfitting and overfitting.

5.7. A Synthetic Multiclass Experiment

In this section we present results on a multiclass synthetic dataset. We generated 64-dimensional feature vectors z drawn from multivariate Gaussian distributions with mean and covariance depending on a target class y and a control parameter ρ :
p ( z | y = l ) = | 2 π Σ l | 1 2 exp 1 2 ( z μ l ) T ( ρ Σ l ) 1 ( z μ l ) l = 1 , , N c l a s s
where for the considered experiment N c l a s s = 8 . The mean μ l is sampled from a normal 64-dimensional random vector and Σ l is randomly generated as Σ l = A A T (where A is sampled from a matrix normal distribution) and normalized to have unit norm. The other parameter ρ is inserted to modulate the signal to noise ratio of the generated samples: a smaller value of ρ corresponds to smaller feature variances and more distinct, less overlapping, pdfs p ( z | y = l ) , and an easier classification task. We then perform quantization of the result using 1 bit, i.e., the input of the ensemble of DINs is the following random vector:
x = U ( z )
where U ( · ) is the Heaviside step operator. The designed architecture has at the first layer 64 input nodes, followed by 32, 16, 4, 2, and 1. The output cardinalities are equal to 2 for the first three layers, 4 for the fourth and fifth layer, and 8 at the last layer. We selected N t r a i n = 1000 , β = 7 (constant trough the network), and number of DINs equal to N m a c h = 10 . Figure 7 shows the classification accuracy (on a test set of 1000 samples) for different values of ρ . As expected, when the value of ρ is small, we can reach almost perfect classification accuracy, whereas, by increasing it, the performance drops to the point where the useful signal is completely buried in noise and the classification accuracy reaches the asymptotic level of 1 8 (that corresponds to random guessing when the number of classes is equal to 8).

6. Conclusions

The proposed ensemble Deep Information Network (DIN) shows good results in terms of accuracy and represents a new simple, flexible, and modular structure. The required hyperparameters are the cardinality of the alphabet at the output of each information node, the value of the Lagrangian multiplier β , and the structure of the tree itself (number of input information nodes of each combiner).
Simplistic architecture choices made for the experiments (such as equal cardinality of all node outputs, β constant through the network, etc.) performed comparably to finely tuned networks. However, we expect that, similar to what happened in neural network applications, a domain specific design of the architectures will allow for consistent improvements in terms of performance on complex datasets.
Despite the local assumption of conditionally independent features, the proposed method always outperforms Naive Bayes. As discussed in Section 4, the induced equivalent probability matrix is different in the two cases. Intuitively, we can understand the difference in performance under the point of view of probability matrix factorization. On the one side, we have the true, exponentially large, joint probability matrix of all features and target class. On the other side, we have the Naive Bayes one, which is extremely simple in terms of complexity but obviously less performing. In between, we have the proposed method, where the complexity is still reasonable but the quality of the approximation is much better. The DIN(Gen) algorithm does not require the assumption of statistical independence, but the classification accuracy is very close to that of DIN(Prob), which further suggests that the assumption can be accepted from a practical point of view.
The proposed method leaves open the possibility of devising a custom hardware implementation. Differently from classical decision trees, in fact, the execution times of all branches as well as the precise number of operations is fixed per datum and known a priori, helping in various system design choices. In fact, with classical trees, where a node’s utilization depends on the datum, we are forced to design the system for the worst case, even if in the vast majority of time not all nodes are used. Instead, with DIN, there is no such a problem.
Finally, a clearly open point is related to the quantization procedure of continuous random variables. One possible self-consistent approach could be devising an information bottleneck based method (similar to the method for continuous random variables [20]).
Further studies on extremely large datasets will help understand principled ways of tuning hyperparameters and architecture choices and their relationship on performance.

Author Contributions

Conceptualization, G.F. and M.V.; methodology, G.F. and M.V.; software, G.F. and M.V.; validation, G.F. and M.V.; formal analysis, G.F. and M.V.; investigation, G.F. and M.V.; resources, G.F. and M.V.; data curation, G.F. and M.V.; writing–original draft preparation, G.F. and M.V.; writing–review and editing, G.F. and M.V.; visualization, G.F. and M.V.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

A special thank to MD Gabriella Olmo who suggested a quantization of the continuous values of the features in the experiment in Section 5.2, which is correct from a medical point of view.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Quantization

Hereafter, we present the quantization scheme used for the numerical features of chronic kidney disease dataset.
  • Age (Years) { < 10 , < 18 , < 45 , < 70 , < 120 }
  • Blood (mm/Hg) { < 80 , < 84 , < 89 , < 99 , < 109 , 110 }
  • Blood Glucose Random (mg/dl) { < 79 , < 160 , < 200 , 200 }
  • Blood Urea (mg/dl) { < 6 , < 20 , 20 }
  • Serum Creatinine (mg/dl) { < 0 . 5 , < 1 . 2 , < 2 , 2 }
  • Sodium (mEq/l) { < 136 , < 145 , 145 }
  • Potassium (mEq/l) { < 3 . 5 , < 5 , 5 }
  • Haemoglobin (gm) { < 12 , < 17 , 17 }
  • Packed Cell Volume { < 27 , < 52 , 52 }
  • White Blood Cell Count (cells/mm 3 ) { < 3500 , < 10500 , 10500 }
  • Red Blood Cell (millions/mm 3 ) { < 2 . 5 , < 6 , 6 }

References

  1. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements Of Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2001. [Google Scholar] [CrossRef]
  2. Murphy, K. Machine Learning: A Probabilistic Perspective; The MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
  3. Bergman, M.K. A Knowledge Representation Practionary; Springer: Basel, Switzerland, 2018. [Google Scholar]
  4. Rokach, L.; Maimon, O.Z. Data Mining with Decision Trees: Theory and Applications; World Scientific: Singapore, 2008; Volume 69. [Google Scholar]
  5. Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef] [Green Version]
  6. Quinlan, J. C4.5: Programs for Machine Learning; Morgan Kaufmann: Burlington, MA, USA, 1993. [Google Scholar]
  7. Quinlan, J. Improved Use of Continuous Attributes in C4.5. J. Artif. Intell. Res. 1996, 4, 77–90. [Google Scholar] [CrossRef] [Green Version]
  8. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Elsevie: Burlington, MA, USA, 2014. [Google Scholar]
  9. Barber, D. Bayesian Reasoning and Machine Learning; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
  10. Jensen, F.V. Introduction to Bayesian Networks; UCL Press: London, UK, 1996; Volume 210. [Google Scholar]
  11. Norouzi, M.; Collins, M.; Johnson, M.A.; Fleet, D.J.; Kohli, P. Efficient Non-greedy Optimization of Decision Trees. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 1729–1737. [Google Scholar]
  12. Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
  13. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  14. Tishby, N.; Zaslavsky, N. Deep Learning and the Information Bottleneck Principle. arXiv 2015, arXiv:1503.02406v1. [Google Scholar]
  15. Tishby, N.; Pereira, F.; Bialek, W. The Information Bottleneck method. arXiv 2000, arXiv:physics/0004057v1. [Google Scholar]
  16. Franzese, G.; Visintin, M. Deep Information Networks. arXiv 2018, arXiv:1803.02251v1. [Google Scholar]
  17. Slonim, N.; Tishby, N. Agglomerative information bottleneck. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; pp. 617–623. [Google Scholar]
  18. Still, S. Information bottleneck approach to predictive inference. Entropy 2014, 16, 968–989. [Google Scholar] [CrossRef] [Green Version]
  19. Still, S. Thermodynamic cost and benefit of data representations. arXiv 2017, arXiv:1705.00612. [Google Scholar]
  20. Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
  21. Gedeon, T.; Parker, A.E.; Dimitrov, A.G. The mathematical structure of information bottleneck methods. Entropy 2012, 14, 456–479. [Google Scholar] [CrossRef] [Green Version]
  22. Freund, Y.; Schapire, R. A short introduction to boosting. Jpn. Soc. Artif. Intell. 1999, 14, 1612. [Google Scholar]
  23. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  24. Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inf. Theory 1972, 18, 14–20. [Google Scholar] [CrossRef] [Green Version]
  25. Blahut, R. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory 1972, 18, 460–473. [Google Scholar] [CrossRef] [Green Version]
  26. Hand, D.J.; Yu, K. Idiot’s Bayes—Not so stupid after all? Int. Stat. Rev. 2001, 69, 385–398. [Google Scholar]
  27. UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences. Available online: http://archive.ics.uci.edu/ml (accessed on 30 September 2010).
  28. Salekin, A.; Stankovic, J. Detection of chronic kidney disease and selecting important predictive attributes. In Proceedings of the IEEE International Conference on Healthcare Informatics (ICHI), Chicago, IL, USA, 4–7 October 2016; pp. 262–270. [Google Scholar]
  29. Duch, W.; Adamczak, R.; Grąbczewski, K. Extraction of logical rules from neural networks. Neural Process. Lett. 1998, 7, 211–219. [Google Scholar] [CrossRef]
Figure 1. Schematic representation of an input node: the inputs are two vectors and the outputs are matrices that statistically describe the random variables X i n and Y.
Figure 1. Schematic representation of an input node: the inputs are two vectors and the outputs are matrices that statistically describe the random variables X i n and Y.
Entropy 22 00100 g001
Figure 2. Schematic representation of an information node, showing the input and output matrices.
Figure 2. Schematic representation of an information node, showing the input and output matrices.
Entropy 22 00100 g002
Figure 3. Sub-network: X i n , a , X o u t , a , X i n , b , X o u t , b , X i n , c , and X o u t , c are all random variables; N 0 is the number of values taken by X i n , a and X i n , b ; N 1 is the number of values taken by X o u t , a and X o u t , b ; and N 2 is the number of values taken by X o u t , c .
Figure 3. Sub-network: X i n , a , X o u t , a , X i n , b , X o u t , b , X i n , c , and X o u t , c are all random variables; N 0 is the number of values taken by X i n , a and X i n , b ; N 1 is the number of values taken by X o u t , a and X o u t , b ; and N 2 is the number of values taken by X o u t , c .
Entropy 22 00100 g003
Figure 4. Example of a DIN for D = 8 : the input nodes are represented as rectangles, the info nodes as circles, and the combiners as triangles. The numbers inside each circle identify the node (position inside the layer and layer number), N i n ( k ) is the number of values taken by the input of the info node at layer k, and N o u t ( k ) is the number of values taken by the output of the info node at layer k. In this example, the info nodes at a given layer all have the same input and output cardinalities.
Figure 4. Example of a DIN for D = 8 : the input nodes are represented as rectangles, the info nodes as circles, and the combiners as triangles. The numbers inside each circle identify the node (position inside the layer and layer number), N i n ( k ) is the number of values taken by the input of the info node at layer k, and N o u t ( k ) is the number of values taken by the output of the info node at layer k. In this example, the info nodes at a given layer all have the same input and output cardinalities.
Entropy 22 00100 g004
Figure 5. Misclassification probability versus number of iterations (average over 10 different trials) for the considered UCI datasets.
Figure 5. Misclassification probability versus number of iterations (average over 10 different trials) for the considered UCI datasets.
Entropy 22 00100 g005
Figure 6. Misclassification probability versus β (average over 20 different trials) for the considered UCI datasets.
Figure 6. Misclassification probability versus β (average over 20 different trials) for the considered UCI datasets.
Entropy 22 00100 g006
Figure 7. Varying of classification accuracy for different values of control parameter ρ .
Figure 7. Varying of classification accuracy for different values of control parameter ρ .
Entropy 22 00100 g007
Table 1. Mean misclassification probability (over 100 random experiments) for the three datasets with the considered classifiers.
Table 1. Mean misclassification probability (over 100 random experiments) for the three datasets with the considered classifiers.
ClassifierCongressional Voting RecordsKidney DiseaseMushroom
Naive Bayes0.108940.051 g0.20641
Decision Tree0.0506910.0623140.05505
Bagged Trees0.0436410.02680.038305
DIN Prob0.0501380.0372290.020796
DIN Gen0.0494470.0262860.022182
Linear Discriminant Classifier0.0597240.0910290.069923
Logistic Regression0.0751610.0964290.07074
Linear SVM0.0632260.0499140.04513
KNN0.086820.113690.037018
KNN-Ensemble0.0628110.0360570.043967

Share and Cite

MDPI and ACS Style

Franzese, G.; Visintin, M. Probabilistic Ensemble of Deep Information Networks. Entropy 2020, 22, 100. https://doi.org/10.3390/e22010100

AMA Style

Franzese G, Visintin M. Probabilistic Ensemble of Deep Information Networks. Entropy. 2020; 22(1):100. https://doi.org/10.3390/e22010100

Chicago/Turabian Style

Franzese, Giulio, and Monica Visintin. 2020. "Probabilistic Ensemble of Deep Information Networks" Entropy 22, no. 1: 100. https://doi.org/10.3390/e22010100

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop