Probabilistic Ensemble of Deep Information Networks

We describe a classifier made of an ensemble of decision trees, designed using information theory concepts. In contrast to algorithms C4.5 or ID3, the tree is built from the leaves instead of the root. Each tree is made of nodes trained independently of the others, to minimize a local cost function (information bottleneck). The trained tree outputs the estimated probabilities of the classes given the input datum, and the outputs of many trees are combined to decide the class. We show that the system is able to provide results comparable to those of the tree classifier in terms of accuracy, while it shows many advantages in terms of modularity, reduced complexity, and memory requirements.

previous version of the DIN was stochastic, while the one we propose here is deterministic. Moreover, we use an ensemble (e.g., [12,13]) of trees with randomly permuted features and weigh their outputs to improve classification accuracy.
The proposed architecture has several advantages in terms of: • extreme flexibility and high modularity: all the nodes are functionally equivalent and with a reduced number of inputs and outputs, which gives good opportunities for a possible hardware implementation; • high parallelizability: each tree can be trained in parallel with the others; • memory usage: we need to feed the network with data only at the first layer and simple incremental counters can be used to estimate the initial probability mass distribution; and • training time and training complexity: the locality of the computed cost function allows a nodewise training that does not require any kind of information from other points of the tree apart from its feeding nodes (that are usually a very small number, e.g., [2][3]. With respect to the DINs in [16], the main difference is that samples of the random variables in the inner layers of the tree are never generated, which is an advantage in the case of large datasets. However, an assumption of statistical independence (see Section 2.3) is necessary to build the probability matrices and this might be seen as a limitation of the newly proposed method. However, experimental results (see Section 5) show that this approximation does not compromise the performance.
We underline similarities and differences of the proposed classifier with respect to the methods described in [6,7] since they are among the best performing ones. When using decision trees, as well as DINs, categorical and missing data are easily managed, but continuous random variables are not: quantization of these input features is necessary in a pre-processing phase, and it can be performed as in C4.5 [6], using other heuristics, or manually. Concerning differences, instead, the first one is that normally a hierarchical decision tree is built starting from the root and splitting at each node, whereas we here propose a way to build a tree starting from the leaves. The topology of our network implies that, once the initial ordering of the features has been set, there is no need, after each node is trained, to perform a search of the best possible next node. The second important difference is that we do not use directly mutual information as a metric for building the tree but we base our algorithm on the Information Bottleneck principle [14,15,[17][18][19][20][21]. This allows us to extract all the relevant information (the sufficient statistic) while removing the redundant one, which is helpful in avoiding overfitting. As in [12,13], we use an ensemble method. We choose the simplest possible form of ensemble combination: we train independently many structurally equivalent networks, using the same single dataset but permuting the order of the features, and produce a weighted average of the outputs based on a simple rule described in Section 3.1. Notice that we use a one-shot procedure, i.e., we do not iterate more than once over the entire dataset and exploit techniques similarly to [22,23]. We leave the study of more sophisticated techniques to future works. Sections 2 and 3 more precisely describe the structure of the DIN and how it works, Section 4 gives some insight on the theoretical properties, Section 5 comments the results obtained with standard datasets. Conclusions are finally drawn in Section 6.

The DIN Architecture and Its Training
The information network is made of input nodes (Section 2.1), information nodes (Section 2.2), and combiners joined together through a tree network described in Section 2.3. Moreover, an ensemble of N mach trees is built, based on which the final estimated class is produced (Section 3.1). In [16], the input nodes are not present, the information node has a slightly different role, the combiners are much simpler than those described here, and just one tree was considered. As already stated, the new version of the DIN is more efficient when a large dataset with relatively few features is analyzed.
In the following, it is assumed that all the features take a finite number of discrete values; a case of continuous random variables is discussed in Section 5.2.
It is also assumed that N train points are used in the training phase, N test points in the testing phase, and that D features are present. The nth training point corresponds to one of N class possible classes.

The Input Node
Each input node (see Figure 1) has two input vectors: 1. x in of size N train , whose elements take values in a set of cardinality N in ; x in corresponds to one of the D features of the dataset (typically one column) 2. y of size N train , whose elements take values in a set of cardinality N class ; y corresponds to the known classes of the N train points Input node x in X in y Y Figure 1. Schematic representation of an input node: the inputs are two vectors and the outputs are matrices that statistically describe the random variables X in and Y.
The notation we use in the equations below is the following: Y, X in represent random variables; y(n) and x in (n) are the nth elements of vectors y and x in , respectively; and 1(c) is equal to 1 if c is true, and is otherwise equal to 0. Using Laplace smoothing [2], the input node estimates the following probabilities (the probability mass function of Y in Equation (1) is common to all the input nodes: it can be evaluated only by the first one and passed to the others): From basic application of probability rules,P(Y = m|X in = i) andP(X in = i|Y = m) are then computed. From now on, for simplicity, we denote all the estimated probabilitiesP simply as P.
All the above probabilities can be organized in matrices defined as follows: Note that vectors x in and y are not needed by the subsequent elements in the tree; only the input nodes have access to them.
Notice also that the following equalities hold:

The Information Node
The information node is schematically shown in Figure 2: the input discrete random variable X in is stochastically mapped into another discrete random variable X out (see [16] for further details) through probability matrices: • The input probability matrices P X in , P X in |Y , P Y|X in , P Y describe the input random variable X in , with N in possible values, and its relationship with class Y.

•
The output matrices P X out , P X out |Y , P Y|X out , P Y describe the output random variable X out , with N out possible values, and its relationship with Y.
Compression (source encoding) is obtained by setting N out < N in .
In the training phase, the information node generates the conditional probability mass function that satisfies the following equation (see [14]): where • P(X out = j) is the probability mass function of the output random variable X out and • β is a real positive parameter. • Z(i; β) is a normalizing coefficient to get The probabilities P(X out = j|X in = i) can be iteratively found using the Blahut-Arimoto algorithm [14,24,25]. Equation (10) solves the information bottleneck: it minimizes the mutual information I(X in ; X out ) under the constraint of a given mutual information I(Y; X out ). In particular, Equation (10) is the solution of the minimization of the Lagrangian If the Lagrangian multiplier β is increased, then the constraint is privileged and the information node tends to maximize the mutual information between its output X out and the class Y; if β is reduced, then minimization of I(X in ; X out ) is obtained (compression). The information node must actually balance compression from X in to X out and propagation of the information about Y. In our implementation, the compression is also imposed by the fact that the cardinality of the output alphabet N out is smaller than that of the input alphabet N in . The role of the information node is thus that of finding the conditional probability matrices

The Combiner
Consider the case depicted in Figure 3, where the two information nodes a and b feed a combiner (shown as a triangle) that generates the input of the information node c. The random variables X out,a and X out,b , both having alphabet with cardinality N 1 , are combined together as that has an alphabet with cardinality N 1 × N 1 . The combiner actually does not generate X in,c ; it simply evaluates the probability matrices that describe X in,c and Y. In particular, the information node c needs P X in,c |Y , which can be evaluated assuming that X out,a and X out,b are conditionally independent given Y ( notice that in implementation [16] this assumption was not necessary): where k = k a + N 1 k b . In particular, the mth row of P X in,c |Y is the Kronecker product of the mth rows of P X out,a |Y and P X out,b |Y P X in,c |Y (m, :) = P X out,a |Y (m, :) ⊗ P X out,b |Y (m, :) m = 0, . . . , N class − 1 (21) (here A(m, :) identifies the mth row of matrix A). The probability vector P X in,c can be evaluated considering that At this point, matrix P Y|X in,c can be evaluated element by element since It is straightforward to extend the equations to the case in which X in,a and X in,b have different cardinalities. Figure 3. Sub-network: X in,a , X out,a , X in,b , X out,b , X in,c , and X out,c are all random variables; N 0 is the number of values taken by X in,a and X in,b ; N 1 is the number of values taken by X out,a and X out,b ; and N 2 is the number of values taken by X out,c . Figure 4 shows an example of a DIN, where we assume that the dataset has D = 8 features and that training is thus obtained using a matrix X train with N train rows and D = 8 columns, with a corresponding class vector y. The kth column x(k) of matrix X train feeds, together with vector y, the input node I(k), k = 0, . . . , D − 1.

The Tree Architecture
Layer 1 x(4) in is the number of values taken by the input of the info node at layer k, and N (k) out is the number of values taken by the output of the info node at layer k. In this example, the info nodes at a given layer all have the same input and output cardinalities.
Information node (k, 0) at layer 0 processes the probability matrices generated by the input node in possible values of X in (k, 0), and evaluates the conditional probability matrices with N (0) out possible values of X out (k, 0), using the algorithm described in Section 2.2. The outputs of info nodes (2k, 0) and (2k + 1, 0) are given to a combiner that outputs the probability matrices for X in (k, 1), having out , using the equations described in Section 2.3. The sequence of combiners and information nodes is iterated, decreasing the number of information nodes from layer to layer, until the final root node is obtained. In the previous implementation of the DINs in [16], the root information node outputs the estimated class of the input and it is therefore necessary that the output cardinality of the root info node is equal to N class . In the current implementation, this cardinality can be larger than N class , since classification is based on the output probability matrix P Y|X out .
For a number of features D = 2 d , the number of layers is d. If D is not a power of 2, then it is possible to use combiners with 3 or more inputs (the changes in the equations in Section 2.3 are straightforward, since a combiner with three inputs can be seen as two cascaded combiners with two inputs each).
The overall binary topology proposed in Figure 4 requires a number of information nodes equal to and a number of combiners equal to All the info nodes run exactly the same algorithm and all the combiners are equal, apart from the input/output alphabet cardinalities. If the cardinalities of the alphabets are all equal, i.e., N do not depend on the layer i, then all the nodes and all the combiners are exactly equal, which might help in a possible hardware implementation; in this case, the number of parameters of the network is Actually, the network performance depends on how the features are coupled in subsequent layers and a random shuffling of the columns of matrix X train provides results that might be significantly different. This property is used in Section 3.1 for building the ensemble of networks.

A Note on Computational Complexity and Memory Requirements
The modular structure of the proposed method has several advantages in terms of both memory footprint and computational cost. The considered topology in this explanation is binary, similarly to what is depicted in Figure 4. We furthermore consider for simplicity cardinalities of the D input features all equal to N in and input/output cardinalities of subsequent layers information node to also be fixed constants N * in and N * out = N * in 2 , respectively. As we show in the experiment (Section 5), small values for N * in and N * out such as 2, 3, or 4 are sufficient in the considered cases. Straightforward generalizations are possible when considering inhomogeneous cases.
At the first layer (the input node layer), each of the D input nodes stores the joint probabilities of the target variable Y and its input feature. Each node thus includes a simple counter that fills the probability matrix of dimension N in × N class . Both the computational cost and the memory requirements for this first stage are the same as the Naive Bayes algorithm. Notice that, from the memory requirements point of view, it is not necessary to store all the training data but just counters with number of joint occurrences of features/classes. If after training, new data are observed, it is in fact sufficient to update the counters and properly renormalize the values to obtain the updated probability matrices. In this paper, we do not cover the topic of online learning as well as possible strategies to reduce the computational complexity in such a scenario.
At the second layer (the first information node layer), each node receives as input the joint probability matrix of feature and target variable and performs the Blahut-Arimoto algorithm. The internal memory requirement of this node is the space needed to store two probability matrices of dimensions N * in × N class and N * in × N * out , respectively. The cost per iteration of Blahut-Aritmoto depends on matrix multiplication of sizes N * in × N * out and N * in × N class , and thus obviously the complexity scales with the number of classes of the considered classification problem. To the best of our knowledge, the convergence rate for the Blahut-Arimoto algorithm applied to information bottleneck problems is unknown. In this study, however, we found empirically that, for the considered datasets, 5-6 iterations per node are sufficient, as discussed in Section 5.5.
Each combiner process the matrices generated by two information nodes: the memory requirement is zero and the computational cost is roughly N class Kronecker products between rows of probability matrices. Since for ease of explanation we chose N * out = N * in 2 the output probability matrix have again dimensions N * in × N class . The overall memory requirement and computational complexity (for a single DIN) is thus going to scale as D times the requirements for an input node, 2D − 1 times the requirements for an information node, and D − 1 times the requirements for a combiner. To complete the discussion, we have to remember that a further multiplication factor of N mach is required to take into account that we are considering an ensemble of networks (actually, at the first layer, the set of input nodes can be shared by the different architectures since only the relative position of the input nodes changes , see Section 3.1).

The Running Phase
During the running phase, the columns of matrix X with N rows and D columns are used as inputs. Assume again that the network architecture is that depicted in Figure 4 with D = 8, and consider the nth input row X(n, :).
According to the MAP criterion, the estimated class of the input point X(n, :) iŝ but we propose to use an improved method, as described in Section 3.1.

The DIN Ensemble
At the end of the training phase, when all the conditional matrices have been generated in each information node and combiner, the network is run with input matrix X train (N train rows and D columns) and the probability vector p out is obtained for each input point X train (n, :). As anticipated at the end of Section 2.4, the DIN classification accuracy depends on how the input features are combined together. By permuting the columns of X train , a different probability vector p out is typically obtained. We thus propose to generate an ensemble of DINs by randomly permuting the columns of X train , and then combine their outputs.
Since in the training phase y(n) is known, it is possible to get for each DIN v the probability p v out (n), and ideally p v out (n, y(n)), the estimated probability corresponding to the true class y(n), should be equal to one. The weights thus represent the reliability of the vth DIN.
In the running phase, feeding the N mach machines each with the correctly permuted vector X(n, :), the final estimated probability vector is determined aŝ and the estimated class isŶ (n) = arg maxp ens (n).

The Probabilistic Point of View
This section is intended to underline the difference in probability terms formulation between the Naive Bayes classifier [2,26] and the proposed scheme, since both use the assumption of conditional independence of the input features. Both classifiers build in a simplified way the probability matrix in is the cardinality for the input feature X i . In the next sections, we show the different structure of these two probability matrices.

Assumption of Conditionally Independent Features
The Naive Bayes assumption allows writing the output estimated probability of the Naive Bayes classifier as follows: which is very easily implemented, without the need of generating the tree network. We rewrite this output probability in a fairly complex way to show the difference between the naive Bayes probability matrix and the DIN one. Consider the nth feature x(n), which can take values in the set {c 0 n , . . . , c D n −1 and thus obviously We can write the joint probability matrix as and the probability matrix of target class given observation as The hypothesis of conditional statistical independence of the features is not always correct and thus we can incur obvious performance degradation.

The Overall Probability Matrix
We now instead compute the output estimated probability for the DIN classifier. Consider again the sub-network in Figure 3 made of info nodes a, b, and c. Info node a is characterized by matrix P a , whose element P a (i, j) is P(X out,a = j|X in,a = i); similar definitions hold for P b and P c . Note that P a and P b have N 0 rows and N 1 columns, whereas P c has N 1 × N 1 rows and N 2 columns; the overall probability matrix between the inputs X in,a , X in,b and the output X out,c isP with N 0 × N 0 rows and N 2 columns. Then, It can be shown thatP where ⊗ identifies the Kronecker matrix multiplication; note that P a ⊗ P b has N 0 × N 0 rows and N 1 × N 1 columns. By iteratively applying the above rule, we can get the expression of the overall matrixP for the exact topology of Figure 4, with eight input nodes and four layers: P = (P 0,0 ⊗ P 1,0 )P 0,1 ⊗ (P 2,0 ⊗ P 3,0 )P 1,1 P 0,2 ⊗ (P 4,0 ⊗ P 5,0 )P 2,1 ⊗ (P 6,0 ⊗ P 7,0 )P 3,1 P 1,2 P 0,3 .
The overall output probability matrix P Y|X can finally be computed as The DIN then behaves as a one-layer system that generates the output according to matrix P Y|X in , whose size might be impractically large. It is also possible to interpret the system as a sophisticated way of factorizing and approximating the exponentially large true probability matrix. In fact, the proposed layered structure needs smaller probability matrices, which makes the system computationally efficient. The equivalent probability matrix is thus different in the DIN (Equation (42)) and Naive Bayes (Equation (38)) cases.

Experiments
In this section, we analyze the results obtained with benchmark datasets. In particular, we consider the DIN ensemble when: (a) each DIN is based on the probability matrices (the scheme described in this paper); and (b) each information node of the DIN randomly generates the symbols, as described in the previous work [16]. We refer to these two variants in captions and labels as DIN(Prob) and DIN(Gen), respectively. The reason for this comparison is that conditional statistical independence is not required in the case DIN(Gen), and the classification accuracy could be different in the two cases. Note that Franzese and Visintin [16] considered just one DIN, not an ensemble of DINs. In the following, we introduce three datasets on which we tested the method (Sections 5.1-5.3) and propose some examples of DINs architectures. Complete analysis of numerical results is described in Section 5.4. Sections 5.5 and 5.6 analyze the impact of changing the maximum number of iterations of Blahut-Arimoto algorithm and Lagrangian coefficient β, respectively. Finally, a synthetic multiclass experiment is described in Section 5.7. In all experiments, the value of β was optimized similarly to what is described in Section 5.6 using the training set.

UCI Congressional Voting Records Dataset
The first experiment on real data was conducted on the UCI Congressional Voting Records dataset [27], which collects the votes given by each of the U.S. House of Representatives Congressmen on 16 key laws (in 1985). Each vote can take three values corresponding to (roughly, see [27] for more details) yes, no, and missing value; each datum belongs to one of two classes (Democrats or Republican). The aim of the network is, given the list of 16 votes, decide if the voter is Republican or Democratic. In this dataset, we thus have D = 16 features and 435 data split into N train data for training and N test = 435 − N train data for testing. The architecture of the used network is the same as the one described in Section 2.4, except for the fact that there are 16 input features instead of 8 (the network has thus one more layer). The input cardinality in the first layer is N (0) in = 3 (yes/no/missing) and the output cardinality is set to N (0) out = 2. From the second layer on, the input cardinality for each information node is equal to N * in = 4 and N * out = 2. In the majority of the cases, the size of the probability matrices is therefore 4 × 2 or 2 × 2. In this example, we used N mach = 30 and N train = 218 (roughly 50% of the data). The value of β was set to 2.2.

UCI Kidney Disease Dataset
The second considered dataset was the UCI Kidney Disease dataset [28]. The dataset has a total of 24 medical features, consisting of mixed categorical, integer, and real values, with missing values. Quantization of non-categorical features of the dataset was performed according to the thresholds in Appendix A, agreed upon by a medical doctor.
The aim of the experiment is to correctly classify patients affected by chronic kidney disease. We performed 100 different trials training the algorithms using only N train = 50 out of 400 samples for the training. Layer zero has 24 input nodes, and then the outputs of layer zero are mixed two at a time to get 12 information nodes at Layer 1, 6 at Layer 2, and 3 at Layer 3; the last three nodes are combined into a unique final node. The output cardinality of all nodes is equal to N * out = 2. The value of β was set equal to 5.6. In addition, in this case, we used an ensemble of N mach = 30 DINs.

UCI Mushroom dataset
The last considered dataset was the UCI Mushroom dataset [29]. This dataset is comprised of 22 categorical features with different cardinalities, which describe some properties of mushrooms, and one target variable that defines whether the considered mushroom is edible or poisonous/unsafe. There are 8124 entries in the dataset. We padded the dataset with two null features to reach the cardinality of 24 and used exactly the same architecture as the kidney disease experiment. We selected N train = 50, β = 2.7, and number of DINs equal to N mach = 15.

Misclassification Probability Analysis
We hereafter report results in terms of misclassification probability between the proposed method and several classification methods implemented using MATLAB R Classification Learner. All datasets were randomly split 100 times into training and testing subsets, thus generating 100 different experiments. The proposed method shows competitive results in the considered cases, as can be observed in Table 1. It is interesting to compare in terms of performance the proposed algorithm with respect to the Naive Bayes classifier, i.e., Equation (34), and the Bagged Tree algorithm, which is the closest algorithm (conceptually) to the one we propose. In general, the two variants of the DINs perform similarly to the Bagged Trees, while outperforming Naive Bayes. For Bagged Trees and KNN-Ensemble, the same number of learners as DIN ensembles were used.

The Impact of Number of Iterations of Blahut-Arimoto on The Performance
As anticipated in Section 2.5, the computational complexity of a single node scales with the number of iterations of Blahut-Arimoto algorithm. To the best of our knowledge, a provable convergence rate for the Blahut-Arimoto algorithm in the information nottleneck setting does not exist. We hereafter ( Figure 5) present empirical results on the impact of limiting the number of iterations of Blahut-Arimoto algorithm (for simplicity, the same bound is applied to all nodes in the networks). When the number of iterations is too small, there is a drastic decrease in performance because the probability matrices in the information nodes have not yet converged, while 5-6 iterations are sufficient and a further increase in the number of iterations is not necessary in terms of performance improvements. 5.6. The Role of β: Underfitting, Optimality, and Overfitting As usual with almost all machine learning algorithms, the choice of hyperparameters is of fundamental importance. For simplicity, in all experiments described in the previous sections, we kept the value of β constant through the network. To gain some intuition, Figure 6 shows the misclassification probability for different β for the three considered datasets (each time keeping β constant through the network). While the three curves are quantitatively different, we can notice the same qualitative trend: when β is too small, not enough information about the target variable is propagated, and then by increasing β above a certain threshold, the misclassification probability drops. Increasing β too much however induces overfitting, as expected, and the classification error (slowly) increases again. Remember (from Equation (15)) that the Lagrangian we are minimizing is L = I(X in ; X out ) − βI(Y; X out ).
Information theory tells us that at every information node we should propagate only the sufficient statistic about the target variable Y. In practice, this is reflected in the role of β: when it is too small, we neglect the term I(Y; X out ) and just minimize I(X in ; X out ) (that corresponds to underfitting), while increasing β allows passing more information about the target variable through the bottleneck. It is important to remember, however, that we do not have direct access to the true mutual information values but just to an empirical estimate based on a finite dataset. Especially when the cardinalities of inputs and outputs are high, this translates into an increased probability of spotting spurious correlations that, if learned by the nodes, induce overfitting. The overall message is that β has an extremely important role in the proposed method, and its value should be chosen to modulate between underfitting and overfitting.

A Synthetic Multiclass Experiment
In this section we present results on a multiclass synthetic dataset. We generated 64-dimensional feature vectors z drawn from multivariate Gaussian distributions with mean and covariance depending on a target class y and a control parameter ρ: where for the considered experiment N class = 8. The mean µ l is sampled from a normal 64-dimensional random vector and Σ l is randomly generated as Σ l = AA T (where A is sampled from a matrix normal distribution) and normalized to have unit norm. The other parameter ρ is inserted to modulate the signal to noise ratio of the generated samples: a smaller value of ρ corresponds to smaller feature variances and more distinct, less overlapping, pdfs p(z|y = l), and an easier classification task. We then perform quantization of the result using 1 bit, i.e., the input of the ensemble of DINs is the following random vector: where U(·) is the Heaviside step operator. The designed architecture has at the first layer 64 input nodes, followed by 32, 16, 4, 2, and 1. The output cardinalities are equal to 2 for the first three layers, 4 for the fourth and fifth layer, and 8 at the last layer. We selected N train = 1000, β = 7 (constant trough the network), and number of DINs equal to N mach = 10. Figure 7 shows the classification accuracy (on a test set of 1000 samples) for different values of ρ. As expected, when the value of ρ is small, we can reach almost perfect classification accuracy, whereas, by increasing it, the performance drops to the point where the useful signal is completely buried in noise and the classification accuracy reaches the asymptotic level of 1 8 (that corresponds to random guessing when the number of classes is equal to 8).

Conclusions
The proposed ensemble Deep Information Network (DIN) shows good results in terms of accuracy and represents a new simple, flexible, and modular structure. The required hyperparameters are the cardinality of the alphabet at the output of each information node, the value of the Lagrangian multiplier β, and the structure of the tree itself (number of input information nodes of each combiner).
Simplistic architecture choices made for the experiments (such as equal cardinality of all node outputs, β constant through the network, etc.) performed comparably to finely tuned networks. However, we expect that, similar to what happened in neural network applications, a domain specific design of the architectures will allow for consistent improvements in terms of performance on complex datasets.
Despite the local assumption of conditionally independent features, the proposed method always outperforms Naive Bayes. As discussed in Section 4, the induced equivalent probability matrix is different in the two cases. Intuitively, we can understand the difference in performance under the point of view of probability matrix factorization. On the one side, we have the true, exponentially large, joint probability matrix of all features and target class. On the other side, we have the Naive Bayes one, which is extremely simple in terms of complexity but obviously less performing. In between, we have the proposed method, where the complexity is still reasonable but the quality of the approximation is much better. The DIN(Gen) algorithm does not require the assumption of statistical independence, but the classification accuracy is very close to that of DIN(Prob), which further suggests that the assumption can be accepted from a practical point of view.
The proposed method leaves open the possibility of devising a custom hardware implementation. Differently from classical decision trees, in fact, the execution times of all branches as well as the precise number of operations is fixed per datum and known a priori, helping in various system design choices. In fact, with classical trees, where a node's utilization depends on the datum, we are forced to design the system for the worst case, even if in the vast majority of time not all nodes are used. Instead, with DIN, there is no such a problem.
Finally, a clearly open point is related to the quantization procedure of continuous random variables. One possible self-consistent approach could be devising an information bottleneck based method (similar to the method for continuous random variables [20]).
Further studies on extremely large datasets will help understand principled ways of tuning hyperparameters and architecture choices and their relationship on performance.