A Geometric Perspective on Information Plane Analysis

Information plane analysis, describing the mutual information between the input and a hidden layer and between a hidden layer and the target over time, has recently been proposed to analyze the training of neural networks. Since the activations of a hidden layer are typically continuous-valued, this mutual information cannot be computed analytically and must thus be estimated, resulting in apparently inconsistent or even contradicting results in the literature. The goal of this paper is to demonstrate how information plane analysis can still be a valuable tool for analyzing neural network training. To this end, we complement the prevailing binning estimator for mutual information with a geometric interpretation. With this geometric interpretation in mind, we evaluate the impact of regularization and interpret phenomena such as underfitting and overfitting. In addition, we investigate neural network learning in the presence of noisy data and noisy labels.


Introduction
Deep Learning (e.g., [1][2][3][4]) has shown promising performance for many applications including image analysis, speech analysis, or robotics. This progress, however, is mainly the result of more and more sophisticated neural network (NN) architectures with everincreasing complexity. This makes it increasingly difficult to understand how such NNs work and to interpret or explain their predictions correctly, in particular, if the number of parameters or training data increase. Thus, neither a simple validation of the input-output mapping nor focusing on salient features rather than all possible parameters [5][6][7][8][9][10][11][12][13] are sufficient in practice.
Thus, research has focused on understanding the inner workings of NNs and investigating, for instance, the learning behavior over time. One prominent example is information plane (IP) analysis [14], which is based on the information bottleneck principle [15]. The key idea is to analyze the plane described by the mutual information I(X; T) between the input X and the activation values of a hidden layer T and by the mutual information I(Y; T) between T and the target variable Y, and how these values change from epoch to epoch. Illustrative examples showing the trajectory of mutual information values over time are shown in Figure 1. Even though IPs appear to be an appealing way to analyze learning behaviors of NNs, we face the problem that the literature on IP analysis reports conflicting results, cf. [14,16,17].
This apparent conflict results from the fact that the mutual information can often not be computed analytically. Thus, contradicting results stem from different ways to estimate the mutual information terms I(X; T) and I(Y; T). For instance, mutual information has been approximated via binning, i.e., via discretizing the continuous activation values; the authors have proposed fixed uniform binning [14,18,19], adaptive uniform binning [16], and adaptive nonuniform binning [20]. However, more elaborate estimation schemes have 1.
We introduce an interpretation for the estimates of mutual information from a geometric perspective, for both fixed and adaptive uniform binning. We support the interpretation by visualizing the data distribution in the latent space.

2.
We show that the effects of regularization and phenomena such as overfitting and underfitting can be well described and interpreted via an IP analysis based on this geometric perspective.

3.
Based on the geometric interpretation of IP analyses, we investigate robust classifier learning, in particular, being able to provide an interpretation of the learning behavior in the presence of noisy data and noisy labels.
The rest of the paper is organized as follows: First, in Section 2, we review and discuss the main ideas of IP analysis and introduce and discuss our approach. Then, in Section 3, we apply these findings to provide a thorough evaluation of deep NN learning for image classification tasks. Finally, in Section 4 we summarize and conclude our work.

Information Planes and Their Geometric Interpretation
Given a labeled training set D = {(x 1 , y 1 ), . . . , (x N , y N )}, where x i are data points and y i are the corresponding class labels, the goal is to train a deep fully connected feed-forward NN with L layers. Assuming that D contains independent samples of a joint distribution P X,Y , the data points x i and the class labels y i can be interpreted in the following as random variables X and Y, respectively.
Let t ,i denote the vector of activation values of the th layer for a data point x i . During training, the activation values change from epoch to epoch even for the same data point, i.e., t ,i (n) is a function of the epoch index n. In the following, we suppress this index for the sake of readability. Since the NNs we consider are deterministic, there exists a function f for mapping a feature to this activation vector: t ,i = f (x i ). The activation vectors {t ,1 , . . . , t ,N } can be assumed to be independent realizations of a random variable T = f (X). For the sake of readability, we write T instead of T , as the layer index is clear from the context.
In the following, we first discuss the estimation of mutual information for NNs via binning, which is a common approach to construct the IP. Second, we show that binning inherently introduces a geometric perspective that helps to interpret the IP correctly.

Mutual Information Estimation for Neural Network Training
For a pair of discrete random variables, U and V with joint probability mass function p U,V , the mutual information can be readily computed via ( [26], Equation (2.28)) where p U and p V are the marginal distributions of U and V, respectively. In general, we can compute the mutual information if U and V are both continuous and if the joint probability density function (PDF) f U,V exists and is known ( [26], Equation (9.47)). For NNs, the distributions of the data points X and activations T are often assumed to be continuous, but the PDFs are not readily available. Thus, even assuming that the joint PDF exists, we require estimators for mutual information that are based on the dataset D, such as kernel density estimators [16] or binning estimators [14,[18][19][20]. If the joint PDF exists and if the true value of I(X; T) is finite, then there exist estimatorsÎ(X; T) that can be parameterized such thatÎ(X; T) ≈ I(X; T); at least if D is sufficiently large. However, in [25], it was shown that the joint PDF of X and T does not exist for deterministic NNs. Indeed, since T = f (X), we have I(X; T) = ∞ for continuous input distributions and many practically relevant activation functions ( [25], Theorem 1). The estimates of I(X; T) based on a finite dataset are thus inadequate.
To circumvent this problem, we rather focus on the mutual information between X or Y and a discretized versionT of the activation T. This discretization, which we obtain via binning, ensures that the mutual information terms are finite and, thus, can be estimated reliably. Specifically, rather than estimating I(X; T) and I(Y; T) directly, we estimate I(X;T) and I(Y;T), whereT is obtained by uniformly quantizing (binning) T: where b is the size of the bin and · is the ceiling operator applied to each element of the scaled activation vector. Specifically, we introduce two binning schemes: (a) binning with a fixed bin size of b = 0.5 and (b) binning with an adaptive bin size, where for each coordinate of T, b is one-tenth of the range of activation values of this coordinate over the dataset. In other words, if t (j) ,i (n) is the activation value of the jth neuron in the th layer for data point i at epoch n, then and T (j) (n) = T (j) (n)/b (j) (n) and T(n) = (T (1) (n), T (2) (n), . . . ).
Since T (and thusT) is a deterministic function of X, we have I(X;T) = H(T) ( [26], Equations (2.41) and (2.167)). Moreover, both Y andT are discrete random variables, and both H(T) and I(Y;T) can be estimated using the plugin estimators for entropy and mutual information. Specifically, witht i = t i /b , we havê |{i:t i = t, y i = y}| N log N|{i:t i = t, y i = y}| |{i:t i = t}||{i: y i = y}|.
These estimators are reasonable if the number of data points for each combination of t and y in the sums is sufficiently large. However, for many applications, this is rarely the case. Indeed, it has been observed that, especially in convolutional NNs, the vector of activations T is so large thatĤ(T) ≈ log |D|, i.e., every data point in D falls into a different bin, even if the bin size is large (see, for example, Figure 7 in [18]).

Information Plane Analysis
Assuming that the data allows us to estimate information-theoretic quantities involving the random variables over images, class labels, and activation functions, we can calculate the quantities defined in Equation (4). The authors of [14] proposed to plot these values in a Cartesian coordinate system, yielding the so-called information plane (IP) and to analyze how they change throughout training. This is illustrated in Figure 1 for two examples.   From Figure 1a, two phases can been observed, cf. [14]: first, a phase in which bothĤ(T) (expansion) andÎ(Y;T) (fitting) increase and, second, a compression phase during whichĤ(T) decreases (Î(Y;T) increases only slightly). The compression phase was interpreted as the hidden layer T discarding irrelevant information about the input X and was causally connected to generalization. In contrast, Figure 1b shows only fitting as an increase inÎ(Y;T).

Interpretation of IPs Based on Binning Estimators
In Section 2.1, we showed that I(X; T) is infinite in deterministic NNs with continuous inputs and thus escapes estimation. To show that IP analyses as introduced in Section 2.2 are still useful, we build on the observation that the horizontal axis, labeled withĤ(T) in our case, does not describe an information-theoretic compression in the sense of a reduction of I(X; T). Such a reduction would indicate that irrelevant features of X are discarded when creating the latent representation of a hidden layer with activations T; T would become conceptually close to a minimal sufficient statistic. Rather, the current consensus is that H(T) is a measure of geometric size and that, thus, compression observed in the IP using such estimators is geometric [17,18]: the quantityĤ(T) is small if the image of the dataset D under the NN function f occupies only a few bins or many bins but with a heavily skewed distribution. In such cases, f (D) has either a small diameter (relative to a fixed bin size) or is strongly clustered (if the bin size is adapted to the range of activation values. To improve the intuitive understanding, we consider three cases: (1) All data points are mapped to a small region in feature space that is covered by a single bin. Then,T is constant over D andĤ(T) = 0 (see Figure 6a). (2) All data points are clustered, i.e., data points belonging to one class are mapped to a small region in the feature space, and regions corresponding to different classes are far apart. Furthermore, data points belonging to one class all fall within the same bin, but different classes occupy different bins. Then, H(T) is related to the logarithm of the number of classes. (3) The data points are spread over the feature space so that every bin contains at most one data point. Then, we havê H(T) = log |D|. This can occur either if the latent space is very high-dimensional as in convolutional NNs or if the bin size b chosen is too small.
In all three cases,Ĥ(T) is a measure of the geometric "size" of the image f (D) in the latent space, where "size" has to be interpreted probabilistically and is measured relative to the bin size b: f (D) is "small" in this sense if some majority of its elements are covered by only few bins. While fixed binning measures the geometric size with an absolute scale, adaptive binning measures the geometric size with a scale relative to the image of the dataset D under f , i.e., relative to the absolute scale of the latent space. Thus, a simple scaling of T, for instance by scaling all weights in a NN with ReLU activation functions, affectsĤ(T) whenT is obtained by fixed binning but not for adaptive binning.

Analyzing NN Training via Information Plane Analysis
The goal of our experiments is to demonstrate that IP analysis-if interpreted correctly-can be a useful tool to analyze and interpret NN learning. To this end, we address different problems and tasks: (a) illustrating the impact of regularization; (b) analyzing phenomena such as overfitting and underfitting; and (c) demonstrating the generality of the approach, by applying it to analyze robust learning. For the first two tasks, we run experiments on the well-known MNIST dataset [27]. For the third task, we run experiments on two different benchmark datasets, namely Brightness MNIST [28] (noisy data) and Noisy MNIST (noisy labels) [29].
For our analysis, we show the mutual information trajectories for both binning approaches, fixed binning (FB) and adaptive binning (AB). For this purpose, after each training epoch, the activation values of the hidden layers evaluated on the test set are saved and the mutual information is computed. For training, we applied an Adam optimizer and used ReLU as an activation function, unless noted otherwise. To reduce the influence of random initialization and the inherent randomness of the Adam optimizer, all experiments were run three times for 4000 epochs, respectively.
To visually validate the claim that the IP displays geometric effects, we decided to use a bottleneck architecture with a two-dimensional layer. This allows us to visualize the data set in latent space without having to resort to projection or dimensionality reduction methods such as t-SNE. Even though we show the IP trajectories for all layers, our discussion mainly focuses on the trajectory corresponding to this two-dimensional layer. The findings, however, are more general and hold for different layer sizes and architectures.

Impact of Regularization
First, we analyze the impact of regularization when training a NN. For this purpose, we trained a bottleneck network (100-100-2-100) with and without l 2 regularization (λ = 0.0003, found by grid search). The thus obtained results for both binning approaches are shown in Figure 2. To make the temporal character of the trajectories more apparent, the first and the last epoch are highlighted by a black point and a large circle, respectively.
Using adaptive binning (see Figure 2b), we recognize a fitting phase, i.e.,Î(Y;T) increases over time, indicating a growth in the class separability. In addition, using fixed binning (see Figure 2a), we can recognize a geometric compression with an absolute scale forĤ(T) from the first to the last epoch for the last two layers. Indeed, using an l 2 regularization (weight decay) reduces the overfitting tendency by keeping the values of the weights small. Consequently, the small weights reduce the absolute scale of the data in latent space. Indeed, as can be seen in Figure 3a,b, where we plot the two-dimensional latent space, the absolute scale reduces from approximately 47 × 69 to approximately 7 × 7 during training.      In contrast, as we show in Figure 2c,d, these effects appear not be present without l 2 regularization. Moreover, in this case, the picture conveyed by the IP is slightly less consistent. For instance, Run 1 and Run 2 show neither a compression nor a fitting phase for fixed binning, as can be seen in Figure 2c. Rather, the latent representation seems to expand throughout training, which is caused by increasing NN weights. This is also illustrated in Figure 3c,d, from which we can see that the absolute scale increases from approximately 45 × 36 to approximately 88 × 116. In contrast, for Run 3, we can see mainly an upward trend (only fitting) forÎ(Y;T).
We additionally run the same experiment using a convolutional NN (CNN). The CNN consists of four convolutional, two max-pooling, and four fully connected (100-100-2-100) layers. The results for the IP analysis on the fully connected layers are shown in Figure 4. In Figure 4a,c, we can see an expansion phase in fixed binning both with and without regularization. Moreover, the regularization results in a consistent fitting phase and slightly larger values ofÎ(Y;T) for adaptive binning at the last epoch, cf. Figure 4b. This can be traced back to better class separability, as seen in Figure 5b. Due to the simplicity of the task, the effect of overfitting on classification accuracy is mild: the full connected NN achieves 96.62% with and 96.42% without regularization, while the CNN achieves 98.90% with and 98.60% without regularization. The corresponding IPs, however, display a qualitatively different behavior, indicating that similar accuracies were achieved along different training paths.

Underfitting Models
The next scenario we consider is underfitting, preventing the model from learning sufficient information from the training data. In this section, we induce underfitting by using (a) too strong regularization (λ = 0.2) and (b) a suboptimal network architecture (two layers with three hidden neurons each). The resulting IPs are displayed in Figure 6.     In the first case, using strong regularization, we achieve an accuracy of approximately 11%. Here, all data points are mapped to a small region in feature space that is covered by a single bin. In this case, for fixed binning,T is constant over D andĤ(T) = 0, which is reflected in the IP (see Figure 6a). In addition, the adaptive binning in Figure 6b also shows the same behavior for the last two hidden layers, i.e.,Ĥ(T) = 0. In the second case, using a too narrow model, we finally obtain an accuracy of approximately 62%, resulting from a slightly different learning behavior. As can be seen from Figure 6c,Ĥ(T) is small, especially for Layer 1, which means that few bins are overpopulated, whereas others are empty.
Indeed, in both cases, the NN cannot extract relevant and required information from the input to fit the target outputs (Y). Therefore, we haveÎ(Y;T) ≤ 2 (see Figure 6), which is lower than log(10) ≈ 3.32 (if all ten classes from MNIST fall into different bins), indicating a weak class separability.

Overfitting Models
The next scenario we consider is overfitting, which can be described as learning a model that fits the training data very well but that does not generalize to unseen data. To demonstrate this in terms of IP analysis, we train a network on MNIST with two hidden layers with 10 units in each layer; in this case, using tanh as an activation function. To encourage overfitting, we did not use any regularization.
In contrast to underfitting, which affects the entire IP, overfitting on clean labels can mainly be seen on the vertical axis of the IP. In fact, for an overfitting model,Î(Y;T) increases at the beginning of the training but decreases again later on. This can be seen in Figure 7.   To further illustrate this effect, in Figure 8a, we plotÎ(Y;T) for Layer 2 over time for fixed binning, first increasing to 3.26 and then decreasing to 3.14 (averaged over three runs). Indeed, this trend (having a peak around epoch 75) is directly related to the learning behavior and the accuracy of the model. To make this more apparent, we compare the plot ofÎ(Y;T) to the mean test loss and the mean test accuracy in Figure 8b and Figure 8c respectively. It can be seen that the mean accuracy initially increases up to 93% and then drops (starting around epoch 75) to 90% at the end of the training. The same trend, an initial reduction and subsequent growth, can also be recognized from the loss curve. This indicates that the information covered by IP analysis is directly related to well-known learning characteristics.

Learning from Noisy Data
We next investigate the effect of corrupted input data on the IP. To this end, we run experiments on the Brightness MNIST dataset [28], a modified version of MNIST, where the illumination of the images increased. In this way, the contrast of the images decreased and, thus, the classes are pushed closer together in the image space. For evaluation purposes, we train both a 100-100-2-100 bottleneck model and a convolutional model as described before and analyze the learning behavior by using both binning approaches.
Indeed, we finally obtain an accuracy of 95.25% and 98.51% for the bottleneck model and the convolution model, respectively, which is comparable to the results on MNIST using same models. However, the IP analysis shown in Figures 9 and 10 reveal that the learning behavior is different. As can be seen from Figures 11a and 12a, due to reduced contrast in the images, the classes are mapped to highly overlapping regions; this is not the case for the original MNIST dataset (cf. Figures 3c and 5c).
To learn successfully, during NN training, the data points in the latent space have to be pushed apart according to their class label. In this way, we can recognize a fitting phase (increasingÎ(Y;T)) for adaptive binning (see Figures 9b and 10b) and an expansion phase for fixed binning (see Figures 9a and 10a). Simultaneously, the data points are pushed apart and occupy a larger volume in latent space (increased from 10 × 8 to 31 × 27 and from 30 × 28 to 769 × 434), as can be seen in Figures 11 and 12.
For the bottleneck model, after epoch 30 (transition point, see Figure 11b), a compression phase emerges for fixed binning, and the clusters are tightened and separated from each other. For adaptive binning (see Figures 9b and 10b), both models share the same trend: a fitting phase along with a compression phase for the bottleneck layer. Moreover, for the last layer (Layer 4), an expansion phase and subsequently a compression phase can be recognized. Since the IPs for MNIST show a slightly different qualitative behavior, this indicates that the IP displays effects both caused by architectural choices and the selected data set.

Learning from Noisy Labels
For many practical applications, we face the problem of noisy and ambiguous labels in the training data (see, e.g., [30]). Thus, there has been a huge interest in studying the dynamics of NN learning from noisy labels [31][32][33][34][35][36], reaching the consensus that NNs first learn the training data for clean labels and subsequently memorize data for the noisy labels.
We investigate this scenario in the IP using a 100-100-2-100 bottleneck model. In addition, we evaluate rectifier family activation functions, namely ReLU and Leaky ReLU (with two different slopes: α = 0.01 and α = 0.3) and double saturated activation functions, e.g., Tanh for noisy labels. For that purpose, similar to [29,37], we apply the idea of symmetric label noise and replace the true label with a label from other classes for 40% of the training samples of MNIST. The thus obtained results for clean and noisy labels are summarized in Table 1. At the beginning of the training process, the weights are randomly initialized close to zero. Therefore, the activation values of the rectified activation functions are small. When training starts, they deviate from the small value and start to increase. Thus, functions of the rectified unit family show an expansion in fixed binning in whichĤ(T) increases over time, which can be seen from Figures 13a and 14a,c.        In contrast, the saturation regions of double saturated activation functions restrict the activation values, and we cannot see an expansion using fixed binning (see Figure 15a). This behavior is also reflected in the 2D visualization of the bottleneck layer for rectifying activation functions (see Figures 16c and 17c) by increasing the absolute scale. However, the absolute scale is bounded in the range [−1, 1] for Tanh (see Figure 18).      In general, when training from noisy labels, the model is first fit to the clean labels and then starts to memorize noisy labels (overfitting). In the fitting phase,Î(Y;T) increases, which can be seen in the adaptive binning illustrated in Figures 13b, 14b,d and 15b). However, in the memorization phase,Î(Y;T) decreases. At the same time, the accuracy also decreases and loss increases, as can be seen in Figure 19a and Figure 19b respectively. These two phases indicate first a growth (see Figures 16b, 17b and 18b) and then a reduction of class separability in the 2D visualization (see Figures 16c, 17c and 18c).  Figure 19. Mean accuracy (a) and mean loss (b) over time for Noisy MNIST training for all activation functions (averaged over three runs).
In particular, for Epoch 13 (see Figure 17b) and Epoch 16 (see Figures 16b and 18b), we can see the transition between fitting clean labels and fitting noisy ones [29]. This seems to be independent of the activation function used. On the other hand, since ReLU maps all negative activation values to zero, we can see a geometric compression (tighter clustering) in this case, especially for the last layer along with decrease inÎ(X;T) in adaptive binning.

Discussion and Conclusions
The main idea of IP analysis is to analyze the plane described by the mutual information I(X; T) between the input X and the activation values of a hidden layer T and by the mutual information I(Y; T) between T and the output variable Y over time. However, as the mutual information cannot be computed analytically, different estimation approaches are used, which leads to inconsistent results and contradicting interpretations of the IPs.
To overcome these issues, as first contribution, we demonstrated that the IP represents geometric rather than information-theoretic effects. To this end, we take advantage of two different binning estimators based on fixed and adaptive binning, requiring different geometric interpretations and thus giving us different views on the geometric compression of the activation T. For our experimental results, we used a bottleneck architecture (a two-dimensional layer), which allows us to directly relate the information covered by IPs to the geometric structure of the latent space. Additionally, showing the two-dimensional latent space supports our findings; however, the application of IP analysis is not limited to this type of architecture. To this end, we also showed results using different architectures, demonstrating that-if interpreted correctly-IP analysis can be a valuable tool to analyze neural network training.
Based on these findings, as a second contribution, we analyzed different scenarios for NN training. First, we evaluated and interpreted the impact of regularization and phenomena such as underfitting and overfitting using the well-known MNIST dataset. We showed that the effects of l 2 regularization, which aims to minimize the magnitude of weights, can be seen both in the IP and the two-dimensional visualization of the latent space. Furthermore, we were able to visualize and interpret over-and underfitting problems for specific setups using IPs. In addition, we also considered practical relevant problems, namely learning from noisy samples and noisy labels. For the first problem, we could show that, despite achieving similar classification performance, the learning behavior is different. For the second problem, we evaluated different activation functions and provided evidence that rectifying activations show an expansion phase corresponding to the memorization of noisy labels. Such an expansion phase is missing for double saturated activation functions, despite them memorizing the noisy labels as well.
In this way, we demonstrated that IPs can be a valuable tool to analyze NN training. However, the mutual information estimators must be adequately designed and their estimates must be interpreted correctly. In particular, we showed that such an interpretation must-at least for binning estimators-take into account geometric aspects. Building on these findings, we will further investigate learning from noisy data and noisy labels. In particular, we are interested in the impact of using different non-linearities for this type of application scenario. Thus, the goal would be to improve the architectural design of deep neural networks when dealing with ambiguous or unreliable data.