# Analysis of Deep Convolutional Neural Networks Using Tensor Kernels and Matrix-Based Entropy

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

## Abstract

**:**

## 1. Introduction

- We propose a kernel tensor-based approach to the matrix-based entropy functional that is designed for measuring MI in large-scale convolutional neural networks (CNNs).
- We provide new insights on the matrix-based entropy functional by showing its connection to well-known quantities in the kernel literature such as the kernel mean embedding and maximum mean discrepancy. Furthermore, we show that the matrix-based entropy functional is closely linked with the von Neuman entropy from quantum information theory.
- Our results indicate that the compression phase is apparent mostly for the training data and less so for the test data, particularly for more challenging datasets. When using a technique such as early stopping to avoid overfitting, training tends to stop before the compression phase occurs (see Figure 1).

## 2. Related Work

## 3. Materials and Methods

#### 3.1. Preliminaries on Matrix-Based Information Measures

#### 3.1.1. Matrix-Based Entropy and Mutual Information

**Definition 1**

**.**Let ${\mathbf{x}}_{i}\in \mathcal{X},\phantom{\rule{0.277778em}{0ex}}i=1,2,\dots ,N$ denote data points and let $\kappa :\mathcal{X}\times \mathcal{X}\mapsto \mathbb{R}$ be an infinitely divisible positive definite kernel [22]. Given the kernel matrix $\mathbf{K}\in {\mathbb{R}}^{N\times N}$ with elements ${\left(\mathbf{K}\right)}_{ij}=\kappa ({\mathbf{x}}_{i},{\mathbf{x}}_{j})$ and the matrix $\mathbf{A},\phantom{\rule{0.277778em}{0ex}}{\left(\mathbf{A}\right)}_{ij}=\frac{1}{N}\frac{{\left(\mathbf{K}\right)}_{ij}}{\sqrt{{\left(\mathbf{K}\right)}_{ii}{\left(\mathbf{K}\right)}_{jj}}}$, the matrix-based Rényi’s α-order entropy is given by

#### 3.1.2. Bound on Matrix-Based Entropy Measure

**Proposition 2.**

#### 3.2. Analysis of Matrix-Based Information Measures

#### 3.2.1. A New Special-Case Interpretation of the Matrix-Based Renyi Entropy Definition

**Proof.**

#### 3.2.2. Link to Measures in Kernel Literature and Validation on High-Dimensional Synthetic Data

#### 3.3. Novel Tensor-Based Matrix-Based Renyi Information Measures

#### 3.4. Tensor Kernels for Measuring Mutual Information

#### 3.4.1. Choosing the Kernel Width

## 4. Results

**Comparison to previous approaches**First, we study the IP of the MLP similar to the one examined in previous works on DNN analysis using information theory [4,5]. We utilize stochastic gradient descent, a cross-entropy loss function, and repeat the experiment 5 times. Figure 1 displays the IP of the MLP with a ReLU activation function in each hidden layer. MI was measured using the training data of the MNIST dataset. A similar experiment was performed with the tanh activation function, obtaining similar results. The interested reader can find these results in Appendix E.

**Increasing DNN size**We analyze the IP of the VGG16 network on the CIFAR-10 dataset with the same experimental setup as in the previous experiments. To our knowledge, this is the first time that the full IP has been modeled for such a large-scale network. Figure 5 and Figure 6 show the IP when measuring the MI for the training dataset and the test dataset, respectively. For the training dataset, we can clearly observe the same trend as for the smaller networks, where layers experience a fitting phase during the early stages of training and a compression phase in the later stage. Note that the compression phase is less prominent for the testing dataset. Note also the difference between the final values of $I(Y;T)$ for the output layer measured using the training and test data, which is a result of the different accuracy achieved on the training data (≈100%) and test data (≈90%). Ref. [3] claims that $I(T;X)\approx H\left(X\right)$ and $I(Y;T)\approx H\left(Y\right)$ for high-dimensional data, and they highlight particular difficulties with measuring the MI between convolutional layers and the input/output. However, this statement is dependent on their particular measure for the MI, and the results presented in Figure 5 and Figure 6 demonstrate that neither $I(T;X)$ nor $I(Y;T)$ is deterministic for our proposed measure. Furthermore, other measures of MI have also demonstrated that both $I(T;X)\approx H\left(X\right)$ and $I(Y;T)\approx H\left(Y\right)$ evolve during training [4,18].

**Effect of early stopping**We also investigate the effect of using early stopping on the IP described above. Early stopping is a regularization technique where the validation accuracy is monitored and training is stopped if the validation accuracy does not increase for a set number of iterations, which is often referred to as the patience hyperparameter. Figure 1 displays the results of monitoring where the training would stop if the early stopping procedure was applied for different values of patience. For a patience of five iterations, the network training would stop before the compression phase takes place for several of the layers. For larger patience values, the effects of the compression phase can be observed before training is stopped. Early stopping is a procedure intended to prevent the network from overfitting, which may imply that the compression phase observed in the IP of DNNs can be related to overfitting. However, recent research on the so-called double descent phenomenon has shown that longer training might be necessary for good performance for overparameterized DNNs [38,39]. In such settings, early stopping might not be as applicable. We describe the double descent phenomenon and investigate its possible connection with the IP in Appendix H.

**Data processing inequality**The data processing inequality (DPI) is a concept in information theory which states that the amount of information cannot increase in a chain of transformations. A good information theoretic estimator should tend to uphold the DPI. DNN consists of a chain of mappings from the input through the hidden layers and to the output. One can interpret a DNN as a Markov chain [1,7] that defines an information path [1], which should satisfy the DPI [40]:

## 5. Kernel Width Sigma

## 6. Discussion and Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Preliminaries on Multivariate Matrix-Based Renyi’s Alpha-Entropy Functionals

**Figure A1.**Different approaches for calculating kernels based on tensor data. The first row shows results when using the multivariate approach of [20], the second row depicts the tensor kernel approach used in this paper, and the third row displays the kernel obtained using matricization-based tensor kernels [32] that preserve the structure between channels. Bright colors indicate high values, while dark values indicate low values in all the kernel matrices.

## Appendix B. Tensor-Based Approach Contains Multivariate Approach as Special Case

**Figure A2.**IP of an MLP consisting of four fully connected layers with 1024, 20, 20, and 20 neurons and a tanh activation function in each hidden layer. MI was measured using the training data of the MNIST dataset and averaged over 5 runs.

**Figure A3.**IP of a CNN consisting of three convolutional layers with 4, 8 and 12 filters and one fully connected layer with 256 neurons and a tanh activation function in each hidden layer. MI was measured using the training data of the MNIST dataset and averaged over 5 runs.

## Appendix C. Structure Preserving Tensor Kernels and Numerical Instability of Multivariate Approach

## Appendix D. Detailed Description of Networks from Section 4

#### Appendix D.1. Multilayer Perceptron Used in Section 4

- Fully connected layer with 784 inputs and 1024 outputs.
- Activation function.
- Batch normalization layer.
- Fully connected layer with 1024 inputs and 20 outputs.
- Activation function.
- Batch normalization layer.
- Fully connected layer with 20 inputs and 20 outputs.
- Activation function.
- Batch normalization layer.
- Fully connected layer with 20 inputs and 20 outputs.
- Activation function.
- Batch normalization layer.
- Fully connected layer with 784 inputs and 10 outputs.
- Softmax activation function.

#### Appendix D.2. Convolutional Neural Network Used in Section 4

- Convolutional layer with 1 input channel and 4 filters, filter size $3\times 3$, stride of 1 and no padding.
- Activation function.
- Batch normalization layer.
- Convolutional layer with 4 input channels and 8 filters, filter size $3\times 3$, stride of 1 and no padding.
- Activation function.
- Batch normalization layer.
- Max pooling layer with filter size $2\times 2$, stride of 2 and no padding.
- Convolutional layer with 8 input channels and 16 filters, filter size $3\times 3$, stride of 1 and no padding.
- Activation function.
- Batch normalization layer.
- Max pooling layer with filter size $2\times 2$, stride of 2 and no padding.
- Fully connected layer with 400 inputs and 256 outputs.
- Activation function.
- Batch normalization layer.
- Fully connected layer with 256 inputs and 10 outputs.
- Softmax activation function.

## Appendix E. IP of MLP with Tanh Activation Function from Section 4

## Appendix F. IP of CNN with Tanh Activation Function from Section 4

## Appendix G. Data Processing Inequality with EDGE MI Estimator

**Figure A4.**Mean difference in MI of subsequent layers ℓ and $\U0001d4c1+1$. Positive numbers indicate compliance with the DPI. MI was measured on the MNIST training using the EDGE MI estimator on a simple MLP [4].

## Appendix H. Connection with Epoch-Wise Double Descent

**Figure A5.**Training and test loss of neural network with one hidden layer with 50,000 neurons on a subset of the MNIST dataset. The figure also shows the MI between the input/labels and the hidden/output layer. The epoch-wise double descent phenomenon is visible in the test loss, and it seems to coincide with the start of the compression of MI between the input and output layer. Notice the different labels on the left and right y-axis. Curves represent the average over 3 training runs.

## References

- Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. arXiv
**2017**, arXiv:1703.00810. [Google Scholar] [CrossRef] - Geiger, B.C. On Information Plane Analyses of Neural Network Classifiers—A Review. IEEE Trans. Neural Netw. Learn. Syst.
**2021**, 33, 7039–7051. [Google Scholar] [CrossRef] - Cheng, H.; Lian, D.; Gao, S.; Geng, Y. Evaluating Capability of Deep Neural Networks for Image Classification via Information Plane. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 168–182. [Google Scholar]
- Noshad, M.; Zeng, Y.; Hero, A.O. Scalable Mutual Information Estimation Using Dependence Graphs. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 2962–2966. [Google Scholar]
- Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp.
**2019**, 2019, 124020. [Google Scholar] [CrossRef] - Yu, S.; Wickstrøm, K.; Jenssen, R.; Príncipe, J.C. Understanding Convolutional Neural Network Training with Information Theory. IEEE Trans. Neural Netw. Learn. Syst.
**2020**, 32, 435–442. [Google Scholar] [CrossRef] [Green Version] - Yu, S.; Principe, J.C. Understanding autoencoders with information theoretic concepts. Neural Netw.
**2019**, 117, 104–123. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Lorenzen, S.S.; Igel, C.; Nielsen, M. Information Bottleneck: Exact Analysis of (Quantized) Neural Networks. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Goldfeld, Z.; Van Den Berg, E.; Greenewald, K.; Melnyk, I.; Nguyen, N.; Kingsbury, B.; Polyanskiy, Y. Estimating Information Flow in Deep Neural Networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2299–2308. [Google Scholar]
- Chelombiev, I.; Houghton, C.; O’Donnell, C. Adaptive Estimators Show Information Compression in Deep Neural Networks. arXiv
**2019**, arXiv:1902.09037. [Google Scholar] [CrossRef] - Zhouyin, Z.; Liu, D. Understanding Neural Networks with Logarithm Determinant Entropy Estimator. arXiv
**2021**, arXiv:2105.03705. [Google Scholar] [CrossRef] - Sanchez Giraldo, L.G.; Rao, M.; Principe, J.C. Measures of Entropy From Data Using Infinitely Divisible Kernels. IEEE Trans. Inf. Theory
**2015**, 61, 535–548. [Google Scholar] [CrossRef] [Green Version] - Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015. [Google Scholar]
- Kolchinsky, A.; Tracey, B.D.; Kuyk, S.V. Caveats for information bottleneck in deterministic scenarios. arXiv
**2019**, arXiv:1808.07593. [Google Scholar] [CrossRef] - Amjad, R.A.; Geiger, B.C. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle. IEEE Trans. Pattern Anal. Mach. Intell.
**2019**, 42, 2225–2239. [Google Scholar] [CrossRef] [Green Version] - Kornblith, S.; Norouzi, M.; Lee, H.; Hinton, G. Similarity of Neural Network Representations Revisited. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 3519–3529. [Google Scholar]
- Jónsson, H.; Cherubini, G.; Eleftheriou, E. Convergence Behavior of DNNs with Mutual-Information-Based Regularization. Entropy
**2020**, 22, 727. [Google Scholar] [CrossRef] [PubMed] - Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; Volume 80, pp. 531–540. [Google Scholar]
- Geiger, B.C.; Kubin, G. Information Bottleneck: Theory and Applications in Deep Learning. Entropy
**2020**, 22, 1408. [Google Scholar] [CrossRef] [PubMed] - Yu, S.; Sanchez Giraldo, L.G.; Jenssen, R.; Principe, J.C. Multivariate Extension of Matrix-based Renyi’s α-order Entropy Functional. IEEE Trans. Pattern Anal. Mach. Intell.
**2019**, 42, 2960–2966. [Google Scholar] [CrossRef] - Landsverk, M.C.; Riemer-Sørensen, S. Mutual information estimation for graph convolutional neural networks. In Proceedings of the 3rd Northern Lights Deep Learning Workshop, Tromso, Norway, 10–11 January 2022; Volume 3. [Google Scholar]
- Bhatia, R. Infinitely Divisible Matrices. Am. Math. Mon.
**2006**, 113, 221–235. [Google Scholar] [CrossRef] - Renyi, A. On Measures of Entropy and Information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
- Nielsen, M.A.; Chuang, I.L. Quantum Computation and Quantum Information, 10th ed.; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
- Mosonyi, M.; Hiai, F. On the Quantum Rényi Relative Entropies and Related Capacity Formulas. IEEE Trans. Inf. Theory
**2011**, 57, 2474–2487. [Google Scholar] [CrossRef] - Kwak, N.; Choi, C.H. Input feature selection by mutual information based on Parzen window. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 1667–1671. [Google Scholar] [CrossRef] [Green Version] - Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A Kernel Two-sample Test. J. Mach. Learn. Res.
**2012**, 13, 723–773. [Google Scholar] - Muandet, K.; Fukumizu, K.; Sriperumbudur, B.; Schölkopf, B. Kernel Mean Embedding of Distributions: A Review and Beyond. In Foundations and Trends
^{®}in Machine Learning; Now Foundations and Trends: Boston, FL, USA, 2017; pp. 1–141. [Google Scholar] - Fukumizu, K.; Gretton, A.; Sun, X.; Schölkopf, B. Kernel Measures of Conditional Dependence. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2008; pp. 489–496. [Google Scholar]
- Smola, A.; Gretton, A.; Song, L.; Schölkopf, B. A Hilbert Space Embedding for Distributions. In Proceedings of the Algorithmic Learning Theory, Sendai, Japan, 1–4 October 2007; Hutter, M., Servedio, R.A., Takimoto, E., Eds.; pp. 13–31. [Google Scholar]
- Evans, D. A Computationally Efficient Estimator for Mutual Information. Proc. Math. Phys. Eng. Sci.
**2008**, 464, 1203–1215. [Google Scholar] [CrossRef] - Signoretto, M.; De Lathauwer, L.; Suykens, J.A. A kernel-based framework to tensorial data analysis. Neural Netw.
**2011**, 34, 861–874. [Google Scholar] [CrossRef] - Shi, J.; Malik, J. Normalized Cuts and Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell.
**2000**, 22, 888–905. [Google Scholar] - Shi, T.; Belkin, M.; Yu, B. Data spectroscopy: Eigenspaces of convolution operators and clustering. Ann. Stat.
**2009**, 37, 3960–3984. [Google Scholar] [CrossRef] - Silverman, B.W. Density Estimation for Statistics and Data Analysis; CRC Press: Boca Raton, FL, USA, 1986; Volume 26. [Google Scholar]
- Cristianini, N.; Shawe-Taylor, J.; Elisseeff, A.; Kandola, J.S. On kernel-target alignment. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2002; pp. 367–373. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Belkin, M.; Hsu, D.; Ma, S.; Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl. Acad. Sci. USA
**2019**, 116, 15849–15854. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Nakkiran, P.; Kaplun, G.; Bansal, Y.; Yang, T.; Barak, B.; Sutskever, I. Deep Double Descent: Where Bigger Models and More Data Hurt. arXiv
**2020**, arXiv:1912.02292. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley Series in Telecommunications and Signal Processing; Wiley-Interscience: New York, NY, USA, 2006. [Google Scholar]
- Yu, X.; Yu, S.; Príncipe, J.C. Deep Deterministic Information Bottleneck with Matrix-Based Entropy Functional. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 3160–3164. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in PyTorch. 2017. Available online: https://pytorch.org (accessed on 1 January 2023).
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]

**Figure 1.**IP obtained using our proposed measure for a small DNN averaged over 5 training runs. The solid black line illustrates the fitting phase while the dotted black line illustrates the compression phase. The iterations at which early stopping would be performed assuming a given patience parameter are highlighted. Patience denotes the number of iterations that need to pass without progress on a validation set before training is stopped to avoid overfitting. For low patience values, training will stop before the compression phase. For the benefit of the reader, a magnified version of the first four layers is also displayed.

**Figure 2.**The leftmost plot shows the entropy calculated using Equation (14) of a 100-dimensional normal distribution with zero mean and an isotropic covariance matrix for different variances. The variances are given along the x-axis. The rightmost plot shows the entropy estimated using Equation (1) for the same distribution. The plots illustrated that the analytically computed entropy and the estimated quantity follow the same trend.

**Figure 3.**The leftmost plot shows the mutual information calculated using Equation (16) between a standard 100-dimensional normal distribution and a normal distribution with a mean vector of all ones and an isotropic covariance matrix with different variances. The variances are given along the x-axis. The rightmost plot shows the mutual information estimated using Equation (3) for the same distributions. The plots illustrated that the analytically computed mutual information and the estimated quantity follow the same trend.

**Figure 4.**IP of a CNN consisting of three convolutional layers with 4, 8 and 12 filters and one fully connected layer with 256 neurons and a ReLU activation function in each hidden layer. MI was measured using the training data of the MNIST dataset and averaged over 5 runs.

**Figure 5.**IP of the VGG16 on the CIFAR-10 dataset. MI was measured using the training data and averaged over 2 runs. Color saturation increases as training progresses. Both the fitting phase and the compression phase is clearly visible for several layers.

**Figure 6.**IP of the VGG16 on the CIFAR-10 dataset. MI was measured using the test data and averaged over 2 runs. Color saturation increases as training progresses. The fitting phase is clearly visible, while the compression phase can only be seen in the output layer.

**Figure 7.**Mean difference in MI of subsequent layers ℓ and $\U0001d4c1+1$. Positive numbers indicate compliance with the DPI. MI was measured on the MNIST training set for the MLP and on the CIFAR-10 training set for the VGG16.

**Figure 8.**Evolution of kernel width as a function of iteration for the three networks that we considered in this work. From left to right, plots shows the kernel width for the MLP, CNN, and VGG16. The plots demonstrate how the optimal kernel width quickly stabilizes and stays relatively stable throughout the training.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Wickstrøm, K.K.; Løkse, S.; Kampffmeyer, M.C.; Yu, S.; Príncipe, J.C.; Jenssen, R.
Analysis of Deep Convolutional Neural Networks Using Tensor Kernels and Matrix-Based Entropy. *Entropy* **2023**, *25*, 899.
https://doi.org/10.3390/e25060899

**AMA Style**

Wickstrøm KK, Løkse S, Kampffmeyer MC, Yu S, Príncipe JC, Jenssen R.
Analysis of Deep Convolutional Neural Networks Using Tensor Kernels and Matrix-Based Entropy. *Entropy*. 2023; 25(6):899.
https://doi.org/10.3390/e25060899

**Chicago/Turabian Style**

Wickstrøm, Kristoffer K., Sigurd Løkse, Michael C. Kampffmeyer, Shujian Yu, José C. Príncipe, and Robert Jenssen.
2023. "Analysis of Deep Convolutional Neural Networks Using Tensor Kernels and Matrix-Based Entropy" *Entropy* 25, no. 6: 899.
https://doi.org/10.3390/e25060899