# Gaussian Perturbations in ReLU Networks and the Arrangement of Activation Regions

^{1}

^{2}

## Abstract

**:**

## 1. Introduction

## 2. Related Results

- We suggest a measure, tangent sensitivity, which characterises, in a way, both the geometrical properties of the function and the original data distribution without the target, meanwhile capturing how the model handled injecting noise at each layer per sample. In comparison to [34], our measure operates on directional derivatives.
- We derive several easily computable bounds and measures for feed-forward ReLU multi-layer perceptrons based either only on the state of the network or on the data as well. Throughout these measures we connect tangent sensitivity to the structure of the network and particularly to the input–output paths inside the network, the norm of the parameters and the distribution of the linear regions in the input space. The bounds are closely related to path-sgd [35], the margin distribution [14,25,36] and the narrowness estimation of linear regions [37] albeit primarily to the distribution of non zero volume activation patterns.
- Finally, we experiment on the CIFAR-10 [38] dataset and observe that even simple upper bounds of tangent sensitivity are connected to the empirical generalisation gap, the performance difference between the training set and test set.

## 3. Methods

#### 3.1. Preliminaries

#### 3.2. Tangent Space Sensitivity

**Definition**

**1.**

**Theorem**

**1.**

**Theorem**

**2.**

#### 3.3. Distribution of Linear Regions

**Lemma**

**1.**

**Proof.**

#### 3.4. Tangent Space Sensitivity and Generalisation

- Layer-wise norm sensitivity, $Sen{s}_{tan}^{1}$ (Equation (1)): the bound does not depend on input, however the layer-wise ${l}_{\infty}$ norm of the parameters changes throughout learning therefore we may estimate the loss at time t (learning step t) based on the inverse change in maximal sensitivity (Equation (1)) and loss measured on the training set at time t:$${\widehat{l}}^{1}({X}_{te};{\theta}^{\left(t\right)}):=\frac{Sen{s}_{tan}^{1}\left({\theta}^{(t-1)}\right))}{Sen{s}_{tan}^{1}\left({\theta}^{\left(t\right)}\right)}l\left({X}_{tr},{Y}_{tr};{\theta}^{\left(t\right)}\right)=\frac{{\left({\Pi}_{i=1}^{k}{w}_{ma{x}_{i}}^{(t-1)}\right)}^{2}}{{\left({\Pi}_{i=1}^{k}{w}_{ma{x}_{i}}^{\left(t\right)}\right)}^{2}}l\left({X}_{tr},{Y}_{tr};{\theta}^{\left(t\right)}\right),$$
- Maximal sensitivity, $Sen{s}_{tan}^{2}$ (Equation (2)): similarly, we may estimate test loss based on the distribution of the number of active nodes:$${\widehat{l}}^{2}({X}_{te};\theta ):=\frac{{\psi}^{*}(k,{\mu}_{te},{\sigma}_{te})}{{\psi}^{*}(k,{\mu}_{tr},{\sigma}_{tr})}l({X}_{tr},{Y}_{tr};\theta ),$$
- Empirical sensitivity, $Sen{s}_{tan}^{3}$ (Equation (3)): assuming the empirical estimation of $p(A;\theta )$ in (Equations (2) and (5)) we define empirical tangent sensitivity as:$$\widehat{S}(X;\theta )=\sum _{{A}_{i}\in {\mathit{A}}^{+}}{p}_{x\in X}({A}_{i};\theta )Sen{s}_{tan}({A}_{i};\theta ).$$The corresponding estimation of test loss:$${\widehat{l}}^{3}({X}_{te};\theta ):=\frac{\widehat{S}({X}_{te};\theta )}{\widehat{S}({X}_{tr};\theta )}l({X}_{tr},{Y}_{tr};\theta ).$$

## 4. Experiments

- Input-output sensitivity [1]: In case of binary classification and for an input set X, input sensitivity is defined as$${S}_{Jac}(X;\theta )={E}_{x\in X}[\parallel \frac{\partial f(x;\theta )}{\partial x}{\parallel}_{2}]$$$${l}^{4}({X}_{te};\theta ):=\frac{{S}_{Jac}({X}_{te};\theta )}{{S}_{Jac}({X}_{tr};\theta )}l({X}_{tr},{Y}_{tr};\theta ).$$
- Fisher-Rao norm [23]: The authors proposed a measure (Theorem 3.1) in case of smooth loss with known labeling as$${S}_{FR}(X,Y;\theta )={(k+1)}^{2}{E}_{x,y}[<\frac{\partial l\left(f\right(x;\theta ),y)}{\partial f(x;\theta )},f(x;\theta ){>}^{2}].$$The measure depends on the input labels therefore we need a slight modification with replacing the loss with the sum of loss over all ten classes:$${S}_{FR}(X;\theta )=\sum _{j=1}^{10}{E}_{x}[<\frac{\partial l\left(f\right(x;\theta ),j)}{\partial f(x;\theta )},f(x;\theta ){>}^{2}].$$We are only interested in the change of the measure therefore we also remove the constant part. We may estimate the loss on the test set by$${l}^{5}({X}_{te};\theta ):=\frac{{S}_{FR}({X}_{te};\theta )}{{S}_{FR}({X}_{tr};\theta )}l({X}_{tr},{Y}_{tr};\theta ).$$
- Spectral norm [14]: the authors suggest spectrally normalised margin complexity to measure generalisation in case of multiclass classification as$${S}_{Spect}(X,Y;\theta )=\frac{f{(x;\theta )}_{y}-{max}_{i\ne y}f{(x;\theta )}_{j}}{{S}_{Spect}\left(\theta \right){\parallel X\parallel}_{2}/n}$$$${S}_{Spect}\left(\theta \right)={\Pi}_{i=1}^{k}{\parallel {W}_{i}\parallel}_{\sigma}.$$As the measure depends on the input label we modified the measure by removing the margin motivated by the fact that on the training set the margin can be misleading as the models may reach high accuracy fast. Additionally, we are only interested in the change of the measure therefore we also remove the norm of the input as it is constant throughout our experiments. The final estimation is similarly to the layer-wise ${l}_{\infty}$ norm:$${l}^{6}({X}_{te};{\theta}^{\left(t\right)}):=\frac{{S}_{Spect}\left({\theta}^{(t-1)}\right))}{{S}_{Spect}\left({\theta}^{\left(t\right)}\right)}l\left({X}_{tr},{Y}_{tr};{\theta}^{\left(t\right)}\right).$$

## 5. Conclusions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A

**Theorem**

**A1.**

**Proof.**

## Appendix B

**Theorem**

**A2.**

**Proof.**

## Appendix C

Layer | #Nodes | #Parameters | Variants |
---|---|---|---|

Input layer | 3072 | 0 | |

Hidden layer 1 | 100 | $100\times 3072+100$ | -/BN |

Hidden layer 1 | 100 | $100\times 100+100$ | -/BN/DO |

Hidden layer 1 | 100 | $100\times 100+100$ | -/BN/DO |

Hidden layer 1 | 100 | $100\times 100+100$ | -/BN/DO |

Output layer | 10 | $10\times 100+10$ |

## References

- Novak, R.; Bahri, Y.; Abolafia, D.A.; Pennington, J.; Sohl-Dickstein, J. Sensitivity and generalisation in neural networks: An empirical study. In Proceedings of the ICLR’18, Vancouver, BC, Canada, 20 April–3 May 2018. [Google Scholar]
- Sontag, E.D. VC dimension of neural networks. NATO ASI Ser. F Comput. Syst. Sci.
**1998**, 168, 69–96. [Google Scholar] - Bartlett, P.L.; Maass, W. Vapnik-Chervonenkis dimension of neural nets. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 2003; pp. 1188–1192. [Google Scholar]
- Liu, T.; Lugosi, G.; Neu, G.; Tao, D. Algorithmic stability and hypothesis complexity. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 2159–2167. [Google Scholar]
- Rolnick, D.; Tegmark, M. The power of deeper networks for expressing natural functions. arXiv
**2017**, arXiv:1705.05502. [Google Scholar] - Lin, H.W.; Tegmark, M. Why does deep and cheap learning work so well? arXiv
**2016**, arXiv:1608.08225. [Google Scholar] [CrossRef] [Green Version] - Stoudenmire, E.; Schwab, D.J. Supervised learning with tensor networks. In Advances in Neural Information Processing Systems, Proceedings of the NIPS 2016, Barcelona, Spain, 5–10 December 2016; Curran Associates Inc.: North Adams, MA, USA, 2016; pp. 4799–4807. [Google Scholar]
- Amari, S.I. Neural learning in structured parameter spaces-natural Riemannian gradient. In Advances in Neural Information Processing Systems, Proceedings of the NIPS 1996, Denver, CO, USA, 3–5 December 1996; MIT Press: Cambridge, MA, USA, 1996; pp. 127–133. [Google Scholar]
- Kanwal, M.; Grochow, J.; Ay, N. Comparing information-theoretic measures of complexity in Boltzmann machines. Entropy
**2017**, 19, 310. [Google Scholar] [CrossRef] - Jacot, A.; Gabriel, F.; Hongler, C. Neural tangent kernel: Convergence and generalisation in neural networks. In Advances in Neural Information Processing Systems, Proceedings of the NIPS 2018, Montreal, QC, USA, 3–8 December 2018; MIT Press: Cambridge, MA, USA, 2018; pp. 8571–8580. [Google Scholar]
- Ay, N.; Jost, J.; Vân Lê, H.; Schwachhöfer, L. Information Geometry; Springer: Berlin/Heidelberg, Germany, 2017; Volume 64. [Google Scholar]
- Neyshabur, B.; Li, Z.; Bhojanapalli, S.; LeCun, Y.; Srebro, N. Towards understanding the role of over-parametrisation in generalisation of neural networks. In Proceedings of the ICLR 2018, Vancouver, BC, USA, 30 April–3 May 2018. [Google Scholar]
- Du, S.S.; Zhai, X.; Poczos, B.; Singh, A. Gradient descent provably optimises over-parameterised neural networks. arXiv
**2018**, arXiv:1810.02054. [Google Scholar] - Bartlett, P.L.; Foster, D.J.; Telgarsky, M.J. Spectrally-normalised margin bounds for neural networks. In Advances in Neural Information Processing Systems 30, Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: North Adams, MA, USA, 2017; pp. 6240–6249. [Google Scholar]
- Krogh, A.; Hertz, J.A. A simple weight decay can improve generalisation. In Advances in Neural Information Processing Systems, Proceedings of the NIPS 1992, Denver, CO, USA, 2–5 December 1992; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1992; pp. 950–957. [Google Scholar]
- Neyshabur, B.; Tomioka, R.; Srebro, N. In search of the real inductive bias: On the role of implicit regularisation in deep learning. In Proceedings of the ICLR’14, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning requires rethinking generalisation. arXiv
**2016**, arXiv:1611.03530. [Google Scholar] - Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.
**2014**, 15, 1929–1958. [Google Scholar] - Cao, Y.; Gu, Q. Generalisation bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems 32, Proceedings of the NIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: North Adams, MA, USA, 2019; pp. 10835–10845. [Google Scholar]
- Perez, L.; Wang, J. The effectiveness of data augmentation in image classification using deep learning. arXiv
**2017**, arXiv:1712.04621. [Google Scholar] - Wilson, A.C.; Roelofs, R.; Stern, M.; Srebro, N.; Recht, B. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems 30, Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: North Adams, MA, USA, 2017; pp. 4148–4158. [Google Scholar]
- Keskar, N.S.; Socher, R. Improving generalisation performance by switching from adam to sgd. arXiv
**2017**, arXiv:1712.07628. [Google Scholar] - Liang, T.; Poggio, T.; Rakhlin, A.; Stokes, J. Fisher-rao metric, geometry, and complexity of neural networks. arXiv
**2017**, arXiv:1711.01530. [Google Scholar] - Dinh, L.; Pascanu, R.; Bengio, S.; Bengio, Y. Sharp minima can generalise for deep nets. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1019–1028. [Google Scholar]
- Jiang, Y.; Krishnan, D.; Mobahi, H.; Bengio, S. Predicting the generalisation gap in deep networks with margin distributions. In Proceedings of the ICLR’19, New Orleasns, LA, USA, 6–9 May 2019. [Google Scholar]
- Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv
**2014**, arXiv:1412.6572. [Google Scholar] - Nagarajan, V.; Kolter, J.Z. Uniform convergence may be unable to explain generalisation in deep learning. In Advances in Neural Information Processing Systems 32, Proceedings of the NIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: North Adams, MA, USA, 2019; pp. 11615–11626. [Google Scholar]
- Werpachowski, R.; György, A.; Szepesvári, C. Detecting overfitting via adversarial examples. In Advances in Neural Information Processing Systems 32, Proceedings of the NIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: North Adams, MA, USA, 2019; pp. 7856–7866. [Google Scholar]
- Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning (still) requires rethinking generalisation. Commun. ACM
**2021**, 64, 107–115. [Google Scholar] [CrossRef] - Beritelli, F.; Capizzi, G.; Sciuto, G.L.; Napoli, C.; Scaglione, F. Rainfall estimation based on the intensity of the received signal in a LTE/4G mobile terminal by using a probabilistic neural network. IEEE Access
**2018**, 6, 30865–30873. [Google Scholar] [CrossRef] - Sciuto, G.L.; Napoli, C.; Capizzi, G.; Shikler, R. Organic solar cells defects detection by means of an elliptical basis neural network and a new feature extraction technique. Optik
**2019**, 194, 163038. [Google Scholar] [CrossRef] - Hanin, B.; Rolnick, D. Deep relu networks have surprisingly few activation patterns. In Advances in Neural Information Processing Systems 32, Proceedings of the NIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: North Adams, MA, USA, 2019; pp. 359–368. [Google Scholar]
- Raghu, M.; Poole, B.; Kleinberg, J.; Ganguli, S.; Dickstein, J.S. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 2847–2854. [Google Scholar]
- Arora, S.; Ge, R.; Neyshabur, B.; Zhang, Y. Stronger generalisation bounds for deep nets via a compression approach. arXiv
**2018**, arXiv:1802.05296. [Google Scholar] - Neyshabur, B.; Salakhutdinov, R.R.; Srebro, N. Path-SGD: Path-Normalised Optimisation in Deep Neural Networks. In Advances in Neural Information Processing Systems, Proceedings of the NIPS 2015, Montreal, QC, Canada, 7–12 December 2015; Curran Associates, Inc.: North Adams, MA, USA, 2015. [Google Scholar]
- Neyshabur, B.; Bhojanapalli, S.; McAllester, D.; Srebro, N. Exploring generalisation in deep learning. In Advances in Neural Information Processing Systems 30, Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: North Adams, MA, USA, 2017; pp. 5947–5956. [Google Scholar]
- Zhang, X.; Wu, D. Empirical Studies on the Properties of Linear Regions in Deep Neural Networks. arXiv
**2020**, arXiv:2001.01072. [Google Scholar] - Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; MIT Press: Cambridge, MA, USA; NYU: New York, NY, USA, 2009. [Google Scholar]
- Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 31. [Google Scholar]
- Carlini, N.; Wagner, D. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Dallas, TX, USA, 3 November 2017; pp. 3–14. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version] - Montufar, G.F.; Pascanu, R.; Cho, K.; Bengio, Y. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems, Proceedings of the NIPS 2014, Montreal, QA, Canada, 8–13 December 2014; Curran Associates Inc.: North Adams, MA, USA, 2014. [Google Scholar]
- Lovász, L.; Vempala, S. Simulated annealing in convex bodies and an O*(n4) volume algorithm. J. Comput. Syst. Sci.
**2006**, 72, 392–417. [Google Scholar] [CrossRef] [Green Version] - Cousins, B.; Vempala, S. A practical volume algorithm. Math. Program. Comput.
**2016**, 8, 133–160. [Google Scholar] [CrossRef] - Chakrabarti, S.; Childs, A.M.; Hung, S.H.; Li, T.; Wang, C.; Wu, X. Quantum algorithm for estimating volumes of convex bodies. arXiv
**2019**, arXiv:1908.03903. [Google Scholar] - Hoeffding, W. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding; Springer: Berlin/Heidelberg, Germany, 1994; pp. 409–426. [Google Scholar]
- Russo, D.; Zou, J. Controlling bias in adaptive data analysis using information theory. In Artificial Intelligence and Statistics, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; PMLR: Westminster, UK, 2016; pp. 1232–1240. [Google Scholar]
- Russo, D.; Zou, J. How much does your data exploration overfit? Controlling bias via information usage. IEEE Trans. Inf. Theory
**2019**, 66, 302–323. [Google Scholar] [CrossRef] [Green Version] - Xu, A.; Raginsky, M. Information-theoretic analysis of generalisation capability of learning algorithms. In Advances in Neural Information Processing Systems 30, Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: North Adams, MA, USA, 2017. [Google Scholar]
- Neu, G.; Lugosi, G. Generalisation Bounds via Convex Analysis. arXiv
**2022**, arXiv:2202.04985. [Google Scholar] - Winkelbauer, A. Moments and absolute moments of the normal distribution. arXiv
**2012**, arXiv:1209.4340. [Google Scholar] - Ioffe, S.; Szegedy, C. Batch normalisation: Accelerating deep network training by reducing internal covariate shift. arXiv
**2015**, arXiv:1502.03167. [Google Scholar] - Kingma, D.; Ba, J. Adam: A method for stochastic optimisation. arXiv
**2014**, arXiv:1412.6980. [Google Scholar]

**Figure 2.**Change in upper bound (Equation (2)) induced by mean (

**a**), variance (

**b**), maximal norm (

**c**) and width of layers (

**d**).

**Figure 3.**Estimated test cross-entropy loss (

**a**) and test accuracy (

**b**) on the CIFAR-10 dataset with the ReLU network on different states (trained with stochastic gradient descent with batch size of 64, learning rate with $\alpha =0.05$ and weight decay with $\beta =0.0005$) based on layer-wise log-norm (${l}_{\infty})$ (Equation (6)), maximal (Equation (7)), empirical tangent sensitivity (Equation (8)), input–output sensitivity (Equation (9)), Fisher–Rao norm (Equation (10)) and spectral norm (Equation (11)).

**Table 1.**Mean Absolute Error (MAE) of estimated test cross-entropy loss and test accuracy of ReLU networks on different states on the CIFAR-10 dataset based on layer-wise log-norm (${l}_{\infty})$ (Equation (6)), maximal (Equation (7)), empirical tangent sensitivity (Equation (8)), change in input–output sensitivity (Equation (9)), in Fisher-Rao norm (Equation (10)) and in spectral norm (Equation (11)).

Estimation | Cross-Entropy | Accuracy |
---|---|---|

Layer-wise log-norm (p = inf) sens. (Equation (6)) | $5.3\times {10}^{-4}$ | $9.2\times {10}^{-3}$ |

Maximal sens. (Equation (7)) | $7.7\times {10}^{-4}$ | $9.8\times {10}^{-3}$ |

Empirical sens. (Equation (8)) | $4.2\times {10}^{-4}$ | $1.1\times {10}^{-2}$ |

Change in sens. (Equation (9)) | $2.7\times {10}^{-3}$ | $7.3\times {10}^{-2}$ |

Change in spectral norm (Equation (11)) | $1.2\times {10}^{-3}$ | $3.3\times {10}^{-2}$ |

Change in Fisher-Rao norm (Equation (10)) | $6.3\times {10}^{-4}$ | $1.5\times {10}^{-2}$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Daróczy, B.
Gaussian Perturbations in ReLU Networks and the Arrangement of Activation Regions. *Mathematics* **2022**, *10*, 1123.
https://doi.org/10.3390/math10071123

**AMA Style**

Daróczy B.
Gaussian Perturbations in ReLU Networks and the Arrangement of Activation Regions. *Mathematics*. 2022; 10(7):1123.
https://doi.org/10.3390/math10071123

**Chicago/Turabian Style**

Daróczy, Bálint.
2022. "Gaussian Perturbations in ReLU Networks and the Arrangement of Activation Regions" *Mathematics* 10, no. 7: 1123.
https://doi.org/10.3390/math10071123