# A Geometric Interpretation of Stochastic Gradient Descent Using Diffusion Metrics

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Continuous-Time SGD and the Diffusion Matrix

## 3. Diffusion Metrics and General Relativity

_{ij}are the coefficients of D, and δ

_{wz}is the Kronecker delta.

_{D}f so that Equation (11) becomes

## 4. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A. Riemannian Geometry

- ${\nabla}_{fX}Y=f{\nabla}_{X}Y$ for all functions f on M;
- ${\nabla}_{X}\left(fY\right)=df\left(X\right)Y+f{\nabla}_{X}Y$.

## References

- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature
**2015**, 521, 436. [Google Scholar] [CrossRef] [PubMed] - LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version] - Chaudhari, P.; Soatto, S. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. arXiv
**2017**, arXiv:1710.11029. [Google Scholar] - Chaudhari, P.; Soatto, S. On the energy landscape of deep networks. arXiv
**2015**, arXiv:1511.06485. [Google Scholar] - Chaudhari, P.; Choromanska, A.; Soatto, S.; LeCun, Y.; Baldassi, C.; Borgs, C.; Chayes, J.; Sagun, L.; Zecchina, R. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv
**2016**, arXiv:1611.01838. [Google Scholar] [CrossRef] [Green Version] - Amari, S. Natural Gradient Works Efficiently in Learning. Neural Comput.
**1998**, 10, 251–276. [Google Scholar] [CrossRef] - Adler, R.; Bazin, M.; Schiffer, M. Introduction to General Relativity; McGraw-Hill: New York, NY, USA, 1965. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. arXiv
**2017**, arXiv:1706.01350. [Google Scholar] - Achille, A.; Soatto, S. On the emergence of invariance and disentangling in deep representations. arXiv
**2017**, arXiv:1706.01350. [Google Scholar] - Petersen, P. Riemannian Geometry; (GTM); Springer: Cham, Switzerland, 1998. [Google Scholar]

Architecture | $\mathit{d}=|\mathbf{Weights}|$ | $\mathit{N}=|\mathbf{Data}|$, CIFAR | $\mathit{N}=|\mathbf{Data}|$, SVHN |
---|---|---|---|

ResNet | 1.7 M | 60 K | 600 K |

Wide ResNet | 11 M | 60 K | 600 K |

DenseNet (k = 12) | 1 M | 60 K | 600 K |

DenseNet (k = 24) | 27.2 M | 60 K | 600 K |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Fioresi, R.; Chaudhari, P.; Soatto, S.
A Geometric Interpretation of Stochastic Gradient Descent Using Diffusion Metrics. *Entropy* **2020**, *22*, 101.
https://doi.org/10.3390/e22010101

**AMA Style**

Fioresi R, Chaudhari P, Soatto S.
A Geometric Interpretation of Stochastic Gradient Descent Using Diffusion Metrics. *Entropy*. 2020; 22(1):101.
https://doi.org/10.3390/e22010101

**Chicago/Turabian Style**

Fioresi, Rita, Pratik Chaudhari, and Stefano Soatto.
2020. "A Geometric Interpretation of Stochastic Gradient Descent Using Diffusion Metrics" *Entropy* 22, no. 1: 101.
https://doi.org/10.3390/e22010101