# Representing Deep Neural Networks Latent Space Geometries with Graphs

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

**Enforcing properties on latent spaces:**A core goal of our work is to enforce desirable properties on the latent spaces of DL architectures, more precisely (i) consistency with a teacher network, (ii) class disentangling, and (iii) smooth variation of geometries over the architecture. In the literature, one can find two types of approaches to enforce properties on latent spaces: (i) directly designing specific modules or architectures [15,16] and (ii) modifying the training procedures [11,13]. The main advantage of the latter approaches is that one is able to draw from the vast literature in DL architecture design [17,18] and use an existing architecture instead of having to design a new one.

**Latent space graphs:**In the past few years, there has been a growing interest in proposing deep neural network layers able to process graph-based inputs, also known as graph neural networks. For example, works, such as References [20,21,22,23], show how one can use convolutions defined in graph domains to improve performance of DL methods dealing with graph signals as inputs. The proposed methodology differs from these works in that it does not require inputs to be defined on an explicit graph. The graphs we consider here (LGGs) are proxies to the latent data geometry of the intermediate representations. Contrary to classical graph neural networks, the purpose of the proposed methodology is to study latent representations using graphs, instead of processing graph supported inputs. Some recent work can be viewed as following ideas similar to those introduced in this paper, with applications in areas, such as knowledge distillation [24,25], robustness [15], interpretability [26], and generalization [27]. Despite sharing a common methodology, these works are not explicitly linked. This can be explained by the fact that they were introduced independently around the same time and have different aims. We provide more details about how they are connected with our proposed methodology in the following paragraphs.

**Knowledge distillation:**Knowledge distillation is a DL compression method, where the goal is to use the knowledge acquired on a pre-trained architecture, called teacher, to train a smaller one, called student. Initial works on knowledge distillation considered each input independently from the others, an approach known as Individual Knowledge Distillation (IKD) [11,12,28]. As such, the student architecture mimics the intermediate representations of the teacher for each input used for training. The main drawback of IKD lies in the fact that it forces intermediate representations of the student to be of the same dimensions of that of the teacher. To deploy IKD in broader contexts, authors have proposed to disregard some of these intermediate representations [12] or to perform some-kind of dimensionality reduction [28].

**Latent embeddings:**In the context of classification, the most common DL setting is to train the architecture end-to-end with an objective function that directly generates a decision at the output. Instead, it can be beneficial to output representations well suited to be processed by a simple classifier (e.g., logistic regression). This framework is called feature extraction or latent embeddings, as the goal is to generate representations that are easy to classify, but without directly enforcing the way they should be used for classification. Such a framework is very interesting if the DL architecture is not going to be used solely for classification but also for related tasks, such as person re-identification [13], transfer learning [30], and multi-task learning [31].

**Robustness of DL architectures:**In this work, we are interested in improving the robustness of DL architectures. We define robustness as the ability of the network to correctly classify inputs even if they are subject to small perturbations. These perturbations may be adversarial (designed exactly to force misclassification) [34] or incidental (due to external factors, such as hardware defects or weather artifacts) [7]. The method we present in Section 4.3 is able to increase the robustness of the architecture in both cases. Multiple works in the literature aim to improve the robustness of DL architectures following two main approaches: (i) training set augmentation [35] and (ii) improved training procedure. Our contribution can be seen as an example of the latter approaches, but can be combined with augmentation-based methods, leading to an increase of performance compared to using the techniques separately [8].

## 3. Methodology

#### 3.1. Deep Learning

**Definition**

**1**

**Definition**

**2**

**Definition**

**3**

**training set**” (${\mathbb{D}}_{\mathrm{train}}$). The reason to select a subset of $\mathbb{D}$ to train the DL architecture is that it is hard to predict the generalization ability of the trained function f. Generalization usually refers to the ability of f to predict the correct output for inputs x not in ${\mathbb{D}}_{\mathrm{train}}$. A simple way to evaluate generalization consists of counting the proportion of elements in $\mathbb{D}-{\mathbb{D}}_{\mathrm{train}}$ that are correctly classified using f. Obviously, this measure of generalization is not ideal, in the sense that it only checks generalization inside $\mathbb{D}$. This is why it is possible for a network that seems to generalize well to have trouble to classify inputs that are subject to deviations. In this case, it is said that the DL architecture is not robust. We delve into more details on robustness in Section 4.3

#### 3.2. Graph Signal Processing

**Definition**

**4**

- 1.
- The finite set $\mathbb{V}$ is composed of vertices ${v}_{1},{v}_{2},\cdots $.
- 2.
- The set $\mathbb{E}$ is composed of pairs of vertices of the form (${v}_{i}$,${v}_{j}$) called edges.

**degree matrix D**of the graph as:

**Definition**

**5**

#### 3.3. Proposed Methodology

- (1)
- Generate a symmetric square matrix $\mathcal{A}\in {\mathbb{R}}^{\left|\mathbb{V}\right|\times \left|\mathbb{V}\right|}$ using a similarity measure between intermediate representations, at a given depth ℓ, of data points in X. In this work, we choose the cosine similarity when data is non-negative, and an RBF similarity kernel based on the L2 distance otherwise.
- (2)
- Threshold $\mathcal{A}$ so that each vertex is connected only to its k-nearest neighbors.
- (3)
- Symmetrize the resulting thresholded matrix: two vertices i and j are connected with edge weights ${w}_{ij}={w}_{ji}$ as long one of the nodes was a k nearest neighbor of the other.
- (4)
- (Optional) Normalize $\mathcal{A}$ using its degree diagonal matrix: $D:\hat{\mathcal{A}}={D}^{-\frac{1}{2}}\mathcal{A}{D}^{-\frac{1}{2}}$.

**Definition**

**6**

**Remark**

**1.**

#### 3.3.1. Toy Example

#### 3.3.2. Dimensionality and LGGs

## 4. Applications

#### 4.1. Knowledge Distillation

**Proposed approach (GKD):**Let us consider a layer in the teacher architecture, and the corresponding one in the student architecture. Considering a batch of inputs, we propose to build the corresponding graphs ${\mathcal{G}}_{T}$ and ${\mathcal{G}}_{S}$ capturing their geometries as described in Section 3.3.

**Experiments:**To illustrate the gains we can achieve using GKD, we ran the following experiment. Starting from a WideResNet28-1 [39] teacher architecture with many parameters, for which an error rate of 7.27% is achieved on CIFAR-10, we first train a student without KD, called baseline, containing roughly 4 times less parameters. The resulting error rate is 10.37%. We then compared RKD and GKD. Results in Table 1 show that GKD doubles the gains of RKD over the baseline.

#### 4.2. Latent Embeddings

**Methodology:**Let us consider the representations obtained at the output of a DL architecture. We build the corresponding LGG $\mathcal{G}$ as described in Section 3.3. Then, we propose to use the label variation on this LGG as the objective function to train the network. By definition, minimizing the label variation leads to maximizing the distances between outputs of different classes. Compared to the classic cross entropy loss, we observe that label variation as an objective function does not suffer from the same drawbacks, notably: the proposed criterion does not need to force the output dimension to match the number of classes, it can result in distinct clusters in the output domain for a same class (as it only deals with distances between examples from different classes, which can be seen as a form of negative sampling), and it can leverage the initial distribution of representations at the output of the network function.

**Experiments:**To evaluate the performance of label variation as an objective function, we perform experiments with the CIFAR-10 dataset [40] and using ResNet18 [17] as our DL architecture. In Table 2, we report the performance of the deep architectures trained with the proposed loss compared with cross-entropy. We also report the relative Mean Corruption Error (MCE), which is a standard measure of robustness towards corruptions of the inputs over the CIFAR-10 corruption benchmark [7], where smaller values of MCE are better. We observe that label variation is a viable alternative to cross-entropy in terms of raw test accuracy, as well as that it leads to significantly better robustness. More details and experiments can be found in Reference [14], where we particularly show how the initial distribution of data points is preserved throughout the learning process (We also make this result available in Appendix E).

#### 4.3. Improving DL Robustness

**Methodology:**Formally, denote ℓ the depth of an intermediate representation in the architecture. Let us consider a batch of inputs, and let us build the corresponding LGG ${\mathcal{G}}_{\ell}$ as described in Section 3.3. The proposed regularizer can be expressed as:

**Experiments:**In order to stress the ability of the proposed regularizer in improving robustness, we consider a ResNet18 that we trained on CIFAR-10. We consider multiple settings. In the first one, we add adversarial noise to inputs [34] and compare the obtained accuracy. In the second one, we consider agnostic corruptions (i.e., corruptions that do not depend on the network function) and report the relative MCE [7]. Results are presented in Table 3. The proposed regularizer performs better than the raw baseline and existing alternatives in the literature [6]. More details can be found in Reference [8].

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. Glossary

**DL**: Deep Learning**GSP**: Graph Signal Processing.**LGG**: Latent Geometry Graphs**KD**: Knowledge Distillation**IKD**: Individual Knowledge Distillation**RKD**: Relative Knowledge Distillation**GKD**: Graph Knowledge Distillation**MCE**: Mean Corruption Error

## Appendix B. CIFAR-10 Dataset

## Appendix C. Details on the Creation of the Illustrative Example

## Appendix D. Complexity of Graph Similarity Computation

## Appendix E. Comparison of Embedding Evolution between Label Variation and Cross Entropy

**Figure A1.**Two-dimensional-Embeddings of the CIFAR-10 training set on a DNN learned using the label variation loss (

**top row**) and the cross-entropy loss (

**bottom row**). Both networks have the same architecture and hyperparameters.

## References

- Tan, M.; Le, Q.V. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv
**2019**, arXiv:1905.11946. [Google Scholar] - Edunov, S.; Ott, M.; Auli, M.; Grangier, D. Understanding back-translation at scale. arXiv
**2018**, arXiv:1808.09381. [Google Scholar] - Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature
**2015**, 521, 436–444. [Google Scholar] [CrossRef] [PubMed] - LeCun, Y. The Power and Limits of Deep Learning. Res.-Technol. Manag.
**2018**, 61, 22–27. [Google Scholar] [CrossRef] - Cisse, M.; Bojanowski, P.; Grave, E.; Dauphin, Y.; Usunier, N. Parseval networks: Improving robustness to adversarial examples. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Hendrycks, D.; Dietterich, T. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Lassance, C.; Gripon, V.; Ortega, A. Laplacian Networks: Bounding Indicator Function Smoothness for Neural Networks Robustness. In Proceedings of the APSIPA Transactions on Signal and Information Processing, 8 January 2021. to appear. [Google Scholar]
- Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational Knowledge Distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3967–3976. [Google Scholar]
- Lassance, C.; Bontonou, M.; Hacene, G.B.; Gripon, V.; Tang, J.; Ortega, A. Deep geometric knowledge distillation with graphs. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 8484–8488. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. In Proceedings of the Neural Information Processing Systems 2014 Deep Learning Workshop, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Koratana, A.; Kang, D.; Bailis, P.; Zaharia, M. LIT: Learned intermediate representation training for model compression. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3509–3518. [Google Scholar]
- Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv
**2017**, arXiv:1703.07737. [Google Scholar] - Bontonou, M.; Lassance, C.; Hacene, G.B.; Gripon, V.; Tang, J.; Ortega, A. Introducing Graph Smoothness Loss for Training Deep Learning Architectures. In Proceedings of the 2019 IEEE Data Science Workshop (DSW), Minneapolis, MN, USA, 2–5 June 2019; pp. 160–164. [Google Scholar]
- Svoboda, J.; Masci, J.; Monti, F.; Bronstein, M.; Guibas, L. PeerNets: Exploiting Peer Wisdom Against Adversarial Attacks. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Qian, H.; Wegman, M.N. L2-Nonexpansive Neural Networks. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in neural information processing systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Shuman, D.I.; Narang, S.K.; Frossard, P.; Ortega, A.; Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag.
**2013**, 30, 83–98. [Google Scholar] [CrossRef][Green Version] - Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv
**2016**, arXiv:1609.02907. [Google Scholar] - Vialatte, J.C. On Convolution of Graph Signals and Deep Learning on Graph Domains. Ph.D. Thesis, IMT Atlantique, Nantes, France, 2018. [Google Scholar]
- Gama, F.; Isufi, E.; Leus, G.; Ribeiro, A. Graphs, Convolutions, and Neural Networks: From Graph Filters to Graph Neural Networks. IEEE Signal Process. Mag.
**2020**, 37, 128–138. [Google Scholar] [CrossRef] - Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Networks Learn. Syst.
**2020**. [Google Scholar] [CrossRef] [PubMed][Green Version] - Liu, Y.; Cao, J.; Li, B.; Yuan, C.; Hu, W.; Li, Y.; Duan, Y. Knowledge Distillation via Instance Relationship Graph. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7096–7104. [Google Scholar]
- Lee, S.; Song, B. Graph-based knowledge distillation by multi-head attention network. arXiv
**2019**, arXiv:1907.02226. [Google Scholar] - Anirudh, R.; Bremer, P.; Sridhar, R.; Thiagarajan, J. Influential Sample Selection: A Graph Signal Processing Approach; Technical Report; Lawrence Livermore National Lab. (LLNL): Livermore, CA, USA, 2017. [Google Scholar]
- Gripon, V.; Ortega, A.; Girault, B. An Inside Look at Deep Neural Networks using Graph Signal Processing. In Proceedings of the ITA, San Diego, CA, USA, 11–16 February 2018. [Google Scholar]
- Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv
**2017**, arXiv:1710.10903. [Google Scholar] - Hu, Y.; Gripon, V.; Pateux, S. Exploiting Unsupervised Inputs for Accurate Few-Shot Classification. arXiv
**2020**, arXiv:2001.09849. [Google Scholar] - Ruder, S. An overview of multi-task learning in deep neural networks. arXiv
**2017**, arXiv:1706.05098. [Google Scholar] - Dietterich, T.G.; Bakiri, G. Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res.
**1994**, 2, 263–286. [Google Scholar] [CrossRef][Green Version] - Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv
**2014**, arXiv:1412.6572. [Google Scholar] - Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Hacene, G.B. Processing and Learning Deep Neural Networks on Chip. Ph.D. Thesis, Ecole Nationale Supérieure Mines-Télécom Atlantique, Nantes, France, 2019. [Google Scholar]
- Frankle, J.; Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Belkin, M.; Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput.
**2003**, 15, 1373–1396. [Google Scholar] [CrossRef][Green Version] - Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv
**2016**, arXiv:1605.07146. [Google Scholar] - Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 15 January 2021).
- Kalofolias, V.; Perraudin, N. Large Scale Graph Learning From Smooth Signals. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Shekkizhar, S.; Ortega, A. Graph Construction from Data by Non-Negative Kernel Regression. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3892–3896. [Google Scholar]
- Torralba, A.; Fergus, R.; Freeman, W.T. 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition. IEEE Trans. Pattern Anal. Mach. Intell.
**2008**, 30, 1958–1970. [Google Scholar] [CrossRef] [PubMed]

**Figure 2.**Graph representation example of 20 examples from CIFAR-10, from the input space (left) to the penultimate layer of the network (right). The different vertex colors represent the classes of the data points. To help the visualization, we only depict the edges that are important for the variation measure (i.e., edges between elements of distinct classes). Note how there are many more edges at the input (

**a**) and how the number of edges decrease as we go deeper in the architecture (

**b**,

**c**).

Method | Error | Gain | Relative Size |
---|---|---|---|

Teacher | 7.27% | — | 100% |

Baseline (student without KD) | 10.34% | — | 27% |

RKD-D [9] | 10.05% | 0.29% | 27% |

GKD (Ours) [10] | 9.71% | 0.63% | 27% |

**Table 2.**Comparison between the cross-entropy and label variation functions. Best results are presented in bold font.

Cost Function | Clean Test Error | Relative MCE |
---|---|---|

Cross-entropy | 5.06% | 100 |

Label Variation (ours) [14] | 5.63% | 90.33 |

**Table 3.**Comparison of different methods on their clean error rate and robustness. Best results are presented in bold font.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lassance, C.; Gripon, V.; Ortega, A. Representing Deep Neural Networks Latent Space Geometries with Graphs. *Algorithms* **2021**, *14*, 39.
https://doi.org/10.3390/a14020039

**AMA Style**

Lassance C, Gripon V, Ortega A. Representing Deep Neural Networks Latent Space Geometries with Graphs. *Algorithms*. 2021; 14(2):39.
https://doi.org/10.3390/a14020039

**Chicago/Turabian Style**

Lassance, Carlos, Vincent Gripon, and Antonio Ortega. 2021. "Representing Deep Neural Networks Latent Space Geometries with Graphs" *Algorithms* 14, no. 2: 39.
https://doi.org/10.3390/a14020039