Representing Deep Neural Networks Latent Space Geometries with Graphs

Deep Learning (DL) has attracted a lot of attention for its ability to reach state-of-the-art performance in many machine learning tasks. The core principle of DL methods consists in training composite architectures in an end-to-end fashion, where inputs are associated with outputs trained to optimize an objective function. Because of their compositional nature, DL architectures naturally exhibit several intermediate representations of the inputs, which belong to so-called latent spaces. When treated individually, these intermediate representations are most of the time unconstrained during the learning process, as it is unclear which properties should be favored. However, when processing a batch of inputs concurrently, the corresponding set of intermediate representations exhibit relations (what we call a geometry) on which desired properties can be sought. In this work, we show that it is possible to introduce constraints on these latent geometries to address various problems. In more details, we propose to represent geometries by constructing similarity graphs from the intermediate representations obtained when processing a batch of inputs. By constraining these Latent Geometry Graphs (LGGs), we address the three following problems: i) Reproducing the behavior of a teacher architecture is achieved by mimicking its geometry, ii) Designing efficient embeddings for classification is achieved by targeting specific geometries, and iii) Robustness to deviations on inputs is achieved via enforcing smooth variation of geometry between consecutive latent spaces. Using standard vision benchmarks, we demonstrate the ability of the proposed geometry-based methods in solving the considered problems.


Introduction
In recent years, Deep Learning (DL) methods have achieved state of the art performance in a vast range of machine learning tasks, including image classification [1] and multilingual automatic text translation [2]. A DL architecture is built by assembling elementary operators called layers [3], some of which contain trainable parameters. Due to their compositional nature, DL architectures exhibit intermediate representations when they process a given input. These intermediate representations lie in so-called latent spaces.
DL architectures are typically trained to minimize a loss function computed at their output. This is performed using a variant of the stochastic gradient descent algorithm that is backpropagated through the multiple layers to update the corresponding parameters. To accelerate the training procedure, it is very common to process batches of inputs concurrently. In such a case, a global criterion over the corresponding batch (e.g. the average loss) is backpropagated.
The training procedure of DL architectures is thus performed in an end-to-end fashion. This end-to-end characteristic of DL refers to the fact that intermediate representations are unconstrained during training. This property has often been considered as an asset in the literature [4] which presents deep learning arXiv:2011.07343v1 [cs.LG] 14 Nov 2020 as a way to replace "hand-crafted" features by automatic differentiation. As a matter of fact, using these hand-crafted features as intermediate representations can cause sub-optimal solutions [5]. On the other hand, completely removing all constraints on the intermediate representations can cause the learning procedure to exhibit unwanted behavior, such as susceptibility to deviations of the inputs [6][7][8], or redundant features [9,10].
In this work we propose a new methodology aiming at enforcing desirable properties on intermediate representations. Since training is organized into batches, we achieve this goal by constraining what we call the latent geometry of data points within a batch. This geometry refers to the relative position of data points within a specific batch, based on their representation in a given layer. While there are many problems for which specific intermediate layer properties are beneficial, in this work we consider three examples. First, we explore compression via knowledge distillation (KD) [9][10][11][12], where the goal is to supervise the training procedure of a small DL architecture (called the student) with a larger one (called the teacher). Second, we study the design of efficient embeddings for classification [13,14], in which the aim is to train the DL architecture to be able to extract features that are useful for classification (and could be used by different classifier) rather than using classification accuracy as the sole performance metric. Finally, we develop techniques to increase the robustness of DL architectures to deviations of their inputs [6][7][8].
To address the three above-mentioned problems, we introduce a common methodology that exploits the latent geometries of a DL architecture. More precisely, we propose to formalize latent geometries by defining similarity graphs. In these graphs, vertices are data points in a batch and an edge weight between two vertices is a function of the relative similarity between the corresponding intermediate representations at a given layer. We call such a graph a latent geometry graph (LGG). In this paper we show that intermediate representations with desirable properties can be obtained by imposing constraints on their corresponding LGGs. In the context of KD, similarity between teacher and student is favored by minimizing the discrepancy between their respective LGGs. For efficient embedding designs, we propose a LGG-based objective function that favors disentanglement of the classes. Lastly, to improve robustness, we enforce smooth variations between LGGs corresponding to pairs of consecutive layers at any stage, from input to output, in a given architecture. Enforcing smooth variations between the LGGs of consecutive layers provides some protection against noisy inputs, since small changes in the input are less likely to lead to a sharp transition of the network's decision. This paper is structured as follows, we first discuss related work in Section 2. We then introduce the proposed methodology in Section 3. Then we present the three applications, namely knowledge distillation, design of classification feature vectors and robustness improvements, in Section 4. Finally, we present a summary and a discussion on future work in Section 5.

Related work
As previously mentioned, in this work we are interested in using graphs to ensure that latent spaces of DL architectures have some desirable properties. The various approaches we introduce in this paper are based on our previous contributions [8,10,14]. However, in this paper they are presented for the first time using a unified methodology and formalism. While we deployed these ideas already in a few applications, by presenting them in a unified form our goal is to provide a broader perspective of these tools, and to encourage their use for other problems.
In what follows, we introduce related work found in the literature. We start by comparing our approach with others that also aim at enforcing properties on latent spaces. Then we discuss approaches that mix graphs and intermediate (or latent) representations in DL architectures. Finally we discuss methods related to the applications highlighted in this work: i) knowledge distillation, ii) latent embeddings, and iii) robustness.

Latent embeddings:
In the context of classification, the most common DL setting is to train the architecture end-to-end with an objective function that directly generates a decision at the output. Instead, it can be beneficial to output representations well suited to be processed by a simple classifier (e.g. logistic regression). This framework is called feature extraction or latent embeddings, as the goal is to generate representations that are easy to classify, but without directly enforcing the way they should be used for classification. Such a framework is very interesting if the DL architecture is not going to be used solely for classification, but also for related tasks such as person re-identification [13], transfer learning [28] and multi-task learning [29].
Many authors have proposed ways to train deep feature extractors. One influential example is [13], where the authors use triplets to perform Deep Metric Learning. In each triplet, the first element is the example to train, the second is a positive example (e.g. same class) and the last is a negative one (e.g. different class). The aim is to result in triplets where the first element is closer to the second than to the last. In contrast, our method considers all connections between examples of different classes, and can focus solely on separation (making all the negatives far) instead of clustering (making all the positives close), which we posit should lead to more robust embeddings in Section 4.2.
Other solutions for generating latent embeddings propose alternatives to the classical arg max operator used to perform the decision at the output of a DL architecture. This can be done either by changing the output so that it is based on error correcting codes [30] or is smoothed, either explicitly [31] or by using the prior knowledge of another network [11].
Robustness of DL architectures: In this work, we are interested in improving the robustness of DL architectures. We define robustness as the ability of the network to correctly classify inputs even if they are subject to small perturbations. These perturbations may be adversarial (designed exactly to force misclassification) [32] or incidental (due to external factors such as hardware defects or weather artifacts) [7]. The method we present in Section 4.3 is able to increase the robustness of the architecture in both cases. Multiple works in the literature aim to improve the robustness of DL architectures following two main approaches: i) training set augmentation [33], and ii) improved training procedure. Our contribution can be seen as an example of the latter approaches, but can be combined with augmentation-based methods, leading to an increase of performance compared to using the techniques separately [8].
A similar idea was proposed in [15], where the authors exploit graph convolutional layers in order to improve robustness of DL architectures applied to non-graph domains. Their approach can be described as denoising the (test) input by using the training data. This differs from the method we propose in Section 4.3, which focuses on generating a smooth network function. As such, the proposed method is more general as it is less dependent on the training set.

Methodology
In this section we first introduce basic concepts from deep learning and graph signal processing (Sections 3.1 and 3.2) and then our proposed methodology (Section 3.3).

Deep learning
We start by introducing basic deep learning (DL) concepts, referring the reader to [3] for a more in-depth overview. A DL architecture is an assembly of layers that can be mathematically described as a function f , often referred to as the "network function" in the literature, that associates an input tensor x with an output tensorŷ = f (x). This function is characterized by a large number of trainable parameters θ. In the literature, many different approaches have been proposed to assemble layers to obtain such network functions [17]. While layers are the basic unit, it is also common to describe architectures in terms of a series of blocks, where a block is typically a small set of connected layers. This block representation allows us to encapsulate non-sequential behaviors, such as the residual connections of residual networks (Resnets) [17], so that even though layers are connected in a more complex way, the blocks remain sequential and the network function can be represented as a series of cascading operations: where each function f can represent a layer, or a block comprising several layers, depending on the underlying DL architecture. Thus, each block is associated with a subfunction f . For example, in the context of Resnets [17], the architecture is composed of blocks as depicted in Figure 1.
Input:  Initially, the parameters θ of f are typically drawn at random. They are then optimized during the training phase so that f achieves a desirable performance for the problem under consideration. The dimension of the output of f depends on the task. In the context of classification, it is common to design the network function such that the output has a dimension equal to the number of classes in the classification problem. In this case, for a given input, each coordinate of this final layer output is used as an estimate of the likelihood that the input belongs to the corresponding class. A network function correctly maps an input to its class if the output of the network function,ŷ, is close to the target vector of the correct class y.
Definition 2 (target vector). Each sample of the training set is associated with a target vector of dimension C, where C is the total number of classes. Thus, the target vector of a sample of class c is the binary vector containing 1 at coordinate c and 0 at all other coordinates.
It is important to distinguish the target vector from the label indicator vector. The latter is defined on a batch of data points, instead of individually for each sample, as follows: The purpose of a classification problem is to obtain a network function f that outputs the correct class decision for any valid input x. In practice, it is often the case that the set of valid inputs D is not finite, and yet we are only given a "small" number of pairs (x, y), where y is the output associated with x. The set of these pairs is called the dataset D. During the training phase, the parameters are tuned using D and an objective function L that measures the discrepancy between the outputs of the network function and expected target indicator vectors, i.e., the discrepancy betweenŷ = f (x) and y. It is common to decompose the function f into a feature extractor F and a classifier C as follows: f = C • F . In a classification task, the objective function is calculated over the outputs of the classifier and the gradients are backpropagated to generate a good feature extractor. Alternatively, to ensure that good latent embeddings are produced one can first optimize the feature extractor part of the architecture to optimize the features and then a classifier can be trained based on the resulting features (which remain fixed or not) [13,14]. We introduce an objective function designed for efficient latent embedding training in Section 4.2.
Usually, the objective function is a loss function. It is minimized over a subset of the dataset that we call "training set" (D train ). The reason to select a subset of D to train the DL architecture is that it is hard to predict the generalization ability of the trained function f . Generalization usually refers to the ability of f to predict the correct output for inputs x not in D train . A simple way to evaluate generalization consists in counting the proportion of elements in D − D train that are correctly classified using f . Obviously, this measure of generalization is not ideal, in the sense that it only checks generalization inside D. This is why it is possible for a network that seems to generalize well to have trouble to classify inputs that are subject to deviations. In this case it is said that the DL architecture is not robust. We delve in more details on robustness in Section 4.3 In summary, a network function is initialized at random. Parameters are then tuned using a variant of the stochastic gradient descent algorithm on a dataset D train , and finally, training performance is evaluated on a validation set. Problematically, the best performance of deep learning architectures strongly depends on the total number of parameters they contain [34]. In particular it has been hypothesized that this dependence comes from the difficulty of finding a good gradient trajectory when the parameter space dimension is small [35]. A common way to circumvent this problem is to rely on knowledge distillation, where a network with a large number of parameters is used to supervise the training of a smaller one. We introduce a graph-based method for knowledge distillation in Section 4.1.

Graph Signal Processing
As mentioned in the introduction, graphs are ubiquitous objects to represent relationships (called edges) between elements in a countable set (called vertices). In this section, we introduce the framework of Graph Signal Processing (GSP) which is central to our proposed methodology. Let us first formally define graphs: Definition 4 (graph). A graph G is a tuple of sets V, E , such that: The set E is composed of pairs of vertices of the form (v i ,v j ) called edges.
It is common to represent the set E using an edge-indicator symmetric adjacency matrix A ∈ R |V|×|V| . Note that in this work we consider only undirected graphs corresponding to symmetric . In some cases, it is useful to consider (edge-)weighted graphs. In that case, the adjacency matrix can take values other than 0 or 1.
We can use A to define the diagonal degree matrix D of the graph as: In the context of GSP, we consider not only graphs, but also graph signals. A graph signal is typically defined as a vector s. In this work we often consider a set of signals S jointly. We group the signals in a matrix S ∈ R |V|×|S| , where each of the columns is an individual graph signal s. An important notion in the remaining of this work is that of graph signal variation.
Definition 5 (Graph signal variation). The total variation σ of a set of graph signals represented by S is: where L = D − A is the combinatorial Laplacian of the graph G that supports S and tr is the trace function. We can also rewrite σ as: where s i represents the signal s defined on vertex v i . As such, the variation of a signal increases when vertices connected by edges with large weights have very different values.

Proposed methodology
In this section we describe how to construct and exploit latent geometry graphs (LGGs) and illustrate the key ideas with a toy example. Given a batch X, each LGG vertex corresponds to a sample in X and each edge weight measures similarity between the corresponding data points. More specifically, LGGs are constructed as follows: 1. Generate a symmetric square matrix A ∈ R |V|×|V| using a similarity measure between intermediate representations, at a given depth , of data points in X. In this work we choose the cosine similarity when data is non-negative and an RBF similarity kernel based on the L2 distance otherwise; 2. Threshold A so that each vertex is connected only to its k-nearest neighbors; 3. Symmetrize the resulting thresholded matrix: two vertices i and j are connected with edge weights w ij = w ji as long one of the nodes was a k nearest neighbor of the other. 4. (Optional) Normalize A using its degree diagonal matrix D: Given the LGG associated to some intermediate representation, we are able to quantify how well this representation matches the classification task under consideration by using the concept of label variation, a measure of graph signal variation for a signal formed as a concatenation of all label indicator vectors: Definition 6 (Label variation). Consider a similarity graph for a given batch X (obtained from some intermediate layer), represented by an adjacency matrix A, and define a label indicator matrix V obtained by concatenating label indicator vectors v c of each class. Label variation is defined as: If the graph is well suited for classification then most nodes will have immediate neighbors in the same class. Indeed, label variation is 0 if and only if data points that belong to distinct classes are not connected in the graph. Therefore smaller label variation is indicative of an easier classification task (well separated classes).

Toy example
In this example we visualize the relation between the classification task and the geometries represented by the graphs. To do so, we construct three similarity graphs for a very small subset 1 of the CIFAR-10 D train , one defined on the image space (i.e., computing the similarity between the 3072 dimensions of the raw input images) and two using the latent space representations of an architecture trained on the dataset. Such representations come from an intermediate layer (32,768 dimensions) and the penultimate layer (512 dimensions). What we expect to see qualitatively is that the classes will be easier to separate as we go deeper in the considered architecture, which should be reflected by the label variation score: the penultimate layer should yield the smallest label variation. We depict this example in Figure 2. Note that data points are placed in the 2D space using Laplacian eigenmaps [36]. As expected, we can qualitatively see the difference in separation from the image space to the latent spaces. We are also able to measure quantitatively how difficult it is to separate the classes using the label variation, which is lowest for the penultimate layer. For more details on how this example was generated we refer the reader to the appendix.

Dimensionality and LGGs
A key asset of the proposed methodology is that the number of vertices in the graph is independent of the dimension of the intermediate representations it was built from. As such, it is possible to compare graphs built from latent spaces with various dimensions, as illustrated in Figure 2. Being agnostic to dimension will be a key ingredient in the applications described in the following section. It is important to note that, while the number of vertices is independent of the dimension of intermediate representations, edge weights are a function of a similarity in the considered latent space, which can have very different meanings depending on the underlying dimensions.
In the context of DL architectures, a common choice of similarity measure is that of cosine. Interestingly, cosine similarity is well defined only for nonnegative data (as typically processed by a ReLU function) and bounded between 0 and 1. When data can be negative, we use a Gaussian kernel applied to the Euclidean distance instead. The problem remains that cosine or Euclidean similarities suffers from the curse of dimensionality. In an effort to reduce the influence of dimension when comparing LGGs obtained from latent spaces with distinct dimensions, in our experiments we make use of graph normalization, as defined in step 4 of LGG construction.. A more in-depth analysis and understanding of the influence of dimension on graph construction is a promising direction for future work, as improving the graph construction could benefit all applications covered in this work.

Applications
We now show how LGGs can be used in three specific applications: i) knowledge distillation, ii) latent embeddings, and iii) robustness. Details on the dataset used can be found in the appendix.

Knowledge distillation
First, we consider the case of knowledge distillation (KD). The goal of KD is to use the knowledge acquired by a pre-trained DL architecture that we call teacher T to train a second architecture called student S. KD is normally performed in compression scenarios where the goal is to obtain an architecture S that is less computationally expensive than T while maintaining good enough generalization. In order to do so, KD approaches aim at making both networks consistent in their decisions. Consistency is usually achieved by minimizing a measure of discrepancy between the networks intermediate and/or final representations.
More formally, we can define the objective function of the student networks trained with knowledge distillation as: where L task is typically the same loss that was used to train the teacher (e.g., cross-entropy), L KD is the distillation loss and λ KD is a scaling parameter to control the importance of the distillation with respect to that of the task. Recall that Individual Knowledge Distillation (IKD) requires intermediate representations of T and S to be of the same dimensions. In order to avoid this drawback, Relational Knowledge Distillation (RKD) has been recently proposed [9,22,23]. Indeed, the method we introduce in this section is inspired by [9], where the authors propose to compare the distance obtained between the intermediate representations of a pair of data points in the teacher with the corresponding distance for the student. The goal then becomes to minimize the variation between these two distances. Interestingly, distances can be compared even if the corresponding intermediate representations do not have the same dimension. However, we point out that forcing (absolute) distances to be similar is not necessarily desirable. As a matter of fact, it would be sufficient to consider distances relatively to other pairs of data points. For example: consider a case where in the teacher latent space the distance between points A and B is 0.5 and the distance between points A and C is 0.25. Instead of forcing the student to have the same distances as well (0.5 and 0.25) we could just ensure that the AC distance is half of AB distance.
In this section we introduce a method that focuses on relative distances. We do so using normalized LGGs. The framework we consider, that we named Graph Knowledge Distillation (GKD) in [10], consists in reducing the discrepancy between LGGs constructed in T and S.
Proposed approach (GKD): Let us consider a layer in the teacher architecture, and the corresponding one in the student architecture. Considering a batch of inputs, we propose to build the corresponding graphs G T and G S capturing their geometries as described in Section 3.3.
During training, we propose to use the following loss in Equation 6: where L d is the Frobenius norm between the adjacency matrices. In practice, many such additive terms can be added, one per pairs of layers to match in teacher and student architectures. Let us point out that the dimensions of latent spaces in T and S are likely to be very different. As such, the LGGs are susceptible to be hard to compare directly. This is why we make use of graph normalization (as described in step 4 of LGG graph construction), where similarities are considered relatively to each other. Despite not being ideal, graph normalization allows us to obtain considerable gains in accuracy, as illustrated in the following experiments. The GKD loss measures the discrepancy between the adjacency matrices of teacher and student LGGs. In this way the geometry of the intermediate representations of the student will be enforced to converge to that of the teacher (which is already fixed). Our intuition is that since the teacher network is expected to generalize well to the test, mimicking its latent geometry should allow for better generalization of the student network as well. Moreover, since we use normalized LGGs, the similarities are considered relative to each other (so that each vertex on the graph has the same "connection strength"), contrary to initial works in RKD [9], where each distance is taken in its absolute value and thus one sample can eclipse all the others (e.g. being too far away from the others).
Experiments: To illustrate the gains we can achieve using GKD, we ran the following experiment. Starting from a WideResNet28-1 [37] teacher architecture with many parameters, for which an error rate of 7.27% is achieved on CIFAR-10, we first train a student without KD, called baseline, containing roughly 4 times less parameters. The resulting error rate is 10.37%. We then compared RKD and GKD. Results in Table 1 show that GKD doubles the gains of RKD over the baseline. RKD-D [9] 10.05% 0.29% 27% GKD (Ours) [10] 9.71% 0.63% 27% More details and experiments can be found in [10], where it is shown that the gains can be explained by the fact the GKD student presents decisions that are more consistent with the teacher than the RKD student. Also, other experiments in [10] suggest that simple modifications to graph construction (e.g. connecting only data points of distinct classes) can improve even further the gains reported in Table 1.

Latent embeddings
We now present an objective function that consists in minimizing the label variation on the output of the considered deep learning architecture. The goal of our objective function is to train the DL architecture to be a good feature extractor for classification, as the LGGs generated by the features will have a very small label variation. This idea was originally proposed in [14].
Methodology: Let us consider the representations obtained at the output of a deep learning architecture. We build the corresponding LGG G as described in Section 3.3. Then we propose to use the label variation on this LGG as the objective function to train the network. By definition, minimizing the label variation leads to maximizing the distances between outputs of different classes. Compared to the classic cross entropy loss, we observe that label variation as an objective function does not suffer from the same drawbacks, notably: the proposed criterion does not need to force the output dimension to match the number of classes, it can result in distinct clusters in the output domain for a same class (as it only deals with distances between examples from different classes), and it can leverage the initial distribution of representations at the output of the network function.
Experiments: To evaluate the performance of label variation as an objective function, we perform experiments with the CIFAR-10 dataset [38] and using ResNet-18 [17] as our DL architecture. In Table 2 we report the performance of the deep architectures trained with the proposed loss compared with cross-entropy. We also report the relative Mean Corruption Error (MCE), which is a standard measure of robustness towards corruptions of the inputs over the CIFAR-10 corruption benchmark [7], where smaller values of MCE are better. We observe that label variation is a viable alternative to cross-entropy in terms of raw test accuracy, and that it leads to significantly better robustness. More details and experiments can be found in [14], where in particular we show how the initial distribution of data points is preserved throughout the learning process. Table 2. Comparison between the cross-entropy and label variation functions.

Improving DL robustness
In this section we propose to use label variation as a regularizer applied at each layer of the considered architecture during training. We initially introduced this idea in [10]. As it is not desirable to enforce a small label variation at early layers in the architecture, the core idea is to ensure a smooth evolution on label variation from an intermediate representation to the next one in the processing flow.
Recall that networks are typically trained with the objective of yielding zero error for the training set. If error on the training set is (approximately) zero then any two examples with different labels can be separated by the network, even if these examples are very close to each other in the original domain. This means that the network function can create significant deformations of the space (i.e., small distances in the original domain map to larger distances in the final layers) and explains how an adversarial attack with small changes to the input can lead to changing the output decision given by the network. When we enforce smooth evolution of label smoothness, we precisely prevent such sudden deformations of space.
Methodology: Formally, denote the depth of an intermediate representation in the architecture. Let us consider a batch of inputs, and let us build the corresponding LGG G as described in Section 3.3. The proposed regularizer can be expressed as: where σ is the label variation on G . This proposed regularizer is then added to the objective function (loss) with a scaling hyperparameter γ.
Experiments: In order to stress the ability of the proposed regularizer in improving robustness. We consider a ResNet18 that we trained on CIFAR-10. We consider multiple settings. In a first one, we add adversarial noise to inputs ( [32]) and compare the obtained accuracy. In a second one, we consider agnostic corruptions (i.e. corruptions that do not depend on the network function) and report the relative MCE [7]. Results are presented in Table 3. The proposed regularizer performs better than the raw baseline and existing alternatives in the literature [6]. More details can be found in [8].

Conclusion
In this work, we have introduced a methodology to represent latent space geometries using similarity graphs (i.e. LGG). We demonstrated the interest of such a formalism for three different problems: i) knowledge distillation, ii) latent embeddings, and iii) robustness. With the ubiquity of graphs in representing relations between data elements, and the growing literature on Graph Signal Processing, we believe that the proposed formalism could be applied to many more problems and domains, including predicting generalization, improving performance in data-thrifty settings, and helping understanding how decisions are taken in a deep learning architecture.
Note that the proposed methodologies use straightforward techniques to build LGGs, and thus could be enriched with more principled approaches [39,40]. Another area of interest would be to build upon [15] and see what improvements may arise from the use of graph convolutional networks in domains that are not typically supported by graphs.
Author Contributions: Initial ideas were proposed jointly by AO, CL and VG. Initial investigation of the ideas was performed by CL. CL performed all simulations, prepared the figures, wrote the first draft and participated in the edition process. VG and AO supervised the project, and worked on the editing process from the original draft to the final version.
Funding: Carlos Lassance was partially funded by the Brittany region in France.