1. Introduction
In recent years, Deep Learning (DL) methods have achieved state of the art performance in a vast range of machine learning tasks, including image classification [
1] and multilingual automatic text translation [
2]. A DL architecture is built by assembling elementary operators called
layers [
3], some of which contain trainable parameters. Due to their compositional nature, DL architectures exhibit intermediate representations when they process a given input. These intermediate representations lie in so-called latent spaces.
DL architectures are typically trained to minimize a loss function computed at their output. This is performed using a variant of the stochastic gradient descent algorithm that is backpropagated through the multiple layers to update the corresponding parameters. To accelerate the training procedure, it is very common to process batches of inputs concurrently. In such a case, a global criterion over the corresponding batch (e.g., the average loss) is backpropagated.
The training procedure of DL architectures is thus performed in an
end-to-end fashion. This end-to-end characteristic of DL refers to the fact that intermediate representations are unconstrained during training. This property has often been considered as an asset in the literature [
4] which presents DL as a way to replace “hand-crafted” features by automatic differentiation. As a matter of fact, using these hand-crafted features as intermediate representations can cause sub-optimal solutions [
5]. On the other hand, completely removing all constraints on the intermediate representations can cause the learning procedure to exhibit unwanted behavior, such as susceptibility to deviations of the inputs [
6,
7,
8], or redundant features [
9,
10].
In this work, we propose a new methodology aiming at enforcing desirable properties on intermediate representations. Since training is organized into batches, we achieve this goal by constraining what we call the
latent geometry of data points within a batch. This geometry refers to the relative position of data points within a specific batch, based on their representation in a given layer. While there are many problems for which specific intermediate layer properties are beneficial, in this work, we consider three examples. First, we explore compression via
knowledge distillation (KD) [
9,
10,
11,
12], where the goal is to supervise the training procedure of a small DL architecture (called the student) with a larger one (called the teacher). Second, we study the design of efficient embeddings for classification [
13,
14], in which the aim is to train the DL architecture to be able to extract features that are useful for classification (and could be used by different classifier) rather than using classification accuracy as the sole performance metric. Finally, we develop techniques to increase the robustness of DL architectures to deviations of their inputs [
6,
7,
8].
To address the three above-mentioned problems, we introduce a common methodology that exploits the latent geometries of a DL architecture. More precisely, we propose to formalize latent geometries by defining similarity graphs. In these graphs, vertices are data points in a batch and an edge weight between two vertices is a function of the relative similarity between the corresponding intermediate representations at a given layer. We call such a graph a latent geometry graph (LGG). In this paper, we show that intermediate representations with desirable properties can be obtained by imposing constraints on their corresponding LGGs. In the context of KD, similarity between teacher and student is favored by minimizing the discrepancy between their respective LGGs. For efficient embedding designs, we propose a LGG-based objective function that favors disentanglement of the classes. Lastly, to improve robustness, we enforce smooth variations between LGGs corresponding to pairs of consecutive layers at any stage, from input to output, in a given architecture. Enforcing smooth variations between the LGGs of consecutive layers provides some protection against noisy inputs, since small changes in the input are less likely to lead to a sharp transition of the network’s decision.
This paper is structured as follows; we first discuss related work in
Section 2. We then introduce the proposed methodology in
Section 3. Then, we present the three applications, namely knowledge distillation, design of classification feature vectors, and robustness improvements, in
Section 4. Finally, we present a summary and a discussion on future work in
Section 5. A term glossary is available in
Appendix A.
2. Related Work
As previously mentioned, in this work, we are interested in using graphs to ensure that latent spaces of DL architectures have some desirable properties. The various approaches we introduce in this paper are based on our previous contributions [
8,
10,
14]. However, in this paper, they are presented for the first time using a unified methodology and formalism. While we deployed these ideas already in a few applications, by presenting them in a unified form, our goal is to provide a broader perspective of these tools, as well as to encourage their use for other problems.
In what follows, we introduce related work found in the literature. We start by comparing our approach with others that also aim at enforcing properties on latent spaces. Then, we discuss approaches that mix graphs and intermediate (or latent) representations in DL architectures. Finally, we discuss methods related to the applications highlighted in this work: (i) knowledge distillation, (ii) latent embeddings, and (iii) robustness.
Enforcing properties on latent spaces: A core goal of our work is to enforce desirable properties on the latent spaces of DL architectures, more precisely (i) consistency with a teacher network, (ii) class disentangling, and (iii) smooth variation of geometries over the architecture. In the literature, one can find two types of approaches to enforce properties on latent spaces: (i) directly designing specific modules or architectures [
15,
16] and (ii) modifying the training procedures [
11,
13]. The main advantage of the latter approaches is that one is able to draw from the vast literature in DL architecture design [
17,
18] and use an existing architecture instead of having to design a new one.
Our proposed unified methodology can be seen as an example of the second type of approaches, with two main advantages over competing techniques. First, by using relational information between the examples, instead of treating each one separately, we extend the range of proposed solutions. For example, relational knowledge distillation methods can be applied to any pair of teacher-student networks [
9] as relational metrics are dependent on the number of examples and not on the dimension of individual layers (see more details in the next paragraphs). Second, by using graphs to represent the relational information, we are able to to exploit the rich literature in graph signal processing [
19] and use it to reason about the properties we aim at enforcing on latent spaces. We discuss this in more detail in
Section 4.1 and
Section 4.3.
Latent space graphs: In the past few years, there has been a growing interest in proposing deep neural network layers able to process graph-based inputs, also known as graph neural networks. For example, works, such as References [
20,
21,
22,
23], show how one can use convolutions defined in graph domains to improve performance of DL methods dealing with graph signals as inputs. The proposed methodology differs from these works in that it does not require inputs to be defined on an explicit graph. The graphs we consider here (LGGs) are proxies to the latent data geometry of the intermediate representations. Contrary to classical graph neural networks, the purpose of the proposed methodology is to study latent representations using graphs, instead of processing graph supported inputs. Some recent work can be viewed as following ideas similar to those introduced in this paper, with applications in areas, such as knowledge distillation [
24,
25], robustness [
15], interpretability [
26], and generalization [
27]. Despite sharing a common methodology, these works are not explicitly linked. This can be explained by the fact that they were introduced independently around the same time and have different aims. We provide more details about how they are connected with our proposed methodology in the following paragraphs.
Knowledge distillation: Knowledge distillation is a DL compression method, where the goal is to use the knowledge acquired on a pre-trained architecture, called teacher, to train a smaller one, called student. Initial works on knowledge distillation considered each input independently from the others, an approach known as Individual Knowledge Distillation (IKD) [
11,
12,
28]. As such, the student architecture mimics the intermediate representations of the teacher for each input used for training. The main drawback of IKD lies in the fact that it forces intermediate representations of the student to be of the same dimensions of that of the teacher. To deploy IKD in broader contexts, authors have proposed to disregard some of these intermediate representations [
12] or to perform some-kind of dimensionality reduction [
28].
On the other hand, the method we propose in
Section 4.1 is based on a recent paradigm named Relational Knowledge Distillation (RKD) [
9], which differs from IKD as it focuses on the relationship between examples instead of their exact positions in latent spaces. RKD has the advantage of leading to dimension-agnostic methods, such as the one described in this work. By defining graphs, its main advantage lies in the fact relationships between elements are considered relatively to each other.
Concurrently, other authors [
24,
25] have proposed methods similar to the one we present here [
10]. In Reference [
24], unlike in our approach, dimensionality reduction transformations are added to the intermediate representations, in an attempt to improve the knowledge distillation. In Reference [
25], LGGs are built using attention (similar to Reference [
29]). Among other differences, we show in
Section 4.1 that constructing graphs that only connect data points from distinct classes can significantly improve accuracy.
Latent embeddings: In the context of classification, the most common DL setting is to train the architecture end-to-end with an objective function that directly generates a decision at the output. Instead, it can be beneficial to output representations well suited to be processed by a simple classifier (e.g., logistic regression). This framework is called feature extraction or latent embeddings, as the goal is to generate representations that are easy to classify, but without directly enforcing the way they should be used for classification. Such a framework is very interesting if the DL architecture is not going to be used solely for classification but also for related tasks, such as person re-identification [
13], transfer learning [
30], and multi-task learning [
31].
Many authors have proposed ways to train deep feature extractors. One influential example is Reference [
13], where the authors use triplets to perform Deep Metric Learning. In each triplet, the first element is the example to train, the second is a positive example (e.g., same class) and the last is a negative one (e.g., different class). The aim is to result in triplets where the first element is closer to the second than to the last. In contrast, our method considers
all connections between examples of different classes and can focus solely on separation (making all the negatives far) instead of clustering (making all the positives close), which we posit should lead to more robust embeddings in
Section 4.2.
Other solutions for generating latent embeddings propose alternatives to the classical arg max operator used to perform the decision at the output of a DL architecture. This can be done either by changing the output so that it is based on error correcting codes [
32] or is smoothed, either explicitly [
33] or by using the prior knowledge of another network [
11].
Robustness of DL architectures: In this work, we are interested in improving the robustness of DL architectures. We define robustness as the ability of the network to correctly classify inputs even if they are subject to small perturbations. These perturbations may be adversarial (designed exactly to force misclassification) [
34] or incidental (due to external factors, such as hardware defects or weather artifacts) [
7]. The method we present in
Section 4.3 is able to increase the robustness of the architecture in both cases. Multiple works in the literature aim to improve the robustness of DL architectures following two main approaches: (i) training set augmentation [
35] and (ii) improved training procedure. Our contribution can be seen as an example of the latter approaches, but can be combined with augmentation-based methods, leading to an increase of performance compared to using the techniques separately [
8].
A similar idea was proposed in Reference [
15], where the authors exploit graph convolutional layers in order to improve robustness of DL architectures applied to non-graph domains. Their approach can be described as denoising the (test) input by using the training data. This differs from the method we propose in
Section 4.3, which focuses on generating a smooth network function. As such, the proposed method is more general as it is less dependent on the training set.
3. Methodology
In this section, we first introduce basic concepts from DL and graph signal processing (
Section 3.1 and
Section 3.2) and then our proposed methodology (
Section 3.3).
3.1. Deep Learning
We start by introducing basic deep learning (DL) concepts, referring the reader to Reference [
3] for a more in-depth overview. A DL architecture is an assembly of layers that can be mathematically described as a function
f, often referred to as the “network function” in the literature, that associates an input tensor
x with an output tensor
. This function is characterized by a large number of trainable parameters
. In the literature, many different approaches have been proposed to assemble layers to obtain such network functions [
17]. While layers are the basic unit, it is also common to describe architectures in terms of a series of
blocks, where a block is typically a small set of connected layers. This block representation allows us to encapsulate non-sequential behaviors, such as the residual connections of residual networks (Resnets) [
17], so that, even though layers are connected in a more complex way, the blocks remain sequential, and the network function can be represented as a series of cascading operations:
where each function
can represent a layer, or a block comprising several layers, depending on the underlying DL architecture. Thus, each block is associated with a subfunction
. For example, in the context of Resnets [
17], the architecture is composed of blocks as depicted in
Figure 1.
A very important concept for the remainder of this work is that of intermediate representations, which are the basis for the LGGs (defined in
Section 3.3) and corresponding applications (
Section 4).
Definition 1 (Intermediate representation).
We call intermediate representation of an input x the output it generates at an intermediate layer or block. Starting from Equation (1), and denoting we define the intermediate representation at depth ℓ for x as . Or, said otherwise, is the representation of in the latent space at depth ℓ. Initially, the parameters of f are typically drawn at random. They are then optimized during the training phase so that f achieves a desirable performance for the problem under consideration. The dimension of the output of f depends on the task. In the context of classification, it is common to design the network function such that the output has a dimension equal to the number of classes in the classification problem. In this case, for a given input, each coordinate of this final layer output is used as an estimate of the likelihood that the input belongs to the corresponding class. A network function correctly maps an input to its class if the output of the network function, , is close to the target vector of the correct class y.
Definition 2 (Target vector). Each sample of the training set is associated with a target vector of dimension C, where C is the total number of classes. Thus, the target vector of a sample of class c is the binary vector containing 1 at coordinate c and 0 at all other coordinates.
In this work, we also introduce the notion of label indicator vector, which it is important to differentiate from that of target vector. The label indicator vector is defined on a batch of data points, instead of individually for each sample, as follows:
Definition 3 (Label indicator vector). Consider a batch of B data points. The label indicator vector of class c for this batch is the binary vector containing 1 at coordinate i if and only if the i-th element of the batch is of class c, and 0 otherwise.
The purpose of a classification problem is to obtain a network function
f that outputs the correct class decision for any valid input
x. In practice, it is often the case that the set of valid inputs
is not finite, and yet we are only given a “small” number of pairs (
x,
y), where
y is the output associated with
x. The set of these pairs is called the dataset
. During the training phase, the parameters are tuned using
and an objective function
that measures the discrepancy between the outputs of the network function and expected target indicator vectors, i.e., the discrepancy between
and. It is common to decompose the function
f into a feature extractor
and a classifier
as follows:
. In a classification task, the objective function is calculated over the outputs of the classifier and the gradients are backpropagated to generate a good feature extractor. Alternatively, to ensure that good latent embeddings are produced, one can first optimize the feature extractor part of the architecture to optimize the features and then a classifier can be trained based on the resulting features (which remain fixed or not) [
13,
14]. We introduce an objective function designed for efficient latent embedding training in
Section 4.2.
Usually, the objective function is a loss function. It is minimized over a subset of the dataset that we call “
training set” (
). The reason to select a subset of
to train the DL architecture is that it is hard to predict the generalization ability of the trained function
f. Generalization usually refers to the ability of
f to predict the correct output for inputs
x not in
. A simple way to evaluate generalization consists of counting the proportion of elements in
that are correctly classified using
f. Obviously, this measure of generalization is not ideal, in the sense that it only checks generalization inside
. This is why it is possible for a network that seems to generalize well to have trouble to classify inputs that are subject to deviations. In this case, it is said that the DL architecture is not robust. We delve into more details on robustness in
Section 4.3In summary, a network function is initialized at random. Parameters are then tuned using a variant of the stochastic gradient descent algorithm on a dataset
, and finally, training performance is evaluated on a validation set. Problematically, the best performance of DL architectures strongly depends on the total number of parameters they contain [
36]. In particular, it has been hypothesized that this dependence comes from the difficulty of finding a good gradient trajectory when the parameter space dimension is small [
37]. A common way to circumvent this problem is to rely on knowledge distillation, where a network with a large number of parameters is used to supervise the training of a smaller one. We introduce a graph-based method for knowledge distillation in
Section 4.1.
3.2. Graph Signal Processing
As mentioned in the introduction, graphs are ubiquitous objects to represent relationships (called edges) between elements in a countable set (called vertices). In this section, we introduce the framework of Graph Signal Processing (GSP) which is central to our proposed methodology. Let us first formally define graphs:
Definition 4 (graph). A graph is a tuple of sets , such that:
- 1.
The finite set is composed of vertices .
- 2.
The set is composed of pairs of vertices of the form (,) called edges.
It is common to represent the set using an edge-indicator symmetric adjacency matrix . Note that, in this work, we consider only undirected graphs corresponding to symmetric (i.e., ). In some cases, it is useful to consider (edge-)weighted graphs. In that case, the adjacency matrix can take values other than 0 or 1.
We can use
to define the diagonal
degree matrix D of the graph as:
In the context of GSP, we consider not only graphs but also graph signals. A graph signal is typically defined as a vector s. In this work, we often consider a set of signals jointly. We group the signals in a matrix , where each of the columns is an individual graph signal s. An important notion in the remaining of this work is that of graph signal variation.
Definition 5 (Graph signal variation).
The total variation σ of a set of graph signals represented by S is:where is the combinatorial Laplacian of the graph that supports S, and tr is the trace function. We can also rewrite σ as:where represents the signal s defined on vertex . As such, the variation of a signal increases when vertices connected by edges with large weights have very different values. 3.3. Proposed Methodology
In this section, we describe how to construct and exploit latent geometry graphs (LGGs) and illustrate the key ideas with a toy example. Given a batch X, each LGG vertex corresponds to a sample in X, and each edge weight measures similarity between the corresponding data points. More specifically, LGGs are constructed as follows:
- (1)
Generate a symmetric square matrix using a similarity measure between intermediate representations, at a given depth ℓ, of data points in X. In this work, we choose the cosine similarity when data is non-negative, and an RBF similarity kernel based on the L2 distance otherwise.
- (2)
Threshold so that each vertex is connected only to its k-nearest neighbors.
- (3)
Symmetrize the resulting thresholded matrix: two vertices i and j are connected with edge weights as long one of the nodes was a k nearest neighbor of the other.
- (4)
(Optional) Normalize using its degree diagonal matrix: .
Given the LGG associated to some intermediate representation, we are able to quantify how well this representation matches the classification task under consideration by using the concept of label variation, a measure of graph signal variation for a signal formed as a concatenation of all label indicator vectors:
Definition 6 (Label variation).
Consider a similarity graph for a given batch X (obtained from some intermediate layer), represented by an adjacency matrix , and define a label indicator matrix V obtained by concatenating label indicator vectors of each class. Label variation is defined as: Remark 1. Label variation has the advantage of being independent of the choice of labels for each class, which can be verified by noticing that for any permutation matrix P, it holds that: If the graph is well suited for classification, then most nodes will have immediate neighbors in the same class. Indeed, label variation is 0 if and only if data points that belong to distinct classes are not connected in the graph. Therefore, smaller label variation is indicative of an easier classification task (well separated classes).
3.3.1. Toy Example
In this example, we visualize the relation between the classification task and the geometries represented by the graphs. To do so, we construct three similarity graphs for a very small subset (20 images from 4 classes are used, i.e., 5 images per class) of the CIFAR-10 (
Appendix B)
, one defined on the image space (i.e., computing the similarity between the 3072 dimensions of the raw input images) and two using the latent space representations of an architecture trained on the dataset. Such representations come from an intermediate layer (32,768 dimensions) and the penultimate layer (512 dimensions). What we expect to see qualitatively is that the classes will be easier to separate as we go deeper in the considered architecture, which should be reflected by the label variation score: the penultimate layer should yield the smallest label variation. We depict this example in
Figure 2. Note that data points are placed in the 2D space using Laplacian eigenmaps [
38]. As expected, we can qualitatively see the difference in separation from the image space to the latent spaces. We are also able to measure quantitatively how difficult it is to separate the classes using the label variation, which is lowest for the penultimate layer. For more details on how this example was generated, we refer the reader to
Appendix C.
3.3.2. Dimensionality and LGGs
A key asset of the proposed methodology is that the number of vertices in the graph is independent of the dimension of the intermediate representations it was built from. As such, it is possible to compare graphs built from latent spaces with various dimensions, as illustrated in
Figure 2. Being agnostic to dimension will be a key ingredient in the applications described in the following section. It is important to note that, while the number of vertices is independent of the dimension of intermediate representations, edge weights are a function of a similarity in the considered latent space, which can have very different meanings depending on the underlying dimensions.
In the context of DL architectures, a common choice of similarity measure is that of cosine. Interestingly, cosine similarity is well defined only for nonnegative data (as typically processed by a ReLU function) and bounded between 0 and 1. When data can be negative, we use a Gaussian kernel applied to the Euclidean distance instead. The problem remains that cosine or Euclidean similarities suffers from the curse of dimensionality. In an effort to reduce the influence of dimension when comparing LGGs obtained from latent spaces with distinct dimensions, in our experiments, we make use of graph normalization, as defined in step 4 of LGG construction. We also provide a discussion on the complexity of graph similarity computation in
Appendix D. Note that a more in-depth analysis and understanding of the influence of dimension on graph construction is a promising direction for future work, as improving the graph construction could benefit all applications covered in this work.