1. Introduction
Metric learning and fewshot classification are two problem settings that test a model’s ability to classify data from classes that were unseen during training. Such problems are also commonly interpreted as testing metalearning ability, since the process of constructing a classifier with examples from new classes can be seen as learning. Many recent works [
1,
2,
3,
4] tackled this problem by learning a continuous embedding (
$\tilde{\mathbf{x}}\in {\mathbb{R}}^{n}$) of datapoints. Such models compare pairs of embeddings using, e.g., Euclidean distance to perform nearest neighbor classification. However, it remains unclear whether such models effectively utilize the entire space of
${\mathbb{R}}^{n}$.
Information theory provides a framework for effectively asking such questions about representation schemes. In particular, the information bottleneck principle [
5,
6] characterizes the optimality of a representation. This principle states that the optimal representation
$\tilde{X}$ is one that maximally compresses the input
X while also being predictive of labels
Y. From this viewpoint, we see that the previous methods which map data to
${\mathbb{R}}^{n}$ focus on being predictive of labels
Y without considering the compression of
X.
The degree of compression of an embedding is the number of bits it reflects about the original data. Note that for continuous embeddings, each of the n numbers in a ndimensional embedding requires 32 bits. It is unlikely that unconstrained optimization of such embeddings use all of these $32n$ bits effectively. We propose to resolve this limitation by instead using discrete embeddings and controlling the number of bits in each dimension via hyperparameters. To this end, we propose a model that produces discrete infomax codes (DIMCO) via an endtoend learnable neural network encoder.
This work’s primary contributions are as follows. We usw mutual information as an objective for learning embeddings, and propose an efficient method of estimating it in the discrete case. We experimentally demonstrate that learned discrete embeddings are more memory and timeefficient compared to continuous embeddings. Our experiments also show that using discrete embeddings help metageneralization by acting as an information bottleneck. We also provide theoretical support for this connection through an informationtheoretic probably approximately correct (PAC) bound that shows the generalization properties of learned discrete codes.
This paper is organized as follows. We propose our model for learning discrete codes in
Section 2. We justify our loss function and also provide a generalization bound for our setup in
Section 3. We compare our method to related work in
Section 4, and present experimental results in
Section 5. Finally, we conclude our paper in
Section 6.
2. Discrete Infomax Codes (Dimco)
We present our model which produces discrete infomax codes (DIMCO). A deep neural network is trained endtoend to learn
kway
ddimensional discrete codes that maximally preserve the information on labels. We outline the training procedure in Algorithm 1, and we also illustrate the overall structure in the case of 4way 3dimensional codes (
$k=4,d=3$) in
Figure 1.
Algorithm 1 DIMCO training procedure. 

2.1. Learnable Discrete Codes
Suppose that we are given a set of labeled examples, which are realizations of random variables $(X,Y)\sim p(\mathbf{x},y)$, where X is the continuous input, and its corresponding discrete label is Y. Realizations of X and Y are denoted by $\mathbf{x}\in {\mathbb{R}}^{D}$ and $y\in \{1,\dots ,c\}$. The codebook $\tilde{X}$ serves as a compressed representation of X.
We constructed a probabilistic encoder $p\left(\tilde{\mathbf{x}}\right\mathbf{x})$—which is implemented by a deep neural network—that maps an input $\mathbf{x}$ to a kway ddimensional code $\tilde{\mathbf{x}}\in {\{1,2,\dots ,k\}}^{d}$. That is, each entry of $\tilde{\mathbf{x}}$ takes on one of k possible values, and the cardinality of $\tilde{X}$ is $\tilde{X}={k}^{d}$. Special cases of this coding scheme include kway class labels ($d=1$), ddimensional binary codes ($k=2$), and even fixedlength decimal integers ($k=10$).
We now describe our model which produces discrete infomax codes. A neural network encoder
$f(\mathbf{x};\theta )$ outputs
kdimensional categorical distributions,
$\mathrm{Cat}({p}_{i,1},\dots ,{p}_{i,k})$. Here,
${p}_{i,j}\left(\mathbf{x}\right)$ represents the probability that output variable
i takes on value
j, consuming
$\mathbf{x}$ as an input, for
$i=1,\dots ,d$ and
$j=1,\dots ,k$. The encoder takes
$\mathbf{x}$ as an input to produce logits
${l}_{i,j}=f{\left(x\right)}_{i,j}$, which form a
$d\times k$ matrix:
These logits undergo softmax functions to yield
Each example
$\mathbf{x}$ in the training set is assigned a codeword
$\tilde{\mathbf{x}}={[{\tilde{x}}_{1},\dots ,{\tilde{x}}_{d}]}^{\top}$, each entry of which is determined by one of
k events that is most probable; i.e.,
While the stochastic encoder
$p\left(\tilde{\mathbf{x}}\right\mathbf{x})$ induces a soft partitioning of input data, codewords assigned by the rule in (
3) yield a hard partitioning of
X.
2.2. Loss Function
The ith symbol is assumed to be sampled from the resulting categorical distribution $\mathrm{Cat}({p}_{i1},\dots ,{p}_{ik})$. We denote the resulting distribution over codes as $\tilde{X}$ and a code as $\tilde{\mathbf{x}}$. Instead of sampling $\tilde{\mathbf{x}}\sim \tilde{X}$ during training, we use a loss function that optimizes the expected performance of the entire distribution $\tilde{X}$.
We train the encoder by maximizing the mutual information between the distributions of codes
$\tilde{X}$ and labels
Y. The mutual information is a symmetric quantity that measures the amount of information shared between two random variables. It is defined as
Since
$\tilde{X}$ and
Y are discrete, their mutual information is bounded from both above and below as
$0\le I(\tilde{X};Y)\le log\left\tilde{X}\right=dlogk$. To optimize the mutual information, the encoder directly computes empirical estimates of the two terms on the righthand side of (
4). Note that both terms consist of entropies of categorical distributions, which have the general closedform formula:
Let
${\overline{p}}_{ij}$ be the empirical average of
${p}_{ij}$ calculated using data points in a batch. Then,
${\overline{p}}_{ij}$ is an empirical estimate of the marginal distribution
$p\left(\tilde{x}\right)$. We compute the empirical estimate of
$H\left(\tilde{X}\right)$ by adding its entropy estimate for each dimension.
We can also compute
where
c is the number of classes. The marginal probability
$p(Y=y)$ is the frequency of class
y in the minibatch, and
$H\left(\tilde{X}\rightY=y)$ can be computed by computing (
6) using only datapoints which belong to class
y. We emphasize that such a closedform estimation of
$I(\tilde{X};Y)$ is only possible because we are using discrete codes. If
$\tilde{X}$ were instead a continuous variable, we would only be able to maximize an approximation of
$I(\tilde{X};Y)$ (e.g., Belghazi et al. [
7]).
We briefly examine the loss function (
4) to see why maximizing it results in discriminative
$\tilde{X}$. Maximizing
$H\left(\tilde{X}\right)$ encourages the distribution of all codes to be as dispersed as possible, and minimizing
$H\left(\tilde{X}\rightY)$ encourages the average embedding of each class to be as concentrated as possible. Thus, the overall loss
$I(\tilde{X};Y)$ imposes a partitioning problem on the model: it learns to split the entire probability space into regions with minimal overlap between different classes. As this problem is intractable for the large models considered in this work, we seek to find a local minima via stochastic gradient descent (SGD). We provide a further analysis of this loss function in
Section 3.1.
2.3. Similarity Measure
Suppose that all data points in the training set are assigned their codewords according to the rule (
3). Now we introduce how to compute a similarity between a query datapoint
${\mathbf{x}}^{\left(q\right)}$ and a support datapoint
${\mathbf{x}}^{\left(s\right)}$ for information retrieval or fewshot classification, where the superscripts
$\left(q\right),\left(s\right)$ stand for query and support, respectively. Denote by
${\tilde{\mathbf{x}}}^{\left(s\right)}$ the codeword associated with
${\mathbf{x}}^{\left(s\right)}$, constructed by (
3). For the test data
${\mathbf{x}}^{\left(q\right)}$, the encoder yields
${p}_{i,j}\left({\mathbf{x}}^{\left(q\right)}\right)$ for
$i=1,\dots ,d$ and
$j=1,\dots ,k$. As a similarity measure between
${\mathbf{x}}^{\left(q\right)}$ and
${\mathbf{x}}^{\left(s\right)}$, we calculate the following log probability.
The probabilistic quantity (
8) indicates that
${\mathbf{x}}^{\left(q\right)}$ and
${\mathbf{x}}^{\left(s\right)}$ become more similar when the encoder’s output—when
${\mathbf{x}}^{\left(q\right)}$ is provided—is well aligned with
${\tilde{\mathbf{x}}}^{\left(s\right)}$.
We can view our similarity measure (
8) as a probabilistic generalization of the Hamming distance [
8]. The Hamming distance quantifies the similarity between two strings of equal length as the number of positions at which the corresponding symbols are equal. As we have access to a distribution over codes, we use (
8) to directly compute the log probability of having the same symbol at each position.
We use (
8) as a similarity metric for both fewshot classification and image retrieval. We perform fewshot classification by computing a codeword for each class via (
3) and classifying each test image by choosing the class that has the highest value of (
8). We similarly perform image retrieval by mapping each support image to its most likely code (
3) and for each query image retrieving the support image that has the highest (
8).
While we have described the operations in (
3) and (
8) for a single pair
$({\mathbf{x}}^{\left(q\right)},{\mathbf{x}}^{\left(s\right)})$, one can easily parallelize our evaluation procedure, since it is an argmax followed by a sum. Furthermore,
$\tilde{\mathbf{x}}$ typically requires little memory, as it consists of discrete values, allowing us to compare against large support sets in parallel. Experiments in
Section 5.4 investigate the degree of DIMCO’s efficiency in terms of both time and memory.
2.4. Regularizing by Enforcing Independence
One way of interpreting the code distribution
$\tilde{X}$ is as a group of
d separate code distributions
${\tilde{x}}_{1},\dots ,{\tilde{x}}_{d}$. Note that the similarity measure described in (
8) can be seen as ensemble of the similarity measures of these
d models. A classic result in ensemble learning is that using more diverse learners increases ensemble performance [
9]. In a similar spirit, we used an optional regularizer which promotes pairwise independence between each pair in these
d codes. Using this regularizer stabilized training, especially in more largescale problems.
Specifically, we randomly sample pairs of indices
${i}_{1},{i}_{2}$ from
$1,\dots ,d$ during each forward pass. Note that
${\tilde{x}}_{{i}_{1}}\otimes {\tilde{x}}_{{i}_{2}}$ and
$({\tilde{x}}_{{i}_{1}},{\tilde{x}}_{{i}_{2}})$ are both categorical distributions with support size
${k}^{2}$, and that we can estimate the two different distributions within each batch. We minimize their KL divergence to promote independence between these two distributions:
We compute (
9) for a fixed number of random pairs of indices for each batch. The cost of computing this regularization term is miniscule compared to that of other components such as feeding data through the encoder.
Using this regularizer in conjunction with the learning objective (
4) yields the following regularized loss:
We fix
$\lambda =1$ in all experiments, as we found that DIMCO’s performance was not particularly sensitive to this hyperparameter. We emphasize that while this optional regularizer stabilizes training, our learning objective is the mutual information
$I(\tilde{X};Y)$ in (
4).
2.5. Visualization of Codes
In
Figure 2, we show images retrieved using our similarity measure (
8). We trained a DIMCO model (
$k=16$,
$d=4$) on the CIFAR100 dataset. We selected specific code locations and plotted the top 10 test images according to our similarity measure. For example, the top (leftmost) image for code
$(\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}},{j}_{2},\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}},{j}_{4})$ would be computed as
where
N is the number of test images.
We visualize two different combinations of codes in
Figure 2. The two examples show that using codewords together results in their respective semantic concepts being combined: (man + fish = man holding fish), (round + warm color = orange). While we visualized combinations of 2 codewords for clarity, DIMCO itself uses a combination of
d such codewords. The regularizer described in
Section 2.4 further encourages each of these
d codewords to represent different concepts. The combinatorially many (
${k}^{d}$) combinations in which DIMCO can assemble such codewords gives DIMCO sufficient expressive power to solve challenging tasks.
4. Related Work
Information bottleneck. DIMCO and Theorem 1 are both close in spirit to the information bottleneck (IB) principle [
5,
6,
16]. IB finds a set of compact representatives
$\tilde{X}$ while maintaining sufficient information about
Y, minimizing the following objective function:
subject to
${\sum}_{\tilde{\mathbf{x}}}p\left(\tilde{\mathbf{x}}\right\mathbf{x})=1$. Equivalently, it can be stated that one maximizes
$I(\tilde{X};Y)$ while simultaneously minimizing
$I(\tilde{X};X)$. Similarly, our objective (
15) is information maximization
$I(\tilde{X};Y)$, and our bound (
A11) suggests that the representation capacity
$\tilde{X}$ should be low for generalization. In the deterministic information bottleneck [
17],
$I(\tilde{X};X)$ is replaced by
$H\left(\tilde{X}\right)$. These three approaches to generalization are related via the chain of inequalities
$I(\tilde{X};X)\le H\left(\tilde{X}\right)\le log\left\tilde{X}\right$, which is tight in the limit of
$\tilde{X}$ being imcompressible. For any finite representation, i.e.,
$\tilde{X}=N$, the limit
$\beta \to \infty $ in (
18) yields a hard partitioning of
X into
N disjoint sets. DIMCO uses the infomax principle to learn
$N={k}^{d}$ such representatives, which are arranged by
kway
ddimensional discrete codes for compact representation with sufficient information on
Y.
Regularizing metalearning. Previous metalearning methods have restricted taskspecific learning by learning only a subset of the network [
18], learning on a lowdimensional latent space [
19], learning on a metalearned prior distribution of parameters [
20], and learning context vectors instead of model parameters [
21]. Our analysis in Theorem 1 suggests that reducing the expressive power of the taskspecific learner has a metaregularizing effect, indirectly giving theoretical support for previous works that benefited from reducing the expressive power of taskspecific learners.
Discrete representations. Discrete representations have been thoroughly studied in information theory [
22]. Recent deep learning methods directly learn discrete representations by learning generative models with discrete latent variables [
23,
24,
25] or maximizing the mutual information between representation and data [
26]. DIMCO is related to but differs from these works, as it assumes a supervised metalearning setting and performs infomax using
labels instead of data.
A standard approach to learning labelaware discrete codes is to first learn continuous embeddings and then quantize it using an objective that maximally preserves its information [
27,
28,
29]. DIMCO can be seen as an endtoend alternative to quantization which directly learns discrete codes. Jeong and Song [
30] similarly learns a sparse binary code in an endtoend fashion by solving a minimum cost flow problem with respect to labels. Their method differs from DIMCO, which learns a dense discrete code by optimizing
$I(\tilde{X};Y)$, which we estimate with a closedform formula.
Metric learning. The structure and loss function of DIMCO are closely related to those of metric learning methods [
1,
11,
12,
31]. We show that the loss functions of these methods can be seen as approximations of the mutual information (
$I(\tilde{X};Y)$) in
Section 2.2, and provide more indepth exposition in
Appendix A. While all of these previous methods require a support/query split within each batch, DIMCO simply optimizes an informationtheoretic quantity of each batch, removing the need for such structured batch construction.
Information theory and representation learning. Many works have applied informationtheoretic principles to unsupervised representation learning: to derive an objective for GANs to learn disentangled features [
32], to analyze the evidence lower bound (ELBO) [
33,
34], and to directly learn representations [
35,
36,
37,
38,
39]. Related also are previous methods that enforce independence within an embedding [
40,
41]. DIMCO is also an informationtheoretic representation learning method, but we instead assume a supervised learning setup where the representation must reflect groundtruth labels. We also used previous results from information theory to prove a generalization bound for our representation learning method.
6. Discussion
We introduced DIMCO, a model that learns a discrete representation of data by directly optimizing the mutual information with the label. To evaluate our initial intuition that shorter representations generalize better between tasks, we provided generalization bounds that get tighter as the representation gets shorter. Our experiments demonstrated that DIMCO is effective at both compressing a continuous embedding, and also at learning a discrete embedding from scratch in an endtoend manner. The discrete embeddings of DIMCO outperformed recent continuous feature extraction methods while also being more efficient in terms of both memory and time. We believe the tradeoff between discrete and continuous embeddings is an exciting area for future research.
DIMCO was motivated by concepts such as the minimum description length (MDL) principle and the information bottleneck: compact task representations should have less room to overfit. Interestingly, Yin et al. [
55] reports that doing the opposite—regularizing the taskgeneral parameters—prevents metaoverfitting by discouraging the metalearning model from memorizing the given set of tasks. In future work, we will investigate the common principle underlying these seemingly contradictory approaches for a fuller understanding of metageneralization.