Discrete Infomax Codes for Supervised Representation Learning

For high-dimensional data such as images, learning an encoder that can output a compact yet informative representation is a key task on its own, in addition to facilitating subsequent processing of data. We present a model that produces discrete infomax codes (DIMCO); we train a probabilistic encoder that yields k-way d-dimensional codes associated with input data. Our model maximizes the mutual information between codes and ground-truth class labels, with a regularization which encourages entries of a codeword to be statistically independent. In this context, we show that the infomax principle also justifies existing loss functions, such as cross-entropy as its special cases. Our analysis also shows that using shorter codes reduces overfitting in the context of few-shot classification, and our various experiments show this implicit task-level regularization effect of DIMCO. Furthermore, we show that the codes learned by DIMCO are efficient in terms of both memory and retrieval time compared to prior methods.


Introduction
Metric learning and few-shot classification are two problem settings that test a model's ability to classify data from classes that were unseen during training. Such problems are also commonly interpreted as testing meta-learning ability, since the process of constructing a classifier with examples from new classes can be seen as learning. Many recent works [1][2][3][4] tackled this problem by learning a continuous embedding ( x ∈ R n ) of datapoints. Such models compare pairs of embeddings using, e.g., Euclidean distance to perform nearest neighbor classification. However, it remains unclear whether such models effectively utilize the entire space of R n .
Information theory provides a framework for effectively asking such questions about representation schemes. In particular, the information bottleneck principle [5,6] characterizes the optimality of a representation. This principle states that the optimal representation X is one that maximally compresses the input X while also being predictive of labels Y. From this viewpoint, we see that the previous methods which map data to R n focus on being predictive of labels Y without considering the compression of X.
The degree of compression of an embedding is the number of bits it reflects about the original data. Note that for continuous embeddings, each of the n numbers in a ndimensional embedding requires 32 bits. It is unlikely that unconstrained optimization of such embeddings use all of these 32n bits effectively. We propose to resolve this limitation by instead using discrete embeddings and controlling the number of bits in each dimension via hyperparameters. To this end, we propose a model that produces discrete infomax codes (DIMCO) via an end-to-end learnable neural network encoder.
This work's primary contributions are as follows. We usw mutual information as an objective for learning embeddings, and propose an efficient method of estimating it in the discrete case. We experimentally demonstrate that learned discrete embeddings are more memory and time-efficient compared to continuous embeddings. Our experiments also show that using discrete embeddings help meta-generalization by acting as an information bottleneck. We also provide theoretical support for this connection through an information-theoretic probably approximately correct (PAC) bound that shows the generalization properties of learned discrete codes.
This paper is organized as follows. We propose our model for learning discrete codes in Section 2. We justify our loss function and also provide a generalization bound for our setup in Section 3. We compare our method to related work in Section 4, and present experimental results in Section 5. Finally, we conclude our paper in Section 6.

Discrete Infomax Codes (Dimco)
We present our model which produces discrete infomax codes (DIMCO). A deep neural network is trained end-to-end to learn k-way d-dimensional discrete codes that maximally preserve the information on labels. We outline the training procedure in Algorithm 1, and we also illustrate the overall structure in the case of 4-way 3-dimensional codes (k = 4, d = 3) in Figure 1. (a) Discrete codes are produced by a probabilistic encoder that maps each datapoint to a distribution over k-way d-dimensional discrete codes. The encoder is trained to maximize the mutual information between code distribution and label distribution. (b) Given a query image, we compare it against a support set of discrete codes and corresponding labels. Our similarity metric is the query's log probability for each discrete code.

Learnable Discrete Codes
Suppose that we are given a set of labeled examples, which are realizations of random variables (X, Y) ∼ p(x, y), where X is the continuous input, and its corresponding discrete label is Y. Realizations of X and Y are denoted by x ∈ R D and y ∈ {1, . . . , c}. The codebook X serves as a compressed representation of X.
We constructed a probabilistic encoder p( x|x)-which is implemented by a deep neural network-that maps an input x to a k-way d-dimensional code x ∈ {1, 2, . . . , k} d . That is, each entry of x takes on one of k possible values, and the cardinality of X is | X| = k d . Special cases of this coding scheme include k-way class labels (d = 1), d-dimensional binary codes (k = 2), and even fixed-length decimal integers (k = 10).
We now describe our model which produces discrete infomax codes. A neural network encoder f (x; θ) outputs k-dimensional categorical distributions, Cat(p i,1 , . . . , p i,k ). Here, p i,j (x) represents the probability that output variable i takes on value j, consuming x as an input, for i = 1, . . . , d and j = 1, . . . , k. The encoder takes x as an input to produce logits These logits undergo softmax functions to yield Each example x in the training set is assigned a codeword x = [ x 1 , . . . , x d ] , each entry of which is determined by one of k events that is most probable; i.e., While the stochastic encoder p( x|x) induces a soft partitioning of input data, codewords assigned by the rule in (3) yield a hard partitioning of X.

Loss Function
The i-th symbol is assumed to be sampled from the resulting categorical distribution Cat(p i1 , . . . , p ik ). We denote the resulting distribution over codes as X and a code as x. Instead of sampling x ∼ X during training, we use a loss function that optimizes the expected performance of the entire distribution X.
We train the encoder by maximizing the mutual information between the distributions of codes X and labels Y. The mutual information is a symmetric quantity that measures the amount of information shared between two random variables. It is defined as Since X and Y are discrete, their mutual information is bounded from both above and below as 0 ≤ I( X; Y) ≤ log | X| = d log k. To optimize the mutual information, the encoder directly computes empirical estimates of the two terms on the right-hand side of (4). Note that both terms consist of entropies of categorical distributions, which have the general closed-form formula: Let p ij be the empirical average of p ij calculated using data points in a batch. Then, p ij is an empirical estimate of the marginal distribution p( x). We compute the empirical estimate of H( X) by adding its entropy estimate for each dimension.
We can also compute where c is the number of classes. The marginal probability p(Y = y) is the frequency of class y in the minibatch, and H( X|Y = y) can be computed by computing (6) using only datapoints which belong to class y. We emphasize that such a closed-form estimation of I( X; Y) is only possible because we are using discrete codes. If X were instead a continuous variable, we would only be able to maximize an approximation of I( X; Y) (e.g., Belghazi et al. [7]). We briefly examine the loss function (4) to see why maximizing it results in discriminative X. Maximizing H( X) encourages the distribution of all codes to be as dispersed as possible, and minimizing H( X|Y) encourages the average embedding of each class to be as concentrated as possible. Thus, the overall loss I( X; Y) imposes a partitioning problem on the model: it learns to split the entire probability space into regions with minimal overlap between different classes. As this problem is intractable for the large models considered in this work, we seek to find a local minima via stochastic gradient descent (SGD). We provide a further analysis of this loss function in Section 3.1.

Similarity Measure
Suppose that all data points in the training set are assigned their codewords according to the rule (3). Now we introduce how to compute a similarity between a query datapoint x (q) and a support datapoint x (s) for information retrieval or few-shot classification, where the superscripts (q), (s) stand for query and support, respectively. Denote by x (s) the codeword associated with x (s) , constructed by (3). For the test data x (q) , the encoder yields p i,j (x (q) ) for i = 1, . . . , d and j = 1, . . . , k. As a similarity measure between x (q) and x (s) , we calculate the following log probability.
The probabilistic quantity (8) indicates that x (q) and x (s) become more similar when the encoder's output-when x (q) is provided-is well aligned with x (s) . We can view our similarity measure (8) as a probabilistic generalization of the Hamming distance [8]. The Hamming distance quantifies the similarity between two strings of equal length as the number of positions at which the corresponding symbols are equal. As we have access to a distribution over codes, we use (8) to directly compute the log probability of having the same symbol at each position.
We use (8) as a similarity metric for both few-shot classification and image retrieval. We perform few-shot classification by computing a codeword for each class via (3) and classifying each test image by choosing the class that has the highest value of (8). We similarly perform image retrieval by mapping each support image to its most likely code (3) and for each query image retrieving the support image that has the highest (8).
While we have described the operations in (3) and (8) for a single pair (x (q) , x (s) ), one can easily parallelize our evaluation procedure, since it is an argmax followed by a sum. Furthermore, x typically requires little memory, as it consists of discrete values, allowing us to compare against large support sets in parallel. Experiments in Section 5.4 investigate the degree of DIMCO's efficiency in terms of both time and memory.

Regularizing by Enforcing Independence
One way of interpreting the code distribution X is as a group of d separate code distributions x 1 , . . . , x d . Note that the similarity measure described in (8) can be seen as ensemble of the similarity measures of these d models. A classic result in ensemble learning is that using more diverse learners increases ensemble performance [9]. In a similar spirit, we used an optional regularizer which promotes pairwise independence between each pair in these d codes. Using this regularizer stabilized training, especially in more large-scale problems.
Specifically, we randomly sample pairs of indices i 1 , i 2 from {1, . . . , d} during each forward pass. Note that x i 1 ⊗ x i 2 and ( x i 1 , x i 2 ) are both categorical distributions with support size k 2 , and that we can estimate the two different distributions within each batch. We minimize their KL divergence to promote independence between these two distributions: We compute (9) for a fixed number of random pairs of indices for each batch. The cost of computing this regularization term is miniscule compared to that of other components such as feeding data through the encoder. Using this regularizer in conjunction with the learning objective (4) yields the following regularized loss: We fix λ = 1 in all experiments, as we found that DIMCO's performance was not particularly sensitive to this hyperparameter. We emphasize that while this optional regularizer stabilizes training, our learning objective is the mutual information I( X; Y) in (4).

Visualization of Codes
In Figure 2, we show images retrieved using our similarity measure (8). We trained a DIMCO model (k = 16, d = 4) on the CIFAR100 dataset. We selected specific code locations and plotted the top 10 test images according to our similarity measure. For example, the top (leftmost) image for code (·, j 2 , ·, j 4 ) would be computed as arg max n∈{1,...,N} log p 2,j 2 (x n ) + log p 4,j 4 (x n ) , (11) where N is the number of test images. We visualize two different combinations of codes in Figure 2.
The two examples show that using codewords together results in their respective semantic concepts being combined: (man + fish = man holding fish), (round + warm color = orange). While we visualized combinations of 2 codewords for clarity, DIMCO itself uses a combination of d such codewords. The regularizer described in Section 2.4 further encourages each of these d codewords to represent different concepts. The combinatorially many (k d ) combinations in which DIMCO can assemble such codewords gives DIMCO sufficient expressive power to solve challenging tasks.

Is Mutual Information a Good Objective?
Our learning objective for DIMCO (4) is the mutual information between codes and labels. In this subsection, we justify this choice by showing that many previous objectives are closely related to mutual information. Due to space constraints, we only show high-level connections here and provide a more detailed exposition in Appendix A.

Cross-Entropy
The de facto loss for classification is the cross-entropy loss, which is defined as where Y is the model's prediction of Y. Using the observation that the final layer q(·) acts a parameterized approximation of the true conditional distribution p(Y| X), we write this as The H(Y) term can be ignored since it is not affected by model parameters. Therefore, minimizing cross-entropy is approximately equivalent to maximizing mutual information.
The two objectives become completely equivalent when the final linear layer q(·) perfectly represents the conditional distribution q(y| x). Note that for discrete x, we cannot use a linear layer to parameterize q(y| x), and therefore, cannot directly optimize the cross-entropy loss. We can therefore view our loss as a necessary modification of the cross-entropy loss for our setup of using discrete embeddings.

Contrastive Losses
Many metric learning methods [1,2,10-12] use a contrastive learning objective to learn a continuous embedding ( X). Such contrastive losses consist of (1) a positive term that encourages an embedding to move closer to that of other relevant embeddings and (2) a negative term that encourages it to move away from irrelevant embeddings. The positive term approximately minimizes log p( X|y), and the negative term as approximately minimizes log p( X). Together, these terms have the combined effect of maximizing We show such equivalences in detail in Appendix A.
In addition to these direct connections to previous loss functions, we show empirically in Section 5.1 that the mutual information strongly correlates with both the top-1 accuracy metric for classification and the Recall@1 metric for retrieval.

Does Using Discrete Codes Help Generalization?
In Section 1, we have provided motivation for the use of discrete codes through the regularization effect of an information bottleneck. In this subsection, we theoretically analyze whether learning discrete codes by maximizing mutual information leads to better generalization. In particular, we study how the mutual information on the test set is affected by the choice of input dataset structure and code hyperparameters k and d through a PAC learning bound.
We analyze DIMCO's characteristics at the level of minibatches. Following related meta-learning works [13,14], we call each batch a "task". We note that this is only a difference in naming convention, and our analysis applies equally well to the metric learning setup: we can view each batch consisting of support and query points as a task.
Define a task T to be a distribution over Z = X × Y. Let tasks T 1 , . . . , T n be sampled i.i.d. from a distribution of tasks τ. Each task T consists of a fixed-size dataset Let θ be the parameters of DIMCO. Let X, Y, X be the random variables for data, labels, and codes, respectively. Recall that our objective is the expected mutual information between labels and codes: The loss that we actually optimize (Equations (6) and (7)) is the empirical loss: The following theorem bounds the difference between the expected loss L and the empirical lossL. Theorem 1. Let d Θ be the VC dimension of the encoder X(·). The following inequality holds with high probability: Proof. We use VC dimension bounds and a finite sample bound for mutual information [15]. We defer a detailed statement and proof to Appendix B.
First note that all three terms in our generalization gap (A11) converge to zero as n, m → ∞. This shows that training a model by maximizing empirical mutual information, as in Equations (6) and (7), generalizes perfectly in the limit of infinite data.
Theorem 1 also shows how the generalization gap is affected differently by dataset size m and number of datasets n. A large n directly compensates for using a large backbone (d Θ ), and a large m compensates for using a large final representation ( X). Put differently, to effectively learn from small datasets (m), one should use a small representation ( X). The number of datasets n is typically less of a problem because the number of different ways to sample datasets is combinatorially large (e.g., n > 10 10 for miniImagenet 5-way 1-shot tasks). Recall that DIMCO has | X| = d log k, meaning that we can control the latter two terms using our hyperparameters d, k. We have explained the use of discrete codes through the information bottleneck effect of small codes X, and Theorem 1 confirms this intuition.

Related Work
Information bottleneck. DIMCO and Theorem 1 are both close in spirit to the information bottleneck (IB) principle [5,6,16]. IB finds a set of compact representatives X while maintaining sufficient information about Y, minimizing the following objective function: subject to ∑ x p( x|x) = 1. Equivalently, it can be stated that one maximizes I( X; Y) while simultaneously minimizing I( X; X). Similarly, our objective (15) is information maximization I( X; Y), and our bound (A11) suggests that the representation capacity | X| should be low for generalization. In the deterministic information bottleneck [17], I( X; X) is replaced by H( X). These three approaches to generalization are related via the chain of inequalities I( X; X) ≤ H( X) ≤ log | X|, which is tight in the limit of X being imcompressible. For any finite representation, i.e., | X| = N, the limit β → ∞ in (18) yields a hard partitioning of X into N disjoint sets. DIMCO uses the infomax principle to learn N = k d such representatives, which are arranged by k-way d-dimensional discrete codes for compact representation with sufficient information on Y.
Regularizing meta-learning. Previous meta-learning methods have restricted taskspecific learning by learning only a subset of the network [18], learning on a low-dimensional latent space [19], learning on a meta-learned prior distribution of parameters [20], and learning context vectors instead of model parameters [21]. Our analysis in Theorem 1 suggests that reducing the expressive power of the task-specific learner has a meta-regularizing effect, indirectly giving theoretical support for previous works that benefited from reducing the expressive power of task-specific learners.
Discrete representations. Discrete representations have been thoroughly studied in information theory [22]. Recent deep learning methods directly learn discrete representations by learning generative models with discrete latent variables [23][24][25] or maximizing the mutual information between representation and data [26]. DIMCO is related to but differs from these works, as it assumes a supervised meta-learning setting and performs infomax using labels instead of data.
A standard approach to learning label-aware discrete codes is to first learn continuous embeddings and then quantize it using an objective that maximally preserves its information [27][28][29]. DIMCO can be seen as an end-to-end alternative to quantization which directly learns discrete codes. Jeong and Song [30] similarly learns a sparse binary code in an end-to-end fashion by solving a minimum cost flow problem with respect to labels. Their method differs from DIMCO, which learns a dense discrete code by optimizing I( X; Y), which we estimate with a closed-form formula.
Metric learning. The structure and loss function of DIMCO are closely related to those of metric learning methods [1,11,12,31]. We show that the loss functions of these methods can be seen as approximations of the mutual information (I( X; Y)) in Section 2.2, and provide more in-depth exposition in Appendix A. While all of these previous methods require a support/query split within each batch, DIMCO simply optimizes an informationtheoretic quantity of each batch, removing the need for such structured batch construction.

Information theory and representation learning.
Many works have applied information-theoretic principles to unsupervised representation learning: to derive an objective for GANs to learn disentangled features [32], to analyze the evidence lower bound (ELBO) [33,34], and to directly learn representations [35][36][37][38][39]. Related also are previous methods that enforce independence within an embedding [40,41]. DIMCO is also an information-theoretic representation learning method, but we instead assume a supervised learning setup where the representation must reflect ground-truth labels. We also used previous results from information theory to prove a generalization bound for our representation learning method.

Correlation of Metrics
We have shown in Section 3.1 that the mutual information I( X; Y) is strongly connected to previous loss functions for classification and retrieval. In this subsection, we show experiments performed to verify whether I( X; Y) is a good metric that quantitatively shows the quality of the representation X. We trained DIMCO on the miniImageNet dataset with k = d = 64 for 20 epochs. We plot the pairwise correlations between five different metrics: (5, 10, 20)-way 1-shot accuracy, Recall@1, and I( X; Y). The results in Figure 3 show that all five metrics are very strongly correlated. We observed similar trends when training with loss functions other than I( X; Y) as well; we show these experiments in Appendix C due to space constraints.

Label-Aware Compression
We applied DIMCO to compressing feature vectors of trained classifier networks. We obtained penultimate embeddings of ResNet20 networks each trained on CIFAR10 and CIFAR100. The two networks had top-1 accuracies of 91.65 and 66.61, respectively. We trained on embeddings for the train set of each dataset, and measured top-1 accuracy of the test set using the training set as support. We compare DIMCO to product quantization (PQ, Jegou et al. [28]), which similarly compresses a given embededing to a k-way d-dimensional code. We compare the two methods in Table 1 with the same range of k, d hyperparameters. We performed the same experiment on the larger ImageNet dataset with a ResNet50 network which had a top-1 accuracy of 76.00. We compare DIMCO to both adaptive scalar quantization (SQ) and PQ in Table 2. We show extended experiments for all three datasets in Appendix A.
The results in Tables 1 and 2 demonstrate that DIMCO consistently outperforms PQ, and is especially efficient when d is low. Furthermore, the ImageNet experiment (Table 2) shows that DIMCO even outperforms SQ, which has a much lower compression rate compared to the embedding sizes we consider for DIMCO. These results are likely due to DIMCO performing label-aware compression, where it compresses the embedding while taking the label into account, whereas PQ and SQ only compress the embeddings themselves.

Few-Shot Classification
We evaluated DIMCO's few-shot classification performance on the miniImageNet dataset. We compare our method against the following previous works: Snell et al. [3], Vinyals et al. [31], Liu et al. [48], Ye et al. [49], Ravi and Larochelle [50], Sung et al. [51], Bertinetto et al. [52], Lee et al. [53]. All methods use the standard four-layer convnet with 64 filters per layer. While some methods used more filters, we used 64 for a fair comparison. We used the data augmentation scheme proposed by Lee et al. [53] and used balanced batches of 100 images consisting of 10 different classes. We evaluated both 5-way 1-shot and 5-way 5-shot learning, and report 95% confidence intervals of 1000 random episodes on the test split.
Results are shown in Table 3, and we provide an extended table with an alternative backbone in Appendix A. Figure 3 shows that DIMCO outperforms previous works on the 5-way 1-shot benchmark. DIMCO's 5-way 5-shot performance is relatively low, likely because the similarity metric (Section 2.3) handles support datapoints individually instead of aggregating them, similarly to Matching Nets [31]. Additionally, other methods are explicitly trained to optimize 5-shot performance, whereas DIMCO's training procedure is the same regardless of task structure.

Image Retrieval
We conducted image retrieval experiments using two standard benchmark datasets: CUB-200-2011 and Cars-196. As baselines, we used three widely adopted metric learning methods: Binomial Deviance [54], Triplet loss [1], and Proxy-NCA [2]. The backbone for all methods was a ResNet-50 network pretrained on the ImageNet dataset. We trained DIMCO on various combinations of (p, d), and set the embedding dimension of the baseline methods to 128. We measured the time per query for each method on a Xeon E5-2650 CPU without any parallelization. We note that computing the retrieval time using a parallel implementation would skew the results even more in favor of DIMCO, since DIMCO's evaluation is simply one memory access followed by a sum.
Results presented in Table 4 show that DIMCO outperforms all three baseline, and that the compact code of DIMCO takes roughly an order of magnitude less memory, and requires less query time as well. This experiment also demonstrates that discrete representations can outperform modern methods that use continuous embeddings, even on this relatively large-scale task. Additionally, this experiment shows that DIMCO can train using large backbones without significantly overfitting.

Discussion
We introduced DIMCO, a model that learns a discrete representation of data by directly optimizing the mutual information with the label. To evaluate our initial intuition that shorter representations generalize better between tasks, we provided generalization bounds that get tighter as the representation gets shorter. Our experiments demonstrated that DIMCO is effective at both compressing a continuous embedding, and also at learning a discrete embedding from scratch in an end-to-end manner. The discrete embeddings of DIMCO outperformed recent continuous feature extraction methods while also being more efficient in terms of both memory and time. We believe the tradeoff between discrete and continuous embeddings is an exciting area for future research.
DIMCO was motivated by concepts such as the minimum description length (MDL) principle and the information bottleneck: compact task representations should have less room to overfit. Interestingly, Yin et al. [55] reports that doing the opposite-regularizing the task-general parameters-prevents meta-overfitting by discouraging the meta-learning model from memorizing the given set of tasks. In future work, we will investigate the common principle underlying these seemingly contradictory approaches for a fuller understanding of meta-generalization.
Funding: This research received no external funding.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Previous Loss Functions Are Approximations of Mutual Information
Appendix A.1. Cross-Entropy Loss The cross-entropy loss has directly been used for few-shot classification [3,31]. Let q(y| x; φ) be a parameterized prediction of y given x, which tries to approximate the true conditional distribution q(y| x). Typically, in a classification network, φ is the parameters of a learned projection matrix and q(·) is the final linear layer. The expected cross-entropy loss can be written as Assuming that the approximate distribution q(·) is sufficiently close to p(y| x), minimizing (A1) can be seen as where the last equality uses the fact that H(Y) is independent of model parameters. Therefore, cross-entropy minimization is approximate maximization of the mutual information between representation X and labels Y.
The approximation is that we parameterized q(y| x; φ) as a linear projection. This structure cannot generalize to new classes because the parameters φ are specific to the labels y seen during training. For a model to generalize to unseen classes, one must amortize the learning of this approximate conditional distribution. [3,31] sidestepped this issue by using the embeddings for each class as φ.

Appendix A.2. Triplet Loss
The triplet loss [1] is defined as where x q , x p , x n ∈ R d are the embedding vectors of query, positive, and negative images. Let y q denote the label of the query data. Recall that the pdf function of a unit Gaussian where c 1 , c 2 are constants. Let p p ( x) = N( x p , 1) and p n ( x) = N( x n , 1) be unit Gaussian distributions centered at x p , x n , respectively. We have Two approximations were made in the process. We first assumed that the embedding distribution of images not in y q is equal to the distribution of all embeddings. This is reasonable when each class only represents a small fraction of the full data. We also approximated the embedding distributions p( x|y), p( x) with unit Gaussian distributions centered at single samples from each.

Appendix A.3. N-Pair Loss
Multiclass N-pair loss [11] was proposed as an alternative to Triplet loss. This loss function requires one positive embedding x + and multiple negative embeddings x 1 , . . . , x N−1 , and takes the form This can be seen as the cross-entropy loss applied to softmax( x x + , x x 1 , . . . , x x N−1 ). Following the same logic as the cross-entropy loss, this is also an approximation of I( X; Y). This objective should have less variance than Triplet loss since it approximates p( x) using more examples.
We simplify this and plug in our specific quantities of interest ( X(X T , θ), Y T ): We similarly bound the error caused by estimating L with a finite number of tasks sampled from τ. Denote the finite sample estimate of L aŝ Let the mapping X → X be parameterized by θ ∈ Θ and let this model have VC dimension d Θ . Using d Θ , we can state that with high probability, where d Θ is the VC dimension of hypothesis class Θ.

Appendix C. Experiments
Appendix C.1. Parameterizing the Code Layer Recall that each discrete code is parameterized by a R k×d matrix. A problem with a naive implementation of DIMCO is that simply using a linear layer that maps R D to R k×d takes Dkd parameters in that single layer. This can be prohibitively expensive for large embeddings, e.g., d = 4096. We therefore, parameterize this code layer as the product of two matrices, which sequentially map R D → R n → R k×d . The total number of parameters required for this is nD + nkd. We fix all n = 128. While more complicated tricks could reduce the parameter count even further, we found that this simple structure was sufficient to produce the results in this paper.

Appendix C.2. Correlation of Metrics
We collected statistics from 8 different independent runs, and report the averages of 500 batches of 1-shot accuracies, Recall@1, and mutual information. I( X; Y) was computed using balanced batches of 16 images each from 5 different classes. In addition to the experiment in the paper, we measured the correlations between 1-shot accuracies, Recall@1, and NMI using three previously proposed losses (triplet, npair, protonet). Figure A1 shows that even for other methods for which mutual information is not the objective, mutual information strongly correlates with all other previous metrics. Figure A1. Correlation between few-shot accuracy and retrieval metrics.

Appendix C.3. Code Visualizations
We provide additional visualizations of codes in Figure A2. These examples consistently show that each code encodes a semantic concept, and that such concepts can be but are not necessarily tied to a particular class.

Appendix C.4. Label-Aware Bit Compression
We computed top-1 accuracies using a kNN classifier on each type of embedding with k = 200. We present extended results comparing PQ and DIMCO on ImageNet embeddings in Table A2. For CIFAR-10 and CIFAR-100 pretrained ResNet20, we used pretrained weights of open-sourced repository (https://github.com/chenyaofo/CIFAR-pretrained-models, accessed on 15 March 2021), and for ImageNet pretrained ResNet50, we used torchvision. We optimized the probablistic encoder with Adam optimizer [56] with learning rate of 1 × 10 −2 for CIFAR-100 and ImageNet, and 3 × 10 −3 for CIFAR-10. We performed an experiment to see how well DIMCO can generalize to new datasets after training with a small number of datasets. We trained each model using {1, 4, 16, 64} samples from each class in the miniImageNet dataset. For example, 4 samples means that we trained on (64 classes ×4 images) instead of the full (64 classes ×600 images). We compare our method against three methods which use continuous embeddings for each datapoint: Triplet Nets [1], multiclass N-pair loss [11], and ProtoNets [3]. Figure A3 shows that DIMCO learns much more effectively compared to previous methods when the number of examples per class is low. We attribute this to DIMCO's tight generalization gap. Since DIMCO uses fewer bits to describe each datapoint, the codes act as an implicit regularizer that helps generalization to unseen datasets. We additionally note that DIMCO is the only method in Figure A3 that can train using a dataset consisting of 1 example per class. DIMCO has this capability because, unlike other methods, DIMCO requires no support/query (also called train/test) split and maximizes the mutual information within a given batch. In contrast, other methods require at least one support and one query example per class within each batch.
For this experiment, we used the Adam optimizer and performed a log-uniform hyperparameter sweep for learning rate ∈ [1 × 10 −7 , 1 × 10 −3 ] For DIMCO, we swept p ∈ [32, 128] and d ∈ [16,32]. For other methods, we made the embedding dimension ∈ [16,32]. For each combination of loss and number of training examples per class, we ran the experiment 64 times and reported the mean and standard deviation of the top 5. Appendix C.6. Few-Shot Classification For this experiment, we built on the code released by Lee et al. [53] (https://github. com/kjunelee/MetaOptNet, accessed on 15 March 2021) with minimal adjustments. We used the repository's default datasets, augmentation, optimizer, and backbones. The only difference was our added module for outputting discrete codes. We show an extended table with citations in Table A3. Table A3. Few-shot classification accuracies on the miniImageNet benchmark, with best results for each setting in bold. Grouped according to backbone architecture. † denotes transductive methods, which are more expressive by taking unlabeled examples into account.