Here we describe a few more variants of the CEB objective.

#### Appendix C.1. Hierarchical CEB

Thus far, we have focused on learning a single latent representation (possibly composed of multiple latent variables at the same level). Here, we consider one way to learn a hierarchical model with CEB.

Consider the graphical model

${Z}_{2}\leftarrow {Z}_{1}\leftarrow X\leftrightarrow Y$. This is the simplest hierarchical supervised representation learning model. The general form of its information diagram is given in

Figure A1.

**Figure A1.**
Information diagram for the basic hierarchical CEB model, ${Z}_{2}\leftarrow {Z}_{1}\leftarrow X\leftrightarrow Y$.

**Figure A1.**
Information diagram for the basic hierarchical CEB model, ${Z}_{2}\leftarrow {Z}_{1}\leftarrow X\leftrightarrow Y$.

The key observation for generalizing CEB to hierarchical models is that the target mutual information doesn’t change. By this, we mean that all of the ${Z}_{i}$ in the hierarchy should cover $I(X;Y)$ at convergence, which means maximizing $I(Y;{Z}_{i})$. It is reasonable to ask why we would want to train such a model, given that the final set of representations are presumably all effectively identical in terms of information content. Doing so allows us to train deep models in a principled manner such that all layers of the network are consistent with each other and with the data. We need to be more careful when considering the residual information terms, though – it is not the case that we want to minimize $I(X;{Z}_{i}|Y)$, which is not consistent with the graphical model. Instead, we want to minimize $I({Z}_{i-1};{Z}_{i}|Y)$, defining ${Z}_{0}=X$.

This gives the following simple

Hierarchical CEB objective:

Because all of the ${Z}_{i}$ are targetting Y, this objective is as stable as regular CEB.

#### Appendix C.2. Sequence Learning

Many of the richest problems in machine learning vary over time. In Bialek et al. [

53], the authors define the

Predictive Information:

This is of course just the mutual information between the past and the future. However, under an assumption of temporal invariance (any time of fixed length is expected to have the same entropy), they are able to characterize the predictive information, and show that it is a subextensive quantity: ${lim}_{T\to \infty}I\left(T\right)/T\to 0$, where $I\left(T\right)$ is the predictive information over a time window of length $2T$ (T steps of the past predicting T steps into the future). This concise statement tells us that past observations contain vanishingly small information about the future as the time window increases.

The application of CEB to extracting the predictive information is straightforward. Given the Markov chain

${X}_{<t}\to {X}_{\ge t}$, we learn a representation

${Z}_{t}$ that optimally covers

$I({X}_{<t},{X}_{\ge t})$ in

Predictive CEB:

Given a dataset of sequences, CEB${}_{\mathrm{pred}}$ may be extended to a bidirectional model. In this case, two representations are learned, ${Z}_{<t}$ and ${Z}_{\ge t}$. Both representations are for timestep t, the first representing the observations before t, and the second representing the observations from t onwards. As in the normal bidirectional model, using the same encoder and backwards encoder for both parts of the bidirectional CEB objective ties the two representations together.

#### Appendix C.2.1. Modeling and Architectural Choices

As with all of the variants of CEB, whatever entropy remains in the data after capturing the entropy of the mutual information in the representation must be modeled by the decoder. In this case, a natural modeling choice would be a probalistic RNN with powerful decoders per time-step to be predicted. However, it is worth noting that such a decoder would need to sample at each future step to decode the subsequent step. An alternative, if the prediction horizon is short or the predicted data are small, is to decode the entire sequence from ${Z}_{t}$ in a single, feed-forward network (possibly as a single autoregression over all outputs in some natural sequence). Given the subextensivity of the predictive information, that may be a reasonable choice in stochastic environments, as the useful prediction window may be small.

Likely a better alternative, however, is to use the CatGen decoder, as no generation of the long future sequences is required in that case.

#### Appendix C.2.2. Multi-Scale Sequence Learning

As in WaveNet [

54], it is natural to consider sequence learning at multiple different temporal scales. Combining an architecture like time-dilated WaveNet with CEB is as simple as combining CEB

${}_{\mathrm{pred}}$ with CEB

${}_{\mathrm{hier}}$ (

Appendix C.1). In this case, each of the

${Z}_{i}$ would represent a wider time dilation conditioned on the aggregate

${Z}_{i-1}$.

#### Appendix C.3. Unsupervised CEB

For unsupervised learning, it seems challenging to put the decision about what information should be kept into objective function hyperparameters, as in the

$\beta $ VAE and penalty VAE [

32] objectives. That work showed that it is possible to constrain the amount of information in the learned representation, but it is unclear how those objective functions keep only the “correct” bits of information for the downstream tasks you might care about. This is in contrast to supervised learning while targeting the MNI point, where the task clearly defines the both the correct amount of information and which bits are likely to be important.

Our perspective on the importance of defining a task in order to constrain the information in the representation suggests that we can turn the problem into a data modeling problem in which the practitioner who selects the dataset also “models” the likely form of the useful bits in the dataset for the downstream task of interest.

In particular, given a dataset X, we propose selecting a function $f\left(X\right)\to {X}^{\prime}$ that transforms X into a new random variable ${X}^{\prime}$. This defines a paired dataset, $P(X,{X}^{\prime})$, on which we can use CEB as normal. Note that choosing the identity function for f results in maximal mutual information between X and ${X}^{\prime}$ ($H\left(X\right)$ nats), which will result in a representation that is far from the MNI for normal downstream tasks.

It may seem that we have not proposed anything useful, as the selection of $f(.)$ is unconstrained, and seems much more daunting than selecting $\beta $ in a $\beta $ VAE or $\sigma $ in a penalty VAE. However, there is a very powerful class of functions that makes this problem much simpler, and that also make it clear using CEB will only select bits from X that are useful. That class of functions is the noise functions.

#### Appendix C.3.1. Denoising CEB Autoencoder

Given a dataset

X without labels or other targets, and some set of tasks in mind to be solved by a learned representation, we may select a random noise variable

U, and function

${X}^{\prime}=f(X,U)$ that we believe will destroy the irrelevant information in

X. We may then add representation variables

${Z}_{X},{Z}_{{X}^{\prime}}$ to the model, giving the joint distribution

$p(x,{x}^{\prime},u,{z}_{X},{z}_{{X}^{\prime}})\equiv p\left(x\right)p\left(u\right)p\left({x}^{\prime}\right|f(x,u))e\left({z}_{X}\right|x)b\left({z}_{{X}^{\prime}}\right|{x}^{\prime})$. This joint distribution is represented in

Figure A2.

**Figure A2.**
Graphical model for the Denoising CEB Autoencoder.

**Figure A2.**
Graphical model for the Denoising CEB Autoencoder.

Denoising Autoencoders were originally proposed in Vincent et al. [

55]. In that work, the authors argue informally that reconstruction of corrupted inputs is a desirable property of learned representations. In this paper’s notation, we could describe their proposed objective as

$minH\left(X\right|{Z}_{{X}^{\prime}})$, or equivalently

$min{\u2329logd\left(x\right|{z}_{{X}^{\prime}}=f(x,\eta ))\u232a}_{x,\eta \sim p\left(x\right)p\left(\theta \right)}$.

We also note that, practically speaking, we would like to learn a representation that is consistent with uncorrupted inputs as well. Consequently, we are going to use a bidirectional model.

This requires two encoders and two decoders, which may seem expensive, but it permits a consistent learned representation that can be used cleanly for downstream tasks. Using a single encoder/decoder pair would result in either an encoder that does not work well with uncorrupted inputs, or a decoder that only generates noisy outputs.

If you are only interested in the learned representation and not in generating good reconstructions, the objective simplifies to the first three terms. In that case, the objective is properly called a

Noising CEB Autoencoder, as the model predicts the noisy

${X}^{\prime}$ from

X:

In these models, the noise function, ${X}^{\prime}=f(X,U)$ must encode the practitioner’s assumptions about the structure of information in the data. This obviously will vary per type of data, and even per desired downstream task.

However, we don’t need to work too hard to find the perfect noise function initially. A reasonable choice for

f is:

In other words, add uniform noise scaled to the domain of X and by a hyperparameter $\lambda $, and clip the result to the domain of X. When $\lambda =1$, ${X}^{\prime}$ is indistinguishable from uniform noise. As $\lambda \to 0$, this maintains more and more of the original information from X in ${X}^{\prime}$. For some value of $\lambda >0$, most of the irrelevant information is destroyed and most of the relevant information is maintained, if we assume that higher frequency content in the domain of X is less likely to contain the desired information. That information is what will be retained in the learned representation.

#### Theoretical Optimality of Noise Functions

Above we claimed that this learning procedure will only select bits that are useful for the downstream task, given that we select the proper noise function. Here we prove that claim constructively. Imagine an oracle that knows which bits of information should be destroyed, and which retained in order to solve the future task of interest. Further imagine, for simplicity, that the task of interest is classification. What noise function must that oracle implement in order to ensure that $CE{B}_{denoise}$ can only learn exactly the bits needed for classification? The answer is simple: for every $X={x}_{i}$, select ${X}^{\prime}={x}_{i}^{\prime}$ uniformly at random from among all of the $X={x}_{j}$ that should have the same class label as $X={x}_{i}$. Now, the only way for CEB to maximize $I(X;{Z}_{{X}^{\prime}})$ and minimize $I({X}^{\prime};{Z}_{{X}^{\prime}})$ is by learning a representation that is isomorphic to classification, and that encodes exactly $I(X;Y)$ nats of information, even though it was only trained “unsupervisedly” on $X,{X}^{\prime}$ pairs. Thus, if we can choose the correct noise function that destroys only the bits we don’t care about, $CE{B}_{denoise}$ will learn the desired representation and nothing else (caveated by model, architecture, and optimizer selection, as usual).