# CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{*}

^{†}

## Abstract

**:**

**C**losed-loop

**Tr**anscriptionbetween a multi-class, multi-dimensional data distribution and a

**L**inear discriminative representation (CTRL) in the feature space that consists of multiple independent multi-dimensional linear subspaces. In particular, we argue that the optimal encoding and decoding mappings sought can be formulated as a two-player minimax game between the encoder and decoderfor the learned representation. A natural utility function for this game is the so-called rate reduction, a simple information-theoretic measure for distances between mixtures of subspace-like Gaussians in the feature space. Our formulation draws inspiration from closed-loop error feedback from control systems and avoids expensive evaluating and minimizing of approximated distances between arbitrary distributions in either the data space or the feature space. To a large extent, this new formulation unifies the concepts and benefits of Auto-Encoding and GAN and naturally extends them to the settings of learning a both discriminative and generative representation for multi-class and multi-dimensional real-world data. Our extensive experiments on many benchmark imagery datasets demonstrate tremendous potential of this new closed-loop formulation: under fair comparison, visual quality of the learned decoder and classification performance of the encoder is competitive and arguably better than existing methods based on GAN, VAE, or a combination of both. Unlike existing generative models, the so-learned features of the multiple classes are structured instead of hidden: different classes are explicitly mapped onto corresponding independent principal subspaces in the feature space, and diverse visual attributes within each class are modeled by the independent principal components within each subspace.

## 1. Introduction

**Data embedding versus data transcription.**Be aware that the support of the distribution of $\mathit{x}$ (and that of $\mathit{z}$) is typically extremely low-dimensional compared to that of the ambient space (for instance, the well-known CIFAR-10 datasets consist of RGB images with a resolution of $32\times 32$. Despite the images being in a space of ${\mathbb{R}}^{3072}$, our experiments will show that the intrinsic dimension of each class is less than a dozen, even after they are mapped into a feature space of ${\mathbb{R}}^{128}$) hence the above mapping(s) may not be uniquely defined based on the support in the space ${\mathbb{R}}^{D}$ (or ${\mathbb{R}}^{d})$. In addition, the data $\mathit{x}$ may contain multiple components (e.g., modes, classes), and the intrinsic dimensions of these components are not necessarily the same. Hence, without loss of generality, we may assume the data $\mathit{x}$ to be distributed over a union of low-dimensional nonlinear submanifolds ${\cup}_{j=1}^{k}{\mathcal{M}}_{j}\subset {\mathbb{R}}^{D}$, where each submanifold ${\mathcal{M}}_{j}$ is of dimension ${d}_{j}\ll D$. Regardless, we hope the learned mappings f and g are (locally dimension-preserving) embedding maps [1], when restricted to each of the components ${\mathcal{M}}_{j}$. In general, the dimension of the feature space d needs to be significantly higher than all of these intrinsic dimensions of the data: $d>{d}_{j}$. In fact, it should preferably be higher than the sum of all the intrinsic dimensions: $d\ge {d}_{1}+\cdots +{d}_{k}$, since we normally expect that the features of different components/classes can be made fully independent or orthogonal in ${\mathbb{R}}^{d}$. Hence, without any explicit control of the mapping process, the actual features associated with images of the data under the embedding could still lie on some arbitrary nonlinear low-dimensional submanifolds inside the feature space ${\mathbb{R}}^{d}$. The distribution of the learned features remains “latent” or “hidden” in the feature space.

**Paper Outline.**This work is to show how such transcription can be achieved for real-world visual data with one important family of models: the linear discriminative representation (LDR) introduced by [2]. Before we formally introduce our approach in Section 2, for the remainder of this section, we first discuss two existing approaches, namely autoencoding and GAN, that are closely related to ours. As these approaches are rather popular and known to the readers, we will mainly point out some of their main conceptual and practical limitations that have motivated this work. Although our objective and framework will be mathematically formulated, the main purpose of this work is to verify the effectiveness of this new approach empirically through extensive experimentation, organized and presented in Section 3 and Appendix A. Our work presents compelling evidence that the closed-loop data transcription problem and our rate-reduction-based formulation deserve serious attention from the information-theoretical and mathematical communities. This has raised many exciting and open theoretical problems or hypotheses about learning, representing, and generating distributions or manifolds of high-dimensional real-world data. We discuss some open problems in Section 4 and new directions in Section 5. Source code can be found at https://github.com/Delay-Xili/LDR (accessed on 9 February 2022).

#### 1.1. Learning Generative Models via Auto-Encoding or GAN

**Auto-Encoding and its variants.**In the machine-learning literature, roughly speaking, there have been two representative approaches to such a distribution-learning task. One is the classic “Auto Encoding” (AE) approach [3,4] that aims to simultaneously learn an encoding mapping f from $\mathit{x}$ to $\mathit{z}$ and an (inverse) decoding mapping g from $\mathit{z}$ back to $\mathit{x}$:

**GAN and its variants.**Compared to measuring distribution distance in the (often controlled) feature space $\mathit{z}$, a much more challenging issue with the above auto-encoding approach is how to effectively measure the distance between the decoded samples $\widehat{\mathit{X}}$ and the original $\mathit{X}$ in the data space $\mathit{x}$. For instance, for visual data such as images, their distributions $p\left(\mathit{X}\right)$ or generative models $p\left(\mathit{X}\right|\mathit{z})$ are often not known. Despite extensive studies in the computer vision and image processing literature [7], it remains elusive to find a good measure for similarity of real images that is both efficient to compute and effective in capturing visual quality and semantic information of the images equally well. Precisely due to such difficulties, it has been suggested early on by [8] that one may have to take a discriminative approach to learn the distribution or a generative model for visual data. More recently, Generative Adversarial Nets (GAN) [9] offers an ingenious idea to alleviate this difficulty by utilizing a powerful discriminator d, usually modeled and learned by a deep network, to discern differences between the generated samples $\widehat{\mathit{X}}$ and the real ones $\mathit{X}$:

**Combination of AE and GAN.**Although AE (VAE) and GAN originated with somewhat different motivations, they have evolved into popular and effective frameworks for learning and modeling complex distributions of many real-world data such as images. (In fact, in some idealistic settings, it can be shown that AE and GAN are actually equivalent: for instance, in the LOG settings, authors in [33] have shown that GAN coincides with the classic PCA, which is precisely the solution to auto-encoding in the linear case). Many recent efforts tend to combine both auto-encoding and GAN to generate more powerful generative frameworks for more diverse data sets, such as [15,34,35,36,37,38,39,40,41,42]. As we will see, in our framework, AE and GAN can be naturally interpreted as two different segments of a closed-loop data transcription process. However, unlike GAN or AE (VAE), the “origin” or “target” distribution of the feature $\mathit{z}$ will no longer be specified a priori, and is instead learned from the data $\mathit{x}$. In addition, this intrinsically low-dimensional distribution of $\mathit{z}$ (with all of its low-dimensional supports) is explicitly modeled as a mixture of orthogonal subspaces (or independent Gaussians) within the feature space ${\mathbb{R}}^{d}$, sometimes known as the principal subspaces.

**Universality of Representations.**Note that GANs (and most VAEs) are typically designed without explicit modeling assumptions on the distribution of the data nor on the features. Many even believe that it is this “universal” distribution learning capability (assuming minimizing distances between arbitrary distributions in high-dimensional space can be solved efficiently, which unfortunately has many caveats and often is impractical) that is attributed to their empirical success in learning distributions of complicated data such as images. In this work, we will provide empirical evidence that such an “arbitrary distribution learning machine” might not be necessary. (In fact, it may be computationally intractable in general). A controlled and deformed family of low-dimensional linear subspaces (Gaussians) can be more than powerful, and expressive enough to model real-world visual data. (In fact, a Gaussian mixture model is already a universal approximator of almost arbitrary densities [43]. Hence, we do not loose any generality at all). As we will also see, once we can place a proper and precise metric on such models, the associated learning problems can become much better conditioned and more amenable to rigorous analysis and performance guarantees in the future.

#### 1.2. Learning Linear Discriminative Representation via Rate Reduction

**LDR via MCR${}^{2}$.**More precisely, consider a set of data samples $\mathit{X}=[{\mathit{x}}^{1},\dots ,{\mathit{x}}^{n}]\in {\mathbb{R}}^{D\times n}$ from k different classes. That is, we have $\mathit{X}={\cup}_{j=1}^{k}{\mathit{X}}_{j}$ with each subset of samples ${\mathit{X}}_{j}$ belonging to one of the low-dimensional submanifolds: ${\mathit{X}}_{j}\subset {\mathcal{M}}_{j},j=1,\dots ,k$. Following the notation in [2], we use a matrix ${\mathbf{\prod}}^{j}(i,i)=1$ to denote the membership of sample i belonging to class j (and ${\mathbf{\prod}}^{j}=0$ otherwise). One seeks a continuous mapping $f(\xb7,\theta ):\mathit{x}\mapsto \mathit{z}$ from $\mathit{X}$ to an optimal representation $\mathit{Z}=[{\mathit{z}}^{1},\dots ,{\mathit{z}}^{n}]\subset {\mathbb{R}}^{d\times n}$:

## 2. Data Transcription via Rate Reduction

#### 2.1. Closed-Loop Transcription to an LDR (CTRL)

**Injectivity:**the generated $\widehat{\mathit{x}}=g(f(\mathit{x},\theta ),\eta )\in \widehat{\mathit{X}}$ should be as close to (ideally the same as) the original data $\mathit{x}\in \mathit{X}$, in terms of certain measures of similarity or distance.**Surjectivity:**for all mapped images $\mathit{z}=f\left(\mathit{x}\right)\in \mathit{Z}$ of the training data $\mathit{x}\in \mathit{X}$, there are decoded samples $\widehat{\mathit{z}}=f(g(\mathit{z},\eta ),\theta )\in \widehat{\mathit{Z}}$ close to (ideally the same as) $\mathit{z}$.

#### 2.2. Measuring Distances in the Feature Space and Data Space

**Contractive measure for the decoder.**For the second item in the above wishlist, as the representations in the feature space $\mathit{z}$ are by design linear subspaces or (degenerate) Gaussians, we have geometrically or statistically meaningful metrics for both samples and distributions in the feature space $\mathit{z}$. For example, we care about the distance between distributions between the features of the original data $\mathit{Z}$ and the transcribed $\widehat{\mathit{Z}}$. Since the features of each class, ${\mathit{Z}}_{j}$ and ${\widehat{\mathit{Z}}}_{j}$, are similar to subspaces/Gaussians, their “distance” can be measured by the rate reduction, with (5) restricted to two sets of equal size:

**Contrastive measure for the encoder.**For the first item in our wishlist, however, we normally do not have a natural metric or “distance” for similarity of samples or distributions in the original data space $\mathit{x}$ for data such as images. As mentioned before, finding proper metrics or distance functions on natural images has always been an elusive and challenging task [7]. To alleviate this difficulty, we can measure the similarity or difference between $\widehat{\mathit{X}}$ and $\mathit{X}$ through their mapped features $\widehat{\mathit{Z}}$ and $\mathit{Z}$ in the feature space (again assuming f is structure-preserving). If we are interested in discerning any differences in the distributions of the original and transcribed samples, we may view the MCR${}^{2}$ feature encoder $f(\xb7,\theta )$ as a “discriminator” to magnify any difference between all pairs of ${\mathit{X}}_{j}$ and ${\widehat{\mathit{X}}}_{j}$, by simply maximizing, instead of minimizing, the same quantity in (8):

**Remark: representing the encoding and decoding mappings.**Some practical questions arise immediately: how rich should the families of functions be that we should consider to use for the encoder f and decoder g that can optimize the above rate-reduction-type objectives? In fact, similar questions exist for the formulation of GAN, regarding the realizability of the data distribution by the generator, see [50]. Conceptually, here we know that the encoder f needs to be rich enough to discriminate (small) deviations from the true data support ${\mathcal{M}}_{j}$, while the decoder g needs to be expressive enough to generate the data distribution from the learned mixture of subspace-Gaussians. How should we represent or parameterize them, hence making our objectives computable and optimizable? For the most general cases, these remain widely open and challenging mathematical and computational problems. As we mentioned earlier, in this work, we will take a more pragmatic approach by simply representing these mappings with popular neural networks that have empirically proven to be good at approximating distributions of practical (visual) datasets or for achieving the maximum of the rate-reduction-type objectives [13]. Nevertheless, our experiments indicate that our formulation and objectives are not so sensitive to particular choices in network structures or many of the tricks used to train them. In addition, in the special cases when the real data distribution is benignly deformed from an LDR, the work of [2] has shown that one can explicitly construct these mappings from the rate-reduction objectives in the form of a deep network known as ReduNet. However, it remains unclear how such constructions could be generalized to closed-loop settings. Regardless, answers to these questions are beyond the scope of this work, as our purposes here are mainly to empirically verify the validity of the proposed closed-loop data transcription framework.

#### 2.3. Encoding and Decoding as a Two-Player MiniMax Game

**a two-player game**”: while the encoder f tries to magnify the difference between the original data and their transcribed data, the decoder g aims to minimize the difference. Now for convenience, let us define the “closed-loop encoding” function:

**Remark: closed-loop error correction.**One may notice that our framework (see Figure 1) draws inspiration from closed-loop error correction widely practiced in feedback control systems. In the machine-learning and deep-learning literature, the idea of closed-loop error correction and closed-loop fixed point has been explored before to interpret the recursive error-correcting mechanism and explain stability in a forward (predictive) deep neural network, for example the deep equilibrium networks [54] and the deep implicit networks [55], again drawing inspiration from feedback control. Here, in our framework, the closed-loop mechanism is not used to interpret the encoding or decoding (forward) networks f and g. Instead, it is used to form an overall feedback system between the two encoding and decoding networks for correcting the “error” in the distributions between the data $\mathit{x}$ and the decoded $\widehat{\mathit{x}}$. Using terminology from control theory, one may view the encoding network f as a “sensor” for error feedback while the decoding network g as a “controller” for error correction. However, notice that here the “target” for control is not a scalar nor a finite dimensional vector, but a continuous mapping—in order for the distribution of $\widehat{\mathit{x}}$ to match that of the data $\mathit{x}$. This is in general a control problem in an infinite dimensional space. The space of diffeomorphisms of submanifolds is infinite-dimensional [1]. Ideally, we hope when the sensor f and the controller g are optimal, the distribution of $\mathit{x}$ becomes a “fixed point” for the closed loop while the distribution of $\mathit{z}$ reaches a compact LDR. Hence, the minimax programs (12) and (13) can also be interpreted as games between an error-feedback sensor and an error-reducing controller.

**Remark: relation to bi-directional or cycle consistency.**The notion of “bi-directional” and “cycle” consistency between encoding and decoding has been exploited in the works of BiGAN [38] and ALI [39] for mappings between the data and features and in the work of CycleGAN [56] for mappings between two different data distributions. In our context, it is similar in order to promote $g\circ f$ and $f\circ g$ to be close to identity mappings (either for the distributions or for the samples). Interestingly, our new closed-loop formulation actually “decouples” the data $\mathit{X}$, say, observed from the external world, from their internally represented features $\mathit{Z}$. The objectives (12) and (13) are functions of only the internal features $\mathit{Z}\left(\theta \right)$ and $\widehat{\mathit{Z}}(\theta ,\eta )$, which can be learned and optimized by adjusting the neural networks $f(\xb7,\theta )$ and $g(\xb7,\eta )$ alone. There is no need for any additional external metrics or heuristics to promote how “close” the decoded images $\widehat{\mathit{X}}$ are to $\mathit{X}$. This is very different from most VAE/GAN-type methods such as BiGAN and ALI that require additional discriminators (networks) for the images and the features. Some experimental comparison are given in the Appendix A.2. In addition, in Appendix A.8.1, we provide some ablation study to illustrate the importance and benefit of a closed loop for enforcing the consistency between the encoder and decoder.

**Remark: transparent versus hidden distribution of the learned features.**Notice that in our framework, there is no need to explicitly specify a prior distribution either as a target distribution to map to for AE (2) or as an initial distribution to sample from for GAN (3). The common practice in AEs or GANs is to specify the prior distribution as a generic Gaussian. This is however particularly problematic when the data distribution is multi-modal and has multiple low-dimensional structures, which is commonplace for multi-class data. In this case, the common practice in AEs or GANs is to train a conditional GAN for different classes or different attributes. However, here we only need to assume the desired target distribution belonging to the family of LDRs. The specific optimal distribution of the features within this family is then learned from the data directly, and then can be represented explicitly as a mixture of independent subspace Gaussians (or equivalently, a mixture of PCAs on independent subspaces). We will give more details in the experimental Section 3 as well as more examples in Appendix A.2, Appendix A.3 and Appendix A.4. Although many GAN + VAE-type methods can learn bidirectional encoding and decoding mappings, the distribution of the learned features inside the feature space remains hidden or even entangled. This makes it difficult to sample the feature space for generative purposes or to use the features for discriminative tasks. (For instance, typically one can only use so-learned features for nearest-neighbor-type classifiers [38], instead of nearest subspace as in this work, see Section 3.3).

## 3. Empirical Verification on Real-World Imagery Datasets

**Datasets.**We provide extensive qualitative and quantitative experimental results on the following datasets: MNIST [57], CIFAR-10 [58], STL-10 [59], CelebA [60], LSUN bedroom [61], and ImageNet ILSVRC 2012 [62]. The network architectures and implementation details can be found in Appendix A.1 and corresponding Appendix A for each dataset.

#### 3.1. Empirical Justification of CTRL Transcription

**Comparison (IS and FID) with other formulations.**First, we conduct five experiments to fairly compare our formulation with GAN [63] and VAE(-GAN) [64] on MNIST and CIFAR-10. Except for the objective function, everything else is exactly the same for all methods (e.g., networks, training data, optimization method). These experiments are: (1). GAN; (2). GAN with its objective replaced by that of the CTRL-Binary (13); (3). VAE-GAN; (4). Binary CTRL (13); and (5). Multi-class CTRL (12). Some visual comparison is given in Figure 3. IS [65] and FID [66] scores are summarized in Table 1. Here, for simplicity, we have chosen a uniform feature dimension $d=128$ for all datasets. If we choose a higher feature dimension, say $d=512$, for the more complex CIFAR-10 dataset, the visual quality can be further improved, see Table A14 in Appendix A.11.

**Visualizing auto-encoding of the data $\mathit{X}$ and the decoded $\widehat{\mathit{X}}$.**We compare some representative $\mathit{X}$ and $\widehat{\mathit{X}}$ on MNIST, CIFAR-10 and ImageNet (10 classes) to verify how close $\widehat{\mathit{x}}=g\circ f\left(\mathit{x}\right)$ is to $\mathit{x}$. The results are shown in Figure 5, and visualizations are created from training samples. Visually, the auto-encoded $\widehat{\mathit{x}}$ faithfully captures major visual features from its respective training sample $\mathit{x}$, especially the pose, shape, and layout. For the simpler dataset such as MNIST, auto-encoded images are almost identical to the original. The visual quality is clearly better than other GAN+VAE-type methods, such as VAE-GAN [34] and BiGAN [38]. We refer the reader to Appendix A.2, Appendix A.4 and Appendix A.7 for more visualization of results on these datasets, including similar results on transformed MNIST digits. More visualization results for learned models on real-life image datasets such as STL-10, CeleB, and LSUN can be found in the Appendix A.5 and Appendix A.6.

#### 3.2. Comparison to Existing Generative Methods

#### 3.3. Benefits of the Learned LDR Transcription Model

**Principal subspaces and principal components for the feature.**To be more specific, given the learned k-class features ${\cup}_{j=1}^{k}{\mathit{Z}}_{j}$ for the training data, we have observed that the leading singular subspaces for different classes are all approximately orthogonal to each other: ${\mathit{Z}}_{i}\perp {\mathit{Z}}_{j}$ (see Figure 4). This corroborates with our above discussion about the theoretical properties of the rate-reduction objective. They essentially span k independent principal subspaces. We can further calculate the mean ${\overline{\mathit{z}}}_{j}$ and the singular vectors ${\{{\mathit{v}}_{j}^{i}\}}_{i=1}^{{r}_{j}}$ (or principal components) of the learned features ${\mathit{Z}}_{j}$ for each class. Although we conceptually view the support of each class is a subspace, the actual support of the features is close to being on the sphere due to feature (scale) normalization. Hence, it is more precise to find its mean and its support centered around the mean. Here, ${r}_{j}$ is a rank we may choose to model the dimension of each principal subspace (say, based on a common threshold on the singular values). Hence, we obtain an explicit model for how the feature $\mathit{z}$ is distributed in each of the k principal subspaces in the feature space ${\mathbb{R}}^{d}$:

**Decoding samples from the feature distribution.**Using the CIFAR-10 and CelebA datatsets, we visualize images decoded from samples of learned feature subspace. For the CIFAR-10 dataset, for each class j, we first compute the top four principal components of the learned features ${\mathit{Z}}_{j}$ (via SVD). For each class j, we then compute $|\langle {\mathit{z}}_{j}^{i},{\mathit{v}}_{j}^{l}\rangle |$, the cosine similarity between the l-th principal direction ${\mathit{v}}_{j}^{l}$ and feature sample ${\mathit{z}}_{j}^{i}$. After finding the top five ${\mathit{z}}_{j}^{i}$ according to $|\langle {\mathit{z}}_{j}^{i},{\mathit{v}}_{j}^{l}\rangle |$ for each class j, we reconstruct images ${\widehat{\mathit{x}}}_{j}^{i}=g({\mathit{z}}_{j}^{i})$. Each row of Figure 6 is for one principal component. We observe that images in the same row share the same visual attributes; images in different rows differ significantly in visual characteristics such as shape, background, and style. See Figure A7 of Appendix A.4 for more visualization of principal components learned for all 10 classes of CIFAR-10. These results clearly demonstrate that the principal components in each subspace of the Gaussian disentangles different visual attributes. In addition, we do not observe any mode dropping for any of the classes, although the dimensions of the classes were not known a priori.

**Disentangled visual attributes as principal components.**For the CelebA dataset, we calculate the principal components of all learned features in the latent space. Figure 7a shows some decoded images along these principal directions. Again, these principal components seem to clearly disentangle visual attributes/factors such as wearing a hat, changing hair color, and wearing glasses. More examples can be found in Appendix A.6. The results are consistent with the property of MCR${}^{2}$ that promotes diversity of the learned features.

**Linear interpolation between features of two distinct samples.**Figure 7b shows interpolating features between pairs of training image samples of the CeleA dataset, where for two training images ${\mathit{x}}_{1}$ and ${\mathit{x}}_{2}$, we reconstruct based on their linearly interpolated feature representations by $\widehat{\mathit{x}}=g(\alpha f\left({\mathit{x}}_{1}\right)+(1-\alpha )f\left({\mathit{x}}_{2}\right)),\alpha \in [0,1]$. The decoded images show continuous morphing from one sample to another in terms of visual characteristics, as opposed to merely a superposition of the two images. Similar interpolation results between two digits in the MNIST dataset can be found in Figure A3 of the Appendix A.2.

**Encoded features for classification.**Notice that not only is the learned decoder good for generative purposes, but the encoder is also good for discriminative tasks. In this experiment, we evaluate the discriminativeness of the learned CTRL model by testing how well the encoded features can help classify the images. We use features of the training images to compute the learned subspaces for all classes, then classify features of the test images based on a simple nearest subspace classifier. Many other encoding methods train a classifier (say, with an additional layer) after the learned features. Results in Table 3 show that our model gives competitive classification accuracy on MNIST compared to some of best VAE-based methods. We also tested the classification on CIFAR-10, and the accuracy is currently about $80.7\%$. As expected, the representation learned with the multi-class objective is very discriminative and good for classification tasks. Be aware that all generative models, GANs, VAEs, and ours, are not specifically engineered for classification tasks. Hence, one should not expect the classification accuracy to compete with supervised-trained classifiers yet. This demonstrates that the learned CTRL model is not only generative but also discriminative.

## 4. Open Theoretical Problems

- Learn a simple linear discriminative representation $f\left(\mathit{X}\right)$ of the data $\mathit{X}$, which we can reliably use to classify the data.
- Learn a reconstruction $g\circ f\left(\mathit{X}\right)\sim \mathit{X}$ of the so-learned representation $f\left(\mathit{X}\right)$, to ensure consistency in the representation.
- Learn both representation and reconstruction in a closed-loop manner, using feedback from the encoder f and decoder g to jointly solve the above two tasks.

#### 4.1. Distributions of the LDR Representation

#### 4.2. Self-Consistency in the Learned Reconstruction

#### 4.3. Properties of the Closed-Loop Minimax Game

## 5. Conclusions and Future Work

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A

#### Appendix A.1. Experiment Settings and Implementation Details

**Network backbones.**For MNIST, we use the standard CNN models in Table A1 and Table A2, following the DCGAN architecture [63]. We resize the MNIST image resolution from 28 × 28 to 32 × 32 to fit DCGAN architecture. All $\alpha $ in lReLU (lReLU is short for Leaky-ReLU) of the encoder are set to 0.2.

$\mathit{z}\in {\mathbb{R}}^{1\times 1\times 128}$ |

4 × 4, stride = 1, pad = 0 deconv. BN 256 ReLU |

4 × 4, stride = 2, pad = 1 deconv. BN 128 ReLU |

4 × 4, stride = 2, pad = 1 deconv. BN 64 ReLU |

4 × 4, stride = 2, pad = 1 deconv. 1 Tanh |

Gray image $\mathit{x}\in {\mathbb{R}}^{32\times 32\times 1}$ |

4 × 4, stride = 2, pad = 1 conv 64 lReLU |

4 × 4, stride = 2, pad = 1 conv. BN 128 lReLU |

4 × 4, stride = 2, pad = 1 conv. BN 256 lReLU |

4 × 4, stride = 1, pad = 0 conv 128 |

$\mathit{z}\in {\mathbb{R}}^{128}$ |

dense $\stackrel{}{\to}$ 4 × 4 × 256 |

ResBlock up 256 |

ResBlock up 256 |

ResBlock up 256 |

BN, ReLU, 3 × 3 conv, 3 Tanh |

RGB image $\mathit{x}\in {\mathbb{R}}^{32\times 32\times 3}$ |

ResBlock down 128 |

ResBlock down 128 |

ResBlock 128 |

ResBlock 128 |

ReLU |

Global sum pooling |

dense $\stackrel{}{\to}$ 128 |

$\mathit{z}\in {\mathbb{R}}^{128}$ |

dense $\stackrel{}{\to}$ 6 × 6 × 512 |

ResBlock up 256 |

ResBlock up 128 |

ResBlock up 64 |

BN, ReLU, 3 × 3 conv, 3 Tanh |

RGB image $\mathit{x}\in {\mathbb{R}}^{48\times 48\times 3}$ |

ResBlock down 64 |

ResBlock down 128 |

ResBlock down 256 |

ResBlock down 512 |

ResBlock 1024 |

ReLU |

Global sum pooling |

dense $\stackrel{}{\to}$ 128 |

$\mathit{z}\in {\mathbb{R}}^{128}$ |

dense $\stackrel{}{\to}$ 4 × 4 × 1024 |

ResBlock up 1024 |

ResBlock up 512 |

ResBlock up 256 |

ResBlock up 128 |

ResBlock up 64 |

BN, ReLU, 3 × 3 conv, 3 Tanh |

RGB image $\mathit{x}\in {\mathbb{R}}^{128\times 128\times 3}$ |

ResBlock down 64 |

ResBlock down 128 |

ResBlock down 256 |

ResBlock down 512 |

ResBlock down 1024 |

ResBlock 1024 |

ReLU |

Global sum pooling |

dense $\stackrel{}{\to}$ 128 |

**Optimization and training details.**Across all of our experiments, we use Adam [74] as our optimizer, with hyperparameters ${\beta}_{1}=0.5,{\beta}_{2}=0.999$. We adopt the simple gradient descent–ascent algorithm for alternating minimizing and maximizing the objectives. The initial value of learning rate is set to be 0.00015 and is scheduled with linear decay. We choose ${\u03f5}^{2}=0.5$ for both Equations (12) and (13) in all CTRL experiments. For all CTRL-Multi experiments on ImageNet, we only choose 10 classes. The details of the 10 classes are shown in Table A9. Most experiments are trained on RTX 3090 GPUs.

ID | Category |
---|---|

n02930766 | cab, hack, taxi, taxicab |

n04596742 | wok |

n02974003 | car wheel |

n01491361 | tiger shark, Galeocerdo cuvieri |

n01514859 | hen |

n09472597 | volcano |

n07749582 | lemon |

n09428293 | seashore, coast, seacoast, sea-coast |

n02504458 | African elephant, Loxodonta africana |

n04285008 | sports car, sport car |

#### Appendix A.2. MNIST

**Settings.**On MNIST dataset, we train our model using DCGAN [63] architecture with our proposed objectives CTRL-Multi (12) and CTRL-Binary (13). The learning rate is set to ${10}^{-4}$ and the batch size is set to 2048. We train our model with 15,000 iterations.

**More results illustrating auto-encoding.**Here, we give more reconstruction results, or $\widehat{\mathit{X}}$, from CTRL-Multi and CTRL-Binary objectives, compared to their corresponding original input $\mathit{X}$. As shown in the Figure A1, for the CTRL-Binary objective, it can generate clean digit-like images but the decoded $\widehat{\mathit{X}}$ might resemble digits from similar but different classes to the input data $\mathit{X}$ since the CTRL-Binary tends to only align the distribution of all digits.

**Images decoded from random samples on the learned multi-class LDR.**Since our CTRL-Multi objective function maps input data of each class into a different (orthogonal) subspace in the feature space, we can generate images conditioned on each class by random sampling $\mathit{z}$ in the subspace of each class and then decode them back to the input space as $\widehat{\mathit{x}}$.

**Interpolation between samples in different classes.**We randomly sample some images from each class. For each image ${\mathit{x}}_{1}$, we randomly sample another image ${\mathit{x}}_{2}$ from a different class. For such a pair of images ${\mathit{x}}_{1}$ and ${\mathit{x}}_{2}$, we reconstruct them based on their linearly interpolated feature representations by $\widehat{\mathit{x}}=g(\alpha f\left({\mathit{x}}_{1}\right)+(1-\alpha )f\left({\mathit{x}}_{2}\right)),\alpha \in [0,1]$, the results of which are shown in the Figure A3. For each row in the figure from left to the right, the reconstructed images continuously morph from one digit to a different digit with a natural transition in shape rather than a simple superposition of the two images. This also confirms that space between subspaces for the digits does not represent valid digits but only shapes with digit-like strokes. Hence for generative purposes, knowing the supports of valid digits is extremely important.

#### Appendix A.3. Transformed MNIST

**Settings.**In this experiment, we verify that the CTRL-Multi objective can preserve diverse data modes in the learned feature embeddings. We construct a transformed MNIST dataset with five modes: normal, large ($1.5\times $), small ($0.5\times $), rotate ${45}^{\circ}$ left, and rotate ${45}^{\circ}$ right. Each image data point will be randomly transformed to one of the modes. Representative examples of such training data can be found in Figure A4a. We train the model with learning rate 1 × 10

^{−4}and batch size 2048 for 15,000 iterations.

**Auto-encoding results.**Figure A4b gives the decoded results of the training data with different modes. Even though the data are now much more diverse for each class, decoder learned from the CTRL-Multi objective can still achieve high sample-wise similarity to the original images.

**Figure A4.**Original (training) data $\mathit{X}$ and their decoded version $\widehat{\mathit{X}}$ on the transformed MNIST.

**Identifying different modes.**Similar to the earlier experiments of Figure 6 for CIFAR-10 in the main paper, we find the top principal components of features of each class ${\mathit{Z}}_{j}$ (via SVD) and generate new images using the learned decoder g from features of the training images aligned the best with these components.

**Figure A5.**The reconstructed images $\widehat{\mathit{X}}$ from the features $\mathit{Z}$ best aligned along top-8 principal components on the transformed MNIST dataset. Each row represents a different principal component.

#### Appendix A.4. CIFAR-10

**Settings.**For all experiments on CIFAR-10, we follow the common training hyper-parameters in Appendix A.1. Beyond that, for each experiment, we run 450,000 iterations with batch size 1600.

**Images decoded from random samples on the CTRL-Multi.**We sample $\mathit{z}$ in the feature space randomly along the principal components and around the mean feature of each class ${\mathit{Z}}_{j}$ as in the MNIST case, according to Equation (A1). The generated images from the sampled features are illustrated in Figure A6, one row per class. As we see, the generator learned from the CTRL-Multi objective is capable of generating diverse images for each class.

**Generating images along different PCA components for each class.**For each class, we first compute the top 10 principal components (singular vectors of the SVD) of $\mathit{Z}$ and then for each of the top singular vectors, we display in each row the top 10 reconstructed image $\widehat{\mathit{X}}$ whose $\mathit{Z}$ are closest to the singular vector using methods described in the main body of the paper, Section 3.3. The results are given in Figure A7. Notice that images in each row are very similar as they are sampled along the same principal component, whereas images in different rows are very different as they are orthogonal in the feature space. These results indicate that the features learned by our method can not only disentangle different classes as orthogonal subspaces but can also disentangle different visual attributes within each class as (orthogonal) principal components within each subspace.

**Figure A7.**Reconstructed images $\widehat{\mathit{X}}$ from features $\mathit{Z}$ close to the principal components learned for the 10 classes of CIFAR-10.

#### Appendix A.5. STL-10

**Settings.**For all experiments on STL-10, we follow the common training hyper-parameters in Appendix A.1. For the CTRL-Binary setting, we train 150,000 iterations. For the CTRL-Multi setting, we initialize the weights from the 20,000-th iteration of CTRL-Binary checkpoint and train for another 80,000 iterations (with the CTRL-Multi objective). The IS and FID scores on the STL-10 dataset are reported in Table A10, on par or even better than existing methods such as SNGAN [31] or DC-VAE [42].

**Visualizing auto-encoding property for the CTRL-Binary.**We visualize the original images $\mathit{x}$ and their decoded $\widehat{\mathit{x}}$ generated by the LDR model learned from the CTRL-Binary objective. The results are shown in Figure A8 for STL-10.

**Figure A8.**Visualizing the original $\mathit{x}$ and corresponding decoded $\widehat{\mathit{x}}$ results on STL-10 dataset. Note the model is trained from the CTRL-Binary objective hence sample- or class-wise correspondence is relatively poor, but the decoded image quality is very good.

#### Appendix A.6. Celeb-A and LSUN

**Settings.**For all experiments on these datasets, we follow the common training hyper-parameters in Appendix A.1. We choose a 300 batch size for Celeb-A and LSUN. Both of them are trained with the CTRL-Binary objective and for 450,000 iterations.

**Generating images along different PCA components.**We calculate the principal components of the learned features $\mathit{Z}$ in the latent subspace. We manually choose three principle components which are related to hat, hair color, and glasses (see Figure A9). The three components are 9th, 19th, and 23rd respectively from the overall 128 principal components. These principal directions seem to clearly disentangle visual attributes/factors such as wearing a hat, changing hair color, and wearing glasses.

**Images generated from random sampling of the feature space.**We sample $\mathit{z}$ randomly according to the following Gaussian model:

**Visualizing auto-encoding property for CTRL-Binary.**We visualize the original image $\mathit{x}$ and their decoded $\widehat{\mathit{x}}$ using the LDR model learned from the CTRL-Binary objective. The results are shown in Figure A11 and Figure A12 for the Celeb-A dataset and the LSUN dataset, respectively. The CTRL-Binary objective can give very good visual quality for $\widehat{\mathit{x}}$ but cannot ensure sample-to-sample alignment. Nevertheless, the decoded $\widehat{\mathit{x}}$ seems to be very similar to the original $\mathit{x}$ in some main visual attributes. We believe the binary objective manages to align only the dominant principal component(s) associated with the most salient visual attributes, say, pose of the face for Celeb-A or layout of the room for LSUN, between features of $\mathit{X}$ and $\widehat{\mathit{X}}$.

**Figure A9.**Sampling along the 9th, 19th, and 23rd principal components of the learned features $\mathit{Z}$ seems to manipulate the visual attributes for generated images on the CelebA dataset.

**Figure A10.**Images decoded from randomly sampled features, as a learned Gaussian distribution (A2), for the CelebA dataset.

**Figure A11.**Visualizing the original $\mathit{x}$ and corresponding decoded $\widehat{\mathit{x}}$ results on Celeb-A dataset. The LDR model is trained from the CTRL-Binary objective.

**Figure A12.**Visualizing the original $\mathit{x}$ and corresponding decoded $\widehat{\mathit{x}}$ results on LSUN-bedroom dataset. The LDR model is trained from the CTRL-Binary objective.

#### Appendix A.7. ImageNet

**Settings.**To verify that the CTRL works on large-scale datasets, we train it on the ImageNet. For all experiments on the ImageNet, we follow the common training hyper-parameters in Appendix A.1.

**Visualizing feature similarity for CTRL-Multi.**We visualize the cosine similarity among features $\mathit{Z}$ of different classes learned from the CTRL-Multi objective in Figure A13. In addition, we provide the visualization of alignment between features $\mathit{Z}$ and decoded features features $\widehat{\mathit{Z}}$. These results demonstrate that not only the encoder has already learnt to discriminate between classes, but also the learned $\mathit{Z}$ and $\widehat{\mathit{Z}}$ are aligned clearly within each class.

**Table A10.**Comparison on CIFAR-10, STL-10, and ImageNet. ↑ means higher is better. ↓ means lower is better.

Method | CIFAR-10 | STL-10 | ImageNet | |||
---|---|---|---|---|---|---|

IS↑ | FID↓ | IS↑ | FID↓ | IS↑ | FID↓ | |

GAN based methods | ||||||

DCGAN [63] | 6.6 | - | 7.8 | - | - | - |

SNGAN [31] | 7.4 | 29.3 | 9.1 | 40.1 | - | 48.73 |

CSGAN [28] | 8.1 | 19.6 | - | - | - | - |

LOGAN [29] | 8.7 | 17.7 | - | - | - | - |

VAE/GAN based methods | ||||||

VAE [5] | 3.8 | 115.8 | - | - | - | - |

VAE/GAN [64] | 7.4 | 39.8 | - | - | - | - |

NVAE [41] | - | 50.8 | - | - | - | - |

DC-VAE [42] | 8.2 | 17.9 | 8.1 | 41.9 | - | - |

CTRL-Binary (ours) | 8.1 | 19.6 | 8.4 | 38.6 | 7.74 | 46.95 |

CTRL-Multi (ours) | 7.1 | 23.9 | 7.7 | 45.7 | 6.44 | 55.51 |

**Figure A13.**Visualizing feature alignment: (

**a**) among features $|{\mathit{Z}}^{\top}\mathit{Z}|$, (

**b**) between features and decoded features $|{\mathit{Z}}^{\top}\widehat{\mathit{Z}}|$. These results obtained after 200,000 iterations.

**Visualizing auto-encoding property for CTRL-Multi.**We visualize the original images $\mathit{X}$ and their decoded $\widehat{\mathit{X}}$ using the LDR model fine-tuned with the CTRL-Multi objective. The results are shown in Figure A14 for the selected 10 classes in ImageNet. The CTRL-Multi objective can give good visual quality for $\widehat{\mathit{X}}$ as well as sample-to-sample alignment.

**Figure A14.**Visualizing the original $\mathit{X}$ and corresponding decoded $\widehat{\mathit{X}}$ results on ImageNet (10 classes). The LDR model is fine-tuned using the CTRL-Multi objective. These visualizations are obtained after 35,000 iterations.

#### Appendix A.8. Ablation Study on Closed-Loop Transcription and Objective Functions

#### Appendix A.8.1. The Importance of the Closed-Loop

**Figure A15.**Qualitative results for ablation study with alternative architectures to the proposed CTRL.

#### Appendix A.8.2. The Importance of Rate Reduction

**Results on MNIST.**The training hyper-parameters of CTRL-Multi and Closed-loop-CE on MNIST are following Appendix A.1. Comparisons between CTRL-Multi and Closed-loop-CE are listed in Figure A16, Figure A17 and Figure A18.

**Figure A16.**The comparison of sample-wise reconstruction between the Closed-loop-CE objective and the CTRL-Multi objective.

**Figure A17.**Training samples along different principal components of the learned features of digit ‘2’.

**Figure A18.**Comparison Closed-loop-CE and CTRL-Multi on $|{\mathit{Z}}^{\top}\widehat{\mathit{Z}}|$ and PCA singular values. (

**a**) $|{\mathit{Z}}^{\top}\widehat{\mathit{Z}}|$ from Closed-loop-CE. (

**b**) $|{\mathit{Z}}^{\top}\widehat{\mathit{Z}}|$ from CTRL-Multi. (

**c**) PCA of learned features by the Closed-loop-CE objective for each class. (

**d**) PCA of learned features by the CTRL-Multi objective for each class.

**Failed Attempts on CIFAR-10 with Cross Entropy.**The training hyper-parameters of Closed-loop-CE on CIFAR10 follow Appendix A.1. We perform the grid search on three hyper-parameters: learning rate $\{1.5\times {10}^{-2},1.5\times {10}^{-3},1.5\times {10}^{-4}\}$, batch size (800 or 1600), and inner loop (1,2,3,4), conducting 24 experiments in total. All cases of the Closed-loop-CE fail to converge or experience model collapse on the CIFAR-10 dataset.

#### Appendix A.8.3. Ablation Study on the CTRL-Multi Objectives

Objective I: | ${min}_{\mathit{\eta}}{max}_{\mathit{\theta}}{\mathcal{T}}_{\mathit{X}}(\mathit{\theta},\mathit{\eta})=\mathit{\Delta}\mathit{R}\left(\mathit{Z}\left(\mathit{\theta}\right)\right)+\mathit{\Delta}\mathit{R}\left(\widehat{\mathit{Z}}(\mathit{\theta},\mathit{\eta})\right)+{\sum}_{\mathit{j}=1}^{\mathit{k}}\mathit{\Delta}\mathit{R}\left({\mathit{Z}}_{\mathit{j}}\left(\mathit{\theta}\right),{\widehat{\mathit{Z}}}_{\mathit{j}}(\mathit{\theta},\mathit{\eta})\right).$ |

Objective II: | ${min}_{\eta}{max}_{\theta}{\mathcal{T}}_{\mathit{X}}(\theta ,\eta )=\Delta R\left(\mathit{Z}\left(\theta \right)\right)+{\sum}_{j=1}^{k}\Delta R\left({\mathit{Z}}_{j}\left(\theta \right),{\widehat{\mathit{Z}}}_{j}(\theta ,\eta )\right).$ |

Objective III: | ${min}_{\eta}{max}_{\theta}{\mathcal{T}}_{\mathit{X}}(\theta ,\eta )={\sum}_{j=1}^{k}\Delta R\left({\mathit{Z}}_{j}\left(\theta \right),{\widehat{\mathit{Z}}}_{j}(\theta ,\eta )\right).$ |

**Figure A19.**The influence of the choice of objective functions on the reconstruction: decoded images $\widehat{\mathit{X}}$ from the objective I, II, or III.

**Figure A20.**Correlation $|{\mathit{Z}}^{\top}\widehat{\mathit{Z}}|$ between features $\mathit{Z}$ and $\widehat{\mathit{Z}}$ learned with Objective I, II, or III.

#### Appendix A.9. Ablation Study on Sensitivity to Spectral Normalization

**Table A12.**Ablation study the influence of spectral normalization. ↑ means higher is better. ↓ means lower is better.

CTRL-Binary | CTRL-Multi | ||||
---|---|---|---|---|---|

Backbone = SNGAN | SN = True | SN = False | SN = True | SN = False | |

CIFAR-10 | IS ↑ | 8.1 | 6.6 | 7.1 | 5.8 |

FID ↓ | 19.6 | 27.8 | 23.9 | 41.5 |

#### Appendix A.10. Ablation Study on Trade-Off between Network Width and Batch Size

**Table A13.**Ablation study on ImageNet about trade-off between batch size (BS) and network width (Channel #).

Channel# = 1024 | Channel# = 512 | Channel# = 256 | |
---|---|---|---|

BS = 1800 | success | success | success |

BS = 1600 | success | success | success |

BS = 1024 | failure | success | success |

BS = 800 | failure | failure | success |

BS = 400 | failure | failure | failure |

#### Appendix A.11. Ablation Study on Feature Dimension

**Table A14.**IS and FID scores of images reconstructed by LDR models learned with different feature dimensions. ↑ means higher is better. ↓ means lower is better.

dim = 128 | dim = 512 | ||||
---|---|---|---|---|---|

CTRL-Binary | CTRL-Multi | CTRL-Binary | CTRL-Multi | ||

CIFAR-10 | IS ↑ | 8.1 | 7.1 | 8.4 | 8.2 |

FID ↓ | 19.6 | 23.6 | 18.7 | 20.5 |

## References

- Lee, J.M. Introduction to Smooth Manifolds; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
- Chan, K.H.R.; Yu, Y.; You, C.; Qi, H.; Wright, J.; Ma, Y. ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction. arXiv
**2021**, arXiv:2105.10446. [Google Scholar] - Kramer, M.A. Nonlinear principal component analysis using autoassociative neural networks. AICHE J.
**1991**, 37, 233–243. [Google Scholar] [CrossRef] - Hinton, G.E.; Zemel, R.S. Autoencoders, Minimum Description Length and Helmholtz Free Energy. In Proceedings of the 6th International Conference on Neural Information Processing Systems (NIPS’93), Siem Reap, Cambodia, 13–16 December 1993; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993; pp. 3–10. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. arXiv
**2013**, arXiv:1312.6114. [Google Scholar] - Zhao, S.; Song, J.; Ermon, S. InfoVAE: Information maximizing variational autoencoders. arXiv
**2017**, arXiv:1706.02262. [Google Scholar] - Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process.
**2004**, 13, 600–612. [Google Scholar] [CrossRef] [PubMed][Green Version] - Tu, Z. Learning Generative Models via Discriminative Approaches. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 18–23 June 2007; pp. 1–8. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; pp. 2672–2680. [Google Scholar]
- Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
- Salmona, A.; Delon, J.; Desolneux, A. Gromov-Wasserstein Distances between Gaussian Distributions. arXiv
**2021**, arXiv:2104.07970. [Google Scholar] - Wright, J.; Ma, Y. High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications; Cambridge University Press: Cambridge, UK, 2021. [Google Scholar]
- Yu, Y.; Chan, K.H.R.; You, C.; Song, C.; Ma, Y. Learning Diverse and Discriminative Representations via the Principle of Maximal Coding Rate Reduction. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2020. [Google Scholar]
- Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell.
**2013**, 35, 1798–1828. [Google Scholar] [CrossRef] - Srivastava, A.; Valkoz, L.; Russell, C.; Gutmann, M.U.; Sutton, C. VeeGAN: Reducing mode collapse in GANs using implicit variational learning. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 3310–3320. [Google Scholar]
- Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv
**2014**, arXiv:1411.1784. [Google Scholar] - Sohn, K.; Lee, H.; Yan, X. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; pp. 3483–3491. [Google Scholar]
- Mathieu, M.F.; Zhao, J.J.; Zhao, J.; Ramesh, A.; Sprechmann, P.; LeCun, Y. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; pp. 5040–5048. [Google Scholar]
- Van den Oord, A.; Kalchbrenner, N.; Espeholt, L.; Vinyals, O.; Graves, A.; Kavukcuoglu, K. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; pp. 4790–4798. [Google Scholar]
- Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8798–8807. [Google Scholar]
- Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; pp. 2172–2180. [Google Scholar]
- Tang, S.; Zhou, X.; He, X.; Ma, Y. Disentangled Representation Learning for Controllable Image Synthesis: An Information-Theoretic Perspective. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 10042–10049. [Google Scholar] [CrossRef]
- Li, K.; Malik, J. Implicit Maximum Likelihood Estimation. arXiv
**2018**, arXiv:1809.09087. [Google Scholar] - Li, K.; Peng, S.; Zhang, T.; Malik, J. Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood Estimation. Int. J. Comput. Vis.
**2020**, 128, 2607–2628. [Google Scholar] [CrossRef] - Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier GANs. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2642–2651. [Google Scholar]
- Dumoulin, V.; Shlens, J.; Kudlur, M. A learned representation for artistic style. arXiv
**2016**, arXiv:1610.07629. [Google Scholar] - Brock, A.; Donahue, J.; Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. arXiv
**2018**, arXiv:1809.11096. [Google Scholar] - Wu, Y.; Rosca, M.; Lillicrap, T. Deep compressed sensing. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6850–6860. [Google Scholar]
- Wu, Y.; Donahue, J.; Balduzzi, D.; Simonyan, K.; Lillicrap, T. Logan: Latent optimisation for generative adversarial networks. arXiv
**2019**, arXiv:1912.00953. [Google Scholar] - Papyan, V.; Han, X.; Donoho, D.L. Prevalence of Neural Collapse during the terminal phase of deep learning training. arXiv
**2020**, arXiv:2008.08186. [Google Scholar] [CrossRef] - Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Lin, Z.; Khetan, A.; Fanti, G.; Oh, S. Pacgan: The power of two samples in generative adversarial networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; pp. 1498–1507. [Google Scholar]
- Feizi, S.; Farnia, F.; Ginart, T.; Tse, D. Understanding GANs in the LQG Setting: Formulation, Generalization and Stability. IEEE J. Sel. Areas Inf. Theory
**2020**, 1, 304–311. [Google Scholar] [CrossRef] - Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding beyond pixels using a learned similarity metric. arXiv
**2015**, arXiv:1512.09300. [Google Scholar] - Rosca, M.; Lakshminarayanan, B.; Warde-Farley, D.; Mohamed, S. Variational Approaches for Auto-Encoding Generative Adversarial Networks. arXiv
**2017**, arXiv:1706.04987. [Google Scholar] - Bao, J.; Chen, D.; Wen, F.; Li, H.; Hua, G. CVAE-GAN: Fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2745–2754. [Google Scholar]
- Huang, H.; He, R.; Sun, Z.; Tan, T.; Li, Z. IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; Volume 31. [Google Scholar]
- Donahue, J.; Krähenbühl, P.; Darrell, T. Adversarial feature learning. arXiv
**2016**, arXiv:1605.09782. [Google Scholar] - Dumoulin, V.; Belghazi, I.; Poole, B.; Mastropietro, O.; Lamb, A.; Arjovsky, M.; Courville, A. Adversarially learned inference. arXiv
**2016**, arXiv:1606.00704. [Google Scholar] - Ulyanov, D.; Vedaldi, A.; Lempitsky, V. It takes (only) two: Adversarial generator-encoder networks. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Vahdat, A.; Kautz, J. Nvae: A deep hierarchical variational autoencoder. arXiv
**2020**, arXiv:2007.03898. [Google Scholar] - Parmar, G.; Li, D.; Lee, K.; Tu, Z. Dual contradistinctive generative autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–24 June 2021; pp. 823–832. [Google Scholar]
- Bacharoglou, A. Approximation of probability distributions by convex mixtures of Gaussian measures. Proc. Am. Math. Soc.
**2010**, 138, 2619. [Google Scholar] [CrossRef][Green Version] - Hastie, T. Principal Curves and Surfaces; Technical Report; Stanford University: Stanford, CA, USA, 1984. [Google Scholar]
- Hastie, T.; Stuetzle, W. Principal Curves. J. Am. Stat. Assoc.
**1987**, 84, 502–516. [Google Scholar] [CrossRef] - Vidal, R.; Ma, Y.; Sastry, S. Generalized Principal Component Analysis; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
- Ma, Y.; Derksen, H.; Hong, W.; Wright, J. Segmentation of multivariate mixed data via lossy data coding and compression. PAMI
**2007**, 29, 9. [Google Scholar] [CrossRef] - Jolliffe, I. Principal Component Analysis; Springer: New York, NY, USA, 1986. [Google Scholar]
- Hong, D.; Sheng, Y.; Dobriban, E. Selecting the number of components in PCA via random signflips. arXiv
**2020**, arXiv:2012.02985. [Google Scholar] - Farnia, F.; Ozdaglar, A.E. GANs May Have No Nash Equilibria. arXiv
**2020**, arXiv:2002.09124. [Google Scholar] - Dai, Y.H.; Zhang, L. Optimality Conditions for Constrained Minimax Optimization. arXiv
**2020**, arXiv:2004.09730. [Google Scholar] - Korpelevich, G.M. The extragradient method for finding saddle points and other problems. Matecon
**1976**, 12, 747–756. [Google Scholar] - Fiez, T.; Ratliff, L.J. Gradient Descent-Ascent Provably Converges to Strict Local Minmax Equilibria with a Finite Timescale Separation. arXiv
**2020**, arXiv:2009.14820. [Google Scholar] - Bai, S.; Kolter, J.Z.; Koltun, V. Deep Equilibrium Models. arXiv
**2019**, arXiv:1909.01377. [Google Scholar] - Ghaoui, L.E.; Gu, F.; Travacca, B.; Askari, A. Implicit Deep Learning. arXiv
**2019**, arXiv:1908.06315. [Google Scholar] [CrossRef] - Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef][Green Version] - Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 9 February 2022).
- Coates, A.; Ng, A.; Lee, H. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 215–223. [Google Scholar]
- Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Yu, F.; Seff, A.; Zhang, Y.; Song, S.; Funkhouser, T.; Xiao, J. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv
**2015**, arXiv:1506.03365. [Google Scholar] - Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis.
**2015**, 115, 211–252. [Google Scholar] [CrossRef][Green Version] - Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv
**2015**, arXiv:1511.06434. [Google Scholar] - Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the International Conference on Machine Learning, PMLR, New York City, NY, USA, 19–24 June 2016; pp. 1558–1566. [Google Scholar]
- Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training GANs. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; pp. 2234–2242. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 6626–6637. [Google Scholar]
- Jonathan Bennett, J.; Carbery, A.; Christ, M.; Tao, T. The Brascamp-Lieb Inequalities: Finiteness, Structure and Extremals. Geom. Funct. Anal.
**2007**, 17, 1343–1415. [Google Scholar] [CrossRef][Green Version] - Ditria, L.; Meyer, B.J.; Drummond, T. OpenGAN: Open Set Generative Adversarial Networks. arXiv
**2020**, arXiv:2003.08074. [Google Scholar] - Fiez, T.; Ratliff, L.J. Local Convergence Analysis of Gradient Descent Ascent with Finite Timescale Separation. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
- Härkönen, E.; Hertzmann, A.; Lehtinen, J.; Paris, S. Ganspace: Discovering interpretable GAN controls. arXiv
**2020**, arXiv:2004.02546. [Google Scholar] - Wu, Z.; Baek, C.; You, C.; Ma, Y. Incremental Learning via Rate Reduction. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Tong, S.; Dai, X.; Wu, Z.; Li, M.; Yi, B.; Ma, Y. Incremental Learning of Structured Memory via Closed-Loop Transcription. arXiv
**2022**, arXiv:2202.05411. [Google Scholar] - Lee, K.S.; Town, C. Mimicry: Towards the Reproducibility of GAN Research. arXiv
**2020**, arXiv:2005.02494. [Google Scholar] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar]

**Figure 1.**

**CTRL: A Closed-loop Transcription to an LDR.**The encoder f has dual roles: it learns an LDR $\mathit{z}$ for the data $\mathit{x}$ via maximizing the rate reduction of $\mathit{z}$ and it is also a “feedback sensor” for any discrepancy between the data $\mathit{x}$ and the decoded $\widehat{\mathit{x}}$. The decoder g also has dual roles: it is a “controller” that corrects the discrepancy between $\mathit{x}$ and $\widehat{\mathit{x}}$ and it also aims to minimize the overall coding rate for the learned LDR.

**Figure 2.**

**Embeddings of Low-Dimensional Submanifolds in High-Dimensional Spaces.**${S}_{\mathit{x}}$ (blue) is the submanifold for the original data $\mathit{x}$; ${S}_{\mathit{z}}$ (red) is the image of ${S}_{\mathit{x}}$ under the mapping f, representing the learned feature $\mathit{z}$; and the green curve is the image of the feature $\mathit{z}$ under the decoding mapping g.

**Figure 3.**Qualitative comparison on (

**a**) MNIST, (

**b**) CIFAR-10 and (

**c**) ImageNet. First row: original $\mathit{X}$; other rows: reconstructed $\widehat{\mathit{X}}$ for different methods.

**Figure 4.**Visualizing the alignment between $\mathit{Z}$ and $\widehat{\mathit{Z}}$: $|{\mathit{Z}}^{\top}\widehat{\mathit{Z}}|$ and in the feature space for (

**a**) MNIST, (

**b**) CIFAR-10, and (

**c**) ImageNet-10-Class.

**Figure 5.**Visualizing the auto-encoding property of the learned closed-loop transcription ($\mathit{x}\approx \widehat{\mathit{x}}=g\circ f\left(\mathit{x}\right)$) on MNIST, CIFAR-10, and ImageNet (zoom in for better visualization).

**Figure 6.**

**CIFAR-10 dataset.**Visualization of top 5 reconstructed $\widehat{\mathit{x}}=g\left(\mathit{z}\right)$ based on the closest distance of $\mathit{z}$ to each row (top 4) of principal components of data representations for class 7—‘Horse’ and class 8—‘Ship’.

**Figure 7.**

**CelebA dataset.**(

**a**): Sampling along three principal components that seem to correspond to different visual attributes; (

**b**): Samples decoded by interpolating along the line between features of two distinct samples.

Method | GAN | GAN (CTRL-Binary) | VAE-GAN | CTRL-Binary | CTRL-Multi | |
---|---|---|---|---|---|---|

MNIST | IS ↑ | 2.08 | 1.95 | 2.21 | 2.02 | 2.07 |

FID ↓ | 24.78 | 20.15 | 33.65 | 16.43 | 16.47 | |

CIFAR-10 | IS ↑ | 7.32 | 7.23 | 7.11 | 8.11 | 7.13 |

FID ↓ | 26.06 | 22.16 | 43.25 | 19.63 | 23.91 |

**Table 2.**Comparison of CIFAR-10 and STL-10. Comparison with more existing methods and on ImageNet can be found in Table A10 in the Appendix A. ↑ means higher is better. ↓ means lower is better.

Method | GAN Based Methods | VAE/GAN-Based Methods | |||||||
---|---|---|---|---|---|---|---|---|---|

SNGAN | CSGAN | LOGAN | VAE-GAN | NVAE | DC-VAE | CTRL-Binary | CTRL-Multi | ||

CIFAR-10 | IS ↑ | 7.4 | 8.1 | 8.7 | 7.4 | - | 8.2 | 8.1 | 7.1 |

FID ↓ | 29.3 | 19.6 | 17.7 | 39.8 | 50.8 | 17.9 | 19.6 | 23.9 | |

STL-10 | IS ↑ | 9.1 | - | - | - | - | 8.1 | 8.4 | 7.7 |

FID ↓ | 40.1 | - | - | - | - | 41.9 | 38.6 | 45.7 |

**Table 3.**Classification accuracy on MNIST compared to classifier-based VAE methods [42]. Most of these VAE-based methods require auxiliary classifiers to boost classification performance.

Method | VAE | Factor VAE | Guide-VAE | DC-VAE | CTRL-Binary | CTRL-Multi |
---|---|---|---|---|---|---|

MNIST | 97.12% | 93.65% | 98.51% | 98.71% | 89.12% | 98.30% |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dai, X.; Tong, S.; Li, M.; Wu, Z.; Psenka, M.; Chan, K.H.R.; Zhai, P.; Yu, Y.; Yuan, X.; Shum, H.-Y.; Ma, Y. CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction. *Entropy* **2022**, *24*, 456.
https://doi.org/10.3390/e24040456

**AMA Style**

Dai X, Tong S, Li M, Wu Z, Psenka M, Chan KHR, Zhai P, Yu Y, Yuan X, Shum H-Y, Ma Y. CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction. *Entropy*. 2022; 24(4):456.
https://doi.org/10.3390/e24040456

**Chicago/Turabian Style**

Dai, Xili, Shengbang Tong, Mingyang Li, Ziyang Wu, Michael Psenka, Kwan Ho Ryan Chan, Pengyuan Zhai, Yaodong Yu, Xiaojun Yuan, Heung-Yeung Shum, and Yi Ma. 2022. "CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction" *Entropy* 24, no. 4: 456.
https://doi.org/10.3390/e24040456