CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction

This work proposes a new computational framework for learning a structured generative model for real-world datasets. In particular, we propose to learn a Closed-loop Transcriptionbetween a multi-class, multi-dimensional data distribution and a Linear discriminative representation (CTRL) in the feature space that consists of multiple independent multi-dimensional linear subspaces. In particular, we argue that the optimal encoding and decoding mappings sought can be formulated as a two-player minimax game between the encoder and decoderfor the learned representation. A natural utility function for this game is the so-called rate reduction, a simple information-theoretic measure for distances between mixtures of subspace-like Gaussians in the feature space. Our formulation draws inspiration from closed-loop error feedback from control systems and avoids expensive evaluating and minimizing of approximated distances between arbitrary distributions in either the data space or the feature space. To a large extent, this new formulation unifies the concepts and benefits of Auto-Encoding and GAN and naturally extends them to the settings of learning a both discriminative and generative representation for multi-class and multi-dimensional real-world data. Our extensive experiments on many benchmark imagery datasets demonstrate tremendous potential of this new closed-loop formulation: under fair comparison, visual quality of the learned decoder and classification performance of the encoder is competitive and arguably better than existing methods based on GAN, VAE, or a combination of both. Unlike existing generative models, the so-learned features of the multiple classes are structured instead of hidden: different classes are explicitly mapped onto corresponding independent principal subspaces in the feature space, and diverse visual attributes within each class are modeled by the independent principal components within each subspace.


Introduction
One of the most fundamental tasks in modern data science and machine learning is to learn and model complex distributions (or structures) of real-world data, such as images or texts, from a set of observed samples. By "to learn and model", one typically means that we want to establish a (parametric) mapping between the distribution of the real data, say x ∈ R D , and a more compact random variable, say z ∈ R d : f (·, θ) : x ∈ R D → z ∈ R d or the inverse g(·, η) : where z has a certain standard structure or distribution (e.g., normal distributions). The solearned representation or feature z would be much easier to use for either generative (e.g., decoding or replaying) or discriminative (e.g., classification) purposes, or both. Data embedding versus data transcription. Be aware that the support of the distribution of x (and that of z) is typically extremely low-dimensional compared to that of the ambient space (for instance, the well-known CIFAR-10 datasets consist of RGB images with a resolution of 32 × 32. Despite the images being in a space of R 3072 , our experiments will show that the intrinsic dimension of each class is less than a dozen, even after they are mapped into a feature space of R 128 ) hence the above mapping(s) may not be uniquely defined based on the support in the space R D (or R d ). In addition, the data x may contain multiple components (e.g., modes, classes), and the intrinsic dimensions of these components are not necessarily the same. Hence, without loss of generality, we may assume the data x to be distributed over a union of low-dimensional nonlinear submanifolds ∪ k j=1 M j ⊂ R D , where each submanifold M j is of dimension d j D. Regardless, we hope the learned mappings f and g are (locally dimension-preserving) embedding maps [1], when restricted to each of the components M j . In general, the dimension of the feature space d needs to be significantly higher than all of these intrinsic dimensions of the data: d > d j . In fact, it should preferably be higher than the sum of all the intrinsic dimensions: d ≥ d 1 + · · · + d k , since we normally expect that the features of different components/classes can be made fully independent or orthogonal in R d . Hence, without any explicit control of the mapping process, the actual features associated with images of the data under the embedding could still lie on some arbitrary nonlinear low-dimensional submanifolds inside the feature space R d . The distribution of the learned features remains "latent" or "hidden" in the feature space.
So, for features of the learned mappings (1) to be truly convenient to use for purposes such as data classification and generation, the goals of learning such mappings should not only simply reduce the dimension of the data x from D to d but also determine explicitly and precisely how the mapped feature z = f (x) is distributed within the feature space R d , in terms of both its support and density. Moreover, we want to establish an explicit map g(·) from this distribution of feature z back to the data space such that the distribution of its imagex = g(z) (closely) matches that of x. To differentiate from finding arbitrary feature embeddings (as most existing methods do), we call embeddings of data onto an explicit family of models (structures or distributions) in the feature space as data transcription.
Paper Outline. This work is to show how such transcription can be achieved for real-world visual data with one important family of models: the linear discriminative representation (LDR) introduced by [2]. Before we formally introduce our approach in Section 2, for the remainder of this section, we first discuss two existing approaches, namely autoencoding and GAN, that are closely related to ours. As these approaches are rather popular and known to the readers, we will mainly point out some of their main conceptual and practical limitations that have motivated this work. Although our objective and framework will be mathematically formulated, the main purpose of this work is to verify the effectiveness of this new approach empirically through extensive experimentation, organized and presented in Section 3 and Appendix A. Our work presents compelling evidence that the closed-loop data transcription problem and our rate-reduction-based formulation deserve serious attention from the information-theoretical and mathematical communities. This has raised many exciting and open theoretical problems or hypotheses about learning, representing, and generating distributions or manifolds of high-dimensional real-world data. We discuss some open problems in Section 4 and new directions in Section 5. Source code can be found at https://github.com/Delay-Xili/LDR (accessed on 9 February 2022).

Learning Generative Models via Auto-Encoding or GAN
Auto-Encoding and its variants. In the machine-learning literature, roughly speaking,there have been two representative approaches to such a distribution-learning task. One is the classic "Auto Encoding" (AE) approach [3,4] that aims to simultaneously learn an encoding mapping f from x to z and an (inverse) decoding mapping g from z back to x: Here, we use bold capital letters to indicate a matrix of finite samples X = [x 1 , . . . , x n ] ∈ R D×n of x and their mapped features Z = [z 1 , . . . , z n ] ⊂ R d×n , respectively. Typically, one wishes for two properties: firstly, the decoded samplesX are "similar" or close to the original X, say in terms of maximum likelihood p(X); and secondly, the (empirical) distribution of the mapped samples Z, denoted asp(z|X), is close to certain desired prior distribution p(z), say some much lower-dimensional multivariate Gaussian (The classical PCA can be viewed as a special case of this task. In fact, the original auto-encoding is precisely cast as nonlinear PCA [3], assuming the data lie on only one nonlinear submanifold M).
However it is typically very difficult, often computationally intractable to maximize the likelihood function p(X) or to minimize certain "distance", say the KL-divergence D KL (p, p), betweenp(z|X) and p(z). Except for simple distributions such as Gaussian, the KL divergence usually does not have a closed-form, even for a mixture of Gaussians. The likelihood and the KL-divergence become ill-conditioned when the supports of the distributions are low-dimensional (i.e., degenerate) and not overlapping (which is almost always the case in practice when dealing with distributions of high-dimensional data in high-dimensional spaces). So in practice, one typically chooses to minimize instead certain approximate bounds or surrogates derived with various simplifying assumptions on the distributions involved, as is the case in variational auto-encoding (VAE) [5,6]. As a result, even after learning, the precise posterior distribution ofp(z|X) remains unclear or hidden inside the feature space.
In this work, we will show that if we impose specific requirements on the (distribution of) learned feature z to be a mixture of subspace-like Gaussians, a natural closed-form distance can be introduced for such distributions based on rate distortion from the information theory. In addition, the optimal solution to the feature representation within this family can be learned directly from the data without specifying any target p(z) in advance, which is particularly difficult in practice when the distribution of a mixed dataset is multi-modal and each component may have a different dimension.
GAN and its variants. Compared to measuring distribution distance in the (often controlled) feature space z, a much more challenging issue with the above auto-encoding approach is how to effectively measure the distance between the decoded samplesX and the original X in the data space x. For instance, for visual data such as images, their distributions p(X) or generative models p(X|z) are often not known. Despite extensive studies in the computer vision and image processing literature [7], it remains elusive to find a good measure for similarity of real images that is both efficient to compute and effective in capturing visual quality and semantic information of the images equally well. Precisely due to such difficulties, it has been suggested early on by [8] that one may have to take a discriminative approach to learn the distribution or a generative model for visual data. More recently, Generative Adversarial Nets (GAN) [9] offers an ingenious idea to alleviate this difficulty by utilizing a powerful discriminator d, usually modeled and learned by a deep network, to discern differences between the generated samplesX and the real ones X: To a large extent, such a discriminator plays the role of minimizing certain distributional distance, e.g., the Jensen-Shannon divergence, between the data X andX. Compared to the KL-divergence, the JS-divergence is well-defined even if the supports of the two distributions are non-overlapping. (However, JS-divergence does not have a closed-form expression even between two Gaussians, whereas KL-divergence does). However, as shown in [10], since the data distributions are low-dimensional, the JS-divergence can be highly ill-conditioned to optimize. (This may explain why many additional heuristics are typically used in many subsequent variants of GAN). So, instead, one may choose to replace JS-divergence with the earth mover's distance or the Wasserstein distance. However both JS-divergence and W-distance can only be approximately computed between two general distributions. (For instance, the W-distance requires one to compute the maximal difference between expectations of the two distributions over all 1-Lipschitz functions). Furthermore, neither the JS-divergence nor the W-distance have closed-form formulae, even for the Gaussian distributions. (The ( 1 -norm) W-distance can be bounded by the ( 2 -norm) W2-distance which has a closed-form [11]. However, as is well-known in high-dimensional geometry, 1 -norm and 2 norm deviate significantly in terms of their geometric and statistical properties as the dimension becomes high [12]. The bound can become very loose). However, from a data representation perspective, subspace-like Gaussians (e.g., PCA) or a mixture of them are the most desirable family of distributions that we wish our features to become. This would make all subsequent tasks (generative or discriminative) much easier. In this work, we will show how to achieve this with a different fundamental metric, known as the rate reduction, introduced by [13].
The original GAN aims to directly learn a mapping g(·), called a generator, from a standard distribution (say, a low-dimensional Gaussian random field) to the real (visual) data distribution in a high-dimensional space. However, distributions of real-world data can be rather sophisticated and often contain multiple classes and multiple factors in each class [14]. This makes learning the mapping g rather challenging in practice, suffering difficulties such as mode-collapse [15]. As a result, many variants of GAN have been subsequently developed in order to improve the stability and performance in learning multiple modes and disentangling different factors in the data distribution, such as Conditional GAN [16][17][18][19][20], InfoGAN [21,22], or Implicit Maximum Likelihood Estimation (IMLE) [23,24]. In particular, to learn a generator for multi-class data, prevalent conditional GAN literature requires label information as conditional inputs [16,[25][26][27]. Recently, [28,29] has proposed training a k-class GAN by generalizing the two-class cross entropy to a (k + 1)-class cross entropy. In this work, we will introduce a more refined 2k-class measure for the k real and k generated classes. In addition, to avoid features for each class collapsing to a singleton [30], instead of cross entropy, we will use the so-called rate-reduction measure that promotes multi-mode and multi-dimension in the learned features [13]. One may view the rate reduction as a metric distance that has closed-form formulae for a mixture of (subspace-like) Gaussians, whereas neither JS-divergence nor W-distance can be computed in closed form (even between two Gaussians).
Another line of research is about how to stabilize the training of GAN. SN-GAN [31] has shown that spectral normalization on the discriminator is rather effective, which we will adopt in our work, although our formulation is not so sensitive to such choice designed for GAN (see ablation study in Appendix A.9). PacGAN [32] shows that the training stability can be significantly improved by packing a pair of real and generated images together for the discriminator. Inspired by this work, we show how to generalize such an idea to discriminating an arbitrary number of pairs of real and decoded samples without concatenating the samples. Our results in this work will even suggest that the larger the batch size discriminated, the merrier (see ablation study in Appendix A.10). In addition, ref. [29] has shown that optimizing the latent features leads to state-of-the-art visual quality. Their method is based on the deep compressed sensing GAN [28]. Hence, there are strong reasons to believe that their method essentially utilizes the compressed sensing principle [12] to implicitly exploit the low-dimensionality of the feature distribution. Our framework will explicitly expose and exploit such low-dimensional structures on the learned feature distribution.
Combination of AE and GAN. Although AE (VAE) and GAN originated with somewhat different motivations, they have evolved into popular and effective frameworks for learning and modeling complex distributions of many real-world data such as images. (In fact, in some idealistic settings, it can be shown that AE and GAN are actually equivalent: for instance, in the LOG settings, authors in [33] have shown that GAN coincides with the classic PCA, which is precisely the solution to auto-encoding in the linear case). Many recent efforts tend to combine both auto-encoding and GAN to generate more powerful generative frameworks for more diverse data sets, such as [15,[34][35][36][37][38][39][40][41][42]. As we will see, in our framework, AE and GAN can be naturally interpreted as two different segments of a closed-loop data transcription process. However, unlike GAN or AE (VAE), the "origin" or "target" distribution of the feature z will no longer be specified a priori, and is instead learned from the data x. In addition, this intrinsically low-dimensional distribution of z (with all of its low-dimensional supports) is explicitly modeled as a mixture of orthogonal subspaces (or independent Gaussians) within the feature space R d , sometimes known as the principal subspaces.
Universality of Representations. Note that GANs (and most VAEs) are typically designed without explicit modeling assumptions on the distribution of the data nor on the features. Many even believe that it is this "universal" distribution learning capability (assuming minimizing distances between arbitrary distributions in high-dimensional space can be solved efficiently, which unfortunately has many caveats and often is impractical) that is attributed to their empirical success in learning distributions of complicated data such as images. In this work, we will provide empirical evidence that such an "arbitrary distribution learning machine" might not be necessary. (In fact, it may be computationally intractable in general). A controlled and deformed family of low-dimensional linear subspaces (Gaussians) can be more than powerful, and expressive enough to model real-world visual data. (In fact, a Gaussian mixture model is already a universal approximator of almost arbitrary densities [43]. Hence, we do not loose any generality at all). As we will also see, once we can place a proper and precise metric on such models, the associated learning problems can become much better conditioned and more amenable to rigorous analysis and performance guarantees in the future.

Learning Linear Discriminative Representation via Rate Reduction
Recently, the authors in [2] proposed a new objective for deep learning that aims to learn a linear discriminative representation (LDR) for multi-class data. The basic idea is to map distributions of real data, potentially on multiple nonlinear submanifolds ∪ k j=1 M j ⊂ R D (in classical statistical settings, such nonlinear structures of the data were also referred to as principal curves or surfaces [44,45]. There has been a long quest of trying to extend PCA to handle potential nonlinear low-dimensional structures in data distribution (see [46] for a thorough survey) to a family of canonical models consisting of multiple independent (or orthogonal) linear subspaces, denoted as ∪ k j=1 S j ⊂ R d . To some extent, this generalizes the classic nonlinear PCA [3] to more general/realistic settings where we simultaneously apply multiple nonlinear PCAs to data on multiple nonlinear submanifolds. Or equivalently, the problem can also be viewed as a nonlinear extension to the classic Generalized PCA (GPCA) [46]. (Conventionally, "generalized PCA" refers to generalizing the setting of PCA to multiple linear subspaces. Here, we need to further generalize multiple nonlinear submanifolds. Unlike conventional discriminative methods that only aim to predict class labels as one-hot vectors, the LDR aims to learn the likely multi-dimensional distribution of the data, hence it is suitable for both discriminative and generative purposes. It has been shown that this can be achieved via maximizing the so-called "rate reduction" objective based on the rate distortion of subspace-like Gaussians [47]. LDR via MCR 2 . More precisely, consider a set of data samples X = [x 1 , . . . , x n ] ∈ R D×n from k different classes. That is, we have X = ∪ k j=1 X j with each subset of samples X j belonging to one of the low-dimensional submanifolds: X j ⊂ M j , j = 1, . . . , k. Following the notation in [2], we use a matrix Π j (i, i) = 1 to denote the membership of sample i belonging to class j (and Π j = 0 otherwise). One seeks a continuous mapping f (·, θ) : x → z from X to an optimal representation Z = [z 1 , . . . , z n ] ⊂ R d×n : which maximizes the following coding rate-reduction objective, known as the MCR 2 principle [13]: where n for j = 1, . . . , k. In this paper, for simplicity we denote ∆R(Z |Π, ) as ∆R(Z) assuming Π, are known and fixed. The first term R(Z | ), or R(Z) for short, is the coding rate of the whole feature set Z (coded as a Gaussian source) with a prescribed precision ; the second term R c (Z |Π, ), or simply R c (Z), is the average coding rate of the k subsets of features Z j = f (X j ) (each coded as a Gaussian). As has been shown by [13], maximizing the difference between the two terms will expand the whole feature set while compressing and linearizing features of each of the k classes. If the mapping f maximizes the rate reduction, it maps the features of different classes into independent (orthogonal) subspaces in R d . Figure 1 illustrates a simple example of data with k = 2 classes (on two submanifolds) mapped to two incoherent subspaces (solid black lines). Notice that, compared to AE (2) and GAN (3), the above mapping (4) is only one-sided: from the data X to the feature Z. In this work, we will see how to use the rate-reduction metric to establish inverse mapping from the feature Z back to the data X, while still preserving the subspace structures in the feature space. The encoder f has dual roles: it learns an LDR z for the data x via maximizing the rate reduction of z and it is also a "feedback sensor" for any discrepancy between the data x and the decodedx. The decoder g also has dual roles: it is a "controller" that corrects the discrepancy between x andx and it also aims to minimize the overall coding rate for the learned LDR.

Closed-Loop Transcription to an LDR (CTRL)
One issue with this one-sided LDR learning (4) is that maximizing the above objective (5) tends to expand the dimension of the learned subspace for features in each class (if the dimension of the feature space d is too high, maximizing the rate reduction may overestimate the dimension of each class. Hence, to learn a good representation, one needs to pre-select a proper dimension for the feature space, as achieved in the experiments in [13]. In fact the same "model selection" problem persists even in the simplest single-subspace case, which is the classic PCA [48]. Selecting the correct number of principal components in a heterogeneous noisy situation remains an active research topic [49]). To verify whether the learned features are neither over-estimating nor under-estimating the data structure, we may consider learning a decoder g(·, η) : z → x from the representation Z = f (X, θ) back to the data space x:X = g(Z, η), and check how close X andX are or how close their features Z andẐ = f (X, θ) are. In principle, the decoder g should examine if all the learned features by the encoder f are both necessary and sufficient for achieving this task. The overall pipeline can be illustrated by the following "closed-loop" diagram: where the overall model has parameters: Θ = {θ, η}.
Notice that in the above process, the segment from X toX resembles a typical Auto-Encoding process; although, as we will soon see, our MCR 2 -based encoder f plays an additional role as a discriminator. The segment from Z toẐ draws resemblance to the typical GAN process; although, in our context, the distribution of the latent variable z will be learned from the data x. Despite these connections, as we will soon see, this new closed-loop formulation will allow us to utilize the error feedback mechanism (widely practiced in control systems) and directly enforce loop consistency between encoding and decoding (networks) without using any additional discriminator(s) that are typically needed in existing VAE/GAN architectures.
Here, in the specific context of rate reduction, we name this special auto-encoding process "Transcription to an LDR" since the maximal rate-reduction principle explicitly transcribes the data X, via f , to features Z on a linear discriminative representation (LDR) (through our extensive experiments on diverse real-world visual datasets, one does not lose any generality or expressiveness by restricting to this special but rich class of models. On the contrary, the restriction significantly simplifies and improves the learning process), which can be subsequently decoded back to the data spaceX, via g. Hence, the encoding and decoding maps f and g together form a "closed-loop" process, as illustrated in Figure 1. We hope that this closed-loop transcription to an LDR (CTRL) has the following good properties: • Injectivity: the generatedx = g( f (x, θ), η) ∈X should be as close to (ideally the same as) the original data x ∈ X, in terms of certain measures of similarity or distance. • Surjectivity: for all mapped images z = f (x) ∈ Z of the training data x ∈ X, there are decoded samplesẑ = f (g(z, η), θ) ∈Ẑ close to (ideally the same as) z.
Mathematically, we seek an embedding of the data x supported on certain nonlinear submanifolds ∪ k j=1 M j in the space R D to feature z on a set of (discriminative) linear subspaces ∪ k j=1 S j in the feature space R d . Ideally, both f and g should be embeddings [1], when restricted on the support of the data distribution or that of the features. (That is, we hope f | M j and g | S j are all embeddings for all j = 1, . . . , k.) In addition, more ideally, we hope f and g are mutually inverse embeddings: g • f = Id (when restricted on the submanifolds). Nevertheless, if we are only interested in learning the distribution, embeddings of the support would often suffice the purposes (e.g., classification or generative purposes). Notice that the above goals are similar to many VAE+GAN-related methods in the machine-learning literature, such as BiGAN [38] and ALI [39]. We will discuss the differences of our approach from these existing methods in Section 2.3 (as well as providing some experimental comparisons in the Appendix A).
At first sight, this is a rather daunting task, since we are trying to learn over a (seemingly infinite-dimensional) functional space of all embeddings and distributions from finite samples. In this work, we will take a more pragmatic approach and show how one can learn a good encoding, decoding, and representation tuple: ( f , g, z) from X via tractable computational means. In particular, we will convert the above goals to certain feasible programs that optimize a sensible measure of goodness for the learned representations Z.

Measuring Distances in the Feature Space and Data Space
Contractive measure for the decoder. For the second item in the above wishlist, as the representations in the feature space z are by design linear subspaces or (degenerate) Gaussians, we have geometrically or statistically meaningful metrics for both samples and distributions in the feature space z. For example, we care about the distance between distributions between the features of the original data Z and the transcribedẐ. Since the features of each class, Z j andẐ j , are similar to subspaces/Gaussians, their "distance" can be measured by the rate reduction, with (5) restricted to two sets of equal size: According to the interpretation of the rate reduction given in [13], the above quantity precisely measures the volume of the space between Z j andẐ j , illustrated as a pair of black and blue lines in Figure 1. Then, for the "distance" of all, say k, classes, we simply sum the rate reduction for all pairs: where Obviously, a main goal of the learned decoder g(·, η) is to minimize the distance between these distributions. Notice that if the encoder f preserves (i.e., injective for) the intrinsic structures of the original data X, (this is typically the case for MCR 2 -based feature representation [13]) this criterion essentially aims to ensure there will be some decoded samplex close to every data sample x-hence the decoder g should be "surjective". According to the ideas of IMLE [23], such a requirement could effectively help to avoid mode-collapsing or mode-dropping. Contrastive measure for the encoder. For the first item in our wishlist, however, we normally do not have a natural metric or "distance" for similarity of samples or distributions in the original data space x for data such as images. As mentioned before, finding proper metrics or distance functions on natural images has always been an elusive and challenging task [7]. To alleviate this difficulty, we can measure the similarity or difference betweenX and X through their mapped featuresẐ and Z in the feature space (again assuming f is structure-preserving). If we are interested in discerning any differences in the distributions of the original and transcribed samples, we may view the MCR 2 feature encoder f (·, θ) as a "discriminator" to magnify any difference between all pairs of X j andX j , by simply maximizing, instead of minimizing, the same quantity in (8): That is, a "distance" between X andX can be measured as the maximally achievable rate reduction between all pairs of classes in these two sets. In a way, this measures how well or badly the decodedX aligns with the original data X-hence measuring the goodness of "injectivity" of the encoder f . Notice that such a discriminative measure is consistent with the idea of GAN [9] that tries to separate X andX into two classes, measured by the crossentropy. Nevertheless, here the MCR 2 -based discriminator f naturally generalizes to cases when the data distributions are multi-class and multi-modal, and the discriminativeness is measured with a more refined measure-the rate reduction-instead of the typical twoclass loss (e.g., cross entropy) used in GANs. See Appendix A.8 for comparisons with some ablation studies.
One may wonder why we need the mapping f (·, θ) to function as a discriminator between X andX by maximizing max θ ∆R f (X, θ), f (X, θ) . Figure 2 gives a simple illustration: there might be many decoders g such that f • g is an identity (Id) mapping. Here, we use the notion of "identity mapping" in a loose sense: depending on the context, it could simply mean an embedding from S z to S z . f • g(z) = z for all z in the subspace S z in the feature space. However, g • f is not necessarily an auto-encoding map for x in the original distribution S x (here for simplicity drawn as a subspace). That is, One should expect, without careful control of the image of g, with high probability, this would be the case, especially when the support of the distribution of x is extremely low-dimensional in the original high-dimensional data space. For example, as we will see in the experiments, the intrinsic dimension of the submanifold associated with each image category is about a dozen, whereas images are embedded in a (pixel) space of thousands or tens of thousands of dimensions. Remark: representing the encoding and decoding mappings. Some practical questions arise immediately: how rich should the families of functions be that we should consider to use for the encoder f and decoder g that can optimize the above rate-reductiontype objectives? In fact, similar questions exist for the formulation of GAN, regarding the realizability of the data distribution by the generator, see [50]. Conceptually, here we know that the encoder f needs to be rich enough to discriminate (small) deviations from the true data support M j , while the decoder g needs to be expressive enough to generate the data distribution from the learned mixture of subspace-Gaussians. How should we represent or parameterize them, hence making our objectives computable and optimizable? For the most general cases, these remain widely open and challenging mathematical and computational problems. As we mentioned earlier, in this work, we will take a more pragmatic approach by simply representing these mappings with popular neural networks that have empirically proven to be good at approximating distributions of practical (visual) datasets or for achieving the maximum of the rate-reduction-type objectives [13]. Nevertheless, our experiments indicate that our formulation and objectives are not so sensitive to particular choices in network structures or many of the tricks used to train them. In addition, in the special cases when the real data distribution is benignly deformed from an LDR, the work of [2] has shown that one can explicitly construct these mappings from the rate-reduction objectives in the form of a deep network known as ReduNet. However, it remains unclear how such constructions could be generalized to closed-loop settings. Regardless, answers to these questions are beyond the scope of this work, as our purposes here are mainly to empirically verify the validity of the proposed closed-loop data transcription framework.

Encoding and Decoding as a Two-Player MiniMax Game
Comparing the contractive and contrastive nature of (8) and (9) on the same utility, we see the roles of the encoder f (·, θ) and the decoder g(·, η) naturally as "a two-player game": while the encoder f tries to magnify the difference between the original data and their transcribed data, the decoder g aims to minimize the difference. Now for convenience, let us define the "closed-loop encoding" function: Ideally, we want this function to be very close to f (x, θ) or at least the distributions of their images should be close. With this notation, combining (8) and (9), a closed-loop notion of "distance" between X andX can be computed as an equilibrium point to the following Min-Max (or Max-Min) program for the same utility in terms of rate reduction (theoretically, there might be significant difference in formulating and seeking the desired solution as the equilibrium point to a min-max or max-min game. In practice, we do not see major differences as we optimize the program by simply alternating between minimization and maximization. We leave a more careful investigation to future work): Notice that this only measures the difference between (features of) the original data and its transcribed version. It does not measure how good the representation Z (orẐ) is for the multiple classes within X (orX). To this end, we may combine the above distance with the original MCR 2 -type objectives (5): namely, the rate reduction ∆R(Z) and ∆R(Ẑ) for the learned LDR Z for X andẐ for the decodedX. Notice that although the encoder f tries to maximize the multi-class rate reduction of the features Z of the data X, the decoder g should minimize the rate reduction of the multi-class featuresẐ of the decodedX. That is, the decoder g tries to use a minimal coding rate needed to achieve a good decoding quality.
Hence, the overall "multi-class" Min-Max program for learning the Closed-loop Transcription to an LDR, named CTRL-Multi, is subject to certain constraints (upper or lower bounds) on the first term and the second term. In this work, we only consider the simple case by adding these rate-reduction quantities together. Of course, in the future, one may consider other more delicate formulations. For instance, we may consider a Min-Max game on the third term (11). Such constrained minimax games have also started to draw attention lately [51].
Empirically, we have evaluated the necessity of these terms in an ablation study (see Appendix A.8.3). Notice that, without the terms associated with the generative part h or with all such terms fixed as constant, the above objective is precisely the original MCR 2 objective proposed by [13]. In an unsupervised setting, if we view each sample (and its augmentations) as its own class, the above formulation remains exactly the same. The number of classes k is simply the number of independent samples. In addition, notice that the minimax objective function depends only on (features of) the data X, hence one can learn the encoder and decoder (parameters) without the need for sampling or matching any additional distribution (as typically needed in GANs or VAEs).
As a special case, if X only has one class, the above Min-Max program reduces (as the first two rate reduction terms automatically become zero) to a special "two-class" or "binary" form, named CTRL-Binary, between X and the decodedX by viewing X andX as two classes {0, 1}. Notice that this binary case resembles formulation of the original GAN (3). Nevertheless, instead of using cross entropy, our formulation adopts a more refined rate-reduction measure, which has been shown to promote diversity in the learned representation [13]).

CTRL-Binary: min
Sometimes, even when X contains multiple classes/modes, one could still view all classes together as one class. Then, the above binary objective is to align the union distribution of all classes with their decodedX. This is typically a simpler task to achieve than the multi-class one (12), since it does not require learning of a more refined multi-class CTRL for the data, as we will later see in experiments. Notice that one good characteristic of the above formulation is that all quantities in the objectives are measured in terms of rate reduction for the learned features (assuming features eventually become subspace Gaussians).
In all of our subsequent experiments, we solve the above minimax programs using the most basic gradient descent-ascent (GDA) algorithm [52] that alternates between the minimization and maximization, with the same learning rate and without any timescale separation (as typically needed for training GANs [53]). Although more refined optimization schemes can likely further improve the efficiency and performance, we leave these for future investigations.
Remark: closed-loop error correction. One may notice that our framework (see Figure 1) draws inspiration from closed-loop error correction widely practiced in feedback control systems. In the machine-learning and deep-learning literature, the idea of closedloop error correction and closed-loop fixed point has been explored before to interpret the recursive error-correcting mechanism and explain stability in a forward (predictive) deep neural network, for example the deep equilibrium networks [54] and the deep implicit networks [55], again drawing inspiration from feedback control. Here, in our framework, the closedloop mechanism is not used to interpret the encoding or decoding (forward) networks f and g. Instead, it is used to form an overall feedback system between the two encoding and decoding networks for correcting the "error" in the distributions between the data x and the decodedx. Using terminology from control theory, one may view the encoding network f as a "sensor" for error feedback while the decoding network g as a "controller" for error correction. However, notice that here the "target" for control is not a scalar nor a finite dimensional vector, but a continuous mapping-in order for the distribution ofx to match that of the data x. This is in general a control problem in an infinite dimensional space. The space of diffeomorphisms of submanifolds is infinite-dimensional [1]. Ideally, we hope when the sensor f and the controller g are optimal, the distribution of x becomes a "fixed point" for the closed loop while the distribution of z reaches a compact LDR. Hence, the minimax programs (12) and (13) can also be interpreted as games between an error-feedback sensor and an error-reducing controller.
Remark: relation to bi-directional or cycle consistency. The notion of "bi-directional" and "cycle" consistency between encoding and decoding has been exploited in the works of BiGAN [38] and ALI [39] for mappings between the data and features and in the work of CycleGAN [56] for mappings between two different data distributions. In our context, it is similar in order to promote g • f and f • g to be close to identity mappings (either for the distributions or for the samples). Interestingly, our new closed-loop formulation actually "decouples" the data X, say, observed from the external world, from their internally represented features Z. The objectives (12) and (13) are functions of only the internal features Z(θ) andẐ(θ, η), which can be learned and optimized by adjusting the neural networks f (·, θ) and g(·, η) alone. There is no need for any additional external metrics or heuristics to promote how "close" the decoded imagesX are to X. This is very different from most VAE/GAN-type methods such as BiGAN and ALI that require additional discriminators (networks) for the images and the features. Some experimental comparison are given in the Appendix A.2. In addition, in Appendix A.8.1, we provide some ablation study to illustrate the importance and benefit of a closed loop for enforcing the consistency between the encoder and decoder.
Remark: transparent versus hidden distribution of the learned features. Notice that in our framework, there is no need to explicitly specify a prior distribution either as a target distribution to map to for AE (2) or as an initial distribution to sample from for GAN (3). The common practice in AEs or GANs is to specify the prior distribution as a generic Gaussian. This is however particularly problematic when the data distribution is multi-modal and has multiple low-dimensional structures, which is commonplace for multi-class data. In this case, the common practice in AEs or GANs is to train a conditional GAN for different classes or different attributes. However, here we only need to assume the desired target distribution belonging to the family of LDRs. The specific optimal distribution of the features within this family is then learned from the data directly, and then can be represented explicitly as a mixture of independent subspace Gaussians (or equivalently, a mixture of PCAs on independent subspaces). We will give more details in the experimental Section 3 as well as more examples in Appendices A.2-A.4. Although many GAN + VAE-type methods can learn bidirectional encoding and decoding mappings, the distribution of the learned features inside the feature space remains hidden or even entangled. This makes it difficult to sample the feature space for generative purposes or to use the features for discriminative tasks. (For instance, typically one can only use so-learned features for nearest-neighbor-type classifiers [38], instead of nearest subspace as in this work, see Section 3.3).

Empirical Verification on Real-World Imagery Datasets
This experiment section serves three purposes: First, we empirically justify the proposed formulation for data transcription by demonstrating good properties of the learned encoder, decoder, and representation tuple ( f , g, z) from X. Second, we compare our method with several representative methods from the GAN family and VAE family. The purpose of the comparison is not to compete for any state-of-the-art performance. Instead, we want to convincingly verify the validity of the proposed framework and its potential in going beyond. Finally, we evaluate the so-learned CTRL through both generative tasks (controlled visualization) and discriminative (classification) tasks. More extensive experimental results, evaluations, and ablation studies can be found in the Appendix A.

Empirical Justification of CTRL Transcription
To empirically validate our new framework, we conduct experiments from a small low-variety dataset (MNIST), to a small dataset of diverse real-world objects (CIFAR-10), to higher resolution images (STL-10, CelebA, LSUN-bedroom), to a large-scale diverse image set (ImageNet). The results are evaluated both quantitatively and qualitatively. Implementation details, more experimental results, and ablation studies are given in Appendix A.
Comparison (IS and FID) with other formulations. First, we conduct five experiments to fairly compare our formulation with GAN [63] and VAE(-GAN) [64] on MNIST and CIFAR-10. Except for the objective function, everything else is exactly the same for all methods (e.g., networks, training data, optimization method). These experiments are: (1). GAN; (2). GAN with its objective replaced by that of the CTRL-Binary (13); (3). VAE-GAN ; (4). Binary CTRL (13); and (5). Multi-class CTRL (12). Some visual comparison is given in Figure 3. IS [65] and FID [66] scores are summarized in Table 1. Here, for simplicity, we have chosen a uniform feature dimension d = 128 for all datasets. If we choose a higher feature dimension, say d = 512, for the more complex CIFAR-10 dataset, the visual quality can be further improved, see Table A14 in Appendix A.11.   As we see from Table 1, replacing cross-entropy with the Equation (13) can improve the generative quality. The two CTRL formulations are clearly on par with the others in terms of IS and significantly better in FID. Finally, with the same training datasets, the quality of CTRL-Multi is lower than that of CTRL-Binary. This is expected, as the multi-class task is more challenging. Nevertheless, as we will see soon, images decoded by CTRL-Multi align much better with their classes than Binary.
Visualizing correlation of features Z and decoded featuresẐ. We visualize the cosine similarity between Z andẐ learned from the multi-class objective (12) on MNIST, CIFAR-10 and ImageNet (10 classes), which indicates how closeẑ = f • g(z) is from z. Results in Figure 4 show that Z andẐ are aligned very well within each class. The block-diagonal patterns for MNIST are sharper than those for CIFAR-10 and ImageNet, as images in CIFAR-10 and ImageNet have more diverse visual appearances.  Visualizing auto-encoding of the data X and the decodedX. We compare some representative X andX on MNIST, CIFAR-10 and ImageNet (10 classes) to verify how closê The results are shown in Figure 5, and visualizations are created from training samples. Visually, the auto-encodedx faithfully captures major visual features from its respective training sample x, especially the pose, shape, and layout. For the simpler dataset such as MNIST, auto-encoded images are almost identical to the original. The visual quality is clearly better than other GAN+VAE-type methods, such as VAE-GAN [34] and BiGAN [38]. We refer the reader to Appendices A.2, A.4 and A.7 for more visualization of results on these datasets, including similar results on transformed MNIST digits. More visualization results for learned models on real-life image datasets such as STL-10, CeleB, and LSUN can be found in the Appendices A.5 and A.6.  Table 2 gives a quantitative comparison of visual quality of our method with others on CIFAR-10, STL-10, and ImageNet. In general, there is a large difference in terms of FID and IS scores between the GAN family and the VAE family of models. SNGAN [31] are commonly used methods in most generative applications, while LOGAN [29] is the stateof-the-art method on ImageNet in terms of FID and IS. More comparisons with existing methods, including results on on the higher-resolution ImageNet dataset, can be found in Table A10 of the Appendix A.7.

Comparison to Existing Generative Methods
As we see, even if the rate-reduction objectives (12) and (13) are not specifically designed nor engineered for visual quality and the networks and hyper-parameters adopted in our experiments are rather basic compared to many of the state-of-the-art generative methods, our method is still rather competitive in terms of these metrics. In our current implementation, the original objectives are used without any other heuristics or regularization. The simplicity of our framework and formulation suggests that there is significant room for further improvement. For instance, in all experiments on all datasets, we have chosen a feature dimension of d = 128 for simplicity and uniformity. In the last Appendix A.11, we have conducted an ablation study on using a higher feature dimension d = 512. The visual quality of the learned model can be significantly improved (as shown in Figure A22 and Table A14 of Appendix A.11).
In fact, compared to these methods, our method has learned not just any generative model. It has learned a structured generative model that has many additional beneficial properties that we now present. Table 2. Comparison of CIFAR-10 and STL-10. Comparison with more existing methods and on ImageNet can be found in Table A10 in the Appendix A. ↑ means higher is better. ↓ means lower is better.

Benefits of the Learned LDR Transcription Model
As we have argued before, the learned LDR transcription model (including the feature z, the encoder f , and the decoder g) can be used for both generative and discriminative purposes. In particular, unlike almost all existing generative methods, the internal structures or distribution of the learned z are no longer "hidden" as they have clear subspace structures. Hence, we can easily derive an explicit (parametrizable) model for the distribution of the learned features as a mixture of independent subspace-like Gaussians. This gives us full control in sampling the learned distribution for generative purposes.
Principal subspaces and principal components for the feature. To be more specific, given the learned k-class features ∪ k j=1 Z j for the training data, we have observed that the leading singular subspaces for different classes are all approximately orthogonal to each other: Z i ⊥ Z j (see Figure 4). This corroborates with our above discussion about the theoretical properties of the rate-reduction objective. They essentially span k independent principal subspaces. We can further calculate the meanz j and the singular vectors {v i j } r j i=1 (or principal components) of the learned features Z j for each class. Although we conceptually view the support of each class is a subspace, the actual support of the features is close to being on the sphere due to feature (scale) normalization. Hence, it is more precise to find its mean and its support centered around the mean. Here, r j is a rank we may choose to model the dimension of each principal subspace (say, based on a common threshold on the singular values). Hence, we obtain an explicit model for how the feature z is distributed in each of the k principal subspaces in the feature space R d : Hence, this essentially gives an explicit mixture of a subspace-like Gaussians model for the learned features: statistical differences between different classes are modeled as k independent principal subspaces; statistical differences within each class j are modeled as r j independent principal components in the jth subspace.
Decoding samples from the feature distribution. Using the CIFAR-10 and CelebA datatsets, we visualize images decoded from samples of learned feature subspace. For the CIFAR-10 dataset, for each class j, we first compute the top four principal components of the learned features Z j (via SVD). For each class j, we then compute | z i j , v l j |, the cosine similarity between the l-th principal direction v l j and feature sample z i j . After finding the top five z i j according to | z i j , v l j | for each class j, we reconstruct imagesx i j = g(z i j ). Each row of Figure 6 is for one principal component. We observe that images in the same row share the same visual attributes; images in different rows differ significantly in visual characteristics such as shape, background, and style. See Figure A7 of Appendix A.4 for more visualization of principal components learned for all 10 classes of CIFAR-10. These results clearly demonstrate that the principal components in each subspace of the Gaussian disentangles different visual attributes. In addition, we do not observe any mode dropping for any of the classes, although the dimensions of the classes were not known a priori.

Disentangled visual attributes as principal components.
For the CelebA dataset, we calculate the principal components of all learned features in the latent space. Figure 7a shows some decoded images along these principal directions. Again, these principal components seem to clearly disentangle visual attributes/factors such as wearing a hat, changing hair color, and wearing glasses. More examples can be found in Appendix A.6. The results are consistent with the property of MCR 2 that promotes diversity of the learned features.  Figure 7b shows interpolating features between pairs of training image samples of the CeleA dataset, where for two training images x 1 and x 2 , we reconstruct based on their linearly interpolated feature representations byx = g(α f (x 1 ) + (1 − α) f (x 2 )), α ∈ [0, 1]. The decoded images show continuous morphing from one sample to another in terms of visual characteristics, as opposed to merely a superposition of the two images. Similar interpolation results between two digits in the MNIST dataset can be found in Figure A3 of the Appendix A.2.

Linear interpolation between features of two distinct samples.
Encoded features for classification. Notice that not only is the learned decoder good for generative purposes, but the encoder is also good for discriminative tasks. In this experiment, we evaluate the discriminativeness of the learned CTRL model by testing how well the encoded features can help classify the images. We use features of the training images to compute the learned subspaces for all classes, then classify features of the test images based on a simple nearest subspace classifier. Many other encoding methods train a classifier (say, with an additional layer) after the learned features. Results in Table 3 show that our model gives competitive classification accuracy on MNIST compared to some of best VAE-based methods. We also tested the classification on CIFAR-10, and the accuracy is currently about 80.7%. As expected, the representation learned with the multi-class objective is very discriminative and good for classification tasks. Be aware that all generative models, GANs, VAEs, and ours, are not specifically engineered for classification tasks. Hence, one should not expect the classification accuracy to compete with supervised-trained classifiers yet. This demonstrates that the learned CTRL model is not only generative but also discriminative. Table 3. Classification accuracy on MNIST compared to classifier-based VAE methods [42]. Most of these VAE-based methods require auxiliary classifiers to boost classification performance.

Open Theoretical Problems
So far, we have given theoretical intuition and derivation for the formulation of closedloop transcription, as well as empirical evidence to showcase both the performance and potential of this formulation. In this section, we take a step back to explore the theoretical underpinnings of the closed-loop LDR transcription. We organize this section by discussing three primary objectives associated with learning an LDR representation:

1.
Learn a simple linear discriminative representation f (X) of the data X, which we can reliably use to classify the data.

2.
Learn a reconstruction g • f (X) ∼ X of the so-learned representation f (X), to ensure consistency in the representation.

3.
Learn both representation and reconstruction in a closed-loop manner, using feedback from the encoder f and decoder g to jointly solve the above two tasks.
These three objectives encompass the overarching principle of CTRL transcription, and indeed each of these objectives are tied to a wide array of mathematical and theoretical problems. We now outline some of the most important theoretical questions or hypotheses implicated by our results, which we leave for future work to study and to answer, likely by a broader range of research communities.

Distributions of the LDR Representation
Our primary mode of optimizing for a "simple representation" is through the LDR framework proposed in [2]. One important open theoretical problem is finding the right energy function to optimize in order to promote LDR. It was shown in [2] that an LDR can be learned for the multi-class data by maximizing the MCR 2 objective ∆R(Z) given in (5). This motivates the first two terms in our objective function (12): maximizing ∆R(Z), ∆R(Ẑ) promotes their representations to be LDRs.
Although the authors in [2] have shown the MCR 2 objective can promote the features learned to be in orthogonal subspaces and characterized the optimal second moments of the distributions, there remain open questions regarding the optimal distributions within the subspaces. A standing hypothesis is that the optimal distributions should be Gaussian. There is indeed already theoretical work on similar energy functions: the Brascamp-Lieb inequalities [67], where the authors study a functional similar to the rate-reduction objective which, in certain contexts, is maximized uniquely by Gaussians. Hence, an important future theoretical direction for the CTRL transcription is to exactly characterize distributional properties of the extremals (both minima and maxima) of the MCR 2 objective or its variants. Such results can further justify the use of Gaussian models (14) to characterize the learned features within the subspaces.
We also notice that the so-learned LDR features have additional striking properties, as shown by examples in Figure 7. Distinctive visual attributes of the imagery data seem to be clearly disentangled by different principal components of the distribution, and along each principal direction, one can linearly interpolate the features, whereas the original data are nonlinear and cannot be directly interpolated. These results go beyond the guarantees given by [2], and an open theoretical problem is that of studying just how the CTRL transcription learns to disentangle and linearize such visual attributes. This understanding is crucial to extend the CTRL transcription framework beyond the 2D vision domain.

Self-Consistency in the Learned Reconstruction
If the learned encoder Z = f (X) is an embedding of the data submanifolds to the subspaces, it should admit an inverse (decoding) mappingX = g(Z). As distributional distance in the data space is hard to come by, the rate reduction ∆R Z,Ẑ gives a welldefined distribution distance between Z andẐ which is used to enforce similarity between X andX in our formulation. Notice that, unlike the KL-divergence or the JS-divergence, the rate reduction is well-defined for degenerate distributions and easily computable in closed-form between mixtures of (degenerate) Gaussians. The third term of Equation (12), ∑ k j=1 ∆R Z j (θ),Ẑ j (θ, η) , is exactly this distributional distance, which is minimized only when the estimated second moments of Z j andẐ j are the same. While this distributional distance seems weaker than sample-wise 2 -distance, we observe strong reconstruction performance nevertheless.
Notice that the current objectives (12) or (13) do not impose any constraints on the mappings of individual samples. That is, they do not explicitly specify how an individual sample x should be related to its decoded versionx = g( f (x)), or how their corresponding features z andẑ are related. Hence, theoretically, nothing is known about relationships between individual samples and their features. However, somewhat surprisingly, experi-mental results with the multi-class objective (12) in next section suggest that they actually can be rather close, at least for the given training samples X. For example, see Figure 5. Of course, one could consider explicitly imposing certain sample-wise requirements in the objectives, such as enforcing x i to be close tox i = g( f (x i )). It has been observed empirically in GANs or VAEs that imposing such sample-wise similarity or dissimilarity would improve visual quality around samples of interest, such as the DC-VAE [42] and the OpenGAN [68]. However, theoretically, how such sample-wise distances or constraints may affect the difficulty or accuracy of learning the correct support and density of the distributions remains an open problem.

Properties of the Closed-Loop Minimax Game
Above are the two primary objectives for CTRL transcription: while the encoder f tries to maximize the expressiveness and discriminativeness of the learned LDR representation, the decoder g tries to minimize the reconstruction error and coding rates. The competing objectives of the encoder f and the decoder g naturally lead to a two-player game. In this paper, we have formulated this game as a zero-sum game, namely Equation (12). Likewise, we have also implemented the most straightforward algorithm for solving this zero-sum game: gradient descent-ascent (GDA) [52], where the minimizer and maximizer take alternating gradient steps. These simplifications into a GDA-optimized zero-sum game were made in order to create a concrete algorithm for our experimentation. However, simplifying to a zero-sum game and GDA is certainly not the only way to solve the more general game described above. This game-theoretic formulation puts CTRL transcription outside of the theoretical realm of [2], since we are no longer finding pure maximizers of ∆R(Z), but rather stable minimax equilibria.
As is the case with GANs, these equilibria may not necessarily be Nash equilibria [50], but rather the more general sense of Stackelberg [69]. So, the problem of studying minimax equilibria of (12) is likely, in its most general form, quite challenging. Nevertheless, our experiments suggest such equilibria tend to be well-behaved, e.g., having a large range of attraction. Our extensive empirical experiments and ablation studies indicate that, in general, the minimax objective converges rather stably to good equilibria for all the real datasets without any special optimization tricks or particular requirements on the networks. The only important factor for the stability of the optimization seems to be a large enough batch size (see Appendix A.10). These observations can be further corroborated with analysis on simpler models: our ongoing work suggests that if we restrict our attention to simplified data structures (e.g., X distributed on a linear subspace), then one can provide theoretical guarantees that the equilibria become efficiently and correctly solvable by the minimax formulation. Extending such analysis to more sophisticated data structures (multiple subspaces, nonlinear submanifolds) remains an exciting new directions for future research.
Despite many possible pathological solutions to the minimax game, empirically, as we have presented in the previous section (alongside many examples in the Appendix A), the solution found by the simple GDA algorithm generally strikes a good trade-off between expressiveness and parsimony of the learned model. The solution automatically determines the proper dimensions for different classes. Ablation studies in Appendix A.10 on the large ImageNet dataset further suggest that this formulation is insensitive to overparameterization by increasing network width, as long as the batch size grows accordingly. However, a rigorous justification for such good model-selection properties remains widely open.

Conclusions and Future Work
This work provides a novel formulation for learning a both generative and discriminative representation for a multi-class, multi-dimensional, possibly nonlinear, distribution of real-world data. We have provided compelling empirical evidence that the distribution of most datasets can be effectively mapped to an LDR, a union of independent princi-pal subspaces and principal components. The objective function is entirely based on an intrinsic information-theoretic measure, the rate reduction, without any other heuristics or regularizing terms. The objective can be achieved with a closed-loop minimax game between the two encoder and the decoder networks without any additional network(s).
The main purpose of this paper is to demonstrate the conceptual simplicity and practical potential of this new framework for distribution/representation learning, instead of striving for state-of-the-art performance with heavy engineering. Nevertheless, with our preliminary implementation, a more informative LDR of the data can be effectively learned with a simple closed-loop transcription for a variety of real-world, multi-class, multi-modal visual datasets, from small to large, from low-resolution to higher-resolution, from domainspecific to diverse categories. The so-learned encoder f already enjoys the benefits of AE/VAEs for their discriminative property and the decoder g with the benefits of GANs for their good generative visual quality. However, probably more importantly, the internal structures of the learned feature representation has now become transparent, hence fully interpretable and controllable (for generative purposes): visual differences between classes are naturally "disentangled" as independent subspaces, while diverse visual attributes within each class are "disentangled" as principal components within each subspace. From extensive ablation studies given in the Appendix A, we see that the rate-reduction-based objective can be stably optimized across a wide range of datasets and network architectures without any additional regularizations or engineering tricks. Both the feedback closed-loop and the rate-reduction measure play indispensable roles in fostering the ease and success of finding the CTRL transcription.
One may notice that there are many ways this simple formulation can be significantly improved or extended. Firstly, in this work, we have simply adopted networks that were designed for GANs, but they may not be optimal for the rate-reduction-type objectives. For example, our ablation study already suggests that some of the components of such networks such as spectral normalization are not quite essential. Characteristics from the white-box ReduNet [2] derived from optimizing rate reduction can be explored in the future. Secondly, notice that our rate-reduction objectives do not impose any requirements on how individual samples should be encoded or decoded although the results from the multi-class objective indicate a certain level of alignment on the individual samples. Recent studies such as DC-VAE [42] or OpenGAN [68] suggest that imposing additional regularization on individual samples may further improve decoded visual quality. Such regularization can certainly be incorporated into this new framework. Last but not the least, compared to GANs and VAEs, our method leads to an explicit structured model for the feature distribution: a mixture of incoherent subspace Gaussians. Such an explicit model has the potential of making many subsequent tasks easier and better: better control of feature sampling for decoding and synthesis [70], designing more robust generators and classifiers for noise and corruptions based on the low-dimensional structures identified, or even extending to the settings of incremental and online learning [71,72]. We leave all these new directions, together with all the open theoretical problems posed in Section 4, for future investigation.  Acknowledgments: Earliest ideas of this work were germinated during a hiking event of Ma's group on Berkeley hills during the summer of 2020. Former group members Chong You (now at Google) and Yichao Zhou (now at Apple) were part of a stimulating discussion on possible extensions or applications of a new rate-reduction framework being developed then. During the preparation of this work, we consulted several experts on some of the related topics. The authors would like to thank Jiantao Jiao of UC Berkeley for discussions about the theoretical conditions for learning distributions via GANs. We thank Benjamin Haeffele of Johns Hopkins University for sharing thoughts on how to learn subspaces correctly and on how to optimize the rate-reduction objectives efficiently. We would also like to thank Shankar Sastry and Manxi Wu of UC Berkeley and Chaobing Song of Univ. of Wisconsin-Madison for informative discussions on how to solve minimax games correctly and efficiently, as well as Chih-Yuan Chiu and Druv Pai of UC Berkeley for engaging discussions on theoretical directions for the CTRL transcription. Last but not the least, we would like to thank Stefano Soatto of UCLA for stimulating discussions and sometimes heated debates on how information can be efficiently and effectively encoded in deep networks.

Conflicts of Interest:
The authors declare no conflicts of interest.

Appendix A Appendix A.1. Experiment Settings and Implementation Details
Network backbones. For MNIST, we use the standard CNN models in Tables A1 and A2, following the DCGAN architecture [63]. We resize the MNIST image resolution from 28 × 28 to 32 × 32 to fit DCGAN architecture. All α in lReLU (lReLU is short for Leaky-ReLU) of the encoder are set to 0.2.
We adopt ResNet architectures for CIFAR-10 shown in Tables A3 and A4, and STL-10  shown in Tables A5 and A6. Each ResBlock up is same as Resnet, but add an up-sampler after the first conv layer. All batch normalization layers of ResBlock in the encoder are replaced with spectral normalization layer.
Finally, we use the same architecture for CelebA, LSUN-bedroom, and ImageNet-128 (see Tables A7 and A8) as all three datasets have the same 128 × 128 resolution. Again, each ResBlock up is same as Resnet, but add an up-sampler after the first conv layer. All batch-normalization layers in the encoder are replaced with spectral normalization layer. All experiments utilize this lightweight PyTorch library "mimicry" [73] that provides implementations of some popular state-of-the-art GANs and evaluation metrics. Optimization and training details. Across all of our experiments, we use Adam [74] as our optimizer, with hyperparameters β 1 = 0.5, β 2 = 0.999. We adopt the simple gradient descent-ascent algorithm for alternating minimizing and maximizing the objectives. The initial value of learning rate is set to be 0.00015 and is scheduled with linear decay. We choose 2 = 0.5 for both Equations (12) and (13) in all CTRL experiments. For all CTRL-Multi experiments on ImageNet, we only choose 10 classes. The details of the 10 classes are shown in Table A9. Most experiments are trained on RTX 3090 GPUs.

Settings.
On MNIST dataset, we train our model using DCGAN [63] architecture with our proposed objectives CTRL-Multi (12) and CTRL-Binary (13). The learning rate is set to 10 −4 and the batch size is set to 2048. We train our model with 15,000 iterations.
More results illustrating auto-encoding. Here, we give more reconstruction results, orX, from CTRL-Multi and CTRL-Binary objectives, compared to their corresponding original input X. As shown in the Figure A1, for the CTRL-Binary objective, it can generate clean digit-like images but the decodedX might resemble digits from similar but different classes to the input data X since the CTRL-Binary tends to only align the distribution of all digits.
In contrast, with the CTRL-Multi objective, the decodedX not only are coherent with the correct class with the input data X, but also show very clear one-to-one mapping between individual samples x andx, although the objective (12) does not enforce that. Comparing with the results from VAE-GAN [34] and BiGAN [38], our decoded images make less errors in reconstruction and preserve much better the individual characteristics of the original samples. Images decoded from random samples on the learned multi-class LDR. Since our CTRL-Multi objective function maps input data of each class into a different (orthogonal) subspace in the feature space, we can generate images conditioned on each class by random sampling z in the subspace of each class and then decode them back to the input space asx.
To perform random sampling in the learned subspace, we first calculate the mean featurez j and the singular vectors v i j from the SVD (or principal components) of the learned features Z j of the training data in the class j, where index i represents the ith principal components. We only use the top r = 8 principal components of each class on MNIST dataset. These statistics of the subspace can be used for guiding the random sampling. Then, we sample z randomly along the principal components and around the mean feature as wherez j is the mean feature of class j, σ i j and v i j are the i-th singular value and principal component of class j, n i are i.i.d. Gaussian N (0, 1) random variables. That is, the feature in each subspace/class is modeled by an r-dimensional multivariate Gaussian, with variances σ i j which characterize variances of the training data in the feature space. Here, α is a hyperparameter that controls the sampling range. As for the visualization of random generated images g(z random_j ) conditioned on the given class, we compare our method with some other conditional generation methods such as ACGAN [25] and InfoGAN [21] (for ACGAN and InfoGAN, we generate images conditioned on class labels with randomly sampled latent z according the procedures mentioned in their respective works). Our model can give realistic and correct conditional generation results with high diversity in each class, while other methods may make mistakes in the generation between some similar classes such as classes 3 and 5 for InfoGAN. Interpolation between samples in different classes. We randomly sample some images from each class. For each image x 1 , we randomly sample another image x 2 from a different class. For such a pair of images x 1 and x 2 , we reconstruct them based on their linearly interpolated feature representations byx = g(α f (x 1 ) + (1 − α) f (x 2 )), α ∈ [0, 1], the results of which are shown in the Figure A3. For each row in the figure from left to the right, the reconstructed images continuously morph from one digit to a different digit with a natural transition in shape rather than a simple superposition of the two images. This also confirms that space between subspaces for the digits does not represent valid digits but only shapes with digit-like strokes. Hence for generative purposes, knowing the supports of valid digits is extremely important. Figure A3. Images generated from the interpolation between samples in different classes.

Appendix A.3. Transformed MNIST
Settings. In this experiment, we verify that the CTRL-Multi objective can preserve diverse data modes in the learned feature embeddings. We construct a transformed MNIST dataset with five modes: normal, large (1.5×), small (0.5×), rotate 45 • left, and rotate 45 • right. Each image data point will be randomly transformed to one of the modes. Representative examples of such training data can be found in Figure A4a. We train the model with learning rate 1 × 10 −4 and batch size 2048 for 15,000 iterations.
Auto-encoding results. Figure A4b gives the decoded results of the training data with different modes. Even though the data are now much more diverse for each class, decoder learned from the CTRL-Multi objective can still achieve high sample-wise similarity to the original images. Identifying different modes. Similar to the earlier experiments of Figure 6 for CIFAR-10 in the main paper, we find the top principal components of features of each class Z j (via SVD) and generate new images using the learned decoder g from features of the training images aligned the best with these components.
In Figure A5, we select three classes 0, 1, 2 and visualize samples from the top r = 8 principal components for each class. Each row represents one principal component direction. As can be seen in the figure, the decoded images along each principal component shows a similar mode and the modes along different component directions are rather incoherent. All major modes of the original data can be identified as one of these principal component directions. This clearly shows that the CTRL-Multi objective can keep the different modes within each class of the data X j as the principal component directions of Z j , and these modes can also be retained in the decoded imagesX j . Settings. For all experiments on CIFAR-10, we follow the common training hyperparameters in Appendix A.1. Beyond that, for each experiment, we run 450,000 iterations with batch size 1600.
Images decoded from random samples on the CTRL-Multi. We sample z in the feature space randomly along the principal components and around the mean feature of each class Z j as in the MNIST case, according to Equation (A1). The generated images from the sampled features are illustrated in Figure A6, one row per class. As we see, the generator learned from the CTRL-Multi objective is capable of generating diverse images for each class.
Further, for visualization of random generated images g(z random_j ) conditioned on the given class, we compare our method with some other conditional generation method such as ACGAN [25] and InfoGAN [21]. For all three experiments, we have randomly sampled 8 images per class in CIFAR-10. For more complex datasets such as CIFAR-10, our model can give more realistic conditional generation results for different classes with high diversity within each class. Generating images along different PCA components for each class. For each class, we first compute the top 10 principal components (singular vectors of the SVD) of Z and then for each of the top singular vectors, we display in each row the top 10 reconstructed imageX whose Z are closest to the singular vector using methods described in the main body of the paper, Section 3.3. The results are given in Figure A7. Notice that images in each row are very similar as they are sampled along the same principal component, whereas images in different rows are very different as they are orthogonal in the feature space. These results indicate that the features learned by our method can not only disentangle different classes as orthogonal subspaces but can also disentangle different visual attributes within each class as (orthogonal) principal components within each subspace.  Table A10, on par or even better than existing methods such as SNGAN [31] or DC-VAE [42].
Visualizing auto-encoding property for the CTRL-Binary. We visualize the original images x and their decodedx generated by the LDR model learned from the CTRL-Binary objective. The results are shown in Figure A8 for STL-10.

Appendix A.6. Celeb-A and LSUN
To verify that our formulation works on images of higher resolution, we conduct experiments on the Celeb-A and LSUN datasets, which have a resolution of 128 × 128.
Settings. For all experiments on these datasets, we follow the common training hyperparameters in Appendix A.1. We choose a 300 batch size for Celeb-A and LSUN. Both of them are trained with the CTRL-Binary objective and for 450,000 iterations.
Generating images along different PCA components. We calculate the principal components of the learned features Z in the latent subspace. We manually choose three principle components which are related to hat, hair color, and glasses (see Figure A9). The three components are 9th, 19th, and 23rd respectively from the overall 128 principal components. These principal directions seem to clearly disentangle visual attributes/factors such as wearing a hat, changing hair color, and wearing glasses.
Images generated from random sampling of the feature space. We sample z randomly according to the following Gaussian model: wherez is the mean feature, σ i and v i are the ith singular value and singular vector, respectively, n i are i.i.d. Gaussian N (0, 1) random variables. As before α is a hyperparameter to control the sampling range. We use the top r = 100 principle components for random sampling. The random generated images are realistic and diverse (see Figure A10). Visualizing auto-encoding property for CTRL-Binary. We visualize the original image x and their decodedx using the LDR model learned from the CTRL-Binary objective. The results are shown in Figures A11 and A12 for the Celeb-A dataset and the LSUN dataset, respectively. The CTRL-Binary objective can give very good visual quality for x but cannot ensure sample-to-sample alignment. Nevertheless, the decodedx seems to be very similar to the original x in some main visual attributes. We believe the binary objective manages to align only the dominant principal component(s) associated with the most salient visual attributes, say, pose of the face for Celeb-A or layout of the room for LSUN, between features of X andX.  Appendix A.7. ImageNet

Settings.
To verify that the CTRL works on large-scale datasets, we train it on the ImageNet. For all experiments on the ImageNet, we follow the common training hyperparameters in Appendix A.1.
We first train our model with the CTRL-Binary objective with batch size of 1800 on the whole ImageNet ILSVRC 2012 dataset. The number of training iterations is 450,000.
After that, we fine-tune the pretrained model with the CTRL-Multi objective, on 10 selected classes. Information about the 10 classes can be found in Table A9. The fine-tune batch size is 1024, and we train another 35,000 iterations for it. This experiment takes 120 GPU hours on 8 A100-SXM4 GPUs. Note that our choice of batch size is substantially larger than those commonly adopted in other works while training on the ImageNet (e.g., 128 in [31]). We empirically observe that training with a larger batch size generates images of better quality and clearer class alignment. This is consistent with the proposed CTRL-Multi objective as it explicitly encourages alignment of class distributions, therefore benefiting from a larger batch that better captures overall data distributions. We leave a more rigorous study of the effect of batch size for future work.
Due to the heavy computation of such large batch size, we present the intermediate result obtained at the early iteration 35,000 whereas most existing methods run with significantly larger number of iterations. Nevertheless, the intermediate result already verify the efficacy of our framework. In addition, we present the full version of the comparison with existing generative methods in Table A10. We see the IS and FID scores for CTRL-Multi degraded a little after the finetuning. This is expected as learning a more refined separation and alignment of 10 classes is a more challenging task than 2 classes. This is consistently observed from experiments on other datasets too.
Visualizing feature similarity for CTRL-Multi. We visualize the cosine similarity among features Z of different classes learned from the CTRL-Multi objective in Figure A13. In addition, we provide the visualization of alignment between features Z and decoded features featuresẐ. These results demonstrate that not only the encoder has already learnt to discriminate between classes, but also the learned Z andẐ are aligned clearly within each class.  Visualizing auto-encoding property for CTRL-Multi. We visualize the original images X and their decodedX using the LDR model fine-tuned with the CTRL-Multi objective. The results are shown in Figure A14 for the selected 10 classes in ImageNet. The CTRL-Multi objective can give good visual quality forX as well as sample-to-sample alignment. We run experiments on MNIST with the three different architectures, and choose the network from Table A1 for the encoder and Table A2 for the decoder, and the training hyperparameters follow Appendix A.1. The qualitative results are shown in Figure A15. Both architectures (A4) and (A5) failed to generate meaningful images. These experiments show that directly applying rate-reduction objectives without the closed-loop or architectures that loosely enforcing cycle consistency fails to work. Instead, the closed-loop formulation allows us to use only two networks, without the need of any extra network. By replacing the rate reduction (∆R) terms in the objective function (12) with crossentropy, we introduce a linear mapping W ∈ R d×k to map Z ∈ R d×n from feature space to logits γ = Z W. We then calculate the softmax cross-entropy function on logits γ and one hot label matrix Y.
γ ij is the formulation of softmax cross-entropy function and Y ∈ R n×k is one hot label matrix. Then, we can replace the first two terms of (12) (∆R Z and ∆R Ẑ ) with H(Z W, Y) and H(Ẑ W, Y). For the third term of (12), we extract j-th class one hot feature γ j = Z j W,γ j =Ẑ j W from Z andẐ, and define the distance D(γ j ,γ j ) = e γ j e γ j +eγ j of them. For the third term of (12), we further introduce k linear layers as discriminators {D j } k j=1 for each class. Then, we replace the third term with the GAN's objective function as ∑ k j=1 E[log D j (Z j )] + E[log(1 − D j (Ẑ j ))] (E[X] denote the expectation of X). Now, we have the cross-entropy version objective function (A6) for the closed-loop framework. We denote the closed-loop framework with cross-entropy as Closed-loop-CE.
We run the experiments on MNIST and CIFAR10. The architectures of MNIST and CIFAR10 are given in Tables A1-A4 (In the context of this section, we use the term Decoder and Generator interchangeably; similarly for Encoder and Discriminator).
Results on MNIST. The training hyper-parameters of CTRL-Multi and Closed-loop-CE on MNIST are following Appendix A.1. Comparisons between CTRL-Multi and Closedloop-CE are listed in Figures A16-A18. Figure A16b,c show the reconstructed imagesX from Closed-loop-CE and CTRL-Multi. Both methods can give sample-wise reconstruction results due to the closed-loop transcription framework. However, comparing training images whose features are best aligned with the principal components of class '2' in Figure A17, we see that the principal components of CE features do not correspond to consistent visual attributes of the images, whereas ours do.
From the heatmaps in Figure A18a,b, we see the features learned by rate reduction possess clear orthogonal subspace structures, whereas those learned by Closed-loop-CE do not. Moreover, Figure A18c,d shows that the learned features of CTRL-Multi have higher singular values for the top principal components of each class, corresponding to a more linearized and diverse feature distribution, whereas those by Closed-loop-CE do not.   Failed Attempts on CIFAR-10 with Cross Entropy. The training hyper-parameters of Closed-loop-CE on CIFAR10 follow Appendix A.1. We perform the grid search on three hyper-parameters: learning rate {1.5 × 10 −2 , 1.5 × 10 −3 , 1.5 × 10 −4 }, batch size (800 or 1600), and inner loop (1,2,3,4), conducting 24 experiments in total. All cases of the Closed-loop-CE fail to converge or experience model collapse on the CIFAR-10 dataset.

.3. Ablation Study on the CTRL-Multi Objectives
In this section, we investigate the influence of each term of the objective function (12) and see how they affect the learned features Z,Ẑ and sample-wise reconstruction. We follow the same experiment setting with CTRL-Multi on MNIST (Appendix A.1), and conduct three experiments, each with a modified version of the original objective. Objective I is the original objective with all three terms, Objective II removes the second term ∆R(Ẑ), and Objective III keeps only the third term ∆R(Z,Ẑ). The results in Figure A19 show that using Objective II we can still maintain the sample-wise reconstruction property, but the image quality is lower when compared those constructed by Objective I (Figure A19b vs. Figure A19c). Objective III loses the sample-wise reconstruction property ( Figure A19a vs. Figure A19d). Finally, the results from Figures A20 and A21 show that without the first two terms, the learned features Z andẐ have poor class-to-class alignment and their principal components do not show clear subspace structure with higher singular values within each class. Table A11. Three different objective functions for CTRL.
Objective III: min η max θ T X (θ, η) = ∑ k j=1 ∆R Z j (θ),Ẑ j (θ, η) . Appendix A.9. Ablation Study on Sensitivity to Spectral Normalization It is known that spectral normalization is important to improve the stability of training GANs. Here, we test our formulation with and without the spectral normalization. We follow the setting from Appendix A.1 and test on CIFAR10, using the network architecture from Tables A3 andA4. All settings of two experiments are exactly same except with or without spectral normalization. We see that our formulation is stable in both settings and generate similar images. The only difference is that the quantitative scores in terms of IS and FID is higher with the spectral normalization. Empirically, we observed that for our formulation, the larger the batch size, the better the results. To justify our use of batch size that is larger than those adopted in previous works such as [31], we conduct the following experiment which studies the training behavior of our proposed CTRL-Multi objective. Specifically, we train on the selected 10 classes of ImageNet with varying number of widest channels in our chosen architecture (specified in Appendix A.1) and batch size. We train both the encoder and decoder from scratch without fine-tuning. Other hyper-parameter settings detailed in Appendix A.7 are fixed. We present the results in Table A13. In the table, we denote training sessions that do not produce meaningful images as "failure" and those that do as "success". In the "failure" scenario, we noticed that the second term in the CTRL-Multi objective (12) would collapse to near 0 and could not be recovered, implying the decoder has essentially lost in the minimax game. In the "success" scenario, both the first terms of (12) stay close to each other and neither would collapse to near 0. The results present an interesting diagonal pattern that captures the relationship between batch size and network width. With a wider network and more channels, the network contains a greater capacity but would require a larger batch to stabilize training. This experiment justifies our use of a larger batch in our experiment in Appendix A.7 and also presents an interesting trade-off between network capacity and batch size for training. In this paper so far, for simplicity and uniformity, we have chosen the feature dimension d = nz to be 128 for all experiments. In practice, however, the choice of feature dimension may affect the performance of the learned features: common practices suggest the larger the model, the better the performance could be. Hence, in this last section, we conduct experiments to show how the feature dimension affects the performance. It is not our intention to find the best feature dimension (nor the best network) with this work. We only want to show that there is room to improve the results presented in this paper.
The baseline experiment is conducted on CIFAR-10 with architectures from Table A2  and Table A1, training hyper-parameters are following the setting in Appendix A.1. Here, we change the feature dimension nz, batch size, and learning rate to 512, 8196, and 0.5 × 10 −4 respectively. Figure A22 shows the comparison of (randomly selected, not cherrypicked) reconstructed images with the original ones. We observe a significant improvement in visual quality over the results with a lower feature dimension. The IS and FID scores reported in Table A14 also confirm the improvement.