An Information Theoretic Interpretation to Deep Neural Networks

It is commonly believed that the hidden layers of deep neural networks (DNNs) attempt to extract informative features for learning tasks. In this paper, we formalize this intuition by showing that the features extracted by DNN coincide with the result of an optimization problem, which we call the `universal feature selection' problem, in a local analysis regime. We interpret the weights training in DNN as the projection of feature functions between feature spaces, specified by the network structure. Our formulation has direct operational meaning in terms of the performance for inference tasks, and gives interpretations to the internal computation results of DNNs. Results of numerical experiments are provided to support the analysis.


Introduction
Due to the striking performance of deep learning in various application fields, deep neural networks (DNNs) have gained great attention in modern computer science. While it is a common understanding that the features extracted from the hidden layers of DNN are "informative" for learning tasks, the mathematical meaning of informative features in DNN is generally not clear. From the practical perspective, DNN models have obtained unprecedented performance in varying tasks, such as image recognition [1], language processing [2,3], and games [4,5]. However, the understanding of the feature extraction behind these models is relatively lacking, which poses challenges for their application in security-sensitive tasks, such as the autonomous vehicle.
To address this problem, there have been numerous research efforts, including both experimental and theoretical studies [6]. The experimental studies usually focus on some empirical properties of the feature extracted by DNNs, by visualizing the feature [7] or testing its performance on specific training settings [8] or learning tasks [9]. Though such empirical methods have provided some intuitive interpretations, the performance can highly depend on the data and network architecture used. For example, while the feature visualization works well on convolutional neural networks, its application to other networks is typically less effective [10].
In contrast, theoretical studies focus on the analytical properties of the extracted feature or the learning process in DNNs. Due to the complicated structure of DNNs, existing studies were often restricted to the networks of specific structures, e.g., network with infinite width [11] or two-layer network [12,13], to characterize the theoretical behaviors. However, the interpretation of the optimal feature remains unclear, which limits their Definition 1 ( -Neighborhood). Let P X denote the space of distributions on some finite alphabet X, and let relint(P X ) denote the subset of strictly positive distributions. For a given > 0, the -neighborhood of a distribution P X ∈ relint(P X ) is defined by the χ 2 -divergence as Definition 2 ( -Dependence). The random variables X, Y are called -dependent if P XY ∈ N X×Y (P X P Y ).

Definition 3 ( -Attribute).
A random variable U is called an -attribute of X if P X|U (·|u) ∈ N X (P X ), for all u ∈ U.
We will focus on the small regime, which we refer to as the local analysis regime. In addition, for any P ∈ P X , we define the information vector φ and feature function L(x) corresponding to P, with respect to a reference distribution P X ∈ relint(P X ), as This gives a three way correspondence P ↔ φ ↔ L for all distributions in N X (P X ), which will be useful in our derivations.

Modal Decomposition
Given a pair of discrete random variables X, Y with the joint distribution P XY (x, y), the |Y| × |X| matrixB is defined as B(y, x) P XY (x, y) − P X (x)P Y (y) P X (x)P Y (y) , whereB(y, x) is the (y, x)th entry ofB. The matrixB is referred to as the canonical dependence matrix (CDM) [24]. The SVD ofB is referred to as the modal decomposition [24] of the joint distribution P XY , which has the following property [18].

Lemma 1. The SVD ofB can be written asB
and σ i denotes the ith singular value with the ordering 1 ≥ σ 1 ≥ · · · ≥ σ K = 0, and ψ Y i and ψ X i are the corresponding left and right singular vectors with ψ X K (x) = P X (x) and ψ Y K (y) = P Y (y).
This SVD decomposes the feature spaces of X, Y into maximally correlated features. To see that, consider the generalized canonical correlation analysis (CCA) problem: where δ ij denotes the Kronecker delta function. It can be shown that for any 1 ≤ k ≤ K − 1, the optimal features are f i (x) = ψ X i (x)/ P X (x), and g i (y) = ψ Y i (y)/ P Y (y), for i = 0, . . . , K − 1, where ψ X i (x) and ψ Y i (y) are the xth and yth entries of ψ X i and ψ Y i , respectively [18]. The special case k = 1 corresponds to the HGR maximal correlation [25][26][27], and the optimal features can be computed from the ACE (Alternating Conditional Expectation) algorithm [36].

Deep Neural Networks
The architecture of deep neural networks (under log-loss) can be depicted as Figure 1, where X is the input data, e.g., images, audios, or natural languages. Moreover, Y is the objective to predict, which can represent a discrete label in classification tasks, or represent target natural languages in machine translations [37]. Specifically, for given data X, the network produces a (trainable) feature mapping to generate k-dimensional feature s(x) = (s 1 , . . . , s k ) T . In practice, the feature mapping block (depicted as the gray block in Figure 1) is typically composed of hundreds and thousands of functional components (e.g., residual block [1]) with different types of layers, and may contain recurrent structure, e.g., LSTM (Long Short-Term Memory) [38]. In general, the internal structure of the feature mapping can have various different types of designs, depending on the learning tasks.
Probabilistic Prediction Figure 1. A deep neural network that uses data X to predict Y. All hidden layers together map the input data X to k-dimensional feature s(x) = (s 1 , . . . , s k ) T . Then, the probabilistic predictionP Y|X of Y is computed from s(x), v(y), and b(y), where v and bias b are the weights and bias in the last layer.
After obtaining the feature s(X), the Y is then predicted by the probability distributioñ which is obtained by applying the softmax function [39] on v T (y)s(x) + b(y), where v(·) and b(·) are the weights and biases in the last layer, respectively (this is equivalent to the common practice that denotes weight and biases by the matrix [v(1), . . . , v(|Y|)] T and the vector [b(1), . . . , b(|Y|)] T , respectively. However, as we will show later, expressing weights v and biases b as mappings of y can better illustrate their roles in feature selection). We will useP Y|X to refer toP when there is no ambiguity. Then, for a given training set of labeled samples (x i , y i ), for i = 1, . . . , N, all the parameters in the network, including v, b, as well as those in the feature mapping block, are chosen to maximize the log-likelihood function (or, equivalently, minimize the log-loss) The procedure of choosing such parameters is called the training of network, which can be performed by stochastic gradient descent (SGD) or its variants [21]. With a trained network, the labelŷ for a new data sample x can be predicted by the maximum a posteriori (MAP) estimation, i.e.,ŷ = arg max y∈YPY|X (y|x). Specifically, when we make predictions for samples in a test dataset, the proportion of samples with correct prediction (i.e.,ŷ = y) over all samples is called the test accuracy.

Information-Theoretic Feature Selection
Suppose that, given random variables X, Y with joint distribution P XY , we want to infer about an attribute V of Y from observed i.i.d. samples x 1 , . . . , x n of X. When the statistical model P X|V is known, the optimal decision rule is the log-likelihood ratio test, where the log-likelihood function can be viewed as the optimal feature for inference. However, in many practical situations [18], it is hard to identify the model of the targeted attribute, and it is necessary to select low-dimensional informative features of X for inference tasks before knowing the model. An information-theoretic formulation of such feature selection problem is the universal feature selection problem [24], which we formalize as follows.
To begin, for an attribute V, we refer to is the information vector specifying the corresponding conditional distribution P Y|V (·|v). The configuration of V models the statistical correlation between V and Y. In the sequel, we focus on the local analysis regime, for which we assume that all the attributes V of our interests to detect are -attributes of Y. As a result, the corresponding configuration satisfies φ Y|V v ≤ , for all v ∈ V. We refer to such configurations as -configurations. The configuration of V is unknown in advance but assumed to be generated from a rotational invariant ensemble (RIE).
Moreover, a probability measure defined on a set of configurations is called an RIE, if all rotationally equivalent configurations have the same measure.
The RIE can be interpreted as assigning a uniform measure to the attributes with the same level of distinguishability. To infer about the attribute V, we construct a k-dimensional feature vector h k = (h 1 , . . . , h k ), for some 1 ≤ k ≤ K − 1, of the form for some choices of feature functions f i . Our goal is to determine the f i such that the optimal decision rule based on h k achieves the smallest possible error probability, where the performance is averaged over the possible C Y generated from an RIE. In turn, we denote ξ X i ↔ f i as the corresponding information vector, and define the matrix Ξ X [ξ X 1 · · · ξ X k ].
be the error exponent associated with the pairwise error probability distinguishing v and v based on h k , then the expected error exponent over a given RIE defined on the set of -configurations is given by where C 0 Proof. See Appendix A.
As a result of (7), designing the ξ X i as the singular vectors ψ X i ofB, for i = 1, . . . , k, optimizes (7) for all RIEs, pairs of (v, v ), and -configurations. Thus, the feature functions corresponding to ψ X i are universally optimal for inferring the unknown attribute V. Moreover, (7) naturally leads to an information metric B Ξ X Ξ X T Ξ X − 1 2 2 F for any feature Ξ X of X, measured by projecting the normalized Ξ X through a linear projectionB. This information metric quantifies how informative a feature of X is when solving inference problems with respect to Y and is optimized when designing features by singular vectors ofB. Thus, we can interpret the universal feature selection as solving the most informative features for data inferences via the SVD ofB, which also coincides with the maximally correlated features in (3). Later, we will show that the feature selection in DNNs shares the same information metric as universal feature selection in the local analysis regime.

Network with Ideal Expressive Power
For convenience of analysis, we first consider the ideal case where the neural network can express any feature mapping s(·) as desired. While this assumption can be rather strong, the existence of such ideal networks is guaranteed by the universal approximation theorem [40]. In addition, one goal of practical network designs is to approximate the ideal networks and obtain sufficient expressive power. For such networks, we will show that when X, Y are -dependent, the extracted feature s(x) and weights v(y) coincide with the solutions of the universal feature selection.
To begin, we use P XY to denote the joint empirical distribution of the labeled samples (x i , y i ), i = 1, . . . , N, and P X , P Y to denote the corresponding marginal distributions. Then, the objective function of (5) is the empirical average of the log-likelihood function Therefore, maximizing this empirical average is equivalent as minimizing the KL divergence: This can be interpreted as finding the best fitting to empirical joint distribution P XY by distributions of the form P XP Y|X . In our development, it is more convenient to denote the bias by d(y) = b(y) − log P Y (y), for y ∈ Y. Then, the following lemma illustrates the explicit constraint on the problem (8) in the local analysis regime. Lemma 2. If X, Y are -dependent, then the optimal v, d for (8) satisfy Proof. See Appendix B.
In turn, we take (9) as the constraint for solving the problem (8) in the local analysis regime. Moreover, we define the information vectors for zero-mean vectorss,ṽ as ξ X (x) = P X (x)s(x), ξ Y (y) = P Y (y)ṽ(y), and define matrices Lemma 3. The KL divergence (8) in the local analysis regime (9) can be expressed as where Proof. See Appendix C.
Lemma 3 reveals key insights for feature selection in neural networks. To see this, we consider the following two learning problems: learning the optimal weight v for given s and learning the optimal feature s for given v.
For the case that s is fixed, we can optimize (10) with Ξ X fixed and obtain the following optimal weights: Theorem 2. For fixed Ξ X and µ s , the optimal Ξ Y * to minimize (10) is given by and the optimal weightsṽ * and biasd * arẽ where Λs (X) denotes the covariance matrix ofs(X).

Proof. See Appendix D.
Specifically, when s(x) = x, Theorem 2 gives the optimal weights for softmax regression. Note that Equation (11) can be viewed as a projection of the input features(x), to a feature v(y) computable from the value of y, which is the most correlated feature tos(x). The solution is given by the operation that left multipliesB matrix, which we refer to as forward feature projection.

Remark 1.
While we assume the continuous input s(x) is a function of a discrete variable X, we only need the labeled samples between s and Y to compute the weights and bias from the conditional expectation (12), and the correlation between X and s is irrelevant. Thus, our analysis for weights and bias can be applied to continuous input networks by just ignoring X and taking s as the real input to network.
We then consider the "backward feature projection" problem, which attempts to find informative feature s * (X) to minimize the loss (10) with given weights and bias. In particular, we can show that the solution of this backward feature projection is precisely symmetric to the forward one.
Theorem 3. For fixed Ξ Y andd, the optimal Ξ X * to minimize (10) is given by and the optimal feature function s * , which are decomposed tos * and µ * s , is given bỹ where Λṽ (Y) denotes the covariance matrix ofṽ(Y).

Proof. See Appendix D.
Finally, when both s and (v, b) (and hence Ξ X , Ξ Y , d) can be designed, the optimal (Ξ Y , Ξ X ) corresponds to the low rank factorization ofB, and the solutions coincide with the universal feature selection. Theorem 4. The optimal solutions for weights and bias to minimize (10) are given byd(y) = −µ T sṽ (y), and (Ξ Y , Ξ X ) * chosen as the largest k left and right singular vectors ofB.

Proof. See Appendix E.
Therefore, we conclude that the learning of neural networks, when both s and (v, b) are designable, is to extract the most correlated aspects of the input data X and the label Y that are informative features for data inferences from universal feature selection.
In the practical learning process of DNN, the BackProp updates the weights of the softmax layer and those on the previous layer(s) in an iterative manner. As we have illustrated in Lemma 3, such iterative updates will converge to the same solution as the alternating between the forward feature projection (11) and the backward feature projection (13), which is indeed the power method to solve the SVD forB [41], also known as the Alternating Conditional Expectation (ACE) algorithm [36].

Remark 2.
From Theorem 4, for a neural network with sufficient expressive power, the trained feature depends only on the distribution of input data rather than the training process. It is worth mentioning that this result does not contradict the practice that trained weights in hidden layers can be different during each training run. In fact, due to the over-parameterized nature of practical network designs, there exist multiple choices of weights in hidden layers to express the same optimal feature s(x).

Network with Restricted Expressive Power
The analysis of the previous section has considered neural networks with ideal expressive power, where the feature s(X) can be selected as any desired function. In general, however, the form of feature functions that can be generalized is often limited by the network structure. In the following, we consider networks with restricted expressive power to characterize the impacts of network structure on the extracted feature.
For illustration, we consider the neural network with a hidden layer of k nodes, and a zero-mean continuous input t = [t 1 · · · t m ] T ∈ R m to this hidden layer, where t is assumed to be a function t(x) of some discrete variable X. Our goal is to analyze the weights and bias in this layer with labeled samples (t(x i ), y i ). Assume the activation function of the hidden layer is a generally smooth function σ(·), then the output s z (X) of the z-th hidden node is where w(z) ∈ R m and c(z) ∈ R are the weights and bias from input layer to hidden layer as shown in Figure 2. We denote s = [s 1 · · · s k ] T as the input vector to the output classification layer. . . . Figure 2. A multi-layer neural network, where the expressive power of the feature mapping s(·) is restricted by the hidden representation t. All hidden layers previous to t are fixed, represented by the "pre-processing" module.
To interpret the feature selection in hidden layers, we fix (v(y), b(y)) at the output layer and consider the problem of designing (w(z), c(z)) to minimize the loss function (8) at the output layer. Ideally, we should have picked w(z) and c(z) to generate s(x) to match s * (x) from (14), which minimizes the loss. However, here we have the constraint that s(x) must take the form of (15) and, intuitively, the network should select w(z), c(z) so that s(x) is close to s * (x). Our goal is to quantify the notion of such closeness.
To develop insights on feature selection in hidden layers, we again focus on the local analysis regime, where the weights and bias are assumed to satisfy the local constraint Then, since t is zero-mean, we can express (15) as Moreover, we define a matrixB 1 with the (z, x)th entryB 1 (z, x) = √ P X (x) σ (c(z))s * z (x), which can be interpreted as a generalized CDM for the hidden layer. Furthermore, we denote ξ X 1 (x) = P X (x)t(x) as the information vector oft(x) with the matrix Ξ X 1 defined as Ξ X 1 ξ X 1 (1) · · · ξ X 1 (|X|) T , and we also define The following theorem characterizes the loss (8).
Theorem 5. Given the weights and bias (v, b) at the output layer, and for any input feature s, we denote L(s) as the loss (8) evaluated with respect to (v, b) and s. Then, with the constraints (16) where Equation (20) quantifies the closeness between s and s * in terms of the loss (8). Then, our goal is to minimize (20), which can be separated to two optimization problems: Note that the optimization problem (21) is similar to the one that appeared in Lemma 3, and the optimal solution is given by Therefore, solving the optimal weights in the hidden layer can be interpreted as projectings * (x) to the subspace of feature functions spanned by t(x) to find the closest expressible function. In addition, the problem (22) is to choose µ s (and hence the bias c(z)) to minimize the quadratic term similar to η (v,b) (s) in (10). Similar to the analyses of parameters in the last layer, we can obtain analytical solutions for hidden layer parameters, e.g., µ * s and w * , with detailed discussions provided in Appendix G.
Overall, we observe the correspondence between (11), (14), and (21), (22), and interpret both operations as feature projections. Our argument can be generalized to any intermediate layer in a multi-layer network, with all the previous layers viewed as the fixed pre-processing that specifies t(x), and all the layers after determining s * . Then, the iterative procedure in back-propagation can be viewed as alternating projection finding the fixed-point solution over the entire network. This final fixed-point solution, even under the local assumption, might not be the SVD solution as in Theorem 4. This is because the limited expressive power of the network often makes it impossible to generate the desired feature function. In such cases, the concept of feature projection can be used to quantify this gap, and thus to measure the quality of the selected features.

Scoring Neural Networks
Given a learning problem, it is useful to tell whether or not some extracted features are informative [42]. Our previous development naturally gives rise to a performance metric.
Definition 5. Given a feature s(x) ∈ R k and weight v(y) ∈ R k with the corresponding information matrices Ξ X and Ξ Y , the H-score H(s, v) is defined as In addition, for given s(x), we define the single-sided H-score H(s) as H-score can be used to measure the quality of features generated at any intermediate layer of the network. It is related to (20) when choosing the optimal bias and Θ as the identity matrix. This can be understood as taking the output of this layer s(x) and directly feeding it to a softmax output layer with v(y) used as the weights, and H(s, v) measures the resulting performance. Note that v(y) here can be an arbitrary function of Y, not necessarily the weights on the next layer computed by the network. When the optimal v * (y) as defined in (12) is used, the resulting performance becomes the one-sided H-score H(s), which measures the quality of s(x). In addition, by comparing (26) with (7), the performance measure H(s) also coincides with the information metric (7), up to a scale factor.
Specifically, for a given dataset and a feature extractor that generate s(·), the H-score H(s) can be efficiently computed from the second equation of (26). In addition, when we use H-score to compare the performance of different feature extractors (models), the model complexity has to be taken into account to reduce overfitting. To this end, we adopt Akaike information criterion (AIC) and define AIC-corrected H-score for comparing different models, where n p and n s represent the number of parameters in the model and the training sample size, respectively.
In current practice, the cross-entropy E P XY logP Y|X is often used as the performance metric. One can, in principle, also use log-loss to measure the effectiveness of the selected feature at the output of an intermediate layer [42]. However, one problem of this metric is that, for a given problem, it is not clear what value of log-loss one should expect, as the log-loss is generally unbounded. In contrast, the H-score can be directly computed from the data samples and has a clear upper bound. Indeed, it follows from Lemma 1 that, for k-dimensional feature s and weights v, we have the sequence of inequalities where σ i indicates the ith singular value ofB. In particular, the first "≤" follows from the definition (24), and the gap between H(s, v) and H(v) measures the optimality of the weights v; the second "≤" follows from the first equality of (26), and the gap between two sides characterizes the difference between the chosen feature and the optimal solution, which is a useful measure of how restrictive (lack of expressive power) the network structure is; the last "≤" follows from the fact that σ i ≤ 1 (cf. Lemma 1), which measures the dependency between data variable and label for the given dataset. In Section 3.4.3, we validate this metric on real data.

Experimental Validation of Theorem 4
We first validate Theorem 4, the optimal feature extracted by network with ideal expressive power. Here, we consider the discrete data with alphabet sizes, |X| = 8 and |Y| = 6, and construct the network as shown in Figure 3. Specifically, the network input is the one-hot encoding of X, i.e., [1 X (1), . . . , 1 X (|X|)] T , where 1 X (x) takes one if and only if X = x, and takes zero otherwise. Then, the feature s(X) is generated by a linear layer, with sigmoid function used as the activation function. For ease of comparison and presentation, we set feature dimension to k = 1, since otherwise the optimal feature (cf. Theorem 4) lies in a subspace and is non-unique. It can be verified that this network has ideal expressive power, i.e., with proper weights in the first layer, s(X) can express any desired function up to scaling and shifting.
. . . To compare the result trained by the neural network and that in Theorem 4, we first randomly generate a distribution P XY , and then draw independently n = 100,000 pairs of (X, Y) samples. We then train the network using batch gradient descent, where we have applied Nesterov momentum [43] with the momentum hyperparameter being 0.9. In addition, we set the learning rate to 4 with a decay factor of 0.01 and clip gradients with norm exceeding 0.5. After training, the learned values of s(x), v(y) and b(y) are shown in Figure 4 and compared with theoretical results. From the figure, we can observe that the training results match our theoretical analyses.

Experimental Validation of Theorem 5
In addition, we validate Theorem 5 by the neural network depicted in Figure 5, with the same settings of X, Y. Specifically, the number of neurons in hidden layers are set to m = 4 and k = 3, where t(X) is randomly generated from X, and we have chosen sigmoid function as the activation function σ(·) to generate s(x). We then fix the weights and bias at the output layer and train the weights w(1), w(2), w(3) and bias c in the hidden layer to optimize the log-loss. Specifically, we use the batch gradient descent with the Nesterov momentum hyperparameter being 0.9. In addition, we set the learning rate to 4 with a decay factor of 10 −6 and clip gradients with norm exceeding 0.1. After training, Figure 6 shows the matching between the learned results and the corresponding theoretical values.

Experimental Validation of H-Score
To validate H-score as a performance measure for extracted features, we compare the H-score and classification accuracy of DNNs on image classification tasks. Specifically, we use the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) [22] dataset as the dataset and extract features using several deep neural networks with representative architectures designs [44][45][46][47][48][49]. After training the feature extractors on the ILSVRC2012 training set, we then compute the H-score of the feature in the last hidden layer, as well as the classification accuracies on ILSVRC2012 validation set (here, we use ILSVRC2012 validation set for testing, as the labels in ILSVRC2012 testing set have not been publicly released). The results are summarized in Table 1, where H AIC (s) is the AIC-corrected Hscore as defined in (27), with n p being the number of model parameters, and n s = 1,300,000 corresponding to the number of training samples in ImageNet. The AIC-corrected H-score is consistent with the classification accuracy, which validates the effectiveness of H-score as a measurement of neural networks.

Discussion
Our characterization gives an information-theoretic interpretation of the feature extraction process in DNNs, which also provides a practical performance measure for scoring neural networks. Different from empirical studies focusing on specific datasets [7], our development is based on the probability distribution space, which is more general and can also provide theoretic insights. Moreover, the information-theoretic framework allows us to obtain direct operational meaning and better interpretations for the solutions, compared with optimization-based theoretical characterizations, e.g., [11,13].
As a first step in establishing a rigorous framework for DNN analysis, the present work can be extended in both theoretical and practical aspects. From the theoretical perspective, one extension is to investigate the analytical properties for general DNNs, using the theoretic insights obtained from local analysis regime. For example, it was shown in [50] that the symmetry between feature and weights in DNNs established in the local analysis regime (cf. Section 3.2.1) also holds for general probability distributions. Another extension is to apply the framework to investigate the optimal feature for structured data or network, e.g., data with sparsity structure [51].
From the practical perspective, in addition to the demonstrated example of evaluating existing DNN models (cf. Section 3.4.3), the H-score can also be used as an objective function in designing learning algorithms. In particular, such usages have been illustrated in multi-modal learning [52] and transfer learning [53] tasks.

Conclusions
In this paper, we apply the local information geometric analysis and provide an information-theoretic interpretation to the feature extraction scheme in DNNs. We first establish an information metric for features in inference tasks by formalizing the informationtheoretic feature selection problem. In addition, we demonstrate that the features extracted by DNNs coincide with the information-theoretically optimal feature, with the same metric measuring the performance of features, called H-score. Furthermore, we discuss the usage of the H-score for measuring the effectiveness of DNNs. Our framework demonstrates a connection between the practical deep learning implementations and information-theoretic characterizations, which can provide theoretical insights for DNN analysis and learning algorithm designs.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Proof of Theorem 1
We commence with the characterization of the error exponent.
Lemma A1. Given a reference distribution P X ∈ relint(P X ), a constant > 0 and integers n and k, let x 1 , . . . , x n denote i.i.d. samples from one of P 1 or P 2 , where P 1 , P 2 ∈ N X (P X ). To decide whether P 1 or P 2 is the generating distribution, a sequence of k-dimensional statistics h k = (h 1 , . . . , h k ) is constructed as where ( f 1 (X), . . . , f k (X)) are zero mean, unit-variance, and uncorrelated with respect to P X , i.e., Then, the error probability of the decision based on h k decays exponentially in n as n → ∞, with (Chernoff) exponent where and . . , k} are the corresponding information vectors.
Proof of Lemma A1. Since the rule is to decide based on comparing the projection to a threshold, via Cramér's theorem [54], the error exponent under P j (j = 1, 2) is where Now, since (A2) holds, we obtain which we express compactly as with ξ k (ξ 1 , . . . , ξ k ). Hence, the constraint (A7) is expressed in information vectors as In turn, the optimal P in (A6), which we denoted by P * , lies in the exponential family through P j with natural statistic f k (x), i.e., the k-dimensional family whose members are of the form for which the associated information vector is where we have used the fact that for all Q X ∈ N X (P X ) with the information vector φ ↔ Q X . As a result, where we have used (A3). Hence, via (A9), we obtain that the intersection with the linear family (A7) is at P * = P θ k * with and thus E j (λ) = D(P * P j ) where to obtain (A11) we have exploited the local approximation of KL divergence [18], to obtain (A12) we have exploited (A10), to obtain (A13) we have again exploited (A3), and to obtain (A14) we have used that Finally, E 1 (λ) = E 2 (λ) when λ = 1/2, so the overall error probability has exponent (A5).
Then, the following lemma demonstrates a property of information vectors in a Markov chain.
Lemma A2. Given the Markov relation X ↔ Y ↔ V and any v ∈ V, let φ X|V v and φ Y|V v denote the associated information vectors for P X|V (·|v) and P Y|V (·|v), then we have Proof of Lemma A2. From the Markov relation we have and P X|V (x|v) = ∑ y∈Y P X|Y,V (x|y, v)P Y|V (y|v) = ∑ y∈Y P X|Y (x|y)P Y|V (y|v).
As a result, from which we obtain the corresponding information vector where the last equality follows from the fact that Finally, rewrite (A16) in the matrix form and we obtain (A15).
In addition, the following lemma is useful for dealing with the expectation over an RIE. Lemma A3. Let z be a spherically symmetric random vector of dimension M, i.e., for any orthogo- Proof of Lemma A3. By definition we have Λ z = QΛ z Q T for any orthogonal Q; hence, Λ z is diagonal. Suppose Λ z = λ I, then from As a result, we have Proceeding to our proof of Theorem 1, by definition of feature functions, we have E P X f i (X) = 0, i = 1, . . . , k.
Then, from Lemma A1, the error exponent of distinguishing v and v based onh k is denotes the associated information vector for P X|V (·|v),ξ X i denotes the information vectors off i , andΞ X [ξ X 1 , . . . ,ξ X k ]. Since the optimal decision rule is linear, the error exponent is invariant with linear transformations of statistics, i.e., where the last equality follows from Lemma A2. As a result, taking the expectation of (A19) over a given RIE yields where we have exploited Lemma A3. Finally, the error exponent (7) can be obtained via noting from the definition off k that

Appendix B. Proof of Lemma 2
We first prove two useful lemmas.
Lemma A4. For distributions P ∈ relint(P X ), Q, R ∈ P X , and sufficiently small , if D(P Q) ≤ 2 and D(P R) ≤ 2 , then there exists a constant C > 0 independent of , such that D(Q R) ≤ C 2 .
Proof of Lemma A4. Denote by · 1 the 1 -distance between distributions, i.e., P − Q 1 ∑ x∈X |P(x) − Q(x)|, then from Pinsker's inequality [14], we have which implies In addition, with p min min x∈X P(x), for all x ∈ X we have where to obtain (A24) we have used (A21). Note that since P ∈ relint(P X ) we have p min > 0, and thus R(x) > p min /2 for sufficiently small . As a result, where to obtain (A26) we have used the fact that KL divergence is upper bounded by corresponding χ 2 -divergence [55], and to obtain (A29) we have used (A22).
Lemma A5. For all (x, y) ∈ X × Y, we have is as defined in (4), and where we have defined τ(x, y) ṽ T (y)s(x) +d(y).
Proof of Lemma A5. First, we can rewrite the conditional distributionP Then, the KL divergence D(P X P Y P XP Y|X ) can be expressed as where to obtain the last equality we have used the fact E P X P Y [τ(X, Y)] = 0. As a result, we have where the last inequality follows from Jensen's inequality: Proceeding to our proof of Lemma 2, first note that when v = d = 0, we havẽ P (s,v,b) Y|X = P Y . As a result, the optimal v, d for (8) satisfy where to obtain the second inequality we have again exploited χ 2 -divergence as an upper bound of KL divergence [55], and to obtain the last inequality we have used the definition of -dependency.
As P XY ∈ relint(P X×Y ), from Lemma A4, there exist C > 0 and 1 > 0 such that D(P X P Y P XP (s,v,b) Y|X ) < C 2 for all < 1 . Furthermore, from Lemma A5, for all (x, y) ∈ X × Y and ∈ (0, 1 ), we have Note that the right-hand side of (A35) satisfies Therefore, there exists δ > 0 independent of 1 , such that for all |τ(x, y)| ≤ δ, we have In addition, if |τ(x, y)| > δ, we have where to obtain the second inequality we have exploited the monotonicity of function t → P Y (y)e t + (1 − P Y (y))e − P Y (y) 1−P Y (y) t , and to obtain the third inequality we have exploited (A36). As a result, we have log P Y (y)e τ(x,y) + (1 − P Y (y))e − P Y (y) Hence, (A35) becomes from which we can obtain τ(x, y) = O( ). To see this, let Then, for all < 0 , we have and (A38) implies |τ(x, y)| < C with C = 2C P X (x)P Y (y) .

Appendix C. Proof of Lemma 3
Proof. From Lemma 2, there exists C > 0 such that for all (x, y) ∈ X × Y, we have |ṽ T (y)s(x) +d(y)| < C , which implies |µ T sṽ (y) +d(y)| < C , (A40) = 0 without loss of generality. Then, (4) can be rewritten as and the numerator can be written as where we have used (A39). Similarly, from As a result, (A42) can be written as which implies P XP (P X P Y ) for sufficiently small . In addition, the local assumption of distributions implies that P XY ∈ N X×Y (P X P Y ) ⊂ N X×Y C (P X P Y ). Again, from the local approximation of KL divergence [18] we have where to obtain ( * ), we have used (A40) and (A41) together with the fact |B(y, x)| < , and that ∑ x∈X,y∈YB

Appendix D. Proofs of Theorems 2 and 3
Theorems 2 and 3 can be proved based on Lemma 3.

Proofs of Theorems 2 and 3.
Note that the value of d(·) only affects the second term of the KL divergence; hence, we can always choose d(·) such thatd(y) + µ T sṽ (y) = 0. Then, the (Ξ Y , Ξ X ) pair should be chosen as Set the derivative (we use the denominator-layout notation of matrix calculus where the scalar-by-matrix derivative will have the same dimension as the matrix) to zero, and the optimal Ξ Y for fixed Ξ X is (here, we assume the matrix Ξ X T Ξ X = Λs (X) is invertible; for the case where Ξ X T Ξ X is singular, we can obtain a similar result with ordinary matrix inverse replaced by the Moore-Penrose inverse) As 1 T √ P YB = 0, we have 1 T √ P Y Ξ Y * = 0, which demonstrates that Ξ Y * is a valid matrix for a zero-mean feature vector.
To express Ξ Y * of (A47) in the form of s and v, we can make use of the correspondence between feature and information vectors. We can show that, for a zero-mean feature function f (X) with corresponding information vector φ, we have the correspondence E P X|Y [ f (X)|Y] ↔Bφ. To see this, note that the y-th element of information vectorBφ is given by Using similar methods, we can verify that Λs (X) = Ξ X T Ξ X . As a result, (A47) is equivalent toṽ By a symmetry argument, we can also obtain the first two equations of Theorem 3. To obtain the third equations of these two theorems, we need to minimize For givenṽ and µ s , the optimald is and the corresponding η (v,b) (s) = 0. In addition, for givend andṽ, we have Set ∂ ∂µ s η (v,b) (s) = 0 and we obtain

Appendix E. Proof of Theorem 4
Proof. From Lemma 3, choosing the optimal (Ξ Y , Ξ X ) is equivalent to solving the matrix factorization problem ofB. Since both Ξ Y and Ξ X have rank no greater than k, from the Eckart-Young-Mirsky theorem [56], the optimal choice of Ξ Y Ξ X T should be the truncated singular value decomposition ofB with top k singular values. As a result, (Ξ Y , Ξ X ) * are the left and right singular vectors ofB corresponding to the largest k singular values. The optimality of biasd(y) = −µ T sṽ (y) has already been shown in Appendix D.

Appendix F. Proof of Theorem 5
The following lemma is useful to prove Theorem 5.
Lemma A6 (Pythagorean theorem). Let Ξ X * be the optimal matrix for given Ξ Y as defined in (13). Then, Proof of Lemma A6. Denote by U, V the Frobenius inner product of matrices U and V, i.e., U, V tr(U T V), and we have As a result, we obtain which finishes the proof.
For the first term, we need to express Ξ X in terms of W and Ξ X 1 . From (17), we obtain E[s z (X)] = σ(c(z)) + o( ), (A53) s z (x) = w T (z)t(x) · σ (c(z)) + o( ), which can be expressed in information vectors as From Theorem 3, we have As a result, we have where the third equality follows from the fact that [cf. (A41)]s(x) = O( ) andṽ(y) = O(1), and the last equality follows from the definitionsB 1 J −1 Ξ X * T and Θ ( Ξ Y T Ξ Y ) 1/2 J.
To obtain µ * s , let us define σ min inf x σ(x), σ max sup x σ(x). Then, the optimal µ s is the solution of minimize µ s (µ s − µ s * ) T Λṽ (Y) µ s − µ s * subject to σ min µ s σ max . (A59) If µ s * satisfies the constraint of (A59), then it is the optimal solution. Otherwise, some elements of µ * s will become either σ min or σ max , known as the saturation phenomenon [21]. To obtain W * , letB Then, the optimal W is given by Hence, W * is given by where the termB Ξ X 1 ( Ξ X 1 T Ξ X 1 ) −1 corresponds to a feature projection oft(X): As a consequence, this multi-layer neural network conducts a generalized feature projection between features extracted from different layers. Note that the projected feature E Pt |Y Λ −1 tt Y depends only on the distribution P˜t |Y and does not depend on the distribution P X|Y . Therefore, the above computations can be accomplished without knowing the hidden random variable X and can be applied to general cases.