Discriminant Analysis under f-Divergence Measures

In statistical inference, the information-theoretic performance limits can often be expressed in terms of a statistical divergence between the underlying statistical models (e.g., in binary hypothesis testing, the error probability is related to the total variation distance between the statistical models). As the data dimension grows, computing the statistics involved in decision-making and the attendant performance limits (divergence measures) face complexity and stability challenges. Dimensionality reduction addresses these challenges at the expense of compromising the performance (the divergence reduces by the data-processing inequality). This paper considers linear dimensionality reduction such that the divergence between the models is maximally preserved. Specifically, this paper focuses on Gaussian models where we investigate discriminant analysis under five f-divergence measures (Kullback–Leibler, symmetrized Kullback–Leibler, Hellinger, total variation, and χ2). We characterize the optimal design of the linear transformation of the data onto a lower-dimensional subspace for zero-mean Gaussian models and employ numerical algorithms to find the design for general Gaussian models with non-zero means. There are two key observations for zero-mean Gaussian models. First, projections are not necessarily along the largest modes of the covariance matrix of the data, and, in some situations, they can even be along the smallest modes. Secondly, under specific regimes, the optimal design of subspace projection is identical under all the f-divergence measures considered, rendering a degree of universality to the design, independent of the inference problem of interest.


Motivation
Consider a simple binary hypothesis testing problem in which we observe an ndimensional sample X and aim to discern the underlying model according to: The optimal decision rule (in the Neyman-Pearson sense) involves computing the likelihood ratio dP dQ (X) and the performance limit (sum of type I and type II errors) is related to the total variation distance between P and Q. We emphasize that our focus is on the settings in which the n elements of X are not statistically independent, in which case the likelihood ratio dP dQ (X) cannot be decomposed into the product of the coordinate-level likelihood ratios. One of the key practical obstacles to solve such problems pertains to the computational cost of finding and performing the statistical tests. This renders a gap between the performance that is information-theoretically viable (unbounded complexity) versus a performance possible under bounded computational complexity [1,2].
Dimensionality reduction techniques have become an integral part of statistical analysis in high dimensions [3][4][5][6]. In particular, linear dimensionality reduction methods have been developed and used for over a century for various reasons, such as their low computational complexity and simple geometric interpretation, as well as for a multitude of applications, such as data compression, storage, and visualization, to name only a few.

Contributions
The contribution of this paper has two main distinctions from the existing literature on DA. First, DA generally focuses on the classification problem for determining the underlying model of the data. Secondly, motivated by the complexities of finding the optimal decision rules for classification (e.g., density estimation), the existing criteria used for separation are selected heuristically. In this paper, we study this problem by referring to the family of f -divergences as measures of the distinction between a pair of probability distributions. Such a choice has three main features: (i) it enables designing linear mappings for a wider range of inference problems (beyond classification); (ii) it provides the designs that are optimal for the inference problem at hand; and (iii) it enables characterizing the information-theoretic performance limits after linear mapping. Our analyses are focused on Gaussian models. Even though we observe that the design of the linear mapping has differences under different f -divergence measures, we have two main observations in the case of zero-mean Gaussian models: (i) the optimal design of the linear mapping is not necessarily along the most dominant components of the data matrix; and (ii) in certain regimes, irrespective of the choice of the f -divergence measure, the design of the linear map that retains the maximal divergence between the two models is robust. In such cases, this makes the optimal design of the linear map independent of the inference problem at hand rendering a degree of universality (in the considered space of the Gaussian probability measures).
The remainder of the paper is organized as follows. Section 2 provides the linear dimensionality reduction model, and it provides an overview of the f -divergence measures considered in this paper. Section 3 formulates the problem, and it helps to facilitate the mathematical analysis in subsequent sections. In Section 4, we provide a motivating operational interpretation for each f -divergence measure and then characterize an optimal design of the linear mapping for zero-mean Gaussian models. Section 5 considers numerical simulations for inference problems associated with the f -divergence measure of interest for zero-mean Gaussian models. Section 6 generalizes the theory to non-zero mean Gaussian models and discusses numerical algorithms that help characterize the design of the linear map, and Section 7 concludes the paper. A list of abbreviations used in this paper is provided on page 22.

Preliminaries
Consider a pair of n-dimensional Gaussian models: P : N (µ P , Σ P ) , and Q : N (µ Q , Σ Q ) , (2) where µ P , µ Q and Σ P , Σ Q are two distinct mean vectors and covariance matrices, respectively, and P and Q denote their associated probability measures. The nature selects one model and generates a random variable X ∈ R n . We perform linear dimensionality reduction on X via matrix A ∈ R r×n , where r < n, rendering After linear mapping, the two possible distributions of Y induced by matrix A are denoted by P A and Q A , where Motivated by inference problems that we discuss in Section 3, our objective is to design the linear mapping parameterized by matrix A that ensures that the two possible distributions of Y, i.e., P A and Q A , are maximally distinguishable. That is, to design A as a function of the statistical models (i.e., µ P , µ Q , Σ P and Σ Q ) such that relevant notions of f -divergences between P A and Q A are maximized. We use a number of f -divergence measures for capturing the distinction between P A and Q A , each with a distinct operational meaning under specific inference problems. For this purpose, we denote the f -divergence of Q A from P A by D f (A), where We use the shorthand D f (A) for the canonical notation D f (Q A P A ) for emphasizing the dependence on A and for the simplicity in notations. E P A denotes the expectation with respect to P A , and f : (0, +∞) → R is a convex function that is strictly convex at 1 and f (1) = 0. Strict convexity at 1 ensures that the f -divergence between a pair of probability measures is zero if and only if the probability measures are identical. Given the linear dimensionality reduction model in (3), the objective is to solve for the following choices of the f -divergence measures.

1.
Kullback-Leibler (KL) divergence for f (t) = t log t: We also denote the KL divergence from P A to Q A by D KL (P A Q A ).
We also denote the χ 2 -divergence from P A to Q A by χ 2 (P A Q A ).

Problem Formulation
In this section, without loss of generality, we focus on the setting where one of the covariance matrices is the identity matrix, and the other one has a covariance matrix Σ in order to avoid complex representations. One key observation is that the design of A under different measures has strong similarities. We first note that, by defininḡ is equivalent to designingĀ for maximally distinguishing N (0,Ā ·Ā ) and N (Ā · µ,Ā · Σ ·Ā ) .
Hence, without loss of generality, we focus on the setting where µ P = 0, Σ P = I n , and Σ Q = Σ. Next, we show that determining an optimal design for A can be confined to the class of semi-orthogonal matrices.
Theorem 1. For every A, there exists a semi-orthogonal matrixĀ such that D f (Ā) = D f (A).

Proof. See Appendix A.
This observation indicates that we can reduce the unconstrained problem in (6) to the following constrained problem: We show that the design of A in the case of µ = 0, under the considered f -divergence measures, directly relates to analyzing the eigenspace of matrix Σ.
Finally, we define the following functions, which we will refer to frequently throughout the paper: In the next sections, we analyze the design of A under different f -divergence measures. In particular, in Sections 4 and 5, we focus on zero-mean Gaussian models for P and Q where we provide an operational interpretation of the measure in the dichotomous mode in (4). Subsequently, we will discuss the generalization to non-zero mean Gaussian models in Section 6.

Main Results for Zero-Mean Gaussian Models
In this section, we analyze problem Q defined in (14) for each of the f -divergence measures separately. Specifically, for each case, we briefly provide an inference problem as a motivating example, in the context of which we relate the optimal performance limit of that inference problem to the f -divergence of interest. These analyses are provided in Sections 4.1-4.5. Subsequently, we provide the main results on the optimal design of the linear mapping matrix A in Section 4.6. The KL divergence, being the expected value of the log-likelihood ratio, captures, at least partially, the performance of a wide range of inference problems. One specific problem whose performance is completely captured by D KL (A) is the quickest changepoint detection. Consider an observation process (time-series) {X t : t ∈ N} in which the observations X t ∈ R n are generated by a distribution with probability measure P specified in (2). This distribution changes to Q at an unknown (random or deterministic) time κ, i.e., Change-point detection algorithms sample the observation process sequentially and aim to detect the change point with the minimal delay after it occurs subject to a false alarm constraint. Hence, the two key figures of merit capturing the performance of a sequential change-point detection algorithm are the average detection delay (ADD) and the rate of false alarms. Whether the change-point κ is random or deterministic gives rise to two broad classes of quickest change-point detection problems, namely the Bayesian setting (κ is random) and minimax setting (κ is deterministic). Irrespective of their discrepancies in settings and the nature of performance guarantees, the ADD for the (asymptotically) optimal algorithms are in the form [39]: Hence, after the linear mapping induced by matrix A, for the ADD, we have where c 1 and c 2 are constants specified by the false alarm constraints. Clearly, the design of A that minimizes the ADD will be maximizing the disparity between the pre-and post-change distributions P A and Q A , respectively.

Connection between D KL and A
By noting that A is a semi-orthogonal matrix and recalling that the eigenvalues of h 1 (A) are denoted by {γ i : i ∈ [r]}, simple algebraic manipulations simplify D KL (Q A P A ) to: By setting, and leveraging, Theorem 2, the problem of finding an optimal design for A that solves (14) can be found as the solution to: where we have defined Likewise, finding the optimal design for A that optimizes D KL (P A Q A ) when µ = 0 can be found by replacing g KL (γ i ) by g KL (23). In either case, the optimal design of A is constructed by choosing r eigenvectors of Σ as the rows of A. The results and observations are formalized in Section 4.6.

Motivation
The KL divergence discussed in Section 4.1 is an asymmetric measure of separation between two probability measures. It is symmetrized by adding two directed divergence measures in reverse directions. The symmetric KL divergence has applications in model selection problems in which the model selection criteria is based on a measure of disparity between the true model and the approximating models. As shown in Reference [40], using the symmetric KL divergence outperforms the individual directed KL divergences since it better reflects the risks associated with underfitting and overfitting of the models, respectively.

Connection between D SKL and A
For a given A, the symmetric KL divergence of interest specified in (8) is given by By setting µ = 0, and leveraging Theorem 2, the problem of finding an optimal design for A that solves (14) can be found as the solution to: where we have defined

Motivation
Squared Hellinger distance facilitates analysis in high dimensions, especially when other measures fail to take closed-form expressions. We will discuss an important instance of this in the next subsection in the analysis of d TV . Squared Hellinger distance is symmetric, and it is confined in the range [0, 2].

Connection between H 2 and A
For a given matrix A, we have the following closed-form expression: By setting µ = 0, and leveraging Theorem 2, the problem of finding an optimal design for A that solves (14) can be found as the solution to: where we have defined

Motivation
The total variation distance appears as the key performance metric in binary hypothesis testing and in high-dimensional inference, e.g., Le Cam's method for the binary quantization and testing of the individual dimensions (which is in essence binary hypothesis testing). In particular, for the simple binary hypothesis testing model in (65), the minimum total probability of error (sum of type-I and type-II error probabilities) is related to the total variation d TV (A). Specifically, for a decision rule d : X → {H 0 , H 1 }, the following holds: The total variation between two Gaussian distributions does not have a closed-form expression. Hence, unlike the other settings, an optimal solution to (6) in this context cannot be obtained analytically. Alternatively, in order to gain intuition into the structure of a near optimal matrix A, we design A such that it optimizes known bounds on d TV (A). In particular, we use two sets of bounds on d TV (A). One set is due to bounding it via the Hellinger distance, and another set is due to a recent study that established upper and lower bounds that are identical up to a constant factor [41].

Connection between d TV and A
(1) Bounding by Hellinger Distance: The total variation distance can be bounded by the Hellinger distance according to It can be readily verified that these bounds are monotonically increasing with H 2 (A) in the interval [0, 2]. Hence, they are maximized simultaneously by maximizing the squared Hellinger distance as discussed in Section 4.3. We refer to this bound as the Hellinger bound.
(2) Matching Bounds up to a Constant: The second set of bounds that we used are provided in Reference [41]. These bounds relate the total variation between two Gaussian models to the Frobenius norm (FB) of a matrix related to their covariance matrices. Specifically, these FB-based bounds on the total variation d TV (A) are given by where we have defined Since the lower and upper bounds on d TV (A) are identical up to a constant, they will be maximized by the same design of A.
4.5. χ 2 -Divergence 4.5.1. Motivation χ 2 -divergence appears in a wide range of statistical estimation problems for the purpose of finding a lower bound on the estimation noise variance. For instance, consider the canonical problem of estimating a latent variable θ from the observed data X, and denote two candidate estimates by p(X) and q(X). Define P and Q as the probability measures of p(X) and q(X), respectively. According to the Hammersly-Chapman-Robbins (HCR) bound on the quadratic loss function, for any estimatorθ, we have which, for unbiased estimators p and q, simplifies to the Cramér-Rao lower bound depending on P and Q through their χ 2 -divergence. Besides the applications to estimation problems, χ 2 is easier to compute compared to some of other f -divergence measures (e.g., total variation). Specifically, for product distributions χ 2 tensorizes to be expressed in terms of the one-dimensional components that are easier to compute than the KL divergence and TV variation distance. Hence, a combination of bounding other measures with χ 2 and then analyzing χ 2 appears in a wide range of inference problems.

Connection between χ 2 and A
By setting µ = 0, for a given matrix A, from (11), we have the following closed-form expression: where we have defined As we show in Appendix C, for χ 2 (A) to exist (i.e., be finite), all the eigenvalues {λ i : i ∈ [r]} should fall in the interval (0, 2). Subsequently, finding the optimal design for A that optimizes χ 2 (P A Q A ) when µ = 0 can be done by replacing g χ 1 in (38) by g χ 2 , which is given by Based on this, and by following a similar line of argument as in the case of the KL divergence, designing an optimal A reduces to identifying a subset of the eigenvalues of Σ and assigning their associated eigenvectors as the rows of matrix A. These observations are formalized in Section 4.6.

Main Results
In this section, we provide analytical closed-form solutions to design optimal matrices A for the following f -divergence measures: D KL , D SKL , H 2 , and χ 2 . The total variation measure d TV does not admit a closed-form for Gaussian models. In this case, we provide a design for A that optimizes the bound we have provided for d TV in Section 4.4. Due to their structural similarities of the results, we group and treat D KL , D SKL , and d TV in Theorem 3. Similarly, we group and treat H 2 and χ 2 in Theorem 4.
. For a given function g : R → R, define the permutations: 1.
For maximizing D f , set g = g f and select the eigenvalues of AΣA as 2. Row i ∈ [r] of matrix A is the eigenvector of Σ associated with the eigenvalue γ i .

Proof. See Appendix B.
By further leveraging the structures of functions g KL , g SKL , and g TV , we can simplify approaches for designing the matrix A. Specifically, note that the functions g KL , g SKL , andg TV are all strictly convex functions taking their global minima at x = 1. Based on this, we have the following observations.
, and the rows of A are eigenvectors of Σ associated with its r largest eigenvalues, i.e., {λ i : i ∈ [r]}.
finding the best permutation of eigenvalues involves sorting all the n eigenvalues λ i 's and subsequently performing r comparisons as illustrated in Algorithm 1. This amounts to O(n · log(n)) time complexity instead of O(n · log(r)) time complexity involved in determining the design for A in the case of Corollaries 1 and 2, which require finding the r extreme eigenvalues in determining the design for π * . Remark 2. The optimal design of A often does not involve being aligned with the largest eigenvalues of the covariance matrix Σ, which is in contrast to some of the key approaches to linear dimensionality reduction that generally perform linear mapping along the eigenvectors associated with the largest eigenvalues of the covariance matrix. When the eigenvalues of Σ are all smaller than 1, in particular, A will be designed by choosing eigenvectors associated with the smallest eigenvalues of Σ in order to preserve largest separability.
Next, we provide the counterpart results for the H 2 and χ 2 -divergence measures. Their major distinction from the previous three measures is that, for these two, D f (A) can be decomposed into a product of individual functions of the eigenvalues {γ i : i ∈ [r]}. Next, we provide the counterparts of Theorem 3 and Corollaries 1 and 2 for H 2 and χ 2 .
Theorem 4 (H 2 , χ 2 ). For a given function g : R → R, define the permutations: Then For maximizing D f , set g = g f and select the eigenvalues of AΣA as 2. Row i ∈ [r] of matrix A is the eigenvector of Σ associated with the eigenvalue γ i .
Proof. See Appendix C.
Next, note that g H is a strictly convex function taking its global minimum at x = 1. Furthermore, g χ i for i ∈ [2] are strictly convex over (0, 2) and take their global minimum at x = 1.
, and the rows of A are eigenvectors of Σ associated with its r largest eigenvalues, i.e., {λ i : i ∈ [r]}.
end if 11: end while 12: return π * Finally, we remark that, unlike the other measures, total variation does not admit a closed-form, and we used two sets of tractable bounds to analyze this case of total variations. By comparing the design of A based on different bounds, we have the following observation.
Remark 3. We note that both sets of bounds lead to the same design of A when either λ 1 ≤ 1 or λ n ≥ 1. Otherwise, each will be selecting a different set of the eigenvectors of Σ to construct A according to the functions

KL Divergence
In this section, we show gains of the above analysis for the KL divergence measure D KL (A) through simulations on a change-point detection problem. We focus on the minimax setting in which the change-point κ is deterministic. The objective is to detect a change in the stochastic process X t with minimal delay after the change in the probability measure occurs at κ and define τ ∈ N as the time that we can form a confident decision. A canonical model to quantify the decision delay is the conditional average detection delay (CADD) due to Pollak [42] where E κ is the expectation with respect to the probability distribution when the change happens at time κ. The objective of this formulation is to optimize the decision delay for the worst-case realization of the random change-point κ (that is, the change-point realization that leads to the maximum decision delay), while the constraints on the false alarm rate are satisfied. In this formulation, this worst-case realization is κ = 1, in which case all the data points are generated from the post-change distribution. In the minimax setting, a reasonable measure of false alarms is the mean-time to false alarm, or its reciprocal, which is the false alarm rate (FAR) defined as where E ∞ is the expectation with respect to the distribution when a change never occurs, i.e., κ = ∞. A standard approach to balance the trade-off between decision delay and false alarm rates involves solving [42] min τ CADD(τ) where α ∈ R + controls the rate of false alarms. For the quickest change-point detection formulation in (48), the popular cumulative sum (CuSum) test generates the optimal solutions, involving computing the following test statistic: Computing W[t] follows a convenient recursion given by where W[0] = 0. The CuSum statistic declares a change at a stopping time τ given by where C is chosen such that the constraint on FAR(τ) in (48) is satisfied.
In this setting, we consider two zero-mean Gaussian models with the following preand post-linear dimensionality reduction structures: where the covariance matrix Σ is generated randomly, and its eigenvalues are sampled from a uniform distribution. In particular, for the original data dimension n, 0.9n eigenvalues are sampled such that {λ i ∼ U (0.064, 1)}, and the remaining eigenvalues are sampled such that {λ i ∼ U (1, 4.24)}. We note that this is done since the objective function lies in the same range for the eigenvalues within the range [0.0649, 1] and [1, 4.24]. In order to consider the worst case detection delay, we set κ = 1 and generate stochastic observations according to the model described in (52) that follows the change-point detection model in (19). For every random realization of covariance matrix Σ, we run the CuSum statistic (50), where we generate A according to the following two schemes: (1) Largest eigen modes: In this scheme, the linear map A is designed such that its rows are eigenvectors associated with the r largest eigenvalues of Σ.
(2) Optimal design: In this scheme, the linear map A is designed such that its rows are eigenvectors associated with r eigenvalues of Σ that maximize D KL (A) according to Theorem 3.
In order to evaluate and compare the performance of the two schemes, we compute the ADD obtained by running a Monte-Carlo simulation over 5000 random realizations of the stochastic process X t following the change-point detection model in (19) for every random realization of Σ and for each reduced dimension 1 ≤ r ≤ 9. The detection delays obtained are then averaged again over 100 random realizations of covariance matrices Σ for each reduced dimension r. Figure 1 shows the plot for ADD versus r for multiple initial data dimension n and for a fixed FAR = 1 5000 . Owing to the dependence on D KL (A) given in (21), the delay associated with the optimal linear mapping in Theorem 3 achieves better performance.

Symmetric KL Divergence
In this section, we show the gains of the analysis by numerically computing D SKL (A). We follow the pre-and post-linear dimensionality reduction structures given in (52), where the covariance matrix Σ is randomly generated following the setup used in Section 5.1. As plotted in Figure 2, by choosing the design scheme for D SKL (A) according to Theorem 3, the optimal design outperforms other schemes.

Squared Hellinger Distance
We consider a Bayesian hypothesis testing problem given class a priori parameters p P A , p Q A and Gaussian class conditional densities for the linear dimensionality reduction model in (52). Without loss of generality, we assume a 0-1 loss function associated with misclassification for the hypothesis test. In order to quantify the performance of the Bayes decision rule, it is imperative to compute the associated probability of error, also known as the Bayes error, which we denote by P e . Since, in general, computing P e for the optimal decision rule for multivariate Gaussian conditional densities is intractable, numerous techniques have been devised to bound P e . Owing to its simplicity, one of the most commonly employed metric is the Bhattacharyya coefficient given by The metric in (53) facilitates upper bounding the error probability as which is widely referred to as the Bhattacharrya bound. Relevant to this study is that the squared Hellinger distance is related to the Bhattacharyya coefficient in (53) through Hence, maximizing the Hellinger distance H 2 (A) results in a tighter bound on P e from (54). To show the performance numerically, we compute the BC(A) via (55). For the preand post-linear dimensionality reduction structures as given in (52), the covariance matrix Σ is randomly generated following the setup used in Section 5.1. As plotted in Figure 3, by employing the design scheme according to Theorem 4, the optimal design results in a smaller BC(A) and, hence, a tighter upper bound on P e in comparison to other schemes.

Total Variation Distance
Consider a binary hypothesis test with Gaussian class conditional densities following the model in (52) and equal class a priori probabilities, i.e., p P A = p Q A . We define c ij as the cost associated with deciding in favor of H i when the true hypothesis is H j such that 0 ≤ i, j ≤ 1, and denote the densities associated with measures P A , Q A by f P A and f Q A , respectively. Without loss of generality, we assume a 0-1 loss function such that c ij = 1 ∀ i = j and c ii = 0 ∀ i. The optimal Bayes decision rule that minimizes the error probability is given by Since the total variation distance cannot be computed in closed-form, we numerically compute the error probability P e under the two bounds (Hellinger-based and FB-based) introduced in Section 4.4.2 to quantify the performance of the design of matrix A for the underlying inference problem. The covariance matrix Σ is randomly generated following the setup used in Section 5.1. As plotted in Figure 4, by optimizing the Hellinger-based bound according to Theorem 4 and optimizing the FB-based bound according to Theorem 3, the two design schemes achieve a smaller P e . We further observe that the bounds due to FB-based are loose in comparison to Hellinger-based bounds. Therefore, we choose not to plot the lower bound on P e for the FB-based bounds in Figure 4.

χ 2 -Divergence
In this section, we show the gains of the proposed analysis through numerical evaluations by numerically computing χ 2 (A), to find a lower bound on the noise variance var θ (θ) up to a constant. Following the pre-and post-linear dimensionality reduction structures given in (52), the covariance matrix Σ is randomly generated following the setup used in Section 5.1. As shown in Figure 5, constructing the optimal design according to Theorem 4 achieves a tighter lower bound in comparison to the other scheme.

General Gaussian Models
In the previous section, we focused on µ = 0. When µ = 0, optimizing each fdivergence measure under the semi-orthogonality constraint does not render closed-form expressions. Nevertheless, to provide some intuitions, we provide a numerical approach to the optimal design of A, which might also enjoy some local optimality guarantees. To start, note that the feasible set of solutions given by M r n = {A ∈ R r×n : A · A = I r } owing to the orthogonality constraints in Q is often referred to as the Stiefel manifold. Therefore, solving Q requires designing algorithms that optimize the objective while preserving manifold constraints during iterations.
We employ the method of Lagrange multipliers to formulate the Lagrangian function. By denoting the matrix of Lagrangian multipliers by L ∈ R r×r , the Lagrangian function of problem (14) is given by From the first order optimality condition, for any local maximizer A * of (14), there exists a Lagrange multiplier L * such that where we denote the partial derivative with respect to A by ∇ A . In what follows, we iterate the design mapping A using the gradient ascent algorithm in order to find a solution for A.
As discussed in the next subsection, this solution is guaranteed to be at least locally optimal.

Optimizing via Gradient Ascent
We use an iterative gradient ascent-based algorithm to find the local maximizer of D f (A) such that A ∈ M r n . The gradient ascent update at any given iteration k ∈ N is given by (59) Note that, following this update, since the new point A k+1 in (59) may not satisfy the semi-orthogonality, i.e., A k+1 / ∈ M r n , it is imperative to establish a relation between the multipliers L and A k in every iteration k to ensure a constraint-preserving update scheme. In particular, to enforce the semi-orthogonality constraint on A k+1 , a relationship between the multipliers and the gradients in every iteration k is derived. Following a similar line of analysis for gradient descent in Reference [43], the relationship between multipliers and the gradients is provided in Appendix E. More details on the analysis of the update scheme can be found in Reference [43], and a detailed discussion on the convergence guarantees of classical steepest descent update schemes adapted to semi-orthogonality constraints can be found in Reference [44].
In order to simplify ∇ A L(A, L) and state the relationships, we define Λ = L + L and subsequently find a relationship between Λ and A k in every iteration k. This is obtained by right-multiplying (59) by A k+1 and solving for Λ that enforces the semi-orthogonality constraint on A k+1 . To simplify the analysis, we take a finite Taylor series expansion of Λ around α = 0 and choose α such that the error in forcing the constraint is a good approximation of the gradient of the objective subjected to A · A = I r . As derived in the Appendix E, by simple algebraic manipulations, it can be shown that the matrices Λ 0 , Λ 1 , and Λ 2 , for which the finite Taylor series expansion of Λ ≈ Λ 0 + α · Λ 1 + α 2 · Λ 2 is a good approximation of the constraint, are given by Additionally, we note that, since finding the global maximum is not guaranteed, it is imperative to initialize A 0 close to the estimated maximum. In this regard, we leverage the structure of the objective function for each f -divergence measure as given in Appendix D.
In particular, we observe that the objective of each f -divergence measure can be decomposed into two objectives: the first not involving µ (making this objective a convex problem as shown in Section 4), and the second objective a function of µ. Hence, leveraging the structure of the solution from Section 4, we initialize A 0 such that it maximizes the objective in the case of zero-mean Gaussian models. We further note that, while there are more sophisticated orthogonality constraint-preserving algorithms [45], we find that our method adopted from Reference [43] is sufficient for our purpose, as we show next through numerical simulations.

Results and Discussion
The design of A when µ = 0 is not characterized analytically. Therefore, we resort to numerical simulations to show the gains of optimizing f -divergence measures when µ = 0. In particular, we consider the linear discriminant analysis (LDA) problem where the goal is to design a mapping A and perform classification in the lower dimensional space (of dimension r). Without loss of generality, we assume n = 10 and consider Gaussian densities with the following pre-and post-linear dimensionality reduction structures: P : N (0, I n ) and Q : N (µ, Σ) P A : N (0, I r ) and Q A : where the covariance matrix Σ is generated randomly the eigenvalues of which are sampled from a uniform distribution {λ i ∼ U (0, 1)} 10 i=1 . For the model in (63), we consider two kinds of performance metrics that have information-theoretic performance interpretations: (i) the total probability of error related to the d TV (A), and (ii) the exponential decay of error probability related to D KL (P A Q A ). In what follows, we demonstrate that optimizing appropriate f -divergence measures between P A and Q A lead to better performance when compared to the performance of the popular Fisher's quadratic discriminant analysis (QDA) classifier [20]. In particular, the Fisher's approach sets r = 1 and designs A by solving In contrast, we design A such that the information-theoretic objective functions associated with the total probability of error (captured by d TV (A)) and the exponential decay of error probability (captured by D KL (P A Q A )) are minimized. The structure of the objective functions is discussed in Total probability of error and Type-II error subjected to type-I error constraints. Both methods and Fisher's method, after projecting the data into a lower dimension, deploy optimal detectors to discern the true model. It is noteworthy that, in both methods the data in the lower dimensions has a Gaussian model, and the conventional QDA [20] classifier is the optimal detector. Hence, we emphasize that our approach aims to have a design for A that maximizes the distance between the probability measures after reducing the dimensions, i.e., the distance between P A and Q A . Since this distance captures the quality of the decisions, our design of A outperforms that of Fisher's. For each comparison, we consider various values for µ and compare the appropriate performance metrics with that of Fisher's QDA for each. In all cases, the data is synthetically generated, i.e., sampled from a Gaussian distribution where we consider 2000 data points associated with each measure P and Q.

Schemes for Linear Map
(1) Total Probability of Error: In this scheme, the linear map A is designed such that d TV (A) is optimized via gradient ascent iterations until convergence. As discussed in Section 4.4.1, since the total probability of error is the key performance metric that arises while optimizing d TV (A), it is expected that optimizing d TV (A) will result in a smaller total error in comparison to other schemes that optimize other objective functions (e.g., Fisher's QDA). We note that, since there do not exist closed-form expressions for the total variation distance, we maximize bounds on d TV (A) instead via the Hellinger bound in (33) as a proxy to minimize the total probability of error. The corresponding gradient expression to optimize H 2 (A) (to perform iterative updates as in (59)) is derived in closed-form and is given in Appendix D.
(2) Type-II Error Subjected to Type-I Error Constraints: In this scheme, the linear map A is designed such that D KL (P A Q A ) is optimized via gradient ascent iterations until convergence. In order to establish a relation, consider the following binary hypothesis test: When minimizing the probability of type-II error subjected to type-I error constraints, the optimal test guarantees that the probability of type-II error decays exponentially as where we have define d : X → {H 0 , H 1 } as the decision rule for the hypothesis test, and s denotes the sample size. As a result, D KL (P A Q A ) appears as the error exponent for hypothesis test in (65). Hence, it is expected that optimizing D KL (P A Q A ) will result in a smaller type-II error for the same type-I error when comparing with a method that optimizes other objectives (e.g., Fisher's QDA). The corresponding gradient expression to optimize the D KL (P A Q A ) is derived in closed-form and is given in Appendix D.
For the sake of comparison and reference, we also consider schemes in which A is designed to optimize the objectives D KL (A), the largest eigen modes (LEM), and the smallest eigen modes (SEM), which carry no specific operational significance in the context of the binary classification problem. In the case of LEM and SEM schemes, the linear map A is designed such that the rows of A are the eigenvector associated with the largest and smallest modes of the matrix Σ, respectively. Furthermore, we define 1 as the vector of all those of appropriate dimension.

Performance Comparison
After learning the linear map A for each scheme described in Section 6.2.1, we perform classification in the lower dimensional space of dimension r to find the type-I, type-II, and total probability of error for each scheme. Tables 1-4 tabulate the results for various choices of the mean parameter µ. We have the following important observations: (i) we observe that optimizing H 2 (A) results in a smaller total probability of error in comparison to the total error obtained by optimizing the Fisher's objective; it is important to note that the superior performance is observed despite maximizing bounds on d TV (A) (that is suboptimal) and not the distance itself; and (ii) we observe that except for the case of µ = 0.8 · 1, optimizing D KL (P A Q A ) results in a smaller type-II error in comparison to that obtained by optimizing the Fisher's objective indicating a gain in optimizing D KL (P A Q A ) in comparison to the Fisher's objective in (64).    It is important to note that the convergence of the gradient ascent algorithm only guarantees a locally optimal solution. While we have restricted the results that consider a maximum separation of µ = 0.8 · 1, we have performed additional simulations for larger separation between models (greater µ > 0.8). We have the following observations: (i) solution for the linear map A obtained through gradient ascent becomes highly sensitive to the initialization A 0 ; specifically, it was observed that optimizing the Fisher's objective outperforms optimizing H 2 (A) for various initializations A 0 , and vice versa, for other random initializations; and (ii) the gradient ascent solver becomes more prone to getting stuck at the local maxima for larger separations between the models. We conjecture that the odd observation in the case of µ = 0.8 · 1 when optimizing D KL (P A Q A ) (where optimizing the Fisher's objective outperforms optimizing D KL (P A Q A )) supports this observation. Furthermore, we note that, since the problem is convex for µ = 0, a deviation from this assumption moves the problem further from being convex, making the solver prone to getting stuck at the locally optimal solutions for larger separation between the Gaussian models.

Subspace Representation
In order to gain more intuition towards the learned representations, we illustrate the 2-dimensional projections of the original 10-dimensional data obtained after optimizing the corresponding f -divergence measures. For brevity, we only show the plots for D KL (P A Q A ) and H 2 (A). Figures 6 and 7 plot the two-dimensional projections of the synthetic dataset that optimize D KL (P A Q A ) and H 2 (A), respectively. As expected, it is observed that the total probability of error is smaller when optimizing H 2 (A). Figure 8 shows the variation in the objective function as a function of gradient ascent iterations. As the iterations grow, the objective functions eventually converges to a locally optimal solution.

Conclusions
In this paper, we have considered the problem of discriminant analysis such that separation between the classes is maximized under f -divergence measures. This approach is motivated by dimensionality reduction for inference problems, where we have investigated discriminant analysis under Kullback-Leibler, symmetrized Kullback-Leibler, Hellinger, χ 2 , and total variation measures. We have characterized the optimal design for the linear transformation of the data onto a lower-dimensional subspace for each in the case of zeromean Gaussian models and adopted numerical algorithms to find the design of the linear transformation in the case of general Gaussian models with non-zero means. We have shown that, in the case of zero-mean Gaussian models, the row space of the mapping matrix lies in the eigenspace of a matrix associated with the covariance matrix of the Gaussian models involved. While each f -divergence measure favors specific eigenvector components, we have shown that all the designs become identical in certain regimes, making the design of the linear mapping independent of the inference problem of interest. Consider two pairs of probability measures (P A , Q A ) and (PĀ, QĀ) associated with the mapping A in space X andĀ in space Y, respectively. Let g : X → Y denote any invertible transformation. Under the invertible map, we have dQĀ = dQ A · |T | −1 , and dPĀ = dP A · |T | −1 , where |T | denotes the determinant of the Jacobian matrix associated with g. Leveraging (A1), the f -divergence measure D f (Ā) simplifies as follows.
Therefore, f -divergence measures are invariant under invertible transformations (both linear and non-linear) ensuring the existence ofĀ for every A as a special case for linear transformations.

Appendix B. Proof of Theorem 3
We observe that D KL (A), D SKL (A), and the objective to be optimized through the matching bound Section 4.4.2, Matching Bounds up to a Constant on d TV (A) can be decomposed as the summation of strictly convex functions involving g KL (x), g SKL (x), and g TV (x), respectively. Since the summation of strictly convex functions is strictly convex, we conclude that each objective } is maximized subjected to spectral constraints given by λ n−(r−i) ≤ γ i ≤ λ i . In order to choose appropriate γ i 's, we first note that the global minimizer for functions g f ∈ {g KL , g SKL , g TV } appears at x = 1. By noting that each g f is strictly convex, it can be readily verified that g f (x) is monotonically increasing for x > 1 and monotonically decreasing for x < 1. This will guide selecting {γ i } r i=1 , as explained next. In the case of λ n ≥ 1, i.e., when all the eigenvalues are larger than or equal to 1, the objective of maximizing each D f ∈ {D KL (A), D SKL (A), d TV (A)} boils down to maximizing a monotonically increasing function (considering the domain). This is trivially done by choosing γ i = λ i for i ∈ [r], proving Corollary 1. On the other hand, when λ 1 ≤ 1, i.e., when all the eigenvalues are smaller than or equal to 1, following the same line of argument, the objective boils down to maximizing each D f ∈ {D KL (A), D SKL (A), d TV (A)}, where each D f is a monotonically decreasing function (considering the domain). This is trivially done by choosing γ i = λ n−r+i for i ∈ [r].
When λ n ≤ 1 ≤ λ 1 , the selection process is not trivial. Rather, an iterative algorithm can be followed, where we start from the eigenvalues farthest away from 1 on both sides and, subsequently, choose the one in every iteration that achieves a higher objective. This procedure can be repeated recursively until r eigenvalues are chosen. This procedure is also discussed in Algorithm 1 in Section 4.6.
Finally, constructing the optimal matrix A, which maximizes D f for any data matrix Σ, becomes equivalent to choosing eigenvectors as the rows of A associated with the chosen permutation of eigenvalues for each of the aforementioned cases.

Appendix C. Proof for Theorem 4
We first find a closed-form expression for χ 2 (A) and χ 2 (P A Q A ). From the definition, we have where we defined K 1 = 2 · h 1 (A) −1 − I r . We note that K 1 is a real symmetric matrix since h 1 (A) is a real symmetric matrix. We denote the eigen decomposition of K 1 as K 1 = U · Θ · U , where the matrix Θ is a diagonal matrix with the eigenvalues {θ i } r i=1 as its elements. Based on this decomposition, we have where we have defined W = U · Y. We note that, in order for χ 2 (A) to be finite, it is required that the eigenvalues {θ i } r i=1 be non-negative. Hence, based on the definition of K 1 , all the eigenvalues λ i should fall in the interval (0, 2). Hence, we obtain: Recall that the eigenvalues of h 1 (A) are given by {γ i } r i=1 in the descending order. Therefore, (A13) simplifies to: Hence, from (A14), maximizing χ 2 (A) is equivalent to choosing the eigenvalues {γ i } r i=1 such that they maximize g χ 1 (x). Similarly, the closed-form expression for χ 2 (P A Q A ) can be derived as follows: where we defined K 2 = 2 · I r − h 1 (A) −1 . We note that K 2 is a real symmetric matrix due to h 1 (A) being a real symmetric matrix. Hence, following a similar line of argument as in the case of χ 2 (A), and as a consequence of Theorem 2, we conclude that all the eigenvalues λ i should fall in the interval (0.5, ∞) to ensure a finite value for χ 2 (P A Q A ). Following this requirement, since the integrands are bounded, we obtain the following closed-form expression: Recall that the eigenvalues of h 1 (A) are given by {γ i } r i=1 ; then, (A16) simplifies to Hence, from (A17), maximizing χ 2 (P A Q A ) is equivalent to choosing the eigenvalues {γ i } r i=1 such that they maximize g χ 2 (x). We observe that H 2 (A), χ 2 (A), and χ 2 (P A Q A ) can be decomposed as the product of r non-negative identical convex functions involving g H (x), g χ 1 (x), and g χ 2 (x), respectively. Hence, the goal is to choose {γ i } r i=1 such that D f ∈ {H 2 (A), χ 2 (A), χ 2 (P A Q A )} is maximized subjected to spectral constraints given by λ n−(r−i) ≤ γ i ≤ λ i . In order to choose appropriate γ i 's, we first note that the global minimizer for each g f ∈ {g H , g χ 1 , g χ 2 } is attained at x = 1. Leveraging this observation, along with the structure that each g f is convex, it is easy to infer that each g f (x) is monotonically increasing for x > 1 and monotonically decreasing x < 1. From the exact same argument in Appendix B, we obtain Corollaries 3 and 4.
Therefore, similar to Appendix B, constructing the linear map A that maximizes D f ∈ {H 2 (A), χ 2 (A), χ 2 (P A Q A )} for any data matrix Σ boils down to choosing eigenvectors as rows of A associated with the chosen permutation of eigenvalues for each of the aforementioned cases.

Appendix D. Gradient Expressions for f -Divergence Measures
For clarity in analysis, we define the following functions: Based on these definitions, we have the following representations for the divergence measures and their associated gradients: