The Geometry of Generalized Likelihood Ratio Test

The generalized likelihood ratio test (GLRT) for composite hypothesis testing problems is studied from a geometric perspective. An information-geometrical interpretation of the GLRT is proposed based on the geometry of curved exponential families. Two geometric pictures of the GLRT are presented for the cases where unknown parameters are and are not the same under the null and alternative hypotheses, respectively. A demonstration of one-dimensional curved Gaussian distribution is introduced to elucidate the geometric realization of the GLRT. The asymptotic performance of the GLRT is discussed based on the proposed geometric representation of the GLRT. The study provides an alternative perspective for understanding the problems of statistical inference in the theoretical sense.


Introduction
The problem of hypothesis testing under statistical uncertainty arises naturally in many practical contexts. In these cases, the probability density functions (PDFs) under either or both hypotheses need not be completely specified, resulting in the inclusion of unknown parameters in the PDFs to express the statistical uncertainty in the model. The class of hypothesis testing problems with unknown parameters in the PDFs is commonly referred to as composite hypothesis testing [1]. The generalized likelihood ratio test (GLRT) is one of the most widely used approaches in composite hypothesis testing [2]. It involves estimating the unknown parameters via the maximum likelihood estimation (MLE) to implement a likelihood ratio test. In practice, the GLRT appears to be asymptotically optimal in the sense of the Neyman-Pearson criterion and usually gives satisfactory results [3]. As the GLRT combines both estimation and detection to deal with the composite hypothesis testing problem, its performance, in general, will depend on the statistical inference performance of these two aspects. However, in the literature, there is no general analytical result associated with the performance of the GLRT [1].
In recent years, the development of new theories in statistical inference has been characterized by the emerging trend of geometric approaches and their powerful capabilities, which allows one to analyze statistical problems in a unified perspective. It is important to link the GLRT to the geometrical nature of estimation and detection, which provides a new viewpoint on the GLRT. The general problem of composite hypothesis testing involves a decision between two hypotheses where the PDFs are themselves functions of unknown parameters. One approach to the understanding of performance limitations of statistical inference is via the theory of information geometry. In this context, the family of probability distributions with a natural geometrical structure is defined as a statistical manifold [4]. Information geometry studies the intrinsic properties of statistical manifolds which are endowed with a Riemannian metric and a family of affine connections derived from the log-likelihood functions of probability distributions [5]. It provides a way of analyzing the geometrical properties of statistical models by regarding them as geometric objects.
The geometric theory of statistics was firstly introduced in the 1940s by Rao [6], where the Fisher information matrix was regarded as a Riemannian metric on the manifold of probability distributions. Then, in 1972, a one-parameter family of affine connections was introduced by Chentsov in [7]. Meanwhile, Efron [8] defined the concept of statistical curvature and discussed its basic role in the high-order asymptotic theory of statistical inference. In 1982, Amari [5,9] developed a duality structure theory that unified all of these theories in a differential-geometric framework, leading to a large number of applications.
In the area of hypothesis testing, the geometric perspectives have acquired relevance in the analysis and development of new approaches to various testing and detection contexts. For example, Kass and Vos [10] provided a detailed introduction to the geometrical foundations of asymptotic inference of curved exponential families. Garderen [11] presented a global analysis of the effects of curvature on hypothesis testing. Dabak [12] induced a geometric structure on the manifold of probability distributions and enforced a detection theoretic specific geometry on it, while Westover [13] discussed the asymptotic limit in the problems of multiple hypothesis testing from the geometrical perspective. For the development of new approaches to hypothesis testing, Hoeffding [14] proposed an asymptotically optimal test for multinomial distributions in which the testing can be denoted in terms of the Kullback-Leibler divergence (KLD) between the empirical distribution of the measurements and the null hypothesis, where the alternate distribution is unrestricted. In the aspect of signal detection, Barbaresco et al. [15][16][17] studied the geometry of Bruhat-Tits complete metric space and upper-half Siegel space and introduced a matrix constant false alarm rate (CFAR) detector which improves the detection performance of the classical CFAR detection.
As more and more new analyses and new approaches have benefited from the geometric and information-theoretic perspectives of statistics, it appears to be important to clarify the geometry of existing problems that is promising to gain new ways to deal with the statistical problems. In this paper, a geometric interpretation of the GLRT is sought from the perspective of information geometry. Two pictures of the GLRT are presented for the cases where unknown parameters are and are not the same under each hypothesis, respectively. Under such an interpretation, both detection and estimation associated with the GLRT are regarded as geometric operations on the statistical manifold. As a general consideration, curved exponential families [9], which include a large number of the most common used distributions, are taken into account as the statistical model of hypothesis testing problems. A demonstration of one-dimensional curved Gaussian distribution is introduced to elucidate the geometric realization of the GLRT. The geometric structure of the curved exponential families developed by Efron [8] in 1975 and Amari [9] in 1982 provides a theoretical foundation for the analysis. The geometric formulation of the GLRT presented in this paper makes it possible for several advanced notions and conclusions in the information geometry theory to be transferred and applied to the performance analysis of the GLRT.
The main contributions of this paper are summarized as follows: • A geometric interpretation of the GLRT is proposed based on the differential geometry of curved exponential families and duality structure theory developed by Amari [9]. Two geometric pictures of the GLRT are presented in the theoretical sense, which provides an alternative perspective for understanding the problems of statistical inference. • The asymptotic performance of the GLRT is discussed based on the proposed geometric representation of the GLRT. The information loss when performingthe MLE using a finite number of samples is related to the flatness of the submanifolds determined by the GLRT model.
In the next section, alternative viewpoints on the likelihood ratio test and the maximum likelihood estimation are introduced from the perspective of information theory. The equivalences between the Kullback-Leibler divergence, likelihood ratio test, and the MLE are highlighted. The principles of information geometry are briefly introduced in Section 3. In Section 4, the geometric interpretation of the GLRT is presented in consideration of the geometry of curved exponential families. We present an example of the GLRT where a curved Gaussian distribution with one unknown parameter is involved, and a further discussion on the geometry of the GLRT. Finally, conclusions are obtained in Section 5.

Information-Theoretic Viewpoints on Likelihood Ratio Test and Maximum Likelihood Estimation
In statistics, the likelihood ratio test and maximum likelihood estimation are two fundamental concepts related to the GLRT. The likelihood ratio test is a very general form of testing model assumptions, while the maximum likelihood estimation is one of the most common approaches to parameter estimation. Both of them are associated with the Kullback-Leibler divergence [18], which is equivalent to the relative entropy [19] in information theory.
For a sequence of observations x = (x 1 , x 2 , . . . , x N ) T ∈ R N which is independently and identically distributed (i.i.d.), the binary hypothesis testing problem is used to decide whether this sequence x originates from the null hypothesis H 0 or the alternative hypothesis H 1 with probability distributions p 0 (x) and p 1 (x), respectively. The likelihood ratio is given by Assume q(x) is the empirical distribution (frequency histogram acquired via Monte Carlo tests) of observed data. For large N, in accordance with the strong law of large numbers [20], the log likelihood ratio test in the Neyman-Pearson formulation ln L denotes that the test is to decide H 1 if ">" is satisfied, or to decide H 0 , and vice versa. The quantity is the KLD from q(x) to p(x). Note that x is dropped from the notion D for simplifying the KLD expression without confusion. Equation (2) indicates that the likelihood ratio test is equivalent to choosing the hypothesis that is "closer" to the empirical distribution in the sense of the KLD. The test can be referred to as a generalized minimum dissimilarity detector in a geometric viewpoint. Now, consider another, slightly different problem where the observations x are from a statistical model represented by p(x|θ) with unknown parameters θ. The problem is to estimate the unknown parameters θ based on observations x. The likelihood function for the underlying estimation problem is In a similar way, for large N, maximizing the likelihood (4) to find the maximum likelihood estimate of θ is equivalent to finding θ, which minimizes the KLD D(q p θ ), i.e., where p θ is used as a surrogate for p(x|θ).
The above results provide an information-theoretic view to the problem of hypothesis testing and maximum likelihood estimation in statistics. From the perspective of informa-tion difference, these results have profound geometric meanings and can be geometrically analyzed and viewed in the framework of information geometry theory, from which additional insights into the analysis of these statistical problems, as well as their geometric interpretations, are obtained.

Statistical Manifold
Information geometry studies the natural geometric structure of the parameterized family of probability distributions S = { p(x|θ )} specifying by a parameter vector θ := [θ 1 , . . . , θ n ], in which x is the samples of a random variable X. When the probability measure on the sample space is continuous and differentiable and the mapping θ → p(x|θ ) is injective [5], the family S is considered as a statistical manifold with θ as its coordinate system [4]. Figure 1 demonstrates the diagram of a statistical manifold. For a given parameter vector θ ∈ Θ ⊂ R n , the measurement x in the sample space X is an instantiation of a probability distribution p(x|θ ). Each p(x|θ ) in the family of distributions is specified by a point s(θ) on the manifold S. The n-dimensional statistical manifold is composed of the parameterized family of probability distributions S = { p(x|θ )} with θ as a coordinate system of S.
Diagram of a statistical manifold. θ and s(θ ) denote parameters of the family of distributions from different samples X. The connection on the statistical manifold S represents a geodesic (the shortest line) between points s(θ) and s(θ ). The length of the geodesic serves as a distance measure between two points on the manifold. The arrow on the geodesic starting from the point s(θ) denotes the tangent vector, which gives the direction of the geodesic.
Various families of probability distributions correspond to specific structures of the statistical manifold. Information geometry takes the statistical properties of samples as the geometric structure of a statistical manifold, and utilizes differential geometry methods to measure the variation of information contained in the samples.

Fisher Information Metric and Affine Connections
The metric and connections associated with a manifold are two important concepts in information geometry. For a statistical manifold consisting of a parameterized family of probability distributions, the Fisher information matrix (FIM) is usually adopted as a Riemannian metric tensor of the manifold [6], which is defined by the inner product between tangent vectors at a point on the manifold. It is denoted by where {∂ log(·)/∂θ i } is considered as a basis for the vector space of random variable X.
The tangent space of S at θ, denoted as T θ (S), is identified as the vector space. Based on the above definition, the FIM metric determines how the information distance is measured on the statistical manifold. When considering the relationships between two tangent spaces T θ and T θ+dθ at two neighboring points θ and θ + dθ (d is the differential operator), an affine connection is defined by which the two tangent spaces become feasible for comparison. When the connection coefficients are all identically 0, then S is flat manifold that "locally looks like" a Euclidean space with zero curvatures everywhere. The most commonly used connection is called α-connections [9], where α Γ jim denotes the connection coefficients with i, j, m = 1, . . . , n, and In (7), α = 0 corresponds to the Levi-Civita connection, while α = 1 defines the e-connection and α = −1 defines the m-connection. Under the e-connection and mconnection, an exponential family with natural parameter θ coordinate and a mixture family with expectation parameter η coordinate are both flat manifolds [9]. Statistical inference with respect to the exponential family greatly benefits from the geometric properties of the flat manifold. By using the methods of differential geometry, many additional insights into the intrinsic structure of probability distributions can be obtained, which opens a new perspective on the analysis of statistical problems. In the next section, a geometric interpretation of the GLRT and further discussions are sought based on the principles of information geometry.

Geometry of the Generalized Likelihood Ratio Test
As a general treatment, the curved exponential families, which encapsulate many important distributions for real-world problems, are considered as the statistical model for the hypothesis testing problems discussed in this paper. In this section, the MLE solution to parameter estimation for curved exponential families is derived. We then present two pictures of the GLRT, which are sketched based on the geometric structure of the curved exponential families developed by Efron [8] in 1975 and Amari [9] in 1982, to illustrate the information geometry of the GLRT. An example of the GLRT for a curved Gaussian distribution with a single unknown parameter is given, which is followed by a further discussion on the geometric formulation of the GLRT.

The MLE Solution to Statistical Estimation for Curved Exponential Families
Exponential families contain lots of the most commonly used distributions, including the normal, exponential, Gamma, Beta, Poisson, Bernoulli, and so on [21]. The curved exponential families are the distributions whose natural parameters are nonlinear functions of "local" parameters. The canonical form of a curved exponential family [9] is expressed as where x ∈ X is a vector of samples, θ := [θ 1 , . . . , θ n ] are the natural parameters, u ∈ R m (m < n) are local parameters standing for the parameters of interest to be estimated, which is specified by (9), while F(x) := [F 1 (x), · · · , F n (x)] T denote sufficient statistics with respect to θ = (θ 1 , · · · , θ n ), which take values from the sample space X. ϕ is the potential function of the curved exponential family and it is found from the normalization condition X p(x|θ)dx = 1, i.e., The term "curved" is due to the fact that the curved exponential family in (9) is a submanifold of the canonical exponential family p(x|θ) by the embedding u −→ θ(u).
Let l(θ, x) = log p(x|θ) be the log-likelihood and ∇ u θ be the Jacobian matrix of the natural parameter θ. According to (9), where η(u) is the expectation of the sufficient statistics F(x), i.e., η(u) := E p(·,u) {F(x)} (12) and is called the expectation parameter, which defines a distribution of mixture family [4]. The natural parameter θ(u) and expectation parameter η(u) are connected by the Legendre transformation [9], as Therefore, Thus, the maximum likelihood estimatorû of the local parameter in (9) can be obtained by the following likelihood equation: Equation (16) indicates that the solution to the MLE can be found by mapping the data F(x) onto F B := {η(u) : u ∈ R m } orthogonally to the tangent of F A := {θ(u) : u ∈ R m }. As θ(u) and η(u) live in two different spaces F A and F B , the inner product between dual spaces is defined as θ(u), η(u) Γ := θ(u) T · Γ · η(u) with a metric Γ. For the flat manifold, the identity matrix serves as the metric Γ. By analogy with the MLE for the universal distribution given by (5), Hoeffding [14] presented another interpretation for the MLE of the curved exponential family. In the interpretation, η(û) represents a point in F B which is located closest to the data point in the sense of the Kullback-Leibler divergence, i.e., where D F(x) η u denotes the Kullback-Leibler divergence from the multivariate joint distributions of F(x) to η u . Based on the above analysis, there are two important spaces related to a curved exponential family. One is called the natural parameter space, denoted by {θ} ⊂ A n , which denotes the enveloping space including all the distributions of exponential families, and the other is called the expectation parameter space, denoted by {η} ⊂ B n , denoting the dual space of A n . The two spaces are "dual" with each other and flat under the e-connection and m-connection, respectively. The curved exponential family (9) is regarded as submanifolds embedded in the two spaces, and the data can also be immersed in these spaces in the form of sufficient statistics F(x). Consequently, the estimators, such as the MLE given by (16), associated with the curved exponential families can be geometrically performed in the two spaces.

Geometric Demonstration of the Generalized Likelihood Ratio Test
As mentioned earlier, the GLRT is one of the most widely used approaches in composite hypothesis testing problems with unknown parameters in the PDFs. The data x have the PDF p(x|u 0 ; H 0 ) under hypothesis H 0 and p(x|u 1 ; H 1 ) under hypothesis H 1 , where u 0 and u 1 are unknown parameters under each hypothesis. The GLRT enables a decision by means of replacement of the unknown parameters by their maximum likelihood estimates (MLEs) to implement a likelihood ratio test. The GLRT decides H 1 if whereû i is the MLE of u i (by maximizing p(x|u i ; H i )). As the parameters u 0 and u 1 , as well as their dimensionalities, may or may not be the same under the null and alternative hypotheses, two pictures of the GLRT are presented for the two cases: one is with the same unknown parameters under each hypotheses and the other is with different parameters or different dimensionalities. The picture for the first case is illustrated in Figure 2a. In this case, distributions under two hypotheses share the same form and the same unknown parameter u. However, the parameter takes different value sets under different hypotheses. The family of S = { p(x|u )} can be smoothly embedded as a surface F B specified by {η(u) : u ∈ R m } in the space B n . The hypotheses p(x|u i ; H i ) with unknown u i define two "uncertainty volumes" Ω 0 and Ω 1 on the submanifold F B . These volumes are the collections of probability distributions specified by the value sets of the unknown parameter u i . The measurements x are immersed in B n in the form of sufficient statistics F(x). Consequently, the MLE can be found by "mapping" the samples F(x) onto the uncertainty volumes Ω 0 and Ω 1 on F B . The points p 0 and p 1 in Figure 2 are the corresponding projections, i.e., the MLEs of the unknown parameter under two hypotheses. As indicated in (17), the MLEs can also be obtained by finding the points on Ω 0 and Ω 1 which are located closest to the data point in the sense of KLD, i.e., (19) and the corresponding minimum KLDs can be represented by respectively. It should be emphasized that the above "mapping" is a general concept. When the parameters to be estimated are not restricted by a given "value set", the MLE is simply obtained by maximizing the likelihood and the projections will fall onto the submanifold F B . However, if the parameters to be estimated are restricted in a given "value set", the MLE should be operated by maximizing the likelihood with respect to the given parameter space. In the case where the projections fall outside the "uncertainty volumes", the MLE solutions are given by those points which are closest to the data point described by (19).
Let R(η 0 , ρ) be a divergence sphere centered at η 0 with radius ρ; that is, the submanifold of the enveloping space B n consisting of points η for which the KLD D(η 0 η) is equal to ρ. Denote this divergence sphere by R(η 0 , ρ) = {η ∈ B n | D(η 0 η) = ρ} (21) Then, the closest point in (19) may be more easily found via the divergence sphere with center F(x) and radius D i tangent to Ω i at p i , as illustrated in Figure 3. Consequently, according to (2), the GLRT can be geometrically performed by comparing the difference between the minimum KLDs D 0 and D 1 with a threshold γ , i.e., In practice, the Neyman-Pearson criterion is commonly employed to determine the threshold γ in (22) and the detector is of maximum probability of detection P D under a given probability of false alarm P F . As a commonly used performance index, the missingprobability P M usually decays exponentially as the sample size increases. The rate of exponential decay can be represented by [22], as Based on Stein's lemma, for a constant false-alarm constraint, the best error exponent is related to the Kullback-Leibler divergence D(p 0 p 1 ) from p 0 to p 1 [23], i.e., and where .
= denotes the first-order equivalence in the exponent. For example, a n . = b n means lim n→∞ 1 n log a n b n = 0.
In the above sense, the KLD from p 0 to p 1 is equivalent to the signal-to-noise ratio (SNR) of the underlying detection problem. Therefore, information geometry offers an insightful geometrical explanation for the detection performance of a Neyman-Pearson detector.
In the second case, the dimensionality of the unknown parameters u 0 and u 1 is different, while the dimensionality of the enveloping spaces is common for both hypotheses due to the same measurements x. However, the hypotheses may correspond to two separated submanifolds, Ω 0 and Ω 1 , embedded in B n caused by the different dimensionality between the unknown parameters. As illustrated in Figure 2b, a surface and a curve are used to denote the submanifolds Ω 0 and Ω 1 , corresponding to the two hypotheses, respectively. Similar to the first case, the GLRT with different unknown parameters may also be geometrically interpreted.

A Demonstration of One-Dimensional Curved Gaussian Distribution
Consider the following detection problem: The measurement originates from a curved Gaussian distribution where a is a positive constant and u is an unknown parameter. The probability density function of the measurement is By reparameterization, the probability density function can be represented in the general form of a curved exponential family as where C(x) = ln a and the potential function ϕ is The above distributions with local parameter u correspond to a one-dimensional curved exponential family embedded in the natural parameter space A. The natural coordinates are which defines a parabola (denoted by F A ) in A. The underlying distribution (28) can also be represented in the expectation parameter space B with expectation coordinates which also defines a parabola (denoted by F B ) The sufficient statistics F(x) obtained from samples x can be represented by Figure 4 shows the expectation parameter space and illustrates geometric interpretation of the underlying GLRT, where the blue parabola in the figure denotes embedding of the curved Gaussian distribution with parameter u. The submanifolds associated with two hypotheses can be geometrically represented by the blue parabolas (specified by η 1 < 0 and η 1 > 0, respectively). Without loss of generality, assume that a = 1, u = 2. The blue dots signify N = 100 observations (measurements) in the expectation parameter space with the coordinates (x, x 2 ). The statistical mean of the measurements are used to calculate the sufficient statistics F(x) which are denoted by a red asterisk. The MLEs of parameter u under two hypotheses are obtained by finding the points on the two submanifolds which are closest to the data point in the sense of KLD. According to (22), the GLRT can be geometrically performed by comparing the difference between the minimum KLDs D 0 and D 1 with a threshold γ .

Discussions
The geometric formulation of the GLRT presented above provides additional insights into the GLRT. To the best of our knowledge, there is no general analytical result associated with the performance of the GLRT in the literature [1]. The asymptotic analysis is only valid under the conditions that (1) the data sample size is large; and (2) the MLE asymptotically attains the Cramér-Rao lower bound (CRLB) of the underlying estimation problems.
It is known that the MLE with natural parameters is a sufficient statistic for an exponential family, and achieves the CRLB if a suitable measurement function is chosen for the estimation [8]. For the curved exponential families the MLE is not, in general, an efficient estimator, which means that the variance of MLE may not achieve CRLB with a finite number of samples. This indicates that when using a finite number of samples there will be a deterioration in performance for both MLE and GLRT when the underlying statistical model is a curved exponential family. There will be an inherent information loss (compared with the Fisher information) when implementingan estimation process if the statistical model is of nonlinearity. Roughly speaking, if the embedded submanifold F B in Figure 2a and Ω 0 , Ω 1 in Figure 2b are curved, the MLEs will not achieve the CRLB due to the inherent information loss caused by the non-flatness of the statistical model. The information loss may be quantitatively calculated using the e-curvature of the statistical model [9].
Consequently, if the statistical model associated with a GLRT is not flat, i.e., the submanifolds shown in Figure 2 are curved, there will be a deterioration in performance for the GLRT using a finite number of samples. As sample size N increases, the sufficient statistics F(x) will be better matched to the statistical model and thus closer to the submanifolds (see Figure 2), and the divergence from data to the submanifold associated with the true hypothesis H i will be shorter. Asymptotically, as N → ∞, the sufficient statistics will fall onto the submanifold associated with the true hypothesis H i , so that the corresponding divergence D i reduces to zero. By then, the GLRT achieves a perfect performance.

Conclusions
In this paper, the generalized likelihood ratio test is addressed from a geometric viewpoint. Two pictures of the GLRT are presented in the philosophy of the information geometry theory. Both the detection and estimation associated with a GLRT are regarded as geometric operations on the manifolds of a parameterized family of probability distributions. As demonstrated in this work, the geometric interpretation of GLRT provides additional insights in the analysis of GLRT.
Potentially, more constructive analysis can be generalized based on the information geometry of GLRT. For example, the error exponent defined by (24) and (25) provides a useful performance index for the detection process associated with GLRT. When p 0 and p 1 in (24) are the estimates of an MLE (rather than the true values) of unknown parameters under each hypothesis, there may be a deterioration in performance in the estimation process. Determining how to incorporate such an "estimation loss" into the error exponent is an issue. Another open issue is the GLRT with PDFs of different forms for each hypothesis, which leads to a different distribution embedding associated with each hypothesis.